This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Linearly Convergent Algorithm for Distributed Principal Component Analysis

Arpita Gang and Waheed U. Bajwa Preliminary versions of some of the results reported in this paper were presented at the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, May 12-17 2019, [1]. Arpita Gang and WUB are with the Department of Electrical and Computer Engineering, Rutgers University–New Brunswick, NJ 08854 (Emails: {arpita.gang, waheed.bajwa}@rutgers.edu).This work was supported in part by the National Science Foundation under Awards CCF-1453073, CCF-1907658, and OAC-1940074, by the Army Research Office under Awards W911NF-17-1-0546 and W911NF-21-1-0301, and by the DARPA Lagrange Program under ONR/NIWC Contract N660011824020.
Abstract

Principal Component Analysis (PCA) is the workhorse tool for dimensionality reduction in this era of big data. While often overlooked, the purpose of PCA is not only to reduce data dimensionality, but also to yield features that are uncorrelated. Furthermore, the ever-increasing volume of data in the modern world often requires storage of data samples across multiple machines, which precludes the use of centralized PCA algorithms. This paper focuses on the dual objective of PCA, namely, dimensionality reduction and decorrelation of features, but in a distributed setting. This requires estimating the eigenvectors of the data covariance matrix, as opposed to only estimating the subspace spanned by the eigenvectors, when data is distributed across a network of machines. Although a few distributed solutions to the PCA problem have been proposed recently, convergence guarantees and/or communications overhead of these solutions remain a concern. With an eye towards communications efficiency, this paper introduces a feedforward neural network-based one time-scale distributed PCA algorithm termed Distributed Sanger’s Algorithm (DSA) that estimates the eigenvectors of the data covariance matrix when data is distributed across an undirected and arbitrarily connected network of machines. Furthermore, the proposed algorithm is shown to converge linearly to a neighborhood of the true solution. Numerical results are also provided to demonstrate the efficacy of the proposed solution.

Index Terms:
Dimensionality reduction, distributed feature learning, generalized Hebbian learning, principal component analysis

I Introduction

The modern era of machine learning involves leveraging massive amounts of high-dimensional data, which can have large computational and storage costs. To combat the complexities arising because of the high dimensions of data, dimensionality reduction and feature learning techniques play a pivotal and necessary role in information processing. The most common and widely used technique for this task is Principal Component Analysis (PCA) [2] which, in the simplest of terms, transforms data into uncorrelated features that aid conversion of data from a high-dimensional space to a low-dimensional space while retaining maximum information. Simultaneously, the enormity of the amount of available data makes it difficult to manage it at a single location. There are multiple and an increasing number of scenarios where data is distributed across different locations, either due to storage constraints or by its inherent nature like in the Internet-of-Things [3]. This aspect of the modern-world data have led researchers to explore distributed algorithms, which can process information across different locations/machines [4]. These aforementioned issues have motivated us to study and develop algorithms for distributed PCA that are efficient in terms of computations and communications among multiple machines, and that can also be proven to converge at a fast rate.

When the data is available at a single location, one of the goals of PCA is to find a KK-dimensional subspace, given by the column space of a matrix 𝐗d×K\mathbf{X}\in\mathbb{R}^{d\times K}, such that the zero-mean data samples 𝐲d(dK)\mathbf{y}\in\mathbb{R}^{d}\leavevmode\nobreak\ (d\gg K) retain maximum information when projected onto 𝐗\mathbf{X}. In other words, when reconstructed as 𝐗𝐗T𝐲\mathbf{X}\mathbf{X}^{\mathrm{T}}\mathbf{y} (subject to 𝐗T𝐗=𝐈\mathbf{X}^{\mathrm{T}}\mathbf{X}=\mathbf{I}), the data samples should have minimum reconstruction error. It can be shown that this minimal error solution is given by the projection of data onto the subspace spanned by the eigenvectors of data covariance matrix. This implies that for dimensionality reduction, learning any basis of that subspace is sufficient. This is referred to as the subspace learning problem. But while simple dimension reduction does not necessarily need uncorrelated features, most downstream machine learning tasks like classification, pattern matching, regression, etc., work more efficiently when the data features are uncorrelated. In the case of image coding, e.g., PCA is known as the Karhunen–Loeve transform [5], wherein images are compressed by decorrelating neighbouring pixels. With this goal in mind, one needs to aim to find the specific directions that not only have maximum variance, but that also lead to uncorrelated features when data is projected onto those directions. These specific directions are given by the eigenvectors (also called the principal components) of the data covariance matrix, and not just any set of orthogonal basis vectors spanning the same space. Mathematically, along with minimum reconstruction error, the other goal of PCA is to ensure the condition that the off-diagonal entries of 𝔼[𝐗T𝐲𝐲T𝐗]\mathbb{E}[\mathbf{X}^{\mathrm{T}}\mathbf{y}\mathbf{y}^{\mathrm{T}}\mathbf{X}] are zero (i.e., the data gets decorrelated), while finding the eigenspace of the covariance matrix 𝔼[𝐲𝐲T]\mathbb{E}[\mathbf{y}\mathbf{y}^{\mathrm{T}}].

As explained above, the true and complete purpose of PCA is served when the search for the optimal solution ends with the specific set of eigenvectors of the data covariance matrix, and not just with the subspace it spans. It is known that getting the principal components from any other basis of the subspace would only require performing singular value decomposition (SVD) of the obtained subspace. Although true, the SVD operation has a high computational complexity, which makes it an expensive step for big data. The traditional solutions for PCA were developed to overcome the cost of SVD and hence reverting back to it defeats the whole purpose.

Thus, even though the problem of dimensionality reduction of data has many optimal solutions (corresponding to all the sets of basis vectors spanning the KK-dimensional space), our goal is to find only the ones that give the eigenvectors as the basis. In terms of optimization geometry of the PCA problem in which one tries to minimize the mean-squared reconstruction error under an orthogonality constraint, it is a non-convex strict-saddle function. In a strict-saddle function, all the stationary points except the local minima are strict saddles wherein the Hessians have at least one negative eigenvalue that helps in escaping these saddle points [6, 7]. Also, in the case of PCA the local minima are the same as the global minima. These geometric aspects make PCA, despite being non-convex, a “nice and solvable” problem whose optimal solution can be reached efficiently. However, note that the set of global minima contains, along with the set of eigenvectors as basis, all other possible bases that are rotated with respect to the eigenvectors. And our goal is not to find just any of the global minima but to look into a very particular subset of it, where the basis is not rotated.

A very popular tool that has been used to learn features of data, and hence compress it, is autoencoders. It was shown in [8] that the globally optimum weights of an autoencoder for minimum reconstruction error are the principal components of the covariance matrix of the input data. In [9], Oja described how using the Hebbian rule for updating the weights of a linear neural network would extract the first principal component from the input data. Several other Hebbian-based rules like Rubner’s model, APEX model [10], Generalized Hebbian Algorithm (GHA) [11], etc., were proposed to extend this idea of training a neural network for finding the first eigenvector to extract the first KK principal components (eigenvectors) of the input covariance matrix. Given the parallelization potential, a feedforward linear neural network-based solution for PCA seems to be very attractive.

The other aspect of modern day data is, as mentioned earlier, its massive size. Collating the huge amount of raw data is usually prohibitive due to communications overhead and/or privacy concerns. These reasons have encouraged researchers over the last couple of decades to develop algorithms that can solve various problems for non-collocated data. The algorithms developed to deal with such scenarios can be broadly classified into two categories: (1) the setups where a central entity/server is required to co-ordinate among the various data centers to yield the final result, and (2) the setup where the data is scattered over an arbitrary network of interconnected data centers with no availability of a central co-ordinating node. The authors in [12, 3] talk about these different setups and the algorithms developed for each of them in more detail. The second scenario is more generic and usually algorithms developed for such setups can be easily modified to be applied to the first scenario. The terms distributed and decentralized are used interchangeably for both the setups in the literature. In this paper, we focus on the latter scenario of arbitrarily connected networks and henceforth call it distributed setup. Hence, here our goal is to solve the PCA problem in the distributed manner when data is scattered over a network of interconnected nodes such that all the nodes in the network eventually agree with each other and converge to the true principal components of the distributed data.

I-A Relation to Prior Work

PCA was developed to find simpler models of smaller dimensions that can approximately fit some data. Some seminal work was done by Pearson [13], who aimed at fitting a line to a set of points, and by Hotelling [2], where a method for the classical PCA problem of decorrelating the features of a given set of data points (observations) by finding the principal components was proposed. Later, some fast iterative methods like the power method, Lanczos algorithm, and orthogonal iterations [14] were proposed, which were proved to converge to the eigenvectors at a linear rate in the case of symmetric matrices. Many other iterative methods have been proposed over the last few decades that are based on the well-known Hebbian learning rule [15] like Oja’s method [9], generalized Hebbian algorithm [11], APEX [10], etc. The analysis for Oja’s algorithm has been provided in [16], which shows that in the deterministic setting the convergence to the first eigenvector is guaranteed at a linear rate for some conditions on the step size and initial estimates. The work in [17] extended the analysis to the generalized Hebbian case for convergence to the first KK principal eigenvectors for a specific choice of step size.

While ways to solve PCA in the centralized case when data is available at a single location have been around for nearly a century, distributed solutions are very recent. Within our distributed setup where we assume a network of arbitrarily connected nodes with no central server, the data distribution can be broadly classified into two types: (1) distribution by features, and (2) distribution by samples. The PCA algorithms for these two different kinds of distribution are significantly different. While both are completely distributed, the first kind [18, 19, 20, 21] involves estimating only one (or a subset) of feature(s) at each node. In this paper, we focus on finding the eigenvectors when the distribution is by samples, which requires estimation of the whole set of eigenvectors at each node of the network. For this type of distribution, power method was adapted for the distributed setup as a subroutine in [22, 23, 24] to extract the first principal component of the global covariance matrix. Such methods make use of an explicit consensus loop [25] after each power iteration to ensure that the nodes (approximately) agree with each other. While a novel approach that reaches the required solution at the nodes accurately (albeit with a small error due to the consensus iterations), the two time-scale aspect makes it a relatively slow algorithm in terms of communications efficiency. Furthermore, finding multiple principal components with these approaches would require a sequential approach where subsequent components are determined by using a covariance matrix residue that is left after projection on estimates of the higher-order components. In contrast, the work in [26] focuses on finding the top eigenvector in the distributed setup for the streaming data case. A detailed review of various distributed PCA algorithms that exist for different setups is provided in [27].

Next, note that some distributed optimization-based algorithms for non-convex problems are being studied only since recently and those dealing with constrained problems are even fewer. In [28], it is shown that an unconstrained non-convex problem converges to a stationary solution at a sublinear rate. The methods proposed in [29, 30] deal with non-convex objective functions in a distributed setup when the constraint set is convex and [31] works with convex approximations of non-convex problems. Thus, none of these methods are directly applicable to the distributed PCA problem in our setup.

Finally, we proposed an efficient distributed PCA solution in [1] for a distributed network when the data is split sample-wise among the interconnected nodes. In this paper, we extend the preliminary work in [1] and provide a detailed mathematical analysis of the proposed algorithm along with exact convergence rates and extensive numerical experiments.

I-B Our Contributions

The main contributions of this paper are (1) a novel algorithm for distributed PCA, (2) theoretical guarantees for the proposed distributed algorithm with a linear convergence rate to a small neighborhood of the true PCA solution, and (3) experimental results to further demonstrate the efficacy of the proposed algorithm.

Our focus in this paper is to solve the distributed PCA problem so as to find a solution that not only enables dimensionality reduction, but that also provides uncorrelated features of data distributed over a network. That is, our goal is to estimate the true eigenvectors, not just any subspace spanned by them, of the covariance matrix of the data that is distributed across an arbitrarily connected network. Also, we focus on providing a solution that is efficient in terms of communications between the interconnected nodes of an arbitary network. To that end, we propose a distributed algorithm that is based on the generalized Hebbian algorithm (GHA) proposed by Sanger [11], wherein the nodes perform local computations along with information exchange with their directly connected neighbors, similar to the idea followed in the distributed gradient descent (DGD) approach in [32]. The local computations do not involve the calculation of any gradient, but we instead use a “psuedo gradient,” which we henceforth call Sanger’s direction. In our proposed solution, termed the Distributed Sanger’s Algorithm (DSA), we have also done away with the need of explicit consensus iterations for making the nodes agree with each other, thereby making it a one time-scale solution that is more communications efficient. Theoretical guarantees are also provided for our proposed distributed PCA algorithm when using a constant step size. The analysis shows that, when using a constant step size α\alpha, the DSA solution reaches within a 𝒪(α)\mathcal{O}(\alpha)-neighborhood of the optimal solution at a linear rate when the error metric is the angles between the estimated vectors and the true eigenvectors111Our results can also be extrapolated to guarantee exact convergence with decaying step size, albeit at a slower than linear rate.. We also provide experimental results and comparisons with centralized orthogonal iteration [14], centralized GHA [11], a sequential distributed power method-based approach and distributed projected gradient descent. The results support our claims and analysis.

To the best of our knowledge, this is the first solution for distributed PCA that uses a Hebbian update, achieves network agreement without the use of explicit consensus iterations, and still provably reaches the globally optimum solution (within an error margin) at all nodes at a linear rate.

I-C Notation and Organization

The following notation is used in this paper. Scalars and vectors are denoted by lower-case and lower-case bold letters, respectively, while matrices are denoted by upper-case bold letters. The operator |||\cdot| denotes the absolute value of a scalar quantity. The superscript in 𝐚(t)\mathbf{a}^{(t)} denotes time (or iteration) index, while ata^{t} denotes the exponentiation operation. The superscript ()T(\cdot)^{\mathrm{T}} denotes the transpose operation, F\|\cdot\|_{F} denotes the Frobenius norm of matrices, while both \|\cdot\| and 2\|\cdot\|_{2} denote the 2\ell_{2}-norm of vectors. Given a matrix 𝐀\mathbf{A}, both aija_{ij} and (𝐀)ij(\mathbf{A})_{ij} denote its entry at the ithi^{th} row and jthj^{th} column, while 𝐚j\mathbf{a}_{j} denotes its jthj^{th} column.

The rest of the paper is organized as follows. In Section II, we describe and mathematically formulate the distributed PCA problem, while Section III describes the proposed distributed algorithm, which is based on the generalized Hebbian algorithm. In Section IV, we derive a general result for a modified generalized Hebbian algorithm that aids in the convergence analysis of the proposed distributed algorithm, while convergence guarantees for the proposed algorithm are provided in Section V. We provide numerical results in Section VI to show efficacy of the proposed method and provide concluding remarks in Section VII. The statements and proofs of some auxiliary lemmas, which are needed for the proofs of the main lemmas that are used within the convergence analysis in Section IV and Section V, are given in Appendix A, while Appendices BF contain the formal statements and proofs of the main lemmas.

II Problem Formulation

Principal Component Analysis (PCA) aims at finding the basis of a low-dimensional space that can decorrelate the features of data points and also retain maximum information. More formally, for a random vector 𝐲d\mathbf{y}\in\mathbb{R}^{d} with 𝔼[𝐲]=𝟎\mathbb{E}\begin{bmatrix}\mathbf{y}\end{bmatrix}=\mathbf{0}, PCA involves finding the top-KK eigenvectors of the covariance matrix 𝚺:=𝔼[𝐲𝐲T]\boldsymbol{\Sigma}:=\mathbb{E}\begin{bmatrix}\mathbf{y}\mathbf{y}^{\mathrm{T}}\end{bmatrix}. The zero mean assumption is taken here without loss of generality as the mean can be subtracted in case data is not centered. Mathematically, PCA can be formulated as

𝐗\displaystyle\mathbf{X} =argmin𝐗d×K𝔼[𝐲𝐗𝐗T𝐲22]such thatlq,(𝔼[𝐗T𝐲𝐲T𝐗])lq=0.\displaystyle=\underset{\mathbf{X}\in\mathbb{R}^{d\times K}}{\operatorname*{arg\,min}}\quad\mathbb{E}\begin{bmatrix}\|\mathbf{y}-\mathbf{X}\mathbf{X}^{\mathrm{T}}\mathbf{y}\|_{2}^{2}\end{bmatrix}\qquad\text{such that}\qquad\forall l\neq q,\ \Big{(}\mathbb{E}\begin{bmatrix}\mathbf{X}^{\mathrm{T}}\mathbf{y}\mathbf{y}^{\mathrm{T}}\mathbf{X}\end{bmatrix}\Big{)}_{lq}=0. (1)

The constraint (𝔼[𝐗T𝐲𝐲T𝐗])lq=0,lq\Big{(}\mathbb{E}\begin{bmatrix}\mathbf{X}^{\mathrm{T}}\mathbf{y}\mathbf{y}^{\mathrm{T}}\mathbf{X}\end{bmatrix}\Big{)}_{lq}=0,\forall l\neq q, ensures that 𝐗\mathbf{X} decorrelates the features of 𝐲\mathbf{y}. Now, 𝔼[𝐗T𝐲𝐲T𝐗]=𝐗T𝔼[𝐲𝐲T]𝐗=𝐗T𝚺𝐗\mathbb{E}\begin{bmatrix}\mathbf{X}^{\mathrm{T}}\mathbf{y}\mathbf{y}^{\mathrm{T}}\mathbf{X}\end{bmatrix}=\mathbf{X}^{\mathrm{T}}\mathbb{E}\begin{bmatrix}\mathbf{y}\mathbf{y}^{\mathrm{T}}\end{bmatrix}\mathbf{X}=\mathbf{X}^{\mathrm{T}}\boldsymbol{\Sigma}\mathbf{X} and it is straightforward to see that this quantity is diagonal only if 𝐗\mathbf{X} contains the eigenvectors of 𝚺\boldsymbol{\Sigma}. This explains why the search for a solution of PCA ends with the eigenvectors and not the subspace spanned by them. In practice, we do not have access to 𝚺\boldsymbol{\Sigma} and so a covariance matrix estimated from the samples of 𝐲\mathbf{y} is used instead. Specifically, for a dataset with NN samples {𝐲l}l=1N\{\mathbf{y}_{l}\}_{l=1}^{N}, or equivalently, for a data matrix 𝐘:=[𝐲1,𝐲2,,𝐲N]\mathbf{Y}:=\begin{bmatrix}\mathbf{y}_{1},\mathbf{y}_{2},\dots,\mathbf{y}_{N}\end{bmatrix}, the sample covariance matrix can be written as 𝐂=1N𝐘𝐘T\mathbf{C}=\frac{1}{N}\mathbf{Y}\mathbf{Y}^{\mathrm{T}} such that 𝚺:=𝔼[𝐂]\boldsymbol{\Sigma}:=\mathbb{E}\begin{bmatrix}\mathbf{C}\end{bmatrix}. The true solution for PCA is then obtained by finding the eigenvectors of the covariance matrix 𝐂\mathbf{C}, which are also the left singular vectors of the data matrix 𝐘\mathbf{Y}. The empirical form of (1) is thus

𝐗\displaystyle\mathbf{X} =argmin𝐗d×Kf(𝐗)=argmin𝐗d×K𝐘𝐗𝐗T𝐘F2such thatlq,(𝐗T𝐘𝐘T𝐗)lq=0.\displaystyle=\underset{\mathbf{X}\in\mathbb{R}^{d\times K}}{\operatorname*{arg\,min}}\quad f(\mathbf{X})=\underset{\mathbf{X}\in\mathbb{R}^{d\times K}}{\operatorname*{arg\,min}}\quad\|\mathbf{Y}-\mathbf{X}\mathbf{X}^{\mathrm{T}}\mathbf{Y}\|_{F}^{2}\qquad\text{such that}\qquad\forall l\neq q,\ \Big{(}\mathbf{X}^{\mathrm{T}}\mathbf{Y}\mathbf{Y}^{\mathrm{T}}\mathbf{X}\Big{)}_{lq}=0. (2)

In the literature, however, PCA is usually posed with a ‘relaxed’ orthogonality constraint of 𝐗T𝐗=𝐈\mathbf{X}^{T}\mathbf{X}=\mathbf{I} instead of (𝐗T𝐘𝐘T𝐗)lq=0,lq,\Big{(}\mathbf{X}^{\mathrm{T}}\mathbf{Y}\mathbf{Y}^{\mathrm{T}}\mathbf{X}\Big{)}_{lq}=0,\forall l\neq q, as follows:

𝐗=argmin𝐗d×K,𝐗T𝐗=𝐈f(𝐗)=argmin𝐗d×K,𝐗T𝐗=𝐈𝐘𝐗𝐗T𝐘F2.\mathbf{X}=\underset{\mathbf{X}\in\mathbb{R}^{d\times K},\mathbf{X}^{\mathrm{T}}\mathbf{X}=\mathbf{I}}{\operatorname*{arg\,min}}f(\mathbf{X})=\underset{\mathbf{X}\in\mathbb{R}^{d\times K},\mathbf{X}^{\mathrm{T}}\mathbf{X}=\mathbf{I}}{\operatorname*{arg\,min}}\|\mathbf{Y}-\mathbf{X}\mathbf{X}^{\mathrm{T}}\mathbf{Y}\|_{F}^{2}. (3)

The optimization formulation in (3) with this constraint will only lead to a subspace spanned by the eigenvectors of 𝐂\mathbf{C} as the solution, thus actually making it a Principal Subspace Analysis (PSA) formulation. In other words, although the formulation (3) gives a solution on the Stiefel manifold, the actual PCA formulation (2) requires the solution to be within a very specific subset of that manifold that corresponds to the eigenvectors of 𝐂\mathbf{C}. The accuracy of the solutions given by the PCA and PSA formulations will be the same when measured in terms of the principal angles between the subspace estimates and the true subspace spanned by the eigenvectors of the covariance matrix. Specifically, if 𝐗=[𝐱1,,𝐱K]\mathbf{X}=\begin{bmatrix}\mathbf{x}_{1},\cdots,\mathbf{x}_{K}\end{bmatrix} is an estimate of the basis of the space spanned by the eigenvectors 𝐐=[𝐪1,,𝐪K]\mathbf{Q}=\begin{bmatrix}\mathbf{q}_{1},\cdots,\mathbf{q}_{K}\end{bmatrix}, then the principal angles between 𝐐\mathbf{Q} and 𝐗\mathbf{X} given by either (2) or (3) will be the same. But a more suitable measure of accuracy for any PCA solution should be the angles between 𝐱i\mathbf{x}_{i} and 𝐪i\mathbf{q}_{i} for all i=1,Ki=1,\cdots K, which motivates us to judge the efficacy of any solution with respect to this metric instead of the principal subspace angles.

In the distributed setup considered in this paper, we consider a network of MM nodes such that the undirected graph, 𝒢:=(𝒱,)\mathcal{G}:=(\mathcal{V},\mathcal{E}), describing the network is connected. Here 𝒱={1,2,,M}\mathcal{V}=\{1,2,\dots,M\} is the set of nodes and \mathcal{E} is the set of edges, i.e., (i,j)(i,j)\in\mathcal{E} if there is a direct path between ii and jj. The set of neighbors for any node ii is denoted by 𝒩i\mathcal{N}_{i}. Under the setup of samples being distributed over the MM nodes, let us assume that the ithi^{th} node has a data matrix 𝐘i\mathbf{Y}_{i} containing NiN_{i} samples such that N=i=1MNiN=\sum_{i=1}^{M}N_{i}. Thus each node has access to only a local covariance matrix 𝐂i=1Ni𝐘i𝐘iT\mathbf{C}_{i}=\frac{1}{N_{i}}\mathbf{Y}_{i}\mathbf{Y}_{i}^{\mathrm{T}} instead of the global covariance matrix but one can see that N𝐂=i=1MNi𝐂iN\mathbf{C}=\sum_{i=1}^{M}N_{i}\mathbf{C}_{i}. In this setting, a straightforward approach might be that each node finds its own solution independent of the data at all the other nodes. While this might seem viable, this approach will have major drawbacks. Recall that the sample covariance 𝐂\mathbf{C} approximates the population covariance 𝚺\boldsymbol{\Sigma} at a rate of 𝒪(f(N1))\mathcal{O}(f(N^{-1})), where ff is some function (depending on the distribution) of the number of samples NN [33]. Since the local data has smaller number of samples than the global data, working with the local covariance matrix 𝐂i\mathbf{C}_{i} alone instead of somehow using the whole data will lead to a larger error in estimation of the eigenvectors. Also, since uniform sampling from the underlying data distribution is not guaranteed in distributed setups, the samples at a node may end up being from a narrow part of the entire distribution, thus being more biased away from the true distribution. This invites the need for the nodes to collaborate amongst themselves in a way that all the data is utilized to find estimates of the eigenvectors at each node while ensuring that all the nodes agree with each other. Thus, for a distributed setting, the PCA problem in (1) can be rewritten here as

𝐗\displaystyle\mathbf{X} =argmin𝐗d×Ki=1Mfi(𝐗)=argmin𝐗d×Ki=1M𝐘i𝐗𝐗T𝐘iF2such thatlq,(𝐗T(i=1M𝐘i𝐘iT)𝐗)lq=0.\displaystyle=\underset{\mathbf{X}\in\mathbb{R}^{d\times K}}{\operatorname*{arg\,min}}\quad\sum_{i=1}^{M}f_{i}(\mathbf{X})=\underset{\mathbf{X}\in\mathbb{R}^{d\times K}}{\operatorname*{arg\,min}}\quad\sum_{i=1}^{M}\|\mathbf{Y}_{i}-\mathbf{X}\mathbf{X}^{\mathrm{T}}\mathbf{Y}_{i}\|_{F}^{2}\quad\text{such that}\quad\forall l\neq q,\ \Big{(}\mathbf{X}^{\mathrm{T}}\big{(}\sum_{i=1}^{M}\mathbf{Y}_{i}\mathbf{Y}_{i}^{\mathrm{T}}\big{)}\mathbf{X}\Big{)}_{lq}=0. (4)

It is easy to see that i=1Mfi(𝐗)=f(𝐗)\sum_{i=1}^{M}f_{i}(\mathbf{X})=f(\mathbf{X}). Also, in a distributed setup, each node ii maintains its own copy 𝐗i\mathbf{X}_{i} of the variable 𝐗\mathbf{X} due to the difference in local information (local data) they carry. Thus, all nodes need to agree with each other to ensure the entire network reaches the same true solution. Hence, the true distributed PCA objective is written as

argmin𝐗id×Ki=1M𝐘i𝐗i𝐗iT𝐘iF2such thatj𝒩i,𝐗i=𝐗jandlq,(𝐗iT(i=1M𝐘i𝐘iT)𝐗i)lq=0.\displaystyle\underset{\mathbf{X}_{i}\in\mathbb{R}^{d\times K}}{\operatorname*{arg\,min}}\sum_{i=1}^{M}\|\mathbf{Y}_{i}-\mathbf{X}_{i}\mathbf{X}_{i}^{\mathrm{T}}\mathbf{Y}_{i}\|_{F}^{2}\quad\text{such that}\quad\forall j\in\mathcal{N}_{i},\ \mathbf{X}_{i}=\mathbf{X}_{j}\quad\text{and}\quad\forall l\neq q,\ \Big{(}\mathbf{X}_{i}^{\mathrm{T}}\big{(}\sum_{i=1}^{M}\mathbf{Y}_{i}\mathbf{Y}_{i}^{\mathrm{T}}\big{)}\mathbf{X}_{i}\Big{)}_{lq}=0. (5)

Note that (1)–(5) are non-convex optimization problems due to the non-convexity of the constraint set. One possible solution to the PCA problem is to instead solve a convex relaxation of the original non-convex function [34, 35]. The issue with these solutions is that they require O(d2)O(d^{2}) memory and computation, which can be prohibitive in high-dimensional settings. In addition, due to O(d2)O(d^{2}) iterate size these solutions are not ideal for distributed settings. Also, these formulations, without any further constraints, will not necessarily give a basis that is the set of dominant eigenvectors. Instead, they might end up giving a rotated basis as explained earlier, thereby not completing the task of decorrelating features. Hence, in this paper we use an algebraic method based on GHA for neural network training, which has O(dK)O(dK) memory and computation requirements, to solve the distributed PCA problem. Our goal is to converge to the true eigenvectors of the global covariance matrix 𝐂\mathbf{C} at every node of the network. As noted earlier in Section I-A, distributed variants of the power method exist in the literature [22, 24, 23] that can find the dominant eigenvector but these methods employ two time-scale approaches that involve several consensus averaging rounds for each iteration of the power method. Such two time-scale approaches can be expensive in terms of communications cost. In this paper, we propose a one time-scale method that finds the top KK eigenvectors of the global sample covariance matrix 𝐂\mathbf{C} at each node through local computations and information exchange with neighbors. The proposed method also converges linearly up to a neighborhood of the true solution when the error metric considered is the angle between the estimates and the true eigenvectors.

III The Proposed Algorithm

In [11], Sanger proposed a generalized Hebbian algorithm (GHA) to train a neural network and find the eigenvectors of the input autocorrelation matrix (same as the covariance matrix for zero-mean input). The outputs of such a network, when the weights are given by the eigenvectors, are the uncorrelated features of the input data that allow data reconstruction with minimal error, hence serving the true purpose of PCA. The algorithm was originally developed to tackle the centralized PCA problem in the case of streaming data, where a new data sample 𝐲t,t=1,2,\mathbf{y}_{t},t=1,2,\ldots, arrives at each time instance.

In this paper we consider a batch setting, but the alignment of GHA with our basic goal of finding the eigenvectors motivates us to leverage it for our distributed setup. The rationale behind the idea of extrapolating the streaming case to a distributed batch setting is simple: since 𝔼[𝐲t𝐲tT]=𝔼[𝐘i𝐘iT]=𝚺\mathbb{E}[\mathbf{y}_{t}\mathbf{y}_{t}^{\mathrm{T}}]=\mathbb{E}[\mathbf{Y}_{i}\mathbf{Y}_{i}^{\mathrm{T}}]=\boldsymbol{\Sigma}, the sample-wise distributed data setting can be seen as a mini-batch variant of the streaming data setting. In the context of neural network training, our approach can be viewed as training a network at each node with a mini-batch of samples in a way that all nodes end up with the same trained network whose weights are given by the eigenvectors of the autocorrelation matrix of the entire batch of samples.

The iterate for the GHA as given in [11] has the following update for the matrix of eigenvectors (i.e., the neural network weight matrix) 𝐗\mathbf{X} when the ttht^{th} sample 𝐲t\mathbf{y}_{t} arrives at the input of the neural network:

𝐗(t+1)=𝐗(t)+αt[𝐲t𝐲tT𝐗(t)𝐗(t)𝓤((𝐗(t))T𝐲t𝐲tT𝐗(t))],\mathbf{X}^{(t+1)}=\mathbf{X}^{(t)}+\alpha_{t}\Big{[}\mathbf{y}_{t}\mathbf{y}_{t}^{\mathrm{T}}\mathbf{X}^{(t)}-\mathbf{X}^{(t)}\boldsymbol{\mathcal{U}}\Big{(}(\mathbf{X}^{(t)})^{\mathrm{T}}\mathbf{y}_{t}\mathbf{y}_{t}^{\mathrm{T}}\mathbf{X}^{(t)}\Big{)}\Big{]}, (6)

where 𝓤:K×KK×K\boldsymbol{\mathcal{U}}:\mathbb{R}^{K\times K}\rightarrow\mathbb{R}^{K\times K} is an operator that sets all the elements below the diagonal to zero and αt\alpha_{t} is the step size. For K=1K=1, and defining 𝚺t=𝐲t𝐲tT\boldsymbol{\Sigma}_{t}=\mathbf{y}_{t}\mathbf{y}_{t}^{\mathrm{T}}, it was shown in [9] that the term (𝐗(t))T𝐲t𝐲tT𝐗(t)=(𝐗(t))T𝚺t𝐗(t)(\mathbf{X}^{(t)})^{\mathrm{T}}\mathbf{y}_{t}\mathbf{y}_{t}^{\mathrm{T}}\mathbf{X}^{(t)}=(\mathbf{X}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{X}^{(t)} is the consequence of a power series approximation of Oja’s rule in lieu of the explicit normalization used in the case of the power method. In the case of K>1K>1, 𝓤((𝐗(t))T𝐲t𝐲tT𝐗(t))=𝓤((𝐗(t))T𝚺t𝐗(t))\boldsymbol{\mathcal{U}}\Big{(}(\mathbf{X}^{(t)})^{\mathrm{T}}\mathbf{y}_{t}\mathbf{y}_{t}^{\mathrm{T}}\mathbf{X}^{(t)}\Big{)}=\boldsymbol{\mathcal{U}}\Big{(}(\mathbf{X}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{X}^{(t)}\Big{)} helps combine Oja’s algorithm with the Gram–Schmidt orthogonalization step as follows:

𝐗(t)𝓤((𝐗(t))T𝚺t𝐗(t))\displaystyle\mathbf{X}^{(t)}\boldsymbol{\mathcal{U}}\Big{(}(\mathbf{X}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{X}^{(t)}\Big{)} =𝐗(t)𝓤([(𝐱1(t))T(𝐱K(t))T]𝚺t[𝐱1(t)𝐱K(t)])\displaystyle=\mathbf{X}^{(t)}\boldsymbol{\mathcal{U}}\Bigg{(}\begin{bmatrix}(\mathbf{x}_{1}^{(t)})^{\mathrm{T}}\\ \vdots\\ (\mathbf{x}_{K}^{(t)})^{\mathrm{T}}\end{bmatrix}\boldsymbol{\Sigma}_{t}\begin{bmatrix}\mathbf{x}_{1}^{(t)}&\cdots&\mathbf{x}_{K}^{(t)}\end{bmatrix}\Bigg{)}
=𝐗(t)𝓤([(𝐱1(t))T𝚺t𝐱1(t)(𝐱1(t))T𝚺t𝐱2(t)(𝐱1(t))T𝚺t𝐱K(t)(𝐱2(t))T𝚺t𝐱1(t)(𝐱2(t))T𝚺t𝐱2(t)(𝐱2(t))T𝚺t𝐱K(t)(𝐱K(t))T𝚺t𝐱1(t)(𝐱K(t))T𝚺t𝐱2(t)(𝐱K(t))T𝚺t𝐱K(t)])\displaystyle=\mathbf{X}^{(t)}\boldsymbol{\mathcal{U}}\Bigg{(}\begin{bmatrix}(\mathbf{x}_{1}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{1}^{(t)}&(\mathbf{x}_{1}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{2}^{(t)}&\ldots&(\mathbf{x}_{1}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{K}^{(t)}\\ (\mathbf{x}_{2}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{1}^{(t)}&(\mathbf{x}_{2}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{2}^{(t)}&\ldots&(\mathbf{x}_{2}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{K}^{(t)}\\ \vdots&\vdots&\ddots&\vdots\\ (\mathbf{x}_{K}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{1}^{(t)}&(\mathbf{x}_{K}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{2}^{(t)}&\ldots&(\mathbf{x}_{K}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{K}^{(t)}\end{bmatrix}\Bigg{)}
=𝐗(t)([(𝐱1(t))T𝚺t𝐱1(t)(𝐱1(t))T𝚺t𝐱2(t)(𝐱1(t))T𝚺t𝐱K(t)0(𝐱2(t))T𝚺t𝐱2(t)(𝐱2(t))T𝚺t𝐱K(t)00(𝐱K(t))T𝚺t𝐱K(t)])\displaystyle=\mathbf{X}^{(t)}\Bigg{(}\begin{bmatrix}(\mathbf{x}_{1}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{1}^{(t)}&(\mathbf{x}_{1}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{2}^{(t)}&\ldots&(\mathbf{x}_{1}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{K}^{(t)}\\ 0&(\mathbf{x}_{2}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{2}^{(t)}&\ldots&(\mathbf{x}_{2}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{K}^{(t)}\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\ldots&(\mathbf{x}_{K}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{K}^{(t)}\end{bmatrix}\Bigg{)}
=[(𝐱1(t))T𝚺t𝐱1(t)𝐱1(t)p=12(𝐱p(t))T𝚺t𝐱2(t)𝐱p(t)p=1K(𝐱p(t))T𝚺t𝐱K(t)𝐱p(t)].\displaystyle=\begin{bmatrix}(\mathbf{x}_{1}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{1}^{(t)}\mathbf{x}_{1}^{(t)}&\sum_{p=1}^{2}(\mathbf{x}_{p}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{2}^{(t)}\mathbf{x}_{p}^{(t)}&\ldots&\sum_{p=1}^{K}(\mathbf{x}_{p}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{K}^{(t)}\mathbf{x}_{p}^{(t)}\end{bmatrix}. (7)

Thus, for any k=1,,Kk=1,\ldots,K, the term involving 𝓤()\boldsymbol{\mathcal{U}}(\cdot) in (6) includes an implicit normalization term (𝐱k(t))T𝚺t𝐱k(t)𝐱k(t)(\mathbf{x}_{k}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{k}^{(t)}\mathbf{x}_{k}^{(t)} as well an orthogonalization term p=1k1(𝐱p(t))T𝚺t𝐱k(t)𝐱p(t)\sum_{p=1}^{k-1}(\mathbf{x}_{p}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{k}^{(t)}\mathbf{x}_{p}^{(t)}, which—analogous to the Gram–Schmidt orthogonalization procedure—forces the estimate 𝐱k(t)\mathbf{x}_{k}^{(t)} to be orthogonal to all the estimates 𝐱p(t),p=1,,k1\mathbf{x}_{p}^{(t)},p=1,\ldots,k-1. Another important thing to note about the GHA algorithm is that, in order to estimate the dominant KK eigenvectors, it only requires the corresponding top KK eigenvalues to be distinct (and nonzero). In other words, it does not require the covariance matrix to be non-singular.

In the deterministic setting, where we have the full-batch instead of new samples every instance, this iterate changes to

𝐗(t+1)\displaystyle\mathbf{X}^{(t+1)} =𝐗(t)+αt[𝐂𝐗(t)𝐗(t)𝓤((𝐗(t))T𝐂𝐗(t))]=𝐗(t)+αt𝓗(𝐂,𝐗(t)).\displaystyle=\mathbf{X}^{(t)}+\alpha_{t}\Big{[}\mathbf{C}\mathbf{X}^{(t)}-\mathbf{X}^{(t)}\boldsymbol{\mathcal{U}}\Big{(}(\mathbf{X}^{(t)})^{\mathrm{T}}\mathbf{C}\mathbf{X}^{(t)}\Big{)}\Big{]}=\mathbf{X}^{(t)}+\alpha_{t}\boldsymbol{\mathcal{H}}(\mathbf{C},\mathbf{X}^{(t)}). (8)

Here, we term 𝓗:d×d×d×Kd×K,𝓗(𝐂,𝐗t):=(𝐂𝐗t𝐗t𝓤((𝐗t)T𝐂𝐗t))\boldsymbol{\mathcal{H}}:\mathbb{R}^{d\times d}\times\mathbb{R}^{d\times K}\rightarrow\mathbb{R}^{d\times K},\boldsymbol{\mathcal{H}}(\mathbf{C},\mathbf{X}^{t}):=\Big{(}\mathbf{C}\mathbf{X}^{t}-\mathbf{X}^{t}\boldsymbol{\mathcal{U}}\Big{(}(\mathbf{X}^{t})^{\mathrm{T}}\mathbf{C}\mathbf{X}^{t}\Big{)}\Big{)} as the Sanger direction. An iterate similar to (8) has been proven to have global convergence in [17] for some very specific choice of the step sizes that are dependent on the iterate itself. Its straightforward extension to the distributed case is not possible as that would lead to different step sizes at different nodes of the network, making it difficult to talk about its convergence guarantees. Hence, to adapt this iterative method to our distributed setup, we use the typical combine and update strategy used quite richly in the literature for distributed algorithms such as [32, 36, 37, 38]. The main contributions of such works lie in showing that the resulting distributed algorithms achieve consensus (i.e., all nodes will have the same iterate values eventually) and, in addition, the consensus value is the same as the centralized solution. The convergence guarantees for these methods are mainly restricted to convex and strongly convex problems though. Our distributed version of (8) for PCA, which is non-convex, is based on similar principles of combine and update.

Specifically, the node ii at iteration tt carries a local copy 𝐗i(t)\mathbf{X}_{i}^{(t)} of the estimate of the eigenvectors of the global covariance matrix 𝐂\mathbf{C}. In the combine step, each node ii exchanges the iterate values with its immediate neighbors j𝒩ij\in\mathcal{N}_{i}, where 𝒩i\mathcal{N}_{i} denotes the neighborhood of node ii, and then takes a weighted sum of the iterates received along with its local iterate. Then this sum is updated independently at all nodes using their respective local information. Since node ii in the network only has access to its local sample covariance 𝐂i\mathbf{C}_{i}, the update is in the form of a local Sanger’s direction given as

𝓗i(𝐂i,𝐗i(t))=𝐂i𝐗i(t)𝐗i(t)𝓤((𝐗i(t))T𝐂i𝐗i(t)).\boldsymbol{\mathcal{H}}_{i}(\mathbf{C}_{i},\mathbf{X}_{i}^{(t)})=\mathbf{C}_{i}\mathbf{X}_{i}^{(t)}-\mathbf{X}_{i}^{(t)}\boldsymbol{\mathcal{U}}\Big{(}(\mathbf{X}_{i}^{(t)})^{\mathrm{T}}\mathbf{C}_{i}\mathbf{X}_{i}^{(t)}\Big{)}. (9)

Input: 𝐘1,𝐘2,𝐘M,[wij],α,K\mathbf{Y}_{1},\mathbf{Y}_{2},\dots\mathbf{Y}_{M},[w_{ij}],\alpha,K
Initialize: i,𝐗i(0)𝐗init:𝐗initd×K,𝐗initT𝐗init=𝐈\forall i,\mathbf{X}_{i}^{(0)}\leftarrow\mathbf{X}_{\text{init}}:\mathbf{X}_{\text{init}}\in\mathbb{R}^{d\times K},\mathbf{X}_{\text{init}}^{\mathrm{T}}\mathbf{X}_{\text{init}}=\mathbf{I}

for t=1,2,t=1,2,\dots do
     Communicate 𝐗i(t1)\mathbf{X}_{i}^{(t-1)} from each node ii to its neighbors
     Estimate of eigenvectors at node ii:  𝐗i(t)j𝒩i{i}wij𝐗j(t1)+α𝓗i(𝐗i(t1))\mbox{\quad}\mathbf{X}_{i}^{(t)}\leftarrow\sum_{j\in\mathcal{N}_{i}\cup\{i\}}w_{ij}\mathbf{X}_{j}^{(t-1)}+\alpha\boldsymbol{\mathcal{H}}_{i}(\mathbf{X}_{i}^{(t-1)})
end for

Return: 𝐗i(t),i=1,2,,M\mathbf{X}_{i}^{(t)},i=1,2,\dots,M

Algorithm 1 Distributed Sanger’s Algorithm (DSA)

The details of the proposed distributed PCA algorithm, called the Distributed Sanger’s Algorithm (DSA), are given in Algorithm 1. The weight matrix 𝐖=[wij]\mathbf{W}=[w_{ij}] in this algorithm is a doubly stochastic matrix conforming to the network topology [25] in the sense that for iji\neq j, wij0w_{ij}\neq 0 when (i,j)(i,j)\in\mathcal{E} and wij=0w_{ij}=0 otherwise. Also, i,wii0\forall i,w_{ii}\neq 0, i.e., there is a self loop at each node. Note that connectivity of the network, as discussed in Section II, is a necessary condition for convergence of DSA. The connectivity assumption, in turn, ensures the Markov chain underlying the graph 𝒢\mathcal{G} is aperiodic and irreducible, which implies that the second-largest (in magnitude) eigenvalue of 𝐖\mathbf{W}, β=max{|λ2(𝐖)|,|λM(𝐖)|}\beta=\max\{|\lambda_{2}(\mathbf{W})|,|\lambda_{M}(\mathbf{W})|\}, is strictly less than 11. While DSA shares algorithmic similarities with first-order distributed optimization methods [32, 39] in which the combine-and-update strategy is used, our challenge is characterizing its convergence behavior due to the non-convex and constrained nature of the distributed PCA problem. To this end, we first provide a general result in Section IV where we prove the convergence of a modified form of GHA. Then we utilize that result, along with some linear algebraic tools and additional lemmas provided in the appendices, to characterize the dynamics of the distributed setup in Section V and prove the convergence of the proposed algorithm.

IV Convergence Analysis of a Modified GHA

Let 𝐗(t)=[𝐱1(t)𝐱2(t)𝐱K(t)]d×K\mathbf{X}^{(t)}=\begin{bmatrix}\mathbf{x}_{1}^{(t)}&\mathbf{x}_{2}^{(t)}&\cdots&\mathbf{x}_{K}^{(t)}\end{bmatrix}\in\mathbb{R}^{d\times K}, KdK\leq d, be an estimate of the KK-dimensional subspace spanned by the eigenvectors of the covariance matrix 𝐂\mathbf{C} after tt iterations and 𝐪l,l=1,,d\mathbf{q}_{l},l=1,\ldots,d, be the eigenvectors of 𝐂\mathbf{C} with corresponding eigenvalues λl\lambda_{l}. On expanding (8) using (7), it is clear that the GHA update equation for estimation of the kthk^{th} eigenvector using a constant step size α\alpha is as follows:

𝐱k(t+1)=𝐱k(t)+α(𝐂𝐱k(t)(𝐱k(t))T𝐂𝐱k(t)𝐱k(t)p=1k1𝐱p(t)(𝐱p(t))T𝐂𝐱k(t)).\displaystyle\mathbf{x}_{k}^{(t+1)}=\mathbf{x}_{k}^{(t)}+\alpha\big{(}\mathbf{C}\mathbf{x}_{k}^{(t)}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\mathbf{x}_{k}^{(t)}-\sum_{p=1}^{k-1}\mathbf{x}_{p}^{(t)}(\mathbf{x}_{p}^{(t)})^{\mathrm{T}}\mathbf{C}\mathbf{x}_{k}^{(t)}\big{)}. (10)

We now slightly modify (10) by replacing 𝐱p(t)\mathbf{x}_{p}^{(t)} for p<kp<k by the true eigenvectors 𝐪p\mathbf{q}_{p}. We term the resulting update equation modified GHA and note that this is not an algorithm in the true sense of the term as it cannot be implemented because of its dependence on the true eigenvectors 𝐪p\mathbf{q}_{p}. The sole purpose of this modified GHA is to help in our ultimate goal of providing convergence guarantee for the DSA algorithm. The update equation of the modified GHA for “estimation” of the kthk^{th} eigenvector of 𝐂\mathbf{C}, k=1,,Kk=1,\ldots,K, has the form

𝐱k(t+1)=𝐱k(t)+α(𝐂𝐱k(t)(𝐱k(t))T𝐂𝐱k(t)𝐱k(t)p=1k1𝐪p𝐪pT𝐂𝐱k(t)).\mathbf{x}_{k}^{(t+1)}=\mathbf{x}_{k}^{(t)}+\alpha\big{(}\mathbf{C}\mathbf{x}_{k}^{(t)}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\mathbf{x}_{k}^{(t)}-\sum_{p=1}^{k-1}\mathbf{q}_{p}\mathbf{q}_{p}^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\big{)}. (11)

Note that similar to the original GHA, this modified GHA assumes that 𝐂\mathbf{C} has KK distinct eigenvalues, i.e., λ1>λ2>>λK>λK+1λd0\lambda_{1}>\lambda_{2}>\ldots>\lambda_{K}>\lambda_{K+1}\geq\cdots\geq\lambda_{d}\geq 0. Now, since 𝐪l,l=1,,d\mathbf{q}_{l},l=1,\ldots,d, are the eigenvectors of a real symmetric matrix, they form a basis for d\mathbb{R}^{d} and can be used for expansion of any 𝐱k(t)\mathbf{x}_{k}^{(t)} as

𝐱k(t)=l=1dzk,l(t)𝐪l,\mathbf{x}_{k}^{(t)}=\sum_{l=1}^{d}z_{k,l}^{(t)}\mathbf{q}_{l}, (12)

where zk,l(t)z_{k,l}^{(t)} is the coefficient corresponding to the eigenvector 𝐪l\mathbf{q}_{l} in the expansion of 𝐱k(t)\mathbf{x}_{k}^{(t)}. Multiplying both sides of (11) by 𝐪lT\mathbf{q}_{l}^{T} and using the fact that 𝐪lT𝐪l=0\mathbf{q}_{l}^{T}\mathbf{q}_{l^{\prime}}=0 for lll\neq l^{\prime}, we get

zk,l(t+1)\displaystyle{z}_{k,l}^{(t+1)} =\displaystyle= zk,l(t)+α(𝐪lT𝐂𝐱k(t)𝐪lT(p=1k1𝐪p𝐪pT𝐂𝐱k(t))(𝐱k(t))T𝐂𝐱k(t)zk,l(t)).\displaystyle{z}_{k,l}^{(t)}+\alpha(\mathbf{q}_{l}^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}-\mathbf{q}_{l}^{T}(\sum_{p=1}^{k-1}\mathbf{q}_{p}\mathbf{q}_{p}^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)})-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}{z}_{k,l}^{(t)}).

This gives

zk,l(t+1)\displaystyle{z}_{k,l}^{(t+1)} =zk,l(t)α(𝐱k(t))T𝐂𝐱k(t)zk,l(t),forl=1,,k1,\displaystyle={z}_{k,l}^{(t)}-\alpha({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}{z}_{k,l}^{(t)},\quad\text{for}\quad l=1,\ldots,k-1, (13)
andzk,l(t+1)\displaystyle\text{and}\quad{z}_{k,l}^{(t+1)} =zk,l(t)+α(λl(𝐱k(t))T𝐂𝐱k(t))zk,l(t),forl=k,,d.\displaystyle={z}_{k,l}^{(t)}+\alpha(\lambda_{l}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}){z}_{k,l}^{(t)},\quad\text{for}\quad l=k,\ldots,d. (14)

It has been shown in [16] that the update equation given by

𝐱1(t+1)=𝐱1(t)+α(𝐂𝐱1(t)(𝐱1(t))T𝐂𝐱1(t)𝐱1(t))\displaystyle\mathbf{x}_{1}^{(t+1)}=\mathbf{x}_{1}^{(t)}+\alpha\big{(}\mathbf{C}\mathbf{x}_{1}^{(t)}-(\mathbf{x}_{1}^{(t)})^{T}\mathbf{C}\mathbf{x}_{1}^{(t)}\mathbf{x}_{1}^{(t)}\big{)}

for k=1k=1 converges to ±𝐪1\pm\mathbf{q}_{1} at a linear rate for a certain condition on the step size α\alpha. Specifically, it was proven that (z1,1(t))21andl=2d(z1,l(t))2b1ρ1t(z_{1,1}^{(t)})^{2}\rightarrow 1\quad\text{and}\quad\sum_{l=2}^{d}(z_{1,l}^{(t)})^{2}\leq b_{1}\rho_{1}^{t}, where b1>0b_{1}>0 is some constant and ρ1=(1+αλ21+αλ1)2<1\rho_{1}=\big{(}\frac{1+\alpha\lambda_{2}}{1+\alpha\lambda_{1}}\big{)}^{2}<1. Here, we extend the proof to a general kk and show that the update equation given in the form of (11) for any k=1,,K,K<dk=1,\ldots,K,K<d, converges to the kthk^{th} dominant eigenvector.

Theorem 1.

Suppose α13λ1(2K1)\alpha\leq\frac{1}{3\lambda_{1}{(2K-1)}}, where λ1\lambda_{1} is the largest eigenvalue of 𝐂\mathbf{C} and KK is the number of eigenvectors to be estimated, 𝐪kT𝐱k(0)0\mathbf{q}_{k}^{T}\mathbf{x}_{k}^{(0)}\neq 0, and 𝐱k(0)=1\|\mathbf{x}_{k}^{(0)}\|=1 for all kk. Then the modified GHA iterate for 𝐱k(t)\mathbf{x}_{k}^{(t)} given by (11) converges at a linear rate to the eigenvector ±𝐪k\pm\mathbf{q}_{k} corresponding to the kthk^{th} largest eigenvalue λk\lambda_{k} of the covariance matrix 𝐂\mathbf{C}.

Proof.

The convergence of 𝐱k(t)\mathbf{x}_{k}^{(t)} to 𝐪k\mathbf{q}_{k} requires convergence of the lower-order coefficients zk,1(t),,zk,k1(t)z_{k,1}^{(t)},\ldots,z_{k,k-1}^{(t)} and the higher-order coefficients zk,k+1(t),,zk,d(t)z_{k,k+1}^{(t)},\ldots,z_{k,d}^{(t)} to 0 and convergence of zk,k(t)z_{k,k}^{(t)} to ±1\pm 1. Now,

|λk(𝐱k(t))T𝐂𝐱k(t)|\displaystyle|\lambda_{k}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}| =|λkl=1dλl(zk,l(t))2|=|λkλk(zk,k(t))2l=1k1λl(zk,l(t))2l=k+1dλl(zk,l(t))2|\displaystyle=|\lambda_{k}-\sum_{l=1}^{d}\lambda_{l}({z}_{k,l}^{(t)})^{2}|=|\lambda_{k}-\lambda_{k}({z}_{k,k}^{(t)})^{2}-\sum_{l=1}^{k-1}\lambda_{l}({z}_{k,l}^{(t)})^{2}-\sum_{l=k+1}^{d}\lambda_{l}({z}_{k,l}^{(t)})^{2}|
|λkλk(zk,k(t))2||l=1k1λl(zk,l(t))2||l=k+1dλl(zk,l(t))2|\displaystyle\geq|\lambda_{k}-\lambda_{k}({z}_{k,k}^{(t)})^{2}|-|\sum_{l=1}^{k-1}\lambda_{l}({z}_{k,l}^{(t)})^{2}|-|\sum_{l=k+1}^{d}\lambda_{l}({z}_{k,l}^{(t)})^{2}|
or,λk|1(zk,k(t))2|\displaystyle\text{or,}\quad\lambda_{k}|1-({z}_{k,k}^{(t)})^{2}| |l=1k1λl(zk,l(t))2|+|l=k+1dλl(zk,l(t))2|+|λk(𝐱k(t))T𝐂𝐱k(t)|.\displaystyle\leq|\sum_{l=1}^{k-1}\lambda_{l}({z}_{k,l}^{(t)})^{2}|+|\sum_{l=k+1}^{d}\lambda_{l}({z}_{k,l}^{(t)})^{2}|+|\lambda_{k}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}|. (15)

Thus, convergence of the lower-order and the higher-order coefficients to 0 along with convergence of the term |λk(𝐱k(t))T𝐂𝐱k(t)||\lambda_{k}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}| will also imply the convergence of zk,k(t)z_{k,k}^{(t)} to ±1\pm 1. To this end, Lemma 5 in the appendix proves linear convergence of the lower-order coefficients zk,1(t),,zk,k1(t)z_{k,1}^{(t)},\ldots,z_{k,k-1}^{(t)} to 0 by showing l=1k1(zk,l(t+1))2<a1γt+1\sum_{l=1}^{k-1}(z_{k,l}^{(t+1)})^{2}<a_{1}\gamma^{t+1} for some constants a1>0,γ<1a_{1}>0,\gamma<1. Furthermore, Lemma 6 in the appendix shows that l=k+1d(zk,l(t+1))2a2ρkt+1\sum_{l=k+1}^{d}({z}_{k,l}^{(t+1)})^{2}\leq a_{2}\rho_{k}^{t+1}, where a1,a2>0a_{1},a_{2}>0 and γ,ρk<1\gamma,\rho_{k}<1, thereby proving linear convergence of the higher-order coefficients to 0. Finally, Lemma 7 in the appendix shows that |λk(𝐱k(t))T𝐂𝐱k(t)|ta4(δt+1+max{δt,γ1t})|\lambda_{k}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}|\leq ta_{4}(\delta^{t+1}+\max\{\delta^{t},\gamma_{1}^{t}\}), where a4>0a_{4}>0 and δ,γ1<1\delta,\gamma_{1}<1. The formal statements and proofs of Lemma 5, Lemma 6 and Lemma 7 are given in Appendix B, Appendix C and Appendix D, respectively.

Thus,

λk|1(zk,k(t))2|\displaystyle\lambda_{k}|1-({z}_{k,k}^{(t)})^{2}| \displaystyle\leq |l=1k1λl(zk,l(t))2|+|l=k+1dλl(zk,l(t))2|+ta4(δt+1+max{δt,γ1t})\displaystyle|\sum_{l=1}^{k-1}\lambda_{l}({z}_{k,l}^{(t)})^{2}|+|\sum_{l=k+1}^{d}\lambda_{l}({z}_{k,l}^{(t)})^{2}|+ta_{4}(\delta^{t+1}+\max\{\delta^{t},\gamma_{1}^{t}\})
=\displaystyle= l=1k1λl(zk,l(t))2+l=k+1dλl(zk,l(t))2+ta4(δt+1+max{δt,γ1t})\displaystyle\sum_{l=1}^{k-1}\lambda_{l}({z}_{k,l}^{(t)})^{2}+\sum_{l=k+1}^{d}\lambda_{l}({z}_{k,l}^{(t)})^{2}+ta_{4}(\delta^{t+1}+\max\{\delta^{t},\gamma_{1}^{t}\})
<\displaystyle< λ1(l=1k1(zk,l(t))2+l=k+1d(zk,l(t))2)+ta4(δt+1+max{δt,γ1t})\displaystyle\lambda_{1}(\sum_{l=1}^{k-1}({z}_{k,l}^{(t)})^{2}+\sum_{l=k+1}^{d}({z}_{k,l}^{(t)})^{2})+ta_{4}(\delta^{t+1}+\max\{\delta^{t},\gamma_{1}^{t}\})
<\displaystyle< λ1(a1γt+a2ρkt)+ta4(δt+1+max{δt,γ1t}).\displaystyle\lambda_{1}(a_{1}\gamma^{t}+a_{2}\rho_{k}^{t})+ta_{4}(\delta^{t+1}+\max\{\delta^{t},\gamma_{1}^{t}\}).

Clearly, limt|1(zk,k(t))2|=0\lim\limits_{t\rightarrow\infty}|1-({z}_{k,k}^{(t)})^{2}|=0. Therefore, Theorem 1 shows that with an update equation of the form (11), the iterates 𝐱k(t)\mathbf{x}_{k}^{(t)} converge linearly to eigenvectors 𝐪k\mathbf{q}_{k} of the covariance matrix 𝐂\mathbf{C}. ∎

V Convergence Analysis of Distributed Sanger’s Algorithm (DSA)

With the analysis of the modified GHA in hand, let us proceed to analyze the proposed DSA algorithm. The iterate of DSA at node ii for the dominant KK-dimensional eigenspace estimate (KdK\leq d) is given as

𝐗i(t+1)=j𝒩i{i}wij𝐗j(t)+α𝓗i(𝐂i,𝐗i(t))=j𝒩i{i}wij𝐗j(t)+α(𝐂i𝐗i(t)𝐗i(t)𝓤((𝐗i(t))T𝐂i𝐗i(t))),\mathbf{X}_{i}^{(t+1)}=\sum_{j\in\mathcal{N}_{i}\cup\{i\}}w_{ij}\mathbf{X}_{j}^{(t)}+\alpha\boldsymbol{\mathcal{H}}_{i}(\mathbf{C}_{i},\mathbf{X}_{i}^{(t)})=\sum_{j\in\mathcal{N}_{i}\cup\{i\}}w_{ij}\mathbf{X}_{j}^{(t)}+\alpha\Big{(}\mathbf{C}_{i}\mathbf{X}_{i}^{(t)}-\mathbf{X}_{i}^{(t)}{\boldsymbol{\mathcal{U}}}((\mathbf{X}_{i}^{(t)})^{\mathrm{T}}\mathbf{C}_{i}\mathbf{X}_{i}^{(t)})\Big{)}, (16)

where 𝐗i(t)=[𝐱i,1(t)𝐱i,2(t)𝐱i,K(t)]d×K\mathbf{X}_{i}^{(t)}=\begin{bmatrix}\mathbf{x}_{i,1}^{(t)}&\mathbf{x}_{i,2}^{(t)}&\cdots&\mathbf{x}_{i,K}^{(t)}\end{bmatrix}\in\mathbb{R}^{d\times K} is an estimate of the KK-dimensional subspace of the global covariance matrix 𝐂\mathbf{C} at the ithi^{th} node after tt iterations, 𝓗i(𝐂i,𝐗i(t))\boldsymbol{\mathcal{H}}_{i}(\mathbf{C}_{i},\mathbf{X}_{i}^{(t)}) is local Sanger’s direction, and wij0w_{ij}\geq 0 is a weight that node ii assigns to 𝐗j(t)\mathbf{X}_{j}^{(t)} based on the connectivity between nodes ii and jj as mentioned before. The Sanger’s direction and the update equation for an estimate of the kthk^{th} eigenvector is thus given as

𝓗i(𝐂i,𝐱i,k(t))\displaystyle\boldsymbol{\mathcal{H}}_{i}(\mathbf{C}_{i},\mathbf{x}_{i,k}^{(t)}) =𝐂i𝐱i,k(t)(𝐱i,k(t))T𝐂i𝐱i,k(t)𝐱i,k(t)p=1k1(𝐱i,p(t))T𝐂i𝐱i,k(t)𝐱i,p(t)\displaystyle=\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,k}^{(t)}-\sum_{p=1}^{k-1}(\mathbf{x}_{i,p}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,p}^{(t)} (17)
and,𝐱i,k(t+1)\displaystyle\text{and,}\quad\mathbf{x}_{i,k}^{(t+1)} =j𝒩i{i}wij𝐱j,k(t)+α(𝐂i𝐱i,k(t)(𝐱i,k(t))T𝐂i𝐱i,kt𝐱i,k(t)p=1k1𝐱i,p(t)(𝐱i,p(t))T𝐂i𝐱i,k(t)).\displaystyle=\sum_{j\in\mathcal{N}_{i}\cup\{i\}}w_{ij}\mathbf{x}_{j,k}^{(t)}+\alpha\big{(}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{t}\mathbf{x}_{i,k}^{(t)}-\sum_{p=1}^{k-1}\mathbf{x}_{i,p}^{(t)}(\mathbf{x}_{i,p}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\big{)}. (18)

Now, let the average of 𝐱1,k(t),𝐱2,k(t),,𝐱M,k(t)\mathbf{x}_{1,k}^{(t)},\mathbf{x}_{2,k}^{(t)},\ldots,\mathbf{x}_{M,k}^{(t)} after ttht^{th} iteration be denoted as 𝐱¯k(t)=1Mi=1M𝐱i,k(t)\bar{\mathbf{x}}_{k}^{(t)}=\frac{1}{M}\sum_{i=1}^{M}\mathbf{x}_{i,k}^{(t)} and given by taking average of (18) over all the nodes i=1,,Mi=1,\ldots,M as

𝐱¯k(t+1)\displaystyle\bar{\mathbf{x}}_{k}^{(t+1)} =\displaystyle= 𝐱¯k(t)+αMi=1M𝓗i(𝐱i,k(t))\displaystyle\bar{\mathbf{x}}_{k}^{(t)}+\frac{\alpha}{M}\sum_{i=1}^{M}\boldsymbol{\mathcal{H}}_{i}(\mathbf{x}_{i,k}^{(t)})
=\displaystyle= 𝐱¯k(t)+αMi=1M𝓗i(𝐱¯k(t))+αMi=1M𝓗i(𝐱i,k(t))αMi=1M𝓗i(𝐱¯k(t))=𝐱¯k(t)+αMi=1M𝓗i(𝐱¯k(t))+α𝐡k(t)\displaystyle\bar{\mathbf{x}}_{k}^{(t)}+\frac{\alpha}{M}\sum_{i=1}^{M}\boldsymbol{\mathcal{H}}_{i}(\bar{\mathbf{x}}_{k}^{(t)})+\frac{\alpha}{M}\sum_{i=1}^{M}\boldsymbol{\mathcal{H}}_{i}(\mathbf{x}_{i,k}^{(t)})-\frac{\alpha}{M}\sum_{i=1}^{M}\boldsymbol{\mathcal{H}}_{i}(\bar{\mathbf{x}}_{k}^{(t)})=\bar{\mathbf{x}}_{k}^{(t)}+\frac{\alpha}{M}\sum_{i=1}^{M}\boldsymbol{\mathcal{H}}_{i}(\bar{\mathbf{x}}_{k}^{(t)})+\alpha\mathbf{h}_{k}^{(t)}
=\displaystyle= 𝐱¯k(t)+αM(𝐂𝐱¯k(t)(𝐱¯k(t))T𝐂𝐱¯k(t)𝐱¯k(t)i=1Mp=1k1𝐱i,p(t)(𝐱i,p(t))T𝐂i𝐱¯k(t))+α𝐡k(t),\displaystyle\bar{\mathbf{x}}_{k}^{(t)}+\frac{\alpha}{M}\big{(}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t)}-(\bar{\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t)}\bar{\mathbf{x}}_{k}^{(t)}-\sum_{i=1}^{M}\sum_{p=1}^{k-1}{\mathbf{x}}_{i,p}^{(t)}({\mathbf{x}}_{i,p}^{(t)})^{T}\mathbf{C}_{i}\bar{\mathbf{x}}_{k}^{(t)}\big{)}+\alpha\mathbf{h}_{k}^{(t)},

where 𝐡k(t)=1Mi=1M(𝓗i(𝐱i,k(t))𝓗i(𝐱¯k(t)))\mathbf{h}_{k}^{(t)}=\frac{1}{M}\sum_{i=1}^{M}(\boldsymbol{\mathcal{H}}_{i}(\mathbf{x}_{i,k}^{(t)})-\boldsymbol{\mathcal{H}}_{i}(\bar{\mathbf{x}}_{k}^{(t)})). We present analysis of the DSA algorithm by first proving convergence of the average 𝐱¯k(t)\bar{\mathbf{x}}_{k}^{(t)} to a neighborhood of the eigenvector 𝐪k\mathbf{q}_{k} of the global covariance matrix 𝐂\mathbf{C} while using a constant step size. Then with the help of Lemma 8 in Appendix E, which proves that the deviation of the iterates 𝐱i,k(t)\mathbf{x}_{i,k}^{(t)} at each node from the average 𝐱¯k(t)\bar{\mathbf{x}}_{k}^{(t)} is upper bounded, we prove that the iterates at each node also converge to a neighborhood of the true solution. It is noteworthy that the analysis of DSA does not require additional constraints on eigenvalues of 𝐂i\mathbf{C}_{i}, i.e., similar to GHA, we only require the top KK eigenvalues of 𝐂\mathbf{C} to be distinct and non-zero.

The complete proof of convergence of DSA is done by induction. First, we show the convergence of 𝐱i,1(t)\mathbf{x}_{i,1}^{(t)} to a 𝒪(α)\mathcal{O}(\alpha) neighborhood of 𝐪1\mathbf{q}_{1} and then analyze the rest of the eigenvector estimates 𝐱i,k(t),k=2,,K,\mathbf{x}_{i,k}^{(t)},k=2,\ldots,K, by assuming that the higher-order estimates have converged.

Case I for Induction – k=1k=1: The iterate for the dominant eigenvector is

𝐱i,1(t+1)=j𝒩i{i}wij𝐱j,1(t)+α(𝐂i𝐱i,1(t)((𝐱i,1(t))T𝐂i𝐱i,1(t))𝐱i,1(t)).\mathbf{x}_{i,1}^{(t+1)}=\sum_{j\in\mathcal{N}_{i}\cup\{i\}}w_{ij}\mathbf{x}_{j,1}^{(t)}+\alpha(\mathbf{C}_{i}\mathbf{x}_{i,1}^{(t)}-((\mathbf{x}_{i,1}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,1}^{(t)})\mathbf{x}_{i,1}^{(t)}). (19)
Theorem 2.

Suppose αminiwii3λ1(2K1)\alpha\leq\frac{\min_{i}w_{ii}}{3\lambda_{1}{(2K-1)}}, where λ1\lambda_{1} is the largest eigenvalue of 𝐂\mathbf{C} and KK is the number of eigenvectors to be estimated, 𝐪1T𝐱i,1(0)0\mathbf{q}_{1}^{T}\mathbf{x}_{i,1}^{(0)}\neq 0, and 𝐱i,1(0)=1\|\mathbf{x}_{i,1}^{(0)}\|=1. Then the DSA iterate for 𝐱i,1(t)\mathbf{x}_{i,1}^{(t)} given by (19) converges at a linear rate to an 𝒪(α)\mathcal{O}(\alpha) neighborhood of the eigenvector ±𝐪1\pm\mathbf{q}_{1} corresponding to the largest eigenvalue λ1\lambda_{1} of the global covariance matrix 𝐂\mathbf{C} at every node of the network.

Proof.

We know that

𝐱i,1(t)𝐱1𝐱i,1(t)𝐱¯1(t)+𝐱¯1(t)𝐱1,where 𝐱1=±𝐪1.\|\mathbf{x}_{i,1}^{(t)}-\mathbf{x}_{1}^{*}\|\leq\|\mathbf{x}_{i,1}^{(t)}-\bar{\mathbf{x}}_{1}^{(t)}\|+\|\bar{\mathbf{x}}_{1}^{(t)}-\mathbf{x}_{1}^{*}\|,\quad\text{where $\mathbf{x}_{1}^{*}=\pm\mathbf{q}_{1}$}. (20)

The term 𝐱i,1(t)𝐱¯1(t)\|\mathbf{x}_{i,1}^{(t)}-\bar{\mathbf{x}}_{1}^{(t)}\| is a measure of consensus in the network and we prove in Lemma 8 in Appendix E that this difference decreases linearly until it reaches a level of 𝒪(α)\mathcal{O}(\alpha). More precisely,

𝐱i,1(t)𝐱¯1(t)b1(βt+α1β),\displaystyle\|\mathbf{x}_{i,1}^{(t)}-\bar{\mathbf{x}}_{1}^{(t)}\|\leq b_{1}\big{(}\beta^{t}+\frac{\alpha}{1-\beta}\big{)}, (21)

where β=max{|λ2(𝐖)|,|λM(𝐖)|}\beta=\max\{|\lambda_{2}(\mathbf{W})|,|\lambda_{M}(\mathbf{W})|\}. In particular, it is well known that for a connected graph β<1\beta<1. Now, the average iterate of DSA for the estimate of the dominant eigenvector (k=1)(k=1) is

𝐱¯1(t)\displaystyle\bar{\mathbf{x}}_{1}^{(t)} =𝐱¯1(t1)+αM(𝐂𝐱¯1(t1)(𝐱¯1(t1))T𝐂𝐱¯1(t1)𝐱¯1(t1))+α𝐡1(t1).\displaystyle=\bar{\mathbf{x}}_{1}^{(t-1)}+\frac{\alpha}{M}(\mathbf{C}\bar{\mathbf{x}}_{1}^{(t-1)}-(\bar{\mathbf{x}}_{1}^{(t-1)})^{T}\mathbf{C}\bar{\mathbf{x}}_{1}^{(t-1)}\bar{\mathbf{x}}_{1}^{(t-1)})+\alpha\mathbf{h}_{1}^{(t-1)}.

Thus,

𝐱¯1(t)𝐱1\displaystyle\bar{\mathbf{x}}_{1}^{(t)}-\mathbf{x}_{1}^{*} =𝐱¯1(t1)+αM(𝐂𝐱¯1(t1)(𝐱¯1(t1))T𝐂𝐱¯1(t1)𝐱¯1(t1))𝐱1+α𝐡1(t1)\displaystyle=\bar{\mathbf{x}}_{1}^{(t-1)}+\frac{\alpha}{M}(\mathbf{C}\bar{\mathbf{x}}_{1}^{(t-1)}-(\bar{\mathbf{x}}_{1}^{(t-1)})^{T}\mathbf{C}\bar{\mathbf{x}}_{1}^{(t-1)}\bar{\mathbf{x}}_{1}^{(t-1)})-\mathbf{x}_{1}^{*}+\alpha\mathbf{h}_{1}^{(t-1)}
or,𝐱¯1(t)𝐱1\displaystyle\text{or,}\quad\|\bar{\mathbf{x}}_{1}^{(t)}-\mathbf{x}_{1}^{*}\| =𝐱¯1(t1)+αM(𝐂𝐱¯1(t1)(𝐱¯1(t1))T𝐂𝐱¯1(t1)𝐱¯1(t1))𝐱1+α𝐡1(t1)\displaystyle=\|\bar{\mathbf{x}}_{1}^{(t-1)}+\frac{\alpha}{M}(\mathbf{C}\bar{\mathbf{x}}_{1}^{(t-1)}-(\bar{\mathbf{x}}_{1}^{(t-1)})^{T}\mathbf{C}\bar{\mathbf{x}}_{1}^{(t-1)}\bar{\mathbf{x}}_{1}^{(t-1)})-\mathbf{x}_{1}^{*}+\alpha\mathbf{h}_{1}^{(t-1)}\|
or,𝐱¯1(t)𝐱1\displaystyle\text{or,}\quad\|\bar{\mathbf{x}}_{1}^{(t)}-\mathbf{x}_{1}^{*}\| 𝐱¯1(t1)+αM(𝐂𝐱¯1(t1)(𝐱¯1(t1))T𝐂𝐱¯1(t1)𝐱¯1(t1))𝐱1+α𝐡1(t1).\displaystyle\leq\|\bar{\mathbf{x}}_{1}^{(t-1)}+\frac{\alpha}{M}(\mathbf{C}\bar{\mathbf{x}}_{1}^{(t-1)}-(\bar{\mathbf{x}}_{1}^{(t-1)})^{T}\mathbf{C}\bar{\mathbf{x}}_{1}^{(t-1)}\bar{\mathbf{x}}_{1}^{(t-1)})-\mathbf{x}_{1}^{*}\|+\alpha\|\mathbf{h}_{1}^{(t-1)}\|. (22)

We saw in Section IV that an iterate of the form

𝐱¯1(t)=𝐱¯1(t1)+αM(𝐂𝐱¯1(t1)(𝐱¯1(t1))T𝐂𝐱¯1(t1)𝐱¯1(t1))\displaystyle\bar{\mathbf{x}}_{1}^{(t)}=\bar{\mathbf{x}}_{1}^{(t-1)}+\frac{\alpha}{M}(\mathbf{C}\bar{\mathbf{x}}_{1}^{(t-1)}-(\bar{\mathbf{x}}_{1}^{(t-1)})^{T}\mathbf{C}\bar{\mathbf{x}}_{1}^{(t-1)}\bar{\mathbf{x}}_{1}^{(t-1)})

converges linearly to 𝐱1=±𝐪1\mathbf{x}_{1}^{*}=\pm\mathbf{q}_{1} for certain conditions on the step size and the initial point. Thus,

𝐱¯1(t)𝐱1\displaystyle\|\bar{\mathbf{x}}_{1}^{(t)}-\mathbf{x}_{1}^{*}\| \displaystyle\leq ρ1𝐱¯1(t1)𝐱1+α𝐡1(t1),whereρ1=1+αMλ21+αMλ1.\displaystyle\rho_{1}\|\bar{\mathbf{x}}_{1}^{(t-1)}-\mathbf{x}_{1}^{*}\|+\alpha\|\mathbf{h}_{1}^{(t-1)}\|,\quad\text{where}\quad\rho_{1}=\frac{1+\frac{\alpha}{M}\lambda_{2}}{1+\frac{\alpha}{M}\lambda_{1}}.

The term 𝐡1(t1)\mathbf{h}_{1}^{(t-1)} in the above equation appears due to the distributed nature of the algorithm and can be bounded separately. Specifically, we prove in Lemma 9, whose formal statement and proof is given in Appendix F, that

𝐡1(t1)9λ1b1(βt1+α1β).\displaystyle\|\mathbf{h}_{1}^{(t-1)}\|\leq 9\lambda_{1}b_{1}\big{(}\beta^{t-1}+\frac{\alpha}{1-\beta}\big{)}.

Thus,

𝐱¯1(t)𝐱1\displaystyle\|\bar{\mathbf{x}}_{1}^{(t)}-\mathbf{x}_{1}^{*}\| \displaystyle\leq ρ1𝐱¯1(t1)𝐱1+9αλ1b1(βt1+α1β)\displaystyle\rho_{1}\|\bar{\mathbf{x}}_{1}^{(t-1)}-\mathbf{x}_{1}^{*}\|+9\alpha\lambda_{1}b_{1}\big{(}\beta^{t-1}+\frac{\alpha}{1-\beta}\big{)}
\displaystyle\leq ρ1(ρ1𝐱¯1(t2)𝐱1+9αλ1b1βt2+9αλ1b1(α1β))+9αλ1b1βt1+9αλ1b1(α1β)\displaystyle\rho_{1}\Big{(}\rho_{1}\|\bar{\mathbf{x}}_{1}^{(t-2)}-\mathbf{x}_{1}^{*}\|+9\alpha\lambda_{1}b_{1}\beta^{t-2}+9\alpha\lambda_{1}b_{1}\big{(}\frac{\alpha}{1-\beta}\big{)}\Big{)}+9\alpha\lambda_{1}b_{1}\beta^{t-1}+9\alpha\lambda_{1}b_{1}\big{(}\frac{\alpha}{1-\beta}\big{)}
\displaystyle\leq ρ1t𝐱¯1(0)𝐱1+9αλ1b1r=0t1(ρ1β1)rβt1+11ρ19αλ1b1(α1β).\displaystyle\rho_{1}^{t}\|\bar{\mathbf{x}}_{1}^{(0)}-\mathbf{x}_{1}^{*}\|+9\alpha\lambda_{1}b_{1}\sum_{r=0}^{t-1}(\rho_{1}\beta^{-1})^{r}\beta^{t-1}+\frac{1}{1-\rho_{1}}9\alpha\lambda_{1}b_{1}\big{(}\frac{\alpha}{1-\beta}\big{)}.

Since ρ1,β<1\rho_{1},\beta<1, we have the following two cases:

  1. 1.

    ρ1βρ1β11\rho_{1}\leq\beta\implies\rho_{1}\beta^{-1}\leq 1. Then, r=0t1(ρ1β1)rβt1r=0t1βt1=tβt1\sum_{r=0}^{t-1}(\rho_{1}\beta^{-1})^{r}\beta^{t-1}\leq\sum_{r=0}^{t-1}\beta^{t-1}=t\beta^{t-1}.

  2. 2.

    ρ1>β\rho_{1}>\beta. Then r=0t1(ρ1β1)rβt1=βt1+ρ1βt2++ρ1t1<ρ1t1++ρ1t1=tρ1t1\sum_{r=0}^{t-1}(\rho_{1}\beta^{-1})^{r}\beta^{t-1}=\beta^{t-1}+\rho_{1}\beta^{t-2}+\cdots+\rho_{1}^{t-1}<\rho_{1}^{t-1}+\cdots+\rho_{1}^{t-1}=t\rho_{1}^{t-1}.

Therefore,

𝐱¯1(t)𝐱1ρ1t𝐱¯1(0)𝐱1+c1tmax{ρ1t1,βt1}+c11ρ1(α1β),wherec1=9αλ1b1.\|\bar{\mathbf{x}}_{1}^{(t)}-\mathbf{x}_{1}^{*}\|\leq\rho_{1}^{t}\|\bar{\mathbf{x}}_{1}^{(0)}-\mathbf{x}_{1}^{*}\|+c_{1}t\max\{\rho_{1}^{t-1},\beta^{t-1}\}+\frac{c_{1}}{1-\rho_{1}}\big{(}\frac{\alpha}{1-\beta}\big{)},\quad\text{where}\quad c_{1}=9\alpha\lambda_{1}b_{1}. (23)

Consequently, from (21) and (23), we get

𝐱i,1(t)𝐱1\displaystyle\|\mathbf{x}_{i,1}^{(t)}-\mathbf{x}_{1}^{*}\| \displaystyle\leq b1(βt+α1β)+ρ1t𝐱¯1(0)𝐱1+c1tmax{ρ1t1,βt1}+c11ρ1(α1β)\displaystyle b_{1}(\beta^{t}+\frac{\alpha}{1-\beta})+\rho_{1}^{t}\|\bar{\mathbf{x}}_{1}^{(0)}-\mathbf{x}_{1}^{*}\|+c_{1}t\max\{\rho_{1}^{t-1},\beta^{t-1}\}+\frac{c_{1}}{1-\rho_{1}}\big{(}\frac{\alpha}{1-\beta}\big{)}
=\displaystyle= ρ1t𝐱¯1(0)𝐱1+b1βt+c1tmax{ρ1t1,βt1}+(c11ρ1+b1)(α1β).\displaystyle\rho_{1}^{t}\|\bar{\mathbf{x}}_{1}^{(0)}-\mathbf{x}_{1}^{*}\|+b_{1}\beta^{t}+c_{1}t\max\{\rho_{1}^{t-1},\beta^{t-1}\}+(\frac{c_{1}}{1-\rho_{1}}+b_{1})\big{(}\frac{\alpha}{1-\beta}\big{)}.

This proves that 𝐱i,1(t)\mathbf{x}_{i,1}^{(t)} converges to a neighborhood of 𝐱1=𝐪1\mathbf{x}_{1}^{*}=\mathbf{q}_{1} or 𝐱1=𝐪\mathbf{x}_{1}^{*}=-\mathbf{q} at a linear rate. ∎

Case II for Induction – 1<kK1<k\leq K: For the remainder of the eigenvectors, we proceed with the proof of convergence by induction. Since we have already proven the base case, we can assume there exist constants ci,p>0c_{i,p}>0 and θi,p<1\theta_{i,p}<1 such that

  1. 1.

    𝐱i,p(t)(𝐱i,p(t))T𝐪p𝐪pTci,p(θi,pt+α1β),p=1,,k1\|\mathbf{x}_{i,p}^{(t)}(\mathbf{x}_{i,p}^{(t)})^{T}-\mathbf{q}_{p}\mathbf{q}_{p}^{T}\|\leq c_{i,p}(\theta_{i,p}^{t}+\frac{\alpha}{1-\beta}),\forall p=1,\ldots,k-1, and

  2. 2.

    𝐱i,p(t)23,p=1,,k1,i=1,M\|\mathbf{x}_{i,p}^{(t)}\|^{2}\leq 3,p=1,\ldots,k-1,i=1,\ldots M.

Using the inequality in 1) above, we can write 𝐱i,p(t)(𝐱i,p(t))T=𝐪p𝐪pT+ϕi,p(t),p=1,,k1{\mathbf{x}}_{i,p}^{(t)}({\mathbf{x}}_{i,p}^{(t)})^{T}=\mathbf{q}_{p}\mathbf{q}_{p}^{T}+{\boldsymbol{\phi}}_{i,p}^{(t)},p=1,\ldots,k-1 such that ϕi,p(t)ci,p(θi,pt+α1β)\|{\boldsymbol{\phi}}_{i,p}^{(t)}\|\leq{c}_{i,p}({\theta}_{i,p}^{t}+\frac{\alpha}{1-\beta}). This therefore implies αMi=1Mp=1k1𝐱i,p(t)(𝐱i,p(t))T𝐂i𝐱¯k(t)=αMi=1Mp=1k1(𝐪p𝐪pT+ϕi,p(t))𝐂i𝐱¯k(t)=αMp=1k1𝐪p𝐪pT𝐂𝐱¯k(t)+α𝝍¯k(t),\frac{\alpha}{M}\sum_{i=1}^{M}\sum_{p=1}^{k-1}{\mathbf{x}}_{i,p}^{(t)}({\mathbf{x}}_{i,p}^{(t)})^{T}\mathbf{C}_{i}\bar{\mathbf{x}}_{k}^{(t)}=\frac{\alpha}{M}\sum_{i=1}^{M}\sum_{p=1}^{k-1}(\mathbf{q}_{p}\mathbf{q}_{p}^{T}+{\boldsymbol{\phi}}_{i,p}^{(t)})\mathbf{C}_{i}\bar{\mathbf{x}}_{k}^{(t)}=\frac{\alpha}{M}\sum_{p=1}^{k-1}\mathbf{q}_{p}\mathbf{q}_{p}^{T}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t)}+\alpha\bar{\boldsymbol{\psi}}_{k}^{(t)}, where 𝝍¯k(t)=1Mi=1Mp=1k1ϕi,p(t)𝐂𝐱¯k(t)\bar{\boldsymbol{\psi}}_{k}^{(t)}=\frac{1}{M}\sum_{i=1}^{M}\sum_{p=1}^{k-1}{\boldsymbol{\phi}}_{i,p}^{(t)}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t)}.

Thus, we have

𝝍¯k(t)\displaystyle\|\bar{\boldsymbol{\psi}}_{k}^{(t)}\| \displaystyle\leq 1Mi=1Mp=1k1λ1ϕi,p(t)𝐱¯k(t)1Mi=1Mp=1k13λ1ci,p(θi,pt+α1β)\displaystyle\frac{1}{M}\sum_{i=1}^{M}\sum_{p=1}^{k-1}\lambda_{1}\|{\boldsymbol{\phi}}_{i,p}^{(t)}\|\|\bar{\mathbf{x}}_{k}^{(t)}\|\leq\frac{1}{M}\sum_{i=1}^{M}\sum_{p=1}^{k-1}\sqrt{3}\lambda_{1}{c}_{i,p}({\theta}_{i,p}^{t}+\frac{\alpha}{1-\beta}) (24)
\displaystyle\leq 1M3λ1(k1)Mc¯(θ¯t+α1β)=3λ1(k1)c¯(θ¯t+α1β),\displaystyle\frac{1}{M}\sqrt{3}\lambda_{1}(k-1)M\bar{c}(\bar{\theta}^{t}+\frac{\alpha}{1-\beta})=\sqrt{3}\lambda_{1}(k-1)\bar{c}(\bar{\theta}^{t}+\frac{\alpha}{1-\beta}),

where c¯=maxi,p{ci,p}\bar{c}=\max_{i,p}\{{c}_{i,p}\} and θ¯=maxi,p{θi,p}<1\bar{\theta}=\max_{i,p}\{{\theta}_{i,p}\}<1.

Consequently,

𝐱¯k(t+1)\displaystyle\bar{\mathbf{x}}_{k}^{(t+1)} =\displaystyle= 𝐱¯k(t)+αM(𝐂𝐱¯k(t)(𝐱¯k(t))T𝐂𝐱¯k(t)𝐱¯k(t)i=1Mp=1k1𝐱i,p(t)(𝐱i,p(t))T𝐂i𝐱¯k(t))+α𝐡k(t)\displaystyle\bar{\mathbf{x}}_{k}^{(t)}+\frac{\alpha}{M}\big{(}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t)}-(\bar{\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t)}\bar{\mathbf{x}}_{k}^{(t)}-\sum_{i=1}^{M}\sum_{p=1}^{k-1}{\mathbf{x}}_{i,p}^{(t)}({\mathbf{x}}_{i,p}^{(t)})^{T}\mathbf{C}_{i}\bar{\mathbf{x}}_{k}^{(t)}\big{)}+\alpha\mathbf{h}_{k}^{(t)}
=\displaystyle= 𝐱¯k(t)+αM(𝐂𝐱¯k(t)(𝐱¯k(t))T𝐂𝐱¯k(t)𝐱¯k(t)i=1Mp=1k1𝐪p𝐪pT𝐂i𝐱¯k(t))+\displaystyle\bar{\mathbf{x}}_{k}^{(t)}+\frac{\alpha}{M}\big{(}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t)}-(\bar{\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t)}\bar{\mathbf{x}}_{k}^{(t)}-\sum_{i=1}^{M}\sum_{p=1}^{k-1}{\mathbf{q}}_{p}{\mathbf{q}}_{p}^{T}\mathbf{C}_{i}\bar{\mathbf{x}}_{k}^{(t)}\big{)}+
αMi=1Mp=1k1(𝐪p𝐪pT𝐱i,p(t)(𝐱i,p(t))T)𝐂i𝐱¯k(t)+α𝐡k(t)\displaystyle\frac{\alpha}{M}\sum_{i=1}^{M}\sum_{p=1}^{k-1}({\mathbf{q}}_{p}{\mathbf{q}}_{p}^{T}-{\mathbf{x}}_{i,p}^{(t)}({\mathbf{x}}_{i,p}^{(t)})^{T})\mathbf{C}_{i}\bar{\mathbf{x}}_{k}^{(t)}+\alpha\mathbf{h}_{k}^{(t)}
=\displaystyle= 𝐱¯k(t)+αM(𝐂𝐱¯k(t)(𝐱¯k(t))T𝐂𝐱¯k(t)𝐱¯k(t)p=1k1𝐪p𝐪pT𝐂𝐱¯k(t))αMi=1Mp=1k1ϕi,p(t)𝐂i𝐱¯k(t)+α𝐡k(t)\displaystyle\bar{\mathbf{x}}_{k}^{(t)}+\frac{\alpha}{M}\big{(}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t)}-(\bar{\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t)}\bar{\mathbf{x}}_{k}^{(t)}-\sum_{p=1}^{k-1}{\mathbf{q}}_{p}{\mathbf{q}}_{p}^{T}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t)}\big{)}-\frac{\alpha}{M}\sum_{i=1}^{M}\sum_{p=1}^{k-1}\boldsymbol{\phi}_{i,p}^{(t)}\mathbf{C}_{i}\bar{\mathbf{x}}_{k}^{(t)}+\alpha\mathbf{h}_{k}^{(t)}
=\displaystyle= 𝐱¯k(t)+αM(𝐂𝐱¯k(t)(𝐱¯k(t))T𝐂𝐱¯k(t)𝐱¯k(t)p=1k1𝐪p𝐪pT𝐂𝐱¯k(t))α𝝍¯k(t)+α𝐡k(t).\displaystyle\bar{\mathbf{x}}_{k}^{(t)}+\frac{\alpha}{M}\big{(}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t)}-(\bar{\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t)}\bar{\mathbf{x}}_{k}^{(t)}-\sum_{p=1}^{k-1}{\mathbf{q}}_{p}{\mathbf{q}}_{p}^{T}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t)}\big{)}-\alpha\bar{\boldsymbol{\psi}}_{k}^{(t)}+\alpha\mathbf{h}_{k}^{(t)}. (25)

We can now proceed with the final theorem that characterizes the convergence behavior of DSA.

Theorem 3.

Suppose αminiwii3λ1(2K1)\alpha\leq\frac{\min_{i}w_{ii}}{3\lambda_{1}{(2K-1)}}, where λ1\lambda_{1} is the largest eigenvalue of 𝐂\mathbf{C} and KK is the number of eigenvectors to be estimated, 𝐪kT𝐱i,k(0)0\mathbf{q}_{k}^{T}\mathbf{x}_{i,k}^{(0)}\neq 0 and 𝐱i,k(0)=1,k=2,,K\|\mathbf{x}_{i,k}^{(0)}\|=1,\forall k=2,\dots,K. Then the DSA iterate for 𝐱i,k(t)\mathbf{x}_{i,k}^{(t)} given by (18) converges at a linear rate to an 𝒪(α)\mathcal{O}(\alpha) neighborhood of the eigenvector 𝐪k\mathbf{q}_{k} corresponding to the kthk^{th} largest eigenvalue λk\lambda_{k} of the global covariance matrix 𝐂\mathbf{C} at each node of the network.

Proof.

We know

𝐱i,k(t)𝐱k𝐱i,k(t)𝐱¯k(t)+𝐱¯k(t)𝐱k,where 𝐱k=±𝐪k.\|\mathbf{x}_{i,k}^{(t)}-\mathbf{x}_{k}^{*}\|\leq\|\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)}\|+\|\bar{\mathbf{x}}_{k}^{(t)}-\mathbf{x}_{k}^{*}\|,\quad\text{where $\mathbf{x}_{k}^{*}=\pm\mathbf{q}_{k}$}. (26)

Also, from Lemma 8 in the appendix we know that

𝐱i,k(t)𝐱¯k(t)bk(βt+α1β).\displaystyle\|\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)}\|\leq b_{k}(\beta^{t}+\frac{\alpha}{1-\beta}\big{)}.

Now, the average iterate of DSA for estimating the kthk^{th} eigenvector is

𝐱¯k(t)\displaystyle\bar{\mathbf{x}}_{k}^{(t)} =𝐱¯k(t1)+αM(𝐂𝐱¯k(t1)((𝐱¯k(t1))T𝐂𝐱¯k(t1))𝐱¯k(t1)p=1k1𝐪p𝐪pT𝐂𝐱¯k(t1))+α𝐡k(t1)+α𝝍¯k(t1)\displaystyle=\bar{\mathbf{x}}_{k}^{(t-1)}+\frac{\alpha}{M}(\mathbf{C}\bar{\mathbf{x}}_{k}^{(t-1)}-((\bar{\mathbf{x}}_{k}^{(t-1)})^{T}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t-1)})\bar{\mathbf{x}}_{k}^{(t-1)}-\sum_{p=1}^{k-1}{\mathbf{q}}_{p}{\mathbf{q}}_{p}^{T}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t-1)})+\alpha\mathbf{h}_{k}^{(t-1)}+\alpha\bar{\boldsymbol{\psi}}_{k}^{(t-1)}
or,𝐱¯k(t)𝐱k\displaystyle\text{or,}\quad\|\bar{\mathbf{x}}_{k}^{(t)}-\mathbf{x}_{k}^{*}\| 𝐱¯k(t1)+αM(𝐂𝐱¯k(t1)((𝐱¯k(t1))T𝐂𝐱¯k(t1))𝐱¯k(t1)p=1k1𝐪p𝐪pT𝐂𝐱¯k(t1))𝐱k+\displaystyle\leq\|\bar{\mathbf{x}}_{k}^{(t-1)}+\frac{\alpha}{M}(\mathbf{C}\bar{\mathbf{x}}_{k}^{(t-1)}-((\bar{\mathbf{x}}_{k}^{(t-1)})^{T}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t-1)})\bar{\mathbf{x}}_{k}^{(t-1)}-\sum_{p=1}^{k-1}{\mathbf{q}}_{p}{\mathbf{q}}_{p}^{T}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t-1)})-\mathbf{x}_{k}^{*}\|+
α𝐡k(t1)+α𝝍¯k(t1).\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\alpha\|\mathbf{h}_{k}^{(t-1)}\|+\alpha\|\bar{\boldsymbol{\psi}}_{k}^{(t-1)}\|.

We know from the discussion in Section IV that for an iterate of the form

𝐱¯k(t)\displaystyle\bar{\mathbf{x}}_{k}^{(t)} =𝐱¯k(t1)+αM(𝐂𝐱¯k(t1)((𝐱¯k(t1))T𝐂𝐱¯k(t1))𝐱¯k(t1)p=1k1𝐪p𝐪pT𝐂𝐱¯k(t1)),\displaystyle=\bar{\mathbf{x}}_{k}^{(t-1)}+\frac{\alpha}{M}(\mathbf{C}\bar{\mathbf{x}}_{k}^{(t-1)}-((\bar{\mathbf{x}}_{k}^{(t-1)})^{T}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t-1)})\bar{\mathbf{x}}_{k}^{(t-1)}-\sum_{p=1}^{k-1}{\mathbf{q}}_{p}{\mathbf{q}}_{p}^{T}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t-1)}),

there exists a constant ρk<1\rho_{k}^{{}^{\prime}}<1 such that 𝐱¯k(t)𝐱kρk𝐱¯k(t1)𝐱k\|\bar{\mathbf{x}}_{k}^{(t)}-\mathbf{x}_{k}^{*}\|\leq\rho_{k}^{{}^{\prime}}\|\bar{\mathbf{x}}_{k}^{(t-1)}-\mathbf{x}_{k}^{*}\|. Thus,

𝐱¯k(t)𝐱kρk𝐱¯k(t1)𝐱k+α𝐡k(t1)+α𝝍¯k(t1).\displaystyle\|\bar{\mathbf{x}}_{k}^{(t)}-\mathbf{x}_{k}^{*}\|\leq\rho_{k}^{{}^{\prime}}\|\bar{\mathbf{x}}_{k}^{(t-1)}-\mathbf{x}_{k}^{*}\|+\alpha\|\mathbf{h}_{k}^{(t-1)}\|+\alpha\|\bar{\boldsymbol{\psi}}_{k}^{(t-1)}\|.

Now, the term 𝐡k(t1)\|\mathbf{h}_{k}^{(t-1)}\| was bounded in Lemma 9 in the appendix as

𝐡k(t1)3(k+2)λ1bk(βt1+α1β).\displaystyle\|\mathbf{h}_{k}^{(t-1)}\|\leq 3(k+2)\lambda_{1}b_{k}\big{(}\beta^{t-1}+\frac{\alpha}{1-\beta}\big{)}. (27)

Thus, using (24) and (27), we can write

𝐱¯k(t)𝐱k\displaystyle\|\bar{\mathbf{x}}_{k}^{(t)}-\mathbf{x}_{k}^{*}\| \displaystyle\leq ρk𝐱¯k(t1)𝐱k+α(3(k+2)λ1bk(βt1+α1β))+α(3λ1(k1)c¯(θ¯t1+α1β))\displaystyle\rho_{k}^{{}^{\prime}}\|\bar{\mathbf{x}}_{k}^{(t-1)}-\mathbf{x}_{k}^{*}\|+\alpha(3(k+2)\lambda_{1}b_{k}(\beta^{t-1}+\frac{\alpha}{1-\beta}))+\alpha(\sqrt{3}\lambda_{1}(k-1)\bar{c}(\bar{\theta}^{t-1}+\frac{\alpha}{1-\beta}))
\displaystyle\leq ρk𝐱¯k(t1)𝐱k+ckmax{βt1,θ¯t1}+ckα1β,ck=max{α(3(k+2)λ1bk),α(3λ1(k1)c¯)}\displaystyle\rho_{k}^{{}^{\prime}}\|\bar{\mathbf{x}}_{k}^{(t-1)}-\mathbf{x}_{k}^{*}\|+c_{k}\max\{\beta^{t-1},\bar{\theta}^{t-1}\}+c_{k}\frac{\alpha}{1-\beta},\quad c_{k}=\max\{\alpha(3(k+2)\lambda_{1}b_{k}),\alpha(\sqrt{3}\lambda_{1}(k-1)\bar{c})\}
\displaystyle\leq ρk(ρk𝐱¯k(t2)𝐱k+ckmax{βt2,θ¯t2}+ckα1β)+ckmax{βt1,θ¯t1}+ckα1β\displaystyle\rho_{k}^{{}^{\prime}}\Big{(}\rho_{k}^{{}^{\prime}}\|\bar{\mathbf{x}}_{k}^{(t-2)}-\mathbf{x}_{k}^{*}\|+c_{k}\max\{\beta^{t-2},\bar{\theta}^{t-2}\}+c_{k}\frac{\alpha}{1-\beta}\Big{)}+c_{k}\max\{\beta^{t-1},\bar{\theta}^{t-1}\}+c_{k}\frac{\alpha}{1-\beta}
\displaystyle\leq ρkt𝐱¯k(0)𝐱k+ckr=0t1(ρkmax{β,θ¯}1)rmax{β,θ¯}t1+ck1ρk(α1β)\displaystyle\rho_{k}^{{}^{\prime}t}\|\bar{\mathbf{x}}_{k}^{(0)}-\mathbf{x}_{k}^{*}\|+c_{k}\sum_{r=0}^{t-1}(\rho_{k}^{{}^{\prime}}\max\{\beta,\bar{\theta}\}^{-1})^{r}\max\{\beta,\bar{\theta}\}^{t-1}+\frac{c_{k}}{1-\rho_{k}^{{}^{\prime}}}\big{(}\frac{\alpha}{1-\beta}\big{)}
\displaystyle\leq ρkt𝐱¯k(0)𝐱k+cktmax{ρkt1,βt1,θ¯t1}+ck1ρk(α1β).\displaystyle\rho_{k}^{{}^{\prime}t}\|\bar{\mathbf{x}}_{k}^{(0)}-\mathbf{x}_{k}^{*}\|+c_{k}t\max\{\rho_{k}^{{}^{\prime}t-1},\beta^{t-1},\bar{\theta}^{t-1}\}+\frac{c_{k}}{1-\rho_{k}^{{}^{\prime}}}\big{(}\frac{\alpha}{1-\beta}\big{)}.

Consequently, from (26) and Lemma 8 we get

𝐱i,k(t)𝐱k\displaystyle\|\mathbf{x}_{i,k}^{(t)}-\mathbf{x}_{k}^{*}\| \displaystyle\leq bk(βt+α1β)+ρkt𝐱¯k(0)𝐱k+cktmax{ρkt1,βt1,θ¯t1}+ck1ρk(α1β)\displaystyle b_{k}\big{(}\beta^{t}+\frac{\alpha}{1-\beta}\big{)}+\rho_{k}^{{}^{\prime}t}\|\bar{\mathbf{x}}_{k}^{(0)}-\mathbf{x}_{k}^{*}\|+c_{k}t\max\{\rho_{k}^{{}^{\prime}t-1},\beta^{t-1},\bar{\theta}^{t-1}\}+\frac{c_{k}}{1-\rho_{k}^{{}^{\prime}}}\big{(}\frac{\alpha}{1-\beta}\big{)}
=\displaystyle= ρkt𝐱¯k(0)𝐱k+bkβt+ck(t1)max{ρkt1,βt1,θ¯t1}+(ck1ρk+bk)(α1β).\displaystyle\rho_{k}^{t}\|\bar{\mathbf{x}}_{k}^{(0)}-\mathbf{x}_{k}^{*}\|+b_{k}\beta^{t}+c_{k}(t-1)\max\{\rho_{k}^{t-1},\beta^{t-1},\bar{\theta}^{t-1}\}+(\frac{c_{k}}{1-\rho_{k}}+b_{k})\big{(}\frac{\alpha}{1-\beta}\big{)}.

This proves that 𝐱i,k(t)\mathbf{x}_{i,k}^{(t)} converges to a neighborhood of 𝐱k=𝐪k\mathbf{x}_{k}^{*}=\mathbf{q}_{k} or 𝐱k=𝐪k\mathbf{x}_{k}^{*}=-\mathbf{q}_{k} at a linear rate. ∎

It is noteworthy that if decaying step sizes αt\alpha_{t} are used such that αt0\alpha_{t}\rightarrow 0 as tt\rightarrow\infty (instead of constant α\alpha), the convergence will be exact but not linear. The rate in that case will be dominated by the rate of decay of αt\alpha_{t}.

VI Experimental Results

In this section, we provide results that demonstrate the efficacy of the proposed DSA algorithm. The need for collaboration between the nodes of a network is a vital part of any distributed algorithm, as already pointed out in Section II. We first verify that necessity along with the effect of step size on DSA by performing some experiments. In these experiments, the weight matrix 𝐖\mathbf{W} that conforms to the underlying graph topology is generated using the Metropolis constant edge-weight approach [40]. The performance of DSA in comparison to some baseline methods is also evaluated in additional experiments. We provide experimental results for DSA on synthetic and real data and compare the results with centralized generalized Hebbian algorithm (GHA) [11], centralized orthogonal iteration (OI) [14], distributed projected gradient descent (DPGD) and sequential distributed power method (SeqDistPM). For both the centralized methods, all the data is assumed to be at a single location with the difference being that GHA uses the Hebbian update whereas OI uses the well-known orthogonal iterations to estimate the top KK eigenvectors of the covariance matrix 𝐂\mathbf{C}. DPGD involves two significant steps per iteration. The first is a distributed gradient descent step at every node ii given by j𝒩i{i}wij𝐗j+αfi(𝐗i)\sum_{j\in\mathcal{N}_{i}\cup\{i\}}w_{ij}\mathbf{X}_{j}+\alpha\nabla f_{i}(\mathbf{X}_{i}) as in [32] using trace maximization fi(𝐗i)=maxTrace(𝐗iT𝐂i𝐗i)f_{i}(\mathbf{X}_{i})=\max\text{Trace}(\mathbf{X}_{i}^{\mathrm{T}}\mathbf{C}_{i}\mathbf{X}_{i}) as the objective. This is followed by a projection step to ensure the orthogonality constraint 𝐗iT𝐗i=𝐈\mathbf{X}_{i}^{\mathrm{T}}\mathbf{X}_{i}=\mathbf{I}. The orthogonalization is accomplished using QR decomposition, an approach that ensures projection onto the Stiefel manifold [41] and whose computational complexity is 𝒪(K2d)\mathcal{O}(K^{2}d), at each node in each iteration. In contrast, SeqDistPM involves implementing the distributed power method [22, 24] KK times, estimating one eigenvector at a time and subtracting its impact on the covariance matrix for the estimation of subsequent eigenvectors. Note that SeqDistPM requires a finite TcT_{c} number of consensus iterations per iteration of the power method. Assuming the cost of communicating one d×K\mathbb{R}^{d\times K} matrix across the network from nodes to their neighbors to be one unit, the communication cost of SeqDistPM is Tc/KT_{c}/K per iteration of the power method. The error metric used for comparison and reporting of the results is the average of the angles between the estimated and true eigenvectors, i.e., if 𝐱i,k\mathbf{x}_{i,k} is the estimate of the kthk^{th} eigenvector at ithi^{th} node and 𝐪k\mathbf{q}_{k} is the true kthk^{th} eigenvector then the average error across all nodes is calculated as follows:

E=1MKi=1Mk=1K(1(𝐱i,kT𝐪k𝐱i,k)2).E=\frac{1}{MK}\sum_{i=1}^{M}\sum_{k=1}^{K}\bigg{(}1-\big{(}\frac{\mathbf{x}_{i,k}^{T}\mathbf{q}_{k}}{\|\mathbf{x}_{i,k}\|}\big{)}^{2}\bigg{)}. (28)

VI-A Synthetic Data

Refer to caption
(a) Demonstration of the need for collaboration among the nodes in a network for the PCA problem
Refer to caption
(b) Effect of varying the step size α\alpha on the performance of DSA
Figure 1: The role of collaboration in the distributed PCA problem and the effect of changing the step size on the performance of DSA. The distributed setup corresponds to an Erdos–Renyi graph (p=0.5p=0.5) with M=10M=10 nodes, while the dimension of data is d=10d=10 and the number of estimated eigenvectors is K=3K=3.

We first show results that emphasize on the need for collaboration among the nodes. To that end, we generate N=10,000N=10,000 independent and identically distributed (i.i.d.) samples drawn from a multivariate Gaussian distribution with an eigengap ΔK=λK+1λK=0.8\Delta_{K}=\frac{\lambda_{K+1}}{\lambda_{K}}=0.8 and dimension d=10d=10. These samples are distributed equally among the M=10M=10 nodes of an Erdos–Renyi network (with connectivity probability p=0.5p=0.5), implying that each node has 1,000 samples. The number of eigenvectors estimated is K=3K=3 and a constant step size of α=0.1\alpha=0.1 is used for this experiment. Figure 1(a) shows the effect of using the GHA at a node without collaboration with other nodes versus DSA, which in simple terms embodies GHA + collaboration in the network. The blue line indicating GHA in the figure is the result of using all the data in a centralized manner. It is clear that the lack of any communication between nodes increases the error in estimation of the eigenvectors by a significant factor. In Figure 1(b), we use the same setup and parameters to show the effect of different step sizes on our proposed DSA algorithm. It is evident that if the step size is too low, the convergence becomes significantly slow, while if its high, the final error is larger. Hence, careful choice of the step size is required for DSA, as characterized by its convergence analysis.

Refer to caption
(a) Erdos–Renyi network
Refer to caption
(b) Cyclic network
Refer to caption
(c) Star network
Figure 2: Comparison between the performances of DSA, DPGD and SeqDistPM for K=1K=1 and ΔK=0.8\Delta_{K}=0.8 in terms of communications efficiency, i.e., decrease in average estimation error as a function of the number of data units communicated throughout the network.

Next, we compare DSA with the distributed methods of DPGD and SeqDistPM to demonstrate its communication efficiency. For that purpose, we generate synthetic data with different eigengaps ΔK{0.6,0.8}\Delta_{K}\in\{0.6,0.8\}. We simulate the distributed setup for Erdos-Renyi (p=0.5p=0.5), star and cycle graph topologies with M=10M=10 nodes. The data is generated so that each node has 1,000 i.i.d samples (Ni=1000N_{i}=1000) drawn from a multivariate Gaussian distribution for d=20d=20, i.e., the total samples generated are 10,000. The dimension of the subspace to be estimated is taken to be K{1,5}K\in\{1,5\}. We use Tc=50T_{c}=50 as the number of consensus iterations per power iteration for SeqDistPM throughout out experiments. The results reported are an average of 10 Monte-Carlo trials. Figure 2 shows the performance of different algorithms for the estimation of the most dominant eigenvector for different network topologies. It is clear that for K=1K=1 SeqDistPM outperforms both DSA and DPGD in terms of communications efficiency because it is basically distributed power method, which is shown in [24, 22] to have good performance for K=1K=1. Even though DSA and DPGD have the same performance in terms of communications cost, it is important to remember that DPGD requires an additional QR normalization step per communications round. Next, Figure 3 shows a comparison between the three algorithms when the top-5 eigenvectors are estimated i.e., K=5K=5. It is clear that while estimating higher-order eigenvectors, DSA slightly outperforms DPGD without performing explicit QR normalization and it also has much better communications efficiency than SeqDistPM. The error for SeqDistPM is significantly high in the beginning because of the sequential estimation, which means that when the first (higher-order) eigenvector(s) is (are) being estimated, the lower-order estimates are still at their initial values and hence those contribute significant error even when the first or higher order terms have low error. After a sufficiently large number of communications rounds, SeqDistPM eventually does reach a lower final error compared to DSA. But this comes at the expense of slower convergence as a function of communications costs. It should also be noted that SeqDistPM lacks a formal convergence analysis and has two time scales that need to be adjusted as both contribute to the final error. Finally, the benefits of DSA over DPGD are twofold. First, DSA reaches similar or better error floor without explicit QR normalization, thus saving 𝒪(K2d)\mathcal{O}(K^{2}d) computations per iteration; and second, the convergence guarantees for gradient descent-based algorithms for non-convex problems like the PCA have limitations. The guarantees usually exist for convergence to a stationary solution with a sub-linear rate.

Refer to caption
(a) Erdos–Renyi network, ΔK=0.6\Delta_{K}=0.6
Refer to caption
(b) Cyclic network, ΔK=0.6\Delta_{K}=0.6
Refer to caption
(c) Star network, ΔK=0.6\Delta_{K}=0.6
Refer to caption
(d) Erdos–Renyi network, ΔK=0.8\Delta_{K}=0.8
Refer to caption
(e) Cyclic network, ΔK=0.8\Delta_{K}=0.8
Refer to caption
(f) Star network, ΔK=0.8\Delta_{K}=0.8
Figure 3: Comparison between DSA, DPGD, and SeqDistPM for K=5K=5 in terms of communications efficiency.

VI-B Real-World Data

Along with the synthetic data experiments, we provide some experiments with real-world datasets of MNIST [42] and CIFAR-10 [43]. For the distributed setup in this case, we use an Erdos–Renyi graph with M=20M=20 nodes and p=0.5p=0.5. Both the datasets have 60,000 samples, thereby making the number of samples per node to be Ni=3000N_{i}=3000. The data dimension for MNIST is d=784d=784 and a constant step size of α=0.1\alpha=0.1 was used. The plots in Figure 4(a) and Figure 4(b) show the results for K{10,40}K\in\{10,40\} for MNIST. Similar plots are shown for CIFAR-10 in Figure 5(a) and Figure 5(b), where the dimension dd for CIFAR-10 is 1024, the number of estimated eigenvectors K{10,20}K\in\{10,20\} and a constant step size of α=0.7\alpha=0.7 is used. For these real-world data sets, we exclude the comparison with SeqDistPM as it is evident this method requires much higher cost of communications for estimating larger number of eigenvectors.

Refer to caption
(a) MNIST, K=10K=10
Refer to caption
(b) MNIST, K=40K=40
Figure 4: Comparison between DSA, OI, GHA, and DPGD for MNIST dataset as a function of the number of algorithmic iterations.
Refer to caption
(a) CIFAR-10, K=10K=10
Refer to caption
(b) CIFAR-10, K=20K=20
Figure 5: Comparison between DSA, OI, GHA, and DPGD for CIFAR-10 dataset as a function of the number of algorithmic iterations.

VII Conclusion

In this paper, we proposed and analyzed a new distributed Principal Component Analysis (PCA) algorithm that, as opposed to distributed subspace learning methods, facilitates both dimensionality reduction and data decorrelation in a distributed setup. Our main contribution in this regard was a detailed convergence analysis to prove that the proposed distributed method linearly converges to a neighborhood of the eigenvectors of the global covariance matrix. We also provided numerical results to demonstrate the communications efficiency and overall effectiveness of the proposed algorithm.

In terms of future work, an obvious extension would be a distributed algorithm that enables exact convergence to the PCA solution at a linear rate. Note that the use of a diminishing step size α\alpha along with the analysis in this paper already guarantees that DSA can converge exactly to the PCA solution. However, this exact convergence guarantee comes at the expense of a slow convergence rate. We instead expect to combine ideas from this work as well as ideas such as gradient tracking from the literature on distributed optimization [36, 31, 44] to develop a linearly convergent, exact algorithm for distributed PCA in the future. Another possible future direction involves developing an algorithm for distributed PCA that does not require the top KK eigenvalues to be distinct. We also leave the case of multiple eigenvector estimation from distributed, streaming data, as in [26], for future work.

Appendix A Statements and Proofs of Auxiliary Lemmas

A-A Statement and Proof of Lemma 1

Lemma 1.

Assume 𝐱k(0)=1,k\|\mathbf{x}_{k}^{(0)}\|=1,\leavevmode\nobreak\ \forall k. If the step size is bounded above as α13λ1(2K1)\alpha\leq\frac{1}{3\lambda_{1}{(2K-1)}}, where λ1\lambda_{1} is the largest eigenvalue of 𝐂\mathbf{C} and KK is the number of eigenvectors to be estimated, then

t,𝐱k(t)<3and(𝐱k(t))T𝐂𝐱k(t)<1α.\forall t,\quad\|\mathbf{x}_{k}^{(t)}\|<\sqrt{3}\quad\text{and}\quad(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}<\frac{1}{\alpha}. (29)
Proof.

From (11), we know the iterate for kthk^{th} eigenvector estimate is

𝐱k(t+1)\displaystyle\mathbf{x}_{k}^{(t+1)} =\displaystyle= 𝐱k(t)+α(𝐂𝐱k(t)(𝐱k(t))T𝐂𝐱kt𝐱k(t)p=1k1𝐪p𝐪pT𝐂𝐱k(t))\displaystyle\mathbf{x}_{k}^{(t)}+\alpha\bigg{(}\mathbf{C}\mathbf{x}_{k}^{(t)}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{t}\mathbf{x}_{k}^{(t)}-\sum_{p=1}^{k-1}\mathbf{q}_{p}\mathbf{q}_{p}^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\bigg{)}
=\displaystyle= 𝐱k(t)+α(𝐂𝐱k(t)(𝐱k(t))T𝐂𝐱kt𝐱k(t)p=1k1λp𝐪p𝐪pT𝐱k(t))\displaystyle\mathbf{x}_{k}^{(t)}+\alpha\big{(}\mathbf{C}\mathbf{x}_{k}^{(t)}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{t}\mathbf{x}_{k}^{(t)}-\sum_{p=1}^{k-1}\lambda_{p}\mathbf{q}_{p}\mathbf{q}_{p}^{T}\mathbf{x}_{k}^{(t)}\big{)}
=\displaystyle= 𝐱k(t)+α(𝐂~k𝐱k(t)(𝐱k(t))T𝐂𝐱k(t)𝐱k(t)),\displaystyle\mathbf{x}_{k}^{(t)}+\alpha\big{(}\tilde{\mathbf{C}}_{k}\mathbf{x}_{k}^{(t)}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\mathbf{x}_{k}^{(t)}\big{)},

where 𝐂~k=𝐂p=1k1λp𝐪p𝐪pT\tilde{\mathbf{C}}_{k}=\mathbf{C}-\sum_{p=1}^{k-1}\lambda_{p}\mathbf{q}_{p}\mathbf{q}_{p}^{T}. Notice that 𝐂~k2=𝐂2p=1k1λp2𝐪p𝐪pT\tilde{\mathbf{C}}_{k}^{2}=\mathbf{C}^{2}-\sum_{p=1}^{k-1}\lambda_{p}^{2}\mathbf{q}_{p}\mathbf{q}_{p}^{T}. Hence,

𝐱k(t+1)2\displaystyle\|\mathbf{x}_{k}^{(t+1)}\|^{2} =𝐱k(t)+α(𝐂~k𝐱k(t)(𝐱k(t))T𝐂𝐱k(t)𝐱k(t))2\displaystyle=\|\mathbf{x}_{k}^{(t)}+\alpha\big{(}\tilde{\mathbf{C}}_{k}\mathbf{x}_{k}^{(t)}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\mathbf{x}_{k}^{(t)}\big{)}\|^{2}
=𝐱k(t)2+α2𝐂~k𝐱k(t)(𝐱k(t))T𝐂𝐱k(t)𝐱k(t)2+2α(𝐱k(t))T(𝐂~k𝐱k(t)(𝐱k(t))T𝐂𝐱k(t)𝐱k(t))\displaystyle=\|\mathbf{x}_{k}^{(t)}\|^{2}+\alpha^{2}\|\tilde{\mathbf{C}}_{k}\mathbf{x}_{k}^{(t)}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\mathbf{x}_{k}^{(t)}\|^{2}+2\alpha(\mathbf{x}_{k}^{(t)})^{T}(\tilde{\mathbf{C}}_{k}\mathbf{x}_{k}^{(t)}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\mathbf{x}_{k}^{(t)})
=𝐱k(t)2+α2((𝐱k(t))T𝐂~k2𝐱k(t)+((𝐱k(t))T𝐂𝐱k(t))2𝐱k(t)22(𝐱k(t))T𝐂𝐱k(t)(𝐱k(t))T𝐂~k𝐱k(t))\displaystyle=\|\mathbf{x}_{k}^{(t)}\|^{2}+\alpha^{2}\big{(}(\mathbf{x}_{k}^{(t)})^{T}\tilde{\mathbf{C}}_{k}^{2}\mathbf{x}_{k}^{(t)}+((\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})^{2}\|\mathbf{x}_{k}^{(t)}\|^{2}-2(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}(\mathbf{x}_{k}^{(t)})^{T}\tilde{\mathbf{C}}_{k}\mathbf{x}_{k}^{(t)}\big{)}
+2α((𝐱k(t))T𝐂~k𝐱k(t)(𝐱k(t))T𝐂𝐱k(t)𝐱k(t)2)\displaystyle+2\alpha\big{(}(\mathbf{x}_{k}^{(t)})^{T}\tilde{\mathbf{C}}_{k}\mathbf{x}_{k}^{(t)}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\|\mathbf{x}_{k}^{(t)}\|^{2}\big{)}
=𝐱k(t)2+α2((𝐱k(t))T(𝐂2p=1k1λp2𝐪p𝐪pT)𝐱k(t)+((𝐱k(t))T𝐂𝐱k(t))2𝐱k(t)22(𝐱k(t))T𝐂𝐱k(t)×\displaystyle=\|\mathbf{x}_{k}^{(t)}\|^{2}+\alpha^{2}\big{(}(\mathbf{x}_{k}^{(t)})^{T}(\mathbf{C}^{2}-\sum_{p=1}^{k-1}\lambda_{p}^{2}\mathbf{q}_{p}\mathbf{q}_{p}^{T})\mathbf{x}_{k}^{(t)}+((\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})^{2}\|\mathbf{x}_{k}^{(t)}\|^{2}-2(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\times
(𝐱k(t))T(𝐂p=1k1λp𝐪p𝐪pT)𝐱k(t))+2α((𝐱k(t))T(𝐂p=1k1λp𝐪p𝐪pT)𝐱k(t)(𝐱k(t))T𝐂𝐱k(t)𝐱k(t)2)\displaystyle(\mathbf{x}_{k}^{(t)})^{T}(\mathbf{C}-\sum_{p=1}^{k-1}\lambda_{p}\mathbf{q}_{p}\mathbf{q}_{p}^{T})\mathbf{x}_{k}^{(t)}\big{)}+2\alpha\big{(}(\mathbf{x}_{k}^{(t)})^{T}(\mathbf{C}-\sum_{p=1}^{k-1}\lambda_{p}\mathbf{q}_{p}\mathbf{q}_{p}^{T})\mathbf{x}_{k}^{(t)}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\|\mathbf{x}_{k}^{(t)}\|^{2}\big{)}
=𝐱k(t)2+α2((𝐱k(t))T𝐂2𝐱k(t)p=1k1λp2(𝐪pT𝐱k(t))2+((𝐱k(t))T𝐂𝐱k(t))2(𝐱k(t)22)\displaystyle=\|\mathbf{x}_{k}^{(t)}\|^{2}+\alpha^{2}\big{(}(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}^{2}\mathbf{x}_{k}^{(t)}-\sum_{p=1}^{k-1}\lambda_{p}^{2}(\mathbf{q}_{p}^{T}\mathbf{x}_{k}^{(t)})^{2}+((\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})^{2}(\|\mathbf{x}_{k}^{(t)}\|^{2}-2)
+2(𝐱k(t))T𝐂𝐱k(t)p=1k1λp(𝐪pT𝐱k(t))2)+2α((𝐱k(t))T𝐂𝐱k(t)(1𝐱k(t)2)p=1k1λp(𝐪pT𝐱k(t))2).\displaystyle+2(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\sum_{p=1}^{k-1}\lambda_{p}(\mathbf{q}_{p}^{T}\mathbf{x}_{k}^{(t)})^{2}\big{)}+2\alpha\big{(}(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}(1-\|\mathbf{x}_{k}^{(t)}\|^{2})-\sum_{p=1}^{k-1}\lambda_{p}(\mathbf{q}_{p}^{T}\mathbf{x}_{k}^{(t)})^{2}\big{)}. (30)

We now split our analysis into three cases based on the range of values of 𝐱k(t)2\|\mathbf{x}_{k}^{(t)}\|^{2}.

Case \@slowromancapi@: Let 𝐱k(t)21\|\mathbf{x}_{k}^{(t)}\|^{2}\leq 1. Then we see from (30) that

𝐱k(t+1)2\displaystyle\|\mathbf{x}_{k}^{(t+1)}\|^{2} \displaystyle\leq 1+α2(λ12+2λ1p=1k1λp)+2αλ11+α2(λ12+2λ1p=1k1λ1)+2αλ1\displaystyle 1+\alpha^{2}(\lambda_{1}^{2}+2\lambda_{1}\sum_{p=1}^{k-1}\lambda_{p})+2\alpha\lambda_{1}\leq 1+\alpha^{2}(\lambda_{1}^{2}+2\lambda_{1}\sum_{p=1}^{k-1}\lambda_{1})+2\alpha\lambda_{1}
\displaystyle\leq 1+α2λ12(2K1)+2αλ12K1=(1+αλ12K1)2\displaystyle 1+\alpha^{2}\lambda_{1}^{2}(2K-1)+2\alpha\lambda_{1}\sqrt{2K-1}=(1+\alpha\lambda_{1}\sqrt{2K-1})^{2}
\displaystyle\leq 2(1+α2λ12(2K1))2(1+19(2K1))2(1+19)<3.\displaystyle 2(1+\alpha^{2}\lambda_{1}^{2}(2K-1))\leq 2(1+\frac{1}{9(2K-1)})\leq 2(1+\frac{1}{9})<3.

Case \@slowromancapii@: Now suppose 1<𝐱k(t)221<\|\mathbf{x}_{k}^{(t)}\|^{2}\leq 2. Then from (30) we have

𝐱k(t+1)2\displaystyle\|\mathbf{x}_{k}^{(t+1)}\|^{2} \displaystyle\leq 2+α2(2λ12+2λ1p=1k12λp)2+α2(2λ12+2λ1p=1k12λ1)\displaystyle 2+\alpha^{2}(2\lambda_{1}^{2}+2\lambda_{1}\sum_{p=1}^{k-1}2\lambda_{p})\leq 2+\alpha^{2}(2\lambda_{1}^{2}+2\lambda_{1}\sum_{p=1}^{k-1}2\lambda_{1})
\displaystyle\leq 2(1+19(2K1))2(1+19)<3,using similar steps as Case \@slowromancapi@.\displaystyle 2(1+\frac{1}{9(2K-1)})\leq 2(1+\frac{1}{9})<3,\quad\text{using similar steps as Case \@slowromancap i@}.

Case \@slowromancapiii@: Finally suppose 2<𝐱k(t)2<32<\|\mathbf{x}_{k}^{(t)}\|^{2}<3. Then from (30) we get

𝐱k(t+1)2\displaystyle\|\mathbf{x}_{k}^{(t+1)}\|^{2} <\displaystyle< 3+α2((𝐱k(t))T𝐂2𝐱k(t)p=1k1λp2(𝐪pT𝐱k(t))2+((𝐱k(t))T𝐂𝐱k(t))2(𝐱k(t)22)\displaystyle 3+\alpha^{2}\big{(}(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}^{2}\mathbf{x}_{k}^{(t)}-\sum_{p=1}^{k-1}\lambda_{p}^{2}(\mathbf{q}_{p}^{T}\mathbf{x}_{k}^{(t)})^{2}+((\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})^{2}(\|\mathbf{x}_{k}^{(t)}\|^{2}-2)
+2(𝐱k(t))T𝐂𝐱k(t)p=1k1λp(𝐪pT𝐱k(t))2)+2α((𝐱k(t))T𝐂𝐱k(t)(1𝐱k(t)2)p=1k1λp(𝐪pT𝐱k(t))2).\displaystyle+2(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\sum_{p=1}^{k-1}\lambda_{p}(\mathbf{q}_{p}^{T}\mathbf{x}_{k}^{(t)})^{2}\big{)}+2\alpha\big{(}(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}(1-\|\mathbf{x}_{k}^{(t)}\|^{2})-\sum_{p=1}^{k-1}\lambda_{p}(\mathbf{q}_{p}^{T}\mathbf{x}_{k}^{(t)})^{2}\big{)}.

To show that 𝐱k(t+1)2<3\|\mathbf{x}_{k}^{(t+1)}\|^{2}<3, we have to show

α2((𝐱k(t))T𝐂2𝐱k(t)p=1k1λp2(𝐪pT𝐱k(t))2+((𝐱k(t))T𝐂𝐱k(t))2(𝐱k(t)22)\displaystyle\alpha^{2}\big{(}(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}^{2}\mathbf{x}_{k}^{(t)}-\sum_{p=1}^{k-1}\lambda_{p}^{2}(\mathbf{q}_{p}^{T}\mathbf{x}_{k}^{(t)})^{2}+((\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})^{2}(\|\mathbf{x}_{k}^{(t)}\|^{2}-2)
2(𝐱k(t))T𝐂𝐱k(t)p=1k1λp(𝐪pT𝐱k(t))2)+2α((𝐱k(t))T𝐂𝐱k(t)(1𝐱k(t)2)p=1k1λp(𝐪pT𝐱k(t))2)0\displaystyle 2(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\sum_{p=1}^{k-1}\lambda_{p}(\mathbf{q}_{p}^{T}\mathbf{x}_{k}^{(t)})^{2}\big{)}+2\alpha\big{(}(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}(1-\|\mathbf{x}_{k}^{(t)}\|^{2})-\sum_{p=1}^{k-1}\lambda_{p}(\mathbf{q}_{p}^{T}\mathbf{x}_{k}^{(t)})^{2}\big{)}\leq 0
\displaystyle\Leftrightarrow\quad α2(𝐱k(t))T𝐂𝐱k(t)(𝐱k(t)21)+2p=1k1λp(𝐪pT𝐱k(t))2(𝐱k(t))T𝐂2𝐱k(t)p=1k1λp2(𝐪pT𝐱k(t))2+((𝐱k(t))T𝐂𝐱k(t))2(𝐱k(t)22)+2(𝐱k(t))T𝐂𝐱k(t)p=1k1λp(𝐪pT𝐱k(t))2.\displaystyle\alpha\leq\frac{2(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}(\|\mathbf{x}_{k}^{(t)}\|^{2}-1)+2\sum_{p=1}^{k-1}\lambda_{p}(\mathbf{q}_{p}^{T}\mathbf{x}_{k}^{(t)})^{2}}{(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}^{2}\mathbf{x}_{k}^{(t)}-\sum_{p=1}^{k-1}\lambda_{p}^{2}(\mathbf{q}_{p}^{T}\mathbf{x}_{k}^{(t)})^{2}+((\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})^{2}(\|\mathbf{x}_{k}^{(t)}\|^{2}-2)+2(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\sum_{p=1}^{k-1}\lambda_{p}(\mathbf{q}_{p}^{T}\mathbf{x}_{k}^{(t)})^{2}}. (31)

We now find a lower bound of the right hand side of (31). Note that

2(𝐱k(t))T𝐂𝐱k(t)(𝐱k(t)21)+2p=1k1λp(𝐪pT𝐱k(t))22(𝐱k(t))T𝐂𝐱k(t)(𝐱k(t)21)\displaystyle 2(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}(\|\mathbf{x}_{k}^{(t)}\|^{2}-1)+2\sum_{p=1}^{k-1}\lambda_{p}(\mathbf{q}_{p}^{T}\mathbf{x}_{k}^{(t)})^{2}\geq 2(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}(\|\mathbf{x}_{k}^{(t)}\|^{2}-1) (32)
and (𝐱k(t))T𝐂2𝐱k(t)p=1k1λp2(𝐪pT𝐱k(t))2+((𝐱k(t))T𝐂𝐱k(t))2(𝐱k(t)22)+2(𝐱k(t))T𝐂𝐱k(t)p=1k1λp(𝐪pT𝐱k(t))2\displaystyle(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}^{2}\mathbf{x}_{k}^{(t)}-\sum_{p=1}^{k-1}\lambda_{p}^{2}(\mathbf{q}_{p}^{T}\mathbf{x}_{k}^{(t)})^{2}+((\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})^{2}(\|\mathbf{x}_{k}^{(t)}\|^{2}-2)+2(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\sum_{p=1}^{k-1}\lambda_{p}(\mathbf{q}_{p}^{T}\mathbf{x}_{k}^{(t)})^{2}
(𝐱k(t))T𝐂2𝐱k(t)+((𝐱k(t))T𝐂𝐱k(t))2(𝐱k(t)22)+2(𝐱k(t))T𝐂𝐱k(t)p=1k1λp(𝐪pT𝐱k(t))2.\displaystyle\leq(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}^{2}\mathbf{x}_{k}^{(t)}+((\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})^{2}(\|\mathbf{x}_{k}^{(t)}\|^{2}-2)+2(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\sum_{p=1}^{k-1}\lambda_{p}(\mathbf{q}_{p}^{T}\mathbf{x}_{k}^{(t)})^{2}. (33)

Now, (𝐱k(t))T𝐂𝐱k(t)(𝐱k(t))T𝐂2𝐱k(t)\frac{(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}}{(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}^{2}\mathbf{x}_{k}^{(t)}} is a generalized Rayleigh quotient whose maximum and minimum values are the largest and smallest eigenvalues of the generalized eigenvalue problem 𝐂𝐲=λ𝐂2𝐲\mathbf{C}\mathbf{y}=\lambda\mathbf{C}^{2}\mathbf{y}. Since the eigenvectors of 𝐂\mathbf{C} and 𝐂2\mathbf{C}^{2} are the same, the largest and smallest eigenvalues of the generalized problems are 1λd\frac{1}{\lambda_{d}} and 1λ1\frac{1}{\lambda_{1}}, respectively, where λ1\lambda_{1} and λd\lambda_{d} are the largest and smallest eigenvalues of 𝐂\mathbf{C}. Thus, (𝐱k(t))T𝐂2𝐱k(t)λ1(𝐱k(t))T𝐂𝐱k(t)(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}^{2}\mathbf{x}_{k}^{(t)}\leq\lambda_{1}(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}. Also, since 𝐪pT𝐱k(t)𝐪𝐱k(t)\mathbf{q}_{p}^{T}\mathbf{x}_{k}^{(t)}\leq\|\mathbf{q}\|\|\mathbf{x}_{k}^{(t)}\|, we have the right hand side of (33)

λ1(𝐱k(t))T𝐂𝐱k(t)+((𝐱k(t))T𝐂𝐱k(t))2(𝐱k(t)22)+2(𝐱k(t))T𝐂𝐱k(t)p=1k1λp𝐱k(t)2\displaystyle\lambda_{1}(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}+((\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})^{2}(\|\mathbf{x}_{k}^{(t)}\|^{2}-2)+2(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\sum_{p=1}^{k-1}\lambda_{p}\|\mathbf{x}_{k}^{(t)}\|^{2}
=\displaystyle= (𝐱k(t))T𝐂𝐱k(t)(λ1+(𝐱k(t))T𝐂𝐱k(t)(𝐱k(t)22)+2p=1k1λp𝐱k(t)2)\displaystyle(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}(\lambda_{1}+(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}(\|\mathbf{x}_{k}^{(t)}\|^{2}-2)+2\sum_{p=1}^{k-1}\lambda_{p}\|\mathbf{x}_{k}^{(t)}\|^{2})
\displaystyle\leq (𝐱k(t))T𝐂𝐱k(t)(λ1+λ1𝐱k(t)2(𝐱k(t)22)+2p=1k1λ1𝐱k(t)2)\displaystyle(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}(\lambda_{1}+\lambda_{1}\|\mathbf{x}_{k}^{(t)}\|^{2}(\|\mathbf{x}_{k}^{(t)}\|^{2}-2)+2\sum_{p=1}^{k-1}\lambda_{1}\|\mathbf{x}_{k}^{(t)}\|^{2})
=\displaystyle= λ1(𝐱k(t))T𝐂𝐱k(t)(1+𝐱k(t)42𝐱k(t)2+2(k1)𝐱k(t)2)\displaystyle\lambda_{1}(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}(1+\|\mathbf{x}_{k}^{(t)}\|^{4}-2\|\mathbf{x}_{k}^{(t)}\|^{2}+2(k-1)\|\mathbf{x}_{k}^{(t)}\|^{2})
=\displaystyle= λ1(𝐱k(t))T𝐂𝐱k(t)((𝐱k(t)21)2+2(k1)𝐱k(t)2)\displaystyle\lambda_{1}(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}((\|\mathbf{x}_{k}^{(t)}\|^{2}-1)^{2}+2(k-1)\|\mathbf{x}_{k}^{(t)}\|^{2})
=\displaystyle= λ1(𝐱k(t))T𝐂𝐱k(t)(𝐱k(t)21)((𝐱k(t)21)+2(k1)𝐱k(t)2(𝐱k(t)21))\displaystyle\lambda_{1}(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}(\|\mathbf{x}_{k}^{(t)}\|^{2}-1)((\|\mathbf{x}_{k}^{(t)}\|^{2}-1)+2(k-1)\frac{\|\mathbf{x}_{k}^{(t)}\|^{2}}{(\|\mathbf{x}_{k}^{(t)}\|^{2}-1)})
<\displaystyle< λ1(𝐱k(t))T𝐂𝐱k(t)(𝐱k(t)21)((31)+2(k1)2),since𝐱k(t)2(𝐱k(t)21)<2\displaystyle\lambda_{1}(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}(\|\mathbf{x}_{k}^{(t)}\|^{2}-1)((3-1)+2(k-1)2),\quad\text{since}\frac{\|\mathbf{x}_{k}^{(t)}\|^{2}}{(\|\mathbf{x}_{k}^{(t)}\|^{2}-1)}<2
=\displaystyle= 2λ1(𝐱k(t))T𝐂𝐱k(t)(𝐱k(t)21)(2k1)2λ1(𝐱k(t))T𝐂𝐱k(t)(𝐱k(t)21)(2K1).\displaystyle 2\lambda_{1}(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}(\|\mathbf{x}_{k}^{(t)}\|^{2}-1)(2k-1)\leq 2\lambda_{1}(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}(\|\mathbf{x}_{k}^{(t)}\|^{2}-1)(2K-1).

Hence, we have that the right hand side of (31) exceeds

2(𝐱k(t))T𝐂𝐱k(t)(𝐱k(t)21)2λ1(𝐱k(t))T𝐂𝐱k(t)(𝐱k(t)21)(2K1)=1λ1(2K1)>13λ1(2K1).\displaystyle\frac{2(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}(\|\mathbf{x}_{k}^{(t)}\|^{2}-1)}{2\lambda_{1}(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}(\|\mathbf{x}_{k}^{(t)}\|^{2}-1)(2K-1)}=\frac{1}{\lambda_{1}(2K-1)}>\frac{1}{3\lambda_{1}(2K-1)}.

Thus, if α13λ1(2K1)\alpha\leq\frac{1}{3\lambda_{1}(2K-1)}, then 𝐱k(t)2<3\|\mathbf{x}_{k}^{(t)}\|^{2}<3.

Next,

0(𝐱k(t))T𝐂𝐱k(t)λ1𝐱k(t)2<3λ13(2K1)λ11α.0\leq(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\leq\lambda_{1}\|\mathbf{x}_{k}^{(t)}\|^{2}<3\lambda_{1}\leq 3(2K-1)\lambda_{1}\leq\frac{1}{\alpha}. (34)

Hence, (𝐱k(t))T𝐂𝐱k(t)<1α(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}<\frac{1}{\alpha}. ∎

A-B Statement and Proof of Lemma 2

Lemma 2.

Suppose 𝐪kT𝐱k(0)=zk,k(0)0\mathbf{q}_{k}^{T}\mathbf{x}_{k}^{(0)}=z_{k,k}^{(0)}\neq 0 and (𝐱k(t))T𝐂𝐱k(t)<1α(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}<\frac{1}{\alpha}, then

(𝐱k(t))T𝐂𝐱k(t)>min{(13αλ1)2λm,(𝐱~k(0))T𝐂𝐱~k(0)},t.(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}>\min\{(1-3\alpha\lambda_{1})^{2}\lambda_{m},(\tilde{\mathbf{x}}_{k}^{(0)})^{T}\mathbf{C}\tilde{\mathbf{x}}_{k}^{(0)}\},\quad\forall t.
Proof.

We know 0(𝐱k(t))T𝐂𝐱k(t)λ1𝐱k(t)2<3λ1using Lemma 10\leq(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\leq\lambda_{1}\|\mathbf{x}_{k}^{(t)}\|^{2}<3\lambda_{1}\leavevmode\nobreak\ \text{using Lemma\leavevmode\nobreak\ \ref{lemma:bounded_iterate_rayleigh}}. Let λm,m>K\lambda_{m},m>K be the smallest non-zero eigenvalue of 𝐂\mathbf{C}. Now, if λm(𝐱k(t))T𝐂𝐱k(t)<3λ1\lambda_{m}\leq(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}<3\lambda_{1}, then

(𝐱k(t+1))T𝐂𝐱k(t+1)\displaystyle({\mathbf{x}}_{k}^{(t+1)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t+1)} =l=1dλl(zk,l(t+1))2\displaystyle=\sum_{l=1}^{d}\lambda_{l}(z_{k,l}^{(t+1)})^{2}
=l=1k1λl(zk,l(t+1))2+l=kdλl(zk,l(t+1))2\displaystyle=\sum_{l=1}^{k-1}\lambda_{l}(z_{k,l}^{(t+1)})^{2}+\sum_{l=k}^{d}\lambda_{l}(z_{k,l}^{(t+1)})^{2}
=l=1k1λl(1α(𝐱k(t))T𝐂𝐱k(t))2(zk,l(t))2+l=kdλl(1+α(λl(𝐱k(t))T𝐂𝐱k(t)))2(zk,l(t))2\displaystyle=\sum_{l=1}^{k-1}\lambda_{l}(1-\alpha({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)})^{2}(z_{k,l}^{(t)})^{2}+\sum_{l=k}^{d}\lambda_{l}(1+\alpha(\lambda_{l}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}))^{2}({z}_{k,l}^{(t)})^{2}
(1α(𝐱k(t))T𝐂𝐱k(t))2l=1k1λl(zk,l(t))2+(1+α(λm(𝐱k(t))T𝐂𝐱k(t)))2l=kdλl(zk,l(t))2\displaystyle\geq(1-\alpha({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)})^{2}\sum_{l=1}^{k-1}\lambda_{l}(z_{k,l}^{(t)})^{2}+(1+\alpha(\lambda_{m}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}))^{2}\sum_{l=k}^{d}\lambda_{l}({z}_{k,l}^{(t)})^{2}
>(1α(𝐱k(t))T𝐂𝐱k(t))2l=1k1λl(zk,l(t))2+(1α(𝐱k(t))T𝐂𝐱k(t)))2l=kdλl(zk,l(t))2\displaystyle>(1-\alpha({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)})^{2}\sum_{l=1}^{k-1}\lambda_{l}(z_{k,l}^{(t)})^{2}+(1-\alpha({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}))^{2}\sum_{l=k}^{d}\lambda_{l}({z}_{k,l}^{(t)})^{2}
=(1α(𝐱k(t))T𝐂𝐱k(t))2l=1dλl(zk,l(t))2\displaystyle=(1-\alpha({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)})^{2}\sum_{l=1}^{d}\lambda_{l}(z_{k,l}^{(t)})^{2}
>(13αλ1)2(𝐱k(t))T𝐂𝐱k(t)(13αλ1)2λm.\displaystyle>(1-3\alpha\lambda_{1})^{2}({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}\geq(1-3\alpha\lambda_{1})^{2}\lambda_{m}. (35)

Also, from (12), we have 𝐱k(t)=l=1dzk,l(t)𝐪l=l=1k1zk,l(t)𝐪l+l=kdzk,l(t)𝐪l.\mathbf{x}_{k}^{(t)}=\sum_{l=1}^{d}z_{k,l}^{(t)}\mathbf{q}_{l}=\sum_{l=1}^{k-1}z_{k,l}^{(t)}\mathbf{q}_{l}+\sum_{l=k}^{d}z_{k,l}^{(t)}\mathbf{q}_{l}. Let l=1k1zk,l(t)𝐪l=𝐱k(t)\sum_{l=1}^{k-1}z_{k,l}^{(t)}\mathbf{q}_{l}=\mathbf{x}_{k}^{{}^{\prime}(t)} and l=kdzk,l(t)𝐪l=𝐱~k(t)\sum_{l=k}^{d}z_{k,l}^{(t)}\mathbf{q}_{l}=\tilde{\mathbf{x}}_{k}^{(t)}. Thus, (𝐱k(t))T𝐂𝐱k(t)=(𝐱~k(t))T𝐂𝐱~k(t)+(𝐱k(t))T𝐂𝐱k(t)(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}=(\tilde{\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}\tilde{\mathbf{x}}_{k}^{(t)}+(\mathbf{x}_{k}^{{}^{\prime}(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{{}^{\prime}(t)}.

Now, if (𝐱~k(t))T𝐂𝐱~k(t)(𝐱k(t))T𝐂𝐱k(t)<λm(\tilde{\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}\tilde{\mathbf{x}}_{k}^{(t)}\leq(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}<\lambda_{m} then

(𝐱k(t+1))T𝐂𝐱k(t+1)(𝐱~k(t+1))T𝐂𝐱~k(t+1)\displaystyle({\mathbf{x}}_{k}^{(t+1)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t+1)}\geq(\tilde{\mathbf{x}}_{k}^{(t+1)})^{T}\mathbf{C}\tilde{\mathbf{x}}_{k}^{(t+1)} =l=kdλl(zk,l(t+1))2\displaystyle=\sum_{l=k}^{d}\lambda_{l}(z_{k,l}^{(t+1)})^{2}
=l=kdλl(1+α(λl(𝐱k(t))T𝐂𝐱k(t)))2(zk,l(t))2\displaystyle=\sum_{l=k}^{d}\lambda_{l}(1+\alpha(\lambda_{l}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}))^{2}({z}_{k,l}^{(t)})^{2}
(1+α(λm(𝐱k(t))T𝐂𝐱k(t)))2l=kdλl(zk,l(t))2>(𝐱~k(t))T𝐂𝐱~k(t).\displaystyle\geq(1+\alpha(\lambda_{m}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}))^{2}\sum_{l=k}^{d}\lambda_{l}({z}_{k,l}^{(t)})^{2}>(\tilde{\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}\tilde{\mathbf{x}}_{k}^{(t)}. (36)

Combining (35) and (36), we have

(𝐱k(t))T𝐂𝐱k(t)>min{(13αλ1)2λm,(𝐱~k(0))T𝐂𝐱~k(0)}.\displaystyle(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}>\min\{(1-3\alpha\lambda_{1})^{2}\lambda_{m},(\tilde{\mathbf{x}}_{k}^{(0)})^{T}\mathbf{C}\tilde{\mathbf{x}}_{k}^{(0)}\}. (37)

A-C Statement and Proof of Lemma 3

Lemma 3.

Assume 𝐱i,k(0)=1\|\mathbf{x}_{i,k}^{(0)}\|=1. If the step size is bounded above as αwii3λ1(2K1)\alpha\leq\frac{w_{ii}}{3\lambda_{1}{(2K-1)}}, where λ1\lambda_{1} is the largest eigenvalue of 𝐂\mathbf{C} and KK is the number of eigenvectors to be estimated, then

𝐱i,k(t)<3and(𝐱i,k(t))T𝐂i𝐱k(t)<1α,k,t.\|\mathbf{x}_{i,k}^{(t)}\|<\sqrt{3}\quad\text{and}\quad(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{k}^{(t)}<\frac{1}{\alpha},\quad\forall k,t. (38)
Proof.

We have

𝐱i,k(t+1)\displaystyle\mathbf{x}_{i,k}^{(t+1)} =\displaystyle= j𝒩i{i}wij𝐱j,k(t)+α(𝐂i𝐱i,k(t)(𝐱i,k(t))T𝐂i𝐱i,kt𝐱i,k(t)p=1k1𝐱i,p(t)(𝐱i,p(t))T𝐂i𝐱i,k(t)).\displaystyle\sum_{j\in\mathcal{N}_{i}\cup\{i\}}w_{ij}\mathbf{x}_{j,k}^{(t)}+\alpha\big{(}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{t}\mathbf{x}_{i,k}^{(t)}-\sum_{p=1}^{k-1}\mathbf{x}_{i,p}^{(t)}(\mathbf{x}_{i,p}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\big{)}. (39)

Hence,

𝐱i,k(t+1)\displaystyle\|\mathbf{x}_{i,k}^{(t+1)}\| \displaystyle\leq wii𝐱i,k(t)+α(𝐂i𝐱i,k(t)(𝐱i,k(t))T𝐂i𝐱i,k(t)𝐱i,k(t))+αp=1k1𝐱i,p(t)(𝐱i,p(t))T𝐂i𝐱i,k(t)+jiwij𝐱j,k(t)\displaystyle\|w_{ii}\mathbf{x}_{i,k}^{(t)}+\alpha\big{(}{\mathbf{C}}_{i}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,k}^{(t)}\big{)}\|+\alpha\sum_{p=1}^{k-1}\|\mathbf{x}_{i,p}^{(t)}(\mathbf{x}_{i,p}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\|+\sum_{j\neq i}\|w_{ij}\mathbf{x}_{j,k}^{(t)}\|
\displaystyle\leq wii𝐱i,k(t)+α(𝐂i𝐱i,k(t)(𝐱i,k(t))T𝐂i𝐱i,k(t)𝐱i,k(t))+αp=1k1λ1(𝐱i,p(t))𝐱i,k(t)𝐱i,p(t)+jiwij𝐱j,k(t)\displaystyle\|w_{ii}\mathbf{x}_{i,k}^{(t)}+\alpha\big{(}{\mathbf{C}}_{i}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,k}^{(t)}\big{)}\|+\alpha\sum_{p=1}^{k-1}\lambda_{1}\|(\mathbf{x}_{i,p}^{(t)})\|\|\mathbf{x}_{i,k}^{(t)}\|\|\mathbf{x}_{i,p}^{(t)}\|+\sum_{j\neq i}w_{ij}\|\mathbf{x}_{j,k}^{(t)}\|
=\displaystyle= wii𝐱i,k(t)+α(𝐂i𝐱i,k(t)(𝐱i,k(t))T𝐂i𝐱i,k(t)𝐱i,k(t))+αp=1k1λ1(𝐱i,p(t))2𝐱i,k(t)+jiwij𝐱j,k(t)\displaystyle\|w_{ii}\mathbf{x}_{i,k}^{(t)}+\alpha\big{(}{\mathbf{C}}_{i}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,k}^{(t)}\big{)}\|+\alpha\sum_{p=1}^{k-1}\lambda_{1}\|(\mathbf{x}_{i,p}^{(t)})\|^{2}\|\mathbf{x}_{i,k}^{(t)}\|+\sum_{j\neq i}w_{ij}\|\mathbf{x}_{j,k}^{(t)}\|
\displaystyle\leq wii𝐱i,k(t)+α(𝐂i𝐱i,k(t)(𝐱i,k(t))T𝐂i𝐱i,k(t)𝐱i,k(t))+3αλ1p=1k1𝐱i,k(t)+jiwij𝐱j,k(t)\displaystyle\|w_{ii}\mathbf{x}_{i,k}^{(t)}+\alpha\big{(}{\mathbf{C}}_{i}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,k}^{(t)}\big{)}\|+3\alpha\lambda_{1}\sum_{p=1}^{k-1}\|\mathbf{x}_{i,k}^{(t)}\|+\sum_{j\neq i}w_{ij}\|\mathbf{x}_{j,k}^{(t)}\|
=\displaystyle= wii𝐱i,k(t)+α(𝐂i𝐱i,k(t)(𝐱i,k(t))T𝐂i𝐱i,k(t)𝐱i,k(t))+3(k1)αλ1𝐱i,k(t)+jiwij𝐱j,k(t).\displaystyle\|w_{ii}\mathbf{x}_{i,k}^{(t)}+\alpha\big{(}{\mathbf{C}}_{i}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,k}^{(t)}\big{)}\|+3(k-1)\alpha\lambda_{1}\|\mathbf{x}_{i,k}^{(t)}\|+\sum_{j\neq i}w_{ij}\|\mathbf{x}_{j,k}^{(t)}\|.

Now,

wii𝐱i,k(t)+α(𝐂i𝐱i,k(t)((𝐱i,k(t))T𝐂i𝐱i,k(t))𝐱i,k(t))2\displaystyle\|w_{ii}\mathbf{x}_{i,k}^{(t)}+\alpha(\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}-((\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)})\mathbf{x}_{i,k}^{(t)})\|^{2}
=wii2𝐱i,k(t)2+α2𝐂i𝐱i,k(t)((𝐱i,k(t))T𝐂i𝐱i,k(t))𝐱i,k(t)2+2αwii(𝐱i,k(t))T(𝐂i𝐱i,k(t)(𝐱i,k(t))T𝐂i𝐱i,k(t)𝐱i,k(t))\displaystyle=w_{ii}^{2}\|\mathbf{x}_{i,k}^{(t)}\|^{2}+\alpha^{2}\|\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}-((\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)})\mathbf{x}_{i,k}^{(t)}\|^{2}+2\alpha w_{ii}(\mathbf{x}_{i,k}^{(t)})^{T}(\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,k}^{(t)})
=wii2𝐱i,k(t)2+2αwii(𝐱i,k((t)))T𝐂i𝐱i,k(t)(1𝐱i,k(t)2)+α2(𝐱i,k(t))T𝐂i2𝐱i,k(t)+α2((𝐱i,k(t))T𝐂i𝐱i,k(t))2(𝐱i,k(t)22).\displaystyle=w_{ii}^{2}\|\mathbf{x}_{i,k}^{(t)}\|^{2}+2\alpha w_{ii}(\mathbf{x}_{i,k}^{((t))})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}(1-\|\mathbf{x}_{i,k}^{(t)}\|^{2})+\alpha^{2}(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}^{2}\mathbf{x}_{i,k}^{(t)}+\alpha^{2}((\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)})^{2}(\|\mathbf{x}_{i,k}^{(t)}\|^{2}-2).

Case \@slowromancapi@: Let us assume 𝐱i,kt21,i\|\mathbf{x}_{i,k}^{t}\|^{2}\leq 1,\forall i. Then, we have

wii𝐱i,k(t)+α(𝐂i𝐱i,k(t)(𝐱i,k(t))T𝐂i𝐱i,k(t)𝐱i,k(t))2\displaystyle\|w_{ii}\mathbf{x}_{i,k}^{(t)}+\alpha\big{(}{\mathbf{C}}_{i}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,k}^{(t)}\big{)}\|^{2} \displaystyle\leq (wii+αλ1)2(wii+wii3(2K1))2.\displaystyle(w_{ii}+\alpha\lambda_{1})^{2}\leq\big{(}w_{ii}+\frac{w_{ii}}{3(2K-1)}\big{)}^{2}.

Thus,

𝐱i,kt+1\displaystyle\|\mathbf{x}_{i,k}^{t+1}\| \displaystyle\leq wii(1+13(2K1))+3(k1)3(2K1)+(1wii)\displaystyle w_{ii}\big{(}1+\frac{1}{3(2K-1)}\big{)}+\frac{3(k-1)}{3(2K-1)}+(1-w_{ii})
<\displaystyle< 13(2K1)+k12K1+1=k0.672(K0.5)+1\displaystyle\frac{1}{3(2K-1)}+\frac{k-1}{2K-1}+1=\frac{k-0.67}{2(K-0.5)}+1
\displaystyle\leq K0.672(K0.5)+1<1.5<3.\displaystyle\frac{K-0.67}{2(K-0.5)}+1<1.5<\sqrt{3}.

Case \@slowromancapii@: Now, suppose 1𝐱i,kt2<2,i1\leq\|\mathbf{x}_{i,k}^{t}\|^{2}<2,\forall i. Then, we get

wii𝐱i,k(t)+α(𝐂i𝐱i,k(t)(𝐱i,k(t))T𝐂i𝐱i,k(t)𝐱i,k(t))22wii2+2α2λ12<2(wii+αλ1)2.\displaystyle\|w_{ii}\mathbf{x}_{i,k}^{(t)}+\alpha\big{(}{\mathbf{C}}_{i}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,k}^{(t)}\big{)}\|^{2}\leq 2w_{ii}^{2}+2\alpha^{2}\lambda_{1}^{2}<2(w_{ii}+\alpha\lambda_{1})^{2}.

Thus, if we need 𝐱i,k(t+1)3\|\mathbf{x}_{i,k}^{(t+1)}\|\leq\sqrt{3}, the following condition should be met:

𝐱i,kt+1\displaystyle\|\mathbf{x}_{i,k}^{t+1}\| 2wii(1+αλ1)+3(k1)αλ12+(1wii)23\displaystyle\leq\sqrt{2}w_{ii}(1+\alpha\lambda_{1})+3(k-1)\alpha\lambda_{1}\sqrt{2}+(1-w_{ii})\sqrt{2}\leq\sqrt{3}
\displaystyle\Leftrightarrow\quad 2+2wiiαλ1+3(k1)αλ123\displaystyle\sqrt{2}+\sqrt{2}w_{ii}\alpha\lambda_{1}+3(k-1)\alpha\lambda_{1}\sqrt{2}\leq\sqrt{3}
\displaystyle\Leftrightarrow\quad 2αλ1+3(k1)αλ1232\displaystyle\sqrt{2}\alpha\lambda_{1}+3(k-1)\alpha\lambda_{1}\sqrt{2}\leq\sqrt{3}-\sqrt{2}
\displaystyle\Leftrightarrow\quad 2αλ1(3k2)322αλ1(3K2)32\displaystyle\sqrt{2}\alpha\lambda_{1}(3k-2)\leq\sqrt{3}-\sqrt{2}\Leftrightarrow\quad\sqrt{2}\alpha\lambda_{1}(3K-2)\leq\sqrt{3}-\sqrt{2}
\displaystyle\Leftrightarrow\quad α322λ1(3K2)=1.51λ1(3K2)=0.225λ1(3K2).\displaystyle\alpha\leq\frac{\sqrt{3}-\sqrt{2}}{\sqrt{2}\lambda_{1}(3K-2)}=\frac{\sqrt{1.5}-1}{\lambda_{1}(3K-2)}=\frac{0.225}{\lambda_{1}(3K-2)}.

Since 0.225λ13(2K1)<0.225λ1(3K2)\frac{0.225}{\lambda_{1}3(2K-1)}<\frac{0.225}{\lambda_{1}(3K-2)}, if α0.2253λ1(2K1),then𝐱i,k(t+1)3\alpha\leq\frac{0.225}{3\lambda_{1}(2K-1)},\quad\text{then}\quad\|\mathbf{x}_{i,k}^{(t+1)}\|\leq\sqrt{3}.

Case \@slowromancapiii@: Finally, suppose 2𝐱i,k(t)23,i2\leq\|\mathbf{x}_{i,k}^{(t)}\|^{2}\leq 3,\forall i. We then have the following: jiwij𝐱j,ktjiwij3=(1wii)3\sum_{j\neq i}w_{ij}\|\mathbf{x}_{j,k}^{t}\|\leq\sum_{j\neq i}w_{ij}\sqrt{3}=(1-w_{ii})\sqrt{3}.

Now, if we desire 𝐱i,k(t+1)3\|\mathbf{x}_{i,k}^{(t+1)}\|\leq\sqrt{3}, then we need

wii𝐱i,k(t)+α(𝐂i𝐱i,k(t)(𝐱i,k(t))T𝐂i𝐱i,k(t)𝐱i,k(t))+3(k1)αλ1𝐱i,k(t)+jiwij33\displaystyle\|w_{ii}\mathbf{x}_{i,k}^{(t)}+\alpha\big{(}{\mathbf{C}}_{i}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,k}^{(t)}\big{)}\|+3(k-1)\alpha\lambda_{1}\|\mathbf{x}_{i,k}^{(t)}\|+\sum_{j\neq i}w_{ij}\sqrt{3}\leq\sqrt{3}
\displaystyle\Leftrightarrow\quad wii𝐱i,k(t)+α(𝐂i𝐱i,k(t)(𝐱i,k(t))T𝐂i𝐱i,k(t)𝐱i,k(t))+3(k1)αλ1𝐱i,k(t)+(1wii)33\displaystyle\|w_{ii}\mathbf{x}_{i,k}^{(t)}+\alpha\big{(}{\mathbf{C}}_{i}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,k}^{(t)}\big{)}\|+3(k-1)\alpha\lambda_{1}\|\mathbf{x}_{i,k}^{(t)}\|+(1-w_{ii})\sqrt{3}\leq\sqrt{3}
\displaystyle\Leftrightarrow\quad wii𝐱i,k(t)+α(𝐂i𝐱i,k(t)(𝐱i,k(t))T𝐂i𝐱i,k(t)𝐱i,k(t))33(k1)αλ1𝐱i,k(t)(1wii)3\displaystyle\|w_{ii}\mathbf{x}_{i,k}^{(t)}+\alpha\big{(}{\mathbf{C}}_{i}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,k}^{(t)}\big{)}\|\leq\sqrt{3}-3(k-1)\alpha\lambda_{1}\|\mathbf{x}_{i,k}^{(t)}\|-(1-w_{ii})\sqrt{3}
\displaystyle\Leftrightarrow\quad wii𝐱i,k(t)+α(𝐂i𝐱i,k(t)(𝐱i,k(t))T𝐂i𝐱i,k(t)𝐱i,k(t))23wii263(k1)wiiαλ1𝐱i,k(t)+9α2λ12(k1)2𝐱i,k(t)2.\displaystyle\|w_{ii}\mathbf{x}_{i,k}^{(t)}+\alpha\big{(}{\mathbf{C}}_{i}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,k}^{(t)}\big{)}\|^{2}\leq 3w_{ii}^{2}-6\sqrt{3}(k-1)w_{ii}\alpha\lambda_{1}\|\mathbf{x}_{i,k}^{(t)}\|+9\alpha^{2}\lambda_{1}^{2}(k-1)^{2}\|\mathbf{x}_{i,k}^{(t)}\|^{2}.

Therefore, we need

wii2𝐱i,k(t)2+2αwii(𝐱i,k((t)))T𝐂i𝐱i,k(t)(1𝐱i,k(t)2)+α2(𝐱i,k(t))T𝐂i2𝐱i,k(t)+α2((𝐱i,k(t))T𝐂i𝐱i,k(t))2(𝐱i,k(t)22)\displaystyle w_{ii}^{2}\|\mathbf{x}_{i,k}^{(t)}\|^{2}+2\alpha w_{ii}(\mathbf{x}_{i,k}^{((t))})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}(1-\|\mathbf{x}_{i,k}^{(t)}\|^{2})+\alpha^{2}(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}^{2}\mathbf{x}_{i,k}^{(t)}+\alpha^{2}((\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)})^{2}(\|\mathbf{x}_{i,k}^{(t)}\|^{2}-2)
3wii263(k1)wiiαλ1𝐱i,k(t)+9α2λ12(k1)2𝐱i,k(t)2\displaystyle\leq 3w_{ii}^{2}-6\sqrt{3}(k-1)w_{ii}\alpha\lambda_{1}\|\mathbf{x}_{i,k}^{(t)}\|+9\alpha^{2}\lambda_{1}^{2}(k-1)^{2}\|\mathbf{x}_{i,k}^{(t)}\|^{2}
\displaystyle\Leftrightarrow\quad 3wii2+2αwii(𝐱i,k((t)))T𝐂i𝐱i,k(t)(1𝐱i,k(t)2)+α2(𝐱i,k(t))T𝐂i2𝐱i,k(t)+α2((𝐱i,k(t))T𝐂i𝐱i,k(t))2(𝐱i,k(t)22)\displaystyle 3w_{ii}^{2}+2\alpha w_{ii}(\mathbf{x}_{i,k}^{((t))})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}(1-\|\mathbf{x}_{i,k}^{(t)}\|^{2})+\alpha^{2}(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}^{2}\mathbf{x}_{i,k}^{(t)}+\alpha^{2}((\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)})^{2}(\|\mathbf{x}_{i,k}^{(t)}\|^{2}-2)
3wii263(k1)wiiαλ1𝐱i,k(t)+9α2λ12(k1)2𝐱i,k(t)2\displaystyle\leq 3w_{ii}^{2}-6\sqrt{3}(k-1)w_{ii}\alpha\lambda_{1}\|\mathbf{x}_{i,k}^{(t)}\|+9\alpha^{2}\lambda_{1}^{2}(k-1)^{2}\|\mathbf{x}_{i,k}^{(t)}\|^{2}
\displaystyle\Leftrightarrow\quad 2αwii(𝐱i,k((t)))T𝐂i𝐱i,k(t)(1𝐱i,k(t)2)+α2(𝐱i,k(t))T𝐂i2𝐱i,k(t)+α2((𝐱i,k(t))T𝐂i𝐱i,k(t))2(𝐱i,k(t)22)\displaystyle 2\alpha w_{ii}(\mathbf{x}_{i,k}^{((t))})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}(1-\|\mathbf{x}_{i,k}^{(t)}\|^{2})+\alpha^{2}(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}^{2}\mathbf{x}_{i,k}^{(t)}+\alpha^{2}((\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)})^{2}(\|\mathbf{x}_{i,k}^{(t)}\|^{2}-2)
63(k1)wiiαλ1𝐱i,k(t)+9α2λ12(k1)2𝐱i,k(t)2\displaystyle\leq-6\sqrt{3}(k-1)w_{ii}\alpha\lambda_{1}\|\mathbf{x}_{i,k}^{(t)}\|+9\alpha^{2}\lambda_{1}^{2}(k-1)^{2}\|\mathbf{x}_{i,k}^{(t)}\|^{2}
\displaystyle\Leftrightarrow\quad α2(𝐱i,k(t))T𝐂i2𝐱i,k(t)+α2((𝐱i,k(t))T𝐂i𝐱i,k(t))2(𝐱i,k(t)22)9α2λ12(k1)2𝐱i,k(t)2\displaystyle\alpha^{2}(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}^{2}\mathbf{x}_{i,k}^{(t)}+\alpha^{2}((\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)})^{2}(\|\mathbf{x}_{i,k}^{(t)}\|^{2}-2)-9\alpha^{2}\lambda_{1}^{2}(k-1)^{2}\|\mathbf{x}_{i,k}^{(t)}\|^{2}
2αwii(𝐱i,k((t)))T𝐂i𝐱i,k(t)(𝐱i,k(t)21)63(k1)wiiαλ1𝐱i,k(t)\displaystyle\leq 2\alpha w_{ii}(\mathbf{x}_{i,k}^{((t))})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}(\|\mathbf{x}_{i,k}^{(t)}\|^{2}-1)-6\sqrt{3}(k-1)w_{ii}\alpha\lambda_{1}\|\mathbf{x}_{i,k}^{(t)}\|
\displaystyle\Leftrightarrow\quad α2wii(𝐱i,k((t)))T𝐂i𝐱i,k(t)(𝐱i,k(t)21)63(k1)wiiλ1𝐱i,k(t)(𝐱i,k(t))T𝐂i2𝐱i,k(t)+((𝐱i,k(t))T𝐂i𝐱i,k(t))2(𝐱i,k(t)22)9λ12(k1)2𝐱i,k(t)2.\displaystyle\alpha\leq\frac{2w_{ii}(\mathbf{x}_{i,k}^{((t))})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}(\|\mathbf{x}_{i,k}^{(t)}\|^{2}-1)-6\sqrt{3}(k-1)w_{ii}\lambda_{1}\|\mathbf{x}_{i,k}^{(t)}\|}{(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}^{2}\mathbf{x}_{i,k}^{(t)}+((\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)})^{2}(\|\mathbf{x}_{i,k}^{(t)}\|^{2}-2)-9\lambda_{1}^{2}(k-1)^{2}\|\mathbf{x}_{i,k}^{(t)}\|^{2}}. (40)

We now find the lower bound of the right-hand side of (40). Note that

(𝐱i,k(t))T𝐂i2𝐱i,k(t)+((𝐱i,k(t))T𝐂i𝐱i,k(t))2(𝐱i,k(t)22)9λ12(k1)2𝐱i,k(t)2\displaystyle(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}^{2}\mathbf{x}_{i,k}^{(t)}+((\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)})^{2}(\|\mathbf{x}_{i,k}^{(t)}\|^{2}-2)-9\lambda_{1}^{2}(k-1)^{2}\|\mathbf{x}_{i,k}^{(t)}\|^{2}
λ1(𝐱i,k(t))T𝐂i𝐱i,k(t)+((𝐱i,k(t))T𝐂i𝐱i,k(t))2(𝐱i,k(t)22)9λ12(k1)2𝐱i,k(t)2,since(𝐱i,k(t))T𝐂i2𝐱i,k(t)λ1(𝐱i,k(t))T𝐂i𝐱i,k(t)\displaystyle\leq\lambda_{1}(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}+((\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)})^{2}(\|\mathbf{x}_{i,k}^{(t)}\|^{2}-2)-9\lambda_{1}^{2}(k-1)^{2}\|\mathbf{x}_{i,k}^{(t)}\|^{2},\quad\text{since}\quad(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}^{2}\mathbf{x}_{i,k}^{(t)}\leq\lambda_{1}(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}
λ1(𝐱i,k(t))T𝐂i𝐱i,k(t)+λ1(𝐱i,k(t))T𝐂i𝐱i,k(t)𝐱i,k(t)2(𝐱i,k(t)22)9λ12(k1)2𝐱i,k(t)2\displaystyle\leq\lambda_{1}(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}+\lambda_{1}(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\|\mathbf{x}_{i,k}^{(t)}\|^{2}(\|\mathbf{x}_{i,k}^{(t)}\|^{2}-2)-9\lambda_{1}^{2}(k-1)^{2}\|\mathbf{x}_{i,k}^{(t)}\|^{2}
=λ1(𝐱i,k(t))T𝐂i𝐱i,k(t)(𝐱i,k(t)21)29λ12(k1)2𝐱i,k(t)2\displaystyle=\lambda_{1}(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}(\|\mathbf{x}_{i,k}^{(t)}\|^{2}-1)^{2}-9\lambda_{1}^{2}(k-1)^{2}\|\mathbf{x}_{i,k}^{(t)}\|^{2}
λ1(k1)(𝐱i,k(t))T𝐂i𝐱i,k(t)(𝐱i,k(t)21)29λ12(k1)2𝐱i,k(t)2\displaystyle\leq\lambda_{1}(k-1)(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}(\|\mathbf{x}_{i,k}^{(t)}\|^{2}-1)^{2}-9\lambda_{1}^{2}(k-1)^{2}\|\mathbf{x}_{i,k}^{(t)}\|^{2}
<λ1(k1)(𝐱i,k(t)21)((𝐱i,k(t))T𝐂i𝐱i,k(t)(𝐱i,k(t)21)9λ1(k1))since𝐱i,k(t)2𝐱i,k(t)21>1\displaystyle<\lambda_{1}(k-1)(\|\mathbf{x}_{i,k}^{(t)}\|^{2}-1)\Big{(}(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}(\|\mathbf{x}_{i,k}^{(t)}\|^{2}-1)-9\lambda_{1}(k-1)\Big{)}\quad\text{since}\quad\frac{\|\mathbf{x}_{i,k}^{(t)}\|^{2}}{\|\mathbf{x}_{i,k}^{(t)}\|^{2}-1}>1

and,

2wii(𝐱i,k((t)))T𝐂i𝐱i,k(t)(𝐱i,k(t)21)63(k1)wiiλ1𝐱i,k(t)\displaystyle 2w_{ii}(\mathbf{x}_{i,k}^{((t))})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}(\|\mathbf{x}_{i,k}^{(t)}\|^{2}-1)-6\sqrt{3}(k-1)w_{ii}\lambda_{1}\|\mathbf{x}_{i,k}^{(t)}\| 2wii(𝐱i,k((t)))T𝐂i𝐱i,k(t)(𝐱i,k(t)21)18(k1)wiiλ1\displaystyle\geq 2w_{ii}(\mathbf{x}_{i,k}^{((t))})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}(\|\mathbf{x}_{i,k}^{(t)}\|^{2}-1)-18(k-1)w_{ii}\lambda_{1}
=2wii((𝐱i,k((t)))T𝐂i𝐱i,k(t)(𝐱i,k(t)21)9(k1)λ1).\displaystyle=2w_{ii}((\mathbf{x}_{i,k}^{((t))})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}(\|\mathbf{x}_{i,k}^{(t)}\|^{2}-1)-9(k-1)\lambda_{1}).

Thus, we have that the right hand side of (40) exceeds

2wii((𝐱i,k((t)))T𝐂i𝐱i,k(t)(𝐱i,k(t)21)9(k1)λ1)λ1(k1)(𝐱i,k(t)21)((𝐱i,k(t))T𝐂i𝐱i,k(t)(𝐱i,k(t)21)9λ1(k1))=2wiiλ1(k1)>wii3λ1(2K1).\displaystyle\frac{2w_{ii}((\mathbf{x}_{i,k}^{((t))})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}(\|\mathbf{x}_{i,k}^{(t)}\|^{2}-1)-9(k-1)\lambda_{1})}{\lambda_{1}(k-1)(\|\mathbf{x}_{i,k}^{(t)}\|^{2}-1)\Big{(}(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}(\|\mathbf{x}_{i,k}^{(t)}\|^{2}-1)-9\lambda_{1}(k-1)\Big{)}}=\frac{2w_{ii}}{\lambda_{1}(k-1)}>\frac{w_{ii}}{3\lambda_{1}(2K-1)}.

This proves if αmin{wii3λ1(2K1),0.2253λ1(2K1)} then 𝐱i,k(t+1)3\alpha\leq\min\{\frac{w_{ii}}{3\lambda_{1}(2K-1)},\frac{0.225}{3\lambda_{1}(2K-1)}\}\text{ then }\|\mathbf{x}_{i,k}^{(t+1)}\|\leq\sqrt{3}. ∎

A-D Statement and Proof of Lemma 4

Lemma 4.

The norm of Sanger’s direction 𝓗i(𝐱i,k(t))\boldsymbol{\mathcal{H}}_{i}(\mathbf{x}_{i,k}^{(t)}) is bounded as

𝓗i(𝐱i,k(t))23λi,12(3k2)(3k+1),k=1,,K.\displaystyle\|\boldsymbol{\mathcal{H}}_{i}(\mathbf{x}_{i,k}^{(t)})\|^{2}\leq 3\lambda_{i,1}^{2}(3k-2)(3k+1),\forall k=1,\ldots,K. (41)
Proof.

We know

𝓗i(𝐱i,k(t))\displaystyle\boldsymbol{\mathcal{H}}_{i}(\mathbf{x}_{i,k}^{(t)}) =𝐂i𝐱i,k(t)(𝐱i,k(t))T𝐂i𝐱i,k(t)𝐱i,k(t)p=1k1𝐱i,p(t)(𝐱i,p(t))T𝐂i𝐱i,k(t)\displaystyle=\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,k}^{(t)}-\sum_{p=1}^{k-1}\mathbf{x}_{i,p}^{(t)}(\mathbf{x}_{i,p}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}
=(𝐈p=1k1𝐱i,p(t)(𝐱i,p(t))T)𝐂i𝐱i,k(t)(𝐱i,k(t))T𝐂i𝐱i,k(t)𝐱i,k(t)=𝐂~i(t)𝐱i,k(t)(𝐱i,k(t))T𝐂i𝐱i,k(t)𝐱i,k(t)\displaystyle=(\mathbf{I}-\sum_{p=1}^{k-1}\mathbf{x}_{i,p}^{(t)}(\mathbf{x}_{i,p}^{(t)})^{T})\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,k}^{(t)}=\tilde{\mathbf{C}}_{i}^{(t)}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,k}^{(t)}
𝓗i(𝐱i,k(t))2\displaystyle\|\boldsymbol{\mathcal{H}}_{i}(\mathbf{x}_{i,k}^{(t)})\|^{2} =𝐂~i(t)𝐱i,k(t)(𝐱i,k(t))T𝐂i𝐱i,k(t)𝐱i,k(t)2\displaystyle=\|\tilde{\mathbf{C}}_{i}^{(t)}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,k}^{(t)}\|^{2}
=(𝐱i,k(t))T(𝐂~i(t))T𝐂~i(t)𝐱i,k(t)+((𝐱i,k(t))T𝐂i𝐱i,k(t))2(𝐱i,k(t)22)+(𝐱i,k(t))T𝐂i𝐱i,k(t)p=1k1(𝐱i,k(t))T𝐱i,p(t)(𝐱i,p(t))T𝐂i𝐱i,k(t).\displaystyle=(\mathbf{x}_{i,k}^{(t)})^{T}(\tilde{\mathbf{C}}_{i}^{(t)})^{T}\tilde{\mathbf{C}}_{i}^{(t)}\mathbf{x}_{i,k}^{(t)}+((\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)})^{2}(\|\mathbf{x}_{i,k}^{(t)}\|^{2}-2)+(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\sum_{p=1}^{k-1}(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{x}_{i,p}^{(t)}(\mathbf{x}_{i,p}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}.

Next, notice that 𝐂~i(t)=𝐂ip=1k1𝐱i,p(t)(𝐱i,p(t))T𝐂i\|\tilde{\mathbf{C}}_{i}^{(t)}\|=\|\mathbf{C}_{i}-\sum_{p=1}^{k-1}\mathbf{x}_{i,p}^{(t)}(\mathbf{x}_{i,p}^{(t)})^{T}\mathbf{C}_{i}\|. Thus,

𝐂~i(t)\displaystyle\|\tilde{\mathbf{C}}_{i}^{(t)}\| \displaystyle\leq 𝐂i+p=1k1𝐱i,p(t)(𝐱i,p(t))T𝐂iλi,1+p=1k13λi,1=λi,1+3(k1)λi,1=λi,1(3k2).\displaystyle\|\mathbf{C}_{i}\|+\sum_{p=1}^{k-1}\|\mathbf{x}_{i,p}^{(t)}(\mathbf{x}_{i,p}^{(t)})^{T}\|\|\mathbf{C}_{i}\|\leq\lambda_{i,1}+\sum_{p=1}^{k-1}3\lambda_{i,1}=\lambda_{i,1}+3(k-1)\lambda_{i,1}=\lambda_{i,1}(3k-2).

We, therefore, get

(𝐱i,k(t))T(𝐂~i(t))T𝐂~i(t)𝐱i,k(t)λmax((𝐂~i(t))T)(𝐱i,k(t))T𝐂~i(t)𝐱i,k(t)=(𝐂~i(t))(𝐱i,k(t))T𝐂~i(t)𝐱i,k(t)λi,1(3k2)(𝐱i,k(t))T𝐂~i(t)𝐱i,k(t).\displaystyle(\mathbf{x}_{i,k}^{(t)})^{T}(\tilde{\mathbf{C}}_{i}^{(t)})^{T}\tilde{\mathbf{C}}_{i}^{(t)}\mathbf{x}_{i,k}^{(t)}\leq\lambda_{\max}((\tilde{\mathbf{C}}_{i}^{(t)})^{T})(\mathbf{x}_{i,k}^{(t)})^{T}\tilde{\mathbf{C}}_{i}^{(t)}\mathbf{x}_{i,k}^{(t)}=\|(\tilde{\mathbf{C}}_{i}^{(t)})\|(\mathbf{x}_{i,k}^{(t)})^{T}\tilde{\mathbf{C}}_{i}^{(t)}\mathbf{x}_{i,k}^{(t)}\leq\lambda_{i,1}(3k-2)(\mathbf{x}_{i,k}^{(t)})^{T}\tilde{\mathbf{C}}_{i}^{(t)}\mathbf{x}_{i,k}^{(t)}.

Thus,

𝓗i(𝐱i,k(t))2\displaystyle\|\boldsymbol{\mathcal{H}}_{i}(\mathbf{x}_{i,k}^{(t)})\|^{2} λi,1(3k2)(𝐱i,k(t))T𝐂~i(t)𝐱i,k(t)+((𝐱i,k(t))T𝐂i𝐱i,k(t))2(𝐱i,k(t)22)+(𝐱i,k(t))T𝐂i𝐱i,k(t)p=1k1(𝐱i,k(t))T𝐱i,p(t)(𝐱i,p(t))T𝐂i𝐱i,k(t)\displaystyle\leq\lambda_{i,1}(3k-2)(\mathbf{x}_{i,k}^{(t)})^{T}\tilde{\mathbf{C}}_{i}^{(t)}\mathbf{x}_{i,k}^{(t)}+((\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)})^{2}(\|\mathbf{x}_{i,k}^{(t)}\|^{2}-2)+(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\sum_{p=1}^{k-1}(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{x}_{i,p}^{(t)}(\mathbf{x}_{i,p}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}
λi,1(3k2)(𝐱i,k(t))T𝐂~i(t)𝐱i,k(t)+((𝐱i,k(t))T𝐂i𝐱i,k(t))2(𝐱i,k(t)22)+(𝐱i,k(t))T𝐂i𝐱i,k(t)p=1k1𝐱i,k(t)2𝐱i,p(t)2λi,1\displaystyle\leq\lambda_{i,1}(3k-2)(\mathbf{x}_{i,k}^{(t)})^{T}\tilde{\mathbf{C}}_{i}^{(t)}\mathbf{x}_{i,k}^{(t)}+((\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)})^{2}(\|\mathbf{x}_{i,k}^{(t)}\|^{2}-2)+(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\sum_{p=1}^{k-1}\|\mathbf{x}_{i,k}^{(t)}\|^{2}\|\mathbf{x}_{i,p}^{(t)}\|^{2}\lambda_{i,1}
𝓗i(𝐱i,k(t))2\displaystyle\|\boldsymbol{\mathcal{H}}_{i}(\mathbf{x}_{i,k}^{(t)})\|^{2} λi,1(3k2)(𝐱i,k(t))T(𝐈p=1k1𝐱i,p(t)(𝐱i,p(t))T)𝐂i𝐱i,k(t)+((𝐱i,k(t))T𝐂i𝐱i,k(t))2(𝐱i,k(t)22)+\displaystyle\leq\lambda_{i,1}(3k-2)(\mathbf{x}_{i,k}^{(t)})^{T}(\mathbf{I}-\sum_{p=1}^{k-1}\mathbf{x}_{i,p}^{(t)}(\mathbf{x}_{i,p}^{(t)})^{T})\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}+((\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)})^{2}(\|\mathbf{x}_{i,k}^{(t)}\|^{2}-2)+
(𝐱i,k(t))T𝐂i𝐱i,k(t)p=1k1𝐱i,k(t)2𝐱i,p(t)2λi,1\displaystyle(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\sum_{p=1}^{k-1}\|\mathbf{x}_{i,k}^{(t)}\|^{2}\|\mathbf{x}_{i,p}^{(t)}\|^{2}\lambda_{i,1}
λi,1(3k2)(𝐱i,k(t))T𝐂i𝐱i,k(t)+λi,1(3k2)p=1k1𝐱i,k(t)2𝐱i,p(t)2λi,1+\displaystyle\leq\lambda_{i,1}(3k-2)(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}+\lambda_{i,1}(3k-2)\sum_{p=1}^{k-1}\|\mathbf{x}_{i,k}^{(t)}\|^{2}\|\mathbf{x}_{i,p}^{(t)}\|^{2}\lambda_{i,1}+
((𝐱i,k(t))T𝐂i𝐱i,k(t))2(𝐱i,k(t)22)+(𝐱i,k(t))T𝐂i𝐱i,k(t)p=1k1𝐱i,k(t)2𝐱i,p(t)2λi,1\displaystyle((\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)})^{2}(\|\mathbf{x}_{i,k}^{(t)}\|^{2}-2)+(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\sum_{p=1}^{k-1}\|\mathbf{x}_{i,k}^{(t)}\|^{2}\|\mathbf{x}_{i,p}^{(t)}\|^{2}\lambda_{i,1}
λi,1(3k2)3λi,1+λi,1(3k2)p=1k19λi,1+9λi,12+λi,13p=1k19λi,1=3λi,12(3k2)(3k+1).\displaystyle\leq\lambda_{i,1}(3k-2)3\lambda_{i,1}+\lambda_{i,1}(3k-2)\sum_{p=1}^{k-1}9\lambda_{i,1}+9\lambda_{i,1}^{2}+\lambda_{i,1}3\sum_{p=1}^{k-1}9\lambda_{i,1}=3\lambda_{i,1}^{2}(3k-2)(3k+1).

Appendix B Statement and Proof of Lemma 5

Lemma 5.

Let η=min{(13αλ1)2λm,(𝐱~k(0))T𝐂𝐱~k(0)}\eta=\min\{(1-3\alpha\lambda_{1})^{2}\lambda_{m},(\tilde{\mathbf{x}}_{k}^{(0)})^{T}\mathbf{C}\tilde{\mathbf{x}}_{k}^{(0)}\}. Now, suppose η<(𝐱k(t))T𝐂𝐱k(t)<1α\eta<(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}<\frac{1}{\alpha}, then the following is true for γ=1αη\gamma=1-\alpha\eta, and some constant a1>0a_{1}>0:

l=1k1(zk,l(t+1))2<a1γt+1.\sum_{l=1}^{k-1}(z_{k,l}^{(t+1)})^{2}<a_{1}\gamma^{t+1}. (42)
Proof.

For l=1,,k1l=1,\ldots,k-1, we know from (13)

zk,l(t+1)\displaystyle{z}_{k,l}^{(t+1)} =\displaystyle= (1α(𝐱k(t))T𝐂𝐱k(t))zk,l(t)\displaystyle(1-\alpha({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}){z}_{k,l}^{(t)}
or,(zk,l(t+1))2\displaystyle\text{or,}\quad({z}_{k,l}^{(t+1)})^{2} =\displaystyle= (1α(𝐱k(t))T𝐂𝐱k(t))2(zk,l(t))2.\displaystyle(1-\alpha({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)})^{2}({z}_{k,l}^{(t)})^{2}.

Let min{(13αλ1)2λm,(𝐱~k(0))T𝐂𝐱~k(0)}=η\min\{(1-3\alpha\lambda_{1})^{2}\lambda_{m},(\tilde{\mathbf{x}}_{k}^{(0)})^{T}\mathbf{C}\tilde{\mathbf{x}}_{k}^{(0)}\}=\eta. Since 0<η<(𝐱k(t))T𝐂𝐱k(t)<1α0<\eta<({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}<\frac{1}{\alpha} (from (34) and (37)), we have 0<1α(𝐱k(t))T𝐂𝐱k(t)<1αη<10<1-\alpha({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}<1-\alpha\eta<1. Therefore,

l=1k1(zk,l(t+1))2<l=1k1γ(zk,l(t))2<γt+1l=1k1(zk,l(0))2=a1γt+1,whereγ=(1αη)2.\displaystyle\sum_{l=1}^{k-1}({z}_{k,l}^{(t+1)})^{2}<\sum_{l=1}^{k-1}\gamma({z}_{k,l}^{(t)})^{2}<\gamma^{t+1}\sum_{l=1}^{k-1}({z}_{k,l}^{(0)})^{2}=a_{1}\gamma^{t+1},\quad\text{where}\quad\gamma=(1-\alpha\eta)^{2}. (43)

Appendix C Statement and Proof of Lemma 6

Lemma 6.

Suppose zk,k(0)0z_{k,k}^{(0)}\neq 0 and (𝐱k(t))T𝐂𝐱k(t)<1α(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}<\frac{1}{\alpha}. Then the following is true for ρk=(1+αλk+11+αλk)2<1\rho_{k}=\Big{(}\frac{1+\alpha\lambda_{k+1}}{1+\alpha\lambda_{k}}\Big{)}^{2}<1 and some constant a2>0a_{2}>0:

l=k+1d(zk,l(t+1))2a2ρkt+1.\sum_{l=k+1}^{d}({z}_{k,l}^{(t+1)})^{2}\leq a_{2}\rho_{k}^{t+1}. (44)
Proof.

For l=k,,dl=k,\ldots,d we know from (14) that zk,l(t+1)=(1+α(λl(𝐱k(t))T𝐂𝐱k(t)))zk,l(t){z}_{k,l}^{(t+1)}=(1+\alpha(\lambda_{l}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)})){z}_{k,l}^{(t)}. If (𝐱k(t))T𝐂𝐱k(t)<1α({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}<\frac{1}{\alpha}, we have 1+α(λl(𝐱k(t))T𝐂𝐱k(t))>αλl0,l=k,,d1+\alpha(\lambda_{l}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)})>\alpha\lambda_{l}\geq 0,\forall l=k,\ldots,d.

Thus, we have for l=k+1,dl=k+1,\cdots d,

(zk,l(t+1)zk,k(t+1))2\displaystyle\Bigg{(}\frac{{z}_{k,l}^{(t+1)}}{{z}_{k,k}^{(t+1)}}\Bigg{)}^{2} =\displaystyle= (1+α(λl(𝐱k(t))T𝐂𝐱k(t))1+α(λk(𝐱k(t))T𝐂𝐱k(t)))2(zk,l(t)zk,k(t))2\displaystyle\Bigg{(}\frac{1+\alpha(\lambda_{l}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)})}{1+\alpha(\lambda_{k}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)})}\Bigg{)}^{2}\Bigg{(}\frac{{z}_{k,l}^{(t)}}{{z}_{k,k}^{(t)}}\Bigg{)}^{2}
=\displaystyle= (1α(λkλl)1+α(λk(𝐱k(t))T𝐂𝐱k(t)))2(zk,l(t)zk,k(t))2\displaystyle\Bigg{(}1-\frac{\alpha(\lambda_{k}-\lambda_{l})}{1+\alpha(\lambda_{k}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)})}\Bigg{)}^{2}\Bigg{(}\frac{{z}_{k,l}^{(t)}}{{z}_{k,k}^{(t)}}\Bigg{)}^{2}
\displaystyle\leq (1α(λkλl)1+αλk)2(zk,l(t)zk,k(t))2\displaystyle\Big{(}1-\frac{\alpha(\lambda_{k}-\lambda_{l})}{1+\alpha\lambda_{k}}\Big{)}^{2}\Big{(}\frac{{z}_{k,l}^{(t)}}{{z}_{k,k}^{(t)}}\Big{)}^{2}
=\displaystyle= (1+αλl1+αλk)2(zk,l(t)zk,k(t))2(1+αλk+11+αλk)2(zk,l(t)zk,k(t))2\displaystyle\Big{(}\frac{1+\alpha\lambda_{l}}{1+\alpha\lambda_{k}}\Big{)}^{2}\Big{(}\frac{{z}_{k,l}^{(t)}}{{z}_{k,k}^{(t)}}\Big{)}^{2}\leq\Big{(}\frac{1+\alpha\lambda_{k+1}}{1+\alpha\lambda_{k}}\Big{)}^{2}\Big{(}\frac{{z}_{k,l}^{(t)}}{{z}_{k,k}^{(t)}}\Big{)}^{2}
=\displaystyle= ρk(zk,l(t)zk,k(t))2,ρk=(1+αλk+11+αλk)2<1.\displaystyle\rho_{k}\Big{(}\frac{{z}_{k,l}^{(t)}}{{z}_{k,k}^{(t)}}\Big{)}^{2},\quad\rho_{k}=\Big{(}\frac{1+\alpha\lambda_{k+1}}{1+\alpha\lambda_{k}}\Big{)}^{2}<1.

Therefore, for l=k+1,,dl=k+1,\ldots,d, (zk,l(t+1))2ρkt+1(zk,l(0)zk,k(0))2(zk,k(t+1))2({z}_{k,l}^{(t+1)})^{2}\leq\rho_{k}^{t+1}\Big{(}\frac{{z}_{k,l}^{(0)}}{{z}_{k,k}^{(0)}}\Big{)}^{2}({z}_{k,k}^{(t+1)})^{2}. Since 𝐱k(t+1)23\|{\mathbf{x}}_{k}^{(t+1)}\|^{2}\leq 3 and 𝐱k(0)=1\|{\mathbf{x}}_{k}^{(0)}\|=1, hence (zk,kt+1)23({z}_{k,k}^{t+1})^{2}\leq 3 and zk,l(0)1{z}_{k,l}^{(0)}\leq 1. Also, because of the assumption zk,k(0)0{z}_{k,k}^{(0)}\neq 0, let us assume (zk,k(0))2>η~({z}_{k,k}^{(0)})^{2}>\tilde{\eta}. Thus, we can write

l=k+1d(zk,lt+1)2\displaystyle\sum_{l=k+1}^{d}({z}_{k,l}^{t+1})^{2} \displaystyle\leq ρkt+1l=k+1d3η~=a2ρkt+1.\displaystyle\rho_{k}^{t+1}\sum_{l=k+1}^{d}\frac{3}{\tilde{\eta}}=a_{2}\rho_{k}^{t+1}. (45)

Appendix D Statement and Proof of Lemma 7

Lemma 7.

Suppose (𝐱k(t))T𝐂𝐱k(t)<1α(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}<\frac{1}{\alpha} and (𝐱k(t))T𝐂𝐱k(t)>min{(13αλ1)2λm,(𝐱~k(0))T𝐂𝐱~k(0)}(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}>\min\{(1-3\alpha\lambda_{1})^{2}\lambda_{m},(\tilde{\mathbf{x}}_{k}^{(0)})^{T}\mathbf{C}\tilde{\mathbf{x}}_{k}^{(0)}\}. Then there exists constants 0<δ,γ1<1,a4>00<\delta,\gamma_{1}<1,a_{4}>0 such that

|λk(𝐱k(t+1))T𝐂𝐱k(t+1)|ta4(δt+1+max{δt,γ1t}).|\lambda_{k}-({\mathbf{x}}_{k}^{(t+1)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t+1)}|\leq ta_{4}(\delta^{t+1}+\max\{\delta^{t},\gamma_{1}^{t}\}). (46)
Proof.
(𝐱k(t+1))T𝐂𝐱k(t+1)\displaystyle({\mathbf{x}}_{k}^{(t+1)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t+1)} =l=1k1λl(1α(𝐱k(t))T𝐂𝐱k(t))2(zk,l(t))2+l=kdλl(1+α(λl(𝐱k(t))T𝐂𝐱k(t)))2(zk,l(t))2\displaystyle=\sum_{l=1}^{k-1}\lambda_{l}(1-\alpha(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})^{2}({z}_{k,l}^{(t)})^{2}+\sum_{l=k}^{d}\lambda_{l}(1+\alpha(\lambda_{l}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}))^{2}({z}_{k,l}^{(t)})^{2}
=l=1k1λl(1+α(λk(𝐱k(t))T𝐂𝐱k(t)))2(zk,l(t))2+l=kdλl(1+α(λk(𝐱k(t))T𝐂𝐱k(t)))2(zk,l(t))2\displaystyle=\sum_{l=1}^{k-1}\lambda_{l}(1+\alpha(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}))^{2}({z}_{k,l}^{(t)})^{2}+\sum_{l=k}^{d}\lambda_{l}(1+\alpha(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}))^{2}({z}_{k,l}^{(t)})^{2}
+l=1k1λl(1α(𝐱k(t))T𝐂𝐱k(t))2(zk,l(t))2l=1k1λl(1+α(λk(𝐱k(t))T𝐂𝐱k(t)))2(zk,l(t))2\displaystyle+\sum_{l=1}^{k-1}\lambda_{l}(1-\alpha(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})^{2}({z}_{k,l}^{(t)})^{2}-\sum_{l=1}^{k-1}\lambda_{l}(1+\alpha(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}))^{2}({z}_{k,l}^{(t)})^{2}
+l=k+1dλl(1+α(λl(𝐱k(t))T𝐂𝐱k(t)))2(zk,l(t))2l=k+1dλl(1+α(λk(𝐱k(t))T𝐂𝐱k(t)))2(zk,l(t))2\displaystyle+\sum_{l=k+1}^{d}\lambda_{l}(1+\alpha(\lambda_{l}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}))^{2}({z}_{k,l}^{(t)})^{2}-\sum_{l=k+1}^{d}\lambda_{l}(1+\alpha(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}))^{2}({z}_{k,l}^{(t)})^{2}
=l=1dλl(1+α(λk(𝐱k(t))T𝐂𝐱k(t)))2(zk,l(t))2+P(t)\displaystyle=\sum_{l=1}^{d}\lambda_{l}(1+\alpha(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}))^{2}({z}_{k,l}^{(t)})^{2}+P^{(t)}
=(1+α(λk(𝐱k(t))T𝐂𝐱k(t)))2(𝐱k(t))T𝐂𝐱k(t)+P(t),\displaystyle=(1+\alpha(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}))^{2}({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}+P^{(t)},

where

P(t)\displaystyle P^{(t)} =l=1k1λl(1α(𝐱k(t))T𝐂𝐱k(t))2(zk,l(t))2l=1k1λl(1+α(λk(𝐱k(t))T𝐂𝐱k(t)))2(zk,l(t))2\displaystyle=\sum_{l=1}^{k-1}\lambda_{l}(1-\alpha(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})^{2}({z}_{k,l}^{(t)})^{2}-\sum_{l=1}^{k-1}\lambda_{l}(1+\alpha(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}))^{2}({z}_{k,l}^{(t)})^{2}
+l=k+1dλl(1+α(λl(𝐱k(t))T𝐂𝐱k(t)))2(zk,l(t))2l=k+1dλl(1+α(λk(𝐱k(t))T𝐂𝐱k(t)))2(zk,l(t))2.\displaystyle+\sum_{l=k+1}^{d}\lambda_{l}(1+\alpha(\lambda_{l}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}))^{2}({z}_{k,l}^{(t)})^{2}-\sum_{l=k+1}^{d}\lambda_{l}(1+\alpha(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}))^{2}({z}_{k,l}^{(t)})^{2}.

Now,

λk(𝐱k(t+1))T𝐂𝐱k(t+1)\displaystyle\lambda_{k}-({\mathbf{x}}_{k}^{(t+1)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t+1)} =λk(1+α(λk(𝐱k(t))T𝐂𝐱k(t)))2(𝐱k(t))T𝐂𝐱k(t)P(t)\displaystyle=\lambda_{k}-(1+\alpha(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}))^{2}({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}-P^{(t)}
=λk(1+α2(λk(𝐱k(t))T𝐂𝐱k(t))2+2α(λk(𝐱k(t))T𝐂𝐱k(t)))(𝐱k(t))T𝐂𝐱k(t)P(t)\displaystyle=\lambda_{k}-(1+\alpha^{2}(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})^{2}+2\alpha(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}))({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}-P^{(t)}
=λk(𝐱k(t))T𝐂𝐱k(t)(α2(λk(𝐱k(t))T𝐂𝐱k(t))2+2α(λk(𝐱k(t))T𝐂𝐱k(t)))(𝐱k(t))T𝐂𝐱k(t)P(t)\displaystyle=\lambda_{k}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}-(\alpha^{2}(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})^{2}+2\alpha(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}))({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}-P^{(t)}
=λk(𝐱k(t))T𝐂𝐱k(t)(λk(𝐱k(t))T𝐂𝐱k(t))(α2(λk(𝐱k(t))T𝐂𝐱k(t))+2α)(𝐱k(t))T𝐂𝐱k(t)P(t)\displaystyle=\lambda_{k}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}-(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})(\alpha^{2}(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})+2\alpha)({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}-P^{(t)}
=(λk(𝐱k(t))T𝐂𝐱k(t))(1α2(λk(𝐱k(t))T𝐂𝐱k(t))(𝐱k(t))T𝐂𝐱k(t)2α(𝐱k(t))T𝐂𝐱k(t))P(t).\displaystyle=(\lambda_{k}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)})(1-\alpha^{2}(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}-2\alpha({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)})-P^{(t)}.

Let us denote V(t)=|λk(𝐱k(t))T𝐂𝐱k(t)|V^{(t)}=|\lambda_{k}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}|. Then,

V(t+1)\displaystyle V^{(t+1)} \displaystyle\leq V(t)|1α2(λk(𝐱k(t))T𝐂𝐱k(t))(𝐱k(t))T𝐂𝐱k(t)2α(𝐱k(t))T𝐂𝐱k(t)|+|P(t)|\displaystyle V^{(t)}|1-\alpha^{2}(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}-2\alpha({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}|+|P^{(t)}|
\displaystyle\leq V(t)max{(1α(𝐱k(t))T𝐂𝐱k(t))2,α2λk(𝐱k(t))T𝐂𝐱k(t)}+|P(t)|.\displaystyle V^{(t)}\max\{(1-\alpha({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)})^{2},\alpha^{2}\lambda_{k}({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}\}+|P^{(t)}|.

Also from (34) and (37), 0<αη<α(𝐱k(t))T𝐂𝐱k(t)<1 and α2λk(𝐱k(t))T𝐂𝐱k(t)<αλk0<\alpha\eta<\alpha({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}<1\text{ and }\alpha^{2}\lambda_{k}({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}<\alpha\lambda_{k}. Denote δ=max{(1αη)2,αλk}\delta=\max\{(1-\alpha\eta)^{2},\alpha\lambda_{k}\}. Since αλk<αλ1<1\alpha\lambda_{k}<\alpha\lambda_{1}<1, hence 0<δ<10<\delta<1. Thus,

V(t+1)δV(t)+|P(t)|.V^{(t+1)}\leq\delta V^{(t)}+|P^{(t)}|. (47)

Next, we bound |P(t)||P^{(t)}| as follows:

|P(t)|\displaystyle|P^{(t)}| =|l=1k1λl((1α(𝐱k(t))T𝐂𝐱k(t))2(1+α(λk(𝐱k(t))T𝐂𝐱k(t)))2)(zk,l(t))2\displaystyle=|\sum_{l=1}^{k-1}\lambda_{l}((1-\alpha(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})^{2}-(1+\alpha(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}))^{2})({z}_{k,l}^{(t)})^{2}
+l=k+1dλl((1+α(λl(𝐱k(t))T𝐂𝐱k(t)))2(1+α(λk(𝐱k(t))T𝐂𝐱k(t)))2)(zk,l(t))2|\displaystyle+\sum_{l=k+1}^{d}\lambda_{l}((1+\alpha(\lambda_{l}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}))^{2}-(1+\alpha(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}))^{2})({z}_{k,l}^{(t)})^{2}|
=|l=1k1λl(αλk)(2+αλk2α(𝐱k(t))T𝐂𝐱k(t))(zk,l(t))2\displaystyle=|\sum_{l=1}^{k-1}\lambda_{l}(-\alpha\lambda_{k})(2+\alpha\lambda_{k}-2\alpha(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})({z}_{k,l}^{(t)})^{2}
+l=k+1dλlα(λlλk)(2+α(λk+λl)2α(𝐱k(t))T𝐂𝐱k(t))(zk,l(t))2|\displaystyle+\sum_{l=k+1}^{d}\lambda_{l}\alpha(\lambda_{l}-\lambda_{k})(2+\alpha(\lambda_{k}+\lambda_{l})-2\alpha(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})({z}_{k,l}^{(t)})^{2}|
l=1k1λl|(αλk)(2+αλk2α(𝐱k(t))T𝐂𝐱k(t))(zk,l(t))2|\displaystyle\leq\sum_{l=1}^{k-1}\lambda_{l}|(-\alpha\lambda_{k})(2+\alpha\lambda_{k}-2\alpha(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})({z}_{k,l}^{(t)})^{2}|
+l=k+1dλl|α(λlλk)(2+α(λk+λl)2α(𝐱k(t))T𝐂𝐱k(t))(zk,l(t))2|\displaystyle+\sum_{l=k+1}^{d}\lambda_{l}|\alpha(\lambda_{l}-\lambda_{k})(2+\alpha(\lambda_{k}+\lambda_{l})-2\alpha(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})({z}_{k,l}^{(t)})^{2}|
l=1k1λlαλk(2+αλk)(zk,l(t))2+l=k+1dλlα(λkλl)(2+α(λk+λl))(zk,l(t))2\displaystyle\leq\sum_{l=1}^{k-1}\lambda_{l}\alpha\lambda_{k}(2+\alpha\lambda_{k})({z}_{k,l}^{(t)})^{2}+\sum_{l=k+1}^{d}\lambda_{l}\alpha(\lambda_{k}-\lambda_{l})(2+\alpha(\lambda_{k}+\lambda_{l}))({z}_{k,l}^{(t)})^{2}
<l=1k1λlαλk(2+αλk)(zk,l(t))2+l=k+1dλl(2αλk+α2λk2)(zk,l(t))2\displaystyle<\sum_{l=1}^{k-1}\lambda_{l}\alpha\lambda_{k}(2+\alpha\lambda_{k})({z}_{k,l}^{(t)})^{2}+\sum_{l=k+1}^{d}\lambda_{l}(2\alpha\lambda_{k}+\alpha^{2}\lambda_{k}^{2})({z}_{k,l}^{(t)})^{2}
<l=1k13λ1(zk,l(t))2+l=k+1d3λ1(zk,l(t))2,since αλk<1 and λl<λ1\displaystyle<\sum_{l=1}^{k-1}3\lambda_{1}({z}_{k,l}^{(t)})^{2}+\sum_{l=k+1}^{d}3\lambda_{1}({z}_{k,l}^{(t)})^{2},\quad\text{since $\alpha\lambda_{k}<1$ and $\lambda_{l}<\lambda_{1}$}
=3λ1(l=1k1(zk,l(t))2+l=k+1d(zk,l(t))2)<3λ1(a1γt+a2ρkt)using Lemma 5 and 6\displaystyle=3\lambda_{1}(\sum_{l=1}^{k-1}({z}_{k,l}^{(t)})^{2}+\sum_{l=k+1}^{d}({z}_{k,l}^{(t)})^{2})<3\lambda_{1}(a_{1}\gamma^{t}+a_{2}\rho_{k}^{t})\quad\text{using Lemma\leavevmode\nobreak\ \ref{lemma:coeff_decay_lower} and \leavevmode\nobreak\ \ref{lemma:coeff_decay_upper}}
a3γ1t,wherea3=max{3λ1a1,3λ1a2}andγ1=max{γ,ρk}.\displaystyle\leq a_{3}\gamma_{1}^{t},\quad\text{where}\quad a_{3}=\max\{3\lambda_{1}a_{1},3\lambda_{1}a_{2}\}\quad\text{and}\quad\gamma_{1}=\max\{\gamma,\rho_{k}\}.

So from (47), we have V(t+1)δV(t)+a3γ1tδt+1V(0)+a3r=0t(δγ11)rγ1tV^{(t+1)}\leq\delta V^{(t)}+a_{3}\gamma_{1}^{t}\leq\delta^{t+1}V^{(0)}+a_{3}\sum_{r=0}^{t}(\delta\gamma_{1}^{-1})^{r}\gamma_{1}^{t}. Since γ1,δ<1\gamma_{1},\delta<1, we have the following two cases:

  1. 1.

    δγ1δγ111\delta\leq\gamma_{1}\implies\delta\gamma_{1}^{-1}\leq 1. Then, r=0t(δγ11)rγ1tr=0tγ1t=tγ1t\sum_{r=0}^{t}(\delta\gamma_{1}^{-1})^{r}\gamma_{1}^{t}\leq\sum_{r=0}^{t}\gamma_{1}^{t}=t\gamma_{1}^{t}.

  2. 2.

    δ>γ1\delta>\gamma_{1}. Then r=0t(δγ11)rγ1t=γ1t+δγ1t1++δt<δt++δt=tδt\sum_{r=0}^{t}(\delta\gamma_{1}^{-1})^{r}\gamma_{1}^{t}=\gamma_{1}^{t}+\delta\gamma_{1}^{t-1}+\cdots+\delta^{t}<\delta^{t}+\cdots+\delta^{t}=t\delta^{t}.

Thus,

V(t+1)\displaystyle V^{(t+1)} \displaystyle\leq δt+1V(0)+ta3max{δt,γ1t}ta4(δt+1+max{δt,γ1t}),\displaystyle\delta^{t+1}V^{(0)}+ta_{3}\max\{\delta^{t},\gamma_{1}^{t}\}\leq ta_{4}(\delta^{t+1}+\max\{\delta^{t},\gamma_{1}^{t}\}),

where a4=max{V(0),a3}a_{4}=\max\{V^{(0)},a_{3}\}. ∎

Appendix E Statement and Proof of Lemma 8

Lemma 8.

The deviation of an iterate at a node from the average is bounded from above as

𝐱i,k(t)𝐱¯k(t)bk(βt+α1β),k=1,,K,\|\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)}\|\leq b_{k}(\beta^{t}+\frac{\alpha}{1-\beta}\big{)},\forall k=1,\ldots,K, (48)

where β\beta is the second largest magnitude of the eigenvalues of 𝐖\mathbf{W} given as β=max{|λ2(𝐖)|,|λM(𝐖)|}<1\beta=\max\{|\lambda_{2}(\mathbf{W})|,|\lambda_{M}(\mathbf{W})|\}<1 and bk>0b_{k}>0 is some constant.

Proof.

We stack the iterates 𝐱i,k(t)\mathbf{x}_{i,k}^{(t)} and 𝓗i(𝐱i,k(t))\boldsymbol{\mathcal{H}}_{i}(\mathbf{x}_{i,k}^{(t)}) as

𝐱k(t)=[𝐱1,k(t)𝐱2,k(t)𝐱M,k(t)]Md𝓗(𝐱k(t))=[𝓗1(𝐱1,k(t))𝓗2(𝐱2,k(t))𝓗M(𝐱M,k(t))]Md𝐱avg,k(t)=[𝐱¯k(t)𝐱¯k(t)𝐱¯k(t)]Md.\mathbf{x}_{k}^{(t)}=\begin{bmatrix}\mathbf{x}_{1,k}^{(t)}\\ \mathbf{x}_{2,k}^{(t)}\\ \vdots\\ \mathbf{x}_{M,k}^{(t)}\\ \end{bmatrix}\in\mathbb{R}^{Md}\quad\boldsymbol{\mathcal{H}}(\mathbf{x}_{k}^{(t)})=\begin{bmatrix}\boldsymbol{\mathcal{H}}_{1}(\mathbf{x}_{1,k}^{(t)})\\ \boldsymbol{\mathcal{H}}_{2}(\mathbf{x}_{2,k}^{(t)})\\ \vdots\\ \boldsymbol{\mathcal{H}}_{M}(\mathbf{x}_{M,k}^{(t)})\\ \end{bmatrix}\in\mathbb{R}^{Md}\quad{\mathbf{x}}_{avg,k}^{(t)}=\begin{bmatrix}\bar{\mathbf{x}}_{k}^{(t)}\\ \bar{\mathbf{x}}_{k}^{(t)}\\ \vdots\\ \bar{\mathbf{x}}_{k}^{(t)}\\ \end{bmatrix}\in\mathbb{R}^{Md}.

The next network-wide iterate (as a stacked vector) can then be written as 𝐱k(t)=(𝐖𝐈)𝐱k(t1)+α𝓗(𝐱k(t1))\mathbf{x}_{k}^{(t)}=(\mathbf{W}\otimes\mathbf{I})\mathbf{x}_{k}^{(t-1)}+\alpha\boldsymbol{\mathcal{H}}(\mathbf{x}_{k}^{(t-1)}), where \otimes denotes the Kronecker product. The ttht^{th} iterate can thus be written as

𝐱k(t)=(𝐖t𝐈)𝐱k(0)+αs=0t1(𝐖t1s𝐈)𝓗(𝐱k(s)).\mathbf{x}_{k}^{(t)}=(\mathbf{W}^{t}\otimes\mathbf{I})\mathbf{x}_{k}^{(0)}+\alpha\sum_{s=0}^{t-1}(\mathbf{W}^{t-1-s}\otimes\mathbf{I})\boldsymbol{\mathcal{H}}(\mathbf{x}_{k}^{(s)}).

Since 𝐖=[wij]\mathbf{W}=[w_{ij}] is a symmetric and doubly stochastic mixing matrix, its largest eigenvalue is 11 corresponding to the eigenvector 𝟏M\mathbf{1}_{M}, a column vector of all 1’s. It is also the left eigenvector of 𝐖\mathbf{W}. That is, 𝐖𝟏M=𝟏M and 𝟏MT𝐖=𝟏MT\mathbf{W}\mathbf{1}_{M}=\mathbf{1}_{M}\text{ and }\mathbf{1}_{M}^{T}\mathbf{W}=\mathbf{1}_{M}^{T}. Also, since the squared norm of Sanger’s direction at every node is bounded, it is easy to see 𝓗(𝐱k(t))2=3Mλ12(3k2)(3k+1)\|\boldsymbol{\mathcal{H}}(\mathbf{x}_{k}^{(t)})\|^{2}=3M\lambda_{1}^{2}(3k-2)(3k+1). Now,

𝐱i,k(t)𝐱¯k(t)𝐱k(t)𝐱avg,k(t)=𝐱k(t)1M((𝟏M𝟏MT)𝐈)𝐱k(t)\displaystyle\|\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)}\|\leq\|\mathbf{x}_{k}^{(t)}-{\mathbf{x}}_{avg,k}^{(t)}\|=\|\mathbf{x}_{k}^{(t)}-\frac{1}{M}((\mathbf{1}_{M}\mathbf{1}_{M}^{T})\otimes\mathbf{I})\mathbf{x}_{k}^{(t)}\|
=(𝐖t𝐈)𝐱k(0)+αs=0t1(𝐖t1s𝐈)𝓗(𝐱k(s))1M((𝟏M𝟏MT)𝐈)((𝐖t𝐈)𝐱k(0)+αs=0t1(𝐖t1s𝐈)𝓗(𝐱k(s)))\displaystyle=\|(\mathbf{W}^{t}\otimes\mathbf{I})\mathbf{x}_{k}^{(0)}+\alpha\sum_{s=0}^{t-1}(\mathbf{W}^{t-1-s}\otimes\mathbf{I})\boldsymbol{\mathcal{H}}(\mathbf{x}_{k}^{(s)})-\frac{1}{M}((\mathbf{1}_{M}\mathbf{1}_{M}^{T})\otimes\mathbf{I})((\mathbf{W}^{t}\otimes\mathbf{I})\mathbf{x}_{k}^{(0)}+\alpha\sum_{s=0}^{t-1}(\mathbf{W}^{t-1-s}\otimes\mathbf{I})\boldsymbol{\mathcal{H}}(\mathbf{x}_{k}^{(s)}))\|
=(𝐖t𝐈)𝐱k(0)+αs=0t1(𝐖t1s𝐈)𝓗(𝐱k(s))1M((𝟏M𝟏MT)𝐈)𝐱k(0)αs=0t1(1M((𝟏M𝟏MT)𝐈)𝓗(𝐱k(s)))\displaystyle=\|(\mathbf{W}^{t}\otimes\mathbf{I})\mathbf{x}_{k}^{(0)}+\alpha\sum_{s=0}^{t-1}(\mathbf{W}^{t-1-s}\otimes\mathbf{I})\boldsymbol{\mathcal{H}}(\mathbf{x}_{k}^{(s)})-\frac{1}{M}((\mathbf{1}_{M}\mathbf{1}_{M}^{T})\otimes\mathbf{I})\mathbf{x}_{k}^{(0)}-\alpha\sum_{s=0}^{t-1}(\frac{1}{M}((\mathbf{1}_{M}\mathbf{1}_{M}^{T})\otimes\mathbf{I})\boldsymbol{\mathcal{H}}(\mathbf{x}_{k}^{(s)}))\|
=((𝐖t1M(𝟏M𝟏MT))𝐈)𝐱k(0)+αs=0t1((𝐖t1s1M(𝟏M𝟏MT))𝐈)𝓗(𝐱k(s))\displaystyle=\|((\mathbf{W}^{t}-\frac{1}{M}(\mathbf{1}_{M}\mathbf{1}_{M}^{T}))\otimes\mathbf{I})\mathbf{x}_{k}^{(0)}+\alpha\sum_{s=0}^{t-1}((\mathbf{W}^{t-1-s}-\frac{1}{M}(\mathbf{1}_{M}\mathbf{1}_{M}^{T}))\otimes\mathbf{I})\boldsymbol{\mathcal{H}}(\mathbf{x}_{k}^{(s)})\|
((𝐖t1M(𝟏M𝟏MT))𝐈)𝐱k(0)+αs=0t1((𝐖t1s1M(𝟏M𝟏MT))𝐈)𝓗(𝐱k(s))\displaystyle\leq\|((\mathbf{W}^{t}-\frac{1}{M}(\mathbf{1}_{M}\mathbf{1}_{M}^{T}))\otimes\mathbf{I})\mathbf{x}_{k}^{(0)}\|+\alpha\|\sum_{s=0}^{t-1}((\mathbf{W}^{t-1-s}-\frac{1}{M}(\mathbf{1}_{M}\mathbf{1}_{M}^{T}))\otimes\mathbf{I})\boldsymbol{\mathcal{H}}(\mathbf{x}_{k}^{(s)})\|
((𝐖t1M(𝟏M𝟏MT))𝐈)𝐱k(0)+αs=0t1((𝐖t1s1M(𝟏M𝟏MT))𝐈)𝓗(𝐱k(s))\displaystyle\leq\|((\mathbf{W}^{t}-\frac{1}{M}(\mathbf{1}_{M}\mathbf{1}_{M}^{T}))\otimes\mathbf{I})\|\|\mathbf{x}_{k}^{(0)}\|+\alpha\sum_{s=0}^{t-1}\|((\mathbf{W}^{t-1-s}-\frac{1}{M}(\mathbf{1}_{M}\mathbf{1}_{M}^{T}))\otimes\mathbf{I})\|\|\boldsymbol{\mathcal{H}}(\mathbf{x}_{k}^{(s)})\|
=βt𝐱k(0)+αs=0t1βt1s𝓗(𝐱k(s))βt3M+α3Mλ12(3k2)(3k+1)s=0t1βt1s\displaystyle=\beta^{t}\|\mathbf{x}_{k}^{(0)}\|+\alpha\sum_{s=0}^{t-1}\beta^{t-1-s}\|\boldsymbol{\mathcal{H}}(\mathbf{x}_{k}^{(s)})\|\leq\beta^{t}\sqrt{3M}+\alpha\sqrt{3M\lambda_{1}^{2}(3k-2)(3k+1)}\sum_{s=0}^{t-1}\beta^{t-1-s}
βt3M+α3Mλ12(3k2)(3k+1)1β\displaystyle\leq\beta^{t}\sqrt{3M}+\frac{\alpha\sqrt{3M\lambda_{1}^{2}(3k-2)(3k+1)}}{1-\beta}
3Mλ1(3k2)(3k+1)(βt+α1β)=bk(βt+α1β),wherebk=λ13M(3k2)(3k+1).\displaystyle\leq\sqrt{3M}\lambda_{1}\sqrt{(3k-2)(3k+1)}\big{(}\beta^{t}+\frac{\alpha}{1-\beta}\big{)}=b_{k}\big{(}\beta^{t}+\frac{\alpha}{1-\beta}\big{)},\quad\text{where}\quad b_{k}=\lambda_{1}\sqrt{3M}\sqrt{(3k-2)(3k+1)}.

Appendix F Statement and Proof of Lemma 9

Lemma 9.

Suppose 𝐱i,k(t)23\|\mathbf{x}_{i,k}^{(t)}\|^{2}\leq 3 and 𝐱i,k(t)𝐱¯k(t)bk(βt+α1β)\|\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)}\|\leq b_{k}(\beta^{t}+\frac{\alpha}{1-\beta}\big{)}, then the following is true k=1,,K\forall k=1,\ldots,K:

𝐡k(t)3(k+2)λ1bk(βt+α1β).\displaystyle\|\mathbf{h}_{k}^{(t)}\|\leq 3(k+2)\lambda_{1}b_{k}(\beta^{t}+\frac{\alpha}{1-\beta}). (49)
Proof.

We have

𝓗i(𝐱i,k(t))𝓗i(𝐱¯k(t))\displaystyle\boldsymbol{\mathcal{H}}_{i}(\mathbf{x}_{i,k}^{(t)})-\boldsymbol{\mathcal{H}}_{i}(\bar{\mathbf{x}}_{k}^{(t)})
=𝐂i(𝐱i,k(t)𝐱¯k(t))(𝐱i,k(t))T𝐂i𝐱i,k(t)𝐱i,k(t)+(𝐱¯k(t))T𝐂i𝐱¯k(t)𝐱¯k(t)p=1k1(𝐱i,p(t)(𝐱i,p(t))T𝐂i𝐱i,k(t)𝐱i,p(t)(𝐱i,p(t))T𝐂i𝐱¯k(t))\displaystyle=\mathbf{C}_{i}(\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)})-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,k}^{(t)}+(\bar{\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}_{i}\bar{\mathbf{x}}_{k}^{(t)}\bar{\mathbf{x}}_{k}^{(t)}-\sum_{p=1}^{k-1}({\mathbf{x}}_{i,p}^{(t)}({\mathbf{x}}_{i,p}^{(t)})^{T}\mathbf{C}_{i}{\mathbf{x}}_{i,k}^{(t)}-{\mathbf{x}}_{i,p}^{(t)}({\mathbf{x}}_{i,p}^{(t)})^{T}\mathbf{C}_{i}\bar{\mathbf{x}}_{k}^{(t)})
=(𝐂i(𝐱i,k(t))T𝐂i𝐱i,k(t)𝐈)(𝐱i,k(t)𝐱¯k(t))((𝐱i,k(t)+𝐱¯k(t))T𝐂i(𝐱i,k(t)𝐱¯k(t)))𝐱¯k(t)p=1k1𝐱i,p(t)(𝐱i,p(t))T𝐂i(𝐱i,k(t)𝐱¯k(t))\displaystyle=(\mathbf{C}_{i}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{I})(\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)})-((\mathbf{x}_{i,k}^{(t)}+\bar{\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}_{i}(\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)}))\bar{\mathbf{x}}_{k}^{(t)}-\sum_{p=1}^{k-1}{\mathbf{x}}_{i,p}^{(t)}({\mathbf{x}}_{i,p}^{(t)})^{T}\mathbf{C}_{i}(\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)})
𝓗i(𝐱i,k(t))𝓗i(𝐱¯k(t))\displaystyle\|\boldsymbol{\mathcal{H}}_{i}(\mathbf{x}_{i,k}^{(t)})-\boldsymbol{\mathcal{H}}_{i}(\bar{\mathbf{x}}_{k}^{(t)})\|
𝐂i(𝐱i,k(t))T𝐂i𝐱i,k(t)𝐈𝐱i,k(t)𝐱¯k(t)+|(𝐱i,k(t)+𝐱¯k(t))T𝐂i(𝐱i,k(t)𝐱¯k(t))|𝐱¯k(t)+p=1k1𝐱i,p(t)(𝐱i,p(t))T𝐂i(𝐱i,k(t)𝐱¯k(t))\displaystyle\leq\|\mathbf{C}_{i}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{I}\|\|\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)}\|+|(\mathbf{x}_{i,k}^{(t)}+\bar{\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}_{i}(\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)})|\|\bar{\mathbf{x}}_{k}^{(t)}\|+\sum_{p=1}^{k-1}\|{\mathbf{x}}_{i,p}^{(t)}({\mathbf{x}}_{i,p}^{(t)})^{T}\mathbf{C}_{i}(\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)})\|
𝐂i(𝐱i,k(t))T𝐂i𝐱i,k(t)𝐈𝐱i,k(t)𝐱¯k(t)+𝐱i,k(t)+𝐱¯k(t)𝐂i𝐱i,k(t)𝐱¯k(t)𝐱¯k(t)+p=1k1𝐱i,p(t)(𝐱i,p(t))T𝐂i𝐱i,k(t)𝐱¯k(t)\displaystyle\leq\|\mathbf{C}_{i}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{I}\|\|\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)}\|+\|\mathbf{x}_{i,k}^{(t)}+\bar{\mathbf{x}}_{k}^{(t)}\|\|\mathbf{C}_{i}\|\|\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)}\|\|\bar{\mathbf{x}}_{k}^{(t)}\|+\sum_{p=1}^{k-1}\|{\mathbf{x}}_{i,p}^{(t)}({\mathbf{x}}_{i,p}^{(t)})^{T}\mathbf{C}_{i}\|\|\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)}\|
𝐂i(𝐱i,k(t))T𝐂i𝐱i,k(t)𝐈𝐱i,k(t)𝐱¯k(t)+𝐱¯k(t)(𝐱i,k(t)+𝐱¯k(t))𝐂i𝐱i,k(t)𝐱¯k(t)+p=1k1𝐱i,p(t)(𝐱i,p(t))T𝐂i𝐱i,k(t)𝐱¯k(t)\displaystyle\leq\|\mathbf{C}_{i}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{I}\|\|\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)}\|+\|\bar{\mathbf{x}}_{k}^{(t)}\|(\|\mathbf{x}_{i,k}^{(t)}\|+\|\bar{\mathbf{x}}_{k}^{(t)}\|)\|\mathbf{C}_{i}\|\|\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)}\|+\sum_{p=1}^{k-1}\|{\mathbf{x}}_{i,p}^{(t)}({\mathbf{x}}_{i,p}^{(t)})^{T}\|\|\mathbf{C}_{i}\|\|\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)}\|
3λ1𝐱i,k(t)𝐱¯k(t)+3(23)λ1𝐱i,k(t)𝐱¯k(t)+p=1k13λ1𝐱i,k(t)𝐱¯k(t)\displaystyle\leq 3\lambda_{1}\|\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)}\|+\sqrt{3}(2\sqrt{3})\lambda_{1}\|\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)}\|+\sum_{p=1}^{k-1}3\lambda_{1}\|\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)}\|
=3(k+2)λ1𝐱i,k(t)𝐱¯k(t)3(k+2)λ1bk(βt+α1β).\displaystyle=3(k+2)\lambda_{1}\|\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)}\|\leq 3(k+2)\lambda_{1}b_{k}(\beta^{t}+\frac{\alpha}{1-\beta}).

Thus,

𝐡k(t)\displaystyle\|\mathbf{h}_{k}^{(t)}\| 1Mi=1M𝓗i(𝐱i,k(t))𝓗i(𝐱¯k(t))3(k+2)λ1bk(βt+α1β).\displaystyle\leq\frac{1}{M}\sum_{i=1}^{M}\|\boldsymbol{\mathcal{H}}_{i}(\mathbf{x}_{i,k}^{(t)})-\boldsymbol{\mathcal{H}}_{i}(\bar{\mathbf{x}}_{k}^{(t)})\|\leq 3(k+2)\lambda_{1}b_{k}(\beta^{t}+\frac{\alpha}{1-\beta}). (50)

References

  • [1] A. Gang, H. Raja, and W. U. Bajwa, “Fast and communication-efficient distributed PCA,” in Proc. IEEE International Conf. Acoustics, Speech and Signal Process. (ICASSP), 2019, pp. 7450–7454.
  • [2] H. Hotelling, “Analysis of a complex of statistical variables into principal components.” J. Educational Psychology, vol. 24, no. 6, pp. 417–441, 1933.
  • [3] M. Nokleby, H. Raja, and W. U. Bajwa, “Scaling-up distributed processing of data streams for machine learning,” Proc. IEEE, vol. 108, no. 11, pp. 1984–2012, Nov. 2020.
  • [4] W. U. Bajwa, V. Cevher, D. Papailiopoulos, and A. Scaglione, “Machine learning from distributed, streaming data,” IEEE Signal Process. Mag., vol. 37, no. 3, pp. 11–13, May 2020.
  • [5] M. Loève, Probability Theory, 3rd ed.   New York: Springer, 1963.
  • [6] R. Dixit and W. U. Bajwa, “Exit time analysis for approximations of gradient descent trajectories around saddle points,” ArXiv Prepr. ArXiv200601106, Jun. 2020. [Online]. Available: http://arxiv.org/abs/2006.01106
  • [7] ——, “Boundary conditions for linear exit time gradient trajectories around saddle points: Analysis and algorithm,” ArXiv Prepr. ArXiv210102625, Jan. 2021. [Online]. Available: http://arxiv.org/abs/2101.02625
  • [8] P. Baldi and K. Hornik, “Neural networks and principal component analysis: Learning from examples without local minima,” Neural Netw., vol. 2, no. 1, p. 53–58, Jan. 1989.
  • [9] E. Oja and J. Karhunen, “On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix,” J. Math. Anal. Applicat., vol. 106, no. 1, pp. 69 – 84, 1985.
  • [10] K. I. Diamantaras and S. Y. Kung, Principal Component Neural Networks: Theory and Applications.   USA: John Wiley & Sons, Inc., 1996.
  • [11] T. D. Sanger, “Optimal unsupervised learning in a single-layer linear feedforward neural network,” Neural Netw., vol. 2, no. 6, pp. 459 – 473, 1989.
  • [12] Z. Yang, A. Gang, and W. U. Bajwa, “Adversary-resilient distributed and decentralized statistical inference and machine learning: An overview of recent advances under the Byzantine threat model,” IEEE Signal Process. Mag., vol. 37, no. 3, pp. 146–159, 2020.
  • [13] K. Pearson, “On lines and planes of closest fit to systems of points in space,” Philosophical Mag., vol. 2, pp. 559–572, 1901.
  • [14] G. H. Golub and C. F. Van Loan, Matrix Computations (3rd Ed.).   Baltimore, MD, USA: Johns Hopkins University Press, 1996.
  • [15] D. O. Hebb, The Organization of Behavior: A Neuropsychological Theory.   Wiley New York, 1949.
  • [16] Z. Yi, M. Ye, J. C. Lv, and K. K. Tan, “Convergence analysis of a deterministic discrete time system of Oja’s PCA learning algorithm,” IEEE Trans. Neural Netw., vol. 16, no. 6, pp. 1318–1328, Nov 2005.
  • [17] J. C. Lv, Z. Yi, and K. K. Tan, “Global convergence of GHA learning algorithm with nonzero-approaching adaptive learning rates,” IEEE Trans. Neural Netw., vol. 18, no. 6, pp. 1557–1571, 2007.
  • [18] D. Kempe and F. McSherry, “A decentralized algorithm for spectral analysis,” J. Comput. and Syst. Sci., vol. 74, no. 1, pp. 70 – 83, 2008.
  • [19] A. Scaglione, R. Pagliari, and H. Krim, “The decentralized estimation of the sample covariance,” in Proc. 42nd Asilomar Conf. on Signals, Syst. and Comput., 2008, pp. 1722–1726.
  • [20] L. Li, A. Scaglione, and J. H. Manton, “Distributed principal subspace estimation in wireless sensor networks,” IEEE J. Sel. Topics Signal Process., vol. 5, no. 4, pp. 725–738, Aug 2011.
  • [21] M. Nokleby and W. U. Bajwa, “Resource tradeoffs in distributed subspace tracking over the wireless medium,” in Proc. 1st IEEE Global Conf. Signal and Information Processing (GlobalSIP’13), Symposium on Network Theory, Austin, TX, Dec. 2013, pp. 823–826.
  • [22] H. Raja and W. U. Bajwa, “Cloud K-SVD: Computing data-adaptive representations in the cloud,” in Proc. 51st Annual Allerton Conf. Commun., Control and Computing, 2013, pp. 1474–1481.
  • [23] H. Wai, A. Scaglione, J. Lafond, and E. Moulines, “Fast and privacy preserving distributed low-rank regression,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process., (ICASSP), 2017, pp. 4451–4455.
  • [24] H. Raja and W. U. Bajwa, “Cloud-K-SVD: A collaborative dictionary learning algorithm for big, distributed data,” IEEE Trans. Signal Process., vol. 64, no. 1, pp. 173–188, Jan 2016.
  • [25] L. Xiao and S. Boyd, “Fast linear iterations for distributed averaging,” Syst. & Control Letters, vol. 53, no. 1, pp. 65–78, 2004.
  • [26] H. Raja and W. U. Bajwa, “Distributed stochastic algorithms for high-rate streaming principal component analysis,” CoRR, vol. abs/2001.01017, 2020. [Online]. Available: http://arxiv.org/abs/2001.01017
  • [27] S. X. Wu, H.-T. Wai, L. Li, and A. Scaglione, “A review of distributed algorithms for principal component analysis,” Proc. IEEE, vol. 106, no. 8, pp. 1321–1340, 2018.
  • [28] M. Hong, D. Hajinezhad, and M.-M. Zhao, “Prox-PDA: The proximal primal-dual algorithm for fast distributed nonconvex optimization and learning over networks,” in Proc. 34th Int. Conf. Mach. Learning, vol. 70.   PMLR, 06–11 Aug 2017, pp. 1529–1538.
  • [29] P. Bianchi and J. Jakubowicz, “Convergence of a multi-agent projected stochastic gradient algorithm for non-convex optimization,” IEEE Trans. Autom. Control, vol. 58, no. 2, pp. 391–405, 2013.
  • [30] H. Wai, A. Scaglione, J. Lafond, and E. Moulines, “A projection-free decentralized algorithm for non-convex optimization,” in Proc. 2016 IEEE Global Conf. Signal and Inform. Process. (GlobalSIP), 2016, pp. 475–479.
  • [31] P. D. Lorenzo and G. Scutari, “NEXT: In-network nonconvex optimization,” IEEE Trans. Signal Inform. Process. Netw., vol. 2, no. 2, pp. 120–136, 2016.
  • [32] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Trans. Autom. Control, vol. 54, no. 1, pp. 48–61, Jan 2009.
  • [33] R. Vershynin, “How close is the sample covariance matrix to the actual covariance matrix?” Journal of Theoretical Probability, vol. 25, 04 2010.
  • [34] R. Arora, A. Cotter, and N. Srebro, “Stochastic optimization of PCA with capped MSG,” in Proc. Advances Neural Inform. Process. Systs., 2013, pp. 1815–1823.
  • [35] M. K. Warmuth and D. Kuzmin, “Randomized PCA algorithms with regret bounds that are logarithmic in the dimension,” in Proc. Advances Neural Inform. Process. Syst., 2007, pp. 1481–1488.
  • [36] W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: an exact first-order algorithm for decentralized consensus optimization,” SIAM J. Optim., vol. 25, no. 2, pp. 944–966, 2015.
  • [37] F. S. Cattivelli and A. H. Sayed, “Diffusion LMS strategies for distributed estimation,” IEEE Trans. Signal Process., vol. 58, no. 3, pp. 1035–1048, 2010.
  • [38] S. Kar and J. M. Moura, “Consensus + innovations distributed inference over networks: Cooperation and sensing in networked systems,” IEEE Signal Process. Mag., vol. 30, no. 3, pp. 99–109, 2013.
  • [39] K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralized gradient descent,” SIAM J. Optim., vol. 26, no. 3, pp. 1835–1854, 2016.
  • [40] S. Boyd, P. Diaconis, and L. Xiao, “Fastest mixing markov chain on a graph,” SIAM Review, vol. 46, pp. 667–689, 2003.
  • [41] P.-A. Absil, R. Mahony, and R. Sepulchre, Optimization Algorithms on Matrix Manifolds.   USA: Princeton University Press, 2007.
  • [42] Y. LeCun, C. Cortes, and C. Burges, “MNIST handwritten digit database,” ATT Labs, vol. 2, 2010.
  • [43] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Tech. Rep., 2009.
  • [44] G. Qu and N. Li, “Harnessing smoothness to accelerate distributed optimization,” IEEE Trans. Control Netw. Syst., vol. 5, no. 3, pp. 1245–1260, 2018.