Distributed Banach-Picard Iteration: Application to Distributed EM and Distributed PCA
Distributed Banach-Picard Iteration: Application to Distributed Parameter Estimation and PCA
Abstract
In recent work, we proposed a distributed Banach-Picard iteration (DBPI) that allows a set of agents, linked by a communication network, to find a fixed point of a locally contractive (LC) map that is the average of individual maps held by said agents. In this work, we build upon the DBPI and its local linear convergence (LLC) guarantees to make several contributions. We show that Sanger’s algorithm for principal component analysis (PCA) corresponds to the iteration of an LC map that can be written as the average of local maps, each map known to each agent holding a subset of the data. Similarly, we show that a variant of the expectation-maximization (EM) algorithm for parameter estimation from noisy and faulty measurements in a sensor network can be written as the iteration of an LC map that is the average of local maps, each available at just one node. Consequently, via the DBPI, we derive two distributed algorithms – distributed EM and distributed PCA – whose LLC guarantees follow from those that we proved for the DBPI. The verification of the LC condition for EM is challenging, as the underlying operator depends on random samples, thus the LC condition is of probabilistic nature.
Index Terms:
Distributed Computation, Banach-Picard Iteration, Fixed Points, Distributed EM, Distributed PCA, Consensus.I Introduction
Parameter estimation from noisy data and dimensionality reduction are two of the most fundamental tasks in signal processing and data analysis. In many scenarios, such as sensor networks and IoT, the underlying data is distributed among a collection of agents that cooperate to jointly solve the problem, i.e., find a consensus solution without sharing the data or sending it to a central unit [1]. This paper addresses the two problems above mentioned in a distributed setting, proposing and analysing two algorithms that are instances of the distributed Banach-Picard iteration (DBPI), which we have recently introduced and proved to enjoy local linear convergence [2]. In this introductory section, after presenting the formulations and motivations of the two problems considered, we review the DBPI and summarize our contributions and the tools that are used for the convergence proofs.
I-A Problem Statement: Dimensionality Reduction
Dimensionality reduction aims at representing high-dimensional data in a lower dimensional space, which can be crucial to reduce the computational complexity of manipulating and processing this data, and is a core task in modern data analysis, machine learning, and related areas. The standard linear dimensionality reduction tool is principal component analysis (PCA), which allows expressing a high-dimensional dataset on the basis formed by the top eigenvectors of its sample covariance matrix. PCA first appeared in the statistics community in the beginning of the 20th century [3] and became one of the workhorses of statistical data analysis, with dimensionality reduction being a notable application. Nowadays, with the ever increasing collection of data by spatially dispersed agents, developing algorithms for distributed PCA constitutes a relevant area of research – see, e.g., [4, 5, 6, 7, 8, 9, 10, 11, 12] (master-slave communication architecture), and [13, 14, 15, 16, 17, 18, 19, 20] (arbitrarily meshed network communication architecture). For a recent and comprehensive review on these works, see, e.g., [21]; for a very recent work see [22].
In the (arbitrarily meshed network) distributed setting, consider a set of agents linked by an undirected and connected communication graph; the nodes are the agents, and the edges represent the communication channels between the agents. Each agent holds a finite set of points in , , and the agents seek to collectively find the top eigenvectors111This means the eigenvectors associated to the largest eigenvalues. of
(assumed to be positive definite, i.e., ), where , i.e., the sum of the cardinalities of each , and
I-B Problem Statement: Distributed Parameter Estimation with Noisy and Faulty Measurements
Consider a collection of spatially distributed sensors monitoring the environment, a common scenario for information processing or decision making tasks see, e.g., [23, 24, 25, 26, 27, 28, 29, 30]. Often, these sensors communicate wirelessly, maybe in a harsh environment, which may result in faulty communications or sensor malfunctions [31]. The setup is modelled as follows: agents, linked by an undirected and connected communication graph, each holding an independent observation given by
(1) |
In (1), is a fixed and unknown parameter, each is assumed to be known only at agent , are independent and identically distributed (i.i.d.) zero-mean Gaussian random variables with variance , and are i.i.d. Bernoulli random variables () with . This formulation models a scenario where sensor measures the parameter with probability and, with probability , it senses only noise, indicating a transducer failure [31]. The agents seek to collectively estimate , treating and as nuisance (or latent) parameters. Observe that if the binary variables were not random, but fixed and known, the model could be regarded as a (distributed) linear regression problem. However, the randomness introduces an extra layer of difficulty accounting for potential sensor failures.
A decentralized algorithm, rather than one where each sensor sends its data to a central node, is potentially more robust to faulty wireless communications that may render a sensor useless. Moreover, a decentralized algorithm can yield considerable energy savings [23], a very desirable feature.
I-C Distributed Banach-Picard Iteration (DBPI)
Our recent work [2] addressed a general distributed setup where agents, linked by a communication network, collaborate to collectively find an attractor of a map that can be implicitly represented as an average of local maps, i.e.,
where is the map held by agent . As defined in [2], an attractor of is a fixed point thereof, , satisfying
(2) |
where is the spectral radius of the Jacobian of at . Moreover, the map is not assumed to have a symmetric Jacobian and no global structural assumptions (e.g., Lipschitzianity or coercivity) are made.
The main contributions of [2] are a distributed algorithm to find – DBPI – and the proof that it enjoys the local linear convergence of its centralized counterpart, i.e., of the (standard) Banach-Picard iteration: .
I-D Contributions and Related Work
In this work, we propose addressing the distributed inference problems described in Subsections I-A and I-B using two instantiations of DBPI. More concretely, we propose:
-
1.
A distributed algorithm for PCA, which results from considering a map that can be implicitly written as an average of local maps and that has as a fixed point the solution to the PCA problem.
-
2.
An algorithm that stems from formulating the problem described in Subsection I-B as a fixed point of a map induced by the stationary equations of the corresponding maximum likelihood estimation (MLE) criterion. This map corresponds to the iterations of a slightly modified EM algorithm for a mixture of linear regressions [32].
The guarantees of local linear convergence for these distributed algorithms involve verifying condition (2) for the maps inducing them, which allows invoking the results from [2]. Consequently, a great portion of this paper is devoted to proving that (2) holds for these maps, which is far from trivial.
The distributed PCA problem (see [23] for a review of distributed PCA) described in Subsection I-A was addressed in [33] , where an algorithm termed accelerated distributed Sanger’s algorithm (ADSA) was proposed. The authors consider a “mini-batch variant” of Sanger’s algorithm (SA, see [34]) and, inspired by [35], arrive at ADSA. Although no proof of convergence was presented in [33], very recent work by the same authors proves convergence of their algorithm [36]. Our contributions in this context are twofold: we show that ADSA is recovered by applying DBPI to SA, and that condition (2) holds for SA, thus, the guarantees of local convergence follow directly as a consequence of the results in [2]. We mention that no computer simulations of ADSA are presented in this work, since these can be found in [33].
The problem presented in Subsection I-B was addressed in [31], where (1) is regarded as a finite mixture model [37]. To estimate , the authors proposed a distributed version of the expectation-maximization (EM [38]) algorithm, termed diffusion-averaging distributed EM (DA-DEM). However, DA-DEM, very much in the spirit of [39, 40], uses a diminishing step-size to achieve convergence, leading to a sublinear convergence rate. Our contribution is an algorithm for this problem that extends a slightly modified version of the centralized EM algorithm to distributed settings. The key challenge is to show that we can “expect” condition (2) to hold, and we dedicate a considerable amount of effort to this endeavor. We use the term “expect”, since the operator underlying DBPI depends on the observed samples and, therefore, the existence of an attractor is a probabilistic question. Finally, we compare our algorithm with DA-DEM through Monte Carlo simulations, confirming the linear convergence rate of our algorithm and the sublinear convergence of DA-DEM.
There is considerable work on the “probabilistic linear convergence” of EM [41], [42], [43]. However, neither the results in [41], nor those in [42] encompass the mixture model underlying (1). The mixture of regressions presented in [43] bears some similarity with the model underlying (1), but it is not the same: in [43], is fixed at and (rather than ), thus there are no measurements that are just noise. Furthermore, [43] is primarily concerned with statistical guarantees for the error with respect to the ground truth, while we address the goal of establishing (2).
As mentioned in [31], there are two other relevant works on distributed EM, namely, [44] and [45]. However (see [31]), both these works address a different problem of Gaussian mixture density estimation. Moreover, in the case of [44], the algorithm demands a cyclic network topology, and, in [45] the algorithm requires higher computational load on each node, since it is based on alternating direction method of multipliers.
To summarize, we show that ADSA [33] is an instance of DBPI and propose an algorithm to solve the mixture model underlying (1), also as an instance of DBPI. Consequently, their corresponding guarantees of local linear convergence result from the attractor condition (2) for the map underlying the corresponding centralized counterparts. We compare DA-DEM and our proposed algorithm through numerical Monte Carlo simulations, and the results confirm the linear convergence of our algorithm and the sublinear convergence of DA-DEM.
I-E Organization of the Paper
Section II briefly reviews the DBPI proposed in [2] and the main convergence result therein proved. The characterization of the fixed points of the “mini-batch” variant of Sanger’s algorithm, as well as the attractor condition, are presented in Section III. Section IV describes the centralized variant of EM underlying the proposed distributed algorithm for the problem in Subsection I-B, presents the verification of the attractor condition, and reports the results of simulations comparing DBPI with DA-DEM.
I-F Notation
The set of real dimensional vectors with positive components is denoted by . Matrices and vectors are denoted by upper and lower case letters, respectively. The spectral radius of a matrix is denoted by and its Frobenius norm by . Given a map , , and denote, respectively, the Jacobian of at and the differential of at . Given a vector , denotes its th component; given a matrix , denotes the element on the th row and th column and its transpose. The -dimensional identity matrix is denoted by , is the -dimensional vector of ones, and is the matrix of zeros. Whenever convenient, we will denote a vector with two stacked blocks, , simply as . Given a square matrix , is an upper triangular matrix of the same dimension as and whose upper triangular part coincides with that of . Given a norm , denotes the closed ball of center and radius with respect to . Random variables and vectors are denoted by upper case letters and, for random variable , the probability density (or mass) function of is denoted by . The probability density of a Gaussian of mean and variance is denoted by .
II Review of Distributed Banach-Picard Iteration
Consider a network of agents, where the interconnection structure is represented by an undirected connected graph: the nodes correspond to the agents and an edge between two agents indicates they can communicate (are neighbours). In the scenario considered in [2], each agent holds an operator , and the goal is to compute a fixed point of the average operator
(3) |
Each agent is restricted to performing computations involving and communicating with its neighbours.
Our only assumption about is the existence of a locally attractive fixed point , i.e., satisfying (2).
Let be the map on defined, for (with held by agent ) by
(4) |
and let , where is the so-called Metropolis weight matrix associated to the communication graph [46]. The algorithm proposed in [2] is presented in Algorithm 1, where .
III Distributed PCA
III-A Algorithm
We obtain a distributed algorithm for solving the PCA problem described in section I-A as an instantiation of DBPI by introducing a map with a fixed point at the desired solution. Moreover, the guarantees of local linear convergence follow as a result of verifying (2).
III-B Convergence: Main Results
The convergence analysis amounts to verifying the attractor condition (2) for , thus establishing, as a corollary of the results in [2], the local linear convergence of Algorithm 1 with each in (4) defined as in (6) (equivalently, ADSA).
We start with the following lemma (proved in Appendix D) showing that the solution sought in the PCA problem is a fixed point of (as defined in (5)).
Lemma 1.
Let . If satisfies
(7) |
then, each column of is either or a unit-norm eigenvector of . Moreover, the columns are orthogonal, i.e., is diagonal with the diagonal elements being either one or zero.
The following theorem guarantees that the Banach-Picard iteration of has local linear convergence to its fixed points.
Theorem 1.
Let be the eigenvalues of . Suppose that is a matrix such (where denotes the th column of and is as defined in (5)), for , and . Then, there exists such that, for ,
Remark 1.
The invertibility of that is assumed in the statements of Lemma 1 and Theorem 1 is not a big restriction. In fact, if rather than , then satisfies and has the same eigenvectors as .
III-C Proof of Theorem 1
First, note that , where
This implies , and, as a consequence, each eigenvalue of is of the form , with being an eigenvalue of . The idea is to show that these eigenvalues of enjoy a key property: they are real-valued and negative, . Such property means that, for sufficiently small , we have . To establish this key property, we divide the proof of Theorem 1 in two lemmas: Lemma 2 and Lemma 3.
Lemma 2 will show that the eigenvalues of coincide with those of the linear map from to given by
(8) |
where
(9) |
(10) |
and
(11) |
Lemma 3 will show that the eigenvalues of (8) are real and negative.
Lemma 2.
Proof.
From the rules of matrix differential calculus (see [47] and [48]), the differential of at , denoted by , is the linear map
(12) |
Observe that is a linear map, thus the composition rule for differentials further yields
(13) |
By assumption, with given by (9), thus combining this with (12) and (13), , which we denote by to simplify the notation, is given by
The eigenvalues of coincide with those of , under the identification between Jacobians and differentials (see [47]); hence, we will study the eigenvalues of the latter.
Let be an extension of to an orthonormal basis of eigenvectors of , i.e., and , where is given by (10). To understand the eigenvalues of , consider the linear map given by
(14) |
and observe that is an invertible linear map (in fact ). Eigenvalues are invariant under a similarity transformation, hence, the eigenvalues of coincide with those of which, after renaming by , just amounts to the linear map
In the proof of the following lemma it is crucial that the eigenvalues are in decreasing order.
Lemma 3.
Proof.
Let be an eigenvector (note that is in fact a matrix) of (15) associated to the eigenvalue , i.e.,
(16) |
next, we show that .
Consider a block partition of of the form
where and are, respectively, and matrices. The eigenvalue matrix equation (16) induces the following system of matrix equations
(17) | ||||
(18) |
where .
There are two non-mutually-exclusive cases to consider: or (, by virtue of being an eigenvector).
Case 1
Case 2
Suppose that . This case splits in two: either or . If , then (17) and the “upper triangularization” operation yields
(19) |
which, after dividing by , yields . If , then,
Next, notice that if , then can be assumed to be , since, otherwise, we could deal with it as in (19) with the roles of and reversed to conclude . Hence, assuming , we obtain, after division by , that Finally, if , then ∎
IV Parameter estimation with noisy measurements
IV-A Roadmap
This is a rather long section, hence the need for a road map. The analysis of (1) is simplified if the measurements are identically distributed besides being just independent and, therefore, we start by introducing a probability distribution on the vectors and a joint model on .
To estimate ( and are treated as nuisance parameters), we consider the stationary equations imposed by equating to zero the gradient of the log-likelihood function, a necessary condition satisfied by the maximum likelihood estimator (MLE). Once the particular form of the stationary equations is realized, we reformulate them as a fixed point equation of the form that naturally suggests the Banach-Picard iteration
Observing that the map cannot be written as an average of local maps, we switch to the map , which can be implicitly written as an average of local maps. With this map, we arrive at a distributed algorithm by considering the map (see section II) arising from and appealing to Algorithm 1.
IV-B Identically distributed observations
Let be an unknown and fixed vector which we term the ground truth.
The agents’ measurements are assumed to be independent (see (1)); however, they are not identically distributed, given the presence of the vectors in (1). To address this issue, let , , and be, respectively, a binary random variable, a random vector, and a real random variable. Suppose the joint density on factors as
(20) |
where
and
(21) |
Instead of assuming that the are fixed as in [31], we assume that each sensor has a measurement , where was drawn from (20), but agent has no knowledge of . After marginalization, the joint density of is given by
(22) | ||||
which is a mixture model [49].
To estimate , the agents seek such that
(23) |
where is the log-likelihood of , i.e.,
(24) | ||||
Since does not depend on ,
in other words, (23) is a necessary condition satisfied by the MLE corresponding to the log-likelihood function , thus independent of .
IV-C Gradient of and the centralized algorithm
Before explicitly writing the stationary equations corresponding to (23), we introduce the responsibility functions [49],
(25) |
Notice that , the posterior probability that the observation was not a result of measuring only noise.
For reasons that will be clear later, the following set of equalities, which can be easily verified, will be convenient:
(26) | ||||
Using (26), (23) can be explicitly written as
(27) | ||||
(28) | ||||
(29) |
where
If the matrix is invertible, then (27)-(29) can be written as a fixed point equation.222The invertibility of this matrix is assumed throughout the rest of the paper. In fact, if is sufficiently large - greater than - this happens with probability one. This constitutes the motivation for the centralized algorithm that we suggest next (see Algorithm 2) and from which we will derive the distributed version; observe that it is the Banach-Picard iteration motivated by (27)-(29).
(30) | ||||
(31) | ||||
(32) |
IV-D Distributed Algorithm
Although, is an average of local maps, the map is not, due to the matrix inversion in (30). As a consequence, (30)-(32) cannot be directly extended to a distributed algorithm. However, switching the order of and results in a map that can be implicitly written as an average of local maps. In fact, instead of the iteration , consider the iteration
where and .
Let
and it follows that , where
(33) |
To conclude, the distributed algorithm we suggest amounts to Algorithm 1, with defined, for , as in (4), and as in (33). Additionally, following [31], we suggest the initialization
(34) |
Some remarks are due:
- a)
- b)
- c)
-
Expanding (32) yields
and, if the update rule is modified according to
then, a straightforward manipulation recovers the EM algorithm presented in [31]. Moreover, the EM algorithm derived in [31] is still amenable to a distributed implementation using Algorithm 1. However, we found it easier to prove (2) for this variant of EM, than for the standard EM.
IV-E Convergence Analysis
The proof of local linear convergence of the centralized variant of EM, i.e., Algorithm 2, is not trivial. In fact, this question is probabilistic in nature, because updates (30)-(32) depend on observations that are, in turn, samples from a probability distribution. Before presenting the main convergence result (Theorem 3), we need to introduce some definitions and only one mild assumption that is instrumental in the proof of Lemma 4 below: the Fisher information at , given by,
(35) |
is non-singular.
Let denote the map underlying the Banach-Picard iteration (30)-(32).333The subscript emphasizes that depends on observations. A straightforward manipulation, using (26), shows that
where
Before stating the main result, we introduce the “infinite sample” map, i.e.,
where
and
The next lemma shows that is a “natural” map to consider.
Lemma 4.
The “infinite sample map”, i.e., , has the following properties
- a)
-
For fixed , converges in probability to ;
- b)
-
The ground truth is a fixed point of , i.e. ;
- c)
-
The attractor condition (2) holds for at , i.e.,
Proof.
The proof of b) amounts to a straightforward verification and the proof of a) follows by the weak law of large numbers, hence, we will focus on the proof of c) which relies on the principle of missing information (see [50], page 101), and the assumption on the Fisher Information condition, i.e., (35).
Under suitable regularity conditions (see appendix A) that hold for the model (LABEL:JointOnYandH),
Additionally, a simple calculation reveals that coincides with the Fisher information of the complete data model (20), i.e.,
The non-singularity assumption on (see (35)), together with the principle of missing information (see [50], page 101), implies that
All these observations show that
(36) |
and, Theorem 7.7.3 of [51], together with , implies that
concluding that is an attractor of . ∎
We recall that the goal of this section is to show that the probability that has an attractor approaches as tends to infinity; the strategy is to derive this from the existence of an attractor of , i.e., c) in Lemma 4. Pointwise convergence in probability, i.e., a) in Lemma 4, is not enough to arrive at this result. In fact, the proof is built on a stronger notion that is a probabilistic version of uniform, rather than pointwise, convergence of maps. This is the content of Theorem 2 below.
Observe that if is a fixed point of , then
so let
Remark 2.
Note that the maps and only coincide with the jacobian maps and at fixed points.
The uniform convergence in probability is expressed in the next theorem, whose proof can be found in appendix C. For the statement, recall (see the notation section) that is the closed ball of center and radius , with respect to the metric induced by the norm .
Theorem 2.
Let and be any norm. With , , , and as defined above,
(37) | ||||
(38) |
both in probability, as .
We now state the main convergence result.
Theorem 3.
There exists and a norm such that
where is the induced matrix norm.444The measurability of the maps in this Theorem are a consequence of Proposition 7.32 in [52].
Before presenting the proof, we explain why Theorem 3 encapsulates the notion that, with probability approaching , the map has an attractor. Let
Remark 3.
Informally, observe that the set is the set of “samples” where the ball is invariant under , i.e,
and that the set is, from remark 2, the set of “samples” where the Jacobian of satisfies (2) at a fixed points. By noting that a continuous map from a convex compact space into itself has a fixed point (Brouwer’s fixed point theorem), it follows that if is in , then has a fixed point. Moreover, if is in then has a fixed point satisfying (2). All of this is made precise below.
The statement of Theorem 3 is that the (non-random sequences) and both tend to . The inequalities
imply that
Now note that, if both inequalities hold, namely
(39) | ||||
(40) |
then (39), together with Brouwer’s fixed point theorem (see [53], page 180) implies that has a fixed point in (this idea is loosely inspired by [54], page 69). Moreover, being a fixed point, at a it holds (see remark 2) that , so, (40) implies that
This explains why Theorem 3 expresses the notion that we can “expect” (2) to hold for . In fact, from the above, the event
contains the event , and the probability of this last event is approaching 1.
Proof of Theorem 3
Let be any norm. Then
(41) |
From Lemma 4 c),
From the proof of Ostrowski’s Theorem (see [55], page 300), there exists a norm on , an open neighborhood of , and , such that
- 1)
-
, for ;
- 2)
-
, where here the norm is the induced matrix norm.
Choose sufficiently small such that
- i)
-
;
- ii)
-
, for ,
where the validity of ii) follows from the compactness of and the continuity of .
IV-F Simulations
In this section, we compare our algorithm with the one from [31] (DA-DEM) through Monte Carlo simulations. The parameters generated once and fixed throughout all Monte Carlo runs were: , , a unit-norm vector , , and an undirected connected graph on nodes with connectivity radius 555 points were randomly deployed on the unit square; two points were then connected by an edge if their distance was less than ..
Each Monte Carlo run consisted in
- 1)
-
Generating a data set: each was independently sampled from a Gaussian with zero mean and covariance ; the variance of the noise was generated according to
with and where SNR is the signal to noise ratio (we experimented with ). Finally, each was sampled according to (see (LABEL:JointOnYandH)), with , , , and .
- 2)
The performance metrics consisted in finding a fixed point using the centralized operators as follows.
We first computed
(44) | |||
(45) |
where: ; ; corresponds to the map arising from the standard EM algorithm derived in [31]. In fact, as seen in [31], the EM algorithm can be written as
where
and
We ran the algorithms, with initialization as in (44) and (45), given by
until we found and satisfying
The error at iteration of the distributed algorithms was then computed as
where is the projection onto the average, i.e., (as mentioned before, and were treated as nuisance parameters).
The number of Monte Carlo tests was and the errors at iteration were averaged out of for each and . The results for two different SNR values are shown in logarithmic scale in Figures 1 and 2.


The simulations show, as expected from the theory, that Algorithm 1 converges linearly and clearly outperforms the algorithm from [31], which, given its diminishing step-size, is bound to converge only sub-linearly. Moreover, both algorithms require just one round of communications per iteration.
V Conclusion
This article builds upon the distributed Banach-Picard algorithm and its convergence properties provided in [2] to make two main contributions: we provided a proof of local linear convergence for the distributed PCA algorithm suggested in [33], thereby filling a gap left by that work; starting from the distributed Banach-Picard iteration, we proposed a distributed algorithm for solving the parameter estimation problem from noisy and faulty measurements that had been addressed in [31]. Unlike the algorithm in [31], which uses diminishing step sizes, thus exhibiting sublinear convergence rate, the proposed instance of the distributed Banach-Picard iteration is guaranteed to have local linear convergence. Numerical experiments confirm the theoretical advantage of the proposed method with respect to that from [31].
Appendix A Regularity Conditions
Theorem 4.
Let be a compact set containing . Then, for all ,
(46) | ||||
(47) |
where , are dummy variables in (i.e. consider partial diferentiation with respect to the components of ), and where
are polynomials.
We start with a proof of (47). Note that satisfies the following differential equations
We deduce that
(48) |
where and are polynomials and is a dummy variable in .
The chain rule of differentiation, the quotient rule of differentiation and the form (48) imply that
(49) |
The result now follows easily from , the compactness of and the identity (49).
The inequality (46) follows from (49) and the form of the gradient of in (26). This concludes the proof of Theorem 4.
An immediate corollary of Theorem 4 is that the absolute values of the partial derivatives of both and are dominated by functions whose expectation exists and is finite; this is the content of the following result.
Theorem 5.
Let be a polynomial in variables. Then
exists and is finite.
To prove this theorem, observe that is a sum of elements of the form
and, hence, it is enough to show that
exists and is finite. This last fact is an easy consequence of the existence of absolute non-central and central moments of Gaussians and, therefore, we will skip the proof.
Appendix B Auxiliary Results and definitions
Theorem 6 ([56], page 2129).
Let be a matrix of functions of an observation and the parameter . If the are i.i.d., is compact, is continuous at each and there is with for all , where exists and is finite, then is continuous and
in probability.
Let be a sequence of random vectors. We use the notation , to denote that converges to in probability, i.e., if, for every , the non-random sequence
converges to .
If is uniformly bounded in probability, i.e., if, for every , there exists , such that
we denote this by (see [54] for more details and also for the calculus with the and ).
Appendix C Proof of Theorem 2
We give only a sketch of the proof of (38) (the proof of (37) is analogous). Observe that
which implies that
(50) | ||||
From Theorem 6 (see appendix B),
(51) | ||||
(52) |
in probability; these are consequences of Theorem 6, by noting that:
- a)
-
, where is the maximum of on and where we used the fact that ;
- b)
-
on , for some map not depending on for which exists and is finite (see appendix A).
Since all norms are equivalent, (51)-(52) also holds if the Frobenius norm is replaced by any other norm.
Taking the supremum over on on both sides of (50), we obtain, from (51)-(52), that
where the definitions of and can be found in appendix B.
Putting everything together, we conclude that
where the equality follows from the calculus rules with and [54].
Appendix D Proof of Lemma 1
Suppose satisfies (7). Throughout this proof, denotes the th column of . Consider the equation imposed by the first column, , i.e.,
and multiply both sides by , which yields
From the two equalities
we conclude that either or is a unit-norm eigenvector of .
Considering the second column, we prove that is either zero or a unit-norm eigenvector of that is orthogonal to . Observe that
(53) |
Now recall that or is a unit-norm eigenvector of . If , then (53) reduces to
and the result follows as in the case of . If , then it is a unit-norm eigenvector of and, hence, there exists such that and (53) reduces to
(54) |
Multiply on the left by and use to obtain
If , we are done. If not, then and, returning to (54), it holds that
This establishes the claim for and . Proceeding as we did for the second column, it is possible to construct a proof by induction establishing the result.
Acknowledgment
This work was partially funded by the Portuguese Fundação para a Ciência e Tecnologia (FCT), under grants PD/BD/135185/2017 and UIDB/50008/2020. The work of João Xavier was supported in part by the Fundação para a Ciência e Tecnologia, Portugal, through the Project LARSyS, under Project FCT Project UIDB/50009/2020 and Project HARMONY PTDC/EEI-AUT/31411/2017 (funded by Portugal 2020 through FCT, Portugal, under Contract AAC n 2/SAICT/2017–031411. IST-ID funded by POR Lisboa under Grant LISBOA-01-0145-FEDER-031411).
References
- [1] R. Olfati-Saber, J. Fax, and R. Murray, “Consensus and cooperation in networked multi-agent systems,” Proceedings of the IEEE, vol. 95, pp. 215–233, 2007.
- [2] F. Andrade, M. Figueiredo, and J. Xavier, “Distributed Picard iteration,” submitted, available at arXiv:2104.00131, 2021.
- [3] K. Pearson, “On lines and planes of closest fit to systems of points in space,” The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, vol. 2, no. 11, pp. 559–572, 1901.
- [4] Y. Qu, G. Ostrouchov, N. Samatova, and A. Geist, “Principal component analysis for dimension reduction in massive distributed data sets,” in Proceedings of IEEE International Conference on Data Mining (ICDM), vol. 1318, no. 1784, 2002, p. 1788.
- [5] Y. Liang, M.-F. F. Balcan, V. Kanchanapally, and D. Woodruff, “Improved distributed principal component analysis,” Advances in Neural Information Processing Systems, vol. 27, pp. 3113–3121, 2014.
- [6] R. Kannan, S. Vempala, and D. Woodruff, “Principal component analysis and higher correlations for distributed data,” in Conference on Learning Theory. PMLR, 2014, pp. 1040–1057.
- [7] C. Boutsidis, D. P. Woodruff, and P. Zhong, “Optimal principal component analysis in distributed and streaming models,” in Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, 2016, pp. 236–249.
- [8] D. Garber, O. Shamir, and N. Srebro, “Communication-efficient algorithms for distributed stochastic principal component analysis,” in International Conference on Machine Learning. PMLR, 2017, pp. 1203–1212.
- [9] Z.-J. Bai, R. H. Chan, and F. T. Luk, “Principal component analysis for distributed data sets with updating,” in International Workshop on Advanced Parallel Processing Technologies. Springer, 2005, pp. 471–483.
- [10] H. Kargupta, W. Huang, K. Sivakumar, and E. Johnson, “Distributed clustering using collective principal component analysis,” Knowledge and Information Systems, vol. 3, no. 4, pp. 422–448, 2001.
- [11] H. Qi, T.-W. Wang, and J. D. Birdwell, “Global principal component analysis for dimensionality reduction in distributed data mining,” Statistical data mining and knowledge discovery, pp. 327–342, 2004.
- [12] F. N. Abu-Khzam, N. F. Samatova, G. Ostrouchov, M. A. Langston, and A. Geist, “Distributed dimension reduction algorithms for widely dispersed data.” in IASTED PDCS, 2002, pp. 167–174.
- [13] A. Scaglione, R. Pagliari, and H. Krim, “The decentralized estimation of the sample covariance,” in 2008 42nd Asilomar Conference on Signals, Systems and Computers. IEEE, 2008, pp. 1722–1726.
- [14] Y.-A. Le Borgne, S. Raybaud, and G. Bontempi, “Distributed principal component analysis for wireless sensor networks,” Sensors, vol. 8, no. 8, pp. 4821–4850, 2008.
- [15] M. E. Yildiz, F. Ciaramello, and A. Scaglione, “Distributed distance estimation for manifold learning and dimensionality reduction,” in 2009 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2009, pp. 3353–3356.
- [16] W. Suleiman, M. Pesavento, and A. M. Zoubir, “Performance analysis of the decentralized eigendecomposition and esprit algorithm,” IEEE Transactions on Signal Processing, vol. 64, no. 9, pp. 2375–2386, 2016.
- [17] S. B. Korada, A. Montanari, and S. Oh, “Gossip pca,” ACM SIGMETRICS Performance Evaluation Review, vol. 39, no. 1, pp. 169–180, 2011.
- [18] L. Li, A. Scaglione, and J. H. Manton, “Distributed principal subspace estimation in wireless sensor networks,” IEEE Journal of Selected Topics in Signal Processing, vol. 5, no. 4, pp. 725–738, 2011.
- [19] I. D. Schizas and A. Aduroja, “A distributed framework for dimensionality reduction and denoising,” IEEE Transactions on Signal Processing, vol. 63, no. 23, pp. 6379–6394, 2015.
- [20] S. X. Wu, H.-T. Wai, A. Scaglione, and N. A. Jacklin, “The power-oja method for decentralized subspace estimation/tracking,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 3524–3528.
- [21] S. X. Wu, H.-T. Wai, L. Li, and A. Scaglione, “A review of distributed algorithms for principal component analysis,” Proceedings of the IEEE, vol. 106, no. 8, pp. 1321–1340, 2018.
- [22] A. Gang, B. Xiang, and W. Bajwa, “Distributed principal subspace analysis for partitioned big data: Algorithms, analysis, and implementation,” IEEE Transactions on Signal and Information Processing over Networks, vol. 7, pp. 699–715, 2021.
- [23] A. G. Dimakis, S. Kar, J. M. Moura, M. G. Rabbat, and A. Scaglione, “Gossip algorithms for distributed signal processing,” Proceedings of the IEEE, vol. 98, no. 11, pp. 1847–1864, 2010.
- [24] J.-J. Xiao, A. Ribeiro, Z.-Q. Luo, and G. B. Giannakis, “Distributed compression-estimation using wireless sensor networks,” IEEE Signal Processing Magazine, vol. 23, no. 4, pp. 27–41, 2006.
- [25] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah, “Randomized gossip algorithms,” IEEE transactions on information theory, vol. 52, no. 6, pp. 2508–2530, 2006.
- [26] S. Barbarossa and G. Scutari, “Decentralized maximum-likelihood estimation for sensor networks composed of nonlinearly coupled dynamical systems,” IEEE Transactions on Signal Processing, vol. 55, no. 7, pp. 3456–3470, 2007.
- [27] T. Zhao and A. Nehorai, “Information-driven distributed maximum likelihood estimation based on gauss-newton method in wireless sensor networks,” IEEE Transactions on Signal Processing, vol. 55, no. 9, pp. 4669–4682, 2007.
- [28] I. D. Schizas, A. Ribeiro, and G. B. Giannakis, “Consensus in ad hoc wsns with noisy links—part i: Distributed estimation of deterministic signals,” IEEE Transactions on Signal Processing, vol. 56, no. 1, pp. 350–364, 2007.
- [29] S. S. Stanković, M. S. Stankovic, and D. M. Stipanovic, “Decentralized parameter estimation by consensus based stochastic approximation,” IEEE Transactions on Automatic Control, vol. 56, no. 3, pp. 531–543, 2010.
- [30] A. H. Sayed, “Diffusion adaptation over networks,” in Academic Press Library in Signal Processing. Elsevier, 2014, vol. 3, pp. 323–453.
- [31] S. S. Pereira, R. López-Valcarce, and A. Pages-Zamora, “Parameter estimation in wireless sensor networks with faulty transducers: A distributed EM approach,” Signal Processing, vol. 144, pp. 226–237, 2018.
- [32] S. Faria and G. Soromenho, “Fitting mixtures of linear regressions,” Journal of Statistical Computation and Simulation, vol. 80, no. 2, pp. 201–225, 2010.
- [33] A. Gang, H. Raja, and W. U. Bajwa, “Fast and communication-efficient distributed PCA,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019, pp. 7450–7454.
- [34] T. D. Sanger, “Optimal unsupervised learning in a single-layer linear feedforward neural network,” Neural networks, vol. 2, no. 6, pp. 459–473, 1989.
- [35] W. Shi, Q. Ling, G. Wu, and W. Yin, “Extra: An exact first-order algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015.
- [36] A. Gang and W. Bajwa, “A linearly convergent algorithm for distributed principal component analysis,” available at arXiv:2101.01300, 2021.
- [37] G. McLachlan and D. Peel, Finite Mixture Models. John Wiley & Sons, 2004.
- [38] A. Dempster, N. Laird, and D. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society (Series B), vol. 29, pp. 1––37, 1977.
- [39] S. Kar and J. M. Moura, “Distributed consensus algorithms in sensor networks with imperfect communication: Link failures and channel noise,” IEEE Transactions on Signal Processing, vol. 57, no. 1, pp. 355–369, 2008.
- [40] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009.
- [41] R. A. Redner and H. F. Walker, “Mixture densities, maximum likelihood and the EM algorithm,” SIAM review, vol. 26, no. 2, pp. 195–239, 1984.
- [42] R. Sundberg, “Maximum likelihood theory for incomplete data from an exponential family,” Scandinavian Journal of Statistics, vol. 1, no. 2, pp. 49–58, 1974.
- [43] S. Balakrishnan, M. Wainwright, and B. Yu, “Statistical guarantees for the EM algorithm: From population to sample-based analysis,” Annals of Statistics, vol. 45, no. 1, pp. 77–120, 2017.
- [44] R. D. Nowak, “Distributed em algorithms for density estimation and clustering in sensor networks,” IEEE transactions on signal processing, vol. 51, no. 8, pp. 2245–2253, 2003.
- [45] P. A. Forero, A. Cano, and G. B. Giannakis, “Distributed clustering using wireless sensor networks,” IEEE Journal of Selected Topics in Signal Processing, vol. 5, no. 4, pp. 707–724, 2011.
- [46] L. Xiao and S. Boyd, “Fast linear iterations for distributed averaging,” Systems & Control Letters, vol. 53, no. 1, pp. 65–78, 2004.
- [47] J. R. Magnus and H. Neudecker, Matrix Differential Calculus with Applications in Statistics and Econometrics. John Wiley & Sons, 2019.
- [48] H. Lutkepohl, Handbook of Matrices. Wiley, 1996.
- [49] C. Bishop, Pattern Recognition and Machine Learning. Springer, 2006.
- [50] G. J. McLachlan and T. Krishnan, The EM algorithm and Extensions. John Wiley & Sons, 2007, vol. 382.
- [51] R. Horn and C. Johnson, Matrix Analysis. Cambridge University Press, 2012.
- [52] D. Bertsekas and S. Shreve, Stochastic Optimal Control: the Discrete-time Case. Athena Scientific, 1996.
- [53] J. Borwein and A. S. Lewis, Convex Analysis and Nonlinear Optimization: Theory and Examples. Springer, 2010.
- [54] A. Van der Vaart, Asymptotic Statistics. Cambridge University Press, 2000, vol. 3.
- [55] J. M. Ortega and W. C. Rheinboldt, Iterative Solution of Nonlinear Equations in Several Variables. SIAM, 2000.
- [56] W. K. Newey and D. McFadden, “Large sample estimation and hypothesis testing,” Handbook of Econometrics, vol. 4, pp. 2111–2245, 1994.