This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Distributed Banach-Picard Iteration: Application to Distributed EM and Distributed PCA

Francisco Andrade, Mário A. T. Figueiredo, , and João Xavier The authors are with the Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal.F. Andrade ([email protected]) and M. Figueiredo ([email protected]) are also with the Instituto de Telecomunicações, Lisbon, Portugal.J. Xavier ([email protected]) is also with the Laboratory for Robotics and Engineering Systems, Institute for Systems and Robotics, Lisbon, Portugal.

Distributed Banach-Picard Iteration: Application to Distributed Parameter Estimation and PCA

Francisco Andrade, Mário A. T. Figueiredo, , and João Xavier The authors are with the Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal.F. Andrade ([email protected]) and M. Figueiredo ([email protected]) are also with the Instituto de Telecomunicações, Lisbon, Portugal.J. Xavier ([email protected]) is also with the Laboratory for Robotics and Engineering Systems, Institute for Systems and Robotics, Lisbon, Portugal.
Abstract

In recent work, we proposed a distributed Banach-Picard iteration (DBPI) that allows a set of agents, linked by a communication network, to find a fixed point of a locally contractive (LC) map that is the average of individual maps held by said agents. In this work, we build upon the DBPI and its local linear convergence (LLC) guarantees to make several contributions. We show that Sanger’s algorithm for principal component analysis (PCA) corresponds to the iteration of an LC map that can be written as the average of local maps, each map known to each agent holding a subset of the data. Similarly, we show that a variant of the expectation-maximization (EM) algorithm for parameter estimation from noisy and faulty measurements in a sensor network can be written as the iteration of an LC map that is the average of local maps, each available at just one node. Consequently, via the DBPI, we derive two distributed algorithms – distributed EM and distributed PCA – whose LLC guarantees follow from those that we proved for the DBPI. The verification of the LC condition for EM is challenging, as the underlying operator depends on random samples, thus the LC condition is of probabilistic nature.

Index Terms:
Distributed Computation, Banach-Picard Iteration, Fixed Points, Distributed EM, Distributed PCA, Consensus.

I Introduction

Parameter estimation from noisy data and dimensionality reduction are two of the most fundamental tasks in signal processing and data analysis. In many scenarios, such as sensor networks and IoT, the underlying data is distributed among a collection of agents that cooperate to jointly solve the problem, i.e., find a consensus solution without sharing the data or sending it to a central unit [1]. This paper addresses the two problems above mentioned in a distributed setting, proposing and analysing two algorithms that are instances of the distributed Banach-Picard iteration (DBPI), which we have recently introduced and proved to enjoy local linear convergence [2]. In this introductory section, after presenting the formulations and motivations of the two problems considered, we review the DBPI and summarize our contributions and the tools that are used for the convergence proofs.

I-A Problem Statement: Dimensionality Reduction

Dimensionality reduction aims at representing high-dimensional data in a lower dimensional space, which can be crucial to reduce the computational complexity of manipulating and processing this data, and is a core task in modern data analysis, machine learning, and related areas. The standard linear dimensionality reduction tool is principal component analysis (PCA), which allows expressing a high-dimensional dataset on the basis formed by the top eigenvectors of its sample covariance matrix. PCA first appeared in the statistics community in the beginning of the 20th century [3] and became one of the workhorses of statistical data analysis, with dimensionality reduction being a notable application. Nowadays, with the ever increasing collection of data by spatially dispersed agents, developing algorithms for distributed PCA constitutes a relevant area of research – see, e.g., [4, 5, 6, 7, 8, 9, 10, 11, 12] (master-slave communication architecture), and [13, 14, 15, 16, 17, 18, 19, 20] (arbitrarily meshed network communication architecture). For a recent and comprehensive review on these works, see, e.g., [21]; for a very recent work see [22].

In the (arbitrarily meshed network) distributed setting, consider a set of NN agents linked by an undirected and connected communication graph; the nodes are the agents, and the edges represent the communication channels between the agents. Each agent nn holds a finite set of points in d\mathbb{R}^{d}, 𝐘nd\mathbf{Y}_{n}\subseteq\mathbb{R}^{d}, and the agents seek to collectively find the top mm eigenvectors111This means the mm eigenvectors associated to the largest mm eigenvalues. of

C=1Mn=1NCn,\displaystyle C=\frac{1}{M}\sum_{n=1}^{N}C_{n},

(assumed to be positive definite, i.e., C0C\succ 0), where M=n=1N|𝐘n|M=\sum_{n=1}^{N}|\mathbf{Y}_{n}|, i.e., the sum of the cardinalities of each 𝐘n\mathbf{Y}_{n}, and

Cn=y𝐘nyyT.\displaystyle C_{n}=\sum_{y\in\mathbf{Y}_{n}}yy^{T}.

I-B Problem Statement: Distributed Parameter Estimation with Noisy and Faulty Measurements

Consider a collection of spatially distributed sensors monitoring the environment, a common scenario for information processing or decision making tasks see, e.g., [23, 24, 25, 26, 27, 28, 29, 30]. Often, these sensors communicate wirelessly, maybe in a harsh environment, which may result in faulty communications or sensor malfunctions [31]. The setup is modelled as follows: NN agents, linked by an undirected and connected communication graph, each holding an independent observation given by

Yn=ZnhnTμ+Wn,n=1,,N.\displaystyle Y_{n}=Z_{n}h_{n}^{T}\mu^{\star}+W_{n},\quad n=1,\ldots,N. (1)

In (1), μd\mu^{\star}\in\mathbb{R}^{d} is a fixed and unknown parameter, each hndh_{n}\in\mathbb{R}^{d} is assumed to be known only at agent nn, {Wj}n=1N\{W_{j}\}_{n=1}^{N} are independent and identically distributed (i.i.d.) zero-mean Gaussian random variables with variance (σ)2(\sigma^{\star})^{2}, and {Zn}n=1N\{Z_{n}\}_{n=1}^{N} are i.i.d. Bernoulli random variables (Zn{0,1}Z_{n}\in\{0,1\}) with fZn(zn|p)=(p)zn(1p)1znf_{Z_{n}}(z_{n}|p^{\star})=(p^{\star})^{z_{n}}(1-p^{\star})^{1-z_{n}}. This formulation models a scenario where sensor nn measures the parameter μ\mu^{\star} with probability pp^{\star} and, with probability 1p1-p^{\star}, it senses only noise, indicating a transducer failure [31]. The agents seek to collectively estimate μ\mu^{\star}, treating pp^{\star} and σ\sigma^{\star} as nuisance (or latent) parameters. Observe that if the binary variables ZnZ_{n} were not random, but fixed and known, the model could be regarded as a (distributed) linear regression problem. However, the randomness introduces an extra layer of difficulty accounting for potential sensor failures.

A decentralized algorithm, rather than one where each sensor sends its data to a central node, is potentially more robust to faulty wireless communications that may render a sensor useless. Moreover, a decentralized algorithm can yield considerable energy savings [23], a very desirable feature.

I-C Distributed Banach-Picard Iteration (DBPI)

Our recent work [2] addressed a general distributed setup where NN agents, linked by a communication network, collaborate to collectively find an attractor xx^{\star} of a map HH that can be implicitly represented as an average of local maps, i.e.,

H=1Nn=1NHn,H=\frac{1}{N}\sum_{n=1}^{N}H_{n},

where HnH_{n} is the map held by agent nn. As defined in [2], an attractor xx^{\star} of HH is a fixed point thereof, H(x)=xH(x^{\star})=x^{\star}, satisfying

ρ(𝐉H(x))<1,\displaystyle\rho\big{(}\mathbf{J}_{H}(x^{\star})\big{)}<1, (2)

where ρ(𝐉H(x))\rho\big{(}\mathbf{J}_{H}(x^{\star})\big{)} is the spectral radius of the Jacobian of HH at xx^{\star}. Moreover, the map HH is not assumed to have a symmetric Jacobian and no global structural assumptions (e.g., Lipschitzianity or coercivity) are made.

The main contributions of [2] are a distributed algorithm to find xx^{\star} – DBPI – and the proof that it enjoys the local linear convergence of its centralized counterpart, i.e., of the (standard) Banach-Picard iteration: xk+1=H(xk)x^{k+1}=H(x^{k}).

I-D Contributions and Related Work

In this work, we propose addressing the distributed inference problems described in Subsections I-A and I-B using two instantiations of DBPI. More concretely, we propose:

  1. 1.

    A distributed algorithm for PCA, which results from considering a map that can be implicitly written as an average of local maps and that has as a fixed point the solution to the PCA problem.

  2. 2.

    An algorithm that stems from formulating the problem described in Subsection I-B as a fixed point of a map induced by the stationary equations of the corresponding maximum likelihood estimation (MLE) criterion. This map corresponds to the iterations of a slightly modified EM algorithm for a mixture of linear regressions [32].

The guarantees of local linear convergence for these distributed algorithms involve verifying condition (2) for the maps inducing them, which allows invoking the results from [2]. Consequently, a great portion of this paper is devoted to proving that (2) holds for these maps, which is far from trivial.

The distributed PCA problem (see [23] for a review of distributed PCA) described in Subsection I-A was addressed in [33] , where an algorithm termed accelerated distributed Sanger’s algorithm (ADSA) was proposed. The authors consider a “mini-batch variant” of Sanger’s algorithm (SA, see [34]) and, inspired by [35], arrive at ADSA. Although no proof of convergence was presented in [33], very recent work by the same authors proves convergence of their algorithm [36]. Our contributions in this context are twofold: we show that ADSA is recovered by applying DBPI to SA, and that condition (2) holds for SA, thus, the guarantees of local convergence follow directly as a consequence of the results in [2]. We mention that no computer simulations of ADSA are presented in this work, since these can be found in [33].

The problem presented in Subsection I-B was addressed in [31], where (1) is regarded as a finite mixture model [37]. To estimate μ\mu^{\star}, the authors proposed a distributed version of the expectation-maximization (EM [38]) algorithm, termed diffusion-averaging distributed EM (DA-DEM). However, DA-DEM, very much in the spirit of [39, 40], uses a diminishing step-size to achieve convergence, leading to a sublinear convergence rate. Our contribution is an algorithm for this problem that extends a slightly modified version of the centralized EM algorithm to distributed settings. The key challenge is to show that we can “expect” condition (2) to hold, and we dedicate a considerable amount of effort to this endeavor. We use the term “expect”, since the operator underlying DBPI depends on the observed samples and, therefore, the existence of an attractor is a probabilistic question. Finally, we compare our algorithm with DA-DEM through Monte Carlo simulations, confirming the linear convergence rate of our algorithm and the sublinear convergence of DA-DEM.

There is considerable work on the “probabilistic linear convergence” of EM [41], [42], [43]. However, neither the results in [41], nor those in [42] encompass the mixture model underlying (1). The mixture of regressions presented in [43] bears some similarity with the model underlying (1), but it is not the same: in [43], pp is fixed at 1/21/2 and Zn{1,1}Z_{n}\in\{-1,1\} (rather than {0,1}\{0,1\}), thus there are no measurements that are just noise. Furthermore, [43] is primarily concerned with statistical guarantees for the error with respect to the ground truth, while we address the goal of establishing (2).

As mentioned in [31], there are two other relevant works on distributed EM, namely, [44] and [45]. However (see [31]), both these works address a different problem of Gaussian mixture density estimation. Moreover, in the case of [44], the algorithm demands a cyclic network topology, and, in [45] the algorithm requires higher computational load on each node, since it is based on alternating direction method of multipliers.

To summarize, we show that ADSA [33] is an instance of DBPI and propose an algorithm to solve the mixture model underlying (1), also as an instance of DBPI. Consequently, their corresponding guarantees of local linear convergence result from the attractor condition (2) for the map underlying the corresponding centralized counterparts. We compare DA-DEM and our proposed algorithm through numerical Monte Carlo simulations, and the results confirm the linear convergence of our algorithm and the sublinear convergence of DA-DEM.

I-E Organization of the Paper

Section II briefly reviews the DBPI proposed in [2] and the main convergence result therein proved. The characterization of the fixed points of the “mini-batch” variant of Sanger’s algorithm, as well as the attractor condition, are presented in Section III. Section IV describes the centralized variant of EM underlying the proposed distributed algorithm for the problem in Subsection I-B, presents the verification of the attractor condition, and reports the results of simulations comparing DBPI with DA-DEM.

I-F Notation

The set of real nn dimensional vectors with positive components is denoted by >0n\mathbb{R}^{n}_{>0}. Matrices and vectors are denoted by upper and lower case letters, respectively. The spectral radius of a matrix AA is denoted by ρ(A)\rho(A) and its Frobenius norm by AF\|A\|_{F}. Given a map HH, 𝐉H(x)\mathbf{J}_{H}(x), and dH(x)dH(x) denote, respectively, the Jacobian of HH at xx and the differential of HH at xx. Given a vector vv, vsv_{s} denotes its ssth component; given a matrix AA, AstA_{st} denotes the element on the ssth row and ttth column and ATA^{T} its transpose. The dd-dimensional identity matrix is denoted by IdI_{d}, 𝟏d\mathbf{1}_{d} is the dd-dimensional vector of ones, and 𝟎m,n\mathbf{0}_{m,n} is the m×nm\times n matrix of zeros. Whenever convenient, we will denote a vector with two stacked blocks, [vT,uT]T[v^{T},u^{T}]^{T}, simply as (v,u)(v,u). Given a square matrix AA, 𝒰(A)\mathcal{U}(A) is an upper triangular matrix of the same dimension as AA and whose upper triangular part coincides with that of AA. Given a norm \|\cdot\|, B¯δ,θ\bar{B}^{\|\cdot\|}_{\delta,\theta} denotes the closed ball of center θ\theta and radius δ\delta with respect to \|\cdot\|. Random variables and vectors are denoted by upper case letters and, for random variable YY, the probability density (or mass) function of YY is denoted by fYf_{Y}. The probability density of a Gaussian of mean μ\mu and variance σ2\sigma^{2} is denoted by 𝒩(|μ,σ2)\mathcal{N}(\cdot|\mu,\sigma^{2}).

II Review of Distributed Banach-Picard Iteration

Consider a network of NN agents, where the interconnection structure is represented by an undirected connected graph: the nodes correspond to the agents and an edge between two agents indicates they can communicate (are neighbours). In the scenario considered in [2], each agent n{1,,N}n\in\{1,...,N\} holds an operator Hn:ddH_{n}:\mathbb{R}^{d}\to\mathbb{R}^{d}, and the goal is to compute a fixed point of the average operator

H=1Nn=1NHn.\displaystyle H=\frac{1}{N}\sum_{n=1}^{N}H_{n}. (3)

Each agent nn is restricted to performing computations involving HnH_{n} and communicating with its neighbours.

Our only assumption about HH is the existence of a locally attractive fixed point xx^{\star}, i.e., satisfying (2).

Let RR be the map on dN\mathbb{R}^{dN} defined, for z=[z1T,,zNT]Tz=[z_{1}^{T},\ldots,z_{N}^{T}]^{T} (with zjdz_{j}\in\mathbb{R}^{d} held by agent jj) by

R(z)=[(H1(z1)z1)T,,(HN(zN)zN)T]T,\displaystyle R(z)=\Big{[}\big{(}H_{1}(z_{1})-z_{1}\big{)}^{T},\ldots,\big{(}H_{N}(z_{N})-z_{N}\big{)}^{T}\Big{]}^{T}, (4)

and let W=W~IdW=\tilde{W}\otimes I_{d}, where W~\tilde{W} is the so-called Metropolis weight matrix associated to the communication graph [46]. The algorithm proposed in [2] is presented in Algorithm 1, where α>0\alpha\in\mathbb{R}_{>0}.

Algorithm 1 Distributed Banach-Picard Iteration (DBPI)
1:  Initialization:
z0\displaystyle z^{0} dN,\displaystyle\in\mathbb{R}^{dN},
z1\displaystyle z^{1} =Wz0+αR(z0),\displaystyle=Wz^{0}+\alpha R(z^{0}),
2:  Update:
zk+2=(I+W)zk+1I+W2zk+α(R(zk+1)R(zk)).z^{k+2}=(I+W)z^{k+1}-\frac{I+W}{2}z^{k}+\alpha\big{(}R(z^{k+1})-R(z^{k})\big{)}.

Informally, in [2], we show that α\alpha can be chosen such that if zkz^{k} gets sufficiently close to 𝟏x\mathbf{1}\otimes x^{\star}, then it converges to 𝟏x\mathbf{1}\otimes x^{\star} at least linearly (the precise statement and proof can be found in [2]). Notice that zz being equal to 𝟏x\mathbf{1}\otimes x^{\star} means that all agents are in consensus, holding a copy of the fixed point xx^{\star}.

III Distributed PCA

III-A Algorithm

We obtain a distributed algorithm for solving the PCA problem described in section I-A as an instantiation of DBPI by introducing a map HH with a fixed point at the desired solution. Moreover, the guarantees of local linear convergence follow as a result of verifying (2).

The “mini batch variant” of Sanger’s algorithm (SA) proposed in [33] and inspired by [34] is the Banach-Picard iteration

Xk+1=H(Xk),\displaystyle X^{k+1}=H(X^{k}),

where H:d×md×mH:\mathbb{R}^{d\times m}\to\mathbb{R}^{d\times m} is given by

H(X)=X+η(CXX𝒰(XTCX)),\displaystyle H(X)=X+\eta\Big{(}CX-X\mathcal{U}\big{(}X^{T}CX\big{)}\Big{)}, (5)

and 𝒰\mathcal{U} was defined in Subsection I-F. Observe that HH can be written as an average of local maps, i.e.,

H=1Nn=1NHn,\displaystyle H=\frac{1}{N}\sum_{n=1}^{N}H_{n},

where Hn:d×md×mH_{n}:\mathbb{R}^{d\times m}\to\mathbb{R}^{d\times m} is defined by

Hn(X)=X+η(NMCnXX𝒰(XTNMCnX)).\displaystyle H_{n}(X)=X+\eta\Big{(}\frac{N}{M}C_{n}X-X\mathcal{U}\big{(}X^{T}\frac{N}{M}C_{n}X\big{)}\Big{)}. (6)

Let RR be the map on (d×m)N\mathbb{R}^{(d\times m)N} defined, for z=[z1T,,zNT]Tz=[z_{1}^{T},\ldots,z_{N}^{T}]^{T}, as in (4), with HnH_{n} as in (6). The distributed algorithm presented in [33], named ADSA, is exactly the DBPI, i.e., Algorithm 1, with this choice of RR.

III-B Convergence: Main Results

The convergence analysis amounts to verifying the attractor condition (2) for HH, thus establishing, as a corollary of the results in [2], the local linear convergence of Algorithm 1 with each HnH_{n} in (4) defined as in (6) (equivalently, ADSA).

We start with the following lemma (proved in Appendix D) showing that the solution sought in the PCA problem is a fixed point of HH (as defined in (5)).

Lemma 1.

Let C0C\succ 0. If Xd×mX^{\star}\in\mathbb{R}^{d\times m} satisfies

CX=X𝒰((X)TCX)\displaystyle CX^{\star}=X^{\star}\mathcal{U}((X^{\star})^{T}CX^{\star}) (7)

then, each column of XX^{\star} is either 0 or a unit-norm eigenvector of CC. Moreover, the columns are orthogonal, i.e., (X)TX(X^{\star})^{T}X^{\star} is diagonal with the diagonal elements being either one or zero.

The following theorem guarantees that the Banach-Picard iteration of HH has local linear convergence to its fixed points.

Theorem 1.

Let λ1>>λm>λm+1λd>0\lambda_{1}>\ldots>\lambda_{m}>\lambda_{m+1}\geq\ldots\geq\lambda_{d}>0 be the eigenvalues of CC. Suppose that XX^{\star} is a d×md\times m matrix such Cxi=λixiCx^{\star}_{i}=\lambda_{i}x^{\star}_{i} (where xix_{i}^{\star} denotes the iith column of XX^{\star} and CC is as defined in (5)), for i=1,,mi=1,\ldots,m, and (X)TX=Im(X^{\star})^{T}X^{\star}=I_{m}. Then, there exists η\eta^{\star} such that, for 0<η<η0<\eta<\eta^{\star},

ρ(𝐉H(X))<1.\displaystyle\rho\big{(}\mathbf{J}_{H}(X^{\star})\big{)}<1.
Remark 1.

The invertibility of CC that is assumed in the statements of Lemma 1 and Theorem 1 is not a big restriction. In fact, if C0C\succeq 0 rather than C0C\succ 0, then C~=C+ϵI\tilde{C}=C+\epsilon I satisfies C~0\tilde{C}\succ 0 and has the same eigenvectors as CC.

III-C Proof of Theorem 1

First, note that H(X)=I+ηS(X)H(X)=I+\eta S(X), where

S(X)=CX𝒰(XTCX).\displaystyle S(X)=CX-\mathcal{U}\big{(}X^{T}CX\big{)}.

This implies 𝐉H(X)=I+η𝐉S(X)\mathbf{J}_{H}(X^{\star})=I+\eta\mathbf{J}_{S}(X^{\star}), and, as a consequence, each eigenvalue of 𝐉H(X)\mathbf{J}_{H}(X^{\star}) is of the form 1+ηβ1+\eta\beta, with β\beta being an eigenvalue of 𝐉S(X)\mathbf{J}_{S}(X^{\star}). The idea is to show that these eigenvalues β\beta of 𝐉S(X)\mathbf{J}_{S}(X^{\star}) enjoy a key property: they are real-valued and negative, β<0\beta<0. Such property means that, for sufficiently small η>0\eta>0, we have |1+ηβ|<1|1+\eta\beta|<1. To establish this key property, we divide the proof of Theorem 1 in two lemmas: Lemma 2 and Lemma 3.

Lemma 2 will show that the eigenvalues of 𝐉S(X)\mathbf{J}_{S}(X^{\star}) coincide with those of the linear map from d×m\mathbb{R}^{d\times m} to d×m\mathbb{R}^{d\times m} given by

WD^WWDA𝒰(DATW+WTAD),\displaystyle W\to\hat{D}W-WD-A\mathcal{U}(DA^{T}W+W^{T}AD), (8)

where

D=diag(λ1,,λm),\displaystyle D=\mbox{diag}(\lambda_{1},\ldots,\lambda_{m}), (9)
D^=(λ1,,λm,λm+1,,λd),\displaystyle\hat{D}=(\lambda_{1},\ldots,\lambda_{m},\lambda_{m+1},\ldots,\lambda_{d}), (10)

and

A=[Im𝟎dm,m].\displaystyle A=\begin{bmatrix}I_{m}\\ \mathbf{0}_{d-m,m}\end{bmatrix}. (11)

Lemma 3 will show that the eigenvalues of (8) are real and negative.

Lemma 2.

Let XX^{\star} satisfy the conditions of Theorem 1. The eigenvalues of 𝐉S(X)\mathbf{J}_{S}(X^{\star}), where

S:d×m\displaystyle S:\mathbb{R}^{d\times m} d×m\displaystyle\to\mathbb{R}^{d\times m}
X\displaystyle X CX𝒰(XTCX)\displaystyle\to CX-\mathcal{U}(X^{T}CX)

coincide with those of the linear map given by

WD^WWDA𝒰(DATW+WTAD),\displaystyle W\to\hat{D}W-WD-A\mathcal{U}(DA^{T}W+W^{T}AD),

with DD, D^\hat{D}, and AA given by, respectively, (9), (10), and (11).

Proof.

From the rules of matrix differential calculus (see [47] and [48]), the differential of SS at XX, denoted by dS(X)dS(X), is the linear map

dXCdX(dX)𝒰(XTCX)Xd(𝒰(XTCX))(X).\displaystyle dX\to CdX-(dX)\mathcal{U}(X^{T}CX)-Xd\big{(}\mathcal{U}(X^{T}CX)\big{)}(X). (12)

Observe that 𝒰\mathcal{U} is a linear map, thus the composition rule for differentials further yields

d(𝒰(XTCX))(X)=𝒰((dX)TCX+XTCdX).\displaystyle d\big{(}\mathcal{U}(X^{T}CX)\big{)}(X)=\mathcal{U}\big{(}(dX)^{T}CX+X^{T}CdX\big{)}. (13)

By assumption, CX=XDCX^{\star}=X^{\star}D with DD given by (9), thus combining this with (12) and (13), dS(X)dS(X^{\star}), which we denote by S^\hat{S} to simplify the notation, is given by

S^(dX)\displaystyle\hat{S}(dX) =CdX(dX)D\displaystyle=CdX-(dX)D
X𝒰((dX)TXD+D(X)TdX).\displaystyle\;\;\;\;-X^{\star}\mathcal{U}\big{(}(dX)^{T}X^{\star}D+D(X^{\star})^{T}dX\big{)}.

The eigenvalues of 𝐉S(X)\mathbf{J}_{S}(X^{\star}) coincide with those of dS(X)=S^dS(X^{\star})=\hat{S}, under the identification between Jacobians and differentials (see [47]); hence, we will study the eigenvalues of the latter.

Let X^\hat{X}^{\star} be an extension of XX^{\star} to an orthonormal basis of eigenvectors of CC, i.e., (X^)TX^=Id(\hat{X}^{\star})^{T}\hat{X}^{\star}=I_{d} and CX^=X^D^C\hat{X}^{\star}=\hat{X}^{\star}\hat{D}, where D^\hat{D} is given by (10). To understand the eigenvalues of S^\hat{S}, consider the linear map given by

V(dX)X^dX\displaystyle V(dX)\to\hat{X}^{\star}dX (14)

and observe that VV is an invertible linear map (in fact V1(dX)=(X^)TdXV^{-1}(dX)=(\hat{X}^{\star})^{T}dX). Eigenvalues are invariant under a similarity transformation, hence, the eigenvalues of S^\hat{S} coincide with those of V1S^VV^{-1}\circ\hat{S}\circ V which, after renaming dXdX by WW, just amounts to the linear map

WD^WWDA𝒰(DATW+WTAD),\displaystyle W\to\hat{D}W-WD-A\mathcal{U}(DA^{T}W+W^{T}AD),

with DD, D^\hat{D}, and AA given by, respectively, (9), (10), and (11). ∎

In the proof of the following lemma it is crucial that the eigenvalues are in decreasing order.

Lemma 3.

Let DD, D^\hat{D}, and AA be defined, respectively, by (9), (10), and (11). The eigenvalues of the linear map from d×m\mathbb{R}^{d\times m} to d×m\mathbb{R}^{d\times m} defined by

WD^WWDA𝒰(DATW+WTAD)\displaystyle W\to\hat{D}W-WD-A\mathcal{U}(DA^{T}W+W^{T}AD) (15)

are real and negative.

Proof.

Let ZZ be an eigenvector (note that ZZ is in fact a matrix) of (15) associated to the eigenvalue β\beta, i.e.,

D^ZZDA𝒰(DATZ+ZTAD)=βZ;\displaystyle\hat{D}Z-ZD-A\mathcal{U}(DA^{T}Z+Z^{T}AD)=\beta Z; (16)

next, we show that β<0\beta<0.

Consider a block partition of ZZ of the form

Z=[Z~Z¯],\displaystyle Z=\begin{bmatrix}\tilde{Z}\\ \bar{Z}\end{bmatrix},

where Z~\tilde{Z} and Z¯\bar{Z} are, respectively, m×mm\times m and (dm)×m(d-m)\!\times\!m matrices. The eigenvalue matrix equation (16) induces the following system of matrix equations

DZ~Z~D𝒰(DZ~+Z~TD)\displaystyle D\tilde{Z}-\tilde{Z}D-\mathcal{U}(D\tilde{Z}+\tilde{Z}^{T}D) =βZ~,\displaystyle=\beta\tilde{Z}, (17)
D¯Z¯Z¯D\displaystyle\bar{D}\bar{Z}-\bar{Z}D =βZ¯,\displaystyle=\beta\bar{Z}, (18)

where D¯=diag(λm+1,,λd)\bar{D}=\mbox{diag}(\lambda_{m+1},\ldots,\lambda_{d}).

There are two non-mutually-exclusive cases to consider: Z~0\tilde{Z}\neq 0 or Z¯0\bar{Z}\neq 0 (Z0Z\neq 0, by virtue of being an eigenvector).

Case 1

Suppose that Z¯st0\bar{Z}_{st}\neq 0. Then, (18) implies that

λm+sZ¯stλtZ¯st=βZ¯st,\displaystyle\lambda_{m+s}\bar{Z}_{st}-\lambda_{t}\bar{Z}_{st}=\beta\bar{Z}_{st},

and, hence, β=λm+sλt<0\beta=\lambda_{m+s}-\lambda_{t}<0.

Case 2

Suppose that Z~st0\tilde{Z}_{st}\neq 0. This case splits in two: either s>ts>t or sts\leq t. If s>ts>t, then (17) and the “upper triangularization” operation yields

λsZ~stλtZ~st=βZ~st,\displaystyle\lambda_{s}\tilde{Z}_{st}-\lambda_{t}\tilde{Z}_{st}=\beta\tilde{Z}_{st}, (19)

which, after dividing by Z~st\tilde{Z}_{st}, yields λsλt=β<0\lambda_{s}-\lambda_{t}=\beta<0. If sts\leq t, then,

βZ~st\displaystyle\beta\tilde{Z}_{st} =λsZ~stλtZ~st𝒰(DZ~+Z~TD)st\displaystyle=\lambda_{s}\tilde{Z}_{st}-\lambda_{t}\tilde{Z}_{st}-\mathcal{U}(D\tilde{Z}+\tilde{Z}^{T}D)_{st}
=λsZ~stλtZ~stλsZ~stλtZ~ts\displaystyle=\lambda_{s}\tilde{Z}_{st}-\lambda_{t}\tilde{Z}_{st}-\lambda_{s}\tilde{Z}_{st}-\lambda_{t}\tilde{Z}_{ts}
=λt(Z~st+Z~ts).\displaystyle=-\lambda_{t}(\tilde{Z}_{st}+\tilde{Z}_{ts}).

Next, notice that if s<ts<t, then Z~ts\tilde{Z}_{ts} can be assumed to be 0, since, otherwise, we could deal with it as in (19) with the roles of ss and tt reversed to conclude β<0\beta<0. Hence, assuming Z~ts=0\tilde{Z}_{ts}=0, we obtain, after division by Z~st\tilde{Z}_{st}, that β=λt<0.\beta=-\lambda_{t}<0. Finally, if s=ts=t, then β=2λt<0.\beta=-2\lambda_{t}<0.

IV Parameter estimation with noisy measurements

IV-A Roadmap

This is a rather long section, hence the need for a road map. The analysis of (1) is simplified if the measurements are identically distributed besides being just independent and, therefore, we start by introducing a probability distribution on the vectors hnh_{n} and a joint model on (Y,H)(Y,H).

To estimate μ\mu^{\star} (pp^{\star} and σ\sigma^{\star} are treated as nuisance parameters), we consider the stationary equations imposed by equating to zero the gradient of the log-likelihood function, a necessary condition satisfied by the maximum likelihood estimator (MLE). Once the particular form of the stationary equations is realized, we reformulate them as a fixed point equation of the form g1g2(θ)=θg_{1}\circ g_{2}(\theta^{\star})=\theta^{\star} that naturally suggests the Banach-Picard iteration

θk+1=g1g2(θk).\displaystyle\theta^{k+1}=g_{1}\circ g_{2}(\theta^{k}).

Observing that the map g1g2g_{1}\circ g_{2} cannot be written as an average of local maps, we switch to the map H=g2g1H=g_{2}\circ g_{1}, which can be implicitly written as an average of local maps. With this map, we arrive at a distributed algorithm by considering the map RR (see section II) arising from HH and appealing to Algorithm 1.

Finally, we observe that the existence of a fixed point of H=g2g1H=g_{2}\circ g_{1} satisfying (2) follows from the existence of a fixed point of g1g2g_{1}\circ g_{2} satisfying (2). The final part of the section is thus devoted to verifying (2) for the map g1g2g_{1}\circ g_{2}, and to a numerical simulation comparing Algorithm 1 and DA-DEM from [31].

IV-B Identically distributed observations

Let θ=(μ,p,(σ)2)Θ=d×(0,1)×(0,+)\theta^{\star}=(\mu^{\star},p^{\star},(\sigma^{\star})^{2})\in\Theta=\mathbb{R}^{d}\times(0,1)\times(0,+\infty) be an unknown and fixed vector which we term the ground truth.

The agents’ measurements are assumed to be independent (see (1)); however, they are not identically distributed, given the presence of the vectors hnh_{n} in (1). To address this issue, let Z{0,1}Z\in\{0,1\}, HdH\in\mathbb{R}^{d}, and YY\in\mathbb{R} be, respectively, a binary random variable, a random vector, and a real random variable. Suppose the joint density on (Y,H,Z)(Y,H,Z) factors as

fY,H,Z(y,h,z|θ)=fH(h)fZ(z|p)fY|H,Z(y|h,z,μ,(σ)2),\displaystyle f_{Y,H,Z}\big{(}y,h,z|\theta^{\star}\big{)}=f_{H}(h)f_{Z}(z|p^{\star})f_{Y|H,Z}\big{(}y|h,z,\mu^{\star},(\sigma^{\star})^{2}\big{)}, (20)

where

fH(h)\displaystyle f_{H}(h) =𝒩(h|0,Id),\displaystyle=\mathcal{N}(h|0,I_{d}),
fZ(z|p)\displaystyle f_{Z}(z|p^{\star}) =(p)z(1p)1z,\displaystyle=(p^{\star})^{z}(1-p^{\star})^{1-z},

and

fY|H,Z(y|h,z,θ)=𝒩(y|hTμ,(σ)2)z𝒩(y|0,(σ)2)1z.\displaystyle f_{Y|H,Z}\big{(}y|h,z,\theta^{\star}\big{)}=\mathcal{N}\big{(}y|h^{T}\mu^{\star},(\sigma^{\star})^{2}\big{)}^{z}\mathcal{N}\big{(}y|0,(\sigma^{\star})^{2}\big{)}^{1-z}. (21)

Instead of assuming that the hnh_{n} are fixed as in [31], we assume that each sensor nn has a measurement (yn,hn)(y_{n},h_{n}), where (yn,hn,zn)(y_{n},h_{n},z_{n}) was drawn from (20), but agent nn has no knowledge of znz_{n}. After marginalization, the joint density of Y,HY,H is given by

fY,H(y,h|θ)=fH(h)fY|H(y|h,θ)\displaystyle f_{Y,H}\big{(}y,h|\theta^{\star})=f_{H}(h)f_{Y|H}(y|h,\theta^{\star}) (22)
=\displaystyle= fH(h)(p𝒩(y|hTμ,(σ)2)+(1p)𝒩(y|0,(σ)2)),\displaystyle f_{H}(h)\Big{(}p^{\star}\mathcal{N}\big{(}y|h^{T}\mu^{\star},(\sigma^{\star})^{2}\big{)}+(1-p^{\star})\mathcal{N}\big{(}y|0,(\sigma^{\star})^{2}\big{)}\Big{)},

which is a mixture model [49].

To estimate μ\mu^{\star}, the agents seek θΘ\theta\in\Theta such that

1Nn=1Nθϕ(yn,hn,θ)=0,\frac{1}{N}\sum_{n=1}^{N}\nabla_{\theta}\phi(y_{n},h_{n},\theta)=0, (23)

where ϕ\phi is the log-likelihood of (Y,H)(Y,H), i.e.,

ϕ(y,h,θ)\displaystyle\phi(y,h,\theta) =log(fY,H(y,h|θ)\displaystyle=\log(f_{Y,H}(y,h|\theta) (24)
=log(fH(h))+log(fY|H(y|h,θ)).\displaystyle=\log\big{(}f_{H}(h)\big{)}+\log\big{(}f_{Y|H}(y|h,\theta)\big{)}.

Since fH(h)f_{H}(h) does not depend on θ\theta,

θϕ(y,h,θ)=θlog(fY|H(y|h,θ));\nabla_{\theta}\phi(y,h,\theta)=\nabla_{\theta}\log\big{(}f_{Y|H}(y|h,\theta)\big{)};

in other words, (23) is a necessary condition satisfied by the MLE corresponding to the log-likelihood function logfY|H(y|h,θ)\log f_{Y|H}(y|h,\theta), thus independent of fHf_{H}.

IV-C Gradient of ϕ\phi and the centralized algorithm

Before explicitly writing the stationary equations corresponding to (23), we introduce the responsibility functions [49],

r(y,h,θ)=p𝒩(y|hTμ,σ2)p𝒩(y|hTμ,σ2)+(1p)𝒩(y|0,σ2).r(y,h,\theta)=\frac{p\mathcal{N}(y|h^{T}\mu,\sigma^{2})}{p\mathcal{N}(y|h^{T}\mu,\sigma^{2})+(1-p)\mathcal{N}(y|0,\sigma^{2})}. (25)

Notice that r(y,h,θ)=(z=1|y,h,θ)r(y,h,\theta)=\mathbb{P}(z=1|y,h,\theta), the posterior probability that the observation yy was not a result of measuring only noise.

For reasons that will be clear later, the following set of equalities, which can be easily verified, will be convenient:

σ2μϕ(y,h,θ)\displaystyle\sigma^{2}\nabla_{\mu}\phi(y,h,\theta) =r(y,h,θ)(yhTμ)h\displaystyle=r(y,h,\theta)(y-h^{T}\mu)h (26)
p+p(1p)ϕp(y,h,θ)\displaystyle p+p(1-p)\frac{\partial\phi}{\partial p}(y,h,\theta) =r(y,h,θ)\displaystyle=r(y,h,\theta)
σ2+2(σ2)2ϕσ2(y,h,θ)\displaystyle\sigma^{2}+2(\sigma^{2})^{2}\frac{\partial\phi}{\partial\sigma^{2}}(y,h,\theta) =r(y,h,θ)(yhTμ)2\displaystyle=r(y,h,\theta)(y-h^{T}\mu)^{2}
+(1r(y,h,θ))y2.\displaystyle+\big{(}1-r(y,h,\theta)\big{)}y^{2}.

Using (26), (23) can be explicitly written as

(1Nn=1NΓ(yn,hn,θ))μ\displaystyle\Big{(}\frac{1}{N}\sum_{n=1}^{N}\Gamma(y_{n},h_{n},\theta)\Big{)}\mu =1Nn=1Nψ(yn,hn,θ)\displaystyle=\frac{1}{N}\sum_{n=1}^{N}\psi(y_{n},h_{n},\theta) (27)
p\displaystyle p =1Nn=1Nr(yn,hn,θ)\displaystyle=\frac{1}{N}\sum_{n=1}^{N}r(y_{n},h_{n},\theta) (28)
σ2\displaystyle\sigma^{2} =1Nn=1Nγ(yn,hn,θ),\displaystyle=\frac{1}{N}\sum_{n=1}^{N}\gamma(y_{n},h_{n},\theta), (29)

where

Γ(y,h,θ)\displaystyle\Gamma(y,h,\theta) =r(y,h,θ)hhT,\displaystyle=r(y,h,\theta)hh^{T},
ψ(y,h,θ)\displaystyle\psi(y,h,\theta) =r(y,h,θ)yh,\displaystyle=r(y,h,\theta)yh,
γ(y,h,θ)\displaystyle\gamma(y,h,\theta) =r(y,h,θ)(yhTμ)2+(1r(y,h,θ))y2.\displaystyle=r(y,h,\theta)(y-h^{T}\mu)^{2}+\big{(}1-r(y,h,\theta)\big{)}y^{2}.

If the matrix 1Nn=1NΓ(yn,hn,θ)\frac{1}{N}\sum_{n=1}^{N}\Gamma(y_{n},h_{n},\theta) is invertible, then (27)-(29) can be written as a fixed point equation.222The invertibility of this matrix is assumed throughout the rest of the paper. In fact, if NN is sufficiently large - greater than dd - this happens with probability one. This constitutes the motivation for the centralized algorithm that we suggest next (see Algorithm 2) and from which we will derive the distributed version; observe that it is the Banach-Picard iteration motivated by (27)-(29).

Another way to write (30)-(32) (see Algorithm 2 below) is θk+1=g1g2(θk)\theta^{k+1}=g_{1}\circ g_{2}(\theta^{k}), where

g2(θ)=1N(\displaystyle g_{2}(\theta)=\frac{1}{N}\Big{(} n=1NΓ(yn,hn,θ),n=1Nψ(yn,hn,θ),\displaystyle\sum_{n=1}^{N}\Gamma(y_{n},h_{n},\theta),\sum_{n=1}^{N}\psi(y_{n},h_{n},\theta),
n=1Nr(yn,hn,θ),n=1Nγ(yn,hn,θ))\displaystyle\sum_{n=1}^{N}r(y_{n},h_{n},\theta),\sum_{n=1}^{N}\gamma(y_{n},h_{n},\theta)\Big{)}

and

g1(Γ,ψ,p,σ2)=\displaystyle g_{1}(\Gamma,\psi,p,\sigma^{2})= (Γ1ψ,p,σ2).\displaystyle\big{(}\Gamma^{-1}\psi,p,\sigma^{2}\big{)}.
Algorithm 2 Centralized variant of EM
  Initialization:
θ0=(μ0,p0,(σ0)2)Θ\displaystyle\theta^{0}=\big{(}\mu^{0},p^{0},(\sigma^{0})^{2}\big{)}\in\Theta
  Update: θk+1=(μk+1,pk+1,(σk+1)2)\theta^{k+1}=\big{(}\mu^{k+1},p^{k+1},(\sigma^{k+1})^{2}\big{)}, where
μk+1\displaystyle\mu^{k+1} =(1Nn=1NΓ(yn,hn,θk))11Nn=1Nψ(yn,hn,θk)\displaystyle=\Big{(}\frac{1}{N}\sum_{n=1}^{N}\Gamma(y_{n},h_{n},\theta^{k})\Big{)}^{-1}\frac{1}{N}\sum_{n=1}^{N}\psi(y_{n},h_{n},\theta^{k}) (30)
pk+1\displaystyle p^{k+1} =1Nn=1Nr(yn,hn,θk)\displaystyle=\frac{1}{N}\sum_{n=1}^{N}r(y_{n},h_{n},\theta^{k}) (31)
(σk+1)2\displaystyle(\sigma^{k+1})^{2} =1Nn=1Nγ(yn,hn,θk).\displaystyle=\frac{1}{N}\sum_{n=1}^{N}\gamma(y_{n},h_{n},\theta^{k}). (32)

IV-D Distributed Algorithm

Although, g2g_{2} is an average of local maps, the map g1g2g_{1}\circ g_{2} is not, due to the matrix inversion in (30). As a consequence, (30)-(32) cannot be directly extended to a distributed algorithm. However, switching the order of g1g_{1} and g2g_{2} results in a map that can be implicitly written as an average of local maps. In fact, instead of the iteration θk+1=g1g2(θk)\theta^{k+1}=g_{1}\circ g_{2}(\theta^{k}), consider the iteration

zk+1=H(zk),\displaystyle z^{k+1}=H(z^{k}),

where H=g2g1H=g_{2}\circ g_{1} and z=(Γ,ψ,p,σ2)z=(\Gamma,\psi,p,\sigma^{2}).

Let

Gn(θ)=(\displaystyle G_{n}(\theta)=\Big{(} Γ(yn,hn,θ),ψ(yn,hn,θ),\displaystyle\Gamma\big{(}y_{n},h_{n},\theta\big{)},\psi\big{(}y_{n},h_{n},\theta\big{)},
r(yn,hn,θ),γ(yn,hn,θ)),\displaystyle r\big{(}y_{n},h_{n},\theta\big{)},\gamma\big{(}y_{n},h_{n},\theta\big{)}\Big{)},

and it follows that H=1Nn=1NHnH=\frac{1}{N}\sum_{n=1}^{N}H_{n}, where

Hn(z)=Gng1(z).\displaystyle H_{n}(z)=G_{n}\circ g_{1}(z). (33)

To conclude, the distributed algorithm we suggest amounts to Algorithm 1, with R:(d2+d+2)N(d2+d+2)NR:\mathbb{R}^{(d^{2}+d+2)N}\to\mathbb{R}^{(d^{2}+d+2)N} defined, for z=(z1T,,zNT)Tz=(z_{1}^{T},\ldots,z_{N}^{T})^{T}, as in (4), and HnH_{n} as in (33). Additionally, following [31], we suggest the initialization

zn0=m=1NW~nmGm(ynhnhnThn,12,yn22).\displaystyle z_{n}^{0}=\sum_{m=1}^{N}\tilde{W}_{nm}G_{m}\big{(}\frac{y_{n}h_{n}}{h_{n}^{T}h_{n}},\frac{1}{2},\frac{y_{n}^{2}}{2}\big{)}. (34)

Some remarks are due:

a)

The existence of a fixed point of g1g2g_{1}\circ g_{2} satisfying (2) is addressed in section IV-E;

b)

The existence of a fixed point of g2g1g_{2}\circ g_{1} satisfying (2) follows from the existence of a fixed point of g1g2g_{1}\circ g_{2} satisfying (2), by the chain rule;

c)

Expanding (32) yields

(σk+1)2\displaystyle(\sigma^{k+1})^{2} =1Nn=1Nr(yn,hn,θk)(ynhnTμk)2\displaystyle=\frac{1}{N}\sum_{n=1}^{N}r(y_{n},h_{n},\theta^{k})(y_{n}-h_{n}^{T}\mu^{k})^{2}
+(1r(yn,hn,θk))yn2,\displaystyle+\big{(}1-r(y_{n},h_{n},\theta^{k})\big{)}y_{n}^{2},

and, if the update rule is modified according to

(σk+1)2\displaystyle(\sigma^{k+1})^{2} =1Nn=1Nr(yn,hn,θk)(ynhnTμk+1)2\displaystyle=\frac{1}{N}\sum_{n=1}^{N}r(y_{n},h_{n},\theta^{k})(y_{n}-h_{n}^{T}\mu^{k+1})^{2}
+(1r(yn,hn,θk))yn2,\displaystyle+\big{(}1-r(y_{n},h_{n},\theta^{k})\big{)}y_{n}^{2},

then, a straightforward manipulation recovers the EM algorithm presented in [31]. Moreover, the EM algorithm derived in [31] is still amenable to a distributed implementation using Algorithm 1. However, we found it easier to prove (2) for this variant of EM, than for the standard EM.

IV-E Convergence Analysis

The proof of local linear convergence of the centralized variant of EM, i.e., Algorithm 2, is not trivial. In fact, this question is probabilistic in nature, because updates (30)-(32) depend on observations that are, in turn, samples from a probability distribution. Before presenting the main convergence result (Theorem 3), we need to introduce some definitions and only one mild assumption that is instrumental in the proof of Lemma 4 below: the Fisher information at θ\theta^{\star}, given by,

I(θ)=𝔼θ[θϕ(y,h,θ)(θϕ(y,h,θ))T],\displaystyle\mbox{I}(\theta^{\star})=\mathbb{E}_{\theta^{\star}}\Big{[}\nabla_{\theta}\phi(y,h,\theta^{\star})\big{(}\nabla_{\theta}\phi(y,h,\theta^{\star})\big{)}^{T}\Big{]}, (35)

is non-singular.

Let TN=g1g2T_{N}=g_{1}\circ g_{2} denote the map underlying the Banach-Picard iteration (30)-(32).333The subscript NN emphasizes that TNT_{N} depends on NN observations. A straightforward manipulation, using (26), shows that

TN(θ)=θ+(AN(θ))11Nn=1Nθϕ(yn,hn,θ),\displaystyle T_{N}(\theta)=\theta+\big{(}A_{N}(\theta)\big{)}^{-1}\frac{1}{N}\sum_{n=1}^{N}\nabla_{\theta}\phi(y_{n},h_{n},\theta),

where

AN(θ)=[1Nn=1N1σ2Γ(yn,hn,θ)𝟎𝟎𝟎1p(1p)0𝟎012(σ2)2].\displaystyle A_{N}(\theta)=\begin{bmatrix}\frac{1}{N}\sum_{n=1}^{N}\frac{1}{\sigma^{2}}\Gamma(y_{n},h_{n},\theta)&\mathbf{0}&\mathbf{0}\\ \mathbf{0}&\frac{1}{p(1-p)}&0\\ \mathbf{0}&0&\frac{1}{2(\sigma^{2})^{2}}\end{bmatrix}.

Before stating the main result, we introduce the “infinite sample” map, i.e.,

T(θ)=θ+(A(θ))1(θ),\displaystyle T(\theta)=\theta+\big{(}A(\theta)\big{)}^{-1}\mathcal{L}(\theta),

where

A(θ)=[𝔼θ[1σ2Γ(y,h,θ)]𝟎𝟎𝟎1p(1p)0𝟎012(σ2)2],\displaystyle A(\theta)=\begin{bmatrix}\mathbb{E}_{\theta^{\star}}\Big{[}\frac{1}{\sigma^{2}}\Gamma(y,h,\theta)\Big{]}&\mathbf{0}&\mathbf{0}\\ \mathbf{0}&\frac{1}{p(1-p)}&0\\ \mathbf{0}&0&\frac{1}{2(\sigma^{2})^{2}}\end{bmatrix},

and

(θ)=𝔼θ[ϕ(y,h,θ)].\mathcal{L}(\theta)=\mathbb{E}_{\theta^{\star}}\big{[}\nabla\phi(y,h,\theta)\big{]}.

The next lemma shows that TT is a “natural” map to consider.

Lemma 4.

The “infinite sample map”, i.e., TT, has the following properties

a)

For fixed θ\theta, TN(θ)T_{N}(\theta) converges in probability to T(θ)T(\theta);

b)

The ground truth θ\theta^{\star} is a fixed point of TT, i.e. T(θ)=θT(\theta^{\star})=\theta^{\star};

c)

The attractor condition (2) holds for TT at θ\theta^{\star}, i.e.,

ρ(𝐉T(θ))<1.\displaystyle\rho\big{(}\mathbf{J}_{T}(\theta^{\star})\big{)}<1.
Proof.

The proof of b) amounts to a straightforward verification and the proof of a) follows by the weak law of large numbers, hence, we will focus on the proof of c) which relies on the principle of missing information (see [50], page 101), and the assumption on the Fisher Information condition, i.e., (35).

Under suitable regularity conditions (see appendix A) that hold for the model (LABEL:JointOnYandH),

𝔼θ[θ2ϕ(y,h,θ)]=I(θ).\mathbb{E}_{\theta^{\star}}\Big{[}\nabla^{2}_{\theta}\phi(y,h,\theta^{\star})\Big{]}=-\mbox{I}(\theta^{\star}).

Additionally, a simple calculation reveals that A(θ)A(\theta^{\star}) coincides with the Fisher information of the complete data model (20), i.e.,

A(θ)=Ic(θ).A(\theta^{\star})=\mbox{I}_{c}(\theta^{\star}).

The non-singularity assumption on I(θ)\mbox{I}(\theta^{\star}) (see (35)), together with the principle of missing information (see [50], page 101), implies that

0I(θ)Ic(θ).0\prec\mbox{I}(\theta^{\star})\preceq\mbox{I}_{c}(\theta^{\star}).

All these observations show that

𝐉T(θ)=I(Ic(θ))1I(θ),\displaystyle\mathbf{J}_{T}(\theta^{\star})=I-\big{(}\mbox{I}_{c}(\theta^{\star})\big{)}^{-1}\mbox{I}(\theta^{\star}), (36)

and, Theorem 7.7.3 of [51], together with 0I(θ)Ic(θ)0\prec\mbox{I}(\theta^{\star})\preceq\mbox{I}_{c}(\theta^{\star}), implies that

ρ(𝐉T(θ))<1,\displaystyle\rho\big{(}\mathbf{J}_{T}(\theta^{\star})\big{)}<1,

concluding that θ\theta^{\star} is an attractor of TT. ∎

We recall that the goal of this section is to show that the probability that TNT_{N} has an attractor approaches 11 as NN tends to infinity; the strategy is to derive this from the existence of an attractor of TT, i.e., c) in Lemma 4. Pointwise convergence in probability, i.e., a) in Lemma 4, is not enough to arrive at this result. In fact, the proof is built on a stronger notion that is a probabilistic version of uniform, rather than pointwise, convergence of maps. This is the content of Theorem 2 below.

Observe that if θN\theta_{N} is a fixed point of TNT_{N}, then

𝐉TN(θN)=I+(AN(θN))11Nn=1Nθ2ϕ(yn,hn,θN),\displaystyle\mathbf{J}_{T_{N}}(\theta_{N})=I+\big{(}A_{N}(\theta_{N})\big{)}^{-1}\frac{1}{N}\sum_{n=1}^{N}\nabla_{\theta}^{2}\phi(y_{n},h_{n},\theta_{N}),

so let

TN(θ)\displaystyle T_{N}^{\prime}(\theta) =I+(AN(θ))11Nn=1Nθ2ϕ(yn,hn,θ),\displaystyle=I+\big{(}A_{N}(\theta)\big{)}^{-1}\frac{1}{N}\sum_{n=1}^{N}\nabla_{\theta}^{2}\phi(y_{n},h_{n},\theta),
T(θ)\displaystyle T^{\prime}(\theta) =I+(A(θ))1𝔼θ[θ2ϕ(y,h,θ)].\displaystyle=I+\big{(}A(\theta)\big{)}^{-1}\mathbb{E}_{\theta^{\star}}\Big{[}\nabla^{2}_{\theta}\phi(y,h,\theta)\Big{]}.
Remark 2.

Note that the maps TN(θ)T_{N}^{\prime}(\theta) and T(θ)T^{\prime}(\theta) only coincide with the jacobian maps 𝐉TN(θ)\mathbf{J}_{T_{N}}(\theta) and 𝐉T(θ)\mathbf{J}_{T}(\theta) at fixed points.

The uniform convergence in probability is expressed in the next theorem, whose proof can be found in appendix C. For the statement, recall (see the notation section) that B¯δ,θ\bar{B}^{\|\cdot\|}_{\delta,\theta^{\star}} is the closed ball of center θ\theta^{\star} and radius δ\delta, with respect to the metric induced by the norm \|\cdot\|.

Theorem 2.

Let δ>0\delta>0 and \|\cdot\| be any norm. With TNT_{N}, TNT_{N}^{\prime}, TT^{\prime}, and TT as defined above,

supθB¯δ,θTN(θ)T(θ)0\displaystyle\sup_{\theta\in\bar{B}^{\|\cdot\|}_{\delta,\theta^{\star}}}\big{\|}T_{N}(\theta)-T(\theta)\big{\|}\to 0 (37)
supθB¯δ,θTN(θ)T(θ)0\displaystyle\sup_{\theta\in\bar{B}^{\|\cdot\|}_{\delta,\theta^{\star}}}\big{\|}T_{N}^{\prime}(\theta)-T^{\prime}(\theta)\big{\|}\to 0 (38)

both in probability, as NN\to\infty.

We now state the main convergence result.

Theorem 3.

There exists δ>0\delta>0 and a norm \|\cdot\| such that

θ(supθB¯δ,θTN(θ)θδ)\displaystyle\mathbb{P}_{\theta^{\star}}\Big{(}\sup_{\theta\in\bar{B}^{\|\cdot\|}_{\delta,\theta^{\star}}}\big{\|}T_{N}(\theta)-\theta^{\star}\|\leq\delta\Big{)} 1and\displaystyle\to 1\quad\text{and}
θ(supθB¯δ,θTN(θ)<1)\displaystyle\mathbb{P}_{\theta^{\star}}\Big{(}\sup_{\theta\in\bar{B}^{\|\cdot\|}_{\delta,\theta^{\star}}}\big{\|}T_{N}^{\prime}(\theta)\|<1\Big{)} 1,\displaystyle\to 1,

where TN(θ)\|T_{N}^{\prime}(\theta)\| is the induced matrix norm.444The measurability of the maps in this Theorem are a consequence of Proposition 7.32 in [52].

Before presenting the proof, we explain why Theorem 3 encapsulates the notion that, with probability approaching 11, the map TNT_{N} has an attractor. Let

𝒜N\displaystyle\mathcal{A}_{N} ={(𝐲,𝐡)N×dN:supθB¯δ,θTN(θ)θδ}\displaystyle=\big{\{}(\mathbf{y},\mathbf{h})\in\mathbb{R}^{N}\times\mathbb{R}^{dN}:\sup_{\theta\in\bar{B}^{\|\cdot\|}_{\delta,\theta^{\star}}}\big{\|}T_{N}(\theta)-\theta^{\star}\|\leq\delta\big{\}}
N\displaystyle\mathcal{B}_{N} ={(𝐲,𝐡)N×dN:supθB¯δ,θTN(θ)<1}.\displaystyle=\big{\{}(\mathbf{y},\mathbf{h})\in\mathbb{R}^{N}\times\mathbb{R}^{dN}:\sup_{\theta\in\bar{B}^{\|\cdot\|}_{\delta,\theta^{\star}}}\big{\|}T_{N}^{\prime}(\theta)\|<1\big{\}}.
Remark 3.

Informally, observe that the set 𝒜N\mathcal{A}_{N} is the set of “samples” where the ball B¯δ,θ\bar{B}_{\delta,\theta^{\star}}^{\|\cdot\|} is invariant under TNT_{N}, i.e,

TN(B¯δ,θ)B¯δ,θ,T_{N}\big{(}\bar{B}_{\delta,\theta^{\star}}^{\|\cdot\|}\big{)}\subseteq\bar{B}_{\delta,\theta^{\star}}^{\|\cdot\|},

and that the set N\mathcal{B}_{N} is, from remark 2, the set of “samples” where the Jacobian of TNT_{N} satisfies (2) at a fixed points. By noting that a continuous map from a convex compact space into itself has a fixed point (Brouwer’s fixed point theorem), it follows that if (𝐲,𝐡)(\mathbf{y},\mathbf{h}) is in 𝒜N\mathcal{A}_{N}, then TNT_{N} has a fixed point. Moreover, if (𝐲,𝐡)(\mathbf{y},\mathbf{h}) is in 𝒜NN\mathcal{A}_{N}\cap\mathcal{B}_{N} then TNT_{N} has a fixed point satisfying (2). All of this is made precise below.

The statement of Theorem 3 is that the (non-random sequences) θ(𝒜N)\mathbb{P}_{\theta^{\star}}(\mathcal{A}_{N}) and θ(N)\mathbb{P}_{\theta^{\star}}(\mathcal{B}_{N}) both tend to 11. The inequalities

θ(𝒜N)+θ(N)1\displaystyle\mathbb{P}_{\theta^{\star}}(\mathcal{A}_{N})+\mathbb{P}_{\theta^{\star}}(\mathcal{B}_{N})-1 θ(𝒜NN)θ(𝒜N)\displaystyle\leq\mathbb{P}_{\theta^{\star}}(\mathcal{A}_{N}\cap\mathcal{B}_{N})\leq\mathbb{P}_{\theta^{\star}}(\mathcal{A}_{N})

imply that

θ(𝒜NN)1.\displaystyle\mathbb{P}_{\theta^{\star}}(\mathcal{A}_{N}\cap\mathcal{B}_{N})\to 1.

Now note that, if both inequalities hold, namely

supθB¯δ,θTN(θ)θδ\displaystyle\sup_{\theta\in\bar{B}^{\|\cdot\|}_{\delta,\theta^{\star}}}\big{\|}T_{N}(\theta)-\theta^{\star}\|\leq\delta (39)
supθB¯δ,θTN(θ)<1,\displaystyle\sup_{\theta\in\bar{B}^{\|\cdot\|}_{\delta,\theta^{\star}}}\big{\|}T_{N}^{\prime}(\theta)\|<1, (40)

then (39), together with Brouwer’s fixed point theorem (see [53], page 180) implies that TNT_{N} has a fixed point θN\theta_{N} in B¯δ,θ\bar{B}^{\|\cdot\|}_{\delta,\theta^{\star}} (this idea is loosely inspired by [54], page 69). Moreover, being a fixed point, at a θN\theta_{N} it holds (see remark 2) that TN(θN)=𝐉TN(θN)T^{\prime}_{N}(\theta_{N})=\mathbf{J}_{T_{N}}(\theta_{N}), so, (40) implies that

ρ(𝐉TN(θN))𝐉TN(θN)supθB¯δ,θTN(θ)<1.\displaystyle\rho\big{(}\mathbf{J}_{T_{N}}(\theta_{N})\big{)}\leq\big{\|}\mathbf{J}_{T_{N}}(\theta_{N})\|\leq\sup_{\theta\in\bar{B}^{\|\cdot\|}_{\delta,\theta^{\star}}}\big{\|}T_{N}^{\prime}(\theta)\|<1.

This explains why Theorem 3 expresses the notion that we can “expect” (2) to hold for TNT_{N}. In fact, from the above, the event

𝒞N={(𝐲,𝐡):TN has a fixed point satisfying (2)}\mathcal{C}_{N}=\big{\{}(\mathbf{y},\mathbf{h}):T_{N}\text{ has a fixed point satisfying \eqref{JacobianCondition}}\big{\}}\\

contains the event 𝒜NN\mathcal{A}_{N}\cap\mathcal{B}_{N}, and the probability of this last event is approaching 1.

Proof of Theorem 3

Let \|\cdot\| be any norm. Then

TN(θ)θTN(θ)T(θ)+T(θ)θ.\displaystyle\|T_{N}(\theta)-\theta^{\star}\|\leq\|T_{N}(\theta)-T(\theta)\|+\|T(\theta)-\theta^{\star}\|. (41)

From Lemma 4 c),

ρ(𝐉T(θ))<1.\displaystyle\rho\big{(}\mathbf{J}_{T}(\theta^{\star})\big{)}<1.

From the proof of Ostrowski’s Theorem (see [55], page 300), there exists a norm \|\cdot\| on d+2\mathbb{R}^{d+2}, an open neighborhood 𝒱\mathcal{V} of θ\theta^{\star}, and λ<1\lambda<1, such that

1)

T(θ)θλθθ\|T(\theta)-\theta^{\star}\|\leq\lambda\|\theta-\theta^{\star}\|, for θ𝒱\theta\in\mathcal{V};

2)

𝐉T(θ)<1\|\mathbf{J}_{T}(\theta^{\star})\|<1, where here the norm is the induced matrix norm.

Choose δ\delta sufficiently small such that

i)

B¯δ,θ𝒱\bar{B}^{\|\cdot\|}_{\delta,\theta^{\star}}\subseteq\mathcal{V};

ii)

T(θ)=𝐉T(θ)β<1\|T^{\prime}(\theta)\|=\|\mathbf{J}_{T}(\theta)\|\leq\beta<1, for θB¯δ,θ\theta\in\bar{B}^{\|\cdot\|}_{\delta,\theta^{\star}},

where the validity of ii) follows from the compactness of B¯δ,θ\bar{B}^{\|\cdot\|}_{\delta,\theta^{\star}} and the continuity of TT^{\prime}.

Now, for any θB¯δ,θ\theta\in\bar{B}^{\|\cdot\|}_{\delta,\theta^{\star}}, (41) implies that

TN(θ)θTN(θ)T(θ)+λδ\displaystyle\|T_{N}(\theta)-\theta^{\star}\|\leq\|T_{N}(\theta)-T(\theta)\|+\lambda\delta

and, hence,

supθB¯δ,θTN(θ)θsupθB¯δ,θTN(θ)T(θ)+λδ.\displaystyle\sup_{\theta\in\bar{B}^{\|\cdot\|}_{\delta,\theta^{\star}}}\big{\|}T_{N}(\theta)-\theta^{\star}\big{\|}\leq\sup_{\theta\in\bar{B}^{\|\cdot\|}_{\delta,\theta^{\star}}}\big{\|}T_{N}(\theta)-T(\theta)\big{\|}+\lambda\delta. (42)

A similar reasoning shows that

supθB¯δ,θTN(θ)supθB¯δ,θTN(θ)T(θ)+β.\displaystyle\sup_{\theta\in\bar{B}^{\|\cdot\|}_{\delta,\theta^{\star}}}\big{\|}T_{N}^{\prime}(\theta)\|\leq\sup_{\theta\in\bar{B}^{\|\cdot\|}_{\delta,\theta^{\star}}}\big{\|}T_{N}^{\prime}(\theta)-T^{\prime}(\theta)\big{\|}+\beta. (43)

To conclude, we appeal to Theorem 2, and we show that it implies the result. Let ϵ1=(1λ)δ\epsilon_{1}=(1-\lambda)\delta and ϵ2=1β2\epsilon_{2}=\frac{1-\beta}{2}. From the properties of λ\lambda and β\beta, it holds that ϵ1>0\epsilon_{1}>0 and 0<ϵ2<10<\epsilon_{2}<1. By the definition of convergence in probability, it holds that

θ(supθB¯δ,θTN(θ)T(θ)ϵ1)1\displaystyle\mathbb{P}_{\theta^{\star}}\Big{(}\sup_{\theta\in\bar{B}^{\|\cdot\|}_{\delta,\theta^{\star}}}\big{\|}T_{N}(\theta)-T(\theta)\|\leq\epsilon_{1}\Big{)}\to 1
θ(supθB¯δ,θTN(θ)T(θ)ϵ2)1.\displaystyle\mathbb{P}_{\theta^{\star}}\Big{(}\sup_{\theta\in\bar{B}^{\|\cdot\|}_{\delta,\theta^{\star}}}\big{\|}T_{N}^{\prime}(\theta)-T^{\prime}(\theta)\|\leq\epsilon_{2}\Big{)}\to 1.

From (42), (43), and the forms of ϵ1\epsilon_{1} and ϵ2\epsilon_{2}, we conclude the result.

IV-F Simulations

In this section, we compare our algorithm with the one from [31] (DA-DEM) through Monte Carlo simulations. The parameters generated once and fixed throughout all Monte Carlo runs were: d=3d=3, N=100N=100, a unit-norm vector μd\mu^{\star}\in\mathbb{R}^{d}, p=0.7p^{\star}=0.7, and an undirected connected graph on NN nodes with connectivity radius rc=0.18r_{c}=0.18 555NN points were randomly deployed on the unit square; two points were then connected by an edge if their distance was less than rcr_{c}..

Each Monte Carlo run consisted in

1)

Generating a data set: each hnh_{n} was independently sampled from a Gaussian with zero mean and covariance I3I_{3}; the variance of the noise (σ)2(\sigma^{\star})^{2} was generated according to

(σ)2=𝐇F2N×SNR,\displaystyle(\sigma^{\star})^{2}=\frac{\|\mathbf{H}\|_{F}^{2}}{N\times\mbox{SNR}},

with 𝐇T=[h1hN]\mathbf{H}^{T}=[h_{1}\ldots h_{N}] and where SNR is the signal to noise ratio (we experimented with SNR{10dB,20dB}\text{SNR}\in\{10\text{dB},20\text{dB}\}). Finally, each yny_{n} was sampled according to fY|Hf_{Y|H} (see (LABEL:JointOnYandH)), with hnh_{n}, μ\mu^{\star}, pp^{\star}, and (σ)2(\sigma^{\star})^{2}.

2)

Computing 1000010000 iterations of the algorithm proposed in [31], with ρ{2,3,4}\rho\in\{2,3,4\}, and of Algorithm 1, with α{0.001,0.005,0.01}\alpha\in\{0.001,0.005,0.01\}. Both algorithms were initialized according to (34).

The performance metrics consisted in finding a fixed point using the centralized operators as follows.

We first computed

θ0(α)=1Nn=1Ng1(zn10000(α))\displaystyle\theta^{0}(\alpha)=\frac{1}{N}\sum_{n=1}^{N}g_{1}\big{(}z_{n}^{10000}(\alpha)\big{)} (44)
θ0(ρ)=1Nn=1Ng^1(zn10000(ρ)),\displaystyle\theta^{0}(\rho)=\frac{1}{N}\sum_{n=1}^{N}\hat{g}_{1}\big{(}z_{n}^{10000}(\rho)\big{)}, (45)

where: α{0.001,0.005,0.01}\alpha\in\{0.001,0.005,0.01\}; ρ{2,3,4}\rho\in\{2,3,4\}; g^1\hat{g}_{1} corresponds to the map arising from the standard EM algorithm derived in [31]. In fact, as seen in [31], the EM algorithm can be written as

θk+1=g^1g2^(θk),\displaystyle\theta^{k+1}=\hat{g}_{1}\circ\hat{g_{2}}(\theta^{k}),

where

g^2(θ)=1N(\displaystyle\hat{g}_{2}(\theta)=\frac{1}{N}\Big{(} n=1NΓ(yn,hn,θ),n=1Nψ(yn,hn,θ),\displaystyle\sum_{n=1}^{N}\Gamma(y_{n},h_{n},\theta),\sum_{n=1}^{N}\psi(y_{n},h_{n},\theta),
n=1Nr(yn,hn,θ),n=1Nyn2)\displaystyle\sum_{n=1}^{N}r(y_{n},h_{n},\theta),\sum_{n=1}^{N}y_{n}^{2}\Big{)}

and

g^1(Γ,ψ,p,a)=\displaystyle\hat{g}_{1}(\Gamma,\psi,p,a)= (Γ1ψ,p,aψTΓ1ψ).\displaystyle\big{(}\Gamma^{-1}\psi,p,a-\psi^{T}\Gamma^{-1}\psi\big{)}.

We ran the algorithms, with initialization as in (44) and (45), given by

θk+1(α)=g1g2(θk(α))\displaystyle\theta^{k+1}(\alpha)=g_{1}\circ g_{2}\big{(}\theta^{k}(\alpha)\big{)}
θk+1(ρ)=g^1g^2(θk(ρ)),\displaystyle\theta^{k+1}(\rho)=\hat{g}_{1}\circ\hat{g}_{2}\big{(}\theta^{k}(\rho)\big{)},

until we found θ(α)\theta^{\star}(\alpha) and θ(ρ)\theta^{\star}(\rho) satisfying

θ(α)g1g2(θ(α))1010\displaystyle\Big{\|}\theta^{\star}(\alpha)-g_{1}\circ g_{2}\big{(}\theta^{\star}(\alpha)\big{)}\Big{\|}\leq 10^{-10}
θ(ρ)g^1g^2(θ(ρ))1010.\displaystyle\Big{\|}\theta^{\star}(\rho)-\hat{g}_{1}\circ\hat{g}_{2}\big{(}\theta^{\star}(\rho)\big{)}\Big{\|}\leq 10^{-10}.

The error at iteration kk of the distributed algorithms was then computed as

1Nn=1Nπ1g1((znk(α))θ(α)\displaystyle\frac{1}{N}\sum_{n=1}^{N}\Big{\|}\pi_{1}\circ g_{1}\big{(}(z_{n}^{k}(\alpha)\big{)}-\theta^{\star}(\alpha)\Big{\|}
1Nn=1Nπ1g^1((znk(ρ))θ(ρ),\displaystyle\frac{1}{N}\sum_{n=1}^{N}\Big{\|}\pi_{1}\circ\hat{g}_{1}\big{(}(z_{n}^{k}(\rho)\big{)}-\theta^{\star}(\rho)\Big{\|},

where π1\pi_{1} is the projection onto the average, i.e., π1(μ,p,σ2)=μ\pi_{1}(\mu,p,\sigma^{2})=\mu (as mentioned before, pp and σ2\sigma^{2} were treated as nuisance parameters).

The number of Monte Carlo tests was 100100 and the errors at iteration kk were averaged out of 100100 for each α\alpha and ρ\rho. The results for two different SNR values are shown in logarithmic scale in Figures 1 and 2.

Refer to caption
Figure 1: The figure shows the result of the Monte Carlo simulation of the error with respect to each optimum for an SNR=10\text{SNR}=10dB and a connectivity radius of 0.180.18. The dashed curves correspond to the algorithm from [31] with parameter ρ{2,3,4}\rho\in\{2,3,4\} and the non-dashed curves correspond to the DBPI algorithm with parameter α{0.001,0.005,0.01}\alpha\in\{0.001,0.005,0.01\}.
Refer to caption
Figure 2: The figure shows the result of the Monte Carlo simulation of the error with respect to each optimum for an SNR=20\text{SNR}=20dB and a connectivity radius of 0.180.18. The dashed curves correspond to the algorithm from [31] with parameter ρ{2,3,4}\rho\in\{2,3,4\} and the non-dashed curves correspond to the DBPI algorithm with parameter α{0.001,0.005,0.01}\alpha\in\{0.001,0.005,0.01\}.

The simulations show, as expected from the theory, that Algorithm 1 converges linearly and clearly outperforms the algorithm from [31], which, given its diminishing step-size, is bound to converge only sub-linearly. Moreover, both algorithms require just one round of communications per iteration.

V Conclusion

This article builds upon the distributed Banach-Picard algorithm and its convergence properties provided in [2] to make two main contributions: we provided a proof of local linear convergence for the distributed PCA algorithm suggested in [33], thereby filling a gap left by that work; starting from the distributed Banach-Picard iteration, we proposed a distributed algorithm for solving the parameter estimation problem from noisy and faulty measurements that had been addressed in [31]. Unlike the algorithm in [31], which uses diminishing step sizes, thus exhibiting sublinear convergence rate, the proposed instance of the distributed Banach-Picard iteration is guaranteed to have local linear convergence. Numerical experiments confirm the theoretical advantage of the proposed method with respect to that from [31].

Appendix A Regularity Conditions

Theorem 4.

Let KΘK\subset\Theta be a compact set containing θ\theta^{\star}. Then, for all θK\theta\in K,

|iϕi1x1ikxk(y,h,θ)|𝒫i1x1ikxkϕ(|y|,|h1|,,|hd|)\displaystyle\Big{|}\frac{\partial^{i}\phi}{\partial^{i_{1}}x_{1}\ldots\partial^{i_{k}}x_{k}}(y,h,\theta)\Big{|}\leq\mathcal{P}^{\phi}_{\partial^{i_{1}}x_{1}\ldots\partial^{i_{k}}x_{k}}(|y|,|h_{1}|,\ldots,|h_{d}|) (46)
|iri1x1ikxk(y,h,θ)|𝒫i1x1ikxkr(|y|,|h1|,,|hd|),\displaystyle\Big{|}\frac{\partial^{i}r}{\partial^{i_{1}}x_{1}\ldots\partial^{i_{k}}x_{k}}(y,h,\theta)\Big{|}\leq\mathcal{P}^{r}_{\partial^{i_{1}}x_{1}\ldots\partial^{i_{k}}x_{k}}(|y|,|h_{1}|,\ldots,|h_{d}|), (47)

where j=1Kij=i1\sum_{j=1}^{K}i_{j}=i\geq 1, x1,,xkx_{1},\ldots,x_{k} are dummy variables in {μ1,,μd,p,σ2}\{\mu_{1},\ldots,\mu_{d},p,\sigma^{2}\} (i.e. consider partial diferentiation with respect to the components of θ\theta), and where

𝒫i1x1ikxkrand𝒫i1x1ikxkϕ\displaystyle\mathcal{P}^{r}_{\partial^{i_{1}}x_{1}\ldots\partial^{i_{k}}x_{k}}\quad\text{and}\quad\mathcal{P}^{\phi}_{\partial^{i_{1}}x_{1}\ldots\partial^{i_{k}}x_{k}}

are polynomials.

We start with a proof of (47). Note that rr satisfies the following differential equations

rμi(y,h,θ)\displaystyle\frac{\partial r}{\partial\mu_{i}}(y,h,\theta) =(1r(θ))r(θ)yhTμσ2μi,i=1,,d,\displaystyle=\big{(}1-r(\theta)\big{)}r(\theta)\frac{y-h^{T}\mu}{\sigma^{2}}\mu_{i},\quad i=1,\ldots,d,
rp(y,h,θ)\displaystyle\frac{\partial r}{\partial p}(y,h,\theta) =1p(1p)r(θ)(1r(θ)),\displaystyle=\frac{1}{p(1-p)}r(\theta)\big{(}1-r(\theta)\big{)},
rσ2(y,h,θ)\displaystyle\frac{\partial r}{\partial\sigma^{2}}(y,h,\theta) =((yhTμ)22(σ2)2y22(σ2)2)r(θ)(1r(θ)).\displaystyle=\Big{(}\frac{(y-h^{T}\mu)^{2}}{2(\sigma^{2})^{2}}-\frac{y^{2}}{2(\sigma^{2})^{2}}\Big{)}r(\theta)\big{(}1-r(\theta)\big{)}.

We deduce that

rλ(y,h,θ)=𝒫~λr(r(θ),y,h,θ)𝒬~λr(p,σ2),\displaystyle\frac{\partial r}{\partial\lambda}(y,h,\theta)=\frac{\tilde{\mathcal{P}}_{\lambda}^{r}(r(\theta),y,h,\theta)}{\tilde{\mathcal{Q}}_{\lambda}^{r}(p,\sigma^{2})}, (48)

where 𝒫~λr\tilde{\mathcal{P}}_{\lambda}^{r} and 𝒬~λr\tilde{\mathcal{Q}}_{\lambda}^{r} are polynomials and λ\lambda is a dummy variable in {μ1,,μd,p,σ2}\{\mu_{1},\ldots,\mu_{d},p,\sigma^{2}\}.

The chain rule of differentiation, the quotient rule of differentiation and the form (48) imply that

iri1x1ikxk(y,h,θ)=𝒫~i1x1ikxkr(r(θ),y,h,θ)𝒬~i1x1ikxkr(p,σ2).\displaystyle\frac{\partial^{i}r}{\partial^{i_{1}}x_{1}\ldots\partial^{i_{k}}x_{k}}(y,h,\theta)=\frac{\tilde{\mathcal{P}}_{\partial^{i_{1}}x_{1}\ldots\partial^{i_{k}}x_{k}}^{r}(r(\theta),y,h,\theta)}{\tilde{\mathcal{Q}}_{\partial^{i_{1}}x_{1}\ldots\partial^{i_{k}}x_{k}}^{r}(p,\sigma^{2})}. (49)

The result now follows easily from 0<r(θ)10<r(\theta)\leq 1, the compactness of KK and the identity (49).

The inequality (46) follows from (49) and the form of the gradient of ϕ\phi in (26). This concludes the proof of Theorem 4.

An immediate corollary of Theorem 4 is that the absolute values of the partial derivatives of both ϕ\phi and rr are dominated by functions whose expectation exists and is finite; this is the content of the following result.

Theorem 5.

Let 𝒫\mathcal{P} be a polynomial in d+1d+1 variables. Then

𝔼θ[𝒫(|y|,|h1|,,|hd|)]\displaystyle\mathbb{E}_{\theta^{\star}}\big{[}\mathcal{P}(|y|,|h_{1}|,\ldots,|h_{d}|)\big{]}

exists and is finite.

To prove this theorem, observe that 𝒫(|y|,|h1|,,|hd|)\mathcal{P}(|y|,|h_{1}|,\ldots,|h_{d}|) is a sum of elements of the form

b|y|n0|h1|n1|hd|nd,\displaystyle b|y|^{n_{0}}|h_{1}|^{n_{1}}\ldots|h_{d}|^{n_{d}},

and, hence, it is enough to show that

𝔼θ[|y|n0|h1|n1|hd|nd]\mathbb{E}_{\theta^{\star}}\big{[}|y|^{n_{0}}|h_{1}|^{n_{1}}\ldots|h_{d}|^{n_{d}}\big{]}

exists and is finite. This last fact is an easy consequence of the existence of absolute non-central and central moments of Gaussians and, therefore, we will skip the proof.

Appendix B Auxiliary Results and definitions

Theorem 6 ([56], page 2129).

Let a(z,θ)a(z,\theta) be a matrix of functions of an observation zz and the parameter θ\theta. If the z1,,zNz_{1},\ldots,z_{N} are i.i.d., Ω\Omega is compact, a(zi,θ)a(z_{i},\theta) is continuous at each θ\theta and there is d(z)d(z) with a(z,θ)Fd(z)\|a(z,\theta)\|_{F}\leq d(z) for all θΩ\theta\in\Omega, where 𝔼[d(z)]\mathbb{E}[d(z)] exists and is finite, then 𝔼[a(z,θ)]\mathbb{E}[a(z,\theta)] is continuous and

supθΩ1Nj=1Na(zj,θ)𝔼[a(z,θ)]F0\displaystyle\sup_{\theta\in\Omega}\Big{\|}\frac{1}{N}\sum_{j=1}^{N}a(z_{j},\theta)-\mathbb{E}[a(z,\theta)]\Big{\|}_{F}\to 0

in probability.

Let XnX_{n} be a sequence of random vectors. We use the notation Xn=oP(1)X_{n}=o_{P}(1), to denote that XnX_{n} converges to 0 in probability, i.e., if, for every ϵ>0\epsilon>0, the non-random sequence

(Xnϵ)\displaystyle\mathbb{P}(\|X_{n}\|\leq\epsilon)

converges to 11.

If XnX_{n} is uniformly bounded in probability, i.e., if, for every ϵ>0\epsilon>0, there exists M(ϵ)>0M(\epsilon)>0, such that

(Xn>M(ϵ))<ϵ,n,\displaystyle\mathbb{P}\big{(}\|X_{n}\|>M(\epsilon)\big{)}<\epsilon,\quad\forall n,

we denote this by Xn=OP(1)X_{n}=O_{P}(1) (see [54] for more details and also for the calculus with the OP(1)O_{P}(1) and oP(1)o_{P}(1)).

Appendix C Proof of Theorem 2

We give only a sketch of the proof of (38) (the proof of (37) is analogous). Observe that

TN(θ)T(θ)=\displaystyle T_{N}^{\prime}(\theta)-T^{\prime}(\theta)=
(A(θ))1(A(θ)AN(θ))(AN(θ))11Nn=1Nθ2ϕ(yn,hn,θ)\displaystyle\big{(}A(\theta)\big{)}^{-1}\Big{(}A(\theta)-A_{N}(\theta)\Big{)}\big{(}A_{N}(\theta)\big{)}^{-1}\frac{1}{N}\sum_{n=1}^{N}\nabla_{\theta}^{2}\phi(y_{n},h_{n},\theta)
+(A(θ))1(1Nn=1Nθ2ϕ(yn,hn,θ)𝔼θ[θ2ϕ(y,h,θ)]),\displaystyle+\big{(}A(\theta)\big{)}^{-1}\Big{(}\frac{1}{N}\sum_{n=1}^{N}\nabla_{\theta}^{2}\phi(y_{n},h_{n},\theta)-\mathbb{E}_{\theta^{\star}}\big{[}\nabla^{2}_{\theta}\phi(y,h,\theta)\big{]}\Big{)},

which implies that

TN(θ)T(θ)A(θ)1×\displaystyle\|T_{N}^{\prime}(\theta)-T^{\prime}(\theta)\|\leq\big{\|}A(\theta)\big{\|}^{-1}\times (50)
(\displaystyle\Big{(} A(θ)AN(θ)AN(θ)11Nn=1Nθ2ϕ(yn,hn,θ)\displaystyle\big{\|}A(\theta)-A_{N}(\theta)\big{\|}\big{\|}A_{N}(\theta)\big{\|}^{-1}\Big{\|}\frac{1}{N}\sum_{n=1}^{N}\nabla_{\theta}^{2}\phi(y_{n},h_{n},\theta)\Big{\|}
+\displaystyle+ 1Nn=1Nθ2ϕ(yn,hn,θ)𝔼θ[θ2ϕ(y,h,θ)]).\displaystyle\Big{\|}\frac{1}{N}\sum_{n=1}^{N}\nabla_{\theta}^{2}\phi(y_{n},h_{n},\theta)-\mathbb{E}_{\theta^{\star}}\big{[}\nabla^{2}_{\theta}\phi(y,h,\theta)\big{]}\Big{\|}\Big{)}.

From Theorem 6 (see appendix B),

supθB¯δ,θAN(θ)A(θ)F0,\displaystyle\sup_{\theta\in\bar{B}^{\|\cdot\|}_{\delta,\theta^{\star}}}\big{\|}A_{N}(\theta)-A(\theta)\big{\|}_{F}\to 0, (51)
supθB¯δ,θ1Nn=1Nθ2ϕ(yn,hn,θ)𝔼θ[θ2ϕ(y,h,θ)]F0,\displaystyle\sup_{\theta\in\bar{B}^{\|\cdot\|}_{\delta,\theta^{\star}}}\big{\|}\frac{1}{N}\sum_{n=1}^{N}\nabla_{\theta}^{2}\phi(y_{n},h_{n},\theta)-\mathbb{E}_{\theta^{\star}}[\nabla^{2}_{\theta}\phi(y,h,\theta)]\big{\|}_{F}\to 0, (52)

in probability; these are consequences of Theorem 6, by noting that:

a)

1σ2Γ(y,h,θ)FMhhTF\|\frac{1}{\sigma^{2}}\Gamma(y,h,\theta)\|_{F}\leq M\|hh^{T}\|_{F}, where MM is the maximum of 1σ2\frac{1}{\sigma^{2}} on B¯δ,θ\bar{B}^{\|\cdot\|}_{\delta,\theta^{\star}} and where we used the fact that |r(y,h,θ)|1|r(y,h,\theta)|\leq 1;

b)

θ2ϕ(y,h,θ)Fg(y,h)\|\nabla^{2}_{\theta}\phi(y,h,\theta)\|_{F}\leq g(y,h) on B¯δ,θ\bar{B}^{\|\cdot\|}_{\delta,\theta^{\star}}, for some map gg not depending on θ\theta for which 𝔼θ[g(y,h)]\mathbb{E}_{\theta^{\star}}[g(y,h)] exists and is finite (see appendix A).

Since all norms are equivalent, (51)-(52) also holds if the Frobenius norm is replaced by any other norm.

Taking the supremum over on B¯δ,θ\bar{B}^{\|\cdot\|}_{\delta,\theta^{\star}} on both sides of (50), we obtain, from (51)-(52), that

supθB¯δ,θTN(θ)T(θ)\displaystyle\sup_{\theta\in\bar{B}^{\|\cdot\|}_{\delta,\theta^{\star}}}\big{\|}T_{N}^{\prime}(\theta)-T^{\prime}(\theta)\big{\|}
\displaystyle\leq oP(1)supθB¯δ,θAN(θ)1supθB¯δ,θ1Nn=1Nθ2ϕ(yn,hn,θ)\displaystyle o_{P}(1)\sup_{\theta\in\bar{B}^{\|\cdot\|}_{\delta,\theta^{\star}}}\|A_{N}(\theta)\|^{-1}\sup_{\theta\in\bar{B}^{\|\cdot\|}_{\delta,\theta^{\star}}}\big{\|}\frac{1}{N}\sum_{n=1}^{N}\nabla_{\theta}^{2}\phi(y_{n},h_{n},\theta)\big{\|}
+\displaystyle+ oP(1),\displaystyle o_{P}(1),

where the definitions of oP(1)o_{P}(1) and OP(1)O_{P}(1) can be found in appendix B.

From (51) and (52), together with the compactness of B¯δ,θ\bar{B}^{\|\cdot\|}_{\delta,\theta^{\star}}, we can deduce (proof omitted) that

supθB¯δ,θAN(θ)1\displaystyle\sup_{\theta\in\bar{B}^{\|\cdot\|}_{\delta,\theta^{\star}}}\|A_{N}(\theta)\|^{-1} =OP(1)\displaystyle=O_{P}(1)
supθB¯δ,θ1Nn=1Nθ2ϕ(yn,hn,θ)\displaystyle\sup_{\theta\in\bar{B}^{\|\cdot\|}_{\delta,\theta^{\star}}}\big{\|}\frac{1}{N}\sum_{n=1}^{N}\nabla_{\theta}^{2}\phi(y_{n},h_{n},\theta)\big{\|} =OP(1).\displaystyle=O_{P}(1).

Putting everything together, we conclude that

supθB¯δ,θTN(θ)T(θ)\displaystyle\sup_{\theta\in\bar{B}^{\|\cdot\|}_{\delta,\theta^{\star}}}\big{\|}T_{N}^{\prime}(\theta)-T^{\prime}(\theta)\big{\|}\leq
oP(1)OP(1)OP(1)+oP(1)=oP(1),\displaystyle o_{P}(1)O_{P}(1)O_{P}(1)+o_{P}(1)=o_{P}(1),

where the equality follows from the calculus rules with OP(1)O_{P}(1) and oP(1)o_{P}(1) [54].

As mentioned, the proof of (37) is entirely analogous; just note that, in order to use Theorem 6, we need to check that θϕ(y,h,θ)g~(y,h)\|\nabla_{\theta}\phi(y,h,\theta)\|\leq\tilde{g}(y,h) on B¯δ,θ\bar{B}^{\|\cdot\|}_{\delta,\theta^{\star}}, for some map g~\tilde{g} not depending on θ\theta, for which 𝔼[g~(y,z)]\mathbb{E}[\tilde{g}(y,z)] exists and is finite; this is again a consequence of the results proved in appendix A.

Appendix D Proof of Lemma 1

Suppose XX^{\star} satisfies (7). Throughout this proof, xix_{i}^{\star} denotes the iith column of XX^{\star}. Consider the equation imposed by the first column, x1x^{\star}_{1}, i.e.,

Cx1=((x1)TCx1)x1,\displaystyle Cx^{\star}_{1}=\big{(}(x^{\star}_{1})^{T}Cx^{\star}_{1}\big{)}x^{\star}_{1},

and multiply both sides by (x1)T(x^{\star}_{1})^{T}, which yields

((x1)TCx1)(1x12)=0.\displaystyle\big{(}(x^{\star}_{1})^{T}Cx^{\star}_{1}\big{)}\big{(}1-\|x^{\star}_{1}\|^{2}\big{)}=0.

From the two equalities

((x1)TCx1)x1\displaystyle\big{(}(x^{\star}_{1})^{T}Cx^{\star}_{1}\big{)}x^{\star}_{1} =Cx1,\displaystyle=Cx^{\star}_{1},
((x1)TCx1)(1x12)\displaystyle\big{(}(x^{\star}_{1})^{T}Cx^{\star}_{1}\big{)}\big{(}1-\|x^{\star}_{1}\|^{2}\big{)} =0,\displaystyle=0,

we conclude that either x1=0x^{\star}_{1}=0 or x1x^{\star}_{1} is a unit-norm eigenvector of CC.

Considering the second column, we prove that x2x^{\star}_{2} is either zero or a unit-norm eigenvector of CC that is orthogonal to x1x^{\star}_{1}. Observe that

Cx2=((x1)TCx2)x1+((x2)TCx2)x2.\displaystyle Cx^{\star}_{2}=\big{(}(x^{\star}_{1})^{T}Cx^{\star}_{2}\big{)}x^{\star}_{1}+\big{(}(x^{\star}_{2})^{T}Cx^{\star}_{2}\big{)}x^{\star}_{2}. (53)

Now recall that x1=0x^{\star}_{1}=0 or x1x^{\star}_{1} is a unit-norm eigenvector of CC. If x1=0x^{\star}_{1}=0, then (53) reduces to

Cx2=((x2)TCx2)x2\displaystyle Cx^{\star}_{2}=\big{(}(x^{\star}_{2})^{T}Cx^{\star}_{2}\big{)}x^{\star}_{2}

and the result follows as in the case of x1x^{\star}_{1}. If x10x^{\star}_{1}\neq 0, then it is a unit-norm eigenvector of CC and, hence, there exists β\beta such that (x1)TC=β(x1)T(x_{1}^{\star})^{T}C=\beta(x_{1}^{\star})^{T} and (53) reduces to

Cx2=β((x1)Tx2)x1+((x2)TCx2)x2.\displaystyle Cx^{\star}_{2}=\beta\big{(}(x^{\star}_{1})^{T}x^{\star}_{2}\big{)}x^{\star}_{1}+\big{(}(x^{\star}_{2})^{T}Cx^{\star}_{2}\big{)}x^{\star}_{2}. (54)

Multiply on the left by (x1)T(x^{\star}_{1})^{T} and use x12=1\|x^{\star}_{1}\|^{2}=1 to obtain

((x2)TCx2)(x1)Tx2=0.\displaystyle\big{(}(x^{\star}_{2})^{T}Cx^{\star}_{2}\big{)}(x^{\star}_{1})^{T}x^{\star}_{2}=0.

If x2=0x^{\star}_{2}=0, we are done. If not, then 0=(x1)Tx20=(x^{\star}_{1})^{T}x^{\star}_{2} and, returning to (54), it holds that

Cx2=((x2)TCx2)x2.\displaystyle Cx^{\star}_{2}=\big{(}(x^{\star}_{2})^{T}Cx^{\star}_{2}\big{)}x^{\star}_{2}.

This establishes the claim for x1x^{\star}_{1} and x2x^{\star}_{2}. Proceeding as we did for the second column, it is possible to construct a proof by induction establishing the result.

Acknowledgment

This work was partially funded by the Portuguese Fundação para a Ciência e Tecnologia (FCT), under grants PD/BD/135185/2017 and UIDB/50008/2020. The work of João Xavier was supported in part by the Fundação para a Ciência e Tecnologia, Portugal, through the Project LARSyS, under Project FCT Project UIDB/50009/2020 and Project HARMONY PTDC/EEI-AUT/31411/2017 (funded by Portugal 2020 through FCT, Portugal, under Contract AAC n 2/SAICT/2017–031411. IST-ID funded by POR Lisboa under Grant LISBOA-01-0145-FEDER-031411).

References

  • [1] R. Olfati-Saber, J. Fax, and R. Murray, “Consensus and cooperation in networked multi-agent systems,” Proceedings of the IEEE, vol. 95, pp. 215–233, 2007.
  • [2] F. Andrade, M. Figueiredo, and J. Xavier, “Distributed Picard iteration,” submitted, available at arXiv:2104.00131, 2021.
  • [3] K. Pearson, “On lines and planes of closest fit to systems of points in space,” The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, vol. 2, no. 11, pp. 559–572, 1901.
  • [4] Y. Qu, G. Ostrouchov, N. Samatova, and A. Geist, “Principal component analysis for dimension reduction in massive distributed data sets,” in Proceedings of IEEE International Conference on Data Mining (ICDM), vol. 1318, no. 1784, 2002, p. 1788.
  • [5] Y. Liang, M.-F. F. Balcan, V. Kanchanapally, and D. Woodruff, “Improved distributed principal component analysis,” Advances in Neural Information Processing Systems, vol. 27, pp. 3113–3121, 2014.
  • [6] R. Kannan, S. Vempala, and D. Woodruff, “Principal component analysis and higher correlations for distributed data,” in Conference on Learning Theory.   PMLR, 2014, pp. 1040–1057.
  • [7] C. Boutsidis, D. P. Woodruff, and P. Zhong, “Optimal principal component analysis in distributed and streaming models,” in Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, 2016, pp. 236–249.
  • [8] D. Garber, O. Shamir, and N. Srebro, “Communication-efficient algorithms for distributed stochastic principal component analysis,” in International Conference on Machine Learning.   PMLR, 2017, pp. 1203–1212.
  • [9] Z.-J. Bai, R. H. Chan, and F. T. Luk, “Principal component analysis for distributed data sets with updating,” in International Workshop on Advanced Parallel Processing Technologies.   Springer, 2005, pp. 471–483.
  • [10] H. Kargupta, W. Huang, K. Sivakumar, and E. Johnson, “Distributed clustering using collective principal component analysis,” Knowledge and Information Systems, vol. 3, no. 4, pp. 422–448, 2001.
  • [11] H. Qi, T.-W. Wang, and J. D. Birdwell, “Global principal component analysis for dimensionality reduction in distributed data mining,” Statistical data mining and knowledge discovery, pp. 327–342, 2004.
  • [12] F. N. Abu-Khzam, N. F. Samatova, G. Ostrouchov, M. A. Langston, and A. Geist, “Distributed dimension reduction algorithms for widely dispersed data.” in IASTED PDCS, 2002, pp. 167–174.
  • [13] A. Scaglione, R. Pagliari, and H. Krim, “The decentralized estimation of the sample covariance,” in 2008 42nd Asilomar Conference on Signals, Systems and Computers.   IEEE, 2008, pp. 1722–1726.
  • [14] Y.-A. Le Borgne, S. Raybaud, and G. Bontempi, “Distributed principal component analysis for wireless sensor networks,” Sensors, vol. 8, no. 8, pp. 4821–4850, 2008.
  • [15] M. E. Yildiz, F. Ciaramello, and A. Scaglione, “Distributed distance estimation for manifold learning and dimensionality reduction,” in 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.   IEEE, 2009, pp. 3353–3356.
  • [16] W. Suleiman, M. Pesavento, and A. M. Zoubir, “Performance analysis of the decentralized eigendecomposition and esprit algorithm,” IEEE Transactions on Signal Processing, vol. 64, no. 9, pp. 2375–2386, 2016.
  • [17] S. B. Korada, A. Montanari, and S. Oh, “Gossip pca,” ACM SIGMETRICS Performance Evaluation Review, vol. 39, no. 1, pp. 169–180, 2011.
  • [18] L. Li, A. Scaglione, and J. H. Manton, “Distributed principal subspace estimation in wireless sensor networks,” IEEE Journal of Selected Topics in Signal Processing, vol. 5, no. 4, pp. 725–738, 2011.
  • [19] I. D. Schizas and A. Aduroja, “A distributed framework for dimensionality reduction and denoising,” IEEE Transactions on Signal Processing, vol. 63, no. 23, pp. 6379–6394, 2015.
  • [20] S. X. Wu, H.-T. Wai, A. Scaglione, and N. A. Jacklin, “The power-oja method for decentralized subspace estimation/tracking,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2017, pp. 3524–3528.
  • [21] S. X. Wu, H.-T. Wai, L. Li, and A. Scaglione, “A review of distributed algorithms for principal component analysis,” Proceedings of the IEEE, vol. 106, no. 8, pp. 1321–1340, 2018.
  • [22] A. Gang, B. Xiang, and W. Bajwa, “Distributed principal subspace analysis for partitioned big data: Algorithms, analysis, and implementation,” IEEE Transactions on Signal and Information Processing over Networks, vol. 7, pp. 699–715, 2021.
  • [23] A. G. Dimakis, S. Kar, J. M. Moura, M. G. Rabbat, and A. Scaglione, “Gossip algorithms for distributed signal processing,” Proceedings of the IEEE, vol. 98, no. 11, pp. 1847–1864, 2010.
  • [24] J.-J. Xiao, A. Ribeiro, Z.-Q. Luo, and G. B. Giannakis, “Distributed compression-estimation using wireless sensor networks,” IEEE Signal Processing Magazine, vol. 23, no. 4, pp. 27–41, 2006.
  • [25] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah, “Randomized gossip algorithms,” IEEE transactions on information theory, vol. 52, no. 6, pp. 2508–2530, 2006.
  • [26] S. Barbarossa and G. Scutari, “Decentralized maximum-likelihood estimation for sensor networks composed of nonlinearly coupled dynamical systems,” IEEE Transactions on Signal Processing, vol. 55, no. 7, pp. 3456–3470, 2007.
  • [27] T. Zhao and A. Nehorai, “Information-driven distributed maximum likelihood estimation based on gauss-newton method in wireless sensor networks,” IEEE Transactions on Signal Processing, vol. 55, no. 9, pp. 4669–4682, 2007.
  • [28] I. D. Schizas, A. Ribeiro, and G. B. Giannakis, “Consensus in ad hoc wsns with noisy links—part i: Distributed estimation of deterministic signals,” IEEE Transactions on Signal Processing, vol. 56, no. 1, pp. 350–364, 2007.
  • [29] S. S. Stanković, M. S. Stankovic, and D. M. Stipanovic, “Decentralized parameter estimation by consensus based stochastic approximation,” IEEE Transactions on Automatic Control, vol. 56, no. 3, pp. 531–543, 2010.
  • [30] A. H. Sayed, “Diffusion adaptation over networks,” in Academic Press Library in Signal Processing.   Elsevier, 2014, vol. 3, pp. 323–453.
  • [31] S. S. Pereira, R. López-Valcarce, and A. Pages-Zamora, “Parameter estimation in wireless sensor networks with faulty transducers: A distributed EM approach,” Signal Processing, vol. 144, pp. 226–237, 2018.
  • [32] S. Faria and G. Soromenho, “Fitting mixtures of linear regressions,” Journal of Statistical Computation and Simulation, vol. 80, no. 2, pp. 201–225, 2010.
  • [33] A. Gang, H. Raja, and W. U. Bajwa, “Fast and communication-efficient distributed PCA,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019, pp. 7450–7454.
  • [34] T. D. Sanger, “Optimal unsupervised learning in a single-layer linear feedforward neural network,” Neural networks, vol. 2, no. 6, pp. 459–473, 1989.
  • [35] W. Shi, Q. Ling, G. Wu, and W. Yin, “Extra: An exact first-order algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015.
  • [36] A. Gang and W. Bajwa, “A linearly convergent algorithm for distributed principal component analysis,” available at arXiv:2101.01300, 2021.
  • [37] G. McLachlan and D. Peel, Finite Mixture Models.   John Wiley & Sons, 2004.
  • [38] A. Dempster, N. Laird, and D. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society (Series B), vol. 29, pp. 1––37, 1977.
  • [39] S. Kar and J. M. Moura, “Distributed consensus algorithms in sensor networks with imperfect communication: Link failures and channel noise,” IEEE Transactions on Signal Processing, vol. 57, no. 1, pp. 355–369, 2008.
  • [40] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009.
  • [41] R. A. Redner and H. F. Walker, “Mixture densities, maximum likelihood and the EM algorithm,” SIAM review, vol. 26, no. 2, pp. 195–239, 1984.
  • [42] R. Sundberg, “Maximum likelihood theory for incomplete data from an exponential family,” Scandinavian Journal of Statistics, vol. 1, no. 2, pp. 49–58, 1974.
  • [43] S. Balakrishnan, M. Wainwright, and B. Yu, “Statistical guarantees for the EM algorithm: From population to sample-based analysis,” Annals of Statistics, vol. 45, no. 1, pp. 77–120, 2017.
  • [44] R. D. Nowak, “Distributed em algorithms for density estimation and clustering in sensor networks,” IEEE transactions on signal processing, vol. 51, no. 8, pp. 2245–2253, 2003.
  • [45] P. A. Forero, A. Cano, and G. B. Giannakis, “Distributed clustering using wireless sensor networks,” IEEE Journal of Selected Topics in Signal Processing, vol. 5, no. 4, pp. 707–724, 2011.
  • [46] L. Xiao and S. Boyd, “Fast linear iterations for distributed averaging,” Systems & Control Letters, vol. 53, no. 1, pp. 65–78, 2004.
  • [47] J. R. Magnus and H. Neudecker, Matrix Differential Calculus with Applications in Statistics and Econometrics.   John Wiley & Sons, 2019.
  • [48] H. Lutkepohl, Handbook of Matrices.   Wiley, 1996.
  • [49] C. Bishop, Pattern Recognition and Machine Learning.   Springer, 2006.
  • [50] G. J. McLachlan and T. Krishnan, The EM algorithm and Extensions.   John Wiley & Sons, 2007, vol. 382.
  • [51] R. Horn and C. Johnson, Matrix Analysis.   Cambridge University Press, 2012.
  • [52] D. Bertsekas and S. Shreve, Stochastic Optimal Control: the Discrete-time Case.   Athena Scientific, 1996.
  • [53] J. Borwein and A. S. Lewis, Convex Analysis and Nonlinear Optimization: Theory and Examples.   Springer, 2010.
  • [54] A. Van der Vaart, Asymptotic Statistics.   Cambridge University Press, 2000, vol. 3.
  • [55] J. M. Ortega and W. C. Rheinboldt, Iterative Solution of Nonlinear Equations in Several Variables.   SIAM, 2000.
  • [56] W. K. Newey and D. McFadden, “Large sample estimation and hypothesis testing,” Handbook of Econometrics, vol. 4, pp. 2111–2245, 1994.