This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Fast and Secure Distributed
Nonnegative Matrix Factorization

Yuqiu Qian, Conghui Tan, Danhao Ding, Hui Li, and Nikos Mamoulis Yuqiu Qian is with Tencent, Shenzhen, China. E-mail: [email protected]. Conghui Tan is with WeBank, Shenzhen, China. E-mail: [email protected]. Danhao Ding is with Department of Computer Science, University of Hong Kong, Hong Kong SAR, China. E-mail: [email protected]. Hui Li is with School of Informatics, Xiamen University, Xiamen, Fujian, China. E-mail: [email protected]. He is the corresponding author. Nikos Mamoulis is with Department of Computer Science and Engineering, University of Ioannina, Ioannina, Epirus, Greece. E-mail: [email protected].
Abstract

Nonnegative matrix factorization (NMF) has been successfully applied in several data mining tasks. Recently, there is an increasing interest in the acceleration of NMF, due to its high cost on large matrices. On the other hand, the privacy issue of NMF over federated data is worthy of attention, since NMF is prevalently applied in image and text analysis which may involve leveraging privacy data (e.g, medical image and record) across several parties (e.g., hospitals). In this paper, we study the acceleration and security problems of distributed NMF. Firstly, we propose a distributed sketched alternating nonnegative least squares (DSANLS) framework for NMF, which utilizes a matrix sketching technique to reduce the size of nonnegative least squares subproblems with a convergence guarantee. For the second problem, we show that DSANLS with modification can be adapted to the security setting, but only for one or limited iterations. Consequently, we propose four efficient distributed NMF methods in both synchronous and asynchronous settings with a security guarantee. We conduct extensive experiments on several real datasets to show the superiority of our proposed methods. The implementation of our methods is available at https://github.com/qianyuqiu79/DSANLS.

Index Terms:
Distributed Nonnegative Matrix Factorization, Matrix Sketching, Privacy

1 Introduction

Nonnegative matrix factorization (NMF) is a technique for discovering nonnegative latent factors and/or performing dimensionality reduction. Unlike general matrix factorization (MF), NMF restricts the two output matrix factors to be nonnegative. Specifically, the goal of NMF is to decompose a huge matrix M+m×nM\in\mathbb{R}_{+}^{m\times n} into the product of two matrices U+m×kU\in\mathbb{R}_{+}^{m\times k} and V+n×kV\in\mathbb{R}_{+}^{n\times k} such that MUVM\approx UV^{\top}. +m×n\mathbb{R}_{+}^{m\times n} denotes the set of m×nm\times n matrices with nonnegative real values, and kk is a user-specified dimensionality, where typically km,nk\ll m,n. Nonnegativity is inherent in the feature space of many real-world applications, where the resulting factors of NMF can have a natural interpretation. Therefore, NMF has been widely used in a branch of fields including text mining [1], image/video processing [2], recommendation [3], and analysis of social networks [4].

Modern data analysis tasks apply on big matrix data with increasing scale and dimensionality. Examples [5] include community detection in a billion-node social network, background separation on a 4K video in which every frame has approximately 27 million rows, and text mining on a bag-of-words matrix with millions of words. The volume of data is anticipated to increase in the ‘big data’ era, making it impossible to store the whole matrix in the main memory throughout NMF. Therefore, there is a need for high-performance and scalable distributed NMF algorithms. On the other hand, there is a surge of works on privacy-preserving data mining over federated data [6, 7] in recent years. In contrast to traditional research about privacy which emphasizes protecting individual information from single institution, federated data mining deals with multiple parties. Each party possesses its own confidential dataset(s) and the union of data from all parties is utilized for achieving better performance in the target task. Due to the prevalent use of NMF in image and text analysis which may involve leveraging privacy data (e.g, medical image and record) across several parties (e.g., hospitals), the privacy issue of NMF over federated data is worthy of attention. To address aforementioned challenges of NMF (i.e., high performance and privacy), we study the acceleration and security problems of distributed NMF in this paper.

First of all, we propose the distributed sketched alternating nonnegative least squares (DSANLS) for accelerating NMF. The state-of-the-art distributed NMF is MPI-FAUN [8], a general framework that iteratively solves the nonnegative least squares (NLS) subproblems for UU and VV. The main idea behind MPI-FAUN is to exploit the independence of local updates for rows of UU and VV, in order to minimize the communication requirements of matrix multiplication operations within the NMF algorithms. Unlike MPI-FAUN, our idea is to speed up distributed NMF in a new, orthogonal direction: by reducing the problem size of each NLS subproblem within NMF, which in turn decreases the overall computation cost. In a nutshell, we reduce the size of each NLS subproblem, by employing a matrix sketching technique: the involved matrices in the subproblem are multiplied by a specially designed random matrix at each iteration, which greatly reduces their dimensionality. As a result, the computational cost of each subproblem significantly drops.

However, applying matrix sketching comes with several issues. First, although the size of each subproblem is significantly reduced, sketching involves matrix multiplication which brings computational overhead. Second, unlike in a single machine setting, data is distributed to different nodes in distributed environment. Nodes may have to communicate extensively in a poorly designed solution. In particular, each node only retains part of both the input matrix and the generated approximate matrices, causing difficulties due to data dependencies in the computation process. Besides, the generated random matrices should be the same for all nodes in every iteration, while broadcasting the random matrix to all nodes brings severe communication overhead and can become the bottleneck of distributed NMF. Furthermore, after reducing each original subproblem to a sketched random new subproblem, it is not clear whether the algorithm still converges and whether it converges to stationary points of the original NMF problem.

Our DSANLS overcomes these problems. Firstly, the extra computation cost due to sketching is reduced with a proper choice of the random matrices. Then, the same random matrices used for sketching are generated independently at each node, thus there is no need for transferring them among nodes during distributed NMF. Having the complete random matrix at each node, an NMF iteration can be done locally with the help of a matrix multiplication rule with proper data partitioning. Therefore, our matrix sketching approach reduces not only the computational overhead, but also the communication cost. Moreover, due to the fact that sketching also shifts the optimal solution of each original NMF subproblem, we propose subproblem solvers paired with theoretical guarantees of their convergence to a stationary point of the original subproblems.

To provide solutions to the problem of secure distributed NMF over federated data, we first show that DSANLS with modification can be adapted to this security setting, but only for one or limited iterations. Therefore, we design new methods called Syn-SD and Syn-SSD in synchronous setting. They are later extended to Asyn-SD and Asyn-SSD in asynchronous setting (i.e., client/server), respectively. Syn-SSD improves the convergence rate of Syn-SD, without incurring much extra communication cost. It also reduces computational overhead by sketching. All proposed algorithms are secure with a guarantee. Secure distributed NMF problem is hard in nature. All parties involved should not be able to infer the confidential information during the process. To the best of our knowledge, we are the first to study NMF over federated data.

In summary, our contributions are as follows:

  • DSANLS is the first distributed NMF algorithm that leverages matrix sketching to reduce the problem size of each NLS subproblem and can be applied to both dense and sparse input matrices with a convergence guarantee.

  • We propose a novel and specially designed subproblem solver (proximal coordinate descent), which helps DSANLS converge faster. We also discuss the use of projected gradient descent as subproblem solver, showing that it is equivalent to stochastic gradient descent (SGD) on the original (non-sketched) NLS subproblem.

  • For the problem of secure distributed NMF, we propose efficient methods, Syn-SD and Syn-SSD, in synchronous setting and later extend them to asynchronous setting. Through sketching, their computation cost is significantly reduced. They are the first secure distributed NMF methods for federated data.

  • We conduct extensive experiments using several (dense and sparse) real datasets, which demonstrates the efficiency and scalability of our proposals.

The remainder of the paper is organized as follows. Sec. 2 provides the background and discusses the related work. Our DSANLS algorithm with detailed theoretical analysis is presented in Sec. 3. Our proposed algorithms for secure distributed NMF problem in both synchronous and asynchronous settings are presented in Sec. 4. Sec. 5 evaluates all algorithms. Finally, Sec. 6 concludes the paper.

Notations. For a matrix AA, we use Ai:jA_{i:j} to denote the entry at the ii-th row and jj-th column of AA. Besides, either ii or jj can be omitted to denote a column or a row, i.e., Ai:A_{i:} is the ii-th row of AA, and A:jA_{:j} is its jj-th column. Furthermore, ii or jj can be replaced by a subset of indices. For example, if I{1,2,,m}I\subset\{1,2,\dots,m\}, AI:A_{I:} denotes the sub-matrix of AA formed by all rows in II, whereas A:JA_{:J} is the sub-matrix of AA formed by all columns in a subset J{1,2,,n}J\subset\{1,2,\dots,n\}.

2 Background and Related Work

In Sec. 2.1, we first illustrate NMF and its security problem in a distributed environment. Then we elaborate on previous works which are related to this paper in Sec. 2.2.

2.1 Preliminary

2.1.1 NMF Algorithms

Generally, NMF can be defined as an optimization problem [9] as follows:

minU+m×k,V+n×kMUVF,\min_{\begin{subarray}{c}U\in\mathbb{R}_{+}^{m\times k},V\in\mathbb{R}_{+}^{n\times k}\end{subarray}}\left\|M-UV^{\top}\right\|_{F}, (1)

where XF=(ijxij2)1/2||X||_{F}=\left(\sum_{ij}x_{ij}^{2}\right)^{1/2} is the Frobenius norm of XX. Problem (1) is hard to solve directly because it is non-convex. Therefore, almost all NMF algorithms leverage two-block coordinate descent schemes (shown in Alg. 1): they optimize over one of the two factors, UU or VV, while keeping the other fixed [10]. By fixing VV, we can optimize UU by solving a nonnegative least squares (NLS) subproblem:

minU+m×kMUVF.\min_{U\in\mathbb{R}_{+}^{m\times k}}\left\|M-UV^{\top}\right\|_{F}. (2)

Similarly, if we fix UU, the problem becomes:

minV+n×kMVUF.\min_{V\in\mathbb{R}_{+}^{n\times k}}\left\|M^{\top}-VU^{\top}\right\|_{F}. (3)
Algorithm 1 Two-Block Coordinate Descent: Framework of Most NMF Algorithms

Input: MM
Parameter: Iteration number TT

1:initialize U00U^{0}\geq 0, V00V^{0}\geq 0
2:for t=0t=0 to T1T-1 do
3:    Ut+1U^{t+1}\leftarrow update(MM, UtU^{t}, VtV^{t})
4:    Vt+1V^{t+1}\leftarrow update(MM, Ut+1U^{t+1}, VtV^{t})
5:return UTU^{T} and VTV^{T}

Within two-block coordinate descent schemes (exact or inexact), different subproblem solvers are proposed. The first widely used update rule is Multiplicative Updates (MU) [11, 9]. MU is based on the majorization-minimization framework and its application guarantees that the objective function monotonically decreases [11, 9]. Another extensively studied method is alternating nonnegative least squares (ANLS), which represents a class of methods where the subproblems for UU and VV are solved exactly following the framework described in Alg. 1. ANLS is guaranteed to converge to a stationary point [12] and has been shown to perform very well in practice with active set [13, 14], projected gradient [15], quasi-Newton [16], or accelerated gradient [17] methods as the subproblem solver. Therefore, we focus on ANLS in this paper.

2.1.2 Secure Distributed NMF

Secure distributed NMF problem is meaningful with practical applications. Suppose two hospitals AA and BB have different clinical records, M1M_{1} and M2M_{2} (i.e., matrices), for same set of phenotypes. For legal or commercial concerns, it is required that none of the hospitals can reveal personal records to another directly. For the purpose of phenotype classification, NMF task can be applied independently (i.e., M1U1V1M_{1}\approx U_{1}V_{1}^{\top} and M2U2V2M_{2}\approx U_{2}V_{2}^{\top}). However, since M1M_{1} and M2M_{2} have the same schema for phenotypes, the concatenated matrix M=[M1,M2]M=[M_{1},M_{2}] can be taken as input for NMF and results in better user (i.e., patients) latent representations V1V_{1} and V2V_{2} by sharing the same item (i.e., phenotypes) latent representation UU:

M=[M1M2][UV1UV2]=U[V1V2].M=\begin{bmatrix}M_{1}&M_{2}\\ \end{bmatrix}\approx\begin{bmatrix}UV_{1}^{\top}&UV_{2}^{\top}\\ \end{bmatrix}=U\cdot\begin{bmatrix}V_{1}^{\top}&V_{2}^{\top}\\ \end{bmatrix}. (4)

Throughout the factorization process, a secure distributed NMF should guarantee that party AA only has access to M1M_{1}, UU and V1V_{1} and party BB only has access to M2M_{2}, UU and V2V_{2}. It is worth noting that the above problem of distributed NMF with two parties can be straightforwardly extended to NN parties. The requirement of all parties over federated data in secure distributed NMF is actual the so-called tt-private protocol (shown in Definition 1 with t=N1t=N-1) which derives from secure function evaluation [18]. In this paper, we will use it to assess whether a distributed NMF is secure.

Definition 1.

(tt-private protocol). All NN parties follow the protocol honestly, but they are also curious about inferring other party’s private information based on their own data (i.e., honest-but-curious). A protocol is tt-private if any tt parties who collude at the end of the protocol learn nothing beyond their own outputs.

Note that a single matrix transpose operation transforms a column-concatenated matrix to a row-concatenated matrix. Without loss of generality, we only consider the scenario that matrices are concatenated along rows in this paper.

Secure distributed NMF problem is hard in nature. Firstly, party AA needs to solve the NMF problem to get UU and V1V_{1} together with party BB. At the same time, party AA should not be able to infer V2V_{2} or M2M_{2} during the whole process. Such secure requirement makes it totally different from traditional distributed NMF problem, whose data partition already incurs secure violation.

2.2 Related Work

In the sequel, we briefly review three research areas which are related to this paper.

2.2.1 Accelerating NMF

Parallel NMF algorithms are well studied in the literature [19, 20]. However, different from a parallel and single machine setting, data sharing and communication have considerable cost in a distributed setting. Therefore, we need specialized NMF algorithms for massive scale data handling in a distributed environment. The first method in this direction [21] is based on the MU algorithm. It mainly focuses on sparse matrices and applies a careful partitioning of the data in order to maximize data locality and parallelism. Later, CloudNMF [22], a MapReduce-based NMF algorithm similar to [21], was implemented and tested on large-scale biological datasets. Another distributed NMF algorithm [23] leverages block-wise updates for local aggregation and parallelism. It also performs frequent updates using whenever possible the most recently updated data, which is more efficient than traditional concurrent counterparts. Apart from MapReduce implementations, Spark is also attracting attention for its advantage in iterative algorithms, e.g., using MLlib [24]. Finally, there are implementations using X10 [25] and on GPU [26].

The most recent and related work in this direction is MPI-FAUN [5, 8], which is the first implementation of NMF using MPI for interprocessor communication. MPI-FAUN is flexible and can be utilized for a broad class of NMF algorithms that iteratively solve NLS subproblems including MU, HALS, and ANLS/BPP. MPI-FAUN exploits the independence of local update computation for rows of UU and VV to apply communication-optimal matrix multiplication. In a nutshell, the full matrix MM is split across a two-dimensional grid of processors and multiple copies of both UU and VV are kept at different nodes, in order to reduce the communication between nodes during the iterations of NMF algorithms.

2.2.2 Matrix Sketching

Matrix sketching is a technique that has been previously used in numerical linear algebra [27], statistics [28] and optimization [29]. Its basic idea is described as follows. Suppose we need to find a solution xx to the equation: Ax=b,(Am×n,bm)Ax=b,\quad(A\in\mathbb{R}^{m\times n},\ b\in\mathbb{R}^{m}). Instead of solving this equation directly, in each iteration of matrix sketching, a random matrix Sd×mS\in\mathbb{R}^{d\times m} (dm)(d\ll m) is generated, and we instead solve the following problem: (SA)x=Sb(SA)x=Sb. Obviously, the solution to the first equation is also a solution to the second equation, but not vice versa. However, the problem size has now decreased from m×nm\times n to d×nd\times n. With a properly generated random matrix SS and an appropriate method to solve subproblem in the second equation, it can be guaranteed that we will progressively approach the solution to the first equation by iteratively applying this sketching technique.

To the best of our knowledge, there is only one piece of previous work [30] which incorporates dual random projection into the NMF problem, in a centralized environment, sharing similar ideas as SANLS, the centralized version of our DSANLS algorithm. However, Wang et al. [30] did not provide an efficient subproblem solver, and their method was less effective than non-sketched methods in practical experiments. Besides, data sparsity was not taken into consideration in their work. Furthermore, no theoretical guarantee was provided for NMF with dual random projection. In short, SANLS is not same as [30] and DSANLS is much more than a distributed version of [30]. The methods that we propose in this paper are efficient in practice and have strong theoretical guarantees.

2.2.3 Secure Matrix Computation on Federated Data

In federated data mining, parties collaborate to perform data processing task on the union of their unencrypted data, without leaking their private data to other participants [31]. A surge of work in the literature studies federated matrix computation algorithms, such as privacy-preserving gradient descent [32, 33], eigenvector computation [34], singular value decomposition [35, 36], k-means clustering [37], and spectral clustering [38] over partitioned data on different parties. Secure multi-party computation (MPC) are applied to preserve the privacy of the parties involved (e.g. secure addition, secure multiplication and secure dot product) [39, 37]. These Secure MPC protocols compute arbitrary function among nn parties and tolerate up to t<(1/2)nt<(1/2)n corrupted parties, at a cost Ω(n)\Omega(n) per bit [40, 41]. These protocols are too generic when it comes to a specific task like secure NMF. Our proposed protocol does not incorporate costly MPC multiplication protocols while tolerates up to nn-1 corrupted (static, honest but curious) parties. Recently, Kim et al. [6] proposed a federated method to learn phenotypes across multiple hospitals with alternating direction method of multipliers (ADMM) tensor factorization; and Feng et al. [7] developed a privacy-preserving tensor decomposition framework for processing encrypted data in a federated cloud setting.

3 DSANLS: Distributed Sketched ANLS

In this section, we illustrate our DSANLS method for accelerating NMF in general distributed environment.

3.1 Data Partitioning

Assume there are NN computing nodes in the cluster. We partition the row indices {1,2,,m}\{1,2,\dots,m\} of the input matrix MM into NN disjoint sets I1,I2,,INI_{1},I_{2},\dots,I_{N}, where Ir{1,2,,m}I_{r}\subset\{1,2,\dots,m\} is the subset of rows assigned to node rr, as in [21]. Similarly, we partition the column indices {1,2,,n}\{1,2,\dots,n\} into disjoint sets J1,J2,,JNJ_{1},J_{2},\dots,J_{N} and assign column set JrJ_{r} to node rr. The number of rows and columns in each node are near the same in order to achieve load balancing, i.e., |Ir|m/N\left|I_{r}\right|\approx m/N and |Jr|n/N\left|J_{r}\right|\approx n/N for each node rr. The factor matrices UU and VV are also assigned to nodes accordingly, i.e., node rr stores and updates UIr:U_{I_{r}:} and VJr:V_{J_{r}:} as shown in Fig. 1(a).

Refer to caption
(a) DSANLS
Refer to caption
(b) Secure NMF
Figure 1: Partitioning data to NN nodes with node rr’s data shaded.

Data partitioning in distributed NMF differs from that in parallel NMF. Previous works on parallel NMF [19, 20] choose to partition UU and VV along the long dimension, but we adopt the row-partitioning of UU and VV as in [21]. To see why, take the UU-subproblem (2) as an example and observe that it is row-independent in nature, i.e., the rr-th row block of its solution UIr:U_{I_{r}:} is given by:

UIr:=argminUIr:+|Ir|×kMIr:UIr:VF2,U_{I_{r}:}=\operatorname*{arg\,min}_{U_{I_{r}:}\in\mathbb{R}_{+}^{|I_{r}|\times k}}\left\|M_{I_{r}:}-U_{I_{r}:}V^{\top}\right\|^{2}_{F}, (5)

and thus can be solved independently without referring to any other row blocks of UU. The same holds for the VV-subproblem. In addition, no communication is needed concerning MM when solving (5) because MIr:M_{I_{r}:} is already present in node rr.

On the other hand, solving (5) requires the entire VV of size n×kn\times k, meaning that every node needs to gather VV from all other nodes. This process can easily be the bottleneck of a naive distributed ANLS implementation. As we will explain shortly, our DSALNS algorithm alleviates this problem, since we use a sketched matrix of reduced size instead of the original complete matrix VV.

3.2 SANLS: Sketched ANLS

To better understand DSANLS, we first introduce the Sketched ANLS (SANLS), i.e., a centralized version of our algorithm. Recall that in Sec. 2.1.1, at each step of ANLS, either UU or VV is fixed and we solve a nonnegative least square problem (2) or (3) over the other variable. Intuitively, it is unnecessary to solve this subproblem with high accuracy, because we may not have reached the optimal solution for the fixed variable so far. Hence, when the fixed variable changes in the next step, this accurate solution from the previous step will not be optimal anymore and will have to be re-computed. Our idea is to apply matrix sketching for each subproblem, in order to obtain an approximate solution for it at a much lower computational and communication cost.

Specifically, suppose we are at the tt-th iteration of ANLS, and our current estimations for UU and VV are UtU^{t} and VtV^{t} respectively. We must solve subproblem (2) in order to update UtU^{t} to a new matrix Ut+1U^{t+1}. We apply matrix sketching to the residual term of subproblem (2). The subproblem now becomes:

minU+m×kMStU(VtSt)F2,\min_{U\in\mathbb{R}_{+}^{m\times k}}\left\|MS^{t}-U\left(V^{t\top}S^{t}\right)\right\|_{F}^{2}, (6)

where Stn×dS^{t}\in\mathbb{R}^{n\times d} is a randomly-generated matrix. Hence, the problem size decreases from n×kn\times k to d×kd\times k. dd is chosen to be much smaller than nn, in order to sufficiently reduce the computational cost111However, we should not choose an extremely small dd, otherwise the the size of sketched subproblem would become so small that it can hardly represent the original subproblem, preventing NMF from converging to a good result. In practice, we can set d=0.1nd=0.1n for medium-sized matrices and d=0.01nd=0.01n for large matrices if mnm\approx n. When mm and nn differ a lot, e.g., mnm\ll n without loss of generality, we should not apply sketching technique to the VV subproblem (since solving the UU subproblem is much more expensive) and simply choose d=mnd=m\ll n. . Similarly, we transform the VV-subproblem into:

minV+n×kMStV(UtSt)F2,\min_{V\in\mathbb{R}_{+}^{n\times k}}\left\|M^{\top}S^{\prime t}-V\left(U^{t\top}S^{\prime t}\right)\right\|_{F}^{2}, (7)

where Stm×dS^{\prime t}\in\mathbb{R}^{m\times d^{\prime}} is also a random matrix with dmd^{\prime}\ll m.

3.3 DSANLS: Distributed SANLS

Now, we come to our proposal: the distributed version of SANLS called DSANLS. Since the UU-subproblem (6) is the same as the VV-subproblem (7) in nature, here we restrict our attention to the UU-subproblem. The first observation about subproblem (6) is that it is still row-independent, thus node rr only needs to solve:

minUIr:+|Ir|×k(MSt)Ir:UIr:(VtSt)F2.\min_{U_{I_{r}:}\in\mathbb{R}^{|I_{r}|\times k}_{+}}\left\|\left(MS^{t}\right)_{I_{r}:}-U_{I_{r}:}\left(V^{t\top}S^{t}\right)\right\|_{F}^{2}. (8)

For simplicity, we denote:

Art(MSt)Ir:andBtVtSt,A^{t}_{r}\triangleq\left(MS^{t}\right)_{I_{r}:}\quad\text{and}\quad\ B^{t}\triangleq V^{t\top}S^{t}, (9)

and the subproblem (8) can be written as:

minUIr:+|Ir|×kArtUIr:BtF2.\min_{U_{I_{r}:}\in\mathbb{R}^{|I_{r}|\times k}_{+}}\left\|A^{t}_{r}-U_{I_{r}:}B^{t}\right\|_{F}^{2}. (10)

Thus, node rr needs to know matrices ArtA^{t}_{r} and BtB^{t} in order to solve the subproblem.

For ArtA^{t}_{r}, by applying matrix multiplication rules, we get Art=(MSt)Ir:=MIr:StA^{t}_{r}=\left(MS^{t}\right)_{I_{r}:}=M_{I_{r}:}S^{t}. Therefore, if StS^{t} is stored at node rr, ArtA^{t}_{r} can be computed without any communication. On the other hand, computing Bt=(VtSt)B^{t}=\left(V^{t\top}S^{t}\right) requires communication across the whole cluster, since the rows of VtV^{t} are distributed across different nodes. Fortunately, if we assume that StS^{t} is stored at all nodes again, we can compute BtB^{t} in a much cheaper way. Following block matrix multiplication rules, we can rewrite BtB^{t} as:

Bt=VtSt=[(VJ1:t)(VJN:t)][SJ1:tSJN:t]=r=1N(VJr:t)SJr:t.B^{t}=V^{t\top}S^{t}=\left[\left(V^{t}_{J_{1}:}\right)^{\top}\ \cdots\ \left(V^{t}_{J_{N}:}\right)^{\top}\right]\left[\begin{array}[]{c}S^{t}_{J_{1}:}\\ \vdots\\ S^{t}_{J_{N}:}\end{array}\right]=\sum_{r=1}^{N}\left(V^{t}_{J_{r}:}\right)^{\top}S^{t}_{J_{r}:}. (11)

Note that the summand B¯rt(VJr:t)SJr:t\bar{B}^{t}_{r}\triangleq\left(V^{t}_{J_{r}:}\right)^{\top}S^{t}_{J_{r}:} is a matrix of size k×dk\times d and can be computed locally. As a result, communication is only needed for summing up the matrices B¯rt\bar{B}^{t}_{r} of size k×dk\times d by using MPI all-reduce operation, which is much cheaper than transmitting the whole VtV_{t} of size n×kn\times k.

Now, the only remaining problem is the transmission of StS^{t}. Since StS^{t} can be dense, even larger than VtV^{t}, broadcasting it across the whole cluster can be quite expensive. However, it turns out that we can avoid this. Recall that StS^{t} is a randomly-generated matrix; each node can generate exactly the same matrix, if we use the same pseudo-random generator and the same seed. Therefore, we only need to broadcast the random seed, which is just an integer, at the beginning of the whole program. This ensures that each node generates exactly the same random number sequence and hence the same random matrices StS^{t} at each iteration.

In short, the communication cost of each node is reduced from 𝒪(nk)\mathcal{O}(nk) to 𝒪(dk)\mathcal{O}(dk) by adopting our sketching technique for the UU-subproblem. Likewise, the communication cost of each VV-subproblem is decreased from 𝒪(mk)\mathcal{O}\left(mk\right) to 𝒪(dk)\mathcal{O}\left(d^{\prime}k\right). The general framework of our DSANLS algorithm is listed in Alg. 2.

Algorithm 2 Distributed SANLS on Node rr

Input: MIr:M_{I_{r}:} and M:JrM_{:J_{r}}
Parameter: Iteration number TT

1:Initialize UIr:00U^{0}_{I_{r}:}\geq 0, VJr:00V^{0}_{J_{r}:}\geq 0
2:Broadcast the random seed
3:for t=0t=0 to T1T-1 do
4:    Generate random matrix Stn×dS^{t}\in\mathbb{R}^{n\times d}
5:    Compute ArtMIr:StA^{t}_{r}\leftarrow M_{I_{r}:}S^{t}
6:    Compute B¯rt(VJr:t)SJr:t\bar{B}^{t}_{r}\leftarrow\left(V^{t}_{J_{r}:}\right)^{\top}S^{t}_{J_{r}:}
7:    All-Reduce: Bti=1NB¯itB^{t}\leftarrow\sum_{i=1}^{N}\bar{B}^{t}_{i}
8:    Update UIr:t+1U^{t+1}_{I_{r}:} by solving minUIr:ArtUIr:Bt\min_{U_{I_{r}:}}\|A^{t}_{r}-U_{I_{r}:}B^{t}\|
9:    
10:    Generate random matrix Stm×dS^{\prime t}\in\mathbb{R}^{m\times d^{\prime}}
11:    Compute Art(M:Jr)StA^{\prime t}_{r}\leftarrow\left(M_{:J_{r}}\right)^{\top}S^{\prime t}
12:    Compute B¯rt(UIr:t)SIr:t\bar{B}^{\prime t}_{r}\leftarrow\left(U^{t}_{I_{r}:}\right)^{\top}S^{\prime t}_{I_{r}:}
13:    All-Reduce: Bti=1NB¯itB^{\prime t}\leftarrow\sum_{i=1}^{N}\bar{B}^{\prime t}_{i}
14:    Update VJr:t+1V^{t+1}_{J_{r}:} by solving minVJr:ArtVJr:Bt\min_{V_{J_{r}:}}\|A^{\prime t}_{r}-V_{J_{r}:}B^{\prime t}\|
15:return UIr:TU^{T}_{I_{r}:} and VJr:TV^{T}_{J_{r}:}

3.4 Generation of Random Matrices

A key problem in Alg. 2 is how to generate random matrices Stn×dS^{t}\in\mathbb{R}^{n\times d} and Stm×dS^{\prime t}\in\mathbb{R}^{m\times d^{\prime}}. Here we focus on generating a random Std×nS^{t}\in\mathbb{R}^{d\times n} satisfying Assumption 1. The reason for choosing such a random matrix is that the corresponding sketched problem would be equivalent to the original problem on expectation; we will prove this in Sec. 3.5.

Assumption 1.

Assume the random matrices are normalized and have bounded variance, i.e., there exists a constant σ2\sigma^{2} such that 𝔼[StSt]=Iand𝕍[StSt]σ2\mathbb{E}\left[S^{t}S^{t\top}\right]=I\quad\text{and}\quad\mathbb{V}\left[S^{t}S^{t\top}\right]\leq\sigma^{2} for all tt, where II is the identity matrix.

Different options exist for such matrices, which have different computation costs in forming sketched matrices Art=MIr:StA^{t}_{r}=M_{I_{r}:}S^{t} and B¯rt=(VJr:t)SJr:t\bar{B}^{t}_{r}=\left(V^{t}_{J_{r}:}\right)^{\top}S^{t}_{J_{r}:}. Since MIr:M_{I_{r}:} is much larger than VJr:tV^{t}_{J_{r}:} and thus computing ArtA^{t}_{r} is more expensive, we only consider the cost of constructing ArtA^{t}_{r} here.

The most classical choice for a random matrix is one with i.i.d. Gaussian entries having mean 0 and variance 1/d1/d. It is easy to show that 𝔼[StSt]=I\mathbb{E}\left[S^{t}S^{t\top}\right]=I. Besides, Gaussian random matrix has bounded variance because Gaussian distribution has finite fourth-order moment. However, since each entry of such a matrix is totally random and thus no special structure exists in StS^{t}, matrix multiplication will be expensive. That is, when given MIr:M_{I_{r}:} of size |Ir|×n|I_{r}|\times n, computing its sketched matrix Art=MIr:StA^{t}_{r}=M_{I_{r}:}S^{t} requires 𝒪(|Ir|nd)\mathcal{O}(|I_{r}|nd) basic operations.

A seemingly better choice for StS^{t} would be a subsampling random matrix. Each column of such random matrix is uniformly sampled from {e1,e2,,en}\{e_{1},e_{2},\dots,e_{n}\} without replacement, where eine_{i}\in\mathbb{R}^{n} is the ii-th canonical basis vector (i.e., a vector having its ii-th element 1 and all others 0). We can easily show that such an StS^{t} also satisfies 𝔼[StSt]=I\mathbb{E}\left[S^{t}S^{t\top}\right]=I and the variance 𝕍[StSt]\mathbb{V}\left[S^{t}S^{t\top}\right] is bounded, but this time constructing the sketched matrix Art=MIr:StA^{t}_{r}=M_{I_{r}:}S^{t} only requires 𝒪(|Ir|d)\mathcal{O}\left(|I_{r}|d\right). Besides, subsampling random matrix can preserve the sparsity of original matrix. Hence, a subsampling random matrix would be favored over a Gaussian random matrix by most applications, especially for very large-scale or sparse problems. On the other hand, we observed in our experiments that a Gaussian random matrix can result in a faster per-iteration convergence rate, because each column of the sketched matrix ArtA^{t}_{r} contains entries from multiple columns of the original matrix and thus is more informative. Hence, it would be better to use a Gaussian matrix when the sketch size dd is small and thus a 𝒪(|Ir|nd)\mathcal{O}(|I_{r}|nd) complexity is acceptable, or when the network speed of the cluster is poor, hence we should trade more local computation cost for less communication cost.

Although we only test two representative types of random matrices (i.e., Gaussian and subsampling random matrices), our framework is readily applicable for other choices, such as subsampled randomized Hadamard transform (SRHT) [42, 43] and count sketch [44, 45]. The choice of random matrices is not the focus of this paper and left for future investigation.

3.5 Solving Subproblems

Before describing how to solve subproblem (10), let us make an important observation. As discussed in Sec. 2.2.2, the sketching technique has been applied in solving linear systems before. However, the situation is different in matrix factorization. Note that for the distributed matrix factorization problem we usually have:

minUIr:+|Ir|×kMIr:UIr:VtF20.\min_{U_{I_{r}:}\in\mathbb{R}_{+}^{|I_{r}|\times k}}\left\|M_{I_{r}:}-U_{I_{r}:}V^{t\top}\right\|_{F}^{2}\neq 0. (12)

So, for the sketched subproblem (10), which can be equivalently written as:

minUIr:+|Ir|×k(MIr:UIr:Vt)StF2,\min_{U_{I_{r}:}\in\mathbb{R}_{+}^{|I_{r}|\times k}}\left\|\left(M_{I_{r}:}-U_{I_{r}:}V^{t\top}\right)S^{t}\right\|_{F}^{2}, (13)

where the non-zero entries of the residual matrix (MIr:UIr:Vt)\left(M_{I_{r}:}-U_{I_{r}:}V^{t\top}\right) will be scaled by the matrix StS^{t} at different levels. As a consequence, the optimal solution will be shifted because of sketching. This fact alerts us that for SANLS, we need to update Ut+1U^{t+1} by exploiting the sketched subproblem (10) to step towards the true optimal solution and avoid convergence to the solution of the sketched subproblem.

3.5.1 Projected Gradient Descent

A natural method is to use one step222Note that we only apply one step of projected gradient descent here to avoid solution shifted. of projected gradient descent for the sketched subproblem:

UIr:t+1\displaystyle U^{t+1}_{I_{r}:} =max{UIr:tηtUIr:ArtUIr:BtF2|UIr:=UIr:t, 0}\displaystyle=\max\left\{U^{t}_{I_{r}:}-\eta_{t}\left.\nabla_{U_{I_{r}:}}\left\|A^{t}_{r}-U_{I_{r}:}B^{t}\right\|_{F}^{2}\right|_{U_{I_{r}:}=U^{t}_{I_{r}:}},\ 0\right\} (14)
=max{UIr:t2ηt[UIr:tBtBtArtBt], 0},\displaystyle=\max\left\{U^{t}_{I_{r}:}-2\eta_{t}\left[U^{t}_{I_{r}:}B^{t}B^{t\top}-A^{t}_{r}B^{t\top}\right],\ 0\right\},

where ηt>0\eta_{t}>0 is the step size and max{,}\max\{\cdot,\cdot\} denotes the entry-wise maximum operation. In the gradient descent step (14), the computational cost mainly comes from two matrix multiplications: BtBtB^{t}B^{t\top} and At,rBtA_{t,r}B^{t\top}. Note that ArtA^{t}_{r} and BtB^{t} are of sizes |Ir|×d|I_{r}|\times d and k×dk\times d respectively, thus the gradient descent step takes 𝒪(kd(|Ir|+k))\mathcal{O}\left(kd(|I_{r}|+k)\right) in total.

To exploit the nature of this algorithm, we further expand the gradient:

UIr:ArtUIr:BtF2=2[UIr:BtBtArtBt]\displaystyle\nabla_{U_{I_{r}:}}\left\|A^{t}_{r}-U_{I_{r}:}B^{t}\right\|_{F}^{2}=2\left[U_{I_{r}:}B^{t}B^{t\top}-A^{t}_{r}B^{t\top}\right] (15)
=(9)\displaystyle\stackrel{{\scriptstyle\eqref{eq:def_A_B}}}{{=}} 2[UIr:(VtSt)(VtSt)(MIr:St)(VtSt)]\displaystyle 2\left[U_{I_{r}:}\left(V^{t\top}S^{t}\right)\left(V^{t\top}S^{t}\right)^{\top}-\left(M_{I_{r}:}S^{t}\right)\left(V^{t\top}S^{t}\right)^{\top}\right]
=\displaystyle= 2[UIr:Vt(StSt)VtMIr:(StSt)Vt].\displaystyle 2\left[U_{I_{r}:}V^{t\top}\left(S^{t}S^{t\top}\right)V^{t}-M_{I_{r}:}\left(S^{t}S^{t\top}\right)V^{t}\right].

By taking the expectation of the above equation, and using the fact 𝔼[StSt]=I\mathbb{E}\left[S^{t}S^{t\top}\right]=I, we have:

𝔼[UIr:ArtUIr:BtF2]=2[UIr:VtVtMIr:Vt]\displaystyle\mathbb{E}\left[\nabla_{U_{I_{r}:}}\left\|A^{t}_{r}-U_{I_{r}:}B^{t}\right\|_{F}^{2}\right]=2\left[U_{I_{r}:}V^{t\top}V^{t}-M_{I_{r}:}V^{t}\right] (16)
=\displaystyle= UIr:MIr:UIr:VtF2\displaystyle\nabla_{U_{I_{r}:}}\left\|M_{I_{r}:}-U_{I_{r}:}V^{t\top}\right\|_{F}^{2}

which means that the gradient of the sketched subproblem is equivalent to the gradient of the original problem on expectation. Therefore, such a step of gradient descent can be interpreted as a (generalized) stochastic gradient descent (SGD) [46] method on the original subproblem. Thus, according to the theory of SGD, we naturally require the step sizes {ηt}\left\{\eta_{t}\right\} to be diminishing, i.e., ηt0\eta_{t}\rightarrow 0 as tt increases.

3.5.2 Proximal Coordinate Descent

However, it is well known that the gradient descent method converges slowly, while the coordinate descent method, namely the HALS method for NMF, is quite efficient [10]. Still, because of its very fast convergence, HALS should not be applied to the sketched subproblem directly because it shifts the solution away from the true optimal solution. Therefore, we would like to develop a method which resembles HALS but will not converge towards the solutions of the sketched subproblems.

To achieve this, we add a regularization term to the sketched subproblem (10). The new subproblem becomes:

minUIr:+|Ir|×kArtUIr:BtF2+μtUIr:UIr:tF2,\min_{U_{I_{r}:}\in\mathbb{R}_{+}^{|I_{r}|\times k}}\left\|A^{t}_{r}-U_{I_{r}:}B^{t}\right\|_{F}^{2}+\mu_{t}\left\|U_{I_{r}:}-U^{t}_{I_{r}:}\right\|_{F}^{2}, (17)

where μt>0\mu_{t}>0 is a parameter. Such regularization is reminiscent to the proximal point method [47] and parameter μt\mu_{t} controls the step size as 1/ηt1/\eta_{t} in projected gradient descent. We therefore require μt+\mu_{t}\rightarrow+\infty to enforce the convergence of the algorithm, e.g., μt=t\mu_{t}=t.

At each step of proximal coordinate descent, only one column of UIr:U_{I_{r}:}, say UIr,jU_{I_{r},j} where j{1,2,,k}j\in\{1,2,\dots,k\}, is updated:

minUIr:j+|Ir|ArtUIr:jBj:tljUIr:lBl:tF2+μtUIr:jUIr:jt22.\min_{U_{I_{r}:j}\in\mathbb{R}_{+}^{|I_{r}|}}\bigg{\|}A^{t}_{r}-U_{I_{r}:j}B^{t}_{j:}-\sum_{l\neq j}U_{I_{r}:l}B^{t}_{l:}\bigg{\|}_{F}^{2}+\mu_{t}\left\|U_{I_{r}:j}-U^{t}_{I_{r}:j}\right\|_{2}^{2}. (18)

It is not hard to see that the above problem is still row-independent, which means that each entry of the row vector UIr:jU_{I_{r}:j} can be solved independently at each node. For example, for any iIri\in I_{r}, the solution of Ui:jt+1U^{t+1}_{i:j} is given by:

Ui:jt+1=\displaystyle U^{t+1}_{i:j}= argminUi:j0(Art)i:Ui:jBj:tljUi:lBl:t22\displaystyle\operatorname*{arg\,min}_{U_{i:j}\geq 0}\bigg{\|}\left(A^{t}_{r}\right)_{i:}-U_{i:j}B^{t}_{j:}-\sum_{l\neq j}U_{i:l}B^{t}_{l:}\bigg{\|}_{2}^{2} (19)
+μtUi:jUi:jt22\displaystyle+\mu_{t}\left\|U_{i:j}-U^{t}_{i:j}\right\|^{2}_{2}
=\displaystyle= max{μtUi:jt+(Art)i:Bj:tljUi:lBl:tBj:tBj:tBj:t+μt,0}.\displaystyle\max\left\{\frac{\mu_{t}U_{i:j}^{t}+\left(A^{t}_{r}\right)_{i:}B_{j:}^{t\top}-\sum_{l\neq j}U_{i:l}B^{t}_{l:}B^{t\top}_{j:}}{B^{t}_{j:}B^{t\top}_{j:}+\mu_{t}},0\right\}.

At each step of coordinate descent, we choose the column jj from {1,2,,k}\{1,2,\dots,k\} successively. When updating column jj at iteration tt, the columns l<jl<j have already been updated and thus UIr:l=UIr:lt+1U_{I_{r}:l}=U^{t+1}_{I_{r}:l}, while the columns l>jl>j are old so UIr:l=UIr:ltU_{I_{r}:l}=U^{t}_{I_{r}:l}.

The complete proximal coordinate descent algorithm for the UU-subproblem is summarized in Alg. 3. When updating column jj, computing the matrix-vector multiplication ArtBj:tA^{t}_{r}B_{j:}^{t\top} takes 𝒪(d|Ir|)\mathcal{O}(d|I_{r}|). The whole inner loop takes 𝒪(k(d+|Ir|))\mathcal{O}\left(k\left(d+|I_{r}|\right)\right) because one vector dot product of length dd is required for computing each summand and the summation itself needs 𝒪(k|Ir|)\mathcal{O}\left(k|I_{r}|\right). Considering that there are kk columns in total, the overall complexity of coordinate descent is 𝒪(k((k+d)|Ir|+kd))\mathcal{O}\left(k(\left(k+d\right)|I_{r}|+kd)\right). Typically, we choose d>kd>k, so the complexity can be simplified to 𝒪(kd(|Ir|+k))\mathcal{O}\left(kd\left(|I_{r}|+k\right)\right), which is the same as that of gradient descent.

Since proximal coordinate descent is much more efficient than projected gradient descent, we adopt it as the default subproblem solver within DSANLS.

Algorithm 3 Proximal Coordinate Descent for Local Subproblem (10) on Node rr

Parameter: μt>0\mu_{t}>0

1:for j=1j=1 to kk do
2:    TμtUIr:jt+ArtBj:tT\leftarrow\mu_{t}U_{I_{r}:j}^{t}+A^{t}_{r}B_{j:}^{t\top}
3:    for l=1l=1 to j1j-1 do
4:         TT(Bl:tBj:t)UIr:lt+1T\leftarrow T-\left(B^{t}_{l:}B_{j:}^{t\top}\right)U^{t+1}_{I_{r}:l}     
5:    for l=j+1l=j+1 to kk do
6:         TT(Bl:tBj:t)UIr:ltT\leftarrow T-\left(B^{t}_{l:}B_{j:}^{t\top}\right)U^{t}_{I_{r}:l}     
7:    UIr:jt+1max{T/(Bj:tBj:t+μt),0}U^{t+1}_{I_{r}:j}\leftarrow\max\left\{T/\left(B_{j:}^{t}B_{j:}^{t\top}+\mu_{t}\right),0\right\}
8:return UIr:t+1U_{I_{r}:}^{t+1}

3.6 Theoretical Analysis

3.6.1 Complexity Analysis

We now analyze the computational and communication costs of our DSANLS algorithm, when using subsampling random sketch matrices. The computational complexity at each node is:

𝒪(dgenerating St+|Ir|dconstructing Art and Bt+kd(|Ir|+k)solving subproblem)\displaystyle\mathcal{O}\big{(}\overbrace{d}^{\text{generating }S^{t}}+\overbrace{|I_{r}|d}^{\text{constructing $A^{t}_{r}$ and $B^{t}$}}+\overbrace{kd(|I_{r}|+k)}^{\text{solving subproblem}}\big{)} (20)
=\displaystyle= 𝒪(kd(|Ir|+k))𝒪(kd(mN+k)).\displaystyle\mathcal{O}\left(kd(|I_{r}|+k)\right)\approx\mathcal{O}\left(kd\left(\frac{m}{N}+k\right)\right).

Moreover, as we have shown in Sec. 3.3, the communication cost of DSANLS is 𝒪(kd)\mathcal{O}\left(kd\right).

On the other hand, for a classical implementation of distributed HALS [48], the computational cost is:

𝒪(kn(|Ir|+k))𝒪(kn(mN+k))\displaystyle\mathcal{O}\left(kn\left(|I_{r}|+k\right)\right)\approx\mathcal{O}\left(kn\left(\frac{m}{N}+k\right)\right) (21)

and the communication cost is 𝒪(kn)\mathcal{O}\left(kn\right) due to the all-gathering of VtV^{t}’s.

Comparing the above quantities, we observe an n/d1n/d\gg 1 speedup of our DSANLS algorithm over HALS in both computation and communication. However, we empirically observed that DSANLS has a slower per-iteration convergence rate (i.e., it needs more iterations to converge). Still, as we will show in the next section, in practice, DSANLS is superior to alternative distributed NMF algorithms, after taking all factors into account.

3.6.2 Convergence Analysis

Here we provide theoretical convergence guarantees for the proposed SANLS and DSANLS algorithms. We show that SANLS and DSANLS converge to a stationary point.

To establish convergence result, Assumption 2 is needed first.

Assumption 2.

Assume all the iterates UtU^{t} and VtV^{t} have uniformly bounded norms, which means that there exists a constant RR such that UtFR\|U^{t}\|_{F}\leq R and VtFR\|V^{t}\|_{F}\leq R for all tt.

We experimentally observed that this assumption holds in practice, as long as the step sizes used are not too large. Besides, Assumption 2 can also be enforced by imposing additional constraints, such as:

Ui:l2MFandVj:l2MFi,j,l,\displaystyle U_{i:l}\leq\sqrt{2\|M\|_{F}}\quad\text{and}\quad V_{j:l}\leq\sqrt{2\|M\|_{F}}\quad\forall i,j,l, (22)

with which we have R=max{m,n}k2MFR=\max\{m,n\}k\sqrt{2\|M\|_{F}}. Such constraints can be very easily handled by both of our projected gradient descent and regularized coordinate descent solvers. Lemma 1 shows that imposing such extra constraints does not prevent us from finding the global optimal solution.

Lemma 1.

If the optimal solution to the original problem (1) exists, there is at least one global optimal solution in the domain (22).

Based on Assumptions 1 (see Sec. 3.4) and Assumption 2, we now can formally show our main convergence result:

Theorem 1.

Under Assumptions 1 and 2, if the step sizes satisfy t=1ηt=\sum_{t=1}^{\infty}\eta_{t}=\infty and t=1ηt2<\sum_{t=1}^{\infty}\eta_{t}^{2}<\infty, for projected gradient descent, or t=11/μt=\sum_{t=1}^{\infty}1/\mu_{t}=\infty and t=11/μt2<\sum_{t=1}^{\infty}1/\mu_{t}^{2}<\infty, for regularized coordinate descent, then SANLS and DSANLS with either sub-problem solver will converge to a stationary point of problem (1) with probability 1.

The proofs of Lemma 1 and Theorem 1 can be found in Appendices A and B.

4 Secure Distributed NMF

In this section, we provide our solutions to the problem of secure distributed NMF over federated data.

4.1 Extend DSANLS to Secure Setting

DSANLS and all lines of works discussed in Sec. 2.2.1 store copies of MM across two-dimensional (shown in Fig. 1(a)), and exploit the independence of local update computation for rows of UU and VV to apply communication-optimal matrix multiplication. They cannot be applied directly to secure distributed NMF setting. The reason is that, in secure distributed NMF setting (shown in Fig. 1(b)), only one column copy is stored in each node, while the others cannot be disclosed.

Nevertheless, DSANLS can be adapted to this secure setting with modification, but only for one or limited iterations. The reason is illustrated in Theorem 2. In modified DSANLS algorithm, each node still takes charge of updating UIr:U_{I_{r}:} and VJr:V_{J_{r}:} as before, but only one copy M:JrM_{:J_{r}} of M=[M1,M2,,MN]M=[M_{1},M_{2},...,M_{N}] will be stored in node rr. Thus, VV-subproblem is exactly the same as in DSANLS. Differently, we need to use MPI-AllReduce function to gather M:JrStM_{:J_{r}}S^{t} from all nodes before each iteration of UU-subproblem, so that each node has access to fully sketched matrix MStMS^{t} to solve sketched UU-subproblem. Note that here random matrix StS^{t} not only helps reduce the communication cost from 𝒪(mn)\mathcal{O}(mn) to 𝒪(md)\mathcal{O}(md) with a smaller NLS problem, but also conceals the full matrix MM in each iteration.

Theorem 2.

MM cannot be recovered only using information about MSMS (or SMSM) and SS.

Proof.

Assume SS is a square matrix. Given MSMS (or SMSM) and SS, we are able to get MM by M=MSS1M=MSS^{-1} (or M=S1M=S^{-1}SM). However, the numbers of row and column are highly imbalanced in SS and it is not a square matrix. Therefore MM cannot be recovered only using information about MSMS (or SMSM) and SS. ∎

However, NMF is an iterative algorithm (shown in Alg. 1). Secure computation in limited iterations cannot guarantee an acceptable accuracy for practical use due to the following reason:

Theorem 3.

MM can be recovered after enough iterations.

Proof.

If we view MS=MSM\cdot S=MS as a system of linear equations with a variable matrix MM and constant matrices SS and MSMS. Each row of MM can be solved by a standard Gaussian Elimination solver, given a sufficient number of (SS, MSMS) pairs. ∎

Theorem 3 suggests that DSANLS algorithm suffers from the dilemma of choosing between information disclosure and unacceptable accuracy, making it impractical to real applications. Therefore, we need to propose new practical solutions to secure distributed NMF.

4.2 Synchronous Framework

A straightforward solution to secure distributed NMF is that each node solves a local NMF problem with a local copy of UU (denoted as U(r)U_{(r)} for node rr). Periodically, nodes communicate with each other, and update local copy of UU to the aggregation of all local copies U(j),j{1,,N}U_{(j)},j\in\{1,\cdots,N\} by All-Reduce operation. We name this method as Syn-SD under synchronous setting. The detailed algorithm is shown in Alg. 4. Within inner iterations, every node maintains its own copy of UU (i.e., U(r)U_{(r)}) by solving the regular NMF problem. Every T2T_{2} rounds, different local copies of UU will be averaged through nodes by using j=1NU(j)/N\sum_{j=1}^{N}U_{(j)}/N. Note that, U(r)U_{(r)} is one copy of the whole matrix UU stored locally in node rr, while VJr:V_{J_{r}:} is the corresponding part of the matrix V=[VJ1:,VJ2:,,VJN:]V=[V_{J_{1}:},V_{J_{2}:},...,V_{J_{N}:}] stored in node rr.

In Syn-SD, the local copy U(r)U_{(r)} in node rr will be updated to a uniform aggregation of local copies from all nodes periodically. Small number of inner iteration T2T_{2} incurs large communication cost caused by All-Reduce. Larger T2T_{2} may lead to slow convergence, since each node does not share any information of its local copy U(r)U_{(r)} inside the inner iterations.

To improve the efficiency of data exchange, we incorporate matrix sketching to Syn-SD, and propose an improved version called Syn-SSD. In Syn-SSD, information of local copies is shared across cluster nodes more frequently, with communication overhead roughly the same as Syn-SD. As shown in Alg. 5, the sketched version StU(r)S^{t}U_{(r)} of the local copy U(r)U_{(r)} is exchanged within each inner iteration. There are two advantages of applying matrix sketching: (1) Since the sketched matrix has a much smaller size, All-Reduce operation causes much less communication cost, making it affordable with higher frequency. (2) Solving a sketched NLS problem can also reduce the computation cost due to a reduced problem size of solving U(r)U_{(r)} and VJr:V_{J_{r}:} for each node. It is worth noting that S1tS_{1}^{t} is exactly the same for each node by using the same seed and generator. The same for S2tS_{2}^{t}. But S1tS_{1}^{t} and S2tS_{2}^{t} are not necessarily equivalent. With such a constraint, the algorithm is equivalent to NMF in single-machine environment and the convergence can be guaranteed.

Algorithm 4 Syn-SD: Secure Distributed NMF on node rr

Input: M:JrM_{:J_{r}}
Parameter: Iteration numbers T1,T2T_{1},T_{2}

1:initialize U(r)00U_{(r)}^{0}\geq 0, VJr:00V_{J_{r}:}^{0}\geq 0
2:for t1=0t_{1}=0 to T11T_{1}-1 do
3:    for t2=1t_{2}=1 to T2T_{2} do
4:         tt1×T2+t2t\leftarrow t_{1}\times T_{2}+t_{2}
5:         U(r)tU_{(r)}^{t}\leftarrow update(M:JrM_{:J_{r}}, U(r)t1U_{(r)}^{t-1}, VJr:t1V_{J_{r}:}^{t-1})
6:         VJr:tV_{J_{r}:}^{t}\leftarrow update(M:JrM_{:J_{r}}, U(i)tU_{(i)}^{t}, VJr:t1V_{J_{r}:}^{t-1})     
7:    All-Reduce: U(r)tj=1NU(j)tNU_{(r)}^{t}\leftarrow\frac{\sum_{j=1}^{N}U_{(j)}^{t}}{N}
8:return U(r)tU_{(r)}^{t} and VJr:tV_{J_{r}:}^{t}
Algorithm 5 Syn-SSD: Secure Sketched Distributed NMF on node rr

Input: M:JrM_{:J_{r}}
Parameter: Iteration numbers T1,T2T_{1},T_{2}

1:initialize U(i)00U_{(i)}^{0}\geq 0, VJr:00V_{J_{r}:}^{0}\geq 0
2:for t1=0t_{1}=0 to T11T_{1}-1 do
3:    for t2=1t_{2}=1 to T2T_{2} do
4:         tt1×T2+t2t\leftarrow t_{1}\times T_{2}+t_{2}
5:         Generate random matrix S1tS_{1}^{t}
6:         U(r)tU_{(r)}^{t}\leftarrow update(M:JrS1tM_{:J_{r}}S_{1}^{t}, U(r)t1U_{(r)}^{t-1}, VJr:t1S1tV_{J_{r}:}^{t-1}S_{1}^{t})
7:         Generate random matrix S2tS_{2}^{t}
8:         All-Reduce: SU¯tj=1NS2tU(j)tN\overline{SU}^{t}\leftarrow\frac{\sum_{j=1}^{N}S_{2}^{t}U_{(j)}^{t}}{N}
9:         VJr:tV_{J_{r}:}^{t}\leftarrow update(S2tM:JrS_{2}^{t}M_{:J_{r}}, SU¯t\overline{SU}^{t}, VJr:t1V_{J_{r}:}^{t-1})     
10:    All-Reduce: U(r)tj=1NU(j)tNU_{(r)}^{t}\leftarrow\frac{\sum_{j=1}^{N}U_{(j)}^{t}}{N}
11:return U(r)tU_{(r)}^{t} and VJr:tV_{J_{r}:}^{t}

It is straightforward to see that Syn-SD and Syn-SSD satisfy Definition 1 and they are (N1)(N-1)-private protocols, since VJr:V_{J_{r}:} and M:JrM_{:J_{r}} are only seen by node rr.

4.3 Asynchronous Framework

In Syn-SD and Syn-SSD, each node must stall until all participating nodes reach the synchronization barrier before the All-Reduce operation. However, highly imbalanced data in real scenario of federated data mining may cause severe workload imbalance problem. The synchronization barrier will force nodes with low workload to halt, making synchronous algorithms less efficient. In this section, we study secure distributed NMF in an asynchronous (i.e., server/client architecture) setting and propose corresponding asynchronous algorithms.

First of all, we extend the idea of Syn-SD to asynchronous setting and name the new method Asyn-SD. In Asyn-SD, the server (in Alg. 6) takes full charge of updating and broadcasting UtU^{t}. Once received U(r)tU^{t}_{(r)} from the client node rr, the server would update UtU^{t} locally, and return the latest version of UtU^{t} back to the client node rr for further computing. Note that the server may receive local copies of UtU^{t} from clients in an arbitrary order. Consequently, we cannot use the same operation of All-Reduce as Syn-SD any more. Instead, UtU^{t} in server side is updated by the weighted sum of current UtU^{t} and newly received local copy U(r)tU^{t}_{(r)} from client node rr. Here the relaxation weight ωt\omega^{t} asymptotically converges to 0. Thus a converged UtU^{t} is guaranteed on server side. Our experiments in Sec. 5 suggest that this relaxation has no harm to factorization convergence.

Algorithm 6 Asyn-SD, Asyn-SSD: Server part

Parameter: Relaxation parameter ρ\rho

1:initialize U00U^{0}\geq 0
2:t0t\leftarrow 0 \triangleright tt is the update counter.
3:while not stopping do
4:    Receive U(r)tU^{t}_{(r)} from client node rr
5:    ωtρρ+t\omega^{t}\leftarrow\frac{\rho}{\rho+t} \triangleright ωt\omega^{t} is the relaxation weight.
6:    Ut(1ωt)Ut+ωtU(r)tU^{t}\leftarrow(1-\omega^{t})U^{t}+\omega^{t}U^{t}_{(r)}
7:    Send UtU^{t} back to client node rr
8:    tt+1t\leftarrow t+1
9:return UtU^{t}
Algorithm 7 Asyn-SD, Asyn-SSD: Client part of node rr

Input: M:JrM_{:J_{r}}
Parameter: Iteration number TT

1:initialize VJr:00V_{J_{r}:}^{0}\geq 0
2:while Server not stopping do
3:    Receive UU from server
4:    U(r)0UU_{(r)}^{0}\leftarrow U
5:    for t=1t=1 to TT do
6:         VJr:tV_{J_{r}:}^{t}\leftarrow update(M:JrM_{:J_{r}}, U(r)t1U_{(r)}^{t-1}, VJr:t1V_{J_{r}:}^{t-1})
7:         U(r)tU_{(r)}^{t}\leftarrow update(M:JrM_{:J_{r}}, U(r)t1U_{(r)}^{t-1}, VJr:tV_{J_{r}:}^{t}) \triangleright For Asyn-SSD, replace it with Lines 5-6 of Alg. 5.     
8:    Send U(r)TU_{(r)}^{T} to server
9:return VJr:TV_{J_{r}:}^{T}

On the other hand, client nodes of Asyn-SD (in Alg. 7) behave similarly as nodes in Syn-SD. Clients locally solve the standard NMF problem for TT iterations, and then update local U(r)tU^{t}_{(r)} by communicating only with the server node. Unlike Syn-SD, Asyn-SD does not have a global synchronization barrier. Client nodes in Asyn-SD independently exchange their local copy U(r)tU^{t}_{(r)} with the server without an All-Reduce operation.

Similarly, Syn-SSD can be extended to its asynchronous version Asyn-SSD. However, the algorithm for clients is more constrained and conservative in sketching. Note that the random sketching matrices S1S_{1} and S2S_{2} (in Alg. 5) should be the same across the nodes in the same summation in order to have a meaningful summation of sketched matrices. However, enforcing the same S2tS_{2}^{t} for updating sketched UU will result in a synchronous All-Reduce operation. Therefore, UU cannot be sketched in asynchronous algorithms and we only consider sketching VJr:V_{J_{r}:} in Asyn-SSD (Line 7 in Alg. 7). The server part of Asyn-SSD is the same as Asyn-SD in Alg. 6.

Similar to synchronous versions, Asyn-SD and Asyn-SSD satisfy Definition 1 and they are (N1)(N-1)-private protocols, since VJr:V_{J_{r}:} and M:JrM_{:J_{r}} are only seen by node rr.

5 Experimental Evaluation

This section includes an experimental evaluation of our algorithms on both dense and sparse real data matrices. The implementation of our methods is available at https://github.com/qianyuqiu79/DSANLS.

5.1 Setup

TABLE I: Statistics of datasets
Dataset #Rows #Columns Non-zero values Sparsity
BOATS 216,000 300 64,800,000 0%
MIT CBCL FACE 2,429 361 876,869 0%
MNIST 70,000 784 10,505,375 80.86%
GISETTE 13,500 5,000 8,770,559 87.01%
Reuters (RCV1) 804,414 47,236 60,915,113 99.84%
DBLP 317,080 317,080 2,416,812 99.9976%

We use several (dense and sparse) real datasets as Qian et al. [49] for evaluation. They corresponds to different NMF tasks, including video analysis, image processing, text mining and community detection. Their statistics are summarized in Tab. I.

We conduct our experiments on a Linux cluster with 16 nodes. Each node contains 8-core Intel® CoreTM i7-3770 CPU @ 1.60GHz cores and 16 GB of memory. Our algorithms are implemented in C++ using the Intel® Math Kernel Library (MKL) and Message Passing Interface (MPI). By default, we use 10 nodes and set the factorization rank kk to 100. We also report the impact of different node number (2-16) and kk (20-500). We use μt=α+βt\mu_{t}=\alpha+\beta t [50], do the grid search for α\alpha and β\beta in the range of {0.1, 1, 10} for each dataset and report the best results. Because the use of Gaussian random matrices is too slow on large datasets RCV1 and DBLP, we only use subsampling random matrices for them.

For the general acceleration of NMF, we assess DSANLS with subsampling and Gaussian random matrices, denoted by DSANLS/S and DSANLS/G, respectively, using proximal coordinate descent as the default subproblem solver. As mentioned in [5, 8], it is unfair to compare with a Hadoop implementation. We only compare DSANLS with MPI-FAUN333https://github.com/ramkikannan/nmflibrary (MPI-FAUN-MU, MPI-FAUN-HALS, and MPI-FAUN-ABPP implementations), which is the first and the state-of-the-art C++/MPI implementation with MKL and Armadillo. For parameters pcpc and prpr in MPI-FAUN, we use the optimal values for each dataset, according to the recommendations in [5, 8].

For the problem of secure distributed NMF, we evaluate all proposed methods: Syn-SD, Syn-SSD with sketch on UU (denoted as Syn-SSD-U), Syn-SSD with sketching on VV (denoted as Syn-SSD-V), Syn-SSD with sketching on both UU and VV (denoted as Syn-SSD-UV), Asyn-SD, Asyn-SSD with sketching on VV (denoted as Asyn-SSD-V), using proximal coordinate descent as the default subproblem solver. We do not list secure building block methods as baselines, since communication overhead is heavy in these multi-round handshake protocols and it is unfair to compare them with MPI based methods. For example, a matrix sum described by Duan and Canny [39] results in 5X communication overhead compared to a MPI all-reduce operation.

We use the relative error of the low rank approximation compared to the original matrix to measure the effectiveness of different NMF approaches. This error measure has been widely used in previous work [5, 8, 51] and is formally defined as MUVF/MF\left\|M-UV^{\top}\right\|_{F}/\left\|M\right\|_{F}.

5.2 Evaluation on Accelerating General NMF

5.2.1 Performance Comparison

Refer to caption
(a) BOATS
Refer to caption
(b) FACE
Refer to caption
(c) MNIST
Refer to caption
(d) GISETTE
Refer to caption
(e) RCV1
Refer to caption
(f) DBLP
Figure 2: Relative error over time for general distributed NMF
Refer to caption
(a) FACE
Refer to caption
(b) MNIST
Refer to caption
(c) RCV1
Refer to caption
(d) DBLP
Figure 3: Reciprocal of per-iteration time as a function of cluster size for general distributed NMF
Refer to caption
(a) kk=20
Refer to caption
(b) kk=50
Refer to caption
(c) kk=200
Refer to caption
(d) kk=500
Figure 4: Relative error over time for general distributed NMF, varying kk value
Refer to caption
(a) BOATS
Refer to caption
(b) FACE
Refer to caption
(c) GISETTE
Refer to caption
(d) RCV1
Figure 5: Relative error per-iteration of different subproblem solvers for general distributed NMF
Refer to caption
(a) BOATS
Refer to caption
(b) FACE
Refer to caption
(c) MNIST
Refer to caption
(d) GISETTE
Figure 6: Relative error over time for uniform workload in secure distributed NMF
Refer to caption
(a) BOATS
Refer to caption
(b) FACE
Refer to caption
(c) MNIST
Refer to caption
(d) GISETTE
Figure 7: Relative error over time for imbalanced workload in secure distributed NMF
Refer to caption
(a) BOATS
Refer to caption
(b) FACE
Refer to caption
(c) MNIST
Refer to caption
(d) GISETTE
Figure 8: Reciprocal of per-iteration time for uniform workload in secure distributed NMF
Refer to caption
(a) BOATS
Refer to caption
(b) FACE
Refer to caption
(c) MNIST
Refer to caption
(d) GISETTE
Figure 9: Reciprocal of per-iteration time for imbalanced workload in secure distributed NMF

Since the time for each iteration is significantly reduced by our proposed DSANLS compared to MPI-FAUN, in Fig. 2, we show the relative error over time for DSANLS and MPI-FAUN implementations of MU, HALS, and ANLS/BPP on the 6 real public datasets. Observe that DSANLS/S performs best in all 6 datasets, although DSANLS/G has faster per-iteration convergence rate. MU converges relatively slowly and usually has a bad convergence result; on the other hand HALS may oscillate in the early rounds444HALS does not guarantee the objective function to decrease monotonically., but converges quite fast and to a good solution. Surprisingly, although ANLS/BPP is considered to be the state-of-art NMF algorithm, it does not perform well in all 6 datasets. As we will see, this is due to its high per-iteration cost.

5.2.2 Scalability Comparison

We vary the number of nodes used in the cluster from 2 to 16 and record the average time for 100 iterations of each algorithm. Fig. 3 shows the reciprocal of per-iteration time as a function of the number of nodes used. All algorithms exhibit good scalability for all datasets (nearly a straight line), except for FACE (i.e., Fig. 3(a)). FACE is the smallest dataset, whose number of columns is 300, while kk is set to 100 by default. When n/Nn/N is smaller than kk, the complexity is dominated by kk, hence, increasing the number of nodes does not reduce the computational cost, but may increase the communication overhead. In general, we can observe that DSANLS/Subsampling has the lowest per-iteration cost compared to all other algorithms, and DSANLS/Gaussian has similar cost to MU and HALS. ANLS/BPP has the highest per-iteration cost, explaining the bad performance of ANLS/BPP in Fig. 2.

5.2.3 Performance Varying the Value of kk

Although tuning the factorization rank kk is outside the scope of this paper, we compare the performance of DSANLS with MPI-FAUN varying the value of kk from 20 to 500 on RCV1. Observe from Fig. 4 and Fig. 2(e) (k=100k=100) that DSANLS outperforms the state-of-art algorithms for all values of kk. Naturally, the relative error of all algorithms decreases with the increase of kk, but they also take longer to converge.

5.2.4 Comparison with Projected Gradient Descent

In Sec. 3.5, we claimed that our proximal coordinate descent approach (denoted as DSANLS-RCD) is faster than projected gradient descent (also presented in the same section, denoted as DSANLS-PGD). Fig. 5 confirms the difference in the convergence rate of the two approaches regardless of the random matrix generation approach.

5.3 Evaluation on Secure Distributed NMF

5.3.1 Performance Comparison for Uniform Workload

In Fig. 6, we show the relative error over time for secure distributed NMF algorithms on the 4 real public datasets, with a uniformly partition of columns. Syn-SSD-UV performs best in BOAT, FACE and GISETTE. As we will see in the next section, this is due to the fact that per-iteration cost of Syn-SSD-UV is significantly reduced by sketching. On MNIST, Syn-SSD-U and Syn-SSD-V has a better convergence in terms of relative error. Syn-SD and Asyn-SD converge relatively slowly and usually have a bad convergence result; on the other hand Asyn-SSD-V converges slowly but consistently generates better results than Syn-SD and Asyn-SD.

5.3.2 Performance Comparison for Imbalanced Workload

To evaluate the performance of different methods when the workload is imbalanced, we conduct experiments on skewed partition of input matrix. Among 10 worker nodes, node 0 is assigned with 50%50\% of the columns, while other nodes have a uniform partition of the rest of columns. The measure for error is the same as the case of uniform workload.

It can be observed that in imbalanced workload, asynchronous algorithms generally outperform synchronous algorithms. Asyn-SSD-V gives the best result in terms of relative error over time, except dataset FACE. In FACE, Asyn-SD slowly converges to the best result. Unlike the case of uniform workload in Fig. 6, the sketching method Syn-SSD-UV does not perform well in imbalanced workload. Syn-SD are basically inapplicable in BOATS, MNIST and GISETTE datasets due to its slow speed. In sparse datasets MNIST and GISETTE, Syn-SSD-V and Syn-SSD-U can converge to a good result, but they do not generate satisfactory results on dense dataset BOATS and FACE.

5.3.3 Scalability Comparison

We vary the number of nodes used in the cluster from 2 to 16 and record the average time for 100 iterations of each algorithm. Fig. 8 shows the reciprocal of per-iteration time as a function of the number of nodes for uniform workload. All algorithms exhibit good scalability for all datasets (nearly a straight line), except for FACE (i.e., Fig. 8(b)). FACE is the smallest dataset, whose number of columns is 361 and number of row is 2,429. When n/Nn/N is smaller than k=100k=100, the time consumed by subproblem solvers is dominated by the communication overhead. Hence, increasing the number of nodes is does not reduce per-iteration time. In general, we can observe that Syn-SSD-UV has the lowest per-iteration time compared to all other algorithms, and also has the best scalability as we can see from the steepest slope. Synchronous averaging has the highest per-iteration cost, explaining the bad performance in uniform workload experiments in Fig. 6.

In imbalanced workload settings, it is not surprising that asynchronous algorithms outperform synchronous algorithms with respect to scalability, as shown in Fig. 9. Synchronization barriers before All-Reduce operations severely affect the scalability of synchronous algorithms, resulting in a nearly flat curve for per-iteration time. The per-iteration time of Syn-SSD-UV is satisfactory when cluster size is small. However, it does not get significant improvements when more nodes are deployed. On the other hand, asynchronous algorithms demonstrate decent scalability as number of nodes grows. The short average iteration time of Asyn-SD and Asyn-SSD-V, shown in Fig. 9, also explains their superior performance over their synchronous counterparts in Fig. 7.

In conclusion, with an overall evaluation of convergence and scalability, Syn-SSD-UV should be adopted for secure distributed NMF under uniform workload, while Asyn-SSD-V is a more reasonable choice for secure distributed NMF under imbalanced workload.

6 Conclusion

In this paper, we studied the acceleration and security problems for distributed NMF. Firstly, we presented a novel distributed NMF algorithm DSANLS that can be used for scalable analytics of high dimensional matrix data. Our approach follows the general framework of ANLS, but utilizes matrix sketching to reduce the problem size of each NLS subproblem. We discussed and compared two different approaches for generating random matrices (i.e., Gaussian and subsampling random matrices). We presented two subproblem solvers for our general framework, and theoretically proved that our algorithm is convergent. We analyzed the per-iteration computational and communication cost of our approach and its convergence, showing its superiority compared to the state-of-the-art. Secondly, we designed four efficient distributed NMF methods in both synchronous and asynchronous settings with a security guarantee. They are the first distributed NMF methods over federated data, where data from all parties are utilized together in NMF for better performances and the data of each party remains confidential without leaking any individual information to other parties during the process. Finally, we conducted extensive experiments on several real datasets to show the superiority of our proposed methods. In the future, we plan to study the applications of DSANLS in dense or sparse tensors and consider more practical designs of asynchronous algorithm for secure distributed NMF.

7 Acknowledgment

This work was supported by the National Natural Science Foundation of China (no. 61972328).

Appendix A Proof of Lemma 1

Proof of Lemma 1.

Suppose (U,V)(U^{*},V^{*}) is the global optimal solution but fails to satisfy Eq. 22 in the paper. If there exist indices i,j,li,j,l such that Ui:lVj:l>2MFU^{*}_{i:l}\cdot V^{*}_{j:l}>2\|M\|_{F}, then

MUVF2\displaystyle\left\|M-U^{*}V^{*\top}\right\|_{F}^{2} (Ui:lVj:lMi:j)2>(2MFMF)2\displaystyle\geq\left(U^{*}_{i:l}\cdot V^{*}_{j:l}-M_{i:j}\right)^{2}>\left(2\|M\|_{F}-\|M\|_{F}\right)^{2} (23)
MF2.\displaystyle\geq\|M\|_{F}^{2}.

However, simply choosing U=0U=0 and V=0V=0 will yield a smaller error MF2\|M\|_{F}^{2}, which contradicts the fact that (U,V)(U^{*},V^{*}) is optimal. Therefore, if we define αl=maxiUi:l\alpha_{l}=\max_{i}U^{*}_{i:l} and βl=maxjVj:l\beta_{l}=\max_{j}V^{*}_{j:l}, we must have αlβl2MF\alpha_{l}\cdot\beta_{l}\leq 2\|M\|_{F} for each ll. Now we construct a new solution (U¯,V¯)(\overline{U},\overline{V}) by:

U¯i:l=Ui:lβl/αlandV¯j:l=Vj:lαl/βl.\overline{U}_{i:l}=U^{*}_{i:l}\cdot\sqrt{\beta_{l}/\alpha_{l}}\quad\text{and}\quad\overline{V}_{j:l}=V^{*}_{j:l}\cdot\sqrt{\alpha_{l}/\beta_{l}}. (24)

Note that

U¯i:lαlβl/αl=αlβl2MF,\displaystyle\overline{U}_{i:l}\leq\alpha_{l}\cdot\sqrt{\beta_{l}/\alpha_{l}}=\sqrt{\alpha_{l}\cdot\beta_{l}}\leq\sqrt{2\|M\|_{F}}, (25)
V¯j:lβlαl/βl=αlβl2MF,\displaystyle\overline{V}_{j:l}\leq\beta_{l}\cdot\sqrt{\alpha_{l}/\beta_{l}}=\sqrt{\alpha_{l}\cdot\beta_{l}}\leq\sqrt{2\|M\|_{F}},

so (U¯,V¯)(\overline{U},\overline{V}) satisfy Eq. 22 in the paper. Besides,

MU¯V¯F2=i,j(Mi:jlU¯i:lV¯j:l)2\displaystyle\left\|M-\overline{U}\,\overline{V}^{\top}\right\|_{F}^{2}=\sum_{i,j}\Big{(}M_{i:j}-\sum_{l}\overline{U}_{i:l}\overline{V}_{j:l}\Big{)}^{2} (26)
=\displaystyle= i,j(Mi:jlUi:lβl/αlVj:lαl/βl)2\displaystyle\sum_{i,j}\Big{(}M_{i:j}-\sum_{l}U^{*}_{i:l}\cdot\sqrt{\beta_{l}/\alpha_{l}}\cdot V^{*}_{j:l}\cdot\sqrt{\alpha_{l}/\beta_{l}}\Big{)}^{2}
=\displaystyle= i,j(Mi:jlUi:lVj:l)2=MUVF2,\displaystyle\sum_{i,j}\Big{(}M_{i:j}-\sum_{l}U^{*}_{i:l}\cdot V^{*}_{j:l}\Big{)}^{2}=\|M-U^{*}V^{*\top}\|_{F}^{2},

which means that (U¯,V¯)(\overline{U},\overline{V}) is also an optimal solution. In short, for any optimal solution of Eq. 1 outside the domain shown in Eq. 22, there exists a corresponding global optimal solution satisfying the domain shown in Eq. 22, which further means that there exists at least one optimal solution in the domain shown in Eq. 22. ∎

Appendix B Proof of Theorem 1

For simplicity, we denote f(U,V)=MUVF2f(U,V)=\|M-UV^{\top}\|_{F}^{2}, f~S=MSU(VS)F2\tilde{f}_{S}=\|MS-U(V^{\top}S)\|_{F}^{2}, and f~S=MSV(US)F2\tilde{f}^{\prime}_{S^{\prime}}=\|M^{\top}S^{\prime}-V(U^{\top}S^{\prime})\|_{F}^{2}. Let GtG^{t} and G~t\tilde{G}^{t} denote the gradients of the above quantities, i.e.,

GtUf(U,Vt)|U=Ut,G~tUf~St(U,Vt)|U=Ut,\displaystyle G^{t}\triangleq\nabla_{U}\left.f(U,V^{t})\right|_{U=U^{t}},\quad\tilde{G}^{t}\triangleq\nabla_{U}\left.\tilde{f}_{S^{t}}(U,V^{t})\right|_{U=U^{t}}, (27)
GtVf(Ut+1,V)|V=Vt,G~tVf~St(Ut+1,V)|V=Vt.\displaystyle G^{\prime t}\triangleq\nabla_{V}\left.f(U^{t+1},V)\right|_{V=V^{t}},\quad\tilde{G}^{\prime t}\triangleq\nabla_{V}\left.\tilde{f}^{\prime}_{S^{\prime t}}(U^{t+1},V)\right|_{V=V^{t}}.

Besides, let

Δt1ηt(UtUt+1)andΔt1ηt(VtVt+1).\Delta^{t}\triangleq\frac{1}{\eta_{t}}\left(U^{t}-U^{t+1}\right)\quad\text{and}\quad\Delta^{\prime t}\triangleq\frac{1}{\eta_{t}}\left(V^{t}-V^{t+1}\right). (28)

B.1 Preliminary Lemmas

To prove Theorem 1, we need following lemmas (which are proved in Sec. B.3):

Lemma 2.

Under Assumptions 1 and 2, conditioned on UtU^{t} and VtV^{t}, G~t\tilde{G}^{t} and G~t\tilde{G}^{\prime t} are unbiased estimators of GtG^{t} and GtG^{\prime t} respectively with uniformly bounded variance.

Lemma 3.

Assume XX is a nonnegative random variable with mean μ\mu and variance σ2\sigma^{2}, and c0c\geq 0 is a constant. Then we have

𝔼[min{X,c}]min{c,μ2}(14σ24σ2+μ2).\mathbb{E}\left[\min\{X,c\}\right]\geq\min\left\{c,\frac{\mu}{2}\right\}\cdot\left(1-\frac{4\sigma^{2}}{4\sigma^{2}+\mu^{2}}\right). (29)
Lemma 4.

Define the function

ϕ(x,y,z)=min{|xy|,y2/2}(14z24z2+y2)0.\phi(x,y,z)=\min\left\{|xy|,y^{2}/2\right\}\cdot\left(1-\frac{4z^{2}}{4z^{2}+y^{2}}\right)\geq 0. (30)

Conditioned on UtU^{t} and VtV^{t}, there exists an uniform constant σ2>0\sigma^{\prime 2}>0 such that

𝔼[Gi:ltΔi:lt]ϕ(Ui:lt/ηt,Gi:lt,σ2)\mathbb{E}[G^{t}_{i:l}\cdot\Delta^{t}_{i:l}]\geq\phi\left(U^{t}_{i:l}/\eta_{t},G^{t}_{i:l},\sigma^{\prime 2}\right) (31)

and

𝔼[Gj:ltΔj:lt]ϕ(Vj:lt/ηt,Gj:lt,σ2)\mathbb{E}[G^{\prime t}_{j:l}\cdot\Delta^{\prime t}_{j:l}]\geq\phi\left(V^{t}_{j:l}/\eta_{t},G^{\prime t}_{j:l},\sigma^{\prime 2}\right) (32)

for any i,j,li,j,l.

Lemma 5 (Supermartingale Convergence Theorem [52]).

Let YtY_{t}, ZtZ_{t} and WtW_{t}, t=0,1,t=0,1,\dots, be three sequences of random variables and let t\mathcal{F}_{t}, t=0,1,t=0,1,\dots, be sets of random variables such that tt+1\mathcal{F}_{t}\subset\mathcal{F}_{t+1}. Suppose that

  1. 1.

    The random variables YtY_{t}, ZtZ_{t} and WtW_{t} are nonnegative, and are functions of the random variables in t\mathcal{F}_{t}.

  2. 2.

    For each tt, we have

    𝔼[Yt+1|t]YtZt+Wt.\mathbb{E}[Y_{t+1}|\mathcal{F}_{t}]\leq Y_{t}-Z_{t}+W_{t}. (33)
  3. 3.

    There holds, with probability 1, t=0Wt<\sum_{t=0}^{\infty}W_{t}<\infty.

Then we have t=0Zt<\sum_{t=0}^{\infty}Z_{t}<\infty, and the sequence YtY_{t} converges to a nonnegative random variable YY, with probability 1.

Lemma 6 ([53]).

For two nonnegative scalar sequences {at}\{a_{t}\} and {bt}\{b_{t}\}, if t=0ak=\sum_{t=0}^{\infty}a_{k}=\infty and t=0atbt<\sum_{t=0}^{\infty}a_{t}b_{t}<\infty, then

lim inftbt=0.\liminf_{t\rightarrow\infty}b_{t}=0. (34)

Furthermore, if |bt+1bt|Bat|b_{t+1}-b_{t}|\leq B\cdot a_{t} for some constant B>0B>0, then

limtbt=0.\lim_{t\rightarrow\infty}b_{t}=0. (35)

B.2 Proof of Theorem 1

Proof of Theorem 1.

Let us first focus on projected gradient descent. By conditioning on UtU^{t} and VtV^{t}, we have

f(Ut+1,Vt)=\displaystyle f(U^{t+1},V^{t})= MUt+1VtF2=M(UtηtΔt)VtF2\displaystyle\left\|M-U^{t+1}V^{t\top}\right\|_{F}^{2}=\left\|M-\left(U^{t}-\eta_{t}\Delta^{t}\right)V^{t\top}\right\|_{F}^{2}
=\displaystyle= (MUtVt)ηtΔtVtF2\displaystyle\left\|\left(M-U^{t}V^{t\top}\right)-\eta_{t}\Delta^{t}V^{t\top}\right\|_{F}^{2}
=\displaystyle= MUtVtF22ηt(MUtVt)(ΔtVt)\displaystyle\left\|M-U^{t}V^{t\top}\right\|_{F}^{2}-2\eta_{t}\left(M-U^{t}V^{t\top}\right)\cdot\left(\Delta^{t}V^{t\top}\right)
+ηt2ΔtVtF2\displaystyle+\eta_{t}^{2}\|\Delta^{t}V^{t\top}\|_{F}^{2}
=\displaystyle= f(Ut,Vt)2ηt(MUtVt)(ΔtVt)\displaystyle f(U^{t},V^{t})-2\eta_{t}\left(M-U^{t}V^{t\top}\right)\cdot\left(\Delta^{t}V^{t\top}\right)
+ηt2ΔtVtF2.\displaystyle+\eta_{t}^{2}\|\Delta^{t}V^{t\top}\|_{F}^{2}. (36)

For the second term of Eq. 36, note that

2(MUtVt)(ΔtVt)\displaystyle 2\left(M-U^{t}V^{t\top}\right)\cdot\left(\Delta^{t}V^{t\top}\right) =2tr[(MUtVt)VtΔt]\displaystyle=2\mathrm{tr}\left[\left(M-U^{t}V^{t\top}\right)V^{t}\Delta^{t\top}\right] (37)
=tr[GtΔt]=i,lGi:ltΔi:lt.\displaystyle=\mathrm{tr}\left[G^{t}\Delta^{t\top}\right]=\sum_{i,l}G^{t}_{i:l}\cdot\Delta^{t}_{i:l}.

By taking expectation and using Lemma 4, we obtain:

𝔼[2(MUtVt)(ΔtVt)]\displaystyle\mathbb{E}\left[2\left(M-U^{t}V^{t\top}\right)\cdot\left(\Delta^{t}V^{t\top}\right)\right] =i,l𝔼[Gi:ltΔi:lt]\displaystyle=\sum_{i,l}\mathbb{E}\left[G^{t}_{i:l}\cdot\Delta^{t}_{i:l}\right] (38)
i,lϕ(Ui:lt/ηt,Gi:lt,σ2).\displaystyle\geq\sum_{i,l}\phi\left(U^{t}_{i:l}/\eta_{t},G^{t}_{i:l},\sigma^{\prime 2}\right).

For simplicity, we will use the notation

Φ(Ut/ηt,Gt)i,lϕ(Ui:lt/ηt,Gi:lt,σ2).\Phi(U^{t}/\eta_{t},G^{t})\triangleq\sum_{i,l}\phi\left(U^{t}_{i:l}/\eta_{t},G^{t}_{i:l},\sigma^{\prime 2}\right). (39)

For the third term of Eq. 36, we can bound it in the following way:

ΔtVtF2\displaystyle\|\Delta^{t}V^{t\top}\|_{F}^{2}\leq ΔtF2VtF2G~tF2VtF2\displaystyle\|\Delta^{t}\|_{F}^{2}\cdot\|V^{t}\|_{F}^{2}\leq\|\tilde{G}^{t}\|_{F}^{2}\cdot\|V^{t}\|_{F}^{2} (40)
=\displaystyle= 2(UtVtM)(StSt)VtF2VtF2\displaystyle\left\|2(U^{t}V^{t\top}-M)(S^{t}S^{t\top})V^{t}\right\|_{F}^{2}\cdot\|V^{t}\|_{F}^{2}
\displaystyle\leq 4MUtVtF2StStF2VtF4\displaystyle 4\|M-U^{t}V^{t\top}\|_{F}^{2}\cdot\|S^{t}S^{t\top}\|^{2}_{F}\cdot\|V^{t}\|_{F}^{4}
\displaystyle\leq 8(MF2+UtF2VtF2)StStF2VtF4\displaystyle 8\left(\|M\|_{F}^{2}+\|U^{t}\|_{F}^{2}\cdot\|V^{t}\|_{F}^{2}\right)\cdot\|S^{t}S^{t\top}\|^{2}_{F}\cdot\|V^{t}\|_{F}^{4}
\displaystyle\leq 8(MF2+R4)R4StStF2,\displaystyle 8\left(\|M\|_{F}^{2}+R^{4}\right)R^{4}\cdot\|S^{t}S^{t\top}\|^{2}_{F},

where in the last inequality we have applied Assumption 2. If we take expectation, we have

𝔼ΔtVtF2\displaystyle\mathbb{E}\|\Delta^{t}V^{t\top}\|_{F}^{2}\leq 8(MF2+R4)R4𝔼StStF2\displaystyle 8\left(\|M\|_{F}^{2}+R^{4}\right)R^{4}\cdot\mathbb{E}\|S^{t}S^{t\top}\|^{2}_{F} (41)
\displaystyle\leq 8(MF2+R4)R4(𝔼[StSt]2+𝕍[StSt])\displaystyle 8\left(\|M\|_{F}^{2}+R^{4}\right)R^{4}\cdot\left(\left\|\mathbb{E}[S^{t}S^{t\top}]\right\|^{2}+\mathbb{V}[S^{t}S^{t\top}]\right)
\displaystyle\leq 8(MF2+R4)R4(n+σ2),\displaystyle 8\left(\|M\|_{F}^{2}+R^{4}\right)R^{4}\cdot\left(n+\sigma^{2}\right),

where mean-variance decomposition have been applied in the second inequality, and Assumption 1 was used in the last line. For convenience, we will use

Γ8(MF2+R4)R4(n+σ2)0\Gamma\triangleq 8\left(\|M\|_{F}^{2}+R^{4}\right)R^{4}\cdot\left(n+\sigma^{2}\right)\geq 0 (42)

to denote this constant later on.

By combining all results, we can rewrite Eq. 36 as

𝔼[f(Ut+1,Vt)]f(Ut,Vt)ηtΦ(Ut/ηt,Gt)+ηt2Γ.\mathbb{E}\left[f(U^{t+1},V^{t})\right]\leq f(U^{t},V^{t})-\eta_{t}\Phi\left(U^{t}/\eta_{t},G^{t}\right)+\eta_{t}^{2}\Gamma. (43)

Likewise, conditioned on Ut+1U^{t+1} and VtV^{t}, we can prove a similar inequality for VV:

𝔼[f(Ut+1,Vt+1)]f(Ut+1,Vt)ηtΦ(Vt/ηt,Gt)+ηt2Γ,\mathbb{E}\left[f(U^{t+1},V^{t+1})\right]\leq f(U^{t+1},V^{t})-\eta_{t}\Phi\left(V^{t}/\eta_{t},G^{\prime t}\right)+\eta_{t}^{2}\Gamma^{\prime}, (44)

where Γ0\Gamma^{\prime}\geq 0 is also some uniform constant. From definition, it is easy to see both Φ(Ut/ηt,Gt)\Phi\left(U^{t}/\eta_{t},G^{t}\right) and Φ(Vt/ηt,Gt)\Phi\left(V^{t}/\eta_{t},G^{\prime t}\right) are nonnegative. Along with condition the condition t=0ηt2<\sum_{t=0}^{\infty}\eta_{t}^{2}<\infty, we can apply the Supermartingale Convergence Theorem (Lemma 5) with

Y2t=f(Ut,Vt),Y2t+1=f(Ut+1,Vt),\displaystyle Y_{2t}=f(U^{t},V^{t}),\quad Y_{2t+1}=f(U^{t+1},V^{t}), (45)
Z2t=Φ(Ut/ηt,Gt),Z2t+1=Φ(Vt/ηt,Gt),\displaystyle Z_{2t}=\Phi\left(U^{t}/\eta_{t},G^{t}\right),\quad Z_{2t+1}=\Phi\left(V^{t}/\eta_{t},G^{\prime t}\right),
W2t=Γηt2,W2t+1=Γηt2,\displaystyle W_{2t}=\Gamma\eta_{t}^{2},\quad W_{2t+1}=\Gamma^{\prime}\eta_{t}^{2},

and then conclude that both {f(Ut+1,Vt)}\{f(U^{t+1},V^{t})\} and {f(Ut,Vt)}\{f(U^{t},V^{t})\} will converge to a same value, and besides:

t=0ηt[Φ(Ut/ηt,Gt)+Φ(Vt/ηt,Gt)]<,\sum_{t=0}^{\infty}\eta_{t}\left[\Phi\left(U^{t}/\eta_{t},G^{t}\right)+\Phi\left(V^{t}/\eta_{t},G^{\prime t}\right)\right]<\infty, (46)

with probability 1. In addition, it is not hard to verify that |Φ(Ut+1/ηt+1,Gt+1)Φ(Ut/ηt,Gt)|Cηt\left|\Phi\left(U^{t+1}/\eta_{t+1},G^{t+1}\right)-\Phi\left(U^{t}/\eta_{t},G^{t}\right)\right|\leq C\cdot\eta_{t} for some constant CC because of the boundness of the gradients. Then, by Lemma 6, we obtain that

limtΦ(Ut/ηt,Gt)=limti:lϕ(Ui:lt/ηt,Gi:lt,σ2)0.\lim_{t\rightarrow\infty}\Phi\left(U^{t}/\eta_{t},G^{t}\right)=\lim_{t\rightarrow\infty}\sum_{i:l}\phi\left(U^{t}_{i:l}/\eta_{t},G^{t}_{i:l},\sigma^{\prime 2}\right)\rightarrow 0. (47)

Since each summand in the above is nonnegative, this equation further implies

limtϕ(Ui:lt/ηt,Gi:lt,σ2)0\lim_{t\rightarrow\infty}\phi\left(U^{t}_{i:l}/\eta_{t},G^{t}_{i:l},\sigma^{\prime 2}\right)\rightarrow 0 (48)

for all ii and ll. By looking into the definition of ϕ\phi in Eq. 30, it is not hard to see that ϕ(Ui:lt/ηt,Gi:lt,σ2)0\phi\left(U^{t}_{i:l}/\eta_{t},G^{t}_{i:l},\sigma^{\prime 2}\right)\rightarrow 0 if and only if min{Ui:lt/ηt,|Gi:lt|}0\min\left\{U^{t}_{i:l}/\eta_{t},\,\big{|}G^{t}_{i:l}\big{|}\right\}\rightarrow 0. Considering ηt>0\eta_{t}>0 and ηt0\eta_{t}\rightarrow 0, we can conclude that

limtmin{Ui:lt,|Gi:lt|}0\lim_{t\rightarrow\infty}\min\left\{U^{t}_{i:l},\,\big{|}G^{t}_{i:l}\big{|}\right\}\rightarrow 0 (49)

for all i,li,l, which means either the gradient Gi:ltG^{t}_{i:l} converges to 0, or Ui:ltU^{t}_{i:l} converges to the boundary 0. In other words, the projected gradient at (Ut,Vt)(U^{t},V^{t}) w.r.t UU converges to 0 as tt\rightarrow\infty. Likewise, we can prove

limtmin{Vj:lt,|Gj:lt|}0,\lim_{t\rightarrow\infty}\min\left\{V^{t}_{j:l},\,\big{|}G^{\prime t}_{j:l}\big{|}\right\}\rightarrow 0, (50)

in a similar way, which completes the proof of projected gradient descent.

The proof of regularized coordinate descent is similar to that of projected gradient descent, and hence we only include a sketch proof here. The key here is to establish an inequality similar to Eq. 36, but with the difference that just one column rather than whole UU or VV is changed every time. Take U:1U_{:1} as an example. An important observation is that when projection does not happen, we can rewrite (19) in the paper as U:1t+1=U:1tG~:1/(τt+Bj:tBj:t)U^{t+1}_{:1}=U^{t}_{:1}-\tilde{G}_{:1}/(\tau_{t}+B^{t}_{j:}B^{t\top}_{j:}), which means that the moving direction of regularized coordinate descent is the same as that of projected gradient descent, but with step size being 1/(τt+Bj:tBj:t)1/(\tau_{t}+B^{t}_{j:}B^{t\top}_{j:}). Since both the expectation and variance of Bj:tBj:tB^{t}_{j:}B^{t\top}_{j:} are bounded, we will have 1/(τt+Bj:tBj:t)1/τt1/(\tau_{t}+B^{t}_{j:}B^{t\top}_{j:})\approx 1/\tau_{t} when τt\tau_{t} is large. Given these two reasons, we can out down a similar inequality as Eq. 36. The remaining proof just follows the one for projected gradient descent. ∎

B.3 Proof of Preliminary Lemmas

Proof of Lemma 2.

Since the proof related to G~t\tilde{G}^{\prime t} is similar to G~t\tilde{G}^{t}, here we only focus on the latter one.

First, let us write down the definition of GtG^{t} and G~t\tilde{G}^{t}:

Gt=2(UtVtM)Vt\displaystyle G^{t}=2(U^{t}V^{t\top}-M)V^{t} (51)
G~t=2(UtVtM)(StSt)Vt.\displaystyle\tilde{G}^{t}=2(U^{t}V^{t\top}-M)(S^{t}S^{t\top})V^{t}.

Therefore,

𝔼[G~t]=\displaystyle\mathbb{E}[\tilde{G}^{t}]= 𝔼[2(UtVtM)(StSt)Vt]\displaystyle\mathbb{E}\left[2(U^{t}V^{t\top}-M)(S^{t}S^{t\top})V^{t}\right] (52)
=\displaystyle= 2(UtVtM)𝔼[StSt]Vt\displaystyle 2(U^{t}V^{t\top}-M)\,\mathbb{E}[S^{t}S^{t\top}]\,V^{t}
=\displaystyle= 2(UtVtM)IVt=2(UtVtM)Vt=Gt,\displaystyle 2(U^{t}V^{t\top}-M)\,I\,V^{t}=2(U^{t}V^{t\top}-M)V^{t}=G^{t},

which means G~t\tilde{G}^{t} is an unbiased estimator of GtG^{t}. Besides, its variance is uniformly bounded because

𝕍[G~t]\displaystyle\mathbb{V}[\tilde{G}^{t}]\leq 𝕍[2(UtVtM)i:(StSt)V:lt]\displaystyle\mathbb{V}\left[2(U^{t}V^{t\top}-M)_{i:}(S^{t}S^{t\top})V^{t}_{:l}\right] (53)
\displaystyle\leq 4MUtVtF2𝕍[StSt]VtF2\displaystyle 4\|M-U^{t}V^{t\top}\|_{F}^{2}\cdot\mathbb{V}[S^{t}S^{t\top}]\cdot\|V^{t}\|_{F}^{2}
\displaystyle\leq 8(MF2+UtF2VtF2)VtF2𝕍[StSt]\displaystyle 8\left(\|M\|_{F}^{2}+\|U^{t}\|_{F}^{2}\|V^{t}\|_{F}^{2}\right)\cdot\|V^{t}\|_{F}^{2}\cdot\mathbb{V}[S^{t}S^{t\top}]
\displaystyle\leq 8(MF2+R4)R2σ2,\displaystyle 8\left(\|M\|_{F}^{2}+R^{4}\right)R^{2}\cdot\sigma^{2},

where both Assumptions 1 and 2 are applied in the last line. ∎

Proof of Lemma 3.

In this proof, we will use Cantelli’s inequality:

Pr(Xμ+λ)1σ2σ2+λ2λ<0.\mathrm{Pr}(X\geq\mu+\lambda)\geq 1-\frac{\sigma^{2}}{\sigma^{2}+\lambda^{2}}\quad\forall\lambda<0. (54)

When μ=0\mu=0, it is easy to see that the right-hand-side of Eq. 29 is 0. Considering that the left-hand-side is the expectation of a nonnegative random variable, Eq. 29 obviously holds in this case.

When μ>0\mu>0 and μ2c\mu\geq 2c, by using the fact that XX is nonnegative, we have

𝔼[min{X,c}]cPr(Xc).\mathbb{E}\left[\min\{X,c\}\right]\geq c\cdot\mathrm{Pr}(X\geq c). (55)

Now we can apply Cantelli’s inequality to bound Pr(Xc)\mathrm{Pr}(X\geq c) with λ=cμ<cμ/20\lambda=c-\mu<c-\mu/2\leq 0, and obtain:

𝔼[min{X,c}]\displaystyle\mathbb{E}\left[\min\{X,c\}\right] c(1σ2σ2+(μc)2)\displaystyle\geq c\cdot\left(1-\frac{\sigma^{2}}{\sigma^{2}+(\mu-c)^{2}}\right) (56)
c(1σ2σ2+(μμ/2)2)\displaystyle\geq c\cdot\left(1-\frac{\sigma^{2}}{\sigma^{2}+(\mu-\mu/2)^{2}}\right)
=c(14σ24σ2+μ2),\displaystyle=c\cdot\left(1-\frac{4\sigma^{2}}{4\sigma^{2}+\mu^{2}}\right),

where in the second inequality we used the fact cμ/2c\leq\mu/2 again.

When μ>0\mu>0 but μ<2c\mu<2c, we have:

𝔼[min{X,c}]𝔼[min{X,μ/2}].\mathbb{E}\left[\min\{X,c\}\right]\geq\mathbb{E}\left[\min\{X,\mu/2\}\right]. (57)

Now we can apply inequality in Eq. 56 from the previous part with c=μ/2c=\mu/2, and thus

𝔼[min{X,c}]𝔼[min{X,μ/2}]μ2(14σ24σ2+μ2),\mathbb{E}\left[\min\{X,c\}\right]\geq\mathbb{E}\left[\min\{X,\mu/2\}\right]\geq\frac{\mu}{2}\cdot\left(1-\frac{4\sigma^{2}}{4\sigma^{2}+\mu^{2}}\right), (58)

which completes the proof. ∎

Proof of Lemma 4.

We only focus on GtG^{t} and Δt\Delta^{t}. We first show that

Gi:ltG~i:lt0G^{t}_{i:l}\cdot\tilde{G}^{t}_{i:l}\geq 0 (59)

for any random matrix StS^{t}. Note that

Gi:lt=2(UtVtM)i:V:lt\displaystyle G^{t}_{i:l}=2(U^{t}V^{t\top}-M)_{i:}V^{t}_{:l} (60)
G~i:lt=2(UtVtM)i:(StSt)V:lt.\displaystyle\tilde{G}^{t}_{i:l}=2(U^{t}V^{t\top}-M)_{i:}(S^{t}S^{t\top})V^{t}_{:l}.

Hence it would be sufficient if we can show that there holds a(StSt)bab0a^{\top}(S^{t}S^{t\top})b\cdot a^{\top}b\geq 0 for any vectors aa and bb:

a(StSt)bab\displaystyle a^{\top}(S^{t}S^{t\top})b\cdot a^{\top}b =tr(a(StSt)bba)\displaystyle=\mathrm{tr}\left(a^{\top}(S^{t}S^{t\top})bb^{\top}a\right) (61)
=tr(aa(StSt)bb)0,\displaystyle=\mathrm{tr}\left(aa^{\top}(S^{t}S^{t\top})bb^{\top}\right)\geq 0,

where the first equality is because AB=tr(AB)A\cdot B=\mathrm{tr}(AB^{\top}), the second equality is due to cyclic permutation invariant property of trace, and the last inequality is because all of aaaa^{\top}, bbbb^{\top} and StStS^{t}S^{t\top} are positive semi-definite matrices.

Now, let us consider the relationship between Δt\Delta^{t} and G~t\tilde{G}^{t}:

Δt=1ηt(UtUt+1)=1ηt(Utmax{UtηtG~t, 0}),\Delta^{t}=\frac{1}{\eta_{t}}\left(U^{t}-U^{t+1}\right)=\frac{1}{\eta_{t}}\left(U^{t}-\max\left\{U^{t}-\eta_{t}\tilde{G}^{t},\,0\right\}\right), (62)

from which it can be shown that

Δi:lt=min{Ui:lt/ηt,G~i:lt}.\Delta^{t}_{i:l}=\min\left\{U^{t}_{i:l}/\eta_{t},\,\tilde{G}^{t}_{i:l}\right\}. (63)

When Gi:lt=0G^{t}_{i:l}=0, it is easy to see that both sides of Eq.31 become 0, and hence Eq.31 holds.

When Gi:lt>0G^{t}_{i:l}>0, from Eq.59 we know that G~i:lt0\tilde{G}^{t}_{i:l}\geq 0 regardless of the choice of StS^{t}. From Lemma 2 we know that

𝔼[G~i:lt]=Gi:lt\mathbb{E}[\tilde{G}^{t}_{i:l}]=G^{t}_{i:l} (64)

and there exists a constant σ20\sigma^{\prime 2}\geq 0 such that

𝕍[G~i:lt]σ2.\mathbb{V}[\tilde{G}^{t}_{i:l}]\leq\sigma^{\prime 2}. (65)

Since Ui:ltU^{t}_{i:l} is a nonnegative constant here, we can apply Lemma 3 to Eq.63 and conclude

𝔼[Δi:lt]\displaystyle\mathbb{E}[\Delta^{t}_{i:l}]\geq min{Ui:lt/ηt,Gi:lt/2}(14𝕍[G~i:lt]4𝕍[G~i:lt]+(Gi:lt)2)\displaystyle\min\left\{U^{t}_{i:l}/\eta_{t},G^{t}_{i:l}/2\right\}\cdot\left(1-\frac{4\mathbb{V}[\tilde{G}^{t}_{i:l}]}{4\mathbb{V}[\tilde{G}^{t}_{i:l}]+\big{(}G^{t}_{i:l}\big{)}^{2}}\right) (66)
\displaystyle\geq min{Ui:lt/ηt,Gi:lt/2}(14σ24σ2+(Gi:lt)2),\displaystyle\min\left\{U^{t}_{i:l}/\eta_{t},G^{t}_{i:l}/2\right\}\cdot\left(1-\frac{4\sigma^{\prime 2}}{4\sigma^{\prime 2}+\big{(}G^{t}_{i:l}\big{)}^{2}}\right),

from which Eq.31 is obvious.

When Gi:lt<0G^{t}_{i:l}<0, also from Eq. 59 we know that G~i:lt0\tilde{G}^{t}_{i:l}\leq 0. Since Ui:ltU^{t}_{i:l} is a nonnegative constant here, we always have

Δi:lt=min{Ui:lt/ηt,G~i:lt}=G~i:lt.\Delta^{t}_{i:l}=\min\left\{U^{t}_{i:l}/\eta_{t},\,\tilde{G}^{t}_{i:l}\right\}=\tilde{G}^{t}_{i:l}. (67)

Therefore, by taking expectation and using Lemma 2, we obtain

𝔼[Δi:lt]=𝔼[G~i:lt]=Gi:lt,\mathbb{E}[\Delta^{t}_{i:l}]=\mathbb{E}[\tilde{G}^{t}_{i:l}]=G^{t}_{i:l}, (68)

and thus

𝔼[Gi:ltΔi:lt]=(Gi:lt)2>(Gi:lt)22(14σ24σ2+(Gi:lt)2)\mathbb{E}\left[G^{t}_{i:l}\cdot\Delta^{t}_{i:l}\right]=\left(G^{t}_{i:l}\right)^{2}>\frac{\big{(}G^{t}_{i:l}\big{)}^{2}}{2}\cdot\left(1-\frac{4\sigma^{\prime 2}}{4\sigma^{\prime 2}+\big{(}G^{t}_{i:l}\big{)}^{2}}\right) (69)

for any constant σ\sigma^{\prime}, which means that Eq. 31 holds. ∎

References

  • Pauca et al. [2004] V. P. Pauca, F. Shahnaz, M. W. Berry, and R. J. Plemmons, “Text mining using non-negative matrix factorizations,” in SDM, 2004, pp. 452–456.
  • Kotsia et al. [2007] I. Kotsia, S. Zafeiriou, and I. Pitas, “A novel discriminant non-negative matrix factorization algorithm with applications to facial image characterization problems,” IEEE Trans. Information Forensics and Security, vol. 2, no. 3-2, pp. 588–595, 2007.
  • Gu et al. [2010] Q. Gu, J. Zhou, and C. H. Q. Ding, “Collaborative filtering: Weighted nonnegative matrix factorization incorporating user and item graphs,” in SDM, 2010, pp. 199–210.
  • Zhang and Yeung [2012] Y. Zhang and D. Yeung, “Overlapping community detection via bounded nonnegative matrix tri-factorization,” in KDD, 2012, pp. 606–614.
  • Kannan et al. [2016] R. Kannan, G. Ballard, and H. Park, “A high-performance parallel algorithm for nonnegative matrix factorization,” in PPOPP, 2016, pp. 9:1–9:11.
  • Kim et al. [2017] Y. Kim, J. Sun, H. Yu, and X. Jiang, “Federated tensor factorization for computational phenotyping,” in KDD, 2017, pp. 887–895.
  • Feng et al. [2018] J. Feng, L. T. Yang, Q. Zhu, and K.-K. R. Choo, “Privacy-preserving tensor decomposition over encrypted data in a federated cloud environment,” IEEE Trans. Dependable Sec. Comput., 2018.
  • Kannan et al. [2018] R. Kannan, G. Ballard, and H. Park, “MPI-FAUN: an mpi-based framework for alternating-updating nonnegative matrix factorization,” IEEE Trans. Knowl. Data Eng., vol. 30, no. 3, pp. 544–558, 2018.
  • Lee and Seung [2000] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization,” in NIPS, 2000, pp. 556–562.
  • Gillis [2014] N. Gillis, “The why and how of nonnegative matrix factorization,” arXiv Preprint, 2014. [Online]. Available: https://arxiv.org/abs/1401.5226
  • Daube-Witherspoon and Muehllehner [1986] M. E. Daube-Witherspoon and G. Muehllehner, “An iterative image space reconstruction algorthm suitable for volume ect,” IEEE Trans. Med. Imaging, vol. 5, no. 2, pp. 61–66, 1986.
  • Grippo and Sciandrone [2000] L. Grippo and M. Sciandrone, “On the convergence of the block nonlinear gauss-seidel method under convex constraints,” Oper. Res. Lett., vol. 26, no. 3, pp. 127–136, 2000.
  • Kim and Park [2008] H. Kim and H. Park, “Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method,” SIAM J. Matrix Analysis Applications, vol. 30, no. 2, pp. 713–730, 2008.
  • Kim and Park [2011] J. Kim and H. Park, “Fast nonnegative matrix factorization: An active-set-like method and comparisons,” SIAM J. Scientific Computing, vol. 33, no. 6, pp. 3261–3281, 2011.
  • Lin [2007] C. Lin, “Projected gradient methods for nonnegative matrix factorization,” Neural Computation, vol. 19, no. 10, pp. 2756–2779, 2007.
  • Zdunek and Cichocki [2006] R. Zdunek and A. Cichocki, “Non-negative matrix factorization with quasi-newton optimization,” in ICAISC, vol. 4029, 2006, pp. 870–879.
  • Guan et al. [2012] N. Guan, D. Tao, Z. Luo, and B. Yuan, “Nenmf: An optimal gradient method for nonnegative matrix factorization,” IEEE Trans. Signal Processing, vol. 60, no. 6, pp. 2882–2898, 2012.
  • Naor and Nissim [2001] M. Naor and K. Nissim, “Communication preserving protocols for secure function evaluation,” in STOC, 2001, pp. 590–599.
  • Kanjani [2007] K. Kanjani, “Parallel non negative matrix factorization for document clustering,” CPSC-659 (Parallel and Distributed Numerical Algorithms) course. Texas A&M University, Tech. Rep, 2007.
  • Robila and Maciak [2006] S. A. Robila and L. G. Maciak, “A parallel unmixing algorithm for hyperspectral images,” in Intelligent Robots and Computer Vision XXIV: Algorithms, Techniques, and Active Vision, vol. 6384, 2006, p. 63840F.
  • Liu et al. [2010] C. Liu, H. Yang, J. Fan, L. He, and Y. Wang, “Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce,” in WWW, 2010, pp. 681–690.
  • Liao et al. [2014] R. Liao, Y. Zhang, J. Guan, and S. Zhou, “Cloudnmf: A mapreduce implementation of nonnegative matrix factorization for large-scale biological datasets,” Genomics, Proteomics & Bioinformatics, vol. 12, no. 1, pp. 48–51, 2014.
  • Yin et al. [2014] J. Yin, L. Gao, and Z. M. Zhang, “Scalable nonnegative matrix factorization with block-wise updates,” in ECML/PKDD (3), ser. Lecture Notes in Computer Science, vol. 8726, 2014, pp. 337–352.
  • Meng et al. [2016] X. Meng, J. K. Bradley, B. Yavuz, E. R. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. B. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, M. J. Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar, “Mllib: Machine learning in apache spark,” J. Mach. Learn. Res., vol. 17, pp. 34:1–34:7, 2016.
  • Grove et al. [2014] D. Grove, J. Milthorpe, and O. Tardieu, “Supporting array programming in X10,” in ARRAY@PLDI, 2014, pp. 38–43.
  • Mejía-Roa et al. [2015] E. Mejía-Roa, D. Tabas-Madrid, J. Setoain, C. García, F. Tirado, and A. D. Pascual-Montano, “Nmf-mgpu: non-negative matrix factorization on multi-gpu systems,” BMC Bioinformatics, vol. 16, pp. 43:1–43:12, 2015.
  • Gower and Richtárik [2015] R. M. Gower and P. Richtárik, “Randomized iterative methods for linear systems,” SIAM J. Matrix Analysis Applications, vol. 36, no. 4, pp. 1660–1690, 2015.
  • Pilanci and Wainwright [2016] M. Pilanci and M. J. Wainwright, “Iterative hessian sketch: Fast and accurate solution approximation for constrained least-squares,” J. Mach. Learn. Res., vol. 17, pp. 53:1–53:38, 2016.
  • Pilanci and Wainwright [2017] M. Pilanci and M. J. Wainwright, “Newton sketch: A near linear-time optimization algorithm with linear-quadratic convergence,” SIAM Journal on Optimization, vol. 27, no. 1, pp. 205–245, 2017.
  • Wang and Li [2010] F. Wang and P. Li, “Efficient nonnegative matrix factorization with random projections,” in SDM, 2010, pp. 281–292.
  • Lindell and Pinkas [2000] Y. Lindell and B. Pinkas, “Privacy preserving data mining,” in CRYPTO, vol. 1880, 2000, pp. 36–54.
  • Wan et al. [2007] L. Wan, W. K. Ng, S. Han, and V. C. S. Lee, “Privacy-preservation for gradient descent methods,” in KDD, 2007, pp. 775–783.
  • Han et al. [2010] S. Han, W. K. Ng, L. Wan, and V. C. S. Lee, “Privacy-preserving gradient-descent methods,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 6, pp. 884–899, 2010.
  • Pathak and Raj [2010] M. A. Pathak and B. Raj, “Privacy preserving protocols for eigenvector computation,” in PSDML, vol. 6549, 2010, pp. 113–126.
  • Han et al. [2009] S. Han, W. K. Ng, and P. S. Yu, “Privacy-preserving singular value decomposition,” in ICDE, 2009, pp. 1267–1270.
  • Chen et al. [2017] S. Chen, R. Lu, and J. Zhang, “A flexible privacy-preserving framework for singular value decomposition under internet of things environment,” in IFIPTM, vol. 505, 2017, pp. 21–37.
  • Sakuma and Kobayashi [2010] J. Sakuma and S. Kobayashi, “Large-scale k-means clustering with user-centric privacy-preservation,” Knowl. Inf. Syst., vol. 25, no. 2, pp. 253–279, 2010.
  • Lin and Jaromczyk [2011] Z. Lin and J. W. Jaromczyk, “Privacy preserving spectral clustering over vertically partitioned data sets,” in FSKD, 2011, pp. 1206–1211.
  • Duan and Canny [2008] Y. Duan and J. F. Canny, “Practical private computation and zero-knowledge tools for privacy-preserving distributed data mining,” in SDM.   SIAM, 2008, pp. 265–276.
  • Beerliová-Trubíniová and Hirt [2008] Z. Beerliová-Trubíniová and M. Hirt, “Perfectly-secure MPC with linear communication complexity,” in TCC, vol. 4948, 2008, pp. 213–230.
  • Damgård and Nielsen [2007] I. Damgård and J. B. Nielsen, “Scalable and unconditionally secure multiparty computation,” in CRYPTO, vol. 4622, 2007, pp. 572–590.
  • Ailon and Chazelle [2006] N. Ailon and B. Chazelle, “Approximate nearest neighbors and the fast johnson-lindenstrauss transform,” in STOC, 2006, pp. 557–563.
  • Lu et al. [2013] Y. Lu, P. S. Dhillon, D. P. Foster, and L. H. Ungar, “Faster ridge regression via the subsampled randomized hadamard transform,” in NIPS, 2013, pp. 369–377.
  • Clarkson and Woodruff [2013] K. L. Clarkson and D. P. Woodruff, “Low rank approximation and regression in input sparsity time,” in STOC, 2013, pp. 81–90.
  • Pham and Pagh [2013] N. Pham and R. Pagh, “Fast and scalable polynomial kernels via explicit feature maps,” in KDD, 2013, pp. 239–247.
  • Nemirovski et al. [2009] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic approximation approach to stochastic programming,” SIAM J. on Optim., vol. 19, no. 4, pp. 1574–1609, 2009.
  • Rockafellar [1976] R. T. Rockafellar, “Monotone operators and the proximal point algorithm,” SIAM J. Control Optim., vol. 14, no. 5, pp. 877–898, 1976.
  • Fairbanks et al. [2015] J. P. Fairbanks, R. Kannan, H. Park, and D. A. Bader, “Behavioral clusters in dynamic graphs,” Parallel Computing, vol. 47, pp. 38–50, 2015.
  • Qian et al. [2018] Y. Qian, C. Tan, N. Mamoulis, and D. W. Cheung, “DSANLS: accelerating distributed nonnegative matrix factorization via sketching,” in WSDM, 2018, pp. 450–458.
  • Boyd et al. [2003] S. Boyd, L. Xiao, and A. Mutapcic, “Subgradient methods,” Stanford University, 2003. [Online]. Available: https://web.stanford.edu/class/ee392o/subgrad_method.pdf
  • Kim et al. [2014] J. Kim, Y. He, and H. Park, “Algorithms for nonnegative matrix and tensor factorizations: a unified view based on block coordinate descent framework,” J. Global Optimization, vol. 58, no. 2, pp. 285–319, 2014.
  • Neveu [1975] J. Neveu, Discrete-parameter martingales.   Elsevier, 1975, vol. 10.
  • Mairal [2013] J. Mairal, “Stochastic majorization-minimization algorithms for large-scale optimization,” in NIPS, 2013, pp. 2283–2291.
[Uncaptioned image] Yuqiu Qian is currently an applied researcher in Tencent. Her research interests include data engineering and machine learning with applications in recommender systems. She received her B.Eng. degree in Computer Science from University of Science and Technology of China (2015), and her PhD degree in Computer Science from University of Hong Kong (2019).
[Uncaptioned image] Conghui Tan is currently a researcher in WeBank. His research interests include machine learning and optimization algorithms. He received his B.Eng. degree in Computer Science from University of Science and Technology of China (2015), and his PhD degree in System Engineering from Chinese University of Hong Kong (2019).
[Uncaptioned image] Danhao Ding is currently a PhD candidate in Department of Computer Science, University of Hong Kong. His research interest include high performance computing and machine learning. He received his B.Eng. degree in Computing and Data Analytics from University of Hong Kong (2016).
[Uncaptioned image] Hui Li is currently an assistant professor in the School of Informatics, Xiamen University. His research interests include data mining and data management with applications in recommender systems and knowledge graph. He received his B.Eng. degree in Software Engineering from Central South University (2012), and his MPhil and PhD degrees in Computer Science from University of Hong Kong (2015, 2018).
[Uncaptioned image] Nikos Mamoulis received his diploma in computer engineering and informatics in 1995 from the University of Patras, Greece, and his PhD in computer science in 2000 from HKUST. He is currently a faculty member at the University of Ioannina. Before, he was professor at the Department of Computer Science, University of Hong Kong. His research focuses on the management and mining of complex data types.