Learning nonnegative matrix factorizations from compressed data
Abstract.
We propose a flexible and theoretically supported framework for scalable nonnegative matrix factorization. The goal is to find nonnegative low-rank components directly from compressed measurements, accessing the original data only once or twice. We consider compression through randomized sketching methods that can be adapted to the data, or can be oblivious. We formulate optimization problems that only depend on the compressed data, but which can recover a nonnegative factorization which closely approximates the original matrix. The defined problems can be approached with a variety of algorithms, and in particular, we discuss variations of the popular multiplicative updates method for these compressed problems. We demonstrate the success of our approaches empirically and validate their performance in real-world applications.
1. Introduction
Low-rank approximations are arguably the main tool for simplifying and interpreting large, complex datasets. Methods based on singular value decomposition of the data matrix deliver deterministic, useful results via polynomial-time algorithms. However, for nonnegative data, spatial localization and interpretability of the features can be boosted by additionally making the factors element-wise nonnegative [23]. In the standard form, given a nonnegative matrix , and a target rank , the Nonnegative Matrix Factoriration (NMF) problem is the task of finding matrices and solving the problem
(1.1) |
From the component matrices and , one can obtain soft clustering of the data with additional interpretability of the components, compared to the SVD results [23]. NMF became a standard analysis tool across many application areas, such as topic modeling and text mining [2, 40], image processing [23], hyperspectral unmixing [18, 9], genomics [34], and others. It is amenable to numerous extensions, such as, incorporating semi-supervision [24], tree-like hierarchy of the clusters [13, 21], or additional information about the domain [40, 12].
The problem (1.1) is in general NP-hard [38] to solve and it never possesses a unique solution. Despite these challenges, several iterative algorithms have been developed to solve the NMF problem, including multiplicative updates (MU), alternating least squares (ALS), alternating nonnegative least squares (ANLS), and hierarchical alternating least squares (HALS). See [11, Section 3] for a discussion of these methods.
When the size of the data matrix is large, it is challenging or impossible to store it, and even harder to use it within the optimization algorithms searching for the NMF decomposition. Note that the resulting factors and collectively contain only entries which is much less than the total of entries in if . Thus, it can be possible to store and process the resulting factors, even when processing the whole data is challenging.
This motivates us to apply a sketch-and-solve approach to the NMF problem. This means that we will first compute a linear function of our input matrix , known as a sketch, and then second, find a good factorization based only on the sketch and the linear function , known as the sketching map. If the size of the sketch is much smaller than , the second task may be significantly more practical when there are memory limitations. The practice of using sketches is also applicable to matrix factorization problems in other settings such as when different entries of are revealed over time. In certain streaming applications, it has been shown that there is little benefit to considering nonlinear sketching functions as opposed to linear sketches [25].
A standard and well-developed application of the linear sketch-and-solve approach is for the simpler problem of linear regression [33, 5, 29]. Wide classes of random matrices, defined obliviously to the underlying data, can be used as linear sketching operators, and deterministic conditions related to the geometry preservation properties of the sketching operators were formulated [26, 43]. Another prominent example of the sketch-and-solve approach is randomized SVD algorithms. To find low-rank factorization of a given matrix from its sketched measurements, the sketch should retain spectral properties of the matrix rather than being data oblivious. In [14], a simple procedure of forming refined data-adapted sketches via just one pass over the data โ a randomized rangefinder algorithm โ was proposed.
In this work, we develop the theory showing why and how these data-adapted sketches are useful in finding nonnegative low-rank components; and we also consider the cases when random, independent from the data, sketches can be used.
One way to sketch a structured object โ in our case, a matrix โ is to vectorize it and use random linear map on the resulting vector. This includes standard compressive sensing approaches for low-rank matrix recovery such as in [3, 7]. Another way (which is the focus of this work) is to consider sketches that take of the form of left and right matrix products with , e.g., or for appropriately sized matrices and . Column and row sketching have been used successfully for matrix factorization and approximation problems [37, 8, 44], and its higher order analogue modewise sketching was used to speed up tensor low-rank fitting [19, 15]. An advantage of this approach is in compact sketching matrices and that contain elements together, compared to the entries in a linear sketch that is applied to a vectorization of the matrix in . Another advantage is in preserving the matrix structure of the data throughout the sketching, which is crucial in this work for integrating the compressed matrices within learning algorithms, such as the multiplicative updates algorithm.
1.1. Contributions and outline
The idea to make NMF problem scalable through randomized sketching was considered earlier. In Section 2, we review relevant related work. What was missing in the prior works is the crucial link between the algorithmic outcomes of compressed problems and their fitness for the original problem. Establishing such connection is challenging partially due to the limited theoretical guarantees for the convergence of NMF algorithms (which is essentially inevitable dues to the NP-hardness of the problem). We approach this challenge in the following way: (1) we define the compressed problems such that we can provably compare their optimal solutions with the optimal solutions to the uncompressed problem, and (2) we propose efficient algorithms for these compressed problems. Due to (1), this also means getting good solutions for the original problem.
In Section 3, we formulate optimization problems which depend only on sketches, but whose optimal solutions approximately solve the original NMF problem. In addition, these problems are formulated to be amenable for efficient solvers. We propose three types of such problems: (I) for two-sided row and column sketching, (II) for one-sided sketching with orthogonal data-adapted measurement matrices and (III) with approximately orthogonal (e.g., random) data-oblivious measurement matrices.
The proposed compressed problem with row and column sketching is as follows
Theorem 3.1 guarantees that if has an exact nonnegative factorization of rank , then the solution to the problem above is also exact as long as the sketching dimension is at least . Crucially, the matrices and can be generic. We explain how to solve this problem in the sketched space despite the inclusion of the regularizer terms involving orthogonal projections. Empirically, as shown in Section 5, this problem can be employed in a simplified form with , and it is suitable for the data with approximately low nonnegative rank: if the sketching matrices are generic (in particular, not designed to approximate the range of the data), the two-sided method should be employed for tight recovery.
The one-sided sketching can be more convenient for some types of the data and also is more compact. Indeed, iteratively solving the two-sided problem requires storing and using both and , whereas using sketches on one side takes twice less space for the same amount of compression. The proposed one-sided compressed problems formulations are
So, what is required from the sketching matrices to work successfully within one-sided compression, (II) or (III)? Theorem 3.4 is stated for the sketching matrices with orthonormal rows: in this case, the regularizer in the form can be conveniently incorporated in the efficient solvers. Otherwise we can use the simplified regularizer without projection operator, Theorem 3.10, where the resulting loss depends on being approximately orthogonal (to the extent easily satisfied by generic random matrices, as described in Remark 3.11). We note that in the one-sided cases, nonzero regularization is crucial both theoretically an empirically (Figure 5).
Informally, both Theorems 3.4 and 3.10 state that when we find and solving the compressed problems stated above, the error depends on (a) how well an existent (unknown) solution of rank fits the uncompressed problem , (b) how well the sketching matrix approximates the range of the data , and (c) how close to orthonormal are the rows of . In particular, this explains and quantfies how orthogonal data-adapted measurements (e.g., constructed via the randomized rangefinder algorithm [14]) are useful in finding nonnegative low-rank components. By Corollary 3.6, in this case, the solution of the compressed problem above is exact for the original that admits exact rank decomposition with the sketch size slightly oversamples . Compared to that, data-oblivious one-sided measurements incur additional loss, both theoretically and empirically, however they can be are advantageous for other reasons. For example, they can be re-generated when needed without any access to data and they do not require an additional pass over the data to form them.
In Section 4, we move from comparing the optimal values to solving the three compressed problems above. We formulate new variants of the multiplicative updates (MU) algorithm for each of them and show that the losses are nonincreasing during their iterations in Corollaries 4.3, 4.4, and 4.5 respectively. These corollaries follow from a more general result Theorem 4.1 formulated for a generic loss function with sketching. We also briefly discuss using projected gradient descent method on our compressed objectives.
One special technical challenge for using MU on the compressed objectives is that random sketching violates nonnegativity of the compressed components, which is the property that ensures the validity of the step sizes used in the MU algorithm. To address this challenge, we further generalize the defined compressed objective functions to include small shifts of the form , where is a vector of all ones in . This results in corrections of the form that restore required nonnegativity inside the recovery algorithms (relevant theory is developed in Subsection 3.4).
2. Related work on scalable NMF
A number of randomized methods were proposed to improve the scalability of NMF, most of them in the form of heuristics.
First, several works propose iterative random sketching approach, which requires sampling new random sketching matrices as the algorithm progresses. Such works include the methods involving random projection streams [45, 46] that allow for Johnson-Lindenstrauss-type sketching matrices (random and oblivious to the data) but require multiple sketches and passes over the initial data. Similarly, in [28, 27] compressed and distributed versions of different descent methods for NMF use different compression matrices in iterations as opposed to a singular compression. Our focus is on the setting that requires one or two passes over the original data in the preprocessing stage while the algorithm that searches for nonnegative factors works solely on the compressed data.
In [35], the factor matrices are additionally assumed to be sparse and they were recovered with compressed sensing techniques. We do not make additional assumptions on the structure of the factor matrices.
Data-adapted sketching with randomized SVD techniques was used in [6] in the context of the hierarchical alternating least squares algorithm, although no theoretical justification of the proposed methods was given. Recently, these ideas were extended to a symmetric variant of the NMF problem in [16]. Additionally, in [49], a randomized SVD approximation is integrated into alternating multiplicative updates in a way that saves space, also without theoretical guarantees.
The two works most closely related to ours are [42] and [36]. Both of these papers derive compressed objective functions and seek to apply semi-NMF methods to iteratively improve a nonnegative factorization. A semi-NMF is a factorization in which one factor is entrywise nonnegative and the other factor is not constrained. Both papers apply the semi-NMF multiplicative updates from [4] and the latter also considers other update methods including updates based on the alternating direction method of multipliers (ADMM). Although the updates of [4] do possess guarantees to not increase their corresponding semi-NMF objectives, neither [42] nor [36] show whether these guarantees can be extended to their NMF objectives. So, the validity of the derived objective functions or the convergence of proposed iterative methods on the original NMF problem was not theoretically justified. A crucial motivation of this work is to create a connection between the algorithms working on the compressed problems and their performance with respect to the solution to the original problem. We achieve this with new variants of the standard NMF algorithms (different from those in [36, 42]) for the newly formulated compressed objective functions. We also provide some numerical comparison between the methods in Section 5.
2.1. Notations
Throughout the text, matrices and vectors are denoted with bold letters. We denote Frobenius matrix norm as and the spectral (operator) norm as . The matrix means it is element-wise nonnegative, the same is denoted by when the size of the matrix is clear from the context. Element-wise positive and negative parts of vectors and matrices are denoted as and respectively. Further, denotes element-wise multiplication and denotes element-wise division. is the linear projection operator onto the column space of a tall matrix , projection to the orthogonal complement is .
3. Compressed problems with reliable solutions
We formulate optimization problems analogous to (1.1), which do not require storing the entire data matrix and instead use sketched measurements. This is achieved by the use of carefully chosen regularization terms. In this section, we prove that the formulated problems are guaranteed to approximately solve the original NMF problem. In the next section, we show that the standard NMF solvers are easily extendable to these new regularized objective functions.
3.1. Two-sided compression
First, we show that if a matrix has an exact low-rank nonnegative matrix factorization, then one can recover an exact factorization using linear measurements on both sides.
Theorem 3.1.
Suppose has an exact nonnegative factorization , where , and they are both full-rank, . Let and be matrices of sizes and respectively such that is invertible111Note that this condition holds generically, i.e. for all but a (Lebesgue) measure-zero set of matrices.. For any , let
(3.1) | ||||
where means . Then .
Remark 3.2.
We can similarly take the sketching matrices and of any sizes are and respectively with .
Remark 3.3 (Implementation considerations).
This and further objective functions are formulated to allow for memory-efficient computations. For example, in the above objective function (3.1), one need not store and can store and instead. Likewise, one does not need to store or compute which is an matrix, since one can instead compute
where is an matrix with columns that form the orthonormal basis of the columns of , and so . One can do a similar trick to compute in terms of an analogous matrix 222The efficient ways to find orthogonal bases of the column/row span are well-known, e.g., see the discussion in [14] Section 4.1.. Thus, one can compute the objective function of (3.1) with total storage cost , by storing the matrices . This and similar considerations are crucial in Section 4, when we study computationally efficient iterative algorithms that solve the optimization problems on compressed data defined here.
Proof.
First, we show that the matrices such that are not only feasible for the problem (3.1) but also give zero objective value. Indeed, since is invertible, if follows that the square matrices and are also invertible. So, from we can then compute and similarly . Then we have
This implies
and similarly and matrices give zero objective value.
3.2. One-sided compression: orthogonal sketching matrices
Note that the described method requires measurements on both sides (and taking either or to be the identity matrix results in a necessity to work with the full matrix ). Now, we will show that it can be enough to measure the matrix on one side only.
Theorem 3.4.
(Orthogonal ) Let be any matrix and let be a matrix with orthogonal rows (i.e. ). Let , give a solution to the original NMF problem (1.1) of rank and . If solve a compressed NMF problem with the same rank , that is,
(3.5) |
where , , and means . Then satisfies
(3.6) |
where
We note that this and further results do not require the data matrix to have an exact nonnegative low rank as in Theorem 3.1 (although if it is, the first term in the bound for the loss vanishes). Before we start the proof, let us recall a simple corollary of the Pythagorean theorem and the triangle inequality to be used several times below: for any matrices and a projection operator ,
(3.7) |
Indeed, this follows from
Proof of Theorem 3.4.
First, note that
(3.8) |
since minimize the objective of (3.5) over all nonnegative matrices of the appropriate sizes. Now, since , observe that for any matrix
So, using identity (3.7) for the matrices and , we can estimate
for Using identity (3.7) for the matrices and , we can estimate the term in parentheses as
for Combining the estimates and regrouping, we get
With , this concludes the proof of Theorem 3.4. โ
Theorem 3.4 shows that the solution to the compressed NMF problem (1.1) will work for the original uncompressed problem (3.5) as long as the terms and are small. Luckily, with just one more pass over the original data one can get such sketching matrices using the standard approaches of randomized linear algebra, such as those in [14].
Theorem 3.5 (โRandomized rangefinder algorithm lossโ, [14]).
Let be integers such that and . Let be an matrix and be a standard Gaussian matrix. Then
where is the -th largest singular value of .
Corollary 3.6 (Data-adapted one-sided sketches).
Suppose the matrix has an approximate nonnegative factorization, that is, there exist , so that satisfies .
Proof.
Remark 3.7.
A high probability deviation bound for the loss is also known [14, Theorem 10.7]. It implies a high probability estimate for (3.9) in a straightforward way. Instead of Gaussian initial sketching, one can employ subsampled random Fourier transform [14] or other cost-efficient matrices [31, 10].
It is easy to see that the oversampling condition can be relaxed to any by suitably increasing the constant factor . Notwithstanding this factor, we see that if has an exact NMF decomposition of rank and then the error of the optimal solution to the compressed problem must be also zero, comparable with the result of Theorem 3.1.
3.3. One-sided compression: nonorthogonal sketching matrices
The orthogonality assumption on can be relaxed to having approximately orthogonal rows, such as those of appropriately normalized random matrices. This case is more than a straightforward extension of Theorem 3.4 because of the following computational challenge: if the sketching matrix does not have orthogonal rows, the orthogonal projection operator does not have a nicely decomposable form . Theorem 3.10 below shows how to having projection matrices in the regularizer term.
Definition 3.8 (Approximately orthogonal matrices).
For a positive constant , we call a matrix -approximately orthogonal if its singular values lie in the interval .
The convenience of the definition above stems from the following simple observation.
Lemma 3.9.
If the matrix is -approximately orthogonal, then for any matrix, we have
(3.10) |
Proof.
For a positive semidefinite matrix , let denote the unique positive semidefinite matrix such that . Then, if is a compact SVD decomposition of , and . This implies and
โ
The next theorem justifies solving a compressed NMF problem with a simplified regularization term:
Theorem 3.10.
(Approximately orthogonal ) Let be any matrix and let be -approximately orthogonal, with . Let , give a solution to the original NMF problem (1.1) of rank and . If solve the following compressed NMF problem with the same rank
(3.11) |
Then satisfies
(3.12) |
where .
Proof.
By optimality, for any matrix for some nonnegative and of the appropriate size (to be the scaled version of as specified below) we have
Approximate orthogonality in the form of (3.10) applied to the matrices and allows to orthogonalize this inequality:
where denotes the matrix with orthogonal rows such that . Indeed, this implies for any So,
(3.13) |
with .
We will further rearrange the optimality condition using the following identity based on completion of the square on both sides of (3.13): for any matrices , of appropriate size,
Using this identity for and on the left and on the right of (3.13), we obtain
Cancelling common terms, letting , we have
To estimate the loss on the uncompressed problem, we use (3.7) with the matrices , and to get
using that and . โ
Remark 3.11.
We conclude the section with the discussion of Theorem 3.10 result. We note that
-
โข
Theorem 3.10 shows it is possible to regularize the compressed NMF problem without the projection operator and to find a -rescaled factors. Note that the rescaling does not affect the learned nonnegative low-rank structure.
-
โข
The property (3.10) is significantly more relaxed than the standard geometry preservation properties of the form , such as Johnson-Lindenstrauss, oblivious subspace embedding, or restricted isometry property. The latter wonโt be satisfied for, e.g., random Gaussian matrix and arbitrary nonnegative rank matrices (as needed within Theorem 3.10), unless there is no compression and .
- โข
-
โข
While it is easy to guarantee approximate orthogonality with generic matrices (not learned from ), the term is still the part of the error bound. So, data-oblivious one-sided compression in general is not expected to result in exact recovery even if data matrix admits exact nonnegative factorization.
3.4. Nonnegativity in compression
In the next section, we discuss the approaches to solve compressed nonnegative matrix factorization problems. In particular, we consider the variations of multiplicative updates algorithm to iteratively minimize the objective functions that we have formulated in this section. The multiplicative updates algorithm is valid due to the fact that the matrices involved in the iterative process are nonnegative. This convenient property is destroyed by sketching unless we have that is an element-wise nonnegative matrix. While this is not expected to be true neither for approximately orthonormal random sketches nor for the data-adapted sketching matrices coming from randomized rangefinder algorithm, to overcome this issue, it suffices to add some extra penalty terms taking the form
(3.14) |
where is a vector of all ones in .
Corollary 3.12.
Suppose has an exact nonnegative factorization , where , and they are both full-rank, . Let and are generic random matrices of sizes and , respectively. If for some and
(3.15) | ||||
where then .
Proof.
Note that
and similarly for adding the term . Then the statement follows directly from Theorem 3.1. โ
When we do not assume that has exactly nonnegative rank , adding the regularizer of the form (3.14) is still possible under an additional condition, essentially imposing that the column-sums of and approximately match if are optimal to (1.1).
Corollary 3.13.
Let be a nonnegative matrix and is a matrix with orthogonal rows. Let , give a solution to the original NMF problem (1.1) of rank and . Additionally assume that , where denotes element-wise division and is the vector of all ones. If
(3.16) |
then satisfies
For example, if we have that one can bound
Proof.
Let be the diagonal matrix with nonzero entries given by , and so . Note that is some (not necessarily optimal) solution to (3.16) and so we have
(3.17) |
Then, for
โ
4. Methods that solve compressed problems
In this section we define iterative methods that solve the formulated optimization problems directly, without referring to the original data or any matrices of the large uncompressed size.
4.1. General convergence for sketched multiplicative updates
Multiplicative updates has been one of the most popular algorithms for NMF since the introduction in [22]. In this section we show how to modify the classical multiplicative updates algorithm for the various objectives we have derived in earlier sections.
To this end, we prove a general theorem for multiplicative updates for multiple minimization terms. We will see in the next Section 4.2 that giving nonnegativity conditions on the sums (4.2) (rather than the stronger conditions on the individual terms) matters so we can put realistic assumption on our regularization terms that include orthogonal projection operators.
Theorem 4.1.
Consider an objective function in the generic form
(4.1) |
where all , , and means Let and be and matrices respectively. If all six matrices
(4.2) |
are entry-wise nonnegative, then the objective (4.1) is nonincreasing under the updates
Remark 4.2.
(Implementation considerations) Note that we can compute the matrix by the multiplication order given by . This order never requires us to store a matrix of size larger than . Similar procedures can be used for other terms appearing in the updates of Theorem 4.1.
The proof of the theorem follows the standard approach that justifies the validity of multiplicative updates algorithm for NMF problem (e.g., [22]), where nonnegativity assumption ensures its validity on the sketched problem.
Proof of Theorem 4.1.
Let us consider the step updating the matrix . Let be the vectorization of denoted as . Define
where denotes matrix Kronecker product. Define to be all of the vectors and stacked together vertically. Similarly, is a stack of all of the matrices and .
Using the mixed Kronecker matrix-vector product property that holds for any appropriately sized matrices , we can rewrite the objective function (4.1) as
Step 1: Define quadratic majorizing function. Consider the function
(4.3) |
where the matrix is a diagonal matrix with the diagonal , recall that represents elementwise division. We claim that majorizes , i.e. and . It is clear that . We can write
from the comparison of (4.3) with the Taylor decomposition for the quadratic function at . So, to check that , it is sufficient to show that the matrix is positive semidefinite. Equivalently, it is sufficient to show that
is positive semidefinite. Indeed, for any vector of the appropiate size, , recall that defines element-wise product.
Step 2: Matrix is positive semidefinite. We will check positive semidefitness of the matrix directly, Consider any . Then
Now observe
Thus, by the entry-wise nonnegativity of , , and , the matrix is also entrywise nonnegative. Since is also nonnegative,
This shows that is positive semidefinite, and therefore, majorizes .
Step 3: The updates minimize the majorizing function. To conclude, observe that
(4.4) |
In matrix form, this corresponds exactly to the update for given by
From majorization property, we have
and thus, the iterates do not increase the objective.
The updates for also do not increase the objective by a similar argument. The conditions on and ensure that the matrices and are recursively nonnegative under the updates. โ
4.2. Multiplicative updates for solving regularized compressed problems
Now, we demonstrate how the general framework (4.1) applies to the compressed problems from Section 3.
Corollary 4.3 (Two-sided updates).
Let be a nonnegative matrix and , are generic conpression matrices such that is invertible. Form and , the matrices which columns form orthonormal bases of the column space of and the row space of respectively. For any , if
where is taken entry-wise, then the objective
Proof.
Consider the setting of Theorem 4.1, with the matrices
Clearly, we have that the matrices and are nonnegative. Then to apply Theorem 4.1, we must check that , , and are entry-wise nonnegative. First, we calculate
Thus to ensure entry-wise nonnegativity of both sums, we need
Similarly, for
we need to ensure and . โ
Corollary 4.4 (One-sided updates for orthogonal ).
If and sketching matrix has orthogonal rows, and the nonnegativity correction term , then the objective
(4.5) |
with respect to the variables is nonincreasing under the updates
(4.6) | ||||
Here, . Note that Corollary 3.13 claims that the optimal solution for (4.5) results in a good solution for the uncompressed NMF problem as long as the original NMF error and term are small.
Proof.
The proof of the next corollary for not necessarily orthogonal sketching matrices is similar to the one above and is omitted for brevity.
Corollary 4.5 (One-sided updates for nonorthogonal sketching matrices).
If and is an arbitrary matrix, and the nonnegativity correction term , then the objective
with respect to the variables is nonincreasing under the updates
Here, is a vector of all ones in .
4.3. Solving compressed problems with projected gradient descent
Multiplicative updates is a popular approach to find nonnegative factorizations due to its simplicity and good convergence properties. However, other standard methods such as alternating nonnegative least squares, hierarchical least squares, projected gradient descent, among others, can be run on the compressed problems. For comparison and additional example, we will consider projected gradient descent (GD) method on compressed data.
For an arbitrary loss function , nonnegative projected GD can be defined
where is the step size. For the sake of concreteness, we will give an example of the updates for one of our formulated objective functions. The projected gradient descent updates for the objective (as in Theorem 3.4) are
(4.7) | ||||
We can similarly derive updates for our other objective functions. A disadvantage of this method is that it possesses no guarantee of convergence or even a nonincreasing property. Empirically, we see (Figure 3 below) that on some datasets projected GD shows competitive performance and that its convergence is indeed not monotonic, unlike the convergence of the regularized compressed MU algorithm.
5. Experiments
We experiment with three datasets coming from various domains. The 20 Newsgroups dataset (โ20Newsโ) [30] is a a standard dataset for text classification and topic modeling tasks. It is a collection of articles divided into a total of subtopics of the general topics of religion, sales, science, sports, tech and politics. The Olivetti faces (โFacesโ) [32] is a standard image dataset containing grayscale facial images. It is often used in the literature as an example for different factorization methods including NMF [47, 48]. Finally, we construct a synthetic dataset with regular random data and nonnegative rank of exactly . Specifically, we let and be matrices whose entries are distributed like standard lognormal random variables and define . The dimensions of the datasets are reported in Table 1.
Dataset | ||
---|---|---|
Synthetic | 1000 | 1000 |
Faces [32] | 400 | 4096 |
20News [30] | 11314 | 101322 |
All experiments were run on a cluster with 4 2.6 GHz Intel Skylake CPU cores and a NVIDIA V100 GPU. Compressed methods were implemented using the JAX library to take advantage of GPU acceleration. The uncompressed methods that we compare were not implemented to use GPU acceleration since in applications the full data matrix may be too large to store on a GPU. In our case, the 20News data matrix was too large to store as a dense matrix on our GPU.
To measure the convergence of the methods, besides interpretability of the topics or images, we use the relative error metric and the scale invariant cosine similarity metric (normalized dot product metric) between and , defined as .
5.1. Synthetic data: exact recovery from compressed measurements is achievable
Synthetic data has exactly low nonnegative rank . Yet, the theoretical results suggest the possibility of exact recovery from the compressed data in two cases: (a) if two-sided compression was employed (Theorem 3.1), or (b) if the compression matrix is such that the projection of the data onto its orthogonal complement is zero, for example, if the compression matrix is learned via the randomized rangefinder algorithm (one-sided data-adapted compression, Corollary 3.6). Two-sided compression requires twice more memory for the same sketch size but works with generic (data-oblivious) measurements that can be advantageous for various reasons as they do not require access to data or another pass over the data.

In Figure 1, we compare nonnegative factors found by the MU from full data , from the one-sided data-adapted sketches, and from the two-sided oblivious (i.i.d. Gaussian) linear measurements. We take the target rank and the sketch size (so, the sketching matrices have the shape ). In this case, the matrix has one million elements whereas the total number of elements in and combined is only forty thousand. This represents a memory ratio of for the one-sided method. For the two sided method, the total number of elements in ,,, and is eighty thousand representing a memory ratio of .
We employ (compressed) MU algorithm as in Corollary 4.4 with and Corollary 4.3 with . The parameter is chosen minimal so that we have , and similarly for the , , and in the two-sided case. We see that the algorithm achieves a near exact fit, eventually reducing relative error below which was our stopping criterion in this case. The one-sided method and the oblivious two-sided methods seem to be converging at a similar rate as the uncompressed method, albeit after a โburn-inโ phase.
5.2. Synthetic data: effect of compression size
In Figure 2, we compare the compressed multiplicative updates for different amounts of compression. The target rank and the sketch size varies. We employ the compressed MU algorithm as in Corollaries 4.3, 4.4, and 4.5. We choose the compression matrix to be (a) a random Gaussian matrix, (b) a random matrix with orthogonal rows, or (c) via the randomized SVD procedure (3.5). For the two-sided method (d), we choose both compression matrices to be random Gaussian matrices and . For the one-sided compression, we take . We report cosine similarity loss.
We show that on an โeasyโ synthetic problem oblivious one-sided measurements can be used: the compressed MU algorithm results in a good, although not perfect, recovery (Figure 2 (a,b)). The amount of limiting loss depending on the amount of compression. Then, the one-sided data-adapted compression and two-sided random compression attain exact reconstruction at different rates depending on the amount of compression (Figure 2 (c,d)). Not surprisingly, in every case, less compression leads to a faster or better convergence performance.




5.3. Real-world data: performance comparison
To showcase the proposed methods, we include recovery from two-sided and one-sided sketches. For the two-sided method we solve a simplified version of the two-sided objective (3.15) with . We take and to be random Gaussian matrices in the oblivious case or according to the randomized rangefinder procedure as in Corollary 3.6 in the adaptive case. For the data-adapted one-sided method, we take and solve the problem (4.5). Then, we include the recovery via projected gradient descent (GD), as described in Section 4.3 with a step size of .
We also compare our proposed methods with a โNMF with random projectionsโ method proposed in [42] and in [36]. These works adapted the updates of [4] to the compressed NMF setting resulting in the updates:
(5.1) |
and similar for , see equations (3.28) and (3.29) of [42]. The work of [42] propose to use the updates (5.1) where and are chosen to be Gaussian matrices. In this case we denote these updates โWLโ in the legends. The work of [36] also proposed using the randomized rangefinder procedure [14] as in our Corollary 3.6 to choose the matrices and . In this case we denote this method โTSโ in the legends. Observe that these iterations are approximately two times faster than those of MU.




In Figure 3 (a,b), we study the performance of compressed nonnegative matrix factorization for the Faces dataset with target rank and sketch size . For the memory, the data matrix contains elements whereas the total number of elements in and is , representing a memory ratio of approximately in the one-sided case and in the two-sided compression. On the Faces dataset, our methods, especially data-adapted versions, attain almost the quality of MU on the uncompressed data while using only of the memory ( in the two-sided case).
In Figure 3 (c,d), we study the performance of compressed nonnegative matrix factorization for the 20News dataset with target rank and sketch size . 20News is a โhard datasetโ for NMF โ we can see that even in a full uncompressed dataset NMF achieves only cosine similarity (however, this similarity can be enough to do a meaningful job for topic modeling applications) and our compressed MU from data-adapted measurements achieve higher than cosine similarity while using only of memory required for the uncompressed MU with the one-sided compression. Indeed, the number of elements in (including zeros) is . The total number of elements in , , , , , , , and is . On the 20News dataset, our compressed MU from data-adapted measurements attain of the similarity using only of the memory ( in the two-sided case) compared to the uncompressed NMF problem.
Since it might be infeasible to even run an uncompressed problem from a memory perspective, we do not specifically focus on time performance here. However, we note that while it typically requires less iterations for the uncompressed NMF to learn the factorization, the iterations themselves are considerably slower. In Figure 3 (c,d) it would take several hours to run the uncompressed experiment until iterations, while the other methods take at most several minutes, so we show only the first iterations for uncompressed NMF. For the Faces dataset (Figure 3 (b)), it took sec to run compressed MU and sec to run GD until iterations, and we can see that times less iterations would have been enough for approximately same quality. The uncompressed method took sec for the plotted iterations, so at least min would be required to run it for iterations.
5.4. Real-world data: interpretable low rank decompositions
An important property of nonnegative low-rank factorizations is getting interpretable components. In Figure 4, we briefly demonstrate that the components learned from the compressed data are also interpretable. That is, we show the columns of the fitted matrix reshaped back to images in the same setup as in Figure 3 (b) for the one-sided data-adapted measurements.



5.5. Real-world data: choosing regularization parameter
We have chosen the regularization parameter for the one-sided experiments above. Here, we demonstrate that it is important empirically, as well as theoretically, to add nonzero regularization term in the one-sided compression case. In Figure 5, we consider the compressed MU algorithm from one-sided data-adapted measurements for the 20News data with and . We can see that regularization can have a beneficial effect on performance and compromises the convergence. At the same time, too large could slow down the convergence or result in slightly worse limiting loss.

6. Conclusion and future directions
In this paper, we propose several formulations of the NMF problem that (a) work using only compressed initial data, where the compression is done with linear maps (sketches) that access initial data only once or twice; (b) have optimal solutions that are provably compatible with the optimal solutions of the uncompressed NMF problem; (c) are supplemented with memory-efficient algorithms that solve the compressed problems without returning to the initial large data or forming any matrices of the original size. The convergence of these algorithms is proved in a standard for the NMF-related algorithms form, that is, showing that the objective function (that we provably connect to the original uncompressed objective) does not increase under the updates. We supplement the theoretical results with the experiments showing comparable nonnegative factorization performance using only of initial data, on artificial, text and image datasets.
There are multiple venues of future work stemming from our approach. For the two-sided measurements, we currently do not have a theorem that holds for the data matrices with approximately low nonnegative rank, like in the one-sided case. The experimental evidence clearly suggests similar more general results should hold in the two-sided case. Also, it would be interesting to explain theoretically why the two-sided compression is less sensitive to the regularization (in practice, we take in the two-sided experiments, which significantly simplifies the updates from Corollary 4.3).
Then, it is important to study the scalable versions of other nonnegative factorization algorithms besides multiplicative updates and projected gradient descent. We focused on multiplicative updates because of their relative popularity and simplicity, but perhaps other methods may be better adapted to sketched problems. A related question is to get theoretical guarantees for the methods proposed in [36] that empirically show comparable performance and typically faster convergence than our proposed algorithms.
Further, it is natural and meaningful to extend the framework to the compressed versions of high-order (tensor) problems. It has been recently shown [20, 39] that nonnegative tensor factorization result in more interpretable decomposition for naturally high-order data, such as temporal or multi-agent, than their matrix NMF counterparts. At the same time, the tensor methods are even more computationally demanding and would benefit from scalable approximations. Scalable versions of other NMF-based algorithms such as semi-supervised versions [1] or sparse NMF [20, 17] are also of interest.
References
- [1] M. Ahn, R. Grotheer, J. Haddock, L. Kassab, A. Kryshchenko, K. Leonard, S. Li, A. Madushani, T. Merkh, D. Needell, E. Sizikova, and C. Wang. Semi-supervised nonnegative matrix factorization models for topic modeling in learning tasks. In Proc. 53rd Asilomar Conference on Signals, Systems and Computers, 2020.
- [2] Y. Chen, H. Zhang, R. Liu, Z. Ye, and J. Lin. Experimental explorations on short text topic mining between LDA and NMF based schemes. Knowledge-Based Systems, 163:1โ13, 2019.
- [3] M. A. Davenport and J. Romberg. An overview of low-rank matrix recovery from incomplete observations. IEEE Journal of Selected Topics in Signal Processing, 10(4):608โ622, 2016.
- [4] C. H. Ding, T. Li, and M. I. Jordan. Convex and semi-nonnegative matrix factorizations. IEEE transactions on pattern analysis and machine intelligence, 32(1):45โ55, 2008.
- [5] P. Drineas, M. W. Mahoney, S. Muthukrishnan, and T. Sarlรณs. Faster least squares approximation. Numerische mathematik, 117(2):219โ249, 2011.
- [6] N. B. Erichson, A. Mendible, S. Wihlborn, and J. N. Kutz. Randomized nonnegative matrix factorization. Pattern Recognition Letters, 104:1โ7, 2018.
- [7] H. Fawzi and P. A. Parrilo. Lower bounds on nonnegative rank via nonnegative nuclear norms. Mathematical Programming, 153:41โ66, 2015.
- [8] M. Fazel, E. Candes, B. Recht, and P. Parrilo. Compressed sensing and robust recovery of low rank matrices. In 2008 42nd Asilomar Conference on Signals, Systems and Computers, pages 1043โ1047. IEEE, 2008.
- [9] X.-R. Feng, H.-C. Li, R. Wang, Q. Du, X. Jia, and A. Plaza. Hyperspectral unmixing based on nonnegative matrix factorization: A comprehensive review. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 15:4414โ4436, 2022.
- [10] M. Ghashami, E. Liberty, J. M. Phillips, and D. P. Woodruff. Frequent directions: Simple and deterministic matrix sketching. SIAM Journal on Computing, 45(5):1762โ1792, 2016.
- [11] N. Gillis. The why and how of nonnegative matrix factorization. Regularization, Optimization, Kernels, and Support Vector Machines, 12, 2014.
- [12] N. Gillis and F. Glineur. Nonnegative factorization and the maximum edge biclique problem. arXiv preprint arXiv:0810.4225, 2008.
- [13] R. Grotheer, L. Huang, Y. Huang, A. Kryshchenko, O. Kryshchenko, P. Li, X. Li, E. Rebrova, K. Ha, and D. Needell. COVID-19 literature topic-based search via hierarchical NMF. In Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020. Association for Computational Linguistics, 2020.
- [14] N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217โ288, 2011.
- [15] C. A. Haselby, M. A. Iwen, D. Needell, M. Perlmutter, and E. Rebrova. Modewise operators, the tensor restricted isometry property, and low-rank tensor recovery. Applied and Computational Harmonic Analysis, 66:161โ192, 2023.
- [16] K. Hayashi, S. G. Aksoy, G. Ballard, and H. Park. Randomized algorithms for symmetric nonnegative matrix factorization. arXiv preprint arXiv:2402.08134, 2024.
- [17] P. O. Hoyer. Non-negative matrix factorization with sparseness constraints. Journal of machine learning research, 5(9), 2004.
- [18] R. Huang, X. Li, and L. Zhao. Spectralโspatial robust nonnegative matrix factorization for hyperspectral unmixing. IEEE Transactions on Geoscience and Remote Sensing, 57(10):8235โ8254, 2019.
- [19] M. A. Iwen, D. Needell, E. Rebrova, and A. Zare. Lower memory oblivious (tensor) subspace embeddings with fewer random bits: modewise methods for least squares. SIAM Journal on Matrix Analysis and Applications, 42(1):376โ416, 2021.
- [20] L. Kassab, A. Kryshchenko, H. Lyu, D. Molitor, D. Needell, E. Rebrova, and J. Yuan. Sparseness-constrained nonnegative tensor factorization for detecting topics at different time scales. Frontiers Applied Mathematics and Statistics, accepted.
- [21] D. Kuang and H. Park. Fast rank- nonnegative matrix factorization for hierarchical document clustering. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 739โ747, 2013.
- [22] D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. Advances in neural information processing systems, 13, 2000.
- [23] D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. nature, 401(6755):788โ791, 1999.
- [24] H. Lee, J. Yoo, and S. Choi. Semi-supervised nonnegative matrix factorization. IEEE Signal Processing Letters, 17(1):4โ7, 2009.
- [25] Y. Li, H. L. Nguyen, and D. P. Woodruff. Turnstile streaming algorithms might as well be linear sketches. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 174โ183, 2014.
- [26] M. W. Mahoney et al. Randomized algorithms for matrices and data. Foundations and Trendsยฎ in Machine Learning, 3(2):123โ224, 2011.
- [27] Y. Qian, C. Tan, D. Ding, H. Li, and N. Mamoulis. Fast and secure distributed nonnegative matrix factorization. IEEE Transactions on Knowledge and Data Engineering, 2020.
- [28] Y. Qian, C. Tan, N. Mamoulis, and D. W. Cheung. DSANLS: Accelerating distributed nonnegative matrix factorization via sketching. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pages 450โ458, 2018.
- [29] G. Raskutti and M. W. Mahoney. A statistical perspective on randomized sketching for ordinary least-squares. Journal of Machine Learning Research, 17(213):1โ31, 2016.
- [30] J. Rennie. 20 Newsgroups, 2008.
- [31] A. K. Saibaba and A. Miedlar. Randomized low-rank approximations beyond Gaussian random matrices. arXiv preprint arXiv:2308.05814, 2023.
- [32] F. Samaria and A. Harter. Parameterisation of a stochastic model for human face identification. In Proceedings of 1994 IEEE workshop on applications of computer vision, pages 138โ142. IEEE, 1994.
- [33] T. Sarlos. Improved approximation algorithms for large matrices via random projections. In 2006 47th annual IEEE symposium on foundations of computer science (FOCSโ06), pages 143โ152. IEEE, 2006.
- [34] C. Shao and T. Hรถfer. Robust classification of single-cell transcriptome data by nonnegative matrix factorization. Bioinformatics, 33(2):235โ242, 2017.
- [35] V. Sharan, K. S. Tai, P. Bailis, and G. Valiant. Compressed factorization: Fast and accurate low-rank factorization of compressively-sensed data. In International Conference on Machine Learning, pages 5690โ5700. PMLR, 2019.
- [36] M. Tepper and G. Sapiro. Compressed nonnegative matrix factorization is fast and accurate. IEEE Transactions on Signal Processing, 64(9):2269โ2283, 2016.
- [37] J. A. Tropp, A. Yurtsever, M. Udell, and V. Cevher. Practical sketching algorithms for low-rank matrix approximation. SIAM Journal on Matrix Analysis and Applications, 38(4):1454โ1485, 2017.
- [38] S. A. Vavasis. On the complexity of nonnegative matrix factorization. SIAM Journal on Optimization, 20(3):1364โ1377, 2010.
- [39] J. Vendrow, J. Haddock, and D. Needell. Neural nonnegative CP decomposition for hierarchical tensor analysis. In 2021 55th Asilomar Conference on Signals, Systems, and Computers, pages 1340โ1347. IEEE, 2021.
- [40] J. Vendrow, J. Haddock, E. Rebrova, and D. Needell. On a guided nonnegative matrix factorization. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3265โ32369. IEEE, 2021.
- [41] R. Vershynin. Compressed Sensing: Theory and Applications, chapter Introduction to the non-asymptotic analysis of random matrices. Cambridge University Press, 2011.
- [42] F. Wang and P. Li. Efficient nonnegative matrix factorization with random projections. In Proceedings of the 2010 SIAM International Conference on Data Mining, pages 281โ292. SIAM, 2010.
- [43] D. P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trendsยฎ in Theoretical Computer Science, 10(1โ2):1โ157, 2014.
- [44] F. Woolfe, E. Liberty, V. Rokhlin, and M. Tygert. A fast randomized algorithm for the approximation of matrices. Applied and Computational Harmonic Analysis, 25(3):335โ366, 2008.
- [45] F. Yahaya, M. Puigt, G. Delmaire, and G. Roussel. Gaussian compression stream: Principle and preliminary results, 2020.
- [46] F. Yahaya, M. Puigt, G. Delmaire, and G. Roussel. Random projection streams for (weighted) nonnegative matrix factorization. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3280โ3284. IEEE, 2021.
- [47] T. Zhang, B. Fang, Y. Y. Tang, G. He, and J. Wen. Topology preserving non-negative matrix factorization for face recognition. IEEE Transactions on Image Processing, 17(4):574โ584, 2008.
- [48] Y. Zhao, H. Wang, and J. Pei. Deep non-negative matrix factorization architecture based on underlying basis images learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(6):1897โ1913, 2019.
- [49] G. Zhou, A. Cichocki, and S. Xie. Fast nonnegative matrix/tensor factorization based on low-rank approximation. IEEE Transactions on Signal Processing, 60(6):2928โ2940, 2012.