This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A fast two-stage algorithm for non-negative matrix factorization in streaming data

Ran Gu,  Qiang Du,  and Simon J. L. Billinge Ran Gu is with the School of Statistics and Data Science, Nankai University, Tianjin 300071, China.Qiang Du, and Simon J. L. Billinge are with Department of Applied Physics and Applied Mathematics, Fu Foundation School of Engineering and Applied Sciences, Columbia University, New York, NY 10027, USA. Qiang Du is also with the Data Science Institute, Columbia University, New York, NY 10027, USA. Simon J. L. Billinge is also with the Condensed Matter Physics and Materials Science Department, Brookhaven National Laboratory, Upton, NY 11973, USA.This research was supported by NSF-DMR 1922234 (for R.G., Q.D. and S.B.) and NSF CCF 1740833 (for Q.D.).Manuscript received January 2th, 2021; revised
Abstract

In this article, we study algorithms for nonnegative matrix factorization (NMF) in various applications involving streaming data. Utilizing the continual nature of the data, we develop a fast two-stage algorithm for highly efficient and accurate NMF. In the first stage, an alternating non-negative least squares (ANLS) framework is used, in combination with active set method with warm-start strategy for the solution of subproblems. In the second stage, an interior point method is adopted to accelerate the local convergence. The convergence of the proposed algorithm is proved. The new algorithm is compared with some existing algorithms in benchmark tests using both real-world data and synthetic data. The results demonstrate the advantage of our algorithm in finding high-precision solutions.

Index Terms:
Interior point method, active set method, nonnegative matrix factorization, streaming data.

I Introduction

Nonnegative matrix factorization (NMF) [26, 19] refers to the factorization of a matrix approximately into the product of two nonnegative matrices with low rank, MXYM\approx XY. It has become one of the most popular multi-dimensional data processing tools in various applications such as signal processing[2], biomedical engineering[27], pattern recognition[6], computer vision and image engineering[2].

Lee and Sewing initiated the study of NMF and presented a method in 1999 [19]. Their method makes all decomposed components non-negative and achieves non-linear dimension reduction at the same time. Developed in [20] for NMF, Lee and Seung’s multiplicative update rule has been a popular method due to the simplicity in implementation.

A commonly used optimization formulation of MXYM\approx XY is to use the Square of Euclidian Distance (SED) as the objective function, that is,

minXn×k,Yk×m12XYMF2s.t.X0Y0\begin{array}[]{cl}\min\limits_{X\in\mathbb{R}^{n\times k},Y\in\mathbb{R}^{k\times m}}&\frac{1}{2}\|XY-M\|_{F}^{2}\\ s.t.&X\geq 0\\ &Y\geq 0\\ \end{array} (1)

Many studies of NMF based on the above formulation have focused on the use of different optimization approaches like the alternating non-negative least squares (ANLS) [22, 15, 11, 13], coordinate descent methods [5, 21], and alternating direction method of multipliers (ADMM) [12]. A comprehensive survey on various NMF models and many existing NMF algorithms can be found in [30].

The NMF problem has been shown to be non-convex and NP-hard [29]. The algorithms studied in the literature can only guarantee finding a local minimum in general, rather than a global minimum of the cost function. Although Arora et al. [1] presented a polynomial-time algorithm for constant kk, the order of the complexity of the algorithm is too high to be applied in practice. Nevertheless, in many data mining applications, high quality local minima are desired in short time [30].

In this study, we are motivated by the application of NMF to problems involving streaming data. Here, streaming data is continually generated data with continuous distributions, such as those from the real-time monitoring of reaction product in chemistry experiments and materials synthesis [23, 28]. For example, Zhao et al. [32] measured the X-ray diffraction data during the nucleation and growth of zeolite-supported Ag nanoparticles through reduction of Ag-exchanged mordenite (MOR), and the processed the data with pair distribution function (PDF) measurement [7]. In the data, each PDF is approximately represented by a vector representation of nn dimensions, which is recorded at mm time instances in total. We should note the key features of continual nature of these types of continuously distributed data: at any fixed time, the PDF is continuous in the distance variable; meanwhile, at any fixed distance in the PDF measurement, its value also has continuity in time; moreover, the spatially distributed data at later times are generated progressively following the data earlier in time. In the AgMOR data used in [32], the dimensions were respectively n=3000n=3000 and m=36m=36, where nn is the length of each data, and mm is the number of measurements. It was anticipated that there are three materials present in the reaction, which means that k=3k=3. The focus of our study here is on streaming data in the particular regime of nmkn\gg m\gg k, with kk being very small. This reflects the high dimensionality of data in an individual measurement (very large nn) for systems with a relatively small number of components kk.

Our goal is to obtain high-precision local solutions for streaming data with a relatively small scale, that n5000n\leq 5000, m200m\leq 200, and k10k\leq 10. Based on our observation, ANLS has a fast descent rate on objective function at the first few iterations, but a low convergent rate near a local minimizer. Interior point method has a better local convergent rate, but require much more computation. Thus, we propose to combine ANLS with interior point method into a two-stage algorithm. In the first stage, an active set method is used for ANLS. Due to the continuity of streaming data, the number of active set changes is small. In the second stage, a line search interior point method is adopted to reach a fast local convergence. Based on the property of streaming data, we optimize the computation of the interior point algorithm so that the computational cost can scale like O(nm2k3)O(nm^{2}k^{3}) in each iteration. The proposed two-stage algorithm combines the advantages of ANLS and interior point method for streaming data.

The organization of this paper is as follows. Section 2 introduces the first stage of our proposed algorithm which is based on the ANLS with an active set method for streaming data. Section 3 proposes the second stage of the algorithm which is a line search interior point method. Its convergence is also discussed. Section 4 gives the whole framework of our proposed two-stage algorithm to solve NMF in streaming data. Section 5 shows the efficiency of our proposed algorithm by numerical tests. Section 6 concludes this paper with discussions on some future concerns.

II ANLS framework and active set method

We first briefly review the ANLS framework for solving (1). In ANLS, variables are simply divided into two groups, which are then updated respectively as outlined in algorithm 1.

Algorithm 1 Alternating nonnegative least squares (ANLS)

Repeat until stopping criteria are satisfied

minXn×k12XYMF2s.t.X0.\begin{array}[]{cl}\min\limits_{X\in\mathbb{R}^{n\times k}}&\frac{1}{2}\|XY-M\|_{F}^{2}\\ s.t.&X\geq 0.\end{array}
minYk×m12XYMF2s.t.Y0.\begin{array}[]{cl}\min\limits_{Y\in\mathbb{R}^{k\times m}}&\frac{1}{2}\|XY-M\|_{F}^{2}\\ s.t.&Y\geq 0.\end{array}

end

Note that each subproblem can be split into a series of nonnegative least square problems

minx12Cxd22s.t.x0\begin{array}[]{cl}\min\limits_{x}&\frac{1}{2}\|Cx-d\|_{2}^{2}\\ s.t.&x\geq 0\\ \end{array} (2)

For C=XC=X, dd corresponds to every column of MM, while for C=YTC=Y^{T}, dd takes every row of MM.

Although the original problem in (1) is nonconvex, the subproblems in algorithm 1 are convex quadratic problems whose optimal solutions can be found in polynomial time. In addition, the convergence of algorithm 1 to a stationary point of (1) has been proved [10].

On the basis of ANLS framework, many algorithms for NMF have been developed, such as active set method [15], projection gradient method [22], projection Newton method [9], projection quasi-Newton method [14, 31], Nesterov’s gradient method [11], and the method combined with Barzilai-Borwein stepsize [13].

The active set method solves the subproblem exactly. Kim and park [15] introduced an algorithm based on ANLS and active set method. The constrained least square problem in the matrix form with multiple right-hand side vectors is decomposed into several independent nonnegative least square problems with single right-hand side vector (2), which are solved by the active set method of Lawson and Hanson [18]. Later, Kim and Park [16] proposed an active-set-like algorithm based on ANLS framework. In their algorithm, the single least squares are solved by block principal pivoting method, and the columns that have a common active set are grouped together to avoid redundant computation of Cholesky factorization in solving linear equations.

By noticing the continuity in streaming data, when a single least square problem (2) is solved, the same CC and similar dd may be used in the subsequent least square, resulting in similar solutions with likely the same active set.

Therefore, we choose the active set method of Lawson and Hanson [18] to solve the least square problem (2), using the solution and its active set as the initial guess of the subsequent least square. This strategy is called warm start strategy or continuation technology, which is widely used in many algorithms and can reasonably improve the effectiveness[24, 8]. We do not need to do any re-grouping, because the right-hand side vectors have been naturally grouped due to the continuity in data, and we just need to solve them one by one sequentially. In fact, the active set of the solution in the subsequent least square usually has no or little change. When we solve the first set of equations in the new least square subproblem, the Cholesky factorization performed in the previous step can be utilized to avoid redundant computation. Numerically, we see that the active set usually changes only on a very few occasions or does not change at all for the streaming data under consideration. Consequently, the first stage of our algorithm is chosen to be a combination of ANLS framework, active set method and warm start strategy.

III A line search interior point method for NMF

Interior-point methods have proved to be equally successful for various nonlinear optimization as for linear programming. In this section, we propose a line search interior point method for solving NMF. Its global convergence and computational cost are analyzed.

III-A Algorithm

Given that the linear independence constraint qualification (LICQ) holds for NMF problem, the KKT conditions [17] for the problem can be written as

(XYM)YT=RXT(XYM)=SR,X=0S,Y=0X0,Y0,R0,S0\begin{array}[]{c}(XY-M)Y^{T}=R\\ X^{T}(XY-M)=S\\ \langle R,X\rangle=0\\ \langle S,Y\rangle=0\\ X\geq 0,Y\geq 0,R\geq 0,S\geq 0\end{array} (3)

We denote x=vec(XT)x=vec(X^{T}), y=vec(Y)y=vec(Y), r=vec(RT)r=vec(R^{T}), s=vec(S)s=vec(S), where vecvec transforms a matrix to a vector by expanding it by columns. Meanwhile, we define the inverse operations mat(x)=XTmat(x)=X^{T}, mat(y)=Ymat(y)=Y, mat(r)=RTmat(r)=R^{T}, mat(s)=Smat(s)=S.

Applying Newton’s method to the nonlinear system, in the variables xx, yy, rr, ss, we obtain

(Q1CTICQ2IDiag(r)Diag(x)Diag(s)Diag(y))(ΔxΔyΔrΔs)\displaystyle\left(\begin{array}[]{cccc}Q_{1}&C^{T}&-I&\\ C&Q_{2}&&-I\\ Diag(r)&&Diag(x)&\\ &Diag(s)&&Diag(y)\\ \end{array}\right)\left(\begin{array}[]{c}\Delta x\\ \Delta y\\ \Delta r\\ \Delta s\\ \end{array}\right) (4)
=(rgraXsgraYμeDiag(r)xμeDiag(s)y)\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}=\left(\begin{array}[]{c}r-graX\\ s-graY\\ \mu e-Diag(r)x\\ \mu e-Diag(s)y\\ \end{array}\right)

with μ=0\mu=0, where graX=vec(Y(XYM)T)graX=vec(Y(XY-M)^{T}), graY=vec(XT(XYM))graY=vec(X^{T}(XY-M)), ee is a vector of ones, Diag(x)Diag(x) constructs a diagonal matrix with the diagonal elements given by xx,

Q1=(YYTYYT)Q_{1}=\left(\begin{array}[]{ccc}YY^{T}&&\\ &\ddots&\\ &&YY^{T}\\ \end{array}\right)
Q2=(XTXXTX)Q_{2}=\left(\begin{array}[]{ccc}X^{T}X&&\\ &\ddots&\\ &&X^{T}X\\ \end{array}\right)
C=(X1TY1TXnTY1TX1TYmTXnTYmT)\displaystyle C=\left(\begin{array}[]{ccc}X_{1}^{T}Y_{1}^{T}&\cdots&X_{n}^{T}Y_{1}^{T}\\ \vdots&\ddots&\vdots\\ X_{1}^{T}Y_{m}^{T}&\cdots&X_{n}^{T}Y_{m}^{T}\\ \end{array}\right) (5)
+((X1Y1M11)I(XnY1Mn1)I(X1YmM1m)I(XnYmMnm)I)\displaystyle+\left(\begin{array}[]{ccc}(X_{1}Y_{1}-M_{11})I&\cdots&(X_{n}Y_{1}-M_{n1})I\\ \vdots&\ddots&\vdots\\ (X_{1}Y_{m}-M_{1m})I&\cdots&(X_{n}Y_{m}-M_{nm})I\\ \end{array}\right)

XiX_{i} is the iith row of XX, and YjY_{j} is the jjth column of YY.

Let μ\mu be strictly positive, then the variables xx, yy, rr and ss are forced to take positive values. The trajectory (x(μ),y(μ),r(μ),s(μ))(x(\mu),y(\mu),r(\mu),s(\mu)) is called the primal-dual central path. The variables are updated by

x=x+α1Δxy=y+α1Δyr=r+α2Δrs=s+α2Δs\begin{array}[]{c}x=x+\alpha_{1}\Delta x\\ y=y+\alpha_{1}\Delta y\\ r=r+\alpha_{2}\Delta r\\ s=s+\alpha_{2}\Delta s\end{array} (6)

where α1(0,α1max]\alpha_{1}\in(0,\alpha_{1}^{\max}], α2(0,α2max]\alpha_{2}\in(0,\alpha_{2}^{\max}], and

α1max=max{α(0,1]:x+αΔx(1τ)x,y+αΔy(1τ)y}α2max=max{α(0,1]:r+αΔr(1τ)r,s+αΔs(1τ)s}\begin{array}[]{rl}\alpha_{1}^{\max}=&\max\{\alpha\in(0,1]:x+\alpha\Delta x\geq(1-\tau)x,\\ &y+\alpha\Delta y\geq(1-\tau)y\}\\ \alpha_{2}^{\max}=&\max\{\alpha\in(0,1]:r+\alpha\Delta r\geq(1-\tau)r,\\ &s+\alpha\Delta s\geq(1-\tau)s\}\end{array} (7)

with τ(0,1)\tau\in(0,1). The condition (7) is called the fraction to the boundary rule [25], which is used to prevent the variables from approaching their lower bounds of 0 too quickly. In this work, we choose τ=0.9\tau=0.9.

A predictor or probing strategy [25] can also be used to determine the parameter μ\mu. We calculate a predictor (affine scaling) direction

(Δxaff,Δyaff,Δraff,Δsaff)(\Delta x^{aff},\Delta y^{aff},\Delta r^{aff},\Delta s^{aff})

by setting μ=0\mu=0. We probe this direction by letting (α1aff,α2aff)(\alpha_{1}^{aff},\alpha_{2}^{aff}) be the longest step lengths that can be taken along the affine scaling direction before violating the nonnegativity conditions (x,y,r,s)0(x,y,r,s)\geq 0. Explicit formulas for these step lengths are given by (7) with τ=1\tau=1. We then define μaff\mu^{aff} to be the value of complementarity along the (shortened) affine scaling step, that is,

μaff=[(x+α1affΔx)T(r+α2affΔr)+(y+α1affΔy)T(s+α2affΔs)]/(nk+mk)\begin{array}[]{rl}\mu^{aff}=&[(x+\alpha_{1}^{aff}\Delta x)^{T}(r+\alpha_{2}^{aff}\Delta r)\\ &+(y+\alpha_{1}^{aff}\Delta y)^{T}(s+\alpha_{2}^{aff}\Delta s)]/(nk+mk)\end{array} (8)

and a heuristic choice of μ\mu is defined as following: μ=σμ\mu=\sigma\mu and

σ=min{(μaff(xTr+yTs)/(nk+mk))3,0.99}.\sigma=\min\{(\frac{\mu^{aff}}{(x^{T}r+y^{T}s)/(nk+mk)})^{3},0.99\}. (9)

We propose a two-level nested loop algorithm for the interior point search. In the inner loop, the parameter μ\mu is fixed. In the outer loop, we gradually reduce μ\mu to 0. We use the following error function to break the inner loop, which is based on the perturbed KKT system:

E(x,y,r,s;μ)=max{((graXr)T,(graYs)T)T,((rTDiag(x),sTDiag(Y))μeT)T}\begin{array}[]{rl}E(x,y,r,s;\mu)=&\max\{\|((graX-r)^{T},(graY-s)^{T})^{T}\|,\\ &\|((r^{T}Diag(x),s^{T}Diag(Y))-\mu e^{T})^{T}\|\}\end{array} (10)

To guarantee the global convergence of the algorithm, we apply a line search approach. First we consider the second derivation of the interior-point method associated with the barrier problem

minX,Y12XYMF2μlog(Xik)μlog(Ykj)\min\limits_{X,Y}\frac{1}{2}\|XY-M\|_{F}^{2}-\mu\sum\log(X_{ik})-\mu\sum\log(Y_{kj})

We use the exact merit function as same as the barrier function, which can be formed by

ϕ(x,y)=12mat(x)Tmat(y)MF2μlog(xi)μlog(yj)\begin{array}[]{rl}\phi(x,y)=&\frac{1}{2}\|mat(x)^{T}mat(y)-M\|_{F}^{2}\\ &-\mu\sum\log(x_{i})-\mu\sum\log(y_{j})\end{array}

In the algorithm, we utilize Armijo line search to make the merit function sufficiently decrease.

Considering the convexity of the Hessian, we approximate

(Q1CTCQ2)\left(\begin{array}[]{cc}Q_{1}&C^{T}\\ C&Q_{2}\\ \end{array}\right)

by a positive definite matrix to guarantee the direction (Δx,Δy)(\Delta x,\Delta y) is always a descent direction of ϕ(x,y)\phi(x,y), so that such a line search can be implemented. Notice that the original NMF problem is a least square problem. Thus we can utilize the Hessian in the traditional Guass-Newton algorithm. The Hessian matrix is

(Q1C¯TC¯Q2)\left(\begin{array}[]{cc}Q_{1}&\bar{C}^{T}\\ \bar{C}&Q_{2}\\ \end{array}\right)

and it is guaranteed to be positive semi-definite. Here

C¯=(X1TY1TXnTY1TX1TYmTXnTYmT)\bar{C}=\left(\begin{array}[]{ccc}X_{1}^{T}Y_{1}^{T}&\cdots&X_{n}^{T}Y_{1}^{T}\\ \vdots&\ddots&\vdots\\ X_{1}^{T}Y_{m}^{T}&\cdots&X_{n}^{T}Y_{m}^{T}\\ \end{array}\right)

which is the first item of CC. For further safeguarding, we add a diagonal matrix ρI\rho I to this Hessian, where ρ\rho is a small positive constant. Then we obtain the primal-dual direction by solving

(Q1+ρIC¯TIC¯Q2+ρIIDiag(r)Diag(x)Diag(s)Diag(y))(ΔxΔyΔrΔs)\displaystyle\left(\begin{array}[]{cccc}Q_{1}+\rho I&\bar{C}^{T}&-I&\\ \bar{C}&Q_{2}+\rho I&&-I\\ Diag(r)&&Diag(x)&\\ &Diag(s)&&Diag(y)\\ \end{array}\right)\left(\begin{array}[]{c}\Delta x\\ \Delta y\\ \Delta r\\ \Delta s\\ \end{array}\right) (11)
=(rgraXsgraYμeDiag(r)xμeDiag(s)y).\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}=\left(\begin{array}[]{c}r-graX\\ s-graY\\ \mu e-Diag(r)x\\ \mu e-Diag(s)y\\ \end{array}\right).

By eliminating Δr\Delta r and Δs\Delta s in (11), we have

(R1C¯TC¯R2)(ΔxΔy)=ϕ(x,y),\left(\begin{array}[]{cc}R_{1}&\bar{C}^{T}\\ \bar{C}&R_{2}\\ \end{array}\right)\left(\begin{array}[]{c}\Delta x\\ \Delta y\\ \end{array}\right)=-\nabla\phi(x,y), (12)

where

R1=Q1+ρI+Diag(x)1Diag(r),R2=Q2+ρI+Diag(y)1Diag(s),ϕ(x,y)=(graXTμ(x1)T,graYTμ(y1)T)T,\begin{array}[]{c}R_{1}=Q_{1}+\rho I+Diag(x)^{-1}Diag(r),\\ R_{2}=Q_{2}+\rho I+Diag(y)^{-1}Diag(s),\\ \nabla\phi(x,y)=(graX^{T}-\mu(x^{-1})^{T},graY^{T}-\mu(y^{-1})^{T})^{T},\end{array}

and Diag(x)1eDiag(x)^{-1}e is simply represented by x1x^{-1}. We simplify the above formula by

Bp=ϕ(x,y),Bp=-\nabla\phi(x,y),

where BB represents the coefficient matrix and pp represents the direction (ΔxT,ΔyT)T(\Delta x^{T},\Delta y^{T})^{T}. Since BB is positive definite, the inner product pTϕ(x,y)>0p^{T}\nabla\phi(x,y)>0, which means that pp is a descent direction.

To sum up all the approaches above, we present the whole algorithm of interior point method in (2). It contains two loops. The parameter μ\mu is fixed in the inner loop. Due to the Armijo line search (13) in the inner loop, the stopping criterion of the inner loop can be satisfied in finite iterations. Further, the parameter μ\mu and ϵμ\epsilon_{\mu} are reduced gradually, and by the definition of error function EE, the solutions of the two-loop algorithm satisfy the KKT system (3) within the error ϵTOL\epsilon_{TOL}. In practice, the barrier stop tolerance can be defined as

ϵμ=μ.\epsilon_{\mu}=\mu.

The complete convergence theorem and its proof are given in (1).

Algorithm 2 A line search interior point method

Initialize: Choose x0,y0,r0,s0>0x_{0},y_{0},r_{0},s_{0}>0, Select an initial barrier parameter μ>0\mu>0, parameters η,σ(0,1)\eta,\sigma\in(0,1), and decreasing tolerances ϵμ0\epsilon_{\mu}\downarrow 0 and ϵTOL\epsilon_{TOL}. Set k=0k=0.
Repeat until E(xk,yk,rk,sk;0)ϵTOLE(x_{k},y_{k},r_{k},s_{k};0)\leq\epsilon_{TOL}
Repeat until E(xk,yk,rk,sk;μ)ϵμE(x_{k},y_{k},r_{k},s_{k};\mu)\leq\epsilon_{\mu}
   Compute the primal-dual direction by solving (11).
   Compute α1max\alpha_{1}^{\max} and α2max\alpha_{2}^{\max} using (7).
   Backtrack step lengths α1=12tα1max,α2=α2max\alpha_{1}=\frac{1}{2^{t}}\alpha_{1}^{\max},\alpha_{2}=\alpha_{2}^{\max}
   to find the smallest integer t0t\geq 0 satisfying

ϕ(xk+α1Δx,yk+α1Δy)ϕ(xk,yk)+ηα1(ΔxT,ΔyT)ϕ(xk,yk)\begin{array}[]{l}\phi(x_{k}+\alpha_{1}\Delta x,y_{k}+\alpha_{1}\Delta y)\\ \leq\phi(x_{k},y_{k})+\eta\alpha_{1}(\Delta x^{T},\Delta y^{T})\nabla\phi(x_{k},y_{k})\end{array} (13)

   Compute (xk+1,yk+1,rk+1,sk+1)(x^{k+1},y^{k+1},r^{k+1},s^{k+1}) using (6).
   Set k:=k+1k:=k+1.
end
Compute parameter σ\sigma using (8) and (9) and update μ=σμ\mu=\sigma\mu.
end

Theorem 1.

Suppose that all the sequences {xk}\{x_{k}\}, {yk}\{y_{k}\}, {rk}\{r_{k}\}, {sk}\{s_{k}\} generated by algorithm 2 are bounded. Then algorithm 2 stops in finite iterations.

Proof.

We will consider the inner loop first and show that for a given μ>0\mu>0, E(xk,yk,rk,sk;μ)ϵμE(x_{k},y_{k},r_{k},s_{k};\mu)\leq\epsilon_{\mu} will be satisfied in finite iterations.

Based on the Armijo line search rule (13), the value of ϕ(xk,yk)\phi(x_{k},y_{k}) decreases monotonously. Then we have that the lower bound of xkx_{k} and yky_{k} is greater than a strictly positive constant that depends on μ\mu. Thus the smallest eigenvalue of the coefficient matrix of (12) is greater than a constant greater than 0. Further more, using the boundedness of (xk,yk)(x_{k},y_{k}), we obtain that (Δx,Δy)(\Delta x,\Delta y) is bounded. According to the lower bound of xk,ykx_{k},y_{k}, the boundedness of (Δx,Δy)(\Delta x,\Delta y) and (7), we have

infα1max>0.\inf\alpha_{1}^{\max}>0. (14)

According to the stepsize rule (13), we have

α1(ΔxT,ΔyT)ϕ(xk,yk)0\alpha_{1}(\Delta x^{T},\Delta y^{T})\nabla\phi(x_{k},y_{k})\rightarrow 0

We aim to prove

(ΔxT,ΔyT)ϕ(xk,yk)0(\Delta x^{T},\Delta y^{T})\nabla\phi(x_{k},y_{k})\rightarrow 0 (15)

next. To prove it by contradiction, we suppose that

(ΔxT,ΔyT)ϕ(xk,yk)0(\Delta x^{T},\Delta y^{T})\nabla\phi(x_{k},y_{k})\nrightarrow 0 (16)

This means that there exists a subsequence 𝒯\mathcal{T} and a constant a>0a>0 such that

(ΔxT,ΔyT)ϕ(xk,yk)>a-(\Delta x^{T},\Delta y^{T})\nabla\phi(x_{k},y_{k})>a (17)

for k𝒯k\in\mathcal{T}. Due to the boundedness of {(xk,yk)}k𝒯\{(x_{k},y_{k})\}_{k\in\mathcal{T}}, there exists a subsequence 𝒯1𝒯\mathcal{T}_{1}\in\mathcal{T} such that {(xk,yk)}k𝒯1\{(x_{k},y_{k})\}_{k\in\mathcal{T}_{1}} converges to (x¯,y¯)(\bar{x},\bar{y}). By (16) and (17), we have

{α1k}k𝒯10\{\alpha_{1}^{k}\}_{k\in\mathcal{T}_{1}}\rightarrow 0

According to (14),

α1k<infα1max\alpha_{1}^{k}<\inf\alpha_{1}^{\max}

when k𝒯1k\in\mathcal{T}_{1} is large enough. For simplicity, we redefine the sequence satisfying the above condition as 𝒯1\mathcal{T}_{1}. From the stepsize rule, the condition (13) is violated by α1=2α1k\alpha_{1}=2\alpha_{1}^{k}. We have

(ϕ(xk+2α1kΔx,yk+2α1kΔy)ϕ(xk,yk))/(2α1k)>η(ΔxT,ΔyT)ϕ(xk,yk)\begin{array}[]{c}(\phi(x_{k}+2\alpha_{1}^{k}\Delta x,y_{k}+2\alpha_{1}^{k}\Delta y)-\phi(x_{k},y_{k}))/(2\alpha_{1}^{k})\\ >\eta(\Delta x^{T},\Delta y^{T})\nabla\phi(x_{k},y_{k})\end{array} (18)

Taking the limit of the above inequality, we obtain

(Δx¯T,Δy¯T)ϕ(x¯k,y¯k)η(Δx¯T,Δy¯T)ϕ(x¯k,y¯k)(\Delta\bar{x}^{T},\Delta\bar{y}^{T})\nabla\phi(\bar{x}_{k},\bar{y}_{k})\geq\eta(\Delta\bar{x}^{T},\Delta\bar{y}^{T})\nabla\phi(\bar{x}_{k},\bar{y}_{k})

Due to 0<η<10<\eta<1, it follows that (Δx¯T,Δy¯T)ϕ(x¯k,y¯k)0(\Delta\bar{x}^{T},\Delta\bar{y}^{T})\nabla\phi(\bar{x}_{k},\bar{y}_{k})\geq 0. On the other hand, (ΔxT,ΔyT)ϕ(xk,yk)<0(\Delta x^{T},\Delta y^{T})\nabla\phi(x_{k},y_{k})<0. Therefore, (Δx¯T,Δy¯T)ϕ(x¯k,y¯k)=0(\Delta\bar{x}^{T},\Delta\bar{y}^{T})\nabla\phi(\bar{x}_{k},\bar{y}_{k})=0, which is in contradiction to (16). So (15) is established.

From (12) and (15), it follows that

ϕ(xk,yk)0\nabla\phi(x_{k},y_{k})\rightarrow 0

and

(Δx,Δy)0.(\Delta x,\Delta y)\rightarrow 0.

Then according to (11), we obtain

Δrμxk1rkΔsμyk1sk\begin{array}[]{c}\Delta r\rightarrow\mu x_{k}^{-1}-r_{k}\\ \Delta s\rightarrow\mu y_{k}^{-1}-s_{k}\end{array}

For an arbitrary cluster point (x¯,y¯)(\bar{x},\bar{y}) , for any δμ>0\delta_{\mu}>0, there exists a constant κ\kappa such that

(xκ,yκ)(x¯,y¯)δμ\|(x^{\kappa},y^{\kappa})-(\bar{x},\bar{y})\|\leq\delta_{\mu}

and

Δr+rkμxk1δμΔs+skμyk1δμ(Δx,Δy)δμϕ(xk,yk)δμ\begin{array}[]{c}\|\Delta r+r_{k}-\mu x_{k}^{-1}\|\leq\delta_{\mu}\\ \|\Delta s+s_{k}-\mu y_{k}^{-1}\|\leq\delta_{\mu}\\ \|(\Delta x,\Delta y)\|\leq\delta_{\mu}\\ \|\nabla\phi(x_{k},y_{k})\|\leq\delta_{\mu}\end{array}

for all kκk\geq\kappa. Due to the boundedness of (xk,yk)(x_{k},y_{k}), α2k\alpha_{2}^{k} can reach 1 for some k:=k¯κ+T1k:=\bar{k}\leq\kappa+T-1, where TT is a constant. Then it follows from

(x¯,y¯)(xk¯,yk¯)Tδμ\|(\bar{x},\bar{y})-(x_{\bar{k}},y_{\bar{k}})\|\leq T\delta_{\mu}

that

rk¯μxk¯1cδμ\|r_{\bar{k}}-\mu x_{\bar{k}}^{-1}\|\leq c\delta_{\mu}
sk¯μyk¯1cδμ\|s_{\bar{k}}-\mu y_{\bar{k}}^{-1}\|\leq c\delta_{\mu}

where cc is constant. Therefore, for a given ϵμ>0\epsilon_{\mu}>0, let δμ\delta_{\mu} be sufficiently small, then there exists a kk such that E(xk,yk,rk,sk;μ)ϵμE(x_{k},y_{k},r_{k},s_{k};\mu)\leq\epsilon_{\mu}.

We denote the sequence satisfying the inner loop stopping criterion by

{(xk,yk,rk,sk}k𝒮.\{(x_{k},y_{k},r_{k},s_{k}\}_{k\in\mathcal{S}}.

To prove the theorem by contradiction, we suppose that there is no point satisfying the outer loop stopping criterion E(xk,yk,rk,sk;0)ϵTOLE(x_{k},y_{k},r_{k},s_{k};0)\leq\epsilon_{TOL}. Due to the boundedness, there exists a cluster point of {(xk,yk,rk,sk}k𝒮\{(x_{k},y_{k},r_{k},s_{k}\}_{k\in\mathcal{S}}. For any cluster point (x¯,y¯,r¯,s¯)(\bar{x},\bar{y},\bar{r},\bar{s}), we consider the limits on both sides of

E(xk,yk,rk,sk;μ)k𝒯2ϵμ,E(x_{k},y_{k},r_{k},s_{k};\mu)_{k\in\mathcal{T}_{2}}\leq\epsilon_{\mu},

then we have that

E(x¯,y¯,r¯,s¯;0)=0.E(\bar{x},\bar{y},\bar{r},\bar{s};0)=0.

Thus algorithm 2 stops at some k𝒯2{k\in\mathcal{T}_{2}}, a contradiction. Therefore, algorithm 2 stops in finite iterations. ∎

III-B Computation

In general, the computational cost in each iteration of the interior point method is usually the cubic power of its size, which is impractical for large scale problems. However, for the special NMF problem under consideration here, the amount of computation can be greatly reduced, so that the proposed method can be applied to practical problems involving streaming data. In the following, we analyze the computational cost of algorithm 2.

First of all, to compute the gradient in (1), O(nmk)O(nmk) flops are needed.

The main computational cost of (2) is computing the primal dual direction (11). One can solve (12) to obtain (Δx,Δy)(\Delta x,\Delta y) first, and then compute (Δr,Δs)(\Delta r,\Delta s) within a low cost.

We rewrite (12) as

(Q¯1C¯TC¯Q¯2)(ΔxΔy)=(b1b2)\left(\begin{array}[]{cc}\bar{Q}_{1}&\bar{C}^{T}\\ \bar{C}&\bar{Q}_{2}\\ \end{array}\right)\left(\begin{array}[]{c}\Delta x\\ \Delta y\\ \end{array}\right)=\left(\begin{array}[]{c}b_{1}\\ b_{2}\\ \end{array}\right) (19)

In order to minimize the computational cost, we first decompose Q¯1\bar{Q}_{1}.

Q¯1=PTP\bar{Q}_{1}=P^{T}P

can be obtained by Cholesky factorization or eigenvalue decomposition. Since the matrix Q¯1\bar{Q}_{1} is composed of nn positive definite diagonal blocks with the size of k times k, one can obtain PP and P1P^{-1} within O(nk3)O(nk^{3}) flops. (19) is equivalent to

(IPTC¯TC¯P1Q¯2)(PΔxΔy)=(PTb1b2).\left(\begin{array}[]{cc}I&P^{-T}\bar{C}^{T}\\ \bar{C}P^{-1}&\bar{Q}_{2}\\ \end{array}\right)\left(\begin{array}[]{c}P\Delta x\\ \Delta y\\ \end{array}\right)=\left(\begin{array}[]{c}P^{-T}b_{1}\\ b_{2}\\ \end{array}\right).

Then we solve Δy\Delta y from

(Q¯2(C¯P1)(C¯P1)T)Δy=b2(C¯P1)(PTb1)(\bar{Q}_{2}-(\bar{C}P^{-1})(\bar{C}P^{-1})^{T})\Delta y=b_{2}-(\bar{C}P^{-1})(P^{-T}b_{1}) (20)

and compute Δx\Delta x by

Δx=P1(PTb1(C¯P1)TΔy).\Delta x=P^{-1}(P^{-T}b_{1}-(\bar{C}P^{-1})^{T}\Delta y).

We need O(nmk2)O(nmk^{2}) flops for constructing C¯\bar{C} and O(nmk3)O(nmk^{3}) flops for computing C¯P1\bar{C}P^{-1}. By considering that each block of C¯\bar{C} is rank one, we can compute C¯P1\bar{C}P^{-1} within O(nmk2)O(nmk^{2}) flops. The dominant computation is the computation of (C¯P1)(C¯P1)T(\bar{C}P^{-1})(\bar{C}P^{-1})^{T} which costs O(nm2k3)O(nm^{2}k^{3}) flops. If we consider that each block of C¯\bar{C} is rank one, it can be reduced to O(nm2k2)O(nm^{2}k^{2}) flops. When solving (20) by Cholesky factorization, since the size of the coefficient matrix is mkmk by mkmk, the computational cost is O(m3k3)O(m^{3}k^{3}). Other computations, like computing the right side of (20) and computing Δx\Delta x, are O(nmk2)O(nmk^{2}) flops.

To sum up, the computation cost of algorithm 2 in each iteration is O((n+mk)m2k2)O((n+mk)m^{2}k^{2}). Compared with the computation of gradient O(nmk)O(nmk), the cost is no more than O(mk2)O(mk^{2}) times. Since nmkn\gg m\gg k and kk is usually very small in our streaming data, the computational complexity is completely acceptable.

IV A two-stage algorithm for NMF on streaming data

In this section, we propose a practical algorithm with fast convergence for solving NMF in streaming data. It combines both ANLS framework with active set method and interior point method proposed in the previous sections.

In the early stage of the algorithm, we use ANLS framework with active set method. It can reduce the value of objective function rapidly. We use the relative step tolerance, which is a relative lower bound on the size of a step, meaning

(xk,yk)(xk+1,yk+1)ϵSTOL(1+(xk,yk)),\|(x_{k},y_{k})-(x_{k+1},y_{k+1})\|\leq\epsilon_{STOL}(1+\|(x_{k},y_{k})\|),

as the stopping criterion of this stage. If the algorithm attempts to take a step that is smaller than step tolerance, the iterations end.

At the end of the first phase, the algorithm is going to enter the second phase of using the interior point method. However, the solutions from active set method usually contain elements with value zero, which is incompatible to the strict interior point required by interior point method. Meanwhile, we also need to provide the initial dual variables (r,s)(r,s) to the interior point method. To address these issues, we first change the primal variable smaller than ρ0\rho_{0} to ρ0\rho_{0} by

(x,y):=max{(x,y),ρ0}(x,y):=\max\{(x,y),\rho_{0}\} (21)

where ρ0\rho_{0} is a small positive constant. We can choose

ρ0=106max{(x,y)}.\rho_{0}=10^{-6}\max\{(x,y)\}. (22)

Next, we give the initial value of the dual variable by

r=max{|graX|}es=max{|graY|}e\begin{array}[]{c}r=\max\{|graX|\}e\\ s=\max\{|graY|\}e\end{array} (23)

We set the parameters

μ=xTr+yTsmk+nk\mu=\frac{x^{T}r+y^{T}s}{mk+nk} (24)

and

ρ=ϵTOL.\rho=\epsilon_{TOL}. (25)

After these preparations, the algorithm enters its second stage by implementing algorithm 2. As the Hessian matrix is approximated in algorithm 2 by the positive definite matrix

(Q1+ρIC¯C¯Q2+ρI),\left(\begin{array}[]{cc}Q_{1}+\rho I&\bar{C}\\ \bar{C}&Q_{2}+\rho I\\ \end{array}\right), (26)

in order to further speed up convergence, we can change C¯\bar{C} back to CC at the right time. A heuristic way to switch to

(Q1+ρICCQ2+ρI),\left(\begin{array}[]{cc}Q_{1}+\rho I&C\\ C&Q_{2}+\rho I\\ \end{array}\right), (27)

is by monitoring σ\sigma, which is given in (9). A small σ\sigma implies that the predictor step generates a point close to the boundary, thus it is likely to be close to a local minimum. Therefore, when

σσc,\sigma\leq\sigma_{c},

we switch to (27), where σc\sigma_{c} is a user-supplied constant, and we set σc=0.01\sigma_{c}=0.01 in our test. The computation of the primal-dual direction is similar to that using C¯\bar{C}. The difference is that each block of CC is no longer rank one, thus the computational cost is O(nm2k3)O(nm^{2}k^{3}) flops. Since (27) is not guaranteed to be positive definite, the primal-dual direction using (27) may not be a descent direction. We check the negativity of

(ΔxT,ΔyT)ϕ(xk,yk)(\Delta x^{T},\Delta y^{T})\nabla\phi(x_{k},y_{k})

in (13) before we implement the line search. If we fail to obtain a descent direction, we switch to the positive definite Hessian (26).

All elements of the above approaches are presented in our two-stage algorithm, as outlined below in algorithm 3.

Algorithm 3 A fast two-stage algorithm

Initialize: Choose initial X,YX,Y
Implement (1) with active set method
Update variables by (21), (22) and (23)
Set parameters by (24) and (25), η=0.5\eta=0.5, select the tolerance ϵTOL\epsilon_{TOL}, and let flag=0flag=0.
Repeat until E(xk,yk,rk,sk;0)ϵTOLE(x_{k},y_{k},r_{k},s_{k};0)\leq\epsilon_{TOL}
ϵμ:=μ\epsilon_{\mu}:=\mu
Repeat until E(xk,yk,rk,sk;μ)ϵμE(x_{k},y_{k},r_{k},s_{k};\mu)\leq\epsilon_{\mu}
   If flag=1flag=1,
    compute the primal-dual direction by solving (11)     with Hessian (27).
    If (ΔxT,ΔyT)ϕ(xk,yk)0(\Delta x^{T},\Delta y^{T})\nabla\phi(x_{k},y_{k})\geq 0,
     flag=0flag=0, compute the primal-dual direction by
     solving (11).
   Else,
    compute the primal-dual direction by solving (11).
   Compute α1max\alpha_{1}^{\max} and α2max\alpha_{2}^{\max} using (7).
   Backtrack step lengths α1=12tα1max,α2=α2max\alpha_{1}=\frac{1}{2^{t}}\alpha_{1}^{\max},\alpha_{2}=\alpha_{2}^{\max} to find
   the smallest integer t0t\geq 0 satisfying (13).
   Compute (xk+1,yk+1,rk+1,sk+1)(x^{k+1},y^{k+1},r^{k+1},s^{k+1}) using (6).
   Set k:=k+1k:=k+1.
End
Compute parameter σ\sigma using (8) and (9) and update μ=σμ\mu=\sigma\mu.
If σ0.01\sigma\leq 0.01, flag=1flag=1, else flag=0flag=0
End

V Numerical tests

We test our two-stage algorithm and compare it with other ANLS-based methods including NeNMF [11], QRPBB [13] and ANLS-BPP [16]. All the tests are performed using MATLAB 2020a on a computer with 64-bit system and 2.70GHz CPU. Comparisons are done on both synthetic data sets and real-world problems.

V-A Stopping criterion

The KKT conditions of (1) are given in (3). The definition of the error function E(x,y,r,s;0)E(x,y,r,s;0) (10) measures the violation of the KKT conditions. Therefore, we set

E(x,y,r,s;0)ϵTOLE(x,y,r,s;0)\leq\epsilon_{TOL}

to be the stopping criterion, where

ϵTOL=106.\epsilon_{TOL}=10^{-6}.

The ANLS-based algorithms do not generate the dual variables rr and ss. Here we give a reasonable definition that

r=max{graX,0}s=max{graY,0}\begin{array}[]{c}r=\max\{graX,0\}\\ s=\max\{graY,0\}\\ \end{array}

In some cases, we also limit the maximum CPU time, in cases where some algorithms can not reach the given accuracy.

V-B Streaming AgMOR PDF data

Zhao et al. [32] measured the X-ray diffraction data during the nucleation and growth of zeolite-supported Ag nanoparticles through reduction of Ag-exchanged mordenite (MOR), and processed the data with pair distribution function (PDF) measurement. In the field of chemistry, more and more people use mathematical tools to analyze their measured data. Chapman et al. [4] used principle component analysis (PCA) and some post-processing to analyze the data given in [32], and obtained 3 principle components. Since PDF is a distribution-type function, NMF may be intuitively more applicable.

We simply remove the negativity of the raw data by shifting each PDF up by the opposite of its original minimum value. Then we perform the NMF algorithms on the data. The size of the data is

n=3000,m=36,n=3000,\ m=36,

and in the algorithms, kk is set to be 3 based on the analysis of [4].

We use the same initial point for each algorithm. The relationship between KKT violation EE and CPU time of the four algorithms are shown in Fig. 1. It can be seen that the performances of NeNMF, ANLS-BPP and our 2-STAGE algorithm are similar in the early stage, and the decline of error function value for QRPBB is the most obvious. After that, there is a rise of error function value in 2-STAGE, because our algorithm begins to enter the second stage and the variables change, and some shocks occur later, which is caused by the instability of the early iterations of the interior point method. After several iterations, 2-STAGE becomes stable and converges quickly to meet the termination criterion. Compared with NeNMF and ANLS-BPP, QRPBB is always more accurate. Even so, it can not achieve the given criterion in a short time.

Refer to caption
Figure 1: CPU time (s) versus tolerance values on AgMOR Data

Next, we generate 10 different random initial points. For each of them, all algorithms are implemented. The results are shown in Tab. I. Because none of the three algorithms compared with the 2-STAGE algorithm can meet the stopping criterion in a short time, we set the maximum running CPU time as 20 seconds. Tab. I presents the average, minimum and maximum CPU time, the KKT violation EE and the objective function value of each algorithm respectively. It can be seen that our algorithm always has the minimum KKT violation and the minimum objective function value. It has a great advantage in finding a high-precision solution within a short amount of time.

Algorithm cpu(s):avrg(min,max) E:avrg(min,max) f:avrg(min,max)
NeNMF 20.01(20.00,20.03) 6.64(1.72,16.32)e-2 5.37274(5.37242,5.37329)e+2
QRPBB 20.02(20.00,20.03) 2.19(1.48,2.94)e-3 5.37236(5.37213,5.37263)e+2
ANLS-BPP 20.02(20.02,20.05) 1.37(0.75,2.14)e-2 5.37236(5.37206,5.37264)e+2
2-STAGE 5.03(3.55,8.88) 3.58(1.24,6.30)e-7 5.37205(5.37205,5.37205)e+2
TABLE I: Experimental results on AgMOR data

V-C Yale face database

The Yale face database is widely tested for face recognition [3]. It contains 165 grayscale images of 15 individuals. There are 11 images per subject, one per different facial expression or configuration: lighting (center-light, left-light and right-light), with/without glasses, facial expressions (normal, happy, sad, sleepy, surprised, and wink). The original image size is 320×243320\times 243 pixels. To reduce the computational burden, the image size was reduced to 64×6464\times 64 pixels. Because the image of face also has some degree of continuity, we regard the data as streaming data to test our algorithm. The data size is n=4096,m=165n=4096,m=165.

We gradually increase mm to explore the influence of mm variation on the algorithm. We selected the face images of the first ii individuals, i=1,2,4,8,15i=1,2,4,8,15, i.e. m=11,22,44,88,165m=11,22,44,88,165. In the following tests, we fix k=3k=3. We limit a maximum CPU time of 60 seconds for each algorithm and generate 10 different random initial points for each sample. We see from Fig. 2, where we take an example of m=44m=44 and an example of m=165m=165, that the first stage of 2-STAGE is not as efficient as other algorithms. The main reason is that the continuity of face data is not strong enough compared with streaming data, thus the active set changes frequently in the algorithm resulting large computational cost. In Fig. 2 (top), the second stage of 2-STAGE, interior point method, shows a fast local convergence, since the slope is significantly steeper than the others. However, in Fig. 2 (bottom), the slope of the second stage of 2-STAGE is similar with others. The reason is that when mm increases, the computational cost of interior point method increases faster than that of other algorithms. Therefore, we see from Tab. II that, as anticipated, the larger mm is, the less efficient 2-STAGE is compared with other algorithms.

Refer to caption
Refer to caption
Figure 2: Experimental tests on Yale face data, m=44m=44 (top) and m=165m=165 (bottom).
size(n,m,k) Algorithm cpu(s):avrg(min,max) E:avrg(min,max) f:avrg(min,max)
(4096,11,3) NeNMF 44.94(30.50,66.53) 125(9.77e-7,1254) 1.40186(1.40104,1.40918)e+7
QRPBB 19.10(12.05,40.28) 8.20(3.39,9.99)e-7 1.40511(1.40104,1.40918)e+7
ANLS-BPP 23.14(16.42,49.08) 9.80(9.50,9.98)e-7 1.40185(1.40104,1.40918)e+7
2-STAGE 13.30(6.83,24.33) 4.39(1.16,9.46)e-7 1.40520(1.40104,1.41008)e+7
(4096,22,3) NeNMF 36.36(7.28,65.63) 1.13(9.47e-7,11.34) 3.13573(3.12627,3.14786)e+7
QRPBB 6.30(2.41,17.23) 7.32(2.63,9.95)e-7 3.13324(3.12627,3.14785)e+7
ANLS-BPP 19.90(2.89,60.03) 0.79(9.48e-7,5.78) 3.13772(3.12627,3.14786)e+7
2-STAGE 9.81(5.45,18.28) 3.54(1.79,7.37)e-7 3.12893(3.12627,3.14785)e+7
(4096,44,3) NeNMF 63.76(61.03,67.06) 2.71(2.04,6.17)e-6 8.20694(8.20694,8.20694)e+7
QRPBB 6.32(4.97,7.06) 8.52(2.49,9.99)e-7 8.20694(8.20694,8.20694)e+7
ANLS-BPP 10.06(8.34,12.14) 9.85(9.44,9.99)e-7 8.20694(8.20694,8.20694)e+7
2-STAGE 16.05(11.08,31.14) 5.05(1.02,9.02)e-7 8.20694(8.20694,8.20694)e+7
(4096,88,3) NeNMF 62.86(61.05,64.81) 2.09(1.35,4.19)e-6 1.97873(1.97753,1.98962)e+8
QRPBB 4.97(3.14,10.58) 8.37(5.42,9.96)e-7 1.97945(1.97753,1.98962)e+7
ANLS-BPP 9.10(7.25,16.50) 4.53(1.32,9.81)e-7 1.97874(1.97753,1.98962)e+7
2-STAGE 14.76(7.86,43.48) 5.05(1.02,9.02)e-7 1.97753(1.97753,1.97753)e+7
(4096,165,3) NeNMF 63.99(60.17,67.20) 1.54(1.35,1.89)e-6 4.06305(4.06305,4.06305)e+8
QRPBB 5.91(5.30,6.78) 5.01(4.79,9.58)e-7 4.06305(4.06305,4.06305)e+8
ANLS-BPP 11.46(10.81,12.36) 9.60(9.40,9.85)e-7 4.06305(4.06305,4.06305)e+8
2-STAGE 29.78(16.58,61.78) 4.10(1.19,8.39)e-7 4.06305(4.06305,4.06305)e+8
TABLE II: Experimental results on Yale face database

V-D Synthetic data

In order to test more problems on different scales, we artificially synthesized some data for testing.

We note that the focus of this paper is on streaming data. Chapman et al.[4] use PCA method to study AgMor data. They found that there were three dominant components (singular values) of the data, which means that the original data is a rank-3 matrix plus noise. According to this characteristics, we construct the artificial data by the following methods. First, we decide the problem size (n,m,k)(n,m,k). Then, we generate a random matrix XX whose size is n×kn\times k, and each element is uniformly distributed from 0 to 1. In the same way, we generate a random matrix YY whose size is k×mk\times m. Next, we compute M=XYM=XY and add Gaussian noise, whose expectation is 0 and standard deviation is 0.10.1, to each element of MM. Finally, we change the negative elements in MM to zeros.

In this set of tests, each algorithm is limited to a maximum CPU time of 60 seconds. We generate 10 different random initial points for sample. The results are shown in Tab. III. From an average point of view, the performance of 2-STAGE is the best of the four algorithms, followed by QRPBB. Except for M = 50, k = 6, ANLS-BPP exceeds QRPBB. In some cases, other algorithms may be better than 2-STAGE. For example, when m=100m=100 and k=6k=6, the minimum time of QRPBB is much less than the maximum time of 2-STAGE. Generally speaking, the performance of 2Stage in getting high-precision solution truly stands out.

size(n,m,k) Algorithm cpu(s):avrg(min,max) E:avrg(min,max) f:avrg(min,max)
(2000,50,3) NeNMF 60.01(60.00,60.03) 1.74(0.70,3.57)e-5 4.61662(4.61662,4.61662)e+2
QRPBB 29.36(20.17,48.14) 8.46(5.60,9.80)e-7 4.61662(4.61662,4.61662)e+2
ANLS-BPP 58.48(47.89,60.05) 1.69(1.00,3.29)e-6 4.61662(4.61662,4.61662)e+2
2-STAGE 3.31(2.45,4.75) 3.07(1.16,5.86)e-7 4.61662(4.61662,4.61662)e+2
(2000,50,4) NeNMF 47.80(40.88,60.02) 1.19(1.00,2.31)e-6 4.56019(4.56019,4.56019)e+2
QRPBB 22.89(12.78,34.22) 7.30(2.67,9.99)e-7 4.56019(4.56019,4.56019)e+2
ANLS-BPP 28.03(23.44,32.61) 9.99(9.98,10.00)e-7 4.56019(4.56019,4.56019)e+2
2-STAGE 7.41(4.22,11.78) 3.25(1.09,8.91)e-7 4.56019(4.56019,4.56019)e+2
(2000,50,5) NeNMF 49.81(40.92,56.80) 9.99(9.98,10.00)e-7 4.45449(4.45449,4.45449)e+2
QRPBB 25.14(20.61,35.91) 5.99(1.80,8.86)e-7 4.45449(4.45449,4.45449)e+2
ANLS-BPP 24.78(21.67,28.31) 9.99(9.97,9.99)e-7 4.45449(4.45449,4.45449)e+2
2-STAGE 6.98(6.20,7.91) 3.93(1.69,8.31)e-7 4.45449(4.45449,4.45449)e+2
(2000,50,6) NeNMF 60.02(60.00,60.03) 5.18(0.38,13.11)e-3 4.36121(4.36121,4.36121)e+2
QRPBB 60.01(60.00,60.03) 2.19(0.02,8.31)e-4 4.36121(4.36121,4.36121)e+2
ANLS-BPP 56.07(20.38,60.05) 5.35(0.10,32.81)e-5 4.36309(4.36121,4.38005)e+2
2-STAGE 14.46(6.31,25.81) 4.74(1.18,7.77)e-7 4.36121(4.36121,4.36121)e+2
(2000,100,3) NeNMF 17.14(14.88,19.91) 9.96(9.93,10.00)e-7 9.57337(9.57337,9.57337)e+2
QRPBB 5.84(4.86,7.70) 7.81(1.76,9.98)e-7 9.57337(9.57337,9.57337)e+2
ANLS-BPP 15.10(14.27,15.97) 9.98(9.94,10.00)e-7 9.57337(9.57337,9.57337)e+2
2-STAGE 4.05(2.95,5.33) 4.71(1.00,9.92)e-7 9.57337(9.57337,9.57337)e+2
(2000,100,4) NeNMF 26.86(23.59,31.17) 9.98(9.97,10.00)e-7 9.54230(9.54230,9.54230)e+2
QRPBB 14.53(13.17,15.63) 5.75(2.37,8.99)e-7 9.54230(9.54230,9.54230)e+2
ANLS-BPP 24.60(22.36,26.70) 9.98(9.96,10.00)e-7 9.54230(9.54230,9.54230)e+2
2-STAGE 7.46(5.36,8.91) 4.82(1.01,9.44)e-7 9.54230(9.54230,9.54230)e+2
(2000,100,5) NeNMF 36.04(31.03,41.45) 9.98(9.97,10.00)e-7 9.44857(9.44857,9.44857)e+2
QRPBB 17.60(16.17,20.45) 7.02(4.22,9.12)e-7 9.44857(9.44857,9.44857)e+2
ANLS-BPP 27.54(25.70,29.58) 9.98(9.96,10.00)e-7 9.44857(9.44857,9.44857)e+2
2-STAGE 11.77(9.69,17.59) 4.59(1.27,8.06)e-7 9.44857(9.44857,9.44857)e+2
(2000,100,6) NeNMF 60.02(60.00,60.03) 7.43(0.03,28.52)e-3 9.35451(9.35451,9.35452)e+2
QRPBB 57.20(37.47,60.03) 2.54(0.07,12.20)e-5 9.35451(9.35451,9.35451)e+2
ANLS-BPP 60.03(60.02,60.05) 1.14(0.04,4.44)e-4 9.35451(9.35451,9.35451)e+2
2-STAGE 25.54(14.44,52.39) 3.05(1.04,9.33)e-7 9.35451(9.35451,9.35451)e+2
TABLE III: Experimental results on synthetic data

VI Conclusions

In this paper we focused on solutions to the NMF for streaming data. We presented a fast two-stage algorithm, where the first stage is the ANLS framework with active set method which gains benefit from the continuity of streaming data, and the second stage is a line search interior point method which gets benefit from nmkn\gg m\gg k. In addition, we have proved the global convergence of the proposed line search interior point method. The first stage reduces the value of the objective function rapidly, and the second stage converges to a local solution quickly due to the property of Newton-type direction. We tested the proposed algorithm on several real and synthetic data, and observed that compared with other algorithms, our algorithm is more effective in solving high-precision local solutions.

The active set method in the first stage does not reach the expected speed, even if it is tested on continuous data. We think that this may be caused by the limitations of the underlying code implementation in MATLAB. On the other hand, we find that the transition part between the two stages may induce instability. This is because the solution of the active set method cannot be directly used as the initial guess of the interior point method, and its changes have an impact on stability. At present, the parameters used to generate starting point are selected carefully to avoid the instability. In the future, we will work to find a more stable transition technique.

Considering that in addition to the basic NMF model, there are other variants of NMF, such as constrained NMFs and structured NMFs, our algorithm has the potential to be applications to more problems through extension. This will be our consideration in the future.

References

  • [1] S. Arora, R. Ge, R. Kannan, and A. Moitra, “Computing a nonnegative matrix factorization—provably,” SIAM Journal on Computing, vol. 45, no. 4, pp. 1582–1611, 2016.
  • [2] I. Buciu, “Non-negative matrix factorization, a new tool for feature extraction: theory and applications,” International Journal of Computers, Communications and Control, vol. 3, no. 3, pp. 67–74, 2008.
  • [3] D. Cai, X. He, J. Han, and H.-J. Zhang, “Orthogonal laplacianfaces for face recognition,” IEEE transactions on image processing, vol. 15, no. 11, pp. 3608–3614, 2006.
  • [4] K. W. Chapman, S. H. Lapidus, and P. J. Chupas, “Applications of principal component analysis to pair distribution function data,” Journal of Applied Crystallography, vol. 48, no. 6, pp. 1619–1626, 2015.
  • [5] A. Cichocki and A.-H. Phan, “Fast local algorithms for large scale nonnegative matrix and tensor factorizations,” IEICE transactions on fundamentals of electronics, communications and computer sciences, vol. 92, no. 3, pp. 708–721, 2009.
  • [6] A. Cichocki, R. Zdunek, A. H. Phan, and S.-i. Amari, Nonnegative matrix and tensor factorizations: applications to exploratory multi-way data analysis and blind source separation.   John Wiley & Sons, 2009.
  • [7] T. Egami and S. J. L. Billinge, Underneath the Bragg peaks: structural analysis of complex materials, 2nd ed.   Amsterdam: Elsevier, 2012. [Online]. Available: Link
  • [8] D. Goldfarb and S. Ma, “Convergence of fixed-point continuation algorithms for matrix rank minimization,” Foundations of Computational Mathematics, vol. 11, no. 2, pp. 183–210, 2011.
  • [9] P. Gong and C. Zhang, “Efficient nonnegative matrix factorization via projected newton method,” Pattern Recognition, vol. 45, no. 9, pp. 3557–3565, 2012.
  • [10] L. Grippo and M. Sciandrone, “On the convergence of the block nonlinear gauss–seidel method under convex constraints,” Operations research letters, vol. 26, no. 3, pp. 127–136, 2000.
  • [11] N. Guan, D. Tao, Z. Luo, and B. Yuan, “Nenmf: An optimal gradient method for nonnegative matrix factorization,” IEEE Transactions on Signal Processing, vol. 60, no. 6, pp. 2882–2898, 2012.
  • [12] D. Hajinezhad, T.-H. Chang, X. Wang, Q. Shi, and M. Hong, “Nonnegative matrix factorization using admm: Algorithm and convergence analysis,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2016, pp. 4742–4746.
  • [13] Y. Huang, H. Liu, and S. Zhou, “Quadratic regularization projected barzilai–borwein method for nonnegative matrix factorization,” Data mining and knowledge discovery, vol. 29, no. 6, pp. 1665–1684, 2015.
  • [14] D. Kim, S. Sra, and I. S. Dhillon, “Fast newton-type methods for the least squares nonnegative matrix approximation problem,” in Proceedings of the 2007 SIAM international conference on data mining.   SIAM, 2007, pp. 343–354.
  • [15] H. Kim and H. Park, “Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method,” SIAM journal on matrix analysis and applications, vol. 30, no. 2, pp. 713–730, 2008.
  • [16] J. Kim and H. Park, “Fast nonnegative matrix factorization: An active-set-like method and comparisons,” SIAM Journal on Scientific Computing, vol. 33, no. 6, pp. 3261–3281, 2011.
  • [17] H. W. Kuhn and A. W. Tucker, “Nonlinear programming,” in Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability.   Berkeley, Calif.: University of California Press, 1951, pp. 481–492. [Online]. Available: Link
  • [18] C. L. Lawson and R. J. Hanson, Solving least squares problems.   Siam, 1995, vol. 15.
  • [19] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, no. 6755, p. 788, 1999.
  • [20] ——, “Algorithms for non-negative matrix factorization,” in Advances in neural information processing systems, 2001, pp. 556–562.
  • [21] L. Li and Y.-J. Zhang, “Fastnmf: highly efficient monotonic fixed-point nonnegative matrix factorization algorithm with good applicability,” Journal of Electronic Imaging, vol. 18, no. 3, p. 033004, 2009.
  • [22] C.-J. Lin, “Projected gradient methods for nonnegative matrix factorization,” Neural computation, vol. 19, no. 10, pp. 2756–2779, 2007.
  • [23] C.-H. Liu, C. J. Wright, R. Gu, S. Bandi, A. Wustrow, P. K. Todd, D. O’Nolan, M. L. Beauvais, J. R. Neilson, P. J. Chupas, K. W. Chapman, and S. J. L. Billinge, “Validation of non-negative matrix factorization for assessment of atomic pair-distribution function (pdf) data in a real-time streaming context,” J. Appl. Crystallogr., Jan 2020, arXiv:2010.11807 [cond-mat.mtrl-sci]. [Online]. Available: Link
  • [24] Y.-F. Liu, M. Hong, and E. Song, “Sample approximation-based deflation approaches for chance sinr-constrained joint power and admission control,” IEEE Transactions on Wireless Communications, vol. 15, no. 7, pp. 4535–4547, 2016.
  • [25] J. Nocedal and S. Wright, Numerical optimization.   Springer Science & Business Media, 2006.
  • [26] P. Paatero and U. Tapper, “Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values,” Environmetrics, vol. 5, no. 2, pp. 111–126, 1994.
  • [27] S. Sra and I. S. Dhillon, Nonnegative matrix approximation: Algorithms and applications.   Computer Science Department, University of Texas at Austin, 2006.
  • [28] P. K. Todd, A. Wustrow, R. D. McAuliffe, M. J. McDermott, G. T. Tran, B. C. McBride, E. D. Boeding, D. O’Nolan, C.-H. Liu, S. S. Dwaraknath, K. W. Chapman, Simon J. L. Billinge, K. A. Persson, A. Huq, G. M. Veith, and J. R. Neilson, “Defect-accommodating intermediates yield selective low-temperature synthesis of YMnO3 polymorphs,” Inorg. Chem., vol. 59, p. 13639–13650, Aug 2020. [Online]. Available: Link
  • [29] S. A. Vavasis, “On the complexity of nonnegative matrix factorization,” SIAM Journal on Optimization, vol. 20, no. 3, pp. 1364–1377, 2009.
  • [30] Y.-X. Wang and Y.-J. Zhang, “Nonnegative matrix factorization: A comprehensive review,” IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 6, pp. 1336–1353, 2012.
  • [31] R. Zdunek and A. Cichocki, “Non-negative matrix factorization with quasi-newton optimization,” in International conference on artificial intelligence and soft computing.   Springer, 2006, pp. 870–879.
  • [32] H. Zhao, T. M. Nenoff, G. Jennings, P. J. Chupas, and K. W. Chapman, “Determining quantitative kinetics and the structural mechanism for particle growth in porous templates,” The Journal of Physical Chemistry Letters, vol. 2, no. 21, pp. 2742–2746, 2011.