This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Adaptive Weighted Nonnegative Matrix Factorization for Robust Feature Representation

Tingting Shen Junhang Li Can Tong Qiang He Chen Li Yudong Yao Yueyang Teng [email protected] College of Medicine and Biological Information Engineering, Northeastern University, Shenyang 110169, China Department of Electrical and Computer Engineering, Stevens Institute of Technology, Hoboken, NJ 07030, USA
Abstract

Nonnegative matrix factorization (NMF) has been widely used to dimensionality reduction in machine learning. However, the traditional NMF does not properly handle outliers, so that it is sensitive to noise. In order to improve the robustness of NMF, this paper proposes an adaptive weighted NMF, which introduces weights to emphasize the different importance of each data point, thus the algorithmic sensitivity to noisy data is decreased. It is very different from the existing robust NMFs that use a slow growth similarity measure. Specifically, two strategies are proposed to achieve this: fuzzier weighted technique and entropy weighted regularized technique, and both of them lead to an iterative solution with a simple form. Experimental results showed that new methods have more robust feature representation on several real datasets with noise than exsiting methods. We make our code available at https://github.com/WRNMF/RobustNMF.git.

keywords:
entropy regularizer, fuzzier, low-dimensional representation, nonnegative matrix factorization (NMF), robustness.
journal: Neurocomputing

1 Introduction

With the rapid development of computer science, we are acculating more and more large-scale and high-dimensional data, thereby bringing about a new challenge for data analysis. It is very difficult to analyze many high-dimensional and high-throughout datasets. Consequently, dimensionality reduction has become an essential step. There are many methods to solve this problem including vector quantization (VQ) [1], singular value decomposition (SVD) [2], principal component analysis (PCA) [3], independent component analysis (ICA) [4] and nonnegative matrix factorization (NMF) [5], etc. In these methods, NMF is attracting more and more attention, which has been widely used in many fields, such as pattern recognition [6], image processing [7] and data mining [8].

The traditional NMF decomposes a high-dimensional nonnegative matrix into the product of two low-dimensional nonnegative matrices that are respectively called base and representation [5]. In recent decades, the theoretical research and practical applications of NMF are being carried out intensively, and many variants are developed for different tasks. For example, Ding et al. [9] proposed SemiNMF by easing the nonnegative constrains on original and base matrices to expand the application scope of NMF. They also limited base matrix to convex combinations of data points for developing the ConvexNMF method. Guillamet et al. [10] proposed the weighted NMF method (WNMF), which introduces to improve the NMF capabilities of some representing positive local data.

Most of the NMF variants use the square of the Euclidean distance or Kullback-Leibler (KL) divergence to measure the similarity between the high-dimensional nonnegative matrix and the product of two low-dimensional nonnegative matrices. However, the results generally cause sensitivity to outliers. In order to improve the robustness of NMF, Xing et al. [11] designed the NMFL2 method that used the Euclidean distance to take the place of the square of the Euclidean distance, so that the importance of the outliers decreased. Du et al. [12] took the Huber function to measure the quality approximation by considering its connection to L2L_{2}-norm and L1L_{1}-norm. As can be seen, these NMF methods use some similarity measures insensitive to outliers for a robust feature representation.

This paper proposes a totally different robust NMF type, which assigns an adaptive weight for each data point to represent its outlier degree. Certainly, an outlier should be paid less attention than a normal point. The task is implemented by two methods: fuzzier weighted technique and entropy weighted regularized technique, in which one utilizes a power hyperparameter to control the weights distribution and the other uses the entropy to regularize the weights. Then, they are solved by the Lagrange multiplier method for obtaining two solutions with a simple form. Experiments on several datasets with synthetic noise displayed that the proposed methods always achieve a better performance than some existing methods.

2 Methodology

𝐍𝐨𝐭𝐚𝐭𝐢𝐨𝐧\bf{Notation}: In this paper, matrices are denoted as capital letters. For a matrix AA, the (i,j)th(i,j)-th element is indicated as AijA_{ij}; ATA^{T} denotes the transpose of AA; the Frobenius norm is represented as ||||F||\cdot||_{F}; the symbols \odot and \oslash mean the item-by-item multiplication and division of two matrices, respectively; A0A\geq 0 means that all elements of AA are equal to or larger than 0.

2.1 Related works

NMF decomposes a nonnegative matrix XRM×NX\in R^{M\times N} into the product of two low-dimensional nonnegative matrices: one is the base matrix WRM×LW\in R^{M\times L} and the other is the representation matrix HRL×NH\in R^{L\times N}, where L<<min{M,N}L<<min\{M,N\}. It can be formulated as:

X=WHX=WH (1)

As well known, it is difficult to be solved, even there may be no solution, so we turn to an optimization method for achieving the best approximation. There are many criteria to measure the similarity between XX and WHWH. In general, the square of the Frobenius norm and the Kullback-Leibler (KL) divergence are used. In this paper, the square of the Frobenius norm is adopted and the formula is expressed as:

min\displaystyle\min F1(W,H)=XWHF2\displaystyle~{}F_{1}(W,H)=\left\|X-WH\right\|_{F}^{2}
s.t.\displaystyle s.~{}t. W0,H0\displaystyle~{}W\geq 0,~{}H\geq 0 (2)

Although the objective function is convex quadratic function on either WW or HH, it isn’t convex for both of them. Thus, it will be alternatively minimized with regard to WW and HH, in which the constructed auxiliary function will be important to derive the iterative update rule.

Definition.

(Auxiliary function) If the function G(h,h)G(h,h^{\prime}) satisfies the following conditions:

G(h,h)F(h)\displaystyle G(h,h^{\prime})\geq F(h) (3)
G(h,h)=F(h)\displaystyle G(h^{\prime},h^{\prime})=F(h^{\prime}) (4)

where hh^{\prime} is a given value; then, G(h,h)G(h,h^{\prime}) is the auxiliary function of F(h)F(h) on hh^{\prime}.

Then, we easily draw the following conclusion:

Lemma.

If G(h,h)G(h,h^{\prime}) is an auxiliary function of F(h)F(h), then under the update rule:

h=argminhG(h,h)h^{*}=arg\mathop{\min}_{h}G(h,h^{\prime}) (5)

the function F(h)F(h) does not increase.

Proof.

The conditions satisfied by the auxiliary function make this proof marked because:

F(h)G(h,h)G(h,h)F(h)F(h^{*})\leq G(h^{*},h^{\prime})\leq G(h^{\prime},h^{\prime})\leq F(h^{\prime}) (6)

It can be observed that the original function should decrease if the auxiliary function reaches the minimum. Now, we drive the update rule for NMF. Consider WW first, where Wt>0W^{t}>0 and H>0H>0 are fixed. Let ξijk=WiktHkj/(WtH)ij\xi_{ijk}=W^{t}_{ik}H_{kj}/(W^{t}H)_{ij}, certainly, ξijk0\xi_{ijk}\geq 0 and k=1Lξijk=1\sum_{k=1}^{L}\xi_{ijk}=1. Therefore, the auxiliary function is as below:

f1(W,Wt)=i=1Mj=1Nk=1Lξijk(XijWikHkjξijk)2f_{1}(W,W^{t})=\sum_{i=1}^{M}\sum_{j=1}^{N}\sum_{k=1}^{L}\xi_{ijk}(X_{ij}-\frac{W_{ik}H_{kj}}{\xi_{ijk}})^{2} (7)

The function is separable and its minimization is equivalent to the solutions of some univariate optimizaiton problems. Take the partial derivative of Eq. (7) and set it to zero so that the following update rule is got.

WW(XHT)(WHHT)W\leftarrow W\odot(XH^{T})~{}\oslash~{}(WHH^{T}) (8)

Similarly, the update of HH is obtained:

HH(WTX)(WTWH)H\leftarrow H\odot(W^{T}X)~{}\oslash~{}(W^{T}WH) (9)

2.2 Proposed method

Different from the previous methods that use a similarity measure with slow growth for outliers, we introduce an optimizable weight QjQ_{j} to emphasize the importance of each data point, in which an outlier will be paid less attention than a normal point.

min\displaystyle\min F2(W,H,Q)=i=1Mj=1NQj[Xij(WH)ij]2\displaystyle~{}F_{2}(W,H,Q)=\sum_{i=1}^{M}\sum_{j=1}^{N}Q_{j}[X_{ij}-(WH)_{ij}]^{2}
s.t.\displaystyle s.~{}t. W0,H0,Q0,j=1NQj=1\displaystyle~{}W\geq 0,~{}H\geq 0,~{}Q\geq 0,~{}\sum_{j=1}^{N}Q_{j}=1 (10)

We first consider the solution to QQ in an alternative optimization manner. Obviously, for fixed WW and HH, QjQ_{j} can be solved as Qj=1Q_{j}=1 if Zj=min{Z1,Z2,,ZN}Z_{j}=min\{Z_{1},Z_{2},...,Z_{N}\}, otherwise 0, where Zj=i=1M(XWH)ij2Z_{j}=\sum_{i=1}^{M}(X-WH)_{ij}^{2}. The derivation is very easy, which is similar to that of the membership degree in k-means [13]. It demonstrates the simple fact that only one QjQ_{j} is 1, and the others are 0, i.e., the only one data point is involved in NMF. Such a phenomenon is certainly incompatible with the real problem. In fact, we generally think that the weights are in the range of [0,1][0,1] instead of 0 or 1, which is more consistent fuzzy logic thinking process. To address this issue, we propose two methods as follows.

2.2.1 Fuzzier weighted robust NMF (FWRNMF)

We give a hyperparameter p>1p>1, representing fuzzy level, to smooth QjQ_{j} so that weights fall within [0, 1]. The specific model is as follows:

min\displaystyle\min F3(W,H,Q)=i=1Mj=1NQjp[Xij(WH)ij]2\displaystyle~{}F_{3}(W,H,Q)=\sum_{i=1}^{M}\sum_{j=1}^{N}Q_{j}^{p}[X_{ij}-(WH)_{ij}]^{2}
s.t.\displaystyle s.~{}t. W0,H0,Q0,j=1NQj=1\displaystyle~{}W\geq 0,~{}H\geq 0,~{}Q\geq 0,~{}\sum_{j=1}^{N}Q_{j}=1 (11)

We still alternatively minimize the objective function with respect to QQ, WW and HH. First, QQ is solved. We need to construct the Lagrange function as:

L(Q,λ)=i=1Mj=1NQjp[Xij(WH)ij]2λ(j=1NQj1)L(Q,\lambda)=\sum_{i=1}^{M}\sum_{j=1}^{N}Q_{j}^{p}[X_{ij}-(WH)_{ij}]^{2}-\lambda(\sum_{j=1}^{N}Q_{j}-1) (12)

where λ\lambda is the Lagrange multiplier.

By setting the gradient of Eq. (12) with respect to λ\lambda and QjQ_{j} to zero, we obtain the following equations system:

Lλ\displaystyle\frac{\partial L}{\partial\lambda} =j=1NQj1=0\sum_{j=1}^{N}Q_{j}-1=0 (13)
LQj\displaystyle\frac{\partial L}{\partial Q_{j}} =pQjp1i=1M[Xij(WH)ij]2λ=0pQ_{j}^{p-1}\sum_{i=1}^{M}[X_{ij}-(WH)_{ij}]^{2}-\lambda=0 (14)

From Eq. (14), we know that:

Qj=λpp11i=1M[Xij(WH)ij]2p1\displaystyle Q_{j}=\sqrt[p-1]{\frac{\lambda}{p}}\sqrt[p-1]{\frac{1}{\sum_{i=1}^{M}[X_{ij}-(WH)_{ij}]^{2}}} (15)

Substituting Eq. (15) into Eq. (13), we have:

λpp1=1j=1N1i=1M[Xij(WH)ij]2p1\displaystyle\sqrt[p-1]{\frac{\lambda}{p}}=\frac{1}{\sum_{j=1}^{N}\sqrt[p-1]{\frac{1}{\sum_{i=1}^{M}[X_{ij}-(WH)_{ij}]^{2}}}} (16)

Substituting it into Eq. (15), we achieve that

Qj=1i=1M[Xij(WH)ij]2p1l=1N1i=1M[Xil(WH)il]2p1\displaystyle Q_{j}=\frac{\sqrt[p-1]{\frac{1}{\sum_{i=1}^{M}[X_{ij}-(WH)_{ij}]^{2}}}}{\sum_{l=1}^{N}\sqrt[p-1]{\frac{1}{\sum_{i=1}^{M}[X_{il}-(WH)_{il}]^{2}}}} (17)

Then, we can solve WW and HH with fixed QQ, which is similar to the traditional NMF method. For example, we can construct the following auxiliary function about WW:

f2(W,Wt)=\displaystyle f_{2}(W,W^{t})= i=1Mj=1Nk=1LQjξijk(XijWikHkjξijk)2\displaystyle\sum_{i=1}^{M}\sum_{j=1}^{N}\sum_{k=1}^{L}Q_{j}\xi_{ijk}(X_{ij}-\frac{W_{ik}H_{kj}}{\xi_{ijk}})^{2} (18)

Setting the partial derivative of f2(W,Wt)f_{2}(W,W^{t}) to zero yields the following update rule:

WW(XQpHT)(WHQpHT)W\leftarrow W\odot(XQ^{p}H^{T})~{}\oslash~{}(WHQ^{p}H^{T}) (19)

where Qp=diag([Q1p,Q2p,,QNp])Q^{p}=diag([Q_{1}^{p},Q_{2}^{p},...,Q_{N}^{p}]).

Similarly, we can also easily obtain the update rule for HH as follows:

HH(WTX)(WTWH)H\leftarrow H\odot(W^{T}X)\oslash(W^{T}WH) (20)

The update rules to WW and HH are similar to the existing WNMF methods in [14, 15]. Optimizing FWRNMF is summarized as follows in 𝐀𝐥𝐠𝐨𝐫𝐢𝐭𝐡𝐦𝟏\mathbf{Algorithm~{}1}:

Algorithm 1 FWRNMF
1:Given the input nonnegative matrix XX, reduced dimension number KK and hyperparameter pp;
2:the weight matrix QQ, the base matrix WW and the representation matrix HH;
3:Randomly initialize W>0W>0 and H>0H>0;
4:while not convergence do
5:     Update QQ by Eq. (17);
6:     Update WW by Eq. (19);
7:     Update HH by Eq. (20);
8:end while
9:return QQ, WW and HH.

2.2.2 Entropy weighted robust NMF (EWRNMF)

We apply an information entropy to regularize the objective function of NMF to obtain the weights in the range of [0, 1] instead of 0 or 1. The information entropy can indicate the uncertainty of weights.

min\displaystyle\min F4(W,H,Q)=i=1Mj=1NQj[Xij(WH)ij]2\displaystyle~{}F_{4}(W,H,Q)=\sum_{i=1}^{M}\sum_{j=1}^{N}Q_{j}[X_{ij}-(WH)_{ij}]^{2}
+γj=1NQjln(Qj)\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\gamma\sum_{j=1}^{N}Q_{j}ln(Q_{j})
s.t.\displaystyle s.~{}t. W0,H0,Q0,j=1NQj=1\displaystyle~{}W\geq 0,~{}H\geq 0,~{}Q\geq 0,~{}\sum_{j=1}^{N}Q_{j}=1 (21)

where γ>0\gamma>0 is a given hyperparameter to smooth the weights. The first term of the objective function is the sum of errors, and the second term is the negative entropy of the weights. The original objective function in Eq. (10) results in only one data point being involved in feature representation, and the entropy regularizer will stimulate more points to participate in feature representation.

We construct the Lagrange function of Eq. (21) with respect to QQ as:

L(Q,λ)\displaystyle L(Q,\lambda) =i=1Mj=1NQj[Xij(WH)ij]2\displaystyle=\sum_{i=1}^{M}\sum_{j=1}^{N}Q_{j}[X_{ij}-(WH)_{ij}]^{2} (22)
+γj=1NQjln(Qj)λ(j=1NQj1)\displaystyle+\gamma\sum_{j=1}^{N}Q_{j}ln(Q_{j})-\lambda(\sum_{j=1}^{N}Q_{j}-1)

where λ\lambda is still the Lagrange multiplier.

By setting the gradient of Eq. (22) with respect to λ\lambda and QjQ_{j} to zero, we obtain the following equations system:

Lλ\displaystyle\frac{\partial L}{\partial\lambda} =j=1NQj1=0\sum_{j=1}^{N}Q_{j}-1=0 (23)
LQj\displaystyle\frac{\partial L}{\partial Q_{j}} =i=1M[Xij(WH)ij]2+γlnQj+γλ=0\sum_{i=1}^{M}[X_{ij}-(WH)_{ij}]^{2}+\gamma lnQ_{j}+\gamma-\lambda=0 (24)

From Eq. (24), we know that:

Qj=eλγγei=1M[Xij(WH)ij]2γ\displaystyle Q_{j}=e^{\frac{\lambda-\gamma}{\gamma}}e^{-\frac{\sum_{i=1}^{M}[X_{ij}-(WH)_{ij}]^{2}}{\gamma}} (25)

Substituting Eq. (25) into Eq. (23), we have:

eλγγ=1j=1Nei=1M[Xij(WH)ij]2γ\displaystyle e^{\frac{\lambda-\gamma}{\gamma}}=\frac{1}{\sum_{j=1}^{N}e^{{-\frac{\sum_{i=1}^{M}[X_{ij}-(WH)_{ij}]^{2}}{\gamma}}}} (26)

Substituting this expression to Eq. (25), we find that:

Qj=ei=1M[Xij(WH)ij]2γl=1Nei=1M[Xil(WH)il]2γQ_{j}=\frac{e^{-\frac{\sum_{i=1}^{M}[X_{ij}-(WH)_{ij}]^{2}}{\gamma}}}{\sum_{l=1}^{N}e^{-\frac{\sum_{i=1}^{M}[X_{il}-(WH)_{il}]^{2}}{\gamma}}} (27)

Then, we can solve WW and HH with fixed QQ, which is similar to FWRNMF.

W\displaystyle W \displaystyle\leftarrow W(XQHT)(WHQHT)\displaystyle W\odot(XQH^{T})~{}\oslash~{}(WHQH^{T}) (28)
H\displaystyle H \displaystyle\leftarrow H(WTX)(WTWH)\displaystyle H\odot(W^{T}X)\oslash(W^{T}WH) (29)

EWRNMF is summarized as follows in 𝐀𝐥𝐠𝐨𝐫𝐢𝐭𝐡𝐦𝟐\mathbf{Algorithm~{}2}:

Algorithm 2 EWRNMF
1:Given the input nonnegative matrix XX, reduced dimension number KK and hyperparameter γ\gamma;
2:the weight matrix QQ, the base matrix WW and the representation matrix HH;
3:Randomly initialize W>0W>0 and H>0H>0;
4:while not convergence do
5:     Update QQ by Eq. (27);
6:     Update WW by Eq. (28);
7:     Update HH by Eq. (29);
8:end while
9:return QQ, WW and HH.

2.3 Extensions

Obviously, the proposed techniques are irrelative to the measure between XX and WHWH, so it can easily be popularized to KL-divergence, α\alpha-divergence, Orthogonal NMF, ConvexNMF, etc, which demonstrates its good compatibility. We omit the derivation processing since it is very similar to the proposed ones.

3 Experiments

3.1 Experimental description

Experiments were performed on an HP Compaq PC with a 2.90-GHz Core i7-10700 CPU and 16 GB memory, and all the methods were implemented in MATLAB. We compared the performance of the two proposed methods with the traditional NMF (named EucNMF), ConvexNMF, NMFL2 and HuberRNMF on nine public datasets, including the Yale, ORL, GTFD, Winequality-red, Page-blocks, Balance, Thyroid, WDBC and Dimdata. The important details of these datasets are shown in Table 1. We add Gaussian noise to the data to interpret the robustness of the methods as below:

Xij=Xij+cN(0,Xij)X_{ij}=X_{ij}+cN(0,X_{ij}) (30)

where cc control the noise level and N(0,Xij)N(0,X_{ij}) follows a Gaussian distribution with mean 0 and variance XijX_{ij}. It implies that noise relates to the scale of data, and a bigger value corresponds to a higher noise, which is reasonable in practical applications.

Table 1: Description of the datasets.
Dataset Samples Dimensions Classes
Yale [14] 165 32×3232\times 32 15
ORL [15] 400 32×3232\times 32 40
GTFD1 750 32×3232\times 32 50
Winequality-red2 4898 12 11
Page-blocks2 5473 10 5
Balance2 625 4 3
Thyroid2 7200 21 3
WDBC2 569 32 2
Dimdata3 4192 14 2
  • 1

    ftp://ftp.ee.gatech.edu/pub/users/hayes/facedb/

  • 2

    http://archive.ics.uci.edu/ml/datasets/

  • 3

    http://pages.cs.wisc.edu/ olvi/

After obtaining a new feature representation, we use k-means to cluster them and then evaluate the clustering results. Clustering accuracy (ACC) [16], [17] and normalized mutual information (NMI) [18], [19] are used to evaluate the performance of these clustering results.

Given a set of the ground true class labels yy and the obtained cluster labels yy^{\prime}, the clustering accuracy is defined as:

ACC=i=1Nδ(yi,map(yi))NACC=\frac{\sum_{i=1}^{N}\delta(y_{i},map(y^{\prime}_{i}))}{N} (31)

where:

δ(a,b)={1,a = b0,otherwise\delta(a,b)=\begin{cases}1,&\text{\emph{a = b}}\\ 0,&\text{otherwise}\end{cases}

and map()map(\cdot) is a permutation mapping function that maps the obtained cluster labels to the real labels. The higher the ACC value is, the better the clustering performance.

NMI is used to calculate the agreement between the two data distributions and is defined as follows:

NMI(y,y)=MI(y,y)max(H(y),H(y))NMI(y,y^{\prime})=\frac{MI(y,y^{\prime})}{max(H(y),H(y^{\prime}))} (32)

where H(y)H(y) is the entropy of yy. MI(y,y)MI(y,y^{\prime}) quantifies the amount of information between two random variables (i.e., yy and yy^{\prime}) and is defined as:

MI(y,y)=yiy,yjyp(yi,yj)log(p(yi,yj)p(yi)p(yj))MI(y,y^{\prime})=\sum_{y_{i}\in y,y^{\prime}_{j}\in y^{\prime}}p(y_{i},y^{\prime}_{j})log(\frac{p(y_{i},y^{\prime}_{j})}{p(y_{i})p(y^{\prime}_{j})}) (33)

where p(yi)p(y_{i}) and p(yj)p(y^{\prime}_{j}) are the probabilities that a data point selected from the dataset belongs to the clusters yiy_{i} and yjy^{\prime}_{j}, respectively; and p(yi,yj)p(y_{i},y^{\prime}_{j}) is the joint probability that an arbitrarily selected data point belongs to clusters yiy_{i} and yjy^{\prime}_{j} concurrently. NMI ranges from 0 to 1, and the larger NMI is, the better the clustering performance.

Because NMF does not have a sole solution, we randomly initialized 10 times to obtain an averaged ACC and NMI for a creditable comparsion. And we force reduced dimensionalty to equal to clustering number, which is a usual practice.

3.2 Clustering performance

We first compare all the methods on all the datasets, in which the hyperparameters of HuberRNMF, FWRNMF and EWRNMF are optimally from cutoff{10i,i=4,3,,3,4}cutoff\in\left\{10^{i},i=-4,-3,...,3,4\right\}, p{1.5,2,,10.5,11}p\in\left\{1.5,2,...,10.5,11\right\}, and γ{10i,i=4,3,,3,4}\gamma\in\left\{10^{i},i=-4,-3,...,3,4\right\}, respectively. The noise level is set to be c=0.05c=0.05. In the following experiments, we always use such a noise level unless otherwise specified. The optimal hyperparameter statergy is also used in the following experiments unless otherwise stated. Clustering results are shown in Tables 2 and 3. We can see that either FWRNMF or EWRNMF always exhibits better performance than other methods, meanwhile NMF2L and HuberRNMF are generally better than the traditional NMF and ConvexNMF. It demonstrates the feasibility and effectivity of the proposed methods. In general, none of FWRNMF or EWRNMF has established total supremacy over each other.

Table 2: ACC on the databases (%).
Method EucNMF ConvexNMF NMFL2 HuberRNMF FWRNMF EWRNMF
Yale 39.82 30.76 40.91 41.06 41.91 41.61
ORL 66.07 26.06 66.35 67.99 69.02 67.92
GTFD 62.56 49.32 63.76 64.73 66.85 64.17
Winequality-red 30.29 28.99 29.66 30.44 42.40 42.40
Page-blocks 34.17 34.99 38.71 36.67 89.71 89.71
Balance 51.70 51.34 50.78 51.93 51.66 54.55
Thyroid 54.04 55.83 54.65 54.19 92.56 92.56
WDBC 87.69 83.68 87.70 88.58 89.01 89.69
Dimdata 52.45 54.26 52.32 52.45 69.74 67.42
Table 3: NMI on the databases (%).
Method EucNMF ConvexNMF NMFL2 HuberRNMF FWRNMF EWRNMF
Yale 45.23 38.04 46.20 46.31 47.05 47.15
ORL 83.48 50.87 83.66 84.40 84.84 84.34
GTFD 85.48 73.93 85.60 86.05 86.71 85.90
Winequality-red 10.56 8.47 10.53 10.63 9.88 10.79
Page-blocks 12.16 6.78 13.27 12.57 15.33 12.55
Balance 10.78 13.01 10.11 11.73 12.84 17.48
Thyroid 1.60 1.55 1.76 1.61 3.61 2.41
WDBC 51.18 34.76 51.20 54.32 54.54 54.57
Dimdata 0.18 0.84 0.16 0.18 16.54 14.33

3.3 Hyperparameter selection

We inspect the ability of FWRNMF and EWRNMF to improve the performance of the traditional NMF on the Yale and WDBC datasets for dimensionality reduction and k-means clustering. The hyperparameters of FWRNMF and EWRNMF are still selected optimally as above. As shown in Figures 1 and 2, regardless of the hyperparameter pp or γ\gamma, the proposed methods indeed provide better performance than the traditional NMF and can achieve a consistently superior clustering result on extensive hyperparameters.

Refer to caption
Figure 1: Clustering performance versus the hyperparameter pp on the Yale dataset: (a) ACC and (b) NMI.
Refer to caption
Figure 2: Clustering performance versus the hyperparameter γ\gamma (the abscissa is the value of i {γ=10i,i=4,3,,3,4}\in\left\{\gamma=10^{i},i=-4,-3,...,3,4\right\}) on the WDBC dataset: (a) ACC and (b) NMI.

3.4 Clustering results versus noise level

We investigate the robustness of the proposed methods with regard to different noise levels on two datasets: GTFD and Balance, which add Gaussian noise to each dataset by Eq. (30) with the noise level cc ranging from 0.02 to 0.1. The results are shown in Figures 3 and 4. It can be seen that the effect of FWRNMF is particularly prominent in GTFD and EWRNMF performs well in Balance. Furthermore, regardless of which dataset, either of two proposed methods always obtains the best results, which shows a strong robustness.

Refer to caption
Figure 3: Clustering performance versus noise level on the GTFD dataset: (a) ACC and (b) NMI.
Refer to caption
Figure 4: Clustering performance versus noise level on the Balance dataset: (a) ACC and (b) NMI.

3.5 Clustering results versus cluster number

We study the relationship between the evaluation standards and cluster number on the Yale and GTFD datasets, where cluster number ranges from 2 to 10 are selected. The experimental details are described as follows:
1) Randomly select k categories as a subset for the following experiment.
2) Randomly initialize WW, HH and obtain new representations, then cluster them by k-means. Hyperparameters are selected according to the above instruction.
3) Repeat 1) and 2) 10 times to obtain an average result.

It can be seen in Figures 5 and 6 that either of our methods is better than the other methods. In detail, for the Yale dataset, EWRNMF is the best, for the GTFD dataset, FWRNMF is the best; however, none of the proposed methods is always the best.

Refer to caption
Figure 5: Clustering performance versus cluster number on the Yale dataset: (a) ACC and (b) NMI.
Refer to caption
Figure 6: Clustering performance versus cluster number on the GTFD dataset: (a) ACC and (b) NMI.

3.6 Visualization

We will directly visualize the results of the NMF methods for qualitative comparison. The Dimdata dataset is selected, in which we reduce the dimension number of the original data to two dimensions since it includes two clusters. The separability of new representation (HH) can be directly observed in Figure 7. By the naked eye, it is easy to observe that the new representations obtained by FWRNMF and EWRNMF have good separability if considering a center-distance-based clustering method such as k-means. This explains why they obtain an extremely excellent result in Tables 2 and 3 regardless of accuracy or NMI.

Refer to caption
Figure 7: Visualization of different methods on the Dimdata dataset which are obtained by (a) EucNMF, (b) ConvexNMF, (c) NMFL2, (d) HuberRNMF, (e) FWRNMF and (f) EWRNMF.
Refer to caption
Figure 8: Objective function on the ORL dataset which are obtained by (a) EucNMF, (b) ConvexNMF, (c) NMFL2, (d) HuberRNMF, (e) FWRNMF and (f) EWRNMF.

3.7 Convergence speed and part-based learning

From the theoretical analysis, we know that the objective function of the proposed methods are monotonically decreasing, but due to the nonconvexity of the objective function, it cannot be guaranteed to be strictly convergent, which is the commonplace in character for NMF methods [20]. Thus, we investigate the convergence speed of the NMF methods. Figure 8 shows the objective functions on the ORL dataset which are obtained by EucNMF, ConvexNMF, NMFL2, HuberRNMF, FWRNMF and EWRNMF. The convergence speed of the proposed FWRNMF is relatively slow, but EWRNMF shows an excellent performance. However, both methods basically show convergence when iterating 200 times. We also show the base images in Figure 9. As can be seen, ConvexNMF identifies more global faces; and the other methods present similar sparse base images that are difficult to distinguish from each other.

Refer to caption
Figure 9: Some base images with respect to the ORL dataset which are obtained by (a) EucNMF, (b) ConvexNMF, (c) NMFL2, (d) HuberRNMF, (e) FWRNMF and (f) EWRNMF.

4 Conclusion

This paper proposes two new methods to obtain robust feature representation for the NMF problem by introducing two types of adaptive weights for each sample. It is very different from the existing robust NMF methods that use a slow growth measure. These adaptive weights can automatically identify outliers and normal points, and give different importance based on outlier degree. The experimental results show that the proposed methods are flexible and effective.

However, both of new methods require an additional hyperparameter to smooth the weights. In the future, we will develop an automatic method to determine the hyperparameters.

Acknowledgement

This work was supported by the Fundamental Research Funds for the Central Universities of China (N180719020).

References

  • Gersho and Gray [2012] A. Gersho, R. M. Gray, Vector Quantization and Signal Compression, volume 159, Springer Science and Business Media, 2012.
  • Hu et al. [2017] C. Hu, X. Lu, M. Ye, W. Zeng, Singular value decomposition and local near neighbors for face recognition under varying illumination, Pattern Recognition 64 (2017) 60–83.
  • Rose [2010] W. Rose, Principal component analysis, Wiley Interdisciplinary Reviews: Computational Statistics 24 (2010) 433–459.
  • Hyvärinen and Oja [2000] A. Hyvärinen, E. Oja, Independent component analysis: algorithms and applications, Neural Networks 13 (2000) 411–430.
  • Lee and Seung [1999] D. D. Lee, H. S. Seung, Learning the parts of objects by non-negative matrix factorization, Nature 401 (1999) 788–791.
  • Gong et al. [2014] C. Gong, D. Tao, K. Fu, J. Yang, Fick’s law assisted propagation for semisupervised learning, IEEE Transactions on Neural Networks and Learning Systems 26 (2014) 2148–2162.
  • Li et al. [2018] Z. Li, J. Tang, X. He, Robust structured nonnegative matrix factorization for image representation, IEEE Transactions on Neural Networks and Learning Systems 29 (2018) 1947–1960.
  • Tepper and Sapiro [2016] M. Tepper, G. Sapiro, Compressed nonnegative matrix factorization is fast and accurate, IEEE Transactions on Signal Processing 64 (2016) 2269–2283.
  • Ding et al. [2008] C. H. Ding, T. Li, M. I. Jordan, Convex and semi-nonnegative matrix factorizations, IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (2008) 45–55.
  • Guillamet et al. [2003] D. Guillamet, J. Vitria, B. Schiele, Introducing a weighted non-negative matrix factorization for image classification, Pattern Recognition Letters 24 (2003) 2447–2454.
  • Xing et al. [2018] L. Xing, H. Dong, W. Jiang, K. Tang, Nonnegative matrix factorization by joint locality-constrained and 2,1\ell_{2,1}-norm regularization, Multimedia Tools and Applications 77 (2018) 3029–3048.
  • Du et al. [2012] L. Du, X. Li, Y. D. Shen, Robust nonnegative matrix factorization via half-quadratic minimization, in: 2012 IEEE 12th International Conference on Data Mining, IEEE, 2012, pp. 201–210.
  • MacQueen [1967] J. MacQueen, Some methods for classification and analysis of multivariate observations, in: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, Oakland, CA, USA, 1967, pp. 281–297.
  • Belhumeur et al. [1997] P. N. Belhumeur, J. P. Hespanha, D. J. Kriegman, Eigenfaces vs. fisherfaces: recognition using class specific linear projection, IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (1997) 711–720.
  • Kuang et al. [2015] D. Kuang, S. Yun, H. Park, Symnmf: nonnegative low-rank approximation of a similarity matrix for graph clustering, Journal of Global Optimization 62 (2015) 545–574.
  • Liu et al. [2011] H. Liu, Z. Wu, X. Li, D. Cai, T. S. Huang, Constrained nonnegative matrix factorization for image representation, IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (2011) 1299–1311.
  • Plummer and Lovász [1986] M. D. Plummer, L. Lovász, Matching Theory, Elsevier, 1986.
  • Cai et al. [2005] D. Cai, X. He, J. Han, Document clustering using locality preserving indexing, IEEE Transactions on Knowledge and Data Engineering 17 (2005) 1624–1637.
  • Shahnaz et al. [2006] F. Shahnaz, M. W. Berry, V. P. Pauca, R. J. Plemmons, Document clustering using nonnegative matrix factorization, Information Processing and Management 42 (2006) 373–386.
  • Lin [2007] C. J. Lin, On the convergence of multiplicative update algorithms for nonnegative matrix factorization, IEEE Transactions on Neural Networks 18 (2007) 1589–1596.