Adaptive Weighted Nonnegative Matrix Factorization for Robust Feature Representation
Abstract
Nonnegative matrix factorization (NMF) has been widely used to dimensionality reduction in machine learning. However, the traditional NMF does not properly handle outliers, so that it is sensitive to noise. In order to improve the robustness of NMF, this paper proposes an adaptive weighted NMF, which introduces weights to emphasize the different importance of each data point, thus the algorithmic sensitivity to noisy data is decreased. It is very different from the existing robust NMFs that use a slow growth similarity measure. Specifically, two strategies are proposed to achieve this: fuzzier weighted technique and entropy weighted regularized technique, and both of them lead to an iterative solution with a simple form. Experimental results showed that new methods have more robust feature representation on several real datasets with noise than exsiting methods. We make our code available at https://github.com/WRNMF/RobustNMF.git.
keywords:
entropy regularizer, fuzzier, low-dimensional representation, nonnegative matrix factorization (NMF), robustness.1 Introduction
With the rapid development of computer science, we are acculating more and more large-scale and high-dimensional data, thereby bringing about a new challenge for data analysis. It is very difficult to analyze many high-dimensional and high-throughout datasets. Consequently, dimensionality reduction has become an essential step. There are many methods to solve this problem including vector quantization (VQ) [1], singular value decomposition (SVD) [2], principal component analysis (PCA) [3], independent component analysis (ICA) [4] and nonnegative matrix factorization (NMF) [5], etc. In these methods, NMF is attracting more and more attention, which has been widely used in many fields, such as pattern recognition [6], image processing [7] and data mining [8].
The traditional NMF decomposes a high-dimensional nonnegative matrix into the product of two low-dimensional nonnegative matrices that are respectively called base and representation [5]. In recent decades, the theoretical research and practical applications of NMF are being carried out intensively, and many variants are developed for different tasks. For example, Ding et al. [9] proposed SemiNMF by easing the nonnegative constrains on original and base matrices to expand the application scope of NMF. They also limited base matrix to convex combinations of data points for developing the ConvexNMF method. Guillamet et al. [10] proposed the weighted NMF method (WNMF), which introduces to improve the NMF capabilities of some representing positive local data.
Most of the NMF variants use the square of the Euclidean distance or Kullback-Leibler (KL) divergence to measure the similarity between the high-dimensional nonnegative matrix and the product of two low-dimensional nonnegative matrices. However, the results generally cause sensitivity to outliers. In order to improve the robustness of NMF, Xing et al. [11] designed the NMFL2 method that used the Euclidean distance to take the place of the square of the Euclidean distance, so that the importance of the outliers decreased. Du et al. [12] took the Huber function to measure the quality approximation by considering its connection to -norm and -norm. As can be seen, these NMF methods use some similarity measures insensitive to outliers for a robust feature representation.
This paper proposes a totally different robust NMF type, which assigns an adaptive weight for each data point to represent its outlier degree. Certainly, an outlier should be paid less attention than a normal point. The task is implemented by two methods: fuzzier weighted technique and entropy weighted regularized technique, in which one utilizes a power hyperparameter to control the weights distribution and the other uses the entropy to regularize the weights. Then, they are solved by the Lagrange multiplier method for obtaining two solutions with a simple form. Experiments on several datasets with synthetic noise displayed that the proposed methods always achieve a better performance than some existing methods.
2 Methodology
: In this paper, matrices are denoted as capital letters. For a matrix , the element is indicated as ; denotes the transpose of ; the Frobenius norm is represented as ; the symbols and mean the item-by-item multiplication and division of two matrices, respectively; means that all elements of are equal to or larger than 0.
2.1 Related works
NMF decomposes a nonnegative matrix into the product of two low-dimensional nonnegative matrices: one is the base matrix and the other is the representation matrix , where . It can be formulated as:
(1) |
As well known, it is difficult to be solved, even there may be no solution, so we turn to an optimization method for achieving the best approximation. There are many criteria to measure the similarity between and . In general, the square of the Frobenius norm and the Kullback-Leibler (KL) divergence are used. In this paper, the square of the Frobenius norm is adopted and the formula is expressed as:
(2) |
Although the objective function is convex quadratic function on either or , it isn’t convex for both of them. Thus, it will be alternatively minimized with regard to and , in which the constructed auxiliary function will be important to derive the iterative update rule.
Definition.
(Auxiliary function) If the function satisfies the following conditions:
(3) | |||
(4) |
where is a given value; then, is the auxiliary function of on .
Then, we easily draw the following conclusion:
Lemma.
If is an auxiliary function of , then under the update rule:
(5) |
the function does not increase.
Proof.
The conditions satisfied by the auxiliary function make this proof marked because:
(6) |
∎
It can be observed that the original function should decrease if the auxiliary function reaches the minimum. Now, we drive the update rule for NMF. Consider first, where and are fixed. Let , certainly, and . Therefore, the auxiliary function is as below:
(7) |
The function is separable and its minimization is equivalent to the solutions of some univariate optimizaiton problems. Take the partial derivative of Eq. (7) and set it to zero so that the following update rule is got.
(8) |
Similarly, the update of is obtained:
(9) |
2.2 Proposed method
Different from the previous methods that use a similarity measure with slow growth for outliers, we introduce an optimizable weight to emphasize the importance of each data point, in which an outlier will be paid less attention than a normal point.
(10) |
We first consider the solution to in an alternative optimization manner. Obviously, for fixed and , can be solved as if , otherwise 0, where . The derivation is very easy, which is similar to that of the membership degree in k-means [13]. It demonstrates the simple fact that only one is 1, and the others are 0, i.e., the only one data point is involved in NMF. Such a phenomenon is certainly incompatible with the real problem. In fact, we generally think that the weights are in the range of instead of 0 or 1, which is more consistent fuzzy logic thinking process. To address this issue, we propose two methods as follows.
2.2.1 Fuzzier weighted robust NMF (FWRNMF)
We give a hyperparameter , representing fuzzy level, to smooth so that weights fall within [0, 1]. The specific model is as follows:
(11) |
We still alternatively minimize the objective function with respect to , and . First, is solved. We need to construct the Lagrange function as:
(12) |
where is the Lagrange multiplier.
By setting the gradient of Eq. (12) with respect to and to zero, we obtain the following equations system:
= | (13) | ||||
= | (14) |
From Eq. (14), we know that:
(15) |
Substituting Eq. (15) into Eq. (13), we have:
(16) |
Substituting it into Eq. (15), we achieve that
(17) |
Then, we can solve and with fixed , which is similar to the traditional NMF method. For example, we can construct the following auxiliary function about :
(18) |
Setting the partial derivative of to zero yields the following update rule:
(19) |
where .
Similarly, we can also easily obtain the update rule for as follows:
(20) |
The update rules to and are similar to the existing WNMF methods in [14, 15]. Optimizing FWRNMF is summarized as follows in :
2.2.2 Entropy weighted robust NMF (EWRNMF)
We apply an information entropy to regularize the objective function of NMF to obtain the weights in the range of [0, 1] instead of 0 or 1. The information entropy can indicate the uncertainty of weights.
(21) |
where is a given hyperparameter to smooth the weights. The first term of the objective function is the sum of errors, and the second term is the negative entropy of the weights. The original objective function in Eq. (10) results in only one data point being involved in feature representation, and the entropy regularizer will stimulate more points to participate in feature representation.
We construct the Lagrange function of Eq. (21) with respect to as:
(22) | |||||
where is still the Lagrange multiplier.
By setting the gradient of Eq. (22) with respect to and to zero, we obtain the following equations system:
= | (23) | ||||
= | (24) |
From Eq. (24), we know that:
(25) |
Substituting Eq. (25) into Eq. (23), we have:
(26) |
Substituting this expression to Eq. (25), we find that:
(27) |
Then, we can solve and with fixed , which is similar to FWRNMF.
(28) | |||||
(29) |
EWRNMF is summarized as follows in :
2.3 Extensions
Obviously, the proposed techniques are irrelative to the measure between and , so it can easily be popularized to KL-divergence, -divergence, Orthogonal NMF, ConvexNMF, etc, which demonstrates its good compatibility. We omit the derivation processing since it is very similar to the proposed ones.
3 Experiments
3.1 Experimental description
Experiments were performed on an HP Compaq PC with a 2.90-GHz Core i7-10700 CPU and 16 GB memory, and all the methods were implemented in MATLAB. We compared the performance of the two proposed methods with the traditional NMF (named EucNMF), ConvexNMF, NMFL2 and HuberRNMF on nine public datasets, including the Yale, ORL, GTFD, Winequality-red, Page-blocks, Balance, Thyroid, WDBC and Dimdata. The important details of these datasets are shown in Table 1. We add Gaussian noise to the data to interpret the robustness of the methods as below:
(30) |
where control the noise level and follows a Gaussian distribution with mean 0 and variance . It implies that noise relates to the scale of data, and a bigger value corresponds to a higher noise, which is reasonable in practical applications.
Dataset | Samples | Dimensions | Classes |
---|---|---|---|
Yale [14] | 165 | 15 | |
ORL [15] | 400 | 40 | |
GTFD1 | 750 | 50 | |
Winequality-red2 | 4898 | 12 | 11 |
Page-blocks2 | 5473 | 10 | 5 |
Balance2 | 625 | 4 | 3 |
Thyroid2 | 7200 | 21 | 3 |
WDBC2 | 569 | 32 | 2 |
Dimdata3 | 4192 | 14 | 2 |
-
1
ftp://ftp.ee.gatech.edu/pub/users/hayes/facedb/
-
2
http://archive.ics.uci.edu/ml/datasets/
-
3
http://pages.cs.wisc.edu/ olvi/
After obtaining a new feature representation, we use k-means to cluster them and then evaluate the clustering results. Clustering accuracy (ACC) [16], [17] and normalized mutual information (NMI) [18], [19] are used to evaluate the performance of these clustering results.
Given a set of the ground true class labels and the obtained cluster labels , the clustering accuracy is defined as:
(31) |
where:
and is a permutation mapping function that maps the obtained cluster labels to the real labels. The higher the ACC value is, the better the clustering performance.
NMI is used to calculate the agreement between the two data distributions and is defined as follows:
(32) |
where is the entropy of . quantifies the amount of information between two random variables (i.e., and ) and is defined as:
(33) |
where and are the probabilities that a data point selected from the dataset belongs to the clusters and , respectively; and is the joint probability that an arbitrarily selected data point belongs to clusters and concurrently. NMI ranges from 0 to 1, and the larger NMI is, the better the clustering performance.
Because NMF does not have a sole solution, we randomly initialized 10 times to obtain an averaged ACC and NMI for a creditable comparsion. And we force reduced dimensionalty to equal to clustering number, which is a usual practice.
3.2 Clustering performance
We first compare all the methods on all the datasets, in which the hyperparameters of HuberRNMF, FWRNMF and EWRNMF are optimally from , , and , respectively. The noise level is set to be . In the following experiments, we always use such a noise level unless otherwise specified. The optimal hyperparameter statergy is also used in the following experiments unless otherwise stated. Clustering results are shown in Tables 2 and 3. We can see that either FWRNMF or EWRNMF always exhibits better performance than other methods, meanwhile NMF2L and HuberRNMF are generally better than the traditional NMF and ConvexNMF. It demonstrates the feasibility and effectivity of the proposed methods. In general, none of FWRNMF or EWRNMF has established total supremacy over each other.
Method | EucNMF | ConvexNMF | NMFL2 | HuberRNMF | FWRNMF | EWRNMF |
---|---|---|---|---|---|---|
Yale | 39.82 | 30.76 | 40.91 | 41.06 | 41.91 | 41.61 |
ORL | 66.07 | 26.06 | 66.35 | 67.99 | 69.02 | 67.92 |
GTFD | 62.56 | 49.32 | 63.76 | 64.73 | 66.85 | 64.17 |
Winequality-red | 30.29 | 28.99 | 29.66 | 30.44 | 42.40 | 42.40 |
Page-blocks | 34.17 | 34.99 | 38.71 | 36.67 | 89.71 | 89.71 |
Balance | 51.70 | 51.34 | 50.78 | 51.93 | 51.66 | 54.55 |
Thyroid | 54.04 | 55.83 | 54.65 | 54.19 | 92.56 | 92.56 |
WDBC | 87.69 | 83.68 | 87.70 | 88.58 | 89.01 | 89.69 |
Dimdata | 52.45 | 54.26 | 52.32 | 52.45 | 69.74 | 67.42 |
Method | EucNMF | ConvexNMF | NMFL2 | HuberRNMF | FWRNMF | EWRNMF |
---|---|---|---|---|---|---|
Yale | 45.23 | 38.04 | 46.20 | 46.31 | 47.05 | 47.15 |
ORL | 83.48 | 50.87 | 83.66 | 84.40 | 84.84 | 84.34 |
GTFD | 85.48 | 73.93 | 85.60 | 86.05 | 86.71 | 85.90 |
Winequality-red | 10.56 | 8.47 | 10.53 | 10.63 | 9.88 | 10.79 |
Page-blocks | 12.16 | 6.78 | 13.27 | 12.57 | 15.33 | 12.55 |
Balance | 10.78 | 13.01 | 10.11 | 11.73 | 12.84 | 17.48 |
Thyroid | 1.60 | 1.55 | 1.76 | 1.61 | 3.61 | 2.41 |
WDBC | 51.18 | 34.76 | 51.20 | 54.32 | 54.54 | 54.57 |
Dimdata | 0.18 | 0.84 | 0.16 | 0.18 | 16.54 | 14.33 |
3.3 Hyperparameter selection
We inspect the ability of FWRNMF and EWRNMF to improve the performance of the traditional NMF on the Yale and WDBC datasets for dimensionality reduction and k-means clustering. The hyperparameters of FWRNMF and EWRNMF are still selected optimally as above. As shown in Figures 1 and 2, regardless of the hyperparameter or , the proposed methods indeed provide better performance than the traditional NMF and can achieve a consistently superior clustering result on extensive hyperparameters.


3.4 Clustering results versus noise level
We investigate the robustness of the proposed methods with regard to different noise levels on two datasets: GTFD and Balance, which add Gaussian noise to each dataset by Eq. (30) with the noise level ranging from 0.02 to 0.1. The results are shown in Figures 3 and 4. It can be seen that the effect of FWRNMF is particularly prominent in GTFD and EWRNMF performs well in Balance. Furthermore, regardless of which dataset, either of two proposed methods always obtains the best results, which shows a strong robustness.


3.5 Clustering results versus cluster number
We study the relationship between the evaluation standards and cluster number on the Yale and GTFD datasets, where cluster number ranges from 2 to 10 are selected. The experimental details are described as follows:
1) Randomly select k categories as a subset for the following experiment.
2) Randomly initialize , and obtain new representations, then cluster them by k-means. Hyperparameters are selected according to the above instruction.
3) Repeat 1) and 2) 10 times to obtain an average result.
It can be seen in Figures 5 and 6 that either of our methods is better than the other methods. In detail, for the Yale dataset, EWRNMF is the best, for the GTFD dataset, FWRNMF is the best; however, none of the proposed methods is always the best.


3.6 Visualization
We will directly visualize the results of the NMF methods for qualitative comparison. The Dimdata dataset is selected, in which we reduce the dimension number of the original data to two dimensions since it includes two clusters. The separability of new representation () can be directly observed in Figure 7. By the naked eye, it is easy to observe that the new representations obtained by FWRNMF and EWRNMF have good separability if considering a center-distance-based clustering method such as k-means. This explains why they obtain an extremely excellent result in Tables 2 and 3 regardless of accuracy or NMI.


3.7 Convergence speed and part-based learning
From the theoretical analysis, we know that the objective function of the proposed methods are monotonically decreasing, but due to the nonconvexity of the objective function, it cannot be guaranteed to be strictly convergent, which is the commonplace in character for NMF methods [20]. Thus, we investigate the convergence speed of the NMF methods. Figure 8 shows the objective functions on the ORL dataset which are obtained by EucNMF, ConvexNMF, NMFL2, HuberRNMF, FWRNMF and EWRNMF. The convergence speed of the proposed FWRNMF is relatively slow, but EWRNMF shows an excellent performance. However, both methods basically show convergence when iterating 200 times. We also show the base images in Figure 9. As can be seen, ConvexNMF identifies more global faces; and the other methods present similar sparse base images that are difficult to distinguish from each other.

4 Conclusion
This paper proposes two new methods to obtain robust feature representation for the NMF problem by introducing two types of adaptive weights for each sample. It is very different from the existing robust NMF methods that use a slow growth measure. These adaptive weights can automatically identify outliers and normal points, and give different importance based on outlier degree. The experimental results show that the proposed methods are flexible and effective.
However, both of new methods require an additional hyperparameter to smooth the weights. In the future, we will develop an automatic method to determine the hyperparameters.
Acknowledgement
This work was supported by the Fundamental Research Funds for the Central Universities of China (N180719020).
References
- Gersho and Gray [2012] A. Gersho, R. M. Gray, Vector Quantization and Signal Compression, volume 159, Springer Science and Business Media, 2012.
- Hu et al. [2017] C. Hu, X. Lu, M. Ye, W. Zeng, Singular value decomposition and local near neighbors for face recognition under varying illumination, Pattern Recognition 64 (2017) 60–83.
- Rose [2010] W. Rose, Principal component analysis, Wiley Interdisciplinary Reviews: Computational Statistics 24 (2010) 433–459.
- Hyvärinen and Oja [2000] A. Hyvärinen, E. Oja, Independent component analysis: algorithms and applications, Neural Networks 13 (2000) 411–430.
- Lee and Seung [1999] D. D. Lee, H. S. Seung, Learning the parts of objects by non-negative matrix factorization, Nature 401 (1999) 788–791.
- Gong et al. [2014] C. Gong, D. Tao, K. Fu, J. Yang, Fick’s law assisted propagation for semisupervised learning, IEEE Transactions on Neural Networks and Learning Systems 26 (2014) 2148–2162.
- Li et al. [2018] Z. Li, J. Tang, X. He, Robust structured nonnegative matrix factorization for image representation, IEEE Transactions on Neural Networks and Learning Systems 29 (2018) 1947–1960.
- Tepper and Sapiro [2016] M. Tepper, G. Sapiro, Compressed nonnegative matrix factorization is fast and accurate, IEEE Transactions on Signal Processing 64 (2016) 2269–2283.
- Ding et al. [2008] C. H. Ding, T. Li, M. I. Jordan, Convex and semi-nonnegative matrix factorizations, IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (2008) 45–55.
- Guillamet et al. [2003] D. Guillamet, J. Vitria, B. Schiele, Introducing a weighted non-negative matrix factorization for image classification, Pattern Recognition Letters 24 (2003) 2447–2454.
- Xing et al. [2018] L. Xing, H. Dong, W. Jiang, K. Tang, Nonnegative matrix factorization by joint locality-constrained and -norm regularization, Multimedia Tools and Applications 77 (2018) 3029–3048.
- Du et al. [2012] L. Du, X. Li, Y. D. Shen, Robust nonnegative matrix factorization via half-quadratic minimization, in: 2012 IEEE 12th International Conference on Data Mining, IEEE, 2012, pp. 201–210.
- MacQueen [1967] J. MacQueen, Some methods for classification and analysis of multivariate observations, in: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, Oakland, CA, USA, 1967, pp. 281–297.
- Belhumeur et al. [1997] P. N. Belhumeur, J. P. Hespanha, D. J. Kriegman, Eigenfaces vs. fisherfaces: recognition using class specific linear projection, IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (1997) 711–720.
- Kuang et al. [2015] D. Kuang, S. Yun, H. Park, Symnmf: nonnegative low-rank approximation of a similarity matrix for graph clustering, Journal of Global Optimization 62 (2015) 545–574.
- Liu et al. [2011] H. Liu, Z. Wu, X. Li, D. Cai, T. S. Huang, Constrained nonnegative matrix factorization for image representation, IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (2011) 1299–1311.
- Plummer and Lovász [1986] M. D. Plummer, L. Lovász, Matching Theory, Elsevier, 1986.
- Cai et al. [2005] D. Cai, X. He, J. Han, Document clustering using locality preserving indexing, IEEE Transactions on Knowledge and Data Engineering 17 (2005) 1624–1637.
- Shahnaz et al. [2006] F. Shahnaz, M. W. Berry, V. P. Pauca, R. J. Plemmons, Document clustering using nonnegative matrix factorization, Information Processing and Management 42 (2006) 373–386.
- Lin [2007] C. J. Lin, On the convergence of multiplicative update algorithms for nonnegative matrix factorization, IEEE Transactions on Neural Networks 18 (2007) 1589–1596.