Partially Shared Semi-supervised Deep Matrix Factorization with Multi-view Data

Haonan Huang2, Naiyao Liang23, Wei Yan2, Zuyuan Yang25, Weijun Sun24 {mrhaonan, naiyaogdut}@aliyun.com, [email protected], {yangzuyuan, gdutswj}@gdut.edu.cn 2Guangdong Key Laboratory of IoT Information Technology, Guangdong University of Technology, Guangzhou, China 3Key Laboratory of iDetection and Manufacturing-IoT, Ministry of Education, Guangzhou, China 4Guangdong-Hong Kong-Macao Joint Laboratory for Smart Discrete Manufacturing, Guangzhou, China 5Author to whom any correspondence should be addressed.

Abstract

Since many real-world data can be described from multiple views, multi-view learning has attracted considerable attention. Various methods have been proposed and successfully applied to multi-view learning, typically based on matrix factorization models. Recently, it is extended to the deep structure to exploit the hierarchical information of multi-view data, but the view-specific features and the label information are seldom considered. To address these concerns, we present a partially shared semi-supervised deep matrix factorization model (PSDMF). By integrating the partially shared deep decomposition structure, graph regularization and the semi-supervised regression model, PSDMF can learn a compact and discriminative representation through eliminating the effects of uncorrelated information. In addition, we develop an efficient iterative updating algorithm for PSDMF. Extensive experiments on five benchmark datasets demonstrate that PSDMF can achieve better performance than the state-of-the-art multi-view learning approaches. The MATLAB source code is available at https://github.com/libertyhhn/PartiallySharedDMF.

Index Terms:

Multi-view learning; Deep matrix factorization; Semi-supervised learning; Partially shared structure.

I Introduction

In practical applications, real-world data can be described from different views, that is the so called multi-view data. For instance, an image can be described by several characteristics, e.g., shape, color, texture and so on. Because multi-view representation learning can exploit the implicit-dependent structure of multiple views and improve the performance of the learning tasks. In recent years, multi-view representation learning has attracted increasing research attention in machine learning [1, 2, 3].

Over the past decade, lots of researches on multi-view learning have emerged. In particular, Non-negative Matrix Factorization (NMF) [4, 5] as one of the most popular high-dimensional data processing algorithms, have been widely used for clustering multi-view data [7, 6]. NMF-based multi-view clustering methods have shown to generate superior clustering results which are easy to interpret [8, 9]. Considering the relations between view-specific (uncorrelated) and common information (correlated), partially shared NMF-based multi-view learning approaches are proposed [10, 11].

Refer to caption — Figure 1: The general framework of the Deep Multi-View Clustering. Data of the same class (*shape*) will become more compact and generate more discriminative representation as the number of decomposition layers increases.

Nevertheless, most existing methods are single layer structures, which are hard to extract hierarchical structural information [12] of multi-view data. With the development of deep learning, Wang et al. proposed DCCA to extract the hidden hierarchical information in 2-view data [13]. Zhao et al. [14] extended single-view deep matrix factorization [15] into multi-view and proposed graph regularized deep multi-view clustering, which can eliminate the interference in multi-view data. Huang et al. [16] designed a novel robust deep multi-view clustering model to learn the hierarchical semantics without hyperparameters. For clarity, the general deep matrix factorization framework for multi-view data as illustrated in Fig. 1. However, most existing deep matrix factorization multi-view methods are merely considering the common information and ignoring the effect of view-specific information in each individual view. Besides, they are formulated as an unsupervised learning problem and inapplicable when partially labeled data is available. Actually, researchers have found that integrating such label information can produce a considerable improvement in learning performance [19, 18, 17].

To this end, we propose a novel deep multi-view clustering methods called Partially Shared Semi-supervised Deep Matrix Factorization (PSDMF). In our method, correlated and uncorrelated features of multi-view data are both considered by partially shared approaches, in which the latent representation of each view is divided into the common part and the view-specific part. A robust sparse regression term with $L_{2,1}$ norm is adopted to integrate the partial label information of the labeled data. Besides, to respect the intrinsic geometric relationship and avoid the parameters problem in different views, we apply graph regularization and auto-weighted strategy to PSDMF. An efficient iterative updating algorithm with pre-training scheme is designed for PSDMF. We summarize our major contributions as follows:

•

This paper proposes an semi-supervised deep multi-view clustering model to improve the performance of traditional unsupervised deep matrix factorization methods by using label regression learning approach.
•

To respect the common information and view-specific features of multi-view data, we propose a partially shared deep matrix factorization method to jointly exploit two kinds of information and learn a comprehensive final layer representation of multi-view data.
•

Local invariant graph regularization and the auto-weighted strategy are introduced to preserve the intrinsic geometric structure in each view and further boost the quality of output representation.

The rest of this paper is organized as follows. In Section II, we give a brief review of related works and describe the proposed PSDMF. Section III present an efficient algorithm to solve the optimization problem. In Section IV, we report the experimental results on five real-world datasets. Finally, Section V concludes this paper. Table. I summarizes the general notations in this article for reader’s convenience.

TABLE I: Notations.

Notation	Description
$P$	The number of views
$m$	The number of layers
$k$	The layer sizes
$N$	The number of samples
$N_{l}$	The number of labeled samples
$N_{u}$	The number of unlabeled samples
$K_{s}$	The dimension of view-specific encoding matrix
$K_{c}$	The dimension of shared encoding matrix
$\mathbf{X}^{p}$	The data matrix of the $p$ -th view
$\mathbf{U}_{i}^{p}$	The hidden matrix in the $i$ -th layer of $p$ -th view
$\mathbf{V}$	The partially shared factor latent representation matrix
$\mathbf{V}_{m}^{p}$	The $m$ -th layer partially shared factor of the $p$ -th view
$\mathbf{W}$	The regression weight matrix

II Methodology

II-A Overview of Deep Matrix Factorization

Semi-NMF is not only a useful data dimensionality reduction technique but also beneficial to data clustering [20]. Motivated by deep neural network structures, Trigeorgis et al. [15] proposed a multi-layer structure Semi-NMF called Deep Semi-NMF to exploit data complex hierarchical information with implicit lower-level hidden attributes. The Deep Semi-NMF decomposes dataset $\mathbf{X}$ hierarchically, and the process can be formulated as:

$\displaystyle\mathbf{X}$	$\displaystyle\approx\mathbf{U}_{1}\mathbf{V}_{1}^{+}$	(1)
$\displaystyle\mathbf{X}$	$\displaystyle\approx\mathbf{U}_{1}\mathbf{U}_{2}\mathbf{V}_{2}^{+}$
	$\displaystyle\vdots$
$\displaystyle\mathbf{X}$	$\displaystyle\approx\mathbf{U}_{1}\ldots\mathbf{U}_{m}\mathbf{V}_{m}^{+}$

where $m$ is the numbers of layers, $\mathbf{U}_{i}$ is the $i$ -th layer hidden matrix, $\mathbf{V}_{i}^{+}$ denotes the $i$ -th layer low-dimensional representation matrix and we use the notation $+$ to state that a matrix contains only non-negative elements. The loss function of Deep Semi-NMF is:

\min_{\mathbf{U}_{i},\mathbf{V}_{m}}\left\|\mathbf{X}-\mathbf{U}_{1}\cdot\cdot\cdot\mathbf{U}_{m}\mathbf{V}_{m}\right\|_{F}^{2}\text{ s.t. }\mathbf{V}_{m}\geq 0.

(2)

The Semi-NMF and Deep Semi-NMF mentioned above can be regarded as single view algorithms. For multi-view data, Zhao et al. [14] presented a new Deep Semi-NMF framework with graph regularization (DMVC) which can eliminate the negative interference in the multi-source data and obtain an effective consensus representation in the final layer. Given data $\mathbf{X}$ consisting of $P$ views, denoted by $\mathbf{X}=\{\mathbf{X}^{1},\mathbf{X}^{2},...,\mathbf{X}^{P}\}$ , the loss function of DMVC is:

$\displaystyle\min_{\mathbf{Z}_{i}^{p},\mathbf{H}_{i}^{p},\mathbf{H}_{m},\alpha^{p}}$	$\displaystyle\sum_{p=1}^{P}\left(\alpha^{p}\right)^{r}\left(\left\\|\mathbf{X}^{p}-\mathbf{U}_{1}^{p}\mathbf{U}_{2}^{p}\ldots\mathbf{U}_{m}^{p}\mathbf{V}_{m}\right\\|_{\mathrm{F}}^{2}\right.$	(3)
	$\displaystyle\left.\quad+\beta\operatorname{tr}\left(\mathbf{V}_{m}\mathbf{L}^{p}\mathbf{V}_{m}^{T}\right)\right)$
s.t.	$\displaystyle\mathbf{V}_{i}^{p}\geq 0,\mathbf{V}_{m}\geq 0,\sum_{p=1}^{P}\alpha^{p}=1,\alpha^{p}\geq 0$

where $\alpha^{p}$ denotes the weighting coefficient of the $p$ -th view and $r$ is a important hyperparameter to control the weights distribution. $\mathbf{L}^{p}$ is the graph Laplacian of the graph for view $p$ and $\beta$ is used to adjust the contribution of the graph constrains. The details of how to construct a graph matrix will be discussed in the next subsection. However, as an unsupervised method, DMVC cannot make use of partial prior knowledge of data (e.g. labels). Besides, DMVC only concerns common representation of multi-view data and ignores the view-specific features. In this paper, we propose a novel semi-supervised deep matrix factorization model to address such challenging problems.

II-B Partially Shared Semi-supervised Deep Matrix Factorization

To make full use of priori knowledge and learn low-dimensional factors with powerful discrimination, motivated by recently proposed label regression learning technique [10, 17], we combine the following $L_{2,1}$ -norm regularized regression in our model:

\min\left\|\mathbf{W}^{T}\mathbf{V}_{l}-\mathbf{Y}\right\|_{F}^{2}+\gamma\|\mathbf{W}\|_{2,1}.

(4)

where $\mathbf{V}_{l}\in\mathbb{R}^{K\times N_{l}}$ denotes the labeled part of representation matrix $\mathbf{V}$ and $\mathbf{V}_{u}\in\mathbb{R}^{K\times N_{u}}$ is the unlabeled part as shown in Fig. 2 (i.e., $\mathbf{V}=[\mathbf{V}_{l},\mathbf{V}_{u}]$ ). $\mathbf{W}\in\mathbb{R}^{K\times C}$ is the regression coefficient matrix and the $L_{2,1}$ -norm is used to enforce it sparse in rows. Consider we have a priori knowledge that those samples share the same label, a binary weight matrix $\mathbf{Y}\in\mathbb{R}^{C\times N}$ is constructed by following rule:

\mathbf{Y}_{cn}=\left\{\begin{array}[]{ll}{1}&{\text{ if }n\text{-th data point belongs to the }c\text{-th class}}\\ {0}&{\text{ otherwise }}\end{array}\right.

(5)

The effectiveness of graph regularization technique has been shown in recent research work [28, 14] and it is able to keep the geometric structure of data when the dimension is changed. Similar to the DMVC, we construct a local graph Laplacian matrix to preserve the local geometrical structure of the each view $\mathbf{X}^{p}$ . As introduced in [14], the graph binary weight matrix $\mathbf{S}^{p}$ is constructed in k-nearest neighbor (k-NN) fashion. Formally, the regularization term $\mathcal{R}^{p}$ is calculated:

$\displaystyle\mathcal{R}^{p}$	$\displaystyle=\frac{1}{2}\sum_{j,q=1}^{n}\left\\|\mathbf{v}_{j}^{p}-\mathbf{v}_{q}^{p}\right\\|^{2}\mathbf{S}_{jq}^{p}$	(6)
	$\displaystyle=\sum_{j=1}^{n}\mathbf{v}_{j}^{p}\mathbf{v}_{j}^{p}\mathbf{D}_{jj}^{p}-\sum_{j,q=1}^{n}\mathbf{v}_{j}^{p}\mathbf{v}_{l}^{p}\mathbf{S}_{jq}^{p}$
	$\displaystyle=\operatorname{Tr}\left(\mathbf{V}^{p}\mathbf{D}^{p}(\mathbf{V}^{p})^{T}\right)-\operatorname{Tr}\left(\mathbf{V}^{p}\mathbf{S}^{p}(\mathbf{V}^{p})^{T}\right)$
	$\displaystyle=\operatorname{Tr}\left(\mathbf{V}^{p}\mathbf{L}^{p}(\mathbf{V}^{p})^{T}\right)$

where $\mathbf{D}_{jj}^{p}=\sum_{q}\mathbf{S}_{jq}^{p}$ . $\mathbf{L}^{p}=\mathbf{D}^{p}-\mathbf{S}^{p}$ denotes the graph Laplacian matrix for each view data.

Different from the existing deep matrix factorization multi-view methods, we use partially shared strategy to jointly exploit view-specific and common features. The partially shared latent representation matrix $\mathbf{V}\in\mathbb{R}^{K\times N}$ and $K=K_{s}\times P+K_{c}$ . The common factor ratio $\lambda=K_{c}/(K_{s}+K_{c})$ . The final $m$ -th layer partially shared factor $\mathbf{V}_{m}^{p}\in\mathbb{R}^{(K_{s}+K_{c})\times N}$ is divided into four parts: the labeled and unlabeled view-specific encoding matrix (i.e. $\mathbf{V}_{sl}^{p}\in\mathbb{R}^{K_{s}\times N_{l}}$ and $\mathbf{V}_{su}^{p}\in\mathbb{R}^{K_{s}\times N_{u}}$ ), the labeled and unlabeled shared encoding matrix (i.e. $\mathbf{V}_{cl}\in\mathbb{R}^{K_{c}\times N_{l}}$ and $\mathbf{V}_{cu}\in\mathbb{R}^{K_{c}\times N_{u}}$ ). We propose a general partially shared deep matrix factorization framework as follows:

	$\displaystyle\min\sum_{p=1}^{P}\left\\|\mathbf{X}^{p}-\left[\mathbf{U}_{1s}^{p},\mathbf{U}_{1c}^{p}\right]\cdot\cdot\cdot\left[\mathbf{U}_{ms}^{p},\mathbf{U}_{mc}^{p}\right]\left[\begin{array}[]{c}\mathbf{V}_{sl}^{p},\mathbf{V}_{su}^{p}\\ \mathbf{V}_{cl},\mathbf{V}_{cu}\end{array}\right]\right\\|_{F}^{2}$		(7)
	$\displaystyle\text{ s.t. }\mathbf{V}_{sl}^{p},\mathbf{V}_{su}^{p},\mathbf{V}_{cl},\mathbf{V}_{cu}\geq 0$		(7)

where $\mathbf{U}_{is}^{p}$ and $\mathbf{U}_{ic}^{p}$ denote the hidden view-specific matrix and common matrix, respectively. To simplify the problem, we use $\mathbf{V}_{m}^{p}=[\mathbf{V}_{s}^{p};\mathbf{V}_{c}]=\left[[\mathbf{V}_{sl}^{p},\mathbf{V}_{su}^{p}];[\mathbf{V}_{cl},\mathbf{V}_{cu}]\right]$ , $\mathbf{U}_{i}^{p}=[\mathbf{U}_{is}^{p};\mathbf{U}_{ic}^{p}]$ , by combining Eq.(4), Eq.(6) and Eq.(7), we get the cost function of PSDMF as:

$\displaystyle\min_{\mathbf{U}_{i}^{p},\mathbf{V}_{m}^{p},\mathbf{W}}O=$	$\displaystyle\sum_{p=1}^{P}\left(\alpha^{p}\left\\|\mathbf{X}^{p}-\mathbf{U}_{1}^{p}\cdot\cdot\cdot\mathbf{U}_{m}^{p}\mathbf{V}_{m}^{p}\right\\|_{F}^{2}\right.$	(8)
	$\displaystyle\left.+\mu\operatorname{tr}\left(\mathbf{V}_{m}^{p}\mathbf{L}^{p}\left(\mathbf{V}_{m}^{p}\right)^{T}\right)\right)$
	$\displaystyle+\beta\left(\left\\|\mathbf{W}^{T}\mathbf{V}_{l}-\mathbf{Y}\right\\|_{F}^{2}+\gamma\\|\mathbf{W}\\|_{2,1}\right)$
	$\displaystyle\text{ s.t. }\mathbf{V}_{m}^{p}\geq 0,\forall p$

where $\alpha^{p}$ is the weighting coefficient for the $p$ -th view and it has a great influence on the effect of the model. In DMVC [14], the weights distribution smooth is determined by parameter $r$ , but $r$ needs to be searched in a large range manually, which makes it to adjust. To avoid this problem, inspired by [25, 26], we use the auto-weighted strategy that obtains the value of $\alpha^{p}$ based on the distance between the data and the decomposition matrices, as follows:

\alpha^{p}=\frac{1}{2\left\|\mathbf{X}^{p}-\mathbf{U}_{1}^{p}\cdot\cdot\cdot\mathbf{U}_{m}^{p}\mathbf{V}_{m}^{p}\right\|_{F}}

(9)

III Optimization

To expedite the approximation of the matrices in our proposed model, we conduct the pre-training [24] by decomposing each view $\mathbf{X}^{p}\approx\mathbf{U}_{1}^{p}\mathbf{V}_{1}^{p}$ by minimizing the Semi-NMF $\left\|\mathbf{X}^{p}-\mathbf{U}_{1}^{p}\mathbf{V}_{1}^{p}\right\|_{F}^{2}$ , where $\mathbf{U}_{1}^{p}\in\mathbb{R}^{M^{p}\times k_{1}}$ and $\mathbf{V}_{1}^{p}\in\mathbb{R}^{k_{1}\times N}$ . Then the factor matrix $\mathbf{V}_{1}^{p}$ is further decomposed as $\mathbf{V}_{1}^{p}\approx\mathbf{U}_{2}^{p}\mathbf{V}_{2}^{p}$ by minimizing the $\left\|\mathbf{V}_{2}^{p}-\mathbf{U}_{2}^{p}\mathbf{V}_{2}^{p}\right\|_{F}^{2}$ , where $\mathbf{U}_{2}^{p}\in\mathbb{R}^{k_{1}\times k_{2}}$ , $\mathbf{V}_{1}^{p}\in\mathbb{R}^{k_{2}\times N}$ and $k_{1},k_{2}$ denote the dimensions for layer 1 and layer 2. The process will be repeated till all of the layers are pre-trained. The optimization of Semi-NMF can be derived following a similar process as described in [20]. To save the space, we omit the updating rules here.

Update rule for hidden matrix $\mathbf{U}_{i}^{p}$ : By fixing $\mathbf{V}_{m}^{p}$ and $\mathbf{W}$ , we minimize the objective function (8) with respect to $\mathbf{U}_{i}^{p}$ . Letting the partial derivative $\partial O/\mathbf{U}_{i}^{p}=0$ , we can obtain

\mathbf{U}_{i}^{p}=\left(\Phi_{i-1}^{T}\Phi_{i-1}\right)^{-1}\left(\Phi_{i-1}^{T}\mathbf{X}^{p}\mathbf{U}_{im}^{T}\right)\\ \left(\mathbf{U}_{im}\mathbf{U}_{im}^{T}\right)^{-1}

(10)

where $\Phi_{i-1}=\mathbf{U}_{1}^{p}\mathbf{U}_{2}^{p}\ldots\mathbf{U}_{i-1}^{p}$ and $\mathbf{U}_{im}=\mathbf{U}_{i+1}^{p}\ldots\mathbf{U}_{m}^{p}\mathbf{V}_{m}^{p}$ .

Update rule for regression weight matrix $\mathbf{W}$ : Following [27], the derivative of objective function (8) with respect to $\mathbf{W}$ is as follows:

\frac{\partial O}{\mathbf{W}}=2\beta\left(\mathbf{V}_{l}(\mathbf{V}_{l}^{T}\mathbf{W}-\mathbf{Y}^{T})+\gamma\mathbf{E}\mathbf{W}\right)

(11)

where $\mathbf{E}$ is a diagonal matrix with $e_{ii}=1/2\|\mathbf{w}_{i}\|_{2}$ . According to the optimization theory, set $\partial O/\mathbf{W}=0$ , then we can obtain the update rule for $\mathbf{W}$ is

\mathbf{W}=\left(\mathbf{V}_{l}\mathbf{V}_{l}^{T}+\gamma\mathbf{E}\right)^{-1}\mathbf{V}_{l}\mathbf{Y}^{T}.

(12)

Update rule for partially shared factor matrix $\mathbf{V}_{m}^{p}$ : As illustrated in Fig. 2, $\mathbf{V}_{m}^{p}$ can be divided into four parts: $\mathbf{V}_{sl}^{p},\mathbf{V}_{su}^{p},\mathbf{V}_{cl}$ and $\mathbf{V}_{cu}$ . Similarly, each data matrix $\mathbf{X}^{p}$ is divided into two parts $\mathbf{X}_{l}^{p}$ and $\mathbf{X}_{u}^{p}$ ; each Laplacian Graph matrix $\mathbf{L}^{p}$ is divided into two parts $\mathbf{L}_{l}^{p}$ and $\mathbf{L}_{u}^{p}$ .

For the constraint $\mathbf{V}_{m}^{p}\geq 0$ , we introduce the Lagrangian multiplier $\eta$ as follows:

	$\displaystyle\mathcal{L}\left(\mathbf{V}_{m}^{p}\right)$	$\displaystyle=\sum_{p=1}^{P}\left(\alpha^{p}\left\\|\mathbf{X}^{p}-\Phi_{m}\mathbf{V}_{m}^{p}\right\\|_{\mathrm{F}}^{2}+\mu\operatorname{tr}\left(\mathbf{V}_{m}^{p}\mathbf{L}^{p}(\mathbf{V}_{m}^{p})^{T}\right)\right.$		(13)
		$\displaystyle\left.+\beta\left\\|\mathbf{W}^{T}\mathbf{V}_{l}-\mathbf{Y}\right\\|-tr(\eta^{T}\mathbf{V}_{m}^{p})\right)$		(13)

where $\Phi_{m}=\mathbf{U}_{1}^{p}\mathbf{U}_{2}^{p}\ldots\mathbf{U}_{m}^{p}$ . Accordingly, $\Phi_{m}$ is divided into two parts $\Phi_{ms}$ and $\Phi_{mc}$ . For the convenience of writing, we denote that $\mathbf{A}^{p}=[\Phi_{ms},\Phi_{mc}][\mathbf{V}_{sl}^{p};\mathbf{V}_{cl}]$ , $\mathbf{B}^{p}=[\Phi_{ms},\Phi_{mc}][\mathbf{V}_{su}^{p};\mathbf{V}_{cu}]$ .

The zero gradient condition of $\mathcal{L}\left(\mathbf{V}_{m}^{p}\right)$ with respect to $\mathbf{V}_{sl}^{p},\mathbf{V}_{su}^{p},\mathbf{V}_{cl}$ and $\mathbf{V}_{cu}$ , respectively, we have

$\displaystyle\frac{\partial\mathcal{L}}{\partial\mathbf{V}_{sl}^{p}}=$	$\displaystyle\alpha^{p}\Phi_{ms}^{T}\left(\mathbf{A}^{p}-\mathbf{X}_{l}^{p}\right)+\mu\mathbf{V}_{s}^{p}\mathbf{L}_{l}^{p}+\beta\mathbf{F}_{s}^{p}-\eta_{sl}=0$	(14)
$\displaystyle\frac{\partial\mathcal{L}}{\partial\mathbf{V}_{su}^{p}}=$	$\displaystyle\alpha^{p}\Phi_{ms}^{T}\left(\mathbf{B}^{p}-\mathbf{X}_{u}^{p}\right)+\mu\mathbf{V}_{s}^{p}\mathbf{L}_{u}^{p}-\eta_{su}=0$
$\displaystyle\frac{\partial\mathcal{L}}{\partial\mathbf{V}_{c1}}=$	$\displaystyle\sum_{p=1}^{P}\alpha^{p}\Phi_{mc}^{T}\left(\mathbf{A}^{p}-\mathbf{X}_{l}^{p}\right)+\mu\mathbf{V}_{c}^{p}\mathbf{L}_{l}^{p}+\beta\mathbf{F}_{c}-\eta_{cl}=0$
$\displaystyle\frac{\partial\mathcal{L}}{\partial\mathbf{V}_{cu}}=$	$\displaystyle\sum_{p=1}^{P}\alpha^{p}\Phi_{mc}^{T}\left(\mathbf{B}^{p}-\mathbf{X}_{u}^{p}\right)+\mu\mathbf{V}_{c}^{p}\mathbf{L}_{u}^{p}-\eta_{cu}=0$

where $\mathbf{F}=\mathbf{W}(\mathbf{W}^{T}\mathbf{V}_{l}-\mathbf{Y})=[\mathbf{F}_{s}^{1};...;\mathbf{F}_{s}^{p};\mathbf{F}_{c}]$ . Following a similar proof to [14], using the Karush-Kuhn-Tucker condition, we can formulate the updating rules for $\mathbf{V}_{m}^{p}$ :

		$\displaystyle\mathbf{V}_{sl}^{p}=\mathbf{V}_{sl}^{p}\odot$		(15)
		$\displaystyle\sqrt{\frac{\alpha^{p}[\Phi_{ms}^{T}\mathbf{A}^{p}]^{-}+\alpha^{p}[\Phi_{ms}^{T}\mathbf{X}_{l}^{p}]^{+}+\mu[\mathbf{V}_{s}^{p}\mathbf{L}_{l}^{p}]^{-}+\beta[\mathbf{F}_{s}^{p}]^{-}}{\alpha^{p}[\Phi_{ms}^{T}\mathbf{A}^{p}]^{+}+\alpha^{p}[\Phi_{ms}^{T}\mathbf{X}_{l}^{p}]^{-}+\mu[\mathbf{V}_{s}^{p}\mathbf{L}_{l}^{p}]^{+}+\beta[\mathbf{F}_{s}^{p}]^{+}}}$
		$\displaystyle\mathbf{V}_{su}^{p}=\mathbf{V}_{su}^{p}\odot$
		$\displaystyle\sqrt{\frac{\alpha^{p}[\Phi_{ms}^{T}\mathbf{B}^{p}]^{-}+\alpha^{p}[\Phi_{ms}^{T}\mathbf{X}_{u}^{p}]^{+}+\mu[\mathbf{V}_{s}^{p}\mathbf{L}_{u}^{p}]^{-}}{\alpha^{p}[\Phi_{ms}^{T}\mathbf{B}^{p}]^{+}+\alpha^{p}[\Phi_{ms}^{T}\mathbf{X}_{u}^{p}]^{-}+\mu[\mathbf{V}_{s}^{p}\mathbf{L}_{u}^{p}]^{+}}}$
		$\displaystyle\mathbf{V}_{cl}=\mathbf{V}_{cl}\odot$
		$\displaystyle\sqrt{\frac{\sum\limits_{p=1}^{P}\alpha^{p}\left([\Phi_{mc}^{T}\mathbf{A}^{p}]^{-}+[\Phi_{ms}^{T}\mathbf{X}_{l}^{p}]^{+}\right)+\mu[\mathbf{V}_{c}^{p}\mathbf{L}_{l}^{p}]^{-}+\beta[\mathbf{F}_{c}]^{-}}{\sum\limits_{p=1}^{P}\alpha^{p}\left([\Phi_{ms}^{T}\mathbf{A}^{p}]^{+}+[\Phi_{ms}^{T}\mathbf{X}_{l}^{p}]^{-}\right)+\mu[\mathbf{V}_{c}^{p}\mathbf{L}_{l}^{p}]^{+}+\beta[\mathbf{F}_{c}]^{+}}}$
		$\displaystyle\mathbf{V}_{cu}=\mathbf{V}_{cu}\odot$
		$\displaystyle\sqrt{\frac{\sum\limits_{p=1}^{P}\alpha^{p}[\Phi_{mc}^{T}\mathbf{B}^{p}]^{-}+\sum\limits_{p=1}^{P}\alpha^{p}[\Phi_{ms}^{T}\mathbf{X}_{u}^{p}]^{+}+\mu[\mathbf{V}_{c}^{p}\mathbf{L}_{u}^{p}]^{-}}{\sum\limits_{p=1}^{P}\alpha^{p}[\Phi_{ms}^{T}\mathbf{B}^{p}]^{+}+\sum\limits_{p=1}^{P}\alpha^{p}[\Phi_{ms}^{T}\mathbf{X}_{u}^{p}]^{-}+\mu[\mathbf{V}_{c}^{p}\mathbf{L}_{u}^{p}]^{+}}},$

where $[\mathbf{H}]^{+}$ and $[\mathbf{H}]^{-}$ denote a matrix that all the negative elements are replaced by $0$ and all the positive elements are replaced by $0$ , respectively. That is,

\forall k,j[\mathbf{H}]_{kj}^{+}=\frac{|\mathbf{H}_{kj}|+\mathbf{H}_{kj}}{2},[\mathbf{H}]_{kj}^{-}=\frac{|\mathbf{H}_{kj}|-\mathbf{H}_{kj}}{2}.

(16)

Algorithm 1 Optimization algorithm of PSDMF

0: Multi-view Data

\{\mathbf{X}^{p}\}_{p=1}^{P}

, parameters

\mu,\beta,\gamma,\lambda,K

, layer sizes

k

0: Partially shared latent representation

\mathbf{V}

, the regression weight matrix

\mathbf{W}

1: Initialize

2: Construct partially label matrix

\mathbf{Y}

via Eq. (5)

3: for all layers in each view do

(\mathbf{U}_{i}^{p},\mathbf{V}_{i}^{p})

\leftarrow

Semi-NMF

(\mathbf{V}_{i-1}^{p},\boldsymbol{k}_{i})

5: end for

6: while not converged do

7: Update

\mathbf{W}

via Eq. (12)

8: for

p=1

P

9: Update

\alpha^{p}

via Eq. (9)

10: for

i=1

m

11: Update

\mathbf{U}_{i}^{p}

via Eq. (10)

12: end for

13: Update

\mathbf{V}_{m}^{p}

via Eq. (15)

14: end for

15: end while

TABLE II: Results on five datasets( mean

\pm

standard deviation). Higher value indicates better performance and the highest values are in boldface.

Datasets		DMVC	GMC	lLSMC	LMVSC	DICS	MvSL	PSLF	GPSNMF	Ours
Extended	ACC	50.34 $\pm$ 0.07	43.38 $\pm$ 0.00	53.41 $\pm$ 1.29	36.62 $\pm$ 0.00	47.54 $\pm$ 3.58	19.64 $\pm$ 1.81	43.79 $\pm$ 4.90	69.56 $\pm$ 8.04	87.38 $\pm$ 3.45
Yale B	NMI	49.97 $\pm$ 0.14	44.90 $\pm$ 0.00	53.40 $\pm$ 0.64	28.09 $\pm$ 0.00	50.05 $\pm$ 5.11	11.82 $\pm$ 2.88	32.14 $\pm$ 4.85	60.44 $\pm$ 9.12	83.53 $\pm$ 2.44
	Purity	50.49 $\pm$ 0.07	43.69 $\pm$ 0.00	53.45 $\pm$ 1.26	42.77 $\pm$ 0.00	51.39 $\pm$ 3.82	20.52 $\pm$ 1.66	44.10 $\pm$ 4.36	69.56 $\pm$ 8.04	87.38 $\pm$ 3.45
Prokaryotic	ACC	53.50 $\pm$ 2.23	49.55 $\pm$ 0.00	52.31 $\pm$ 6.01	57.53 $\pm$ 0.00	56.07 $\pm$ 11.76	35.68 $\pm$ 2.51	46.49 $\pm$ 6.63	63.45 $\pm$ 2.59	66.83 $\pm$ 2.94
	NMI	3.28 $\pm$ 0.66	19.34 $\pm$ 0.00	16.49 $\pm$ 8.59	13.37 $\pm$ 0.00	12.63 $\pm$ 9.88	1.05 $\pm$ 0.41	1.69 $\pm$ 1.02	18.46 $\pm$ 5.48	21.08 $\pm$ 4.21
	Purity	57.31 $\pm$ 0.65	58.44 $\pm$ 0.00	63.08 $\pm$ 4.15	62.94 $\pm$ 0.00	60.14 $\pm$ 5.16	57.03 $\pm$ 0.07	57.46 $\pm$ 0.73	63.45 $\pm$ 2.59	66.53 $\pm$ 2.94
Caltech101	ACC	54.60 $\pm$ 0.43	69.20 $\pm$ 0.00	51.98 $\pm$ 2.61	69.47 $\pm$ 0.00	50.95 $\pm$ 10.61	46.42 $\pm$ 4.14	76.88 $\pm$ 3.09	89.77 $\pm$ 2.54	90.24 $\pm$ 3.08
-7	NMI	37.60 $\pm$ 5.79	65.95 $\pm$ 0.00	53.22 $\pm$ 2.22	46.68 $\pm$ 0.00	53.08 $\pm$ 7.71	51.59 $\pm$ 2.81	40.94 $\pm$ 7.75	74.53 $\pm$ 4.10	75.52 $\pm$ 5.36
	Purity	79.55 $\pm$ 4.64	88.47 $\pm$ 0.00	85.61 $\pm$ 1.64	77.07 $\pm$ 0.00	85.28 $\pm$ 3.70	83.73 $\pm$ 1.49	77.56 $\pm$ 4.12	90.38 $\pm$ 2.33	91.12 $\pm$ 2.03
Caltech101	ACC	51.72 $\pm$ 1.72	45.64 $\pm$ 0.00	45.03 $\pm$ 4.16	42.08 $\pm$ 0.00	35.75 $\pm$ 3.83	42.01 $\pm$ 2.86	55.75 $\pm$ 10.35	77.33 $\pm$ 3.34	79.74 $\pm$ 3.73
-20	NMI	52.18 $\pm$ 1.49	48.09 $\pm$ 0.00	61.21 $\pm$ 1.14	49.64 $\pm$ 0.00	55.99 $\pm$ 2.58	60.48 $\pm$ 1.45	44.85 $\pm$ 5.77	67.48 $\pm$ 3.52	71.24 $\pm$ 3.06
	Purity	70.23 $\pm$ 0.92	55.49 $\pm$ 0.00	76.95 $\pm$ 0.79	50.84 $\pm$ 0.00	69.32 $\pm$ 2.47	76.16 $\pm$ 1.17	62.55 $\pm$ 4.94	77.81 $\pm$ 2.60	79.74 $\pm$ 2.84
MSRCV1	ACC	40.76 $\pm$ 2.37	74.76 $\pm$ 0.00	69.91 $\pm$ 1.79	64.76 $\pm$ 0.00	51.91 $\pm$ 6.61	78.07 $\pm$ 5.83	82.96 $\pm$ 4.51	72.17 $\pm$ 4.03	86.24 $\pm$ 5.69
	NMI	27.60 $\pm$ 3.17	70.09 $\pm$ 0.00	62.37 $\pm$ 2.71	58.81 $\pm$ 0.00	50.48 $\pm$ 11.29	68.32 $\pm$ 4.69	75.04 $\pm$ 3.49	67.41 $\pm$ 4.34	76.14 $\pm$ 3.42
	Purity	45.14 $\pm$ 2.41	79.05 $\pm$ 0.00	72.86 $\pm$ 1.38	69.05 $\pm$ 0.00	58.09 $\pm$ 8.68	78.52 $\pm$ 5.73	82.96 $\pm$ 4.51	74.92 $\pm$ 4.44	86.24 $\pm$ 4.45

Until now, we have all the update rules done. Updates iterations repeatedly until convergence. The overall optimization process of PSDMF is outlined in Algorithm 1, where the ”Semi-NMF” procedure performs the pre-train as described earlier. Once the optimal partially shared latent representation matrix $\mathbf{V}$ and regression weight matrix $\mathbf{W}$ are obtained, cluster label $y=\arg\max_{c}y_{c,i}$ where $\boldsymbol{y}_{i}=\mathbf{W}^{T}\boldsymbol{v}_{i}$ .

IV Experiments

To comparatively study the performance of PSDMF, we consider several state-of-the-art baselines as stated bellow:
Multi-view clustering via deep matrix factorization (DMVC) [14], graph-based multi-view clustering (GMC) [21], latent multi-view subspace clustering (lLSMC) [23], large-scale multi-view subspace clustering (LMVSC) [22], multi-view discriminative learning via joint non-negative matrix factorization (DICS) [18], graph-regularized multi-view semantic subspace learning (MvSL) [19], partially shared latent factor learning (PSLF) [10] and semi-supervised multi-view clustering with graph-regularized partially shared non-negative matrix factorization (GPSNMF) [17]. Among them, four methods are unsupervised (i.e., DMVC, GMC, lLSMC, LMVSC) and the others are semi-supervised.

IV-A Datasets and Evaluation Metric

We perform experiments on five benchmark datasets: Extended Yale B, Prokaryotic, MSRCV1, Caltech101-7 and 20. More detailed information about these data is shown in Table III.

TABLE III: Statistics of datasets used in experiments.

Dataset	Dimension of Views	Sizes	Classes
Extended Yale B	2500/3304/6750	650	10
Prokaryotic	393/3/438	551	4
Caltech101-7/20	48/40/254/1984/512/928	1474/2386	7/20
MSRCV1	24/576/512/256/254	210	7

For evaluation metrics, we use Accuracy (ACC), Normalized Mutual Information (NMI) and Purity to comprehensively measure the clustering performance in our experiments. The formal definitions of these metrics, omitted here to save space, can be found in [9].

IV-B Experiment Setup

Following [17], for PSLF, GPSNMF and PSDMF, the dimension of partially shared latent representation $K$ and the common factor ratio $\lambda$ are set to 100 and 0.5, respectively. Thus, $K_{c}+K_{s}\times P=100$ and $K_{c}/(K_{s}+K_{c})=0.5$ . In addition, they all predicate the labels with regression matrix $\mathbf{W}$ . Other than these three algorithms, all the above methods are evaluated by the classic clustering algorithm, K-means. In terms of semi-supervised methods (i.e. DICS, MvSL, PSLF, GPSNMF and PSDMF), we denote $10\%$ as the proportions of the labeled data. Following [14], the layer sizes of deep structure models (i.e. DMVC, PSDMF) are set as $[100,50]$ .

For all compared methods, 10 repeated times are run to reduce the randomness caused by initialization, and report their average scores and standard deviations. As to the compared methods, source codes are obtained from the authors’ websites, and we set their parameters to the optimal value if they have.

IV-C Performance Evaluation

The performances of multi-view clustering is shown in Table II. From the experiment results, we have drawn some conclusions as following: In general, semi-supervised algorithms are superior to unsupervised algorithms. For example, our PSDMF significantly improves the performance compared with DMVC all over the five datasets, even when only $10\%$ label information is used. For the dataset Extended Yale B, we raise the performance bar by around $18\%$ in ACC, $13\%$ in NMI and $18\%$ in Purity. On average, we improve the state-of-the-art GPSNMF by more than $16\%$ , which means multi-layer structure methods have clustering advantages over single-layer methods. Through the partially shared deep representation, PSDMF eliminates the influence of undesirable factors by considering both view-specific and common features, and keeps the important information in the final representation layer. In different datasets experiments, our model achieves better performances than the state-of-the-art methods, which demonstrate the robustness of ours.

IV-D Parameters Sensitivity

There are three nonnegative essential parameters (i.e. $\mu,\beta,\gamma$ ) in our proposed model. $\mu$ controls the smoothness of the partially shared factor. $\gamma$ and $\beta$ controls the sparsity of the weight matrix $\mathbf{W}$ and balances the relation between the regression term and data reconstruction, respectively. Following [10, 17], $\gamma=10$ usually obtains better performance. As is shown in Fig. 3, we test the sensitivity of the parameters $\mu$ and $\beta$ in terms of clustering results, which are tuned in the range of $\{0.001,0.005,0.01,0.05,0.1,0.5\}$ and $\{0.01,0.1,1,10\}$ , respectively. We can observe that the parameters affect the experimental results in different datasets, which indicates that $\mu$ and $\beta$ play an important role in PSDMF. In practice, we fix $\mu=0.1$ and $\beta=10$ as default in our experiments.

V conclusion

In this paper, we introduced a novel semi-supervised deep matrix factorization model for multi-view learning, called PSDMF, which is able to learn a comprehensive partially shared latent final-layer representation for multi-view data. Through the partially shared multi-layer matrix factorization structure, our method is capable of exploiting view-specific and common features among different views, simultaneously. Benefitting from the label regression term, it can incorporate information from the labeled data. Furthermore, we utilize graph regularization and auto-weighted strategy to preserve the intrinsic geometric structure of the data. An iterative optimization algorithm is developed to deal with the PSDMF. Experimental results show that the proposed model achieves superior performances.

Acknowledgment

This work was supported by the National Natural Science Foundation of China under Grant Nos. 61722304, 61801133, and 61803096, in part by the Guangdong Science and Technology Foundation under Grant Nos. 2019B010118001, 2019B010121001, and 2019B010154002, National Key Research and Development Project, China under Grant No. 2018YFB1802400.

References

[1] S. Sun, “A survey of multi-view machine learning,” Neural computing and applications, vol. 23, no. 7–8, pp. 2031–2038, 2013
[2] Y. Li, M. Yang, Z. Zhang, “A survey of multi-view representation learning,” IEEE Trans. Knowl. Data Eng., vol. 31, no. 10, pp. 1863–1883, 2018.
[3] R. Zhang, F. Nie, X. Li and X. Wei, “Feature selection with multi-view data: A survey,” Information Fusion, vol. 50, pp. 158–167, 2019.
[4] D. D. Lee, H. S. Seung, “Algorithms for non-negative matrix factorization,” in NIPS, pp. 556–562, 2001.
[5] Z. Yang, Y. Xiang, K. Xie and Y. Lai, “Adaptive method for nonsmooth nonnegative matrix factorization,” IEEE Trans. Neural Netw. Learn. Syst., vol. 28, no. 4, pp. 948–960, 2017.
[6] X. Zhang, L. Zong, X. Liu and H. Yu, “Constrained NMF-based multi-view clustering on unmapped data,” in AAAI, pp. 3174–3180, 2015.
[7] J. Liu, C. Wang, J. Gao and J. Han, “Multi-view clustering via joint nonnegative matrix factorization,” in SDM, pp. 252–260, 2013.
[8] L. Zong, X. Zhang, L. Zhao, H. Yu and Q. Zhao, “Multi-view clustering via multi-manifold regularized non-negative matrix factorization,” Neural Netw., vol. 88, pp. 74–89, 2017.
[9] Z. Yang, N. Liang, W. Yan, Z. Li and S. Xie, “Uniform distribution non-negative matrix factorization for multiview clustering,” IEEE Trans. Cybern., in press, 2020.
[10] J. Liu, Y. Jiang, Z. Li, Z. H. Zhou and H. Lu, “Partially shared latent factor learning with multiview data,” IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 6, pp. 1233–1246, 2015.
[11] L. Zhao, T. Yang, J. Zhang, Z. Chen, Y. Yang and Z. J. Wang, ”Co-learning non-negative correlated and uncorrelated features for multi-view data,” IEEE Trans. Neural Netw. Learn. Syst., in press, 2020.
[12] Y. Meng, R. Shang, F. Shang, L. Jiao, S. Yang and R. Stolkin, “Semi-supervised graph regularized deep NMF with bi-orthogonal constraints for data representation,” IEEE Trans. Neural Netw. Learn. Syst., in press, 2019.
[13] W. Wang, R. Arora, K. Livescu and J. Bilmes, “On deep multi-view representation learning,” in ICML, pp. 1083–1092, 2015.
[14] H. Zhao, Z. Ding and Y. Fu, “Multi-view clustering via deep matrix factorization,” in AAAI, pp. 2921–2927, 2017.
[15] G. Trigeorgis, K. Bousmalis, S. Zafeiriou, and B. W. Schuller, “A deep matrix factorization method for learning attribute representations,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 3, pp. 417–429, 2017.
[16] S. Huang, Z. Kang and Z. Xu, “Auto-weighted multi-view clustering via deep matrix decomposition,” Pattern Recognition, vol. 97, no. 107015, 2020.
[17] N. Liang, Z. Yang, Z. Li, S. Xie and C. Su, “Semi-supervised multi-view clustering with Graph-regularized Partially Shared Non-negative Matrix Factorization,” Knowledge-Based Systems, vol. 190, no. 105185, 2020.
[18] Z. Zhang, Z. Qin, P. Li, Q. Yang and J. Shao, “Multi-view discriminative learning via joint non-negative matrix factorization,” in Int. Conf. Database Systems for Advanced Applications, pp. 542–557, 2018.
[19] J. Peng, P. Luo, Z. Guan and J. Fan, ”Graph-regularized multi-view semantic subspace learning,” Int. Jour. Mach. Learn. Cybern., vol. 10, no. 5, pp. 879–895, 2019.
[20] C. H. Ding, T. Li, and M. I. Jordan, “Convex and semi-nonnegative matrix factorizations,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 1, pp. 45–55, 2010.
[21] H. Wang, Y. Yang and B. Liu, “GMC: Graph-based multi-view clustering,” IEEE Trans. Knowl. Data Eng., vol. 32, no. 6, pp. 1116–1129, 2019.
[22] Z. Kang, W. Zhou, Z. Zhao, J. Shao, M. Han and Z. Xu, “Large-scale multi-view subspace clustering in linear time,” in AAAI, 2020.
[23] C. Zhang, H. Fu, Q. Hu, X. Cao, Y. Xie, D. Tao and D. Xu, “Generalized latent multi-view subspace clustering,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 1, pp. 86–99, 2020.
[24] W. Zhao, C. Xu, Z. Guan and Y. Liu, “Multiview concept learning via deep matrix factorization,” IEEE Trans. Neural Netw. Learn. Syst., in press, 2020.
[25] F. Nie, G. Cai, J. Li and X. Li, “Auto-weighted multi-view learning for image clustering and semi-supervised classification,” IEEE Trans. Image Process., vol. 27, no. 3, pp. 1501–1511, 2018.
[26] Z. Kang, X. Lu, J. Yi and Z. Xu, “Self-weighted multiple kernel learning for graph-based clustering and semi-supervised classification,” in IJCAI, pp. 2312–2318, 2018.
[27] F. Nie, X. Cai, H. Huang and C. Ding, “Efficient and robust feature selection via joint l2, 1-norms minimization,” in NIPS, pp. 1813–1821, 2010.
[28] D. Cai, X. He, J. Han and T.S. Huang, “Graph regularized nonnegative matrix factorization for data representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33 no. 8, pp. 1548–1560, 2011.