This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

[3]\surJingya Chang

1]\orgdivSchool of Mathematics and Statistics, \orgnameGuangdong University of Technology, \cityGuangzhou, \postcode510520, \countryChina

2]\orgdivSchool of Automation, \orgnameGuangdong University of Technology, \cityGuangzhou, \postcode510006, \countryChina

[3]\orgdivSchool of Mathematics and Statistics, \orgnameCenter for Mathematics and Interdisciplinary Sciences, Guangdong University of Technology, \cityGuangzhou, \postcode510520, \countryChina

An adaptive weighted self-representation method for incomplete multi-view clustering

\surLishan Feng [email protected]    \surGuoxu Zhou [email protected]    [email protected] [ [ *
Abstract

For multi-view data in reality, part of its elements may be missing because of human or machine error. Incomplete multi-view clustering (IMC) clusters the incomplete multi-view data according to the characters of various views of the instances. Recently, IMC has attracted much attention and many related methods have been proposed. However, the existing approaches still need to be developed and innovated in the following aspects: (1) Current methods only consider the differences of different views, while the different influences of instances, as well as distinguishes between missing values and completed values are ignored. (2) The updating scheme for weighting matrix in adaptive weighted algorithms usually relies on an optimization sub-problem, whose optimal solution may not be easy to achieve. (3) The adaptive weighted subspace algorithms that can recover the incomplete data are anchor types. The randomness of the anchor matrix may cause unreliability. To tackle these limitations, we propose an adaptive weighted self-representation (AWSR) subspace method for IMC. The AWSR method tunes the weighting matrix adaptively in accordance with the views of different instances and the recovery process of the missing values. The low rank and smoothness constraints on the representation matrix make the subspace reveal the underlying features of the dataset accurately. We also analyze the convergence property of the block coordinate method for our optimization model theoretically. Numerical performance on five real-world data shows that the AWSR method is effective and delivers superior results when compared to other eight widely-used approaches considering the clustering accuracy (ACC), normalized mutual information (NMI) and Purity.

keywords:
Multi-view clustering, Incomplete data, Subspace learning, Self-representation learning

1 Introduction

The rapid development of science and technology enables us many ways to get the information of an object. One can receive news from different news organizations or media. The same semantic can be expressed by different languages via a smart phone application program. However, in reality, due to human or machine error, such as lost files, anonymous purposes, equipment malfunctions, technical failures and so on, it is common that the collected information is missing. Data missing can be divided into two categories: value missing, and view missing. Value missing means some of the elements of the features are absent [5]. View missing indicates the complete absence of some features [21, 5], which is a special case of value missing. In this paper, we concentrate on view missing incomplete multi-view clustering (IMC) problem.

1.1 Methods from MVC (multi-view clustering) to IMC and the auto-weighted techniques

In this subsection, we review the kinds of subspace methods for MVC and the approaches for dealing with missing data in IMC. At last, we summarize the auto-weighted techniques used for MVC.

Among methods of MVC, the class of subspace clustering methods is widely studied and used [16, 39, 8, 45]. These methods first map the original data into a low dimensional subspace, and then obtain the clustering results by applying spectral clustering approach to the subspace. One kind of subspace clustering approaches assumes that the features of items in the dataset are relevant and each item can be represented by other items numerically. In this context, the self-representation subspace clustering methods arise [21, 31]. Another kind of subspace clustering methods for MVC is the anchor based approach, which chooses a small set of points called anchors as the basis of the feature space and generates the subspace of the given data by the anchors [12, 30, 11]. Moreover, Zhao et al. [44] proposed the Reinforced Tensor Graph Neural Network (RTGNN) method to learn multi-view graph data, which employed tensor decomposition to obtain the graph structure features of each view in the common feature spaces.

Compared to MVC, the missing data in IMC creates difficulties for clustering. One way of dealing with the missing data is ignoring it or simply filling it with a constant. Deng et al. introduced a missing index matrix to utilize the completed instances and formulated a projection model for IMC [7]. Hu et al. removed the effect of missing data by assigning 0 to it [14]. Zhao et al. suggested to use the average vector of non-missing views to estimate the missing instances [42]. This kind of methods fails to explore the implicit information in the missing data and may lead to an inaccurate clustering result. The other way of dealing with the missing data is to recover it. A popular technique of restoring the missing data is to mix the matrix completion and clustering task together throughout the clustering process. Yin et al. reconstructed the incomplete data by adding a regularization term to keep the estimated matrix close enough to the existing data [37]. Liu et al. distinguished the complete and incomplete views and obtained the missing data via the self-representation model [21].

Different views may have different influences on the clustering result. For example, the view collected from the color of a leaf is important to the view given by the shape of a leaf when clustering data of leaves. Equal treatment of various features may bring negative effect to clustering. Therefore, weighting matrix is assigned to adjust the importance of different views [28, 8]. Zhao et al. proposed to use a constant diagonal matrix to measure the importance of different views [42]. A majority of the weighted MVC methods are auto-weighted. The diagonal elements of the weighting matrix are restricted to the unit interval with summation being one. For subspace MVC methods, the weight is commonly multiplied by the representation error of the corresponding view [36, 34, 29, 30, 12, 22, 33, 18, 23, 15]. Then the weight is updated by solving a constrained sub optimization problem whose stationary point always has a closed form. In another case, a Frobenius norm regularization term of weights is added to smooth the weight distribution [26, 43]. In this case, the sub optimization problem for updating the weights becomes complex.

1.2 Motivation and contribution

Although great progress has been made in the area of IMC, current auto-weighted subspace IMC methods still face challenges. 1) The mechanism of weighting multiplies the weight by total representation error of the corresponding view. The features of different instances are treated equally. However, the significance of different features varies from instance to instance. For example, the color feature plays different roles in clustering leaves and persons. Also, both the known and unknown elements in the same feature are not distinguished and treated equally. 2) The adaptive updating scheme of the weighting matrix relies on the optimal solution of a constrained optimization problem, the global solution of which is not easy to find. 3) The existing auto-weighted subspace IMC methods that can recover the incomplete data are anchor based methods. The randomness of the anchor matrix induces unreliability of the clustering results.

In order to address the above limitations, we propose an adaptive weighted self-representation subspace clustering (AWSR) method for the IMC problem. The framework of this approach is intuitively shown in Figure 1.

Refer to caption
Figure 1: Overview of the algorithm. Here is an example of three views, including RGB, Text and Shape modalities. From left to right, first, feature vectors are extracted from the incomplete multi-view data 𝐗\mathbf{X}. Then the low rank and smoothness properties are imposed on the consensus coefficient matrix 𝐙\mathbf{Z}. Meanwhile, auto-weighted self-representation learning and missing data imputation are jointly performed to obtain the consensus representation matrix. Finally, cluster assignments are given based on the resulting affinity matrix.

In the AWSR model, the construction of the consensus representation matrix is designed in conjunction with the process of the missing data imputation. The weighting matrix is adaptively updated in accordance with the values assigned to the missing data and the features of different instances. Unlike traditional methods, we also put low rank and smoothness constraints on the representation matrix. The representation matrix, missing data and weighting matrix are mutually updated with each other to achieve the optimum point.

In terms of the IMC problem, the contributions of our paper include:

(1) To the best of our knowledge, the proposed AWSR method is one of the first attempts to employ self-representation subspace algorithm with auto-weighted matrix that can obtain the representation matrix and recover the incomplete data simultaneously. A novel updating scheme of the weighting matrix is given, which not only avoids solving the sub optimization problem but also treats the values of missing data and existing data differently. Both the low rank and smoothness properties of the representation matrix are considered.

(2) In the computation process, we introduce a splitting variable which enhances the convexity of the objective function in the optimization model and makes the model easy to be solved by the block coordinate descent (BCD) algorithm. In theory, we prove the global convergence property of the iteration sequence.

(3) The numerical experiments show complete superiority of our method over other eight state of the art approaches for IMC on five benchmark datasets.

2 Notations and related work

Table 1: Summary of the Notations
Notation Description
𝐗i\mathbf{X}_{i} The data matrix for the iith view.
𝐙\mathbf{Z} The consensus coefficient matrix for all the views.
𝐖(i)\mathbf{W}^{(i)} The weighting matrix for the iith view.
𝐙i\mathbf{Z}_{i} The coefficient matrix for the iith view.
vv The number of the views.
F\|\cdot\|_{F} The Frobenius norm of matrix.
\|\cdot\|_{*} The nuclear norm of matrix.
Tr()Tr(\cdot) The trace of matrix.
𝐈\mathbf{I} The identity matrix.
Ω\varOmega The feasible regions of the variables.
\otimes The Kronecker product.
diag(𝐙\mathbf{Z})=0 The diagonal elements of matrix 𝐙\mathbf{Z} are zero.
\botrule

First, we introduce the notations and the preliminary knowledge. In this paper, matrix is expressed by bold capital letters, while bold lower case letters and lower case letters represent vectors and scalars respectively. The nuclear norm of a matrix 𝐙\mathbf{Z} is defined as 𝐙=inσi\left\|\mathbf{Z}\right\|_{*}=\sum\limits_{i}^{n}\sigma_{i} and σi\sigma_{i} is the iith singular value of 𝐙\mathbf{Z}. The Frobenius norm of 𝐙\mathbf{Z} is 𝐙F=(i=1nj=1n𝐙ij2)12.\left\|\mathbf{Z}\right\|_{F}=(\sum\limits_{i=1}^{n}\sum\limits_{j=1}^{n}\mathbf{Z}_{ij}^{2})^{\frac{1}{2}}.

Definition 1 (Subspace clustering [24]).

Considering a collection of samples 𝐗=[𝐱1,,𝐱n].\mathbf{X}=\left[\mathbf{x}_{1},\cdots,\mathbf{x}_{n}\right]. Suppose that 𝐗\mathbf{X} is derived from a combination of kk subspaces {Si}i=1k\{S_{i}\}_{i=1}^{k} with dimensionality {Di}i=1k\{D_{i}\}_{i=1}^{k}. The objective of subspace clustering is to segment samples depending on the underlying subspaces under which they are drawn from.

The subspace here can be described as Si={xDi:x=νi+Biy},S_{i}=\{x\in\mathbb{R}^{D_{i}}:x=\nu_{i}+B_{i}y\}, where νi\nu_{i} is an arbitrary point in Si,S_{i}, BiB_{i} is the basis of Si,S_{i}, and yy is a low-dimensional representation for xx [32]. For example, the data points in the partition part of Figure 1 belong to three subspaces of the 3\mathbb{R}^{3} space.

One of the most important subspace clustering approach is self-representation subspace method [21]. When the number of instances is large, the data of one instance can be represented by the data of others. The representation matrix represents the latent subspace. As a surrogate of the original dataset, the latent subspace usually inherits some properties from the original data and certain constraints or properties, such as sparsity, low rank or smoothness are usually imposed on the representation matrix. The task of self-representation subspace multi-view clustering is to solve

min𝐙,{𝐙i}i=1v\displaystyle\min\limits_{\mathbf{Z}_{*},\{\mathbf{Z}_{i}\}_{i=1}^{v}} i=1v(𝐗i𝐗i𝐙i)+λ(𝐙,{𝐙i}i=1v),\displaystyle\sum\limits_{i=1}^{v}\ell(\mathbf{X}_{i}-\mathbf{X}_{i}\mathbf{Z}_{i})+\lambda\mathbf{\mathcal{R}}(\mathbf{Z}_{*},\{\mathbf{Z}_{i}\}_{i=1}^{v}), (1)
s.t.Ω(𝐙,{𝐙i}i=1v),\displaystyle\text{s.t.}\ \varOmega(\mathbf{Z}_{*},\{\mathbf{Z}_{i}\}_{i=1}^{v}),

where \ell is some loss function to measure the representation error. \mathbf{\mathcal{R}} is the regularization term and 𝐙\mathbf{Z}_{*} is the consensus coefficient matrix of all views. Ω\varOmega defines feasible regions of the variables. The regularization term is usually determined by the potential properties of the subspace.

For complete single-view dataset, if the subspace is low rank, then the model of low rank self-representation is [19]

min𝐙1,s.t.𝐗1=𝐗1𝐙1,diag(𝐙1)=0.\min\ \left\|\mathbf{Z}_{1}\right\|_{*},\quad\text{s.t.}\ \mathbf{X}_{1}=\mathbf{X}_{1}\mathbf{Z}_{1},\ \text{diag}(\mathbf{Z}_{1})=0. (2)

In order to capture the within-cluster affinities and smoothness properties better, the self-representation model becomes [24]

min𝐙1𝐗1𝐗1𝐙1F2+λ𝐙1F2,s.t.diag(𝐙1)=0.\displaystyle\min\limits_{\mathbf{Z}_{1}}\ \left\|\mathbf{X}_{1}-\mathbf{X}_{1}\mathbf{Z}_{1}\right\|_{F}^{2}+\lambda\left\|\mathbf{Z}_{1}\right\|_{F}^{2},\quad\text{s.t.}\ \text{diag}(\mathbf{Z}_{1})=0. (3)

Following the framework of (Eq.(3)), the complete multi-view subspace clustering problem can be converted into

min𝐙i1vi=1v𝐗i𝐗i𝐙iF2+λ𝐙iF2,\displaystyle\min\limits_{\mathbf{Z}_{i}}\ \frac{1}{v}\sum\limits_{i=1}^{v}\left\|\mathbf{X}_{i}-\mathbf{X}_{i}\mathbf{Z}_{i}\right\|_{F}^{2}+\lambda\left\|\mathbf{Z}_{i}\right\|_{F}^{2}, (4)
s.t.diag(𝐙i)=0.\displaystyle\text{s.t.}\ \text{diag}(\mathbf{Z}_{i})=0.

3 The proposed AWSR model

When data information is incomplete, the incomplete data matrix is explicitly expressed and involved in the optimization process. Liu et al. proposed the following model [21]:

min𝐙,{𝐗i(m)}i=1v,𝐅1vi=1v[𝐗i(o),𝐗i(m)][𝐗i(o),𝐗i(m)]𝐙F2\displaystyle\min\limits_{\mathbf{Z},{\{\mathbf{X}_{i}^{(m)}\}}_{i=1}^{v},\mathbf{F}}\ \frac{1}{v}\sum\limits_{i=1}^{v}\left\|[\mathbf{X}_{i}^{(o)},\mathbf{X}_{i}^{(m)}]-[\mathbf{X}_{i}^{(o)},\mathbf{X}_{i}^{(m)}]\mathbf{Z}\right\|_{F}^{2} (5)
+λ𝐙F2+γ𝐙𝐅𝐅F2,\displaystyle\hskip 56.9055pt+\lambda\left\|\mathbf{Z}\right\|_{F}^{2}+\gamma\left\|\mathbf{Z}-\mathbf{F}^{\top}\mathbf{F}\right\|_{F}^{2},
s.t.diag(𝐙)=0,𝐅𝐅=𝐈k,\displaystyle\hskip 28.45274pt\text{s.t.}\qquad\text{diag}(\mathbf{Z})=0,\ \mathbf{F}\mathbf{F}^{\top}=\mathbf{I}_{k},

where 𝐗i(m)\mathbf{X}_{i}^{(m)} is the matrix saving the missing data of the iith view and 𝐗i(o)\mathbf{X}_{i}^{(o)} is the matrix for existing data of the iith view. In this model, all views share a consensus coefficient matrix 𝐙\mathbf{Z}.

Since different views exert influences on the clustering result differently, it is a trend to include self- or auto-weighted techniques in MVC models. Two general auto-weighted models are widely used [12, 34, 26, 43]:

min𝐙,{𝐙i}i=1vi=1vwi(𝐗i𝐗i𝐙i)+λ(𝐙,{𝐙i}i=1v),\displaystyle\min\limits_{\mathbf{Z}_{*},\{\mathbf{Z}_{i}\}_{i=1}^{v}}\ \sum\limits_{i=1}^{v}\text{w}_{i}\ell(\mathbf{X}_{i}-\mathbf{X}_{i}\mathbf{Z}_{i})+\lambda\mathbf{\mathcal{R}}(\mathbf{Z}_{*},\{\mathbf{Z}_{i}\}_{i=1}^{v}), (6)
s.t.Ω(𝐙,{𝐙i}i=1v,W),\displaystyle\qquad\text{s.t.}\qquad\varOmega(\mathbf{Z}_{*},\{\mathbf{Z}_{i}\}_{i=1}^{v},\textbf{W}),

and

min𝐙,{𝐙i}i=1vi=1vwi(𝐗i𝐗i𝐙i)+λ1(𝐙,{𝐙i}i=1v),\displaystyle\min\limits_{\mathbf{Z}_{*},\{\mathbf{Z}_{i}\}_{i=1}^{v}}\ \sum\limits_{i=1}^{v}\text{w}_{i}\ell(\mathbf{X}_{i}-\mathbf{X}_{i}\mathbf{Z}_{i})+\lambda_{1}\mathbf{\mathcal{R}}(\mathbf{Z}_{*},\{\mathbf{Z}_{i}\}_{i=1}^{v}), (7)
+λ2(W)\displaystyle\hskip 42.67912pt+\lambda_{2}\mathbf{\mathcal{R}}(\textbf{W})
s.t.Ω(𝐙,{𝐙i}i=1v,W).\displaystyle\qquad\text{s.t.}\qquad\varOmega(\mathbf{Z}_{*},\{\mathbf{Z}_{i}\}_{i=1}^{v},\textbf{W}).

The vector w=(w1,,wv)\textbf{w}=(\text{w}_{1},\ldots,\text{w}_{v}) is composed of the diagonal entries of the diagonal weighting matrix W. The regularization term (W)\mathbf{\mathcal{R}}(\textbf{W}) is the Frobenius norm in [26, 43].

As pointed in [8], when coping with various real-life applications, auto-weighting should be more adaptive and automatic. The existing weighting matrix only concentrate on the differences among influences of different views. However the differences of instances, as well as dissimilarities between the missing values and existing values are neglected. Besides, the weighting matrix is updated by solving a sub optimization problem, which may be troublesome. To overcome these difficulties, we suggest a new way to automatically produce the weighting matrix. Initially the weighting matrix is a diagonal matrix with its diagonal elements being

𝐖jj(i)={1,if jth item is in the ith view,0,otherwise.\mathbf{W}_{jj}^{(i)}=\left\{\begin{array}[]{cl}1,&\mbox{if $j$th item is in the $i$th view,}\\ 0,&\text{otherwise}.\\ \end{array}\right. (8)

Because the missing view is totally unknown at first, the zero diagonal elements in this weighting matrix make the representation error caused by the corresponding missing view being zero. At the beginning, the value of the data in the missing view is set to zero. During the iteration process, some elements in the missing view of the item are recovered while other elements are still zero. In this case, the influence of the recovered data should be taken into account. Therefore, the elements of the weighting matrix at the index positions where the missing data is recovered are set to

𝐖mj(i)={αmj(i),if 𝐗i,mj is recovered or nonzero0,otherwise.\mathbf{W}_{mj}^{(i)}=\left\{\begin{array}[]{cl}\alpha_{mj}^{(i)},&\mbox{if $\mathbf{X}_{i,mj}$ is recovered or nonzero}\\ 0,&\text{otherwise}.\end{array}\right. (9)

The updating scheme means if the mmth element of the jjth instance in the iith missing view 𝐗i,mj\mathbf{X}_{i,mj} is recovered, then 𝐖mj(i)=αmji.\mathbf{W}_{mj}^{(i)}=\alpha^{i}_{mj}. Here αmji\alpha^{i}_{mj} is nonzero and determined by either the value of the missing element 𝐗i,mj\mathbf{X}_{i,mj} or an empirical constant. On the other hand, if the nnth element of the jjth instance in the iith missing view 𝐗i,nj\mathbf{X}_{i,nj} is zero, then 𝐖nj(i)=0.\mathbf{W}_{nj}^{(i)}=0. We illustrate the influence tuned by 𝐖mj(i)\mathbf{W}_{mj}^{(i)} and 𝐖nj(i)\mathbf{W}_{nj}^{(i)} in detail. Let 𝐄i=𝐗i𝐗i𝐙.\mathbf{E}^{i}=\mathbf{X}_{i}-\mathbf{X}_{i}\mathbf{Z}. The matrix product

𝐄i(𝐖(i))=(𝐄1ji𝐄pji)(𝐖mj(i)𝐖nj(i))\mathbf{E}^{i}(\mathbf{W}^{(i)})^{\top}=\left(\begin{array}[]{ccccc}\cdots&\mathbf{E}^{i}_{1j}&\cdots\\ &\vdots&\\ \cdots&\mathbf{E}^{i}_{pj}&\cdots\\ \end{array}\right)\left(\begin{array}[]{ccccc}\cdots&\vdots&\cdots&\vdots&\cdots\\ \cdots&\mathbf{W}_{mj}^{(i)}&\cdots&\mathbf{W}_{nj}^{(i)}&\cdots\\ \cdots&\vdots&\cdots&\vdots&\cdots\\ \end{array}\right)

implies that 𝐖mj(i)=αmji\mathbf{W}_{mj}^{(i)}=\alpha^{i}_{mj} affirms the effect of the jjth instance in the iith view on clustering results, while 𝐖nj(i)=0\mathbf{W}_{nj}^{(i)}=0 weakens the effect of the jjth instance in the iith view on clustering results. Therefore, by utilizing the adaptive weighting matrix technique we achieve the goal that the recovered data has significance on the clustering result and the significance of the unrecovered data on clustering result is eliminated. The influences of different instances in the same view are distinguished. Furthermore, the updating framework avoids solving the optimization sub-problem of 𝐖(i)\mathbf{W}^{(i)}.

In order to get a low rank and smoothness coefficient matrix, the nuclear norm and Frobenius norm are employed to construct regularization terms, which can also capture the data correlation structure in the same space and help to solve the minimization problem effectively. Finally, the overall model is

min𝐙,{𝐗i(m)}i=1v\displaystyle\min\limits_{\mathbf{Z},{\{\mathbf{X}_{i}^{(m)}\}}_{i=1}^{v}} i=1v([𝐗i(o),𝐗i(m)][𝐗i(o),𝐗i(m)]𝐙)𝐖(i)F2\displaystyle\sum\limits_{i=1}^{v}\left\|([\mathbf{X}_{i}^{(o)},\mathbf{X}_{i}^{(m)}]-[\mathbf{X}_{i}^{(o)},\mathbf{X}_{i}^{(m)}]\mathbf{Z}){\mathbf{W}^{(i)}}^{\top}\right\|_{F}^{2} (10)
+γ𝐙+λ2𝐙F2,\displaystyle+\gamma\left\|\mathbf{Z}\right\|_{*}+\frac{\lambda}{2}\left\|\mathbf{Z}\right\|_{F}^{2},
s.t.diag(𝐙)=0,\displaystyle\text{s.t.}\ \text{diag}(\mathbf{Z})=0,

where γ,λ\gamma,\lambda are positive trade-off parameters and 𝐗i(o)\mathbf{X}_{i}^{(o)} and 𝐗i(m)\mathbf{X}_{i}^{(m)} refer to observed and missing entries of the iith data view.

In summary, the entire framework of our method are as follows. First we construct the adaptive weighted linear constrained optimization model (Eq.(10)) based on the subspace representation approach. Then, problem (Eq.(10)) is solved iteratively by the BCD method. Finally, the Laplacian matrix is generated based on the coefficient matrix 𝐙\mathbf{Z} and the spectral clustering approach [4] is employed to cluster the given objects.

4 Computation of the AWSR model

In this section, we demonstrate computation process of the optimization model (Eq.(10)). At first, we declare that the nonzero elements of the weighting matrix W is determined artificially according to the numerical results. In order to enhance the convex property of the objective function and create a separable optimization model, we introduce a separated variable 𝐉\mathbf{J} and convert model (Eq.(10)) to

min𝐉,𝐙,{𝐗i(m)}i=1vγ𝐙+λ2𝐙F2+α2𝐉𝐙F2\displaystyle\min\limits_{\mathbf{J},\mathbf{Z},{\{\mathbf{X}_{i}^{(m)}\}}_{i=1}^{v}}\ \gamma\left\|\mathbf{Z}\right\|_{*}+\frac{\lambda}{2}\left\|\mathbf{Z}\right\|_{F}^{2}+\frac{\alpha}{2}\left\|\mathbf{J}-\mathbf{Z}\right\|_{F}^{2} (11)
+i=1v([𝐗i(o),𝐗i(m)][𝐗i(o),𝐗i(m)]𝐉)𝐖(i)F2,\displaystyle+\sum\limits_{i=1}^{v}\left\|([\mathbf{X}_{i}^{(o)},\mathbf{X}_{i}^{(m)}]-[\mathbf{X}_{i}^{(o)},\mathbf{X}_{i}^{(m)}]\mathbf{J}){\mathbf{W}^{(i)}}^{\top}\right\|_{F}^{2},
s.t.diag(𝐙)=0.\displaystyle\text{s.t.}\ \text{diag}(\mathbf{Z})=0.

The (Eq.(10)) and (Eq.(11)) are equivalent when α>0\alpha>0 is large enough. The strong convexity property of the objective function implies that the solution is stable and unique. We apply the alternate strategy [20] to calculation of the above optimization problem.

4.1 Sub-problem for J:

To update 𝐉\mathbf{J}, we fix the values of all other variables. The sub-problem related with 𝐉\mathbf{J} is

min𝐉ψ(𝐉)\displaystyle\min\limits_{\mathbf{J}}\psi(\mathbf{J}) =i=1v([𝐗i(o),𝐗i(m)][𝐗i(o),𝐗i(m)]𝐉)𝐖(i)F2\displaystyle=\sum\limits_{i=1}^{v}\left\|([\mathbf{X}_{i}^{(o)},\mathbf{X}_{i}^{(m)}]-[\mathbf{X}_{i}^{(o)},\mathbf{X}_{i}^{(m)}]\mathbf{J}){\mathbf{W}^{(i)}}^{\top}\right\|_{F}^{2} (12)
+α2𝐉𝐙F2.\displaystyle+\frac{\alpha}{2}\left\|\mathbf{J}-\mathbf{Z}\right\|_{F}^{2}.

The (Eq.(12)) is equivalent to the following problem

min𝐉ψ(𝐉)\displaystyle\min\limits_{\mathbf{J}}\ \psi(\mathbf{J}) =Tr(i=1v𝐗i𝐉𝐖(i)𝐖(i)𝐉𝐗i\displaystyle=Tr(\sum\limits_{i=1}^{v}\mathbf{X}_{i}\mathbf{J}{\mathbf{W}^{(i)}}^{\top}\mathbf{W}^{(i)}\mathbf{J}^{\top}{\mathbf{X}_{i}}^{\top} (13)
𝐗i𝐉𝐖(i)𝐖(i)𝐗i𝐗i𝐖(i)𝐖(i)𝐉𝐗i)\displaystyle-\mathbf{X}_{i}\mathbf{J}{\mathbf{W}^{(i)}}^{\top}\mathbf{W}^{(i)}{\mathbf{X}_{i}}^{\top}-\mathbf{X}_{i}{\mathbf{W}^{(i)}}^{\top}\mathbf{W}^{(i)}{\mathbf{J}}^{\top}\mathbf{X}_{i})
+α2Tr(𝐉𝐉𝐉𝐙𝐙𝐉),\displaystyle+\frac{\alpha}{2}Tr(\mathbf{J}^{\top}\mathbf{J}-\mathbf{J}^{\top}\mathbf{Z}-\mathbf{Z}^{\top}\mathbf{J}),

in which 𝐗i=[𝐗i(o),𝐗i(m)]\mathbf{X}_{i}=[\mathbf{X}_{i}^{(o)},\mathbf{X}_{i}^{(m)}]. Take derivation of the objective function in (Eq.(13)) with respect to variable 𝐉\mathbf{J}, and let the derivation equal to zero. We get

i=1v𝐗i𝐗i𝐉𝐖(i)𝐖(i)+α2𝐉\displaystyle\sum\limits_{i=1}^{v}{\mathbf{X}_{i}}^{\top}\mathbf{X}_{i}\mathbf{J}{\mathbf{W}^{(i)}}^{\top}\mathbf{W}^{(i)}+\frac{\alpha}{2}\mathbf{J} (14)
i=1v𝐗i𝐗i𝐖(i)𝐖(i)α2𝐙=0.\displaystyle-\sum\limits_{i=1}^{v}{\mathbf{X}_{i}}^{\top}\mathbf{X}_{i}{\mathbf{W}^{(i)}}^{\top}\mathbf{W}^{(i)}-\frac{\alpha}{2}\mathbf{Z}=0.

Since the matrix equation 𝐀𝐗𝐁=𝐂\mathbf{AXB}=\mathbf{C} is equivalent to [25]

(𝐁T𝐀)vec(𝐗)=vec(𝐂),(\mathbf{B}^{T}\otimes\mathbf{A})\text{vec}(\mathbf{X})=\text{vec}(\mathbf{C}),

where \otimes means the Kronecker product, the (Eq.(14)) equals to

[i=1v𝐖(i)𝐖(i)𝐗i𝐗i+α2𝐈]vec(𝐉)=vec(𝐂).\left[\sum\limits_{i=1}^{v}{\mathbf{W}^{(i)}}^{\top}\mathbf{W}^{(i)}\otimes{\mathbf{X}_{i}}^{\top}\mathbf{X}_{i}+\frac{\alpha}{2}\mathbf{I}\right]\text{vec}(\mathbf{J})=\text{vec}(\mathbf{C}). (15)

Here 𝐂=i=1v𝐗i𝐗i𝐖(i)𝐖(i)+α2𝐙.\mathbf{C}=\sum\limits_{i=1}^{v}{\mathbf{X}_{i}}^{\top}\mathbf{X}_{i}{\mathbf{W}^{(i)}}^{\top}\mathbf{W}^{(i)}+\frac{\alpha}{2}\mathbf{Z}. The coefficient matrix in the linear equation system (Eq.(15)) is positive definite. Hence, the vector vec(𝐉)\text{vec}(\mathbf{J}) can be solved by the traditional conjugate gradient method. Finally, we get the matrix 𝐉\mathbf{J} by rearranging vec(𝐉).\text{vec}(\mathbf{J}).

4.2 Sub-problem for 𝐗i(m)\mathbf{X}_{i}^{(m)} :

To update 𝐗i(m)\mathbf{X}_{i}^{(m)}, we fix the values of all other variables. The sub-problem related with 𝐗i(m)\mathbf{X}_{i}^{(m)} is

min𝐗i(m)\displaystyle\min\limits_{\mathbf{X}_{i}^{(m)}} ψ(𝐗i(m))=\displaystyle\psi(\mathbf{X}_{i}^{(m)})= (16)
i=1v([𝐗i(o),𝐗i(m)][𝐗i(o),𝐗i(m)]𝐉)𝐖(i)F2.\displaystyle\sum\limits_{i=1}^{v}\left\|([\mathbf{X}_{i}^{(o)},\mathbf{X}_{i}^{(m)}]-[\mathbf{X}_{i}^{(o)},\mathbf{X}_{i}^{(m)}]\mathbf{J}){\mathbf{W}^{(i)}}^{\top}\right\|_{F}^{2}.

The (Eq.(16)) can be transformed to

min𝐗i(m)\displaystyle\min\limits_{\mathbf{X}_{i}^{(m)}} ψ(𝐗i(m))=Tr([𝐗i(o),𝐗i(m)]𝐁i[𝐗i(o),𝐗i(m)]),\displaystyle\psi(\mathbf{X}_{i}^{(m)})=Tr([\mathbf{X}_{i}^{(o)},\mathbf{X}_{i}^{(m)}]\mathbf{B}_{i}{[\mathbf{X}_{i}^{(o)},\mathbf{X}_{i}^{(m)}]}^{\top}), (17)
fori=1,,v,\displaystyle\text{for}\ i=1,\ldots,v,

where

𝐁i\displaystyle\mathbf{B}_{i} =𝐖(i)𝐖(i)𝐉𝐖(i)𝐖(i)𝐖(i)𝐖(i)𝐉\displaystyle={\mathbf{W}^{(i)}}^{\top}\mathbf{W}^{(i)}-\mathbf{J}{\mathbf{W}^{(i)}}^{\top}\mathbf{W}^{(i)}-{\mathbf{W}^{(i)}}^{\top}\mathbf{W}^{(i)}\mathbf{J}^{\top} (18)
+𝐉𝐖(i)𝐖(i)𝐉.\displaystyle+\mathbf{J}{\mathbf{W}^{(i)}}^{\top}\mathbf{W}^{(i)}\mathbf{J}^{\top}.

Because 𝐗i\mathbf{X}_{i} is divided into observed and missing data, i.e., 𝐗i=[𝐗i(o),𝐗i(m)]\mathbf{X}_{i}=[\mathbf{X}_{i}^{(o)},\mathbf{X}_{i}^{(m)}], we rewrite 𝐁i\mathbf{B}_{i} in the form of block matrix. The (Eq.(17)) is equivalent to

min𝐗i(m)\displaystyle\min\limits_{\mathbf{X}_{i}^{(m)}} ψ(𝐗i(m))=\displaystyle\ \psi(\mathbf{X}_{i}^{(m)})= (19)
Tr([𝐗i(o),𝐗i(m)](𝐁i(oo)𝐁i(om)𝐁i(mo)𝐁i(mm))[𝐗i(o),𝐗i(m)].\displaystyle Tr([\mathbf{X}_{i}^{(o)},\mathbf{X}_{i}^{(m)}]\left(\begin{array}[]{cc}\mathbf{B}_{i}^{(oo)}&\mathbf{B}_{i}^{(om)}\\ \mathbf{B}_{i}^{(mo)}&\mathbf{B}_{i}^{(mm)}\end{array}\right){[\mathbf{X}_{i}^{(o)},\mathbf{X}_{i}^{(m)}]}^{\top}.

The (Eq.(19)) is further replaced by

min𝐗i(m)ψ(𝐗i(m))=\displaystyle\min\limits_{\mathbf{X}_{i}^{(m)}}\ \psi(\mathbf{X}_{i}^{(m)})= (20)
Tr(𝐗i(o)(𝐁i(om)+𝐁i(mo))𝐗i(m)+𝐗i(m)𝐁i(mm)𝐗i(m)).\displaystyle Tr(\mathbf{X}_{i}^{(o)}(\mathbf{B}_{i}^{(om)}+{\mathbf{B}_{i}^{(mo)}}^{\top}){\mathbf{X}_{i}^{(m)}}^{\top}+\mathbf{X}_{i}^{(m)}\mathbf{B}_{i}^{(mm)}{\mathbf{X}_{i}^{(m)}}^{\top}).

From the definition of 𝐁i\mathbf{B}_{i}, we have

𝐁i(mm)=(𝐖(i)𝐉𝐖(i))(m)[(𝐖(i)𝐉𝐖(i))(m)],\mathbf{B}_{i}^{(mm)}=\left(\mathbf{W}^{(i)}-\mathbf{J}\mathbf{W}^{(i)}\right)^{(m)}\left[\left(\mathbf{W}^{(i)}-\mathbf{J}\mathbf{W}^{(i)}\right)^{(m)}\right]^{\top}, (21)

in which (𝐖(i)𝐉𝐖(i))(m)(\mathbf{W}^{(i)}-\mathbf{J}\mathbf{W}^{(i)})^{(m)} refers to the columns of 𝐖(i)𝐉𝐖(i)\mathbf{W}^{(i)}-\mathbf{J}\mathbf{W}^{(i)} and corresponds to the weighted missing entries in the iith view. By taking the derivation of the objective function in (Eq.(20)) with respect to the variable 𝐗i(m)\mathbf{X}_{i}^{(m)}, we get

𝐗i(m)𝐁i(mm)=𝐗i(o)𝐁i(om).\mathbf{X}_{i}^{(m)}\mathbf{B}_{i}^{(mm)}=-\mathbf{X}_{i}^{(o)}\mathbf{B}_{i}^{(om)}. (22)

Because 𝐁i(mm)\mathbf{B}_{i}^{(mm)} is a positive semi-definite matrix, we calculate the eigenvalue decomposition of matrix 𝐁i(mm)\mathbf{B}_{i}^{(mm)}, i.e., 𝐁i(mm)=𝐔ii𝐕i\mathbf{B}_{i}^{(mm)}=\mathbf{U}_{i}\mathbf{\wedge}_{i}\mathbf{V}_{i}^{\top}. The numerical solution of equation system (Eq.(22)) is

𝐗i(m)=𝐗i(o)𝐁i(om)𝐕ii+𝐔i.\mathbf{X}_{i}^{(m)}=-\mathbf{X}_{i}^{(o)}\mathbf{B}_{i}^{(om)}\mathbf{V}_{i}\mathbf{\wedge}_{i}^{+}{\mathbf{U}_{i}}^{\top}. (23)

Here i+\mathbf{\wedge}_{i}^{+} is the pseudo inverse of i\mathbf{\wedge}_{i}.

4.3 Sub-problem for ZZ:

To update 𝐙\mathbf{Z}, we fix the values of all other variables. We solve the following sub-problem:

min𝐙ψ(𝐙)=γ𝐙+λ2𝐙F2+α2𝐉𝐙F2,\displaystyle\min\limits_{\mathbf{Z}}\ \psi(\mathbf{Z})=\ \gamma\left\|\mathbf{Z}\right\|_{*}+\frac{\lambda}{2}\left\|\mathbf{Z}\right\|_{F}^{2}+\frac{\alpha}{2}\left\|\mathbf{J}-\mathbf{Z}\right\|_{F}^{2}, (24)
s.t.diag(𝐙)=0.\displaystyle\text{s.t.}\ \ \text{diag}(\mathbf{Z})=0.

The objective function in (Eq.(24)) is rearranged, and we obtain the equivalent model as follows

min𝐙ψ(𝐙)=γλ+α𝐙+12𝐙αλ+α𝐉F2,\displaystyle\min\limits_{\mathbf{Z}}\ \psi(\mathbf{Z})=\ \frac{\gamma}{\lambda+\alpha}\left\|\mathbf{Z}\right\|_{*}+\frac{1}{2}\left\|\mathbf{Z}-\frac{\alpha}{\lambda+\alpha}\mathbf{J}\right\|_{F}^{2}, (25)
s.t.diag(𝐙)=0.\displaystyle\text{s.t.}\ \text{diag}(\mathbf{Z})=0.

Denote the Lagrange function

(𝐙,𝐲)=ψ(𝐙)+<𝐲,diag(𝐙)>.\mathcal{L}(\mathbf{Z},\mathbf{y})=\psi(\mathbf{Z})+<\mathbf{y},\text{diag}(\mathbf{Z})>.

Then the dual function of the Lagrange multiplier 𝐲\mathbf{y} is g(𝐲)=inf𝐙(𝐙,𝐲).g(\mathbf{y})=\inf\limits_{\mathbf{Z}}\mathcal{L}(\mathbf{Z},\mathbf{y}). The optimization problem (Eq.(25)) is convex and satisfies the weak Slater condition, so the strong duality property holds that means

sup𝐲inf𝐙(𝐙,𝐲)=(𝐙,𝐲)=inf𝐙sup𝐲(𝐙,𝐲).\sup\limits_{\mathbf{y}}\inf\limits_{\mathbf{Z}}\mathcal{L}(\mathbf{Z},\mathbf{y})=\mathcal{L}(\mathbf{Z}^{*},\mathbf{y}^{*})=\inf\limits_{\mathbf{Z}}\sup\limits_{\mathbf{y}}\mathcal{L}(\mathbf{Z},\mathbf{y}). (26)

We employ Uzawa’s algorithm [2] to solve the dual problem of (Eq.(25)). The gradient of g(𝐲)g(\mathbf{y}) is given by

g(𝐲)=(𝐙~,𝐲)𝐲=diag(𝐙~),\nabla g(\mathbf{y})=\frac{\partial\mathcal{L}(\tilde{\mathbf{Z}},\mathbf{y})}{\partial\mathbf{y}}=\text{diag}(\tilde{\mathbf{Z}}), (27)

where 𝐙~=argmin𝐙(𝐙,𝐲).\tilde{\mathbf{Z}}=\text{arg}\min\limits_{\mathbf{Z}}\mathcal{L}(\mathbf{Z},\mathbf{y}). Uzawa’s algorithm is achieved iteratively via the following scheme

{(𝐙t,𝐲t1)=min𝐙(𝐙,𝐲t1),𝐲t=𝐲t1+δk(diag(𝐙t)).\begin{cases}\mathcal{L}(\mathbf{Z}^{t},\mathbf{y}^{t-1})=\min\limits_{\mathbf{Z}}\mathcal{L}(\mathbf{Z},\mathbf{y}^{t-1}),\\ \mathbf{y}^{t}=\mathbf{y}^{t-1}+\delta_{k}(\text{diag}(\mathbf{Z}^{t})).\end{cases} (28)

The parameter δk\delta_{k} is a step size of the gradient direction. Next, we consider the minimization problem minZ(𝐙,𝐲t1)\min\limits_{Z}\mathcal{L}(\mathbf{Z},\mathbf{y}^{t-1}) in (Eq.(28)). By substituting the formula in (Eq.(25)) for ψ(𝐙),\psi(\mathbf{Z}), we obtain

(𝐙,𝐲t1)\displaystyle\mathcal{L}(\mathbf{Z},\mathbf{y}^{t-1}) =ψ(𝐙)+<𝐲t1,diag(𝐙)>\displaystyle=\psi(\mathbf{Z})+<\mathbf{y}^{t-1},\text{diag}(\mathbf{Z})> (29)
=ψ(𝐙)+<𝐘t1,𝐙>.\displaystyle=\psi(\mathbf{Z})+<\mathbf{Y}^{t-1},\mathbf{Z}>.

Here 𝐘\mathbf{Y} is a matrix of the same size as 𝐙\mathbf{Z}. Its diagonal elements equal to 𝐲\mathbf{y} and off diagonal elements are zero. Furthermore, we have the following equivalent model

argmin𝐙(𝐙,𝐲t1)=γλ+α𝐙\displaystyle\text{arg}\min\limits_{\mathbf{Z}}\mathcal{L}(\mathbf{Z},\mathbf{y}^{t-1})=\frac{\gamma}{\lambda+\alpha}\left\|\mathbf{Z}\right\|_{*} (30)
+argmin𝐙12𝐙(αλ+α(𝐉+𝐘t1α))F2.\displaystyle+\text{arg}\min\limits_{\mathbf{Z}}\frac{1}{2}\left\|\mathbf{Z}-\left(\frac{\alpha}{\lambda+\alpha}(\mathbf{J}+\frac{\mathbf{Y}^{t-1}}{\alpha})\right)\right\|_{F}^{2}.

Based on (Eq.(30)), the minimization problem in (Eq.(28)) can be efficiently solved by the singular value threshold operator [3] and Uzawa’s process is transformed into

{𝐙t=𝒟γλ+α(αλ+α(𝐉+𝐘t1α))𝐲t=𝐲t1+δk(diag(𝐙t)),\begin{cases}\begin{aligned} \mathbf{Z}^{t}&=\mathcal{D}_{\frac{\gamma}{\lambda+\alpha}}\left(\frac{\alpha}{\lambda+\alpha}(\mathbf{J}+\frac{\mathbf{Y}^{t-1}}{\alpha})\right)\\ \mathbf{y}^{t}&=\mathbf{y}^{t-1}+\delta_{k}(\text{diag}(\mathbf{Z}^{t})),\\ \end{aligned}\end{cases} (31)

where

𝒟γλ+α(αλ+α(𝐉+𝐘t1α))=𝐔𝒟γλ+α(𝚺)𝐕T\displaystyle\mathcal{D}_{\frac{\gamma}{\lambda+\alpha}}\left(\frac{\alpha}{\lambda+\alpha}(\mathbf{J}+\frac{\mathbf{Y}^{t-1}}{\alpha})\right)=\mathbf{U}\mathcal{D}_{\frac{\gamma}{\lambda+\alpha}}(\mathbf{\Sigma})\mathbf{V}^{T}
𝒟γλ+α(𝚺)=diag{max(0,σiγλ+α)}\displaystyle\mathcal{D}_{\frac{\gamma}{\lambda+\alpha}}(\mathbf{\Sigma})=\text{diag}\{\max(0,\sigma_{i}-\frac{\gamma}{\lambda+\alpha})\}

and 𝐔𝚺𝐕\mathbf{U}\mathbf{\Sigma}\mathbf{V} is the singular value decomposition of the matrix

αλ+α(𝐉+𝐘t1α) with𝚺=diag(σi).\frac{\alpha}{\lambda+\alpha}(\mathbf{J}+\frac{\mathbf{Y}^{t-1}}{\alpha})\text{ with}\quad\mathbf{\Sigma}=\text{diag}(\sigma_{i}).

When the iteration converges, we obtain the consensus coefficient matrix 𝐙\mathbf{Z}. Finally, we apply the spectral clustering technology [4] to the Laplacian (affinity) matrix 𝐋=(|𝐙|+|𝐙|)/2\mathbf{L}=(|\mathbf{Z}|+|\mathbf{Z}^{\top}|)/2 to calculate the cluster assignments. The procedure for solving our model (Eq.(11)) is summarized in Algorithm 1.

Algorithm 1 An adaptive weighted self-representation method for incomplete multi-view clustering (AWSR)
1:Initial weighting matrix 𝐖0(i)\mathbf{W}_{0}^{(i)}, incomplete multi-view data {[𝐗i(o),𝐗i(m)]}i=1v,{\{[\mathbf{X}_{i}^{(o)},\mathbf{X}_{i}^{(m)}]\}}_{i=1}^{v}, initial matrix 𝐙0\mathbf{Z}_{0}, 𝐉0,\mathbf{J}_{0}, the number of clusters kk, parameter γ\gamma, λ\lambda, α\alpha, error=1e3error=1e-3;
2:while abs((obj(t-1)-obj(t))/(obj(t)))>> error do
3:     Update 𝐉\mathbf{J} by solving Eq.(15);
4:     Update {𝐗i(m)}i=1v\{\mathbf{X}_{i}^{(m)}\}_{i=1}^{v} by Eq.(23);
5:     Update 𝐙\mathbf{Z} by Eq.(31);
6:     t=t+1;
7:end while
8:Obtain the affinity matrix by 𝐋=(|𝐙|+|𝐙|)/2\mathbf{L}=(|\mathbf{Z}|+|\mathbf{Z}^{\top}|)/2;
9:Apply spectral clustering method with the aid of the affinity matrix 𝐋\mathbf{L};
10:Clustering result 𝒞\mathcal{C}.

4.4 Convergence and complexity analysis

In this section, we prove that if the iteration sequence generated by the AWSR algorithm is infinite, it converges to a stationary point globally. Also, we demonstrate the convergence property of the sequences of objective function values numerically.

Let 𝐗~(m)={𝐗1(m),𝐗2(m),,𝐗v(m)}\tilde{\mathbf{X}}^{(m)}=\{\mathbf{X}_{1}^{(m)},\mathbf{X}_{2}^{(m)},\cdot\cdot\cdot,\mathbf{X}_{v}^{(m)}\}. First, we demonstrate the property of {𝐉(t),𝐙(t),𝐗~(m)(t)}\{\mathbf{J}^{(t)},\mathbf{Z}^{(t)},\tilde{\mathbf{X}}^{(m)(t)}\} in the following lemma.

Lemma 1.

The sequence {𝐉(t),𝐙(t),𝐗~(m)(t)\mathbf{J}^{(t)},\mathbf{Z}^{(t)},\tilde{\mathbf{X}}^{(m)(t)}} generated via Algorithm 1 satisfies those properties:

(1) The function f(𝐉(t),𝐙(t),𝐗~(m)(t))f(\mathbf{J}^{(t)},\mathbf{Z}^{(t)},\tilde{\mathbf{X}}^{(m)(t)}) is a monotonically decreasing function. Especially, we have

f(𝐉(t+1),𝐙(t+1),𝐗~(m)(t+1))\displaystyle f(\mathbf{J}^{(t+1)},\mathbf{Z}^{(t+1)},\tilde{\mathbf{X}}^{(m)(t+1)}) (32)
f(𝐉(t),𝐙(t),𝐗~(m)(t))α2𝐉(t+1)𝐉(t)F2\displaystyle\leqslant f(\mathbf{J}^{(t)},\mathbf{Z}^{(t)},\tilde{\mathbf{X}}^{(m)(t)})-\frac{\alpha}{2}\left\|\mathbf{J}^{(t+1)}-\mathbf{J}^{(t)}\right\|_{F}^{2}
λ2𝐙(t+1)𝐙(t)F2\displaystyle-\frac{\lambda}{2}\left\|\mathbf{Z}^{(t+1)}-\mathbf{Z}^{(t)}\right\|_{F}^{2}

(2) limtf(𝐉(t),𝐙(t),𝐗~(m)(t))=c\lim\limits_{t\rightarrow\infty}f(\mathbf{J}^{(t)},\mathbf{Z}^{(t)},\tilde{\mathbf{X}}^{(m)(t)})=c for some constant cc.

(3) When t+t\rightarrow+\infty, 𝐉(t+1)𝐉(t)0\mathbf{J}^{(t+1)}-\mathbf{J}^{(t)}\rightarrow 0, 𝐙(t+1)𝐙(t)0\mathbf{Z}^{(t+1)}-\mathbf{Z}^{(t)}\rightarrow 0 and 𝐗~(m)(t+1)𝐗~(m)(t)0\tilde{\mathbf{X}}^{(m)(t+1)}-\tilde{\mathbf{X}}^{(m)(t)}\rightarrow 0.

(4) The sequences {𝐉(t)}\{\mathbf{J}^{(t)}\}, {𝐙(t)}\{\mathbf{Z}^{(t)}\} and {𝐗~(m)(t)}\{\tilde{\mathbf{X}}^{(m)(t)}\} are bounded.

Proof.

(1). The function f(𝐉,𝐙(t),𝐗~(m)(t))f(\mathbf{J},\mathbf{Z}^{(t)},\tilde{\mathbf{X}}^{(m)(t)}) of 𝐉\mathbf{J} in the ttth updating process of 𝐉\mathbf{J} (Eq.(12)) is a α\alpha-strongly convex function. Therefore, we have

f(𝐉(t+1),𝐙(t),𝐗~(m)(t))\displaystyle f(\mathbf{J}^{(t+1)},\mathbf{Z}^{(t)},\tilde{\mathbf{X}}^{(m)(t)}) f(𝐉(t),𝐙(t),𝐗~(m)(t))\displaystyle\leqslant f(\mathbf{J}^{(t)},\mathbf{Z}^{(t)},\tilde{\mathbf{X}}^{(m)(t)}) (33)
α2𝐉(t+1)𝐉(t)F2.\displaystyle-\frac{\alpha}{2}\left\|\mathbf{J}^{(t+1)}-\mathbf{J}^{(t)}\right\|_{F}^{2}.

The function f(𝐉(t+1),𝐙(t),𝐗~(m))f(\mathbf{J}^{(t+1)},\mathbf{Z}^{(t)},\tilde{\mathbf{X}}^{(m)}) of 𝐗~(m)\tilde{\mathbf{X}}^{(m)} in the ttth updating process of 𝐗~(m)\tilde{\mathbf{X}}^{(m)} (Eq.(16)) is a convex function, and the following inequality holds

f(𝐉(t+1),𝐙(t),𝐗~(m)(t+1))f(𝐉(t+1),𝐙(t),𝐗~(m)(t)).\displaystyle f(\mathbf{J}^{(t+1)},\mathbf{Z}^{(t)},\tilde{\mathbf{X}}^{(m)(t+1)})\leq f(\mathbf{J}^{(t+1)},\mathbf{Z}^{(t)},\tilde{\mathbf{X}}^{(m)(t)}). (34)

Similarly, the function f(𝐉(t+1),𝐙,𝐗~(m)(t+1))f(\mathbf{J}^{(t+1)},\mathbf{Z},\tilde{\mathbf{X}}^{(m)(t+1)}) of 𝐙\mathbf{Z} is a λ\lambda-strongly convex function in the sub-problem (Eq.(24)), and we have

f(𝐉(t+1),𝐙(t+1),𝐗~(m)(t+1))\displaystyle f(\mathbf{J}^{(t+1)},\mathbf{Z}^{(t+1)},\tilde{\mathbf{X}}^{(m)(t+1)}) f(𝐉(t+1),𝐙(t),𝐗~(m)(t+1))\displaystyle\leqslant f(\mathbf{J}^{(t+1)},\mathbf{Z}^{(t)},\tilde{\mathbf{X}}^{(m)(t+1)}) (35)
λ2𝐙(t+1)𝐙(t)F2.\displaystyle-\frac{\lambda}{2}\left\|\mathbf{Z}^{(t+1)}-\mathbf{Z}^{(t)}\right\|_{F}^{2}.

Summation of the above ( Eq.(33)), (Eq.(34)) and (Eq.(35)) induces the inequality

f(𝐉(t+1),𝐙(t+1),𝐗~(m)(t+1))\displaystyle f(\mathbf{J}^{(t+1)},\mathbf{Z}^{(t+1)},\tilde{\mathbf{X}}^{(m)(t+1)}) f(𝐉(t),𝐙(t),𝐗~(m)(t))\displaystyle\leqslant f(\mathbf{J}^{(t)},\mathbf{Z}^{(t)},\tilde{\mathbf{X}}^{(m)(t)}) (36)
λ2𝐙(t+1)𝐙(t)F2\displaystyle-\frac{\lambda}{2}\left\|\mathbf{Z}^{(t+1)}-\mathbf{Z}^{(t)}\right\|_{F}^{2}
α2𝐉(t+1)𝐉(t)F2,\displaystyle-\frac{\alpha}{2}\left\|\mathbf{J}^{(t+1)}-\mathbf{J}^{(t)}\right\|_{F}^{2},

which means the function f(𝐉(t),𝐙(t),𝐗~(m)(t))f(\mathbf{J}^{(t)},\mathbf{Z}^{(t)},\tilde{\mathbf{X}}^{(m)(t)}) decreases as tt increases.

(2). Because f(𝐉(t),𝐙(t),𝐗~(m)(t))f(\mathbf{J}^{(t)},\mathbf{Z}^{(t)},\tilde{\mathbf{X}}^{(m)(t)}) is non-negative and monotonically decreasing, it converges as tt tends to infinity. That is to say

limt+f(𝐉(t),𝐙(t),𝐗~(m)(t))=cfor some constant c.\lim\limits_{t\rightarrow+\infty}f(\mathbf{J}^{(t)},\mathbf{Z}^{(t)},\tilde{\mathbf{X}}^{(m)(t)})=c\ \text{for some constant }\ c.

(3). Based on (Eq.(36)), we have

t=0+[λ2𝐙(t+1)𝐙(t)F2+α2𝐉(t+1)𝐉(t)F2]\displaystyle\sum\limits_{t=0}^{+\infty}[\frac{\lambda}{2}\left\|\mathbf{Z}^{(t+1)}-\mathbf{Z}^{(t)}\right\|_{F}^{2}+\frac{\alpha}{2}\left\|\mathbf{J}^{(t+1)}-\mathbf{J}^{(t)}\right\|_{F}^{2}] (37)
\displaystyle\leqslant t=0+[f(𝐉(t),𝐙(t),𝐗~(m)(t))\displaystyle\sum\limits_{t=0}^{+\infty}[f(\mathbf{J}^{(t)},\mathbf{Z}^{(t)},\tilde{\mathbf{X}}^{(m)(t)})
f(𝐉(t+1),𝐙(t+1),𝐗~(m)(t+1))]\displaystyle-f(\mathbf{J}^{(t+1)},\mathbf{Z}^{(t+1)},\tilde{\mathbf{X}}^{(m)(t+1)})]
\displaystyle\leq f(𝐉(0),𝐙(0),𝐗~(m)(0)).\displaystyle f(\mathbf{J}^{(0)},\mathbf{Z}^{(0)},\tilde{\mathbf{X}}^{(m)(0)}).

Furthermore, we get 𝐉(t+1)𝐉(t)0\mathbf{J}^{(t+1)}-\mathbf{J}^{(t)}\rightarrow 0, 𝐙(t+1)𝐙(t)0\mathbf{Z}^{(t+1)}-\mathbf{Z}^{(t)}\rightarrow 0. In addition, the rule of updating 𝐗~(m)(t)\tilde{\mathbf{X}}^{(m)(t)} indicates

f(𝐉(t+1),𝐙(t),𝐗~(m)(t+1))f(𝐉(t+1),𝐙(t),𝐗~(m)(t)),f(\mathbf{J}^{(t+1)},\mathbf{Z}^{(t)},\tilde{\mathbf{X}}^{(m)(t+1)})\leq f(\mathbf{J}^{(t+1)},\mathbf{Z}^{(t)},\tilde{\mathbf{X}}^{(m)(t)}),

and then we obtain 𝐗~(m)(t+1)𝐗~(m)(t)0\tilde{\mathbf{X}}^{(m)(t+1)}-\tilde{\mathbf{X}}^{(m)(t)}\rightarrow 0 from (Eq.(17)).

(4). Due to the fact that the value f(𝐉(t),𝐙(t),𝐗~(m)(t))f(\mathbf{J}^{(t)},\mathbf{Z}^{(t)},\tilde{\mathbf{X}}^{(m)(t)}) is bounded and each term in (Eq.(11)) is nonnegative, the sequences {𝐉(t)}\{\mathbf{J}^{(t)}\}, {𝐙(t)}\{\mathbf{Z}^{(t)}\} and {𝐗~(m)(t)}\{\tilde{\mathbf{X}}^{(m)(t)}\} are all bounded. ∎

Theorem 1.

Suppose {(𝐉(t),𝐙(t),𝐗~(m)(t))}\{(\mathbf{J}^{(t)},\mathbf{Z}^{(t)},\tilde{\mathbf{X}}^{(m)(t)})\} is the infinite iteration sequence produced by Algorithm 1. Then there exists (𝐉(),𝐙(),𝐗~(m)())(\mathbf{J}^{(*)},\mathbf{Z}^{(*)},\tilde{\mathbf{X}}^{(m)(*)}) such that

limt(𝐉(t),𝐙(t),𝐗~(m)(t))=(𝐉(),𝐙(),𝐗~(m)())\lim_{t\rightarrow\infty}(\mathbf{J}^{(t)},\mathbf{Z}^{(t)},\tilde{\mathbf{X}}^{(m)(t)})=(\mathbf{J}^{(*)},\mathbf{Z}^{(*)},\tilde{\mathbf{X}}^{(m)(*)}) (38)

with diag(𝐙())=0\text{diag}(\mathbf{Z}^{(*)})=0 and

limtf(𝐉(t),𝐙(t),𝐗~(m)(t))\displaystyle\lim_{t\rightarrow\infty}\|\nabla f(\mathbf{J}^{(t)},\mathbf{Z}^{(t)},\tilde{\mathbf{X}}^{(m)(t)})\| =0.\displaystyle=0. (39)
Proof.

First we prove that the sequence {(𝐉(t),𝐙(t),𝐗~(m)(t))}\{(\mathbf{J}^{(t)},\mathbf{Z}^{(t)},\tilde{\mathbf{X}}^{(m)(t)})\} converges. Because (𝐉(t),𝐙(t),𝐗~(m)(t))(\mathbf{J}^{(t)},\mathbf{Z}^{(t)},\tilde{\mathbf{X}}^{(m)(t)}) is bounded, there exists at least one accumulation point (𝐉(),𝐙(),𝐗~(m)())(\mathbf{J}^{(*)},\mathbf{Z}^{(*)},\tilde{\mathbf{X}}^{(m)(*)}) such that a sub-sequence {𝐉(tj),𝐙(tj),𝐗~(m)(tj)}\{\mathbf{J}^{(t_{j})},\mathbf{Z}^{(t_{j})},\tilde{\mathbf{X}}^{(m)(t_{j})}\} converges to (𝐉(),𝐙(),𝐗~(m)())(\mathbf{J}^{(*)},\mathbf{Z}^{(*)},\tilde{\mathbf{X}}^{(m)(*)}), i.e.,

𝐉(tj)𝐉(),𝐙(tj)𝐙(),𝐗~(m)(tj)𝐗~(m)().\mathbf{J}^{(t_{j})}\rightarrow\mathbf{J}^{(*)},\mathbf{Z}^{(t_{j})}\rightarrow\mathbf{Z}^{(*)},\tilde{\mathbf{X}}^{(m)(t_{j})}\rightarrow\tilde{\mathbf{X}}^{(m)(*)}.

On the other hand, we have

𝐉(t+1)𝐉(t)0,\mathbf{J}^{(t+1)}-\mathbf{J}^{(t)}\rightarrow 0,
𝐙(t+1)𝐙(t)0,\mathbf{Z}^{(t+1)}-\mathbf{Z}^{(t)}\rightarrow 0,
𝐗~(m)(t+1)𝐗~(m)(t)0\tilde{\mathbf{X}}^{(m)(t+1)}-\tilde{\mathbf{X}}^{(m)(t)}\rightarrow 0

from Lemma 1. Thus, the sequence {(𝐉(t),𝐙(t),𝐗~(m)(t))}\{(\mathbf{J}^{(t)},\mathbf{Z}^{(t)},\tilde{\mathbf{X}}^{(m)(t)})\} converges and (Eq.(38)) holds. From Theorem 4.4 of [3], the solution 𝐙(t)\mathbf{Z}^{(t)} of sub-problem (Eq.(24)) converges to 𝐙()\mathbf{Z}^{(*)} satisfying diag(𝐙())=0diag(\mathbf{Z}^{(*)})=0.

Next we show that the partial derivative sequences with respect to 𝐉\mathbf{J}, 𝐙\mathbf{Z} and 𝐗~(m)\tilde{\mathbf{X}}^{(m)} converge to zero. It can be deduced from the solution process of the sub-problems that

f𝐉(𝐉(t+1),𝐙(t),𝐗~(m)(t))\displaystyle\nabla f_{\mathbf{J}}(\mathbf{J}^{(t+1)},\mathbf{Z}^{(t)},\tilde{\mathbf{X}}^{(m)(t)}) =0,\displaystyle=0, (40)
f𝐗~(m)(𝐉(t+1),𝐙(t),𝐗~(m)(t+1))\displaystyle\nabla f_{\tilde{\mathbf{X}}^{(m)}}(\mathbf{J}^{(t+1)},\mathbf{Z}^{(t)},\tilde{\mathbf{X}}^{(m)(t+1)}) =0.\displaystyle=0.
f𝐙(𝐉(t+1),𝐙(t+1),𝐗~(m)(t+1))\displaystyle\nabla f_{\mathbf{Z}}(\mathbf{J}^{(t+1)},\mathbf{Z}^{(t+1)},\tilde{\mathbf{X}}^{(m)(t+1)}) =0.\displaystyle=0.

Since the sequence {𝐉(t);𝐙(t);𝐗~(m)(t)}\{\mathbf{J}^{(t)};\mathbf{Z}^{(t)};\tilde{\mathbf{X}}^{(m)(t)}\} converges, the conclusion (Eq. (39)) is obtained immediately. ∎

Time complexity: We analyze the time complexity by counting the number of elementary operations performed in each sub-problem. (1) For the sub-problem of 𝐉\mathbf{J}, the time and space-consuming Kronecker product in (Eq.(15)) is computed via the product form in (Eq.(14)) during the iteration process in the conjugate gradient algorithm. Since we employ the preconditioned conjugate gradient method, the algorithm usually terminates quickly. Its computational complexity is O(n3+n2di)O(n^{3}+n^{2}d_{i}), where did_{i} represents the feature numbers of the iith view. (2) The solution of 𝐗i\mathbf{X}_{i} has a closed form whose computational complexity is O(nm3+dinnm)O(n_{m}^{3}+d_{i}nn_{m}), where nmn_{m} represents the number of unobservable entries in the iith view. (3) Because the minimization problem in (Eq.(28)) is solved analytically, the Uzawa’s algorithm for computing 𝐙\mathbf{Z} in the sub-problem (Eq.(24)) only costs few iterations in practice. The computational complexity is O(n3)O(n^{3}) for obtaining 𝐙.\mathbf{Z}. (4) The computational complexity of spectral clustering in the last step is O(n3).O(n^{3}). To sum up, the total time complexity of the AWSR method is O(n3)O(n^{3}). Space complexity: The matrices involved are the iith view data 𝐗idi×n,\mathbf{X}_{i}\in\mathbb{R}^{d_{i}\times n}, the self-representation matrix 𝐙n×n,\mathbf{Z}\in\mathbb{R}^{n\times n}, the weighting matrix 𝐖(i)di×n\mathbf{W}^{(i)}\in\mathbb{R}^{d_{i}\times n} and the separated variable 𝐉n×n\mathbf{J}\in\mathbb{R}^{n\times n} for i=1,,v.i=1,\ldots,v. Thus the space complexity is O(n2).O(n^{2}).

5 Experiment

In this section, we demonstrate the performance of our AWSR method compared to performance of other widely-used methods on five datasets.

5.1 Datasets

The proposed AWSR method as well as other approaches is implemented on five benchmark datasets, named ORL111http://cam-orl.co.uk/facedatabase.html , Still222https://www.di.ens.fr/willow/research/stillactions/ , BBCSport333http://mlg.ucd.ie/datasets/bbc.html , Olympics444http://mlg.ucd.ie/aggregation/ , Leaves555https://github.com/cswanghao/gbs/blob/ . For the process of creating missing view, we refer to the processing method in literature [21]. First, the nm=[nr]n_{m}=[n*r] data entries of the first view are deleted, where nn is the total number of entries and rr is the missing ratio. Then the nmn_{m} data entries of the second view are deleted and those existing in the first view are removed to ensure that all data is observed at least in one view. Finally, nmn_{m} data entries arbitrarily chosen from the other views are deleted. In the numerical experiments, we consider the missing ratio as 0.1, 0.2, 0.3, 0.4, 0.5. The details of each dataset are as follows.

ORL: The dataset consists of 400 images about 40 different people at different times, lighting, facial expressions (eyes open, eyes closed, smiling and not smiling) and facial details (with/without glasses). Refer to [27] and [38], we use part of the images to generate the datasets ORL1 and ORL2. ORL1 is a dataset of dual views with dimensions 1024 and 288, respectively. Data in ORL2 has three features, whose dimensions are 4096, 3304 and 6750, respectively.

Still: The dataset [6] consists of 467 images taken of six different movements, including grabbing, running, walking, throwing, squatting and kicking. In the experiment, Still is a three-view dataset with dimensions 200, 200 and 200 respectively.

BBCSport: The dataset [9] consists of 737 sports articles published on the BBC Sport website by journalists in five different areas including athletics, cricket, football, rugby and tennis. In the experiment, we only selecte a subset of BBCSport that contains news reports from 116 journalists. BBCSport is a four-view dataset with dimensions 1991, 2063, 2113 and 2158.

Olympics: The dataset [10] contains images of 464 athletes or organizers at 28 different sports in the London 2012 Summer Olympic Games. In the experiment, Olympics is a double-view dataset with dimensions 4942 and 3097 respectively.

Leaves: This dataset [1] contains 1600 leaf images, covering the leaf images of 100 plant species. Leaves describes leaves from shape, fine-scale edges and texture histogram features. In the experiment, Leaves is a three-view dataset with dimensions 64, 64 and 64.

Statistical information about the data is presented in Table 2. The “Number of Clusters" and “Dimension of Each View” are acquired from the information of the dataset. For example, ORL dataset is the data of 400 images composed of different images of 40 people, and its number of clusters is 40.

Table 2: Statistics of the datasets.
Dataset Number of Dimension of Each View
Clusters 1 2 3 4
ORL1 40 1024 288 - -
Still 6 200 200 200 -
BBCSport 5 1991 2063 2113 2158
Olympics 28 4942 3097 - -
Leaves 100 64 64 64 -
ORL2 40 4096 3304 6750 -

5.2 Methods for comparison

In this subsection, we provide a brief introduction to eight other algorithms and two benchmarks, which we use to with our proposed algorithm.

(1) LSRs (single-best baseline) [19] first filled the missing data with random values, then it applied LSR algorithm to the learning of each view.

(2) In LSRc (concatenated baseline) [19], the missing data was filled with arbitrary values, and the data of each view was concatenated. It run the LSR algorithm on the concatenated data.

(3) IMG [41] factorized data of each view to learn the latent subspaces independently. Meanwhile, it utilized the graph Laplacian term to regularize the latent subspaces of each view.

(4) DAIMC [13] was a weighted non-negative matrix factorization based method. It employed the existing instance alignment information to learn the consistent latent feature matrix of all views. Besides, in order to reduce the influence of missing entries, the 2,1\ell_{2,1}-norm was applied to build the consistency basis matrix of all views.

(5) AGL [35] first performed a low rank representation of the graph for each view with spectral constraints. Then it built a consensus representation for all views using the co-regularization term.

(6) AWGF [40] factorized a weighted non-negative matrix to obtain the feature matrix of each view and then constructed the graph of each view. The feature extraction and the graph were fused into a large framework according to a certain weight.

(7) PLR [17] obtained the consensus feature matrix of all views based on non-negative matrix factorization. Then it imposed 2,1\ell_{2,1}-norm constraints on the basis matrix of each view and finally imposed regularization constraints on the local graph to obtain the shared feature matrix.

(8) IMSR [21] performed feature extraction. It employed missing data imputation and low rank regularized consensus coefficient matrix via self-representation learning to obtain the complete consensus coefficient matrix.

(9) CAMVSC [31] employed an auto-weighted technique and constructed a self-representation subspace model to cluster mult-views on complete dataset.

(10) SIMC_ADC [12] proposed an anchor point learning method to obtain consensus coefficient matrix for large scale multi-view clustering problem.

5.3 Evaluation metrics

Accuracy (ACC), normalized mutual information (NMI) and Purity are used to measure the performance of clustering. Generally speaking, higher values represent better clustering performance. Next, we introduce the definitions of the metrics. The ACC is calculated by

ACC=TP+TNTP+TN+FP+FN,\displaystyle ACC=\frac{TP+TN}{TP+TN+FP+FN},

where true positive (TP) indicates that similar documents are put in the same cluster, true negative (TN) means that different documents are placed in various clusters, false negative (FN) indicates that similar documents are placed in different clusters and false positive (FP) indicates that various documents are put in the same cluster. The mutual information (MI) is

MI(𝒞,Θ)=ci𝒞ωkΘP(ci,ωk)logP(ci,ωk)P(ci)P(ωk),\displaystyle MI(\mathcal{C},\varTheta)=\sum\limits_{c_{i}\in\mathcal{C}}\sum\limits_{\omega_{k}\in\mathcal{\varTheta}}P(c_{i},\omega_{k})log\frac{P(c_{i},\omega_{k})}{P(c_{i})P(\omega_{k})},

where P()P(\cdot) is a probability function. If 𝒞\mathcal{C} and Θ\varTheta are independent of each other, mutual information (MI) is equivalent to zero. In the clustering experiments, we generally use the normalized mutual information (NMI) to measure

NMI(𝒞,Θ)=MI(𝒞,Θ)max(H(𝒞),H(Θ)),\displaystyle NMI(\mathcal{C},\varTheta)=\frac{MI(\mathcal{C},\varTheta)}{max(H(\mathcal{C}),H(\varTheta))},

where H()H(\cdot) is the entropy function and defined as H(Θ)=kP(ωk)logP(ωk)H(\varTheta)=-\sum\limits_{k}P(\omega_{k})logP(\omega_{k}). The metric Purity is given by

Purity(𝒞,Θ)=1ni=1rmaxj=1k(ωkci),\displaystyle Purity(\mathcal{C},\varTheta)=\frac{1}{n}\sum\limits_{i=1}^{r}\max\limits_{j=1}^{k}(\omega_{k}\cap c_{i}),

where nn is the total number of sample, Θ={ω1,ω2,,ωK}\varTheta=\{\omega_{1},\omega_{2},\cdot\cdot\cdot,\omega_{K}\} is the clustering class and the real class is defined as 𝒞={c1,c2,,cJ}\mathcal{C}=\{c_{1},c_{2},\cdot\cdot\cdot,c_{J}\}.

5.4 Numerical experiments

In this section, we explain how to choose the parameters. Then we demonstrate the numerical results of the proposed AWSR method comparing with other prevailing methods. At last, the sequence of objective function values is plotted to verify the convergence property of the computational algorithm.

5.4.1 Parameter selection

For the elements αmji\alpha_{mj}^{i} in the weighting matrix, we assign the constant 11 to it from experience. The main idea we determine the parameters in the objective function is that we compare the performance of the AWSR method with different potential candidates and choose the best ones as the value of parameters.

First, we list the potential values of the parameters. We roughly approximate the range which may contain the desirable value of the parameters. All potential values of the parameters are expressed by 2q.2^{q}. In Table 3, we demonstrate the values of the exponential qq for each parameter in different datasets. To be specific, the values assigned for the parameter γ\gamma in the ORL1 dataset are 23.1,23.3,23.5.2^{3.1},2^{3.3},2^{3.5}. That is because the expression [3.1:0.2:3.5][3.1:0.2:3.5] indicates that the exponential qq takes the values from 3.1 to 3.5 with a common difference 0.2 between any successive values. Other nominated parameters are chosen based on Table 3 in the same way.

Table 3: The values of the exponential qq for different parameters (2q2^{q}).
Dataset qq for λ=2q\lambda=2^{q} qq for γ=2q\gamma=2^{q} qq for α=2q\alpha=2^{q}
ORL1 [2.9:0.1:1.7][-2.9:0.1:-1.7] [3.1:0.2:3.5][3.1:0.2:3.5] [7.3:0.1:8][7.3:0.1:8]
Still [1.9:0.05:1.7][-1.9:0.05:-1.7] [3.1:0.2:3.5][3.1:0.2:3.5] [7.2:0.1:7.7][7.2:0.1:7.7]
BBCSport [10:2:10][-10:2:10] [2.5:1:4.5][2.5:1:4.5] [7.3:0.1:8][7.3:0.1:8]
Olympics [1.8:0.1:1.7][-1.8:0.1:-1.7] [3.1:0.2:3.5][3.1:0.2:3.5] [7.2:0.1:8][7.2:0.1:8]
Leaves [2.3:0.1:1.6][-2.3:0.1:-1.6] [3.1:0.2:3.5][3.1:0.2:3.5] [7.4:0.1:8][7.4:0.1:8]
ORL2 [-3:1:0] [2:1:6] [6.7:0.05:6.85]

Next, we take the Olympics dataset as an example to show the process of choosing parameters. As presented in the ‘Olympics’ row in Table 3, the parameter λ\lambda has 2 candidates, 21.82^{-1.8} and 21.7,2^{-1.7}, while γ\gamma is allowed to take the values of 23.1,23.3,23.5.2^{3.1},2^{3.3},2^{3.5}. The parameter α\alpha has 9 choices, from 27.22^{7.2} to 28.2^{8}. Thus the combination of (λ,γ,α)(\lambda,\gamma,\alpha) has 54 cases. We sort these 54 combinations in lexical order, which means every number from 1 to 54 corresponds to a combination of (λ,γ,α)(\lambda,\gamma,\alpha). For each combination of (λ,γ,α),(\lambda,\gamma,\alpha), we run the AWSR method when the missing ratio is 0.1, 0.2 , 0.3 and 0.4 respectively. In Figure 2, the average values of ACC, NMI, Purity are demonstrated. The number in the horizontal axis is the order of the corresponding combination (λ,γ,α)(\lambda,\gamma,\alpha). It can be seen that, the model performs the best when the parameters come to the 3636th combination of (λ,γ,α),(\lambda,\gamma,\alpha), which is λ=2(1.7)\lambda=2^{(-1.7)}, γ=2(3.1)\gamma=2^{(3.1)} and α=2(8.0)\alpha=2^{(8.0)}.

Refer to caption
Figure 2: Comparison of numerical results of different parameter combinations on Olympics.

5.4.2 Numerical results

Complete datasets: First, we compare the performance of our proposed algorithm with CAMVSC algorithm on complete datasets ORL2, BBCSport and Leaves. The experimental results are shown in Table 4. Since the CAMVSC algorithm is designed especially for complete multi-view clustering, it outperforms the AWSR method on these complete dataset.

Table 4: Performance comparison between the algorithm proposed and CAMVSC algorithm on complete dataset ORL2, BBCSport and Leaves.
Dataset algorithm ACC NMI Purity
ORL2 AWSR 80.90 91.16 84.30
CAMVSC 82.45 91.73 85.33
BBCSport AWSR 80.17 75.89 90.52
CAMVSC 92.24 82.19 92.24
Leaves AWSR 77.80 89.47 80.30
CAMVSC 91.38 81.05 91.38

Incomplete datasets: In [21], it is shown that the IMSR algorithm is superior to the LSRs, LSRc, IMG, DAIMC, AGL, AWGF and PLR for incomplete dataset. Therefore first we only compare the AWSR algorithm with the IMSR algorithm and the state of the art SIMC_ADC algorithm when the missing ratio changes from 0.1 to 0.5. The results are shown in Figure 3 and Table 5 respectively. In Figure 3, the values of ACC, NMI and Purity computed by AWSR and IMSR are plotted. The blue lines describe the evaluation indices given by AWSR, while the red ones show the evaluation indices generated by IMSR. In Table 5, the superior values are in bold type. For most cases, AWSR performs better than IMSR and SIMC_ADC.

Leaves

Refer to caption

Refer to caption

Refer to caption

BBCSport

Refer to caption

Refer to caption

Refer to caption

Olympics

Refer to caption

Refer to caption

Refer to caption

ORL1

Refer to caption

Refer to caption

Refer to caption

Still

Refer to caption

Refer to caption

Refer to caption

Figure 3: Performance comparison of AWSR and IMSR methods on datasets Leaves, BBCSport, Olympics, ORL1 and Still.
Table 5: Comparison between AWSR method and the SIMC_ADC method on incomplete datasets with various missing ratios. The best results are shown in bold.
Metrics Dataset Algorithm 0.1 0.2 0.3 0.4 0.5
ACC ORL1 SIMC_ADC 51.73 45.85 37.80 29.23 30.80
AWSR 65.38 55.15 47.90 43.73 41.98
Olympics SIMC_ADC 56.51 51.44 45.60 39.01 34.24
AWSR 76.57 73.69 68.32 57.74 44.29
Leaves SIMC_ADC 52.62 42.84 36.36 32.03 27.36
AWSR 66.25 56.48 46.31 39.71 33.99
BBCSport SIMC_ADC 52.33 49.31 48.97 48.53 45.07
AWSR 79.31 76.55 71.53 61.46 53.97
Still SIMC_ADC 32.63 31.32 29.13 29.79 29.14
AWSR 33.02 31.18 29.87 28.56 27.20
NMI ORL1 SIMC_ADC 69.64 65.20 57.63 50.17 52.90
AWSR 78.01 72.43 67.13 62.94 64.22
Olympics SIMC_ADC 65.84 61.25 55.27 49.80 45.86
AWSR 83.35 81.36 78.19 71.35 65.33
Leaves SIMC_ADC 74.71 69.21 64.70 61.45 58.07
AWSR 81.48 75.47 69.72 66.00 60.00
BBCSport SIMC_ADC 28.65 23.47 23.71 22.88 20.61
AWSR 73.60 68.46 63.41 50.11 39.41
Still SIMC_ADC 11.93 10.54 9.74 9.21 8.70
AWSR 12.08 10.25 8.97 7.80 6.67
Purity ORL1 SIMC_ADC 54.39 48.78 40.48 31.38 32.95
AWSR 68.25 57.70 50.73 46.48 45.18
Olympics SIMC_ADC 64.38 58.92 52.97 39.09 43.82
AWSR 83.95 82.18 78.71 72.72 62.85
Leaves SIMC_ADC 54.73 45.10 38.68 39.94 29.16
AWSR 68.27 58.63 48.20 41.96 39.79
BBCSport SIMC_ADC 58.71 55.35 52.67 52.50 48.19
AWSR 89.40 86.38 84.40 72.07 64.40
Still SIMC_ADC 35.56 34.28 33.00 32.35 31.50
AWSR 35.25 33.43 32.29 30.90 29.27

Moreover, we also compare the AWSR method with the above-mentioned approaches. The average values of the clustering results provided with missing ratio being 0.1, 0.2, 0.3, 0.4 and 0.5 are demonstrated in Table 6. The best metric value of each problem is shown in bold. We can see that for each dataset, the overall performance of the proposed algorithm AWSR on ACC, NMI and Purity is significantly better than that of other algorithms except for the NMI metric on ORL1 dataset, of which the proposed algorithm is the second best.

Table 6: Comparison of the performance after collation. The average values of the clustering results provided with missing ratio being 0.1, 0.2, 0.3, 0.4 and 0.5 are given. The best results are shown in bold.
Dataset LSRs LSRc IMG DAIMC AGL AWGF PLR IMSR SIMC_ADC AWSR
ACC ORL1 41.50 32.20 35.30 56.10 43.20 61.27 60.50 64.63 39.08 65.10
Still 29.85 27.54 29.46 32.29 28.22 29.38 31.65 33.09 30.60 33.26
BBCSport 60.17 38.97 43.90 73.28 67.24 37.29 70.17 76.72 48.84 78.10
Olympics 60.52 59.53 26.47 59.87 67.03 43.09 64.87 73.17 45.36 74.37
Leaves 41.39 21.96 37.36 48.33 47.88 44.34 47.63 51.81 38.24 61.02
NMI ORL1 60.98 53.31 52.55 73.73 63.80 76.79 75.83 78.57 59.08 77.35
Still 9.56 6.56 10.13 10.60 7.91 8.90 10.52 11.56 10.02 12.01
BBCSport 42.88 17.16 23.04 60.49 60.10 14.51 59.51 69.06 23.86 70.7
Olympics 66.45 69.67 35.12 72.44 74.28 55.92 77.09 82.16 55.60 82.94
Leaves 65.06 50.75 60.75 71.59 71.32 56.40 70.61 73.49 65.63 75.63
Purity ORL1 43.35 34.70 40.06 59.70 45.65 65.58 63.55 67.57 41.59 67.61
Still 32.38 29.98 31.15 34.90 31.13 31.76 34.30 35.20 33.34 35.35
BBCSport 68.10 47.07 45.10 81.21 78.28 43.19 80.00 86.38 53.48 87.75
Olympics 65.82 69.09 32.88 70.78 72.76 53.32 75.30 82.37 51.68 83.86
Leaves 42.69 22.85 40.23 50.68 50.13 46.91 49.85 54.03 40.32 63.51

We also visualize the results of Table 6 with different metrics values based on different datasets in Figure 4.

Leaves

Refer to caption

Still

Refer to caption

Olympics

Refer to caption

ORL1

Refer to caption

BBCSport

Refer to caption

Refer to caption

Figure 4: Visualization histogram of the results in Table 6.

Time comparison: In terms of computation efficiency, the AWSR method is not as efficient as the IMSR and SIMC_ADC algorithm. Take Olympics, BBCSport and Still as examples. In Table 7 we show the the computation time of AWSR, IMSR and SIMC_ADC when the missing rate is 0.10.1.

Table 7: Comparison of time (seconds).
Method Dataset Time Olympics BBCSport Still
AWSR 6.596 1.656 5.159
IMSR 2.166 0.378 0.724
SIMC_ADC 2.174 1.097 1.439

5.4.3 Sequence of objective function values

To verify the convergence property of the BCD algorithm, we show the result of objective function values on Still, Leaves, ORL1 and Olympics datasets in Figure 5. Since the missing elements at the initial point are zero, it can be seen that the first objective function value in each figure is less than the objective function value at the second point, while some missing elements are nonzero. The objective function value declines and tends to converge eventually.

Still

Refer to caption

Leaves

Refer to caption

ORL1

Refer to caption

Olympics

Refer to caption

Figure 5: The convergence curves of the objective function value on datasets Still, Leaves, ORL1 and Olympics.

6 Conclusion

For problems of IMC, we introduce an adaptive weighted self-representation (AWSR) model. It adjusts the weighting matrix on the basis of the variations of views and the recovery process of the missing data. The proposed AWSR model is concise and can be solved by the traditional BCD algorithm. Convergence property of the iterative algorithm is proved. Numerical experiments on five real-world incomplete data demonstrate the effectiveness of AWSR and its superiority over other eight widely-used approaches. In addition, we also compare the performance with CAMVSC on three complete data, our method can be applied to complete and incomplete datasets, while the CAMVSC value can only be applied to complete ones.

Although the AWSR method is effective for IMC, there are still some issues that deserve further investigation. Theoretically, the elements of weighting matrix are determined by the differences of instances, uncompleted entries and completed entries. However, the adjustment of the entries in the weighting matrix is realized empirically in the experiments. A more rational and flexible updating strategy should be provided. Besides, high-efficiency optimization algorithms are needed for large scale datasets.

\bmhead

Acknowledgments This work was supported by the National Natural Science Foundation of China [grant No.12326302, No.62073087].

\bmhead

Data availability The data used to support the findings of this study are available from the corresponding author upon request.

Declarations

  • Conflict of interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

  • Author contributions Conceptualization: [Lishan Feng], [Jingya Chang]; Software: [Lishan Feng]; Validation: [Lishan Feng], [Guoxu Zhou], [Jingya Chang]; Formal analysis: [Lishan Feng]; Data curation: [Lishan Feng]; Writing - original draft: [Lishan Feng]; Preparation: [Lishan Feng]; Visualization: [Lishan Feng]; Project administration: [Lishan Feng]; Investigation: [Guoxu Zhou], [Jingya Chang]; Resources: [Guoxu Zhou], [Jingya Chang]; Supervision:[Guoxu Zhou], [Jingya Chang]; Funding acquisition: [Guoxu Zhou], [Jingya Chang]; Methodology: [Jingya Chang]; Writing - review &\& editing: [Jingya Chang]. All authors have read and agreed to the published version of the manuscript.

References

  • \bibcommenthead
  • Beghin et al [2010] Beghin T, Cope JS, Remagnino P, et al (2010) Shape and texture based plant leaf classification. In: International conference on advanced concepts for intelligent vision systems, Springer, pp 345–353
  • Bramble et al [1997] Bramble JH, Pasciak JE, Vassilev AT (1997) Analysis of the inexact uzawa algorithm for saddle point problems. SIAM Journal on Numerical Analysis 34(3):1072–1092
  • Cai et al [2010] Cai JF, Candès EJ, Shen Z (2010) A singular value thresholding algorithm for matrix completion. SIAM Journal on optimization 20(4):1956–1982
  • Cai et al [2018] Cai Y, Jiao Y, Zhuge W, et al (2018) Partial multi-view spectral clustering. Neurocomputing 311:316–324
  • Chao et al [2022] Chao G, Wang S, Yang S, et al (2022) Incomplete multi-view clustering with multiple imputation and ensemble clustering. Applied Intelligence pp 1–11
  • Delaitre et al [2010] Delaitre V, Laptev I, Sivic J (2010) Recognizing human actions in still images: a study of bag-of-features and part-based representations. In: BMVC 2010-21st British Machine Vision Conference
  • Deng et al [2023] Deng S, Wen J, Liu C, et al (2023) Projective incomplete multi-view clustering. IEEE Transactions on Neural Networks and Learning Systems
  • Fang et al [2023] Fang U, Li M, Li J, et al (2023) A comprehensive survey on multi-view clustering. IEEE Transactions on Knowledge and Data Engineering
  • Greene and Cunningham [2006] Greene D, Cunningham P (2006) Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proceedings of the 23rd international conference on Machine learning, pp 377–384
  • Greene and Cunningham [2013] Greene D, Cunningham P (2013) Producing a unified graph representation from multiple social network views. In: Proceedings of the 5th annual ACM web science conference, pp 118–121
  • Guo et al [2023] Guo W, Wang Z, Chi Z, et al (2023) Scalable one-stage multi-view subspace clustering with dictionary learning. Knowledge-Based Systems 259:110092
  • He et al [2023] He Wj, Zhang Z, Wei Y (2023) Scalable incomplete multi-view clustering with adaptive data completion. Information Sciences p 119562
  • Hu and Chen [2019] Hu M, Chen S (2019) Doubly aligned incomplete multi-view clustering. arXiv preprint arXiv:190302785
  • Hu et al [2021] Hu Y, Luo C, Wang B, et al (2021) Complete/incomplete multi-view subspace clustering via soft block-diagonal-induced regulariser. IET Computer Vision 15(8):618–632
  • Khan et al [2023] Khan MA, Khan GA, Khan J, et al (2023) Adaptive weighted low-rank sparse representation for multi-view clustering. IEEE Access
  • Li et al [2022] Li Z, Tang C, Zheng X, et al (2022) High-order correlation preserved incomplete multi-view subspace clustering. IEEE Transactions on Image Processing 31:2067–2080
  • Lian et al [2021] Lian H, Xu H, Wang S, et al (2021) Partial multiview clustering with locality graph regularization. International Journal of Intelligent Systems 36(6):2991–3010
  • Liang et al [2022] Liang N, Yang Z, Xie S (2022) Incomplete multi-view clustering with sample-level auto-weighted graph fusion. IEEE Transactions on Knowledge and Data Engineering 35(6):6504–6511
  • Liu et al [2012] Liu G, Lin Z, Yan S, et al (2012) Robust recovery of subspace structures by low-rank representation. IEEE transactions on pattern analysis and machine intelligence 35(1):171–184
  • Liu et al [2021a] Liu J, Liu X, Yang Y, et al (2021a) Hierarchical multiple kernel clustering. In: Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI, pp 2–9
  • Liu et al [2021b] Liu J, Liu X, Zhang Y, et al (2021b) Self-representation subspace clustering for incomplete multi-view data. In: Proceedings of the 29th ACM International Conference on Multimedia, pp 2726–2734
  • Liu et al [2023] Liu M, Yang Z, Li L, et al (2023) Auto-weighted collective matrix factorization with graph dual regularization for multi-view clustering. Knowledge-Based Systems 260:110145
  • Liu and Lin [2023] Liu SS, Lin L (2023) Adaptive weighted multi-view clustering. In: Conference on Health, Inference, and Learning, PMLR, pp 19–36
  • Lu et al [2012] Lu CY, Min H, Zhao ZQ, et al (2012) Robust and efficient subspace segmentation via least squares regression. In: Computer Vision – ECCV 2012. Springer Berlin Heidelberg, Berlin, Heidelberg, pp 347–360
  • Merino [1992] Merino DI (1992) Topics in matrix analysis. The Johns Hopkins University
  • Ren et al [2018] Ren P, Xiao Y, Xu P, et al (2018) Robust auto-weighted multi-view clustering. In: IJCAI, pp 2644–2650
  • Samaria and Harter [1994] Samaria FS, Harter AC (1994) Parameterisation of a stochastic model for human face identification. In: Proceedings of 1994 IEEE workshop on applications of computer vision, IEEE, pp 138–142
  • Shi et al [2022] Shi S, Nie F, Wang R, et al (2022) Self-weighting multi-view spectral clustering based on nuclear norm. Pattern Recognition 124:108429
  • Shu et al [2022] Shu X, Zhang X, Gao Q, et al (2022) Self-weighted anchor graph learning for multi-view clustering. IEEE Transactions on Multimedia
  • Sun et al [2021] Sun M, Zhang P, Wang S, et al (2021) Scalable multi-view subspace clustering with unified anchors. In: Proceedings of the 29th ACM International Conference on Multimedia, pp 3528–3536
  • Tang et al [2022] Tang K, Cao L, Zhang N, et al (2022) Consistent auto-weighted multi-view subspace clustering. Pattern Analysis and Applications 25(4):879–890
  • Vidal [2011] Vidal R (2011) Subspace clustering. IEEE Signal Processing Magazine 28(2):52–68
  • Wan et al [2023] Wan X, Liu X, Liu J, et al (2023) Auto-weighted multi-view clustering for large-scale data. arXiv preprint arXiv:230301983
  • Wang et al [2022] Wang S, Wang Y, Le W (2022) Adaptive weight structure representation for multi-view subspace clustering. In: 2022 9th International Conference on Dependable Systems and Their Applications (DSA), pp 918–925, 10.1109/DSA56465.2022.00129
  • Wen et al [2018] Wen J, Xu Y, Liu H (2018) Incomplete multiview spectral clustering with adaptive graph learning. IEEE transactions on cybernetics 50(4):1418–1429
  • Wen et al [2019] Wen J, Zhang Z, Xu Y, et al (2019) Unified embedding alignment with missing views inferring for incomplete multi-view clustering. In: Proceedings of the AAAI conference on artificial intelligence, pp 5393–5400
  • Yin and Jiang [2023] Yin J, Jiang J (2023) Incomplete multi-view clustering based on self-representation. Neural Processing Letters pp 1–15
  • Zhang et al [2015] Zhang C, Fu H, Liu S, et al (2015) Low-rank tensor constrained multiview subspace clustering. In: Proceedings of the IEEE international conference on computer vision, pp 1582–1590
  • Zhang et al [2017] Zhang C, Hu Q, Fu H, et al (2017) Latent multi-view subspace clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4279–4287
  • Zhang et al [2020] Zhang P, Wang S, Hu J, et al (2020) Adaptive weighted graph fusion incomplete multi-view subspace clustering. Sensors 20(20):5755
  • Zhao et al [2016] Zhao H, Liu H, Fu Y (2016) Incomplete multi-modal visual data grouping. In: IJCAI, pp 2392–2398
  • Zhao et al [2022a] Zhao L, Zhang J, Yang T, et al (2022a) Incomplete multi-view clustering based on weighted sparse and low rank representation. Applied Intelligence 52(13):14822–14838
  • Zhao et al [2023] Zhao M, Yang W, Nie F (2023) Auto-weighted orthogonal and nonnegative graph reconstruction for multi-view clustering. Information Sciences 632:324–339
  • Zhao et al [2022b] Zhao X, Dai Q, Wu J, et al (2022b) Multi-view tensor graph neural networks through reinforced aggregation. IEEE Transactions on Knowledge and Data Engineering 35(4):4077–4091
  • Zhuge et al [2017] Zhuge W, Hou C, Jiao Y, et al (2017) Robust auto-weighted multi-view subspace clustering with common subspace representation matrix. PloS one 12(5):e0176769