MetaViewer: Towards A Unified Multi-View Representation

Ren Wang
Shandong University
[email protected] Haoliang Sun
Shandong University
[email protected]
Yuling Ma
Shandong Jianzhu University
[email protected] Xiaoming Xi
Shandong Jianzhu University
[email protected] Yilong Yin
Shandong University
[email protected]

Abstract

Existing multi-view representation learning methods typically follow a specific-to-uniform pipeline, extracting latent features from each view and then fusing or aligning them to obtain the unified object representation. However, the manually pre-specify fusion functions and view-private redundant information mixed in features potentially degrade the quality of the derived representation. To overcome them, we propose a novel bi-level-optimization-based multi-view learning framework, where the representation is learned in a uniform-to-specific manner. Specifically, we train a meta-learner, namely MetaViewer, to learn fusion and model the view-shared meta representation in outer-level optimization. Start with this meta representation, view-specific base-learners are then required to rapidly reconstruct the corresponding view in inner-level. MetaViewer eventually updates by observing reconstruction processes from uniform to specific over all views, and learns an optimal fusion scheme that separates and filters out view-private information. Extensive experimental results in downstream tasks such as classification and clustering demonstrate the effectiveness of our method.

1 Introduction

Multi-view representation learning mines a unified representation from multiple views of the same entity [27, 55, 39, 58]. Each view acquired by different sensors or sources contains both view-shared consistency information and view-specific information. Among them, view-specific information consists of complementary and redundant components, where the former can be considered as a supplement to the consistency information, while the latter is highly specific and may be adverse for the unified representation [8, 16]. Therefore, a high-quality representation is required to retain the consistency and complementary information, as well as filter out the view-private redundant ones [48].

Refer to caption — Figure 1: (a), (b) and (c) show three multi-view learning frameworks following the specific-to-uniform pipeline, where the unified representation is obtained by fusing or concatenating view-specific features. (d) illustrates our uniform-to-specific manner, where a meta-learner learns to fusion by observing reconstruction from unified representation to specific views.

Given the data containing two views $x_{1}$ and $x_{2}$ , the prevailing multi-view representation methods typically follow a specific-to-uniform pipeline and can be roughly characterized as:

H:=f(x_{1};W_{f})\circ g(x_{2};W_{g}),

(1)

where $f$ and $g$ are encoding (or embedding [28]) functions that map the original view data into the corresponding latent features with the trainable parameters $W_{f}$ and $W_{g}$ . These latent features are subsequently aggregated into the unified representation using the designed aggregation operator $\circ$ . with different aggregation strategies, existing approaches can be further subdivided into the joint, alignment, and a combined share-specific (S $\&$ S) representation [3, 27, 23].

Joint representation focuses on the integration of complementary information by directly fusing latent features, where the $\circ$ is represented as fusion strategies, such as graph-based module [41, 34], neural network [33, 10], or other elaborate functions [37, 18, 7]. While alignment representation seeks to perform alignment between view-specific features to retain the consistency information and specifies $\circ$ as the alignment operator, measured by distance [11, 26], similarity [14, 25], or correlation[1, 44, 24]. As a trade-off way, S $\&$ S representation explicitly distinguishes latent features into shared and specific representation and only aligns the shared part [50, 21, 30, 23]. Fig. 1 (a) - (c) show the above three branches of the specific-to-uniform paradigm.

Despite demonstrating promising results, the aggregation way inherently suffers from potential risks in the following two aspects: (1) The derived unified representation is usually the concatenation or fusion of learned latent features with the manually pre-specify rules. It makes them fairly hard to be generally applied in practice due to the significant variation of fusion schemes relying on the downstream tasks and training views [36]. (2) Even finding a well-performing fusion scheme, view-private redundant information mixed in latent features also degrades the quality of the fused unified representation. Several studies have noticed the second issue and attempted to distinguish the redundant information from view features via multi-level feature modeling [48] or matrix factorization [57]. However, recent works indicate the view-specific information could be not automatically separated at feature level [35, 23]. In addition, the first issue has received little attention.

In this paper, we propose a novel multi-view representation learning framework based on bi-level optimization meta-learning. In contrast to the specific-to-uniform pipeline, our model emphasizes learning unified representation in a uniform-to-specific manner, as illustrated in Fig. 1 (d). In detail, we build a meta-learner, namely MetaViewer, to learn fusion and model a unified meta representation in outer-level optimization. Based on this meta representation, view-specific base-learners are required to rapidly reconstruct the corresponding view in the inner-level. MetaViewer eventually updates by observing reconstruction processes over all views, thus learning optimal fusion rules to address the first issue. On the other hand, the rapid reconstruction from uniform representation to specific views in inner-level optimization essentially models the information that cannot be fused, i.e., view-private parts, solving the second issue. After alternate training, the resulting meta representation is closer to each view, which indicates that it contains as much view-shared information as possible as well as avoids the hindrance of redundant information. Extensive experiments on multiple benchmarks validate the performance of our MetaViewer. The unified meta representation learned from multiple views achieves comparable performance to state-of-the-art methods in downstream tasks such as clustering and classification. The core contributions of this paper are as follows.

1.

We propose a novel insight for multi-view representation learning, where the unified representation is learned in a uniform-to-specific manner.
2.

Our MetaViewer achieves the data-driven fusion of view features in meta-learning paradigm. To the best of our knowledge, this could be the first meta-learning-based work in multi-view representation scenarios.
3.

MetaViewer decouples the modeling of view-shared and view-private information via bi-level optimization, alleviating the hindrance of redundant information.
4.

Extensive experimental results validate the performance of our approach in several downstream tasks.

2 Related Work

2.1 Multi-view learning

Multi-view representation learning is not a new topic and has been widely used in downstream tasks such as retrieval, classification, and clustering. This work focuses on multi-view representation in unsupervised deep learning scope, and related works can be summarized into two main categories [51]. One is the deep extension of traditional methods, where representative ones include deep canonical correlation analysis (DCCA) [1] and its variants [44, 40, 54]. DCCA intends to discover the nonlinear mapping for two views to a common space in which their canonical correlations are maximally preserved. These methods benefit from a sound theoretical foundation, but also usually have strict restrictions on the number and form of views.

Another alternative is the multi-view deep network. Early deep-based approaches attempted to build different architectures for handling multi-view data, such as CNN-based [12, 53, 38] and GAN-based model [49, 19], etc. Some recent approaches focus on better parameter constraints using mutual information [2, 8], comparative information [28, 52], etc. Most of these existing methods follow a specific-to-uniform pipeline. In contrast, the underlying assumption of our MetaViewer learns from uniform to specific. The most related work is the MFLVC [48], which separates view-private information from latent features at the parameter level. The essential difference is that we model the view-private information in inner-level optimization, allowing outer-level observes the modeling process and future meta learners the optimal fusion scheme.

2.2 Meta-learning

Optimization-based meta-learning is a classic application of bi-level optimization designed to learn task-level knowledge to quickly handle new tasks [20, 22]. A typical work, MAML [13], learns a set of initialization parameters to solve different tasks with few steps of updates. Similar meta paradigm has been used to learn other manually designed parts, such as the network structure [29], optimizer[45, 59], and even sample weight [36, 32]. Similarly, we try to meta-learn fusion of multi-view feature for a unified representation. There also exist some works that consider both meta-learning paradigms and multi-view data[42, 15, 31]. However, they are dedicated to exploit the rich information contained in multiple views to improve the performance of the meta-learner on few-shot tasks or self-supervised scenario. Instead, we train a meta-learner to derive the high-quality shared representations from multi-view data via the bi-level optimization. To the best of our knowledge this is the first work to learn multi-view representation with meta-learning paradigm.

3 MetaViewer

Given a set of unlabeled multi-view dataset $\mathcal{D}=\{x_{i}\in\mathbb{R}^{d_{x}}\}_{i=1}^{N}$ , where $N$ are the number of samples and each sample entity $x_{i}=\{x_{i}^{1},x_{i}^{2},\dots,x_{i}^{v}\}_{v=1}^{V}$ contains $V$ views. The view-incomplete scenario means that partial views of some samples are missing or unavailable, i.e., $\mathcal{D}_{inc}=\{\mathcal{D}_{c},\{\mathcal{D}_{u}^{v}\}_{v=1}^{V}\}$ , where $\mathcal{D}_{c}$ and $\mathcal{D}_{u}^{v}$ indicate subsets of samples with complete views and unavailable $v$ - $th$ view, respectively. Our goal is learn a unified high-quality representation $H$ for each entity by observing available views and filtering self-private information as much as possible. The overall framework of our MetaViewer is shown in Fig. 2, including three mainly modules and a bi-level optimization process. The outer-level learns a meta-learner to learn a optimal fusion function and derive the unified representation $H$ , and inner-level reconstructs original views from $H$ in few update steps, which explicitly models and separates view-private information and ensure the representation quality. In following subsections, we first introduce the entire structure of the MetaViewer and then elaborate the bi-level optimization process.

3.1 The entire structure

Embedding module aims to transform heterogeneous view into the latent feature space, where transformed view embeddings have the same dimension as each other. To this end, we conduct a view-specific embedding function $f_{v}$ for each view, where $v=1,2,\dots,V$ . Given the $v$ - $th$ view data $x^{v}$ of the entity $x$ , the corresponding embedding $z^{v}$ can be computed by

z^{v}=f_{v}(x^{v},\phi_{f_{v}}),

(2)

where $z^{v}\in\mathbb{R}^{d}$ and $f_{v}$ typically instantiated as a multi-layer neural network with learnable parameters $\phi_{f_{v}}$ .

Representation learning module maps the obtained embedding to the view representation, consisting of view-specific base-learners ${\{b_{v}\}_{v=1}^{V}}$ and a view-shared meta-learner $m$ (i.e., MetaViewer). The former learns representation for each view embedding, while the latter takes all embeddings as input and outputs the unified representation $H$ that is ultimately used for downstream tasks. Meanwhile, base-learners are generally required to be initialized from parameters of the meta-learner to learning view-shared meta representation (see 3.2), thus two types learners should be structurally shared rather than individually designed. To meet the above two requirements simultaneously, MetaViewer is implemented as a channel-oriented $1$ - $d$ convolutional layer (C-Conv) with a non-linear function (e.g., ReLU[17]), as shown in Fig. 2 (c).

On the one hand, as a meta-learner, we first concatenate embeddings at the channel level, i.e., the number of the channel in the concatenated feature is equal to the number of views, and then train the MetaViewer to learn the fusion of cross-view information $H\in\mathbb{R}^{d_{h}}$ by

H=m(z^{cat},\omega),

(3)

where $z^{cat}\in\mathbb{R}^{d\times V}$ indicates the concatenated embedding and $\omega$ is the parameter of the MetaViewer. On the other hand, base-learners could be initialized and trained for learning $v$ - $th$ representation $h_{base}^{v}$ via

h_{base}^{v}=b_{v}(z^{v},\theta_{b_{v}}(\omega_{sub})),

(4)

where $h_{base}^{v}\in\mathbb{R}^{d_{h}}$ and $\theta_{b_{v}}(\omega_{sub})$ , or $\theta_{b_{v}}(\omega)$ for short, means the base-learner’s parameter $\theta_{b_{v}}$ is initialized from the channel(view)-related part $\omega_{sub}$ of the MetaViewer’s parameters. Note that this sub-network mechanism also provides a convenient way to handle incomplete views.

Self-supervised module conducts pre-text tasks to provide effectively supervised objects for model training, represented by different heads. Typically, the reconstruction head $r$ achieves the reconstruction object by re-mapping the representation back to the original view space, i.e.,

x_{rec}^{v}=r_{v}(z_{rec}^{v},\phi_{r_{v}}),

(5)

where $x_{rec}^{v}\in\mathbb{R}^{d_{x}}$ is the reconstruction result and $r_{v}$ is the reconstruction function with learnable parameters $\phi_{r_{v}}$ . Expect that, we can also conduct contrastive or correlation head for the meta-learner to mine the associations across views. Similar self-supervised objectives [40, 54, 28, 52] have been extensively studied in multi-view learning, and this is not the focus of this work.

3.2 Training via bi-level optimization

Now, we have conducted the entire structure, which can be end-to-end trained to derive the unified multi-view representation, even for incomplete views, in the specific-to-uniform manner like most existing approaches. However the data-driven fusion and view-private redundant information still cannot be handling well. So we turn to a opposite uniform-to-specific way using a bi-level optimization process, inspired by the meta-learning paradigm. Inner-level focus on the training of view-specific modules for corresponding views, and outer-level updates meta-learner to find the optimal fusion rule through observing the learning over all views. Before the detailed description, we introduce a split way of multi-view data in meta-learning style.

Meta-split of multi-view data. Consider a batch multi-view samples $\{\mathcal{D}_{batch}^{v}\}_{v=1}^{V}$ from $\mathcal{D}$ . For bi-level updating, we randomly and proportionally divide it into two disjoint subsets, marked as support set $S$ and query set $Q$ , respectively. As shown in 2 (a), support set is used in inner-level for leaning view-specific information, thus the sample attributes in it could be ignored. In contrast, query set retains both view and sample attributes for meta-learner training in outer-level optimization. This meta-split that decoupling view from samples can be naturally transferred to the data with incomplete views $\mathcal{D}_{inc}=\{\mathcal{D}_{c},\{\mathcal{D}_{u}^{v}\}_{v=1}^{V}\}$ , where subset with incomplete views $\{\mathcal{D}_{u}^{v}\}_{v=1}^{V}$ is used as support set, and the complete part $\mathcal{D}_{c}$ is left as query set.

Algorithm 1 The framework of our MetaViewer.

Require: Training dataset $\mathcal{D}$ , meta parameters $\omega$ , base parameters $\{\theta_{v}\}_{v=1}^{V}$ , view-specific parameters $\{\phi_{v}\}_{v=1}^{V}$ , the number of view $V$ , the iteration step in inner-level optimization $T$ .

1: Initialize

\omega

\{\phi_{v}\}_{v=1}^{V}

;

2: while not done do

3: #

Outer

level

4: Sample and meta-split a batch set from

\mathcal{D}

\{\mathcal{D}_{batch}^{v}\}_{v=1}^{V}=\{S,Q\}

6: for

t=1,\dots,T

7: for

v=1,\dots,V

8: #

Inner

level

9: Initialize

\theta_{v}=\theta_{v}(\omega)

\tilde{\phi_{v}}=\phi_{v}

;

10: Optimize

\theta_{v}(\omega)

and

\tilde{\phi_{v}}

via Eq. 6.

11: end for

12: end for

13: Optimize

\omega

via Eq. 8.

14: Optimize

\{\phi_{v}\}_{v=1}^{V}

via Eq. 8.

15: end while

Inner-level optimization. Without loss of generality, take the inner-level update after the $o$ - $th$ outer-level optimization as an example. Let $\omega^{o}$ be the lasted parameters of the MetaViewer, and $\phi_{v}^{o}=\{\phi_{f_{v}}^{o},\phi_{r_{v}}^{o}\}$ denotes the lasted parameters in embedding and self-supervised modules for brevity. We first initial base-learner from meta-learner, i.e., $\theta_{b_{v}}^{0}=\omega^{o}$ , and make a copy of $\phi_{v}^{o}$ for $\tilde{\phi}_{v}$ . Note that the copy means gradients with respect to the $\phi_{v}^{o}$ will not be back-propagated to $\tilde{\phi}_{v}$ and vice versa. Thus, $\theta_{b_{v}}^{0}$ and $\tilde{\phi}_{v}$ form the initial state for the inner-level optimization. Suppose the $\mathcal{L}_{inner}^{v}$ is the loss function of the inner-level with respect to the $v$ - $th$ view, and then the corresponding update goals are

\theta_{b_{v}}^{\ast}(\omega),\tilde{\phi}_{v}^{\ast},=\mathop{\arg\min}\mathcal{L}_{inner}^{v}\left(\theta_{b_{v}}(\omega^{o}),\tilde{\phi}_{v};S^{v}\right).

(6)

Consider a gradient descent strategy (e.g., SGD [4]), we can further write the update process of $\theta_{b_{v}}$ :

\theta_{b_{v}}^{i}=\theta_{b_{v}}^{i-1}-\beta\frac{\partial\mathcal{L}_{inner}^{v}}{\partial\theta_{b_{v}}^{i-1}},\dots,\theta_{b_{v}}^{0}=\omega^{o},

(7)

where $\beta$ and $i$ denote the learning rate and iterative step of inner-level optimization, respectively.

Outer-level optimization. After several inner-level updates, we obtain a set of optimal view-specific parameters on the support set. Outer-level then updates the meta-learner, embedding, and head modules by training on the query set. With the loss function $\mathcal{L}_{outer}$ , the outer-level optimization goal is

\omega^{\ast},\{\phi_{v}^{\ast}\}_{1}^{V}=\mathop{\arg\min}\mathcal{L}_{outer}\left(\theta^{*}(\omega),\{\phi_{v}\}_{1}^{V};Q\right).

(8)

By alternately optimizing Eq. 6 and Eq. 8, we end up with the optimal meta-parameters $\omega^{*}$ and a set of view-specific parameters $\phi_{v}^{\ast}$ . For a test sample $x_{test}$ , its corresponding representation is derived by sequentially feeding the embedding function and the meta-learner. The overall framework of our MetaViewer is shown in Alg. 1.

Datasets	#Views	#Classes	#Samples (train, val, test)	View Dimensions
BDGP	2	5	2500 (1500, 500, 500)	1750; 79
Handwirtten	2	10	2000 (1200, 400, 400)	240; 216
RGB-D	2	50	500 (300, 100, 100)	12288; 4096
Fashion-MV	3	10	10000 (6000, 2000, 2000)	784; 784; 784
MSRA	6	7	210 (126, 42, 42)	1302; 48; 512; 100; 256; 210
Caltech101-20	6	20	2386 (1425, 469, 492)	48; 40; 254; 1984; 512; 928

Table 1: The attributes for all datasets used in our experiments.

Datasets	Metrics	DCCA [1]	DCCAE [44]	MIB [8]	MFLVC [48]	DCP [28]	MVer-R	MVer-C
BDGP	ACC	0.4640	0.5180	0.6940	0.8800	0.7820	0.8280	0.9040
	NMI	0.4163	0.5793	0.5565	0.8397	0.7800	0.7979	0.8627
	ARI	0.3347	0.3208	0.4865	0.8504	0.6725	0.6156	0.8925
Handwritten	ACC	0.5725	0.6300	0.6325	0.6400	0.6625	0.7500	0.8625
	NMI	0.6980	0.7504	0.6758	0.6453	0.7056	0.7853	0.7896
	ARI	0.5215	0.5929	0.5216	0.4885	0.5610	0.6721	0.7225
RGB-D	ACC	0.5100	0.4800	0.5000	0.5300	0.5200	0.5300	0.5700
	NMI	0.8299	0.8158	0.8113	0.8331	0.8204	0.8241	0.8497
	ARI	0.5202	0.4834	0.5127	0.5407	0.5264	0.5304	0.5707
Fashion-MV	ACC	0.7070	0.7105	0.5720	0.8320	0.6260	0.8080	0.8540
	NMI	0.8042	0.8112	0.7383	0.8875	0.6838	0.8813	0.8876
	ARI	0.6180	0.6234	0.4762	0.7893	0.5430	0.7505	0.8007
MSRA	ACC	0.3333	0.3571	0.3095	0.7143	0.6429	0.7143	0.7018
	NMI	0.2997	0.3285	0.2471	0.6796	0.6801	0.7007	0.6126
	ARI	0.3341	0.3471	0.3035	0.6891	0.6308	0.6893	0.6029
Caltech101-20	ACC	0.3862	0.3659	0.3598	0.3659	0.3679	0.4187	0.4512
	NMI	0.5088	0.5224	0.4700	0.5836	0.4437	0.5852	0.6086
	ARI	0.2273	0.2525	0.2218	0.2687	0.2350	0.2919	0.3500

Table 2: Clustering results of all methods on six datasets. Bold and underline denote the best and second-best results, respectively.

3.3 Specific-to-uniform versus uniform-to-specific

We discuss the difference between specific-to-uniform and uniform-to-specific paradigm through the update of fusion parameters $\omega$ with a reconstruction loss $\mathcal{L}_{rec}^{v}$ . Using the same structure described in Sec. 3.1, the specific-to-uniform generally optimizes $\omega$ by minimizing reconstruction losses over all views, i.e., $\omega^{\ast}=\mathop{\arg\min}\sum_{v=1}^{V}\mathcal{L}_{rec}^{v}(r_{v}(m(z^{v},\omega),\phi_{r_{v}})),x_{v})$ , and the $\omega$ is updated by (with the SGD)

\omega\leftarrow\omega-\alpha\sum_{v=1}^{V}\nabla_{\omega}\mathcal{L}^{v}_{rec}(\omega).

(9)

The optimal $\omega^{\ast}$ observes all views and derives the unified representation $H$ , that is, from the particular to the general. While the update of $\omega$ in our uniform-to-specific can be written

\omega\leftarrow\omega-\alpha\sum_{v=1}^{V}\nabla_{\omega}\mathcal{L}^{v}_{rec}\left(\theta^{\ast}_{b_{v}}(\omega)\right).

(10)

Note that $\theta^{\ast}_{b_{v}}(\omega)$ contain the optimization process of each view in inner-level as Eq. (6), which means that the optimal $\omega^{\ast}$ update by observing the reconstruction from unified representation to specific views. Fig. 3 intuitively demonstrates the difference between these two manner.

3.4 The instances of the objective function

Our uniform-to-specific framework emphasizes learning from reconstruction in inner-level, thus the $\mathcal{L}_{inner}^{v}$ is specified as the reconstruction loss [56, 28]

\mathcal{L}_{inner}^{v}=\mathcal{L}_{rec}^{v}(S^{v},S_{rec}^{v})=\|S^{v}-S_{rec}^{v}\|_{F}^{2}.

(11)

While parameters updated in outer-level can be constrained by richer self-supervised feedbacks as mentained in Sec. 3.1. Here we provide two instances of outer-level loss function to demonstrate how MetaViewer can be extended with different learning objectives.

MVer-R adopts the same reconstruction loss as the inner-level and $\mathcal{L}_{outer}=\sum_{v}\mathcal{L}_{rec}^{v}(Q^{v},Q_{rec}^{v})$ , which is the purest implementation of MetaViewer.

MVer-C additionally utilizes a contrastive objective, where the similarities of views belonged to same entity (i.e., positive pairs) should be maximized and that of different entities (i.e., negative pairs) should be minimized, i.e., $\mathcal{L}_{outer}=\sum_{v}\left(\mathcal{L}_{rec}^{v}+\sum_{v^{\prime},v^{\prime}\neq v}\mathcal{L}_{con}^{v,v^{\prime}}\right)$ . Following previous work [48, 19, 52], the contrastive loss $\mathcal{L}_{con}^{v,v^{\prime}}$ is formed as

\small\mathcal{L}_{con}^{v,v^{\prime}}=-\frac{1}{N_{Q}}\sum_{i=1}^{N_{Q}}log\frac{e^{d(q_{i}^{v},q_{i}^{v^{\prime}})/\tau}}{\sum_{j=1,j\neq i}^{N_{Q}}e^{d(q_{i}^{v},q_{j}^{v})/\tau}+\sum_{j=1}^{N_{Q}}e^{d(q_{i}^{v},q_{j}^{v^{\prime}})/\tau}},

(12)

where $q_{i}^{v}$ is the $v$ - $th$ views of the $i$ - $th$ query sample, and $d$ is the similarity metric (e.q., cosine similarity [6]). The $N_{Q}$ and $\tau$ denote the number of query set samples and the temperature parameter, respectively. Note that, the derived meta representation can be also used in contrastive learning as a additional noevl view.

4 Experiments

In this section, we present extensive experimental results to validate the quality of the unified representation derived from our MetaViewer. The remainder of the experiments are organized as follows: Subsection 4.1 lists datasets, compared methods and the implementation details. Subsection 4.2 compares the performance of our method with classical and state-of-the-art methods on two common downstream scenarios, clustering and classification tasks. Comparison with manually designed fusion and ablation studies are shown in Subsection 4.3 and 4.4, respectively.

Datasets	Metrics	DCCA [1]	DCCAE [44]	MIB [8]	MFLVC [48]	DCP [28]	MVer-R	MVer-C
BDGP	ACC	0.9840	0.9865	0.8900	0.9820	0.9720	0.9860	0.9800
	Precision	0.9842	0.9863	0.9005	0.9822	0.9726	0.9871	0.9859
	F-score	0.9840	0.9850	0.8884	0.9820	0.9720	0.9859	0.9802
Handwritten	ACC	0.8825	0.9000	0.7900	0.9400	0.9725	0.9700	0.9775
	Precision	0.8920	0.9048	0.8390	0.9420	0.9730	0.9708	0.9790
	F-score	0.8805	0.8992	0.7852	0.9401	0.9724	0.9700	0.9775
RGB-D	ACC	0.3000	0.2400	0.3300	0.4400	0.3700	0.5100	0.5600
	Precision	0.2110	0.1600	0.2850	0.4609	0.2887	0.5365	0.5520
	F-score	0.2204	0.1691	0.2737	0.4181	0.3078	0.4873	0.5278
Fashion-MV	ACC	0.8490	0.8535	0.8680	0.9650	0.8925	0.9685	0.9770
	Precision	0.8522	0.8597	0.8680	0.9652	0.8206	0.9637	0.9678
	F-score	0.8354	0.8384	0.8655	0.9649	0.8290	0.9648	0.9707
MSRA	ACC	0.2381	0.2429	0.3619	0.6905	0.9048	0.9371	0.9270
	Precision	0.2053	0.2204	0.2498	0.7129	0.9153	0.9393	0.9317
	F-score	0.2422	0.2357	0.2773	0.6895	0.9037	0.9391	0.9277
Caltech101-20	ACC	0.7154	0.7154	0.7272	0.8537	0.9248	0.9228	0.9216
	Precision	0.4527	0.6057	0.6164	0.7183	0.8941	0.8946	0.9068
	F-score	0.3981	0.4325	0.5247	0.6907	0.8458	0.8421	0.8572

Table 3: Classification results of all methods on six datasets. Bold and underline denote the best and second-best results, respectively.

4.1 Experimental Setup

Datasets. To comprehensively evaluate the effectiveness of our MetaViewer, we conduct six multi-view benchmarks in experiments. All datasets are scaled to $[0,1]$ and split into training, validation, and test sets in the ratio of $6:2:2$ , as shown in Tab. 1. More details of dataset are in Appendix A.

•

BDGP is an image and text dataset, corresponding to $2,500$ drosophila embryo images in $5$ categories. Each image is described by a $79$ -D textual feature vector and a $1,750$ -D visual feature vector [5].
•

Handwritten contains $2,000$ handwritten digital images from $0$ to $9$ . Two types of descriptors, i.e., $240$ -D pixel average in $2\times 3$ windows and $216$ -D profile correlations, are selected as two views [56].
•

RGB-D dataset contains visual and depth images of $300$ distinct objects across $50$ categories [60, 43]. Two views are obtained by flattening the $64\times 64\times 3$ color images and $64\times 64$ depth images.
•

Fashion-MV is an image dataset that contains $10$ categories with a total of $30,000$ fashion products. It has three views and each of which consists of 10,000 gray images sampled from the same category [47].
•

MSRA [46] consists of $210$ scene recognition images from seven classes with six views, that is, CENTRIST, CMT, GIST, HOG, LBP, and SIFT.
•

Caltech101-20 is a subset of the Caltech101 image set [9], which consists of $2,386$ images of $20$ subjects. Six features are used, including Gabor, Wavelet Moments, CENTRIST, HOG, GIST, and LBP.

Compared methods. We compare the performance of MetaViewer with five representative multi-view learning methods, including two classical methods (DCCA [1] and DCCAE [44]) and three state-of-the-art methods (MIB [8], MFLVC [48] and DCP [28]). Among them, DCCA and DCCAE are the deep extensions of traditional correlation strategies. MIB is a typical generative method with mutual information constraints. DCP learns unified representation both in complete and incomplete views. In particular, MFLVC also notices the view-private redundant information and designs a multi-level features network for clustering tasks.

Implementation details. For a fair comparison, all methods are trained from scratch and share the same backbone listed in Appendix B. We concatenate latent features of all views in compared methods to obtain the unified representation $H$ with the same dimension $d_{h}={256}$ , and verify their performance in clustering and classification tasks using K-means and a linear SVM, respectively. For MetaViewer, we train $2,000$ epochs for all benchmarks, and set the batch size is $32$ for RGBD and MSRA and $256$ for others. The learning rates in outer- and inner-level are set to $10^{-3}$ and $10^{-2}$ , respectively. All experiments have been verified using the PyTorch library on a single RTX3090.

4.2 Performance on downstream tasks

Clustering results. Tab. 2 lists the results of the clustering task, where the performance is measured by three standard evaluation metrics, i.e., Accuracy (ACC), Normalized Mutual Information (NMI), and Adjusted Rand Index (ARI). A higher value of these metrics indicates a better clustering performance. It can be observed that (1) our MVer-C variant significantly outperforms other compared methods on all benchmarks except MSRA; (2) the second-best results appear between MVer-R and MFLVC, both of which explicitly separates the view-private information; (3) A larger number of categories and views is the main reason for the degradation of clustering performance, and our Metaviewer improves most significantly in such scenario (e.g. Fashion-MV and Caltech101-20).

Classification results. Tab. 3 lists the results of the classification task, where three common metrics are used including Accuracy, Precision, and F-score. A higher value indicates a better classification performance. Similar to the clustering results, two variants of MetaViewer significantly outperform the comparison methods. It is worth noting that (1) DCP learns the unified generic representation and therefore achieves the second-best result instead of MFLVC. (2) The number of categories is the main factor affecting the classification performance, and our method obtains the most significant improvement in the RGB-D dataset with $50$ classes. More results including incomplete views are deferred to Appendix C.

4.3 Comparison with manually designed fusion.

As mentioned in 3.3, MetaViewer is essentially learned to learn an optimal fusion function that filters the view-private information. To verify this, we compare it with commonly used fusion strategies [27], including sum, maxima, concatenation, linear layer and C-Conv. The former three are the specified fusion rules without trainable parameters and the remaining two are the trainable fusion layer trained via the specific-to-uniform manner. Tab. 4 lists the clustering results and an additional MSE score on the Handwritten dataset with the same embedding and reconstruction network (see Tab. 1). We can observe that (1) trainable fusion layers outperform the hand-designed rules, and our MetaViewer yields the best performance; (2) the MSE scores listed in the last column indicate that the quality of the unified representation cannot be measured and guaranteed only with the reconstruction constraint, due to the view-private redundant information mixed in view-specific latent features.

4.4 Ablation Studies

Meta-learner structures. We implement the meta-learner as channel-level convolution layers in this work. Albeit simple, this layer can be considered as a universal approximator for almost any continuous function [36], and thus can fit a wide range of conventional fusion functions. To investigate the effect of network depth, width, and convolution kernel size on the performance of the representation, we alternate fix the $32$ kernels and $1\times 3$ kernel size and show the classification results on Handwritten data in Fig. 4. It is clear that (1) the meta-learner works well with just a shallow structure as shown in Fig. 4 (a), instead of gradually overfitting to the training data as the network deepens or widens, (2) our MetaViewer is stable and insensitive to the hyper-parameters within reasonable ranges.

Meta-split ratios. Fig. 5 (a) shows the impact of meta-split mentioned in 3.2 on the classification performance, where the proportion of support set is set from $0.1$ to $0.9$ in steps of $0.1$ , and the rest is query set. In addition to the single view, we also compare the sum and concat. fusion as baselines. MetaViewer consistently surpasses all baselines over the experimental proportion. In addition, fusion baselines are more dependent on the better-performing view at lower proportions, instead becoming unstable as the available query sample decreases.

Inner-level update steps. Another hyper-parameter is the number of iteration steps in inner-level optimization. More iterations mean a larger gap from the learned meta representation to the specific view space, i.e., coarser modeling of view-private information. Fig. 5 (b) shows the classification results with various steps, where $n$ step means that the inner-level optimization is updated $n$ times throughout the training. MetaViewer achieves the best results when using $1$ steps, and remains stable within $15$ steps.

Strategies	Rules	ACC $\uparrow$	NMI $\uparrow$	ARI $\uparrow$	MSE $\downarrow$
Sum	$z^{x}+z^{y}$	69.25	71.89	59.02	1.84
Max	$max(z^{x},z^{y})$	80.75	73.93	63.75	-
Concat.	$cat[z^{x},z^{y}]$	78.75	72.02	61.52	1.77
Linear	$l(z^{x},z^{y},\theta_{l})$	85.00	77.40	69.71	4.74
C-Conv	$m(z^{x},z^{y},\omega)$	69.75	65.21	51.33	2.37
MetaViewer	meta-learning	86.25	78.96	72.25	2.45

Table 4: Clustering resulting on the Handwritten dataset.

5 Conclusion

This work introduced a novel meta-learning perspective for multi-view learning, and proposed a meta-learner, namely MetaViewer, to derive a high-quality unified representation for downstream tasks. In contrast to the prevailing specific-to-uniform pipeline, MetaViewer observes the reconstruction process from unified representation to specific views and essentially learns an optimal fusion function that separates and filters out meaningless view-private information. Extensive experimental results on clustering and classification tasks demonstrate the performance of the unified representation we meta-learned.

References

[1] Galen Andrew, Raman Arora, Jeff A. Bilmes, and Karen Livescu. Deep canonical correlation analysis. In ICML, volume 28, pages 1247–1255, 2013.
[2] Philip Bachman, R. Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. In NeurIPS, pages 15509–15519, 2019.
[3] Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency. Multimodal machine learning: A survey and taxonomy. IEEE TPAMI, 41(2):423–443, 2019.
[4] Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Yves Lechevallier and Gilbert Saporta, editors, COMPSTAT, pages 177–186, 2010.
[5] Xiao Cai, Hua Wang, Heng Huang, and Chris H. Q. Ding. Joint stage recognition and anatomical annotation of drosophila gene expression patterns. Bioinform., 28(12):16–24, 2012.
[6] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. A simple framework for contrastive learning of visual representations. In ICML, volume 119, pages 1597–1607, 2020.
[7] Yanhua Cheng, Xin Zhao, Rui Cai, Zhiwei Li, Kaiqi Huang, and Yong Rui. Semi-supervised multimodal deep learning for RGB-D object recognition. In IJCAI, pages 3345–3351, 2016.
[8] Marco Federici, Anjan Dutta, Patrick Forré, Nate Kushman, and Zeynep Akata. Learning robust representations via multi-view information bottleneck. In ICLR, 2020.
[9] Li Fei-Fei, Robert Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. CVIU, 106(1):59–70, 2007.
[10] Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Convolutional two-stream network fusion for video action recognition. In CVPR, pages 1933–1941, 2016.
[11] Fangxiang Feng, Xiaojie Wang, and Ruifan Li. Cross-modal retrieval with correspondence autoencoder. In ACM MM, pages 7–16, 2014.
[12] Yifan Feng, Zizhao Zhang, Xibin Zhao, Rongrong Ji, and Yue Gao. GVCNN: group-view convolutional neural networks for 3d shape recognition. In CVPR, pages 264–272, 2018.
[13] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, volume 70, pages 1126–1135, 2017.
[14] Andrea Frome, Gregory S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc’Aurelio Ranzato, and Tomás Mikolov. Devise: A deep visual-semantic embedding model. In NeurIPS, pages 2121–2129, 2013.
[15] Kuiliang Gao, Bing Liu, Xuchu Yu, and Anzhu Yu. Unsupervised meta learning with multiview constraints for hyperspectral image small sample set classification. IEEE TIP, 31:3449–3462, 2022.
[16] Yu Geng, Zongbo Han, Changqing Zhang, and Qinghua Hu. Uncertainty-aware multi-view representation learning. In AAAI, pages 7545–7553, 2021.
[17] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In AISTATS, volume 15, pages 315–323, 2011.
[18] Suriya Gunasekar, Makoto Yamada, Dawei Yin, and Yi Chang. Consistent collective matrix completion under joint low rank structure. In AISTATS, volume 38, 2015.
[19] Kaveh Hassani and Amir Hosein Khas Ahmadi. Contrastive multi-view representation learning on graphs. In ICML, volume 119, pages 4116–4126, 2020.
[20] Timothy M. Hospedales, Antreas Antoniou, Paul Micaelli, and Amos J. Storkey. Meta-learning in neural networks: A survey. IEEE TPAMI, 44(9):5149–5169, 2022.
[21] Junlin Hu, Jiwen Lu, and Yap-Peng Tan. Sharable and individual multi-view metric learning. IEEE TPAMI, 40(9):2281–2288, 2018.
[22] Mike Huisman, Jan N. van Rijn, and Aske Plaat. A survey of deep meta-learning. Artif. Intell. Rev., 54(6):4483–4541, 2021.
[23] Xiaodong Jia, Xiao-Yuan Jing, Xiaoke Zhu, Songcan Chen, Bo Du, Ziyun Cai, Zhenyu He, and Dong Yue. Semi-supervised multi-view deep discriminant representation learning. IEEE TPAMI, 43(7):2496–2509, 2021.
[24] Xiao-Yuan Jing, Fei Wu, Xiwei Dong, Shiguang Shan, and Songcan Chen. Semi-supervised multi-view correlation feature learning with application to webpage classification. In AAAI, pages 1374–1381, 2017.
[25] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. IEEE TPAMI, 39(4):664–676, 2017.
[26] Dongge Li, Nevenka Dimitrova, Mingkun Li, and Ishwar K. Sethi. Multimedia content processing through cross-modal association. In ACM MM, pages 604–611, 2003.
[27] Yingming Li, Ming Yang, and Zhongfei Zhang. A survey of multi-view representation learning. IEEE TKDE, 31(10):1863–1883, 2019.
[28] Yijie Lin, Yuanbiao Gou, Xiaotian Liu, Jinfeng Bai, Jiancheng Lv, and Xi Peng. Dual contrastive prediction for incomplete multi-view representation learning. IEEE TPAMI, 2022.
[29] Zechun Liu, Haoyuan Mu, Xiangyu Zhang, Zichao Guo, Xin Yang, Kwang-Ting Cheng, and Jian Sun. Metapruning: Meta learning for automatic neural network channel pruning. In ICCV, pages 3295–3304, 2019.
[30] Shirui Luo, Changqing Zhang, Wei Zhang, and Xiaochun Cao. Consistent and specific multi-view subspace clustering. In Sheila A. McIlraith and Kilian Q. Weinberger, editors, AAAI, pages 3730–3737, 2018.
[31] Yao Ma, Shilin Zhao, Weixiao Wang, Yaoman Li, and Irwin King. Multimodality in meta-learning: A comprehensive survey. KBS, 250:108976, 2022.
[32] Yuren Mao, Zekai Wang, Weiwei Liu, Xuemin Lin, and Pengtao Xie. Metaweighting: Learning to weight tasks in multi-task learning. In ACL, pages 3436–3448, 2022.
[33] Niall McLaughlin, Jesús Martínez del Rincón, and Paul Miller. Recurrent convolutional network for video-based person re-identification. In CVPR, pages 1325–1334, 2016.
[34] Feiping Nie, Guohao Cai, Jing Li, and Xuelong Li. Auto-weighted multi-view learning for image clustering and semi-supervised classification. IEEE TIP, 27(3):1501–1511, 2018.
[35] Mathieu Salzmann, Carl Henrik Ek, Raquel Urtasun, and Trevor Darrell. Factorized orthogonal latent spaces. In AISTATS, volume 9, pages 701–708, 2010.
[36] Jun Shu, Qi Xie, Lixuan Yi, Qian Zhao, Sanping Zhou, Zongben Xu, and Deyu Meng. Meta-weight-net: Learning an explicit mapping for sample weighting. In NeurIPS, pages 1917–1928, 2019.
[37] Nitish Srivastava and Ruslan Salakhutdinov. Multimodal learning with deep boltzmann machines. JMLR, 15(1):2949–2980, 2014.
[38] Kai Sun, Jiangshe Zhang, Junmin Liu, Ruixuan Yu, and Zengjie Song. DRCNN: dynamic routing convolutional neural network for multi-view 3d object recognition. IEEE TIP, 30:868–877, 2021.
[39] Shiliang Sun, Wenbo Dong, and Qiuyang Liu. Multi-view representation learning with deep gaussian processes. IEEE TPAMI, 43(12):4453–4468, 2021.
[40] Zhongkai Sun, Prathusha Kameswara Sarma, William A. Sethares, and Yingyu Liang. Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In AAAI, pages 8992–8999, 2020.
[41] Hong Tao, Chenping Hou, Feiping Nie, Jubo Zhu, and Dongyun Yi. Scalable multi-view semi-supervised classification via adaptive regression. IEEE TIP, 26(9):4283–4296, 2017.
[42] Risto Vuorio, Shao-Hua Sun, Hexiang Hu, and Joseph J. Lim. Multimodal model-agnostic meta-learning via task-aware modulation. In NeurIPS, pages 1–12, 2019.
[43] Shiye Wang, Changsheng Li, Yanming Li, Ye Yuan, and Guoren Wang. Self-supervised information bottleneck for deep multi-view subspace clustering. CoRR, abs/2204.12496, 2022.
[44] Weiran Wang, Raman Arora, Karen Livescu, and Jeff A. Bilmes. On deep multi-view representation learning. In ICML, volume 37, pages 1083–1092, 2015.
[45] Olga Wichrowska, Niru Maheswaranathan, Matthew W. Hoffman, Sergio Gomez Colmenarejo, Misha Denil, Nando de Freitas, and Jascha Sohl-Dickstein. Learned optimizers that scale and generalize. In ICML, volume 70, pages 3751–3760, 2017.
[46] Jinglin Xu, Junwei Han, and Feiping Nie. Discriminatively embedded k-means for multi-view clustering. In CVPR, pages 5356–5364, 2016.
[47] Jie Xu, Yazhou Ren, Huayi Tang, Zhimeng Yang, Lili Pan, Yang Yang, and Xiaorong Pu. Self-supervised discriminative feature learning for multi-view clustering. CoRR, abs/2103.15069, 2021.
[48] Jie Xu, Huayi Tang, Yazhou Ren, Liang Peng, Xiaofeng Zhu, and Lifang He. Multi-level feature learning for contrastive multi-view clustering. In CVPR, pages 16030–16039, 2022.
[49] Fei Xue, Xin Wu, Shaojun Cai, and Junqiu Wang. Learning multi-view camera relocalization with graph neural networks. In CVPR, pages 11372–11381, 2020.
[50] Xiaowei Xue, Feiping Nie, Sen Wang, Xiaojun Chang, Bela Stantic, and Min Yao. Multi-view correlated feature learning by uncovering shared component. In AAAI, pages 2810–2816, 2017.
[51] Xiaoqiang Yan, Shizhe Hu, Yiqiao Mao, Yangdong Ye, and Hui Yu. Deep multi-view learning methods: A review. Neurocomputing, 448:106–129, 2021.
[52] En Yu, Zhuoling Li, and Shoudong Han. Towards discriminative representation: Multi-view trajectory contrastive learning for online multi-object tracking. In CVPR, pages 8824–8833, 2022.
[53] Tan Yu, Jingjing Meng, and Junsong Yuan. Multi-view harmonized bilinear network for 3d object recognition. In CVPR, pages 186–194, 2018.
[54] Yun-Hao Yuan, Jin Li, Yun Li, Jipeng Qiang, Yi Zhu, Xiaobo Shen, and Jianping Gou. Learning canonical f-correlation projection for compact multiview representation. In CVPR, pages 19238–19247, 2022.
[55] Changqing Zhang, Huazhu Fu, Qinghua Hu, Xiaochun Cao, Yuan Xie, Dacheng Tao, and Dong Xu. Generalized latent multi-view subspace clustering. IEEE TPAMI, 42(1):86–99, 2020.
[56] Changqing Zhang, Yeqing Liu, and Huazhu Fu. Ae2-nets: Autoencoder in autoencoder networks. In CVPR, pages 2577–2585, 2019.
[57] Liang Zhao, Tao Yang, Jie Zhang, Zhikui Chen, Yi Yang, and Z. Jane Wang. Co-learning non-negative correlated and uncorrelated features for multi-view data. IEEE TNNLS, 32(4):1486–1496, 2021.
[58] Qinghai Zheng, Jihua Zhu, and Zhongyu Li. Collaborative unsupervised multi-view representation learning. IEEE TCSVT, 32(7):4202–4210, 2022.
[59] Wenqing Zheng, Tianlong Chen, Ting-Kuei Hu, and Zhangyang Wang. Symbolic learning to optimize: Towards interpretability and scalability. In ICLR, 2022.
[60] Pengfei Zhu, Binyuan Hui, Changqing Zhang, Dawei Du, Longyin Wen, and Qinghua Hu. Multi-view deep subspace clustering networks. CoRR, abs/1908.01978, 2019.