Instance-Variant Loss with Gaussian RBF Kernel for 3D Cross-modal Retrieval
Abstract.
3D cross-modal retrieval is gaining attention in the multimedia community. Central to this topic is learning a joint embedding space to represent data from different modalities, such as images, 3D point clouds, and polygon meshes, to extract modality-invariant and discriminative features. Hence, the performance of cross-modal retrieval methods heavily depends on the representational capacity of this embedding space. Existing methods treat all instances equally, applying the same penalty strength to instances with varying degrees of difficulty, ignoring the differences between instances. This can result in ambiguous convergence or local optima, severely compromising the separability of the feature space. To address this limitation, we propose an Instance-Variant loss to assign different penalty strengths to different instances, improving the space separability. Specifically, we assign different penalty weights to instances positively related to their intra-class distance. Simultaneously, we reduce the cross-modal discrepancy between features by learning a shared weight vector for the same class data from different modalities. By leveraging the Gaussian RBF kernel to evaluate sample similarity, we further propose an Intra-Class loss function that minimizes the intra-class distance among same-class instances. Extensive experiments on three 3D cross-modal datasets show that our proposed method surpasses recent state-of-the-art approaches.
1. Introduction
As 3D models become increasingly prevalent in CAD, VR/AR, and autonomous driving applications, the efficient and accurate retrieval of 3D models has gained growing attention within the multimedia community. This area has received widespread interest as the foundation for numerous downstream tasks, such as robot navigation, scene understanding, 3D modeling, and animation (Han et al., 2019). 3D cross-modal retrieval aims to reduce cross-modal discrepancy and learn modality-invariant (minimizing intra-class distance) and discriminative features (maximizing inter-class distance) among multi-modal data. In comparison to 2D cross-modal retrieval (image-text retrieval (Wei et al., 2023; Zhang et al., 2022), sketch-based image retrieval (Sain et al., 2021; Tian et al., 2021)), 3D cross-modal retrieval has to consider the representation and structure of 3D models and utilizes more multi-modal data, including images, point clouds, meshes, and multi-view grayscale images for 3D models as query domains for retrieval.

The key challenge in 3D cross-modal retrieval is reducing the substantial cross-modal discrepancy between 3D models and images. Existing cross-modal retrieval methods focus on learning a common embedding space to bridge the heterogeneous gap, enabling comparing features from different modalities. The common embedding space aims to minimize the intra-class distance within the same category and maximize the inter-class distance between different categories. Recently, researchers have employed contrastive loss (Lin et al., 2021; Hu et al., 2022), cross-modal center loss (Jing et al., 2021), softmax cross-entropy loss (Liang et al., 2021), and other contrastive learning and metric learning methods to learn the common embedding space that characterizes multi-modal data features. To minimize the intra-class distance, researchers use the cross-modal center and cross-entropy loss to learn a common embedding space for all modality data by calculating the class center (mean of the feature) for different classes (Jing et al., 2021). Some also employed contrastive learning to enforce inter-modal centroid alignment, reducing the modal discrepancy between various data modalities (Hu et al., 2022).
Although these methods have made significant progress, most still treat all instances of all classes from all modalities equally and overlook differences among instances, leading to ambiguous convergence and suboptimal performance (Wei et al., 2022). Cross-modal retrieval tasks improve the network’s representation ability in the common space by optimizing the margin between the intra-class and inter-class distances. As illustrated in Fig. 1, the contribution of different instances to the network varies considerably. Applying equal penalty strength to instances with varying degrees of difficulty can result in ambiguous convergence or local optima, which can severely compromise the separability of the feature space. Furthermore, the inherent data representation and structural differences between 3D models and images make learning a separable embedding space challenging. This significant modality discrepancy will make it hard for the network to learn modality-invariant features from the data. As a result, the retrieval performance in 3D cross-modal tasks often suffers from ambiguous convergence. Therefore, it is crucial to propose an effective instance-level penalized-strength weighted loss for 3D cross-modal retrieval tasks. This approach would consider the varying contributions of instances to the network and assign different penalty strengths to different instances.
To address the aforementioned issues, we propose the Instance-Variant loss for 3D cross-modal retrieval. The most intuitive motivation is to assign different penalty strengths to different instances. By adding a learnable weight coefficient, positively related to the instance’s intra-class distance, to the softmax output. We ensure that instances more challenging to distinguish (with larger intra-class distances) receive more severe penalty strengths, thereby improving the network’s effectiveness. Simultaneously, because the penalty strength is related to the intra-class distance of instances, the network will also adapt to the problems of data distribution imbalance and multimodal optimization imbalance. Unlike the classical softmax loss, we add different penalty strengths for different instances and learn a common weight vector for all modalities. This approach will effectively alleviate the modal optimization imbalance problem and reduce the discrepancy between modalities.
Under the constraint of softmax loss, we can map all modality data onto a shared unit hypersphere (Liu et al., 2017) by normalizing the weight vector and instance features. The unit hypersphere can better minimize intra-class distance and maximize inter-class distance. It also helps reduce the discrepancy between various data sources, enhance the semantic Information of the learned data, and improve the robustness of retrieval tasks. Inspired by the Gaussian RBF kernel function (Borodachov et al., 2019), we propose an Intra-Class loss, which aims to minimize the distance between all samples of the same class on the unit hypersphere. Unlike Centerloss, our Intra-Class loss does not require finding class centers. The essence of the Gaussian kernel is to measure the similarity between samples. Similar samples can be better clustered together in a space that describes similarity and then become linearly separable. Using Intra-Class loss can minimize the intra-class distance and evaluate the multi-modal shared unit hypersphere (embedding space) learned by the network.
By utilizing the proposed Instance-Variant loss and Intra-Class loss, the network can assign distinct penalty strengths to different instances while minimizing the intra-class distance, effectively reducing the inseparability of the feature space caused by ambiguous convergence. To validate the effectiveness of the proposed Instance-Variant and Intra-Class loss, we jointly train the framework with the cross-entropy loss function for the 3D cross-modal retrieval task, aiming to extract modality-invariant and discriminative features. Our approach significantly outperforms recent state-of-the-art methods in 3D cross-modal and uni-modal retrieval tasks. The primary contributions of this paper can be summarized as follows:
-
•
We propose the Instance-Variant loss, which effectively assigns different penalty strengths to different instances, enhancing the network’s effectiveness by focusing on more challenging instances and promoting better feature space separability.
-
•
We introduce the Intra-Class loss based on the Gaussian RBF kernel function, aiming to minimize the intra-class distance among all instances in the shared embedding space. This approach minimizes the intra-class distance and evaluates the multi-modal shared embedding space learned by the network.
-
•
The proposed Instance-Variant loss learns a shared weight vector for data from different modalities, effectively mitigating cross-modal discrepancy and enhancing cross-modal retrieval performance.
-
•
Our approach significantly outperforms the recent state-of-the-art methods on three datasets (Pix3D, ModelNet40, MI3DOR) for 3D cross-modal and uni-modal retrieval tasks. This demonstrates the effectiveness of the proposed Instance-Variant loss and Intra-Class loss.
2. Related Works
3D Cross-modal Retrieval. There has been growing interest in 3D Cross-modal retrieval algorithms, including image-based 3D shape retrieval (IBSR), 3D model-based shape retrieval, and 3D Cross-modal mutual retrieval. Image-based approaches represent a 3D shape as a set of 2D views captured from pre-defined viewpoints of the 3D shape (Huang et al., 2022; Zhou et al., 2022; Xu et al., 2020). The advantage of using a set of 2D view representations is that they can directly employ the existing powerful CNNs for feature extraction (Liu et al., 2023; Su et al., 2015) and reduce the domain gap between 3D models and images. Lin et al.(Lin et al., 2021) used contrastive learning to realize instance-level 3D shape retrieval based on a single image. So far, great progress has been made in IBSR tasks. However, there are still some limitations, such as the queried domain of this task being limited to images and the lack of global spatial and local geometric information on the 3D model. Model-based shape retrieval typically represents 3D models in the form of polygon meshes (Hanocka et al., 2019; Feng et al., 2019) and point clouds (Qi et al., 2017a; Wang et al., 2019b; Qi et al., 2017b). PointNet (Qi et al., 2017a) developed a point-wise operation and a symmetric function to solve the permutation variance issue. DGCNN (Wang et al., 2019b) proposed a dynamic graph convolution neural network with EdgeConv using K nearest neighbor points. MeshNet (Feng et al., 2019) and MeshCNN (Hanocka et al., 2019) were designed to learn features directly from the mesh by modeling the geometric relations of the mesh faces of the object. The advantage of model-based approaches is they can explore the global spatial and local geometric information of 3D models to obtain representative 3D descriptors. However, they have primarily focused on uni-modal retrieval without further exploring cross-modal retrieval.
3D Cross-modal mutual retrieval comprehensively considers the advantages and disadvantages of the aforementioned methods. They leverage feature extraction networks like DGCNN, ImageNet, and MeshNet to extract features from point clouds, images, and polygon meshes. By measuring the intra-class and inter-class distances in a shared embedding space, cross-modal mutual retrieval can achieve mutual retrieval from any two modals. Jing et al. (Jing et al., 2021) pioneered the field of 3D cross-modal (mutual) retrieval, using cross-modal center loss to reduce the discrepancy between modalities and minimize the intra-class distance of samples. Chen et al.(Chen et al., 2021) utilized multimodal contrastive prototype loss to accomplish semi-supervised 3D cross-modal retrieval. 3D cross-modal mutual retrieval has far-reaching implications. It enables cross-modal retrieval from real (projected) images, point clouds, and mesh data. Using 3D cross-modal (mutual) retrieval technology, researchers can quickly find corresponding texture materials for 3D models and rapidly convert between different 3D model representation methods. This paper aims to use the proposed Instance-Variant loss and Intra-Class loss to achieve cross-modal mutual retrieval between real images and 3D models.
Metric learning. Metric learning focuses on learning a distance metric function that encourages semantic relevant instances close to each other. In previous literature, numerous metric learning approaches (Wang et al., 2019a; Zheng et al., 2020; Sun et al., 2020) have been developed for various tasks. Wang et al.(Wang et al., 2019a) proposed a multi-similarity loss for collecting and weighting informative pairs. Sun et al.(Sun et al., 2020)introduced a circle loss to weight different similarity scores. However, the above approaches are developed for unimodal retrieval tasks, which usually cannot accurately capture the relationship of cross-modal components with the modality gap. Frome et al.(Frome et al., 2013) attempted to project images and sentences into a common embedding space, and an unweighted triplet loss was used to encourage relevant semantic instances to cluster together. Wei et al.(Wei et al., 2022, 2021) introduced the universal weighting metric learning framework to sample informative pairs and assign proper weight values to them based on their similarity scores. However, the above methods can’t adapt to category-level cross-modal retrieval. Constrained by the limitations of existing datasets, it is difficult for researchers to construct sample pairs or triplets. Hence, in this paper, we develop a novel metric learning method for 3D cross-modal retrieval, which can modify its penalty strength directly according to the intra-class distance of the instance, changing its contribution to the loss function.
3. Proposed Approach
In this section, we will explain our proposed Instance-Variant loss and the Intra-Class loss based on the Gaussian RBF kernel function. In subsection 3.1, we will present the problem description and preliminaries. Subsections 3.2 and 3.3 will cover the mathematical formulation and analysis of the Instance-Variant and Intra-Class losses, respectively.
3.1. Problem Statement and Preliminaries
Assuming dataset contains instances, with instances of categories from modalities, the -th instance is a set consisting of the set of modalities with a semantic label and weight vector . Formally:
(1) |
Generally, the modality instances are in different representation spaces, and their similarities (distances) cannot be directly measured. The 3D cross-modal retrieval task aims to learn projection functions for each modality , where , , is a learnable parameter. is a projected feature in the common representation space, and is the set of all features of the same class. To obtain optimal retrieval performance, we aim to ensure that the distance between same-class instances and the intra-class distance is smaller than the distance between instances from different classes and the inter-class distance by a significant margin .
(2) |
To better understand the proposed Instance-Variant loss, we will first briefly review the original softmax, A-softmax (Liu et al., 2017), and AM-softmax (Wang et al., 2018). The formulation of the original softmax loss is given by
(3) | ||||
where is the input of the last fully connected layer ( denotes the -th sample), and is the -th column of the last fully connected layer. The is also called the target logit (Pereyra et al., 2017) of the -th sample. is also called the weight vector of the -th category. The relationship between the weight vector and the features’ mean vector (class center) is described in Figure 6 of (Wang et al., 2017). In the A-softmax loss, the authors proposed to normalize the weight vectors (making to be 1) and generalize the target logit from to , where . In the AM-softmax, the authors proposed to normalize both the weight vectors and instances’ features (), while the .
3.2. Instance-Variant Loss
The softmax loss function has been extensively applied in uni-modal retrieval tasks, as it not only identifies the optimal plane to separate distinct data classes but also learns the ideal weight vector for each category. In comparison to the class center (feature mean), the weight vector is better suited for metric learning problems involving hard samples. Nevertheless, in contrast to uni-modal retrieval tasks, cross-modal retrieval tasks must also diminish cross-modal discrepancy. We innovatively learn a shared weight vector for data across various modalities (). Throughout the weight vector’s iterative process, it will acquire the features of all modal data, effectively addressing the imbalance in modal optimization. This paper assumes that the norm of and are normalized to 1 if not specified. Given the extracted features , . The new cross-modal softmax loss takes the following form.
(4) | ||||
To simplify subsequent derivations, we can rewrite Equation 4 as Equation 5.
(5) | ||||
To ensure the intra-class distance is smaller than the inter-class distance, , where , where .
Thus, we can let ,
and .
If the network applies the same penalty strength to instances, ambiguous convergence (local optima) severely compromises the feature space’s separability. To address the limitation, we consider enhancing the optimization flexibility by allowing each instance to learn at its own pace, depending on its current optimization status. The Instance-Variant loss takes the following form.
(6) | ||||
Let ,
(7) | ||||
Under the same condition that factors affecting loss convergence, such as input data and optimizer, . Take the partial derivative of with respect to () and derivative of and with respect to , , . Therefore, we can get that both and are monotonically decreasing concerning . If we assume the and are optimized to the same value and all training features can be perfectly classified (consistent with the L-softmax assumption (Liu et al., 2016)).
(8) | ||||
When and are optimized to the same value and all training features can be perfectly classified, the of will be smaller than the one of , which means will have better performance than (Proved in Appendix). During the training optimization process, an instance will receive a stronger penalty when it is difficult to distinguish (the intra-class distance is large or the inter-class distance is small). Compared to the traditional softmax loss, we introduce the concept of hard negative mining. Moreover, we assess both the intra-class and inter-class distances to define hard instances, effectively enhancing the separability of the feature space. We can also use the hyperparameter to scale , adapting to different data distributions in various datasets. Simultaneously, researchers can improve the retrieval performance of the network by only modifying and . The proposed Instance-Variant loss can take the following form.
(9) | ||||
where and are hyperparameter, is the temperature coefficient (Hinton et al., 2015) for softmax.
3.3. Gaussian RBF Intra-Class loss

Under the constraint of the softmax loss, we can map all modality data onto a shared unit hypersphere by normalizing the weight vector and instance features. We aim for the intra-class metric to be asymptotically correct and empirically reasonable with a finite number of points. Also, the essence of the Gaussian RBF kernel is to measure the similarity between samples. Similar samples can be better clustered together in a space that describes their similarity, and subsequently become linearly separable. We consider the Gaussian Radial Basis Function (RBF) kernel to achieve this.
(10) |
and define the Intra-Class loss as the negative log-likelihood of the average pairwise Gaussian RBF:
(11) | ||||
The objective of this loss function is to minimize the distance between any two instances of the same class from different modalities, thereby reducing the intra-class distance among all data within the same class. Compared to existing Intra-Class loss functions (Xiao et al., 2022; Jing et al., 2021), our Intra-Class loss straightforwardly minimizes the distance between two instances without learning class centers, reducing the network’s complexity and the cross-modal discrepancy. We will discuss more mathematic characters in the appendix.
Datasets | ModelNet40 | MI3DOR | Pix3D-9 | Pix3D-4 | ||||||||||
Source | Target | mAP-v1 | mAP-v2 | mAP-v4 | mAP | mAP | mAP | |||||||
CMCL | Ours | CMCL | Ours | CMCL | DSCMR (Zhen et al., 2019) | Ours | CMCL | Ours | CMCL | Ours | CMCL | Ours | ||
Image | Image | 82.06 | 83.55 | 86.00 | 88.13 | 90.23 | 82.31 | 90.50 | 75.26 | 77.24 | 72.97 | 77.99 | 85.97 | 91.38 |
Image | Mesh | 85.58 | 86.20 | 87.31 | 88.64 | 89.59 | 77.30 | 89.94 | 78.24 | 79.78 | 73.79 | 81.64 | 86.78 | 90.86 |
Image | Point | 85.23 | 85.77 | 86.79 | 88.16 | 89.04 | 74.33 | 89.48 | 79.69 | 80.15 | 74.48 | 80.15 | 84.16 | 91.43 |
Mesh | Image | 83.58 | 85.21 | 85.96 | 87.76 | 88.11 | 76.18 | 88.84 | 79.14 | 82.14 | 87.50 | 92.83 | 94.74 | 94.81 |
Point | Image | 82.29 | 85.23 | 85.18 | 87.87 | 87.11 | 73.74 | 89.03 | 79.40 | 81.82 | 74.54 | 83.72 | 86.22 | 93.09 |
Mesh | Mesh | 88.51 | 89.30 | —— | —— | —— | 74.84 | —— | 85.59 | 87.05 | 83.36 | 90.75 | 90.54 | 94.72 |
Mesh | Point | 87.37 | 88.14 | —— | —— | —— | 70.21 | —— | 85.74 | 86.59 | 88.53 | 89.05 | 95.22 | 95.77 |
Point | Point | 87.04 | 88.51 | —— | —— | —— | 70.80 | —— | 86.32 | 86.63 | 72.58 | 81.47 | 87.77 | 92.23 |
Point | Mesh | 87.58 | 89.04 | —— | —— | —— | 71.59 | —— | 85.91 | 86.85 | 85.30 | 89.16 | 92.43 | 92.54 |
Mean | 85.47 | 86.77 | 86.97 | 88.39 | 88.29 | 74.59 | 89.20 | 81.70 | 83.14 | 79.23 | 85.20 | 89.31 | 92.98 |
Loss | |||||||
---|---|---|---|---|---|---|---|
Image2Image | 77.65 | 80.55 | 79.73 | 82.95 | 80.97 | 83.55 | 81.61 |
Image2Mesh | 82.26 | 84.84 | 84.54 | 85.82 | 84.90 | 86.20 | 83.05 |
Image2Point | 76.91 | 84.09 | 83.86 | 84.86 | 83.83 | 85.77 | 81.17 |
Mesh2Mesh | 86.78 | 88.56 | 83.86 | 88.66 | 89.28 | 89.30 | 84.60 |
Mesh2Image | 82.13 | 83.19 | 82.67 | 84.93 | 83.97 | 85.21 | 82.29 |
Mesh2Point | 81.35 | 87.30 | 87.23 | 86.75 | 87.77 | 88.14 | 81.57 |
Point2Point | 71.97 | 87.70 | 87.24 | 87.52 | 87.57 | 88.51 | 79.95 |
Point2Image | 75.56 | 82.68 | 81.93 | 84.64 | 83.34 | 85.23 | 80.44 |
Point2Mesh | 80.83 | 88.03 | 87.88 | 88.39 | 88.32 | 89.04 | 81.96 |
Mean | 79.49 | 85.22 | 84.84 | 86.06 | 85.55 | 86.77 | 81.85 |
4. Experiments
Datasets. To validate our proposed method, we perform experiments on three datasets: ModelNet40 (Wu et al., 2015), MI3DOR (Zhou et al., 2019), Pix3D (Sun et al., 2018), and Pix3D with four categories (Pix3D-4, a subset of the Pix3D dataset created by (Lin et al., 2021)). The ModelNet40 dataset is a 3D object benchmark and contains 12,311 CAD models belonging to 40 different categories, with 9,843 used for training and 2,468 for testing. This dataset provides three modalities: image, point cloud, and mesh. MI3DOR is a large-scale dataset for 2D-to-3D tasks with 21,000 images and 7,690 models from 21 categories, with 3,842 used for training and 3,848 for testing. Pix3D is a large-scale dataset of real images and ground-truth models with precise 2D-3D alignment and contains 395 models and 16,913 images from 9 categories, with 313 used for training and 82 for testing. Pix3D-4 only chooses categories that contain more than 300 non-occluded and non-truncated samples. The Pix3D-4 contains 322 models from 4 categories (bed, chair, sofa, table), with 257 for training and 65 for testing. Both MI3DOR and Pix3D datasets only have 3D models (complex mesh in obj format). Therefore, we sampled the 3D models from the MI3DOR and Pix3D datasets, resulting in mesh data with 1024 faces and point cloud data with 2048 points for each dataset.
4.1. Implementation Details
We propose an end-to-end framework for cross-modal retrieval tasks based on proposed Instance-Variant loss, Intra-Class loss, and cross-entropy loss. The overview of the 3D cross-modal retrieval task framework is illustrated in Fig. 2. For 2D image feature extraction, we utilize ResNet18 (He et al., 2016) as the backbone network with four convolution blocks, all with 3 × 3 kernels, where the number of kernels is 64, 128, 256, and 512, respectively. DGCNN is employed as the backbone network to capture point cloud features. DGCNN (Wang et al., 2019b) contains four EdgeConv blocks with kernels set to 64, 64, 64, and 128. MeshNet (Feng et al., 2019) is used to extract mesh features. The shared multimodal feature encoder consists of two fully connected layers with a size of 512,256,C (C is the number of classes). The three proposed loss functions are used to train the network to learn discriminative and modal-invariant features jointly:
Training details. We implemented our network using PyTorch (Paszke et al., 2019). For all three datasets, our network is trained with an SGD optimizer and a learning rate of 0.01. The learning rate is reduced by 90 every 20,000 iterations for ModelNet40 and MI3DOR, and every 4,000 iterations for Pix3D. We set the temperature parameter and for all three datasets, and , , and for ModelNet40, MI3DOR, and Pix3D, respectively. Since our work is the first to employ real-world datasets like MI3DOR and Pix3D for 3D cross-modal retrieval tasks, we lack baseline methods for comparison. As a result, we have chosen Cross-Modal Center Loss (Short as CMCL) (Jing et al., 2021) as our baseline and conducted relevant experiments on the aforementioned datasets. CMCL training details on MI3DOR and Pix3D datasets are the same as those on ModelNet40.
Evaluation Metrics. The evaluation results for all experiments are presented with the Mean Average Precision (mAP) score, a classical performance evaluation criterion for cross-modal retrieval tasks (Jing et al., 2021; Chen et al., 2021; Zeng et al., 2021). The mAP for the retrieval task measures whether the retrieved data belong to the same class as the query (relevant) or not (irrelevant). Given a query and a set of R corresponding retrieved data (R top-ranked data), the Average Precision is defined as:
(12) |
where is the number of relevant items in the retrieved set, represents the precision of the top retrieved items, and is an indicator function whose value is one if the -th retrieved item is relevant (here relevant means belonging to the category of the query). The MAP can be calculated by averaging the AP values.
4.2. 3D Cross-modal Retrieval Task
To evaluate the effectiveness of the proposed loss function, we conduct experiments on the ModelNet40, MI3DOR, and Pix3D datasets with three different modalities, including images, point clouds, and meshes. To thoroughly examine the quality of the learned features, we perform two retrieval tasks: uni-modal retrieval and cross-domain retrieval. The performance of our method for 3D uni-modal and cross-modal retrieval tasks is shown in Table 1. Since only the Cross-Modal Center Loss (CMCL) is proposed for 3D cross-modal retrieval, and they only experiment on the ModelNet dataset, we reproduce the CMCL method on MI3DOR and Pix3D datasets to evaluate the retrieval performance from real-image. Our proposed jointly trained method significantly outperforms the state-of-the-art method on all retrieval tasks and datasets. In particular, when the input retrieval data consists of real images, our retrieval method demonstrates a more significant performance compared to the CMCL method. This indicates that our approach can effectively handle real datasets with challenging negative examples. This can be attributed to our Instance-Variant loss assigning different penalty strengths to different instances, improving the space separator. Our method obtained significantly better performance on all retrieval pairs across all datasets, showcasing the strong generalization ability of our proposed method.
4.3. Impact of Loss Function
The three components of our proposed loss function are as follows: Cross-entropy loss for each modality in the label space, denoted as ; Instance-Variant loss in the shared embedding space, denoted as ; and Intra-Class loss, denoted as . We also mention the softmax loss without the instance-variant weight, denoted as . To further investigate the impact of each component, we evaluate different combinations for the loss functions, including optimization with , , and respectively; jointly optimization with , , ; jointly optimization with , , and . These five networks are trained with the same setting and hyper-parameters, where the performance is shown in Table 2. As illustrated in Table 2. A) The combination of , , and achieves the best performance for all cross-modal and uni-modal retrieval tasks. B) As the baseline, cross-entropy loss alone achieves relatively high mAP due to the sharing head of the three modalities forcing the network to learn similar representations in the common space for different modalities of the same class. C) The Instance-Variant loss can be used independently, achieving fairly good retrieval results. When combined with the Intra-Class loss, some tasks’ retrieval performance surpasses the CMCL method’s. D) can improve both performances of the uni-modal or cross-modal retrieval. E) is better than the .
Batchsize | 32 | 64 | 96 | 128 |
---|---|---|---|---|
Image2Image | 81.78 | 83.34 | 83.45 | 83.55 |
Image2Mesh | 85.16 | 86.23 | 86.18 | 86.20 |
Image2Point | 83.81 | 85.29 | 85.51 | 85.77 |
Mesh2Mesh | 89.57 | 89.59 | 89.47 | 89.30 |
Mesh2Image | 84.91 | 85.49 | 85.33 | 85.21 |
Mesh2Point | 87.39 | 87.95 | 87.89 | 88.14 |
Point2Point | 86.77 | 87.49 | 88.02 | 88.51 |
Point2Image | 83.93 | 84.84 | 85.11 | 85.23 |
Point2Mesh | 88.07 | 88.61 | 88.83 | 89.04 |
Mean | 85.71 | 86.54 | 86.64 | 86.77 |
Unseen | MI3DOR: img-¿view | MI3DOR-¿Pix3D-9 | ||
---|---|---|---|---|
Method | CMCL | Ours | CMCL | Ours |
Image2Image | 24.77 | 30.24 | 54.15 | 54.29 |
Image2Mesh | 25.82 | 31.69 | 53.03 | 48.27 |
Image2Point | 25.92 | 31.29 | 51.14 | 53.08 |
Mesh2Mesh | 85.59 | 87.05 | 76.72 | 76.75 |
Mesh2Image | 22.45 | 25.28 | 46.32 | 42.22 |
Mesh2Point | 85.71 | 86.59 | 48.24 | 48.91 |
Point2Point | 86.19 | 86.63 | 77.75 | 80.32 |
Point2Image | 21.81 | 24.16 | 46.07 | 48.55 |
Point2Mesh | 85.80 | 86.85 | 64.17 | 64.57 |
Mean | 51.56 | 54.42 | 57.51 | 57.44 |
4.4. Impact of Batch Size


Few researchers have discussed the difference between the weight vector and class center (Wang et al., 2017, 2018). The researchers compared weight vector and class center distribution differences before and after full optimization, concluding that their distributions overlap at optimum network optimization (Wang et al., 2017). It has been proved that batch size affects center loss retrieval performance (Jing et al., 2021; Wen et al., 2016) Consequently, we further investigated the impact of batch size on weight vector calculation and Intra-Class loss. To analyze the impact of batch sizes on the performance, we conduct experiments on the ModelNet40 dataset with different batch sizes (32,64,96,128). All networks are trained with the same number of epochs and hyper-parameters. As shown in Table 3, changing the batch size does not significantly impact the retrieval performance, indicating that the weight vector and Intra-Class loss are not affected by the variations in the amount of data within a batch. This also implies that the loss function proposed in this paper can achieve comparable retrieval results with fewer resources. Meanwhile, we noticed that the performance of some retrieval tasks was improved when using smaller batch sizes. We speculate that the primary factors are hyperparameter selection and imbalanced modality optimization issues leading to imperfect feature extraction (Peng et al., 2022).
4.5. 3D Cross-modal Unseen Retrieval Task
Considering the real-world application of cross-modal retrieval, we may encounter new data domains or distributions that have not appeared in the training dataset. Therefore, we use real-image of the MI3DOR dataset training the retrieval network, using views of MI3DOR and Pix3D datasets to test. The main reason for choosing the MI3DOR dataset as the benchmark is that it contains both real images corresponding to the models and multi-view images of the models; at the same time, the data distribution in this dataset is more balanced compared to Pix3D, so we do not need to worry about the unbiased nature of the networks used for testing. As illustrated in Table 4, our network has better generalization ability than the CMCL method. However, it fails to address the challenge of unseen domain retrieval, as the network’s performance experiences a significant decline. We will discuss more experimental results about unseen retrieval in the appendix.
4.6. Qualitative Visualization
T-SNE Feature Embedding Visualization. Fig. 3 demonstrates distinct clusters for each modality, highlighting the proposed approach’s effectiveness in discriminating class samples. Furthermore, the combined features across modalities confirm the learned common space can capture modality-invariant representations.
3D Cross-Modal Retrieval Visualization. Fig. 4 displays the cross-modal retrieval samples for six queries from the ModelNet40 and MI3DOR datasets. Cosine distance measures data similarity across different modalities using normalized features for each query. The figure shows that instances with similar appearances are closer in feature space despite different modalities, proving the network learned modality-invariant features. More experiment results will be illustrated in the appendix.
5. Conclusion
This paper introduces the Instance-Variant loss and Intra-Class loss for 3D cross-modal retrieval. The Instance-Variant loss effectively assigns different penalty strengths to different instances, enhancing the network’s effectiveness by focusing on more challenging instances and promoting better feature space separability. The Intra-Class loss, based on the Gaussian RBF kernel, aims to minimize the intra-class distance among all instances in the shared embedding space. This approach minimizes the intra-class distance and evaluates shared embedding space learned by the network. The proposed Instance-Variant loss learns a shared weight vector for data from different modalities, effectively mitigating cross-modal discrepancy and enhancing cross-modal retrieval performance. Extensive experiments have been conducted on 3D cross-modal retrieval tasks. The proposed framework significantly outperforms the state-of-the-art methods on the ModelNet40, MI3DOR, and Pix3D datasets. In future work, we will explore potential technologies further to improve the robustness and effectiveness of the proposed loss and enhance the unseen retrieval performance.
References
- (1)
- Borodachov et al. (2019) Sergiy V Borodachov, Douglas P Hardin, and Edward B Saff. 2019. Discrete energy on rectifiable sets. Vol. 3. Springer.
- Chen et al. (2021) Zhimin Chen, Longlong Jing, Yang Liang, Yingli Tian, and Bing Li. 2021. Multimodal Semi-Supervised Learning for 3D Objects. In 32nd British Machine Vision Conference 2021, BMVC 2021, Online, November 22-25, 2021. BMVA Press, 381.
- Feng et al. (2019) Yutong Feng, Yifan Feng, Haoxuan You, Xibin Zhao, and Yue Gao. 2019. MeshNet: Mesh Neural Network for 3D Shape Representation. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019. AAAI Press, 8279–8286.
- Frome et al. (2013) Andrea Frome, Gregory S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc’Aurelio Ranzato, and Tomás Mikolov. 2013. DeViSE: A Deep Visual-Semantic Embedding Model. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States. 2121–2129.
- Han et al. (2019) Xian-Feng Han, Hamid Laga, and Mohammed Bennamoun. 2019. Image-based 3D object reconstruction: State-of-the-art and trends in the deep learning era. IEEE transactions on pattern analysis and machine intelligence 43, 5 (2019).
- Hanocka et al. (2019) Rana Hanocka, Amir Hertz, Noa Fish, Raja Giryes, Shachar Fleishman, and Daniel Cohen-Or. 2019. MeshCNN: a network with an edge. ACM Trans. Graph. 38, 4 (2019), 90:1–90:12.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society, 770–778.
- Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the Knowledge in a Neural Network. arXiv:1503.02531 [stat.ML]
- Hu et al. (2022) Nian Hu, Heyu Zhou, Xiangdong Huang, Xuanya Li, and An-An Liu. 2022. A feature transformation framework with selective pseudo-labeling for 2D image-based 3D shape retrieval. IEEE Transactions on Circuits and Systems for Video Technology 32, 11 (2022), 8010–8021.
- Huang et al. (2022) Jingjia Huang, Wei Yan, Ge Li, Thomas H. Li, and Shan Liu. 2022. Learning Disentangled Representation for Multi-View 3D Object Recognition. IEEE Trans. Circuits Syst. Video Technol. 32, 2 (2022), 646–659.
- Jing et al. (2021) Longlong Jing, Elahe Vahdani, Jiaxing Tan, and Yingli Tian. 2021. Cross-Modal Center Loss for 3D Cross-Modal Retrieval. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021. Computer Vision Foundation / IEEE, 3142–3151.
- Liang et al. (2021) Shuang Liang, Weidong Dai, and Yichen Wei. 2021. Uncertainty Learning for Noise Resistant Sketch-Based 3D Shape Retrieval. IEEE Trans. Image Process. 30 (2021), 8632–8643.
- Lin et al. (2021) Ming-Xian Lin, Jie Yang, He Wang, Yu-Kun Lai, Rongfei Jia, Binqiang Zhao, and Lin Gao. 2021. Single Image 3D Shape Retrieval via Cross-Modal Instance and Category Contrastive Learning. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 11385–11395.
- Liu et al. (2017) Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. 2017. SphereFace: Deep Hypersphere Embedding for Face Recognition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, 6738–6746.
- Liu et al. (2016) Weiyang Liu, Yandong Wen, Zhiding Yu, and Meng Yang. 2016. Large-Margin Softmax Loss for Convolutional Neural Networks. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016 (JMLR Workshop and Conference Proceedings, Vol. 48). JMLR.org, 507–516.
- Liu et al. (2023) Zhenguang Liu, Shuang Wu, Shuyuan Jin, Shouling Ji, Qi Liu, Shijian Lu, and Li Cheng. 2023. Investigating Pose Representations and Motion Contexts Modeling for 3D Motion Prediction. IEEE Trans. Pattern Anal. Mach. Intell. 45, 1 (2023), 681–697.
- Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Z. Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada. 8024–8035.
- Peng et al. (2022) Xiaokang Peng, Yake Wei, Andong Deng, Dong Wang, and Di Hu. 2022. Balanced Multimodal Learning via On-the-fly Gradient Modulation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 8228–8237.
- Pereyra et al. (2017) Gabriel Pereyra, George Tucker, Jan Chorowski, Lukasz Kaiser, and Geoffrey E. Hinton. 2017. Regularizing Neural Networks by Penalizing Confident Output Distributions. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Workshop Track Proceedings. OpenReview.net.
- Qi et al. (2017a) Charles Ruizhongtai Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. 2017a. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, 77–85.
- Qi et al. (2017b) Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J. Guibas. 2017b. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. 5099–5108.
- Sain et al. (2021) Aneeshan Sain, Ayan Kumar Bhunia, Yongxin Yang, Tao Xiang, and Yi-Zhe Song. 2021. StyleMeUp: Towards Style-Agnostic Sketch-Based Image Retrieval. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021. Computer Vision Foundation / IEEE, 8504–8513.
- Su et al. (2015) Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik G. Learned-Miller. 2015. Multi-view Convolutional Neural Networks for 3D Shape Recognition. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015. IEEE Computer Society, 945–953.
- Sun et al. (2018) Xingyuan Sun, Jiajun Wu, Xiuming Zhang, Zhoutong Zhang, Chengkai Zhang, Tianfan Xue, Joshua B. Tenenbaum, and William T. Freeman. 2018. Pix3D: Dataset and Methods for Single-Image 3D Shape Modeling. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. Computer Vision Foundation / IEEE Computer Society, 2974–2983.
- Sun et al. (2020) Yifan Sun, Changmao Cheng, Yuhan Zhang, Chi Zhang, Liang Zheng, Zhongdao Wang, and Yichen Wei. 2020. Circle Loss: A Unified Perspective of Pair Similarity Optimization. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. Computer Vision Foundation / IEEE, 6397–6406.
- Tian et al. (2021) Jialin Tian, Xing Xu, Zheng Wang, Fumin Shen, and Xin Liu. 2021. Relationship-Preserving Knowledge Distillation for Zero-Shot Sketch Based Image Retrieval. In MM ’21: ACM Multimedia Conference, Virtual Event, China, October 20 - 24, 2021. ACM, 5473–5481.
- Van der Maaten and Hinton (2008) Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).
- Wang et al. (2018) Feng Wang, Jian Cheng, Weiyang Liu, and Haijun Liu. 2018. Additive Margin Softmax for Face Verification. IEEE Signal Process. Lett. 25, 7 (2018), 926–930.
- Wang et al. (2017) Feng Wang, Xiang Xiang, Jian Cheng, and Alan Loddon Yuille. 2017. NormFace: L Hypersphere Embedding for Face Verification. In Proceedings of the 2017 ACM on Multimedia Conference, MM 2017, Mountain View, CA, USA, October 23-27, 2017, Qiong Liu, Rainer Lienhart, Haohong Wang, Sheng-Wei ”Kuan-Ta” Chen, Susanne Boll, Yi-Ping Phoebe Chen, Gerald Friedland, Jia Li, and Shuicheng Yan (Eds.). ACM, 1041–1049.
- Wang et al. (2019a) Xun Wang, Xintong Han, Weilin Huang, Dengke Dong, and Matthew R. Scott. 2019a. Multi-Similarity Loss With General Pair Weighting for Deep Metric Learning. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 5022–5030.
- Wang et al. (2019b) Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E. Sarma, Michael M. Bronstein, and Justin M. Solomon. 2019b. Dynamic Graph CNN for Learning on Point Clouds. ACM Trans. Graph. 38, 5 (2019), 146:1–146:12.
- Wei et al. (2021) Jiwei Wei, Xing Xu, Zheng Wang, and Guoqing Wang. 2021. Meta Self-Paced Learning for Cross-Modal Matching. In MM ’21: ACM Multimedia Conference, Virtual Event, China, October 20 - 24, 2021. ACM, 3835–3843.
- Wei et al. (2023) Jiwei Wei, Yang Yang, Xing Xu, Jingkuan Song, Guoqing Wang, and Heng Tao Shen. 2023. Less is Better: Exponential Loss for Cross-Modal Matching. IEEE Transactions on Circuits and Systems for Video Technology (2023), 1–1.
- Wei et al. (2022) Jiwei Wei, Yang Yang, Xing Xu, Xiaofeng Zhu, and Heng Tao Shen. 2022. Universal Weighting Metric Learning for Cross-Modal Retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 44, 10 (2022), 6534–6545.
- Wen et al. (2016) Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. 2016. A Discriminative Feature Learning Approach for Deep Face Recognition. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VII (Lecture Notes in Computer Science, Vol. 9911). Springer, 499–515.
- Wu et al. (2015) Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 2015. 3D ShapeNets: A deep representation for volumetric shapes. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015. IEEE Computer Society, 1912–1920.
- Xiao et al. (2022) Degui Xiao, Jiazhi Li, Jianfang Li, Shiping Dong, and Tao Lu. 2022. IHEM Loss: Intra-Class Hard Example Mining Loss for Robust Face Recognition. IEEE Trans. Circuits Syst. Video Technol. 32, 11 (2022), 7821–7831.
- Xu et al. (2020) Yongzhe Xu, Jiangchuan Hu, Kanoksak Wattanachote, Kun Zeng, and Yongyi Gong. 2020. Sketch-Based Shape Retrieval via Best View Selection and a Cross-Domain Similarity Measure. IEEE Trans. Multim. 22, 11 (2020), 2950–2962.
- Zeng et al. (2021) Zhixiong Zeng, Ying Sun, and Wenji Mao. 2021. MCCN: Multimodal Coordinated Clustering Network for Large-Scale Cross-modal Retrieval. In MM ’21: ACM Multimedia Conference, Virtual Event, China, October 20 - 24, 2021. ACM, 5427–5435.
- Zhang et al. (2022) Kun Zhang, Zhendong Mao, Quan Wang, and Yongdong Zhang. 2022. Negative-Aware Attention Framework for Image-Text Matching. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 15640–15649.
- Zhen et al. (2019) Liangli Zhen, Peng Hu, Xu Wang, and Dezhong Peng. 2019. Deep Supervised Cross-Modal Retrieval. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 10394–10403.
- Zheng et al. (2020) Wenzhao Zheng, Jiwen Lu, and Jie Zhou. 2020. Deep Metric Learning via Adaptive Learnable Assessment. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. Computer Vision Foundation / IEEE, 2957–2966.
- Zhou et al. (2019) Heyu Zhou, An-An Liu, and Weizhi Nie. 2019. Dual-level Embedding Alignment Network for 2D Image-Based 3D Object Retrieval. In Proceedings of the 27th ACM International Conference on Multimedia, MM 2019, Nice, France, October 21-25, 2019. ACM, 1667–1675.
- Zhou et al. (2022) Yaqian Zhou, Yu Liu, Heyu Zhou, Zhiyong Cheng, Xuanya Li, and An-An Liu. 2022. Learning Transferable and Discriminative Representations for 2D Image-Based 3D Model Retrieval. IEEE Trans. Circuits Syst. Video Technol. 32, 10 (2022), 7147–7159.