[table]capposition=top \newfloatcommandcapbtabboxtable[][\FBwidth] \floatsetupheightadjust=all, floatrowsep=columnsep \newfloatcommandfigureboxfigure[\nocapbeside][] \newfloatcommandtableboxtable[\nocapbeside][]
Let Images Give You More:
Point Cloud Cross-Modal Training for Shape Analysis
Abstract
Although recent point cloud analysis achieves impressive progress, the paradigm of representation learning from a single modality gradually meets its bottleneck. In this work, we take a step towards more discriminative 3D point cloud representation by fully taking advantages of images which inherently contain richer appearance information, e.g., texture, color, and shade. Specifically, this paper introduces a simple but effective point cloud cross-modality training (PointCMT) strategy, which utilizes view-images, i.e., rendered or projected 2D images of the 3D object, to boost point cloud analysis. In practice, to effectively acquire auxiliary knowledge from view images, we develop a teacher-student framework and formulate the cross-modal learning as a knowledge distillation problem. PointCMT eliminates the distribution discrepancy between different modalities through novel feature and classifier enhancement criteria and avoids potential negative transfer effectively. Note that PointCMT effectively improves the point-only representation without architecture modification. Sufficient experiments verify significant gains on various datasets using appealing backbones, i.e., equipped with PointCMT, PointNet++ and PointMLP achieve state-of-the-art performance on two benchmarks, i.e., 94.4% and 86.7% accuracy on ModelNet40 and ScanObjectNN, respectively. Code will be made available at https://github.com/ZhanHeshen/PointCMT.
1 Introduction
As the fundamental 3D representation, point clouds have attracted increasing attention for various applications, e.g., self-driving [2, 33, 34], robotics perception [7, 13, 6], etc. Generally, a point cloud consists of sparse and unordered points in the 3D space, which is significantly different from a 2D image with a dense and regular pixel array. Prior studies treat the understanding of 2D images and 3D point clouds as two separate problems, and both have their own merits and drawbacks. Concretely, rich color and fine-grained texture are easily obtained in 2D images, but they are ambiguous in depth and shape sensing. Previous works extract features on images through convolution neural networks (CNN). In contrast, point clouds are superior in providing spatial and geometric information but only preserve sparse and textureless features. Several prior studies process features on unstructured point clouds through local aggregation operators [34, 43]. It is natural to raise a question: Could we use the rich information hidden in 2D images to boost 3D point cloud shape analysis?
To address the above issue, one straightforward way is to leverage the benefits of both images and point clouds, i.e., fusing information from two complementary representations with task-specific design [58, 24, 10, 32, 42]. However, utilizing additional image representation requires designing a multi-modal network, which takes the extra image inputs in both training and inference phases. Moreover, the exploiting extra-images is usually computation-intensive and paired-images are usually unavailable during inference. Thus, multi-modal learning meets its bottleneck in many aspects.

This paper tries to ease the barrier of cross-modal learning between images and point clouds. Inspired by knowledge distillation (KD) that achieves knowledge transfer from a teacher model to a student one, we formulate the cross-modal learning as a KD problem, conducting alignment between sample representations learned by images and point clouds. However, previous KD approaches usually assume that the training data used by the teacher and student are from the same distribution [17]. Thus, since sparse and disordered point clouds represent visual information different from images, naive feature alignment between two representations appeals to cause limited gains or negative transfer for the cross-modal scenario. To this end, we design a novel framework for cross-modal KD and propose the point cloud cross-modal training strategy, i.e., PointCMT in Figure 1 (a), which distills features derived from images into the point cloud representation. Specifically, multiple view images for each 3D object can be generated through either rendering the CAD model or conducting perspective projection on the point cloud from different viewpoints. These free auxiliary images are fed into the image network to obtain the global representations for the object. Besides, feature and classifier enhancements are conducted between the point cloud and image features, in which the newly proposed criteria effectively avoid negative transfer between different modalities, i.e., directly applying [17] hampers the performance on ModelNet40. After training, the model gains higher performance, only taking point clouds as input without architecture modification.
Compared with multi-modal approaches, our solution has the following preferable properties: 1) Generality: PointCMT can be integrated with arbitrary point cloud analysis models without structural modification. 2) Effectively: It significantly boosts the performance upon several baseline approaches, e.g., PointNet++ [34] achieves state-of-the-art 94.4% from 93.4% overall accuracy on ModelNet40, as shown in Figure 1 (b). 3) Efficiency: It only utilizes auxiliary image data in the training stage. After training, the enhanced 3D model infers without image inputs. 4) Flexibility: The extensive experiments illustrate that PointCMT performs superior even without colorized and dedicated rendered images, i.e., it can also greatly improve the performance when using images directly projected by sparse and textureless point clouds. Thus, it provides an alternative solution to enhance the point cloud shape analysis when the additional rendered images are not accessible.
In summary, our contributions are: 1) This paper formulates cross-modal learning on point cloud analysis as a knowledge distillation problem, where we utilize the merits of texture and color-aware 2D images to acquire more discriminative point cloud representation. 2) We propose point cloud cross-modal training, i.e., PointCMT, strategy with corresponding criteria to boost point cloud models during the training stage. 3) Extensive experiments on several datasets verify the effectiveness of our approach, where PointCMT greatly boosts several baseline models even on the state-of-the-art, e.g., PointNet++ [34] trained with PointCMT gains 1.0% and 4.4% accuracy improvements on ModelNet40 [48] and ScanObjectNN, respectively. Even based upon PointMLP [31], it increases its accuracy by 1% to 86.7% on ScanObjectNN dataset.
2 Related Works
3D Shape Recognition Based on Point Clouds. These stream methods directly process raw point clouds as input (also called point-based methods). They are pioneered by PointNet [33], which approximates a permutation-invariant set function using a per-point Multi-Layer Perceptron (MLP) followed by a max-pooling layer. Later on, point-based methods aim at designing local aggregation operators for local feature extraction. Specifically, they generally sample multiple sub-points from the original point cloud, and then aggregate neighboring features of each sub-point through local aggregation operators, in which point-wise MLPs [34, 31], adaptive weight [44, 46, 27] and pseudo grid based methods [39, 20] are proposed. More recently, there are some attempts to utilize non-local operator [54], or Transformer [60, 14] to mine the long distance dependency. This paper also follows the paradigm of point-based methods to conduct point cloud shape analysis.
3D Shape Recognition Based on Images. Since point clouds are irregular and unordered, some works consider projecting the 3D shapes into multiple images from different viewpoints (also called view-based methods) and then leverage the well-developed 2D CNNs to process 3D data. One seminal work of multi-view learning is MVCNN [38]. It extracts per-view features with a shared CNN in parallel, then aggregates via a view-level max-pooling layer. Most follow-up works propose more effective modules to aggregate the view-level features. For instance, some of them enhance the aggregated feature by considering similarity among views [11, 56] while others focus on the viewpoint relation [45, 22]. The above methods usually utilize ad-hoc rendered images for each 3D shape, including shade and texture for the surface mesh. Therefore, they generally achieve higher performance than point-based methods using sparse point clouds as input. Recently, [12] proposes a simple but effective method (SimpleView) through directly projecting sparse point clouds onto image planes, achieving comparable performance with point-based methods. Inspired by view-based methods, this paper takes advantage of extracted image features from the view-based method, which are utilized as prior knowledge to boost point cloud shape analysis.
Knowledge Distillation. Knowledge distillation (KD) aims at compressing a large network (teacher) to a compact and tiny one (student), and boost the performance of the student at the same time. The concept was first shown by Hinton et al. [17], which trains a student by using the softened logits of a teacher as targets. Over the past few years, several subsequent approaches [23, 1, 5, 40, 9, 61] use different criteria to align the sample representations between the teacher and student. However, almost all the existing works assume that the training data used by the teacher and student networks are from the same distribution. Our experiment illustrates that new biases and negative transfer will be introduced in the distillation process if cross-modal data from different distribution (e.g., features extracted from unordered point cloud and regular grid image) is utilized directly on previous KDs.
Cross-Modal Knowledge Transfer. Cross-modal knowledge transfer in computer vision is a relatively emerging field that aims to utilize additional modalities at the training stage and enhance the model’s performance on the target modality at the inference. Recently, there are 3D-to-2D knowledge transfer approaches, which adopt geometric aware 3D features from point clouds to enhance the performance of 2D tasks through a contrastive manner [18] or feature alignment [30]. Later on, approaches attempt to transfer priors in images to enhance 3D point cloud-related tasks, and some are designed for specific tasks. Concretely, [28, 26] propose the images-to-point contrastive pre-training, [50] inflates 2D convolution kernels to the 3D ones and [57, 59, 53] independently apply cross-modal training for visual grounding, captioning and semantic segmentation. Inspired but different from the above, we are the first to conduct image-to-point knowledge distillation for point cloud analysis.

3 Methodlogy
3.1 Problem Statement
Let and be the point cloud and ground-truth label of the 3D object. Its corresponding view-image counterparts can be denoted as , where , and are the number of points, number of view-images and image size, respectively. View images can be gained by rendering the 3D CAD model [38] or perspective projecting the raw point cloud [12]. We denote and as the image and point cloud analysis networks, respectively, and regard them as the teacher and student in traditional knowledge distillation (KD). For these networks, we split each of them into two parts: (i) Encoders (i.e., feature extractors and ), the output of which at the last layer are global feature representations and . (ii) Classifiers, which project the feature representation and into class logits through and , where denotes the classifier.
Since we formulate the cross-modal learning as a KD problem, its goal in the learning process is to distill the prior knowledge from the image into point cloud features, obtaining an image-enhanced ideal feature . During the knowledge distillation, we parameterize the teacher and student networks with and , and denote the knowledge of the teacher network as . From the Bayesian perspective, a neural network can be viewed as a probability model, e.g., for point cloud analysis model as an example: given an input point cloud , the network assigns an output probability with the parameters . Therefore, if we want the student network guided by image knowledge on the input sample, our goal can be further reformulated as maximizing the probability , which can also be used to measure the ability of student network extracting the feature with cross-modal information. To find the lower bound of the above probability, we define the discrepancy between theoretically discriminative features and as
(1) |
where and are ideal features in specific modalities. In Lemma 1, we give the lower bound of cross-modal learning as a KD problem. The proof is provided in the supplementary material.
Lemma 1: By the definition above, is bounded below by , where is
(2) |
In Lemma 1, measures the compatibility of knowledge distillation of the image networks, and can be viewed as a constant when the architectures are determined. is also a constant when the parameters and the architecture of the point cloud analysis model are fixed, e.g., using a pre-trained image network. Therefore, the Lemma ensures that during the knowledge distillation, one can maximize through minimizing , which gives a theoretical guarantee for the KD problem.
If we adopt previous KD studies in cross-modal scenario, they minimize through making student directly approximate the teacher’s features [23, 1, 5] or predict logits [17]. However, in a common KD problem, the teacher and student are generally trained on the same dataset with an identical distribution [17]. Moreover, the teacher generally achieves better performance than the student. In contrast, the image and point cloud analysis models tend to learn different feature representations and logits distribution which are generally complementary. Direct alignment may cause a limited gain or even negative transfer. Moreover, previous KD approaches treat encoders and classifiers as a whole architecture since the teacher and student networks generally have the same components. However, the point cloud convolution in encoder is significantly different from the 2D CNN, but they have the same classifier design.
To solve the cross-modal KD problem, we propose PointCMT, which effectively solves the cross-modal learning problem. The workflow of PointCMT is demonstrated in Figure 2 (a) and can be summarized as Algorithm 1. Specifically, there are three stages exist in PointCMT. In Stage I (Section 3.2), we train the image encoder and classifier using view-images and ground-truth labels. In Stage II (Section 3.3), we train the cross-modal point generator (CMPG) through image features, and we independently align features and logits of two modalities in stage III (Section 3.4), where the encoder and classifier of point cloud analysis network are both enhanced.
3.2 Learning Image Priors
For each 3D object, we use multiple view-images (i.e., rendered color images or projected images from raw point cloud) as additional data, and the whole process can be described as:
(3) |
Inspired by view-based learning approaches [38], view-images from flow into a shared-weights image feature extractor to obtain a series of vectors. By aggregating the view-level vectors via an aggregation function , we obtain a global feature representation from all images, which integrates shape information from multiple views. Finally, an image classifier maps the above global feature to gain a prediction logits through , which is supervised by the ground truth label through cross-entropy loss .
3.3 Cross-Modal Point Generator
A cross-modal point generator (CMPG) can be seen as a nonlinear transformation that maps the global feature representation acquired from images into the Euclidean space. Thus, it can avoid potential negative transfer effectively by directly aligning cross-modal features from different distributions. In order to better learn image priors in the point cloud analysis network, we pre-train the CMPG through the image feature , and fix it in the distillation stage. Figure 2 (b) illustrates the pre-training stage of CMPG. It takes the image feature as input and reconstruct a point cloud , which is supervised by the original point cloud through Earth Mover’s distance (EMD) [37]:
(4) |
where and is a bijection, i.e., for each point , finds a sole point correspondence in . After pre-training, CMPG reconstructs a point cloud through an image-relative feature representation.
3.4 Image Priors Assisted Training
During the training stage, three objectiveness should be optimized:
Classification Loss. In our PointCMT, arbitrary point cloud analysis models can be assembled. Generally, it should be designed through a point encoder and a classifier, where the point encoder takes a point cloud as input, generates the point cloud feature representation and feeds it into the classifier to obtain a class logits. Finally, the cross-entropy loss is used as criteria.
Feature Enhancement Loss. Unlike previous KD methods that directly align features of teacher and student, we first transform the cross-modal features into Euclidean space. As shown in Figure 2 (c), the pre-trained CMPG independently transform and , obtaining two point clouds and , respectively. After that, EMD loss is conducted on the two-point clouds as an objectiveness:
(5) |
where and is a bijection. Compared with traditional loss, the EMD distance is natural for solving an assignment problem for permutation-invariant point sets. For all but a zero-measure subset of point set pairs, the optimal bijection is unique and invariant under the infinitesimal movement of the points. Thus, EMD is differentiable almost everywhere.
Classifier Enhancement Loss. In addition to supervising the point feature extractor through the above Feature Enhancement, we further propose constraints conducted on the point classifier (as Figure 2 (a)). Specifically, the image feature generated by the teacher network is fed into the point classifier, in which the gradient only back-propagates to the point classifier. Besides, classifier enhancement is proposed to enable the point classifier to handle the image feature during the distillation. It aligns outputs logits of the point classifier by using image and point features as inputs. This constraint is modified based on the distillation loss in Hinton et al. [17] as in Equation (6), but in this case, two sets of logits come from the same classifier.
(6) |
where is KL divergence. In contrast, our proposed classifier enhancement is more suitable for the cross-modal scenario where a great discrepancy exist between image and point cloud features. Concretely, the loss for classifier enhancement can be written as
(7) |
Final Loss. The final loss is a weighted sum of the above three losses , where and are the weights to adjust the ratios of each loss, respectively.
4 Experiment

4.1 Shape Classification on ModelNet40
We firstly evaluate our PointCMT on synthetic dataset ModelNet40 [48], which is a large-scale 3D CAD model dataset.
Dataset Description and Processing. ModelNet40 is composed of 9,843 train models and 2,468 test models in 40 classes. For the input of the 3D network, we use point clouds provided by the official dataset, which is the same as PointNet [33]. For the input of the image network, we use 20 rendered view images from CAD models utilized in RotationNet [22]. These images have a resolution of . Since they consider both the mesh surfaces and illumination, they can provide more information to the 3D network. Selected samples of point clouds and corresponding multi-view images are shown in Figure 3.
Method | Input | #Points | mAcc(%) | OA(%) | Speed | Param. |
PointNet [33] | pnt | 1k | 86.0 | 89.2 | - | 3.47M |
PointNet++ [34] | pnt, nor | 5k | - | 91.9 | - | 1.47M |
PointCNN [25] | pnt | 1k | 88.0 | 92.5 | - | - |
PointConv [47] | pnt, nor | 1k | - | 92.5 | 80† | 18.6M |
KPConv [39] | pnt | 7k | - | 92.9 | 10† | 15.2M |
PointASNL [54] | pnt, nor | 1k | - | 93.2 | - | - |
PosPool [29] | pnt | 5k | - | 93.2 | - | - |
Point Transformer [60] | pnt | 1k | 90.6 | 93.7 | - | - |
GBNet [36] | pnt | 1k | 91.0 | 93.8 | 112† | 8.4M |
GDANet [51] | pnt | 1k | - | 93.8 | 14† | 0.9M |
SimpleView [12] | pnt | 1k | - | 93.9 | 2208 | 1.64M |
CurveNet [49] | pnt | 1k | - | 94.2 | 15† | 2.0M |
PointMLP [31] | pnt | 1k | 91.4 | 94.5 | 139 | 12.6M |
DGCNN [43] (baseline) | pnt | 1k | 90.2 | 92.9 | 518 | 1.68M |
RS-CNN [27] (baseline) | pnt | 1k | 89.3 | 92.9 | 2174 | 1.17M |
PointNet++ [34] (baseline) | pnt | 1k | 90.1 | 93.4* | 300 | 1.62M |
DGCNN w/ PointCMT | pnt | 1k | 90.8 (+0.6) | 93.5 (+0.6) | 518 | 1.68M |
RS-CNN w/ PointCMT | pnt | 1k | 90.1 (+0.8) | 93.8 (+0.9) | 2174 | 1.17M |
PointNet++ w/ PointCMT | pnt | 1k | 91.2 (+1.1) | 94.4 (+1.0) | 300 | 1.62M |
Implementation. For image network, we use ResNet-18 [16] pre-trained on ImageNet [8] as the feature extractor. Following MVCNN [38], we obtain the global shape feature by applying view-wise max-pooling to the view-level features. Finally, a fully-connected layer is used to output the classification logits. During the training process of the above network, we use SGD as our optimizer with a learning rate of 0.01. The batch size is set to 128 for 50 epochs. After that, we fix the image network and train the CMPG with Adam and 32 batch size for 50 epochs. In practice, CMPG consists of three-layer MLP. For the point cloud analysis models, DGCNN [43] and RS-CNN [27] are independently trained with the training strategies provided in their official codes. PointNet++ [34] is trained with strategy of RS-CNN [27] as [12] for a better performance.
Comparison with State-of-the-arts. The classification results on ModelNet40 are shown in Table 1, where the overall accuracy (OA) and class mean accuracy (mAcc) are compared. The upper part of the table illustrates the results of current state-of-the-art methods, in which we use PointNet++ [34], RS-CNN [27] and DGCNN [43] as our baselines. We do not use PointMLP [31] as our baseline since it cannot robustly reproduce the highest results on ModelNet40, where the issue is mentioned in their open-sourced codes. For models trained from scratch, PointMLP [31] achieves the highest accuracy. As shown in the lower part of the table, after training with PointCMT, the performance of all baselines is greatly boosted, i.e., 1.0% improvement upon PointNet++ and 0.9% for RS-CNN and 0.6% for DGCNN. We also compare our method to several open-sourced methods and report the parameters and testing speed. As shown in the last two columns of the table, though PointMLP gains 0.1 higher overall accuracy, its network consists of about parameters and only achieves 46% speed of PointNet++. In contrast, PointCMT performs well on light-weighted models, which shows its great potential for real-time applications, e.g., scene parsing in autonomous driving.
4.2 Shape Classification on ScanObjectNN
Though ModelNet40 is the widely used benchmark for point cloud analysis, it may not meet the realistic requirement due to its synthetic nature. To this end, we also conduct experiments on the ScanObjectNN benchmark [41], which is a real-world dataset.
OBJ_ONLY | PB_T50_RS | |||
Method | mAcc(%) | OA(%) | mAcc(%) | OA(%) |
3DmFV [3] | - | 73.8 | 58.1 | 63.0 |
PointNet [33] | - | 79.2 | 63.4 | 68.2 |
SpiderCNN [52] | - | 79.5 | 69.8 | 73.7 |
PointNet++ [34] | - | 84.3 | 75.4 | 77.9 |
DGCNN [43] | - | 86.2 | 73.6 | 78.1 |
PointCNN [25] | - | 85.5 | 75.1 | 78.5 |
DRNet [35] | - | - | 78.0 | 80.3 |
GBNet [36] | - | - | 77.8 | 80.5 |
SimpleView [12] | 86.2 | 89.0 | - | 80.8 |
PRANet [4] | - | - | 79.1 | 82.1 |
MVTN [15] | - | - | - | 82.8 |
PointNet++ [34] (baseline) | 85.40.2 | 87.40.1 | 75.50.3 | 79.20.2 |
PointMLP [31] (baseline) | 89.10.3 | 92.20.3 | 83.90.5 | 85.40.3 |
PointNet++ w/ PointCMT | 89.00.3 (+3.7) | 91.60.2 (+4.3) | 79.90.3 (+4.4) | 83.10.2 (+3.9) |
PointMLP w/ PointCMT | 91.80.2 (+2.6) | 93.20.3 (+1.0) | 84.40.4 (+0.4) | 86.40.3 (+1.0) |
Dataset Description and Processing. ScanObjectNN collects 2,902 objects from real-world indoor scenes ScanNet [7] and SceneNN [19], categorizing into 15 categories. Several variants are provided in the dataset, where the most challenging one is PB_T50_RS, i.e., introducing perturbation objects (11,416 and 2,882 data for training and test) via random translation, shift, rotation and scaling. Due to background, noise, and occlusions, this benchmark poses significant challenges to existing point cloud analysis methods. Furthermore, since the PB_T50_RS dataset only preserves the spatial coordinates (XYZ) for each object while the other information, such as RGB, is discarded, we also compared the original 2,902 objects (OBJ_ONLY), which includes additional RGB information. On both above datasets, we only use depth images through conducting perspective projection on raw point cloud as additional inputs, as shown in the last column in Figure 3. Section 4.6 will discuss more results using projection with additional color information.
Implementation. All view-images in ScanObjectNN are gained by the projection of raw point clouds, and we follow the structure of [12] only generate six images for PB_T50_RS and OBJ_ONLY. We train the image network from scratch with batch size 32 and SGD optimizer for longer epochs of 1,000. The training strategy of CMPG is the same as ModelNet40. For both point cloud models trained from scratch and with PointCMT, we use SGD optimizer for 1,000 epochs with batch size 32.
Comparison with State-of-the-arts. The results are shown in Table 2, where PointNet++ [34] and current state-of-the-art PointMLP [31] are chosen as our baselines. PointCMT significantly improves the performance on both class mean accuracy (mAcc) and the overall accuracy (OA), even on state-of-the-art methods. Specifically, although background, noise, and occlusions exist on PB_T50_RS dataset, PointCMT still improves the overall accuracy of PointNet++ by 3.9%. Moreover, PointCMT also achieves state-of-the-art results on OBJ_ONLY dataset. Note that there is no auxiliary information provided in images, and all view images are generated through the perspective projection of points coordinates. Nevertheless, PointCMT still dramatically increases the mAcc of PointMLP from 89.4% to 92.0% (+2.6%).
Data percentage | Train from scratch | w/ PointCMT |
---|---|---|
2% | 73.3 | 75.2 (+1.9) |
5% | 82.1 | 83.5 (+1.4) |
10% | 85.1 | 87.9 (+2.8) |
20% | 88.4 | 89.3 (+0.9) |
4.3 Data Efficient Learning
We evaluate our approach under limited data scenarios in Table 3. Here, we only sample a small amount of training data in each category on ModelNet40, and evaluate the entire testing data. Our PointCMT shows an even more significant gap when using a small subset of the training data, again compared to PointNet++ which trained from scratch. Especially when facing only 2% and 10% of the training data, we achieve about 1.9% and 2.8% improvements, respectively. This result illustrates that PointCMT provides more vital guidance for point cloud models in the low data regime.
4.4 Ablation study
The ablation results on three datasets are summarized in Table 4, in which we use PointNet++ as our baselines. We first test the effectiveness of feature enhancement (FE) and classifier enhancement (CE) in PointCMT. The results demonstrate that only using FE already significantly boosts the performance on both datasets, i.e., increases the overall accuracy by 0.4% and 3.1% on ModelNet40 and ScanObjetNN. Only using classifier enhancement (CE) improves the accuracy by around 0.6% and 2.9%. Finally, it is surprising that when we use both FE and CE during the training phase, it achieves the best result of 94.4% and 83.3%, respectively.
Model | FE | CE | ModelNet40 | OBJ_ONLY | PB_T50_RS |
---|---|---|---|---|---|
PointNet++ | ✗ | ✗ | 93.4 | 87.5 | 79.4 |
✓ | ✗ | 93.8 (+0.4) | 89.2 (+1.7) | 82.5 (+3.1) | |
✗ | ✓ | 94.0 (+0.6) | 91.3 (+3.8) | 82.3 (+2.9) | |
✓ | ✓ | 94.4 (+1.0) | 91.8 (+4.3) | 83.3 (+3.9) |
4.5 Comparison with Knowledge Distillation Methods
To further verify the effectiveness of our proposed method compared with typical teach-student architecture and other knowledge distillation manners, we compare PointCMT with typical approaches of knowledge transfer in Table 5. Among all the methods, Hinton et al. [17] is a pioneer study for knowledge distillation, while Huang et al. [21] and Yang et al. [55] are recent works. As shown in the table, directly aligning features as [17] between two modalities will cause a negative transfer on ModelNet40. This phenomenon does not appear on ScanObjectNN, since view images projected via point clouds may have a smaller gap than rendered images of CAD models. Nevertheless, other KD techniques only achieve marginal improvement compared with PointCMT.
View-images | ModelNet40 | OBJ_ONLY |
---|---|---|
Rendered from CAD | 94.4 | - |
Projection | 94.0 | 91.8 |
Projection w/ color | - | 90.7 |
4.6 Different View-image Generation
In this section, we compare results with different view-image generation strategies. As shown in Figure 3, multiple view-image types can be applied in our framework, and we compare the results in Table 6. As illustrated in the table, images rendered from the CAD model improve more compared with only using projection since the former provides additional shade and texture information. In contrast, we find out that using additional colors in the OBJ_ONLY dataset cannot boost performance. The reason is that OBJ_ONLY dataset only contains 2,902 objects, and image networks are easier to overfit when using the color information.
5 Conclusion
In this work, we propose a point cloud cross-modal training strategy named PointCMT. By exploiting some sophisticated architecture and reasonable criteria function, our PointCMT can boost the performance significantly for point cloud analysis methods on several benchmarks, outperforming previous methods by a large margin. We believe that our work can be applied to a broader range of other scenarios in the future, such as 3D semantic segmentation and object detection. Meanwhile, our method provides an alternative solution to the comprehension of 3D scenes with severe texture details missing. It can improve performance through image priors and knowledge transfer.
Acknowledgment
This work was supported in part by NSFC-Youth 61902335, by the Basic Research Project No. HZQB-KCZYZ-2021067 of Hetao Shenzhen HK S&T Cooperation Zone, by the National Key R&D Program of China with grant No.2018YFB1800800, by Shenzhen Outstanding Talents Training Fund, by Guangdong Research Project No. 2017ZT07X152 and No. 2019CX01X104, by the Guangdong Provincial Key Laboratory of Future Networks of Intelligence (Grant No. 2022B1212010001), by the NSFC 61931024&8192 2046, by NSFC-Youth 62106154, by zelixir biotechnology company Fund, by Tencent Open Fund, and by ITSO at CUHKSZ.
References
- [1] Sungsoo Ahn, Shell Xu Hu, Andreas Damianou, Neil D Lawrence, and Zhenwen Dai. Variational information distillation for knowledge transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9163–9171, 2019.
- [2] Jens Behley, Martin Garbade, Andres Milioto, Jan Quenzel, Sven Behnke, Cyrill Stachniss, and Jurgen Gall. Semantickitti: A dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE International Conference on Computer Vision, pages 9297–9307, 2019.
- [3] Yizhak Ben-Shabat, Michael Lindenbaum, and Anath Fischer. 3dmfv: Three-dimensional point cloud classification in real-time using convolutional neural networks. IEEE Robotics and Automation Letters, 3(4):3145–3152, 2018.
- [4] Silin Cheng, Xiwu Chen, Xinwei He, Zhe Liu, and Xiang Bai. Pra-net: Point relation-aware network for 3d point cloud analysis. IEEE Transactions on Image Processing, 30:4436–4448, 2021.
- [5] Jang Hyun Cho and Bharath Hariharan. On the efficacy of knowledge distillation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4794–4802, 2019.
- [6] Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4d spatio-temporal convnets: Minkowski convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3075–3084, 2019.
- [7] Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5828–5839, 2017.
- [8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- [9] Xiang Deng and Zhongfei Zhang. Comprehensive knowledge distillation with causal intervention. Advances in Neural Information Processing Systems, 34, 2021.
- [10] Khaled El Madawi, Hazem Rashed, Ahmad El Sallab, Omar Nasr, Hanan Kamel, and Senthil Yogamani. Rgb and lidar fusion based 3d semantic segmentation for autonomous driving. In 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pages 7–12. IEEE, 2019.
- [11] Yifan Feng, Zizhao Zhang, Xibin Zhao, Rongrong Ji, and Yue Gao. Gvcnn: Group-view convolutional neural networks for 3d shape recognition. In Proc. IEEE Conf. Comput. Vision Pattern Recognition, June 2018.
- [12] Ankit Goyal, Hei Law, Bowei Liu, Alejandro Newell, and Jia Deng. Revisiting point cloud shape classification with a simple and effective baseline. In International Conference on Machine Learning, pages 3809–3820. PMLR, 2021.
- [13] Benjamin Graham, Martin Engelcke, and Laurens van der Maaten. 3d semantic segmentation with submanifold sparse convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9224–9232, 2018.
- [14] Meng-Hao Guo, Jun-Xiong Cai, Zheng-Ning Liu, Tai-Jiang Mu, Ralph R Martin, and Shi-Min Hu. Pct: Point cloud transformer. Computational Visual Media, 7(2):187–199, 2021.
- [15] Abdullah Hamdi, Silvio Giancola, and Bernard Ghanem. Mvtn: Multi-view transformation network for 3d shape recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1–11, 2021.
- [16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- [17] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. NeurIPS Workshops, 2014.
- [18] Ji Hou, Saining Xie, Benjamin Graham, Angela Dai, and Matthias Nießner. Pri3d: Can 3d priors help 2d representation learning? arXiv preprint arXiv:2104.11225, 2021.
- [19] Binh-Son Hua, Quang-Hieu Pham, Duc Thanh Nguyen, Minh-Khoi Tran, Lap-Fai Yu, and Sai-Kit Yeung. Scenenn: A scene meshes dataset with annotations. In 2016 Fourth International Conference on 3D Vision (3DV), pages 92–101. IEEE, 2016.
- [20] Binh-Son Hua, Minh-Khoi Tran, and Sai-Kit Yeung. Pointwise convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 984–993, 2018.
- [21] Zhen Huang, Xu Shen, Jun Xing, Tongliang Liu, Xinmei Tian, Houqiang Li, Bing Deng, Jianqiang Huang, and Xian-Sheng Hua. Revisiting knowledge distillation: An inheritance and exploration framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3579–3588, 2021.
- [22] Asako Kanezaki, Yasuyuki Matsushita, and Yoshifumi Nishida. Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5010–5019, 2018.
- [23] Jangho Kim, SeongUk Park, and Nojun Kwak. Paraphrasing complex network: Network compression via factor transfer. Advances in neural information processing systems, 31, 2018.
- [24] Georg Krispel, Michael Opitz, Georg Waltner, Horst Possegger, and Horst Bischof. Fuseseg: Lidar point cloud segmentation fusing multi-modal data. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1874–1883, 2020.
- [25] Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. Pointcnn: Convolution on x-transformed points. Advances in neural information processing systems, 31, 2018.
- [26] Zhenyu Li, Zehui Chen, Ang Li, Liangji Fang, Qinhong Jiang, Xianming Liu, Junjun Jiang, Bolei Zhou, and Hang Zhao. Simipu: Simple 2d image and 3d point cloud unsupervised pre-training for spatial-aware visual representations. AAAI 2021, 2021.
- [27] Yongcheng Liu, Bin Fan, Shiming Xiang, and Chunhong Pan. Relation-shape convolutional neural network for point cloud analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8895–8904, 2019.
- [28] Yueh-Cheng Liu, Yu-Kai Huang, Hung-Yueh Chiang, Hung-Ting Su, Zhe-Yu Liu, Chin-Tang Chen, Ching-Yu Tseng, and Winston H Hsu. Learning from 2d: Pixel-to-point knowledge transfer for 3d pretraining. arXiv preprint arXiv:2104.04687, 2021.
- [29] Ze Liu, Han Hu, Yue Cao, Zheng Zhang, and Xin Tong. A closer look at local aggregation operators in point cloud analysis. In European Conference on Computer Vision, pages 326–342. Springer, 2020.
- [30] Zhengzhe Liu, Xiaojuan Qi, and Chi-Wing Fu. 3d-to-2d distillation for indoor scene parsing. CVPR, 2021.
- [31] Xu Ma, Can Qin, Haoxuan You, Haoxi Ran, and Yun Fu. Rethinking network design and local geometry in point cloud: A simple residual MLP framework. In International Conference on Learning Representations, 2022.
- [32] Gregory P Meyer, Jake Charland, Darshan Hegde, Ankit Laddha, and Carlos Vallespi-Gonzalez. Sensor fusion for joint 3d object detection and semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.
- [33] Charles R Qi, Hao Su, Kaichun Mo, and LeonidasJ Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017.
- [34] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pages 5099–5108, 2017.
- [35] Shi Qiu, Saeed Anwar, and Nick Barnes. Dense-resolution network for point cloud classification and segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3813–3822, 2021.
- [36] Shi Qiu, Saeed Anwar, and Nick Barnes. Geometric back-projection network for point cloud classification. IEEE Transactions on Multimedia, 2021.
- [37] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. The earth mover’s distance as a metric for image retrieval. International journal of computer vision, 40(2):99–121, 2000.
- [38] Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik G. Learned-Miller. Multi-view convolutional neural networks for 3d shape recognition. In Proc. ICCV, 2015.
- [39] Hugues Thomas, Charles R. Qi, Jean-Emmanuel Deschaud, Beatriz Marcotegui, Francois Goulette, and Leonidas J. Guibas. Kpconv: Flexible and deformable convolution for point clouds. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
- [40] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive representation distillation. arXiv preprint arXiv:1910.10699, 2019.
- [41] Mikaela Angelina Uy, Quang-Hieu Pham, Binh-Son Hua, Thanh Nguyen, and Sai-Kit Yeung. Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1588–1597, 2019.
- [42] Sourabh Vora, Alex H Lang, Bassam Helou, and Oscar Beijbom. Pointpainting: Sequential fusion for 3d object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4604–4612, 2020.
- [43] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds. In ACM Transactions on Graphics (TOG), 2019.
- [44] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (TOG), 38(5):1–12, 2019.
- [45] Xin Wei, Ruixuan Yu, and Jian Sun. View-gcn: View-based graph convolutional network for 3d shape analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1850–1859, 2020.
- [46] Wenxuan Wu, Zhongang Qi, and Li Fuxin. Pointconv: Deep convolutional networks on 3d point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9621–9630, 2019.
- [47] Wenxuan Wu, Zhongang Qi, and Li Fuxin. Pointconv: Deep convolutional networks on 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9621–9630, 2019.
- [48] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015.
- [49] Tiange Xiang, Chaoyi Zhang, Yang Song, Jianhui Yu, and Weidong Cai. Walk in the cloud: Learning curves for point clouds shape analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 915–924, 2021.
- [50] Chenfeng Xu, Shijia Yang, Bohan Zhai, Bichen Wu, Xiangyu Yue, Wei Zhan, Peter Vajda, Kurt Keutzer, and Masayoshi Tomizuka. Image2point: 3d point-cloud understanding with pretrained 2d convnets. arXiv preprint arXiv:2106.04180, 2021.
- [51] Mutian Xu, Junhao Zhang, Zhipeng Zhou, Mingye Xu, Xiaojuan Qi, and Yu Qiao. Learning geometry-disentangled representation for complementary understanding of 3d object point cloud. arXiv preprint arXiv:2012.10921, 2, 2021.
- [52] Yifan Xu, Tianqi Fan, Mingye Xu, Long Zeng, and Yu Qiao. Spidercnn: Deep learning on point sets with parameterized convolutional filters. In Proceedings of the European Conference on Computer Vision (ECCV), pages 87–102, 2018.
- [53] Xu Yan, Jiantao Gao, Chaoda Zheng, Chao Zheng, Ruimao Zhang, Shuguang Cui, and Zhen Li. 2dpass: 2d priors assisted semaantic segmentation on lidar point clouds. In Proceedings of the European Conference on Computer Vision (ECCV), 2022.
- [54] Xu Yan, Chaoda Zheng, Zhen Li, Sheng Wang, and Shuguang Cui. Pointasnl: Robust point clouds processing using nonlocal neural networks with adaptive sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5589–5598, 2020.
- [55] Jing Yang, Brais Martinez, Adrian Bulat, and Georgios Tzimiropoulos. Knowledge distillation via softmax regression representation learning. In International Conference on Learning Representations, 2020.
- [56] Ze Yang and Liwei Wang. Learning relationships for multi-view 3d object recognition. In Proc. IEEE Int. Conf. Comput. Vision, pages 7505–7514, 2019.
- [57] Zhengyuan Yang, Songyang Zhang, Liwei Wang, and Jiebo Luo. Sat: 2d semantics assisted training for 3d visual grounding. In ICCV, 2021.
- [58] Haoxuan You, Yifan Feng, Rongrong Ji, and Yue Gao. Pvnet: A joint convolutional network of point cloud and multi-view for 3d shape recognition. In Proc. ACM Int. Conf. Multimedia, MM ’18, pages 1310–1318, New York, NY, USA, 2018. ACM.
- [59] Zhihao Yuan, Xu Yan, Yinghong Liao, Yao Guo, Guanbin Li, Zhen Li, and Shuguang Cui. X-trans2cap: Cross-modal knowledge transfer using transformer for 3d dense captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- [60] Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16259–16268, 2021.
- [61] Helong Zhou, Liangchen Song, Jiajie Chen, Ye Zhou, Guoli Wang, Junsong Yuan, and Qian Zhang. Rethinking soft labels for knowledge distillation: A bias-variance tradeoff perspective. arXiv preprint arXiv:2102.00650, 2021.
Supplementary Material
A Theoretical Proof
We provide the detailed theoretical proof for the lemmas proposed in the main manuscript. Define the discrepancy between the discriminative image and point cloud features as
(8) |
Lemma 1: By the definition above, is bounded below by , where is
(9) |
Proof of Lemma 1:
(10) | ||||
For a successful distillation, the term should be greater than or equal to , i.e., the distillated model outperforms the original point cloud network. By the definition, we have .
B Additional Experiments
B.1 Comparison with Different Normalization
As we mentioned in the manuscript, the discrepancy between two modalities make the KD problem challenging, which is the initial motivation of our PointCMT. In this section, we analyze different strategies that are used to eliminate the discrepancy. Specifically, we design two normalization:
Normalize-I: We assume the features from image and point cloud networks follow two Gaussian distributions. We normalize the mean and standard deviation of the features from image network and make it closer to the features from point cloud network. For every batch of paired image and point cloud data, denote as batch size, let be feature pairs from point cloud and image networks. Let mean(), std() be the mean and standard deviation of , and mean() and std() be the mean and standard deviation of . We normalize image features through:
(11) |
Normalize-II: We regard features as vectors in a feature space, and normalize their norms into the same scale. Let norm() and norm() be the mean of and , where denote 2-norm, we normalize image features through:
(12) |
Method | ModelNet40 |
---|---|
Baseline | 93.4 |
Hinton et al. [17] | 93.1 (-0.3) |
w/ Normalize-I | 93.4 (+0.0) |
w/ Normalize-II | 93.5 (+0.1) |
PointCMT (ours) | 94.4 (+1.0) |
During the experiment, we train PointNet++ with above normalization through KD loss:
(13) |
where denotes the normalization operations mentioned above. As shown in Table 7, exploiting normalization strategies can slightly improve the performance on ModelNet40 and eliminate the negative transfer. Nevertheless, our PointCMT still achieves superior performance upon such naive normalization.
Method | ModelNet40 |
---|---|
Baseline | 93.4 |
PointCMT w/o pre-train | 94.2 (+0.8) |
PointCMT | 94.4 (+1.0) |
B.2 Effect of Pre-trained Image Network
In Table 8, we illustrate the results on ModelNet40 without pre-trained image feature extractor. As shown in the table, when we train image feature extractor from scratch, PointCMT still gains 94.2% overall accuracy, i.e., with only 0.2% performance drop. Moreover, the results on ScanObjectNN are obtained without pre-trained image feature extractor, since the view-images used on this dataset are from perspective projection.
B.3 Performance of Image Networks
In this section, we demonstrate the performance of image networks pre-trained in the Stage I. As shown in Table 9, through exploiting rendered view-image from CAD models, the image network can achieve very high overall accuracy of 97.0% on ModelNet40, boosting the PointNet++ via PointCMT by a 1% performance gain. Only using projection in the image network can only obtain 93.8% on ModelNet40. Nevertheless, it still increases the performance of PointNet++ by 0.6%. Moreover, we find out that in the cross-modal KD scenario, the performance of the teacher will not always better than the student. Still, PointCMT effectively learn the complementary information from the teacher, and improve the performance of point cloud analysis approaches. For the ScanObjectNN dataset, using additional color information makes image networks overfit on the OBJ_ONLY sub-set, and thus hampers the performance.
Image Networks | ModelNet40 | OBJ_ONLY | PB_T50_RS |
---|---|---|---|
Rendered from CAD | 97.0 | - | - |
Gains on PointNet++ | (+1.0) | - | - |
Projection | 93.8 | 89.0 | 80.8 |
Gains on PointNet++ | (+0.6) | (+4.3) | (+3.9) |
Projection w/ color | - | 87.5 | - |
Gains on PointNet++ | - | (+3.2) | - |
B.4 Training Speed
In this section, we demonstrate the training speed of our PointCMT. As shown in the Table 10, the additional training stage of I (image encoder and image classifier) and II (CMPG) actually introduce little extra cost in the entire training phrase since the small epoch numbers for stage I and few parameters of CMPG. For the speed evaluation of the Stage III, since the image network has been trained, we fixed pre-trained network and generate objects’ features offline, which can be directly exploited in the Stage II and III without repeatedly forwarding the image network.
Stage I (Image Network) | Stage II (CMPG) | Stage III (PointNet++) |
27.35ms / 4.36h | 2.30ms / 0.46h | 10.64s / 36.38h |
B.5 Part Segmentation
To illustrate the superiority of our PointCMT, we set two experiments for part segmentation on ShapeNetPart dataset, as shown in Table 11. (a) PointNet++ with pre-trained encoder (trained from scratch on ModelNet40); (b) PointNet++ with pre-trained encoder (trained with PointCMT on ModelNet40). As shown in Table, utilizing pre-trained encoder trained with PointCMT effectively improve the performance, especially for the more challenging metric of Class avg IoU. Here, Instance avg IoU and Class avg IoU denote the IoU averaged by all instances and each class, respectively.
Method | Inctance avg IoU | Class avg IoU |
---|---|---|
Pre-trained PointNet++ w/o PointCMT | 85.3 | 82.0 |
Pre-trained PointNet++ w/ PointCMT | 85.6 (+0.3) | 82.6 (+0.6) |