Transfer Learning with Point Transformers
Abstract
Point Transformers are near state-of-the-art models for classification, segmentation, and detection tasks on Point Cloud data. They utilize a self attention based mechanism to model large range spatial dependencies between multiple point sets. In this project we explore two things: classification performance of these attention based networks on ModelNet10 dataset and then, we use the trained model to classify 3D MNIST dataset after finetuning. We also train the model from scratch on 3D MNIST dataset to compare the performance of finetuned and from-scratch model on the MNIST dataset. We observe that since the two datasets have a large difference in the degree of the distributions, transfer learned models do not outperform the from-scratch models in this case. Although we do expect transfer learned models to converge faster since they already know the lower level edges, corners, etc features from the ModelNet10 dataset.
1 Introduction
Point clouds are generated using specialised devices, like laser scanners or depth cameras, that emit signals and measure the time it takes for them the signals to bounce back. This is then used to estimate the distance/depth of different parts of an object. A large collection of such points results in a point cloud.
Point clouds offer a comprehensive and precise depiction of the three-dimensional realm. They accurately capture the form, arrangement, and spatial connections of objects and surroundings, exhibiting a high level of fidelity. As a result, point clouds find utility in applications that demand intricate geometric data, including fields like robotics, autonomous navigation, virtual reality, and augmented reality.
Point clouds also aid in object recognition in the 3D space. By looking at the arrangement of points, it is possible to identify objects or certain features of the objects. Point cloud segmentation involves assigning semantic labels to individual points in a point cloud. PointNet [7] is one of the most commonly used point cloud classification network that uses multi-layer perceptrons and pooling layers to extract features. PointNet++ [8] improves upon the previous architecture by introducing hierarchical groupings of points into local regions allowing PointNet to be applied at different scales. This enables the model to capture both local and contextual features thus improving the classification performance. Further, CNN-based architectures like DGCNN [10] use k-nearest neighbour graphs to correlate points and apply graph convolutions to extract features. Relation Shape CNN [6] captures the geometric relationship between points by considering the position and orientation of points with respect to their neighbours. PointCNN [5] addresses the problem of unordered and irregularly sampled point clouds by introducing a learnable permutation-invariant function. This weighting function allows features to be extracted in an order-agnostic manner and facilitate better classification.
Transformers are a type of deep learning models that use attention mechanism to process information in a structured way [9, 2]. Point Transformers [13] uses the attention mechanism to capture relationship between points. This model utilizes positional encodings to include spatial information and applies multi-head self-attention to capture how points in the point cloud interact with each other. The transformer family of models is particularly appropriate for point cloud processing because the self-attention operator, which is at the core of transformer networks, is in essence a set operator: it is invariant to permutation and cardinality of the input elements. The application of self-attention to 3D point clouds is therefore quite natural, since point clouds are essentially sets embedded in 3D space.
We show that transformer based networks are superior to the models discussed above in variety of tasks such as classification, segmentation etc. We focus on the Point Transformer v1 architecture which has vector attention. We train the models on ModelNet10 dataset and transformer based networks achieve good accuracy in this task.
We further explore the concept of transfer learning where we train the network on ModelNet10 dataset and then finetune it on 3D MNIST dataset. We compare performance of this model compared to one which was trained from scratch on 3D MNIST dataset.
2 Related Work
Point clouds in 3D space lack a specific order and are spread out, resembling sets. When it comes to handling these point clouds using learning-based methods, there are three main categories: projection-based networks, voxel-based networks, and point-based networks.
PointNet [7] directly handles unprocessed point cloud data for the purposes of 3D object classification and segmentation. PointNet exhibits a remarkable ability to capture both local and global geometric features, all while accommodating point cloud inputs that lack any predefined order or alignment.
The work in [8] builds upon PointNet to make a PointNet++. PointNet++ enhances the representation learning capability by applying a set of PointNet modules in a nested fashion, capturing multi-scale features for improved object classification and segmentation.
The authors [14] introduce VoxelNet, a comprehensive learning framework designed for 3D object detection. VoxelNet specifically works with point clouds that have been transformed into volumetric representations. By utilizing a combination of convolutional and fully connected layers, VoxelNet effectively detects objects within the point cloud data, surpassing previous methods and achieving superior performance on different benchmark datasets.
Another set of approaches connect the points in the cloud to form a graph and perform message passing on this graph. Methods like DGCNN [10], PointWeb , ECC, SPG, KCNet, and others utilize different techniques to conduct graph convolutions or leverage contextual relationships within the point set. Some methods explore continuous convolutions directly applied to the 3D point set without quantization. For example, PCCN represents convolutional kernels as MLPs, SpiderCNN defines kernel weights using polynomial functions, and Spherical CNN addresses 3D rotation equivariance.
The introduction of Transformer and self-attention models has brought significant advancements to machine translation, natural language processing, and recommendation systems [3, 4, 9]. These developments have also influenced the utilization of self-attention networks in 2D image recognition. Scalar dot-product self-attention has been applied within local image patches. Meanwhile, Zhao et al [13]. have introduced a range of vector self-attention operators.
The concept of self-attention is particularly relevant in the context of this study because it inherently operates on sets. Positional information is incorporated as attributes of the elements, treating them as a set. Since 3D point clouds essentially consist of points with positional attributes, the self-attention mechanism appears to be well-suited for this type of data. Consequently, a Point Transformer layer is developed to apply self-attention specifically to 3D point clouds.
3 Dataset
We train our network on ModelNet10 [11], which is an extensive dataset utilized to train and assess 3D deep learning models. It encompasses a wide range of 3D computer-aided design (CAD) models, spanning ten distinct object categories like chairs, tables, and lamps. This dataset offers a rich variety of objects, featuring diverse characteristics such as different shapes, sizes, and appearances. We apply transfer learning on a 3D MNIST model obtained from Kaggle 111https://www.kaggle.com/datasets/daavoo/3d-mnist?resource=download which is created using the MNIST 2D dataset.
4 Methodology
With the aim to accurately categorise point clouds without using CNN’s and without making any assumptions about the datapoints, we opted for a transformer-based model that utilises vectors for the purpose of classification. We built upon the work of Zhao et al in [13]. The authors argue that 3D point clouds are nothing but sets of points in space and so self-attention, which highlights positional information of a set of elements, is a useful tool. They used three datasets to demonstrate the utility of point transformers in 3D deep learning: For 3D Semantic Segmentation: Stanford Large-Scale 3D Indoor Spaces (S3DIS) [1]; For 3D Shape Classification: ModelNet40 dataset [12]; For Object Part Segmentation: ShapeNetPart [11].
We utilise the same model. The input data for the model consists of labelled point clouds. The output of the model is a prediction score for each possible class. The architecture of the self-attention transformer block shown below.

In figure 1, , , and are pointwise feature transformations, such as linear projections and MLP’s. is a position encoding function and is a normalization function such as softmax.

In figure 2, the point transformer layer is combined with transition-down blocks.
We train our network on the ModelNet10 dataset, and then fine tune it for the 3D MNIST dataset. Additionally, we retrain the entire network on the 3D MNIST dataset and compare the results with those obtained from transfer learning.

5 Experimental Evaluation
We observed a training accuracy of 87.7% while training the network on ModelNet10. Using the best model, we then apply fine tuning, a transfer learning technique, and evaluate the model performance on the 3D MNIST dataset.
Figure 4 depicts the accuracies obtained while training the Point Transformer model on the ModelNet10 dataset. As expected, the model learns features well on this dataset and resulting in an increase in accuracy overall.

Figure 5 depicts the accuracies obtained while re-training the Point Transformer model on the 3D MNIST dataset. The network does not seem to learn the optimal parameters for 3D MNIST dataset. Although the expected accuracy in random initialisation should be 0.1%, we are getting an accuracy of 0.25%. This implies that the model seems to be learning some features but not enough to do provide a good classification performance.

Figure 6 depicts the accuracies obtained while fine tuning the ModelNet10 trained Point Transformer model on the 3D MNIST dataset for 15 epochs. We see that finetuning approach does not give good results for 3D MNIST dataset which can be explained by OOD shifts from ModelNet10 and 3D MNIST. However, the finetuned model converges faster than the pretrained one which shows that TL is still effective in learning lower level features such as edges and corners.

The Table 5 summarises the results obtained for 3D MNIST dataset. We observe that the model gives similar results for both retraining and fine tuning. However, fine tuning works for fewer epochs which indicates that transfer learning is effective to some extent.
Evaluation Metrics | |||
---|---|---|---|
Epochs | Method | Accuracy | F1 Score |
15 | Fine Tuning | 26 | 14.2 |
30 | Retraining | 24.6 | 11.6 |
We draw some inferences from the results tabulated in the Table 5. Transfer learning relies on the assumption that the training data and the target data have similar underlying data distributions. Yet, if the out-of-distribution (OOD) data differs significantly from the source data’s distribution, the knowledge transferred from the source may not be relevant or valuable. Consequently, the model may struggle to adapt and perform poorly when faced with the novel data distribution.
As the source dataset and the target dataset are too dissimilar, transfer learning was not effective. The model was not able to generalise the learned knowledge from ModelNet10 and apply it to classify 3D MNIST. Clearly, both the datasets require fundamentally different information for classification.
To analyse classification on 3D MNIST, we create a new model specifically for 3D MNIST dataset which relies on basic MLP architecture. Figure 7 tabulates the metrics obtained with the MLP-based model on 3D MNIST. Reverting to a simpler model made up of four-dense layers and two fully-connected layers gives better results. While inconclusive, attention-based mechanisms do not seem to learn the features from the 3D MNIST dataset. However, this needs to be explored further.

References
- [1] Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese. 3d semantic parsing of large-scale indoor spaces. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1534–1543, 2016.
- [2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
- [3] Vinayak Gupta and Srikanta Bedathur. Proactive: Self-attentive temporal point process flows for activity sequences. In KDD, 2022.
- [4] Vinayak Gupta, Srikanta Bedathur, and Abir De. Learning temporal point processes for efficient retrieval of continuous time event sequences. In AAAI, 2022.
- [5] Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. Pointcnn: Convolution on x-transformed points. Advances in neural information processing systems, 31, 2018.
- [6] Yongcheng Liu, Bin Fan, Shiming Xiang, and Chunhong Pan. Relation-shape convolutional neural network for point cloud analysis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8895–8904, 2019.
- [7] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017.
- [8] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30, 2017.
- [9] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
- [10] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics (tog), 38(5):1–12, 2019.
- [11] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015.
- [12] Li Yi, Vladimir G Kim, Duygu Ceylan, I-Chao Shen, Mengyan Yan, Hao Su, Cewu Lu, Qixing Huang, Alla Sheffer, and Leonidas Guibas. A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics (ToG), 35(6):1–12, 2016.
- [13] Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 16259–16268, 2021.
- [14] Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4490–4499, 2018.