Upsampling Autoencoder for Self-Supervised Point Cloud Learning

Cheng Zhang, Jian Shi, Xuan Deng, Zizhao Wu
Hangzhou Dianzi University, Hangzhou China
{zhangcheng828,wuzizhao}@foxmail.com,@hdu.edu.cn Corresponding author.

Abstract

In computer-aided design (CAD) community, the point cloud data is pervasively applied in reverse engineering, where the point cloud analysis plays an important role. While a large number of supervised learning methods have been proposed to handle the unordered point clouds and demonstrated their remarkable success, their performance and applicability are limited to the costly data annotation. In this work, we propose a novel self-supervised pre-training model for point cloud learning without human annotations, which relies solely on upsampling operation to perform feature learning of point cloud in an effective manner. The key premise of our approach is that upsampling operation encourages the network to capture both high-level semantic information and low-level geometric information of the point cloud, thus the downstream tasks such as classification and segmentation will benefit from the pre-trained model. Specifically, our method first conducts the random subsampling from the input point cloud at a low proportion e.g., 12.5%. Then, we feed them into an encoder-decoder architecture, where an encoder is devised to operate only on the subsampled points, along with a upsampling decoder is adopted to reconstruct the original point cloud based on the learned features. Finally, we design a novel joint loss function which enforces the upsampled points to be similar with the original point cloud and uniformly distributed on the underlying shape surface. By adopting the pre-trained encoder weights as initialisation of models for downstream tasks, we find that our UAE outperforms previous state-of-the-art methods in shape classification, part segmentation and point cloud upsampling tasks. Code will be made publicly available upon acceptance.

1 Introduction

Analysis and classification of 3D point cloud is an important problem in computer vision, graphics and CAD communities, due to its wide applications in robot manipulation [52], autonomous driving [47], reverse engineering [5], etc. In view of the great success of deep learning on other computer vision tasks, many endeavors have been made to adapt the deep learning technologies to the analysis of 3D point clouds [65, 48, 49, 35, 32, 75, 73, 22, 83, 13, 64] including PointNet [48], VoxNet [84], etc. However, most of the existing works are supervised and rely on large-scale and accurate annotated dataset, which hinder their applicability.

Unsupervised pre-training, on the other hand, has emerged as a promising alternative to address the shortcoming of the above supervised methods within an effective architecture. Depending on the models their used, existing unsupervised approaches for learning on point clouds can be roughly classified into two categories: reconstruction-based and generation-based methods. The generation-based methods [23, 2] typically employed the Generative Adversarial Networks (GANs) [2] or Variational Auto-Encoders (VAEs)[23] to learn feature representations in an unsupervised framework. The reconstruction-based models [54] usually adopted a framework that trains an encoder to learn shape representations by reconstructing the input data via a decoder. Though these models have been demonstrated to be effective in certain applications, they usually fail to acquire high-level structure information [10].

Refer to caption — Figure 1: Overview of UAE. During pre-training, we first generate subsampled points for each input by randomly sampling, then train a encoder-decoder model to upsample these points by 8 $\times$ . Finally, we use the learned pre-training weights as an initialization for various downstream tasks.

Recently, self-supervised pre-training [56, 21, 54, 51, 80, 10], a class of unsupervised pre-training methods, is gaining certain interest in the point cloud learning paradigm. The premise of self-supervised learning is to generate the results from unlabeled data in a supervised manner by leveraging the underlying structure in the data. By adopting the pre-trained self-supervised models for point cloud learning, these approaches are capable of predicting high-level structure information and with good performance. For example, some self-supervised pre-training methods[72, 29, 17] have employed the contrastive learning framework to capture high-level geometric information (e.g., shape representation) by modeling shape similarity and dissimilarity between two or more views. OcCo [61] has utilized occlusion completion as a pre-training task to learn pre-train weights for point cloud analysis. However, according to our knowledge, all the existing self-supervised pre-training approaches are not efficient enough, as the majority of them need a careful treatment of negative pairs by either relying on large batch sizes [72, 80], memory banks [30], data pre-processing [61] or customized mining strategies [17, 30] to retrieve the negative pairs. Furthermore, their performance critically depends on complicated 3D data augmentations, e.g., cuboid [80], shape disorganizing [12] and shearing [19].

In this work, we propose a novel self-supervised pre-training model based on upsampling auto-encoder, namely UAE, aiming to reach a simple yet effective model, which can be applied to a wide range of downstream point cloud analysis tasks such as 3D object classification, semantic segmentation and point cloud upsampling. The key observation is that upsampling model that is accurate needs to understand the structure information of the point cloud, and thus a pre-trained upsampling model can facilitate some downstream tasks of point clouds due to the high-level structure information it inherently owned. We note that some works have explored the upsampling or completion strategies for self-supervised point cloud learning, however as mentioned former, these approaches require either negative sampling or data augmentation operation. Our UAE is inspired by the masked auto-encoder (MAE) [25] for image analysis, offering a simpler and more effective upsampling architecture than theirs by providing a solution that subsamples random points from the input point cloud and reconstructs the points that are not sampled in the point space directly. Specifically, our method first conducts the random subsampling from the input point cloud at a low proportion. Then, we feed them into an asymmetric encoder-decoder architecture, where the encoder is devised to operate only on the subsampled points, along with an upsampling decoder is adopted to reconstruct the original point cloud based on the learned features. Finally, we design a novel joint loss function that enforces the upsampled points to be similar with the original point cloud and uniformly distributed on the underlying shape surface.

We pre-train our UAE network with ShapeNet dataset [8] and evaluate its performance in various downstream point cloud learning tasks, including shape classification, shape segmentation and point cloud upsampling. Our experiments have demonstrated that in all these tasks, our pre-trained model offers better performance than the same network trained with the labeled data of downstream tasks. And comparing with other SOTA unsupervised pre-training methods, our UAE obtains the best performance in nearly all these tasks.

Our contributions can be summarized as follows:

•

presents a novel self-supervised pre-training method for point cloud learning based on pre-trained upsampling model.
•

investigates a novel upsampling architecture, where a joint loss function is introduced to enforce the upsampled points to be similar to the original point cloud.
•

achieves significant performance in shape classification, part segmentation and point cloud upsampling tasks.

2 Related works

2.1 Supervised Learning on 3D Point Clouds

In recent years, deep learning on 3D point cloud has attracted increasing attention of researchers. As 3D points are irregularly sampled and have quite a few special properties such as irregularity and permutation invariance, the traditional neural networks, i.e., convolutionl neural networks (CNNs) in the 2D field, cannot be directly applied to point clouds. Therefore, previous works attempt to convert point clouds into a regular grid structure and apply 3DCNN to feature learning [84, 73, 67]. However, the computational consumption and memory of 3DCNN increase cubically with the increase of voxel’s resolution [41].

Another mainstream method is directly operating the unordered point clouds [4, 39, 81, 22, 83, 62, 78, 9, 55]. For instance, Qi et al. [48] pioneered this series of works and proposed a novel neutral architecture, PointNet, which can directly handle irregular and unordered 3D points by stacking multi-layer perceptrons (MLPs) and capturing global shape with max-pooling. Later, Qi et al. [49] proposed PointNet++ to overcome the disadvantage that PointNet fails to capture local structures by developing a hierarchical grouping architecture at different set abstraction level. Subsequently, PointCNN [35], PointConv [68], RSCNN [40] and DGCNN [65] also focus on local structures of point cloud and further improve the quality of captured features. However, these models mentioned above are supervised and rely on large-scale labeled point cloud dataset. In this paper, we propose a novel unsupervised framework, namely UAE for point cloud analysis.

2.2 Unsupervised Learning on 3D Point Clouds

There are several prior attempts [23, 1, 3, 18, 24, 76] on learning shape-specific invariant representations of a point cloud based on unsupervised reconstruction models. These methods discover valuable information of 3D point cloud by reconstructing the input data, which has been shown to effectively learn shape-specific invariant representations. For example, FoldingNet [76] proposed an end-to-end auto-encoder to get the codeword that can represent the high-dimensional embedding of a point cloud, and the fully-connected decoder was replaced with the folding-based decoder. GraphTER [19] proposed a novel unsupervised learning method to capture intrinsic patterns of point cloud structure under both global and local transformations. However, because of lacking sufficient semantic supervision, this kind of methods are insufficient to understand the high-level semantic information of point clouds.

Witnessing the great success of self-supervised on computer vision tasks, many researchers have endeavored to apply it to the field of self-supervised point cloud analysis. Xie et al. [72] pioneered this line of works by proposing PointContrast to build representations of scene-level point cloud which relies on the complete 3D reconstruction of a scene with point-wise correspondences between the different views of a point cloud. However, these point-wise correspondences need to post-process the input data by registering the different depth maps into a single 3D scene. Later, Zhang et al.[80] proposed DepthContrast to side-step the necessity of registered point clouds or correspondences, by considering each depth map as an instance and discriminating between them, even if they come from the same scene. Yet, most of them require careful treatment of negative key samples by either relying on large batch sizes [72, 80], memory banks or customized mining strategies [17] to find the negative key samples. Additionally, the performance of these methods critically depends on the choice of data augmentations [72, 21, 19, 29], limiting the applicability of the approach. As a comparison, in this work, we investigate a useful pretext task for learning shape-specific invariant representations.

Some researchers have also exploited the self-supervised pre-training model, for example, OcCo [61] is investigated by using occlusion completion as a pre-training task to learn an initialization for point cloud models. Compared with OcCo, our model has two advantages: firstly, our UAE does not require extra data-preprocessing to generate occluding point clouds based on different viewpoints; and secondly, our method uses point cloud upsampling as a pre-training task and design a novel joint loss function to capture better shape information by enforcing the upsampled points uniformly distributed on the underlying shape surface.

2.3 Self-Supervised Learning for Images

Self-supervised learning has attracted increasing attention since the cost of accurate annotated dataset is extraordinarily expensive [42, 45, 53, 59]. There are several classes of methods for learning representations such as clustering [6, 7], GANs [15, 43], pretext tasks [14, 44] etc. Recently, contrastive learning [46, 16, 71, 58, 26, 11] have been proposed to learn the unsupervised representation of 2-dimensional natural images. It can be considered as a pretext task where the goal is to maximize the representational similarity between positive key samples and dissimilarity between negative key samples for the input query. The positive key sample is generated with a random data augmentation module, which, given the input, generates a pair of random views of the input. Other inputs in the same batch are often used as negative key samples. Generally, contrastive and related methods strongly depend on data augmentation and the sampling of negative pairs [25, 20], which limit their applicability. In contrast to these methods, we pre-train the UAE on a simple but meaningful upsampling objective, aiming to capture high-level semantic information within a simple and effective framework.

3 Method

Suppose that the original point cloud with $N$ points is denoted by $\mathcal{X}\in\mathbb{R}^{N\times C}$ . In the simplest setting of $C$ = 3, each point contains 3D coordinates. We aim to train a model that is capable of unsupervised learning the point cloud representations. Towards this goal, we propose upsampling autoencoder (UAE), a simple self-supervised learning approach that reconstructs the original point cloud from a small number of subsampled points. Unlike classical self-supervised methods, our UAE is suggested to capture high-level semantic information without any data augmentation and negative pairs. The overall structure of UAE is shown in Figure 2.

In what follows, we begin by describing the strategy that makes up our subsampled points. Then we detail how to learn a encoder $\varphi(\cdot)$ and an upsampling decoder $\phi(\cdot)$ . And finally, we discuss the design of our joint loss function.

3.1 Subsampling

Given a point cloud $\mathcal{X}\in\mathbb{R}^{N\times 3}$ , we first subsample $\mathcal{X}$ into a lower resolution subset $\mathcal{P}\in\mathbb{R}^{rN\times 3}$ , where $r$ denotes that subsampling ratio. There exists a variety of sampling methods: 1) Random Sampling, where the probability of sampling each point in the point cloud follows a uniform distribution; 2) Farthest Point Sampling, where each point to be sampled is as far away as possible from points in the sampled set; 3) Local Sampling, where the points are sampled from a local part of the point cloud. In this paper, we propose to randomly sample a subset of points at a low proportion uniformly (e.g., $r=0.125$ ), which largely eliminates redundancy, thus creating a rather challenging task that cannot be easily handled by extrapolation from subsampled neighboring points [25]. This highly sparse point cloud is more conducive for us to design an effective encoder to capture high-level semantic information.

3.2 Encoder

The encoder $\varphi(\cdot)$ takes the coordinates of the subsampled point clouds $\mathcal{P}$ as input and outputs high-dimensional features. In contrast to previous self-supervised methods, where the global shape representation is learned by the encoder, we perform point-wise feature extraction on subsampled point cloud ${\mathcal{P}}$ , where each point can learn its representation by upsampling decoding the original point cloud to reveal the local structure around it. These representations will gather global shape information about the original point cloud as we randomly subsample points into different subsets (or groups) through training iterations while capturing the local structures under point-wise upsampling decoding.

Generally, any deep learning-based network that takes point clouds as the input and outputs high-dimensional features can be adopted as the encoder $\varphi(\cdot)$ of UAE. However, as 3D points have some special properties such as irregularity and permutation invariance, we can not directly leverage the convolutional neural network (CNN). Therefore, we deploy the dynamic graph CNN (DGCNN [65]) to overcome this limitation. In particular, we adopt the EdgeConv layer in DGCNN as our basic feature extraction block. By performing EdgeConv, our encoder can aggregate the features of neighbor points to the center point and update the feature of the center point.

Specifically, given an input subsampled points $\mathcal{P}=\{p_{i}\}_{i=1}^{rN}\subseteq\mathbb{R}^{C}$ , we first apply K-Nearest Neighbor (KNN) algorithm to construct a local graph in the feature space. The KNN algorithm based on the feature space can efficiently and effectively find two points with the most similar semantics, i.e., the points on the two wings of the airplane, which have similar semantics. Then, the EdgeConv layer performs feature transformation on the $rN$ points and output the updated point features of $p_{i}$ . The output feature of a point $p_{i}$ is

f_{i}=\max_{p_{j}\in\mathcal{N}(p_{i})}{\rm ReLU}(h_{\theta}(p_{i},p_{i}-p_{j})),

(1)

where $p_{j}\in\mathcal{N}(p_{i})$ denotes that point $p_{j}$ belongs to the neighborhood of point $p_{i}$ , $h_{\theta}$ denotes that the learnable parameters of MLP.

By stacking four EdgeConv layers, each point is aggregated with local neighboring features hops away. Furthermore, as the subsampling operation enlarges the distance between neighboring points, performing EdgeConv for the subsampled points will capture longer-range dependencies and thus facilitates to extract high-level semantic information that is invariant under random sampling.

3.3 Decoder

The decoder $\phi(\cdot)$ takes the point-wise features as input and outputs the upsampled point cloud $\tilde{\mathcal{P}}$ . To upsample a point cloud, Li et al. proposed PU-Net [77] to duplicate the point features and then employ separate MLPs to process each copy independently. However, the expanded features would be too similar to the inputs, so affecting the upsampling quality. In recent years, wang et al. proposed MPU [66] to break a $16\times$ upsampling network into four successive $2\times$ subnets to progressively upsample points in multiple steps. MPU preserves superior upsampling details but usually requires a more complex training process.

Inspired by the PU-GAN [33], we design a novel upsampling decoder to expand the point features, which is mainly composed of feature up and feature down blocks. As shown in Figure 2, we first upsample the point features $F_{in}$ (after a MLP layer) by $\frac{1}{r}\times$ to generate $F_{up}$ and downsample it to generate $F_{down}$ ; then, instead of directly constructing the original point cloud $\mathcal{X}$ , we adopt residual learning to regress the per-point feature offset by calculating the difference between $F_{up}$ and $F_{down}$ ; Ultimately, we feed them into a feature up block and a MLP layer to restore the original point cloud $\mathcal{X}$ . Such a strategy that utilizes feature offset to self-correct the expanded features has two advantages: firstly, it facilitates the production of fine-grained features while avoiding tedious multi-step training; secondly, the features of the same object can be completely different with respect to rigid transformations. In the following, we detail the design choices of feature up and feature down blocks.

Feature up block. To upsample the point features $\frac{1}{r}$ times, we adopt the commonly-used variation expansion operator [33, 34] by duplicating $F_{in}$ with $\frac{1}{r}$ copies and concatenating with a regular 2D grid. However, such operator may introduce redundant information or extra noise [34]. To rectify these problems, we propose to make use of the offset-attention [22] as the global refinement unit by considering the overall shape structure. The reason behind is that, compared with the widely-used self-attention unit [33, 34, 79], the offset-attention is generally more robust because it works by replacing the attention feature with the offset between the input of self-attention module and attention feature. The pipeline of this block is illustrated in Figure 3 (top).

Feature down block. To downsample the expanded features, we propose a novel GCN-based feature down block, illustrated in Figure 3 (bottom). Given the expanded features $F_{up}\in\mathbb{R}^{N\times D}$ , our method works in two steps. First, we reshape $F_{up}$ of shape $N\times D$ to $rN\times\frac{1}{r}\times D$ and use one layer of EdgeConv to downsample $F_{up}$ to shape $rN\times D$ using learnable parameters. Second, we feed them into a set of MLPs to regress the point features $F_{down}$ .

In contrast to previous works [33, 77, 66], our feature down block leverages GCNs, which are common modules for feature extraction. To the best of our knowledge, we are the first to introduce a GCN-based feature downsampling block. Our GCN design choice stems from the fact that GCNs enable our feature down block to encode spatial information from point neighborhoods and learn new features from the latent space rather than simply using Convs.

3.4 Joint loss function

To encourage the upsampled points to be similar to the original point cloud and uniformly distributed on the underlying shape surface, a novel joint loss function was proposed to train our UAE in an end-to-end fashion. Next, we will detail the design of this loss function.

Reconstruction loss. Our UAE reconstructs the original point cloud $\mathcal{X}$ by predicting coordinate for each unsampled point. Each element in the decoder’s output is a vector of coordinate values representing a point’s spatial location. We formulate our objective function to encourage the geometric consistency between $\tilde{\mathcal{P}}$ and $\mathcal{X}$ in the point space:

\mathcal{L}_{CD}(\tilde{\mathcal{P}},\mathcal{X})=\frac{1}{|\tilde{\mathcal{P}}|}\sum_{\tilde{p}\in\tilde{\mathcal{P}}}\min_{x\in\mathcal{X}}||\tilde{p}-x||_{2}+\frac{1}{|\mathcal{X}|}\sum_{x\in\mathcal{X}}\min_{\tilde{p}\in\tilde{\mathcal{P}}}||x-\tilde{p}||_{2},

(2)

where $\mathcal{L}_{CD}(\cdot)$ means the Chamfer Distance (CD) to measure the average closest point distance between two point sets, $||\cdot||_{2}$ denotes the L2 distance between two points.

Repulsion loss. Since the reconstruction loss alone can not ensure that the upsampled points over the underlying object surfaces will be uniformly distributed, which is important for capturing high-level shape information. To solve this issue, we propose to use the repulsion loss [77] as the constraint term of uniformity distribution, which is represented as:

\mathcal{L}_{rep}=\sum^{N}_{i=0}\sum_{\tilde{p_{j}}\in\mathcal{N}(\tilde{p_{i}})}\delta(||\tilde{p_{j}}-\tilde{p_{i}}||)\omega(||\tilde{p_{j}}-\tilde{p_{i}}||),

(3)

where $\mathcal{N}(\tilde{p_{i}})$ is the point set of the k-nearest neighbors of point $\tilde{p_{i}}$ , $N$ is the number of upsampled points and $||\cdot||$ is the L2-norm. $\delta(m)=-m$ is the repulsion term, which is a decreasing function to penalize $\tilde{p_{i}}$ if $\tilde{p_{i}}$ is located too close to other points in $\mathcal{N}(\tilde{p_{i}})$ . We further use the fast-decaying weight function $\omega(m)=e^{-m^{2}/h^{2}}$ ( $h$ as the finite support radius [37]) to penalize $\tilde{p_{i}}$ only when it is too close to its neighboring points.

Overall loss function. The overall loss function can be calculated as the weighted sum of all the terms described above:

\mathcal{L}_{total}=\alpha\mathcal{L}_{CD}+\beta\mathcal{L}_{rep},

(4)

where $\alpha$ and $\beta$ are the weighting factors for the loss functions of reconstruction and repulsion, respectively. Following [77], we set $\alpha=100$ and $\beta=1$ .

4 Experiments

In this section, we detail the pre-training setups of UAE and evaluate it for different downstream tasks: classification, part segmentation, and point cloud upsampling.

4.1 Experimental setups

Pre-training Dataset. We train our UAE on ShapeNet dataset [8], which consists of 57, 448 synthetic 3D CAD models organized into 55 categories with a further 203 subcategories, organized according to WordNet synsets. For pre-training we use the normalized version of ShapeNet, where all shapes are consistently aligned and normalized to fit inside a unit cube.

Architecture. The network architecture used for pre-training is shown in Fig 4. we deploy four EdgeConv layers whose dimensions are [64,64,128,256] and one MLP (648) layer as the encoder. The number of nearest neighbors set to 20 for all EdgeConv layers. After the four EdgeConv layers, we concatenate the output of these layers to get a 64+64+128+256=512 dimensional point cloud features and feed them into one MLP layer with input channels of 512 and output channels of 648. Similar to previous works [33, 50], the upsampling decoder is used to obtain the reconstructed point cloud, where $D$ is 128.

Pre-training Hyperparameters. We uniformly sample 2, 048 points on ShapeNet dataset for our UAE pre-training. We use the Adam optimizer with no weight decay (L2 regularisation). The learning rate is set to 1e-3 initially and is decayed by 0.7 every 10 epochs. We pre-train the models for 120 epochs. The batch size is 32, and the momentum of batch normalization is 0.9. Note that, all the experiments are trained on an NVIDIA RTX3090 GPU.

Evaluation Metrics. For the classification task on ModelNet10 and ModelNet40 datasets, we use the overall accuracy (OA) as the metric. On ShapeNet Part dataset, we evaluate our scheme with part classification accuracy and mean Intersection over-Union (mIoU). For each sample, IoU is computed for each part that belongs to that object category. The mean of all part IoUs is regarded as the IoU for that sample. For point cloud upsampling task, we use the Chamfer distance (CD), Hausdorff distance (HD), and point-to-surface distance (P2F) w.r.t ground truth meshes as evaluation metrics. The smaller the metrics, the better the performance.

Model	Reference	ModelNet40	ModelNet10
Supervised Learning
PointNet [48]	CVPR2016	89.2	$-$
PointNet++ [49]	NIPS2017	91.9	$-$
PointConv [69]	CVPR2019	92.5	$-$
RGCNN [57]	MM2018	90.5	$-$
PointCNN [35]	NIPS2018	92.2	$-$
SpiderCNN [74]	ECCV2018	92.4	$-$
PointWeb [82]	CVPR2019	92.3	$-$
DGCNN [65]	TOG2018	92.9	$-$
Unsupervised Transfer Learning
FoldingNet [76]	CVPR2018	88.4	94.4
MAP-VAE [23]	ICCV2019	90.1	94.8
MID-FC [63]	TOG2020	90.3	$-$
GSIR [10]	ICCV2021	90.3	$-$
RS-DGCNN [54]	NIPS2019	90.6	94.5
GLR-RSCNN [51]	CVPR2020	91.3	94.2
GraphTER [19]	CVPR2020	92.0	$-$
PointDis [38]	$-$	92.3	95.3
SSC-RSCNN [12]	ICCV2021	92.4	95.0
Ours-DGCNN	$-$	92.9	95.6
Supervised Fine-tuning
DepthContrast [80]	CVPR2021	91.3	$-$
MID-FC	TOG2020	93.1	$-$
GLR-RSCNN	CVPR2020	92.2	94.8
ParAE-DGCNN	ICCV2021	92.9	$-$
SSC-RSCNN	ICCV2021	93.0	95.5
Ours-DGCNN	$-$	93.2	95.7

Table 1: Shape Classification Results on ModelNet40 and ModelNet10. The quantitative results of SOTA unsupervised and supervised fine-tuning methods. “Unsupervised Transfer Learning” denotes the parameters of the pre-trained encoder are fixed on downstream tasks, and “Supervised Fine-tuning” denotes the pre-trained encoders are fine-tuned on target tasks.

4.2 Shape Classification

Model	mIoU	Areo	Bag	Cap	Car	Chair	Ear Phone	Guitar	Knife	Lamp	Laptop	Motor	Mug	Pistol	Rocket	Skate Board	Table
Supervised Learning
KDNet[31]	82.3	80.1	74.6	74.3	70.3	88.6	73.5	90.2	87.2	81.0	94.9	57.4	86.7	78.1	51.8	69.9	80.3
PointNet	83.7	83.4	78.7	82.5	74.9	89.6	73.0	91.5	85.9	80.8	95.3	65.2	93.0	81.2	57.9	72.8	80.6
PointNet++[49]	85.1	82.4	79.0	87.7	77.3	90.8	71.8	91.0	85.9	83.7	95.3	71.6	94.1	81.3	58.7	76.4	82.6
P2Sequence[39]	85.1	82.6	81.8	87.5	77.3	90.8	77.1	91.1	86.9	83.9	95.7	70.8	94.6	79.3	58.1	75.2	82.8
DGCNN	85.2	84.0	83.4	86.7	77.8	90.6	74.7	91.2	87.5	82.8	95.7	66.3	94.9	81.1	63.5	74.5	82.6
Unsupervised Transfer Learning
LGAN [2]	57.0	54.1	48.7	62.6	43.2	68.4	58.3	74.3	68.4	53.4	82.6	18.6	75.1	54.7	37.2	46.7	66.4
MAP-VAE [23]	68.0	62.7	67.1	73.0	58.5	77.1	67.3	84.8	77.1	60.9	90.8	35.8	87.7	64.2	45.0	60.4	74.8
Graph-TER[19]	81.9	81.7	68.1	83.7	74.6	88.1	68.9	90.6	86.6	80.0	95.6	56.3	90.0	80.8	55.2	70.7	79.1
MID-FC [63]	84.2	80.4	82.5	89.0	80.0	89.9	80.7	90.5	85.7	77.8	95.9	73.4	94.8	81.1	56.7	81.8	82.4
Ours-DGCNN	85.0	83.5	82.4	86.9	77.9	90.4	75.6	91.0	86.9	81.0	95.1	68.9	94.7	81.4	62.5	73.1	82.7
Supervised fine-tuning
SSC-RSCNN	85.2	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$
Self-Sup [54]	85.3	84.1	84.0	85.8	77.0	90.9	80.0	91.5	87.0	83.2	95.8	71.6	94.0	82.6	60.0	77.9	81.8
PointDis [38]	85.3	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$
OcCo [61]	85.5	84.4	77.5	83.4	77.9	91.0	75.2	91.6	88.2	83.5	96.1	65.5	94.4	79.6	58.0	76.2	82.8
MID-FC [63]	85.5	83.6	82.9	91.3	81.6	90.4	81.5	91.8	87.1	79.3	95.7	68.7	95.2	83.6	68.3	82.7	83.2
Ours-DGCNN	85.6	84.7	80.9	86.2	82.3	91.3	76.2	91.2	89.6	82.1	96.7	68.9	93.7	83.8	65.3	79.5	83.1

Table 2: We present the part segmentation evaluation results on ShapeNet Part, where mIoU refers to mean Intersection-over-Union measure.

Dataset. We utilize ModelNet40 [70] and ModelNet10 [70] for shape classification task. We follow the same data split protocols of PointNet-based methods [65] for these two datasets. For ModelNet40, the train set has 9, 840 models and the test set has 2, 468 models, and the dataset consists of 40 categories. For ModelNet10, 3, 991 models are for fine-tuning and 908 models for testing. It contains 10 categories. We follow the experimental configuration [48]: (1) we uniformly sample 1,024 points from the mesh faces for each model; (2) the point cloud is re-scaled to fit the unit sphere; and (3) the (x,y,z) coordinates and the normal of the sampled points are used in the experiment. During the training process, randomly scaling and perturbing the objects are adopted as the data augmentation strategy in our experiment.

Implementation Details. The global max pooling and average pooling layer are deployed in our classification head to acquire a 1,296-dimensional global feature vector. Three layers of linear projection with dropout ratio of 50% are used to get the final classification score.

For unsupervised transfer learning, we fix the parameters of encoder and only train the classification head. During training, a random translation in [-0.2, 0.2], a random anisotropic scaling in [0.67, 1.5] and a random input dropout were applied to augment the input data. During testing, no data augmentation or voting methods were used. The batch sizes were 32, 200 training epochs were used and the initial learning rates were 10^-3, with a cosine annealing schedule to adjust the learning rate at every epoch.

For supervised fine-tuning, we fully fine-tuned the pre-trained model on ModelNet40 and ModelNet10 datasets.

Results. The classification results are presented in Table 1. When utilizing DGCNN as the encoder, our method outperforms most of previous unsupervised counterparts and the results on ModelNet10 and ModelNet40 are comparable to certain fully-supervised models. Since the pre-training of the encoder is based on different datasets, the results demonstrate that our framework has a strong generalizability, which is regarded as a significant application of self-supervised representation learning. Notably, most of the classes are unseen by the model during ShapeNet pre-training. Thus, the superior performance further demonstrates that our model has a good ability to generalize to novel classes.

To the best of our knowledge, the most important application of self-supervised learning methods is to make full use of a large number of unlabeled data and boost the performance of supervised learning methods. Thus, we pre-train the encoder with our framework on ShapeNet and fine-tune the weights on downstream classification tasks and compare the results with other supervised fine-tuning methods. As demonstrated in Table 1, the proposed method outperforms all other supervised fine-tuning methods on ModelNet10 and ModelNet40 datasets, which justifies the superiority of our UAE.

4.3 Part Segmentation

Dataset. We use the large-scale 3D dataset ShapeNet Parts [36] as the experiment bed. ShapeNet Parts contains 16,880 models (14,006 models are used for training, and 2874 models are used for testing), each of which is labeled with two to six parts, and the entire dataset has 50 different part labels. We sample 2,048 points from each model as input, with a few pointsets having six labeled parts. We directly adopt the same train–test split strategy similar to DGCNN [65] in our experiment.

Implementation Details. The global max pooling and average pooling layer are also deployed to acquire a 1,296-dimensional global feature vector. And we design a segmentation head to propagate these features to each point hierarchically. Like shape classification task, three layers of linear projection with dropout ratio of 50% are also used to get the final classification score of each point.

We optimize our networks via SGD with batch size 32. The learning rate of unsupervised transfer learning setting starts from 10^-2 and decreases to 10^-4, and the learning rate of supervised fine-tuning setting decays from 10^-1 to 10^-3.

For unsupervised transfer learning, we fix the parameters of encoder and only train the segmentation head. For supervised fine-tuning, we fully fine-tuned the pre-trained model on ShapeNet Part dataset.

Results. As shown in Table 2, the proposed UAE outperforms all other unsupervised point cloud learning models by a large margin, which indicates that our pre-trained model captures more effective semantic information that can transfer well to downstream segmentation tasks. Some results are visualized in Figure 5. We also conduct the object part segmentation experiments under supervised fine-tuning strategy and make comparisons with previous excellent models, such as PointDis, MID-FC and OcCo in Table 2. As shown, our UAE model (Ours-DGCNN) fine-tuned on 100% labeled samples achieves state-of-the-art performance. Compared to the supervised learning framework DGCNN, our pre-trained model (Ours-DGCNN) achieves remarkable performance improvements, which demonstrates the advantage of our pre-training strategy. We can conclude that pre-training with our framework on unlabeled data significantly boosts the performance and can be regarded as a strong initializer for supervised models, which is a critical application of self-supervised learning.

4.4 Point Cloud Upsampling

Raw point clouds data acquired from depth cameras or LiDAR scanners are often sparse, noisy and non-uniform, which hinders shape analysis and 3D reconstruction. Point cloud upsampling is thus significant to the subsequent CAD applications such as analysis, rendering or general processing. We evaluate our UAE on the point cloud upsampling task over the PU-GAN’s dataset.

Dataset. For point cloud upsampling task, we use PU-GAN’s dataset [33] as the experiment bed. The PU-GAN’s dataset has 147 CAD models that were collected from the released datasets of PU-Net [77] and MPU [66], as well as from the Visionair repository [60], covering a rich variety of objects, ranging from simple and smooth models (e.g., Icosahedron) to complex and high-detailed objects (e.g., Statue). Following previous work [33], we randomly select 120 models for training and use the rest for testing.

Implementation Details. We demonstrate a pre-training strategy to evaluate whether the unsupervised pre-training with our model helps improve the performance. We first pre-train our UAE on ShapeNet dataset in an unsupervised fashion, and then employ the learned representation from the pre-trained encoder as an initialization. Eventually, we build our upsampling version of UAE where we only replace the feature extractor of PU-GAN with our pre-trained encoder. We evaluate the effectiveness of our model by comparing the results of PU-GAN with our initialization and other upsampling models in a supervised fashion.

For training, we randomly sample 512 points from each point cloud in the PU-GAN’s dataset and upsample them to 2,048 points. For testing, we randomly sample 2048 points and upsample them to 8196 points. The quality of the upsampled point cloud is measured by the CD, HD and P2F between the original and upsampled point clouds.

Results. We qualitatively and quantitatively compared the unsupervised pre-training with our method with the randomly initialized PU-GAN and other state-of-the-art point set upsampling methods: EAR [28], PU-Net [77], MPU [66] and PU-GCN [50]. Table 3 shows the quantitative comparison results. Our method (Ours (transfer)) achieves the lowest value consistently for most evaluation metrics in an unsupervised manner. Particularly, the supervised fine-tuning with our method (Ours (fine-tune)) outperforms all previous point set upsampling models. The gain is significant in 4 $\times$ upsamling ratio which means that our pre-trained model can capture more effective semantic information. Figure 6 shows the point set upsampling results on PU-GAN’s dataset.

Model	Supervised	P2F	CD	HD
Model	Supervised	(10^-3)	(10^-3)	(10^-3)
EAR [76]	yes	5.82	0.52	7.37
PU-Net [77]	yes	6.84	0.72	8.94
MPU [66]	yes	3.96	0.49	6.11
PU-GAN [33]	yes	2.33	0.28	4.61
PU-GCN [50]	yes	2.72	0.25	1.88
Ours (transfer)	no	2.25	0.24	4.35
Ours (finetune)	yes	2.16	0.22	4.28

Table 3: We present quantitative comparisons of our method with the upsampling SOTA approaches.

4.5 Ablation Study

Sampling strategy	ModelNet40	ShapeNet
Farthest point sampling	92.64	84.83
Local sampling	92.46	85.04
Random sampling	93.27	85.62

Table 4: Supervised finetuning results (% and mIoU) on ModelNet40 and ShapeNet Part with different sampling strategies (the subsampling ratio is 12.5%), where we can see that the random sampling strategy works the best.

Subsampling ratio ( $r$ )	ModelNet40	ShapeNet
5.0%	91.58	83.94
12.5%	93.27	85.62
25.0%	92.69	85.22
50.0%	92.56	84.94
100.0%	92.19	84.76

Table 5: Supervised finetuning results (% and mIoU) on ModelNet40 and ShapeNet Part datasets with different subsampling rates.

Loss function	ModelNet40	ShapeNet
CD	92.13	84.53
EMD	92.26	84.71
EMD + RL	93.18	85.54
CD+RL (Ours)	93.27	85.62

Table 6: Ablation study on various loss functions. CD: replace our joint loss function with chamfer distance loss in our architecture. EMD: replace the joint loss function with Earth’s moving distance loss in our architecture. EMD + RL: replace our joint loss function with Earth’s moving distance and the repulsion losses.

The following studies, which are used to investigate the determining factors of our framework, are conducted on both ModelNet40 and ShapeNetPart datasets.

Impact of subsampling strategy. To examine the effectiveness of various subsampling strategies, we conduct a detailed ablation on shape classification and part segmentation tasks. Table 4 presents the results with three types of subsampling methods. We observe that the random sampling achieves the best performance, improving by 0.63%/0.79% on average over Farthest Point Sampling (FPS), and 0.81%/0.58% over local sampling (LS). The reason is that FPS and LS retain too many geometric details (see Figure 7), resulting in the created task being easily solved by extrapolation from neighboring points. In contrast, the task created by random sampling with a low sampling ratio is harder than that of FPS and LS, which provides a more challenging goal for our model. Moreover, random sampling uniformly selects $rN$ points from the original $N$ points. Its computational complexity is $\mathcal{O}(1)$ , which is agnostic to the total number of input points. Compared with FPS and LS, random sampling has the highest computational efficiency, regardless of the scale of input point clouds [27]. Based on these characteristics in both computational time and memory, we can conclude that our UAE is suitable for training very large models.

Effect of subsampling ratio. We conduct an ablation study to analyze the setting of subsampling ratio $r$ . The results are shown in Table 5. The best performance is achieved when $r$ is set to 12.5%. When the ratio becomes smaller, the lack of key points will result in a performance decline. On the contrary, when the ratio increases, a lot of noises at the local boundary lead to the deformation of feature extraction ability, and then reduces the accuracy of the model. Meanwhile, the task with high subsampling ratio (25% or 50%) can be easily solved by our UAE, which is not conducive to capture high-level semantic information.

Impact of loss function. We also investigate the options of loss functions. The results are shown in Table 6. From this table, we can see that compared with the CD loss, EMD loss achieves better results (+0.13%/+0.18%) because it can better encourage the output points to be located close to the underlying object surfaces [77]. However, in Table 6, the results of the two tasks are close, the time complexity of EMD loss is $\mathcal{O}(N^{2})$ [33], which means that it needs more pretraining time, especially when $N$ is very large. Furthermore, we can see that the performance of the model is significantly improved after adding repulsion loss, especially under the combination of “CD + RL” where the results from the output points are marked semantically and distinguished in spatial position to reduce noise.

5 Conclusion and Future work

In this work, we presented UAE, a new framework for self-supervised point cloud learning. UAE learns high-level semantic information by upsampling a sparse point cloud uniformly within a simple and effective framework. We showed state-of-the-art results of our method on various downstream tasks including shape classification, object segmentation and point cloud upsampling. In the future, it would be interesting to further exploit other applications, and take one closer step to the harsh real-world setting, i.e., limited annotations. In addition, we will continue to study the point cloud pre-training methods on large-scale datasets, and focus on finding an efficient way to take advantage of the large-scale point data.

6 Acknowledgements

This work was partially supported by the Zhejiang Provincial Natural Science Foundation of China (LGF21F20012), the National Natural Science Foundation of China (No.61602139), and the Graduate Scientific Research Foundation of Hangzhou Dianzi University (CXJJ2021082, CXJJ2021083).

References

[1] I. Achituve, H. Maron, and G. Chechik. Self-supervised learning for domain adaptation on point clouds. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 123–133, 2021.
[2] P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas. Learning representations and generative models for 3d point clouds. International Conference on Machine Learning, 2018.
[3] P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas. Learning representations and generative models for 3d point clouds. In International Conference on Machine Learning, pages 40–49. PMLR, 2018.
[4] M. Atzmon, H. Maron, and Y. Lipman. Point convolutional neural networks by extension operators. ACM Transactions on Graphics, pages 71.1–71.12, 2018.
[5] R. Bénière, G. Subsol, G. Gesquière, F. L. Breton, and W. Puech. A comprehensive process of reverse engineering from 3d meshes to CAD models. Comput. Aided Des., 45(11):1382–1393, 2013.
[6] P. Bojanowski and A. Joulin. Unsupervised learning by predicting noise. In International Conference on Machine Learning, pages 517–526. PMLR, 2017.
[7] M. Caron, P. Bojanowski, A. Joulin, and M. Douze. Deep clustering for unsupervised learning of visual features. In Europeon Conference on Computer Vision, pages 132–149, 2018.
[8] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
[9] C. Chen, G. Li, R. Xu, T. Chen, M. Wang, and L. Lin. Clusternet: Deep hierarchical cluster network with rigorously rotation-invariant representation for point cloud analysis. Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
[10] H. Chen, S. Luo, X. Gao, and W. Hu. Unsupervised learning of geometric sampling invariant representations for 3d point clouds. In International Conference on Computer Vision, pages 893–903, 2021.
[11] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning, pages 1597–1607, 2020.
[12] Y. Chen, J. Liu, B. Ni, H. Wang, J. Yang, N. Liu, T. Li, and Q. Tian. Shape self-correction for unsupervised point cloud understanding. In International Conference on Computer Vision, pages 8382–8391, 2021.
[13] Z. Cheng, H. Wan, X. Shen, and Z. Wu. Patchformer: A versatile 3d transformer based on patch attention. CoRR, abs/2111.00207, 2021.
[14] C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visual representation learning by context prediction. In International Conference on Computer Vision, pages 1422–1430, 2015.
[15] J. Donahue, P. Krähenbühl, and T. Darrell. Adversarial feature learning. International Conference on Learning Representations, 2016.
[16] A. Dosovitskiy, P. Fischer, J. T. Springenberg, M. Riedmiller, and T. Brox. Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE transactions on pattern analysis and machine intelligence, pages 1734–1747, 2015.
[17] B. Du, X. Gao, W. Hu, and X. Li. Self-contrastive learning with hard negative sampling for self-supervised point cloud learning. In ACM Multi media, pages 3133–3142, 2021.
[18] M. Gadelha, R. Wang, and S. Maji. Multiresolution tree networks for 3d point cloud processing. In Europeon Conference on Computer Vision, pages 103–118, 2018.
[19] X. Gao, W. Hu, and G.-J. Qi. Graphter: Unsupervised learning of graph transformation equivariant representations via auto-encoding node-wise transformations. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 7163–7172, 2020.
[20] J. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. Á. Pires, Z. Guo, M. G. Azar, B. Piot, K. Kavukcuoglu, R. Munos, and M. Valko. Bootstrap your own latent - A new approach to self-supervised learning. In Advances in Neural Information Processing Systems, 2020.
[21] J. Gu and S. Yeung. Staying in shape: Learning invariant shape representations using contrastive learning. Conference on Uncertainty in Artificial Intelligence, 2021.
[22] M.-H. Guo, J.-X. Cai, Z.-N. Liu, T.-J. Mu, R. R. Martin, and S.-M. Hu. Pct: Point cloud transformer. Computational Visual Media, 7(2):187–199, 2021.
[23] Z. Han, X. Wang, Y. Liu, and M. Zwicker. Multi-angle point cloud-vae: Unsupervised feature learning for 3d point clouds from multiple angles by joint self-reconstruction and half-to-half prediction. In International Conference on Computer Vision, pages 10441–10450, 2019.
[24] K. Hassani and M. Haley. Unsupervised multi-task feature learning on point clouds. In International Conference on Computer Vision, pages 8160–8171, 2019.
[25] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick. Masked autoencoders are scalable vision learners, 2021.
[26] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. Momentum contrast for unsupervised visual representation learning. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 9729–9738, 2020.
[27] Q. Hu, B. Yang, L. Xie, S. Rosa, Y. Guo, Z. Wang, N. Trigoni, and A. Markham. Randla-net: Efficient semantic segmentation of large-scale point clouds. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 11105–11114. Computer Vision Foundation / IEEE, 2020.
[28] H. Huang, S. Wu, M. Gong, D. Cohen-Or, U. Ascher, and H. Zhang. Edge-aware point set resampling. ACM transactions on graphics (TOG), 32(1):1–12, 2013.
[29] J. Jiang, X. Lu, W. Ouyang, and M. Wang. Unsupervised representation learning for 3d point cloud data. arXiv preprint arXiv:2110.06632, 2021.
[30] L. Jiang, S. Shi, Z. Tian, X. Lai, S. Liu, C.-W. Fu, and J. Jia. Guided point contrastive learning for semi-supervised point cloud semantic segmentation. In International Conference on Computer Vision, pages 6423–6432, 2021.
[31] R. Klokov and V. Lempitsky. Escape from cells: Deep kd-networks for the recognition of 3d point cloud models. Proceedings of the IEEE International Conference on Computer Vision, pages 863–872, 2017.
[32] L. Landrieu and M. Boussaha. Point cloud oversegmentation with graph-structured deep metric learning. Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
[33] R. Li, X. Li, C. Fu, D. Cohen-Or, and P. Heng. PU-GAN: A point cloud upsampling adversarial network. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 7202–7211, 2019.
[34] R. Li, X. Li, P.-A. Heng, and C.-W. Fu. Point cloud upsampling via disentangled refinement. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 344–353, June 2021.
[35] Y. Li, R. Bu, M. Sun, and B. Chen. Pointcnn: Convolution on x-transformed points. Advances in Neural Information Processing Systems, 2018.
[36] Y. Li, V. G. Kim, D. Ceylan, I. C. Shen, M. Yan, S. Hao, C. Lu, Q. Huang, A. Sheffer, and L. Guibas. A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics (TOG), 35(6cd):210.1–210.12, 2016.
[37] Y. Lipman, D. Cohen-Or, D. Levin, and H. Tal-Ezer. Parameterization-free projection for geometry reconstruction. In ACM SIGGRAPH, page 22–es, 2007.
[38] F. Liu, G. Lin, and C. Foo. Point discriminative learning for unsupervised representation learning on 3d point clouds. CoRR, abs/2108.02104, 2021.
[39] X. Liu, Z. Han, Y. S. Liu, and M. Zwicker. Point2sequence: Learning the shape representation of 3d point clouds with an attention-based sequence to sequence network. AAAI, 2018.
[40] Y. Liu, B. Fan, S. Xiang, and C. Pan. Relation-shape convolutional neural network for point cloud analysis. Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[41] Z. Liu, H. Tang, Y. Lin, and S. Han. Point-voxel cnn for efficient 3d deep learning. Advances in Neural Information Processing Systems, 2019.
[42] D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. Van Der Maaten. Exploring the limits of weakly supervised pretraining. In Europeon Conference on Computer Vision, pages 181–196, 2018.
[43] L. Mescheder, S. Nowozin, and A. Geiger. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. In International Conference on Machine Learning, pages 2391–2400, 2017.
[44] M. Noroozi and P. Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In Europeon Conference on Computer Vision, pages 69–84, 2016.
[45] B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583):607–609, 1996.
[46] A. v. d. Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
[47] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas. Frustum pointnets for 3d object detection from RGB-D data. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 918–927. Computer Vision Foundation / IEEE Computer Society, 2018.
[48] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. Conference on Computer Vision and Pattern Recognition (CVPR), pages 652–660, 2017.
[49] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in Neural Information Processing Systems, 2017.
[50] G. Qian, A. Abualshour, G. Li, A. K. Thabet, and B. Ghanem. PU-GCN: point cloud upsampling using graph convolutional networks. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 11683–11692, 2021.
[51] Y. Rao, J. Lu, and J. Zhou. Global-local bidirectional reasoning for unsupervised representation learning of 3d point clouds. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 5376–5385, 2020.
[52] R. B. Rusu, Z. C. Marton, N. Blodow, M. E. Dolha, and M. Beetz. Towards 3d point cloud based object maps for household environments. Robotics Auton. Syst., 56(11):927–941, 2008.
[53] R. Salakhutdinov and G. Hinton. Deep boltzmann machines. In AI-STATS, pages 448–455, 2009.
[54] J. Sauder and B. Sievers. Self-supervised deep learning on point clouds by reconstructing space. Advances in Neural Information Processing Systems, 32:12962–12972, 2019.
[55] Q. Shi, S. Anwar, and N. Barnes. Semantic segmentation for real point cloud scenes via bilateral augmentation and adaptive fusion. Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
[56] C. Sun, Z. Zheng, X. Wang, M. Xu, and Y. Yang. Point cloud pre-training by mixing and disentangling. arXiv preprint arXiv:2109.00452, 2021.
[57] G. Te, W. Hu, Z. Guo, and A. Zheng. Rgcnn: Regularized graph cnn for point cloud segmentation. in ACM Multimedia, 2018.
[58] Y. Tian, D. Krishnan, and P. Isola. Contrastive multiview coding. In Europeon Conference on Computer Vision, pages 776–794, 2020.
[59] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust features with denoising autoencoders. In International Conference on Machine Learning, pages 1096–1103, 2008.
[60] Visionair. http://www.infra-visionair.eu/. Accessed: 2019-07-24., 2019.
[61] H. Wang, Q. Liu, X. Yue, J. Lasenby, and M. J. Kusner. Unsupervised point cloud pre-training via occlusion completion. In International Conference on Computer Vision, pages 9782–9792, 2021.
[62] L. Wang, Y. Huang, Y. Hou, S. Zhang, and J. Shan. Graph attention convolution for point cloud semantic segmentation. Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[63] P.-S. Wang, Y.-Q. Yang, Q.-F. Zou, Z. Wu, Y. Liu, and X. Tong. Unsupervised 3d learning for shape analysis via multiresolution instance discrimination. ACM Trans. Graphic, 2020.
[64] W. Wang, E. Xie, X. Li, D. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. International Conference on Computer Vision, 2021.
[65] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon. Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics, 38(5), 2018.
[66] Y. Wang, S. Wu, H. Huang, D. Cohen-Or, and O. Sorkine-Hornung. Patch-based progressive 3d point set upsampling. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 5958–5967, 2019.
[67] Z. Wang and F. Lu. Voxsegnet: Volumetric cnns for semantic part segmentation of 3d shapes. IEEE Transactions on Visualization and Computer Graphics, pages 1–1, 2019.
[68] W. Wu, Z. Qi, and L. Fuxin. Pointconv: Deep convolutional networks on 3d point clouds. Conference on Computer Vision and Pattern Recognition (CVPR), pages 9621–9630, 2019.
[69] W. Wu, Z. Qi, and L. Fuxin. Pointconv: Deep convolutional networks on 3d point clouds. Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[70] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets: A deep representation for volumetric shapes. Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
[71] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin. Unsupervised feature learning via non-parametric instance discrimination. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 3733–3742, 2018.
[72] S. Xie, J. Gu, D. Guo, C. R. Qi, L. Guibas, and O. Litany. Pointcontrast: Unsupervised pre-training for 3d point cloud understanding. In Europeon Conference on Computer Vision, pages 574–591. Springer, 2020.
[73] Q. Xu, X. Sun, C.-Y. Wu, P. Wang, and U. Neumann. Grid-gcn for fast and scalable point cloud learning. Conference on Computer Vision and Pattern Recognition (CVPR), pages 5661–5670, 2020.
[74] Y. Xu, T. Fan, M. Xu, Z. Long, and Q. Yu. Spidercnn: Deep learning on point sets with parameterized convolutional filters. Europeon Conference on Computer Vision, 2018.
[75] X. Yan, C. Zheng, Z. Li, S. Wang, and S. Cui. Pointasnl: Robust point clouds processing using nonlocal neural networks with adaptive sampling. Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
[76] Y. Yang, C. Feng, Y. Shen, and D. Tian. Foldingnet: Point cloud auto-encoder via deep grid deformation. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 206–215, 2018.
[77] L. Yu, X. Li, C. Fu, D. Cohen-Or, and P. Heng. Pu-net: Point cloud upsampling network. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 2790–2799, 2018.
[78] C. Zhang, H. Chen, H. Wan, P. Yang, and Z. Wu. Graph-pbn: Graph-based parallel branch network for efficient point cloud learning. Graphical Models, page 101120, 2021.
[79] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena. Self-attention generative adversarial networks. In International Conference on Machine Learning, pages 7354–7363, 2019.
[80] Z. Zhang, R. Girdhar, A. Joulin, and I. Misra. Self-supervised pretraining of 3d features on any point-cloud. arXiv preprint arXiv:2101.02691, 2021.
[81] H. Zhao, L. Jiang, C. W. Fu, and J. Jia. Pointweb: Enhancing local neighborhood features for point cloud processing. Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[82] H. Zhao, L. Jiang, C.-W. Fu, and J. Jia. Pointweb: Enhancing local neighborhood features for point cloud processing. Conference on Computer Vision and Pattern Recognition (CVPR), pages 5565–5573, 2019.
[83] H. Zhao, L. Jiang, J. Jia, P. Torr, and V. Koltun. Point transformer. International Conference on Computer Vision, 2021.
[84] Y. Zhou and O. Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. International Conference on Computer Vision, pages 4490–4499, 2018.