This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

PointCG: Self-supervised Point Cloud Learning via Joint Completion and Generation

Yun Liu, Peng Li, Xuefeng Yan, Liangliang Nan, Bing Wang, Honghua Chen, Lina Gong, Wei Zhao, and Mingqiang Wei, Senior Member, IEEE Yun Liu, Peng Li, Honghua Chen, Lina Gong, Wei Zhao and Mingqiang Wei are with the School of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China, and also with the Shenzhen Institute of Research, Nanjing University of Aeronautics and Astronautics, Shenzhen, China (e-mail: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]). Xuefeng Yan is with the School of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China, and also with the Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing, China (e-mail: [email protected]).Liangliang Nan is with Urban Data Science section, Delft University of Technology, Delft, Netherlands (e-mail: [email protected]). Bing Wang is with the Department of Aeronautical and Aviation Engineering, The Hong Kong Polytechnic University, Hongkong, China (e-mail: [email protected]).
Abstract

The core of self-supervised point cloud learning lies in setting up appropriate pretext tasks, to construct a pre-training framework that enables the encoder to perceive 3D objects effectively. In this paper, we integrate two prevalent methods, masked point modeling (MPM) and 3D-to-2D generation, as pretext tasks within a pre-training framework. We leverage the spatial awareness and precise supervision offered by these two methods to address their respective limitations: ambiguous supervision signals and insensitivity to geometric information. Specifically, the proposed framework, abbreviated as PointCG, consists of a Hidden Point Completion (HPC) module and an Arbitrary-view Image Generation (AIG) module. We first capture visible points from arbitrary views as inputs by removing hidden points. Then, HPC extracts representations of the inputs with an encoder and completes the entire shape with a decoder, while AIG is used to generate rendered images based on the visible points’ representations. Extensive experiments demonstrate the superiority of the proposed method over the baselines in various downstream tasks. Our code will be made available upon acceptance.

Index Terms:
PointCG, self-supervised learning, hidden point completion, arbitrary-view image generation, point clouds

1 Introduction

Self-supervised representation learning (SSRL) aims to fully exploit the statistical and structural knowledge inherent in unlabeled datasets, enabling the encoder of the pre-training model to extract informative and discriminative representations. The pre-trained encoder can be subsequently applied to various downstream tasks such as classification, segmentation, and object detection [1, 2]. The core of SSRL lies in the design of appropriate pretext tasks aimed at aiding the encoder in achieving a full perception and understanding of the inputs.

Refer to caption
Figure 1: Qualitative and quantitative comparison of models using different pretext tasks. Chamfer Distance (CD) and Structural Similarity Index (SSIM) are employed as the quantitative metrics. For the masked point modeling (MPM) task, we utilize the method proposed in Point-MAE [3] with the inputs of visible points from arbitrary views (see Sec. 3.1). For the 3D-to-2D generation task, we define the pretext task as generating images from arbitrary views. The result of the model using only MPM exhibits group clustering at the edges, while our method yields sharpened and clear edges that closely align with the ground truth. The model relying solely on 3D-to-2D generation fails to capture three-dimensional structural information, while our method can effectively preserve the geometric structure. Directly combining both tasks generates point clouds and images superior to using only MPM or 3D-to-2D generation (Direct Combination) but with lower Linear-SVM accuracy.

Based on the tasks employed, existing self-supervised pre-training methods can be broadly classified into two paradigms: contrastive learning and generative learning, both of which have attained great success in processing 2D images [4, 5, 6] and 3D point clouds [7, 3, 8, 9, 10, 11]. Compared to contrastive learning, generative learning is considered as a more data-efficient pre-training method, capable of capturing the patterns of the inputs with relatively limited data volume [12]. Therefore, it is highly favored in the context of data scarcity within the field of 3D vision, where masked point modeling [7, 3, 8, 13, 10] and 3D-to-2D generation [11, 14] stand out as two representative generative learning methods. Among them, masked point modeling drives the model to predict arbitrary missing parts based on the remaining points. Accomplishing this task requires a thorough understanding of the spatial properties and global-local context of point clouds. 3D-to-2D generation employs a cross-modal pretext task which translates a 3D object point cloud to its diverse forms of 2D rendered images (e.g., silhouette, depth, contour). Pre-training with pixel-wise precise supervision drives the backbone to perceive the fine-grained edge details of 3D objects.

However, both of the above methods have their own limitations. As revealed in [15, 16, 17], due to the irregularity of point clouds, commonly used point set similarity metrics (e.g., Chamfer Distance and Earth Mover’s Distance) in masked point modeling cannot provide explicit point-to-point supervision between ground truth and generated point clouds. The lack of precise correspondence results in limited feature representation capability of the pre-trained backbone network. Conversely, 3D-to-2D generation [11, 14] alleviates the issue of insufficient supervision signals by utilizing regular 2D images as the generation objective, offering pixel-wise precise supervision. However, relying solely on images from limited views as ground truth may overlook the structural information from occluded point sets, diminishing the backbone’s perception of the spatial properties of point clouds. As shown in Fig. 1, masked point modeling exhibits subpar performance in reconstructing some challenging areas (e.g., edges) due to the lack of point-to-point supervision. Besides, 3D-to-2D generation yields images lacking three-dimensional structural information, attributed to the lack of explicit geometric guidance. These observations collectively indicate the models’ inadequate perception of the inputs, consequently reducing their performance on downstream tasks.

Based on the aforementioned analysis, an intuitive method is to combine these two pretext tasks to retain their individual merits while compensating for their respective limitations. However, as shown in Fig. 1, while the model directly combining both tasks outperforms those relying solely on MPM or 3D-to-2D generation in generating high-quality point clouds or images, its Linear-SVM accuracy is lower (88.41%88.41\% vs 90.72%90.72\% and 91.13%91.13\%). We argue that the encoder’s involvement in both tasks can lead to confusion when generating content for two modalities concurrently. Furthermore, to accomplish both tasks, the model shifts its training focus toward the decoder, which is typically discarded after pre-training. This phenomenon diminishes the feature extraction capability of the encoder, ultimately reducing the Linear SVM accuracy.

Refer to caption
Figure 2: Visualization of the unmasked points (a), the masked points (b), the completed point cloud composed of green unmasked points and gray masked points (c), and the completed point cloud in blue with overlapping points highlighted in red (d).

To address these issues, we propose PointCG, a framework that effectively integrates masked point modeling and 3D-to-2D generation tasks. This framework incorporates a Hidden Point Completion (HPC) module and an Arbitrary-view Image Generation (AIG) module. Existing MAE-based MPM methods often employ a random masking strategy based on Farthest Point Sampling (FPS) and K-Nearest Neighbor (KNN) techniques. However, the inputs of unmasked patches (Fig. 2 (a)) preserve the overall shape of an object and exhibit substantial overlap with the target points (highlighted in red in Fig. 2 (d)). The leakage of overall structure and point location information enables the model to reconstruct the object without a holistic comprehension of the entire structure, which limits the learning capacity of the encoder during pre-training. To overcome this limitation, we select the visible points from arbitrary views by removing hidden points as input and introduce the HPC module to complete the point clouds. For the 3D-to-2D generation task, we employ the arbitrary-view image generation as the pretext task, which generates the image from an arbitrary view based on the representations of visible points extracted by the encoder. Furthermore, the cross-modal feature alignment is introduced to align the feature spaces of point clouds and images, which enables simultaneous content generation across both modalities and refocuses the training on the encoder. Specifically, we extract features from both the input point clouds and their corresponding rendered 2D images, encouraging feature proximity for the same instance while maintaining feature separation for different instances.

Through the effective integration of HPC and AIG, the pre-trained encoder achieves a comprehensive understanding of 3D objects and can extract high-quality 3D representations. We evaluate our model and the proposed modules with a variety of downstream tasks and ablation studies. We further demonstrate that informative representations can be effectively learned from the restricted points, and such representations facilitate effortless masked point modeling and arbitrary-view image generation.

2 Related Work

2.1 Self-supervised Representation Learning

Self-supervised representation learning aims to derive robust and general representations from unlabeled datasets, which can be broadly classified into two categories based on the types of pretext tasks: contrastive learning and generative learning.

Contrastive learning-based methods (e.g., BYOL [18], SimSiam [6], DINO [19], STRL [9], CrossPoint [20]) define the augmented views of a sample as positive samples, while considering other instances as negative samples, thereby constructing discriminative tasks. Generative learning-based methods (e.g., GPT [21], Point-BERT [7], Point-MAE [3], OcCo [10], MaskFeat3D [22]) are based on the intuition that effective feature abstractions contain sufficient information to reconstruct the original geometric structures [11]. In the point cloud processing community, where 3D assets are relatively scarce, generative learning has garnered widespread attention due to its data efficiency [4, 3, 13]. Among them, MAE stands out as one of the representative paradigms. It involves masking a substantial portion of input data, followed by the use of an encoder to extract informative representations and a decoder to reconstruct explicit features (e.g., pixels or points) or implicit features (e.g., discrete tokens). Taking Point-BERT [7], MaskFeat3D [22], and IAE [23] as examples, each of these methods utilizes the visible groups as input after masking and reconstructs the positions of masked points, surface normals, and surface variations, as well as the implicit features of the masked points. However, after random masking [3] or partial occlusion [13], the visible groups often retain the overall structure of the object (Fig. 7), and there are substantial overlap regions between input and target patches (Fig. 2). The leakage of overall structure and point location information will reduce the difficulty in reconstructing masked patches, thus limiting the learning and inference capabilities of the encoder.

To avoid the leakage of the object’s overall shape and minimize overlap, we simulate scanners to capture visible points from arbitrary views as input. Our approach is conceptually aligned with OcCo [10], which employs the Z-Buffer algorithm [24] to select visible points from multiple views, subsequently completing the original point clouds with an encoder-decoder architecture. The Z-Buffer algorithm addressed within rendering relies on two assumptions: the points satisfy sampling criteria (e.g., Nyquist condition) and the points are associated with normals (or the normals can be estimated) [25]. However, our method seeks rigorous theoretical support for visibility computation without requiring normal estimation, point rendering, or surface reconstruction. Therefore, we employ the Hidden Point Removal (HPR) operator to compute visibility in a more robust manner.

2.2 Cross-modal Learning

Recently, cross-modal learning has been a popular research topic, aiming at extracting informative representations from multiple modalities such as images, audio, and point clouds. It has the potential to enhance the performance of various tasks, including visual recognition, speech recognition, and point cloud analysis.

In point cloud analysis, a variety of methods have been proposed for cross-modal learning, such as CrossPoint [20], PointMCD [26], TAP [14], and PointVST [11]. CrossPoint [20] establishes cross-modal contrastive learning between images and point clouds, demonstrating that the correspondence between images and points can enhance 3D object understanding. PointMCD [26] obtains a powerful point encoder by aligning the multi-view visual and geometric descriptors generated by a pretrained image encoder and a learnable point encoder. Both CrossPoint [20] and PointMCD [26] are based on the contrastive paradigm and rely heavily on extensive 3D-2D paired data. Generative methods, such as TAP [14] and PointVST [11], generate images from specific views based on the input point clouds. These methods use regular 2D images as generation objectives to provide precise supervision.

In this paper, we follow the generative learning paradigm and propose a unified pre-training framework with two complementary pretext tasks: hidden point completion and arbitrary-view image generation. The spatial awareness provided by 3D completion addresses the geometric insensitivity inherent in image supervision, as shown in the second line of column one in Fig. 1. Additionally, we demonstrate the mutual enhancement between the two pretext tasks through various experiments.

3 Methodology

Refer to caption
Figure 3: Overview of PointCG. PointCG integrates two prevalent methods, masked point modeling (MPM) and 3D-to-2D generation, as pretext tasks within a pre-training framework. In detail, we first capture visible points with the HPR [25] operator. Then we utilize an encoder-decoder architecture to extract features from these visible points and complete the original point clouds through the hidden point completion (HPC) module. The arbitrary-view image generation (AIG) module generates images based on the aligned representations of visible points. Note that the input images for feature alignment are randomly selected and do not need to match the target images of image generation.

As illustrated in Fig. 3, PointCG mainly consists of a hidden point completion (HPC) module and an arbitrary-view image generation (AIG) module. Specifically, we begin by selecting the visible points from arbitrary views as the inputs (Sec. 3.1), and then introduce an asymmetric Transformer-based encoder-decoder architecture for extracting representations and completing hidden points (Sec. 3.2). Finally, we generate arbitrary-view images (Sec. 3.3) based on the aligned representations extracted by the encoder. In the following, we will delve into the details of these modules.

3.1 Data Organization

Given a complete point cloud P={pi|1iN}3P=\{p_{i}|{1}\leq{i}\leq{N}\}\in\mathbb{R}^{3}, we randomly select the camera position C=[azimuth,elevation,distance]C=[azimuth,elevation,distance], where distancedistance is fixed at 1.0. azimuthazimuth and elevationelevation are randomly chosen within the range of [0,2π][0,2\pi]. The HPR operator [25] is employed to determine whether pi{p_{i}} is visible from CC. It mainly consists of two steps: inversion transformation and convex hull construction.

Inversion transformation. We employ spherical flip [27] to reflect each point piPp_{i}\in{P} to the spherical surface (denoted as pi^P^\hat{p_{i}}\in\hat{P}) along the ray from CC through pip_{i} to the spherical surface, as illustrated in Fig. 4 (a).

Refer to caption
Figure 4: Visualization of the original points P{P} in blue, points after spherical flipping P^\hat{P} in light blue and visible points from CC in magenta.

Convex hull construction. The visible points from CC inverted on the spherical surface are situated on the convex hull of P^C\hat{P}\cup{C}. Therefore, we need to compute the collection of triangular planes, which make up the convex hull. Then we extract all vertices (magenta points in P^\hat{P} in Fig. 4 (b)) of the convex hull and project them back onto the original point cloud to obtain the visible points PvP_{v} (magenta points in PP in Fig. 4 (b)). The remaining points of the original point cloud are hidden from CC, denoted as PhP_{h}.

3.2 Occluded Point Completion

For each input, we employ the FPS and KNN to divide the visible points PvP_{v} into patches with vv centers. Simultaneously, we extract hh central points from the hidden points PhP_{h} and retrieve kk nearest neighbor points from the complete point cloud as the target patches PGTP_{GT}. Then, the visible patches are projected into tokens Tvv×dT^{v}\in\mathbb{R}^{v\times{d}} with a lightweight PointNet [28], where dd is the dimension of features. Subsequently, a learnable Multi-Layer Perceptron (MLP) is adopted to embed the visible and hidden centers into positional tokens denoted as TLvT_{L}^{v} and TLhT_{L}^{h}, respectively. Finally, we extract representations TET_{E} by an encoder and capture the tokens TDT_{D} with a decoder for completing the original point clouds:

TE=Encoder(Tv,TLv),TEv×g×d\begin{split}T_{E}=Encoder(T^{v},T_{L}^{v}),T_{E}\in\mathbb{R}^{v\times{g}\times{d}}\end{split} (1)
TD=Decoder(Cat(TE,TH),Cat(TLv,TLh))THh×d,TD(v+h)×d\begin{split}T_{D}=Decoder(Cat(T_{E},T_{H}),Cat(T_{L}^{v},T_{L}^{h}))\\ T_{H}\in\mathbb{R}^{h\times{d}},T_{D}\in\mathbb{R}^{(v+h)\times{d}}\end{split} (2)

where THT_{H} represents the hidden tokens, which is initialized by duplicating a learnable token of dimension dd. We concatenate the visible points’ features TET_{E} and THT_{H}, as well as the positional tokens TLvT_{L}^{v} and TLhT_{L}^{h} as the inputs of the decoder. Based on the outputs TDT_{D} of the decoder, we will reconstruct the kk nearest neighbors of hh center points by a reconstruction head of a fully connected (FC) layer:

Ppre=Reshape(FC(TD)),Ppreh×k×3\begin{split}P_{pre}=Reshape(FC(T_{D})),\\ P_{pre}\in\mathbb{R}^{h\times{k}\times{3}}\end{split} (3)

where PpreP_{pre} denotes as the predicted hidden point patches.

Loss function. The Chamfer distance [29] is employed as the reconstruction loss:

CD=1|Ppre|x^PpreminxPGTx^x22+1|PGT|xPGTminx^Pprex^x22\begin{split}\mathcal{L}_{CD}=\frac{1}{|P_{pre}|}\sum_{\hat{x}\in{P_{pre}}}\min_{x\in{P_{GT}}}||\hat{x}-x||_{2}^{2}\\ +\frac{1}{|P_{GT}|}\sum_{x\in{P_{GT}}}\min_{\hat{x}\in{P_{pre}}}||\hat{x}-x||_{2}^{2}\end{split} (4)

where PGTh×k×3P_{GT}\in\mathbb{R}^{h\times{k}\times{3}} denotes the reconstruction targets.

3.3 Arbitrary-view Image Generation

3.3.1 Feature Alignment

To shift the pre-training focus towards enhancing the encoder for better 3D understanding, we employ the feature alignment module to build correspondence between images and point clouds in the feature space.

During pre-training, a pre-trained CLIP-visual [30] module ff is used to extract features from the rendered image IiI_{i}. Then, the image features f(Ii)f(I_{i}) and the 3D features TiMax(TE)T_{i}\in{Max(T_{E})} are projected into the invariant space with functions gg and hh, respectively, resulting in i=g(f(Ii))\mathcal{H}_{i}=g(f(I_{i})) and 𝒵i=h(Ti)\mathcal{Z}_{i}=h(T_{i}).

Loss function. In the invariant space, we aim to maximize the similarity between 𝒵i\mathcal{Z}_{i} and i\mathcal{H}_{i} when they correspond to the same objects. The cross-modal instance discrimination loss CM\mathcal{L}_{CM} can be formulated as:

CM=12Mi=1M[l(i,𝒵,)+l(i,,𝒵)]l(i,𝒵,)=logexp(s(𝒵i,i)/τ)k=1,kiMexp(s(𝒵i,𝒵k)/τ)+k=1Mexp(s(𝒵i,k)/τ)\begin{split}\begin{array}[]{l}\mathcal{L}_{CM}=\frac{1}{2M}\sum^{M}_{i=1}[l(i,\mathcal{Z},\mathcal{H})+l(i,\mathcal{H},\mathcal{Z})]\\ \\ {\color[rgb]{0,0,0}l(i,\mathcal{Z},\mathcal{H})=}\\ {\color[rgb]{0,0,0}-log\frac{exp(s(\mathcal{Z}_{i},\mathcal{H}_{i})/{\tau})}{\sum^{M}_{k=1,k\neq{i}}{exp(s(\mathcal{Z}_{i},\mathcal{Z}_{k})/{\tau})}+\sum^{M}_{k=1}{exp(s(\mathcal{Z}_{i},\mathcal{H}_{k})/{\tau})}}}\end{array}\end{split} (5)

where MM is the mini-batch size. τ\tau is the temperature co-efficient, and s()s(\centerdot) denotes the cosine similarity function.

3.3.2 Image Generation

AIG generates rendered images from arbitrary views based on the visible points’ representations extracted by the encoder.

In pre-training, we randomly select a rendered image as the target and capture the corresponding view parameters LCL_{C} as the input. LCL_{C} comprises azimuth ϕ\phi, elevation λ\lambda, and distance ρ\rho, described as LC=Cat(ϕ,λ,ρ)L_{C}=Cat(\phi,\lambda,\rho). To enhance flexibility, we apply several learnable transformation layers for positional embedding tokens TLCT^{C}_{L}. Then, we concatenate TET_{E} with TLCT^{C}_{L} and encode the combined representation using MLP (𝒢θ\mathcal{G}_{\theta}): TGI=𝒢θCat(TE,TLC)T^{I}_{G}=\mathcal{G}_{\theta}Cat(T_{E},T^{C}_{L}).

Refer to caption
Figure 5: The image generator consists of several deconvolutional residual blocks, generating the image from the view of LCL_{C}.

Finally, we design an image generator (Fig. 5) to generate rendered images based on TGIT^{I}_{G}.

Specifically, we start by reshaping TGIT^{I}_{G} from its original vectorized representation to a 2D feature map. This feature map is then passed through a series of deconvolutional residual blocks and three parallel convolutional blocks to generate the rendered image GIpG_{I}^{p}.

Loss function. We utilize the 1\mathcal{L}_{1} loss as the content loss for image generation:

1=1ni=1n|GTIGIp|\begin{split}\begin{array}[]{l}\mathcal{L}_{1}={\frac{1}{n}}*\sum_{i=1}^{n}{|GT_{I}-G_{I}^{p}|}\end{array}\end{split} (6)

where GIpG_{I}^{p} and GTIGT_{I} represent the predicted and GT images, respectively, while nn denotes the number of sample points. Besides, we incorporate multi-scale frequency reconstruction (MSFR) loss [31] as the auxiliary loss alongside the content loss 1\mathcal{L}_{1} to reduce the differences in the frequency space. MSFR loss measures the 1\mathcal{L}_{1} distance between multi-scale GT and predicted images in the frequency domain:

MSFR=1ni=1n|(GTI)(GIp)|\begin{split}\begin{array}[]{l}\mathcal{L}_{MSFR}=\frac{1}{n}*\sum_{i=1}^{n}{|\mathcal{F}(GT_{I})-\mathcal{F}(G_{I}^{p})|}\end{array}\end{split} (7)

where \mathcal{F} denotes the fast Fourier transform (FFT) that transfers the image signal to the frequency domain. MSFR can effectively maintain contrast in high-frequency regions, complementing the ability of 1\mathcal{L}_{1} to preserve colors and luminance [32]. The image generation loss is given by:

G=α1+βMSFR\begin{split}\begin{array}[]{l}\mathcal{L}_{G}=\alpha*\mathcal{L}_{1}+\beta*\mathcal{L}_{MSFR}\end{array}\end{split} (8)

where the contribution of 1\mathcal{L}_{1} and MSFR\mathcal{L}_{MSFR} can be adjusted by modifying the values of α\alpha and β\beta.

Our loss function during pre-training is formulated as:

=ωCD+ϕCM+ψG\begin{split}\begin{array}[]{l}\mathcal{L}=\omega*\mathcal{L}_{CD}+\phi*\mathcal{L}_{CM}+\psi*\mathcal{L}_{G}\end{array}\end{split} (9)

where CD\mathcal{L}_{CD} enforces 3D completion, CM\mathcal{L}_{CM} introduces 3D-2D correspondence, and G\mathcal{L}_{G} ensures image generation.

4 Experiment

In this section, we present extensive experiments to demonstrate the effectiveness of our method. We begin by introducing the pre-training process on ShapeNet55 [33]. Then, we showcase its performance on 3D completion tasks in Sec. 4.1. Then, in Sec. 4.2, Sec. 4.3, and Sec. 4.4, we follow the previous works to conduct experiments of object classification, part segmentation, and semantic segmentation. Finally, we validate the effectiveness of our modules through various ablation studies in Sec. 4.6.

In the following tables, ``Pre-T"``\text{Pre-T}" indicates whether the model is initialized with a pre-trained model, while ``Rep."``\text{Rep.}" signifies that the result is reproduced with the official code. Please note that we reproduce experiments of Point-MAE [3] and Point-M2AE [8] with their official codes, and all settings are consistent with our experimental configuration.

Pre-training. We pre-train the encoder on ShapeNet55 [33], which contains 52,47052,470 clean 3D models, covering 5555 common object categories. The input point number NN is 2,0482,048, and the rendered images have a size of 224×224×3224\times{224}\times{3}. The encoder and decoder include 1212 and 44 standard Transformer blocks, respectively. Each Transformer block has 384384 hidden dimensions with 66 heads. We employ the AdamW optimizer [34] and cosine learning rate decay [35]. The initial learning rate is set to 0.0010.001, and the weight decay is 0.050.05.

4.1 3D Completion

Completion based on visible points from random views. To assess the effectiveness of our self-supervised model initialized with pre-trained weights, we randomly select an instance from synthetic dataset ModelNet40 [36] and real-world dataset ScanObjectNN [37] separately and reconstruct the original point clouds. The visualization results of Point-MAE [3] and our model are shown in Fig. 6.

Compared to Point-MAE, our method not only completes the chair’s pivot axis and the five-pronged base with greater fidelity but also obtains smoother surface structures. Our method achieves remarkable performance in reconstructing both synthetic and real-world data with visible points from arbitrary views.

Refer to caption
Figure 6: 3D completion with pre-trained models of Point-MAE [3] and our method, based on the unmasked points from ModelNet40 [36] (top row) and ScanObjectNN [37] (bottom row).

Completion based on grouped patches and partial points. To demonstrate the robustness and generalizability of our method in the 3D completion task, we devise two methods to obtain grouped patches and partial points as inputs. The reconstruction results are visualized in Fig. 7.

Refer to caption
Figure 7: 3D completion with pre-trained models of Point-MAE [3] and our model based on partial points from ModelNet40 [36].

In the first method, we obtain 22, 88, and 2626 central points via FPS, and then acquire 3232 neighboring points with KNN to form groups [3, 8] denoted by ‘G2’, ‘G8’, and ‘G26’. In the second method, we randomly select one group, consisting of 6464, 256256, and 832832 points, denoted as ‘P64’, ‘P256’, and ‘P832’, respectively.

For ‘G2’ and ‘G8’, both Point-MAE and our method can complete the overall structure, but our method excels in recovering finer geometric details and sharper edges. The ‘G26’ column retains sufficient information, and the results are satisfactory for both methods. For ‘P64’ and ‘P256’, our method successfully completes entire structures and captures many local details. In contrast, the results of Point-MAE appear quite blurry. Although the ‘G26’ and ‘P832’ have the same number of input points, the reconstruction results of Point-MAE based on ‘P832’ are significantly lower than ‘G26’.

As shown, the inputs obtained by random masking retain objects’ structural information while exposing the coordinates of target points. These MAE methods employing random masking strategy, such as Point-MAE, exhibit poorer reconstruction performance when the inputs are partial and lack of complete structural integrity. Our method, however, excels in extracting informative representations and demonstrates strong inference capability, leading to superior reconstruction performance, even from highly partial point data.

4.2 Shape Classification

To assess the discrimination of the representations extracted by the pre-trained encoder, we validate the encoder on the shape classification task using the ModelNet40 [36] and ScanObjectNN [37] datasets.

Shape classification on synthetic data. ModelNet40 [36] contains 12,31112,311 clean 3D CAD models, covering 4040 object categories. We fine-tune the pre-trained encoder, and the results are presented in Tab. I. Our method achieves 94.03% with global fine-tuning, surpassing the reproduced version of Point-MAE (Rep.) (93.21%) by 0.82% and the publicly released accuracy by 0.23%. To validate the effectiveness of our architecture, we incorporate Point-M2AE as the backbone and evaluate its classification performance on ModelNet40. The classification results outperform the outcomes of the reproduced Point-M2AE model.

TABLE I: Accuracy of shape classification on ModelNet40.
Methods Pre-T booktitle/year. Acc (Vote).
DGCNN [38] - ACM/2019 92.9
RSCNN [39] - CVPR/2019 93.6
PointTransformer [40] - ICCV/2021 93.7
DGCNN+OcCo [10] Y ICCV/2021 93.0
DGCNN+STRL [9] Y ICCV/2021 93.1
DGCNN+MAE3D [13] Y TMM/2023 93.4
Point-BERT [7] Y CVPR/2022 93.2
MaskPoint [41] Y ECCV/2022 93.8
Point-MAE [3] Y ECCV/2022 93.8
Joint-MAE [42] Y CoRR/2023 94.0
Point-MAE (Rep.) Y ECCV/2022 93.21
PointCG (Point-MAE) Y - 94.03
Point-M2AE [8] Y NeurIPS/2022 94.0
Point-M2AE (Rep.) Y NeurIPS/2022 93.59
PointCG (Point-M2AE) Y - 94.11

Besides, we also attempt to freeze the parameters of our pre-trained model, and validate it with a Linear-SVM classifier in Tab. II.

Our method outperforms Point-MAE [3] and Point-M2AE [8] by margins of +1.46% and +0.29%, respectively. The results highlight the superior quality of the 3D representation learned by our method.

TABLE II: Accuracy of Linear-SVM on ModelNet40.
Methods Pre-T Acc.
FoldingNet [43] - 88.4
DGCNN+Jiasaw [44] Y 90.6
DGCNN+OcCo [10] Y 89.2
DGCNN+CrossPoint [20] Y 91.2
FoldingNet+PointMCD [26] Y 89.8
Point-BERT [7] Y 87.4
Joint-MAE [42] Y 92.4
Point-MAE [3] Y 91.0
PointCG (Point-MAE) Y 92.26
Point-M2AE [8] Y 92.9
Point-M2AE (Rep.) Y 92.63
PointCG (Point-M2AE) Y 92.92

Shape classification on the real-world data. Evaluating a pre-trained model’s performance on real-world datasets is crucial, as real-world scenes tend to be more complex than synthetic ones. We follow the common practice to evaluate our model on three variants: ‘OBJ-BG’, ‘OBJ-ONLY’, and ‘PB-T50-RS’ of ScanObjectNN [37].

To further validate the effectiveness of our design, we follow PointVST [11] with DGCNN as the encoder to construct a pre-training network (DGCNN+PointCG) and evaluate it on the ‘PB-T50-RS’ variant. Additionally, we reproduce TAP [14] and PointVST [11] using their official pretrained models.

As presented in Tab. III, our method outperforms CrossPoint [20], as well as the reproduced TAP [14] and PointVST [11], all using DGCNN [38] as the encoder. When utilizing Point-MAE or Point-M2AE as the backbone, the classification results exhibit a significant improvement over the reproduced Point-MAE and Point-M2AE. These results underscore the discriminative power of the representations extracted by the encoder, even in complex real-world scenes.

TABLE III: Accuracy of shape classification on ScanObjectNN.
Methods OBJ-BG OBJ-ONLY PB-T50-RS
DGCNN [38] 82.8 86.2 78.1
DGCNN+MAE3D [13] 87.7 88.4 86.2
DGCNN+CrossPoint [20] - - 86.2
DGCNN+TAP [14] - - 86.6
DGCNN+TAP [14](Rep.) - - 86.54
DGCNN+PointVST [11] - - 89.3
DGCNN+PointVST [11] (Rep.) - - 87.6
DGCNN+PointCG - - 87.90
Transformer+OcCo [10] 84.85 85.54 78.79
Transformer+TAP [14] 90.36 89.50 85.67
Point-BERT [7] 87.43 88.12 83.07
Joint-MAE [42] 90.94 88.86 86.07
Point-MAE [3] 90.02 88.29 85.18
PointCG (Point-MAE) 91.16 88.99 86.47
Point-M2AE [8] 91.22 88.81 86.43
Point-M2AE (Rep.) 90.87 88.12 85.39
PointCG (Point-M2AE) 91.19 88.72 86.41

Few-shot Learning. Following previous works [45, 7, 10, 3], we conduct few-shot learning experiments using the pre-trained model on ModelNet40 [36]. We adopt nn-way, mm-shot setting, where nn denotes the number of classes randomly selected from the dataset, and mm represents the number of objects randomly sampled for each class. This yields n×m{n}\times{m} objects for training. For evaluation, we randomly select 2020 unseen objects from each of nn classes.

The results with settings of n{5,10}n\in\{5,10\} and m{10,20}m\in\{10,20\} are presented in Tab. IV. As shown, our method consistently outperforms the baselines in nearly all few-shot settings, with minimal deviation. This highlights the robustness and generalization of the representations extracted by the PointCG encoder, even in data-limited scenarios.

TABLE IV: Few-shot object classification on ModelNet40. We conduct 10 independent experiments for each setting and report mean accuracy (%) with standard deviation.
Methods 5-way, 10-shot 5-way, 20-shot 10-way, 10-shot 10-way, 20-shot
Transformer [46] 87.8±5.287.8\pm 5.2 93.3±4.393.3\pm 4.3 84.6±5.584.6\pm 5.5 89.4±6.389.4\pm 6.3
Transformer+OcCo [10] 94.0±3.694.0\pm 3.6 95.9±2.395.9\pm 2.3 89.4±5.189.4\pm 5.1 92.4±4.692.4\pm 4.6
Point-BERT [7] 94.6±3.194.6\pm 3.1 96.3±2.796.3\pm 2.7 91.0±5.491.0\pm 5.4 92.7±5.192.7\pm 5.1
Point-MAE [3] 96.3±2.596.3\pm 2.5 97.8±1.897.8\pm 1.8 92.6±4.192.6\pm 4.1 95.0 ±\pm 3.0
PointCG (Point-MAE) 96.7 ±\pm 2.1 98.0 ±\pm 1.3 93.1 ±\pm 3.6 95.8 ±\pm 2.6
Point-M2AE [8] 96.8 ±\pm 1.8 98.3 ±\pm 1.4 92.3 ±\pm 4.5 95.0 ±\pm 3.0
PointCG (Point-M2AE) 97.0 ±\pm 1.9 98.4 ±\pm 1.6 92.8 ±\pm 3.8 95.5 ±\pm 2.9

4.3 Part Segmentation

The task of part segmentation aims to predict more fine-grained class labels for every instance. We conduct part segmentation on ShapeNetPart [47], which comprises 16,88116,881 samples shared by 1616 categories, annotated with 5050 parts in total. As illustrated in Tab. V, our method achieves competitive results and outperforms others in eleven categories.

TABLE V: Part segmentation results on ShapeNetPart. We report the mean IoU across all part categories mIoUC(%)mIoU_{C}(\%) and the mean IoU across all instances mIoUI(%)mIoU_{I}(\%), as well as the IoU(%) for categories.
Methods mIoUCmIoU_{C} mIoUImIoU_{I} aero bag cap car chair ear guitar knife lamp laptop motor mug pistol rocket skate table
plane phone bike board
Transformer [7] 83.42 85.1 82.9 85.4 87.7 78.8 90.5 80.8 91.1 87.7 85.3 95.6 73.9 94.9 83.5 61.2 74.9 80.6
Point-BERT [7] 84.11 85.6 84.3 84.8 88.0 79.8 91.0 81.7 91.6 87.9 85.2 95.6 75.6 94.7 84.3 63.4 76.3 81.5
Point-MAE [3] 84.19 86.1 84.3 85.0 88.3 80.5 91.3 78.5 92.1 87.4 86.1 96.1 75.2 94.6 84.7 63.5 77.1 82.4
PointCG (Point-MAE) 84.48 86.2 84.4 86.1 88.4 80.7 91.3 81.2 91.8 88.3 85.9 95.9 75.7 94.9 85.1 63.7 76.5 81.8

Visualization of part segmentation. Fine-grained part segmentation holds immense practical value. To highlight the clear advantage of our method in this task, we visualize the results and compare them with Point-MAE [3] in Fig. 8.

Refer to caption
Figure 8: Visualization of the part segmentation results on the test set of ShapeNetPart.

As depicted in the third line, our method accurately segments the fuselage and tail fin of the airplane, along with the earphone and headband. This reveals the capability of our method to capture discriminative features of points belonging to distinct sections within the same instance.

4.4 Semantic Segmentation

Large-scale indoor datasets introduce more complexities as they cover larger scenes in real-world environments with noise and outliers. We evaluate the performance of our pre-trained model on the 3D semantic segmentation task using the Stanford large-scale 3D Indoor Spaces (S3DIS) [48] dataset. S3DIS includes data from 66 indoor areas, comprising a total of 272272 rooms. We fine-tune the pre-trained model with Area 1-5 and evaluate it with Area 6. The results for each category are shown in Tab. VI. Our method outperforms Point-MAE [3] across all categories except ‘beam’ and ‘board’. The results underscore our model’s capability to extract contextual and semantic information, which is crucial for producing fine-grained segmentation outcomes.

TABLE VI: Semantic segmentation results on S3DIS. The mean class-wise intersection over union (mIoUC(%)mIoU_{C}(\%)), mean class-wise accuracy (mAcc(%)mAcc(\%)), overall accuracy (oAcc(%)oAcc(\%)), and the IoU(%)IoU(\%) of each class are reported.
Methods oAccoAcc mAccmAcc mIoUCmIoU_{C} ceiling floor wall beam column window door table chair sofa bookcase board clutter
Point-MAE [3] 90.29 82.18 73.14 94.4 96.8 77.6 82.0 68.5 78.4 79.0 73.3 82.9 45.4 52.6 55.5 64.6
PointCG (Point-MAE) 92.22 85.17 76.58 96.5 98.4 81.6 81.9 82.0 82.9 83.5 75.1 84.7 49.6 57.5 52.6 69.1

In reference to the semantic segmentation experiments of STRL [9], we fine-tune our pre-trained model on one area in Area 1-5, followed by evaluation on Area 6. We extend the experiments of Point-MAE [3] based on the pre-trained model released in the official code and present the mean IoUIoU across all class categories mIoU(%)mIoU(\%) and the classification accuracy Acc (%) in Tab. VII. Our model exhibits a significant improvement in accuracy and mIoUmIoU compared to STRL [9] and Point-MAE [3]. These results demonstrate the capability of our model to extract contextual and semantic information, leading to fine-grained segmentation results.

TABLE VII: Semantic segmentation on S3DIS.
Fine-tuning Area Method Acc. mIoU
Area 1 (3687 samples) STRL [9] 85.28 59.15
Point-MAE [3] 89.03 71.92
PointCG (Point-MAE) 90.29 74.28
Area 2 (4440 samples) STRL [9] 72.37 39.21
Point-MAE [3] 76.72 47.13
PointCG (Point-MAE) 78.31 48.95
Area 3 (1650 samples) STRL [9] 79.12 51.88
Point-MAE [3] 84.09 64.29
PointCG (Point-MAE) 85.21 65.63
Area 4 (3662samples) STRL [9] 73.81 39.28
Point-MAE [3] 77.34 45.15
PointCG (Point-MAE) 78.07 47.30
Area 5 (6852 samples) STRL [9] 77.28 49.53
Point-MAE [3] 80.56 51.46
PointCG (Point-MAE) 81.79 54.04

4.5 Indoor 3D object detection

To validate the effectiveness of our method in scene-level prediction tasks, we conduct object detection experiments on ScanNetV2 [49], a 3D indoor scene dataset with rich annotations, including 1,513 scenes across 18 object classes. The dataset includes semantic labels, per-point instances, and both 2D and 3D bounding boxes. Following TAP [14], we adopt 3DETR [50] as a baseline, construct the PointCG network, and pre-train on the object-level dataset ShapeNet55 [33].

As shown in Tab. VIII, our method demonstrates superior performance compared to both the baseline 3DETR [50] and TAP [14] in both AP0.25AP_{0.25} and AP0.5AP_{0.5} metrics. This improvement indicates that the encoder trained by PointCG effectively captures discriminative information and generalizes well to complex scenes even when pre-trained with object-level datasets.

TABLE VIII: Scene-level object detection on ScanNetV2 [49]. Average precision at 0.25 IoU thresholds (AP0.25AP_{0.25}) and 0.5 IoU thresholds (AP0.5AP_{0.5}) of detection are reported.
Methods Pre-T AP0.25AP_{0.25} AP0.5AP_{0.5}
VoteNet - 58.6 33.5
3DETR - 62.1 37.9
3DETR+TAP ShapeNet 63.0(+0.9) 41.4(+3.5)
3DETR+PointCG ShapeNet 63.21(+1.11) 42.17(+4.27)

4.6 Ablation Study

To investigate the architectural designs of our method, we conduct comprehensive ablation studies with Point-MAE as the backbone model and elucidate the individual contribution of each module.

Effectiveness of the components. As shown in Tab. IX, we validate the effectiveness of each module by enhancing and replacing modules on the baseline.We adopt Point-MAE [3] for comparison and utilize hidden point completion (HPC) as the baseline (a). In (b), we add the feature alignment module with pre-trained Vit-B/16 [30]. Based on (b), we incorporate the arbitrary-view image generation (AIG) module, constituting our PointCG, denoted as (c). As shown in Tab. IX, while the inclusion of the feature alignment module based on HPC does not substantially improve classification accuracy, the exclusion of this module from PointCG yields a diminished Linear-SVM accuracy of 88.41%88.41\%.

TABLE IX: Ablation studies on the introduced modules. Shape classification based on pre-trained models.
Methods Linear-SVM. Acc. Acc+Vote
Point-MAE (Rep.) [3] - 92.22 93.21
(a) HPC 91.15 92.76 93.47
(b) + Feature Alignment 91.23 92.99 93.51
(c) + AIG 92.26 93.52 94.03
(d) PointCG (Vit-B/32) 92.21 93.05 93.97
(e) PointCG (ResNet50 [51]) 91.01 92.75 93.39
(f) PointCG (Grayscale image) 91.75 93.19 93.52
(g) PointCG (depth map) 91.09 92.78 93.43

We further replace Vit-B/16 with Vit-B/32 (d) and ResNet50 [51] (e). While Vit-B/32 outperforms Vit-B/16 in the image domain, it does not improve the accuracy of classification. The model with ResNet50 [51] exhibits comparatively poorer performance.

To assess the impact of color in rendered images, we extract grayscale images from the rendered ones and pre-train with them in (f). This operation leads to a decrease in shape classification. Grayscale images may potentially lose structural or finer details inherent in the original images. In experiment (g), we extract depth maps from point clouds following PointCLIP [52] and pre-train with them instead of rendered images. The classification result shows poor performance.

Training and inference time. Tab. X presents pre-training and inference times for the classification task of each module. The results demonstrate that the pre-training time is notably longer than that of the baseline, whereas the inference time for classification remains comparable.

The pre-training of HPC requires approximately 397s397s per epoch. Compared to the baseline (Point-MAE), the primary time consumption arises from the data organization module, which is responsible for obtaining visible points from arbitrary views as inputs. Among the components, the 3D completion module has the shortest pre-training duration, while the feature alignment and AIG modules demand considerably more time.

TABLE X: Ablation studies on the introduced modules. Training and inference times of each model.
Methods Epoch time (s) Inference time (s)
Point-MAE (Rep.) [3] 54.73-57.59 36.85
HPC 395.18-399.26 36.65
PointCG W/O AIG 413.96-419.47 37.65
PointCG W/O Feature Alignment 405.52-411.22 35.67
PointCG W/O 3D Completion 520.97-524.73 36.79
PointCG (Point-MAE) 534.56-538.72 37.37
Refer to caption
Figure 9: Visualization of the 3D completion results based on the pre-trained models of PointCG without AIG (W/O AIG) and PointCG on ModelNet40 [36].

Visualization of 3D Completion. To validate the positive impact of AIG on the 3D completion task, we exclude AIG from PointCG (W/O AIG) and pre-train the network. The 3D completion results of this variant and PointCG are shown in Fig. 9. As depicted, the edges of the aircraft wings are sharpened with our method. Without AIG, the completion edges exhibit point groups, attributed to solely relying on 3D completion, while the completion targets are point clusters.

Refer to caption
Figure 10: Visualization of the generated images based on pre-trained models. We present the original rendered images in row (a) and corresponding grayscale one in row (b). The row (c) shows the results of PointCG without 3D completion. The results based on PointCG are shown in row (d).

Visualization of the generated images. To validate the provision of geometric structural information by 3D completion, we showcase the results of image generation after excluding the completion module and compare them with PointCG’s results in Fig. 10. As depicted in line (c), outcomes exhibit significant artifacts and lack local structural information. Specifically, in the case of the aircraft tail wing, only the wing’s shape is generated, omitting volumetric structural details. However, our method in line (d) successfully predicts the accurate shapes of the objects.

Camera perspectives. To identify more suitable inputs and predicted images, we design experiments with different camera positions as inputs, and the results are reported in Tab. XI. (a) takes the left-view image as input and the front-view as target, and (b) takes the left-view image as input and an arbitrary-view image as target. In (c), we use images from three different views (front, left, and top views) as inputs and an arbitrary-view image as target. Both the input and target of (d) are from arbitrary views, yielding the optimal results. Therefore, we utilize two arbitrary-view images as the input and target, respectively.

TABLE XI: Ablation studies on camera views. Pre-training and shape classification with images from different views.
Input/Prediction Linear-SVM. Acc. Acc+Vote
(a) Left view /front view 91.13 92.66 93.31
(b) Left view /arbitrary view 91.65 92.83 93.68
(c) Front, left, and top views/arbitrary view 92.22 93.03 93.40
(d) Arbitrary view/arbitrary view 92.26 93.52 94.03

Pre-training with more complete inputs. We posit that if the inputs of 3D completion contain more structural information and more overlap areas with the targets, the completion task will be accomplished more easily. This leads to a reduced training intensity for the backbone, thereby diminishing the backbone’s perception of 3D objects. We design this study based on different inputs. The inputs in (a) are derived from a single arbitrary view. The inputs for (b) and (c) involve the addition of two and eight extra patches, respectively, to the input of (a). The points from two arbitrary views serve as the inputs for (d). Results are reported in Tab. XII.

TABLE XII: Ablation study on the completeness of inputs. We pre-train the model based on inputs with varying levels of completeness.
Input Linear-SVM. Acc. Acc+Vote
(a) One arbitrary view 92.26 93.52 94.03
(b) One arbitrary view + 2 patches 91.73 93.14 93.55
(c) One arbitrary view + 8 patches 91.41 92.85 93.25
(d) Two arbitrary views 91.02 92.75 93.27

In cases (b) and (c), the accuracy of the Linear-SVM during pre-training decreases as more patches are included in the inputs. The classification results via fine-tuning also show a decline. In case (d), inputs from two views offer more structural information about the input objects, which leads to a significant decrease in shape classification. This experiment reveals that as additional structural information is progressively included in the input, shape classification accuracy steadily decreases. This indicates that excessive exposure to object structure within the inputs hinders the model’s learning ability.

Image generation losses. AIG significantly impacts the backbone’s perception of 3D objects by precise supervision between the ground truth and the generated images. It always affects the quality of the generated images (as shown in Fig. 11). We conduct experiments to examine the performance of various loss functions for supervision, as outlined in Tab. XIII.

TABLE XIII: Ablation study on the loss functions of image generation. We pre-train with various loss functions and subsequently fine-tune with shape classification.
Generation Loss Linear-SVM Acc. Acc+Vote
1\mathcal{L}_{1} 90.32 93.07 93.43
2\mathcal{L}_{2} 91.13 92.97 93.40
1.01+1.021.0*\mathcal{L}_{1}+1.0*\mathcal{L}_{2} 91.97 93.35 93.76
1+(1.0MS_SSIM)\mathcal{L}_{1}+(1.0-\mathcal{L}_{MS\_SSIM}) 91.73 93.23 93.61
0.81+0.2MSFR0.8*\mathcal{L}_{1}+0.2*\mathcal{L}_{MSFR} 91.65 93.48 93.81
1.01+0.2MSFR1.0*\mathcal{L}_{1}+0.2*\mathcal{L}_{MSFR} 92.26 93.52 94.03

The 2\mathcal{L}_{2} loss penalizes large errors more heavily and is more tolerant of small errors. In contrast, the 1\mathcal{L}_{1} loss does not excessively penalize large errors. The classification results of the model with the 1\mathcal{L}_{1} loss yield superior results compared to the 2\mathcal{L}_{2} loss.

The multi-scale structural similarity (MS_SSIM) index preserves the contrast in high-frequency regions. While 1\mathcal{L}_{1} is effective in preserving colors and luminance [32], but does not produce quite the same contrast as MS_SSIM. To leverage the benefits of both loss functions, we utilize the combination of them: G=α1+βMS_SSIM\mathcal{L}_{G}=\alpha*{\mathcal{L}_{1}}+\beta*\mathcal{L}_{MS\_SSIM}. However, the classification quality is slightly lower than others. As shown in this table, the model with G=1.01+0.2MSFR\mathcal{L}_{G}=1.0*\mathcal{L}_{1}+0.2*\mathcal{L}_{MSFR} achieves the best classification results

Refer to caption
Figure 11: Visualization of the generated images with different losses. We present the original rendered images in row (a) and corresponding grayscale versions in row (b). Rows (c) and (d) are generated with the mixed losses of 1.01+1.0(1.0MS_SSIM)1.0*\mathcal{L}_{1}+1.0*(1.0-\mathcal{L}_{MS\_SSIM}) and 1.01+0.2MSFR1.0*\mathcal{L}_{1}+0.2*\mathcal{L}_{MSFR}, respectively.

Visualization of the generated images. To assess a more suitable image generation loss for our model, we visualize the generated images with different losses in Fig. 11. Despite minor artifacts near the edges or corners, our model consistently produces complete and structurally clear images in line (d). This indicates that our model possesses the capability to capture geometric structures and stereoscopic knowledge about 3D objects, as well as the ability to infer the occluded points from arbitrary views.

Qualitative results of 2D image generation. While Figs. 10 and 11 provide visual examples of the generated images, they lack rigorous qualitative results to substantiate any significant improvements over baseline methods in terms of preserving 3D structure in 2D images. To address this, we present Tab. XIV, which provides quantitative evaluations of the generated images under various architectures and image generation losses.

We employ Mean Squared Error (MSE), Structural Similarity index (SSIM), Peak Signal to Noise Ratio (PSNR), and Normalized Mutual Information (NMI) as our primary image evaluation metrics. Clearly, the configuration of PointCG with 1.01+0.2MSFR1.0*\mathcal{L}_{1}+0.2*\mathcal{L}_{MSFR} (d) achieves the best performance across all metrics. By contrast, PointCG without 3D Completion (a) exhibits the poorest results. This indicates that the 3D completion module significantly enhances the quality of the generated images.

TABLE XIV: Qualitative results of the generated images based on different architectures and image generation losses.
Methods/Generation Loss MSE \downarrow PSNR \uparrow SSIM \uparrow NMI \uparrow
(a) PointCG W/O 3D Completion 0.054 30.099 0.795 0.485
(b) 1.01+1.021.0*\mathcal{L}_{1}+1.0*\mathcal{L}_{2} 0.038 31.599 0.831 0.515
(c) 1+(1.0MS_SSIM)\mathcal{L}_{1}+(1.0-\mathcal{L}_{MS\_SSIM}) 0.042 31.519 0.827 0.509
(d) 1.01+0.2MSFR1.0*\mathcal{L}_{1}+0.2*\mathcal{L}_{MSFR} 0.034 32.106 0.856 0.558

5 Conclusion

In this paper, we propose PointCG, a unified framework with hidden points completion and arbitrary-view image generation for self-supervised point cloud learning. Completion and generation based on partial points prompt the encoder to extract high-quality representations with 3D structural intricacies and alleviate ambiguous supervision. Thereby, our method achieves notable enhancements over baseline methods and outperforms similar methods in classification and reconstruction tasks on real datasets. We expect our pre-trained models will benefit a wide range of 3D tasks, including 3D object detection, semantic segmentation, and visual grounding.

While performing well across multiple tasks, there is much room for improving PointCG. This includes expanding the modality of images to incorporate other formats such as language or audio. Furthermore, while we focus on instance-level tasks, scene-level understanding is crucial in real-world applications. Therefore, our future studies may involve delving into applications in scene understanding and investigating interactions among multiple modalities for 3D reasoning.

References

  • [1] Y. Liu, X. Yan, Z. Li, Z. Chen, Z. Wei, and M. Wei, “Pointgame: Geometrically and adaptively masked autoencoder on point clouds,” IEEE Trans. Geosci. Remote. Sens., vol. 61, pp. 1–12, 2023.
  • [2] L. Gu, X. Yan, P. Cui, L. Gong, H. Xie, F. L. Wang, J. Qin, and M. Wei, “PointSee: Image enhances point cloud,” IEEE Trans. Vis. Comput. Graph., vol. 0, no. 0, pp. 1–18, 2023.
  • [3] Y. Pang, W. Wang, F. E. H. Tay, W. Liu, Y. Tian, and L. Yuan, “Masked autoencoders for point cloud self-supervised learning,” in Computer Vision - ECCV, 2022, pp. 604–621.
  • [4] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. B. Girshick, “Masked autoencoders are scalable vision learners,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 979–15 988.
  • [5] K. He, H. Fan, Y. Wu, S. Xie, and R. B. Girshick, “Momentum contrast for unsupervised visual representation learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9726–9735.
  • [6] X. Chen and K. He, “Exploring simple siamese representation learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15 750–15 758.
  • [7] X. Yu, L. Tang, Y. Rao, T. Huang, J. Zhou, and J. Lu, “Point-bert: Pre-training 3d point cloud transformers with masked point modeling,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19 291–19 300.
  • [8] R. Zhang, Z. Guo, P. Gao, R. Fang, B. Zhao, D. Wang, Y. Qiao, and H. Li, “Point-m2ae: Multi-scale masked autoencoders for hierarchical point cloud pre-training,” in Advances in Neural Information Processing Systems, 2022, pp. 27 061–27 074.
  • [9] S. Huang, Y. Xie, S. Zhu, and Y. Zhu, “Spatio-temporal self-supervised representation learning for 3d point clouds,” in IEEE/CVF International Conference on Computer Vision, 2021, pp. 6515–6525.
  • [10] H. Wang, Q. Liu, X. Yue, J. Lasenby, and M. J. Kusner, “Unsupervised point cloud pre-training via occlusion completion,” in IEEE/CVF International Conference on Computer Vision, 2021, pp. 9762–9772.
  • [11] Q. Zhang and J. Hou, “Pointvst: Self-supervised pre-training for 3d point clouds via view-specific point-to-image translation,” IEEE Transactions on Visualization and Computer Graphics, 2023.
  • [12] Z. Qi, R. Dong, G. Fan, Z. Ge, X. Zhang, K. Ma, and L. Yi, “Contrast with reconstruct: Contrastive 3d representation learning guided by generative pretraining,” in International Conference on Machine Learning, ICML, ser. Proceedings of Machine Learning Research, vol. 202, 2023, pp. 28 223–28 243.
  • [13] J. Jiang, X. Lu, L. Zhao, R. Dazeley, and M. Wang, “Masked autoencoders in 3d point cloud representation learning,” CoRR, vol. abs/2207.01545, 2022.
  • [14] Z. Wang, X. Yu, Y. Rao, J. Zhou, and J. Lu, “Take-a-photo: 3d-to-2d generative pre-training of point cloud models,” CoRR, vol. abs/2307.14971, 2023.
  • [15] W. Feng, J. Zhang, H. Cai, H. Xu, J. Hou, and H. Bao, “Recurrent multi-view alignment network for unsupervised surface registration,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2021, pp. 10 297–10 307.
  • [16] J. Sauder and B. Sievers, “Self-supervised deep learning on point clouds by reconstructing space,” in NeurIPS, 2019, pp. 12 942–12 952.
  • [17] Y. Zeng, Y. Qian, Z. Zhu, J. Hou, H. Yuan, and Y. He, “Corrnet3d: Unsupervised end-to-end learning of dense correspondence for 3d point clouds,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, 2021, pp. 6052–6061.
  • [18] J. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. Á. Pires, Z. Guo, M. G. Azar, B. Piot, K. Kavukcuoglu, R. Munos, and M. Valko, “Bootstrap your own latent - A new approach to self-supervised learning,” in Advances in Neural Information Processing Systems, 2020, pp. 21 271–21 284.
  • [19] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in IEEE/CVF International Conference on Computer Vision, 2021, pp. 9630–9640.
  • [20] M. Afham, I. Dissanayake, D. Dissanayake, A. Dharmasiri, K. Thilakarathna, and R. Rodrigo, “Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 9892–9902.
  • [21] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
  • [22] S. Yan, Y. Yang, Y. Guo, H. Pan, P. Wang, X. Tong, Y. Liu, and Q. Huang, “3d feature prediction for masked-autoencoder-based point cloud pretraining,” in The Twelfth International Conference on Learning Representations, ICLR.   OpenReview.net, 2024.
  • [23] S. Yan, Z. Yang, H. Li, L. Guan, H. Kang, G. Hua, and Q. Huang, “Implicit autoencoder for point cloud self-supervised representation learning,” CoRR, vol. abs/2201.00785, 2022.
  • [24] W. StraBer, “Schnelle kurven-und flachendarstellung auf graphischen sichtgeraten,” Ph.D. dissertation, PhD thesis, 1974.
  • [25] S. Katz, A. Tal, and R. Basri, “Direct visibility of point sets,” ACM Trans. Graph., vol. 26, no. 3, p. 24, 2007.
  • [26] Q. Zhang, J. Hou, and Y. Qian, “Pointmcd: Boosting deep point cloud encoders via multi-view cross-modal distillation for 3d shape recognition,” IEEE Transactions on Multimedia, 2023.
  • [27] S. Katz, G. Leifman, and A. Tal, “Mesh segmentation using feature point and core extraction,” Vis. Comput., vol. 21, no. 8-10, pp. 649–658, 2005.
  • [28] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017, pp. 77–85.
  • [29] H. Fan, H. Su, and L. J. Guibas, “A point set generation network for 3d object reconstruction from a single image,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017, pp. 2463–2471.
  • [30] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning, vol. 139, 2021, pp. 8748–8763.
  • [31] S. Cho, S. Ji, J. Hong, S. Jung, and S. Ko, “Rethinking coarse-to-fine approach in single image deblurring,” in IEEE/CVF International Conference on Computer Vision, 2021, pp. 4621–4630.
  • [32] H. Zhao, O. Gallo, I. Frosio, and J. Kautz, “Loss functions for image restoration with neural networks,” IEEE Trans. Computational Imaging, vol. 3, no. 1, pp. 47–57, 2017.
  • [33] A. X. Chang, T. A. Funkhouser, L. J. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu, “Shapenet: An information-rich 3d model repository,” CoRR, vol. abs/1512.03012, 2015.
  • [34] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on Learning Representations, 2019, pp. 1–8.
  • [35] ——, “SGDR: stochastic gradient descent with warm restarts,” in 5th International Conference on Learning Representations, ICLR, 2017.
  • [36] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3d shapenets: A deep representation for volumetric shapes,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2015, pp. 1912–1920.
  • [37] M. A. Uy, Q. Pham, B. Hua, D. T. Nguyen, and S. Yeung, “Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data,” in IEEE/CVF International Conference on Computer Vision, 2019, pp. 1588–1597.
  • [38] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon, “Dynamic graph CNN for learning on point clouds,” ACM Trans. Graph., vol. 38, no. 5, pp. 146:1–146:12, 2019.
  • [39] Y. Liu, B. Fan, S. Xiang, and C. Pan, “Relation-shape convolutional neural network for point cloud analysis,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8895–8904.
  • [40] H. Zhao, L. Jiang, J. Jia, P. H. S. Torr, and V. Koltun, “Point transformer,” in IEEE/CVF International Conference on Computer Vision, 2021, pp. 16 239–16 248.
  • [41] H. Liu, M. Cai, and Y. J. Lee, “Masked discrimination for self-supervised learning on point clouds,” in Computer Vision - ECCV, vol. 13662, 2022, pp. 657–675.
  • [42] Z. Guo, X. Li, and P. Heng, “Joint-mae: 2d-3d joint masked autoencoders for 3d point cloud pre-training,” CoRR, vol. abs/2302.14007, 2023.
  • [43] Y. Yang, C. Feng, Y. Shen, and D. Tian, “Foldingnet: Point cloud auto-encoder via deep grid deformation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 206–215.
  • [44] J. Sauder and B. Sievers, “Self-supervised deep learning on point clouds by reconstructing space,” in Advances in Neural Information Processing Systems, 2019, pp. 12 942–12 952.
  • [45] C. Sharma and M. Kaul, “Self-supervised few-shot learning on point clouds,” in Advances in Neural Information Processing Systems, 2020, pp. 7212–7221.
  • [46] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.
  • [47] L. Yi, V. G. Kim, D. Ceylan, I. Shen, M. Yan, H. Su, C. Lu, Q. Huang, A. Sheffer, and L. J. Guibas, “A scalable active framework for region annotation in 3d shape collections,” ACM Trans. Graph., vol. 35, no. 6, pp. 210:1–210:12, 2016.
  • [48] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. K. Brilakis, M. Fischer, and S. Savarese, “3d semantic parsing of large-scale indoor spaces,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 1534–1543.
  • [49] A. Dai, A. X. Chang, M. Savva, M. Halber, T. A. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017, pp. 2432–2443.
  • [50] I. Misra, R. Girdhar, and A. Joulin, “An end-to-end transformer model for 3d object detection,” in 2021 IEEE/CVF International Conference on Computer Vision, ICCV.   IEEE, 2021, pp. 2886–2897.
  • [51] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
  • [52] R. Zhang, Z. Guo, W. Zhang, K. Li, X. Miao, B. Cui, Y. Qiao, P. Gao, and H. Li, “Pointclip: Point cloud understanding by CLIP,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8542–8552.