PVT: Point-Voxel Transformer for Point Cloud Learning

Cheng Zhang¹, Haocheng Wan^1∗, Xinyi Shen², Zizhao Wu¹
¹Hangzhou Dianzi University, Hangzhou China
²University College London, London UK
{zhangcheng828,wanhaocheng2022,xinyishen2018,wuzizhao}@foxmail.com, @163.com, @hdu.edu.cn These authors contributed equally.Corresponding author.

Abstract

The recently developed pure Transformer architectures have attained promising accuracy on point cloud learning benchmarks compared to convolutional neural networks. However, existing point cloud Transformers are computationally expensive because they waste a significant amount of time on structuring irregular data. To solve this shortcoming, we present the Sparse Window Attention (SWA) module to gather coarse-grained local features from non-empty voxels. The module not only bypasses the expensive irregular data structuring and invalid empty voxel computation, but also obtains linear computational complexity with respect to voxel resolution. Meanwhile, we leverage two different self-attention variants to gather fine-grained features about the global shape according to different scale of point clouds. Finally, we construct our neural architecture called Point-Voxel Transformer (PVT), which integrates these modules into a joint framework for point cloud learning. Compared with previous Transformer-based and attention-based models, our method attains a top accuracy of 94.1% on the classification benchmark and 10 $\times$ inference speedup on average. Extensive experiments also validate the effectiveness of PVT on semantic segmentation benchmarks. Our code and pretrained model are avaliable at https://github.com/HaochengWan/PVT.

Keywords Point Cloud Learning, Transformer, Neural Network, 3D Vision.

1 Introduction

Point cloud learning has been receiving increasing attention from both industry and academia due to its wide applications including autonomous driving, robotics, AR/VR, etc. In these settings, sensors like LIDAR produce irregular and unordered sets of points that correspond to object surfaces. However, how to capture semantics directly and quickly from these data remains a challenge for point cloud learning.

Most existing point cloud learning methods can be classified into two categories in terms of data representations: voxel-based models and point-based models. The Voxel-based models generally rasterize point clouds onto regular grids and apply 3D convolution for feature learning [67, 30, 46]. These models are computationally efficient due to their excellent memory locality, but the inevitable information loss degrades the fine-grained localization accuracy [35]. By contrast, point-based models naturally preserve the accuracy of point location, but are generally computationally intensive [41, 4, 18].

Recently, Transformer architecture has drawn great attention in natural language processing and 2D vision because of its superior capability in capturing long-range dependencies. Powered by Transformer [40] and its variants [26, 8], point-based models have applied self-attention (the core unit of Transformer) to extract features from point clouds and improve performance significantly [14, 9, 65, 53]. However, most of them suffer from the time-consuming process of sampling and aggregating features from irregular points, which becomes the efficiency bottleneck [27].

In this paper, we study how to design an efficient and high-performance Transformer architecture while avoiding the shortcoming of previous point cloud Transformers. We observe that voxel-based models have regular data locality and can efficiently encode coarse-grained features, while point-based networks preserve the accuracy of location information with the flexible fields and can effectively aggregate fine-grained features. Inspired by this, we propose a novel point cloud learning architecture, namely, Point-Voxel Transformer (PVT), which combines the ideas of voxel-based and point-based models. Our network focuses on how to fully exploit the potential of the two models mentioned above in Transformer architecture, capturing useful discriminative features from 3D data.

To investigate this, we conduct self-attention (SA) computation in voxels to obtain efficient learning pattern while performing SA in points to preserve the accuracy of location information with the flexible fields. However, directly performing SA computation to voxels is infeasible, mainly owing to two facts. First, non-empty voxels are sparsely distributed in the voxel domain and only account for a small proportion of total voxels [54]. Second, the original SA computation leads to quadratic complexity with respect to the number of voxels, making it unsuitable for various 3D vision tasks, particularly for the dense prediction tasks of object detection and semantic segmentation. To tackle these issues, we design a sparse window attention (SWA) module which only has linear computational complexity by computing SA locally within the non-overlapping 3D window. A key design element of SWA lies in its sparse attention computing, in which a GPU-based Rule Book stores the non-empty voxel indexes. On the basis of this strategy, we can bypass the invalid computation of empty voxels and retain the original 3D shape structure. For cross-window information interaction, inspired by Swin Transformer [26], we propose to apply shifted window to enlarge the receptive field. In addition, we leverage two different SA variants to capture the global information according to different scales of point clouds. For small-scale point clouds, we propose relative-attention (RA), which extends the SA mechanism to consider representations of the relative position, or distances between points. The advantage of RA is that the absolute coordinates of the same object can be completely different with rigid transformations; thus, injecting relative position representations (RPR) in our structure is generally more robust. We observe that our network obtains significant improvements over not using this RPR term without adding extra training parameters. Nevertheless, with tens of thousands of points (e.g., SemanticKITTI [3]) as inputs, directly applying the RA module in points incurs unacceptable $\mathcal{O}(N^{2})$ memory consumption, where $N$ is the input point number. Thus, for large-scale point clouds, we perform External Attention (EA), a linear attention variant to avoid the $\mathcal{O}(N^{2})$ computational complexity of RA module.

Based on these modules, We propose a two-branch PVT that mainly consists of the voxel branch and the point branch (Figure 1). As illustrated in Figure 4, by behaving SWA in the voxel branch and performing RA (or EA) in the point branch, our method disentangles the coarse-grained local feature aggregation and the fine-grained global context transformation so that each branch can be implemented efficiently and accurately.

By using the PVT, we construct PVT networks for various 3D tasks. These networks can capably serve as a general-purpose backbone for point cloud learning. In particular, we conduct the classiﬁcation experiment on the ModelNet40 [47] dataset and obtain the state-of-the-art accuracy of 94.1% (no voting), while being on average 10 $\times$ faster than its Transformer baselines. On ShapeNet Part [24], S3DIS [1], and SemanticKITTI [3] datasets, our model also obtains strong performance (86.6%, 69.2%, and 64.9% mIoU, respectively).

The main contributions are summarized as following:

•

We propose PVT, the first Transformer-based approach, to deeply incorporate the advantages from both point-based and voxel-based networks to our knowledge.
•

We propose an efficient local attention module, named SWA, which achieves linear computational complexity with respect to the input voxel size and bypasses the invalid empty voxel computation.
•

Extensive experiments show that our PVT model attains competitive results on general point cloud learning tasks with 10 $\times$ latency speed-up than its baselines.

2 Related works

2.1 Voxel-based models on points

To leverage the success of convolutional neural networks on 2D images, many researchers have endeavored to project 3D point clouds directly onto voxels and perform 3D convolution for information aggregation [6, 21]. However, the memory and computational consumption of such full-voxel-based models increases cubically with respect to the voxel’s resolution. To tackle this shortcoming, O-CNN [43], OctNet [33], and kd-Net [17] constructed tree structures for 3D voxels to bypass the invalid computation of the empty voxels. Although such methods are efficient in data structuring, geometric details are lost during the projection because multiple bucketing points into the same voxel leads to indistinguishable features for learning.

Compared with most voxel-based 3D models, our model is more efficient and effective for the following reasons: 1) Instead of using 3D convolutions, we propose SWA to handle the sparsity characteristic of voxels that have a linear computational complexity with respect to voxel resolution and bypasses the invalid computation of empty voxels. 2) As we employ RA in the point domain, an inherent advantage is that we can avoid feature loss during voxelization, maintaining the geometric details of a point cloud.

2.2 Point-based models for point cloud learning

Instead of voxelization, it is possible to make a neutral network that consumes directly on point clouds. [29] proposed PointNet, the pioneering work that learns directly on sparse and unstructured point clouds. Inspired by PointNet, previous works introduced sophisticated neural modules to learn per-point features. These models can be generally classified as: 1) neighbouring feature pooling [63, 16], 2) graph message passing [45, 59, 42], and 3) attention-based or Transformer-based models [34, 60, 55, 14, 65, 9].

Refer to caption — Figure 1: Model Architecture: The model for part segmentation take as input N points and feeds them into 3 stacked PVT blocks to learn a semantically rich and discriminative representation for each point, followed by a MLP layer to generate the output feature. After that, we leverage the max-pooling and repeating operators to extract an effective global feature representing the entire point cloud. Note that shortcut connections are used to extract multi-scale features and one MLP layer (1280) to aggregate multi-scale features, where we concatenate features from previous layers to get a 64+64+128+1024=1280-dimensional point cloud. Finally, we predict the final point-wise segmentation scores for the input point cloud and the part label of a point is also determined as the one with maximal score. PVT block (grey box): The PVT block is composed of two branches. The upper branch is *voxel*-based for aggregating local features and the lower is *point*-based for capturing global features. We can effectively fuse two branches because they are providing complementary information. $R$ and $w$ denote the voxel resolution size and 3d window size, respectively. $\bigoplus$ : addition, $\bigotimes$ : concatenation.

Owing to the sparsity and irregularity of point clouds, the methods that directly consume points have achieved state-of-the-art performance. However, the cost of data structuring of such methods has become the computation burden, especially on the large-scale point clouds dataset [27, 51]. In this study, we handle this shortcoming by using the SWA module.

2.3 Self-attention and Transformer

[2] proposed a neural machine translation method with an attention mechanism, in which attention weight is computed through the hidden state of an RNN. Then [25] further proposed self-attention to visualize and interpret sentence embeddings. Subsequent works employed self-attention layers to replace some or all the spatial convolution layers, such as Transformer for machine translation [40]. [7] proposed the bidirectional transformers (BERT), which is one of the most powerful models in the NLP field.

Given the success of self-attention and Transformer architectures in NLP, researchers have applied them to vision tasks [15, 32, 62]. For instance, [8] proposed an image recognition network, ViT, which directly applied a Transformer architecture on image patches and achieved better performance than the traditional convolutional neural networks. [26] recently introduced Swin Transformer to incorporate inductive bias for spatial locality, hierarchy and translation invariance.

2.4 Transformers on point cloud

Recently, plenty of researchers have attempted to explore Transformer-based architectures for point cloud learning. [9] proposed PT¹ to extract global features by introducing the dot-product SA mechanism. [14] proposed offset-attention to calculate the offset difference between the SA features and the input features by element-wise subtraction. Moreover, [19] proposed PT² to build local vector attention in neighborhood point sets and achieved significant progress. However, they wasted a high percentage of the total time on structuring the irregular data, which becomes the efficiency bottleneck. In this work, we study how to solve this shortcoming, so as to design an efficient Transformer-based architecture for point cloud analysis.

There also exists some Transformer-based architecture for various point cloud processing tasks. For instance, [11] proposed the SE(3)-Transformer, a variant of the self-attention module for 3D point clouds and graphs, which is equivariant under continuous 3D roto-translations. Late, [57] and [49] have attempted to explore Transformer-based architectures for point cloud completion and achieved significant performance on all completion tasks. Recently, [58] proposed Point-BERT to unleash the scalability and generalization of Transformers for 3D point cloud representation learning. Extensive experiments also demonstrate that Point-BERT significantly improves the performance of standard point cloud Transformers. Different from these Transformers, in this paper, we study how to design an efficient and high-performance Transformer architecture for point cloud learning, making it suitable for edge devices with limited computational resources or real-time autonomous driving scenarios.

3 Method

Overview

In contrast to the previous point cloud Transformers [14, 9, 65], we design the network as efficient as possible with high accuracy so that it can be widely used on various 3D tasks.

We introduce an efficient neural architecture for point cloud learning, namely, PVT. Formally, given a point cloud embedding $F\in\mathbb{R}^{N\times D}$ , our PVT is designed to map the input features $F$ to a new set of point feature $F^{\prime}\in\mathbb{R}^{N\times D}$ . As illustrated in Figure 1, the PVT consists of two main branches: a voxel branch and a point branch. We leverage the voxel branch to map the inputs $F$ to $F_{local}\in\mathbb{R}^{N\times D}$ , which aggregates local features in the voxel domain. However, full voxel method will inevitably encounter information loss during voxelization. Thus, we utilize the point branch to map the inputs $F$ to $F_{global}\in\mathbb{R}^{N\times D}$ , which can directly extract global features for each individual point. With both local features and aggregated global context, we can efficiently fuse two branches into an addition layer as both provide complementary information.

Below we detail the two branches in Sections 3.1 and 3.2. Section 3.3 details our feature fusion module, and Section 3.4 discusses the relationship between the proposed model and the prior works.

3.1 Voxel branch

This branch aims to effectively capture local information, which can bypass expensive sampling and neighbor points querying. Specifically, it contains three steps: voxelization, feature aggregation, and devoxelization.

The voxelized and devoxelized methods map the input point cloud $\mathcal{P}$ to a new set of voxel features $V$ and transform the voxel-wise features back to the point-wise features $F_{local}$ (Readers can refer to [27] for more details). In this work, we apply the standard Transformer architecture [40] to perform feature aggregation on regular 3D voxels, which can improve accuracy significantly. However, the standard Transformer architecture conducts global-SA which leads to quadratic complexity with respect to the number of voxels, making it unsuitable for many 3D vision problems requiring dense prediction, such as semantic segmentation.

Window Attention: To obtain efficient modeling power, we propose to compute SA within local 3D windows, which are arranged to evenly partition the voxel space in a non-overlapping manner. Assume that each local 3D window contains $W\times W\times W$ voxels and $R$ denotes the voxel resolution. The computational complexity of a global-SA and a window-based one on $R^{3}$ voxels are:

	$\displaystyle\Omega(Global\textbf{-}SA)=4R^{3}D^{2}+2(R^{3})^{2}D$		(1)
	$\displaystyle\Omega(Window\textbf{-}SA)=4R^{3}D^{2}+2B^{3}R^{3}D$		(2)

where the former is quadratic to the number of voxels $R^{3}$ , the latter is linear when $W$ is fixed, $D$ is the dimension of features. In summary, global-SA computation is generally unaffordable for a large voxel resolution, whereas the local window-SA is scalable.

Sparse Window Attention: Different from 2D pixels densely placed on an image plane, non-empty voxels only account for a small proportion of total voxels. Inspired by [13, 54], we design Sparse Window Attention to handle the sparsity characteristic of voxels. As shown in Figure 2, for each querying index $v_{i}$ , we first use Window Attention to determine all neighboring voxel indices in the attention field, and then adopt a Hash table to get the hashed integer neighboring voxel indices. Last, a GPU-based Rule Book stores the hashed voxel indices as keys, and the corresponding indices for the non-empty voxel array as values. Finally, we can perform Window Attention to gather the coarse-grained local features. Note that all the steps can be conducted in parallel on GPUs by assigning each querying voxel $v_{i}$ a separate CUDA thread.

The SWA lacks information interaction across windows, which may limit the representation power of our model. Thus, we extend the shifted 2D window mechanism of Swin Transformer [26] to 3D windows for the purpose of introducing cross-window information interaction while maintaining the efficient computation of non-overlapping SWA. Readers can refer to Swin Transformer for more details

As shown in Figure 1, with the cyclic shifted window partitioning approach, the step of feature aggregation of this branch can be described as follows:

Algorithm 1 Pseudo-code for the Voxel Branch

Input: $P$ , an array with shape [B, N, C]. $F$ , an array with shape [B, N, D].
Output: $F_{local}$ , an array with shape [B, N, D]
F_1 = voxelize(P, F) ${\color[rgb]{0,1,1}\#shape=[B,R^{3},D]}$
F_1 = cyclic_shift(F_1)
F_2 = SWA(layernorm(F_1)) + F_1
F_2 = MLP(layernorm(F_2)) + F_2
F_2 = reverse_cyclic_shift (F_2)
F_3 = SWA(layernorm(F_2)) + F_2
$F_{local}$ = MLP(layernorm(F_3)) + F_3

3.2 Point branch

The voxel branch gathers the neighborhood information with low resolution. However, in order to capture long-range dependencies, low-resolution voxel branch alone is limited. To this end, we attempt to employ the following self-attention on the entire point cloud for global context aggregation, which is computed as:

	$\displaystyle F_{sa}$	$\displaystyle=softmax(QK^{T})V,\quad F_{sa}\in\mathbb{R}^{N\times D}$		(3)
	$\displaystyle F_{global}$	$\displaystyle=MLP(F_{sa})+F,\quad F_{global}\in\mathbb{R}^{N\times D}$		(4)

where (Q, K, V) $\in\mathbb{R}^{N\times D}$ is generated by shared linear transformations and the input features $F$ and are all ordered independent. Moreover, softmax and weighted sum are both permutation-independent operators. Thus, the self-attention computation is permutation-invariant, making it well-suited to handle the irregular, disordered 3D points.

Relative Attention: Self-attention mentioned above fails to incorporate relative position representations in its structure, whereas such ability is very important to 3D visual tasks. For example, the absolute coordinates of the same object can be completely different with rigid transformations. Therefore, injecting relative position representations are generally more robust. In this study, we design our relative-attention to embed relative position representations, which are not well studied in prior point cloud learning works.

First of all, by embedding RPR into the scaled dot-product self-attention module, Eq 3 can be re-formulated as:

\displaystyle F_{ra}=softmax(QK^{T}+B)V

(5)

where $B\in\mathbb{R}^{N\times N}$ is the relative representations bias.

Suppose that the original point cloud with $N$ points is denoted by $\mathcal{P}=\{p_{i}\}_{i=1}^{N}\subseteq\mathbb{R}^{3}$ . We compute the relative position $B$ as follows:

\displaystyle B_{p_{i},p_{j},m}=p_{i,m}-p_{j,m},\quad m\in\{x,y,z\}.

(6)

To map relative coordinates to the corresponding position encoding, we maintain three learnable look-up tables $t_{x},t_{y},t_{z}\in\mathbb{R}^{L\times N\times N}$ corresponding to the x, y and z axis, respectively. As the relative coordinates are continuous floating-point numbers, we uniformly quantize the range of $B_{p_{i},p_{j},m}$ into $L$ discrete parts and map the relative coordinates $B_{p_{i},p_{j},m}$ to the indices of the tables as:

\displaystyle idx_{i,j,m}=\lfloor\frac{B_{p_{i},p_{j},m}+s_{max}}{s_{quad}}\rfloor

(7)

where $s_{max}$ is the maximum size of point cloud coordinates and $s_{quad}=\frac{2\cdot s_{max}}{L}$ is the quantization size, and $\lfloor\cdot\rfloor$ denotes floor rounding.

We look up the tables to retrieve corresponding embedding with the index and sum them up to obtain the position encoding of

\displaystyle B_{p_{i},p_{j}}=\sum_{m=1}^{3}t_{m}[idx_{i,j,m}],

(8)

where $t[idx]$ indicates the $idx$ -th entry of the learnable look-up table $t$ , and $B_{p_{i},p_{j}}$ means the relative position encoding between $p_{i}$ and $p_{j}$ .

External Attention: Although the RA module has demonstrated significant performance on small-scale point clouds, it is unsuitable for large-scale point clouds (e.g., SemanticKITTI [3]) due to its unacceptable $\mathcal{O}(N^{2})$ memory consumption. Therefore, in this work, we perform External Attention computation on large-scale point clouds. External Attention, is a novel attention mechanism based on two external, small, learnable, shared memories, which can be implemented easily by simply using two cascaded linear layers and two normalization layers. It has linear complexity and implicitly considers the correlations between all data samples, making it suitable for large-scale point clouds.

3.3 Feature fusion

We effectively fuse the outputs of two branches with an addition as they are providing complementary information:

\displaystyle F^{\prime}=F_{local}+F_{global},\quad F^{\prime}\in\mathbb{R}^{N\times D}

(9)

3.4 Relationship to prior works

	Layer Type	Time Complexity per Layer
The-	3D Convolutions	$\mathcal{O}(k\cdot R^{3}\cdot D^{2})$
voxel-	Window-attention	$\mathcal{O}(v\cdot R^{3}\cdot D)$
branch	SWA	$\mathcal{O}(r^{2}\cdot v\cdot R^{3}\cdot D)$
The-	1D Convolutions	$\mathcal{O}(k\cdot N\cdot D^{2})$
point-	Relative-attention	$\mathcal{O}(N^{2}\cdot D)$
branch	External-attention	$\mathcal{O}(N\cdot D)$

Table 1: Per-layer complexity for different layer types.

R

is the resolution of voxels,

v=W\times W\times W

is the number of voxels in a same 3D window,

r

denotes the proportion of non-empty voxels in a local 3D window,

N

is the number of points,

D

is the representation dimension, and

k

is the kernel size of convolutions.

The proposed PVT is related to several prior works which includes PVCNN [27], PCT [14], PT¹[9], PT²[65],.

Although we are inspired by the idea of PVCNN [27], our PVT is different in several ways: 1) in the voxel branch, PVCNN uses a 3D convolution to gather local information while we employ SWA computation within each local window, which is highly efficient. Per-layer complexity for different layer types are shown in Table 1, for large-scale point clouds, approximately 10% of the voxels are non-empty ( $r=0.1$ ) [51] which means that our SWA can save significant computation power compared with window-attention. 2) in the point branch, PVCNN uses 1D convolution, which is computation efficient but lacks the global context modeling capability. By performing RA (or EA) computation on the entire points, our method gathers the global modeling power.

Unlike prior Transformer-based 3D models that need to gather the downsampled points and find the corresponding neighbors by using expensive FPS and $k$ -NN in point domain, our approach does not require explicitly identify which point is the farthest and what are in the neighboring set. Instead, the voxel branch observes regular data and learns to capture local features using SWA. Additionally, the point branch only needs to perform RA (or EA) on the entire point cloud, which also does not require to find the neighboring points. Thus, our approach obtains highly efficiency than PCT, PT¹ and PT² (see Table 4).

4 Experiments

We now discuss the PVT that can be constructed from PVT blocks for different point cloud learning tasks: shape classification, and object and semantic segmentation. The performance is quantitatively evaluated with four metrics, including overall accuracy (OA), average precision (AP), the intersection over union (IoU), and mean IoU (mIoU). For the sake of fair comparison with baselines, we report the measured latency on a GTX 2080 GPU to reﬂect the efﬁciency but evaluate other indicators on a GTX 3090 GPU.

Model. The architecture used for the segmentation task is shown in Figure 1. In our settings, dropout with keep probability of 0.5 is used in the last two linear layers. All layers include ReLU and batch normalization. In addition, for other point cloud learning tasks, we use the similar architecture as in segmentation. Readers can refer to our source code for more derails.

Baselines. We select four models as the comparison baselines, including the strong attention-based model PointASNL and the three powerful Transformer-based networks PT¹, PT², and PCT.

4.1 Shape Classification

OA $<$ 92.5
Model	Input	Points	OA(%)	Latency
PointNet	xyz	16 $\times$ 1024	89.2	13.6ms
PointNet++ [31]	xyz,nor	16 $\times$ 1024	91.9	35.3ms
SO-Net [22]	xyz,nor	8 $\times$ 2048	90.9	$-$
PointGrid [21]	xyz,nor	16 $\times$ 1021	92.0	$-$
SpiderCNN [52]	xyz,nor	8 $\times$ 1024	92.4	82.6ms
PointCNN [23]	xyz	16 $\times$ 1024	92.2	221.2ms
PointWeb [64]	xyz,nor	16 $\times$ 1024	92.3	$-$
PVCNN [27]	xyz,nor	16 $\times$ 1024	92.4	24.2ms
OA $>$ 92.5
KPConv [39]	xyz	16 $\times$ 6500	92.9	120.5ms
DGCNN [45]	xyz	16 $\times$ 1024	92.9	85.8ms
LDGCNN [59]	xyz	16 $\times$ 1024	92.7	$-$
PointASNL [53]	xyz,nor	16 $\times$ 1024	93.2	923.6ms
PT¹ [9]	xyz,nor	16 $\times$ 1024	92.8	320.6ms
PT² [65]	xyz,nor	8 $\times$ 1024	93.7	530.2ms
PCT [14]	xyz	16 $\times$ 1024	93.2	92.4ms
PVT (Ours)	xyz	16 $\times$ 1024	93.7	45.5ms
PVT (Ours)	xyz,nor	16 $\times$ 1024	94.1	48.6ms

Table 2: Results on ModelNet40 [48]. Compared with previous Transformer-based models, our PVT achieves the state-of-the-art accuracy with 10

\times

measured speed-up on average.

Input	PT¹	PT²	PCT	Ours
16 $\times$ 512	247.5	430.7	76.4	31.5
16 $\times$ 1024	320.6	530.2	92.4	45.5
8 $\times$ 2048	460.5	720.4	145.4	71.2

Table 3: The speed comparison of different models when the number of input points varies.

Data. We conduct the classification experiment on the ModelNet40 [47] dataset. ModelNet40 includes 12,311 CAD models from 40 different object classes, in which 9843 models are used for training and the rest for testing. We follow the experimental configuration of PointNet, i.e., for each model, we uniformly sample 1024 points with 3 channels (or 6) of spatial location (and normal) as input; the point cloud is re-scaled to fit the unit sphere.

Setting. We utilize random translation, random anisotropic scaling and random input dropout strategies to augment the input points data during training. During testing, no data augmentation or voting methods were used. For classification on ModelNet40, the SGD optimizer was used for 200 epochs with the batch size 32. We set the initial learning rate to 0.01 and adopt a cosine annealing schedule to adjust the learning rate at every epoch.

Results. The results are shown in Tables 2 and 3, we can see that our PVT outperforms most previous models. Compared with its baselines, such as PCT, PT¹ and PT², PointASNL, our PVT not only achieves state-of-the-art accuracy of 94.1%, but also has the best speed-accuracy trade-off (10 $\times$ faster on average). Figure 3 also provides an accuracy plot under equal-epoch setting. As can be seen, our method outperforms all Transformer-based methods, being the fastest and most accurate towards convergence.

Model	Params	FLOPs	SDA( $\%$ )	OA( $\%$ )
PointNet	3.47M	0.45G	0.0	89.2
PointNet++(SSG)	1.48M	1.68G	43.5	90.7
PointNet++(MSG)	1.74M	4.09G	47.6	91.9
DGCNN	1.81M	2.43G	57.2	92.9
PointASNL	3.98M	5.92G	39.8	93.1
PT¹	21.1M	5.05G	32.5	92.8
PT²	9.14M	17.1G	65.4	93.7
PCT	2.88M	2.17G	24.6	93.2
PVT(Ours)	2.76M	1.93G	9.1	94.1

Table 4: Computational resource requirements. SDA means the rate of total runtime on structuring the sparse data.

4.2 Analysis of computational requirements

Now, we analyze the computational requirements of PVT and several other baselines by comparing the floating point operations required (FLOPs) and number of parameters (Params) in Table 4. PVT has the lowest memory requirements with only 2.45M parameters, and also puts a low load on the processor of only 1.62G FLOPs, yet delivers highly accurate results of 94.1%. These characteristics make it suitable for deployment on a mobile device. In addition, PT² has attained top accuracy on various point cloud learning benchmarks, but its shortcomings of slow inference time and high computing cost are also obvious. In the summary in Table 4, PVT only spends 6.3% of the total runtime on structuring the irregular data, which is much lower than previous Transformer-based models. Overall, PVT has the best accuracy and the lowest computational and memory requirements compared with its baselines.

4.3 Object part segmentation

Model	pIoU	Areo	Bag	Cap	Car	Chair	Ear Phone	Guitar	Knife	Lamp	Laptop	Motor	Mug	Pistol	Rocket	Skate Board	Table
# Shapes		2690	76	55	898	3758	69	787	392	1547	451	202	184	283	66	152	5271
PointNet	83.7	83.4	78.7	82.5	74.9	89.6	73.0	91.5	85.9	80.8	95.3	65.2	93.0	81.2	57.9	72.8	80.6
P2Sequence	85.1	82.6	81.8	87.5	77.3	90.8	77.1	91.1	86.9	83.9	95.7	70.8	94.6	79.3	58.1	75.2	82.8
PointASNL	86.1	84.1	84.7	87.9	79.7	92.2	73.7	91.0	87.2	84.2	95.8	74.4	95.2	81.0	63.0	76.3	83.2
RS-CNN	86.2	83.5	84.8	88.8	79.6	91.2	81.1	91.6	88.4	86.0	96.1	73.7	94.1	83.4	60.5	77.7	83.6
PT¹	85.9	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$
PCT	86.4	85.0	82.4	89.0	81.2	91.9	71.5	91.3	88.1	86.3	95.8	64.6	95.8	83.6	62.2	77.6	73.7
PVT (Ours)	86.6	85.3	82.1	88.7	82.1	92.4	75.5	91.0	88.9	85.6	95.4	76.2	94.7	84.2	65.0	75.3	81.7

Table 5: Results of part segmentation on ShapeNet. pIoU means part-average Intersection-over-Union.

Model	mIoU	Ceiling	Floor	Wall	Bean	Column	Window	Door	Chair	Table	Bookcase	Sofa	Board	Clutter
PointNet	41.09	88.80	97.33	69.80	0.05	3.92	46.26	10.76	52.61	58.93	40.28	5.85	26.38	33.22
PointNet++	50.04	90.79	96.45	74.12	0.02	5.77	43.59	25.39	69.22	76.94	21.45	55.61	49.34	41.88
PointNet++-CE	51.56	92.28	96.87	74.77	0.02	7.04	46.78	25.42	69.13	79.18	26.67	53.39	54.61	44.03
DGCNN	47.08	92.42	97.46	76.03	0.37	12.00	51.59	27.01	64.85	68.58	7.67	43.76	29.44	40.83
PVCNN	56.12	91.23	97.54	77.13	0.29	13.02	51.72	26.74	68.52	75.48	28.64	53.29	27.21	41.92
BPM [12]	61.43	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$
PointASNL [12]	62.60	94.31	98.42	79.13	0.00	26.71	55.21	66.21	83.32	86.83	47.64	68.32	56.41	52.12
IAF-Net [50]	64.60	91.41	98.60	81.80	0.00	34.90	62.00	54.70	79.70	86.90	49.90	72.40	74.80	52.10
PVT(Ours)	68.21	91.18	98.76	86.23	0.31	34.21	49.90	61.45	81.62	89.85	48.20	79.96	76.45	54.67

Table 6: Indoor scene segmentation results on the S3DIS dataset, evaluated on Area5. From this table, we can see that the proposed PVT outperforms most of previous 3D models in some categories signiﬁcantly,

Model	Latency	mIoU	Car	Bicycle	MotorCycle	Truck	Other-vehicle	Person	Bicyclist	Motorcyclist	Road	Parking	Sidewalk	Other-ground	Building	Fence	vegetation	Trunk	Terrain	Pole	Traffic-sign
PointNet^∗	500	14.6	46.3	1.3	0.3	0.1	0.8	0.2	0.2	0.0	61.6	15.8	35.7	1.4	41.4	12.9	31.0	4.6	17.6	2.4	3.7
PointNet++^∗	5900	20.1	53.7	1.9	0.2	0.9	0.2	0.9	1.0	0.0	72.0	18.7	41.8	5.6	62.3	16.9	46.5	13.8	30.0	6.2	8.9
PVCNN^∗	146	39.0	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$
TangentConv^∗	3000	40.9	83.9	63.9	33.4	15.4	83.4	90.8	15.2	2.7	16.5	12.1	79.5	49.3	58.1	23.0	28.4	8.1	49.0	35.8	28.5
PointASNL	7260	46.8	87.4	74.3	24.3	1.8	83.1	87.9	39.0	0.0	25.1	29.2	84.1	52.2	70.6	34.2	57.6	0.0	43.9	57.8	36.9
RandLA-Net^∗	880	53.9	94.2	26.0	25.8	40.1	38.9	49.2	48.2	7.2	90.7	60.3	73.7	20.4	86.9	56.3	81.4	61.3	66.8	49.2	47.7
KPConv^∗	3560	58.8	96.0	30.2	42.5	33.4	44.3	61.5	61.6	11.8	88.8	61.3	72.7	31.6	90.5	64.2	84.8	69.2	69.1	56.4	47.4
BAAF	6880	59.9	90.9	74.4	62.2	23.6	89.8	95.4	48.7	31.8	35.5	46.7	82.7	63.4	67.9	49.5	55.7	53.0	60.8	53.7	52.0
SPVCNN ^∗	110	63.7	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$
PVT(Ours)	142	64.9	97.7	76.9	59.8	30.5	91.8	94.4	49.2	34.3	42.3	62.1	81.3	65.6	70.9	47.5	68.7	86.2	70.8	52.1	54.0

Table 7: Results of outdoor scene segmentation on SemanticKITTI.

*

: results directly taken from [36]

Data. The model is also evaluated on ShapeNet Parts [24]. It is composed of a total of 16,880 models (14,006 models are used for training, the rest for testing), each of which has 2 to 6 parts and the whole dataset is labeled in 50 different parts. A total 2048 points are sampled from each model as input and only few point sets have six labeled parts. We directly adopt the same train-test split as PointNet in our experiment.

Setting. The same training setting as in our classification task was adopted. For this task on ShapeNet Part, we train our model using the SGD optimizer for 200 epochs with the batch size 16. The initial learning rate was set to 0.1, with a cosine annealing schedule to adjust the learning rate at every epoch.

Results. Table 5 lists the class-wise segmentation results. Part-average IoU is used to evaluate our model, which is given both overall and for each object category. The results show that our PVT makes an improvement of 2.9% over PointNet. Compared with all baselines, PVT attains the top pIoU with 86.6%.

Visualization. In Figure 4, we illustrate output features from the point and voxel branches respectively, where a warmer color represents larger magnitude. As we can see, the voxel branch captures large, continuous parts while the point-based counterpart captures global shape details (e.g., table legs, airplane wings, and tail).

Model	mIoU(%)	OA(%)	mAcc(%)
PointNet	47.6	78.5	66.2
G+RCU [10]	49.7	81.1	$-$
TangentConv [37]	52.8	$-$	$-$
RSNet [28]	56.5	$-$	66.5
3P-RNN [56]	56.3	86.9	$-$
SGPN [44]	50.4	$-$	$-$
SPGraph [20]	62.1	85.5	73.0
HAPGN [5]	62.9	85.8	$-$
PVCNN	63.2	85.8	72.5
ShellNet [61]	66.8	87.1	$-$
PVT(Ours)	69.2	88.3	76.2

Table 8: Results of semantic segmentation of 3D indoor scenes on S3DIS, evaluated with 6-fold cross-validation. From this table, we can see that the proposed PVT outperforms most of previous 3D models.

4.4 Indoor Scene Segmentation

Data. To further assess our network, we conduct semantic segmentation task on S3DIS dataset [1], which includes 273 million points from six indoor areas of three different buildings. Each point is annotated with a semantic label from 13 classes—e.g., beam, bookcase, chair, column, and window. We follow [38] and [29] and uniformly sampled points from blocks of area size $1m\times 1m$ , where each point is represented by a 9D vector (XYZ, RGB, and normalized spatial coordinates).

Setting. During the training process, we generate training data by randomly sampling 4,096 points from each block on-the-fly. Following [45], we utilize Area-5 as the test scene and all the other areas for training. Note that the position and RGB information of points are used to as the input features. The same training setting as in our classification task is adopted, but requires 50 epochs with the batch size 8 to fulfil the experiment.

Results. The results are presented in Tables 6 and Table 8. From Tables 6, we can see that our PVT attains mIoU of 67.3%, which outperforms MLPs-based frameworks such as PointNet [29] and PointNet++ [31], graph-based methods such as DGCNN [45], attention-based models such as PointASNL [53]. Moreover, PVT attains mIoU of 69.2% under 6-fold cross-validation, substantially outperforming most prior models.

Visualization. Fig.5 shows the visualization of PVT’s predictions. The predictions of our network are very close to the ground truth. PVT captures detailed semantic structure in complex 3D scenes, which is important in our network.

4.5 Outdoor semantic Segmentation

Data. To further evaluate our network, we conduct outdoor semantic segmentation task on SemanticKITTI [3] dataset, which includes 43,552 densely labeled LIDAR scans belonging to 21 sequences. Each scan is a large-scale point cloud with $\sim 10^{5}$ points and spanning up to 160×160×20 m in 3D space. Officially, the sequences $00\sim 07$ and $09\sim 10$ (19,130 scans) are used for training, the sequence 08 (4071 scans) for validation, and the sequences $11\sim 21$ (20,351 scans) for online testing. The raw 3D points only have 3D coordinates without color information.

Setting. Based on UNet, we build our backbone network for large-scale point clouds segmentation with a stem, four down-sampling and four up-sampling stages, and the dimensions of these nine stages are 32, 64, 64, 128, 256, 256, 128, 64, and 64, respectively. As for the voxel branch, the voxel resolution is 0.05 m for segmentation experiments. We train PVT for 100 epochs on a single GeForce RTX 3090 GPU with the batch size 8. In addition, the Adam optimizer is employed to minimize the overall loss; the learning rate starts from 0.01 and decays with a rate of 0.5 after every 10 epochs.

Results. As shown in Table 7, PVT outperforms the previous state-of-the-art point-based model BAAF by 5.0% in mIoU with 48× measured speedup. Compared with strong attention-based PointASNL and point-based KPConv, our PVT achieves +18.1% and +6.1% mIoU improvements, with 51× and 25× measured speedup respectively.

5 Ablation Studies

In this section, we conduct extensive ablation study to analyze the effectiveness of different components of PVT block. The results of the ablation study are summarized in Tables 9 and 10.

Model	PB	VB	shifting	NPB	OA(%)	Latency
A	✗	✓	✓	3	92.8	33.5
B	✓	✗	✓	3	92.3	25.2
C	✓	✓	✗	3	93.0	47.9
D	✓	✓	✓	2	93.1	36.4
E	✓	✓	✓	4	93.6	72.5
PVT ^full	✓	✓	✓	3	94.1	48.6

Table 9: Ablation study on ModelNet40. PB and VB mean the point-based branch and the voxel-based branch; NPB denotes the number of PVT blocks; shifting means all self-attention modules adopt the cyclic shifted box partitioning method.

Ablation	ModelNet40(OA)	ShapeNet(mIoU)
MLP	92.6	85.3
EdgeConv	93.1	85.8
w/o rel. pos	93.2	86.3
PVT^full	94.1	86.6

Table 10: Ablation study on the relative-attention and relative position bias on two benchmarks. MLP: replace relative-attention with MLP layer in our architecture. EdgeConv: replace relative-attention with EdgeConv layer in our architecture. no rel. pos: the relative attention without an additional relative position bias term (see Eq 5).

Impact of the voxel-based and point-based branches. We set two baselines: A and B. Model A only encodes global context features by the point branch, and Model B only encodes local features by the voxel one. As reported in Table 9, the Baseline model A obtains a low accuracy of 92.8% on classification benchmarks, and model B gets 92.3%. When we combine local and global features (PVT^full), there is a notable improvement in both tasks. This means our network can take advantage of the combination of two branches, which provide richer information about the points.

Effect of the cyclic shifted windows scheme. Ablation of the shifted window method on classification is reported in Table 9 (Model C). The network with the shifted window partitioning (PVT^full) outperforms the Model C without shifting at each layer by +1.0% OA on ModelNet40. The results indicate the effectiveness and efficiency of using cyclic shifted window to build cross-window information interaction in the preceding layers.

Impact of the number of PVT blocks. we validate the impact of the PVT block by controlling the number of PVT blocks and report the results in Table 9. From this table, we can conclude the following: on one hand, reducing the number of PVT blocks can save latency, for example, compared with PVT^full, Model D saves 25% latency but incurs a loss on accuracy; on the other hand, increasing the number of PVT blocks from PVT^full can hardly support Model E accuracy benefit but leads to an increase on latency. To balance between speed and accuracy, we adopt 3 PVT blocks as our full model.

Effect of Relative-attention. We validate the impact of RA module used in our network. The results are shown in Table 10. We set two baselines. ”MLP” is a no-attention baseline that replaces RA with a MLP layer. ”EdgeConv” is a more advanced no-attention baseline that replaces RA with a EdgeConv layer. EdgeConv performs feature aggregating at each point and enables each point to exchange information with its neighboring points, but does not leverage attention mechanisms. We can see that the RA module achieves better results than the no-attention baselines. The performance gap between RA and MLP baselines is significant: 94.1% vs. 92.6% and 86.6% vs. 85.3%, an respectivel improvement of 1.5 and 1.3 absolute percentage points. Compared with EdgeConv baseline, our RA module also achieves improvements of 0.9 and 0.9 absolute percentage points, respectively.

Effect of relative position representations. Finally, we also investigate the impact of RPR used in the RA module. As demonstrated in Table 10, our PVT with RPR yields +0.8% OA/+0.4% mIoU on ModelNet40 and ShapeNet in relation to those without this term respectively, indicating the effectiveness of the RPR.

6 Conclusion and Future Work

In this work, we present PVT for efficient point cloud learning. PVT is found to surpass previous Transformer-based and attention-based models in efficiency. It achieves this by deeply combining the advantages from both voxel-based and point-based networks. To reduce the computation cost, we design a GPU-based SWA computing method that has linear computational complexity with respect to voxel resolution and bypasses the invalid empty voxel computations. In addition, we study two different self-attention variants to gather fine-grained features about the global shape according to different scales of point clouds.

In the future, we expect to promote the primary structure for other research areas, such as point cloud pretraining, generation, and completion.

References

[1] I. Armeni, S. Sax, A. R. Zamir, and S. Savarese. Joint 2d-3d-semantic data for indoor scene understanding. arXiv preprint arXiv:1702.01105, 2017.
[2] Bahdanau, Dzmitry, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. in ICLR, 2014.
[3] J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall. Semantickitti: A dataset for semantic scene understanding of lidar sequences. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 9296–9306. IEEE, 2019.
[4] C. Chen, G. Li, R. Xu, T. Chen, M. Wang, and L. Lin. Clusternet: Deep hierarchical cluster network with rigorously rotation-invariant representation for point cloud analysis. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
[5] C. Chen, S. Qian, Q. Fang, and C. Xu. Hapgn: Hierarchical attentive pooling graph network for point cloud segmentation. IEEE Transactions on Multimedia, 23:2335–2346, 2021.
[6] C. Choy, J. Y. Gwak, and S. Savarese. 4d spatio-temporal convnets: Minkowski convolutional neural networks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
[7] J. Devlin, M. W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. in NAACL-HLT, 2018.
[8] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. in ICLR, 2021.
[9] N. Engel, V. Belagiannis, and K. Dietmayer. Point transformer. CoRR, abs/2011.00931, 2020.
[10] F. Engelmann, T. Kontogianni, A. Hermans, and B. Leibe. Exploring spatial context for 3d semantic segmentation of point clouds. 2017 IEEE International Conference on Computer Vision Workshop (ICCVW), 2017.
[11] F. Fuchs, D. E. Worrall, V. Fischer, and M. Welling. Se(3)-transformers: 3d roto-translation equivariant attention networks. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
[12] J. Gong, J. Xu, X. Tan, J. Zhou, Y. Qu, Y. Xie, and L. Ma. Boundary-aware geometric encoding for semantic segmentation of point clouds. Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
[13] B. Graham, M. Engelcke, and L. van der Maaten. 3d semantic segmentation with submanifold sparse convolutional networks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, June 18-22, pages 9224–9232. Computer Vision Foundation / IEEE Computer Society, 2018.
[14] M.-H. Guo, J.-X. Cai, Z.-N. Liu, T.-J. Mu, R. R. Martin, and S.-M. Hu. Pct: Point cloud transformer. Computational Visual Media, 7(2):187–199, Apr 2021.
[15] H. Hu, Z. Zhang, Z. Xie, and S. Lin. Local relation networks for image recognition. IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
[16] Q. Huang, W. Wang, and U. a. Neumann. Recurrent slice networks for 3d segmentation of point clouds. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
[17] R. Klokov and V. Lempitsky. Escape from cells: Deep kd-networks for the recognition of 3d point cloud models. Proceedings of the IEEE International Conference on Computer Vision, pages 863–872, 2017.
[18] S. Lan, R. Yu, G. Yu, and L. S. Davis. Modeling local geometric structure of 3d point clouds using geo-cnn. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
[19] L. Landrieu and M. Boussaha. Point cloud oversegmentation with graph-structured deep metric learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
[20] L. Landrieu and M. Simonovsky. Large-scale point cloud semantic segmentation with superpoint graphs. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[21] T. Le and D. Ye. Pointgrid: A deep network for 3d shape understanding. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
[22] J. Li, B. M. Chen, and G. H. Lee. So-net: Self-organizing network for point cloud analysis. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
[23] Y. Li, R. Bu, M. Sun, and B. Chen. Pointcnn: Convolution on x-transformed points. Advances in Neural Information Processing Systems (NeurIPS), 2018.
[24] Y. Li, V. G. Kim, D. Ceylan, I. C. Shen, M. Yan, S. Hao, C. Lu, Q. Huang, A. Sheffer, and L. Guibas. A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics (TOG), 35(6cd):210.1–210.12, 2016.
[25] Z. Lin, M. Feng, C. N. d. Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio. A structured self-attentive sentence embedding. in ICLR, 2017.
[26] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
[27] Z. Liu, H. Tang, Y. Lin, and S. Han. Point-voxel cnn for efficient 3d deep learning. Advances in Neural Information Processing Systems (NeurIPS), 2019.
[28] R. Mehta and T. Arbel. Rs-net: Regression-segmentation 3d cnn for synthesis of full resolution missing brain mri in the presence of tumours. International Workshop on Simulation and Synthesis in Medical Imaging, pages 119–129, 2018.
[29] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 652–660, 2017.
[30] C. R. Qi, H. Su, M. Niebner, A. Dai, M. Yan, and L. J. Guibas. Volumetric and multi-view cnns for object classification on 3d data. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016.
[31] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in Neural Information Processing Systems (NeurIPS), 2017.
[32] P. Ramachandran, N. Parmar, A. Vaswani, I. Bello, A. Levskaya, and J. Shlens. Stand-alone self-attention in vision models. CoRR, abs/1906.05909, 2019.
[33] G. Riegler, A. O. Ulusoys, and A. Geiger. Octnet: Learning deep 3d representations at high resolutions. IEEE Computer Society, 2016.
[34] Q. Shi, S. Anwar, and N. Barnes. Semantic segmentation for real point cloud scenes via bilateral augmentation and adaptive fusion. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
[35] S. Shi, C. Guo, L. Jiang, Z. Wang, J. Shi, X. Wang, and H. Li. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
[36] H. Tang, Z. Liu, S. Zhao, Y. Lin, J. Lin, H. Wang, and S. Han. Searching efficient 3d architectures with sparse point-voxel convolution. In European Conference on Computer Vision, 2020.
[37] M. Tatarchenko, J. Park, V. Koltun, and Q. Y. Zhou. Tangent convolutions for dense prediction in 3d. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
[38] L. Tchapmi, C. Choy, I. Armeni, J. Y. Gwak, and S. Savarese. Segcloud: Semantic segmentation of 3d point clouds. 2017 International Conference on 3D Vision (3DV), 2017.
[39] H. Thomas, C. R. Qi, J. E. Deschaud, B. Marcotegui, and L. J. Guibas. Kpconv: Flexible and deformable convolution for point clouds. IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
[40] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 2017.
[41] C. Wang, B. Samari, and K. Siddiqi. Local spectral graph convolution for point set feature learning. in ECCV, 2018.
[42] L. Wang, Y. Huang, Y. Hou, S. Zhang, and J. Shan. Graph attention convolution for point cloud semantic segmentation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
[43] P. S. Wang, Y. Liu, Y. X. Guo, C. Y. Sun, and X. Tong. O-cnn: Octree-based convolutional neural networks for 3d shape analysis. Acm Transactions on Graphics, 36(4):72, 2017.
[44] W. Wang, R. Yu, Q. Huang, and U. Neumann. Sgpn: Similarity group proposal network for 3d point cloud instance segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2569–2578, 2018.
[45] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon. Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics, 38(5), 2018.
[46] Z. Wang and F. Lu. Voxsegnet: Volumetric cnns for semantic part segmentation of 3d shapes. IEEE Transactions on Visualization and Computer Graphics, pages 1–1, 2019.
[47] Z. Wu, S. Song, A. Khosla, X. Tang, and J. Xiao. 3d shapenets for 2.5d object recognition and next-best-view prediction. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2014.
[48] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets: A deep representation for volumetric shapes. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
[49] P. Xiang, X. Wen, Y. Liu, Y. Cao, P. Wan, W. Zheng, and Z. Han. Snowflakenet: Point cloud completion by snowflake point deconvolution with skip-transformer. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 5479–5489. IEEE, 2021.
[50] M. Xu, Z. Zhou, J. Zhang, and Y. Qiao. Investigate indistinguishable points in semantic segmentation of 3d point cloud. Proceedings of the AAAI Conference on Artificial Intelligence, abs/2103.10339, 2021.
[51] Q. Xu, X. Sun, C.-Y. Wu, P. Wang, and U. Neumann. Grid-gcn for fast and scalable point cloud learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5661–5670, 2020.
[52] Y. Xu, T. Fan, M. Xu, Z. Long, and Q. Yu. Spidercnn: Deep learning on point sets with parameterized convolutional filters. ECCV, 2018.
[53] X. Yan, C. Zheng, Z. Li, S. Wang, and S. Cui. Pointasnl: Robust point clouds processing using nonlocal neural networks with adaptive sampling. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
[54] Y. Yan, Y. Mao, and B. Li. Second: Sparsely embedded convolutional detection. Sensors, 18(10), 2018.
[55] B. Yang, J. Wang, R. Clark, Q. Hu, S. Wang, A. Markham, and N. Trigoni. Learning object bounding boxes for 3d instance segmentation on point clouds. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
[56] X. Ye, J. Li, H. Huang, L. Du, and X. Zhang. 3d recurrent neural networks with context fusion for point cloud semantic segmentation. Proceedings of the European Conference on Computer Vision (ECCV), pages 403–417, 2018.
[57] X. Yu, Y. Rao, Z. Wang, Z. Liu, J. Lu, and J. Zhou. Pointr: Diverse point cloud completion with geometry-aware transformers. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 12478–12487. IEEE, 2021.
[58] X. Yu, L. Tang, Y. Rao, T. Huang, J. Zhou, and J. Lu. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. CoRR, abs/2111.14819, 2021.
[59] K. Zhang, M. Hao, J. Wang, C. D. Silva, and C. Fu. Linked dynamic graph cnn: Learning on point cloud via linking hierarchical features. arXiv:1904.10014 [cs], 2019.
[60] W. Zhang and C. Xiao. Pcan: 3d attention map learning using contextual information for point cloud based retrieval. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
[61] Z. Zhang, B. S. Hua, and S. K. Yeung. Shellnet: Efficient point cloud convolutional neural networks using concentric shells statistics. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
[62] H. Zhao, J. Jia, and V. K. and. Exploring self-attention for image recognition. Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
[63] H. Zhao, L. Jiang, C. W. Fu, and J. Jia. Pointweb: Enhancing local neighborhood features for point cloud processing. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
[64] H. Zhao, L. Jiang, C.-W. Fu, and J. Jia. Pointweb: Enhancing local neighborhood features for point cloud processing. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5565–5573, 2019.
[65] H. Zhao, L. Jiang, J. Jia, P. Torr, and V. Koltun. Point transformer. IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
[66] Q.-Y. Zhou, J. Park, and V. Koltun. Open3D: A modern library for 3D data processing. arXiv:1801.09847, 2018.
[67] Y. Zhou and O. Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. IEEE/CVF International Conference on Computer Vision (ICCV), pages 4490–4499, 2018.