Investigating Attention Mechanism in 3D Point Cloud Object Detection

Shi Qiu ^,1,2, Yunfan Wu^*^,1, Saeed Anwar^1,2,3 and Chongyi Li⁴
¹Australian National University, ²Data61-CSIRO, Australia
³University of Technology Sydney, ⁴Nanyang Technological University
{shi.qiu, yunfan.wu, saeed.anwar}@anu.edu.au, [email protected] denotes equal contributions.

Abstract

Object detection in three-dimensional (3D) space attracts much interest from academia and industry since it is an essential task in AI-driven applications such as robotics, autonomous driving, and augmented reality. As the basic format of 3D data, the point cloud can provide detailed geometric information about the objects in the original 3D space. However, due to 3D data’s sparsity and unorderedness, specially designed networks and modules are needed to process this type of data. Attention mechanism has achieved impressive performance in diverse computer vision tasks; however, it is unclear how attention modules would affect the performance of 3D point cloud object detection and what sort of attention modules could fit with the inherent properties of 3D data. This work investigates the role of the attention mechanism in 3D point cloud object detection and provides insights into the potential of different attention modules. To achieve that, we comprehensively investigate classical 2D attentions, novel 3D attentions, including the latest point cloud transformers on SUN RGB-D and ScanNetV2 datasets. Based on the detailed experiments and analysis, we conclude the effects of different attention modules. This paper is expected to serve as a reference source for benefiting attention-embedded 3D point cloud object detection. The code and trained models are available at: https://github.com/ShiQiu0419/attentions_in_3D_detection.

1 Introduction

3D data such as point cloud provides detailed geometric and structure information in scene understandings compared to images. As a result, the applications of 3D data have attracted more and more attention in recent years. Among the applications, 3D point cloud object detection is the most essential function, which is highly desired in robotics, autonomous driving, and augmented reality. Due to the irregular format of 3D point cloud data, the pipelines of 3D object detection [24, 23, 22] differ from conventional 2D object detection methods [9, 31, 12], raising new challenges.

Refer to caption — (a) Overall results (%) on *SUN RGB-D* [34] dataset.

Many frameworks [25, 35, 4, 36] have been proposed to process the irregular point cloud data for object detection. For example, VoxNet [21] voxelized the point cloud data and used 3D CNNs to detect the objects, but it usually suffers from the expensive computation cost. To share the solid performance of 2D detectors, some methods [3, 17] project the point cloud data to front view or bird’s eye view (BEV) and lift the existing 2D detectors for 3D region proposals. However, as the occlusion problem is still challenging in 2D detection, this issue could also be problematic in 3D space.

PointNet [25] and PointNet++ [26] are proposed to process point cloud data directly without voxelization or projection into BEVs. This track of methods achieves excellent performance in classification [27] and semantic segmentation [30] tasks. Although these methods can effectively process the unordereded point cloud data, they cannot be directly used for 3D object detection due to point cloud’s inherent sparsity. In 2D images, the center of an object is likely to be inside the object’s pixels, while the center of an object in 3D point cloud is usually in empty space since only the surfaces of an object can be captured by the 3D scanners [2, 15]. Thus, it is hard to aggregate the context around an object center, which leads to difficulties in detecting 3D objects in the point cloud.

To address this problem, VoteNet [23] generates some seed points that are close to object centers via a PointNet++ backbone and then produces the region proposals after voting and clustering the seed points. Moreover, VoteNet is an end-to-end trainable network that significantly simplifies the 3D object detection pipeline by omitting 2D to 3D conversions and appropriate pre/post-processing steps. Although VoteNet is efficient in 3D object detection, its effectiveness highly relies on the point features learned from the backbone network. In detail, the learned point features would not only affect the selection of seed points, but also involve in the following voting process since the offsets from the object centers are totally estimated from them. Therefore, extracting high-quality point features is a critical factor for the success of such a 3D object detection pipeline.

Feature learning is always considered as a fundamental problem in all computer vision tasks, especially in using Convolutional Neural Networks (CNNs). Following the success of the attention mechanism [37] in language-related topics, attention as a basic function for extracting and refining features can significantly improve the performance of many computer vision tasks. Basically, regular attention modules used in image feature learning can be categorized into three main types based on the operating domains: spatial attention [39, 14], channel attention [13], and mixed attention [41, 8]. More recently, an increasing number of attention modules are proposed for 3D point cloud analysis, including self-attention [43, 7, 28] and transformer [10, 51] methods. Although the attention mechanism shows the effectiveness in point cloud classification [28] or semantic segmentation [45], fewer of them are utilized in the point cloud object detection since it is unclear if the attention mechanism also fits with this task. As we explained above, the learned point features in VoteNet [23] are critical for 3D object detection. Thus, taking the recent VoteNet [23] as a basic pipeline, we investigate the advantages and disadvantages of different attention modules in 3D point cloud object detection.

To thoroughly investigate the attention mechanism’s effects on 3D point cloud object detection, we analyze five classical 2D attention modules [13, 39, 14, 41, 8] and five novel 3D attention modules [43, 7, 28, 10, 51] (more details are provided in Section 3). Furthermore, we conduct the experiments on two widely-used 3D object detection benchmarks, SUN RGB-D [34] and ScanNetV2 [5] datasets, under different metrics such as the ones shown in Figure 1. In general, our main contributions can be concluded from the following aspects:

•

We push the VoteNet pipeline towards better performance for 3D point cloud object detection by integrating attention mechanisms into it.
•

We are the first to comprehensively evaluate the performances of ten recent attention modules for 3D point cloud object detection on SUN RGB-D and ScanNetV2 datasets.
•

We concretely summarize the effects and characters of different types of attention modules and provide novel insights and inspiration to facilitate the understanding of the attention mechanism for 3D point cloud object detection.

2 Related Work

Point Cloud Networks. Due to the rapid development of 3D sensors, point cloud data can be easily collected using LiDAR scanners and RGB-D cameras. To better understand the contained information in point cloud data, different CNN-based point cloud networks have been invented for machine perception. To be specific, early methods [35, 50, 21] attempt to convert raw point clouds into a particular intermediate representation (e.g., images or voxels) according to the projective relations, then apply regular 2D/3D CNN operations to learn the high-dimensional features for the following analysis. However, the intermediate representations of point cloud data usually encounter the problem of geometric information loss, resulting in inaccurate predictions.

To avoid this problem, current methods tend to exploit the usage of Multi-Layer Perceptron (MLP), which was originally proposed in PointNet [25]. In practice, MLP is implemented as the $1\times 1$ convolutions followed by a Batch Normalization (BN) layer and an activation function. Particularly, MLP can avoid the sparsity and unorderedness of raw point cloud data because each point shares the learnable weights. Moreover, PointNet++ [26] extends the basic MLP operation to aggregate the local features from pre-defined point neighborhoods via a symmetric function like max-pooling. Since VoteNet [23] takes PointNet++ as the backbone for feature learning, we mainly use MLP as the essential convolutional operation in the attention mechanism for 3D point cloud object detection.

3D Object Detection. The aim of object detection in 3D space is to predict the class label and 3D bounding box for each object. In general, the standard approaches for 3D object detection can be categorized into two streams [11]: the region proposal-based and single-shot methods.

Region proposal-based methods are usually in two-stage where the first stage generates region proposals and the second stage decides the class label of each proposal. More concretely, the region proposal-based methods are in three tracks: the multi-view based methods [3, 17, 19], segmentation-based methods [32, 38, 49], and frustum-based methods [24, 44, 40]. In addition, the single-shot methods can be more efficient than the region proposal-based methods since they directly estimate the class probabilities and regress bounding boxes. According to the ways of processing the raw 3D data, the single-shot methods can be further divided into BEV [47, 46] methods, discretization [52, 33, 18] methods, and point-based [48] methods. Theoretically, VoteNet can be recognized as a region proposal-based method [11]; meanwhile, it also shares the efficiency of point-based methods that can directly use point cloud data as input.

Attention Mechanism. Initially, the attention mechanism intends to imitate the human vision system by focusing on the more relevant features to our targets rather than the whole scene containing some irrelevant context. Many methods have been introduced to estimate attention (weight) maps in order to re-weight the original feature map learned from the CNNs. As for image-related tasks, the attention map can be generated according to spatial [39, 14] or channel-related [13, 1, 6] information, while some methods [8, 41] incorporate both of them for better information integration. In addition, point cloud networks tend to utilize a self-attention [37] structure, which can estimate the long-range dependencies regardless of a specific order between the elements. In practice, we can leverage the basic form of self-attention to calculate either point-wise relations [43, 7] or channel-wise affinities [28] in a wide range of point cloud analysis problems [29] such as classification, semantic segmentation, object detection, etc.

In this work, we adopt ten standard attention modules covering the main types of existing designs used in both 2D images and 3D point clouds. By applying them to the backbone of VoteNet, we can comprehensively study the role of the attention mechanism in 3D point cloud object detection.

3 Approach

As shown in Figure 2, the VoteNet pipeline consists of a backbone that learns features, a voting module estimating the object centers, as well as a predicting module regressing the bounding boxes and class labels.

More concretely, the last two rows of Figure 2 compare the detailed structures of VoteNet’s original backbone and our attentional backbone. In general, the Set Abstraction (SA) [26] layer and the Feature Propagation (FP) [26] layer act as the encoder and decoder in the backbone, respectively. Following a regular usage in image-related CNNs, the attention module is placed after each encoder (SA) and decoder (FP) of the backbone. Moreover, to generate only a few seed points (e.g., 1024) from all input points (e.g., 20,000), we adopt the official implementation¹¹1https://github.com/facebookresearch/votenet of VoteNet, which leverages four SA layers in a down-sampling manner while only two FP layers for up-sampling.

3.1 2D Attentions

Non-local [39] block is spatial attention that uses a weighted sum of features to represent a pixel. Particularly, the weights are estimated as the long-range dependencies (i.e., inner produces) between the pixels, by which the learned feature maps can be further enhanced with both local and global information.

Criss-cross attention [14] is another spatial attention module, which exploits the criss-cross pixels to obtain the contextual information of a certain point effectively. Compared to the Non-local block, the criss-cross attention saves memory and achieves comparable performance in the meantime.

Squeeze and Excitation (SE) block [13] is channel attention that intends to refine the features by exploiting the inter-dependencies between channels. At first, the SE block squeezes the spatial information and generates the channel-wise descriptors; then, it learns the weights for different channels by exciting the descriptors with convolutions and activations.

Convolutional Block Attention Module (CBAM) [41] is mixed attention that consists of two sequential attention blocks. To be specific, a channel attention block is leveraged to capture the channel-wise information based on the inter-channel relationship of features. In practice, it utilizes the sum of space-wise average-pooled and max-pooled descriptors to encode a channel attention map. In addition, the spatial attention block exploits the inter-spatial relationship of features for complementary context. Differently, the spatial attention block applies the average-pooling and max-pooling operations along the channels, then concatenates the pooled features to generate a spatial attention map.

Moreover, Dual-Attention [8] is another mixed attention structure exploiting both channel-wise and spatial-wise long-range dependencies of features to enhance the discriminant ability. Notably, this structure utilizes a position attention module and a channel attention module in parallel. To be more specific, the position attention module can model the dependencies between positions in a feature map following a similar manner of the Non-local [39] block. In contrast, the channel attention module directly estimates the dependencies between the channels without any convolution. Finally, the two modules are aggregated using simple element-wise summation to generate the final feature map.

Table 1: The results of Average Precision on SUN RGB-D [34] dataset. (IoU threshold = 0.25)

	Method	bed	table	sofa	chair	toilet	desk	dresser	night-	book-	bath-	mAP
	Method	bed	table	sofa	chair	toilet	desk	dresser	stand	shelf	tub	mAP
	baseline [23]	83.3	49.8	64.1	74.1	89.3	23.8	26.4	60.7	30.9	72.8	57.5
2D Attentions	Non-local [39]	84.7	51.4	62.9	75.1	89.4	24.3	28.8	61.8	28.0	76.6	58.3
	Criss-cross [14]	82.9	49.8	62.1	74.1	85.9	24.2	27.3	60.2	28.1	67.2	56.2
	SE [13]	84.2	50.7	65.0	75.3	90.6	26.8	32.3	63.4	31.6	76.5	59.6
	CBAM [41]	84.8	50.7	64.1	74.5	90.4	25.8	33.7	65.9	28.8	72.0	59.1
	Dual-attn [8]	79.7	44.5	54.3	67.4	86.5	18.6	23.8	45.8	18.1	67.1	50.6
3D Attentions	A-SCN [43]	81.8	48.9	63.8	74.0	88.3	24.5	26.7	57.5	24.9	65.4	55.6
	Point-attn [7]	84.4	49.0	61.9	73.8	87.4	25.7	24.6	56.0	28.2	73.1	56.4
	CAA [28]	83.7	50.2	63.4	74.9	89.7	25.7	30.6	64.7	27.5	77.6	58.8
	Point-trans [51]	83.9	50.4	63.7	75.2	86.6	26.3	28.1	62.5	35.8	72.2	58.5
	Offset-attn [10]	82.8	49.8	60.5	73.0	86.5	23.6	27.1	56.5	25.6	71.2	55.7

Table 2: The results of Recall on SUN RGB-D [34] dataset. (IoU threshold = 0.25)

	Method	bed	table	sofa	chair	toilet	desk	dresser	night-	book-	bath-	AR
	Method	bed	table	sofa	chair	toilet	desk	dresser	stand	shelf	tub	AR
	baseline [23]	95.2	85.5	89.5	86.7	97.4	78.8	81.0	87.8	68.6	90.4	86.1
2D Attentions	Non-local [39]	95.4	84.6	89.2	87.2	96.0	79.7	81.9	90.6	63.2	92.3	86.0
	Criss-cross [14]	93.4	84.2	89.0	86.7	94.7	78.3	82.4	89.8	66.2	84.6	84.9
	SE [13]	94.5	85.6	89.2	86.9	99.3	80.4	82.4	89.8	69.2	90.4	86.8
	CBAM [41]	95.9	84.7	90.1	86.7	97.4	79.1	83.8	90.2	68.6	86.5	86.3
	Dual-attn [8]	92.1	80.9	86.1	84.1	95.4	77.2	79.2	83.1	66.2	84.6	82.9
3D Attentions	A-SCN [43]	94.1	83.3	88.4	87.3	96.7	78.8	77.3	85.4	67.6	80.8	84.0
	Point-attn [7]	94.8	83.6	88.9	86.3	95.4	78.7	78.2	88.2	62.5	86.5	84.3
	CAA [28]	94.1	84.7	89.7	86.8	97.4	79.3	80.6	89.8	65.9	90.4	85.9
	Point-trans [51]	93.4	84.5	89.4	86.1	94.7	77.4	80.6	89.4	71.9	90.4	85.8
	Offset-attn [10]	94.1	83.5	87.8	86.1	97.4	78.9	78.2	88.2	64.9	86.5	84.6

3.2 3D Attentions

Attentional ShapeContextNet (A-SCN) [43] introduces a self-attention-based module to exploit the shape context-driven features in the 3D point cloud. By comparing the query and key matrices, the attention map is estimated as the point-wise similarities. The output is then calculated as a matrix product between the attention map and the value matrix, together with an additional skip connection from the value matrix.

Point-Attention [7] module also follows the basic structure of self-attention to captures more shape-related features and long-range correlations from the point space of local point graphs. Additionally, it applies a skip-connection to strengthen the relationship between the input and output.

Channel Affinity Attention [28] estimates the attention map between the channels by calculating the channel-wise affinities in a self-attention structure. Specifically, it utilizes a compact channel-wise comparator block and a channel affinity estimator block to compute the similarity matrix and affinity matrix.

Recently, inspiring by the success of transformers [16] in the 2D image, researchers also realize the transformer-based networks for point cloud analysis, in order to heavily exploit the attention mechanism as the basic point feature learning module. For example, Offset-Attention [10] is proposed to estimate the offsets between the input and attention features, which are calculated from a self-attention structure. Mainly, Offset-Attention leverages the robustness of relative coordinates in transformations and the effectiveness of the Laplacian matrix in graph convolution.

Moreover, Point Transformer [51] is designed to take advantage of the local geometric relations between the center point and its neighbors. Using basic MLP operations, the Point Transformer block can effectively aggregate a local feature for each point based on the learned attention weights for its neighbors. With the help of rich local and geometric context, this method achieves outstanding performances in both point cloud classification and segmentation tasks.

Table 3: The results of Average Precision on ScanNetV2 [5] dataset. (IoU threshold = 0.25)

	Method	cabinet	bed	chair	sofa	table	door	window	book-	picture	counter	desk	curtain	refri-	shower-	toilet	sink	bathtub	garba-	mAP
	Method	cabinet	bed	chair	sofa	table	door	window	shelf	picture	counter	desk	curtain	gerator	curtain	toilet	sink	bathtub	gebin	mAP
	baseline [23]	38.5	88.7	87.8	90.4	58.9	45.0	36.7	44.8	4.9	50.0	61.4	39.1	50.4	59.1	97.3	48.7	91.3	39.0	57.3
2D Attentions	Non-local [39]	33.3	86.6	87.5	85.4	58.7	42.4	33.4	47.9	3.6	50.9	66.5	40.3	51.4	65.9	96.9	57.2	92.9	34.4	57.5
	Criss-cross [14]	37.7	86.7	86.3	85.2	60.3	40.8	34.6	45.1	5.0	57.8	71.5	40.2	43.9	61.1	94.0	47.4	88.9	34.8	56.7
	SE [13]	35.3	88.6	87.5	86.7	59.4	44.7	35.3	57.5	5.6	49.6	70.8	47.1	49.2	61.9	95.7	50.4	92.9	36.3	58.6
	CBAM [41]	39.2	88.7	87.9	89.5	60.1	48.2	38.5	49.4	5.3	51.9	69.6	42.5	54.3	61.7	93.3	49.0	88.4	38.6	58.7
	Dual-attn [8]	34.7	88.2	86.5	84.4	56.4	42.2	27.0	41.3	2.6	51.7	66.1	37.2	46.3	56.6	98.3	46.2	85.4	33.4	54.7
3D Attentions	A-SCN [43]	37.3	85.7	88.2	87.9	58.2	41.3	31.8	46.8	3.5	50.9	67.9	35.8	49.6	61.3	96.6	53.2	83.9	37.8	56.5
	Point-attn [7]	31.8	87.4	84.0	88.4	58.5	38.2	31.5	41.2	2.2	61.2	69.1	29.6	50.7	49.5	97.3	46.6	83.9	33.4	54.7
	CAA [28]	36.4	88.5	88.7	89.7	60.0	44.5	38.6	48.4	4.4	49.3	69.8	39.0	43.1	60.4	94.3	53.0	91.3	37.2	57.6
	Point-trans [51]	39.0	84.5	88.3	88.3	63.0	44.5	39.5	53.4	6.6	52.6	70.2	41.6	46.8	63.1	97.4	48.4	91.6	44.9	59.1
	Offset-attn [10]	38.0	88.1	87.2	89.9	58.5	43.2	27.5	50.2	6.8	59.6	69.9	39.5	50.6	61.5	95.8	51.1	87.2	38.9	58.0

Table 4: The results of Recall on ScanNetV2 [5] dataset. (IoU threshold = 0.25)

	Method	cabinet	bed	chair	sofa	table	door	window	book-	picture	counter	desk	curtain	refri-	shower-	toilet	sink	bathtub	garba-	AR
	Method	cabinet	bed	chair	sofa	table	door	window	shelf	picture	counter	desk	curtain	gerator	curtain	toilet	sink	bathtub	gebin	AR
	baseline [23]	76.3	95.1	91.9	99.0	82.0	72.4	63.8	84.4	23.0	84.6	93.7	71.6	93.0	78.6	98.3	64.3	96.8	70.6	80.0
2D Attentions	Non-local [39]	74.2	93.8	91.8	97.9	84.0	71.3	60.6	81.8	19.4	82.7	94.5	79.1	93.0	96.4	98.3	74.5	96.8	65.7	80.9
	Criss-cross [14]	73.9	95.1	91.3	97.9	82.9	68.5	61.0	85.7	18.0	86.5	93.7	73.1	94.7	78.6	94.8	67.3	93.5	67.4	79.1
	SE [13]	72.6	95.1	92.0	96.9	82.9	71.3	64.5	85.7	21.2	80.8	96.1	77.6	91.2	92.9	96.6	63.3	96.8	67.7	80.3
	CBAM [41]	77.2	95.1	92.1	97.9	84.3	70.7	69.1	87.0	19.8	82.7	94.5	76.1	96.5	96.4	96.6	65.3	90.3	65.8	81.0
	Dual-attn [8]	73.4	95.1	92.3	97.9	81.7	71.3	56.0	85.7	18.0	86.5	93.7	74.6	98.2	92.9	100.0	65.3	93.5	67.0	80.2
3D Attentions	A-SCN [43]	75.8	95.1	93.0	97.9	83.7	71.5	62.8	84.4	19.4	82.7	93.7	71.6	98.2	92.9	100.0	71.4	90.3	66.8	80.6
	Point-attn [7]	71.2	93.8	90.4	97.9	82.6	70.0	61.7	84.4	17.6	84.6	95.3	74.6	96.5	85.7	100.0	66.3	90.3	66.6	79.4
	CAA [28]	74.2	93.8	92.1	99.0	84.0	71.5	65.6	83.1	21.2	84.6	95.3	73.1	94.7	92.9	94.8	67.3	96.8	66.2	80.6
	Point-trans [51]	75.8	95.1	92.5	95.9	82.6	70.0	64.9	83.1	25.2	84.6	94.5	77.6	91.2	89.3	98.3	65.3	93.5	66.8	80.3
	Offset-attn [10]	73.9	95.1	92.1	97.9	81.4	71.7	56.7	80.5	20.7	86.5	96.1	73.1	98.2	92.9	98.3	67.3	90.3	65.5	79.9

4 Experiments

4.1 Datasets

We evaluate the performances of different attention modules on two datasets, SUN RGB-D [34] and ScanNetV2 [5], which are both captured from the indoor scenes in real-world using RBG-D cameras. To be more specific:

•

SUN RGB-D: There are 5,285 training and 5,050 testing RGB-D images in the dataset, where each object is precisely annotated with a bounding box and one of 35 semantic classes. According to the provided camera parameters, the original data and annotated bounding boxes can be projected as 3D point clouds. Following a widely used experimental setting [23], we only use the 3D coordinates as input and report the average precision (AP) and recall of the ten most common classes, together with the overall metrics of mean average precision (mAP) and average recall (AR).
•

ScanNetV2: The original dataset contains the reconstructed meshes from 18 object categories, where 1,201 samples are for training and 312 samples are in the validation set. The point cloud data is sampled from the vertices of reconstructed meshes. Besides the usage in segmentation, we input the 3D coordinates and predict the bounding box and category of each object in the same evaluating protocol as [23].

4.2 Implementation

In general, all attention modules in our work are adopted and slightly modified from the available official implementations. For 2D attentions, we regard the spatial space of 2D image as the point space of point cloud, following relation of $H\times W=N\times 1;$ where $H$ and $W$ are the height and width of an image while $N$ is the number of points. Moreover, to achieve a stable training process in 3D cases, we replace the original $1\times 1$ convolutions in 2D attentions with the MLP operations if necessary. For fair comparisons, the reduction factor in all attention modules is empirically set as 8. As for the recent Point Transformer [51] whose code is not released yet, we reproduce the structure according to the related descriptions in the paper.

The implementations are realized by PyTorch and Python platforms on a single Tesla-P100 GPU using CUDA and Linux operating system. All the experiments adopt the similar training settings such as the learning rate of 0.001, the batch size of 8, and a total of 180 training epochs. Following the default configurations in [23], in SUN RGB-D [34] dataset, the number of input points is 20,000; while in ScanNetV2 [5] dataset, the input size is 40,000. Coupled with this paper, we will release the source code of all deployed attention modules.

4.3 Experimental Results

By applying different attention modules in our attentional backbone, we conduct the experiments of 3D point cloud object detection on SUN RGB-D [34] dataset and ScanNetV2 [5] dataset, respectively.

Table 1 and 2 present the detailed detection results in SUN RGB-D [34] dataset under the metrics of average precision (AP), mean average precision (mAP), recall and average recall (AR). Although we can notice the different effects by using different attention modules, as shown in Table 1, the SE [20] method realizes the best overall result (59.6% mAP) among all tested attention modules and significantly exceeds the baseline’s result (57.5% mAP) by 2.1%. In terms of each category’s AP, the SE [20] method achieves the highest values in four out of ten object categories. Meanwhile, the results in Table 2 can also verify the outstanding performance of SE [13] under the metrics of recall. In comparison to the complicated self-attention methods, a compact attention structure like in SE [20] or CBAM [41] is able to benefit the point cloud detection task effectively and efficiently. Moreover, we observe that the channel-related information plays a crucial role in the attention mechanism of the point cloud, since the effectiveness of SE [13], CBAM [41] and CAA [28] are more prominent than the spatial attention modules.

Table 5: Overall evaluations of 3D point cloud object detection results using SUN RGB-D [34] and ScanNetV2 [5] datasets. (“mAP”: mean average precision; “AR”: average recall; the value behind “@” denotes the IoU threshold.)

	Method	SUN RGB-D [34]		ScanNetV2 [5]
	Method	[email protected]	[email protected]	[email protected]	[email protected]
	baseline [23]	33.1	51.1	33.7	49.9
2D Attentions	Non-local [39]	31.4	49.7	34.6	49.5
	Criss-cross [14]	33.1	50.0	33.8	49.2
	SE [13]	34.5	52.1	35.8	51.4
	CBAM [41]	34.9	53.1	37.1	52.5
	Dual-attn [8]	24.4	42.1	30.2	47.2
3D Attentions	A-SCN [43]	30.1	48.2	33.1	48.7
	Point-attn [7]	32.2	49.7	30.8	46.7
	CAA [28]	33.3	51.4	35.1	50.4
	Point-trans [51]	34.3	51.3	38.0	53.5
	Offset-attn [10]	30.6	48.2	36.0	50.4

In Table 3 and 4, we further compare the performances of different attention modules under a challenging condition of more input points and detected object categories using ScanNetV2 [5] dataset. Even though SE [13] and CBAM [41] can still provide relatively good performances evaluated under either precision- or recall-related metrics, the Point Transformer [51] achieves the best overall result (59.1% mAP) among all the ten tested attention modules. Mainly, the advantages of Point Transformer [51] can be concluded from two sides: on the one hand, it incorporates more local context for each point rather than a single feature representation learned from a shared MLP; on the other hand, the attention map is estimated from the geometric relations in 3D space, while most of the rest methods only calculate the dependencies from the feature space.

4.4 Overall Evaluations

In addition to the average precision and recall evaluated under an IoU threshold of 0.25 in Section 4.3, we provide the overall evaluations, mean average precision (mAP) and average recall (AR), under an IoU threshold of 0.5.

In general, Table 5 shows the similar results as in Table 1 and 2, where the SE [13] and CBAM [41] methods achieve better mAP and AR scores under both the IoU thresholds of 0.25 and 0.5 in SUN RGB-D dataset. Moreover, it is worth noting that when the IoU threshold is set at 0.5, Point Transformer [51] achieves the improvements of 4.3% mAP and 3.6% AR compared to the baseline results, since it can integrate more local information for the object detection task in dense point cloud scenes in ScanNetV2 [5] dataset.

Apart from VoteNet [23], we conduct more experiments by utilizing our attentional backbone in BoxNet [23] and MLCVNet [42]. In general, the attentions show similar effects with them as when tested with VoteNet. More experimental data can be found in the supplementary material.

Table 6: Model complexity of VoteNet [23] with different attention modules, evaluated on ScanNetV2 [5] dataset. (“^∗”: counted by the first attention module in the backbone.)

	Method	model size	training time	inference time	# parameters
	Method	(MB)	(s/epoch)	(s/epoch)	( $\times 10^{3}$ /attention^∗)
	baseline [23]	11.0	43.8	35.0	-
2D Attentions	Non-local [39]	13.0	48.2	35.9	8.5
	Criss-cross [14]	16.0	54.6	35.2	20.8
	SE [13]	11.9	44.2	35.1	4.1
	CBAM [41]	11.5	45.7	36.4	4.1
	Dual-attn [8]	15.9	50.6	36.7	21.0
3D Attentions	A-SCN [43]	16.0	48.5	35.9	20.8
	Point-attn [7]	16.0	48.6	35.6	20.8
	CAA [28]	34.7	47.2	36.7	106.6
	Point-trans [51]	25.8	88.1	38.7	100.1
	Offset-attn [10]	19.5	50.1	35.3	35.6

4.5 Model Complexity

Table 6 presents some reference data regarding a model’s complexity. By adding different attention modules in the backbone of VoteNet [23], the model size of whole network may increase more or less depending on the number of parameters used in each attention module. Although some attentions, e.g., Point Transformer [51] and CAA [28], can achieve relatively higher performances, they require more computational resources such as longer training time or larger memory consumption. In terms of the inference time, all tested models can perform at a similar level. However, as Point Transformer [51] needs an additional local neighbor searching operation compared to other methods, the efficiency of its inference process will be affected. Alternatively, we may integrate more global perception in both spatial and channel domains as in CBAM [41] and SE [13], to better balance the effectiveness and efficiency.

4.6 Visualization

Recall that in the VoteNet pipeline, the most critical usage of its backbone is to generate the votes (yellow points) that are expected to approach the centroids (red points) of detected objects. Therefore, the generated votes can intuitively indicate the quality of the backbone’s output.

To this end, in Figure 4, we compare the votes that are generated from different attentional backbones. In the sub-figure of the baseline, we can see that the votes can be easily attached to the most significant object’s (the middle one) centroid, while there are fewer votes around the centroids of two smaller objects. As for the sub-figure of the SE method, it can be clearly observed that more votes have been centralized at the centroids of two smaller objects (especially for the left object), providing more confident estimations on the detected objects bounding boxes. To further visualize the effects of different attention modules, in the supplementary material, we also compare the point features learned from our attentional backbone.

5 Insights

From the experimental results and our analysis, we obtain several interesting observations and insights of attention mechanism in 3D point cloud object detection:

1)

The self-attention modules are not preferable in processing 3D point cloud data. On the one hand, the fashion of self-attention needs high computational resources. On the other hand, the effectiveness of point-wise long-range dependencies used in self-attention modules is relatively limited as such an operation may cause some redundancies in representing the large-scale 3D point cloud data.
2)

The compact attention structures like SE [13] and CBAM [41] enable the effectiveness and efficiency of 3D point cloud feature refinement. This is achieved by capturing the global perception from a broad perspective in feature space.
3)

Comparing the spatial attention module with the channel-attention module, we found that the channel-related information is more important than spatial information when embedded into the attention modules for point cloud feature representations.
4)

As reflected from the Point Transformer [51]’s results, incorporating more local context could better represent the complex point cloud scenes, thus leading to better 3D point cloud object detection performance.

6 Conclusion

This paper proposes an attentional backbone used in the VoteNet pipeline for 3D point cloud object detection. By integrating different standard 2D and 3D attentions modules, we compare their effects via various metrics and datasets. Based on the experiments and visualization, we summarize the effects of the attention mechanism in 3D point cloud object detection. Moreover, we provide our insights on how to effectively leverage the attention mechanism for point cloud feature representation. In addition to presenting a benchmark evaluating the performances of different attention modules, we expect our preliminary findings to help future research in designing reliable and transparent attention structures for more point cloud analyzing works.

References

[1] Saeed Anwar and Nick Barnes. Real image denoising with feature attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3155–3164, 2019.
[2] François Blais et al. Review of 20 years of range sensor development. Journal of electronic imaging, 13(1):231–243, 2004.
[3] Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1907–1915, 2017.
[4] Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4d spatio-temporal convnets: Minkowski convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3075–3084, 2019.
[5] Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017.
[6] Pengfei Fang, Jieming Zhou, Soumava KUMAR Roy, Pan Ji, Lars Petersson, and Mehrtash T Harandi. Attention in attention networks for person retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
[7] Mingtao Feng, Liang Zhang, Xuefei Lin, Syed Zulqarnain Gilani, and Ajmal Mian. Point attention network for semantic segmentation of 3d point clouds. Pattern Recognition, 107:107446, 2020.
[8] Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. Dual attention network for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3146–3154, 2019.
[9] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
[10] Meng-Hao Guo, Jun-Xiong Cai, Zheng-Ning Liu, Tai-Jiang Mu, Ralph R Martin, and Shi-Min Hu. Pct: Point cloud transformer. Computational Visual Media, 7:pages187–199, 2021.
[11] Yulan Guo, Hanyun Wang, Qingyong Hu, Hao Liu, Li Liu, and Mohammed Bennamoun. Deep learning for 3d point clouds: A survey. IEEE transactions on pattern analysis and machine intelligence, 2020.
[12] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
[13] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
[14] Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 603–612, 2019.
[15] Michel Jaboyedoff, Thierry Oppikofer, Antonio Abellán, Marc-Henri Derron, Alex Loye, Richard Metzger, and Andrea Pedrazzini. Use of lidar in landslide investigations: a review. Natural hazards, 61(1):5–28, 2012.
[16] Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. Transformers in vision: A survey. arXiv preprint arXiv:2101.01169, 2021.
[17] Jason Ku, Melissa Mozifian, Jungwook Lee, Ali Harakeh, and Steven L Waslander. Joint 3d proposal generation and object detection from view aggregation. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1–8. IEEE, 2018.
[18] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12697–12705, 2019.
[19] Ming Liang, Bin Yang, Shenlong Wang, and Raquel Urtasun. Deep continuous fusion for multi-sensor 3d object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pages 641–656, 2018.
[20] Xinhai Liu, Zhizhong Han, Yu-Shen Liu, and Matthias Zwicker. Point2sequence: Learning the shape representation of 3d point clouds with an attention-based sequence to sequence network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8778–8785, 2019.
[21] Daniel Maturana and Sebastian Scherer. Voxnet: A 3d convolutional neural network for real-time object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 922–928. IEEE, 2015.
[22] Charles R Qi, Xinlei Chen, Or Litany, and Leonidas J Guibas. Imvotenet: Boosting 3d object detection in point clouds with image votes. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4404–4413, 2020.
[23] Charles R Qi, Or Litany, Kaiming He, and Leonidas J Guibas. Deep hough voting for 3d object detection in point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9277–9286, 2019.
[24] Charles R Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J Guibas. Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 918–927, 2018.
[25] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 652–660, 2017.
[26] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pages 5099–5108, 2017.
[27] Shi Qiu, Saeed Anwar, and Nick Barnes. Dense-resolution network for point cloud classification and segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3813–3822, 2021.
[28] Shi Qiu, Saeed Anwar, and Nick Barnes. Geometric back-projection network for point cloud classification. IEEE Transactions on Multimedia, 2021.
[29] Shi Qiu, Saeed Anwar, and Nick Barnes. Pnp-3d: A plug-and-play for 3d point clouds. arXiv preprint arXiv:2108.07378, 2021.
[30] Shi Qiu, Saeed Anwar, and Nick Barnes. Semantic segmentation for real point cloud scenes via bilateral augmentation and adaptive fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1757–1767, 2021.
[31] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28:91–99, 2015.
[32] Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li. Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 770–779, 2019.
[33] Vishwanath A Sindagi, Yin Zhou, and Oncel Tuzel. Mvx-net: Multimodal voxelnet for 3d object detection. In 2019 International Conference on Robotics and Automation (ICRA), pages 7276–7282. IEEE, 2019.
[34] Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 567–576, 2015.
[35] Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision, pages 945–953, 2015.
[36] Hugues Thomas, Charles R Qi, Jean-Emmanuel Deschaud, Beatriz Marcotegui, François Goulette, and Leonidas J Guibas. Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE International Conference on Computer Vision, pages 6411–6420, 2019.
[37] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
[38] Sourabh Vora, Alex H Lang, Bassam Helou, and Oscar Beijbom. Pointpainting: Sequential fusion for 3d object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4604–4612, 2020.
[39] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7794–7803, 2018.
[40] Zhixin Wang and Kui Jia. Frustum convnet: Sliding frustums to aggregate local point-wise features for amodal 3d object detection. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1742–1749. IEEE, 2019.
[41] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), pages 3–19, 2018.
[42] Qian Xie, Yu-Kun Lai, Jing Wu, Zhoutao Wang, Yiming Zhang, Kai Xu, and Jun Wang. Mlcvnet: Multi-level context votenet for 3d object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10447–10456, 2020.
[43] Saining Xie, Sainan Liu, Zeyu Chen, and Zhuowen Tu. Attentional shapecontextnet for point cloud recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4606–4615, 2018.
[44] Danfei Xu, Dragomir Anguelov, and Ashesh Jain. Pointfusion: Deep sensor fusion for 3d bounding box estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 244–253, 2018.
[45] Xu Yan, Chaoda Zheng, Zhen Li, Sheng Wang, and Shuguang Cui. Pointasnl: Robust point clouds processing using nonlocal neural networks with adaptive sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5589–5598, 2020.
[46] Bin Yang, Ming Liang, and Raquel Urtasun. Hdnet: Exploiting hd maps for 3d object detection. In Conference on Robot Learning, pages 146–155. PMLR, 2018.
[47] Bin Yang, Wenjie Luo, and Raquel Urtasun. Pixor: Real-time 3d object detection from point clouds. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7652–7660, 2018.
[48] Zetong Yang, Yanan Sun, Shu Liu, and Jiaya Jia. 3dssd: Point-based 3d single stage object detector. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11040–11048, 2020.
[49] Zetong Yang, Yanan Sun, Shu Liu, Xiaoyong Shen, and Jiaya Jia. Std: Sparse-to-dense 3d object detector for point cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1951–1960, 2019.
[50] Tan Yu, Jingjing Meng, and Junsong Yuan. Multi-view harmonized bilinear network for 3d object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 186–194, 2018.
[51] Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip Torr, and Vladlen Koltun. Point transformer. arXiv preprint arXiv:2012.09164, 2020.
[52] Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4490–4499, 2018.