On-the-fly Point Feature Representation for Point Clouds Analysis

Jiangyi Wang Singapore University of Technology and Design (SUTD)SingaporeSingapore jiangyi˙[email protected] , Zhongyao Cheng Institute for Infocomm Research (I²R), A*STARSingaporeSingapore cheng˙[email protected] , Na Zhao Singapore University of Technology and Design (SUTD)SingaporeSingapore na˙[email protected] , Jun Cheng Institute for Infocomm Research (I²R), A*STARSingaporeSingapore cheng˙[email protected] and Xulei Yang Institute for Infocomm Research (I²R), A*STARSingaporeSingapore yang˙[email protected]

(2024)

Abstract.

Point cloud analysis is challenging due to its unique characteristics of unorderness, sparsity and irregularity. Prior works attempt to capture local relationships by convolution operations or attention mechanisms, exploiting geometric information from coordinates implicitly. These methods, however, are insufficient to describe the explicit local geometry, e.g., curvature and orientation. In this paper, we propose On-the-fly Point Feature Representation (OPFR), which captures abundant geometric information explicitly through Curve Feature Generator module. This is inspired by Point Feature Histogram (PFH) from computer vision community. However, the utilization of vanilla PFH encounters great difficulties when applied to large datasets and dense point clouds, as it demands considerable time for feature generation. In contrast, we introduce the Local Reference Constructor module, which approximates the local coordinate systems based on triangle sets. Owing to this, our OPFR only requires extra 1.56ms for inference (65 $\times$ faster than vanilla PFH) and 0.012M more parameters, and it can serve as a versatile plug-and-play module for various backbones, particularly MLP-based and Transformer-based backbones examined in this study. Additionally, we introduce the novel Hierarchical Sampling module aimed at enhancing the quality of triangle sets, thereby ensuring robustness of the obtained geometric features. Our proposed method improves overall accuracy (OA) on ModelNet40 from 90.7% to 94.5% (+3.8%) for classification, and OA on S3DIS Area-5 from 86.4% to 90.0% (+3.6%) for semantic segmentation, respectively, building upon PointNet++ backbone. When integrated with Point Transformer backbone, we achieve state-of-the-art results on both tasks: 94.8% OA on ModelNet40 and 91.7% OA on S3DIS Area-5.

Scene understanding, point clouds representation, local geometry, classification, semantic segmentation.

^†^†journalyear: 2024^†^†copyright: acmlicensed^†^†conference: Proceedings of the 32nd ACM International Conference on Multimedia; October 28-November 1, 2024; Melbourne, VIC, Australia^†^†booktitle: Proceedings of the 32nd ACM International Conference on Multimedia (MM ’24), October 28-November 1, 2024, Melbourne, VIC, Australia^†^†doi: 10.1145/3664647.3680700^†^†isbn: 979-8-4007-0686-8/24/10^†^†ccs: Computing methodologies Scene understanding^†^†ccs: Computing methodologies Computer vision representations

1. Introduction

Point cloud analysis on robotics and automation application (Chen et al., 2024; Cheng et al., 2023; Zhao et al., 2021a; Sheng et al., 2022; Wiesmann et al., 2022; Zhao and Lee, 2022; Sheng et al., 2023; Han et al., 2024; Li and Zhao, 2024; Li et al., 2024) has garnered substantial attention in recent years, driven by advancements in sensor technologies like LiDAR and photogrammetry. This growing interest attributes to two key advantages: 1) It can accurately represent complex objects with numbers of points. 2) It can be quickly created by using 3D scanning devices. Compared to 2D image data, point clouds provide a more powerful 3D sparse representation containing abundant geometry and layout information of the environment.

Deep learning technology (Krizhevsky et al., 2012; He et al., 2016) has achieved significant improvements in various image processing tasks. However, the typical deep learning technology requires highly regular input data formats. The unordered and irregular point clouds bring great challenges to apply the image processing techniques directly. PointNet (Qi et al., 2017a), the pioneering work of network architecture that directly works with point clouds, overcomes the challenges of the unordered and irregular inputs. It uses point-wise shared-MLP followed by a pooling operation to extract global features from point clouds, but global pooling operation leads to the loss of valuable local information. PointNet++ (Qi et al., 2017b) further proposes set abstraction (SA) to process local regions hierarchically. This step aggregates features from neighboring points, thereby capturing local information. However, it still learns from individual points without incorporating local relationships (Liu et al., 2019). This could hinder the model from leveraging inherent point clouds geometric structures.

Local geometric structures are vital for understanding point clouds. In an effort to capture this information, some prior works attempt to learn local relationships from convolutions (Li et al., 2018; Wu et al., 2019; Jiang et al., 2018), attentions (Guo et al., 2021; Zhao et al., 2021b; Yu et al., 2022), or graphs (Yang et al., 2018; Wang et al., 2019; Zhang et al., 2019). However, these methods require huge amount of labelled data to learn local geometry implicitly (Ran et al., 2022), while getting large amount of labelled 3D annotations is difficult. Recently, RepSurf (Ran et al., 2022) has emerged as a novel approach that explicitly learns geometric information based on umbrella surface (Foorginejad and Khalili, 2014), which is a triangle set¹¹1In this paper, we refer triangle set to a collection of connected or disconnected triangles. For those connected ones, they form a surface. with connected triangles formed by $k$ nearest neighbors ( $k$ -NNs). While triangle sets are effective in capturing location and orientation information, they often fall short in incorporating curvature knowledge, which is essential for accurate point cloud recognition (Sun et al., 2016; Czerniawski et al., 2016). Moreover, as depicted in the supplementary material, in certain $k$ -NNs, the $k$ points may come from different surfaces of the object. These “noisy points” can lead to the distortion of $k$ -NN triangle sets, significantly impacting the quality of the obtained geometric features (Ran et al., 2022).

To integrate curvature information explicitly, we draw inspiration from Point Feature Histogram (PFH) (Rusu et al., 2008b), a notable hand-crafted feature descriptor for capturing regional curvature knowledge. PFH exploits the histogram of curvature angles within local neighborhoods to characterize individual points. As shown in Fig. 2, these angles are calculated between the normal vector and the local coordinate system, which demand substantial computing resources. Nonetheless, many point cloud datasets (Uy et al., 2019; Geiger et al., 2012) lack normal vectors and necessitate additional normal estimation (Hoffman and Jain, 1987; Hoppe et al., 1992; Sanchez et al., 2020). Normal estimation poses significant computational challenges, particularly for dense point clouds, while its accuracy degenerates considerably for sparse point clouds. These limitations can potentially lead to the breakdown of vanilla PFH approach, further underscoring the challenges of its direct integration with deep learning models.

In view of PFH’s potentials and limitations, we explore curvature information and propose On-the-fly Point Feature Representation (OPFR), which includes Local Reference Constructor module and Curve Feature Generator module. This provides an efficient way to leverage explicit curvature knowledge without the prerequisite of normal estimation, which inherently relies on the quality of triangle sets. Additionally, we propose the novel Hierarchical Sampling module to mitigate the distortion of triangle sets that occurs in the naive $k$ -NN approach. Our sampling method demonstrates the robustness against noisy points by employing a hierarchical sampling strategy and a farthest point sampling strategy. As a result, it can significantly improve the obtained geometric features. These innovations confer the following properties:

•

Curvature Awareness. The usage of curvature information remains underexplored by prior works. Our proposed OPFR obtains the capability to explicitly capture not only location and orientation knowledge, but also curvature geometry via Curve Feature Generator module.
•

Computational Efficiency. Vanilla PFH is computationally expensive due to normal estimation. Our proposed OPFR introduces Local Reference Constructor module, which approximates the local coordinate systems based on triangle sets to overcome the computational bottlenecks.
•

Robustness. Naive $k$ -NN sampling causes distortion of triangle sets, which compromises the obtained geometric features. In contrast, our proposed OPFR presents Hierarchical Sampling module to enhance the quality of triangle sets, ensuring robust geometric features for noisy points.

Moreover, our OPFR is backbone-agnostic, making it compatible with different 3D point clouds analysis architectures. We demonstrate its model-agnostic nature by adapting two representative backbones: PointNet++ (Qi et al., 2017b) and Point Transformer (Zhao et al., 2021b). It serves as an efficient plug-and-play module, and achieves substantial performance improvements. Empirical results prove its compatibility with different backbones. When incorporating with Point Transformer backbone, our OPFR achieves state-of-the-art performance for both point cloud classification and semantic segmentation tasks.

Refer to caption — Figure 1. Illustration of On-the-fly Point Feature Representation (OPFR) learning paradigm. The generation of geometric features consists of three key modules: Hierarchical Sampling, Local Reference Constructor (LRCon) and Curve Feature Generator (CFGen). These geometric features are further fed into shared-MLP followed by pooling operation, constituting the final OPFR.

2. Related Work

2.1. Deep Learning on Point Clouds

Many prior works (Qi et al., 2017a; Zaheer et al., 2017; Qi et al., 2017b; Duan et al., 2019; Zhao et al., 2019; Liu et al., 2020; Nezhadarya et al., 2020) learn from raw point clouds via careful network designs. PointNet (Qi et al., 2017a) pioneers this trail by handling coordinates of each point with shared-MLP and consolidating the final representation with a global pooling operation. However, it is susceptible to a deficiency in preserving local structures due to the use of global pooling operation. PointNet++ (Qi et al., 2017b) is an extension of the original PointNet architecture, which applies PointNet to multiple subsets of point clouds. It further leverages a hierarchical feature learning paradigm to capture the local structures. However, PointNet++ still processes points individually in each local region, neglecting explicit consideration of relationships between centroids and their neighbors.

As PointNet++ establishes the hierarchical point clouds analysis framework, the focus of many works has shifted towards the development of local feature extractors, including convolution-based (Li et al., 2018; Jiang et al., 2018; Mao et al., 2019; Komarichev et al., 2019; Wu et al., 2019), attention-based (Guo et al., 2021; Zhao et al., 2021b; Yu et al., 2022), and graph-based (Yang et al., 2018; Te et al., 2018; Wang et al., 2019; Zhang et al., 2019; Xu et al., 2020) approaches. PointCNN (Li et al., 2018) learns a $\chi$ -transformation from input point clouds, which attempts to re-organize inputs into canonical order. Subsequently, it utilizes vanilla convolution operations to extract local features. Point Transformer (Zhao et al., 2021b) replaces conventional shared-MLP modules with Transformer (Vaswani et al., 2017) blocks, serving as feature extractors within localized patch processing. DGCNN (Wang et al., 2019) utilizes dynamic graph structures to enhance feature learning and capture relationships between points. However, these works rely heavily on the learnability of feature extractors, potentially missing inherent local shape information. More recently, RepSurf (Ran et al., 2022) leverages triangle sets with connected triangles formed by $k$ nearest neighbors ( $k$ -NNs), to learn location and orientation-aware representations from geometric features explicitly. Although location and orientation features are explicitly injected into network architecture in RepSurf, the usage of curvature information still remains underexplored. Moreover, RepSurf relies on naive $k$ -NNs to produce triangle sets and obtain geometric features, which are vulnerable to noisy points (Ran et al., 2022).

2.2. Hand-crafted Designs on Point Clouds

Many works in 3D computer vision attempt to build sophisticated feature descriptors (Scovanner et al., 2007; Rusu et al., 2008b, 2009), which help to understand point clouds through hand-crafted features. Point Feature Histogram (PFH) (Rusu et al., 2008b), one of the feature descriptors, is commonly used in computer vision tasks like object recognition, registration and model retrieval (Rusu et al., 2008a; Himmelsbach et al., 2009; Li et al., 2016). It develops point cloud representations by summarizing the distribution of certain geometric attributes within a local neighborhood around each point.

We depict the workflow of PFH derivation for one of our interested point in Fig. 2. The whole process can be decomposed into two steps. Firstly, for each point pair within $k$ nearest neighbors ( $k$ -NNs) of interested point, curvature features are characterized using angles calculated from normal vectors and relative positions. Secondly, for each angle, we achieve its histogram within $k$ -NNs. The histograms of different angles are concatenated together, yielding the final PFH representation. Unfortunately, many point cloud datasets (Uy et al., 2019; Geiger et al., 2012) collected in real-world scenarios lack normal vectors, and estimating normal vectors (Wold et al., 1987; Hoppe et al., 1992) for sparse point clouds often leads to significant deviations from ground truth. Nonetheless, PFH calculation involves establishing local coordinate systems and constructing curvature features, which is computationally expensive. As a result, the practical application of vanilla PFH is limited.

3. Methodology

The pipeline of On-the-fly Point Feature Representation (OPFR) is depicted in Fig. 1, and we illustrate the OPFR generating process for the right corner of table (highlighted in pink), which is one of our interested points. Firstly, we propose Hierarchical Sampling module, which takes each point in the point cloud as input and outputs several clusters (highlighted in blue) and corresponding centroids (highlighted in orange). This hierarchical sampling strategy improves the quality of triangle set for each point, thereby facilitating the development of subsequent geometric features. Then, for each point pair within the clusters, we design Curve Feature Generator module to generate geometric features, including location, orientation, and curvature. The inclusion of explicit curvature information allows us to more effectively capture the local geometry surrounding these point pairs. To enhance the efficiency and enable on-the-fly processing, we present Local Reference Constructor module. It approximates a local coordinate system (highlighted in red) for each point pair using adjacent points from triangle sets. Lastly, these obtained geometric features are further fed into shared-MLP followed by a pooling operation, constituting our final OPFR representation $\mathbf{r}_{i}$ . The resultant OPFR representation $\mathbf{r}_{i}$ along with coordinate $\mathbf{x}_{i}$ can be directed into various point clouds analysis backbones, e.g., PointNet++ (Qi et al., 2017b) and Point Transformer (Zhao et al., 2021b), for end-to-end training.

3.1. Hierarchical Sampling

Algorithm 1 Pseudo-code of Hierarchical Sampling

⬇

def Hierarchical_Sampling(pc, k1, k2, k3):

# k1: number of nearest neighbors for each point

# k2: number of centroids within k1 nearest neighbors

# k3: number of neighbors for each selected centroid

# pc: input point clouds # [B, N, 3]

nearest_neighbors = kNN(inputs=pc, k=k1)

# [B, N, k1, 3]

selected_centroids = FPS(inputs=nearest_neighbors, k=k2)

# [B, N, k2, 3]

surface_points = kNN(inputs=selected_centroids, k=k3)

# [B, N, k2, k3, 3]

return surface_points

As mentioned earlier, the connected triangle sets produced by naive $k$ nearest neighbors ( $k$ -NNs) are susceptible to noisy points (Ran et al., 2022), leading to significant distortion. Given that our Local Reference Constructor module inherently relies on triangle sets to approximate local reference frames, we propose the novel Hierarchical Sampling module to alleviate this distortion issue. For each individual point, our Hierarchical Sampling module generates several clusters, forming a triangle set. Specifically, we firstly conduct $k$ -NN algorithm to select the $k_{1}$ nearby points (highlighted in purple). Secondly, we utilize farthest point sampling (Eldar et al., 1997) algorithm to identify $k_{2}$ surface centroids (highlighted in orange) from these $k_{1}$ nearby points. Lastly, for each centroid, we retrieve its $k_{3}$ nearest neighbors (highlighted in blue). The selected $k_{3}$ neighbors are used to further develop geometric features. The detailed implementation is presented in Algorithm 1.

As illustrated in Fig. 1, Hierarchical Sampling module is designed to decouple the right corner of the table into $k_{2}$ distinct clusters (e.g., table top and table leg). These $k_{2}$ clusters exhibit simpler geometric structures, allowing resultant triangle sets to better approximate the original local surface. Therefore, compared to the naive $k$ -NN approach, our hierarchical sampling scheme effectively relieves the distortion issue of triangle sets, and ensures the robustness against original noisy points. As a result, it greatly enhances the development of subsequent geometric features. We provide additional visualization examples comparing triangle sets generated by Hierarchical Sampling and those produced by $k$ -NN sampling in the supplementary material.

3.2. Local Reference Constructor

A local reference frame is a local system of Cartesian coordinates at each point (Melzi et al., 2019), which provides a reference for understanding local structures. Denote a point set as $\mathbf{X}=\{\mathbf{x}_{1},\mathbf{x}_{2},\cdots,\mathbf{x}_{N}\}\subseteqq\mathbb{R}^{N\times 3}$ , normal vector set as $\mathbf{N}=\{\mathbf{n}_{1},\mathbf{n}_{2},\cdots,\mathbf{n}_{N}\}\subseteqq\mathbb{R}^{N\times 3}$ . Assume $\mathbf{x}_{i}$ as our interest point, and the objective is to extract geometric features for point $\mathbf{x}_{i}$ . Then, the local reference frame $\{\mathbf{u},\mathbf{v},\mathbf{w}\}$ (Rusu et al., 2008b) for $\{(\mathbf{x}_{i},\mathbf{x}_{j}),i\neq j\}$ is defined as:

(1)

\displaystyle\begin{cases}&\mathbf{u}=\mathbf{n}_{i}\\ &\mathbf{v}=\dfrac{(\mathbf{x}_{j}-\mathbf{x}_{i})\times\mathbf{u}}{||(\mathbf{x}_{j}-\mathbf{x}_{i})\times\mathbf{u}||}\\ &\mathbf{w}=\dfrac{\mathbf{u}\times\mathbf{v}}{||\mathbf{u}\times\mathbf{v}||}\end{cases}.

Although Equ. 1 achieves the construction of local reference frames, it comes with two major problems. Firstly, it relies on normal vectors, which are often unavailable in many benchmarks (Uy et al., 2019; Geiger et al., 2012) and real-life scenarios. Despite normal estimation (Hoppe et al., 1992) is feasible, its computational cost escalates significantly with dense point clouds, and its accuracy diminishes considerably with sparse point clouds. Secondly, it involves multiple cross-product operations sequentially, which cannot be effectively parallelized in terms of tensor operations. This leads to the inevitable computational overheads.

To circumvent normal estimation and overcome the computational bottlenecks, we design approximated local reference frames through Local Reference Constructor (LRCon) module. Within each cluster generated by Hierarchical Sampling module, we establish point pairs between centroid and neighboring points. For each point pair, LRCon module leverages two adjacent neighbors along with their cross-product to serve as the approximate local reference frames. Denote number of neighbors as $K$ , neighbors of centroid $\mathbf{x}_{i}$ as $\mathbf{X}_{i}=\{\mathbf{x}_{i1},\mathbf{x}_{i2},\cdots,\mathbf{x}_{iK}\}\subseteq\mathbb{R}^{K\times 3}$ . Based on this setting, we can construct the approximated local reference frame $\{\hat{\mathbf{u}},\hat{\mathbf{v}},\hat{\mathbf{w}}\}$ for point pair $\{(\mathbf{x}_{i},\mathbf{x}_{ij}),j=1,2,\cdots,K\}$ , which is defined as:

(2)

\displaystyle\begin{cases}&\hat{\mathbf{u}}=\dfrac{\mathbf{x}_{ij}^{+}-\mathbf{x}_{i}}{||\mathbf{x}_{ij}^{+}-\mathbf{x}_{i}||}\\ &\hat{\mathbf{v}}=\dfrac{\mathbf{x}_{ij}^{-}-\mathbf{x}_{i}}{||\mathbf{x}_{ij}^{-}-\mathbf{x}_{i}||}\\ &\hat{\mathbf{w}}=\dfrac{\hat{\mathbf{u}}\times\hat{\mathbf{v}}}{||\hat{\mathbf{u}}\times\hat{\mathbf{v}}||}\end{cases},

where $\mathbf{x}_{ij}^{+}$ , $\mathbf{x}_{ij}^{-}$ are the most adjacent points for $\mathbf{x}_{ij}$ in neighbor set $\mathbf{X}_{i}$ clockwise and counterclockwise. To maintain the consistency of local frame orientation, we apply clockwise cross-product (Ran et al., 2022) to compute $\hat{\mathbf{w}}$ . When setting up the approximated local reference frames, the LRCon module basically finds the adjacent neighbors from the corresponding triangles in the triangle sets, i.e., $\Delta_{1}=\{\mathbf{x}_{i},\mathbf{x}_{ij},\mathbf{x}_{ij}^{+}\}$ and $\Delta_{2}=\{\mathbf{x}_{i},\mathbf{x}_{ij},\mathbf{x}_{ij}^{-}\}$ . As mentioned earlier, the Hierarchical Sampling module can boost the quality of triangle sets, ensuring the robustness against noisy points. This implicitly guarantees the reliability of the approximated local reference frames.

This approximation scheme allows us to establish local reference frames that are independent with normal vectors of point clouds. By re-ordering these neighbors in $\mathbf{X}_{i}$ based on their projected angles in $xy$ -plane, we can efficiently derive approximated reference frames through tensor operations. Notably, with the integration of the LRCon module, our OPFR only requires an additional 1.56ms for inference, making it 65 $\bm{\times}$ faster than vanilla PFH. Furthermore, since LRCon module eliminates the need of normal estimation, it is compatible with point clouds of varying densities.

3.3. Curve Feature Generator

We propose to approximate the local curve at point $(t,f(t))$ by excluding high-order derivatives using Taylor Series (Swokowski, 1979):

(3)

f(x)\approx\underbrace{f(t)}_{\text{location}}+\underbrace{f^{\prime}(t)}_{\text{orientation}}(x-t)+\frac{1}{2}\underbrace{f^{\prime\prime}(t)}_{\text{curvature}}(x-t)^{2}.

Intuitively, the derivatives $f^{\prime}(t)$ and $f^{\prime\prime}(t)$ reflect how the local curve is oriented and skewed near point $(t,f(t))$ respectively. From Taylor approximation, it can be observed that, 1-order derivative information is inadequate for accurately characterizing local curves.

Table 1. Performance of classification on ModelNet40 and ScanObjectNN. We evaluate different approaches in terms of overall accuracy (OA, %), mean per-class accuracy (mAcc, %), number of parameters (#Params) and FLOPs. Bold means outperforming other models on corresponding dataset. Green means an improvement from our OPFR compared with the original backbone.

Method	Input	ModelNet40		ScanObjectNN		#Params	FLOPs^†
Method	Input	OA	mAcc	OA	mAcc	#Params	FLOPs^†
PointNet (Qi et al., 2017a)	1k pnts	89.2	86.0	68.2	63.4	3.47M	0.45G
DGCNN (Wang et al., 2019)	1k pnts	92.9	90.2	78.1	73.6	1.82M	2.43G
KPConv (Thomas et al., 2019)	$\sim$ 7k pnts	92.9	-	-	-	14.3M	-
MVTN (Hamdi et al., 2021)	multi-view	93.8	92.0	82.8	-	4.24M	1.78G
RPNet (Ran et al., 2021)	1k pnts^∗	94.1	-	-	-	2.70M	3.90G
CurveNet (Xiang et al., 2021)	1k pnts	94.2	-	-	-	2.14M	0.66G
RepSurf-U (Ran et al., 2022)	1k pnts	94.4	91.4	84.3	81.3	1.483M	1.77G
RepSurf-U^∘ (Ran et al., 2022)	1k pnts	-	-	86.0	83.1	6.806M	4.84G
PointMLP (Ma et al., 2022)	1k pnts	94.1	91.5	85.4	83.9	12.6M	31.4G
PointTrans. V2 (Wu et al., 2022)	1k pnts^∗	94.2	91.6	-	-	-	-
PointNeXt (Qian et al., 2022)	1k pnts	93.2	90.8	87.7	85.8	4.5M	6.5G
SPoTr (Park et al., 2023)	1k pnts	93.2	90.8	88.6	86.8	3.3M	12.3G
PointNet++ (Qi et al., 2017b)	1k pnts	90.7	88.4	77.9	75.4	1.475M	1.7G
PointNet++ & OPFR (ours)	1k pnts	94.5 $\uparrow$ 3.8	91.6 $\uparrow$ 3.2	85.7 $\uparrow$ 7.8	83.8 $\uparrow$ 8.4	1.487M	1.85G
PointNet++ & OPFR^∘ (ours)	1k pnts	94.6 $\uparrow$ 3.9	91.8 $\uparrow$ 3.4	88.5 $\uparrow$ 10.6	86.6 $\uparrow$ 11.2	8.42M	5.9G
PointTrans. (Zhao et al., 2021b)	1k pnts	93.7	90.6	82.3	80.7	5.187M	0.29G
PointTrans. & OPFR (ours)	1k pnts	94.8 $\uparrow$ 1.1	92.0 $\uparrow$ 1.4	88.1 $\uparrow$ 5.8	86.3 $\uparrow$ 5.6	5.190M	0.33G

•

$*$ : w/ normal vector. $\circ$ : w/ double channels and deeper networks. $\dagger$ : FLOPs from 1024 input point cloud points.

To exploit 2-order curvature information, we propose the Curve Feature Generator (CFGen) module. This module processes input point pairs along with their approximated local reference frames, generating geometric features that encompass location, orientation, and curvature. Denote the approximate local reference frames for $(\mathbf{x}_{i},\mathbf{x}_{ij})$ as $\{\hat{\mathbf{u}}_{ij},\hat{\mathbf{v}}_{ij},\hat{\mathbf{w}}_{ij}\}$ . The location and orientation can be naturally (Ran et al., 2022) characterized by relative position $\mathbf{x}^{\prime}_{ij}=\mathbf{x}_{ij}-\mathbf{x}_{i}$ and frame cross-product $\mathbf{n}_{ij}=\hat{\mathbf{u}}_{ij}\times\hat{\mathbf{v}}_{ij}$ , respectively. Furthermore, we propose the curvature proxy $\mathbf{p}_{ij}$ for point clouds, which is an approximation of curvature definition (Serrano and Suceava, 2015) from differential geometry. We provide the theoretical analysis for this part in the supplementary material. The curvature proxy $\mathbf{p}_{ij}$ is defined as:

(4)

\mathbf{p}_{ij}=\frac{1}{||\mathbf{x}^{\prime}_{ij}||}\cdot\text{arccos}([\hat{\mathbf{u}}_{ij};\hat{\mathbf{v}}_{ij};\hat{\mathbf{w}}_{ij}]\odot\frac{\mathbf{x}^{\prime}_{ij}}{||\mathbf{x}^{\prime}_{ij}||}),

where $\odot$ is the entry-wise dot product. Note that, curvature proxy $\mathbf{p}_{ij}$ approximates the limit definition of curvature from differential geometry, making it inherently curvature-aware. Intuitively, $\mathbf{p}_{ij}$ effectively captures how the surface is curved in three reference frames $\{\hat{\mathbf{u}}_{ij},\hat{\mathbf{v}}_{ij},\hat{\mathbf{w}}_{ij}\}$ in terms of normalized angles.

Table 2. Performance of semantic segmentation on S3DIS 6-fold and S3DIS Area-5 benchmarks. We evaluate different approaches in terms of mean Intersection over Union (mIoU, %), mean per-class accuracy (mAcc, %), overall accuracy (OA, %), number of parameters (#Params) and FLOPs. Bold means outperforming other models on corresponding dataset. Green means an improvement from our OPFR compared with the original backbone.

Method	S3DIS 6-fold			S3DIS Area-5			#Params	FLOPs^†
Method	mIoU	mAcc	OA	mIoU	mAcc	OA	#Params	FLOPs^†
PointNet (Qi et al., 2017a)	47.6	66.2	78.5	41.1	48.9	-	1.7M	4.1G
KPConv (Thomas et al., 2019)	70.6	79.1	-	67.1	72.8	-	14.9M	-
RPNet (Ran et al., 2021)	70.8	-	-	-	-	-	2.4M	5.1G
RepSurf (Ran et al., 2022)	74.3	82.6	90.8	68.9	76.0	90.2	0.976M	6.7G
PointTrans. V2 (Wu et al., 2022)	-	-	-	71.6	77.9	91.1	-	-
PointNeXt-B (Qian et al., 2022)	71.5	-	88.8	67.3	-	89.4	3.8M	8.9G
PointNeXt-XL (Qian et al., 2022)	74.9	-	90.3	70.5	-	90.6	41.6M	84.8G
Superpoint Trans. (Sun et al., 2023)	76.0	85.5	90.4	68.9	77.3	89.5	0.21M	-
ConDaFormer^∗ (Duan et al., 2024)	-	-	-	72.6	78.4	91.6	-	-
PointNet++ (Qi et al., 2017b)	59.9	66.1	87.5	56.0	61.2	86.4	0.969M	7.2G
PointNet++ & OPFR (ours)	74.6 $\uparrow$ 14.7	83.0 $\uparrow$ 16.9	90.5 $\uparrow$ 3.0	69.1 $\uparrow$ 13.1	76.9 $\uparrow$ 15.7	90.0 $\uparrow$ 3.6	0.979M	7.5G
PointTrans. (Zhao et al., 2021b)	73.5	81.9	90.2	70.4	76.5	90.8	7.768 M	5.8G
PointTrans. & OPFR (ours)	76.9 $\uparrow$ 3.4	85.6 $\uparrow$ 3.7	92.0 $\uparrow$ 1.8	72.6 $\uparrow$ 2.2	78.6 $\uparrow$ 2.1	91.7 $\uparrow$ 0.9	7.771M	6.4G

•

$*$ : w/o test-time-augmentation. $\dagger$ : FLOPs from 15000 input point cloud points.

Table 3. Semantics segmentation results for each class on S3DIS Area-5. We evaluate model performance in terms of mean accuracy (mIoU, %) for each semantic class. Bold means top improved semantic classes in terms of mIoU. Green means an improvement from our OPFR compared with the original backbone.

Method	ceiling	floor	wall	beam	column	window	door	chair	table	bookcase	sofa	board	clutter	mIoU
PointNet++ (Qi et al., 2017b)	91.47	98.18	82.19	0.00	17.99	57.75	64.64	79.70	87.82	67.11	69.76	65.29	50.79	56.0
PointNet++ & OPFR (ours)	93.13	98.37	85.38	0.00	41.50 $\uparrow$ 23.51	62.32	71.56	80.37	89.86	77.25	72.67	68.18	57.12	69.1
PointTrans (Zhao et al., 2021b)	93.71	98.00	86.78	0.00	36.35	64.79	73.40	83.30	89.84	68.80	73.32	74.33	58.17	70.4
PointTrans. & OPFR (ours)	93.68	98.11	88.20	0.00	55.16 $\uparrow$ 18.81	69.02	73.53	83.68	90.43	75.57	79.71	75.67	62.06	72.6

3.4. On-the-fly Point Feature Representation

Point Feature Histogram (PFH) (Rusu et al., 2008b) utilizes histogram operations to aggregate regional geometric features and generate final representation for each point. We argue that, these predefined transformation functions are task-agnostic, which making the final representations not fitting well for specific tasks. To this end, motivated by PointNet++ (Qi et al., 2017b), we employ shared-MLP to learn the final representations from point clouds. Therefore, the proposed OPFR representation $\mathbf{r}_{i}$ for point $\mathbf{x}_{i}$ is defined as:

(5)

\mathbf{r}_{i}=\mathcal{A}(\{\mathcal{F}([\mathbf{x}_{ij}^{\prime};\mathbf{n}_{ij};\mathbf{p}_{ij}]):j=1,2,\cdots,K\}),

where $\mathcal{A}$ is a pooling operation (e.g., sum), $\mathcal{F}$ is a shared-MLP, and $[\mathbf{x}_{ij}^{\prime};\mathbf{n}_{ij};\mathbf{p}_{ij}]$ are explicit geometric features obtained from CFGen module for one point pair $(\mathbf{x}_{i},\mathbf{x}_{ij})$ . By feeding OPFR representation $\mathbf{r}_{i}$ along with coordinate $\mathbf{x}_{i}$ into the backbone, the whole learning process can be achieved through end-to-end training.

In Fig. 3, the three-view drawing depicts the OPFR values of 1-st channel for an airplane. The blue hues represent areas with smaller OPFR values, typically the airplane body, while the red hues indicate larger OPFR values, primarily associated with the airplane wings. This color differentiation underlines our OPFR is sensitive to curvature variation across the airplane’s structure, demonstrating the curvature-aware property of OPFR. We provide more visualization examples in the supplementary material. Additionally, it is important to highlight that, as shown in Fig. 4, our OPFR can outperform vanilla PFH by a large margin with the help of learnable shared-MLP. Furthermore, the introducing of shared-MLP only increases 0.012M learnable parameters, which is approximately negligible for most popular backbones (Qi et al., 2017b; Zhao et al., 2021b).

4. Experiments

We evaluate our OPFR on two primary tasks: point cloud classification and semantic segmentation. We choose two representative point cloud understanding models, PointNet++ (Qi et al., 2017b) and Point Transformer (Zhao et al., 2021b), as our backbones to evaluate the effectiveness and compatibility of OPFR representations across different backbone architectures. Additionally, we carry out ablation studies to demonstrate the effectiveness of our OPFR network designs and quantitatively evaluate the efficiency and quality of OPFR feature representations. Moreover, due to space constraints, we present qualitative results in the supplementary material.

Implementation details. For the Hierarchical Sampling module, we set $k_{1}=20$ and $k_{2}=4$ to control the number of candidate centroids and selected centroids respectively. The shared-MLP consists of three layers with 30 OPFR dimensions ( $\mathbf{r}_{i}\in\mathbb{R}^{30}$ ), followed by a sum pooling operation. These are achieved via empirical studies, which will be further discussed in Sec. 4.3. Following RepSurf (Ran et al., 2022), we set $k_{3}=8$ , considering the trade-off between performance and efficiency. We use CrossEntropy loss and label smoothing (Szegedy et al., 2016) techniques with a ratio of 0.3 for both tasks. We provide more details about implementation in the supplementary material.

4.1. Classification

We evaluate our OPFR on two commonly used benchmarks for point cloud classification: ModelNet40 (Wu et al., 2015) and ScanObjectNN (Uy et al., 2019).

Experimental setups. Following RepSurf (Ran et al., 2022), we implement two versions to integrate OPFR with PointNet++ (Qi et al., 2017b), one standard version and one scaled-up version. The scaled-up version doubles the channels of standard version and exploits deeper networks. If not specified, we default to the standard version. We also apply the channel de-differentiation design (Ran et al., 2022) when integrated with PointNet++. We opt Adam (Kingma and Ba, 2014) optimizer with default parameters to train our models for 250 epochs with a batch size of 64 and initial learning rate of 0.002. We apply exponential learning rate decay scheme with decay rate of 0.7. The whole training and testing process are conducted through one NVIDIA Quadro P5000 16GB GPU. For evaluation metrics, we use overall accuracy (OA) and mean accuracy within each classes (mAcc). For efficiency metrics, we use number of learnable parameters (#Params) and floating point operations (FLOPs). For a fair comparison, we calculate FLOPs from 1024 input point clouds, and utilize single-scale grouping (SSG) set abstraction (Qi et al., 2017b) for all PointNet++ based (Qi et al., 2017b; Ran et al., 2022; Qian et al., 2022; Ma et al., 2022) methods.

Classification on ModelNet40. ModelNet40 (Wu et al., 2015) is one synthetic object classification benchmark, which contains 9843 training samples and 2468 testing samples. They contain 100 unique CAD models from 40 object categories. The experimental results are presented in Tab. 1. The results reveal that our OPFR significantly improves PointNet++ (Qi et al., 2017b) backbone by 3.8% OA and 3.2% mAcc, with just an additional 0.012M more parameters and 0.15G more FLOPs. The scaled-up OPFR further attains a slight improvement of 0.1% OA and 0.2% mAcc. Moreover, when integrated with transformer-based backbone, Point Transformer (Zhao et al., 2021b), our OPFR achieves the state-of-the-art 94.8% OA and 92.0% mAcc (+1.1% OA and +1.4% mAcc).

Classification on ScanObjectNN. ScanObjectNN (Uy et al., 2019) is a challenging, real-world object classification benchmark. It is composed of 2902 point cloud samples from 15 categories, including occlusion and background. Following the typical protocol (Qi et al., 2017b; Ran et al., 2022; Park et al., 2023), we verify our OPFR on the hardest variant (PB_T50_RS_variant) of ScanObjectNN. In Tab. 1, the proposed OPFR achieves 85.7% OA and 83.8% mAcc (+7.8% OA and +8.4% mAcc) on PointNet++ backbone, which outperforms RepSurf (Ran et al., 2022) by a large margin of 1.4% OA and 2.5% mAcc with comparable model size. Our result surpasses PointMLP (Ma et al., 2022) by 0.3% OA as well, and utilizes 9 $\times$ fewer parameters. Furthermore, we scale up our proposed OPFR and achieve 88.5% OA and 86.6% mAcc, which demonstrates a superiority of 0.8% OA and 0.8% mAcc compared with state-of-the-art MLP-based backbone, PointNeXt (Qian et al., 2022). Our result is also comparable to prior state-of-the-art transformer-based backbone, SPoTr (Park et al., 2023), with around 2.1 $\times$ fewer FLOPs. When integrated with Point Transformer, our OPFR attains a notable improvement of 5.8% OA and 5.6% mAcc, which only increases 0.003M more parameters and 0.04G more FLOPs.

Table 4. Ablation study on the effectiveness of different modules. We conduct experiments on ScanObjectNN dataset.

Method	OA	mAcc
PointNet++ & OPFR (ours)	85.68	83.81
$(-)$ Hierarchical Sampling strategy	-1.17	-0.91
$(-)$ Curve Feature Generator	-2.02	-1.76
$(-)$ shared-MLP	-1.53	-1.34

4.2. Semantic Segmentation

We evaluate our proposed OPFR representations on a challenging benchmark, S3DIS (Armeni et al., 2016), for semantic segmentation task.

Experimental setups. When integrated with PointNet++ (Qi et al., 2017b), we apply the channel de-differentiation design (Ran et al., 2022). We opt AdamW (Loshchilov and Hutter, 2017) with default parameters to train our models for 100 epochs with a batch size of 8 and initial learning rate of 0.006. Here, we employ multi-step learning rate decay scheme and decay at [60,80] epochs with a decay rate of 0.1. The whole training and testing process are conducted through two NVIDIA A40 48GB GPU. For evaluation metrics, we use mean of classwise intersection over union (mIoU), mean of classwise accuracy (mAcc), and overall accuracy (OA). For a fair comparison, we calculate FLOPs from 15000 input point clouds (Qian et al., 2022), and leave test-time-augmentation (Duan et al., 2024) in absence.

Semantic Segmentation on S3DIS. S3DIS (Armeni et al., 2016) encompasses 271 scenes which are distributed across 6 indoor areas, with each individual point being classified into one of 13 semantic labels. Following a common protocol (Tchapmi et al., 2017; Qi et al., 2017b), we evaluate the presented approach in two modes: (a) Area-5 is withheld for training and is used for testing, and (b) 6-fold cross-validation. In Tab. 2, our proposed OPFR considerably enhances PointNet++ (Qi et al., 2017b) by 14.7%/16.9%/3.0% (mIoU/mAcc/OA) on S3DIS 6-fold benchmark. Our result is comparable to PointNeXt-XL (Qian et al., 2022), with around 40 $\times$ fewer parameters and 11 $\times$ fewer FLOPs. When integrated with Point Transformer (Zhao et al., 2021b), the performance of OPFR exceeds previous state-of-the-art Superpoint Transformer (Sun et al., 2023) by 0.9%/0.1%/1.6% (mIoU/mAcc/OA) for S3DIS 6-fold. Meanwhile, on S3DIS Area-5, our OPFR attains mIoU/mAcc/OA of 72.6%/78.6%/91.7% (+2.2%/+2.1%/+0.9%), surpassing the prior state-of-the-art ConDaFormer (Duan et al., 2024).

Furthermore, as shown in Tab. 3, we present quantitative segmentation results for each semantic class on S3DIS Area-5 in terms of mIoU. In Tab. 3, the top performance gain comes from the most challenging columns semantic class for both PointNet++ and Point Transformer backbones. Within all classes columns exhibit a distinct columnar structure, which consists of two or three planes in S3DIS dataset. This multi-plane structure can be effectively captured by different clusters generated from the proposed Hierarchical Sampling module, which facilitates the recognition of column pattern with greater ease. We provide detailed qualitative results in the supplementary material.

4.3. Ablation Study

We ablate some critical designs of our standard OPFR with PointNet++ (Qi et al., 2017b) backbone on ModelNet40 (Wu et al., 2015) and ScanObjectNN (Uy et al., 2019) dataset for an insightful exploration.

Effectiveness of different OPFR modules. Shown in Tab. 4, as we remove Hierarchical Sampling module, Curve Feature Generator module, and 3-layer shared-MLP, the overall accuracy (OA) decreases by 1.17%, 2.02%, 1.53% and mean accuracy (mAcc) drops by 0.91%, 1.76%, 1.34% respectively. From this empirical study, we can confirm that, explicit geometric features are crucial for 3D object understanding, and shared-MLP is necessary as well to enhance the semantics of obtained geometric features. Furthermore, due to the use of Hierarchical Sampling module, we can effectively relieve the distortion of triangle sets, thereby improving the quality of geometric features. Additionally, we argue that, the Hierarchical Sampling module can be applied to RepSurf (Ran et al., 2022) to handle the distorted triangle sets from $k$ nearest neighbors. Due to space limits, we provide the ablation study in the supplementary material.

Table 5. Ablation study on the designs of OPFR netowrk architecture. We conduct experiments on ScanObjectNN dataset. (#(OPFR dims): number of OPFR dimensions, #(layers): number of shared-MLP layers)

Pooling	BN	#(OPFR dims)	#(layers)	OA
max	✓	30	3	85.47
avg	✓	30	3	85.55
sum	✓	30	3	85.68
sum	✗	30	3	85.32
sum	✓	30	3	85.68
sum	✓	10	3	85.32
sum	✓	30	3	85.68
sum	✓	64	3	85.44
sum	✓	128	3	84.54
sum	✓	30	1	83.34
sum	✓	30	2	84.89
sum	✓	30	3	85.68
sum	✓	30	4	85.42
sum	✓	30	5	84.50

Designs of OPFR network architecture. We ablate the designs of OPFR network architecture in terms of pooling operation $\mathcal{A}$ and shared-MLP $\mathcal{F}$ in Tab. 5. Empirical results demonstrate that, usage of summation pooling, batch normalization, and three-layer shared-MLP with 30 OPFR dimensions outperforms other options. From our experiments, we hypothesize that, the network tends to encounter overfitting issues as we increase the number of OPFR dimensions and shared-MLP layers.

Table 6. Ablation study on the hyper-parameters sensitivity. We evaluate the overall accuracy (OA) on ScanObjectNN for different combinations of hyper-parameters

k_{1}

and

k_{2}

OA	$k_{1}=10$	$\bm{k_{1}=20}$	$k_{1}=40$	$k_{1}=60$
$k_{2}=2$	85.31	85.52	85.41	85.32
$\bm{k_{2}=4}$	85.43	85.68	85.55	85.42
$k_{2}=6$	85.51	85.66	85.51	85.40
$k_{2}=8$	85.47	85.61	85.52	85.46

Sensitivity of hyper-parameters. In Hierarchical Sampling module, we are required to determine number of surface centroid candidates $k_{1}$ , number of selected surface centroids $k_{2}$ and number of neighbors $k_{3}$ . Following RepSurf (Ran et al., 2022) design, we fix $k_{3}$ equal to $8$ to construct OPFR and explore the relation between $k_{1}$ and $k_{2}$ in terms of overall accuracy (OA) in Tab. 6. Generally speaking, our OPFR is relatively insensitive to the choices of hyper-parameters. As the value of $k_{1}$ increases, there is an initial rise in overall accuracy, which is subsequently followed by a slight decline. We hypothesize that, this phenomenon is attributed to the inherent trade-off between exploration and concentration. When $k_{1}$ is small, we are unable to capture the local region of point clouds effectively. Conversely, when $k_{1}$ is too large, we move far from the original point, leading to the deviation of obtained geometric features. Furthermore, our OPFR is insensitive to the change of $k_{2}$ . We hypothesize that, this behavior primarily stems from these $k_{2}$ clusters may overlap with each other. To avoid computation overheads, we consider $(k_{1}=20,k_{2}=4)$ as an ideal choice.

Table 7. Ablation study on the efficiency of OPFR representations. We test the speed of all methods with one NVIDIA A40 GPU. (#(Extra Params): number of extra parameters, Infer Speed: inference duration per input sample)

Method	#(Extra Params)	Infer Speed
PointNet++ & PFH (Rusu et al., 2009)	-	102ms
PointNet++ & RepSurf (Ran et al., 2022)	0.008M	1.12ms
PointNet++ & OPFR (ours)	0.012M	1.56ms

Efficiency of OPFR representations. Shown in Tab. 7, we evaluate the efficiency of our OPFR representations in terms of number of extra parameters and inference speed. Empirically, although vanilla PFH introduces no extra learnable parameters, it requires 102ms for each input sample to generate the final representation, rendering it impractical for online network training. The main computational bottlenecks lie in the estimation of point clouds normal vectors (Wold et al., 1987; Hoppe et al., 1992). We propose novel Local Reference Constructor module to eliminate the needs of normal estimation and overcome the computational overheads. We achieve an impressive inference speed of 1.56ms (65 $\bm{\times}$ faster) with a marginal increase of 0.012M number of parameters. Therefore, OPFR can serve as a versatile plug-and-play module for various backbones. Furthermore, the efficiency of our OPFR is close to the previous state-of-the-art plug-and-play feature representation RepSurf (Ran et al., 2022), with only 0.004M more parameters and 0.44ms more inference time.

Quality of OPFR representations. Shown in Fig. 4, we compare the performance between PFH (Rusu et al., 2008b), RepSurf (Ran et al., 2022), and proposed OPFR using PointNet++ (Qi et al., 2017b) backbone. All of them are injected to PointNet++ as extra features. By incorporating vanilla PFH, overall accuracy (OA) and mean accuracy (mAcc) are enhanced by 2.0% and 2.6% on ModelNet40, 4.7% and 5.3% on ScanObjectNN, emphasizing the effectiveness of regional curvature knowledge. This gain further escalates to 3.8% and 3.2% on ModelNet40, 7.8% and 8.4% on ScanObjectNN in OA and mAcc respectively, when equipped with the proposed OPFR. This demonstrates the significance of shared-MLP, which enriches the obtained geometric features. Furthermore, compared with the previous state-of-the-art feature representation RepSurf, our OPFR outperforms it dramatically on ScanObjectNN, with a considerable margin of 1.4% and 2.5% higher OA and mAcc. We hypothesize that, this phenomenon is attributed to the uses of explicit curvature knowledge and robust sampling strategy, which are underexplored in RepSurf.

5. Conclusion

We propose the novel plug-and-play module On-the-fly Point Feature Representation (OPFR) for various backbones. It explicitly captures local geometry including location, orientation and curvature through Curve Feature Generator module. We further develop the Local Reference Constructor module to improve efficiency and enable on-the-fly processing. Additionally, we introduce the Hierarchical Sampling module to mitigate the distortion of triangle sets that occurs in the naive $k$ nearest neighbors sampling, thereby enhancing the robustness of obtained geometric features. We evaluate the proposed OPFR on ModelNet40 and ScanObjectNN benchmarks for point cloud classification task, S3DIS for semantic segmentation task. For both PointNet++ and Point Transformer backbones, our presented OPFR achieves the state-of-the-art results on different benchmarks. The comprehensive empirical results demonstrate the backbone-agnostic nature of our proposed method. We believe that our work can prompt consideration of how to better leverage geometric knowledge in network architecture designs for understanding point clouds.

Acknowledgements.

This research work is supported by the Agency for Science, Technology and Research (A*STAR) under its MTC Programmatic Funds (Grant No. M23L7b0021).

References

(1)
Armeni et al. (2016) Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese. 2016. 3d semantic parsing of large-scale indoor spaces. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1534–1543.
Chen et al. (2024) Huizhou Chen, Jiangyi Wang, Yuxin Li, Na Zhao, Jun Cheng, and Xulei Yang. 2024. Improving 3D Occupancy Prediction through Class-balancing Loss and Multi-scale Representation. arXiv preprint arXiv:2405.16099 (2024).
Cheng et al. (2023) Zhongyao Cheng, Cen Chen, Ziyuan Zhao, Peisheng Qian, Xiaoli Li, and Xulei Yang. 2023. COCO-TEACH: A Contrastive Co-Teaching Network For Incremental 3D Object Detection. In 2023 IEEE International Conference on Image Processing (ICIP). IEEE, 1990–1994.
Czerniawski et al. (2016) Thomas Czerniawski, Mohammad Nahangi, Carl Haas, and Scott Walbridge. 2016. Pipe spool recognition in cluttered point clouds using a curvature-based shape descriptor. Automation in Construction 71 (2016), 346–358.
Duan et al. (2024) Lunhao Duan, Shanshan Zhao, Nan Xue, Mingming Gong, Gui-Song Xia, and Dacheng Tao. 2024. ConDaFormer: Disassembled Transformer with Local Structure Enhancement for 3D Point Cloud Understanding. Advances in Neural Information Processing Systems 36 (2024).
Duan et al. (2019) Yueqi Duan, Yu Zheng, Jiwen Lu, Jie Zhou, and Qi Tian. 2019. Structural relational reasoning of point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 949–958.
Eldar et al. (1997) Yuval Eldar, Michael Lindenbaum, Moshe Porat, and Yehoshua Y Zeevi. 1997. The farthest point strategy for progressive image sampling. IEEE Transactions on Image Processing 6, 9 (1997), 1305–1315.
Foorginejad and Khalili (2014) A Foorginejad and K Khalili. 2014. Umbrella curvature: a new curvature estimation method for point clouds. Procedia Technology 12 (2014), 347–352.
Geiger et al. (2012) Andreas Geiger, Philip Lenz, and Raquel Urtasun. 2012. Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition. IEEE, 3354–3361.
Guo et al. (2021) Meng-Hao Guo, Jun-Xiong Cai, Zheng-Ning Liu, Tai-Jiang Mu, Ralph R Martin, and Shi-Min Hu. 2021. Pct: Point cloud transformer. Computational Visual Media 7 (2021), 187–199.
Hamdi et al. (2021) Abdullah Hamdi, Silvio Giancola, and Bernard Ghanem. 2021. Mvtn: Multi-view transformation network for 3d shape recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1–11.
Han et al. (2024) Yucheng Han, Na Zhao, Weiling Chen, Keng Teck Ma, and Hanwang Zhang. 2024. Dual-Perspective Knowledge Enrichment for Semi-supervised 3D Object Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 2049–2057.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
Himmelsbach et al. (2009) Michael Himmelsbach, Thorsten Luettel, and H-J Wuensche. 2009. Real-time object classification in 3D point clouds using point feature histograms. In 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 994–1000.
Hoffman and Jain (1987) Richard Hoffman and Anil K Jain. 1987. Segmentation and classification of range images. IEEE transactions on pattern analysis and machine intelligence 5 (1987), 608–620.
Hoppe et al. (1992) Hugues Hoppe, Tony DeRose, Tom Duchamp, John McDonald, and Werner Stuetzle. 1992. Surface reconstruction from unorganized points. In Proceedings of the 19th annual conference on computer graphics and interactive techniques. 71–78.
Jiang et al. (2018) Mingyang Jiang, Yiran Wu, Tianqi Zhao, Zelin Zhao, and Cewu Lu. 2018. Pointsift: A sift-like network module for 3d point cloud semantic segmentation. arXiv preprint arXiv:1807.00652 (2018).
Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
Komarichev et al. (2019) Artem Komarichev, Zichun Zhong, and Jing Hua. 2019. A-cnn: Annularly convolutional neural networks on point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7421–7430.
Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012).
Li and Zhao (2024) Linfeng Li and Na Zhao. 2024. End-to-End Semi-Supervised 3D Instance Segmentation with PCTeacher. In 2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 5352–5358.
Li et al. (2016) Peng Li, Jian Wang, Yindi Zhao, Yanxia Wang, and Yifei Yao. 2016. Improved algorithm for point cloud registration based on fast point feature histograms. Journal of Applied Remote Sensing 10, 4 (2016), 045024–045024.
Li et al. (2018) Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. 2018. Pointcnn: Convolution on x-transformed points. Advances in neural information processing systems 31 (2018).
Li et al. (2024) Yicong Li, Na Zhao, Junbin Xiao, Chun Feng, Xiang Wang, and Tat-seng Chua. 2024. LASO: Language-guided Affordance Segmentation on 3D Object. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14251–14260.
Liu et al. (2019) Yongcheng Liu, Bin Fan, Shiming Xiang, and Chunhong Pan. 2019. Relation-shape convolutional neural network for point cloud analysis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8895–8904.
Liu et al. (2020) Ze Liu, Han Hu, Yue Cao, Zheng Zhang, and Xin Tong. 2020. A closer look at local aggregation operators in point cloud analysis. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIII 16. Springer, 326–342.
Loshchilov and Hutter (2017) Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).
Ma et al. (2022) Xu Ma, Can Qin, Haoxuan You, Haoxi Ran, and Yun Fu. 2022. Rethinking network design and local geometry in point cloud: A simple residual MLP framework. arXiv preprint arXiv:2202.07123 (2022).
Mao et al. (2019) Jiageng Mao, Xiaogang Wang, and Hongsheng Li. 2019. Interpolated convolutional networks for 3d point cloud understanding. In Proceedings of the IEEE/CVF international conference on computer vision. 1578–1587.
Melzi et al. (2019) Simone Melzi, Riccardo Spezialetti, Federico Tombari, Michael M Bronstein, Luigi Di Stefano, and Emanuele Rodola. 2019. Gframes: Gradient-based local reference frame for 3d shape matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4629–4638.
Nezhadarya et al. (2020) Ehsan Nezhadarya, Ehsan Taghavi, Ryan Razani, Bingbing Liu, and Jun Luo. 2020. Adaptive hierarchical down-sampling for point cloud classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12956–12964.
Park et al. (2023) Jinyoung Park, Sanghyeok Lee, Sihyeon Kim, Yunyang Xiong, and Hyunwoo J Kim. 2023. Self-positioning point-based transformer for point cloud understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21814–21823.
Qi et al. (2017a) Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. 2017a. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 652–660.
Qi et al. (2017b) Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. 2017b. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems 30 (2017).
Qian et al. (2022) Guocheng Qian, Yuchen Li, Houwen Peng, Jinjie Mai, Hasan Hammoud, Mohamed Elhoseiny, and Bernard Ghanem. 2022. Pointnext: Revisiting pointnet++ with improved training and scaling strategies. Advances in Neural Information Processing Systems 35 (2022), 23192–23204.
Ran et al. (2022) Haoxi Ran, Jun Liu, and Chengjie Wang. 2022. Surface representation for point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18942–18952.
Ran et al. (2021) Haoxi Ran, Wei Zhuo, Jun Liu, and Li Lu. 2021. Learning inner-group relations on point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15477–15487.
Rusu et al. (2009) Radu Bogdan Rusu, Nico Blodow, and Michael Beetz. 2009. Fast point feature histograms (FPFH) for 3D registration. In 2009 IEEE international conference on robotics and automation. IEEE, 3212–3217.
Rusu et al. (2008a) Radu Bogdan Rusu, Nico Blodow, Zoltan Csaba Marton, and Michael Beetz. 2008a. Aligning point cloud views using persistent feature histograms. In 2008 IEEE/RSJ international conference on intelligent robots and systems. IEEE, 3384–3391.
Rusu et al. (2008b) Radu Bogdan Rusu, Zoltan Csaba Marton, Nico Blodow, and Michael Beetz. 2008b. Persistent point feature histograms for 3D point clouds. In Proc 10th Int Conf Intel Autonomous Syst (IAS-10), Baden-Baden, Germany. 119–128.
Sanchez et al. (2020) Julia Sanchez, Florence Denis, David Coeurjolly, Florent Dupont, Laurent Trassoudaine, and Paul Checchin. 2020. Robust normal vector estimation in 3D point clouds through iterative principal component analysis. ISPRS Journal of Photogrammetry and Remote Sensing 163 (2020), 18–35.
Scovanner et al. (2007) Paul Scovanner, Saad Ali, and Mubarak Shah. 2007. A 3-dimensional sift descriptor and its application to action recognition. In Proceedings of the 15th ACM international conference on Multimedia. 357–360.
Serrano and Suceava (2015) Isabel M Serrano and Bogdan D Suceava. 2015. A medieval mystery: Nicole Oresme’s concept of curvitas. Notices of the AMS 62, 9 (2015).
Sheng et al. (2022) Hualian Sheng, Sijia Cai, Na Zhao, Bing Deng, Jianqiang Huang, Xian-Sheng Hua, Min-Jian Zhao, and Gim Hee Lee. 2022. Rethinking IoU-based optimization for single-stage 3D object detection. In European Conference on Computer Vision. Springer, 544–561.
Sheng et al. (2023) Hualian Sheng, Sijia Cai, Na Zhao, Bing Deng, Min-Jian Zhao, and Gim Hee Lee. 2023. PDR: Progressive depth regularization for monocular 3D object detection. IEEE Transactions on Circuits and Systems for Video Technology 33, 12 (2023), 7591–7603.
Sun et al. (2023) Jiahao Sun, Chunmei Qing, Junpeng Tan, and Xiangmin Xu. 2023. Superpoint transformer for 3d scene instance segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 2393–2401.
Sun et al. (2016) Junhua Sun, Jie Zhang, and Guangjun Zhang. 2016. An automatic 3D point cloud registration method based on regional curvature maps. Image and vision computing 56 (2016), 49–58.
Swokowski (1979) Earl William Swokowski. 1979. Calculus with analytic geometry. Taylor & Francis.
Szegedy et al. (2016) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2818–2826.
Tchapmi et al. (2017) Lyne Tchapmi, Christopher Choy, Iro Armeni, JunYoung Gwak, and Silvio Savarese. 2017. Segcloud: Semantic segmentation of 3d point clouds. In 2017 international conference on 3D vision (3DV). IEEE, 537–547.
Te et al. (2018) Gusi Te, Wei Hu, Amin Zheng, and Zongming Guo. 2018. Rgcnn: Regularized graph cnn for point cloud segmentation. In Proceedings of the 26th ACM international conference on Multimedia. 746–754.
Thomas et al. (2019) Hugues Thomas, Charles R Qi, Jean-Emmanuel Deschaud, Beatriz Marcotegui, François Goulette, and Leonidas J Guibas. 2019. Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF international conference on computer vision. 6411–6420.
Uy et al. (2019) Mikaela Angelina Uy, Quang-Hieu Pham, Binh-Son Hua, Thanh Nguyen, and Sai-Kit Yeung. 2019. Revisiting Point Cloud Classification: A New Benchmark Dataset and Classification Model on Real-World Data. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
Wang et al. (2019) Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. 2019. Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (tog) 38, 5 (2019), 1–12.
Wiesmann et al. (2022) Louis Wiesmann, Rodrigo Marcuzzi, Cyrill Stachniss, and Jens Behley. 2022. Retriever: Point cloud retrieval in compressed 3D maps. In 2022 International Conference on Robotics and Automation (ICRA). IEEE, 10925–10932.
Wold et al. (1987) Svante Wold, Kim Esbensen, and Paul Geladi. 1987. Principal component analysis. Chemometrics and intelligent laboratory systems 2, 1-3 (1987), 37–52.
Wu et al. (2019) Wenxuan Wu, Zhongang Qi, and Li Fuxin. 2019. Pointconv: Deep convolutional networks on 3d point clouds. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition. 9621–9630.
Wu et al. (2022) Xiaoyang Wu, Yixing Lao, Li Jiang, Xihui Liu, and Hengshuang Zhao. 2022. Point transformer v2: Grouped vector attention and partition-based pooling. Advances in Neural Information Processing Systems 35 (2022), 33330–33342.
Wu et al. (2015) Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 2015. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1912–1920.
Xiang et al. (2021) Tiange Xiang, Chaoyi Zhang, Yang Song, Jianhui Yu, and Weidong Cai. 2021. Walk in the cloud: Learning curves for point clouds shape analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 915–924.
Xu et al. (2020) Qiangeng Xu, Xudong Sun, Cho-Ying Wu, Panqu Wang, and Ulrich Neumann. 2020. Grid-gcn for fast and scalable point cloud learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5661–5670.
Yang et al. (2018) Yaoqing Yang, Chen Feng, Yiru Shen, and Dong Tian. 2018. Foldingnet: Point cloud auto-encoder via deep grid deformation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 206–215.
Yu et al. (2022) Xumin Yu, Lulu Tang, Yongming Rao, Tiejun Huang, Jie Zhou, and Jiwen Lu. 2022. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19313–19322.
Zaheer et al. (2017) Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. 2017. Deep sets. Advances in neural information processing systems 30 (2017).
Zhang et al. (2019) Kuangen Zhang, Ming Hao, Jing Wang, Clarence W de Silva, and Chenglong Fu. 2019. Linked dynamic graph cnn: Learning on point cloud via linking hierarchical features. arXiv preprint arXiv:1904.10014 (2019).
Zhao et al. (2019) Hengshuang Zhao, Li Jiang, Chi-Wing Fu, and Jiaya Jia. 2019. Pointweb: Enhancing local neighborhood features for point cloud processing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5565–5573.
Zhao et al. (2021b) Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. 2021b. Point transformer. In Proceedings of the IEEE/CVF international conference on computer vision. 16259–16268.
Zhao et al. (2021a) Na Zhao, Tat-Seng Chua, and Gim Hee Lee. 2021a. Few-shot 3d point cloud semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8873–8882.
Zhao and Lee (2022) Na Zhao and Gim Hee Lee. 2022. Static-dynamic co-teaching for class-incremental 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 3436–3445.