Laplacian Unit: Adaptive Local Detail-Preserving Filtering for 3D Point Cloud Understanding

Haoyi Xiu [email protected]; [email protected] Xin Liu [email protected] Weimin Wang [email protected] Kyoung-Sook Kim [email protected] Takayuki Shinohara [email protected] Qiong Chang [email protected] Masashi Matsuoka [email protected] Department of Architecture and Building Engineering, Tokyo Institute of Technology, Tokyo, Japan Artificial Intelligence Research Center, AIST, Tokyo, Japan DUT-RU International School of Information Science and Engineering, Dalian University of Technology, Dalian, China Digital Architecture Research Center, AIST, Tokyo, Japan Department of Computer Science, Tokyo Institute of Technology, Tokyo, Japan

Abstract

Analyzing point clouds are challengingis challenging, as people need to infer the underlying structure of the scanned surface from the [wwm: sampled] points that are generally unordered and unstructured. In this study, we propose a new building block termed Laplacian Unit called the Laplacian unit for 3D point cloud understanding, which facilitates learning through adaptive local detail-preserving filtering. As a core ingredient component of the Laplacian Unit, Discrete Point unit, the discrete point Laplacian (DPL) extends the discrete Laplace operator by adaptively mining spatial and feature correlations to cope with the complex structure of point clouds. In contrast to the popular framework in which the Laplacians are treated as the smoothness penalty for regularization, we construct an architectural unit that performs learned non-linear filtering using nonlinear filtering using the calibrated DPL. We show that the lightweight property of the Laplacian Unit unit enables them to be integrated into multiple positions of the network with minor little computational overhead. Extensive experiments demonstrate that Laplacian Units units provide consistent performance improvements for networks having with different types of operators and computational complexity across multiple point cloud understanding tasks, including 3D point cloud classification, part segmentation, and indoor and outdoor scene segmentation. Furthermore, networks equipped with Laplacian Units units achieve state-of-the-art performance in 3D point cloud classification (ScanObjectNN) and part segmentation (ShapeNet Part) tasks. The code will be available upon publicationpublicly available.

keywords:

3D point cloud, 3D deep learning, discrete Laplace operator

MSC:

[2010] 00-01, 99-00

^†^†journal: Neurocomputing

1 Introduction

A 3D Point point cloud is essentially a set of points irregularly distributed on the surface of scanned objects in 3D space. The ever-growing capacity of scanning hardware enables 3D scanners to capture high-quality point clouds in a cost-effective manner. Therefore, an increasing number of point cloud datasets are becoming have become available to research communities, which has triggered active research on data-driven 3D point cloud understanding and related applications such as remote sensing [zhu2017deep, shinohara2020fwnet] and autonomous driving [cui2021deep, qi2018frustum, saleh2021fast]. Recently, research communities have witnessed a remarkable progression in the deep learning-based point cloud understanding, in which point clouds are represented as deep features learned by designed neural networksachieved advanced understanding of deep learning–based point cloud analysis. Compared with the representational power of conventional machine learning algorithms, deep neural networks (DNNs) can learn more discriminative descriptions of data , performing and perform exceedingly well in various research domains [lecun2015deep, krizhevsky2012imagenet, he2016deep]. Due to [lecun2015deep]. In particular, convolutional neural networks (CNNs) have shown great success in 2D image understanding [krizhevsky2012imagenet, he2016deep], which has motivated researchers to apply CNNs to 3D point clouds. Because of the irregular structure of 3D point clouds, early attempts of DNN-based point cloud learning methods analyses project point clouds onto 2D or 3D regular grids, i.ethat is, 2D depth or Bird’s Eye View bird’s eye view (BEV) images [kanezaki2018rotationnet, su2015multi] or 3D voxels [maturana2015voxnet, zhou2018voxelnet], so that thereby making the well-established convolution for regular grids can be applied applicable to point clouds. However, such approaches are considered suboptimal, as they tend to lose fine details due to the dimensionality reduction. On the other hand, the aforementioned issues are circumvented by To resolve the aforementioned issue, the network needs to be directly applicable to raw point clouds while being insensitive to the input point order. Motivated by such observations, PointNet [qi2017pointnet] which applies shared (point-wiseapplies pointwise multi-layer perceptrons (MLPs) Multi-Layer Perceptrons (MLPs) , and symmetric functions (e.g., max-pooling)to the input point clouds; therefore, point clouds points are processed in a lossless manner with the whole entire network being invariant to the input point order. Subsequent work, termed PointNet++ [qi2017pointnet++], has extended PointNet by repetitive applications of ”small” PointNets on local subsets of points, resembling convolution in image processing.

Although Convolutional Neural Network (CNN) has shown great successes in 2D image understanding, the irregularity of The great success of CNN is rooted in its ability to model local dependencies [krizhevsky2012imagenet], which are shared across the data domain [defferrard2016convolutional]. Specifically, convolutional filters are applied locally to extract features in which rich local spatial correlations are considered. Although raw point clouds can be processed directly owing to the PointNet, the exploitation of spatial correlations remains challenging as raw 3D points makes it difficult to apply convolution-based methods directly.Therefore, to replicate the success of CNN, point clouds lack the explicit structure, e.g., pre-defined local neighbors. Therefore, much effort has been spent on realizing convolution on irregular 3D point clouds. Unlike convolutions on regular grids, point clouds are not ordered and regularly spaced; thus, the handling of spatial correlation becomes the major challengedevoted to incorporating local spatial correlations into network architectures. Depending on the way algorithms cope with the spatial correlationthe methods incorporate spatial correlations into local operations, three general categorizations of local aggregation methods are produced [liu2020closer]. Point-wise MLP methods consider The pointwise MLP–based method considers spatial correlations by simply concatenating the relative coordinates directly concatenating the point positions or edge features to the input features [qi2017pointnet, qi2017pointnet++, wang2019dynamic]. [qi2017pointnet++, wang2019dynamic]. In other words, point positions are considered as a part of the input feature. The second type of methods method, called the adaptive weight–based method, computes the spatial correlation adaptively conditioned on using sub-networks [wang2018deep, liu2019relation, li2018pointcnn, wu2019pointconv]. In these methods, sub-networks predict convolution filters dynamically by taking as input the point positions or features [wang2018deep, liu2019relation, li2018pointcnn, wu2019pointconv] whereas the last type of methods consider . On the other hand, pseudo-grid–based method incorporates spatial correlations by constructing pseudo grid kernels [hua2018pointwise, atzmon2018point, thomas2019kpconv, mao2019interpolated, tatarchenko2018tangent]. To achieve further enhancement, some works focus on developing more sophisticated local aggregations [jiang2018pointsift, zhao2019pointweb, hu2020randla, fan2021scf] while others enhance different aspects of the networks such as global context modeling and sampling [yan2020pointasnl, xu2020grid, qiu2021semantic, qiu2021geometric]. artificial kernel points on which raw features are projected based on the spatial proximity [hua2018pointwise, atzmon2018point, thomas2019kpconv, mao2019interpolated, tatarchenko2018tangent].

Refer to caption — Figure 1: Point cloud and its underlying shape. The point cloud loses much significant structural information, which makes it extremely difficult to infer the underlying shape.

We consider that the essential challenge lies in the Although various local aggregations are developed, we consider that ambiguities contained in the point cloud compared to underlying objectsstill pose severe challenges to 3D point cloud understanding. As shown in Fig. 1, a point cloud lacks much the absence of structural information such as connectivity between points due to some sampling process. As a result, it makes it makes it extremely difficult for a machine (even for a human) to recognize the underlying shape .

correctly. Therefore, we believe that introducing a stronger inductive bias leads to easier optimization and more discriminative representations. To achieve this, One possible way to approach such a complex problem introduce such an inductive is to learn from the differential properties of points residing in a small restricted region, which precisely approximates the underlying surface. As can be seen from observed in Fig. 2, even though the object is not recognizable by looking at the coarse shape, points in a local region are likely to reside on the same underlying surface. Furthermore, differential coordinates are known to contain local detailssuch as the , such as size, orientation, and curvature [sorkine2006differential]. Thus, exploiting such z a differential representation results in inductive local detail-preserving learning.

One popular choice to analyze the local differential properties is through the family of the discrete Laplace operator (Laplacian). The Laplacian is often used as a smoothness penalty for regularization [kalofolias2016learn] , or for smoothing the surface via the mean curvature flow [desbrun1999implicit]. In addition, the linear nature of the Laplacian makes it attractive for analyses involving voluminous data, e.g., the deep learning-based for example, deep learning–based analysis.

In this paperstudy, we propose a new building block for point cloud understandingtermed Laplacian Unit, termed the Laplacian unit. The core ingredient of the Laplacian Unit, Discrete Point unit, discrete point Laplacian (DPL), extends the conventional Laplacian by adaptively mining spatial/channel correlations to model the complex structure of irregular point clouds. The standalone DPL is a powerful local structure-aware operator, structure–aware operator; however, it has no access to global information by definition. Motivated by the assumption that the combination of precise local details and coarse global structures benefits the optimization, we inject DPL into a filtering framework [szeliski2010computer] in which local details and global contexts are fused seamlessly. Furthermore, we apply a point-wise non-linear pointwise nonlinear calibration function that enhances/suppresses DPL in a data-dependent manner for better adaptation to varying shapes. Consequently, the Laplacian Unit unit performs a local-global fusion through a non-linear nonlinear filtering process.

The resulted Laplacian Unit resulting Laplacian unit comprises a generic local detail-preserving operator, DPL, a calibration function, and a local-global fusion step. The An overview of the Laplacian Unit is provided unit is shown in Fig. 3). DPL can be extended easily easily extended by incorporating various ways methods to compute spatial/feature correlations. To understand its fundamental behavior, an extremely simple and efficient version of the Laplacian Unit unit is proposed. Laplacian Unit The Laplacian unit is a lightweight module that can be easily integrated into many types of networks various network types at multiple positions. We construct LU-Nets, a family of powerful models based on Laplacian Unitsunits, to investigate the effectiveness of Laplacian Units concerning units with respect to varying local aggregation methods and types of architectures. The performance of the LU-Nets is measured across a range of point cloud understanding tasks, including object classification, object part segmentation, and indoor and outdoor scene segmentations. We empirically demonstrate that Laplacian Units units can consistently boost a number of networks having with different types of operations and complexity. Specifically, the LU-Nets achieve state-of-the-art performance on in object classification on ScanObjectNN [uy2019revisiting] and object part segmentation on ShapeNet Part [yi2016scalable], demonstrating its effectiveness. Moreover, the design choices of the Laplacian Unit unit are verified in the ablation study. Since Laplacian Units Because Laplacian units are closely related to the notion of the smoothness [dong2016learning] and filtering [szeliski2010computer], we hypothesize that their behavior can be monitored by inspecting the smoothness of the learned features. To this end, we perform both the quantitative and qualitative analyses based on the feature smoothness to provide an intuitive understanding of the behavior of the Laplacian UnitsLaplacian units.

We summarize the major contributions of the paper as follows:

1.

We propose the Laplacian Unitunit, a new building block for 3D point cloud understanding, which that realizes local detail-preserving learning through adaptive non-linear nonlinear filtering;
2.

We propose LU-Nets, a powerful family of models in which Laplacian Units units are integrated at multiple positions of different models;
3.

We verify the effectiveness of the Laplacian Unit unit through extensive experiments on four fundamental tasks, including 3D object classification, object part segmentation, and indoor and outdoor scene segmentation.
4.

We investigate the impact of Laplacian Units units by performing comparative experiments using various LU-Nets with different complexity complexities and local aggregation operators.
5.

We examine the effectiveness of design choices of the Laplacian Unit via the unit via an ablation study.
6.

We provide an intuitive understanding of the behavior of Laplacian Units units by performing quantitative and qualitative analyses based on smoothness.

2 Related Works

2.1 Deep learning on 3D point clouds

Projection-based Methods

Early attempts to apply deep learning on 3D point clouds project raw point clouds onto regular 2D (view) or 3D (voxel) grids to enable successful grid-based convolution operations. View-based methods [su2015multi, kanezaki2018rotationnet, feng2018gvcnn] project point clouds onto several 2D planes from different viewpoints. Multi-view images are subsequently processed by using 2D CNNs. On the other handIn contrast, voxel-based methods [maturana2015voxnet, zhou2018voxelnet, graham20183d, choy20194d] project point clouds onto 3D regular voxel grids and apply 3D convolutions. The performance of view-based methods heavily relies relies heavily on the choice of projection planes, whereas voxel-based methods suffer from a substantial memory consumption. Moreover, fine-grained geometrical details are lost due to owing to the conversions.

Point-based methods

The point-based Point-based methods, in contrast to the projection-based methods, operate directly on irregular point clouds without any conversion. Point clouds are naturally unordered; hence, such a type of method must respect permutation invariance, which ensures that the results are insensitive to permutations of points. In general, the point-based methods can be roughly categorized based on the type of the local aggregation operator they use [liu2020closer]. Pioneered by PointNet [qi2017pointnet], point-wise pointwise MLP-based –based methods [qi2017pointnet++, zhang2019shellnet, lan2019modeling, wang2019dynamic] update the feature of a query point using a series of shared MLPs followed by a symmetric aggregation function (e.g., max-pooling), which guarantees the whole process to be permutation invariantthat the entire process is permutation-invariant. Given a query point, neighborhood points and their associated features are collected through a neighborhood search algorithm such as k-nearest-neighbor k-nearest neighbor (kNN) or radius search. Subsequently, features , which that are either the input features of points or the concatenation of different features (e.g., positions, input features, or edge features) [qi2017pointnet++, wang2019dynamic, cui2021geometric], are transformed by shared MLPs. Updated query point features are obtained by spatially summarizing the transformed neighborhood features by using a symmetric aggregation function. On the other handIn contrast, inspired by the success of convolution in image processing, numerous effort has been spent on realizing efforts have been made to realize convolution on 3D point clouds. In general, these methods can be categorized into 2 two types depending on how they generate convolution filters: adaptive weight-based and –based and pseudo gridpseudo-grid -based –based methods. Adaptive weight-based weight–based methods dynamically update the weights of convolution filters by transforming the input positions/features. The generated weights are applied to the input features for the convolution. For instance, spatial features such as relative positions (relative to query positions) [wu2019pointconv, wang2018deep] and Euclidean distance [liu2019relation] are used to generate filters. Moreover, some works studies use additional edge features [simonovsky2017dynamic] for inputwhile others utilize , whereas others utilize the attention mechanism [wang2019graph, zhao2019pointweb] to further modulate the generated filters. On the other hand, pseudo grid-based In contrast, pseudo-grid–based methods build artificial convolution kernels by mimicking the standard convolutions in 2D image convolutions. ConcretelySpecifically, the input features are projected onto artificial kernel points, which are additional ”grid points” with associated weight matrices (convolution filter), using similarity-based methods such as the trilinear interpolation [mao2019interpolated], Gaussian kernel [shen2018mining, atzmon2018point], or the linear correlation [thomas2019kpconv]. Subsequently, the convolution is performed on the kernel points to which the point cloud features are projected.

Based on these basic local aggregation operators, some works focus studies have focused on enhancing the network capability by improving the down-sampling downsampling procedures [xu2020grid, yang2019modeling], injecting rotation invariance [fan2021scf], or the performing careful supervision [gong2021omni].

2.2 Discrete Laplace operators

Discrete Laplace operators (Laplacians) can be considered as the discretizations of the continuous Laplace-Beltrami Laplace–Beltrami operator [sorkine2006differential]. The use of Laplacians is Laplacians are ubiquitous. For instance, they are used as the smoothness penalty for regularization [smola2003kernels, zhou2004regularization, dong2016learning], with applications to for point cloud segmentation [rabbani2006segmentation, landrieu2017structured], denoising [zeng20193d, dinesh2020point]or , and 3D mesh generation [wang2018pixel2mesh]to name a few. The most related application of the Laplacian to our work study is mesh smoothing or fairing. The purpose of mesh smoothing is to improve the quality of the mesh by optimizing the vertex positions. Although there exist a variety of mesh smoothing techniques exist [owen1998survey], we are particularly interested in the Laplacian smoothing [taubin1995signal, desbrun1999implicit]. The Laplacian smoothing generalizes the notion of Fourier Analysis analysis to the case of arbitrary connectivity meshes, where the eigenvalues and eigenvectors of the Laplacian matrix are considered as natural vibration modes and natural frequencies, respectively [taubin1995signal]. Therefore, the iterative applications of the resulted resulting linear operator, which is a low-pass filter, attenuates the frequencies of a given mesh and hence making makes it smoother. The Laplacian smoothing, with little modification, can be applied to almost any primitive geometric shape. The Its simplicity and efficiency of the Laplacian smoothing have gained wide interest from have made Laplacian smoothing attractive for the research communities and industries. However, Laplacian smoothing tends to filter out too many frequencies, thus often leading to shrinkage or over-smoothing. Over-smoothing smooths out the characteristic edges or mixes up the features of vertices of neighboring points [li2018deeper], making the output undesirable for subsequent processing. While various heuristics are have been proposed [taubin1995signal, desbrun1999implicit] to combat the over-smoothing, the these techniques either require careful tuning of the parameters or do not apply to the high dimensional high-dimensional features.

On the other hand, though the Laplacian Unit similarly In contrast, although the Laplacian unit adopts a Laplacian-based operator as its core, the over-smoothing problem is converted to an adaptive filtering problem, which is handled intelligently via data-driven learning. Furthermore, we allow the Laplacian Unit not only unit to perform per-point smoothing but also and sharpening with the implicit assumption that each point requires a different degree of smoothing/sharpening.

3 Laplacian Unit

In this section, we present the overall design of the Laplacian Unit unit by providing detailed formulations of its components along with the rationales behind such designs. Next, an efficient instantiation of the Laplacian Unit unit is presented, which is extensively used in this study. Then, the relationships to closely related works are discussed in detail. Furthermore, we present a powerful family of models called Laplacian Unit-enhanced unit–enhanced networks (LU-Nets) for point cloud understanding.

3.1 Discrete Point Laplacian

Let $X=\{x_{i}\}_{i=1}^{n}\in\mathbb{R}^{n\times d}$ denotes denote the feature vectors of a point cloud, where $n$ is the total number of input points and $d$ is the feature dimension. We define the differential coordinates of the query point $x_{i}$ as the difference between $x_{i}$ and the centroid of its neighbors:

\delta x_{i}=x_{i}-\frac{1}{|\mathcal{N}(x_{i})|}\sum_{j\in\mathcal{N}(x_{i})}x_{j},

(1)

where $x_{j}\in\mathcal{N}(x_{i})$ is a neighboring point of $x_{i}$ that resides in a local region centered by on $x_{i}$ . $\mathcal{N}(x_{i})$ can be defined using k-Nearest-Neighbor the k-nearest neighbor (kNN) if the point cloud has a uniform density or radius search otherwise. The illustration of An illustration of the differential representation is presented shown in Fig. 2).

Compared with the absolute coordinates which that encode the global spatial layout, the differential coordinate representation encapsulates precise local characteristics such as smoothness [dong2016learning] and curvature [taubin1995signal], which can potentially serve as a powerful tool to learn for learning the underlying surface. Motivated by the above observation, we proceed to construct the operator based on this representation. The Eq. 1 can be rewritten as

-\delta x_{i}=\frac{1}{|\mathcal{N}(x_{i})|}\sum_{j\in\mathcal{N}(x_{i})}(x_{j}-x_{i}),

(2)

where the right-hand side of the equation becomes is the definition of the discrete Laplacian (umbrella operator) [taubin1995signal]. Therefore, the differential representation has an intimate relationship between the discrete Laplacian, which is expected to preserve fine-grained local details. Intuitively, the operator measures the deviation from the local average by aggregating pairwise differences between the query point and neighboring points; thus, it is naturally related to the definition of smoothness [bruna2013spectral]. Though this operator is fast to computeAlthough this operator can be quickly computed, considering that points are scattered in $\mathbb{R}^{3}$ with variable densities and measurement errors, the spatial correlations between points should be taken into account before the aggregation. Furthermore, we expect that a vector $(x_{j}-x_{i})\in\mathbb{R}^{d}$ encodes not only the spatial displacement but also semantic differences. Therefore, more discriminative local features can be produced by exploiting channel correlations. To this end, we define a generic operator, coined Discrete Point called the discrete point Laplacian (DPL), as

\Delta x_{i}=\mathcal{A}(\{w_{ij}\cdot\mathcal{M}(x_{j}-x_{i})\,|\,x_{j}\in\mathcal{N}(x_{i})\}),

(3)

where $w_{ij}\in\mathbb{R}_{\geq 0}$ denotes the spatial weight for each pairwise difference, and $\mathcal{M}\colon\mathbb{R}^{d_{in}}\to\mathbb{R}^{d_{mid}}$ is a mapping which individually that transforms the pairwise difference $x_{j}-x_{i}$ to exploit inter-channel correlations. Then, the DPL of $x_{i}$ is obtained through by aggregating each transformed vector by using the aggregation function $\mathcal{A}$ . Note that DPL is permutation invariant permutation-invariant because all operations involved are either a point-wise pointwise transformation or a symmetric function, making it well-suited for point cloud analysis.

3.2 Formulating local-global fusion as non-linear nonlinear filtering

By definition, DPL is a pure local operator which that injects local detail awareness into networks. Merely relying on the fine-grained but primitive local geometry, however, may not be optimal for tasks such as scene understanding as it requires understanding the global layout of 3D objects. Therefore, an organic way of the method for local-global feature fusion is necessary for modeling complex semantic relationships. For this purpose, we propose to encapsulate encapsulating the DPL into the a linear filtering framework [szeliski2010computer]. Linear filtering updates the feature of a given point by using a weighted sum of neighboring points in a local region. A linear filter can be defined as

x_{i}^{\prime}=\sum_{j\in\mathcal{N}(x_{i})}w_{ij}x_{j},

(4)

where $x_{i}^{\prime}$ is the updated feature. Alternatively, the Eq. 4 can be rewritten as

x_{i}^{\prime}=\sum_{j\in\mathcal{N}(x_{i})}w_{ij}x_{i}+\sum_{j\in\mathcal{N}(x_{i})}w_{ij}(x_{j}-x_{i}).

(5)

We adopt the a convex combination of neighboring features, i.e.that is, $\sum_{j\in\mathcal{N}(x_{i})}w_{ij}=1,w_{ij}\geq 0$ . Consequently, the resulted resulting equation becomes

x_{i}^{\prime}=x_{i}+\sum_{j\in\mathcal{N}(x_{i})}w_{ij}(x_{j}-x_{i}),

(6)

where the output becomes the sum of the input and a discrete Laplacian. Replacing the discrete Laplacian by using the DPL defined in Eq. 3, we obtain the following DPL-based linear filtering:

x_{i}^{\prime}=x_{i}+\Delta x_{i}.

(7)

The raw input $x_{i}$ represents the feature in the global/absolute coordinate systemwhile , whereas DPL is computed locally in the differential coordinate system; hence, the output $x_{i}^{\prime}$ encapsulates both local and global characteristics. The above aforementioned formulation of local-global fusion has two benefits. First, it is an efficient way to achieve local-global fusion because the fusion operation only involves summation without any additional parameter; Secondsecond, if we move the $x_{i}$ to the left-hand side of Eq. 7, we obtain the following equation:

x_{i}^{\prime}-x_{i}=\Delta x_{i},

(8)

which precisely coincides coincides precisely with the residual learning framework [he2016deep]. Preconditioning the optimization of DPL by residual learning is intuitively sound as $\Delta x_{i}$ measures the local variations with reference to $x_{i}$ . Therefore, we expect this framework to inherit benefits such as easier optimization than to optimize unreferenced mapping [he2016deep] via the residual learning.

An execution of the Eq. 7 adjusts the $x_{i}$ towards toward its centroid. In other words, all features get are closer to the local (weighted) averages after one execution. Hence, a the potential issue of Eq. 7 is the over-smoothing [li2018deeper], which adversely affects the performance by mixing up the features in different clusters. On the other hand, the desired amount of smoothing is usually unknown, which potentially depends on various conditions such as resolutions, query positions, and categories to which the input point clouds belong; therefore. Therefore, the DPL $\Delta x_{i}$ must be manipulated adaptively with respect to each point. To this end, we introduce a non-linear nonlinear calibration function $\mathcal{T}$ to the DPL:

\widetilde{\Delta x_{i}}=\mathcal{T}(\Delta x_{i}),

(9)

where $\mathcal{T}\colon\mathbb{R}^{d_{mid}}\to\mathbb{R}^{d_{in}}$ . $\mathcal{T}$ learns to calibrate the DPL in a non-linear nonlinear manner so that the desired degree of smoothing/sharpening can be obtained in a single step. As a result, the proposed Laplacian Unit unit is defined as:

x_{i}^{\prime}=x_{i}+\widetilde{\Delta x_{i}},

(10)

where local and global features are combined via non-linear nonlinear filtering.

3.3 Instantiation

As a generic operator, there are numerous choices for each component in the DPL (Eq. 3). In this study, we focus on analyzing its fundamental influence by constructing an extremely simple and lightweight version of the Laplacian Unitunit. The concrete formulation is shown presented in Table 1. Specifically, we set the combination of $w_{ij}$ and $\mathcal{A}$ to be an as the average operator. The reason is that This is because modern DNNs for point clouds often adopt farthest point sampling [qi2017pointnet++] as the down-sampling downsampling method; thus, intermediate points are almost uniformly distributed during the forward propagation. Therefore, the average operator becomes a reasonable choice under such a circumstancecircumstances. For function $\mathcal{M}$ , a simple linear transformation $W_{\mathcal{M}}\in\mathbb{R}^{d_{in}\times d_{in}}$ is adopted to exploit the feature correlation. To facilitate the training while enabling the function to be selective to useful features, we formulate the calibration function $\mathcal{T}$ as a sequential application of a batch normalization [ioffe2015batch] (BatchNorm) and rectified linear unit (ReLU). We expect the ReLU function to remain retain useful features while eliminating the impact of harmful features by pushing them to zero.

	$w_{ij}$	$\mathcal{A}$	$\mathcal{M}$	$\mathcal{T}$
Form	1	$\frac{1}{\|\mathcal{N}(x_{i})\|}\sum_{j\in\mathcal{N}(x_{i})}$	$W_{\mathcal{M}}$	BatchNorm-ReLU BatchNorm–ReLU

Table 1: The simple instantiation used in this study.

W_{\mathcal{M}}

indicates a linear transformation. BatchNorm is a batch normalization [ioffe2015batch] while and ReLU is a rectified linear unit.

In terms of additional computational overhead introduced by this particular formulation, let $S$ denotes denote the number of Laplacian Units units that are integrated into a network, and $C_{s}$ as is the dimension of the feature to which the Laplacian Units units are applied. The additional parameters introduced by Laplacian Units the Laplacian units can be calculated as

\sum_{s=1}^{S}C_{s}^{2}+2C_{s}.

(11)

For this particular case, the $C_{s}^{2}$ term refers to the number of parameters of $\mathcal{M}$ , while $2C_{s}$ denotes the parameters from the function $\mathcal{T}$ . As can be seen, the number of additional parameters is small; hence, multiple Laplacian Units units can be applied without increasing much computational overhead.

Though Although simple, we show throughout in Sec. 4 that the above simple formulation performs consistently well and provides significant performance improvement for a range of tasks. Although the While more sophisticated variants of the Laplacian Unit unit can be crafted, we focus on analyzing fundamental impacts brought by the fundamental impacts of the simple instantiation of the Laplacian Unitunit. Therefore, we leave the a thorough exploration of possible formulations to for future work.

3.4 Relationships to Previous Works

In this section, we discuss the relationships between the Laplacian Unit unit and closely related methods.

Edge feature augmentation

The difference term $x_{j}-x_{i}$ is frequently referred to as the edge feature in graph-based methods [wang2019dynamic, guo2020pct], where the edge feature is used as an additional input feature for local operations. As shown in Table 2, in its simplest form, the edge feature-augmented feature–augmented method updates the feature of a query point by aggregating the non-linearly nonlinearly transformed neighborhood features (usually uses max-pooling), where each feature is a concatenation of $x_{j}-x_{i}$ and $x_{i}$ . In other words, the fusion of local and global features is performed before the aggregation, and the local feature is restricted to the pairwise difference. In addition, in some cases [wang2019dynamic], edge feature-augmented methods perform the neighborhood search feature–augmented methods perform neighborhood searches in the feature space instead of the 3D space.

Method	$x_{i}^{\prime}$	Learned
Edge feature augmentation	$\textrm{max}(\{\phi(\textrm{concat}(x_{j}-x_{i},x_{i}))\,\|\,x_{j}\in\mathcal{N}_{d}(x_{i})\})$ .	$\phi$
Laplacian Smoothing [taubin1995signal]	$x_{i}+\frac{\lambda}{\|\mathcal{N}(x_{i})\|}\sum_{x_{j}\in\mathcal{N}(x_{i})}(x_{j}-x_{i}),0<\lambda<1$	-

Table 2: Closely related methods. Learned shows the learnable components in the methods.

\phi

indicates a series of MLPs.

x_{j}\in\mathcal{N}_{d}(x_{i})

means denotes the nearest neighbors of

x_{i}

, where

d

denotes the feature space in which the neighborhood search is performed.

We think believe that the pairwise difference is not sufficiently informative for local-global fusion because it merely describes the difference between two connected points. Performing kNN in the feature space can enlarge the receptive field, ; however, it ignores the natural 3D layout of objects in a complex scene. Moreover, the feature-space kNN is much more computationally expensive than the 3D-space counterpart. Furthermore, the non-linear nonlinear transformations are applied to the concatenation of $x_{i}$ and $x_{j}-x_{i}$ , which doubles increases the parameters required while increasing the optimization difficulties complicates the optimization because the function is no longer a residual function. In contrast, the Laplacian Unit unit treats the local feature as a patch instead of a pairwise difference, in which local properties are sufficiently described, hence thereby making the local feature more discriminative. We expect that the local-global fusion would be easier when local features are more meaningful. The neighborhood search of Laplacian Units units is performed in 3D space. Thus, it is fully aware of the natural 3D layout. Moreover, the Laplacian Unit unit belongs to the residual learning framework, which is known to ease facilitate optimization.

To compare the effectiveness of Laplacian Units units and edge feature augmentation, we compare the network equipped with Laplacian Units units with the edge feature-augmented feature–augmented network (denoted as Edge edge feature augmentation in Table. 11) in Sec. 5.3.

Laplacian Smoothing

As shown in Table 2, Laplacian Smoothingsmoothing [taubin1995signal] formulates the mesh smoothing as low-pass filtering using the discrete Laplacian. It receives 3D coordinates as input and generates 3D displacement vectors which are used for moving that are used to move the vertices.

The dampening factor $\lambda$ is tuned manually manually tuned as a hyperparameter. In addition, the process is often repeated several times (the number repetition is also decided manually) to obtain the desired amount of smoothness.

In comparison with Laplacian smoothing, we design the Laplacian Unit unit as an architectural unit that is optimized along with the backbone networks. While the input is restricted to 3D coordinates for the Laplacian smoothing, the proposed Laplacian Unit unit takes arbitrary dimensional vectors as input and returns vectors of the same dimension. Unlike Laplacian smoothing, the Laplacian Unit unit considers both spatial and feature relations to model the complex structure of point clouds. Furthermore, the Laplacian Unit unit performs smoothing and sharpening simultaneously by using a learned calibration function, whereas the dampening factor (a scaler) of the Laplacian Smoothing Laplacian smoothing is handled in a hand-crafted manner. Such a way of calibration calibration method is likely to be sub-optimal suboptimal and only realizes the low-pass filtering. Moreover, while the Laplacian Smoothing Laplacian smoothing needs to be applied multiple times to obtain the desired smoothing effect, Laplacian Units aims units aim to obtain the desired amount of smoothness by single non-linear nonlinear filtering.

Since Laplacian Smoothing Because Laplacian smoothing is not directly applicable to the deep learning framework, we formulate the learned Laplacian Smoothing smoothing, where $lambda$ , as shown in Table. 2is optimized with , is optimized using the network (denoted as Learned Laplacian Smoothing learned Laplacian smoothing in Table. 11). Note that we make the original Laplacian Smoothing smoothing stronger by removing the constraint of $0<\lambda<1$ , which makes it possible to perform low-pass and high-pass filtering adaptively. Then, the effectiveness of the Laplacian Unit unit and learned Laplacian smoothing is are compared in Sec. 5.3).

3.5 Constructing Laplacian Unit enhanced networks (LU-Nets)

Modern DNNs for point cloud analysis can be differentiated by their use of local aggregation operators [liu2020closer] and network architectures (e.g., residual and non-residual nonresidual networks). To investigate the effectiveness of the Laplacian Unit unit on various types of networks, we construct a family of modelstermed Laplacian Unit-enhanced networks (, LU-Nets ) for point cloud analysis. In this section, we first introduce three local aggregation operators which that can be used as ingredients of in different models [liu2020closer]. Then, we We then elaborate on the construction of the LU-Nets.

3.5.1 Local aggregation methods

A local aggregation method updates the input feature of a point cloud by transforming neighbor features by MLPs or convolution followed by the aggregation using a symmetric function (e.g., max-pooling, average-pooling, or sum-pooling). We In the following sections, we describe in detail the local aggregation methods we used in this studyin the following sections. The computation flows of each method are flow for each method is shown in Fig. 5).

Point-wise Pointwise MLP

The pioneering and most representative work of point-wise MLP pointwise MLPs is PointNet++ [qi2017pointnet++]. Specifically, given a query point and its neighbors, PointNet++ applies point-wise MLPs on pointwise MLPs to the concatenation of positions and features. Subsequently, the features of neighboring points are aggregated by a max-pooling to update the query point feature. Though Although simple, it has shown demonstrated competitive and robust performance in a range of tasks. Therefore, it is chosen as our instantiation for the point-wise pointwise MLP method. The An illustration of the point-wise pointwise MLP method is shown presented in Fig. 5 (top).

Adaptive weight

Adaptive weight-based weight–based methods extend the regular convolution by producing convolution weights adaptively given 3D positions or features. We choose PointConv [wu2019pointconv] (without the inverse density scale) as our instantiation for the adaptive weight method. Unlike some works studies that incur huge spatial complexity (e.g., [simonovsky2017dynamic]) or only realize depth-wise convolution [liu2019relation, liu2020closer], PointConv there exists an efficient variant of PointConv in which the computational cost is greatly reduced without harming compromising the performance. Such This property is attractive as the because modern networks need to recursively apply convolutions in many layers. The A graphical description is provided in Fig. 5 (middle).

Pseudo gridPseudo-grid

Pseudo grid-based Pseudo-grid–based methods faithfully extend the regular grid convolution to the irregular setting by introducing artificial kernels to which input features are projected. Thenthe , convolution is performed on the projected features using a weight matrix associated with each kernel point. Pseudo grid-based Pseudo-grid–based methods mainly differ from each other in projection techniques and grid point dispositions. In this study, we construct a KPConv [thomas2019kpconv]-like operator as our an instantiation for the pseudo grid-based pseudo-grid–based method because of its simplicity and effectiveness. The naive implementation of the original KPConv incurs a huge memory consumption. Therefore, we create an efficient variant of KPConv by reducing the kernel weight dimension from $\mathbb{R}^{d_{in}\times d_{out}}$ to $\mathbb{R}^{d_{mid}}$ , where $d_{in}$ , $d_{out}$ , and $d_{mid}$ is are the input, output, and middle feature dimensiondimensions, respectively, which are in analogy with ones mentioned in analogous to those described in Wu et al. [wu2019pointconv]. Subsequently, following Wu et al. [wu2019pointconv], we apply the change of summation order technique to reduce the memory consumption. The resulted resulting efficient variant of KPConv is sufficient for our purpose as the major characteristic of the method, i.e.that is, the use of fixed kernel points for the convolution, is retained.

3.5.2 Constructing Laplacian Unit-Enhanced Unit–Enhanced Feature Encoders and Decoders

As a compositional unit, the Laplacian Unit unit can be easily integrated into modern network architectures for point cloud analysis. A typical network for point cloud analysis adopts the a multi-resolution structure. The network progressively down-samples downsamples an input point cloud with increasing depths depth to capture fine-to-coarse characteristics. In addition, the successive down-sampling successive downsampling ensures that the same amount of computation in a layer can process an input of a wider scale in the next layer, hence thereby making the algorithm computationally efficient.

One of the key concepts behind the multi-resolution structure is that the optimal resolution or scale for the task at hand is unknown; thus, aggregations from various resolutions or scales are generally beneficial. Therefore, we advocate that Laplacian Units units should be appended whenever the resolution changes.

Laplacian Unit-enhanced Unit–enhanced Set Abstraction Levels

A classic example that uses a multi-resolution structure for the feature encoding is the Set Abstraction Level (SAset abstraction (SA_) level (SA) [qi2017pointnet++]. Similar to the strided convolution in 2D CNNs, an SA extracts high-dimensional features while reduces (abstracts) reducing the resolutions. Specifically, representative points are selected via a down-sampling downsampling process to be the center/query points for local aggregations. Subsequently, for each center point, a neighborhood search is performed to obtain its neighbors, which are used for the local aggregation. The representative points and their associated features resulted resulting from local aggregations are passed to the next SA.

In this study, to seamlessly integrate Laplacian Units units into modern frameworks, we propose the Laplacian Unit-enhanced Set Abstraction Levels unit–enhanced SA levels (LU-SAs) by integrating Laplacian Units units to classic SAs. ConcretelySpecifically, a Laplacian Unit unit is appended to the local aggregation in the SA if the down-sampling downsampling is performed. As shown in Fig. 6 (left), we construct 3 three types of LU-SAs to further investigate the impact of Laplacian Units on varying units on different types of networks. Note that any local aggregation operator mentioned in Sec. 3.5.1 can be used in LU-SAs. Basic (B) LU-SA represents the most basic type where original single-scale grouping SA [qi2017pointnet++] is augmented by a Laplacian Unit. On the other handunit. In contrast, aggregations from a single scale are hardly sufficient, as each point likely has its an optimal scale. Therefore, Scale-enhanced (S) LU-SA extends multi-scale multiscale grouping SA [qi2017pointnet++] by appending an MLP after the concatenation of local aggregations from 3 three different scales. The MLP is used to obtain multi-scale multiscale features by combining the features from individual scales. Then, the resulted multi-scale resulting multiscale features are passed to a Laplacian Unit unit for further refinement. Depth-enhanced (D), on the other handin contrast, provides discriminative features by increasing the network depth, which is known to be beneficial for performance improvements [he2016deep]. In particular, the first local aggregation is performed while the resolution remains unchanged. Subsequently, a down-sampling is performeddownsampling is performed, followed by the second local aggregation. The input feature is added to the output of the local aggregation with an using identity mapping. The resulted features are put resulting features are placed into a Laplacian Unit unit for further enhancement. In other words, D is a Laplacian Unit-enhanced unit–enhanced residual block.

Laplacian Unit-enhanced Unit–enhanced Feature propagation Propagation Levels

A task such as semantic segmentation demands requires the resolution of the output to be the same as the input. Therefore, the abstracted points and features after successive applications of SA need to be up-sampled upsampled to the original resolution. Inspired by U-Net [ronneberger2015u], Feature Propagation Level the feature propagation level (FP) [qi2017pointnet++] is proposed to up-sample upsample both features and points via an interpolation. Specifically, each FP reconstructs the high-resolution point features from the low-resolution ones using tri-linear trilinear interpolation. Subsequently, the features of the same level in the encoder (a stack of SAs) are concatenated to the interpolated features using a skip connection so that each point possesses bidirectional features derived from the up- and down-sampling processupsampling and downsampling processes. As a result, the full-resolution point cloud is reconstructed, where each point possesses individual features that traveled travel through the encoder-decoder hierarchy. In this study, we propose Laplacian Unit-enhanced Feature Propagation Level a Laplacian unit–enhanced feature propagation level (LU-FP) by augmenting the original FP through by appending a Laplacian Unit unit at the end of the FP. Consequently, the refined features are passed to the next LU-FP. The illustration of the An illustration of FP is shown in Fig. 6 (right).

Constructing Laplacian-enhanced Networks (LU-Nets)

In this study, several different models are constructed to investigate the impact of Laplacian Units on the units on different types of networks. Specifically, the above-mentioned LU-SAs are used to construct the corresponding models. The constructed models are shown in Fig. 7. Note that only the LU-Net Scale-enhanced scale-enhanced (S) is used for the classification task. LU-Net Basic basic (B) is constructed by stacking 4 four LU-SA (B)s and LU-FPs. The output of the last LU-FP is fed to an MLP for the per-point classification. The model considers a single scale per layer and serves as a the baseline architecture. LU-Net (S) also stacks 4 four LU-SA (S)s and LU-FPs. Being different from Unlike LU-Net (B), it takes into account multiple scales per layer, thus being more powerful than LU-Net (B). For the classification branch, the output of the last LU-SA (S) is transformed by an MLP, followed by the concatenation of max- and average-pooled features. The resulted resulting features are passed to the last MLP to produce the classification scores. For the segmentation branch, the additional average-pooled features from each LU-SA (S) are concatenated to form a multi-scale multiscale global feature. The global features are transformed by an MLP and concatenated to the output of the last LU-FP. Like in the Similar to LU-Net (B), the final point predictions are produced through an MLP. LU-Net Depth-enhanced depth-enhanced (D) stacks 5 five LU-SA (D)s to construct a deeper model compared with the LU-Net (B) and (S). The network depth is effectively increased by stacking two local aggregations per LU-SA(D). The final predictions are generated by transforming the output of the last LU-FP by using an MLP.

4 Experiments

In this section, we present the configurations and results of each experiment. The four fundamental tasks on which the effectiveness of Laplacian Units is units are evaluated are: object classification on ScanObjectNN [uy2019revisiting], object part segmentation on ShapeNet Part [yi2016scalable], indoor scene segmentation on Stanford Large-Scale 3D Indoor Spaces (S3DIS) [armeni20163d], and outdoor scene segmentation on Semantic3D [hackel2017semantic3d]. We are particularly interested in the capability of the Laplacian Unit unit as a compositional module, i.e.that is, the relative performance improvements brought by the Laplacian UnitsLaplacian units. The results demonstrate that LU-Nets can generally outperform their counterparts in which no Laplacian Unit is integrated. The model units are integrated. Model S is used as the default choice for benchmarking because it has a competitive performance compared with model D while having lower computational complexity (the computational complexity is analyzed in detail in Sec. 5.2).

4.1 Implementation Details

All experiments are performed using PyTorch on a server with an NVIDIA V100 GPU. The default algorithm used in the neighborhood search of Laplacian Units units is kNN, where $k$ is set to 20 in for classification and 16 in for segmentation.

4.2 Object Classification on ScanObjectNN

We evaluate the effectiveness of the Laplacian Unit unit using ScanObjectNN [uy2019revisiting].

We advocate ScanObjectNN because it consists of real-world 3D scans, which is more challenging compared to with datasets composed of synthetic CAD models [wu20153d]. There are 15k objects in the dataset, where each object is categorized into 1 of 11 one of the 15 classes. As they are real-world scans, each point cloud includes measurement errors, certain occlusions, and background points. We use the hardest set of the dataset and adopt the official train-test split, where 80% of the data are randomly sampled for training and the remaining 20% for the test.

Configuration

The performance is measured using overall accuracy (OA). We use the Adam [kingma2015adam] optimizer and trained the model for 250 epochs with a batch size of 32. The initial learning rate is set to 0.001 and decayed by a factor of 10 when it plateaus. We adopt the same training strategy as in the previous work [uy2019revisiting] for fair comparisons. ConcretelySpecifically, 1,024 points are randomly sampled from 2,048 points as input. The input data are augmented by random rotations rotation (rot.) and Gaussian jittering (jit.). Moreover, to further investigate the effectiveness of Laplacian Units units on the different training strategies, we adopt another training strategy where in which the number of points is increased to 2,048. The random Random scaling (scale.) and the random translation (trans.) are applied used for data augmentation.

Result of object classification

Method	#point	Aug.	OA	$\uparrow\downarrow$
PointNet [qi2017pointnet]	1,024	rot. & jit.	68.2	-
PointNet++ [qi2017pointnet++]	1,024	rot. & jit.	77.9	-
DGCNN [wang2019dynamic]	1,024	rot. & jit.	78.1	-
PointCNN [li2018pointcnn]	1,024	rot. & scale	78.5	-
BGA-PN++ [uy2019revisiting]	1,024	rot. & jit.	80.2	-
BGA-DGCNN [uy2019revisiting]	1,024	rot. & jit.	79.7	-
SimpleView [goyal2020revisiting]	1,024	rot. & jit.	80.5	-
SimpleView [goyal2020revisiting]	1,024	& im.scale & crop	80.5	-
Ours (PM)	1,024	rot. & jit.	80.2	-
Ours (PM) + LU	1,024	rot. & jit.	78.7	$\downarrow$ 1.5
Ours (AW)	1,024	rot. & jit.	78.1	-
Ours (AW) + LU	1,024	rot. & jit.	79.9	$\uparrow$ 1.8
Ours (PG)	1,024	rot. & jit.	79.9	-
Ours (PG) + LU	1,024	rot. & jit.	81.2	$\uparrow$ 1.3
Ours (PM)	2,048	trans. & scale	83.2	-
Ours (PM) + LU	2,048	trans. & scale	84.5	$\uparrow$ 1.3
Ours (AW)	2,048	trans. & scale	81.1	-
Ours (AW) + LU	2,048	trans. & scale	82.1	$\uparrow$ 1.0
Ours (PG)	2,048	trans. & scale	79.6	-
Ours (PG) + LU	2,048	trans. & scale	81.2	$\uparrow$ 1.6

Table 3: Result of object classification. PM, AW, and PG stand for Point-wise represent pointwise MLP, Adaptive Weightadaptive weight, and Pseudo Gridpseudo-grid, respectively. LU denotes the Laplacian Unitunit.

\uparrow\downarrow

denotes the obtained relative performance by adding Laplacian Unitsunits. Bold texts indicate The bold text indicates the best performance.

The results are reported presented in Table 3. Under the same training strategy as in the previous work [uy2019revisiting], significant improvements are observed when Laplacian Units units are integrated into AW and PG. In particular, PG + LU achieves the state-of-the-art performance by surpassing all previous works (the bold text in the middle part of Table 3), demonstrating the effectiveness of Laplacian Unitsunits. Note that it even outperforms networks that adopt the additional fine-grained ground truth information (BGA-PN++ and BGA-DGCNN) and the network that uses the additional data augmentations (SimpleView). We speculate that the inter-point relations exploited by convolution-based methods (AW and PG) are further enhanced by Laplacian Units. On the other handunits. In contrast, the performance of the PM drops when the Laplacian Unit unit is added. Unlike convolution-based methods, the MLP-based method does not explicitly infer inter-point relations using positional information; thus, we conjecture that the training strategy is not sufficient for the MLP-based method to obtain such an inductive bias, which is injected by design in convolution-based methods , during the training.

The performance of the methods using the second training strategy is shown in the bottom part of Table 3. Under this training strategy, we observe that the Laplacian Units Laplacian units consistently provide performance improvements. Specifically, in contrast to the first training strategy, Laplacian Units units successfully augment PM, achieving the best performance among networks (the bold texts text at the bottom of Table 3).

4.3 Object Part Segmentation

We use the ShapeNet part dataset [yi2016scalable] to evaluate the effectiveness of the Laplacian Unit unit on the object part segmentation task. The dataset contains a total of 16,880 models, where 14,006 models are used for training and 2,874 models are used for testing. It contains 16 object categories and 50 parts, each model being annotated into 2 to 6 of which is annotated into two to six parts. We use the data provided by Qi et al. [qi2017pointnet++].

Configuration

We use randomly sampled 2,048 points with their surface normal features as input. The input data are augmented by the random anisotropic scaling and random translation. We train the models for 150 epochs. We use the SGD optimizer with an initial learning rate of 0.1, which is decayed by the a factor of 10 when it plateaus. As a common practice, we use the voting for post-processing, following [thomas2019kpconv]. The performance metric used in this task is the instance-wise average intersection over union (Ins. mIoU) [qi2017pointnet++], and class-wise average IoU (Cat. mIoU).

Result of object part segmentation

Method	Ins. mIoU	$\uparrow\downarrow$ (Ins. mIoU)
PointNet++ [qi2017pointnet++]	85.1	-
DGCNN [wang2019dynamic]	85.2	-
PointCNN [li2018pointcnn]	86.1	-
PointConv [wu2019pointconv]	85.7	-
KPConv [thomas2019kpconv]	86.4	-
Point Transformer [zhao2021pointtransformer]	86.6	-
CurveNet [xiang2021walk]	86.8	-
PointNet [qi2017pointnet]	83.7	-
PointNet + LU	85.1	$\uparrow$ 1.4
RSCNN [liu2019relation]	86.2	-
RSCNN + LU	86.7	$\uparrow$ 0.5
Ours (PM)	86.7	-
Ours (PM) + LU	86.9	$\uparrow$ 0.2
Ours (AW)	86.5	-
Ours (AW) + LU	86.8	$\uparrow$ 0.3
Ours (PG)	86.5	-
Ours (PG) + LU	86.9	$\uparrow$ 0.4

Table 4: Result of object part segmentation. PM, AW, and PG stand for Point-wise represent pointwise MLP, Adaptive Weightadaptive weight, and Pseudo Gridpseudo-grid, respectively. LU denotes the Laplacian Unitunit.

\uparrow\downarrow

denotes the obtained relative performance by adding Laplacian Unitsunits. Bold texts indicate The bold text indicates the best performance.

As shown in Table 4, integrating Laplacian Units units into the network architecture is beneficial for all networks. Specifically, Laplacian Units units provide a significant performance gain ranging from 0.2 to 0.4 mIoU points to the backbones, which lead the PM and PG to achieve state-of-the-art performance. We conjecture that employing Laplacian Units units explicitly enables each point to recognize how they are different from each other; thus, the predictions become more boundary-aware. As shown in Fig. 8, assisted by Laplacian Unitsunits, the networks are able to recognize inter-part boundaries. In addition, to verify the effect of the Laplacian Units Laplacian units on existing architectures, we integrate Laplacian Units Laplacian units are integrated into PointNet [qi2017pointnet] and RSCNN [liu2019relation]. PointNet serves as a representative of the basic architecture, whereas RSCNN serves as a cutting-edge onearchitecture. The results are shown in the middle rows of Table 4. Laplacian Units units again manage to lift their performance significantly, which demonstrates its effectiveness on both basic and advanced networks.

4.4 Indoor Scene Segmentation

We evaluate the performance of the Laplacian Unit unit on the indoor semantic segmentation task on Stanford Large-Scale large-scale 3D Indoor Spaces indoor spaces (S3DIS) [armeni20163d] dataset. In total, 6 six indoor environments containing 272 rooms are included. Every Each point is labeled with a class from 13 categories. Similar to [tchapmi2017segcloud], we use Area 5 for testing and others for training, which is a more principled way to assess generalizability.

Configuration

During the training, following [zhao2019pointweb], we randomly sample one point from all scenes and extract a 1m $\times$ 1m pillar centered by this point’s horizontal coordinates. Subsequently, 4,096 points are randomly sampled from the points contained in the extracted pillar. These points are taken as an element of a mini-batch. During the test, we regularly slide a pillar with a stride of 0.5m to ensure every point is tested at least once. Input The input feature consists of 3D coordinates, RGB, and normalized 3D coordinates with respect to the maximum coordinates in a room. We use the SGD optimizer and train the models for 100 epochs in which one epoch is set to 1.5k iterations. The batch size is set to 32. The initial learning rate is set to 0.1 and decayed by a factor of 10 every 25 epochs. Following [thomas2019kpconv], we augment augmented the input with a random vertical rotation, a random anisotropic scaling, a Gaussian jittering, and a random color dropout. The performance metrics are the point average IoU (mIoU), overall accuracy (OA), and class-averaged accuracy (mAcc).

Result of indoor scene segmentation

Method	mIoU	OA	mAcc	$\uparrow\downarrow$ (mIoU)
PointCNN [li2018pointcnn]	57.3	85.9	63.9	-
SPGraph [landrieu2018large]	58.0	86.4	66.5	-
PointWeb [zhao2019pointweb]	60.3	87.0	66.6	-
Minkowski [choy20194d]	65.4	-	71.7	-
KPConv deform [thomas2019kpconv]	67.1	-	72.8	-
RFCR [thomas2019kpconv]	68.7	-	-	-
Point Transformer [zhao2021pointtransformer]	70.4	90.8	76.5	-
PointNet [qi2017pointnet]	41.1	-	49.0	-
PointNet + LU	52.0	83.8	59.3	$\uparrow$ 10.9
KPConv rigid [thomas2019kpconv]	65.4	-	70.9	-
KPConv rigid + LU	68.3	90.1	75.1	$\uparrow$ 2.9
Ours (PM)	65.6	89.0	72.0	-
Ours (PM) + LU	66.2	89.1	72.0	$\uparrow$ 0.6
Ours (AW)	63.1	88.3	69.6	-
Ours (AW) + LU	65.8	89.6	71.8	$\uparrow$ 2.7
Ours (PG)	63.0	88.3	69.3	-
Ours (PG) + LU	65.3	89.2	71.5	$\uparrow$ 2.3

Table 5: Result of indoor scene segmentation. PM, AW, and PG stand for Point-wise represent pointwise MLP, Adaptive Weightadaptive weight, and Pseudo Gridpseudo-grid, respectively. LU denotes the Laplacian Unitunit.

\uparrow\downarrow

denotes the obtained relative performance by adding Laplacian Unitsunits. Bold texts denote text denotes the best performance.

The results are shown listed in Table 5. The consistent Consistent relative performance improvements are observed by inserting Laplacian Units units into the backbone networks. Notably, though although PM achieves good performance, it is further boosted by Laplacian Units units by 0.6 mIoU points. For AW and PG, Laplacian Units even provide units provide a greater performance gain. To investigate the effect of Laplacian Units units on existing architectures, we integrate multiple Units units into both basic (PointNet [qi2017pointnet]) and advanced (KPConv (rigid) [thomas2019kpconv]) methods. Results The results (middle rows of Table 5) reveal that Laplacian Units units can boost them significantly. Notably, with Laplacian Unitsunits, KPConv (rigid) even outperforms its strong variant KPConv (deform), which is based on the deformable convolution, indicating that the information encoded by Laplacian Units is units is more beneficial than tailoring receptive fields. Laplacian Units units make use of local smoothness statistics explicitly inside the network, which force forces networks to be more boundary-aware and produce smooth predictions within that boundariesboundary. This speculation is further verified by the qualitative results shown in Fig. 9). In some cases, LU-Net is able to detect a subtle object which that is missed completely by the counterpart which that has no units integrated (third row of Fig. 9). In addition, LU-Net also produces smoother predictions, which are shown in the first, second, and last rows of Fig. 9).

4.5 Outdoor Scene Segmentation

We use the Semantic3D dataset [hackel2017semantic3d] to evaluate the effectiveness of the Laplacian Unit on the unit in an outdoor scene segmentation task. The dataset is an online benchmark that contains 4 four billion points measured by a terrestrial LiDAR sensor. It provides a range of diverse urban scenesincluding 8 , including eight classes in total. 15 Fifteen training scans are distributedin total. We use 13 of them for training while use 2 using two of them for validation. We use the data from the reduced-8 challenge as our test data because they are less biased by the objects close to the sensor [thomas2019kpconv]. The test scores are obtained by submitting the generated predictions to the evaluation server. We follow the data preparation procedure of [thomas2019kpconv].

Configuration

During the training, we randomly take sample 8,192 points from a sphere (3m radius) sampled regularly from a scene. For testing, we regularly take spheres from the scene so that each point is tested at least once. We use the SGD optimizer and train models for 150 epochs in which one epoch is set to 500 iterations. The batch size is set to 32. The initial learning rate is set to 0.01 and decayed by a factor of 10 every 30 epochs. The same augmentation strategy used as the for indoor scene segmentation is used. The performance is measured in terms of the point average IoU (mIoU) and overall accuracy (OA).

Result of outdoor scene segmentation

Method	mIoU	OA	$\uparrow\downarrow$
ShellNet [zhang2019shellnet]	69.3	93.2	-
SPGraph [landrieu2018large]	73.2	94.0	-
KPConv [thomas2019kpconv]	74.6	92.9	-
RG-Net [truong2019fast]	74.7	94.5	-
RandLA-Net [hu2020randla]	77.4	94.8	-
SCF-Net [fan2021scf]	77.6	94.7	-
RFCR [gong2021omni]	77.8	94.3	-
Ours (PM)	72.7	94.0	-
Ours (PM) + LU	73.1	94.1	$\uparrow$ 0.4
Ours (AW)	72.8	93.5	-
Ours (AW) + LU	75.1	94.5	$\uparrow$ 2.3
Ours (PG)	71.2	91.9	-
Ours (PG) + LU	73.7	93.9	$\uparrow$ 2.5

Table 6: Result of outdoor scene segmentation. PM, AW, and PG stand for Point-wise represent pointwise MLP, Adaptive Weightadaptive weight, and Pseudo Gridpseudo-grid, respectively. LU denotes the Laplacian Unitunit.

\uparrow\downarrow

denotes the obtained relative performance by adding Laplacian Unitsunits.

The results of the outdoor scene segmentation are presented in Table 6. In general, Laplacian Units manage to units improve the performance of the backbones . Like in backbones consistently. Similar to the indoor segmentation case, a greater performance improvement is observed for both convolution-based methods compared with the one that of PM. This might suggest that Laplacian Units units favor more convolution methods in which spatial correlations are well-exploited compared with the MLP-based method. In outdoor scenes, many artificial objects consist of simple planes. Therefore, we think believe that the Laplacian Unit unit would be effective to classify for classifying such objects because it enforces smooth predictions within object boundaries by explicitly considering inter-point differences. For instance, despite the gap of in mIoU between Ours (AW) + LU and RFCR [gong2021omni], Ours(AW) + LU outperforms RFCR by 0.2 OA points. This reveals that Ours(AW) + LU does better in simple classes having performs better in geometrically simple classes with more samples, e.g.for example, ground and buildings.

5 Analyzing Laplacian Unit

In this section, we first investigate the influence of the Laplacian Units Laplacian units on various architectures. To this end, we perform a model strength analysis in which the performance before and after applying Laplacian Units on varying units on various types of models are compared. Next, the computational complexity of the model is analyzed. Then, the design choices of Laplacian Units units are validated by an ablation study. Last but not least, the Finally, quantitative and qualitative analyses are performed to provide an intuitive understanding of the behavior of Laplacian Unitsunits. All experiments are conducted on part segmentation tasks using the ShapeNet Part part dataset because the task has sufficient complexity. The training configuration follows the one that used in the part segmentation task. The Ins. mIoU is used as the performance metric.

5.1 Model Strength Analysis

In this experiment, we evaluate the impact of the Laplacian Unit on the unit on different types of models. Differences of The differences in the models include the type of the local aggregation method (please refer to Sec 3.5.1 for more details) and network architectures employed (please see Sec 3.5.2 and Sec 3.5.2 for more details).

	Pointwise MLP			Adaptive weight			Pseudo-grid
Type	Bef.	Aft.	$\uparrow\downarrow$	Bef.	Aft.	$\uparrow\downarrow$	Bef.	Aft.	$\uparrow\downarrow$
B	86.3	86.6	$\uparrow$ 0.3	86.3	86.8	$\uparrow$ 0.5	86.1	86.6	$\uparrow$ 0.5
S	86.7	86.9	$\uparrow$ 0.2	86.5	86.8	$\uparrow$ 0.3	86.5	86.9	$\uparrow$ 0.4
D	86.5	86.7	$\uparrow$ 0.2	86.8	87.0	$\uparrow$ 0.2	86.3	86.7	$\uparrow$ 0.4

Table 7: Result of model strength analysis. Bef. denotes the performance before adding Laplacian Units units, whereas Aft. shows the performance after adding Laplacian Unitsunits.

\uparrow\downarrow

denotes the obtained relative performance by adding Laplacian Unitsunits.

Results are reported The results are presented in Table 7. In general, Laplacian Units units successfully improve the performance for of all models with varying types and complexity. In particular, relative improvements ranging from 0.2 to 0.5 are provided by integrating Laplacian Units units into networks. As expected, without Laplacian Unitsunits, stronger models (S and D) consistently outperform the weaker models (B) (first column of each local aggregation operator in Table 7). We observe that the relative improvements provided by Laplacian Units units to weaker backbones are consistently greater than those provided to by stronger ones. Therefore, it seems that Laplacian Units offer a units offer greater gain for less complex models. Nevertheless, the stronger models still can can still be significantly enhanced by Laplacian Unitsthe Laplacian units. For instance, model D of the Adaptive Weight ( the bold texts adaptive weight ( bold text in Table 7) even surpasses the best performance reported in Table 4.

5.2 Computational Complexity

In this experiment, we analyze the computational complexity of each model on in the object part segmentation task. Specifically, the space complexity and time complexity of models B, S, and D are evaluated. We are particularly interested in the additional computational overhead brought incurred by adding Laplacian Unitsunits.

The space complexity of each model is reported listed in Table 8. The stronger models (S and D) have approximately 3 three times more parameters than the weaker models. More Laplacian Units Furthermore, more Laplacian units are added to stronger models than in weaker models. In general, even though multiple Laplacian Units units are added to the models, the relative increase of in the parameters is still tolerable. Specifically, the increase of in the number of parameters ranges from 3% to 10% for stronger models while it and reaches 68% for tiny models (e.g., PM of Basic basic columns). The large relative increase of in the parameters for tiny models is expected because of the extremely small model sizes.

	Basic (B)			Scale-enhanced (S)			Depth-enhanced (D)
	Bef.	Aft.	#LU	Bef.	Aft.	#LU	Bef.	Aft.	#LU
PM	0.73	1.23	7	2.62	3.14	8	4.68	6.50	9
AW	6.04	6.54		18.55	19.07		18.27	20.08
PG	6.04	6.54		18.55	19.07		18.27	20.08

Table 8: The number of parameters of different models (million). Bef. and Aft. columns indicate the number of parameters before and after adding Laplacian Units units to the models, respectively. #LU means the number of Laplacian Units units inserted into models.

	Basic			Scale-enhanced			Depth-enhanced
	Bef.	Aft.	#LU	Bef.	Aft.	#LU	Bef.	Aft.	#LU
PM	0.04	0.06	7	0.07	0.10	8	0.10	0.15	9
AW	0.04	0.06		0.07	0.10		0.12	0.17
PG	0.04	0.07		0.10	0.13		0.19	0.24

Table 9: Inference speed of different models (seconds per batch). The batch size is set to 32. Bef. and Aft. columns indicate the inference speeds before and after adding Laplacian Units units to the models, respectively. #LU means the number of Laplacian Units units inserted into models.

In terms of the inference speed per mini-batch, models show the models show a relative increase of 50% to 75%of relative increase. We conjecture that such an increase mainly comes from the neighborhood search (kNN in this study), which is performed in every Laplacian Unitunit. In practice, the neighborhood search can be shared and combined with the previous/next local aggregation, which can greatly speed up the inference.

5.3 Ablation Studies

We verify the design choices of the Laplacian Unit unit by performing ablation experiments. Furthermore, the impact of the insertion positions is evaluated via level analysis. For this experiment, the point-wise MLP-based pointwise MLP–based model B is used because we are only interested in the relative performance changes caused by each component. Note that we do not perform the voting post-processing for the same reason.

Laplacian Unit components

The results concerning for the components in Eq. 10are reported , are listed in Table 10. With all components present, the network achieves the best performance, suggesting that the combination of all components is indeed vital. The performance drops by 0.1 mIoU point by removing the $\mathcal{T}$ . In such a case, the Laplacian Unit unit becomes a linear filter; thus, its representational power is weakened compared with the one of the non-linear that of the nonlinear filtering. However, performance drops significantly (0.6 mIoU points) when $\mathcal{M}$ is removed, revealing that the channel relations of raw DPL need to be exploited to become useful. The significance of the local-global fusion is evidently shown as the performance sharply drops 0.7 mIoU points when the global feature, $x_{i}$ , is removed from the formulation. Finally, performance drops dramatically by 2.6 mIoU points when only raw DPL remains, verifying our design choices.

Component	mIoU	Neighborhood size	mIoU
w/ all	86.2	4	86.0
w/o $\mathcal{T}$	86.1	8	86.4
w/o $\mathcal{M}$	85.6	16	86.2
w/o $x_{i}$	85.5	32	86.3
w/o $x_{i}$ and $\mathcal{T}$ and $\mathcal{M}$	83.6	64	86.4

Table 10: Result of the ablation analysis.

Neighborhood size

We examine the sensitivity of performance on the size of the local neighborhood used in the DPL. The results are reported presented in Table 10. Among the tested sizes, we observe stable performanceexcept the , except for one of the smallest size sizes (4). Interestingly, most sizes outperform the default onesize, which is 16, indicating that performance may be further improved by looking for the best size for the task at hand.

	mIoU	$\uparrow\downarrow$
None	85.8	-
Laplacian Unit unit (Ours)	86.2	$\uparrow$ 0.4
Edge feature augmentation	85.8	0
Learned Laplacian Smoothing	85.9	$\uparrow$ 0.1

Table 11: Comparisons with closely related methods.

Comparisons with related works

We further compare compared the proposed Laplacian Units units to closely related works, including the edge feature-augmented edge feature–augmented networks and learned Laplacian Smoothing. Concretelysmoothing. Specifically, the edge feature-augmented feature–augmented network concatenates the $x_{j}-x_{i}$ term to the input feature during the local aggregation, which is exactly equal to the first row of the $x_{i}^{\prime}$ column in Table 2. On the other hand, as we mentioned in Sec 3.4, we convert the dampening factor $\lambda$ to into a learnable network parameter to make the original Laplacian Smoothingsmoothing [taubin1995signal] learnable. The results are shown listed in Table 11. We empirically verify that Laplacian Units units provide the most significant relative improvement over the baseline (first row of Table 11). A minor increase in the performance is observed by adopting learnable Laplacian Smoothing while smoothing, whereas edge feature augmentation does not affect the performance. Therefore, the careful handling of $x_{j}-x_{i}$ is crucial for achieving good performance.

	No	Full	1	2	3	4	5	6	7
PM	85.8	86.2	85.9	85.8	85.8	85.8	86.0	85.8	85.9
AW	85.6	86.4	85.6	85.7	85.8	85.9	85.8	85.9	85.9
PG	85.5	86.1	85.5	85.5	85.5	85.6	85.7	85.7	85.8

Table 12: Result of level (resolution) analysis. No denotes the performance of the network without Laplacian Units units while Full means Laplacian Units units are integrated at all levels. Following The following numbers 1–7 represent the inserting insertion level where only 1 Lalalcian Unit one Laplacian unit is integrated. The vertical line between 4 and 5 means indicates the end of the encoder.

Impact of inserting positions

We investigate the impact of inserting positions (resolutions) of Laplacian Unitsunits. Specifically, only one unit is inserted into each position to analyze the impact of individual positions. The results are shown listed in Table 12. In general, we can still observe performance improvements when only one Laplacian Unit unit is integrated at a certain level. For AW and PG, it seems that more improvements are obtained by inserting into deeper positions, while peak values appear in the shallow and deep levels at the same time for PM. On the other handIn contrast, the improvements are rather limited compared to the Full versionwith the full version, in which units are inserted at all positions. Therefore, the effect of Laplacian Units units can be accumulated throughout from different positions in the network, which verifies our integration strategy.

5.4 Visualization

In this section, we provide analyses on analyze the outputs of Laplacian Units to facilitate units to facilitate an intuitive understanding of its behaviorsbehavior. For this purpose, quantitative and qualitative analyses are performedconducted.

Configuration

We use pre-trained models , (i.e., models (S) from the part segmentation task, ) for the following analyses. Since the Laplacian Unit performs non-linear Because the Laplacian unit performs nonlinear filtering on the input, we hypothesize that the behavior of the Laplacian Unit unit can be tracked by monitoring the smoothness of the features. The smoothness [dong2016learning] can be defined as

\frac{1}{2}\sum_{i,j}W_{l,ij}\|y_{l,i}-y_{l,j}\|^{2}=\textrm{tr}(Y_{l}^{\top}L_{l}Y_{l})

(12)

where $Y_{l}=\{y_{l,i}\}_{i=0}^{n_{l}\times d_{l}}$ is the feature vectors vector of the point cloud in the at level $l$ that are normalized into , which are normalized to a unit hypersphere (the maximum norm of the feature vectors is 1one), $W_{ij}\in\mathbb{R}_{x\geq 0}$ denotes the weight of the edge between $y_{l,i}$ and $y_{l,j}$ , $L_{l}$ = $D_{l}-W_{l}$ is the graph Laplacian, and $D_{l,ii}=\sum_{j}W_{l,ij}$ is the diagonal weighted degree matrix. The sparse weight matrix $W_{l}$ is a kNN graph ( $k=16$ in this study), where $W_{ij}=1$ when there exist exists an edge between point points $i$ and $j$ . Because we are interested in the smoothness of the output features with reference to input features, we subsequently define relative smoothness to quantify such a change:

RS=\frac{\textrm{tr}(Y_{l,out}^{\top}L_{l}Y_{l,out})}{\textrm{tr}(Y_{l,in}^{\top}LY_{l,in})},

(13)

where $Y_{l,in}$ and $Y_{l,out}$ are the features before and after applying the Laplacian Unit in the unit at level $l$ . Laplacian Unit The Laplacian unit smooths the input when $RS$ is larger than 1 one, whereas it sharpens the feature when $RS$ is smaller than 1. one.

Result of smoothness analysis

The $RS$ for each level using the ShapeNet dataset is calculated, and some examples are shown in Fig. 10. The result shows results show that the Laplacian Unit unit adaptively performs smoothing and sharpening adaptively depending on the input classes and levels to which they are integrated. Furthermore, the result suggests results suggest that the networks in general tend to desire smoother features in the encoder levelswhile , whereas sharper features are produced in the decoder levels. We conjecture that the smoother features effectively summarize the coarse shape information from local details in the encoderwhile , whereas the interpolated features are sharpened for point-wise pointwise prediction in the decoder. Moreover, Laplacian Units units for different models perform similarly for some classes, whereas they perform differently in other classes. This suggests that the network types, i.e.that is, local aggregation operations used in the network, influence differently on the behavior of Laplacian Unitsunits differently. The qualitative results are shown in Fig. 11. In the first row, the features with relatively large magnitudes before applying Laplacian Units are smoothedand hence units are smoothed, and hence, the feature distributions appear more uniform. On the other handIn contrast, sharpening makes feature distributions more variable, as isolated extreme features appear in several positions of the shapes after applying Laplacian Unitsunits. Nevertheless, we can observe that the smoothing and sharpening occur simultaneously on at the point level in practice. For instance, the wings of the airplane in the second row of Fig. 11 becomes smoother while isolated become smoother when some points with high responses appear on its the body, which further verifies the flexible functionalities of Laplacian Unitsunits.

6 Conclusion

We propose a new building block for deep learning-based learning–based 3D point cloud understanding, named the Laplacian Unit, which called the Laplacian unit, that learns the complex structure via local detail-preserving adaptive filtering. Discrete Point Laplacian (DPL ) Furthermore, a DPL is proposed to effectively model the local spatial and feature relations. Furthermore, DPL is encapsulated in a non-linear nonlinear filtering framework in which seamless local-global feature fusion is realized. The resulted Laplacian Unit resulting Laplacian unit is a generic and lightweight module that can be integrated into various models. An extremely simple and efficient version of the Laplacian Unit unit is proposed to investigate its fundamental influence on performance. A strong family of models called Laplacian Unit-enhanced networks (LU-Nets ) for 3D point cloud understanding tasks is proposed. Three major categories of local aggregation operators along with three distinct network architectures are prepared to construct various models, which are used to investigate the effectiveness of the Laplacian Unit unit on different types of models. Extensive experiments on 4 four fundamental tasks show that Laplacian Units units significantly improve the performance of the backbone network in most cases. In particular, the LU-Nets achieve state-of-the-art performance in classification (ScanObjectNN) and part segmentation tasks (ShapeNet Part). Furthermore, we validate that Laplacian Units units can provide consistent improvement improvements regardless of model complexity and typestype. The quantitative and qualitative analysis based on smoothness shows that the Laplacian Unit unit performs adaptive filtering on the point level, which verifies our design motivations.