NoiseTrans: Point Cloud Denoising with Transformers

Guangzhe Hou Guihe Qin Minghui Sun Yanhua Liang Jie Yan Zhonghan Zhang Jilin University, Changchun, Jilin, China

Abstract

Introduction: Point clouds obtained from capture devices or 3D reconstruction techniques are often noisy and interfere with downstream tasks.
Objectives: The paper aims to recover the underlying surface of noisy point clouds.
Methods: We design a novel model, NoiseTrans, which uses transformer encoder architecture for point cloud denoising. Specifically, we obtain structural similarity of point-based point clouds with the assistance of the transformer’s core self-attention mechanism. By expressing the noisy point cloud as a set of unordered vectors, we convert point clouds into point embeddings and employ Transformer to generate clean point clouds. To make the Transformer preserve details when sensing the point cloud, we design the Local Point Attention to prevent the point cloud from being over-smooth. In addition, we also propose sparse encoding, which enables the Transformer to better perceive the structural relationships of the point cloud and improve the denoising performance.
Results: Experiments show that our model outperforms state-of-the-art methods in various datasets and noise environments.

keywords:

point clouds , transformer , denoising , point embedding , local point attention

1 Introduction

With the increasing availability of 3D scanning equipment and the development of 3D reconstruction techniques, point clouds are becoming more readily available. In industry, point clouds are also widely used in areas such as autonomous driving and robotics [6]. However, point clouds can be corrupted by noise during acquisition due to limitations of the equipment and blurred matching of reconstruction techniques. This noise can lead to instability of the underlying structure and severely affect the downstream understanding task. At the same time, the denoising task also faces many serious challenges due to the disorderly and irregular nature of point clouds [27]. Therefore, the task of point cloud denoising is crucial to point cloud processing. A well-designed denoising method should restore the basic structure of the point cloud while retaining as much detail as possible.

Point cloud denoising methods can be divided into two different categories: non-deep learning and deep learning. Non-deep learning methods have developed over decades producing a number of methods [9, 4, 1, 40, 24, 12, 15, 44, 16] for different models. Depending on the implementation method, they are generally classified as local surface fitting, non-local means, sparse coding, and graph-based methods. These methods provide a variety of ideas for point cloud denoising that are of profound value. Nevertheless, non-deep learning methods are still plagued by problems such as prior knowledge and hyper-parameter settings.

In recent years, a number of point-based deep learning denoising methods [30, 14, 26, 22, 23] have also been proposed with remarkable results. In earlier work [30, 14, 26], deep learning methods typically used graph convolution or multi-layer perceptrons to extract features from point clouds to predict the true locations of noisy points. More recently, distinctive methods such as DMR [22] and Score [23] have been proposed. They constructed the point cloud as a mathematical model to remove noise. The emergence of these methods has also greatly enriched problem-solving possibilities.

With the success of DETR [3] on object detection tasks, Transformer has been rapidly developed in the field of computer vision. The transformer-based models have achieved outstanding results on various image tasks. Particularly in the low-level task, a range of models such as IPT [5] and Uformer [37] have achieved notable performance. In the 3D vision domain, PT [45] creatively used Transformer for the classification and semantic segmentation of point-based point clouds. Inspired by this success, we incorporate transformer into point-based denoising tasks and achieve state-of-the-art results. To the best of our knowledge, we are the first to use the transformer to solve point-based point cloud denoising tasks.

We propose the NoiseTrans framework, the main idea of which is to use a self-attention mechanism to capture the global semantic relevance of structural information, and the permutation invariance of Transformer allows us to disregard the order of points. To enable the transformer structure to obtain structural organization and to improve denoising performance, we designed sparse encoding. Further, considering the issue of surface detail preservation, we added the Local Point Attention module. Extensive experiments demonstrate that NoiseTrans with global semantic relevance achieves state-of-the-art results with different datasets and different noise types. The main contributions of this paper are summarized as follows:

1.

We propose a novel transformer-based point cloud denoising framework, named NoiseTrans, which is exactly suitable for point-based point cloud data with noise in various scenarios.
2.

We propose the point embedding module with local point attention, which allows multi-scale extraction of local features of the point cloud and emphasizes edge weights to preserve detail.
3.

We propose learnable Sparse Encoding which is permutation invariant and it can provide more information on structural relationships for the denoising task.
4.

We compare our method on raw and synthetic point cloud datasets which are added with different types of noise to existing methods and achieve state-of-the-art performance.

This paper is organised as follows. Section 3 describes our architecture in detail. Section 4 gives the results of various different experiments. Section 5 gives a brief summary of the paper.

2 Related work

2.1 Non-deep learning methods

There has been a long history of research on point cloud denoising methods. Point cloud denoising has become particularly important in recent years as technologies such as remote sensing, robotics, and autonomous driving continue to develop. Non-deep learning point cloud denoising methods can generally be divided into four categories: local-surface-fitting based methods [4, 1, 17], non-local means based methods [46, 34], sparsity-based methods [40, 24, 32] and graph-based methods [15, 44, 16].

The most representative local surface-based adaptation method is MLS [1], which assumes that the surface of the point cloud is smooth and fits point to plane for denoising. In addition to that, other surface fitting methods have been proposed such as jet-fitting with re-projection [4] and various forms of bilateral filters [17] have achieved remarkable results. However, this can be over-smooth or over-sharpened at high noise levels.

Non-local means-based methods are developed from the 2D image denoising, such as [46, 34], which detect the similar structure of non-local point cloud, and then remove the noise through filtering to get the point cloud without noise. These methods can retain more surface features and avoid excessive smoothing. However, a large number of parameters need to be set, which is difficult to apply to different scenarios.

Sparsity-based methods [40, 24, 32] are generally to first solve the optimization problem of sparsity constraints to reconstruct the normal and then update the coordinate position of the point according to the reconstructed normal to denoise. MRPCA [24] is a representative of this, but performance degrades at high noise levels.

The graph-based methods [15, 44, 16] are to define the point cloud as a graph, use the points in the point cloud as nodes in the graph, and then denoise it through graph filters [44, 16]. In particular, GLR [44] uses the Laplace regularisation of the graph as a filter for denoising and achieves high effectiveness. Under high noise levels, however, it leads to instability in the structure of the graph, resulting in a degradation of the denoising effect.

2.2 Point-based deep learning methods

In recent years, most of the significant advances in visual recognition have been achieved by deep learning models, especially deep Convolution Neural Networks (CNNs).However, directly processing point clouds using off-the-shelf image-based methods is not straightforward since point clouds are essentially irregular and unordered, which are not suitable to be processed directly using convolution neural networks designed for gridded features. With the deepening of 3D object classification, especially the proposal of PointNet [27] and PointNet++ [28], the point-based deep learning methods have become possible, and the point cloud denoising method based on the deep learning methods has also gained more extensive attention.

PointCleanNet [30], which employs PCPNet [13] (a variant of PointNet [27]) as the backbone, predicted the displacement vector from noise point to object surface according to the local feature of each point. Because of the multiple iterations in the denoising process, shrinkage of the point cloud occurs. TotalDn [14] proposed the unsupervised point cloud denoising for the first time. They introduced the prior knowledge that the denser the point cloud is, the closer it is to the real surface. They propose a new loss function that uses neighbors to determine the location of ground truth points without considering the noise point themselves. This method works well for flatter surfaces with lower noise levels. Due to the similarity between point cloud and graph in 3D space, graph convolution network [26] is used for denoising, and positive results have also been achieved. Recently, DMR [22] has shifted its attention to the physical characteristics of point clouds and reconstructed the bottom manifold of point clouds through a down-up sampling network structure to the denoise point cloud. Meanwhile, Score [23] regarded the point cloud with noisy points as the distribution after convolution of the underlying manifold and noise distribution and proposed that the likelihood function of the distribution should update the position of each point by iteration of gradient rise to achieve the effect of denoising. But this approach leads to the creation of some outliers.

2.3 Transformer methods

Transformer [36] was originally proposed as a sequence-to-sequence model for machine translation [35] and consists of an encoder and a decoder. Each encoder block consists mainly of a multi-head self-attention module and a feed-forward network(FFN). Compared to the encoder blocks, decoder blocks additionally insert cross-attention modules between the multi-head self-attention modules and the FFN [18]. With the continuous development of natural language processing(NLP) techniques, transformer-based pre-training models have been proposed one after another. Among them, pre-trained models such as the BERT [8] and GPT series [29, 2] have achieved advanced performance in various tasks. As a result, the transformer has become the architecture of choice for NLP.

Inspired by the great success of the self-attention mechanism in NLP, Transformer has also been transplanted to the computer vision field [5, 21, 20]. ViT [10] proposes an image transformer structure that takes image patches as input and achieves advanced results on image classification tasks. In addition, it achieves impressive performance on tasks such as object detection [3, 19], semantic segmentation [47] and target tracking [7]. Not only that, Transformer also provides new thinking for point-based 3D point cloud processing, e.g. [45] and [43], etc.

Unlike the denoising methods described above, our NoiseTrans is designed on the basis of the transformer encoder architecture. We incorporate the core self-attention mechanism into the point-based point cloud denoising method in order to achieve optimal performance.

3 Approach

In this section, we propose a network architecture for point cloud denoising based on the characteristics of point cloud noise and present our design considerations.

3.1 Overview

Currently, deep learning denoising methods generally fit clean surfaces either through local features [30, 14, 26] or abstracting point clouds into mathematical models [22, 23]. In contrast, we focus on the global relationship of the point cloud. Previous work [46] has shown that synthetic and raw point clouds tend to exhibit self-similarity. Extracting and exploiting this similar geometry and overall structure can provide a powerful means of denoising. Meanwhile, Transformer has proven through extensive experiments that its self-attention mechanism has the ability to capture long-range dependencies between elements, whether in the field of natural language processing [36, 8, 2] or computer vision [45, 21, 10]. This is exactly what we need for denoising tasks. Moreover, its permutation invariance is naturally suitable for disordered point clouds. Further, the unique point-to-point correspondence of the transformer structure is convenient and crucial for our denoising task.

Refer to caption — Figure 1: Illustration of the proposed point cloud denoising framework. The different colors of the token show the degree of effect on the red token.

Based on the above analysis, we propose a transformer-based point cloud denoising method, NoiseTrans, which is shown in Figure 1. Our model consists of three parts: 1) Point embedding module: We characterize local point cloud feature sets at different scales as embeddings of points. We add the Local Point Attention of our design to the representation to make the model more sensitive to edge bulges. This preserves more detail and prevents the point cloud from being over-smooth.2) Transformer module: We propose Sparse Encoding in the transformer to perceive the structural relationships of the point cloud. Sparsity is added to the encoding as a reference for sensitivity to outliers. 3) Output header module: We design a tail output for the denoising task. Residual connections are added to facilitate the identification of noise.

Note that the downsampling layer, which is often used in transformer models [5, 45, 21], is discarded from our network structure. This is due to the fact that we are working to recover surfaces that have been corrupted by noise. Moreover, our network has no requirement for the number of input points thus it can be applied to a wide range of scenes. We will show more details in the following sections.

3.2 Point embedding

The purpose of this section is to obtain input for the Transformer module. For the transformer to work effectively on point clouds, the first step is converting the point cloud into a vector sequence. One of the most straightforward solutions is to use coordinates as a set of vector inputs. However, this approach can lead to the loss of local features. In order to enable each point to probe the geometric features of the local area in which it is located, a point embedding module has been designed. This also compensates for the deficiencies of Transformer in local feature extraction [37, 21].

In 2D images, convolution kernels of various sizes are usually designed to obtain the corresponding perceptual fields to extract features at different scales. Inspired by this experience, we choose to use the number of neighbor points to emulate the convolution kernels. As shown in Figure 2, we design three feature extraction units with different numbers of neighbors in a parallel manner to capture features at various scales.

In the feature extraction unit, we construct K nearest neighbor points as a patch for each point in Euclidean space, followed by a convolution layer to generate high-dimensional features. The central point aggregates the patch as its own local features. After the several layers of feature aggregation, the output of the three feature extraction units is concatenated to obtain a vector sequence containing the local features of the point cloud. We call the above process Point Embedding and the resulting vector can be fed directly to the transformer.

Formally, given the feature ${\bf F}^{l}_{p}=\{{\bf f}_{i}^{l}\}_{i=1}^{N}\in\mathbb{R}^{N\times\mathrm{d}^{l}}$ in the $lth$ layer, where $N$ denotes the number of points. Then the features of the $(l+1)th$ layer can be expressed as:

{\bf f}_{i}^{l+1}=G_{l}\left({\bf F}^{l}_{p}\right)=\max_{j\in N(i)}\left({\bf g}^{l}_{\theta}\left({\bf f}_{i}^{l},{\bf f}_{j}^{l}-{\bf f}_{i}^{l}\right)\right)

(1)

where ${\bf g}_{\theta}(.)$ represents a non-linear function with $\theta$ as a learnable parameter. $N(i)$ denotes the neighbors of point $i$ , and $\max\left(.\right)$ is an aggregation function.

To obtain local features at different scales, we set varying values of k in the three feature extraction units. The output is then stitched together, following a weight-shared multi-layer perceptron layer, which can be defined as:

{\bf F}_{p.emb}=h_{\varphi}\left(\left[\left[{\bf F}^{1}_{p1},\cdot\cdot\cdot{\bf F}^{4}_{p1}\right],\cdot\cdot\cdot\left[{\bf F}^{1}_{p3},\cdot\cdot\cdot{\bf F}^{4}_{p3}\right]\right]\right)

(2)

where ${\bf F}^{i}_{p1}$ , ${\bf F}^{i}_{p2}$ and ${\bf F}^{i}_{p3}$ represent the outputs of the $ith$ feature extraction unit, and $h_{\varphi}(.)$ represents a weight shared multi-layer perceptron with $\varphi$ as a learnable parameter. $\left[...\right]$ represents the concatenation operation.

However, there are still problems with the above methods. A key issue in the denoising task is the balance between denoising and preserving protruding edges. This problem is unspecialised in most deep learning methods, and can lead to the network recognising the protruding edges as noise. This is especially so after several iterations, resulting in a loss of detail. Figure 3 illustrates this problem briefly.

To alleviate this situation, we introduced the Local Point Attention in feature extraction. This allows points that are on the same bump to be given more weight, thus preventing the object surface from being over-smooth.

Specifically, we compare the features of neighbor points at different scales to get the structural features where they are located. After concatenation, they are multiplied by a learnable matrix to obtain the weights. Our formula can be expressed as follows:

{\bf a}_{i}^{l+1}=A_{l}\left({\bf F}^{l}_{p}\right)=sigmoid\left({\bf W}^{l}_{a}\cdot\left(Agg\left({\bf f}^{l}_{i},{\bf f}^{l}_{j}\right)\right)\right)

(3)

where $Agg(.)$ represents the aggregation function. Then equation (1) can be rewritten as:

{\bf f}_{i}^{l+1}=G_{l}\left({\bf F}^{l}_{p}\right)=\max_{j\in N(i)}\left({\bf g}^{l}_{\theta}\left(\left({\bf f}_{i}^{l},{\bf f}_{j}^{l}-{\bf f}_{i}^{l}\right)\odot{\bf a}^{l}_{i}\right)\right)

(4)

where ${\bf a}_{i}^{l}$ denotes the $ith$ point at $lth$ layer with the surface similarity weights of its neighbors compared to itself. We will show the specific effects in the ablation experiments.

3.3 Sparse encoding

An advantage of the transformer is that each token can attention to any location information [41]. Our model is built on the standard transformer and implements the point cloud denoising task. But the transformer [36] was originally designed for sequence modeling and its self-attention mechanism only takes into account the semantic similarities between individual points. To make it better at capturing the structural relationships between points, we propose Sparse Encoding.

When describing the structure of a point cloud, the real directional orientation between points is only available in non-Euclidean space. This is difficult to compute for point clouds in Euclidean space. On the basis of ensuring the invariance of the overall model permutation, we use learnable coordinate differences as an approximation to the directions in non-Euclidean space. It is worth noting that we only make direction predictions for neighbor points. This is to allow the centroids to identify the spatial structure as well as to prevent errors that occur at longer distances.

For point cloud noise, the sparsity of the points is an important reference for judging the noise size and location information. We discard counting points as a criterion for evaluating sparsity with a more dynamic approach being chosen instead. We use the learnable inverse of the neighbor distance to evaluate the sparsity of points in range. This method not only allows the number of neighbors to be represented but also to some extent resembles the ‘degree’ in a graph.

Specifically, given a coordinate vector set of points ${\bf X}=\left\{{\bf x}_{i}\right\}_{i=1}^{N}\in\mathbb{R}^{N\times d}$ ,where $N$ represents the number of points in the set and d represents the dimensionality of the vector set, generally we take $d=3$ .The location code we have designed can be expressed as follows:

{\bf encoding}_{i}=M_{\nu}\left(Concat\left({\bf x}_{i}-{\bf x}_{j},\left(\left\|{\bf x}_{i}-{\bf x}_{j}\right\|_{2}^{2}\right)^{-1}\right)\right)

(5)

where $M_{\nu}$ represents a weight shared multi-layer perceptron with $\nu$ as a learnable parameter and $j$ represents neighbor points for $i$ . In our experiments, we set the number of neighbor points to $3$ .

Our point embedding and sparse coding ultimately form the input to our transformer. The purpose of point cloud denoising is to recover the noisy points to the underlying surface of the object, a task that transfers one set to another with the same properties. At the same time, in order to reduce the number of parameters and the calculation time, we, therefore, use only the encoder part of the transformer, simplifying the decoder part. Our module has a total of six transformer layers and each layer is a Pre-LN structure [39]. Aimed at the organization of the point cloud, we redesign the Feed Forward Network. We use a weight-shared multi-layer perceptron to further extract features from the point cloud.

	$\displaystyle{\bf F}^{l+1}_{o}=MLP$	$\displaystyle\big{(}LN\big{(}MSA\left(LN\left({\bf F}^{l}_{o}\right)\right)+{\bf F}^{l}_{o}\big{)}\big{)}+{\bf F}^{l}_{o}$		(6)
		$\displaystyle{\bf F}^{0}_{o}={\bf F}_{p.emb}+{\bf encoding}$		(6)

where $F^{l}_{o}$ represents the output of layer $l$ , and $LN$ represents Layer Normalization. The multi-head attention mechanism plays a particularly crucial role in our task, describing the relationships between the points in different dimensions. Note that the order of inputs do not change in the above process. Meanwhile, the output has the same dimension and structure as the input, providing a great help to our loss function and the completeness of the point cloud. Since the calculations involved in the process are permutation invariant, a different order of inputs does not change the final output. As mentioned above, this is naturally suited to unordered point clouds.

3.4 Output header

We specifically design output header module to handle the point cloud denoising task. The module consists of a densely connected multi-layer perceptron containing four linear layers. Each linear layer has a GELU nonlinearity. The module uses residual connections to transform the task into simulated noise for better denoising. The final output of our network is the denoised coordinates of each point. The computational process can be expressed as follows:

{\bf Y}=P_{\epsilon}({\bf F}_{o})+{\bf X}

(7)

where $X$ is the input coordinate set of the point cloud with noise and ${\bf F}_{o}$ represents the output of the transformer module. The output $Y$ represents the denoised coordinate.

3.5 Loss function

Choosing an appropriate loss function is particularly critical for the denoising task as it directly affects the denoising performance. We consider the differences between synthetic datasets and raw point clouds to design a two-part loss function in supervised training.

For quantifying the distance between the denoised point cloud and the ground truth point cloud, we adopt the Chamfer Distance(CD) [11] as our loss function. This has also been shown to be effective by previous work [30, 26, 22]. We make some changes to the CD to make it easier to visualize during the training process. The exact expression is as follows:

\operatorname{Loss}_{CD}=\sum_{{\bf y}\in{\bf Y}}\sum_{{\bf x}\in{\bf X}}\min_{\begin{subarray}{c}{\bf y}\in{\bf Y}\\ {\bf x}\in{\bf X}\end{subarray}}\|{\bf x}-{\bf y}\|_{2}^{2}

(8)

Nevertheless, it has been shown in previous work [30, 26] that the use of a relative minimum distance loss function leads to inferior visual effects. And this phenomenon was confirmed again in our experiments. Particularly at several iterations, the point cloud can appear filamentous or clustered. To enable points to move away from each other, we choose an ‘absolute’ distance function as an additional loss function. Note that thanks to the point-to-point mapping nature of our network and the uniform distribution of points in the training set, we are able to choose this approach. Our specific loss function is as follows:

\operatorname{Loss}_{AD}=\sum_{i=1}^{N}\left\|{\bf x}_{i}-{\bf y}_{i}\right\|_{2}^{2}

(9)

where ${\bf x}$ represents the input coordinates and ${\bf y}$ represents the output coordinates. Overall, our loss function is as follows:

\operatorname{Loss}=\alpha\cdot\operatorname{Loss}_{CD}+\beta\cdot\operatorname{Loss}_{AD}

(10)

The above processes are all evaluated on uniformly distributed point clouds. However, raw point clouds are generally non-uniform. Given this case, we reduce the weight of the ‘absolute’ distance function. We set $\alpha=0.9$ and $\beta=0.1$ in our experiments. By minimizing the overall loss function, we can ensure that the denoised points are restored to the underlying surface as closely as possible.

4 Experiments

In this section, we evaluate our method on various datasets and compare it with other state-of-the-art methods quantitatively and qualitatively.

4.1 Setup

4.1.1 Dataset

We use the training set from the paper [22] and extract 100 different meshes in the ModelNet40 [38] dataset for training. We sampled $10k-20k$ points at each mesh by Poisson sampling. Similar to previous work [22, 23], we normalize the point cloud to the unit sphere, perturbed by Gaussian noise with standard deviations from $1\%$ to $4\%$ of the radius of the bounding sphere, and then split it into patches having $1k$ points as input.

For testing, we select $20$ meshes on each of the ModelNet40 [38] and PU-Net [42] datasets. we sample $10k$ points for each mesh and perturb them with Gaussian noise with standard deviations of $1\%$ , $2\%$ , and $3\%$ on the diagonal of the bounding box, respectively. To validate the effectiveness of the network structure we designed and because of the greater object details in the ModelNet40 dataset, we carry out ablation experiments on the ModelNet40 dataset and quantitative analysis on the PU-Net dataset. Furthermore, for validating the results of our model in different noise and raw point clouds, we add different types of noise to the data and use the Paris-rue-Madame [33] dataset as the raw point cloud validation metric.

4.1.2 Metric

To quantitatively analyze the effect of denoising, we adopt two evaluation methods often used in previous work: the Chamfer Distance (CD) [11] and the Point-to-Mesh Distance (P2M) [31]. Where the Chamfer Distance describes the sum of the shortest distances of points between two point clouds:

\operatorname{CD}=\frac{1}{\left|{\bf S}_{1}\right|}\sum_{{\bf x}\in{\bf S}_{1}}\min_{{\bf y}\in{\bf S}_{2}}\|{\bf x}-{\bf y}\|_{2}^{2}+\frac{1}{\left|{\bf S}_{2}\right|}\sum_{{\bf y}\in{\bf S}_{2}}\min_{{\bf x}\in{\bf S}_{1}}\|{\bf y}-{\bf x}\|_{2}^{2}

(11)

where ${\bf S}_{1}$ and ${\bf S}_{2}$ represent two different point clouds, ${\bf x}$ and ${\bf y}$ represent the coordinates of the points.The Point-to-Mesh Distance, on the other hand, evaluates the denoising effect by calculating the average sum of the distances between the denoised point cloud and the clean mesh triangles.

4.1.3 Implementation details

We implemented our model in pytorch [25]. We used the Adam optimizer and set the learning rate of the Adam decay to 0.001 and the smoothing constants to 0.9 and 0.999 respectively. For 30K patches used as training input, we performed 200 epochs with an initial learning rate of 0.0005 and decreased to half of the original after 50 epochs. During testing, when the noise level was high, we iterated with reference to previous methods [30, 26, 22] in order to obtain better denoising performance. We went through only one iteration for $1\%-2\%$ of the noise and two iterations for $3\%$ .

4.2 Comparison to state-of-the-art

4.2.1 Quantitative results

Table 1: Comparison of denoising algorithms. Each data in the table is evaluated on 20 point clouds of different shapes selected from the PU-Net dataset.

Noise level	1%		2%		3%
Metric	CD	P2M	CD	P2M	CD	P2M
Noisy	3.544	1.233	7.620	4.188	12.838	8.655
MRPCA	3.016	1.022	3.764	1.126	5.103	2.033
GLR	2.945	1.029	3.695	1.289	4.872	2.109
PCNet	3.462	1.105	7.352	3.572	12.851	8.542
DMR	4.425	1.626	4.968	2.022	5.885	2.708
Score	2.427	0.429	3.545	1.027	4.795	1.988
Ours	2.288	0.360	3.251	0.843	4.070	1.470

In this section, we quantitatively compare our method with some classical non-deep learning based point cloud denoising methods, as well as state-of-the-art deep learning based methods such as the MRPCA [24] algorithm based on sparse representation, the GLR [44] algorithm based on graphs, the PointCleanNet [30] algorithm, which are the pioneers of deep learning methods, DMR [22] and Score [23] algorithms. To more comprehensively validate the effectiveness and robustness of our method, we add Gaussian noise, uniform noise, and Laplace noise to the test set, and then feed the test set directly into the algorithm to calculate the CD and P2M results between the output point cloud and the real point cloud, as shown in Table 1, in the range of $1\%$ to $3\%$ standard deviation of Gaussian noise, our method is significantly better than previous deep learning and non-deep learning methods. Moreover, our method outperforms previous methods in the case of large noise standard deviations.

Table 2: Comparison of denoising algorithms in different noises. We sampled 10K points from each point cloud.

Noise	Laplacian noise		Uniform noise
Metrics	CD	P2M	CD	P2M
Noisy	4.070	2.301	5.420	3.409
PCNet	3.899	1.986	5.134	3.124
DMR	4.013	2.208	5.242	3.281
Score	2.275	0.812	4.200	2.457
Ours	2.126	0.752	3.801	2.100

Although the Gaussian noise is used in the training, Table 2 shows that with uniform noise and Laplace noise, our method also has a significant advantage over other methods.

4.2.2 Qualitative results

We show in Figure 4 the distribution of points on the surface of the object after denoising by the current state-of-the-art DMR and Score methods compared to our method for different levels of Gaussian noise interference. We have colored all the points according to distance from a clean object surface, where blue indicates proximity to the surface and brighter means further from the surface. It is clear to observe that our method outperforms DMR, the output points are closer to the object surface and, unlike Score, our method does not produce outliers. In particular, it has a more pronounced visual effect when the noise level is low; it is also able to bring the noise points as close to the object’s surface as possible when the noise level is particularly high. Meanwhile, we have evaluated several methods in different noises and, as shown in Figure 5, our method works well even in noises that are not identified by the model.

We have further compared the above methods in the raw point cloud. We have used the Paris-rue-Madame [33] dataset as a reference. As the clean point cloud is unknown, it is not possible to quantify the deviation from noise and we can only analyze the dataset qualitatively. The Paris-rue-Madame dataset consists of two files, each with 10 million points. We only take the coordinates of each point, and given the limited computing power of our GPU, we divide the 10 million points into 1000 parts, convert them to the size of a unit sphere and feed them into the algorithm, and then perform the stitching operation after obtaining the output. For qualitative analysis, we take a portion of the dataset and show the results after one round of iterations in Figure 6. It can be seen that our method retains more detail than the DMR, and the denoising effect is better than the Score method, in that the denoised points are closer to the surface of the object, and the denoised surface is smoother.

4.2.3 Ablation studies

Table 3: Comparison of ablation studies. In the first experiment, we replaced the self-attentive layer with a convolution layer. In the third experiment, we use the learnable point coordinates as the location embedding.

Component	1%	2%	3%
No self-attention	2.501	3.567	4.452
No pos.encoding	2.416	3.350	4.119
Coor.encoding	2.315	3.328	4.088
Attention+Spare	2.288	3.251	4.070

Figure 7: Visual Ablation Experiment of Local Point Attention. (a): clean (b): noisy (c): no local point attention (d): ours. It can be observed that after three iterations, the flaps of aircraft (c) have significantly disappeared, while aircraft (d) is still present.

We focus in this section on showing the reasonableness of our network structure design, as shown in Table 3 which demonstrates the impact of our design network part structure on the final denoising effect. The first experiment (first row) discards the self-attentive layer and uses a convolution layer instead, to verify the effect of the self-attention mechanism on our experiments. The second and third experiments (second and third rows) change the positional encoding in the transformer module (the second experiment has no positional encoding) to verify the effectiveness of our position encoding design. We used a point cloud of 10K points with $1\%-3\%$ Gaussian noise in our experiments.

Each module of our design contributes to denoising performance. It is worth noting that the role of self-attention increases as the noise gradually increases. The position encoding plays a larger role in the less noisy cases. We believe that it is when the noise is high that the offset of the position can lead to an uncontrolled increase in the distance between points, affecting the effect of positional encoding. This problem can be mitigated by using different scales of position encoding.

To verify the effectiveness of validating our local attention, we conducted experiments on the ModelNet40 [38] dataset. Figure 7 qualitatively shows the results of our experiments. At $2\%$ noise level, after $3$ iterations, the aircraft wing tail bulge is smaller or even disappears. However, this is significantly improved after increasing the Local Point Attention.

5 Conclusion

In this paper, we propose a new denoising architecture, NoiseTrans. We formulate the point cloud as a set of vector sets and successfully merge the point cloud denoising task with transformer through a number of technical innovations. Different from previous work, our model extracts local features at diverse scales and captures the semantic relationships between points and structural features with the help of the transformer encoder. To preserve details in denoising, we introduce changes to the embedding vector. It can be shown from a number of experiments that our model achieves state-of-the-art performance on different datasets.

At present, there is still no standard dataset for the point cloud denoising task, which also causes evaluation discrepancies. We hope to witness the availability of large standard datasets for denoising in the future, allowing further refinement of our approach.

References

Alexa et al. [2001] Alexa, M., Behr, J., Cohen-Or, D., Fleishman, S., Levin, D., Silva, C.T., 2001. Point set surfaces, in: Proceedings Visualization, 2001. VIS’01., IEEE. pp. 21–29.
Brown et al. [2020] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al., 2020. Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901.
Carion et al. [2020] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S., 2020. End-to-end object detection with transformers, in: European conference on computer vision, Springer. pp. 213–229.
Cazals and Pouget [2005] Cazals, F., Pouget, M., 2005. Estimating differential quantities using polynomial fitting of osculating jets. Computer Aided Geometric Design 22, 121–146.
Chen et al. [2021a] Chen, H., Wang, Y., Guo, T., Xu, C., Deng, Y., Liu, Z., Ma, S., Xu, C., Xu, C., Gao, W., 2021a. Pre-trained image processing transformer, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12299–12310.
Chen et al. [2020] Chen, S., Liu, B., Feng, C., Vallespi-Gonzalez, C., Wellington, C., 2020. 3d point cloud processing and learning for autonomous driving: Impacting map creation, localization, and perception. IEEE Signal Processing Magazine 38, 68–86.
Chen et al. [2021b] Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H., 2021b. Transformer tracking, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8126–8135.
Devlin et al. [2018] Devlin, J., Chang, M.W., Lee, K., Toutanova, K., 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 .
Digne and De Franchis [2017] Digne, J., De Franchis, C., 2017. The bilateral filter for point clouds. Image Processing On Line 7, 278–287.
Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al., 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 .
Fan et al. [2017] Fan, H., Su, H., Guibas, L.J., 2017. A point set generation network for 3d object reconstruction from a single image, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 605–613.
Fleishman et al. [2003] Fleishman, S., Drori, I., Cohen-Or, D., 2003. Bilateral mesh denoising, in: ACM SIGGRAPH 2003 Papers, pp. 950–953.
Guerrero et al. [2018] Guerrero, P., Kleiman, Y., Ovsjanikov, M., Mitra, N.J., 2018. Pcpnet learning local shape properties from raw point clouds, in: Computer Graphics Forum, Wiley Online Library. pp. 75–85.
Hermosilla et al. [2019] Hermosilla, P., Ritschel, T., Ropinski, T., 2019. Total denoising: Unsupervised learning of 3d point cloud cleaning, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 52–60.
Hu et al. [2020] Hu, W., Gao, X., Cheung, G., Guo, Z., 2020. Feature graph learning for 3d point cloud denoising. IEEE Transactions on Signal Processing 68, 2841–2856.
Hu et al. [2021] Hu, W., Hu, Q., Wang, Z., Gao, X., 2021. Dynamic point cloud denoising via manifold-to-manifold distance. IEEE Transactions on Image Processing 30, 6168–6183.
Huang et al. [2013] Huang, H., Wu, S., Gong, M., Cohen-Or, D., Ascher, U., Zhang, H., 2013. Edge-aware point set resampling. ACM transactions on graphics (TOG) 32, 1–12.
Lin et al. [2021] Lin, T., Wang, Y., Liu, X., Qiu, X., 2021. A survey of transformers. arXiv preprint arXiv:2106.04554 .
Liu et al. [2020] Liu, L., Ouyang, W., Wang, X., Fieguth, P., Chen, J., Liu, X., Pietikäinen, M., 2020. Deep learning for generic object detection: A survey. International journal of computer vision 128, 261–318.
Liu et al. [2019] Liu, X., Han, Z., Liu, Y.S., Zwicker, M., 2019. Point2sequence: Learning the shape representation of 3d point clouds with an attention-based sequence to sequence network, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 8778–8785.
Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., 2021. Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022.
Luo and Hu [2020] Luo, S., Hu, W., 2020. Differentiable manifold reconstruction for point cloud denoising, in: Proceedings of the 28th ACM international conference on multimedia, pp. 1330–1338.
Luo and Hu [2021] Luo, S., Hu, W., 2021. Score-based point cloud denoising, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4583–4592.
Mattei and Castrodad [2017] Mattei, E., Castrodad, A., 2017. Point cloud denoising via moving rpca, in: Computer Graphics Forum, Wiley Online Library. pp. 123–137.
Paszke et al. [2019] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al., 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32.
Pistilli et al. [2020] Pistilli, F., Fracastoro, G., Valsesia, D., Magli, E., 2020. Learning robust graph-convolutional representations for point cloud denoising. IEEE Journal of Selected Topics in Signal Processing 15, 402–414.
Qi et al. [2017a] Qi, C.R., Su, H., Mo, K., Guibas, L.J., 2017a. Pointnet: Deep learning on point sets for 3d classification and segmentation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 652–660.
Qi et al. [2017b] Qi, C.R., Yi, L., Su, H., Guibas, L.J., 2017b. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems 30.
Radford et al. [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al., 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 9.
Rakotosaona et al. [2020] Rakotosaona, M.J., La Barbera, V., Guerrero, P., Mitra, N.J., Ovsjanikov, M., 2020. Pointcleannet: Learning to denoise and remove outliers from dense point clouds, in: Computer Graphics Forum, Wiley Online Library. pp. 185–203.
Ravi et al. [2020] Ravi, N., Reizenstein, J., Novotny, D., Gordon, T., Lo, W.Y., Johnson, J., Gkioxari, G., 2020. Accelerating 3d deep learning with pytorch3d. arXiv preprint arXiv:2007.08501 .
Schoenenberger et al. [2015] Schoenenberger, Y., Paratte, J., Vandergheynst, P., 2015. Graph-based denoising for time-varying point clouds, in: 2015 3DTV-Conference: The True Vision-Capture, Transmission and Display of 3D Video (3DTV-CON), IEEE. pp. 1–4.
Serna et al. [2014] Serna, A., Marcotegui, B., Goulette, F., Deschaud, J.E., 2014. Paris-rue-madame database: a 3d mobile laser scanner dataset for benchmarking urban detection, segmentation and classification methods, in: 4th International Conference on Pattern Recognition, Applications and Methods ICPRAM 2014.
Sun et al. [2015] Sun, Y., Schaefer, S., Wang, W., 2015. Denoising point sets via l0 minimization. Computer Aided Geometric Design 35, 2–15.
Sutskever et al. [2014] Sutskever, I., Vinyals, O., Le, Q.V., 2014. Sequence to sequence learning with neural networks. Advances in neural information processing systems 27.
Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I., 2017. Attention is all you need. Advances in neural information processing systems 30.
Wang et al. [2021] Wang, Z., Cun, X., Bao, J., Liu, J., 2021. Uformer: A general u-shaped transformer for image restoration. arXiv preprint arXiv:2106.03106 .
Wu et al. [2015] Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., Xiao, J., 2015. 3d shapenets: A deep representation for volumetric shapes, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1912–1920.
Xiong et al. [2020] Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., Liu, T., 2020. On layer normalization in the transformer architecture, in: International Conference on Machine Learning, PMLR. pp. 10524–10533.
Xu et al. [2015] Xu, L., Wang, R., Zhang, J., Yang, Z., Deng, J., Chen, F., Liu, L., 2015. Survey on sparsity in geometric modeling and processing. Graphical Models 82, 160–180.
Ying et al. [2021] Ying, C., Cai, T., Luo, S., Zheng, S., Ke, G., He, D., Shen, Y., Liu, T.Y., 2021. Do transformers really perform badly for graph representation? Advances in Neural Information Processing Systems 34.
Yu et al. [2018] Yu, L., Li, X., Fu, C.W., Cohen-Or, D., Heng, P.A., 2018. Pu-net: Point cloud upsampling network, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2790–2799.
Yu et al. [2021] Yu, X., Rao, Y., Wang, Z., Liu, Z., Lu, J., Zhou, J., 2021. Pointr: Diverse point cloud completion with geometry-aware transformers, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12498–12507.
Zeng et al. [2019] Zeng, J., Cheung, G., Ng, M., Pang, J., Yang, C., 2019. 3d point cloud denoising using graph laplacian regularization of a low dimensional manifold model. IEEE Transactions on Image Processing 29, 3474–3489.
Zhao et al. [2021] Zhao, H., Jiang, L., Jia, J., Torr, P.H., Koltun, V., 2021. Point transformer, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16259–16268.
Zheng et al. [2010] Zheng, Q., Sharf, A., Wan, G., Li, Y., Mitra, N.J., Cohen-Or, D., Chen, B., 2010. Non-local scan consolidation for 3d urban scenes. ACM Trans. Graph. 29, 94–1.
Zheng et al. [2021] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H., et al., 2021. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6881–6890.