This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Hausdorff Point Convolution with Geometric Priors

Pengdi Huang Shenzhen University, China Liqiang Lin Shenzhen University, China Fuyou Xue Shenzhen University, China Kai Xu National University of Defense Technology, China Danny Cohen-Or Tel Aviv University, Israel Hui Huang Shenzhen University, China
Abstract

Developing point convolution for irregular point clouds to extract deep features remains challenging. Current methods evaluate the response by computing point set distances which account only for the spatial alignment between two point sets, but not quite for their underlying shapes. Without a shape-aware response, it is hard to characterize the 3D geometry of a point cloud efficiently with a compact set of kernels. In this paper, we advocate the use of Hausdorff distance as a shape-aware distance measure for calculating point convolutional responses. The technique we present, coined Hausdorff Point Convolution (HPC), is shape-aware. We show that HPC constitutes a powerful point feature learning with a rather compact set of only four types of geometric priors as kernels. We further develop a HPC-based deep neural network (HPC-DNN). Task-specific learning can be achieved by tuning the network weights for combining the shortest distances between input and kernel point sets. We also realize hierarchical feature learning by designing a multi-kernel HPC for multi-scale feature encoding. Extensive experiments demonstrate that HPC-DNN outperforms strong point convolution baselines (e.g., KPConv), achieving 2.8%2.8\% mIoU performance boost on S3DIS [1] and 1.5%1.5\% on SemanticKITTI [3] for semantic segmentation task.

1 Introduction

Analysis of large-scale scanned scenes is drawing increasing attention driven by advances in deep neural networks. Point clouds acquired by common depth scanners, e.g., LiDAR, RGBD cameras, have challenging characteristics: noisy, incomplete, sparse, irregular, and unordered. These characteristics prevent applying conventional convolutions to successfully extract effective per-point features.

\begin{overpic}[width=216.81pt]{teaser1.pdf} \end{overpic}
Figure 1: In Hausdorff Point Convolution, the response of a local neighborhood (Q1Q_{1} or Q2Q_{2}) to a geometric kernel (GG) is computed based on Hausdorff distance, which is a shape-aware distance measure. Such shape-aware convolution facilitates a powerful point feature learning with a compact set of geometric priors as convolutional kernels.

Early attempts to applying convolutions on point cloud have bypassed the irregularity problem by mapping the data to predefined regular grids [38, 24]. Advanced techniques use point convolution methods based on implicit or geometric kernel functions [36, 18, 32]. For example, KPconv [33] adopts explicit kernel point set and achieves state-of-the-art results. A crucial design choice herein is how to compute the response between the input and a kernel point sets. Existing methods usually evaluate the response by computing point set distance. The distance measures used by existing techniques only account for the spatial alignment between two point sets but not quite for their underlying shapes. Without a shape-aware response, it is hard to characterize the 3D geometry of a point cloud efficiently with a compact set of kernels.

In this paper, we advocate the use of Hausdorff distance, as a shape-aware distance measure, and show that it is particularly suited for computing convolutional response between a point cloud and a kernel point set; see Figure 1. We introduce a new point convolution where the response of the local neighborhood against a geometric kernel is computed based on the Hausdorff distance measure, hence named Hausdorff Point Convolution (HPC). Since HPC is shape-aware, we are able to achieve a powerful point feature learning with a particularly compact set of only four types of geometric priors as kernels. The key characteristic of Hausdorff distance is its robustness to noise, outliers and irregular point density, making HPC highly preferable for feature learning of typical raw scan point data. Another property of HPC is that it is permutation-invariant and can handle data with arbitrary scales. Furthermore, rotation invariance can be achieved by adopting a rotationally-symmetric kernel shape.

We developed a HPC-based deep neural network, referred to as HPC-DNN, by adopting KPConv [33] as the basic architecture. To make HPC operations learnable within the network, we decompose HPC into a distributive function and a shortest distance matrix between the point sets of the local neighborhood and a geometric kernel. The distributive function is then used to aggregate the shortest distances weighted by the learnable parameters and the input features, which is essentially a weighted Hausdorff distance. Task-specific learning is achieved by tuning the network weights for shortest distance aggregation. To realize hierarchical feature learning, we design multi-kernel HPCs for multi-scale feature learning.

We have implemented HPC-DNN with PyTorch and evaluated it on several classic tasks of point cloud processing and understanding, i.e., semantic object detection and segmentation on point scanned scenes. Experiments show that HPC-DNN outperforms strong baseline point convolutions (e.g., KPConv) significantly. In particular it attains 2.8%2.8\% mIoU improvement on S3DIS [1] and 1.5%1.5\% on SemanticKITTI [3] for semantic segmentation task.

2 Related Work

The analysis of unstructured point cloud is widely regarded as a difficult and ill-posed problem. In particular, feature extraction is a fundamental and important one [29, 26, 27]. Recently, with the emergence of neural networks, new methods for point cloud analysis have been introduced  [22, 23, 19]. These methods build upon the ability of neural networks to learn from data. The key challenge in deploying neural networks is that point clouds are irregular and unordered, so conventional convolutions cannot be adopted. Previous works can be roughly categorized into three: i) Grid-based, ii) Implicit, and iii) Explicit point convolution. We briefly discuss them as follow.

Grid-based convolution on point clouds.

To bypass the data irregularity, these grid-based methods project or transform the data into regular grids, on which traditional convolution methods can be applied [31, 21]. These methods are mainly designed for individual objects. For example, in SFCNN [24], point clouds are projected onto a grid sphere. The local and global features are then learned by a multi-layer perception (MLP) architecture. FoldingNet [38] proposes a two-step-folding operation to construct a mapping between point set and 2D grid. To obtain higher-level features, some researches adopt the VoxelNet framework. In [39], point clouds are voxelized and a shape attention regional proposal network is trained to learn the spatial occupancy of objects in horizontal and vertical directions. Grid-based convolution is limited by resolution due to heavy computation cost, which might be relieved with efficient data structure [25, 15]. Nonetheless, this thread of methods are, in general, sensitive to noise and tend to overlook small-scale, yet visually meaningful shape details. For larger and more complex scenes, specific designs need to deal with incomplete data and outliers.

Implicit point convolution.

The pioneering work of PointNet [22], opened an avenue of works focusing on direct convolution on 3D point clouds, which either aim to improve the neighborhood structure [23, 7, 30], or to enhance the convolutional filters [28, 9, 37]. Atzmon et al. [2] propose a unique volume-based point convolution, which consists of two operators, i.e., extension and restriction, mapping point cloud functions to volumetric ones and vise-versa. Hu et al. [10] propose a location spatial encoding block to group relative coordinates and input features, and then extract the neighboring features. Li et al. [19] present PointCNN, which uses MLP to learn an XX matrix to canonize the point cloud features, thus permutation- invariant and offering hierarchical convolutions. PointConv [36] constructs a location related weight function for continuous convolution, and re-weight the weight function through a learned point density factor. Tatarchenko et al. [32] propose a TangentConv that uses Gaussian kernels as implicit kernel metrics functions. DGCNN [35] adopts an EdgeConv, which constructs a local graph on neighboring points. Graph information function is instantiated by a fully connected layer. All these methods locally organize the geometric features [20, 16], and then use MLP to obtain the final high-level features, referred to as implicit point convolution. These implicit convolutions learn permutation or rotation invariance by MLP layers, thus sensitive to training data quality and the network training convergence.

Explicit point convolution.

Deep feature extraction by analyzing local neighborhoods has received intensive attentions, but no much focus has been given to the development of explicit reference shapes to promote the deep feature expressiveness. Recently, KPConv [33] introduces a point kernel based method for point cloud convolution operations, which achieved state-of-the-art performances on classic point cloud datasets. However, this method simply applies the form of a two-dimensional grid operation and uses a prescribed kernel function. JSENet [33] adopts KPConv as its backbone network for point convolution, and proposes a deep network that fuses the region and edge information for joint learning of semantic segmentation and edge detection. Taking a perspective of blending both geometry and topology, we define a permutation-invariant and scale-invariant geometric convolution, offering the analysis of unordered point clouds. By combining feature mapping and geometric convolution, we construct a convolutional neural network that learns to weight the input shape feature. Moreover, based on the diversity of shapes, the multi-kernels are designed to extract hierarchical features jointly.

3 Method

\begin{overpic}[width=496.85625pt,tics=5]{overview1.pdf} \end{overpic}
Figure 2: An overview of the proposed HPC-DNN. The HPC-DNN adopts a multi-kernels Hausdorff based point convolution (HPC), and extracts hierarchical features from the large-scale input point cloud scene. Then, multi-scale convolution responses of different kernels are merged in each layer to implement scene semantic segmentation.

Overview

We propose Hausdorff point convolution (HPC), a new point convolution based Hausdorff distance. In HPC, there are four types of convolutional kernels whose parameters are learnable with downstream tasks such as point cloud segmentation; see Figure 2. The feature response between input points and kernel points is computed based on Hausdorff distance which essentially measures the similarity between the two point sets. The feature responses of multiple kernels at the same scale (layer) are combined to form a powerful representation. Deep networks are formed by stacking multi-scale HPCs which learn hierarchical feature representations and finally produce an output.

3.1 Definition of HPC

Given an input feature vector F=f(v)F=f(v) and a kernel vector G=g(v)G=g(v), the discrete convolution writes as:

FG=uUf(u)g(vu)=f(u),g(vu),F*G=\sum_{u\in{U}}f(u)g(v-u)=\langle f(u),g(v-u)\rangle, (1)

where * denotes convolution operation, and ,\langle\cdot,\cdot\rangle is vector inner product. Convolution is essentially a “sliding inner product” between the input feature f(u)f(u) and the flipped kernel g(vu)g(v-u). Inner product measures the similarity between two vectors. The convolution response thus reflects a “sliding similarity” between the “shape” of FF and GG.

In 2D convolutional layers of a CNN, the similarity is measured between a 2D feature map and a 2D filter for feature extraction. We hope to extend this concept of convolution to deep feature learning of 3D point clouds. We give a generalized definition of point convolution. Given a point poPp_{o}\in P and its neighboring point set QQ, and a given kernel point set GG, we use a function TT to calculate the similarity response \hbar between each other:

=T(Q,G),Q={pipopir},\hbar=T(Q,G),Q=\{p_{i}\mid\|p_{o}-p_{i}\|\leq r\}, (2)

where rr is the query radius. Hausdorff distance can be used for computing shape similarity. Given the neighboring point set QQ and a kernel point set GG, the Hausdorff distance H(Q,G)H(Q,G) is adopted as the convolution function TT:

H(Q,G)=max(h(Q,G),h(G,Q)),H(Q,G)=\max(h(Q,G),h(G,Q)), (3)

where h(G,Q)h(G,Q) and h(Q,G)h(Q,G) are called the narrow Hausdorff distance. Their formula is defined as follows:

h(G,Q)=maxgGminqQgq,h(Q,G)=maxqQmingGgq,\begin{split}h(G,Q)&=\max_{g\in{G}}\min_{q\in{Q}}\|g-q\|,\\ h(Q,G)&=\max_{q\in{Q}}\min_{g\in{G}}\|g-q\|,\end{split} (4)

where gq\|g-q\| is the Euclidean distance between point gg and qq. Obviously, H(G,Q)=H(Q,G)rH(G,Q)=H(Q,G)\leq r, but h(G,Q)h(Q,G)h(G,Q)\neq h(Q,G) in general.

If the kernel point set GG is defined in a spherical space, G={gigir}G=\{g_{i}\mid\|g_{i}\|\leq r\}, the neighborhood point set QQ and the kernel point set GG do not necessarily need a rigid registration, which means that Hausdorff distance measurement can be performed directly on the two point sets.

In fact, the Eq. (3) satisfies the important properties of point cloud convolutions described in [19]. For Q=[q1,,qn]Q=\left[q_{1},\ldots,q_{n}\right] and its permuted counterpart Q=[qπ1,,qπn]Q^{\prime}=\left[q^{\prime}_{\pi_{1}},\ldots,q^{\prime}_{\pi_{n}}\right], it has point permutation invariance: H(Q,G)=H(Q,G)H(Q,G)=H(Q^{\prime},G). Besides, it is scale invariance after normalization as H(Q,G)=H(rQ,rG)/rH(Q,G)=H(rQ,rG)/r with the query radius rr being a given constant. Since H(G,Q)rH(G,Q)\leq r, 1/r1/r is then the normalization factor. This means that a kernel shape can calculate neighboring shape responses at any scale or query radius. Figure 3 shows the result of calculation using the Eq. (3) on a 3D indoor scene. It can be seen that the vertical line kernel causes a notable response on the pole-like structures, while the plane kernel has a notable response on the ground. This result is very similar to the neuron activation of feature maps in 2D convolutional neural networks.

\begin{overpic}[width=216.81pt,tics=5]{haus_response1.pdf} \end{overpic}
Figure 3: Hausdorff distance metric on an indoor scene. The point color (dark to bright) indicates the Hausdorff distance value (low to high) between neighborhood points and the given kernel.
\begin{overpic}[width=496.85625pt]{convlayer1.pdf} \end{overpic}
Figure 4: A point convolutional layer of the proposed HPC-DNN. The inputs are a query point and its neighboring points with corresponding features and a given kernel points, the output is a feature vector of query point.

3.2 HPC Layer

As a network layer, Hausdorff convolution layer should contain learnable weights which can be optimized via back propagation. Instead of computing the Hausdorff response directly, we opt to assign a weight corresponding to each shortest distance, and automatically adjust the length according to the attitude and distribution of neighboring points with input features. We split the Hausdorff distance operator into two parts: the shortest distance set and the distributive function (see definition below). The distance between a neighboring point qiq_{i} and a kernel point set GG is defined as d(qi,G)=mingGgqid(q_{i},G)=\min_{g\in{G}}\|g-q_{i}\|, and similarly the distance between a kernel point gig_{i} and a neighboring point set QQ is d(gi,Q)=minqQgiqd(g_{i},Q)=\min_{q\in{Q}}\|g_{i}-q\|. Let set DgD_{g} represent the set of distances from each point in GG to the set QQ as Dg=[d(g1,Q),,d(gn,Q)]D_{g}=\left[d(g_{1},Q),\ldots,d(g_{n},Q)\right], and DqD_{q} be Dq=[d(q1,G),,d(qm,G)]D_{q}=\left[d(q_{1},G),\ldots,d(q_{m},G)\right], where n=|G|n=|G| and m=|Q|m=|Q|. Therefore, the Hausdorff convolution between QQ and GG can be written as:

H(Q,G)=max(max(Dg),max(Dq))=max([Dg,Dq]).\begin{split}H(Q,G)&=\max(\max(D_{g}),\max(D_{q}))\\ &=\max([D_{g},D_{q}]).\end{split} (5)

Let us name a function ff satisfying f({f(Dg)}{f(Dq)})=f(DgDq)f(\{f(D_{g})\}\cup\{f(D_{q})\})=f(D_{g}\cup D_{q}) a distributive function. [Dg,Dq]=DgDq\left[D_{g},D_{q}\right]=D_{g}\cup D_{q} is the shortest distance set. It can be seen that the max\max function is a distributive function. However, the max\max operator is sensitive to outliers. To overcome this issue, many improved variants of Hausdorff distance has been proposed [5]. Among the variants, we found the Hausdorff distance in the cumulative form [13] is more appropriate for pattern matching. This modified Hausdorff distance is:

h(Q,G)=qQmingGd(g,q).h(Q,G)=\sum_{q\in{Q}}\min_{g\in{G}}d(g,q).\\ (6)

In this case, Eq. (3) and Eq. (4) would also adopt accumulation operation instead of max operation. Let sum\mathrm{sum} indicate the accumulation of elements in a set, it can be:

H(Q,G)=sum(sum(Dg),sum(Dq))=sum([Dg,Dq]).\begin{split}H(Q,G)&=\mathrm{sum}(\mathrm{sum}(D_{g}),\mathrm{sum}(D_{q}))\\ &=\mathrm{sum}([D_{g},D_{q}]).\end{split} (7)

Obviously, the accumulation function is also a distributive function. We construct a shortest distance matrix DminD_{\text{min}} to store the shortest distance set DgDqD_{g}\cup D_{q}:

Dmin(i,j)={qigj,qigjDgDq0,otherwiseD_{\text{min}}(i,j)=\left\{\begin{aligned} \|q_{i}-g_{j}\|,&&\|q_{i}-g_{j}\|\in D_{g}\cup D_{q}\\ 0,&&\text{otherwise}\end{aligned}\right. (8)

where 1im1\leq i\leq m and 1jn1\leq j\leq n. The size of DminD_{\text{min}} is n×mn\times m.

The point convolutional network layer is to weight the elements in DminD_{\text{min}}, and the input feature of each point can also be considered as a distance weighting constant. Inspired by KPConv [33], we formulate the computation of Hausdorff convolution based on the shortest distance matrix DminD_{\text{min}}. For a input features Finm×cinF_{\text{in}}\in{\mathbb{R}^{m\times c_{\text{in}}}}, and an output features FoutcoutF_{\text{out}}\in{\mathbb{R}^{c_{\text{out}}}}, the convolution formula of the point convolutional layer is:

Fout=f(DminFinW),F_{\text{out}}=f(D_{\text{min}}F_{\text{in}}W),\\ (9)

where WW is the weight matrix with a size of n×cin×coutn\times c_{\text{in}}\times c_{\text{out}}. It maps the features from input channel number cinc_{in} to the output channel number coutc_{out}. The shortest distance matrix is a sparse matrix: Dmin0n+m\|D_{min}\|_{0}\leq n+m. Therefore, multiplying DminD_{\text{min}} by the weight matrix WW approximates weighting each shortest distance as wjiDmin(i,j)w_{ji}D_{\text{min}}(i,j). Since distributive functions including max\max, sum\mathrm{sum}, min\min are all differentiable, the Hausdorff convolutional layer is also differentiable. Figure 4 shows the architecture of a HPC layer. As can be seen, the shortest distance matrix is normalized before weighting. Normalization is to remove the scale factor: D¯min(i,j)=Dmin(i,j)/r\bar{D}_{\text{min}}(i,j)=D_{\text{min}}(i,j)/r. In addition, since the Hausdorff distance measures the maximum dissimilarity between two shapes, for similarity response, the nonzero value should take 1D¯min(i,j)1-\bar{D}_{\text{min}}(i,j).

Hausdorff convolution is a general form of point cloud convolution operation. From the view of Hausdorff convolution, KPConv [33] is equivalent to computing a one-way similarity by using a self-defined measurement function (kernel function). PointConv [36] and RSCNN [20] are equivalent to matching with a kernel containing only one point at the center. PointConv uses density information to weight distance, while RSCNN adopts maximum distance.

3.3 HPC with Multiple Kernels

Similar to common CNNs, we require multiple kernels (the point cloud GG) to facilitate a powerful feature learning. This involves the design of kernel point generation and multi-kernel network structure.

Kernel point generation

The choice of the shape of kernel point clouds is important. They should be geometric primitives (e.g., planar patches, cylindrical patches, quadric patches, etc.), and meanwhile they should be ideally omnipresent on the 3D surface of common objects and scenes. In this work, we find that the following four types of kernel shapes suffice: points, lines, planes, and spherical patches, encompassing primitives from 0D to 3D. Further, we find that Hausdorff response is more robust if GG has a rotationally symmetric structure:

H(Q,G)=H(RQ,G),H(Q,G)=H(RQ,G), (10)

where RSO(3)R\in SO(3) is a rotation transformation being applied to the neighborhood point set QQ. Apparently, point and spherical kernels bring rotation invariance to the calculation, which qualifies Hausdorff distance for retrieving the same kernel in varying poses. To form the kernel point sets, we sample the primitive shapes using the farthest point sampling (FPS) algorithm. Figure 3 shows the four kernel shapes adopted in our method.

\begin{overpic}[width=216.81pt,tics=5]{multiHPC1.pdf} \end{overpic}
Figure 5: The architecture of multi-kernel Hausdorff point convolution layer. For a query point and its input feature FinF_{\text{in}}, output feature FoutF_{\text{out}}, the convolution results coutc_{\text{out}} of KK kernels are linearly accumulated.

Multi-kernel network structure

With multiple kernel point sets, we hope that each convolution feature component contributes to the final encoding feature. Therefore, a multi-kernel feature encoding F~outcout\tilde{F}_{out}\in{\mathbb{R}^{c_{out}}} groups KK Hausdorff point convolution result FoutcoutF_{out}\in{\mathbb{R}^{c_{out}}}:

F~out=ReLU(k=1KReLU(Fout(Gk))),\tilde{F}_{out}=ReLU(\sum_{k=1}^{K}ReLU(F_{out}(G_{k}))), (11)

where KK is the number of kernels and ReLUReLU is the activation function of rectified linear units. Figure 5 shows the architecture of multi-kernel Hausdorff point convolution. Different kernel features are grouped by summing operation. Compared with concatenation operation, the parameter amount of summing is smaller and the common part between features can be mutually enhanced.

4 Results and Evaluation

Our network, HPC-DNN, is implemented based on KP-FCNN [33]. The network architecture consists of two parts: an encoder and a decoder. The encoder contains five convolutional layers. Each convolutional layer contains two HPC layers or two multi-kernel HPC layers. For a feature of dimension cinc_{in}, the input and output dimension of the first convolution operation is cinc_{in} and 2cin2*c_{in}, respectively. The input and output dimensions of the second convolution layer are both 2cin2*c_{in}. The query radius rr of point neighborhood doubles for every layer. The decoder, following the encoder, acts as deconvolution. Deconvolution implements point feature propagation based on nearest neighbor upsampling. We retain the shortcut structure of KP-FCNN in convolutional layers. For multi-kernel HPCs, the shortcut structure is removed for building a concise convolutional layer.

We implement our network with PyTorch and perform training/testing on a server with 22 Intel 2.20GHz Intel(R) Xeon(R) E5-2699 CPU and 9 Quadro GV100. We conduct scene semantic segmentation tasks on large-scale point clouds of indoor and outdoor scenes for evaluation.

4.1 Segmentation on S3DIS

For indoor scene segmentation, we use the public indoor point cloud dataset S3DIS [1] to compare our proposed HPC-DNN with KPConv [33], JSENet [11] and other state-of-the-art point convolution methods. S3DIS contains 271271 indoor rooms encompassing offices, corridors, etc. The 3D data were acquired with RGBD scanner, and the dense point cloud data is associated with RGB color information. Following the convention, we use the set of Area-1 to Area-4 and Area-6 for training, and the set of Area-5 for testing. Segmentation accuracy is measured by mean of intersection over union (mIoU).

Table 1 shows that our HPC-DNN exhibits a significant performance improvement for scene segmentation compared to other baselines. We achieve a 66.7%66.7\% for single kernel HPC and 68.2%68.2\% for multi-kernel HPC, which are both higher than KPConv [33] under the same configuration. In particular, the results of multi-kernel HPC-DNN achieves the state-of-the-art performance among all methods. Due to the high scanning quality of S3DIS scenes, the geometry of the objects is relatively complete. The multi-kernel HPC-DNN can effectively capture and enhance the geometric features of semantic targets.

Table 1: Semantic scene segmentation results on S3DIS.
      Method       mIoU
      TangentConv [32]       52.6
      PointNet++ [23]       53.4
      DGCNN [35]       56.1
      PointCNN [19]       57.3
      PointNet [22]       57.8
      ParamConv [34]       58.3
      PointWeb [41]       60.3
      HPEIN [14]       61.9
      SPGraph [17]       62.1
      MVPNet [12]       62.4
      Point2Node [8]       63.0
      MinkowskiNet [4]       65.4
      KPConv [33]       65.4
      JSENet [11]       67.7
      HPC-DNN       66.7
      Multi-kernel HPC-DNN       68.2
\begin{overpic}[width=496.85625pt]{FeatureMaps1.pdf} \end{overpic}
Figure 6: Visualization of feature maps on point clouds after point convolution with different kernels.

4.2 Segmentation on SemanticKITTI

SemanticKITTI [3] is a dataset of large-scale outdoor point clouds built based on the KITTI vision benchmark [6]. The data sequences in SemanticKITTI are composed of continuous frames of point cloud captured by LiDAR scanners without color information. Although a single frame of point cloud contains about 100100K points, they are usually scattered in a space of 160×160160\times 160 meters. Therefore, the point clouds are very sparse. Besides, due to the round scanning nature of LiDAR, the density of the point clouds is heavily anisotropic, which further increase the difficulty of processing and understanding. We use sequences 0–7 and 9–10 for training, and sequence 8 for testing. We again use mIoU to evaluate the prediction accuracy over the 19 categories in the dataset. Our method is compared to KPConv [33], PolarNet [40] and other baselines. The hyper-parameters, e.g., the query radius of each layer and the number of sampled points, are kept consistent across different deep networks.

Table 2 shows the scene segmentation result of our networks and other baselines. HPC-DNN and Multi-kernel HPC-DNN get an mIoU of 59.6%59.6\% and 60.3%60.3\%, respectively, outperforming the state of the art KPConv. Although the point clouds of this dataset contain much missing data, HPC is still able to capture as much geometric information as possible, leading to improved understanding results.

Table 2: Semantic scene segmentation results on SemanticKITTI.
      Method       mIoU
      PointNet++ [23]       20.3
      RSCNN [17]       47.2
      PolarNet [40]       58.2
      KPConv [33]       58.8
      HPC-DNN       59.6
      Multi-kernel HPC-DNN       60.3

4.3 Ablation Study

Distributive functions.

We analyze the performance of our HPC-DNN with different kinds of distributive functions (see definition in Section 3.2). The experiment was carried out on the S3DIS dataset using a spherical kernel, with max\max, min\min, and sum\mathrm{sum} functions used, respectively. As shown in Table 3, the accuracy of the sum\mathrm{sum} function is 65.41%65.41\% which is better than 65.04%65.04\% of min\min and 65.01%65.01\% of max\max. Although these three functions are all metric functions recommended by Dubuisson and Jain [5], the cumulative form of sum\mathrm{sum} is more preferable in overcoming the outlier issue.

Table 3: The effect of different distributive function ff.
      ff function       mIoU
      max\max function       65.41
      min\min function       65.05
      sum\mathrm{sum} function       66.65

Shortest distance matrix.

In Section 3.2, we use shortest distance matrix to capture the correlation between the kernel and target shapes. Shortest distance matrix is a sparsification of neighborhood distance matrix with only the shortest distances kept and the rest set to 0. Here, we make a comparison between the shortest distance matrix and neighborhood distance matrix using S3DIS. Table 4 shows that the performance of shortest distance matrix is much better than that of neighborhood distance matrix. Although the shortest distance matrix contains less information than the neighborhood distance matrix, the results demonstrate that the former effectively retains the key information of point neighborhood and avoids the interference of redundant data.

Table 4: Analysis of the shortest distance factor: with and without constructing the shortest distance matrix.
      Distance type       mIoU
      All distances between GG and QQ       61.84
      The shortest distance       66.65

4.4 Impact of Kernels

Kernel shape.

We tested the effect of different kernel shapes over the S3DIS dataset. The kernel shapes being tested include 3D sphere, 2D plane, 1D straight line, and a center point. The number of points in each kernel is 1515, and all the points are generated by sampling the parametric shape with farthest point sampling. The results are shown in the Table 5. It can be seen that spherical kernel achieves the best performance of 66.65%66.65\%. Planar kernel, straight line kernel, and center point kernel obtain 64.02%64.02\%, 62.16%62.16\% and 44.70%44.70\%, respectively. The good performance of spherical kernel is mainly attributed to invariant to rotation in calculating geometric distances. Since the point kernel concentrates at the center, it is less informative.

Table 5: The impact of kernels in different shapes.
      Kernel prior shape       mIoU
      Sphere       66.65
      Plane       64.02
      Straight Line       62.16
      Center point       44.70

Number of kernels.

We also analyze the impact of the number of kernels using, again, S3DIS. The numbers of kernels tested are K=1K=1, K=2K=2, and K=4K=4. The results are shown in the Table 6. As can be seen, the performance of network with multi-kernels significantly improves over the single kernel version. Among multi-kernel versions, two-kernel leads to a performance improve of 1.2%1.2\%, and four-kernel 1.6%1.6\%. This shows that the features extracted by the multi-kernel HPC can effectively complement to each other, and provide more powerful feature description for the point clouds. Of course, multi-kernel features contain redundancy, which explains the slower performance improvement when four kernels are used.

Table 6: The results of different numbers of kernels.
      The number of Kernel       mIoU
      Kernel (Sphere)       66.65
      Kernels (Sphere + Plane)       67.85
      Kernels (All Four)       68.23
\begin{overpic}[width=481.95116pt]{s3disgallery1.pdf} \end{overpic}
Figure 7: A gallery of HPC-DNN segmentation results on the S3DIS dataset.
\begin{overpic}[width=481.95116pt]{kittigallery1.pdf} \end{overpic}
Figure 8: A gallery of HPC-DNN segmentation results on the challenging SemanticKITTI dataset.

4.5 Visualization of Results

We visualize the feature map of our HPC-DNN network after convolution, particularly the extracted features from different kernel shapes. Besides, we visualize the segmentation results of HPC-DNN on S3DIS and SemanticKITTI.

Figure 6 shows the feature maps of the same layer with different kernels on a S3DIS point cloud scene. In some feature channels, the features extracted from each kernel has obvious characteristics. In two different feature channels, the spherical kernel produces similar convolution response on the wall in the same direction(left sub-figure) and in arbitrary directions(right sub-figure), respectively. It shows that the network has learned directional information by spherical kernel. Meanwhile, the planar kernel strengthens edge structures. The vertical linear kernel is robust to the vertical walls with the same feature distribution. The center point kernel has a local strengthening effect on the point cloud associated with input features.

Figure 7 shows some results of scene segmentation on S3DIS. It can be seen that our method attains accurate segmentation results; it is especially good at distinguishing objects with geometry discrepancy. The segmentation error mainly comes from the semantic objects on the facade with similar planar shape. Figure 8 demonstrates the predicted labels by our proposed network on SemanticKITTI. As can be seen, small objects, e.g., vehicle, on road scene are well distinguished by our method.

5 Discussion and Future Work

We have presented a method for 3D point cloud convolution using Hausdorff distance between neighborhoods of the query points and a set of kernel points. The convolution operation is share-aware and allows a powerful feature learning with a small set of geometric priors/kernels. Based on this geometric convolution operator, we extend conventional regular CNNs to HPC-DNN, using multi-kernels representing different geometric priors. We further extend our approach and define a weighted Hausdorff distance, which fuses features from adjacent levels and local geometric information in the network. Evaluations on point cloud scene dataset S3DIS and SemanticKITTI demonstrates that our method is effective, about 2%2\% higher than strong baselines.

Our HPC-DNN has similar limitations as conventional CNNs: the number of kernels, i.e., convolution filters, is limited by the capacity of the deep network. Furthermore, the kernel points represent prescribed geometric prior. Their shape is not learnable; only their shortest distance weighting is learned. For extreme sparse point cloud data, the geometry is hard to be extracted, hindering the extraction of effective features by the HPC-DNN. In the future, we would like to explore different architecture based on HPC, where, in a longer term, explore the possibility to learn the kernel geometries. The challenge is to keep the valuable properties of permutation and scale invariance.

References

  • [1] Iro Armeni, Sasha Sax, Amir R. Zamir, and Silvio Savarese. Joint 2D-3D-semantic data for indoor scene understanding. arXiv preprint arXiv:1702.01105, 2017.
  • [2] Matan Atzmon, Haggai Maron, and Yaron Lipman. Point convolutional neural networks by extension operators. ACM Trans. on Graphics (Proc. SIGGRAPH), 37(4):1–12, 2018.
  • [3] Jens Behley, Martin Garbade, Andres Milioto, Jan Quenzel, Sven Behnke, Cyrill Stachniss, and Jurgen Gall. Semantickitti: A dataset for semantic scene understanding of lidar sequences. In Proc. Int. Conf. on Computer Vision, pages 9297–9307, 2019.
  • [4] Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4d spatio-temporal convnets: Minkowski convolutional neural networks. arXiv preprint arXiv:1904.08755, 2019.
  • [5] M-P Dubuisson and Anil K Jain. A modified hausdorff distance for object matching. In Proceedings of 12th international conference on pattern recognition, volume 1, pages 566–568. IEEE, 1994.
  • [6] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Proc. IEEE Conf. on Computer Vision & Pattern Recognition, pages 3354–3361, 2012.
  • [7] Benjamin Graham, Martin Engelcke, and Laurens van der Maaten. 3D semantic segmentation with submanifold sparse convolutional networks. In Proc. IEEE Conf. on Computer Vision & Pattern Recognition, pages 9224–9232, 2018.
  • [8] Wenkai Han, Chenglu Wen, Cheng Wang, Xin Li, and Qing Li. Point2node: Correlation learning of dynamic-node for point cloud feature modeling. In Proc. AAAI Conf. on Artificial Intelligence, pages 10925–10932, 2020.
  • [9] Pedro Hermosilla, Tobias Ritschel, Pere-Pau Vázquez, Àlvar Vinacua, and Timo Ropinski. Monte carlo convolution for learning on non-uniformly sampled point clouds. ACM Trans. on Graphics (Proc. SIGGRAPH Asia), page 235, 2018.
  • [10] Qingyong Hu, Bo Yang, Linhai Xie, Stefano Rosa, Yulan Guo, Zhihua Wang, Niki Trigoni, and Andrew Markham. Randla-net: Efficient semantic segmentation of large-scale point clouds. In Proc. IEEE Conf. on Computer Vision & Pattern Recognition, pages 11108–11117, 2020.
  • [11] Zeyu Hu, Mingmin Zhen, Xuyang Bai, Hongbo Fu, and Chiew-lan Tai. Jsenet: Joint semantic segmentation and edge detection network for 3d point clouds. arXiv preprint arXiv:2007.06888, 2020.
  • [12] Maximilian Jaritz, Jiayuan Gu, and Hao Su. Multi-view pointnet for 3d scene understanding. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 3995–4003. IEEE, 2019.
  • [13] Oliver Jesorsky, Klaus J Kirchberg, and Robert W Frischholz. Robust face detection using the hausdorff distance. In International conference on audio-and video-based biometric person authentication, pages 90–95. Springer, 2001.
  • [14] Li Jiang, Hengshuang Zhao, Shu Liu, Xiaoyong Shen, Chi-Wing Fu, and Jiaya Jia. Hierarchical point-edge interaction network for point cloud semantic segmentation. In Proc. Int. Conf. on Computer Vision, pages 10433–10441, 2019.
  • [15] Roman Klokov and Victor Lempitsky. Escape from cells: Deep kd-networks for the recognition of 3D point cloud models. In Proc. Int. Conf. on Computer Vision, pages 863–872, 2017.
  • [16] Shiyi Lan, Ruichi Yu, Gang Yu, and Larry S Davis. Modeling local geometric structure of 3d point clouds using geo-cnn. In Proc. IEEE Conf. on Computer Vision & Pattern Recognition, pages 998–1008, 2019.
  • [17] Loic Landrieu and Martin Simonovsky. Large-scale point cloud semantic segmentation with superpoint graphs. In Proc. IEEE Conf. on Computer Vision & Pattern Recognition, pages 4558–4567, 2018.
  • [18] Huan Lei, Naveed Akhtar, and Ajmal Mian. Octree guided cnn with spherical kernels for 3d point clouds. In Proc. IEEE Conf. on Computer Vision & Pattern Recognition, pages 9631–9640, 2019.
  • [19] Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. Pointcnn: Convolution on x-transformed points. In Advances in Neural Information Processing Systems, pages 820–830, 2018.
  • [20] Yongcheng Liu, Bin Fan, Shiming Xiang, and Chunhong Pan. Relation-shape convolutional neural network for point cloud analysis. In Proc. IEEE Conf. on Computer Vision & Pattern Recognition, pages 8895–8904, 2019.
  • [21] Daniel Maturana and Sebastian Scherer. Voxnet: A 3D convolutional neural network for real-time object recognition. In IROS, pages 922–928. IEEE, 2015.
  • [22] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3D classification and segmentation. In Proc. IEEE Conf. on Computer Vision & Pattern Recognition, pages 652–660, 2017.
  • [23] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pages 5099–5108, 2017.
  • [24] Yongming Rao, Jiwen Lu, and Jie Zhou. Spherical fractal convolutional neural networks for point cloud recognition. In Proc. IEEE Conf. on Computer Vision & Pattern Recognition, pages 452–460, 2019.
  • [25] Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger. Octnet: Learning deep 3D representations at high resolutions. In Proc. IEEE Conf. on Computer Vision & Pattern Recognition, pages 3577–3586, 2017.
  • [26] Radu Bogdan Rusu, Gary Bradski, Romain Thibaux, and John Hsu. Fast 3D recognition and pose using the viewpoint feature histogram. In 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 2155–2162. IEEE, 2010.
  • [27] Samuele Salti, Federico Tombari, and Luigi Di Stefano. SHOT: Unique signatures of histograms for surface and texture description. Computer Vision and Image Understanding, 125:251–264, 2014.
  • [28] Martin Simonovsky and Nikos Komodakis. Dynamic edge-conditioned filters in convolutional neural networks on graphs. In Proc. IEEE Conf. on Computer Vision & Pattern Recognition, pages 3693–3702, 2017.
  • [29] Bastian Steder, Radu Bogdan Rusu, Kurt Konolige, and Wolfram Burgard. Narf: 3D range image features for object recognition. In IROS, volume 44, 2010.
  • [30] Hang Su, Varun Jampani, Deqing Sun, Subhransu Maji, Evangelos Kalogerakis, Ming-Hsuan Yang, and Jan Kautz. Splatnet: Sparse lattice networks for point cloud processing. In Proc. IEEE Conf. on Computer Vision & Pattern Recognition, pages 2530–2539, 2018.
  • [31] Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view convolutional neural networks for 3D shape recognition. In Proc. Int. Conf. on Computer Vision, pages 945–953, 2015.
  • [32] Maxim Tatarchenko, Jaesik Park, Vladlen Koltun, and Qian-Yi Zhou. Tangent convolutions for dense prediction in 3D. In Proc. IEEE Conf. on Computer Vision & Pattern Recognition, pages 3887–3896, 2018.
  • [33] Hugues Thomas, Charles R. Qi, Jean-Emmanuel Deschaud, Beatriz Marcotegui, François Goulette, and Leonidas J. Guibas. Kpconv: Flexible and deformable convolution for point clouds. In Proc. Int. Conf. on Computer Vision, pages 6411–6420, 2019.
  • [34] Shenlong Wang, Simon Suo, Wei-Chiu Ma, Andrei Pokrovsky, and Raquel Urtasun. Deep parametric continuous convolutional neural networks. In Proc. IEEE Conf. on Computer Vision & Pattern Recognition, pages 2589–2597, 2018.
  • [35] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds. ACM Trans. on Graphics, 38(5):1–12, 2019.
  • [36] Wenxuan Wu, Zhongang Qi, and Li Fuxin. Pointconv: Deep convolutional networks on 3D point clouds. In Proc. IEEE Conf. on Computer Vision & Pattern Recognition, pages 9621–9630, 2019.
  • [37] Yifan Xu, Tianqi Fan, Mingye Xu, Long Zeng, and Yu Qiao. Spidercnn: Deep learning on point sets with parameterized convolutional filters. In Proc. Euro. Conf. on Computer Vision, pages 87–102, 2018.
  • [38] Yaoqing Yang, Chen Feng, Yiru Shen, and Dong Tian. Foldingnet: Point cloud auto-encoder via deep grid deformation. In Proc. IEEE Conf. on Computer Vision & Pattern Recognition, pages 206–215, 2018.
  • [39] Yangyang Ye, Houjin Chen, Chi Zhang, Xiaoli Hao, and Zhaoxiang Zhang. Sarpnet: Shape attention regional proposal network for lidar-based 3D object detection. Neurocomputing, 2019.
  • [40] Yang Zhang, Zixiang Zhou, Philip David, Xiangyu Yue, Zerong Xi, Boqing Gong, and Hassan Foroosh. Polarnet: An improved grid representation for online lidar point clouds semantic segmentation. In Proc. IEEE Conf. on Computer Vision & Pattern Recognition, pages 9601–9610, 2020.
  • [41] Hengshuang Zhao, Li Jiang, Chi-Wing Fu, and Jiaya Jia. Pointweb: Enhancing local neighborhood features for point cloud processing. In Proc. IEEE Conf. on Computer Vision & Pattern Recognition, pages 5565–5573, 2019.