MaskRange: A Mask-classification Model for Range-view based LiDAR Segmentation

Yi Gu, Yuming Huang, Chengzhong Xu, Hui Kong
University of Macau
{yc17463, yc17907, huikong, czxu}@um.edu.mo

Abstract

Range-view based LiDAR segmentation methods are attractive for practical applications due to their direct inheritance from efficient 2D CNN architectures. In literature, most range-view based methods follow the per-pixel classification paradigm. Recently, in the image segmentation domain, another paradigm formulates segmentation as a mask-classification problem and has achieved remarkable performance. This raises an interesting question: can the mask-classification paradigm benefit the range-view based LiDAR segmentation and achieve better performance than the counterpart per-pixel paradigm? To answer this question, we propose a unified mask-classification model, MaskRange, for the range-view based LiDAR semantic and panoptic segmentation. Along with the new paradigm, we also propose a novel data augmentation method to deal with overfitting, context-reliance, and class-imbalance problems. Extensive experiments are conducted on the SemanticKITTI benchmark. Among all published range-view based methods, our MaskRange achieves state-of-the-art performance with $66.10$ mIoU on semantic segmentation and promising results with $53.10$ PQ on panoptic segmentation with high efficiency. Our code will be released.

Keywords: LiDAR segmentation, Mask-classification, Data augmentation, Autonomous vehicle and robot

1 Introduction

Refer to caption — Figure 1: PQ and mIoU vs. single-frame inference latency on SemanticKITTI [1]. Our MaskRange can achieve remarkable performance on both semantic and panoptic segmentation tasks with 25 FPS.

Light Detection and Ranging (LiDAR) segmentation plays a significant role in autonomous driving and robot navigation by providing agents a full understanding of their surroundings. Following the development of image segmentation [2], most of the existing LiDAR-based methods [3, 4, 5, 6] formulate LiDAR segmentation as a per-point classification problem.

Very recently, in the image domain, some methods [7, 8, 9] revive another paradigm, where image segmentation is formulated as a mask-classification problem by disentangling this task into two parallel branches: partitioning the image and classifying each partition. In a sense, the mask-classification paradigm is more anthropomorphic. Moreover, these methods have shown remarkable performance on both semantic- and instance-level segmentation tasks in a unified manner.

This naturally raises an interesting question: can the mask-classification paradigm benefit LiDAR segmentation? To our best knowledge, no work has been investigated so far to answer this question. To fill this gap, we propose a novel mask-classification model for LiDAR-based segmentation, dubbed as MaskRange. We choose the range-view-based representation due to its efficiency and similarity to the image format. We build up our model based on the meta mask-classification architecture [7, 8], with the consideration of run-time and memory consumption.

Intuitively, this adaptation to LiDAR’s range-view representation should be readily effective for segmentation. However, due to the issues caused by overfitting, context reliance, and class imbalance, the performance is not as good as expected. The empirical results even cannot be on par with our per-pixel baseline. To tackle these problems, we propose a novel data augmentation scheme, named Weighted Paste Drop (WPD). The WPD takes negligible cost and can dramatically improve the performance of our model. In SemanticKITTI [1] benchmark (accessed June 2022), our MaskRange achieves $66.10$ mIoU (rank 1 in range-view based methods) in semantic segmentation track and $53.10$ PQ in panoptic segmentation track. Notably, our model only takes $40$ $ms$ ( $25$ FPS) for both tasks on a single NVIDIA RTX 2080Ti GPU. See Figure 1 for better comparison.

It is worthwhile to highlight that, other than focusing on designing one particular network structure, our work aims at introducing a more general paradigm that has the potential of being applied to different LiDAR data representations. As a typical example, the range-view representation is adopted to demonstrate the superiority of the mask-classification paradigm. We hope this work can provide some insights into the mask-classification based LiDAR segmentation paradigm. In summary, our main contributions are stated as follows:

(i) To our best knowledge, MaskRange is the first LiDAR segmentation method based on the mask-classification paradigm. We will release our code and pre-trained models.

(ii) We propose a novel data augmentation method to reduce the adverse effects caused by the overfitting, context-reliance, and class-imbalance problems.

(iii) Our method is a unified framework that can be applied to both semantic and panoptic segmentation tasks. Correspondingly, we propose a unified weighted focal loss to re-balance the training data. We achieve the best performance in range-view-based LiDAR semantic segmentation and promising results in panoptic segmentation in SemanticKITTI benchmark.

2 Related Work

2.1 LiDAR Segmentation

According to different LiDAR data representations, LiDAR-based segmentation algorithms can be categorized into three types [10], i.e., the point-, voxel-, and projection-based methods. The point-based methods [5, 11, 12] can extract features directly from the raw point cloud, but have the inefficient neighbor-searching problem as well as computational and memory limitations. The voxel-based methods [13, 14, 15] partition the space into a finite number of regions to accelerate the feature extraction process, but suffer serious information loss when resolution is reduced. The projection-based methods [6, 16, 17, 18, 19] also have information loss, but can benefit from some efficient 2D CNN architectures. Due to their efficiency, projection-based methods are more attractive for practical applications. Our work belongs to this category, specifically to the range-view-based methods.

2.2 Mask-classification Image Segmentation

Different from the conventional per-pixel classification paradigm, the mask-classification methods formulate the segmentation problem from another perspective by predicting a set of binary masks and assigning a single class to each mask in parallel. Following DETR [20] framework, Maskformer [7] uses a set of learnable tokens to query the dense embeddings for mask prediction and shows promising results on both semantic and panoptic segmentation tasks. Similarly, K-Net [9] also unifies segmentation tasks in the view of dot product between kernel and feature maps. Mask2Former [8] further improves the components of MaskFormer and outperforms most specialized architectures on all considered tasks and datasets. Our work utilizes the MaskFormer [7] meta architecture and aims to achieve LiDAR segmentation in a unified manner.

2.3 3D Data Augmentation

Data augmentation is an important way to enhance model performance by increasing the diversity of training sets. In the 3D area, simple data augmentation schemes such as random rotation, translation, point dropping, and flipping are widely used in common practice [5, 11, 21]. PointMixup [22] and PointCutMix [23] extend the 2D augmentation methods Mixup [24] and CutMix [25] into 3D point cloud, respectively. These methods have shown better performance compared with the original simple augmentation methods. However, they cannot generally be applied to large-scale point cloud understanding. Mix3D [26] proposes an effective method by simply mixing two data items and can significantly improve the out-of-context generalization ability. To alleviate the class-imbalance problem, RPVNet [10] and Second [27] augment data by pasting the rare-class instances to the scene with the consideration of plausibility and reality. Our Weighted Paste Drop data-augmentation scheme is similar to these works but more efficient and unified. We will compare our method with the above-mentioned methods in subsection 3.2.

3 MaskRange

The goal of LiDAR semantic segmentation is to predict the semantic label for each point. As for panoptic segmentation, points are classified into two types: things and stuff. A unique id is required for each point belonging to things [28]. Unlike most methods proposed for a single task specially, in this study, we aim to achieve both tasks in a unified manner.

We choose the range view as a proxy representation for LiDAR scans due to its efficiency and similarity to the image format. Generally, there are two ways to get the range-view representation: spherical projection [6] and scan unfolding [29]. Scan unfolding can recover a more dense representation but is not convenient to make data augmentation. Since a proper data augmentation is necessary to increase the diversity of training data, we choose the spherical projection to get the range image

\left(\begin{array}[]{ccc}u\\ v\\ \end{array}\right)=\left(\begin{array}[]{ccc}\frac{1}{2}\left[1-arctan(y,x)\pi^{-1}\right]W\\ \left[1-(arcsin(zr^{-1})+f_{up})\frac{1}{f}\right]H\\ \end{array}\right),

(1)

where W and H are the width and height of the range image, respectively. $f=f_{up}+f_{down}$ is the LiDAR’s vertical field-of-view. The range value $r=\sqrt{x^{2}+y^{2}+z^{2}}$ is calculated according to the point coordinates $\left[x,y,z\right]^{T}$ and $(u,v)^{T}$ are image coordinates in the range view. Following the previous works [6, 18], the input to our network is a $(H,W,5)$ tensor $R_{in}$ with channels $(x,y,z,rem,r)$ , where $rem$ is the remission or intensity value.

3.1 Architecture

We build up our model based on the mask-classification meta architecture [7, 8] with consideration of the efficiency and accuracy. A simple meta architecture generally consists of three components: Backbone, Pixel Decoder and Transformer Decoder. Figure 2 is the overview of MaskRange. Each component is introduced in detail in the following paragraphs.

Backbone. The backbone is to extract low-resolution features from the input $R_{in}$ . Without loss of generality, we use a modified ResNet-34 [30, 31, 18] as our backbone to extract global features from a range image (see supplementary material Section A for more details). Note that we do not utilize any pre-trained parameters for two reasons. One is that we hope our method is general enough so that different types of backbones can be incorporated into it, whereas there may be no pre-trained models for these backbones. The other is that the ImageNet [32] pre-trained model is not suitable for our range image since the data distribution is quite different.

Pixel Decoder. The pixel decoder is used to get high-resolution per-pixel embeddings. Since we have no pre-trained models, our backbone is required to be trained from scratch and we also need to optimize the pixel decoder and transformer decoder simultaneously. Thus, it would take relatively long time to train the network. To speed up the feed-forward time and the convergence rate, we utilize the Fully Interpolation Decoding (FID) [18] as our decoder module. The FID module is a parameter-free decoder, which only contains bilinear upsampling operation and can facilitate the optimization of our network. To further improve the performance, we replace the bilinear upsampling in FID with the Data-dependent Upsampling (DUpsampling) [33] to get the high-resolution feature maps. Note that we design our decoder in FID format only for simplicity and efficiency. Other decoders can also be used.

Transformer Decoder. For DETR-like frameworks [20, 7], the transformer decoder acts like first asking what objects are in the image, then finding where these objects are. To achieve that, a set of learnable object embeddings are used to interact with the image features. The outputs of the transformer decoder are the corresponding class predictions and mask embeddings. The binary mask predictions can be obtained via a simple matrix multiplication operation. We do not incorporate more advanced designs (e.g., Mask2Former [8]) into our architecture for simplicity. With this architecture, we can achieve the semantic segmentation and panoptic segmentation tasks in a unified manner. We refer readers to MaskFormer [7] for more details and better comparison.

3.2 Weighted Paste Drop Data Augmentation

To prevent model overfitting, the simple data augmentation schemes are widely used in common practice, including random rotation, translation, flipping, and point dropping [17, 18, 19]. We also train our model directly with these common data augmentation schemes. However, the performance of our MaskRange even cannot be on par with that of the per-pixel baseline (CENet [31]). We analyze that the unsatisfactory results (see supplementary material Section B for experimental analysis) are due to three problems: overfitting, context reliance, and class imbalance.

Overfitting. We restrict the overfitting problem here only related to the amount and diversity of training data. It has been pointed out by some previous works [34, 35] that transformer architectures usually require more training data compared with convolutional neural networks. However, the public available LiDAR segmentation datasets are relatively limited and small, and the common data augmentation schemes cannot satisfy the requirement of MaskRange.

Context Reliance. In terms of “Context”, it generally refers to the strong configuration rules present in man-made environments [26]. For example, with context knowledge, it can be inferred that traffic signs tend to be at the roadside, but not within a parking lot. However, relying too heavily on contextual cues may result in poor model generalization ability to rare or unseen situations [26, 36]. This problem is particularly acute for MaskRange since the transformer decoder only queries the global context embeddings to find objects. Intuitively, paying more attention to local geometrical information can be helpful to deal with the context-reliance problem. Unfortunately, most LiDAR segmentation methods adopt encoder-decoder architectures to aggregate context information without the consideration of balancing the global context priors and the local geometry information.

Class Imbalance. Generally, the number of points corresponding to “road” is significantly larger than that of “pedestrian”. Thus, the network is more biased to the high-frequency classes and may perform badly in some rare classes. This problem is widespread in LiDAR-based segmentation datasets [6] and can usually be dealt with by adding statistical weights to the cross-entropy loss, i.e., weighted cross entropy [6, 17, 18, 19]. However, this simple method cannot handle the class-imbalance problem well.

To deal with the aforementioned problems, we propose a novel data-augmentation method, which is constituted by three meta operations: “Weighted”, “Paste” and “Drop”. Initially, two frames are randomly selected from the dataset (noted as the first frame and the second frame) and the common data augmentation is applied. The paste operation selects the long-tail objects from the second frame firstly, then adds them to the first frame. The drop operation selects the non-long-tail class points in the first frame, then deletes these points. The weighted operation is to add a probability to the paste and drop.

Our WPD data augmentation significantly enlarges the size and diversity of the dataset, therefore alleviating the overfitting problem. The core idea of our WPD scheme to alleviate the context-reliance problem is to weaken the role of the context priors. Our “paste” operation can create unusual or even impossible scene scenarios to the training set. For example, we can “paste” a “trunk” in the middle of a road, which never appears in the original dataset and cannot be created by common data augmentation. This means that the “paste” operation can weaken the context priors. To further reduce the context bias, we “drop” the points with high context information. For example, we hope our network can recognize cars without the road background. For the class-imbalance problem, we drop the high-frequency classes, such as road and car, with high probability and paste less-frequency classes frequently.

Note that, different from Mix3D [26] which pastes all components, we only paste the long-tail objects and drop the non-long-tail objects. As Figure 3(c) shows, after the spherical projection, it is hard to distinguish the mixed points by Mix3D from the original points of the first frame. Though the context priors can be “misled” by Mix3D [26], it makes the model confused. In addition, Mix3D cannot handle the class-imbalance problem. RPVNet [10] and Second [27] also copy the long-tail instances from a database and paste them above the ground-class points. However, their methods are time-consuming and unrealistic (See Figure 3(a) and 3(b)). Our paste combines the actual scanning points, which inherently considers the orientation of the LiDAR sensor and the sparsity of the scanned points at different distances.

3.3 Loss Function

Our loss functions contain classification loss and binary mask losses. The mask-classification loss is the cross-entropy loss. The binary mask losses contain focal loss $L_{focal}$ [37], Lovász loss $L_{ls}$ [38] and boundary loss $L_{bd}$ [39]. Our total loss can be computed as

L=\lambda_{1}L_{cls}+\lambda_{2}L_{focal}+\lambda_{3}L_{ls}+\lambda_{4}L_{bd},

(2)

where $\lambda_{1}$ , $\lambda_{2}$ , $\lambda_{3}$ and $\lambda_{4}$ are hyper parameters to balance these losses.

Different from other works [17, 18, 19, 31] which utilize the weighted cross-entropy loss, we utilize the weighted focal loss to re-balance the training set since the statistical weights have been largely changed by our data augmentation. Our weighted focal loss can be computed as

L_{focal}(p_{t})=\alpha_{i}\beta_{i}(1-p_{t})^{\gamma}\log(p_{t}),\ \ with\ \ \alpha_{i}=\frac{1}{f_{i}+\epsilon},\ \ and\ \ \beta_{i}=\frac{ins_{i}}{sem_{i}},

(3)

where $p_{t}$ is the probability of prediction for the corresponding label. $\gamma$ is the focusing parameter and $(1-p_{t})^{\gamma}$ is the modulating factor [37]. $f_{i}$ stands for the proportion of the class $i$ according to the number of points. $\epsilon$ is a small constant to prevent the exception value. $ins_{i}$ and $sem_{i}$ are the number of instance and semantic segments of the class $i$ , respectively. Thus, $\alpha_{i}$ is the semantic re-balance factor and $\beta_{i}$ is the panoptic re-balance factor. For semantic segmentation, $\beta_{i}$ is equal to 1 since each class has only one “instance”. As for panoptic segmentation, $\beta_{i}$ is equal to 1 for stuff classes and larger than 1 for thing classes. This unified weighted focal loss provides a general promotion on both semantic and panoptic segmentation tasks.

In MaskFormer [7], auxiliary losses are added to every transformer decoder layer and the per-pixel embeddings in the final pixel decoder layer. Our FID-like pixel decoder is not fully decoded, which may result in lower performance with only supervision on the final layer. Thus, inspired by deep supervision [40, 41, 31], we add auxiliary losses on the feature embeddings in intermediate pixel decoder layers (as illustrated in Figure 2).

4 Experiments

We evaluate our MaskRange on the public SemanticKITTI [1] benchmark, which consists of 43551 LiDAR scans from 22 sequences. We use sequences 00 to 10 (except 08) as the training set and sequence 08 as the validation set, and present the implementation details in Sec. 4.1. We compare our results on the test set (sequences 11 to 21) with other state-of-the-art methods on both semantic (Sec. 4.2) and panoptic (Sec. 4.3) segmentation tasks. We conduct extensive ablation studies (Sec. 4.4) to analyze the effectiveness of each component of our method.

4.1 Implementation Details

Data augmentation. Initially, two frames are randomly selected and the common data augmentation schemes (random point dropping, flipping, rotation, and translation) are applied to them. Then we paste the long-tail objects in the second frame to the first frame and drop the non-long-tail class points in the first frame. We set motorcyclist, bicyclist, bicycle, person, motorcycle, traffic-sign other-vehicle, truck, pole, other-ground, and trunk as long-tail classes according to the frequency of the points in each class (the detailed statistical results are presented in supplementary material Section C). The other classes are set as non-long-tail classes. We add the probability to the paste and drop operations, known as the weighted operation. We similarly obtain these weights as presented in our weighted focal loss Eq. 3, i.e., $\alpha_{i}\beta_{i}$ . We rescale these weights to [0, 1] by $\alpha_{i}\beta_{i}$ / $\max$ ( $\alpha_{1}\beta_{1}$ , $\alpha_{2}\beta_{2}$ , …, $\alpha_{k}\beta_{k}$ ), where $k$ is the number of classes.

Training settings. The input resolution of our range image is $64\times 2048$ . CENet [31] is used as our pixel-level module, which provides us with the backbone and the pixel decoder. We use 4 transformer decoder layers with 100 queries by default. We use AdamW as the optimizer [42] and the poly [43] learning rate schedule to optimize our model. The initial learning rate of transformer decoder parameters is set as $0.0001$ . For backbone and pixel decoder, the initial learning rate is $0.001$ . The batch size is set as $16$ . All models are trained for $150$ epochs with $4$ A100 GPUs.

Inference and post-processing. In inference time, we use the same strategy as proposed in MaskFormer [7]. With the predictions in range view, we need to further project these predictions back to the 3D point cloud. Following some previous works [6, 17, 19], we employ a KNN post-processing to alleviate the “shadow” effect. The inference latency is measured on a platform with an AMD Ryzen 9 3950X CPU and an RTX 2080Ti GPU.

4.2 Semantic Segmentation

Table 1: Quantitative comparison on SemanticKITTI [1] test set. We compare our MaskRange with some recent SOTA point/voxel-based and projection-based methods. The best results in the compared point/voxel- and projection-based methods are both highlighted by bold numbers separately. The second-best results in the compared projection-based methods are highlighted in blue numbers. The FPS measurements with

\ast

are tested by us with a single RTX 2080Ti GPU.

Category	Method	Mean IoU	Car	Bicycle	Motorcycle	Truck	Other-vehicle	Person	Bicyclist	Motorcyclist	Road	Parking	Sidewalk	Other-ground	Building	Fence	Vegetation	Trunk	Terrain	Pole	Traffic-sign	FPS (Hz)
Point / voxel	RandLA-Net [12]	53.9	94.2	26.0	25.8	40.1	38.9	49.2	48.2	7.2	90.7	60.3	73.7	20.4	86.9	56.3	81.4	61.3	66.8	49.2	47.7	1.9
	KPConv [44]	58.8	96.0	30.2	42.5	33.4	44.3	61.5	61.6	11.8	88.8	61.3	72.7	31.6	90.5	64.2	84.8	69.2	69.1	56.4	47.4	-
	KPRNet [45]	63.1	95.5	54.1	47.9	23.6	42.6	65.9	65.0	16.5	93.2	73.9	80.6	30.2	91.7	68.4	85.7	69.8	71.2	58.7	64.1	0.3
	JS3C-Net [46]	66.0	95.8	59.3	52.9	54.3	46.0	69.5	65.4	39.9	88.9	61.9	72.1	31.9	92.5	70.8	84.5	69.8	67.9	60.7	68.7	2.1 (V100)
	SPVNAS [15]	67.0	97.2	50.6	50.4	56.6	58.0	67.4	67.1	50.3	90.2	67.6	75.4	21.8	91.6	66.9	86.1	73.4	71.0	64.3	67.3	3.9 (1080Ti)
	Cylinder3D [13]	67.8	97.1	67.6	64.0	59.0	58.6	73.9	67.9	36.0	91.4	65.1	75.5	32.3	91.0	66.5	85.4	71.8	68.5	62.6	65.6	5.6
Projection	RangeNet++ [6]	52.2	91.4	25.7	34.4	25.7	23.0	38.3	38.8	4.8	91.8	65.0	75.2	27.8	87.4	58.6	80.5	55.1	64.6	47.9	55.9	12
	SqueezeSegv3 [16]	55.9	92.5	38.7	36.5	29.6	33.0	45.6	46.2	20.1	91.7	63.4	74.8	26.4	89.0	59.4	82.0	58.7	65.4	49.6	58.9	6
	SalsaNext [17]	59.5	91.9	48.3	38.6	38.9	31.9	60.2	59.0	19.4	91.7	63.7	75.8	29.1	90.2	64.2	81.8	63.6	66.5	54.3	62.1	24
	FIDNet [18]	59.5	93.9	54.7	48.9	27.6	23.9	62.3	59.8	23.7	90.6	59.1	75.8	26.7	88.9	60.5	84.5	64.4	69.0	53.3	62.8	31*
	Lite-HDSeg [19]	63.8	92.3	40.0	55.4	37.7	39.6	59.2	71.6	54.1	93.0	68.2	78.3	29.3	91.5	65.0	78.2	65.8	65.1	59.5	67.7	20
	CENet [31]	64.7	91.9	58.6	50.3	40.6	42.3	68.9	65.9	43.5	90.3	60.9	75.1	31.5	91.0	66.2	84.5	69.7	70.0	61.5	67.6	32*
	MaskRange (ours)	66.1	94.2	56.0	55.7	59.2	52.4	67.6	64.8	31.8	91.7	70.7	77.1	29.5	90.6	65.2	84.6	68.5	69.2	60.2	66.6	25*

For semantic segmentation, we use mean Intersection over Union (mIoU) [47] as our evaluation metric to compare our MaskRange with recently published works. As shown in Table 1, among all projection-based methods, our MaskRange achieves the best performance. Compared with point/voxel-based methods, our performance is still competitive with only $40\ ms$ runtime cost.

4.3 Panoptic Segmentation

Table 2: Comparison of LiDAR panoptic segmentation on SemanticKITTI test set.

Method	PQ	PQ^†	RQ	SQ	PQ^Th	RQ^Th	SQ^Th	PQ^St	RQ^St	SQ^St	mIoU	FPS (Hz)
RangeNet++ [6] + PP [48]	37.1	45.9	47.0	75.9	20.2	25.2	75.2	49.3	62.8	76.5	52.4	2.4
LPSAD [49]	38.0	47.0	48.2	76.5	25.6	31.8	76.8	47.1	60.1	76.2	50.9	11.8
KPConv [44] + PP [48]	44.5	52.5	54.4	80.0	32.7	38.7	81.5	53.1	65.9	79.0	58.8	1.9
Panoster [50]	52.7	59.9	64.1	80.7	49.4	58.5	83.3	55.1	68.2	78.8	59.9	-
Panoptic-PolarNet [51]	54.1	60.7	65.0	81.4	53.3	60.6	87.2	54.8	68.1	77.2	59.5	11.6
DS-Net [52]	55.9	62.5	66.7	82.3	55.1	62.8	87.2	56.5	69.5	78.7	61.6	3.2
EfficientLPS [53]	57.4	63.2	68.7	83.0	53.1	60.5	87.8	60.5	74.6	79.5	61.4	4.7 (Titan RTX)
MaskRange (ours)	53.1	59.2	64.6	81.2	44.9	53.0	83.5	59.1	73.1	79.5	61.8	25

For panoptic segmentation, we use Panoptic Quality (PQ) [54] as the main metric and we also report Recognition Quality (RQ) and Segmentation Quality (SQ) for better comparison. These metrics are calculated separately for thing and stuff classes. Besides, we also report PQ^† [55] which uses IoU as PQ for stuff classes. As shown in Table 2, among all listed methods, our MaskRange is the most efficient one with 25 FPS and achieves comparable performance with $53.1$ PQ. Benefiting from the mask classification paradigm and our unified WPD augmentation, we can extend our MaskRange to panoptic segmentation without any bounding-box supervision or extra hyperparameters tuning.

4.4 Ablation Study

We do ablation studies on the WPD data augmentation, upsampling and deep supervision strategies. We also compare our weighted focal loss (WF) with the weighted cross-entropy loss (WCE) to show the effectiveness of our loss functions. We choose CENet [31] as the per-pixel baseline and train it with the official code. All the ablation studies are conducted with the $64\times 2048$ resolution.

Data augmentation. We first study the effectiveness of our WPD data augmentation. To show the superiority of our WPD, we also compare our method with Mix3D [26] and Instance Paste (I.P.) [10]. As shown in Table 3 (row 3-6), our WPD outperforms compared methods by at least $1.2$ mIoU. It can be seen that, with the common data augmentation, our model cannot even be on par with the CENet [31] (row 1). When incorporating the Mix3D [26], the performance is improved by a large margin. This motivates us to investigate better data augmentation methods. Thus, we propose WPD, which can alleviate the overfitting, context-reliance, and class-imbalance problems simultaneously. We refer readers to the supplementary materials Section B and D for more details.

We also add our WPD to the CENet (row 2), but the improvement is limited ( $64.25\%-62.70\%=1.55\%$ ). In contrast, the improvement attributed to the WPD in our MaskRange is $66.80\%-59.10\%=7.70\%$ . This means that our MaskRange is more expressive. Compared with CENet, MaskRange introduces $1.113$ M extra parameters and only $2.2$ more GFlops. Another possible explanation is that the transformer decoder collects global information from the image features to generate class predictions. This setup reduces the need of heavy context aggregation in per-pixel module [7] and can help this module focus more on the shape or geometry information.

WCE vs. WF. Since the statistical weights have been largely changed by our WPD data augmentation, we use focal loss to re-weight the cross-entropy loss. Additionally, we also introduce extra re-balance factors to the original focal loss. Rows $6$ - $7$ of Table 3 show that our weighted focal loss is superior to the weighted cross-entropy loss.

Deep supervision strategy. We compare our improved implementation of deep supervision (notated as DeepB) with the default deep supervision (DeepA). As illustrated in Figure 2, the red dashed lines represent DeepA and the red solid lines represent DeepB. As shown in row $7$ and $8$ of Table 3, the mIoU of DeepB can be slightly higher than DeepA. Moreover, the convergence rate of DeepB is faster than that of DeepA. Thus, We choose DeepB as our deep supervision strategy.

Upsampling strategy. The interpolation-based upsamling (IU) is efficient but may result in coarse predictions. With data-dependent upsampling (DU), we can achieve better performance by $0.24$ mIoU (row $8$ - $9$ ) with only $2\ ms$ extra runtime cost.

We also do the ablation on the re-balance strategies, which contain the comparisons of no balance, class balance, and unified balance strategies on both semantic and panoptic segmentation tasks. We refer readers to the supplementary material Section E for more details.

Table 3: Ablation studies of LiDAR semantic segmentation on SemanticKITTI validation set.

	common	Mix	I.P.	WPD	WCE	WF	DeepA	DeepB	IU	DU	mIoU (%)
CENet [31]	✓				✓				✓		62.70
CENet [31]				✓	✓				✓		64.25
MaskRange	✓				✓		✓		✓		59.10
		✓			✓		✓		✓		65.60
			✓		✓		✓		✓		63.32
				✓	✓		✓		✓		66.80
				✓		✓	✓		✓		67.41
				✓		✓		✓	✓		67.53
				✓		✓		✓		✓	67.77

5 Conclusion

In this work, we investigate a unified LiDAR segmentation method (MaskRange) based on the mask-classification paradigm with the range-view LiDAR representation. With a novel Weighted Paste Drop data augmentation, our MaskRange achieves state-of-the-art performance on semantic segmentation and promising results on panoptic segmentation with high efficiency. Our method is flexible and general enough that can be readily applicable to other LiDAR segmentation methods.

References

Behley et al. [2019] J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall. Semantickitti: A dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9297–9307, 2019.
Long et al. [2015] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
Guo et al. [2020] Y. Guo, H. Wang, Q. Hu, H. Liu, L. Liu, and M. Bennamoun. Deep learning for 3d point clouds: A survey. IEEE transactions on pattern analysis and machine intelligence, 43(12):4338–4364, 2020.
Li et al. [2020] Y. Li, L. Ma, Z. Zhong, F. Liu, M. A. Chapman, D. Cao, and J. Li. Deep learning for lidar point clouds in autonomous driving: A review. IEEE Transactions on Neural Networks and Learning Systems, 32(8):3412–3432, 2020.
Qi et al. [2017] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017.
Milioto et al. [2019] A. Milioto, I. Vizzo, J. Behley, and C. Stachniss. Rangenet++: Fast and accurate lidar semantic segmentation. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4213–4220. IEEE, 2019.
Cheng et al. [2021] B. Cheng, A. Schwing, and A. Kirillov. Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems, 34, 2021.
Cheng et al. [2022] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1290–1299, 2022.
Zhang et al. [2021] W. Zhang, J. Pang, K. Chen, and C. C. Loy. K-net: Towards unified image segmentation. Advances in Neural Information Processing Systems, 34:10326–10338, 2021.
Xu et al. [2021] J. Xu, R. Zhang, J. Dou, Y. Zhu, J. Sun, and S. Pu. Rpvnet: A deep and efficient range-point-voxel fusion network for lidar point cloud segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16024–16033, 2021.
Qi et al. [2017] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pages 5099–5108, 2017.
Hu et al. [2020] Q. Hu, B. Yang, L. Xie, S. Rosa, Y. Guo, Z. Wang, N. Trigoni, and A. Markham. Randla-net: Efficient semantic segmentation of large-scale point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11108–11117, 2020.
Zhu et al. [2021] X. Zhu, H. Zhou, T. Wang, F. Hong, Y. Ma, W. Li, H. Li, and D. Lin. Cylindrical and asymmetrical 3d convolution networks for lidar segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9939–9948, 2021.
Choy et al. [2019] C. Choy, J. Gwak, and S. Savarese. 4d spatio-temporal convnets: Minkowski convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3075–3084, 2019.
Tang et al. [2020] H. Tang, Z. Liu, S. Zhao, Y. Lin, J. Lin, H. Wang, and S. Han. Searching efficient 3d architectures with sparse point-voxel convolution. In European Conference on Computer Vision, pages 685–702. Springer, 2020.
Xu et al. [2020] C. Xu, B. Wu, Z. Wang, W. Zhan, P. Vajda, K. Keutzer, and M. Tomizuka. Squeezesegv3: Spatially-adaptive convolution for efficient point-cloud segmentation. In European Conference on Computer Vision, pages 1–19. Springer, 2020.
Cortinhal et al. [2020] T. Cortinhal, G. Tzelepis, and E. E. Aksoy. Salsanext: Fast, uncertainty-aware semantic segmentation of lidar point clouds. In ISVC (2), volume 12510 of Lecture Notes in Computer Science, pages 207–222. Springer, 2020.
Zhao et al. [2021] Y. Zhao, L. Bai, and X. Huang. Fidnet: Lidar point cloud semantic segmentation with fully interpolation decoding. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4453–4458. IEEE, 2021.
Razani et al. [2021] R. Razani, R. Cheng, E. Taghavi, and L. Bingbing. Lite-hdseg: Lidar semantic segmentation using lite harmonic dense convolutions. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 9550–9556. IEEE, 2021.
Carion et al. [2020] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko. End-to-end object detection with transformers. In European Conference on Computer Vision, pages 213–229. Springer, 2020.
Aksoy et al. [2020] E. E. Aksoy, S. Baci, and S. Cavdar. Salsanet: Fast road and vehicle segmentation in lidar point clouds for autonomous driving. In 2020 IEEE intelligent vehicles symposium (IV), pages 926–932. IEEE, 2020.
Chen et al. [2020] Y. Chen, V. T. Hu, E. Gavves, T. Mensink, P. Mettes, P. Yang, and C. G. Snoek. Pointmixup: Augmentation for point clouds. In European Conference on Computer Vision, pages 330–345. Springer, 2020.
Zhang et al. [2021] J. Zhang, L. Chen, B. Ouyang, B. Liu, J. Zhu, Y. Chen, Y. Meng, and D. Wu. Pointcutmix: Regularization strategy for point cloud classification. arXiv preprint arXiv:2101.01461, 2021.
Zhang et al. [2018] H. Zhang, M. Cissé, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. In ICLR (Poster). OpenReview.net, 2018.
Yun et al. [2019] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6023–6032, 2019.
Nekrasov et al. [2021] A. Nekrasov, J. Schult, O. Litany, B. Leibe, and F. Engelmann. Mix3d: Out-of-context data augmentation for 3d scenes. In 2021 International Conference on 3D Vision (3DV), pages 116–125. IEEE, 2021.
Yan et al. [2018] Y. Yan, Y. Mao, and B. Li. SECOND: sparsely embedded convolutional detection. Sensors, 18(10):3337, 2018.
Li et al. [2022] Z. Li, W. Wang, E. Xie, Z. Yu, A. Anandkumar, J. M. Alvarez, P. Luo, and T. Lu. Panoptic segformer: Delving deeper into panoptic segmentation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1280–1289, 2022.
Triess et al. [2020] L. T. Triess, D. Peter, C. B. Rist, and J. M. Zöllner. Scan-based semantic segmentation of lidar point clouds: An experimental study. In 2020 IEEE Intelligent Vehicles Symposium (IV), pages 1116–1121. IEEE, 2020.
He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
Huixian Cheng [2022] G. X. Huixian Cheng, Xianfeng Han. Cenet: Toward concise and efficient lidar semantic segmentation for autonomous driving. In IEEE International Conference on Multimedia and Expo, 2022.
Deng et al. [2009] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255. Ieee, 2009.
Tian et al. [2019] Z. Tian, T. He, C. Shen, and Y. Yan. Decoders matter for semantic segmentation: Data-dependent decoding enables flexible feature aggregation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3126–3135, 2019.
Vaswani et al. [2017] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
Dosovitskiy et al. [2021] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR. OpenReview.net, 2021.
Shetty et al. [2019] R. Shetty, B. Schiele, and M. Fritz. Not using the car to see the sidewalk–quantifying and controlling the effects of context in classification and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8218–8226, 2019.
Lin et al. [2017] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 2980–2988, 2017.
Berman et al. [2018] M. Berman, A. Rannen Triki, and M. B. Blaschko. The lovász-softmax loss: a tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4413–4421, 2018.
Bokhovkin and Burnaev [2019] A. Bokhovkin and E. Burnaev. Boundary loss for remote sensing imagery semantic segmentation. In International Symposium on Neural Networks, pages 388–401. Springer, 2019.
Lee et al. [2015] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-supervised nets. In Artificial intelligence and statistics, pages 562–570. PMLR, 2015.
Wang et al. [2015] L. Wang, C.-Y. Lee, Z. Tu, and S. Lazebnik. Training deeper convolutional networks with deep supervision. arXiv preprint arXiv:1505.02496, 2015.
Loshchilov and Hutter [2019] I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In ICLR (Poster). OpenReview.net, 2019.
Chen et al. [2018] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell., 40(4):834–848, 2018.
Thomas et al. [2019] H. Thomas, C. R. Qi, J.-E. Deschaud, B. Marcotegui, F. Goulette, and L. J. Guibas. Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE International Conference on Computer Vision, pages 6411–6420, 2019.
Kochanov et al. [2020] D. Kochanov, F. K. Nejadasl, and O. Booij. Kprnet: Improving projection-based lidar semantic segmentation. arXiv preprint arXiv:2007.12668, 2020.
Yan et al. [2021] X. Yan, J. Gao, J. Li, R. Zhang, Z. Li, R. Huang, and S. Cui. Sparse single sweep lidar point cloud segmentation via learning contextual shape priors from scene completion. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 3101–3109, 2021.
Everingham et al. [2015] M. Everingham, S. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International journal of computer vision, 111(1):98–136, 2015.
Lang et al. [2019] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12697–12705, 2019.
Milioto et al. [2020] A. Milioto, J. Behley, C. McCool, and C. Stachniss. Lidar panoptic segmentation for autonomous driving. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8505–8512. IEEE, 2020.
Gasperini et al. [2021] S. Gasperini, M.-A. N. Mahani, A. Marcos-Ramiro, N. Navab, and F. Tombari. Panoster: End-to-end panoptic segmentation of lidar point clouds. IEEE Robotics and Automation Letters, 6(2):3216–3223, 2021.
Zhou et al. [2021] Z. Zhou, Y. Zhang, and H. Foroosh. Panoptic-polarnet: Proposal-free lidar point cloud panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13194–13203, 2021.
Hong et al. [2021] F. Hong, H. Zhou, X. Zhu, H. Li, and Z. Liu. Lidar-based panoptic segmentation via dynamic shifting network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13090–13099, 2021.
Sirohi et al. [2021] K. Sirohi, R. Mohan, D. Büscher, W. Burgard, and A. Valada. Efficientlps: Efficient lidar panoptic segmentation. IEEE Transactions on Robotics, 2021.
Kirillov et al. [2019] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár. Panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9404–9413, 2019.
Porzi et al. [2019] L. Porzi, S. R. Bulo, A. Colovic, and P. Kontschieder. Seamless scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8277–8286, 2019.
Hurtado et al. [2020] J. V. Hurtado, R. Mohan, and A. Valada. Mopt: Multi-object panoptic tracking. arXiv preprint arXiv:2004.08189, 2020.

Supplementary Materials

Appendix A Network Details

Backbone
layer	k	s	chns_in	chns_out	output	output size	activation
conv1	3	1	5	64	-	$64\times 2048$	LeakyReLU
conv2	3	1	64	128	-	$64\times 2048$	LeakyReLU
conv3	3	1	128	128	$x_{0}$	$64\times 2048$	LeakyReLU
Basic block $\times 3$	3	1	128	128	$x_{1}$	$64\times 2048$	LeakyReLU
Basic block $\times 4$	3	2	128	128	$x_{2}$	$32\times 1024$	LeakyReLU
Basic block $\times 6$	3	2	128	128	$x_{3}$	$16\times 512$	LeakyReLU
Basic block $\times 3$	3	2	128	128	$x_{4}$	$\ \ 8\times 256$	LeakyReLU

Pixel Decoder
operation	chns	input	output	resolution
$\uparrow$	128	$x_{2}$	$x_{2}\uparrow$	$64\times 2048$
$\uparrow$	128	$x_{3}$	$x_{3}\uparrow$	$64\times 2048$
$\uparrow$	128	$x_{4}$	$x_{4}\uparrow$	$64\times 2048$
Concatenate	640	$(x_{0},x_{1},x_{2}\uparrow,x_{3}\uparrow,x_{4}\uparrow)$	cat	$64\times 2048$
Conv2d	256	cat	embeddings	$64\times 2048$
Conv2d	128	embeddings	pixel_embeddings	$64\times 2048$

Transformer Decoder
layer	operations	input	output	activation
layer1	self-attention, cross-attention, FFN	$x_{4}$ , querys	$o_{1}$	ReLU
layer2	self-attention, cross-attention, FFN	$o_{1}$ , querys	$o_{2}$	ReLU
layer3	self-attention, cross-attention, FFN	$o_{2}$ , querys	$o_{3}$	ReLU
layer4	self-attention, cross-attention, FFN	$o_{3}$ , querys	$o_{4}$	ReLU

Table 4: MaskRange. Here k is the kernel size. s is the stride. chns_in and chns_out are the number of input and output channels.

\uparrow

is the interpolation or data-dependent upsampling operation.

Appendix B Overfitting, Class Imbalance and Context Reliance Analysis

Overfitting. We train our MaskRange and CENet [31] with the same common data augmentation. Table 5 shows that the performance of MaskRange in trainset is better than that of CENet. However, it is worse in validation set. This is a typical overfitting phenomenon. With our WPD, the performance gap is reduced.

Table 5: Best performance on the train and validation set

Methods	train set	validation set
CENet [31]	66.8	62.7
MaskRange	89.6	59.1
MaskRange + WPD	84.6	67.7

Class Imbalance.

Table 6: Class-wise comparison on SemanticKITTI [1] validation set.

Method	Mean IoU	Car	Bicycle	Motorcycle	Truck	Other-vehicle	Person	Bicyclist	Motorcyclist	Road	Parking	Sidewalk	Other-ground	Building	Fence	Vegetation	Trunk	Terrain	Pole	Traffic-sign
MaskRange + common_aug	59.10	95.94	14.53	39.00	60.89	48.96	68.00	73.84	0.0	95.09	47.13	82.95	0.15	89.50	60.79	87.46	69.12	75.35	64.65	48.85
MaskRange + WPD	67.77	97.21	59.08	67.46	87.33	65.87	79.37	89.82	4.19	95.65	49.68	84.01	12.08	89.85	61.78	85.94	70.40	72.06	66.03	49.81

We analyze the imbalance problem from the empirical perspective in this section. We compare the class-wise performance of MaskRange with common data augmentation and WPD and present the results in Table 6. Since the other ground and motorcyclist classes are seldom present in the training set, no matter what kind of augmentation methods we use, MaskRange still performs badly in these two classes. This extreme example illustrates the imbalance problem. Table 6 also shows the effectiveness of our method, where the performances of the bicycle and motorcycle classes are improved at least $28$ mIoU.

Context Reliance. We compare the qualitative results of our MaskRange with common data augmentation and WPD. Figure 4 shows the results on the validation set.

To further demonstrate the context reliance problem, we crop the original point cloud by dropping the points with $z(height)<-1.5$ . In this way, most of ground points with high context information are dropped and we inference the rest points. Figure 5 shows that with common data augmentation scheme, MaskRange can correctly predict most of regions. However, with ground points removed, some regions are wrongly predicted, due to the loss of context information. Our WPD can effectively alleviate the context reliance problem and help MaskRange get reasonable predictions without ground points.

Appendix C Statistical Results

We define the normalized weights for each class as:

w_{i}=\begin{cases}w_{s,i},&\text{in semantic}\\ w_{p,i},&\text{in panoptic}\end{cases},\ \ w_{s,i}=\frac{\alpha_{i}}{\max\limits_{k}\alpha_{k}},\ \ w_{p,i}=\frac{\alpha_{i}\beta_{i}}{\max\limits_{k}\alpha_{k}\beta_{k}},

(4)

where $\alpha_{i}=\frac{1}{f_{i}+\epsilon}$ , $\beta_{i}=\frac{ins_{i}}{sem_{i}}$ , $k\in\{1,2,...,K\}$ and $K$ is the number of total classes. The $f_{i}$ is provided by SemanticKITTI [1] and we obtain the class-wise statistical results by directly counting the number of $sem_{i}$ and $ins_{i}$ from training set. The detailed statistical results are shown in Table 7.

Table 7: Class-wise statistical results on SemanticKITTI [1] training set.

f_{i}

stands for the proportion of the class

i

ins_{i}

and

sem_{i}

are the number of instance and semantic segments of the class

i

, respectively.

\alpha_{i}

is the semantic re-balance factor and

\beta_{i}

is for panoptic.

w_{s,i}

and

w_{p,i}

are normalized weights of class

i

for semantic and panoptic segmentation, respectively.

Category	Method	Car	Bicycle	Motorcycle	Truck	Other-vehicle	Person	Bicyclist	Motorcyclist	Road	Parking	Sidewalk	Other-ground	Building	Fence	Vegetation	Trunk	Terrain	Pole	Traffic-sign
Semantic	$f_{i}$	$4.26e^{-2}$	$1.66e^{-4}$	$3.98e^{-4}$	$2.16e^{-3}$	$1.81^{e}{-3}$	$3.38e^{-4}$	$1.27e^{-4}$	$3.75e^{-5}$	$1.99e^{-1}$	$1.47e^{-2}$	$1.44e^{-1}$	$3.9e^{-3}$	$1.33e^{-1}$	$7.24e^{-2}$	$2.67e^{-1}$	$6.04e^{-3}$	$7.81e^{-2}$	$2.86e^{-3}$	$6.16e^{-4}$
	$\alpha_{i}$	22.93	857.56	715.11	315.96	356.25	747.62	887.22	963.89	5.01	63.62	6.9	203.88	7.48	13.63	3.73	142.15	12.64	259.37	618.97
	$w_{s,i}$	0.02	0.89	0.74	0.33	0.37	0.78	0.92	1	0.01	0.07	0.01	0.21	0.01	0.01	0	0.15	0.01	0.27	0.64
	$sem_{i}$	17784	3471	2872	2264	5063	4721	1176	552	19130	7705	18052	5257	17116	18751	19130	17124	18534	18617	13224
Panoptic	$ins_{i}$	168431	6584	3444	2575	7499	8039	1492	559	19130	7705	18052	5257	17116	18751	19130	17124	18534	18617	13224
	$\beta_{i}$	9.47	1.89	1.19	1.13	1.48	1.7	1.26	1.01	1	1	1	1	1	1	1	1	1	1	1
	$\alpha_{i}\beta_{i}$	217.16	1620.79	850.98	357.04	527.24	1270.95	1117.9	973.53	5.01	63.62	6.9	203.88	7.48	13.63	3.73	142.15	12.64	259.37	618.97
	$w_{p,i}$	0.13	1	0.53	0.22	0.33	0.78	0.69	0.6	0	0.04	0	0.13	0	0.01	0	0.09	0.01	0.16	0.38

For semantic segmentation, we define the classes with $w_{s,i}>t$ as long-tail classes. As for panoptic segmentation, the long-tail classes are those with $w_{p,i}>t$ . Here, $t$ is the threshold to separate the long-tail and non-long-tail classes and is empirically set as $0.1$ .

In this way, we set motorcyclist, bicyclist, bicycle, person, motorcycle, traffic-sign, other-vehicle, truck, pole, other-ground, trunk as long-tail classes for semantic sementation (Figure 6) and bicycle, person, bicyclist, motorcyclist, motorcycle, traffic-sign, other-vehicle, truck, pole, car, other-ground as long-tail classes for panoptic segmentation (Figure 6).

Appendix D Data Augmentation details

D.1 WPD Probability

The “weighted” operation adds the probability to the “paste” and “drop”.

We only “paste” long-tail classes with the probability $p_{i}$ for class $i$ :

p_{i}=\begin{cases}w_{s,i}-t,&\text{in semantic}\\ w_{p,i}-t,&\text{in panoptic}\end{cases}.

(5)

And we “drop” the non-long-tail classes with the probability $d_{i}$ :

d_{i}=\begin{cases}t-w_{s,i},&\text{in semantic}\\ t-w_{p,i},&\text{in panoptic}\end{cases}.

(6)

Here, $t=0.1$ is the threshold to separate the long-tail and non-long-tail classes.

D.2 Ablation study

We also do the ablation studies on the weighted paste, weighted drop and weighted paste drop. The results are presented in Table 8.

Table 8: Ablation study for the data augmentation of LiDAR semantic segmentation on SemanticKITTI validation set. WP: weighted paste. WD: weighted drop. The WPD: weighted paste with drop.

	WP	WD	WPD	mIoU (%)
MaskRange	✓			66.09
		✓		64.80
			✓	67.77

D.3 Visual comparisons

We sample some visual results to show the sparsity and orientation problems (Figure 7 and 8). We also compare the augmented range images with Mix3D [26] (Figure 9).

Appendix E Class Balance vs. Unified Balance

Table 9: Ablation study of the re-balance strategies on the SemanticKITTI [1] validation set.

	N	C	U	mIoU (%)
Semantic	✓			64.43
		✓		66.74
			✓	66.74
	N	C	U	PQ (%)
Panoptic	✓			48.33
		✓		49.43
			✓	51.15

We investigate the effectiveness of the re-balance designs by comparing three different strategies:

No balance (N) : Paste + Drop + focal loss

Class balance (C) : Weighted + Paste + Drop + focal loss + $\alpha_{i}$

Unified balance (U) : Weighted + Paste + Drop + focal loss + $\alpha_{i}$ + $\beta_{i}$

We report the results on both semantic and panoptic segmentation in Table 9. Compared with no-balance strategy, the class-balance strategy shows better performance on both two tasks. However, the class-balance strategy is only designed for semantic segmentation. This motivates us to deal with the imbalance problem of semantic and panoptic segmentation in a unified manner considering that our MaskRange is a unified architecture. Thus, we propose the unified-balance strategy, which achieves better performance on panoptic segmentation and the same performance as the class-balance strategy on semantic segmentation.

Appendix F Post-processing

The projection-based methods suffer from the “shadow” effect [6], which is caused by the blurry CNN mask and the many-to-one mapping. To alleviate this “shadow” effect, a KNN (K-nearest-neighbor) post-processing module [6] is proposed and we adopt this post-processing as well. Furthermore, a simple but effective post-processing method for our MaskRange can be inserted to the previous stage of the KNN post-processing.

In real-world applications, LiDAR frames are usually segmented frame by frame. As shown in Figure 10, some predictions can be mis-classified and result in a “flickering” effect over time. The “flickering” is caused by the largely different observation views and also exists in per-pixel classification methods.

Different from per-pixel classification methods, our MaskRange can aggregate the prediction errors due to the introduced mask. If a query is mis-classified, the whole region will be wrongly predicted. At the inference stage, the object queries are fixed and the classification results of these queries tend to be consistent in continuous frames. We can utilize this property to further improve the performance.

We notate the classification output of query $i$ in the current frame $t$ as $\hat{C}_{t,i}$ . We set a temporal window centered at frame $t$ with the size $K+L+1$ , where $K$ is the number of previous frames and $L$ is the number of future frames. To alleviate the “flickering” effect, we simply apply a temporal filter to the class predictions of these queries within the window. For example, the average filter can be computed as:

\bar{\hat{C}}_{t,i}=\frac{1}{L+K+1}\sum_{t}\hat{C}_{t,i}.

(7)

Note that our filtering post-processing is easy to implement and takes almost negligible runtime and memory cost. It utilizes the continuous inputs during inference when applied to real word scene understanding, free of sequential training data. Our filtering post-processing can be applied together with KNN post-processing and does not rely on future information (simply set $L$ as $0$ ).

Appendix G More Results

Table 10 shows the results of our temporal filter post-processing.

Table 10: Results of MaskRange with temporal filter post-processing on the SemanticKITTI [1] test set.

Methods	Semantic (mIoU)
KNN	66.10
KNN+temporal	66.34

The class-wise panoptic results are presented in Table 11.

Table 11: Class-wise PQ scores of LiDAR panoptic segmentation on the SemanticKITTI [1] test set. R.Net, P.P, KPC refer to RangeNet++, Point Pillars, KPConv respectively.

Method	car	truck	bicycle	motorcycle	other vehicle	person	bicyclist	motorcyclist	road	sidewalk	parking	other ground	building	vegetation	trunk	terrain	fence	pole	traffic sign	PQ
R.Net [6]+ P.P. [48]	66.9	6.7	3.1	16.2	8.8	14.6	31.8	13.5	90.6	63.2	41.3	6.7	79.2	71.2	34.6	37.4	38.2	32.8	47.4	37.1
KPC [44] + P.P. [48]	72.5	17.2	9.2	30.8	19.6	29.9	59.4	22.8	84.6	60.1	34.1	8.8	80.7	77.6	53.9	42.2	49.0	46.2	46.8	44.5
LPSAD [49]	76.5	7.1	6.1	23.9	14.8	29.4	29.7	17.2	90.4	60.1	34.6	5.8	76.0	69.5	30.3	36.8	37.3	31.3	45.8	38.0
PanopticTrackNet [56]	70.8	14.4	17.8	20.9	27.4	34.2	35.4	7.9	91.2	66.1	50.3	10.5	81.8	75.9	42.0	44.3	42.9	33.4	51.1	43.1
Panoster [50]	84.0	18.5	36.4	44.7	30.1	61.1	69.2	51.1	90.2	62.5	34.5	6.1	82.0	77.7	55.7	41.2	48.0	48.9	59.8	52.7
EfficientLPS [53]	85.7	30.3	37.2	47.7	43.2	70.1	66.0	44.7	91.1	71.1	55.3	16.3	87.9	80.6	52.4	47.1	53.0	48.8	61.6	57.4
MaskRange (ours)	$70.8$	37.8	$15.8$	$38.4$	$37.2$	$42.5$	$59.5$	57.3	$88.7$	71.3	$50.1$	$9.76$	$87.4$	$79.1$	$54.3$	$44.1$	$47.8$	52.3	65.2	$53.1$