Statement of Changes

Dear Editor and Reviewers:

Thanks for the valuable feedback to our previous submission (RA-L 20-2498). In this revised version, we have several major changes, including:

       1. A new outlier removal algorithm. We design a new principal guided parameter-free outlier removal algorithm. The algorithm is able to be used for any combination of LiDAR and camera resolutions. We believe this algorithm has chance to become a general preprocessing step for many LiDAR-camera systems. We also show it can improve the performance of self-supervised learning in the experiment section.
       2. Test model performance on sparser LiDAR data. In the new discussion section, we add evaluation, comparison and visualization of our non-learning method on sparser 32 line and 16 line LiDAR data.
       3. Revise introduction. We substantially revise the introduction, to further clarify the motivation and two major contributions of the paper.
       4. Other writing and formatting changes, including adding more related missing papers, shortening the description of surface normal calculation, etc.
       5. A supplementary file with more supportive materials. In this pdf file, we add three extra pages after replying to each questions. The supplementary file consists of three parts: the statistical information to illustrate intuition of the model, the panoptic segmentation to partially support the sharing surface assumption and several more examples about identified outlier mask.

Thank you all again for the comments, and we hope this revision meets your expectation.

Best,
Yiming Zhao

Reviewer: 1

Comments:

1. The authors’ main argument is that classical image processing can achieve similar performance to self-supervised learning [1] on LiDAR completion - this is partially true (when using a relatively small and simple dataset like KITTI) and has been demonstrated by prior work [2], so it is not exactly a novel contribution.

Response:

Thanks for pointing out this. In the new revised version, we modify the motivation to make it clear. We were actually working on the self-supervised depth completion in the beginning, then Ku’s paper[2] intrigued us to think why such a simple method with several image processing operators can achieve similar performance as self-supervised depth completion. Especially, the first LiDAR depth completion paper[3] investigates some traditional non-learning solutions, and their performances are all limited. After thinking in-depth, we realize there are two key components, outlier and local surface geometry. As a non-learning method, how to accurately remove outliers is challenging. The local surface geometry gives a guide of what kind of information decide the value of empty pixels. The sharing surface assumption clarifies the working condition of the non-learning solution that the point cloud should contain geometry information of objects.

$[1]$ Ma, Fangchang, Guilherme Venturelli Cavalheiro, and Sertac Karaman. ”Self-supervised sparse-to-dense: Self-supervised depth completion from lidar and monocular camera.” 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019.
$[2]$ Ku, Jason, Ali Harakeh, and Steven L. Waslander. ”In defense of classical image processing: Fast depth completion on the cpu.” 2018 15th Conference on Computer and Robot Vision (CRV). IEEE, 2018.
$[3]$ Uhrig, Jonas, et al. ”Sparsity invariant cnns.” 2017 international conference on 3D Vision (3DV). IEEE, 2017.

2. The proposed method is not precisely ”label-free” as advertised - the authors still require the ground truth labels for parameter tuning. Automatic machine learning is replaced with more tedious human learning (i.e., manual parameter tuning).

Response:
We totally agree with this point. To solve it, we design a complete new outlier removal algorithm in the revised version. The new algorithm is based on three observations. Please check the paper for more details (the section III.A with Figure 2 and Figure 3). Though the new method still needs to define a local patch, the patch size is determined by the LiDAR and camera resolution. We think this should be ok as the resolution is easy to get by reading the sensor instruction. For example, we can easily to know if the LiDAR is 32 line or 64 line, or if the image is $255\times 255$ or $512\times 512$ .

3. Manual parameter tuning is non-scalable - the authors only demonstrate quantitative results on only one dataset (specifically, one single sensor placement and density). I don’t believe such approach can generalize to more complex scenes or lidars with lower density - I’m happy to be convinced otherwise, if the authors can replicate the ”accuracy vs. input density” experiment in [7] and show that the proposed method is robust even with much sparser inputs.

Response:
Thanks for this comment. The new outlier removal algorithm will simultaneously adapt to the different resolutions of LiDAR and camera, thus there is no parameter tuning in our current model. As a geometry model, the sharing surface assumption implies that the point cloud should contain geometry information. Intuitively, if the point cloud is too sparse, the assumption will not work. To verify that, we follow your suggestion to conduct the comparison by simulating 32 line and 16 line sparser LiDAR. We put the comparison result with one self-supervised method as a baseline in Figure 6, and also show some examples there as well. Please check section V.A for more details. Our solution has the better performance for both 32 and 64 line. Considering 32 line and 64 line are more popular choices for the autonomous car, such as nuScenes and Waymo, and the trend of LiDAR is even denser 128 line or solid-state LiDAR, we believe our method will be valuable for various cases.

4. Deep-learning-based methods have the ability to utilize the RGB information, scale up with more data, and additionally to extrapolate beyond regions with lidar scans. In comparison, the proposed model-based method is not able to make use of the color information, does not scale, and can only interpolate between neighboring pixels.

Response:
We totally agree with all those advantages of deep learning. This paper is not going to against deep learning, which has already shown its power on many robotic perception tasks. Instead, this paper wants to present a different view of depth completion from the geometry side. In our first version, we actually showed how to extend the geometry model with deep learning. But considering this is a short letter paper, we hope to focus on one point and explain it clearly. So we only keep the geometry non-learning part in the recent version.

5. The assumption that neighboring pixels share the same surface is strong and not always correct. I expect more in-depth discussion regarding its limitations

Response:
We would like to say, the assumption is that most empty pixels share the same surface with their nearest point. This assumption implies the point cloud should contain the geometry information of objects. We can think from the opposite side, if even the closest point is coming from the other surface or object, the point cloud is just too sparse that only can provide depth measurement for some pixels. We put more discussion in the section V.A along with performance on sparser LiDAR.

6. This half-page discussion can be condensed to a few lines.

Response:
Thanks for the suggestion. We remove many irrelevant descriptions of how we calculate surface normal and only keep the final equation. This part takes about two paragraphs in the revised version. Readers can know how we calculate it by only reading our paper, if they want to know more details, they can check the original one.

7. The outlier removal algorithm relies on the assumption that a local patch comes from the same plane. A side effect of such strong assumption is that thin-line structures are doomed to be treated as outliers. I expect more discussion.

Response:
Thanks for the suggestion. The new outlier removal algorithm does not need this assumption. We discuss exceptions of new outlier removal algorithm in Section III.A.

8. Results: despite the fact that the proposed method seems to marginally outperform prior work (e.g., [7][9]) on quantitative metrics, there are severe blurriness, non-continuous object boundaries, as well as floating points in the depth predictions (observed from the video). To a large extent these undesirable visual artifacts exemplify the imposed over-simplifying assumption of local planarity, and are more convincing than the quantitative metrics (especially given the fact that KITTI ground truth isn’t dense and can be error-prone). I strongly suggest evaluating the proposed method (and some of the main comparison methods) on a simulated dataset with perfect and dense ground truth for further verification.

Response:
We can understand your concern, but we cannot provide this in the revised version. We will explain why first, then try to solve your concern by providing other materials.

First, the reason we need a benchmark is that a well-recognized benchmark can provide a fair comparison to help researchers focus more on the method design. The KITTI dataset is collected from the real world, it has many real challenges that do not exist in the synthetic data, such as the occluded outlier, the missing point on transparent/reflective surface, the unevenly distributed points generated by LiDAR, etc. Those observed non-continuous object boundaries and floating points can be further processed by blur operators and noise removal algorithm if only looking for better visualization. You can check the video presented here: ( $https://www.youtube.com/watch?v=t_{C}GGUE2kEM&t=3sab_{c}hannel=WAVELaboratory$ ).

Second, this suggestion is not feasible for us. RA-L is a letter, we only have 30 days to prepare the revision. We’d rather use this period to resolve other more common and important issues, such as outlier removal and performance on sparser LiDAR.

We understand the major concern is the sharing surface assumption. Besides more illustrations in the revised version, we also provide some supplementary materials attached at the end of this file. We hope it can help us to resolve your concern.

9. Minor comments Equation (1) and (2): not that it matters, but for consistency and correctness the left-hand-sides of the equations should have been column vectors, e.g., $[x,y,z]^{T}$

Response:
Changed, thank you.

Reviewer: 2

Comments:

1. It must be mentioned that the submission has a significant number of typos, technical imprecisions and strange choices of word. These issues don’t invalidate the rest of the paper and its messages, but should definitely be addressed. I’ve added a more detailed description of them in a separate file. There are however a few minor points that are lacking in their discussion and at least a major one, in my perception.

Response:
Thank you very much for your valuable time and feedback, especially the extra pdf file which is very helpful to improve the writing quality. Since we make substantial change in this revised version to incorporate all the feedback, some writing issues may not exist. But for other left issues, we change them by following your suggestions.

2. The authors suggest the use of a different method for computing normals based on previous work by Baldino and that choice seems to be uncommon in the literature for the depth completion task. This is an interesting point raised by them and to further solidify it, it would be valuable to expand more on the consequences of this choice. They mention it’s not as intensive computationally compared to neural networks, but is there a price to pay in terms of accuracy? Another points that is somewhat lacking is the discussion of the limitations in the assumptions used to develop the algorithm: for instance, the fact that near edges the assumption of similar depth fails due to discontinuities.

Response:
When the neural network is used to predict the surface normal, the RGB image usually will be used as the input. On the contrary, we calculate the surface normal directly from LiDAR point cloud by using Baldino’s method, which implies that the point cloud should contain geometry information of surroundings. As a non-learning geometry model, the point cloud should contain geometry information is the foundation to make our method work. Therefore, in the revised version, we explicitly mention this in the introduction as this is also why the sharing surface assumption makes sense. Moreover, we add a new discussion section before the conclusion. In there, we discuss the sharing surface assumption and geometry information contained in the point cloud by investigating the model performance under various LiDAR sparsity.

3. In a similar fashion, one would ask if their outlier removal strategy isn’t too simplistic and if a slightly more elaborate, but still simple scheme (e.g. RANSAC) could provide better results. Based on responses to previous reviews, the author suggests they tried different methods, so perhaps expanding on that could be informative.

Response:
Thanks for the suggestion. In the revised version, we develop a complete new outlier removal algorithm. The major improvement of the new outlier removal algorithm is the parameter-free property. The previous removal algorithm still needs to manually decide the patch size and threshold, though the performance is not sensitive to those two values. The new removal algorithm only needs LiDAR and camera resolutions that can be read from the sensor datasheet. So the new method is able to be seamlessly used for any camera-LiDAR combination, such as 32 line or 64 line LiDAR with $256\times 256$ or $512\times 512$ image resolution. We put the intuition, observations, discussion, algorithm and two visualization samples in section III. A.

4. Apply proposed outlier removal algorithm on existing self-supervised learning methods.

Response:
Thanks for this great suggestion, we also think adding this part will further increase the contribution of the paper. We are using the open-sourced code of SparseToDense [1] in Table I. The first row is the performance by using the provided self-supervised pre-trained weights. In the second row, we still use the pre-trained weights, but we simply remove the outliers in the input before feed it into the network. We observe the performance increase for three indicators. The last row is the model performance by training from the scratch with the outlier removal algorithm. We can see all those four indicators get improved. However, the improvement seems not significant. We think the network itself may have the ability to somehow smooth the label or we need to finetune the hyper-parameters. But we believe Table I is able to present the potential chance of our method to further empower other models.

Table 1: Comparison with SparseToDense on KITTI’s validation set.

	RMSE	MAE	iRMSE	iMAE
SparseToDense	1343.33	358.57	4.27	1.64
outlier removal + pre-trained weights	1318.92	360.65	4.14	1.61
outlier removal + train from scratch	1321.42	331.19	4.07	1.60

$[1]$ https://github.com/fangchangma/self-supervised-depth-completion

Reviewer: 3

Comments:

1. I have two concerns. First, the novelty is limited. The proposed methods is a combination of two existing methods, surface normal estimation [11] and distance transform [33]. Secondly, the performance gap is still large with existing supervised depth completion methods ( 500 mm RMSE w/ RGB and 300 mm RMSE w/o RGB compared to [8, 22]), while being 3 times slower (0.06s/frame vs 0.02s/frame).
Response:
We substantially change the introduction section, we hope the revised version can illustrate our novelty and contribution better. There are two major contributions in this paper. The first one is a general parameter-free outlier removal algorithm that removes incorrect points mapped from the occluded region due to the slightly different views of LiDAR and camera. The second contribution is the surface geometry model that explains the relationship between the empty pixel with its nearby points and gives an explicit mathematical equation to calculate the depth value of the empty pixel. Both surface normal estimation and distance transform are just two techniques that we think are most suitable to provide the information needed by the model practically. So they are more like implementation contributions instead of novelty. For the second point, we agree the performance is still not comparable with the state-of-the-art supervised solution. However, in this deep learning epoch, we propose a solution purely from the geometry view with several unique merits. We believe those proposed new ideas in this paper will contribute to related methods and tasks.

2. It would be useful to know the influence on the amount of LiDAR points while having a comparison with learning based methods. I expect the performance to drop drastically of the proposed approach when the amount of input points is reduced, more so than with learning based methods. The approximations made by the authors will break down at some point after all. Most learning based methods additionally use the RGB image to extract object cues. RGB information becomes more and more useful when the sparsity increases. So, is there a way to leverage the RGB image with the proposed method? This would be beneficial for low cost LiDARs (e.g. with only 8 lasers).
Response:
Thanks for your suggestion. To resolve this commonly concerned question on sparser input LiDAR, we add a new discussion section (Section V) before the conclusion. By following the previous settings in the other paper, we simulate sparser 32 line and 16 line LiDAR, then compare our model performance with one self-supervised method there. We also visualize some result samples with various sparse input.

Generally speaking, as a geometry model, our method implies a working condition that the point cloud should contain the geometry information of surroundings. If the point cloud is too sparse that only provides depth measurements for several pixels, the geometry model with LiDAR only will not give reasonable performance. This explains the phenomenon that our method outperforms the baseline method for 32 line and 64 line LiDAR, but the self-supervised learning method starts to take the lead for 16 line LiDAR. One even can consider an extreme case that there is no LiDAR measurement. The self-supervised depth completion will degenerate to self-supervised depth prediction, but the geometry method will completely fail.

We do not present the model performance of 8 line LiDAR for two reasons. First, the trend has appeared clearly in Figure 6 of the paper. Second, the 8 line LiDAR is relatively less popular than the other resolutions. Most autonomous driving systems choose 32 line (nuScence), 64 line (Waymo), or even denser solid-state LiDAR, which is the primary goal of the paper. Though some indoor robots choose 16 line LiDAR, we still hope to concentrate on more popular LiDAR settings for autonomous car.

How to integrate RGB image into this non-learning pipeline is an interesting point. We agree the RGB information will be valuable especially for the sparser point cloud. We tried some ideas like SLIC superpixel method. Though we get slightly better performance, we still think adding that part is trivial that may distract readers in a short letter. So we prefer to set this direction as one of our future work.

3. Minor: I think it would be better if the authors add a caption to each sub-image (e.g. predicted depth, error, surface normals, etc.) in the provided video.
Response:
Sure, we will do it. We are working on preparing more demo videos, including driving at night on other datasets. We will replace the demo video if this paper can get accepted.

We have supportive materials in next three pages.

Supplementary materials

Part 1. Statistical Information

Here we investigate the statistical information on KITTI validation set in Fig. 1. For each pixel on sparse input image plane, we calculate the $l_{1}$ distance to their nearest point, if the pixel itself has value, then the $l_{1}$ distance is 0. The y-axis is the percentage of the pixels that have the certain $l_{1}$ distance. We can see most pixels have a small distance to the nearest neighbor. Moreover, we explore the case that using the nearest value as the initial guess of empty pixels shown in blue bar. To further explore the effect of outliers, we replace the value of sparse input with the corresponding ground truth, then calculate the nearest initial again. We show this as the orange bar.

We can see several points from the statistical information:
a. Most empty pixels have a close nearest point.
b. Using the value of close nearest point as the initial guess has a small error.
c. Removing outliers will significantly reduce the nearest initial error.

All those points support our sharing surface assumption, local surface geometry model and the importance of outlier removal algorithm.

Refer to caption — Figure 1: Statistical analysis on KITTI validation set. The green line is the percentage of the pixel that has a certain $l_{1}$ distance to the nearest point. The blue bar is the error between the nearest initial with ground truth. For the orange bar, we replace the input value with the ground truth, then calculate the nearest initial error again.

Part 2. Sharing Surface Assumption and Panoptic Segmentation

Due to the lack of normal ground truth, it is hard to directly verify the sharing surface assumption. Therefore, we use the recent developed panoptic segmentation ground truth to verify if the nearest point is coming from the same object. We believe this evidence can partially support our sharing surface assumption. In Fig. 2, all samples are picked from KITTI depth completion validation set. We implement the state-of-the-art panoptic-deeplab as the panoptic segmentation model, and train the model on Cityscape, then fine-tune it on recent released KITTI360. All the black pixels on the depth image are empty pixels that have a nearest point coming from other objects. We can see those pixels are almost all on the edge. This evidence at least shows that most empty pixels share the same object with the nearest point. So we think it partially support our sharing surface assumption.

Part 3. More Outlier Mask Visualization

Here we visualize more samples to show the outlier mask generated by our outlier removal algorithm in Fig .3. We can see most outlier points are near the boundary between foreground objects and the background.