\newfloatcommand

capbtabboxtable[][\FBwidth]

Semantic Dense Reconstruction with Consistent Scene Segments

Yingcai Wan^1,4∗, Yanyan Li^2,4∗, Yingxuan You³, Cheng Guo^1,4, Lijin Fang¹ and Federico Tombari^2,5 ¹Dept. Faculty of Robot Science and engineering, Northeastern University, Shenyan, China. (wan.yc.meta,guochengrobot)@gmail.com, [email protected]²Dept. Computer Science, Technical University of Munich, Munich, Germany. (yanyan.li,federico.tombari)@tum.de³Key Laboratory of Machine Perception, Peking university, Beijing, China. [email protected]⁴ MetaSpatie Tech. ⁵ Google Inc. *authors are with equal contributions.

Abstract

In this paper, a method for dense semantic 3D scene reconstruction from an RGB-D sequence is proposed to solve high-level scene understanding tasks. First, each RGB-D pair is consistently segmented into 2D semantic maps based on a camera tracking backbone that propagates objects’ labels with high probabilities from full scans to corresponding ones of partial views. Then a dense 3D mesh model of an unknown environment is incrementally generated from the input RGB-D sequence. Benefiting from 2D consistent semantic segments and the 3D model, a novel semantic projection block (SP-Block) is proposed to extract deep feature volumes from 2D segments of different views. Moreover, the semantic volumes are fused into deep volumes from a point cloud encoder to make the final semantic segmentation. Extensive experimental evaluations on public datasets show that our system achieves accurate 3D dense reconstruction and state-of-the-art semantic prediction performances simultaneously.

I Introduction

Scene understanding systems support robots and smart devices to interact intelligently with unknown environments, which aim to provide spatial and semantic information around 3D instances of scenes. Commonly, the majority of 3D semantic segmentation methods [1, 2, 3] rely on the assumption of having available as input a complete and accurate 3D model, which is hard to obtain under realistic settings. Therefore, how to provide a complete system for this 3D scene understanding task from RGB-D images is still open in the community of computer vision.

Given a single RGB image, 2D semantic segmentation algorithms [4, 5, 6, 7] can obtain impressive performance. These methods, however, are sensitive to viewpoint changes and also suffer from inconsistent predictions across different views. To solve these issues, deep neural networks have been applied to 3D semantic segmentation tasks based on point clouds, such as PointNet++[1], MCCNN [8] and MinkowskiNet [9, 10]. The primary focus of these works lies on the analysis of point clouds from complete dense reconstruction results, while the geometric and semantic details captured in the process of reconstruction are ignored. Recently, several 2D-3D joint end-to-end methods [11, 12] combining 3D knowledge (geometry and shape) with 2D detailed information (texture and color) were proposed to improve the accuracy of semantic segmentation tasks. Following the 2D-3D fusion strategy, BPNet [3] proposes a bidirectional projection module to improve both 2D and 3D semantic segmentation performance from 2D RGB images and point clouds. However, these attempts are performed without fully leveraging the complementay 2D information. Furthermore, they assume that the input 3D scenes are already completely reconstructed at high quality, rather than building dense models from real RGB-D images.

To build a complete scene semantic estimation pipeline from images, the pioneering benchmark, ScanNet [13] and Matterport3D [14] achieve dense reconstruction and semantic segmentation at the same time. Different from them, our tracking and dense mapping part are implemented on a CPU. SceneGraphFusion [15] also explores to start from images and obtains 3D semantic models. Nevertheless, the system makes use of ground truth camera poses and build an instance point cloud by using a geometric instance segmentation method [16], which limits the prediction performance when those instances are not fused accurately.

Compared to the existing 3D segmentation networks [9, 3] and scene graph generation approach [15], our dense semantic reconstruction method has a more complete function that adopts continuous RGB-D frames and outputs dense semantic 3D models. Based on 2D segmentation algorithms [6, 7] and our camera trackers [17, 18], we maintain a semantic sparse map that saves the probability of each object. Since semantic predictions from partial views are not as reliable as full ones’, the semantic sparse map is used to correct wrong 2D segments existed in bad cases. After obtaining camera poses, consistent 2D semantic masks and a dense 3D reconstruction model, the proposed SP-Block is responsible to extracts the multi-scale features from 2D consistent object segments and project channels of deep features into volumes that are fused with those extracted from the encoder of MinkowskiNet [9] after the domain transformation (DoT) operation as shown in Figure 5. Compared with MinkowskiNet [9] and BPNet [3], our 2D semantic masks of different views provide accurate 2D objects’ regions that make traditional 3D semantic predictions more accurate.

The contributions of this paper are summarized as follows,

•

The 2D consistent object prediction strategy based on 2D segments and a sparse semantic map is proposed to achieve accurate and consistent 2D semantic masks.
•

SP-Block is built to extract and project multi-scale deep features from 2D semantic map, which are transformed to the feature domain from point clouds by the DoT operation.
•

The proposed architecture containing tracking, meshing, and segmentation modules extends traditional SLAM methods to a multi-task scene understanding system.

II Related work

II-A Object instance segmentation

Early 2D semantic segmentation methods, including Faster R-CNN [5], Mask R-CNN [4], make instance mask predictions before object semantic recognition. More recent networks [19, 20, 21] are anchor-based approaches that predict boxes’ offsets relative to a collection of fixed boxes. Although semantic instance segmentation has achieved reliable results, more and more segmentation tasks have put forward requirements for efficiency. YOLACT [6] is the real-time (more than $30$ fps) instance segmentation algorithm that is updated as YOLACT++ [7] by incorporating deformable convolutions into the backbone network. However, those approaches focus on the single image processing topics, leading to inconsistent scene interpretation issues due to illumination changes, occlusions and other variations over time. To solve the problem, video-based instance segmentation (VIS) [22] tracks object instances interested in a video sequence, but these methods require the target information determined in the first frame. Different from them, each RGB-D pair is segmented in geometric and semantic manners to obtain correct boundaries in this paper. Moreover, we build a global sparse semantic map in real-time to maintain 2D consistent semantic segments.

II-B 2D-3D segmentation

3D scenes are usually represented by point clouds since this unstructured type of data is efficient and contains rich geometric information compared with 2D images. 3D ShapeNet [23] is one of the first works in this area. Via training a 3D convolutional deep belief network from a ground truth shape database. Inspired by ShapeNet, PointNet [24] and PointNet++ [1] exploit a more efficient representation of 3D surfaces. To solve issues in spatio-temporal perception tasks, MinkowskiNet [9] making use of 4D dimensional convolutional neural layers outperforms efficient and accurate performance on semantic segmentation. In addition to extracting deep features from point clouds, the researchers also proved that the joint 2D and 3D features complement each other in local areas and obtain better performance. 3DMV [11] is presented as a joint 3D-multi-view approach built on the core idea of combining spatial and RGB features in an end-to-end architecture. Based on 3DMV, 3D-SIS [12] fuses 2D color images with 3D geometry features by projecting deep features from 2D RGB view into the voxel grid for instance segmentation. BPNet [3] enables the bidirectional feature interacting between 2D and 3D CNNs in multiple pyramid levels via the proposed bidirectional projection module. In this paper, after obtaining 2D consistent semantic maps from the system, the SP-Block is proposed to capture features from those 2D semantic segments and transform generated feature volumes to another feature domain encoded from MinkowshiNet [9].

II-C Scene understanding from RGB-D sequences

Based on RGB-D images, tracking and mapping methods [25] are used to build global 3D maps that are bridges for complete scene understanding systems. Multi-feature-based trackers [18, 17] achieve robust estimations in indoor scenes, but those mapping parts aim to maintain sparse features for removing camera pose drift rather than reconstructing dense maps. Different from those methods, KinectFision [26] and BundleFusion [27] focus their mind on dense reconstruction by using GPUs to obtain on-the-fly 3D scene reconstruction. In our system, both tracking and dense mapping are important to be implemented via multi-threads on a CPU, which does not have high requirements for hardware.

Texturing 2D semantic maps to dense maps is a traditional way to build online semantic reconstruction systems. SemanticFusion [28] is a pioneer 3D semantic segmentation system, which fuses semantic surfels labeled by convolutional neural networks incrementally. PanopticFusion [29] extends the 2D-to-3D mapping framework to TSDF voxels where semantic labels come from pixel-wise panoptic prediction networks. Those 2D-to-3D architectures are direct to follow, but they rely on the performances of 2D semantic segmentation methods, which limits further improvement. Instead of using 2D semantic results, SceneGraphFusion [15] estimates a semantic scene graph based on a 3D instance map while the performance also relies on the instance fusing method. Instead of sticking those 2D labels in the dense model or capturing deep features from point clouds only, we try to make the 2D objects’ labels consistent and let our 3D segmentation network segment the mesh model in one shoot, which tends to be more accurate since global structures and local details are considered together.

III System Overview

In this section, we introduce the main modules of the pipeline , as shown in Figure 2, the system is divided into three parts, 1) Camera tracker and dense mapping; 2) Semantic label propagation; 3) Two-branch semantic segmentation.

III-A Tracking and mapping

To achieve robust pose estimation performances in the general indoor environments, the camera tracking module used in this paper is our Manhattan-SLAM [17] that exploits points, lines, plane features, and spatial constraints between them, including parallel and perpendicular planes. Note that this tracker can be replaced by other tracking methods.

For the mapping part, we have two main changes proposed. Compared with the surfel map generated by Manhattan-SLAM, 3D objects’ semantic labels are obtained from geometric and semantic segments of every keyframe, which are fused into a global sparse semantic map, as shown in Figure 4(e), to support consistent 2D semantic mask generation. Furthermore, a dense mapping module is created to propose a smooth dense mesh map. Since sparse maps cannot provide enough information for robots, our system generates a dense mesh map incrementally based on CPU. When a new keyframe is generated from the tracking thread, we make use of the estimated camera pose and the RGB-D pair to build a dense TSDF [30, 31] model. After that, the marching cubes method [32] is exploited to extract the smooth surface from voxels.

III-B 2D Consistent semantic segmentation

The consistent semantic segmentation strategy in this architecture is also an important module that is responsible to provides stable 2D semantic instance predictions of different views. In this module, two segmentation branches, learned [6] (Figure 4(b)) and geometric [16] (Figure 4(c&d)) methods, are used to deal with RGB and depth maps, respectively. The geometric part segments objects’ areas based on the 3D shape by computing normals from depth maps, while the learned one predicts the semantic instance masks directly after training convolutional neural networks on large datasets. To remove the boundary noise of semantic labels, only union regions of those two maps are accepted by the system.

As we all know, images that only capture part information of objects are useful for 3D semantic segmentation since more details and boundaries information can be obtained from there. Those partial scans, however, bring huge challenges to 2D semantic segmentation networks. To keep the consistency of segments, we take advantage of camera poses and the global semantic sparse map to correct those ill-posed results.

III-C 2D-3D semantic segmentation network

In this module, an encoder-decoder network is implemented for the final 3D dense semantic segmentation task. In the encoder module, the proposed SP-Block is connected with the original encoder of MinkowskiNet to build deep embedding features that are decoded in semantic predictions.

Benefiting from the SP-Block, deep features from 2D and 3D domains can be fused to predict 3D semantic segments from the 3D dense reconstruction.

IV Odometry based consistent 2D semantic segmentation

Inconsistent semantic segmentation prediction between different RGB images of the same scene is a common issue in semantic segmentation methods [6, 7]. To solve this issue, an incremental joint 2D segmentation strategy is proposed to achieve sharp and consistent segments from each keyframe.

IV-A Segments from a single RGB-D pair

In this paper, each RGB image is fed to YOLACT [6] to segment instances and predict objects’ labels, there are two types of outputs, label $R_{rgb}$ and probability maps $R_{p}$ , from the network, where the first one codes each index of detected objects as shown in Figure 4(a) while the three channels of the second map (see Figure 4(b)) is used to save the corresponding probabilities.

Since boundaries of semantic masks generated from an RGB image are commonly noisy, we extract areas with discontinuous depth information from the corresponding depth map. Given depth maps, geometric-based shape segmentation methods [2, 16] are used to segment the scene into different instances according to the normal edge analysis. As shown in Figure 4, the TV is segmented from a wall since the normal map detects disconnection regions between them. Therefore, a filtered segmentation map $R^{*}$ is obtained by

R^{*}=R_{rgb}\cdot R_{d}

(1)

here $R_{d}$ is a binary map where instance-covered pixels are denoted as 1. We have to note that those segments extracted from the RGB image will be removed if they do not exist in the geometric map, which will be completed when the information appears in both semantic and geometric images.

IV-B Semantic propagation

To keep the consistency of objects captured in different views, we maintain a semantic sparse point clouds map containing different geometric landmarks [17, 18] and semantic objects.

Each 3D object of the map needs to be initialized and updated in the whole tracking process. First, 3D objects $O_{i},i\in(1,n)$ where $n$ is the number of 3D objects saved in the map are re-projected to a new keyframe to obtain 2D re-projections regions $o_{i}^{rp}$ . We compute the IoU (Intersection over Union) between $o_{i}^{rp}$ and semantic labels $o_{j},j\in(1,m)$ , here $m$ is the number of detected objects of $R^{*}$ . If the IoU is more than a threshold $t_{iou}=0.4$ , we continue to check the probability of $o_{j}$ . $o_{i}^{rp}$ and $o_{j}$ are matched when index and probability of $o_{j}$ are matched with $o_{i}^{rp}$ at the same time. Otherwise, we continually check the probabilities of semantic labels in $R^{*}$ , if they are more than the threshold $t_{p1}=0.9$ , those objects will be noted as new comes and be fused into the map.

In the objects’ fusion process, when the re-projected information is matched with the current keyframe’s semantic labels, we will also update the probability of $O_{i}$ . If the probabilities of those 2D semantic segments are more than $t_{p1}$ , the probability of $O_{i}$ will be increased since the object is reconfirmed in different views. If those new segments that satisfy the IoU requirement are in high probabilities but in different semantic labels, the weights of related 3D objects $O_{i}$ will be decreased, which will be removed from the map when the weights are less than $t_{p2}=0.7$ .

V 2D-3D joint semantic segmentation

Given consistent 2D semantic images, camera poses, and a dense 3D model within the same coordinate, an encoder-decoder architecture as shown in Figure 5 is introduced in this section to predict semantic segmentation.

V-A Interested regions selection

3DMV [11] and 3D-SIS [12] propose differentiable projection layers mapping 2D features to corresponding 3D voxels. However, this operation brings noisy to the 3D branch if features from 2D origins have no matches in 3D models. Therefore, we re-project [3] each 3D point $P=[x,y,z]$ of the model to 2D images to detect intersections that are represented into binary masks $B_{k}$ by using,

B_{k}=KT_{kw}P\ocircle R_{k}^{*}

(2)

where $\ocircle$ is the $and$ operation between the $k^{th}$ semantic input image $R_{k}^{*}$ and the re-projected image from the 3D model. $T_{kw}$ is the 6 DoFs pose matrix from the world to the $k^{th}$ camera coordinate and $K$ is the intrinsic matrix.

V-B Semantic projection block

As shown in Fgiure 5, the embedding $\widetilde{F}_{3D}$ is constructed by two branches, $F_{3D}$ from the encoder of MinkowskiNet [9] and $\hat{F}_{3D}^{V}$ from our semantic projection block introduced in this section.

First, we extract deep feature pyramids by using ResNet-18 [33] from multi-view semantic images. There are four levels of feature maps $F_{2D}^{Li},i\in[1\dots 4]$ extracted from each image. To maintain compatibility with deep features from the encoder of 3D-Net [9], we project each feature channel of $F_{2D}^{Li}$ to the shape of $N\times 1$ based on the camera pose and intrinsic matrix, where $N$ is the number of voxels in the 3D model. Therefore, each level’s feature maps (with $C$ channels) are transferred to a shape as $N\times C$ . And then at the same level of $V$ views, following [3] we concatenate those transferred shapes along the channel direction to obtain $\hat{F}_{3D}^{v,l}$ with the size of $N\times C\times V$ .

Then the DoT operation that is constructed by four 3D sparse convolutional layers and a sparse max-pooling layer is proposed to aggregate feature volumes $\hat{F}_{3D}^{v,l}$ from different views.

\hat{F}_{3D}^{V}=\sum_{v=1}^{V}\sum_{l=1}^{L}DoT(\hat{F}_{3D}^{v,l})

(3)

where $L$ is the size of feature levels while $V$ is the number of views. Via the DoT operation, the shape of $\hat{F}_{3D}^{v,l}$ is transferred with the same size of $F_{3D}$ . Moreover, the spatial and semantic information from different views is fused.

Finally, corresponding levels of encodes’ features $F_{3D}$ based on MinkowskiNet and $\hat{F}_{3D}^{V}$ based on our SP-Block are fused by a concatenation operation as

\widetilde{F}_{3D}=F_{3D}\oplus\hat{F}_{3D}^{V}

(4)

where $\widetilde{F}_{3D}$ is the fusion embedding that is fed to the decoder to obtain semantic prediction results.

V-C Decoder and training

The decoder used in this paper comes from MinkowskiNet [9], which generates a semantic label for every 3D point. Readers can refer to MinkowskiNet [9] for details. Furthermore, we make use of the conventional cross entropy [34] to supervise the whole encoder-decoder pipeline. During the raining process, we exploit the ground truth 2D semantic labels and 3D reconstruction models of each scenario as the input of our network.

V-D Implementation details

The network is implemented on the PyTorch platform, which exploits the SGD optimizer with a learning rate of $0.01$ and a momentum of $0.9$ . Furthermore, the network is trained on the machine with 4 NVIDIA GeForce 2080TI GPUs and 64GB RAM, where the batch size is set to 12 in 100 training epochs.

The size of RGB-D images fed to our tracking system is $480\times 640$ , while the size of semantic images for SP-Block is downsampled to $240\times 320$ . The channels’ size of four layers in the feature pyramid are $512$ , $256$ , $128$ and $96$ , respectively. After the DoT operation, the channels’ size of four layers in $\hat{F}_{3D}^{V}$ are $256$ , $128$ , $128$ and $96$ , respectively.

VI Experiments

In this section, the performances, including dense mapping and 3D segmentation, of the system are evaluated on public datasets and compared with state-of-the-art methods.

VI-A Dataset

VI-A1 ScanNetV2

The ScanNetV2 [35] dataset includes 1513 sequences (around 2.5 million RGB-D frames) from 70 unique indoor scenes, which provides ground truth annotations for training, validation, and testing directly on 3D reconstructions. Those sequences are split into training, validation, and testing datasets where the semantic labels are defined according to the rule of NYU40 [36].

VI-A2 3RScan

The 3RScan dataset [37] is a large indoor RGB-D dataset scanned multiple times of changing environments. It contains 1482 RGB-D scans of 478 environments and ground truth annotations of instance-level semantic segmentation, dense mapping, scene semantic segment.

VI-A3 ICL-NUIM

The ICL-NUIM synthetic dataset aims at benchmarking tracking and mapping methods, which provides the living room and the office room scene with ground-truth poses. And the ground-truth maps are only provided for 4 sequences of the living room.

	ElaticFu[38]	BundleFu[27]	InfiniTAM[39]	Ours
lr2	0.8	0.7	0.1	0.7
lr3	2.8	0.8	2.8	0.7

TABLE I: Reconstruction error (cm) for the Living Room sequences.

VI-B Accuracy of dense reconstruction

First, the reconstruction accuracy is evaluated between different approaches. ElasticFusion [38], BundleFusion [27] and InfiniTAM [39] are dense reconstruction methods, where the first one reconstructs surfel-based models and the others obtain dense mesh model as well as ours. In the lr2 and lr3 sequences, a complete room can be found. Therefore, we compare the reconstruction errors among those methods. Compared with those methods, our method is more robust in different sequences and we only need a CPU for the on-the-fly dense reconstruction.

Moreover, we compare the qualitative reconstruction results between ours and the ground truth of ScanNetv2 that is built from BundleFusion. As shown in Figure 7(a), our method can reconstruct the chair completely. Benefiting from our accurate pose estimation module and the smooth dense reconstruction strategy, the refrigerator is reconstructed more accurately than the ground truth. Related semantic segmentation results are shown in Figure 7(b).

Method	bath	bed	bkshf	cab	chair	cntr	curt	desk	door	floor	other	pic	fridge	shower	sink	sofa	table	toilet	wall	window
SF [28]	59.8	46.8	32.0	35.7	46.9	33.2	46.9	34.7	35.7	72.2	34.7	21.7	34.3	28.8	47.2	43.7	37.8	65.5	58.3	29.5
SR [40]	69.7	52.6	31.2	31.7	64.0	24.0	30.3	26.1	30.9	80.6	33.3	7.3	56.3	23.6	46.2	58.3	51.6	73.3	66.9	21.1
PF [29]	57.5	67.0	48.4	44.8	66.7	35.8	53.3	42.0	35.6	81.0	40.3	30.2	47.0	50.8	52.6	61.3	54.8	82.1	65.7	45.9
PsF [41]	65.6	61.2	65.7	48.6	68.4	41.7	54.9	48.9	47.5	87.1	43.7	25.7	41.8	34.5	53.4	59.8	54.0	78.9	70.6	47.0
FPC [42]	85.4	82.3	6.4	60.9	75.1	56.0	64.8	58.2	64.8	91.9	46.4	40.6	64.2	51.7	63.5	77.9	68.9	87.0	83.8	56.3
SPV [43]	73.4	78.5	79.1	60.5	80.6	59.3	70.4	59.9	60.5	91.1	57.8	35.0	57.5	75.2	61.3	72.6	64.4	86.4	80.5	61.7
BPNet [3]	83.0	80.1	78.2	61.8	89.0	61.9	58.5	65.7	57.1	93.8	53.3	23.7	44.2	61.7	65.2	79.4	72.7	89.5	81.7	59.3
Ours	83.4	79.2	76.9	61.0	88.9	57.9	58.3	64.2	58.3	93.9	53.2	24.3	44.4	65.1	63.4	80.2	71.4	89.0	81.4	58.5
Ours+	84.9	81.2	80.2	65.5	89.9	61.4	59.1	70.5	58.7	93.9	58.0	25.6	48.3	62.3	65.9	82.8	75.5	88.5	82.6	61.0

TABLE II: The Quantitative Accuracy Comparison of the Final Semantic Segmentation Results on the ScanNetV2 Validation dataset. We use bold and blue numbers to mark the best and second results per instance, respectively.

Method	recon.	voxel	mIoU	mAcc
SF [28]^∗	Y	-	42.2	47.4
SR [40]^∗	Y	-	44.0	65.6
PF [29]^∗	Y	-	53.1	68.7
PsF [41]^∗	Y	-	55.0	70.3
FPC [42]^∗	Y	-	67.2	77.0
SPV [43]^∗	Y	1cm	68.3	79.6
BPNet [3]	N	5cm	67.1	88.3
Ours	Y	5cm	67.4	84.0
Ours+	Y	5cm	69.8	88.8

TABLE III: 3D semantic segmentation mIoU and mAccu results on the ScanNetV2 Validation Set. ’Y’ meams that a method has a dense reconstruction function. We use ’-’ to mark unsure situations and ’*’ means the result coming from [43].

VI-C 3D semantic segmentation results

Following the common evaluation metrics in previous works, the standard mean Intersection over Union (mIoU) and mean Accuracy (mAcc) as used to evaluate the performance of our network. Our 3D semantic segmentation results are shown in Table II,III, where we compare our network with state-of-the-art pipelines. Similar to our methods, SF [28], SR [40], SPV [43] and PF [29] are semantic reconstruction systems based RGB-D images, while BPNet [3] deals with point clouds.

Semantic segmentation results of 20-class objects/scenarios from different approaches are listed in Table II, FPC [42] achieves good predictions in $5$ classes, especially in bath,bed and wall instances. However, it does not understand bookshelves existed in scenes. Compared with those methods, our network achieves more robust performances. For most object classes, our method obtains better results. To be specific, we obtain the best results at chair, desk, sofa and toilet instances since our 2D semantic masks based on Yolact [6] can detect those objects from RGB images, which proves that the proposed SP-Block improves the segmentation results. Similar to our method, BPNet is a 2D-3D joint method. But, it feeds the network with RGB images. According to the results of Ours $+$ , which makes use of the 2D semantic masks provided by ScanNetV2 to take place of Yolact’s results, the performances of our network are improved shapely and perform better over BP-Net in the majority of categories listed in Table II.

Furthermore, we compare the mIoU and mAcc between different methods as shown in Tabel III, where BPNet and ours results are based on the voxel size of $5$ cm while SPV makes use of $1$ cm voxels. When the voxel size becomes smaller, more detailed information is saved in models, which tends to produce accurate predictions. However, it brings more intensive computation and higher requirements for hardware.

To verify the generality of our algorithm, the network is tested on the 3RScan dataset. As shown in Figure 8(b), our approach accurately predicts the table and chair of the scene, but MinkowskiNet fails to detect them.

VII Conclusion

In this work, we present a complete scene understanding system starting from RGB-D sequences, which builds a dense mesh map incrementally and segments the map semantically. First, we propose a semantic instance sparse map to support a 2D consistent semantic generation task. Moreover, the proposed SP-Block is used to extract deep features from those 2D semantic views and project those features to the domain from point clouds. Extensive qualitative and quantitative results show that the proposed method achieves complete and state-of-the-art performances in this area.

Since the consistent 2D semantic masks can improve the 3D segmentation results, our future research plan is to build a semantic deep visual odometry in an end-to-end architecture, especially focus on a video-based consistent 2D segmentation.

References

[1] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” arXiv preprint arXiv:1706.02413, 2017.
[2] F. Furrer, T. Novkovic, M. Fehr, A. Gawel, M. Grinvald, T. Sattler, R. Siegwart, and J. Nieto, “Incremental object database: Building 3d models from multiple partial observations,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018.
[3] W. Hu, H. Zhao, L. Jiang, J. Jia, and T.-T. Wong, “Bidirectional projection network for cross dimension scene understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
[4] A. O. Vuola, S. U. Akram, and J. Kannala, “Mask-rcnn and u-net ensembled for nuclei segmentation,” in 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019).
[5] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, 2017.
[6] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “Yolact: Real-time instance segmentation,” in ICCV, 2019.
[7] ——, “Yolact++: Better real-time instance segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
[8] P. Hermosilla, T. Ritschel, P.-P. Vázquez, À. Vinacua, and T. Ropinski, “Monte carlo convolution for learning on non-uniformly sampled point clouds,” vol. 37, no. 6, 2018, pp. 1–12.
[9] C. Choy, J. Gwak, and S. Savarese, “4d spatio-temporal convnets: Minkowski convolutional neural networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
[10] H. Tang, Z. Liu, S. Zhao, Y. Lin, J. Lin, H. Wang, and S. Han, “Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution,” in European Conference on Computer Vision (ECCV), 2020.
[11] A. Dai and M. Nießner, “3dmv: Joint 3d-multi-view prediction for 3d semantic scene segmentation,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018.
[12] J. Hou, A. Dai, and M. Nießner, “3d- 3d semantic instance segmentation of rgb-d scans,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
[13] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” in Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017.
[14] A. Chang, A. Dai, T. Funkhouser, M. Halber, and Y. Zhang, “Matterport3d: Learning from rgb-d data in indoor environments,” in 2017 International Conference on 3D Vision (3DV), 2017.
[15] S. C. Wu, J. Wald, K. Tateno, N. Navab, and F. Tombari, “Scenegraphfusion: Incremental 3d scene graph prediction from rgb-d sequences,” 2021.
[16] K. Tateno, F. Tombari, and N. Navab, “Real-time and scalable incremental segmentation on dense slam,” in 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
[17] R. Yunus, Y. Li, and F. Tombari, “Manhattanslam: Robust planar tracking and mapping leveraging mixture of manhattan frames,” arXiv preprint arXiv:2103.15068, 2021.
[18] Y. Li, R. Yunus, N. Brasch, N. Navab, and F. Tombari, “Rgb-d slam with structural regularities,” in 2021 IEEE international conference on Robotics and automation (ICRA).
[19] Z. Tian, C. Shen, X. Wang, and H. Chen, “Boxinst: High-performance instance segmentation with box annotations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
[20] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European Conference on Computer Vision, 2020.
[21] C. Zhu, Y. He, and M. Savvides, “Feature selective anchor-free module for single-shot object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
[22] Y. Wang, Z. Xu, X. Wang, C. Shen, B. Cheng, H. Shen, and H. Xia, “End-to-end video instance segmentation with transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
[23] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3d shapenets: A deep representation for volumetric shapes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
[24] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017.
[25] R. Mur-Artal and J. D. Tardós, “Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras,” IEEE transactions on robotics, vol. 33, no. 5, 2017.
[26] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, and A. W. Fitzgibbon, “Kinectfusion: Real-time dense surface mapping and tracking,” in IEEE International Symposium on Mixed and Augmented Reality, 2012.
[27] A. Dai, M. Nießner, M. Zollhöfer, S. Izadi, and C. Theobalt, “Bundlefusion: Real-time globally consistent 3d reconstruction using on-the-fly surface reintegration,” ACM Transactions on Graphics (ToG), 2017.
[28] J. Mccormac, A. Handa, A. Davison, and S. Leutenegger, “Semanticfusion: Dense 3d semantic mapping with convolutional neural networks,” IEEE, 2016.
[29] G. Narita, T. Seno, T. Ishikawa, and Y. Kaji, “Panopticfusion: Online volumetric semantic mapping at the level of stuff and things,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020.
[30] Q.-Y. Zhou and V. Koltun, “Dense scene reconstruction with points of interest,” ACM Transactions on Graphics (ToG), 2013.
[31] M. Nießner, M. Zollhöfer, S. Izadi, and M. Stamminger, “Real-time 3d reconstruction at scale using voxel hashing,” ACM Transactions on Graphics (ToG), 2013.
[32] W. E. Lorensen and H. E. Cline, “Marching cubes: A high resolution 3d surface construction algorithm,” ACM siggraph computer graphics, 1987.
[33] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.
[34] Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, 2015.
[35] A. Dai, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[36] P. K. Nathan Silberman, Derek Hoiem and R. Fergus, “Indoor segmentation and support inference from rgbd images,” in ECCV, 2012.
[37] J. Wald, A. Avetisyan, N. Navab, F. Tombari, and M. Nießner, “Rio: 3d object instance re-localization in changing indoor environments,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019.
[38] T. Whelan, R. F. Salas-Moreno, B. Glocker, A. J. Davison, and S. Leutenegger, “Elasticfusion: Real-time dense slam and light source estimation,” The International Journal of Robotics Research, vol. 35, no. 14, 2016.
[39] V. A. Prisacariu, O. Kähler, S. Golodetz, M. Sapienza, T. Cavallari, P. H. Torr, and D. W. Murray, “Infinitam v3: A framework for large-scale 3d reconstruction with loop closure,” arXiv preprint arXiv:1708.00783, 2017.
[40] J. Jeon, J. Jung, J. Kim, and S. Lee, “Semantic reconstruction: Reconstruction of semantically segmented 3d meshes via volumetric semantic fusion,” Computer Graphics Forum, 2018.
[41] Quang-Hieu, Pham, Binh-Son, Hua, Thanh, Nguyen, Sai-Kit, and Yeung, “Real-time progressive 3d semantic segmentation for indoor scenes,” in 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).
[42] J. Zhang, C. Zhu, L. Zheng, and K. Xu, “Fusion-aware point convolution for online semantic 3d scene segmentation,” IEEE, 2020.
[43] S.-S. Huang, Z.-Y. Ma, T.-J. Mu, H. Fu, and S.-M. Hu, “Supervoxel convolution for online 3d semantic segmentation,” ACM Transactions on Graphics (TOG), 2021.