One at a Time: Progressive Multi-step Volumetric Probability Learning for
Reliable 3D Scene Perception
Abstract
Numerous studies have investigated the pivotal role of reliable 3D volume representation in scene perception tasks, such as multi-view stereo (MVS) and semantic scene completion (SSC). They typically construct 3D probability volumes directly with geometric correspondence, attempting to fully address the scene perception tasks in a single forward pass. However, such a single-step solution makes it hard to learn accurate and convincing volumetric probability, especially in challenging regions like unexpected occlusions and complicated light reflections. Therefore, this paper proposes to decompose the complicated 3D volume representation learning into a sequence of generative steps to facilitate fine and reliable scene perception. Considering the recent advances achieved by strong generative diffusion models, we introduce a multi-step learning framework, dubbed as VPD, dedicated to progressively refining the Volumetric Probability in a Diffusion process. Specifically, we first build a coarse probability volume from input images with the off-the-shelf scene perception baselines, which is then conditioned as the basic geometry prior before being fed into a 3D diffusion UNet, to progressively achieve accurate
probability distribution modeling. To handle the corner cases in challenging areas, a Confidence-Aware Contextual Collaboration (CACC) module is developed to correct the uncertain regions for reliable volumetric learning based on multi-scale contextual contents. Moreover, an Online Filtering (OF) strategy is designed to maintain representation consistency for stable diffusion sampling. Extensive experiments are conducted on scene perception tasks including multi-view stereo (MVS) and semantic scene completion (SSC), to validate the efficacy of our method in learning reliable volumetric representations. Notably, for the SSC task, our work stands out as the first to surpass LiDAR-based methods on the SemanticKITTI dataset.

1 Introduction
Obtaining a dependable 3D representation is of critical importance in the realm of computer vision, particularly for tasks involving 3D scene perception, such as multi-view stereo (MVS) (Yao et al. 2018; Chen et al. 2020; Zhang et al. 2020) and semantic scene completion (SSC) (Li et al. 2023b; Miao et al. 2023; Li et al. 2023a). The existing 2D-based approaches implicitly learned 3D features by harnessing contextual information (Mayer et al. 2016; Wang et al. 2020, 2021), which often struggle with precise geometric modeling due to the inherent ambiguity of 2D representations. On the other hand, some researchers have sought to enforce geometric constraints by utilizing 3D probability volumes to model correspondences across various depth hypothesis planes, which attracts growing attention (Yin, Darrell, and Yu 2019; Gu et al. 2020; Ding et al. 2022).
Nevertheless, many complex real-world scenarios, characterized by incomplete observations and intricate reflection conditions, pose substantial challenges when striving for precise geometric modeling. Existing 3D probability volume-based approaches have made strides by devising sophisticated architectures (Gu et al. 2020; Chen et al. 2020; Ding et al. 2022) and refining loss functions (Peng et al. 2022; Wang et al. 2022b) to acquire reliable probability volumes. However, these methods generally resolve the problem with a single-step approximation solution, imposing a substantially heavy burden on the learning process. To mitigate these learning challenges, another line of research has introduced GRU-based architectures (Yao et al. 2019; Wang et al. 2022a; Xu et al. 2023) to facilitate the acquisition of a dependable 3D volumetric representation through iterative refinement. Nevertheless, these approaches typically rely on 2D convolutional GRU mechanisms, which are susceptible to cumulative errors (Li et al. 2018; Mao and Sejdić 2022). This, in turn, motivates us to explore the potential of iterative refinement in the context of 3D volumetric probability.
Based on the above analysis, we propose a Volumetric Probability Diffusion (VPD) framework, which progressively models the volumetric probability and thus achieves reliable geometry estimation in the MVS and SSC tasks. As depicted in Fig. 1, the core idea is to devise a multi-step learning scheme that models the probability volumes and progressively refine them. Inspired by the powerful probability distribution modeling capabilities exhibited by generative diffusion models (Saharia et al. 2022; Müller et al. 2023), we propose a progressive optimization paradigm based on the diffusion process for reliable probability volume modeling. To leverage the geometry prior extracted from the input images with pre-trained models, our VPD is conditioned with the extracted coarse volumes and contextual features to guide the diffusion progress. Specifically, the coarse volumes are employed as basic geometry prior, which is concatenated with the noisy input volume of the diffusion framework as prior volume condition. Despite the effectiveness of the prior volume condition in high-confidence regions, the low-confidence mismatch issue in challenging regions (e.g. non-Lambertian surfaces, thin structures and reflections) still exists, which impairs the learning of probability distribution approximation to the target volumes. Therefore, we further introduce a Confidence-Aware Contextual Collaboration (CACC) module to correct the uncertain regions of the predicted 3D volumes with rich contextual information. In detail, CACC first prunes the 3D volumes using confidence-aware filtering. Next, the fine-grained features and geometric details are retrieved from multi-scale contextual contents to complement the information in the low-confidence regions of the volumes. Moreover, to avoid perturbations in the diffusion sampling process, we introduce an Online Filtering (OF) strategy to maintain the consistency of the representations for a stable diffusion. In summary, the main contributions of this paper are listed as:
-
•
We pinpoint the limitation of sing-step-based strategies, and correspondingly propose a novel Volumetric Probability Diffusion (VPD) framework, which fully exploits the strong generative ability of diffusion models for fine and reliable volumetric representation.
-
•
We propose a Confidence-Aware Contextual Collaboration (CACC) module to enhance the reliability of volumetric learning in VPD. Additionally, we develop an Online Filtering (OF) strategy to maintain representation consistency during the reverse sampling process.
-
•
Extensive experiments validate the effectiveness of our approach. We achieve state-of-the-art results on various scene perception tasks, including 1) MVS: DTU (Aanæs et al. 2016), BlendedMVS (Yao et al. 2020) and ScanNet (Dai et al. 2017); 2) SSC: SemanticKITTI (Behley et al. 2019). Notably, to the best of our knowledge, VPD is the first camera-based method that surpasses LiDAR-based methods on the SemanticKITTI.
2 Related Works
2.1 Learning-based 3D Scene Perception
With the development of learning-based methods, the quality of 3D representation for scene perception has been steadily improved (Deng 2021; Okae et al. 2021; Xie et al. 2023). Recently, stereo matching has been explored in semantic scene completion (SSC) (Li et al. 2023b, a). In StereoScene (Li et al. 2023a), a stereo volume constructor is proposed to generate a geometric cost volume to enhance the understanding of 3D scenarios. For multi-view stereo (MVS), a set of images are employed to construct 3D cost volumes with epipolar constrain (Gu et al. 2020; Ding et al. 2022; Peng et al. 2022). CasMVSNet (Gu et al. 2020) employs cascade cost volumes with different scales to form a coarse-to-fine depth estimation framework. TransMVSNet (Ding et al. 2022) leverages global context information with a feature matching transformer to exploit long-range aggregation across input images. UniMVSNet (Peng et al. 2022) proposes a unification representation for both regression and classification that is supervised with a unified focal loss. Different from all the previous methods that try to approximate the ground truth in a single step, we propose to formulate depth estimation as progressive distribution modeling, which decomposes the issue into multiple steps to further improve performance in challenging scenarios.

2.2 Denoising Diffusion Models
Denoising diffusion models (DDMs) are a novel class of generative models derived from nonequilibrium thermodynamics (Sohl-Dickstein et al. 2015) and have achieved astounding results in the field of computer vision (Luo and Hu 2021; Rombach et al. 2022; Ramesh et al. 2021). DiffRF (Müller et al. 2023) adopts a set of posed images as additional conditions for radiance field synthesis with a rendering loss to resolve ambiguities. Different from the one-to-many mapping in the generation process of DiffRF, we employ the diffusion process as a one-to-one mapping, leveraging geometry prior for accurate and reliable 3D scene perception. DiffRF (Müller et al. 2023) adopts a set of posed images as additional conditions for radiance field synthesis with a rendering loss to resolve ambiguities. Different from the one-to-many mapping in the generation process of DiffRF, we employ the diffusion process as a one-to-one mapping, leveraging geometry prior for accurate and reliable 3D scene perception. DiffuStereo (Shao et al. 2022) leverages an iterative diffusion model to obtain highly accurate depth maps for automatic high-quality human reconstruction from sparse-view inputs as conditions. However, DiffuStereo directly refines the depth maps generated by the off-the-shelf algorithms, without fully exploring the geometric constraints in the matching process. In contrast, we propose volumetric probability diffusion (VPD) to make full use of the correspondence distribution across different depth hypothesis planes, which is more advisable because the diffusion process excels at modeling distributions.
3 Methodology
In this work, we formulate the 3D perception in MVS and SSC tasks as multi-step conditional volumetric probability learning, and propose Volumetric Probability Diffusion (VPD). As shown in Figure 2, given input images, we first construct diffusion conditions with coarse probabilistic volumes and multi-scale contextual features extracted from off-the-shelf scene perception baselines. Next, we progressively estimate a refined volume over multiple steps by diffusing a noisy volume with the constructed conditions. The refined volumes are finally fed into the task-specific head to generate depth maps in MVS or occupancy grids in SSC. Please refer to the Supplementary Material for the details on the task-specific head. In detail, the proposed VPD mainly consists of the following components:
I. A Volumetric Diffusion model that is implemented with a 3D UNet (Ronneberger, Fischer, and Brox 2015). In the forward process, the target volumes are constructed from ground truth depth maps. In the reverse process, an Online Filtering (OF) strategy is further developed to maintain the unique peak distribution in the estimated volumes.
II. The diffusion conditions including the basic prior volume condition and the contextual feature condition constructed with the Confidence-Aware Contextual Collaboration (CACC) module.
3.1 Volumetric Diffusion
The standard generative diffusion models aim to form one-to-many mappings with a forward and reverse process. In our scenario, we employ a volumetric diffusion model to learn the parametric approximation to the target volume based on the guidance of conditions.
In the forward process, we construct target unimodel volume from ground truth depth map , and progressively corrupt the target volume to in time steps. In the reverse process, the 3D diffusion UNet estimates a refined volume to approximate the target volume from noisy input volume and we consider conditions to guide the estimation.
Volumetric Gaussian Forward Process.
Given a ground truth depth map , we first construct the target volume following the unimodel projection (Ding et al. 2022; Peng et al. 2022) along depth dimension as diffusion input:
(1) |
We gradually add noise on to generate the noisy volume over steps following a discrete-time Markov chain. Given distribution of , the forward process can be characterized as:
(2) |
where and is the pre-defined coefficient. and denote the normal distribution and the identical matrix, respectively.
Iterative Conditional Reverse Process.
The conditional reverse sampling process is dedicated to iteratively denoise with conditions to recover . Each step of the reverse process can be defined as conditional distribution transition (Saharia et al. 2022), which is formulated as:
(3) |
where represents the reverse function, denotes the two conditions of the diffusion model.
Online Filtering. Since VPD is dedicated to learning for approximating the target volume with unimodal distribution, we propose to filter the predicted online at each iteration to suppress the perturbation caused by the generated multi-model representation before sending it to the next reverse sampling step. Our implementation can be formally written as:
(4) |
where denotes unimodel projection same as GT volume construction in Section 3.1. represents Winner-Takes-All operation (Cheng et al. 2020), which maintains the unique peak along the depth dimension.
3.2 Condition Construction
In this section, we introduce the condition construction in VPD. As shown in Figure 2, we extract coarse probabilistic volumes and multi-scale contextual features from input images with the off-the-shelf MVS or SSC baselines (correspond to the Feature Net and the Volume Net in Figure 2). Next, we employ them as the prior volume condition and the contextual feature condition to constrain the learning of distribution transition in the diffusion process, respectively.
Coarse Volume Probabilization.
We employ Coarse Volume Probabilization (CVP) to construct the coarse probabilistic volume, which is concatenated with the input noisy volume as the basic prior volume condition of the diffusion model. Given a coarse cost volume from baseline networks, we employ along the depth dimension for each pixel in space to implement the volume probabilization, which is formally written as:
(5) |
where represents cost value of depth hypothesis plane (). denotes the number of depth hypothesis planes. For multi-view stereo (MVS), we adopt regularized cost volumes (Gu et al. 2020; Long et al. 2021; Ding et al. 2022; Peng et al. 2022) as the volume , while the geometric cost volumes (Li et al. 2023a) are employed for semantic scene completion (SSC).

Confidence-Aware Contextual Collaboration.
Although the CVP provides basic geometry prior, it is still hard to achieve compelling results, especially in challenging regions like occlusions, reflections, textureless regions, etc. Thus, we propose a Confidence-Aware Contextual Collaboration (CACC) module to further apply continuous refinement with the contextual feature condition on the estimated volumes.
The overall structure of CACC is shown in Figure 2. Given a depth volume in downsample block of the 3D UNet (i.e., diffusion model) and scale contextual features from feature extraction networks, our goal is to retrieval reliable multi-scale contextual features from , and refine according to the confidence information along the spatial dimension. It is worth noting that we directly obtain multi-scale contextual features from the off-the-shelf baseline networks for computational efficiency.
Specifically, we first form a confidence map by checking the highest probability value among all depth hypothesis planes across the depth dimension. Next, we reverse the values in to obtain query for cross attention that measures the matching uncertainty in :
(6) |
where denotes Winner-Takes-All operation along depth dimension. To generate key and value , we apply deformable convolution on the corresponding contextual features for efficient geometric transformation modeling and receptive fields adaption. For each location point p on the contextual features , the process is formulated as:
(7) |
where and denotes the deformable weight and learnable offset, respectively. represents splitting the input into halves along the feature channel. To reduce computation cost, we adopt linear attention (Kitaev, Kaiser, and Levskaya 2020; Shen et al. 2021) as:
(8) |
where represents confidence-aware context. and are operations along each row and column of the input matrix, respectively. In this way, the relevant information of contextual features is retrieved according to the matching uncertainty of the depth volume .
Subsequently, we implement element-wise multiplication between the depth volume and confidence map to obtain a filtered volume. To match in dimension, is projected into 3D contextual volume before adding to the filtered volume following lift operation (Philion and Fidler 2020). Finally, the refined volume is constructed by element-wise summation between the filtered volume and the contextual volume:
(9) |
where denotes element-wise multiplication. Note that CACC is applied on each downsample block of the 3D UNet with different dimension sizes. Through the refinement operation of CACC on the depth volume, volumetric distribution in high-confidence regions is retained, while that in low-confidence regions is optimized with multi-scale contexts.
In Figure 3, we visualize the confidence map , uncertainty map , estimated depth map without CACC and estimated depth map with CACC. It can be seen that the model without CACC struggles to achieve compelling results in challenging regions (e.g. object boundaries, low-texture regions). The confidence map and the uncertainty map illustrate the regions with poor estimation, which are effectively refined by retrieving information from the contextual features with CACC.
3.3 Training Objective
In this work, we adopt an end-to-end joint training pipeline for the whole framework, and our training objective is to optimize the volumetric diffusion model for target volume approximation. Different loss functions are applied to achieve the object according to the representations of coarse probabilistic volumes (in Section 3.2), which is consistent with baseline networks (Gu et al. 2020; Peng et al. 2022; Ding et al. 2022; Long et al. 2021; Li et al. 2023a).
Regression Loss. For regression representation, we implement implicit supervision on the output of VPD. Specifically, the estimated probabilistic depth volume is first regressed into a 2D depth map , then a loss (Gu et al. 2020) is implemented between the estimated depth map and the ground truth :
(10) |
where denotes the number of labled pixels . In this way, the depth volume predicted by the volumetric diffusion model is implicitly supervised throughout the training process.
Classification Loss. For classification representation, we apply focal loss (Ding et al. 2022) to directly supervise discrete volumetric distribution. The function adopts a tunable parameter to help focus on hard samples to prevent overfitting, which is formally defined as:
(11) |
where and denotes labeled pixels with valid ground truth and the depth hypothesis closest to the ground truth, respectively. represents the prediction on depth hypothesis .
Unification Loss. For unification representation, we adopt unified focal loss (Peng et al. 2022) for continuous supervision on the estimated volume:
(12) |
where and denote the estimated unity value and the continuous target, respectively. and are tunable parameters for sample balance. represents binary cross-entropy. is a sigmoid-like function as . Moreover, the and represent and , respectively. The and represent the same sigmoid-like function with different inputs of and , respectively. The denotes the positive target.

Method | Abs Rel | Abs | Sq Rel | Th8 | Th20 | ||
---|---|---|---|---|---|---|---|
MVSNet (Yao et al. 2018) | 0.0139 | 11.5502 | 2.0383 | 0.1378 | 0.0932 | 0.9845 | 0.9966 |
CasMVSNet (Gu et al. 2020) | 0.0097 | 7.4381 | 1.6300 | 0.0872 | 0.0570 | 0.9887 | 0.9976 |
UniMVSNet (Peng et al. 2022) | 0.0095 | 7.2756 | 1.3163 | 0.0837 | 0.0547 | 0.9858 | 0.9934 |
TransMVSNet (Ding et al. 2022) | 0.0094 | 7.2096 | 1.2712 | 0.0842 | 0.0541 | 0.9905 | 0.9982 |
TransMVSNet (Ding et al. 2022) Ours | 0.0067 | 4.9416 | 0.9918 | 0.0510 | 0.0333 | 0.9918 | 0.9984 |
Method | Abs Rel | Abs | |
---|---|---|---|
MVSNet (Yao et al. 2018) | 0.0915 | 2.6554 | 0.9135 |
CasMVSNet (Gu et al. 2020) | 0.0665 | 1.7102 | 0.9349 |
UniMVSNet (Peng et al. 2022) | 0.0825 | 1.8744 | 0.9320 |
TransMVSNet (Ding et al. 2022) | 0.0657 | 1.9216 | 0.9402 |
CasMVSNet (Gu et al. 2020) Ours | 0.0404 | 1.4122 | 0.9604 |
UniMVSNet (Peng et al. 2022) Ours | 0.0496 | 1.3128 | 0.9425 |
TransMVSNet (Ding et al. 2022) Ours | 0.0376 | 1.2267 | 0.9596 |
Methods | Abs Rel | Abs | Sq Rel | RMSE | |
---|---|---|---|---|---|
MVDepth (Wang and Shen 2018) | 0.1167 | 0.2301 | 0.0596 | 0.3236 | 0.8453 |
DPSNet (Im et al. 2019) | 0.1200 | 0.2104 | 0.0688 | 0.3139 | 0.8640 |
DELTAS (Sinha et al. 2020) | 0.0915 | 0.1710 | 0.0327 | 0.2390 | 0.9147 |
NRGBD (Liu et al. 2019) | 0.1013 | 0.1657 | 0.0502 | 0.2500 | 0.9160 |
PairNet (Wang et al. 2022a) | 0.0895 | 0.1709 | 0.0615 | 0.2734 | 0.9172 |
ESTD (Long et al. 2021) | 0.0812 | 0.1505 | 0.0298 | 0.2199 | 0.9313 |
ESTD (Long et al. 2021) Ours | 0.0753 | 0.1497 | 0.0237 | 0.2149 | 0.9483 |
4 Experiments
We evaluate the proposed Volumetric Probability Diffusion (VPD) on the 3D scene perception tasks of multi-view stereo (MVS) and semantic scene completion (SSC).
4.1 Multi-view Stereo (MVS)
Datasets.
DTU (Aanæs et al. 2016) is a large-scale indoor dataset, which consists of 124 different scenes with 7 different illumination conditions. We split the dataset into training, validation, and test set following the setting of MVSNet (Yao et al. 2018). BlendedMVS (Yao et al. 2020) dataset is a synthetic dataset that consists of 106 training scans and 7 validation scans. ScanNet (Dai et al. 2017) is an RGB-D video dataset that consists of more than 1600 scans, annotated with depth maps and camera poses.
Implementation Details.
Our model is implemented on the Pytorch platform with 4 NVIDIA A100 GPUs. We train our model for 16 epochs on the DTU dataset and 7 epochs on the ScanNet dataset, respectively. For the BlendedMVS dataset, we implement tests using the model trained on the DTU dataset to evaluate the generalization ability. The initial learning rate is set to , which decays following the same strategy of baseline networks. During the training process, the batch size is set to 12 and we adopt Adam as the optimizer. The diffusion forward step is set to 1000 and we adopt 4 iterations in the reverse process.

bicycle car motorcycle truck other vehicle person bicyclist motorcyclist road |
parking sidewalk other ground building fence vegetation trunk terrain pole traffic sign |
Performance.
For quantitative evaluation, we conduct standard metrics (Eigen, Puhrsch, and Fergus 2014; Long et al. 2021; Cai, Ji, and Xu 2023), including absolute relative error (Abs Rel), absolute error (Abs), square relative error (Sq Rel), root mean square error in linear scale (RMSE), threshold distance error (Th) and inlier ratios (, where ). As reported in Table 1, our method shows significant improvements compared to TransMVSNet, reducing Abs by .
We evaluate the zero-shot generalization ability of our method from DTU to BlendedMVS validation set without any fine-tuning. As shown in Table 2, our method has a notable performance gain compared to baseline networks, which demonstrates that our approach generalizes well across different datasets without post-processing. Table 3 shows quantitative results on the ScanNet test set. ESTD (Long et al. 2021) with VPD outperforms other methods in terms of accuracy, which indicates that our method also has strong modeling capability for temporal cost volumes.
Moreover, We visualize qualitative results on the DTU test set and BlendedMVS validation set in Figure 4. Our approach significantly enhances outcomes from baseline models, generating more complete depth maps with heightened accuracy, particularly in challenging areas like object boundaries and repetitive patterns.
Method | Input |
road (15.30%) |
sidewalk (11.13%) |
parking (1.12%) |
other-grnd (0.56%) |
building (14.1%) |
car (3.92%) |
truck (0.16%) |
bicycle (0.03%) |
motorcycle (0.03%) |
other-veh. (0.20%) |
vegetation (39.3%) |
trunk (0.51%) |
terrain (9.17%) |
person (0.07%) |
bicyclist (0.07%) |
motorcyclist. (0.05%) |
fence (3.90%) |
pole (0.29%) |
traf.-sign (0.08%) |
mIoU |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MonoScene (Cao and de Charette 2022) | Mono | 54.70 | 27.10 | 24.80 | 5.70 | 14.40 | 18.80 | 3.30 | 0.50 | 0.70 | 4.40 | 14.90 | 2.40 | 19.50 | 1.00 | 1.40 | 0.40 | 11.10 | 3.30 | 2.10 | 11.08 |
VoxFormer-S (Li et al. 2023b) | Stereo | 53.90 | 25.30 | 21.10 | 5.60 | 19.80 | 20.80 | 3.50 | 1.00 | 0.70 | 3.70 | 22.40 | 7.50 | 21.30 | 1.40 | 2.60 | 0.20 | 11.10 | 5.10 | 4.90 | 12.20 |
VoxFormer-T (Li et al. 2023b) | Stereo-T | 54.10 | 26.90 | 25.10 | 7.30 | 23.50 | 21.70 | 3.60 | 1.90 | 1.60 | 4.10 | 24.40 | 8.10 | 24.20 | 1.60 | 1.10 | 0.00 | 13.10 | 6.60 | 5.70 | 13.41 |
SSCNet (Song et al. 2017) | LiDAR | 51.15 | 30.76 | 27.12 | 6.44 | 34.53 | 24.26 | 1.18 | 0.54 | 0.78 | 4.43 | 35.25 | 1.18 | 29.01 | 0.25 | 0.25 | 0.78 | 19.87 | 13.10 | 6.73 | 16.14 |
StereoScene (Li et al. 2023a) | Stereo | 61.90 | 31.20 | 30.70 | 10.70 | 24.20 | 22.80 | 2.80 | 3.40 | 2.40 | 6.10 | 23.80 | 8.40 | 27.00 | 2.90 | 2.20 | 0.50 | 16.50 | 7.00 | 7.20 | 15.36 |
StereoScene (Li et al. 2023a) Ours | Stereo | 61.76 | 32.41 | 20.39 | 11.11 | 24.43 | 32.12 | 7.33 | 3.74 | 2.53 | 9.26 | 25.87 | 8.89 | 37.68 | 2.02 | 0.99 | 0.00 | 10.09 | 11.72 | 7.53 | 16.31 |
4.2 Semantic Scene Completion (SSC)
Datasets.
SemanticKITTI (Behley et al. 2019) is a popular semantic scene completion dataset, which contains 22 outdoor driving scenes. SemanticKITTI holds LiDAR annotations that are voxelized as 25625632 grid of 0.2m voxels. The target voxel grids are labeled with 21 classes (1 free, 1 unknown, and 19 semantics). Following (Li et al. 2023a), we only adopt RGB images of the dataset as inputs.
Implementation Details.
We extend StereoScene (Li et al. 2023a) with our proposed VPD for SSC evaluation. More specifically, the geometric cost volume in StereoScene is leveraged as the coarse volume. The whole model is trained on SemanticKITTI for 30 epochs with a learning rate of . AdamW is adopted as a training optimizer following (Li et al. 2023a).
Performance.
For quantitative evaluation, we adopt mIoU (mean Intersection over Union) to account for the SSC task. We compare our method with other state-of-the-art SSC networks: (1) camera-based methods including MonoScene (Cao and de Charette 2022), VoxFormer (Li et al. 2023b) and StereoScene (Li et al. 2023a), (2) LiDAR-based method of SSCNet (Song et al. 2017). As shown in Table 4, our method surpasses StereoScene by in terms of mIoU, which demonstrates the application of VPD effectively improves the accuracy of depth estimation and thereby enhances the performance of semantic scene completion. It’s worth noting that our method even surpasses the LiDAR-based method of SSCNet in terms of mIoU. Figure 5 visualizes the qualitative results, our method produces more accurate and complete results in large-scale driving scenarios compared with StereoScene.
Model Settings | Evaluation Metrics | |||
---|---|---|---|---|
CVP | CACC | OF | Abs Rel | Abs |
0.0097 | 7.4381 | |||
✓ | 0.0083 | 6.3111 | ||
✓ | ✓ | 0.0078 | 5.9829 | |
✓ | ✓ | ✓ | 0.0075 | 5.7275 |
4.3 Ablation Study
We conduct extensive ablation studies on DTU test set with different model settings. We extend CasMVSNet with VPD of different settings for the evaluation.
Effect of Model Settings.
For the experiment in the first row, we adopt basic CasMVSNet as the baseline setting. The reverse iterations are set to 4 unless otherwise stated. As shown in Table 5, the VPD framework brings significant improvement with the basic prior volume condition of CVP. The CACC and OF obviously enhance the depth estimation performance with reliable volumetric learning, reducing Abs by and , respectively.
Method | Abs | Run-time(s) | Memory(G) |
---|---|---|---|
MVSNet (Yao et al. 2018) | 11.55 | 0.75 | 12.74 |
CasMVSNet (Gu et al. 2020) | 7.44 | 0.41 | 6.32 |
CasMVSNet (Gu et al. 2020) + Ours (4 Steps) | 5.73 | 0.69 | 8.11 |
TransMVSNet (Ding et al. 2022) | 7.21 | 0.87 | 8.27 |
TransMVSNet (Ding et al. 2022) + Ours (1 Step) | 6.27 | 0.96 | 10.48 |
TransMVSNet (Ding et al. 2022) + Ours (2 Steps) | 5.40 | 1.04 | 10.48 |
TransMVSNet (Ding et al. 2022) + Ours (4 Steps) | 4.94 | 1.24 | 10.48 |
TransMVSNet (Ding et al. 2022) + Ours (6 Steps) | 4.79 | 1.39 | 10.48 |
TransMVSNet (Ding et al. 2022) + Ours (8 Steps) | 4.72 | 1.52 | 10.48 |
Efficiency Analyse.
We report the running time and memory consumption of several schemes that are built upon different baselines equipped with our proposed VPD on the NVIDIA A100 GPU, which are detailed in Table 6. It’s worth noting that our method could effectively achieve compelling performance gains with acceptable time consumption. As shown in the table, the proposed VPD delivers satisfactory performance improvements with relatively slight increments in time consumption, while the memory consumption of our proposed method remains constant regardless of the number of reverse steps. In addition, the performance gain of more than 4 reverse steps with more running time is not obvious, thus we adopt 4 steps as the default setting to balance efficiency and effectiveness.
5 Conclusion
In this work, we propose a novel framework of Volumetric Probability Diffusion (VPD) for 3D scene perception tasks including MVS and SSC. Different from previous single-step approximation solutions, we employ multi-step generative diffusion to progressively model volumetric probability for more reliable estimation. Specifically, we introduce a Confidence-Aware Contextual Collaboration (CACC) module to correct the uncertain regions for reliable target volume approximation. In the sampling process, we develop an Online Filtering (OF) strategy to maintain consistency in estimated volume representations. Our method achieves state-of-the-art performance on multiple MVS/SSC benchmarks and even surpasses the LiDAR-based method with only camera-based inputs.
6 Acknowledgements
This paper is supported in part by NSFC under Grant 62302246 and ZJNSFC under Grant LQ23F010008.
Supplementary Material of VPD
Appendix A Implementation of Task-Specific Head
We report the implementation details for the task-specific head as follows.
For multi-view stereo (MVS), the task-specific head is constructed following different baseline networks (Gu et al. 2020; Peng et al. 2022; Ding et al. 2022; Long et al. 2021). Specifically, in CasMVSNet (Gu et al. 2020) and ESTD (Long et al. 2021), the estimated depth map is computed with differentiable soft-argmin operation along the depth direction of the estimated volume. In UniMVSNet (Peng et al. 2022), unity regression is employed to regress the depth map, which selects the optimal hypothesis and calculates the offset to ground-truth depth. TransMVSNet (Ding et al. 2022) leverages the Winner-Takes-All (WTA) (Cheng et al. 2020) operation along the depth dimension to get the estimated depth map.
For semantic scene completion (SSC), the estimated volume will be upsampled and processed with a softmax layer to get the semantic occupancy prediction following StereoScene (Li et al. 2023a).
Appendix B Architectural Details on Diffusion Unet
As described in Section 3 of the main paper, we adopt a 3D UNet (Ronneberger, Fischer, and Brox 2015; Müller et al. 2023) as the architecture of the volumetric diffusion model. Specifically, the 3D UNet consists of three downsample and upsample blocks. Each downsample block contains two residual layers (He et al. 2016) and a CACC module for the contextual feature condition guidance, while each upsample block contains two residual layers. The CACC takes contextual features as inputs to refine the estimated depth volumes in the 3D UNet. The downsample blocks adopt 3D convolutions with stride two for downsampling, and the upsample blocks use trilinear interpolation for upsampling. We employ Group Norm (Wu and He 2018) and SiLU (Elfwing, Uchibe, and Doya 2018) in each block for normalization and activation, respectively. Moreover, the basic prior volume condition is generated from the given cost volume, which is concatenated with the input noisy volume before feeding into the 3D UNet.


bicycle car motorcycle truck other vehicle person bicyclist motorcyclist road |
parking sidewalk other ground building fence vegetation trunk terrain pole traffic sign |
Appendix C Additional Qualitative Results
We provide additional qualitative results for MVS and SSC. The results further demonstrate the effectiveness of our approach in enhancing the 3D scene perception performance.
Multi-view Stereo (MVS). We visualize more qualitative results on the DTU (Aanæs et al. 2016) test set and the BlendedMVS (Yao et al. 2020) validation set in Figure 6. For the tests on the BlendedMVS dataset, we use the model trained on the DTU dataset to evaluate the generalization ability. As described in Section 3 of the main paper, the initial coarse prior the VPD framework is continuously improved with the proposed CACC module and OF strategy during the diffusion process. Therefore, compared with the baseline networks, our method generates more complete depth maps and achieves more accurate results in challenging regions (e.g., object boundaries in column 1, repetitive pattern regions in column 2 and column 4).
Semantic Scene Completion (SSC). We visualize more qualitative results on the SemanticKITTI validation set in Figure 7. Compared to StereoScene (Li et al. 2023a), our method shows obvious advancement on small moving objects (e.g., trucks in row 1, cars in row 3 and row 4) and generates more accurate 3D scene layout (e.g., roads in row 2 and row 5).
Appendix D Evaluation of CACC and OF on Baselines
We conduct additional experiments by directly applying CACC and OF to the cost volume of CasMVSNet as shown in Table 7. The improvements of such implementation are obviously less compared with applying CACC and OF to our proposed framework, which we attribute to the fact that CACC and OF are optimally functional only within our multi-step generative process.
Method | Abs Rel | Abs |
---|---|---|
CasMVSNet | 0.0097 | 7.4381 |
CasMVSNet+CACC+OF | 0.0091 | 7.1218 |
CasMVSNet+VPD | 0.0075 | 5.7275 |
References
- Aanæs et al. (2016) Aanæs, H.; Jensen, R.; Vogiatzis, G.; Tola, E.; and Dahl, A. 2016. Large-Scale Data for Multiple-View Stereopsis. International Journal of Computer Vision.
- Behley et al. (2021) Behley, J.; Garbade, M.; Milioto, A.; Quenzel, J.; Behnke, S.; Gall, J.; and Stachniss, C. 2021. Towards 3D LiDAR-based semantic scene understanding of 3D point cloud sequences: The SemanticKITTI Dataset. The International Journal of Robotics Research, 40(8-9).
- Behley et al. (2019) Behley, J.; Garbade, M.; Milioto, A.; Quenzel, J.; Behnke, S.; Stachniss, C.; and Gall, J. 2019. Semantickitti: A dataset for semantic scene understanding of lidar sequences. In ICCV.
- Cai, Ji, and Xu (2023) Cai, C.; Ji, P.; and Xu, Y. 2023. RIAV-MVS: Recurrent-Indexing an Asymmetric Volume for Multi-View Stereo. CVPR.
- Cao and de Charette (2022) Cao, A.-Q.; and de Charette, R. 2022. Monoscene: Monocular 3d semantic scene completion. In CVPR.
- Chen et al. (2020) Chen, P.-H.; Yang, H.-C.; Chen, K.-W.; and Chen, Y.-S. 2020. Mvsnet++: Learning depth-based attention pyramid features for multi-view stereo. IEEE Transactions on Image Processing, 29.
- Cheng et al. (2020) Cheng, X.; Zhong, Y.; Harandi, M.; Dai, Y.; Chang, X.; Li, H.; Drummond, T.; and Ge, Z. 2020. Hierarchical neural architecture search for deep stereo matching. NeurIPS, 33.
- Dai et al. (2017) Dai, A.; Chang, A. X.; Savva, M.; Halber, M.; Funkhouser, T.; and Nießner, M. 2017. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR.
- Deng (2021) Deng, L. L. T. 2021. RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching. 3DV.
- Ding et al. (2022) Ding, Y.; Yuan, W.; Zhu, Q.; Zhang, H.; Liu, X.; Wang, Y.; and Liu, X. 2022. Transmvsnet: Global context-aware multi-view stereo network with transformers. In CVPR.
- Eigen, Puhrsch, and Fergus (2014) Eigen, D.; Puhrsch, C.; and Fergus, R. 2014. Depth map prediction from a single image using a multi-scale deep network. NeurIPS, 27.
- Elfwing, Uchibe, and Doya (2018) Elfwing, S.; Uchibe, E.; and Doya, K. 2018. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks.
- Gu et al. (2020) Gu, X.; Fan, Z.; Zhu, S.; Dai, Z.; Tan, F.; and Tan, P. 2020. Cascade cost volume for high-resolution multi-view stereo and stereo matching. In CVPR.
- He et al. (2016) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR.
- Im et al. (2019) Im, S.; Jeon, H.-G.; Lin, S.; and Kweon, I. S. 2019. Dpsnet: End-to-end deep plane sweep stereo. arXiv preprint arXiv:1905.00538.
- Kitaev, Kaiser, and Levskaya (2020) Kitaev, N.; Kaiser, L.; and Levskaya, A. 2020. Reformer: The Efficient Transformer. In ICLR.
- Li et al. (2023a) Li, B.; Sun, Y.; Jin, X.; Zeng, W.; Zhu, Z.; Wang, X.; Zhang, Y.; Okae, J.; Xiao, H.; and Du, D. 2023a. StereoScene: BEV-Assisted Stereo Matching Empowers 3D Semantic Scene Completion. arXiv preprint arXiv:2303.13959.
- Li et al. (2018) Li, Y.; Yu, R.; Shahabi, C.; and Liu, Y. 2018. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. In ICLR.
- Li et al. (2023b) Li, Y.; Yu, Z.; Choy, C.; Xiao, C.; Alvarez, J. M.; Fidler, S.; Feng, C.; and Anandkumar, A. 2023b. VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion. CVPR.
- Liu et al. (2019) Liu, C.; Gu, J.; Kim, K.; Narasimhan, S. G.; and Kautz, J. 2019. Neural rgb (r) d sensing: Depth and uncertainty from a video camera. In CVPR.
- Long et al. (2021) Long, X.; Liu, L.; Li, W.; Theobalt, C.; and Wang, W. 2021. Multi-view depth estimation using epipolar spatio-temporal networks. In CVPR, 8258–8267.
- Luo and Hu (2021) Luo, S.; and Hu, W. 2021. Diffusion probabilistic models for 3d point cloud generation. In CVPR, 2837–2845.
- Mao and Sejdić (2022) Mao, S.; and Sejdić, E. 2022. A review of recurrent neural network-based methods in computational physiology. IEEE Transactions on Neural Networks and Learning Systems.
- Mayer et al. (2016) Mayer, N.; Ilg, E.; Hausser, P.; Fischer, P.; Cremers, D.; Dosovitskiy, A.; and Brox, T. 2016. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In CVPR.
- Miao et al. (2023) Miao, R.; Liu, W.; Chen, M.; Gong, Z.; Xu, W.; Hu, C.; and Zhou, S. 2023. OccDepth: A Depth-Aware Method for 3D Semantic Scene Completion. arXiv preprint arXiv:2302.13540.
- Müller et al. (2023) Müller, N.; Siddiqui, Y.; Porzi, L.; Bulò, S. R.; Kontschieder, P.; and Nießner, M. 2023. DiffRF: Rendering-Guided 3D Radiance Field Diffusion. In CVPR.
- Okae et al. (2021) Okae, J.; Li, B.; Du, J.; and Hu, Y. 2021. Robust Scale-Aware Stereo Matching Network. IEEE Transactions on Artificial Intelligence.
- Peng et al. (2022) Peng, R.; Wang, R.; Wang, Z.; Lai, Y.; and Wang, R. 2022. Rethinking depth estimation for multi-view stereo: A unified representation. In CVPR.
- Philion and Fidler (2020) Philion, J.; and Fidler, S. 2020. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In ECCV.
- Ramesh et al. (2021) Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; and Sutskever, I. 2021. Zero-shot text-to-image generation. In International Conference on Machine Learning, 8821–8831. PMLR.
- Rombach et al. (2022) Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; and Ommer, B. 2022. High-resolution image synthesis with latent diffusion models. In CVPR, 10684–10695.
- Ronneberger, Fischer, and Brox (2015) Ronneberger, O.; Fischer, P.; and Brox, T. 2015. U-net: Convolutional networks for biomedical image segmentation. In MICCAI.
- Saharia et al. (2022) Saharia, C.; Ho, J.; Chan, W.; Salimans, T.; Fleet, D. J.; and Norouzi, M. 2022. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence.
- Shao et al. (2022) Shao, R.; Zheng, Z.; Zhang, H.; Sun, J.; and Liu, Y. 2022. Diffustereo: High quality human reconstruction via diffusion-based stereo using sparse cameras. In ECCV 2022.
- Shen et al. (2021) Shen, Z.; Zhang, M.; Zhao, H.; Yi, S.; and Li, H. 2021. Efficient attention: Attention with linear complexities. In WACV.
- Sinha et al. (2020) Sinha, A.; Murez, Z.; Bartolozzi, J.; Badrinarayanan, V.; and Rabinovich, A. 2020. Depth estimation by learning triangulation and densification of sparse points for multi-view stereo. arXiv preprint arXiv:2003.08933.
- Sohl-Dickstein et al. (2015) Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; and Ganguli, S. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML. PMLR.
- Song et al. (2017) Song, S.; Yu, F.; Zeng, A.; Chang, A. X.; Savva, M.; and Funkhouser, T. 2017. Semantic scene completion from a single depth image. In CVPR.
- Wang et al. (2022a) Wang, F.; Galliani, S.; Vogel, C.; and Pollefeys, M. 2022a. IterMVS: iterative probability estimation for efficient multi-view stereo. In CVPR.
- Wang et al. (2021) Wang, F.; Galliani, S.; Vogel, C.; Speciale, P.; and Pollefeys, M. 2021. PatchmatchNet: Learned Multi-View Patchmatch Stereo. In CVPR.
- Wang and Shen (2018) Wang, K.; and Shen, S. 2018. Mvdepthnet: Real-time multiview depth estimation neural network. In 3DV.
- Wang et al. (2020) Wang, Q.; Shi, S.; Zheng, S.; Zhao, K.; and Chu, X. 2020. Fadnet: A fast and accurate network for disparity estimation. In ICRA.
- Wang et al. (2022b) Wang, X.; Zhu, Z.; Huang, G.; Qin, F.; Ye, Y.; He, Y.; Chi, X.; and Wang, X. 2022b. MVSTER: epipolar transformer for efficient multi-view stereo. In ECCV.
- Wu and He (2018) Wu, Y.; and He, K. 2018. Group normalization. In ECCV.
- Xie et al. (2023) Xie, B.; Li, B.; Zhang, Z.; Dong, J.; Jin, X.; Yang, J.; and Zeng, W. 2023. NaviNeRF: NeRF-based 3D Representation Disentanglement by Latent Semantic Navigation. In ICCV.
- Xu et al. (2023) Xu, G.; Wang, X.; Ding, X.; and Yang, X. 2023. Iterative Geometry Encoding Volume for Stereo Matching. In CVPR.
- Yao et al. (2018) Yao, Y.; Luo, Z.; Li, S.; Fang, T.; and Quan, L. 2018. Mvsnet: Depth inference for unstructured multi-view stereo. In ECCV.
- Yao et al. (2019) Yao, Y.; Luo, Z.; Li, S.; Shen, T.; Fang, T.; and Quan, L. 2019. Recurrent mvsnet for high-resolution multi-view stereo depth inference. In CVPR.
- Yao et al. (2020) Yao, Y.; Luo, Z.; Li, S.; Zhang, J.; Ren, Y.; Zhou, L.; Fang, T.; and Quan, L. 2020. Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. In CVPR.
- Yin, Darrell, and Yu (2019) Yin, Z.; Darrell, T.; and Yu, F. 2019. Hierarchical discrete distribution decomposition for match density estimation. In CVPR, 6044–6053.
- Zhang et al. (2020) Zhang, Y.; Chen, Y.; Bai, X.; Yu, S.; Yu, K.; Li, Z.; and Yang, K. 2020. Adaptive unimodal cost volume filtering for deep stereo matching. In AAAI.