EDNet: Efficient Disparity Estimation with Cost Volume Combination and Attention-based Spatial Residual

Songyan Zhang2, Zhicheng Wang12, Qiang Wang3, Jinshuo Zhang2, Gang Wei2, Xiaowen Chu3
{qiangwang, chxw}@comp.hkbu.edu.hk *Corresponding author. 2CAD Research Center, Tongji University.
3Department of Computer Science, Hong Kong Baptist University.
{spyder, zhichengwang, zhangjinshuo, weigang}@tongji.edu.cn

Abstract

Existing state-of-the-art disparity estimation works mostly leverage the 4D concatenation volume and construct a very deep 3D convolution neural network (CNN) for disparity regression, which is inefficient due to the high memory consumption and slow inference speed. In this paper, we propose a network named EDNet for efficient disparity estimation. Firstly, we construct a combined volume which incorporates contextual information from the squeezed concatenation volume and feature similarity measurement from the correlation volume. The combined volume can be next aggregated by 2D convolutions which are faster and require less memory than 3D convolutions. Secondly, we propose an attention-based spatial residual module to generate attention-aware residual features. The attention mechanism is applied to provide intuitive spatial evidence about inaccurate regions with the help of error maps at multiple scales and thus improve the residual learning efficiency. Extensive experiments on the Scene Flow and KITTI datasets show that EDNet outperforms the previous 3D CNN based works and achieves state-of-the-art performance with significantly faster speed and less memory consumption.

I Introduction

Accurate and fast depth estimation is of great significance to many applications like robot navigation, 3D reconstruction and autonomous driving. Instead of depth regression from a single-view RGB image, stereo matching is to conduct correspondence analysis between pixels of stereo images and compute the disparity $d$ for each pixel. Depth can be then calculated by $(\frac{fB}{d})$ , where $f$ is the camera’s focal length and $B$ is the distance between two camera centers, also called baseline in stereo vision.

While traditional methods based on hand-crafted feature extraction and matching cost aggregation tend to fail on those textureless and repetitive regions in the images, convolutional neural networks (CNNs) have been widely adopted to conquer those difficulties in stereo matching. Several recent methods [1, 2, 3, 4] have achieved state-of-the-art performance by constructing a 4D concatenation volume which follows 3D convolution blocks for aggregation. Although the 4D concatenation cost volume can preserve the rich contextual information in conjunction with the strong regularization ability of 3D convolutions, it significantly increases the computation cost and usually cannot perform real-time disparity inference. Moreover, the concatenation volume incorporates no feature similarity measurement, which means that the model has to learn correspondence from scratch. Besides, DispNetC [5] formed a low-cost correlation layer with 2D convolutions to conduct correspondence analysis between the left and right feature maps. The following works [6, 7, 8] adopted the similar method as it keeps a good balance between speed and accuracy. However, as the correlation map is produced with only one single feature channel for each disparity level, the performance is less competitive. Thus, this raises the question of how to make full use of the complementary advantages of the concatenation volume and correlation volume.

Refer to caption — Figure 1: The first row is the visualization of residual learning process from scale 3 to scale 2. The residual_scale2 is learned from the disparity_scale3 for correcting the disparity_scale2. With our proposed modules, sharp edges and overall structures can be recovered. Other state-of-the-art methods fail to generate the accurate disparity estimation in low-texture regions as shown in the second row. Please pay more attention to regions pointed by the red arrow.

Since ResNet [9] has revealed that the residual convolution block can improve the training efficiency by learning a residual mapping instead of the desired underlying one, it has been adopted as a popular approach to refine the disparity estimation [10, 11, 6, 12]. To be specific in stereo matching, learning an additive correction to the coarse disparity map is easier and more efficient than directly learning the fine-grained one. However, some works failed to provide the residual learning module with the fitting error information [13, 14], or computed the estimated error at only the one scale [10, 8, 15] but adopted it to learn the disparity maps at multiple scales. The error map from one single scale cannot provide the precise error information at other scales, which makes the residual learning method less effective. Furthermore, even if the error map is provided at each corresponding scale [11], the conventional residual learning method has no explicit spatial guidance about where to intervene. As the regions with inaccurate estimation deserve more attention, we argue that residual learning could be more efficient if the spatial attention about the learning errors is provided.

To address the above issues, we propose EDNet which is composed of a combined volume to generate robust feature representations, and an attention-based residual module to learn the disparity refinement. Firstly, the proposed combined volume alleviates the information loss by employing the squeezed concatenation volume and preserves the feature similarity measurement with the correlation volume. We then adopt 2D convolutions for further aggregation so that the significant memory consumption and computational complexity of 3D CNNs can be avoided. Secondly, inspired by the attention mechanism, we adopt a spatial attention module to generate the attention-aware residual features. Therefore, the residual learning module can have intuitive spatial evidence about inaccurate regions to compute a specific correction. We follow the coarse-to-fine strategy and compute the attention-aware residuals across different scales. With the error maps provided at each scale, the residual module can learn a corresponding correction accordingly and improve the learning efficiency. As shown in Figure 1, our network can generate an accurate and continuous disparity map even in low-texture regions. The contributions of our work can be summarized as follows:

•

We propose a low-cost but effective method to aggregate the 3D correlation features and 4D concatenation volume together by constructing a combined volume, which can be further processed by fast 2D convolutions. Compared with others, our correspondence analysis can preserve both the contextual information and feature similarities even with 2D convolutions.
•

We design an Attention-based Residual (AR) module to learn the disparity refinement at each scale. In the AR module, the attention mechanism is applied to the concatenated maps of RGB image, estimated disparity and estimated error to improve the learning efficiency.
•

Compared to those existing methods based on 3D CNNs, our proposed EDNet achieves state-of-the-art accuracy on the public Scene Flow [5] and KITTI [16, 17] datasets with up to 76.5% runtime memory reduction and 45 $\times$ inference speed acceleration.

The rest of the paper is organized as follows. We introduce some related studies about stereo matching based on CNNs in Section II. Section III introduces the methodology and implementation of our proposed EDNet. We demonstrate our experimental results in Section IV. We finally conclude the paper in Section V.

II Related Works

A classical stereo matching pipeline consists of four steps [18]. In recent years, CNNs have drawn great attention and been introduced to tackle the stereo matching task [19]. In this section, we briefly discuss those common mechanisms for computing the matching cost with CNNs and review the approaches with the residual learning method.

II-A Matching Cost Computation

CNN based matching cost computation methods make a great contribution to the stereo matching accuracy. There are two popular approaches for matching cost computation. The first one is using either a layer of 2D [20] or 1D [5] convolutional operations, called correlation layer. Such an inner product between feature vectors is adopted in [21, 22, 6, 7]. Liang $et$ $al.$ [6] build a correlation volume for initial disparity estimation which follows a disparity refinement module by learning through feature consistency. Wang $et$ $al$ . [10] make some modification and propose a point-wise correlation volume to preserve fast computation. Another popular method to compute matching cost is to form a 4D volume by concatenating the corresponding features from the opposite stereo images across each disparity level. 3D convolutions are followed to aggregate features and regress disparity. This method can be found first in [1]. Chang $et$ $al$ . [2] improve Kendall’s approach [1] by designing a spatial pyramid pooling module [23] so that correspondence estimation can benefit from the image features with rich object context information. Guo $et$ $al$ . [3] combine the concatenation volume with the group-wise correlation volume and improve the accuracy with 3D convolutions. The best performance on Scene Flow dataset comes from [24] which introduces the idea of DenseNet [25] to further improve PSMNet [2]. Zhang $et$ $al$ . propose GANet [4] with two guided aggregation layers and fifteen 3D convolutions to achieve state-of-the-art performance.

II-B Residual Learning for Stereo Matching

The residual learning concept is proposed by He $et$ $al$ . [9] which turns to be an efficient way to train a CNN model and has been adopted by many works. In the stereo matching task, the residual learning strategy is widely used for refining disparity estimation [6, 10, 26, 12]. Pang $et$ $al$ . [8] present a cascade residual learning scheme and adopt a two-stage CNN, in which the second stage refines the estimation by producing residual signals. Stucker $et$ $al$ . [27] specially build a U-Net [28] based network to enhance the reconstruction by regressing a residual correction. In order to meet the need for real-time inference, [29] takes residual learning strategy to flexibly output disparity estimation according to the requirement of applications. Song $et$ $al$ . [11] manage to aggregate edge information for residual learning and thus construct a multi-task network for edge detection and stereo matching.

III Methodology

III-A Network Architecture

The architecture of our proposed EDNet is shown in Figure 2. We exploit the structure of DispNetC [5] as the backbone with extensive modifications. For feature extraction, the last left and right feature maps of conv3 from the weights-sharing encoder network are used to form the combined volume which is composed of a squeezed concatenation volume and correlation volume which will be discussed in Section III-B. The detail of our proposed cost volume combination method can be found in the left bottom corner of Figure 2. 2D convolutions are then employed to aggregate the combined volume and regress the disparity. In the decoder part, we follow the coarse-to-fine strategy to refine the disparity progressively. The spatial attention module is applied in order to generate attention-aware residual features, which will be introduced in Section III-C. The stacked hourglass module in PSMNet [2] is used for residual regression but is implemented by 2D convolutions. The attention-based spatial residual module is well illustrated in the right bottom corner of Figure 2. Different from DispNetC [5] which has 6 scales of output, we reduce the disparity prediction to 4 scales, removing the prediction at 1/16 and 1/32 of full resolution.

III-B Cost Volume Combination

Previous works simply build a correlation volume [5, 8, 10] or 4D concatenation volume [1, 3, 2] which follows 2D or 3D convolutions for aggregation. However, a single cost volume cannot meet the need of preserving contextual information and feature similarity at the same time. GwcNet [3] improves the performance by combining the group-wise correlation and concatenation volume, but 3D convolutions are required for aggregation which leads to higher memory consumption and more complex computation. To this end, we propose to combine the correlation volume and 4D concatenation volume in a more efficient way aiming at taking advantages of both two cost volumes.

Given a pair of stereo features $\textbf{f}_{L}$ and $\textbf{f}_{R}$ , we follow the 1D-correlation in DispNetC [5] to calculate the correspondence at each disparity level $d$ . The correlation volume is computed as:

\displaystyle\textbf{C}_{corr}(d,x,y)=\frac{1}{N}<\textbf{f}_{L}(x-d,y),\textbf{f}_{R}(x,y)>

(1)

where $<x_{1},x_{2}>$ is the inner product of two feature vectors $x_{1}$ and $x_{2}$ , and $N$ is the channel number of input features. The shape of the correlation volume is $N\times D\times H\times W$ , where $N$ denotes the batch size, $D$ is the estimated disparity range and the spatial size is $H\times W$ . Then we construct the 4D concatenation volume by concatenating the left and right feature maps, i.e.,

\displaystyle\textbf{C}_{concat}(d,x,y)=Concat\{\textbf{f}_{L}(x-d,y),\textbf{f}_{R}(x,y)\}

(2)

After obtaining the concatenation volume with the shape $N\times D\times C\times H\times W$ , we use three 3D convolutions for aggregation and compress it into 1 channel. The aggregated concatenation volume now has the shape $N\times D\times 1\times H\times W$ . It is then squeezed into $N\times D\times H\times W$ , the same shape as the correlation volume. The correlation volume and squeezed concatenation volume are finally concatenated to form the combined volume. In this way, both the contextual information and feature similarity measurement are incorporated in the combined volume. Further aggregation can be done by 2D convolutions instead of 3D convolutions, which are more efficient. The final combined volume is formed as:

\displaystyle\textbf{C}_{comb}(x,y)=Concat\{\textbf{C}_{corr}(x,y),\textbf{C}_{concat}(x,y)\}

(3)

III-C Attention-based Spatial Residual

The normal residual learning method lacks the spatial evidence about where the errors occur. We propose an attention-based spatial residual module to guide the residual learning process to pay more attention to those inaccurate regions across the whole spatial space. According to the estimated disparity $\hat{d}^{s}$ at scale $s$ , a synthesized left image $\tilde{I}_{L}^{s}$ can be obtained by warping the right image $I_{R}^{s}$ , i.e.,

\displaystyle\tilde{I}_{L}^{s}(x,y)=I_{R}^{s}(x+\hat{d}^{s}(x,y),y)

(4)

With the warped left image and target left image, we can get the error $E_{L}^{s}=|\tilde{I}_{L}^{s}-I_{L}^{s}|$ . A spatial attention module with 3 layers of 2D convolution which are 1 $\times$ 1, 3 $\times$ 3 and 1 $\times$ 1 respectively is applied. The spatial attention feature map $\textbf{f}_{a}^{s}$ is compressed into one channel followed by the sigmoid function to compute the spatial attention vector whose size is $N\times 1\times H\times W$ . The input of error map and color stereo images enables the spatial attention module to learn an attention distribution on blurry object boundaries and mismatched pixels. Akin to FADNet [10], CRL[10], FlowNet2 [15], the input to spatial attention module is the concatenation of the stereo images, error map and estimated disparity map.

We follow DispNetC [5] to preserve both the high-level information and local information by skip-connection. The concatenation of ‘upconvolution’ feature maps from the decoder network and corresponding feature maps from the encoder network is then concatenated with the input of attention module to form the residual feature maps $\textbf{f}_{r}^{s}\in R^{H\times W\times C}$ . The final attention-aware residual features $\textbf{f}_{ar}^{s}\in R^{H\times W\times C}$ at scale $s$ are computed by multiplying the attention vector and residual features $\textbf{f}_{r}^{s}$ , i.e.,

\displaystyle\textbf{f}_{ar}^{s}=\textbf{f}_{ar}^{s}\otimes\sigma(\textbf{f}_{a}^{s})

(5)

where $\sigma(\cdot)$ denotes the sigmoid function. The attention-aware residual features are then input to the stacked hourglass module for residual regression. The stacked hourglass module has the same encoder-decoder structure as PSMNet [2] but is implemented with 2D convolutions.

The attention-based spatial residual module is repeated 3 times as we increase the resolution of the disparity map progressively. Different from the aforementioned works [10, 8], we compute error maps across multiple scales as the error information changes accordingly after the correction. Therefore, we provide the residual learning module with constantly updated error maps across multiple scales instead of a single error map at full resolution.

III-D Multi-scale Residual Learning

Instead of building a cascade architecture with residual refinement at the second stage [15, 8, 10], we simply replace the direct disparity estimation with a residual estimation over all scales except the smallest $S$ at which the initial disparity is computed. The multi-scale residual outputs are denoted as $\{r^{s}\}_{s=0}^{S-1}$ where 0 represents the scale of full resolution. For the rest $S$ scales, the estimated disparity at the previous scale is first upsampled to the current scale $\hat{d}_{up}^{s}$ using bilinear interpolation function and then added to the residual for refinement. The final predicted disparity $\hat{d}^{s}$ at scale $s$ is produced as:

\displaystyle\hat{d}^{s}=\hat{d}_{up}^{s}+r^{s},0\leq s\leq S-1

(6)

III-E Loss Function

Given the output disparity map at different scales, we adopt the pixel-wise smooth L1 loss to train our EDNet at scale $s$ :

\displaystyle L^{s}(d^{s},\hat{d}^{s})=\frac{1}{N}\sum_{i=1}^{N}{smooth}_{L_{1}}(d_{i}^{s}-\hat{d}_{i}^{s}),

(7)

where $N$ is the number of pixels of the disparity map, $\hat{d}_{i}^{s}$ is the $i^{th}$ element of the predicted disparity $\hat{d}^{s}$ , and $d^{s}$ represents the ground truth disparity. The smooth L1 loss function is:

\displaystyle{smooth}_{L_{1}}(x)=\begin{cases}0.5x^{2},&\text{if }|x|<1\\ |x|-0.5,&\text{otherwise}.\end{cases}

(8)

The final loss function is a combination of losses over all scales, i.e.,

\displaystyle L=\sum_{s=0}^{S}{\lambda}^{s}L^{s}(d^{s},\hat{d}^{s})

(9)

where ${\lambda}^{s}$ is a scalar for adjusting the loss weight at scale $s$ .

IV Performance Evaluation

IV-A Datasets and Evaluation Metrics

Three public datasets are used for training and testing our EDNet. The Scene Flow dataset [5] consists of 39,824 pairs of synthetic stereo RGB images (35,454 pairs for training and 4,370 pairs for testing) with a full resolution of 960 $\times$ 540. Both KITTI 2012 [17] and KITTI 2015 [16] are datasets of real scenes with a full resolution of 1242 $\times$ 375. The ground truth of these two datasets is generated by lidar so that only sparse ground truth is available. We evaluate our model on Scene Flow dataset with the end-point error (EPE), 1-pixel error and 3-pixel error. The end-point error computes the mean disparity error in pixels while the 1-pixel error and 3-pixel error measure the average percentage of the pixel whose EPE is bigger than 1 pixel and 3 pixels respectively. The official metrics (e.g., D1-all) are reported for evaluation on KITTI 2012 and KITTI 2015 datasets.

Method	Cost Volume		Residual Module		Results
Method	correlation	s-concatenation	normal	attention	EPE	$\textgreater$ 1px(%)	$\textgreater$ 3px(%)
EDNet-NRS	✓				1.67	27.9	9.7
EDNet-NRCo		✓			1.89	29.9	10.7
EDNet-NR	✓	✓			1.63	27.4	9.8
EDNet-NA	✓	✓	✓		1.04	12.7	5.4
EDNet-NS	✓			✓	1.07	13.1	5.8
EDNet-F	✓	✓		✓	1.00	12.2	5.4

TABLE I: Evaluation of EDNet with different settings. We compute the end-point error, 1-pixel and 3-pixel error on Scene Flow dataset. We use ”Co”, ”S”, ”R”, ”A” to denote the correlation volume, squeezed concatenation volume, normal residual, and attention-based residual respectively. ”N” stands for ”not applied”. ”EDNet-F” represents the model with all our proposed components and ”s-concatenation” means the squeezed concatenation volume.

IV-B Implementation Details

We implemented our EDNet by PyTorch[30] and trained the model with Adam (momentum=0.9, beta=0.999) as optimizer. For Scene Flow dataset, raw images are randomly cropped to 320 $\times$ 640 as input. Our training is performed on 2 NVIDIA RTX 2080ti GPUs for 70 epochs with a batch size of 8 (4 on each GPU). We follow the training strategy in AANet [12], where the initial learning rate is set to 0.001 and decreased by half every 10 epochs after 20^th epoch. The loss weights are set to ${\lambda}_{0}=1.0,{\lambda}_{1}={\lambda}_{2}=0.8,{\lambda}_{3}=0.6$ . The crop size for KITTI is set as 256 $\times$ 512. Due to the insufficient training samples of both KITTI 2012 and KITTI 2015, the pre-trained Scene Flow model is used for fine-tuning on the mixed KITTI 2012 and KITTI 2015 training sets for the first 1000 epochs which follows another 400 epochs of training on two datasets to get the submission result respectively. We use a constant learning rate of 0.0001 for KITTI datasets. Inspired by [31] that searching the correspondence at a coarse scale can be beneficial, especially in low-texture or textureless regions, the loss weights are set as ${\lambda}_{0}=0.6,{\lambda}_{1}={\lambda}_{2}=0.8,{\lambda}_{3}=1.0$ . For all datasets, color normalization is taken into use with the mean ([0.485, 0.456, 0.406]) and variation ([0.229, 0.224, 0.225]) of the ImageNet [32] for data pre-processing. The maximum disparity is set as 192 pixels.

IV-C Ablation Study

To validate the effectiveness of each component in EDNet, we evaluate our model with different configurations on Scene Flow dataset. All the experimental results are obtained after 10 epochs of training. As shown in Table I, removing either of the combined volume or attention-based spatial residual leads to a clear drop of performance.

Cost Volume Combination: As shown in Figure 3, models without the combined volume suffer from inaccurate and discontinuous disparity estimation, especially in low-texture regions. The possible reason might be that a single cost volume is unable to avoid the information loss which leads to less robust feature representations. As shown in Table I, the EPE decreases from EDNet-NS’s 1.07 to EDNet-F’s 1.00. The comparison among EDNet-NRS, EDNet-NRCo and EDNet-NR can validate the effectiveness of our proposed combined volume as well.

Attention-based Spatial Residual: It can be directly analyzed from EDNet-NR and EDNet-NA in Table I that the residual learning module brings significant improvement in terms of accuracy, about 36% decrease of EPE. The comparison between EDNet-NA and EDNet-F in Table I demonstrates that the learning efficiency can be further improved with the help of the attention mechanism. The visualization of EDNet-NA and EDNet-F shown in Figure 3 illustrates that more details like the shaper object boundaries can be recovered with the attention-based spatial residual.

Multi-scale Error Maps: Experiments are conducted to prove that our multi-scale error maps mechanism is of great importance to the residual learning. We remove the error map from the residual features as well as the input of attention module from scale 2 to all scale. As shown in Figure 4, models with error maps at multiple scales can achieve a lower EPE loss with a faster convergence speed. Such performance gain comes at a low cost of extra computation. Multi-scale error maps enable the residual module to learn from the error at the corresponding scale and thus greatly explore the ability of residual learning.

Method	PSMNet [2]	GANet [4]	GwcNet [3]	Bi3D [33]	DispNetC [5]	AANet+ [12]	Ours
EPE	1.09	0.84	0.76	0.73	1.68	0.72	0.63
Time (s)	0.453	3.302	0.254	OOM	0.025	0.068	0.059

TABLE II: EPE values on Scene Flow dataset of several state-of-the-art methods. Our method achieves the best scores and has a competitive inference speed. The inference time is tested on a single NVIDIA 2080ti GPU at the resolution of 576

\times

960 for a fair comparison. OOM denotes out of memory. “X” indicates the second best.

IV-D Experimental Results

In this subsection, we compare our method with those existing state-of-the-art methods in the aspect of inference speed, memory consumption and accuracy on Scene Flow, KITTI 2015 and KITTI 2012 datasets. We also evaluate the model generalization on Middlebury 2014 dataset [34].

Method	Noc (%)		All (%)		Time (s)
Method	fg	all	fg	all	Time (s)
GANet [4]	3.37	1.73	3.82	1.93	0.36
GCNet [1]	5.58	2.61	6.16	2.87	0.9
PSMNet [2]	4.31	2.14	4.62	2.32	0.41
GwcNet [3]	3.49	1.92	3.93	2.11	0.32
SegStereo [7]	3.70	2.08	4.07	2.25	0.6
MC-CNN [35]	7.64	3.33	8.88	3.89	67
HD³ [36]	3.43	1.87	3.63	2.02	0.14
CSN [37]	3.55	1.78	4.03	1.59	0.6
DeepPruner-B [38]	3.18	1.95	3.56	2.15	0.18
Bi3D [33]	3.11	1.79	3.48	1.95	0.48
Ours	3.33	2.31	3.88	2.53	0.05
AANet [12]	4.93	2.32	5.39	2.55	0.075
DeepPruner-F [38]	3.43	2.35	3.91	2.59	0.06
DispNetC [5]	3.72	4.05	4.41	4.34	0.06
FADNet [10]	3.07	2.59	3.50	2.82	0.05
MADNet [13]	8.41	4.27	9.20	4.66	0.02
Ours	3.33	2.31	3.88	2.53	0.05

TABLE III: Benchmark results on KITTI 2015 test sets. “Noc” and “All” indicate the percentage of outliers averaged over ground truth pixels of non-occluded and all regions respectively. “fg” and “all” indicate the percentage of outliers averaged over the foreground and all ground truth pixels respectively. “X” indicates the best result.

Scene Flow Dataset: As shown in Table II, our proposed EDNet not only outperforms all the competing state-of-the-art methods with the lowest EPE but also runs significantly faster: approximately 7.5 $\times$ faster than PSMNet [2], 4 $\times$ faster than GwcNet [3], 55 $\times$ faster than GANet [4]. Compared with DispNetC [5] and StereoNet [26], EDNet improves the performance by 60% and 40% respectively. Figure 5 gives a visual comparison between our EDNet and other outstanding methods. Sharper object boundaries and more continuous disparity maps can be generated by EDNet, indicating the value of our proposed approach.

KITTI Datasets: We divide the comparison into two groups as shown in Table III and Table IV. First, compared with other top-performing methods, our EDNet still achieves competitive results when evaluating non-occluded pixels while runs considerably faster according to the benchmark. Then we compare our EDNet with previous real-time models. Experimental results from Table III and Table IV show that our work can produce more precise estimation. To stress the efficiency of our proposed method, we compare the computational complexity, memory consumption as well as the inference speed with some popular 3D convolutions based models. Table V comprehensively shows that our model requires less memory consumption: approximately 50% less than PSMNet [2], 40% less than GwcNet [3], 60% less than GANet [4] and 70% less than Bi3D [33]. Moreover, our EDNet has lower computational complexity as our work has the lowest FLOPs (the lower the better) and runs significantly faster: approximately 7 $\times$ faster than PSMNet [2], 17 $\times$ faster than Bi3D [33] and 45 $\times$ faster than GANet [4]. Figure 6 visualizes the disparity and error maps of our method and other state-of-the-art works on KITTI 2015 dataset.

Method	Out (%)		Avg		Time (s)
Method	Noc	All	Noc	All	Time (s)
GANet [4]	2.18	2.79	0.5	0.5	0.36
GCNet [1]	2.71	3.46	0.6	0.7	0.90
PSMNet [2]	2.44	3.01	0.5	0.6	0.41
GwcNet [3]	2.16	2.71	0.5	0.5	0.32
SegStereo [7]	2.66	3.19	0.5	0.6	0.60
MC-CNN [35]	3.90	5.45	0.8	1.0	100
Ours	2.97	3.67	0.5	0.6	0.05
AANet [12]	2.90	3.60	0.5	0.6	0.06
StereoNet [26]	4.91	6.02	0.8	0.9	0.015
DispNetC [5]	7.36	8.70	0.9	1.0	0.06
FADNet [10]	3.98	4.61	0.6	0.7	0.05
Ours	2.97	3.67	0.5	0.6	0.05

TABLE IV: Benchmark results on KITTI2012 test sets. Both the percentage of pixels with error bigger than 2 and the overall EPE are reported over non-occluded and all regions.

Method	Memory (GB)	GFLOPs	Time (s)
PSMNet [2]	4.83	937.9	0.393
GANet [4]	6.53	1936.98	2.43
GwcNet [3]	4.27	899.99	0.272
Bi3D [33]	10.74	4212.05	0.899
Ours	2.52	162.92	0.053

TABLE V: Comparisons of the runtime, running memory and computational cost. All the results are tested on a single NVIDIA RTX 2080Ti GPU at a resolution of 1248

\times

384.

Middlebury 2014 Dataset: We evaluate the model generalization on Middlebury 2014 dataset without additional training. All the models are pretrained on Scene Flow and finetuned on KITTI. As observed from Figure 7, the performance of our EDNet is superior to others, as more smooth and continuous disparity estimation in low-texture regions can be produced. Moreover, our EDNet can preserve better overall object structures and generate sharper edges. The generalization results show that our model has a better adaptation ability to unknown scenes.

V Conclusion

We have exploited an efficient architecture called EDNet with the proposed combined volume and attention-based residual module. We show that the combination of the correlation volume and squeezed 4D concatenation volume is of great importance to robust feature representations, especially in ill-posed regions. Besides, the employment of the spatial attention module greatly improves the efficiency of residual learning with multi-scale error maps. Extensive experimental results on KITTI and Scene Flow datasets have demonstrated the superiority of our method when comparing with previous state-of-the-art methods. The future work would be applying our work to other depth-related tasks, e.g., 3D reconstruction and robot navigation.

References

[1] A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry, “End-to-end learning of geometry and context for deep stereo regression,” in 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 66–75.
[2] J. Chang and Y. Chen, “Pyramid stereo matching network,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 5410–5418.
[3] X. Guo, K. Yang, W. Yang, X. Wang, and H. Li, “Group-wise correlation stereo network,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 3268–3277.
[4] F. Zhang, V. Prisacariu, R. Yang, and P. H. S. Torr, “Ga-net: Guided aggregation net for end-to-end stereo matching,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[5] N. Mayer, E. Ilg, P. Häusser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox, “A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4040–4048.
[6] Z. Liang, Y. Feng, Y. Guo, H. Liu, W. Chen, L. Qiao, L. Zhou, and J. Zhang, “Learning for disparity estimation through feature constancy,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 2811–2820.
[7] G. Yang, H. Zhao, J. Shi, Z. Deng, and J. Jia, “Segstereo: Exploiting semantic information for disparity estimation,” in Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
[8] J. Pang, W. Sun, J. S. Ren, C. Yang, and Q. Yan, “Cascade residual learning: A two-stage convolutional neural network for stereo matching,” in 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), 2017, pp. 878–886.
[9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
[10] Q. Wang, S. Shi, S. Zheng, K. Zhao, and X. Chu, “FADNet: A fast and accurate network for disparity estimation,” in 2020 IEEE International Conference on Robotics and Automation (ICRA 2020), 2020, pp. 101–107.
[11] X. Song, X. Zhao, H. Hu, and L. Fang, “Edgestereo: A context integrated residual pyramid network for stereo matching,” in 2019 Asian Conference on Computer Vision (ACCV), 2018.
[12] H. Xu and J. Zhang, “Aanet: Adaptive aggregation network for efficient stereo matching,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 1956–1965.
[13] A. Tonioni, F. Tosi, M. Poggi, S. Mattoccia, and L. D. Stefano, “Real-time self-adaptive deep stereo,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 195–204.
[14] Z. Yin and J. Shi, “Geonet: Unsupervised learning of dense depth, optical flow and camera pose,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 1983–1992.
[15] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “Flownet 2.0: Evolution of optical flow estimation with deep networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1647–1655.
[16] M. Menze and A. Geiger, “Object scene flow for autonomous vehicles,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3061–3070.
[17] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 3354–3361.
[18] D. Scharstein, R. Szeliski, and R. Zabih, “A taxonomy and evaluation of dense two-frame stereo correspondence algorithms,” in Proceedings IEEE Workshop on Stereo and Multi-Baseline Vision (SMBV 2001), 2001.
[19] J. Žbontar and Y. LeCun, “Stereo matching by training a convolutional neural network to compare image patches,” Journal of Machine Learning Research, vol. 17, no. 65, pp. 1–32, 2016.
[20] A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazirbas, V. Golkov, P. v. d. Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional networks,” in 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 2758–2766.
[21] Z. Chen, X. Sun, L. Wang, Y. Yu, and C. Huang, “A deep visual correspondence embedding model for stereo matching costs,” in 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 972–980.
[22] Y. Feng, Z. Liang, and H. Liu, “Efficient deep learning for stereo matching with larger image patches,” in 2017 10th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), 2017.
[23] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1904–1916, 2015.
[24] G. Nie, M. Cheng, Y. Liu, Z. Liang, D. Fan, Y. Liu, and Y. Wang, “Multi-level context ultra-aggregation for stereo matching,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 3278–3286.
[25] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2261–2269.
[26] S. Khamis, S. Fanello, C. Rhemann, A. Kowdle, J. Valentin, and S. Izadi, “Stereonet: Guided hierarchical refinement for real-time edge-aware depth prediction,” in Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
[27] C. Stucker and K. Schindler, “Resdepth: Learned residual stereo reconstruction,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2020, pp. 707–716.
[28] O. Ronneberger, P. Fischer, and T. Brox, “U-net:convolutional networks for biomedical image segmentation,” in 2015 Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015.
[29] Y. Wang, Z. Lai, G. Huang, B. H. Wang, L. van der Maaten, M. Campbell, and K. Q. Weinberger, “Anytime stereo image depth estimation on mobile devices,” in 2019 International Conference on Robotics and Automation (ICRA), 2019, pp. 5893–5900.
[30] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems, 2019, pp. 8026–8037.
[31] M. D. Menz and R. D. Freeman, “Stereoscopic depth processing in the visual cortex: a coarse-to-fine mechanism.” in 2003 Nature neuroscience, 2003, pp. 59–65.
[32] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255.
[33] A. Badki, A. Troccoli, K. Kim, J. Kautz, P. Sen, and O. Gallo, “Bi3d: Stereo depth estimation via binary classifications,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 1597–1605.
[34] D. Scharstein, H. Hirschmuller, Y. Kitajima, G. Krathwohl, N. Nesic, and P. W. X Wang, “High-resolution stereo datasets with subpixel-accurate ground truth.” in German Conference on Pattern Recognition (GCPR), 2014.
[35] J. Žbontar and Y. LeCun, “Computing the stereo matching cost with a convolutional neural network,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1592–1599.
[36] Z. Yin, T. Darrell, and F. Yu, “Hierarchical discrete distribution decomposition for match density estimation,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 6037–6046.
[37] X. Gu, Z. Fan, S. Zhu, Z. Dai, F. Tan, and P. Tan, “Cascade cost volume for high-resolution multi-view stereo and stereo matching,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 2492–2501.
[38] S. Duggal, S. Wang, W. Ma, R. Hu, and R. Urtasun, “Deeppruner: Learning efficient stereo matching via differentiable patchmatch,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 4383–4392.