¹¹institutetext: Dalian University of Technology, China
{yrpiao, miaozhang}@dlut.edu.cn {jxx0709,zhangyukun}@mail.dlut.edu.cn

Dynamic Fusion Network For Light Field Depth Estimation

Yongri Piao Yukun Zhang Miao Zhang Corresponding Author Xinxin Ji

Abstract

Focus-based methods have shown promising results for the task of depth estimation. However, most existing focus-based depth estimation approaches depend on maximal sharpness of the focal stack. Out-of-focus information in the focal stack poses challenges for this task. In this paper, we propose a dynamically multi-modal learning strategy which incorporates RGB data and the focal stack in our framework. Our goal is to deeply excavate the spatial correlation in the focal stack by designing the spatial-correlation perception module and dynamically fuse multi-modal information between RGB data and the focal stack in a adaptive way by designing the multi-modal dynamic fusion module. The success of our method is demonstrated by achieving the state-of-the-art performance on two datasets. Furthermore, we test our network on a set of different focused images generated by a smart-phone camera to prove that the proposed method not only broke the limitation of only using light field data, but also open a path toward practical applications of depth estimation on common consumer level cameras data. The code is available: https://github.com/OIPLab-DUT/Light-Field-for-Depth-Estimation.

1 Introduction

Depth estimation is a crucial step for understanding geometric relations within a scene. Accurate and reliable depth information plays an important role in computer vision including object tracking [9, 18], scene understanding [10], virtual reality (VR) [37, 4], autonomous driving [3] and pose estimation [26, 28].

Refer to caption — Figure 1: From left to right: we show focal stack, RGB image, depth map, the result of DDFF, the result of our method in a scene.

Depth estimation can be broadly classified into active and passive acquisition. In contrast to active techniques that involve sending a controlled energy beam and detecting the reflected energy [20, 2], passive techniques are image-based methods and are more in accord with the human visual perception of the depth, i.e., humans use a great variety of vision-based passive depth cues such as texture, edges, size perspective, binocular disparity, motion parallax, occlusion effects and variations in shading. Monocular depth estimation [31, 6, 23], as low-cost, convenient and efficient passive techniques, has attracted lots of interest lately. However, depth estimation from a single image of a generic scene is an ill-posed problem, due to the inherent ambiguity of mapping an intensity or color measurement into a depth value. On the other hand, inspired by the analogy to human depth perception, multiview depth estimation has achieved great success, including binocular and multi-view stereo [29, 17, 25]. The similarities and correspondences produced between each pixel in the images produce far superior results. However, these approachs are sensitive to imaging systems and require careful alignment and calibration in the setup.

The light field enables a unique capability of post-capture refocusing. Fig.1 shows an example of the light field. A stack of focal slices are generated as they are taken at different depths, which contain abundant spatial parallax information. Furthermore, focusness information caters to human’s visual fixation that allows our eyes to maximize the focus we can give to the object in a scene. In spite of this promising characteristics for focusness information, there are a few studies documenting their efficacy for depth estimation. Early works mainly aim at determining the depth of a pixel by measuring its sharpness or focus at different images of the focal stack [27, 33, 22]. Later on, several approaches based on deep learning have been proposed [11]. These methods use convolutional neural networks (CNNs) to extract effective focusness information for facilitating depth estimation instead of hand-crafted features.

While these methods demonstrate that focusness information is useful for depth estimation, there is still large room for further improvement in terms of two key aspects: (1) How we deeply excavate the spatial correlation between focal slices for obtaining useful focusness information is critical for depth estimation. Since the different focal slices are focused at different depths, the spatial correlation between the focal slices is closely related to depth variation of the objects in a scene. Most of previous focal-based depth estimation networks used a standard 2D CNN to learns filters that extend across the entire focal stack. However, spatial correlation between the focal slices is likely to be ignored. As a result, the focusness information has not been well-captured. (2) How do we effectively fuse focusness features and RGB information to reduce information loss in the depth map? While focusness information in the focal slices provides implicit depth cues, leading to better depth estimation, out-of-focus areas with unknown sharpness could be prone to information loss error, leading to inaccurate depth estimation. As shown in Fig. 1, the result of DDFF which based on focal slices [11] losses some detail and contains a lot of noise. Considering that the RGB image contains high quality sharpness and can be used to compensate for missing data in the out-of-focus area of the focal slices, we believe combining multi-modal information is beneficial to improve the accuracy of depth maps. Albeit most of recent methods performed data fusion by employing some manually set, such as, sum fusion, weighted fusion, concatenate fusion. Those methods are unable to take full advantages of multi-modal information between RGB images and focal slices.

Our core insight is that we can leverage RGB data and the focal stack to learn an estimation model of depth by deeply excavating the spatial correlation in the focal stack and fusing multi-modality cues between RGB image and focal slices. Concretely, our contributions are mainly three-fold:

•

We proposed a spatial-correlation perception module for correlating the focus and depth. Based on the observe that different focal slices possess focus area of multiple scales and focused at different depth, a pyramid ConvGRU is designed to excavate the spatial correlation between different focal slices and sequently pass multi-scale focusness information along the depth direction.
•

We propose a multi-modal dynamic fusion module (MDFM) in which multi-modalities features are fused in an adaptive manner. This fusion strategy allows the filter parameters to dynamically change with the input focusness features in the process of convolution with RGB features, thereby avoiding information loss in the depth map.
•

We demonstrate the effectiveness of the proposed model on two light field datasets [19]. The results show that our approach achieves superior performance over the state-of-the-art approaches. Moreover, we also validate our model on Mobile Phone dataset [32] which contains focal slices captured by a smart-phone camera. This further shows a positive step towards practical application of our model with common consumer level cameras.

2 Related Work

The related works can be divided into two categories: light field depth estimation using traditional approaches and learning-based approaches.

2.1 Traditional Approaches

There is a wide range of methods for light field depth estimation. The field can be roughly divided into methods based on EPI analysis, multi-view stereo matching based approaches and focus-based approaches. [40] proposed a 4D light-field depth estimation method via epipolar plane image analysis and locally linear embedding. [15] employed a sparse decomposition design to leverage the depth-orientation relationship on its epipolar plane images. Another popular approach is using total variation regularization. [21] proposed a nonlinear Total Variation (TV) based method for recovering 3D shape of an object by diffusing several initial depth maps obtained through different focus measures. [24] proposed an non-convex minimization scheme to determine depth maps based on prior knowledge. [13] applied a preconditioned alternating direction method of multipliers (PADMM) with a new cost function to generate a noise-free depth map. Jeon, et al. proposed an algorithm to estimate the multi-view stereo correspondences with sub-pixel accuracy using the cost volume [14]. [35] developed an occlusion-aware depth estimation algorithm to deal with occlusion edges.

Although the above traditional methods can produce good depth maps, these methods require presetting for the depth estimation with careful parameter tuning. Furthermore, they tend to be difficult to generalize into all scenes.

2.2 Learning-based Approaches

Recently, deep learning, in particular Convolutional Neural Networks (CNNs), has successfully broken the bottleneck of traditional methods in a wide range of fields. Such as, super-resolution [39], novel view generation [16] and material recognition [36]. For the depth estimation, [30] introduced a fully-convolutional neural network for highly accurate depth estimation. [1] predicted depth from a single focal slice using deep neural networks by exploiting dense overlapping patches. [11] suggested the first end-to-end learning method to compute depth maps from the focal stack. [31] estimated scene depths from a single image, using the information provided by a camera’s aperture as supervision. They introduced two differentiable aperture rendering functions that use the input image and predicted depths to simulate the depth-of-field effects caused by real camera apertures. Then they trained a monocular depth estimation network end-to-end to conduct depth estimation from RGB images.

These methods demonstrate that focusness cues can greatly contribute to depth estimation, achieving superior results. However, there is still large room for further improvement in terms of spatial correlation excavation in the focal stack and multi-modal fusion between the RGB image and the focal stack. In this paper, we propose a deep learning based method that dynamically incorporates a RGB and a focal stack to confront these challenges.

3 Method

3.1 The Overall Architecture

In this paper, our goal is to excavate the spatial correlation among focal slices and dynamically fuse RGB information and focusness features for focus-based depth estimation. The overall framework is described in Figure 1. First, the RGB image is fed into the RGB stream which consists of a SeNet-154 network [12] pretrained on ImageNet [7], a decoder employs four upsampling layer to gradually up-scale the final feature from the encoder and a refinement module concatenate the feature from decoder and encoder in color channels with three convolutional layers. Second, all focal slices are fed into another focal stream to generate focusness information from focal slices, relying on the proposed spatial-correlation perception modelling (SCPM) designed to excavate the spatial correlation in focal slices. Last, we propose a multi-modal dynamic fusion module (MDFM) to dynamically fuse multi-modal information between RGB data and the focal stack in an adaptive way for providing good depth map. The details of our two modules will be discussed in the following two sections.

3.2 Spatial-Correlation Perception Module (SCPM)

Considering the stack of focal slices in a scene possess focus-area of multiple scales and focused at different depth, we aim to correlate the depth information with multi-scale focusness features. To do this, we propose a spatial-correlation perception module (SCPM), in which the spatial correlation between different focal slices is excavated by a proposed pyramid ConvGRU. In this way, multi-scale focusness information in different slices can be transfered along the depth direction.

Specifically, we first feed 12 focal slices $I_{1}(x,y,s,t)$ , $I_{2}(x,y,s,t)$ … $I_{n}(x,y,s,t)$ to four 5 $\times$ 5 convolutional layers to encode focusness features $x_{i}$ . This procedure can be defined as:

\begin{array}[]{l}f_{i}\left({I_{i}(x,y,s,t);\theta_{i}}\right)\to x_{i},\end{array}

(1)

where $s\times t$ indicates the angular resolution, $x\times y$ indicates the spatial resolution and $i$ represents the i-th focus slice. $\theta$ denotes the parameters of the encoder layer and $f_{i}$ is a learning mapping function. Then, in order to excavate the spatial correlation between different focal slices, we draw ideas from recent works in [8, 38, 43]. Concretely, they use a typical ConvGRU to capture short and long term temporal dependencies. In our work, we consider the focusness features as a feature map sequence and design a pyramid ConvGRU. In order to pass multi-scale focusness features corresponding to focus area of multiple scales, the proposed pyramid ConvGRU use an atrous spatial pyramid pooling (ASPP) module [5] for each of the gates instead of convolution, which can encode multi-scale foucsness information by applying atrous convolution at multiple parallel filters with different rates and field-of-views. The dilation rates are 1, 3 and 5, respectively. The formulation of the proposed pyramid ConvGRU is:

r_{i}=\sigma\left[\begin{array}[]{l}W_{r1}*[x_{i},h_{i-1}],\\ W_{r3}*[x_{i},h_{i-1}],\\ W_{r5}*[x_{i},h_{i-1}]\\ \end{array}\right],

(2)

z_{i}=\sigma\left[\begin{array}[]{l}W_{z1}*[x_{i},h_{i-1}],\\ W_{z3}*[x_{i},h_{i-1}],\\ W_{z5}*[x_{i},h_{i-1}]\\ \end{array}\right],

(3)

\begin{array}[]{l}n_{i}=\tanh(x_{i}*W_{xn}+r_{i}\odot h_{i-1}*W_{hn}+b_{n}),\end{array}

(4)

\begin{array}[]{l}h_{i}=\left({1-z_{i}}\right)\odot h_{i-1}+z_{i}\odot n_{i},\end{array}

(5)

where all $W_{*}$ and $b_{*}$ are model parameters to be learned. $\sigma$ is sigmoid function. $\odot$ and $\ast$ are element-wise multiplication and convolution, respectively. The pyramid ConvGRU takes the i-th focusness feature $x_{i}$ and a previous output $h_{i-1}$ as input, outputs a new feature $h_{i}$ by combining $h_{i-1}$ and a candidate state $n_{i}$ weighted by an output of an update gate $z_{i}$ . The update and reset gates, $z_{i}$ and $r_{i}$ , selectively update multi-scale focusness information from the input focusness feature $x_{i}$ and the previous output $h_{i-1}$ , automatically learn the spatial correlation from neighboring slices. Note that the updated focusness feature $h_{i}$ is not disappeared with information transfering, but passed to the next slice.

3.3 Multi-Modal Dynamic Fusion Module (MDFM)

As RGB images and focus slices imply different depth information, we consider that fusing different multi-modal information is important for depth estimation task. However, existing fusion schemes usually employ some manually set, including sum fusion, weighted fusion, concatenate fusion. These static fusion schemes not suitable for our task because the RGB and focusness feature are not equivalent quantities, is prone to information loss. As shown in the Fig.4(a), these static fusion methods processes the entire image. When the network parameters are fixed, the convolution kernel does not change with the input features, ignoring the relationship between multi-modal features. Therefore, a more proper and effective strategy should be considered. To this end, we introduce a multi-modal dynamic fusion module (MDFM) to dynamically fuse the RGB features and focusness features in an adaptive manner. In this module, the filter varies with focusness features is used to convolve with RGB information , thereby avoiding information loss.

Specifically, the process of our module consists of two steps: 1) we choose a adaptive kernel [33] instead of the spatially invariant convolution. In our work, the standard spatial convolution W is adapted at each pixel using focuesness features f via the adaptive kernel. Therefore, when the focusness features change, the parameters of the convolution kernel also change; 2) we apply the generated adaptive convolution kernel to RGB features f, making the whole network dynamically fuse multi-modal information to get an accurate depth map.

For the output from the spatial-correlation perception module h, we have:

\begin{array}[]{l}d_{i}=\sum\limits_{j\in\Omega(i)}{\exp(-\frac{1}{2}(h_{i}-h_{j})}^{T}(h_{i}-h_{j}))W\left[{p_{i}-p_{j}}\right]f_{j}+b\end{array}

(6)

where $[pi-pj]$ is index of the spatial dimensions of an array with 2D spatial offsets, i and j represent the pixel coordinates, $W$ is a standard spatial convolution W, $f_{j}$ represents the output from the RGB stream, $b$ is bias term, $d$ represents the final depth prediction map. Before we perform filtering operation, the filter has a pre-defined form and depend on the content of focusness features. It is hoped that the prediction map depends on both the RGB features and the reliable focusness information. Finally, we refine the depth map by adding two consecutive convolutional layers. As shown in the Fig.4(b), in test stage, compared to traditional methods of fusing multi-modal information which parameters do not change with input features, our model dynamically selects filter parameters for convolution based on the content of the focusness features and adaptively selects multimodal features for fusion, avoiding information loss.

3.4 Loss Functions

In this section, we calculate the reconstruction error between the predicted images and the ground truth. We define the loss by:

\begin{array}[]{l}L=l_{depth}+\lambda l_{grad}+\mu l_{normal}\end{array}

(7)

where $\lambda$ , $\mu$ are weighting coefficients. Here we set as $\lambda=1$ , $\mu=1$ . We use a logarithm of depth errors as $l_{depth}$ to minimize the difference between the depth estimate map $d_{i}$ and its ground truth $g_{i}$ :

\begin{array}[]{l}l_{depth}=\frac{1}{n}\sum\limits_{i=1}^{n}{\ln\left({\left\|{d_{i}-g_{i}}\right\|_{1}+\alpha}\right)}\end{array}

(8)

where $\alpha(>0)$ is a parameter we set $\alpha=0.5$ . In order to deal with edge distortion problems caused by CNNs training, we consider the following loss function of the gradients of depth:

\begin{array}[]{l}l_{grad}=\frac{1}{n}\sum\limits_{i=1}^{n}{\left({F\left({\nabla_{x}\left({\left\|{d_{i}-g_{i}}\right\|_{1}}\right)}\right)+F\left({\nabla_{y}\left({\left\|{d_{i}-g_{i}}\right\|_{1}}\right)}\right)}\right)}\end{array}

(9)

where $\bigtriangledown x(*)$ is the spatial derivative of $\|d_{i}-g_{i}\|$ computed at the ith pixel with respect to $x$ , and so on. To deal with such small depth structures and further improve fine details of depth maps, we consider yet another loss for training, which measures accuracy of the normal to the surface of an estimated depth map with respect to its ground truth:

\begin{array}[]{l}l_{normal}=\frac{1}{n}\sum\limits_{i=1}^{n}{\left({1-\frac{{\left\langle{n_{i}^{d},n_{i}^{g}}\right\rangle}}{{\sqrt{\left\langle{n_{i}^{d},n_{i}^{d}}\right\rangle},\sqrt{\left\langle{n_{i}^{g},n_{i}^{g}}\right\rangle}}}}\right)}\end{array}

(10)

where $<\textperiodcentered,\textperiodcentered>$ denotes the inner product of vectors. Denoting the surface normal of an estimated depth map and its ground truth by $n_{i}^{d}=\left[{-\nabla_{x}\left({d_{{}_{i}}}\right),{\rm{}}-\nabla_{y}\left({d_{{}_{i}}}\right),1}\right]^{{}^{T}}$ , $n_{i}^{g}=\left[{-\nabla_{x}\left({g_{{}_{i}}}\right),{\rm{}}-\nabla_{y}\left({g_{{}_{i}}}\right),1}\right]^{{}^{T}}$

4 Experiments

4.1 Experiments Setup

4.1.1 Datasets.

To demonstrate the effectiveness of the proposed approach, we conduct experiments on DUT-LFDD dataset, LFSD dataset and a Mobile Phone dataset. Note that although there are some synthetic light field (LF) datasets proovided by [12] which consist of multiview images and corresponding depth maps. The models trained on such synthetic data have trouble generalizing to real-world data. Furthermore, the focal stack can not be correctly synthesized from the multi-view images provided in HCI, because the changes between views are not transitional but rotational.

DUT-LFDD: This dataset contains 967 real-world light-field samples coming from a large variety of indoor and outdoor scenes. We randomly select 630 samples for training and the remaining 337 samples for testing. All images are captured by commercial Lytro Illum camera in real life scenes and contains many challenging scenes. Each light field sample consists of a RGB image, a focal stack with 12 focal slices focusing at different depths and a depth image corresponding to the RGB image.

LFSD: The dataset proposed by [19] contains 100 light fields captured by Lytro camera, including 60 indoor and 40 outdoor scenes. Each light field consists of an RGB image, focal slices and depth map.

Mobile Phone dataset: The Mobile Phone dataset provided by previous researchers [32]. They continuously captured images of size 640 $\times$ 360 pixels using a Samsung Galaxy S3 phone during auto-focusing. The dataset provides focal stack and RGB image of different scenes (number of frames in parenthesis): plants(23), bottles(31), fruits(30), metals(33), window(27), telephone(33), etc. For each scene, we choose 12 focal slices and a RGB image to evaluate our model.

To prevent the overfitting problem, we augment the training set by the following operations:

•

Flipping: we only consider horizontal flipping (i.e. mirroring) of images at a probability of 0.5.
•

Rotating: the RGB image, focal stack and the depth image are rotated by a random degree r $\in$ [-5,5].
•

Color Jitter: brightness, contrast, and saturation values of the sample are randomly scaled by c $\in$ [0.6,1.4].

4.1.2 Evaluation Metrics.

We adopt seven metrics for comprehensive evaluation, including Mean Eelation error(Abs Rel), Squared Relative error (Sq Rel), Root Mean Squared error (RMSE), Mean log 10 error (RMSE log), Accuracy with threshold ( $thr=1.25,1.25^{2},1.25^{3}]$ ). They are universally-agreed and standard for evaluating a depth estimation model and well explained in many literatures [41, 34, 42].

Table 1: Quantitative results of the ablation analysis for our SCPM.

Methods	Error metric $\downarrow$				Accuracy metric $\uparrow$
Methods	RMSE	RMSE Log	Abs Rel	Sq Rel	$\delta<1.25$	$\delta<1.25^{2}$	$\delta<1.25^{3}$
2DCNN	0.3667	0.0501	0.1879	0.1015	0.7102	0.9435	0.9930
GRU	0.3648	0.0492	0.1821	0.0976	0.7159	0.9444	0.9931
SCPM	0.3457	0.0453	0.1707	0.0864	0.7342	0.9516	0.9952

•

Mean relation error (Abs Rel): $\frac{1}{T}\sum\nolimits_{i\in T}{\frac{{\left|{d_{i}-d_{i}^{gt}}\right|}}{{d_{i}^{gt}}}}$
•

Root mean squared error (RMSE):

${\frac{1}{T}\sum\nolimits_{i\in T}{\left\|{d_{i}-d_{i}^{gt}}\right\|}^{2}}$
•

Mean log 10 error (RMSE log):

${\sqrt{\frac{1}{T}\sum\nolimits_{i\in T}{\left\|{\log(d_{i})-\log(d_{i}^{gt})}\right\|}^{2}}}$
•

Squared relative error (Sq Rel): $\frac{1}{T}\sum\nolimits_{i\in T}{\frac{{\left\|{d_{i}-d_{i}^{gt}}\right\|^{2}}}{{d_{i}^{gt}}}}$
•

Accuracy with threshold ( $\delta$ ): $\max\left({\frac{{d_{i}}}{{d_{i}^{gt}}},\frac{{d_{i}^{gt}}}{{d_{i}}}}\right)=\delta<thr$

4.1.3 Implementation Details.

We implement our network based on Pytorch framework with one Nvidia GTX 1080Ti GPU. We train the RGB stream and focal stream using Adam optimizer with an initial learning rate of 0.0001, and reduce it to 10% for every 5 epochs. We set $\beta 1$ = 0.9, $\beta 2$ = 0.999, and use weight decay of 0.0001. The encoder module in the RGB stream is initialized by a model pretained with the ImageNet dataset. The other layers in the network are randomly initialized. The batchsize is 1 and maximum epoch is set 80.

4.2 Ablation Studies

4.2.1 Effect of SCPM.

The SCPM is proposed to excavate the spatial correlation between different focal slices. To verify the effectiveness of the SCPM, we first replace the SCPM with 7 convolution layers (noted as $2DCNN$ ). Table 1 shows that the SCPM improves the RMSE performances by nearly 2% than 2DCNN. We believe that this improvement is due to the cyclic structure retaining the spatial relationship between different focal features. Then we compare the performance of our pyramid ConvGRU and ConvGRU. We replace pyramid ConvGRU with the ConvGRU. And we can observe from Table 1 that the improvements in performance are achieved by using our pyramid ConvGRU. These improvements are resonable since our model can adapt to different sizes of focus area compared to ConvGRU. Furthermore, we randomly select four different focus slices for visualization after inputting them into pyramid ConvGRU in Fig.5. (a)-(c) and (e)-(g) represent the first three of the 12 focus slices of each scene. (d) and (h) represent the output of the last pyramid ConvGRU in each scene. We can see from Fig.5 that the feature maps of different focus slices contain different details and are passed by our pyramid ConvGRU.

Table 2: Quantitative results of the ablation analysis for our MDFM.

Concatenate

represents Concatenate Fusion.

SUM

represents Sum Fusion.

Weight

represents Weighted Fusion.

Type	Methods	Error metric				Accuracy metric
Type	Methods	RMSE	RMSE Log	Abs Rel	Sq Rel	$\delta<1.25$	$\delta<1.25^{2}$	$\delta<1.25^{3}$
Our dataset	SUM	0.6600	0.1564	0.3777	0.3211	0.4186	0.6838	0.9184
	Weight	0.4877	0.0921	0.2603	0.1652	0.5190	0.8584	0.9749
	Concatenate	0.4268	0.0705	0.1917	0.1065	0.6250	0.9122	0.9790
	our	0.3457	0.0453	0.1707	0.0864	0.7342	0.9516	0.9952
LFSD	SUM	0.7037	0.1788	0.3977	0.3484	0.3971	0.6325	0.8889
	Weight	0.4667	0.0829	0.2422	0.1501	0.5777	0.8823	0.9756
	Concatenate	0.4284	0.0697	0.1858	0.1042	0.6570	0.9081	0.9752
	our	0.3612	0.0550	0.1796	0.0901	0.6973	0.9339	0.9874

4.2.2 Effect of MDFM.

The MDFM is proposed for dynamically fusing multi-modal information between RGB features and focusness features. To demonstrate the effectiveness of the MDFM, we compare the MDFM with a variety of conventional fusion methods, including Concatenate Fusion(noted as $Concatenate$ ), Sum Fusion(noted as $SUM$ ), Weighted Fusion(noted as $Weight$ ). For better comparison, we replace the fusion block in the framework. The results of the comparison are shown in Figure 4. It can be seen that the quality of depth map achieves accumulative improvements by a large margin with the MDFM. Especially in regions of depth discontinuities, the MDFM is able to recover the depth more accurately than the concatenate fusion and preserves more structural information compared to other static fusion methods. Numerically, our proposed MDFM reduces the RMSE performances by nearly 8% than concatenate fusion and as shown in Table 2.

Table 3: Quantitative comparison with state-of-the-art methods. From top to bottom: our dataset, LFSD dataset.

*

respects non-deep-learning methods. The best result results are shown in boldface.

Type	Methods	Error metric				Accuracy metric
Type	Methods	RMSE	RMSE Log	Abs Rel	Sq Rel	$\delta<1.25$	$\delta<1.25^{2}$	$\delta<1.25^{3}$
Our dataset	Our	0.3457	0.0453	0.1707	0.0864	0.7342	0.9516	0.9952
	DDFF	0.5255	0.1042	0.2666	0.1834	0.4944	0.8202	0.9667
	EPINET	0.4974	0.0959	0.2324	0.1434	0.5010	0.8375	0.9837
	VDFF^∗	0.7192	0.1887	0.3887	0.3808	0.4040	0.6593	0.8505
	PADMM^∗	0.4730	0.0912	0.2253	0.1509	0.5891	0.8560	0.9577
	LF^∗	0.6897	0.1436	0.3835	0.3790	0.4913	0.7549	0.8783
	LF $\_$ OCC^∗	0.6233	0.1432	0.3109	0.2510	0.4524	0.7464	0.9127
LFSD	Our	0.3612	0.0550	0.1796	0.0901	0.6973	0.9339	0.9874
	DDFF	0.4255	0.0717	0.2128	0.1204	0.6185	0.8916	0.9860
	VDFF^∗	0.5747	0.1258	0.3320	0.2660	0.4730	0.7823	0.9359
	PADMM^∗	0.4238	0.0722	0.2153	0.1336	0.6536	0.8880	0.9770
	LF^∗	-	-	-	-	-	-	-
	LF $\_$ OCC^∗	-	-	-	-	-	-	-
	EPINET	-	-	-	-	-	-	-

4.3 Comparison with State-of-the-arts

We compare our method with 6 other state-of-the-arts light field depth estimation method, containing both deep-learning-based methods and traditional methods (marked with $*$ ). Deep-learning-based method is DDFF [11], EPINET [30], traditional methods are VDFF^∗ [24], PADMM^∗ [13], LF^∗[14], LF $\_$ OCC^∗ [35]. For fair comparisons, the results from competing methods are generated by authorized codes. To verify the generalization and applicabilityof our network, we conduct experiments on three dataset.

4.3.1 Quantitative Evaluation.

As shown in Table 3, our method is able to clearly outperform the other state-of-the-art methods on our dataset in terms of seven evaluation metrics. Not only that, we also apply the model parameters trained on our dataset directly to the LFSD dataset for testing, and our method achieves significant advantages, such as Top-1 accuracies (Sq Rel) and Top-1 accuracies (RMSE). Note that the LFSD dataset only provides the focal slices, RGB images and depth maps, therefore, we only compare our method with 3 focus-based methods on LFSD dataset.

4.3.2 Qualitative Evaluation.

Figure 5 provides some challenging samples of results comparing our method with other state-of-the-art methods. It can be seen that our method can achieve accurate prediction, when foreground and background are similar as shown in the $1^{th}$ and $4^{th}$ rows, when some areas are textureless as shown in the $3^{th}$ row, when smooth surface as shown in the $2^{th}$ row.

We further illustrate the visual results of our method on LFSD dataset in Figure 6. From the $5^{th}$ line and the $6^{th}$ , we can find that compared with traditional methods, our network retains more detailed information and better maintains the structure of the depth map. From the $7^{th}$ line, we can find the results of DDFF contain a lot of noise. This is beacuse DDFF used a standard 2D CNN to learns filters that extend across the entire focal stack. In contrast, our model preserves the spatial correlation between focal slices and effectively fuses the focusness feature and RGB information to reduce information loss.

4.4 Mobile Phone Results

The common consumer level cameras also can capture focal stack. To expand the practical application of our network, we test our network on the focal stack captured captured by a Samsung Galaxy S3 phone, provided by the authors of [32]. For ease of testing, we choose 12 focal slices and a RGB image from each scene. Figure 9 provides a qualitative comparison of different deep learning-based methods. It can be seen that the proposed method allows to recover high quality depth maps with less noise and sharper object boundaries, although the model was not trained on this specific dataset. The impressive results show that the proposed method can be used for further applications in our daily life. Further, our model is not limited by the light field camera because the smart-phone camera can capture focal stacks by manipulating the focus distance programmatically and it represents the dominant modality of image capture.

5 Conclusion

In this paper, we propose a multimodal learning which incorporates RGB data and the focal stack for focus-based depth estimation. Our SCPM excavates the spatial correlation between different focal slices and sequently pass multi-scale focusness information along the depth direction by exploiting our proposed pyramid ConvGRU. The MDFM dynamically fuse the RGB features and focusness features in an adaptive manner. In this module, the filter varies with focusness features is used to convolve with RGB information. Our experiments show that the proposed method achieves superior performance, especially in challenging scenes. Our extensive evaluation shows that the proposed network can be applied on common consumer level cameras data successfully.

References

[1] Anwar, S., Hayder, Z., Porikli, F.: Depth estimation and blur removal from a single out-of-focus image. In: BMVC. vol. 1, p. 2 (2017)
[2] Areann, M.C., Bosch, T., Lescure, M., et al.: Laser ranging: a critical review of usual techniques for distance measurement [j]. Opt. Eng 40(1) (2001)
[3] Chen, C., Seff, A., Kornhauser, A., Xiao, J.: Deepdriving: Learning affordance for direct perception in autonomous driving. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2722–2730 (2015)
[4] Chen, C.F., Bolas, M., Rosenberg, E.S.: Rapid creation of photorealistic virtual reality content with consumer depth cameras. In: 2017 IEEE Virtual Reality (VR) (2017)
[5] Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40(4), 834–848 (2017)
[6] Chen, Y., Zhao, H., Hu, Z.: Attention-based context aggregation network for monocular depth estimation. arXiv preprint arXiv:1901.10137 (2019)
[7] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)
[8] Eom, C., Park, H., Ham, B.: Temporally consistent depth prediction with flow-guided memory units. IEEE Transactions on Intelligent Transportation Systems (2019)
[9] Ghasemi, A., Vetterli, M.: Scale-invariant representation of light field images for object recognition and tracking. In: Computational Imaging XII. vol. 9020, p. 902015. International Society for Optics and Photonics (2014)
[10] Hazirbas, C., Ma, L., Domokos, C., Cremers, D.: Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture. In: Asian conference on computer vision. pp. 213–228. Springer (2016)
[11] Hazirbas, C., Soyer, S.G., Staab, M.C., Leal-Taixé, L., Cremers, D.: Deep depth from focus. In: Asian Conference on Computer Vision. pp. 525–541. Springer (2018)
[12] Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7132–7141 (2018)
[13] Javidnia, H., Corcoran, P.: Application of preconditioned alternating direction method of multipliers in depth from focal stack. Journal of Electronic Imaging 27(2), 023019 (2018)
[14] Jeon, H.G., Park, J., Choe, G., Park, J., Bok, Y., Tai, Y.W., So Kweon, I.: Accurate depth map estimation from a lenslet light field camera. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1547–1555 (2015)
[15] Johannsen, O., Sulc, A., Goldluecke, B.: What sparse light field coding reveals about scene structure. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3262–3270 (2016)
[16] Kalantari, N.K., Wang, T.C., Ramamoorthi, R.: Learning-based view synthesis for light field cameras. ACM Transactions on Graphics (TOG) 35(6), 193 (2016)
[17] Kendall, A., Martirosyan, H., Dasgupta, S., Henry, P., Kennedy, R., Bachrach, A., Bry, A.: End-to-end learning of geometry and context for deep stereo regression. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 66–75 (2017)
[18] Li, J., Hu, W., Yang, J.: Adnet: Appearance and depth features network for visual object tracking (2019)
[19] Li, N., Ye, J., Ji, Y., Ling, H., Yu, J.: Saliency detection on light field. In: CVPR. pp. 2806–2813 (2014)
[20] Liebe, C.C., Padgett, C., Chang, J.: Three dimensional imaging utilizing structured light. In: 2004 IEEE Aerospace Conference Proceedings (IEEE Cat. No. 04TH8720). vol. 4, pp. 2647–2655. IEEE (2004)
[21] Mahmood, M.T.: Shape from focus by total variation. In: IVMSP 2013. pp. 1–4. IEEE (2013)
[22] Mahmood, M.T., Choi, T.S.: Nonlinear approach for enhancement of image focus volume in shape from focus. IEEE Transactions on image processing 21(5), 2866–2873 (2012)
[23] Mal, F., Karaman, S.: Sparse-to-dense: Depth prediction from sparse depth samples and a single image. In: 2018 IEEE International Conference on Robotics and Automation (ICRA). pp. 1–8. IEEE (2018)
[24] Moeller, M., Benning, M., Schönlieb, C., Cremers, D.: Variational depth from focus reconstruction. IEEE Transactions on Image Processing 24(12), 5369–5378 (2015)
[25] Pang, J., Sun, W., Ren, J.S., Yang, C., Yan, Q.: Cascade residual learning: A two-stage convolutional neural network for stereo matching. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 887–895 (2017)
[26] Park, K., Patten, T., Prankl, J., Vincze, M.: Multi-task template matching for object detection, segmentation and pose estimation using depth images. In: 2019 International Conference on Robotics and Automation (ICRA). pp. 7207–7213. IEEE (2019)
[27] Pertuz, S., Puig, D., Garcia, M.A.: Analysis of focus measure operators for shape-from-focus. Pattern Recognition 46(5), 1415–1432 (2013)
[28] Porzi, L., Penate-Sanchez, A., Ricci, E., Moreno-Noguer, F.: Depth-aware convolutional neural networks for accurate 3d pose estimation in rgb-d images. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2017)
[29] Roberts, R., Sinha, S.N., Szeliski, R., Steedly, D.: Structure from motion for scenes with large duplicate structures. In: CVPR 2011. pp. 3137–3144. IEEE (2011)
[30] Shin, C., Jeon, H.G., Yoon, Y., So Kweon, I., Joo Kim, S.: Epinet: A fully-convolutional neural network using epipolar geometry for depth from light field images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4748–4757 (2018)
[31] Srinivasan, P.P., Garg, R., Wadhwa, N., Ng, R., Barron, J.T.: Aperture supervision for monocular depth estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6393–6401 (2018)
[32] Suwajanakorn, S., Hernandez, C., Seitz, S.M.: Depth from focus with your mobile phone. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3497–3506 (2015)
[33] Thelen, A., Frey, S., Hirsch, S., Hering, P.: Improvements in shape-from-focus for holographic reconstructions with regard to focus operators, neighborhood-size, and height value interpolation. IEEE Transactions on Image Processing 18(1), 151–157 (2008)
[34] Tosi, F., Aleotti, F., Poggi, M., Mattoccia, S.: Learning monocular depth estimation infusing traditional stereo knowledge. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 9799–9809 (2019)
[35] Wang, T.C., Efros, A.A., Ramamoorthi, R.: Occlusion-aware depth estimation using light-field cameras. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 3487–3495 (2015)
[36] Wang, T.C., Zhu, J.Y., Hiroaki, E., Chandraker, M., Efros, A.A., Ramamoorthi, R.: A 4d light-field dataset and cnn architectures for material recognition. In: European Conference on Computer Vision. pp. 121–138. Springer (2016)
[37] Wozniak, P., Capobianco, A., Javahiraly, N., Curticapean, D.: Depth sensor based detection of obstacles and notification for virtual reality systems. In: International Conference on Applied Human Factors and Ergonomics. pp. 271–282. Springer (2019)
[38] Yao, R., Zhang, Y., Gao, C., Zhou, Y., Zhao, J., Liang, L.: Lightweight video object segmentation based on convgru. In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV). pp. 441–452. Springer (2019)
[39] Yoon, Y., Jeon, H.G., Yoo, D., Lee, J.Y., Kweon, I.S.: Light-field image super-resolution using convolutional neural network. IEEE Signal Processing Letters 24(6), 848–852 (2017)
[40] Zhang, Y., Lv, H., Liu, Y., Wang, H., Wang, X., Huang, Q., Xiang, X., Dai, Q.: Light-field depth estimation via epipolar plane image analysis and locally linear embedding. IEEE Transactions on Circuits and Systems for Video Technology 27(4), 739–747 (2016)
[41] Zhao, S., Fu, H., Gong, M., Tao, D.: Geometry-aware symmetric domain adaptation for monocular depth estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 9788–9798 (2019)
[42] Zheng, C., Cham, T.J., Cai, J.: T2net: Synthetic-to-realistic translation for solving single-image depth estimation tasks. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 767–783 (2018)
[43] Zhu, X., Dai, J., Zhu, X., Wei, Y., Yuan, L.: Towards high performance video object detection for mobiles. arXiv preprint arXiv:1804.05830 (2018)