Efficient Light Field Reconstruction
via Spatio-Angular Dense Network

Zexi Hu, Henry Wing Fung Yeung, Xiaoming Chen, Yuk Ying Chung, and Haisheng Li This work was supported by the Research Foundation for Advanced Talents of Beijing Technology and Business University (No. 19008021181), National Natural Science Foundation of China (No. 61877002), Beijing Natural Science Foundation and Fengtai Rail Transit Frontier Research Joint Fund (No. L191009) and Scientific Research Program of Beijing Municipal Education Commission (No. KZ202110011017).Zexi Hu is with the School of Computer Science and Engineering, Beijing Technology and Business University, China and the School of Computer Science, University of Sydney, Australia. (Email: [email protected])Henry Wing Fung Yeung and Yuk Ying Chung are with the School of Computer Science, University of Sydney, Australia. (Email: [email protected], [email protected])Xiaoming Chen and Haisheng Li are with the School of Computer Science and Engineering, Beijing Technology and Business University, China. (Email: [email protected], [email protected])Corresponding author: Xiaoming Chen.

Abstract

As an image sensing instrument, light field images can supply extra angular information compared with monocular images and have facilitated a wide range of measurement applications. Light field image capturing devices usually suffer from the inherent trade-off between the angular and spatial resolutions. To tackle this problem, several methods, such as light field reconstruction and light field super-resolution, have been proposed but leaving two problems unaddressed, namely domain asymmetry and efficient information flow. In this paper, we propose an end-to-end Spatio-Angular Dense Network (SADenseNet) for light field reconstruction with two novel components, namely correlation blocks and spatio-angular dense skip connections to address them. The former performs effective modeling of the correlation information in a way that conforms with the domain asymmetry. And the latter consists of three kinds of connections enhancing the information flow within two domains. Extensive experiments on both real-world and synthetic datasets have been conducted to demonstrate that the proposed SADenseNet’s state-of-the-art performance at significantly reduced costs in memory and computation. The qualitative results show that the reconstructed light field images are sharp with correct details and can serve as pre-processing to improve the accuracy of related measurement applications.

Index Terms:

Light field reconstruction, light field imaging, deep learning, image processing, convolutional neural network.

I Introduction

As an emerging image sensing instrument, light field (LF) cameras can capture a set of images from different perspectives. This feature offers advantages in vision-based measurement tasks. For example, researchers proposed using LF camera [1, 2, 3, 4] for more accurate face detection that is more robust to spoof attacks. Similar achievement in measurement accuracy has been seen in material recognition [5, 6] where the subjects can’t be easily distinguished in regular 2D images [7], and salient object detection in complex scenarios [8, 9]. The abundant information provided by LF instruments also facilitates depth measurement and 3D measurement with promising accuracy [10, 11, 12, 13] compared with other types of image sensors such as stereo vision [14] and structured light [15]. The emergence of LF cameras, e.g., Raytrix R series [16, 17] and Lytro Illum [18], has enabled LF instruments to apply to a wide range of consumer- and industrial-grade applications, such as measurement in industrial quality control [17, 19]. However, due to the limited capacity of micro-lens inside the sensor, these cameras inherently suffer from a low resolution and a trade-off between the angular and spatial resolutions.

For mitigating the trade-off, there are two major solutions, namely LF reconstruction and LF spatial super-resolution (LFSR). The former focuses on upsampling in the angular domain, i.e. the number of sub-aperture images (SAI), while the latter aims at increasing the resolution of the spatial domain, i.e. the spatial resolution of each SAI. This paper focuses on the former category to reconstruct a densely sampled LF from a sparsely sampled one. With the recent success of deep learning in image processing, learning-based methods are also introduced to LF reconstruction and have achieved superior performance [20, 21]. However, these methods still suffer from problematic spatio-angular features extraction which limits their performance. It stems from two aspects, namely Domain Asymmetry and Inefficient Information Flow.

Regarding Domain Asymmetry, it has been discovered in [22] that the information in the spatial and angular domains has distinct natures as the spatial domain contains the regular 2D image information while the angular domain encodes the disparity information between adjacent SAIs. It is also obvious that these two domains are spaces of enormously different sizes. Specifically, given a LF image $X\in\mathbb{R}^{U\times V\times W\times H}$ , the spatial space $W\times H$ is usually much larger than the angular space $U\times V$ , i.e. $U\times V\ll W\times H$ . This issue becomes more prominent in light field reconstruction where the high-frequency information in the angular domain is largely damaged when downsampling into the sparsely sampled LF image. In other words, the angular domain becomes further smaller and sparser. Therefore, it is unreasonable to process the two domains, which are asymmetrical in natures and volumes, in the same manner. Still, the existing methods manipulate the 4D LF data in a symmetrical manner. These methods majorly fall into three categories: 4D convolution filters [21, 23], epipolar-plane images (EPI) based methods [22, 24, 25, 26, 10, 27] and pseudo-4D convolution filters [5, 21, 23].

A 4D convolution filter processes the LF image straightforwardly as it convolves the spatial and angular domain simultaneously. However, it is proved to be inefficient as it requires expensive computation and large memory consumption [5, 21]. As an alternative, the EPI-based methods have been proposed which perform independent super-resolution on EPI slices. An EPI slice can be obtained from the 4D LF by fixing 2 specific coordinates, with one lies on the spatial domain and the other on the angular domain. An illustration of this idea is depicted in Fig. 1(a). As EPI slices are essentially 2D images with patterns that reflect the correlation information in the LF image, these methods decompose the 4D LF reconstruction task into a series of 2D sub-tasks. Another alternative is pseudo-4D convolution filters which decompose a 4D convolution into a spatial and an angular convolution separately. Typical examples are the interleaved 2D angular and spatial convolutions to simulate 4D filters for material recognition [5] and the spatio-angular separable (SAS) convolution for LF reconstruction and super-resolution [21, 23]. Illustrations of EPI-based and pseudo-4D methods are demonstrated in Fig. 1 (b) and (c) correspondingly.

Even though the aforementioned three options have achieved remarkable progress in extracting spatio-angular features, the vast majority of them treat the spatial and angular dimensions symmetrically against the domain asymmetry. One of the very few exceptions is [22] where the issue was addressed by its "blur-restoration-deblur" framework which downsamples the spatial domain for conforming with the asymmetry at the expense of losing spatial high-frequency information rendering the method sub-optimal. In this paper, we argue that the operations performed in the spatial and angular domains should be of distinct natures accordingly. The convolution operation performed in the spatial domain is modeling the local spatial features such as edges and corners, whereas the one performed in the angular domain is modeling the disparities information. Therefore, we have hypothesized that the symmetrical pattern hampers the spatio-angular feature representation and we proposed correlation blocks which comprise an uneven number of spatial and angular convolutions to model correlation information of LF images asymmetrically. Unlike [22], our approach doesn’t involve any down-sampling or other operations that would explicitly abandon any valuable information in any domain but fully convolves on both domains. We elaborate this component in Section III-B and validate our hypothesis in the ablation study in Section IV-D2.

Refer to caption — Figure 1: The illustrations of the 4D convolution alternatives. (a) EPI, (b) Interleaved 2D convolution, (c) SAS convolution. They all process the spatio-angular information in a symmetrical way.

As for Inefficient Information Flow, most computer vision methods have suffer from it as they opted to increase network depth [28, 29] to extract deeper features. Obviously, its downside is a huge number of trainable parameters and high computation cost. The other defect is the gradient-vanishing problem [30], especially in shallow layers, as networks go deeper, resulting in training difficulties. These negative effects can be amplified in the realm of LF images due to its 4D high volume [5, 21], leading to model inefficiency when training and testing. For example, in [21], Yeung et al. have proposed a network that outperforms Kalantari et al. [20] by 0.3 db PSNR with 4 SAS convolutions. Yet, the performance gain is diminishing as more SAS convolutions are added, with a further 0.3 db and 0.6 db improvement at 8 and 16 SAS convolutions, which requires 1.5 and 2 times of model size respectively. Although it has achieved state-of-the-art performance, the 16-SAS-convolution network contains around 1.5 million parameters and takes more than 8 days to fully converge, rendering the model difficult to train and impractical to apply to memory-constrained devices.

Inspired by the success of dense skip connections which have been widely studied and exploited in image processing [30, 31, 32, 33, 34], we propose to use dense skip connections to enhance the information flow. Contrary to the previous dense networks, we propose spatio-angular dense skip connections specially designed for LF images which consist of three kinds of connections, namely angular, spatial, and image skip connections, to supply distinct types of information and reinforce the spatio-angular feature representation.

Lastly, with these two components, correlation blocks and spatio-angular dense skip connections, we propose a simple yet efficient end-to-end Spatio-Angular Dense Network (SADenseNet). Extensive experiments are conducted on both real-world and synthetic datasets to demonstrate SADenseNet’s superior performance of LF reconstruction at substantially lower computational and memory costs compared with state-of-the-art methods. A series of ablation studies are also presented to verify the proposed components’ effectiveness, and we also perform SADenseNet as pre-processing that yields a positive effect for depth estimation to prove its potential for more accurate LF measurement application.

The major contributions of this paper are summarized in the following five aspects:

•

Correlation blocks are proposed to model correlation information based on the study of domain asymmetry.
•

Spatio-angular dense skip connections are proposed to enhance the information flow within spatial and angular domains.
•

With the proposed components, we design a simple yet efficient end-to-end network, Spatio-Angular Dense Network (SADenseNet), for light field reconstruction.
•

SADenseNet is evaluated by extensive experiments on both real-world and synthetic datasets to verify its superior performance and efficiency in computation and memory usage compared with previous state-of-the-art methods.
•

An experiment on depth estimation is conducted to prove SADenseNet can improve the accuracy of LF’s measurement applications.

II Related Works

II-A Light Field Image Processing

To process LF images, early methods [35, 36, 37, zhang_light_2015] explicitly estimate the disparity or depth and then obtain the reconstructed SAIs by warping the input SAIs and blending. Various depth estimation methods are proposed, e.g., phase-based estimation [zhang_light_2015] and EPI-based estimation [37]. Different blending methods are also introduced, e.g., soft-blending [38] and learning-based blending [39]. These methods are overly dependent on the quality of the disparity or depth maps which the existence of noise in them causes undesirable artifacts. Such approaches are also disadvantaged when handling occluded areas as the warping process has no information about the invisible parts.

In recent years, significant progress in computer vision has been achieved by the deep learning-based methods [28, 40, 41, 32], and this technique has been applied to LF reconstruction. The first deep learning-based LF reconstruction method has been proposed by Kalantari et al. [20] which designed a disparity network to estimate the disparity and reconstruct the intermediate SAIs by warping the input SAIs with the calculated disparity. After that, the input and intermediate SAIs are fed into a color network to obtain the refined SAIs. This method has achieved outstanding performance with the deep features extracted by the deep neural networks. However, the reconstruction quality is limited due to the artifacts brought by the explicit disparity estimation and warping process. Moreover, each SAI needs to be individually reconstructed, which leads to duplicated calculations. Similarly, Zhou et al. [42] proposed an encoder-decoder network to estimate the disparity to synthesize views by warping. With the help of its modified ResNet-50 to extract expressive representation and three sub-networks for disparity estimation, noise filtering, and view rendering respectively, robustness is gained against input noise.

Wang et al. proposed an EPI-based method to decompose the 4D task into 3D sub-tasks and train two networks separately for vertical and horizontal reconstruction [27]. Heber et al. [26, 10] designed a U-shaped network to extract shape information by EPIs while Shin et al. [25] proposed a EPI-based method to compute disparity maps from LF with a four-branch network. A common drawback of EPI-based methods is they ignore either one [27, 25, 26] or two dimensions [22, 10] of the 4D LF in their optimization and cannot be jointly optimized in all dimensions. Therefore, they cannot process the LF data in a global scope.

In [21], Yeung et al. proposed an end-to-end network that removes explicit disparity manipulation and reconstructs SAIs at a single forward propagation. This method has achieved groundbreaking performance with its SAS convolution that enables joint optimization in all dimensions. However, it has endured the same defect of stacking convolutional layers for obtaining deeper feature representation as other computer vision tasks [28, 43]. Different from decomposing 4D convolution [21], Meng et al. [44] pursued to fully exploit high-dimensional LF information and proposed aperture group batch normalization to ease the training of 4D convolutional network. In addition to pixel-wise loss function in the spatial space, they proposed an angular loss to optimize the error in the EPI space to preserve the correlation information of adjacent viewpoints. However, suffering from heavy 4D convolution operations, it was still running at a large cost of computation and memory, and the high-dimensional information was yet to be fully exploited as the performance gain was not proportional to its costs. Besides pure learning-based methods, Chandramouli et al. [45] proposed a generative model to tackle the issue that learning-based models are confined to the observation model they have been trained on. Despite its limited performance, the fashion is inspiring with possibilities of LF models with generalization abilities.

There are some other methods [46, 47] that perform LF reconstruction but in a compressive manner. Their compression performance is proved to be promising, but they are essentially a different task from regular LF reconstruction as the full LF image is involved to be encoded into a small volume of information instead of a limited set of SAIs. Although the above methods and the regular LF reconstruction are not working in exactly the same application scenarios, it would be interesting to see where their performance ceilings are located.

In this paper, our proposed SADenseNet has a similar pipeline as [21] that contains two parts, namely the feature extraction and the SAI reconstruction. However, we ditch the refinement work as SADenseNet’s feature extraction is so strong with the help of correlation blocks and spatio-angular dense skip connections that post-processing components are unnecessary. Rather than inefficient explicit warping process [20] and costly 4D convolution operations [44], we pursue the potential of decomposed convolution operations and information flow enhancement.

II-B Feature Representation

In order to acquire powerful feature representation, stacking convolutional layers is the most straightforward solution since more parameters lead to higher non-linearity and more complicated mapping functions. However, heavier computation costs and higher risks of over-fitting are also incurred. To this end, some studies have discovered more efficient network structures.

The residual connection is one of the widely used structures firstly proposed by He et al. [48] to force layers to learn residue of a mapping function instead of the mapping function itself. Such a technique eases the optimization and improves the feature representation, and can be applied to other computer vision tasks such as single image super-resolution (SISR) [29, 49, 50] and visual tracking [51, 52].

On the other hand, dense skip connections have been proposed to mitigate the gradient-vanishing problem while aggregating hierarchical features of low- and high-frequencies [30]. Tai et al. [31] proposed a dense connected network to aggregate short-term and long-term memories for image restoration. In SISR, Tong et al. designed local and global dense skip connections to aid the feature extraction [32]. Haris et al. [33] designed a projection unit to learn down-sampling as well as up-sampling. Li et al. [34] introduced feedback units containing projection groups to form a recurrent neural network (RNN) to enforce a curriculum learning strategy. The basic units of the latter two methods [33, 34] are reinforced with dense skip connections to gain richer feature representation. Among these, [31] and [32] have similar architectures with our SADenseNet as they perform dual dense skip connections. Nevertheless, the critical difference is that their duality is for extracting homogeneous features while ours are for spatio-angular features from the two distinct domains.

II-C Video-related Tasks

Video-related tasks share similarities with LF tasks as they are both extracting cross-domain representation and processing multiple images or frames that are highly correlated to each other. A powerful spatio-temporal feature representation is a key to success as well. In terms of decomposition of high dimensions, similar methods can be found in pseudo-3D-based P3D [53, 54] and R2D [55]. Different from the pseudo-4D convolution in LF, the pseudo-3D convolutions decompose a 3D convolution into two 2D convolutions operating in the spatial and temporal spaces separately. Convolution decomposition has been proved to achieve better performance than the original 3D convolution achieved by the extra non-linearity and the reduced risk of over-fitting.

III Proposed Method

III-A Overview

Let us consider $\mathcal{F}(X)$ as the LF reconstruction function that maps the input SAIs $X$ to the reconstructed SAIs $\hat{Y}$ , hence

\begin{split}\hat{Y}=\mathcal{F}(X)&\quad X\in\mathbb{R}^{U\times V\times W\times H}\\ &\quad\hat{Y}\in\mathbb{R}^{\mathcal{N}\times W\times H}\end{split}

(1)

where $(U,V)$ of $X$ and ${\mathcal{N}}$ of $\hat{Y}$ denote the input and output angular resolution respectively, i.e. the number of SAIs, while $(W,H)$ is the spatial resolution, i.e. the resolution of every single SAI. The input SAIs $X$ , which is a sparsely sampled LF image, and the reconstructed SAIs $\hat{Y}$ can finally form a densely sampled LF image ${\hat{I}\in\mathbb{R}^{\overline{U}\times\overline{V}\times W\times H}}$ . Therefore,

\overline{U}\times\overline{V}=U\times V+\mathcal{N}.

The objective of this paper is to train an end-to-end network to learn $\mathcal{F}(X)$ in Eq. 1 for estimating $\hat{Y}$ that approximates the ground truth $Y$ . The proposed SADenseNet comprises two components, correlation feature extraction, and SAI reconstruction, and an illustration is given in Fig. 2. The input SAIs $X$ are fed into correlation feature extraction containing a series of correlation blocks, which will be described in Section III-B. In addition to the correlation blocks, spatio-angular dense skip connections are exploited through the inter- and intra-correlation-blocks to enhance the information flow, which will be elaborated in Section III-C. With the extracted correlation features, SAI reconstruction will finally reconstruct SAIs $\hat{Y}$ , which will be explained in Section III-D.

III-B Correlation Blocks

For addressing the aforementioned domain asymmetry problem in Section I, we propose correlation blocks that process the spatial and angular information asymmetrically. The structure is demonstrated in Fig. 3.

The component is based on the SAS module proposed in [21, 23]. However, motivated by the fact that the spatial space is usually substantially larger compared to the angular space, i.e. $U\times V\ll W\times H$ , we presume that more spatial operations are needed than angular operations. Therefore, in each block, the 4D LF tensor of size $(U,V,W,H,C_{1})$ , where $C_{1}$ is the number of input channels, is firstly reshaped into the spatial mode of size $(U\times V,W,H,C_{1})$ to flatten the angular dimensions. Subsequently, a series of spatial convolutional layers, denoted as $[\mathcal{C}^{\mathcal{S}}_{1},\mathcal{C}^{\mathcal{S}}_{2},\dotsb,\mathcal{C}^{\mathcal{S}}_{n_{\mathcal{S}}}]$ , are operated on the reshaped tensor of kernel size $(1,k_{W},k_{H},C_{1},C_{1})$ to extract the spatial features, where $n_{\mathcal{S}}$ is the number of spatial convolutional layers¹¹1In this paper, the last two values of the size of 3D convolution kernels denote the input and output channels correspondingly.. Likewise, the output of spatial convolution series is reshaped into the angular mode of size $(U,V,W\times H,C_{1})$ and convolved by an angular convolutional layer of kernel size $(k_{U},k_{V},1,C_{1},C_{2})$ to extract angular features that model the correlation information, where $C_{2}$ is the number of output channels. Lastly, the tensor is reshaped back into the original shape with new channels $(U,V,W,H,C_{2})$ for the upcoming operation. Hence, a correlation block can be formulated as

	$\displaystyle\mathcal{H}(x)=\mathcal{A}(\mathcal{S}(x))$		(2)
	$\displaystyle\mathcal{A}(x)=\mathcal{C}^{\mathcal{A}}(x)$		(3)
	$\displaystyle\mathcal{S}(x)=\mathcal{C}^{\mathcal{S}}_{n_{\mathcal{S}}}(\mathcal{C}^{\mathcal{S}}_{n_{\mathcal{S}}-1}(\dotsb(\mathcal{C}^{\mathcal{S}}_{2}(\mathcal{C}^{\mathcal{S}}_{1}(x)))\dotsb)$		(4)

where $\mathcal{S}$ and $\mathcal{A}$ denote the spatial and angular feature extraction while $\mathcal{C}^{\mathcal{S}}(x)$ and $\mathcal{C}^{\mathcal{A}}(x)$ denote the corresponding convolution functions.

Such an asymmetrical processing method suits LF image data compared with the existing 4D convolution alternatives because the spatial information is sufficiently convolved to extract informative features with an enlarged receptive field in the spatial domain before the subsequent angular convolution.

In order to acquire deeper correlation features, a series of correlation blocks are employed consecutively as

F_{i}=\begin{cases}\mathcal{H}(F_{i-1}),&\text{if}\ i>1,\\ \mathcal{H}(X),&\text{if}\ i=1.\end{cases}

(5)

where $F_{i}$ denotes the output tensor of $i$ -th correlation block.

The angular convolutions of correlation blocks are operated on the coarse-to-fine spatial features, thus they can extract correlation features hierarchically. The correlation extraction can be regarded as implicit disparity modeling but without explicit dependency on the accuracy of disparity. We denote the number of correlation blocks as $n_{CB}$ and set both $(k_{U},k_{V})$ and $(k_{W},k_{H})$ to $(3,3)$ across all the correlation blocks. Finally, the output of the last correlation block $F_{n_{CB}}$ will be fed to the following SAI reconstruction process.

III-C Spatio-Angular Dense Skip Connections

For further enhancing the feature representation extraction process, we propose to improve the information flow with spatio-angular dense skip connections consisting of three types of connections, namely spatial and angular dense skip connections, and image skip connections.

The spatial dense skip connections are intra-correlation-block connections that concatenate the output of shallow spatial convolutional layers to the output of the subsequent layers within a correlation block, i.e. the spatial convolutional layers will receive the features from the preceding layers. Consequently, the spatial convolution function in Eq. 4 can be revised as

	$\displaystyle\mathcal{S}(x)=\tilde{\mathcal{C}}^{\mathcal{S}}_{n_{S}}(x)$		(6)
	$\displaystyle\tilde{\mathcal{C}^{\mathcal{S}}_{i}}(x)=[\mathcal{C}^{\mathcal{S}}_{i}(\tilde{\mathcal{C}}^{\mathcal{S}}_{i-1}(x)),\tilde{\mathcal{C}}^{\mathcal{S}}_{i-1}(x),\dotsb,\tilde{\mathcal{C}}^{\mathcal{S}}_{2}(x),\tilde{\mathcal{C}}^{\mathcal{S}}_{1}(x)]$		(7)

where $\tilde{\mathcal{C}^{\mathcal{S}}_{i}}(x)$ represents the $i$ -th densely connected spatial convolutional layer and $[\cdot]$ indicates concatenation. The spatial dense skip connections are demonstrated in Fig. 2 and 3 as blue arrows connecting blue tensors under the spatial mode.

The densely connected structure comes with three critical benefits. Firstly, the feature tensors are reinforced with both shallow and deep information, forming hierarchical representations. This is contrary to the previous methods [21] where the shallow features are absent in high-level processing. Secondly, it facilitates the optimization of the network, especially for the early layers, since every layer will access the gradients directly in back-propagation and the gradient vanishment problem is also greatly alleviated. Last but not least, the high trainable parameter utilization leads to model efficiency and prevents over-fitting.

While the spatial dense skip connections strengthen the spatial representation within individual correlation blocks, we explore to bring this benefit to the angular domain by adopting angular dense skip connections that concatenate the output of the preceding correlation blocks $F_{i}$ to the latter ones. As angular convolutions are connected with their counterparts in the previous correlation blocks, the angular dense skip connections play a role as inter-correlation-block information flow as depicted in Fig. 2 as the red arrows.

Furthermore, inspired by the practice of appending raw input to the intermediate layers in video processing [41] and optical flow [56], we introduce image skip connections to provide input $X$ as primitive features for all the blocks. This design is very similar to [41, 56] as these two methods implicitly estimate the motion between frames and inject raw image information into the intermediate motion feature for further refinement, while in our case, the correlation blocks implicitly model the LF disparity and the raw image offers complementary information for the angular convolutions. The image skip connections are depicted as yellow arrows connecting the input LF and the correlation blocks in Fig. 2.

Accordingly, the function of correlation blocks in Eq. 5 are revised into

F_{i}=\begin{cases}[\mathcal{H}(F_{i-1}),F_{i-1},\dotsb,F_{1},X],&\text{if}\ i>1,\\ \mathcal{H}(X),&\text{if}\ i=1\end{cases}

(8)

where $[\cdot]$ indicates concatenation. The output of the last correlation block $F_{n_{CB}}$ aggregates the feature maps of all the correlation blocks to form hierarchical spatio-angular feature representation. It will be fed to the following SAI reconstruction process.

In [30], the number of feature maps increased by dense skip connections is referred to as the growth rate. For simplicity, we keep the growth rates in both domain consistent across the network. Therefore,

r_{\mathcal{S}}=r_{CB}=C_{1}=C_{2}

where $r_{\mathcal{S}}$ and $r_{CB}$ are growth rates of the spatial and angular dense connections respectively. Consequently, the number of feature maps of the last correlation block $F_{n_{CB}}$ and the last spatial convolution $\tilde{\mathcal{C}}^{\mathcal{S}}_{n_{S}}(x)$ in each correlation block will grow to $r_{CB}\times n_{{CB}}$ and $r_{\mathcal{S}}\times n_{\mathcal{S}}$ respectively. In the experiments of Section IV, the growth rates are set to 32, which are only half of the feature map number used in [21], by virtue of the efficiency of densely connected architecture.

III-D SAI Reconstruction

Before reconstruction using the extracted hierarchical correlation features, we follow the practice in [32] to employ a bottleneck layer to aggregate the hierarchical features and reduce the number of the feature maps to 96. Then, the reduced features are reshaped into the angular mode, zero-padded, and fed into an angular convolutional layer with a kernel size of $(U,V,1,96,\mathcal{N})$ . It shrinks the angular resolution from $(U\times V)$ to $(1\times 1)$ while producing $\mathcal{N}$ channels, which correspond to the $\mathcal{N}$ reconstructed SAIs. In Fig. 2, as $(U\times V)=(2\times 2)$ and $\mathcal{N}=60$ , the kernel size of the angular convolution is $(2,2,1,96,60)$ .

TABLE I: Overall comparison with the state-of-the-art on the real-world datasets. The table is divided into two parts: 1. The 2nd to 4th columns are PSNR and SSIM on four datasets. The number of image samples in a particular dataset is shown in parentheses. 2. The last two columns display the number of parameters and running speed. Bold scores indicate the best results. The last row is the differences between SADenseNet and Yeung et al. [21].

Method	30 Scenes (30)	EPFL (118)	Occlusions (43)	Reflective (31)	# Parameters	Speed/s
Kalantari et al. [20]	38.31/0.9755	38.76/0.9586	31.82/0.8973	35.91/0.9415	1,644,204	721.05
HDDRNet [44]	37.52/0.9664	38.58/0.9547	32.36/0.9071	36.32/0.9443	16,558,848	1434.65
NoisyLFRecon [42]	38.90/0.9776	39.01/0.9639	32.12/0.9070	36.36/0.9455	20,277,551	32.92
Yeung et al. [21]	39.16/0.9782	39.53/0.9641	32.66/0.9073	36.44/0.9458	1,498,752	38.05
SADenseNet(Ours)	40.31/0.9836	40.54/0.9706	33.76/0.9269	37.15/0.9521	1,134,140	12.82
Difference	+1.15/+0.0054	+1.01/+0.0065	+1.10/+0.0196	+0.71/+0.0063	75.67%	$\approx 3\times$ faster

TABLE II: Performance comparison with the state-of-the-art on the synthetic HCI dataset [57]. PSNR and SSIM are given. Bold scores indicate the best results. The differences between SADenseNet and Yeung et al. [21] are presented in the last row.

Method	Buddha	Mona	Average
Kalantari et al. [20]	42.73/0.9844	42.42/0.9858	42.58/0.9851
Wu et al. [24]	43.20/0.9963	44.37/0.9977	43.79/0.9981
Yeung et al. [21]	43.77/0.9872	45.67/0.9920	44.72/0.9896
SADenseNet(Ours)	45.82/0.9921	46.84/0.9932	46.33/0.9927
Difference	+2.05/+0.0049	+1.17/+0.0012	+1.61/+0.0031

IV Experiments

IV-A Implementation and Evaluation Details

The proposed SADenseNet is implemented using the deep learning library Keras [58] with Tensorflow [59] backend. The network is trained and tested on a PC with an Intel Core i7-6700K 8-core 4.00GHz CPU, an Nvidia GTX 1080 Ti GPU, and 32GB RAM. The source code and trained models are publicly available at https://huzexi.github.io.

In regard to training, to conduct a fair comparison, we follow the protocol of [21] to train the network with the training set proposed by Kalantari et al. [20], which contains 100 samples. The mini-batch size is set to 2 with a spatial size of $(128\times 128)$ . The training process is iterated with an Adam optimizer [60] and the learning rate is set to $1e-4$ . We follow the conventional practice to process the luminance channel of the YCbCr color space. The other two channels, namely Cb and Cr, are acquired by up-sampling the angular resolution using bicubic interpolation. During training, data augmentation is applied to improve the generalization of the network. We follow the strategies in [25] to randomly flip and rotate the spatial and angular dimensions simultaneously. As a result, the training data can be reused 8 times.

$n_{CB}$ and $n_{\mathcal{S}}$ are set to 6 and 5 correspondingly as the default values, and the network is trained by minimizing the Mean Squared Error (MSE) loss function as follows:

\min_{\hat{Y}}{\sum_{n}^{\mathcal{N}}\sum_{x}^{W}\sum_{y}^{H}(\hat{Y}(n,x,y)-Y(n,x,y))^{2}}

(9)

where $n$ iterates over the reconstructed SAIs and $(x,y)$ indicates a spatial location.

IV-B Comparison with State-of-the-art Methods on Real-world Images

Firstly, we compare SADenseNet’s performance on real-world images with the state-of-the-art methods: Kalantari et al. [20], HDDRNet [44], NoisyLFRecon [42] and Yeung et al. [21].

Experiments are conducted on four real-world LF datasets, namely 30 Scenes [20], EPFL [61], Occlusions [62] and Reflective [62] which are captured by Lytro Illum cameras [18]. These four datasets contain 30, 118, 43, and 31 LF images correspondingly. The LF images that have appeared in the 100-sample training set are removed for fairness. In each LF image, the angular resolution is $(14\times 14)$ and the spatial resolution is $(376\times 541)$ . During the evaluation, only the central $(8\times 8)$ SAIs are adopted as the rest are dark and noisy, 22 pixels of the four borders in the spatial resolution are also shaved as in [20]. The reconstruction is performed from $(2\times 2)$ SAIs to $(8\times 8)$ SAIs. The reconstruction quality is measured by Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) in RGB color space.

The results are shown in Table I. It can be observed that SADenseNet has outperformed the other methods observably. Compared with the best of them, Yeung et al. [21], SADenseNet outperforms by more than 1.00 dB PSNR in all datasets except the Reflective dataset where our proposed approach has a reconstruction advantage of 0.71 dB PSNR.

For a better understanding of the performance gains, selected visual results from the four test datasets are presented in Fig. 4. It is visible that our error maps are basically clearer than the other prior arts. Concretely, the reconstructed edges of the objects are more complete in IMG_1528, e.g., the lamp pole behind the leaves in the red bounding box and the sign above the vehicles in the blue bounding box. The lamp pole example poses a major challenge to LF reconstruction as it is occluded by the leaves in some SAIs. Such a phenomenon can also be observed at the fence border in the red bounding box of Occlusions_29 and the English letter behind the leaves in the blue bounding box of Reflective_12. This can be attributed to the abundant spatio-angular representation in handling the appearance that is visible in limited SAIs. Similar improvement can also be observed in the reconstruction of complicated details, e.g., the flower cores in the blue and red boxes of Mirabelle_Prune_Tree, and the windows in the blue bounding box of Occlusions_29, which are entirely blurred in the result of most methods. In Reflective_12, it is very obvious that the method of HDDRNet [44] and Yeung et al. [21] incorrectly reconstructs a large area of shadow at the bottom left corner while SADenseNet successfully reconstructs it with the long-term information supplied by the dense skip connections [31] and enlarged spatial receptive fields of the correlation blocks while HDDRNet [44] and Yeung et al. [21] confine the correlation information and falsely renders the shadow across all SAIs.

On the other hand, to demonstrate the model’s efficiency, we compare the memory cost in terms of the number of trainable parameters and the computation cost in terms of the testing speed, reported in Table I. The speed tests are operated in CPU-only mode. The results demonstrate that the method of Kalantari et al. [20] and NoisyLFRecon [42] are plagued by their explicit disparity estimation and warping process. Also, their SAI synthesis functions must be performed separately for each SAI as the intermediate information cannot be shared ending up with a very low processing speed. Regarding the memory cost, HDDRNet [44] and NoisyLFRecon [42] employ a relatively huge model as the former hires 4D convolution operators and the latter adopts a modified ResNet-50 which has taken more than 19 million parameters. These two means aim at extracting an expressive feature representation but lead to a huge model instead of improvement in reconstruction performance. On the other hand, the end-to-end separable-convolution-based methods, [21] and SADenseNet, reconstruct SAIs in one propagation at a substantially faster speed. Moreover, compared with [21], our SADenseNet requires only 3/4 of parameters and is nearly 3 times faster. The achievement can be attributed to the correlation blocks with efficient convolutions and spatio-angular dense skip connections that improve information flow.

IV-C Comparison with State-of-the-art Methods on Synthetic Images

The performance of SADenseNet is also evaluated on the synthetic HCI dataset [57]. We follow the practice of Wu et al. [24] and Yeung et al. [21] to calculate PSNR and SSIM on the Y channel only over the $(3\times 3)$ to $(9\times 9)$ reconstruction task. The results on Buddha and Mona are reported in Table II. SADenseNet has produced competitive results exceeding Yeung et al. [21] by 1.61 dB PSNR averagely. In regard to SSIM, SADenseNet has obtained the closest results to the ones reported by Wu et al. in [24] while outperforming its PSNR significantly likewise.

The reconstruction quality of the two testing images has been visualized in Fig. 5. The most significant difference in Buddha is located at the edges of the bricks on the floor where SADenseNet does not produce the blurring artifacts as [21]. Similar improvement has been observed in Mona where artifacts are produced around the leaf border in [21] but not seen in SADenseNet.

IV-D Ablation Studies

To further study the nature of SADenseNet, a series of ablation studies are conducted in this section. We keep the previous configuration in Section IV-B to conduct the studies on real-world datasets.

TABLE III: Comparison of performance with different

n_{\mathcal{S}}

. The baseline variant is underlined.

$n_{\mathcal{S}}$	30 Scenes	# Parameters	Speed/s
6	40.34/0.9839	1,466,108	16.25
5	40.31/0.9836	1,134,140	12.82
4	40.15/0.9831	857,468	9.96
3	40.01/0.9824	636,092	7.30
2	39.80/0.9815	470,012	5.72
1	38.28/0.9787	359,228	4.18

TABLE IV: Comparison of performance with different

n_{\mathcal{A}}

. The baseline variant is underlined.

$n_{\mathcal{A}}$	30 Scenes	# Parameters	Speed/s
1	40.31/0.9836	1,134,140	12.82
2	40.27/0.9834	1,189,628	13.48
3	40.19/0.9834	1,245,116	14.26

TABLE V: Comparison of performance with different

n_{CB}

. The baseline variant is underlined.

$n_{CB}$	30 Scenes	# Parameters	Speed/s
7	40.31/0.9835	1,346,588	14.98
6	40.31/0.9836	1,134,140	12.82
5	40.21/0.9834	930,908	10.52
4	40.03/0.9824	736,892	8.36
3	39.91/0.9819	552,092	6.48
2	39.61/0.9804	376,508	4.30
1	38.35/0.9735	210,240	2.41

IV-D1 Spatio-angular Dense Skip Connections

In Section III-C, we have introduced three kinds of dense skip connections to enhance the information flow. To discover the contribution of these connections, we create several variants with or without them. For simplicity, spatial, angular and image connections are denoted as S, A and I correspondingly. 8 variants, each processed a unique combination of connections, are reported in Table VI. SADenseNet_ISA is effectively the full version of SADenseNet. $n_{CB}$ and $n_{\mathcal{S}}$ remain 6 and 5 correspondingly for all the variants. In Table VI, it is obvious that reconstruction quality deteriorates if any connection is removed. For a better comparison of the connections’ contribution, we visualize these variants’ performance against model sizes and speed in Fig. 7. We set the full version, SADenseNet_ISA, as the baseline in this study, and the results of the prior arts, Kalantari et al. [20] and Yeung et al. [21], are also plotted.

TABLE VI: Performance comparison of SADenseNet variants on the real-world datasets. The table is divided into three parts: 1. The 2nd to 4th columns are the properties of the variants. 2. The 5th to 8th columns are PSNR and SSIM evaluated on four datasets. 3. The last two columns display the number of parameters and running speed. The baseline variant is underlined.

Name	Spatial	Angular	Image	30 Scenes (30)	EPFL (118)	Occlusions (43)	Reflective (31)	# Parameters	Speed/s
SADenseNet_None				38.76/0.9776	39.25/0.9642	32.63/0.9123	36.25/0.9453	394,844	4.9
SADenseNet_I			✓	39.85/0.9818	40.17/0.9683	33.30/0.9221	36.90/0.9494	396,860	5.17
SADenseNet_S	✓			39.67/0.9815	39.89/0.9678	33.19/0.9199	36.79/0.9492	947,804	11.15
SADenseNet_A		✓		39.99/0.9822	40.21/0.9686	33.40/0.9207	36.99/0.9495	579,164	6.35
SADenseNet_SA	✓	✓		40.10/0.9828	40.30/0.9692	33.49/0.9241	37.08/0.9496	1,132,124	12.31
SADenseNet_IA		✓	✓	40.18/0.9831	40.42/0.9701	33.67/0.9274	37.06/0.9507	581,180	6.42
SADenseNet_IS	✓		✓	40.14/0.9832	40.32/0.9699	33.53/0.9238	37.05/0.9500	949,820	11.79
SADenseNet_ISA	✓	✓	✓	40.31/0.9836	40.54/0.9706	33.76/0.9267	37.15/0.9521	1,134,140	12.82

In terms of reconstruction quality, while all types of connections outperform SADenseNet_None significantly, the biggest performance gain is achieved by the angular connections as SADenseNet_A surpasses SADenseNet_S and SADenseNet_I by 0.32 dB and 0.14 dB in 30 Scenes.

It is observed that the three kinds of dense skip connections are complementary as combining any two or three of them leads to further performance gains. Besides, all variants have outperformed the prior arts by at least 0.40 dB PSNR except SADenseNet_None which is without any skip connections inducing drastic decline, proving the importance of the three kinds of skip connections.

In Fig. 7 (a) and (b), in terms of the x-axis, it is observed that all the SADenseNet variants have fewer parameters and higher speeds. While the prior arts take 32.92 seconds in [42], 38.05 seconds in [21] and even more than 700 seconds in [20, 44] to reconstruct a full LF image, SADenseNet variants take less than half of the time. It is worth noticing that while SADenseNet_ISA has achieved the best reconstruction quality with a large model size similar to Yeung et al. [21], the sub-optimal variant SADenseNet_IA is capable of achieving approximate performance, 40.18 dB PSNR, 0.13 dB lower than SADenseNet_ISA, but it uses only 0.58 million parameters, which is about half of SADenseNet_ISA, manifesting to be an economic option under a tight memory constraint. The speed of SADenseNet_IA is about 2 $\times$ faster than the variants with spatial connections as it gets rid of the relatively costly spatial connections. Meanwhile, SADenseNet_I is also a considerable choice to pursue a lightweight model under a tighter condition which takes a similar number of parameters and running speed with SADenseNet_None but achieves 39.85 dB PSNR, 1.09 dB higher.

Last but not least, the training process is plotted in Fig. 6 for demonstrating how these connections improve the information flow and benefit the convergence of the network during training. For simplicity, we only plot the first $5\times 10^{5}$ iterations with PSNR. The curves appear more stable when multiple connections are used, i.e. SADenseNet_ISA, SADenseNet_SA, SADenseNet_IA, and SADenseNet_IS, while the ones with either spatial connections, angular connections, or none have encountered sudden drops in PSNR. We deduce that the improvement of training stability is derived from the combination of spatial and angular dense connected information flow. However, SADenseNet_I is the only exception having a stable curve, suggesting that this type of connection is complementary to the extracted spatio-angular features while training. This impact is also verified by comparing SADenseNet_IA and SADenseNet_IS with SADenseNet_A and SADenseNet_S. It is also witnessed that SADenseNet_ISA, trained with $1.5\times 10^{5}$ iterations in around 14 hours, has already outperformed the best result of Yeung et al. [21] which requires 10 days of training on a similar GPU.

With the analysis from the test scenario and the training process, we can draw a conclusion that the proposed three kinds of dense skip connections can enhance the information flow of SADenseNet to facilitate a quick and stable convergence when training and improve the reconstruction performance while the model remains efficient in computation and memory usage.

IV-D2 Domain Asymmetry

As described in Section I, we hypothesize the domain asymmetry in the spatial and angular domains. In this section, we validate this hypothesis by evaluating the performance with different $n_{\mathcal{S}}$ on 30 Scenes [20] dataset. The number of correlation blocks $n_{CB}$ is fixed to 6 and we set $n_{\mathcal{S}}=5$ as the baseline. As shown in Table V, the performance reduces gradually as $n_{\mathcal{S}}$ decreases. It is worth noticing that when $n_{\mathcal{S}}$ decreases to 1, a correlation block virtually degrades to a SAS convolution as in [21], which achieves the worst results of 38.28 dB, declining by 2.03 dB, suggesting it hampers the performance by treating the two domains symmetrically. Moreover, when adding more spatial convolutions to the baseline, the performance gain is marginal. When $n_{\mathcal{S}}=6$ , PSNR increases by merely 0.03 dB but 29% more parameters and about 22% lower speed are incurred. This suggests that the influence of deeper spatial features has saturated.

On the other hand, we modify the correlation blocks to append more angular convolutions. The number of angular convolutions in each block is denoted as $n_{\mathcal{A}}$ , and the other variables $n_{CB}$ and $n_{\mathcal{S}}$ are fixed to 6 and 5. In each block, only the last angular convolution is densely connected with other blocks. Such a modification seems likely to obtain deeper angular features, however, the results in Table V disagree with the speculation. Harmful effects are recorded that when $n_{\mathcal{A}}=2$ and $3$ , the PSNR declines by 0.04 dB and 0.12 dB compared with the baseline, let alone the incurred computation and memory cost, proving that employing more angular convolutions is in vain.

Given the results on $n_{\mathcal{S}}$ and $n_{\mathcal{A}}$ , we can conclude that domain asymmetry exists in LF data, and the correlation blocks are proved effective to extract spatio-angular features with asymmetrical operations.

IV-D3 Correlation Blocks

To verify the effectiveness of stacking correlation blocks, we evaluate the performance by varying the number of blocks $n_{CB}$ . As demonstrated in Table V, the performance is rising as $n_{CB}$ increases and reaches the highest point at $n_{CB}=6$ , which is used as the baseline. When $n_{CB}$ increases to 7, the performance slightly drops, which suggests the performance gain brought by a deeper architecture has been saturated, and adding more correlation blocks will lead to deterioration.

Remarkably, the $n_{CB}=2$ variant has outperformed Yeung et al. [21] by 0.45 dB PSNR with only 33% parameters and $8.8\times$ faster speed, another convincing proof of the architecture’s effectiveness and another option to obtain a lightweight but powerful model by employing fewer correlation blocks.

IV-E Depth Estimation

Depth estimation is one of the important LF measurement applications. As SAIs are reconstructed, we further investigate the depth maps estimated from the densely sampled LF to demonstrate SADenseNet’s reconstruction quality from another perspective. Fig. 8 gives some selected images’ depth maps generated from the ground truth, the densely-sampled LF reconstructed by Yeung et al. [21] and SADenseNet using the estimation method of [13]. It can be observed that SADenseNet has produced visibly better depth maps than Yeung et al. [21] with high-quality object edges in Chain-link_Fence_2. It is interesting that in Cars and IMG_1411, depth generated on SADenseNet’s reconstructed SAIs is more satisfactory than ground truth. In Cars, the car on the left should be located between the metal plate in the foreground and the building in the background. Both ground truth and Yeung et al. [21] misclassify it as having a similar depth to the metal plate. The misclassification mainly happens in textureless areas such as the sky in the top-left corner of Cars and to the left of the plant in IMG_1411. However, depth maps estimated with SADenseNet do not suffer from this problem.

The depth predictions in Fig. 8 suggests that the region of the textureless areas and the foreground objects become more distinguishable to the depth estimation algorithm [13] after being pre-processed by our proposed method. One plausible explanation for such improvement is that the prior learned by our proposed method is able to refine the input LF during reconstruction before feeding into the depth estimation algorithm. In conclusion, SADenseNet not only produces high reconstruction quality but also supplies complementary information to improve the accuracy of depth estimation and other measurement applications.

IV-F Limitations

Even though SADenseNet has demonstrated superior performance in the task of LF reconstruction, it doesn’t necessarily outperform other methods across all samples. While there are 222 real-world samples tested in the previous evaluation, SADenseNet fails 14 of them, in most of which reflective surfaces are presented. Two of them are visualized in Fig. 9 with NoisyLFRecon [42] and Yeung et al. [21] as the representatives of disparity-based and separable-convolution-based methods, respectively.

In reflective_29, it is observed that SADenseNet is outperformed by Yeung et al. [21] by 0.5 db PSNR, and in the error maps the reflective surface of the steel pot is reconstructed with distorted artifacts in the red box which doesn’t happen in the results of the other two methods. In the other example, Water_Drops, in spite of the higher PSNR and SSIM achieved by Yeung et al. [21] and SADenseNet, the flaw can be easily spotted in their error maps that the reflective water drops are not correctly reconstructed, while NoisyLFRecon [42] is not confused by these reflections with a visibly clearer error map. Given these results, we presumed that disparity-based methods are more robust to the scenes with reflective surfaces, which remains a problem for our method to study in future research.

V Conclusion

In this paper, We have studied domain asymmetry in LF images and proposed correlation blocks to extract spatio-angular feature representation in an asymmetrical manner. For further enhancing the spatio-angular representation, we have applied spatial and angular dense skip connections to construct a compact information flow. Moreover, additional image skip connections are utilized to complement the correlation features with raw image information, forming our final Spatio-Angular Dense Network (SADenseNet). Experiments on four real-world datasets and a synthetic dataset have demonstrated its state-of-the-art performance at significantly lower costs. Ablation studies have verified the benefits of the skip connections and the efficacy of the proposed asymmetrical processing in the spatial and angular domains using the correlation blocks. Furthermore, we have found that some lightweight variants of our proposed method, i.e. the model without spatial dense skip connections or that employing only two correlation blocks, can still achieve impressive performance with very few resources when there are some constrained conditions. At last, we have performed depth estimation as a typical LF measurement task on the reconstructed LF images, and the result has proved SADenseNet’s promising potential of improving LF-related measurement performance.

References

[1] R. Raghavendra, K. B. Raja, and C. Busch, “Presentation attack detection for face recognition using light field camera,” IEEE Transactions on Image Processing, vol. 24, no. 3, pp. 1060–1075, 2015.
[2] Z. Ji, H. Zhu, and Q. Wang, “LFHOG: A discriminative descriptor for live face detection from light field image,” in 2016 IEEE International Conference on Image Processing (ICIP). IEEE, 2016, pp. 1474–1478.
[3] H. Sellahewa and S. A. Jassim, “Image-quality-based adaptive face recognition,” IEEE Transactions on Instrumentation and Measurement, vol. 59, no. 4, pp. 805–813, 2010.
[4] L. Fang and S. Li, “Face recognition by exploiting local gabor features with multitask adaptive sparse representation,” IEEE Transactions on Instrumentation and Measurement, vol. 64, no. 10, pp. 2605–2615, 2015.
[5] T.-C. Wang, J.-Y. Zhu, E. Hiroaki, M. Chandraker, A. A. Efros, and R. Ramamoorthi, “A 4d light-field dataset and cnn architectures for material recognition,” in Computer Vision – ECCV 2016. Cham: Springer International Publishing, 2016, pp. 121–138.
[6] Z. Lu, H. W. F. Yeung, Q. Qu, Y. Y. Chung, X. Chen, and Z. Chen, “Improved image classification with 4D light-field and interleaved convolutional neural network,” Multimedia Tools and Applications, vol. 78, no. 20, pp. 29 211–29 227, 2019-10.
[7] G. Song, K. Song, and Y. Yan, “Edrnet: Encoder–decoder residual network for salient object detection of strip steel surface defects,” IEEE Transactions on Instrumentation and Measurement, vol. 69, no. 12, pp. 9709–9719, 2020.
[8] H. Sheng, S. Zhang, X. Liu, and Z. Xiong, “Relative location for light field saliency detection,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 1631–1635.
[9] M. Zhang, W. Ji, Y. Piao, J. Li, Y. Zhang, S. Xu, and H. Lu, “Lfnet: Light field fusion network for salient object detection,” IEEE Transactions on Image Processing, vol. 29, pp. 6276–6287, 2020.
[10] S. Heber, W. Yu, and T. Pock, “Neural EPI-Volume Networks for Shape from Light Field,” in Proceedings of the IEEE International Conference on Computer Vision, vol. 2017-Octob, Oct. 2017, pp. 2271–2279.
[11] H.-G. Jeon, J. Park, G. Choe, J. Park, Y. Bok, Y.-W. Tai, and I. S. Kweon, “Depth from a Light Field Image with Learning-Based Matching Costs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 2, pp. 297–310, Feb. 2019.
[12] T.-C. Wang, A. A. Efros, and R. Ramamoorthi, “Occlusion-aware depth estimation using light-field cameras,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3487–3495.
[13] J. Chen, J. Hou, Y. Ni, and L.-P. Chau, “Accurate light field depth estimation with superpixel regularization over partially occluded regions,” IEEE Transactions on Image Processing, vol. 27, no. 10, pp. 4889–4900, 2018.
[14] H.-Y. Lin, C.-L. Tsai, and V. L. Tran, “Depth measurement based on stereo vision with integrated camera rotation,” IEEE Transactions on Instrumentation and Measurement, vol. 70, pp. 1–10, 2021.
[15] E. Lilienblum and A. Al-Hamadi, “A structured light approach for 3-d surface reconstruction with a stereo line-scan system,” IEEE Transactions on Instrumentation and Measurement, vol. 64, no. 5, pp. 1258–1266, 2015.
[16] “Raytrix – 3D light field camera technology.” [Online]. Available: https://raytrix.de/
[17] C. Heinze, S. Spyropoulos, S. Hussmann, and C. Perwaß, “Automated robust metric calibration algorithm for multifocus plenoptic cameras,” IEEE Transactions on Instrumentation and Measurement, vol. 65, no. 5, pp. 1197–1205, 2016.
[18] “Lytro.” [Online]. Available: https://www.lytro.com/
[19] Z. Niu, X. Zhang, J. Ye, L. Yea, R. Zhu, and X. Jiang, “Adaptive phase correction for phase measuring deflectometry based on light field modulation,” IEEE Transactions on Instrumentation and Measurement, pp. 1–1, 2021.
[20] N. K. Kalantari, T.-C. Wang, and R. Ramamoorthi, “Learning-based view synthesis for light field cameras,” ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia 2016), vol. 35, no. 6, p. 193, 2016.
[21] H. Wing Fung Yeung, J. Hou, J. Chen, Y. Y. Chung, and X. Chen, “Fast Light Field Reconstruction With Deep Coarse-To-Fine Modeling of Spatial-Angular Clues,” in The European Conference on Computer Vision (ECCV), Sep. 2018.
[22] G. Wu, M. Zhao, L. Wang, Q. Dai, T. Chai, and Y. Liu, “Light field reconstruction using deep convolutional network on EPI,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2017, 2017, p. 2.
[23] H. W. F. Yeung, J. Hou, X. Chen, J. Chen, Z. Chen, and Y. Y. Chung, “Light Field Spatial Super-Resolution Using Deep Efficient Spatial-Angular Separable Convolution,” IEEE Transactions on Image Processing, vol. 28, no. 5, pp. 2319–2330, 2019.
[24] G. Wu, Y. Liu, L. Fang, Q. Dai, and T. Chai, “Light Field Reconstruction Using Convolutional Network on EPI and Extended Applications,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
[25] C. Shin, H.-G. Jeon, Y. Yoon, I. S. Kweon, and S. J. Kim, “Epinet: A fully-convolutional neural network using epipolar geometry for depth from light field images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4748–4757.
[26] S. Heber, W. Yu, and T. Pock, “U-shaped Networks for Shape from Light Field,” in Procedings of the British Machine Vision Conference 2016, vol. 1, 2016, pp. 37.1–37.12.
[27] Y. Wang, F. Liu, Z. Wang, G. Hou, Z. Sun, and T. Tan, “End-to-end View Synthesis for Light Field Imaging with Pseudo 4dcnn,” in The European Conference on Computer Vision (ECCV), Sep. 2018.
[28] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” International Conference on Learning Representations (ICRL), pp. 1–14, 2015.
[29] J. Kim, J. K. Lee, and K. M. Lee, “Accurate Image Super-Resolution Using Very Deep Convolutional Networks,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1646–1654.
[30] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely Connected Convolutional Networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4700–4708.
[31] Y. Tai, J. Yang, X. Liu, and C. Xu, “MemNet: A Persistent Memory Network for Image Restoration,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4539–4547.
[32] T. Tong, G. Li, X. Liu, and Q. Gao, “Image Super-Resolution Using Dense Skip Connections,” in 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, 2017, pp. 4809–4817.
[33] M. Haris, G. Shakhnarovich, and N. Ukita, “Deep back-projection networks for super-resolution,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1664–1673.
[34] Z. Li, J. Yang, Z. Liu, X. Yang, G. Jeon, and W. Wu, “Feedback Network for Image Super-Resolution,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3867–3876.
[35] S. Pujades, F. Devernay, and B. Goldluecke, “Bayesian view synthesis and image-based rendering principles,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3906–3913.
[36] Z. Zhang, Y. Liu, and Q. Dai, “Light field from micro-baseline image pair,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3800–3809.
[37] S. Wanner and B. Goldluecke, “Variational light field analysis for disparity estimation and super-resolution,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 3, pp. 606–619, 2014.
[38] E. Penner and L. Zhang, “Soft 3d reconstruction for view synthesis,” ACM Transactions on Graphics (TOG), vol. 36, no. 6, p. 235, 2017.
[39] H. Zheng, M. Ji, H. Wang, Y. Liu, and L. Fang, “CrossNet: An End-to-end Reference-based Super Resolution Network using Cross-scale Warping,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 88–104.
[40] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun, “Large kernel matters - Improve semantic segmentation by global convolutional network,” in Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, vol. 2017-Janua, 2017, pp. 1743–1751.
[41] J. Caballero, C. Ledig, A. Aitken, A. Acosta, J. Totz, Z. Wang, and W. Shi, “Real-time video super-resolution with spatio-temporal networks and motion compensation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4778–4787.
[42] W. Zhou, J. Shi, Y. Hong, L. Lin, and E. E. Kuruoglu, “Robust dense light field reconstruction from sparse noisy sampling,” Signal Processing, vol. 186, p. 108121, 2021.
[43] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9.
[44] N. Meng, H. K.-H. So, X. Sun, and E. Lam, “High-dimensional dense residual convolutional neural network for light field reconstruction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
[45] P. Chandramouli, K. V. Gandikota, A. Gorlitz, A. Kolb, and M. Moeller, “A Generative Model for Generic Light Field Reconstruction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
[46] “Deep spatial-angular regularization for compressive light field reconstruction over coded apertures,” in Computer Vision – ECCV 2020. Cham: Springer International Publishing, 2020, pp. 278–294.
[47] Y. Inagaki, Y. Kobayashi, K. Takahashi, T. Fujii, and H. Nagahara, “Learning to capture light fields through a coded aperture camera,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 418–434.
[48] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
[49] Y. Tai, J. Yang, and X. Liu, “Image super-resolution via deep recursive residual network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3147–3155.
[50] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image super-resolution using very deep residual channel attention networks,” in European Conference on Computer Vision, 2018, pp. 286–301.
[51] Y. Song, C. Ma, L. Gong, J. Zhang, R. W. H. Lau, and M.-H. Yang, “CREST: Convolutional Residual Learning for Visual Tracking,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2555–2564.
[52] Q. Wang, Z. Teng, J. Xing, J. Gao, W. Hu, and S. Maybank, “Learning Attentions: Residual Attentional Siamese Network for High Performance Online Visual Tracking,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[53] Z. Qiu, T. Yao, and T. Mei, “Learning Spatio-Temporal Representation With Pseudo-3d Residual Networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5533–5541.
[54] S. Li, F. He, B. Du, L. Zhang, Y. Xu, and D. Tao, “Fast Spatio-Temporal Residual Network for Video Super-Resolution,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 522–10 531.
[55] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 6450–6459.
[56] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, pp. 1647–1655.
[57] S. Wanner, S. Meister, and B. Goldluecke, “Datasets and benchmarks for densely sampled 4d light fields.” in VMV. Citeseer, 2013, pp. 225–226.
[58] F. Chollet et al., “Keras,” https://keras.io, 2015.
[59] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for large-scale machine learning,” in 12th $USENIX$ Symposium on Operating Systems Design and Implementation $OSDI$ 16), 2016, pp. 265–283.
[60] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[61] M. Rerábek and T. Ebrahimi, “New Light Field Image Dataset,” 8th International Conference on Quality of Multimedia Experience (QoMEX), pp. 1–2, 2016.
[62] “Stanford Lytro Light Field Archive.” [Online]. Available: http://lightfields.stanford.edu/LF2016.html

Ground truth	HDDRNet	NoisyLFRecon	Yeung et al.	SADenseNet (Ours)


IMG_1528	25.35/0.8926	31.15/0.9540	31.30/0.9543	34.18/0.9715


Mirabelle_Prune_Tree	31.05/0.9416	31.16/0.9509	30.63/0.9497	34.20/0.9743


Occlusions_29	34.58/0.9666	37.65/0.9800	36.33/0.9767	39.31/0.9872


Reflective_12	32.83/0.9494	34.98/0.9662	33.14/0.9635	36.10/0.9744

	Central SAI	Depth from ground truth	Depth from Yeung et al. [21]	Depth from SADenseNet
Cars
IMG_1411
Chain-link_Fence_2


(a) PSNR versus numbers of parameters.	(b) PSNR versus speed.

Efficient Light Field Reconstruction via Spatio-Angular Dense Network

Abstract

Index Terms:

I Introduction

II Related Works

II-A Light Field Image Processing

II-B Feature Representation

II-C Video-related Tasks

III Proposed Method

III-A Overview

III-B Correlation Blocks

III-C Spatio-Angular Dense Skip Connections

III-D SAI Reconstruction

IV Experiments

IV-A Implementation and Evaluation Details

IV-B Comparison with State-of-the-art Methods on Real-world Images

IV-C Comparison with State-of-the-art Methods on Synthetic Images

IV-D Ablation Studies

IV-D1 Spatio-angular Dense Skip Connections

IV-D2 Domain Asymmetry

IV-D3 Correlation Blocks

IV-E Depth Estimation

IV-F Limitations

V Conclusion

References

Efficient Light Field Reconstruction
via Spatio-Angular Dense Network