Spatial-Angular Attention Network for
Light Field Reconstruction

Gaochang Wu, Yingqian Wang, Yebin Liu, Lu Fang, and Tianyou Chai This work was supported by the Major Program of National Natural Science Foundation of China No.61991400, No.61991401 and No.62103092, Natural Science Foundation of China No. U20A20189, Science and Technology Major Projects of Liaoning Province No.2020JH1/10100008, NSFC No.61827805, No.61531014, No.61861166002 and No.6181001011, and Fundamental Research Funds for the Central Universities No. 100802004. (Corresponding author: Tianyou Chai.)Gaochang Wu and Tianyou Chai are with the State Key Laboratory of Synthetical Automation for Process Industries, Northeastern University, Shenyang 110819, China, and also with the Institute of Industrial Artificial Intelligence, Northeastern University, Shenyang 110819, P. R. China (email: {wugc, tychai}@mail.neu.edu.cn).Yingqian Wang is with the College of Electronic Science and Technology, Nation University of Defense Technology (NUDT), P. R. China. (email: [email protected]).Yebin Liu is with Department of Automation, Tsinghua University, Beijing 100084, P. R. China (email: [email protected]).Fang Lu is with the Department of Electronic Engineering, Tsinghua University, Beijing 100084, China, and also with the Beijing National Research Center for Information Science and Technology, Beijing 100084, P. R. China (email: [email protected]).

Abstract

Typical learning-based light field reconstruction methods demand in constructing a large receptive field by deepening their networks to capture correspondences between input views. In this paper, we propose a spatial-angular attention network to perceive non-local correspondences in the light field, and reconstruct high angular resolution light field in an end-to-end manner. Motivated by the non-local attention mechanism [1, 2], a spatial-angular attention module specifically for the high-dimensional light field data is introduced to compute the response of each query pixel from all the positions on the epipolar plane, and generate an attention map that captures correspondences along the angular dimension. Then a multi-scale reconstruction structure is proposed to efficiently implement the non-local attention in the low resolution feature space, while also preserving the high frequency components in the high-resolution feature space. Extensive experiments demonstrate the superior performance of the proposed spatial-angular attention network for reconstructing sparsely-sampled light fields with non-Lambertian effects.

Index Terms:

Light field reconstruction, deep learning, attention mechanism.

I Introduction

Through capturing both intensities and directions from sampled light rays, light field enables high-quality view synthesis without the need of complex and heterogeneous information such as geometry and texture. More importantly, benefiting from the light field rendering technology [3], photorealistic views can be rendered in real-time regardless of the scene complexity or non-Lambertian effect. This high quality rendering technology usually requires a densely-sampled light field (DSLF), where the disparity between adjacent views should be less than one pixel. However, typical DSLF capture either suffers from a long period of acquisition time (e.g., DSLF gantry system [3]) or falls into the well-known resolution trade-off problem. That is, due to the limitation of the sensor resolution [4], the light field is sparsely sampled either in the angular domain [5] or the spatial domain [6].

Recently, a more promising way is the fast capturing of a sparsely-sampled (angular domain) light field followed by direct reconstruction or depth-based view synthesis methods [7, 8] by using advanced deep learning techniques. On the one hand, typical learning-based reconstruction methods [9, 10, 11] employ multiple convolutional layers to map the low angular resolution light field to the DSLF. But due to the limited perceptive range of convolutional filters [12], these networks fail to collect enough information (i.e., the spatial-angular correspondences) when dealing with large disparities, leading to aliasing effects in the reconstructed light field. On the other hand, depth-based view synthesis methods [7, 13, 14] address the large disparity problem through plane sweep (depth estimation), and then synthesize novel views using learning-based prediction. However, these methods require depth consistency along the angular dimension, and thus, often fail to handle the depth ambiguity caused by the non-Lambertian effect.

Refer to caption — Figure 1: We propose a spatial-angular attention module embedded in a multi-scale reconstruction structure for learning-based light field reconstruction. The network perceives correspondence pixels in a non-local manner, providing high quality reconstruction with a sparse input. Light fields courtesy of Moreschini et al. [15] and Adhikarla et al. [16].

In this paper, we propose a Spatial-Angular Attention Network, termed as SAA-Net, to achieve DSLF reconstruction from a sparse input. The proposed SAA-Net perceives correspondences on the Epipolar Plane Image (EPI) in a non-local fashion, addressing the aforementioned non-Lambertian issue and large disparity issue in a unified framework. Specifically, the SAA-Net consists of two parts, including a spatial-angular attention module (Sec. IV-A) and a U-net backbone (Sec. IV-B). Motivated by the non-local attention mechanism in [1, 2], for each pixel in the input light field, the Spatial-Angular Attention Module (termed as SAAM for short) computes the responses of pixels from all the positions on the epipolar plane, and generates an attention map that records the correspondences along the angular dimension, as shown in Fig. 1 (top). This correspondence information in the attention map is then applied to guide the reconstruction in the angular dimension via matrix multiplication and channel-to-angular pixel shuffling.

To efficiently perform the non-local attention, we propose a convolutional neural network with multi-scale reconstruction structure. The network follows the basic architecture of the U-net, i.e., an encoder-decoder structure with skip connections. The encoder compresses the input light field in the spatial dimensions and removes redundancy information for the SAAM. Rather than simply reconstruct the light field at the end of the network, we propose a multi-scale reconstruction structure by performing deconvolution along the angular dimension in each skip connection branch, as shown in Fig. 1 (top). The proposed multi-scale reconstruction structure maintains the view consistency in the low spatial resolution feature space while preserving fine details in the high spatial resolution feature space.

For network training, we propose a spatial-angular perceptual loss that is specifically designed for the high-dimensional light field data (Sec. V). Rather than computing the high-level feature loss [17, 18] by feeding each view in the light field into a 2D CNN (e.g., the commonly-used VGG [19]), we pre-train a 3D auto-encoder that considers the consistency in both the spatial and angular dimensions of the light field. In summary, we make the following contributions¹¹1The source code is available at https://github.com/GaochangWu/SAAN.:

•

A spatial-angular attention module that perceives correspondences non-locally on the epipolar plane;
•

A multi-scale reconstruction structure for efficiently performing the non-local attention in the low spatial resolution feature space while also preserving high frequencies;
•

A spatial-angular perceptual loss specifically designed for high-dimensional light field data.

We demonstrate the superiority of the SAA-Net by performing extensive evaluations on various light field datasets. The proposed network presents high-quality DSLF on challenging cases with both non-Lambertian effects and large disparities, as illustrated in Fig. 1 (bottom).

II Related Work

II-A Light Lield Reconstruction

First, we briefly review the major works on light field view synthesis (or view synthesis) depending on whether the depth information is explicitly used.

Depth image-based view synthesis. Typically, these kind of approaches first estimate the depth of a scene, then warp and blend the input views to synthesize a novel view [20, 8]. Conventional light field depth estimation approaches follow the pipeline of stereo matching [21], i.e., cost computation, cost aggregation (or cost volume filtering), disparity regression and post refinement. The main difference is that light field converts disparity from discrete space into a continuous space [22], delivering various depth cues that enable light field depth estimation with different strategies, such as structure tensor-based local direction estimation [22], depth from correspondence [23], depth from defocus [24, 25] and depth from parallelogram cues [26]. Moreover, some learning-based approaches incorporate the aforementioned depth estimation pipeline with 2D convolution-based feature extraction, 3D convolution-based cost volume aggregation and depth regression [27]. For novel view synthesis, input views are warped to the novel viewpoints with sub-pixel accuracy using bilinear interpolation and blended in different manners, e.g., total variation optimization [22], soft blending [20] and learning-based synthesis [28].

Recently, researchers mainly focus on the studies for maximizing the quality of synthesized views based on the deep learning technique. Flynn et al. [13] proposed a learning-based method to synthesize novel views using the predicted probabilities and colors for each depth plane. Kalantari et al. [7] further employed a sequential network setting to infer depth (disparity) and color, and optimized the model via end-to-end training. Following the sequential network setting, Meng et al. [29] developed a confidence estimation network between depth and color networks to infer pixel-wise blending weights. Shi et al. [8] proposed to blend the warped views in both pixel level and feature level. Jin et al. proposed to use a regular sampling pattern (four corner views) [30] and a flexible sampling pattern [31] for light field depth estimation, and then perform the reconstruction using spatial-angular alternating refinement. Ko et al. [32] introduced a dynamic blending filter to generate the filter coefficients adaptively according to the warped views. Different from the sequential network settings mentioned above, Zhou et al. [33] proposed a novel learning-based Multi-Plane Image (MPI) representation that infers a novel view by alpha blending of different images. Mildenhall et al. [14] further proposed to use multiple MPIs to synthesize a local light field.

Depth image-based view synthesis approaches solve the problem of large correspondence gap in the sparsely-sampled light field by using depth estimation and warping. But the scene depth are based on the Lambertian assumption, and thus, these approaches will suffer from depth ambiguity when addressing the non-Lambertian effect, as demonstrated in Fig. 8 (second case). In this paper, we address the problem of large correspondence gap with a non-local attention mechanism to capture large gap correspondences. Since we do not rely on depth information, the proposed method shows higher reconstruction quality on the non-Lambertian cases.

Reconstruction without explicit depth. These kind of approaches treat light field reconstruction as the approximation of plenoptic function. In the Fourier domain, the sparse sampling in the angular dimension produces overlaps between the original spectrum and its replicas, leading to aliasing effect [10]. Classical approaches [34, 35] consider a reconstruction filter (usually in a wedge shape) to extract the original signal while filtering the aliasing high-frequency components. For instance, Vagharshakyan et al. [36] utilized an adapted discrete shearlet transform in the Fourier domain to remove the high-frequency spectra that introduce aliasing effects. Shi et al. [37] performed DSLF reconstruction as an optimization for sparsity in the continuous Fourier domain.

Inspired by the success of deep learning in computer vision, depth-independent plenoptic reconstruction approaches [9, 10] were widely investigated by using deep convolution networks. Specifically, Zhu et al. [38] proposed an auto-encoder that combines convolutional layers and convLSTM layers [39]. For explicitly addressing the aliasing effects, Wu et al. [10] took advantage of the clear texture structure of the EPI and proposed a “blur-restoration-deblur” framework. However, when applying a large blur kernel for large disparities, this approach fails to recover the high-frequency details, and thus leading to blur effect. Liu et al. [40] further applied a multi-stream network that takes 3D EPIs in different directions as input. In addition to extracting slices from the plenoptic function, Yeung et al. [11] directly fed the entire 4D light field into a pseudo 4D convolutional network, and proposed a novel spatial-angular alternating convolution to iteratively refine the angular dimensions of the light field. Jin et al. [41] further extended the spatial-angular alternating convolution to the problem of compressive light field reconstruction. Wang et al. [42] applied pseudo 4D convolution to reconstruct the two angular dimensions of the input light field sequentially. Wu et al. [43] proposed an evaluation network for EPIs with different shear amount, termed as sheared EPI structure, and further boosted its performance with an end-to-end optimization framework [44]. With this structure, the networks implicitly use depth information to select a well reconstructed EPI. However, the performances of these networks are limited by the finite perceptive field of the convolutional neurons, especially when handling the large disparity problem.

II-B Attention Mechanism

Attention was first built to imitate the mechanism of human perception that mainly focuses on the salient part [45, 46, 47]. Vaswani et al. [48] indicated that the attention mechanism is able to solve the long term dependency problem even without using convolution operation or recurrent neural cell. Attention mechanism is typically embedded within a conventional network backbone, such as a VGG-net [19], a ResNet [49, 50] or even a simple multi-layer perceptron (MLP) [51]. It encourages the network to focus on the salient parts by assigning an adaptive weight (attention map) to the extracted features.

Hu et al. [52] and Woo et al. [50] proposed to use a global pooling (max-pooling or average-pooling) followed by an MLP to aggregate the entire information in the spatial dimension. This attention mechanism enables the network to focus on certain channels in the feature maps. However, the global pooling operation will decimate the high-frequencies in the feature maps [53], which could be unacceptable, especially for a reconstruction task. Alternatively, Vaswani et al. [48] proposed to use a weighted average of the responses from all the positions with respect to a certain position for Natural Language Processing (NLP), which is called self-attention. Wang et al. [1] further bridged the self-attention for NLP to more general tasks in computer vision, such as video classification. More concretely, a high-dimensional feature map, e.g., a 3D tensor with spatial-temporal dimensions, is reshaped into its 2D form for the operation of matrix multiplication. Different from the convolution, the self-attention allows the element in the feature map to interact with any elements regardless of their distance (range), and thus, is also termed as non-local attention.

Rather than using the non-local attention mechanism, Wang et al. [54, 55] proposed a parallax attention module to calculate the correspondence between two stereo images along the epipolar line. Tsai et al. [56] introduced an attention module in the angular dimension to weight the contribution of each view in a light field. Guo et al. [57] applied a pixel-wise attention map using convolution layers for adaptively blending feature maps from different frequency components in a light field.

Compared with the attention modules in [54, 55, 56, 57], the major difference in this paper is that the proposed attention is calculated non-locally in the 2D epipolar plane for each pixel, enabling the network to capture spatial-angular correspondences. That also makes us different from existing non-local attention mechanisms [1, 2, 48], which perform non-local attention across the entire data dimensions. Another significant difference is that we embed a channel-to-angular pixel shuffling operation into the non-local attention mechanism to explicitly achieve the light field reconstruction task. To the best of our knowledge, our method is the first to apply non-local attention to the light field reconstruction task.

III Problem Analysis and Motivation

In this section, we empirically show that the performance of a learning-based light field reconstruction method is closely related to the perception range of its neurons (or convolutional filters) on the epipolar plane, especially when handling a light field with large disparity problem.

Deep neural network is proved to be a powerful technique in solving ill-posed inverse problems [58]. For the light field reconstruction task, both network structure and the disparity range of the scene are significant to the performance of light field reconstruction. Since the disparity range of the scene is unalterable once the light field is acquired, most deep learning-based light field reconstruction methods [9, 7, 10, 11] pursue a more appropriate architecture for better performance. Specifically, the depth-based view synthesis methods convert the feature maps into a physically meaningful depth map, while the depth-independent methods directly convert feature maps to novel views. Essentially, both two kinds of approaches adopt convolution operation to generate responses (feature maps) among corresponding pixels.

To quantitatively measure the capability of correspondence capturing, we apply the concept of receptive field introduced in [12, 59]. The receptive field measures the number of pixels that are connected to a particular filter in the CNN, i.e., the number of correspondence pixels perceived by a certain convolutional filter.

We analyse the reconstruction qualities of two networks with the same structure (U-net) and the same number of parameters (around 120K) but different receptive fields, as illustrated in Fig. 2. For a scene with small disparity (about 3 pixels in the demonstrated example), networks with either small receptive field ( $27\times 27$ pixels) or large receptive field ( $53\times 53$ pixels) can reconstruct high angular resolution light fields (EPIs) with view consistency, as shown in Fig. 2(a) and Fig. 2(b). However, for a scene with large disparity (about 9 pixels), the network with small receptive field is not able to collect enough information from corresponding pixels on the epipolar plane, as shown clearly at the top of Fig. 2(c). Since the actual size of the receptive field can be smaller than its theoretical size [59], the actual receptive field might not be able to cover the disparity range of the input light field, leading to severe aliasing effects in the reconstructed result, as shown at the bottom of Fig. 2(c). In contrast, the network with a large receptive field can produce high quality result (Fig. 2(d)).

Due to the limitation of parameter amount, it is intractable to expand the receptive field by pursuing a deeper network or a larger filter size. The fundamental idea of our proposed light field reconstruction method is to capture the correspondence non-locally across the spatial and angular dimensions of the light field. We achieve this with two features: 1) a spatial-angular attention module that captures non-local correspondence between any two pixels on the epipolar plane; and 2) an encoder-decoder network that reduces the redundancies in the light field to efficiently achieve the non-local perception.

IV Spatial-Angular Attention Network

In this section, we first introduce the overall architecture of the proposed SAA-Net for light field reconstruction, and then introduce the proposed spatial-angular attention module in details. The input of the SAA-Net is a 3D light field slice with two spatial dimensions and one angular dimension, i.e., $L(x,y,s)$ or $L(y,x,t)$ . By splitting light fields into 3D slices, the proposed network can be applied to both 3D light fields from a single-degree-of-freedom gantry system and 4D light fields from plenoptic camera and camera array system.

For a 4D light field $L(x,y,s,t)$ , we adopt a hierarchical reconstruction strategy similar with that in [10]. The strategy first reconstruct 3D light fields using slices $L_{t^{*}}(x,y,s)$ and $L_{s^{*}}(y,x,t)$ , then use the synthesized 3D light fields to reconstruct the final 4D light field.

IV-A Network Architecture

We propose a network with Multi-Scale Reconstruction (MSR) structure to maintain view consistency (i.e., continuity in the angular dimension) in the low spatial resolution feature space while preserving fine details in the high spatial resolution feature space. As shown in Fig. 3(a), the backbone of the proposed SAA-Net follows the encoder-decoder structure with skip connections (also known as U-net). But the proposed SAA-Net has two particular differences: 1) In each skip connection, we use a deconvolution layer along the angular dimension before feeding the feature maps to the decoder part; 2) In the encoder part, we use strided convolution with stride only in the spatial dimensions of the light field. Table I provides the detailed configuration of the proposed SAA-Net.

The encoder part of the SAA-Net generates multi-scale light field features and reduces the redundant information in the spatial dimension to save the computational and GPU memory costs for the non-local perception. We use two convolutional layers (3D) with stride $[2,2]$ and $[2,1]$ to downsample the spatial resolution of the light field feature with ratio 4 and 2 along the width and height dimension, respectively. Before each downsampling, two 3D convolutional layers with filter sizes $3\times 1\times 3$ and $1\times 3\times 3$ (width, height and angular) are employed to take place of a single convolutional layer with filter size $3\times 3\times 3$ , reducing $1/3$ parameters without performance degradation [11].

The skip connections copy the feature layers before each downsampling layer in the encoder, as shown in Fig. 3(a). For each skip connection, a deconvolution layer (also known as transposed convolution layer) is applied to upsample the feature map in the angular dimension, followed by a $1\times 1\times 1$ convolution. Since the angular information mainly concentrates on the 2D EPI $E(x,s)$ for reconstructing a 3D light field $L(x,y,s)$ , the filter size in each deconvolution layer in the skip connection is set to $3\times 1\times 7$ .

The decoder part of the SAA-Net upsamples the feature map from the spatial-angular attention module (Sec. IV-B) by using two deconvolution layer with stride $[2,1]$ and $[2,2]$ in the spatial dimensions (width and height), respectively. The decoder also receives information from the skip connections by concatenating the features from corresponding levels along the channel dimension [60], as shown in Fig. 3(a). We then use two 3D convolutional layers with filter sizes $3\times 1\times 3$ and $1\times 3\times 3$ to compress the channel numbers in each level of the decoder. This can be considered as the blending of the light field features from different reconstruction scale. Note that all the reconstructions (upsampling operations) in the angular dimension are implemented in the skip connections and the spatial-angular attention module, where the latter will be introduced in the following subsection.

IV-B Spatial-Angular Attention Module

Inspired by the non-local attention mechanism in [1, 2], we propose a Spatial-Angular Attention Module (SAAM) to disentangle the disparity information in light field. The main differences between the proposed SAAM and the previous non-local attention [1, 2] are as follows: 1) Since the disparity information is encoded in the EPI, the non-local attention mechanism is performed in the 2D epipolar plane rather than the entire 3D space; 2) We model the light field reconstruction with pixel shuffling [61] that disentangles the reconstructed angular information from the channel dimension.

A straightforward choice to perform spatial-angular attention is to embed the attention module in each resolution scale of the U-net. However, implementing non-local perception in the full resolution light field (feature map) is intractable due to the high computation complexity and GPU memory cost. Alternatively, we insert the proposed SAAM between the encoder and decoder, i.e., to perform non-local operation in the low-resolution feature space, as shown in Fig. 3.

Since features in a 3D CNN will be a 5D tensor $\phi\in\mathbb{R}^{B\times W\times H\times A\times C}$ (i.e., batch, width, hight, angular and channel), we first apply two convolution layers with kernel size $1\times 1\times 1$ to produce two feature layers $\phi_{q}$ and $\phi_{k}$ with size of $B\times W\times H\times A\times C^{\prime}$ . The channel number $C^{\prime}$ is set to be $\frac{C}{8}$ (i.e., $C^{\prime}=6$ in our implementation) for computation efficiency. Then the feature layers $\phi_{q}$ and $\phi_{k}$ are reshaped into 3D tensors $\phi^{\prime}_{q}$ and $\phi^{\prime}_{k}$ of size $BH\times WA\times C^{\prime}$ and $BH\times C^{\prime}\times WA$ , respectively. In this way, we merge the angular and width dimensions ( $s$ and $x$ or $t$ and $y$ in a light field) together to implement the non-local perception on the epipolar plane.

We apply batch-wise matrix multiplication between $\phi^{\prime}_{q}$ and $\phi^{\prime}_{k}$ and use a softmax function to produce an attention map $M$ as illustrated in Fig. 3(b). The attention map is composed of $BH$ matrices with shape $WA\times WA$ . Each matrix can be considered as a 2D expansion map of a 4D tensor $M^{\prime}\in\mathbb{R}^{W\times A\times W\times A}$ (the batch and height dimensions are neglected). The point $M^{\prime}(x_{0},s_{0},x_{1},s_{1})$ indicates the response of light field position $L(x_{0},y,s_{0})$ to position $L(x_{1},y,s_{1})$ in the latent space. In other words, the attention map is able to capture correspondence among all the views in the input 3D light field.

TABLE I: Detail configuration of the proposed SAA-Net, where

k

denotes the kernel size,

s

the stride,

chn

the number of channels, Conv the 3D convolution layer, Deconv the 3D deconvolution layer and Concat the concatenation.

Layer	$k$	$s$	$chn$	Input
Encoder
Conv1 $\_1$	$3\times 1\times 3$	$[1,1,1]$	1/24	$L(x,y,s)$
Conv1 $\_2$	$1\times 3\times 3$	$[1,1,1]$	24/24	Conv1 $\_1$
Conv1 $\_3$	$3\times 3\times 1$	$[2,2,1]$	24/48	Conv1 $\_2$
Conv2 $\_1$	$3\times 1\times 3$	$[1,1,1]$	48/48	Conv1 $\_3$
Conv2 $\_2$	$1\times 3\times 3$	$[1,1,1]$	48/48	Conv2 $\_1$
Conv2 $\_3$	$3\times 1\times 1$	$[2,1,1]$	48/96	Conv2 $\_2$
Conv3 $\_1$	$1\times 1\times 1$	$[1,1,1]$	96/48	Conv2 $\_3$
Conv3 $\_2$	$3\times 1\times 3$	$[1,1,1]$	48/48	Conv3 $\_1$
Conv3 $\_3$	$1\times 3\times 3$	$[1,1,1]$	48/48	Conv3 $\_2$
Conv3 $\_4$	$3\times 1\times 3$	$[1,1,1]$	48/48	Conv3 $\_3$
Conv3 $\_5$	$1\times 3\times 3$	$[1,1,1]$	48/48	Conv3 $\_4$
Skip connection
Deconv4 $\_1$	$3\times 1\times 7$	$[1,1,\alpha]$	24/24	Conv1 $\_2$
Conv4 $\_2$	$1\times 1\times 1$	$[1,1,1]$	24/24	Deconv4 $\_1$
Deconv5 $\_1$	$3\times 1\times 7$	$[1,1,\alpha]$	48/48	Conv2 $\_2$
Conv5 $\_2$	$1\times 1\times 1$	$[1,1,1]$	48/48	Deconv5 $\_1$
SAAM
Decoder
Conv6 $\_1$	$1\times 1\times 1$	$[1,1,1]$	48/96	SAAM
Deconv6 $\_2$	$4\times 1\times 1$	$[2,1,1]$	96/48	Conv6 $\_1$
Concat1	-	-	-	Conv6 $\_1$ ; Conv4 $\_2$
Conv6 $\_3$	$3\times 1\times 3$	$[1,1,1]$	48/48	Concat1
Conv6 $\_4$	$1\times 3\times 3$	$[1,1,1]$	48/48	Conv6 $\_3$
Deconv7 $\_1$	$4\times 4\times 1$	$[2,2,1]$	48/24	Conv6 $\_4$
Concat2	-	-	-	Conv7 $\_1$ ; Conv5 $\_2$
Conv7 $\_2$	$3\times 1\times 3$	$[1,1,1]$	24/24	Concat2
Conv7 $\_3$	$1\times 3\times 3$	$[1,1,1]$	24/24	Conv7 $\_2$
Conv8	$3\times 3\times 3$	$[1,1,1]$	24/1	Conv7 $\_3$

We demonstrate the non-local perception of the proposed SAAM by visualizing a part of the attention map before the softmax function as shown in Fig. 4. In this example, there are two points $P_{A}$ and $P_{B}$ with remarkable visual features as shown in Fig. 4(a), and their corresponding points in other views are marked as $P_{A}^{\prime}$ ( $P_{A}^{\prime\prime}$ ) and $P_{B}^{\prime}$ ( $P_{B}^{\prime\prime}$ ). As the viewpoint changes along the angular dimension, the background point $P_{A}$ will be occluded by the foreground point $P_{B}$ , which is demonstrated more obviously in Fig. 4(b). Fig. 4(c) shows three sub-maps extracted from the attention map $M^{\prime}$ with $s_{0}=1$ and $s_{1}=1,2,3$ , respectively. It can be clearly seen that a point will have the highest response at the location of its correspondence in the attention map. For instance, the response $R(P_{B},P_{B}^{\prime})$ at the location $M^{\prime}(11,1,9,2)$ for the corresponding patch $(P_{B},P_{B}^{\prime})$ (the middle sub-figure of Fig. 4(c)), and the response $R(P_{B},P_{B}^{\prime\prime})$ at the location $M^{\prime}(11,1,7,3)$ for the corresponding patch $(P_{B},P_{B}^{\prime\prime})$ (the right sub-figure of Fig. 4(c)). For the occluded point $P_{A}$ , the location of the maximum response changes from $M^{\prime}(5,1,5,1)$ to $M^{\prime}(5,1,7,3)$ . In this case, the attention module is able to locate the occluded point $P_{A}^{\prime\prime}$ through its surrounding pixels. More demonstrations of spatial-angular attention map can be found in Sec. VII-A.

Feature $\phi_{v}$ and $\phi_{b}$ are obtained by another two $1\times 1\times 1$ convolutions in a similar manner as that for $\phi_{q}$ and $\phi_{k}$ . The main difference is that the channel numbers of these two feature layers are $\alpha C^{\prime}$ , where, $\alpha$ denotes the reconstruction factor (upsampling scale in the angular dimension) of the network. Another batch-wise matrix multiplication is applied between the attention map $M$ and $\phi^{\prime}_{v}$ (reshaped from $\phi_{v}$ ), resulting a 3D tensor $\phi^{\prime}_{a}\in\mathbb{R}^{BH\times WA\times\alpha C^{\prime}}$ . We then reshape $\phi^{\prime}_{a}$ into a 5D tensor $\phi_{a}\in\mathbb{R}^{B\times W\times H\times A\times\alpha C^{\prime}}$ .

TABLE II: Detail configuration of the proposed Spatial-Angular Attention Module (SAAM), where MatMul denotes the matrix multiplication and Add the element-wise addition.

Layer	$k$	$chn$	Input
Conv1	$1\times 1\times 1$	$C$ / $C^{\prime}$	Encoder
Conv2	$1\times 1\times 1$	$C$ / $C^{\prime}$	Encoder
Conv3	$1\times 1\times 1$	$C$ / $\alpha C^{\prime}$	Encoder
Conv4	$1\times 1\times 1$	$C$ / $\alpha C^{\prime}$	Encoder
Reshape1	-	$C^{\prime}$ /-	Conv1
Reshape2	-	$C^{\prime}$ /-	Conv2
Reshape3	-	$C^{\prime}$ /-	Conv3
MatMul1	-	-	Reshape1; Reshape2
Softmax	-	-	MatMul1
MatMul2	-	-	Softmax; Reshape3
Reshape4	-	-/ $\alpha C^{\prime}$	MatMul2
Add	-	$\alpha C^{\prime}$ / $\alpha C^{\prime}$	Reshape4; Conv4
Pixel shuffle	$3\times 1\times 7$	$\alpha C^{\prime}$ / $C^{\prime}$	Add
Conv5	$7\times 1\times 1$	$C^{\prime}$ / $C$	Pixel shuffle
Conv6	$1\times 1\times 7$	$C$ / $C$	Conv5

Using the aforementioned SAAM, the reconstructed angular information can be well encoded in the channel dimension of the 5D tensor. By disentangling it with channel-to-angular pixel shuffling, we can reconstruct a high angular resolution light field (feature map) in a non-local manner. Specifically, we first multiply the feature layer $\phi_{a}$ by a trainable scale parameter (initialized as 0) and add back to the feature layer $\phi_{b}$ . We then apply the channel-to-angular pixel shuffling and reconstruct a 5D tensor $\phi_{c}\in\mathbb{R}^{B\times W\times H\times(\alpha(A-1)+1)\times C^{\prime}}$ . As illustrated in Fig. 5, the channel-to-angular pixel shuffling rearranges elements from channel dimension to angular dimension²²2Different from typical spatial super-resolution, upsampling a light field by $\alpha\times$ generates a light field of $\alpha(A-1)+1$ views. So we simply remove the last $\alpha-1$ views from the reconstructed light field.. The final output of the SAAM is generated by two convolutional layers with kernel sizes of $7\times 1\times 1$ and $1\times 1\times 7$ , respectively.

By combining the proposed SAAM with the feature maps in the skip connections, the network is able to reconstruct light field with view consistency while also preserving the high frequency components. Detailed parameter setting of the SAAM is listed in Table II.

V Network Training

V-A Spatial-Angular Perceptual Loss

Typical learning-based light field reconstruction or view synthesis methods optimize the network parameters by formulating a pixel-wise loss between the inferred image and the desired view (or EPI [10]). Recently, researches [33, 62, 63, 14] show that formulating the loss function in the high-level feature space will motivate the restoration of high-frequency components. This high-level feature loss, also known as perceptual loss, can be computed from part of the feature layers in the autologous network [17] or other pre-trained networks [18], such as the commonly-used VGG network [19].

In this paper, we propose a spatial-angular perceptual loss that is specifically designed for the high-dimensional light field data. Existing approaches [14, 63, 29] for light field reconstruction apply perceptual loss between 2D sub-aperture images, neglecting the view consistency constraint in the angular dimension. Alternatively, we propose to use a 3D light field encoder to map the 3D light fields into high-dimensional feature tensors (width, height, angular and channel).

To achieve this, we design another 3D encoder-decoder network (auto-encoder) that learns to extract the high-level features for the proposed spatial-angular perceptual loss³³3The architecture of the 3D auto-encoder for the perceptual loss is different with that of the SAA-Net.. Note that the auto-encoder can be also generalized to a 4D form. But given that some light field datasets have only one angular dimension (e.g., light fields from gantry system in [15]) and the proposed SAA-Net also takes 3D light field as its input, we only adopts 3D convolution in the encoder and decoder.

The proposed auto-encoder for the spatial-angular perceptual loss uses the output of the SAA-Net or the desired 3D light field (ground truth) as input, as shown in Fig. 6. It has 12 convolutional layers ( $\phi_{ae}^{(l)},l=1,2,\cdots,12$ ) with kernel size $3\times 3\times 3$ and 3 bilinear upsampling layers. The encoder part includes the first 6 convolutional layers, where 3 convolutional layers ( $\phi_{ae}^{(2)},\phi_{ae}^{(4)},\phi_{ae}^{(6)}$ ) with stride 2 in each dimension are used to compress the light field from low-level pixel space into high-level feature space. The decoder part includes the last 6 convolutional layers and 3 upsampling layers to restore the light field from the latent representations. Detailed configuration of each layer (stride and number of channels) is shown in Fig. 6.

The auto-encoder learns how to extract high-level features through unsupervised learning, i.e., the network learns to predict the input 3D light field. The objective for training the auto-encoder is $\mathcal{L}_{AE}(L_{HR})=\|f_{AE}(L_{HR})-L_{HR}\|_{1}$ , where $f_{AE}$ denotes the auto-encoder. We use the same training dataset (Sec. V-B), learning rate and optimizer (Sec. V-C) as those for the SAA-Net.

To compute the final spatial-angular perceptual loss, we feed the output of the SAA-Net $\hat{L}_{HR}$ as well as the ground truth light field $L_{HR}$ to the auto-encoder, as shown in Fig. 6. And the spatial-angular perceptual loss is defined as follows

\mathcal{L}_{feat}(\hat{L}_{HR},L_{HR})=\sum\limits_{l=2,4,6}\lambda_{feat}^{(l)}\|\phi_{ae}^{(l)}(\hat{L}_{HR})-\phi_{ae}^{(l)}(L_{HR})\|_{1},

where $\phi_{ae}^{(l)}(\cdot)$ $(l={2,4,6})$ denotes the feature maps obtained from the $l$ th layer in the encoder, and $\lambda_{feat}=0.2,0.2,0.1$ is a set of hyperparameters for the proposed spatial-angular perceptual loss.

To prevent the potential possibility that different light field patches are mapped to the same feature vector [17], our loss function also contains a pixel-wise term $\mathcal{L}_{pix}$ using Mean Absolute Error (MAE) between $\hat{L}_{HR}$ and $L_{HR}$ , i.e.,

\mathcal{L}_{pix}(\hat{L}_{HR},L_{HR})=\|\hat{L}_{HR}-L_{HR}\|_{1}.

Then the final loss function $\mathcal{L}_{SAA}$ for training the SAA-Net is defined as

\mathcal{L}_{SAA}=\mathcal{L}_{pix}+\mathcal{L}_{feat}.

The two terms are weighted by the set of hyperparameters $\lambda_{feat}$ in the perceptual loss.

V-B Training Data

We use light fields from the Stanford (New) Light Field Archive [64] as the training dataset, which contains 12 light fields⁴⁴4The light field Lego Gantry Self Portrait is excluded from the training dataset since the moving camera may influence the reconstruction performance. with $17\times 17$ views. Since the network input is 3D light fields, we can extract 17 $L(x,y,s)$ and 17 $L(y,x,t)$ in each 4D light field set. Similar with the data augmentation strategy proposed in [38], we augment the extracted 3D light fields using shearing operation [65]

L_{d}(x,y,s)=L(x+(s-\frac{S}{2})\cdot d,y,s),

where $S$ is the angular resolution of the 3D light field $L(x,y,s)$ , and $L_{d}(x,y,s)$ is the resulting 3D light field with shear amount $d$ . $L_{d}(y,x,t)$ can be obtained following a similar manner. In practice, we use two shear amounts $d=\pm 2$ . The shearing-based data augmentation increases the number of training examples by 2 times. More importantly, the disparity effects in the augmented light field will be more obvious as shown in Fig. 7, enabling the network to address the large disparity problem.

To accelerate the training procedure, the extracted 3D light fields are cropped into sub-light fields with a spatial resolution of $64\times 24$ (width and height for $L(x,y,s)$ or height and width for $L(y,x,t)$ ) and a stride of 40 pixels. About $6.7\times 10^{5}$ examples can be extracted from the 3D light fields (original and sheared).

V-C Implementation Details

We train two models with reconstruction factors $\alpha=3,4$ . The input/output angular resolution of the training samples for these two models are 6/16 and 5/17, respectively. Although the reconstruction factor of the network is fixed, we can achieve a flexible upsampling rate through network cascade. The training is performed on the Y channel (i.e., the luminance channel) of the YCbCr color space. We initialize the weights of both convolution and deconvolution layers by drawing randomly from a Gaussian distribution with a zero mean and a standard deviation of $1\times 10^{-3}$ , and the biases by zero. The network is optimized by using ADAM solver [66] with learning rate of $1\times 10^{-4}$ ( $\beta_{1}=0.9$ , $\beta_{2}=0.999$ ) and mini-batch size of 28. The training model is implemented in the Tensorflow framework [67]. The network converges after $8\times 10^{5}$ steps of backpropagation, taking about 35 hours on a NVIDIA Quadro GV100.

VI Evaluations

In this section, we evaluate the proposed SAA-Net on various kinds of light fields, including those from both gantry systems and from plenoptic camera (Lytro Illum [4]). We mainly compare our method with six state-of-the-arts learning-based methods including Kalantari et al. [7] (depth-based), LLFF [14] (MPI representation), Wu et al. [10], Yeung et al. [11], HDDRNet [63] and DA²N [44] (without explicit depth). To fully demonstrate the effectiveness of our design choices, we also perform ablation studies by training our network without the SAAM, without the MSR structure and without the spatial-angular perceptual loss, respectively. The quantitative evaluations is reported by measuring the average PSNR and SSIM [68] values over the synthesized views of the luminance channel in the YCbCr space. Please refer to the submitted video for more qualitative results.

TABLE III: Quantitative results (PSNR/SSIM) of reconstructed light fields on the light fields from the MPI Light Field Archive [16].

	Scale	Bikes	FairyCollection	LivingRoom	Mannequin	WorkShop	Average
Kalantari et al. [7]	$8\times$	34.83 / 0.969	36.66 / 0.977	46.35 / 0.991	40.62 / 0.983	38.66 / 0.986	39.42 / 0.981
Wu et al. [10]		38.39 / 0.990	40.32 / 0.992	45.48 / 0.996	43.26 / 0.995	41.55 / 0.995	41.80 / 0.994
Yeung et al. [11]		39.55 / 0.993	40.25 / 0.993	47.32 / 0.997	44.49 / 0.996	43.17 / 0.996	42.96 / 0.995
LLFF [14]		36.84 / 0.985	40.14 / 0.989	46.85 / 0.990	43.23 / 0.989	41.79 / 0.991	41.77 / 0.989
HDDRNet [63]		38.97 / 0.994	40.20 / 0.994	43.65 / 0.997	42.21 / 0.996	42.01 / 0.997	41.41 / 0.996
DA²N [44]		39.14 / 0.992	41.00 / 0.993	46.30 / 0.996	44.17 / 0.996	43.09 / 0.996	42.74 / 0.994
Our proposed		40.35 / 0.994	42.22 / 0.995	48.01 / 0.997	44.99 / 0.996	45.29 / 0.997	44.17 / 0.996
Kalantari et al. [7]	$16\times$	30.67 / 0.935	32.39 / 0.952	41.62 / 0.973	37.15 / 0.970	33.94 / 0.971	35.15 / 0.960
Wu et al. [10]		31.22 / 0.951	30.33 / 0.942	42.43 / 0.991	39.53 / 0.989	33.49 / 0.977	35.40 / 0.970
Yeung et al. [11]		32.67 / 0.967	31.82 / 0.969	43.54 / 0.993	40.82 / 0.992	37.21 / 0.988	37.21 / 0.982
LLFF [14]		34.95 / 0.963	34.01 / 0.966	44.73 / 0.987	39.92 / 0.985	37.61 / 0.985	38.24 / 0.977
HDDRNet [63]		33.97 / 0.976	35.08 / 0.979	44.83 / 0.997	40.60 / 0.993	38.54 / 0.992	38.60 / 0.987
DA²N [44]		35.79 / 0.984	36.23 / 0.981	45.91 / 0.996	40.83 / 0.992	40.11 / 0.994	39.77 / 0.990
Our proposed		36.54 / 0.987	37.62 / 0.987	47.25 / 0.997	42.08 / 0.994	40.55 / 0.994	40.81 / 0.992
MSR structure-2		34.13 / 0.972	35.72 / 0.984	46.21 / 0.996	40.97 / 0.993	38.66 / 0.991	39.14 / 0.987
MSR structure-3		34.20 / 0.973	35.53 / 0.984	46.79 / 0.996	41.04 / 0.993	38.85 / 0.991	39.28 / 0.987
MSR structure-4		34.45 / 0.977	35.05 / 0.977	47.00 / 0.995	41.01 / 0.993	38.26 / 0.991	39.15 / 0.987
w/o MSR structure		33.76 / 0.975	34.35 / 0.976	46.07 / 0.995	40.19 / 0.991	38.05 / 0.992	38.48 / 0.986
w/o SAP loss		36.43 / 0.987	37.29 / 0.986	47.10 / 0.997	41.81 / 0.993	40.18 / 0.993	40.56 / 0.991

“MSR structure-X” denotes the proposed network that has $X$ series of the MSR structure without using the SAAM.

VI-A Evaluations on Light Fields from Gantry Systems

In this experiment, the comparisons are performed on light fields from the MPI Light Field Archive [16] ( $1\times 101$ views of resolution $960\times 720$ , $1\times 97$ views for evaluation) and the CIVIT Dataset [15] ( $1\times 193$ views of resolution $1280\times 720$ ) with upsampling scales $8\times$ and $16\times$ . The performances with respect to both angular sparsity and non-Lambertian are taken into consideration. Since the vanilla version of the network by Yeung et al. [11]⁵⁵5In the modified implementation, every 8 (6) views are applied to reconstruct (synthesize) a 3D light field of 22 (21) views for the networks of reconstruction factor $\alpha=3$ ( $\alpha=4$ ). and Meng et al. [63] (HDDRNet) were specifically designed for 4D light fields, we modify their convolutional layers to fit the 3D input while keeping its network architecture unchanged. The networks by Kalantari et al. [7], Yeung et al. [11], Mildenhall et al. [14] (LLFF) and Meng et al. [63] (HDDRNet) are re-trained using the same training dataset as our SAA-Net. Due to the particularity of the training datasets in [44], we do not retrain the DA²N. We perform network cascade to achieve different upsampling scales, i.e., two cascades for $8\times$ ( $16\times$ ) upsampling using a network of reconstruction factor $\alpha=3$ ( $\alpha=4$ ).

Fig. 8 shows the reconstruction results on three light fields, Bikes, FairyCollection and WorkShop, from the MPI Light Field Archive [16] with upsampling scale $16\times$ (disparity range up to 33.5px). The first and the third cases have complex occlusion structures, as shown in the top and the bottom row in Fig. 8. The baseline methods [7, 10, 11, 14] fail to reconstruct the complex structures. Among them, the depth and learning-based approach [7] and the MPI-based approach [14] fail to estimate proper occlusion relations between the foregrounds and the backgrounds. The second scene is a non-Lambertian case, i.e., a refractive glass before the toys. The approach by Kalantari et al. [7] cannot reconstruct the refractive object. And the EPIs reconstructed by the baseline methods [10, 11] appear severe aliasing effects due to the limited receptive field. The MPI-based approach LLFF [14] is able to reconstruct the non-Lambertian effects in this case, but produces ghosting artifacts as marked by the red arrow in Fig. 8.

Fig. 9 shows the reconstruction results on three light fields, Castle, Holiday and Flowers, from the CIVIT Dataset [15] with upsampling scale $16\times$ (disparity range about 14px). The first case and the third case have thin structures with complex occlusions. The depth and learning-based approach by Kalantari et al. [7] fails to estimate depth maps accurate enough to warp the input images, and the color CNN cannot correct the misaligned views, producing ghosting artifacts as shown in Fig. 9. For the second case, we demonstrate reconstructed EPIs in a highly non-Lambertian region. Caused by the depth ambiguity, the approach by Kalantari et al. [7] produces choppiness artifacts along the angular dimension. Due to the limited receptive field of the networks, the results by Wu et al. [10] and Yeung et al. [11] show aliasing effects in various degrees. In the demonstrated cases, LLFF [14] assigns wrong planes to the tiny structures, leading to ghosting and tearing artifacts.

Table III and Table IV list the quantitative measurements ( $8\times$ and $16\times$ upsampling scales) on the light fields from the MPI Light Field Archive [16] and the CIVIT Dataset [15], respectively. Compared with the baseline approaches, the proposed SAA-Net shows superior performance on both light field datasets.

TABLE IV: Quantitative results (PSNR/SSIM) of reconstructed light fields on the light fields from the CIVIT Dataset [15].

	Scale	Seal & Balls	Castle	Holiday	Dragon	Flowers	Average
Kalantari et al. [7]	$8\times$	46.83 / 0.990	39.14 / 0.973	36.03 / 0.979	43.97 / 0.989	39.00 / 0.989	40.99 / 0.984
Wu et al. [10]		49.01 / 0.997	37.67 / 0.984	40.46 / 0.995	48.38 / 0.997	45.85 / 0.998	44.27 / 0.994
Yeung et al. [11]		49.83 / 0.997	40.84 / 0.993	41.16 / 0.996	48.61 / 0.997	47.83 / 0.997	45.65 / 0.996
LLFF [14]		47.03 / 0.990	40.25 / 0.988	39.69 / 0.987	47.38 / 0.990	44.71 / 0.991	43.81 / 0.989
HDDRNet [63]		46.79 / 0.997	40.43 / 0.993	39.97 / 0.996	47.72 / 0.997	43.76 / 0.997	43.73 / 0.996
DA²N [44]		48.23 / 0.997	42.65 / 0.993	41.38 / 0.996	48.75 / 0.998	46.40 / 0.998	45.48 / 0.996
Our proposed		50.99 / 0.998	43.20 / 0.994	42.29 / 0.997	50.12 / 0.998	48.49 / 0.998	47.02 / 0.997
Kalantari et al. [7]	$16\times$	43.13 / 0.985	36.03 / 0.965	32.44 / 0.961	39.50 / 0.985	35.21 / 0.973	37.26 / 0.974
Wu et al. [10]		45.21 / 0.994	35.20 / 0.977	35.58 / 0.987	46.39 / 0.997	41.60 / 0.995	40.80 / 0.990
Yeung et al. [11]		44.38 / 0.992	37.86 / 0.989	36.06 / 0.988	45.52 / 0.997	42.30 / 0.994	41.22 / 0.992
LLFF [14]		45.50 / 0.990	38.60 / 0.971	36.69 / 0.984	44.80 / 0.992	41.19 / 0.989	41.36 / 0.985
HDDRNet [63]		44.24 / 0.997	39.88 / 0.991	38.09 / 0.992	44.26 / 0.997	42.04 / 0.996	41.70 / 0.995
DA²N [44]		46.19 / 0.996	40.77 / 0.992	37.99 / 0.992	47.19 / 0.998	41.95 / 0.996	42.82 / 0.995
Our proposed		49.19 / 0.998	41.32 / 0.992	38.88 / 0.993	48.39 / 0.998	44.05 / 0.997	44.37 / 0.996
MSR structure-2		46.85 / 0.995	37.78 / 0.989	36.17 / 0.988	47.10 / 0.998	42.98 / 0.996	42.18 / 0.993
MSR structure-3		48.50 / 0.997	40.66 / 0.991	38.23 / 0.992	46.94 / 0.997	42.92 / 0.996	43.45 / 0.995
MSR structure-4		47.15 / 0.997	40.86 / 0.992	38.43 / 0.993	46.69 / 0.997	43.18 / 0.997	43.26 / 0.995
w/o MSR structure		46.41 / 0.994	38.65 / 0.990	36.78 / 0.988	46.83 / 0.997	42.77 / 0.996	42.29 / 0.993
w/o SAP loss		48.83 / 0.996	41.05 / 0.992	38.62 / 0.992	48.00 / 0.997	43.85 / 0.997	44.07 / 0.995

VI-B Evaluations on Light Fields from Lytro Illum

We evaluate the proposed approach using three Lytro light field datasets (113 light fields in total), the 30 Scenes dataset by Kalantari et al. [7], and the Reflective and Occlusions categories from the Stanford Lytro Light Field Archive [69]. In this experiment, we reconstruct a $7\times 7$ light field from $3\times 3$ views ( $3\times$ upsampling) and a $8\times 8$ light field from $2\times 2$ views ( $7\times$ upsampling). We compare our SAA-Net with 8 learning-based approaches, including 3 depth-based approaches (Kalantari et al. [7], LLFF [14] and Meng et al. [29]) and 5 approaches without explicit depth (Wu et al. [10], Wang et al. [42], Yeung et al. [11], HDDRNet [63] and DA²N [44]). Since the vanilla versions of the networks in [7, 11, 42, 63, 29] are trained on Lytro light fields, we use their open-source model (parameters) without re-training. The networks by Wu et al. [10] and Mildenhall et al. [14] (LLFF) are re-trained using the same dataset introduced in Sec. V-B. Note that the proposed network is not fine-tuned on any Lytro light field datasets, and the results are produced by the same set of network parameters for both $3\times$ and $7\times$ upsampling scales.

Table V lists the quantitative results on the evaluated Lytro light fields. The proposed SAA-Net shows competitive performance compared with the state-of-the-art light field reconstruction approach by Yeung et al. [11]. For the evaluated 113 light fields, the average PSNR / SSIM values of the proposed network are 42.55 / 0.984 for $3\times$ upsampling and 36.55 / 0.969 for $7\times$ upsampling. In comparison, the average PSNR / SSIM values of the baseline approaches with the highest performance are 41.82 / 0.984 (DA²N [44]) for $3\times$ upsampling and 35.81 / 0.938 (Meng et al. [29]) for $7\times$ upsampling. Since our network is not re-trained or fine-tuned on any Lytro light field dataset, these experimental results clearly demonstrate that our network can generalize well on light fields captured by different acquisition geometries.

We demonstrate two cases with relatively large disparities (maximum disparity up to 13px), IMG1743 from the 30 Scenes [7] and Occlusions 23 from the Occlusions category [69], as shown in Fig. 10. In both cases, the reconstruction results by Wu et al. [10] and Yeung et al. [11] show ghosting artifacts around the region with large disparity (background in the IMG1743 case, and foreground in the Occlusions 23 case) due to their limited receptive fields. The depth and learning-based approaches by Kalantari et al. [7] and Mildenhall et al. [14] (LLFF) produce promising results in the first case, but appear tearing artifacts near the occlusion boundaries as marked by the red arrows in the EPIs. In the second case, the approach by Kalantari et al. [7] fails to estimate proper depth information, introducing misalignment as shown by the EPI. LLFF [14] produces ghosting effects due to the incorrect plane assignments around the background region, as shown by the red arrows in the figure. In comparison, the proposed SAA-Net provides reconstructed light fields with higher view consistency (as shown in the demonstrated EPIs).

TABLE V: Quantitative results (PSNR/SSIM) of reconstructed views on the light fields from Lytro Illum [4]. The 30 Scenes dataset courtesy of Kalantari et al. [7], and the Reflective (32 light fields) and Occlusions (51 light fields) categories are from the Stanford Lytro Light Field Archive [69].

	Scale	30 Scenes	Reflective	Occlusions
Kalantari et al. [7]	$3\times$	39.62/0.978	37.78/0.971	34.02/0.955
Wu et al. [10]		41.02/0.988	41.71/0.989	38.11/0.944
Wang et al. [42]		43.82/0.993	39.93/0.959	34.69/0.923
Yeung et al. [11]		44.53/0.990	42.56/0.975	39.27/0.945
LLFF [14]		39.92/0.977	39.52/0.969	35.64/0.929
HDDRNet [63]		43.02/0.989	40.72/0.979	36.13/0.958
DA²N [44]		43.69/0.995	43.25/0.991	39.82/0.971
Our proposed		44.75/0.996	44.04/0.992	40.32/0.972
Kalantari et al. [7]	$7\times$	38.21/0.974	35.84/0.942	31.81/0.895
Wu et al. [10]		36.28/0.965	36.48/0.962	32.19/0.907
Yeung et al. [11]		39.22/0.977	36.47/0.947	32.68/0.906
LLFF [14]		38.17/0.974	36.40/0.948	31.96/0.901
HDDRNet [63]		38.33/0.967	36.77/0.931	32.78/0.909
Meng et al. [29]		39.14/0.970	37.01/0.950	33.10/0.912
DA²N [44]		38.99/0.986	36.72/0.975	33.14/0.950
Our proposed		39.98/0.988	37.77/0.978	33.77/0.952
MSR structure-2		37.17/0.978	36.29/0.974	32.02/0.944
MSR structure-3		37.62/0.980	36.93/0.975	32.53/0.948
MSR structure-4		37.20/0.981	36.67/0.975	32.58/0.947
w/o MSR structure		37.08/0.978	36.37/0.975	32.37/0.946
w/o SAP loss		39.71/0.986	37.39/0.977	33.52/0.950

VI-C Ablation studies

In this experiment, we empirically validate the modules and losses in our SAA-Net by performing the following ablation studies on different datasets. The results are listed in the last five rows in Table IV, III and V. First, we evaluate the network with increasing series of the MSR structure while removing the SAAM. We use “MSR structure- $X$ ” to represent the network with $X$ series of MSR structure. By increasing the series, the network will have a larger receptive field, e.g., the theoretical receptive field size of the MSR structure-4 reaches 247 pixels in the spatial dimension. However, since it is intractable to increase the actual size of the receptive field simply by using a deeper network [59], purely increasing the series of MSR structure has limitations in improving performance. This point is also verified by the ablation study. In comparison, the non-local attention is more effective than simply increasing the receptive field of the network.

In the second ablation study, we use a typical 3D U-net with the same convolutional layers as the backbone and remove the deconvolution layer in the skip connections, denoted as “w/o MSR structure” for short. The angular reconstruction is simply achieved by using deconvolution at the end of the network. The network parameters are kept comparable to the SAA-Net by adjusting the channel dimension of the network. For light fields from gantry systems [16, 15], the performance of the network decreases more than 2dB in terms of PSNR, as shown in Table IV and III. For light fields from Lytro Illum, the performance of the network decreases more than 1.3dB, as shown in Table V.

In the last ablation study, we train the proposed SAA-Net simply by using the pixel-wise term (MAE loss) without the proposed spatial-angular perceptual loss, denoted as “w/o SAP loss” for short. The performance (PSNR) decreases around 0.3dB as shown in Table IV, III and V.

In Fig. 11, we also visualize the comparison ( $16\times$ upsampling) between the proposed SAA-Net, the network with different series of the MSR structure while removing the SAAM and the network without the MSR structure. The evaluated light fields are from the MPI Light Field Archive [16]. The results show that without using the proposed modules, especially the MSR structure, the SAA-Net appears severe aliasing effects around region with large disparity, e.g., the occluded leaves in the FairyCollection case and the hand writing on the reflective board in the Mannequin case.

VII Further Analysis

VII-A Spatial-Angular Attention Map

We visualize four additional attention maps (before the softmax function) on scenes with small disparity, large disparity, occlusion and non-Lambertian effect, as shown in Fig. 12. The first case (Fig. 12(a)) shows a scene with a relatively small disparity (about 7 pixels). Due to the spatial downsampling operation in the MSR structure, the disparity of the light field features in the SAAM is about 1.75 pixels, as shown in the top left figure in Fig. 12(a). Three sub-maps $M^{\prime}(x_{0},s_{0},x_{1},s_{1}),s_{0}=2,s_{1}=1,2,3$ are shown in the bottom of Fig. 12(a), which visualize the correspondence captured by the attention mechanism. As we can see, the response with the highest value moves from $R(P_{A}^{\prime},P_{A})$ at position $M^{\prime}(11,2,12,1)$ to $R(P_{A}^{\prime},P_{A}^{\prime\prime})$ at position $M^{\prime}(11,2,9,3)$ along the angular dimension. In addition, the attention map also shows a high response value $R(P_{A}^{\prime},P_{B})$ at position $M^{\prime}(11,2,13,1)$ due to the fractional disparity. This indicates that the proposed SAAM has potential to capture a correspondence with sub-pixel accuracy.

In the second case (Fig. 12(b)), we demonstrate the spatial-angular attention on a scene with a large disparity (about 16 pixels). The spatial downsampling operation in the MSR structure reduces the disparity to 4 pixels. The response in the attention map moves from $R(P_{A}^{\prime},P_{A})$ at position $M^{\prime}(10,2,14,1)$ to $R(P_{A}^{\prime},P_{A}^{\prime\prime})$ at position $M^{\prime}(10,2,6,3)$ along the angular dimension. Although the SAAM only applies $1\times 1\times 1$ convolutions before the attention, it is capable to capture correspondence with large displacement.

Fig. 12(c) visualize the attention map of a scene with a complex occlusion structure (Mannequin in the MPI Light Field Archive [16]). The background white board is occluded by the foreground leaves (please refer to the sub-aperture image of the second row in Fig. 11). As shown by the demonstrated attention map in Fig. 12(c), the SAAM fails to capture a correct correspondence because of the occlusion. The highest response value is $R(P_{A}^{\prime},P_{B})$ at position $M^{\prime}(8,2,10,1)$ , which is also a background point like $P_{A}$ . However, the SAAM also generate a considerable response $R(P_{A}^{\prime},P_{A})$ at position $M^{\prime}(8,2,8,1)$ , which we speculate is inferred from other non-occluded background points.

The fourth case in Fig. 12(d) demonstrates the spatial-angular attention on a scene with non-Lambertian effect. In this case, the positional relation between the corresponding points $P_{A}$ , $P_{A}^{\prime}$ and $P_{A}^{\prime\prime}$ does not follow a clear depth cue, as clearly shown in the top right figure of Fig. 12(d). Analogously, the responses do not follow a regular disparity pattern along the angular dimension, as visualized by the at the bottom of Fig. 12(d). This result shows that the proposed SAAM is able to catch the correspondences even for regions with non-Lambertian effects.

VII-B Tensor Decomposition for Spatial-Angular Attention

Although we propose a multi-scale reconstruction structure to save GPU memory cost, the SAAM will still occupy a large GPU memory when dealing with an input light field with high spatial-angular resolution. For example, when reconstructing light fields from the MPI Light Field Archive [16] (spatial resolution $960\times 720$ ), we have to disassemble the 3D data into sub-light fields of resolution $960\times 24\times 25$ (width $\times$ height $\times$ angular). Our investigation shows that the disassembling will cause quality degradation on the reconstructed results.

We therefore apply the truncated Singular Value Decomposition (SVD) [70] to compact the 3D tensor $\phi^{\prime}_{q}\in\mathbb{R}^{BH\times WA\times C^{\prime}}$ and $\phi^{\prime}_{k}\in\mathbb{R}^{BH\times C^{\prime}\times WA}$ before computing the attention map

\tilde{\phi}=USV^{T},

where $\tilde{\phi}$ denotes $\phi^{\prime}_{q}$ or $\phi^{{}^{\prime}T}_{k}$ , $U$ and $V$ are two orthogonal matrices (ignoring the batch dimension), and $S$ is a diagonal matrix with singular values along its diagonal. By truncating the diagonal matrix $S$ with the largest $\tau$ singular values, we can get a good approximation $\tilde{\phi}\approx US_{\tau}V^{T}$ and also compress the 3D tensors. Since the rank of the matrices are $C^{\prime}=6$ , the parameter of the truncated SVD $\tau=[1,2,\cdots,C^{\prime}]$ .

Fig. 13 shows the performance on PSNR in function of the SVD decomposition using the largest $\tau=[1,2,\cdots,6]$ singular values. As we can see from the “SVD” curve, with no less than three singular values, the SVD decomposition will maintain the network performance without using fine-tuning. Moreover, since the decomposition enables us to feed the network with higher spatial resolution input, e.g., from $960\times 24\times 25$ to $960\times 64\times 25$ , we can obtain a reconstruction result with even higher quality (0.04dB higher) when employing truncated SVD decomposition, as shown by the “SVD-Ext” (green dot) in the figure.

VII-C Limitations

The non-local attention involves outer product of large scale matrices, especially for the high-dimensional light field data. For this reason, the proposed network takes almost 15% of its inference time on the SAAM. For the 3D light field from the MPI Light Field Archive [16], the network takes about 51 seconds to reconstruct a $1\times 97$ light field from $1\times 7$ views of spatial resolution $960\times 720$ ( $16\times$ upsampling), i.e., 0.53s per view. For the 3D light field from the CIVIT Dataset [15], the network takes about 126 seconds to reconstruct a $1\times 193$ light field from $1\times 13$ views of spatial resolution $1280\times 720$ ( $16\times$ upsampling), i.e., 0.65s per view. For a 4D light field from Lytro Illum, it takes about 17 seconds to reconstruct a $7\times 7$ light field from $3\times 3$ views of spatial resolution $536\times 376$ ( $3\times$ upsampling), i.e., about 0.35s per view. And the reconstruction of a $8\times 8$ Lytro light field from $2\times 2$ views ( $7\times$ upsampling) takes about 30 seconds, i.e., about 0.5s per view. The parameter number of our SAA-Net is about 338K. The above evaluations are performed on an Intel Xeon Gold 6130 CPU @ 2.10GHz with an NVIDIA TITAN Xp.

Although we apply a simple SVD decomposition to accerlate the network and compact the 3D tensor, the compression rate is limited by the rank of the matrices. Decomposing the attention map into the combination of small tensors [71] might solve this problem in a more essential way.

The another limitation of our proposed method is that repetitive patterns in the input light field can cause multiple plausible responses in the attention map, leading to misalignments in the reconstructed light fields. A possible solution is to introduce a smooth term in the attention map as in [54] to penalize multiple responses.

VIII Conclusions

We have proposed a spatial-angular attention module in a 3D U-net backbone to capture correspondence information non-locally for light field reconstruction. The introduced Spatial-Angular Attention Module (termed as SAAM) is designed to compute the responses from all the positions on the epipolar plane for each pixel in the light field and produce a spatial-angular attention map that records the correspondences. The attention map is then applied to guide light field reconstruction via channel-to-angular pixel shuffling. We further propose a multi-scale reconstruction structure based on the 3D U-net backbone that implements the SAAM efficiently in the low spatial resolution feature space, while also preserving fine details in the high spatial resolution feature space. For the network training, a spatial-angular perceptual loss is designed specifically for the high-dimensional light field data by pre-training a 3D auto-encoder. The evaluations on light fields with challenging non-Lambertian effects and large disparities have demonstrated the superiority of the proposed spatial-angular attention network.

References

[1] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7794–7803.
[2] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self-attention generative adversarial networks,” in ICML, 2019, pp. 7354–7363.
[3] M. Levoy and P. Hanrahan, “Light field rendering,” in SIGGRAPH, 1996, pp. 31–42.
[4] “Lytro.” https://www.lytro.com/.
[5] R. S. Overbeck, D. Erickson, D. Evangelakos, M. Pharr, and P. Debevec, “A system for acquiring, processing, and rendering panoramic light field stills for virtual reality,” in SIGGRAPH, 2019.
[6] Y. Wang, L. Wang, J. Yang, W. An, J. Yu, and Y. Guo, “Spatial-angular interaction for light field image super-resolution,” in European Conference on Computer Vision (ECCV), 2020.
[7] N. K. Kalantari, T.-C. Wang, and R. Ramamoorthi, “Learning-based view synthesis for light field cameras,” ACM TOG, vol. 35, no. 6, 2016.
[8] J. Shi, X. Jiang, and C. Guillemot, “Learning fused pixel and feature-based view reconstructions for light fields,” in IEEE Conference on Computer Vision and Pattern Recognition, 2020.
[9] Y. Yoon, H.-G. Jeon, D. Yoo, J.-Y. Lee, and I. So Kweon, “Learning a deep convolutional network for light-field image super-resolution,” in CVPRW, 2015, pp. 24–32.
[10] G. Wu, Y. Liu, L. Fang, Q. Dai, and T. Chai, “Light field reconstruction using convolutional network on epi and extended applications,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 7, pp. 1681–1694, 2019.
[11] H. W. F. Yeung, J. Hou, J. Chen, Y. Y. Chung, and X. Chen, “Fast light field reconstruction with deep coarse-to-fine modeling of spatial-angular clues,” in ECCV, 2018, pp. 138–154.
[12] J. Long, Z. Ning, and T. Darrell, “Do convnets learn correspondence?” Advances in Neural Information Processing Systems, vol. 2, pp. 1601–1609, 2014.
[13] J. Flynn, I. Neulander, J. Philbin, and N. Snavely, “Deepstereo: Learning to predict new views from the world’s imagery,” in CVPR, 2015.
[14] B. Mildenhall, P. P. Srinivasan, R. Ortiz-Cayon, N. K. Kalantari, R. Ramamoorthi, R. Ng, and A. Kar, “Local light field fusion: Practical view synthesis with prescriptive sampling guidelines,” ACM Transactions on Graphics (TOG), vol. 38, no. 4, pp. 1–14, 2019.
[15] S. Moreschini, F. Gama, R. Bregovic, and A. Gotchev, “Civit dataset: Horizontal-parallax-only densely-sampled light-fields,” https://civit.fi/densely-sampled-light-field-datasets/.
[16] V. K. Adhikarla, M. Vinkler, D. Sumin, R. K. Mantiuk, K. Myszkowski, H.-P. Seidel, and P. Didyk, “Towards a quality metric for dense light fields,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 58–67.
[17] A. Dosovitskiy and T. Brox, “Generating images with perceptual similarity metrics based on deep networks,” in NIPS, 2016, pp. 658–666.
[18] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in European Conference on Computer Vision, 2016.
[19] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” International conference on learning representations, 2015.
[20] E. Penner and L. Zhang, “Soft 3D reconstruction for view synthesis,” ACM TOG, vol. 36, no. 6, p. 235, 2017.
[21] D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense two-frame stereo correspondence algorithms,” International journal of computer vision, vol. 47, no. 1-3, pp. 7–42, 2002.
[22] S. Wanner and B. Goldluecke, “Variational light field analysis for disparity estimation and super-resolution,” IEEE TPAMI, vol. 36, no. 3, pp. 606–619, 2014.
[23] C.-T. Huang, “Robust pseudo random fields for light-field stereo matching,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 11–19.
[24] M. W. Tao, S. Hadap, J. Malik, and R. Ramamoorthi, “Depth from combining defocus and correspondence using light-field cameras,” in ICCV, 2013, pp. 673–680.
[25] T.-C. Wang, A. A. Efros, and R. Ramamoorthi, “Occlusion-aware depth estimation using light-field cameras,” in ICCV, 2015, pp. 3487–3495.
[26] S. Zhang, H. Sheng, C. Li, J. Zhang, and Z. Xiong, “Robust depth estimation for light field via spinning parallelogram operator,” Computer Vision and Image Understanding, vol. 145, pp. 148–159, 2016.
[27] A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry, “End-to-end learning of geometry and context for deep stereo regression,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 66–75.
[28] H. Zheng, M. Ji, H. Wang, Y. Liu, and L. Fang, “Crossnet: An end-to-end reference-based super resolution network using cross-scale warping,” in ECCV, 2018.
[29] N. Meng, K. Li, J. Liu, and E. Y. Lam, “Light field view synthesis via aperture disparity and warping confidence map,” IEEE Transactions on Image Processing, vol. 30, pp. 3908–3921, 2021.
[30] J. Jin, J. Hou, H. Yuan, and S. Kwong, “Learning light field angular super-resolution via a geometry-aware network.” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 7, 2020, pp. 11 141–11 148.
[31] J. Jin, J. Hou, J. Chen, H. Zeng, S. Kwong, and J. Yu, “Deep coarse-to-fine dense light field reconstruction with flexible sampling and geometry-aware fusion.” IEEE Transactions on Pattern Analysis and Machine Intelligence, no. 1, pp. 1–1, 2020.
[32] K. Ko, Y. J. Koh, S. Chang, and C.-S. Kim, “Light field super-resolution via adaptive feature remixing,” IEEE Transactions on Image Processing, vol. 30, pp. 4114–4128, 2021.
[33] T. Zhou, R. Tucker, J. Flynn, G. Fyffe, and N. Snavely, “Stereo magnification: Learning view synthesis using multiplane images,” in SIGGRAPH, 2018.
[34] J.-X. Chai, X. Tong, S.-C. Chan, and H.-Y. Shum, “Plenoptic sampling,” in Proceedings of the 27th annual conference on Computer graphics and interactive techniques. ACM Press/Addison-Wesley Publishing Co., 2000, pp. 307–318.
[35] C. Zhang and T. Chen, “Spectral analysis for sampling image-based rendering data,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 11, pp. 1038–1050, 2003.
[36] S. Vagharshakyan, R. Bregovic, and A. Gotchev, “Light field reconstruction using shearlet transform,” IEEE TPAMI, vol. 40, no. 1, pp. 133–147, 2018.
[37] L. Shi, H. Hassanieh, A. Davis, D. Katabi, and F. Durand, “Light field reconstruction using sparsity in the continuous fourier domain,” ACM TOG, vol. 34, no. 1, p. 12, 2014.
[38] H. Zhu, M. Guo, H. Li, Q. Wang, and A. Robleskelly, “Revisiting spatio-angular trade-off in light field cameras and extended applications in super-resolution,” IEEE Transactions on Visualization and Computer Graphics, pp. 1–15, 2019.
[39] X. Shi, Z. Chen, H. Wang, D. Yeung, W. Wong, and W. Woo, “Convolutional lstm network: a machine learning approach for precipitation nowcasting,” in NIPS, 2015, pp. 802–810.
[40] D. Liu, Y. Huang, Q. Wu, R. Ma, and P. An, “Multi-angular epipolar geometry based light field angular reconstruction network,” IEEE Transactions on Computational Imaging, vol. 6, pp. 1507–1522, 2020.
[41] J. Jin, J. Hou, J. Chen, S. Kwong, and J. Yu, “Light field super-resolution via attention-guided fusion of hybrid lenses,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 193–201.
[42] Y. Wang, F. Liu, K. Zhang, Z. Wang, Z. Sun, and T. Tan, “High-fidelity view synthesis for light field imaging with extended pseudo 4DCNN,” IEEE Transactions on Computational Imaging, vol. 6, pp. 830–842, 2020.
[43] G. Wu, Y. Liu, Q. Dai, and T. Chai, “Learning sheared epi structure for light field reconstruction,” IEEE Transactions on Image Processing, vol. 28, no. 7, pp. 3261–3273, 2019.
[44] G. Wu, Y. Liu, L. Fang, and T. Chai, “Revisiting light field rendering with deep anti-aliasing neural network.” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2021.
[45] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis,” IEEE Transactions on pattern analysis and machine intelligence, vol. 20, no. 11, pp. 1254–1259, 1998.
[46] R. A. Rensink, “The dynamic representation of scenes,” Visual cognition, vol. 7, no. 1-3, pp. 17–42, 2000.
[47] M. Corbetta and G. L. Shulman, “Control of goal-directed and stimulus-driven attention in the brain,” Nature reviews neuroscience, vol. 3, no. 3, pp. 201–215, 2002.
[48] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in NIPS, 2017, pp. 5998–6008.
[49] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
[50] S. Woo, J. Park, J. Lee, and I. S. Kweon, “Cbam: Convolutional block attention module,” in ECCV, 2018, pp. 3–19.
[51] Y. Jiang, S. Chang, and Z. Wang, “Transgan: Two transformers can make one strong gan,” arXiv: Computer Vision and Pattern Recognition, 2021.
[52] J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu, “Squeeze-and-excitation networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2019.
[53] Z. Qin, P. Zhang, F. Wu, and X. Li, “Fcanet: Frequency channel attention networks.” arXiv preprint arXiv:2012.11879, 2020.
[54] L. Wang, Y. Wang, Z. Liang, Z. Lin, J. Yang, W. An, and Y. Guo, “Learning parallax attention for stereo image super-resolution,” in IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 12 250–12 259.
[55] L. Wang, Y. Guo, Y. Wang, Z. Liang, Z. Lin, J. Yang, and W. An, “Parallax attention for unsupervised stereo correspondence learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
[56] Y.-J. Tsai, Y.-L. Liu, M. Ouhyoung, and Y.-Y. Chuang, “Attention-based view selection networks for light-field disparity estimation,” in Proceedings of AAAI Conference on Artificial Intelligence (AAAI), 2020, p. 1.
[57] M. Guo, J. Hou, J. Jin, J. Chen, and L.-P. Chau, “Deep spatial-angular regularization for compressive light field reconstruction over coded apertures.” in European Conference on Computer Vision, 2020, pp. 278–294.
[58] K. H. Jin, M. T. Mccann, E. Froustey, and M. Unser, “Deep convolutional neural network for inverse problems in imaging,” IEEE Transactions on Image Processing, vol. 26, no. 9, pp. 4509–4522, 2017.
[59] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Object detectors emerge in deep scene cnns,” in International Conference on Learning Representations, 2015.
[60] G. Eilertsen, J. Kronander, G. Denes, R. K. Mantiuk, and J. Unger, “Hdr image reconstruction from a single exposure using deep cnns,” Acm Transactions on Graphics, 2017.
[61] W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1874–1883.
[62] Z. Zhang, Z. Wang, Z. Lin, and H. Qi, “Image super-resolution by neural texture transfer,” in IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 7982–7991.
[63] N. Meng, H. K.-H. So, X. Sun, and E. Y. Lam, “High-dimensional dense residual convolutional neural network for light field reconstruction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 3, pp. 873–886, 2021.
[64] “Stanford (New) Light Field Archive.” http://lightfield.stanford.edu/lfs.html.
[65] Ng and Ren, “Fourier slice photography,” Acm Transactions on Graphics, vol. 24, no. 3, p. 735, 2005.
[66] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” Computer Science, 2014.
[67] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: Large-scale machine learning on heterogeneous systems.” http://tensorflow.org/.
[68] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE TIP, vol. 13, no. 4, pp. 600–612, 2004.
[69] “Stanford Lytro Light Field Archive.” http://lightfields.stanford.edu/.
[70] R. Girshick, “Fast r-cnn,” in 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1440–1448.
[71] T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” Siam Review, vol. 51, no. 3, pp. 455–500, 2009.