Learning Non-Local Spatial-Angular Correlation for Light Field
Image Super-Resolution
Abstract
Exploiting spatial-angular correlation is crucial to light field (LF) image super-resolution (SR), but is highly challenging due to its non-local property caused by the disparities among LF images. Although many deep neural networks (DNNs) have been developed for LF image SR and achieved continuously improved performance, existing methods cannot well leverage the long-range spatial-angular correlation and thus suffer a significant performance drop when handling scenes with large disparity variations. In this paper, we propose a simple yet effective method to learn the non-local spatial-angular correlation for LF image SR. In our method, we adopt the epipolar plane image (EPI) representation to project the 4D spatial-angular correlation onto multiple 2D EPI planes, and then develop a Transformer network with repetitive self-attention operations to learn the spatial-angular correlation by modeling the dependencies between each pair of EPI pixels. Our method can fully incorporate the information from all angular views while achieving a global receptive field along the epipolar line. We conduct extensive experiments with insightful visualizations to validate the effectiveness of our method. Comparative results on five public datasets show that our method not only achieves state-of-the-art SR performance but also performs robust to disparity variations. Code is publicly available at https://github.com/ZhengyuLiang24/EPIT.
1 Introduction
Light field (LF) cameras record both intensity and directions of light rays, and enable various applications such as depth perception [25, 29, 32], view rendering [3, 52, 66], virtual reality [11, 74], and 3D reconstruction [6, 77]. However, due to the inherent spatial-angular trade-off [82], an LF camera can either provide dense angular samplings with low-resolution (LR) sub-aperture images (SAIs), or capture high-resolution (HR) SAIs with sparse angular sampling. To handle this problem, many methods have been proposed to enhance the angular resolution via novel view synthesis [28, 43, 67], or enhance the spatial resolution by performing LF image super-resolution (SR) [10, 20]. In this paper, we focus on the latter task, i.e., generating HR LF images from their LR counterparts.

Recently, convolutional neural networks (CNNs) have been widely applied to LF image SR and demonstrated superior performance over traditional paradigms [1, 34, 44, 49, 64]. To incorporate the complementary information (i.e., angular information) from different views, existing CNNs adopted various mechanisms such as adjacent-view combination [73], view-stack integration [26, 78, 79], bidirectional recurrent fusion [59], spatial-angular disentanglement [36, 60, 61, 72, 9], and 4D convolutions [42, 43]. However, as illustrated in both Fig.Β 1 and Sec.Β 4.3, existing methods achieve promising results on LFs with small baselines, but suffer a notable performance drop when handling scenes with large disparity variations.
We attribute this performance drop to the contradictions between the local receptive field of CNNs and the requirement of incorporating non-local spatial-angular correlation in LF image SR. That is, LF images provide multiple observations of a scene from a number of regularly distributed viewpoints, and a scene point is projected onto different but correlated spatial locations on different angular views, which is termed as the spatial-angular correlation. Note that, the spatial-angular correlation has the non-local property since the difference between the spatial locations of two views (i.e., disparity value) depends on several factors (e.g., angular coordinates of the selected views, the depth value of the scene point, the baseline length of the LF camera, and the resolution of LF images), and can be very large in some situations. Consequently, it is appealing for LF image SR methods to incorporate complementary information from different views by exploiting the spatial-angular correlation under large disparity variations.
In this paper, we propose a simple yet effective method to learn the non-local spatial-angular correlation for LF image SR. In our method, we re-organize 4D LFs as multiple 2D epipolar plane images (EPIs) to manifest the spatial-angular correlation to the line patterns with different slopes. Then, we develop a Transformer-based network called EPIT to learn the spatial-angular correlation by modeling the dependencies between each pair of pixels on EPIs. Specifically, we design a basic Transformer block to alternately process horizontal and vertical EPIs, and thus progressively incorporate the complementary information from all angular views. Compared to existing LF image SR methods, our method can achieve a global receptive field along the epipolar line, and thus performs robust to disparity variations.
In summary, the contributions of this work are as follows: (1) We address the importance of exploiting non-local spatial-angular correlation in LF image SR, and propose a simple yet effective method to handle this problem. (2) We develop a Transformer-based network to learn the non-local spatial-angular correlation from horizontal and vertical EPIs, and validate the effectiveness of our method through extensive experiments and visualizations. (3) Compared to existing state-of-the-art LF image SR methods, our method achieves superior performance on public LF datasets, and is much more robust to disparity variations.
2 Related Work
2.1 LF Image SR
LFCNN [73] is the first method to adopt CNNs to learn the correspondence among stacked SAIs. Then, it is a common practice for LF image SR networks to aggregate the complementary information from adjacent views to model the correlation in LFs. Yeung et al. [72] designed a spatial-angular separable (SAS) convolution to approximate the 4D convolution to characterize the sub-pixel relationship of LF 4D structures. Wang et al. [59] proposed a bidirectional recurrent network to model the spatial correlation among views on horizontal and vertical baselines. Meng et al. [42] proposed a densely-connected network with 4D convolutions to explicitly learn the spatial-angular correlation encoded in 4D LF data. To further learn inherent corresponding relations in SAIs, Zhang et al. [78, 79] grouped LFs into four different branches according to the specific angular directions, and used four sub-networks to model the multi-directional spatial-angular correlation.
The aforementioned networks use part of input views to super-resolve each view, and the inherent spatial-angular correlation in LF images cannot be well incorporated. To address this issue, Jin et al. [26] proposed an All-to-One framework for LF image SR, and each SAI can be individually super-resolved by combining the information from all views. Wang et al. [61, 60] organized LF images into macro-pixels, and designed a disentangling mechanism to fully incorporate the angular information. Liu et al. [38] introduced 3D convolutions based multi-view context block to exploit the correlations among all views. In addition, Wang et al. [62] adopted deformable convolutions to achieve long-range information exploitation from all SAIs. Existing methods generally learn the local correspondence across SAIs, and ignore the importance of non-local spatial-angular correlation in LF images. However, due to the limited receptive field of CNNs, existing methods generally learn the local correspondence across SAIs, and ignore the importance of non-local spatial-angular correlation in LF images.
Recently, Liang et al. [36] applied Transformers to LF image SR and developed an angular Transformer and a spatial Transformer to incorporate angular information and model long-range spatial dependencies, respectively. However, since 4D LFs were organized into 2D angular patches to form the input of angular Transformers, the non-local property of spatial-angular correlations reduces the robustness of LFT to large disparity variations.
2.2 Non-Local Correlation Modeling
Non-local means [5] is a classical algorithm that computes the weighted mean of pixels in an image according to the self-similarity measure, and a number of studies on such non-local priors have been proposed for image restoration [12, 51, 19, 4], image and video SR [16, 76, 14, 71, 23]. Then, the attention mechanism is developed as a tool to bias the most informative components of an input signal, and achieves significant performance in various computer vision tasks [22, 8, 58, 15]. Huang et al. [24] proposed novel criss-cross attention to capture contextual information from full-image dependencies in an efficient way. Wang et al. [56, 55] proposed a parallax attention mechanism to handle the varying disparities problem of stereo images. Wu et al. [69] applied attention mechanisms to 3D LF reconstruction and developed a spatial-angular attention module to learn the first-order correlation on EPIs.
Recently, the attention mechanism is further generalized as Transformers [54] with multi-head structures and feed-forward networks. Transformers have inspired lots of works [39, 35, 7, 13] to further investigate the power of attention mechanisms in visions. Liu et al. [40] presented a pure-Transformer method to incorporate the inherent spatial-temporal locality of videos for action recognition. Naseer et al. [45] investigated the robustness and generalizability of Transformers, and demonstrated favorable merits of Transformers over CNNs for occlusion handling. Shi et al. [50] observed that Transformers can implicitly make accurate connections for misaligned pixels, and presented a new understanding of Transformers to process spatially unaligned images.
3 Method
3.1 Preliminary
Based on the two-plane LF parameterization model [33], an LF image is commonly formulated as a 4D function , where and represent angular dimensions, and represent spatial dimensions. The EPI sample of 4D LF is acquired with a fixed angular coordinate and a fixed spatial coordinate. Specifically, the horizontal EPI is obtained with constant and , and the vertical EPI is obtained with constant and .
As shown in Fig.Β 2, the EPIs not only record spatial structures at edges or textures, but also reflect the disparity information via line patterns of different slopes. Specifically, due to large disparities, the EPIs contain abundant spatial-angular correlation of LFs in a long-range way. Therefore, we propose to explore the non-local spatial-angular correlation from horizontal and vertical EPIs for LF image SR.
3.2 Network Design
As shown in Fig.Β 3(a), our network takes an LR LF as its input, and produces an HR LF , where presents the upscaling factor. Our network consists of three stages including initial feature extraction, deep spatial-angular correlation learning, and feature upsampling.
3.2.1 Initial Feature Extraction
As shown in Fig.Β 3(b), we follow most existing works [7, 35, 75] to use three 33 convolutions with LeakyReLU [41] as a SpatialConv layer to map each SAI to a high-dimensional feature. The initially extracted feature can be represented as , where denotes the channel dimension.


3.2.2 Deep Spatial-Angular Correlation Learning
Non-Local Cascading Block. The basic module for spatial-angular correlation learning is the Non-Local Cascading block. As shown in Fig.Β 3(a), each block consists of two cascaded Basic-Transformer units to separately incorporate the complementary information along the horizontal and vertical epipolar lines. In our method, we employed five Non-Local Cascading blocks to achieve a global perception of all angular views, and followed SwinIR [35] to adopt spatial convolutions to enhance the local feature representation. The effectiveness of this inter-block spatial convolution is validated in Sec.Β 4.4. Note that, the weights of the two Basic-Transformer units in each block are shared to jointly learn the intrinsic properties of LFs, which is demonstrated effective in Sec.Β 4.4.
As shown in Fig.Β 3(c), the initial features can be first separately reshaped to the horizontal EPI pattern and the vertical EPI pattern . Next, (or ) is fed to a Basic-Transformer unit to integrate the long-range information along the horizontal (or vertical) epipolar line. Then, the enhanced feature (or ) is reshaped into the size of , and passes through a SpatialConv layer to incorporate the spatial context information within each SAI. Without loss of generality, we take the vertical Basic-Transformer as an example to introduce the detail of our Basic-Transformer unit in the following texts.
Basic-Transformer Unit. The objective of this unit is to capture long-range dependencies along the epipolar line via Transformers. To leverage the powerful sequence modeling capability of Transformers, we convert the vertical EPI features to the sequences of βtokensβ for capturing spatial-angular correlation in and dimensions. As shown in Fig.Β 3(d), the vertical EPI features are passed through a linear projection matrix , where denotes the embedding dimension of each token. The projected EPI features are a sequence of tokens with a length of , i.e., . Following the PreNorm operation in [70], we also apply Layer Normalization (LN) before the attention calculation, and obtain the normalized tokens .
Afterwards, tokens are passed through the Self-Attention layer and transformed into the deep tokens involving non-local spatial-angular information along the vertical epipolar line. Specifically, need to be separately multiplied by , and to generate corresponding query, key and value components for self-attention calculation, i.e., , and .
Given a query position in and a key position in , the corresponding response measures the mutual similarity of the pairs by the dot-product operation, followed by a Softmax function to obtain the attention scores on the vertical EPI tokens. That is,
(1) |
Based on the attention scores, the output of self-attention can be calculated as the weighted sum of value. In summary, the calculation process of Self-Attention layer can be formulated as:
(2) |
To further incorporate the spatial-angular correlation, following [54], our Basic-Transformer unit also contains the multi-layer perception (MLP) and LN, and generates the enhanced as:
(3) |
At the end of the Basic-Transformer unit, the enhanced is fed to another linear projection , and reshaped into the size of for the subsequent SpatialConv layer.

Methods | #Prm.(M) | 2 | 4 | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
2/4 | EPFL | HCInew | HCIold | INRIA | STFgantry | EPFL | HCInew | HCIold | INRIA | STFgantry | |
Bicubic | - / - | 29.74/.9376 | 31.89/.9356 | 37.69/.9785 | 31.33/.9577 | 31.06/.9498 | 25.14/.8324 | 27.61/.8517 | 32.42/.9344 | 26.82/.8867 | 25.93/.8452 |
VDSR [30] | 0.66 / 0.66 | 32.50/.9598 | 34.37/.9561 | 40.61/.9867 | 34.43/,9741 | 35.54/.9789 | 27.25/.8777 | 29.31/.8823 | 34.81/.9515 | 29.19/.9204 | 28.51/.9009 |
EDSR [37] | 38.6 / 38.9 | 33.09/.9629 | 34.83/.9592 | 41.01/.9874 | 34.97/.9764 | 36.29/.9818 | 27.84/.8854 | 29.60/.8869 | 35.18/.9536 | 29.66/.9257 | 28.70/.9072 |
RCAN [81] | 15.3 / 15.4 | 33.16/.9634 | 34.98/.9603 | 41.05/.9875 | 35.01/.9769 | 36.33/.9831 | 27.88/.8863 | 29.63/.8886 | 35.20/.9548 | 29.76/.9276 | 28.90/.9131 |
resLF[79] | 7.98 / 8.64 | 33.62/.9706 | 36.69/.9739 | 43.42/.9932 | 35.39/.9804 | 38.36/.9904 | 28.27/.9035 | 30.73/.9107 | 36.71/.9682 | 30.34/.9412 | 30.19/.9372 |
LFSSR [72] | 0.88 / 1.77 | 33.68/.9744 | 36.81/.9749 | 43.81/.9938 | 35.28/.9832 | 37.95/.9898 | 28.27/.9118 | 30.72/.9145 | 36.70/.9696 | 30.31/.9467 | 30.15/.9426 |
LF-ATO [26] | 1.22 / 1.36 | 34.27/.9757 | 37.24/.9767 | 44.20/.9942 | 36.15/.9842 | 39.64/.9929 | 28.52/.9115 | 30.88/.9135 | 37.00/.9699 | 30.71/.9484 | 30.61/.9430 |
LF-InterNet [61] | 5.04 / 5.48 | 34.14/.9760 | 37.28/.9763 | 44.45/.9946 | 35.80/.9843 | 38.72/.9909 | 28.67/.9162 | 30.98/.9161 | 37.11/.9716 | 30.64/.9491 | 30.53/.9409 |
LF-DFnet [62] | 3.94 / 3.99 | 34.44/.9755 | 37.44/.9773 | 44.23/.9941 | 36.36/.9840 | 39.61/.9926 | 28.77/.9165 | 31.23/.9196 | 37.32/.9718 | 30.83/.9503 | 31.15/.9494 |
MEG-Net [78] | 1.69 / 1.77 | 34.30/.9773 | 37.42/.9777 | 44.08/.9942 | 36.09/.9849 | 38.77/.9915 | 28.74/.9160 | 31.10/.9177 | 37.28/.9716 | 30.66/.9490 | 30.77/.9453 |
LF-IINet [38] | 4.84 / 4.88 | 34.68/.9773 | 37.74/.9790 | 44.84/.9948 | 36.57/.9853 | 39.86/.9936 | 29.11/.9188 | 31.36/.9208 | 37.62/.9734 | 31.08/.9515 | 31.21/.9502 |
DPT [57] | 3.73 / 3.78 | 34.48/.9758 | 37.35/.9771 | 44.31/.9943 | 36.40/.9843 | 39.52/.9926 | 28.93/.9170 | 31.19/.9188 | 37.39/.9721 | 30.96/.9503 | 31.14/.9488 |
LFT [36] | 1.11 / 1.16 | 34.80/.9781 | 37.84/.9791 | 44.52/.9945 | 36.59/.9855 | 40.51/.9941 | 29.25/.9210 | 31.46/.9218 | 37.63/.9735 | 31.20/.9524 | 31.86/.9548 |
DistgSSR [60] | 3.53 / 3.58 | 34.81/.9787 | 37.96/.9796 | 44.94/.9949 | 36.59/.9859 | 40.40/.9942 | 28.99/.9195 | 31.38/.9217 | 37.56/.9732 | 30.99/.9519 | 31.65/.9535 |
LFSAV [9] | 1.22 / 1.54 | 34.62/.9772 | 37.43/.9776 | 44.22/.9942 | 36.36/.9849 | 38.69/.9914 | 29.37/.9223 | 31.45/.9217 | 37.50/.9721 | 31.27/.9531 | 31.36/.9505 |
EPIT (ours) | 1.42 / 1.47 | 34.83/.9775 | 38.23/.9810 | 45.08/.9949 | 36.67/.9853 | 42.17/.9957 | 29.34/.9197 | 31.51/.9231 | 37.68/.9737 | 31.37/.9526 | 32.18/.9571 |
Cross-View Similarity Analysis. Note that, the setting represents the similarity scores of with all in , and thus can be re-organized as a slice of cross-view attention map according to the angular coordinate. Inspired by this, we visualized the cross-view attention maps on an example scene in Fig.Β 4. The regions marked by the red stripe in Fig.Β 4(a) are set as the query tokens, and the self-similarity (i.e., key are same as query) is ideally located at the diagonal, as shown in Fig.Β 4(f). In contrast, the yellow stripes in Figs.Β 4(b)-4(e) are set as the key tokens, the corresponding cross-view similarities are shown in Figs.Β 4(g)-4(j). It can be observed that due to the foreground occlusions, the responses of the background appear as short lines (marked by the white boxes) parallel to the diagonal in each cross-view attention map, and both of the distance to the diagonal and the length of response regions adaptively change as the key view moves along the baseline, which demonstrates the disparity-awareness of our Basic-Transformer unit.
3.2.3 Feature Upsampling
Finally, we apply the pixel shuffling operation to increase the spatial resolution of LF features, and further employ a 33 convolution to obtain the super-resolved LF image . Following most existing works [61, 60, 36, 62, 57, 38, 78, 79, 72], we use the loss function to train our network due to its robustness to outliers [2].

4 Experiments
In this section, we first introduce the datasets and our implementation details, and then compare our method with state-of-the-art methods. Next, we investigate the performance of different SR methods with respect to disparity variations. Finally, we validate the effectiveness of our method through ablation studies.
4.1 Datasets and Implementation Details
We followed [62, 60, 38, 57, 36] to use five public LF datasets (EPFL [48], HCInew [21], HCIold [65], INRIA [46], STFgantry [53]) in the experiments. All LFs in these datasets have an angular resolution of 99. Unless specifically mentioned, we extracted the central 55 SAIs for training and test. In the training stage, we cropped each SAI into patches of size 6464128128, and performed 0.50.25 bicubic downsampling to generate the LR patches for 24 SR, respectively. We used peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) [63] as quantitative metrics for performance evaluation. To obtain the metric score for a dataset with scenes, we first calculated the metric of each scene by averaging the scores over all the SAIs separately, and then obtained the score for this dataset by averaging the scores over the scenes.
We adopted the same training settings for all experiments, i.e., Xavier initialization algorithm [17] and Adam optimizer [31] with , . The initial learning rate was set to 2 and decreased by a factor of 0.5 every 15 epochs. During the training phase, we performed random horizontal flipping, vertical flipping, and 90-degree rotation to augment the training data. All models were implemented in the PyTorch framework and trained from scratch for 80 epochs with 2 Nvidia RTX 2080Ti GPUs.
4.2 Comparisons on Benchmark Datasets
We compare our method to 14 state-of-the-art methods, including 3 single image SR methods [30, 37, 81] and 11 LF image SR methods [79, 72, 26, 61, 62, 78, 38, 57, 36, 60, 9].
Quantitative Results. A quantitative comparison among different methods is shown in TabelΒ 1. Our EPIT with a small model size (i.e., 1.42M1.47M for 24 SR) achieves state-of-the-art PSNR and SSIM scores on almost all the datasets for both 2 and 4 SR. It is worth noting that LFs in the STFgantry dataset [53] have larger disparity variations, and are thus more challenging. Our EPIT significantly outperforms all the compared methods and achieves 1.66dB0.32dB PSNR improvements over the second top-performing method LFT for 24 SR, respectively, which demonstrates the powerful capacity of our EPIT in non-local correlation modeling.
Qualitative Results. FigureΒ 5 shows the qualitative results achieved by different methods for 2/4 SR. It can be observed from the zoom-in regions that single image SR method RCAN [81] cannot recover the textures and details in the SR images. In contrast, our EPIT can incorporate sub-pixel correspondence among SAIs and generate more faithful details with fewer artifacts. Compared to most LF image SR methods, our EPIT can generate superior visual results with high angular consistency. Please refer to the supplemental material for additional visual comparisons.

Input | EPFL | HCInew | HCIold | INRIA | STFgantry | |||||
---|---|---|---|---|---|---|---|---|---|---|
[60] | Ours | [60] | Ours | [60] | Ours | [60] | Ours | [60] | Ours | |
22 | 28.27 | -0.05 | 30.80 | +0.04 | 36.77 | +0.17 | 30.55 | -0.03 | 30.74 | +0.56 |
33 | 28.67 | +0.03 | 31.07 | +0.19 | 37.18 | +0.19 | 30.83 | +0.11 | 31.12 | +0.74 |
44 | 28.81 | +0.23 | 31.25 | +0.15 | 37.32 | +0.20 | 30.93 | +0.26 | 31.23 | +0.88 |
55 | 28.99 | +0.35 | 31.38 | +0.13 | 37.56 | +0.12 | 30.99 | +0.38 | 31.65 | +0.56 |
66 | 29.10 | +0.33 | 31.39 | +0.18 | 37.52 | +0.26 | 30.98 | +0.47 | 31.57 | +0.74 |
77 | 29.38 | +0.22 | 31.43 | +0.20 | 37.65 | +0.27 | 31.18 | +0.33 | 31.63 | +0.77 |
88 | 29.32 | +0.28 | 31.52 | +0.14 | 37.76 | +0.24 | 31.23 | +0.31 | 31.58 | +0.90 |
99 | 29.41 | +0.30 | 31.48 | +0.21 | 37.80 | +0.26 | 31.22 | +0.34 | 31.66 | +0.84 |

Angular Consistency. We evaluate the angular consistency by using the 4 SR results on several challenging scenes (e.g., Backgammon and Stripes) in 4D LF benchmark [21] for disparity estimation. As shown in Fig.Β 6, our EPIT achieves competitive MSE scores on these challenging scenes, which demonstrates the superiority of our EPIT on angular consistency.
Performance with Different Angular Resolution. Since the angular resolution of LR images can vary significantly with different LF devices, we compare our method to DistgSSR [60] on LFs with different angular resolutions. It can be observed from TableΒ 2 that our method achieves higher PSNR values than DistgSSR on almost all the datasets with each angular resolution (except on the EPFL and INRIA datasets with 22 input LFs). The consistent performance improvements demonstrate that our EPIT can well model the spatial-angular correlation with various angular resolutions. More comparisons and discussions are provided in the supplemental material.

Performance on Real-World LF Scenes. We compare our method to state-of-the-art methods under real-world degradation by directly applying them to LFs in the STFlytro dataset [47]. Since no groundtruth HR images are available in this dataset, we present the LR input and their super-resolved results in Fig.Β 7. It can be observed that our method can recover more faithful details and generate more clear letters than other methods. Since the LF structure keeps unchanged under both bicubic and real-world degradation, our method can learn the spatial-angular correlation from bicubicly downsampled training data, and well generalize to LF images under real degradation.
4.3 Robustness to Large Disparity Variations
Considering the parallax structure of LF images, we followed the shearing operation in existing works [67, 68] to linearly change the overall disparity range of LF datasets. Note that, the content of SAIs maintain unchanged after the shearing operation, and thus we can quantitatively investigate the performance of different SR methods with respect to the disparity variations.
Quantitative & Qualitative Comparison. FigureΒ 8 shows the quantitative and qualitative results of different SR methods with respect to sheared values, from which we can observe that: 1) Except for the single image SR method RCAN, all LF image SR methods suffer a performance drop when the absolute sheared value of LF images increases. That is because, large sheared values can result in more significant misalignment among LF images, and introduce difficulties in complementary information incorporation; 2) As the absolute sheared value increases, the performance of existing LF image SR methods is even inferior to RCAN. The possible reason is that, these methods do not make full use of local spatial information, but rather rely on local angular information from adjacent views. When the sheared value exceeds their receptive fields, the large disparities can make the spatial-angular correlation non-local and thus introduce challenges in complementary information incorporation; 3) Our EPIT performs much more robust to disparity variations and achieves the highest PSNR scores under all sheared values. More quantitative comparisons on the whole datasets can be referred to the supplemental material.
LAM Visualization. We used Local Attribution Map (LAM) [18] to visualize the input regions that contribute to the SR results of different methods. As shown in Fig.Β 8, we first specify the center of green stripes in HR images as the target regions, and then re-organize the corresponding attribution maps on LR images into the EPI patterns. It can be observed that RCAN achieves a larger receptive field along the spatial dimension than other compared methods, which supports the results in Figs.Β 8(b) and Β 8(e) that RCAN achieves a relatively stable SR performance with different sheared values. It is worth noting that our EPIT can automatically incorporate the most relevant information from different views, and can learn the non-local spatial-angular correlation regardless of disparity variations.

Perspective Comparison. We compare the performance of MEG-Net, DistgSSR and our method with respect to different perspectives and sheared values (0 to 3). It can be observed in Fig.Β 9 that, both MEG-Net and DistgSSR suffer significant performance drops on all perspectives as the sheared value increases. In contrast, our EPIT can well handle the disparity variation problem, and achieve much higher PSNR values with a balanced distribution among different views regardless of the sheared values.
4.4 Ablation Study
In this subsection, we compare the performance of our EPIT with different variants to verify the effectiveness of our design choices, and additionally, investigate their robustness to large disparity variations.
Horizontal/Vertical Basic-Transformer Units. We demonstrated the effectiveness of the horizontal and vertical Basic-Transformer units in our EPIT by separately removing them from our network. Note that, without using horizontal or vertical Basic-Transformer unit, these variants cannot incorporate any information from the corresponding angular directions. As shown in TableΒ 3, both variants w/o-Horizontal and w/o-Vertical suffer a decrease of 0.72dB in the INRIA dataset as compared to EPIT, which demonstrates the importance of exploiting spatial-angular correlations from all angular views.
Weight Sharing in Non-Local Cascading Blocks. We introduced the variant w/o-Share by removing the weight sharing between horizontal and vertical Basic-Transformer units. As shown in TableΒ 3, the additional parameters in variant w/o-Share do not introduce further performance improvement. It demonstrates that the weight sharing strategy between two directional Basic-Transformer units is beneficial and efficient to regularize the network.
SpatialConv in Non-Local Cascading Blocks. We introduced the variant w/o-Local by removing the SpatialConv layers from our EPIT, and we adjusted the channel number to make the model size of this variant not smaller than the main model. As shown in TableΒ 3, the SpatialConv has a significant influence on the SR performance, e.g., the variant w/o-Local suffers a 0.41dB PSNR drop on the EPFL dataset. It demonstrates that local context information is crucial to the SR performance, and the simple convolutions can fully incorporate the spatial information from each SAI.
Basic-Transformer in Non-Local Cascading Blocks. We introduced the variant w/o-Trans by replacing Basic-Transformer in Non-Local Blocks with cascaded convolutions. As shown in TableΒ 3, w/o-Trans suffers a most significant performance drop as the sheared value increases, which demonstrates the effectiveness of the Basic-Transformer in incorporating global information on the EPIs.
Basic-Transformer Number. We introduced the variants with-n-Block (n=1,2,3) by retaining n Non-Local Blocks. Results in TableΒ 3 show the effectiveness of our EPIT (having 5 Non-Local Blocks) with higher-order spatial-angular correlation modeling capability.
Variants | #Prm. | FLOPs | EPFL (Sheared) | INRIA (Sheared) | ||||
0 | 2 | 4 | 0 | 2 | 4 | |||
w/o-Horiz | 1.42M | 80.20G | 33.96 | 33.98 | 34.02 | 35.95 | 36.08 | 36.11 |
w/o-Verti | 1.42M | 80.20G | 34.01 | 33.94 | 33.87 | 35.95 | 35.97 | 36.02 |
w/o-Share | 2.71M | 80.20G | 34.80 | 34.63 | 34.51 | 36.66 | 36.72 | 36.45 |
w/o-Local | 1.64M | 96.39G | 34.42 | 34.36 | 34.27 | 36.36 | 36.40 | 36.25 |
w/o-Trans | 1.60M | 78.82G | 33.90 | 31.32 | 31.74 | 35.95 | 33.28 | 33.55 |
w-1-Block | 1.54M | 68.23G | 33.97 | 34.24 | 34.08 | 35.84 | 36.19 | 35.93 |
w-2-Block | 1.45M | 73.37G | 34.19 | 34.36 | 34.29 | 35.98 | 36.27 | 35.99 |
w-3-Block | 1.71M | 85.78G | 34.64 | 34.51 | 34.45 | 36.53 | 36.47 | 36.22 |
EPIT | 1.42M | 74.96G | 34.83 | 34.69 | 34.59 | 36.67 | 36.75 | 36.59 |
5 Conclusion
In this paper, we propose a Transformer-based network for LF image SR. By modeling the dependencies between each pair of pixels on EPIs, our method can learn the spatial-angular correlation while achieving a global receptive field along the epipolar line. Extensive experimental results demonstrated that our method can not only achieve state-of-the-art SR performance on benchmark datasets, but also perform robust to large disparity variations.
Acknowledgment: This work was supported in part by the Foundation for Innovative Research Groups of the National Natural Science Foundation of China under Grant 61921001.
References
- [1] Martin Alain and Aljosa Smolic. Light field super-resolution via lfbm5d sparse coding. In 2018 25th IEEE international conference on image processing (ICIP), pages 2501β2505, 2018.
- [2] Yildiray Anagun, Sahin Isik, and Erol Seke. Srlibrary: comparing different loss functions for super-resolution over various convolutional architectures. Journal of Visual Communication and Image Representation, 61:178β187, 2019.
- [3] Benjamin Attal, Jia-Bin Huang, Michael ZollhΓΆfer, Johannes Kopf, and Changil Kim. Learning neural light fields with ray-space embedding. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 19819β19829, 2022.
- [4] Dana Berman, Shai Avidan, etΒ al. Non-local image dehazing. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1674β1682, 2016.
- [5] Antoni Buades, Bartomeu Coll, and J-M Morel. A non-local algorithm for image denoising. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volumeΒ 2, pages 60β65, 2005.
- [6] Zewei Cai, Xiaoli Liu, Xiang Peng, and BruceΒ Z Gao. Ray calibration and phase mapping for structured-light-field 3d reconstruction. Optics Express, 26(6):7598β7613, 2018.
- [7] Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and Wen Gao. Pre-trained image processing transformer. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 12299β12310, 2021.
- [8] Liang-Chieh Chen, Yi Yang, Jiang Wang, Wei Xu, and AlanΒ L Yuille. Attention to scale: Scale-aware semantic image segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3640β3649, 2016.
- [9] Zhen Cheng, Yutong Liu, and Zhiwei Xiong. Spatial-angular versatile convolution for light field reconstruction. IEEE Transactions on Computational Imaging, 8:1131β1144, 2022.
- [10] Zhen Cheng, Zhiwei Xiong, Chang Chen, Dong Liu, and Zheng-Jun Zha. Light field super-resolution with zero-shot learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 10010β10019, 2021.
- [11] Suyeon Choi, Manu Gopakumar, Yifan Peng, Jonghyun Kim, and Gordon Wetzstein. Neural 3d holography: Learning accurate wave propagation models for 3d holographic virtual and augmented reality displays. ACM Transactions on Graphics (TOG), 40(6):1β12, 2021.
- [12] Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and Karen Egiazarian. Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE Transactions on image processing, 16(8):2080β2095, 2007.
- [13] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, etΒ al. An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Learning and Representation (ICLR), 2015.
- [14] Gilad Freedman and Raanan Fattal. Image and video upscaling from local self-examples. ACM Transactions on Graphics (TOG), 30(2):1β11, 2011.
- [15] Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. Dual attention network for scene segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3146β3154, 2019.
- [16] Daniel Glasner, Shai Bagon, and Michal Irani. Super-resolution from a single image. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 349β356, 2009.
- [17] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics, pages 249β256, 2010.
- [18] Jinjin Gu and Chao Dong. Interpreting super-resolution networks with local attribution maps. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 9199β9208, 2021.
- [19] Shuhang Gu, Lei Zhang, Wangmeng Zuo, and Xiangchu Feng. Weighted nuclear norm minimization with application to image denoising. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2862β2869, 2014.
- [20] Mantang Guo, Junhui Hou, Jing Jin, Jie Chen, and Lap-Pui Chau. Deep spatial-angular regularization for light field imaging, denoising, and super-resolution. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):6094β6110, 2021.
- [21] Katrin Honauer, Ole Johannsen, Daniel Kondermann, and Bastian Goldluecke. A dataset and evaluation methodology for depth estimation on 4d light fields. In Asian Conference on Computer Vision (ACCV), pages 19β34, 2016.
- [22] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7132β7141, 2018.
- [23] Jia-Bin Huang, Abhishek Singh, and Narendra Ahuja. Single image super-resolution from transformed self-exemplars. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5197β5206, 2015.
- [24] Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross attention for semantic segmentation. In IEEE International Conference on Computer Vision (ICCV), pages 603β612, 2019.
- [25] Jing Jin and Junhui Hou. Occlusion-aware unsupervised learning of depth from 4-d light fields. IEEE Transactions on Image Processing, 31:2216β2228, 2022.
- [26] Jing Jin, Junhui Hou, Jie Chen, and Sam Kwong. Light field spatial super-resolution via deep combinatorial geometry embedding and structural consistency regularization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2260β2269, 2020.
- [27] Jing Jin, Junhui Hou, Jie Chen, Huanqiang Zeng, Sam Kwong, and Jingyi Yu. Deep coarse-to-fine dense light field reconstruction with flexible sampling and geometry-aware fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
- [28] NimaΒ Khademi Kalantari, Ting-Chun Wang, and Ravi Ramamoorthi. Learning-based view synthesis for light field cameras. ACM Transactions on Graphics (TOG), 35(6):1β10, 2016.
- [29] Numair Khan, MinΒ H Kim, and James Tompkin. Differentiable diffusion for dense depth estimation from multi-view images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8912β8921, 2021.
- [30] Jiwon Kim, JungKwon Lee, and KyoungMu Lee. Accurate image super-resolution using very deep convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1646β1654, 2016.
- [31] DiederikP Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning and Representation (ICLR), 2015.
- [32] Titus Leistner, Radek Mackowiak, Lynton Ardizzone, Ullrich KΓΆthe, and Carsten Rother. Towards multimodal depth estimation from light fields. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 12953β12961, 2022.
- [33] Marc Levoy and Pat Hanrahan. Light field rendering. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pages 31β42, 1996.
- [34] Chia-Kai Liang and Ravi Ramamoorthi. A light transport framework for lenslet light field cameras. ACM Transactions on Graphics (TOG), 34(2):1β19, 2015.
- [35] Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc VanΒ Gool, and Radu Timofte. Swinir: Image restoration using swin transformer. In IEEE International Conference on Computer Vision Workshops (ICCVW), pages 1833β1844, 2021.
- [36] Zhengyu Liang, Yingqian Wang, Longguang Wang, Jungang Yang, and Shilin Zhou. Light field image super-resolution with transformers. IEEE Signal Processing Letters, 29:563β567, 2022.
- [37] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and KyoungMu Lee. Enhanced deep residual networks for single image super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 136β144, 2017.
- [38] Gaosheng Liu, Huanjing Yue, Jiamin Wu, and Jingyu Yang. Intra-inter view interaction network for light field image super-resolution. IEEE Transactions on Multimedia, pages 1β1, 2021.
- [39] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE International Conference on Computer Vision (ICCV), pages 10012β10022, 2021.
- [40] Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3202β3211, 2022.
- [41] AndrewΒ L Maas, AwniΒ Y Hannun, AndrewΒ Y Ng, etΒ al. Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, volumeΒ 30, pageΒ 3, 2013.
- [42] Nan Meng, HaydenKwokHay So, Xing Sun, and Edmund Lam. High-dimensional dense residual convolutional neural network for light field reconstruction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
- [43] Nan Meng, Xiaofei Wu, Jianzhuang Liu, and Edmund Lam. High-order residual network for light field super-resolution. In Proceedings of the AAAI Conference on Artificial Intelligence, volumeΒ 34, pages 11757β11764, 2020.
- [44] Kaushik Mitra and Ashok Veeraraghavan. Light field denoising, light field superresolution and stereo camera based refocussing using a gmm light field patch prior. In 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pages 22β28, 2012.
- [45] MuhammadΒ Muzammal Naseer, Kanchana Ranasinghe, SalmanΒ H Khan, Munawar Hayat, Fahad ShahbazΒ Khan, and Ming-Hsuan Yang. Intriguing properties of vision transformers. Advances in Neural Information Processing Systems, 34:23296β23308, 2021.
- [46] MikaelLe Pendu, Xiaoran Jiang, and Christine Guillemot. Light field inpainting propagation via low rank matrix completion. IEEE Transactions on Image Processing, 27(4):1981β1993, 2018.
- [47] AbhilashΒ Sunder Raj, Michael Lowney, Raj Shah, and Gordon Wetzstein. Stanford lytro light field archive, 2016.
- [48] Martin Rerabek and Touradj Ebrahimi. New light field image dataset. In International Conference on Quality of Multimedia Experience (QoMEX), 2016.
- [49] Mattia Rossi and Pascal Frossard. Geometry-consistent light field super-resolution via graph-based regularization. IEEE Transactions on Image Processing, 27(9):4207β4218, 2018.
- [50] Shuwei Shi, Jinjin Gu, Liangbin Xie, Xintao Wang, Yujiu Yang, and Chao Dong. Rethinking alignment in video super-resolution transformers. Advances in Neural Information Processing Systems, 2022.
- [51] Abhishek Singh, Fatih Porikli, and Narendra Ahuja. Super-resolving noisy images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2846β2853, 2014.
- [52] Mohammed Suhail, Carlos Esteves, Leonid Sigal, and Ameesh Makadia. Light field neural rendering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8269β8279, 2022.
- [53] Vaibhav Vaish and Andrew Adams. The (new) stanford light field archive. Computer Graphics Laboratory, Stanford University, 6(7), 2008.
- [54] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, AidanΒ N Gomez, Εukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
- [55] Longguang Wang, Yulan Guo, Yingqian Wang, Zhengfa Liang, Zaiping Lin, Jungang Yang, and Wei An. Parallax attention for unsupervised stereo correspondence learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
- [56] Longguang Wang, Yingqian Wang, Zhengfa Liang, Zaiping Lin, Jungang Yang, Wei An, and Yulan Guo. Learning parallax attention for stereo image super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- [57] Shunzhou Wang, Tianfei Zhou, Yao Lu, and Huijun Di. Detail preserving transformer for light field image super-resolution. In Proceedings of the AAAI Conference on Artificial Intelligence,, 2022.
- [58] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7794β7803, 2018.
- [59] Yunlong Wang, Fei Liu, Kunbo Zhang, Guangqi Hou, Zhenan Sun, and Tieniu Tan. Lfnet: A novel bidirectional recurrent convolutional neural network for light-field image super-resolution. IEEE Transactions on Image Processing, 27(9):4274β4286, 2018.
- [60] Yingqian Wang, Longguang Wang, Gaochang Wu, Jungang Yang, Wei An, Jingyi Yu, and Yulan Guo. Disentangling light fields for super-resolution and disparity estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- [61] Yingqian Wang, Longguang Wang, Jungang Yang, Wei An, Jingyi Yu, and Yulan Guo. Spatial-angular interaction for light field image super-resolution. In European Conference on Computer Vision (ECCV), pages 290β308, 2020.
- [62] Yingqian Wang, Jungang Yang, Longguang Wang, Xinyi Ying, Tianhao Wu, Wei An, and Yulan Guo. Light field image super-resolution using deformable convolution. IEEE Transactions on Image Processing, 30:1057β1071, 2020.
- [63] Zhou Wang, AlanC Bovik, HamidR Sheikh, and EeroP Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600β612, 2004.
- [64] Sven Wanner and Bastian Goldluecke. Variational light field analysis for disparity estimation and super-resolution. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3):606β619, 2013.
- [65] Sven Wanner, Stephan Meister, and Bastian Goldluecke. Datasets and benchmarks for densely sampled 4d light fields. In Vision, Modelling and Visualization (VMV), volumeΒ 13, pages 225β226, 2013.
- [66] Suttisak Wizadwongsa, Pakkapon Phongthawee, Jiraphon Yenphraphai, and Supasorn Suwajanakorn. Nex: Real-time view synthesis with neural basis expansion. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8534β8543, 2021.
- [67] Gaochang Wu, Yebin Liu, Qionghai Dai, and Tianyou Chai. Learning sheared epi structure for light field reconstruction. IEEE Transactions on Image Processing, 28(7):3261β3273, 2019.
- [68] Gaochang Wu, Yebin Liu, Lu Fang, and Tianyou Chai. Revisiting light field rendering with deep anti-aliasing neural network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
- [69] Gaochang Wu, Yingqian Wang, Yebin Liu, Lu Fang, and Tianyou Chai. Spatial-angular attention network for light field reconstruction. IEEE Transactions on Image Processing, 30:8999β9013, 2021.
- [70] Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture. In International Conference on Machine Learning, pages 10524β10533, 2020.
- [71] Jianchao Yang, Zhe Lin, and Scott Cohen. Fast image super-resolution based on in-place example regression. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1059β1066, 2013.
- [72] HenryWingFung Yeung, Junhui Hou, Xiaoming Chen, Jie Chen, Zhibo Chen, and YukYing Chung. Light field spatial super-resolution using deep efficient spatial-angular separable convolution. IEEE Transactions on Image Processing, 28(5):2319β2330, 2018.
- [73] Youngjin Yoon, HaeGon Jeon, Donggeun Yoo, JoonYoung Lee, and InSo Kweon. Light-field image super-resolution using convolutional neural network. IEEE Signal Processing Letters, 24(6):848β852, 2017.
- [74] Jingyi Yu. A light-field journey to virtual reality. IEEE MultiMedia, 24(2):104β112, 2017.
- [75] Jiyang Yu, Jingen Liu, Liefeng Bo, and Tao Mei. Memory-augmented non-local attention for video super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 17834β17843, 2022.
- [76] Roman Zeyde, Michael Elad, and Matan Protter. On single image scale-up using sparse-representations. In International Conference on Curves and Surfaces, pages 711β730, 2010.
- [77] Jingyang Zhang, Yao Yao, and Long Quan. Learning signed distance field for multi-view surface reconstruction. In IEEE International Conference on Computer Vision (ICCV), pages 6525β6534, 2021.
- [78] Shuo Zhang, Song Chang, and Youfang Lin. End-to-end light field spatial super-resolution network using multiple epipolar geometry. IEEE Transactions on Image Processing, 30:5956β5968, 2021.
- [79] Shuo Zhang, Youfang Lin, and Hao Sheng. Residual networks for light field image super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 11046β11055, 2019.
- [80] Shuo Zhang, Hao Sheng, Chao Li, Jun Zhang, and Zhang Xiong. Robust depth estimation for light field via spinning parallelogram operator. Computer Vision and Image Understanding, 145:148β159, 2016.
- [81] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. In European Conference on Computer Vision (ECCV), pages 286β301, 2018.
- [82] Hao Zhu, Mantang Guo, Hongdong Li, Qing Wang, and Antonio Robles-Kelly. Revisiting spatio-angular trade-off in light field cameras and extended applications in super-resolution. IEEE Transactions on Visualization and Computer Graphics, 27(6):3019β3033, 2019.
Learning Non-Local Spatial-Angular Correlation for Light Field
Image Super-Resolution (Supplemental Material)
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/c26cdb1a-1063-44d5-a742-9cbf7817cff8/x10.png)
Figure I. Qualitative comparison of different SR methods for 4 SR.
SectionΒ A provides more visual comparisons on the light field (LF) datasets, and presents additional comparisons on LFs with different angular resolution. SectionΒ B presents detailed quantitative results of different methods on each dataset with various sheared values. SectionΒ C describes additional experiments for LF angular SR, and shows visual results achieved by different methods.
A Additional Comparisons on Benchmarks
A.1 Qualitative Results
In this subsection, we show more visual comparisons of 4 SR on the benchmark dataset in Fig. I. It can be observed that the proposed EPIT recovers richer and more realistic details.
A.2 Robustness to Different Angular Resolution
In the main body of our paper, we have illustrated that our EPIT (trained on central 55 SAIs) achieves competitive PSNR scores on other angular resolutions, as compared to top-performing DistgSSR [60]. In TableΒ I, we provide more quantitative results achieved by the state-of-the-art methods with different angular resolutions.
In addition, we train a series of EPIT models from scratch on 22, 33 and 44 SAIs, respectively. It can be observed from TableΒ II that when using larger angular resolution SAIs as training data, e.g., 55, our method can achieve better SR performance on different angular resolutions. That is because, more angular views are beneficial for our EPIT to learn the spatial-angular correlation better. This phenomenon inspires us to explore the intrinsic mechanism of LF processing tasks in the future.
Methods | ||||||
Datasets | resLF | LFSSR | MEG-Net | LFT | EPIT(ours) | |
EPFL [48] | 22 | - | 26.00/.8541 | 26.40/.8667 | 27.64/.8953 | 28.22/.9024 |
33 | 28.13/.9012 | 26.84/.8750 | 27.16/.8834 | 28.12/.9029 | 28.74/.9103 | |
44 | - | 27.62/.8930 | 28.04/.9036 | 28.43/.9087 | 29.04/.9164 | |
55 | 28.27/.9035 | 28.27/.9118 | 28.74/.9160 | 29.85/.9210 | 29.34/.9197 | |
66 | - | 27.62/.8995 | 28.46/.9115 | 28.45/.9101 | 29.43/.9218 | |
77 | 27.91/.9038 | 27.29/.8889 | 28.30/.9083 | 28.55/.9094 | 29.60/.9231 | |
88 | - | 27.06/.8834 | 28.15/.9061 | 28.37/.9064 | 29.60/.9240 | |
99 | 26.07/.8881 | 26.95/.8810 | 28.12/.9046 | 28.45/.9071 | 29.71/.9246 | |
HCInew [21] | 22 | - | 28.44/.8639 | 29.02/.8782 | 29.94/.8960 | 30.84/.9114 |
33 | 30.63/.9089 | 29.47/.8848 | 29.84/.8943 | 30.28/.9031 | 31.23/.9182 | |
44 | - | 30.22/.8997 | 30.68/.9094 | 30.51/.9065 | 31.40/.9213 | |
55 | 30.73/.9107 | 30.72/.9145 | 31.10/.9177 | 31.46/.9218 | 31.51/.9231 | |
66 | - | 30.24/.9053 | 30.91/.9154 | 30.26/.9009 | 31.57/.9241 | |
77 | 30.23/.9112 | 29.89/.8997 | 30.64/.9125 | 30.05/.8975 | 31.63/.9250 | |
88 | - | 29.68/.8969 | 30.48/.9105 | 29.81/.8923 | 31.66/.9256 | |
99 | 27.84/.8967 | 29.46/.8942 | 30.34/.9087 | 29.77/.8916 | 31.69/.9260 | |
HCIold [65] | 22 | - | 33.37/.9413 | 34.17/.9489 | 35.52/.9591 | 36.94/.9690 |
33 | 36.61/.9674 | 34.72/.9535 | 35.26/.9579 | 35.91/.9616 | 37.37/.9717 | |
44 | - | 35.80/.9615 | 36.42/.9662 | 36.15/.9634 | 37.52/.9729 | |
55 | 36.71/.9682 | 36.70/.9696 | 37.28/.9716 | 37.63/.9735 | 37.68/.9737 | |
66 | - | 35.32/.9617 | 36.75/.9688 | 36.21/.9636 | 37.76/.9744 | |
77 | 36.21/.968 | 34.94/.9578 | 36.35/.9662 | 36.10/.9629 | 37.92/.9749 | |
88 | - | 34.70/.9558 | 36.18/.9651 | 35.73/.9596 | 38.00/.9754 | |
99 | 33.55/.9519 | 34.46/.9539 | 36.08/.9644 | 35.71/.9593 | 38.06/.9756 | |
INRIA [46] | 22 | - | 27.83/.9035 | 28.31/.9125 | 29.99/.9378 | 30.52/.9418 |
33 | 30.33/.9413 | 28.78/.9201 | 29.16/.9264 | 30.35/.9424 | 30.94/.9472 | |
44 | - | 29.59/.9327 | 30.00/.9401 | 30.64/.9457 | 31.19/.9509 | |
55 | 30.34/.9412 | 30.31/.9467 | 30.66/.9490 | 31.20/.9524 | 31.27/.9526 | |
66 | - | 29.50/.9356 | 30.38/.9443 | 30.61/.9457 | 31.45/.9533 | |
77 | 29.82/.9398 | 29.05/.9269 | 30.13/.9415 | 30.56/.9443 | 31.51/.9539 | |
88 | - | 28.76/.9221 | 30.02/.9399 | 30.41/.9422 | 31.54/.9540 | |
99 | 27.65/.9226 | 28.58/.9196 | 29.97/.9386 | 30.43/.9420 | 31.56/.9539 | |
STFgantry [53] | 22 | - | 27.29/.8710 | 28.15/.8944 | 29.69/.9263 | 31.30/.9468 |
33 | 30.05/.9348 | 28.81/.9064 | 29.22/.9161 | 30.05/.9316 | 31.86/.9534 | |
44 | - | 29.77/.9254 | 30.30/.9356 | 30.35/.9359 | 32.11/.9558 | |
55 | 30.19/.9372 | 30.15/.9426 | 30.77/.9453 | 31.86/.9548 | 32.18/.9571 | |
66 | - | 29.79/.9320 | 30.58/.9428 | 30.01/.9289 | 32.31/.9580 | |
77 | 29.71/.9375 | 29.40/.9257 | 30.25/.9393 | 29.53/.9208 | 32.40/.9585 | |
88 | - | 29.12/.9211 | 30.03/.9367 | 29.17/.9135 | 32.48/.9591 | |
99 | 27.23/.9224 | 28.85/.9169 | 29.83/.9344 | 29.06/.9110 | 32.50/.9592 |
Datasets | EPIT(ours)* | ||||
---|---|---|---|---|---|
22 | 33 | 44 | 55 | ||
EPFL [48] | 22 | 28.40/.9037 | 28.45/.9040 | 28.33/.9034 | 28.22/.9024 |
33 | 28.61/.9076 | 28.75/.9090 | 28.67/.9090 | 28.74/.9103 | |
44 | 28.69/.9108 | 28.90/.9131 | 28.86/.9137 | 29.04/.9164 | |
55 | 28.81/.9124 | 29.08/.9152 | 29.06/.9162 | 29.34/.9197 | |
66 | 28.81/.9133 | 29.13/.9168 | 29.12/.9180 | 29.43/.9218 | |
77 | 28.88/.9137 | 29.24/.9176 | 29.24/.9190 | 29.60/.9231 | |
88 | 28.86/.9140 | 29.25/.9184 | 29.25/.9198 | 29.60/.9240 | |
99 | 28.92/.9141 | 29.32/.9188 | 29.34/.9204 | 29.71/.9246 | |
HCInew [21] | 22 | 30.81/.9109 | 30.86/.9116 | 30.86/.9116 | 30.84/.9114 |
33 | 30.84/.9124 | 31.06/.9157 | 31.09/.9162 | 31.23/.9182 | |
44 | 30.86/.9132 | 31.14/.9174 | 31.21/.9184 | 31.40/.9213 | |
55 | 30.86/.9134 | 31.19/.9184 | 31.27/.9197 | 31.51/.9231 | |
66 | 30.86/.9134 | 31.21/.9190 | 31.32/.9205 | 31.57/.9241 | |
77 | 30.85/.9133 | 31.23/.9194 | 31.35/.9211 | 31.63/.9250 | |
88 | 30.86/.9133 | 31.24/.9197 | 31.37/.9215 | 31.66/.9256 | |
99 | 30.85/.9132 | 31.25/.9199 | 31.39/.9219 | 31.69/.9260 | |
HCIold [65] | 22 | 36.83/.9683 | 36.85/.9682 | 36.81/.9679 | 36.94/.9690 |
33 | 36.92/.9688 | 37.13/.9701 | 37.14/.9702 | 37.37/.9717 | |
44 | 36.95/.9692 | 37.21/.9708 | 37.27/.9712 | 37.52/.9729 | |
55 | 37.01/.9695 | 37.31/.9714 | 37.39/.9718 | 37.68/.9737 | |
66 | 37.00/.9696 | 37.33/.9717 | 37.44/.9723 | 37.76/.9744 | |
77 | 37.00/.9696 | 37.40/.9719 | 37.52/.9726 | 37.92/.9749 | |
88 | 36.99/.9696 | 37.41/.9721 | 37.56/.9729 | 38.00/.9754 | |
99 | 36.99/.9697 | 37.44/.9722 | 37.60/.9730 | 38.06/.9756 | |
INRIA [46] | 22 | 30.63/.9429 | 30.66/.9431 | 30.58/.9427 | 30.52/.9418 |
33 | 30.82/.9458 | 30.91/.9465 | 30.87/.9466 | 30.94/.9472 | |
44 | 30.90/.9472 | 31.04/.9484 | 31.02/.9489 | 31.19/.9509 | |
55 | 30.95/.9483 | 31.14/.9498 | 31.14/.9506 | 31.27/.9526 | |
66 | 30.94/.9484 | 31.17/.9503 | 31.18/.9511 | 31.45/.9533 | |
77 | 30.93/.9485 | 31.20/.9506 | 31.22/.9515 | 31.51/.9539 | |
88 | 30.92/.9484 | 31.22/.9507 | 31.24/.9517 | 31.54/.9540 | |
99 | 30.91/.9481 | 31.22/.9506 | 31.26/.9516 | 31.56/.9539 | |
STFgantry [53] | 22 | 30.84/.9432 | 31.03/.9449 | 31.09/.9452 | 31.30/.9468 |
33 | 30.93/.9447 | 31.39/.9493 | 31.49/.9503 | 31.86/.9534 | |
44 | 31.02/.9459 | 31.56/.9510 | 31.69/.9523 | 32.11/.9558 | |
55 | 30.99/.9459 | 31.58/.9518 | 31.74/.9534 | 32.18/.9571 | |
66 | 31.03/.9460 | 31.68/.9525 | 31.85/.9541 | 32.31/.9580 | |
77 | 31.03/.9459 | 31.70/.9526 | 31.90/.9545 | 32.40/.9585 | |
88 | 31.04/.9459 | 31.73/.9528 | 31.96/.9549 | 32.48/.9591 | |
99 | 31.02/.9457 | 31.74/.9529 | 31.97/.9550 | 32.50/.9592 |
-
β’
* Note that, βAAβ below βEPIT(ours)β denotes the models are trained on the LFs with corresponding angular resolution.
B Additional Quantitative Comparison on Disparity Variations
We have presented the performance comparison on two selected scenes with different shearing values for 2 SR in the main paper. Here, we provide quantitative results on each dataset in TableΒ III and Fig.Β II. It can be observed that our EPIT achieves more consistent performance than existing methods with respect to disparity variations on various datasets.
Methods | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Datasets | Bicubic | RCAN | resLF | LFSSR | LF-ATO | LF-InterNet | LF-DFnet | MEG-Net | LF-IINet | LFT | DistgSSR | Ours | |
EPFL [48] | -4 | 29.95/.9372 | 33.47/.9640 | 32.41/.9582 | 31.90/.9550 | 32.59/.9593 | 32.15/.9573 | 32.69/.9597 | 32.07/.9560 | 32.24 /.9579 | 32.48/.9587 | 32.29/.9583 | 34.52/.9734 |
-3 | 29.92/.9369 | 33.45/.9637 | 32.38/.9578 | 31.85/.9548 | 32.58/.9592 | 32.14/.9572 | 32.68/.9597 | 32.16/.9564 | 32.27/.9577 | 32.49/.9587 | 32.29/.9578 | 34.67/.9746 | |
-2 | 29.89/.9369 | 33.31/.9632 | 32.36/.9587 | 31.92/.9561 | 32.37/.9589 | 32.06/.9571 | 32.47/.9592 | 32.17/.9574 | 32.37/.9589 | 32.35/.9587 | 32.65/.9618 | 34.64/.9749 | |
-1 | 29.83/.9373 | 33.30/.9634 | 33.01/.9652 | 32.69/.9640 | 33.06/.9659 | 32.62/.9636 | 33.41/.9673 | 32.82/.9653 | 33.29/.9676 | 33.33/.9676 | 33.37/.9687 | 34.71/.9756 | |
0 | 29.74/.9376 | 33.16/.9634 | 33.62/.9706 | 33.68/.9744 | 34.27/.9757 | 34.14/.9760 | 34.40/.9755 | 34.30/.9773 | 34.68/.9773 | 34.80/.9781 | 34.81/.9787 | 34.83/.9775 | |
1 | 29.87/.9373 | 33.16/.9629 | 32.81/.9644 | 32.70/.9639 | 32.67/.9656 | 32.57/.9642 | 33.19/.9669 | 32.76/.9647 | 33.12/.9663 | 33.18/.9675 | 33.01/.9681 | 34.66/.9760 | |
2 | 29.91/.9370 | 33.37/.9633 | 32.28/.9579 | 31.87/.9548 | 32.47/.9597 | 32.00/.9569 | 32.45/.9593 | 31.85/.9560 | 32.15/.9577 | 32.42/.9598 | 32.04/.9581 | 34.69/.9750 | |
3 | 29.94/.9370 | 33.48/.9638 | 32.32/.9575 | 31.85/.9543 | 32.56/.9591 | 32.09/.9569 | 32.61/.9594 | 31.84/.9545 | 32.19/.9574 | 32.43/.9585 | 32.17/.9578 | 34.73/.9747 | |
4 | 29.98/.9372 | 33.52/.9641 | 32.40/.9579 | 31.97/.9550 | 32.57/.9592 | 32.15/.9572 | 32.68/.9597 | 31.93/.9554 | 32.19/.9575 | 32.46/.9586 | 32.15/.9579 | 34.59/.9736 | |
HCInew [21] | -4 | 30.83/.9343 | 34.59/.9611 | 33.34/.9533 | 32.57/.9494 | 33.37/.9545 | 32.99/.9525 | 33.62/.9554 | 32.91/.9510 | 33.34/.9534 | 33.35/.9541 | 33.03/.9523 | 36.77/.9782 |
-3 | 30.81/.9342 | 34.65/.9609 | 33.45/.9543 | 32.61/.9501 | 33.58/.9558 | 33.06/.9523 | 33.75/.9562 | 33.16/.9527 | 33.44/.9542 | 33.51/.9551 | 33.43/.9554 | 37.05/.9791 | |
-2 | 30.83/.9344 | 34.60/.9605 | 33.50/.9594 | 32.58/.9548 | 33.13/.9599 | 32.91/.9563 | 33.41/.9609 | 33.33/.9588 | 33.80/.9618 | 33.37/.9609 | 33.76/.9644 | 36.98/.9792 | |
-1 | 30.74/.9349 | 34.42/.9603 | 35.00/.9704 | 34.19/.9691 | 34.87/.9716 | 34.29/.9690 | 35.59/.9739 | 34.51/.9716 | 35.70/.9748 | 35.49/.9747 | 35.68/.9754 | 37.21/.9815 | |
0 | 31.89/.9356 | 34.98/.9603 | 36.69/.9739 | 36.81/.9749 | 37.24/.9767 | 37.28/.9763 | 37.44/.9773 | 37.42/.9777 | 37.74/.9790 | 37.84/.9791 | 37.96/.9796 | 38.23/.9810 | |
1 | 30.73/.9350 | 34.14/.9602 | 34.04/.9649 | 33.90/.9639 | 33.41/.9660 | 33.63/.9633 | 34.30/.9681 | 34.06/.9659 | 34.64/.9682 | 34.33/.9694 | 34.30/.9691 | 36.83/.9792 | |
2 | 30.79/.9344 | 34.30/.9605 | 32.99/.9547 | 32.64/.9509 | 32.84/.9566 | 32.65/.9527 | 32.80/.9560 | 32.43/.9517 | 32.99/.9546 | 33.10/.9571 | 32.31/.9546 | 36.31/.9787 | |
3 | 30.77/.9341 | 34.39/.9609 | 33.17/.9523 | 32.70/.9493 | 33.32/.9545 | 33.03/.9523 | 33.51/.9553 | 32.59/.9492 | 33.22/.9529 | 33.32/.9541 | 32.87/.9521 | 36.56/.9787 | |
4 | 30.79/.9343 | 34.36/.9612 | 33.16/.9530 | 32.74/.9499 | 33.19/.9545 | 32.99/.9526 | 33.40/.9553 | 32.61/.9497 | 33.13/.9532 | 33.21/.9543 | 32.70/.9521 | 36.40/.9778 | |
HCIold [65] | -4 | 36.85/.9775 | 40.85/.9875 | 39.36/.9852 | 38.44/.9833 | 39.18/.9852 | 39.22/.9851 | 39.55/.9858 | 38.69/.9837 | 38.93/.9849 | 39.20/.9851 | 39.17/.9850 | 42.34/.9929 |
-3 | 36.83/.9775 | 40.88/.9874 | 39.57/.9854 | 38.45/.9837 | 39.35/.9854 | 39.33/.9853 | 39.76/.9858 | 38.99/.9843 | 39.18/.9850 | 39.37/.9851 | 39.40/.9852 | 43.04/.9936 | |
-2 | 36.84/.9777 | 40.32/.9871 | 38.84/.9858 | 38.05/.9841 | 38.33/.9854 | 38.80/.9852 | 38.70/.9862 | 38.64/.9851 | 38.90/.9860 | 38.47/.9855 | 39.53/.9879 | 42.80/.9938 | |
-1 | 36.71/.9782 | 40.22/.9873 | 40.43/.9902 | 39.44/.9891 | 39.60/.9900 | 39.79/.9895 | 40.96/.9914 | 39.68/.9899 | 41.19/.9915 | 40.73/.9913 | 41.45/.9923 | 43.31/.9952 | |
0 | 37.69/.9785 | 41.05/.9875 | 43.42/.9932 | 43.81/.9938 | 44.20/.9942 | 44.45/.9946 | 44.23/.9941 | 44.08/.9942 | 44.84/.9948 | 44.52/.9945 | 44.94/.9949 | 45.08/.9949 | |
1 | 36.66/.9783 | 39.25/.9869 | 39.85/.9903 | 40.31/.9904 | 38.42/.9901 | 39.93/.9903 | 40.18/.9915 | 39.85/.9905 | 40.88/.9921 | 39.99/.9916 | 40.50/.9922 | 42.75/.9942 | |
2 | 36.74/.9779 | 39.78/.9871 | 38.77/.9862 | 38.50/.9844 | 38.25/.9862 | 38.70/.9856 | 38.41/.9865 | 38.17/.9847 | 38.64/.9861 | 38.61/.9867 | 38.33/.9863 | 42.31/.9939 | |
3 | 36.76/.9777 | 40.66/.9876 | 39.31/.9852 | 38.48/.9834 | 39.10/.9855 | 39.10/.9852 | 39.45/.9858 | 38.37/.9832 | 39.00/.9851 | 39.19/.9853 | 38.90/.9849 | 42.97/.9939 | |
4 | 36.80/.9776 | 40.70/.9877 | 39.21/.9853 | 38.68/.9838 | 39.03/.9855 | 39.09/.9853 | 39.35/.9859 | 38.36/.9834 | 38.68/.9848 | 39.00/.9851 | 38.68/.9848 | 42.67/.9935 | |
INRIA [46] | -4 | 31.58/.9566 | 35.40/.9769 | 34.24/.9719 | 33.75/.9695 | 34.42/.9725 | 33.99/.9713 | 34.64/.9736 | 33.89/.9703 | 34.13/.9719 | 34.37/.9724 | 34.20/.9720 | 36.46/.9815 |
-3 | 31.55/.9566 | 35.39/.9768 | 34.22/.9717 | 33.71/.9695 | 34.43/.9726 | 34.04/.9715 | 34.62/.9736 | 33.95/.9703 | 34.12/.9715 | 34.39/.9726 | 34.10/.9710 | 36.67/.9826 | |
-2 | 31.55/.9567 | 35.22/.9763 | 34.04/.9715 | 33.59/.9695 | 34.08/.9718 | 33.87/.9709 | 34.31/.9726 | 33.91/.9707 | 34.13/.9721 | 34.11/.9716 | 34.67/.9749 | 36.67/.9829 | |
-1 | 31.49/.9573 | 35.26/.9767 | 34.88/.9767 | 34.59/.9760 | 34.92/.9770 | 34.56/.9757 | 35.51/.9790 | 34.69/.9766 | 35.42/.9790 | 35.26/.9783 | 35.55/.9799 | 36.79/.9837 | |
0 | 31.33/.9577 | 35.01/.9769 | 35.39/.9804 | 35.28/.9832 | 36.15/.9842 | 35.80/.9843 | 36.36/.9840 | 36.09/.9849 | 36.57/.9853 | 36.59/.9855 | 36.59/.9859 | 36.67/.9853 | |
1 | 31.53/.9573 | 35.04/.9762 | 34.82/.9765 | 34.83/.9768 | 34.56/.9772 | 34.73/.9772 | 35.44/.9793 | 34.93/.9773 | 35.30/.9782 | 35.21/.9784 | 35.25/.9795 | 36.80/.9840 | |
2 | 31.55/.9567 | 35.29/.9765 | 34.16/.9721 | 33.75/.9698 | 34.43/.9735 | 33.99/.9717 | 34.49/.9737 | 33.75/.9706 | 34.07/.9720 | 34.46/.9740 | 34.08/.9726 | 36.75/.9832 | |
3 | 31.56/.9565 | 35.41/.9768 | 34.10/.9710 | 33.65/.9689 | 34.39/.9725 | 33.94/.9711 | 34.54/.9732 | 33.61/.9687 | 34.02/.9712 | 34.37/.9725 | 34.04/.9715 | 36.75/.9829 | |
4 | 31.58/.9565 | 35.43/.9769 | 34.18/.9715 | 33.80/.9696 | 34.40/.9724 | 34.01/.9713 | 34.63/.9736 | 33.72/.9695 | 34.03/.9713 | 34.36/.9723 | 34.02/.9715 | 36.59/.9821 | |
STFgantry [53] | -4 | 29.83/.9479 | 35.69/.9833 | 33.73/.9739 | 32.48/.9677 | 34.19/.9776 | 32.92/.9715 | 34.70/.9792 | 32.98/.9702 | 33.87/.9751 | 34.11/.9775 | 33.58/.9751 | 39.33/.9947 |
-3 | 29.80/.9479 | 35.79/.9832 | 33.78/.9740 | 32.59/.9688 | 34.44/.9781 | 33.12/.9723 | 34.78/.9794 | 33.25/.9714 | 33.92/.9750 | 34.34/.9778 | 33.89/.9755 | 39.68/.9950 | |
-2 | 29.82/.9484 | 35.65/.9831 | 33.83/.9769 | 32.59/.9716 | 33.70/.9789 | 32.56/.9734 | 34.26/.9808 | 33.39/.9754 | 34.31/.9793 | 33.84/.9792 | 34.05/.9821 | 39.43/.9950 | |
-1 | 29.72/.9490 | 35.44/.9830 | 35.56/.9860 | 34.37/.9837 | 35.89/.9881 | 34.09/.9831 | 36.46/.9890 | 34.89/.9860 | 36.53/.9895 | 36.34/.9895 | 36.65/.9903 | 39.65/.9952 | |
0 | 31.06/.9498 | 36.33/.9831 | 38.36/.9904 | 37.95/.9898 | 39.64/.9929 | 38.72/.9909 | 39.61/.9926 | 38.77/.9915 | 39.86/.9936 | 40.54/.9941 | 40.40/.9942 | 42.17/.9957 | |
1 | 29.72/.9490 | 34.87/.9830 | 34.97/.9862 | 34.67/.9846 | 34.64/.9890 | 34.10/.9851 | 35.60/.9902 | 34.96/.9862 | 35.78/.9893 | 35.66/.9906 | 35.15/.9901 | 38.81/.9949 | |
2 | 29.79/.9483 | 35.01/.9829 | 33.66/.9779 | 32.88/.9721 | 33.85/.9821 | 32.61/.9740 | 33.85/.9816 | 32.90/.9750 | 33.97/.9800 | 34.15/.9827 | 32.70/.9798 | 38.58/.9947 | |
3 | 29.77/.9477 | 35.20/.9831 | 33.45/.9731 | 32.50/.9676 | 33.96/.9779 | 32.90/.9715 | 34.18/.9787 | 32.41/.9683 | 33.53/.9743 | 33.94/.9777 | 33.02/.9741 | 38.53/.9949 | |
4 | 29.80/.9477 | 35.19/.9832 | 33.39/.9733 | 32.53/.9679 | 33.72/.9774 | 32.78/.9714 | 34.18/.9792 | 32.41/.9685 | 33.43/.9745 | 33.76/.9773 | 32.78/.9739 | 38.46/.9947 |
C LF Angular SR
It is worth noting that the proposed spatial-angular correlation learning mechanism has large potential in multiple LF image processing tasks. In this section, we apply our proposed spatial-angular correlation learning mechanism to the LF angular SR task. We first introduce our EPIT-ASR model for LF angular SR. Then, we introduce the datasets and implementation details in our experiments. Finally, we present the preliminary but promising results as compared to the state-of-the-art LF angular SR methods.
C.1 Upsampling
Since our EPIT is flexible to LFs with different angular resolutions (as demonstrated in Sec.Β A.2), the EPIT-ASR model can be built by changing the upsampling stage of EPIT.
Here, we follow [60, 27] to take the 22 77 angular SR task as an example to introduce the angular upsampling module in our EPIT-ASR. Given the deep LF feature , a 22 convolution without padding is first applied to the angular dimensions to generate an angular-downsampled feature . Then, a 11 convolution is used to increase the channel dimension, followed by a 2D pixel-shuffling layer to generate the angular-upsampled feature . Finally, a 33 convolution is applied to the spatial dimensions of to generate the final output .
C.2 Datasets and Implement Details
Following [27, 60], we conducted experiments on the HCInew [21] and HCIold [65] datasets. All LFs in these datasets have an angular resolution of 99. We cropped the central 77 SAIs with 6464 spatial resolution as groundtruth high angular resolution LFs, and selected the corner 22 SAIs as inputs.
Our EPIT-ASR was initialized using the Xavier algorithm [17], and trained using the Adam method [31] with , . The initial learning rate was set to 2 and halved after every 15 epochs. The training was stopped after 80 epochs. During the training phase, we performed random horizontal flipping, vertical flipping, and 90-degree rotation to augment the training data.
C.3 Qualitative Results
FigureΒ IV shows the quantitative and qualitative results achieved by different LF angular SR methods. It can be observed that the magnitude of errors for our EPIT-ASR is smaller than other methods, especially on the delicate texture areas (e.g., the letters in scene Dishes). As shown in the zoom-in regions, our method generates more faithful details with fewer artifacts.


