\addauthor

Zeke Zexi [email protected] \addauthorHaodong [email protected] \addauthorYuk Ying [email protected] \addauthorXiaoming [email protected] \addinstitution The School of Computer Science,
University of Sydney,
Darlington, NSW, Australia \addinstitution The School of Computer and Artificial Intelligence,
Beijing Technology and Business University,
Beijing, China Efficient Multi-disparity Transformer for LFSR

Efficient Multi-disparity Transformer for
Light Field Image Super-resolution

Abstract

This paper presents the Multi-scale Disparity Transformer (MDT), a novel Transformer tailored for light field image super-resolution (LFSR) that addresses the issues of computational redundancy and disparity entanglement caused by the indiscriminate processing of sub-aperture images inherent in conventional methods. MDT features a multi-branch structure, with each branch utilising independent disparity self-attention (DSA) to target specific disparity ranges, effectively reducing computational complexity and disentangling disparities. Building on this architecture, we present LF-MDTNet, an efficient LFSR network. Experimental results demonstrate that LF-MDTNet outperforms existing state-of-the-art methods by 0.37 dB and 0.41 dB PSNR at the $2\times$ and $4\times$ scales, achieving superior performance with fewer parameters and higher speed.

1 Introduction

Light Field (LF) imaging captures light from multiple directions in a single shot, enabling computer vision capabilities over traditional cameras. This technology excels in areas like material recognition [Lu et al.(2019)Lu, Yeung, Qu, Chung, Chen, and Chen, Wang et al.(2016)Wang, Zhu, Hiroaki, Chandraker, Efros, and Ramamoorthi], depth estimation and more [Verinaz-Jadan et al.(2022)Verinaz-Jadan, Song, Howe, Foust, and Dragotti, Raghavendra et al.(2015)Raghavendra, Raja, and Busch]. Several LF capture devices have emerged [Wikipedia contributors(2020), Raytrix(), Debevec(2018), Sta()]. However, they face challenges in balancing angular and spatial resolution due to sensor limitations, often reducing spatial resolution.

Light field image super-resolution (LFSR) aims to improve this by enhancing spatial resolution while maintaining the underlying LF parallax structure via the utilisation of correlation information, which traditional single image super-resolution (SISR) does not address. With the advent of deep learning, particularly the application of convolutional neural networks (CNNs) and Transformers, there has been a substantial improvement in the quality of reconstructed images. Recent LFSR models, such as LFT [Liang et al.(2022)Liang, Wang, Wang, Yang, and Zhou], EPIT [Liang et al.(2023)Liang, Wang, Wang, Yang, Zhou, and Guo] and LF-DET [Cong et al.(2024)Cong, Sheng, Yang, Cui, and Chen], have implemented Transformers to establish dependencies within the spatial, angular, and epipolar plane image (EPI) subspace inherent in LF images. However, these models generally encounter a critical limitation: the indiscriminate processing of all sub-aperture images (SAIs) in self-attention mechanisms, which leads to two primary issues: computational redundancy and disparity entanglement.

Refer to caption — Figure 1: Parallax at different disparity ranges in the bedroom sample. (b) and (c) includes an indicator in the bottom-right corner denoting the SAIs used to calculate the parallax.

The issue of computational redundancy arises because a substantial portion of the information in a LF image is redundant across SAIs. Processing all correlation information through Transformers often results in unnecessary computation and excessive model size, rendering the model impractical for real-world applications. On the other hand, disparity entanglement occurs due to uniformly processing all SAIs, which is prone to an oversight of the wide disparity variations and the unique characteristics of information represented by each disparity range. This issue becomes particularly problematic when the training data distribution is unbalanced, leading to some disparities dominating others, which can suppress vital clues representing the underlying correlations. Figure 1 exemplifies how the parallax manifests differently across disparity ranges. Notably, in Figure 1(b) and (c), the edges of bowls in the blue box are more distinguishable at a large disparity range, while at a smaller disparity range, these features are absent or obscured. Similarly, the window patterns vary significantly at these two disparity ranges, revealing subtle nuances, as highlighted in the red box. These underscore the importance of proper disparity processing.

In response to these challenges, this paper introduces the Multi-scale Disparity Transformer (MDT), a novel Transformer architecture tailored for LF image processing that effectively manages disparity information across multiple scales. Specifically, within the MDT, a multi-branch structure is employed to capture different disparity ranges explicitly. Within each branch, different from conventional Transformers, the key-query calculation operates only on a predesignated subset of SAIs, focusing on a specific range. Meanwhile, the value matrix is preserved directly from the input to maintain the original information for image reconstruction. As a result, the MDT reduces computational redundancy and disentangle disparity through efficient and structured processing.

Building upon MDTs, we present LF-MDTNet, an efficient LFSR network. In our experiments, LF-MDTNet outperforms the previous state-of-the-art methods while requiring at most 67% of the parameters, 55% of the FLOPs and 37% of the inference time. Under less stringent computational constraints, the best LF-MDTNet model achieves an improvement of 0.41 dB PSNR over the best competitor, demonstrating its superior performance.

2 Related Works

Efficient and effective 4D LF data processing has posed a significant challenge due to its inherent large volume. To reduce this complexity, several approaches have emerged. Wang et al. [Wang et al.(2016)Wang, Zhu, Hiroaki, Chandraker, Efros, and Ramamoorthi] introduced an interleaved filter for LF material recognition that decomposes a 4D convolution into spatial and angular ones. The concept of decomposition is further evolved by Yeung et al. [Yeung et al.(2019)Yeung, Hou, Chen, Chen, Chen, and Chung] in their LFSR network using spatial-angular separable (SAS) convolutions to replace computationally intensive 4D convolutions. Subsequent advancements, such as DKNet by Hu et al. [Hu et al.(2022)Hu, Chen, Yeung, Chung, and Chen], DistgSSR by Wang et al. [Wang et al.(2022b)Wang, Wang, Wu, Yang, An, Yu, and Guo], and HLFSR by Duong et al., have further refined and extended this decomposition approach across various LF subspaces.

More recently, Vision Transformers (ViTs) have expanded their successes in natural language processing [Vaswani et al.(2017)Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin] to image processing [Liu et al.(2021)Liu, Lin, Cao, Hu, Wei, Zhang, Lin, and Guo, Liang et al.(2021)Liang, Cao, Sun, Zhang, Van Gool, and Timofte, Dosovitskiy et al.(2020)Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly, et al.]. For LFSR, Wang et al. introduced DPT [Wang et al.(2022a)Wang, Zhou, Lu, and Di], utilising content and gradient Transformers to build long-range dependencies within the spatial subspace. Liang et al. proposed LFT [Liang et al.(2022)Liang, Wang, Wang, Yang, and Zhou] that employed spatial Transformers to perform self-attention on overlapping local windows, echoing techniques used in HAT [Chen et al.(2023)Chen, Wang, Zhou, Qiao, and Dong] for SISR. Liang et al\bmvaOneDotextended the Transformer to horizontal and vertical EPI subspaces in EPIT [Liang et al.(2023)Liang, Wang, Wang, Yang, Zhou, and Guo]. Cong et al\bmvaOneDotintroduced several improvements on LFT [Liang et al.(2022)Liang, Wang, Wang, Yang, and Zhou], including constructing a hierarchy to aggregate information from spatial-angular Transformers, subsampling the key and value matrices in the spatial Transformers to reduce computational complexity, and partitioning macro-pixels in the angular Transformers to capture multi-scale disparity. However, these methods still process all SAIs in the self-attention mechanism, leading to computational redundancy and disparity entanglement issues.

3 Methodology

3.1 Network Architecture

Light field image super-resolution (LFSR) enhances the spatial resolution of a low-resolution (LR) light field (LF) image $I_{LR}$ to produce a super-resolved (SR) LF image $I_{SR}$ , which aims to approximate the corresponding high-resolution (HR) LF image $I_{HR}$ . The process can be mathematically represented as follows:

\begin{split}I_{SR}=\mathcal{F}(I_{LR}),&\quad I_{LR}(u,v,x,y)\in\mathbb{R}^{U\times V\times W\times H\times C},\\ &\quad I_{SR}(u,v,x,y)\in\mathbb{R}^{U\times V\times rW\times rH\times C}\end{split}

where $(U,V)$ represents the angular dimensions and $(W,H)$ the spatial dimensions of a LR image, with $C$ being the channel dimension. Correspondingly, $(u,v)$ indicates an angular location, and $(x,y)$ indicates a spatial location. The term $r$ denotes the scaling factor.

The proposed LF-MDTNet is illustrated in Figure 2(a). The network comprises three primary stages: shallow feature extraction, deep feature extraction and image reconstruction. The shallow feature extraction stage utilises four $3\times 3$ convolution layers operating on the spatial subspace of LF images to obtain low-level features. The subsequent deep feature extraction stage incorporates a sequence of $N_{a}$ correlation blocks crafted to derive comprehensive correlation information and establish a high-level spatial-angular representation. Their details will be elaborated on in the following subsection. The final image reconstruction stage aggregates the deep features via convolution layers and upscales the spatial resolution through the pixel shuffler [Shi et al.(2016)Shi, Caballero, Huszar, Totz, Aitken, Bishop, Rueckert, and Wang].

Correlation Blocks. The correlation block comprises two specialised Transformers: the Multi-scale Disparity Transformer, which operates in the spatial subspace, and the angular Transformer, which focuses on the angular subspace. Both components model long-range dependencies within the LF data, each targeting a subspace of distinct characteristics. They are followed by two $3\times 3$ spatial convolutions to capture locality, serving as a dynamic alternative to the commonly used static positional encoding.

Angular Transformer. The angular Transformer follows the methodologies of prior works [Liang et al.(2022)Liang, Wang, Wang, Yang, and Zhou, Cong et al.(2024)Cong, Sheng, Yang, Cui, and Chen] at large, utilising a vanilla Transformer to build long-range dependencies in the angular subspace with SAIs as tokens. To boost efficiency, the embedding dimension of the query and key, $C_{QK}$ , is reduced to be notably smaller than their counterpart $C_{V}$ for the value. This modification not only accelerates the computation of the attention matrix but also consolidates the feature representation into a more compact embedding space.

Connection Enhancement. To further improve the LF-MDTNet’s performance, besides a standard skip connection bridges the HR and LR, two types of connections are integrated to optimise data flow, as illustrated in Figure 2(a). Firstly, in alignment with the strategy from [Hu et al.(2022)Hu, Chen, Yeung, Chung, and Chen], a raw image connection concatenates the raw image tensor directly to the output of the last correlation block prior to the image reconstruction stage. This connection acts as a specialised form of dense connections [He et al.(2016)He, Zhang, Ren, and Sun], ensuring that raw image data directly contributes to feature aggregation and upsampling. Thirdly, inspired by [Chen et al.(2024)Chen, Zhang, Gu, Kong, and Yang], each correlation block incorporates a learnable skip connection that dynamically adjusts the input by channel-wise coefficients $\alpha\in\mathbb{R}^{C}$ and adds it to the output.

3.2 Multi-scale Disparity Transformer

Our proposed Multi-scale Disparity Transformer (MDT) features a multi-branch structure to target different disparity ranges. Given a 4D LF tensor $\mathbf{X}$ of shape $(U\times V\times W\times H\times C)$ , the channels are firstly split evenly into $N_{b}$ branches, with each branch $X_{i}$ holding $C/N_{b}$ channels. Each $X_{i}$ of shape $(U\times V\times W\times H\times C/N_{b})$ undergoes disparity self-attention $DSA_{i}$ to explicitly construct long-range dependencies for a specific disparity range. Diverging from conventional LF Transformers [Cong et al.(2024)Cong, Sheng, Yang, Cui, and Chen, Liang et al.(2022)Liang, Wang, Wang, Yang, and Zhou, Liang et al.(2023)Liang, Wang, Wang, Yang, Zhou, and Guo], the query and key matrices, denoted as $Q_{i}$ and $K_{i}$ , are constructed on a selective SAI subset $\bar{X}_{i}$ holding $\mathcal{S}_{i}$ SAIs out of the complete set $\{U\times V\}$ . Meanwhile, they commonly merge the angular subspace with the batch dimension before self-attention, treating the spatial subspace as tokens and the channel dimension as embedding, whilst the MDT merges this subspace with the channel dimension, resulting in a tensor of shape $(WH\times\mathcal{S}_{i}C/N_{b})$ , where the angular subspace and the channel dimension are embedding for self-attention.

Afterwards, each $\bar{X}i$ undergoes a linear projection through $D_{i}\in\mathbb{R}^{\mathcal{S}iC/N_{b}\times C_{D}}$ to produce a compact disparity embedding $\mathcal{D}i\in\mathbb{R}^{WH\times C_{D}}$ . This embedding is then used in subsequent linear projections $W{Q_{i}}\in\mathbb{R}^{C_{D}\times C_{QK}}$ and $W{K_{i}}\in\mathbb{R}^{C_{D}\times C_{QK}}$ to generate $Q_{i}$ and $K_{i}$ , respectively. The value matrix, $V_{i}$ , is directly derived from $X_{i}$ , thus preserving the information across all SAIs and avoiding the computation to encode and decode it. The resulting process for obtaining these three matrices for self-attention is defined as follows:

	$\displaystyle Q_{i}=W_{Q_{i}}\mathcal{D}_{i},\;K_{i}=W_{K_{i}}\mathcal{D}_{i},\;V_{i}=X_{i},\;\mathcal{D}_{i}=D_{i}\bar{X}_{i};$
	$\displaystyle X_{i}\in\mathbb{R}^{U\times V\times W\times H\times C/N_{b}};\;\quad\bar{X}_{i}\in\mathbb{R}^{\mathcal{S}_{i}\times W\times H\times C/N_{b}},\;\mathcal{S}_{i}\subseteq\{U\times V\}.$

The calculation of self-attention adheres to the conventional Transformers as:

DSA_{i}(Q_{i},K_{i},V_{i})=SoftMax(Q_{i}K_{i}^{T}/\sqrt{C_{QK}})V_{i},

where $DSA_{i}\in\mathbb{R}^{\mathcal{S}_{i}\times W\times H\times C/N_{b}}$ . Finally, MDT compiles the self-attention of multiple disparities by concatenating the $N_{b}$ branches’ output:

\mathbf{Y}=MDT(\mathbf{X})=[DSA_{1},DSA_{2},\cdots,DSA_{N_{b}}]

where $[\cdot]$ signifies the concatenation operation, and $\mathbf{Y}\in\mathbb{R}^{U\times V\times W\times H\times C}$ .

	$\Omega(ST)$	$\Omega(MDT)$	$\Omega(MDT)/\Omega(ST)$ (Symbolic)	$\Omega(MDT)/\Omega(ST)$ (Numeric)
Matrix Projection	$3UVWHC^{2}$	$\sum_{i=1}^{N_{b}}\mathcal{S}_{i}WH(\frac{C}{N_{b}}C_{D}+C_{D}C_{QK})$	$\frac{\sum_{i=1}^{N_{b}}\mathcal{S}_{i}(C_{D}C/N_{b}+C_{D}C_{QK})}{3UVC^{2}}$	$24:75\approx 33\%$
Query-Key Dot-Product	$UV(WH)^{2}C$	$N_{b}(WH)^{2}C_{QK}$	$\frac{N_{b}C_{QK}}{UVC}$	$8:25\approx 32\%$
Self-attention Output	$UV(WH)^{2}C$	$UV(WH)^{2}C$	$1:1$	$1:1=100\%$
Feed-forward Network	$UVWHC^{2}$	$0$	$0:1$	$0:1=0\%$

Table 1: Comparison of computational complexity.

3.2.1 Advantages

The MDT structure confers two principal advantages:

Disentanglement of Disparities. MDT systematically disentangles the process of disparity modelling at multiple scales, with each scale explicitly addressed within its distinct branch. This separation creates a clear information structure, significantly reducing the risk of confounding that may arise when disparities are processed collectively. By ensuring each branch operates independently, MDT prevents any specific disparity, especially those underrepresented in training samples yet critical for high-fidelity reconstructions, from being overlooked and suppressed.

Flexible Design for Targeted Disparity Modelling. MDT provides flexibility in SAI selection to target specific disparity ranges strategically. For example, as shown in Figure 2(b), $DSA_{1}$ targets larger disparities using SAIs from the LF image’s corners, whereas $DSA_{2}$ utilises a closer subset of intermediate SAIs for modelling shorter disparities. The SAI number can vary across branches, and SAIs can be shared among branches. This adaptability enhances the network’s ability to meet diverse requirements and scenarios.

Reduced Computational Complexity. The computational complexity of a Transformer can be broken down into four parts. In our analysis in Table 1, we compare the MDT with a conventional spatial Transformer ( $ST$ ) [Cong et al.(2024)Cong, Sheng, Yang, Cui, and Chen]. Assuming of $C_{D}=2C$ , $C_{QK}=C$ , and $N_{b}=2$ , as utilized in following experiments, MDT significantly reduces computational complexity: it requires only $33\%$ of cost for matrix projection for queries, keys, and values, and $32\%$ for the query-key dot-product. Moreover, MDT’s direct derivation of values from the input obviates the need for feed-forward network computation.

4 Experiments

4.1 Experimental Settings

Evaluation Configuration. We conduct experiments using the widely used BasicLFSR framework [Bas(2023)] implemented in PyTorch [Foundation()]. Five datasets are involved: EPFL [Rerábek and Ebrahimi(2016)], HCInew [Honauer et al.(2016)Honauer, Johannsen, Kondermann, and Goldluecke], HCIold [Wanner et al.(2013)Wanner, Meister, and Goldluecke], INRIA [Le Pendu et al.(2018)Le Pendu, Jiang, and Guillemot] and STFgantry [Vaish and Adams(2008)]. These datasets are split into 70/20/10/35/9 training samples and 10/4/2/5/2 testing samples, respectively. In each sample, the central $5\times 5$ SAIs are used. PSNR and SSIM are calculated on the Y channel of the YCbCr colour space as quantitative metrics for network performance evaluation.

Training Settings. To train the network, the samples are segmented into $32\times 32$ patches in the bicubically down-sampled LR to generate $64\times 64$ or $128\times 128$ patches in the $2\times$ and $4\times$ tasks. The network is optimised for 40 epochs using an Adam optimiser with a batch size of $4$ and a learning rate of $2\times 10^{-4}$ , followed by a fine-tuning phase for 10 epochs with a reduced learning rate of $2\times 10^{-5}$ . Code and model weights will be released publicly.

Network Implementation. We empirically set the feature channels $C=48$ across all Transformers and convolutions. For both MDTs and angular Transformers, $C_{D}=96$ and $C_{QK}=48$ . In MDTs, as shown in Figure 2(b), there are two DSA branches ( $N_{b}=2$ ), one of which contains the four corner SAIs (blue), the other contains the four SAIs close to the centre (red). In the angular Transformers, $C_{V}=96$ . The number of correlation blocks $N_{a}$ is 16.

4.2 Quantitative Comparison

Method	EPFL	HCInew	HCIold	INRIA	STFgantry	Average
$\boldsymbol{2\times}\;\bf{LFSR}$
LFSSR [Yeung et al.(2019)Yeung, Hou, Chen, Chen, Chen, and Chung]	33.76/0.9729	36.84/0.9744	43.25/0.9931	35.50/0.9816	37.41/0.9885	37.35/0.9821
DistgSSR [Wang et al.(2022b)Wang, Wang, Wu, Yang, An, Yu, and Guo]	34.81/0.9787	37.96/0.9796	44.94/0.9949	36.58/0.9859	40.40/0.9942	38.94/0.9867
LFT [Liang et al.(2022)Liang, Wang, Wang, Yang, and Zhou]	34.78/0.9776	37.77/0.9788	44.63/0.9947	36.54/0.9853	40.41/0.9941	38.82/0.9861
HLFSR [Van Duong et al.(2023)Van Duong, Huu, Yim, and Jeon]	35.31/0.9800	38.32/0.9807	44.98/0.9950	37.06/0.9867	40.85/0.9947	39.30/0.9874
EPIT [Liang et al.(2023)Liang, Wang, Wang, Yang, Zhou, and Guo]	34.85/0.9775	38.23/0.9810	45.08/0.9949	36.68/0.9852	42.17/0.9957	39.40/0.9869
LF-DET [Cong et al.(2024)Cong, Sheng, Yang, Cui, and Chen]	35.20/0.9794	38.22/0.9803	44.92/0.9949	36.88/0.9862	41.56/0.9953	39.36/0.9872
LF-MDTNet	35.60/0.9812	38.43/0.9811	45.22/0.9952	37.24/0.9870	41.71/0.9955	39.67/0.9880
$\boldsymbol{4\times}\;\bf{LFSR}$
LFSSR [Yeung et al.(2019)Yeung, Hou, Chen, Chen, Chen, and Chung]	28.66/0.9096	30.94/0.9128	36.89/0.9686	30.70/0.9442	30.43/0.9381	31.52/0.9347
DisgSSR [Wang et al.(2022b)Wang, Wang, Wu, Yang, An, Yu, and Guo]	28.99/0.9195	31.38/0.9217	37.56/0.9732	30.99/0.9519	31.65/0.9534	32.12/0.9439
LFT [Liang et al.(2022)Liang, Wang, Wang, Yang, and Zhou]	29.33/0.9196	31.36/0.9205	37.59/0.9731	31.30/0.9515	31.62/0.9530	32.24/0.9436
HLFSR [Van Duong et al.(2023)Van Duong, Huu, Yim, and Jeon]	29.20/0.9222	31.57/0.9238	37.78/0.9742	31.24/0.9534	31.64/0.9537	32.28/0.9455
EPIT [Liang et al.(2023)Liang, Wang, Wang, Yang, Zhou, and Guo]	29.31/0.9196	31.51/0.9231	37.68/0.9737	31.35/0.9526	32.18/0.9570	32.41/0.9452
LF-DET [Cong et al.(2024)Cong, Sheng, Yang, Cui, and Chen]	29.42/0.9220	31.51/0.9227	37.76/0.9739	31.34/0.9528	32.02/0.9561	32.41/0.9455
LF-MDTNet	29.82/0.9268	31.78/0.9261	38.06/0.9754	31.75/0.9558	32.69/0.9608	32.82/0.9490

Table 2: PSNR and SSIM results at the

2\times

and

4\times

scales. The best and second best results are highlighted in bold and underline, respectively.

A quantitative comparison is conducted to compare the overall performance of LF-MDTNet with several state-of-the-art methods at $2\times$ and $4\times$ scales. The compared methods include convolution-based LFSSR [Yeung et al.(2019)Yeung, Hou, Chen, Chen, Chen, and Chung], DistgSSR [Wang et al.(2022b)Wang, Wang, Wu, Yang, An, Yu, and Guo] and HLFSR [Van Duong et al.(2023)Van Duong, Huu, Yim, and Jeon], and Transformer-based methods LFT [Liang et al.(2022)Liang, Wang, Wang, Yang, and Zhou], EPIT [Liang et al.(2023)Liang, Wang, Wang, Yang, Zhou, and Guo] and LF-DETNet [Cong et al.(2024)Cong, Sheng, Yang, Cui, and Chen]. The results are shown in Table 2.

LF-MDTNet is the leading method across both scales and nearly all datasets, with the only exception being the STFgantry dataset at the $2\times$ scale, where it ranks second. Notably, at the $2\times$ scale, LF-MDTNet surpasses the second-best method by over 0.3 dB in PSNR for EPFL and INRIA. At the $4\times$ scale, LF-MDTNet’s lead expands further, achieving more than 0.4 dB higher PSNR for EPFL and INRIA, and more than 0.5 dB for STFgantry. These results highlight the superiority of LF-MDTNet’s performance in LFSR.

4.3 Performance Analysis

Qualitative Comparison. We showcase LF-MDTNet’s superior performance through qualitative evaluations in Figure 3 with two zoom-in views within blue and red boxes and the corresponding EPIs highlighting the LF parallax structure. The samples include (a) Perforated_Metal_3 and (b) Hublais captured with the Lytro Illum camera [Wikipedia contributors(2020)], and (c) using a multi-camera array [Sta()] featuring larger disparity ranges. In general, LF-MDTNet surpasses the second-best model on these samples by a significant 0.5-0.7 dB PSNR. In (a) Perforated_Metal_3, LF-MDTNet accurately reconstructs the dense perforated holes and complex occlusions with sharp edges and clear light spots, outperforming other methods that produce vague and merged effects. In (b) Hublais, it clearly reconstructs the windows and bars, where others show artefacts. In (c) Lego Knights, it achieves sharper and distinguishable edges on the studs and knight’s shield. The improvements are evident not only in the zoom-in views but also in the corresponding EPIs, indicating LF-MDTNet reconstructs richer details and enhances the LF parallax structure.

DSA Feature Visualisation. To better understand the factors behind this success, we further conduct an analysis of the feature maps from the DSA branches of the last MDT, depicted in Figure 4. It is apparent that these branches focus on distinct aspects of LF images, reflecting that the network is disparity-aware and has effectively achieved disparity disentanglement. Precisely, $DSA_{1}$ (blue, in the first row), which works with distant SAIs, primarily captures object edges at larger disparity ranges where occlusion is most prevalent. In contrast, $DSA_{2}$ (red), targeting closer SAIs, specialises in detailed textures and patterns. Notably, $DSA_{1}$ exhibits strong activation on critical features such as the hole edges in (a) Perforated_Metal_3 and the studs on Legos in (c) Lego Knights, whereas $DSA_{2}$ conversely shows a weaker reaction in these areas. This discrepancy explains LF-MDTNet’s notable improvement in these regions shown in Figure 3. Meanwhile, there is also a noticeable overlap between the two branches, including the areas within the perforated holes in Perforated_Metal_3, the windows in Hublais, and the knights’ helmets in Lego Knights. These phenomenons suggest that while the branches focus on different disparities, they also share common targets, creating a complementary and cooperative synergy that effectively addresses disparity entanglement and enhances overall performance.

Analysis of SAI Subsets. We conducted an ablation study to evaluate how different combinations of SAI subsets in the DSA branches affect LF-MDTNet’s performance. The results, detailed in Table 3, highlight the critical role of strategic SAI subset selection.

Model	(a)	(b)	(c)	(d)	(e)	(f)	(g)	(h)	(i)
Model
PSNR/SSIM	32.59/0.9468	32.50/0.9453	32.44/0.9449	32.41/0.9449	32.31/0.9439	32.30/0.9435	32.10/0.9428	32.58/0.9469	32.50/0.9451
#Parameters	981,120	923,520	923,520	925,824	868,224	868,224	840,576	1,038,720	1,504,128
Inference Time	1.80	1.76	1.76	1.80	1.73	1.74	1.73	1.84	1.77
FLOPs (G)	28.20	27.84	27.84	28.15	27.79	27.79	27.76	28.57	28.44

Table 3: Comparison of different SAI subsets for LF-MDTNet (

N_{a}=6

The $N_{a}=6$ model is shown as the baseline in Column (a). Rows (b) and (c) isolate the SAI subsets used in the baseline model, revealing that each subset individually underperforms the combined approach by 0.09 dB and 0.15 dB, respectively. This indicates the benefit of combining different SAI subsets for multi-scale disparity learning. Columns (d), (e), and (f) experiment with similar SAI combinations as Columns (a), (b), and (c) but reduce the number of SAIs by half, modelling disparities with a single diagonal line instead of two. This adjustment results in a performance decline of 0.18 dB for the dual-branch model and 0.30 dB and 0.29 dB for the models with separate branches. Column (g) explores a minimal model with just one central SAI, severely limiting angular information in MDT’s query-key calculations. This leads to a severe decline in performance by 0.49 dB, underscoring the necessity of using sufficient SAIs for comprehensive disparity capture. However, merely increasing the number of SAIs does not always correlate with enhanced performance. Row (h) adds four central SAIs from the edges as an additional DSA branch to the baseline model, resulting in extra parameters and computation but achieving only similar performance, suggesting a saturation in learning multi-scale disparities. Conversely, Row (i) aggregates all SAIs into a single DSA branch, offering access to more SAIs but resulting in a performance decline by 0.09 dB, highlighting how disparity entanglement hampers model performance.

4.4 Model Efficiency

We assessed the efficiency of LF-MDTNet by comparing it with state-of-the-art methods, varying the number of correlation blocks $N_{a}$ from 4 to 16 in increments of 2 to encompass a broad spectrum of model complexity for a balanced comparison. The average PSNR is used as the performance metric, depicted in Figure 5, against the parameter number as the memory efficiency metric and inference time and FLOPs as the computation metrics. The inference time is measured as the elapsed time for inferring a testing sample on our hardware setup that includes an Intel i7-11700 CPU, 32 GB RAM, an Nvidia GTX 3090 GPU.

Remarkably, with $N_{a}=6$ , LF-MDTNet outperforms all competitors with a PSNR of 32.59 dB while being smaller and faster. Notably, compared to the top performers LF-DET and EPIT, LF-MDTNet requires only 58% and 67% of their parameters, 37% and 71% of their inference time, and 55% and 40% of their FLOPs, respectively, while achieving a 0.18 dB higher PSNR. Given a similar resource condition, the $N_{a}=10$ variant, while having a similar parameter number with LF-DET and EPIT, and using only 88% and 64% of their FLOPs, and 54% and 104% of their inference time, surpasses their PSNR by 0.25 dB. These results demonstrate LF-MDTNet’s ability to strike a superior balance between performance and efficiency, making it a more viable model for real-world applications. Detailed results are available in the supplementary material.

5 Conclusion

In this paper, we presented LF-MDTNet, a novel LFSR network that leveraged the Multi-scale Disparity Transformer (MDT) to overcome the challenges of computational redundancy and disparity entanglement in LF image processing. The experimental results demonstrated that LF-MDTNet outperformed the state-of-the-art methods significantly across scales and datasets while requiring less computational resources in terms of model size and computation complexity. The qualitative result showed that LF images reconstructed by LF-MDTNet were sharper and clearer while the parallax structure was enhanced. Our analysis further revealed how the Disparity Self-attention branches in MDT captured information of distinct disparity ranges and characteristics and how the selection of SAI subsets influenced the performance. We believe this work establishes a new paradigm in LF image processing and will inspire future research in related areas.

References

[Sta()] The (new) stanford light field archive. URL http://lightfield.stanford.edu/index.html. Accessed: 2024-05-02.
[Bas(2023)] Basiclfsr: Open source light field toolbox for super-resolution. https://github.com/ZhengyuLiang24/BasicLFSR, 2023. Accessed: 2023-06-10.
[Chen et al.(2023)Chen, Wang, Zhou, Qiao, and Dong] Xiangyu Chen, Xintao Wang, Jiantao Zhou, Yu Qiao, and Chao Dong. Activating more pixels in image super-resolution transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22367–22377, 2023.
[Chen et al.(2024)Chen, Zhang, Gu, Kong, and Yang] Zheng Chen, Yulun Zhang, Jinjin Gu, Linghe Kong, and Xiaokang Yang. Recursive Generalization Transformer for Image Super-Resolution, February 2024.
[Cong et al.(2024)Cong, Sheng, Yang, Cui, and Chen] Ruixuan Cong, Hao Sheng, Da Yang, Zhenglong Cui, and Rongshan Chen. Exploiting Spatial and Angular Correlations With Deep Efficient Transformers for Light Field Image Super-Resolution. IEEE Transactions on Multimedia, 26:1421–1435, 2024. ISSN 1520-9210, 1941-0077. 10.1109/TMM.2023.3282465.
[Debevec(2018)] Paul Debevec. Experimenting with light fields. https://blog.google/products/google-ar-vr/experimenting-light-fields/, 2018. Accessed: 2024-05-02.
[Dosovitskiy et al.(2020)Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly, et al.] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[Foundation()] The Linux Foundation. Pytorch. https://pytorch.org/. Accessed: 2024-05-02.
[He et al.(2016)He, Zhang, Ren, and Sun] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. ISBN 978-1-4673-8851-1.
[Honauer et al.(2016)Honauer, Johannsen, Kondermann, and Goldluecke] Katrin Honauer, Ole Johannsen, Daniel Kondermann, and Bastian Goldluecke. A dataset and evaluation methodology for depth estimation on 4d light fields. In Asian Conference on Computer Vision, pages 19–34. Springer, 2016.
[Hu et al.(2022)Hu, Chen, Yeung, Chung, and Chen] Zexi Hu, Xiaoming Chen, Henry Wing Fung Yeung, Yuk Ying Chung, and Zhibo Chen. Texture-Enhanced Light Field Super-Resolution With Spatio-Angular Decomposition Kernels. IEEE Transactions on Instrumentation and Measurement, 71:1–16, 2022.
[Le Pendu et al.(2018)Le Pendu, Jiang, and Guillemot] Mikael Le Pendu, Xiaoran Jiang, and Christine Guillemot. Light field inpainting propagation via low rank matrix completion. IEEE Transactions on Image Processing, 27(4):1981–1993, 2018.
[Liang et al.(2021)Liang, Cao, Sun, Zhang, Van Gool, and Timofte] Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1833–1844, 2021.
[Liang et al.(2022)Liang, Wang, Wang, Yang, and Zhou] Zhengyu Liang, Yingqian Wang, Longguang Wang, Jungang Yang, and Shilin Zhou. Light field image super-resolution with transformers. IEEE Signal Processing Letters, 29:563–567, 2022.
[Liang et al.(2023)Liang, Wang, Wang, Yang, Zhou, and Guo] Zhengyu Liang, Yingqian Wang, Longguang Wang, Jungang Yang, Shilin Zhou, and Yulan Guo. Learning non-local spatial-angular correlation for light field image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12376–12386, 2023.
[Liu et al.(2021)Liu, Lin, Cao, Hu, Wei, Zhang, Lin, and Guo] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
[Lu et al.(2019)Lu, Yeung, Qu, Chung, Chen, and Chen] Zhicheng Lu, Henry W. F. Yeung, Qiang Qu, Yuk Ying Chung, Xiaoming Chen, and Zhibo Chen. Improved image classification with 4D light-field and interleaved convolutional neural network. Tools and Applications, 78(20):29211–29227, October 2019. ISSN 1573-7721.
[Raghavendra et al.(2015)Raghavendra, Raja, and Busch] Ramachandra Raghavendra, Kiran B. Raja, and Christoph Busch. Presentation attack detection for face recognition using light field camera. IEEE Transactions on Image Processing, 24(3):1060–1075, 2015.
[Raytrix()] Raytrix. 3d light field camera technology. https://raytrix.de/. Accessed: 2024-05-02.
[Rerábek and Ebrahimi(2016)] Martin Rerábek and Touradj Ebrahimi. New Light Field Image Dataset. 8th International Conference on Quality of Multimedia Experience (QoMEX), pages 1–2, 2016.
[Shi et al.(2016)Shi, Caballero, Huszar, Totz, Aitken, Bishop, Rueckert, and Wang] Wenzhe Shi, Jose Caballero, Ferenc Huszar, Johannes Totz, Andrew P. Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1874–1883. IEEE, June 2016. ISBN 978-1-4673-8851-1.
[Vaish and Adams(2008)] Vaibhav Vaish and Andrew Adams. The (new) stanford light field archive. Computer Graphics Laboratory, Stanford University, 6(7):3, 2008.
[Van Duong et al.(2023)Van Duong, Huu, Yim, and Jeon] Vinh Van Duong, Thuc Nguyen Huu, Jonghoon Yim, and Byeungwoo Jeon. Light Field Image Super-Resolution Network via Joint Spatial-Angular and Epipolar Information. IEEE Transactions on Computational Imaging, pages 1–16, 2023. ISSN 2333-9403, 2334-0118, 2573-0436. 10.1109/TCI.2023.3261501.
[Vaswani et al.(2017)Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[Verinaz-Jadan et al.(2022)Verinaz-Jadan, Song, Howe, Foust, and Dragotti] Herman Verinaz-Jadan, Pingfan Song, Carmel L Howe, Amanda J Foust, and Pier Luigi Dragotti. Shift-invariant-subspace discretization and volume reconstruction for light field microscopy. IEEE Transactions on Computational Imaging, 8:286–301, 2022.
[Wang et al.(2022a)Wang, Zhou, Lu, and Di] Shunzhou Wang, Tianfei Zhou, Yao Lu, and Huijun Di. Detail preserving transformer for light field image super-resolution. In Proc. AAAI Conf. Artif. Intell., 2022a.
[Wang et al.(2016)Wang, Zhu, Hiroaki, Chandraker, Efros, and Ramamoorthi] Ting-Chun Wang, Jun-Yan Zhu, Ebi Hiroaki, Manmohan Chandraker, Alexei A Efros, and Ravi Ramamoorthi. A 4D light-field dataset and CNN architectures for material recognition. In European Conference on Computer Vision, pages 121–138. Springer, 2016.
[Wang et al.(2022b)Wang, Wang, Wu, Yang, An, Yu, and Guo] Yingqian Wang, Longguang Wang, Gaochang Wu, Jungang Yang, Wei An, Jingyi Yu, and Yulan Guo. Disentangling light fields for super-resolution and disparity estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022b.
[Wanner et al.(2013)Wanner, Meister, and Goldluecke] Sven Wanner, Stephan Meister, and Bastian Goldluecke. Datasets and benchmarks for densely sampled 4D light fields. In Vision, Modelling and Visualization (VMV), volume 13, pages 225–226, 2013.
[Wikipedia contributors(2020)] Wikipedia contributors. Lytro — Wikipedia, the free encyclopedia. https://w.wiki/7G9s, 2020. Accessed: 2024-05-02.
[Yeung et al.(2019)Yeung, Hou, Chen, Chen, Chen, and Chung] Henry Wing Fung Yeung, Junhui Hou, Xiaoming Chen, Jie Chen, Zhibo Chen, and Yuk Ying Chung. Light Field Spatial Super-Resolution Using Deep Efficient Spatial-Angular Separable Convolution. IEEE Transactions on Image Processing, 28(5):2319–2330, 2019. ISSN 10577149.

Ground-truth	HLFSR	EPIT	LF-DET	LF-MDTNet (Ours)




(a) Perforated_Metal_3	28.12/0.9075	27.76/0.8966	28.33/0.9077	28.98/0.9224




(b) Hublais	30.11/0.9463	30.62/0.9456	30.63/0.9460	31.20/0.9491




(c) Lego Knights	34.83/0.9771	35.71/0.9813	35.44/0.9799	36.41/0.9837



(a) Raw image	(b) Large dispairty range	(c) Small disparity range

$DSA_{1}$
$DSA_{2}$
	(a) Perforated_Metal_3	(b) Hublais	(c) Lego Knights

Efficient Multi-disparity Transformer for Light Field Image Super-resolution