LTCF-Net: A Transformer-Enhanced Dual-Channel Fourier Framework for Low-Light Image Restoration

Gaojing Zhang¹
¹University of Sussex
[email protected] Jinglun Feng²
²City College of New York
[email protected] Corresponding author.

Abstract

We introduce LTCF-Net, a novel network architecture designed for enhancing low-light images. Unlike Retinex-based methods, our approach utilizes two color spaces—LAB and YUV—to efficiently separate and process color information, by leveraging the separation of luminance from chromatic components in color images. In addition, our model incorporates the Transformer architecture to comprehensively understand image content while maintaining computational efficiency. To dynamically balance the brightness in output images, we also introduce a Fourier transform module that adjusts the luminance channel in the frequency domain. This mechanism could uniformly balance brightness across different regions while eliminating background noises, and thereby enhancing visual quality. By combining these innovative components, LTCF-Net effectively improves low-light image quality while keeping the model lightweight. Experimental results demonstrate that our method outperforms current state-of-the-art approaches across multiple evaluation metrics and datasets, achieving more natural color restoration and a balanced brightness distribution.

1 Introduction

Low-Light Image Enhancement (LLIE) is a critical and complex task within the domain of computer vision. In environments with inadequate lighting, camera images often exhibit severe noise, diminished contrast, and obscured details. These degraded images not only compromise human visual perception but also challenge the downstream visual tasks such as object detection at night. The primary objective of LLIE is to enhance the visibility and contrast of these images, reveal details obscured in shadows, and mitigate distortions that typically occur during the enhancement process, including noise, unwanted artifacts, and inaccurate color reproduction.

Traditional methods like histogram equalization[11, 1, 10, 33] and gamma correction[21] are among the most straightforward techniques to enhance visibility in low-light images. However, while these methods effectively increase contrast, they often lead to oversaturation in bright areas but also result in a loss of detail [15], without addressing the underlying issues of inadequate illumination. In contrast, Retinex theory, which simulates human visual perception of color, has become a foundational principle in LLIE strategies [6, 43, 45, 37, 41, 47]. Although Retinex-based methods leverage illumination estimation and reflectance for enhancement, their underlying assumption of clean, distortion-free inputs rarely holds in real low-light conditions [43, 45]. This fundamental limitation often results in amplified noise and color artifacts, compromising their practical effectiveness.

Recent learning-based approaches have attempted direct mapping between low-light and normal-light conditions [16, 17, 52, 51]. While these methods show promise, they often sacrifice perceptual color accuracy and theoretical grounding for performance [28], complicated by their multi-stage training requirements. The introduction of Transformers [35], known for capturing global dependencies, initially seemed promising. However, the computational demands of standard Vision Transformers [44, 12] proved prohibitive [13], leading to hybrid CNN-Transformer architectures like SNR-Net [27] and LYT-Net [4]. While these models incorporate global Transformer layers at reduced resolutions, they have yet to fully capitalize on Transformers’ potential for low-light enhancement, indicating room for further advancement.

Refer to caption — Figure 1: Model Pipeline. Our main models are Multi-header Self-attention (MHSA) Block, Channel Denoising (CD) Block, Multi-stage Squeeze and Excited Fusion (MSEF) Block and Fourier Branch Processing (FBP) Block. The individual submodules can be seen in Fig.2.

To address these challenges, we present LTCF-Net, a novel approach to low-light image enhancement that leverages a dual-channel color space architecture. Our model uniquely processes illumination and color information in parallel streams, employing self-attention mechanisms [35] to dynamically adjust enhancement based on varying exposure levels across image regions. Unlike traditional Retinex-based methods [25, 6] that rely on complex degradation modeling, LTCF-Net enables direct end-to-end training. By prioritizing illumination processing—aligned with human visual perception, our approach achieves superior detail preservation and natural enhancement while maintaining color fidelity, effectively avoiding common artifacts like overexposure and color distortion.

Experimental results demonstrate that our method outperforms existing state-of-the-art techniques on multiple datasets, achieving significant performance improvements on datasets such as LOL-v1 [42], LOL-v2 [49], SID [8], and SDSD [39], validating the effectiveness and superiority of our approach. Our main contributions are:

•

We propose a novel dual-channel color space transformation method that effectively separates and independently processes illumination and color information, simplifying the complex decoupling task. The model utilizes LAB [34] and YUV [19] color spaces for targeted enhancement, implementing a multi-head self-attention mechanism [35] on the luminance and chrominance layers to dynamically enhance low-light recovery capabilities.
•

The Fourier adjustment module [30, 50] is introduced to convolve the real and imaginary parts of the data in the frequency domain to enhance the features in the frequency domain. It also includes an adaptive enhancement mechanism that balances the light distribution by dealing with regions of different scales or dynamic ranges
•

Through quantitative and qualitative experiments, our network demonstrates superior performance on LOL [49, 42] and other datasets compared to state-of-the-art methods.

2 Proposed Method

Fig. 1 and Fig. 2 illustrate the structure of the proposed LTCF-Net. As depicted in Fig. 1, the architecture is divided into two branches dedicated to illumination enhancement, each operating within distinct color spaces: LAB and YUV. Within each branch, as illustrated in Fig. 2(a) and Fig. 2(d), a luminance enhancement channel incorporates both a Multi-Head Self-Attention (MHSA) module and a Fourier Brightness Processing (FBP) module, which are critical in meticulously restoring luminance details. These modules are detailed in Sections 2.2 and 2.5, respectively.

Additionally, Fig. 2(c) illustrates that each color channel integrates a Channel Denoising (CD) Block, outlined in Section 3.3. This block enhances the noise reduction capabilities of the system. The denoised outputs are then combined with the luminance channel through a Multi-stage Squeeze and Excited Fusion (MSEF) module, elaborated in Section 3.4. Fig. 2(b) provides an in-depth view of the MSEF module, which is engineered to refine the modeling of both illumination and color features, thereby ensuring the fidelity and uniformity of the final enhanced images.

2.1 Color Spaces

The LAB and YUV color spaces offer significant advantages over the traditional RGB color space, particularly in terms of separating light information. This separation facilitates independent manipulation of the light and color channels, enhancing the process’s flexibility and effectiveness. The LAB color space, grounded in human visual perception, allows for precise adjustments as demonstrated by the standard LAB conversion formula presented in Eq.4.

\displaystyle\begin{array}[]{r}L^{*}=116\times f\left(\frac{Y}{Y_{n}}\right)-16\\ a^{*}=500\left[f\left(\frac{X}{X_{n}}\right)-f\left(\frac{Y}{Y_{n}}\right)\right]\\ b^{*}=200\left[f\left(\frac{Y}{Y_{n}}\right)-f\left(\frac{Z}{Z_{n}}\right)\right]\end{array}

(4)

Meanwhile, the YUV color space is particularly beneficial for low-light image enhancement due to its ability to independently adjust luminance (Y) — critical for enhancing visibility — without affecting the chrominance channels (U and V). The formulation of the YUV color space is specified in Eq. 5.

\begin{array}[]{r}Y=0.299R+0.587G+0.114B\\ U=-0.14713R-0.28886G+0.436B\\ V=0.615R-0.51499G-0.10001B\end{array}

(5)

2.2 Multi-header Self-attention Block

The MHSA Block, inspired by transformer architecture, begins by transforming the input feature $\mathbf{F}_{in}\in\mathbb{R}^{H\times W\times C}$ into a reshaped format $\mathbf{X}\in\mathbb{R}^{HW\times C}$ . This reshaped feature is then partitioned into multiple ‘heads’ as $\mathbf{X}=\left[\mathbf{X}_{1},\mathbf{X}_{2},\cdots,\mathbf{X}_{k}\right]$ .

Each segment, $\mathbf{X}_{i}\in\mathbb{R}^{HW\times d_{k}}$ , where $d_{k}=\frac{c}{k}$ and $i=1,2,…,k$ , undergoes processing by three bias-free fully connected layers ( $f_{c}$ ), which project $\mathbf{X}_{i}$ into the query $\mathbf{Q}_{i}$ , key $\mathbf{K}_{i}$ , and value $\mathbf{V}_{i}$ components, which can be defined by $\mathbf{Q}_{i}=\mathbf{X}_{i}\mathbf{W}_{\mathbf{Q}_{i}},\mathbf{K}_{i}=\mathbf{X}_{i}\mathbf{W}_{\mathbf{K}_{i}},\mathbf{V}_{i}=\mathbf{X}_{i}\mathbf{W}_{\mathbf{V}_{i}}$ .

The learnable parameters of the fully connected layers are represented by $\mathbf{W}_{\mathbf{Q}i}$ , $\mathbf{W}_{\mathbf{K}i}$ , and $\mathbf{W}_{\mathbf{V}_{i}}\in\mathbb{R}^{d_{k}\times d_{k}}$ . This structure allows the model to adaptively respond to varying lighting conditions across different image regions, enhancing areas that are typically more challenging due to darkness. Each attention head functions independently, employing the self-attention mechanism as defined below:

\text{Attention}(\mathbf{Q}_{i},\mathbf{K}_{i},\mathbf{V}_{i})=\text{Softmax}\left(\frac{\mathbf{Q}_{i}\mathbf{K}_{i}^{\mathrm{T}}}{\sqrt{d_{k}}}\right)\times\mathbf{V}_{i}

(6)

The outputs from all attention heads are concatenated and integrated using a positional encoder, then reshaped back to the original input dimensions, resulting in the final output feature $\mathbf{F}_{\mathrm{out}}\in\mathbb{R}^{H\times W\times C}$ .

2.3 Channel Denoising Block

The CD Block leverages a four-scale U-shaped network [32], incorporating the MHSA mechanism introduced above at the network bottleneck. This integration of convolutional and attention-based methodologies allows for robust feature extraction and effective denoising. The module consists of multiple convolutional layers characterized by two types of strides and includes skip connections to enhance detail retrieval and noise reduction.

This block processes color information from two distinct color spaces, represented by $R_{i}$ , where $i=4$ corresponds to the color space components $A$ , $B$ , $U$ , and $V$ . The process initiates with these color components being processed through a $3\times 3$ convolutional layer with a stride of one to extract preliminary features, denoted by $F^{(0)}=\mathrm{Conv}_{3\times 3}^{s=1}(R)$ . The signal then sequentially traverses three $3\times 3$ convolutional layers each with a stride of two, progressively capturing multi-scale features. For each convolution layer $k=1,2,3$ , the operation results in $F^{(k)}=\mathrm{Conv}_{3\times 3}^{s=2}(F^{(k-1)})$ , reducing the spatial dimensions of the feature map by half with each convolution, transitioning from $F^{(1)}\in\mathbb{R}^{\frac{H}{2}\times\frac{W}{2}\times C}$ to $F^{(3)}\in\mathbb{R}^{\frac{H}{8}\times\frac{W}{8}\times C}$ .

At the network bottleneck, minimum-scale feature mapping captures global dependencies effectively, utilizing the MHSA Block. Subsequent stages involve up-sampling, matched in scale to the prior down-sampling phases. The initial up-sampling step, employing a deconvolution with a stride of two, produces $G^{(1)}=\mathrm{Deconv}_{3\times 3}^{s=2}(F^{(3)})$ . This up-sampling process is repeated twice to restore the feature map to its original dimensions, $H\times W$ . The final output is then refined through two additional convolutional layers followed by a Tanh activation function to maintain color fidelity. The ultimate output, $\mathbf{F}_{\mathrm{out}}$ , restores the feature dimensions to $\mathbf{F}_{\mathrm{out}}\in\mathbb{R}_{i}^{H\times W\times C}$ .

2.4 Multi-stage Squeeze and Excited Fusion Block

The MSEF Block is engineered to integrate multiple feature enhancement mechanisms [20], specifically designed to enhance the model’s capability to accurately reconstruct and refine details in images by capturing both global and local features associated with illumination and color information. Initially, the input feature is subjected to Layer Normalization to stabilize its mean and variance. Subsequently, the normalized features are processed through the Squeeze-and-Excitation Block (SEBlock), which employs a two-stage recalibration of feature importance: squeezing to reduce redundancy and excitation to emphasize informative features.

Within the SEBlock, features $\mathbf{F}_{in}$ initially pass through Global Average Pooling (GAP) to form a global descriptor for each channel, capturing essential contextual information that is broadly representative of the entire image. This descriptor $\mathbf{D}_{\mathrm{re}}$ is then refined through a fully connected layer equipped with a ReLU activation function, as outlined in Eq. 7, enhancing significant features while filtering out less relevant data. The process continues with another fully connected layer featuring a Tanh activation function, which expands the features back to their original dimensions to achieve optimal recalibration:

\begin{array}[]{r}\mathbf{D}_{\mathrm{re}}=\operatorname{ReLU}\left(\mathbf{W}_{1}\cdot\operatorname{GP}\left(\operatorname{LN}\left(\mathbf{F}_{\text{in }}\right)\right)\right)\\ \mathbf{D}_{\mathrm{ex}}=\operatorname{Tanh}\left(\mathbf{W}_{2}\cdot\mathbf{D}_{\mathrm{re}}\right)\cdot\operatorname{LN}\left(\mathbf{F}_{\text{in }}\right)\end{array}

(7)

This process not only emphasizes and restores key features but also enhances the model’s ability to detect variations in color and illumination, producing images that appear more natural. This is particularly vital for low-light image enhancement, where maintaining accurate color and detail is crucial. Finally, residual connections also help preserve the original input features, ensuring stability during training and preventing the loss of important details.

2.5 Fourier Branch Processing Block

The Fourier Branch Processing Block is crucial for enhancing and restoring image brightness by manipulating the luminance channel in the frequency domain through the application of Fourier Transform techniques [3]. This approach allows for precise differentiation and manipulation of high-frequency and low-frequency components, thereby significantly improving the details in brightness across the image.

Initially, the luminance channel undergoes transformation into its frequency-domain representation, with the real and imaginary parts processed separately to refine frequency-domain features. A convolutional layer reduces the number of channels to 16, streamlining features while preserving essential details. This is followed by a Leaky ReLU activation function that captures finer details and mitigates the issue of dying ReLUs in areas of low brightness. Subsequently, features are further refined through an additional convolutional layer that maintains the channel count, ensuring that critical nuances are preserved. A final convolutional layer re-expands the features to match the original input dimensions. This restoration is crucial for aligning the enhanced features with the original image structure, facilitating seamless integration back into the overall image processing workflow. Detailed experimental validations of this process are presented in Section 3.4.

2.6 Combined Loss Function

In this section, we introduce our loss function framework, dividing six loss variables into two main categories: pixel-level and perceptual-level losses, each targeting different aspects of image quality enhancement.

Pixel-Level Losses. These losses measure discrepancies directly at the pixel level, thus preserving low-level image attributes such as brightness and color fidelity.

Specifically, $\mathcal{L}_{\mathrm{S1}}$ is a smooth L1 loss function, which is defined for two parameters: the predicted image ( $y_{\mathrm{pred}}$ ) and the ground truth image ( $y_{\mathrm{true}}$ ). This loss function has been proven effective in Fast R-CNN because it is robust to outliers and can provide stable gradients during the training process. Its formula is shown in Eq. 8

\displaystyle\mathcal{L}_{\text{S1}}(y_{\mathrm{true}},y_{\mathrm{pred}})

\displaystyle=\sum_{i\in\{x,y,w,h\}}\text{smooth}_{L_{1}}(y_{\mathrm{true}}^{i}-y_{\mathrm{pred}}^{i})

(8)

Where the function $smooth_{L_{1}(x)}$ is defined as:

\begin{array}[]{r}smooth_{L_{1}(x)}=\begin{cases}0.5x^{2},&\text{if }|x|<1\\ |x|-0.5,&\text{otherwise}\end{cases}\end{array}

(9)

Methods	LOL-v1		LOL-v2-real		SID		SMID		SDSD-indoor		SDSD-outdoor		Complexity
	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	FLOPS (G)	Params (M)
SID[8]	14.35	0.43	13.24	0.44	16.97	0.59	24.78	0.71	23.29	0.70	24.90	0.69	13.43	7.76
DeepUPE[38]	14.38	0.44	13.27	0.45	17.01	0.60	23.91	0.69	21.70	0.66	21.94	0.69	21.10	1.02
IPT[9]	16.27	0.50	19.80	0.83	20.53	0.56	27.03	0.78	26.11	0.83	27.55	0.85	6887	115.31
RetinexNet[42]	16.77	0.56	15.47	0.56	16.48	0.57	22.83	0.68	20.84	0.61	20.96	0.62	587.47	0.84
EnGAN[23]	17.48	0.65	18.23	0.61	17.23	0.54	22.62	0.67	20.02	0.60	20.10	0.61	61.01	114.35
FIDE [46]	18.27	0.66	16.85	0.67	18.34	0.57	24.42	0.69	22.41	0.65	22.20	0.62	28.51	8.62
DRBN [48]	20.13	0.83	20.29	0.83	19.02	0.57	26.60	0.78	24.08	0.86	25.77	0.84	48.61	5.27
KinD++ [51]	20.86	0.79	14.74	0.64	18.02	0.58	22.18	0.63	21.95	0.67	21.97	0.65	34.99	8.02
URetinex [45]	21.32	0.83	22.79	0.83	19.07	0.61	22.47	0.67	24.88	0.73	24.96	0.76	58.27	0.36
MIRNet [43]	24.14	0.83	20.02	0.82	20.84	0.60	25.66	0.76	24.38	0.86	27.13	0.83	785	31.76
SNR-Net [47]	24.61	0.84	21.48	0.84	22.87	0.62	28.49	0.80	29.44	0.89	28.66	0.86	26.35	4.01
Retinexformer [6]	25.16	0.84	22.80	0.84	24.44	0.68	29.15	0.81	29.77	0.89	29.84	0.87	15.57	1.61
DiffLL [22]	26.33	0.83	28.85	0.87	25.88	0.66	28.71	0.72	27.92	0.79	28.82	0.80	124.48	0.66
LYT-Net [4]	27.23	0.85	27.80	0.87	24.19	0.62	28.73	0.73	29.20	0.77	29.87	0.82	3.49	0.045
HVI-CIDNet [14]	27.71	0.87	28.13	0.89	22.90	0.67	28.63	0.79	29.54	0.86	27.54	0.79	7.57	1.88
LTCF-Net*	26.88	0.89	30.11	0.93	25.26	0.68	29.58	0.86	29.27	0.80	28.70	0.84	4.60	0.097
LTCF-Net	27.07	0.89	29.76	0.92	26.28	0.70	29.63	0.86	29.98	0.89	30.14	0.87	10.370	0.155

Table 1: Quantitative comparisons on LOL (v1 [42] and v2 [49]), SID [8], SMID [7], and SDSD [39] (indoor and outdoor) datasets. The best result is in red, the second best result is in yellow and the third best result is in in blue. Our LTCF-Net significantly outperforms SOTA algorithms. LTCF-Net* represents the proposed method without the Fourier Branch Processing Block.

$\mathcal{L}_{\mathrm{PSNR}}$ measures and optimizes image quality by assessing the mean squared error between ( $y_{\text{pred}}$ ) and ( $y_{\text{true}}$ ). It is designed to enhance pixel accuracy, as detailed in Eq. 10.

\displaystyle\mathcal{L}_{\text{PSNR}}=40.0-\sum_{i=1}^{n}20\cdot\log_{10}\left(\frac{1}{\sqrt{\text{MSE}}}\right)

(10)

The $\mathcal{L}_{\mathrm{Color}}$ ensures color fidelity by minimizing color discrepancies between the predicted and reference images, calculated as shown in Eq. 11.

\displaystyle\mathcal{L}_{\text{Color}}=\sum_{i=1}^{n}\left|\text{mean}(y_{\text{true}}^{i})-\text{mean}(y_{\text{pred}}^{i})\right|

(11)

The histogram loss function aligns the global statistical features such as brightness and contrast by comparing histograms, enhancing overall image consistency as illustrated in Eq. 12.

\displaystyle\mathcal{L}_{\text{Hist}}=\sum_{i=1}^{n}\left|\frac{1}{N}\sum_{n=1}^{N}H_{\text{true}}^{i,n}-\frac{1}{N}\sum_{n=1}^{N}H_{\text{pred}}^{i,n}\right|

(12)

Where $H_{\text{true}}^{i,n}$ and $H_{\text{pred}}^{i,n}$ are the histogram values for the ground truth and predicted images, respectively, at bin $n$ . This loss encourages the model to generate predictions whose color distributions match those of the ground truth more closely.

Perceptual-Level Losses. These losses, on the other hand, leverage high-level features to enhance the visual quality of generated images.

$\mathcal{L}_{\mathrm{per}}$ utilizes features extracted from a VGG19 network, this perceptual loss measures differences that affect the perceptual quality between the predicted and ground truth images. Its formula is shown in Eq. 13.

\begin{array}[]{r}\mathcal{L}_{\text{per}}=\frac{1}{C_{j}H_{j}W_{j}}\left\|\phi_{j}(y_{true})-\phi_{j}(y_{pred})\right\|_{2}^{2}\end{array}

(13)

Additionally, $\mathcal{L}_{\mathrm{SSIM}}$ is based on multi-scale structural similarity, this loss ensures structural consistency across different scales, thereby enhancing the perceptual similarity of the generated image. Its formula is shown in Eq. 14. Where $y$ is the original image and $\hat{y}$ is the target image.

\mathcal{L}_{\text{SSIM}}(y,\hat{y})=1-\frac{(2\mu_{y}\mu_{\hat{y}}+C_{1})(2\sigma_{y\hat{y}}+C_{2})}{(\mu_{y}^{2}+\mu_{\hat{y}}^{2}+C_{1})(\sigma_{y}^{2}+\sigma_{\hat{y}}^{2}+C_{2})}

(14)

As shown in Eq 15, the final loss function is a combined loss between pixel level losses and perceptual level losses, where $\alpha_{1}$ to $\alpha_{5}$ are hyperparameters for each loss component. This final loss function includes provides a comprehensive evaluation metric that optimizes various aspects of image quality, ultimately leading to a more realistic and high-quality output.

$\displaystyle\mathcal{L}_{Pixel}$	$\displaystyle=\mathcal{L}_{\mathrm{S1}}+\alpha_{1}\mathcal{L}_{\mathrm{PSNR}}+\alpha_{2}\mathcal{L}_{\mathrm{Color}}+\alpha_{3}\mathcal{L}_{\mathrm{Hist}}$	(15)
$\displaystyle\mathcal{L}_{Perc}$	$\displaystyle=\alpha_{4}\mathcal{L}_{\mathrm{Per}}+\alpha_{5}\mathcal{L}_{\mathrm{SSIM}}$
$\displaystyle\mathcal{L}_{Total}$	$\displaystyle=\mathcal{L}_{Pixel}+\mathcal{L}_{Perc}$

3 Experiment

3.1 Datasets and Implementation Details

We eveluate our method on LOL(v1 [42] and v2-real [49]), SID [8], SMID [7], SDSD [39], and FiveK [5] datasets.

LOL. We use both LOL-v1 and LOL-v2-real dataset. The training and testing sets are split with ratios of 485:15, 689:100 for LOL-v1, LOL-v2-real respectively.

SID. A subset of the SID dataset captured using the Sony $\alpha 7SII$ camera is used for evaluation. This subset consists of 2697 pairs of short- and long-exposure RAW images. Low light and normal light RGB images are generated by applying the same in-camera signal processing as used in SID to convert RAW images to RGB. Of these pairs, 2299 are allocated for training and 398 for testing.

SMID. The SMID benchmark dataset contains 20809 pairs of short- and long-exposure RAW images. These RAW images are also converted to low-light and normal-light RGB image pairs. The dataset provides 18789 pairs for training and 2021 pairs for testing.

SDSD. We use the static version of the SDSD dataset captured with a Canon EOS 6D Mark $II$ camera and an ND filter. The SDSD dataset includes both indoor and outdoor subsets, with the SDSD-outdoor subset containing 3150 images and the SDSD-indoor subset containing 1963 images.

FiveK. FiveK dataset is adjusted by experts as references to create low-light photos followed by the method proposed in [5]. It contains 5000 underexposed image pairs with bounding box annotations for 60 object categories. Note that this dataset is particularly used in Object Detection task for enhanced low-light images as discussed in Section 3.3.

In addition to the above eight benchmarks, we also tested our method on five datasets: LIME [18], NPE [40], MEF [31], DICM [26], and VV [36], which do not have ground truth annotations.

Methods	L-v1	L-v2	SID	SMID	SD-in	SD-out	Mean
HVI-CIDNet	2.97	3.27	2.75	3.23	3.34	3.48	3.17
Retinexformer	3.52^∗	4.01^∗	3.19^∗	3.75	3.66	3.98^∗	3.69^∗
URetinex	2.81	3.01	2.69	3.88^∗	3.75	3.82	3.33
LYT-Net	3.6	3.77	3.1	3.72	3.77	3.83	3.63
GASD	2.99	3.21	2.88	3.02	$\uparrow$ 3.84	3.24	3.20
KID++	2.5	2.64	2.78	3.21	3.01	3.22	2.90
DIffLL	2.65	2.72	2.69	3.01	3.27	3.08	2.91
LTCF-Net	$\uparrow$ 3.69	$\uparrow$ 4.23	$\uparrow$ 3.21	$\uparrow$ 4.11	3.78^∗	$\uparrow$ 4.01	$\uparrow$ 3.84

(a)

Methods	Bicycle	Boat	Bottle	Bus	Car	Cat	Chair	Cup	Dog	Motor	People	Table	Mean
HVI-CIDNet	41.3	31.4	32.5	46.3	44.5	26.4	25.5	26.5	27.4	25.5	31.4	25.5	32.1
Retinexformer	44.5	33.8	32.5	48.5	46.8	28.7	27.8	28.3	29.1	28.9	32.7	26.8	34.2
URetinex	46.2^∗	36.2^∗	35.6	51.2	49.2	$\uparrow$ 30.2	30.3^∗	30.7	32.3	31.2	34.2	28.2	36.2^∗
LYT-Net	34.5	34.5	31.8	47.6	45.7	27.6	26.7	27.8	26.8	27.3	30.8	25.8	32.4
GASD	43.8	32.7	34.1	49.8	47.3	29.1^∗	28.1	31.4^∗	30.9	33.1^∗	35.3	27.7	35.3
KID++	40.6	35.1	$\uparrow$ 36.4	52.4^∗	$\uparrow$ 50.1	25.8	29.6	$\uparrow$ 33.2	33.6^∗	29.7	37.1	29.1^∗	36.1
DIffLL	45.1	30.9	30.7	45.7	43.9	27.3	$\uparrow$ 30.9	25.6	25.7	32.8	38.6^∗	26.9	33.7
LTCF-Net	$\uparrow$ 47.8	$\uparrow$ 36.9	36.2^∗	$\uparrow$ 53.4	49.8^∗	28.5	29.7	31.2	$\uparrow$ 34.6	$\uparrow$ 33.9	$\uparrow$ 39.7	$\uparrow$ 30.1	$\uparrow$ 37.7

(b)

Table 2: (a) compares the human perception quality of various low-light enhancement algorithms conduced by user survey. (b) compares the preprocessing effects of different methods on high-level vision understanding. Our model LTCF-Net outperforms the benchmarks on average.

Implementation Details. Our model is implemented using the PyTorch framework and trained with the ADAM optimizer [24], using parameters $\beta_{1}=0.9$ and $\beta_{2}=0.999$ , for 1000 epochs. Training and testing process were conducted on a server equipped with an NVIDIA A40 GPU. The network parameters were randomly initialized, and the model was trained from scratch. An initial learning rate of $2\times 10^{-4}$ was set and gradually reduced to $1\times 10^{-6}$ using a cosine annealing scheduler [29] to facilitate convergence and avoid local minima. For the hyperparameters in the loss function, we set $\alpha_{1}=0.12$ , $\alpha_{2}=0.05$ , $\alpha_{3}=0.55$ , $\alpha_{4}=0.015$ , and $\alpha_{5}=0.25$ .

3.2 Comparison Study of Low-light Image Enhancement

Quantitative Results. The proposed method is quantitatively evaluated against various SOTA algorithms, as detailed in Table 1. While our model registers a slightly lower PSNR on the LOL-v1 dataset, it shows a notable improvement of 1.26 dB in PSNR on the LOL-v2-real dataset compared to the leading DiffLL model. Additionally, it surpasses other top-performing methods on the SID, SMID, SDSD-indoor, and SDSD-outdoor datasets with respective PSNR enhancements of 0.4 dB, 0.43 dB, 0.21 dB, and 0.27 dB. This performance is achieved with only 10.37G FLOPS and 0.155M Params, demonstrating the efficiency and effectiveness of our approach.

Qualitative Results. As shown in Fig. 3, we first compare the visual results of our model against others across the LOL-v1 and LOL-v2-real datasets. For instance, Retinexformer [6] and URetinex [45] demonstrate subpar performance in restoring light and shadow details. Similarly, DiffLL [22] and HVI-CIDNet [14] struggle with accurate color restoration. In contrast, our model excels in accurately enhancing both light and intricate details.

Additionally, as depicted in Fig. 4, models like URetinex [45], KinD++ [51], and DiffLL [22] show noticeable color inaccuracies. Moreover, Retinexformer [6] is unable to recover text on a sticker, underscoring its limitations. These observations confirm that our method outperforms others in restoring contrast and detail, particularly in the SID and SMID datasets. Including Fig. 5, in SDSD-indoor and outdoor datasets, our method can restore many details.

User Study Score. We conducted a user study involving 95 participants to subjectively evaluate the visual quality of images from seven different datasets as shown in Fig. 6. Each participant evaluated the enhanced images based on three criteria: overall reconstruction quality, presence of overexposure, and accuracy of detailed recovery. Each participant rated the enhanced images on a scale from 0 to 5. According to the aggregated scores presented in Table 2(a), our method consistently outperformed competing methods, effectively avoiding issues of overexposure and underexposure while ensuring uniformly distributed brightness across the images.

3.3 Low-light Object Detection Comparisons

Additional, we evaluate the impact of various enhancement algorithms on object detection using the FiveK dataset [5]. We used YOLO-v4-tiny [2] as the detector, and then compare the object detection performance on raw images and the images enhanced by different low-light enhancement methods.

Table 2 shows the average precision (AP) scores. Our model achieves the highest mean AP of 37.7, surpassing the leading self-monitoring method, URetinex, by 1.5 AP. Notably, our approach outperforms in seven object categories: Bicycle, Boat, Bus, Dog, Motor, People, and Table.

Fig. 7 illustrates the qualitative detection results in a low-light scene (left) and after LTCF-Net enhancement (right). The enhanced image enables the detector to identify more objects with higher accuracy, demonstrating the crucial role of our low-light enhancement method in improving object detection performance.

3.4 Ablation Study

To validate the effectiveness of each component within the LTCF-Net and its contribution to training optimization, we conducted detailed ablation experiments on the SID dataset, specifically examines the dual color branches, the MSEF module, and the FBP module.

Effectiveness of Dual Color Space Branch. We assessed the impact of individual color spaces on model performance through a series of decomposed experiments, as detailed in Table 3. By separately eliminating the LAB and YUV color spaces, and comparing these to the dual color space setup, we observed the effect on performance metrics. The results indicate that the combined use of both color spaces significantly enhances performance over any single color space configuration.

Effectiveness of MSEF Block. The incorporation of the MSEF module, despite increasing the number of parameters and FLOPS, results in significant improvements in image quality metrics, boosting the metric of PSNR by 1.38 and SSIM by 0.08. These gains confirm that the inclusion of the MSEF module is crucial for enhancing the model’s performance.

LAB	YUV	MSEF	FBP	PSNR	SSIM	Params (M)	FLOPS (G)
✓				22.07	0.588	0.044	3.38
	✓			22.12	0.587	0.042	3.33
✓		✓		24.08	0.610	0.047	3.55
	✓	✓		24.19	0.622	0.045	3.50
✓	✓			23.88	0.601	0.078	4.23
✓	✓	✓		25.26	0.682	0.097	4.60
✓	✓	✓	✓	26.28	0.701	0.155	10.37

Table 3: Ablation study on each proposed network component tested on the SID [8] dataset, with PSNR, SSIM, params, FLOPS(size=256

\times

256) metrics reported.

Effectiveness of FBP Block. The primary function of this module is to reduce noise across the image. Observations from Table 3 and Fig. 8 indicate that models lacking the Fourier module exhibit noticeable noise. Integrating Fourier modules significantly reduce this noise, albeit at the cost of slightly increasing the overall brightness of the image. Given that the final assessment of image quality relies on visual perception, the inclusion of Fourier modules is deemed essential. Consequently, we offer two versions of the model to accommodate different preferences and requirements.

4 Conclusion

In this work, we introduced LTCF-Net, an innovative low-light image enhancement model that combines dual-channel color spaces with Transformer and Fourier Transform techniques to address the limitations of current enhancement methods. By effectively separating and processing illumination and color information, our model simplifies the enhancement process, allowing for end-to-end, single-stage training that improves operational efficiency and reduces the likelihood of artifacts such as noise and color distortion. Experimental results show that LTCF-Net achieves superior performance compared to existing state-of-the-art methods, as demonstrated on several benchmarks where it showed notable improvements in both quantitative metrics and qualitative assessments. Through extensive evaluations, LTCF-Net not only sets a new standard in low-light image enhancement but also suggests a promising direction for future research in integrating color space knowledge with deep learning architectures to further advance the field of image processing.

References

Abdullah-Al-Wadud et al. [2007] Mohammad Abdullah-Al-Wadud, Md Hasanul Kabir, M Ali Akber Dewan, and Oksam Chae. A dynamic histogram equalization for image contrast enhancement. IEEE transactions on consumer electronics, 53(2):593–600, 2007.
Bochkovskiy et al. [2020] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020.
Bracewell [1989] Ronald N Bracewell. The fourier transform. Scientific American, 260(6):86–95, 1989.
Brateanu et al. [2024] Alexandru Brateanu, Raul Balmez, Adrian Avram, and CC Orhei. Lyt-net: Lightweight yuv transformer-based network for low-light image enhancement. arXiv preprint arXiv:2401.15204, 2024.
Bychkovsky et al. [2011] Vladimir Bychkovsky, Sylvain Paris, Eric Chan, and Frédo Durand. Learning photographic global tonal adjustment with a database of input/output image pairs. In CVPR 2011, pages 97–104. IEEE, 2011.
Cai et al. [2023] Yuanhao Cai, Hao Bian, Jing Lin, Haoqian Wang, Radu Timofte, and Yulun Zhang. Retinexformer: One-stage retinex-based transformer for low-light image enhancement. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12504–12513, 2023.
Chen et al. [2018] Chen Chen, Qifeng Chen, Jia Xu, and Vladlen Koltun. Learning to see in the dark. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3291–3300, 2018.
Chen et al. [2019] Chen Chen, Qifeng Chen, Minh N Do, and Vladlen Koltun. Seeing motion in the dark. In Proceedings of the IEEE/CVF International conference on computer vision, pages 3185–3194, 2019.
Chen et al. [2021] Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and Wen Gao. Pre-trained image processing transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12299–12310, 2021.
Cheng and Shi [2004] Heng-Da Cheng and XJ Shi. A simple and effective histogram equalization approach to image enhancement. Digital signal processing, 14(2):158–170, 2004.
Dale-Jones and Tjahjadi [1993] R Dale-Jones and Tardi Tjahjadi. A study and modification of the local histogram equalization algorithm. Pattern Recognition, 26(9):1373–1381, 1993.
Dosovitskiy [2020a] Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020a.
Dosovitskiy [2020b] Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020b.
Feng et al. [2024] Yixu Feng, Cheng Zhang, Pei Wang, Peng Wu, Qingsen Yan, and Yanning Zhang. You only need one color space: An efficient network for low-light image enhancement. arXiv preprint arXiv:2402.05809, 2024.
Fu et al. [2016] Xueyang Fu, Delu Zeng, Yue Huang, Xiao-Ping Zhang, and Xinghao Ding. A weighted variational model for simultaneous reflectance and illumination estimation. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2782–2790, 2016.
Fu et al. [2022] Ying Fu, Yang Hong, Linwei Chen, and Shaodi You. Le-gan: Unsupervised low-light image enhancement network using attention module and identity invariant loss. Knowledge-Based Systems, 240:108010, 2022.
Guo et al. [2020] Chunle Guo, Chongyi Li, Jichang Guo, Chen Change Loy, Junhui Hou, Sam Kwong, and Runmin Cong. Zero-reference deep curve estimation for low-light image enhancement. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1780–1789, 2020.
Guo et al. [2016] Xiaojie Guo, Yu Li, and Haibin Ling. Lime: Low-light image enhancement via illumination map estimation. IEEE Transactions on image processing, 26(2):982–993, 2016.
Herbstreit and Pouliquen [1967] Jack W Herbstreit and H Pouliquen. International standards for color television. IEEE spectrum, 4(3):104–111, 1967.
Hu et al. [2018] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
Huang et al. [2012] Shih-Chia Huang, Fan-Chieh Cheng, and Yi-Sheng Chiu. Efficient contrast enhancement using adaptive gamma correction with weighting distribution. IEEE transactions on image processing, 22(3):1032–1041, 2012.
Jiang et al. [2023] Hai Jiang, Ao Luo, Haoqiang Fan, Songchen Han, and Shuaicheng Liu. Low-light image enhancement with wavelet-based diffusion models. ACM Transactions on Graphics (TOG), 42(6):1–14, 2023.
Jiang et al. [2021] Yifan Jiang, Xinyu Gong, Ding Liu, Yu Cheng, Chen Fang, Xiaohui Shen, Jianchao Yang, Pan Zhou, and Zhangyang Wang. Enlightengan: Deep light enhancement without paired supervision. IEEE transactions on image processing, 30:2340–2349, 2021.
Kingma [2014] Diederik P Kingma. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Land [1977] Edwin H. Land. The retinex theory of color vision: A retina-and-cortex system (retinex) may treat a color as a code for a three-part report from the retina, independent of the flux of radiant energy but correlated with the reflectance of objects, 1977.
Lee et al. [2013] Chulwoo Lee, Chul Lee, and Chang-Su Kim. Contrast enhancement based on layered difference representation of 2d histograms. IEEE transactions on image processing, 22(12):5372–5384, 2013.
Liu et al. [2021] Risheng Liu, Long Ma, Jiaao Zhang, Xin Fan, and Zhongxuan Luo. Retinex-inspired unrolling with cooperative prior architecture search for low-light image enhancement. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10561–10570, 2021.
Lore et al. [2017] Kin Gwn Lore, Adedotun Akintayo, and Soumik Sarkar. Llnet: A deep autoencoder approach to natural low-light image enhancement. Pattern Recognition, 61:650–662, 2017.
Loshchilov and Hutter [2016] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
Lv et al. [2024] Xiaoqian Lv, Shengping Zhang, Chenyang Wang, Yichen Zheng, Bineng Zhong, Chongyi Li, and Liqiang Nie. Fourier priors-guided diffusion for zero-shot joint low-light enhancement and deblurring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 25378–25388, 2024.
Ma et al. [2015] Kede Ma, Kai Zeng, and Zhou Wang. Perceptual quality assessment for multi-exposure image fusion. IEEE Transactions on Image Processing, 24(11):3345–3356, 2015.
Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.
Somal [2020] Simran Somal. Image enhancement using local and global histogram equalization technique and their comparison. In First International Conference on Sustainable Technologies for Computational Intelligence: Proceedings of ICTSCI 2019, pages 739–753. Springer, 2020.
Standard et al. [2007] CIE Standard et al. Colorimetry-part 4: Cie 1976 l* a* b* colour space. International Standard, pages 2019–06, 2007.
Vaswani [2017] A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017.
Vonikakis et al. [2018] Vassilios Vonikakis, Rigas Kouskouridas, and Antonios Gasteratos. On the evaluation of illumination compensation algorithms. Multimedia Tools and Applications, 77:9211–9231, 2018.
Wang et al. [2024] Litian Wang, Liquan Zhao, Tie Zhong, and Chunming Wu. Low-light image enhancement using generative adversarial networks. Scientific Reports, 14(1):18489, 2024.
Wang et al. [2019] Ruixing Wang, Qing Zhang, Chi-Wing Fu, Xiaoyong Shen, Wei-Shi Zheng, and Jiaya Jia. Underexposed photo enhancement using deep illumination estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
Wang et al. [2021] Ruixing Wang, Xiaogang Xu, Chi-Wing Fu, Jiangbo Lu, Bei Yu, and Jiaya Jia. Seeing dynamic scene in the dark: A high-quality video dataset with mechatronic alignment. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9700–9709, 2021.
Wang et al. [2013] Shuhang Wang, Jin Zheng, Hai-Miao Hu, and Bo Li. Naturalness preserved enhancement algorithm for non-uniform illumination images. IEEE transactions on image processing, 22(9):3538–3548, 2013.
Wang et al. [2018] Wenjing Wang, Chen Wei, Wenhan Yang, and Jiaying Liu. Gladnet: Low-light enhancement network with global awareness. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), pages 751–755. IEEE, 2018.
Wei et al. [2018a] Chen Wei, Wenjing Wang, Wenhan Yang, and Jiaying Liu. Deep retinex decomposition for low-light enhancement. arXiv preprint arXiv:1808.04560, 2018a.
Wei et al. [2018b] Chen Wei, Wenjing Wang, Wenhan Yang, and Jiaying Liu. Deep retinex decomposition for low-light enhancement. arXiv preprint arXiv:1808.04560, 2018b.
Wu et al. [2021] Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 22–31, 2021.
Wu et al. [2022] Wenhui Wu, Jian Weng, Pingping Zhang, Xu Wang, Wenhan Yang, and Jianmin Jiang. Uretinex-net: Retinex-based deep unfolding network for low-light image enhancement. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5901–5910, 2022.
Xu et al. [2020] Ke Xu, Xin Yang, Baocai Yin, and Rynson WH Lau. Learning to restore low-light images via decomposition-and-enhancement. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2281–2290, 2020.
Xu et al. [2022] Xiaogang Xu, Ruixing Wang, Chi-Wing Fu, and Jiaya Jia. Snr-aware low-light image enhancement. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17714–17724, 2022.
Yang et al. [2021a] Wenhan Yang, Shiqi Wang, Yuming Fang, Yue Wang, and Jiaying Liu. Band representation-based semi-supervised low-light image enhancement: Bridging the gap between signal fidelity and perceptual quality. IEEE Transactions on Image Processing, 30:3461–3473, 2021a.
Yang et al. [2021b] Wenhan Yang, Wenjing Wang, Haofeng Huang, Shiqi Wang, and Jiaying Liu. Sparse gradient regularized deep retinex network for robust low-light image enhancement. IEEE Transactions on Image Processing, 30:2072–2086, 2021b.
Zhang et al. [2024] Tongshun Zhang, Pingping Liu, Ming Zhao, and Haotian Lv. Dmfourllie: Dual-stage and multi-branch fourier network for low-light image enhancement. In ACM Multimedia 2024, 2024.
Zhang et al. [2019] Yonghua Zhang, Jiawan Zhang, and Xiaojie Guo. Kindling the darkness: A practical low-light image enhancer. In Proceedings of the 27th ACM international conference on multimedia, pages 1632–1640, 2019.
Zhou et al. [2022] Shangchen Zhou, Chongyi Li, and Chen Change Loy. Lednet: Joint low-light enhancement and deblurring in the dark. In European conference on computer vision, pages 573–589. Springer, 2022.