Geometric-Aware Low-Light Image and Video Enhancement
via Depth Guidance

Yingqi Lin Xiaogang Xu^∗ Yan Han Jiafei Wu Zhe Liu
Zhejiang Lab {linyq, xgxu, hanyan, wujiafei, zhe.liu}@zhejianglab.com
* indicates the corresponding author

Abstract

Low-Light Enhancement (LLE) is aimed at improving the quality of photos/videos captured under low-light conditions. It is worth noting that most existing LLE methods do not take advantage of geometric modeling. We believe that incorporating geometric information can enhance LLE performance, as it provides insights into the physical structure of the scene that influences illumination conditions. To address this, we propose a Geometry-Guided Low-Light Enhancement Refine Framework (GG-LLERF) designed to assist low-light enhancement models in learning improved features for LLE by integrating geometric priors into the feature representation space. In this paper, we employ depth priors as the geometric representation. Our approach focuses on the integration of depth priors into various LLE frameworks using a unified methodology. This methodology comprises two key novel modules. First, a depth-aware feature extraction module is designed to inject depth priors into the image representation. Then, Hierarchical Depth-Guided Feature Fusion Module (HDGFFM) is formulated with a cross-domain attention mechanism, which combines depth-aware features with the original image features within the LLE model. We conducted extensive experiments on public low-light image and video enhancement benchmarks. The results illustrate that our designed framework significantly enhances existing LLE methods.

Figure 1: We propose to enhance the performance of existing LLIE and LLVE methods via depth priors via our designed GG-LLERF. Our method consistently achieves the improvement target on different datasets/target networks with the same structure. “R.F.” denotes Retinexformer, “R.T.” represents Restormer.

1 Introduction

Low-light imaging is a common requirement in our daily lives, but it often results in poor quality images or videos due to inadequate illumination or limited exposure time [3]. In response to this issue, the Low-Light Enhancement (LLE) technique has been developed. Its objectives include reducing noise and artifacts, preserving edges and textures, and reproducing natural brightness and color [17, 28].

LLE can be further categorized into two subdomains: Low-Light Image Enhancement (LLIE) [25, 4, 5] and Low-Light Video Enhancement (LLVE) [19, 26, 35]. Both LLIE and LLVE can significantly benefit the performance of various downstream visual applications, such as nighttime detection and autopilot systems [9, 16]. Although many deep-learning-based LLE methods have been introduced and have achieved considerable success in some scenarios, their performance in more challenging situations is often unsatisfactory and has encountered certain limitations. Overcoming these performance limits is currently a prominent research topic.

One approach to improving enhancement performance involves the utilization of multi-modal maps as priors. For example, SKF [23] leverages semantic maps to optimize the feature space for low-light enhancement, while SMG [27] employs a generative framework to integrate edge information, thereby enhancing the initial appearance modeling for low-light scenarios. However, it’s worth noting that all of these employed priors operate at the 2D level. 2D-level information may not fully capture the geometric information of the entire 3D scene. The inclusion of 3D geometric information can be instrumental in determining the illumination distribution in the scene, thereby enhancing LLE tasks. This is especially important in video tasks. Having geometric information of each frame allows for a more accurate modeling of the scene’s geometric features.

In this paper, we introduce a groundbreaking approach by learning the essential geometric priors within Low-Light Enhancement (LLE) frameworks. Subsequently, we develop an effective Geometry-Guided Low-Light Enhancement Refine Framework (GG-LLERF) tailored to incorporate these priors into the target frameworks $\mathcal{G}$ , thereby surpassing their original performance limits. For our geometric representation, we harness depth priors, which offer valuable insights into the scene’s geometry. We choose depth priors because of the existence of highly effective open-world depth estimation models $\mathcal{D}$ that have been trained on large-scale datasets, e.g., DPT [13].

Formulating a desired depth prior presents two primary challenges. First, achieving precise depth estimation directly from a low-light image is a demanding task. Existing depth estimation networks often struggle to provide ideal zero-shot predictions under various dark environments. Second, a significant domain gap exists between the depth scores and image content. Consequently, the direct fusion of depth and image features proves to be a complex and challenging endeavor.

To address the first challenge, we introduce an encoder-decoder network denoted as $\mathcal{F}$ , which predicts depth information from low-light images/frames $I_{l}$ . We use the depth prediction from the corresponding normal-light images $I_{n}$ as the ground truth for supervision. This approach allows us to distill the depth prediction capability of $\mathcal{D}$ to achieve the desired depth estimation in low-light conditions. To overcome the second challenge, instead of directly fusing depth and image data, we propose a feature fusion method that involves the encoder of $\mathcal{G}$ and the depth-aware feature extraction module of $\mathcal{F}$ (with the extracted features denoted as $f_{g}$ and $f_{d}$ , respectively). These features are extracted from $I_{l}$ , making them more compatible and suitable for fusion. Furthermore, these depth-aware feature priors $f_{d}$ not only convey the geometrical information of the target scene but also encapsulate complementary image characteristics that can enhance the features $f_{g}$ extracted by $\mathcal{G}$ .

Once the depth-aware priors have been obtained and the fusion locations identified, the key challenge is how to effectively carry out the fusion process. To address this, we introduce the Hierarchical Depth-Guided Feature Fusion Module (HDGFFM) at various encoder stages of $\mathcal{G}$ . The HDGFFM facilitates feature fusion between $f_{g}$ and $f_{d}$ through a cross-attention strategy, where the depth-aware features $f_{d}$ serve as the query vector, while the extracted feature $f_{g}$ functions as the key and value vectors in the transformer’s attention computation. Compared to self-attention using $f_{g}$ alone, this cross-attention strategy allows us to obtain depth-aware features that encompass information beyond the image content. These depth-aware features can then be harnessed to refine $f_{g}$ for enhancement, guided by the objectives of LLE. The proposed HDGFFM with its cross-attention strategy is adaptable to various LLE networks, making it a versatile and effective fusion method.

Extensive experiments are conducted on public LLIE and LLVE datasets with various target networks. Experimental results demonstrate that our GG-LLERF can be combined with existing LLIE and LLVE frameworks to improve their performance on varying datasets. In summary, our contribution is three-fold.

•

We are the first to propose adopting the designed suitable depth-aware features as the prior for LLIE and LLVE tasks, and we formulate the corresponding Geometry-Guided Low-Light Enhancement Refine Framework.
•

HDGFFM is proposed to conduct feature fusion in an effective attention computation manner.
•

Extensive experiments are conducted on different datasets and networks, showing the effectiveness of our strategy.

2 Related Work

Refer to caption — Figure 2: The overall framework of GG-LLERF, which incorporates depth-aware features into the encoder of the target LLE framework (LLIE or LLVE) $\mathcal{G}$ . The depth-aware features are extracted from the depth modeling network $\mathcal{F}$ , which is supervised by a pre-trained depth estimator $\mathcal{D}$ . The depth-aware features and image/frame features are fused via our proposed Hierarchical Depth-Guided Feature Fusion Module (HDGFFM) (Sec. 3.2), where Correlation-based Cross Attention is utilized (Sec. 3.3).

2.1 Low-light Image and Video Enhancement

In recent years, there have been notable advancements in learning-based LLIE techniques [32, 24, 34, 8, 36, 37, 20, 10, 30, 7, 29], primarily emphasizing supervised approaches due to the availability of abundant image pairs for training. For example, MIRNet [31] adopts a multi-scale architecture to effectively capture and distill information at various levels. This results in improved image quality with enhanced brightness, contrast, and details, along with a reduction in noise. Fu et al. [5] implemented illumination augmentation on a pair of images, successfully achieving self-supervised Retinex decomposition learning. This innovative approach contributes to further advancements in LLIE techniques.

Beyond LLIE, there is a rising need for LLVE solutions. Danai et al. [14] proposed a data synthesis mechanism that generates dynamic video pairs static datasets, formulating a LLVE framework. Wang et al. [19] introduced a multi-task model capable of simultaneously estimating noise and illumination, particularly effective for videos with severe noise conditions. Xu et al. [26] innovatively designed a parametric 3D filter tailored for enhancing and sharpening low-light videos, contributing to the latest developments in video enhancement techniques.

While significant progress has been achieved in both LLIE and LLVE, there remains a challenge in obtaining accurate enhancement results in certain demanding scenarios. This difficulty arises from the highly ill-posed nature of directly recovering normal-light photos from these images. Consequently, the incorporation of suitable priors becomes imperative to address these challenges effectively.

2.2 Low-light Enhancement with Priors

Given the ill-posed nature of LLE, the integration of suitable priors is essential to achieve the desired enhancement results. Recent methods have proposed to enhance the corresponding effects by incorporating multi-modal maps as unified priors. For instance, SKF [23] utilizes semantic maps to optimize the feature space for low-light enhancement. SMG [27] adopts a generative framework that integrates edge information, enhancing the initial appearance modeling specifically designed for low-light scenarios.

Nevertheless, the existing priors primarily operate at the 2D level, potentially falling short of fully capturing the geometric intricacies of the entire 3D scene. Recognizing the importance of incorporating 3D geometric information, which plays a pivotal role in determining illumination distribution, becomes crucial for enhancing LLE tasks. The central focus of this paper lies in exploring methods to effectively acquire and utilize 3D priors in the context of LLE.

3 Method

In this section, we first provide an overview of our proposed strategy, which leverages geometrical priors (i.e., depth), in Sec. 3.1. Following the overview, we delve into the structural details of our proposed Hierarchical Depth-Guided Feature Fusion (HDGFFM) in Sec. 3.2. This includes discussing how to obtain the desired depth prior for fusion, where the fusion takes place, and the methods employed for conducting the fusion. The next section is dedicated to describing a vital component of our fusion procedure, known as Correlation-based Cross Attention for Fusion, and can be found in Sec. 3.3. Finally, we introduce the training pipeline of our strategy in Sec. 3.4.

3.1 Overview

Motivation. To enhance the performance of existing low-light enhancement methods, some approaches incorporate multi-modality information, such as edge and semantic maps. However, all of these incorporated priors operate at a 2D level, making them inadequate for accurately representing the corresponding real-world 3D structure. While deriving explicit 3D priors, such as point clouds and meshes, from 2D data is a highly ill-posed problem [12, 6], there are alternative approaches to obtain pseudo-3D data in a data-driven manner, one of which is depth information.

Depth estimation has been a long-standing task, and it has recently reached new heights with the development of large models. By training on extensive datasets that encompass diverse scenes and objects, current depth estimation models (e.g., DPT [13]) exhibit remarkable zero-shot performance on various images and videos. Consequently, we suggest distilling depth information from SOTA depth prediction networks and integrating it into the low-light enhancement task.

Implementation. In this paper, we focus on the supervised learning manner for LLE, where low-light data is denoted as $I_{l}$ , and the normal-light data is represented as $I_{n}$ . As shown in Fig., we set a lightweight depth estimation branch $\mathcal{F}$ for a given target LLE framework $\mathcal{G}$ to be improved.

$\mathcal{F}$ takes the input of $I_{d}$ and output $\hat{d}_{d}$ , as

\hat{d}_{d}=\mathcal{F}(I_{l};\theta_{f}),

(1)

where $\theta_{f}$ is the parameters to learn. The ground truth to train $\mathcal{F}$ is obtained as the output of a pre-trained open-world depth estimator $\mathcal{D}$ (the parameter of $\mathcal{D}$ is frozen in the training stage), as $d_{n}=\mathcal{D}(I_{n})$ .

We combine the extracted features from $\mathcal{G}$ and $\mathcal{F}$ using a cross-attention module $\mathcal{C}$ . This process refines the features in $\mathcal{G}$ by incorporating depth information from $\mathcal{F}$ , ultimately enhancing the final results, denoted as $\hat{I}_{n}$ , as

\hat{I}_{n}=\mathcal{G}(I_{d},\mathcal{F}_{e}(I_{l});\theta_{g}),

(2)

where $\mathcal{F}_{e}(I_{l})$ denotes the extracted feature with $\mathcal{F}$ . The objective can be represented as

\hat{\theta}_{g},\hat{\theta}_{f}=argmin(\mathcal{L}(\hat{I}_{n},I_{n})+\mathcal{L}(\hat{d}_{d},d_{n})).

(3)

In the following sections, we will provide the details of feature fusion.

3.2 Hierarchical Depth-Guided Feature Fusion

What and where to fuse? Assuming we can obtain the desired depth prediction $\hat{d}_{d}$ from the low-light input data, it remains challenging to directly integrate $\hat{d}_{d}$ into the enhancement process of $\mathcal{G}$ due to the fundamental difference between depth and image domains. Therefore, we propose conducting the fusion in the deep feature space, specifically between the extracted features from $\mathcal{G}$ and $\mathcal{F}$ , which we denote as $f_{g}$ and $f_{d}$ , respectively. This approach effectively mitigates the challenges posed by the disparity between the two data sources during fusion, as both $f_{g}$ and $f_{d}$ are derived from the same input image $I_{d}$ .

Our experiments have revealed that conducting the fusion is more appropriate in the encoder part, as opposed to the decoder part, as done in SKF [23]. This choice is driven by the fact that in the decoder part, the feature discrepancy tends to increase because these features are closer to the target outputs, which belong to different domains. To elaborate, the features in the decoder of $\mathcal{G}$ represent the 2D image content, while the features in $\mathcal{F}$ capture the 3D geometrical information. Hence, it is more suitable to perform the fusion in the encoder part of both $\mathcal{G}$ and $\mathcal{F}$ .

Based on the analysis provided above, we have identified both the location and the specific features to be fused within HDGFFM. In the following sections, we will elaborate on the fusion method itself.

The fusion pipeline in HDGFFM. In a typical low-light enhancement network, a hierarchical encoder structure is employed. To ensure compatibility between the feature sets $f_{g}$ and $f_{d}$ , the encoder of $\mathcal{F}$ is also designed in a pyramidal manner. Consequently, both $f_{g}$ and $f_{d}$ consist of a sequence of features, which we denote as $f_{g}=\{f_{g,l}\},l\in[1,L]$ , and $f_{d}=\{f_{d,l}\},l\in[1,L]$ , where $L$ represents the number of layers in the encoder. To further standardize the channel representation of $f_{g}$ and $f_{d}$ , we incorporate a depth-aware embedding module, denoted as $\mathcal{E}_{l}$ , in each layer. These modules are responsible for processing the depth-aware features within the respective layers, as

\hat{f}_{d,l}=\mathcal{E}_{l}(f_{d,l}).

(4)

The depth-aware embedding modules serve the purpose of channel adjustment and do not involve information compression. Therefore, they are designed to be lightweight, consisting of just a single layer of linear convolution.

The fusion process is carried out through our proposed cross-attention mechanism, which will be elaborated on in the upcoming section.

3.3 Correlation-based Cross Attention for Fusion

Feature fusion strategy. Several strategies have been employed to fuse cross-domain features for low-light enhancement. In SMG [27], synthesized edge information serves as a condition for generating feature normalization and convolution parameters. However, this operation can be computationally expensive, especially when dealing with high-resolution images. On the other hand, SKF [23] computes the similarity between image and semantic features and utilizes this similarity to modulate the original image features, using query, key, and value features in the fusion as image, semantic, and image features. It’s worth noting that this strategy may not effectively incorporate new information, as the semantic features are primarily used to facilitate the computation of an attention matrix rather than being integrated with the image features themselves.

In this paper, our approach involves directly guiding the depth-aware features into the pathways of $\mathcal{G}$ . Additionally, we integrate the results of the cross-attention computation with the outputs of the self-attention computation, based on a feature correlation measurement.

Fusion computation details. Suppose our goal is to fuse features in the $l$ -th layer, denoted as $f_{g,l}$ and $\hat{f}_{d,l}$ . Both of these feature maps have the same shape of $H\times W\times C$ , where $H$ represents the feature height, $W$ the feature width, and $C$ the feature channel. To perform the feature computation, separate projection matrices for query, key, and value [15] are established for both $f_{g,l}$ and $\hat{f}_{d,l}$ . For $f_{g,l}$ , these projection matrices are labeled as $W_{q}$ , $W_{k}$ , and $W_{v}$ , while for $\hat{f}_{d,l}$ , only the key and query matrices are required, designated as $\hat{W}_{k}$ and $\hat{W}_{v}$ , respectively.

The self-attention is first conducted for $f_{g,l}$ , by computing the self-similarity matrix, as

A_{l}=Softmax(W_{q}(f_{g,l})\times W_{k}(f_{g,l})^{T}\times\tau),

(5)

where $W_{q}(f_{g,l})$ and $W_{k}(f_{g,l})$ has the shape of $C\times(H\times W)$ , $T$ is the transpose operation, $\tau$ is the temperature value that is a hyper-parameter, $\times$ is the matrix multiplication operation, and $A_{l}$ has the shape of $C\times C$ . The output of the self-attention is obtained as

f^{\prime}_{g,l}=A_{l}\times W_{v}(f_{g,l}).

(6)

Regarding cross-attention, the similarity matrix can be computed between the image and depth-aware features, and the final output can be determined based on this computed similarity matrix. The procedure can be written as

		$\displaystyle\hat{A}_{l}=Softmax(W_{q}(f_{g,l})\times\hat{W}_{k}(\hat{f}_{d,l})^{T}\times\tau),$		(7)
		$\displaystyle f^{\prime\prime}_{g,l}=\hat{A}_{l}\times\hat{W}_{v}(\hat{f}_{d,l}).$		(7)

As both $f_{g,l}$ and $\hat{f}{d,l}$ are derived from the encoder process of $I_{d}$ and exhibit homogeneity, as discussed in Sec. 3.2, $f^{\prime}{g,l}$ and $f^{\prime\prime}{g,l}$ also exhibit small heterogeneity between them. To further minimize the disparity between $f^{\prime}{g,l}$ and $f^{\prime\prime}_{g,l}$ , we introduce a method to model the correlation between these two sets of features and utilize this correlation as a weighting factor for the fusion process. The correlation is obtained as

w_{l}=Sigmoid(\mathcal{O}(f^{\prime}_{g,l}\oplus f^{\prime\prime}_{g,l})),

(8)

where $\oplus$ is the channel concatenation operation. The fusion output can be written as

\bar{f}_{g,l}=w_{l}\cdot f^{\prime\prime}_{g,l}+f^{\prime}_{g,l}+f_{g,l},

(9)

where $\cdot$ is the element-wise multiplication. Moreover, the final outputs of HDGFFM is obtained with the attention outputs and the feed-forward network, as

F_{g,l}=f_{g,l+1}=\mathcal{H}(\bar{f}_{g,l})+f_{g,l},

(10)

where $\mathcal{H}$ denotes the feed-forward network.

3.4 Training Strategy

In Eq. 3, two objectives are considered: the restoration loss for enhancement and the depth supervision loss to obtain accurate depth priors. The restoration loss, denoted as $\mathcal{L}_{g}$ , can employ the same loss terms as those used for the target model $\mathcal{G}$ . On the other hand, the depth supervision loss is implemented as the $L_{2}$ distance between the predicted depth values and the ground truth depth information, as

\mathcal{L}_{d}=\|\hat{d}_{d}-d_{n}\|_{2}.

(11)

The final loss to train our strategy can be written as

\mathcal{L}=\mathcal{L}_{g}+\lambda\mathcal{L}_{d},

(12)

where $\lambda$ is the loss weight, which is robust across different target scenarios. We guarantee that our code and models will be released upon the publication of this paper.

4 Experiment

4.1 Experimental Settings

Datasets. We assessed the proposed framework using various LLIE/LLVE datasets, which include SDSD-indoor [19], SDSD-outdoor [19], LOLv2 [30], SMID [3], SID [2], and DAVIS2017 [11]. SDSD: The videos in SDSD were captured in dynamic pairs using electromechanical equipment. Consequently, it is employed in both LLIE and LLVE tasks. LOLv2: LOLv2 is subdivided into LOLv2-real and LOLv2-synthetic. LOLv2-real comprises real-world low-/normal-light image pairs for training and testing. LOLv2-synthetic was generated by analyzing the illumination distribution in the RAW format. SMID: MID is a static video dataset consisting of frames captured with short exposures and corresponding ground truth obtained using long exposures. SID: The collection approach for SID is similar to that of SMID but introduces additional challenges with extreme situations. For both SMID and SID, we use full images and convert RAW data to RGB since our work focuses on LLE in the RGB domain. DAVIS2017: The utilization of the DAVIS dataset for LLE was initially proposed by [35]. It synthesizes low-light and normal-light video pairs with dynamic motions. In comparison to [35], we further incorporate the degradation of noise into low-light videos, in addition to invisibility, aligning more closely with real-world low-light videos. All training and testing splits adhere to the guidelines specified in the original papers.

Methods	SDSD-indoor		SDSD-outdoor		Davis2017		Params	Runtime
Methods	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	(millions)	(seconds)
DP3DF [22]	28.9	0.880	27.24	0.794	-	-	28.86	0.01
Ours	30.95	0.897	27.89	0.804	-	-	33.32	0.01
SDSD [2]	25.86	0.738	25.35	0.765	-	-	4.45	-
Ours	27.92	0.759	26.88	0.785	-	-	6.82	-
StableLLVE [18]	-	-	-	-	26.39	0.977	-	0.01
Ours	-	-	-	-	28.04	0.981	-	0.02

Table 1: Quantitative comparison in LLVE task on SDSD-indoor, SDSD-outdoor, and Davis2017. Our method performs better than the baseline consistently. The runtime is computed with a size of 512

\times

960.

Metrics. To assess the performance of various frameworks, we employ full-reference image quality evaluation metrics. Specifically, we utilize peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) [21].

Compared methods. To validate the effectiveness of our method, we conduct comparisons with a set of SOTA methods for both LLIE and LLVE tasks. For video tasks, we evaluate against DP3DF [26], SDSD [19], and StableLLVE [35], following their official settings and objectives. In the case of LLIE tasks, we include three representative frameworks: Retinexformer [1], Restormer [33], and SNR [25].

Implementation details. We implement our framework in PyTorch and conducted experiments on an NVIDIA A100 GPU. Our implementation is based on the released code of the baseline networks, and we ensure the use of the same training settings for both the baseline and our method. The code will be made publicly available immediately following publication.

4.2 Quantitative Evaluation

Quantitative comparison on LLVE tasks. The quantitative comparisons for LLVE baselines and their versions with our depth guidance are presented in Table 1. Specifically, in DP3DF, our method demonstrates improvements of 2.05 in PSNR and 0.017 in SSIM on the SDSD-indoor dataset. Notably, within the SDSD framework, our method achieves enhancements of 2.06 in PSNR and 1.53 in SSIM on the SDSD-indoor dataset. Furthermore, in the StableLLVE framework evaluated on the Davis2017 dataset, our results exhibit a significant increase of 1.65 in PSNR, showcasing substantial improvement, particularly in various dynamic scenes. Additionally, we provide metrics for ”params” and ”running time” for both baselines and our method in Table 1. The results illustrate that the improvements achieved by our method do not come at the severe expense of efficiency.

Quantitative comparison on LLIE tasks. For the evaluation of LLIE, we conduct experiment on a variety of datasets, showing the effectiveness of our strategy more comprehensively and convincingly. The results are displayed in Table 2, where we can see that almost all baselines are improved by involving our strategy without additional heavy computation cost. The “-” iterms are placed due to the limited computation resources. In particular, for Retinexformer, our method showcases improvements of 1.84 in PSNR and 0.021 in SSIM on the SDSD-outdoor dataset. Within the Restormer framework, our method achieves enhancements of 0.012 in SSIM on the SDSD-outdoor dataset. Additionally, in the SNR framework evaluated on the LOLv2-synthetic dataset, our results demonstrate a notable increase of 1.83 in PSNR and 0.021 in SSIM, indicating substantial improvements across different scenarios.

Qualitative evaluation. We also provide the visual comparisons for the baseline and ours on different datasets, as shown in Figs. 4 and 5. As we can see, our method’s results are closer to the ground truth with natural illumination and color, fewer noise and artifacts.

Methods	SDSD-indoor		SDSD-outdoor		SMID		LOLv2-real		LOLv2-synthetic		SID		Params	Runtime
Methods	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	(millions)	(seconds)
Retinexformer [1]	25.00	0.894	25.69	0.780	28.66	0.805	-	-	24.20	0.926	21.59	0.56	-	0.05
Ours	25.12	0.903	27.53	0.801	28.72	0.810	-	-	24.97	0.928	22.17	0.566	-	0.06
Restormer [33]	27.11	0.923	29.53	0.838	-	-	19.49	0.852	-	-	22.99	0.533	26.13	0.15
Ours	27.40	0.924	29.91	0.850	-	-	19.95	0.853	-	-	23.77	0.634	31.41	0.21
SNR [25]	26.15	0.914	28.41	0.824	28.21	0.801	19.98	0.828	23.35	0.925	22.69	0.616	39.12	0.01
Ours	26.86	0.924	29.34	0.849	29.01	0.813	20.88	0.849	25.18	0.942	26.53	0.678	41.49	0.03

Table 2: Quantitative comparison in video task on SDSD-indoor, SDSD-outdoor, SMID, LOLv2-real, LOLv2-synthetic and SID. Our method performs better than the baseline consistently. The runtime is computed with a size of 512

\times

960.

4.3 Ablation Study

In this section, we perform various ablation studies on the SDSD datasets, encompassing both indoor and outdoor subsets, to assess the significance of different components in our framework. Specifically, we evaluate the key components in HDGFFM through three ablated cases: (i) The location where the geometric factor is added. (ii) The method of fusion between image features and depth-aware features. (iii) Different weights for depth loss. The details of each case are elaborated below.

•

“Ours with Decoder”. As indicated in Sec.3.1 and Eq.2, the fusion in our framework occurs in the encoder of the target network $\mathcal{G}$ . This choice is made because both the image content features and depth-aware features are extracted from the low-light input image, fostering homogeneity and enhancing fusion effects. To investigate the impact of alternative fusion points, we aim to test whether fusing image content features and depth-aware features in the decoder of $\mathcal{G}$ yields comparable results.
•

“Ours w/o Corre.”. As highlighted in Eq.8, we employ a correlation computation method via Sigmoid before the fusion of image content features and depth-aware features. This approach is designed to further mitigate inhomogeneity between these features. In this ablation setting, we omit the correlation computation, i.e., remove the Sigmoid from Eq.8. Consequently, the weight $w_{l}$ can have unlimited score ranges in this scenario.
•

“Ours with Add”. In a straightforward fusion approach, the two features are directly added without considering correlation computations. This setting neglects the incorporation of correlation computations between the features.
•

“Ours with $\lambda\times 5$ ” and “Ours with $\lambda\div 5$ ”. As specified in Eq. 12, there is a hyper-parameter that governs the contribution of image reconstruction loss and depth prediction loss, denoted as $\lambda$ . We aim to analyze the effects of varying the value of $\lambda$ on the final results.

As depicted in Tables 3 and 4, it is evident that the results of all ablation settings are inferior to our original implementation in the comparison section. In comparing “Ours” with “Ours with Decoder”, we validate the impracticality of fusing restored image content features and depth-aware features since the former is close to the normal-light output while the latter is close to the low-light inputs. Furthermore, comparing “Ours” with “Ours w/o Corre.” and ”Ours with Add” underscores the significance of our proposed “Correlation-based Cross Attention for Fusion”. Lastly, the consistent and even improved results across different values for $\lambda$ highlight the robustness of the hyperparameter choice in our framework. This emphasizes that researchers can choose optimal $\lambda$ values for different datasets.

	SDSD-indoor		SDSD-outdoor
Methods	PSNR	SSIM	PSNR	SSIM
Ours with Decoder	28.97	0.875	27.34	0.799
Ours w/o Corre.	30.74	0.894	24.82	0.794
Ours with Add	30.57	0.892	27.14	0.786
Ours with $\lambda\times 5$	29.53	0.878	27.99	0.802
Ours with $\lambda\div 5$	29.89	0.883	27.42	0.795
Ours	30.95	0.897	27.89	0.804

Table 3: Results of the ablation study in the SDSD-indoor and SDSD-outdoor datasets with DP3DF. Our method performs better than all ablation settings.

	SDSD-indoor		SDSD-outdoor
Methods	PSNR	SSIM	PSNR	SSIM
Ours with Decoder	23.50	0.877	29.14	0.836
Ours w/o Corre.	22.86	0.876	27.33	0.823
Ours with Add	25.24	0.902	28.06	0.834
Ours with $\lambda\times 5$	23.92	0.900	30.00	0.853
Ours with $\lambda\div 5$	23.75	0.896	30.03	0.851
Ours	26.86	0.924	29.34	0.849

Table 4: Results of the ablation study in the SDSD-indoor, SDSD-outdoor, and SMID datasets with SNR. Our method performs better than all ablation settings.

4.4 User Study

To evaluate the effectiveness of our proposed HDGFFM in terms of subjective evaluation, we carried out a user study with 50 participants (using online questionnaires). Our objective is to verify the subjective quality of our approach compared with baselines, and the evaluation scenarios should cover across various datasets and frameworks.

For more details, we initially select $T$ videos/images randomly from each testing dataset for a given LLIE/LLVE approach, where $T=20$ for LLIE and $T=10$ for LLVE. Our user study employs an AB-test strategy, where each participant is presented with an image/video from the corresponding baseline and an image/video from our framework. Importantly, participants are unaware of which one is the baseline, as the order of our results and baseline results is randomized during each evaluation. During each evaluation, participants compare 5 pairs for a specific method on a given dataset. They have the option to choose which one is better or indicate that the two are the same. Participants are instructed to evaluate the results based on criteria such as natural color, minimal noise, realistic illumination, and stable temporal changes (for LLVE). Each participant is provided with an example. Each participant completes 15 tasks (10 methods × 5 pairs). It takes approximately 30 minutes for a participant to finish the user study.

The results of the user study are presented in Fig. 6. Clearly, the participants consistently preferred the results of our method. This indicates that our approach significantly improves the human subjective perception of existing LLIE and LLVE frameworks.

5 Limitation

In this paper, we introduce a novel emphasis on employing geometrical priors, particularly in the form of depth, for low-light enhancement, a aspect overlooked by previous methods. Our extensive experiments across various datasets and networks demonstrate the effectiveness of our proposed HDGFFM. Despite achieving significant advancements, there are still areas for improvement. Firstly, further efforts are required to minimize model parameters and additional computational requirements of HDGFFM, even though it has not introduced heavy computations in this paper. Secondly, refining our frameworks to establish a greater distinction between ours and baselines is a future direction. This could be achieved by incorporating other 3D priors.

6 Conclusion

This paper proposes the utilization of geometrical priors (depth in this paper) for low-light enhancement, encompassing both image and video enhancement. To distill depth estimation capabilities from large-scale open-world depth estimation models, we introduce a lightweight depth estimator. Furthermore, the extracted depth-aware features are efficiently fused with the encoder features of the target network using our proposed HDGFFM, which incorporates correlation-based cross attention for fusion. Through extensive experiments conducted on representative datasets and methods, we demonstrate the effectiveness of our proposed strategy.

References

Cai et al. [2023] Yuanhao Cai, Hao Bian, Jing Lin, Haoqian Wang, Radu Timofte, and Yulun Zhang. Retinexformer: One-stage retinex-based transformer for low-light image enhancement. In ICCV, 2023.
Chen et al. [2018] Chen Chen, Qifeng Chen, Jia Xu, and Vladlen Koltun. Learning to see in the dark. In CVPR, 2018.
Chen et al. [2019] Chen Chen, Qifeng Chen, Minh N. Do, and Vladlen Koltun. Seeing motion in the dark. In ICCV, 2019.
Fu et al. [2023a] Huiyuan Fu, Wenkai Zheng, Xiangyu Meng, Xin Wang, Chuanming Wang, and Huadong Ma. You do not need additional priors or regularizers in retinex-based low-light image enhancement. In CVPR, 2023a.
Fu et al. [2023b] Zhenqi Fu, Yan Yang, Xiaotong Tu, Yue Huang, Xinghao Ding, and Kai-Kuang Ma. Learning a simple low-light image enhancer from paired low-light instances. In CVPR, 2023b.
Hu et al. [2021] Tao Hu, Liwei Wang, Xiaogang Xu, Shu Liu, and Jiaya Jia. Self-supervised 3d mesh reconstruction from single images. In CVPR, 2021.
Jiang et al. [2021] Yifan Jiang, Xinyu Gong, Ding Liu, Yu Cheng, Chen Fang, Xiaohui Shen, Jianchao Yang, Pan Zhou, and Zhangyang Wang. EnlightenGAN: Deep light enhancement without paired supervision. IEEE TIP, 2021.
Kim et al. [2021] Hanul Kim, Su-Min Choi, Chang-Su Kim, and Yeong Jun Koh. Representative color transform for image enhancement. In ICCV, 2021.
Lamba et al. [2020] Mohit Lamba, Atul Balaji, and Kaushik Mitra. Towards fast and light-weight restoration of dark images. In BMVC, 2020.
Liu et al. [2021] Risheng Liu, Long Ma, Jiaao Zhang, Xin Fan, and Zhongxuan Luo. Retinex-inspired unrolling with cooperative prior architecture search for low-light image enhancement. In CVPR, 2021.
Pont-Tuset et al. [2017] Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation. arXiv, 2017.
Pontes et al. [2019] Jhony K Pontes, Chen Kong, Sridha Sridharan, Simon Lucey, Anders Eriksson, and Clinton Fookes. Image2mesh: A learning framework for single image 3d reconstruction. In ACCV, 2019.
Ranftl et al. [2021] René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In ICCV, 2021.
Triantafyllidou et al. [2020] Danai Triantafyllidou, Sean Moran, Steven McDonagh, Sarah Parisot, and Gregory Slabaugh. Low light video enhancement using synthetic data produced with an intermediate domain mapping. In ECCV, 2020.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
Wang et al. [2022] Hai Wang, Yanyan Chen, Yingfeng Cai, Long Chen, Yicheng Li, Miguel Angel Sotelo, and Zhixiong Li. Sfnet-n: An improved sfnet algorithm for semantic segmentation of low-light autonomous driving road scenes. IEEE Transactions on Intelligent Transportation Systems, 2022.
Wang et al. [2023] Haoyuan Wang, Xiaogang Xu, Ke Xu, and Rynson WH Lau. Lighting up nerf via unsupervised decomposition and enhancement. In ICCV, 2023.
Wang et al. [2019] Ruixing Wang, Qing Zhang, Chi-Wing Fu, Xiaoyong Shen, Wei-Shi Zheng, and Jiaya Jia. Underexposed photo enhancement using deep illumination estimation. In CVPR, 2019.
Wang et al. [2021a] Ruixing Wang, Xiaogang Xu, Chi-Wing Fu, Jiangbo Lu, Bei Yu, and Jiaya Jia. Seeing dynamic scene in the dark: High-quality video dataset with mechatronic alignment. In ICCV, 2021a.
Wang et al. [2021b] Tao Wang, Yong Li, Jingyang Peng, Yipeng Ma, Xian Wang, Fenglong Song, and Youliang Yan. Real-time image enhancer via learnable spatial-aware 3D lookup tables. In ICCV, 2021b.
Wang et al. [2004] Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE TIP, 2004.
Wei et al. [2018] Chen Wei, Wenjing Wang, Wenhan Yang, and Jiaying Liu. Deep Retinex decomposition for low-light enhancement. In BMVC, 2018.
Wu et al. [2023] Yuhui Wu, Chen Pan, Guoqing Wang, Yang Yang, Jiwei Wei, Chongyi Li, and Heng Tao Shen. Learning semantic-aware knowledge guidance for low-light image enhancement. In CVPR, 2023.
Xu et al. [2020] Ke Xu, Xin Yang, Baocai Yin, and Rynson WH. Lau. Learning to restore low-light images via decomposition-and-enhancement. In CVPR, 2020.
Xu et al. [2022] Xiaogang Xu, Ruixing Wang, Chi-Wing Fu, and Jiaya Jia. SNR-Aware low-light image enhancement. In CVPR, 2022.
Xu et al. [2023a] Xiaogang Xu, Ruixing Wang, Chi-Wing Fu, and Jiaya Jia. Deep parametric 3d filters for joint video denoising and illumination enhancement in video super resolution. In AAAI, 2023a.
Xu et al. [2023b] Xiaogang Xu, Ruixing Wang, and Jiangbo Lu. Low-light image enhancement via structure modeling and guidance. In CVPR, 2023b.
Yang et al. [2023] Shuzhou Yang, Moxuan Ding, Yanmin Wu, Zihan Li, and Jian Zhang. Implicit neural representation for cooperative low-light image enhancement. In ICCV, 2023.
Yang et al. [2021a] Wenhan Yang, Shiqi Wang, Yuming Fang, Yue Wang, and Jiaying Liu. Band representation-based semi-supervised low-light image enhancement: Bridging the gap between signal fidelity and perceptual quality. IEEE TIP, 2021a.
Yang et al. [2021b] Wenhan Yang, Wenjing Wang, Haofeng Huang, Shiqi Wang, and Jiaying Liu. Sparse gradient regularized deep Retinex network for robust low-light image enhancement. IEEE TIP, 2021b.
Zamir et al. [2020a] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, and Ling Shao. Learning enriched features for real image restoration and enhancement. In ECCV, 2020a.
Zamir et al. [2020b] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, and Ling Shao. Learning enriched features for real image restoration and enhancement. In ECCV, 2020b.
Zamir et al. [2022] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. Restormer: Efficient transformer for high-resolution image restoration. In CVPR, 2022.
Zeng et al. [2020] Hui Zeng, Jianrui Cai, Lida Li, Zisheng Cao, and Lei Zhang. Learning image-adaptive 3D lookup tables for high performance photo enhancement in real-time. IEEE TPAMI, 2020.
Zhang et al. [2021] Fan Zhang, Yu Li, Shaodi You, and Ying Fu. Learning temporal consistency for low light video enhancement from single images. In CVPR, 2021.
Zhao et al. [2021] Lin Zhao, Shao-Ping Lu, Tao Chen, Zhenglu Yang, and Ariel Shamir. Deep symmetric network for underexposed image enhancement with recurrent attentional learning. In ICCV, 2021.
Zheng et al. [2021] Chuanjun Zheng, Daming Shi, and Wentian Shi. Adaptive unfolding total variation network for low-light image enhancement. In ICCV, 2021.

Geometric-Aware Low-Light Image and Video Enhancement via Depth Guidance