Refer to caption — Figure 1. CLE Diffusion enables users to select regions of interest(ROI) with a simple click and adjust the degree of brightness enhancement as desired, while MAXIM (Tu et al., 2022) is limited to homogeneously enhancing images to a pre-defined level of brightness.

CLE Diffusion: Controllable Light Enhancement Diffusion Model

Yuyang Yin Institute of Information Science, Beijing Jiaotong University Beijing Key Laboratory of Advanced Information Science and Network TechnologyChina , Dejia Xu VITA GroupUniversity of Texas at AustinUSA , Chuangchuang Tan Institute of Information Science, Beijing Jiaotong University Beijing Key Laboratory of Advanced Information Science and Network TechnologyChina , Ping Liu Center for Frontier AI Research, IHPC, A*STARSingapore , Yao Zhao Institute of Information Science, Beijing Jiaotong University Beijing Key Laboratory of Advanced Information Science and Network TechnologyChina and Yunchao Wei Institute of Information Science, Beijing Jiaotong University Beijing Key Laboratory of Advanced Information Science and Network TechnologyChina

(2023)

Abstract.

Low light enhancement has gained increasing importance with the rapid development of visual creation and editing. However, most existing enhancement algorithms are designed to homogeneously increase the brightness of images to a pre-defined extent, limiting the user experience. To address this issue, we propose Controllable Light Enhancement Diffusion Model, dubbed CLE Diffusion, a novel diffusion framework to provide users with rich controllability. Built with a conditional diffusion model, we introduce an illumination embedding to let users control their desired brightness level. Additionally, we incorporate the Segment-Anything Model (SAM) to enable user-friendly region controllability, where users can click on objects to specify the regions they wish to enhance. Extensive experiments demonstrate that CLE Diffusion achieves competitive performance regarding quantitative metrics, qualitative results, and versatile controllability. Project page: https://yuyangyin.github.io/CLEDiffusion/

image processing, low light image enhancement, diffusion model

^†^†journalyear: 2023^†^†copyright: acmlicensed^†^†conference: Proceedings of the 31st ACM International Conference on Multimedia; October 29-November 3, 2023; Ottawa, ON, Canada^†^†booktitle: Proceedings of the 31st ACM International Conference on Multimedia (MM ’23), October 29-November 3, 2023, Ottawa, ON, Canada^†^†price: 15.00^†^†doi: 10.1145/3581783.3612145^†^†isbn: 979-8-4007-0108-5/23/10^†^†ccs: Computing methodologies Computer vision

1. Introduction

The low-light capturing conditions can stem from various factors, ranging from environmental causes like inadequate illumination to technical reasons such as sub-optimal ISO settings. Consequently, these images are often regarded as less visually appealing and present new difficulties for downstream tasks. Remedying such degradation has been a critical area of vision research for many years, leading to the development of diverse approaches and solutions for low-light enhancement.

Enhancing low-light images tackles an ill-posed problem due to the lack of large publicly available datasets with paired images, which makes it difficult to create a universal solution for quick and easy enhancement. As a result, this challenging problem has remained the focus of numerous researchers over the past few decades. One approach to addressing this issue is using histogram equalization (HE) methods (Pizer et al., 1990; Abdullah-Al-Wadud et al., 2007) to adjust the contrast. These methods aim to stretch the dynamic range of low-light images, but they often result in unwanted illumination in more intricate scenes. Another line of research involves the implementation of the Retinex theory (Land, 1977), which aims to decompose low-light images into two distinct layers, reflectance, and illumination. Various image filters (Jobson et al., 1997; Lee et al., 2013; Wang et al., 2013) and manually designed priors (Fu et al., 2016; Guo et al., 2017) have been utilized to enhance the decomposition process.

With the recent development of deep learning, data-driven approaches for low-light enhancement have gained significant interest in recent years. These approaches utilize large-scale datasets to restore normal-light images from complex degradations. However, many of these methods (Ren et al., 2019; Shen et al., 2017; Wang et al., 2019; Wei* et al., 2018) require paired low-light and normal-light images pixel-aligned to achieve optimal results. Recently, researchers have made significant progress by utilizing unpaired data through adversarial learning (Jiang et al., 2021; Xu et al., 2022b) and more advanced networks (Fan et al., 2022; Wang et al., 2021; Yang et al., 2020; Guo et al., 2020) for enhancing low-light images.

However, adversarial learning is prone to optimization instability (Arjovsky et al., 2017; Gulrajani et al., 2017) and mode collapse issues (Metz et al., 2016; Ravuri and Vinyals, 2019), making them hard to scale up. The recent success of denoising diffusion models (Ho et al., 2020; Song et al., 2020; Rombach et al., 2022; Nichol et al., 2021; Ramesh et al., 2022; Saharia et al., 2022b) in image generation reveals their outstanding capacity. This has also attracted researchers to study their abilities for image restoration tasks (Kanizo et al., 2013; Saharia et al., 2022c; Whang et al., 2022). These attempts have proven that denoising diffusion models can model natural image distributions better than existing methods. Despite the exciting progress, there has been no effort to explore diffusion models’ ability to restore low-light images.

Moreover, most existing approaches are implicitly designed to enhance low-light images in a deterministic way, ignoring the ill-posed nature of low-light enhancement. While learning a one-to-one mapping between low-light and normal-light images might provide visually pleasing results, this design limits the model’s flexibility since they assume a pre-defined well-lit brightness exists. Although the claim might hold for certain natural images, the definition of well-lit remains highly subjective. On the contrary, the presumed well-lit brightness is extracted through the datasets (Chen et al., 2018b; Liu et al., 2021; Bychkovsky et al., 2011) for most data-driven methods. Despite the great effort in collecting aligned datasets (Liu et al., 2021; Chen et al., 2018b; Bychkovsky et al., 2011), low-light and normal-light image pairs remain scarce compared to large-scale datasets like ImageNet (Deng et al., 2009) and LAION (Schuhmann et al., 2022). As a result, the learned mapping is biased toward the available samples, and an ability to provide a diverse set of plausible reconstructions is greatly desired. While ReCoRo (Xu et al., 2022b) provides controllability of brightness level, the model assigns the brightest illumination using the training set pairs. This brings brightness inconsistency across the images, leading to unstabilized optimization and confusing user experience. Moreover, ReCoRo has little extrapolation ability to generate brighter results than the ground truth images in the dataset.

On the other hand, most previous methods learn to enlighten images in a global homogeneous way, making the models incapable of dealing with intricate lighting situations where both under-exposed and over-exposed regions coexist. Though these methods may improve the visibility of under-exposed regions, they often result in over-exposed regions becoming overly emphasized since the models fail to address them adaptively. To this end, users are used to specifying regions of interest manually so that the desired edits can be performed locally without retouching the other regions. However, this requirement of precise user specification can be tedious for smartphone end-users since finger drawings inevitably come with unintentional noise. ReCoRo (Xu et al., 2022b) studies this setting for the first time and develops a robust enhancement model to work with imprecise user inputs by baking in domain-specific augmentations. However, their augmentations are specially designed for portrait images and require further tuning for masks of other image classes.

To overcome the above-mentioned issues, we propose a novel denoising diffusion framework, dubbed CLE Diffusion, that performs Controllable Light Enhancement via iterative refinement. Conditioned on a unified illumination embedding, our diffusion model learns to enhance the low-light images towards a target brightness level specified by users. Our brightness level, different from ReCoRo’s (Xu et al., 2022b), is represented using the average pixel intensities of the image and thus is consistent throughout the dataset. This avoids the wild assumption that a perfect well-lit illumination level exists. We further condition the denoising diffusion process with the low-light image features to ease the optimization. Alongside the original low-light image, we prepare a normalized color map and a signal-to-noise ratio (SNR) map to reduce the burden of the enhancement module. Additionally, we include a binary mask as an extra input to support localized edits, letting the users freely specify the regions of interest (ROI). Armed with the Segment-Anything Model (SAM) (Kirillov et al., 2023), our framework is capable of user-friendly region-controllable enhancement via diverse simple promptings, such as points and boxes, alleviating the requirement for precise user specification in practice.

Our major contributions can be summarized as follows,

•

We propose a novel diffusion framework, dubbed CLE Diffusion, for Controllable Light Enhancement via iterative refinement. To the best of our knowledge, CLE Diffusion is the first attempt to study controllable light enhancement using diffusion models.
•

Our framework’s controllability allows users to specify desired brightness levels and the regions of interest easily. Using a unified illumination embedding, our conditional diffusion model provides seamless and consistent control over brightness levels. Moreover, we facilitate user-friendly region control by employing the SAM model, which enables users to click on the image to specify regions for enhancement.
•

As shown in extensive experiments, our CLE Diffusion demonstrates competitive performance in terms of quantitative, qualitative, and versatile controllability.

2. Related Works

2.1. Traditional Light Enhancement Methods

Many image priors have found their use for traditional single-image low-light enhancement. Some approaches implement local and global histogram equalization (Pizer et al., 1990; Abdullah-Al-Wadud et al., 2007) to increase the contrast of the input image. Some other solutions (Li et al., 2015; Zhang et al., 2012) consider the low-light images as inverted haze images and use dehazing methods on the inverted input image. Another popular line of work is based on the Retinex theory (Land, 1977), which separates the image into illumination and reflectance layers and performs simple transformations on top of them for the desired effects. SRIE (Fu et al., 2015) estimates both layers simultaneously using a weighted variational model, while LIME (Guo et al., 2017) refines the illumination layer estimate and uses the decomposed reflection layer as the final enhanced result. For noise suppression, JED (Ren et al., 2018) has made progress by utilizing sequential decomposition. However, hand-crafted models used in these methods require careful parameter tuning and have limited model capacity.

2.2. Learning-based Light Enhancement Methods

In recent years, various learning-based light enhancement methods have been introduced thanks to the rapid development of deep learning. LLNet and S-LLNet (Lore et al., 2017) utilize deep autoencoder-based approaches for contrast enhancement and denoising. They are trained using data synthesized with random Gamma corrections and Gaussian noise. Based on the Retinex theory, Retinex-Net (Chen et al., 2018b) assumes that paired aligned images share the same reflectance but have different illuminations. The authors collect a paired low-light and normal-light data set, which paves the way for the training of larger models. Since collecting well-aligned paired datasets (Liu et al., 2021; Bychkovsky et al., 2011) requires hard work, great effort has been put into searching for methods without the need for paired supervision. EnlightenGAN (Jiang et al., 2021) utilizes a generative adversarial network framework to learn a powerful generator without using unpaired data. Zero-DCE (Guo et al., 2020) learns light enhancement through image-specific curve estimation. CERL (Chen et al., 2022a) builds upon EnlightenGAN and incorporates plug-and-play noise suppression. In addition, many works investigate the performance of different architectures, including Recursive Band Network (Yang et al., 2020), Signal-to-Noise-Ratio (SNR) Prior-aware Transformer (Xu et al., 2022c), Multi-axis MLP (Tu et al., 2022), Normalizing Flow (Wang et al., 2021), and Half Wavelet Network (Fan et al., 2022). These existing methods generally enhance the images to a pre-defined brightness level learned from the training dataset and avoid the challenging real-world case of region controlling for light enhancement.

A controllable GAN network named ReCoRo (Xu et al., 2022b) is the closest baseline to our approach. It allows users to specify the areas and levels of enhancement. However, the authors assign the brightest illumination level for each image using its normal-light version from the training set. This inconsistency of brightness level across images unstabilizes network optimization and disturbs the user experience at inference time. ReCoRo (Xu et al., 2022b) works with imprecise region controls by baking in augmentations. Effective as it is for portrait images, this method requires re-training for novel object classes. In contrast, our CLE Diffusion framework produces visually pleasing results via a conditional diffusion framework. Our unified brightness level also allows for seamless and consistent brightness control. Moreover, our framework offers a user-friendly experience with region control accessible through a single click.

2.3. Diffusion Model

Denoising diffusion model (Sohl-Dickstein et al., 2015) is a deep generative model synthesizing data through an iterative denoising process. Diffusion models consist of a forward process that distorts clean images with noise and a reverse process that learns to reconstruct clean images. They have demonstrated outstanding image generation (Song and Ermon, 2019; Ho et al., 2020) capability with the help of various improvements in architecture design (Dhariwal and Nichol, 2021; Rombach et al., 2022), sampling guidance (Ho and Salimans, 2022), and inference cost (Salimans and Ho, 2022; Rombach et al., 2022; Vahdat et al., 2021). Equipped with large-scale image-pair datasets, many works scaled up the model (Rombach et al., 2022; Nichol et al., 2021; Ramesh et al., 2022; Saharia et al., 2022b) to billions of parameters to work with the challenging text-to-image generation. Additionally, denoising diffusion models have succeeded in various high-level computer vision tasks, including 3D generation (Poole et al., 2022; Singer et al., 2023; Xu et al., 2022a; Gu et al., 2023; Watson et al., 2022), object detection (Chen et al., 2022b) and depth estimation (Saxena et al., 2023). Meanwhile, various image restoration tasks are also studied through diffusion models, including super resolution (Saharia et al., 2022c), image deblurring (Whang et al., 2022), adverse weather condition (Özdenizci and Legenstein, 2023), and image to image translation (Saharia et al., 2022a). Our CLE Diffusion lies in the conditional denoising diffusion model category and makes the first attempt to study its effectiveness in controllable light enhancement.

3. Method

We propose a novel Controllable Light Enhancement (CLE) Diffusion framework, which adopts a conditional diffusion model to enhance any region in low-light images to any brightness level. As shown in Fig. 2, we design domain-specific conditioning information and loss functions tailored to our needs. Additionally, we incorporate a Brightness Control Module to enable controllable light enhancement. To further improve usability, we support region controllability by including a binary mask as input and leveraging the Segment-Anything Model (SAM) (Kirillov et al., 2023). This allows for a user-friendly interface where users can easily click on an image to specify the regions to enhance.

3.1. Preliminary of Diffusion Model

Diffusion model (Sohl-Dickstein et al., 2015) is a type of generative model that uses iterative refinement to generate data. The widely used DDPM model (Ho et al., 2020) consists of a forward and a reverse process. The forward process gradually adds Gaussian noise to a clean input image, formulated as $q(y_{t}|y_{t-1})=\mathcal{N}(y_{t};\sqrt{\beta_{t}}y_{t-1},(1-\beta_{t})I)$ . Furthermore, the intermediate steps can be marginalized out to characterize the distribution into $q(y_{t}|y_{0})=\mathcal{N}(\sqrt{\bar{\alpha_{t}}}y_{0},(1-\bar{\alpha_{t}})I)$ , where $\alpha_{t}=1-\beta_{t}$ and $\bar{\alpha_{t}}=\prod_{i=0}^{t}\alpha_{i}$ . The sequence of $\beta_{t}$ is carefully designed so that the noisy images will converge to pure Gaussian random noise $\mathcal{N}(0,I)$ when $t$ reaches the end of the forward process.

In the reverse process of the DDPM model, a neural network is used to denoise the data samples. The process can be formulated as $p_{\theta}(y_{t-1}|y_{t})=\mathcal{N}(\hat{y_{0}},\frac{1-\bar{\alpha_{t-1}}}{1-\bar{\alpha_{t}}}\beta_{t}),$ where $\hat{y_{0}}=\frac{1}{\sqrt{\alpha_{t}}}(y_{t}-\frac{\beta_{t}}{\sqrt{1-\bar{\alpha}_{t}}}\epsilon_{\theta}(y_{t},t)).$ $\epsilon_{\theta}$ is implemented with a U-Net (Ronneberger et al., 2015) model to estimate the noise component from a noisy image. During inference time, we sample $y_{T}\sim\mathcal{N}(0,I)$ and gradually reduce the noise level until we reach a clean image $y_{0}$ . To accelerate sampling, DDIM (Song et al., 2020) presents a deterministic sampling approach as follows,

(1)		$\displaystyle y_{t-1}=$	$\displaystyle\sqrt{\bar{\alpha}_{t-1}}\left(\frac{y_{t}-\sqrt{1-\bar{\alpha}_{t}}\cdot\boldsymbol{\epsilon}_{\theta}\left(y_{t},t\right)}{\sqrt{\bar{\alpha}_{t}}}\right)$
(1)			$\displaystyle+\sqrt{1-\bar{\alpha}_{t-1}}\cdot\boldsymbol{\epsilon}_{\theta}\left(y_{t},t\right).$

Diffusion models are usually trained via optimizing the negative log-likelihood loss function, which is further simplified (Ho et al., 2020; Kingma et al., 2021) as follows:

(2)

\mathcal{L}_{\text{simple}}=\mathbb{E}_{\mathbf{y}_{0},t,\epsilon}\left[\left\|\epsilon-\epsilon_{\theta}\left(\sqrt{\bar{a}_{t}}\mathbf{y}_{0}+\sqrt{1-\bar{a}_{t}}\epsilon,t\right)\right\|^{2}\right]

3.2. Controllable Light Enhancement Diffusion Model

Specifically for low-light enhancement, we need to generate coherent normal-light images $y$ that share the content with the input low-light image $x$ . Instead of learning the one-to-one mapping between the two domains, we are interested in approximating the conditional distribution $p(y|x)$ using the available paired image samples from the dataset. Similar to existing attempts to utilize diffusion model for image restoration (Saharia et al., 2022a, c), we implement our CLE Diffusion by adapting the original DDPM (Ho et al., 2020) model to accept additional condition information. While the forward process remains as simple as distorting the image with some carefully attenuated Gaussian noise, the U-net in the reverse process takes $x$ as an additional input.

Empirically, we observe simple concatenation of low-light image $x$ and diffused element $y_{t}$ leads to unstable training for complex scenes. This is partially due to the intricate noise and the diverse lighting in low-light environments. To this end, we further parameterize the conditional diffusion process with two additional priors.

Our first motivation comes from the severe color distortion in low-light images. As discussed in earlier works (Xu et al., 2022b; Liu et al., 2021), unnatural color shifts are often observed when enhancing low-light images. To this end, we implement a color map to reduce color distortion by normalizing the range of three color channels in input images. Specifically, an input image $x$ can be decomposed into three channels:

(3)

x=[x_{r},x_{g},x_{b}],

where $x_{r}$ means the red channel of the image, $x_{g}$ means the green channel of the image, $x_{b}$ means the blue channel of the image. We then extract the maximum pixel value for the three channels respectively:

(4)

x_{max}=[x_{r,max},x_{g,max},x_{b,max}],

where for example, $x_{r,max}$ means the maximum value of pixels on the red channel. Overall, the color map can be formulated as follows:

(5)

C(x)=\frac{x}{x_{max}}=[\frac{x_{r}}{x_{r,max}},\frac{x_{g}}{x_{g,max}},\frac{x_{b}}{x_{b,max}}],

Another main challenge for enhancing low-light images lies in the inevitable noise in low-light conditions. Numerous researchers (Ren et al., 2019; Li et al., 2018; Chen et al., 2018b) have made various attempts to model the noise formulation. More recently, Xu et al. (Xu et al., 2022c) adopted an SNR-aware transformer for effective low-light enhancement. Specifically, the SNR map is used to bring spatial attention to the low signal-to-noise-ratio region. The SNR map can be obtained as follows,

(6)

S(x)=\frac{F(x)}{|x-F(x)+\epsilon|},

where $\epsilon$ is included for numerical stability and $F$ is a low-pass filter implemented as a Gaussian blur in our experiments. We consider the high-frequency component in the image to be noise and calculate the ratio between the original image and noise directly.

In each training step, we randomly sample a pair of low-light image $x$ and its corresponding normal-light image $y$ . We then prepare the color map $C(x)$ , the SNR map $S(x)$ and a noisy image $y_{t}=\sqrt{\bar{a}_{t}}{y}+\sqrt{1-\bar{a}_{t}}\epsilon$ . They are concatenated with low-light image $x$ as the input of our diffusion model. Consequently, Eq. 2 can be extended as follows,

(7)

\displaystyle\mathcal{L}_{\text{simple}}=\mathbb{E}_{{y},t,\epsilon}\left[\left\|\epsilon-\epsilon_{\theta}\left(\sqrt{\bar{a}_{t}}{y}+\sqrt{1-\bar{a}_{t}}\epsilon,x,C(x),S(x),t\right)\right\|^{2}\right]

3.3. Brightness Control Module

To emphasize the effectiveness of conditioning information, we adopt a classifier-free guidance (Ho and Salimans, 2022) approach. This involves jointly training a conditional and an unconditional diffusion model, without the need for a classifier on noisy images as in the classifier guidance technique (Dhariwal and Nichol, 2021). Instead, the classifier-free guidance approach allows for a trade-off between sample quality and diversity by adjusting the weighting of conditional sampling and unconditional sampling during inference. For CLE Diffusion, we treat the brightness level of images as our ”class” labels. It is important to note that our ”class” in this case is continuous, allowing for seamless interpolation and continuous adjustment of the target brightness levels.

We first extract the vanilla brightness level $\lambda$ of normal-light images by calculating the average pixel value. Then we encode the average value using a random orthogonal matrix into the illumination embedding. Specifically, we uniformly establish several discretized values in $[0,1]$ as anchors. Given a $n\times n$ encoding matrix, the $d$ -th value is mapped to the $d$ -th column of the $n$ -dimensional random orthogonal matrix. We adopt bilinear interpolations of the two nearby columns for the intermediate values. The illumination embedding is further embedded into the U-net using our Brightness Control Module. As shown in Fig. 3, we adopt a FiLM layer (Perez et al., 2018) to learn feature-wise affine transformation based on the illumination embedding. Then the modulated feature is split by half along the channel axis, with one copy being multiplied by the feature map and the other added to the feature map.

We train the conditional diffusion model $\epsilon_{\theta}(y_{t},x,C(x),S(x)|\lambda)$ together with an unconditional model $\epsilon_{\theta}(y_{t},x,C(x),S(x)|0)$ . To be more specific, for the unconditional model, we input a zero embedding with the same shape as the illumination embedding. During the training process, we randomly drop the conditioning by allowing $2\%$ of the total iterations to train the unconditional model.

As shown in Algo. 1, the sampling process is implemented with DDIM (Song et al., 2020) sampler. Firstly, we sample $y_{T}\sim\mathcal{N}(0,I)$ . Subsequently, we estimate two noise maps, one from the conditional model and the other from the unconditional model, and apply a weighted average of the two noise estimates. The guidance scale $s$ is used to regulate the influence of the conditioning signal, where a larger scale produces results more aligned with the controlling signal, while a smaller scale produces less connected results.

Algorithm 1 CLE Diffusion sampling

Input

y_{T}\sim\mathcal{N}(0,1)

,low-light image

x

,color map

C(x)

, SNR map

S(x)

, brightness level

\lambda

, and number of implicit sampling steps

T

for

t=T,...,1

\begin{aligned} e_{t-1}&=s\cdot\epsilon_{\theta}(y_{t},x,C(x),S(x)|\lambda)\\ &+(1-s)\cdot\epsilon_{\theta}(y_{t},x,C(x),S(x)|0)\end{aligned}

\hat{y_{0}}=\frac{y_{t}-\sqrt{1-\overline{\alpha_{t}}}\cdot e_{t-1}}{\sqrt{\overline{\alpha_{t}}}}

y_{t-1}=\sqrt{\overline{\alpha_{t-1}}}\hat{y_{0}}+\sqrt{1-\overline{\alpha_{t-1}}}\cdot e_{t-1}

Output

y_{0}

3.4. Regional Controllability

Users may prioritize increasing the brightness of specific regions of interest over globally illuminating the entire image, especially when dealing with complex lighting conditions. To address this need, we introduce region controllability to our CLE Diffusion method, referred to as Mask-CLE Diffusion.

We incorporate a binary mask $M$ into our diffusion model by concatenating the mask with the original inputs. To accommodate this requirement, we created synthetic training data by randomly sampling free-form masks (Suvorov et al., 2022) with feathered boundaries. The target images are generated by alpha blending the low-light and normal-light images from existing low-light datasets (Chen et al., 2018b; Bychkovsky et al., 2011).

Segment-Anything Model (SAM) (Kirillov et al., 2023) is a large vision transformer trained on 11 million images to produce high-quality object masks from input prompts such as points or boxes. With the aid of SAM (Kirillov et al., 2023), obtaining precise object masks in low-light conditions becomes a user-friendly process. Users can select their desired region with just one click. Mask-CLE Diffusion subsequently generates controllable light enhancement images via mixing the results from $\epsilon_{\theta}(y_{t},x,C(x),S(x),M|\lambda)$ and $\epsilon_{\theta}(y_{t},x,C(x),S(x),M|0)$ . For tiny objects such as human hair, SAM masks are not accurate enough. In such cases, we can still use the SAM mask as an initialization for downstream models. In our experiments, we utilize MatteFormer (Park et al., 2022) to automatically construct detailed alpha mattes for human matting. The trimaps used for MatteFormer are constructed using dilate and erode operations.

3.5. Auxiliary Loss Functions

Although an ideal optimum for the $\mathcal{L}_{\text{simple}}$ will be able to approximate the $p(y|x)$ distribution from the available paired dataset, in practice, we observe that the model often ends up generating color distortion and unprecedented noise. To improve the convergence of our diffusion model, we introduce auxiliary losses to provide direct supervision on the estimated denoised images $\hat{y_{0}}$ .

Brightness Loss

To maintain the same brightness level between the enhanced images and ground truth, we utilize brightness intensity loss. We use average pixel intensities to supervise the brightness information as follows,

(8)

\mathcal{L}_{\text{br}}=|{G}(\hat{y_{0}})-{G}(y)|_{1},

where $G(\cdot)$ means the gray-scale version of an RGB image.

Angular Color Loss

Increasing the brightness can amplify the color distortion from low-light images. We adopt a color loss (Wang et al., 2019) that encourages the color of the enhanced images $\hat{y_{0}}$ to match the ground truth $y$ . The color loss can be expressed as:

(9)

\mathcal{L}_{\text{col}}=\sum_{i}{\angle\left(\hat{y_{0}}_{i},y_{i}\right)},

where $i$ denotes a pixel location and $\angle(,)$ calculates the angular difference between two 3-dimensional vectors representing colors in RGB color space.

SSIM Loss

We further incorporate Structural Similarity Index (SSIM) loss to improve the visual quality of the enhanced images. SSIM calculates the structural similarity between a predicted image and a ground truth image, providing statistics for overall contrast and luminance consistency. SSIM loss can be expressed as follows,

(10)

\mathcal{L}_{\text{ssim}}=\frac{(2\mu_{y}\mu_{\hat{y_{0}}}+c_{1})(2\sigma_{y\hat{y_{0}}}+c_{2})}{(\mu_{y^{2}}+\mu_{\hat{y_{0}}}^{2}+c_{1})(\sigma_{y^{2}}+\sigma_{\hat{y_{0}}}^{2}+c_{2})},

where $\mu_{y}$ and $\mu_{\hat{y_{0}}}$ are pixel value averages, $\sigma_{y}$ and $\sigma_{\hat{y_{0}}}$ are variances, $\sigma_{y\hat{y_{0}}}$ is covariance, $c_{1}$ and $c_{2}$ are constants for numerical stability.

Perceptual Loss

Although effective, the pixel-space metrics mentioned above primarily focus on low-level information and are less aligned with human perception. Thus, it is preferable to utilize high-level feature statistics to measure the quality of the enhanced images (Zhang et al., 2018a). LPIPS (Zhang et al., 2018b) loss extracts deep features via a VGG network (Simonyan and Zisserman, 2014) pre-trained on ImageNet (Deng et al., 2009) and calculates the distance between the enhanced features and ground truth features.

(11)

\mathcal{L}_{\text{lpips}}=\sum_{l}\frac{1}{H_{l}W_{l}}|\phi^{l}_{VGG}(\hat{y_{0}})-\phi^{l}_{VGG}(y)|_{2},

where $\phi^{l}_{VGG}$ represents the feature maps extracted from $l$ -th layer of VGG network and $H_{l},W_{l}$ are the height and weight of feature map of layer $l$ , respectively. We find that LPIPS loss can improve the model’s ability to restore high-frequency information.

Overall, the auxiliary loss functions can be summarized as:

(12)

\displaystyle\mathcal{L}_{\text{aux}}=\mathcal{L}_{\text{simple}}+W_{\text{br}}\mathcal{L}_{\text{br}}+W_{\text{col}}\mathcal{L}_{\text{col}}+W_{\text{ssim}}\mathcal{L}_{\text{ssim}}+W_{\text{lpips}}\mathcal{L}_{\text{lpips}},

where the $W_{\text{br}}$ , $W_{\text{col}}$ , $W_{\text{ssim}}$ , $W_{\text{lpips}}$ are weighting coefficients.

4. Experiment

Due to the space limit, we provide more implementation details, results, and analysis in the supplementary.

4.1. Quantitative Metric

Existing quantitative metrics typically assume the existence of an ideal brightness level, making it difficult to compare images with different brightness levels fairly. As shown in Fig. 4, PSNR and SSIM present an inverted “V”-shaped pattern while LPIPS show a “V”-shaped pattern. These metrics exhibit large variances as brightness levels change, indicating their sensitivity to changes in light intensity. As light enhancement is highly subjective, a better value of these metrics does not necessarily correlate with better image quality. Additionally, the highly ill-posed nature of light enhancement brings about many possible optimal solutions. These solutions can have different white balances and exposure levels, further complicating the determination of a single ”best” solution.

We are initially inspired by the color map’s ability to normalize the brightness of images. As mentioned in Eq. 5, the color map can bring images with different brightness levels to a normalized brightness level, especially for images that are too dark or overexposed. Then, we use Canny edge detector (Canny, 1986) to extract useful structural information, further reducing the appearance information that is easily affected by illumination. We have empirically set the Canny threshold as 50 and 250. Other combinations exhibit minimal variation. As semantic features are more resilient than pixel-space metrics, we select the LPIPS distance calculation for comparing the structure of two images. We name this new metric as Light-Independent LPIPS, dubbed $LI\text{-}LPIPS$ :

(13)

\mathcal{L}_{\text{LI\text{-}LPIPS}}(a,b)=\sum_{l}\frac{1}{H_{l}W_{l}}|\phi^{l}(\text{Canny}(C(a))-\phi^{l}(\text{Canny}(C(b))|_{2},

where $C(\cdot)$ is color map, $\phi^{l}$ is the $l$ -th layer feature map of a pre-trained VGG network, and $\text{Canny}(\cdot)$ represents the Canny operator. In Fig.4, when the average image brightness ranges from 0.18 to 0.69, the difference observed in LI-LPIPS is less than 0.02, while LPIPS yields a variation in maximum and minimum values as high as 0.15. This validates LI-LPIPS is insensitive to variations in brightness.

4.2. Evaluation Protocol

We evaluate the CLE-Diffusion on two popular benchmarks, LOL (Chen et al., 2018b) and MIT-Adobe FiveK (Bychkovsky et al., 2011). The LOL dataset contains 485 training and 15 testing paired images, with each pair comprising a low-light and a normal-light image. MIT-Adobe FiveK consists of 5000 images processed with Adobe Lightroom by five experts. Following the split in previous methods (Tu et al., 2022; Ni et al., 2020), 4500 paired images are utilized as the training set, while the remaining 500 paired images serve as the test set. We use PSNR, SSIM, LPIPS (Zhang et al., 2018a), and LI-LPIPS metrics to measure the quality of output images.

Table 1. Comparisons on the LOL dataset. Ours* refers to the model variant trained with a smaller weight for color loss.

Method	PSNR↑	SSIM↑	LPIPS↓	LI-LPIPS↓
Zero-DCE (Guo et al., 2020)	14.86	0.54	0.33	0.3051
EnlightenGAN (Jiang et al., 2021)	17.48	0.65	0.32	0.2838
RetinexNet (Chen et al., 2018b)	16.77	0.56	0.47	0.5468
DRBN (Yang et al., 2020)	20.13	0.83	0.16	0.3271
KinD++ (Zhang et al., 2021)	21.30	0.82	0.16	0.3768
MAXIM (Tu et al., 2022)	23.43	0.96	0.20	0.1801
HWMNet (Fan et al., 2022)	24.24	0.85	0.12	0.1893
LLFlow (Wang et al., 2022)	25.19	0.93	0.11	0.1763
Ours	25.51	0.89	0.16	0.1841
Ours*	24.92	0.88	0.16	0.1751

4.3. Performance on the LOL dataset

We compare our CLE Diffusion with state-of-the-art low-light image enhancement methods. For quantitative comparison, we select several representative methods, including deep Retinex-based methods (RetinexNet (Chen et al., 2018b), KinD++ (Zhang et al., 2019)), CNN-based methods (MAXIM (Tu et al., 2022), HWMNet (Fan et al., 2022)), a Zero-Shot learning method (Zero-DCE (Guo et al., 2020)), a Semi-Supervised Learning method (DRBN (Yang et al., 2020)), a GAN-based (EnlightenGAN (Jiang et al., 2021)) and a Flow-based model (LLFlow (Wang et al., 2022)).

As shown in Tab. 1, our method surpasses the compared methods in terms of PSNR and ranks third in SSIM, LPIPS, and LI-LPIPS. These findings provide evidence that our CLE Diffusion technique can generate superior samples from the dataset distribution in comparison to current approaches.

Visual comparisons are shown in Fig. 5, our proposed method surpasses other methods by accurately aligning the predicted luminance with the ground truth. The outputs generated by our framework exhibit significantly higher quality, characterized by well-suppressed noise, in contrast to the results produced by other methods. These alternative approaches display color shifts, noisy artifacts, and irregular illumination.

4.4. Performance on MIT-Adobe FiveK dataset

To further validate the ability of CLE Diffusion, we test on the MIT-Adobe FiveK dataset, which is much larger than LOL dataset and includes more diverse scenes. Tab. 2 shows the comparison results with other methods. Our method achieves the highest values in terms of PSNR, while also producing results comparable to state-of-the-art methods in terms of SSIM. As indicated by Fig. 6, our method is more consistent with ground truth in terms of color distortion. Compared to previous methods, our methods can preserve better details and color consistency as exemplified in zoomed-in regions.

Table 2. Comparisons on the MIT-Adobe FiveK dataset.

Method	PSNR↑	SSIM↑
EnlightenGAN (Jiang et al., 2021)	17.74	0.83
CycleGAN (Zhu et al., 2017)	18.23	0.84
Exposure (Hu et al., 2018)	22.35	0.86
DPE (Chen et al., 2018a)	24.08	0.92
UEGAN (Ni et al., 2020)	25.00	0.93
MAXIM (Tu et al., 2022)	26.15	0.95
HWMNet (Fan et al., 2022)	26.29	0.96
Ours	29.81	0.97

4.5. Controllable Light Enhancement

With the powerful capabilities of CLE-Diffusion and Mask-CLE-Diffusion, we can not only control the global brightness of images at a specified brightness level but also achieve precise brightness control for desired regions. Users can input multiple desired brightness levels into the network to sample multiple images with various brightness levels as depicted in Fig. 7. For images with distinct brightness levels, the fidelity of details is remarkably high and the images exhibit no apparent color distortion, overexposure, or underexposure while maintaining a consistent global brightness.

By utilizing SAM model (Kirillov et al., 2023), users can effortlessly obtain regions of interest(ROI) by clicking on the image. As demonstrated in Fig. A, Mask-CLE-Diffusion is able to selectively enhance specific regions to attain desired brightness levels. In contrast to directly blending normal-light and low-light images using masks, our results have a more natural appearance and better emphasize the ROI.

4.6. Ablation Study

We conduct ablation studies to validate the effectiveness of our proposed loss functions and network design. As shown in Fig. 8, when training with only $\mathcal{L}_{\text{simple}}$ , the output images suffer from color distortion, artifacts, and unsatisfactory lighting. And when training without the conditioning color map and SNR map, the network produces noisy results which do not appear in the results of the full model. Tab. 3 also presents a notable decrease in all four metrics. These results show that auxiliary losses, the conditioning color map, and the SNR map contribute to improving the overall performance of the model.

Table 3. Ablation study for CLE Diffusion’s components.

Method	PSNR↑	SSIM↑	LPIPS↓	LI-LPIPS↓
Only $\mathcal{L}_{\text{simple}}$	8.85	0.59	0.63	0.2940
Only concat $x$	22.22	0.81	0.22	0.2186
CLE-Diffusion	25.51	0.89	0.16	0.1841

5. Conclusion

In this work, we present a novel diffusion framework, CLE Diffusion, for Controllable Light Enhancement. The framework is based on a diffusion model conditioned on an illumination embedding, which enables seamless control of brightness during inference. Additionally, we incorporate the Segment-Anything Model (SAM) to allow users to easily enhance specific regions of interest with a single click. Through extensive experiments, we demonstrate that our CLE Diffusion achieves competitive results in terms of quantitative metrics, qualitative performance, and versatile controllability.

Limitations. Despite the exciting results from CLE Diffusion, the model suffers from the slow inference speed of diffusion models, since multiple inferences are required during the sampling process. Moreover, challenging cases (e.g., complex lighting, blurry scenes) need to be further explored. In the future, we will investigate how to extend our framework to more general scenarios.

Acknowledgment. This work was supported in part by the Fundamental Research Funds for the Central Universities (No.K22RC00010) and A*STAR Career Development Funding (CDF) Award (No.222D- 800031).

References

(1)
Abdullah-Al-Wadud et al. (2007) M. Abdullah-Al-Wadud, M. H. Kabir, M. A. Akber Dewan, and O. Chae. 2007. A Dynamic Histogram Equalization for Image Contrast Enhancement. IEEE Transactions on Consumer Electronics 53, 2 (May 2007), 593–600.
Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein generative adversarial networks. In International conference on machine learning. PMLR, 214–223.
Bychkovsky et al. (2011) Vladimir Bychkovsky, Sylvain Paris, Eric Chan, and Frédo Durand. 2011. Learning Photographic Global Tonal Adjustment with a Database of Input / Output Image Pairs. In The Twenty-Fourth IEEE Conference on Computer Vision and Pattern Recognition.
Canny (1986) John Canny. 1986. A computational approach to edge detection. IEEE Transactions on pattern analysis and machine intelligence 6 (1986), 679–698.
Chen et al. (2022b) Shoufa Chen, Peize Sun, Yibing Song, and Ping Luo. 2022b. Diffusiondet: Diffusion model for object detection. arXiv preprint arXiv:2211.09788 (2022).
Chen et al. (2018b) Wei Chen, Wang Wenjing, Yang Wenhan, and Liu Jiaying. 2018b. Deep Retinex Decomposition for Low-Light Enhancement. In British Machine Vision Conference. British Machine Vision Association.
Chen et al. (2018a) Yu-Sheng Chen, Yu-Ching Wang, Man-Hsin Kao, and Yung-Yu Chuang. 2018a. Deep photo enhancer: Unpaired learning for image enhancement from photographs with gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6306–6314.
Chen et al. (2022a) Zeyuan Chen, Yifan Jiang, Dong Liu, and Zhangyang Wang. 2022a. CERL: A Unified Optimization Framework for Light Enhancement With Realistic Noise. IEEE Transactions on Image Processing (2022).
Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.
Dhariwal and Nichol (2021) Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems 34 (2021), 8780–8794.
Fan et al. (2022) Chi-Mao Fan, Tsung-Jung Liu, and Kuan-Hsien Liu. 2022. Half wavelet attention on M-Net+ for low-light image enhancement. In 2022 IEEE International Conference on Image Processing (ICIP). IEEE, 3878–3882.
Fu et al. (2015) Xueyang Fu, Yinghao Liao, Delu Zeng, Yue Huang, Xiao-Ping Zhang, and Xinghao Ding. 2015. A probabilistic method for image enhancement with simultaneous illumination and reflectance estimation. IEEE Transactions on Image Processing 24, 12 (2015), 4965–4977.
Fu et al. (2016) X. Fu, D. Zeng, Y. Huang, X. P. Zhang, and X. Ding. 2016. A Weighted Variational Model for Simultaneous Reflectance and Illumination Estimation. In Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition. 2782–2790.
Gu et al. (2023) Jiatao Gu, Alex Trevithick, Kai-En Lin, Josh Susskind, Christian Theobalt, Lingjie Liu, and Ravi Ramamoorthi. 2023. NerfDiff: Single-image View Synthesis with NeRF-guided Distillation from 3D-aware Diffusion. arXiv preprint arXiv:2302.10109 (2023).
Gulrajani et al. (2017) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. 2017. Improved training of wasserstein gans. Advances in neural information processing systems 30 (2017).
Guo et al. (2020) Chunle Guo, Chongyi Li, Jichang Guo, Chen Change Loy, Junhui Hou, Sam Kwong, and Runmin Cong. 2020. Zero-reference deep curve estimation for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1780–1789.
Guo et al. (2017) X. Guo, Y. Li, and H. Ling. 2017. LIME: Low-Light Image Enhancement via Illumination Map Estimation. IEEE Trans. on Image Processing 26, 2 (Feb 2017), 982–993.
Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33 (2020), 6840–6851.
Ho and Salimans (2022) Jonathan Ho and Tim Salimans. 2022. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022).
Hu et al. (2018) Yuanming Hu, Hao He, Chenxi Xu, Baoyuan Wang, and Stephen Lin. 2018. Exposure: A white-box photo post-processing framework. ACM Transactions on Graphics (TOG) 37, 2 (2018), 1–17.
Jiang et al. (2021) Yifan Jiang, Xinyu Gong, Ding Liu, Yu Cheng, Chen Fang, Xiaohui Shen, Jianchao Yang, Pan Zhou, and Zhangyang Wang. 2021. Enlightengan: Deep light enhancement without paired supervision. IEEE Transactions on Image Processing 30 (2021), 2340–2349.
Jobson et al. (1997) D. J. Jobson, Z. Rahman, and G. A. Woodell. 1997. A multiscale retinex for bridging the gap between color images and the human observation of scenes. IEEE Trans. on Image Processing 6, 7 (Jul 1997), 965–976.
Kanizo et al. (2013) Yossi Kanizo, David Hay, and Isaac Keslassy. 2013. Palette: Distributing tables in software-defined networks. In 2013 Proceedings IEEE INFOCOM. IEEE, 545–549.
Kingma et al. (2021) Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. 2021. Variational diffusion models. Advances in neural information processing systems 34 (2021), 21696–21707.
Kirillov et al. (2023) Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. 2023. Segment anything. arXiv preprint arXiv:2304.02643 (2023).
Land (1977) Edwin H. Land. 1977. The retinex theory of color vision. Sci. Amer (1977), 108–128.
Lee et al. (2013) Chang-Hsing Lee, Jau-Ling Shih, Cheng-Chang Lien, and Chin-Chuan Han. 2013. Adaptive multiscale retinex for image contrast enhancement. In Signal-Image Technology & Internet-Based Systems (SITIS), 2013 International Conference on. IEEE, 43–50.
Li et al. (2015) L. Li, R. Wang, W. Wang, and W. Gao. 2015. A low-light image enhancement method for both denoising and contrast enlarging. In Proc. IEEE Int’l Conf. Image Processing. 3730–3734.
Li et al. (2018) M. Li, J. Liu, W. Yang, X. Sun, and Z. Guo. 2018. Structure-Revealing Low-Light Image Enhancement Via Robust Retinex Model. IEEE Trans. on Image Processing 27, 6 (June 2018), 2828–2841.
Liu et al. (2021) Jiaying Liu, Dejia Xu, Wenhan Yang, Minhao Fan, and Haofeng Huang. 2021. Benchmarking low-light image enhancement and beyond. International Journal of Computer Vision 129 (2021), 1153–1184.
Lore et al. (2017) Kin Gwn Lore, Adedotun Akintayo, and Soumik Sarkar. 2017. LLNet: A deep autoencoder approach to natural low-light image enhancement. Pattern Recognition 61 (2017), 650 – 662.
Metz et al. (2016) Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. 2016. Unrolled generative adversarial networks. arXiv preprint arXiv:1611.02163 (2016).
Ni et al. (2020) Zhangkai Ni, Wenhan Yang, Shiqi Wang, Lin Ma, and Sam Kwong. 2020. Towards unsupervised deep image enhancement with generative adversarial network. IEEE Transactions on Image Processing 29 (2020), 9140–9151.
Nichol et al. (2021) Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. 2021. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021).
Özdenizci and Legenstein (2023) Ozan Özdenizci and Robert Legenstein. 2023. Restoring vision in adverse weather conditions with patch-based denoising diffusion models. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
Park et al. (2022) GyuTae Park, SungJoon Son, JaeYoung Yoo, SeHo Kim, and Nojun Kwak. 2022. Matteformer: Transformer-based image matting via prior-tokens. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11696–11706.
Perez et al. (2018) Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. 2018. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
Pizer et al. (1990) S. M. Pizer, R. E. Johnston, J. P. Ericksen, B. C. Yankaskas, and K. E. Muller. 1990. Contrast-limited adaptive histogram equalization: speed and effectiveness. In Proceedings of Conference on Visualization in Biomedical Computing. 337–345.
Poole et al. (2022) Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. 2022. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022).
Ramesh et al. (2022) Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022).
Ravuri and Vinyals (2019) Suman Ravuri and Oriol Vinyals. 2019. Classification accuracy score for conditional generative models. Advances in neural information processing systems 32 (2019).
Ren et al. (2019) Wenqi Ren, Sifei Liu, Lin Ma, Qianqian Xu, Xiangyu Xu, Xiaochun Cao, Junping Du, and Ming-Hsuan Yang. 2019. Low-light image enhancement via a deep hybrid network. IEEE Transactions on Image Processing 28, 9 (2019), 4364–4375.
Ren et al. (2018) Xutong Ren, Mading Li, Wen-Huang Cheng, and Jiaying Liu. 2018. Joint enhancement and denoising method via sequential decomposition. In 2018 IEEE international symposium on circuits and systems (ISCAS). IEEE, 1–5.
Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10684–10695.
Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. Springer, 234–241.
Saharia et al. (2022a) Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. 2022a. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings. 1–10.
Saharia et al. (2022b) Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. 2022b. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. arXiv preprint arXiv:2205.11487 (2022).
Saharia et al. (2022c) Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. 2022c. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
Salimans and Ho (2022) Tim Salimans and Jonathan Ho. 2022. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512 (2022).
Saxena et al. (2023) Saurabh Saxena, Abhishek Kar, Mohammad Norouzi, and David J Fleet. 2023. Monocular Depth Estimation using Diffusion Models. arXiv preprint arXiv:2302.14816 (2023).
Schuhmann et al. (2022) Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. 2022. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402 (2022).
Shen et al. (2017) Liang Shen, Zihan Yue, Fan Feng, Quan Chen, Shihao Liu, and Jie Ma. 2017. Msr-net: Low-light image enhancement using deep convolutional network. arXiv preprint arXiv:1711.02488 (2017).
Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
Singer et al. (2023) Uriel Singer, Shelly Sheynin, Adam Polyak, Oron Ashual, Iurii Makarov, Filippos Kokkinos, Naman Goyal, Andrea Vedaldi, Devi Parikh, Justin Johnson, et al. 2023. Text-To-4D Dynamic Scene Generation. arXiv preprint arXiv:2301.11280 (2023).
Sohl-Dickstein et al. (2015) Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning. PMLR, 2256–2265.
Song et al. (2020) Jiaming Song, Chenlin Meng, and Stefano Ermon. 2020. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020).
Song and Ermon (2019) Yang Song and Stefano Ermon. 2019. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems 32 (2019).
Suvorov et al. (2022) Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. 2022. Resolution-robust large mask inpainting with fourier convolutions. In Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2149–2159.
Tu et al. (2022) Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, Alan Bovik, and Yinxiao Li. 2022. Maxim: Multi-axis mlp for image processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5769–5780.
Vahdat et al. (2021) Arash Vahdat, Karsten Kreis, and Jan Kautz. 2021. Score-based generative modeling in latent space. Advances in Neural Information Processing Systems 34 (2021), 11287–11302.
Wang et al. (2019) Ruixing Wang, Qing Zhang, Chi-Wing Fu, Xiaoyong Shen, Wei-Shi Zheng, and Jiaya Jia. 2019. Underexposed photo enhancement using deep illumination estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6849–6857.
Wang et al. (2013) S. Wang, J. Zheng, H. M. Hu, and B. Li. 2013. Naturalness Preserved Enhancement Algorithm for Non-Uniform Illumination Images. IEEE Trans. on Image Processing 22, 9 (Sept 2013), 3538–3548.
Wang et al. (2022) Yufei Wang, Renjie Wan, Wenhan Yang, Haoliang Li, Lap-Pui Chau, and Alex Kot. 2022. Low-light image enhancement with normalizing flow. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 2604–2612.
Wang et al. (2021) Yufei Wang, Renjie Wan, Wenhan Yang, Haoliang Li, Lap-Pui Chau, and Alex C Kot. 2021. Low-Light Image Enhancement with Normalizing Flow. arXiv preprint arXiv:2109.05923 (2021).
Watson et al. (2022) Daniel Watson, William Chan, Ricardo Martin-Brualla, Jonathan Ho, Andrea Tagliasacchi, and Mohammad Norouzi. 2022. Novel view synthesis with diffusion models. arXiv preprint arXiv:2210.04628 (2022).
Wei* et al. (2018) Chen Wei*, Wenjing Wang*, Wenhan Yang, and Jiaying Liu. 2018. Deep Retinex Decomposition for Low-Light Enhancement. In British Machine Vision Conference.
Whang et al. (2022) Jay Whang, Mauricio Delbracio, Hossein Talebi, Chitwan Saharia, Alexandros G Dimakis, and Peyman Milanfar. 2022. Deblurring via stochastic refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16293–16303.
Xu et al. (2022a) Dejia Xu, Yifan Jiang, Peihao Wang, Zhiwen Fan, Yi Wang, and Zhangyang Wang. 2022a. NeuralLift-360: Lifting An In-the-wild 2D Photo to A 3D Object with 360° Views. arXiv e-prints (2022), arXiv–2211.
Xu et al. (2022b) Dejia Xu, Hayk Poghosyan, Shant Navasardyan, Yifan Jiang, Humphrey Shi, and Zhangyang Wang. 2022b. ReCoRo: Region-Controllable Robust Light Enhancement with User-Specified Imprecise Masks. In Proceedings of the 30th ACM International Conference on Multimedia. 1376–1386.
Xu et al. (2022c) Xiaogang Xu, Ruixing Wang, Chi-Wing Fu, and Jiaya Jia. 2022c. SNR-aware low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17714–17724.
Yang et al. (2020) Wenhan Yang, Shiqi Wang, Yuming Fang, Yue Wang, and Jiaying Liu. 2020. From fidelity to perceptual quality: A semi-supervised approach for low-light image enhancement. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3063–3072.
Zhang et al. (2018a) Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018a. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition. 586–595.
Zhang et al. (2018b) Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018b. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In CVPR.
Zhang et al. (2012) X. Zhang, P. Shen, L. Luo, L. Zhang, and J. Song. 2012. Enhancement and noise reduction of very low light level images. In Proc. IEEE Int’l Conf. Pattern Recognition. 2034–2037.
Zhang et al. (2021) Yonghua Zhang, Xiaojie Guo, Jiayi Ma, Wei Liu, and Jiawan Zhang. 2021. Beyond brightening low-light images. International Journal of Computer Vision 129 (2021), 1013–1037.
Zhang et al. (2019) Yonghua Zhang, Jiawan Zhang, and Xiaojie Guo. 2019. Kindling the darkness: A practical low-light image enhancer. In Proceedings of the 27th ACM international conference on multimedia. 1632–1640.
Zhu et al. (2017) Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision. 2223–2232.

Appendix A Implementation Details

All experiments are conducted on one NVIDIA RTX3090 GPU with PyTorch. The Adamw optimizer is used with the initial learning rate of $5\times 10^{5}$ . The weight decay is $1\times 10^{4}$ . A batch size of 16 is applied for 12000 epochs in LOL dataset (Chen et al., 2018b) and 2,000 epochs in MIT-Adobe FiveK dataset (Bychkovsky et al., 2011). The brightness level information is derived from the average values of the patches. We set $W_{br}$ , $W_{col}$ , $W_{ssim}$ , $W_{lpips}$ to 20,100,2.83 and 50, respectively.

During the training phase, we randomly crop images into 128 $\times$ 128 randomly and perform horizontal and vertical flipping for data augmentation. Similar to prior works, we normalize the input pixels into the range of $[-1,1]$ to stabilize the training. However, the datasets for low-light image enhancement are primarily distributed in the lower range of values and exhibit characteristics of a short-tailed distribution. We empirically discover that normalizing the input data to Gaussian distribution can ease the optimization.

Appendix B More Quantitative comparisons

As shown in Tab. A, on the MIT-Adobe FiveK dataset, we outperform state-of-the-art methods and achieve the best performance across all metrics. To ensure consistency, we revisited our testing scheme and adopted the procedure described in MAXIM (Tu et al., 2022), which involves center cropping the images to 512x512 prior to metric calculations. Additionally, we identified that in MAXIM (Tu et al., 2022), the reported PSNR and SSIM values are computed solely on the Y channel (grayscale version). Consequently, we recalculated our performance on the MIT-Adobe FiveK dataset and showcased our superiority compared to prior methods, achieving a notable improvement of +3.65 dB.

Table A. Quantitative comparisons on MIT-Adobe Fivek dataset

	PSNR $\uparrow$	PSNR-Gray $\uparrow$	SSIM $\uparrow$	SSIM-Gray $\uparrow$	LPIPS $\downarrow$	LI-LPIPS $\downarrow$
LLFlow	19.74	21.04	0.80	0.80	0.1728	0.2665
HWMNet	24.41	26.30	0.93	0.96	0.0812	0.1428
MAXIM	24.60	26.16	0.93	0.95	0.0752	0.1491
Our	26.62	29.81	0.94	0.97	0.0601	0.1385

Appendix C More Ablation Study

To further explore the contribution of each auxiliary loss function, we present the results of CLE Diffusion trained with various combinations of losses in Tab. LABEL:tab:abl. Removing the brightness loss fails to achieve sufficient brightness enhancement. The angular color loss function contributes to reducing color distortion and improves overall performance. Furthermore, we separately train the networks without SSIM loss or perceptual loss to demonstrate the positive impact of the auxiliary loss functions.

Table B. Quantitative comparisons on the LOL dataset for ablations of each auxiliary loss function. The full model presents better results than its partial versions, demonstrating the effectiveness of the auxiliary loss functions.

$L_{\text{simple}}$	$L_{\text{br}}$	$L_{\text{col}}$	$L_{\text{ssim}}$	$L_{\text{lpips}}$	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	LI-LPIPS $\downarrow$
$\checkmark$					8.85	0.59	0.63	0.2940
$\checkmark$		$\checkmark$	$\checkmark$	$\checkmark$	22.45	0.85	0.20	0.1908
$\checkmark$	$\checkmark$		$\checkmark$	$\checkmark$	25.15	0.87	0.17	0.1885
$\checkmark$	$\checkmark$	$\checkmark$		$\checkmark$	25.04	0.87	0.18	0.1866
$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$		25.07	0.86	0.20	0.1918
$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	25.51	0.89	0.16	0.1841

Appendix D More visual comparisons

We provide more cases about region controllable light enhancement in Fig. A. Fig. B show comparisons of performance on a real-world image. Fig.C shows global controllable light enhancement on MIT-Adobe FiveK dataset. We provide comparisons on global brightness control ability in Fig.D. We also compare performance on normal light image inputs in Fig. E. Fig. F shows comparisons of performance on Segment-Anything model. Fig. G give comparisons on LOL dataset.