Deep Decomposition and Bilinear Pooling Network for Blind Night-Time Image Quality Evaluation

Qiuping Jiang, Jiawu Xu, Yudong Mao, Wei Zhou, Xiongkuo Min, Guangtao Zhai Q. Jiang, J. Xu, and Y. Mao are with the School of Information Science and Engineering, Ningbo University, Ningbo 315211, China (e-mail: [email protected]). W. Zhou is with the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, Canada (e-mail: [email protected]). X. Min and G. Zhai are with the Institute of Image Communication and Information Processing, Shanghai Jiao Tong University, Shanghai 200240, China (minxiongkuo, [email protected]).

Abstract

Blind image quality assessment (BIQA), which aims to accurately predict the image quality without any pristine reference information, has been extensively concerned in the past decades. Especially, with the help of deep neural networks, great progress has been achieved. However, it remains less investigated on BIQA for night-time images (NTIs) which usually suffers from complicated authentic distortions such as reduced visibility, low contrast, additive noises, and color distortions. These diverse authentic degradations particularly challenges the design of effective deep neural network for blind NTI quality evaluation (NTIQE). In this paper, we propose a novel deep decomposition and bilinear pooling network (DDB-Net) to better address this issue. The DDB-Net contains three modules, i.e., an image decomposition module, a feature encoding module, and a bilinear pooling module. The image decomposition module is inspired by the Retinex theory and involves decoupling the input NTI into an illumination layer component responsible for illumination information and a reflection layer component responsible for content information. Then, the feature encoding module involves learning feature representations of degradations that are rooted in the two decoupled components separately. Finally, by modeling illumination-related and content-related degradations as two-factor variations, the two feature sets are bilinearly pooled together to form a unified representation for quality prediction. The superiority of the proposed DDB-Net has been well validated by extensive experiments on several benchmark datasets. The source code will be made available soon.

Index Terms:

Night-time image, image quality assessment, blind/no-reference, Retinex decomposition.

I Introduction

Due to the poor lighting condition in night-time, the captured night-time images (NTIs) are usually perceived with poor visibility and low visual quality. Given that high-quality NTIs are crucial for consumer photography and practical applications such as automated driving systems, many NTI quality/visibility enhancement algorithms have been proposed. However, the research efforts on designing objective quality metrics that can automatically quantify the visual quality of NTIs and compare the performance of different NTI enhancement algorithms remain limited, which hereby hinders the development of this field. Generally, objective image quality assessment (IQA) methods can be roughly divided into three categories, i.e., full-reference (FR), no-reference (NR), and reduced-reference (RR) [1]. Among them, FR and RR IQA methods require full and partial reference information, respectively. However, for the NTIs we concerned, there is usually no available pristine image to provide any reference information. Therefore, NR-IQA is more valuable for NTIs in this regard.

Early studies on NR-IQA mainly focus on specific distortion types, i.e., assuming that a particular distortion type is known and then specific distortion-related features are extracted to predict image quality [2, 3, 4, 5, 6, 7, 8]. Obviously, such the specificity limits their applications like the real-world night-time scenario. Although the rapid advances in the IQA community during the last decade push to produce general-purpose blind IQA (BIQA) methods [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26] that can simultaneously work with a number of distortion types, their efficacies are still limited to synthetic distortions. This is evident by the fact that they usually validate their performance on legacy synthetic distortion benchmark databases where the distorted images are simulated from pristine corpus in laboratory. As a result, the existing general-purpose BIQA methods still cannot work well with the authentically distorted images like the NTIs captured in the real-world night-time scenario. Recently, inspired by the success of deep neural networks in many image processing and computer vision tasks, great progresses have also been achieved on deep learning-based BIQA. However, it remains less investigated on deep learning-based BIQA for NTIs which usually suffer from complicated authentic distortions such as reduced visibility, low contrast, additive noises, invisible details, and color distortions. The diverse authentic degradations in NTIs pose great challenges to the design of highly effective end-to-end deep network architectures for blind NTI quality evaluation (NTIQE).

To evaluate the visual quality of NTIs, Xiang et al. [27] first established a dedicated large-scale natural NTI database (NNID), which contains $2,240$ NTIs with $448$ different image contents captured by three different photographic equipments in real-world scenarios along with their corresponding subjective quality scores (obtained by conducting human subjective experiments). Then, a NR quality metric called BNBT is proposed by considering both brightness and texture features. The experimental results on NNID database have demonstrated an acceptable performance of BNBT, i.e., the predicted quality scores by BNBT are consistent with ground truth subjective quality scores. Despite its effectiveness, BNBT requires elaborately-designed handcrafted features, which enlightens us to adopt an end-to-end data-driven method by taking the advantage of deep learning.

However, designing tailored end-to-end deep neural networks for blind NTIQE is non-trivial due to the diverse authentic degradations. The main challenge is that heterogeneous distortions in NTIs make it difficult to learn a unified mapping from input NTI to quality score. An important observation is that the commonly-encountered distortions in NTIs can have impacts on either illumination perception or content perception. For example, the color distortion and additive noise are only influential in content perception while the reduced visibility and low contrast are only influential in illumination perception. Thus, it is intuitive to consider decomposing the input NTI into two independent components with each component accounting for illumination information and content information, respectively. Assisted by such a tailored image decomposition process, the degradation features related to illumination perception and content perception can be better learned and then fused to facilitate blind NTIQE.

Refer to caption — Figure 1: The proposed deep decomposition and bilinear pooling network (DDB-Net) for blind NTIQE. It contains an image decomposition module, a feature encoding module, and a bilinear pooling module. The image decomposition module takes an NTI as input and decouples it into two layer components, i.e., illumination ( $L$ ) and reflection ( $R$ ). Then, the feature encoding module involves learning feature representations of degradations that are rooted in the illumination and reflectance separately. Finally, the two feature sets are bilinearly pooled and concatenated together to form a unified representation for quality prediction.

In this paper, we propose a novel deep decomposition and bilinear pooling network (DDB-Net) for blind NTIQE to better address the above issues. As shown in Fig. 1, our DDB-Net contains three modules namely image decomposition module, feature encoding module, and bilinear pooling module. Inspired by the Retinex theory [28], the image decomposition module involves decoupling the input NTI into two layer components, i.e., one layer component (illumination) is responsible for illumination information, while the other one (reflection) for content information. Then, the feature encoding module involves learning feature representations of degradations that are rooted in the two decoupled components separately. Finally, by modeling illumination-related and content-related degradations as two-factor variations, the two feature sets are bilinearly pooled and concatenated together to form a unified representation for quality prediction. Extensive experiments conducted on several databases well demonstrated the superiority of the proposed DDB-Net against state-of-the-art BIQA methods. In summary, this paper presents the following contributions:

1) We propose a novel “decompose-and-conquer” end-to-end deep neural network with the Retinex decomposition module embedding to better address the sophisticated blind NTIQE problem.

2) We introduce a self-reconstruction-based feature encoding module and design tailored loss functions to regularize network training towards learning more targeted illunimation-related and content-related feature representations from the two decomposed components separately.

3) We model the illumination-related and content-related degradations as two-factor variations and perform bilinear pooling to fuse the two sets of features into a unified feature representation for quality prediction of NTIs.

The rest of this paper is organized in the following manner. Section II introduces the related works. Sections III illustrates the proposed method with details. Section IV presents the experimental results. Section V concludes the paper.

II Related Works

In this section, we will review the existing related works, including traditional blind image quality assessment, deep learning-based blind image quality assessment, and blind image quality assessment in poor conditions.

II-A Traditional Blind Image Quality Assessment

In the literature of traditional blind image quality assessment, natural scene statistics (NSS) and human visual system (HVS) are two main cues for designing objective BIQA models. As for NSS-based frameworks, Moorthy et al. [12] proposed the Distortion Identification-based Image Verity and INtegrity Evaluation (DIIVINE) index to evaluate perceptual image quality in a no-reference manner, which is composed of distortion identification and NSS-based quality regression. Likewise, the CurveletQA [11] also performs within a two-stage framework containing distortion classification and quality assessment. Different from DIIVINE, the quality assessment of CurveletQA is based on NSS features in curvelet domain. Except for the two-stage frameworks, other NSS-based BIQA algorithms have been developed. For example, Mittal et al. [10] presented the blind/referenceless image spatial quality evaluator (BRISQUE), which is operated in the spatial domain. Moreover, the BLIINDS-II [9] was proposed by exploiting the NSS model of discrete cosine transform (DCT) coefficients. In addition, some opinion-unaware BIQA methods based on NSS, i.e. so-called “completely blind” models, such as Natural Image Quality Evaluator (NIQE) [20] and ILNIQE [19], have shown competitive performance with the help of massive natural images.

Beyond the NSS features, many other statistical factors have been considered by researchers. In the family of two-stage framework, Liu et al. [17] proposed the Spatial-Spectral Entropy-based Quality (SSEQ) index, where local spatial and spectral entropy features are used to predict perceptual image quality. The GM-LOG method [15] extracts the joint statistics of local contrast features to assess image quality, including gradient magnitude and Laplacian of Gaussian response. Apart from statistical structural features, lumninance histogram is used in the NRSL model [13]. The NSS features are combined with contrast, sharpness, brightness and colorfulness, together forming the BIQME framework [18].

For the HVS-based objective BIQA methods, some HVS-inspired features are applied to estimate perceptual image quality. Among these methods, Gu et al. [14] proposed the No-reference Free Energy-based Robust Metric (NFERM) on the basis of free energy principle. In [21], Li et al. used contrast masking to design the BIQA model based on structural degradation. Besides, according to the similar concept regarding the HVS properties, they proposed the GWH-GLBP by computing the gradient-weighted histogram of local binary pattern [16].

However, the above-mentioned conventional BIQA methods generally need to design elaborate handcrafted features with the pre-defined NSS or HVS mechanisms. Thus, resorting to data-driven methods based on deep learning is a promising alternative.

II-B Deep Learning-based Blind Image Quality Assessment

Recently, deep learning has achieved great success in the field of blind image quality assessment. These methods can be typically divided into two categories, consisting of those using pre-trained deep features and end-to-end learning ones. For the first category, Wu et al. [29] proposed the HFD-BIQA that integrates deep semantic features from ResNet [30] into local structure features. Moreover, a Network in Network (NIN) model [31] pre-trained on ImageNet [32] was utilized to make image quality prediction. For the second category, Kang et al. [33] proposed a relatively shallow convolutional neural network (CNN) structure for BIQA. Each patch is assigned a subjective quality of the corresponding image as the ground-truth targets for training. In this way, the visual quality of whole image is then calculated by averaging predicted patch quality values. Furthermore, a BIQA model was developed based on shearlet transform and stacked auto-encoders [34]. In [35], the RankIQA was designed to synthesize masses of ranked images for training a Siamese network. Ma et al. [36] proposed an end-to-end optimized deep neural Network for BIQA. Additionally, Bosse et al. [37] presented the end-to-end WaDIQaM that can blindly learn perceptual image quality. The deep bilinear convolutional neural network called DBCNN was proposed to bilinearly pool the feature representations to a single quality score [38].

Although the deep learning-based BIQA models can deliver good performance, they are not suitable for evaluating the perceptual quality of NTIs. This is mainly because these models usually neglect the specific characteristics of NTIs, e.g. reduced visibility, low contrast, additive noises, invisible details, and color distortions.

II-C Quality Assessment for Images Captured in Poor Conditions

In real-world applications, people may encounter many kinds of poor imaging environments, e.g. hazy, rainy, underwater, and so on. In such poor conditions, capturing images with high-quality is quite challenging and thus addressing the blind quality assessment issue is urgently needed.

In the quality evaluation of hazy images, Min et al. [39] proposed the haze-removing features, structure-preserving features, and over-enhancement features to construct the objective quality assessment index. They also used synthetic hazy images to build an effective quality assessment model for image dehazing [40]. For image deraining applications, a novel deep quality assessment model to predict the visual quality of real-world derained images was developed, which is achieved by designing a feature embedding network [41]. As for underwater image quality evaluation, Yang et al. [42] proposed to linearly combine chroma, saturation and contrast factors for quantifying the perceptual quality of underwater images. In [43], colorfulness, sharpness and contrast measures were fused to predict the underwater image quality. Most recently, Jiang et al. [44] first constructed a large-scale benchmark dataset for quality evaluation of underwater image enhancement and then proposed a no-reference underwater image quality metric by extracting both color and luminance features.

It should be noted that the degradation characteristics of NTIs are different from those of hazy, rainy and underwater images. Thus, these quality evaluation methods developed for hazy, rainy and underwater images cannot achieve good performance on NTIs. In this paper, we focus on designing efficient end-to-end deep network architectures for blind NTIQE.

III DDB-Net

In this section, we first describe the architecture of the image decomposition module. Then, we introduce the self-reconstruction-based feature encoding module for illunimation-related and content-related degradation feature learning. Finally, we introduce the bilinear pooling module for fusing the two sets of features.

III-A Image Decomposition Module

According to the Retinex theory [28], a single image $I$ can be considered as a composition of two independent layer components, i.e., reflectance $R$ and illumination $L$ , in the fashion of $I=R\otimes L$ , where $\otimes$ denotes element-wise product. However, recovering two components from one single input is an ill-posed problem. Although the Retinex theory has been widely applied in relevant applications [45, 46, 47, 48], there are obvious differences between our proposed method and these works. First, the problem we focuses is different from these works, i.e., we focus on the problem of NTIQE while these works [45, 46, 47, 48] focus on low-light/night-time image enhancement. Second, none of them have successfully decomposed a single image into two independent components with an embedded deep neural network module in a fully unsupervised manner. In what follows, we will present how to design an effective deep image decomposition module to achieve this goal.

The detailed architecture of our deep image decomposition module is shown in Fig. 2. It contains two streams corresponding to the reflection component ( $R$ ) and illumination component ( $L$ ), respectively. The reflection stream adopts a 5-layer U-Net-like architecture, followed by two convolutional (conv) layers and a Sigmoid layer in the end, while the illumination stream is composed of two conv+ReLU layers and a conv layer on concatenated feature maps from the reflection stream, finally followed by a Sigmoid layer in the end.

Typically, there is no available ground truth reflection and illumination maps that can be used for supervised training. Therefore, designing a well-defined non-reference loss function is the key to the success for training a stable deep image decomposition module. In the literature, a basic assumption is that different shots of the same scene should have the same reflection component. Furthermore, although the illumination maps could be intensively varied, they should be of simple and mutually consistent structures. These inspire us to take a pair of images (associated with the same scene) as input and impose both reflection and illumination constraints between the image pair to train the image decomposition module without ground truth.

Specifically, during the training stage, the input to our image decomposition module is a pair of an NTI and its corresponding exposure-adjusted image (EAI). We denote the input NTI and EAI by $[I_{N},I_{E}]$ . Their corresponding reflection and illumination components are denoted by $[R_{N},R_{E}]$ and $[L_{N},L_{E}]$ , respectively. The image decomposition module is constrained by a hybrid loss defined upon these components. In th following, we first describe how to generate the EAI from the input NTI, and then illustrate the definitions of different loss terms.

EAI generation: Given an input NTI $I_{N}$ , we first generate multiple intermediate images with different exposure levels according to a camera response model that characterizes the relationship between pixel values and exposure ratios. Then, the multiple intermediate images with different exposure levels are fused to obtain an EAI $I_{E}$ .

Since there is no available camera information, an existing camera response model [49] that characterizes the general relationship between pixel value and exposure ratio is applied:

\mathcal{E}(I,e)=I^{(e^{\alpha})}\cdot e^{\beta(1-e^{\alpha})},

(1)

where $I$ and $e$ represent the pixel value and the exposure ratio, respectively, and the parameters $\alpha=-0.3293$ and $\beta=1.1258$ are estimated by fitting a total number of 201 real-world camera response curves provided in the DoRF database [50]. Specifically, the exposure ratios are $e,\cdots,e^{K}$ , where the base ratio is empirically set to $e=2.4$ and the number of ratios is set to $K=4$ , as in [51]. Based on the multi-exposure images $\{{\mathcal{E}}_{1},{\mathcal{E}}_{2},{\mathcal{E}}_{3},{\mathcal{E}}_{4}\}$ , the SPD-MEF algorithm [52] is performed to reconstruct a fused image which can be used as the EAI $I_{E}$ .

Inter-consistency loss: The inter-consistency loss includes reflection consistency loss and illumination mutual consistency loss. First, the reflection consistency loss ${\mathcal{L}}_{con}^{R}$ encourages the reflection similarity, which is defined as follows:

{\mathcal{L}}_{con}^{R}={\left\|{R}_{N}-{R}_{E}\right\|}_{1},

(2)

where ${\|\cdot\|}_{1}$ means the ${\ell}_{1}$ norm. Second, the illumination mutual consistency loss ${\mathcal{L}}_{con}^{L}$ is defined as follows:

{\mathcal{L}}_{con}^{L}=f(M)={\left\|\frac{M}{{c}^{2}}\otimes\text{exp}\left(-\frac{M^{2}}{2{c}^{2}}\right)\right\|}_{1},

(3)

M=\left|\triangledown{L}_{N}\right|+\left|\triangledown{L}_{E}\right|,

(4)

where $\triangledown$ means the first order derivative operator along both horizontal and vertical directions, $c$ is a parameter controlling the shape of the above penalty curve. To facilitate understanding, we draw the penalty curves with different values of $c$ in Fig. 3. As we can see, the penalty value first increases and then decreases to zero as $M$ increases. In our implementation, we set $c=0.1$ . By minimizing such an illumination mutual consistency loss, the mutual strong edges are encouraged to be well preserved and all weak edges are to be suppressed.

Intra-smoothness loss: Besides the inter-consistency loss mentioned above, we also consider intra-component smoothness loss. On the one hand, the illumination maps should be piece-wise smooth, thus we introduce a structure-aware smoothness loss ${\mathcal{L}}_{sm}^{L}$ to constraint both $L_{N}$ and $L_{E}$ :

{\mathcal{L}}_{sm}^{L}={\left\|\frac{\triangledown L_{N}}{max\{(\triangledown R_{N})^{2},\tau\}}\right\|}_{1}+{\left\|\frac{\triangledown L_{E}}{max\{(\triangledown{R}_{E})^{2},\tau\}}\right\|}_{1},

(5)

where $\tau$ denotes a small positive constant which is empirically set to $\tau=0.01$ to avoid the denominator being zero. This loss measures the relative structure of the illumination with respect to the reflection. Therefore, the illumination loss can be aware of image structure which is reflected by reflection. Specifically, for a strong edge point in the reflection map, the penalty on the illumination will be small; for a point in the flat region of the reflection map, the penalty on the illumination becomes large. On the other hand, different from the illumination maps that should be piece-wise smooth, the reflectance maps are usually tend to be piece-wise continuous. Thus, we directly use a total-variation loss ${\mathcal{L}}_{sm}^{R}$ to constraint both $R_{N}$ and $R_{E}$ :

{\mathcal{L}}_{sm}^{R}={\left\|\triangledown R_{N}\right\|}_{1}+{\left\|\triangledown R_{E}\right\|}_{1}.

(6)

Reconstruction loss: The third consideration is that the decomposed two components should well reproduce the input in the fashion of element-wise product, which is constrained by an image reconstruction loss:

{\mathcal{L}}_{rec}={\left\|I_{N}-L_{N}\otimes R_{N}\right\|}_{1}+{\left\|I_{E}-L_{E}\otimes R_{E}\right\|}_{1}.

(7)

Image decomposition loss: The total loss for our image decomposition module is defined as follows:

{\mathcal{L}}_{idm}={\mathcal{L}}_{con}^{R}+{\mathcal{L}}_{con}^{L}+{\mathcal{L}}_{sm}^{R}+{\mathcal{L}}_{sm}^{L}+{\mathcal{L}}_{rec},

(8)

III-B Feature Encoding Module

Based on the reflection and illumination components generated by the image decomposition module, the next step is to build feature representations for each of these two components separately. In this work, we design a simple self-reconstruction-based encoder-decoder architecture to achieve this goal. Specifically, both the reflection and illumination components share the same feature encoding network architecture. However, these two feature encoding networks are constrained by different loss functions, i.e., tailored loss terms are designed to separately regularize the feature encoding of reflection and illumination components.

As shown in Fig. 4, our proposed self-reconstruction-based encoder-decoder module involves two parts namely encoder and decoder. The encoder receives either the reflection ( $R$ ) or illumination ( $L$ ) component as input and progressively forms a set of hierarchical feature representations $C_{1}$ , $C_{2}$ , $C_{3}$ , $C_{4}$ . Then, the decoder takes the last-layer feature representation $C_{4}$ as input and progressively reconstruct the input ( $\hat{R}$ or $\hat{L}$ ). The encoder contains four stacked 3 $\times$ 3 convolutional layers with each convolutional layer equipped with a ReLU layer for activation. The stride of all convolutional layers is set to 1. The output feature channels of each convolutional layers in the encoder are set to 16, 32, 64, 128, respectively.

Since the reflection and illumination components contain NTI degradation information in different aspects, we design tailored losses to guide the reconstruction of each component. Specifically, the losses imposed on the reflection component reconstruction include a structure loss ${\mathcal{L}}_{str}$ and a color loss ${\mathcal{L}}_{color}$ , while the loss imposed on the illumination component reconstruction is a mean square error (MSE) loss ${\mathcal{L}}_{mse}$ . In this way, the learned reflection feature representations will focus more on the structural and color information due to the joint guidance of ${\mathcal{L}}_{str}$ and ${\mathcal{L}}_{color}$ , while the learned illumination feature representations will focus more on the luminance information.

1) Structure loss: It is well known that the HVS is highly sensitive to image structural information and low-quality NTIs will inevitably change the structural perception [53]. For this, we adopt a structural similarity (SSIM) [53] loss between the input reflection image $R$ and its corresponding reconstructed version $\hat{R}$ for encouraging the encoder to have the capacity of extracting informative structural features. The SSIM loss is defined as follows:

{\mathcal{L}}_{str}=1-SSIM(R,\hat{R}),

(9)

where $SSIM(R,\hat{R})$ computes the structural similarity score between image $R$ and image $\hat{R}$ according to the SSIM metric [53].

2) Color loss: It is a common sense that NTIs will introduce color distortions and the reflection component contains almost all color information of the scene. Therefore, a simple yet effective color loss between $R$ and $\hat{R}$ is also desired, which will encourage the encoder to have the capability of extracting effective color features. Inspired by [54], the blurring operation can remove high frequencies of an image and promote color comparison. Thus, the following color loss is introduced:

{\mathcal{L}}_{color}={\left\|{R}_{B}-{\hat{R}}_{B}\right\|}_{2}^{2},

(10)

where ${R}_{B}$ and ${\hat{R}}_{B}$ are the blurred versions of $R$ and $\hat{R}$ , respectively:

{R}_{B}(i,j)=\sum_{\Omega(i,j)}{R}(i+{\Delta}_{i},j+{\Delta}_{j})G({\Delta}_{i},{\Delta}_{j}),

(11)

{\hat{R}}_{B}(i,j)=\sum_{\Omega(i,j)}{\hat{R}}(i+{\Delta}_{i},j+{\Delta}_{j})G({\Delta}_{i},{\Delta}_{j}),

(12)

where $\Omega(i,j)$ is an image patch centered by the pixel at $(i,j)$ and $G({\Delta}_{i},{\Delta}_{j})$ is the 2-D Gaussian blur kernel, which can be expressed as:

G({\Delta}_{i},{\Delta}_{j})=T\cdot\text{exp}\left(-\frac{({\Delta}_{i}-\mu)^{2}+({\Delta}_{j}-\mu)^{2}}{2\sigma}\right),

(13)

where the parameters $T=0.053$ , $\mu=0$ , and $\sigma=3$ are set according to [54].

3) MSE loss: For the reconstruction of illumination component, we only apply a simple MSE loss which is defined by the Euclidean distance between the $L$ and $\hat{L}$ :

{\mathcal{L}}_{mse}={\left\|{L}-{\hat{L}}\right\|}_{2}^{2}.

(14)

The overall loss for our feature encoding module is $\mathcal{L}_{feat}=\mathcal{L}_{str}+\mathcal{L}_{color}+\mathcal{L}_{mse}$ . Constrained by these tailored losses for self-reconstruction, the content-related features and illumination-related features can be well extracted from the reflection and illumination component, respectively.

III-C Bilinear Pooling Module

We consider bilinear techniques to combine the reflection and illumination feature representations into an unified one. Bilinear models have shown powerful capability in modeling two-factor variations, such as style and content of images [55], location and appearance for fine-grained recognition [56], temporal and spatial aspects for video analysis [57], etc. It also has been applied to address the BIQA problem where the synthetic and authentic distortions are modeled as the two-factor variations [58]. Here, we tackle the blind NTIQE problem with a similar philosophy, where the reflection-related and illumination-related degradations are modeled as the two-factor variations.

Given an input NTI and its side output feature maps from the reflection and illumination encoders, $C_{i}^{R}$ and $C_{i}^{L}$ are both with the size of $h_{i}\times w_{i}\times d_{i}$ since the reflection and illumination encoder share the same architectures and configurations. Before performing bilinear pooling, $C_{i}^{R}$ and $C_{i}^{L}$ are separately fed into a $1\times 1$ convolutional layer to obtain their corresponding compact version with 32 channels ( $\hat{C}_{i}^{R}$ and $\hat{C}_{i}^{L}$ ), i.e., $h_{i}\times w_{i}\times 32$ . Then, bilinear pooling is performed on $C_{i}^{R}$ and $C_{i}^{L}$ as follows:

B_{i}=(\hat{C}_{i}^{R})^{T}\hat{C}_{i}^{L},

(15)

where the outer product $B_{i}$ is a vector of dimension $32\times 32$ .

According to [59], bilinear representation is usually mapped from Riemannian manifold into an Euclidean space using signed square root and $\ell_{2}$ normalization [60]:

\hat{B}_{i}=\frac{sign({B}_{i})\odot\sqrt{|{B}_{i}|}}{\left\|sign({B}_{i})\odot\sqrt{|{B}_{i}|}\right\|_{2}^{2}},

(16)

where $\odot$ means the element-wise product. Finally, the bilinear pooled feature representations over all scales are concatenated into a single vector:

\hat{B}=concat(\hat{B}_{1},\hat{B}_{2},\hat{B}_{3},\hat{B}_{4}),

(17)

Finally, $\hat{B}$ is fed into two fully connected layers for quality prediction, which outputs a scalar indicating the overall quality score. Here, we consider the $\ell_{2}$ norm as the empirical loss, which has been widely used in previous works:

\mathcal{L}_{quality}=\frac{1}{K}\sum_{k=1}^{K}\left\|Q_{k}-\hat{Q}_{k}\right\|_{2}^{2},

(18)

where $Q_{k}$ is the ground truth subjective quality score of the $k$ -th image in a mini-batch and $\hat{Q}_{k}$ is the predicted quality score by DDB-Net. It is noteworthy that bilinear pooling is a global strategy and therefore our DDB-Net can receive input images with arbitrary sizes.

III-D Network Training and Testing

Our DDB-Net is trained on the target NTI quality database by minimizing the following hybrid loss function:

\mathcal{L}_{total}=\lambda_{1}\mathcal{L}_{idm}+\lambda_{2}\mathcal{L}_{feat}+\lambda_{3}\mathcal{L}_{quality}.

(19)

where $\lambda_{1}$ , $\lambda_{2}$ , and $\lambda_{3}$ are the weights used to control the relative importance of different loss terms. The optimal weights are $\lambda_{1}=0.1$ , $\lambda_{2}=0.2$ , and $\lambda_{3}=0.7$ , respectively. During training, all parameters are radomly initialized and we use the Adam optimization algorithm [61] with a batch size of $16$ . We run $100$ epoches with a learning rate of $3\times 10^{-5}$ and use Batch normalization to stabilize the training process. All the training images are resized into $512\times 512\times 3$ before feeding into the network. The model is implemented by PyTorch [62] with a single NVIDIA GTX 2080Ti GPU. During testing, the EAI stream will not be used.

IV Experimental Results

IV-A Experimental Setups

IV-A1 Databases

For performance evaluation, we use three benchmark datasets including natural night-time image dataset (NNID) [27], enhanced night-time image dataset (EHND) ¹¹1https://sites.google.com/site/xiangtaooo/, and a subset of the LIVE challenge (CLIVE) image quality dataset [63].

The NNID dataset contains $2,240$ NTIs with $448$ different image contents captured by three different photographic equipments (i.e., a digital camera (Device I: Nikon D5300), a mobile phone (Device II: iPhone 8plus) and a tablet (Device III: iPad mini2)) in real-world night-time scenarios. For each image content, one device is used with five different settings to capture five images of different visual quality levels. The five settings are different for different image contents. In NNID, $1,400$ images with $280$ different image contents are captured by Nikon D5300, $640$ images with $128$ different image contents are captured by iPhone 8plus, and $200$ images with 40 different image contents are captured by iPad mini2. The images in NNID are in three different resolutions including $512\times 512$ , $1024\times 1024$ , and $2048\times 2048$ . The ground truth subjective quality score for each NTI is provided in the form of mean opinion score (MOS).

The EHND dataset contains both original NTIs and their corresponding enhanced versions by different NTI enhancement algorithms. Specifically, EHND contains a total number of $1,500$ enhanced NTIs obtained by applying $15$ off-the-shelf NTI enhancement algorithms on $100$ original NTIs. Similarly, the ground truth subjective quality score, i.e., MOS, for each enhanced NTI is also provided.

The CLIVE dataset contains widely diverse authentic image distortions on $1,162$ images captured using a representative variety of modern mobile devices. Each image was collected without artificially introducing any distortions beyond those occurring during capture, processing, and storage by a user’s device. The MOS values obtained from the large-scale subjective studies are also available as part of this database. In the experiment, we pick out the images captured at night-time for testing. In total, we have picked out $159$ NTIs from the CLIVE dataset for testing only.

IV-A2 Protocols and Criteria

We conduct experiments by following the general evaluation protocol adopted in existing learning-based BIQA studies. Specifically, we randomly divide all the images in each individual dataset into five folds with each fold contains the equal number of images. The dataset split is conducted according to source images to guarantee that there is no overlap of image content. For each time, we use four folds for training and the rest one fold for teating. The training and testing procedures are repeated five times on each database so that each image in the dataset can be tested for once. For each time, we compute four criteria to measure the model performance. The four performance criteria include Pearson linear correlation coefficient (PLCC), Spearman rank order correlation coefficient (SRCC), Kendall rank order correlation coefficient (KRCC), and root mean square error (RMSE). Among these criteria, PLCC and RMSE measure the prediction precision while SRCC and KRCC measure the prediction monotonicity. These criteria results from the five training-testing sessions are calculated respectively and averaged as the final performance.

TABLE I: Performance Results of Different BIQA Methods on The NNID Dataset.

Methods	Entire Database (2240 images)				Device I: Nikon D5300 (1400 images)				Device II: iPhone 8plus (640 images)				Device III: iPad mini2 (200 images)
Methods	SRCC	KRCC	PLCC	RMSE	SRCC	KRCC	PLCC	RMSE	SRCC	KRCC	PLCC	RMSE	SRCC	KRCC	PLCC	RMSE
BLIINDS-II	0.7438	0.5403	0.7549	0.1119	0.7520	0.5461	0.7627	0.1108	0.6419	0.4564	0.6574	0.1103	0.6777	0.5048	0.7333	0.0892
BRISQUE	0.7365	0.5352	0.7452	0.1132	0.7315	0.5332	0.7420	0.1150	0.6445	0.4598	0.6652	0.1091	0.5704	0.4166	0.6431	0.0980
CurveletQA	0.8676	0.6762	0.8679	0.0924	0.8937	0.7115	0.8953	0.0844	0.8110	0.6147	0.8183	0.0916	0.7712	0.5881	0.8217	0.0889
DIIVINE	0.7744	0.5675	0.7637	0.1092	0.7601	0.5545	0.7330	0.1178	0.6830	0.4793	0.5844	0.1187	0.6661	0.4698	0.6491	0.0998
NRSL	0.8291	0.6265	0.8327	0.0936	0.8165	0.6131	0.8192	0.0981	0.7417	0.5417	0.7325	0.1007	0.6625	0.4848	0.6903	0.0966
NFERM	0.8512	0.6572	0.8556	0.1099	0.8706	0.6803	0.8764	0.1110	0.8122	0.6146	0.8224	0.1257	0.7610	0.5727	0.7882	0.1231
GM-LOG	0.8114	0.6072	0.8125	0.0985	0.8135	0.6099	0.8171	0.0992	0.7338	0.5338	0.7313	0.0998	0.6996	0.5117	0.7107	0.0951
GWH-GLBP	0.7111	0.5108	0.7098	0.1350	0.6998	0.5020	0.6819	0.1382	0.6383	0.4731	0.6174	0.1614	0.6244	0.4547	0.7071	0.1343
SSEQ	0.7838	0.5894	0.7865	0.1144	0.7809	0.5878	0.7891	0.1258	0.6735	0.4919	0.6968	0.1451	0.6673	0.4617	0.6436	0.1689
BIQME	0.8255	0.6185	0.8273	0.0911	0.8189	0.6141	0.8245	0.0913	0.8140	0.6144	0.8027	0.0972	0.7935	0.6064	0.7905	0.1005
ILNIQE	0.7115	0.5183	0.6335	0.1691	0.6712	0.4831	0.6766	0.1679	0.6949	0.5018	0.6809	0.1639	0.7983	0.6086	0.6721	0.1720
NIQE	0.5983	0.4220	0.5701	0.1803	0.6007	0.4240	0.5859	0.1847	0.5772	0.4017	0.5874	0.1811	0.6591	0.4694	0.6092	0.1842
BNBT	0.8769	0.6822	0.8784	0.1061	0.8866	0.7066	0.8939	0.1020	0.8632	0.6737	0.8698	0.1157	0.8517	0.6890	0.8576	0.1137
MDM	0.8023	0.6060	0.8039	0.1005	0.8273	0.6331	0.7253	0.1323	0.7741	0.5746	0.6974	0.1134	0.7252	0.5542	0.6304	0.1962
HOSA	0.5484	0.3806	0.5487	0.1416	0.5547	0.3839	0.5507	0.1448	0.4353	0.2983	0.4266	0.1365	0.6374	0.4563	0.6202	0.1359
WaDIQaM	0.8272	0.6213	0.8229	0.0954	0.8127	0.6258	0.8263	0.0895	0.8194	0.6017	0.8101	0.0952	0.8069	0.6048	0.8016	0.0937
DBCNN	0.8938	0.6953	0.8958	0.0849	0.8745	0.6779	0.8826	0.0843	0.8704	0.6738	0.8796	0.0852	0.8526	0.6539	0.8614	0.0893
TSCNN	0.8618	0.6575	0.8669	0.0841	0.8788	0.6863	0.8723	0.0823	0.8655	0.6462	0.8638	0.0821	0.8574	0.6434	0.8548	0.0851
VCR	0.8792	0.6621	0.8744	0.0817	0.8436	0.6397	0.8214	0.0896	0.8812	0.6667	0.8854	0.0819	0.8628	0.6533	0.8554	0.0842
GraphBIQA	0.8618	0.6696	0.8546	0.1108	0.8891	0.7034	0.8818	0.1035	0.8443	0.6509	0.8425	0.0986	0.7986	0.6059	0.7990	0.1074
DDB-Net	0.9318	0.7881	0.9311	0.0745	0.9185	0.7926	0.9203	0.0762	0.9022	0.7707	0.9003	0.0801	0.8925	0.7575	0.8918	0.0824

IV-B Performance Comparisons on NNID and EHND

Since there is always no available pristine reference for real-world NTIs, the quality evaluation of NTIs can only be performed in a no-reference manner. Therefore, we compare the performance of the proposed DDB-Net against 20 existing BIQA methods, including 15 handcrafted feature-based BIQA methods (i.e., BLIINDS-II [9], BRISQUE [10], CurveletQA [11], DIIVINE [12], NRSL [13], NFERM [14], GM-LOG [15], GWH-GLBP [16], SSEQ [17], BIQME [18], ILNIQE [19], NIQE [20], BNBT [27], MDM [7], HOSA [23]) and five deep learning-based BIQA methods (i.e., WaDIQaM [37], DBCNN [38], TSCNN [64], VCR [65], and GraphBIQA [66]). The handcrafted feature-based BIQA methods include two types: training-based and training-free. The training-based ones commonly adopt elaborately designed features to characterize the level of deviations from statistical regularities of high-quality natural images, based on which a quality prediction function is learned via supprot vector regression (SVR) [67]. The training-free ones (i.e., ILNIQE [19] and NIQE [20]) first build a pristine statistical model from a large collection of high-quality natural images and then measure the distance between this pristine statistical model and the statistical model built on the distorted image as the estimated quality score. By contrast, the deep learning-based BIQA methods directly optimize an end-to-end function mapping from the input image to its quality score without any efforts on manual feature engineering.

IV-B1 Comparisons on NNID

The performance comparison results of different BIQA methods on the NNID database are shown in Table I. From the results, we can have the following observations. First, most training-based methods perform better than the two training-free methods (i.e., ILNIQE [19] and NIQE [20]) and deep learning-based methods are superior to most handcraft feature-based methods. It is reasonable because BIQA is a challenging task where training is particularly useful to model the complex non-linear relationship between the extracted features and perceived quality score, and end-to-end deep learning technique further provides an effective solution to directly establish the explicit image-to-quality mapping owing to its powerful feature representation learning capacity. Second, the existing NSS feature-based BIQA methods cannot obtain satisfactory results for evaluating NTIs because NSS is not quite suitable to characterize the degradation properties of in-the-wild NTIs. Third, the proposed DDB-Net delivers the best performance among all competitors. The reason is that we have decomposed the complex blind NTIQE task into two easier sub-tasks with each sub-task accounting for illumination perception and content perception, respectively. In such a way, the features related to the illumination perception and the content perception can be better learned to facilitate blind NTIQE.

In addition to the numerical performance results, we also show the scatter plots between the objective scores (predicted by BIQA methods) and the subjective MOSs (provided in the database) in Fig. 5. In the scatter plot, each point corresponds to an image in the NNID database. The $x$ -axis represents the predicted scores by each BIQA method while the $y$ -axis represents the ground truth subjective MOS. The fitted curve that characterizes the distribution of all the data points is shown in red. A good BIQA method is expected to have more compact scatter point distribution and the fitted curve should be close to the diagonal line. In addition, the corresponding fitting error in terms of the R-Square score and $95\%$ confidence interval are also provided with each plot. A higher R-Square value indicates a lower fitting error. It is obviously found that the proposed DDB-Net produces promising prediction results, which are highly consistent with subjective scores.

Finally, we use a hypothesis testing approach based on t-statistics [68] to further demonstrate whether the superiority of our proposed DDB-Net over the competitors is significant or not. In our experiment, the two-sample t-test between the pair set of PLCC values (from any two different BIQA algorithms) at the 5 $\%$ significance level is conducted. Fig. 6(a) shows the results of t-test, where the white/black small block indicates that row model performs statistically better/worse than the column model. From the results, we find that our DDB-Net always performs significantly better than all competitors, which further validates the superiority of our DDB-Net.

TABLE II: Performance Results of Different BIQA Methods on The EHND Dataset.

Methods	SRCC ( $\uparrow$ )	KRCC ( $\uparrow$ )	PLCC ( $\uparrow$ )	RMSE ( $\downarrow$ )
BLIINDS-II	0.7168	0.5016	0.7026	0.7383
BRISQUE	0.7021	0.5077	0.6907	0.7424
CurveletQA	0.7525	0.5743	0.7624	0.6931
DIIVINE	0.6868	0.4750	0.6216	0.7593
NRSL	0.7853	0.5845	0.7812	0.6241
NFERM	0.7546	0.5683	0.7532	0.6831
GM-LOG	0.7915	0.5947	0.7794	0.6325
GWH-GLBP	0.7235	0.5074	0.7196	0.7240
SSEQ	0.7012	0.5135	0.6981	0.7383
BIQME	0.6916	0.4847	0.7174	0.7063
ILNIQE	0.3815	0.2145	0.4637	0.8034
NIQE	0.2723	0.1839	0.3125	0.8279
MDM	0.6959	0.5180	0.7193	0.7489
HOSA	0.3784	0.3179	0.3997	0.9840
WaDIQaM	0.7528	0.5843	0.7598	0.6865
DBCNN	0.7935	0.6442	0.8051	0.6234
TSCNN	0.8174	0.6521	0.8235	0.6183
VCR	0.8213	0.6647	0.8311	0.6121
GraphBIQA	0.7967	0.6115	0.8114	0.6232
DDB-Net	0.8682	0.6881	0.8814	0.6032

IV-B2 Comparisons on EHND

A well-performing NTIQE should also be able to measure the performance of different NTI quality enhancement algorithms, i.e., well evaluate different enhanced results. Actually, a certain enhancement algorithm may result in particularly bad enhanced result which may still suffer from unsatisfactory brightness and even more serious color distortions than the original raw NTI. Therefore, we also evaluate the performance of different BIQA methods on another night-time image database called EHND which contains $1,500$ images obtained by applying $15$ existing representative NTI enhancement algorithms on $100$ original NTIs. The numerical performance results of different BIQA methods on the EHND database are shown in Table II and the significance t-test results are shown in Fig. 6(b). It is observed from these results that our proposed DDB-Net again outperforms other competitors by a large margin in terms of all performance criteria.

In this case, the most important role of NTIQE is to automatically select the one with the highest visual quality from $15$ enhanced results generated from the same original NTI. Therefore, it is of great interests to conduct experiments to further compare such a kind of capability of different BIQA methods. Specifically, we measure the rank- $n$ accuracy which is closely relevant with the capability of a certain objective quality metric in selecting the optimal enhanced result from a set of candidates. Given $15$ different enhanced results associated with the same original NTI, the rank- $n$ accuracy is defined as the percentage of images whose top- $1$ result in terms of MOS appears within the top- $n$ results in terms of objective predicted score. Obviously, a higher rank- $n$ accuracy value indicates a better performance of a certain NTIQE. In Fig. 7, we show the rank-1, rank-2, rank-3, rank-4, and rank-5 accuracy values by different BIQA methods on the EHND database. It is observed that our DDB-Net always delivers highest rank- $n$ accuracy values, indicating the best capability in selecting the one with the highest visual quality from a set of candidates.

TABLE III: Performance Results of Cross-Dataset Validation. Note That The MOS Values of The EHND Dataset Are Normalized into The Range of [0,1] in This Experiment.

Methods	Train on NNID & Test on EHND				Train on EHND & Test on NNID
Methods	SRCC	KRCC	PLCC	RMSE	SRCC	KRCC	PLCC	RMSE
CurveletQA	0.5872	0.4198	0.6800	0.1563	0.6880	0.4998	0.6830	0.1237
NRSL	0.3425	0.2368	0.5280	0.1814	0.3854	0.2603	0.4986	0.1469
NFERM	0.5079	0.3565	0.6154	0.1683	0.6494	0.4631	0.6444	0.1295
GM-LOG	0.4111	0.2861	0.5383	0.1793	0.4063	0.2758	0.4710	0.1494
BIQME	0.5635	0.4023	0.6943	0.1545	0.6000	0.4170	0.6472	0.1291
MDM	0.6521	0.4698	0.7190	0.1496	0.6084	0.4439	0.6050	0.1649
WaIQaM	0.6785	0.4947	0.7059	0.1599	0.6923	0.5182	0.7037	0.1595
DBCNN	0.7105	0.5237	0.7218	0.1574	0.7548	0.5493	0.7421	0.1362
TSCNN	0.7828	0.5970	0.8236	0.1276	0.7911	0.5946	0.7646	0.1253
VCR	0.7289	0.5400	0.7620	0.1351	0.8542	0.6640	0.8314	0.1157
GraphBIQA	0.7233	0.5384	0.7758	0.1408	0.7895	0.5913	0.7852	0.1250
DDB-Net	0.8119	0.6231	0.8456	0.1182	0.8691	0.6754	0.8572	0.1116

IV-C Cross-Dataset Validation

Despite the above results have demonstrated the promising performance of our DDB-Net on each individual dataset, it remains unknown whether the model pretrained on one dataset can be well generalized to another. Therefore, we in this section conduct cross-dataset validations by training the model on one dataset and testing it on another one. Note that the ranges of MOS values of the two datasets are different, i.e., [0,1] for the NNID dataset and [0,5] for the EHND dataset, we linearly normalize the MOS values of the EHND dataset into the range of [0,1] to facilitate cross-dataset validation. The performance results of cross-dataset validations are shown in Table III. In the table, we only show the results of the methods that have already achieved fairly good performance (the SRCC values are higher than 0.8) in the intra-dataset validation experiments. From Table III, we can observe that the performance results obtained by training on NNID and testing on EHND are worse than the results obtained by training on EHND and testing NNID. It is expectable because the images in EHND are enhanced by different algorithms, which may suffer from extra artifacts beyond those appeared in the original raw NTIs. Despite this, our proposed DDB-Net still owns the best generalization capability among all competitors.

IV-D Further Validation on the Night-Time Subset of CLIVE

Besides the above two large-scale night-time image quality datasets (i.e., NNID and EHND), we are also interested in the performance of different BIQA methods on a small-scale dataset. To the best of our knowledge, except for NNID and EHND, there is no other specific dataset for NTI quality evaluation. Therefore, we pick out all the NITs from the CLIVE dataset (CLIVE-NT subset) for testing. As a result, a total number of 159 NTIs were selected to form the CLIVE-NT subset. Some examples are shown in Fig. 8. Since CLIVE-NT subset is quite small, it is not suitable for training. Thus, we use the model trained on the whole NNID dataset to test the performance on the CLIVE-NT subset. The results are shown in Table. IV. We can observe that the performance values are consistently inferior to those on the NNID dataset. It is exceptable because the characteristics of these two datasets are quite different. Despite of this, our proposed DDB-Net is still better than other compared methods by a large margin.

TABLE IV: Performance Results on The CLIVE-NT Subset. The Model Was Trained on The NNID Dataset and Tested on The CLIVE-NT Subset.

Methods	SRCC ( $\uparrow$ )	KRCC ( $\uparrow$ )	PLCC ( $\uparrow$ )	RMSE ( $\downarrow$ )
BLIINDS-II	0.0914	0.0649	0.1001	0.2117
BRISQUE	0.2233	0.1482	0.2288	0.1875
CurveletQA	0.4240	0.2924	0.4303	0.1739
DIIVINE	0.3474	0.2303	0.3886	0.1775
NRSL	0.4470	0.3048	0.4502	0.1720
NFERM	0.3230	0.2190	0.3262	0.1821
GM-LOG	0.3681	0.2515	0.3782	0.1783
GWH-GLBP	0.2867	0.2005	0.3198	0.1825
SSEQ	0.4072	0.2808	0.4241	0.1745
BIQME	0.2589	0.1740	0.2668	0.1862
ILNIQE	0.1832	0.1416	0.1874	0.1946
NIQE	0.1619	0.1369	0.1535	0.1987
MDM	0.2525	0.2077	0.2820	0.1926
HOSA	0.2104	0.1437	0.2164	0.1934
WaDIQaM	0.4537	0.3121	0.4618	0.1692
DBCNN	0.5581	0.3819	0.5462	0.1625
TSCNN	0.2323	0.1537	0.2213	0.1641
VCR	0.6024	0.4286	0.5842	0.1553
GraphBIQA	0.4641	0.3285	0.4853	0.1635
DDB-Net	0.6228	0.4593	0.6301	0.1432

IV-E Influence of Loss Weights

As we have mentioned in Eq. (19), we use $\lambda_{1}$ , $\lambda_{2}$ , and $\lambda_{3}$ to control the relative importance of different loss terms. How to determine these weight values is non-trivial. Since the main task in NTIQE is to predict the quality score of an input NTI, we first set a relatively large weight to $\lambda_{3}$ , i.e., $\lambda_{3}=\{0.6,0.7,0.8,0.9\}$ . Then, the other two weights $\lambda_{1}$ and $\lambda_{2}$ are determined based on the constraint $\lambda_{1}+\lambda_{2}=1-\lambda_{3}$ . We test the performance results of different weight values on the NNID dataset, as shown in Table V. We can find that the best SRCC, KRCC, and RMSE are obtained when $\lambda_{1}=0.1$ , $\lambda_{2}=0.2$ , and $\lambda_{3}=0.7$ while the best PLCC is obtained when $\lambda_{1}=0.1$ , $\lambda_{2}=0.1$ , and $\lambda_{3}=0.8$ . Given that the PLCC metric is related with the curve fitting process, it is less important than SRCC and KRCC. Therefore, we take $\lambda_{1}=0.1$ , $\lambda_{2}=0.2$ , and $\lambda_{3}=0.7$ as the final weights in our implementation.

TABLE V: Performance results with different weights in Eq.(19). The experiments are conducted on the NNID dataset.

$\bf{\lambda_{3}}$	$\bf{\lambda_{1},\lambda_{2}}$	SRCC( $\uparrow$ )	KRCC( $\uparrow$ )	PLCC( $\uparrow$ )	RMSE( $\downarrow$ )
0.6	0.1, 0.3	0.9313	0.7877	0.9301	0.0751
	0.2, 0.2	0.9314	0.7873	0.9305	0.0749
	0.3, 0.1	0.9316	0.7876	0.9298	0.0753
0.7	0.1, 0.2	0.9318	0.7881	0.9311	0.0745
0.7	0.2, 0.1	0.9313	0.7869	0.9295	0.0767
0.8	0.1, 0.1	0.9315	0.7880	0.9312	0.0747
0.9	0.05, 0.05	0.9310	0.7865	0.9289	0.0768

IV-F Computational Complexity Analysis

We finally compare the computational complexity of different methods including running time, parameter number, and floating point operations (FLOPs). Note that the parameter number and FLOPs are only applicable to deep learning-based methods. The computational complexity is tested on a PC with an Intel Xeon Silver 4210 CPU @ 2.20GHz and a RTX 2080Ti GPU. The results of running time, parameter number, and FLOPs are shown in Table VI. We can find that deep learning-based methods generally have faster running speed than the handcrafted feature-based methods during the testing stage. Among all methods, DBCNN has the fastest running time followed by our proposed DDB-Net which ranks the second place. Among all deep learning-based methods, our DDB-Net has the lowest number of parameters and the second-lowest number of FLOPs. Note that our DDB-Net only has 0.38M parameters, which is efficient. Overall, our proposed DDB-Net can evaluate the quality of NTIs quite efficiently.

TABLE VI: Computational Complexity Comparison of Different Methods Including Running Time, Parameter Number, and Floating Point Operations (FLOPs). Note That The Parameter Number and FLOPs Are Only Applicable to Deep Learning-Based Methods.

Methods	Time (s)	# of Parameters	# of FLOPs
BLINDS-II	15.3832	–	–
BRISQUE	0.0556	–	–
CurveletQA	2.4772	–	–
DIIVINE	9.5184	–	–
NRSL	0.9810	–	–
NFERM	22.5676	–	–
GM-LOG	0.3130	–	–
GWH-GLBP	0.3276	–	–
SSEQ	0.9846	–	–
BIQME	0.8317	–	–
ILQNIQE	0.3746	–	–
NIQE	0.2457	–	–
MDM	0.1344	–	–
HOSA	0.4515	–	–
WaDIQaM	0.0217	6.29M	17.57G
DBCNN	0.0043	15.31M	43.18G
TSCNN	0.0136	1.44M	13.46G
VCR	0.0389	16.66M	27.78G
GraphBIQA	0.0205	31.90M	10.76G
DDB-Net	0.0092	0.38M	13.33G

V Application: Automatic Parameter Tuning of NTI Quality Enhancement Algorithm

An effective blind NTIQE should be able to well guide the optimization of NTI quality enhancement algorithms. In this section, we demonstrate this idea by applying the proposed DDB-Net to automatic parameter tuning of existing NTI quality enhancement algorithms. There are always one or several parameters in NTI quality enhancement algorithms whose optimal values vary with contents. It is challenging and time-consuming to handpick a set of parameters that work well for all image contents. A well-performing blind NTIQE is able to replace the role of humans in this task, especially when the volume of images to be processed is particularly large.

Here, we use the LIME algorithm [69] as a representative example of NTI quality enhancement algorithm, which involves two tunable parameters $g$ and $l$ . The default values are: $g=0.6$ and $l=0.2$ . However, the visual quality of the final enhanced image is highly sensitive to these two parameters. Fig. 9 shows the results generated with different $g$ and $l$ values. In the figure, warmer color indicates better predicted quality of the corresponding enhanced image. The corresponding scores predicted by our DDB-Net are also shown under each image. By varying $g$ and $l$ , we can obtain enhanced results with significantly different visual quality. For example, the two enhanced results in the left side of Fig. 9 still suffers from over-/under exposure problem while the two enhanced results in the right side exhibits much better visual quality with much more finer details and natural color appearance. It is found that our DDB-Net can evaluate their visual qualities consistently with human subjective perception. Furthermore, we also find that the visual quality of the upper right image is better than that of the bottom right one which is produced by using the default parameter values. It means that it is possible to adaptively determine the optimal parameter values under the guidance of our proposed DDB-Net.

VI Conclusion

This paper has presented a novel deep NTIQE called DDB-Net which consists of three modules namely image decomposition module, feature encoding module, and bilinear pooling module. With the help of decomposing the input NTI into two independent layer components (illumination and reflectance), the degradation features related to illumination perception and content perception are better learned and then fused with bilinear pooling to improve the performance of blind NTIQE. Experiments on two benchmark databases have demonstrated the superiority of our proposed DDB-Net.

Although our proposed DDB-Net is promising, future works towards further impoving the performance may focus on the following directions: 1) designing more efficient unsupervised solutions for image layer decomposition; 2) designing more effective loss functions to facilitate learning degradation features from each component; 3) designing more powerful feature fusion schemes by considering other variants of bilinear pooling to further improve the performance.

References

[1] G. Zhai and X. Min, “Perceptual image quality assessment: A survey,” Science China: Information Sciences, vol. 63, no. 11, pp. 76–127, 11 2020.
[2] Y. Zhan and R. Zhang, “No-reference jpeg image quality assessment based on blockiness and luminance change,” IEEE Signal Processing Letters, vol. 24, no. 6, pp. 760–764, 2017.
[3] S. A. Golestaneh and D. M. Chandler, “No-reference quality assessment of jpeg images via a quality relevance map,” IEEE Signal Processing Letters, vol. 21, no. 2, pp. 155–158, 2014.
[4] K. Gu, G. Zhai, W. Lin, X. Yang, and W. Zhang, “No-reference image sharpness assessment in autoregressive parameter space,” IEEE Transactions on Image Processing, vol. 24, no. 10, pp. 3218–3231, 2015.
[5] T. Oh, J. Park, K. Seshadrinathan, S. Lee, and A. C. Bovik, “No-reference sharpness assessment of camera-shaken images by analysis of spectral structure,” IEEE Transactions on Image Processing, vol. 23, no. 12, pp. 5428–5439, 2014.
[6] C. Tang, X. Yang, and G. Zhai, “Noise estimation of natural images via statistical analysis and noise injection,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 25, no. 8, pp. 1283–1294, 2015.
[7] H. Ziaei Nafchi and M. Cheriet, “Efficient no-reference quality assessment and classification model for contrast distorted images,” IEEE Transactions on Broadcasting, vol. 64, no. 2, pp. 518–523, 2018.
[8] Q. Jiang, Z. Peng, G. Yue, H. Li, and F. Shao, “No-reference image contrast evaluation by generating bidirectional pseudoreferences,” IEEE Transactions on Industrial Informatics, vol. 17, no. 9, pp. 6062–6072, 2021.
[9] M. A. Saad, A. C. Bovik, and C. Charrier, “Blind image quality assessment: A natural scene statistics approach in the DCT domain,” IEEE Transactions on Image Processing, vol. 21, no. 8, pp. 3339–3352, 2012.
[10] A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image quality assessment in the spatial domain,” IEEE Transactions on Image Processing, vol. 21, no. 12, pp. 4695–4708, 2012.
[11] L. Liu, H. Dong, H. Huang, and A. C. Bovik, “No-reference image quality assessment in curvelet domain,” Signal Processing: Image Communication, vol. 29, no. 4, pp. 494–505, 2014.
[12] A. K. Moorthy and A. C. Bovik, “Blind image quality assessment: From natural scene statistics to perceptual quality,” IEEE Transactions on Image Processing, vol. 20, no. 12, pp. 3350–3364, 2011.
[13] Q. Li, W. Lin, J. Xu, and Y. Fang, “Blind image quality assessment using statistical structural and luminance features,” IEEE Transactions on Multimedia, vol. 18, no. 12, pp. 2457–2469, 2016.
[14] K. Gu, G. Zhai, X. Yang, and W. Zhang, “Using free energy principle for blind image quality assessment,” IEEE Transactions on Multimedia, vol. 17, no. 1, pp. 50–63, 2015.
[15] W. Xue, X. Mou, L. Zhang, A. C. Bovik, and X. Feng, “Blind image quality assessment using joint statistics of gradient magnitude and Laplacian features,” IEEE Transactions on Image Processing, vol. 23, no. 11, pp. 4850–4862, 2014.
[16] Q. Li, W. Lin, and Y. Fang, “No-reference quality assessment for multiply-distorted images in gradient domain,” IEEE Signal Processing Letters, vol. 23, no. 4, pp. 541–545, 2016.
[17] L. Liu, B. Liu, H. Huang, and A. C. Bovik, “No-reference image quality assessment based on spatial and spectral entropies,” Signal Processing: Image Communication, vol. 29, no. 8, pp. 856–863, 2014.
[18] K. Gu, D. Tao, J.-F. Qiao, and W. Lin, “Learning a no-reference quality assessment model of enhanced images with big data,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 4, pp. 1301–1313, 2017.
[19] L. Zhang, L. Zhang, and A. C. Bovik, “A feature-enriched completely blind image quality evaluator,” IEEE Transactions on Image Processing, vol. 24, no. 8, pp. 2579–2591, 2015.
[20] A. Mittal, R. Soundararajan, and A. C. Bovik, “Making a “completely blind” image quality analyzer,” IEEE Signal Processing Letters, vol. 20, no. 3, pp. 209–212, 2012.
[21] Q. Li, W. Lin, and Y. Fang, “BSD: Blind image quality assessment based on structural degradation,” Neurocomputing, vol. 236, pp. 93–103, 2017.
[22] P. Ye, J. Kumar, L. Kang, and D. Doermann, “Unsupervised feature learning framework for no-reference image quality assessment,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 1098–1105.
[23] J. Xu, P. Ye, Q. Li, H. Du, Y. Liu, and D. Doermann, “Blind image quality assessment based on high order statistics aggregation,” IEEE Transactions on Image Processing, vol. 25, no. 9, pp. 4444–4457, 2016.
[24] Q. Jiang, F. Shao, G. Jiang, M. Yu, and Z. Peng, “Supervised dictionary learning for blind image quality assessment using quality-constraint sparse coding,” Journal of Visual Communication and Image Representation, 2015.
[25] Q. Jiang, F. Shao, W. Lin, K. Gu, G. Jiang, and H. Sun, “Optimizing multistage discriminative dictionaries for blind image quality assessment,” IEEE Transactions on Multimedia, vol. 20, no. 8, pp. 2035–2048, 2018.
[26] Q. Wu, H. Li, F. Meng, K. N. Ngan, B. Luo, C. Huang, and B. Zeng, “Blind image quality assessment based on multichannel feature fusion and label transfer,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 26, no. 3, pp. 425–440, 2016.
[27] T. Xiang, Y. Yang, and S. Guo, “Blind night-time image quality assessment: Subjective and objective approaches,” IEEE Transactions on Multimedia, vol. 22, no. 5, pp. 1259–1272, 2020.
[28] J. McCann, Retinex Theory. New York, NY: Springer New York, 2016, pp. 1118–1125.
[29] J. Wu, J. Zeng, Y. Liu, G. Shi, and W. Lin, “Hierarchical feature degradation based blind image quality assessment,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 510–517.
[30] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
[31] Y. Li, L.-M. Po, L. Feng, and F. Yuan, “No-reference image quality assessment with deep convolutional neural networks,” in IEEE International Conference on Digital Signal Processing, 2016, pp. 685–689.
[32] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255.
[33] L. Kang, P. Ye, Y. Li, and D. Doermann, “Convolutional neural networks for no-reference image quality assessment,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1733–1740.
[34] Y. Li, L.-M. Po, X. Xu, L. Feng, F. Yuan, C.-H. Cheung, and K.-W. Cheung, “No-reference image quality assessment with shearlet transform and deep neural networks,” Neurocomputing, vol. 154, pp. 94–109, 2015.
[35] X. Liu, J. van de Weijer, and A. D. Bagdanov, “Rankiqa: Learning from rankings for no-reference image quality assessment,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1040–1049.
[36] K. Ma, W. Liu, K. Zhang, Z. Duanmu, Z. Wang, and W. Zuo, “End-to-end blind image quality assessment using deep neural networks,” IEEE Transactions on Image Processing, vol. 27, no. 3, pp. 1202–1213, 2017.
[37] S. Bosse, D. Maniry, K.-R. Müller, T. Wiegand, and W. Samek, “Deep neural networks for no-reference and full-reference image quality assessment,” IEEE Transactions on Image Processing, vol. 27, no. 1, pp. 206–219, 2017.
[38] W. Zhang, K. Ma, J. Yan, D. Deng, and Z. Wang, “Blind image quality assessment using a deep bilinear convolutional neural network,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 1, pp. 36–47, 2018.
[39] X. Min, G. Zhai, K. Gu, X. Yang, and X. Guan, “Objective quality evaluation of dehazed images,” IEEE Transactions on Intelligent Transportation Systems, vol. 20, no. 8, pp. 2879–2892, 2018.
[40] X. Min, G. Zhai, K. Gu, Y. Zhu, J. Zhou, G. Guo, X. Yang, X. Guan, and W. Zhang, “Quality evaluation of image dehazing methods using synthetic hazy images,” IEEE Transactions on Multimedia, vol. 21, no. 9, pp. 2319–2333, 2019.
[41] Q. Wu, L. Wang, K. N. Ngan, H. Li, F. Meng, and L. Xu, “Subjective and objective de-raining quality assessment towards authentic rain image,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 11, pp. 3883–3897, 2020.
[42] M. Yang and A. Sowmya, “An underwater color image quality evaluation metric,” IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 6062–6071, 2015.
[43] P. Guo, L. He, S. Liu, D. Zeng, and H. Liu, “Underwater image quality assessment: Subjective and objective methods,” IEEE Transactions on Multimedia, 2021.
[44] Q. Jiang, Y. Gu, C. Li, R. Cong, and F. Shao, “Underwater image enhancement quality evaluation: Benchmark dataset and objective metric,” IEEE Transactions on Circuits and Systems for Video Technology, 2022.
[45] J. Liu, D. Xu, W. Yang, M. Fan, and H. Huang, “Benchmarking low-light image enhancement and beyond,” Int. J. Comput. Vis., vol. 129, pp. 1153–1184, 2021.
[46] Y. Wang, Y. Cao, Z. Zha, J. Zhang, Z. Xiong, W. Zhang, and F. Wu, “Progressive retinex: Mutually reinforced illumination-noise perception network for low-light image enhancement,” Proceedings of the 27th ACM International Conference on Multimedia, 2019.
[47] M. Li, J. Liu, W. Yang, X. Sun, and Z. Guo, “Structure-revealing low-light image enhancement via robust retinex model,” IEEE Transactions on Image Processing, vol. 27, pp. 2828–2841, 2018.
[48] J. Zhang, Y. Cao, S. Fang, Y. Kang, and C. Chen, “Fast haze removal for nighttime image using maximum reflectance prior,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7016–7024, 2017.
[49] Z. Ying, G. Li, and W. Gao, “A bio-inspired multi-exposure fusion framework for low-light image enhancement,” ArXiv, vol. abs/1711.00591, 2017.
[50] M. D. Grossberg and S. K. Nayar, “Modeling the space of camera response functions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, pp. 1272–1282, 2004.
[51] J. Liang, J. Wang, Y. Quan, T. Chen, J. Liu, H. Ling, and Y. Xu, “Recurrent exposure generation for low-light face detection,” IEEE Transactions on Multimedia, vol. 24, pp. 1609–1621, 2022.
[52] K. Ma, H. Li, H. Yong, Z. Wang, D. Meng, and L. Zhang, “Robust multi-exposure image fusion: A structural patch decomposition approach,” IEEE Transactions on Image Processing, vol. 26, no. 5, pp. 2519–2532, 2017.
[53] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
[54] A. Ignatov, N. Kobyshev, R. Timofte, and K. Vanhoey, “Dslr-quality photos on mobile devices with deep convolutional networks,” in 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 3297–3305.
[55] J. B. Tenenbaum and W. T. Freeman, “Separating style and content,” in NIPS, 1996.
[56] T.-Y. Lin, A. RoyChowdhury, and S. Maji, “Bilinear cnn models for fine-grained visual recognition,” 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1449–1457, 2015.
[57] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in NIPS, 2014.
[58] W. Zhang, K. Ma, J. Yan, D. Deng, and Z. Wang, “Blind image quality assessment using a deep bilinear convolutional neural network,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, pp. 36–47, 2020.
[59] X. Pennec, P. Fillard, and N. Ayache, “A riemannian framework for tensor computing,” International Journal of Computer Vision, vol. 66, pp. 41–66, 2005.
[60] F. Perronnin, J. Sánchez, and T. Mensink, “Improving the fisher kernel for large-scale image classification,” in ECCV, 2010.
[61] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2015.
[62] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in NeurIPS, 2019.
[63] D. Ghadiyaram and A. C. Bovik, “Massive online crowdsourced study of subjective and objective picture quality,” IEEE Transactions on Image Processing, vol. 25, pp. 372–387, 2016.
[64] Q. Yan, D. Gong, and Y. Zhang, “Two-stream convolutional networks for blind image quality assessment,” IEEE Transactions on Image Processing, vol. 28, pp. 2200–2211, 2019.
[65] Z. Pan, F. Yuan, J. Lei, Y. Fang, X. Shao, and S. T. W. Kwong, “Vcrnet: Visual compensation restoration network for no-reference image quality assessment,” IEEE Transactions on Image Processing, vol. 31, pp. 1613–1627, 2022.
[66] S. Sun, T. Yu, J. Xu, J. Lin, W. Zhou, and Z. Chen, “Graphiqa: Learning distortion graph representations for blind image quality assessment,” IEEE Transactions on Multimedia, 2022.
[67] C.-C. Chang and C.-J. Lin, “Libsvm: A library for support vector machines,” ACM Trans. Intell. Syst. Technol., vol. 2, pp. 27:1–27:27, 2011.
[68] D. C. Montgomery and G. C. Runger, Applied Statistics and Probability for Engineers. Applied statistics and probability for engineers, 2014.
[69] X. Guo, Y. Li, and H. Ling, “Lime: Low-light image enhancement via illumination map estimation,” IEEE Transactions on Image Processing, vol. 26, pp. 982–993, 2017.