This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Dynamic Image Restoration and Fusion Based on Dynamic Degradation

Aiqing Fang, Xinbo Zhao, Jiaqi Yang, Yanning Zhang aNational Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, School of Computer Science, Northwestern Polytechnical University, Xi’an 710072, China [email protected] (X. Zhao), [email protected] (A. Fang), [email protected] (J. Yang), [email protected] (S. Cao), [email protected] (Y. Zhang)
Abstract

The deep-learning-based image restoration and fusion methods have achieved remarkable results. However, the existing restoration and fusion methods paid limited research attention to the robustness problem caused by dynamic degradation. In this paper, we propose a novel dynamic image restoration and fusion neural network, termed as DDRF-Net, which is capable of solving two problems, i.e., static restoration and fusion, dynamic degradation. In order to solve the static fusion problem of existing methods, dynamic convolution is introduced to learn dynamic restoration and fusion weights. In addition, a dynamic degradation kernel is proposed to improve the robustness of image restoration and fusion. Our network framework can effectively combine image degradation with image fusion tasks, provides more detailed information for image fusion tasks through image restoration loss, and optimizes image restoration tasks through image fusion loss. Therefore, the stumbling blocks of deep learning in image fusion, e.g., static fusion weight and specifically designed network architecture, are greatly mitigated. Extensive experiments show that our method neatly outperforms the state-of-the-art methods.

keywords:
image fusion , dynamic fusion , dynamic degradation , deep learning.

1 Introduction

As a basic research content in the field of computer vision, image fusion plays an important role in many high-level semantic tasks, e.g., object detction, saliency detection and object recognition. The existing image fusion methods have done a lot of exploration on the fusion criteria (e.g., maximum, sum, weighted average, and L1L1) and network framework (e.g., Siamese network, nest network, and generative adversarial networks). However, existing image fusion methods lack exploration and analysis of dynamic fusion weight and dynamic degradation model. Unfortunately, in real-world scenes, due to the differences between data acquisition equipment and environment, different cross-modal images are faced with dynamic degradation factors, making the existing static image fusion weights unable to adapt to the dynamic scene.

However, the human vision system can handle the dynamic scene robustly, which are closely related to the dynamic cognitive processing mechanism of human brain. According to the research of cognitive psychology [1] and biological neuroscience [2], the cognitive process of the human brain is dynamic. The dynamic characteristics of human brain for information collection, transmission and processing make human beings complete very complex visual and auditory tasks. Therefore, we believe that the dynamic cognitive learning of the human brain is of positive significance to improve the robustness of image fusion tasks, as verified in Sect.4.

Refer to caption
Figure 1: Examples of image restoration and fusion. The first line image is an example of image restoration, and the second line image is an example of image fusion. The first two columns of images are used as the input of the network, and the middle image is the existing image restoration and fusion methods. The last column of images is our method.

In order to overcome dynamic image fusion problem, inspired by the human brain dynamic cognitive mechanism, a robust dynamic image restore and fusion method is proposed. As far as we know, our research is the first exploration in the field of image fusion and image restoration. Our method studies dynamic fusion weights and dynamic noise kernels in kernel space. Because this method processes data in kernel space, it is more efficient than methods leveraging feature space. Different from the existing image restore and fusion methods, the input of our feature extraction module is not only the cross-modal images but also the dynamic degradation kernel. This method can extract the features of different modal images and restore the degraded images at the same time. This joint optimization strategy is critical to hidden feature extraction. In addition, dynamic convolution is used to construct the image fusion module, which makes the fusion weight follow the sample with dynamic characteristics. Moreover, we introduce the dynamic negative sample fusion loss through the idea of adversarial learning, so as to reduce the image fusion noise and improve the quality of image fusion. The main contributions of our work include the following four points:

  • 1.

    Firstly, we make a research on the influence of static fusion weight and static noise on image fusion tasks and propose the idea of dynamic fusion, i.e. dynamic fusion weight, dynamic noise kernel, and dynamic negative sample fusion loss.

  • 2.

    Secondly, we propose a dynamic image fusion network with the ability of image restoration. Through the joint optimization of image fusion and image restoration tasks, the robustness of image fusion in a complex environment (i.e., low light, and blur) is further improved.

  • 3.

    Finally, we make comprehensive comparative and analysis experiments on image restoration and fusion on public datasets. The experimental results demonstrate our method is superior compared with the state-of-the-art methods.

The remainder of this paper is structured as follows. Sect. 2 reviews relevant theory knowledge. Sect. 3 presents a dynamic image fusion method. Sect. 4 introduces the experimental datasets, evaluation metrics, and implementation details. Sect. 5 presents a discussion and explanation. Sect. 6 draws a conclusion.

2 Related work

Our research involves not only dynamic image fusion but also dynamic image degradation. Therefore, we briefly introduce image restoration, dynamic convolution and image fusion.

2.1 Dynamic convolution

Deep convolution neural network [3] has achieved remarkable results in many visual fields. However, the traditional convolution module often obtains static learning weights, which will not change in the test phase once the training is completed. This will severely limit the performance of the network model in changing scenarios. To solve this problem, the concept of dynamic convolution (e.g., CondConv [4], DynamicConv [5], and DyNet [6]) is proposed, i.e., the target convolution kernel is obtained by weighting a series of convolution kernels. Dynamic convolution has excellent performance in the task of classification and recognition but there is a lack of relevant research in the task of image fusion.

2.2 Image restoration

When existing image restoration [7] or super-resolution [8, 9] tasks generate degraded data, they often use limited degradation kernel (e.g., blurring, bicubicly downsampling, and white Gaussian noise) to simulate image degradation in a real environment. However, in this way, a single image will only be affected by a single blur kernel and downsampling, which is inconsistent with the real environment, i.e., a single image will be affected by a variety of degradation kernels. Image degradation is of great research value not only for image restoration task or image super-resolution task, but also for image fusion task, especially for cross-modal image fusion task without ground truth. After all, image fusion is to improve the real quality of the image, not the similarity of original images. Unfortunately, no relevant research has been found in the field of cross-modal image fusion.

2.3 Image fusion

Due to the outstanding performance of deep learning methods in many visual fields, deep-learning-based image fusion methods have been widely proposed,e.g., DenseFuse [10], IFCNN [11], FusionDN [12], PGMI [13], U2Fusion [14], NestFuse [15], RFN-Nest [16]. Existing deep-learning-based image fusion networks, e.g., image fusion methods based on Siamese networks and generative adversarial networks , often use similarity loss (e.g., content loss, detail loss, gradient loss, and tee loss) or adversarial loss to learn static fusion weight. For the cross-modal image fusion task without ground truth, it is necessary to calculate the similarity between the predicted image and the original image, so as to retain the texture details of source images as much as possible. Li et al. [10] introduced a SSIM [17] to optimize the image fusion weight. Many cross-modal image fusion methods (e.g., IFCNN, FusionDN, NestFuse, and U2Fusion) based on deep learning use this similarity measure optimization function, e.g., SSIM, MSE, and content loss. However, Ma et al. [18] introduced an adversarial loss into the infrared and visible image fusion task. Although the introduction of adversarial loss provides a new idea for image fusion, the subjective fusion quality obtained by adversarial loss has an obvious local blur effect. Therefore, to solve this problem, Ma et al. [19] introduced a dual discrimination generative adversarial fusion network that combined multiple loss functions (e.g., content loss, detail loss, gradient loss, tee loss, and adversarial loss) are used to optimize the fusion weight. In addition, the fusion criterion of existing image fusion methods is usually defined as the fusion layer after the feature extraction layer, which separates the fusion from the subsequent decoding convolution. Although this method is widely used, there are also some irrationality. 1) The fusion criterion of artificial design needs specific expert experience. 2) Artificially designed fusion standards will limit the versatility of the network, such as infrared and visible light, multi-focus image fusion tasks. To solve above problems, we carried out the related research of dynamic image fusion and proposed the dynamic degradation for image restoration and fusion.

3 Proposed method

Refer to caption
Figure 2: Network Architecture of the proposed DDRF-Net, where H1H_{1} represents the dynamic degradation model of visible image; H2H_{2} represents the dynamic degradation model of infrared image; H3H_{3} represents the dynamic degradation model of fused image, i.e., fusion process; LL represents the loss of image fusion task; PP denotes the fused image; GG denotes PCA and stretch alignment operation [8]. The dynamic convolution kernel is composed of 12 basic kernels of 4 types (i.e., isotropic Gaussian kernel, motion blur kernel, and anisotropic Gaussian kernel) in the black box.

In this section, with analysis of the static restoration and fusion methods, we provide our dynamic restoration and fusion formulation, the definition and design of dynamic kernel. At last, the neural network architecture design and training skills are shown concretely.

3.1 Motivation

The robustness of image fusion is of great significance for its application and development. The existing image fusion methods lack research on dynamic fusion weight and dynamic degradation model, i.e., the visual data is affected by illumination, blur, and compression transmission at the same time, which is dynamic and coupled with each other, as verified in section 4. The existing image fusion methods do not consider this dynamic degradation problem in the feature extraction stage. In the dynamic visual scene, the original feature extraction weight will be invalid, and the image fusion effect will be seriously affected. Although there is comprehensive research on the degenerate kernels in the existing image restoration tasks, only the influence of a single degenerate kernel (e.g., downsampling combined with blur kernel) on the overall visual scene is considered. However, dynamic fusion weight and dynamic degradation models play an important role in improving the robustness of image fusion. Therefore, we are motivated to study dynamic fusion and dynamic degradation models in the field of image fusion.

3.2 Dynamic degradation

Image restoration and fusion are not only to calculate the similarity between the predicted image and different modal images but also to restore high-quality image. Although existing image restoration methods have done a lot of research on different degradation kernels, existing degradation kernels are mostly static degradation kernels, only considering the global degradation of the image (i.e., a single degradation kernel is used for the whole image), without considering the local degradation of the image (i.e., a single image will be affected by different degradation models). The traditional degradation model H on Z (high quality image) is defined as [9]:

ZH=(𝒁𝒌)s+N,{Z\circledast H}=(\bm{Z}\circledast\bm{k})\downarrow_{s}+{N}, (1)

where 𝒌\bm{k} indicates a 2-D convolution kernel, denotes convolution operation, ↓s represents down-sampling process with a scaling factor s, N indicates white gaussian noise. The existing degradation convolution 𝒌\bm{k} is a single static convolution kernel, which cannot simulate the degradation of real images. Therefore, in order to better simulate the real degradation model, we define 𝒌\bm{k} as a linear combination of a series of basic degradation kernels, i.e., isotropic Gaussian kernel, motion blur kernel, and anisotropic Gaussian kernel. The dynamic kernel is defined as:

kd=akm(𝒋)+bki(𝒋)+cka(𝒋) s.t. j={1,2,3,4}&a+b+c=1&a,b,c[0,1],\begin{array}[]{r}{k}_{d}={a}*{k}_{m}(\bm{j})+{b}*{k}_{i}(\bm{j})+{c}*{k}_{a}(\bm{j})\\ \text{ s.t. }\mathrm{j}=\{1,2,3,4\}\&{a}+{b}+{c}={1}\&\mathrm{a},\mathrm{b},\mathrm{c}\in[0,1]\end{array}, (2)

where a, b, c indicate the weight of basic degradation kernel, km(𝒋)k_{m}(\bm{j}) denotes motion blur kernel, ki(𝒋)k_{i}(\bm{j}) indicates isotropic Gaussian kernel, ka(𝒋)k_{a}(\bm{j}) indicate anisotropic Gaussian kernel, xx respresents a random integer variable. In order to avoid the image oscillation caused by the dynamic degradation kernel, the weights a,b,ca,b,c must be added to sum to 1. Therefore, the dynamic degradation model HmH_{m} can be formulated as:

ZiHm=(𝒁i𝒌d)s+N,m{1,2,3}{Z}_{i}*{H}_{m}=\left(\bm{Z}_{i}\circledast\bm{k}_{d}\right)\downarrow_{s}+{N},{m}\in\left\{1,2,3\right\} (3)

For the cross-modal image fusion task, taking the infrared and visible image fusion task as an example, we need to restore three dynamic degradation models, i.e., visible image (H1)(H_{1}), infrared image (H2)(H_{2}), and fused image (H3)(H_{3}).

3.3 Dynamic fusion

Existing image fusion methods often use artificial fusion criteria in feature space or learn fixed fusion weights by the data-driven method. The fusion criterion for artificial design is defined as:

ø=i=1k(wi(𝒙,𝒚)C(𝒇i,𝒇v)),{ø}=\sum_{i=1}^{k}\left({w}_{i}(\bm{x},\bm{y})*{C}\left(\bm{f}_{i},\bm{f}_{v}\right)\right), (4)

where 𝒇i\bm{f}_{i} and 𝒇v\bm{f}_{v} denote infrared and visible feature maps; wi(𝒙,𝒚)w_{i}(\bm{x},\bm{y}) denotes the fusion weight of the position of feature map (𝒙,𝒚)(\bm{x},\bm{y}); øø indicates the fused feature maps; C()C(*) denotes concatenate operation; øø indicates fused feature maps. This method focuses on the design and improvement of the fusion weight strategy of different feature maps, e.g., the fusion weight based on saliency, weighted average, maximum, and sum. For static CNN, the fusion weight can be formulated as:

ø=WTC(𝒇i,𝒇v)+𝒃,{ø}={{W^{T}}*{C}\left(\bm{f}_{i},\bm{f}_{v}\right)}+{\bm{b}}, (5)

where WTW^{T} and bb denote the convolution process. To overcome the shortcomings of the existing fusion methods (i.e., artificial fusion criterion and static fusion weight), we propose a dynamic fusion method in kernel space. We concatenate the convolution kernel parameters learned by the feature extraction module, and linearly stack the convolution kernel weights combined with the channel attention mechanism. The dynamic fusion criteria is defined as:

ø=((1Nk=1Nπk(𝒙)𝑾k)T𝐂(𝒇i,𝒇v,𝒌di,𝒌dv)+1Nk=1Nπk(𝒙)𝒃k),\mathfrak{ø}=\left((\frac{1}{N}\sum_{\mathrm{k}=1}^{N}\pi_{k}(\bm{x}){\bm{W}}_{k})^{T}*\mathbf{C}\left(\bm{f}_{i},\bm{f}_{v},\bm{k}_{di},\bm{k}_{dv}\right)+\frac{1}{N}\sum_{\mathrm{k}=1}^{N}\pi_{k}(\bm{x}){\bm{b}}_{k}\right), (6)

where WkW_{k} denotes k-th convolution kernel; NN the number of convolution kernel of dynamic kernel; πk(𝒙)\pi_{k}(\bm{x}) denotes convolution kernel weight, i.e., channel attention weight; C(𝒇i,𝒇v,𝒌di,𝒌dv)C(\bm{f}_{i},\bm{f}_{v},\bm{k}_{di},\bm{k}_{dv}) denotes the dynamic degenerate kernel (i.e., 𝒌di\bm{k}_{di}, and 𝒌dv\bm{k}_{dv}) is superimposed on the feature maps, i.e., visible feature map (𝒇v\bm{f}_{v}), and infrared feature map (𝒇i\bm{f}_{i}) in the channel dimension.

3.4 Loss Function

Existing deep-learning-based image fusion methods mainly use similarity measure loss to optimize image fusion weight, and the similarity fusion loss (𝜽){\cal L(\bm{\theta})} is defined as:

(𝜽)=11Ni=1NS(f(𝐗𝐢,𝜽),𝐏),{\cal L(\bm{\theta})}=1-\frac{1}{N}\sum\limits_{i=1}^{N}S(f(\bf X_{i},\bm{\theta}),P), (7)

where 𝑿𝒊\bm{X_{i}} indicates i-th image to be fused; 𝜽\bm{\theta} denotes CNN’s weight; 𝑷\bm{P} indicates fused image; S()S(*) denotes similarity function. For the cross-modal image fusion task, due to the limitations (i.e., just represents similarity, rather than high-quality image) of the existing similarity loss, this paper introduces the idea of adversarial learning based on the dynamic degradation model and proposes the dynamic negative sample fusion loss (𝓗𝒾){\cal L(\bm{H}_{i})} that is defined as:

(𝓗𝒾)=11Ni=1NS(𝐗𝐢𝐇𝐢,𝐏),{\cal L(\bm{H}_{i})}=1-\frac{1}{N}\sum\limits_{i=1}^{N}S(\bf X_{i}*\bm{H}_{i},P), (8)

where 𝑯i\bm{H}_{i} denotes i-th dynamic model. The basis of adversarial learning is to have positive samples as guidance so that the network model can be optimized in the right direction. Therefore, we introduce a positive sample loss (𝜽){\cal L(\bm{\theta})}, which is defined as:

(𝓞)\displaystyle{\cal L(\bm{O})} =O(𝐌𝟏,𝐌𝟐,𝐌𝟑,,𝐌𝐢)\displaystyle=O(\bf{M_{1},M_{2},M_{3},\cdots,M_{i}}) (9)

where MiM_{i} denotes i-th image fusion method; O()O(*) indicates evaluation function. Therefore our fusion loss function L is defined as:

(𝜽)=13i=13(𝓗𝒾)+(𝜽)+(𝓞){\cal L(\bm{\theta})}=\frac{1}{3}\sum\limits_{i=1}^{3}{\cal L(\bm{H}_{i})}+\cal L(\bm{\theta})+\cal L(\bm{O}) (10)

Our fusion loss considers not only the similarity with the original image, but also the negative sample distance based on the dynamic degradation model. This method can effectively reduce the universality of image fusion caused by the difference of data distribution.

3.5 Network architecture and training

Our network framework adopts the classic Siamese network framework. In the two-way feature extraction branch network, the network weights are not shared. Traditional image fusion methods based on Siamese network or self-coding network, e.g., Deepfuse, Densefuse, and IFCNN, two sub-modules are purely for extracting features from different images. Different from the traditional image fusion method based on Siamese networks or self-coding network, our feature extraction branch is not only to extract features but also to restore the dynamic degradation model. Notably, the most important thing is that our network does not need datasets from other fields to assist learning. i.e., the cross-modal image fusion task completes the restoration of the dynamic degradation model with its own data. This strategy effectively reduces the complexity of training and avoids the problem of model recovery caused by the difference in data distribution. Different from the existing deep-learning-based methods (e.g., NestFuse, RFN-Nest, and SAF), our network does not need complex network structure, e.g., RESNET, NestNet, and DenseNet. Our fusion module uses the superposition of four layers of dynamic convolution, and the weight of the convolution kernel is generated by the channel attention mechanism. In the process of training, our dynamic degradation kernel will be serialized, dimension reduced and stretched to align with different modal data in dimension. When the same image is affected by multiple degradation kernels, it is difficult for a simple single branch network to converge. That is reason why we avoid a data-driven training model in feature space.

4 Experiments

In this section, experimental setup are first presented. Then, comparative experiment and analysis experiment are introduced.

Table 1: Metrics
No. Method Equations Description
1 EN [20] EN=i=0255pilog2pi,EN=-\sum_{i=0}^{255}p_{i}\log_{2}p_{i}, where PiP_{i} is the probability of a gray level appearing in the image.
2 AG [21] AG=1MNi=1Mj=1NΔIx2(𝒊,𝒋)+ΔIy2(𝒊,𝒋)2,AG=\frac{1}{M^{*}N}\sum_{i=1}^{M}\sum_{j=1}^{N}\sqrt{\frac{\Delta I_{x}^{2}(\bm{i},\bm{j})+\Delta I_{y}^{2}(\bm{i},\bm{j})}{2}}, where M×NM\times N denotes the image height and width; ΔI𝒙(𝒊,𝒋)\Delta I_{\bm{x}}(\bm{i},\bm{j}) denotes image horizontal gradient; ΔIy(𝒊,𝒋)\Delta I_{y}(\bm{i},\bm{j}) denotes image vertical gradient.
3 SSIM [17] SSIM(𝑰i,𝑹)=(2uIiuR+C1)(2σIiR+C2)(uIi2+uR2+C1)(σIi2+σR2+C2),\begin{array}[]{l}{SSIM}(\bm{I}_{i},\bm{R})=\frac{\left(2u_{I_{i}}u_{R}+C_{1}\right)\left(2\sigma_{I_{i}R}+C_{2}\right)}{\left(u_{I_{i}}^{2}+u_{R}^{2}+C_{1}\right)\left(\sigma_{I_{i}}^{2}+\sigma_{R}^{2}+C_{2}\right)},\end{array} where μIi\mu_{I_{i}} and μR\mu_{R} indicate the mean value of origin image IiI_{i} and fused image RR; σIiR\sigma_{I_{i}R} is the standard covariance correlation.
4 VIFF [22] VIF=jsubbandsI(C;N,jFN,j|sN,j)jsubbandsI(CN,j;EN,j|sN,j),\mathrm{VIF}=\frac{\sum_{j\in{subbands}}I\left({C}\stackrel{{\scriptstyle N,j}}{{;}}{F}^{N,j}|s^{N,j}\right)}{\sum_{j\in\ {subbands}}I\left({C}^{N,j};{E}^{N,j}|s^{N,j}\right)}, where C;N,j{C}\stackrel{{\scriptstyle N,j}}{{;}} denotes N elements of the CjC_{j} that describes the coefficients from subband j; jsubbandsI(C;N,jFN,j|sN,j)\sum_{j\in{subbands}}I\left({C}\stackrel{{\scriptstyle N,j}}{{;}}{F}^{N,j}|s^{N,j}\right) denotes reference image information.
5 PSNR [23] PSNR=10log10(2n1)2MSE,\mathrm{PSNR}=10\log_{10}\frac{\left(2^{n}-1\right)^{2}}{MSE}, where MSEMSE is the mean square error of the current image X and the reference image y.
Table 2: Parameter settings of evaluated methods. Bold values indicate different architectures of OURS. Time was computed on images with an average size of 400x400. The numbers in brackets indicate the size of the pre-trained model.
No. Method Parameters Year Category Time(s) Dynamic Model Size(M)
1 IFCNN [11] Lr0=0.01,power=0.9L_{r0}=0.01,power=0.9 2020 Deep learning 0.021 ×\times 0.32(170)0.32(170)
2 GANMCC [24] epoach=10,cdim=1,scale=3,stride=14epoach=10,c_{d}im=1,scale=3,stride=14 2020 GAN ×\times ×\times 10.9010.90
3 PGMI [13] Epoach=15,lr=1e4,cdim=1,stride=14,scale=3Epoach=15,lr=1e-4,c_{d}im=1,stride=14,scale=3 20202020 Deep learning 0.044 ×\times 0.800.80
4 NestFuse-max [15] Nbfilter=[64,112,160,208,256]Nb_{filter}=[64,112,160,208,256] 20202020 Deep learning 0.1160.116 ×\times 20.8020.80
5 U2Fusion [14] num=30,epoach=[3,2,2],lam=0num=30,epoach=[3,2,2],lam=0 20202020 Deep learning 0.857 \checkmark 1064.96
6 GFF [25] f=1,mode=1,n=10f=1,mode=1,n=10 2020 Traditional 0.094 ×\times ×\times
7 RFN-Nest [16] batchsize=4,epoch=2,λ=100batchsize=4,epoch=2,\lambda=100 2021 Deep learning 0.119 \checkmark 18.20
8 DDRF-Net batchsize=16,lr=0.001,kb=12batchsize=16,lr=0.001,k_{b}=12 2021 Deep learning 0.028 7.58

4.1 Experimental Setup

In this section, datasets, metrics and methods for experimental evaluation are first presented. Then, implementation details of evaluated methods are introduced.

1) Datasets: We carry out experiments on four datasets, i.e., FLIR, RGB-t, gun, and TNO.

a) FLIR [26]: FLIR dataset is obtained by RGB and thermal imaging camera installed on the vehicle and contains 14, 452 thermal infrared images, including 10, 228 from short video and 4, 224 from the 144-second video. Unfortunately, there is no registration.

b) RGB-T [27]: The RGBT dataset includes 821 image pairs.

c) Gun [11, 28]: Gun is captured in night mode contaminated with blur and noise. There is a lot of duplication between this dataset and the TNO dataset. Therefore, we only use one image of gunA&gunBgunA\&gunB in this dataset.

d) TNO [29]: It contains multi-spectral (near-infrared and long-wave infrared or thermal) night images of different military-related scenes, registered in different multi-band camera systems. There are 21 pairs of images commonly used in existing image fusion methods.

2) Metrics: The distinctiveness of an image quality is usually quantitatively evaluated using entropy (EN) [20], average gradient (AG) [21], structural similarity (SSIM) [17], visual information fidelity (VIF) [22], natural image quality evaluator (NIQE) [30], PSNR [23], are defined as 1.

3) Methods: As shown in Table 2, we compared 7 state-of-the-art methods, i.e., 5 deep-learning-based methods, 1 GAN method, and 1 traditional method, published in the top journal.

4) Implementation details: Before the experiment, we need to clarify the following questions. All the experimental data were tested in the same environment. Our experimental platform is desktop 3.0 GHZ i5-8500, RTX2070, 32G memory.

Refer to caption
Figure 3: Examples of tested methods on the RGBT dataset with low-light and blur degradation.

4.2 Comparative experiments

In order to highlight the advantages of this method, we carried out comparative experiments and analytical experiments on the task of cross-modal image fusion and image restoration.

4.2.1 Image Fusion

In order to show the superiority of our method, we test DDRF-Net and state-of-the-art methods on RGBT and FLIR datasets. In Figure 3, there are a lot of low-light and blur problems in the visible images, only our fusion method still has good performance in these two cases. In Figure 4, there are a lot of highlights and halos in visible image, and there are blur and noise in infrared image. We can see that IFCNN, U2Fusion and NestFuse have good results in high-light. The NestFuse is relatively poor in halo and blur effect processing, the RFN-Nest method is not good in blur and highlight processing. However, our method achieves the best results in highlight, halo and blur processing compared with state-of-the-art methods. From Figure 4 and Table 3, we can see that our method has a very significant improvement in gradient compared with other methods, which means that our method has better definition.

Refer to caption
Figure 4: Examples of tested methods on the FLIR dataset with high-light, blur and noise degradation.
Table 3: Six evaluation indicators for quantitative contrast of tested methods on FLIR and RGBT datasets. Red color represents the largest objective index, and the green color represents the second.
Datasets FLIR RGBT
Metrics EN AG SSIM VIF PSNR Mean EN AG SSIM VIF PSNR Mean
DDRF 7.22 9.43 0.70 0.38 43.66 10.59 7.71 21.02 0.58 0.31 35.33 11.23
U2Fusion 7.41 6.77 0.70 0.40 36.53 9.05 7.41 12.98 0.61 0.32 35.12 9.85
RFN 7.46 2.79 0.69 0.28 34.63 7.95 7.34 6.54 0.60 0.29 35.21 8.83
PGMI 7.37 4.66 0.70 0.34 36.74 8.73 7.54 9.71 0.62 0.28 35.37 9.32
NestFuse 7.60 4.79 0.68 0.38 35.02 8.56 7.71 13.07 0.62 0.41 39.79 11.08
IFCNN 7.09 5.92 0.76 0.36 44.56 10.19 7.24 14.11 0.64 0.32 34.19 9.89
GFF 7.34 5.06 0.75 0.37 43.65 10.08 7.50 12.85 0.64 0.34 35.41 10.03
GANMCC 7.32 3.90 0.61 0.30 28.66 7.14 6.51 5.33 0.53 0.24 35.65 8.43
Refer to caption
Figure 5: Examples of tested methods on the Gun dataset with noise and down-sampling degradation.

4.2.2 Image restoration

Refer to caption
Figure 6: Examples of tested methods on the TNO dataset with low-light degradation

In this experiment, for the dynamic image restoration task, we only need to convert the cross-modal image data into the same input data in the test phase. Our dynamic recovery model includes pure dynamic recovery, (i.e., H1,H2H_{1},H_{2}) and fusion dynamic recovery, i.e., H3H_{3}. In the test dataset of the Gun, for the problem of downsampling change, we downsampling the visible light image, so this experiment has ground truth labels (GT). For the problem of illumination change, we use the TNO dataset. However, the GT labels cannot be obtained by illumination change. Therefore, to solve the problem of brightness change, we use the special image restoration method i.e., Zero-DCE [7] published in CVPR in 2020, as ground truth (GT). Notably, the most important thing is that Zero-DCE is only effective for brightness changes, and it cannot effectively restore the image quality for downsampling. That’s why it cannot be used as a GT for downsampling experimental. From Figure 5, 6 and Table 4, we find that DDRF-Net is not only effective for cross-modal image fusion tasks, but also for image restoration tasks. Among them, the simple image restoration task (i.e., H1,H2H_{1},H_{2}) has a better effect than the dynamic fusion restoration task (i.e., H3H_{3}). DDRF-Net surpasses the tested methods (e.g., U2Fusion, PGMI, NestFuse, and RFN-Nest) in restoration ability. Although the IFCNN has a good effect on the recovery ability, there is still a gap compared with the DDRF-Net. From the image effect, we can still find that the special image restoration algorithm Zero-DCE has achieved good results in brightness and noise but also lost a lot of texture details, and the contrast information is significantly reduced compared with the original image. In addition, we can also find that due to the difference of data distribution between infrared image and visible image, the performance of H1H_{1} and H2H_{2} in different data distribution is different. Notably, from Figure 6 and Table , we also find that when the original image have edge pixel problem caused by down-sampling degradation. In this condition, compared with the existing methods, H1H_{1}, H2H_{2} achieved better results, i.e., effectively overcome the edge pixel problem. However, the fusion effect of H3H_{3} is better than that of H1H_{1} and H2H_{2}, i.e., the fusion result is clearer and has better subjective visual perception.

4.3 Analysis Experiments

In this section, we analyze the impact of static and dynamic on image restoration and image fusion tasks.

4.3.1 Static and dynamic restore and fusion analysis

Although comparative experiments show that DDRF-Net has better fusion performance than state-of-the-art methods, we still need to further analyze whether the dynamic model has better performance than the static model. In this experiment, we compared two tasks, i.e., static restoration and dynamic restoration, static fusion and dynamic fusion. All the training parameters of the network model are the same, i.e., epoach=89, lr=0.001, batchsize=16. In the static and dynamic restoration tasks, we use the traditional convolution and dynamic convolution as the basic network unit. The experiment consists of three parts, i.e., restore0e_{0} analysis, restore1e_{1} analysis, and fusion analysis. Restore0e_{0} analysis is a special image restoration optimization. Restore1e_{1}analysis is a joint optimized image restoration, which is used to test the difference between special image restoration optimization and joint optimization. Fusion analysis is used to compare the differences between static fusion and dynamic fusion.

Table 4: Six evaluation indicators for quantitative contrast of tested methods on Gun and TNO datasets. Red color represents the largest objective index, and the green color represents the second.
Datasets Gun TNO
Metrics EN AG SSIM VIF PSNR Mean EN AG SSIM VIF PSNR Mean
U2Fusion 7.23 7.87 0.10 0.01 20.44 7.13 3.96 2.81 0.44 0.27 35.73 8.64
RFN 6.85 3.00 0.61 0.28 46.56 11.46 4.88 1.74 0.29 0.24 33.16 8.06
PGMI 5.98 5.80 0.42 0.17 27.02 7.88 5.16 4.86 0.67 0.38 31.15 8.44
NestFuse 6.84 4.41 0.65 0.35 49.73 12.40 5.07 3.17 0.46 0.39 36.11 9.04
IFCNN 6.93 7.82 0.65 0.38 46.36 12.43 5.43 5.14 0.56 0.51 37.69 9.87
GFF 6.86 4.35 0.65 0.35 50.02 12.45 5.09 3.22 0.48 0.39 36.55 9.15
GANMCC 7.07 5.48 0.62 0.29 37.91 10.27 5.63 2.88 0.77 0.34 49.21 11.76
DDRF-H1 6.91 6.53 0.93 0.77 53.34 13.69 6.07 6.67 0.90 0.59 48.74 12.59
DDRF-H2 6.89 6.18 0.95 0.76 57.77 14.51 5.45 4.18 0.67 0.46 40.69 10.29
DDRF-H3 6.89 6.18 0.95 0.76 56.81 14.32 5.79 6.42 0.68 0.52 40.38 10.76
Refer to caption
Figure 7: Static and dynamic analysis. (a) Only the image restoration loss optimized image recovery results, i.e., dynamic restore0, static restore0. (b) The results of the restoration are optimized by the loss of restoration and image fusion, i.e., dynamic restore1, static restore1. (c) Fusion analysis results, i.e., dynamic fusion, and static fusion.

From Figure 7, we can draw the following three conclusions. (1) Compared with the joint optimization strategy of restoration loss and image fusion loss, the restoration results obtained by the simple restoration loss optimization method have better image definition, e.g., contrast, illumination, and gradient. (2) Dynamic fusion is better than static fusion in complex environment (e.g., low-light) has better fusion quality. (3) Under the joint optimization strategy, the static restore result is poor, and there is an obvious hazy effect. However, the result of dynamic restoration is good. (4) The static fusion model has better gradient information in the normal environment, which makes the image edge sharper.

4.3.2 Comparative analysis of loss and network architecture

In order to demonstrate the superiority of our loss function and network architecture, we compare and analyze the traditional similarity loss, traditional static fusion, and dynamic fusion networks. In this experiment, we will use the same training dataset, the same network structure, and the same training parameters. This experiment mainly explores the influence of traditional similarity loss and our loss function on image fusion, i.e., static fusion and dynamic fusion. The experimental results are shown in Figure 8. The experiment includes 8 sub-experiments, i.e., 4 restoration experiments, and 4 fusion experiments.

Refer to caption
Figure 8: Loss function and network architecture analysis, StSt denotes static network; DyDy denotes dynamic network; tlosst-loss denotes traditional fusion loss; flossf-loss denotes H3H_{3} fusion loss. There are four groups of experiments in static network, i.e., (a), (c), (d), and (e). There are four groups of experiments in dynamic network, i.e., (b), (f), (g), and (h). The first line of each group represents the image restoration experiment, and the second line represents the image fusion experiment. (a) Traditional static image fusion network, i.e., without image restoration, traditional loss, and static network. (b) Dynamic image fusion network, i.e., without image restoration, traditional loss, and dynamic network. (c) and (f): Image fusion network (i.e., static fusion, dynamic fusion, ) based on traditional similarity loss. The image restoration loss is only in the first 10 epoach optimizations. (d) and (g): Image fusion network (i.e., static fusion, dynamic fusion, ) based on traditional similarity loss. Image restoration loss and image fusion loss are jointly optimized. (e) and (h): The same loss function, different networks, i.e., static fusion network, and dynamic fusion network

From Figure 8, we can find several phenomena. (1) The fusion results based on our fusion and restoration framework (c-f) are generally better than (a-b). This shows that our fusion framework has better performance than pure fusion framework. (2) (a, c-e) is superior to (b, f-h) in the quality of image restoration and fusion. This shows that dynamic network has more advantages in comprehensive performance than static network. (3) (d-e) and (g-h) four groups of experiments demonstrate that under the same network framework, the proposed loss can significantly improve the quality of image restoration, and (d) and (g) can solve the problem of over sharpening. (4) (c, f) compared with (e, h), the former have better gradient information. Compared with c, f has better image restoration results, and i.e., the dynamic network has better advantages.

4.3.3 Image restore and fusion efficiency analysis

The proposed network can complete the tasks of image restoration and image fusion at the same time. As far as we know, the existing image fusion network cannot simultaneously take care of these two tasks. The model size and running time are shown in Table 2. It can be seen from the experimental data that under the same conditions, this method has small model and fast running speed. In addition, our network does not need complex pre-training model and complex network structure.

5 Discussion

Exhaustive experiments in section 4 verify that our dynamic image fusion method is better than existing cross-modal image fusion methods in robustness. This also prove the effectiveness of our simulation of human brain dynamic cognitive mechanism. There are three potential reasons:.

1) Dynamic fusion mechanism. Human beings can be robust in a variety of visual tasks, which is closely related to the dynamic characteristics of the human brain. This dynamic characteristic is of great significance for various visual tasks. However, the existing cross-modal image fusion methods mostly use static fusion weights or separate the fusion layer from the decoding layer, which will aggravate the difference of data distribution to a certain extent.

2) Dynamic degradation mechanism. Although the existing image restoration task or image super-resolution task has studied a variety of degenerative kernels, the related research is limited to the static degenerative kernels of a single image. In the field of image fusion, the research on the robustness of image fusion has not been found. This is why our research is of great significance in the field of image fusion.

3) Joint optimization loss. Based on the loss of traditional similarity measure (i.e., the loss of similarity between the predicted image and the original image), we construct a dynamic negative sample fusion loss based on the dynamic degradation model. On the one hand, the loss function reduces the difference between the prediction result and the original data distribution by similarity loss; on the other hand, by expanding the distance between the prediction result and the dynamic degradation data distribution, the network is forced to optimize in the direction of improving the quality, rather than simply being similar to the original images.

6 Conclusion

In this paper, we proposed a dynamic cross-modal image restore and fusion method inspired by human brain dynamic cognitive mechanism. We mainly studied the influence of dynamic degradation model and dynamic fusion weight on image restore and fusion tasks. Our network effectively unifies image restoration tasks and image fusion tasks. The same network can complete the image restoration task and image fusion task. Our network has better feature mining ability than the existing image fusion methods. To verify the effectiveness of our DDRF-Net, we carried out a large number of comparative experiments and analytical experiments, and the experimental results show the superiority of our method.

Acknowledgment

This work was supported by the National Natural Science Foundation of China under Grants nos. 61871326, and the Shanxi Natural Science Basic Research Program under Grant no. 2018JM6116, and National Natural Science Foundation of China under Grants nos. 61231016.

References

References

  • [1] R. M. Hutchison, T. Womelsdorf, E. A. Allen, P. A. Bandettini, V. D. Calhoun, M. Corbetta, S. Della Penna, J. H. Duyn, G. H. Glover, J. Gonzalez-Castillo, D. A. Handwerker, S. Keilholz, V. Kiviniemi, D. A. Leopold, F. de Pasquale, O. Sporns, M. Walter, C. Chang, Dynamic functional connectivity: Promise, issues, and interpretations, NeuroImage 80 (2013) 360–378.
  • [2] Y. Yuan, Y. Qiu, A. C. Schouten, Dynamic functional brain connectivity for face perception, Frontiers in Human Neuroscience 9 (e24).
  • [3] Y. Han, G. Huang, S. Song, L. Yang, H. Wang, Y. Wang, Dynamic neural networks: A survey (2021). arXiv:2102.04906.
  • [4] B. Yang, G. Bender, Q. V. Le, J. Ngiam, Condconv: Conditionally parameterized convolutions for efficient inference, in: H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, R. Garnett (Eds.), Advances in Neural Information Processing Systems 32, Curran Associates, Inc., 2019, pp. 1307–1318.
  • [5] Y. Chen, X. Dai, M. Liu, D. Chen, L. Yuan, Z. Liu, Dynamic convolution: Attention over convolution kernels, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • [6] Y. Zhang, J. Zhang, Q. Wang, Z. Zhong, Dynet: Dynamic convolution for accelerating convolutional neural networks, ArXiv abs/2004.10694.
  • [7] C. Guo, C. Li, J. Guo, C. C. Loy, J. Hou, S. Kwong, R. Cong, Zero-reference deep curve estimation for low-light image enhancement, in: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 1777–1786.
  • [8] K. Zhang, W. Zuo, L. Zhang, Learning a single convolutional super-resolution network for multiple degradations, in: IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3262–3271.
  • [9] W. W. Y. Z. S. L. L. S. Lei Zhang, Jiangtao Nie, Unsupervised adaptation learning for hyperspectral imagery super-resolution, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 3070–3079.
  • [10] H. Li, X.-J. Wu, Densefuse: A fusion approach to infrared and visible images, IEEE Transactions on Image Processing 28 (5) (2019) 2614–2623.
  • [11] Y. Zhang, Y. Liu, P. Sun, H. Yan, X. Zhao, L. Zhang, Ifcnn: A general image fusion framework based on convolutional neural network, Information Fusion 54 (2020) 99 – 118.
  • [12] H. Xu, J. Ma, Z. Le, J. Jiang, X. Guo, Fusiondn: A unified densely connected network for image fusion, in: Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020, pp. 12484–12491.
  • [13] H. Zhang, H. Xu, Y. Xiao, X. Guo, J. Ma, Rethinking the image fusion: A fast unified image fusion network based on proportional maintenance of gradient and intensity, Proceedings of the AAAI Conference on Artificial Intelligence 34 (7) (2020) 12797–12804.
  • [14] H. Xu, J. Ma, J. Jiang, X. Guo, H. Ling, U2fusion: A unified unsupervised image fusion network, IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) 1–1.
  • [15] H. Li, X. J. Wu, T. Durrani, Nestfuse: An infrared and visible image fusion architecture based on nest connection and spatial/channel attention models, IEEE Transactions on Instrumentation and Measurement 69 (12) (2020) 9645–9656.
  • [16] H. Li, X.-J. Wu, J. Kittler, Rfn-nest: An end-to-end residual fusion network for infrared and visible images, Information Fusion 73 (2021) 72–86.
  • [17] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, Image quality assessment: from error visibility to structural similarity, IEEE Transactions on Image Processing 13 (4) (2004) 600–612.
  • [18] M. Jiayi, Y. Wei, L. Pengwei, L. Chang, J. Junjun, Fusiongan: A generative adversarial network for infrared and visible image fusion, Information Fusion 48 (2019) 11 – 26.
  • [19] J. Ma, H. Xu, J. Jiang, X. Mei, X.-P. Zhang, Ddcgan: A dual-discriminator conditional generative adversarial network for multi-resolution image fusion, IEEE Transactions on Image Processing 29 (2020) 1–1.
  • [20] H. R. Sheikh, A. C. Bovik, Image information and visual quality, IEEE Transactions on Image Processing 15 (2) (2006) 430–444.
  • [21] G. Cui, H. Feng, Z. Xu, Q. Li, Y. Chen, Detail preserved fusion of visible and infrared images using regional saliency extraction and multi-scale image decomposition, Optics Communications 341 (341) (2015) 199–209.
  • [22] Y. Han, Y. Cai, Y. Cao, X. Xu, A new image fusion performance metric based on visual information fidelity, Information Fusion 14 (2) (2013) 127–135.
  • [23] J. Sijbers, P. Scheunders, N. Bonnet, D. Van Dyck, E. Raman, Quantification and improvement of the signal-to-noise ratio in a magnetic resonance image acquisition procedure, Magnetic Resonance Imaging 14 (10) (1996) 1157–1163.
  • [24] J. Ma, H. Zhang, Z. Shao, P. Liang, H. Xu, Ganmcc: A generative adversarial network with multi-classification constraints for infrared and visible image fusion, IEEE Transactions on Instrumentation and Measurement 70 (2021) 5005014.
  • [25] J. Ma, Y. Zhou, Infrared and visible image fusion via gradientlet filter, Computer Vision and Image Understanding 197-198 (2020) 103016.
  • [26] AZoSensors, Flir releases starter thermal imaging dataset for machine learning advanced driver assistance development (2018).
  • [27] C. Li, X. Liang, Y. Lu, N. Zhao, J. Tang, Rgb-t object tracking: Benchmark and baseline, Pattern Recognition 96 (2019) 106977.
  • [28] Z. Zhou, B. Wang, S. Li, M. Dong, Perceptual fusion of infrared and visible images through a hybrid multi-scale decomposition with gaussian and bilateral filters, Information Fusion 30 (2016) 15–26.
  • [29] T. A., Tno image fusion dataset. figshare. dataset. (2014).
  • [30] A. Mittal, R. Soundararajan, A. C. Bovik, Making a completely blind image quality analyzer, IEEE Signal Processing Letters 20 (3) (2013) 209–212.