This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Unsupervised Hyperspectral and Multispectral Images Fusion Based on the Cycle Consistency

Shuaikai Shi,  Lijun Zhang, Yoann Altmann, 
Jie Chen, 
Shuaikai Shi, Lijun Zhang and Jie Chen are with the Center of Intelligent Acoustics and Immersive Communications, School of Marine Science and Technology, Northwestern Polytechinical University, Xi’an 710072, China, and also with the Key Laboratory of Ocean Acoustics and Sensing, Ministry of Industry and Information Technology, Xi’an 710072, China (e-mail:[email protected]; [email protected]; [email protected]).Yoann Altmann is with the School of Engineering and Physical Sciences, Heriot Watt University, Edinburgh EH14 4AS, U.K. (e-mail: [email protected]). Part of this work was supported by the Royal Academy of Engineering under the Research Fellowship scheme RF201617/16/31.
Abstract

Hyperspectral images (HSI) with abundant spectral information reflected materials property usually perform low spatial resolution due to the hardware limits. Meanwhile, multispectral images (MSI), e.g., RGB images, have a high spatial resolution but deficient spectral signatures. Hyperspectral and multispectral image fusion can be cost-effective and efficient for acquiring both high spatial resolution and high spectral resolution images. Many of the conventional HSI and MSI fusion algorithms rely on known spatial degradation parameters, i.e., point spread function, spectral degradation parameters, spectral response function, or both of them. Another class of deep learning-based models relies on the ground truth of high spatial resolution HSI and needs large amounts of paired training images when working in a supervised manner. Both of these models are limited in practical fusion scenarios. In this paper, we propose an unsupervised HSI and MSI fusion model based on the cycle consistency, called CycFusion. The CycFusion learns the domain transformation between low spatial resolution HSI (LrHSI) and high spatial resolution MSI (HrMSI), and the desired high spatial resolution HSI (HrHSI) are considered to be intermediate feature maps in the transformation networks. The CycFusion can be trained with the objective functions of marginal matching in single transform and cycle consistency in double transforms. Moreover, the estimated PSF and SRF are embedded in the model as the pre-training weights, which further enhances the practicality of our proposed model. Experiments conducted on several datasets show that our proposed model outperforms all compared unsupervised fusion methods. The codes of this paper will be available at this address: https://github.com/shuaikaishi/CycFusion for reproducibility.

Index Terms:
Unsupervised fusion, hyperspectral image, multispectral image, cycle consistency, super-resolution.

I Introduction

Hyperspectral imaging simultaneously captures spatial information and spectral signatures of the observed objects, enabling the identification of material classes and component information. In this way, hyperspectral imaging has been widely used over the past decades in areas including object detection[1], face recognition [2], remote sensing [3], medical diagnosis [4], agriculture[5] and food industry [6]. However, a larger instantaneous field of view (IFOV) [7] is required when acquiring hyperspectral images (HSI) compared to acquiring multispectral images (MSI), e.g., RGB images, to obtain an acceptable signal-to-noise ratio (SNR). Thus, there is a trade-off between the high spatial resolution and high spectral resolution during the image acquisition due to the hardware limits. In practice, it is only possible to capture either high spatial resolution multispectra images (HrMSI), or low spectral solution hyperspectral images (LrHSI), rather than simultanesouly achieve high spectral and high spatial resulution. Fortunately, HSI-MSI fusion [8] is an effective tool designed to output one image which has high resolution both on spatial and spectral dimensions by fusing a pair of HrMSI and LrHSI.

I-A Motivation

Conventional HSI and MSI fusion methods generally assume the imaging model is known, where both the parameters of point spread function (PSF) in the spatial degradation model from HrHSI to LrHSI, and spectral response function (SRF) in the spectral merging processes from HrHSI to HrMSI, are known or at least one of them is known. In practice, however, the spatial and spectral downsampling processes are complex and it is non-trivial to accurately know these parameters [9]. Therefore this type of traditional model has limited performance when downsampling parameters mismatch the actual system.

Deep learning-based approaches are introduced to fusion processes to recover the spatial and spectral details of HrHSI and obtain promising results, which often use HrHSI as ground truth to supervise the training process. However, HrHSI is usually inaccessible in practical fusion scenarios. Moreover, these methods require a number of paired images as training data. Likewise, strictly paired training data is hard to obtain, i.e., LrHSI and HrMSI may be non-rigidly aligned in practice. The above two facts limit the practical generalization ability of deep learning-based models for the HSI and MSI fusion.

I-B Methodology Overview and Contributions

Recently, cycle-consistent generative adversarial networks (CycleGAN) [10] has been successfully used for cross-domain transformation tasks, e.g., image style transfer [11], unsupervised language translation[12], image superresolution[13], near-infrared (NIR) image to RGB image transformation [14]. This method, also known as unsupervised distribution alignment, works in an unsupervised manner and does not impose the constraint of using paired training data. Inspired by CycleGAN, we propose an HSI and MSI fusion model based on cycle consistency during the transformation between domain LrHSI and domain HrMSI, named CycFusion. Instead of using known PSF and SRF as priors in traditional methods, or known HrHSI as the ground truth to supervisely train deep fusion networks, our proposed CycFusion can estimate the image degradation processes and then implement the fusion task in an unsupervised manner. Specifically, the proposed model consists of two downsampling modules and two upsampling modules. The downsampling modules contain a shared weight depthwise separable convolution network to estimate the kernel of PSF in the spatial degradation process, and another 1×11\times 1 convolution module to estimate the SRF matrix in the spectral degradation model. Inspired by the PSF and SRF estimation processes[15], we draw above downsampling modules to respectively degrade HrMSI in spatial dimension and LrHSI in spectral dimension to obtain two low resolution MSI (LrMSI), then constrain the consistency of these two images to learn the downsampling processes. The estimated results are used as pre-training parameters of CycFusion. Besides, the other two upsampling modules, namely, the spatial superresolution module and spectral superresolution module, which restore the spatial details and spectral signatures of desired results through the mapping from LrHSI to HrHSI and from HrMSI to HrHSI, respectively. After building the above modules, we can transfer an image in the domain LrHSI to the image that contains same scene in the domain HrMSI through spatial upsampling and spectral downsampling modules and the same is true for the reverse conversion via another two modules. To clarify, the proposed CycFusion is shown in Fig. 1. After perform the training strategy that will be detailed in the Section III, the fused results will be obtained by extracting the feature maps in the intermediate transformation.

The main contributions of our proposed model are summarized as follows.

  1. 1.

    Inspired by the advanced progress of the domain transformation, a novel unsupervised hyperspectral and multispectral fusion model is proposed.

  2. 2.

    The observation model is embedded in the CycFusion, and the estimated results make it possible to implement the blind fusion task using deep neural networks with high model capacity.

  3. 3.

    The proposed model consists of cross-domain transformations between LrHSI and HrMSI. The experiment results show that the fused images benefit from the objective function with cycle consistency between the input LrHSI and HrMSI, and themselves through twice domain transformations.

The rest of this paper is organized as follows. Section II introduces related work. The proposed CycFusion is presented in Section III. In Section IV, the effectiveness of CycFusion is demonstrated through experiments on three publicly available datasets. Section V concludes this paper and gives future directions.

II Related Work

The fusion of LrHSI and HrMSI has proven to be an effective approach for producing the desired HrHSI. HSI-MSI fusion methods are generally classified into three types, namely, pansharpening-based approaches, subspace-based approaches and deep learning-based approaches.

II-A Pansharpening-based HSI-MSI Fusion

Pansharpening algorithms [16] aim at injecting the spatial details of high-resolution panchromatic images into LrHSI, including component substitution (CS) approaches and multi-resolution analysis (MRA) approaches. These methods have been extended to the HSI-MSI fusion. Gram–Schmidt adaptive (GSA) [17] is a representative CS-based pansharpening algorithm, which uses Gram–Schmidt transformation to separate the spatial component of the LrHSI that needs to be substituted by the HrMSI. The MRA-based algorithms [18] produce the desired HrHSI by injecting the spatial details obtained by the subtraction of HrMSI and its low-pass version. The generalized Laplacian pyramid-based hyper-sharpening (GLP-HS) [19] is the representative method that uses the pyramidal decompositions to obtain the high spatial resolution structures. Pansharpening-based HSI-MSI fusion methods are computationally efficient and do not rely on known PSF and SRF, however, these approaches compute the fused results band-by-band, which may introduce the spectral distortion.

II-B Subspace-based HSI-MSI Fusion

Subspace-based HSI-MSI fusion methods leverage the low-rankness and sparsity of the desired HrHSI, which generally assume there is a subspace that can represent all spectral signatures of HrHSI. Among the representative approaches are unmixing-based methods and orthogonal subspace-based methods. Coupled non-negative matrix factorization (CNMF) [20] pioneered the unmixing-based fusion, which embeds the linear mixing model (LMM) in the fusion problem and iteratively optimizes the endmembers and abundance maps of the HrHSI. Orthogonal subspace-based fusion methods assume the HrHSI underlies a low dimensional and orthogonal subspace. These methods generally require proper regularization on the coefficients of HrHSI on the subspace due to the under-estimate problem [21] caused by the dimensions of the orthogonal subspace may large than the bands of HrMSI. Commonly used regularization imposes smoothness and sparsity priors on the coefficients, e.g., hyperspectral superresolution (HySure) [15] constrains the smoothness of HrHSI in the subspace using the band-by-band total variation (TV), the work in [22] using dictionary learning with sparse coefficients, and a non-negative structured sparse representation (NSSR) [23] method models the spatial correlations via clustering. Furthermore, the convolutional neural networks (CNNs) denoisers have been introduced to produce reasonable results which substitute using the regularized term, named CNN-Fus [24]. It is worth mentioning that the work in [25] implements the fusion process by solving the Sylvester equation (FUSE), which greatly improved the fusion efficiency. Beyond solving the 2D matrix-based optimization problems, tensor-based methods direct process the 3D HSI data cube. The representative approach of this class is the coupled sparse tensor factorization (CSTF) [26] method based on Tucker decomposition. Subspace-based HSI-MSI fusion methods have interpretable model design and physical meaningful regularizers, leading to better superior performance over pansharpening-based methods. However, these models generally need PSF and SRF as priors, which limits their application in real scenarios. Besides, these methods only adopt the linear mixing model and linear subspace resulting in restrictive models.

II-C Deep Learning-based HSI-MSI Fusion

Recently, deep learning as an expressive model has been introduced to HSI-MSI fusion [27]. The deep learning-based fusion models can be divided into two parts, namely, supervised methods and unsupervised methods. The supervised fusion methods use an amount of paired LrHSI and HrMSI as inputs and HrHSI as ground truth. Once properly trained, the fusion models can fuse two degraded images in the test dataset and output the desired HrHSI. Most deep learning-based methods focus on designing the expressive modules to extract the high-resolution spatial information and spectral signatures and then fuse them into output images. Wang et al.[28] introduced the deep prior into HSI super-resolution to learn the spatial-spectral priors automatically. Xie et al. [29] proposed a model-inspired MSI-HSI fusion network (MHF-net) which enhanced the interpretability in estimating the observation model from the training data. Hu et al. [30] designed a deep spatial and spectral attention CNN to implement the hyperspectral image super-resolution, named HSRnet. Wang et al.[31] proposed an alternately and iteratively optimization algorithm to estimate both the degradation model and fusion model, called enhanced deep blind hyperspectral image fusion network (EDBIN). Instead of using the known PSF and SRF as in subspace-based models, deep learning-based models leverage the high capacity of deep neural networks to recovery the HrHSI from low-resolution images and obtain state-of-the-art performance. However, these models need large numbers of HrHSI as training data that cannot be acquired in the real world. Fortunately, unsupervised fusion methods have been proposed to reduce the reliance on HrHSI as training data. In [32], the authors assume LrHSI and HrHSI have similar abundances and propose a two-branch unsupervised sparse Dirichlet-Net (uSDN) that iteratively learns the shared features of abundance. Then, a nonlinear variational probabilistic generative model (NVPGM) [33] extends the fusion model that globally trains the model parameters and obtains more accurate fusion results. Moreover, the CNNs have been introduced to unsupervised fusion models to learn the spatially correlated structures and further improve the fusion quality, like FusionNet [34]. Guide deep decoder (GDD) [35] based on the deep image prior was proposed to produce the fused images from a noise input with the degraded images. A coupled unmixing model with a cross-attention module is embedded into the fusion network (CUCaNet) [36], which can learn the unknown SRF and PSF. These unsupervised fusion models are independent of the ground truth of HrHSI as training data, which have promising applications in practice.

To summarize, the prior information of representative fusion methods are listed in Table I.

TABLE I: The properties of representative fusion methods.
Category Method Ground truth PSF SRF
Pansharpening GSA[17]
-based GLP-HS[19]
Subspace-based HySure[15]
CNMF[20]
CNN-Fus[24]
FUSE [25]
CSTF [26]
NSSR [23]
Supervised deep HSRnet[30]
MHF-net[29]
learning-based EDBIN[31]
uSDN [32]
Unsupervised deep NVPGM [33]
learning-based GDD [35]
CUCaNet [36]
CycFusion (ours)
Refer to caption
Figure 1: Framework of the proposed CycFusion consists of 4 domain transformations, including two downsampling modules, y()\mathcal{F}_{y}(\cdot) and z()\mathcal{F}_{z}(\cdot), and two upsampling modules, 𝒢y()\mathcal{G}_{y}(\cdot) and 𝒢z()\mathcal{G}_{z}(\cdot). y()\mathcal{F}_{y}(\cdot) is formulated by the depthwise separable CNNs to model the PSF and z()\mathcal{F}_{z}(\cdot) is realized by the pointwise CNNs to approximate the SRF. 𝒢y()\mathcal{G}_{y}(\cdot) contains an interpolation block and a refining block, and 𝒢z()\mathcal{G}_{z}(\cdot) draws the pointwise CNNs to estimate the spectral superresolution process.

III The Proposed CycFusion Model

In this section, we present the proposed CycFusion, including the problem formulation and the proposal of the degradation model and superresolution networks, objective function, and training method.

III-A Problem Formulation

Given a pair of observed LrHSI cube 𝒴L×w×h\mathcal{Y}\in\mathbb{R}^{L\times w\times h} and HrMSI cube 𝒵l×W×H\mathcal{Z}\in\mathbb{R}^{l\times W\times H}, the objective of HSI-MSI fusion is to aggregate the spectral and spatial information of these two data cubes and then produce the desired HrHSI cube 𝒳L×W×H\mathcal{X}\in\mathbb{R}^{L\times W\times H}, where {W,H,L}\{W,H,L\} and {w,h,l}\{w,h,l\} denote the widths, heights and band numbers of high-resolution and low-resolution images. Generally, wW,hHw\ll W,h\ll H and lLl\ll L and thus the fusion is ill-posed. When unfolding the 3D tensor to a 2D matrix, most of the existing literature assumes the observation model is linear as follows

𝐘\displaystyle\bf Y =𝐗𝐁𝐒+𝐍y,\displaystyle=\mathbf{XBS}+\mathbf{N}_{y},
𝐙\displaystyle\bf Z =𝐑𝐗+𝐍z,\displaystyle=\mathbf{RX+N}_{z}, (1)

where 𝐗L×WH\mathbf{X}\in\mathbb{R}^{L\times WH}, 𝐘L×wh\mathbf{Y}\in\mathbb{R}^{L\times wh} and 𝐙l×WH\mathbf{Z}\in\mathbb{R}^{l\times WH} are HrHSI, LrHSI and HrMSI matrices, respectively. 𝐁WH×WH\mathbf{B}\in\mathbb{R}^{WH\times WH} is the bluring matrix constructed by the PSF kernel. 𝐒WH×wh\mathbf{S}\in\mathbb{R}^{WH\times wh} and 𝐑l×L\mathbf{R}\in\mathbb{R}^{l\times L} are the spatial downsampling matrix and the SRF matrix. 𝐍y\mathbf{N}_{y} and 𝐍z\mathbf{N}_{z} are the additive noise matrices. Note that the above observation model (1) is adopted in the most of existing HSI-MSI fusion literature [26, 23]. The task of HSI-MSI fusion is to recover the matrix, 𝐗\mathbf{X}, using 𝐘,𝐙\mathbf{Y},\mathbf{Z} as inputs, which may use known 𝐁,𝐒\mathbf{B,S} and 𝐑\mathbf{R} in the non-blind fusion or use estimated values of them in the blind fusion.

III-B Degradation Model

Based on the above observation model, we propose two CNN-based modules, namely, the spatial downsampling module and spectral downsampling module, to estimate the parameters in the observation model for performing the blind fusion task.

First, considering the spatial degradation process is imposed only in the spatial dimension and is independent of the spectral dimension. Inspired by it, we introduce a depthwise separable CNN [37] with a shared kernel among all channels of size equal to the stride size and also to the spatial downsampling ratio, SS, to model the PSF kernel. The spatial degradation process is then formulated by a convolutional operator as

𝒴\displaystyle\mathcal{Y} =[𝒳1,:,:𝐊,𝒳2,:,:𝐊,,𝒳L,:,:𝐊],\displaystyle=[\mathcal{X}_{1,:,:}*\mathbf{K},\mathcal{X}_{2,:,:}*\mathbf{K},\dots,\mathcal{X}_{L,:,:}*\mathbf{K}],
𝐘\displaystyle\mathbf{Y} =y(𝐗;𝐊),\displaystyle=\mathcal{F}_{y}(\mathbf{X};\mathbf{K}), (2)

where 𝒳i,:,:\mathcal{X}_{i,:,:} means to take the ii-th slice from the HrHSI tensor and also the ii-th channel image in the data cube and [,,][\dots,\dots,\dots] means concatenating matrices into a tensor. The convolutional operator is simplified by y()\mathcal{F}_{y}(\cdot) parameterized by a shared convolutional kernel 𝐊S×S\mathbf{K}\in\mathbb{R}^{S\times S}. In addition, the kernel parameter should ensure the sum-to-one constraint to satisfy the energy conservation in the spatial degradation model as

i=1Sj=1S𝐊i,j=1\sum_{i=1}^{S}\sum_{j=1}^{S}\mathbf{K}_{i,j}=1 (3)

and the Softmax function is added to the model parameter 𝚯(PSF)𝐑S×S\boldsymbol{\Theta}^{(PSF)}\in\mathbf{R}^{S\times S} to ensure the above constraint as shown in the top left of Fig. 1.

𝐊i,j=Softmax[𝚯(PSF)]=exp[𝚯i,j(PSF)]i,jexp[𝚯i,j(PSF)].\mathbf{K}_{i,j}=\text{Softmax}[\mathbf{\Theta}^{(PSF)}]=\frac{\exp[\mathbf{\Theta}^{(PSF)}_{i,j}]}{\sum_{i,j}{\exp[\mathbf{\Theta}^{(PSF)}_{i,j}]}}. (4)

Second, the opposite of the spatial degradation process is the spectral degradation process only performing on the spectral dimension leading to no spatial correction information bothered here. So we use a 1×11\times 1 CNN to model the SRF between HrHSI and HrMSI, where the spectral downsampling module in (1) can be formulated by a pointwise convolutional operator as

𝒵:,i,j\displaystyle\mathcal{Z}_{:,i,j} =[𝒳:,i,j𝐑1,:,𝒳:,i,j𝐑2,:,,𝒳:,i,j𝐑l,:],i,j\displaystyle=[\mathcal{X}_{:,i,j}*\mathbf{R}_{1,:},\mathcal{X}_{:,i,j}*\mathbf{R}_{2,:},\dots,\mathcal{X}_{:,i,j}*\mathbf{R}_{l,:}],\forall i,j
𝐙\displaystyle\bf Z =z(𝐗;𝐑),\displaystyle=\mathcal{F}_{z}(\mathbf{X};\mathbf{R}), (5)

where the convolutional operator is simplified by z()\mathcal{F}_{z}(\cdot), the SRF matrix 𝐑\mathbf{R} is parameterized by 𝚯(SRF)l×L\mathbf{\Theta}^{(SRF)}\in\mathbb{R}^{l\times L} and the Softmax function is also used here to ensure the energy conservation as

i=1L\displaystyle\sum_{i=1}^{L} 𝐑:,i=𝟏,\displaystyle\mathbf{R}_{:,i}=\mathbf{1}, (6)
𝐑\displaystyle\mathbf{R} =Softmax[𝚯(SRF)].\displaystyle=\text{Softmax}[\mathbf{\Theta}^{(SRF)}]. (7)

The pointwise spectral downsampling module is depicted in the bottom right of Fig. 1.

To summarize, inspired by the observation model, we build two single-layer CNNs to estimate the downsampling domain transformation from HrHSI to LrHSI and HrMSI with the operators and kernel sizes listed in Table II.

TABLE II: Network Architecture of CycFusion for the CAVE dataset. Conv. and Interp. are abbreviations of convolution and interpolation, respectively.
Module Operator Output shape
Spatial
downsampling
y()\mathcal{F}_{y}(\cdot)
Depthwise
separable
S×SS\times S Conv.
[w,h,L][w,h,L]
Spectral
downsampling
z()\mathcal{F}_{z}(\cdot)
1×11\times 1 Conv. [W,H,l][W,H,l]
Spatial upsampling 𝒢y()\mathcal{G}_{y}(\cdot) Interp. & 3×33\times 3 Conv. Interp. & 3×33\times 3 Conv. Interp. & 3×33\times 3 Conv. Interp. & 3×33\times 3 Conv. Interp. & 3×33\times 3 Conv. [2w,2h,32][2w,2h,32] [4w,4h,64][4w,4h,64] [8w,8h,128][8w,8h,128] [16w,16h,128][16w,16h,128] [32w,32h,128][32w,32h,128]
1×11\times 1 Conv.
1×11\times 1 Conv.
[32w,32h,128][32w,32h,128]
[32w,32h,L][32w,32h,L]
Spectral
upsampling
𝒢z()\mathcal{G}_{z}(\cdot)
1×11\times 1 Conv.
1×11\times 1 Conv.
1×11\times 1 Conv.
1×11\times 1 Conv.
1×11\times 1 Conv.
1×11\times 1 Conv.
[W,H,32][W,H,32]
[W,H,64][W,H,64]
[W,H,128][W,H,128]
[W,H,128][W,H,128]
[W,H,128][W,H,128]
[W,H,L][W,H,L]

III-C Superresolution Model

Furthermore, we propose two upsampling modules to implement the inverse transformation from LrHSI to HrHSI and HrMSI to HrHSI, namely, spatial upsampling module and spectral upsampling module. The former consists of an interpolation block and a refining block. The interpolation block alternately and iteratively increases the spatial resolution of the input LrHSI correspond to HrHSI and aggregate spatially correlated information via bicubic interpolation and 3×33\times 3 convolution operation, respectively. The input LrHSI will pass through several interpolation blocks with different learnable parameters during the forward computation. Then followed by a refining block with two 1×11\times 1 convolutional layers further preserves the high-fidelity details of HrHSI. Additionally, the instance normalization (IN) layers[38] are plugged into the spatial upsampling module to accelerate the network training process and the rectified linear unit (ReLU) is adopted as the activation function in our proposed model provided as

ReLU(x)=max(0,x).\text{ReLU}(x)=\max(0,x). (8)

The Sigmoid function, σ(x)=1/(1+exp(x))\mathbf{\sigma}(x)=1/(1+\exp{({-x})}), is at the end of this module to ensure that the outputs are in [0,1] where data are normalized in the range when the network is trained. To clarify, the forward computation of the spatial upsampling module can be formulated as

Interp.𝐘~(i)=𝐘(i)2,\displaystyle\text{Interp.}\quad\tilde{\mathbf{Y}}^{(i)}=\mathbf{Y}^{(i)}\uparrow_{2},
3×3\displaystyle 3\times 3\quad Conv.𝐘(i+1)=ReLU(IN(Conv.(𝐘~(i)))),\displaystyle\text{Conv.}\quad\,\,\mathbf{Y}^{(i+1)}=\text{ReLU}(\text{IN}(\text{Conv.}(\tilde{\mathbf{Y}}^{(i)}))),
1×1\displaystyle 1\times 1\quad Conv.𝐘(6)=ReLU(IN(Conv.(𝐘(5)))),\displaystyle\text{Conv.}\quad\,\,\mathbf{Y}^{(6)}=\text{ReLU}(\text{IN}(\text{Conv.}({\mathbf{Y}}^{(5)}))),
1×1\displaystyle 1\times 1\quad Conv.𝐗Spa=σ(Conv.(𝐘(6))),\displaystyle\text{Conv.}\quad\,\,\mathbf{X}_{Spa}=\sigma(\text{Conv.}({\mathbf{Y}}^{(6)})), (9)

where 𝐘(0)=𝐘\mathbf{Y}^{(0)}=\mathbf{Y}, 2\uparrow_{2} represents the spatial resolution is doubled by each interpolation step, i=0,1,,4i=0,1,\dots,4 due to we conducted on data with a spatial downsampling ratio of 32 in the next section and 𝐗Spa\mathbf{X}_{Spa} is the spatial superresolution result. We denote this module as 𝐗Spa=𝒢y(𝐘)\mathbf{X}_{Spa}=\mathcal{G}_{y}(\mathbf{Y}) for short and the diagram is depicted in the bottom left of Fig. 1.

Besides, we propose another spectral upsampling module which is a 5-layer 1×11\times 1 CNNs mapping HrMSI to HrHSI given as

1×1\displaystyle 1\times 1\quad Conv.𝐙(i+1)=ReLU(IN(Conv.(𝐙(i)))),\displaystyle\text{Conv.}\quad\mathbf{Z}^{(i+1)}=\text{ReLU}(\text{IN}(\text{Conv.}({\mathbf{Z}}^{(i)}))),
1×1\displaystyle 1\times 1\quad Conv.𝐗Spe=σ(Conv.(𝐘(5))),\displaystyle\text{Conv.}\quad\mathbf{X}_{Spe}=\sigma(\text{Conv.}({\mathbf{Y}}^{(5)})), (10)

where 𝐙(0)=𝐙\mathbf{Z}^{(0)}=\mathbf{Z}, i=0,1,,4i=0,1,\dots,4 and 𝐗Spe\mathbf{X}_{Spe} is the spectral superresolution result. We denote this module as 𝐗Spe=𝒢z(𝐙)\mathbf{X}_{Spe}=\mathcal{G}_{z}(\mathbf{Z}) for short and the diagram is depicted in the top right of Fig. 1.

These two upsampling modules are parameterized by the neural networks with parameters in (9) and (10) denoted by 𝚯(SpaUp)\boldsymbol{\Theta}^{(SpaUp)} and 𝚯(SpeUp)\boldsymbol{\Theta}^{(SpeUp)}, respectively. To clarify, the network architecture of CycFusion for the CAVE dataset is listed in Table II.

Input: a pair of LrHSI and HrMSI: (𝐘\mathbf{Y}, 𝐙\mathbf{Z}) ;
1 Pre-training phase:
2 Initialize 𝚯(PSF)\boldsymbol{\Theta}^{(PSF)} and 𝚯(SRF)\boldsymbol{\Theta}^{(SRF)} by random sampling from Kaiming normal distribution [39] ;
3 repeat
4       Input a pair of 𝐘\mathbf{Y} and 𝐙\mathbf{Z};
5       Compute the gradient of (18) w.r.t. 𝚯(PSF)\boldsymbol{\Theta}^{(PSF)} and 𝚯(SRF)\boldsymbol{\Theta}^{(SRF)} ;
6       Update parameters via Adam optimizer[40].
7until pretraining phase end;
8 Training phase:
9 Initialize 𝚯(SpaUp)\boldsymbol{\Theta}^{(SpaUp)} and 𝚯(SpeUp)\boldsymbol{\Theta}^{(SpeUp)} by random sampling from Kaiming normal distribution ;
10 repeat
11       Input a pair of 𝐘\mathbf{Y} and 𝐙\mathbf{Z};
12       Compute the gradient of (16) w.r.t. 𝚯\boldsymbol{\Theta};
13       Update 𝚯\boldsymbol{\Theta} via Adam optimizer.
14      
15until training phase end;
Output: Compute the fused result by (17).
Algorithm 1 Fusion based on CycFusion

III-D Objective Function

In this part, we construct the objective function based on learning the forward and reverse domain transformations between LrHSI and HrHSI. Specifically, the combination of 𝒢y()\mathcal{G}_{y}(\cdot) and z()\mathcal{F}_{z}(\cdot) completes the forward mapping, while the reverse transformation consists of 𝒢z()\mathcal{G}_{z}(\cdot) and y()\mathcal{F}_{y}(\cdot). The objective function of CycFusion contains three parts, namely, marginal matching, cycle consistency and fusion identity.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Some benchmark RGB images in datasets, CAVE: (a) balloons, (b) chart and stuffed toy, (c) clay and (d) features, Chikusei: (e) region3, (f) region8 and (g) region14, Pavia University: (h) Pavia University.

First, single domain transformation between LrHSI and HrMSI is supposed to satisfy the marginal matching, e.g., an image from domain LrHSI, p(𝐘)p(\mathbf{Y}), through the forward mapping is subject to the distribution of HrMSI, p(𝐙)p(\mathbf{Z}). And it holds for images in the domain HrMSI via the reverse transformation. These relationships are given as

p(𝐙)\displaystyle p(\mathbf{Z}) q1(𝐙)=𝔼𝐘p(𝐘)[z(𝒢y(𝐘))],\displaystyle\approx q_{1}(\mathbf{Z})=\mathbb{E}_{\mathbf{Y}\sim p(\mathbf{Y})}[\mathcal{F}_{z}(\mathcal{G}_{y}(\mathbf{Y}))],
p(𝐘)\displaystyle p(\mathbf{Y}) q1(𝐘)=𝔼𝐙p(𝐙)[y(𝒢z(𝐙))],\displaystyle\approx q_{1}(\mathbf{Y})=\mathbb{E}_{\mathbf{Z}\sim p(\mathbf{Z})}[\mathcal{F}_{y}(\mathcal{G}_{z}(\mathbf{Z}))], (11)

where q1(𝐙)q_{1}(\mathbf{Z}) and q1(𝐘)q_{1}(\mathbf{Y}) are learned distributions of HrMSI and LrHSI via the single domain transformation, respectively. In the HSI-MSI fusion task, paired 𝐘\mathbf{Y} and 𝐙\mathbf{Z} are generally used and (11) can be rephrased to the first term of the loss function as

mm(𝚯)=\displaystyle\mathcal{L}_{mm}(\mathbf{\Theta})= 𝐙z(𝒢y(𝐘))1\displaystyle\|\mathbf{Z}-\mathcal{F}_{z}(\mathcal{G}_{y}(\mathbf{Y}))\|_{1}
+𝐘y(𝒢z(𝐙))1,\displaystyle+\|\mathbf{Y}-\mathcal{F}_{y}(\mathcal{G}_{z}(\mathbf{Z}))\|_{1}, (12)

where 𝚯={𝚯(PSF),𝚯(SRF),𝚯(SpaUp),𝚯(SpeUp)}\mathbf{\Theta}=\{\boldsymbol{\Theta}^{(PSF)},\boldsymbol{\Theta}^{(SRF)},\boldsymbol{\Theta}^{(SpaUp)},\boldsymbol{\Theta}^{(SpeUp)}\} contains all parameters that need to be optimized, and 1\|\cdot\|_{1} denotes 1\ell_{1}-norm of a matrix/vector.

Second, dual-domain transformations between LrHSI and HrMSI are supposed to satisfy the cycle consistency, e.g., an image from domain LrHSI transfers to domain HrMSI and then transfers back close to itself. And it also holds for images in the domain HrMSI. The cycle consistency can be formulated as

p(𝐘)\displaystyle p(\mathbf{Y}) q2(𝐘)=𝔼𝐙q1(𝐙)[y(𝒢z(𝐙))],\displaystyle\approx q_{2}(\mathbf{Y})=\mathbb{E}_{\mathbf{Z}\sim q_{1}(\mathbf{Z})}[\mathcal{F}_{y}(\mathcal{G}_{z}(\mathbf{Z}))],
p(𝐙)\displaystyle p(\mathbf{Z}) q2(𝐙)=𝔼𝐘q1(𝐘)[z(𝒢y(𝐘))],\displaystyle\approx q_{2}(\mathbf{Z})=\mathbb{E}_{\mathbf{Y}\sim q_{1}(\mathbf{Y})}[\mathcal{F}_{z}(\mathcal{G}_{y}(\mathbf{Y}))], (13)

where q2(𝐙)q_{2}(\mathbf{Z}) and q2(𝐘)q_{2}(\mathbf{Y}) are learned distributions of HrMSI and LrHSI via dual-domain transformations, respectively. We also use the 1\ell_{1} loss to construct the second term of the objective function and (13) can be rephrased as

cyc(𝚯)=\displaystyle\mathcal{L}_{cyc}(\mathbf{\Theta})= 𝐘y(𝒢z(z(𝒢y(𝐘))))1\displaystyle\|\mathbf{Y}-\mathcal{F}_{y}(\mathcal{G}_{z}(\mathcal{F}_{z}(\mathcal{G}_{y}(\mathbf{Y}))))\|_{1}
+𝐙z(𝒢y(y(𝒢z(𝐙))))1.\displaystyle+\|\mathbf{Z}-\mathcal{F}_{z}(\mathcal{G}_{y}(\mathcal{F}_{y}(\mathcal{G}_{z}(\mathbf{Z}))))\|_{1}. (14)

Third, the corresponding images in the domain HrHSI can be obtained from single-side superresolution. We constrain the identity for fused results via the following loss function

ide(𝚯(SpaUp),𝚯(SpeUp))=𝒢y(𝐘)𝒢z(𝐙)1.\mathcal{L}_{ide}(\boldsymbol{\Theta}^{(SpaUp)},\boldsymbol{\Theta}^{(SpeUp)})=\|\mathcal{G}_{y}(\mathbf{Y})-\mathcal{G}_{z}(\mathbf{Z})\|_{1}. (15)

To sum up, the objective function of CycFusion is

total(𝚯)=mm+cyc+ide.\displaystyle\mathcal{L}_{total}(\mathbf{\Theta})=\mathcal{L}_{mm}+\mathcal{L}_{cyc}+\mathcal{L}_{ide}. (16)

When the CycFusion training is complete, we simply obtain the fused result by

𝐗^=12(𝒢z(𝐙)+𝒢y(𝐘)).\hat{\mathbf{X}}=\frac{1}{2}(\mathcal{G}_{z}(\mathbf{Z})+\mathcal{G}_{y}(\mathbf{Y})). (17)

III-E CycFusion with Pretraining

The parameters of the PSF and SRF may be hard-acquired in real applications. An estimation method for the PSF kernel and the SRF matrix was proposed in Hysure [15], which further downsample the LrHSI and the HrMSI in the spectral and spatial dimensions, respectively, to produce the same low spatial resolution MSI (LrMSI). A similar strategy is used here to learn the downsampling parameters, 𝚯(PSF)\boldsymbol{\Theta}^{(PSF)} and 𝚯(SRF)\boldsymbol{\Theta}^{(SRF)}. Specifically, we use the following loss function to pre-train the downsampling modules and then load the pretraining weights when training the whole network.

pre(𝚯(PSF),𝚯(SRF))=z(𝐘)y(𝐙)1.\mathcal{L}_{pre}(\boldsymbol{\Theta}^{(PSF)},\boldsymbol{\Theta}^{(SRF)})=\|\mathcal{F}_{z}(\mathbf{Y})-\mathcal{F}_{y}(\mathbf{Z})\|_{1}. (18)

To specify, the training process of CycFusion is summarized in Algorithm 1. When we know the PSF kernel and SRF matrix as priors, the proposed model can skip the pretraining phase and implements the fusion task in a no-blind manner, which we refer to as CycFusion-noblind.

Refer to caption
Figure 3: The one cycle learning rate schedule with 10k iterations warmup and 20k iterations cosine annealing.
TABLE III: Quantitative metrics of the comparison methods on the CAVE dataset. The best results of non-blind methods are in bold, while those of blind methods are underlined.
Methods PSNR SAM ERGAS SSIM
GSA 39.90 8.96 0.49 0.968
GLP-HS 41.47 8.65 0.41 0.970
CNMF 42.59 6.34 0.38 0.984
CSTF 41.74 9.53 0.42 0.970
FUSE 42.49 7.61 0.40 0.980
NSSR 44.27 5.91 0.34 0.986
GDD 37.06 6.71 0.40 0.971
CUCaNet 37.51 7.49 0.47 0.969
CycFusion 43.88 5.25 0.32 0.986
CycFusion-noblind 44.53 5.15 0.30 0.987
Ideal value +\infty 0 0 1

IV Experiments

We present experimental results of our proposed model and several typical unsupervised approaches conducted on three publicly available datasets, CAVE [41]111https://www.cs.columbia.edu/CAVE/databases/multispectral/, Chikusei [42]222http://naotoyokoya.com/Download.html and Pavia University333https://rslab.ut.ac.ir/data. Comparison methods include pansharpening-based GSA [17]444https://openremotesensing.net/knowledgebase/hyperspectral-and-multispectral-data-fusion/ and GLP-HS [19]555http://openremotesensing.net/knowledgebase/hyperspectral-andmultispectral-data-fusion/, subspace-based CNMF [20]666http://naotoyokoya.com/assets/zip/CNMF_MATLAB.zip, CSTF [26]777https://github.com/renweidian/CSTF, FUSE [25]888http://wei.perso.enseeiht.fr/publications.html and NSSR [23]999http://see.xidian.edu.cn/faculty/wsdong and deep learning-based GDD [35]101010 https://github.com/tuezato/guided-deep-decoder, CUCaNet [36]111111https://github.com/danfenghong/ECCV2020_CUCaNet.

Refer to caption
(a) GT
(PSNR/SAM)
Refer to caption
(b) GSA
(42.28/4.80)
Refer to caption
(c) GLP-HS
(42.07/5.05)
Refer to caption
(d) CNMF
(43.85/3.72)
Refer to caption
(e) CSTF
(43.60/5.06)
Refer to caption
(f) FUSE
(44.70/3.86)
Refer to caption
(g) NSSR
(46.59/3.04)
Refer to caption
(h) GDD
(36.34/7.51)
Refer to caption
(i) CUCaNet
(38.50/6.91)
Refer to caption
(j) CycFusion
(48.06/2.63)
Refer to caption
(k) CycFusion-noblind
(48.00/2.66)
Refer to caption
Figure 4: (a-k) The 21st band (600nm) of fused HrHSI (balloons in the CAVE dataset) obtained by the testing methods, where a ROI zoomed in 9 times (bottom-left) and the corresponding residual maps (bottom-right) are shown for detail visualization. PSNR and SAM are also listed for comparison.
Refer to caption
(a) GT
(PSNR/SAM)
Refer to caption
(b) GSA
(31.85/6.45)
Refer to caption
(c) GLP-HS
(32.27/6.59)
Refer to caption
(d) CNMF
(33.04/5.98)
Refer to caption
(e) CSTF
(31.84/8.01)
Refer to caption
(f) FUSE
(32.64/6.87)
Refer to caption
(g) NSSR
(33.72/4.69)
Refer to caption
(h) GDD
(29.89/7.03)
Refer to caption
(i) CUCaNet
(30.11/6.02)
Refer to caption
(j) CycFusion
(37.40/5.39)
Refer to caption
(k) CycFusion-noblind
(37.71/5.00)
Refer to caption
Figure 5: (a-k) The 21st band (600nm) of fused HrHSI (CD in the CAVE dataset) obtained by the testing methods, where a ROI zoomed in 9 times (bottom-left) and the corresponding residual maps (bottom-right) are shown for detail visualization. PSNR and SAM are also listed for comparison.
Refer to caption
(a) balloons
Refer to caption
(b) CD
Figure 6: RMSE along with spectral bands for the first five best methods on balloonsballoons and CDCD images in the CAVE dataset.

IV-A Data Description

The CAVE dataset contains 32 indoor HSI with the size of 512×512512\times 512 and 31 spectral channels covering wavelengths ranging from 0.4 μm\mu m to 0.7 μm\mu m. The original Chikusei dataset was acquired over Chikusei, Japan, by the Headwall Hyperspec-VNIR-C imaging sensor in 2014. The scene is with the size of 2517×23352517\times 2335 and has 128 bands covering the spectral range from 0.363 μm\mu m to 1.018 μm\mu m with the ground sampling distance (GSD) of 2.5m. We select 16 non-overlapped regions with the size of 512×512512\times 512 from the raw data to evaluate the performance of all comparison methods. The last HSI we used in the experiment is the Pavia University dataset collected by the reflective optics system imaging spectrometer (ROSIS) covering the University of Pavia, Italy. The image contains 610×340610\times 340 pixels and has 103 bands ranging from 0.43 μm\mu m to 0.86 μm\mu m with the GSD of 1.3m. We crop the top-left area with the size of 608×320608\times 320 for the convenience of data processing. The above data we refer to as the ground truth (GT) of HrHSI to evaluate the fused results. Some benchmark images in the three datasets are depicted in Fig. 2.

Following [30, 33] and [34], we directly average the 32×3232\times 32 spatially disjoint block in the HrHSI to simulate the observed LrHSI and use the Nikon D700 camera121212https://www.maxmax.com/spectral_response.htm, LANDSAT-8131313https://landsat.gsfc.nasa.gov/article/preliminary-spectral-response-of-the-operational-land-imager-in-band-band-average-relative-spectral-response and an IKONOS-like SRF to generate the HrMSI via HrHSI for the CAVE, Chikusei and Pavia University dataset, respectively. The HrMSI of the CAVE dataset have 3 bands coresponding to red, green, and blue channels, while HrMSI of other two remote sensing dataset have 4 bands with one more NIR band than the former.

Refer to caption
(a) GT
(PSNR/SAM)
Refer to caption
(b) GSA
(36.99/3.20)
Refer to caption
(c) GLP-HS
(36.18/2.85)
Refer to caption
(d) CNMF
(39.26/2.07)
Refer to caption
(e) CSTF
(36.50/2.59)
Refer to caption
(f) FUSE
(39.53/2.46)
Refer to caption
(g) NSSR
(34.16/2.61)
Refer to caption
(h) GDD
(38.62/1.90)
Refer to caption
(i) CUCaNet
(37.71/2.57)
Refer to caption
(j) CycFusion
(39.30/1.83)
Refer to caption
(k) CycFusion-noblind
(40.75/1.70)
Refer to caption
Figure 7: (a-k) The 73rd band (734nm) of fused HrHSI (region3 in the Chikusei dataset) obtained by the testing methods, where a ROI zoomed in 9 times (bottom-left) and the corresponding residual maps (bottom-right) are shown for detail visualization.

IV-B Experimental Setup

IV-B1 Hyperparameter Settings

The architecture for the CAVE dataset is listed in Table II, while the number of hidden units are doubled when processing the Chikusei and Pavia datasets because there are more spectral bands in these remote sensing data. We use the Adam optimizer[40] to train the proposed CycFusion. In the blind fusion situation, the number of pretraining iterations is set to 10k10k and the learning rate is set to 0.001. In the training phase, we use the one cycle learning rate schedule [43] to accelerate the training process with 10k10k iterations warm-up and 20k20k iterations annealing for the convergence, where the maximum learning rate is set to 0.01. The learning rate during the training phase is shown in Fig. 3.

IV-B2 Performance Metrics

To quantitatively analyze the fusion results of comparison methods, we use the following criteria, including the peak signal-to-noise ratio (PSNR), the relative dimensionless global error in synthesis (ERGAS) [44], the spectral angle mapper (SAM) and the structure similarity (SSIM) [45]. PSNR is directly related to the root mean squared error (RMSE). ERGAS can be seen as the average relative RMSE of each band, which can eliminate intensity effects. SAM measures the similarity between spectra in the radian units. SSIM is a widely used criterion in image processing, which measures the structural similarity between the ground truth image and the estimated image. All performance metrics are evaluated in the range of 8-bit, i.e., [0-255].

IV-C Experiments on the Indoor Dataset

The average performance metrics on the CAVE dataset for all comparison methods are listed in Table III. Overall, GSA and GLP-HS as pansharpening-based methods with parameters of SRF and PSF absent yield poor fusion results. Compared with them, the subspace-based methods combine the observation model and give competitive fusion performance, especially NSSR, bringing 2.8dB improvement in PSNR compared to GLP-HS. GDD attempts to learn the prior information on HrHSI from a noise input with a deep CNN, however, GDD may not learn enough prior to the indoor dataset, resulting in poor results. CUCaNet couples the spectral unmixing model and utilizes the multiple consistency loss to learn the parameters of the observation model and fusion network, which works in a blind fusion manner and brings 0.45dB increment in PSNR compared to GDD. Unsurprisingly, the proposed CycFusion outperforms all other fusion algorithms with performance metrics as listed in Table III. Specifically, CycFusion and CycFusion-noblind increase 2.41dB and 0.26dB in terms of PSNR compared to the second-best blind and no-blind fusion methods, GLP-HS and NSSR, respectively.

Fig. 4 and Fig. 5 show the fused HrHSI in the 21st band (600nm) for balloons and CD, respectively. Two regions of interest (ROIs) are highlighted for comparing detailed differences of all testing algorithms. It can be observed that there are prominent bumps in the fused image provided by the pansharpening-based GLP-HS as shown in Fig. 4 (c) and Fig. 5 (c). Besides, unusual ripples are present in the result of GDD especially in the Fig. 5 (h) that may be caused by insufficient prior learning. Overall, the proposed CycFusion and CycFusion-noblind approaches give better fusion results of the HrHSI and corresponding lower error maps than other comparison methods. To compare the reconstruction in each band obtained by the testing algorithms, the band-by-band root mean squared error (RMSE) of the first five best methods on these two images are shown in Fig. 6. These RMSE results further illustrate the superiority of our proposed methods in spectral reconstruction.

TABLE IV: Quantitative metrics of the comparison methods on the Chikusei and Pavia University dataset. The best results of non-blind methods are in bold, while those of blind methods are underlined.
Methods Chikusei Pavia University
PSNR SAM ERGAS SSIM PSNR SAM ERGAS SSIM
GSA 36.37 3.19 0.46 0.955 39.19 4.76 0.28 0.967
GLP-HS 35.20 2.84 0.48 0.959 38.45 3.61 0.27 0.971
CNMF 38.29 2.26 0.46 0.968 40.41 2.99 0.24 0.976
CSTF 35.28 2.64 0.49 0.939 37.86 3.81 0.29 0.951
FUSE 38.93 2.37 0.44 0.968 42.20 3.00 0.22 0.979
NSSR 32.80 2.98 0.55 0.920 38.56 3.48 0.27 0.965
GDD 38.12 2.94 0.48 0.942 37.65 4.20 0.32 0.962
CUCaNet 38.43 2.52 0.54 0.953 38.27 2.68 0.27 0.971
CycFusion 37.51 2.35 0.45 0.964 40.20 2.86 0.26 0.976
CycFusion-noblind 39.72 2.12 0.43 0.969 42.58 2.57 0.21 0.981
Ideal value +\infty 0 0 1 +\infty 0 0 1
Refer to caption
(a) GT
(PSNR/SAM)
Refer to caption
(b) GSA
(39.19/4.76)
Refer to caption
(c) GLP-HS
(38.45/3.61)
Refer to caption
(d) CNMF
(40.41/2.99)
Refer to caption
(e) CSTF
(37.86/3.81)
Refer to caption
(f) FUSE
(42.20/3.00)
Refer to caption
(g) NSSR
(38.56/3.48)
Refer to caption
(h) GSA
(37.65/4.20)
Refer to caption
(i) CUCaNet
(38.27/2.68)
Refer to caption
(j) CycFusion
(40.20/2.86)
Refer to caption
(k) CycFusion-noblind
(42.58/2.57)
Refer to caption
Figure 8: (a-k) The 31st band (556nm) of fused HrHSI (Pavia University) obtained by the testing methods, where a ROI zoomed in 9 times (bottom-left) and the corresponding residual maps (bottom-right) are shown for detail visualization.
Refer to caption
(a) region3
Refer to caption
(b) Pavia University
Figure 9: RMSE along with spectral bands for the first five best methods on the region3 in the Chikusei dataset and the Pavia University image.

IV-D Experiments on the Remote Sensing Dataset

To further evaluate the effectiveness, we conduct two remote sensing datasets. Different from the indoor images, the remote sensing data have lower spatial resolution and generally contain several spectral signatures in one image. Table IV shows the quantitative metrics of all testing methods over the Chikusei and Pavia University datasets. As one can see, the SAM results evaluated on these two datasets are much smaller than the same metrics on the CAVE dataset in Table III. In the no-blind fusion methods, our proposed CycFusion-noblind also obtains the best results among all quantitative metrics. We also depict one band in each fused image for visual comparison as shown in Fig. 7 and Fig. 8. Our proposed model also produces the best fusion results and well spectral reconstruction as shown in Fig. 9 .

IV-E Model Discussion

IV-E1 Complexity Analysis

The computational complexity of CycFusion heavily depends on the number of parameters. The total size of networks parameters consists of two parts containing weights and biases in the convolution and affine parameters in the instance normalization. The former is p=1P(Mp2Cpin+1)Cpout\sum_{p=1}^{P}(M^{2}_{p}C^{in}_{p}+1)C^{out}_{p}, where Mp×MpM_{p}\times M_{p}, CpinC^{in}_{p} and CpoutC^{out}_{p} are the kernel size, input dimension and output dimension in the pp-th convolutional layer, and the latter is p=1P2Cpout\sum_{p=1}^{P}2C^{out}_{p}. The size of models and the training time per image run on one NVIDIA GeForce RTX 3090 GPU for the three datasets are listed in Table V, which are comparable to the unsupervised models based on deep learning, such as GDD[35], CUCaNet[36] as shown in the Table V.

TABLE V: Model size and training time of the proposed model and two unsupervised deep learning-based methods, where M means millions and min. is short for minutes.
Dataset CAVE Chikusei Pavia University
CycFusion Size (M) 0.47 0.52 0.51
Time (min.) 65 155 110
GDD Size (M) 0.25 0.26 0.26
Time (min.) 40 42 42
CUCaNet Size (M) 0.87 0.89 0.90
Time (min.) 109 120 122

IV-E2 Ablation Study

In this part, we perform the fusion task with the cycle consistency loss absent to investigate the effect of double domain transformation. We implement a simple experiment conducted on the Pavia University dataset and the metrics are shown in Table VI. The results show that the proposed CycFusion is greatly enhanced by the cycle consistency.

TABLE VI: Ablation study of the CycFusion with or without the cycle consistency loss conducted on the Pavia University.
Loss function CycFusion CycFusion-noblind
mm\mathcal{L}_{mm} cyc\mathcal{L}_{cyc} ide\mathcal{L}_{ide} PSNR SAM PSNR SAM
31.07 5.32 37.33 4.91
40.20 2.86 42.58 2.57

V Conclusion

In this paper, we propose a novel unsupervised HSI and MSI fusion framework based on the cycle consistency through double domain transformations, named CycFusion. The CycFusion recovers the spatial information and spectral signatures via learning the single domain transformation from LrHSI and HrMSI to the desired HrHSI, respectively. Meanwhile, the cycle consistencies between the outputs of the degraded images through double domain transformations and themselves are retained. Moreover, the proposed CycFusion retains the ability to learn the parameters in the observation model, which further enhances the practicality of our proposed network. Experimental results conducted on three publicly available datasets show the effectiveness and efficiency of our proposed model. In future work, we will introduce the fusion model combined with the training data of unpaired degraded images due to the decoupled design based on the domain transformation and develop more insightful models that contain spatial-spectral attention modules to further enhance the fusion quality.

Acknowledgments

The authors would like to acknowledge Prof. S. K. Nayar for sharing the CAVE data, Prof. N. Yokoya for sharing the Chikusei data and Prof. P. Gamba for sharing the Pavia University imagery.

References

  • [1] L. Yan, M. Zhao, X. Wang, Y. Zhang, and J. Chen, “Object detection in hyperspectral images,” IEEE Signal Process. Lett., vol. 28, pp. 508–512, 2021.
  • [2] Z. Pan, G. Healey, M. Prasad, and B. Tromberg, “Face recognition in hyperspectral images,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 12, pp. 1552–1560, 2003.
  • [3] J. M. Bioucas-Dias, A. Plaza, G. Camps-Valls, P. Scheunders, N. Nasrabadi, and J. Chanussot, “Hyperspectral remote sensing data analysis and future challenges,” IEEE Geosci. Remote Sens. Mag., vol. 1, no. 2, pp. 6–36, 2013.
  • [4] G. Lu and B. Fei, “Medical hyperspectral imaging: a review,” J. Biomed. Opt., vol. 19, no. 1, p. 010901, 2014.
  • [5] K. Zhu, Z. Sun, F. Zhao, T. Yang, Z. Tian, J. Lai, B. Long, and S. Li, “Remotely sensed canopy resistance model for analyzing the stomatal behavior of environmentally-stressed winter wheat,” ISPRS J. Photogramm. Remote Sens., vol. 168, pp. 197–207, 2020.
  • [6] K. T. Higgins, “Five new technologies for inspection,” Food Process., vol. 74, no. 5, pp. 81–82,84, 2013.
  • [7] R. Heylen, M. Parente, and P. Gader, “A review of nonlinear hyperspectral unmixing methods,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 7, no. 6, pp. 1844–1868, 2014.
  • [8] N. Yokoya, C. Grohnfeldt, and J. Chanussot, “Hyperspectral and multispectral data fusion: A comparative review of the recent literature,” IEEE Geosci. Remote Sens. Mag., vol. 5, no. 2, pp. 29–56, 2017.
  • [9] T. Wang, G. Yan, H. Ren, and X. Mu, “Improved methods for spectral calibration of on-orbit imaging spectrometers,” IEEE Trans. Geosci. Remote Sens., vol. 48, no. 11, pp. 3924–3931, 2010.
  • [10] J. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Los Alamitos, CA, USA, OCT. 2017, pp. 2242–2251.
  • [11] H. You, Y. Cheng, T. Cheng, C. Li, and P. Zhou, “Bayesian cycle-consistent generative adversarial networks via marginalizing latent sampling,” IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 10, pp. 4389–4403, 2021.
  • [12] H. Sun, R. Wang, K. Chen, M. Utiyama, E. Sumita, and T. Zhao, “Unsupervised neural machine translation with cross-lingual language representation agreement,” IEEE/ACM Trans. on Audio, Speech, and Lang. Process., vol. 28, pp. 1170–1182, 2020.
  • [13] X. Zhong, Y. Wang, A. Cai, N. Liang, L. Li, and B. Yan, “Dual-energy ct image super-resolution via generative adversarial network,” in Inter. Conf. on Artif. Intell. and Electromechan.l Autom.(AIEA), 2021, pp. 343–347.
  • [14] A. Mehri and A. D. Sappa, “Colorizing near infrared images through a cyclic adversarial approach of unpaired samples,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 971–979.
  • [15] M. Simões, J. Bioucas‐Dias, L. B. Almeida, and J. Chanussot, “A convex formulation for hyperspectral image superresolution via subspace-based regularization,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 6, pp. 3373–3388, 2015.
  • [16] L. Loncan, L. B. de Almeida, J. M. Bioucas-Dias, X. Briottet, J. Chanussot, N. Dobigeon, S. Fabre, W. Liao, G. A. Licciardi, M. Simões, J.-Y. Tourneret, M. A. Veganzones, G. Vivone, Q. Wei, and N. Yokoya, “Hyperspectral pansharpening: A review,” IEEE Geosci. Remote Sens. Mag., vol. 3, no. 3, pp. 27–46, 2015.
  • [17] B. Aiazzi, S. Baronti, and M. Selva, “Improving component substitution pansharpening through multivariate regression of MS+Pan data,” IEEE Trans. Geosci. Remote Sens., vol. 45, no. 10, pp. 3230–3239, 2007.
  • [18] B. Aiazzi, L. Alparone, S. Baronti, A. Garzelli, and M. Selva, “Mtf-tailored multiscale fusion of high-resolution ms and pan imagery,” Photogramm. Eng. Remote Sens., vol. 72, no. 5, pp. 591–596, 2015.
  • [19] M. Selva, B. Aiazzi, F. Butera, L. Chiarantini, and S. Baronti, “Hyper-sharpening: A first approach on sim-ga data,” IEEE J. Sel. Topics Appl. Earth Observ. in Remote Sens., vol. 8, no. 6, pp. 3008–3024, 2015.
  • [20] N. Yokoya, T. Yairi, and A. Iwasaki, “Coupled nonnegative matrix factorization unmixing for hyperspectral and multispectral data fusion,” IEEE Trans. Geosci. Remote Sens., vol. 50, no. 2, pp. 528–537, 2012.
  • [21] D. Hong, W. He, N. Yokoya, J. Yao, L. Gao, L. Zhang, J. Chanussot, and X. Zhu, “Interpretable hyperspectral artificial intelligence: When nonconvex modeling meets hyperspectral remote sensing,” IEEE Geosci. Remote Sens. Mag., vol. 9, no. 2, pp. 52–87, 2021.
  • [22] Q. Wei, J. Bioucas-Dias, N. Dobigeon, and J.-Y. Tourneret, “Hyperspectral and multispectral image fusion based on a sparse representation,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 7, pp. 3658–3668, 2015.
  • [23] W. Dong, F. Fu, G. Shi, X. Cao, J. Wu, G. Li, and X. Li, “Hyperspectral image super-resolution via non-negative structured sparse representation,” IEEE Trans. Image Process., vol. 25, no. 5, pp. 2337–2352, 2016.
  • [24] R. Dian, S. Li, and X. Kang, “Regularizing hyperspectral and multispectral image fusion by cnn denoiser,” IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 3, pp. 1124–1135, 2021.
  • [25] Q. Wei, N. Dobigeon, and J.-Y. Tourneret, “Fast fusion of multi-band images based on solving a sylvester equation,” IEEE Trans. Image Process., vol. 24, no. 11, pp. 4109–4121, 2015.
  • [26] S. Li, R. Dian, L. Fang, and J. M. Bioucas-Dias, “Fusing hyperspectral and multispectral images via coupled sparse tensor factorization,” IEEE Trans. Image Process., vol. 27, no. 8, pp. 4118–4130, 2018.
  • [27] R. Dian, S. Li, B. Sun, and A. Guo, “Recent advances and new guidelines on hyperspectral and multispectral image fusion,” Information Fusion, vol. 69, pp. 40–51, 2021.
  • [28] X. Wang, J. Chen, Q. Wei, and C. Richard, “Hyperspectral image super-resolution via deep prior regularization with parameter estimation,” IEEE Trans. Circuits Syst, Video Technol, pp. 1–1, 2021.
  • [29] Q. Xie, M. Zhou, Q. Zhao, D. Meng, W. Zuo, and Z. Xu, “Multispectral and hyperspectral image fusion by MS/HS fusion net,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, p. 1585–1594.
  • [30] J.-F. Hu, T.-Z. Huang, L.-J. Deng, T.-X. Jiang, G. Vivone, and J. Chanussot, “Hyperspectral image super-resolution via deep spatiospectral attention convolutional neural networks,” IEEE Trans. Neural Netw. Learn. Syst., pp. 1–15, 2021.
  • [31] W. Wang, X. Fu, W. Zeng, L. Sun, R. Zhan, Y. Huang, and X. Ding, “Enhanced deep blind hyperspectral image fusion,” IEEE Trans. Neural Netw. Learn. Syst., pp. 1–11, 2021.
  • [32] Y. Qu, H. Qi, and C. Kwan, “Unsupervised sparse Dirichlet-Net for hyperspectral image super-resolution,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 2511–2520.
  • [33] Z. Wang, B. Chen, H. Zhang, and H. Liu, “Unsupervised hyperspectral and multispectral images fusion based on nonlinear variational probabilistic generative model,” IEEE Trans. Neural Netw. Learn. Syst., pp. 1–15, 2020.
  • [34] Z. Wang, B. Chen, R. Lu, H. Zhang, H. Liu, and P. K. Varshney, “Fusionnet: An unsupervised convolutional variational network for hyperspectral and multispectral image fusion,” IEEE Trans. Image Process., vol. 29, pp. 7565–7577, 2020.
  • [35] T. Uezato, D. Hong, N. Yokoya, and W. He, “Guided deep decoder: Unsupervised image pair fusion,” in Proceedings of the European Conference on Computer Vision(ECCV).   Cham: Springer International Publishing, 2020, pp. 87–102.
  • [36] J. Yao, D. Hong, J. Chanussot, D. Meng, X. Zhu, and Z. Xu, “Cross-attention in coupled unmixing nets for unsupervised hyperspectral super-resolution,” in Proceedings of the European Conference on Computer Vision(ECCV).   Cham: Springer International Publishing, 2020, pp. 208–224.
  • [37] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 1800–1807.
  • [38] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance normalization: The missing ingredient for fast stylization,” arXiv preprint arXiv:1607.08022, 2016.
  • [39] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1026–1034.
  • [40] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [41] F. Yasuma, T. Mitsunaga, D. Iso, and S. K. Nayar, “Generalized assorted pixel camera: Postcapture control of resolution, dynamic range, and spectrum,” IEEE Trans. Image Process., vol. 19, no. 9, pp. 2241–2253, 2010.
  • [42] N. Yokoya and A. Iwasaki, “Airborne hyperspectral data over Chikusei,” Space Application Laboratory, University of Tokyo, Japan, Tech. Rep. SAL-2016-05-27, May 2016.
  • [43] L. N. Smith and N. Topin, “Super-convergence: Very fast training of neural networks using large learning rates,” arXiv preprint arXiv:1708.07120, 2017.
  • [44] L. Wald, Data fusion: definitions and architectures: fusion of images of different spatial resolutions.   Paris, France: Presses des MINES, 2002.
  • [45] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. Image Process., vol. 13, no. 4, pp. 600–612, 2004.