\useunder

\ul

Unrolling Plug-and-Play Network for Hyperspectral Unmixing

Min Zhao, Linruize Tang, and Jie Chen The authors are with School of Marine Science and Technology, Northwestern Polytechnical University, China. (corresponding author: J. Chen, [email protected]).

Abstract

Deep learning based unmixing methods have received great attention in recent years and achieve remarkable performance. These methods employ a data-driven approach to extract structure features from hyperspectral image, however, they tend to be less physical interpretable. Conventional unmixing methods are with much more interpretability, whereas they require manually designing regularization and choosing penalty parameters. To overcome these limitations, we propose a novel unmixing method by unrolling the plug-and-play unmixing algorithm to conduct the deep architecture. Our method integrates both inner and outer priors. The carefully designed unfolding deep architecture is used to learn the spectral and spatial information from the hyperspectral image, which we refer to as inner priors. Additionally, our approach incorporates deep denoisers that have been pretrained on a large volume of image data to leverage the outer priors. Secondly, we design a dynamic convolution to model the multiscale information. Different scales are fused using an attention module. Experimental results of both synthetic and real datasets demonstrate that our method outperforms compared methods.

Index Terms:

Hyperspectral unmixing, unrolling, plug-and-play, ADMM, inner priors, outer priors.

I Introduction

Hyperspectral imaging stands as a pivotal domain in remote sensing, capturing data that incorporates both spatial and spectral characteristics of the target. The dense spectral information with hundreds of spectral bands allows it to be widely applied in many fields, such as environmental monitoring and agriculture [1, 2, 3]. However, due to the high work distance and low spatial resolution of hyperspectral sensors, a pixel may contain several materials, which degrades the performance of subsequent high-level data processing. Spectral unmixing is one of the most prominent tools to cope with this issue, aiming to decompose the mixed pixels into pure components, termed as endmembers, and their corresponding fraction abundances.

From the standpoint of physical interpretability and simplicity, the most commonly used mixing model is the linear mixing model (LMM), presuming that an observed pixel is a linear combination of endmembers weighted by abundances, i.e.:

\mathbf{x}=\mathbf{Ma}+\mathbf{n}

(1)

where $\mathbf{x}\in\mathbb{R}^{B\times 1}$ is a vector containing $B$ spectral bands associated with the obtained pixel, $\mathbf{M}\in\mathbb{R}^{B\times R}$ is the endmember matrix with $R$ spectral signatures of pure constituents, $\mathbf{a}\in\mathbb{R}^{R\times 1}$ is the corresponding abundance, $\mathbf{n}$ denotes the additive noise. Due to the physical interpretation of hyperspectal image and mixing model, the endmembers and abundances are considered to satisfy the nonnegativity constraint (ENC and ANC), and the abundances are also assumed to satisfy sum-to-one for each pixel (ASC). It is important to highlight that the ASC can be ignored for some physically motivated reasons. This may occur, for instance, when there are local variations in the topography of the scene [4]. In our work, we take into account all these constraints, i.e.,

	$\displaystyle\mathcal{D}_{M}$	$\displaystyle:=\{\mathbf{M}\|\mathbf{M}\succeq\mathbf{0}\}$		(2)
	$\displaystyle\mathcal{D}_{A}$	$\displaystyle:=\left\{\mathbf{a}\|\mathbf{a}\succeq\mathbf{0},\sum_{i=1}^{R}a_{i}=1\right\}$		(3)

in which $\mathcal{D}_{M}$ and $\mathcal{D}_{A}$ are the feasible regions of endmembers and abundances, respectively.

For a hyperspectral image $\mathbf{X}\in\mathbb{R}^{B\times N}$ with $N$ pixels, the hyperspectral unmixing task is formulated by the following optimization problem:

\min_{\begin{subarray}{c}\mathbf{M}\in\mathcal{D}_{M}\\ \mathbf{A}\in\mathcal{D}_{A}\end{subarray}}\frac{1}{2}\|\mathbf{X}-\mathbf{MA}\|_{\text{F}}^{2}+\lambda\mathcal{R}(\mathbf{M},\mathbf{A})

(4)

where the first term is the data fitting term, the second term serves as regularization aiming to enforce certain desirable properties of endmembers and abundances, and $\lambda$ is the trade-off parameter. Different regularization techniques have been devised. Conventional methods typically design regularization according to the spatial and spectral characteristics in hyperspectral images. For instance, in most scenes, the pixel values of an image are piece wise continuous in the spatial dimension. The works [5, 6, 7] introduce total variation (TV) to achieve this purpose. As a pixel usually contains much fewer materials than the number of pure materials contained in the endmember matrix, sparse constraints are introduced to the abundance matrix to obtain sparse results [5, 7]. Minimum-volume simplex regularization is incorporated in the objective function to constrain the volume of the simplex formed by endmembers [8].

Typically, the alternative direction method of multipliers (ADMM) is a powerful tool to solve problem (4), which decomposes a complicated optimization problem into several easier subproblems. The work [5] utilises this strategy to address the objective function with two regularizations on abundance, namely $\ell_{1}$ -norm and TV. A nonstandard application of ADMM with a block coordinate descent scheme is designed to address a 3D TV constrained problem, which can model the spatial and spectral correlations and gain sharper edges [9]. In [10], an ADMM based blind hyperspetral unmixing method is proposed, which can simultaneously estimate abundances and endmembers. A graph TV regularization is considered to capture the spatial correlation information. The work [11] uses this tool to solve a graph regularized nonlinear unmixing method with the multilinear mixing model to indicate the nonlinearity between endmembers. Many other ADMM-based unmixing methods have also been proposed, including those for nonlinear unmixing using kernel methods [12] and those addressing unmixing problems that consider spectral variability [13]. While ADMM is a versatile and powerful tool for solving the unmixing optimization problems, the choice of penalty parameters inherently limits the performance.

Compared to traditional methods that rely on manually designed regularization terms, deep neural networks (DNNs) can automatically extract and model complex patterns and nonlinear relationships in data. DNNs have achieved breakthrough results in hyperspectral image processing tasks such as object recognition [14], band selection [15], and image super-resolution [16]. The encoding and decoding processes of autoencoder perfectly fits the formulation of the unmixing problems, and this kind of methods have achieved great advances [17]. Thus many unmixing methods have been proposed based on deep autoencoders [18]. To effectively leverage the spatial structure information of hyperspectral image, deep convolutional neural network (CNN) based autoencoders are conducted [19, 20]. Some unmixing networks with modified autoencoder architectures have also been proposed. For instance, MSSS-Net [21] builds a two-stream unmixing framework, which adopts an end-to-end manner to simultaneously learn spatial stream and multi-view spectral stream networks, and fuses information of different scales to achieve more effective unmixing. In work [22], an endmember-guided subnetwork is introduced alongside the basic autoencoder framework to extract features from pure or near-pure spectra. Meanwhile, a weight sharing strategy is used to guide the learning of the basic network. However, DNNs are often regarded as “black boxes” and lack of physical interpretability.

Recently, integration model- and deep learning- based methods is a new trend to design unmixing frameworks with both interpretability and data driven advantages. On one hand, the plug-and-play unmixing approaches use a pretrained denoiser to provide priors [23, 24]. However, these methods still require setting the value of penalty parameters. On the other hand, unrolling/unfolding the optimization algorithm to design the unmixing network structure receives great performance [25, 26, 27, 28], which breaks through the drawbacks of iterative algorithms and plugs interpretability to the deep architecture. The work [27] unrolls the constrained sparse regrassion (CSR) problem to construct the abundance estimation network, which has shown better performance with lighter structures and faster convergence speed. The work [27] employs a fully CNN to unroll the variable splitting and augmented Lagrangian algorithm. However, the spatial information learned by this model is still unsatisfactory, and its performance is limited in high-noise scenes.

In our work, to overcome the poor interpretability of deep learning based unmixing architecture, we propose a novel unmixing network, named PnP-Net. Our method unrolls the plug-and-play framework and uses pretrained state-of-the-art denoisers to add information learnt from large scale of image datasets. The main merits of this work are threefold:

•

Our proposed method can take advantages of both the optimization- and learning-based methods. We unroll the plug-and-play unmixing framework into a novel neural network scheme, which can be trained in an end-to-end manner. Our approach relies on the ADMM algorithm, and the layers mimic the iterative process.
•

By using the denoiser pretrained with external datasets, our proposed framework effectively leverages the information from additional data and combines the priors with internal learning from the hyperspectral data. This strategy bypasses the difficulty of limited hyperspectral data and the training of a large number of network parameters.
•

In this work, we design a dynamic convolution using kernels of various sizes to capture multiscale features for spectral unmixing. It increases the model complexity with multiple parallel convolution kernels fused by an attention module.

The remainder of this paper is organized as follows. We describe the background and related works of our method in Section II. In Section III, we present our proposed PnP-Net method in detail and give the main flowchart and training details. Section IV shows the experiment results to validate the effectiveness of our method. Section V concludes this work.

II Related Works

In this section, we briefly review the basic concepts of the plug-and-play unmixing framework and deep unrolling network.

II-A Plug-and-Play Unmixing Methods

The plug-and-play framework benefits from variable splitting technique, allowing the utilization of denoising priors to effectively tackle a range of image restoration problems. It has been used for hyperspectral unmixing [29]. Typically, the denoising regularization is plugged for the abundances with (4) rewritten as:

\min_{\begin{subarray}{c}\mathbf{M}\in\mathcal{D}_{M}\\ \mathbf{A}\in\mathcal{D}_{A}\end{subarray}}\frac{1}{2}\|\mathbf{X}-\mathbf{MA}\|_{\text{F}}^{2}+\lambda\mathcal{R}(\mathbf{A}).

(5)

We use ADMM to resolve the optimization problem described in (5). By introducing an auxiliary variable $\mathbf{V}$ to replace $\mathbf{A}$ in the regularization term of (5) and adding constraint $\mathbf{V}=\mathbf{A}$ , the original problem is decomposed into easier and more manageable subproblems. The iterative optimization process of (5) is as follows:

$\displaystyle\left\{\mathbf{M},\mathbf{A}\right\}$	$\displaystyle\leftarrow\arg\min_{\begin{subarray}{c}\mathbf{M}\in\mathcal{D}_{M}\\ \mathbf{A}\in\mathcal{D}_{A}\end{subarray}}\\|\mathbf{X}-\mathbf{MA}\\|_{\text{F}}^{2}+\frac{\rho}{2}\\|\mathbf{A}-\mathbf{V}+\mathbf{G}\\|_{\text{F}}^{2}$	(6)
$\displaystyle\mathbf{V}$	$\displaystyle\leftarrow\arg\min_{\mathbf{V}}\frac{\rho}{2}\\|\mathbf{A}-\mathbf{V}+\mathbf{G}\\|_{\text{F}}^{2}+\lambda\mathcal{R}(\mathbf{V})$	(7)
$\displaystyle\mathbf{G}$	$\displaystyle\leftarrow\mathbf{G}+\mathbf{A}-\mathbf{V}$	(8)

in which $\mathbf{G}$ is the dual variable, and $\rho$ is the penalty parameter. The process involves two essential operators. The first is a blind unmixing operator to estimate endmembers and abundances. The second step can be viewed as a denoising of $\mathbf{A}+\mathbf{G}$ . The regularization term $\mathcal{R}(\mathbf{V})$ can be implicitly coped by incorporating a denoising operator ( $\mathsf{C}(\cdot)$ ), i.e.,

\mathbf{V}=\mathsf{C}\left(\mathbf{A}+\mathbf{G}\right).

(9)

In general, the denoiser $\mathsf{C}(\cdot)$ can be any readily available denoising operator. For example, the well-known nonlocal means denoising (NLM) [30] and block-matching and 3D filtering (BM3D) [31] have been plugged to capture image priors and get high-quality unmixing results. This provides the possibility of integrating CNN-based denoisers with robust prior learning from mass image data and addressing the drawback of insufficient volume of hyperspectral data.

II-B Deep Unrolling Network

The deep unrolling paradigm involves unfolding iterative optimization algorithms into trainable deep architectures. One popular deep unrolling unmixing method is unfolding the CSR for sparse unmixing [27]. With known $\mathbf{M}$ and $\mathcal{R}(\mathbf{A})=\|\mathbf{A}\|_{1}$ , it introduces an auxiliary variable $\mathbf{V}=\mathbf{A}$ . The ADMM iteratively addresses the optimization problem through the following steps:

$\displaystyle\mathbf{A}^{(k+1)}$	$\displaystyle=\mathbf{W}\mathbf{X}+\mathbf{Q}\left(\mathbf{V}^{(k)}-\mathbf{G}^{(k)}\right)$	(10)
$\displaystyle\mathbf{V}^{(k+1)}$	$\displaystyle=\max\left(\text{soft}\left(\mathbf{A}^{(k+1)}+\mathbf{G}^{(k)},\frac{\lambda}{\rho}\right),\mathbf{0}\right)$	(11)
$\displaystyle\mathbf{G}^{(k+1)}$	$\displaystyle=\mathbf{G}^{(k)}+\mathbf{A}^{(k+1)}-\mathbf{V}^{(k+1)}$	(12)

in which $\mathbf{W}=\left(\mathbf{M}^{\top}\mathbf{M}+\rho\mathbf{I}\right)^{-1}\mathbf{M}^{\top}$ and $\mathbf{Q}=\left(\mathbf{M}^{\top}\mathbf{M}+\rho\mathbf{I}\right)^{-1}\rho$ . $\text{Soft}(a,\theta)=\text{sign}(a)(|a|-\theta)_{+}$ is the soft-threshold operator to resolve the $\ell_{1}$ -norm. The iterative steps from (10) to (12) can be unfolded into network layers, denoted as $\text{Layer}_{A}$ , $\text{Layer}_{V}$ and $\text{Layer}_{G}$ with learnable parameters $\left\{\mathbf{W}^{(k+1)},\mathbf{Q}^{(k+1)},\theta^{(k+1)}\right\}$ , where $\theta^{(k+1)}$ replaces the role of $\frac{\lambda}{\rho}$ . Thus (10) to (12) can be reexpressed as follows:

\begin{split}\mathbf{A}^{(k+1)}&=\text{Layer}_{A}\left(\mathbf{V}^{(k)},\mathbf{G}^{(k)};\mathbf{W}^{(k+1)},\mathbf{Q}^{(k+1)}\right)\\ &=\mathbf{W}^{(k+1)}\mathbf{X}+\mathbf{Q}^{(k+1)}\left(\mathbf{V}^{(k)}-\mathbf{G}^{(k)}\right)\\ \end{split}

(13)

\begin{split}\mathbf{V}^{(k+1)}&=\text{Layer}_{V}\left(\mathbf{A}^{(k+1)},\mathbf{V}^{(k)},\theta^{(k+1)}\right)\\ &=\text{ReLU}\left(\mathbf{A}^{(k+1)}+\mathbf{G}^{(k)}-\theta^{(k+1)}\mathbf{I}\right)\end{split}

(14)

\begin{split}\mathbf{G}^{(k+1)}&=\text{Layer}_{G}\left(\mathbf{A}^{(k+1)},\mathbf{V}^{(k+1)},\mathbf{G}^{(k)}\right)\\ &=\mathbf{G}^{(k)}+\mathbf{A}^{(k+1)}-\mathbf{V}^{(k+1)}.\end{split}

(15)

Each iteration is decomposed into a single network layer, and combining these layers many times is analogous to performing multiple iterations of CSR. The unrolling deep network can be trained by an end-to-end manner. By this way, the knowledge of classical iterative algorithms are infused in the design of deep network for hyperspectral unmixing. However, the networks obtained by unrolling iterative algorithms are restricted to one pixel only and not take into account of spatial structure information. By some modifications of networks, we can employ the well-designed layers such as convolutional networks to add more spatial priors.

III Proposed Method

III-A Problem Formulation

In (4), handcrafting an effective regularizer $\mathcal{R}(\mathbf{M},\mathbf{A})$ and devising an efficient algorithm to solve the objective function are challenging tasks. Rather than using this intricate path, we propose to derive priors from various image data and integrate them into the optimization process by a plug-and-play strategy, where the regularization is implicitly designed by the utilization of a denoising algorithm. In hyperspectral unmixing task, the spatial information is embedded in the abundance, and the spectral information are contained in the endmember. Thus we plug regularization on abundance to fully exploit the spatial information. We introduce regularization by denoising (RED) as the regularizer to add denoising priors, and the corresponding hyperspectral unmixing optimization problem can be reformulated as:

\min_{\mathbf{M},\mathbf{A}}\frac{1}{2}\|\mathbf{X}-\mathbf{MA}\|_{\text{F}}^{2}+\lambda\mathcal{R}_{\text{RED}}(\mathbf{A})+\Omega_{\mathcal{D}_{A}}(\mathbf{A})+\Omega_{\mathcal{D}_{M}}(\mathbf{M}),

(16)

where $\Omega_{\mathcal{D}}$ is an indicator function of a set $\mathcal{D}$ and is defined as:

\Omega_{\mathcal{D}}(\mathbf{Z})=\begin{cases}0,&\mathbf{Z}\in\mathcal{D}\\ \infty,&\text{ otherwise. }\end{cases}

(17)

$\mathcal{R}_{\text{RED}}(\mathbf{A})$ is defined as:

\mathcal{R}_{\text{RED}}(\mathbf{A})=\frac{1}{2}\mathbf{A}^{\top}\left(\mathbf{A}-\mathsf{C}(\mathbf{A})\right),

(18)

where $\mathsf{C}(\cdot)$ is an off-the-shelf denoiser. RED depends on the inner product between the solution $\mathbf{A}$ and its residual after denoising $\mathbf{A}-\mathsf{C}(\mathbf{A})$ . Under mild assumptions, it demonstrates advantageous derivative properties with $\nabla\mathcal{R}(\mathbf{A})=\mathbf{A}-\mathsf{C}(\mathbf{A})$ . This formulation also provides flexibility in accommodating any denoising engines like original plug-and-play denoising framework.

We use ADMM algorithm to solve (16). Two auxiliary variables $\mathbf{V}_{1}$ and $\mathbf{V}_{2}$ are introduced, and the objective function is rewritten as:

\begin{split}\min_{\begin{subarray}{c}\mathbf{M},\mathbf{A},\\ \mathbf{V}_{1},\mathbf{V}_{2}\end{subarray}}\frac{1}{2}\|\mathbf{X}-&\mathbf{M}\mathbf{A}\|_{\text{F}}^{2}+\lambda\mathcal{R}_{\text{RED}}(\mathbf{V}_{1})+\Omega_{\mathcal{D}_{A}}(\mathbf{A})+\Omega_{\mathcal{D}_{M}}(\mathbf{V}_{2})\\ \text{s.t.}&~{}~{}\mathbf{A}=\mathbf{V}_{1},~{}\mathbf{M}=\mathbf{V}_{2}.\end{split}

(19)

The associated augmented Lagrangian is expressed as:

\begin{split}\mathcal{J}=\frac{1}{2}\|\mathbf{X}-&\mathbf{M}\mathbf{A}\|_{\text{F}}^{2}+\lambda\mathcal{R}_{\text{RED}}(\mathbf{V}_{1})+\Omega_{\mathcal{D}_{A}}(\mathbf{A})+\Omega_{\mathcal{D}_{M}}(\mathbf{V}_{2})\\ &+\frac{\alpha}{2}\|\mathbf{A}-\mathbf{V}_{1}+\mathbf{G}_{1}\|_{\text{F}}^{2}+\frac{\beta}{2}\|\mathbf{M}-\mathbf{V}_{2}+\mathbf{G}_{2}\|_{\text{F}}^{2}\end{split}

(20)

where $\mathbf{G}_{1}$ and $\mathbf{G}_{2}$ are dual variables, and $\alpha$ and $\beta$ are nonnegative parameters. Then the ADMM algorithm involves solving four subproblems at each iteration. In our work, we unroll the ADMM algorithm into a deep network by associating each iteration with a single network layer and then stacking a finite number of layers together. Through unrolling, we can use back-propagation to learn regularization parameters that are difficult to handcraft. This approach allows us to capture image priors through a well-designed network structure and add physical interpretability to the deep network. The proposed unrolling layers (PnP-Net) are presented in detail as follows.

III-B Unrolling Plug-and-play Framework for A Layer

III-B1 Update $\mathbf{A}$

Refer to caption — Figure 1: A diagram of the $\mathbf{T}$ matrix. $L=3$ for example with kernel sizes set to $1\times 1$ , $3\times 3$ and $5\times 5$ . The black blocks represent the corresponding weights are zeros, the gray blocks are with available weights, and $\{\mathbf{T}_{l}\}_{l=1}^{L}(i,j)=1$ .

The $\Omega_{\mathcal{D}_{A}}$ can be satisfied with a Softmax operator. The remained part of $A$ -subproblem is a least-square problem defined as

\min_{\mathbf{A}}\frac{1}{2}\|\mathbf{X}-\mathbf{M}\mathbf{A}\|_{\text{F}}^{2}+\frac{\alpha}{2}\|\mathbf{A}-\mathbf{V}_{1}+\mathbf{G}_{1}\|_{\text{F}}^{2}.

(21)

It can be solved with a closed-form as follws:

\mathbf{A}^{(k+1)}=\left(\mathbf{M}^{\top}\mathbf{M}+\alpha\mathbf{I}\right)^{-1}\left[\mathbf{M}^{\top}\mathbf{X}+\alpha\left(\mathbf{V}_{1}-\mathbf{G}_{1}\right)\right].

(22)

The $A$ -update layer is designed by unfolding (22) and rewrites as

\begin{split}\mathbf{A}^{(k+1)}&=\text{Layer}_{A}\left(\mathbf{V}_{1}^{(k)},\mathbf{G}_{1}^{(k)},\mathbf{X},\mathbf{W}_{1},\mathbf{Q}_{1}\right)\\ &=\mathbf{W}_{1}\mathbf{X}+\mathbf{Q}_{1}\left(\mathbf{V}_{1}^{(k)}-\mathbf{G}_{1}^{(k)}\right)\end{split}

(23)

where

\begin{split}\mathbf{W}_{1}&=\left(\mathbf{M}^{\top}\mathbf{M}+\alpha\mathbf{I}\right)^{-1}\mathbf{M}^{\top}\\ \mathbf{Q}_{1}&=\left(\mathbf{M}^{\top}\mathbf{M}+\alpha\mathbf{I}\right)^{-1}\alpha.\end{split}

(24)

As for a closed-form solution, conventional unrolling methods only learn the regularization parameters. In [27] and [28], $\mathbf{W}_{1}$ and $\mathbf{Q}_{1}$ are considered as the learnable parameters to enhance the flexibility of network and improve the performance. Following this strategy, we also replace the fixed $\mathbf{W}_{1}$ and $\mathbf{Q}_{1}$ as parameters to estimate.

In [27], two fully connected layers with bias set to $\mathbf{0}$ adding together are used to formulate $\text{Layer}_{A}$ . In [28], 2D convolutional layers are used to generate this layer, which can capture spatial information. In our work, inspired by the dynamic convolution work [32], we design a novel dynamic convolution with multiscale convolution kernels to formulate this layer, which can capture abundant spatial information from different scales. This layer consolidates multiple parallel convolution kernels dynamically, adjusting their contributions based on input-dependent attentions. Instead of enhancing either the depth or the width of the network, this layer possesses increased representational capability due to the nonlinear aggregation of these kernels through attention. Convolutional kernels with different sizes are able to capture multiscale information. We denote the traditional 2D convolution layer as:

\mathbf{O}=\mathbf{K}\otimes\mathbf{Ix}

(25)

where $\mathbf{K}$ is the convolutional kernel, $\mathbf{Ix}$ denotes the input, and $\mathbf{O}$ is the output. Our dynamic convolution layer (DCL) with $L$ parallel multiscale convolution kernels is defined as:

\mathbf{O}=\sum_{l=1}^{L}\left(\mathcal{T}(\mathbf{T}_{l})\odot\mathbf{K}_{l}\right)\otimes\mathbf{Ix}

(26)

where $\mathbf{T}_{l}$ is the attention weight of the $l{\text{th}}$ convolution layer, the size of $\mathbf{T}_{l}$ is the same as the maximum size of convolutional kernels. The size of available weights of $\mathbf{T}_{l}$ is the same as the $l{\text{th}}$ kernel’s, and they are designed on the centre of $\mathbf{T}_{l}$ . Other positions are padded with zeros. The summation of each position of $\mathbf{T}$ is one, i.e. $\{\mathbf{T}_{l}\}_{l=1}^{L}(i,j)=1$ (the $(i,j)$ th position). The size of $\mathbf{T}_{l}$ is inconsistent with $\mathbf{K}_{l}$ . We use $\mathcal{T}$ to extract the available weights from $\mathbf{T}_{l}$ to make $\mathbf{T}_{l}$ and $\mathbf{K}_{l}$ the same size. $\odot$ is an element-by-element multiplication. An illustration of $\mathbf{T}$ matrix is shown in Fig. 1. Then the output $\mathbf{O}$ represents the optimal combination of multiscale features. The weight $\mathbf{T}$ is associated with the input and learnt by an attention module, and the squeeze-and-excitation module [33] is introduced to calculate it. The global average pooling is firstly applied to squeeze the global spatial information of an input data cube. Then two fully connected layers with an ReLU between them as the activated function are exploited to extract features. A specially designed layer, named Softmaxpro, is used to obtain sum-to-one attention weights $\mathbf{T}$ with the characteristics described above. To be specific, for the $\mathbf{T}_{l}$ , a fully connected layer is used to map the size of features the same as the $l$ th convolutional kernel. Then a zero-padding operator makes it the same as the maximum size of convolutional kernels. Finally, we apply the Softmax to the available weights of each position of $\mathbf{T}$ to accomplish this goal:

\{\mathbf{T}_{l}\}_{l=1}^{L}(i,j)=\text{Softmax}\left(\left\{\mathcal{T}(\mathbf{T}_{l})\right\}_{l=1}^{L}(i,j)\right).

(27)

The scheme of our proposed DCL is shown in Fig. 2.

With this useful and efficient network component, (23) is equivalently reformulated as:

\begin{split}\mathbf{A}^{(k+1)}=&\text{Layer}_{A}\left(\mathbf{V}_{1}^{(k)},\mathbf{G}_{1}^{(k)},\mathbf{X};\left\{\mathbf{W}_{1},\mathbf{Q}_{1},\mathbf{R},\mathbf{H}\right\}^{(k+1)}\right)\\ =&\sum_{l=1}^{L}\left(\mathcal{T}\left(\mathbf{R}_{l}^{(k+1)}\right)\odot\mathbf{W}_{1,l}^{(k+1)}\right)\otimes\mathbf{X}\\ &+\sum_{p=1}^{P}\left(\mathcal{T}\left(\mathbf{H}_{p}^{(k+1)}\right)\odot\mathbf{Q}_{1,p}^{(k+1)}\right)\otimes\left(\mathbf{V}_{1}^{(k)}-\mathbf{G}_{1}^{(k)}\right),\end{split}

(28)

where $L$ and $P$ are the number of kernels, $\mathbf{W}_{1,l}$ and $\mathbf{Q}_{1,l}$ represent the weights of the $l$ th convolution kernel, and $\mathbf{R}_{l}$ and $\mathbf{H}_{p}$ are the attention weight matrices of dynamic convolution.

Then we apply a Softmax operator to constrain the output of $\text{Layer}_{A}$ within $\mathcal{D}_{A}$ , i.e.,

\mathbf{A}=\text{Softmax}\left(\text{Layer}_{A}\left(\mathbf{V}_{1}^{(k)},\mathbf{G}_{1}^{(k)},\mathbf{X};\left\{\mathbf{W}_{1},\mathbf{Q}_{1}\right\}^{(k+1)}\right)\right).

(29)

The $\text{Layer}_{A}$ neural network structure is illustrated in Fig. 3.

III-B2 Update $\mathbf{V}_{1}$

The $V_{1}$ -subproblem is a standard RED objective function:

\min_{\mathbf{V}_{1}}\frac{\alpha}{2}\|\mathbf{A}-\mathbf{V}_{1}+\mathbf{G}_{1}\|_{\text{F}}^{2}+\lambda\mathcal{R}_{\text{RED}}(\mathbf{V}_{1}).

(30)

It also can be solved in a closed-form. We use the fixed-point strategy to solve this problem. By setting the gradient of the objective function to $\mathbf{0}$ , we have the following equation:

\lambda\left(\mathbf{V}_{1}-\mathsf{C}(\mathbf{V}_{1})\right)+\alpha\left(\mathbf{A}-\mathbf{V}_{1}+\mathbf{G}_{1}\right)=\mathbf{0}.

(31)

The solution of $V_{1}$ -subproblem is an iterative scheme as follows:

\mathbf{V}_{1}^{(k+1)}=\frac{1}{\lambda+\alpha}\left[\lambda\mathsf{C}(\mathbf{V}_{1}^{(k)})+\alpha(\mathbf{A}^{(k+1)}+\mathbf{G}_{1}^{(k)})\right].

(32)

The $V_{1}$ -update layer is derived by unrolling (32). The update of the $(k+1)$ th iteration of $\text{Layer}_{V_{1}}$ is expressed as

\begin{split}\mathbf{V}_{1}^{(k+1)}&=\text{Layer}_{V_{1}}\left(\mathbf{A}_{1}^{(k+1)},\mathbf{G}_{1}^{(k)};\theta_{1}^{(k+1)},\theta_{2}^{(k+1)}\right)\\ &=\theta_{1}^{(k+1)}\mathsf{C}(\mathbf{V}_{1}^{(k)})+\theta_{2}^{(k+1)}(\mathbf{A}^{(k+1)}+\mathbf{G}_{1}^{(k)})\end{split}

(33)

where $\theta_{1}^{(k+1)}$ and $\theta_{2}^{(k+1)}$ are learnable parameters and play the role of $\frac{\lambda}{\lambda+\alpha}$ and $\frac{\alpha}{\lambda+\alpha}$ . In this manner, we can use a data-driven strategy to learn the regularization parameters, rather than handcraft them. The $\text{Layer}_{V_{1}}$ neural network layer is shown in Fig. 3.

III-B3 Update $\mathbf{M}$

As for the update of $M$ -subproblem, it aims to solve the following optimization problem:

\min_{\mathbf{M}}\frac{1}{2}\|\mathbf{X}-\mathbf{M}\mathbf{A}\|_{\text{F}}^{2}+\frac{\beta}{2}\|\mathbf{M}-\mathbf{V}_{2}+\mathbf{G}_{2}\|_{\text{F}}^{2},

(34)

which can be efficiently solved through a closed-form solution. Then we obtain the following update iteration:

\mathbf{M}^{(k+1)}=\left(\mathbf{X}\mathbf{A}^{\top}-\beta\mathbf{V}_{2}+\beta\mathbf{G}_{2}\right)\left(\mathbf{A}\mathbf{A}^{\top}-\beta\mathbf{I}\right)^{-1}.

(35)

The $M$ -update layer is designed by unfolding (35). With the $\mathbf{V}_{2}^{(k)}$ , $\mathbf{G}_{2}^{(k)}$ and given $\mathbf{X}$ , it is written as follows:

\begin{split}\mathbf{M}^{(k+1)}&=\text{Layer}_{M}\left(\mathbf{V}_{2}^{(k)},\mathbf{G}_{2}^{(k)},\mathbf{X};\mathbf{W}_{2},\mathbf{Q}_{2}\right)\\ &=\mathbf{X}\mathbf{W}_{2}+\left(\mathbf{G}_{2}^{(k)}-\mathbf{V}_{2}^{(k)}\right)\mathbf{Q}_{2}\end{split}

(36)

where

\begin{split}\mathbf{W}_{2}&=\mathbf{A}^{\top}\left(\mathbf{A}\mathbf{A}^{\top}-\beta\mathbf{I}\right)^{-1}\\ \mathbf{Q}_{2}&=\left(\mathbf{A}\mathbf{A}^{\top}-\beta\mathbf{I}\right)^{-1}\beta.\end{split}

(37)

We replace $\mathbf{W}_{2}$ and $\mathbf{Q}_{2}$ with learnable parameters to enhance flexibility. The $\text{Layer}_{M}$ neural network layer is presented in Fig. 3 and defined as

\mathbf{M}^{(k+1)}=\text{Layer}_{M}\left(\mathbf{V}_{2}^{(k)},\mathbf{G}_{2}^{(k)},\mathbf{X};\left\{\mathbf{W}_{2},\mathbf{Q}_{2}\right\}^{(k+1)}\right).

(38)

III-B4 Update $\mathbf{V}_{2}$

The $V_{2}$ -subproblem is a least-square problem with the nonnegative constraint, which is formulated as:

\min_{\mathbf{V}_{2}}\frac{\beta}{2}\|\mathbf{M}-\mathbf{V}_{2}+\mathbf{G}_{2}\|_{\text{F}}^{2}+\Omega_{\mathcal{D}_{M}}(\mathbf{V}_{2}).

(39)

We use a hard threshold operator to solve this problem by mapping $\mathbf{M}+\mathbf{G}_{2}$ to a set of all nonnegative elements. This operation is written as:

\mathbf{V}_{2}=\max\left(\mathbf{M}+\mathbf{G}_{2},\mathbf{0}\right).

(40)

In the design of $\text{Layer}_{V_{2}}$ , we use the ReLU to generate this layer:

\begin{split}\mathbf{V}_{2}^{(k+1)}&=\text{Layer}_{V_{2}}\left(\mathbf{M}^{(k+1)},\mathbf{G}_{2}^{(k)}\right)\\ &=\text{ReLU}\left(\mathbf{M}^{(k+1)}+\mathbf{G}_{2}^{(k)}\right).\end{split}

(41)

The $\text{Layer}_{V_{2}}$ neural network layer is shown in Fig. 3.

III-B5 Update $\mathbf{G}_{1}$ and $\mathbf{G}_{2}$

The $\text{Layer}_{G_{1}}$ is designed to unfold the following iteration:

\mathbf{G}_{1}=\mathbf{G}_{1}+\eta_{1}(\mathbf{A}-\mathbf{V}_{1}).

(42)

With $\mathbf{A}^{(k+1)}$ and $\mathbf{V}_{1}^{(k+1)}$ , the $(k+1)$ th neural network layer to update $\mathbf{G}_{1}$ is designed as

\begin{split}\mathbf{G}_{1}^{(k+1)}&=\text{Layer}_{G_{1}}\left(\mathbf{A}^{(k+1)},\mathbf{V}_{1}^{(k+1)};\theta_{3}^{(k+1)}\right)\\ &=\mathbf{G}_{1}^{(k)}+\theta_{3}^{(k+1)}(\mathbf{A}^{(k+1)}-\mathbf{V}_{1}^{(k+1)})\end{split}

(43)

where $\theta_{3}^{(k+1)}$ is a learnable parameter to replace the role of $\eta_{1}$ .

$\text{Layer}_{G_{2}}$ is derived from

\mathbf{G}_{2}=\mathbf{G}_{2}+\eta_{2}(\mathbf{M}-\mathbf{V}_{2}).

(44)

The same as the design of $\text{Layer}_{G_{1}}$ , this layer uses $\theta_{4}^{(k+1)}$ to replace the role of $\eta_{2}$ :

\begin{split}\mathbf{G}_{2}^{(k+1)}&=\text{Layer}_{G_{2}}\left(\mathbf{M}^{(k+1)},\mathbf{V}_{2}^{(k+1)};\theta_{4}^{(k+1)}\right)\\ &=\mathbf{G}_{2}^{(k)}+\theta_{4}^{(k+1)}(\mathbf{M}^{(k+1)}-\mathbf{V}_{2}^{(k+1)}).\end{split}

(45)

The $\text{Layer}_{G_{1}}$ and $\text{Layer}_{G_{2}}$ neural network layers are also shown in Fig. 3.

III-C Network Architecture

Each unrolling block represents a single iteration of the ADMM based plug-and-play unmixing method and consists of 6 parts: $\text{Layer}_{A}$ , $\text{Layer}_{V_{1}}$ , $\text{Layer}_{G_{1}}$ , $\text{Layer}_{M}$ , $\text{Layer}_{V_{2}}$ and $\text{Layer}_{G_{2}}$ . As shown in Fig. 4, we use $K$ iteration blocks to conduct the architecture of our proposed method, which can mimic the ADMM based unmixing algorithm with $K$ iterations. For the $k$ th block, the learnable parameters are $\bm{\Theta}^{(k)}=\left\{\mathbf{R}^{(k)},\mathbf{W}_{1}^{(k)},\mathbf{H}^{(k)},\mathbf{Q}_{1}^{(k)},\theta_{1}^{(k)},\theta_{2}^{(k)},\mathbf{W}_{2}^{(k)},\mathbf{Q}_{2}^{(k)},\theta_{3}^{(k)},\theta_{4}^{(k)}\right\}$ . Each block in the network has its own set of learnable parameters that are not shared across blocks, which has demonstrated flexibility and strong learning capability for unmixing task.

Additionally, the PnP-Net architecture can be divided into two parts: the endmember estimation network and the abundance estimation network. Each block of the endmember estimation network contains the $\text{Layer}_{M}$ , $\text{Layer}_{V_{2}}$ and $\text{Layer}_{G_{2}}$ components, there are $K$ blocks to estimate the endmember. The abundance estimation network comprises three parts, namely $\text{Layer}_{A}$ , $\text{Layer}_{V_{1}}$ and $\text{Layer}_{G_{1}}$ . $K$ layers are applied to estimate the abundance.

III-D Training Details and Network Initialization

Our PnP-Net is trained by a blind manner with the input hyperspectral image $\mathbf{X}$ . We use the mean square error (MSE) to compute the differences between the input and its reconstruction. Compared to calculating the MSE loss directly using the reconstruction of the final $K$ th block output, calculating it for each block can better constrain the training process of the network. Thus, we use the weighted sum of the reconstructions from each block to compute the loss function, which is defined as:

\mathcal{L}=\frac{1}{2N}\sum_{k=1}^{K}\beta_{k}\|\mathbf{X}-\hat{\mathbf{X}}^{(k)}\|_{\text{F}}^{2}

(46)

where $\beta_{k}$ denotes the importance of each term, $\hat{\mathbf{X}}^{(k)}=\hat{\mathbf{M}}^{(k)}\hat{\mathbf{A}}^{(k)}$ is the corresponding reconstructed image of the $k$ th block.

Parameter initialization plays an important role in our PnP-Net, which can accelerate the convergence speed and enhance the endmember and abundance estimation accuracy. The endmember matrix $\mathbf{M}^{(0)}$ and abundance matrix $\mathbf{A}^{(0)}$ are calculated by VCA [34] and FCLS [35]. Specially, we obtain the initialization of $\mathbf{W}_{1}$ and $\mathbf{Q}_{1}$ according to $\mathbf{A}^{(0)}$ and $\mathbf{M}^{(0)}$ , and other values are initialized with zeros.

Our PnP-Net method is flexible to plug various deep denoisers. In this work, we choose 3 novel denoisers, i.e. DnCNN [36], IRCNN [37] and SCUNet [38], as examples. We pretrain these models using images of Waterloo Exploration Database [39], DIV2K [40], and Flick2K [41] datasets with different noise settings. It is worth noting that the parameters of denoiser are fixed during training which can reduce computational stress and take advantage of outer priors. Our PnP-Net with these denoisers are named as PnP-Net-1, PnP-Net-2 and PnP-Net-3. During training, we set the learning rate to $5\times 10^{-4}$ . The number of iteration blocks ( $K$ ) is set to 5. $L$ and $P$ are set to 3 with the kernel size setting to $1\times 1$ , $3\times 3$ and $5\times 5$ . We set $\{\beta_{k}\}_{k=1}^{K}$ to $\left\{1\times 10^{-4},1\times 10^{-3},1\times 10^{-2},1\times 10^{-1},1\right\}$ .

IV Experiment Results

In this section, we firstly introduce the evaluation indicators used to analyse the effectiveness of the proposed methods. The considered conventional and state-of-the-art compared methods are also briefly presented. Then, the datasets used in the experiments and results are illustrated and discussed.

TABLE I: Performance Evaluation of the synthetic data (in terms of aRMSE and PSNR).

Noise level		5dB		10dB		20dB		30dB
Method	Denoiser	aRMSE $\downarrow$	PSNR $\uparrow$	aRMSE $\downarrow$	PSNR $\uparrow$	aRMSE $\downarrow$	PSNR $\uparrow$	aRMSE $\downarrow$	PSNR $\uparrow$
SUnSAL-TV	/	0.0991	27.8650	0.0658	30.5224	0.0225	44.0060	0.0081	50.5379
gtvMBO	/	0.1149	29.6094	0.0699	34.3044	0.0272	43.9903	\ul0.0080	53.8601
U-ADMM-BUNet	/	0.0939	30.6719	0.0647	35.1446	0.0256	44.0009	0.0086	53.3713
DIFCNN	/	0.1044	30.4841	0.0651	34.8753	0.0242	44.0677	0.0081	53.1629
AERED	NLM	0.0907	32.2215	0.0604	35.6194	0.0246	44.1319	0.0081	54.4845
AERED	BM3D	0.0927	32.4679	0.0597	36.3611	0.0217	44.1408	\ul0.0080	54.6312
PnP-Net	DnCNN	\ul0.0856	33.5082	0.0572	37.6063	0.0197	44.3506	0.0079	55.5990
	IRCNN	0.0850	\ul33.4259	\ul0.0566	37.7314	0.0203	44.1599	\ul0.0080	55.4729
	SCUNet	0.0859	33.3130	0.0535	\ul37.7282	\ul0.0199	\ul44.2947	\ul0.0080	\ul55.4810

TABLE II: Performance Evaluation of the synthetic data (in terms of mRMSE and mSAD).

Noise level		5dB		10dB		20dB		30dB
Method	Denoiser	mRMSE $\downarrow$	mSAD $\downarrow$	mRMSE $\downarrow$	mSAD $\downarrow$	mRMSE $\downarrow$	mSAD $\downarrow$	mRMSE $\downarrow$	mSAD $\downarrow$
SUnSAL-TV	/	0.0402	6.3464	0.0228	3.6377	0.0052	0.9584	0.0023	0.2176
gtvMBO	/	0.0502	6.4056	0.0205	2.7845	0.0100	1.1000	\ul0.0014	0.1913
U-ADMM-BUNet	/	0.0400	6.3073	0.0200	3.2512	0.0050	0.9482	0.0015	0.2003
DIFCNN	/	0.0423	6.4225	0.0199	3.0286	0.0067	1.1064	0.0017	0.2001
AERED	NLM	0.0415	6.2650	0.0204	2.9087	0.0051	1.0135	0.0015	0.2088
AERED	BM3D	0.0401	6.2937	0.0206	2.7587	0.0049	0.9676	0.0014	0.2069
PnP-Net	DnCNN	\ul0.0394	\ul6.2195	\ul0.0194	\ul2.7191	\ul0.0046	0.8658	0.0013	0.1928
	IRCNN	0.0399	6.2858	0.0202	2.7848	0.0049	0.9291	\ul0.0014	0.1943
	SCUNet	0.0390	6.1300	0.0192	2.6464	0.0042	\ul0.8734	\ul0.0014	\ul0.1941

IV-A Evaluation Indicators

We use several indicators to evaluate the unmixing results quantitatively in terms of estimated abundances, extracted endmembers and reconstructed images. In addition, we have no available ground truth for the real datasets, we only evaluate the reconstructed image quantitatively and subjectively evaluate the visual effect.

The root mean square error (RMSE) is applied to assess the differences between the ground truth and estimated abundances and endmembers. We respectively name them as aRMSE and mRMSE, which are defined as follows:

\mathrm{aRMSE}=\sqrt{\frac{1}{NR}\sum_{i=1}^{N}\left\|\mathbf{a}_{i}-\hat{\mathbf{a}}_{i}\right\|^{2}}

(47)

and

\mathrm{mRMSE}=\sqrt{\frac{1}{BR}\sum_{r=1}^{R}\left\|\mathbf{m}_{r}-\hat{\mathbf{m}}_{r}\right\|^{2}},

(48)

where $\hat{\mathbf{a}}_{i}$ and $\mathbf{a}_{i}$ represent the estimated and real abundances of the $i$ th pixel, and $\hat{\mathbf{m}}_{r}$ and $\mathbf{m}_{r}$ denote the $r$ th extracted and corresponding real endmembers.

We apply the mean spectral angle distance (SAD) to evaluate the degree of distortion between the real and reconstructed spectral signatures, and mSAD denotes the SAD between the estimated and real endmembers. They are calculated by:

\mathrm{SAD}=\frac{1}{N}\sum_{i=1}^{N}\arccos\left(\frac{\mathbf{y}_{i}^{\top}\hat{\mathbf{y}}_{i}}{\left\|\mathbf{y}_{i}\right\|\left\|\hat{\mathbf{y}}_{i}\right\|}\right)

(49)

and

\mathrm{mSAD}=\frac{1}{R}\sum_{r=1}^{R}\arccos\left(\frac{\mathbf{m}_{i}^{\top}\hat{\mathbf{m}}_{r}}{\left\|\mathbf{m}_{r}\right\|\left\|\hat{\mathbf{m}}_{r}\right\|}\right)

(50)

where $\mathbf{y}_{i}$ and $\hat{\mathbf{y}}_{i}$ represent the real and reconstructed spectra.

We also use the peak signal-to-noise ratio (PSNR) to evaluate the differences between the real and reconstructed images, calculated by:

\mathrm{PSNR}=10\times\log_{10}\left(\frac{\mathrm{max}^{2}}{\mathrm{MSE}}\right)

(51)

where $\mathrm{max}$ represents the highest pixel value in the reconstructed image.

IV-B Compared Methods

We compare the proposed PnP-Net method with several conventional and state-of-the-art unmixing methods, i.e. SUnSAL-TV [5], gtvMBO [10], U-ADMM-BUNet [27], DIFCNN [28] and AERED [42]. To make a fair comparison, all methods are initialized by the endmembers extracted by VCA [34] and abundances estimated by FCLS [35]. All parameters of these methods are carefully chosen to achieve the best experimental results.

The first two compared methods are conventional methods. The SUnSAL-TV method uses the VCA to extract the endmembers, and a total variation spatial regularization is applied to constraint the estimated abundances. The gtvMBO method is a blind unmixing method with a data-driven graph total variation regularization, and its objective function is solved by ADMM. The U-ADMM-BUNet and DIFCNN are novel unrolling based unmixing methods. AERED integrates the autoencoder based unmixing network with RED, which combines the explicit and implicit priors. We name the AERED with NLM denoiser and block-matching and 4-D ffltering (BM4D) [43] deoiser as AERED-1 and AERED-2, respectively.

IV-C Experiments on Synthetic Datasets

In order to quantitatively verify the unmixing results, we generate a set of synthetic data with the spatial size of $100\times 100$ . We adapt the method described in [44] to produce these data. The ground-truth of abundance maps are shown in Fig. 5, and they are constrained to satisfy ASC and ANC. The abundance obtained by this strategy shows spatial correlation that similar to the real data. Four spectral signatures selected from the USGS spectral library with 224 bands are applied to generate the endmember matrix. The reference endmember spectra are shown in Fig. 6. Zero mean Gaussian with four levels of signal-to-noise-ratio (SNR), i.e. 5dB, 10dB, 20dB and 30dB, are added to the data.

Table I shows the aRMSE and PSNR results of the synthetic data, and Table II lists the mRMSE and mSAD results. The bold numbers in Tables I and II denote the best results, and the underlined numbers indicate the second best results. From the results of these two tables, it can be observed that the proposed PnP-Net unmixing framework achieves the best unmixing results compared to the benchmark methods. Compared to traditional unmixing methods such as SUnSAL-TV and gtvMBO, the proposed method eliminates the need for penalty parameter selection, instead utilizing a data-driven learning approach. In comparison to unrolling unmixing method, i.e. U-ADMM-BUNet and DIFCNN, this proposed PnP-Net method employs dynamic convolution to achieve multiscale information learning and fusion and gets better results. Our PnP-Net method leverages an externally pretrained denoiser on large datasets to provide outer priors, and also integrate a hyperspectral data-driven and physically interpretable neural network to learn internal priors. This combination results in superior unmixing performance. We can also observe that, due to the use of deep denoisers and deep neural networks, our proposed unmixing method is robust to noise.

Fig. 5 shows the abundance maps of the proposed method and the comparison methods for the synthetic data at 10dB. We observe that all these estimates are very close to the ground-truth. Moreover, it can be seen that the noise in the abundance maps of PnP-Net and AERED is obviously smaller than that of other comparison methods, which indicates the superiority of using denoiser to bring prior information. Furthermore, Fig. 6 illustrates the extracted endmembers of the proposed method. We observe that the result of our method is close to the ground-truth. Fig. 7 illustrates how the number of blocks affect the unmixing performance and also shows that how the number of kernels $L$ and $P$ impact the unmixing results.

IV-D Experiments on Real Datasets

We also evaluate our proposed method on real datasets. We conduct experiments on two widely used hyperspectral images obtained by airborne sensors, namely Jasper Ridge dataset and Muufl Gulfport dataset. The descriptions and experimental results about these two datasets are presented below.

The Jasper Ridge dataset is captured by Analytical Imaging and Geophysics (AIG) in 1999. It contains 224 bands covering the spectral range from 380nm to 2500nm. After removing the bands affected by water vapor and atmospheric (1-3, 108-112, 154-166, and 220-224), 198 bands are remained. The original size of this data is $512\times 614$ , a popular region of interest with $100\times 100$ pixels are cropped. The four endmembers in this data are “water”, “soil”, “tree” and “road”.

The Muufl Gulfport dataset was captured by the CASI-1500 hyperspectral sensor over the University of Southern Mississippi Gulf Park Campus. The original image contains 72 bands with $325\times 220$ pixels. We remove the first four and the last four bands of the image, which contain a lot of noise. A subimage with the size of $130\times 90$ pixels is applied in our experiments. We reference the scene label ground-truth map from manually labeling in [45]. There are 5 pure materials in this subimage, namely “roof”, “grass”, “tree”, “shadow” and “asphalt”.

The visual results of the estimated abundance maps of these two datasets are shown in Fig. 8 and Fig. 9. The first four rows of Fig. 8 report the comparisons of our method and compared methods with respect to the Jasper Ridge dataset. All these methods decompose this data with four clear abundance maps. But we observe that the abundance maps of “road” material of SUnSAL-TV and gtvMBO is lighter than other methods. The reason may be that the data contains spectral variability, which makes these two conventional methods hard to extract representative endmember. Our PnP-Net gets clearer abundance results. The last line of Fig. 8 illustrates the reconstructed error between the reconstructed image and real image of these methods. We observe that our method gets good reconstruct results which are also consistent with the results in Table III. To some extent, it proves the advantages of our method to unmix the real data. The abundance maps of Muufl Gulfport dataset are illustrated in Fig. 9. It can be seen that the results of gtvMBO are noisy than other methods, especially for the “asphalt” endmember. The proposed PnP-Net method can estimate the abundance maps much clearer than the compared method. Our results also have more detailed spatial information and sharper edges. The last row of Fig. 9 presents the reconstructed error maps of all methods. DIFCNN shows higher results. It may cause by the loss function selected to train the model. It use the cross entropy-loss to train the network, while other compared deep methods apply the MSE-like loss to train the model. The latter is more conductive to reduce the Euclidean distance between the reconstructed and original image. Table IV lists quantitative results of the Muufl Gulfport data. We get that our method can obtain good reconstructions.

TABLE III: Reconstruction Performance Evaluation of the Jasper Ridge data (in terms of PSNR and SAD).

Method	SUnSAL-TV	gtvMBO	U-ADMM-BUNet	DIFCNN	AERED		PnP-Net
Denoiser	/	/	/	/	NLM	BM3D	DnCNN	IRCNN	SCUNet
PSNR $\uparrow$	28.5721	29.5939	30.0075	29.7343	31.3895	30.0416	29.9136	\ul30.1203	30.0763
SAD $\downarrow$	9.2143	6.3181	5.6415	8.7582	5.5519	\ul5.4386	5.5369	5.4210	5.6177

TABLE IV: Reconstruction Performance Evaluation of the Muufl Gulfport data (in terms of PSNR and mSAD).

Method	SUnSAL-TV	gtvMBO	U-ADMM-BUNet	DIFCNN	AERED		PnP-Net
Denoiser	/	/	/	/	NLM	BM3D	DnCNN	IRCNN	SCUNet
PSNR $\uparrow$	31.5959	30.1698	31.1459	25.652	32.3154	33.4423	32.4775	\ul32.5454	32.3401
SAD $\downarrow$	5.2867	7.5754	6.8583	8.3911	5.1006	4.4823	4.9912	\ul4.9505	5.1039

V Conclusion

In this paper, we propose a novel PnP-Net method for hyperspectral unmixing. We unroll the plug-and-play unmixing method into trainable deep model. We apply the RED to add denoising priors, and a various of denoisers can be plugged. The ADMM is used to solve the optimization function. We unfold the ADMM based iterative steps to design the layer, and one layer represents one iteration. $K$ layers are stacked to conduct the deep architecture. Through unrolling, we can use the end-to-end manner to train the network. To fully exploit the spatial information, we use the dynamic convolution with different sizes of convolution kernels to capture multiscale information. The denoisers are pretrained with a variety of images which make our method leverage the power of external information to enhance the modeling ability. This strategy bypasses the issue of limited hyperspectral data, and provides a way to integrate the inner and outer priors. Extensive experiment results show the superior performance of our method.

References

[1] S. Li, W. Song, L. Fang, Y. Chen, P. Ghamisi, and J. A. Benediktsson, “Deep learning for hyperspectral image classification: An overview,” IEEE Trans. Image Process., vol. 57, no. 9, pp. 6690–6709, 2019.
[2] B. Lu, P. D. Dao, J. Liu, Y. He, and J. Shang, “Recent advances of hyperspectral imaging technology and applications in agriculture,” Remote Sensing, vol. 12, no. 16, p. 2659, 2020.
[3] X. Briottet, Y. Boucher, A. Dimmeler, A. Malaplate, A. Cini, M. Diani, H. Bekman, P. Schwering, T. Skauli, I. Kasen et al., “Military applications of hyperspectral imagery,” in Targets and backgrounds XII: Characterization and representation, vol. 6239. SPIE, 2006, pp. 82–89.
[4] L. Drumetz, M.-A. Veganzones, S. Henrot, R. Phlypo, J. Chanussot, and C. Jutten, “Blind hyperspectral unmixing using an extended linear mixing model to address spectral variability,” IEEE Trans. Image Process., vol. 25, no. 8, pp. 3890–3905, 2016.
[5] M.-D. Iordache, J. M. Bioucas-Dias, and A. Plaza, “Total variation spatial regularization for sparse hyperspectral unmixing,” IEEE Trans. Geosci. Remote Sens., vol. 50, no. 11, pp. 4484–4502, 2012.
[6] F. Xiong, Y. Qian, J. Zhou, and Y. Y. Tang, “Hyperspectral unmixing via total variation regularized nonnegative tensor factorization,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 4, pp. 2341–2357, 2018.
[7] H. Li, R. Feng, L. Wang, Y. Zhong, and L. Zhang, “Superpixel-based reweighted low-rank and total variation sparse unmixing for hyperspectral remote sensing imagery,” IEEE Trans. Geosci. Remote Sens., vol. 59, no. 1, pp. 629–647, 2020.
[8] J. Li, A. Agathos, D. Zaharie, J. M. Bioucas-Dias, A. Plaza, and X. Li, “Minimum volume simplex analysis: A fast algorithm for linear hyperspectral unmixing,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 9, pp. 5067–5082, 2015.
[9] X. Wang, Y. Zhong, L. Zhang, and Y. Xu, “Blind hyperspectral unmixing considering the adjacency effect,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 9, pp. 6633–6649, 2019.
[10] J. Qin, H. Lee, J. T. Chi, L. Drumetz, J. Chanussot, Y. Lou, and A. L. Bertozzi, “Blind hyperspectral unmixing based on graph total variation regularization,” IEEE Trans. Geosci. Remote Sens., vol. 59, no. 4, pp. 3338–3351, 2020.
[11] M. Li, F. Zhu, A. J. Guo, and J. Chen, “A graph regularized multilinear mixing model for nonlinear hyperspectral unmixing,” Remote Sensing, vol. 11, no. 19, p. 2188, 2019.
[12] J. Gu, B. Yang, and B. Wang, “Nonlinear unmixing for hyperspectral images via kernel-transformed bilinear mixing models,” IEEE Trans. Geosci. Remote Sens., vol. 60, pp. 1–13, 2021.
[13] P.-A. Thouvenin, N. Dobigeon, and J.-Y. Tourneret, “Hyperspectral unmixing with spectral variability using a perturbed linear mixing model,” IEEE Trans. on Signal Process., vol. 64, no. 2, pp. 525–538, 2015.
[14] X. Yang, M. Zhao, S. Shi, and J. Chen, “Deep constrained energy minimization for hyperspectral target detection,” IEEE J. Sel. Topics Appl. EarthObservations Remote Sensing, vol. 15, pp. 8049–8063, 2022.
[15] Y. Cai, X. Liu, and Z. Cai, “BS-Nets: An end-to-end framework for band selection of hyperspectral image,” IEEE Trans. Geosci. Remote Sens., vol. 58, no. 3, pp. 1969–1984, 2019.
[16] Y. Cheng, X. Wang, Y. Ma, X. Mei, M. Wu, and J. Ma, “General hyperspectral image super-resolution via meta-transfer learning,” IEEE Trans. Neur. Net. Lear. Sys., 2024.
[17] J. Chen, M. Zhao, X. Wang, C. Richard, and S. Rahardja, “Integration of physics-based and data-driven models for hyperspectral image unmixing: A summary of current methods,” IEEE Signal Process. Mag., vol. 40, no. 2, pp. 61–74, 2023.
[18] B. Palsson, J. R. Sveinsson, and M. O. Ulfarsson, “Blind hyperspectral unmixing using autoencoders: A critical comparison,” IEEE J. Sel. Topics Appl. EarthObservations Remote Sensing, vol. 15, pp. 1340–1372, 2022.
[19] B. Palsson, M. O. Ulfarsson, and J. R. Sveinsson, “Convolutional autoencoder for spectral–spatial hyperspectral unmixing,” IEEE Trans. Geosci. Remote Sens., vol. 59, no. 1, pp. 535–549, 2020.
[20] F. Khajehrayeni and H. Ghassemian, “Hyperspectral unmixing using deep convolutional autoencoders in a supervised scenario,” IEEE J. Sel. Topics Appl. EarthObservations Remote Sensing, vol. 13, pp. 567–576, 2020.
[21] L. Qi, Z. Chen, F. Gao, J. Dong, X. Gao, and Q. Du, “Multiview spatial–spectral two-stream network for hyperspectral image unmixing,” IEEE Trans. Geosci. Remote Sens., vol. 61, pp. 1–16, 2023.
[22] D. Hong, L. Gao, J. Yao, N. Yokoya, J. Chanussot, U. Heiden, and B. Zhang, “Endmember-guided unmixing network (EGU-Net): A general deep learning framework for self-supervised hyperspectral unmixing,” IEEE Trans. Neur. Net. Lear. Sys., vol. 33, no. 11, pp. 6518–6531, 2021.
[23] Z. Wang, L. Zhuang, L. Gao, A. Marinoni, B. Zhang, and M. K. Ng, “Hyperspectral nonlinear unmixing by using plug-and-play prior for abundance maps,” Remote Sensing, vol. 12, no. 24, p. 4117, 2020.
[24] X. Wang, J. Chen, and C. Richard, “Tuning-free plug-and-play hyperspectral image deconvolution with deep priors,” IEEE Trans. Geosci. Remote Sens., vol. 61, pp. 1–13, 2023.
[25] C. Cui, X. Wang, S. Wang, L. Zhang, and Y. Zhong, “Unrolling nonnegative matrix factorization with group sparsity for blind hyperspectral unmixing,” IEEE Trans. Geosci. Remote Sens., 2023.
[26] Y. Shao, Q. Liu, and L. Xiao, “IVIU-Net: Implicit variable iterative unrolling network for hyperspectral sparse unmixing,” IEEE J. Sel. Topics Appl. EarthObservations Remote Sens., vol. 16, pp. 1756–1770, 2023.
[27] C. Zhou and M. R. Rodrigues, “ADMM-based hyperspectral unmixing networks for abundance and endmember estimation,” IEEE Trans. Geosci. Remote Sens., vol. 60, pp. 1–18, 2021.
[28] F. Kong, M. Chen, Y. Li, D. Li, and Y. Zheng, “Deep interpretable fully CNN structure for sparse hyperspectral unmixing via model-driven and data-driven integration,” IEEE Trans. Geosci. Remote Sens., 2023.
[29] M. Zhao, X. Wang, J. Chen, and W. Chen, “A plug-and-play priors framework for hyperspectral unmixing,” IEEE Trans. Geosci. Remote Sens., vol. 60, pp. 1–13, 2021.
[30] A. Buades, B. Coll, and J.-M. Morel, “Non-local means denoising,” Image Processing On Line, vol. 1, pp. 208–212, 2011.
[31] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising by sparse 3-D transform-domain collaborative filtering,” IEEE Trans. Image Process., vol. 16, no. 8, pp. 2080–2095, 2007.
[32] Y. Chen, X. Dai, M. Liu, D. Chen, L. Yuan, and Z. Liu, “Dynamic convolution: Attention over convolution kernels,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2020, pp. 11 030–11 039.
[33] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2018, pp. 7132–7141.
[34] J. M. Nascimento and J. M. Dias, “Vertex component analysis: A fast algorithm to unmix hyperspectral data,” IEEE Trans. Geosci. Remote Sens., vol. 43, no. 4, pp. 898–910, 2005.
[35] D. C. Heinz et al., “Fully constrained least squares linear spectral mixture analysis method for material quantification in hyperspectral imagery,” IEEE Trans. Geosci. Remote Sens., vol. 39, no. 3, pp. 529–545, 2001.
[36] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE Trans. Image Process., vol. 26, no. 7, pp. 3142–3155, 2017.
[37] K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning deep CNN denoiser prior for image restoration,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2017, pp. 3929–3938.
[38] K. Zhang, Y. Li, J. Liang, J. Cao, Y. Zhang, H. Tang, D.-P. Fan, R. Timofte, and L. V. Gool, “Practical blind image denoising via Swin-Conv-UNet and data synthesis,” Machine Intelligence Research, vol. 20, no. 6, pp. 822–836, 2023.
[39] K. Ma, Z. Duanmu, Q. Wu, Z. Wang, H. Yong, H. Li, and L. Zhang, “Waterloo exploration database: New challenges for image quality assessment models,” IEEE Trans. Image Process., vol. 26, no. 2, pp. 1004–1016, 2016.
[40] E. Agustsson and R. Timofte, “Ntire 2017 challenge on single image super-resolution: Dataset and study,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2017, pp. 126–135.
[41] D. Liu, B. Wen, Y. Fan, C. C. Loy, and T. S. Huang, “Non-local recurrent network for image restoration,” Advances in neural information processing systems, vol. 31, 2018.
[42] M. Zhao, J. Chen, and N. Dobigeon, “AE-RED: A hyperspectral unmixing framework powered by deep autoencoder and regularization by denoising,” IEEE Trans. Geosci. Remote Sens., 2024.
[43] M. Maggioni, V. Katkovnik, K. Egiazarian, and A. Foi, “Nonlocal transform-domain filter for volumetric data denoising and reconstruction,” IEEE Trans. Image Process., vol. 22, no. 1, pp. 119–133, 2012.
[44] Z. Han, D. Hong, L. Gao, B. Zhang, M. Huang, and J. Chanussot, “AutoNAS: Automatic neural architecture search for hyperspectral unmixing,” IEEE Trans. Geosci. Remote Sens., vol. 60, pp. 1–14, 2022.
[45] X. Du and A. Zare, “Technical report: scene label ground truth map for muufl gulfport data set. university of florida, gainesville,” Tech. Rep., vol. 20170417, 2017.