Algorithm Unrolling for Massive Access via Deep Neural Network with Theoretical Guarantee

Yandong Shi, , Hayoung Choi, ,
Yuanming Shi, , and Yong Zhou This paper was presented in part at IEEE International Conference on Communications Workshops (ICC Workshops), Dublin, Ireland, 2020 [1]. Y. Shi is with the School of Information Science and Technology, ShanghaiTech University, Shanghai 201210, China, also with the Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences, Shanghai 200050, China, and also with the University of Chinese Academy of Sciences, Beijing 100049, China (e-mail: [email protected]). H. Choi is with the Department of Mathematics Kyungpook National University Daegu, Republic of Korea (e-mail:[email protected]). Y. Shi and Y. Zhou are with the School of Information Science and Technology, ShanghaiTech University, Shanghai 201210, China (e-mail:{shiym, zhouyong}@shanghaitech.edu.cn). Yuanming Shi is also with Yoke Intelligence, Shanghai, China.

Abstract

Massive access is a critical design challenge of Internet of Things (IoT) networks. In this paper, we consider the grant-free uplink transmission of an IoT network with a multiple-antenna base station (BS) and a large number of single-antenna IoT devices. Taking into account the sporadic nature of IoT devices, we formulate the joint activity detection and channel estimation (JADCE) problem as a group-sparse matrix estimation problem. This problem can be solved by applying the existing compressed sensing techniques, which however either suffer from high computational complexities or lack of algorithm robustness. To this end, we propose a novel algorithm unrolling framework based on the deep neural network to simultaneously achieve low computational complexity and high robustness for solving the JADCE problem. Specifically, we map the original iterative shrinkage thresholding algorithm (ISTA) into an unrolled recurrent neural network (RNN), thereby improving the convergence rate and computational efficiency through end-to-end training. Moreover, the proposed algorithm unrolling approach inherits the structure and domain knowledge of the ISTA, thereby maintaining the algorithm robustness, which can handle non-Gaussian preamble sequence matrix in massive access. With rigorous theoretical analysis, we further simplify the unrolled network structure by reducing the redundant training parameters. Furthermore, we prove that the simplified unrolled deep neural network structures enjoy a linear convergence rate. Extensive simulations based on various preamble signatures show that the proposed unrolled networks outperform the existing methods in terms of the convergence rate, robustness and estimation accuracy.

Index Terms:

Massive access, joint activity detection and channel estimation, group-sparse matrix estimation, algorithm unrolling, and deep learning.

I Introduction

Massive machine-type communications (mMTC), as one of the three major use cases of the fifth-generation (5G) wireless networks [2], is expected to support ubiquitous connectivity for a large number of low-cost Internet of Things (IoT) devices. The typical applications of the mMTC use case include smart homes, wearables, environmental sensing, and healthcare [3]. Driven by the increasing popularity of IoT services and the decreasing cost of IoT devices, the number of IoT devices is expected to reach $75.4$ billion by 2025 [4]. Providing efficient medium access for such a large number of IoT devices is a critical design challenge that needs to be addressed to support mMTC.

Grant-free random access, proposed by the 3rd generation partnership project (3GPP) for 5G new radio (NR), has been recognized as a promising multiple access scheme that is capable of accommodating massive IoT devices with low-latency requirements [5]. In particular, with grant-free random access, each IoT device can transmit a frame containing both preamble sequence and data at any time instant without waiting for the grant from the base station (BS). Compared to the grant-based random access, the signaling overhead and the random access latency can be significantly reduced. However, due to the lack of prior knowledge on the subset of active IoT devices that will transmit and the necessity of obtaining accurate channel state information (CSI) for data decoding, it is of vital importance to achieve joint activity detection and channel estimation (JADCE) for grant-free random access in IoT networks[6]. To alleviate the collision probability of preamble sequences and in turn facilitate JADCE, the authors in [7, 8, 9] focused on the design of non-orthogonal preamble signature sequences. In particular, the authors in [7] proposed a novel preamble design that concatenates multiple Zadoff-Chu (ZC) sequences, which increase the success rate of preamble transmission and enhance the channel reuse efficiency for grant-free random access.

Taking into account the sporadic nature of IoT devices, the JADCE problem was formulated as a sparse signal recovery problem in [10]. This kind of problem can be addressed by adopting the compressed sensing (CS) technique that is capable of exploiting the channel sparsity structures in both the time- and frequency-domains [11, 12]. Due to the large-scale nature of mMTC, the computationally efficient approximate message passing (AMP) based algorithms were proposed to solve the CS problems for supporting massive access [13, 6, 14]. The authors in [6] showed that AMP with a vector denoiser and multiple measurement vectors (MMV) [15] achieve approximately the same performance for JADCE. An AMP algorithm with a minimum mean-squared error (MMSE) denoiser was proposed in [13], where a prior knowledge of the underlying channel distribution was required. Furthermore, a generalized multiple measurement vector approximate message passing (GMMV-AMP) algorithm was proposed to detect the device activity by exploiting sparsity in both the spatial and the angular domains [14]. However, AMP-based algorithms often fail to converge when the preamble signature matrix is mildly ill-conditioned or non-Gaussian [16]. To this end, the authors in [17, 18] introduced a mixed $\ell_{1}/\ell_{2}$ -norm to formulate the JADCE problem as a form of group LASSO, which can be solved by using the interior-point method [17] or the alternating direction method of multipliers (ADMM) [18] without any prior knowledge on CSI. In particular, the popular iterative shrinkage thresholding algorithm (ISTA) [19, 20] was one of the best known robust algorithmic approaches to solve the group LASSO problem. However, the ISTA converges only sublinearly in general cases [19] and suffers from an inherent trade-off between the estimation performance and convergence rate [21], which prohibit its practical implementation for grant-free massive access in IoT networks.

Deep learning has recently been emerging as a disruptive technology to reduce the computational cost. For instance, the authors in [22, 23, 24, 25, 26] proposed the idea of “learning to optimize”, where neural networks were developed to replace the traditional optimization algorithms to solve the non-convex constrained optimization problems [22, 23, 24] and mixed-integer nonlinear programming (MINLP) problems [25, 26] for resource allocation in wireless networks. These studies demonstrated the potentials of deep learning for improving communication and computation efficiency. However, the deep learning framework usually regarded the neural network as a black box and cannot guarantee the theoretical performance. Therefore, the lack of interpretability and theoretical guarantee of the deep learning framework is a critical issue that needs to be addressed in wireless networks.

Fortunately, the unrolled deep neural network [27], as another powerful deep learning technique, has been proposed to provide a concrete connection between the existing iterative algorithms and the deep neural networks with theoretical performance guarantee [28, 29]. The promising applications of this method include resource management, MIMO detection and precoding design [30, 31]. The typical unrolled deep neural network, i.e., learned iterative shrinkage thresholding algorithm (LISTA) [27, 32, 33], was adopted to approximate the sparse coding based on LASSO, which achieves a better estimation performance. Moreover, the authors in [34, 32] established the simplified structures for LISTA and proved that simplified LISTA achieves linear convergence rate. The authors in [33] study the selection of adapted step size to improve the convergence rate of ISTA and further propose a unrolled network where only the step sizes of ISTA are learned. However, such an unrolling framework cannot be directly applied to solve the JADCE problem in IoT networks with grant-free random access, where the uplink channels of active devices are usually represented by a group-row-sparse matrix. This is because the existing LISTA with a scalar shrinkage-thresholding operator (SSTO), which only focuses on the recovery of sparse vectors, cannot tackle the group-sparse matrix estimation problem.

I-A Contributions

In this paper, we consider grant-free massive access in an IoT network, where a large number of single-antenna IoT devices sporadically transmit non-orthogonal preamble sequences to a multi-antenna BS. In such a network setting, we formulate the JADCE problem as a group-sparse matrix estimation problem by introducing a mixed $\ell_{1}/\ell_{2}$ -norm. Based on the above discussions, we shall develop a new unrolled deep neural network framework by adopting the multidimensional generalization operator, named multidimensional shrinkage thresholding operator (MSTO) [35], to address this problem. Such an extension turns out to be non-trivial, as the MSTO for the group-row-sparse matrix breaks the independency between the sparse vectors, which brings formidable challenge to reveal the weight coupling structure and analyze the convergence rate for the proposed method. We address this challenge and provide the theoretical performance guarantee for the proposed framework that is capable of solving JADCE problem to support massive access in IoT networks.

The major contributions of this paper are summarized as follows:

•

We develop a novel unrolled deep neural network framework with three structures for group-sparse matrix estimation to support grant-free massive access in IoT networks. The proposed methods enjoy high robustness by inheriting the structure of ISTA and keep the computational complexity at a low level through end-to-end training.
•

We conduct rigorous theoretical analysis to get rid of the interpretability issue and facilitate the efficient training in IoT networks. We reveal the weight coupling structure between the training parameters and identify the property that the learned weight matrix only depends on the preamble signature. The theoretical analysis allows us to further simplify original unrolled network structure by reducing the redundant training parameters. Furthermore, we prove that the simplified unrolling neural networks enjoy a linear convergence rate.
•

Extensive simulations are conducted to validate the theoretical analysis and show the superior performance of the proposed algorithms in three critical aspects for grant-free massive access. Firstly, comparing with the robust algorithms such as ISTA, the proposed methods have much lower computational cost. Secondly, the proposed methods are more robust when comparing to computational-efficient algorithms such as AMP-based algorithms by considering simulating different preamble signatures, including Gaussian matrix, binary matrix, and ZC sequence. Thirdly, the proposed algorithms are capable of returning more accurate solutions comparing with the classical CS-based algorithms.

I-B Organization and Notations

The rest of this paper is organized as follows. The system model and problem formulation are described in Section II. Section III presents the proposed three kinds of unrolled neural networks for solving the group-sparse matrix estimation problem. The convergence analysis of the proposed unrolled neuron networks is presented in Section IV. The numerical results of the proposed algorithms are illustrated in Section V. Finally, Section VI concludes this paper.

Notations

Throughout this paper, we denote $[N]=\{1,2,\ldots,N\}$ . Let $\mathbb{R}$ (resp. $\mathbb{C}$ ) be the set of real (resp. complex) numbers and $\mathbb{N}$ be the set of integers. For a set $\mathcal{S}$ , the number of elements of $\mathcal{S}$ is denoted by $|\mathcal{S}|$ . For index sets $\mathcal{I}=\{i_{1},\ldots,i_{n}\}\subset[N]$ , $\mathcal{J}=\{j_{1},\ldots,j_{m}\}\subset[M]$ and $\bm{A}\in\mathbb{R}^{N\times M}$ , we denote by $\bm{A}[\mathcal{I},\mathcal{J}]$ the (sub)matrix of entries that lie in the rows of $\bm{A}$ indexed by $\mathcal{I}$ and the columns indexed by $\mathcal{J}$ . When $\mathcal{I}=[N]$ (resp. $\mathcal{J}=[M]$ ), we simply denote by $\bm{A}[:,\mathcal{J}]$ (resp. $\bm{A}[\mathcal{I},:]$ ). For a matrix $\bm{A}$ , $\|\bm{A}\|_{2}$ (resp. $\|\bm{A}\|_{F}$ ) denotes the operator (resp. Frobenius) norm. And $\|\bm{A}\|_{\max}$ denotes the max norm. Also, $\|\bm{A}\|_{2,1}$ (resp. $\|\bm{A}\|_{2,0}$ ) denotes mixed $\ell_{1}/\ell_{2}$ -norm (resp. mixed $\ell_{0}/\ell_{2}$ -norm). For a vector $\bm{v}$ , $\|\bm{v}\|_{2}$ denotes the Euclidean norm. We denote the support of the vector $\bm{v}=[v_{1},\ldots,v_{N}]\in\mathbb{R}^{N}$ as $\text{supp}(\bm{v})$ , i.e. $\text{supp}(\bm{v})=\{i\in[N]|v_{i}\neq 0\}$ .

II System Model and Problem Formulation

II-A System Model

Consider the grant-free uplink transmission of a single-cell IoT network consisting of one $M$ -antenna BS and $N$ single-antenna IoT devices, as shown in Fig. 2. Full frequency reuse and quasi-static block fading channels are considered in this paper. We denote $[N]=\{1,\ldots,N\}$ as the set of IoT devices, which sporadically communicate with the BS. In each transmission block, all the IoT devices independently decide whether or not to be active. Within a specific transmission block, we denote $a_{n}=1$ if device $n$ is active and $a_{n}=0$ otherwise. Because of the sporadic transmission, it is reasonable to assume that each IoT device is active with a probability being much smaller than 1. In addition, we assume that the active IoT devices are synchronized at the symbol level [17]. The preamble signal received at the BS, denoted as $\bm{y}(\ell)\in\mathbb{C}^{M}$ , can be expressed as

\displaystyle\bm{y}(\ell)=\sum_{n=1}^{N}\bm{h}_{n}a_{n}s_{n}(\ell)+\bm{z}(\ell),\quad\ell=1,...,L,

(1)

where $\bm{h}_{n}\in{\mathbb{C}}^{M}$ denotes the channel vector between device $n$ and the BS, $s_{n}(\ell)\in\mathbb{C}$ denotes the $\ell$ -th symbol of the preamble signature sequence transmitted by device $n$ , $L$ denotes the length of the signature sequence, and $\bm{z}(\ell)\in\mathbb{C}^{M}$ denotes the additive white Gaussian noise (AWGN) vector at the BS.

Refer to caption — Figure 1: Illustration of an IoT network with a massive number of IoT devices that transmit sporadically to the BS.

As the length of the signature sequence is typically much smaller than the number of IoT devices, i.e., $L\ll N$ , it is impossible to allocate mutually orthogonal signature sequences to all the IoT devices. Hence, each IoT device is assigned a unique signature sequence, which is generally non-orthogonal to the preamble sequences assigned to other IoT devices. Note that three different kinds of non-orthogonal preamble sequences will be considered in the simulations.

For notational ease, we denote $\bm{Y}=[\bm{y}(1),...,\bm{y}(L)]^{T}\in\mathbb{C}^{L\times M}$ as the received preamble signal matrix across $M$ antennas over the transmission duration of $L$ symbols, $\bm{H}=[\bm{h}_{1},...,\bm{h}_{N}]^{T}\in\mathbb{C}^{N\times M}$ as the channel matrix between $N$ devices and the BS, $\bm{Z}=[\bm{z}(1),...,\bm{z}(L)]\in\mathbb{C}^{L\times M}$ as the noise matrix at the BS, and $\bm{S}=[\bm{s}(1),...,\bm{s}(L)]^{T}\in\mathbb{C}^{L\times N}$ as the known preamble signature sequence matrix with $\bm{s}(\ell)=[s_{1}(\ell),...,s_{N}(\ell)]^{T}\in\mathbb{C}^{N}$ . As a result, the preamble signal received at the BS can be rewritten as

\displaystyle\bm{Y}=\bm{SAH}+\bm{Z},

(2)

where $\bm{A}={\rm{diag}}(a_{1},...,a_{N})\in\mathbb{R}^{N\times N}$ denotes the diagonal activity matrix. We aim to simultaneously detect the activity matrix $\bm{A}$ and estimate the channel matrix $\bm{H}$ , which is known as a JADCE problem [13]. By denoting $\bm{X}^{\natural}=\bm{AH}\in\mathbb{C}^{N\times M}$ , matrix $\bm{X}^{\natural}$ thus has the group-sparse structure in rows [17], as shown in Fig. 2. We further rewrite (2) as

\displaystyle\bm{Y}=\bm{SX}^{\natural}+\bm{Z}.

(3)

Note that the active devices and their corresponding channel vectors can be recovered from the estimation of matrix $\bm{X}^{\natural}$ [17].

II-B Problem Formulation

To estimate the group-row sparse matrix $\bm{X}^{\natural}$ , we adopt the mixed $\ell_{1}/\ell_{2}$ -norm [17] to induce the group sparsity, i.e., $\mathcal{R}(\bm{X})=\sum_{n=1}^{N}\|\bm{X}[n,:]\|_{2},$ where $\bm{X}[n,:]$ is the $n$ -th row of matrix $\bm{X}$ . The group-sparse matrix estimation problem can be reformulated as the following unconstrained convex optimization problem (also known as group LASSO) [36]

\displaystyle\mathscr{P}:\mathop{\operatorname{minimize}}\limits_{\bm{X}\in\mathbb{C}^{N\times M}}\frac{1}{2}\|\bm{Y}-\bm{S}\bm{X}\|_{F}^{2}+\lambda\mathcal{R}(\bm{X}),

(4)

where $\lambda>0$ denotes the regulation parameter. As matrix $\bm{X}$ is group sparse in rows, problem $\mathscr{P}$ is essentially a group-sparse matrix estimation problem.

To facilitate efficient algorithm design, we rewrite (3) as its real-valued counterpart

\displaystyle\begin{aligned} \bm{\tilde{Y}}&=\bm{\tilde{S}}\bm{\tilde{X}}^{\natural}+\bm{\tilde{Z}}\\ &=\left[\begin{matrix}\Re\{\bm{S}\}&-\Im\{\bm{S}\}\\ \Im\{\bm{S}\}&\Re\{\bm{S}\}\end{matrix}\right]\left[\begin{matrix}\Re\{\bm{X}^{\natural}\}\\ \Im\{\bm{X}^{\natural}\}\end{matrix}\right]+\left[\begin{matrix}\Re\{\bm{Z}\}\\ \Im\{\bm{Z}\}\end{matrix}\right],\end{aligned}

(5)

where $\Re\{\cdot\}$ and $\Im\{\cdot\}$ represent the real part and imaginary part of a complex matrix. As a result, problem $\mathscr{P}$ can be further rewritten as

\displaystyle\mathscr{P}_{r}:\mathop{\operatorname{minimize}}\limits_{\bm{\tilde{X}}\in\mathbb{R}^{2N\times M}}\frac{1}{2}\|\bm{\tilde{Y}}-\bm{\tilde{S}}\bm{\tilde{X}}\|_{F}^{2}+\lambda\mathcal{R}(\bm{\tilde{X}}).

(6)

Therefore, our goal of JADCE becomes the recovering of matrix $\bm{\tilde{X}}^{\natural}$ based on the preamble matrix $\bm{\tilde{S}}$ and the noisy observation $\bm{\tilde{Y}}$ .

II-C Prior Work

The ISTA [36] is a popular approach to solve the group LASSO problem. In particular, the resulting ISTA iteration for (6) can be written as

\displaystyle\bm{\tilde{X}}^{k+1}=\eta_{\lambda/C}\left(\bm{\tilde{X}}^{k}+\frac{1}{C}\bm{\tilde{S}}^{T}(\bm{\tilde{Y}}-\bm{\tilde{S}}\bm{\tilde{X}}^{k})\right),

(7)

where $\bm{\tilde{X}}^{k}$ is an estimate of ground truth $\bm{\tilde{X}}^{\natural}$ at iteration $k$ , $\frac{1}{C}$ plays a role as step size, $\lambda$ is regulation parameter, and $\eta_{\lambda/C}(\cdot)$ denotes the MSTO [35]. Specifically, $\eta_{\theta}(\bm{X})[n]$ denotes as the $n$ -th row of the matrix $\bm{X}$ after applying $\eta_{\theta}(\cdot)$ , which is defined as [20]

\displaystyle\eta_{\theta}(\bm{X})[n]=\max\left\{0,\frac{\|\bm{X}[n,:]\|_{2}-\theta}{\|\bm{X}[n,:]\|_{2}}\right\}\bm{X}[n,:].

(8)

However, ISTA suffers from an inherent trade-off between estimation performance and convergence rate based on the choice of $\lambda$ , i.e., a larger $\lambda$ leads to faster convergence but a poorer estimation performance [21]. Moreover, the choice of step size $\frac{1}{C}$ also influences the convergence rate [33]. In the next section, we shall propose a learned ISTA framework by learning the parameters including $\lambda$ and the step size $\frac{1}{C}$ simultaneously to improve the convergence rate and estimation performance of ISTA for the group-sparse matrix estimation problem (ISTA-GS).

III Algorithm Unrolling via Deep Neural Network

In this section, we propose an unrolled deep learning framework to solve the JADCE problem by exploiting the group sparse structure. Although the LISTA proposed in [27, 34, 32, 37] can recover the individual sparse vector signals, all these methods cannot recover the matrix with group row sparsity. To address this issue, we extend LISTA for the group-sparse matrix recovery. Furthermore, we analytically prove that the weight coupling property holds for the group-sparse matrix estimation problem.

III-A Unrolled Neural Network Structures

In this subsection, we propose three neural network structures under the unrolled framework for group-sparse matrix estimation problem $\mathscr{P}_{r}$ .

III-A1 LISTA-GS

Inspired by [27, 34] and by denoting $\bm{W}_{1}=\frac{1}{C}\bm{\tilde{S}}^{T}$ , $\bm{W}_{2}=\bm{I}-\frac{1}{C}\bm{\tilde{S}}^{T}\bm{\tilde{S}}$ , and $\theta=\frac{\lambda}{C}$ , we rewrite (7) as

\displaystyle\bm{\tilde{X}}^{k+1}=\eta_{\theta^{k}}(\bm{W}_{1}\bm{\tilde{Y}}+\bm{W}_{2}\bm{\tilde{X}}^{k}).

(9)

The key idea of the proposed unrolled method for group-sparse matrix estimation problem is to view matrices $\bm{W}_{1}$ , $\bm{W}_{2}$ , and scalars $\theta^{k}$ in (9) as trainable parameters. As a result, (9) can be modeled as a one layer recurrent neural network (RNN), i.e., one iteration of ISTA-GS is treated as one layer neural network. The neural network structure can be concatenated as a $K$ -layer RNN, which is capable of modeling the ISTA-GS with $K$ iterations. Mathematically, the unrolled RNN with $K$ iterations for group-sparse matrix estimation problem is given by

\displaystyle\bm{\tilde{X}}^{k+1}=\eta_{\theta^{k}}(\bm{W}_{1}^{k}\bm{\tilde{Y}}+\bm{W}_{2}^{k}\bm{\tilde{X}}^{k}),k=0,1,\ldots,K-1,

(10)

where parameters $\bm{\Theta}=\{\bm{W}_{1}^{k},\bm{W}_{2}^{k},\theta^{k}\}_{k=0}^{K-1}$ are trainable. This is the main difference from the iterative algorithm in (7).

The unrolled neural network structure given in (10) is named as LISTA-GS, and the corresponding structure is plotted in Fig. 3(a). However, this kind of structure contains too many parameters with two high-dimensional weight parameter matrices, which may not be efficient to be trained.

III-A2 LISTA-GSCP

We establish the following necessary condition for the convergence of LISTA-GS, which is inspired by the fact that $\bm{W}_{2}^{0}=\bm{I}-\bm{W}_{1}^{0}\bm{\tilde{S}}$ and coupled structure in [34]. This necessary condition reveals the properties of trainable parameters if the proposed unrolled NN can recover the group sparse matrix and can be used to simplify the proposed network.

Theorem 1.

Given $\{\bm{W}_{1}^{k},\bm{W}_{2}^{k},\theta^{k}\}_{k=0}^{\infty}$ with bounded weights, i.e., $\|\bm{W}_{1}^{k}\|_{F}\leq C_{W_{1}}$ and $\|\bm{W}_{2}^{k}\|_{F}\leq C_{W_{2}}$ , let $\{\bm{\tilde{X}}^{k}\}_{k=1}^{\infty}$ be generated layer-wise by LISTA-GS in (10) with an input $\bm{\tilde{Y}}$ observed by (3) and the initial point $\bm{\tilde{X}}^{0}=\bm{0}$ . If network can recover any group-row-sparse signal with no observation noise, then $\{\bm{W}_{1}^{k},\bm{W}_{2}^{k},\theta^{k}\}_{k=0}^{\infty}$ must satisfy the following two conditions

	$\displaystyle\theta^{k}\rightarrow 0,\quad\text{as }k\rightarrow\infty,$		(11)
	$\displaystyle\bm{W}_{2}^{k}-(\bm{I}-\bm{W}_{1}^{k}\bm{\tilde{S}})\rightarrow\bm{0},\quad\text{as }k\rightarrow\infty.$		(12)

Proof.

Please refer to Appendix A. ∎

Thus, we develop the necessary condition of the recovery convergence for the group-sparse matrix estimation problem. It is worth nothing that the MSTO [35] is a generalization of the SSTO, which brings unique challenge of revealing the weight coupling structure on the group-row-sparse matrix estimation problem. The extension turns out to be non-trivial since the MSTO of group-sparse matrix breaks up the “independency” between sparse vectors.

Motivated by (12) in Theorem 1, we adopt the partial weight coupling structure of the trainable weights $\{\bm{W}_{1}^{k},\bm{W}_{2}^{k}\}_{k=0}^{\infty}$ in LISTA-GS as $\bm{W}_{2}^{k}=\bm{I}-\bm{W}_{1}^{k}\bm{\tilde{S}},\forall k.$ Hence, by letting $\bm{W}^{k}=(\bm{W}_{1}^{k})^{T}$ , we obtain the simplified $K$ -layer unrolled neural network structure (namely LISTA-GSCP) as

\bm{\tilde{X}}^{k+1}=\eta_{\theta^{k}}\big{(}\bm{\tilde{X}}^{k}+(\bm{W}^{k})^{T}(\bm{\tilde{Y}}-\bm{\tilde{S}}\bm{\tilde{X}}^{k})\big{)},k=0,1,\ldots,K-1,

(13)

where parameters $\bm{\Theta}=\{\bm{W}^{k},\theta^{k}\}_{k=0}^{K-1}$ are trainable. Fig. 3(b) illustrates the unrolled network.

III-A3 ALISTA-GS

To further alleviate the need to learn a weight matrix $\bm{W}^{k}$ with a large amount of parameters while stabilizing the training process, we separate the weight as the product of a scalar and a matrix, i.e., $\bm{W}^{k}=\gamma^{k}\bm{W}$ . The definition of “good” parameters $\bm{W}\in\mathcal{W}(\bm{\tilde{S}})$ in (35) shows that the weight matrix only depends on $\bm{\tilde{S}}$ . Thus, we can solve the following convex optimization problem by using the projected gradient descend (PGD) algorithm to obtain the weight matrix $\bm{W}$ prior to the training stage

	$\displaystyle\underset{\bm{W}\in\mathbb{R}^{2L\times 2N}}{\operatorname{minimize}}\quad\\|\bm{W}^{T}\bm{\tilde{S}}\\|_{F}^{2}$
	$\displaystyle\operatorname{subject~{}to}\quad\bm{W}[:,i]^{T}\bm{\tilde{S}}[:,i]=1,\forall i\in[2N].$		(14)

Hence, the proposed third unrolled network structure (namely ALISTA-GS) comes to light

\bm{\tilde{X}}^{k+1}=\eta_{\theta^{k}}\big{(}\bm{\tilde{X}}^{k}+\gamma^{k}\bm{W}^{T}(\bm{\tilde{Y}}-\bm{\tilde{S}}\bm{\tilde{X}}^{k})\big{)},k=0,1,\ldots,K-1,

(15)

where $\bm{\Theta}=\{\theta^{k},\gamma^{k}\}_{k=0}^{K-1}$ are parameters to be trained and the weight matrix $\bm{W}$ is pre-computed before the training process. Different from the LISTA-GSCP, this network structure pays more attention to select better learning step sizes. A visual depiction is provided in Fig. 3(c).

Till now, we have established three LISTA variations for group-sparse matrix estimation problems. With less training parameters, the training process turns to be easier and faster.

Remark 1.

Recently, the advancement of complex-valued deep networks make it possible to apply the proposed LISTA variations in a complex-valued representation, i.e., without rewriting as its real-valued counterpart (5). The complex-valued representation may be utilized to further enhance the performance of sparse recovery by exploiting the extra information about potential grouping of real and imaginary parts, e.g., the real and imaginary parts are either zero or nonzero simultaneously [38, 39]. However, the design of the complex-valued network training process and the complex-valued shrinkage function will be two challenges. Besides, the convergence analysis of such complex-valued LISTA variations will be an interesting problem, which will be studied in our future work.

III-B Deep Neural Networks Training and Testing

In the training stage, we consider the framework that the neural networks learn to solve group-sparse matrix estimation problem $\mathscr{P}_{r}$ in a supervised way. Note that we denote $P$ samples training data as $\{\bm{\tilde{X}}_{i}^{\natural},\bm{\tilde{Y}}_{i}\}_{i=1}^{P}$ , where $\bm{\tilde{X}}_{i}^{\natural}$ and $\bm{\tilde{Y}}_{i}$ are viewed as the label and the input, respectively. The following optimization problem is adopted in the whole training stage for training a $k$ -layer network,

\mathop{\operatorname{minimize}}\limits_{\bm{\Theta}_{0:k-1}}~{}\sum_{i=1}^{P}\|\bm{\tilde{X}}^{k}(\bm{\Theta}_{0:k-1},\bm{\tilde{Y}_{i}},\bm{\tilde{X}}^{0})-\bm{\tilde{X}}_{i}^{\natural}\|_{F}^{2}.

(16)

However, such large scale optimization easily converged to a bad local minimum [40]. The layer-wise training strategy is used to separate (16) into the following two parts, i.e., (17) and (18), where (17) is to find a good initialization of (18) to avoid bad local minima of (16). In particular, for training a $k$ -layer network, we denote $\bm{\Theta}_{0:k-1}$ as the trainable parameters from layer $0$ to layer $k-1$ , and $\bm{\Theta}_{k-1}$ as the trainable parameters in layer $k-1$ . By fixing the parameters $\bm{\Theta}_{0:k-2}$ of the former $k-1$ -layer network that has been trained and the parameters $\bm{\Theta}_{k-1}$ ’s initialization is given in Table I, we first learn the parameters $\bm{\Theta}_{k-1}$ with the learning rate $\alpha_{0}$ by using Adam algorithm [41] to solve the following optimization problem

\mathop{\operatorname{minimize}}\limits_{\bm{\Theta}_{k-1}}~{}\sum_{i=1}^{P}\|\bm{\tilde{X}}^{k}(\bm{\Theta}_{0:k-1},\bm{\tilde{Y}}_{i},\bm{\tilde{X}}^{0})-\bm{\tilde{X}}^{\natural}\|_{F}^{2},

(17)

where $\bm{\tilde{X}}^{k}(\bm{\Theta}_{0:k-1},\bm{\tilde{Y}},\bm{\tilde{X}}^{0})$ denotes the output of the $k$ -layer network with input $\bm{\tilde{Y}}$ and initial point $\bm{\tilde{X}}^{0}$ . Then, we use the parameters $\bm{\Theta}_{k-1}$ obtained by (17) and the fixed parameters $\bm{\Theta}_{0:k-2}$ as initialization and then tune all parameters $\bm{\Theta}_{0:k-1}$ with the learning rate $\alpha_{1}$ which is smaller than $\alpha_{0}$ by using Adam algorithm to solve

\mathop{\operatorname{minimize}}\limits_{\bm{\Theta}_{0:k-1}}~{}\sum_{i=1}^{P}\|\bm{\tilde{X}}^{k}(\bm{\Theta}_{0:k-1},\bm{\tilde{Y}}_{i},\bm{\tilde{X}}^{0})-\bm{\tilde{X}}^{\natural}\|_{F}^{2}.

(18)

After applying the procedure successfully, we obtain the parameters $\bm{\Theta}_{0:k-1}$ for the whole $k$ -layer network. So next we can train a $(k+1)$ -layer network.

In the testing stage, since the known preamble signature $\bm{S}$ remains unchanged, the proposed unrolled networks can recover the changing channels. Given a newly received signal $\bm{\tilde{Y}}^{{}^{\prime}}$ , the learned unrolled networks, i.e., LISTA-GS in (10), LISTA-GSCP in (13) and ALISTA-GS in (15), are applied for group-sparse matrix estimation. For instance, with LISTA-GS, we obtain the $(k+1)$ -th layer recovered signal by using $\bm{\tilde{X}}^{k+1}=\eta_{(\theta^{k})^{*}}((\bm{W}_{1}^{k})^{*}\bm{\tilde{Y}}^{{}^{\prime}}+(\bm{W}_{2}^{k})^{*}\bm{\tilde{X}}^{k})$ , where $(\theta^{k})^{*},(\bm{W}_{1}^{k})^{*}$ , and $(\bm{W}_{2}^{k})^{*}$ are the learned parameters of the $k$ -th layer.

Remark 2.

Most of the existing deep learning based approaches including the proposed LISTA variations rely on the assumption that the channel coefficients follow the same distribution during the training and testing stages, and may not work well in dynamic environment with different channel distributions over time. The continual learning and transfer learning technologies [42, 43] that have been developed for resource allocation and beamforming optimization in dynamic environment have the potential to address this issue. As these technologies are still in the early stage, we leave the proposed LISTA variations to recover the group-sparse matrix in dynamic environment for future work.

III-C Time Complexity and Number of Trainable Parameters

The time complexity for the three neural network structures are mainly due to matrix multiplication. For LISTA-GS, the evaluation of the matrix multiplication $\bm{W}_{1}^{k}\bm{\tilde{Y}}$ and $\bm{W}_{2}^{k}\bm{\tilde{X}}^{k}$ require $\mathcal{O}(NLM+N^{2}M)$ time at each iteration. As for LISTA-GSCP and ALISTA-GS, the evaluation of the matrix multiplication $(\bm{W}^{k})^{T}(\bm{\tilde{Y}}-\bm{\tilde{S}}\bm{\tilde{X}}^{k})$ and $\gamma^{k}\bm{W}^{T}(\bm{\tilde{Y}}-\bm{\tilde{S}}\bm{\tilde{X}}^{k})$ require $\mathcal{O}(LNM)$ time at each iteration. Since the proposed methods converge faster than ISTA and with the same complexity per iteration, thereby reducing the computational cost.

For $K$ -layer RNN, the ALISTA-GS contains only $2K$ total trainable parameters $\{\theta^{k},\gamma^{k}\}_{k=0}^{K-1}$ , while LISTA-GS and LISTA-GSCP require $K(N^{2}+LN+1)$ variables $\{\bm{W}_{1}^{k},\bm{W}_{2}^{k},\theta^{k}\}_{k=0}^{K-1}$ and $K(LN+1)$ variables $\{\bm{W}^{k},\theta^{k}\}_{k=0}^{K-1}$ , respectively. We summarize the initialization and the required numbers of trainable parameters for the $K$ -layer networks in Table I. Since $\theta^{k}$ and $\gamma^{k}$ should be initialize to proper constants, we initialize $\theta^{k}=0.1$ and $\gamma^{k}=1$ in this paper.

TABLE I: Number of trainable parameters(params) and initialization in the

K

- layer RNN.

Network	Trainable params	Initialization	Number of params
LISTA-GS	$\{\bm{W}_{1}^{k},\bm{W}_{2}^{k},\theta^{k}\}_{k=0}^{K-1}$	$\bm{W}_{1}^{k}=\frac{1}{C}\bm{\tilde{S}}^{T}$ , $\bm{W}_{2}^{k}=\bm{I}-\frac{1}{C}\bm{\tilde{S}}^{T}\bm{\tilde{S}},\theta^{k}=0.1$	$K(N^{2}+LN+1)$
LISTA-GSCP	$\{\bm{W}^{k},\theta^{k}\}_{k=0}^{K-1}$	$\bm{W}^{k}=\frac{1}{C}\bm{\tilde{S}}^{T},\theta^{k}=0.1$	$K(LN+1)$
ALISTA-GS	$\{\theta^{k},\gamma^{k}\}_{k=0}^{K-1}$	$\theta^{k}=0.1,\gamma^{k}=1.0$	$2K$

IV Convergence Analysis

In this section, we provide the main theoretical results of this paper, i.e., the linear convergence rate of LISTA-GSCP (13) and ALISTA-GS (15), respectively. Since the proposed unrolled networks inherit the structure of ISTA-GS, they thus allow us to track the interpretability for such deep learning framework from the perspective of optimization. As a matter of fact, our proposed unrolled neural networks are extensions of [34, 32] to solve the group-sparse matrix estimation problems. Such dimension expansion and matrix structures bring the unique and formidable challenges for establishing the theoretical analysis on the proposed unrolled neural networks.

We firstly establish the convergence analysis of LISTA-GSCP framework. We use $\bm{\tilde{X}}^{k}$ to replace $\bm{\tilde{X}}^{k}(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})$ for notational simplicity. The following theorem presents the convergence rate of LISTA-GSCP. For theoretical analysis, we assume that $\ell_{2}$ norm of all rows of signal $\bm{\tilde{X}}^{\natural}$ and Frobenius norm of AWGN noise $\bm{\tilde{Z}}$ are bounded by $\beta$ and $\sigma$ [34, 32]. Furthermore, since each entry of the activity sequence $\{a_{1},...,a_{N}\}$ follows the Bernoulli distribution, we assume that the number of non-zero rows on signal $\bm{\tilde{X}}^{\natural}$ is bounded by a small number $s$ [17]. And for notation brevity, we assume the signal $\bm{\tilde{X}}^{\natural}$ and noise $\bm{\tilde{Z}}$ belong to set $\mathcal{X}(\beta,s,\sigma)\coloneqq\big{\{}(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})|\|\bm{\tilde{X}}^{\natural}[i,:]\|_{2}\leq\beta,\forall i,\|\bm{\tilde{X}}^{\natural}\|_{2,0}\leq s,\|\bm{\tilde{Z}}\|_{F}\leq\sigma\big{\}}.$

Theorem 2 (Convergence rate of LISTA-GSCP).

Given $\{\bm{W}^{k},\theta^{k}\}_{k=0}^{\infty}$ , let $\{\bm{\tilde{X}}^{k}\}_{k=1}^{\infty}$ be generated by LISTA-GSCP in (13) with an input $\bm{\tilde{Y}}$ observed by (3) and initial point $\bm{\tilde{X}}^{0}=\bm{0}$ . If $s$ is sufficiently small, then for all $(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})\in\mathcal{X}(\beta,s,\sigma),$ we have the error bound:

\|\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\|_{F}\leq s\beta\exp(-ck)+C\sigma,

(19)

where $c>0$ and $C>0$ are constants that depend only on $\bm{\tilde{S}}$ and $s$ .

Especially, for the noiseless case, (19) reduces to $\|\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\|_{F}\leq s\beta\exp(-ck).$ Moreover, LISTA-GSCP converges at an $\mathcal{O}(\log(\frac{1}{\epsilon}))$ rate, which is faster than original ISTA of $\mathcal{O}(\frac{1}{\epsilon})$ and Nesterov’s method [44] of $\mathcal{O}(\frac{1}{\sqrt{\epsilon}})$ .

The steps of proving Theorem 2 are summarized as follows:

A.

“Good” parameters for leaning. We define the “good” parameters that guarantee the linear convergence rate.
B.

Error bound for one sample data. We establish an error bound for one sample $(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})$ ¹¹1To emphasize the signal and noise, we use $(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})$ to replace one sample rather than $\{\bm{\tilde{X}}^{\natural},\bm{\tilde{Y}}\}$ ..
C.

Error bound for the whole data set. By taking the supremum over all $(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})$ , we establish an error bound over the whole samples.

IV-A “Good” Parameters for Learning

In this subsection, we shall give the definition of “good” parameters for the LISTA-GSCP network structure that guarantees the linear convergence rate.

First, we need to introduce several fundamental definitions inspired by [34]. The first one is the mutual coherence [45] of $\bm{\tilde{S}}$ , which characterizes the coherence between different columns of $\bm{\tilde{S}}$ .

Definition 1.

(i)

The mutual coherence of $\bm{\tilde{S}}\in\mathbb{R}^{2L\times 2N}$ with normalized columns is defined as

\mu(\bm{\tilde{S}})=\max_{i\neq j\atop 1\leq i,j\leq 2N}\big{|}(\bm{\tilde{S}}[:,i])^{T}\bm{\tilde{S}}[:,j]\big{|}.

(20)

(ii)

The generalized mutual coherence of $\bm{\tilde{S}}\in\mathbb{R}^{2L\times 2N}$ with normalized columns is defined as

\tilde{\mu}(\bm{\tilde{S}})=\inf_{\bm{W}\in\mathbb{R}^{2L\times 2N}\atop(\bm{W}[:,i])^{T}\bm{\tilde{S}}[:,i]=1,\forall i}\bigg{\{}\max_{i\neq j\atop 1\leq i,j\leq 2N}\Big{|}(\bm{W}[:,i])^{T}\bm{\tilde{S}}[:,j]\Big{|}\bigg{\}}.

(21)

Lemma $1$ in [34] tells us that there exists a matrix $\bm{W}\in\mathbb{R}^{2L\times 2N}$ that attaches the infimum given in (21), i.e., $\mathcal{W}(\bm{\tilde{S}})\neq\emptyset,$ where a set of “good” weight matrices is defined as

	$\displaystyle\mathcal{W}(\bm{\tilde{S}})\coloneqq\operatorname*{argmin}_{\bm{W}\in\mathbb{R}^{2L\times 2N}}\big{\{}\\|\bm{W}\\|_{\max}\big{\|}(\bm{W}[:,i])^{T}\bm{\tilde{S}}[:,i]=1,\forall i,$
	$\displaystyle\max_{i\neq j\atop 1\leq i,j\leq 2N}\big{\|}(\bm{W}[:,i])^{T}\bm{\tilde{S}}[:,j]\big{\|}=\tilde{\mu}(\bm{\tilde{S}})\big{\}}.$		(22)

Then, we define the “good” parameters to be learned in LISTA-GSCP as follows.

Definition 2.

$\bm{\Theta}=\{\bm{W}^{k},\theta^{k}\}_{k=0}^{\infty}$ are called “good” parameters in LISTA-GSCP if they satisfy

\bm{W}^{k}\in\mathcal{W}(\bm{\tilde{S}}),\quad\theta^{k}=\tilde{\mu}\sup_{(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})}\|\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\|_{2,1}+\sigma C_{W},\quad\forall k\in\mathbb{N},

(23)

where $C_{W}=\underset{k\geq 0}{\max}~{}\|\bm{W}^{k}\|_{2,1}$ and $\tilde{\mu}=\tilde{\mu}(\bm{\tilde{S}})$ .

In Definition 2, we propose the “good” choice of the learning parameters in LISTA-GSCP. In the following subsection, we prove that the sequence of “good” parameters lead to the conclusion (19) in Theorem 2.

IV-B Error Bound for One Sample

In this subsection, we give an upper bound of the recovery error for one sample $(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})\in\mathcal{X}(\beta,s,\sigma)$ . We first introduce the extra notation $\psi$ , to provide information about the group sparsity. For each $\bm{\tilde{X}}\in\mathbb{R}^{2N\times M}$ , we define a function $\psi:\mathbb{R}^{2N\times M}\rightarrow\mathbb{R}^{2N}$ as

\psi(\bm{\tilde{X}})=\begin{bmatrix}||\bm{\tilde{X}}[1,:]||_{2},||\bm{\tilde{X}}[2,:]||_{2},\cdots,||\bm{\tilde{X}}[2N,:]||_{2}\end{bmatrix}^{T}.

(24)

Specifically, for a vector $\bm{v}=[v_{1},v_{2},\ldots,v_{2N}]^{T}\in\mathbb{R}^{2N}$ , we have $\psi(\bm{v})_{i}=|v_{i}|,\text{for all }i\in[2N].$ Note that ${\rm{supp}}(\psi(\bm{A}))$ can provide information about group sparsity of a given matrix $\bm{A}$ . By simple calculations, one can get the following lemma.

Lemma 1.

With $\bm{\tilde{X}}\in\mathbb{R}^{2N\times M}$ and $\bm{e}=[1,1,\ldots,1]^{T}$ , we have

\psi(\eta_{\theta}(\bm{\tilde{X}}))=\psi(\bm{\tilde{X}})-\theta^{k}\bm{e},

(25)

By taking one sample $(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})\in\mathcal{X}(\beta,s,\sigma)$ and letting $\mathcal{I}={\rm{supp}}(\psi(\bm{\tilde{X}}^{\natural}))$ , we establish the error bound by two steps: (i) We show that there are no false positive rows in $\bm{\tilde{X}}^{k}$ for all $k$ . (ii) Since the no-false-positive property holds, we consider the component on $\mathcal{I}$ .

In step (i), we prove the following lemma.

Lemma 2 (No-false-positive property).

With all assumptions in Theorem 2 and “good” parameters $\bm{\Theta}$ , we have

{\rm{supp}}(\psi(\bm{\tilde{X}}^{k}))\subset{\rm{supp}}(\psi(\bm{\tilde{X}}^{\natural})),\quad\forall k.

(26)

Proof.

Please refer to Appendix B. ∎

Lemma 2 shows that there are no false positive entries in $\bm{\tilde{X}}^{k}$ . In another word, if (23) in LISTA-GSCP hold, then

\bm{\tilde{X}}^{k}[i,:]=\bm{0},\quad\forall i\notin\mathcal{I},~{}\forall k.

(27)

This property implies that the recovery error of the component beyond $\mathcal{I}$ turns to be $0$ .

In step (ii), we consider the component on $\mathcal{I}$ . For all $i\in\mathcal{I}$ , the LISTA-GSCP in (13) gives

	$\displaystyle\bm{\tilde{X}}^{k+1}[i,:]=\eta_{\theta^{k}}\Big{(}\bm{\tilde{X}}^{k}[i,:]-(\bm{W}^{k}[:,i])^{T}\bm{\tilde{S}}[:,\mathcal{I}](\bm{\tilde{X}}^{k}[\mathcal{I},:]-$
	$\displaystyle\bm{\tilde{X}}^{\natural}[\mathcal{I},:])+(\bm{W}^{k}[:,i])^{T}\bm{\tilde{Z}}\Big{)}$
	$\displaystyle\in\underbrace{\bm{\tilde{X}}^{k}[i,:]-(\bm{W}^{k}[:,i])^{T}\bm{\tilde{S}}[:,\mathcal{I}](\bm{\tilde{X}}^{k}[\mathcal{I},:]-\bm{\tilde{X}}^{\natural}[\mathcal{I},:])}_{(T1)}$
	$\displaystyle+\underbrace{(\bm{W}^{k}[:,i])^{T}\bm{\tilde{Z}}}_{(T2)}-\underbrace{\theta^{k}\partial\\|\bm{\tilde{X}}^{k+1}[i,:]\\|_{2}}_{(T3)},$		(28)

where $\partial\|\cdot\|_{2}$ is given in (51). It can be viewed that (28) consists of three parts $(T1)$ , $(T2)$ , $(T3)$ . Since $(\bm{W}^{k}[:,i])^{T}\bm{\tilde{S}}[:,i]=1$ , then (28) can expressed as

	$\displaystyle\bm{\tilde{X}}^{k+1}[i,:]-\bm{\tilde{X}}^{\natural}[i,:]\in-\sum_{j\in\mathcal{I},j\neq i}(\bm{W}^{k}[:,i])^{T}\bm{\tilde{S}}[:,j]$
	$\displaystyle\Big{(}\bm{\tilde{X}}^{k}[j,:]-\bm{\tilde{X}}^{\natural}[j,:]\Big{)}+(T2)-(T3).$		(29)

The definition of $\partial\|\cdot\|_{2}$ in (51) shows that $\|\partial\|\bm{\tilde{X}}^{k+1}[i,:]\|_{2}\|_{2}\leq 1.$ Then, taking the norm on both sides in (29), one can get that for all $i\in\mathcal{I}$ ,

	$\displaystyle\big{\\|}\bm{\tilde{X}}^{k+1}[i,:]-\bm{\tilde{X}}^{\natural}[i,:]\big{\\|}_{2}\overset{\text{(a)}}{\leq}\sum_{j\in\mathcal{I},j\neq i}\big{\|}(\bm{W}^{k}[:,i])^{T}\bm{\tilde{S}}[:,j]\big{\|}\cdot$
	$\displaystyle\big{\\|}\bm{\tilde{X}}^{k}[j,:]-\bm{\tilde{X}}^{\natural}[j,:]\big{\\|}_{2}+\theta^{k}+\big{\\|}(\bm{W}^{k}[:,i])^{T}\bm{\tilde{Z}}\big{\\|}_{2}$
	$\displaystyle\overset{\text{(b)}}{\leq}\tilde{\mu}\sum_{j\in\mathcal{I},j\neq i}\big{\\|}\bm{\tilde{X}}^{k}[j,:]-\bm{\tilde{X}}^{\natural}[j,:]\big{\\|}_{2}+\theta^{k}+\big{\\|}\bm{W}^{k}[:,i]\big{\\|}_{2}\big{\\|}\bm{\tilde{Z}}\big{\\|}_{F},$		(30)

where (a) follows from the triangle inequality, (b) arises from the choice of “good” parameters in (2) and $\big{\|}(\bm{W}^{k}[:,i])^{T}\bm{\tilde{Z}}\big{\|}_{2}\leq\big{\|}\bm{W}^{k}[:,i]\big{\|}_{2}\|\bm{\tilde{Z}}\|_{F}.$ It is easy to check that (27) implies that $\big{\|}\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\big{\|}_{2,1}=\big{\|}\bm{\tilde{X}}^{k}[\mathcal{I},:]-\bm{\tilde{X}}^{\natural}[\mathcal{I},:]\big{\|}_{2,1}$ for all $k$ . Therefore, based on (IV-B), it follows that

		$\displaystyle\\|\bm{\tilde{X}}^{k+1}-\bm{\tilde{X}}^{\natural}\\|_{2,1}=\sum_{i\in\mathcal{I}}\\|\bm{\tilde{X}}^{k+1}[i,:]-\bm{\tilde{X}}^{\natural}[i,:]\\|_{2}$
		$\displaystyle\leq\tilde{\mu}(\|\mathcal{I}\|-1)\big{\\|}\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\big{\\|}_{2,1}+\|\mathcal{I}\|\theta^{k}+\sigma C_{W},$		(31)

where $C_{W}=\underset{k\geq 0}{\max}~{}\|\bm{W}^{k}\|_{2,1}$ . The inequality (IV-B) provides the recursive form for consecutive errors of $\|\bm{\tilde{X}}^{k+1}-\bm{\tilde{X}}^{\natural}\|_{2,1}$ and $\|\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\|_{2,1}$ . Hence, we establish the recover error bound of mixed-norm for one sample $(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})\in\mathcal{X}(\beta,s,\sigma)$ .

IV-C Error Bound for Whole Data Set

In this subsection, we give an upper bound of recovery error for the whole training samples. Note that $|\mathcal{I}|:={\rm{supp}}(\psi(\bm{\tilde{X}}^{\natural}))\leq s$ for all $(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})\in\mathcal{X}(\beta,s,\sigma)$ . Taking supremum over $(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})\in\mathcal{X}(\beta,s,\sigma)$ on both sides of (IV-B), we have

	$\displaystyle\sup_{(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})}\\|\bm{\tilde{X}}^{k+1}-\bm{\tilde{X}}^{\natural}\\|_{2,1}\leq(s-1)\tilde{\mu}\sup_{(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})}\\|\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\\|_{2,1}$
	$\displaystyle+s\theta^{k}+\sigma C_{W}.$		(32)

Considering the “good” parameters of $\theta^{k}=\tilde{\mu}\sup_{(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})}\|\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\|_{2,1}+\sigma C_{W},$ it follows that

\displaystyle\sup_{(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})}\|\bm{\tilde{X}}^{k+1}-\bm{\tilde{X}}^{\natural}\|_{2,1}\leq(2\tilde{\mu}s-\tilde{\mu})^{k+1}s\beta+\frac{(s+1)C_{W}}{1+\tilde{\mu}-2\tilde{\mu}s}\sigma,

provided that $2\tilde{\mu}s-\tilde{\mu}<1$ . By letting $c=-\log(2\tilde{\mu}s-\tilde{\mu}),C=\frac{(s+1)C_{W}}{1+\tilde{\mu}-2\tilde{\mu}s}$ , we have $\sup_{(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})}\|\bm{\tilde{X}}^{k+1}-\bm{\tilde{X}}^{\natural}\|_{2,1}\leq s\beta\exp(-c(k+1))+C\sigma.$ The fact that $\|\bm{X}\|_{F}\leq\|\bm{X}\|_{2,1}$ for any matrix $\bm{X}$ gives an upper bound of error with respect to Frobenius norm

	$\displaystyle\sup_{(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})}\\|\bm{\tilde{X}}^{k+1}-\bm{\tilde{X}}^{\natural}\\|_{F}$	$\displaystyle\leq\sup_{(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})}\\|\bm{\tilde{X}}^{k+1}-\bm{\tilde{X}}^{\natural}\\|_{2,1}$
		$\displaystyle\leq s\beta\exp(-c(k+1))+\sigma C.$		(33)

As long as $s\leq(1+1/\tilde{\mu})/2$ and $c=-\log(2\tilde{\mu}s-\tilde{\mu})>0$ hold, the error bound holds uniformly for all $(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})\in\mathcal{X}(\beta,s,\sigma)$ . Then for $k$ -layer network, we have

\displaystyle\sup_{(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})}\|\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\|_{F}\leq s\beta\exp(-ck)+\sigma C.

(34)

Therefore, we complete the proof of Theorem 2.

IV-D Convergence Rate of ALISTA-GS

In this subsection, we first give the definition of the “good” parameters to be learned in the ALISTA-GS network. We then verify the convergence of ALISTA-GS for noiseless case. For noisy case, the analysis can be referred to that of Theorem $1$ .

Definition 3.

$\bm{\Theta}=\{\theta^{k},\gamma^{k}\}_{k=0}^{\infty}$ are called “good” parameters for all $(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})\in\mathcal{X}(\beta,s,0)$ (noiseless case) with the weight matrix is pre-computed in ALISTA-GS if they satisfy that for each $k\in\mathbb{N}$

\bm{W}\in\mathcal{W}(\bm{\tilde{S}}),\quad\theta^{k}=\tilde{\mu}\gamma^{k}\sup_{(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})}\|\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\|_{2,1}.

(35)

Theorem 3 (Convergence rate of ALISTA-GS).

Given $\{\theta^{k},\gamma^{k}\}_{k=0}^{\infty}$ and $\bm{W}\in\mathcal{W}(\bm{\tilde{S}})$ , let $\{\bm{\tilde{X}}^{k}\}_{k=1}^{\infty}$ be generated by ALISTA-GS in (15) with an input $\bm{\tilde{Y}}$ observed by (3) and initial point $\bm{\tilde{X}}^{0}=\bm{0}$ . If $s\leq(1+1/\tilde{\mu})/2$ for all $(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})\in\mathcal{X}(\beta,s,0),$ and

	$\displaystyle\gamma^{k}\in\Big{(}0,\frac{2}{1+2\tilde{\mu}s-\tilde{\mu}}\Big{)},$
	$\displaystyle\theta^{k}=\tilde{\mu}\gamma^{k}\sup_{(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})}\\|\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\\|_{2,1},\forall k\in\mathbb{N},$		(36)

then we have

\displaystyle\|\bm{\tilde{X}}^{k+1}-\bm{\tilde{X}}^{\natural}\|_{F}\leq s\beta\exp\Big{(}-\sum_{\tau=0}^{k}c^{\tau}\Big{)},

(37)

where $c^{\tau}=-\log\big{(}\gamma^{\tau}(2\tilde{\mu}s-\tilde{\mu})+|1-\gamma^{\tau}|\big{)}$ is a positive constant.

Proof.

Please refer to Appendix C. ∎

Remark 3.

Optimally, if the factor $c^{\tau}$ takes the maximum at $\gamma^{\tau}=1$ , i.e., $c^{\tau}\equiv-\log(2\tilde{\mu}s-\tilde{\mu})$ , then ALISTA-GS enjoys a linear convergence rate.

V Numerical Results

In this section, we present the simulation results for the proposed three unrolled neural network structures for grant-free massive access in IoT networks. We first introduce the settings of the simulations and performance metrics and then present the simulation results to confirm the main theorems including the weight coupling structure and the convergence rate. Finally, we compare the recovery performance of the proposed methods with other popular CS-based methods.

V-A Simulation Settings and Performance Metrics

In simulations, the channels are assumed to suffer from independent Rayleigh fading, i.e., $\bm{H}\sim\mathcal{C}\mathcal{N}(\bm{0},\bm{I})$ . The preamble signature matrix $\bm{S}$ is fixed with each of its columns being normalized and the noise matrix $\bm{Z}$ follows the Gaussian distribution with zero mean and variance $\sigma^{2}$ . In addition, each entry of the activity sequence $\{a_{1},...,a_{N}\}$ is a random variable which follows the Bernoulli distribution with mean $0.1$ , i.e., $\mathbb{P}(a_{n}=1)=0.1$ and $\mathbb{P}(a_{n}=0)=0.9$ , $\forall\,n\in[N]$ . The transmit signal-to-noise ratio (SNR) of the system is defined as

\text{SNR}=\frac{\mathbb{E}[\|\bm{SX}\|_{F}^{2}]}{\mathbb{E}[\|\bm{Z}\|_{F}^{2}]}.

(38)

By transforming all these complex-valued matrices into real-valued matrices according to (5), we obtain the training data set $\{\bm{\tilde{X}}_{i}^{\natural},\bm{\tilde{Y}}_{i}\}_{i=1}^{P}$ . In the training stage, we choose $K=12$ layers unless otherwise stated for all the unrolled models in the simulations. The training set contains $P=64$ different samples in the training stage. The learning rate $\alpha_{0}$ is set to be $5\times 10^{-4}$ and $\alpha_{1}=0.2\alpha_{0}$ . In the validation and testing stage, $128$ samples are generated to test the trained models (drawn independent of the training set, but from the same distribution).

We compare our proposed unrolled networks, i.e., LISTA-GS, LISTA-GSCP and ALISTA-GS, with the other three popular CS-based algorithms for group-sparse matrix estimation problem:

•

ISTA-GS [36]: The vanilla ISTA for solving multiple measurement vector CS-based problem. The regularization parameter $\lambda$ is set as $0.1$ .
•

Nesterov’s method [44]: A first-order method solves the equivalent smooth convex reformulations of (6). The regularization parameter $\lambda$ is set as $0.1$ .
•

AMP-MMV [15, 6]: The AMP developed Bayesian algorithm for solving the multiple measurement vector CS-based problem.
•

AMP MMSE-denoiser [13]: The vector AMP algorithm with minimum mean-squared error (MMSE) denoiser for user activity detection and channel estimation.

We adopt the normalized mean square error (NMSE) to evaluate the performance in recovering the real-valued $\bm{\tilde{X}}$ , defined as

\text{NMSE}(\bm{\tilde{X}},\bm{\tilde{X}}^{\natural})=10\log_{10}\left(\frac{\mathbb{E}\|\bm{\tilde{X}}-\bm{\tilde{X}}^{\natural}\|_{F}^{2}}{\mathbb{E}\|\bm{\tilde{X}}^{\natural}\|_{F}^{2}}\right),

(39)

where $\bm{\tilde{X}}^{\natural}$ represents the ground truth and $\bm{\tilde{X}}$ is the estimated value.

V-B Validation of Theorems

In this subsection, we conduct simulations to validate the developed Theorems. We generate the preamble signature matrix according to the complex Gaussian distribution unless otherwise stated, i.e., $\bm{S}\sim\mathcal{C}\mathcal{N}(\bm{0},\bm{I})$ . We set the length of the signature sequence, the total number of devices, and the number of antennas at the BS, i.e., $L,N$ and $M$ , to $100$ , $200$ , and $30$ , respectively.

Validation of Theorem 1. Theorem 1 endorses the empirical results shown in Figs. 4(a) and 4(b). In Fig. 4, the value of $\|\bm{W}_{2}^{k}-(\bm{I}-\bm{W}_{1}^{k}\bm{\tilde{S}})\|_{2}$ and $\theta^{k}$ in LISTA-GS are reported. We observe that as $k$ increases, the value of $\|\bm{W}_{2}^{k}-(\bm{I}-\bm{W}_{1}^{k}\bm{\tilde{S}})\|_{2}$ and $\theta^{k}$ approach to $0$ . The simulations clearly validate Theorem 1: $\bm{W}_{2}^{k}-(\bm{I}-\bm{W}_{1}^{k}\bm{\tilde{S}})\rightarrow\bm{0}$ and $\theta^{k}\rightarrow 0$ , as $k\rightarrow\infty$ .

Validations of Theorems 2 and 3. We examine the convergence rates of the proposed unrolled networks. Fig. 8 shows the NMSE of the proposed unrolled networks and other methods over iterations in a noisy scenario when SNR $=15~{}$ dB. In Fig. 8, we show that the proposed networks (almost) converge linearly, which validate Theorem 2 and 3. Furthermore, the LISTA-GS converges fastest, but with the most number of trainable parameters. Besides, we observe that the proposed unrolled networks outperform other baseline algorithms in terms of NMSE.

TABLE II: Training time and estimation performance of the proposed unrolled networks.

Networks	LISTA-GS	LISTA-GSCP	ALISTA-GS
Training time (min)	160.13	105.31	87.09
NMSE (dB)	-24.08	-23.38	-23.30

TABLE III: Average running time of various methods of

12

-layer iterations per sample.

LISTA-GS	LISTA-GSCP	ALISTA-GS	ISTA-GS	Nesterov’s method	AMP-MMV	AMP MMSE-denoiser
0.0078s	0.0076s	0.0076s	0.0079s	0.0087s	0.2089s	0.0731s

Training Time Comparison. We compare the training time of the proposed unrolled networks. We train all the networks of $12$ layers with $64$ training samples in noisy scenario when SNR = $15$ dB on GeForce GTX 1080. Table II shows that the ALISTA-GS is not only faster to train but performs as well as LISTA-GS and LISTA-GSCP. With less trainable parameters, the training process turns to be faster.

Computation Time Comparison. We compare the running time of test stage for different methods. Once the proposed unrolled networks are trained, we can use them to recover many new sample of different channels. We run all the methods of $12$ iterations on Intel(R) Core(TM) i7-8650U CPU @ 2.11 GHz and average over $100$ test samples, which is shown in Table III. The running time of the proposed unrolled networks is very close to the ISTA-GS and Nesterov’s method, which validate the time complexity analysis in Section III-C. In addition, the proposed unrolled networks run faster than AMP-based algorithms. This benefits from the low computational complexity per iteration of ISTA-GS.

Convergence Performance with Ill-Conditioned Preamble Signature. We consider the simulations when preamble signature matrix $\bm{S}$ is an ill-conditioned matrix, which is defined as a matrix with a large condition number $\kappa$ . We consider the fixed signature matrix $\bm{S}$ with condition numbers $\kappa=2$ and $15$ in this setting. To obtain the signature matrix $\bm{S}$ of condition number $\kappa=2$ and $15$ , we first sample a matrix $\bm{A}\in\mathbb{C}^{L\times N}$ , i.e., $\bm{A}\sim\mathcal{C}\mathcal{N}(\bm{0},\bm{I})$ . Then, we decompose $\bm{A}=\bm{U\varSigma V^{\natural}}$ by singular value decomposition and replace $\bm{\varSigma}$ by a new $\bm{\varSigma^{\natural}}$ that satisfy the above conditions.

Fig. 8 and Fig. 8 show the simulation results of NMSE when signature matrix $\bm{S}$ of the condition number is $\kappa=2$ and $\kappa=15$ with SNR $=15$ dB, respectively. The baseline AMP-MMV and AMP MMSE-denoiser fail when $\kappa=15$ in simulations. Comparing with the case $\kappa=2$ and $\kappa=15$ , the proposed unrolled networks remain stable but the outputs of ISTA-GS and Nesterov’s method diverge when $\kappa=15$ . This results show that the proposed unrolled networks are robust to ill-conditioned preamble signature matrix.

V-C Performance of Proposed Unrolled Neural Networks

In this subsection, we show the performance in terms of NMSE, and compare the proposed methods with other classic CS-based methods. By varying the SNR from $0$ to $10$ dB, all the algorithms under consideration reach a stable solution in the simulations, and we set $K=12$ . The smaller NMSE value means the better recovery performance of JADCE. In these cases, we set $L,N$ and $M$ to $90$ , $300$ , and $100$ , which is suitable for grant-free massive access.

Gaussian Matrix as Preamble Signature. We conduct simulations for the case in which preamble signature matrix $\bm{S}$ is a complex Gaussian matrix, i.e., $\bm{S}\sim\mathcal{C}\mathcal{N}(\bm{0},\bm{I})$ . Fig. 8 shows the impact of SNR on the NMSE of the proposed unrolled networks and baseline algorithms. First, the performance of NMSE decreases as SNR grows, which implies that the JADCE performance becomes better as SNR increases. Second, the proposed network structures achieve a much lower NMSE than other CS-based methods for different values of SNR. In addition, the LISTA-GS outperforms the LISTA-GSCP and ALISTA-GS. This indicates that the more trainable parameters, the better NMSE performance.

Fig. 12 demonstrates the NMSE performance versus number of devices of the proposed unrolled networks and baseline algorithms with $L=90$ and SNR = $15$ dB. As a result, the NMSE increases as the number of devices increase for all the methods, while the proposed unrolled networks outperform other baseline algorithms. Fig. 12 shows that the proposed unrolled networks have the less device activity detection error probability compared with other baseline methods, which also verify the superior of the proposed unrolled networks.

Binary Matrix as Preamble Signature. We conduct simulations for the case in which the preamble sequence matrix $\bm{S}$ is binary sequence matrix, i.e., $\bm{S}\in\{\pm 1\}^{L\times N}$ . Each entry of $\bm{S}$ is selected uniformly at random on $\{\pm 1\}$ , which is closely related to Code Division Multiple Access (CDMA) communication systems [41]. Fig. 12 shows the NMSE performance of the proposed unrolled networks and baseline methods. As a result, the proposed networks remain stable but AMP-MMV and AMP MMSE-denoiser fail to solve the JADCE problem when preamble sequence matrix $\bm{S}$ is binary. The proposed unrolled networks outperform baseline algorithms more than $3$ dB on NMSE for different values of SNR. Obviously, the proposed unrolled networks achieve significantly better NMSE performance compared with CS-based methods. This result demonstrates the robustness of our proposed methods for the non-Gaussian matrices case.

Zadoff-Chu Sequences as Preamble Signature. In this case, we carry out simulations when preamble signature $\bm{S}$ are composed of Zadoff-Chu sequences[8], which are widely used in practical situations, i.e., 4G LTE systems. Fig. 12 shows the NMSE performance of the proposed unrolled networks and other baseline CS-based methods. The proposed networks remain stable but AMP-based algorithms fail to solve the JADCE when $\bm{S}$ is Zadoff-Chu sequence matrix. From this figure, we observe that the proposed unrolled networks achieve a much better NMSE performance compared with the baseline methods. Moreover, as the SNR increasing, the proposed approaches perform better than baseline methods. Among the proposed structures, the LISTA-GS achieves the best performance and LISTA-GSCP shares a similar performance to that of ALISTA-GS. This result also shows the robustness of our proposed methods for practical situations.

In summary, the simulations demonstrate the effectiveness of the proposed unrolled networks in the following aspects:

•

The proposed unrolled networks converge faster than the robust algorithms such as ISTA-GS and Nesterov’s method, yielding much lower computational complexity.
•

Simulations of different preambles show that the proposed unrolled networks are more robust compared with the computationally efficient algorithms such as AMP-based algorithms.
•

The proposed unrolled networks achieve better performance for JADCE comparing with the baseline CS-based algorithms.

VI Conclusions

In this paper, we proposed a novel unrolled deep neural network framework enjoying linear convergence rate, low computational complexity and high robustness to solve the JADCE problem in grant-free massive access for IoT networks. We introduced the first unrolled network structure by mapping the iterative algorithm ISTA-GS as an unrolled RNN, thereby improving the convergence rate by end-to-end training. To make training procedure efficiently and tackle the interpretability issue of deep learning, we proposed two simplified unrolled network structures with less trainable parameters and proved the linear convergence rate of these methods. Extensive simulations were conducted to verify the effectiveness of the proposed unrolled networks in terms of the convergence rate, robustness and estimation accuracy for the JADCE problem.

Appendix A Proof of theorem 1

Proof.

We prove the threshold value $\theta^{k}$ converges to zero firstly, and then we will show that the weights $\{\bm{W}_{1}^{k},\bm{W}_{2}^{k}\}_{k=0}^{\infty}$ in LISTA-GS have weight coupling property.

(1) We verify that $\theta^{k}\to 0$ as $k\to\infty$ in (11). We define a subset of $\mathcal{X}(\beta,s,\sigma)$ for a given $0<\tilde{\beta}\leq\beta$ as

	$\displaystyle\tilde{\mathcal{X}}(\beta,\tilde{\beta},s,\sigma):=\big{\{}(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})\in\mathcal{X}(\beta,s,\sigma)\|~{}\|{\rm{supp}}(\psi(\bm{\tilde{X}}^{\natural}))\|\leq s,$
	$\displaystyle\tilde{\beta}\leq\big{\\|}\bm{\tilde{X}}^{\natural}[i,:]\big{\\|}_{2}\leq\beta,\forall i\in{\rm{supp}}(\psi(\bm{\tilde{X}}^{\natural}))\big{\}}.$

Clearly, $\tilde{\mathcal{X}}(\beta,\tilde{\beta},s,0)\subset\mathcal{X}(\beta,s,0)$ . Since $\bm{\tilde{X}}^{k}\rightarrow\bm{\tilde{X}}^{\natural}$ is uniform for all $(\bm{\tilde{X}}^{\natural},\bm{0})\in\mathcal{X}(\beta,s,0),$ we have $(\bm{\tilde{X}}^{\natural},\bm{0})\in\tilde{\mathcal{X}}(\beta,\tilde{\beta},s,0)$ , where $\tilde{\beta}\leq\beta$ . Then, there exists $K_{1}\in\mathbb{N}$ for all $(\bm{\tilde{X}}^{\natural},\bm{0})\in\tilde{\mathcal{X}}(\beta,\beta/10,s,0)$ such that if $k\geq K_{1}$ ,

\Big{|}\big{\|}\bm{\tilde{X}}^{k}[i,:]\big{\|}_{2}-\big{\|}\bm{\tilde{X}}^{\natural}[i,:]\big{\|}_{2}\Big{|}<\frac{\beta}{10}\quad\forall i\in[2N].

Then we have $\operatorname{sign}(\psi(\bm{\tilde{X}}^{k}))=\operatorname{sign}(\psi(\bm{\tilde{X}}^{\natural})),\forall k\geq K_{1}.$ Recall that the recurrence relation $\bm{\tilde{X}}^{k+1}=\eta_{\theta^{k}}(\bm{W}_{1}^{k}\bm{\tilde{Y}}+\bm{W}_{2}^{k}\bm{\tilde{X}}^{k})$ and let $\mathcal{I}={\rm{supp}}(\psi(\bm{\tilde{X}}^{\natural}))$ . From the uniform convergence of $\bm{\tilde{X}}^{k}$ , it follows that for any $k\geq K_{1}\text{ and }(\bm{\tilde{X}}^{\natural},\bm{0})\in\tilde{\mathcal{X}}(\beta,\beta/10,s,0)$ . We have

\displaystyle\bm{\tilde{X}}^{k+1}[\mathcal{I},:]=\bm{\tilde{Q}}^{k}-\theta^{k}{\rm{diag}}(\psi(\bm{\tilde{Q}}^{k}))^{-1}\bm{\tilde{Q}}^{k},

(40)

where $\bm{\tilde{Q}}^{k}=\bm{W}_{2}^{k}[\mathcal{I},\mathcal{I}]\bm{\tilde{X}}^{k}[\mathcal{I},:]+\bm{W}_{1}^{k}[\mathcal{I},:]\bm{\tilde{Y}}$ .

Also, the uniform convergence of $\bm{\tilde{X}}^{k}$ implies that for any $\varepsilon>0$ and $(\bm{\tilde{X}}^{\natural},\bm{0})\in\tilde{\mathcal{X}}(\beta,\beta/10,s,0),$ there exists $K_{2}\in\mathbb{N}$ such that if $k\geq K_{2}$ , then $\|\bm{\tilde{X}}^{k}[\mathcal{I},:]-\bm{\tilde{X}}^{\natural}[\mathcal{I},:]\|_{F}<\varepsilon$ . Denote $\bm{\tilde{E}}^{k}=\bm{\tilde{X}}^{k}[\mathcal{I},:]-\bm{\tilde{X}}^{\natural}[\mathcal{I},:]\in\mathbb{R}^{|\mathcal{I}|\times M}$ for each $k\in\mathbb{N}$ . That is, $\|\bm{\tilde{E}}^{k}\|_{F}<\varepsilon$ for all $k\geq K_{2}$ .

Since the noise is assumed to be zero, i.e., $\bm{\tilde{Y}}=\bm{\tilde{S}}\bm{\tilde{X}}^{\natural}$ . Then from (40) we have

	$\displaystyle\\|\theta^{k}{\rm{diag}}(\psi(\bm{\tilde{Q}}^{k}))^{-1}\bm{\tilde{Q}}^{k}[:,{j}]\\|_{2}^{2}=\xi_{j}^{k}+$
	$\displaystyle\big{\\|}\big{(}\bm{I}-\bm{W}_{2}^{k}[\mathcal{I},\mathcal{I}]-\bm{W}_{1}^{k}[\mathcal{I},:]\bm{\tilde{S}}[:,\mathcal{I}]\big{)}\bm{\tilde{X}}^{\natural}[\mathcal{I},{j}]\big{\\|}_{2}^{2},$		(41)

where $\xi_{j}^{k}=\big{\|}-\big{(}\bm{I}-\bm{W}_{2}^{k}[\mathcal{I},\mathcal{I}]-\bm{W}_{1}^{k}[\mathcal{I},:]\bm{\tilde{S}}[:,\mathcal{I}]\big{)}\bm{\tilde{X}}^{\natural}[\mathcal{I},{j}]+\bm{W}_{2}^{k}[\mathcal{I},\mathcal{I}]\bm{\tilde{E}}^{k}[:,j]-\bm{\tilde{E}}^{k+1}[:,j]\big{\|}_{2}^{2}-\big{\|}\big{(}\bm{I}-\bm{W}_{2}^{k}[\mathcal{I},\mathcal{I}]-\bm{W}_{1}^{k}[\mathcal{I},:]\bm{\tilde{S}}[:,\mathcal{I}]\big{)}\bm{\tilde{X}}^{\natural}[\mathcal{I},{j}]\big{\|}_{2}^{2}.$

We now show that $|\xi_{j}^{k}|$ is sufficiently small for $k\geq K_{2}$ . Denote extra notations

	$\displaystyle\bm{p}_{j}^{k}=(\bm{I}-\bm{W}_{2}^{k}[\mathcal{I},\mathcal{I}]-\bm{W}_{1}^{k}[\mathcal{I},:]\bm{\tilde{S}}[:,\mathcal{I}]\big{)}\bm{\tilde{X}}^{\natural}[\mathcal{I},{j}],\quad$
	$\displaystyle\bm{q}_{j}^{k}=\bm{W}_{2}^{k}[\mathcal{I},\mathcal{I}]\bm{\tilde{E}}^{k}[:,j]-\bm{\tilde{E}}^{k+1}[:,j].$

The value $\xi_{j}^{k}$ can be simply rewritten as $\|\bm{q}_{j}^{k}\|_{2}^{2}-2\langle\bm{p}_{j}^{k},\bm{q}_{j}^{k}\rangle,$ where $\langle\cdot,\cdot\rangle$ is the inner product in $\mathbb{R}^{|\mathcal{I}|}$ . Then the Cauchy-Schwartz inequality implies $|\xi_{j}^{k}|\leq(2\|\bm{p}_{j}^{k}\|_{2}+1)\|\bm{q}_{j}^{k}\|_{2}.$

From the fact that $\|\bm{A}\|_{2}\leq\|\bm{A}\|_{F}$ for any matrix $\bm{A}$ and the triangle inequality it follows that if $k\geq K_{2}$ , then

	$\displaystyle\\|\bm{q}_{j}^{k}\\|_{2}$	$\displaystyle\leq\\|\bm{W}_{2}^{k}[\mathcal{I},\mathcal{I}]\\|_{2}\\|\bm{\tilde{E}}^{k}[:,j]\\|_{2}+\\|\bm{\tilde{E}}^{k+1}[:,j]\\|_{2}$
		$\displaystyle\leq(C_{W_{2}}+1)\varepsilon.$		(42)

Note that

\displaystyle\|\bm{p}_{j}^{k}\|_{2}^{2}\leq\big{\|}\bm{I}-\bm{W}_{2}^{k}[\mathcal{I},\mathcal{I}]-\bm{W}_{1}^{k}[\mathcal{I},:]\bm{\tilde{S}}[:,\mathcal{I}]\big{\|}_{2}^{2}\big{\|}\bm{\tilde{X}}^{\natural}[\mathcal{I},{j}]\big{\|}_{2}^{2}.

(43)

Then it follows that

	$\displaystyle\sum_{j=1}^{M}\\|\bm{p}_{j}^{k}\\|_{2}^{2}$	$\displaystyle\leq\Big{(}1+\big{\\|}\bm{W}_{2}^{k}\big{\\|}_{F}+\big{\\|}\bm{W}_{1}^{k}\big{\\|}_{F}\big{\\|}\bm{\tilde{S}}\big{\\|}_{F}\Big{)}^{2}\big{\\|}\bm{\tilde{X}}^{\natural}\big{\\|}_{F}^{2}$
		$\displaystyle\leq\Big{(}1+C_{W_{2}}+C_{W_{1}}\big{\\|}\bm{\tilde{S}}\big{\\|}_{F}\Big{)}^{2}\big{\\|}\bm{\tilde{X}}^{\natural}\big{\\|}_{F}^{2}.$		(44)

By Cauchy-Schwarz inequality, it holds that

	$\displaystyle\sum_{j=1}^{M}\\|\bm{p}_{j}^{k}\\|_{2}$	$\displaystyle\leq\sqrt{M}\sqrt{\sum_{j=1}^{M}\\|\bm{p}_{j}^{k}\\|_{2}^{2}}$
		$\displaystyle\leq\sqrt{M}\Big{(}1+C_{W_{2}}+C_{W_{1}}\big{\\|}\bm{\tilde{S}}\big{\\|}_{F}\Big{)}\big{\\|}\bm{\tilde{X}}^{\natural}\big{\\|}_{F}.$		(45)

Thus, by (42) and (45)

$\displaystyle\sum_{j=1}^{M}\big{\|}\xi_{j}^{k}\big{\|}$	$\displaystyle\leq(2\\|\bm{p}_{j}^{k}\\|_{2}+1)\\|\bm{q}_{j}^{k}\\|_{2}$
	$\displaystyle\leq(C_{W_{2}}+1)\varepsilon\sum_{j=1}^{M}(2\\|\bm{p}_{j}^{k}\\|_{2}+1)$
	$\displaystyle\leq(C_{W_{2}}+1)(2C+M)\varepsilon,$	(46)

where $C=\sqrt{M}\Big{(}1+C_{W_{2}}+C_{W_{1}}\big{\|}\bm{\tilde{S}}\big{\|}_{F}\Big{)}\big{\|}\bm{\tilde{X}}^{\natural}\big{\|}_{F}$ . Then, by (41) we have

	$\displaystyle\big{\\|}\big{(}\bm{I}-\bm{W}_{2}^{k}[\mathcal{I},\mathcal{I}]-\bm{W}_{1}^{k}[\mathcal{I},:]\bm{\tilde{S}}[:,\mathcal{I}]\big{)}\bm{\tilde{X}}^{\natural}[\mathcal{I},:]\big{\\|}_{F}^{2}$
	$\displaystyle=-\sum_{j=1}^{M}\xi_{j}^{k}+\big{\|}\theta^{k}\big{\|}^{2}\sum_{i=1}^{\|\mathcal{I}\|}\Big{\\|}\frac{1}{\\|\bm{\tilde{Q}}^{k}[i,:]\\|_{2}}\bm{\tilde{Q}}^{k}[i,:]\Big{\\|}_{2}^{2}$

Therefore, by (50) we get

	$\displaystyle\big{\\|}\big{(}\bm{I}-\bm{W}_{2}^{k}[\mathcal{I},\mathcal{I}]-\bm{W}_{1}^{k}[\mathcal{I},:]\bm{\tilde{S}}[:,\mathcal{I}]\big{)}\bm{\tilde{X}}^{\natural}[\mathcal{I},:]\big{\\|}_{F}^{2}$
	$\displaystyle=-\sum_{j=1}^{M}\xi_{j}^{k}+\|\theta^{k}\|^{2}\|\mathcal{I}\|.$		(47)

For any $(\bm{\tilde{X}}^{\natural},\bm{0})\in\tilde{\mathcal{X}}(\beta/2,\beta/10,s,0)$ , it is true that $(2\bm{\tilde{X}}^{\natural},\bm{0})\in\tilde{\mathcal{X}}(\beta,\beta/10,s,0)$ holds. Thus, the above argument holds for all $(2\bm{\tilde{X}}^{\natural},\bm{0})$ if $(\bm{\tilde{X}}^{\natural},\bm{0})\in\tilde{\mathcal{X}}(\beta/2,\beta/10,s,0).$ Substituting $\bm{\tilde{X}}^{\natural}$ with $2\bm{\tilde{X}}^{\natural}$ in the previous equation, we have

	$\displaystyle 4\big{\\|}\big{(}\bm{I}-\bm{W}_{2}^{k}[\mathcal{I},\mathcal{I}]-\bm{W}_{1}^{k}[\mathcal{I},:]\bm{\tilde{S}}[:,\mathcal{I}]\big{)}\bm{\tilde{X}}^{\natural}[\mathcal{I},:]\big{\\|}_{F}^{2}$
	$\displaystyle=-\sum_{j=1}^{M}\hat{\xi}_{j}^{k}+\|\theta^{k}\|^{2}\|\mathcal{I}\|.$		(48)

From (47) and (48) one can get $3|\theta^{k}|^{2}|\mathcal{I}|=4\sum_{j=1}^{M}{\xi}_{j}^{k}-\sum_{j=1}^{M}\hat{\xi}_{j}^{k}.$ Then, taking the absolute value on the both sides, then by (46), if $\forall k\geq\max\{K_{1},K_{2}\}$ ,

\displaystyle|\theta^{k}|^{2}\leq\frac{5(C_{W_{2}}+1)(2C+M)}{3|\mathcal{I}|}\varepsilon.

(49)

Moreover, as $\theta^{k}$ is MSTO parameter, $\theta^{k}\geq 0$ . Therefore, $\theta^{k}\rightarrow 0$ as $k\rightarrow\infty$ .

(2) We prove that $\bm{I}-\bm{W}_{2}^{k}-\bm{W}_{1}^{k}\bm{\tilde{S}}\rightarrow\bm{0}$ as $k\rightarrow\infty$ . LISTA-GS model (9) gives

	$\displaystyle\bm{\tilde{X}}^{k+1}[\mathcal{I},:]=\eta_{\theta^{k}}\big{(}\bm{W}_{1}[\mathcal{I},:]^{k}\bm{\tilde{S}})\bm{\tilde{X}}^{\natural}+\bm{W}_{2}^{k}[\mathcal{I},:]\bm{\tilde{X}}^{k}\big{)}$
	$\displaystyle\in\bm{W}_{1}^{k}[\mathcal{I},:]\bm{\tilde{S}}\bm{\tilde{X}}^{\natural}+\bm{W}_{2}^{k}[\mathcal{I},:]\bm{\tilde{X}}^{k}-\theta^{k}\partial\ell_{2,1}(\bm{\tilde{X}}^{k+1}[\mathcal{I},:]),$

where $\partial\ell_{2,1}(\bm{\tilde{X}})$ is the sub-gradient of $\|\bm{\tilde{X}}\|_{2,1}$ . It is a set defined for each row as follows:

\displaystyle\partial\ell_{2,1}(\bm{X})=\begin{bmatrix}\partial\|\bm{X}[1,:]\|_{2},\partial\|\bm{X}[2,:]\|_{2},\cdots,\partial\|\bm{X}[2N,:]\|_{2}\end{bmatrix}^{T},

(50)

where

\displaystyle\partial\|\bm{X}[n,:]\|_{2}=\begin{cases}\big{\{}\frac{\bm{X}[n,:]}{\|\bm{X}[n,:]\|_{2}}\big{\}}\quad\text{if }\bm{X}[n,:]\neq\bm{0},\\ \{\bm{h}\in\mathbb{R}^{M}|~{}\|\bm{h}\|_{2}\leq 1\}\quad\text{if }\bm{X}[n,:]=\bm{0}.\end{cases}

(51)

If $k\geq\max\{K_{1},K_{2}\}$ then from (49) and (46)

	$\displaystyle\Big{\\|}\big{(}\bm{I}-\bm{W}_{2}^{k}[\mathcal{I},\mathcal{I}]-\bm{W}_{1}^{k}[\mathcal{I},:]\bm{\tilde{S}}[:,\mathcal{I}]\big{)}\bm{\tilde{X}}^{\natural}[\mathcal{I},j]\Big{\\|}_{2}$
	$\displaystyle\leq\Big{(}(C_{W_{2}}+1)(2C+M)+\frac{5(C_{W_{2}}+1)(2C+M)}{3\|\mathcal{I}\|}\Big{)}\varepsilon.$

From the definition of operator norm we have that

	$\displaystyle\sigma_{\max}\Big{(}\bm{I}-\bm{W}_{2}^{k}[\mathcal{I},\mathcal{I}]-\bm{W}_{1}^{k}[\mathcal{I},:]\bm{\tilde{S}}[:,\mathcal{I}]\Big{)}$
	$\displaystyle\leq\frac{1}{\beta}\Big{(}(C_{W_{2}}+1)(2C+M)+\frac{5(C_{W_{2}}+1)(2C+M)}{3\|\mathcal{I}\|}\Big{)}\varepsilon.$

Thus, $s\geq 2$ , $\bm{I}-\bm{W}_{2}^{k}[\mathcal{I},\mathcal{I}]-\bm{W}_{1}^{k}[\mathcal{I},:]\bm{\tilde{S}}[:,\mathcal{I}]\rightarrow\bm{0}$ uniformly for all $\mathcal{I}$ with $2\leq|\mathcal{I}|\leq s$ . Therefore, $\bm{I}-\bm{W}_{2}^{k}-\bm{W}_{1}^{k}\bm{\tilde{S}}\rightarrow\bm{0}$ as $k\rightarrow\infty$ . ∎

Appendix B Proof of Lemma 1

We will show that for the no-false-positives property holds for the LISTA-GSCP.

Proof.

Let $(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})\in\mathcal{X}(\beta,s,\sigma)$ and $\mathcal{I}={\rm{supp}}(\psi(\bm{\tilde{X}}^{\natural}))$ . We prove that if (23) holds, then $\bm{\tilde{X}}^{k}[i,:]=\bm{0},\forall i\notin\mathcal{I},\forall k.$

(i) When $k=0$ , it is satisfied since $\bm{\tilde{X}}^{0}=\bm{0}$ .

(ii) Suppose that $\bm{\tilde{X}}^{k}[i,:]=\bm{0}$ for all $i\notin\mathcal{I}$ . Then by (13) it follows that

	$\displaystyle\bm{\tilde{X}}^{k+1}[i,:]=\eta_{\theta^{k}}\Big{(}\bm{\tilde{X}}^{k}[i,:]-(\bm{W}^{k}[:,i])^{T}\bm{\tilde{S}}(\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural})$
	$\displaystyle+(\bm{W}^{k}[:,i])^{T}\bm{\tilde{Z}}\Big{)}$

for all $i\notin\mathcal{I}$ . Then by the triangle inequality it follows that

	$\displaystyle\big{\\|}-(\bm{W}^{k}[:,i])^{T}\bm{\tilde{S}}(\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural})+(\bm{W}^{k}[:,i])^{T}\bm{\tilde{Z}})\big{\\|}_{2}$
	$\displaystyle{\leq}\sum_{j\in\mathcal{I}}\big{\|}\bm{W}^{k}[:,i])^{T}\bm{\tilde{S}}[:,j]\big{\|}\big{\\|}\bm{\tilde{X}}^{k}[j,:]-\bm{\tilde{X}}^{\natural}[j,:]\big{\\|}_{2}+\big{\\|}(\bm{W}^{k}[:,i])^{T}\bm{\tilde{Z}})\big{\\|}_{2}$
	$\displaystyle{\leq}\tilde{\mu}\big{\|}\big{\\|}\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\big{\\|}_{2,1}+C_{W}\\|\bm{\tilde{Z}}\\|_{F}.$

Since $\theta^{k}=\tilde{\mu}\sup_{\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}}}\{\|\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\|_{2,1}\}+C_{W}\sigma$ and $\bm{W}^{k}\in\mathcal{W}(\bm{\tilde{S}})$ , so $\forall i\notin\mathcal{I}$ it holds that

\displaystyle\theta^{k}\geq\big{\|}-(\bm{W}^{k}[:,i])^{T}\bm{\tilde{S}}(\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural})+(\bm{W}^{k}[:,i])^{T}\bm{\tilde{Z}})\big{\|}_{2},

which implies $\|\bm{\tilde{X}}^{k+1}[i,:]\|_{2}=0,\forall i\notin\mathcal{I}$ by the definition of $\eta_{\theta^{k}}.$ Therefore, by induction, we have $\|\bm{\tilde{X}}^{k}[i,:]\|_{2}=0,\forall i\notin\mathcal{I},\forall k.$ ∎

Appendix C Proof of Theorem 3

Proof.

Similar to the proof of Theorem 2, we first show that there are no false positive in $\bm{\tilde{X}}^{k}$ . Take arbitrary $(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})\in\mathcal{X}(\beta,s,0)$ and let $\mathcal{I}={\rm{supp}}(\psi(\bm{\tilde{X}}^{\natural}))$ . We prove the no-false-positive property by induction.

(i) When $k=0$ , it is satisfied since $\bm{\tilde{X}}^{0}=\bm{0}$ .

(ii) Fix $k$ and suppose that $\bm{\tilde{X}}^{k}[i,:]=\bm{0}$ for all $i\notin\mathcal{I}$ . Then we have

	$\displaystyle\bm{\tilde{X}}^{k+1}[i,:]$	$\displaystyle=\eta_{\theta^{k}}\big{(}\bm{\tilde{X}}^{k}[i,:]+\gamma^{k}(\bm{W}[:,i])^{T}(\bm{\tilde{Y}}-\bm{\tilde{S}}\bm{\tilde{X}}^{k})\big{)}$
		$\displaystyle=\eta_{\theta^{k}}\big{(}-\gamma^{k}(\bm{W}[:,i])^{T}\bm{\tilde{S}}(\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural})\big{)}$

for all $i\notin\mathcal{I}$ . As the thresholds are taken as $\theta^{k}=\tilde{\mu}\gamma^{k}\sup_{\bm{\tilde{X}}^{\natural}}\|\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\|_{2,1}$ and $\bm{W}\in\mathcal{W}(\bm{\tilde{S}})$ , it holds that

	$\displaystyle\theta^{k}\geq\tilde{\mu}\gamma^{k}\big{\\|}\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\big{\\|}_{2,1}\geq\big{\\|}-\gamma^{k}(\bm{W}[:,i])^{T}\bm{\tilde{S}}(\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural})\big{\\|}_{2},$
	$\displaystyle\forall i\notin\mathcal{I},$

which implies $\|\bm{\tilde{X}}^{k+1}[i,:]\|_{2}=0,\forall i\notin\mathcal{I}$ by the definition. Thus, $\bm{\tilde{X}}^{k}[i,:]=\bm{0},\forall i\notin\mathcal{I},\forall k.$

In the next step, we consider the components on $\mathcal{I}$ . For all $i\in\mathcal{I}$ , we have

	$\displaystyle\bm{\tilde{X}}^{k+1}[i,:]\in\underbrace{\bm{\tilde{X}}^{k}[i,:]-\gamma^{k}(\bm{W}[:,i])^{T}\bm{\tilde{S}}[:,\mathcal{I}]\big{(}\bm{\tilde{X}}^{k}[\mathcal{I},:]-\bm{\tilde{X}}^{\natural}[\mathcal{I},:]\big{)}}_{(T1)}$
	$\displaystyle-\theta^{k}\partial\\|\bm{\tilde{X}}^{k+1}[i,:]\\|_{2},$		(52)

where $\partial\|\bm{\tilde{X}}^{k+1}[i,:]\|_{2}$ is defined in (51). As we choose $\bm{W}\in\mathcal{W}(\bm{\tilde{S}})$ , so $(\bm{W}[:,i])^{T}\bm{\tilde{S}}[:,i]=1$ , then $(T1)$ can be expressed as

	$\displaystyle\bm{\tilde{X}}^{\natural}[i,:]\underbrace{-\gamma^{k}\sum_{j\in\mathcal{I},j\neq i}(\bm{W}[:,i])^{T}\bm{\tilde{S}}[:,j](\bm{\tilde{X}}^{k}[j,:]-\bm{\tilde{X}}^{\natural}[j,:])}$
	$\displaystyle\underbrace{+(1-\gamma^{k})(\bm{\tilde{X}}^{k}[i,:]-\bm{\tilde{X}}^{\natural}[i,:])}_{(T2)}.$

Hence, (52) can be rewritten as

\displaystyle\bm{\tilde{X}}^{k+1}[i,:]-\bm{\tilde{X}}^{\natural}[i,:]\in(T2)-\theta^{k}\partial\ell_{2,1}(\bm{\tilde{X}}^{k+1}[i,:])

Then one can get that for all $i\in\mathcal{I}$

	$\displaystyle\\|\bm{\tilde{X}}^{k+1}[i,:]-\bm{\tilde{X}}^{\natural}[i,:]\\|_{2}\leq\tilde{\mu}\gamma^{k}\sum_{j\in\mathcal{I},j\neq i}\\|(\bm{\tilde{X}}^{k}[j,:]-\bm{\tilde{X}}^{\natural}[j,:])\\|_{2}$
	$\displaystyle+\theta^{k}+\|1-\gamma^{k}\|\\|\bm{\tilde{X}}^{k}[i,:]-\bm{\tilde{X}}^{\natural}[i,:]\\|_{2}.$

By no-false-positive property, we have

	$\displaystyle\\|\bm{\tilde{X}}^{k+1}-\bm{\tilde{X}}^{\natural}\\|_{2,1}\leq\tilde{\mu}\gamma^{k}(\|\mathcal{I}\|-1)\\|\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\\|_{2,1}+\|\mathcal{I}\|\theta^{k}$
	$\displaystyle+\|1-\gamma^{k}\|\\|\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\\|_{2,1}.$

Finally, we take supremum over $(\bm{\tilde{X}}^{\natural},\bm{0})\in\mathcal{X}(\beta,s,0)$ with $|\mathcal{I}|\leq s$ , we have

	$\displaystyle\sup_{\bm{\tilde{X}}^{\natural}}\\|\bm{\tilde{X}}^{k+1}-\bm{\tilde{X}}^{\natural}\\|_{2,1}\leq\big{(}\tilde{\mu}\gamma^{k}(s-1)+\|1-\gamma^{k}\|\big{)}$
	$\displaystyle\sup_{\bm{\tilde{X}}^{\natural}}\\|\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\\|_{2,1}+s\theta^{k}.$

Taking $c^{\tau}=-\log\big{(}\gamma^{\tau}(2\tilde{\mu}s-\tilde{\mu})+|1-\gamma^{\tau}|\big{)}$ , then from the fact that $\theta^{k}=\tilde{\mu}\gamma^{k}\sup_{\bm{\tilde{X}}^{\natural}}\|\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\|_{2,1}$ in (36) we obtain

	$\displaystyle\sup_{\bm{\tilde{X}}^{\natural}}\\|\bm{\tilde{X}}^{k+1}-\bm{\tilde{X}}^{\natural}\\|_{2,1}$	$\displaystyle\leq\exp\Big{(}-\sum_{\tau=0}^{k}c^{\tau}\Big{)}\sup_{\bm{\tilde{X}}^{\natural}}\\|\bm{\tilde{X}}^{0}-\bm{\tilde{X}}^{\natural}\\|_{2,1}$
		$\displaystyle\leq s\beta\exp\Big{(}-\sum_{\tau=0}^{k}c^{\tau}\Big{)}.$

Hence, we get the following upper bound with respect to the Frobenius norm:

	$\displaystyle\sup_{\bm{\tilde{X}}^{\natural}}\\|\bm{\tilde{X}}^{k+1}-\bm{\tilde{X}}^{\natural}\\|_{F}\leq\sup_{\bm{\tilde{X}}^{\natural}}\\|\bm{\tilde{X}}^{k+1}-\bm{\tilde{X}}^{\natural}\\|_{2,1}$
	$\displaystyle\leq s\beta\exp\Big{(}-\sum_{\tau=0}^{k}c^{\tau}\Big{)}.$		(53)

Therefore, the error bound holds uniformly for all $(\bm{\tilde{X}}^{\natural},\bm{0})\in\mathcal{X}(\beta,s,0)$ .

Lastly, we verify that $c^{\tau}>0$ for all $\tau$ . The assumption $s<(1+1/\tilde{\mu})/2$ gives $2\tilde{\mu}s-\tilde{\mu}<1$ . If $0<\gamma^{\tau}\leq 1$ , then $\gamma^{\tau}(2\tilde{\mu}s-\tilde{\mu})+|1-\gamma^{\tau}|=\gamma^{\tau}(2\tilde{\mu}s-\tilde{\mu}-1)+1<1.$ If $1<\gamma^{\tau}<2/(1+2\tilde{\mu}s-\tilde{\mu})$ , then $\gamma^{\tau}(2\tilde{\mu}s-\tilde{\mu})+|1-\gamma^{\tau}|=\gamma^{\tau}(2\tilde{\mu}s-\tilde{\mu}+1)-1<1.$ Thus $c^{\tau}>0$ for all $\tau$ .

∎

References

[1] Y. Shi, S. Xia, Y. Zhou, and Y. Shi, “Sparse signal processing for massive device connectivity via deep learning,” in Proc. IEEE Int. Conf. Commun. (ICC) Workshop, pp. 1–6, 2020.
[2] S. K. Sharma and X. Wang, “Toward massive machine type communications in ultra-dense cellular IoT networks: Current issues and machine learning-assisted solutions,” IEEE Commun. Surveys Tuts., vol. 22, pp. 426–471, First Quart. 2020.
[3] W. Peng, W. Gao, and J. Liu, “AI-enabled massive devices multiple access for smart city,” IEEE Internet Things J., vol. 6, pp. 7623–7634, Oct. 2019.
[4] Y. Wu, X. Gao, S. Zhou, W. Yang, Y. Polyanskiy, and G. Caire, “Massive access for future wireless communication systems,” IEEE Wireless Commun., pp. 1–9, Aug. 2020.
[5] L. Liu, E. G. Larsson, W. Yu, P. Popovski, C. Stefanovic, and E. De Carvalho, “Sparse signal processing for grant-free massive connectivity: A future paradigm for random access protocols in the internet of things,” IEEE Signal Process. Mag., vol. 35, pp. 88–99, Sept. 2018.
[6] Z. Chen, F. Sohrabi, and W. Yu, “Sparse activity detection for massive connectivity,” IEEE Trans. Signal Process., vol. 66, pp. 1890–1904, Apr. 2018.
[7] H. Jiang, D. Qu, J. Ding, and T. Jiang, “Multiple preambles for high success rate of grant-free random access with massive MIMO,” IEEE Trans. Wireless Commun., vol. 18, pp. 4779–4789, Oct. 2019.
[8] J. Ding and J. Choi, “Comparison of preamble structures for grant-free random access in massive MIMO systems,” IEEE Wireless Commun. Lett., vol. 9, pp. 166–170, Feb. 2020.
[9] X. Shao, X. Chen, C. Zhong, J. Zhao, and Z. Zhang, “A unified design of massive access for cellular internet of things,” IEEE Internet of Things J., vol. 6, pp. 3934–3947, Apr. 2019.
[10] K. Senel and E. G. Larsson, “Grant-free massive MTC-enabled massive MIMO: A compressive sensing approach,” IEEE Trans. Commun., vol. 66, pp. 6164–6175, Dec. 2018.
[11] Z. Qin, J. Fan, Y. Liu, Y. Gao, and G. Y. Li, “Sparse representation for wireless communications: A compressive sensing approach,” IEEE Signal Process. Mag., vol. 35, pp. 40–58, May 2018.
[12] Z. Gao, L. Dai, S. Han, I. Chih-Lin, Z. Wang, and L. Hanzo, “Compressive sensing techniques for next-generation wireless communications,” IEEE Wireless Commun., vol. 25, pp. 144–153, Jun. 2018.
[13] L. Liu and W. Yu, “Massive connectivity with massive MIMO-Part I: Device activity detection and channel estimation,” IEEE Trans. Signal Process., vol. 66, pp. 2933–2946, Jun. 2018.
[14] M. Ke, Z. Gao, Y. Wu, X. Gao, and R. Schober, “Compressive sensing-based adaptive active user detection and channel estimation: Massive access meets massive MIMO,” IEEE Trans. Signal Process., vol. 68, pp. 764–779, 2020.
[15] J. Ziniel and P. Schniter, “Efficient high-dimensional inference in the multiple measurement vector problem,” IEEE Trans. Signal Process., vol. 61, pp. 340–354, Jan. 2012.
[16] A. K. Fletcher, P. Pandit, S. Rangan, S. Sarkar, and P. Schniter, “Plug-in estimation in high-dimensional linear inverse problems: A rigorous analysis,” in Proc. Neural Inf. Process. Syst. (NeurIPS), pp. 7440–7449, 2018.
[17] T. Jiang, Y. Shi, J. Zhang, and K. B. Letaief, “Joint activity detection and channel estimation for IoT networks: Phase transition and computation-estimation tradeoff,” IEEE Internet of Things J., vol. 6, pp. 6212–6225, Aug. 2019.
[18] Q. He, T. Q. Quek, Z. Chen, Q. Zhang, and S. Li, “Compressive channel estimation and multi-user detection in C-RAN with low-complexity methods,” IEEE Trans. Wireless Commun., vol. 17, pp. 3931–3944, Jun. 2018.
[19] Z. Qin, K. Scheinberg, and D. Goldfarb, “Efficient block-coordinate descent algorithms for the group lasso,” Math. Program. Comput., vol. 5, no. 2, pp. 143–169, 2013.
[20] A. Bonnefoy, V. Emiya, L. Ralaivola, and R. Gribonval, “Dynamic screening: Accelerating first-order algorithms for the lasso and group-lasso,” IEEE Trans. Signal Process., vol. 63, pp. 5121–5132, Oct. 2015.
[21] R. Giryes, Y. C. Eldar, A. M. Bronstein, and G. Sapiro, “Tradeoffs between convergence speed and reconstruction accuracy in inverse problems,” IEEE Trans. Signal Process., vol. 66, pp. 1676–1690, Apr. 2018.
[22] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropoulos, “Learning to optimize: Training deep neural networks for interference management,” IEEE Trans. Signal Process., vol. 66, pp. 5438–5453, Oct. 2018.
[23] M. Eisen, C. Zhang, L. F. O. Chamon, D. D. Lee, and A. Ribeiro, “Learning optimal resource allocations in wireless systems,” IEEE Trans. Signal Process., vol. 67, pp. 2775–2790, May 2019.
[24] Y. Shen, Y. Shi, J. Zhang, and K. B. Letaief, “A graph neural network approach for scalable wireless power control,” in Proc. IEEE Global Commun. Conf. (Globecom) Workshop, pp. 1–6, Dec. 2019.
[25] Y. Shen, Y. Shi, J. Zhang, and K. B. Letaief, “LORM: Learning to optimize for resource management in wireless networks with few training samples,” IEEE Trans. Wireless Commun., vol. 19, pp. 665–679, Jan. 2020.
[26] L. Liang, H. Ye, G. Yu, and G. Y. Li, “Deep-learning-based wireless resource allocation with application to vehicular networks,” Proc. of the IEEE, vol. 108, pp. 341–356, Feb. 2020.
[27] K. Gregor and Y. LeCun, “Learning fast approximations of sparse coding,” in Proc. Int. Conf. Mach. Learn. (ICML), pp. 399–406, Omnipress, 2010.
[28] V. Monga, Y. Li, and Y. C. Eldar, “Algorithm unrolling: Interpretable, efficient deep learning for signal and image processing,” IEEE Signal Process. Mag., vol. 38, pp. 18–44, Mar. 2021.
[29] N. Shlezinger, J. Whang, Y. C. Eldar, and A. G. Dimakis, “Model-based deep learning,” arXiv preprint arXiv:2012.08405, 2020.
[30] S. Balatsoukas, Alexios, and C. Studer, “Deep unfolding for communications systems: A survey and some new directions,” in Proc. IEEE Int. Workshop Signal Process. Syst. (SiPS), pp. 266–271, 2019.
[31] Q. Hu, Y. Cai, Q. Shi, K. Xu, G. Yu, and Z. Ding, “Iterative algorithm induced deep-unfolding neural networks: Precoding design for multiuser mimo systems,” IEEE Trans. Wireless Commun., vol. 20, pp. 1394–1410, Feb. 2021.
[32] J. Liu, X. Chen, Z. Wang, and W. Yin, “ALISTA: Analytic weights are as good as learned weights in LISTA,” in Proc. Int. Conf. on Learn. Rep. (ICLR), 2019.
[33] P. Ablin, T. Moreau, M. Massias, and A. Gramfort, “Learning step sizes for unfolded sparse coding,” in Proc. Neural Inf. Process. Syst. (NeurIPS), pp. 13100–13110, 2019.
[34] X. Chen, J. Liu, Z. Wang, and W. Yin, “Theoretical linear convergence of unfolded ISTA and its practical weights and thresholds,” in Proc. Neural Inf. Process. Syst. (NeurIPS), pp. 9061–9071, 2018.
[35] A. T. Puig, A. Wiesel, G. Fleury, and A. O. Hero, “Multidimensional shrinkage-thresholding operator and group LASSO penalties,” IEEE Signal Process. Lett., vol. 18, pp. 363–366, Jun. 2011.
[36] M. Yuan and Y. Lin, “Model selection and estimation in regression with grouped variables,” J. R. Stat. Soc. B, vol. 68, no. 1, pp. 49–67, 2006.
[37] D. Ito, S. Takabe, and T. Wadayama, “Trainable ISTA for sparse signal recovery,” IEEE Trans. Signal Process., vol. 67, pp. 3113–3125, Jun. 2019.
[38] A. Maleki, L. Anitori, Z. Yang, and R. G. Baraniuk, “Asymptotic analysis of complex LASSO via complex approximate message passing (CAMP),” IEEE Trans. Inf. Theory, vol. 59, pp. 4290–4308, Jul. 2013.
[39] S. Takabe, T. Wadayama, and Y. C. Eldar, “Complex trainable ista for linear and nonlinear inverse problems,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. 5020–5024, 2020.
[40] M. Borgerding, P. Schniter, and S. Rangan, “AMP-Inspired deep networks for sparse linear inverse problems,” IEEE Trans. Signal Process., vol. 65, pp. 4293–4308, Aug. 2017.
[41] Y. Kabashima, “A CDMA multiuser detection algorithm on the basis of belief propagation,” J. Phys. A: Math. General, vol. 36, p. 11111, Oct. 2003.
[42] H. Sun, W. Pu, M. Zhu, X. Fu, T.-H. Chang, and M. Hong, “Learning to continuously optimize wireless resource in episodically dynamic environment,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. 4945–4949, 2021.
[43] Y. Yuan, G. Zheng, K.-K. Wong, B. Ottersten, and Z.-Q. Luo, “Transfer learning and meta learning-based fast downlink beamforming adaptation,” IEEE Trans. Wireless Commun., vol. 20, pp. 1742–1755, Mar. 2021.
[44] J. Liu, S. Ji, and J. Ye, “Multi-task feature learning via efficient $l_{2,1}$ -norm minimization,” in Proc. Conf. Uncertainty Artif. Intell. (UAI), pp. 339–348, AUAI Press, 2009.
[45] R. Gribonval and M. Nielsen, “Sparse representations in unions of bases,” IEEE Trans. Inf. Theory, vol. 49, pp. 3320–3325, Dec. 2003.

	$\displaystyle\big{\\|}\bm{\tilde{X}}^{k+1}[i,:]-\bm{\tilde{X}}^{\natural}[i,:]\big{\\|}_{2}\overset{\text{(a)}}{\leq}\sum_{j\in\mathcal{I},j\neq i}\big{\|}(\bm{W}^{k}[:,i])^{T}\bm{\tilde{S}}[:,j]\big{\|}\cdot$
	$\displaystyle\big{\\|}\bm{\tilde{X}}^{k}[j,:]-\bm{\tilde{X}}^{\natural}[j,:]\big{\\|}_{2}+\theta^{k}+\big{\\|}(\bm{W}^{k}[:,i])^{T}\bm{\tilde{Z}}\big{\\|}_{2}$
	$\displaystyle\overset{\text{(b)}}{\leq}\tilde{\mu}\sum_{j\in\mathcal{I},j\neq i}\big{\\|}\bm{\tilde{X}}^{k}[j,:]-\bm{\tilde{X}}^{\natural}[j,:]\big{\\|}_{2}+\theta^{k}+\big{\\|}\bm{W}^{k}[:,i]\big{\\|}_{2}\big{\\|}\bm{\tilde{Z}}\big{\\|}_{F},$		(30)

	$\displaystyle\sum_{j=1}^{M}\\|\bm{p}_{j}^{k}\\|_{2}^{2}$	$\displaystyle\leq\Big{(}1+\big{\\|}\bm{W}_{2}^{k}\big{\\|}_{F}+\big{\\|}\bm{W}_{1}^{k}\big{\\|}_{F}\big{\\|}\bm{\tilde{S}}\big{\\|}_{F}\Big{)}^{2}\big{\\|}\bm{\tilde{X}}^{\natural}\big{\\|}_{F}^{2}$
		$\displaystyle\leq\Big{(}1+C_{W_{2}}+C_{W_{1}}\big{\\|}\bm{\tilde{S}}\big{\\|}_{F}\Big{)}^{2}\big{\\|}\bm{\tilde{X}}^{\natural}\big{\\|}_{F}^{2}.$		(44)

	$\displaystyle\sum_{j=1}^{M}\\|\bm{p}_{j}^{k}\\|_{2}$	$\displaystyle\leq\sqrt{M}\sqrt{\sum_{j=1}^{M}\\|\bm{p}_{j}^{k}\\|_{2}^{2}}$
		$\displaystyle\leq\sqrt{M}\Big{(}1+C_{W_{2}}+C_{W_{1}}\big{\\|}\bm{\tilde{S}}\big{\\|}_{F}\Big{)}\big{\\|}\bm{\tilde{X}}^{\natural}\big{\\|}_{F}.$		(45)

$\displaystyle\sum_{j=1}^{M}\big{\|}\xi_{j}^{k}\big{\|}$	$\displaystyle\leq(2\\|\bm{p}_{j}^{k}\\|_{2}+1)\\|\bm{q}_{j}^{k}\\|_{2}$
	$\displaystyle\leq(C_{W_{2}}+1)\varepsilon\sum_{j=1}^{M}(2\\|\bm{p}_{j}^{k}\\|_{2}+1)$
	$\displaystyle\leq(C_{W_{2}}+1)(2C+M)\varepsilon,$	(46)