This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Algorithm Unrolling for Massive Access via Deep Neural Network with Theoretical Guarantee

Yandong Shi, , Hayoung Choi, ,
Yuanming Shi, , and Yong Zhou
This paper was presented in part at IEEE International Conference on Communications Workshops (ICC Workshops), Dublin, Ireland, 2020 [1]. Y. Shi is with the School of Information Science and Technology, ShanghaiTech University, Shanghai 201210, China, also with the Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences, Shanghai 200050, China, and also with the University of Chinese Academy of Sciences, Beijing 100049, China (e-mail: [email protected]). H. Choi is with the Department of Mathematics Kyungpook National University Daegu, Republic of Korea (e-mail:[email protected]). Y. Shi and Y. Zhou are with the School of Information Science and Technology, ShanghaiTech University, Shanghai 201210, China (e-mail:{shiym, zhouyong}@shanghaitech.edu.cn). Yuanming Shi is also with Yoke Intelligence, Shanghai, China.
Abstract

Massive access is a critical design challenge of Internet of Things (IoT) networks. In this paper, we consider the grant-free uplink transmission of an IoT network with a multiple-antenna base station (BS) and a large number of single-antenna IoT devices. Taking into account the sporadic nature of IoT devices, we formulate the joint activity detection and channel estimation (JADCE) problem as a group-sparse matrix estimation problem. This problem can be solved by applying the existing compressed sensing techniques, which however either suffer from high computational complexities or lack of algorithm robustness. To this end, we propose a novel algorithm unrolling framework based on the deep neural network to simultaneously achieve low computational complexity and high robustness for solving the JADCE problem. Specifically, we map the original iterative shrinkage thresholding algorithm (ISTA) into an unrolled recurrent neural network (RNN), thereby improving the convergence rate and computational efficiency through end-to-end training. Moreover, the proposed algorithm unrolling approach inherits the structure and domain knowledge of the ISTA, thereby maintaining the algorithm robustness, which can handle non-Gaussian preamble sequence matrix in massive access. With rigorous theoretical analysis, we further simplify the unrolled network structure by reducing the redundant training parameters. Furthermore, we prove that the simplified unrolled deep neural network structures enjoy a linear convergence rate. Extensive simulations based on various preamble signatures show that the proposed unrolled networks outperform the existing methods in terms of the convergence rate, robustness and estimation accuracy.

Index Terms:
Massive access, joint activity detection and channel estimation, group-sparse matrix estimation, algorithm unrolling, and deep learning.

I Introduction

Massive machine-type communications (mMTC), as one of the three major use cases of the fifth-generation (5G) wireless networks [2], is expected to support ubiquitous connectivity for a large number of low-cost Internet of Things (IoT) devices. The typical applications of the mMTC use case include smart homes, wearables, environmental sensing, and healthcare [3]. Driven by the increasing popularity of IoT services and the decreasing cost of IoT devices, the number of IoT devices is expected to reach 75.475.4 billion by 2025 [4]. Providing efficient medium access for such a large number of IoT devices is a critical design challenge that needs to be addressed to support mMTC.

Grant-free random access, proposed by the 3rd generation partnership project (3GPP) for 5G new radio (NR), has been recognized as a promising multiple access scheme that is capable of accommodating massive IoT devices with low-latency requirements [5]. In particular, with grant-free random access, each IoT device can transmit a frame containing both preamble sequence and data at any time instant without waiting for the grant from the base station (BS). Compared to the grant-based random access, the signaling overhead and the random access latency can be significantly reduced. However, due to the lack of prior knowledge on the subset of active IoT devices that will transmit and the necessity of obtaining accurate channel state information (CSI) for data decoding, it is of vital importance to achieve joint activity detection and channel estimation (JADCE) for grant-free random access in IoT networks[6]. To alleviate the collision probability of preamble sequences and in turn facilitate JADCE, the authors in [7, 8, 9] focused on the design of non-orthogonal preamble signature sequences. In particular, the authors in [7] proposed a novel preamble design that concatenates multiple Zadoff-Chu (ZC) sequences, which increase the success rate of preamble transmission and enhance the channel reuse efficiency for grant-free random access.

Taking into account the sporadic nature of IoT devices, the JADCE problem was formulated as a sparse signal recovery problem in [10]. This kind of problem can be addressed by adopting the compressed sensing (CS) technique that is capable of exploiting the channel sparsity structures in both the time- and frequency-domains [11, 12]. Due to the large-scale nature of mMTC, the computationally efficient approximate message passing (AMP) based algorithms were proposed to solve the CS problems for supporting massive access [13, 6, 14]. The authors in [6] showed that AMP with a vector denoiser and multiple measurement vectors (MMV) [15] achieve approximately the same performance for JADCE. An AMP algorithm with a minimum mean-squared error (MMSE) denoiser was proposed in [13], where a prior knowledge of the underlying channel distribution was required. Furthermore, a generalized multiple measurement vector approximate message passing (GMMV-AMP) algorithm was proposed to detect the device activity by exploiting sparsity in both the spatial and the angular domains [14]. However, AMP-based algorithms often fail to converge when the preamble signature matrix is mildly ill-conditioned or non-Gaussian [16]. To this end, the authors in [17, 18] introduced a mixed 1/2\ell_{1}/\ell_{2}-norm to formulate the JADCE problem as a form of group LASSO, which can be solved by using the interior-point method [17] or the alternating direction method of multipliers (ADMM) [18] without any prior knowledge on CSI. In particular, the popular iterative shrinkage thresholding algorithm (ISTA) [19, 20] was one of the best known robust algorithmic approaches to solve the group LASSO problem. However, the ISTA converges only sublinearly in general cases [19] and suffers from an inherent trade-off between the estimation performance and convergence rate [21], which prohibit its practical implementation for grant-free massive access in IoT networks.

Deep learning has recently been emerging as a disruptive technology to reduce the computational cost. For instance, the authors in [22, 23, 24, 25, 26] proposed the idea of “learning to optimize”, where neural networks were developed to replace the traditional optimization algorithms to solve the non-convex constrained optimization problems [22, 23, 24] and mixed-integer nonlinear programming (MINLP) problems [25, 26] for resource allocation in wireless networks. These studies demonstrated the potentials of deep learning for improving communication and computation efficiency. However, the deep learning framework usually regarded the neural network as a black box and cannot guarantee the theoretical performance. Therefore, the lack of interpretability and theoretical guarantee of the deep learning framework is a critical issue that needs to be addressed in wireless networks.

Fortunately, the unrolled deep neural network [27], as another powerful deep learning technique, has been proposed to provide a concrete connection between the existing iterative algorithms and the deep neural networks with theoretical performance guarantee [28, 29]. The promising applications of this method include resource management, MIMO detection and precoding design [30, 31]. The typical unrolled deep neural network, i.e., learned iterative shrinkage thresholding algorithm (LISTA) [27, 32, 33], was adopted to approximate the sparse coding based on LASSO, which achieves a better estimation performance. Moreover, the authors in [34, 32] established the simplified structures for LISTA and proved that simplified LISTA achieves linear convergence rate. The authors in [33] study the selection of adapted step size to improve the convergence rate of ISTA and further propose a unrolled network where only the step sizes of ISTA are learned. However, such an unrolling framework cannot be directly applied to solve the JADCE problem in IoT networks with grant-free random access, where the uplink channels of active devices are usually represented by a group-row-sparse matrix. This is because the existing LISTA with a scalar shrinkage-thresholding operator (SSTO), which only focuses on the recovery of sparse vectors, cannot tackle the group-sparse matrix estimation problem.

I-A Contributions

In this paper, we consider grant-free massive access in an IoT network, where a large number of single-antenna IoT devices sporadically transmit non-orthogonal preamble sequences to a multi-antenna BS. In such a network setting, we formulate the JADCE problem as a group-sparse matrix estimation problem by introducing a mixed 1/2\ell_{1}/\ell_{2}-norm. Based on the above discussions, we shall develop a new unrolled deep neural network framework by adopting the multidimensional generalization operator, named multidimensional shrinkage thresholding operator (MSTO) [35], to address this problem. Such an extension turns out to be non-trivial, as the MSTO for the group-row-sparse matrix breaks the independency between the sparse vectors, which brings formidable challenge to reveal the weight coupling structure and analyze the convergence rate for the proposed method. We address this challenge and provide the theoretical performance guarantee for the proposed framework that is capable of solving JADCE problem to support massive access in IoT networks.

The major contributions of this paper are summarized as follows:

  • We develop a novel unrolled deep neural network framework with three structures for group-sparse matrix estimation to support grant-free massive access in IoT networks. The proposed methods enjoy high robustness by inheriting the structure of ISTA and keep the computational complexity at a low level through end-to-end training.

  • We conduct rigorous theoretical analysis to get rid of the interpretability issue and facilitate the efficient training in IoT networks. We reveal the weight coupling structure between the training parameters and identify the property that the learned weight matrix only depends on the preamble signature. The theoretical analysis allows us to further simplify original unrolled network structure by reducing the redundant training parameters. Furthermore, we prove that the simplified unrolling neural networks enjoy a linear convergence rate.

  • Extensive simulations are conducted to validate the theoretical analysis and show the superior performance of the proposed algorithms in three critical aspects for grant-free massive access. Firstly, comparing with the robust algorithms such as ISTA, the proposed methods have much lower computational cost. Secondly, the proposed methods are more robust when comparing to computational-efficient algorithms such as AMP-based algorithms by considering simulating different preamble signatures, including Gaussian matrix, binary matrix, and ZC sequence. Thirdly, the proposed algorithms are capable of returning more accurate solutions comparing with the classical CS-based algorithms.

I-B Organization and Notations

The rest of this paper is organized as follows. The system model and problem formulation are described in Section II. Section III presents the proposed three kinds of unrolled neural networks for solving the group-sparse matrix estimation problem. The convergence analysis of the proposed unrolled neuron networks is presented in Section IV. The numerical results of the proposed algorithms are illustrated in Section V. Finally, Section VI concludes this paper.

Notations

Throughout this paper, we denote [N]={1,2,,N}[N]=\{1,2,\ldots,N\}. Let \mathbb{R} (resp. \mathbb{C}) be the set of real (resp. complex) numbers and \mathbb{N} be the set of integers. For a set 𝒮\mathcal{S}, the number of elements of 𝒮\mathcal{S} is denoted by |𝒮||\mathcal{S}|. For index sets ={i1,,in}[N]\mathcal{I}=\{i_{1},\ldots,i_{n}\}\subset[N], 𝒥={j1,,jm}[M]\mathcal{J}=\{j_{1},\ldots,j_{m}\}\subset[M] and 𝑨N×M\bm{A}\in\mathbb{R}^{N\times M}, we denote by 𝑨[,𝒥]\bm{A}[\mathcal{I},\mathcal{J}] the (sub)matrix of entries that lie in the rows of 𝑨\bm{A} indexed by \mathcal{I} and the columns indexed by 𝒥\mathcal{J}. When =[N]\mathcal{I}=[N] (resp. 𝒥=[M]\mathcal{J}=[M]), we simply denote by 𝑨[:,𝒥]\bm{A}[:,\mathcal{J}] (resp. 𝑨[,:]\bm{A}[\mathcal{I},:]). For a matrix 𝑨\bm{A}, 𝑨2\|\bm{A}\|_{2} (resp. 𝑨F\|\bm{A}\|_{F}) denotes the operator (resp. Frobenius) norm. And 𝑨max\|\bm{A}\|_{\max} denotes the max norm. Also, 𝑨2,1\|\bm{A}\|_{2,1} (resp. 𝑨2,0\|\bm{A}\|_{2,0}) denotes mixed 1/2\ell_{1}/\ell_{2}-norm (resp. mixed 0/2\ell_{0}/\ell_{2}-norm). For a vector 𝒗\bm{v}, 𝒗2\|\bm{v}\|_{2} denotes the Euclidean norm. We denote the support of the vector 𝒗=[v1,,vN]N\bm{v}=[v_{1},\ldots,v_{N}]\in\mathbb{R}^{N} as supp(𝒗)\text{supp}(\bm{v}), i.e. supp(𝒗)={i[N]|vi0}\text{supp}(\bm{v})=\{i\in[N]|v_{i}\neq 0\}.

II System Model and Problem Formulation

II-A System Model

Consider the grant-free uplink transmission of a single-cell IoT network consisting of one MM-antenna BS and NN single-antenna IoT devices, as shown in Fig. 2. Full frequency reuse and quasi-static block fading channels are considered in this paper. We denote [N]={1,,N}[N]=\{1,\ldots,N\} as the set of IoT devices, which sporadically communicate with the BS. In each transmission block, all the IoT devices independently decide whether or not to be active. Within a specific transmission block, we denote an=1a_{n}=1 if device nn is active and an=0a_{n}=0 otherwise. Because of the sporadic transmission, it is reasonable to assume that each IoT device is active with a probability being much smaller than 1. In addition, we assume that the active IoT devices are synchronized at the symbol level [17]. The preamble signal received at the BS, denoted as 𝒚()M\bm{y}(\ell)\in\mathbb{C}^{M}, can be expressed as

𝒚()=n=1N𝒉nansn()+𝒛(),=1,,L,\displaystyle\bm{y}(\ell)=\sum_{n=1}^{N}\bm{h}_{n}a_{n}s_{n}(\ell)+\bm{z}(\ell),\quad\ell=1,...,L, (1)

where 𝒉nM\bm{h}_{n}\in{\mathbb{C}}^{M} denotes the channel vector between device nn and the BS, sn()s_{n}(\ell)\in\mathbb{C} denotes the \ell-th symbol of the preamble signature sequence transmitted by device nn, LL denotes the length of the signature sequence, and 𝒛()M\bm{z}(\ell)\in\mathbb{C}^{M} denotes the additive white Gaussian noise (AWGN) vector at the BS.

Refer to caption
Figure 1: Illustration of an IoT network with a massive number of IoT devices that transmit sporadically to the BS.
Refer to caption
Figure 2: Illustration of the JADCE problem, where 𝑿\bm{X}^{\natural} has the group-sparse structure in rows, i.e., if one entry of 𝑿\bm{X}^{\natural} is zero, then other entries on the same row should be zero simultaneously.

As the length of the signature sequence is typically much smaller than the number of IoT devices, i.e., LNL\ll N, it is impossible to allocate mutually orthogonal signature sequences to all the IoT devices. Hence, each IoT device is assigned a unique signature sequence, which is generally non-orthogonal to the preamble sequences assigned to other IoT devices. Note that three different kinds of non-orthogonal preamble sequences will be considered in the simulations.

For notational ease, we denote 𝒀=[𝒚(1),,𝒚(L)]TL×M\bm{Y}=[\bm{y}(1),...,\bm{y}(L)]^{T}\in\mathbb{C}^{L\times M} as the received preamble signal matrix across MM antennas over the transmission duration of LL symbols, 𝑯=[𝒉1,,𝒉N]TN×M\bm{H}=[\bm{h}_{1},...,\bm{h}_{N}]^{T}\in\mathbb{C}^{N\times M} as the channel matrix between NN devices and the BS, 𝒁=[𝒛(1),,𝒛(L)]L×M\bm{Z}=[\bm{z}(1),...,\bm{z}(L)]\in\mathbb{C}^{L\times M} as the noise matrix at the BS, and 𝑺=[𝒔(1),,𝒔(L)]TL×N\bm{S}=[\bm{s}(1),...,\bm{s}(L)]^{T}\in\mathbb{C}^{L\times N} as the known preamble signature sequence matrix with 𝒔()=[s1(),,sN()]TN\bm{s}(\ell)=[s_{1}(\ell),...,s_{N}(\ell)]^{T}\in\mathbb{C}^{N}. As a result, the preamble signal received at the BS can be rewritten as

𝒀=𝑺𝑨𝑯+𝒁,\displaystyle\bm{Y}=\bm{SAH}+\bm{Z}, (2)

where 𝑨=diag(a1,,aN)N×N\bm{A}={\rm{diag}}(a_{1},...,a_{N})\in\mathbb{R}^{N\times N} denotes the diagonal activity matrix. We aim to simultaneously detect the activity matrix 𝑨\bm{A} and estimate the channel matrix 𝑯\bm{H}, which is known as a JADCE problem [13]. By denoting 𝑿=𝑨𝑯N×M\bm{X}^{\natural}=\bm{AH}\in\mathbb{C}^{N\times M}, matrix 𝑿\bm{X}^{\natural} thus has the group-sparse structure in rows [17], as shown in Fig. 2. We further rewrite (2) as

𝒀=𝑺𝑿+𝒁.\displaystyle\bm{Y}=\bm{SX}^{\natural}+\bm{Z}. (3)

Note that the active devices and their corresponding channel vectors can be recovered from the estimation of matrix 𝑿\bm{X}^{\natural} [17].

II-B Problem Formulation

To estimate the group-row sparse matrix 𝑿\bm{X}^{\natural}, we adopt the mixed 1/2\ell_{1}/\ell_{2}-norm [17] to induce the group sparsity, i.e., (𝑿)=n=1N𝑿[n,:]2,\mathcal{R}(\bm{X})=\sum_{n=1}^{N}\|\bm{X}[n,:]\|_{2}, where 𝑿[n,:]\bm{X}[n,:] is the nn-th row of matrix 𝑿\bm{X}. The group-sparse matrix estimation problem can be reformulated as the following unconstrained convex optimization problem (also known as group LASSO) [36]

𝒫:minimize𝑿N×M12𝒀𝑺𝑿F2+λ(𝑿),\displaystyle\mathscr{P}:\mathop{\operatorname{minimize}}\limits_{\bm{X}\in\mathbb{C}^{N\times M}}\frac{1}{2}\|\bm{Y}-\bm{S}\bm{X}\|_{F}^{2}+\lambda\mathcal{R}(\bm{X}), (4)

where λ>0\lambda>0 denotes the regulation parameter. As matrix 𝑿\bm{X} is group sparse in rows, problem 𝒫\mathscr{P} is essentially a group-sparse matrix estimation problem.

To facilitate efficient algorithm design, we rewrite (3) as its real-valued counterpart

𝒀~=𝑺~𝑿~+𝒁~=[{𝑺}{𝑺}{𝑺}{𝑺}][{𝑿}{𝑿}]+[{𝒁}{𝒁}],\displaystyle\begin{aligned} \bm{\tilde{Y}}&=\bm{\tilde{S}}\bm{\tilde{X}}^{\natural}+\bm{\tilde{Z}}\\ &=\left[\begin{matrix}\Re\{\bm{S}\}&-\Im\{\bm{S}\}\\ \Im\{\bm{S}\}&\Re\{\bm{S}\}\end{matrix}\right]\left[\begin{matrix}\Re\{\bm{X}^{\natural}\}\\ \Im\{\bm{X}^{\natural}\}\end{matrix}\right]+\left[\begin{matrix}\Re\{\bm{Z}\}\\ \Im\{\bm{Z}\}\end{matrix}\right],\end{aligned} (5)

where {}\Re\{\cdot\} and {}\Im\{\cdot\} represent the real part and imaginary part of a complex matrix. As a result, problem 𝒫\mathscr{P} can be further rewritten as

𝒫r:minimize𝑿~2N×M12𝒀~𝑺~𝑿~F2+λ(𝑿~).\displaystyle\mathscr{P}_{r}:\mathop{\operatorname{minimize}}\limits_{\bm{\tilde{X}}\in\mathbb{R}^{2N\times M}}\frac{1}{2}\|\bm{\tilde{Y}}-\bm{\tilde{S}}\bm{\tilde{X}}\|_{F}^{2}+\lambda\mathcal{R}(\bm{\tilde{X}}). (6)

Therefore, our goal of JADCE becomes the recovering of matrix 𝑿~\bm{\tilde{X}}^{\natural} based on the preamble matrix 𝑺~\bm{\tilde{S}} and the noisy observation 𝒀~\bm{\tilde{Y}}.

II-C Prior Work

The ISTA [36] is a popular approach to solve the group LASSO problem. In particular, the resulting ISTA iteration for (6) can be written as

𝑿~k+1=ηλ/C(𝑿~k+1C𝑺~T(𝒀~𝑺~𝑿~k)),\displaystyle\bm{\tilde{X}}^{k+1}=\eta_{\lambda/C}\left(\bm{\tilde{X}}^{k}+\frac{1}{C}\bm{\tilde{S}}^{T}(\bm{\tilde{Y}}-\bm{\tilde{S}}\bm{\tilde{X}}^{k})\right), (7)

where 𝑿~k\bm{\tilde{X}}^{k} is an estimate of ground truth 𝑿~\bm{\tilde{X}}^{\natural} at iteration kk, 1C\frac{1}{C} plays a role as step size, λ\lambda is regulation parameter, and ηλ/C()\eta_{\lambda/C}(\cdot) denotes the MSTO [35]. Specifically, ηθ(𝑿)[n]\eta_{\theta}(\bm{X})[n] denotes as the nn-th row of the matrix 𝑿\bm{X} after applying ηθ()\eta_{\theta}(\cdot), which is defined as [20]

ηθ(𝑿)[n]=max{0,𝑿[n,:]2θ𝑿[n,:]2}𝑿[n,:].\displaystyle\eta_{\theta}(\bm{X})[n]=\max\left\{0,\frac{\|\bm{X}[n,:]\|_{2}-\theta}{\|\bm{X}[n,:]\|_{2}}\right\}\bm{X}[n,:]. (8)

However, ISTA suffers from an inherent trade-off between estimation performance and convergence rate based on the choice of λ\lambda, i.e., a larger λ\lambda leads to faster convergence but a poorer estimation performance [21]. Moreover, the choice of step size 1C\frac{1}{C} also influences the convergence rate [33]. In the next section, we shall propose a learned ISTA framework by learning the parameters including λ\lambda and the step size 1C\frac{1}{C} simultaneously to improve the convergence rate and estimation performance of ISTA for the group-sparse matrix estimation problem (ISTA-GS).

III Algorithm Unrolling via Deep Neural Network

In this section, we propose an unrolled deep learning framework to solve the JADCE problem by exploiting the group sparse structure. Although the LISTA proposed in [27, 34, 32, 37] can recover the individual sparse vector signals, all these methods cannot recover the matrix with group row sparsity. To address this issue, we extend LISTA for the group-sparse matrix recovery. Furthermore, we analytically prove that the weight coupling property holds for the group-sparse matrix estimation problem.

Refer to caption
(a) Unrolled network strucrure 1 (LISTA-GS).
Refer to caption
(b) Unrolled network structure 2 (LISTA-GSCP).
Refer to caption
(c) Unrolled network structure 3 (ALISTA-GS).
Figure 3: Three unrolled network structures.

III-A Unrolled Neural Network Structures

In this subsection, we propose three neural network structures under the unrolled framework for group-sparse matrix estimation problem 𝒫r\mathscr{P}_{r}.

III-A1 LISTA-GS

Inspired by [27, 34] and by denoting 𝑾1=1C𝑺~T\bm{W}_{1}=\frac{1}{C}\bm{\tilde{S}}^{T}, 𝑾2=𝑰1C𝑺~T𝑺~\bm{W}_{2}=\bm{I}-\frac{1}{C}\bm{\tilde{S}}^{T}\bm{\tilde{S}}, and θ=λC\theta=\frac{\lambda}{C}, we rewrite (7) as

𝑿~k+1=ηθk(𝑾1𝒀~+𝑾2𝑿~k).\displaystyle\bm{\tilde{X}}^{k+1}=\eta_{\theta^{k}}(\bm{W}_{1}\bm{\tilde{Y}}+\bm{W}_{2}\bm{\tilde{X}}^{k}). (9)

The key idea of the proposed unrolled method for group-sparse matrix estimation problem is to view matrices 𝑾1\bm{W}_{1}, 𝑾2\bm{W}_{2}, and scalars θk\theta^{k} in (9) as trainable parameters. As a result, (9) can be modeled as a one layer recurrent neural network (RNN), i.e., one iteration of ISTA-GS is treated as one layer neural network. The neural network structure can be concatenated as a KK-layer RNN, which is capable of modeling the ISTA-GS with KK iterations. Mathematically, the unrolled RNN with KK iterations for group-sparse matrix estimation problem is given by

𝑿~k+1=ηθk(𝑾1k𝒀~+𝑾2k𝑿~k),k=0,1,,K1,\displaystyle\bm{\tilde{X}}^{k+1}=\eta_{\theta^{k}}(\bm{W}_{1}^{k}\bm{\tilde{Y}}+\bm{W}_{2}^{k}\bm{\tilde{X}}^{k}),k=0,1,\ldots,K-1, (10)

where parameters 𝚯={𝑾1k,𝑾2k,θk}k=0K1\bm{\Theta}=\{\bm{W}_{1}^{k},\bm{W}_{2}^{k},\theta^{k}\}_{k=0}^{K-1} are trainable. This is the main difference from the iterative algorithm in (7).

The unrolled neural network structure given in (10) is named as LISTA-GS, and the corresponding structure is plotted in Fig. 3(a). However, this kind of structure contains too many parameters with two high-dimensional weight parameter matrices, which may not be efficient to be trained.

III-A2 LISTA-GSCP

We establish the following necessary condition for the convergence of LISTA-GS, which is inspired by the fact that 𝑾20=𝑰𝑾10𝑺~\bm{W}_{2}^{0}=\bm{I}-\bm{W}_{1}^{0}\bm{\tilde{S}} and coupled structure in [34]. This necessary condition reveals the properties of trainable parameters if the proposed unrolled NN can recover the group sparse matrix and can be used to simplify the proposed network.

Theorem 1.

Given {𝐖1k,𝐖2k,θk}k=0\{\bm{W}_{1}^{k},\bm{W}_{2}^{k},\theta^{k}\}_{k=0}^{\infty} with bounded weights, i.e., 𝐖1kFCW1\|\bm{W}_{1}^{k}\|_{F}\leq C_{W_{1}} and 𝐖2kFCW2\|\bm{W}_{2}^{k}\|_{F}\leq C_{W_{2}}, let {𝐗~k}k=1\{\bm{\tilde{X}}^{k}\}_{k=1}^{\infty} be generated layer-wise by LISTA-GS in (10) with an input 𝐘~\bm{\tilde{Y}} observed by (3) and the initial point 𝐗~0=𝟎\bm{\tilde{X}}^{0}=\bm{0}. If network can recover any group-row-sparse signal with no observation noise, then {𝐖1k,𝐖2k,θk}k=0\{\bm{W}_{1}^{k},\bm{W}_{2}^{k},\theta^{k}\}_{k=0}^{\infty} must satisfy the following two conditions

θk0,as k,\displaystyle\theta^{k}\rightarrow 0,\quad\text{as }k\rightarrow\infty, (11)
𝑾2k(𝑰𝑾1k𝑺~)𝟎,as k.\displaystyle\bm{W}_{2}^{k}-(\bm{I}-\bm{W}_{1}^{k}\bm{\tilde{S}})\rightarrow\bm{0},\quad\text{as }k\rightarrow\infty. (12)
Proof.

Please refer to Appendix A. ∎

Thus, we develop the necessary condition of the recovery convergence for the group-sparse matrix estimation problem. It is worth nothing that the MSTO [35] is a generalization of the SSTO, which brings unique challenge of revealing the weight coupling structure on the group-row-sparse matrix estimation problem. The extension turns out to be non-trivial since the MSTO of group-sparse matrix breaks up the “independency” between sparse vectors.

Motivated by (12) in Theorem 1, we adopt the partial weight coupling structure of the trainable weights {𝑾1k,𝑾2k}k=0\{\bm{W}_{1}^{k},\bm{W}_{2}^{k}\}_{k=0}^{\infty} in LISTA-GS as 𝑾2k=𝑰𝑾1k𝑺~,k.\bm{W}_{2}^{k}=\bm{I}-\bm{W}_{1}^{k}\bm{\tilde{S}},\forall k. Hence, by letting 𝑾k=(𝑾1k)T\bm{W}^{k}=(\bm{W}_{1}^{k})^{T}, we obtain the simplified KK-layer unrolled neural network structure (namely LISTA-GSCP) as

𝑿~k+1=ηθk(𝑿~k+(𝑾k)T(𝒀~𝑺~𝑿~k)),k=0,1,,K1,\bm{\tilde{X}}^{k+1}=\eta_{\theta^{k}}\big{(}\bm{\tilde{X}}^{k}+(\bm{W}^{k})^{T}(\bm{\tilde{Y}}-\bm{\tilde{S}}\bm{\tilde{X}}^{k})\big{)},k=0,1,\ldots,K-1, (13)

where parameters 𝚯={𝑾k,θk}k=0K1\bm{\Theta}=\{\bm{W}^{k},\theta^{k}\}_{k=0}^{K-1} are trainable. Fig. 3(b) illustrates the unrolled network.

III-A3 ALISTA-GS

To further alleviate the need to learn a weight matrix 𝑾k\bm{W}^{k} with a large amount of parameters while stabilizing the training process, we separate the weight as the product of a scalar and a matrix, i.e., 𝑾k=γk𝑾\bm{W}^{k}=\gamma^{k}\bm{W}. The definition of “good” parameters 𝑾𝒲(𝑺~)\bm{W}\in\mathcal{W}(\bm{\tilde{S}}) in (35) shows that the weight matrix only depends on 𝑺~\bm{\tilde{S}}. Thus, we can solve the following convex optimization problem by using the projected gradient descend (PGD) algorithm to obtain the weight matrix 𝑾\bm{W} prior to the training stage

minimize𝑾2L×2N𝑾T𝑺~F2\displaystyle\underset{\bm{W}\in\mathbb{R}^{2L\times 2N}}{\operatorname{minimize}}\quad\|\bm{W}^{T}\bm{\tilde{S}}\|_{F}^{2}
subjectto𝑾[:,i]T𝑺~[:,i]=1,i[2N].\displaystyle\operatorname{subject~{}to}\quad\bm{W}[:,i]^{T}\bm{\tilde{S}}[:,i]=1,\forall i\in[2N]. (14)

Hence, the proposed third unrolled network structure (namely ALISTA-GS) comes to light

𝑿~k+1=ηθk(𝑿~k+γk𝑾T(𝒀~𝑺~𝑿~k)),k=0,1,,K1,\bm{\tilde{X}}^{k+1}=\eta_{\theta^{k}}\big{(}\bm{\tilde{X}}^{k}+\gamma^{k}\bm{W}^{T}(\bm{\tilde{Y}}-\bm{\tilde{S}}\bm{\tilde{X}}^{k})\big{)},k=0,1,\ldots,K-1, (15)

where 𝚯={θk,γk}k=0K1\bm{\Theta}=\{\theta^{k},\gamma^{k}\}_{k=0}^{K-1} are parameters to be trained and the weight matrix 𝑾\bm{W} is pre-computed before the training process. Different from the LISTA-GSCP, this network structure pays more attention to select better learning step sizes. A visual depiction is provided in Fig. 3(c).

Till now, we have established three LISTA variations for group-sparse matrix estimation problems. With less training parameters, the training process turns to be easier and faster.

Remark 1.

Recently, the advancement of complex-valued deep networks make it possible to apply the proposed LISTA variations in a complex-valued representation, i.e., without rewriting as its real-valued counterpart (5). The complex-valued representation may be utilized to further enhance the performance of sparse recovery by exploiting the extra information about potential grouping of real and imaginary parts, e.g., the real and imaginary parts are either zero or nonzero simultaneously [38, 39]. However, the design of the complex-valued network training process and the complex-valued shrinkage function will be two challenges. Besides, the convergence analysis of such complex-valued LISTA variations will be an interesting problem, which will be studied in our future work.

III-B Deep Neural Networks Training and Testing

In the training stage, we consider the framework that the neural networks learn to solve group-sparse matrix estimation problem 𝒫r\mathscr{P}_{r} in a supervised way. Note that we denote PP samples training data as {𝑿~i,𝒀~i}i=1P\{\bm{\tilde{X}}_{i}^{\natural},\bm{\tilde{Y}}_{i}\}_{i=1}^{P}, where 𝑿~i\bm{\tilde{X}}_{i}^{\natural} and 𝒀~i\bm{\tilde{Y}}_{i} are viewed as the label and the input, respectively. The following optimization problem is adopted in the whole training stage for training a kk-layer network,

minimize𝚯0:k1i=1P𝑿~k(𝚯0:k1,𝒀~𝒊,𝑿~0)𝑿~iF2.\mathop{\operatorname{minimize}}\limits_{\bm{\Theta}_{0:k-1}}~{}\sum_{i=1}^{P}\|\bm{\tilde{X}}^{k}(\bm{\Theta}_{0:k-1},\bm{\tilde{Y}_{i}},\bm{\tilde{X}}^{0})-\bm{\tilde{X}}_{i}^{\natural}\|_{F}^{2}. (16)

However, such large scale optimization easily converged to a bad local minimum [40]. The layer-wise training strategy is used to separate (16) into the following two parts, i.e., (17) and (18), where (17) is to find a good initialization of (18) to avoid bad local minima of (16). In particular, for training a kk-layer network, we denote 𝚯0:k1\bm{\Theta}_{0:k-1} as the trainable parameters from layer 0 to layer k1k-1, and 𝚯k1\bm{\Theta}_{k-1} as the trainable parameters in layer k1k-1. By fixing the parameters 𝚯0:k2\bm{\Theta}_{0:k-2} of the former k1k-1-layer network that has been trained and the parameters 𝚯k1\bm{\Theta}_{k-1}’s initialization is given in Table I, we first learn the parameters 𝚯k1\bm{\Theta}_{k-1} with the learning rate α0\alpha_{0} by using Adam algorithm [41] to solve the following optimization problem

minimize𝚯k1i=1P𝑿~k(𝚯0:k1,𝒀~i,𝑿~0)𝑿~F2,\mathop{\operatorname{minimize}}\limits_{\bm{\Theta}_{k-1}}~{}\sum_{i=1}^{P}\|\bm{\tilde{X}}^{k}(\bm{\Theta}_{0:k-1},\bm{\tilde{Y}}_{i},\bm{\tilde{X}}^{0})-\bm{\tilde{X}}^{\natural}\|_{F}^{2}, (17)

where 𝑿~k(𝚯0:k1,𝒀~,𝑿~0)\bm{\tilde{X}}^{k}(\bm{\Theta}_{0:k-1},\bm{\tilde{Y}},\bm{\tilde{X}}^{0}) denotes the output of the kk-layer network with input 𝒀~\bm{\tilde{Y}} and initial point 𝑿~0\bm{\tilde{X}}^{0}. Then, we use the parameters 𝚯k1\bm{\Theta}_{k-1} obtained by (17) and the fixed parameters 𝚯0:k2\bm{\Theta}_{0:k-2} as initialization and then tune all parameters 𝚯0:k1\bm{\Theta}_{0:k-1} with the learning rate α1\alpha_{1} which is smaller than α0\alpha_{0} by using Adam algorithm to solve

minimize𝚯0:k1i=1P𝑿~k(𝚯0:k1,𝒀~i,𝑿~0)𝑿~F2.\mathop{\operatorname{minimize}}\limits_{\bm{\Theta}_{0:k-1}}~{}\sum_{i=1}^{P}\|\bm{\tilde{X}}^{k}(\bm{\Theta}_{0:k-1},\bm{\tilde{Y}}_{i},\bm{\tilde{X}}^{0})-\bm{\tilde{X}}^{\natural}\|_{F}^{2}. (18)

After applying the procedure successfully, we obtain the parameters 𝚯0:k1\bm{\Theta}_{0:k-1} for the whole kk-layer network. So next we can train a (k+1)(k+1)-layer network.

In the testing stage, since the known preamble signature 𝑺\bm{S} remains unchanged, the proposed unrolled networks can recover the changing channels. Given a newly received signal 𝒀~\bm{\tilde{Y}}^{{}^{\prime}}, the learned unrolled networks, i.e., LISTA-GS in (10), LISTA-GSCP in (13) and ALISTA-GS in (15), are applied for group-sparse matrix estimation. For instance, with LISTA-GS, we obtain the (k+1)(k+1)-th layer recovered signal by using 𝑿~k+1=η(θk)((𝑾1k)𝒀~+(𝑾2k)𝑿~k)\bm{\tilde{X}}^{k+1}=\eta_{(\theta^{k})^{*}}((\bm{W}_{1}^{k})^{*}\bm{\tilde{Y}}^{{}^{\prime}}+(\bm{W}_{2}^{k})^{*}\bm{\tilde{X}}^{k}), where (θk),(𝑾1k)(\theta^{k})^{*},(\bm{W}_{1}^{k})^{*}, and (𝑾2k)(\bm{W}_{2}^{k})^{*} are the learned parameters of the kk-th layer.

Remark 2.

Most of the existing deep learning based approaches including the proposed LISTA variations rely on the assumption that the channel coefficients follow the same distribution during the training and testing stages, and may not work well in dynamic environment with different channel distributions over time. The continual learning and transfer learning technologies [42, 43] that have been developed for resource allocation and beamforming optimization in dynamic environment have the potential to address this issue. As these technologies are still in the early stage, we leave the proposed LISTA variations to recover the group-sparse matrix in dynamic environment for future work.

III-C Time Complexity and Number of Trainable Parameters

The time complexity for the three neural network structures are mainly due to matrix multiplication. For LISTA-GS, the evaluation of the matrix multiplication 𝑾1k𝒀~\bm{W}_{1}^{k}\bm{\tilde{Y}} and 𝑾2k𝑿~k\bm{W}_{2}^{k}\bm{\tilde{X}}^{k} require 𝒪(NLM+N2M)\mathcal{O}(NLM+N^{2}M) time at each iteration. As for LISTA-GSCP and ALISTA-GS, the evaluation of the matrix multiplication (𝑾k)T(𝒀~𝑺~𝑿~k)(\bm{W}^{k})^{T}(\bm{\tilde{Y}}-\bm{\tilde{S}}\bm{\tilde{X}}^{k}) and γk𝑾T(𝒀~𝑺~𝑿~k)\gamma^{k}\bm{W}^{T}(\bm{\tilde{Y}}-\bm{\tilde{S}}\bm{\tilde{X}}^{k}) require 𝒪(LNM)\mathcal{O}(LNM) time at each iteration. Since the proposed methods converge faster than ISTA and with the same complexity per iteration, thereby reducing the computational cost.

For KK-layer RNN, the ALISTA-GS contains only 2K2K total trainable parameters {θk,γk}k=0K1\{\theta^{k},\gamma^{k}\}_{k=0}^{K-1}, while LISTA-GS and LISTA-GSCP require K(N2+LN+1)K(N^{2}+LN+1) variables {𝑾1k,𝑾2k,θk}k=0K1\{\bm{W}_{1}^{k},\bm{W}_{2}^{k},\theta^{k}\}_{k=0}^{K-1} and K(LN+1)K(LN+1) variables {𝑾k,θk}k=0K1\{\bm{W}^{k},\theta^{k}\}_{k=0}^{K-1}, respectively. We summarize the initialization and the required numbers of trainable parameters for the KK-layer networks in Table I. Since θk\theta^{k} and γk\gamma^{k} should be initialize to proper constants, we initialize θk=0.1\theta^{k}=0.1 and γk=1\gamma^{k}=1 in this paper.

TABLE I: Number of trainable parameters(params) and initialization in the KK- layer RNN.
Network Trainable params Initialization Number of params
LISTA-GS {𝑾1k,𝑾2k,θk}k=0K1\{\bm{W}_{1}^{k},\bm{W}_{2}^{k},\theta^{k}\}_{k=0}^{K-1} 𝑾1k=1C𝑺~T\bm{W}_{1}^{k}=\frac{1}{C}\bm{\tilde{S}}^{T}, 𝑾2k=𝑰1C𝑺~T𝑺~,θk=0.1\bm{W}_{2}^{k}=\bm{I}-\frac{1}{C}\bm{\tilde{S}}^{T}\bm{\tilde{S}},\theta^{k}=0.1 K(N2+LN+1)K(N^{2}+LN+1)
LISTA-GSCP {𝑾k,θk}k=0K1\{\bm{W}^{k},\theta^{k}\}_{k=0}^{K-1} 𝑾k=1C𝑺~T,θk=0.1\bm{W}^{k}=\frac{1}{C}\bm{\tilde{S}}^{T},\theta^{k}=0.1 K(LN+1)K(LN+1)
ALISTA-GS {θk,γk}k=0K1\{\theta^{k},\gamma^{k}\}_{k=0}^{K-1} θk=0.1,γk=1.0\theta^{k}=0.1,\gamma^{k}=1.0 2K2K

IV Convergence Analysis

In this section, we provide the main theoretical results of this paper, i.e., the linear convergence rate of LISTA-GSCP (13) and ALISTA-GS (15), respectively. Since the proposed unrolled networks inherit the structure of ISTA-GS, they thus allow us to track the interpretability for such deep learning framework from the perspective of optimization. As a matter of fact, our proposed unrolled neural networks are extensions of [34, 32] to solve the group-sparse matrix estimation problems. Such dimension expansion and matrix structures bring the unique and formidable challenges for establishing the theoretical analysis on the proposed unrolled neural networks.

We firstly establish the convergence analysis of LISTA-GSCP framework. We use 𝑿~k\bm{\tilde{X}}^{k} to replace 𝑿~k(𝑿~,𝒁~)\bm{\tilde{X}}^{k}(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}}) for notational simplicity. The following theorem presents the convergence rate of LISTA-GSCP. For theoretical analysis, we assume that 2\ell_{2} norm of all rows of signal 𝑿~\bm{\tilde{X}}^{\natural} and Frobenius norm of AWGN noise 𝒁~\bm{\tilde{Z}} are bounded by β\beta and σ\sigma [34, 32]. Furthermore, since each entry of the activity sequence {a1,,aN}\{a_{1},...,a_{N}\} follows the Bernoulli distribution, we assume that the number of non-zero rows on signal 𝑿~\bm{\tilde{X}}^{\natural} is bounded by a small number ss [17]. And for notation brevity, we assume the signal 𝑿~\bm{\tilde{X}}^{\natural} and noise 𝒁~\bm{\tilde{Z}} belong to set 𝒳(β,s,σ){(𝑿~,𝒁~)|𝑿~[i,:]2β,i,𝑿~2,0s,𝒁~Fσ}.\mathcal{X}(\beta,s,\sigma)\coloneqq\big{\{}(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})|\|\bm{\tilde{X}}^{\natural}[i,:]\|_{2}\leq\beta,\forall i,\|\bm{\tilde{X}}^{\natural}\|_{2,0}\leq s,\|\bm{\tilde{Z}}\|_{F}\leq\sigma\big{\}}.

Theorem 2 (Convergence rate of LISTA-GSCP).

Given {𝐖k,θk}k=0\{\bm{W}^{k},\theta^{k}\}_{k=0}^{\infty}, let {𝐗~k}k=1\{\bm{\tilde{X}}^{k}\}_{k=1}^{\infty} be generated by LISTA-GSCP in (13) with an input 𝐘~\bm{\tilde{Y}} observed by (3) and initial point 𝐗~0=𝟎\bm{\tilde{X}}^{0}=\bm{0}. If ss is sufficiently small, then for all (𝐗~,𝐙~)𝒳(β,s,σ),(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})\in\mathcal{X}(\beta,s,\sigma), we have the error bound:

𝑿~k𝑿~Fsβexp(ck)+Cσ,\|\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\|_{F}\leq s\beta\exp(-ck)+C\sigma, (19)

where c>0c>0 and C>0C>0 are constants that depend only on 𝐒~\bm{\tilde{S}} and ss.

Especially, for the noiseless case, (19) reduces to 𝑿~k𝑿~Fsβexp(ck).\|\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\|_{F}\leq s\beta\exp(-ck). Moreover, LISTA-GSCP converges at an 𝒪(log(1ϵ))\mathcal{O}(\log(\frac{1}{\epsilon})) rate, which is faster than original ISTA of 𝒪(1ϵ)\mathcal{O}(\frac{1}{\epsilon}) and Nesterov’s method [44] of 𝒪(1ϵ)\mathcal{O}(\frac{1}{\sqrt{\epsilon}}).

The steps of proving Theorem 2 are summarized as follows:

  • A.

    “Good” parameters for leaning. We define the “good” parameters that guarantee the linear convergence rate.

  • B.

    Error bound for one sample data. We establish an error bound for one sample (𝑿~,𝒁~)(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})111To emphasize the signal and noise, we use (𝑿~,𝒁~)(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}}) to replace one sample rather than {𝑿~,𝒀~}\{\bm{\tilde{X}}^{\natural},\bm{\tilde{Y}}\}..

  • C.

    Error bound for the whole data set. By taking the supremum over all (𝑿~,𝒁~)(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}}), we establish an error bound over the whole samples.

IV-A “Good” Parameters for Learning

In this subsection, we shall give the definition of “good” parameters for the LISTA-GSCP network structure that guarantees the linear convergence rate.

First, we need to introduce several fundamental definitions inspired by [34]. The first one is the mutual coherence [45] of 𝑺~\bm{\tilde{S}}, which characterizes the coherence between different columns of 𝑺~\bm{\tilde{S}}.

Definition 1.
  • (i)

    The mutual coherence of 𝑺~2L×2N\bm{\tilde{S}}\in\mathbb{R}^{2L\times 2N} with normalized columns is defined as

    μ(𝑺~)=maxij1i,j2N|(𝑺~[:,i])T𝑺~[:,j]|.\mu(\bm{\tilde{S}})=\max_{i\neq j\atop 1\leq i,j\leq 2N}\big{|}(\bm{\tilde{S}}[:,i])^{T}\bm{\tilde{S}}[:,j]\big{|}. (20)
  • (ii)

    The generalized mutual coherence of 𝑺~2L×2N\bm{\tilde{S}}\in\mathbb{R}^{2L\times 2N} with normalized columns is defined as

    μ~(𝑺~)=inf𝑾2L×2N(𝑾[:,i])T𝑺~[:,i]=1,i{maxij1i,j2N|(𝑾[:,i])T𝑺~[:,j]|}.\tilde{\mu}(\bm{\tilde{S}})=\inf_{\bm{W}\in\mathbb{R}^{2L\times 2N}\atop(\bm{W}[:,i])^{T}\bm{\tilde{S}}[:,i]=1,\forall i}\bigg{\{}\max_{i\neq j\atop 1\leq i,j\leq 2N}\Big{|}(\bm{W}[:,i])^{T}\bm{\tilde{S}}[:,j]\Big{|}\bigg{\}}. (21)

Lemma 11 in [34] tells us that there exists a matrix 𝑾2L×2N\bm{W}\in\mathbb{R}^{2L\times 2N} that attaches the infimum given in (21), i.e., 𝒲(𝑺~),\mathcal{W}(\bm{\tilde{S}})\neq\emptyset, where a set of “good” weight matrices is defined as

𝒲(𝑺~)argmin𝑾2L×2N{𝑾max|(𝑾[:,i])T𝑺~[:,i]=1,i,\displaystyle\mathcal{W}(\bm{\tilde{S}})\coloneqq\operatorname*{argmin}_{\bm{W}\in\mathbb{R}^{2L\times 2N}}\big{\{}\|\bm{W}\|_{\max}\big{|}(\bm{W}[:,i])^{T}\bm{\tilde{S}}[:,i]=1,\forall i,
maxij1i,j2N|(𝑾[:,i])T𝑺~[:,j]|=μ~(𝑺~)}.\displaystyle\max_{i\neq j\atop 1\leq i,j\leq 2N}\big{|}(\bm{W}[:,i])^{T}\bm{\tilde{S}}[:,j]\big{|}=\tilde{\mu}(\bm{\tilde{S}})\big{\}}. (22)

Then, we define the “good” parameters to be learned in LISTA-GSCP as follows.

Definition 2.

𝚯={𝑾k,θk}k=0\bm{\Theta}=\{\bm{W}^{k},\theta^{k}\}_{k=0}^{\infty} are called “good” parameters in LISTA-GSCP if they satisfy

𝑾k𝒲(𝑺~),θk=μ~sup(𝑿~,𝒁~)𝑿~k𝑿~2,1+σCW,k,\bm{W}^{k}\in\mathcal{W}(\bm{\tilde{S}}),\quad\theta^{k}=\tilde{\mu}\sup_{(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})}\|\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\|_{2,1}+\sigma C_{W},\quad\forall k\in\mathbb{N}, (23)

where CW=maxk0𝐖k2,1C_{W}=\underset{k\geq 0}{\max}~{}\|\bm{W}^{k}\|_{2,1} and μ~=μ~(𝐒~)\tilde{\mu}=\tilde{\mu}(\bm{\tilde{S}}).

In Definition 2, we propose the “good” choice of the learning parameters in LISTA-GSCP. In the following subsection, we prove that the sequence of “good” parameters lead to the conclusion (19) in Theorem 2.

IV-B Error Bound for One Sample

In this subsection, we give an upper bound of the recovery error for one sample (𝑿~,𝒁~)𝒳(β,s,σ)(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})\in\mathcal{X}(\beta,s,\sigma). We first introduce the extra notation ψ\psi, to provide information about the group sparsity. For each 𝑿~2N×M\bm{\tilde{X}}\in\mathbb{R}^{2N\times M}, we define a function ψ:2N×M2N\psi:\mathbb{R}^{2N\times M}\rightarrow\mathbb{R}^{2N} as

ψ(𝑿~)=[𝑿~[1,:]2,𝑿~[2,:]2,,𝑿~[2N,:]2]T.\psi(\bm{\tilde{X}})=\begin{bmatrix}||\bm{\tilde{X}}[1,:]||_{2},||\bm{\tilde{X}}[2,:]||_{2},\cdots,||\bm{\tilde{X}}[2N,:]||_{2}\end{bmatrix}^{T}. (24)

Specifically, for a vector 𝒗=[v1,v2,,v2N]T2N\bm{v}=[v_{1},v_{2},\ldots,v_{2N}]^{T}\in\mathbb{R}^{2N}, we have ψ(𝒗)i=|vi|,for all i[2N].\psi(\bm{v})_{i}=|v_{i}|,\text{for all }i\in[2N]. Note that supp(ψ(𝑨)){\rm{supp}}(\psi(\bm{A})) can provide information about group sparsity of a given matrix 𝑨\bm{A}. By simple calculations, one can get the following lemma.

Lemma 1.

With 𝐗~2N×M\bm{\tilde{X}}\in\mathbb{R}^{2N\times M} and 𝐞=[1,1,,1]T\bm{e}=[1,1,\ldots,1]^{T}, we have

ψ(ηθ(𝑿~))=ψ(𝑿~)θk𝒆,\psi(\eta_{\theta}(\bm{\tilde{X}}))=\psi(\bm{\tilde{X}})-\theta^{k}\bm{e}, (25)

By taking one sample (𝑿~,𝒁~)𝒳(β,s,σ)(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})\in\mathcal{X}(\beta,s,\sigma) and letting =supp(ψ(𝑿~))\mathcal{I}={\rm{supp}}(\psi(\bm{\tilde{X}}^{\natural})), we establish the error bound by two steps: (i) We show that there are no false positive rows in 𝑿~k\bm{\tilde{X}}^{k} for all kk. (ii) Since the no-false-positive property holds, we consider the component on \mathcal{I}.

In step (i), we prove the following lemma.

Lemma 2 (No-false-positive property).

With all assumptions in Theorem 2 and “good” parameters 𝚯\bm{\Theta}, we have

supp(ψ(𝑿~k))supp(ψ(𝑿~)),k.{\rm{supp}}(\psi(\bm{\tilde{X}}^{k}))\subset{\rm{supp}}(\psi(\bm{\tilde{X}}^{\natural})),\quad\forall k. (26)
Proof.

Please refer to Appendix B. ∎

Lemma 2 shows that there are no false positive entries in 𝑿~k\bm{\tilde{X}}^{k}. In another word, if (23) in LISTA-GSCP hold, then

𝑿~k[i,:]=𝟎,i,k.\bm{\tilde{X}}^{k}[i,:]=\bm{0},\quad\forall i\notin\mathcal{I},~{}\forall k. (27)

This property implies that the recovery error of the component beyond \mathcal{I} turns to be 0.

In step (ii), we consider the component on \mathcal{I}. For all ii\in\mathcal{I}, the LISTA-GSCP in (13) gives

𝑿~k+1[i,:]=ηθk(𝑿~k[i,:](𝑾k[:,i])T𝑺~[:,](𝑿~k[,:]\displaystyle\bm{\tilde{X}}^{k+1}[i,:]=\eta_{\theta^{k}}\Big{(}\bm{\tilde{X}}^{k}[i,:]-(\bm{W}^{k}[:,i])^{T}\bm{\tilde{S}}[:,\mathcal{I}](\bm{\tilde{X}}^{k}[\mathcal{I},:]-
𝑿~[,:])+(𝑾k[:,i])T𝒁~)\displaystyle\bm{\tilde{X}}^{\natural}[\mathcal{I},:])+(\bm{W}^{k}[:,i])^{T}\bm{\tilde{Z}}\Big{)}
𝑿~k[i,:](𝑾k[:,i])T𝑺~[:,](𝑿~k[,:]𝑿~[,:])(T1)\displaystyle\in\underbrace{\bm{\tilde{X}}^{k}[i,:]-(\bm{W}^{k}[:,i])^{T}\bm{\tilde{S}}[:,\mathcal{I}](\bm{\tilde{X}}^{k}[\mathcal{I},:]-\bm{\tilde{X}}^{\natural}[\mathcal{I},:])}_{(T1)}
+(𝑾k[:,i])T𝒁~(T2)θk𝑿~k+1[i,:]2(T3),\displaystyle+\underbrace{(\bm{W}^{k}[:,i])^{T}\bm{\tilde{Z}}}_{(T2)}-\underbrace{\theta^{k}\partial\|\bm{\tilde{X}}^{k+1}[i,:]\|_{2}}_{(T3)}, (28)

where 2\partial\|\cdot\|_{2} is given in (51). It can be viewed that (28) consists of three parts (T1)(T1), (T2)(T2), (T3)(T3). Since (𝑾k[:,i])T𝑺~[:,i]=1(\bm{W}^{k}[:,i])^{T}\bm{\tilde{S}}[:,i]=1, then (28) can expressed as

𝑿~k+1[i,:]𝑿~[i,:]j,ji(𝑾k[:,i])T𝑺~[:,j]\displaystyle\bm{\tilde{X}}^{k+1}[i,:]-\bm{\tilde{X}}^{\natural}[i,:]\in-\sum_{j\in\mathcal{I},j\neq i}(\bm{W}^{k}[:,i])^{T}\bm{\tilde{S}}[:,j]
(𝑿~k[j,:]𝑿~[j,:])+(T2)(T3).\displaystyle\Big{(}\bm{\tilde{X}}^{k}[j,:]-\bm{\tilde{X}}^{\natural}[j,:]\Big{)}+(T2)-(T3). (29)

The definition of 2\partial\|\cdot\|_{2} in (51) shows that 𝑿~k+1[i,:]221.\|\partial\|\bm{\tilde{X}}^{k+1}[i,:]\|_{2}\|_{2}\leq 1. Then, taking the norm on both sides in (29), one can get that for all ii\in\mathcal{I},

𝑿~k+1[i,:]𝑿~[i,:]2(a)j,ji|(𝑾k[:,i])T𝑺~[:,j]|\displaystyle\big{\|}\bm{\tilde{X}}^{k+1}[i,:]-\bm{\tilde{X}}^{\natural}[i,:]\big{\|}_{2}\overset{\text{(a)}}{\leq}\sum_{j\in\mathcal{I},j\neq i}\big{|}(\bm{W}^{k}[:,i])^{T}\bm{\tilde{S}}[:,j]\big{|}\cdot
𝑿~k[j,:]𝑿~[j,:]2+θk+(𝑾k[:,i])T𝒁~2\displaystyle\big{\|}\bm{\tilde{X}}^{k}[j,:]-\bm{\tilde{X}}^{\natural}[j,:]\big{\|}_{2}+\theta^{k}+\big{\|}(\bm{W}^{k}[:,i])^{T}\bm{\tilde{Z}}\big{\|}_{2}
(b)μ~j,ji𝑿~k[j,:]𝑿~[j,:]2+θk+𝑾k[:,i]2𝒁~F,\displaystyle\overset{\text{(b)}}{\leq}\tilde{\mu}\sum_{j\in\mathcal{I},j\neq i}\big{\|}\bm{\tilde{X}}^{k}[j,:]-\bm{\tilde{X}}^{\natural}[j,:]\big{\|}_{2}+\theta^{k}+\big{\|}\bm{W}^{k}[:,i]\big{\|}_{2}\big{\|}\bm{\tilde{Z}}\big{\|}_{F}, (30)

where (a) follows from the triangle inequality, (b) arises from the choice of “good” parameters in (2) and (𝑾k[:,i])T𝒁~2𝑾k[:,i]2𝒁~F.\big{\|}(\bm{W}^{k}[:,i])^{T}\bm{\tilde{Z}}\big{\|}_{2}\leq\big{\|}\bm{W}^{k}[:,i]\big{\|}_{2}\|\bm{\tilde{Z}}\|_{F}. It is easy to check that (27) implies that 𝑿~k𝑿~2,1=𝑿~k[,:]𝑿~[,:]2,1\big{\|}\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\big{\|}_{2,1}=\big{\|}\bm{\tilde{X}}^{k}[\mathcal{I},:]-\bm{\tilde{X}}^{\natural}[\mathcal{I},:]\big{\|}_{2,1} for all kk. Therefore, based on (IV-B), it follows that

𝑿~k+1𝑿~2,1=i𝑿~k+1[i,:]𝑿~[i,:]2\displaystyle\|\bm{\tilde{X}}^{k+1}-\bm{\tilde{X}}^{\natural}\|_{2,1}=\sum_{i\in\mathcal{I}}\|\bm{\tilde{X}}^{k+1}[i,:]-\bm{\tilde{X}}^{\natural}[i,:]\|_{2}
μ~(||1)𝑿~k𝑿~2,1+||θk+σCW,\displaystyle\leq\tilde{\mu}(|\mathcal{I}|-1)\big{\|}\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\big{\|}_{2,1}+|\mathcal{I}|\theta^{k}+\sigma C_{W}, (31)

where CW=maxk0𝑾k2,1C_{W}=\underset{k\geq 0}{\max}~{}\|\bm{W}^{k}\|_{2,1}. The inequality (IV-B) provides the recursive form for consecutive errors of 𝑿~k+1𝑿~2,1\|\bm{\tilde{X}}^{k+1}-\bm{\tilde{X}}^{\natural}\|_{2,1} and 𝑿~k𝑿~2,1\|\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\|_{2,1}. Hence, we establish the recover error bound of mixed-norm for one sample (𝑿~,𝒁~)𝒳(β,s,σ)(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})\in\mathcal{X}(\beta,s,\sigma).

IV-C Error Bound for Whole Data Set

In this subsection, we give an upper bound of recovery error for the whole training samples. Note that ||:=supp(ψ(𝑿~))s|\mathcal{I}|:={\rm{supp}}(\psi(\bm{\tilde{X}}^{\natural}))\leq s for all (𝑿~,𝒁~)𝒳(β,s,σ)(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})\in\mathcal{X}(\beta,s,\sigma). Taking supremum over (𝑿~,𝒁~)𝒳(β,s,σ)(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})\in\mathcal{X}(\beta,s,\sigma) on both sides of (IV-B), we have

sup(𝑿~,𝒁~)𝑿~k+1𝑿~2,1(s1)μ~sup(𝑿~,𝒁~)𝑿~k𝑿~2,1\displaystyle\sup_{(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})}\|\bm{\tilde{X}}^{k+1}-\bm{\tilde{X}}^{\natural}\|_{2,1}\leq(s-1)\tilde{\mu}\sup_{(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})}\|\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\|_{2,1}
+sθk+σCW.\displaystyle+s\theta^{k}+\sigma C_{W}. (32)

Considering the “good” parameters of θk=μ~sup(𝑿~,𝒁~)𝑿~k𝑿~2,1+σCW,\theta^{k}=\tilde{\mu}\sup_{(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})}\|\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\|_{2,1}+\sigma C_{W}, it follows that

sup(𝑿~,𝒁~)𝑿~k+1𝑿~2,1(2μ~sμ~)k+1sβ+(s+1)CW1+μ~2μ~sσ,\displaystyle\sup_{(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})}\|\bm{\tilde{X}}^{k+1}-\bm{\tilde{X}}^{\natural}\|_{2,1}\leq(2\tilde{\mu}s-\tilde{\mu})^{k+1}s\beta+\frac{(s+1)C_{W}}{1+\tilde{\mu}-2\tilde{\mu}s}\sigma,

provided that 2μ~sμ~<12\tilde{\mu}s-\tilde{\mu}<1. By letting c=log(2μ~sμ~),C=(s+1)CW1+μ~2μ~sc=-\log(2\tilde{\mu}s-\tilde{\mu}),C=\frac{(s+1)C_{W}}{1+\tilde{\mu}-2\tilde{\mu}s}, we have sup(𝑿~,𝒁~)𝑿~k+1𝑿~2,1sβexp(c(k+1))+Cσ.\sup_{(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})}\|\bm{\tilde{X}}^{k+1}-\bm{\tilde{X}}^{\natural}\|_{2,1}\leq s\beta\exp(-c(k+1))+C\sigma. The fact that 𝑿F𝑿2,1\|\bm{X}\|_{F}\leq\|\bm{X}\|_{2,1} for any matrix 𝑿\bm{X} gives an upper bound of error with respect to Frobenius norm

sup(𝑿~,𝒁~)𝑿~k+1𝑿~F\displaystyle\sup_{(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})}\|\bm{\tilde{X}}^{k+1}-\bm{\tilde{X}}^{\natural}\|_{F} sup(𝑿~,𝒁~)𝑿~k+1𝑿~2,1\displaystyle\leq\sup_{(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})}\|\bm{\tilde{X}}^{k+1}-\bm{\tilde{X}}^{\natural}\|_{2,1}
sβexp(c(k+1))+σC.\displaystyle\leq s\beta\exp(-c(k+1))+\sigma C. (33)

As long as s(1+1/μ~)/2s\leq(1+1/\tilde{\mu})/2 and c=log(2μ~sμ~)>0c=-\log(2\tilde{\mu}s-\tilde{\mu})>0 hold, the error bound holds uniformly for all (𝑿~,𝒁~)𝒳(β,s,σ)(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})\in\mathcal{X}(\beta,s,\sigma). Then for kk-layer network, we have

sup(𝑿~,𝒁~)𝑿~k𝑿~Fsβexp(ck)+σC.\displaystyle\sup_{(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})}\|\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\|_{F}\leq s\beta\exp(-ck)+\sigma C. (34)

Therefore, we complete the proof of Theorem 2.

IV-D Convergence Rate of ALISTA-GS

In this subsection, we first give the definition of the “good” parameters to be learned in the ALISTA-GS network. We then verify the convergence of ALISTA-GS for noiseless case. For noisy case, the analysis can be referred to that of Theorem 11.

Definition 3.

𝚯={θk,γk}k=0\bm{\Theta}=\{\theta^{k},\gamma^{k}\}_{k=0}^{\infty} are called “good” parameters for all (𝐗~,𝐙~)𝒳(β,s,0)(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})\in\mathcal{X}(\beta,s,0) (noiseless case) with the weight matrix is pre-computed in ALISTA-GS if they satisfy that for each kk\in\mathbb{N}

𝑾𝒲(𝑺~),θk=μ~γksup(𝑿~,𝒁~)𝑿~k𝑿~2,1.\bm{W}\in\mathcal{W}(\bm{\tilde{S}}),\quad\theta^{k}=\tilde{\mu}\gamma^{k}\sup_{(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})}\|\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\|_{2,1}. (35)
Theorem 3 (Convergence rate of ALISTA-GS).

Given {θk,γk}k=0\{\theta^{k},\gamma^{k}\}_{k=0}^{\infty} and 𝐖𝒲(𝐒~)\bm{W}\in\mathcal{W}(\bm{\tilde{S}}), let {𝐗~k}k=1\{\bm{\tilde{X}}^{k}\}_{k=1}^{\infty} be generated by ALISTA-GS in (15) with an input 𝐘~\bm{\tilde{Y}} observed by (3) and initial point 𝐗~0=𝟎\bm{\tilde{X}}^{0}=\bm{0}. If s(1+1/μ~)/2s\leq(1+1/\tilde{\mu})/2 for all (𝐗~,𝐙~)𝒳(β,s,0),(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})\in\mathcal{X}(\beta,s,0), and

γk(0,21+2μ~sμ~),\displaystyle\gamma^{k}\in\Big{(}0,\frac{2}{1+2\tilde{\mu}s-\tilde{\mu}}\Big{)},
θk=μ~γksup(𝑿~,𝒁~)𝑿~k𝑿~2,1,k,\displaystyle\theta^{k}=\tilde{\mu}\gamma^{k}\sup_{(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})}\|\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\|_{2,1},\forall k\in\mathbb{N}, (36)

then we have

𝑿~k+1𝑿~Fsβexp(τ=0kcτ),\displaystyle\|\bm{\tilde{X}}^{k+1}-\bm{\tilde{X}}^{\natural}\|_{F}\leq s\beta\exp\Big{(}-\sum_{\tau=0}^{k}c^{\tau}\Big{)}, (37)

where cτ=log(γτ(2μ~sμ~)+|1γτ|)c^{\tau}=-\log\big{(}\gamma^{\tau}(2\tilde{\mu}s-\tilde{\mu})+|1-\gamma^{\tau}|\big{)} is a positive constant.

Proof.

Please refer to Appendix C. ∎

Remark 3.

Optimally, if the factor cτc^{\tau} takes the maximum at γτ=1\gamma^{\tau}=1, i.e., cτlog(2μ~sμ~)c^{\tau}\equiv-\log(2\tilde{\mu}s-\tilde{\mu}), then ALISTA-GS enjoys a linear convergence rate.

V Numerical Results

In this section, we present the simulation results for the proposed three unrolled neural network structures for grant-free massive access in IoT networks. We first introduce the settings of the simulations and performance metrics and then present the simulation results to confirm the main theorems including the weight coupling structure and the convergence rate. Finally, we compare the recovery performance of the proposed methods with other popular CS-based methods.

Refer to caption
(a) Weight 𝑾2k(𝑰𝑾1k𝑺~)𝟎\bm{W}_{2}^{k}-(\bm{I}-\bm{W}_{1}^{k}\bm{\tilde{S}})\rightarrow\bm{0}.
Refer to caption
(b) The threshold θk0\theta^{k}\rightarrow 0.
Figure 4: Validation of Theorem 1. LISTA-GS has weight coupling structure.

V-A Simulation Settings and Performance Metrics

In simulations, the channels are assumed to suffer from independent Rayleigh fading, i.e., 𝑯𝒞𝒩(𝟎,𝑰)\bm{H}\sim\mathcal{C}\mathcal{N}(\bm{0},\bm{I}). The preamble signature matrix 𝑺\bm{S} is fixed with each of its columns being normalized and the noise matrix 𝒁\bm{Z} follows the Gaussian distribution with zero mean and variance σ2\sigma^{2}. In addition, each entry of the activity sequence {a1,,aN}\{a_{1},...,a_{N}\} is a random variable which follows the Bernoulli distribution with mean 0.10.1, i.e., (an=1)=0.1\mathbb{P}(a_{n}=1)=0.1 and (an=0)=0.9\mathbb{P}(a_{n}=0)=0.9, n[N]\forall\,n\in[N]. The transmit signal-to-noise ratio (SNR) of the system is defined as

SNR=𝔼[𝑺𝑿F2]𝔼[𝒁F2].\text{SNR}=\frac{\mathbb{E}[\|\bm{SX}\|_{F}^{2}]}{\mathbb{E}[\|\bm{Z}\|_{F}^{2}]}. (38)

By transforming all these complex-valued matrices into real-valued matrices according to (5), we obtain the training data set {𝑿~i,𝒀~i}i=1P\{\bm{\tilde{X}}_{i}^{\natural},\bm{\tilde{Y}}_{i}\}_{i=1}^{P}. In the training stage, we choose K=12K=12 layers unless otherwise stated for all the unrolled models in the simulations. The training set contains P=64P=64 different samples in the training stage. The learning rate α0\alpha_{0} is set to be 5×1045\times 10^{-4} and α1=0.2α0\alpha_{1}=0.2\alpha_{0}. In the validation and testing stage, 128128 samples are generated to test the trained models (drawn independent of the training set, but from the same distribution).

We compare our proposed unrolled networks, i.e., LISTA-GS, LISTA-GSCP and ALISTA-GS, with the other three popular CS-based algorithms for group-sparse matrix estimation problem:

  • ISTA-GS [36]: The vanilla ISTA for solving multiple measurement vector CS-based problem. The regularization parameter λ\lambda is set as 0.10.1.

  • Nesterov’s method [44]: A first-order method solves the equivalent smooth convex reformulations of (6). The regularization parameter λ\lambda is set as 0.10.1.

  • AMP-MMV [15, 6]: The AMP developed Bayesian algorithm for solving the multiple measurement vector CS-based problem.

  • AMP MMSE-denoiser [13]: The vector AMP algorithm with minimum mean-squared error (MMSE) denoiser for user activity detection and channel estimation.

We adopt the normalized mean square error (NMSE) to evaluate the performance in recovering the real-valued 𝑿~\bm{\tilde{X}}, defined as

NMSE(𝑿~,𝑿~)=10log10(𝔼𝑿~𝑿~F2𝔼𝑿~F2),\text{NMSE}(\bm{\tilde{X}},\bm{\tilde{X}}^{\natural})=10\log_{10}\left(\frac{\mathbb{E}\|\bm{\tilde{X}}-\bm{\tilde{X}}^{\natural}\|_{F}^{2}}{\mathbb{E}\|\bm{\tilde{X}}^{\natural}\|_{F}^{2}}\right), (39)

where 𝑿~\bm{\tilde{X}}^{\natural} represents the ground truth and 𝑿~\bm{\tilde{X}} is the estimated value.

V-B Validation of Theorems

Refer to caption
Figure 5: NMSE versus iterations with SNR =15=15~{}dB.
Refer to caption
Figure 6: NMSE of unrolled networks and other methods when condition number κ=2\kappa=2 and SNR =15=15~{}dB.
Refer to caption
Figure 7: NMSE of urolled networks and other methods when condition number κ=15\kappa=15 and SNR =15=15~{}dB. AMP-MMV and AMP MMSE-denoiser fail in this case.
Refer to caption
Figure 8: NMSE versus SNR when preamble signature 𝑺\bm{S} is complex Gaussian matrix in noisy case.

In this subsection, we conduct simulations to validate the developed Theorems. We generate the preamble signature matrix according to the complex Gaussian distribution unless otherwise stated, i.e., 𝑺𝒞𝒩(𝟎,𝑰)\bm{S}\sim\mathcal{C}\mathcal{N}(\bm{0},\bm{I}). We set the length of the signature sequence, the total number of devices, and the number of antennas at the BS, i.e., L,NL,N and MM, to 100100, 200200, and 3030, respectively.

Validation of Theorem 1. Theorem 1 endorses the empirical results shown in Figs. 4(a) and 4(b). In Fig. 4, the value of 𝑾2k(𝑰𝑾1k𝑺~)2\|\bm{W}_{2}^{k}-(\bm{I}-\bm{W}_{1}^{k}\bm{\tilde{S}})\|_{2} and θk\theta^{k} in LISTA-GS are reported. We observe that as kk increases, the value of 𝑾2k(𝑰𝑾1k𝑺~)2\|\bm{W}_{2}^{k}-(\bm{I}-\bm{W}_{1}^{k}\bm{\tilde{S}})\|_{2} and θk\theta^{k} approach to 0. The simulations clearly validate Theorem 1: 𝑾2k(𝑰𝑾1k𝑺~)𝟎\bm{W}_{2}^{k}-(\bm{I}-\bm{W}_{1}^{k}\bm{\tilde{S}})\rightarrow\bm{0} and θk0\theta^{k}\rightarrow 0, as kk\rightarrow\infty.

Validations of Theorems 2 and 3. We examine the convergence rates of the proposed unrolled networks. Fig. 8 shows the NMSE of the proposed unrolled networks and other methods over iterations in a noisy scenario when SNR =15=15~{}dB. In Fig. 8, we show that the proposed networks (almost) converge linearly, which validate Theorem 2 and 3. Furthermore, the LISTA-GS converges fastest, but with the most number of trainable parameters. Besides, we observe that the proposed unrolled networks outperform other baseline algorithms in terms of NMSE.

TABLE II: Training time and estimation performance of the proposed unrolled networks.
Networks LISTA-GS LISTA-GSCP ALISTA-GS
Training time (min) 160.13 105.31 87.09
NMSE (dB) -24.08 -23.38 -23.30
TABLE III: Average running time of various methods of 1212-layer iterations per sample.
LISTA-GS LISTA-GSCP ALISTA-GS ISTA-GS Nesterov’s method AMP-MMV AMP MMSE-denoiser
0.0078s 0.0076s 0.0076s 0.0079s 0.0087s 0.2089s 0.0731s
Refer to caption
Figure 9: NMSE versus number of devices.
Refer to caption
Figure 10: Device activity detection error probability versus SNR.
Refer to caption
Figure 11: NMSE versus SNR when preamble signature 𝑺\bm{S} is binary matrix in noisy case. AMP-MMV and AMP MMSE-denoiser fail in this case.
Refer to caption
Figure 12: NMSE versus SNR when preamble signature 𝑺\bm{S} is Zadoff-Chu sequence matrix in noisy case. AMP-MMV and AMP MMSE-denoiser fail in this case.

Training Time Comparison. We compare the training time of the proposed unrolled networks. We train all the networks of 1212 layers with 6464 training samples in noisy scenario when SNR = 1515 dB on GeForce GTX 1080. Table II shows that the ALISTA-GS is not only faster to train but performs as well as LISTA-GS and LISTA-GSCP. With less trainable parameters, the training process turns to be faster.

Computation Time Comparison. We compare the running time of test stage for different methods. Once the proposed unrolled networks are trained, we can use them to recover many new sample of different channels. We run all the methods of 1212 iterations on Intel(R) Core(TM) i7-8650U CPU @ 2.11 GHz and average over 100100 test samples, which is shown in Table III. The running time of the proposed unrolled networks is very close to the ISTA-GS and Nesterov’s method, which validate the time complexity analysis in Section III-C. In addition, the proposed unrolled networks run faster than AMP-based algorithms. This benefits from the low computational complexity per iteration of ISTA-GS.

Convergence Performance with Ill-Conditioned Preamble Signature. We consider the simulations when preamble signature matrix 𝑺\bm{S} is an ill-conditioned matrix, which is defined as a matrix with a large condition number κ\kappa. We consider the fixed signature matrix 𝑺\bm{S} with condition numbers κ=2\kappa=2 and 1515 in this setting. To obtain the signature matrix 𝑺\bm{S} of condition number κ=2\kappa=2 and 1515, we first sample a matrix 𝑨L×N\bm{A}\in\mathbb{C}^{L\times N}, i.e., 𝑨𝒞𝒩(𝟎,𝑰)\bm{A}\sim\mathcal{C}\mathcal{N}(\bm{0},\bm{I}). Then, we decompose 𝑨=𝑼𝚺𝑽\bm{A}=\bm{U\varSigma V^{\natural}} by singular value decomposition and replace 𝚺\bm{\varSigma} by a new 𝚺\bm{\varSigma^{\natural}} that satisfy the above conditions.

Fig. 8 and Fig. 8 show the simulation results of NMSE when signature matrix 𝑺\bm{S} of the condition number is κ=2\kappa=2 and κ=15\kappa=15 with SNR =15=15 dB, respectively. The baseline AMP-MMV and AMP MMSE-denoiser fail when κ=15\kappa=15 in simulations. Comparing with the case κ=2\kappa=2 and κ=15\kappa=15, the proposed unrolled networks remain stable but the outputs of ISTA-GS and Nesterov’s method diverge when κ=15\kappa=15. This results show that the proposed unrolled networks are robust to ill-conditioned preamble signature matrix.

V-C Performance of Proposed Unrolled Neural Networks

In this subsection, we show the performance in terms of NMSE, and compare the proposed methods with other classic CS-based methods. By varying the SNR from 0 to 1010dB, all the algorithms under consideration reach a stable solution in the simulations, and we set K=12K=12. The smaller NMSE value means the better recovery performance of JADCE. In these cases, we set L,NL,N and MM to 9090, 300300, and 100100, which is suitable for grant-free massive access.

Gaussian Matrix as Preamble Signature. We conduct simulations for the case in which preamble signature matrix 𝑺\bm{S} is a complex Gaussian matrix, i.e., 𝑺𝒞𝒩(𝟎,𝑰)\bm{S}\sim\mathcal{C}\mathcal{N}(\bm{0},\bm{I}). Fig. 8 shows the impact of SNR on the NMSE of the proposed unrolled networks and baseline algorithms. First, the performance of NMSE decreases as SNR grows, which implies that the JADCE performance becomes better as SNR increases. Second, the proposed network structures achieve a much lower NMSE than other CS-based methods for different values of SNR. In addition, the LISTA-GS outperforms the LISTA-GSCP and ALISTA-GS. This indicates that the more trainable parameters, the better NMSE performance.

Fig. 12 demonstrates the NMSE performance versus number of devices of the proposed unrolled networks and baseline algorithms with L=90L=90 and SNR = 1515 dB. As a result, the NMSE increases as the number of devices increase for all the methods, while the proposed unrolled networks outperform other baseline algorithms. Fig. 12 shows that the proposed unrolled networks have the less device activity detection error probability compared with other baseline methods, which also verify the superior of the proposed unrolled networks.

Binary Matrix as Preamble Signature. We conduct simulations for the case in which the preamble sequence matrix 𝑺\bm{S} is binary sequence matrix, i.e., 𝑺{±1}L×N\bm{S}\in\{\pm 1\}^{L\times N}. Each entry of 𝑺\bm{S} is selected uniformly at random on {±1}\{\pm 1\}, which is closely related to Code Division Multiple Access (CDMA) communication systems [41]. Fig. 12 shows the NMSE performance of the proposed unrolled networks and baseline methods. As a result, the proposed networks remain stable but AMP-MMV and AMP MMSE-denoiser fail to solve the JADCE problem when preamble sequence matrix 𝑺\bm{S} is binary. The proposed unrolled networks outperform baseline algorithms more than 33dB on NMSE for different values of SNR. Obviously, the proposed unrolled networks achieve significantly better NMSE performance compared with CS-based methods. This result demonstrates the robustness of our proposed methods for the non-Gaussian matrices case.

Zadoff-Chu Sequences as Preamble Signature. In this case, we carry out simulations when preamble signature 𝑺\bm{S} are composed of Zadoff-Chu sequences[8], which are widely used in practical situations, i.e., 4G LTE systems. Fig. 12 shows the NMSE performance of the proposed unrolled networks and other baseline CS-based methods. The proposed networks remain stable but AMP-based algorithms fail to solve the JADCE when 𝑺\bm{S} is Zadoff-Chu sequence matrix. From this figure, we observe that the proposed unrolled networks achieve a much better NMSE performance compared with the baseline methods. Moreover, as the SNR increasing, the proposed approaches perform better than baseline methods. Among the proposed structures, the LISTA-GS achieves the best performance and LISTA-GSCP shares a similar performance to that of ALISTA-GS. This result also shows the robustness of our proposed methods for practical situations.

In summary, the simulations demonstrate the effectiveness of the proposed unrolled networks in the following aspects:

  • The proposed unrolled networks converge faster than the robust algorithms such as ISTA-GS and Nesterov’s method, yielding much lower computational complexity.

  • Simulations of different preambles show that the proposed unrolled networks are more robust compared with the computationally efficient algorithms such as AMP-based algorithms.

  • The proposed unrolled networks achieve better performance for JADCE comparing with the baseline CS-based algorithms.

VI Conclusions

In this paper, we proposed a novel unrolled deep neural network framework enjoying linear convergence rate, low computational complexity and high robustness to solve the JADCE problem in grant-free massive access for IoT networks. We introduced the first unrolled network structure by mapping the iterative algorithm ISTA-GS as an unrolled RNN, thereby improving the convergence rate by end-to-end training. To make training procedure efficiently and tackle the interpretability issue of deep learning, we proposed two simplified unrolled network structures with less trainable parameters and proved the linear convergence rate of these methods. Extensive simulations were conducted to verify the effectiveness of the proposed unrolled networks in terms of the convergence rate, robustness and estimation accuracy for the JADCE problem.

Appendix A Proof of theorem 1

Proof.

We prove the threshold value θk\theta^{k} converges to zero firstly, and then we will show that the weights {𝑾1k,𝑾2k}k=0\{\bm{W}_{1}^{k},\bm{W}_{2}^{k}\}_{k=0}^{\infty} in LISTA-GS have weight coupling property.

(1) We verify that θk0\theta^{k}\to 0 as kk\to\infty in (11). We define a subset of 𝒳(β,s,σ)\mathcal{X}(\beta,s,\sigma) for a given 0<β~β0<\tilde{\beta}\leq\beta as

𝒳~(β,β~,s,σ):={(𝑿~,𝒁~)𝒳(β,s,σ)||supp(ψ(𝑿~))|s,\displaystyle\tilde{\mathcal{X}}(\beta,\tilde{\beta},s,\sigma):=\big{\{}(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})\in\mathcal{X}(\beta,s,\sigma)|~{}|{\rm{supp}}(\psi(\bm{\tilde{X}}^{\natural}))|\leq s,
β~𝑿~[i,:]2β,isupp(ψ(𝑿~))}.\displaystyle\tilde{\beta}\leq\big{\|}\bm{\tilde{X}}^{\natural}[i,:]\big{\|}_{2}\leq\beta,\forall i\in{\rm{supp}}(\psi(\bm{\tilde{X}}^{\natural}))\big{\}}.

Clearly, 𝒳~(β,β~,s,0)𝒳(β,s,0)\tilde{\mathcal{X}}(\beta,\tilde{\beta},s,0)\subset\mathcal{X}(\beta,s,0). Since 𝑿~k𝑿~\bm{\tilde{X}}^{k}\rightarrow\bm{\tilde{X}}^{\natural} is uniform for all (𝑿~,𝟎)𝒳(β,s,0),(\bm{\tilde{X}}^{\natural},\bm{0})\in\mathcal{X}(\beta,s,0), we have (𝑿~,𝟎)𝒳~(β,β~,s,0)(\bm{\tilde{X}}^{\natural},\bm{0})\in\tilde{\mathcal{X}}(\beta,\tilde{\beta},s,0), where β~β\tilde{\beta}\leq\beta. Then, there exists K1K_{1}\in\mathbb{N} for all (𝑿~,𝟎)𝒳~(β,β/10,s,0)(\bm{\tilde{X}}^{\natural},\bm{0})\in\tilde{\mathcal{X}}(\beta,\beta/10,s,0) such that if kK1k\geq K_{1},

|𝑿~k[i,:]2𝑿~[i,:]2|<β10i[2N].\Big{|}\big{\|}\bm{\tilde{X}}^{k}[i,:]\big{\|}_{2}-\big{\|}\bm{\tilde{X}}^{\natural}[i,:]\big{\|}_{2}\Big{|}<\frac{\beta}{10}\quad\forall i\in[2N].

Then we have sign(ψ(𝑿~k))=sign(ψ(𝑿~)),kK1.\operatorname{sign}(\psi(\bm{\tilde{X}}^{k}))=\operatorname{sign}(\psi(\bm{\tilde{X}}^{\natural})),\forall k\geq K_{1}. Recall that the recurrence relation 𝑿~k+1=ηθk(𝑾1k𝒀~+𝑾2k𝑿~k)\bm{\tilde{X}}^{k+1}=\eta_{\theta^{k}}(\bm{W}_{1}^{k}\bm{\tilde{Y}}+\bm{W}_{2}^{k}\bm{\tilde{X}}^{k}) and let =supp(ψ(𝑿~))\mathcal{I}={\rm{supp}}(\psi(\bm{\tilde{X}}^{\natural})). From the uniform convergence of 𝑿~k\bm{\tilde{X}}^{k}, it follows that for any kK1 and (𝑿~,𝟎)𝒳~(β,β/10,s,0)k\geq K_{1}\text{ and }(\bm{\tilde{X}}^{\natural},\bm{0})\in\tilde{\mathcal{X}}(\beta,\beta/10,s,0). We have

𝑿~k+1[,:]=𝑸~kθkdiag(ψ(𝑸~k))1𝑸~k,\displaystyle\bm{\tilde{X}}^{k+1}[\mathcal{I},:]=\bm{\tilde{Q}}^{k}-\theta^{k}{\rm{diag}}(\psi(\bm{\tilde{Q}}^{k}))^{-1}\bm{\tilde{Q}}^{k}, (40)

where 𝑸~k=𝑾2k[,]𝑿~k[,:]+𝑾1k[,:]𝒀~\bm{\tilde{Q}}^{k}=\bm{W}_{2}^{k}[\mathcal{I},\mathcal{I}]\bm{\tilde{X}}^{k}[\mathcal{I},:]+\bm{W}_{1}^{k}[\mathcal{I},:]\bm{\tilde{Y}}.

Also, the uniform convergence of 𝑿~k\bm{\tilde{X}}^{k} implies that for any ε>0\varepsilon>0 and (𝑿~,𝟎)𝒳~(β,β/10,s,0),(\bm{\tilde{X}}^{\natural},\bm{0})\in\tilde{\mathcal{X}}(\beta,\beta/10,s,0), there exists K2K_{2}\in\mathbb{N} such that if kK2k\geq K_{2}, then 𝑿~k[,:]𝑿~[,:]F<ε\|\bm{\tilde{X}}^{k}[\mathcal{I},:]-\bm{\tilde{X}}^{\natural}[\mathcal{I},:]\|_{F}<\varepsilon. Denote 𝑬~k=𝑿~k[,:]𝑿~[,:]||×M\bm{\tilde{E}}^{k}=\bm{\tilde{X}}^{k}[\mathcal{I},:]-\bm{\tilde{X}}^{\natural}[\mathcal{I},:]\in\mathbb{R}^{|\mathcal{I}|\times M} for each kk\in\mathbb{N}. That is, 𝑬~kF<ε\|\bm{\tilde{E}}^{k}\|_{F}<\varepsilon for all kK2k\geq K_{2}.

Since the noise is assumed to be zero, i.e., 𝒀~=𝑺~𝑿~\bm{\tilde{Y}}=\bm{\tilde{S}}\bm{\tilde{X}}^{\natural}. Then from (40) we have

θkdiag(ψ(𝑸~k))1𝑸~k[:,j]22=ξjk+\displaystyle\|\theta^{k}{\rm{diag}}(\psi(\bm{\tilde{Q}}^{k}))^{-1}\bm{\tilde{Q}}^{k}[:,{j}]\|_{2}^{2}=\xi_{j}^{k}+
(𝑰𝑾2k[,]𝑾1k[,:]𝑺~[:,])𝑿~[,j]22,\displaystyle\big{\|}\big{(}\bm{I}-\bm{W}_{2}^{k}[\mathcal{I},\mathcal{I}]-\bm{W}_{1}^{k}[\mathcal{I},:]\bm{\tilde{S}}[:,\mathcal{I}]\big{)}\bm{\tilde{X}}^{\natural}[\mathcal{I},{j}]\big{\|}_{2}^{2}, (41)

where ξjk=(𝑰𝑾2k[,]𝑾1k[,:]𝑺~[:,])𝑿~[,j]+𝑾2k[,]𝑬~k[:,j]𝑬~k+1[:,j]22(𝑰𝑾2k[,]𝑾1k[,:]𝑺~[:,])𝑿~[,j]22.\xi_{j}^{k}=\big{\|}-\big{(}\bm{I}-\bm{W}_{2}^{k}[\mathcal{I},\mathcal{I}]-\bm{W}_{1}^{k}[\mathcal{I},:]\bm{\tilde{S}}[:,\mathcal{I}]\big{)}\bm{\tilde{X}}^{\natural}[\mathcal{I},{j}]+\bm{W}_{2}^{k}[\mathcal{I},\mathcal{I}]\bm{\tilde{E}}^{k}[:,j]-\bm{\tilde{E}}^{k+1}[:,j]\big{\|}_{2}^{2}-\big{\|}\big{(}\bm{I}-\bm{W}_{2}^{k}[\mathcal{I},\mathcal{I}]-\bm{W}_{1}^{k}[\mathcal{I},:]\bm{\tilde{S}}[:,\mathcal{I}]\big{)}\bm{\tilde{X}}^{\natural}[\mathcal{I},{j}]\big{\|}_{2}^{2}.

We now show that |ξjk||\xi_{j}^{k}| is sufficiently small for kK2k\geq K_{2}. Denote extra notations

𝒑jk=(𝑰𝑾2k[,]𝑾1k[,:]𝑺~[:,])𝑿~[,j],\displaystyle\bm{p}_{j}^{k}=(\bm{I}-\bm{W}_{2}^{k}[\mathcal{I},\mathcal{I}]-\bm{W}_{1}^{k}[\mathcal{I},:]\bm{\tilde{S}}[:,\mathcal{I}]\big{)}\bm{\tilde{X}}^{\natural}[\mathcal{I},{j}],\quad
𝒒jk=𝑾2k[,]𝑬~k[:,j]𝑬~k+1[:,j].\displaystyle\bm{q}_{j}^{k}=\bm{W}_{2}^{k}[\mathcal{I},\mathcal{I}]\bm{\tilde{E}}^{k}[:,j]-\bm{\tilde{E}}^{k+1}[:,j].

The value ξjk\xi_{j}^{k} can be simply rewritten as 𝒒jk222𝒑jk,𝒒jk,\|\bm{q}_{j}^{k}\|_{2}^{2}-2\langle\bm{p}_{j}^{k},\bm{q}_{j}^{k}\rangle, where ,\langle\cdot,\cdot\rangle is the inner product in ||\mathbb{R}^{|\mathcal{I}|}. Then the Cauchy-Schwartz inequality implies |ξjk|(2𝒑jk2+1)𝒒jk2.|\xi_{j}^{k}|\leq(2\|\bm{p}_{j}^{k}\|_{2}+1)\|\bm{q}_{j}^{k}\|_{2}.

From the fact that 𝑨2𝑨F\|\bm{A}\|_{2}\leq\|\bm{A}\|_{F} for any matrix 𝑨\bm{A} and the triangle inequality it follows that if kK2k\geq K_{2}, then

𝒒jk2\displaystyle\|\bm{q}_{j}^{k}\|_{2} 𝑾2k[,]2𝑬~k[:,j]2+𝑬~k+1[:,j]2\displaystyle\leq\|\bm{W}_{2}^{k}[\mathcal{I},\mathcal{I}]\|_{2}\|\bm{\tilde{E}}^{k}[:,j]\|_{2}+\|\bm{\tilde{E}}^{k+1}[:,j]\|_{2}
(CW2+1)ε.\displaystyle\leq(C_{W_{2}}+1)\varepsilon. (42)

Note that

𝒑jk22𝑰𝑾2k[,]𝑾1k[,:]𝑺~[:,]22𝑿~[,j]22.\displaystyle\|\bm{p}_{j}^{k}\|_{2}^{2}\leq\big{\|}\bm{I}-\bm{W}_{2}^{k}[\mathcal{I},\mathcal{I}]-\bm{W}_{1}^{k}[\mathcal{I},:]\bm{\tilde{S}}[:,\mathcal{I}]\big{\|}_{2}^{2}\big{\|}\bm{\tilde{X}}^{\natural}[\mathcal{I},{j}]\big{\|}_{2}^{2}. (43)

Then it follows that

j=1M𝒑jk22\displaystyle\sum_{j=1}^{M}\|\bm{p}_{j}^{k}\|_{2}^{2} (1+𝑾2kF+𝑾1kF𝑺~F)2𝑿~F2\displaystyle\leq\Big{(}1+\big{\|}\bm{W}_{2}^{k}\big{\|}_{F}+\big{\|}\bm{W}_{1}^{k}\big{\|}_{F}\big{\|}\bm{\tilde{S}}\big{\|}_{F}\Big{)}^{2}\big{\|}\bm{\tilde{X}}^{\natural}\big{\|}_{F}^{2}
(1+CW2+CW1𝑺~F)2𝑿~F2.\displaystyle\leq\Big{(}1+C_{W_{2}}+C_{W_{1}}\big{\|}\bm{\tilde{S}}\big{\|}_{F}\Big{)}^{2}\big{\|}\bm{\tilde{X}}^{\natural}\big{\|}_{F}^{2}. (44)

By Cauchy-Schwarz inequality, it holds that

j=1M𝒑jk2\displaystyle\sum_{j=1}^{M}\|\bm{p}_{j}^{k}\|_{2} Mj=1M𝒑jk22\displaystyle\leq\sqrt{M}\sqrt{\sum_{j=1}^{M}\|\bm{p}_{j}^{k}\|_{2}^{2}}
M(1+CW2+CW1𝑺~F)𝑿~F.\displaystyle\leq\sqrt{M}\Big{(}1+C_{W_{2}}+C_{W_{1}}\big{\|}\bm{\tilde{S}}\big{\|}_{F}\Big{)}\big{\|}\bm{\tilde{X}}^{\natural}\big{\|}_{F}. (45)

Thus, by (42) and (45)

j=1M|ξjk|\displaystyle\sum_{j=1}^{M}\big{|}\xi_{j}^{k}\big{|} (2𝒑jk2+1)𝒒jk2\displaystyle\leq(2\|\bm{p}_{j}^{k}\|_{2}+1)\|\bm{q}_{j}^{k}\|_{2}
(CW2+1)εj=1M(2𝒑jk2+1)\displaystyle\leq(C_{W_{2}}+1)\varepsilon\sum_{j=1}^{M}(2\|\bm{p}_{j}^{k}\|_{2}+1)
(CW2+1)(2C+M)ε,\displaystyle\leq(C_{W_{2}}+1)(2C+M)\varepsilon, (46)

where C=M(1+CW2+CW1𝑺~F)𝑿~FC=\sqrt{M}\Big{(}1+C_{W_{2}}+C_{W_{1}}\big{\|}\bm{\tilde{S}}\big{\|}_{F}\Big{)}\big{\|}\bm{\tilde{X}}^{\natural}\big{\|}_{F}. Then, by (41) we have

(𝑰𝑾2k[,]𝑾1k[,:]𝑺~[:,])𝑿~[,:]F2\displaystyle\big{\|}\big{(}\bm{I}-\bm{W}_{2}^{k}[\mathcal{I},\mathcal{I}]-\bm{W}_{1}^{k}[\mathcal{I},:]\bm{\tilde{S}}[:,\mathcal{I}]\big{)}\bm{\tilde{X}}^{\natural}[\mathcal{I},:]\big{\|}_{F}^{2}
=j=1Mξjk+|θk|2i=1||1𝑸~k[i,:]2𝑸~k[i,:]22\displaystyle=-\sum_{j=1}^{M}\xi_{j}^{k}+\big{|}\theta^{k}\big{|}^{2}\sum_{i=1}^{|\mathcal{I}|}\Big{\|}\frac{1}{\|\bm{\tilde{Q}}^{k}[i,:]\|_{2}}\bm{\tilde{Q}}^{k}[i,:]\Big{\|}_{2}^{2}

Therefore, by (50) we get

(𝑰𝑾2k[,]𝑾1k[,:]𝑺~[:,])𝑿~[,:]F2\displaystyle\big{\|}\big{(}\bm{I}-\bm{W}_{2}^{k}[\mathcal{I},\mathcal{I}]-\bm{W}_{1}^{k}[\mathcal{I},:]\bm{\tilde{S}}[:,\mathcal{I}]\big{)}\bm{\tilde{X}}^{\natural}[\mathcal{I},:]\big{\|}_{F}^{2}
=j=1Mξjk+|θk|2||.\displaystyle=-\sum_{j=1}^{M}\xi_{j}^{k}+|\theta^{k}|^{2}|\mathcal{I}|. (47)

For any (𝑿~,𝟎)𝒳~(β/2,β/10,s,0)(\bm{\tilde{X}}^{\natural},\bm{0})\in\tilde{\mathcal{X}}(\beta/2,\beta/10,s,0), it is true that (2𝑿~,𝟎)𝒳~(β,β/10,s,0)(2\bm{\tilde{X}}^{\natural},\bm{0})\in\tilde{\mathcal{X}}(\beta,\beta/10,s,0) holds. Thus, the above argument holds for all (2𝑿~,𝟎)(2\bm{\tilde{X}}^{\natural},\bm{0}) if (𝑿~,𝟎)𝒳~(β/2,β/10,s,0).(\bm{\tilde{X}}^{\natural},\bm{0})\in\tilde{\mathcal{X}}(\beta/2,\beta/10,s,0). Substituting 𝑿~\bm{\tilde{X}}^{\natural} with 2𝑿~2\bm{\tilde{X}}^{\natural} in the previous equation, we have

4(𝑰𝑾2k[,]𝑾1k[,:]𝑺~[:,])𝑿~[,:]F2\displaystyle 4\big{\|}\big{(}\bm{I}-\bm{W}_{2}^{k}[\mathcal{I},\mathcal{I}]-\bm{W}_{1}^{k}[\mathcal{I},:]\bm{\tilde{S}}[:,\mathcal{I}]\big{)}\bm{\tilde{X}}^{\natural}[\mathcal{I},:]\big{\|}_{F}^{2}
=j=1Mξ^jk+|θk|2||.\displaystyle=-\sum_{j=1}^{M}\hat{\xi}_{j}^{k}+|\theta^{k}|^{2}|\mathcal{I}|. (48)

From (47) and (48) one can get 3|θk|2||=4j=1Mξjkj=1Mξ^jk.3|\theta^{k}|^{2}|\mathcal{I}|=4\sum_{j=1}^{M}{\xi}_{j}^{k}-\sum_{j=1}^{M}\hat{\xi}_{j}^{k}. Then, taking the absolute value on the both sides, then by (46), if kmax{K1,K2}\forall k\geq\max\{K_{1},K_{2}\},

|θk|25(CW2+1)(2C+M)3||ε.\displaystyle|\theta^{k}|^{2}\leq\frac{5(C_{W_{2}}+1)(2C+M)}{3|\mathcal{I}|}\varepsilon. (49)

Moreover, as θk\theta^{k} is MSTO parameter, θk0\theta^{k}\geq 0. Therefore, θk0\theta^{k}\rightarrow 0 as kk\rightarrow\infty.

(2) We prove that 𝑰𝑾2k𝑾1k𝑺~𝟎\bm{I}-\bm{W}_{2}^{k}-\bm{W}_{1}^{k}\bm{\tilde{S}}\rightarrow\bm{0} as kk\rightarrow\infty. LISTA-GS model (9) gives

𝑿~k+1[,:]=ηθk(𝑾1[,:]k𝑺~)𝑿~+𝑾2k[,:]𝑿~k)\displaystyle\bm{\tilde{X}}^{k+1}[\mathcal{I},:]=\eta_{\theta^{k}}\big{(}\bm{W}_{1}[\mathcal{I},:]^{k}\bm{\tilde{S}})\bm{\tilde{X}}^{\natural}+\bm{W}_{2}^{k}[\mathcal{I},:]\bm{\tilde{X}}^{k}\big{)}
𝑾1k[,:]𝑺~𝑿~+𝑾2k[,:]𝑿~kθk2,1(𝑿~k+1[,:]),\displaystyle\in\bm{W}_{1}^{k}[\mathcal{I},:]\bm{\tilde{S}}\bm{\tilde{X}}^{\natural}+\bm{W}_{2}^{k}[\mathcal{I},:]\bm{\tilde{X}}^{k}-\theta^{k}\partial\ell_{2,1}(\bm{\tilde{X}}^{k+1}[\mathcal{I},:]),

where 2,1(𝑿~)\partial\ell_{2,1}(\bm{\tilde{X}}) is the sub-gradient of 𝑿~2,1\|\bm{\tilde{X}}\|_{2,1}. It is a set defined for each row as follows:

2,1(𝑿)=[𝑿[1,:]2,𝑿[2,:]2,,𝑿[2N,:]2]T,\displaystyle\partial\ell_{2,1}(\bm{X})=\begin{bmatrix}\partial\|\bm{X}[1,:]\|_{2},\partial\|\bm{X}[2,:]\|_{2},\cdots,\partial\|\bm{X}[2N,:]\|_{2}\end{bmatrix}^{T}, (50)

where

𝑿[n,:]2={{𝑿[n,:]𝑿[n,:]2}if 𝑿[n,:]𝟎,{𝒉M|𝒉21}if 𝑿[n,:]=𝟎.\displaystyle\partial\|\bm{X}[n,:]\|_{2}=\begin{cases}\big{\{}\frac{\bm{X}[n,:]}{\|\bm{X}[n,:]\|_{2}}\big{\}}\quad\text{if }\bm{X}[n,:]\neq\bm{0},\\ \{\bm{h}\in\mathbb{R}^{M}|~{}\|\bm{h}\|_{2}\leq 1\}\quad\text{if }\bm{X}[n,:]=\bm{0}.\end{cases} (51)

If kmax{K1,K2}k\geq\max\{K_{1},K_{2}\} then from (49) and (46)

(𝑰𝑾2k[,]𝑾1k[,:]𝑺~[:,])𝑿~[,j]2\displaystyle\Big{\|}\big{(}\bm{I}-\bm{W}_{2}^{k}[\mathcal{I},\mathcal{I}]-\bm{W}_{1}^{k}[\mathcal{I},:]\bm{\tilde{S}}[:,\mathcal{I}]\big{)}\bm{\tilde{X}}^{\natural}[\mathcal{I},j]\Big{\|}_{2}
((CW2+1)(2C+M)+5(CW2+1)(2C+M)3||)ε.\displaystyle\leq\Big{(}(C_{W_{2}}+1)(2C+M)+\frac{5(C_{W_{2}}+1)(2C+M)}{3|\mathcal{I}|}\Big{)}\varepsilon.

From the definition of operator norm we have that

σmax(𝑰𝑾2k[,]𝑾1k[,:]𝑺~[:,])\displaystyle\sigma_{\max}\Big{(}\bm{I}-\bm{W}_{2}^{k}[\mathcal{I},\mathcal{I}]-\bm{W}_{1}^{k}[\mathcal{I},:]\bm{\tilde{S}}[:,\mathcal{I}]\Big{)}
1β((CW2+1)(2C+M)+5(CW2+1)(2C+M)3||)ε.\displaystyle\leq\frac{1}{\beta}\Big{(}(C_{W_{2}}+1)(2C+M)+\frac{5(C_{W_{2}}+1)(2C+M)}{3|\mathcal{I}|}\Big{)}\varepsilon.

Thus, s2s\geq 2, 𝑰𝑾2k[,]𝑾1k[,:]𝑺~[:,]𝟎\bm{I}-\bm{W}_{2}^{k}[\mathcal{I},\mathcal{I}]-\bm{W}_{1}^{k}[\mathcal{I},:]\bm{\tilde{S}}[:,\mathcal{I}]\rightarrow\bm{0} uniformly for all \mathcal{I} with 2||s2\leq|\mathcal{I}|\leq s. Therefore, 𝑰𝑾2k𝑾1k𝑺~𝟎\bm{I}-\bm{W}_{2}^{k}-\bm{W}_{1}^{k}\bm{\tilde{S}}\rightarrow\bm{0} as kk\rightarrow\infty. ∎

Appendix B Proof of Lemma 1

We will show that for the no-false-positives property holds for the LISTA-GSCP.

Proof.

Let (𝑿~,𝒁~)𝒳(β,s,σ)(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})\in\mathcal{X}(\beta,s,\sigma) and =supp(ψ(𝑿~))\mathcal{I}={\rm{supp}}(\psi(\bm{\tilde{X}}^{\natural})). We prove that if (23) holds, then 𝑿~k[i,:]=𝟎,i,k.\bm{\tilde{X}}^{k}[i,:]=\bm{0},\forall i\notin\mathcal{I},\forall k.

(i) When k=0k=0, it is satisfied since 𝑿~0=𝟎\bm{\tilde{X}}^{0}=\bm{0}.

(ii) Suppose that 𝑿~k[i,:]=𝟎\bm{\tilde{X}}^{k}[i,:]=\bm{0} for all ii\notin\mathcal{I}. Then by (13) it follows that

𝑿~k+1[i,:]=ηθk(𝑿~k[i,:](𝑾k[:,i])T𝑺~(𝑿~k𝑿~)\displaystyle\bm{\tilde{X}}^{k+1}[i,:]=\eta_{\theta^{k}}\Big{(}\bm{\tilde{X}}^{k}[i,:]-(\bm{W}^{k}[:,i])^{T}\bm{\tilde{S}}(\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural})
+(𝑾k[:,i])T𝒁~)\displaystyle+(\bm{W}^{k}[:,i])^{T}\bm{\tilde{Z}}\Big{)}

for all ii\notin\mathcal{I}. Then by the triangle inequality it follows that

(𝑾k[:,i])T𝑺~(𝑿~k𝑿~)+(𝑾k[:,i])T𝒁~)2\displaystyle\big{\|}-(\bm{W}^{k}[:,i])^{T}\bm{\tilde{S}}(\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural})+(\bm{W}^{k}[:,i])^{T}\bm{\tilde{Z}})\big{\|}_{2}
j|𝑾k[:,i])T𝑺~[:,j]|𝑿~k[j,:]𝑿~[j,:]2+(𝑾k[:,i])T𝒁~)2\displaystyle{\leq}\sum_{j\in\mathcal{I}}\big{|}\bm{W}^{k}[:,i])^{T}\bm{\tilde{S}}[:,j]\big{|}\big{\|}\bm{\tilde{X}}^{k}[j,:]-\bm{\tilde{X}}^{\natural}[j,:]\big{\|}_{2}+\big{\|}(\bm{W}^{k}[:,i])^{T}\bm{\tilde{Z}})\big{\|}_{2}
μ~|𝑿~k𝑿~2,1+CW𝒁~F.\displaystyle{\leq}\tilde{\mu}\big{|}\big{\|}\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\big{\|}_{2,1}+C_{W}\|\bm{\tilde{Z}}\|_{F}.

Since θk=μ~sup𝑿~,𝒁~{𝑿~k𝑿~2,1}+CWσ\theta^{k}=\tilde{\mu}\sup_{\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}}}\{\|\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\|_{2,1}\}+C_{W}\sigma and 𝑾k𝒲(𝑺~)\bm{W}^{k}\in\mathcal{W}(\bm{\tilde{S}}), so i\forall i\notin\mathcal{I} it holds that

θk(𝑾k[:,i])T𝑺~(𝑿~k𝑿~)+(𝑾k[:,i])T𝒁~)2,\displaystyle\theta^{k}\geq\big{\|}-(\bm{W}^{k}[:,i])^{T}\bm{\tilde{S}}(\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural})+(\bm{W}^{k}[:,i])^{T}\bm{\tilde{Z}})\big{\|}_{2},

which implies 𝑿~k+1[i,:]2=0,i\|\bm{\tilde{X}}^{k+1}[i,:]\|_{2}=0,\forall i\notin\mathcal{I} by the definition of ηθk.\eta_{\theta^{k}}. Therefore, by induction, we have 𝑿~k[i,:]2=0,i,k.\|\bm{\tilde{X}}^{k}[i,:]\|_{2}=0,\forall i\notin\mathcal{I},\forall k.

Appendix C Proof of Theorem 3

Proof.

Similar to the proof of Theorem 2, we first show that there are no false positive in 𝑿~k\bm{\tilde{X}}^{k}. Take arbitrary (𝑿~,𝒁~)𝒳(β,s,0)(\bm{\tilde{X}}^{\natural},\bm{\tilde{Z}})\in\mathcal{X}(\beta,s,0) and let =supp(ψ(𝑿~))\mathcal{I}={\rm{supp}}(\psi(\bm{\tilde{X}}^{\natural})). We prove the no-false-positive property by induction.

(i) When k=0k=0, it is satisfied since 𝑿~0=𝟎\bm{\tilde{X}}^{0}=\bm{0}.

(ii) Fix kk and suppose that 𝑿~k[i,:]=𝟎\bm{\tilde{X}}^{k}[i,:]=\bm{0} for all ii\notin\mathcal{I}. Then we have

𝑿~k+1[i,:]\displaystyle\bm{\tilde{X}}^{k+1}[i,:] =ηθk(𝑿~k[i,:]+γk(𝑾[:,i])T(𝒀~𝑺~𝑿~k))\displaystyle=\eta_{\theta^{k}}\big{(}\bm{\tilde{X}}^{k}[i,:]+\gamma^{k}(\bm{W}[:,i])^{T}(\bm{\tilde{Y}}-\bm{\tilde{S}}\bm{\tilde{X}}^{k})\big{)}
=ηθk(γk(𝑾[:,i])T𝑺~(𝑿~k𝑿~))\displaystyle=\eta_{\theta^{k}}\big{(}-\gamma^{k}(\bm{W}[:,i])^{T}\bm{\tilde{S}}(\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural})\big{)}

for all ii\notin\mathcal{I}. As the thresholds are taken as θk=μ~γksup𝑿~𝑿~k𝑿~2,1\theta^{k}=\tilde{\mu}\gamma^{k}\sup_{\bm{\tilde{X}}^{\natural}}\|\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\|_{2,1} and 𝑾𝒲(𝑺~)\bm{W}\in\mathcal{W}(\bm{\tilde{S}}), it holds that

θkμ~γk𝑿~k𝑿~2,1γk(𝑾[:,i])T𝑺~(𝑿~k𝑿~)2,\displaystyle\theta^{k}\geq\tilde{\mu}\gamma^{k}\big{\|}\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\big{\|}_{2,1}\geq\big{\|}-\gamma^{k}(\bm{W}[:,i])^{T}\bm{\tilde{S}}(\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural})\big{\|}_{2},
i,\displaystyle\forall i\notin\mathcal{I},

which implies 𝑿~k+1[i,:]2=0,i\|\bm{\tilde{X}}^{k+1}[i,:]\|_{2}=0,\forall i\notin\mathcal{I} by the definition. Thus, 𝑿~k[i,:]=𝟎,i,k.\bm{\tilde{X}}^{k}[i,:]=\bm{0},\forall i\notin\mathcal{I},\forall k.

In the next step, we consider the components on \mathcal{I}. For all ii\in\mathcal{I}, we have

𝑿~k+1[i,:]𝑿~k[i,:]γk(𝑾[:,i])T𝑺~[:,](𝑿~k[,:]𝑿~[,:])(T1)\displaystyle\bm{\tilde{X}}^{k+1}[i,:]\in\underbrace{\bm{\tilde{X}}^{k}[i,:]-\gamma^{k}(\bm{W}[:,i])^{T}\bm{\tilde{S}}[:,\mathcal{I}]\big{(}\bm{\tilde{X}}^{k}[\mathcal{I},:]-\bm{\tilde{X}}^{\natural}[\mathcal{I},:]\big{)}}_{(T1)}
θk𝑿~k+1[i,:]2,\displaystyle-\theta^{k}\partial\|\bm{\tilde{X}}^{k+1}[i,:]\|_{2}, (52)

where 𝑿~k+1[i,:]2\partial\|\bm{\tilde{X}}^{k+1}[i,:]\|_{2} is defined in (51). As we choose 𝑾𝒲(𝑺~)\bm{W}\in\mathcal{W}(\bm{\tilde{S}}), so (𝑾[:,i])T𝑺~[:,i]=1(\bm{W}[:,i])^{T}\bm{\tilde{S}}[:,i]=1, then (T1)(T1) can be expressed as

𝑿~[i,:]γkj,ji(𝑾[:,i])T𝑺~[:,j](𝑿~k[j,:]𝑿~[j,:])\displaystyle\bm{\tilde{X}}^{\natural}[i,:]\underbrace{-\gamma^{k}\sum_{j\in\mathcal{I},j\neq i}(\bm{W}[:,i])^{T}\bm{\tilde{S}}[:,j](\bm{\tilde{X}}^{k}[j,:]-\bm{\tilde{X}}^{\natural}[j,:])}
+(1γk)(𝑿~k[i,:]𝑿~[i,:])(T2).\displaystyle\underbrace{+(1-\gamma^{k})(\bm{\tilde{X}}^{k}[i,:]-\bm{\tilde{X}}^{\natural}[i,:])}_{(T2)}.

Hence, (52) can be rewritten as

𝑿~k+1[i,:]𝑿~[i,:](T2)θk2,1(𝑿~k+1[i,:])\displaystyle\bm{\tilde{X}}^{k+1}[i,:]-\bm{\tilde{X}}^{\natural}[i,:]\in(T2)-\theta^{k}\partial\ell_{2,1}(\bm{\tilde{X}}^{k+1}[i,:])

Then one can get that for all ii\in\mathcal{I}

𝑿~k+1[i,:]𝑿~[i,:]2μ~γkj,ji(𝑿~k[j,:]𝑿~[j,:])2\displaystyle\|\bm{\tilde{X}}^{k+1}[i,:]-\bm{\tilde{X}}^{\natural}[i,:]\|_{2}\leq\tilde{\mu}\gamma^{k}\sum_{j\in\mathcal{I},j\neq i}\|(\bm{\tilde{X}}^{k}[j,:]-\bm{\tilde{X}}^{\natural}[j,:])\|_{2}
+θk+|1γk|𝑿~k[i,:]𝑿~[i,:]2.\displaystyle+\theta^{k}+|1-\gamma^{k}|\|\bm{\tilde{X}}^{k}[i,:]-\bm{\tilde{X}}^{\natural}[i,:]\|_{2}.

By no-false-positive property, we have

𝑿~k+1𝑿~2,1μ~γk(||1)𝑿~k𝑿~2,1+||θk\displaystyle\|\bm{\tilde{X}}^{k+1}-\bm{\tilde{X}}^{\natural}\|_{2,1}\leq\tilde{\mu}\gamma^{k}(|\mathcal{I}|-1)\|\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\|_{2,1}+|\mathcal{I}|\theta^{k}
+|1γk|𝑿~k𝑿~2,1.\displaystyle+|1-\gamma^{k}|\|\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\|_{2,1}.

Finally, we take supremum over (𝑿~,𝟎)𝒳(β,s,0)(\bm{\tilde{X}}^{\natural},\bm{0})\in\mathcal{X}(\beta,s,0) with ||s|\mathcal{I}|\leq s, we have

sup𝑿~𝑿~k+1𝑿~2,1(μ~γk(s1)+|1γk|)\displaystyle\sup_{\bm{\tilde{X}}^{\natural}}\|\bm{\tilde{X}}^{k+1}-\bm{\tilde{X}}^{\natural}\|_{2,1}\leq\big{(}\tilde{\mu}\gamma^{k}(s-1)+|1-\gamma^{k}|\big{)}
sup𝑿~𝑿~k𝑿~2,1+sθk.\displaystyle\sup_{\bm{\tilde{X}}^{\natural}}\|\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\|_{2,1}+s\theta^{k}.

Taking cτ=log(γτ(2μ~sμ~)+|1γτ|)c^{\tau}=-\log\big{(}\gamma^{\tau}(2\tilde{\mu}s-\tilde{\mu})+|1-\gamma^{\tau}|\big{)}, then from the fact that θk=μ~γksup𝑿~𝑿~k𝑿~2,1\theta^{k}=\tilde{\mu}\gamma^{k}\sup_{\bm{\tilde{X}}^{\natural}}\|\bm{\tilde{X}}^{k}-\bm{\tilde{X}}^{\natural}\|_{2,1} in (36) we obtain

sup𝑿~𝑿~k+1𝑿~2,1\displaystyle\sup_{\bm{\tilde{X}}^{\natural}}\|\bm{\tilde{X}}^{k+1}-\bm{\tilde{X}}^{\natural}\|_{2,1} exp(τ=0kcτ)sup𝑿~𝑿~0𝑿~2,1\displaystyle\leq\exp\Big{(}-\sum_{\tau=0}^{k}c^{\tau}\Big{)}\sup_{\bm{\tilde{X}}^{\natural}}\|\bm{\tilde{X}}^{0}-\bm{\tilde{X}}^{\natural}\|_{2,1}
sβexp(τ=0kcτ).\displaystyle\leq s\beta\exp\Big{(}-\sum_{\tau=0}^{k}c^{\tau}\Big{)}.

Hence, we get the following upper bound with respect to the Frobenius norm:

sup𝑿~𝑿~k+1𝑿~Fsup𝑿~𝑿~k+1𝑿~2,1\displaystyle\sup_{\bm{\tilde{X}}^{\natural}}\|\bm{\tilde{X}}^{k+1}-\bm{\tilde{X}}^{\natural}\|_{F}\leq\sup_{\bm{\tilde{X}}^{\natural}}\|\bm{\tilde{X}}^{k+1}-\bm{\tilde{X}}^{\natural}\|_{2,1}
sβexp(τ=0kcτ).\displaystyle\leq s\beta\exp\Big{(}-\sum_{\tau=0}^{k}c^{\tau}\Big{)}. (53)

Therefore, the error bound holds uniformly for all (𝑿~,𝟎)𝒳(β,s,0)(\bm{\tilde{X}}^{\natural},\bm{0})\in\mathcal{X}(\beta,s,0).

Lastly, we verify that cτ>0c^{\tau}>0 for all τ\tau. The assumption s<(1+1/μ~)/2s<(1+1/\tilde{\mu})/2 gives 2μ~sμ~<12\tilde{\mu}s-\tilde{\mu}<1. If 0<γτ10<\gamma^{\tau}\leq 1, then γτ(2μ~sμ~)+|1γτ|=γτ(2μ~sμ~1)+1<1.\gamma^{\tau}(2\tilde{\mu}s-\tilde{\mu})+|1-\gamma^{\tau}|=\gamma^{\tau}(2\tilde{\mu}s-\tilde{\mu}-1)+1<1. If 1<γτ<2/(1+2μ~sμ~)1<\gamma^{\tau}<2/(1+2\tilde{\mu}s-\tilde{\mu}), then γτ(2μ~sμ~)+|1γτ|=γτ(2μ~sμ~+1)1<1.\gamma^{\tau}(2\tilde{\mu}s-\tilde{\mu})+|1-\gamma^{\tau}|=\gamma^{\tau}(2\tilde{\mu}s-\tilde{\mu}+1)-1<1. Thus cτ>0c^{\tau}>0 for all τ\tau.

References

  • [1] Y. Shi, S. Xia, Y. Zhou, and Y. Shi, “Sparse signal processing for massive device connectivity via deep learning,” in Proc. IEEE Int. Conf. Commun. (ICC) Workshop, pp. 1–6, 2020.
  • [2] S. K. Sharma and X. Wang, “Toward massive machine type communications in ultra-dense cellular IoT networks: Current issues and machine learning-assisted solutions,” IEEE Commun. Surveys Tuts., vol. 22, pp. 426–471, First Quart. 2020.
  • [3] W. Peng, W. Gao, and J. Liu, “AI-enabled massive devices multiple access for smart city,” IEEE Internet Things J., vol. 6, pp. 7623–7634, Oct. 2019.
  • [4] Y. Wu, X. Gao, S. Zhou, W. Yang, Y. Polyanskiy, and G. Caire, “Massive access for future wireless communication systems,” IEEE Wireless Commun., pp. 1–9, Aug. 2020.
  • [5] L. Liu, E. G. Larsson, W. Yu, P. Popovski, C. Stefanovic, and E. De Carvalho, “Sparse signal processing for grant-free massive connectivity: A future paradigm for random access protocols in the internet of things,” IEEE Signal Process. Mag., vol. 35, pp. 88–99, Sept. 2018.
  • [6] Z. Chen, F. Sohrabi, and W. Yu, “Sparse activity detection for massive connectivity,” IEEE Trans. Signal Process., vol. 66, pp. 1890–1904, Apr. 2018.
  • [7] H. Jiang, D. Qu, J. Ding, and T. Jiang, “Multiple preambles for high success rate of grant-free random access with massive MIMO,” IEEE Trans. Wireless Commun., vol. 18, pp. 4779–4789, Oct. 2019.
  • [8] J. Ding and J. Choi, “Comparison of preamble structures for grant-free random access in massive MIMO systems,” IEEE Wireless Commun. Lett., vol. 9, pp. 166–170, Feb. 2020.
  • [9] X. Shao, X. Chen, C. Zhong, J. Zhao, and Z. Zhang, “A unified design of massive access for cellular internet of things,” IEEE Internet of Things J., vol. 6, pp. 3934–3947, Apr. 2019.
  • [10] K. Senel and E. G. Larsson, “Grant-free massive MTC-enabled massive MIMO: A compressive sensing approach,” IEEE Trans. Commun., vol. 66, pp. 6164–6175, Dec. 2018.
  • [11] Z. Qin, J. Fan, Y. Liu, Y. Gao, and G. Y. Li, “Sparse representation for wireless communications: A compressive sensing approach,” IEEE Signal Process. Mag., vol. 35, pp. 40–58, May 2018.
  • [12] Z. Gao, L. Dai, S. Han, I. Chih-Lin, Z. Wang, and L. Hanzo, “Compressive sensing techniques for next-generation wireless communications,” IEEE Wireless Commun., vol. 25, pp. 144–153, Jun. 2018.
  • [13] L. Liu and W. Yu, “Massive connectivity with massive MIMO-Part I: Device activity detection and channel estimation,” IEEE Trans. Signal Process., vol. 66, pp. 2933–2946, Jun. 2018.
  • [14] M. Ke, Z. Gao, Y. Wu, X. Gao, and R. Schober, “Compressive sensing-based adaptive active user detection and channel estimation: Massive access meets massive MIMO,” IEEE Trans. Signal Process., vol. 68, pp. 764–779, 2020.
  • [15] J. Ziniel and P. Schniter, “Efficient high-dimensional inference in the multiple measurement vector problem,” IEEE Trans. Signal Process., vol. 61, pp. 340–354, Jan. 2012.
  • [16] A. K. Fletcher, P. Pandit, S. Rangan, S. Sarkar, and P. Schniter, “Plug-in estimation in high-dimensional linear inverse problems: A rigorous analysis,” in Proc. Neural Inf. Process. Syst. (NeurIPS), pp. 7440–7449, 2018.
  • [17] T. Jiang, Y. Shi, J. Zhang, and K. B. Letaief, “Joint activity detection and channel estimation for IoT networks: Phase transition and computation-estimation tradeoff,” IEEE Internet of Things J., vol. 6, pp. 6212–6225, Aug. 2019.
  • [18] Q. He, T. Q. Quek, Z. Chen, Q. Zhang, and S. Li, “Compressive channel estimation and multi-user detection in C-RAN with low-complexity methods,” IEEE Trans. Wireless Commun., vol. 17, pp. 3931–3944, Jun. 2018.
  • [19] Z. Qin, K. Scheinberg, and D. Goldfarb, “Efficient block-coordinate descent algorithms for the group lasso,” Math. Program. Comput., vol. 5, no. 2, pp. 143–169, 2013.
  • [20] A. Bonnefoy, V. Emiya, L. Ralaivola, and R. Gribonval, “Dynamic screening: Accelerating first-order algorithms for the lasso and group-lasso,” IEEE Trans. Signal Process., vol. 63, pp. 5121–5132, Oct. 2015.
  • [21] R. Giryes, Y. C. Eldar, A. M. Bronstein, and G. Sapiro, “Tradeoffs between convergence speed and reconstruction accuracy in inverse problems,” IEEE Trans. Signal Process., vol. 66, pp. 1676–1690, Apr. 2018.
  • [22] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropoulos, “Learning to optimize: Training deep neural networks for interference management,” IEEE Trans. Signal Process., vol. 66, pp. 5438–5453, Oct. 2018.
  • [23] M. Eisen, C. Zhang, L. F. O. Chamon, D. D. Lee, and A. Ribeiro, “Learning optimal resource allocations in wireless systems,” IEEE Trans. Signal Process., vol. 67, pp. 2775–2790, May 2019.
  • [24] Y. Shen, Y. Shi, J. Zhang, and K. B. Letaief, “A graph neural network approach for scalable wireless power control,” in Proc. IEEE Global Commun. Conf. (Globecom) Workshop, pp. 1–6, Dec. 2019.
  • [25] Y. Shen, Y. Shi, J. Zhang, and K. B. Letaief, “LORM: Learning to optimize for resource management in wireless networks with few training samples,” IEEE Trans. Wireless Commun., vol. 19, pp. 665–679, Jan. 2020.
  • [26] L. Liang, H. Ye, G. Yu, and G. Y. Li, “Deep-learning-based wireless resource allocation with application to vehicular networks,” Proc. of the IEEE, vol. 108, pp. 341–356, Feb. 2020.
  • [27] K. Gregor and Y. LeCun, “Learning fast approximations of sparse coding,” in Proc. Int. Conf. Mach. Learn. (ICML), pp. 399–406, Omnipress, 2010.
  • [28] V. Monga, Y. Li, and Y. C. Eldar, “Algorithm unrolling: Interpretable, efficient deep learning for signal and image processing,” IEEE Signal Process. Mag., vol. 38, pp. 18–44, Mar. 2021.
  • [29] N. Shlezinger, J. Whang, Y. C. Eldar, and A. G. Dimakis, “Model-based deep learning,” arXiv preprint arXiv:2012.08405, 2020.
  • [30] S. Balatsoukas, Alexios, and C. Studer, “Deep unfolding for communications systems: A survey and some new directions,” in Proc. IEEE Int. Workshop Signal Process. Syst. (SiPS), pp. 266–271, 2019.
  • [31] Q. Hu, Y. Cai, Q. Shi, K. Xu, G. Yu, and Z. Ding, “Iterative algorithm induced deep-unfolding neural networks: Precoding design for multiuser mimo systems,” IEEE Trans. Wireless Commun., vol. 20, pp. 1394–1410, Feb. 2021.
  • [32] J. Liu, X. Chen, Z. Wang, and W. Yin, “ALISTA: Analytic weights are as good as learned weights in LISTA,” in Proc. Int. Conf. on Learn. Rep. (ICLR), 2019.
  • [33] P. Ablin, T. Moreau, M. Massias, and A. Gramfort, “Learning step sizes for unfolded sparse coding,” in Proc. Neural Inf. Process. Syst. (NeurIPS), pp. 13100–13110, 2019.
  • [34] X. Chen, J. Liu, Z. Wang, and W. Yin, “Theoretical linear convergence of unfolded ISTA and its practical weights and thresholds,” in Proc. Neural Inf. Process. Syst. (NeurIPS), pp. 9061–9071, 2018.
  • [35] A. T. Puig, A. Wiesel, G. Fleury, and A. O. Hero, “Multidimensional shrinkage-thresholding operator and group LASSO penalties,” IEEE Signal Process. Lett., vol. 18, pp. 363–366, Jun. 2011.
  • [36] M. Yuan and Y. Lin, “Model selection and estimation in regression with grouped variables,” J. R. Stat. Soc. B, vol. 68, no. 1, pp. 49–67, 2006.
  • [37] D. Ito, S. Takabe, and T. Wadayama, “Trainable ISTA for sparse signal recovery,” IEEE Trans. Signal Process., vol. 67, pp. 3113–3125, Jun. 2019.
  • [38] A. Maleki, L. Anitori, Z. Yang, and R. G. Baraniuk, “Asymptotic analysis of complex LASSO via complex approximate message passing (CAMP),” IEEE Trans. Inf. Theory, vol. 59, pp. 4290–4308, Jul. 2013.
  • [39] S. Takabe, T. Wadayama, and Y. C. Eldar, “Complex trainable ista for linear and nonlinear inverse problems,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. 5020–5024, 2020.
  • [40] M. Borgerding, P. Schniter, and S. Rangan, “AMP-Inspired deep networks for sparse linear inverse problems,” IEEE Trans. Signal Process., vol. 65, pp. 4293–4308, Aug. 2017.
  • [41] Y. Kabashima, “A CDMA multiuser detection algorithm on the basis of belief propagation,” J. Phys. A: Math. General, vol. 36, p. 11111, Oct. 2003.
  • [42] H. Sun, W. Pu, M. Zhu, X. Fu, T.-H. Chang, and M. Hong, “Learning to continuously optimize wireless resource in episodically dynamic environment,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. 4945–4949, 2021.
  • [43] Y. Yuan, G. Zheng, K.-K. Wong, B. Ottersten, and Z.-Q. Luo, “Transfer learning and meta learning-based fast downlink beamforming adaptation,” IEEE Trans. Wireless Commun., vol. 20, pp. 1742–1755, Mar. 2021.
  • [44] J. Liu, S. Ji, and J. Ye, “Multi-task feature learning via efficient l2,1l_{2,1}-norm minimization,” in Proc. Conf. Uncertainty Artif. Intell. (UAI), pp. 339–348, AUAI Press, 2009.
  • [45] R. Gribonval and M. Nielsen, “Sparse representations in unions of bases,” IEEE Trans. Inf. Theory, vol. 49, pp. 3320–3325, Dec. 2003.