This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

GNN-Based Beamforming for Sum-Rate Maximization in MU-MISO Networks

Yuhang Li, Yang Lu, , Bo Ai, , Octavia A. Dobre, ,
Zhiguo Ding, , Dusit Niyato
Yuhang Li and Yang Lu are with the School of Computer and Information Technology, and also with the Collaborative Innovation Center of Railway Traffic Safety, Beijing Jiaotong University, Beijing 100044, China (e-mail: 22125206,[email protected]).Bo Ai is with the State Key Laboratory of Rail Traffic Control and Safety, and the School of Electronics and Information Engineering, Beijing Jiaotong University, Beijing 100044, China (e-mail: [email protected]). Octavia A. Dobre is with the Faculty of Engineering and Applied Science, Memorial University, St. John’s, NL A1C 5S7, Canada (E-mail: [email protected]).Zhiguo Ding is with Department of Electrical Engineering and Computer Science, Khalifa University, Abu Dhabi, UAE, and Department of Electrical and Electronic Engineering, University of Manchester, Manchester, UK (e-mail: [email protected]).Dusit Niyato is with the School of Computer Science and Engineering, Nanyang Technological University, Singapore 639798 (e-mail: [email protected]).
Abstract

The advantages of graph neural networks (GNNs) in leveraging the graph topology of wireless networks have drawn increasing attentions. This paper studies the GNN-based learning approach for the sum-rate maximization in multiple-user multiple-input single-output (MU-MISO) networks subject to the users’ individual data rate requirements and the power budget of the base station. By modeling the MU-MISO network as a graph, a GNN-based architecture named CRGAT is proposed to directly map the channel state information to the beamforming vectors. The attention-enabled aggregation and the residual-assisted combination are adopted to enhance the learning capability and avoid the oversmoothing issue. Furthermore, a novel activation function is proposed for the constraint due to the limited power budget at the base station. The CRGAT is trained in an unsupervised learning manner with two proposed loss functions. An evaluation method is proposed for the learning-based approach, based on which the effectiveness of the proposed CRGAT is validated in comparison with several convex optimization and learning based approaches. Numerical results are provided to reveal the advantages of the CRGAT including the millisecond-level response with limited optimality performance loss, the scalability to different number of users and power budgets, and the adaptability to different system settings.

Index Terms:
GNNs, sum-rate maximization, MU-MISO, CRGAT.

I INTRODUCTION

With the exponential growth of the mobile data demand, enhancing the spectral efficiency (SE) becomes an urgent need for wireless communication systems to accommodate higher transmission data rates and support massive access services [1]. As a key SE enabler, the smart antenna technology has drawn great attentions, which makes the beamforming design an important topic [2]. Over the past decades, the beamforming design has been mainly solved by traditional optimization algorithms, such as successive convex approximation (SCA) [3] and block successive upper bound minimization (BSUM) [4]. However, these high computational complexity due to the iterations makes traditional optimization algorithms hard to realize the real-time processing in practical time-varying wireless communication systems. Besides, the beamforming design problems became extremely complicated for large scale wireless networks [5].

Recently, some researchers have developed the deep learning (DL)-based signal processing approaches[6], where the traditional optimization algorithms were approximated by neural networks (NNs). In [7], the supervised “learn-to-optimize” approach was proposed, where fully connected multi-layer perceptrons (MLPs) were trained to approximate the non-linear mapping between the inputs and outputs of a sum-rate maximization problem. Once the MLPs are well-trained, the average inference time is greatly reduced with limited optimality performance loss to the weighted minimum mean squared error (WMMSE) algorithm. Similarly, in [8], a black-box based convolutional NN (CNN) was trained to approximate the WMMSE algorithm. This supervised learning approach requires a large amount of labeled samples, which burdens the training process. Instead, some works utilized the unsupervised learning to directly optimize the objective function. In [9], the negative objective function was used as the loss function to facilitate an unsupervised learning of NNs to solve a sum-rate maximization problem in a multiple-input single-output (MISO) system.

The aforementioned models achieve remarkable performance in some scenarios, but the inputs of these models are usually with flat data structures [10], which may not be able to exploit the hidden features to improve the learning performance. Due to the graph-structured topology of the wireless networks, the graph NNs (GNNs) become more suitable to approximate the desired transmit design, especially for large-scale or complicated networks [11]. Besides, GNNs can be directly generalized to different network scenarios without re-training, which is an important feature in practice, but cannot be realized by the MLPs and CNNs [12].

There have been some existing works regarding the GNN-based transmit designs, e.g., [13, 14, 15, 16, 17, 18, 19, 20, 21]. In [13], a link scheduling problem was solved by using channel gains and distance information based on graph embedding. In [14], the authors transformed the radio resource management problem into a graph optimization problem and theoretically analysed the substitution equivalence, scalability and generalisation capabilities of GNNs from the perspective of message passing mechanisms. In [15], a graph attention network (GAT)-based energy efficient beamforming design was proposed for a multiple-user (MU)-MISO system with the users’ individual power constraints. In addition to the fully connected graph representation, the heterogeneous graph representation was studied. For example, in [16], the MU-MISO system was modelled as a bipartite graph, with which, the sum-rate and min-rate maximization problems were solved by a bipartite GNN based on bipartite message passing mechanism. In [17], the authors designed a unicast-multicast GNN (UMGNN) and proposed a UMGNN-based approach to jointly design the multicast and unicast beamformers based on imperfect channel state information (CSI). In [18], a device-to-device (D2D) network was modeled as a heterogeneous graph, and the power control and beamforming designs were proposed by use of the heterogeneous interfering GNNs. Nevertheless, only simple constraints were considered in the above works, and thus, the approaches cannot be directly extended to handle the transmit design problems with complicated constraints involving coupling variables, such as the quality-of-service (QoS) constraints.

As the QoS is the central consideration in wireless services, some recent works have studied to QoS-constrained learning approaches. For instance, in [19], a user scheduling and beamforming design based on GNN was proposed for MU-MISO downlink systems under constraints of the QoS requirement and the power budget. In [20], a GNN-based approach for primal dual learning was proposed for solving the radio resource management problem with beam selection and link activation constraints for D2D networks. However, the mapping from CSI to beamforming vectors was simplified via the WMMSE method, with which only the power parts of the beamforming vectors were the outputs. In [21], the authors proposed a unified framework for GNNS to solve constrained optimization problems in wireless communication, but a proper-selected initial point was required as input. In addition to the above model-based learning approaches, developing the end-to-end learning approach is a fundamental problem for the GNN-based beamforming design.

The architecture of the GNNs is the key factor to exploit the features [12]. In most aforementioned works, the permutation invariance and permutation equivariance properties of GNN were leveraged. However, how to further improve the learning capability, such as handling the oversmoothing issue, has yet been investigated. Moreover, as an emerging approach, the evaluation method of the GNN-based approaches is of high importance, but still remains an open issue. The above observations motivate this paper which is to investigate the GNN-based approach to solve the constrained sum-rate maximum problem for MU-MISO networks. The major contributions are summarized as follows:

  • We formulate a sum-rate maximization problem for MU-MISO networks subject to the rate requirement of each user and the sum-power constraint of the base station (BS). Then, we present a graph representation of the MU-MISO network, and reformulate the sum-rate maximization problem into a graph optimization problem.

  • To establish the mapping from the CSI to the beamforming vectors, we propose a GNN-based model named CRGAT based on the message passing mechanism. Specifically, we adopt the attention-enabled aggregation and the residual-assisted combination to enhance the learning capability and avoid the oversmoothing issue. Furthermore, we propose a novel activation function for the sum-power constraint.

  • To train the CRGAT in an unsupervised learning manner and reduce the requirement of extensive training data, the rate requirements are incorporated into the objective function by the proposed loss functions based on the penalty method (PM) and Lagrangian duality method (LDM), respectively. With the loss functions, the adaptability of the CRGAT to the considered problem is enhanced.

  • We develop an evaluation method for the learning-based approach, based on which we validate the effectiveness of the proposed CRGAT in comparison with several convex optimization and learning based benchmarking schemes. In particular, the millisecond-level response with limited performance loss, the scalability to different numbers of users and power budgets and the adaptability to different system settings are observed for the CRGAT. Besides, the effectiveness of the residual-assisted combination is validated by the ablation experiment, and the transfer learning performance of the CRGAT is illustrated.

The rest of this paper is organized as follows. Section II introduces the system model and problem formulation. The proposed CRGAT and the unsupervised learning algorithms are respectively presented in Section III and Section IV. Numerical results are presented in Sections V. Finally, Section VI concludes this paper.

Notations: The following mathematical notations and symbols are used throughout this paper. 𝐚\bf a and 𝐀\bf A stand for a column vector and a matrix (including multiple-dimensional tensor), respectively. The sets of real numbers and n-by-m real matrices are denoted by {\mathbb{R}} and n×m{\mathbb{R}^{n\times m}}, respectively. The sets of complex numbers and nn-by-mm complex matrices are denoted by {\mathbb{C}} and n×m{\mathbb{C}^{n\times m}}, respectively. For a complex number aa, |a|\left|a\right| denotes its modulus. Re(a){{\rm Re}(a)} and Im(a){{\rm Im}(a)} denote its real and imaginary part, respectively. For a vector 𝐚\bf a, 𝐚2{\left\|\bf a\right\|_{2}} is the Euclidean norm. For a matrix 𝐀{\bf A}, 𝐀H{\bf A}^{H} and 𝐀F\left\|{\bf A}\right\|_{F} denote its conjugate transpose and Frobenius norm, respectively. 𝐀(i,j){\bf A}_{(i,j)} and 𝐀(i,:){\bf A}_{(i,:)} are the ii-th row and the jj-th column of matrix 𝐀\bf A and the ii-th row of matrix 𝐀\bf A respectively. For a tensor 𝐀{\bf A}, 𝐀(i,:,:){\bf A}_{(i,:,:)} denotes the matrix with index ii in the first dimension of the tensor.

II System Model and Problem Formulation

Refer to caption
Figure 1: A graph representation of the MU-MISO system.

II-A System model

Consider a downlink MU-MISO system as shown in Fig. 1, where one NTN_{\rm T}-antenna BS intends to serve KK single-antenna users over a common spectral band. We use 𝒦{1,2,,K}\mathcal{K}\triangleq\{1,2,...,K\} to denote the index set of the users.

Denote the symbol for the kk-th user and the corresponding beamforming vector by sks_{k} and 𝐰k{{{\bf{w}}_{k}}}, respectively. The received signal at the kk-th user is given by

yk\displaystyle{{y}_{k}} =𝐡kH𝐰kskdesiredsignal+i=1,ikK𝐡kH𝐰isiintracellinterference+nk,\displaystyle=\underbrace{{\bf{h}}_{k}^{H}{{\bf{w}}_{k}}{{s}_{k}}}_{\rm desired~{}signal}+\underbrace{\sum\nolimits_{i=1,i\neq k}^{K}{{\bf h}_{k}^{H}{{\bf w}_{i}}{{s}_{i}}}}_{\mathop{\rm{intra-cell~{}interference}}}+{n_{k}},

where 𝐡kNT{{{\bf{h}}_{k}}}\in{\mathbb{C}^{{N_{\rm{T}}}}} denotes the CSI between the BS and the kk-th user, and nk𝒞𝒩(0,σk2)n_{k}\sim\mathcal{CN}({0,{\sigma_{k}^{2}}}) denotes the additive white Gaussian noise (AWGN). Without loss of generality, it is assumed that 𝔼{|sk|2}=1{{\mathbb{E}}}\{{{{|{s_{k}}|}^{2}}}\}=1 (k𝒦\forall k\in\mathcal{K}). Then, the achievable data rate at the kk-th user is expressed as

Rk({𝐰i})=log2(1+|𝐡kH𝐰k|2i=1,ikK|𝐡kH𝐰i|2+σk2).\displaystyle{R_{k}}\left({\left\{{{{\bf{w}}_{i}}}\right\}}\right)={\log_{2}}\left(1+\frac{{{{\left|{{\bf{h}}_{k}^{H}{{\bf{w}}_{k}}}\right|}^{2}}}}{{{\sum\nolimits_{i=1,i\neq k}^{K}{{\left|{{\bf{h}}_{k}^{H}{{\bf{w}}_{i}}}\right|}^{2}}+\sigma_{{k}}^{2}}}}\right). (1)

II-B Optimization problem formulation

The sum-rate maximization problem is formulated as

max{𝐰i}k=1KRk({𝐰i})\displaystyle\mathop{\max}\limits_{\left\{{{{\bf{w}}_{i}}}\right\}}{{\sum\nolimits_{k=1}^{K}{R_{k}}\left({\left\{{{{\bf{w}}_{i}}}\right\}}\right)}} (2a)
s.t.\displaystyle{\rm s.t.}~{} k=1K𝐰k22PMax,\displaystyle{{\sum\nolimits_{k=1}^{K}{\left\|{{{\bf{w}}_{k}}}\right\|_{2}^{2}}}}\leq{P_{\rm Max}}, (2b)
Rk({𝐰i})RReq,\displaystyle{R_{k}}\left({\left\{{{{\bf{w}}_{i}}}\right\}}\right)\geq{R_{\rm Req}}, (2c)
𝐰iNT,i,k𝒦,\displaystyle{{\bf{w}}_{i}}\in{\mathbb{C}^{{N_{\rm{T}}}}},{\forall i,k}\in{\cal K},

where RReq{R_{\rm Req}} denotes the information rate requirement of each user and PMax{P_{\rm Max}} denotes the power budget of the BS.

The problem (2) can be solved by conventional convex optimization, which however, requires to be implemented for every CSI realization. To extract the knowledge of the mapping from the CSI to the optimal beamforming vectors, the GNN-based approach is investigated by solving the following graph optimization problem.

II-C Graph optimization problem formulation

The network topology nature of the considered MU-MISO system makes it suitable to be modeled as a graph. In particular, the considered MU-MISO system can be modeled as a fully connected directed graph denoted as 𝒢=(𝒱,){\cal G}=({\cal V},{\cal E}) as shown in Fig. 1, where 𝒱\cal V denotes the set of nodes and \cal E denotes the set of edges. Each BS-user desired link is modeled as a node while each interference link is modeled as an edge. That is, there are KK nodes and K(K1)K(K-1) edges. The node feature of the kk-th node is defined as the corresponding CSI, i.e., 𝐡k{\bf h}_{k}, while there is no edge feature as the inter-user interference is implicit.

The node feature matrix of the graph 𝒢{\cal G} is denoted as 𝐇[𝐡1,,𝐡k]TK×NT\mathbf{H}\triangleq[{\bf{h}}_{1},\cdots,{\bf{h}}_{k}]^{T}\in{\mathbb{C}^{{K}\times{N_{\rm{T}}}}}. By defining the beamforming matrix 𝐖[𝐰1,,𝐰k]TK×NT\mathbf{W}\triangleq[{\bf{w}}_{1},\cdots,{\bf{w}}_{k}]^{T}\in{\mathbb{C}^{{K}\times{N_{\rm{T}}}}}, the problem (2) can be rewritten as the following graph optimization problem:

max𝐖k=1KR^k(𝐖)\displaystyle\mathop{\max}\limits_{{{\mathbf{W}}}}~{}{{\sum\nolimits_{k=1}^{K}{{\widehat{R}_{k}}\left(\mathbf{W}\right)}}} (3a)
s.t.\displaystyle{\rm s.t.}~{} 𝐖F2PMax,\displaystyle{{\|\mathbf{W}\|_{F}^{2}}}\leq{P_{\rm Max}}, (3b)
R^k(𝐖)RReq,\displaystyle{{\widehat{R}_{k}}\left(\mathbf{W}\right)}\geq{R_{\rm Req}}, (3c)
𝐖K×NT,k𝒦,\displaystyle{\mathbf{W}}\in{\mathbb{C}^{{K}\times{N_{\rm{T}}}}},{\forall k}\in{\cal K},

where

R^k(𝐖)log2(1+|𝐇(k,:)H𝐖(k,:)|2i𝒩(k)|𝐇(k,:)H𝐖(i,:)|2+σk2){{\widehat{R}_{k}}\left(\mathbf{W}\right)}\triangleq{\log_{2}}\left(1+\frac{{{{\left|{{\mathbf{H}}_{\left(k,:\right)}^{H}{{\mathbf{W}}_{\left(k,:\right)}}}\right|}^{2}}}}{{{\sum\nolimits_{i\in\mathcal{N}\left(k\right)}{{\left|{\mathbf{H}}_{\left(k,:\right)}^{H}{{\mathbf{W}}_{\left(i,:\right)}}\right|}^{2}}+\sigma_{{k}}^{2}}}}\right)

and 𝒩(k)𝒦/k\mathcal{N}(k)\triangleq{\cal K}/k denotes the set of neighbor nodes of the kk-th node.

Refer to caption
Figure 2: The architecture of the CRGAT.

III GNN Architecture

To solve the defined graph optimization problem, we propose a GNN-based architecture named CRGAT with the node feature matrix (i.e., 𝐇\bf H) as the input and the optimization variable (i.e., 𝐖\bf W) as the output. In particular, the proposed CRGAT is composed of two parts, i.e., a complex encoder and a complex decoder. The complex encoder aims to explore the hidden edge features (such as inter-user interference) as well as exploiting the node features (i.e., 𝐇\bf H) by using the message passing mechanism to obtain embedded features with full graph information. The complex decoder aims to map the embedded features to the learned beamforming matrix.

The overall architecture of the proposed CRGAT is shown in Fig. 2, where the complex encoder includes LL complex residual graph attention layers (CRGALs) and the complex decoder includes CC complex fully connected layers (CFCLs). The details of each layer is presented as follows.

III-A Complex residual graph attention layer

The CRGAL is the key component of the proposed CRGAT, and it updates the feature of each node by aggregating the features of its neighboring nodes. As there is no edge feature, each node can only learn its interference to/from other nodes via the node features. To facilitate the aggregating process, the multiple attention mechanisms are adopted to assign different weights to neighboring nodes111As the considered system is modelled as a fully connected graph, the neighboring nodes of a node are actually the other (K1)(K-1) nodes., such that the central node pays more attention to its neighboring nodes with greater impact on it.

It is noted that the deeper CRGALs may induce the oversmoothing issue, which homogenizes the aggregated features of different nodes and degrades the availability of the aggregated features. Therefore, the residual structure is adopted to balance the trade-off between adopting deeper CRGALs and avoiding the oversmoothing issue.

Refer to caption
Figure 3: The processes of the CRGAL.

As shown in Fig. 3, each CRGAL consists of two processes, i.e., the attention-enabled aggregation and the residual-assisted combination.

III-A1 Attention-enabled aggregation

The kk-th node aggregates the features of its neighboring nodes using feedforward NNs and attention mechanisms.

For the ll-th CRGAL, let 𝐇(l)K×F(l)\mathbf{H}^{(l)}\in{\mathbb{C}^{{K}\times{F(l)}}} denotes input node feature matrix222The input node feature matrix of the 11-st CRGAL is, in fact, the node feature matrix of the graph 𝒢\cal G, i.e., 𝐇(1)=𝐇\mathbf{H}^{(1)}=\mathbf{H}, and F(1)=NTF(1)=N_{\rm T}. Besides, the input of the ll-th CRGAL, i.e., 𝐇(l)\mathbf{H}^{(l)}, is also the output of the (l1)(l-1)-th CRGAL. of the ll-th CRGAL (l{1,,L}l\in{\cal L}\triangleq\{1,...,L\}), where F(l){F}(l) denotes the corresponding dimension of the node feature. D(l)D(l) self-attention mechanisms are applied, and each one is realized by a trainable vector 𝐚d(l)F~(l){\bf a}^{(l)}_{d}\in\mathbb{C}^{{\widetilde{F}}(l)} (d{1,,D(l)}d\in\{1,...,D(l)\}), where F~(l){\widetilde{F}}(l) denotes the corresponding dimension of the output of the feedforward NN. Denote αk,j,d(l)\alpha^{(l)}_{k,j,d} by the attention coefficient for the (k,j)(k,j)-th node pair after the dd-th attention mechanism of the ll-th CRGAL, and αk,j,d(l)\alpha^{(l)}_{k,j,d} is computed by (6), where

𝐀(l)[𝐚1(l),,𝐚D(l)(l)]TD(l)×F~(l),\displaystyle\mathbf{A}^{(l)}\triangleq[{\bf{a}}^{(l)}_{1},\cdots,{\bf{a}}^{(l)}_{D(l)}]^{T}\in{\mathbb{C}^{{D(l)}\times{\widetilde{F}}(l)}}, (4)

𝚯(l)D(l)×F(l)×F~(l){\bm{\Theta}}^{\left(l\right)}\in{\mathbb{C}^{D(l)\times F(l)\times{\widetilde{F}}(l)}} denotes the learnable parameter of the feedforward NN, f2()f_{2}() represents a nonlinear operation (e.g., LeakyReLU(){\rm LeakyReLU}()), and f3():{f}_{3}():{\mathbb{C}}\rightarrow{\mathbb{R}} represents a complex-to-real function (e.g., Abs(){\rm Abs}()).

Then, the aggregated feature of the kk-th node after the dd-th attention mechanism of the ll-th CRGAL is computed by

𝜷k,d(l)=f1(αk,j,d(l)𝐇(j,:)(l)𝚯(d,:,:)(l)|j𝒩(k))F~(l),\displaystyle{\bm{\beta}}^{\left(l\right)}_{k,d}=f_{1}\left(\alpha^{(l)}_{k,j,d}{{\mathbf{H}}^{(l)}_{\left(j,:\right)}}{\bm{\Theta}}_{\left(d,:,:\right)}^{\left(l\right)}\left|{j\in{\mathcal{N}}(k)}\right.\right)\in{\mathbb{C}}^{{\widetilde{F}}{(l)}}, (5)

where f1():F~(l)×|𝒩(k)|F~(l)f_{1}():{\mathbb{C}}^{{\widetilde{F}}{(l)}\times|{\mathcal{N}}(k)|}\rightarrow{\mathbb{C}}^{{\widetilde{F}}{(l)}} represents a function with permutation invariance (e.g., sum(){\rm sum}()). Note that with D(l)D(l) independent attention mechanisms, the aggregated features of D(l)D(l), i.e., {𝜷k,d(l)}d=1D(l)\{{\bm{\beta}}^{(l)}_{k,d}\}_{d=1}^{D(l)}, are obtained for each node in the ll-th CRGAL.

αk,j,d(l)=exp(f3(f2(𝐇(k,:)(l)𝚯(d,:,:)(l)+𝐇(j,:)(l)𝚯(d,:,:)(l))T𝐀(d,:)(l)))i𝒩(k)exp(f3(f2(𝐇(k,:)(l)𝚯(d,:,:)(l)+𝐇(i,:)(l)𝚯(d,:,:)(l))T𝐀(d,:)(l)))\displaystyle\alpha^{\left(l\right)}_{k,j,d}=\frac{\exp\left({f}_{3}\left({{f}_{2}}{\left({\mathbf{H}}^{(l)}_{\left(k,:\right)}{{\bm{\Theta}}}_{\left(d,:,:\right)}^{\left(l\right)}+{\mathbf{H}}^{(l)}_{\left(j,:\right)}{{\bm{\Theta}}}_{\left(d,:,:\right)}^{\left(l\right)}\right)}^{T}\mathbf{A}^{\left(l\right)}_{\left(d,:\right)}\right)\right)}{\sum\nolimits_{i\in\mathcal{N}\left(k\right)}\exp\left({f}_{3}\left({{f}_{2}}\left({\mathbf{H}}^{(l)}_{\left(k,:\right)}{{\bm{\Theta}}}_{\left(d,:,:\right)}^{\left(l\right)}+{\mathbf{H}}^{(l)}_{\left(i,:\right)}{{\bm{\Theta}}}_{\left(d,:,:\right)}^{\left(l\right)}\right)^{T}\mathbf{A}^{\left(l\right)}_{\left(d,:\right)}\right)\right)} (6)

 

III-A2 Residual-assisted combination

The residual-assisted combination is to combine the D(l)D(l) attention-enabled aggregated features, the input features of the ll-th CRGAL and the initial input features (i.e., the input feature of the 11-st CRGAL). It is noted that the latter two features (also known as the jump residual and the initial residual, respectively) are utilized to overcome the oversmoothing issue. With the residual-assisted combination operation, the net activation of the ll-th CRGAL denoted as 𝐇~(l+1){\widetilde{\mathbf{H}}}^{(l+1)} is given by

𝐇~(l+1)=\displaystyle{\widetilde{\mathbf{H}}}^{\left(l+1\right)}= Com({𝜷k,d(l)}d=1D(l))Aggregation+a¯(l)𝐇(l)𝚯¯(l)Jumpresidual+\displaystyle\underbrace{{\rm Com}\left(\left\{{\bm{\beta}}^{\left(l\right)}_{k,d}\right\}_{d=1}^{D\left(l\right)}\right)}_{{\rm{Aggregation}}}+\underbrace{\overline{a}^{(l)}\mathbf{H}^{\left(l\right)}\overline{\bm{\Theta}}^{(l)}}_{{\rm{Jump~{}residual}}}+ (7)
a~(l)𝐇(1)𝚯~(l)InitialresidualK×(F~(l)×D(l)),\displaystyle\underbrace{\widetilde{a}^{(l)}\mathbf{H}^{\left(1\right)}\widetilde{\bm{\Theta}}^{(l)}}_{{\rm{Initial~{}residual}}}\in{\mathbb{C}}^{K\times({\widetilde{F}}(l)\times D(l))},

where Com(){\rm Com}() represents concatenation operation, a¯(l)\overline{a}^{(l)}, a~(l)\widetilde{a}^{(l)}\in\mathbb{R} are trainable weighting factors. 𝚯¯(l)F(l)×(F~(l)×D(l))\overline{\bm{\Theta}}^{(l)}\in\mathbb{C}^{F(l)\times({\widetilde{F}}{(l)}\times D(l))} and 𝚯~(l)F(1)×(F~(l)×D(l))\widetilde{\bm{\Theta}}^{(l)}\in\mathbb{C}^{F(1)\times({\widetilde{F}}{(l)}\times D(l))} denote the trainable parameters of the feedforward NNs for the jump residual and the initial residual, respectively.

For the ll-th CRGAL, SELU()\mathbb{C}\rm{SELU}() is adopted as the activation function to enhance the representational capability, which is given by

𝐇(l+1)\displaystyle{{\mathbf{H}}}^{\left(l+1\right)} =SELU(𝐇~(l+1))\displaystyle=\mathbb{C}\mathrm{SELU}\left({\widetilde{\mathbf{H}}}^{\left(l+1\right)}\right) (8)
=SELU(Re(𝐇~(l+1)))+iSELU(Im(𝐇~(l+1))),\displaystyle=\mathrm{SE}\mathrm{LU}\left({\rm Re}\left({\widetilde{\mathbf{H}}}^{\left(l+1\right)}\right)\right)+i\mathrm{SE}\mathrm{LU}\left({\rm Im}\left({\widetilde{\mathbf{H}}}^{\left(l+1\right)}\right)\right),

where the activation function SELU(){\rm SELU}() is applied to the real and imaginary parts of the input features in an element-wise manner, such that the output and the input of SELU()\mathbb{C}\rm{SELU}() have the same shape. Note that the output of the LL-th CRGAL, i.e., 𝐇(L+1){{\mathbf{H}}}^{(L+1)}, is the input of the 11-st CFCL.

Remark 1.

The considered system is formulated into a fully connected graph. As a result, each node is able to obtain information from all its neighboring nodes through a single graph convolution, which makes it more likely to suffer from oversmoothing issue. For traditional GAT, the learning performance is degraded by using more layers, because the embedded features of all nodes intend to be identical [27]. However, the oversmoothing issue is fixed by the adopted residual-assisted combination. As a result, more CRGALs can be stacked while the feature embedding capability is also enhanced. The reason is that the jump residual and the initial residual prevent the embedded features of all nodes to be identical.

III-B Complex fully connected layer

The CFCLs are utilized to map the embedded features, i.e., 𝐇(L+1){{\mathbf{H}}}^{\left(L+1\right)}, to the learned beamforming matrix. Let 𝐇(c)K×G(c){\mathbf{H}}^{\left(c\right)}\in{\mathbb{C}^{{K}\times{G(c)}}} denote the input node feature matrix of the cc-th CFCL (c𝒞{1,,C}c\in{\cal C}\triangleq\{1,...,C\}), where G(c){{G}(c)} denotes the corresponding dimension of each node. In particular, for the cc-th CFCL, the mapping from the input to the net activation, denoted by 𝐇~(c+1){\widetilde{\mathbf{H}}^{(c+1)}}, is realized by

𝐇~(c+1)=𝐇(c)𝚯(c)K×G(c+1),\displaystyle{\widetilde{\mathbf{H}}^{\left(c+1\right)}}={\mathbf{H}}^{\left(c\right)}{{\bm{\Theta}}^{(c)}}\in{\mathbb{C}^{{K}\times{G\left(c+1\right)}}}, (9)

where 𝚯(c)G(c)×G(c+1){\bm{\Theta}}^{(c)}\in\mathbb{C}^{G(c)\times G(c+1)} denotes the learnable parameters of the cc-th CFCL.

For the cc-th CFCL with c𝒞Cc\in{\cal C}\setminus C, the BatchNorm\mathbb{C}\rm{BatchNorm} (\mathbb{C}BN) layer is added in order to prevent the proposed CRGAT from overfitting as well as enhancing the convergence behavior. The \mathbb{C}BN layer can be treated as an activation function, which is applied to the net activation after SELU()\mathbb{C}\rm{SELU}() as follows:

𝐇(c+1)=\displaystyle{{\mathbf{H}}^{\left(c+1\right)}}= BN(SELU(𝐇~(c+1)))\displaystyle\mathbb{C}\mathrm{BN}\left(\mathbb{C}\mathrm{SELU}\left({\widetilde{\mathbf{H}}^{\left(c+1\right)}}\right)\right) (10)
=\displaystyle= BN(SELU(Re(𝐇~(c+1))))+\displaystyle\mathrm{BN}\left(\rm{SELU}\left({\rm Re}\left({\widetilde{\mathbf{H}}^{\left(c+1\right)}}\right)\right)\right)+
iBN(SELU(Im(𝐇~(c+1)))),\displaystyle i\mathrm{BN}\left(\rm{SELU}\left({\rm Im}\left({\widetilde{\mathbf{H}}^{\left(c+1\right)}}\right)\right)\right),

where BN(){\rm BN}() denotes the batch normalization function.

As for the last CFCL, a new activation function is proposed to guarantee that the output beamforming matrix, denoted as 𝐖^\widehat{\mathbf{W}}, satisfies the sum-power constraint (2b), which is given by

𝐖^=ϕ(𝐇~(C+1))\displaystyle\widehat{\mathbf{W}}=\phi\left({\widetilde{\mathbf{H}}^{\left(C+1\right)}}\right) (11)
={PMax𝐇~(C+1),𝐇~(C+1)F21PMax𝐇~(C+1)F𝐇~(C+1),𝐇~(C+1)F2>1K×NT.\displaystyle=\left\{\begin{array}[]{l}\sqrt{{P_{\rm Max}}}{{\widetilde{\mathbf{H}}}^{\left(C+1\right)}},~{}{\left\|{{\widetilde{\mathbf{H}}}^{\left(C+1\right)}}\right\|_{F}^{2}\leq 1}\\ \sqrt{\frac{{{P_{\rm Max}}}}{\left\|{{\widetilde{\mathbf{H}}}^{\left(C+1\right)}}\right\|_{F}}}{{{\widetilde{\mathbf{H}}}^{\left(C+1\right)}}},~{}{\left\|{{\widetilde{\mathbf{H}}}^{\left(C+1\right)}}\right\|_{F}^{2}>1}\end{array}\right.\in\mathbb{C}^{K\times{N_{\rm{T}}}}. (14)

Note that 𝐖^\widehat{\mathbf{W}} also represents the output of the CRGAT, i.e., the learned beamforming matrix.

With the proposed activation function (11), the problem (2) is equivalently transformed into the following optimization problem:

min𝜽k=1KR^k(𝐖^|𝜽)\displaystyle\mathop{\min}\limits_{\bm{\theta}}{{\sum\nolimits_{k=1}^{K}{-{\widehat{R}_{k}}\left(\widehat{\mathbf{W}}|\bm{\theta}\right)}}} (15a)
s.t.\displaystyle{\rm s.t.}~{} R^k(𝐖^|𝜽)RReq,k𝒦,\displaystyle{{\widehat{R}_{k}}\left(\widehat{\mathbf{W}}|\bm{\theta}\right)}\geq{R_{\rm Req}},\forall k\in{\cal K}, (15b)

where 𝜽{𝐀(l),𝚯(l),𝚯¯(l),𝚯~(l),𝚯(c)}\bm{\theta}\triangleq\{\mathbf{A}^{(l)},{\bm{\Theta}}^{(l)},\overline{\bm{\Theta}}^{(l)},\widetilde{\bm{\Theta}}^{(l)},{\bm{\Theta}}^{(c)}\} represents all trainable parameters in the proposed CRGAT.

Remark 2.

The activation function (11) is based on the projected gradient descent method, which is utilized to realize the allocation of the power budget. That is, the activation function (11) in fact simplifies the task from learning the beamforming vectors to learning the directions of beamforming vectors. Besides, it is observed from (15), the impact of PMaxP_{\rm Max} has been integrated into the CRGAT, and thus, the CRGAT should have the generalization capability on PMaxP_{\rm Max}.

Remark 3.

It is observed from the definition of 𝛉\bm{\theta} that all learnable parameters are independent from the number of users. Therefore, the CRGAT still works if the number of users changes. That is, the CRGAT has the generalization capability on the number of users. Intuitively, the generalization performance is affected by the variance of the number of users in training set and test set. Based on the generalization performance, one can directly use the well-trained CRGAT or update the CRGAT in a transfer learning manner, which greatly saves the time and computational resources for re-training a new CRGAT, if the network topology changes.

IV Unsupervised learning

With the proposed CRGAT, the mapping from the channel matrix to the learned beamforming matrix satisfying the sum-rate power constraint is established, i.e., 𝐖^=CRGAT(𝐇)\widehat{\mathbf{W}}={\rm CRGAT}({\bf H}). To train the CRGAT via unsupervised learning, the problem (2) should be further transformed into an unconstrained problem, such that the trainable parameters can be trained in a gradient descent manner.

To this end, the PM and the LDM are respectively adopted to design the loss function based on the objective function (15a) and the rate requirement constraint (15b).

IV-A Loss function based on penalty method

LPenal(𝜽)=1Nn=1N[k=1KR^k(𝐖^(n)|𝜽)Objectivefunctiondueto(15a)+λk=1KReLU(RReqR^k(𝐖^(n)|𝜽))Penaltytermdueto(15b)]\displaystyle{{L}_{\rm Penal}\left({\bm{\theta}}\right)=\frac{1}{N}\sum\nolimits_{n=1}^{N}{\left[\underbrace{\sum\nolimits_{k=1}^{K}{-{\widehat{R}_{k}}{\left({\widehat{\mathbf{W}}_{\left(n\right)}}|\bm{\theta}\right)}}}_{\rm Objective~{}function~{}due~{}to~{}(\ref{cons:pb:A})}+\lambda\underbrace{\sum\nolimits_{k=1}^{K}{\operatorname{ReLU}\left(R_{\rm Req}-{\widehat{R}_{k}}\left(\widehat{\mathbf{W}}_{\left(n\right)}|\bm{\theta}\right)\right)}}_{\rm Penalty~{}term~{}due~{}to~{}(\ref{cons:pb:C})}\right]}} (16)

 

The PM embeds the rate requirement constraint (15b) into the loss function as a penalty term to enforce the output of CRGAT to satisfy (15b). For the problem (15), the loss function based on the PM is given by (16), where NN denotes the batch size, ReLU(x)max(x,0)\operatorname{ReLU}(x)\triangleq\max(x,0) represents the rectified linear unit activation function, 𝐖^(n)\widehat{\mathbf{W}}_{\left(n\right)} represents the ouput of the CRGAT assosicated with the nn-th sample, i.e., 𝐇(n){\mathbf{H}}_{\left(n\right)}, and λ>0\lambda>0 denotes the penalty coefficient which is treated as a hyperparameter.

Particularly, the penalty term is imposed only if at least one rate requirement constraint cannot be satisfied. With a larger value of λ\lambda, the rate requirement constraints are more likely to be satisfied, which however, may degrade the impact of the objective function on the loss function. By carefully adjusting the value of λ\lambda, a tradeoff between maximizing the sum rates and satisfying the rate requirement constraints can be achieved.

The unsupervised training procedure with the loss function based on PM is shown in Algorithm 1 where MM, EE and BB respectively denote the total numbers of samples, epochs and minibatchs in the training dataset, and ω\omega denotes the learning rate.

𝜽:=𝜽ω𝜽LPenal(𝜽)\displaystyle\bm{\theta}:=\bm{\theta}-\omega\nabla_{\bm{\theta}}{{L}_{\rm Penal}\left(\bm{\theta}\right)} (17)
1
20:  Training dataset 𝒟={𝐇(n)}n=1M\mathcal{D}=\left\{\mathbf{H}_{\left(n\right)}\right\}_{n=1}^{M};
0:  Learned 𝜽\bm{\theta};
31:  Initialize 𝜽\bm{\theta};
2:  for  epoch e[0,1,,E]e\in[0,1,\dots,E] do
3:     for  index of minibatch b[0,1,,B]b\in[0,1,\dots,B] do
44:        Sample the bb-th minibatch {𝐇(n)}n=1N\left\{\mathbf{H}_{\left(n\right)}\right\}_{n=1}^{N};
55:        Calculate the corresponding beamforming matrices {𝐖^(n)}n=1N\left\{{\widehat{\mathbf{W}}}_{\left(n\right)}\right\}_{n=1}^{N} via 𝐖^(n)=CRGAT(𝐇(n))\widehat{\mathbf{W}}_{\left(n\right)}={\rm CRGAT}\left({\bf H}_{\left(n\right)}\right), n\forall n;
66:        Update 𝜽\bm{\theta} via (17);
7:     end for
8:  end for
9:  return  𝜽\bm{\theta}.
Algorithm 1 Unsupervised Learning with Loss function based on PM

IV-B Loss function based on Lagrangian duality method

Instead of using the constant penalty coefficient, the LDM adopts the learnable Lagrangian multipliers.

Let 𝝁[μ1,,μK]+K\bm{\mu}\triangleq[\mu_{1},...,\mu_{K}]\in\mathbb{R}_{+}^{K} denote non-negative Lagrangian multipliers associated with (15b). The Lagrangian dual optimization problem of the problem (15) is given by

max𝝁min𝜽(𝜽,𝝁),\displaystyle\max_{\bm{\mu}}\min_{{\bm{\theta}}}{{\cal L}\left(\bm{\theta},\bm{\mu}\right)},

where (𝜽,𝝁){{\cal L}\left(\bm{\theta},\bm{\mu}\right)} is given by (18).

(𝜽,𝝁)=k=1KR^k(𝐖^|𝜽)+k=1KμkReLU(RReqR^k(𝐖^|𝜽))\displaystyle{{\cal L}\left(\bm{\theta},\bm{\mu}\right)}={\sum\nolimits_{k=1}^{K}{-{\widehat{R}_{k}}{\left(\widehat{\mathbf{W}}|\bm{\theta}\right)}}}+\sum\nolimits_{k=1}^{K}{\mu_{k}\operatorname{ReLU}\left(R_{\rm Req}-{\widehat{R}_{k}}\left({\widehat{\mathbf{W}}}|\bm{\theta}\right)\right)} (18)

 

The Lagrangian dual optimization problem can be solved by alternately optimizing 𝜽\bm{\theta} and 𝝁\bm{\mu}. Similarly, the loss function based on the LDM is given by (21), and we alternately update 𝜽\bm{\theta} and 𝝁\bm{\mu} during the training phase. Then, the unsupervised training procedure with the loss function based on LDM is shown in Algorithm 2, where τ>0\tau>0 denotes the step size for updating 𝝁\bm{\mu}.

𝜽\displaystyle\bm{\theta} :=𝜽ω𝜽LLag(𝜽,𝝁)\displaystyle:=\bm{\theta}-\omega\nabla_{\bm{\theta}}{{L}_{\rm Lag}\left(\bm{\theta},\bm{\mu}\right)} (19)
μk\displaystyle\nabla_{{\mu}_{k}} :=μk+τn=1NReLU(RReqR^k(𝐖^(n)|𝜽))\displaystyle:=\nabla_{{\mu}_{k}}+\tau\sum_{n=1}^{N}\operatorname{ReLU}\left(R_{\rm Req}-{\widehat{R}_{k}}\left(\widehat{\mathbf{W}}_{(n)}|\bm{\theta}\right)\right) (20)
Remark 4.

Different from handling the sum-power constraint via the activation function, both PM and LDM cannot theoretically guarantee a feasible solution. Nevertheless, the efficacy of the two methods has been validated in some existing works numerically, see e.g., [24, 25].

LLag(𝜽,𝝁)=1Nn=1N[k=1KR^k(𝐖^(n)|𝜽)+k=1KμkReLU(RReqR^k(𝐖^(n)|𝜽))]\displaystyle{{{L}_{\rm Lag}\left(\bm{\theta},\bm{\mu}\right)}=\frac{1}{N}\sum\nolimits_{n=1}^{N}{\left[\sum\nolimits_{k=1}^{K}{-{\widehat{R}_{k}}{\left({\widehat{\mathbf{W}}_{\left(n\right)}}|\bm{\theta}\right)}}+\sum\nolimits_{k=1}^{K}{\mu_{k}\operatorname{ReLU}\left(R_{\rm Req}-{\widehat{R}_{k}}\left(\widehat{\mathbf{W}}_{\left(n\right)}|\bm{\theta}\right)\right)}\right]}} (21)

 

1
20:  Training dataset 𝒟={𝐇(n)}n=1M\mathcal{D}=\left\{\mathbf{H}_{\left(n\right)}\right\}_{n=1}^{M};
30:  Learned 𝜽\bm{\theta} and 𝝁\bm{\mu};
41:  Initialize 𝜽\bm{\theta} and 𝝁\bm{\mu};
2:  for  epoch e[0,1,,E]e\in[0,1,\dots,E] do
3:     𝝁[μ1,,μK]𝟎\nabla_{\bm{\mu}}\triangleq[\nabla_{{\mu}_{1}},...,\nabla_{{\mu}_{K}}]\leftarrow{\bf 0}
4:     for  index of minibatch b[0,1,,B]b\in[0,1,\dots,B] do
55:        Sample the bb-th minibatch {𝐇(n)}l=1N\left\{\mathbf{H}_{\left(n\right)}\right\}_{l=1}^{N};
66:        Calculate the corresponding beamforming matrices {𝐖^(n)}n=1N\left\{{\widehat{\mathbf{W}}}_{\left(n\right)}\right\}_{n=1}^{N} via 𝐖^(n)=CRGAT(𝐇(n))\widehat{\mathbf{W}}_{\left(n\right)}={\rm CRGAT}\left({\bf H}_{\left(n\right)}\right), n\forall n;
77:        Update 𝜽\bm{\theta} via (19);
88:        Update μk\nabla_{{\mu}_{k}} via (20), k𝒦\forall k\in{\cal K};
9:     end for
910:     𝝁𝝁+τ𝝁\bm{\mu}\leftarrow\bm{\mu}+\tau\nabla_{\bm{\mu}};
1011:  end for
12:  return  𝜽\bm{\theta} and 𝝁\bm{\mu}.
Algorithm 2 Unsupervised Learning with Loss function based on LDM

V Numerical Results

In this section, numerical results are provided to evaluate the proposed CRGAT, activation function and two loss functions. An evaluation method for the learning-based approach is proposed based on certain basic metrics, which include three aspects: model effectiveness, scalability and adaptability. Besides, the effectiveness of the residual-assisted combination is validate by the ablation experiment, and the transfer learning performance of the CRGAT is illustrated.

V-A Simulation setup

Simulation scenario: The number of antennas is set as NT{8,16}N_{\rm T}\in\{8,16\}. The number of users is set as K{3,4,5,6,7,8,9}K\in\{3,4,5,6,7,8,9\}. The total power budget is set as PMax{1,2,3}P_{\rm Max}\in\{1,2,3\}. The rate requirement is set as RReq{1,2,3}R_{\rm Req}\in\{1,2,3\} bit/s/Hz. The CSI of each user includes both large-scale path loss and the small-scale fading. The path-loss model is based on 10(140.7+36.7log10(d~))/1010^{-(140.7+36.7\log_{10}({\widetilde{d}}))/10} where d~{\widetilde{d}} (in kilometer) denotes the distance between the BS and the user. Rayleigh fading is adopted for the considered small-scale fading. The noise power spectrum is 162162 dBm/Hz and the bandwidth is 1010 MHz. The average ratio between path loss and noise power is set as 1010 dB.

Computer configuration: The convex formulations and the NNs are respectively processed by convex solver SeDuMi under Mathworks MATLAB R2021b and Python 3.8.18 with Pytorch 1.10.0 on a computer with Intel(R) Core(TM) i9-12900K CPU and one NVIDIA RTX 3090 GPU (24 GB of memory).

Dataset: The ratio between the training set, validation set and test set in the dataset is set as 9:1:19:1:1. Since the unsupervised learning is adopted, the training set does not include labels. The labels of the validation set and test set are generated by an SCA-based approach (presented in Appendix A) with the convergence precision of 10410^{-4}. All the datasets prepared for the simulation experiments are shown in Table I.

TABLE I: Datasets.
No. NTN_{\rm T} KK PMax{P_{\rm Max}} RReq{R_{\rm Req}} Size Type
1 8 3 1 1 10,000 B
2 8 4 1 1 110,000 A
3 8 5 1 1 110,000 A
4 16 7 1 1 10,000 B
5 16 8 1 1 110,000 A
6 16 9 1 1 10,000 B
7 16 8 1 2 110,000 A
8 16 8 1 3 110,000 A
9 16 8 2 1 110,000 A
10 16 8 3 1 110,000 A
  • Type A: The training set, validation set and test set are all included.

  • Type B: Only test set is included.

GNN architecture: The detailed architecture of the CRGAT is shown in Table II where 44 CGCLs and 33 CFCLs are included.

TABLE II: The architecture of the proposed CRGAT.
No. Type IFs OFs AMs \mathbb{C}SELU \mathbb{C}BN
1 CRGL NTN_{\rm T} 32 10 \checkmark ×\times
2 CGCL 320 64 10 \checkmark ×\times
3 CGCL 640 128 10 \checkmark ×\times
4 CGCL 1280 256 10 \checkmark ×\times
5 CFCL 2560 1024 - \checkmark \checkmark
6 CFCL 1024 512 - \checkmark \checkmark
7 CFCL 512 NTN_{\rm T} - ×\times ×\times
  • IFs: The dimension of the input feature, i.e., F(l)F(l), G(c)G(c).

  • OFs: The dimension of the output feature.

  • AMs: Number of attention mechanisms, i.e., D(l)D(l).

  • CGCL: Generalized to all complex graph convolution layers.

Basic Metrics: The basic metrics are used to evaluate the learning-based approaches on a given test set in terms of the optimality performance, feasible rate and inference time.

  • 1)

    Optimality performance: The ratio of the average achievable sum rates (with feasible solutions) by the learning-based approach to the average optimal sum rates.

  • 2)

    Feasibility rate: The percentage of the feasible solutions to the considered problem by the learning-based approach.

  • 3)

    Inference time: The average running time required to calculate the feasible beamforming vectors with the given CSI by the learning-based approach.

Advanced Metrics: Based on the basic metrics, we define the following three advanced metrics to evaluate the learning-based approaches. The detailed is described as follows.

  • 1)

    Model effectiveness: Evaluate the efficacy of the model architecture based on the basic metrics, where the settings of the test set and the training set are identical.333The identical setting means that the scalar variables are the same while the random variables are i.i.d.

  • 2)

    Scalability: Evaluate the efficacy of the learning-based approach on the test set with different settings from the training set.

  • 3)

    Adaptability: Evaluate the efficacy of the activation function and the loss function with different constraints.

V-B Model effectiveness

This subsection shows the performance of the proposed CRGAT based on the basic metrics. Two scenarios are considered, i.e., (NT,K,PMax,RReq)=(8,4,1,1)(N_{\rm T},K,P_{\rm Max},R_{\rm Req})=(8,4,1,1) (Dataset No. 2) and (NT,K,PMax,RReq)=(16,8,1,1)(N_{\rm T},K,P_{\rm Max},R_{\rm Req})=(16,8,1,1) (Dataset No. 5). To fully evaluate the proposed CRGAT, several convex optimization and learning based approaches are adopted as the benchmarks444The learning-based approaches are according to existing works and extended to solve the considered problem..

  • 1)

    SCA: The SCA-based algorithm serves as a benchmark to evaluate the optimality and the computational efficiency of the learning-based approaches;

  • 2)

    MRT: The SCA-based algorithm based on the maximum ratio transmission (MRT) scheme (given in Appendix B);

  • 3)

    ZF: The SCA-based algorithm based on the zero-forceing (ZF) scheme (given in Appendix B);

  • 4)

    CMLP[7]: Direct learning of the mapping from CSI, i.e., {𝐡i}\{{\bf h}_{i}\}, to beamforming vectors, i.e., {𝐰i}\{{\bf w}_{i}\} via CFCLs;

  • 5)

    CGCN[23]: Topological information among CSI is utilized through graph convolution;

  • 6)

    CGAT[15]: Additive attention mechanism is used in the process of graph convolution;

  • 7)

    CTGCN[28]: Key-value attention mechanism is used in the process of graph convolution.

For fair comparison, all learning-based approaches adopt the activation function (11) and trained via Algorithm 1. The numerical results are shown in Table III.

TABLE III: Model effectiveness.
NTN_{\rm T} KK Approach OP FR IT
8 4 SCA 100% 100% 1.99s
MRT 46.6% 100% 1.29s
ZF 99.1% 100% 1.12s
CMLP 79.7% 0% 0.93ms
CGCN 84.5% 99.3% 0.90ms
CGAT 91.0% 99.6% 1.66ms
CTGCN 85.0% 99.2% 2.42ms
CRGAT 97.1% 100% 2.73ms
16 8 SCA 100% 100% 8.48s
MRT 40.6% 100% 1.86s
ZF 98.9% 100% 2.14s
CMLP 36.1% 0% 0.94ms
CGCN 52.3% 98.5% 0.92ms
CGAT 84.0% 99.9% 1.71ms
CTGCN 45.0% 92.7% 2.46ms
CRGAT 95.4% 100% 2.81ms
  • OP: Optimality performance.

  • FR: Feasibility rate.

  • IT: Inference time.

  • : The optimality performance is calculated with infeasible solutions as there is no feasible solution.

Compared with the benchmark learning-based approaches, the proposed CRGAT achieves the best optimality performance, and optimality performance gap to the SCA-based algorithm is less than 5%5\%. The increment of antennas and users has an insignificant impact on the optimality performance of the CRGAT, which however, greatly degrades the optimality performance of other learning-based approaches. On the one hand, the attention-enabled aggregation adopted in the CRGAT is able to explore the hidden features due to the inter-user interference, which makes it more suitable to the considered problem than the CMLP and the CGCN. On the other hand, the residual-assisted combination adopted in the CRGAT avoids the oversmoothing issue, which enhances its feature embedding capability compared with the CGAT and the CTGCN. Besides, the feasible solution by the CRGAT is guaranteed.

Compared with traditional convex optimization based approaches, the inference time is greatly reduced by the learning-based approaches. Besides, the increment number of antennas and users has limited impact on the inference time of the learning-based approaches while greatly increasing the inference time of the convex optimization based approaches. That is, with limited optimality performance loss, the proposed CRGAT is able to make a real-time response to the time-varying CSI, e.g., several microseconds with our computer configuration. The following comparisons are among the learning-based approaches, and the inference time is omitted.

Furthermore, we evaluate the robustness of the proposed CRGAT with (NT,K)=(16,8)(N_{\rm T},K)=(16,8) by adding channel estimation errors to the original input, i.e, H^=H+E\widehat{\textbf{H}}={\textbf{H}}+{\textbf{E}} (k𝒦\forall k\in\mathcal{K}), where H represents the original input and E denotes the CSI error with E(k,:)𝒞𝒩(0,σE,k2𝐈NT){\textbf{E}}_{(k,:)}\sim\mathcal{C}\mathcal{N}(0,\sigma^{2}_{{\rm E},k}{\bf I}_{N_{\rm T}}) (k𝒦\forall k\in\mathcal{K}). The numerical results are shown in Table IV. It is observed that the adding noise to the original input has limited impact on the optimality performance of the CRGAT.

TABLE IV: The robustness of the CRGAT with imperfect CSI.
σE,k2\sigma^{2}_{{\rm E},k} OP FR
1×104H(k,:)21\times 10^{-4}\|{\textbf{H}}_{(k,:)}\|^{2} 95.3% 100%
4×104H(k,:)24\times 10^{-4}\|{\textbf{H}}_{(k,:)}\|^{2} 95.2% 100%
9×104H(k,:)29\times 10^{-4}\|{\textbf{H}}_{(k,:)}\|^{2} 95.0% 100%

V-C Scalability

V-C1 Scalability to different numbers of users

Due to the dynamic nature of the network typology, the generalization capability to the number of users is of high importance. Therefore, this subsection first evaluates the scalability of the well-trained CRGAT, where the numbers of users in the training set and the test set are different. For comparison, the CGCN, CGAT and CTGCN are also simulated, where the CMLP is not included as it is not scalable due to its fixed input ports. The numerical results are shown in Table V.

TABLE V: Scalability to different numbers of users.
NTN_{\rm T} KTrK_{\rm Tr} KTeK_{\rm Te} Model OP FR
8 4 3 CGCN 82.7% 100%
CGAT 86.4% 100%
CTGCN 89.0% 100%
CRGAT 96.0% 100%
5 CGCN 78.8% 92.7%
CGAT 82.6% 98.2%
CTGCN 77.8% 94.3%
CRGAT 92.2% 99.5%
16 8 7 CGCN 55.9% 99.8%
CGAT 84.8% 100%
CTGCN 47.8% 98.3%
CRGAT 95.3% 100%
9 CGCN 52.2% 95.9%
CGAT 82.5% 99.6%
CTGCN 45.0% 92.7%
CRGAT 94.6% 100%
  • KTrK_{\rm Tr}: The number of users in the training set.

  • KTeK_{\rm Te}: The number of users in the test set.

It is observed that the CRGAT trained on the dataset with KTr=4K_{\rm Tr}=4 (Dataset No. 2) can also achieve a reasonably good performance (within 10%10\% performance loss) when tested on the dataset with KTe{3,5}{K_{\rm Te}\in\{3,5\}} (Dataset No. 1 and 3). Besides, the performance loss is within 6%6\% when testing the CRGAT trained on the dataset with KTr=4K_{\rm Tr}=4 (Dataset No. 5) by the dataset with KTe{7,9}{K_{\rm Te}\in\{7,9\}} (Dataset No. 6 and 7). The reason that the CRGAT yields a better scalability with latter case is that in the former case (KTr=4K_{\rm Tr}=4), KTeK_{\rm Te} is increased/decreased by 25%25\% of KTrK_{\rm Tr} while in the latter case (KTr=8K_{\rm Tr}=8), KTeK_{\rm Te} is increased/decreased by 12.5%12.5\% of KTrK_{\rm Tr}. In addition, the CRGAT outperforms other GNN-based models in terms of scalability. The reason is also due to the attention-enabled aggregation and the residual-assisted combination. Such a result indicates that the proposed CRGAT is scalable to dynamic network typologies without re-training.

V-C2 Scalability to different power budgets

Expect the number of users, the generalization capability of the proposed approach with regard to PMax{P_{\rm Max}} is also evaluated. Specifically, considering a scenario of (NT,K,RReq)=(16,8,1)(N_{\rm T},K,R_{\rm Req})=(16,8,1), we train the CRGAT with PMax=1P_{\rm Max}=1 (Dataset No. 6) while testing the CRGAT with PMax=2P_{\rm Max}=2 (Dataset No. 7) and PMax=3P_{\rm Max}=3 (Dataset No. 8), respectively. For comparison, the following two benchmark methods are also simulated:

  • 1)

    PM only (PM-only) [24]: All constraints are handled by the PM.

  • 2)

    LDM only (LDM-only) [25]: All constraints are handled by the LDM.

The numerical results are shown in Table VI. It is observed all methods under test achieve 100%100\% feasibility rate, while the proposed methods are able to obtain higher optimality performance than the benchmarking ones. Note that the generalization capability of the proposed approach can save time and arithmetic power to re-train the neural network for different scenarios.

TABLE VI: Scalability to different PMaxP_{\rm Max}.
PMaxTr{P_{\rm Max}^{\rm Tr}} PMaxTe{P_{\rm Max}^{\rm Te}} Method PER FR
1 2 PM-only 78.5% 100%
LDM-only 78.3% 100%
PM with (11) 92.4% 100%
LDM with (11) 92.3% 100%
3 PM-only 71.1% 100%
LDM-only 70.1% 100%
PM with (11) 90.1% 100%
LDM with (11) 89.9% 100%
  • PMaxTr{P_{\rm Max}^{\rm Tr}}: PMaxP_{\rm Max} in the training set.

  • PMaxTe{P_{\rm Max}^{\rm Te}}: PMaxP_{\rm Max} in the test set.

V-D Adaptability

The adaptability is to evaluate the performance of the learning-based approaches under different scenarios. The adaptability to different RReqR_{\rm Req} and PMax{P_{\rm Max}} is respectively evaluated as follows.

Table VII shows the results with different values of RReqR_{\rm Req}. It is observed that with an increase of RReqR_{\rm Req}, the optimality performance of each method almost keeps unchanged while the feasible rate decreases in general. The reason is that with larger value of RReqR_{\rm Req}, the feasible set of problem (15) shrinks. Besides, the proposed methods are superior to the benchmark ones, especially in terms of the feasible rate. As it is of high significance for learning-based approaches to yield feasible solutions, the effectiveness of the proposed approach is validated.

TABLE VII: Adaptability to different RReq{R_{\rm Req}}.
RReq{R_{\rm Req}} Method OP FR
1 PM-only 94.5% 63.0%
LDM-only 94.5% 84.7%
PM with (11) 95.4% 100%
LDM with (11) 95.2% 100%
2 PM-only 94.3% 62.0%
LDM-only 94.5% 91.3%
PM with (11) 94.9% 99.6%
LDM with (11) 95.6% 99.7%
3 PM-only 94.4% 51.0%
LDM-only 95.1% 85.3%
PM with (11) 94.8% 94.1%
LDM with (11) 95.4% 94.3%

Table VIII shows the results with different values of PMax{P_{\rm Max}}. It is interesting to see that with an increment of PMax{P_{\rm Max}}, the optimality performance of all methods is slightly degraded. This might be due to the fact that with larger PMax{P_{\rm Max}}, there are more allocation schemes of the power budget, and the learning task becomes more complicated. Nevertheless, the feasibility rates of the proposed methods keeps 100%100\%, which outperforms the benchmark methods.

TABLE VIII: Adaptability to different PMax{P_{\rm Max}}.
PMax{P_{\rm Max}} Method OP FR
1 PM-only 94.5% 63.0%
LDM-only 94.5% 84.7%
PM with (11) 95.4% 100%
LDM with (11) 95.2% 100%
2 PM-only 91.3% 71.4%
LDM-only 91.5% 88.4%
PM with (11) 92.6% 100%
LDM with (11) 92.1% 100%
3 PM-only 88.5% 74.5%
LDM-only 89.9% 87.3%
PM with (11) 89.7% 100%
LDM with (11) 90.5% 100%

V-E Ablation experiment: Residual-assisted combination

The effectiveness of the attention-enabled aggregation has been analyzed in our previous work [15], which is also validated in Table III and Table V. For a deeper understanding of the residual-assisted combination, this subsection shows the performance of CGAT and CRGAT with different depths. The numerical results are shown in Table IX.

TABLE IX: Ablation experiment: Residual-assisted combination.
Model Depth OP FR
CGAT 2 84.4% 99.9%
CGAT 3 84.0% 99.9%
CGAT 4 4.0% 0%
CRGAT 2 84.9% 100%
CRGAT 3 93.8% 100%
CRGAT 4 95.4% 100%
  • Depth: Total number of complex graph convolution layers.

It is observed that with the increment of depths, the performance of the CGAT is degraded. In particular, a sharp decay is induced with 44 depths. This is due to the oversmoothing issue as analyzed in Remark 1. However, the oversmoothing issue is avoided by the CRGAT, as the expression capacity of the CRGAT is enhanced with the increment of depths. Such a result validates the effectiveness and the importance of the residual-assisted combination.

Furthermore, we adopt the mean average distance (MAD) [26] among the node features to quantify oversmoothing issue, which is calculated by

MAD=i=1Kj=1K𝐃(i,j)i=1K|𝒩(i)|,\displaystyle{\rm MAD}=\frac{\sum_{i=1}^{K}\sum_{j=1}^{K}\mathbf{D}_{(i,j)}}{\sum_{i=1}^{K}\left|\mathcal{N}\left(i\right)\right|}, (22)

where 𝐃K×K\mathbf{D}\in{\mathbb{R}^{{K}\times{K}}} denotes the distance matrix which is calculated by

𝐃(i,j)=1𝐇¯(i,:)(l)𝐇¯(j,:)(l)𝐇¯(i,:)(l)2𝐇¯(j,:)(l)2,\displaystyle\mathbf{D}_{(i,j)}=1-\frac{\overline{\mathbf{H}}^{(l)}_{(i,:)}\overline{\mathbf{H}}^{(l)}_{(j,:)}}{\|\overline{\mathbf{H}}^{(l)}_{(i,:)}\|_{2}\|\overline{\mathbf{H}}^{(l)}_{(j,:)}\|_{2}}, (23)

where 𝐇¯(l)=Com(Re(𝐇(l)),Im(𝐇(l)))K×2F~(l)\overline{\mathbf{H}}^{(l)}={\rm Com}\left({\rm Re}(\mathbf{H}^{(l)}),{\rm Im}(\mathbf{H}^{(l)})\right)\in{\mathbb{C}^{{K}\times{2{{\widetilde{F}}(l)}}}}. The MADs of different layers in CGAT and CRGAT are shown in Table X.

TABLE X: MAD in different layers of CGAT and CRGAT.
Model Layer 1 2 3 4
CGAT 0.382 0.303 0.238 0.086
CRGAT 0.712 0.689 0.484 0.472

It is observed that the MAD of the CGAT converges to 0 with a growth of the depths, which indicates that the node features intend to be indistinguishable. However, a larger MAD is obtained in the CRGAT. As the learning capability is roughly positive correlation to the depths, the residual-assisted combination in fact balances the trade-off between enhancing the learning capability and avoiding the oversmoothing issue.

V-F Transfer learning

As analysed before, the scalability to different numbers of users of the CRGAT is effected by |KTrKTe|/KTr|K_{\rm Tr}-K_{\rm Te}|/K_{\rm Tr}. To improve the optimality performance, instead of generalizing, one can also reuse of the pre-trained CRGAT on a new test set in a transfer learning manner.

Fig. 4 show the transfer learning performance of the CRGAT, where the learnable parameters of the CRGAT trained on the dataset with KTr=4K_{\rm Tr}=4 (Dataset No. 2) is leveraged as the initial parameters for fine-tuning the CRGAT on the dataset with KTr=5K_{\rm Tr}=5 (Dataset No. 3). For comparison, the convergence behavior of retraining the CRGAT on the dataset with KTr=5K_{\rm Tr}=5 (Dataset No. 3) is also plotted. It is observed that with the transfer learning, the CRGAT is able to achieve about 95%95\% optimality performance after 44 epochs while the generalization performance is about 92%92\% (in Table V). Besides, the convergence speed of the transfer learning is much faster than that of the retraining with a slight optimality performance gain.

Refer to caption
Figure 4: Convergence behavior of the transfer learning and re-train.

VI CONCLUSION

In this paper, we studied an end-to-end GNN-based beamforming design for a constrained sum-rate maximization problem. A GNN-based architecture named CRGAT was proposed to directly map the CSI to the beamforming vectors. To enhance the model representation and avoid the oversmoothing issue, the attention mechanisms and residual structures were applied to the message passing mechanism. To guarantee the feasible solution, a novel activation function was proposed to handle the sum-power constraint, and two loss functions based were designed for the QoS requirements. The CRGAT was trained in an unsupervised learning manner, and was evaluated from the perspectives of model effectiveness, scalability and adaptability. Compared with the bechmarking schemes, the advantages of the proposed CRGAT are summarized as: 1) real-time response to time-varying CSI with limited optimality performance loss; 2) scalability to dynamic network typologies; 3) adaptability to various network settings.

Appendix A SCA-based Solution Approach for the problem (2)

The problem (2) can be solved by the SCA method, and the details are given as follows.

By defining auxiliary variables {γk}\{\gamma_{k}\} satisfying that γkRk({𝐰i})\gamma_{k}\leq{R_{k}}({\{{{{\bf{w}}_{i}}}\}}), the problem (2) is equivalent to

max{𝐰i,γi}k=1Kγk\displaystyle\mathop{\max}\limits_{\left\{{{{\bf{w}}_{i}}},\gamma_{i}\right\}}{{\sum\nolimits_{k=1}^{K}\gamma_{k}}} (24a)
s.t.\displaystyle{\rm s.t.}~{} |𝐡kH𝐰k|2(2γk1)(i=1,ikK|𝐡kH𝐰i|2+σk2),\displaystyle{\left|{{\bf{h}}_{k}^{H}{{\bf{w}}_{k}}}\right|}^{2}\geq\left(2^{\gamma_{k}}-1\right)\left({{{\sum\nolimits_{i=1,i\neq k}^{K}{{\left|{{\bf{h}}_{k}^{H}{{\bf{w}}_{i}}}\right|}^{2}}+\sigma_{{k}}^{2}}}}\right), (24b)
k=1K𝐰k22PMax,\displaystyle{{\sum\nolimits_{k=1}^{K}{\left\|{{{\bf{w}}_{k}}}\right\|_{2}^{2}}}}\leq{P_{\rm Max}}, (24c)
γkRReq,\displaystyle\gamma_{k}\geq{R_{\rm Req}}, (24d)
𝐰iNT,i,k𝒦.\displaystyle{{\bf{w}}_{i}}\in{\mathbb{C}^{{N_{\rm{T}}}}},{\forall i,k}\in{\cal K}.

By defining the auxiliary variables {ai,bi}\{a_{i},b_{i}\} satisfying that

{|𝐡kH𝐰k|2eak+bkeak2γk1ebki=1,ikK|𝐡kH𝐰i|2+σk2,\displaystyle\left\{\begin{array}[]{l}{\left|{{\bf{h}}_{k}^{H}{{\bf{w}}_{k}}}\right|^{2}}\geq{e^{{a_{k}}+{b_{k}}}}\\ {e^{{a_{k}}}}\geq{2^{{\gamma_{k}}}}-1\\ {e^{{b_{k}}}}\geq\sum\nolimits_{i=1,i\neq k}^{K}{{{\left|{{\bf{h}}_{k}^{H}{{\bf{w}}_{i}}}\right|}^{2}}}+\sigma_{k}^{2}\end{array}\right., (28)

the problem (24) is equivalent to

max{𝐰i,γi,ai,bi}k=1Kγk\displaystyle\mathop{\max}\limits_{\left\{{{{\bf{w}}_{i}}},\gamma_{i},a_{i},b_{i}\right\}}{{\sum\nolimits_{k=1}^{K}\gamma_{k}}} (29a)
s.t.\displaystyle{\rm s.t.}~{} |𝐡kH𝐰k|2eak+bk,\displaystyle{\left|{{\bf{h}}_{k}^{H}{{\bf{w}}_{k}}}\right|^{2}}\geq{e^{{a_{k}}+{b_{k}}}}, (29b)
eak2γk1,\displaystyle{e^{{a_{k}}}}\geq{2^{{\gamma_{k}}}}-1, (29c)
ebki=1,ikK|𝐡kH𝐰i|2+σk2,\displaystyle{e^{{b_{k}}}}\geq\sum\nolimits_{i=1,i\neq k}^{K}{{{\left|{{\bf{h}}_{k}^{H}{{\bf{w}}_{i}}}\right|}^{2}}}+\sigma_{k}^{2}, (29d)
(24c),(24d),𝐰iNT,i,k𝒦.\displaystyle{\rm(\ref{cons:p2:C}),(\ref{cons:p2:D})},{{\bf{w}}_{i}}\in{\mathbb{C}^{{N_{\rm{T}}}}},{\forall i,k}\in{\cal K}.

Apply the first order Taylor approximation to non-convex constraints (29b), (29c) and (29d), the problem (29) is approximated by the following convex problem (30) with given feasible {𝐰~i,a~i,b~i}\{{{{\widetilde{\bf w}}}_{i}},{{\widetilde{a}}_{i}},{{\widetilde{b}}_{i}}\}:

max{𝐰i,γi,ai,bi}k=1Kγk\displaystyle\mathop{\max}\limits_{\left\{{{{\bf{w}}_{i}}},\gamma_{i},a_{i},b_{i}\right\}}{{\sum\nolimits_{k=1}^{K}\gamma_{k}}} (30a)
s.t.\displaystyle{\rm s.t.}~{} 2Re{𝐰~kH𝐡k𝐡kH𝐰k}|𝐡kH𝐰~k|2eak+bk,\displaystyle 2{\mathop{\rm Re}\nolimits}\left\{{{{{{{{\widetilde{\bf w}}}_{k}}}}^{H}}{{\bf h}}_{k}{{\bf h}}_{k}^{H}{{{\bf{w}}_{k}}}}\right\}-{\left|{{{\bf h}}_{k}^{H}{{{\widetilde{\bf w}}}_{k}}}\right|^{2}}\geq{e^{{a_{k}}+{b_{k}}}}, (30b)
ea~k(1+aka~k)2γk1,\displaystyle e^{{\widetilde{a}}_{k}}\left(1+a_{k}-{\widetilde{a}}_{k}\right)\geq 2^{\gamma_{k}}-1, (30c)
eb~k(1+bkb~k)i=1,ikK|𝐡kH𝐰i|2+σk2,\displaystyle e^{{\widetilde{b}}_{k}}\left(1+b_{k}-{\widetilde{b}}_{k}\right)\geq\sum\nolimits_{i=1,i\neq k}^{K}{{\left|{{\bf{h}}_{k}^{H}{{\bf{w}}_{i}}}\right|}^{2}}+\sigma_{{k}}^{2}, (30d)
(24c),(24d),𝐰iNT,i,k𝒦.\displaystyle{\rm(\ref{cons:p2:C}),(\ref{cons:p2:D})},{{\bf{w}}_{i}}\in{\mathbb{C}^{{N_{\rm{T}}}}},{\forall i,k}\in{\cal K}.

By iteratively solving the problem (30) via the SCA method, a near-optimal solution to the problem (2) can be obtained. Specially, the initialization {𝐰~i}\{{{\widetilde{\bf w}}}_{i}\} can be obtained by solving the following second order cone programming.

Find\displaystyle{\rm Find} {𝐰i}\displaystyle~{}{{\left\{{{{\bf{w}}_{i}}}\right\}}}
s.t.\displaystyle{\rm s.t.} (1+12RReq1)𝐡kH𝐰k𝛀k𝐟+𝐛k,\displaystyle~{}\sqrt{\left(1+\frac{1}{2^{R_{\rm Req}}-1}\right)}{{{{{{\bf h}}_{k}^{H}{{\bf{w}}_{k}}}}}}\geq\left\|{\bm{\Omega}}_{k}{\bf f}+{\bf b}_{k}\right\|, (31a)
Re{𝐡kH𝐰k}0,Im{𝐡kH𝐰k}=0,\displaystyle{\rm Re}\left\{{{{\bf h}}_{k}^{H}{{\bf{w}}_{k}}}\right\}\geq 0,{\rm Im}\left\{{{{\bf h}}_{k}^{H}{{\bf{w}}_{k}}}\right\}=0, (31b)
(24c),𝐰iNT,i,k𝒦.\displaystyle{\rm(\ref{cons:p2:C})},{{\bf{w}}_{i}}\in{\mathbb{C}^{{N_{\rm{T}}}}},{\forall i,k}\in{\cal K}.

where

{𝐟[𝐰1H,𝐰2H,,𝐰KH]HKNT𝐛k[𝟎KT,σk2]TK+1Ωk[diag(𝐡kH,𝐡kH,,𝐡kH)𝟎KNTT](K+1)×KNT.\displaystyle\left\{\begin{array}[]{l}{\bf{f}}\triangleq{\left[{{\bf{w}}_{1}^{H},{\bf{w}}_{2}^{H},...,{\bf{w}}_{K}^{H}}\right]^{H}}\in{\mathbb{C}}^{KN_{\rm T}}\\ {{\bf{b}}_{k}}\triangleq{\left[{{\bf{0}}_{K}^{T},\sigma_{k}^{2}}\right]^{T}}{\in{\mathbb{C}}^{K+1}}\\ {\Omega_{k}}\triangleq\left[{\begin{array}[]{*{20}{l}}{{\rm{diag}}\left({{{\bf{h}}_{k}}^{H},{{\bf{h}}_{k}}^{H},...,{{\bf{h}}_{k}}^{H}}\right)}\\ {{\bf{0}}_{K{N_{\rm{T}}}}^{T}}\end{array}}\right]\in{\mathbb{C}}^{(K+1)\times{KN_{\rm T}}}\end{array}\right.. (37)

With initialized {𝐰~i}\{{{\widetilde{\bf w}}}_{i}\}, a~i{{\widetilde{a}}_{i}} and b~i{{\widetilde{b}}_{i}} are respectively initialized by

{eak=2Rk({𝐰~i})1ebk=i=1,ikK|𝐡kH𝐰~i|2+σk2.\displaystyle\left\{\begin{array}[]{l}{e^{{a_{k}}}}={2^{{R_{k}}({\{{{{\widetilde{\bf{w}}}_{i}}}\}})}}-1\\ {e^{{b_{k}}}}=\sum\nolimits_{i=1,i\neq k}^{K}{{{\left|{{\bf{h}}_{k}^{H}{{{\widetilde{\bf{w}}}_{i}}}}\right|}^{2}}}+\sigma_{k}^{2}\end{array}\right.. (40)

Appendix B MRT and ZF beamforming design

In the MRT and ZF beamforming designs, each beamforming vectors are simplified by fixing its direction as

𝐰k=pk𝐰¯k,\displaystyle{\bf w}_{k}={\sqrt{p}_{k}}{{\overline{\bf w}}_{k}}, (41)

where pk+p_{k}\in{\mathbb{R}}^{+} and 𝐰¯k{{\overline{\bf w}}_{k}} satisfying 𝐰¯k22=1\|{{\overline{\bf w}}_{k}}\|_{2}^{2}=1 denote the power component and the direction component, respectively.

In the MRT design, the direction of beamforming is set as 𝐡k/𝐡k2{{\bf h}_{k}}/{\|{\bf h}_{k}\|_{2}}, and in the ZF design the direction of beamforming is set as 𝐯k/𝐯k2{{\bf v}_{k}}/{\|{\bf v}_{k}\|_{2}}, where 𝐯k{{\bf v}_{k}} is the kk-th column of 𝐕𝐆H(𝐆𝐆H)1{\bf V}\triangleq{\bf G}^{H}({\bf G}{\bf G}^{H})^{-1} and 𝐆[𝐡1H,𝐡2H,,𝐡KH]{\bf G}\triangleq[{\bf h}_{1}^{H},{\bf h}_{2}^{H},...,{\bf h}_{K}^{H}].

The power components in the MRT and ZF designs can also be optimized by the SCA method, which is similar to the SCA-based solution approach for the problem (2), and thus, omitted here.

References

  • [1] H. Niu et al., “Active RIS assisted rate-splitting multiple access network: Spectral and energy efficiency tradeoff,” IEEE J. Sel. Areas Commun., vol. 41, no. 5, pp. 1452-1467, May 2023.
  • [2] W. Xu, Z. Yang, D. W. K. Ng, M. Levorato, Y. C. Eldar, and M. Debbah, “Edge learning for B5G networks with distributed signal processing: Semantic communication, edge computing, and wireless sensing,” IEEE J. Sel. Top. Signal Process, vol. 17, no. 1, pp. 9-39, Jan. 2023.
  • [3] W. R. Ghanem, V. Jamali, Y. Sun, and R. Schober, “Resource allocation for multi-user downlink MISO OFDMA-URLLC systems,” IEEE Trans. Commun., vol. 68, no. 11, pp. 7184-7200, Nov. 2020.
  • [4] Y. Omid, S. M. Shahabi, C. Pan, Y. Deng, and A. Nallanathan, “Low-complexity robust beamforming design for IRS-aided MISO systems with imperfect channels,” IEEE Commun. Lett., vol. 25, no. 5, pp. 1697-1701, May 2021.
  • [5] Y. Yang, Z. Zhang, Y. Tian, R. Jin, and C. Huang, “Implementing graph neural networks over wireless networks via over-the-air computing: A joint communication and computation framework,” IEEE Wireless Commun., vol. 30, no. 3, pp. 62-69, June 2023,
  • [6] T. O’ Shea and J. Hoydis, “An introduction to deep learning for the physical layer,” IEEE Trans. Cognit. Commun. Networking, vol. 3, no. 4, pp. 563-575, Dec. 2017.
  • [7] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropoulos, “Learning to optimize: Training deep neural networks for interference management,” IEEE Trans. Signal Process., vol. 66, no. 20, pp. 5438-5453, 15 Oct.15, 2018.
  • [8] Q. Hu, Y. Cai, Q. Shi, K. Xu, G. Yu and Z. Ding, “Iterative algorithm induced deep-unfolding neural networks: Precoding design for multiuser MIMO systems,” IEEE Trans. Wireless Commun., vol. 20, no. 2, pp. 1394-1410, Feb. 2021.
  • [9] T. Lin and Y. Zhu, “Beamforming design for large-scale antenna arrays using deep learning,” IEEE Wireless Commun. Lett, vol. 9, no. 1, pp. 103-107, Jan. 2020.
  • [10] T. Bilot, N. E. Madhoun, K. A. Agha, and A. Zouaoui, “Graph neural networks for intrusion detection: A survey,” IEEE Access, vol. 11, pp. 49114-49139, 2023.
  • [11] Y. Gu, C. She, Z. Quan, C. Qiu and X. Xu, “Graph neural networks for distributed power allocation in wireless networks: Aggregation over-the-air,” early access in IEEE Trans. Wireless Commun.
  • [12] Y. Shen, J. Zhang, S. H. Song, and K. B. Letaief, “Graph neural networks for wireless communications: From theory to practice,” IEEE Trans. Wireless Commun., vol. 22, no. 5, pp. 3554-3569, May 2023.
  • [13] M. Lee, G. Yu, and G. Y. Li, “graph embedding-based wireless link scheduling with few training samples,” IEEE Trans. Wireless Commun., vol. 20, no. 4, pp. 2282-2294, Apr 2021.
  • [14] Y. Shen, Y. Shi, J. Zhang, and K. B. Letaief, “Graph neural networks for scalable radio resource management: Architecture design and theoretical analysis,” IEEE J. Sel. Areas Commun., vol. 39, no. 1, pp. 101-115, Jan. 2021.
  • [15] Y. Li, Y. Lu, R. Zhang, B. Ai, and Z. Zhong, “Deep learning for energy efficient beamforming in MU-MISO networks: A GAT-based approach,” IEEE Wireless Commun. Lett., vol. 12, no. 7, pp. 1264-1268, July 2023.
  • [16] J. Kim, H. Lee, S.-E. Hong, and S.-H. Park, “A bipartite graph neural network approach for scalable beamforming optimization,” IEEE Trans. Wireless Commun., vol. 22, no. 1, pp. 333-347, Jan. 2023.
  • [17] Z. Zhang, M. Tao, and Y.-F. Liu, “Learning to beamform in joint multicast and unicast transmission with imperfect CSI,” IEEE Trans. Commun., vol. 71, no. 5, pp. 2711-2723, May 2023.
  • [18] X. Zhang, H. Zhao, J. Xiong, X. Liu, L. Zhou, and J. Wei, “Scalable power control/beamforming in heterogeneous wireless networks with graph neural networks,” in Proc. IEEE Globecom, 2021.
  • [19] S. He, J. Yuan, Z. An, W. Huang, Y. Huang, and Y. Zhang, “Joint user scheduling and beamforming design for multiuser MISO downlink systems,” IEEE Trans. Wireless Commun., vol. 22, no. 5, pp. 2975-2988, May 2023.
  • [20] S. He, S. Xiong, W. Zhang, Y. Yang, J. Ren, and Y. Huang, “GBLinks: GNN-based beam selection and link activation for ultra-dense D2D mmWave networks,” IEEE Trans. Commun., vol. 70, no. 5, pp. 3451-3466, May 2022.
  • [21] S. He, S. Xiong, Z. An, W. Zhang, Y. Huang, and Y. Zhang, “An unsupervised deep unrolling framework for constrained optimization problems in wireless networks,” IEEE Trans. Wireless Commun., vol. 21, no. 10, pp. 8552-8564, Oct. 2022.
  • [22] Y. Shen, Y. Shi, J. Zhang, and K. B. Letaief, “A graph neural network approach for scalable wireless power control,” in Proc. IEEE GC Wkshps, 2019.
  • [23] T. Chen, M. You, G. Zheng, and S. Lambotharan, “Graph neural network based beamforming in D2D wireless networks,” in Proc. Int. ITG Workshop on Smart Antennas (WSA), 2021.
  • [24] A. Kaushik, M. Alizadeh, O. Waqar, and H. Tabassum, “Deep unsupervised learning for generalized assignment problems: A case-study of user-association in wireless networks,” in Proc. IEEE ICC Wkshps, 2021.
  • [25] F. Fioretto, T. W. Mak, and P. Van Hentenryck, “Predicting AC optimal power flows: Combining deep learning and Lagrangian dual methods”,in Proc. AAAI, vol. 34, no. 01, pp. 630-637, Apr. 2020.
  • [26] D. Chen, Y. Lin, W. Li, P. Li, J. Zhou, and X. Sun, “Measuring and relieving the over-smoothing problem for graph neural networks from the topological view”, in Proc. AAAI, vol. 34, no. 04, pp. 3438-3445, Apr. 2020.
  • [27] W. Zhao, C. Wang, C. Han, and T. Guo, “Exploring over-smoothing in graph attention networks from the Markov chain perspective,“ in Proc. ICLR, Sep. 2022.
  • [28] Y. Shi, Z. Huang, S. Feng, H. Zhong, W. Wang, and Y. Sun, “Masked label prediction: Unified message passing model for semi-supervised classification,“ in Proc. IJCAI-21, pp. 1548-1554, Aug. 2021.