GNN-Based Beamforming for Sum-Rate Maximization in MU-MISO Networks

Yuhang Li, Yang Lu, , Bo Ai, , Octavia A. Dobre, ,
Zhiguo Ding, , Dusit Niyato Yuhang Li and Yang Lu are with the School of Computer and Information Technology, and also with the Collaborative Innovation Center of Railway Traffic Safety, Beijing Jiaotong University, Beijing 100044, China (e-mail: 22125206,[email protected]).Bo Ai is with the State Key Laboratory of Rail Traffic Control and Safety, and the School of Electronics and Information Engineering, Beijing Jiaotong University, Beijing 100044, China (e-mail: [email protected]). Octavia A. Dobre is with the Faculty of Engineering and Applied Science, Memorial University, St. John’s, NL A1C 5S7, Canada (E-mail: [email protected]).Zhiguo Ding is with Department of Electrical Engineering and Computer Science, Khalifa University, Abu Dhabi, UAE, and Department of Electrical and Electronic Engineering, University of Manchester, Manchester, UK (e-mail: [email protected]).Dusit Niyato is with the School of Computer Science and Engineering, Nanyang Technological University, Singapore 639798 (e-mail: [email protected]).

Abstract

The advantages of graph neural networks (GNNs) in leveraging the graph topology of wireless networks have drawn increasing attentions. This paper studies the GNN-based learning approach for the sum-rate maximization in multiple-user multiple-input single-output (MU-MISO) networks subject to the users’ individual data rate requirements and the power budget of the base station. By modeling the MU-MISO network as a graph, a GNN-based architecture named CRGAT is proposed to directly map the channel state information to the beamforming vectors. The attention-enabled aggregation and the residual-assisted combination are adopted to enhance the learning capability and avoid the oversmoothing issue. Furthermore, a novel activation function is proposed for the constraint due to the limited power budget at the base station. The CRGAT is trained in an unsupervised learning manner with two proposed loss functions. An evaluation method is proposed for the learning-based approach, based on which the effectiveness of the proposed CRGAT is validated in comparison with several convex optimization and learning based approaches. Numerical results are provided to reveal the advantages of the CRGAT including the millisecond-level response with limited optimality performance loss, the scalability to different number of users and power budgets, and the adaptability to different system settings.

Index Terms:

GNNs, sum-rate maximization, MU-MISO, CRGAT.

I INTRODUCTION

With the exponential growth of the mobile data demand, enhancing the spectral efficiency (SE) becomes an urgent need for wireless communication systems to accommodate higher transmission data rates and support massive access services [1]. As a key SE enabler, the smart antenna technology has drawn great attentions, which makes the beamforming design an important topic [2]. Over the past decades, the beamforming design has been mainly solved by traditional optimization algorithms, such as successive convex approximation (SCA) [3] and block successive upper bound minimization (BSUM) [4]. However, these high computational complexity due to the iterations makes traditional optimization algorithms hard to realize the real-time processing in practical time-varying wireless communication systems. Besides, the beamforming design problems became extremely complicated for large scale wireless networks [5].

Recently, some researchers have developed the deep learning (DL)-based signal processing approaches[6], where the traditional optimization algorithms were approximated by neural networks (NNs). In [7], the supervised “learn-to-optimize” approach was proposed, where fully connected multi-layer perceptrons (MLPs) were trained to approximate the non-linear mapping between the inputs and outputs of a sum-rate maximization problem. Once the MLPs are well-trained, the average inference time is greatly reduced with limited optimality performance loss to the weighted minimum mean squared error (WMMSE) algorithm. Similarly, in [8], a black-box based convolutional NN (CNN) was trained to approximate the WMMSE algorithm. This supervised learning approach requires a large amount of labeled samples, which burdens the training process. Instead, some works utilized the unsupervised learning to directly optimize the objective function. In [9], the negative objective function was used as the loss function to facilitate an unsupervised learning of NNs to solve a sum-rate maximization problem in a multiple-input single-output (MISO) system.

The aforementioned models achieve remarkable performance in some scenarios, but the inputs of these models are usually with flat data structures [10], which may not be able to exploit the hidden features to improve the learning performance. Due to the graph-structured topology of the wireless networks, the graph NNs (GNNs) become more suitable to approximate the desired transmit design, especially for large-scale or complicated networks [11]. Besides, GNNs can be directly generalized to different network scenarios without re-training, which is an important feature in practice, but cannot be realized by the MLPs and CNNs [12].

There have been some existing works regarding the GNN-based transmit designs, e.g., [13, 14, 15, 16, 17, 18, 19, 20, 21]. In [13], a link scheduling problem was solved by using channel gains and distance information based on graph embedding. In [14], the authors transformed the radio resource management problem into a graph optimization problem and theoretically analysed the substitution equivalence, scalability and generalisation capabilities of GNNs from the perspective of message passing mechanisms. In [15], a graph attention network (GAT)-based energy efficient beamforming design was proposed for a multiple-user (MU)-MISO system with the users’ individual power constraints. In addition to the fully connected graph representation, the heterogeneous graph representation was studied. For example, in [16], the MU-MISO system was modelled as a bipartite graph, with which, the sum-rate and min-rate maximization problems were solved by a bipartite GNN based on bipartite message passing mechanism. In [17], the authors designed a unicast-multicast GNN (UMGNN) and proposed a UMGNN-based approach to jointly design the multicast and unicast beamformers based on imperfect channel state information (CSI). In [18], a device-to-device (D2D) network was modeled as a heterogeneous graph, and the power control and beamforming designs were proposed by use of the heterogeneous interfering GNNs. Nevertheless, only simple constraints were considered in the above works, and thus, the approaches cannot be directly extended to handle the transmit design problems with complicated constraints involving coupling variables, such as the quality-of-service (QoS) constraints.

As the QoS is the central consideration in wireless services, some recent works have studied to QoS-constrained learning approaches. For instance, in [19], a user scheduling and beamforming design based on GNN was proposed for MU-MISO downlink systems under constraints of the QoS requirement and the power budget. In [20], a GNN-based approach for primal dual learning was proposed for solving the radio resource management problem with beam selection and link activation constraints for D2D networks. However, the mapping from CSI to beamforming vectors was simplified via the WMMSE method, with which only the power parts of the beamforming vectors were the outputs. In [21], the authors proposed a unified framework for GNNS to solve constrained optimization problems in wireless communication, but a proper-selected initial point was required as input. In addition to the above model-based learning approaches, developing the end-to-end learning approach is a fundamental problem for the GNN-based beamforming design.

The architecture of the GNNs is the key factor to exploit the features [12]. In most aforementioned works, the permutation invariance and permutation equivariance properties of GNN were leveraged. However, how to further improve the learning capability, such as handling the oversmoothing issue, has yet been investigated. Moreover, as an emerging approach, the evaluation method of the GNN-based approaches is of high importance, but still remains an open issue. The above observations motivate this paper which is to investigate the GNN-based approach to solve the constrained sum-rate maximum problem for MU-MISO networks. The major contributions are summarized as follows:

•

We formulate a sum-rate maximization problem for MU-MISO networks subject to the rate requirement of each user and the sum-power constraint of the base station (BS). Then, we present a graph representation of the MU-MISO network, and reformulate the sum-rate maximization problem into a graph optimization problem.
•

To establish the mapping from the CSI to the beamforming vectors, we propose a GNN-based model named CRGAT based on the message passing mechanism. Specifically, we adopt the attention-enabled aggregation and the residual-assisted combination to enhance the learning capability and avoid the oversmoothing issue. Furthermore, we propose a novel activation function for the sum-power constraint.
•

To train the CRGAT in an unsupervised learning manner and reduce the requirement of extensive training data, the rate requirements are incorporated into the objective function by the proposed loss functions based on the penalty method (PM) and Lagrangian duality method (LDM), respectively. With the loss functions, the adaptability of the CRGAT to the considered problem is enhanced.
•

We develop an evaluation method for the learning-based approach, based on which we validate the effectiveness of the proposed CRGAT in comparison with several convex optimization and learning based benchmarking schemes. In particular, the millisecond-level response with limited performance loss, the scalability to different numbers of users and power budgets and the adaptability to different system settings are observed for the CRGAT. Besides, the effectiveness of the residual-assisted combination is validated by the ablation experiment, and the transfer learning performance of the CRGAT is illustrated.

The rest of this paper is organized as follows. Section II introduces the system model and problem formulation. The proposed CRGAT and the unsupervised learning algorithms are respectively presented in Section III and Section IV. Numerical results are presented in Sections V. Finally, Section VI concludes this paper.

Notations: The following mathematical notations and symbols are used throughout this paper. $\bf a$ and $\bf A$ stand for a column vector and a matrix (including multiple-dimensional tensor), respectively. The sets of real numbers and n-by-m real matrices are denoted by ${\mathbb{R}}$ and ${\mathbb{R}^{n\times m}}$ , respectively. The sets of complex numbers and $n$ -by- $m$ complex matrices are denoted by ${\mathbb{C}}$ and ${\mathbb{C}^{n\times m}}$ , respectively. For a complex number $a$ , $\left|a\right|$ denotes its modulus. ${{\rm Re}(a)}$ and ${{\rm Im}(a)}$ denote its real and imaginary part, respectively. For a vector $\bf a$ , ${\left\|\bf a\right\|_{2}}$ is the Euclidean norm. For a matrix ${\bf A}$ , ${\bf A}^{H}$ and $\left\|{\bf A}\right\|_{F}$ denote its conjugate transpose and Frobenius norm, respectively. ${\bf A}_{(i,j)}$ and ${\bf A}_{(i,:)}$ are the $i$ -th row and the $j$ -th column of matrix $\bf A$ and the $i$ -th row of matrix $\bf A$ respectively. For a tensor ${\bf A}$ , ${\bf A}_{(i,:,:)}$ denotes the matrix with index $i$ in the first dimension of the tensor.

II System Model and Problem Formulation

Refer to caption — Figure 1: A graph representation of the MU-MISO system.

II-A System model

Consider a downlink MU-MISO system as shown in Fig. 1, where one $N_{\rm T}$ -antenna BS intends to serve $K$ single-antenna users over a common spectral band. We use $\mathcal{K}\triangleq\{1,2,...,K\}$ to denote the index set of the users.

Denote the symbol for the $k$ -th user and the corresponding beamforming vector by $s_{k}$ and ${{{\bf{w}}_{k}}}$ , respectively. The received signal at the $k$ -th user is given by

\displaystyle{{y}_{k}}

\displaystyle=\underbrace{{\bf{h}}_{k}^{H}{{\bf{w}}_{k}}{{s}_{k}}}_{\rm desired~{}signal}+\underbrace{\sum\nolimits_{i=1,i\neq k}^{K}{{\bf h}_{k}^{H}{{\bf w}_{i}}{{s}_{i}}}}_{\mathop{\rm{intra-cell~{}interference}}}+{n_{k}},

where ${{{\bf{h}}_{k}}}\in{\mathbb{C}^{{N_{\rm{T}}}}}$ denotes the CSI between the BS and the $k$ -th user, and $n_{k}\sim\mathcal{CN}({0,{\sigma_{k}^{2}}})$ denotes the additive white Gaussian noise (AWGN). Without loss of generality, it is assumed that ${{\mathbb{E}}}\{{{{|{s_{k}}|}^{2}}}\}=1$ ( $\forall k\in\mathcal{K}$ ). Then, the achievable data rate at the $k$ -th user is expressed as

\displaystyle{R_{k}}\left({\left\{{{{\bf{w}}_{i}}}\right\}}\right)={\log_{2}}\left(1+\frac{{{{\left|{{\bf{h}}_{k}^{H}{{\bf{w}}_{k}}}\right|}^{2}}}}{{{\sum\nolimits_{i=1,i\neq k}^{K}{{\left|{{\bf{h}}_{k}^{H}{{\bf{w}}_{i}}}\right|}^{2}}+\sigma_{{k}}^{2}}}}\right).

(1)

II-B Optimization problem formulation

The sum-rate maximization problem is formulated as


	$\displaystyle\mathop{\max}\limits_{\left\{{{{\bf{w}}_{i}}}\right\}}{{\sum\nolimits_{k=1}^{K}{R_{k}}\left({\left\{{{{\bf{w}}_{i}}}\right\}}\right)}}$	(2a)
$\displaystyle{\rm s.t.}~{}$	$\displaystyle{{\sum\nolimits_{k=1}^{K}{\left\\|{{{\bf{w}}_{k}}}\right\\|_{2}^{2}}}}\leq{P_{\rm Max}},$	(2b)
	$\displaystyle{R_{k}}\left({\left\{{{{\bf{w}}_{i}}}\right\}}\right)\geq{R_{\rm Req}},$	(2c)
	$\displaystyle{{\bf{w}}_{i}}\in{\mathbb{C}^{{N_{\rm{T}}}}},{\forall i,k}\in{\cal K},$

where ${R_{\rm Req}}$ denotes the information rate requirement of each user and ${P_{\rm Max}}$ denotes the power budget of the BS.

The problem (2) can be solved by conventional convex optimization, which however, requires to be implemented for every CSI realization. To extract the knowledge of the mapping from the CSI to the optimal beamforming vectors, the GNN-based approach is investigated by solving the following graph optimization problem.

II-C Graph optimization problem formulation

The network topology nature of the considered MU-MISO system makes it suitable to be modeled as a graph. In particular, the considered MU-MISO system can be modeled as a fully connected directed graph denoted as ${\cal G}=({\cal V},{\cal E})$ as shown in Fig. 1, where $\cal V$ denotes the set of nodes and $\cal E$ denotes the set of edges. Each BS-user desired link is modeled as a node while each interference link is modeled as an edge. That is, there are $K$ nodes and $K(K-1)$ edges. The node feature of the $k$ -th node is defined as the corresponding CSI, i.e., ${\bf h}_{k}$ , while there is no edge feature as the inter-user interference is implicit.

The node feature matrix of the graph ${\cal G}$ is denoted as $\mathbf{H}\triangleq[{\bf{h}}_{1},\cdots,{\bf{h}}_{k}]^{T}\in{\mathbb{C}^{{K}\times{N_{\rm{T}}}}}$ . By defining the beamforming matrix $\mathbf{W}\triangleq[{\bf{w}}_{1},\cdots,{\bf{w}}_{k}]^{T}\in{\mathbb{C}^{{K}\times{N_{\rm{T}}}}}$ , the problem (2) can be rewritten as the following graph optimization problem:


	$\displaystyle\mathop{\max}\limits_{{{\mathbf{W}}}}~{}{{\sum\nolimits_{k=1}^{K}{{\widehat{R}_{k}}\left(\mathbf{W}\right)}}}$	(3a)
$\displaystyle{\rm s.t.}~{}$	$\displaystyle{{\\|\mathbf{W}\\|_{F}^{2}}}\leq{P_{\rm Max}},$	(3b)
	$\displaystyle{{\widehat{R}_{k}}\left(\mathbf{W}\right)}\geq{R_{\rm Req}},$	(3c)
	$\displaystyle{\mathbf{W}}\in{\mathbb{C}^{{K}\times{N_{\rm{T}}}}},{\forall k}\in{\cal K},$

where

{{\widehat{R}_{k}}\left(\mathbf{W}\right)}\triangleq{\log_{2}}\left(1+\frac{{{{\left|{{\mathbf{H}}_{\left(k,:\right)}^{H}{{\mathbf{W}}_{\left(k,:\right)}}}\right|}^{2}}}}{{{\sum\nolimits_{i\in\mathcal{N}\left(k\right)}{{\left|{\mathbf{H}}_{\left(k,:\right)}^{H}{{\mathbf{W}}_{\left(i,:\right)}}\right|}^{2}}+\sigma_{{k}}^{2}}}}\right)

and $\mathcal{N}(k)\triangleq{\cal K}/k$ denotes the set of neighbor nodes of the $k$ -th node.

III GNN Architecture

To solve the defined graph optimization problem, we propose a GNN-based architecture named CRGAT with the node feature matrix (i.e., $\bf H$ ) as the input and the optimization variable (i.e., $\bf W$ ) as the output. In particular, the proposed CRGAT is composed of two parts, i.e., a complex encoder and a complex decoder. The complex encoder aims to explore the hidden edge features (such as inter-user interference) as well as exploiting the node features (i.e., $\bf H$ ) by using the message passing mechanism to obtain embedded features with full graph information. The complex decoder aims to map the embedded features to the learned beamforming matrix.

The overall architecture of the proposed CRGAT is shown in Fig. 2, where the complex encoder includes $L$ complex residual graph attention layers (CRGALs) and the complex decoder includes $C$ complex fully connected layers (CFCLs). The details of each layer is presented as follows.

III-A Complex residual graph attention layer

The CRGAL is the key component of the proposed CRGAT, and it updates the feature of each node by aggregating the features of its neighboring nodes. As there is no edge feature, each node can only learn its interference to/from other nodes via the node features. To facilitate the aggregating process, the multiple attention mechanisms are adopted to assign different weights to neighboring nodes¹¹1As the considered system is modelled as a fully connected graph, the neighboring nodes of a node are actually the other $(K-1)$ nodes., such that the central node pays more attention to its neighboring nodes with greater impact on it.

It is noted that the deeper CRGALs may induce the oversmoothing issue, which homogenizes the aggregated features of different nodes and degrades the availability of the aggregated features. Therefore, the residual structure is adopted to balance the trade-off between adopting deeper CRGALs and avoiding the oversmoothing issue.

As shown in Fig. 3, each CRGAL consists of two processes, i.e., the attention-enabled aggregation and the residual-assisted combination.

III-A1 Attention-enabled aggregation

The $k$ -th node aggregates the features of its neighboring nodes using feedforward NNs and attention mechanisms.

For the $l$ -th CRGAL, let $\mathbf{H}^{(l)}\in{\mathbb{C}^{{K}\times{F(l)}}}$ denotes input node feature matrix²²2The input node feature matrix of the $1$ -st CRGAL is, in fact, the node feature matrix of the graph $\cal G$ , i.e., $\mathbf{H}^{(1)}=\mathbf{H}$ , and $F(1)=N_{\rm T}$ . Besides, the input of the $l$ -th CRGAL, i.e., $\mathbf{H}^{(l)}$ , is also the output of the $(l-1)$ -th CRGAL. of the $l$ -th CRGAL ( $l\in{\cal L}\triangleq\{1,...,L\}$ ), where ${F}(l)$ denotes the corresponding dimension of the node feature. $D(l)$ self-attention mechanisms are applied, and each one is realized by a trainable vector ${\bf a}^{(l)}_{d}\in\mathbb{C}^{{\widetilde{F}}(l)}$ ( $d\in\{1,...,D(l)\}$ ), where ${\widetilde{F}}(l)$ denotes the corresponding dimension of the output of the feedforward NN. Denote $\alpha^{(l)}_{k,j,d}$ by the attention coefficient for the $(k,j)$ -th node pair after the $d$ -th attention mechanism of the $l$ -th CRGAL, and $\alpha^{(l)}_{k,j,d}$ is computed by (6), where

\displaystyle\mathbf{A}^{(l)}\triangleq[{\bf{a}}^{(l)}_{1},\cdots,{\bf{a}}^{(l)}_{D(l)}]^{T}\in{\mathbb{C}^{{D(l)}\times{\widetilde{F}}(l)}},

(4)

${\bm{\Theta}}^{\left(l\right)}\in{\mathbb{C}^{D(l)\times F(l)\times{\widetilde{F}}(l)}}$ denotes the learnable parameter of the feedforward NN, $f_{2}()$ represents a nonlinear operation (e.g., ${\rm LeakyReLU}()$ ), and ${f}_{3}():{\mathbb{C}}\rightarrow{\mathbb{R}}$ represents a complex-to-real function (e.g., ${\rm Abs}()$ ).

Then, the aggregated feature of the $k$ -th node after the $d$ -th attention mechanism of the $l$ -th CRGAL is computed by

\displaystyle{\bm{\beta}}^{\left(l\right)}_{k,d}=f_{1}\left(\alpha^{(l)}_{k,j,d}{{\mathbf{H}}^{(l)}_{\left(j,:\right)}}{\bm{\Theta}}_{\left(d,:,:\right)}^{\left(l\right)}\left|{j\in{\mathcal{N}}(k)}\right.\right)\in{\mathbb{C}}^{{\widetilde{F}}{(l)}},

(5)

where $f_{1}():{\mathbb{C}}^{{\widetilde{F}}{(l)}\times|{\mathcal{N}}(k)|}\rightarrow{\mathbb{C}}^{{\widetilde{F}}{(l)}}$ represents a function with permutation invariance (e.g., ${\rm sum}()$ ). Note that with $D(l)$ independent attention mechanisms, the aggregated features of $D(l)$ , i.e., $\{{\bm{\beta}}^{(l)}_{k,d}\}_{d=1}^{D(l)}$ , are obtained for each node in the $l$ -th CRGAL.

\displaystyle\alpha^{\left(l\right)}_{k,j,d}=\frac{\exp\left({f}_{3}\left({{f}_{2}}{\left({\mathbf{H}}^{(l)}_{\left(k,:\right)}{{\bm{\Theta}}}_{\left(d,:,:\right)}^{\left(l\right)}+{\mathbf{H}}^{(l)}_{\left(j,:\right)}{{\bm{\Theta}}}_{\left(d,:,:\right)}^{\left(l\right)}\right)}^{T}\mathbf{A}^{\left(l\right)}_{\left(d,:\right)}\right)\right)}{\sum\nolimits_{i\in\mathcal{N}\left(k\right)}\exp\left({f}_{3}\left({{f}_{2}}\left({\mathbf{H}}^{(l)}_{\left(k,:\right)}{{\bm{\Theta}}}_{\left(d,:,:\right)}^{\left(l\right)}+{\mathbf{H}}^{(l)}_{\left(i,:\right)}{{\bm{\Theta}}}_{\left(d,:,:\right)}^{\left(l\right)}\right)^{T}\mathbf{A}^{\left(l\right)}_{\left(d,:\right)}\right)\right)}

(6)

III-A2 Residual-assisted combination

The residual-assisted combination is to combine the $D(l)$ attention-enabled aggregated features, the input features of the $l$ -th CRGAL and the initial input features (i.e., the input feature of the $1$ -st CRGAL). It is noted that the latter two features (also known as the jump residual and the initial residual, respectively) are utilized to overcome the oversmoothing issue. With the residual-assisted combination operation, the net activation of the $l$ -th CRGAL denoted as ${\widetilde{\mathbf{H}}}^{(l+1)}$ is given by

	$\displaystyle{\widetilde{\mathbf{H}}}^{\left(l+1\right)}=$	$\displaystyle\underbrace{{\rm Com}\left(\left\{{\bm{\beta}}^{\left(l\right)}_{k,d}\right\}_{d=1}^{D\left(l\right)}\right)}_{{\rm{Aggregation}}}+\underbrace{\overline{a}^{(l)}\mathbf{H}^{\left(l\right)}\overline{\bm{\Theta}}^{(l)}}_{{\rm{Jump~{}residual}}}+$		(7)
		$\displaystyle\underbrace{\widetilde{a}^{(l)}\mathbf{H}^{\left(1\right)}\widetilde{\bm{\Theta}}^{(l)}}_{{\rm{Initial~{}residual}}}\in{\mathbb{C}}^{K\times({\widetilde{F}}(l)\times D(l))},$

where ${\rm Com}()$ represents concatenation operation, $\overline{a}^{(l)}$ , $\widetilde{a}^{(l)}\in\mathbb{R}$ are trainable weighting factors. $\overline{\bm{\Theta}}^{(l)}\in\mathbb{C}^{F(l)\times({\widetilde{F}}{(l)}\times D(l))}$ and $\widetilde{\bm{\Theta}}^{(l)}\in\mathbb{C}^{F(1)\times({\widetilde{F}}{(l)}\times D(l))}$ denote the trainable parameters of the feedforward NNs for the jump residual and the initial residual, respectively.

For the $l$ -th CRGAL, $\mathbb{C}\rm{SELU}()$ is adopted as the activation function to enhance the representational capability, which is given by

	$\displaystyle{{\mathbf{H}}}^{\left(l+1\right)}$	$\displaystyle=\mathbb{C}\mathrm{SELU}\left({\widetilde{\mathbf{H}}}^{\left(l+1\right)}\right)$		(8)
		$\displaystyle=\mathrm{SE}\mathrm{LU}\left({\rm Re}\left({\widetilde{\mathbf{H}}}^{\left(l+1\right)}\right)\right)+i\mathrm{SE}\mathrm{LU}\left({\rm Im}\left({\widetilde{\mathbf{H}}}^{\left(l+1\right)}\right)\right),$

where the activation function ${\rm SELU}()$ is applied to the real and imaginary parts of the input features in an element-wise manner, such that the output and the input of $\mathbb{C}\rm{SELU}()$ have the same shape. Note that the output of the $L$ -th CRGAL, i.e., ${{\mathbf{H}}}^{(L+1)}$ , is the input of the $1$ -st CFCL.

Remark 1.

The considered system is formulated into a fully connected graph. As a result, each node is able to obtain information from all its neighboring nodes through a single graph convolution, which makes it more likely to suffer from oversmoothing issue. For traditional GAT, the learning performance is degraded by using more layers, because the embedded features of all nodes intend to be identical [27]. However, the oversmoothing issue is fixed by the adopted residual-assisted combination. As a result, more CRGALs can be stacked while the feature embedding capability is also enhanced. The reason is that the jump residual and the initial residual prevent the embedded features of all nodes to be identical.

III-B Complex fully connected layer

The CFCLs are utilized to map the embedded features, i.e., ${{\mathbf{H}}}^{\left(L+1\right)}$ , to the learned beamforming matrix. Let ${\mathbf{H}}^{\left(c\right)}\in{\mathbb{C}^{{K}\times{G(c)}}}$ denote the input node feature matrix of the $c$ -th CFCL ( $c\in{\cal C}\triangleq\{1,...,C\}$ ), where ${{G}(c)}$ denotes the corresponding dimension of each node. In particular, for the $c$ -th CFCL, the mapping from the input to the net activation, denoted by ${\widetilde{\mathbf{H}}^{(c+1)}}$ , is realized by

\displaystyle{\widetilde{\mathbf{H}}^{\left(c+1\right)}}={\mathbf{H}}^{\left(c\right)}{{\bm{\Theta}}^{(c)}}\in{\mathbb{C}^{{K}\times{G\left(c+1\right)}}},

(9)

where ${\bm{\Theta}}^{(c)}\in\mathbb{C}^{G(c)\times G(c+1)}$ denotes the learnable parameters of the $c$ -th CFCL.

For the $c$ -th CFCL with $c\in{\cal C}\setminus C$ , the $\mathbb{C}\rm{BatchNorm}$ ( $\mathbb{C}$ BN) layer is added in order to prevent the proposed CRGAT from overfitting as well as enhancing the convergence behavior. The $\mathbb{C}$ BN layer can be treated as an activation function, which is applied to the net activation after $\mathbb{C}\rm{SELU}()$ as follows:

$\displaystyle{{\mathbf{H}}^{\left(c+1\right)}}=$	$\displaystyle\mathbb{C}\mathrm{BN}\left(\mathbb{C}\mathrm{SELU}\left({\widetilde{\mathbf{H}}^{\left(c+1\right)}}\right)\right)$	(10)
$\displaystyle=$	$\displaystyle\mathrm{BN}\left(\rm{SELU}\left({\rm Re}\left({\widetilde{\mathbf{H}}^{\left(c+1\right)}}\right)\right)\right)+$
	$\displaystyle i\mathrm{BN}\left(\rm{SELU}\left({\rm Im}\left({\widetilde{\mathbf{H}}^{\left(c+1\right)}}\right)\right)\right),$

where ${\rm BN}()$ denotes the batch normalization function.

As for the last CFCL, a new activation function is proposed to guarantee that the output beamforming matrix, denoted as $\widehat{\mathbf{W}}$ , satisfies the sum-power constraint (2b), which is given by

		$\displaystyle\widehat{\mathbf{W}}=\phi\left({\widetilde{\mathbf{H}}^{\left(C+1\right)}}\right)$		(11)
		$\displaystyle=\left\{\begin{array}[]{l}\sqrt{{P_{\rm Max}}}{{\widetilde{\mathbf{H}}}^{\left(C+1\right)}},~{}{\left\\|{{\widetilde{\mathbf{H}}}^{\left(C+1\right)}}\right\\|_{F}^{2}\leq 1}\\ \sqrt{\frac{{{P_{\rm Max}}}}{\left\\|{{\widetilde{\mathbf{H}}}^{\left(C+1\right)}}\right\\|_{F}}}{{{\widetilde{\mathbf{H}}}^{\left(C+1\right)}}},~{}{\left\\|{{\widetilde{\mathbf{H}}}^{\left(C+1\right)}}\right\\|_{F}^{2}>1}\end{array}\right.\in\mathbb{C}^{K\times{N_{\rm{T}}}}.$		(14)

Note that $\widehat{\mathbf{W}}$ also represents the output of the CRGAT, i.e., the learned beamforming matrix.

With the proposed activation function (11), the problem (2) is equivalently transformed into the following optimization problem:


	$\displaystyle\mathop{\min}\limits_{\bm{\theta}}{{\sum\nolimits_{k=1}^{K}{-{\widehat{R}_{k}}\left(\widehat{\mathbf{W}}\|\bm{\theta}\right)}}}$	(15a)
$\displaystyle{\rm s.t.}~{}$	$\displaystyle{{\widehat{R}_{k}}\left(\widehat{\mathbf{W}}\|\bm{\theta}\right)}\geq{R_{\rm Req}},\forall k\in{\cal K},$	(15b)

where $\bm{\theta}\triangleq\{\mathbf{A}^{(l)},{\bm{\Theta}}^{(l)},\overline{\bm{\Theta}}^{(l)},\widetilde{\bm{\Theta}}^{(l)},{\bm{\Theta}}^{(c)}\}$ represents all trainable parameters in the proposed CRGAT.

Remark 2.

The activation function (11) is based on the projected gradient descent method, which is utilized to realize the allocation of the power budget. That is, the activation function (11) in fact simplifies the task from learning the beamforming vectors to learning the directions of beamforming vectors. Besides, it is observed from (15), the impact of $P_{\rm Max}$ has been integrated into the CRGAT, and thus, the CRGAT should have the generalization capability on $P_{\rm Max}$ .

Remark 3.

It is observed from the definition of $\bm{\theta}$ that all learnable parameters are independent from the number of users. Therefore, the CRGAT still works if the number of users changes. That is, the CRGAT has the generalization capability on the number of users. Intuitively, the generalization performance is affected by the variance of the number of users in training set and test set. Based on the generalization performance, one can directly use the well-trained CRGAT or update the CRGAT in a transfer learning manner, which greatly saves the time and computational resources for re-training a new CRGAT, if the network topology changes.

IV Unsupervised learning

With the proposed CRGAT, the mapping from the channel matrix to the learned beamforming matrix satisfying the sum-rate power constraint is established, i.e., $\widehat{\mathbf{W}}={\rm CRGAT}({\bf H})$ . To train the CRGAT via unsupervised learning, the problem (2) should be further transformed into an unconstrained problem, such that the trainable parameters can be trained in a gradient descent manner.

To this end, the PM and the LDM are respectively adopted to design the loss function based on the objective function (15a) and the rate requirement constraint (15b).

IV-A Loss function based on penalty method

\displaystyle{{L}_{\rm Penal}\left({\bm{\theta}}\right)=\frac{1}{N}\sum\nolimits_{n=1}^{N}{\left[\underbrace{\sum\nolimits_{k=1}^{K}{-{\widehat{R}_{k}}{\left({\widehat{\mathbf{W}}_{\left(n\right)}}|\bm{\theta}\right)}}}_{\rm Objective~{}function~{}due~{}to~{}(\ref{cons:pb:A})}+\lambda\underbrace{\sum\nolimits_{k=1}^{K}{\operatorname{ReLU}\left(R_{\rm Req}-{\widehat{R}_{k}}\left(\widehat{\mathbf{W}}_{\left(n\right)}|\bm{\theta}\right)\right)}}_{\rm Penalty~{}term~{}due~{}to~{}(\ref{cons:pb:C})}\right]}}

(16)

The PM embeds the rate requirement constraint (15b) into the loss function as a penalty term to enforce the output of CRGAT to satisfy (15b). For the problem (15), the loss function based on the PM is given by (16), where $N$ denotes the batch size, $\operatorname{ReLU}(x)\triangleq\max(x,0)$ represents the rectified linear unit activation function, $\widehat{\mathbf{W}}_{\left(n\right)}$ represents the ouput of the CRGAT assosicated with the $n$ -th sample, i.e., ${\mathbf{H}}_{\left(n\right)}$ , and $\lambda>0$ denotes the penalty coefficient which is treated as a hyperparameter.

Particularly, the penalty term is imposed only if at least one rate requirement constraint cannot be satisfied. With a larger value of $\lambda$ , the rate requirement constraints are more likely to be satisfied, which however, may degrade the impact of the objective function on the loss function. By carefully adjusting the value of $\lambda$ , a tradeoff between maximizing the sum rates and satisfying the rate requirement constraints can be achieved.

The unsupervised training procedure with the loss function based on PM is shown in Algorithm 1 where $M$ , $E$ and $B$ respectively denote the total numbers of samples, epochs and minibatchs in the training dataset, and $\omega$ denotes the learning rate.

\displaystyle\bm{\theta}:=\bm{\theta}-\omega\nabla_{\bm{\theta}}{{L}_{\rm Penal}\left(\bm{\theta}\right)}

(17)

20: Training dataset

\mathcal{D}=\left\{\mathbf{H}_{\left(n\right)}\right\}_{n=1}^{M}

;

0: Learned

\bm{\theta}

;

31: Initialize

\bm{\theta}

;

2: for epoch

e\in[0,1,\dots,E]

3: for index of minibatch

b\in[0,1,\dots,B]

44: Sample the

b

-th minibatch

\left\{\mathbf{H}_{\left(n\right)}\right\}_{n=1}^{N}

;

55: Calculate the corresponding beamforming matrices

\left\{{\widehat{\mathbf{W}}}_{\left(n\right)}\right\}_{n=1}^{N}

via

\widehat{\mathbf{W}}_{\left(n\right)}={\rm CRGAT}\left({\bf H}_{\left(n\right)}\right)

\forall n

;

66: Update

\bm{\theta}

via (17);

7: end for

8: end for

9: return

\bm{\theta}

Algorithm 1 Unsupervised Learning with Loss function based on PM

IV-B Loss function based on Lagrangian duality method

Instead of using the constant penalty coefficient, the LDM adopts the learnable Lagrangian multipliers.

Let $\bm{\mu}\triangleq[\mu_{1},...,\mu_{K}]\in\mathbb{R}_{+}^{K}$ denote non-negative Lagrangian multipliers associated with (15b). The Lagrangian dual optimization problem of the problem (15) is given by

\displaystyle\max_{\bm{\mu}}\min_{{\bm{\theta}}}{{\cal L}\left(\bm{\theta},\bm{\mu}\right)},

where ${{\cal L}\left(\bm{\theta},\bm{\mu}\right)}$ is given by (18).

\displaystyle{{\cal L}\left(\bm{\theta},\bm{\mu}\right)}={\sum\nolimits_{k=1}^{K}{-{\widehat{R}_{k}}{\left(\widehat{\mathbf{W}}|\bm{\theta}\right)}}}+\sum\nolimits_{k=1}^{K}{\mu_{k}\operatorname{ReLU}\left(R_{\rm Req}-{\widehat{R}_{k}}\left({\widehat{\mathbf{W}}}|\bm{\theta}\right)\right)}

(18)

The Lagrangian dual optimization problem can be solved by alternately optimizing $\bm{\theta}$ and $\bm{\mu}$ . Similarly, the loss function based on the LDM is given by (21), and we alternately update $\bm{\theta}$ and $\bm{\mu}$ during the training phase. Then, the unsupervised training procedure with the loss function based on LDM is shown in Algorithm 2, where $\tau>0$ denotes the step size for updating $\bm{\mu}$ .

	$\displaystyle\bm{\theta}$	$\displaystyle:=\bm{\theta}-\omega\nabla_{\bm{\theta}}{{L}_{\rm Lag}\left(\bm{\theta},\bm{\mu}\right)}$		(19)
	$\displaystyle\nabla_{{\mu}_{k}}$	$\displaystyle:=\nabla_{{\mu}_{k}}+\tau\sum_{n=1}^{N}\operatorname{ReLU}\left(R_{\rm Req}-{\widehat{R}_{k}}\left(\widehat{\mathbf{W}}_{(n)}\|\bm{\theta}\right)\right)$		(20)

Remark 4.

Different from handling the sum-power constraint via the activation function, both PM and LDM cannot theoretically guarantee a feasible solution. Nevertheless, the efficacy of the two methods has been validated in some existing works numerically, see e.g., [24, 25].

\displaystyle{{{L}_{\rm Lag}\left(\bm{\theta},\bm{\mu}\right)}=\frac{1}{N}\sum\nolimits_{n=1}^{N}{\left[\sum\nolimits_{k=1}^{K}{-{\widehat{R}_{k}}{\left({\widehat{\mathbf{W}}_{\left(n\right)}}|\bm{\theta}\right)}}+\sum\nolimits_{k=1}^{K}{\mu_{k}\operatorname{ReLU}\left(R_{\rm Req}-{\widehat{R}_{k}}\left(\widehat{\mathbf{W}}_{\left(n\right)}|\bm{\theta}\right)\right)}\right]}}

(21)

20: Training dataset

\mathcal{D}=\left\{\mathbf{H}_{\left(n\right)}\right\}_{n=1}^{M}

;

30: Learned

\bm{\theta}

and

\bm{\mu}

;

41: Initialize

\bm{\theta}

and

\bm{\mu}

;

2: for epoch

e\in[0,1,\dots,E]

\nabla_{\bm{\mu}}\triangleq[\nabla_{{\mu}_{1}},...,\nabla_{{\mu}_{K}}]\leftarrow{\bf 0}

4: for index of minibatch

b\in[0,1,\dots,B]

55: Sample the

b

-th minibatch

\left\{\mathbf{H}_{\left(n\right)}\right\}_{l=1}^{N}

;

66: Calculate the corresponding beamforming matrices

\left\{{\widehat{\mathbf{W}}}_{\left(n\right)}\right\}_{n=1}^{N}

via

\widehat{\mathbf{W}}_{\left(n\right)}={\rm CRGAT}\left({\bf H}_{\left(n\right)}\right)

\forall n

;

77: Update

\bm{\theta}

via (19);

88: Update

\nabla_{{\mu}_{k}}

via (20),

\forall k\in{\cal K}

;

9: end for

910:

\bm{\mu}\leftarrow\bm{\mu}+\tau\nabla_{\bm{\mu}}

;

1011: end for

12: return

\bm{\theta}

and

\bm{\mu}

Algorithm 2 Unsupervised Learning with Loss function based on LDM

V Numerical Results

In this section, numerical results are provided to evaluate the proposed CRGAT, activation function and two loss functions. An evaluation method for the learning-based approach is proposed based on certain basic metrics, which include three aspects: model effectiveness, scalability and adaptability. Besides, the effectiveness of the residual-assisted combination is validate by the ablation experiment, and the transfer learning performance of the CRGAT is illustrated.

V-A Simulation setup

Simulation scenario: The number of antennas is set as $N_{\rm T}\in\{8,16\}$ . The number of users is set as $K\in\{3,4,5,6,7,8,9\}$ . The total power budget is set as $P_{\rm Max}\in\{1,2,3\}$ . The rate requirement is set as $R_{\rm Req}\in\{1,2,3\}$ bit/s/Hz. The CSI of each user includes both large-scale path loss and the small-scale fading. The path-loss model is based on $10^{-(140.7+36.7\log_{10}({\widetilde{d}}))/10}$ where ${\widetilde{d}}$ (in kilometer) denotes the distance between the BS and the user. Rayleigh fading is adopted for the considered small-scale fading. The noise power spectrum is $162$ dBm/Hz and the bandwidth is $10$ MHz. The average ratio between path loss and noise power is set as $10$ dB.

Computer configuration: The convex formulations and the NNs are respectively processed by convex solver SeDuMi under Mathworks MATLAB R2021b and Python 3.8.18 with Pytorch 1.10.0 on a computer with Intel(R) Core(TM) i9-12900K CPU and one NVIDIA RTX 3090 GPU (24 GB of memory).

Dataset: The ratio between the training set, validation set and test set in the dataset is set as $9:1:1$ . Since the unsupervised learning is adopted, the training set does not include labels. The labels of the validation set and test set are generated by an SCA-based approach (presented in Appendix A) with the convergence precision of $10^{-4}$ . All the datasets prepared for the simulation experiments are shown in Table I.

TABLE I: Datasets.

No.	$N_{\rm T}$	$K$	${P_{\rm Max}}$	${R_{\rm Req}}$	Size	Type
1	8	3	1	1	10,000	B
2	8	4	1	1	110,000	A
3	8	5	1	1	110,000	A
4	16	7	1	1	10,000	B
5	16	8	1	1	110,000	A
6	16	9	1	1	10,000	B
7	16	8	1	2	110,000	A
8	16	8	1	3	110,000	A
9	16	8	2	1	110,000	A
10	16	8	3	1	110,000	A

•

Type A: The training set, validation set and test set are all included.
•

Type B: Only test set is included.

GNN architecture: The detailed architecture of the CRGAT is shown in Table II where $4$ CGCLs and $3$ CFCLs are included.

TABLE II: The architecture of the proposed CRGAT.

No.	Type	IFs	OFs	AMs	$\mathbb{C}$ SELU	$\mathbb{C}$ BN
1	CRGL	$N_{\rm T}$	32	10	$\checkmark$	$\times$
2	CGCL	320	64	10	$\checkmark$	$\times$
3	CGCL	640	128	10	$\checkmark$	$\times$
4	CGCL	1280	256	10	$\checkmark$	$\times$
5	CFCL	2560	1024	-	$\checkmark$	$\checkmark$
6	CFCL	1024	512	-	$\checkmark$	$\checkmark$
7	CFCL	512	$N_{\rm T}$	-	$\times$	$\times$

•

IFs: The dimension of the input feature, i.e., $F(l)$ , $G(c)$ .
•

OFs: The dimension of the output feature.
•

AMs: Number of attention mechanisms, i.e., $D(l)$ .
•

CGCL: Generalized to all complex graph convolution layers.

Basic Metrics: The basic metrics are used to evaluate the learning-based approaches on a given test set in terms of the optimality performance, feasible rate and inference time.

1)

Optimality performance: The ratio of the average achievable sum rates (with feasible solutions) by the learning-based approach to the average optimal sum rates.
2)

Feasibility rate: The percentage of the feasible solutions to the considered problem by the learning-based approach.
3)

Inference time: The average running time required to calculate the feasible beamforming vectors with the given CSI by the learning-based approach.

Advanced Metrics: Based on the basic metrics, we define the following three advanced metrics to evaluate the learning-based approaches. The detailed is described as follows.

1)

Model effectiveness: Evaluate the efficacy of the model architecture based on the basic metrics, where the settings of the test set and the training set are identical.³³3The identical setting means that the scalar variables are the same while the random variables are i.i.d.
2)

Scalability: Evaluate the efficacy of the learning-based approach on the test set with different settings from the training set.
3)

Adaptability: Evaluate the efficacy of the activation function and the loss function with different constraints.

V-B Model effectiveness

This subsection shows the performance of the proposed CRGAT based on the basic metrics. Two scenarios are considered, i.e., $(N_{\rm T},K,P_{\rm Max},R_{\rm Req})=(8,4,1,1)$ (Dataset No. 2) and $(N_{\rm T},K,P_{\rm Max},R_{\rm Req})=(16,8,1,1)$ (Dataset No. 5). To fully evaluate the proposed CRGAT, several convex optimization and learning based approaches are adopted as the benchmarks⁴⁴4The learning-based approaches are according to existing works and extended to solve the considered problem..

1)

SCA: The SCA-based algorithm serves as a benchmark to evaluate the optimality and the computational efficiency of the learning-based approaches;
2)

MRT: The SCA-based algorithm based on the maximum ratio transmission (MRT) scheme (given in Appendix B);
3)

ZF: The SCA-based algorithm based on the zero-forceing (ZF) scheme (given in Appendix B);
4)

CMLP[7]: Direct learning of the mapping from CSI, i.e., $\{{\bf h}_{i}\}$ , to beamforming vectors, i.e., $\{{\bf w}_{i}\}$ via CFCLs;
5)

CGCN[23]: Topological information among CSI is utilized through graph convolution;
6)

CGAT[15]: Additive attention mechanism is used in the process of graph convolution;
7)

CTGCN[28]: Key-value attention mechanism is used in the process of graph convolution.

For fair comparison, all learning-based approaches adopt the activation function (11) and trained via Algorithm 1. The numerical results are shown in Table III.

TABLE III: Model effectiveness.

$N_{\rm T}$	$K$	Approach	OP	FR	IT
8	4	SCA	100%	100%	1.99s
		MRT	46.6%	100%	1.29s
		ZF	99.1%	100%	1.12s
		CMLP	79.7%^∗	0%	0.93ms
		CGCN	84.5%	99.3%	0.90ms
		CGAT	91.0%	99.6%	1.66ms
		CTGCN	85.0%	99.2%	2.42ms
		CRGAT	97.1%	100%	2.73ms
16	8	SCA	100%	100%	8.48s
		MRT	40.6%	100%	1.86s
		ZF	98.9%	100%	2.14s
		CMLP	36.1%^∗	0%	0.94ms
		CGCN	52.3%	98.5%	0.92ms
		CGAT	84.0%	99.9%	1.71ms
		CTGCN	45.0%	92.7%	2.46ms
		CRGAT	95.4%	100%	2.81ms

•

OP: Optimality performance.
•

FR: Feasibility rate.
•

IT: Inference time.
•

^∗: The optimality performance is calculated with infeasible solutions as there is no feasible solution.

Compared with the benchmark learning-based approaches, the proposed CRGAT achieves the best optimality performance, and optimality performance gap to the SCA-based algorithm is less than $5\%$ . The increment of antennas and users has an insignificant impact on the optimality performance of the CRGAT, which however, greatly degrades the optimality performance of other learning-based approaches. On the one hand, the attention-enabled aggregation adopted in the CRGAT is able to explore the hidden features due to the inter-user interference, which makes it more suitable to the considered problem than the CMLP and the CGCN. On the other hand, the residual-assisted combination adopted in the CRGAT avoids the oversmoothing issue, which enhances its feature embedding capability compared with the CGAT and the CTGCN. Besides, the feasible solution by the CRGAT is guaranteed.

Compared with traditional convex optimization based approaches, the inference time is greatly reduced by the learning-based approaches. Besides, the increment number of antennas and users has limited impact on the inference time of the learning-based approaches while greatly increasing the inference time of the convex optimization based approaches. That is, with limited optimality performance loss, the proposed CRGAT is able to make a real-time response to the time-varying CSI, e.g., several microseconds with our computer configuration. The following comparisons are among the learning-based approaches, and the inference time is omitted.

Furthermore, we evaluate the robustness of the proposed CRGAT with $(N_{\rm T},K)=(16,8)$ by adding channel estimation errors to the original input, i.e, $\widehat{\textbf{H}}={\textbf{H}}+{\textbf{E}}$ ( $\forall k\in\mathcal{K}$ ), where H represents the original input and E denotes the CSI error with ${\textbf{E}}_{(k,:)}\sim\mathcal{C}\mathcal{N}(0,\sigma^{2}_{{\rm E},k}{\bf I}_{N_{\rm T}})$ ( $\forall k\in\mathcal{K}$ ). The numerical results are shown in Table IV. It is observed that the adding noise to the original input has limited impact on the optimality performance of the CRGAT.

TABLE IV: The robustness of the CRGAT with imperfect CSI.

$\sigma^{2}_{{\rm E},k}$	OP	FR
$1\times 10^{-4}\\|{\textbf{H}}_{(k,:)}\\|^{2}$	95.3%	100%
$4\times 10^{-4}\\|{\textbf{H}}_{(k,:)}\\|^{2}$	95.2%	100%
$9\times 10^{-4}\\|{\textbf{H}}_{(k,:)}\\|^{2}$	95.0%	100%

V-C Scalability

V-C1 Scalability to different numbers of users

Due to the dynamic nature of the network typology, the generalization capability to the number of users is of high importance. Therefore, this subsection first evaluates the scalability of the well-trained CRGAT, where the numbers of users in the training set and the test set are different. For comparison, the CGCN, CGAT and CTGCN are also simulated, where the CMLP is not included as it is not scalable due to its fixed input ports. The numerical results are shown in Table V.

TABLE V: Scalability to different numbers of users.

$N_{\rm T}$	$K_{\rm Tr}$	$K_{\rm Te}$	Model	OP	FR
8	4	3	CGCN	82.7%	100%
			CGAT	86.4%	100%
			CTGCN	89.0%	100%
			CRGAT	96.0%	100%
		5	CGCN	78.8%	92.7%
			CGAT	82.6%	98.2%
			CTGCN	77.8%	94.3%
			CRGAT	92.2%	99.5%
16	8	7	CGCN	55.9%	99.8%
			CGAT	84.8%	100%
			CTGCN	47.8%	98.3%
			CRGAT	95.3%	100%
		9	CGCN	52.2%	95.9%
			CGAT	82.5%	99.6%
			CTGCN	45.0%	92.7%
			CRGAT	94.6%	100%

•

$K_{\rm Tr}$ : The number of users in the training set.
•

$K_{\rm Te}$ : The number of users in the test set.

It is observed that the CRGAT trained on the dataset with $K_{\rm Tr}=4$ (Dataset No. 2) can also achieve a reasonably good performance (within $10\%$ performance loss) when tested on the dataset with ${K_{\rm Te}\in\{3,5\}}$ (Dataset No. 1 and 3). Besides, the performance loss is within $6\%$ when testing the CRGAT trained on the dataset with $K_{\rm Tr}=4$ (Dataset No. 5) by the dataset with ${K_{\rm Te}\in\{7,9\}}$ (Dataset No. 6 and 7). The reason that the CRGAT yields a better scalability with latter case is that in the former case ( $K_{\rm Tr}=4$ ), $K_{\rm Te}$ is increased/decreased by $25\%$ of $K_{\rm Tr}$ while in the latter case ( $K_{\rm Tr}=8$ ), $K_{\rm Te}$ is increased/decreased by $12.5\%$ of $K_{\rm Tr}$ . In addition, the CRGAT outperforms other GNN-based models in terms of scalability. The reason is also due to the attention-enabled aggregation and the residual-assisted combination. Such a result indicates that the proposed CRGAT is scalable to dynamic network typologies without re-training.

V-C2 Scalability to different power budgets

Expect the number of users, the generalization capability of the proposed approach with regard to ${P_{\rm Max}}$ is also evaluated. Specifically, considering a scenario of $(N_{\rm T},K,R_{\rm Req})=(16,8,1)$ , we train the CRGAT with $P_{\rm Max}=1$ (Dataset No. 6) while testing the CRGAT with $P_{\rm Max}=2$ (Dataset No. 7) and $P_{\rm Max}=3$ (Dataset No. 8), respectively. For comparison, the following two benchmark methods are also simulated:

1)

PM only (PM-only) [24]: All constraints are handled by the PM.
2)

LDM only (LDM-only) [25]: All constraints are handled by the LDM.

The numerical results are shown in Table VI. It is observed all methods under test achieve $100\%$ feasibility rate, while the proposed methods are able to obtain higher optimality performance than the benchmarking ones. Note that the generalization capability of the proposed approach can save time and arithmetic power to re-train the neural network for different scenarios.

TABLE VI: Scalability to different

P_{\rm Max}

${P_{\rm Max}^{\rm Tr}}$	${P_{\rm Max}^{\rm Te}}$	Method	PER	FR
1	2	PM-only	78.5%	100%
		LDM-only	78.3%	100%
		PM with (11)	92.4%	100%
		LDM with (11)	92.3%	100%
	3	PM-only	71.1%	100%
		LDM-only	70.1%	100%
		PM with (11)	90.1%	100%
		LDM with (11)	89.9%	100%

•

${P_{\rm Max}^{\rm Tr}}$ : $P_{\rm Max}$ in the training set.
•

${P_{\rm Max}^{\rm Te}}$ : $P_{\rm Max}$ in the test set.

V-D Adaptability

The adaptability is to evaluate the performance of the learning-based approaches under different scenarios. The adaptability to different $R_{\rm Req}$ and ${P_{\rm Max}}$ is respectively evaluated as follows.

Table VII shows the results with different values of $R_{\rm Req}$ . It is observed that with an increase of $R_{\rm Req}$ , the optimality performance of each method almost keeps unchanged while the feasible rate decreases in general. The reason is that with larger value of $R_{\rm Req}$ , the feasible set of problem (15) shrinks. Besides, the proposed methods are superior to the benchmark ones, especially in terms of the feasible rate. As it is of high significance for learning-based approaches to yield feasible solutions, the effectiveness of the proposed approach is validated.

TABLE VII: Adaptability to different

{R_{\rm Req}}

${R_{\rm Req}}$	Method	OP	FR
1	PM-only	94.5%	63.0%
	LDM-only	94.5%	84.7%
	PM with (11)	95.4%	100%
	LDM with (11)	95.2%	100%
2	PM-only	94.3%	62.0%
	LDM-only	94.5%	91.3%
	PM with (11)	94.9%	99.6%
	LDM with (11)	95.6%	99.7%
3	PM-only	94.4%	51.0%
	LDM-only	95.1%	85.3%
	PM with (11)	94.8%	94.1%
	LDM with (11)	95.4%	94.3%

Table VIII shows the results with different values of ${P_{\rm Max}}$ . It is interesting to see that with an increment of ${P_{\rm Max}}$ , the optimality performance of all methods is slightly degraded. This might be due to the fact that with larger ${P_{\rm Max}}$ , there are more allocation schemes of the power budget, and the learning task becomes more complicated. Nevertheless, the feasibility rates of the proposed methods keeps $100\%$ , which outperforms the benchmark methods.

TABLE VIII: Adaptability to different

{P_{\rm Max}}

${P_{\rm Max}}$	Method	OP	FR
1	PM-only	94.5%	63.0%
	LDM-only	94.5%	84.7%
	PM with (11)	95.4%	100%
	LDM with (11)	95.2%	100%
2	PM-only	91.3%	71.4%
	LDM-only	91.5%	88.4%
	PM with (11)	92.6%	100%
	LDM with (11)	92.1%	100%
3	PM-only	88.5%	74.5%
	LDM-only	89.9%	87.3%
	PM with (11)	89.7%	100%
	LDM with (11)	90.5%	100%

V-E Ablation experiment: Residual-assisted combination

The effectiveness of the attention-enabled aggregation has been analyzed in our previous work [15], which is also validated in Table III and Table V. For a deeper understanding of the residual-assisted combination, this subsection shows the performance of CGAT and CRGAT with different depths. The numerical results are shown in Table IX.

TABLE IX: Ablation experiment: Residual-assisted combination.

Model	Depth	OP	FR
CGAT	2	84.4%	99.9%
CGAT	3	84.0%	99.9%
CGAT	4	4.0%^∗	0%
CRGAT	2	84.9%	100%
CRGAT	3	93.8%	100%
CRGAT	4	95.4%	100%

•

Depth: Total number of complex graph convolution layers.

It is observed that with the increment of depths, the performance of the CGAT is degraded. In particular, a sharp decay is induced with $4$ depths. This is due to the oversmoothing issue as analyzed in Remark 1. However, the oversmoothing issue is avoided by the CRGAT, as the expression capacity of the CRGAT is enhanced with the increment of depths. Such a result validates the effectiveness and the importance of the residual-assisted combination.

Furthermore, we adopt the mean average distance (MAD) [26] among the node features to quantify oversmoothing issue, which is calculated by

\displaystyle{\rm MAD}=\frac{\sum_{i=1}^{K}\sum_{j=1}^{K}\mathbf{D}_{(i,j)}}{\sum_{i=1}^{K}\left|\mathcal{N}\left(i\right)\right|},

(22)

where $\mathbf{D}\in{\mathbb{R}^{{K}\times{K}}}$ denotes the distance matrix which is calculated by

\displaystyle\mathbf{D}_{(i,j)}=1-\frac{\overline{\mathbf{H}}^{(l)}_{(i,:)}\overline{\mathbf{H}}^{(l)}_{(j,:)}}{\|\overline{\mathbf{H}}^{(l)}_{(i,:)}\|_{2}\|\overline{\mathbf{H}}^{(l)}_{(j,:)}\|_{2}},

(23)

where $\overline{\mathbf{H}}^{(l)}={\rm Com}\left({\rm Re}(\mathbf{H}^{(l)}),{\rm Im}(\mathbf{H}^{(l)})\right)\in{\mathbb{C}^{{K}\times{2{{\widetilde{F}}(l)}}}}$ . The MADs of different layers in CGAT and CRGAT are shown in Table X.

TABLE X: MAD in different layers of CGAT and CRGAT.

	1	2	3	4
CGAT	0.382	0.303	0.238	0.086
CRGAT	0.712	0.689	0.484	0.472

It is observed that the MAD of the CGAT converges to $0$ with a growth of the depths, which indicates that the node features intend to be indistinguishable. However, a larger MAD is obtained in the CRGAT. As the learning capability is roughly positive correlation to the depths, the residual-assisted combination in fact balances the trade-off between enhancing the learning capability and avoiding the oversmoothing issue.

V-F Transfer learning

As analysed before, the scalability to different numbers of users of the CRGAT is effected by $|K_{\rm Tr}-K_{\rm Te}|/K_{\rm Tr}$ . To improve the optimality performance, instead of generalizing, one can also reuse of the pre-trained CRGAT on a new test set in a transfer learning manner.

Fig. 4 show the transfer learning performance of the CRGAT, where the learnable parameters of the CRGAT trained on the dataset with $K_{\rm Tr}=4$ (Dataset No. 2) is leveraged as the initial parameters for fine-tuning the CRGAT on the dataset with $K_{\rm Tr}=5$ (Dataset No. 3). For comparison, the convergence behavior of retraining the CRGAT on the dataset with $K_{\rm Tr}=5$ (Dataset No. 3) is also plotted. It is observed that with the transfer learning, the CRGAT is able to achieve about $95\%$ optimality performance after $4$ epochs while the generalization performance is about $92\%$ (in Table V). Besides, the convergence speed of the transfer learning is much faster than that of the retraining with a slight optimality performance gain.

VI CONCLUSION

In this paper, we studied an end-to-end GNN-based beamforming design for a constrained sum-rate maximization problem. A GNN-based architecture named CRGAT was proposed to directly map the CSI to the beamforming vectors. To enhance the model representation and avoid the oversmoothing issue, the attention mechanisms and residual structures were applied to the message passing mechanism. To guarantee the feasible solution, a novel activation function was proposed to handle the sum-power constraint, and two loss functions based were designed for the QoS requirements. The CRGAT was trained in an unsupervised learning manner, and was evaluated from the perspectives of model effectiveness, scalability and adaptability. Compared with the bechmarking schemes, the advantages of the proposed CRGAT are summarized as: 1) real-time response to time-varying CSI with limited optimality performance loss; 2) scalability to dynamic network typologies; 3) adaptability to various network settings.

Appendix A SCA-based Solution Approach for the problem (2)

The problem (2) can be solved by the SCA method, and the details are given as follows.

By defining auxiliary variables $\{\gamma_{k}\}$ satisfying that $\gamma_{k}\leq{R_{k}}({\{{{{\bf{w}}_{i}}}\}})$ , the problem (2) is equivalent to


	$\displaystyle\mathop{\max}\limits_{\left\{{{{\bf{w}}_{i}}},\gamma_{i}\right\}}{{\sum\nolimits_{k=1}^{K}\gamma_{k}}}$	(24a)
$\displaystyle{\rm s.t.}~{}$	$\displaystyle{\left\|{{\bf{h}}_{k}^{H}{{\bf{w}}_{k}}}\right\|}^{2}\geq\left(2^{\gamma_{k}}-1\right)\left({{{\sum\nolimits_{i=1,i\neq k}^{K}{{\left\|{{\bf{h}}_{k}^{H}{{\bf{w}}_{i}}}\right\|}^{2}}+\sigma_{{k}}^{2}}}}\right),$	(24b)
	$\displaystyle{{\sum\nolimits_{k=1}^{K}{\left\\|{{{\bf{w}}_{k}}}\right\\|_{2}^{2}}}}\leq{P_{\rm Max}},$	(24c)
	$\displaystyle\gamma_{k}\geq{R_{\rm Req}},$	(24d)
	$\displaystyle{{\bf{w}}_{i}}\in{\mathbb{C}^{{N_{\rm{T}}}}},{\forall i,k}\in{\cal K}.$

By defining the auxiliary variables $\{a_{i},b_{i}\}$ satisfying that

\displaystyle\left\{\begin{array}[]{l}{\left|{{\bf{h}}_{k}^{H}{{\bf{w}}_{k}}}\right|^{2}}\geq{e^{{a_{k}}+{b_{k}}}}\\ {e^{{a_{k}}}}\geq{2^{{\gamma_{k}}}}-1\\ {e^{{b_{k}}}}\geq\sum\nolimits_{i=1,i\neq k}^{K}{{{\left|{{\bf{h}}_{k}^{H}{{\bf{w}}_{i}}}\right|}^{2}}}+\sigma_{k}^{2}\end{array}\right.,

(28)

the problem (24) is equivalent to


	$\displaystyle\mathop{\max}\limits_{\left\{{{{\bf{w}}_{i}}},\gamma_{i},a_{i},b_{i}\right\}}{{\sum\nolimits_{k=1}^{K}\gamma_{k}}}$	(29a)
$\displaystyle{\rm s.t.}~{}$	$\displaystyle{\left\|{{\bf{h}}_{k}^{H}{{\bf{w}}_{k}}}\right\|^{2}}\geq{e^{{a_{k}}+{b_{k}}}},$	(29b)
	$\displaystyle{e^{{a_{k}}}}\geq{2^{{\gamma_{k}}}}-1,$	(29c)
	$\displaystyle{e^{{b_{k}}}}\geq\sum\nolimits_{i=1,i\neq k}^{K}{{{\left\|{{\bf{h}}_{k}^{H}{{\bf{w}}_{i}}}\right\|}^{2}}}+\sigma_{k}^{2},$	(29d)
	$\displaystyle{\rm(\ref{cons:p2:C}),(\ref{cons:p2:D})},{{\bf{w}}_{i}}\in{\mathbb{C}^{{N_{\rm{T}}}}},{\forall i,k}\in{\cal K}.$

Apply the first order Taylor approximation to non-convex constraints (29b), (29c) and (29d), the problem (29) is approximated by the following convex problem (30) with given feasible $\{{{{\widetilde{\bf w}}}_{i}},{{\widetilde{a}}_{i}},{{\widetilde{b}}_{i}}\}$ :


	$\displaystyle\mathop{\max}\limits_{\left\{{{{\bf{w}}_{i}}},\gamma_{i},a_{i},b_{i}\right\}}{{\sum\nolimits_{k=1}^{K}\gamma_{k}}}$	(30a)
$\displaystyle{\rm s.t.}~{}$	$\displaystyle 2{\mathop{\rm Re}\nolimits}\left\{{{{{{{{\widetilde{\bf w}}}_{k}}}}^{H}}{{\bf h}}_{k}{{\bf h}}_{k}^{H}{{{\bf{w}}_{k}}}}\right\}-{\left\|{{{\bf h}}_{k}^{H}{{{\widetilde{\bf w}}}_{k}}}\right\|^{2}}\geq{e^{{a_{k}}+{b_{k}}}},$	(30b)
	$\displaystyle e^{{\widetilde{a}}_{k}}\left(1+a_{k}-{\widetilde{a}}_{k}\right)\geq 2^{\gamma_{k}}-1,$	(30c)
	$\displaystyle e^{{\widetilde{b}}_{k}}\left(1+b_{k}-{\widetilde{b}}_{k}\right)\geq\sum\nolimits_{i=1,i\neq k}^{K}{{\left\|{{\bf{h}}_{k}^{H}{{\bf{w}}_{i}}}\right\|}^{2}}+\sigma_{{k}}^{2},$	(30d)
	$\displaystyle{\rm(\ref{cons:p2:C}),(\ref{cons:p2:D})},{{\bf{w}}_{i}}\in{\mathbb{C}^{{N_{\rm{T}}}}},{\forall i,k}\in{\cal K}.$

By iteratively solving the problem (30) via the SCA method, a near-optimal solution to the problem (2) can be obtained. Specially, the initialization $\{{{\widetilde{\bf w}}}_{i}\}$ can be obtained by solving the following second order cone programming.


$\displaystyle{\rm Find}$	$\displaystyle~{}{{\left\{{{{\bf{w}}_{i}}}\right\}}}$
$\displaystyle{\rm s.t.}$	$\displaystyle~{}\sqrt{\left(1+\frac{1}{2^{R_{\rm Req}}-1}\right)}{{{{{{\bf h}}_{k}^{H}{{\bf{w}}_{k}}}}}}\geq\left\\|{\bm{\Omega}}_{k}{\bf f}+{\bf b}_{k}\right\\|,$	(31a)
	$\displaystyle{\rm Re}\left\{{{{\bf h}}_{k}^{H}{{\bf{w}}_{k}}}\right\}\geq 0,{\rm Im}\left\{{{{\bf h}}_{k}^{H}{{\bf{w}}_{k}}}\right\}=0,$	(31b)
	$\displaystyle{\rm(\ref{cons:p2:C})},{{\bf{w}}_{i}}\in{\mathbb{C}^{{N_{\rm{T}}}}},{\forall i,k}\in{\cal K}.$

where

\displaystyle\left\{\begin{array}[]{l}{\bf{f}}\triangleq{\left[{{\bf{w}}_{1}^{H},{\bf{w}}_{2}^{H},...,{\bf{w}}_{K}^{H}}\right]^{H}}\in{\mathbb{C}}^{KN_{\rm T}}\\ {{\bf{b}}_{k}}\triangleq{\left[{{\bf{0}}_{K}^{T},\sigma_{k}^{2}}\right]^{T}}{\in{\mathbb{C}}^{K+1}}\\ {\Omega_{k}}\triangleq\left[{\begin{array}[]{*{20}{l}}{{\rm{diag}}\left({{{\bf{h}}_{k}}^{H},{{\bf{h}}_{k}}^{H},...,{{\bf{h}}_{k}}^{H}}\right)}\\ {{\bf{0}}_{K{N_{\rm{T}}}}^{T}}\end{array}}\right]\in{\mathbb{C}}^{(K+1)\times{KN_{\rm T}}}\end{array}\right..

(37)

With initialized $\{{{\widetilde{\bf w}}}_{i}\}$ , ${{\widetilde{a}}_{i}}$ and ${{\widetilde{b}}_{i}}$ are respectively initialized by

\displaystyle\left\{\begin{array}[]{l}{e^{{a_{k}}}}={2^{{R_{k}}({\{{{{\widetilde{\bf{w}}}_{i}}}\}})}}-1\\ {e^{{b_{k}}}}=\sum\nolimits_{i=1,i\neq k}^{K}{{{\left|{{\bf{h}}_{k}^{H}{{{\widetilde{\bf{w}}}_{i}}}}\right|}^{2}}}+\sigma_{k}^{2}\end{array}\right..

(40)

Appendix B MRT and ZF beamforming design

In the MRT and ZF beamforming designs, each beamforming vectors are simplified by fixing its direction as

\displaystyle{\bf w}_{k}={\sqrt{p}_{k}}{{\overline{\bf w}}_{k}},

(41)

where $p_{k}\in{\mathbb{R}}^{+}$ and ${{\overline{\bf w}}_{k}}$ satisfying $\|{{\overline{\bf w}}_{k}}\|_{2}^{2}=1$ denote the power component and the direction component, respectively.

In the MRT design, the direction of beamforming is set as ${{\bf h}_{k}}/{\|{\bf h}_{k}\|_{2}}$ , and in the ZF design the direction of beamforming is set as ${{\bf v}_{k}}/{\|{\bf v}_{k}\|_{2}}$ , where ${{\bf v}_{k}}$ is the $k$ -th column of ${\bf V}\triangleq{\bf G}^{H}({\bf G}{\bf G}^{H})^{-1}$ and ${\bf G}\triangleq[{\bf h}_{1}^{H},{\bf h}_{2}^{H},...,{\bf h}_{K}^{H}]$ .

The power components in the MRT and ZF designs can also be optimized by the SCA method, which is similar to the SCA-based solution approach for the problem (2), and thus, omitted here.

References

[1] H. Niu et al., “Active RIS assisted rate-splitting multiple access network: Spectral and energy efficiency tradeoff,” IEEE J. Sel. Areas Commun., vol. 41, no. 5, pp. 1452-1467, May 2023.
[2] W. Xu, Z. Yang, D. W. K. Ng, M. Levorato, Y. C. Eldar, and M. Debbah, “Edge learning for B5G networks with distributed signal processing: Semantic communication, edge computing, and wireless sensing,” IEEE J. Sel. Top. Signal Process, vol. 17, no. 1, pp. 9-39, Jan. 2023.
[3] W. R. Ghanem, V. Jamali, Y. Sun, and R. Schober, “Resource allocation for multi-user downlink MISO OFDMA-URLLC systems,” IEEE Trans. Commun., vol. 68, no. 11, pp. 7184-7200, Nov. 2020.
[4] Y. Omid, S. M. Shahabi, C. Pan, Y. Deng, and A. Nallanathan, “Low-complexity robust beamforming design for IRS-aided MISO systems with imperfect channels,” IEEE Commun. Lett., vol. 25, no. 5, pp. 1697-1701, May 2021.
[5] Y. Yang, Z. Zhang, Y. Tian, R. Jin, and C. Huang, “Implementing graph neural networks over wireless networks via over-the-air computing: A joint communication and computation framework,” IEEE Wireless Commun., vol. 30, no. 3, pp. 62-69, June 2023,
[6] T. O’ Shea and J. Hoydis, “An introduction to deep learning for the physical layer,” IEEE Trans. Cognit. Commun. Networking, vol. 3, no. 4, pp. 563-575, Dec. 2017.
[7] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropoulos, “Learning to optimize: Training deep neural networks for interference management,” IEEE Trans. Signal Process., vol. 66, no. 20, pp. 5438-5453, 15 Oct.15, 2018.
[8] Q. Hu, Y. Cai, Q. Shi, K. Xu, G. Yu and Z. Ding, “Iterative algorithm induced deep-unfolding neural networks: Precoding design for multiuser MIMO systems,” IEEE Trans. Wireless Commun., vol. 20, no. 2, pp. 1394-1410, Feb. 2021.
[9] T. Lin and Y. Zhu, “Beamforming design for large-scale antenna arrays using deep learning,” IEEE Wireless Commun. Lett, vol. 9, no. 1, pp. 103-107, Jan. 2020.
[10] T. Bilot, N. E. Madhoun, K. A. Agha, and A. Zouaoui, “Graph neural networks for intrusion detection: A survey,” IEEE Access, vol. 11, pp. 49114-49139, 2023.
[11] Y. Gu, C. She, Z. Quan, C. Qiu and X. Xu, “Graph neural networks for distributed power allocation in wireless networks: Aggregation over-the-air,” early access in IEEE Trans. Wireless Commun.
[12] Y. Shen, J. Zhang, S. H. Song, and K. B. Letaief, “Graph neural networks for wireless communications: From theory to practice,” IEEE Trans. Wireless Commun., vol. 22, no. 5, pp. 3554-3569, May 2023.
[13] M. Lee, G. Yu, and G. Y. Li, “graph embedding-based wireless link scheduling with few training samples,” IEEE Trans. Wireless Commun., vol. 20, no. 4, pp. 2282-2294, Apr 2021.
[14] Y. Shen, Y. Shi, J. Zhang, and K. B. Letaief, “Graph neural networks for scalable radio resource management: Architecture design and theoretical analysis,” IEEE J. Sel. Areas Commun., vol. 39, no. 1, pp. 101-115, Jan. 2021.
[15] Y. Li, Y. Lu, R. Zhang, B. Ai, and Z. Zhong, “Deep learning for energy efficient beamforming in MU-MISO networks: A GAT-based approach,” IEEE Wireless Commun. Lett., vol. 12, no. 7, pp. 1264-1268, July 2023.
[16] J. Kim, H. Lee, S.-E. Hong, and S.-H. Park, “A bipartite graph neural network approach for scalable beamforming optimization,” IEEE Trans. Wireless Commun., vol. 22, no. 1, pp. 333-347, Jan. 2023.
[17] Z. Zhang, M. Tao, and Y.-F. Liu, “Learning to beamform in joint multicast and unicast transmission with imperfect CSI,” IEEE Trans. Commun., vol. 71, no. 5, pp. 2711-2723, May 2023.
[18] X. Zhang, H. Zhao, J. Xiong, X. Liu, L. Zhou, and J. Wei, “Scalable power control/beamforming in heterogeneous wireless networks with graph neural networks,” in Proc. IEEE Globecom, 2021.
[19] S. He, J. Yuan, Z. An, W. Huang, Y. Huang, and Y. Zhang, “Joint user scheduling and beamforming design for multiuser MISO downlink systems,” IEEE Trans. Wireless Commun., vol. 22, no. 5, pp. 2975-2988, May 2023.
[20] S. He, S. Xiong, W. Zhang, Y. Yang, J. Ren, and Y. Huang, “GBLinks: GNN-based beam selection and link activation for ultra-dense D2D mmWave networks,” IEEE Trans. Commun., vol. 70, no. 5, pp. 3451-3466, May 2022.
[21] S. He, S. Xiong, Z. An, W. Zhang, Y. Huang, and Y. Zhang, “An unsupervised deep unrolling framework for constrained optimization problems in wireless networks,” IEEE Trans. Wireless Commun., vol. 21, no. 10, pp. 8552-8564, Oct. 2022.
[22] Y. Shen, Y. Shi, J. Zhang, and K. B. Letaief, “A graph neural network approach for scalable wireless power control,” in Proc. IEEE GC Wkshps, 2019.
[23] T. Chen, M. You, G. Zheng, and S. Lambotharan, “Graph neural network based beamforming in D2D wireless networks,” in Proc. Int. ITG Workshop on Smart Antennas (WSA), 2021.
[24] A. Kaushik, M. Alizadeh, O. Waqar, and H. Tabassum, “Deep unsupervised learning for generalized assignment problems: A case-study of user-association in wireless networks,” in Proc. IEEE ICC Wkshps, 2021.
[25] F. Fioretto, T. W. Mak, and P. Van Hentenryck, “Predicting AC optimal power flows: Combining deep learning and Lagrangian dual methods”,in Proc. AAAI, vol. 34, no. 01, pp. 630-637, Apr. 2020.
[26] D. Chen, Y. Lin, W. Li, P. Li, J. Zhou, and X. Sun, “Measuring and relieving the over-smoothing problem for graph neural networks from the topological view”, in Proc. AAAI, vol. 34, no. 04, pp. 3438-3445, Apr. 2020.
[27] W. Zhao, C. Wang, C. Han, and T. Guo, “Exploring over-smoothing in graph attention networks from the Markov chain perspective,“ in Proc. ICLR, Sep. 2022.
[28] Y. Shi, Z. Huang, S. Feng, H. Zhong, W. Wang, and Y. Sun, “Masked label prediction: Unified message passing model for semi-supervised classification,“ in Proc. IJCAI-21, pp. 1548-1554, Aug. 2021.