Power Allocation for Device-to-Device Interference Channel Using Truncated Graph Transformers ^†^†thanks: This work was supported by The Hong Kong University of Science and Technology (HKUST) Startup Fund under Grant R9249.

Dohoon Kim and Shenghui Song Dept. of ECE, The Hong Kong University of Science and Technology, Hong Kong
Email: {dhkimac, eeshsong}@ust.hk

Abstract

Power control for the device-to-device interference channel with single-antenna transceivers has been widely analyzed with both model-based methods and learning-based approaches. Although the learning-based approaches, i.e., data-driven and model-driven, offer performance improvement, the widely adopted graph neural network suffers from learning the heterophilous power distribution of the interference channel. In this paper, we propose a deep learning architecture in the family of graph transformers to circumvent the issue. Experiment results show that the proposed methods achieve the state-of-the-art performance across a wide range of untrained network configurations. Furthermore, we show there is a trade-off between model complexity and generality.

Index Terms:

Interference channel, power allocation, deep learning, graph transformer networks

I Introduction

Power allocation is a fundamental problem for wireless communications. For device-to-device (D2D) systems where single antenna transmitter-receiver pairs communicate in an interference channel, the problem can be abstracted into a sum-rate maximization with power constraints. However, the non-convex and NP-hard nature makes optimizing this simple formulation a challenge [1]. There have been various approaches to approximate a solution, such as solving it on the Lagrangian dual domain [2] or weighted minimum mean squared error (WMMSE) minimization [3]. These classical methods guarantee convergence to a local minima, but are computationally complex and not suitable for large scale wireless networks. Furthermore, the iterative nature of these algorithms prevent the utilization of parallel computing.

Recent advances in deep learning has inspired data-driven approaches to tackle this optimization problem [4, 5, 6]. These methods outperform classical algorithms with a fraction of computational complexity, in exchange for generality. Purely data-driven models can not guarantee performance outside of its training data distribution. Model-driven approaches merge classical algorithms with deep learning models through algorithmic unfolding to utilize the strong prior of conventional techniques [4]. The deep learning model is trained to accelerate the convergence of existing algorithms, drastically reducing the iteration count. However, such an unfolding algorithm retains the computational complexity of the classical algorithm it is optimizing, and retains the iterative nature.

For both data-driven and model-driven methods, graph neural networks (GNNs) have been the architecture of choice due to their strong prior on graph data structures. GNN’s kernel sizes are constant to arbitrary size of graphs, and its permutation invariance reduces data complexity during learning [5]. Since wireless networks can be easily modeled as a graph, where individual devices can be treated as nodes and communication links as edges, GNNs can be easily adapted for wirless network datasets. However, we would like to argue that the basic GNN architecture is not suitable for the concerned power allocation problem, as the interference channel is always a fully connected graph, and its optimal power distribution is heterephilous, which GNN models suffer from learning [7].

In this work, we explore a different model architecture, i.e., transformers. Transformers have achieved great successes in the area of natural language processing (NLP), where large pretrained models could be fine tuned on various tasks. Models like GPT-3 or Lambda are giant transformer networks with billions of parameters that generate state-of-the-art performance across different NLP domains [8]. Their attention mechanism maps a weight to all input nodes in the sequence, which allows gradient flow on long dependencies of a sentence. As attention maps a scalar weight to all words in a sentence, it is equivalent to providing an edge weight to a fully connected graph with the same number of nodes as the input sequence.

In this paper, we propose to revise the graph transformer architecture for the power allocation problem in the D2D interference channel. Graph transformers extends the concept of attention weights from transformers to GNNs, where it generates an attention map that acts as a soft weight across all neighbors [9]. Extensive experiments validate that the proposed architecture outperforms classical algorithms and other learning based models across different network topologies and configurations with just a single trained model. We further show that model complexity correlates to model generalization, which proposes an interesting trade-off between the computational complexity during inference and the frequency for model retraining.

II Problem Formulation and Preliminaries

II-A Weighted Sum-Rate Maximization

Consider a single-input single-output (SISO) ad hoc interference network with $n$ transmitter-receiver pairs. Denote $h_{ij}$ to be the channel coefficient between transmitter $i$ and receiver $j$ , $p_{i}$ as the power allocated to transmitter $i$ , and $\sigma^{2}$ as the additive white Gaussian noise (AWGN) power level. The channel rate of the $i$ -th transmitter-receiver pair can be expressed as:

c_{i}=log_{2}\left(1+\frac{|h_{ii}|^{2}p_{i}}{\sigma^{2}+\sum_{j=1}^{j\neq i}|h_{ij}|^{2}p_{j}}\right),i=1,...n.

(1)

The design objective is to maximize the sum rate across all pairs under the maximum power constraint. Additionally, each pair would have a weight $w_{i}$ on the channel rate to enforce priority or fairness across devices. The constrained weighted sum rate maximization problem can be written as:

	$\displaystyle\max_{\textbf{p}}$	$\displaystyle\sum w_{i}c_{i},$		(2)
	s.t	$\displaystyle\textbf{p}\in[0,P_{max}]^{n}$		(3)

Our goal is to find a data-driven model that determines the maximizing power $\textbf{p}\in\mathbb{R}_{+}^{n}$ given an instantiated channel state $\textbf{H}\in\mathbb{R}_{+}^{n\times n}$ with $\textbf{H}_{i,j}=h_{ij}$ and noise power $\sigma^{2}$ .

II-B Graph Neural Networks

GNN is a family of neural network architectures designed to be trained on graph data structures. A single pass of a GNN layer consists of three stages: message formulation, neighbor aggregation, and feature update. The message formulation function $m(\cdot)$ creates a message feature vector, generally using the corresponding node embedding $\textbf{x}_{i}$ and edge embedding $\textbf{e}_{ij}$ . Neighbor aggregation $\bigoplus$ is a symmetric operation across all neighbors of a specific node, which grants permutation invariance to the input sequence. $j\in\mathcal{N}(i)$ is a set of nodes that are connected with node $i$ in the graph. Generally, these would be summation, average, or maximum, but any other symmetrical operations would suffice. The authors of [10] discussed the efficacy for various candidates, and showed that no specific operation is superior and subject to case by case identification. Finally, the feature update $\sigma(\cdot)$ takes both the aggregated message and its node feature embedding to compute the new embedding for node i.

\textbf{z}_{i}=\sigma(\textbf{x}_{i},\underset{j\in\mathcal{N}(i)}{\bigoplus}m(\textbf{x}_{i},\textbf{x}_{j},\textbf{e}_{ij})).

(4)

Note that if $\bigoplus$ is a sum operation, this is equivalent to multiplying the node feature matrix $\textbf{X}\in\mathbb{R}^{n\times d}$ with the adjacency matrix $\textbf{A}\in\mathbb{N}_{0}^{n\times n}$ , where the value of $\textbf{A}_{ij}=1$ if node i and j are connected, and 0 otherwise. In this case, we can imagine a basic message passing GNN as a masked linear layer in relation to the adjacency matrix. Various architectures of GNN have been utilized in the literature for the power allocation problem. The authors of [4] proposed a GNN to accelerate WMMSE, and [5] utilized a message passing GNN where the message is generated with concatenated node and edge feature embeddinigs .

While GNNs show better expressitivity in graph data, they have their limits. The limit relevant to our problem formulation is their strong bias on homophily, which means nodes connected with each other will have the same label. Detailed analysis in regards to the power control problem will be discussed in Section V-C.

II-C Transformers

Transformer is another family of neural network architectures that was developed for NLP problems [11]. Its attention mechanism was devised to circumvent the problem of gradient vanishing in long sentences. Information of words at the start of the sentence would not pass through the model when inferring on the words on the later part. Attention solves this by placing a soft weight on all words in the sentence for each step, which is calculated on run-time. The attention map is calculated using a query, key, and value matrix. If the model has an input feature vector $\textbf{x}_{\textbf{i}}\in\mathbb{R}^{n\times f}$ , the attention module would contain linear weight $\textbf{W}_{\textbf{q}}\in\mathbb{R}^{d\times n},\textbf{W}_{\textbf{k}}\in\mathbb{R}^{d\times n},\textbf{W}_{\textbf{v}}\in\mathbb{R}^{d\times n}$ which would compute the feature matrix $\textbf{Q},\textbf{K},\textbf{V}\in\mathbb{R}^{d\times f}$ . When multiple attention modules are stacked, this is called a multi-head attention. The attention module could be expressed as:

\textbf{z}_{i}=\text{softmax}(\frac{\textbf{Q}\textbf{K}^{T}}{\sqrt{d}})\textbf{V}.

(5)

Because attention is individually computed for each word, and the matrix multiplication of the attention weights to the value matrix is equivalent to a sum neighbor aggregation, transformer models are inherently permutation invariant. Note that the input that generates the Q ,K, and V can be from different input vectors. This is called cross-attention, which would be used for our proposed method as the query comes from the target node, and the keys and values come from the neighboring nodes.

III Truncated Graph Transformers (TGTs)

III-A Model Architecture

Refer to caption — Figure 1: Block diagram of the TGT algorithm. Note that the Transformer layer weights are shared across the three instances.

Transformers take a sequence input of arbitrary length and compute a permutation invariant operation. This can be alternatively viewed as a GNN on a fully connected simple graph. The authors of [9] generalized the transformer to a GNN framework, named graph transformers. Details of the structure is nearly identical with the original self attention, but the difference comes from how the input data is handled. Here, the queries and keys are generated from the corresponding node embeddings, and the attention is mapped across the neighbors. A possible integration of edge feature vector to the process is also introduced.

We revise the graph transformer architecture discussed in [9] for better convergence on the power allocation problem. Fig. 1 illustrates the proposed model. First, both the node and edge features are passed through a fully connected layer with batch normalization into a higher dimension embedding with dimension $d$ . The embedding is then fed through a multi-head cross attention 3-layer model defined as:

	$\displaystyle\hat{\textbf{x}_{i}}^{l+1}=\sum_{k=1}^{H}\underset{j\in\mathcal{N}(i)}{\sum}w^{kl}_{ij}\textbf{V}^{k}\textbf{x}^{kl}_{j},$	(6)
where,
	$\displaystyle w^{kl}_{ij}=\text{softmax}_{j}(\text{LeakyReLU}(\frac{\textbf{Q}^{k}x^{kl}_{i}(\textbf{K}^{k}\textbf{x}^{kl}_{j}+\textbf{e}^{k}_{ij})}{\sqrt{d}})).$	(7)

Here, $\textbf{x}^{l}_{ij},\textbf{e}_{ij}\in\mathbb{R}^{d}$ are the node and edge embeddings, $\textbf{Q}^{k},\textbf{K}^{k},\textbf{V}^{k}\in\mathbb{R}^{d/H\times d/H}$ are the query, key, and value weights of the $k^{th}$ head, and $H$ is the number of heads for the multi-head attention. $w_{ij}^{kl}$ is the attention map for the $ij$ -th message on head $k$ , layer $l$ . The right side of Fig. 1 illustrates how a single TGT layer pass is computed. First $\textbf{x}^{l}_{ij},\textbf{e}_{ij}$ are split into $\textbf{x}^{kl}_{ij},\textbf{e}^{k}_{ij}\in\mathbb{R}^{d/H}$ for each head. Then the query is generated from the target node and the keys are generated from its neighboring nodes. The edge information is injected as a sum to the key feature vector. The product of the query and keys are scaled, and passes through a Leaky ReLU before going through softmax [12]. Ablation study showed that models with no shared weights only provide marginal improvement on performance while tripling the number of trainable parameters. Thus, $\textbf{Q}^{k},\textbf{K}^{k},\textbf{V}^{k}$ are shared across all three layers. The output of the multi-head attention is followed by a layer normalization [13] with a skip connection of the previous layer output, as:

\textbf{x}_{i}^{l+1}=LayerNorm(\hat{\textbf{x}_{i}}^{l+1}+\textbf{x}_{i}^{l}).

(8)

The final layer of the model is just a sum along the feature dimension to flatten the node feature to a scalar, which is then passed through a sigmoid activation to limit the range of the output to $[0,1]$ . This can be expressed as:

p_{i}=\frac{1}{1+exp(-(1,1,...,1)\textbf{x}_{i}^{L})}.

(9)

As our model reduces the number of non-linearities from the original model, we name it as the Truncated Graph Transformer (TGT).

III-B Graph Homophily

In this section, we attempt to explain why TGT outperforms basic GNNs by showing that the power control problem with low noise is heterophilous. Graph homophily is an index that measures whether a node with an adjacent edge would have the same label. We start with the index definition by [14]:

h=E[\frac{1}{d_{i}}\sum_{j\in\mathcal{N}(i)}|x_{i}=x_{j}|^{\textbf{1}}].

(10)

Where $||^{\textbf{1}}$ is a function that returns 1 if the two values are the same, and $d_{i}$ is the degree of node $i$ . $h$ would map to a value $[0,1]$ , where 1 represents that the graph is highly homophilic. Citation graphs or relation graphs are known to be homophilic, as papers with similar topics would be cited together and similar people will be in a closer relationship. GNNs are known to better perform in homophilic datasets, which is due to their neighbor aggregation design on a fixed graph structure. The author of [15] showed that when removing the non-linearity of the GNN, it can be abstracted into a Chebyshev low pass filter on the graph, which indicates that the performance gained from utilizing GNNs may not be from the non-linear feature extractions. In order to circumvent this issue, researchers are exploring more sophisticated GNN architectures that include sub-graph information [16], sheaf diffusion [17], and mapping the graph into a hyperbolic space [7].

Due to the network modeled in the problem formulation is a fully connected graph with a scalar edge weights and power allocation, we expand the definition of graph homophily for a graph with scalar edge weight $e_{ij}$ and node label $x_{i}\in[0,1]$ as:

h=E_{i}[\frac{1}{\sum_{j\in\mathcal{N}(i)}e_{ij}}\sum_{j\in\mathcal{N}(i)}e_{ij}(1-|x_{i}-x_{j}|)].

(11)

TABLE I: Homophily

h

on a wireless network of 50 pairs

	Noise Power Level $\sigma^{2}$
	$10^{-6}$	$10^{-5}$	$10^{-4}$	$10^{-3}$	$10^{-2}$	$10^{-1}$	$10^{0}$
$h$	0.417	0.418	0.468	0.560	0.688	0.817	0.914

Table I shows the homophily of a network channel H with 50 pairs, where the optimal power generated by WMMSE is considered as the label. As the noise level increases, the allocated power changes from a heterophilous to a homophilic distribution. Intuitively, the power control problem for the SISO interference channel can be abstracted into finding the optimal volume for each person in a busy dinner party. If two people were close together with their partners in equal distance, it would be beneficial for one to remain silent and the other to speak with their loudest voice, instead of the two speaking over each other. With increased noise power, the optimal power allocation saturates to full power, as the noise power level overpowers the interference from neighboring pairs.

Graph Network Transformers are more resilient to heterophilous datasets due to the attention weights acting as active graph rewiring. The wireless network is formulated as a graph with soft neighbor connections, in which every node is connected with a scalar weight associated to it. By applying attention to each neighbors, the model is effectively rewiring the graph each layer, avoiding the over smoothing issue of GNNs when aggregating with a fixed graph topology.

IV Numerical Experiments

IV-A Performance on Singular Network Size

We perform experiments on a simulated SISO D2D network over an AWGN channel of fixed noise $\sigma^{2}=2.6\times 10^{-5}$ . $n$ transmitters are uniformly distributed on a square plane with range $[-n,n]^{2}$ , where the receiver is placed within the distance of $[1,n/4]$ around the target transmitter. A lower bound on the receiver location is placed to prevent numerical instability when calculating the path gain. The channel is constructed with a constant path-loss defined as $h^{p}_{ij}=||t_{i}-r_{j}||^{-2.2}$ , where $t_{i},r_{j}$ are the coordinates of the corresponding transmitter and receiver. We then multiply with the large-scale channel coefficient coefficient $h^{f}_{ij}\sim\text{Rayleigh}(1)$ . This specific problem setting has been used across various data driven power allocation publications, and has been chosen here for ease of comparison across different methods [4, 6]. We assume equal weight $w_{i}=1$ for all pairs.
For all data-driven benchmarks, the models were trained on a dataset of 25,000 network instances, which consists of 50 randomly sampled channel states on a 500 randomly sampled network topology. For the data-driven benchmarks, published configurations for model dimension and training strategies were retained, but the learning rate were adjusted so that the model could stably converge due to difference in simulated conditions. Then the model was tested on previously unseen 1000 network instances with 50 randomly sampled channel states on 50 randomly sampled network topologies.

The proposed model, TGT, is a 3 layer graph transformer model with 32 head attention, where the embedding dimension $d=64$ . As the channel $h_{ij}$ is a near zero value, we normalize the channel matrix H to a range of $[0,1]$ . The model was trained using a AdamW[18] optimizer with learning rate set to $5\times 10^{-4}$ . All models were trained for 50 epochs. The channel state H needs to be represented as a graph data to be processed by a GNN. For our method, we convert H to an undirected graph $G$ with $n$ nodes. In particular, $G$ is a fully connected graph with self loops where feature vector of the $i$ -th node and the $ij$ -edge are given by $\textbf{x}_{i}^{0}=[h_{ii},w_{i}]\in\mathbb{R}^{2}$ and $\textbf{e}_{ij}=[h_{ij},h_{ji}]\in\mathbb{R}^{2}$ , respectively. Implementation details can be accessed through the github repository [19].

We perform numerical comparison across the following benchmarks:

•

Max Power is a naive strategy where all transmitters transmit maximum power $P_{max}=1$ .
•

WMMSE[3] acts as a baseline for the classical method, which is run for 100 iterations.
•

PCGNN[5] utilizes a variant of a GNN with its edge weights embedded onto the message. It achieves state-of-the-art performance on complex interference channels.
•

UWMMSE[4] is a model-driven approach that utilizes GNNs to accelerate the conversion of WMMSE with only 4 iterations. It achieves state-of-the art performance on a fixed network topology.
•

Graph Trasformer [9] is run with the same number of layers, heads, and $d$ to compare with the proposed method. Note that this model has 173K trainable parameters, while ours only have 13.1K.

TABLE II: Experiment results on a various range of network size.

	Average Sumrate per Network Size
Algorithm	20	30	40	50
Max Power	62.547	79.038	96.168	106.364
WMMSE	84.563	109.972	132.677	147.463
PCGNN	84.237	109.892	131.403	147.165
UWMMSE	83.081	108.335	131.048	146.035
Graph Transformer	85.670	111.376	133.554	148.951
Proposed Method	85.968	111.584	134.320	149.592
Proposed Method: Multi Node	84.999	111.543	134.487	149.343

Table II shows the comparison across various number of pairs. Each data-driven model is trained only on the dataset with the corresponding network size. Multi-node is a single model that was trained on a even distribution of network size of $\{20,30,40,50\}$ , with the same number of samples as mentioned above. The proposed method outperforms all methods across various network sizes for both fixed node training and multi-node training. As the training dataset only contains 50 channel instances, we do further comparison of the models on a fixed network topology with 32,000 randomly sampled channel fading. Fig. 2 shows that the proposed method follows a similar distribution curve as WMMSE, with a consistent advantage on all range of channel instances.

IV-B Deviation from the Original Network Configuration

While deep learning models outperform classical algorithms with lower computational complexity, it’s performance is only guaranteed within the training data distribution, and thus must be tested with its generality across unseen data distributions to understand when to retrain the model. We test the model’s generality across network size, fading distribution, and pair density. We then compare the computational complexity, and provide analysis on the relationship between model size and generality.

In Table II, each model was trained only on a fixed number of $n$ . We compare how our model performs across a sweep of network sizes, ranging from 20 to 100 pairs. 10 channel instances are sampled for each 100 random network topologies. Fig. 3 shows the result across three different plots $n=30,50$ and multi-node model, where the sum-rate is normalized with WMMSE for ease of comparison. A taper of performance is clearly visible from the model’s trained network size, but the model is able to outperform WMMSE even outside of the trained pair size. However, we can see that the multi-node model has a more stable performance drop off, performing better on average than fixed size training. This shows the model should be trained on the expected data distribution of the deployed scenario, to prevent frequent retraining of the model.

We now consider a scenario where the fading coefficient of the Rayleigh fading changes. Using the same configuration as in Table II, Table III shows that the proposed model generalizes well from its original fading coefficient 1, outperforming WMMSE up to a fading coefficient of 4 for both fixed model and the multi-node model.

TABLE III: Model generalization across fading coefficients

	0.5	1	1.5	2	2.5	3	3.5	4
Max Power	94.9	103.4	110.4	111.8	113.9	117.5	115.2	114.5
WMMSE	122.2	144.3	158.2	164.2	169.4	174.9	174.3	174.0
TGT	123.1	146.3	160.3	165.9	170.2	175.6	174.4	173.9

While the model is able to generalize across multiple network size and fading coefficients, the proposed method struggles when the overall density of the network changes from the training data. We test the proposed model on a fixed pair network of $n=50$ , but change the the distribution of transmitters from $[-n,n]^{2}$ to $[-n/r,n/r]^{2}$ , where $r\in[0.25,4]$ . Table IV shows the model struggles to generalize on deviation from the trained pair density. Note that the model still out perform WMMSE when trained on the different density data.

TABLE IV: Model generalization across network field sizes

	200	150	100	50	25	17	12
Max Power	217.2	210.3	179.7	107.4	44.3	22.3	14.8
WMMSE	221.4	219.3	196.2	146.8	96.0	70.0	57.0
TGT	215.0	216.7	195.0	148.7	88.9	53.3	37.6

IV-C Computational Complexity

One advantage of data-driven models over iterative classical algorithms is their simpler computational complexity. WMMSE is known to be run on $O(KN^{3})$ time, where N is the number of pairs and K is the number of iterations. Like all other GNN based models, the proposed model has a computational complexity of $O(LFN^{2})$ , where L is the number of layers, F is the dimension of the linear weights, and N is the number of pairs. Since the computational complexity rely on the model parameter, we compare how the proposed model performs across various number of parameter size. We compare the model with various widths from $d=[4,104]$ , in which the model parameter ranges from 960 to 33.7K. Fig. 4 compares the performance of the proposed model trained on a fixed pair size of $n=30$ . For the different parameter sizes, we retain the original ratio of the hidden layers dimensions and scale accordingly while keeping the training conditions same as Section IV-A. The proposed method outperforms WMMSE already at the parameter size of 3.5K, and seems to converge at the parameter size of 13K. Graph transformer achieves equivalent performance with ten times more parameters.

While the proposed model outperforms WMMSE at a smaller parameter size, this comes with a price of generality. Fig. 5 shows that models with smaller parameters fail to generalize to unseen network configurations while models with larger parameters have a slower drop off. Thus, there is a trade-off of computational simplicity and model generality, which decides the frequency of retraining required once the deployed network deviates from its trained data distribution.

V Conclusion

We proposed a graph transformer based architecture called TGT for the wireless power allocation problem. It was observed that as the power control problem displays a heterophilous distribution with lower level of noise, architectures such as graph transformer would perform well. With parameter sharing and removing additional non-linear layers, the proposed model is able to achieve state-of-the-art results with a fraction of trainable parameters. Analysis of the model to unseen data distributions shows that although the model is able to generalize on a range of unseen distributions, limits exist to the type of perturbations, which shows a trade-off between model complexity and performance.

References

[1] Z.-Q. Luo and S. Zhang, “Dynamic spectrum management: Complexity and duality,” IEEE Journal of Selected Topics in Signal Processing, vol. 2, no. 1, pp. 57–73, Feb. 2008.
[2] W. Yu and R. Lui, “Dual methods for nonconvex spectrum optimization of multicarrier systems,” IEEE Transactions on Communications, vol. 54, no. 7, pp. 1310–1322, July 2006.
[3] Q. Shi, M. Razaviyayn, Z.-Q. Luo, and C. He, “An iteratively weighted mmse approach to distributed sum-utility maximization for a mimo interfering broadcast channel,” IEEE Transactions on Signal Processing, vol. 59, no. 9, pp. 4331–4340, Sept. 2011.
[4] A. Chowdhury, F. Gama, and S. Segarra, “Stability Analysis of Unfolded WMMSE for Power Allocation,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2022, pp. 5298–5302.
[5] Y. Shen, J. Zhang, S. H. Song, and K. B. Letaief, “Graph neural networks for wireless communications: From theory to practice,” IEEE Transactions on Wireless Communications, vol. 22, no. 5, pp. 3554–3569, May 2023.
[6] M. Eisen and A. Ribeiro, “Large scale wireless power allocation with graph neural networks,” in 2019 IEEE 20th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), pp. 1–5, ISSN: 1948-3252.
[7] Q. Liu, M. Nickel, and D. Kiela, “Hyperbolic graph neural networks,” Advances in Neural Information Processing Systems, vol. 32, 2019.
[8] T. Lin, Y. Wang, X. Liu, and X. Qiu, “A survey of transformers,” AI Open, 2022.
[9] V. P. Dwivedi and X. Bresson, “A generalization of transformer networks to graphs.” [Online]. Available: http://arxiv.org/abs/2012.09699
[10] G. Corso, L. Cavalleri, D. Beaini, P. Liò, and P. Veličković, “Principal neighbourhood aggregation for graph nets,” Advances in Neural Information Processing Systems, vol. 33, pp. 13 260–13 271, 2020.
[11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[12] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” arXiv preprint arXiv:1710.10903, 2017.
[13] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
[14] H. Pei, B. Wei, K. C.-C. Chang, Y. Lei, and B. Yang, “Geom-gcn: Geometric graph convolutional networks,” arXiv preprint arXiv:2002.05287, 2020.
[15] F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger, “Simplifying graph convolutional networks,” in International conference on machine learning. PMLR, 2019, pp. 6861–6871.
[16] F. Frasca, B. Bevilacqua, M. M. Bronstein, and H. Maron, “Understanding and extending subgraph GNNs by rethinking their symmetries.” [Online]. Available: http://arxiv.org/abs/2206.11140
[17] C. Bodnar, F. Di Giovanni, B. P. Chamberlain, P. Liò, and M. M. Bronstein, “Neural sheaf diffusion: A topological perspective on heterophily and oversmoothing in GNNs.” [Online]. Available: http://arxiv.org/abs/2202.04579
[18] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
[19] K. Dohoon, https://github.com/dhkimac/Truncated-Graph-Transformer, 2022.

Power Allocation for Device-to-Device Interference Channel Using Truncated Graph Transformers ††thanks: This work was supported by The Hong Kong University of Science and Technology (HKUST) Startup Fund under Grant R9249.