This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Power Allocation for Device-to-Device Interference Channel Using Truncated Graph Transformers thanks: This work was supported by The Hong Kong University of Science and Technology (HKUST) Startup Fund under Grant R9249.

Dohoon Kim and Shenghui Song Dept. of ECE, The Hong Kong University of Science and Technology, Hong Kong
Email: {dhkimac, eeshsong}@ust.hk
Abstract

Power control for the device-to-device interference channel with single-antenna transceivers has been widely analyzed with both model-based methods and learning-based approaches. Although the learning-based approaches, i.e., data-driven and model-driven, offer performance improvement, the widely adopted graph neural network suffers from learning the heterophilous power distribution of the interference channel. In this paper, we propose a deep learning architecture in the family of graph transformers to circumvent the issue. Experiment results show that the proposed methods achieve the state-of-the-art performance across a wide range of untrained network configurations. Furthermore, we show there is a trade-off between model complexity and generality.

Index Terms:
Interference channel, power allocation, deep learning, graph transformer networks

I Introduction

Power allocation is a fundamental problem for wireless communications. For device-to-device (D2D) systems where single antenna transmitter-receiver pairs communicate in an interference channel, the problem can be abstracted into a sum-rate maximization with power constraints. However, the non-convex and NP-hard nature makes optimizing this simple formulation a challenge [1]. There have been various approaches to approximate a solution, such as solving it on the Lagrangian dual domain [2] or weighted minimum mean squared error (WMMSE) minimization [3]. These classical methods guarantee convergence to a local minima, but are computationally complex and not suitable for large scale wireless networks. Furthermore, the iterative nature of these algorithms prevent the utilization of parallel computing.

Recent advances in deep learning has inspired data-driven approaches to tackle this optimization problem [4, 5, 6]. These methods outperform classical algorithms with a fraction of computational complexity, in exchange for generality. Purely data-driven models can not guarantee performance outside of its training data distribution. Model-driven approaches merge classical algorithms with deep learning models through algorithmic unfolding to utilize the strong prior of conventional techniques [4]. The deep learning model is trained to accelerate the convergence of existing algorithms, drastically reducing the iteration count. However, such an unfolding algorithm retains the computational complexity of the classical algorithm it is optimizing, and retains the iterative nature.

For both data-driven and model-driven methods, graph neural networks (GNNs) have been the architecture of choice due to their strong prior on graph data structures. GNN’s kernel sizes are constant to arbitrary size of graphs, and its permutation invariance reduces data complexity during learning [5]. Since wireless networks can be easily modeled as a graph, where individual devices can be treated as nodes and communication links as edges, GNNs can be easily adapted for wirless network datasets. However, we would like to argue that the basic GNN architecture is not suitable for the concerned power allocation problem, as the interference channel is always a fully connected graph, and its optimal power distribution is heterephilous, which GNN models suffer from learning [7].

In this work, we explore a different model architecture, i.e., transformers. Transformers have achieved great successes in the area of natural language processing (NLP), where large pretrained models could be fine tuned on various tasks. Models like GPT-3 or Lambda are giant transformer networks with billions of parameters that generate state-of-the-art performance across different NLP domains [8]. Their attention mechanism maps a weight to all input nodes in the sequence, which allows gradient flow on long dependencies of a sentence. As attention maps a scalar weight to all words in a sentence, it is equivalent to providing an edge weight to a fully connected graph with the same number of nodes as the input sequence.

In this paper, we propose to revise the graph transformer architecture for the power allocation problem in the D2D interference channel. Graph transformers extends the concept of attention weights from transformers to GNNs, where it generates an attention map that acts as a soft weight across all neighbors [9]. Extensive experiments validate that the proposed architecture outperforms classical algorithms and other learning based models across different network topologies and configurations with just a single trained model. We further show that model complexity correlates to model generalization, which proposes an interesting trade-off between the computational complexity during inference and the frequency for model retraining.

II Problem Formulation and Preliminaries

II-A Weighted Sum-Rate Maximization

Consider a single-input single-output (SISO) ad hoc interference network with nn transmitter-receiver pairs. Denote hijh_{ij} to be the channel coefficient between transmitter ii and receiver jj, pip_{i} as the power allocated to transmitter ii, and σ2\sigma^{2} as the additive white Gaussian noise (AWGN) power level. The channel rate of the ii-th transmitter-receiver pair can be expressed as:

ci=log2(1+|hii|2piσ2+j=1ji|hij|2pj),i=1,n.c_{i}=log_{2}\left(1+\frac{|h_{ii}|^{2}p_{i}}{\sigma^{2}+\sum_{j=1}^{j\neq i}|h_{ij}|^{2}p_{j}}\right),i=1,...n. (1)

The design objective is to maximize the sum rate across all pairs under the maximum power constraint. Additionally, each pair would have a weight wiw_{i} on the channel rate to enforce priority or fairness across devices. The constrained weighted sum rate maximization problem can be written as:

maxp\displaystyle\max_{\textbf{p}} wici,\displaystyle\sum w_{i}c_{i}, (2)
s.t p[0,Pmax]n\displaystyle\textbf{p}\in[0,P_{max}]^{n} (3)

Our goal is to find a data-driven model that determines the maximizing power p+n\textbf{p}\in\mathbb{R}_{+}^{n} given an instantiated channel state H+n×n\textbf{H}\in\mathbb{R}_{+}^{n\times n}with Hi,j=hij\textbf{H}_{i,j}=h_{ij} and noise power σ2\sigma^{2}.

II-B Graph Neural Networks

GNN is a family of neural network architectures designed to be trained on graph data structures. A single pass of a GNN layer consists of three stages: message formulation, neighbor aggregation, and feature update. The message formulation function m()m(\cdot) creates a message feature vector, generally using the corresponding node embedding xi\textbf{x}_{i} and edge embedding eij\textbf{e}_{ij}. Neighbor aggregation \bigoplus is a symmetric operation across all neighbors of a specific node, which grants permutation invariance to the input sequence. j𝒩(i)j\in\mathcal{N}(i) is a set of nodes that are connected with node ii in the graph. Generally, these would be summation, average, or maximum, but any other symmetrical operations would suffice. The authors of [10] discussed the efficacy for various candidates, and showed that no specific operation is superior and subject to case by case identification. Finally, the feature update σ()\sigma(\cdot) takes both the aggregated message and its node feature embedding to compute the new embedding for node i.

zi=σ(xi,j𝒩(i)m(xi,xj,eij)).\textbf{z}_{i}=\sigma(\textbf{x}_{i},\underset{j\in\mathcal{N}(i)}{\bigoplus}m(\textbf{x}_{i},\textbf{x}_{j},\textbf{e}_{ij})). (4)

Note that if \bigoplus is a sum operation, this is equivalent to multiplying the node feature matrix Xn×d\textbf{X}\in\mathbb{R}^{n\times d} with the adjacency matrix A0n×n\textbf{A}\in\mathbb{N}_{0}^{n\times n}, where the value of Aij=1\textbf{A}_{ij}=1 if node i and j are connected, and 0 otherwise. In this case, we can imagine a basic message passing GNN as a masked linear layer in relation to the adjacency matrix. Various architectures of GNN have been utilized in the literature for the power allocation problem. The authors of [4] proposed a GNN to accelerate WMMSE, and [5] utilized a message passing GNN where the message is generated with concatenated node and edge feature embeddinigs .

While GNNs show better expressitivity in graph data, they have their limits. The limit relevant to our problem formulation is their strong bias on homophily, which means nodes connected with each other will have the same label. Detailed analysis in regards to the power control problem will be discussed in Section V-C.

II-C Transformers

Transformer is another family of neural network architectures that was developed for NLP problems [11]. Its attention mechanism was devised to circumvent the problem of gradient vanishing in long sentences. Information of words at the start of the sentence would not pass through the model when inferring on the words on the later part. Attention solves this by placing a soft weight on all words in the sentence for each step, which is calculated on run-time. The attention map is calculated using a query, key, and value matrix. If the model has an input feature vector xin×f\textbf{x}_{\textbf{i}}\in\mathbb{R}^{n\times f}, the attention module would contain linear weight Wqd×n,Wkd×n,Wvd×n\textbf{W}_{\textbf{q}}\in\mathbb{R}^{d\times n},\textbf{W}_{\textbf{k}}\in\mathbb{R}^{d\times n},\textbf{W}_{\textbf{v}}\in\mathbb{R}^{d\times n} which would compute the feature matrix Q,K,Vd×f\textbf{Q},\textbf{K},\textbf{V}\in\mathbb{R}^{d\times f}. When multiple attention modules are stacked, this is called a multi-head attention. The attention module could be expressed as:

zi=softmax(QKTd)V.\textbf{z}_{i}=\text{softmax}(\frac{\textbf{Q}\textbf{K}^{T}}{\sqrt{d}})\textbf{V}. (5)

Because attention is individually computed for each word, and the matrix multiplication of the attention weights to the value matrix is equivalent to a sum neighbor aggregation, transformer models are inherently permutation invariant. Note that the input that generates the Q ,K, and V can be from different input vectors. This is called cross-attention, which would be used for our proposed method as the query comes from the target node, and the keys and values come from the neighboring nodes.

III Truncated Graph Transformers (TGTs)

III-A Model Architecture

Refer to caption

Figure 1: Block diagram of the TGT algorithm. Note that the Transformer layer weights are shared across the three instances.

Transformers take a sequence input of arbitrary length and compute a permutation invariant operation. This can be alternatively viewed as a GNN on a fully connected simple graph. The authors of [9] generalized the transformer to a GNN framework, named graph transformers. Details of the structure is nearly identical with the original self attention, but the difference comes from how the input data is handled. Here, the queries and keys are generated from the corresponding node embeddings, and the attention is mapped across the neighbors. A possible integration of edge feature vector to the process is also introduced.

We revise the graph transformer architecture discussed in [9] for better convergence on the power allocation problem. Fig. 1 illustrates the proposed model. First, both the node and edge features are passed through a fully connected layer with batch normalization into a higher dimension embedding with dimension dd. The embedding is then fed through a multi-head cross attention 3-layer model defined as:

xi^l+1=k=1Hj𝒩(i)wijklVkxjkl,\displaystyle\hat{\textbf{x}_{i}}^{l+1}=\sum_{k=1}^{H}\underset{j\in\mathcal{N}(i)}{\sum}w^{kl}_{ij}\textbf{V}^{k}\textbf{x}^{kl}_{j}, (6)
where,
wijkl=softmaxj(LeakyReLU(Qkxikl(Kkxjkl+eijk)d)).\displaystyle w^{kl}_{ij}=\text{softmax}_{j}(\text{LeakyReLU}(\frac{\textbf{Q}^{k}x^{kl}_{i}(\textbf{K}^{k}\textbf{x}^{kl}_{j}+\textbf{e}^{k}_{ij})}{\sqrt{d}})). (7)

Here, xijl,eijd\textbf{x}^{l}_{ij},\textbf{e}_{ij}\in\mathbb{R}^{d} are the node and edge embeddings, Qk,Kk,Vkd/H×d/H\textbf{Q}^{k},\textbf{K}^{k},\textbf{V}^{k}\in\mathbb{R}^{d/H\times d/H} are the query, key, and value weights of the kthk^{th} head, and HH is the number of heads for the multi-head attention. wijklw_{ij}^{kl} is the attention map for the ijij-th message on head kk, layer ll. The right side of Fig. 1 illustrates how a single TGT layer pass is computed. First xijl,eij\textbf{x}^{l}_{ij},\textbf{e}_{ij} are split into xijkl,eijkd/H\textbf{x}^{kl}_{ij},\textbf{e}^{k}_{ij}\in\mathbb{R}^{d/H} for each head. Then the query is generated from the target node and the keys are generated from its neighboring nodes. The edge information is injected as a sum to the key feature vector. The product of the query and keys are scaled, and passes through a Leaky ReLU before going through softmax [12]. Ablation study showed that models with no shared weights only provide marginal improvement on performance while tripling the number of trainable parameters. Thus, Qk,Kk,Vk\textbf{Q}^{k},\textbf{K}^{k},\textbf{V}^{k} are shared across all three layers. The output of the multi-head attention is followed by a layer normalization [13] with a skip connection of the previous layer output, as:

xil+1=LayerNorm(xi^l+1+xil).\textbf{x}_{i}^{l+1}=LayerNorm(\hat{\textbf{x}_{i}}^{l+1}+\textbf{x}_{i}^{l}). (8)

The final layer of the model is just a sum along the feature dimension to flatten the node feature to a scalar, which is then passed through a sigmoid activation to limit the range of the output to [0,1][0,1]. This can be expressed as:

pi=11+exp((1,1,,1)xiL).p_{i}=\frac{1}{1+exp(-(1,1,...,1)\textbf{x}_{i}^{L})}. (9)

As our model reduces the number of non-linearities from the original model, we name it as the Truncated Graph Transformer (TGT).

III-B Graph Homophily

In this section, we attempt to explain why TGT outperforms basic GNNs by showing that the power control problem with low noise is heterophilous. Graph homophily is an index that measures whether a node with an adjacent edge would have the same label. We start with the index definition by [14]:

h=E[1dij𝒩(i)|xi=xj|1].h=E[\frac{1}{d_{i}}\sum_{j\in\mathcal{N}(i)}|x_{i}=x_{j}|^{\textbf{1}}]. (10)

Where ||1||^{\textbf{1}} is a function that returns 1 if the two values are the same, and did_{i} is the degree of node ii. hh would map to a value [0,1][0,1], where 1 represents that the graph is highly homophilic. Citation graphs or relation graphs are known to be homophilic, as papers with similar topics would be cited together and similar people will be in a closer relationship. GNNs are known to better perform in homophilic datasets, which is due to their neighbor aggregation design on a fixed graph structure. The author of [15] showed that when removing the non-linearity of the GNN, it can be abstracted into a Chebyshev low pass filter on the graph, which indicates that the performance gained from utilizing GNNs may not be from the non-linear feature extractions. In order to circumvent this issue, researchers are exploring more sophisticated GNN architectures that include sub-graph information [16], sheaf diffusion [17], and mapping the graph into a hyperbolic space [7].

Due to the network modeled in the problem formulation is a fully connected graph with a scalar edge weights and power allocation, we expand the definition of graph homophily for a graph with scalar edge weight eije_{ij} and node label xi[0,1]x_{i}\in[0,1] as:

h=Ei[1j𝒩(i)eijj𝒩(i)eij(1|xixj|)].h=E_{i}[\frac{1}{\sum_{j\in\mathcal{N}(i)}e_{ij}}\sum_{j\in\mathcal{N}(i)}e_{ij}(1-|x_{i}-x_{j}|)]. (11)
TABLE I: Homophily hh on a wireless network of 50 pairs
Noise Power Level σ2\sigma^{2}
10610^{-6} 10510^{-5} 10410^{-4} 10310^{-3} 10210^{-2} 10110^{-1} 10010^{0}
hh 0.417 0.418 0.468 0.560 0.688 0.817 0.914

Table I shows the homophily of a network channel H with 50 pairs, where the optimal power generated by WMMSE is considered as the label. As the noise level increases, the allocated power changes from a heterophilous to a homophilic distribution. Intuitively, the power control problem for the SISO interference channel can be abstracted into finding the optimal volume for each person in a busy dinner party. If two people were close together with their partners in equal distance, it would be beneficial for one to remain silent and the other to speak with their loudest voice, instead of the two speaking over each other. With increased noise power, the optimal power allocation saturates to full power, as the noise power level overpowers the interference from neighboring pairs.

Graph Network Transformers are more resilient to heterophilous datasets due to the attention weights acting as active graph rewiring. The wireless network is formulated as a graph with soft neighbor connections, in which every node is connected with a scalar weight associated to it. By applying attention to each neighbors, the model is effectively rewiring the graph each layer, avoiding the over smoothing issue of GNNs when aggregating with a fixed graph topology.

IV Numerical Experiments

IV-A Performance on Singular Network Size

We perform experiments on a simulated SISO D2D network over an AWGN channel of fixed noise σ2=2.6×105\sigma^{2}=2.6\times 10^{-5}. nn transmitters are uniformly distributed on a square plane with range [n,n]2[-n,n]^{2}, where the receiver is placed within the distance of [1,n/4][1,n/4] around the target transmitter. A lower bound on the receiver location is placed to prevent numerical instability when calculating the path gain. The channel is constructed with a constant path-loss defined as hijp=tirj2.2h^{p}_{ij}=||t_{i}-r_{j}||^{-2.2}, where ti,rjt_{i},r_{j} are the coordinates of the corresponding transmitter and receiver. We then multiply with the large-scale channel coefficient coefficient hijfRayleigh(1)h^{f}_{ij}\sim\text{Rayleigh}(1). This specific problem setting has been used across various data driven power allocation publications, and has been chosen here for ease of comparison across different methods [4, 6]. We assume equal weight wi=1w_{i}=1 for all pairs.
For all data-driven benchmarks, the models were trained on a dataset of 25,000 network instances, which consists of 50 randomly sampled channel states on a 500 randomly sampled network topology. For the data-driven benchmarks, published configurations for model dimension and training strategies were retained, but the learning rate were adjusted so that the model could stably converge due to difference in simulated conditions. Then the model was tested on previously unseen 1000 network instances with 50 randomly sampled channel states on 50 randomly sampled network topologies.

The proposed model, TGT, is a 3 layer graph transformer model with 32 head attention, where the embedding dimension d=64d=64. As the channel hijh_{ij} is a near zero value, we normalize the channel matrix H to a range of [0,1][0,1]. The model was trained using a AdamW[18] optimizer with learning rate set to 5×1045\times 10^{-4}. All models were trained for 50 epochs. The channel state H needs to be represented as a graph data to be processed by a GNN. For our method, we convert H to an undirected graph GG with nn nodes. In particular, GG is a fully connected graph with self loops where feature vector of the ii-th node and the ijij-edge are given by xi0=[hii,wi]2\textbf{x}_{i}^{0}=[h_{ii},w_{i}]\in\mathbb{R}^{2} and eij=[hij,hji]2\textbf{e}_{ij}=[h_{ij},h_{ji}]\in\mathbb{R}^{2}, respectively. Implementation details can be accessed through the github repository [19].

We perform numerical comparison across the following benchmarks:

  • Max Power is a naive strategy where all transmitters transmit maximum power Pmax=1P_{max}=1.

  • WMMSE[3] acts as a baseline for the classical method, which is run for 100 iterations.

  • PCGNN[5] utilizes a variant of a GNN with its edge weights embedded onto the message. It achieves state-of-the-art performance on complex interference channels.

  • UWMMSE[4] is a model-driven approach that utilizes GNNs to accelerate the conversion of WMMSE with only 4 iterations. It achieves state-of-the art performance on a fixed network topology.

  • Graph Trasformer [9] is run with the same number of layers, heads, and dd to compare with the proposed method. Note that this model has 173K trainable parameters, while ours only have 13.1K.

TABLE II: Experiment results on a various range of network size.
Average Sumrate per Network Size
Algorithm 20 30 40 50
Max Power 62.547 79.038 96.168 106.364
WMMSE 84.563 109.972 132.677 147.463
PCGNN 84.237 109.892 131.403 147.165
UWMMSE 83.081 108.335 131.048 146.035
Graph Transformer 85.670 111.376 133.554 148.951
Proposed Method 85.968 111.584 134.320 149.592
Proposed Method: Multi Node 84.999 111.543 134.487 149.343

Refer to caption

Figure 2: Histogram of sum rates between TGT and WMMSE. Compared across 32000 different channel instances with n=50,σ=2.6×105.n=50,\sigma=2.6\times 10^{-5}.

Table II shows the comparison across various number of pairs. Each data-driven model is trained only on the dataset with the corresponding network size. Multi-node is a single model that was trained on a even distribution of network size of {20,30,40,50}\{20,30,40,50\}, with the same number of samples as mentioned above. The proposed method outperforms all methods across various network sizes for both fixed node training and multi-node training. As the training dataset only contains 50 channel instances, we do further comparison of the models on a fixed network topology with 32,000 randomly sampled channel fading. Fig. 2 shows that the proposed method follows a similar distribution curve as WMMSE, with a consistent advantage on all range of channel instances.

IV-B Deviation from the Original Network Configuration

While deep learning models outperform classical algorithms with lower computational complexity, it’s performance is only guaranteed within the training data distribution, and thus must be tested with its generality across unseen data distributions to understand when to retrain the model. We test the model’s generality across network size, fading distribution, and pair density. We then compare the computational complexity, and provide analysis on the relationship between model size and generality.

In Table II, each model was trained only on a fixed number of nn. We compare how our model performs across a sweep of network sizes, ranging from 20 to 100 pairs. 10 channel instances are sampled for each 100 random network topologies. Fig. 3 shows the result across three different plots n=30,50n=30,50 and multi-node model, where the sum-rate is normalized with WMMSE for ease of comparison. A taper of performance is clearly visible from the model’s trained network size, but the model is able to outperform WMMSE even outside of the trained pair size. However, we can see that the multi-node model has a more stable performance drop off, performing better on average than fixed size training. This shows the model should be trained on the expected data distribution of the deployed scenario, to prevent frequent retraining of the model.

Refer to caption

Figure 3: Normalized sum rate of TGT across a wide range of unseen network sizes. σ=2.6×105.\sigma=2.6\times 10^{-5}.

We now consider a scenario where the fading coefficient of the Rayleigh fading changes. Using the same configuration as in Table II, Table III shows that the proposed model generalizes well from its original fading coefficient 1, outperforming WMMSE up to a fading coefficient of 4 for both fixed model and the multi-node model.

TABLE III: Model generalization across fading coefficients
0.5 1 1.5 2 2.5 3 3.5 4
Max Power 94.9 103.4 110.4 111.8 113.9 117.5 115.2 114.5
WMMSE 122.2 144.3 158.2 164.2 169.4 174.9 174.3 174.0
TGT 123.1 146.3 160.3 165.9 170.2 175.6 174.4 173.9

While the model is able to generalize across multiple network size and fading coefficients, the proposed method struggles when the overall density of the network changes from the training data. We test the proposed model on a fixed pair network of n=50n=50, but change the the distribution of transmitters from [n,n]2[-n,n]^{2} to [n/r,n/r]2[-n/r,n/r]^{2}, where r[0.25,4]r\in[0.25,4]. Table IV shows the model struggles to generalize on deviation from the trained pair density. Note that the model still out perform WMMSE when trained on the different density data.

TABLE IV: Model generalization across network field sizes
200 150 100 50 25 17 12
Max Power 217.2 210.3 179.7 107.4 44.3 22.3 14.8
WMMSE 221.4 219.3 196.2 146.8 96.0 70.0 57.0
TGT 215.0 216.7 195.0 148.7 88.9 53.3 37.6

IV-C Computational Complexity

One advantage of data-driven models over iterative classical algorithms is their simpler computational complexity. WMMSE is known to be run on O(KN3)O(KN^{3}) time, where N is the number of pairs and K is the number of iterations. Like all other GNN based models, the proposed model has a computational complexity of O(LFN2)O(LFN^{2}), where L is the number of layers, F is the dimension of the linear weights, and N is the number of pairs. Since the computational complexity rely on the model parameter, we compare how the proposed model performs across various number of parameter size. We compare the model with various widths from d=[4,104]d=[4,104], in which the model parameter ranges from 960 to 33.7K. Fig. 4 compares the performance of the proposed model trained on a fixed pair size of n=30n=30. For the different parameter sizes, we retain the original ratio of the hidden layers dimensions and scale accordingly while keeping the training conditions same as Section IV-A. The proposed method outperforms WMMSE already at the parameter size of 3.5K, and seems to converge at the parameter size of 13K. Graph transformer achieves equivalent performance with ten times more parameters.

Refer to caption

Figure 4: Model performation in relation to parameter size. All models were trained on n=30n=30, σ=2.6×105.\sigma=2.6\times 10^{-5}. Sum-rate of WMMSE is plotted for reference

While the proposed model outperforms WMMSE at a smaller parameter size, this comes with a price of generality. Fig. 5 shows that models with smaller parameters fail to generalize to unseen network configurations while models with larger parameters have a slower drop off. Thus, there is a trade-off of computational simplicity and model generality, which decides the frequency of retraining required once the deployed network deviates from its trained data distribution.

Refer to caption

Figure 5: Model generalization in relation to parameter size. Every model was trained on a dataset of n=30n=30, σ=2.6×105.\sigma=2.6\times 10^{-5}.

V Conclusion

We proposed a graph transformer based architecture called TGT for the wireless power allocation problem. It was observed that as the power control problem displays a heterophilous distribution with lower level of noise, architectures such as graph transformer would perform well. With parameter sharing and removing additional non-linear layers, the proposed model is able to achieve state-of-the-art results with a fraction of trainable parameters. Analysis of the model to unseen data distributions shows that although the model is able to generalize on a range of unseen distributions, limits exist to the type of perturbations, which shows a trade-off between model complexity and performance.

References

  • [1] Z.-Q. Luo and S. Zhang, “Dynamic spectrum management: Complexity and duality,” IEEE Journal of Selected Topics in Signal Processing, vol. 2, no. 1, pp. 57–73, Feb. 2008.
  • [2] W. Yu and R. Lui, “Dual methods for nonconvex spectrum optimization of multicarrier systems,” IEEE Transactions on Communications, vol. 54, no. 7, pp. 1310–1322, July 2006.
  • [3] Q. Shi, M. Razaviyayn, Z.-Q. Luo, and C. He, “An iteratively weighted mmse approach to distributed sum-utility maximization for a mimo interfering broadcast channel,” IEEE Transactions on Signal Processing, vol. 59, no. 9, pp. 4331–4340, Sept. 2011.
  • [4] A. Chowdhury, F. Gama, and S. Segarra, “Stability Analysis of Unfolded WMMSE for Power Allocation,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2022, pp. 5298–5302.
  • [5] Y. Shen, J. Zhang, S. H. Song, and K. B. Letaief, “Graph neural networks for wireless communications: From theory to practice,” IEEE Transactions on Wireless Communications, vol. 22, no. 5, pp. 3554–3569, May 2023.
  • [6] M. Eisen and A. Ribeiro, “Large scale wireless power allocation with graph neural networks,” in 2019 IEEE 20th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), pp. 1–5, ISSN: 1948-3252.
  • [7] Q. Liu, M. Nickel, and D. Kiela, “Hyperbolic graph neural networks,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [8] T. Lin, Y. Wang, X. Liu, and X. Qiu, “A survey of transformers,” AI Open, 2022.
  • [9] V. P. Dwivedi and X. Bresson, “A generalization of transformer networks to graphs.” [Online]. Available: http://arxiv.org/abs/2012.09699
  • [10] G. Corso, L. Cavalleri, D. Beaini, P. Liò, and P. Veličković, “Principal neighbourhood aggregation for graph nets,” Advances in Neural Information Processing Systems, vol. 33, pp. 13 260–13 271, 2020.
  • [11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • [12] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” arXiv preprint arXiv:1710.10903, 2017.
  • [13] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
  • [14] H. Pei, B. Wei, K. C.-C. Chang, Y. Lei, and B. Yang, “Geom-gcn: Geometric graph convolutional networks,” arXiv preprint arXiv:2002.05287, 2020.
  • [15] F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger, “Simplifying graph convolutional networks,” in International conference on machine learning.   PMLR, 2019, pp. 6861–6871.
  • [16] F. Frasca, B. Bevilacqua, M. M. Bronstein, and H. Maron, “Understanding and extending subgraph GNNs by rethinking their symmetries.” [Online]. Available: http://arxiv.org/abs/2206.11140
  • [17] C. Bodnar, F. Di Giovanni, B. P. Chamberlain, P. Liò, and M. M. Bronstein, “Neural sheaf diffusion: A topological perspective on heterophily and oversmoothing in GNNs.” [Online]. Available: http://arxiv.org/abs/2202.04579
  • [18] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
  • [19] K. Dohoon, https://github.com/dhkimac/Truncated-Graph-Transformer, 2022.