This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

KANsformer for Scalable Beamforming

Xinke Xie, Yang Lu, , Chong-Yung Chi, ,
Wei Chen, , Bo Ai, , and Dusit Niyato
Xinke Xie and Yang Lu are with the School of Computer and Technology, Beijing Jiaotong University, Beijing 100044, China (e-mail: [email protected],[email protected]).Chong-Yung Chi is with the Institute of Communications Engineering, Department of Electrical Engineering, National Tsing Hua University, Hsinchu 30013, Taiwan (e-mail:[email protected]).Wei Chen and Bo Ai are with the School of Electronics and Information Engineering, Beijing Jiaotong University, Beijing 100044, China (e-mail: [email protected],[email protected]).Dusit Niyato is with the College of Computing and Data Science, Nanyang Technological University, Singapore 639798 (e-mail: [email protected]).
Abstract

This paper proposes an unsupervised deep-learning (DL) approach by integrating transformer and Kolmogorov–Arnold networks (KAN) termed KANsformer to realize scalable beamforming for mobile communication systems. Specifically, we consider a classic multi-input-single-output energy efficiency maximization problem subject to the total power budget. The proposed KANsformer first extracts hidden features via a multi-head self-attention mechanism and then reads out the desired beamforming design via KAN. Numerical results are provided to evaluate the KANsformer in terms of generalization performance, transfer learning and ablation experiment. Overall, the KANsformer outperforms existing benchmark DL approaches, and is adaptable to the change in the number of mobile users with real-time and near-optimal inference.

Index Terms:
Transformer, KAN, beamforming, energy efficiency.

I Introduction

Deep learning (DL) has revolutionized a wide range of application fields and achieved unprecedented success in tasks such as image recognition and natural language processing. Its ability to automatically extract high-level features from raw data enables deep neural networks to outperform traditional machine learning methods in complex problem domains. Recently, the DL-enabled designs for wireless networks have emerged as a hot research topic[1]. Some researchers attempted to apply multi-layer perceptrons (MLP) [2], convolutional neural networks (CNN) [3] and graph neural networks (GNN) [4] to deal with the power allocation and signal processing problems in wireless networks. Overall, the DL models can be trained to achieve close performance to traditional convex optimization (CVXopt)-based approaches but with a much faster inference speed. How to further improve the learning performance remains an open issue for DL-enabled wireless optimization.

Typically, task-oriented DL requires dedicated models for wireless networks. One promising way is to follow the “encoder-decoder” paradigm, where the encoder extracts features over the wireless networks while the decoder maps the extracted features to desired transmit design. Recent works have paid a great attention to the design of encoder. Particularly, the GNN shows good scalability and generalization performance by exploiting the graph topology of wireless networks [5]. In [6, 7, 8], the GNN was adopted as the encoder to develop the solution approaches for energy efficiency (EE) maximization, sum-rate maximization and max-min rate, respectively, for multi-user multi-input-single-output (MISO) networks, all of which were scalable to the number of users. In [9], a GNN based model was trained via unsupervised learning to solve the outage-constrained EE maximization problem. The GNNs in [6, 7, 8, 9] all leveraged the multi-head attention mechanism also known as the graph attention networks (GAT) to enhance the feature extraction, especially for inter-user interference. Similarly, the transformer is also built upon the self-attention mechanism and is popular for its highly predictive performance [10]. In [11], a transformer and a weighted A based algorithm were proposed to plan the unmanned aerial vehicle trajectory for age-of-information minimization, which outperformed traditional algorithms numerically. However, the above works on wireless networks all adopted MLP as decoder. Recently, Kolmogorov-Arnold network (KAN) has been proposed as a promising alternative to MLP with superior performance in terms of accuracy and interpretability [12].

To the best of our knowledge, the integration of transformer and KAN has not been applied to beamforming design. In this paper, we formulate the classic EE maximization problem for MISO networks [13]. We then propose an approach integrating transformer and KAN termed KANsformer, which utilizes the multi-head self-attention mechanism to extract hidden features among interference links and KAN to map the extracted features to the desired beamforming design. The KANsformer is trained via unsupervised learning and a scale function guarantees feasible solution. Via parameter sharing, the KANsformer is scalable to the number of users. Numerical results indicate that the KANsformer outperforms existing DL models and approaches the CVXopt-based solution accuracy with millisecond-level inference time. The major performance gain is contributed by the KAN via ablation experiment. Besides, we validate the scalability of KANsformer, which can be further enhanced by transfer learning at the expense of little training cost.

The rest of this paper is organized as follows. Section II gives the system model and problem formulation. Section III presents the structure of KANsformer. Section IV provides numerical results. Finally, Section V concludes the paper with future research directions.

II System Model and Problem Formulation

Consider a downlink MISO network, where one NTN_{\rm T}-antenna transmitter intends to serve KK single-antenna mobile users (MUs) over a common spectral band. We use 𝒦{1,2,,K}\mathcal{K}\triangleq\{1,2,...,K\} to denote the index set of the MUs.

Denote the symbol for the kk-th MU and the corresponding beamforming vector as sks_{k} and 𝐰kNT{{{\bf{w}}_{k}}}\in\mathbb{C}^{N_{\rm T}}, respectively. The received signal at the kk-th MU is given by

yk=𝐡kH𝐰ksk+ikK𝐡kH𝐰isi+nk,\displaystyle{{\rm{y}}_{k}}={\bf{h}}_{k}^{H}{{\bf{w}}_{k}}{s_{k}}+\sum\nolimits_{i\neq k}^{K}{{\bf{h}}_{k}^{H}{{\bf{w}}_{i}}{s_{i}}}+{n_{k}}, (1)

where 𝐡kNT{{{\bf{h}}_{k}}}\in{\mathbb{C}^{{N_{\rm{T}}}}} denotes the channel state information (CSI) of the kk-th transmitter-MU link, and nk𝒞𝒩(0,σk2)n_{k}\sim\mathcal{CN}({0,{\sigma_{k}^{2}}}) denotes the additive white Gaussian noise (AWGN) at the kk-th MU. Without loss of generality, it is assumed that 𝔼{|sk|2}=1{{\mathbb{E}}}\{{{{|{s_{k}}|}^{2}}}\}=1 (k𝒦\forall k\in\mathcal{K}). Then, the achievable rate at the kk-th MU is expressed as

Rk({𝐰i})=log2(1+|𝐡kH𝐰k|2i=1,ikK|𝐡kH𝐰i|2+σk2),\displaystyle{R_{k}}\left({\left\{{{{\bf{w}}_{i}}}\right\}}\right)={\log_{2}}\left({1+\frac{{{{\left|{{\bf{h}}_{k}^{H}{{\bf{w}}_{k}}}\right|}^{2}}}}{{\sum\nolimits_{i=1,i\neq k}^{K}{{{\left|{{\bf{h}}_{k}^{H}{{\bf{w}}_{i}}}\right|}^{2}}}+{\sigma_{k}^{2}}}}}\right), (2)

where {𝐰i}\{{\bf w}_{i}\} denotes the set of all admissible beamforming vectors. The weighted EE for the considered system is expressed as

EE({𝐰i})=k=1KαkRk({𝐰i})k=1K𝐰k22+PC,\displaystyle{\rm EE}\left({\left\{{{{\bf{w}}_{i}}}\right\}}\right)=\frac{{\sum\nolimits_{k=1}^{K}\alpha_{k}{{R_{k}}\left({\left\{{{{\bf{w}}_{i}}}\right\}}\right)}}}{\sum\nolimits_{k=1}^{K}{\left\|{{{\bf{w}}_{k}}}\right\|_{2}^{2}}+{P_{\rm C}}}, (3)

where αk\alpha_{k} is a preassigned weight for the kk-th MU and PC{P_{\rm C}} denotes the constant power consumption introduced by circuit modules.

Our goal is to maximize the EE of the considered network, which is mathematically formulated as an optimization problem:

{𝐰i}=\displaystyle{\left\{{{{\bf{w}}^{\star}_{i}}}\right\}}= argmax{𝐰iNT},i=1K𝐰i22PmaxEE({𝐰i}),\displaystyle\arg\mathop{\max}\limits_{\left\{{{{\bf{w}}_{i}}\in{\mathbb{C}^{{N_{\rm{T}}}}}}\right\},\sum\nolimits_{i=1}^{K}\left\|{{{\bf{w}}_{i}}}\right\|_{2}^{2}\leq{P_{\rm max}}}{\rm{EE}}\left({\left\{{{{\bf{w}}_{i}}}\right\}}\right), (4)

where Pmax{P_{\rm max}} denotes the power budget of the transmitter.

The problem (4) can be efficiently solved by existing CVX techniques but without close-form solution. By treating the CVXopt-based algorithm as a “black box”, it maps CSI to beamforming vectors via iterative computations. Such a mapping can also be regarded as a “function” represented by Π():NT×KNT×K\Pi(\cdot):{\mathbb{C}}^{N_{\rm T}\times K}\rightarrow{\mathbb{C}}^{N_{\rm T}\times K}. Following the universal approximation theorem, we intend to utilize neural networks to solve the problem (4).

III Structure of KANsformer

The main idea of the proposed KANsformer is to realize the mapping, i.e., Π()\Pi(\cdot), from {𝐡i}\{{\bf h}_{i}\} to {𝐰i}\{{\bf w}_{i}\} such that EE({𝐰i}){\rm EE}({\{{{{\bf{w}}_{i}}}\}}) is close to EE({𝐰i}){\rm EE}({\{{{{\bf{w}}^{\star}_{i}}}\}}). Particularly, we utilize unsupervised learning to train the KANsformer to alleviate the burden on collecting labelled training set. Denote 𝜽\bm{\theta} as learnable parameters of the KANsformer, the loss function to update the learnable parameters is given by

(𝜽)=EE(Π({𝐡i}|𝜽)).\displaystyle{{\cal L}\left({\bm{\theta}}\right)=-{{\rm EE}\left(\Pi\left(\left\{{\mathbf{h}_{i}}\right\}\left|{\bm{\theta}}\right.\right)\right)}}. (5)

Note that via off-line training based on historical statistics, the KANsformer can derive the solution instantaneously at a low computational complexity instead of complex iterative calculation by the mathematical optimization approaches.

The structure of the KANsformer is illustrated in Fig. 1, which includes four modules: pre-processing module, transformer encoder module, KAN decoder module and post-processing module. The detailed processes in each module are described as follows.

III-A Pre-Processing Module

In pre-processing module, we divide each complex-valued CSI vector in {𝐡k}\{{{\bf{h}}_{k}}\} into its real part, i.e., Re(𝐡k){\rm Re}({{\bf{h}}_{k}}), and its imaginary part, i.e., Im(𝐡k){\rm Im}({{\bf{h}}_{k}}). Then, Re(𝐡k){\rm Re}({{\bf{h}}_{k}}) and Im(𝐡k){\rm Im}({{\bf{h}}_{k}}) are concatenated and input into a linear transformation to obtain the input for the transformer encoder module denoted by 𝐇^K×D\widehat{\bf H}\in\mathbb{R}^{K\times D}, which given by

𝐇^=\displaystyle\widehat{\bf H}= [Con(Re(𝐡1),Im(𝐡1));;\displaystyle\left[{\rm{Con}}\left({\rm Re}\left({{\bf{h}}_{1}}\right),{\rm Im}\left({{\bf{h}}_{1}}\right)\right);\cdots;\right.
Con(Re(𝐡K),Im(𝐡K))]𝐖0,\displaystyle\left.{\rm{Con}}\left({\rm Re}\left({{\bf{h}}_{K}}\right),{\rm Im}\left({{\bf{h}}_{K}}\right)\right)\right]{\bf W}_{0}, (6)

where Con(){\rm Con}(\cdot) represents the concatenation operation, and 𝐖02NT×D{\bf W}_{0}\in\mathbb{R}^{2N_{\rm T}\times D} denotes the learnable parameters with DD being a configurable dimension which is usually greater than KK such that more attention heads can be employed.

Refer to caption
Figure 1: Structure of KANsformer which includes four modules: pre-processing module, transformer encoder module (with LL TELs), KAN decoder module (with TT KDLs) and post-processing module. The detailed processes of the ll-th TEL and the tt-th KDL are illustrated.

III-B Transformer Encoder Module

The aim of the transformer encoder module is to encode the obtained network feature 𝐇^\widehat{\bf H} by exploring interactions among MUs and embedding the impact of inter-MU interference into the encoded network feature. The transformer encoder module comprises LL transformer encoder layers (TELs), each of which includes two ingredients, i.e., multi-head self-attention and position-wise feed-forward. For the ll-th TEL, we denote its input and output111The inputs and outputs of all the TELs have the same size. as 𝐇(l){\bf H}^{(l)} and 𝐇(l+1)K×D{\bf H}^{(l+1)}\in{\mathbb{R}}^{K\times D}, respectively. The detailed processes of the two ingredients are given as follows.

III-B1 Multi-Head Self-Attention

Suppose that M(l)M^{(l)} self-attention heads are employed in the ll-th TEL, and then, the attention coefficient matrix associated with the mm-th self-attention head in the ll-th TEL is given by

𝐀m(l)=\displaystyle{\bf{A}}^{(l)}_{m}= Softmax(𝐇(l)𝐖Q(l)(𝐇(l)𝐖K(l))TD)𝐇(l)𝐖V(l)\displaystyle{\rm Softmax}\left(\frac{{\bf H}^{(l)}{\bf W}^{(l)}_{Q}\left({\bf H}^{(l)}{\bf W}^{(l)}_{K}\right)^{T}}{\sqrt{D}}\right){\bf H}^{(l)}{\bf W}^{(l)}_{V}
K×DM(l),\displaystyle\in{\mathbb{R}}^{K\times\frac{D}{M^{(l)}}}, (7)

where 𝐖Q(l){\bf W}^{(l)}_{Q}, 𝐖K(l){\bf W}^{(l)}_{K} and 𝐖V(l)D×DM(l){\bf W}^{(l)}_{V}\in\mathbb{R}^{D\times\frac{D}{M^{(l)}}} denotes the learnable parameters for the query, key and value projections, respectively.

The obtained M(l)M^{(l)} attention heads, i.e., {𝐀m(l)}m\{{\bf A}_{m}^{(l)}\}_{m}, are concatenated and then, passed into a linear layer with leanrable parameters of 𝐖MA(l)D×D{\bf W}_{{\rm{MA}}}^{(l)}\in\mathbb{R}^{{D}\times{D}}. Then, we obtain the multi-head attention coefficient matrix as

𝐇MA(l)=Con(𝐀1(l)𝐀M(l)(l))K×D𝐖MA(l)K×D.\displaystyle{\bf{H}}_{{\rm{MA}}}^{(l)}=\underbrace{{\rm{Con}}\left({{\bf{A}}_{{1}}^{({{l}})}\cdots{\bf{A}}_{{{{M}}^{({{l}})}}}^{({{l}})}}\right)}_{\in\mathbb{R}^{K\times D}}{\bf{W}}_{{\rm{MA}}}^{({{l}})}\in\mathbb{R}^{K\times D}. (8)

To improve the training performance and stack deeper layers, the parameter-free layer normalization process represented by LayerNorm(){\rm LayerNorm}(\cdot) and the residual connection are adopted. The attention coefficient matrix is then updated by

𝐇~MA(l)=LayerNorm(𝐇MA(l))+𝐇(l)K×D.\displaystyle{\widetilde{\bf H}}_{\rm{MA}}^{(l)}={\rm LayerNorm}\left({\bf H}_{{\rm{MA}}}^{(l)}\right)+{\bf H}^{(l)}\in\mathbb{R}^{K\times D}. (9)

III-B2 Position-wise Feed-forward Layer

The obtained attention coefficient matrix is input into a 2-layer feed-forward network with position-wise operation, i.e.,

𝐇FF(l)=\displaystyle{\bf H}_{\rm{FF}}^{(l)}= Con(f2(l)(ReLu(f1(l)([𝐇~MA(l)]k,:))))\displaystyle{\rm Con}\left(f_{2}^{(l)}\left({\rm ReLu}\left(f_{1}^{(l)}\left(\left[{\widetilde{\bf H}}_{\rm{MA}}^{(l)}\right]_{k,:}\right)\right)\right)\right)
K×D,\displaystyle\in\mathbb{R}^{K\times D}, (10)

where f1(l)():DDf_{1}^{(l)}(\cdot):\mathbb{R}^{D}\rightarrow\mathbb{R}^{D^{\prime}} and f2(l)():DDf_{2}^{(l)}(\cdot):\mathbb{R}^{D^{\prime}}\rightarrow\mathbb{R}^{D} denote the feed-forward functions with DD^{\prime} being a intermediate dimension. The learnable parameters of f1(l)()f_{1}^{(l)}\left(\cdot\right) and f2(l)()f_{2}^{(l)}\left(\cdot\right) are denoted as 𝐖1(l)D×D{{\bf W}^{(l)}_{1}}\in\mathbb{R}^{D\times D^{\prime}} and 𝐖2(l)D×D{{\bf W}^{(l)}_{2}}\in\mathbb{R}^{D^{\prime}\times D}, respectively.

Similarly, the layer normalization process and the residual connection are followed by the feed-forward network, and the output of the ll-th TEL is given by

𝐇(l+1)=LayerNorm(𝐇FF(l))+𝐇FF(l).\displaystyle{\bf H}^{(l+1)}={\rm LayerNorm}\left({\bf H}_{\rm{FF}}^{(l)}\right)+{{\bf H}_{\rm{FF}}^{(l)}}. (11)

III-C KAN Decoder Module

The aim of the KAN decoder module is to decode the obtained network features, i.e., 𝐇(L+1){\bf H}^{(L+1)}, to the required beamforming vectors via TT KAN decoder layers (KDLs). For the tt-th KDL, we denote its input and output as 𝐅(t)K×F(t){\bf F}^{(t)}\in{\mathbb{R}}^{K\times F(t)} and 𝐅(t+1)K×F(t+1){\bf F}^{(t+1)}\in{\mathbb{R}}^{K\times F(t+1)}, respectively, where F(t)F(t) and F(t+1)F(t+1) denote the corresponding dimensions. Note that 𝐅(1)=𝐇(L+1){\bf F}^{(1)}={\bf H}^{\left(L+1\right)} and F(1)=DF(1)=D while F(T+1)=2NTF(T+1)=2N_{\rm T}. The processing of the tt-th KDL is given by

[𝐅(t+1)]k,j=i=1F(t)ϕj,i(t)([𝐅(t)]k,i),\displaystyle\left[{\bf{\bf F}}^{(t+1)}\right]_{k,j}=\sum\nolimits_{i=1}^{F(t)}\phi_{j,i}^{(t)}\left(\left[{\bf{\bf F}}^{(t)}\right]_{k,i}\right), (12)

where j{1,,F(t+1)}j\in\left\{1,...,F\left(t+1\right)\right\}, k{1,,K}k\in\left\{1,...,K\right\} and ϕj,i(t)():\phi_{j,i}^{(t)}(\cdot):{\mathbb{R}}\rightarrow{\mathbb{R}} is a continuous function which is given by

ϕj,i(t)(x)=βj,i(t)x1+exp(x)+γj,i(t)Splinej,i(t)(x),\displaystyle\phi_{j,i}^{(t)}\left(x\right)=\beta_{j,i}^{(t)}\frac{x}{1+\exp\left(-x\right)}+\gamma_{j,i}^{(t)}{\rm Spline}_{j,i}^{(t)}(x), (13)

where βj,i(t)\beta_{j,i}^{(t)} and γj,i(t)\gamma_{j,i}^{(t)} are learnable parameters, and Splinej,i(t)():{\rm Spline}_{j,i}^{(t)}(\cdot):{\mathbb{R}}\rightarrow{\mathbb{R}} is parameterized as a linear combination of B-splines such that

Splinej,i(t)(x)=p=0Pcp,j,i(t)Bp(x),\displaystyle{\rm Spline}_{j,i}^{(t)}(x)=\sum\nolimits_{p=0}^{P}c_{p,j,i}^{(t)}B_{p}\left(x\right), (14)

where cp,j,i(t)c_{p,j,i}^{(t)} denotes the learnable weights and PP is a hyperparameter related to the B-splines (cf. [14]).

III-D Post-Processing Module

The post-processing module is to convert 𝐅(T+1){{\bf F}^{(T+1)}} obtained by the KAN decoder module to the feasible solution to the problem (4).

In particular, the real-valued 𝐅(T+1){\bf F}^{\left(T+1\right)} is used to recover KK complex-valued beamforming vectors with the kk-th beamforming vector given by

𝐰~k=𝐅(T+1)[k,1:NT]+i𝐅(T+1)[k,NT+1:2NT].\displaystyle\widetilde{{\bf w}}_{k}={\bf F}^{\left(T+1\right)}\left[k,1:N_{\rm T}\right]+i{\bf F}^{\left(T+1\right)}\left[k,N_{\rm T}+1:2N_{\rm T}\right]. (15)

Then, each beamforming vector is fed into a scale function to satisfy the power budget of PmaxP_{\rm max}:

𝐰k=Pmaxmax(Pmax,i=1K𝐰~i22)𝐰~k.\displaystyle{{\bf{w}}}_{k}=\sqrt{\frac{{{P_{\rm max}}}}{{\rm max}\left({{P_{\rm max}}},\sum\nolimits_{i=1}^{K}{{\left\|\widetilde{{\bf w}}_{i}\right\|}_{2}^{2}}\right)}}\widetilde{{\bf w}}_{k}. (16)

The learnable parameters in the KANsformer are given by

𝜽={𝐖0,𝐖(l),βj,i(t),γj,i(t),cp,j,i(t)},\displaystyle\bm{\theta}=\left\{{{\bf W}_{0}},{\bf W}^{(l)},{\beta_{j,i}^{(t)}},{\gamma_{j,i}^{(t)}},{c_{p,j,i}^{(t)}}\right\}, (17)

where 𝐖(l){𝐖Q(l),𝐖K(l),𝐖V(l),𝐖O(l),𝐖1(l),𝐖2(l)}{\bf W}^{(l)}\triangleq\{{{\bf W}^{(l)}_{Q}},{{\bf W}^{(l)}_{K}},{{\bf W}^{(l)}_{V}},{{\bf W}^{(l)}_{O}},{{\bf W}^{(l)}_{1}},{{\bf W}^{(l)}_{2}}\}. Note that 𝜽\bm{\theta} is independent of KK, thus facilitating the KANsformer to accept the input {𝐡k}\{{{\bf{h}}_{k}}\} with different values of KK.

IV Numerical Results

This section provides numerical results to evaluate the proposed KANsformer in terms of generalization performance, transfer learning and ablation experiment under the following settings.

IV-1 Simulation scenario

All the system parameters used are NT{4,8,16}N_{\rm T}\in\left\{4,8,16\right\}, K{2,3,4,5,7,8,9,10,12,14}K\in\left\{2,3,4,5,7,8,9,10,12,14\right\}, αk=1\alpha_{k}=1 (k𝒦\forall k\in\mathcal{K}), Pmax=1{P_{\rm max}}=1 W, PC=0.1{P_{\rm C}}=0.1 W, CSI {𝐡kNT}\{{\bf h}_{k}\in{\mathbb{C}}^{N_{\rm T}}\} being Rayleigh distributed for both training samples and test samples, and the corresponding labels (for test samples) representing the maximal EEs obtained by CVXopt-based algorithms. Specifically, we use KTrK_{\rm Tr}, KTeK_{\rm Te} and KTrK_{\rm Tr}^{\prime} to respectively denote the number of MUs in the training stage, test stage and fine-tuning training stage (due to transfer learning), where KTeKTrK_{\rm Te}\neq K_{\rm Tr} (known as scalability) is unknown during the training stage.

IV-2 Computer configuration

All DL models are trained and tested by Python 3.10 with Pytorch 2.4.0 on a computer with Intel(R) Xeon(R) Platinum 8255C CPU and NVIDIA RTX 2080 Ti (11 GB of memory).

IV-3 Initialization and training

The learnable parameters are initialized according to He (Kaiming) method and the learning rate is initialized as 10410^{-4}. The AdamAdam algorithm is adopted as the optimizer during the training phase. The batch size is set to 1616 for 100100 training epochs. The learnable weights with the best performance are used as the training results.

IV-4 Benchmark DL models

In order to evaluate the KANsformer numerically, the following four baselines are considered, i.e.,

  • CVXopt-based approach: A single-layer successive convex approximation based optimization algorithm, similar to Algorithm 1 in [15], used to generate the test labels.

  • MLP: A basic feed-forward neural network, similar to [2].

  • GAT: A basic GCN with multi-head attention mechanism, similar to [6].

IV-5 Test performance metrics

  • Optimality performance: The ratio of the average achievable EE by the DL model to the optimal EE.

  • Inference time: Average running time for yielding the feasible beamforming solution by the DL model.

TABLE I: Generalization performance evaluation.
NTN_{\rm T} KTrK_{\rm Tr} KTeK_{\rm Te} CVX MLP GAT KF
4 2 2 100% 98.2% 98.4% 99.5%
Inference time 6.7s 3.3 ms 7.7 ms 8.5 ms
3 ×\times ×\times 84.3% 85.1%
8 4 4 100% 79.1% 90.1% 95.3%
5 ×\times ×\times 82.6% 83.2%
Inference time 10.5 s 3.4 ms 7.7 ms 8.5 ms
7 ×\times ×\times 84.0% 91.1%
16 8 8 100% 17.9% 85.6% 92.9%
9 ×\times ×\times 82.8% 90.8%
Inference time 57.4 s 3.5 ms 7.8 ms 8.4 ms
  • KF is short for KANsformer.

  • ×\times represents “not applicable”.

TABLE II: Transfer learning evaluation: NT=16N_{\rm T}=16.
KTeK_{\rm Te} Scaling Re-training (KTr=KTeK_{\rm Tr}=K_{\rm Te}) Transfer learning (KTr=8K_{\rm Tr}=8, KTr=KTeK^{\prime}_{\rm Tr}=K_{\rm Te})
(KTr=8K_{\rm Tr}=8) 100 epochs 10 epochs 20 epochs 50 epochs
10 86.6% 93.4% 93.0% 93.5% 93.5%
12 77.9% 90.7% 94.6% 95.2% 95.2%
14 71.1% 95.2% 93.6% 94.2% 94.2%
  • Scaling represents that directly applying the model trained with KTrK_{\rm Tr} to the scenario of KTeK_{\rm Te}.

TABLE III: Ablation experiment: NT=16N_{\rm T}=16 and KTr=8K_{\rm Tr}=8.
Encoder Decoder KTeK_{\rm Te}
GAT TF MLP KAN 77 88 99
\checkmark ×\times \checkmark ×\times 84.0% 85.6% 82.8%
\checkmark ×\times ×\times \checkmark 89.5% 91.2% 88.1%
×\times \checkmark \checkmark ×\times 82.5% 85.8% 83.2%
×\times \checkmark ×\times \checkmark 91.1% 92.9% 90.8%
Avg. gain - 0.1% 1.9% 3.1%
- Avg. gain 14.1% 7.3% 12.9%
  • TF is short for transformer.

IV-A Generalization Performance

The numerical performance tests and inference times of the KANsformer are given in Table I, which are presented in more detail below, respectively.

IV-A1 Optimality performance with KTe=KTrK_{\rm Te}=K_{\rm Tr} (marked by blue-shaded areas)

One can observe that the KANsformer outperforms the MLP and the GAT for all the three cases; the larger of NTN_{\rm T} and KTe=KTrK_{\rm Te}=K_{\rm Tr}, the larger the performance degradation, showing larger negative impact on the learning performance for all the DL models. However, the KANsformer with the best performance maintains the performance loss within 10%10\%.

IV-A2 Optimality performance with KTeKTrK_{\rm Te}\neq K_{\rm Tr} (marked by orange-shaded areas)

The KANsformer performs much better than the GAT for KTr=8,/KTe{7,9}K_{\rm Tr}=8,/K_{\rm Te}\in\{7,9\}, but slightly better for KTr=4,/KTe{3,5}K_{\rm Tr}=4,/K_{\rm Te}\in\{3,5\}, besides some performance loss compared with the case for KTe=KTr{4,8}K_{\rm Te}=K_{\rm Tr}\in\{4,8\}. These results also indicate that the scalability performance loss is larger for larger |KTeKTr|/KTr|K_{\rm Te}-K_{\rm Tr}|/K_{\rm Tr}, because the multi-head self-attention mechanism intends to explore the interaction among MUs, which may change with the number of MUs.

IV-A3 Inference time

All of the MLP, GAT and KANsformer achieve millisecond-level inference (significantly faster than the iterative CVXopt-based approach) such that they are applicable under time-varying channel conditions. A more surprising observation is that the inference time of DL models remains almost unchanged for all the numbers of NTN_{\rm T} and KK used, while it increases exponentially for the CVXopt-based approach (a widely known fact).

In summary, the well-trained KANsformer is able to achieve real-time and near-optimal inference for solving the problem (4) while being scalable to the number of MUs (though KTeK_{\rm Te} unknown in the training stage) with an acceptable performance.

IV-B Transfer Learning

As mentioned that the scalability suffers from performance degradation with the increment of |KTeKTr|/KTr|K_{\rm Te}-K_{\rm Tr}|/K_{\rm Tr}. One can retrain the model or fine-tune the model via transfer learning on a new dataset (where the number of users is KTeK_{\rm Te}). The former initializes the learnable parameters randomly while the latter adopts the learnable parameters of the model trained for KTrK_{\rm Tr} as the initial values of the model 𝜽\bm{\theta} instead. Table II shows the performance of scaling, re-training (via 100100 epochs) and transfer learning (via {10,20,50}\{10,20,50\} epochs). It can be seen that the transfer learning can effectively improve the performance at a quite low training cost (e.g., 1010 epochs) compared with the performance of the plain scaling, meanwhile achieving a comparable performance of the re-training at fewer training epochs (e.g., 2020 epochs). For KTe=14K_{\rm Te}=14, the transfer learning falls behind the re-training by 1%1\%, and the reason is that the prior-knowledge for KTrK_{\rm Tr} may mislead the transfer learning under KTe=KTrK^{\prime}_{\rm Te}=K_{\rm Tr} with large |KTrKTr||K^{\prime}_{\rm Tr}-K_{\rm Tr}|. Nevertheless, the transfer learning can also achieve a considerable performance gain (>20%>20\%) compared with the plain scaling.

IV-C Ablation Experiment

Table III gives the ablation experiment to validate the effectiveness of the transformer used as the encoder and KAN used as the decoder. A performance gain can be observed by comparing transformer/KAN and GAT/MLP for both cases of KTe=KTrK_{\rm Te}=K_{\rm Tr} and KTeKTrK_{\rm Te}\neq K_{\rm Tr}. Specifically, the average performance gains over KTe{7,8,9}K_{\rm Te}\in\{7,8,9\} resulting from the transformer and KAN are respectively 1.7%1.7\% and 11.4%11.4\%. The reason for this is that both the GAT and the transformer adopt the attention mechanism to enhance the expressive capability while KAN has more flexible activation processes than MLP, such that KAN can outperform MLP in terms of interpretability [12].

V Conclusion

We have presented a DL model (i.e., the KANsformer shown in Fig. 1) with the transformer and KAN used in the encoder-decoder structure, respectively, for solving the beamforming design problem (cf. (4)).

Numerical results showed that the KANsformer outperforms the existing DL models in terms of both the performance accuracy and the inference time consumed. Furthermore, we would like to emphasize that, in response to the given input CSI {𝐡k}\{{\bf h}_{k}\}, the KANsformer can yield the beamforming vector output {𝐰k}\{{\bf w}_{k}\}, with the elapsed inference time almost fixed (thus insensitive to the problem size) and tremendously lower than the problem-size dependent running time required by the CVXopt-based approach; the performance accuracy of the former is quite close to that the latter (treated as the optimum). These results also motivate further development of more powerful encoders and decoders dedicated to wireless communication systems.

References

  • [1] Y. Lu, W. Mao, H. Du, O. A. Dobre, D. Niyato, and Z. Ding, “Semantic-aware vision-assisted integrated sensing and communication: Architecture and resource allocation,” IEEE Wireless Commun., vol. 31, no. 3, pp. 302-308, Jun. 2024.
  • [2] C. Hu et al., “AI-empowered RIS-assisted networks: CV-enabled RIS selection and DNN-enabled transmission,” early accessed in IEEE Trans. Veh. Technol, 2024.
  • [3] Z. Song et al., “A deep learning framework for physical-layer secure beamforming,” early accessed in IEEE Trans. Veh. Technol, 2024
  • [4] Y. Lu, Y. Li, R. Zhang, W. Chen, B. Ai, and D. Niyato, “Graph neural networks for wireless networks: Graph representation, architecture and evaluation,” early accessed by IEEE Wireless Commun., 2024.
  • [5] Y. Shen, J. Zhang, S. H. Song, and K. B. Letaief, “Graph neural networks for wireless communications: From theory to practice,” IEEE Trans. Wireless Commun., vol. 22, no. 5, pp. 3554-3569, May 2023.
  • [6] Y. Li, Y. Lu, R. Zhang, B. Ai and Z. Zhong, “Deep learning for energy efficient beamforming in MU-MISO networks: A GAT-based approach,” IEEE Wireless Commun.Lett., vol. 12, no. 7, pp. 1264-1268, July 2023.
  • [7] Y. Li, Y. Lu, B. Ai, O. A. Dobre, Z. Ding, and D. Niyato, “GNN-based beamforming for sum-rate maximization in MU-MISO networks,” IEEE Trans. Wireless Commun., vol. 23, no. 8, pp. 9251-9264, Aug. 2024.
  • [8] Y. Li, Y. Lu, B. Ai, Z. Zhong, D. Niyato, and Z. Ding, “GNN-enabled max-min fair beamforming,” IEEE Trans. Veh. Technol., vol. 73, no. 8, pp. 12184-12188, Aug. 2024.
  • [9] C. He, Y. Li, Y. Lu, B. Ai, Z. Ding, and D. Niyato, “ICNet: GNN-enabled beamforming for MISO interference channels with statistical CSI,” vol. 73, no. 8, pp. 12225-12230, Aug. 2024.
  • [10] A. Vaswani, Ashish et al., “Attention is all you need,” in Proc. Neurlps, pp. 5998-6008, 2017.
  • [11] B. Zhu, E. Bedeer, H. H. Nguyen, R. Barton and Z. Gao, “UAV trajectory planning for AoI-minimal data collection in UAV-aided IoT networks by transformer,” IEEE Trans. Wireless Commun., vol. 22, no. 2, pp. 1343-1358, Feb. 2023
  • [12] Z. Liu, Y. Wang, S. Vaidya, et al., “Kan: Kolmogorov-Arnold networks,” arXiv preprint: 2404.19756, Apr. 2024.
  • [13] Y. Lu, K. Xiong, P. Fan, Z. Ding, Z. Zhong, and K. B. Letaief, “Global energy efficiency in secure MISO SWIPT systems with non-linear power-splitting EH model,” IEEE J. Sel. Areas Commun., vol. 37, no. 1, pp. 216-232, Jan. 2019.
  • [14] L. Schumaker, “Spline Functions: Basic Theory.” Wiley, 1981.
  • [15] Y. Lu, “Secrecy energy efficiency in RIS-assisted networks,” IEEE Trans. Veh. Technol., vol. 72, no. 9, pp. 12419-12424, Sept. 2023.