KANsformer for Scalable Beamforming

Xinke Xie, Yang Lu, , Chong-Yung Chi, ,
Wei Chen, , Bo Ai, , and Dusit Niyato Xinke Xie and Yang Lu are with the School of Computer and Technology, Beijing Jiaotong University, Beijing 100044, China (e-mail: [email protected],[email protected]).Chong-Yung Chi is with the Institute of Communications Engineering, Department of Electrical Engineering, National Tsing Hua University, Hsinchu 30013, Taiwan (e-mail:[email protected]).Wei Chen and Bo Ai are with the School of Electronics and Information Engineering, Beijing Jiaotong University, Beijing 100044, China (e-mail: [email protected],[email protected]).Dusit Niyato is with the College of Computing and Data Science, Nanyang Technological University, Singapore 639798 (e-mail: [email protected]).

Abstract

This paper proposes an unsupervised deep-learning (DL) approach by integrating transformer and Kolmogorov–Arnold networks (KAN) termed KANsformer to realize scalable beamforming for mobile communication systems. Specifically, we consider a classic multi-input-single-output energy efficiency maximization problem subject to the total power budget. The proposed KANsformer first extracts hidden features via a multi-head self-attention mechanism and then reads out the desired beamforming design via KAN. Numerical results are provided to evaluate the KANsformer in terms of generalization performance, transfer learning and ablation experiment. Overall, the KANsformer outperforms existing benchmark DL approaches, and is adaptable to the change in the number of mobile users with real-time and near-optimal inference.

Index Terms:

Transformer, KAN, beamforming, energy efficiency.

I Introduction

Deep learning (DL) has revolutionized a wide range of application fields and achieved unprecedented success in tasks such as image recognition and natural language processing. Its ability to automatically extract high-level features from raw data enables deep neural networks to outperform traditional machine learning methods in complex problem domains. Recently, the DL-enabled designs for wireless networks have emerged as a hot research topic[1]. Some researchers attempted to apply multi-layer perceptrons (MLP) [2], convolutional neural networks (CNN) [3] and graph neural networks (GNN) [4] to deal with the power allocation and signal processing problems in wireless networks. Overall, the DL models can be trained to achieve close performance to traditional convex optimization (CVXopt)-based approaches but with a much faster inference speed. How to further improve the learning performance remains an open issue for DL-enabled wireless optimization.

Typically, task-oriented DL requires dedicated models for wireless networks. One promising way is to follow the “encoder-decoder” paradigm, where the encoder extracts features over the wireless networks while the decoder maps the extracted features to desired transmit design. Recent works have paid a great attention to the design of encoder. Particularly, the GNN shows good scalability and generalization performance by exploiting the graph topology of wireless networks [5]. In [6, 7, 8], the GNN was adopted as the encoder to develop the solution approaches for energy efficiency (EE) maximization, sum-rate maximization and max-min rate, respectively, for multi-user multi-input-single-output (MISO) networks, all of which were scalable to the number of users. In [9], a GNN based model was trained via unsupervised learning to solve the outage-constrained EE maximization problem. The GNNs in [6, 7, 8, 9] all leveraged the multi-head attention mechanism also known as the graph attention networks (GAT) to enhance the feature extraction, especially for inter-user interference. Similarly, the transformer is also built upon the self-attention mechanism and is popular for its highly predictive performance [10]. In [11], a transformer and a weighted A^∗ based algorithm were proposed to plan the unmanned aerial vehicle trajectory for age-of-information minimization, which outperformed traditional algorithms numerically. However, the above works on wireless networks all adopted MLP as decoder. Recently, Kolmogorov-Arnold network (KAN) has been proposed as a promising alternative to MLP with superior performance in terms of accuracy and interpretability [12].

To the best of our knowledge, the integration of transformer and KAN has not been applied to beamforming design. In this paper, we formulate the classic EE maximization problem for MISO networks [13]. We then propose an approach integrating transformer and KAN termed KANsformer, which utilizes the multi-head self-attention mechanism to extract hidden features among interference links and KAN to map the extracted features to the desired beamforming design. The KANsformer is trained via unsupervised learning and a scale function guarantees feasible solution. Via parameter sharing, the KANsformer is scalable to the number of users. Numerical results indicate that the KANsformer outperforms existing DL models and approaches the CVXopt-based solution accuracy with millisecond-level inference time. The major performance gain is contributed by the KAN via ablation experiment. Besides, we validate the scalability of KANsformer, which can be further enhanced by transfer learning at the expense of little training cost.

The rest of this paper is organized as follows. Section II gives the system model and problem formulation. Section III presents the structure of KANsformer. Section IV provides numerical results. Finally, Section V concludes the paper with future research directions.

II System Model and Problem Formulation

Consider a downlink MISO network, where one $N_{\rm T}$ -antenna transmitter intends to serve $K$ single-antenna mobile users (MUs) over a common spectral band. We use $\mathcal{K}\triangleq\{1,2,...,K\}$ to denote the index set of the MUs.

Denote the symbol for the $k$ -th MU and the corresponding beamforming vector as $s_{k}$ and ${{{\bf{w}}_{k}}}\in\mathbb{C}^{N_{\rm T}}$ , respectively. The received signal at the $k$ -th MU is given by

\displaystyle{{\rm{y}}_{k}}={\bf{h}}_{k}^{H}{{\bf{w}}_{k}}{s_{k}}+\sum\nolimits_{i\neq k}^{K}{{\bf{h}}_{k}^{H}{{\bf{w}}_{i}}{s_{i}}}+{n_{k}},

(1)

where ${{{\bf{h}}_{k}}}\in{\mathbb{C}^{{N_{\rm{T}}}}}$ denotes the channel state information (CSI) of the $k$ -th transmitter-MU link, and $n_{k}\sim\mathcal{CN}({0,{\sigma_{k}^{2}}})$ denotes the additive white Gaussian noise (AWGN) at the $k$ -th MU. Without loss of generality, it is assumed that ${{\mathbb{E}}}\{{{{|{s_{k}}|}^{2}}}\}=1$ ( $\forall k\in\mathcal{K}$ ). Then, the achievable rate at the $k$ -th MU is expressed as

\displaystyle{R_{k}}\left({\left\{{{{\bf{w}}_{i}}}\right\}}\right)={\log_{2}}\left({1+\frac{{{{\left|{{\bf{h}}_{k}^{H}{{\bf{w}}_{k}}}\right|}^{2}}}}{{\sum\nolimits_{i=1,i\neq k}^{K}{{{\left|{{\bf{h}}_{k}^{H}{{\bf{w}}_{i}}}\right|}^{2}}}+{\sigma_{k}^{2}}}}}\right),

(2)

where $\{{\bf w}_{i}\}$ denotes the set of all admissible beamforming vectors. The weighted EE for the considered system is expressed as

\displaystyle{\rm EE}\left({\left\{{{{\bf{w}}_{i}}}\right\}}\right)=\frac{{\sum\nolimits_{k=1}^{K}\alpha_{k}{{R_{k}}\left({\left\{{{{\bf{w}}_{i}}}\right\}}\right)}}}{\sum\nolimits_{k=1}^{K}{\left\|{{{\bf{w}}_{k}}}\right\|_{2}^{2}}+{P_{\rm C}}},

(3)

where $\alpha_{k}$ is a preassigned weight for the $k$ -th MU and ${P_{\rm C}}$ denotes the constant power consumption introduced by circuit modules.

Our goal is to maximize the EE of the considered network, which is mathematically formulated as an optimization problem:

\displaystyle{\left\{{{{\bf{w}}^{\star}_{i}}}\right\}}=

\displaystyle\arg\mathop{\max}\limits_{\left\{{{{\bf{w}}_{i}}\in{\mathbb{C}^{{N_{\rm{T}}}}}}\right\},\sum\nolimits_{i=1}^{K}\left\|{{{\bf{w}}_{i}}}\right\|_{2}^{2}\leq{P_{\rm max}}}{\rm{EE}}\left({\left\{{{{\bf{w}}_{i}}}\right\}}\right),

(4)

where ${P_{\rm max}}$ denotes the power budget of the transmitter.

The problem (4) can be efficiently solved by existing CVX techniques but without close-form solution. By treating the CVXopt-based algorithm as a “black box”, it maps CSI to beamforming vectors via iterative computations. Such a mapping can also be regarded as a “function” represented by $\Pi(\cdot):{\mathbb{C}}^{N_{\rm T}\times K}\rightarrow{\mathbb{C}}^{N_{\rm T}\times K}$ . Following the universal approximation theorem, we intend to utilize neural networks to solve the problem (4).

III Structure of KANsformer

The main idea of the proposed KANsformer is to realize the mapping, i.e., $\Pi(\cdot)$ , from $\{{\bf h}_{i}\}$ to $\{{\bf w}_{i}\}$ such that ${\rm EE}({\{{{{\bf{w}}_{i}}}\}})$ is close to ${\rm EE}({\{{{{\bf{w}}^{\star}_{i}}}\}})$ . Particularly, we utilize unsupervised learning to train the KANsformer to alleviate the burden on collecting labelled training set. Denote $\bm{\theta}$ as learnable parameters of the KANsformer, the loss function to update the learnable parameters is given by

\displaystyle{{\cal L}\left({\bm{\theta}}\right)=-{{\rm EE}\left(\Pi\left(\left\{{\mathbf{h}_{i}}\right\}\left|{\bm{\theta}}\right.\right)\right)}}.

(5)

Note that via off-line training based on historical statistics, the KANsformer can derive the solution instantaneously at a low computational complexity instead of complex iterative calculation by the mathematical optimization approaches.

The structure of the KANsformer is illustrated in Fig. 1, which includes four modules: pre-processing module, transformer encoder module, KAN decoder module and post-processing module. The detailed processes in each module are described as follows.

III-A Pre-Processing Module

In pre-processing module, we divide each complex-valued CSI vector in $\{{{\bf{h}}_{k}}\}$ into its real part, i.e., ${\rm Re}({{\bf{h}}_{k}})$ , and its imaginary part, i.e., ${\rm Im}({{\bf{h}}_{k}})$ . Then, ${\rm Re}({{\bf{h}}_{k}})$ and ${\rm Im}({{\bf{h}}_{k}})$ are concatenated and input into a linear transformation to obtain the input for the transformer encoder module denoted by $\widehat{\bf H}\in\mathbb{R}^{K\times D}$ , which given by

	$\displaystyle\widehat{\bf H}=$	$\displaystyle\left[{\rm{Con}}\left({\rm Re}\left({{\bf{h}}_{1}}\right),{\rm Im}\left({{\bf{h}}_{1}}\right)\right);\cdots;\right.$
		$\displaystyle\left.{\rm{Con}}\left({\rm Re}\left({{\bf{h}}_{K}}\right),{\rm Im}\left({{\bf{h}}_{K}}\right)\right)\right]{\bf W}_{0},$		(6)

where ${\rm Con}(\cdot)$ represents the concatenation operation, and ${\bf W}_{0}\in\mathbb{R}^{2N_{\rm T}\times D}$ denotes the learnable parameters with $D$ being a configurable dimension which is usually greater than $K$ such that more attention heads can be employed.

Refer to caption — Figure 1: Structure of KANsformer which includes four modules: pre-processing module, transformer encoder module (with $L$ TELs), KAN decoder module (with $T$ KDLs) and post-processing module. The detailed processes of the $l$ -th TEL and the $t$ -th KDL are illustrated.

III-B Transformer Encoder Module

The aim of the transformer encoder module is to encode the obtained network feature $\widehat{\bf H}$ by exploring interactions among MUs and embedding the impact of inter-MU interference into the encoded network feature. The transformer encoder module comprises $L$ transformer encoder layers (TELs), each of which includes two ingredients, i.e., multi-head self-attention and position-wise feed-forward. For the $l$ -th TEL, we denote its input and output¹¹1The inputs and outputs of all the TELs have the same size. as ${\bf H}^{(l)}$ and ${\bf H}^{(l+1)}\in{\mathbb{R}}^{K\times D}$ , respectively. The detailed processes of the two ingredients are given as follows.

III-B1 Multi-Head Self-Attention

Suppose that $M^{(l)}$ self-attention heads are employed in the $l$ -th TEL, and then, the attention coefficient matrix associated with the $m$ -th self-attention head in the $l$ -th TEL is given by

	$\displaystyle{\bf{A}}^{(l)}_{m}=$	$\displaystyle{\rm Softmax}\left(\frac{{\bf H}^{(l)}{\bf W}^{(l)}_{Q}\left({\bf H}^{(l)}{\bf W}^{(l)}_{K}\right)^{T}}{\sqrt{D}}\right){\bf H}^{(l)}{\bf W}^{(l)}_{V}$
		$\displaystyle\in{\mathbb{R}}^{K\times\frac{D}{M^{(l)}}},$		(7)

where ${\bf W}^{(l)}_{Q}$ , ${\bf W}^{(l)}_{K}$ and ${\bf W}^{(l)}_{V}\in\mathbb{R}^{D\times\frac{D}{M^{(l)}}}$ denotes the learnable parameters for the query, key and value projections, respectively.

The obtained $M^{(l)}$ attention heads, i.e., $\{{\bf A}_{m}^{(l)}\}_{m}$ , are concatenated and then, passed into a linear layer with leanrable parameters of ${\bf W}_{{\rm{MA}}}^{(l)}\in\mathbb{R}^{{D}\times{D}}$ . Then, we obtain the multi-head attention coefficient matrix as

\displaystyle{\bf{H}}_{{\rm{MA}}}^{(l)}=\underbrace{{\rm{Con}}\left({{\bf{A}}_{{1}}^{({{l}})}\cdots{\bf{A}}_{{{{M}}^{({{l}})}}}^{({{l}})}}\right)}_{\in\mathbb{R}^{K\times D}}{\bf{W}}_{{\rm{MA}}}^{({{l}})}\in\mathbb{R}^{K\times D}.

(8)

To improve the training performance and stack deeper layers, the parameter-free layer normalization process represented by ${\rm LayerNorm}(\cdot)$ and the residual connection are adopted. The attention coefficient matrix is then updated by

\displaystyle{\widetilde{\bf H}}_{\rm{MA}}^{(l)}={\rm LayerNorm}\left({\bf H}_{{\rm{MA}}}^{(l)}\right)+{\bf H}^{(l)}\in\mathbb{R}^{K\times D}.

(9)

III-B2 Position-wise Feed-forward Layer

The obtained attention coefficient matrix is input into a 2-layer feed-forward network with position-wise operation, i.e.,

	$\displaystyle{\bf H}_{\rm{FF}}^{(l)}=$	$\displaystyle{\rm Con}\left(f_{2}^{(l)}\left({\rm ReLu}\left(f_{1}^{(l)}\left(\left[{\widetilde{\bf H}}_{\rm{MA}}^{(l)}\right]_{k,:}\right)\right)\right)\right)$
		$\displaystyle\in\mathbb{R}^{K\times D},$		(10)

where $f_{1}^{(l)}(\cdot):\mathbb{R}^{D}\rightarrow\mathbb{R}^{D^{\prime}}$ and $f_{2}^{(l)}(\cdot):\mathbb{R}^{D^{\prime}}\rightarrow\mathbb{R}^{D}$ denote the feed-forward functions with $D^{\prime}$ being a intermediate dimension. The learnable parameters of $f_{1}^{(l)}\left(\cdot\right)$ and $f_{2}^{(l)}\left(\cdot\right)$ are denoted as ${{\bf W}^{(l)}_{1}}\in\mathbb{R}^{D\times D^{\prime}}$ and ${{\bf W}^{(l)}_{2}}\in\mathbb{R}^{D^{\prime}\times D}$ , respectively.

Similarly, the layer normalization process and the residual connection are followed by the feed-forward network, and the output of the $l$ -th TEL is given by

\displaystyle{\bf H}^{(l+1)}={\rm LayerNorm}\left({\bf H}_{\rm{FF}}^{(l)}\right)+{{\bf H}_{\rm{FF}}^{(l)}}.

(11)

III-C KAN Decoder Module

The aim of the KAN decoder module is to decode the obtained network features, i.e., ${\bf H}^{(L+1)}$ , to the required beamforming vectors via $T$ KAN decoder layers (KDLs). For the $t$ -th KDL, we denote its input and output as ${\bf F}^{(t)}\in{\mathbb{R}}^{K\times F(t)}$ and ${\bf F}^{(t+1)}\in{\mathbb{R}}^{K\times F(t+1)}$ , respectively, where $F(t)$ and $F(t+1)$ denote the corresponding dimensions. Note that ${\bf F}^{(1)}={\bf H}^{\left(L+1\right)}$ and $F(1)=D$ while $F(T+1)=2N_{\rm T}$ . The processing of the $t$ -th KDL is given by

\displaystyle\left[{\bf{\bf F}}^{(t+1)}\right]_{k,j}=\sum\nolimits_{i=1}^{F(t)}\phi_{j,i}^{(t)}\left(\left[{\bf{\bf F}}^{(t)}\right]_{k,i}\right),

(12)

where $j\in\left\{1,...,F\left(t+1\right)\right\}$ , $k\in\left\{1,...,K\right\}$ and $\phi_{j,i}^{(t)}(\cdot):{\mathbb{R}}\rightarrow{\mathbb{R}}$ is a continuous function which is given by

\displaystyle\phi_{j,i}^{(t)}\left(x\right)=\beta_{j,i}^{(t)}\frac{x}{1+\exp\left(-x\right)}+\gamma_{j,i}^{(t)}{\rm Spline}_{j,i}^{(t)}(x),

(13)

where $\beta_{j,i}^{(t)}$ and $\gamma_{j,i}^{(t)}$ are learnable parameters, and ${\rm Spline}_{j,i}^{(t)}(\cdot):{\mathbb{R}}\rightarrow{\mathbb{R}}$ is parameterized as a linear combination of B-splines such that

\displaystyle{\rm Spline}_{j,i}^{(t)}(x)=\sum\nolimits_{p=0}^{P}c_{p,j,i}^{(t)}B_{p}\left(x\right),

(14)

where $c_{p,j,i}^{(t)}$ denotes the learnable weights and $P$ is a hyperparameter related to the B-splines (cf. [14]).

III-D Post-Processing Module

The post-processing module is to convert ${{\bf F}^{(T+1)}}$ obtained by the KAN decoder module to the feasible solution to the problem (4).

In particular, the real-valued ${\bf F}^{\left(T+1\right)}$ is used to recover $K$ complex-valued beamforming vectors with the $k$ -th beamforming vector given by

\displaystyle\widetilde{{\bf w}}_{k}={\bf F}^{\left(T+1\right)}\left[k,1:N_{\rm T}\right]+i{\bf F}^{\left(T+1\right)}\left[k,N_{\rm T}+1:2N_{\rm T}\right].

(15)

Then, each beamforming vector is fed into a scale function to satisfy the power budget of $P_{\rm max}$ :

\displaystyle{{\bf{w}}}_{k}=\sqrt{\frac{{{P_{\rm max}}}}{{\rm max}\left({{P_{\rm max}}},\sum\nolimits_{i=1}^{K}{{\left\|\widetilde{{\bf w}}_{i}\right\|}_{2}^{2}}\right)}}\widetilde{{\bf w}}_{k}.

(16)

The learnable parameters in the KANsformer are given by

\displaystyle\bm{\theta}=\left\{{{\bf W}_{0}},{\bf W}^{(l)},{\beta_{j,i}^{(t)}},{\gamma_{j,i}^{(t)}},{c_{p,j,i}^{(t)}}\right\},

(17)

where ${\bf W}^{(l)}\triangleq\{{{\bf W}^{(l)}_{Q}},{{\bf W}^{(l)}_{K}},{{\bf W}^{(l)}_{V}},{{\bf W}^{(l)}_{O}},{{\bf W}^{(l)}_{1}},{{\bf W}^{(l)}_{2}}\}$ . Note that $\bm{\theta}$ is independent of $K$ , thus facilitating the KANsformer to accept the input $\{{{\bf{h}}_{k}}\}$ with different values of $K$ .

IV Numerical Results

This section provides numerical results to evaluate the proposed KANsformer in terms of generalization performance, transfer learning and ablation experiment under the following settings.

IV-1 Simulation scenario

All the system parameters used are $N_{\rm T}\in\left\{4,8,16\right\}$ , $K\in\left\{2,3,4,5,7,8,9,10,12,14\right\}$ , $\alpha_{k}=1$ ( $\forall k\in\mathcal{K}$ ), ${P_{\rm max}}=1$ W, ${P_{\rm C}}=0.1$ W, CSI $\{{\bf h}_{k}\in{\mathbb{C}}^{N_{\rm T}}\}$ being Rayleigh distributed for both training samples and test samples, and the corresponding labels (for test samples) representing the maximal EEs obtained by CVXopt-based algorithms. Specifically, we use $K_{\rm Tr}$ , $K_{\rm Te}$ and $K_{\rm Tr}^{\prime}$ to respectively denote the number of MUs in the training stage, test stage and fine-tuning training stage (due to transfer learning), where $K_{\rm Te}\neq K_{\rm Tr}$ (known as scalability) is unknown during the training stage.

IV-2 Computer configuration

All DL models are trained and tested by Python 3.10 with Pytorch 2.4.0 on a computer with Intel(R) Xeon(R) Platinum 8255C CPU and NVIDIA RTX 2080 Ti (11 GB of memory).

IV-3 Initialization and training

The learnable parameters are initialized according to He (Kaiming) method and the learning rate is initialized as $10^{-4}$ . The $Adam$ algorithm is adopted as the optimizer during the training phase. The batch size is set to $16$ for $100$ training epochs. The learnable weights with the best performance are used as the training results.

IV-4 Benchmark DL models

In order to evaluate the KANsformer numerically, the following four baselines are considered, i.e.,

•

CVXopt-based approach: A single-layer successive convex approximation based optimization algorithm, similar to Algorithm 1 in [15], used to generate the test labels.
•

MLP: A basic feed-forward neural network, similar to [2].
•

GAT: A basic GCN with multi-head attention mechanism, similar to [6].

IV-5 Test performance metrics

•

Optimality performance: The ratio of the average achievable EE by the DL model to the optimal EE.
•

Inference time: Average running time for yielding the feasible beamforming solution by the DL model.

TABLE I: Generalization performance evaluation.

$N_{\rm T}$	$K_{\rm Tr}$	$K_{\rm Te}$	CVX	MLP	GAT	KF^†
4	2	2	100%	98.2%	98.4%	99.5%
Inference time			6.7s	3.3 ms	7.7 ms	8.5 ms
		3	$\times$	$\times$	84.3%	85.1%
8	4	4	100%	79.1%	90.1%	95.3%
		5	$\times$	$\times$	82.6%	83.2%
Inference time			10.5 s	3.4 ms	7.7 ms	8.5 ms
		7	$\times$	$\times$	84.0%	91.1%
16	8	8	100%	17.9%	85.6%	92.9%
		9	$\times$	$\times$	82.8%	90.8%
Inference time			57.4 s	3.5 ms	7.8 ms	8.4 ms

•

^†KF is short for KANsformer.
•

$\times$ represents “not applicable”.

TABLE II: Transfer learning evaluation:

N_{\rm T}=16

$K_{\rm Te}$	Scaling^†	Re-training ( $K_{\rm Tr}=K_{\rm Te}$ )	Transfer learning ( $K_{\rm Tr}=8$ , $K^{\prime}_{\rm Tr}=K_{\rm Te}$ )
$K_{\rm Te}$	( $K_{\rm Tr}=8$ )	100 epochs	10 epochs	20 epochs	50 epochs
10	86.6%	93.4%	93.0%	93.5%	93.5%
12	77.9%	90.7%	94.6%	95.2%	95.2%
14	71.1%	95.2%	93.6%	94.2%	94.2%

•

^†Scaling represents that directly applying the model trained with $K_{\rm Tr}$ to the scenario of $K_{\rm Te}$ .

TABLE III: Ablation experiment:

N_{\rm T}=16

and

K_{\rm Tr}=8

Encoder		Decoder		$K_{\rm Te}$
GAT	TF^†	MLP	KAN	$7$	$8$	$9$
$\checkmark$	$\times$	$\checkmark$	$\times$	84.0%	85.6%	82.8%
$\checkmark$	$\times$	$\times$	$\checkmark$	89.5%	91.2%	88.1%
$\times$	$\checkmark$	$\checkmark$	$\times$	82.5%	85.8%	83.2%
$\times$	$\checkmark$	$\times$	$\checkmark$	91.1%	92.9%	90.8%
Avg. gain		-		0.1%	1.9%	3.1%
-		Avg. gain		14.1%	7.3%	12.9%

•

^†TF is short for transformer.

IV-A Generalization Performance

The numerical performance tests and inference times of the KANsformer are given in Table I, which are presented in more detail below, respectively.

IV-A1 Optimality performance with $K_{\rm Te}=K_{\rm Tr}$ (marked by blue-shaded areas)

One can observe that the KANsformer outperforms the MLP and the GAT for all the three cases; the larger of $N_{\rm T}$ and $K_{\rm Te}=K_{\rm Tr}$ , the larger the performance degradation, showing larger negative impact on the learning performance for all the DL models. However, the KANsformer with the best performance maintains the performance loss within $10\%$ .

IV-A2 Optimality performance with $K_{\rm Te}\neq K_{\rm Tr}$ (marked by orange-shaded areas)

The KANsformer performs much better than the GAT for $K_{\rm Tr}=8,/K_{\rm Te}\in\{7,9\}$ , but slightly better for $K_{\rm Tr}=4,/K_{\rm Te}\in\{3,5\}$ , besides some performance loss compared with the case for $K_{\rm Te}=K_{\rm Tr}\in\{4,8\}$ . These results also indicate that the scalability performance loss is larger for larger $|K_{\rm Te}-K_{\rm Tr}|/K_{\rm Tr}$ , because the multi-head self-attention mechanism intends to explore the interaction among MUs, which may change with the number of MUs.

IV-A3 Inference time

All of the MLP, GAT and KANsformer achieve millisecond-level inference (significantly faster than the iterative CVXopt-based approach) such that they are applicable under time-varying channel conditions. A more surprising observation is that the inference time of DL models remains almost unchanged for all the numbers of $N_{\rm T}$ and $K$ used, while it increases exponentially for the CVXopt-based approach (a widely known fact).

In summary, the well-trained KANsformer is able to achieve real-time and near-optimal inference for solving the problem (4) while being scalable to the number of MUs (though $K_{\rm Te}$ unknown in the training stage) with an acceptable performance.

IV-B Transfer Learning

As mentioned that the scalability suffers from performance degradation with the increment of $|K_{\rm Te}-K_{\rm Tr}|/K_{\rm Tr}$ . One can retrain the model or fine-tune the model via transfer learning on a new dataset (where the number of users is $K_{\rm Te}$ ). The former initializes the learnable parameters randomly while the latter adopts the learnable parameters of the model trained for $K_{\rm Tr}$ as the initial values of the model $\bm{\theta}$ instead. Table II shows the performance of scaling, re-training (via $100$ epochs) and transfer learning (via $\{10,20,50\}$ epochs). It can be seen that the transfer learning can effectively improve the performance at a quite low training cost (e.g., $10$ epochs) compared with the performance of the plain scaling, meanwhile achieving a comparable performance of the re-training at fewer training epochs (e.g., $20$ epochs). For $K_{\rm Te}=14$ , the transfer learning falls behind the re-training by $1\%$ , and the reason is that the prior-knowledge for $K_{\rm Tr}$ may mislead the transfer learning under $K^{\prime}_{\rm Te}=K_{\rm Tr}$ with large $|K^{\prime}_{\rm Tr}-K_{\rm Tr}|$ . Nevertheless, the transfer learning can also achieve a considerable performance gain ( $>20\%$ ) compared with the plain scaling.

IV-C Ablation Experiment

Table III gives the ablation experiment to validate the effectiveness of the transformer used as the encoder and KAN used as the decoder. A performance gain can be observed by comparing transformer/KAN and GAT/MLP for both cases of $K_{\rm Te}=K_{\rm Tr}$ and $K_{\rm Te}\neq K_{\rm Tr}$ . Specifically, the average performance gains over $K_{\rm Te}\in\{7,8,9\}$ resulting from the transformer and KAN are respectively $1.7\%$ and $11.4\%$ . The reason for this is that both the GAT and the transformer adopt the attention mechanism to enhance the expressive capability while KAN has more flexible activation processes than MLP, such that KAN can outperform MLP in terms of interpretability [12].

V Conclusion

We have presented a DL model (i.e., the KANsformer shown in Fig. 1) with the transformer and KAN used in the encoder-decoder structure, respectively, for solving the beamforming design problem (cf. (4)).

Numerical results showed that the KANsformer outperforms the existing DL models in terms of both the performance accuracy and the inference time consumed. Furthermore, we would like to emphasize that, in response to the given input CSI $\{{\bf h}_{k}\}$ , the KANsformer can yield the beamforming vector output $\{{\bf w}_{k}\}$ , with the elapsed inference time almost fixed (thus insensitive to the problem size) and tremendously lower than the problem-size dependent running time required by the CVXopt-based approach; the performance accuracy of the former is quite close to that the latter (treated as the optimum). These results also motivate further development of more powerful encoders and decoders dedicated to wireless communication systems.

References

[1] Y. Lu, W. Mao, H. Du, O. A. Dobre, D. Niyato, and Z. Ding, “Semantic-aware vision-assisted integrated sensing and communication: Architecture and resource allocation,” IEEE Wireless Commun., vol. 31, no. 3, pp. 302-308, Jun. 2024.
[2] C. Hu et al., “AI-empowered RIS-assisted networks: CV-enabled RIS selection and DNN-enabled transmission,” early accessed in IEEE Trans. Veh. Technol, 2024.
[3] Z. Song et al., “A deep learning framework for physical-layer secure beamforming,” early accessed in IEEE Trans. Veh. Technol, 2024
[4] Y. Lu, Y. Li, R. Zhang, W. Chen, B. Ai, and D. Niyato, “Graph neural networks for wireless networks: Graph representation, architecture and evaluation,” early accessed by IEEE Wireless Commun., 2024.
[5] Y. Shen, J. Zhang, S. H. Song, and K. B. Letaief, “Graph neural networks for wireless communications: From theory to practice,” IEEE Trans. Wireless Commun., vol. 22, no. 5, pp. 3554-3569, May 2023.
[6] Y. Li, Y. Lu, R. Zhang, B. Ai and Z. Zhong, “Deep learning for energy efficient beamforming in MU-MISO networks: A GAT-based approach,” IEEE Wireless Commun.Lett., vol. 12, no. 7, pp. 1264-1268, July 2023.
[7] Y. Li, Y. Lu, B. Ai, O. A. Dobre, Z. Ding, and D. Niyato, “GNN-based beamforming for sum-rate maximization in MU-MISO networks,” IEEE Trans. Wireless Commun., vol. 23, no. 8, pp. 9251-9264, Aug. 2024.
[8] Y. Li, Y. Lu, B. Ai, Z. Zhong, D. Niyato, and Z. Ding, “GNN-enabled max-min fair beamforming,” IEEE Trans. Veh. Technol., vol. 73, no. 8, pp. 12184-12188, Aug. 2024.
[9] C. He, Y. Li, Y. Lu, B. Ai, Z. Ding, and D. Niyato, “ICNet: GNN-enabled beamforming for MISO interference channels with statistical CSI,” vol. 73, no. 8, pp. 12225-12230, Aug. 2024.
[10] A. Vaswani, Ashish et al., “Attention is all you need,” in Proc. Neurlps, pp. 5998-6008, 2017.
[11] B. Zhu, E. Bedeer, H. H. Nguyen, R. Barton and Z. Gao, “UAV trajectory planning for AoI-minimal data collection in UAV-aided IoT networks by transformer,” IEEE Trans. Wireless Commun., vol. 22, no. 2, pp. 1343-1358, Feb. 2023
[12] Z. Liu, Y. Wang, S. Vaidya, et al., “Kan: Kolmogorov-Arnold networks,” arXiv preprint: 2404.19756, Apr. 2024.
[13] Y. Lu, K. Xiong, P. Fan, Z. Ding, Z. Zhong, and K. B. Letaief, “Global energy efficiency in secure MISO SWIPT systems with non-linear power-splitting EH model,” IEEE J. Sel. Areas Commun., vol. 37, no. 1, pp. 216-232, Jan. 2019.
[14] L. Schumaker, “Spline Functions: Basic Theory.” Wiley, 1981.
[15] Y. Lu, “Secrecy energy efficiency in RIS-assisted networks,” IEEE Trans. Veh. Technol., vol. 72, no. 9, pp. 12419-12424, Sept. 2023.