KANsformer for Scalable Beamforming
Abstract
This paper proposes an unsupervised deep-learning (DL) approach by integrating transformer and Kolmogorov–Arnold networks (KAN) termed KANsformer to realize scalable beamforming for mobile communication systems. Specifically, we consider a classic multi-input-single-output energy efficiency maximization problem subject to the total power budget. The proposed KANsformer first extracts hidden features via a multi-head self-attention mechanism and then reads out the desired beamforming design via KAN. Numerical results are provided to evaluate the KANsformer in terms of generalization performance, transfer learning and ablation experiment. Overall, the KANsformer outperforms existing benchmark DL approaches, and is adaptable to the change in the number of mobile users with real-time and near-optimal inference.
Index Terms:
Transformer, KAN, beamforming, energy efficiency.I Introduction
Deep learning (DL) has revolutionized a wide range of application fields and achieved unprecedented success in tasks such as image recognition and natural language processing. Its ability to automatically extract high-level features from raw data enables deep neural networks to outperform traditional machine learning methods in complex problem domains. Recently, the DL-enabled designs for wireless networks have emerged as a hot research topic[1]. Some researchers attempted to apply multi-layer perceptrons (MLP) [2], convolutional neural networks (CNN) [3] and graph neural networks (GNN) [4] to deal with the power allocation and signal processing problems in wireless networks. Overall, the DL models can be trained to achieve close performance to traditional convex optimization (CVXopt)-based approaches but with a much faster inference speed. How to further improve the learning performance remains an open issue for DL-enabled wireless optimization.
Typically, task-oriented DL requires dedicated models for wireless networks. One promising way is to follow the “encoder-decoder” paradigm, where the encoder extracts features over the wireless networks while the decoder maps the extracted features to desired transmit design. Recent works have paid a great attention to the design of encoder. Particularly, the GNN shows good scalability and generalization performance by exploiting the graph topology of wireless networks [5]. In [6, 7, 8], the GNN was adopted as the encoder to develop the solution approaches for energy efficiency (EE) maximization, sum-rate maximization and max-min rate, respectively, for multi-user multi-input-single-output (MISO) networks, all of which were scalable to the number of users. In [9], a GNN based model was trained via unsupervised learning to solve the outage-constrained EE maximization problem. The GNNs in [6, 7, 8, 9] all leveraged the multi-head attention mechanism also known as the graph attention networks (GAT) to enhance the feature extraction, especially for inter-user interference. Similarly, the transformer is also built upon the self-attention mechanism and is popular for its highly predictive performance [10]. In [11], a transformer and a weighted A∗ based algorithm were proposed to plan the unmanned aerial vehicle trajectory for age-of-information minimization, which outperformed traditional algorithms numerically. However, the above works on wireless networks all adopted MLP as decoder. Recently, Kolmogorov-Arnold network (KAN) has been proposed as a promising alternative to MLP with superior performance in terms of accuracy and interpretability [12].
To the best of our knowledge, the integration of transformer and KAN has not been applied to beamforming design. In this paper, we formulate the classic EE maximization problem for MISO networks [13]. We then propose an approach integrating transformer and KAN termed KANsformer, which utilizes the multi-head self-attention mechanism to extract hidden features among interference links and KAN to map the extracted features to the desired beamforming design. The KANsformer is trained via unsupervised learning and a scale function guarantees feasible solution. Via parameter sharing, the KANsformer is scalable to the number of users. Numerical results indicate that the KANsformer outperforms existing DL models and approaches the CVXopt-based solution accuracy with millisecond-level inference time. The major performance gain is contributed by the KAN via ablation experiment. Besides, we validate the scalability of KANsformer, which can be further enhanced by transfer learning at the expense of little training cost.
The rest of this paper is organized as follows. Section II gives the system model and problem formulation. Section III presents the structure of KANsformer. Section IV provides numerical results. Finally, Section V concludes the paper with future research directions.
II System Model and Problem Formulation
Consider a downlink MISO network, where one -antenna transmitter intends to serve single-antenna mobile users (MUs) over a common spectral band. We use to denote the index set of the MUs.
Denote the symbol for the -th MU and the corresponding beamforming vector as and , respectively. The received signal at the -th MU is given by
(1) |
where denotes the channel state information (CSI) of the -th transmitter-MU link, and denotes the additive white Gaussian noise (AWGN) at the -th MU. Without loss of generality, it is assumed that (). Then, the achievable rate at the -th MU is expressed as
(2) |
where denotes the set of all admissible beamforming vectors. The weighted EE for the considered system is expressed as
(3) |
where is a preassigned weight for the -th MU and denotes the constant power consumption introduced by circuit modules.
Our goal is to maximize the EE of the considered network, which is mathematically formulated as an optimization problem:
(4) |
where denotes the power budget of the transmitter.
The problem (4) can be efficiently solved by existing CVX techniques but without close-form solution. By treating the CVXopt-based algorithm as a “black box”, it maps CSI to beamforming vectors via iterative computations. Such a mapping can also be regarded as a “function” represented by . Following the universal approximation theorem, we intend to utilize neural networks to solve the problem (4).
III Structure of KANsformer
The main idea of the proposed KANsformer is to realize the mapping, i.e., , from to such that is close to . Particularly, we utilize unsupervised learning to train the KANsformer to alleviate the burden on collecting labelled training set. Denote as learnable parameters of the KANsformer, the loss function to update the learnable parameters is given by
(5) |
Note that via off-line training based on historical statistics, the KANsformer can derive the solution instantaneously at a low computational complexity instead of complex iterative calculation by the mathematical optimization approaches.
The structure of the KANsformer is illustrated in Fig. 1, which includes four modules: pre-processing module, transformer encoder module, KAN decoder module and post-processing module. The detailed processes in each module are described as follows.
III-A Pre-Processing Module
In pre-processing module, we divide each complex-valued CSI vector in into its real part, i.e., , and its imaginary part, i.e., . Then, and are concatenated and input into a linear transformation to obtain the input for the transformer encoder module denoted by , which given by
(6) |
where represents the concatenation operation, and denotes the learnable parameters with being a configurable dimension which is usually greater than such that more attention heads can be employed.

III-B Transformer Encoder Module
The aim of the transformer encoder module is to encode the obtained network feature by exploring interactions among MUs and embedding the impact of inter-MU interference into the encoded network feature. The transformer encoder module comprises transformer encoder layers (TELs), each of which includes two ingredients, i.e., multi-head self-attention and position-wise feed-forward. For the -th TEL, we denote its input and output111The inputs and outputs of all the TELs have the same size. as and , respectively. The detailed processes of the two ingredients are given as follows.
III-B1 Multi-Head Self-Attention
Suppose that self-attention heads are employed in the -th TEL, and then, the attention coefficient matrix associated with the -th self-attention head in the -th TEL is given by
(7) |
where , and denotes the learnable parameters for the query, key and value projections, respectively.
The obtained attention heads, i.e., , are concatenated and then, passed into a linear layer with leanrable parameters of . Then, we obtain the multi-head attention coefficient matrix as
(8) |
To improve the training performance and stack deeper layers, the parameter-free layer normalization process represented by and the residual connection are adopted. The attention coefficient matrix is then updated by
(9) |
III-B2 Position-wise Feed-forward Layer
The obtained attention coefficient matrix is input into a 2-layer feed-forward network with position-wise operation, i.e.,
(10) |
where and denote the feed-forward functions with being a intermediate dimension. The learnable parameters of and are denoted as and , respectively.
Similarly, the layer normalization process and the residual connection are followed by the feed-forward network, and the output of the -th TEL is given by
(11) |
III-C KAN Decoder Module
The aim of the KAN decoder module is to decode the obtained network features, i.e., , to the required beamforming vectors via KAN decoder layers (KDLs). For the -th KDL, we denote its input and output as and , respectively, where and denote the corresponding dimensions. Note that and while . The processing of the -th KDL is given by
(12) |
where , and is a continuous function which is given by
(13) |
where and are learnable parameters, and is parameterized as a linear combination of B-splines such that
(14) |
where denotes the learnable weights and is a hyperparameter related to the B-splines (cf. [14]).
III-D Post-Processing Module
The post-processing module is to convert obtained by the KAN decoder module to the feasible solution to the problem (4).
In particular, the real-valued is used to recover complex-valued beamforming vectors with the -th beamforming vector given by
(15) |
Then, each beamforming vector is fed into a scale function to satisfy the power budget of :
(16) |
The learnable parameters in the KANsformer are given by
(17) |
where . Note that is independent of , thus facilitating the KANsformer to accept the input with different values of .
IV Numerical Results
This section provides numerical results to evaluate the proposed KANsformer in terms of generalization performance, transfer learning and ablation experiment under the following settings.
IV-1 Simulation scenario
All the system parameters used are , , (), W, W, CSI being Rayleigh distributed for both training samples and test samples, and the corresponding labels (for test samples) representing the maximal EEs obtained by CVXopt-based algorithms. Specifically, we use , and to respectively denote the number of MUs in the training stage, test stage and fine-tuning training stage (due to transfer learning), where (known as scalability) is unknown during the training stage.
IV-2 Computer configuration
All DL models are trained and tested by Python 3.10 with Pytorch 2.4.0 on a computer with Intel(R) Xeon(R) Platinum 8255C CPU and NVIDIA RTX 2080 Ti (11 GB of memory).
IV-3 Initialization and training
The learnable parameters are initialized according to He (Kaiming) method and the learning rate is initialized as . The algorithm is adopted as the optimizer during the training phase. The batch size is set to for training epochs. The learnable weights with the best performance are used as the training results.
IV-4 Benchmark DL models
In order to evaluate the KANsformer numerically, the following four baselines are considered, i.e.,
IV-5 Test performance metrics
-
•
Optimality performance: The ratio of the average achievable EE by the DL model to the optimal EE.
-
•
Inference time: Average running time for yielding the feasible beamforming solution by the DL model.
CVX | MLP | GAT | KF† | |||
4 | 2 | 2 | 100% | 98.2% | 98.4% | 99.5% |
Inference time | 6.7s | 3.3 ms | 7.7 ms | 8.5 ms | ||
3 | 84.3% | 85.1% | ||||
8 | 4 | 4 | 100% | 79.1% | 90.1% | 95.3% |
5 | 82.6% | 83.2% | ||||
Inference time | 10.5 s | 3.4 ms | 7.7 ms | 8.5 ms | ||
7 | 84.0% | 91.1% | ||||
16 | 8 | 8 | 100% | 17.9% | 85.6% | 92.9% |
9 | 82.8% | 90.8% | ||||
Inference time | 57.4 s | 3.5 ms | 7.8 ms | 8.4 ms |
-
•
†KF is short for KANsformer.
-
•
represents “not applicable”.
Scaling† | Re-training () | Transfer learning (, ) | |||
() | 100 epochs | 10 epochs | 20 epochs | 50 epochs | |
10 | 86.6% | 93.4% | 93.0% | 93.5% | 93.5% |
12 | 77.9% | 90.7% | 94.6% | 95.2% | 95.2% |
14 | 71.1% | 95.2% | 93.6% | 94.2% | 94.2% |
-
•
†Scaling represents that directly applying the model trained with to the scenario of .
Encoder | Decoder | |||||
---|---|---|---|---|---|---|
GAT | TF† | MLP | KAN | |||
84.0% | 85.6% | 82.8% | ||||
89.5% | 91.2% | 88.1% | ||||
82.5% | 85.8% | 83.2% | ||||
91.1% | 92.9% | 90.8% | ||||
Avg. gain | - | 0.1% | 1.9% | 3.1% | ||
- | Avg. gain | 14.1% | 7.3% | 12.9% |
-
•
†TF is short for transformer.
IV-A Generalization Performance
The numerical performance tests and inference times of the KANsformer are given in Table I, which are presented in more detail below, respectively.
IV-A1 Optimality performance with (marked by blue-shaded areas)
One can observe that the KANsformer outperforms the MLP and the GAT for all the three cases; the larger of and , the larger the performance degradation, showing larger negative impact on the learning performance for all the DL models. However, the KANsformer with the best performance maintains the performance loss within .
IV-A2 Optimality performance with (marked by orange-shaded areas)
The KANsformer performs much better than the GAT for , but slightly better for , besides some performance loss compared with the case for . These results also indicate that the scalability performance loss is larger for larger , because the multi-head self-attention mechanism intends to explore the interaction among MUs, which may change with the number of MUs.
IV-A3 Inference time
All of the MLP, GAT and KANsformer achieve millisecond-level inference (significantly faster than the iterative CVXopt-based approach) such that they are applicable under time-varying channel conditions. A more surprising observation is that the inference time of DL models remains almost unchanged for all the numbers of and used, while it increases exponentially for the CVXopt-based approach (a widely known fact).
In summary, the well-trained KANsformer is able to achieve real-time and near-optimal inference for solving the problem (4) while being scalable to the number of MUs (though unknown in the training stage) with an acceptable performance.
IV-B Transfer Learning
As mentioned that the scalability suffers from performance degradation with the increment of . One can retrain the model or fine-tune the model via transfer learning on a new dataset (where the number of users is ). The former initializes the learnable parameters randomly while the latter adopts the learnable parameters of the model trained for as the initial values of the model instead. Table II shows the performance of scaling, re-training (via epochs) and transfer learning (via epochs). It can be seen that the transfer learning can effectively improve the performance at a quite low training cost (e.g., epochs) compared with the performance of the plain scaling, meanwhile achieving a comparable performance of the re-training at fewer training epochs (e.g., epochs). For , the transfer learning falls behind the re-training by , and the reason is that the prior-knowledge for may mislead the transfer learning under with large . Nevertheless, the transfer learning can also achieve a considerable performance gain () compared with the plain scaling.
IV-C Ablation Experiment
Table III gives the ablation experiment to validate the effectiveness of the transformer used as the encoder and KAN used as the decoder. A performance gain can be observed by comparing transformer/KAN and GAT/MLP for both cases of and . Specifically, the average performance gains over resulting from the transformer and KAN are respectively and . The reason for this is that both the GAT and the transformer adopt the attention mechanism to enhance the expressive capability while KAN has more flexible activation processes than MLP, such that KAN can outperform MLP in terms of interpretability [12].
V Conclusion
We have presented a DL model (i.e., the KANsformer shown in Fig. 1) with the transformer and KAN used in the encoder-decoder structure, respectively, for solving the beamforming design problem (cf. (4)).
Numerical results showed that the KANsformer outperforms the existing DL models in terms of both the performance accuracy and the inference time consumed. Furthermore, we would like to emphasize that, in response to the given input CSI , the KANsformer can yield the beamforming vector output , with the elapsed inference time almost fixed (thus insensitive to the problem size) and tremendously lower than the problem-size dependent running time required by the CVXopt-based approach; the performance accuracy of the former is quite close to that the latter (treated as the optimum). These results also motivate further development of more powerful encoders and decoders dedicated to wireless communication systems.
References
- [1] Y. Lu, W. Mao, H. Du, O. A. Dobre, D. Niyato, and Z. Ding, “Semantic-aware vision-assisted integrated sensing and communication: Architecture and resource allocation,” IEEE Wireless Commun., vol. 31, no. 3, pp. 302-308, Jun. 2024.
- [2] C. Hu et al., “AI-empowered RIS-assisted networks: CV-enabled RIS selection and DNN-enabled transmission,” early accessed in IEEE Trans. Veh. Technol, 2024.
- [3] Z. Song et al., “A deep learning framework for physical-layer secure beamforming,” early accessed in IEEE Trans. Veh. Technol, 2024
- [4] Y. Lu, Y. Li, R. Zhang, W. Chen, B. Ai, and D. Niyato, “Graph neural networks for wireless networks: Graph representation, architecture and evaluation,” early accessed by IEEE Wireless Commun., 2024.
- [5] Y. Shen, J. Zhang, S. H. Song, and K. B. Letaief, “Graph neural networks for wireless communications: From theory to practice,” IEEE Trans. Wireless Commun., vol. 22, no. 5, pp. 3554-3569, May 2023.
- [6] Y. Li, Y. Lu, R. Zhang, B. Ai and Z. Zhong, “Deep learning for energy efficient beamforming in MU-MISO networks: A GAT-based approach,” IEEE Wireless Commun.Lett., vol. 12, no. 7, pp. 1264-1268, July 2023.
- [7] Y. Li, Y. Lu, B. Ai, O. A. Dobre, Z. Ding, and D. Niyato, “GNN-based beamforming for sum-rate maximization in MU-MISO networks,” IEEE Trans. Wireless Commun., vol. 23, no. 8, pp. 9251-9264, Aug. 2024.
- [8] Y. Li, Y. Lu, B. Ai, Z. Zhong, D. Niyato, and Z. Ding, “GNN-enabled max-min fair beamforming,” IEEE Trans. Veh. Technol., vol. 73, no. 8, pp. 12184-12188, Aug. 2024.
- [9] C. He, Y. Li, Y. Lu, B. Ai, Z. Ding, and D. Niyato, “ICNet: GNN-enabled beamforming for MISO interference channels with statistical CSI,” vol. 73, no. 8, pp. 12225-12230, Aug. 2024.
- [10] A. Vaswani, Ashish et al., “Attention is all you need,” in Proc. Neurlps, pp. 5998-6008, 2017.
- [11] B. Zhu, E. Bedeer, H. H. Nguyen, R. Barton and Z. Gao, “UAV trajectory planning for AoI-minimal data collection in UAV-aided IoT networks by transformer,” IEEE Trans. Wireless Commun., vol. 22, no. 2, pp. 1343-1358, Feb. 2023
- [12] Z. Liu, Y. Wang, S. Vaidya, et al., “Kan: Kolmogorov-Arnold networks,” arXiv preprint: 2404.19756, Apr. 2024.
- [13] Y. Lu, K. Xiong, P. Fan, Z. Ding, Z. Zhong, and K. B. Letaief, “Global energy efficiency in secure MISO SWIPT systems with non-linear power-splitting EH model,” IEEE J. Sel. Areas Commun., vol. 37, no. 1, pp. 216-232, Jan. 2019.
- [14] L. Schumaker, “Spline Functions: Basic Theory.” Wiley, 1981.
- [15] Y. Lu, “Secrecy energy efficiency in RIS-assisted networks,” IEEE Trans. Veh. Technol., vol. 72, no. 9, pp. 12419-12424, Sept. 2023.