Multi-User MISO with Stacked Intelligent Metasurfaces: A DRL-Based Sum-Rate Optimization Approach

Hao Liu, Jiancheng An,
George C. Alexandropoulos, Derrick Wing Kwan Ng, , Chau Yuen and Lu Gan H. Liu and L. Gan are with the School of Information and Communication Engineering, University of Electronic Science and Technology of China (UESTC), Chengdu, Sichuan 611731, China. L. Gan is also with the School of Information and Communication Engineering, the Yibin Institute of UESTC, Yibin, Sichuan 644000, China (e-mail: [email protected], [email protected]). J. An and C. Yuen are with the School of Electrical and Electronics Engineering, Nanyang Technological University, Singapore 639798 (e-mail: [email protected], [email protected]). G. C. Alexandropoulos is with the Department of Informatics and Telecommunications, National and Kapodistrian University of Athens, 15784 Athens, Greece (e-mail: [email protected]). D. W. K. Ng is with the School of Electrical Engineering and Telecommunications, University of New South Wales (UNSW), Sydney, NSW 2052, Australia (e-mail: [email protected]). This paper has been presented in part at the IEEE International Conference on Communications (ICC), Denver, USA, 2024[1].

Abstract

Stacked intelligent metasurfaces (SIMs) represent a novel signal processing paradigm that enables over-the-air processing of electromagnetic waves at the speed of light. Their multi-layer architecture exhibits customizable computational capabilities compared to conventional single-layer reconfigurable intelligent surfaces and metasurface lenses. In this paper, we deploy SIM to improve the performance of multi-user multiple-input single-output (MISO) wireless systems through a low complexity manner with reduced numbers of transmit radio frequency chains. In particular, an optimization formulation for the joint design of the SIM phase shifts and the transmit power allocation is presented, which is efficiently tackled via a customized deep reinforcement learning (DRL) approach that systematically explores pre-designed states of the SIM-parametrized smart wireless environment. The presented performance evaluation results demonstrate the proposed method’s capability to effectively learn from the wireless environment, while consistently outperforming conventional precoding schemes under low transmit power conditions. Furthermore, the implementation of hyperparameter tuning and whitening process significantly enhance the robustness of the proposed DRL framework.

Index Terms:

Stacked intelligent metasurface (SIM), reconfigurable intelligent surface (RIS), wave-based computing, deep reinforcement learning (DRL), interference cancellation.

I Introduction

The evolution of wireless networks strives to achieve higher transmission rates and lower end-to-end network latency. With the advent of the fifth-generation (5G) mobile networks, which incorporate advanced physical-layer technologies[2, 1, 3], such as massive multiple-input multiple-output (MIMO) [4] and millimeter wave communications [5], there has been a ten-fold improvement in terms of data rates [6]. However, the emergence of the Internet-of-Everything (IoE) has imposed more stringent demands on wireless capacity [6]. Unfortunately, current technologies can hardly, or cannot, satisfy the extremely high throughput and ultra-reliable low-latency service for swarms of massive heterogeneous devices. Furthermore, emerging multimedia applications, including autonomous vehicles and unmanned aerial vehicles, impose more rigorous connectivity requirements, exceeding the capabilities of traditional communication systems [7].

In typical wireless networks, multi-user interference constitutes a significant obstacle that considerably degrades system performance [8], adversely jeopardizing both quality-of-service (QoS) and overall system spectral efficiency. As a result, interference management in next-generation communication networks remains a crucial challenge. Indeed, the task of harnessing inter-user interference has attracted significant attention within the research community, catalyzing the advancement of various inter-user interference management techniques [9, 10, 8, 6, 11]. In particular, several key interference cancellation techniques have been developed over the past decades, including digital precoding schemes [9, 8], hybrid precoding approaches [6, 12, 13], methods based on artificial intelligence (AI) [14, 15, 11], and more recently, fully analog frameworks based on the stacked intelligent metasurface (SIM) technology [7, 16, 17, 18].

In practical multi-user systems, digital precoding schemes emerge as an efficacious tool for mitigating inter-user interference by exploiting accurate channel state information (CSI) [19]. The effectiveness of precoding technology is evidenced by its spatial selectivity to mitigate multipath interference and extend coverage range. In particular, various nonlinear precoding schemes, which leverage input data in multi-user systems, can yield high sum rate performance, such as dirty paper coding, vector perturbation [9], and symbol-level precoding [10]. However, the high computational complexity associated with nonlinear precoding technologies precludes their wide implementation in practice. Consequently, linear precoding schemes, which linearly combine individual users’ data signals for reduced-complexity precoded outputs, have received considerable research attention. Specifically, these schemes include matched filtering (MF) [20], zero-forcing (ZF) [8], minimum mean square error (MMSE) [8], and signal-to-leakage-and-noise-ratio (SLNR) [21]. Yet, it is important to note that advanced digital precoding techniques generally require high-precision digital-to-analog converters (DACs) and energy-hungry radio frequency (RF) chains. This dependence not only introduces processing delays but also escalates power consumption, posing substantial challenges for large-scale network deployment.

The aforementioned digital precoding techniques [9, 10, 20, 8, 21] addressed multi-user interference at the transceiver side. Recently, there has been increasing research and development interest in the technology of reconfigurable intelligent surfaces (RISs) [22, 23, 24, 25, 26, 27, 28], which are ultra-thin planar structures composed of numerous cost-effective and nearly passive reflective units. This technology is intended to contribute to the establishment of favorable channels for next-generation communication networks with low power consumption [26, 29, 30, 31, 32]. Specifically, RISs can programmatically alter the phase of electromagnetic (EM) waves impinging upon them, without requiring any active RF chains or any form of power amplification [24]. Given these potential benefits, numerous studies have focused on exploiting RISs to mitigate interference via passive beamforming [6, 12, 33]. For instance, by adapting the RIS phase shifts based on CSI, beams can be directed towards user equipments (UEs) while minimizing co-channel interference [6].

TABLE I: A survey of metasurface based wireless communication systems

Reference	Multi-layer	Scenario	Metasurface function	Programmable	Metric	Power allocation	Algorithm
RIS[13]	$\times$	Multi-user MISO	Beamforming	$\checkmark$	Sum rate	$\checkmark$	DRL
$\text{D}^{2}\text{NN}$ [34]	$\checkmark$	-	Image classification	$\times$	-	$\times$	-
PAIM[17]	$\checkmark$	-	Image classification	$\checkmark$	-	$\times$	-
SIM[7]	$\checkmark$	MIMO	Precoding/combining	$\checkmark$	Fitting error	$\checkmark$	GD
SIM[18]	$\checkmark$	Multi-user MISO	Wave-based beamforming	$\checkmark$	Sum rate	$\times$	AO
SIM[35]	$\checkmark$	MISO satellite	Wave-based beamforming	$\checkmark$	Sum rate	$\checkmark$	AO
SIM[36]	$\checkmark$	SISO	Wave-based beamforming	$\checkmark$	Received power	$\times$	AO
SIM[37]	$\checkmark$	MIMO	Precoding/combining	$\checkmark$	Achievable rate	$\checkmark$	GD
SIM[38]	$\checkmark$	Multi-user MISO	Wave-based beamforming	$\checkmark$	Sum rate	$\checkmark$	AO
SIM this paper	$\checkmark$	Multi-user MISO	Wave-based beamforming	$\checkmark$	Sum rate	$\checkmark$	DRL
PAIM: Programmable artificial intelligence machine			AO: Alternating optimization	GD: Gradient descent

Recently, AI techniques have garnered substantial interest for effective interference mitigation in multi-user systems [39, 13, 11, 14, 15, 40, 1]. In one notable study [11], Sun et al. customized a deep neural network (DNN) to mimic a weighted MMSE algorithm, achieving low complexity and efficient interference cancellation. Also, Xiao et al. in [41] proposed a deep reinforcement learning (DRL)-based power control scheme for suppressing downlink interference while conserving energy in ultra-dense small cells. Furthermore, a multi-agent dual-depth Q-network-based approach was proposed in [14] to jointly optimize the beamforming vectors and power splitting ratios in a multi-user multiple-input single-output (MISO) simultaneous wireless information and power transfer system. Meanwhile, AI can be combined with RISs to enhance interference cancellation capabilities. Specifically, Huang et al. in [13] developed a DRL-based scheme to jointly optimize RIS and beamforming at the base station (BS) in a multi-user MISO system. Moreover, Stylianopoulos et al. [15] proposed a lower requirement multi-armed bandits DRL approach for quantized RIS in a multi-user MISO system, while Alexandropoulos et al. [40] elaborated on the design and measurement requirements necessary for DRL RIS controllers, highlighting the evolving role of AI in refining communication system efficiencies.

While AI techniques offer superior resilience to interference in large-scale networks and complex environments compared to digital precoding techniques such as MMSE and ZF, it is worth noting that DNN-based interference cancellation technologies still rely on sophisticated digital signal processors. More recently, inspired by the wave-based computation potential [42], an all-optical diffractive deep neural network ( $\text{D}^{2}\text{NN}$ ) was introduced [34]. This physical DNN consists of several passive diffractive layers, enabling it to compute various complex functions and process input waves at the speed of light. This innovation substantially reduces energy consumption and remarkably shortens the processing delay compared to conventional neural networks relying on digital computing. Nevertheless, the $\text{D}^{2}\text{NN}$ is made of three-dimensional (3D) printed neurons, which are inherently constrained to solving a single specific task once fabricated. Addressing this limitation, a programmable $\text{D}^{2}\text{NN}$ was reported [17], also known as SIM, in combination with several reconfigurable metasurfaces, which can dynamically adapt the network coefficients in real-time through field-programmable gate arrays (FPGA). Indeed, SIM not only inherits the advantages of $\text{D}^{2}\text{NN}$ for network operations at the speed of light but also offers the high flexibility to be adaptively re-trained for accommodating a wide spectrum of machine learning tasks [43].

With its computational power and adaptability to the environment, SIM has been proven to be effective in wireless sensing and communication systems [44, 45, 46, 47], highlighting its great potential to enhance performance across various systems, including satellite communications [35], cell-free communications [48], and point-to-point communications [49, 7, 50, 36, 37, 38, 51]. Specifically, An et al. [18] explored the application of SIM to multi-user beamforming in an MISO system and presented an AO approach to address the non-convex optimization problem. The AO algorithm alternately tackles the power allocation and SIM phase shift design subproblems via iterative water-filling and GD approaches, respectively [18].

However, conventional AO schemes, which optimize variables alternately, tend to converge to locally optimal solutions since they decompose the joint optimization problem into several subproblems. Furthermore, it also results in prohibitive computational complexity for optimizing the numerous meta-atoms in the SIM, thereby complicating adaptation to dynamic wireless environments. Moreover, the GD method necessitates calculating the gradient formula based on the specific optimization problem, thus lacking generalizability. In contrast, DRL presents a promising paradigm independent of the problem’s specific form. It enables wireless systems to acquire knowledge of complex, dynamic wireless environments without needing prior data. This is accomplished through continuous self-learning from network feedback [40]. Motivated by the aforementioned potential, DRL is increasingly recognized as an effective approach to acquire near-optimal solutions for complex joint optimization and decision-making problems, while maintaining relatively low computational complexity.

Building on these insights, in this paper, we present a SIM-assisted multi-user downlink MISO communication system optimized by DRL. To elaborate, we contrast this work with related SIM-aided wireless communications research in Table I. Specifically, our contributions are summarized as follows:

1.

We consider a SIM-assisted multi-user MISO wireless communication system. The SIM, which consists of a group of metasurfaces, enables efficient signal processing in the wave domain at light speed and is seamlessly integrated with the BS. By leveraging this advanced and efficient computing paradigm, we can completely eliminate the use of a digital precoder and significantly reduce the required number of RF chains at the BS.
2.

We formulate an optimization problem that jointly optimizes the transmission coefficients of the meta-atoms in the SIM and the power allocation strategy for the transmit antennas, with the objective of maximizing the sum rate of the multi-user MISO system. However, the problem is non-convex in nature due to the constraints imposed by the passive units of the SIM.
3.

In contrast to [18] adopting the AO algorithm to alternatively optimize SIM phase shifts and the transmission power allocation, we propose a DRL-based method to address the challenge of jointly optimizing wave-based beamforming and power allocation. We meticulously construct the deep deterministic policy gradient (DDPG) architecture, renowned for its competence in handling intricate optimization problems with continuous solution spaces. This framework is integrated with a residual CNN, specifically tailored to the unique SIM structure. The DRL technique precludes explicit prior training data collection, thereby obviating the need for time-consuming labeled data collection for algorithm training. Furthermore, the computational complexity of the proposed DRL method is thoroughly analyzed.
4.

Numerical results demonstrate the advantages of the proposed SIM-assisted multi-user MISO wireless communication system across various practical scenarios. These findings corroborate the enhanced sum rate improvement of the joint optimization scheme based on DRL in the wireless communication network.

The rest of the paper is organized as follows. The SIM-assisted multi-user MISO system model and optimization problem formulation are described in Section II. The proposed DRL-based framework and the parameter updating process of DDPG are presented in Section III. The simulation results and the corresponding analysis are presented in Section IV to verify the performance of the proposed algorithms. Finally, conclusions are provided in Section V.

Notations: Bold lowercase and uppercase letters denote vectors and matrices, respectively; $\mathbb{E}(\cdot)$ denotes the expectation operation; $(\cdot)^{T}$ and $(\cdot)^{H}$ represent the transpose and Hermitian transpose, respectively; $\Re(x)$ and $\Im(x)$ represent the real and imaginary parts of complex-valued $x$ , respectively; $\text{diag}(\mathbf{c})$ denotes a diagonal matrix with the elements of vector $\mathbf{c}$ on its main diagonal; $[\mathbf{X}]_{n,m}$ denotes the $n$ -th row and $m$ -th column element of $\mathbf{X}$ ; $[\mathbf{X}]_{n,:}$ and $[\mathbf{X}]_{:,m}$ denote vectors collecting the elements of the $n$ -th row and $m$ -th column of matrix $\mathbf{X}$ , respectively; $\mathbb{C}^{x\times y}$ denotes the space of $x\times y$ complex-valued matrices; $\lceil x\rceil$ means the nearest integer greater than or equal to $x$ ; $\text{mod}(x,y)$ represents the remainder after division of $x$ by $y$ ; $\mathbf{g}\sim\mathcal{CN}(\mathbf{0},\mathbf{\Sigma})$ denotes the distribution of a circularly symmetric complex Gaussian (CSCG) random vector $\mathbf{g}$ with zero mean and covariance $\mathbf{\Sigma}$ , where $\sim$ means “distributed as”; $\text{log}_{a}$ denotes the logarithmic function with base $a$ ; $|x|$ and $\lVert\mathbf{x}\rVert$ represent the magnitude of a complex number $x$ and the Euclidean norm of vector $\mathbf{x}$ , respectively; $\text{sinc}(x)=\text{sin}(\pi x)/(\pi x)$ represents the sinc function. $\mathcal{O}(\cdot)$ represents the computational complexity order.

II Modeling and Problem Formulation

This section introduces the SIM-assisted multi-user MISO system model, wherein the SIM performs wave-based precoding. Specifically, Section II-A first presents the proposed SIM design. Then, Section II-B describes the spatially correlated channel model for the multi-user MISO system. Section II-C subsequently outlines the system performance metric and Section II-D formulates the design as an optimization problem.

II-A SIM Design

Refer to caption — Figure 1: The considered SIM-aided multi-user MISO transmission system with $L$ layers of metasurfaces.

As shown in Fig. 1, we consider a downlink MISO system comprised of a BS, a SIM, and multiple UEs, in which the SIM is integrated with the BS to improve system performance. The SIM consists of multiple programmable metasurfaces, each of which is composed of a large number of meta-atoms that are linked to an intelligent controller, e.g., an FPGA [17]. By appropriately tuning the transmission coefficients of the SIM via this controller, the SIM is capable of directly manipulating the electromagnetic (EM) behaviors of the propagating waves [34]. As a result, SIM can tailor a spatial waveform shape at the output metasurface layer. Note that traditional communication systems without SIM require many active RF chains to generate the desired EM waveforms. In contrast, employing SIM can significantly reduce the number of active RF chains [18] and enable low-precision DACs¹¹1In practical communication systems, low-precision DACs can be utilized by employing modulation schemes with lower DAC requirements, such as binary phase-shift keying (BPSK). This approach accordingly reduces overall system power consumption and cost.. This is achieved by directly steering the EM waves as they pass through the low-cost and energy-efficient metasurfaces.

Let $M$ represent the number of UEs with the corresponding set given by $\mathcal{M}=\{1,2,...,M\}$ . At the BS, $M$ antennas are selected to concurrently transmit $M$ individual data streams associated with $M$ UEs²²2In this paper, we assume that $M$ antennas have been selected. The task of simultaneously optimizing antenna selection and wave-based precoding constitutes future research work.. Let $L$ denote the number of equidistance-spaced metasurface layers of the SIM and $N$ denote the number of meta-atoms on each metasurface layer, satisfying $N\geq M$ , while the metasurface layer set is represented by $\mathcal{L}=\{1,2,...,L\}$ and the meta-atom sets of each layer are represented by $\mathcal{N}=\{1,2,...,N\}$ , respectively. Besides, let $\phi_{n}^{l}=e^{j\varphi_{n}^{l}}$ denote the transmission coefficient of the $n$ -th meta-atom on the $l$ -th metasurface layer. $\varphi_{n}^{l}$ denotes the corresponding phase shift, which is assumed to be adjustable between $0$ and $2\pi$ continuously [18, 34, 7], i.e., $\varphi_{n}^{l}\in[0,2\pi)$ , $n\in\mathcal{N}$ , $l\in\mathcal{L}$ . Then, the transmission coefficient vector and the corresponding matrix representation of the $l$ -th metasurface are denoted by $\boldsymbol{\phi}^{l}=[\phi_{1}^{l},\phi_{2}^{l},...,\phi_{N}^{l}]^{T}\in\mathbb{C}^{N\times 1}$ and $\boldsymbol{\Phi}^{l}=\text{diag}(\boldsymbol{\phi}^{l})\in\mathbb{C}^{N\times N}$ , respectively.

Without loss of generality, each metasurface layer of the SIM is modeled as a uniform planar array, with all the metasurface layers being arranged in an identical square configuration. Let $r_{e}$ denote the element spacing between two adjacent meta-atoms within the same layer. Thus, the spacing between the $n$ -th meta-atom and the $\tilde{n}$ -th meta-atom on the same layer is

r_{n,\tilde{n}}=r_{e}\sqrt{(n_{x}-\tilde{n}_{x})^{2}+(n_{y}-\tilde{n}_{y})^{2}},

(1)

where $n_{x}$ and $n_{y}$ are the $x$ -axis and $y$ -axis indices of the $n$ -th meta-atom, respectively, with $\tilde{n}_{x}$ and $\tilde{n}_{y}$ defined similarly for the $\tilde{n}$ -th meta-atom. The indices are calculated as $n_{x}=\text{mod}(n-1,\,n_{\mathrm{max}})+1$ and $n_{y}=\lceil n/n_{\mathrm{max}}\rceil$ , where $n_{\mathrm{max}}=\sqrt{N}$ represents the maximum number of meta-atoms per row.

Furthermore, the total thickness of the SIM structure is denoted as $D$ , where the parallel metasurface layers are equally spaced with a distance of $d_{s}=D/(L-1)$ between adjacent layers. Thus, the spacing between the $n$ -th meta-atom on the $l$ -th layer and the $\tilde{n}$ -th meta-atom on the ( $l-1$ )-th layer is $r_{n,\tilde{n}}^{l}=\sqrt{r_{n,\tilde{n}}^{2}+d_{s}^{2}}$ . Moreover, we assume that the transmit antennas are arranged in a uniform linear array (ULA) with a half-wavelength distance between them. The array’s center aligns with that of the first metasurface. Thus, the distance between the $m$ -th transmitting antenna and the $n$ -th meta-atom on the first metasurface is defined as

\displaystyle r_{m,n}^{1}=\bigg{(}\Big{[}\Big{(}n_{y}-\frac{n_{\mathrm{max}}+1}{2}\Big{)}r_{e}-\Big{(}m-\frac{M+1}{2}\Big{)}\frac{\lambda}{2}\Big{]}^{2}\quad+\Big{(}n_{x}-\frac{n_{\mathrm{max}}+1}{2}\Big{)}^{2}r_{e}^{2}+d_{s}^{2}\bigg{)}^{\frac{1}{2}},

where $\lambda$ denotes the wavelength of the carrier signal.

Furthermore, the propagation coefficient of the EM wave between adjacent metasurface layers can be obtained from the Rayleigh-Sommerfeld diffraction equation [34, 18]. According to these equations, the propagation coefficient from the $\tilde{n}$ -th meta-atom on the $(l-1)$ -th metasurface layer to the $n$ -th meta-atom on the $l$ -th metasurface layer is defined by

[\mathbf{W}^{l}]_{n,\tilde{n}}=\frac{d_{s}s_{a}}{(r_{n,\tilde{n}}^{l})^{2}}\left(\frac{1}{2\pi r_{n,\tilde{n}}^{l}}-j\frac{1}{\lambda}\right)e^{j2\pi r_{n,\tilde{n}}^{l}/\lambda},

(2)

$\mathrm{for}\,l\in\mathcal{L}/\{1\},\,n,\tilde{n}\in\mathcal{N}$ , where $s_{a}$ represents the area of each meta-atom. The propagation coefficient $[\mathbf{W}^{1}]_{n,m}$ from the $m$ -th transmitting antenna to the $n$ -th meta-atom on the first metasurface is obtained by replacing $r_{n,\tilde{n}}^{l}$ in (2) with $r^{1}_{m,n}$ in (II-A). Furthermore, we assume that non-adjacent layers do not interact with each other [43, 44, 46, 47, 48, 50, 51]. Thus, the overall EM response in the SIM can be expressed as

\mathbf{B}=\mathbf{\Phi}^{L}\mathbf{W}^{L}\cdots\mathbf{\Phi}^{2}\mathbf{W}^{2}\mathbf{\Phi}^{1}\mathbf{W}^{1}\in\mathbb{C}^{N\times M}.

(3)

II-B Spatially Correlated Channel Model

For the wireless channels shown in Fig. 1, we consider a spatially correlated channel model due to the closely spaced meta-atoms [43, 44, 46, 47, 48, 50, 51]. Thus, the channel spanning from the output metasurface to the $M$ UEs can be expressed as

\mathbf{G}=\tilde{\mathbf{G}}\mathbf{R}^{\frac{1}{2}}\in\mathbb{C}^{M\times N},

(4)

where $\mathbf{R}\in\mathbb{C}^{N\times N}$ denotes the spatial correlation matrix of the SIM, and $\tilde{\mathbf{G}}\in\mathbb{C}^{M\times N}$ represents the independent and identically distributed (i.i.d.) Rayleigh fading channel satisfying $[\tilde{\mathbf{G}}]_{m,:}\sim\mathcal{CN}(\mathbf{0},\rho_{m}^{2}\mathbf{I})$ , where $\rho_{m}^{2}$ denotes the path loss between the output metasurface of the SIM and the $m$ -th UE. In particular, the path loss of the $m$ -th UE is given by

\rho_{m}^{2}=C_{0}d_{c,m}^{-\alpha},\;m\in\mathcal{M},

(5)

where $C_{0}\in\mathbb{R}$ denotes the path loss at a reference distance of $1$ meter, $d_{c,m}$ denotes the transmission distance between the SIM and the $m$ -th UE, and $\alpha$ denotes the corresponding path loss exponent. Besides, assuming far-field wave propagation in an isotropic scattering wireless environment, the spatial correlation matrix $\mathbf{R}$ is thus defined by [50, 4]

[\mathbf{R}]_{n,\tilde{n}}=\text{sinc}(2r_{n,\tilde{n}}/\lambda),\;n\in\mathcal{N},\;\tilde{n}\in\mathcal{N}.

(6)

II-C System Performance Metric

We focus on the downlink data transmission and the signal received at the $m$ -th UE is given as

y_{m}=[\mathbf{G}]_{m,:}\mathbf{B}\mathbf{P}\mathbf{x}+z_{m},\,\forall m,

(7)

where $z_{m}$ denotes the additive white Gaussian noise (AWGN) with variance $\sigma^{2}_{m}$ , i.e., $z_{m}\sim\mathcal{CN}(0,\sigma^{2}_{m})$ . $\mathbf{P}=\text{diag}(\sqrt{p}_{1},\sqrt{p}_{2},...,\sqrt{p}_{M})\in\mathbb{C}^{M\times M}$ denotes the transmission power matrix with $p_{m},m\in\mathcal{M}$ representing the transmit power allocated to the $m$ -th UE. $\mathbf{x}=[x_{1},x_{2},...,x_{M}]\in\mathbb{C}^{M\times 1}$ denotes a column vector of the data streams transmitted to all the UEs, satisfying $\mathbb{E}[\lVert\mathbf{x}\rVert^{2}]=1$ . We also assume that our passive SIM does not introduce any noise to its internally propagating signals [17].

Furthermore, the received signal of (7) can be rewritten as

y_{m}=\sqrt{p_{m}}[\mathbf{G}]_{m,:}[\mathbf{B}]_{:,m}x_{m}+\sum^{M}_{k=1,k\neq m}\sqrt{p_{k}}[\mathbf{G}]_{m,:}[\mathbf{B}]_{:,k}x_{k}+z_{m}.

(8)

The second addend of (8) is considered as the co-channel interference caused by other $(M-1)$ users. Thus, the received signal-to-interference-plus-noise ratio (SINR) at the $m$ -th UE is defined as

\kappa_{m}=\frac{p_{m}|[\mathbf{G}]_{m,:}[\mathbf{B}]_{:,m}|^{2}}{\sum^{M}_{k=1,k\neq m}p_{k}|[\mathbf{G}]_{m,:}[\mathbf{B}]_{:,k}|^{2}+\sigma_{m}^{2}}.

(9)

In this paper, we adopt the sum rate of $M$ UEs as a system performance metric, which is defined by

C(\mathbf{P},\boldsymbol{\Phi}^{l})=\sum^{M}_{m=1}\text{log}_{2}(1+\kappa_{m}).

(10)

II-D Problem Formulation

Our objective is to design the SIM phase shifts $\mathbf{\Phi}^{l}$ and the transmit power allocation strategy $\mathbf{P}$ that jointly maximize the system sum-rate $C(\mathbf{P},\boldsymbol{\Phi}^{l})$ under perfect CSI of all involved wireless channels³³3The examination of the effects of imperfect CSI is left for future work.. As such, the joint design can be formulated as the following joint optimization problem:


$\displaystyle\!\!(\text{P1}):\max_{\{\boldsymbol{\Phi}^{l}\}_{l=1}^{L},\{p_{m}\}_{m=1}^{M}}$	$\displaystyle C(\mathbf{P},\boldsymbol{\Phi}^{l})$	(11a)
$\displaystyle\mathrm{s.t.}\quad\quad$	$\displaystyle\mathbf{B}=\mathbf{\Phi}^{L}\mathbf{W}^{L}\cdots\mathbf{\Phi}^{2}\mathbf{W}^{2}\mathbf{\Phi}^{1}\mathbf{W}^{1},$	(11b)
	$\displaystyle\|\boldsymbol{\phi}^{l}_{n}\|=1,\;l\in\mathcal{L},\;n\in\mathcal{N},$	(11c)
	$\displaystyle\mathbf{P}=\text{diag}(\sqrt{p_{1}},\sqrt{p_{2}},...,\sqrt{p_{M}}),$	(11d)
	$\displaystyle\sum_{m=1}^{M}p_{m}\leq P,$	(11e)
	$\displaystyle p_{m}\geq 0,$	(11f)

where $P$ denotes the maximum transmit power budget at the BS. It can be verified that problem (P1) constitutes non-convex optimization due to non-convexities arising from the objective function, constraints (11c), and consequently, in the multilayered architecture of the SIM (11b). It has been shown that traditional techniques to address such problems, such as the AO methods shown, e.g., in [52, 53, 18], often struggle to find satisfactory solutions within moderate complexity. For this reason, in this paper, we propose a novel DRL approach to determine an effective $\mathbf{\Phi}^{l},\,l\in\mathcal{L}$ , and power allocation solution $\mathbf{P}$ with low complexity. It is noted that given the availability of CSI, an efficient DRL approach will empower the SIM to persistently interact with the environment, enabling autonomous learning and self-guided exploration through feedback from the environment. This methodology enables DRL to derive optimal configuration without the need for manual data gathering for training, which is in contrast with the conventional deep learning algorithms that necessitate both offline training and online learning [54, 13].

III Proposed DRL Design

In this section, we first introduce the fundamental framework of DRL, detailing the composition of the state, action, reward, and policy, to establish the foundation for the proposed optimization method. Subsequently, the entire network architecture is presented, followed by a detailed explanation of the specific optimization process. Finally, we analyze the computational complexity of the proposed DRL-based scheme.

III-A DRL Formulation

Model-free DRL is a dynamic tool capable of solving a series of decision-making problems, enabling an agent to learn the best strategy in a real-time manner [54, 41, 14]. A general DRL system consists of two components: the agent and the environment. The agent continuously interacts with the environment, applying actions based on its current policy and observing the immediate rewards and new states from the environment. The fundamental concepts of the proposed DRL formulation for handling (P1) are as follows:

•

Action space: Let $\mathcal{A}$ denote the action space. According to the state of the environment $\mathbf{s}_{t}$ , the agent takes an action $\mathbf{a}_{t}\in\mathcal{A}$ to influence the environment according to a policy. Then, the agent receives the reward $r_{t}$ and observes the new state $\mathbf{s}_{t+1}$ feedback. For our system model, the action at each time step $t$ is constructed by the current SIM phase shifts $\boldsymbol{\phi}_{t}^{l},\,l\in\mathcal{L}$ and transmit power allocation $p_{m,t},\,m\in\mathcal{M}$ . Since DNNs require real-valued inputs and outputs, the SIM phase shifts are rearranged into real and imaginary components, i.e.,

\mathbf{a}_{t}=\big{[}\Re(\boldsymbol{\phi}^{1}_{t}),...,\Im(\boldsymbol{\phi}^{1}_{t}),...,\Im(\boldsymbol{\phi}^{L}_{t}),p_{1,t},...,p_{M,t}\big{]}^{T},

(12)

where $p_{m,t}$ is the transmit power of the $m$ -th antenna at time step $t$ . Hence, the action is represented as a vector with dimension $D_{a}=2NL+M$ .

•

State space: Let $\mathcal{S}$ denote the environment state space. The current time-slot environment state $\mathbf{s}_{t}\in\mathcal{S}$ is observed from the environment. The current state incorporates the reward $r_{t-1}$ and action $\mathbf{a}_{t-1}$ from the previous time step along with the CSI for all UEs, i.e.,

\displaystyle\mathbf{s}_{t}=\big{[}r_{t-1},\mathbf{a}_{t-1},\Re([\mathbf{G}]_{1,:}),...,\Im([\mathbf{G}]_{1,:}),...,\Im([\mathbf{G}]_{M,:})\big{]}^{T}.

(13)

Hence, the dimension of the state space is $D_{s}=2N(L+M)+M+1$ .

•

Policy: Let $\pi$ denote the policy of agent and $\pi(\mathbf{s}_{t},\mathbf{a}_{t})$ denote the probability of taking $\mathbf{a}_{t}$ based on the environment state $\mathbf{s}_{t}$ , satisfying $\sum_{\mathbf{a}_{t}\in\mathcal{A}}\pi(\mathbf{s}_{t},\mathbf{a}_{t})=1$ .
•

Reward: The reward $r_{t}$ serves as a measure to quantify the quality of the policy. The agent takes action $\mathbf{a}_{t}$ in the environment and then dynamically adjusts its policy based on the feedback reward, aiming to enhance its action performance and maximize the reward. The objective of this paper is to maximize the sum rate of $M$ UEs. As such, the reward at each time step $t$ is defined as $C(\mathbf{P},\boldsymbol{\Phi}^{l})$ , which is determined based on the current power allocation scheme $\mathbf{P}$ and phase shifts of SIM $\{\boldsymbol{\Phi}^{l}\}$ , both are defined by the output of the actor training neural network.
•

Transitions: Action $\mathbf{a}_{t}$ is sampled from $\pi(\mathbf{s}_{t},\mathbf{a}_{t})$ which determines the phase shifts of the SIM and the power allocation. A transmission then occurs within a coherence block where $\mathbf{G}$ remains constant and $C(\mathbf{P},\boldsymbol{\Phi}^{l})$ is computed. Then, the environment/system proceeds to time step $t+1$ where new channel matrices are sampled/observed in i.i.d. fashion.

Generally, there are two categories of algorithms to determine the policy in DRL, i.e., value-based algorithms and policy-based algorithms [15, 40]. Specifically, the well-known deep Q-network (DQN) is a quintessential value-based algorithm that employs experience replay and target network strategies to enhance the stability and efficiency of its learning process. However, DQN is typically applicable to a discrete action space. Yet, in the SIM-aided scenario, a continuous action space spanned by SIM phase shifts and the transmit power allocation strategy is presented in (11). On the other hand, the policy gradient (PG) method is a basic policy-based algorithm that aims to maximize the long-term expected discounted reward of each episode for continuous action spaces [40]. In fact, the PG directly adjusts the likelihood of selecting particular actions based on the gradient of expected reward. However, the pure PG approach exhibits high variance, leading to unstable learning. Moreover, it demands an extensive sample set. Thus, the convergence performance and stability of PG in the wireless communication environment need to be improved [54]. To address this, we adopt a DDPG-based solution, which can handle continuous action space along with excellent stability and convergence, to solve the problem in (P1).

DDPG also exploits a Q-function to choose an action $\mathbf{a}_{t}$ under the policy $\pi$ according to the current state $\mathbf{s}_{t}$ . Specifically, the goal of DDPG is to find the best policy $\pi_{\mathrm{op}}$ that maximizes the long-term expected discounted reward. The Q-function evaluates state-action pairs via a Q-value, and the DDPG employs a separate critic as the Q-function. As shown in Fig. 2, the DDPG algorithm utilizes two neural networks with distinct roles: actor networks (depicted in orange) and critic networks (in green). The DDPG exploits the actor neural network $\pi(\mathbf{s}_{t};\theta_{\pi})$ , with $\theta_{\pi}$ representing its parameters, as the policy to generate the SIM phase shifts and transmit power allocation strategy from a continuous action space $\mathcal{A}$ . Besides, the critic neural network $Q(\mathbf{s}_{t},\mathbf{a}_{t};\theta_{q})$ , having the parameters $\theta_{q}$ , is utilized to output a Q-value, which measures the output of the actor network.

During the training phase, DRL operates without target data, which actually differentiates it from classic supervised training [40, 13, 11]. To address this, the proposed DDPG adopts two networks with identical structures but with different parameters as shown in Fig. 2. The critic target network $\tilde{Q}(\mathbf{s}_{t+1},\tilde{\mathbf{a}}_{t};\theta_{\tilde{q}})$ is utilized to update the critic training network $Q(\mathbf{s}_{t},\mathbf{a}_{t};\theta_{q})$ , while the actor target network $\tilde{\pi}(\mathbf{s}_{t+1};\theta_{\tilde{\pi}})$ updates the actor training network $\pi(\mathbf{s}_{t};\theta_{\pi})$ , where $\tilde{\mathbf{a}}_{t}$ denotes the output of actor target network $\tilde{\pi}(\mathbf{s}_{t+1};\theta_{\tilde{\pi}})$ . The training neural networks and the target neural networks are built with an identical structure but different parameters. While the training networks undergo training as classic DNNs, the target networks are untrainable and aim to provide a label for the training network. The parameters of the target neural network are updated periodically. At the end of each step, the updates on the critic and actor target neural networks are expressed as

	$\displaystyle\theta_{\tilde{q}}\leftarrow(1-\eta_{c})\theta_{\tilde{q}}+\eta_{c}\theta_{q},$		(14)
	$\displaystyle\theta_{\tilde{\pi}}\leftarrow(1-\eta_{a})\theta_{\tilde{\pi}}+\eta_{a}\theta_{\pi},$		(15)

where $\eta_{c}\in(0,1)$ and $\eta_{a}\in(0,1)$ denote the learning rate for updating the critic and actor target neural network, respectively. The updates on the target neural networks can stabilize and accelerate the convergence of the DRL training process [40].

Moreover, the critic training neural network’s update employs the gradient of a loss function. The label in the loss function consists of the reward $r_{t}$ , which is returned by the SIM-aided multi-user MISO wireless environment, and the output of the critic target neural network $\tilde{Q}(\mathbf{s}_{t+1},\tilde{\mathbf{a}}_{t};\theta_{\tilde{q}})$ , which can be expressed as

V(\mathbf{s}_{t},\mathbf{a}_{t})=r_{t}+\mu\tilde{Q}(\mathbf{s}_{t+1},\tilde{\mathbf{a}}_{t};\theta_{\tilde{q}}),

(16)

where $\mu$ denotes a discount factor. In this paper, we employ the mean square error (MSE) loss function, which is defined as

l(\theta_{q})=\left(V(\mathbf{s}_{t},\mathbf{a}_{t})-Q(\mathbf{s}_{t},\mathbf{a}_{t};\theta_{q})\right)^{2},

(17)

where the error between $V(\mathbf{s}_{t},\mathbf{a}_{t})$ and estimated value $Q(\mathbf{s}_{t},\mathbf{a}_{t};\theta_{q})$ is termed the temporal difference error [40]. Thus, the parameter update of the critic training neural network can be expressed as

\theta_{q}^{(t+1)}=\theta_{q}^{(t)}-\gamma_{c}\Delta_{\theta_{q}}l(\theta_{q}),

(18)

where $\gamma_{c}>0$ denotes the corresponding learning rate.

The update on the actor training neural network is defined through the policy gradient theorem [40], as follows

\theta_{\pi}^{(t+1)}=\theta_{\pi}^{(t)}-\gamma_{a}\Delta_{\pi}Q(\mathbf{s}_{t},\pi(\mathbf{s}_{t};\theta_{\pi});\theta_{q})\Delta_{\theta_{\pi}}\pi(\mathbf{s}_{t};\theta_{\pi}),

(19)

where $\gamma_{a}>0$ denotes its learning rate. $\Delta_{\pi}Q(\mathbf{s}_{t},\pi(\mathbf{s}_{t};\theta_{\pi});\theta_{q})$ is the gradient of the critic training network regarding the output of the actor training neural network. $\Delta_{\theta_{\pi}}\pi(\mathbf{s}_{t};\theta_{\pi})$ represents the gradient of the actor training neural network with respect to its parameters $\theta_{\pi}$ . Notably, the update process of the actor training neural network depends heavily on the gradient of the critic training neural network according to the current policy. This interaction ensures that the action policy is updated towards a direction that maximizes the long-term expected sum rate, as estimated through the Q-values.

It is noted, however, that value-based algorithms can be trapped in local optima due to the correlation between samples and nonstationary targets [40, 54]. In addition, when considering short timescales, this behavior is particularly significant, since data generated through interactions with wireless environments tend to exhibit high temporal correlation. Therefore, the experience replay technique is utilized to reduce the negative impact of sample correlation on the agent by using a buffer window to store a portion of data. For each training phase, networks adopt a mini-batch randomly sampled from experience replay to calculate the gradient and update parameters. Moreover, in our DRL framework, we introduce noise to the action to prevent the learning process from being trapped in a locally optimal solution [40]. This whitening process is applied to the SIM phase shifts and transmit power allocation prior to utilization within the wireless network. Specifically, truncated random white noise $\mathcal{W}\in[-w_{a},w_{a}]$ , where $w_{a}$ defines the truncation value, is integrated into the action processing, which is expressed as

\mathbf{a}_{t}\leftarrow\mathbf{a}_{t}+\mathcal{W}.

(20)

The noise $\mathcal{W}$ follows a Gaussian distribution with zero mean and variance $v$ , i.e., $\widetilde{\mathcal{W}}\sim\mathcal{CN}(0,v)$ . To enable smooth convergence, $v$ decreases exponentially during training as

v=v_{0}\zeta^{t/t_{\mathrm{gap}}},

(21)

where $v_{0}$ denotes the initial $v$ value while $\zeta\in(0,1)$ and $t_{\mathrm{gap}}$ are the decay rate and the gap factor for discounting the whitening process, respectively.

III-B Architecture of Employed Networks

Fig. 3 shows the structures of the proposed networks. The actor neural network operates by inputting a state vector, directly producing the SIM phase shifts and transmit power allocation strategy. This network consists of two branches, each addressing the optimization of SIM phase shifts and power allocation, respectively. The branch assigned to process SIM phase shifts consists of a reshape operation, two blocks, an adaptive average pooling layer, a dense layer, a flattening operation, and a normalization layer. The reshape operation serves to transform the input vector $\boldsymbol{\phi}^{l}\in\mathbb{C}^{N\times 1},\,l\in\mathcal{L}$ , into the corresponding matrix in the shape of $n_{\mathrm{max}}\times n_{\mathrm{max}}$ , while the flatten operation is employed to streamline the subsequent dense layer processing. Each block is formed by the interweaving connection of two convolutional layers of a $3\times 3$ kernel size, two layer-normalization (LN) layers, and a residual connection with a convolutional layer of a $1\times 1$ kernel size. The output channels of each convolutional layer are consistent. The adaptive average pooling layer restricts the output feature map size of the block to $3\times 3$ . The power allocation branch is composed of three dense layers. Finally, the outputs from the two branches are concatenated to yield an action. The critic network starts with two input branches, each comprising a dense layer. Their outputs are summed as input to alternating dense and LN layers.

The dense layer neuron quantity exceeds input and output dimensions, based on the SIM meta-atom count and data streams. The normalization layer is exploited to restrict the output of the actor neural network such that the module of meta-atoms of the metasurface layers is 1, i.e., $|\phi^{l}_{n}|^{2}=1,l\in\mathcal{L},n\in\mathcal{N}$ . The LN layer is designed to overcome the influence of the data distribution change caused by the variations in the previous layer on the neural network, which is capable of accelerating the learning speed and improving the robustness of the neural network. In contrast to the batch normalization layer that normalizes data across the batch, the LN layer performs normalization individually for each sample, maintaining independence among samples. This property endows it with superior performance in handling sequential data.

On the other hand, the LeakyReLu activation function is utilized after the hidden dense layer and convolutional layer, while the Tanh function is employed before the normalization layer in the branch processing the SIM phase shifts. The Softmax function is inserted at the end of the power allocation branch for power normalization. The Adam optimizer [13] is adopted to update the parameters of both the actor and critic training neural networks.

Regarding the learning rate, if the model exhibits no improvement even after $\iota_{p}$ iteration steps and the reward ceases to increase, we multiply the learning rate by a decay factor of $\iota_{f}$ to reduce the learning rate. This approach enhances the model’s convergence stability toward a near-optimal solution.

III-C Proposed DRL-based Solution of (P1)

In our DRL approach, an agent is assigned to continuously collect channel coefficients $\mathbf{G}$ and combine action $\mathbf{a}_{t-1}$ in experience replay memory to form current state information $\mathbf{s}_{t}$ . At the very beginning of the algorithm, we need to create four neural networks, i.e., the actor training neural network $\pi(\mathbf{s}_{t};\theta_{\pi})$ , the actor target neural network $\tilde{\pi}(\mathbf{s}_{t};\theta_{\tilde{\pi}})$ , critic training neural network $Q(\mathbf{s}_{t},\mathbf{a}_{t};\theta_{q})$ , and critic target neural network $\tilde{Q}(\mathbf{s}_{t},\tilde{\mathbf{a}}_{t};\theta_{\tilde{q}})$ , and then initialize the parameters of $\pi(\mathbf{s}_{t};\theta_{\pi})$ and $Q(\mathbf{s}_{t},\mathbf{a}_{t};\theta_{q})$ i.e., $\theta_{q}$ and $\theta_{\pi}$ , by randomly sampling in a continuous space. In contrast, the parameters of $\tilde{\pi}(\mathbf{s}_{t};\theta_{\tilde{\pi}})$ and $\tilde{Q}(\mathbf{s}_{t},\tilde{\mathbf{a}}_{t};\theta_{\tilde{q}})$ are given by (14) and (15), respectively. Moreover, the experience replay memory $M_{\mathrm{er}}$ with the capacity $C_{\mathrm{er}}$ is also established.

Algorithm 1 The DRL-Based SIM Optimization and Power Allocation

\mu

\eta_{c}

\eta_{a}

\gamma_{c}

\gamma_{a}

C_{\mathrm{er}}

N_{B}

s_{a}

v_{0}

\zeta

t_{\mathrm{gap}}

E

T

\{\boldsymbol{\phi}^{1}_{\mathrm{op}},\boldsymbol{\phi}^{2}_{\mathrm{op}},...,\boldsymbol{\phi}^{L}_{\mathrm{op}}\}

\mathbf{P}_{\mathrm{op}}

, and

C_{\mathrm{op}}(\mathbf{P},\boldsymbol{\Phi}^{l})

0: Randomly initialize the critic training network parameter

\theta_{q}

and actor training network parameter

\theta_{\pi}

. Initialize the critic target network parameter

\theta_{\tilde{q}}

and actor target network parameter

\theta_{\tilde{\pi}}

by (14) and (15), respectively. Initialize the empty experience replay memory

M_{\mathrm{er}}

1: for episode

e=1,2,...,E

2: Collect the current CSI

\{\mathbf{g}_{1}^{T},...,\mathbf{g}_{M}^{T}\}

, randomly initialize the SIM coefficient matrix

\{\boldsymbol{\phi}^{1},\boldsymbol{\phi}^{2},...,\boldsymbol{\phi}^{L}\}

and initialize transmit power matrix as

\mathbf{P}=\sqrt{\frac{P}{M}}\mathbf{I}_{M}

to obtain initial state

\mathbf{s}_{0}

3: Initialize the whitening process

\mathcal{W}

4: for

t=1,2,...,T

5: Obtain current action

\mathbf{a}_{t}=\pi(\mathbf{s}_{t};\theta_{\pi})

6: White

\mathbf{a}_{t}

by (20) and then normalize it to satisfy condition (11c).

7: Reform

\mathbf{a}_{t}

into SIM phase shifts matrix

\{\boldsymbol{\phi}^{1},\boldsymbol{\phi}^{2},...,\boldsymbol{\phi}^{L}\}

and transmit power

\{p_{1},...,p_{M}\}

8: Observe new state

\mathbf{s}_{t+1}

and instant reward

r_{t}

from environment given action

\mathbf{a}_{t}

9: Store tuple (

\mathbf{s}_{t},\mathbf{a}_{t},r_{t},\mathbf{s}_{t+1}

) to experience replay memory.

10: if the experience replay memory is full then

11: Sample a

N_{B}

-size mini-batch tuples (

\mathbf{s}_{n}

\mathbf{a}_{n}

r_{n}

\mathbf{s}_{n+1}

) randomly from

M_{\mathrm{er}}

12: Update the critic and actor training network, i.e.,

Q(\mathbf{s}_{t},\mathbf{a}_{t};\theta_{q})

and

\pi(\mathbf{s}_{t};\theta_{\pi})

, by (18) and (19), respectively.

13: Update the critic and actor target network, i.e.,

\tilde{Q}(\mathbf{s}_{t},\tilde{\mathbf{a}}_{t};\theta_{\tilde{q}})

and

\tilde{\pi}(\mathbf{s}_{t};\theta_{\tilde{\pi}})

, by (14) and (15), respectively.

14: Update the variance of whitening process by (21).

15: end if

16: Update the state

\mathbf{s}_{t}

17: end for

18: end for

The algorithm executes for $E$ episodes and iterates $T$ steps in each episode. At episode start, the agent collects CSI, resets the experience buffer and whitening noise, randomly initializes SIM phase shifts from $0$ to $2\pi$ for $\mathbf{B}$ , and equally allocates transmit antenna power. During a specific episode, the agent first obtains the initial state $\mathbf{s}_{0}$ . Then, taking the state $\mathbf{s}_{t}$ into the actor training network to output corresponding action $\mathbf{a}_{t}$ . The agent obtains the reward $r_{t}$ by (10) and the next state $\mathbf{s}_{t+1}$ from the wireless environment. Subsequently, storing a tuple ( $\mathbf{s}_{t}$ , $\mathbf{a}_{t}$ , $r_{t}$ , $\mathbf{s}_{t+1}$ ) into experience replay memory $M_{\mathrm{er}}$ . When the number of stored tuples attain the capacity $C_{\mathrm{er}}$ of the $M_{\mathrm{er}}$ , signifying a full replay buffer, the critic and actor training networks start to randomly sample mini-batches of size $N_{B}$ from $M_{\mathrm{er}}$ and update their parameters utilizing (18) and (19), respectively. Ultimately, the critic target network and actor target network are updated through a soft update method, as outlined in (14) and (15), respectively.

Finally, the optimal SIM phase shifts $\{\boldsymbol{\phi}^{1}_{\mathrm{op}},\boldsymbol{\phi}^{2}_{\mathrm{op}},...,\boldsymbol{\phi}^{L}_{\mathrm{op}}\}$ and the optimal transmit power allocation strategy $\mathbf{P}_{\mathrm{op}}$ are directly derived from the action corresponding to the maximized sum rate $C_{\mathrm{op}}(\mathbf{P},\boldsymbol{\Phi}^{l})$ , which is associated the largest instantaneous reward in the current episode. The proposed design is detailed in Algorithm 1.

III-D Complexity Analysis

The computational complexity analysis of our proposed DRL algorithm can be divided into the training and predicting phases. The former involves experience replay, whitening, and four neural networks, while the latter only relates to the actor network. Denote the complexity of the adaptive pooling layer, the LN layer, and the activation function as $v_{\mathrm{pool}}$ , $v_{\mathrm{ln}}$ , and $v_{\mathrm{activation}}$ , respectively.

1) Training phase: The convolutional layer complexity is encompassed within the actor network calculations. As detailed in [55], the complexity considering the output feature map size can be expressed as $c_{l-1}\times s_{c}^{2}\times c_{l}\times s_{f}$ , where $c_{l-1}$ denotes the number of channels in the previous layer, $s_{c}$ is the convolutional kernel size, $c_{l}$ refers to the output channel number, and $s_{f}$ indicates the output feature map size. Thus, the complexity of the two blocks can be expressed as $v_{\mathrm{blocks}}=2\times(L\times 3^{2}\times c\times N)+2\times(c\times 3^{2}\times c\times N)+2\times(c\times 1^{2}\times 2L\times N)+6v_{\mathrm{activation}}+4v_{\mathrm{ln}}\approx\mathcal{O}(NLc+Nc^{2}+v_{\mathrm{ln}})$ , where $c$ denotes the adopted input/output channel of the convolutional layer. The adaptive pooling layer maintains the number of channels in the output feature map of the block, while decreasing the dimensions to $3\times 3$ to reduce the computational complexity. The complexity of the normalization layer is proportional to $2NL$ , as each meta-atom needs to calculate its phase once and then apply the modulus normalization. Moreover, according to [56], the overall complexity of the actor network is characterized by

$\displaystyle c_{\mathrm{actor}}$	$\displaystyle=v_{\mathrm{blocks}}+v_{\mathrm{pool}}+3\times 3\times c\times u_{a,0}+\sum^{3}_{i=1}u_{a,i}u_{a,i+1}$	(22)
	$\displaystyle\quad\,+12v_{\mathrm{activation}}+2NL$
	$\displaystyle\approx\mathcal{O}\Big{(}NLc+Nc^{2}+cu_{a,0}+\sum^{3}_{i=1}u_{a,i}u_{a,i+1}+v_{\mathrm{ln}}\Big{)},$

where $u_{a,0}$ denotes the node count in the dense layer of the branch that handles phase shifts for the SIM, and $u_{a,i},\,i=\{1,2,3\}$ denotes the node count in the dense layers of the branch that allocates power in the actor network. Similarly, the complexity of the critic network can be expressed as

\displaystyle c_{\mathrm{critic}}=\sum^{3}_{i=0}u_{c,i}u_{c,i+1}+3v_{\mathrm{activation}}+2v_{ln}\approx\mathcal{O}\Big{(}\sum^{3}_{i=0}u_{c,i}u_{c,i+1}+v_{\mathrm{ln}}\Big{)},

(23)

where $u_{c,i}$ represents the node count in the $i$ -th dense layer of critic network. $u_{c,0}$ equals the input size $|\mathbf{a}|+|\mathbf{s}|$ , where $|\mathbf{a}|$ and $|\mathbf{s}|$ are the size of the action and state, respectively.

On the other hand, the complexity of the whitening process is indeed related to the size of the action. Each SIM meta-atom’s coefficient is whitened and then renormalized to meet the conditions in (11c). Thus, the computational complexity of the whitening process is $NL+2NL$ , where $2NL$ corresponds to the cost of the renormalization operation. Therefore, the overall complexity of the training phase is

\displaystyle 2\times c_{\mathrm{critic}}+2\times c_{\mathrm{actor}}+NL+2NL=\mathcal{O}\Big{(}NLc+Nc^{2}+cu_{a,0}+\sum^{3}_{i=1}u_{a,i}u_{a,i+1}+\sum^{3}_{i=0}u_{c,i}u_{c,i+1}+v_{\mathrm{ln}}\Big{)}.

(24)

2) Predicting phase: Since the critic network and the experience replay are adopted to ensure the actor network a faster and more stable training process, the complexity of the predicting phase only needs to consider the actor network. Therefore, the complexity of the predicting phase is

\displaystyle\mathcal{O}\Big{(}NLc+Nc^{2}+cu_{a,0}+\sum^{3}_{i=1}u_{a,i}u_{a,i+1}+v_{\mathrm{ln}}\Big{)}.

(25)

Note that the computational complexity of the DRL scheme exhibits linear scaling with the number of meta-atoms and layers in the SIM during both the training and predicting phases. This linear complexity ensures high computational efficiency and scalability, which is suitable for solving large-scale joint parameter optimization problems in SIM-aided wireless communications. Moreover, DNN implementations in practice are highly parallelizable and run in suitable hardware (e.g., graphics processing units) therefore the computational complexity is practically reduced by a near-linear factor.

IV Simulation Results and Discussion

This section presents the numerical evaluation of the proposed DRL-optimized SIM-assisted multi-user downlink MISO wireless communication systems.

IV-A Setup and Benchmarks

TABLE II: Hyper Parameters of DRL

Parameter	Meaning	Value
$E$	The number of episodes	50
$T$	The number of iterations	26000
$C_{\mathrm{er}}$	The capacity of experience replay	5000
$\zeta$	The decay rate of whitening process	0.95
$v_{0}$	The initial value of whitening process	2
$w_{a}$	The truncation range of whitening process	2
$t_{\mathrm{gap}}$	The gap factor for discounting whitening process	100
$\eta_{c}$	The critic target network soft update parameter	0.01
$\eta_{a}$	The actor target network soft update parameter	0.01
$\gamma_{c}$	The critic training network initial learning rate	0.0004
$\gamma_{a}$	The actor training network initial learning rate	0.0004
$\mu$	The discounting factor of reward	0.99
$N_{B}$	The size of mini-batch	32
$\iota_{p}$	The patience factor for discounting the learning rate	200
$\iota_{f}$	The learning rate decay factor	0.8

As shown in Fig. 4, we consider a SIM-assisted multi-user MISO system that operates in the downlink at a frequency of 28 GHz, corresponding to a wavelength of $\lambda=10.7$ mm. The thickness of the SIM is set to $D=5\lambda$ [7, 18], consisting of $L=4$ layers of isomorphic square metasurfaces, with each layer consists of $N=49$ meta-atoms. The area of each meta-atom is set to $s_{a}=\lambda^{2}/4$ . In addition, we set the distance between meta-atoms to be $r_{e}=\lambda/2$ . The considered simulation setup is depicted in Fig. 4, where the BS is at the height $H_{b}=10$ meter and a SIM is integrated with the BS to facilitate wave-based precoding. The $M$ UEs are randomly distributed in an annular region as shown in Fig. 4 at the beginning of every episode, with the center of the SIM projected onto the ground serving as the annular center. The annulus has an inner radius $R_{l,1}=100$ meter and an outer radius $R_{l,2}=250$ meter. Due to the random distribution of the UEs, we calculate the path losses independently using (5), where we set $C_{0}=-35$ dB and $\alpha=3.5$ . The transmission distance of the $m$ -th UE can be obtained as $d_{c,m}=(H_{b}^{2}+R_{m}^{2})^{\frac{1}{2}}$ , where $R_{m}$ represents the horizontal distance from the $m$ -th UE to the annular center. At the transmitter, we set the maximum transmit power to $P=10$ dBm and the noise power is set to $\sigma_{m}^{2}=-104\;\text{dBm},\;\forall m\in\mathcal{M}$ . Additionally, we adaptively adjust the number of neurons of the dense layer based on the number of SIM meta-atoms. This strategy aims to mitigate the potential degradation in system performance due to insufficient network capacity. The other relevant parameters are presented in Table II.

To verify the performance of the proposed scheme, we consider the following four benchmark schemes for comparison:

1) DRL-UPA: Optimize the SIM phase shifts based on the proposed DRL algorithm considering the uniform transmit power allocation.

2) Random: The phase shifts of the SIM are configured randomly, while the transmit power of different antennas is allocated via the iterative water-filling algorithm, which follows the water-filling principle [57] in iteration $t$ as

p_{m,w}^{(t)}=\left(p_{w}^{(t)}-\frac{\sum_{k,k\neq m}^{M}|\mathbf{g}_{m}^{T}\mathbf{b}_{k}|^{2}p_{k,w}^{(t-1)}+\sigma^{2}}{|\mathbf{g}_{m}^{T}\mathbf{b_{m}}|^{2}}\right),\;m\in\mathcal{M},

(26)

where the power allocated to data stream $m$ in iteration $t$ is denoted as $p_{m,w}^{(t)}$ and the water level $p_{w}^{(t)}$ is determined such that $\sum_{m=1}^{M}p_{m,w}^{(t)}=P$ .

3) Codebook: The codebook size corresponds to the number of iterations in the proposed DRL scheme, and for each codeword, we utilize the Random scheme. Its performance is identified as the maximum sum rate attained across all the codeword configurations explored during training.

4) AO: Perform iterative water-filling and gradient ascent approaches to separately optimize transmit power allocation and SIM phase shifts as in [18].

All results are obtained through an average of 100 independent random channel realizations.

IV-B Performance Evaluation of the Proposed Algorithm

In Fig. 5, we depict the average sum rate versus the transmit power $P$ at the BS, by considering $M=4$ , $L=4$ , and $N=49$ . For comparison, we also illustrate the ZF and MMSE fully digital precoding schemes [8] adopting water-filling transmit power allocation without a SIM. As observed, the joint optimization of SIM phase shifts and transmit power allocation via DRL effectively harnesses inter-user interference, yielding superior sum rate performance, of about 2 bps/Hz, compared to the considered AO algorithm for transmit power larger than $P=0$ dBm. At low transmit power levels, ZF and MMSE underperform all SIM-assisted precoding methods with the optimized phase shifts; this happens due to the large SIM aperture. This observation validates the superiority of SIM’s multi-layer structure over traditional precoding schemes in multi-user MISO systems. However, the sum rate of the digital precoding scheme grows faster than SIM with the increasing transmit power, surpassing all other schemes at $P\geq 24$ dBm. In fact, the high transmit power induces an extensive dynamic range in the sum rate, resulting in more significant oscillations and poorer convergence for DRL and AO approaches, at the expense of higher implementation cost for the former.

Fig. 6 illustrates the sum rate versus the number $L$ of SIM layers for two transmit power levels. As shown, the proposed DRL scheme consistently outperforms the considered SIM-aided benchmark schemes in both considered setups. In particular, the average sum rate initially increases and then becomes saturated when the number of SIM layers increases. The initial improvement trend stems from the enhanced interference mitigation precoding capability offered by the multi-layer structure of SIM. Then, the gains diminish and eventually become saturated, since optimizing the numerous parameters becomes intractable.

We show in Fig. 7 the average sum rate versus the transmit power $P$ considering three system settings, i.e., $N=81$ , $N=49$ , and $N=25$ . The proposed DRL algorithm demonstrates superior performance compared to the other three benchmark schemes. At low transmit power levels, the proposed DRL scheme achieves substantially higher sum rates compared to Codebook and AO. Specifically, the proposed scheme with $N=25$ achieves a sum rate performance comparable to the AO algorithm with $N=81$ , since the AO algorithm tends to trap into some local suboptimal solutions [18]. In particular, DRL, benefiting from DNN’s powerful computational power, achieves a four-fold increase in average sum rate over Codebook at $P=0$ dBm. However, the elevated transmit power induces an extensive dynamic range at the neural network output of DRL, which directly degrades the performance of the scheme, diminishing the gap versus Codebook scheme. Moreover, setting a large codebook size enables the Codebook scheme to outperform the AO at low $N$ .

In Fig. 8, we show the sum rate versus the number $N$ of meta-atoms per layer considering three cases for the number $L$ of SIM layers, while the other hyperparameters are identical to those described in Section IV-A. It can be seen that the sum rate increases with the number of meta-atoms per layer, showing diminishing returns beyond a threshold, e.g., $N=80$ . In this regime, the spatial gain attained through the SIM approaches saturation. Specifically, the AO algorithm gradually approaches the performance of the DRL scheme in terms of sum rate, at the cost of an increased computational complexity, which is proportional to the number of meta-atoms. Additionally, both Random and Codebook with random SIM phase shifts schemes attain only mild performance gains with the increase of $L$ , as they are inefficient in utilizing the degrees of freedom offered by increasing $N$ .

IV-C Impact of Hyper-Parameters of DRL

To better demonstrate the optimization capability of the proposed DRL algorithm, Fig. 9a shows its convergence process for three transmit power levels. It can be seen that the transmit power levels significantly affect the final convergence result. Specifically, in our proposed scheme, the improvements in average reward are considerable as the transmit power increases. This is because the proposed DRL can jointly adapt both the SIM phase shifts and transmit power allocation strategy, thus improving the system sum rate performance. However, DRL-UPA, which only optimizes the SIM phase shifts, has a relatively small gain from the increase of transmit power. Furthermore, we compare the average reward versus training steps in Fig. 9b under different numbers of data streams, i.e., $M=\{2,3,4\}$ , while other parameters remain the same as those described in Section IV-A. Observed from Fig. 9b that Codebook and Random are unable to efficiently optimize such a massive number of parameters as the number of data streams increases. Thus, adjusting the number of data streams has less impact on Codebook and Random in terms of the convergence speed and resultant performance.

Fig. 10a demonstrates the effect of different learning rates, considering $\gamma_{a}=\gamma_{c}=\{0.001,0.0001,0.00001\}$ , and the data stream is set to $M=4$ . It is demonstrated that the learning rate dictates the system’s optimal performance and ability to converge. Specifically, when adopting a learning rate of 0.0001, DRL can converge to a relatively satisfactory reward level and further decrease the learning rate over the training process exhibiting a trend of continued reward optimization. This is because, after sufficient training, a relatively lower learning rate could effectively constrain the parameter space and avoid the model learning noise with a higher probability. These benefits allow the model to fine-tune the parameters to find a better solution near the local minimum. However, with a lower learning rate, such as 0.00001, the system is more likely to become trapped in a local optimum because of the lack of exploration competence. Therefore, it is crucial to determine an appropriate learning rate to ensure that the system achieves optimal performance. Moreover, the codebook scheme has a fast convergence faster but performs poorly, whereas the DRL approach achieves a superior sum rate by relying on a well-configured SIM.

Next, we show in Fig. 10b the effect of different batch sizes considering $N_{B}=\{8,16,32,64,128\}$ , while the transmit power is $P=10$ dBm and the number of the data streams is $M=4$ . During the DRL training process, the batch size does not significantly influence the resultant sum rate performance. Instead, it slightly affects the convergence speed and efficacy of the DRL algorithm. For example, when employing a batch size of 32, the DRL algorithm rapidly converges to the maximum. A larger batch size reduces the gradient’s variance while improving the gradient’s stability. In contrast, when the batch sizes are set to 8 or 16, the algorithm exhibits slower convergence behavior. The convergence behaviors of DRL-UPA are less sensitive to the batch size, which, however, suffers from severe performance loss due to the CSI-unaware power allocation.

IV-D Impact of Whitening Process

Finally, in Fig. 11, we investigate the impact of the whitening process in (20) on the convergence behavior of the DRL algorithm. We note that different parameters involved in the whitening process significantly influence the convergence effect of rewards. Setting $v_{0}=2$ and $t_{\mathrm{gap}}=100$ provides sufficient exploration ability in both early and late training stages, enabling a broader range of attempted actions circumventing over-dependence on proven high-return tactics. This parameterization led to a $20\%$ performance increase compared to the suboptimal values of $v_{0}=0.5,t_{\mathrm{gap}}=100$ .

Additionally, the whitening process leads to a smoother DRL training process as it simulates the uncertainty in a real wireless communications environment, thus granting a certain level of robustness to the trained model against uncharted interference. Indeed, the whitening process has the effect of regularization to reduce overfitting and improve the generalization ability of the model. Nevertheless, introducing excessive noise also impedes convergence, which requires careful controlling of the noise variance. Accordingly, attenuation of the whitening process over time is implemented, with Fig. 11 illustrating the efficacy of this attenuation strategy. According to the above analysis, the whitening process plays a pivotal role in the proposed DRL algorithm.

V Conclusions

This paper investigated a SIM-assisted multi-user MISO wireless communication system profiting from the SIM-enabled precoding in the wave domain. A joint optimization framework of the SIM phase shifts and transmit power allocation, aiming to maximize the sum-rate performance, was formulated. To address this challenging non-convex optimization, a DRL approach operating on continuous value solutions and without prerequisite labeled data was proposed. The SIM phase shifts and power allocation strategies were directly extracted from the DRL’s actor network. The presented simulation results validated the efficiency of the SIM for multi-user interference suppression, particularly under low transmit power levels. Furthermore, the proposed DRL optimization for an indicative SIM-assisted multi-user MISO system demonstrated a 2 bps/Hz sum-rate improvement compared to a state-of-the-art AO algorithm. It was also showcased that integrating appropriate hyperparameter selection and a whitening process can substantially enhance the robustness of the proposed DRL algorithm. For future research, SIM-enabled broadband communication and discrete space optimization techniques deserve further exploration.

References

[1] H. Liu, J. An, D. W. K. Ng, G. C. Alexandropoulos, and L. Gan, “DRL-based orchestration of multi-user MISO systems with stacked intelligent metasurfaces,” in IEEE Int. Conf. Commun. (ICC). Denver, USA: IEEE, Jun. 2024, pp. 1–6.
[2] J. An, C. Yuen, L. Dai, M. Di Renzo, M. Debbah, and L. Hanzo, “Near-field communications: Research advances, potential, and challenges,” IEEE Wireless Commun., vol. 31, no. 3, pp. 100–107, 2024.
[3] E. Basar, G. C. Alexandropoulos, Y. Liu, Q. Wu, S. Jin, C. Yuen, O. A. Dobre, and R. Schober, “Reconfigurable intelligent surfaces for 6G: Emerging hardware architectures, applications, and open challenges,” IEEE Veh. Technol. Mag., pp. 2–22, 2024.
[4] Ö. T. Demir, E. Björnson, and L. Sanguinetti, “Channel modeling and channel estimation for holographic massive MIMO with planar arrays,” IEEE Wireless Commun. Lett., vol. 11, no. 5, pp. 997–1001, Feb. 2022.
[5] Y. Niu, Y. Li, D. Jin, L. Su, and A. V. Vasilakos, “A survey of millimeter wave communications (mmwave) for 5G: Opportunities and challenges,” Wireless Netw., vol. 21, no. 8, pp. 2657–2676, Apr. 2015.
[6] Z. Chen, G. Chen, J. Tang, S. Zhang, D. K. So, O. A. Dobre, K.-K. Wong, and J. Chambers, “Reconfigurable-intelligent-surface-assisted B5G/6G wireless communications: Challenges, solution, and future opportunities,” IEEE Commun. Mag., vol. 61, no. 1, pp. 16–22, Jan. 2023.
[7] J. An et al., “Stacked intelligent metasurfaces for efficient holographic MIMO communications in 6G,” IEEE J. Sel. Areas Commun., vol. 41, no. 8, pp. 2380–2396, Aug. 2023.
[8] G. C. Alexandropoulos et al., “Advanced coordinated beamforming for the downlink of future LTE cellular networks,” IEEE Commun. Mag., vol. 54, no. 7, pp. 54–60, Jul. 2016.
[9] B. M. Hochwald, C. B. Peel, and A. L. Swindlehurst, “A vector-perturbation technique for near-capacity multiantenna multiuser communication-part II: Perturbation,” IEEE Trans. Commun., vol. 53, no. 3, pp. 537–544, Mar. 2005.
[10] A. Li, D. Spano, J. Krivochiza, S. Domouchtsidis, C. G. Tsinos, C. Masouros, S. Chatzinotas, Y. Li, B. Vucetic, and B. Ottersten, “A tutorial on interference exploitation via symbol-level precoding: Overview, state-of-the-art and future directions,” IEEE Commun. Surveys Tuts., vol. 22, no. 2, pp. 796–839, Mar. 2020.
[11] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropoulos, “Learning to optimize: Training deep neural networks for interference management,” IEEE Trans. Signal Process., vol. 66, no. 20, pp. 5438–5453, Oct. 2018.
[12] T. V. Nguyen, D. N. Nguyen, M. D. Renzo, and R. Zhang, “Leveraging secondary reflections and mitigating interference in multi-IRS/RIS aided wireless networks,” IEEE Trans. Wireless Commun., vol. 22, no. 1, pp. 502–517, Aug. 2023.
[13] C. Huang, R. Mo, and C. Yuen, “Reconfigurable intelligent surface assisted multiuser MISO systems exploiting deep reinforcement learning,” IEEE J. Sel. Areas Commun., vol. 38, no. 8, pp. 1839–1850, Aug. 2020.
[14] R. Zhang, K. Xiong, Y. Lu, B. Gao, P. Fan, and K. B. Letaief, “Joint coordinated beamforming and power splitting ratio optimization in MU-MISO SWIPT-enabled hetnets: A multi-agent DDQN-based approach,” IEEE J. Sel. Areas Commun., vol. 40, no. 2, pp. 677–693, Feb. 2022.
[15] K. Stylianopoulos et al., “Deep contextual bandits for orchestrating multi-user MISO systems with multiple RISs,” in IEEE Int. Conf. Commun. (ICC). Seoul, South Korea: IEEE, May 2022, pp. 1556–1561.
[16] G. Huang et al., “Stacked intelligent metasurfaces for task-oriented semantic communications,” 2024. [Online]. Available: https://arxiv.org/pdf/2407.15053
[17] C. Liu, Q. Ma, Z. J. Luo, Q. R. Hong, Q. Xiao, H. C. Zhang, L. Miao, W. M. Yu, Q. Cheng, L. Li et al., “A programmable diffractive deep neural network based on a digital-coding metasurface array,” Nat. Electron., vol. 5, no. 2, pp. 113–122, Feb. 2022.
[18] J. An et al., “Stacked intelligent metasurfaces for multiuser beamforming in the wave domain,” in IEEE Int. Conf. Commun. (ICC). Rome, Italy: IEEE, May 2023, pp. 1938–1883.
[19] C. Liu, X. Liu, D. W. K. Ng, and J. Yuan, “Deep residual learning for channel estimation in intelligent reflecting surface-assisted multi-user communications,” IEEE Trans. Wireless Commun., vol. 21, no. 2, pp. 898–912, Feb. 2022.
[20] M. Joham, W. Utschick, and J. A. Nossek, “Linear transmit processing in MIMO communications systems,” IEEE Trans. Signal Process., vol. 53, no. 8, pp. 2700–2712, Jul. 2005.
[21] M. Sadek, A. Tarighat, and A. H. Sayed, “A leakage-based precoding scheme for downlink multi-user MIMO channels,” IEEE Trans. Wireless Commun., vol. 6, no. 5, pp. 1711–1721, May 2007.
[22] X. Jia et al., “Environment-aware codebook for reconfigurable intelligent surface-aided MISO communications,” IEEE Wireless Commun. Lett., vol. 12, no. 7, pp. 1174–1178, Jul. 2023.
[23] H. Liu et al., “K-means based constellation optimization for index modulated reconfigurable intelligent surfaces,” IEEE Commun. Lett., vol. 27, no. 8, pp. 2152–2156, Aug. 2023.
[24] M. Jian et al., “Reconfigurable intelligent surfaces for wireless communications: Overview of hardware designs, channel models, and estimation techniques,” Intell. Converged Netw., vol. 3, no. 1, pp. 1–32, Mar. 2022.
[25] J. An, C. Xu, D. W. K. Ng, C. Yuen, and L. Hanzo, “Adjustable-delay RIS is capable of improving OFDM systems,” IEEE Trans. Veh. Technol., vol. 73, no. 7, pp. 9927–9942, 2024.
[26] M. A. ElMossallamy, H. Zhang, L. Song, K. G. Seddik, Z. Han, and G. Y. Li, “Reconfigurable intelligent surfaces for wireless communications: Principles, challenges, and opportunities,” IEEE Trans. Cogn. Commun. Netw., vol. 6, no. 3, pp. 990–1002, Sep. 2020.
[27] C. Xu et al., “Antenna selection for reconfigurable intelligent surfaces: A transceiver-agnostic passive beamforming configuration,” IEEE Trans. Wireless Commun., vol. 22, no. 11, pp. 7756–7774, 2023.
[28] W. Xu et al., “Time-varying channel prediction for RIS-assisted MU-MISO networks via deep learning,” IEEE Trans. Cogn. Commun. Netw., vol. 8, no. 4, pp. 1802–1815, Dec. 2022.
[29] J. An et al., “Codebook-based solutions for reconfigurable intelligent surfaces and their open challenges,” IEEE Wireless Commun., pp. 1–8, Nov. 2022.
[30] J. An, C. Xu, L. Gan, and L. Hanzo, “Low-complexity channel estimation and passive beamforming for RIS-assisted MIMO systems relying on discrete phase shifts,” IEEE Trans. Commun., vol. 70, no. 2, pp. 1245–1260, Feb. 2022.
[31] L. You et al., “Reconfigurable intelligent surfaces-assisted multiuser MIMO uplink transmission with partial CSI,” IEEE Trans. Wireless Commun., vol. 20, no. 9, pp. 5613–5627, Sep. 2021.
[32] Z. Yu et al., “Environment-aware codebook design for RIS-assisted MU-MISO communications: Implementation and performance analysis,” IEEE Trans. Commun., pp. 1–15, Jun. 2024.
[33] X. Cao et al., “Massive access of static and mobile users via reconfigurable intelligent surfaces: Protocol design and performance analysis,” IEEE J. Sel. Areas Commun., vol. 40, no. 4, pp. 1253–1269, Apr. 2022.
[34] X. Lin, Y. Rivenson, N. T. Yardimci, M. Veli, Y. Luo, M. Jarrahi, and A. Ozcan, “All-optical machine learning using diffractive deep neural networks,” Sci., vol. 361, no. 6406, pp. 1004–1008, Jul. 2018.
[35] S. Lin et al., “Stacked intelligent metasurface enabled LEO satellite communications relying on statistical CSI,” IEEE Wireless Commun. Lett., vol. 13, no. 5, pp. 1295–1299, Feb. 2024.
[36] N. U. Hassan, J. An, M. Di Renzo, M. Debbah, and C. Yuen, “Efficient beamforming and radiation pattern control using stacked intelligent metasurfaces,” IEEE Open J. Commun. Soc., Jan. 2024.
[37] A. Papazafeiropoulos et al., “Achievable rate optimization for stacked intelligent metasurface-assisted holographic MIMO communications,” IEEE Trans. Wireless Commun., pp. 1–14, May 2024.
[38] A. Papazafeiropoulos, P. Kourtessis, S. Chatzinotas, D. I. Kaklamani, and I. S. Venieris, “Achievable rate optimization for large stacked intelligent metasurfaces based on statistical CSI,” IEEE Wireless Commun. Lett., pp. 1–5, May 2024.
[39] W. Xu et al., “Deep reinforcement learning based on location-aware imitation environment for RIS-aided mmwave MIMO systems,” IEEE Wireless Commun. Lett., vol. 11, no. 7, pp. 1493–1497, Jul. 2022.
[40] G. C. Alexandropoulos et al., “Pervasive machine learning for smart radio environments enabled by reconfigurable intelligent surfaces,” Proc. IEEE, vol. 110, no. 9, pp. 1494–1525, Sep. 2022.
[41] L. Xiao, H. Zhang, Y. Xiao, X. Wan, S. Liu, L.-C. Wang, and H. V. Poor, “Reinforcement learning-based downlink interference control for ultra-dense small cells,” IEEE Trans. Wireless Commun., vol. 19, no. 1, pp. 423–434, Jan. 2020.
[42] B. Yang, X. Cao, J. Xu, C. Huang, G. C. Alexandropoulos, L. Dai, M. Debbah, H. V. Poor, and C. Yuen, “Reconfigurable intelligent computational surfaces: When wave propagation control meets computing,” IEEE Trans. Wireless Commun., vol. 30, no. 3, pp. 120–128, Jun. 2023.
[43] J. An et al., “Stacked intelligent metasurface-aided MIMO transceiver design,” IEEE Wireless Commun., pp. 1–9, Apr. 2024.
[44] H. Liu et al., “Stacked intelligent metasurfaces for wireless sensing and communication: Applications and challenges,” arXiv preprint arXiv:2407.03566, 2024.
[45] J. An, C. Yuen, M. Di Renzo, M. Debbah, H. V. Poor, and L. Hanzo, “Stacked intelligent metasurface performs a 2D DFT in the wave domain for DOA estimation,” arXiv preprint arXiv:2310.09861, 2023.
[46] J. An et al., “Two-dimensional direction-of-arrival estimation using stacked intelligent metasurfaces,” IEEE J. Sel. Areas Commun., pp. 1–16, Jun. 2024.
[47] X. Yao et al., “Channel estimation for stacked intelligent metasurface-assisted wireless networks,” IEEE Wireless Commun. Lett., vol. 13, no. 5, pp. 1349–1353, Feb. 2024.
[48] Q. Li, M. El-Hajjar, C. Xu, J. An, C. Yuen, and L. Hanzo, “Stacked intelligent metasurfaces for holographic MIMO aided cell-free networks,” IEEE Trans. Commun., pp. 1–13, May 2024.
[49] Q.-U.-A. Nadeem et al., “Hybrid digital-wave domain channel estimator for stacked intelligent metasurface enabled multi-user MISO systems,” in 2024 IEEE Wireless Communications and Networking Conference (WCNC), 2024, pp. 1–6.
[50] J. An et al., “A tutorial on holographic MIMO communications—part I: Channel modeling and channel estimation,” IEEE Commun. Lett., vol. 27, no. 7, pp. 1664–1668, Jul. 2023.
[51] M. Nerini and B. Clerckx, “Physically consistent modeling of stacked intelligent metasurfaces implemented with beyond diagonal RIS,” IEEE Commun. Lett., pp. 1–5, Jul. 2024.
[52] Q. Wu and R. Zhang, “Intelligent reflecting surface enhanced wireless network via joint active and passive beamforming,” IEEE Trans. Wireless Commun., vol. 18, no. 11, pp. 5394–5409, Nov. 2019.
[53] H. Guo, Y.-C. Liang, J. Chen, and E. G. Larsson, “Weighted sum-rate maximization for reconfigurable intelligent surface aided wireless networks,” IEEE Trans. Wireless Commun., vol. 19, no. 5, pp. 3064–3076, May 2020.
[54] V. François-Lavet, P. Henderson, R. Islam, M. G. Bellemare, and J. Pineau, “An introduction to deep reinforcement learning,” Found. Trends Mach. Learn., vol. 11, no. 3-4, pp. 219–354, Dec. 2018.
[55] K. He and J. Sun, “Convolutional neural networks at constrained time cost,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2015, pp. 5353–5360.
[56] O. Sidelnikov, A. Redyuk, and S. Sygletos, “Equalization performance and complexity analysis of dynamic deep neural networks in long haul transmission systems,” Opt. Exp., vol. 26, no. 25, pp. 32 765–32 776, Nov. 2018.
[57] S. Deng, T. Weber, and A. Ahrens, “Capacity optimizing power allocation in interference channels,” AEU Int. J. Electronics Commun., vol. 63, no. 2, pp. 139–147, Feb. 2009.