Intelligent Reflecting Surface Configurations for Smart Radio Using Deep Reinforcement Learning

Wei Wang, , and Wei Zhang This work was supported in part by National Key R&D Program of China under Grant 2020YFA0711400, Shenzhen Science & Innovation Fund under Grant JCYJ20180507182451820, and the Australian Research Council’s Project funding scheme under LP160101244. W. Wang is with Peng Cheng Laboratory, Shenzhen, China (e-mail: [email protected]). W. Zhang is with School of Electrical Engineering and Telecommunications, The University of New South Wales, Sydney, NSW 2052, Australia (e-mail: [email protected]).

Abstract

Intelligent reflecting surface (IRS) is envisioned to change the paradigm of wireless communications from “adapting to wireless channels” to “changing wireless channels”. However, current IRS configuration schemes, consisting of sub-channel estimation and passive beamforming in sequence, conform to the conventional model-based design philosophies and are difficult to be realized practically in the complex radio environment. To create the smart radio environment, we propose a model-free design of IRS control that is independent of the sub-channel channel state information (CSI) and requires the minimum interaction between IRS and the wireless communication system. We firstly model the control of IRS as a Markov decision process (MDP) and apply deep reinforcement learning (DRL) to perform real-time coarse phase control of IRS. Then, we apply extremum seeking control (ESC) as the fine phase control of IRS. Finally, by updating the frame structure, we integrate DRL and ESC in the model-free control of IRS to improve its adaptivity to different channel dynamics. Numerical results show the superiority of our proposed joint DRL and ESC scheme and verify its effectiveness in model-free IRS control without sub-channel CSI.

I Introduction

Metasurfaces, which consist of artificially periodic or quasi-periodic structures with sub-wavelength scales, are a new design of functional materials [1, 2]. Some extraordinary electromagnetic properties observed on metasurfaces, e.g., negative permittivity and permeability, reveal its potential in tailoring electromagnetic waves in a wide frequency range, from microwave to visible light [3, 4, 5, 6]. Intelligent reflecting surface (IRS), a.k.a. reconfigurable intelligent surface (RIS), is a type of programmable metasurfaces that is capable of electronically tuning electromagnetic wave by incorporating active components into each unit cell of metasurfaces [7, 8, 9, 10, 11, 12]. The advent of IRS is envisioned to revolutionize many industries, a major one of which is wireless communications.

Wireless communications are subject to the time-varying radio propagation environment. The effects of free space path loss, signal absorption, reflections, refractions, and diffractions caused by physical objects during the propagation of electromagnetic waves jointly render wireless channels highly dynamic [13, 14, 15]. IRS’s capability of manipulating electromagnetic waves in a real time manner brings infinite possibilities to wireless communications and makes it possible for human beings to transform the design paradigm of wireless communications from “adapting to wireless channels” to “changing wireless channels” [7]. To this end, great research efforts have been spent to acquire the channel state information (CSI) of the sub-channels, i.e., the channels between wireless transceivers and IRS, which is widely regarded as the prerequisite of IRS reflection pattern (passive beamforming) design [8]. Owing to the passive nature, IRS is unable to sense the incident signal, thereby rendering the estimation process far more complicated than traditional wireless communication systems. In [9], a channel estimation scheme with reduced training overhead is proposed by exploiting the inter-user channel correlation. In [10, 11], compressed sensing based channel estimation methods are proposed to estimate the channel responses between base station, IRS and a single-antenna user at mmWave frequency band. As the proposed schemes are focusing on a single-antenna user, their extension to multiple users with array antenna might increase multi-fold the training overhead. In [16], a joint beam training and positioning scheme is proposed to estimate the parameters of the line-of-sight (LoS) paths for IRS assisted mmWave communications. The proposed random beamforming in the training stage is performed in a broadcasting manner and thus the training overhead is independent of the user number.

Despite the aforementioned endeavors to advance the CSI acquisition techniques in IRS assisted wireless communications, the practical applications of IRS are still confronting various challenges. Firstly, channel estimations for IRS assisted wireless communications demand a radical update of the existed protocols to incorporate the coordination of the transmitter, the receiver, and the IRS. It indicates that the existing wireless systems, e.g., Wi-Fi, 4G-LTE and 5G-NR, are unable to readily embrace IRS. Secondly, even if the perfect CSI is available, the real-time optimization of IRS reflection coefficients using convex and non-convex optimization techniques is computationally prohibitive [12]. Thirdly, current solutions to IRS control, which consists of CSI acquisition and IRS reflection designs, are based on the accurate modelling of IRS. However, as a type of low-cost reflective metasurfaces, IRS changes its reflection coefficient via tuning the impedance, the exact value of which is dependant on the carrier frequency of the incident signal [17, 18]. The carrier frequency can be shifted by Doppler effects and might also vary over different users, thus the mathematical modelling of IRS in the complex radio propagation environment is inherently difficult.

To tackle the aforementioned challenges, we follow the design paradigm of model-free control by treating the wireless communication system as a (semi) black box with uncertain parameters and to optimize reflection coefficients of IRS through deep reinforcement learning (DRL) and extremum seeking control (ESC). Compared with the prevailing designs of IRS assisted wireless communication systems [11, 19, 20, 21, 22, 23, 24, 25, 26], our proposed scheme is model-free. Specifically, the instantaneous (or statistical) CSI of sub-channels (i.e., Tx-Rx channel, Tx-IRS channel, and IRS-Rx channel) that constitutes the equivalent wireless channel is not required. Our design, in a true sense, treats IRS as a part of the wireless channel and requires the minimum interaction with wireless communication systems. The disentanglement of IRS configuration from wireless communication system means the improved independence of IRS and will speed up the rollout of IRS in the future. There are already some attempts towards the standalone operation of IRS [27, 28, 29]. In [27, 28], deep learning and deep reinforcement learning are applied to guide the IRS to interact with the incident signal given the knowledge of the sampled channel vectors. However, in order to obtain the CSI of the Tx-IRS and IRS-Rx sub-channels, the authors propose to install channel sensors on the IRS, which is, to some extent, against the initial role of IRS as a passive device. In [29], to reduce the dependance on the CSI of sub-channels, a deep learning scheme is proposed to extract the interactions between phase shifts of IRS and receiver locations. However, the training data has to be collected offline, which limits its adaptability to the more general scenarios.

In this paper, our objective is to build a model-free IRS control scheme with a higher level understanding of the radio environment, which is able to configure the IRS reflection coefficients without the CSI of the sub-channels. To this end, we adopt a typical scenario, i.e., time-division duplexing (TDD) multi-user multiple-input-multiple-output (MIMO), as an example to perform our design. To summarize, our contributions are as follows.

•

We model the control of IRS as a Markov decision process (MDP) and then apply DRL, specifically, double deep Q-network (DDQN) method, to perform real-time coarse phase control of IRS. The proposed DDQN scheme outperforms the other sub-channel-CSI-independent methods, e.g., multi-armed bandit (MAB), random reflection.
•

To enhance the action of DDQN, we further apply ESC as the fine phase control of IRS. Specifically, we propose a dither-based iterative method to optimize the phase shift of IRS through trial and error. We also prove that the output of the proposed dither-based iterative method is monotonically increasing.
•

By updating the frame structure, we integrate DRL and ESC in the model-free control of IRS. The integrated scheme is more adaptive to various channel dynamics and has the potential to achieve better performance.

Numerical results show the superiority of our proposed DRL, ESC, and joint DRL and ESC scheme and verify their effectiveness in model-free IRS control without sub-channel CSI.

The rest of the paper is organized as follows. In Section II, we introduce the system model. In Section III, we propose a DRL enabled model-free control of IRS. In Section IV, we propose a dither-based iterative method to enhance the action of DRL. In Section V, we present numerical results. Finally, in Section VII, we draw the conclusion.

Notations: Column vectors (matrices) are denoted by bold-face lower (upper) case letters, $\mathbf{x}[n]$ denotes the $n$ -th element in the vector $\mathbf{x}$ , $\odot$ represents the Hadamard product, $(\cdot)^{*}$ , $(\cdot)^{T}$ and $(\cdot)^{H}$ represent conjugate, transpose and conjugate transpose operation, respectively.

II System Model

In this section, we introduce the system model of model-free IRS control.

II-A Optimal Phase Shift Vector of IRS

Refer to caption — Figure 1: The illustration of IRS assisted wireless communications

For IRS assisted wireless communications, the channel model between BS and a certain user $k$ can be represented as

\displaystyle\mathbf{h}_{k}=\mathbf{h}_{BU_{k}}+\mathbf{H}_{BR}\bm{\Theta}\mathbf{h}_{RU_{k}}

(1)

where the user $k$ is equipped with a single antenna, $\mathbf{h}_{BU_{k}}\in\mathbb{C}^{N_{B}\times 1}$ is the channel response vector between the user $k$ and BS, $\mathbf{H}_{BR}\in\mathbb{C}^{N_{B}\times N_{R}}$ is the channel response matrix between BS and IRS, $\mathbf{h}_{RU_{k}}\in\mathbb{C}^{N_{R}\times 1}$ is the channel response vector between the user $k$ and IRS, $\bm{\Theta}=\operatorname{diag}\{\bm{\theta}\}$ , and $\bm{\theta}\in\mathbb{C}^{N_{R}\times 1}$ (with $|\bm{\theta}(n)|=1$ ) is the phase shift vector of the IRS. Accordingly, the multi-user channel is written as

\displaystyle\mathbf{H}=\mathbf{H}_{BU}+\mathbf{H}_{BR}\bm{\Theta}\mathbf{H}_{RU}

(2)

where

	$\displaystyle\mathbf{H}=[\mathbf{h}_{1},\mathbf{h}_{2},\cdots,\mathbf{h}_{K}]$
	$\displaystyle\mathbf{H}_{BU}=[\mathbf{h}_{BU_{1}},\mathbf{h}_{BU_{2}},\cdots,\mathbf{h}_{BU_{K}}]$
	$\displaystyle\mathbf{H}_{RU}=[\mathbf{h}_{RU_{1}},\mathbf{h}_{RU_{2}},\cdots,\mathbf{h}_{RU_{K}}]$

And the relationship between the aggregated equivalent channel $\mathbf{H}$ and the sub-channels is shown in Figure 1.

The objective of reinforcement learning based IRS configuration is to develop a widely compatible method that can be deployed in various scenarios of wireless communications without any knowledge of the wireless system’s internal working mechanism. Mathematically, the problem is formulated as

\displaystyle\begin{split}&\max_{\bm{\theta}}\;\;P_{m}\\ &\;s.t.\;\;\;\;\bm{\theta}[n]=e^{-j\bm{\varphi}[n]},\;\forall n\in\{1,2,\cdots,N_{R}\}\\ &\quad\quad\;\;\;\bm{\varphi}[n]\in\mathcal{B},\;\;\;\forall n\in\{1,2,\cdots,N_{R}\}\end{split}

(3)

where $P_{m}$ is the performance metric of the wireless system that is to be optimized. $P_{m}$ is dependent on the wireless channel $\mathbf{H}$ , and $\mathbf{H}$ is dependent on the reflection pattern $\bm{\theta}$ . $\bm{\varphi}[n]$ is the quantized phase selected from a finite set $\mathcal{B}=\left\{-\pi,\frac{-2^{r}+2}{2^{r}}\pi,\frac{-2^{r}+4}{2^{r}}\pi,\cdots,\pi\right\}$ with $2^{r}+1$ possible values.

It is worth mentioning that the model-free control does not need to know the exact relationship between the objective $P_{m}$ and variable $\bm{\theta}$ .

II-B A Typical Scenario – TDD Multi-User MIMO

Without loss of generality, we use a typical scenario in wireless communication, i.e., TDD multi-user MIMO, to illustrate our design philosophy. In TDD, by exploiting the channel reciprocity, the BS can estimate the downlink channel from the pilot of the uplink channel. Thus, TDD multi-user MIMO consists of two stages (refer to Figure 2), i.e., uplink pilot transmission and downlink data transmission[30, 31].

At uplink stage, the pilot transmits from multiple users to BS simultaneously. The received pilot signal is represented as

\displaystyle\mathbf{Y}_{U}

\displaystyle=\mathbf{H}\mathbf{S}+\mathbf{N}

(4)

where $\mathbf{S}\in\mathbb{C}^{K\times K}$ is the pilot pattern, $\mathbf{N}\in\mathbb{C}^{N_{B}\times K}$ is the additive white Gaussian noise. Upon receiving the pilot, BS performs minimum mean square error (MMSE) estimation of the channel matrix, i.e.,

\displaystyle\hat{\mathbf{H}}=\mathbf{Y}_{U}\mathbf{S}^{H}(\mathbf{S}\mathbf{S}^{H}+\sigma^{2}_{U}\mathbf{I})^{-1}

(5)

When $\mathbf{S}$ is an unitary matrix, (5) is further expressed as

\displaystyle\hat{\mathbf{H}}=\frac{\mathbf{Y}_{U}\mathbf{S}^{H}}{1+\sigma^{2}_{U}}

(6)

At downlink stage, data transmission with zero-forcing (ZF) precoding is performed, and the precoding matrix is represented as


$\displaystyle\mathbf{M}$	$\displaystyle=[\mathbf{m}_{1},\mathbf{m}_{2},\cdots,\mathbf{m}_{K}]^{H}$	(7a)
	$\displaystyle=\mathbf{D}_{{P}}(\hat{\mathbf{H}}^{H}\hat{\mathbf{H}})^{-1}\hat{\mathbf{H}}^{H}$	(7b)

where $\mathbf{D}_{p}={\rm diag}([\frac{1}{\|\mathbf{m}_{1}\|_{2}},\frac{1}{\|\mathbf{m}_{2}\|_{2}},\cdots,\frac{1}{\|\mathbf{m}_{K}\|_{2}}])$ is for power normalization. The received signal of user $k$ is given by

\displaystyle y_{D,k}=\mathbf{m}_{k}^{H}\mathbf{h}_{k}x_{k}+\sum_{l\neq k}^{K}\mathbf{m}_{l}^{H}\mathbf{h}_{k}x_{l}+n_{k}

(8)

where $x_{k}$ is the signal intended to user $k$ ( $\mathbb{E}(x_{k})=0$ and $\mathbb{E}(|x_{k}|^{2})=1$ , $\forall k\in\{1,\cdots,K\}$ ), and $n_{k}\sim\mathcal{CN}(0,\sigma_{k}^{2})$ is the additive white Gaussian noise. Thus, the signal-to-noise ratio of the $k$ -th user is

\displaystyle SINR_{k}=\frac{|\mathbf{m}_{k}^{H}\mathbf{h}_{k}|^{2}}{\sum_{l\neq k}^{K}|\mathbf{m}_{l}^{H}\mathbf{h}_{k}|^{2}+\sigma^{2}_{k}}

(9)

For a communication system, the performance metrics can be SINR, data rate, frame error rate (FER), and etc. Without loss of generality, we adopt the sum data rate as the performance metric, i.e.,

\displaystyle P_{m}=\sum_{k=1}^{K}r_{k}=\sum_{k=1}^{K}\log_{2}(1+SINR_{k})

(10)

II-C Channel Model

We assume the Rician channel model for $\mathbf{h}_{BU_{k}}$ , $\mathbf{H}_{BR}$ and $\mathbf{h}_{RU_{k}}$ . Take $\mathbf{H}_{BR}$ as an example, it is represented as

\displaystyle\mathbf{H}_{BR}=\sqrt{\frac{K}{K+1}}\mathbf{H}_{BR,LoS}+\sqrt{\frac{1}{K+1}}\mathbf{H}_{BR,NLoS}

(11)

where $\mathbf{H}_{BR,LoS}$ denotes the deterministic LoS component, $\mathbf{H}_{BR,NLoS}$ denotes the fast fading NLoS component, and the component of which is independent and identically distributed (i.i.d.) circularly symmetric complex Gaussian random variables with zero-mean and unit variance, and $K$ is the ratio between the power in the LoS path and the power in the NLoS paths [32].

The LoS component is position-dependent and is thus slow-time-varying; The NLoS components are caused by the multi-path effects and are thus fast-time-varying [33]. Combining the characteristics of wireless channel with the setting of reinforcement learning, we introduce the following two concepts.

(1) Channel block: One channel block consists of the uplink pilot transmission stage and downlink data transmission stage (as shown in Figure 2), and the channel matrix is constant during the channel block.

(2) Channel episode: One channel episode consists of $T$ channel blocks (as shown in Figure 2). The LoS component within one channel episode remains constant; The NLoS components change over time, and the NLoS components of different channel blocks are i.i.d.

III Model-Free IRS Control Enabled By Deep Reinforcement Learning

In this section, we apply DRL to model-free IRS control.

III-A Design Objectives

We aim to achieve stand-alone operation of the wireless communication system and the IRS, and our design includes the following characteristics.

•

Wireless Communication System: The wireless communication system is almost unaware of the existence of the IRS, except that it needs to feed back its instantaneous performance to the IRS controller. In this regard, the uplink pilot transmission and downlink data transmission exactly follow the conventional structure in Figure 2.
•

IRS: The IRS is strictly regarded as part of the wireless channel and will not be jointly designed with the wireless communication system. The configuration of IRS is based on (a) the performance feedback from the wireless system and (b) its learned policy through trail-and-error interaction with the dynamic environment. And the IRS is unaware of the working mechanism of the wireless communication system.

A salient advantage of the proposed design is that the IRS can be deployed in various wireless communication applications, e.g., Wi-Fi, 4G-LTE, 5G-NR, without updating their existing protocols, which will speed up the roll-out of IRSs. Another benefit is that, by treating the existing wireless communication system as a black box, the configuration of IRS does not require the overhead-demanding channel sounding process to acquire the CSI of the subchannels, i.e., $\mathbf{H}_{BU}$ , $\mathbf{H}_{BR}$ , and $\mathbf{H}_{RU}$ , that constitute the aggregated equivalent channel $\mathbf{H}$ .

Our design is primarily based on the reinforcement learning technology. Specifically, the IRS and its controller are the agent, the wireless communication system, which comprises transmitter, wireless channel, and receiver, is the environment. The relationship between the different parties is given in Figure 3. Initially, the agent takes random actions and the environment responds to those actions by giving rise to rewards and presenting new situations to the agent [34, 35]. Through trail-and-error interaction with the wireless communication system, the agent gradually learns the optimal policy to maximize the expected return over time. In this regard, IRS, which is capable of changing the radio environment, is analogous to the human body, and the reinforcement learning method, which guides the action of IRS, is analogous to the human brain. The integration of the IRS and the reinforcement learning method is the pathway to creating the smart radio environment.

III-B Basics of Deep Reinforcement Learning

To facilitate the presentation of our design, we briefly introduce some key concepts of DRL in this subsection.

III-B1 Objective of Reinforcement Learning

An MDP is specified by $4$ -tuple $\langle\mathcal{S},\mathcal{A},{P},{R}\rangle$ , where $\mathcal{S}$ is the state space, $\mathcal{A}$ is the action space, ${P}$ is the state transition probability, and ${R}$ is the immediate reward received by the agent. When an agent in the state $s\in{S}$ takes the action $a\in\mathcal{A}$ , the environment will evolve to the next state $s^{\prime}\in\mathcal{S}$ with probability $P(s^{\prime}|s,a)=\rm{Pr}(S_{t+1}=s^{\prime}|S_{t}=s,A_{t}=a)$ , and in the meantime, the agent will receive the immediate reward $R_{s\rightarrow s^{\prime}}^{a}$ . Adding the time index to $S,A,R$ , the evolution of an MDP can be represented using the following trajectory

\displaystyle\langle S_{0},A_{0},R_{1},S_{1},A_{1},R_{2},\cdots,S_{T-1},A_{T-1},R_{T},S_{T},\cdots\rangle

(12)

The agent’s action is directed by the policy function

\displaystyle\pi(a|s)=\rm{Pr}(A_{t}=a|S_{t}=s)

(13)

which is the probability that the agent takes action $a$ when the current state is $s$ . A reinforcement learning task intends to find a policy that achieves a good return over the long run, where the return is defined as the cumulative discounted future reward, i.e.,

\displaystyle U_{t}=\Sigma_{\tau=0}^{\infty}\gamma^{\tau}R_{t+\tau+1}

and $\gamma\in[0,1]$ is the discount factor for future rewards. Owing to the randomness of state transition (caused by the dynamic environment) and action selection, the return $U_{t}$ is a random variable. Mathematically, the agent’s goal in reinforcement learning is to find a good policy that maximizes the expected return, i.e.,

\displaystyle\max_{\pi}\;\;\mathbb{E}(U_{t})

(14)

III-B2 Action-Value Function and Optimal Policy

One key metric for action selection in reinforcement learning is action-value function, i.e.,

\displaystyle Q_{\pi}(s,a)=\mathbb{E}[U_{t}|S_{t}=s,A_{t}=a]

(15)

which is the conditional expected return for an agent to pick action $a$ in the state $s$ under the policy $\pi$ . For any policy $\pi$ and any state $s\in\mathcal{S}$ , action-value function satisfies the following recursive relationship, i.e.,

	$\displaystyle Q_{\pi}(s,a)=\mathbb{E}_{s^{\prime}}\left[R_{s\rightarrow s^{\prime}}^{a}+\gamma\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\|s^{\prime})Q_{\pi}(s^{\prime},a^{\prime})\Big{\|}s^{\prime},a^{\prime}\right]$
	$\displaystyle=\sum_{s^{\prime}\in\mathcal{S}}P(s^{\prime}\|s,a)\left(R_{s\rightarrow s^{\prime}}^{a}+\gamma\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}\|s^{\prime})Q_{\pi}(s^{\prime},a^{\prime})\right)$		(16)

where $R_{s\rightarrow s^{\prime}}^{a}$ is the immediate reward when the environment transits from state $s$ to state $s^{\prime}$ after taking the action $a$ , and Eq. (16) is the well-known Bellman equation of action-value function [34].

A policy is defined to be better than another policy if its expected return is greater than that of for all states and all actions. Thus, the optimal action-value function is

	$\displaystyle Q^{*}(s,a):=$	$\displaystyle Q_{\pi^{*}}(s,a)$
	$\displaystyle=$	$\displaystyle\max_{\pi}Q_{\pi}(s,a),\;\;\forall s\in\mathcal{S},a\in\mathcal{A}$		(17)

With the optimal action-value function, the optimal policy is obviously

\displaystyle\pi^{*}(a|s)=\left\{\begin{array}[]{cc}1,&{\rm if}\;a=\operatorname*{arg\,max}_{a\in\mathcal{A}}Q^{*}(s,a)\\ 0,&{\rm otherwise}\end{array}\right.

(20)

Combining (17), (20) with (16), the Bellman optimality equation for $Q^{*}(s,a)$ is given by

	$\displaystyle Q^{*}(s,a)$	$\displaystyle=\sum_{s^{\prime}\in\mathcal{S}}P(s^{\prime}\|s,a)(R_{s\rightarrow s^{\prime}}^{a}+\gamma\max_{a^{\prime}\in\mathcal{A}}Q^{*}(s^{\prime},a^{\prime}))$
		$\displaystyle=\mathbb{E}_{s^{\prime}}[R_{s\rightarrow s^{\prime}}^{a}+\gamma\max_{a^{\prime}\in\mathcal{A}}Q^{*}(s^{\prime},a^{\prime})\|s_{t}=s,a_{t}=a]$		(21)

With the Bellman optimality equation, the optimal policy $\pi^{*}(a|s)$ or the optimal action-value function $Q^{*}(s,a)$ can be obtained via iterative methods, i.e., policy iteration based methods and value iteration based methods[34]. Hereinafter, we will mainly focus on value iteration based methods.

III-B3 Temporal Difference Learning

The aforementioned iterative methods require the complete knowledge of the environment, i.e., state transition probability $p(s^{\prime}|s,a)$ , reward function $R_{s\rightarrow s^{\prime}}^{a}$ , etc. However, the explicit knowledge of environment dynamics is unavailable in practice. The conditional expectation in (21) can be realized via numerically averaging over the sample sequences of states, actions, and rewards from actual interaction with the environment, e.g., temporal difference method, or Monte Carlo method.

Upon observing a new segment of the trajectory in (12), i.e., $\langle S_{t}=s,A_{t}=a,R_{t+1}=R_{s\rightarrow s^{\prime}}^{a},S_{t+1}=s^{\prime}\rangle$ , the action-value function $Q(s,a)$ updates as follows:

	$\displaystyle Q_{t+1}(s,a)=$
	$\displaystyle Q_{t}(s,a)+\alpha\left(R_{s\rightarrow s^{\prime}}^{a}+\gamma\max_{a^{\prime}\in\mathcal{A}}Q_{t}(s^{\prime},a^{\prime})-Q_{t}(s,a)\right)$		(22)

where $\alpha\in(0,1]$ is the learning rate and the following term inside the bracket is the error between the estimated Q value and the return. It means that the value function is updated in the direction of the error, and iteration will terminate when the error becomes infinitesimal.

III-B4 Double Deep Q-Network (DDQN)

When the state $s$ and the action $a$ are both discrete, the optimal state-action function $Q^{*}(s,a)$ can be obtained as a lookup table, which is also known as Q-table [36], following the iterative procedures in (22). However, the size of the state (or action) space can be prohibitively large, and the state (or action) can even be continuous. In such cases, it is impractical to represent $Q(s,a)$ as a lookup table. Fortunately, the deep neural network (DNN) can be adopted to approximate the Q-table as $Q(s,a)\approx\widetilde{Q}(s,a;\mathbf{w})$ , which enables reinforcement learning to scale to more generalized decision-making problems. The coefficients $\mathbf{w}$ of $\widetilde{Q}(s,a;\mathbf{w})$ are the weights of the DNN, and the DNN is termed as deep Q-network (DQN) [37, 36].

The trajectory segment $\langle S_{t}=s,A_{t}=a,R_{t+1}=R_{s\rightarrow s^{\prime}}^{a},S_{t+1}=s^{\prime}\rangle$ in (12) constitutes an “experience sample” that will be used to train the DQN, and in accordance with (22), the loss function adopted during the training process of DQN is

\displaystyle Loss=\Big{(}\underbrace{R_{s\rightarrow s^{\prime}}^{a}+\gamma\max_{a^{\prime}\in\mathcal{A}}\widetilde{Q}(s^{\prime},a^{\prime};\mathbf{w})}_{T_{DQN}}-\widetilde{Q}(s,a;\mathbf{w})\Big{)}^{2}

(23)

where $T_{DQN}$ is the target value fed to the network.

The target $T_{DQN}$ is dependent on the immediate reward $R_{s\rightarrow s^{\prime}}^{a}$ , as well as the output of the DQN $\widetilde{Q}(s^{\prime},a^{\prime};\mathbf{w})$ . Such structure will inevitably result in over-estimation of the action state value (a.k.a., Q value) during the training process and thus will significantly degrade the performance of DRL. To mitigate over-estimation, we will adopt the double DQN (DDQN) structure [38, 39] in our design.

The fundamental idea of DDQN is to apply a separate target network $\widetilde{Q}(s^{\prime},a^{\prime};\mathbf{w}^{-})$ to estimate the target value [39], and the expression of target in DDQN is

\displaystyle T_{DQN}=R_{s\rightarrow s^{\prime}}^{a}+\gamma\widetilde{Q}(s^{\prime},{\operatorname*{arg\,max}_{a^{\prime}\in\mathcal{A}}\widetilde{Q}(s^{\prime},a^{\prime};\mathbf{w});\mathbf{w}^{-}})

(24)

To summarize, DDQN differs from DQN in the following two aspects, i.e., (1) the optimal action is selected using the DQN $\widetilde{Q}(s^{\prime},a^{\prime};\mathbf{w})$ whose weights are $\mathbf{w}$ , and (2) the Q value of the target value is taken from the target network whose weights are $\mathbf{w}^{-}$ .

III-C Model-Free Control of IRS Using Deep Reinforcement Learning

To apply reinforcement learning to model-free IRS configuration, we firstly model IRS assisted wireless communications as an MDP.

•

Agent: The agent is IRS controller, which is capable of autonomously interacting with the environment via IRS to meet the design objectives.
•

Environment: The environment refers to the things that the agent interact with, which includes BS, wireless channel, IRS, and mobile users.
•

State: To facilitate the accurate prediction of expected next rewards and next states given an action, we define the state as $\{\mathbf{H},\bm{\theta}\}$ , which consists of two sub-states, namely the equivalent wireless channel $\mathbf{H}$ and the reflection vector $\bm{\theta}$ of IRS.
•

Action: The action is defined as the incremental phase shift of the current reflection pattern, i.e.,

$\displaystyle\bm{\theta}^{(t+1)}=\bm{\theta}^{(t)}\odot\Delta\bm{\theta}^{(t)}$ (25)

where $\odot$ is the Hadamard (element-wise) product, $\bm{\theta}^{(t)}$ is the reflection pattern at the $t$ -th channel block, and $\Delta\bm{\theta}^{(t)}$ is the incremental phase shift of $\bm{\theta}^{(t)}$ . We use the subset (or full set) of the discrete Fourier transform (DFT) vectors as the action set. For example, when the size of action space is $5$ , we set $\mathcal{A}=\left\{\mathbf{v}(-\frac{6}{N_{R}}),\mathbf{v}(-\frac{2}{N_{R}}),\mathbf{v}(0),\mathbf{v}(\frac{2}{N_{R}}),\mathbf{v}(\frac{6}{N_{R}})\right\}$ , where $\mathbf{v}(\Psi_{R})$ is the steering vector ¹¹1Without loss of generality, we assume that the reflector array of IRS is a uniform linear array (ULA)., i.e.,

$\displaystyle\mathbf{v}(\Psi_{R})$ $\displaystyle=\left[1,\;e^{j\pi\Psi_{R}},\;\cdots,\;e^{j(N_{R}-1)\pi\Psi_{R}}\right]^{T}$

When $\Delta\bm{\theta}^{(t)}=\mathbf{v}(0)$ , the sub-state $\bm{\theta}$ stays unchanged, and the sub-state $\mathbf{H}$ changes merely due to the variation of NLoS components; $\Delta\bm{\theta}^{(t)}=\mathbf{v}(-\frac{2}{N_{R}})$ and $\Delta\bm{\theta}^{(t)}=\mathbf{v}(\frac{2}{N_{R}})$ are towards the opposite directions, which enables the agent to quickly correct from a negative action; $\Delta\bm{\theta}^{(t)}=\mathbf{v}(-\frac{6}{N_{R}})$ and $\Delta\bm{\theta}^{(t)}=\mathbf{v}(\frac{6}{N_{R}})$ are used to speed up the transition of reflection pattern.

•

Reward: The immediate reward after transition from $s$ to $s^{\prime}$ with action $a$ is defined as

\displaystyle R=\left\{\begin{array}[]{cc}P_{m},&\;{\rm when}\;\;P_{m}\geq P_{th}\\ P_{m}-100,&\;{\rm when}\;\;P_{m}<P_{th}\end{array}\right.

(28)

where $P_{th}$ is a performance threshold. When $P_{m}$ is less than $P_{th}$ , we add an penalty $-100$ to encourage the IRS to maximize performance, while maintaining an acceptable performance above the threshold.

Remark 1.

The reasons for using incremental phase shift, rather than the absolute phase shift, as the action are two-fold. On one hand, we need to build the Markov property of the state transmission, and, on the other hand, we intend to reduce the size action space and accelerate convergence rate.

Initialize parameters

s_{0},\epsilon

;

Initialize the FIFO memory

\mathcal{M}

with the size

N_{m}

;

Initialize the weights of the DQN

\mathbf{w}

and set the target network as

\mathbf{w}^{-}=\mathbf{w}

for

t=0,1,2,\cdots

Input

s_{t}

to the DQN and obtain the state-action values

\widetilde{Q}(s_{t},a;\mathbf{w}),a\in\mathcal{A}

;

With

\widetilde{Q}(s_{t},a;\mathbf{w}),a\in\mathcal{A}

, select an action

a_{t}

using

\epsilon

-greedy policy;

Receive the reward

r_{t+1}

and the estimated channel response

\hat{\mathbf{H}}_{t+1}

, and compute the next state

s_{t+1}

from

\hat{\mathbf{H}}_{t+1}

s_{t}

and

a_{t}

Store the experience tuple

\langle s_{t},a_{t},r_{t+1},s_{t+1}\rangle

to the FIFO memory

\mathcal{M}

;

|\mathcal{M}|\geq N_{e}

Randomly select a mini batch of

N_{e}

experience

tuples

\langle s_{i},a_{i},r_{i+1},s_{i+1}\rangle

from

\mathcal{M}

Calculate the target values

T_{DQN,i}

for the mini

batch according to (24).

With the input

\{s_{i}\}

and the output

\{T_{DQN,i}\}

, train

the DQN, and update its weights

\mathbf{w}

t\mod N_{TNet}=0

, update the weights of the

target network, i.e., set

\mathbf{w}^{-}=\mathbf{w}

end if

end for

Algorithm 1 Double DQN based model-free IRS control for IRS-assisted wireless communications

Based on the modeled MDP and the basics of DRL presented in Subsection B, we propose to maximize the expected return (cumulative discounted future reward) using Algorithm 1. Some of the key techniques applied in Algorithm 1 are explained as follows.

III-C1 DDQN

Different from the naive DQN method, where the DQN $\widetilde{Q}(s,a;\mathbf{w})$ (with the weights $\mathbf{w}$ ) is used to generate the target value, we use a separate target network $\widetilde{Q}(s,a;\mathbf{w}^{-})$ (with the weights $\mathbf{w}^{-}$ ) to generate the target value, and the weights of the target network are updated by $\mathbf{w}^{-}=\mathbf{w}$ in every $N_{TNet}$ time intervals. The structure of the network is shown in Figure 4. Specifically, we apply the deep residual network (ResNet) [40] to process two sub-states (i.e., $\mathbf{H}$ and $\bm{\theta}$ ), and then we fuse the processed information of the two sub-states using a two-layer dense network. The activation function that we use is the Swish function [41].

III-C2 $\epsilon$ -Greedy Policy

Given the perfect $\widetilde{Q}(s,a;\mathbf{w})$ , the optimal policy is to select the action that yields the largest state-action value. However, the perfect $\widetilde{Q}(s,a;\mathbf{w})$ demands for a infinite size of experiences, which is impractical and infeasible in the dynamic wireless environment. Therefore, it is necessary for the agent to keep exploring to avoid get stuck with a sub-optimal policy. To this end, we apply an $\epsilon$ -greedy policy. In $\epsilon$ -greedy policy, $\epsilon$ refers to the probability of choosing to explore, i.e., randomly select from all the possible actions, and $1-\epsilon$ is the probability of choosing to exploit the obtained DQN in decision making. In this regard, the $\epsilon$ -greedy policy is represented as

\displaystyle\pi^{\epsilon}=\left\{\begin{array}[]{cc}\pi^{*}(a/s),&w.p.\;\;1-\epsilon\\ P(a)=\frac{1}{|\mathcal{A}|},&w.p.\;\;\epsilon\end{array}\right.

(31)

where $\pi^{*}(a/s)$ as the policy based on the Q-network, which is introduced in (20). In our design, $\epsilon$ is initially set to $1$ and decreases exponentially at a rate of $\vartheta,(0<\vartheta<1)$ every time interval until its reaches the lower bound $\epsilon_{min}$ .

III-C3 Experience Replay

Instead of training the DQN with the latest experience tuple, we store $N_{e}$ recent experience tuples in the memory $\mathcal{M}$ in “first in, first out” (FIFO) manner, i.e., queue data structure, and then randomly fetch a mini-batch of $N_{e}$ experience samples from $\mathcal{M}$ to train the DQN.

III-D Summarizing The Work Flow of Model-Free IRS Configuration

In this subsection, we summarize the work flow of the proposed model-free IRS configuration. To this end, we plot the sequence diagram in Figure 5.

According to Figure 5, in a specific loop, the IRS is configured with a reflection pattern ( $\bm{\theta}^{(t+1)}=\bm{\theta}^{(t)}\odot\Delta\bm{\theta}^{(t)}$ ) according to the $\epsilon$ -greedy policy, and then UEs and BS perform uplink pilot transmission and downlink data transmission sequentially as if there exists no IRS. After that, BS sends the estimated channel matrix ( $\hat{\mathbf{H}}^{(t+1)}$ ) to IRS controller, which can be fulfilled through wired communications, and UEs send back their performance metrics to the IRS controller. Finally, according to the received channel estimate ( $\hat{\mathbf{H}}^{(t+1)}$ ) and performance feedback ( $P_{m}^{t+1}$ ), IRS controller derives the tuple $\langle\{\hat{\mathbf{H}}^{(t)},\bm{\theta}^{(t)}\},\Delta\bm{\theta}^{(t)},R^{(t+1)},\{\hat{\mathbf{H}}^{(t+1)},\bm{\theta}^{(t+1)}\}\rangle$ and stores it in the FIFO queue as the training data for the DQN $\widetilde{Q}(s_{t},a;\mathbf{w})$ .

Compared with the traditional TDD multi-user MIMO, the extra efforts of incorporating IRS are merely the feedback of $\hat{\mathbf{H}}^{(t)}$ and $P_{m}^{t+1}$ . The former can be easily achieved via wired communications between BS and IRS, and the latter costs negligible wireless communication resources of the mobile UEs. It is also noteworthy that IRS controller is unaware of the working mechanism of BS and UEs and does not require the CSI of the sub-channels.

IV Enhancing IRS Control Using Extremum Seeking Control

In DRL, action space is restrained for a fast convergence rate, which limits the phase freedom of IRS. To enhance the control of IRS, another model-free real-time optimization method, namely, extremum seeking control (ESC), is used to design the fine phase control of IRS.

IV-A Model-Free Control of IRS Using ESC

ESC is model-free method to realize a learning-based adaptive controller for maximizing/minimizing certain system performance metrics [42, 43]. The first application of ESC can be traced back to the work of the French engineer Leblanc in 1922 to maintain an efficient power transfer for a tram car [44]. The basic idea of ESC is to add a dither signal (e.g., sinusoidal signal [45, 46], and random noise [47]) to the system input and observing its effect on the output to obtain an approximate implicit gradient of a nonlinear static map of the unknown system [48, 42].

According to the design philosophy of ESC, we propose a dither-based model-free control of IRS as in Figure 6. Our design consists of three parts, i.e., dither signal generation module, ascent direction estimation module, and parameter update module. Dither signal generation module generates random dither/pertubation signal to probe the response $P_{m}(\cdot)$ of the system; gradient estimation module determines the update direction of the system input according to the system performance $P_{m}(\bm{\varphi}+\Delta\bm{\varphi})$ to guarantee the monotonic increase of the performance, and it also guides on-off switch of the random dither signal generation; parameter update module updates the system input $\bm{\varphi}$ according to the estimated direction.

Specifically, the iterative process in Figure 6 runs as follows.

Step 1. Dither Signal Generation

Generate a small random dither signal through uniform random distribution²²2The parameter selection will be explained in the following context., i.e.,

\displaystyle\Delta\bm{\varphi}[n]=\frac{a}{2^{r-1}},\;\;\;a\in\mathcal{U}\left\{-\frac{2^{r-1}}{N_{R}},\frac{2^{r-1}}{N_{R}}\right\}

(32)

Then, add the dither signal $\Delta\bm{\varphi}$ to the parameter $\bm{\varphi}$ , use $\bm{\varphi}+\Delta\bm{\varphi}$ as the input of the system, and receive the feedback of the performance metric $P_{m}(\bm{\varphi}+\Delta\bm{\varphi})$ .

Step 2. Direction Estimation and Parameter Update

Condition 1. If $P_{m}(\bm{\varphi}+\Delta\bm{\varphi})\geq P_{m}(\bm{\varphi})$ , adopt $\Delta\bm{\varphi}$ as the direction. Update the parameter as

\displaystyle\bm{\varphi}\leftarrow\bm{\varphi}+\Delta\bm{\varphi},

(33)

and update the performance metric as

\displaystyle P_{m}(\bm{\varphi})\leftarrow P_{m}(\bm{\varphi}+\Delta\bm{\varphi})

(34)

Then, jump to Step 1 for the next iteration;

Condition 2. Else if $P_{m}(\bm{\varphi}+\Delta\bm{\varphi})<P_{m}(\bm{\varphi})$ , adopt $-\Delta\bm{\varphi}$ as the direction. Update the parameter as set

\displaystyle\begin{split}\bm{\varphi}&\leftarrow\bm{\varphi}-\Delta\bm{\varphi}\end{split}

(35)

Turn off the dither signal, use only $\bm{\varphi}$ as the system input, and measure the system performance $P_{m}(\bm{\varphi})$ . Then, jump to Step 1 for the next iteration.

It is noteworthy that each iteration uses one or two time intervals, and each iteration can guarantee the monotonic increase of the performance metric $P_{m}$ , which is validated in the following proposition.

Proposition 1.

Each iteration in ESC based iterative process can guarantee the monotonic increase of the performance metric $P_{m}$ given that the norm of the random dither signal, i.e., $\|\Delta\bm{\varphi}\|$ , is small enough.

Proof.

To prove Proposition 1, it is essential to validate that the operation (35) in Condition 2 of Step 2 guarantees the increase of $P_{m}$ , i.e., when $P_{m}(\bm{\varphi}+\Delta\bm{\varphi})<P_{m}(\bm{\varphi})$ , the following inequality

\displaystyle P_{m}(\bm{\varphi}-\Delta\bm{\varphi})>P_{m}(\bm{\varphi})

holds.

To this end, we expand $P_{m}(\bm{\varphi}+\Delta\bm{\varphi})$ using Taylor series of $P_{m}$ with respect to $\bm{\varphi}$ , i.e.,

		$\displaystyle P_{m}(\bm{\varphi}+\Delta\bm{\varphi})$
	$\displaystyle=$	$\displaystyle P_{m}(\bm{\varphi})+\frac{\partial P_{m}(\bm{\varphi})}{\partial\bm{\varphi}^{H}}\Delta\bm{\varphi}+\mathcal{O}(\\|\Delta\bm{\varphi}\\|^{2})\quad{\rm as}\;\;\Delta\bm{\varphi}\rightarrow 0$		(36)

Since $\|\Delta\bm{\varphi}\|$ is small, namely, $\|\Delta\bm{\varphi}\|\rightarrow 0$ , we adopt the first-order approximation, i.e.,

\displaystyle P_{m}(\bm{\varphi}+\Delta\bm{\varphi})\approx P_{m}(\bm{\varphi})+\frac{\partial P_{m}(\bm{\varphi})}{\partial\bm{\varphi}^{H}}\Delta\bm{\varphi}

(37)

As it is reported by the system that $P_{m}(\bm{\varphi}+\Delta\bm{\varphi})<P_{m}(\bm{\varphi})$ in Condition 2, we have

\displaystyle\frac{\partial P_{m}(\bm{\varphi})}{\partial\bm{\varphi}^{H}}\Delta\bm{\varphi}<0

(38)

Then, it is easy to verify that

	$\displaystyle P_{m}(\bm{\varphi}-\Delta\bm{\varphi})$	$\displaystyle\approx P_{m}(\bm{\varphi})-\frac{\partial P_{m}(\bm{\varphi})}{\partial\bm{\varphi}^{H}}\Delta\bm{\varphi}$
		$\displaystyle>P_{m}(\bm{\varphi})$		(39)

∎

Remark 2.

As $\bm{\theta}[n]=e^{-j\pi\bm{\varphi}[n]}$ , the operations of (33) and (35) can be written w.r.t. $\bm{\theta}$ as follows


$\displaystyle\bm{\theta}$	$\displaystyle\leftarrow\bm{\theta}\odot\Delta\bm{\theta}$	(40a)
$\displaystyle\bm{\theta}$	$\displaystyle\leftarrow\bm{\theta}\odot\Delta\bm{\theta}^{*}$	(40b)

IV-B Comparison with Gradient Ascent Search

To obtain further insights into the proposed dither-based method, we make a comparison with the well-known iterative algorithm – gradient ascent (descent in minimization problems) search.

For gradient ascent search algorithm, $\bm{\varphi}$ in each iteration is updated as follows.

\displaystyle\bm{\varphi}\leftarrow\bm{\varphi}+\underbrace{\gamma\frac{\partial P_{m}(\bm{\varphi})}{\partial\bm{\varphi}}}_{\Delta\bm{\varphi}}

(41)

When the step size $\gamma$ is small, the iteration will almost surely guarantee the increase of $P_{m}(\bm{\varphi})$ , because

	$\displaystyle P_{m}(\bm{\varphi}+\Delta\bm{\varphi})$	$\displaystyle\approx P_{m}(\bm{\varphi})+\gamma\frac{\partial P_{m}(\bm{\varphi})}{\partial\bm{\varphi}^{H}}\frac{\partial P_{m}(\bm{\varphi})}{\partial\bm{\varphi}}$
		$\displaystyle=P_{m}(\bm{\varphi})+\gamma\\|\frac{\partial P_{m}(\bm{\varphi})}{\partial\bm{\varphi}}\\|_{2}^{2}\geq P_{m}(\bm{\varphi})$		(42)

Remark 3.

As gradient ascent search take steps in the direction of the gradient, it is also called steepest descent. Thus, the convergence rate of gradient ascent search is faster than dither-based extremum search when the step size $\gamma$ is properly selected. On the other hand, it is also noteworthy that gradient ascent search requires the exact expression of the gradient $\frac{\partial P_{m}(\bm{\varphi})}{\partial\bm{\varphi}}$ , whilst dither-based method is implemented through trial and error, which does not rely on any explicit knowledge of the wireless system’s internal working mechanism.

IV-C Integrating ESC Into DRL

Recall that the action space for DRL is intentionally restrained for a fast convergence rate, whilst dither-based iterative method relies on the small-scale phase shift. Therefore, dither-based iterative method is complementary to the action in DRL and can be applied to enhance the action of DRL.

The enhanced action of DRL is defined as follows.

Step 1. Coarse phase shift (CPS) When $l=0$ , set

\displaystyle\bm{\theta}^{(t+1)}_{temp}=\bm{\theta}^{(t)}\odot\Delta\bm{\theta}_{c}^{(t)}

(43)

where $\bm{\theta}^{(t)}$ is the reflection pattern at the $t$ -th channel block, $\Delta\bm{\theta}_{c}^{(t)}$ is the coarse incremental phase shift at the $t$ -th channel block, and $\bm{\theta}^{(t+1)}_{temp}$ is the intermediate reflection pattern at the $t-1$ -th channel block. Following the example in Section III. C, the action set of the incremental phase shift $\Delta\bm{\theta}_{c}^{(t)}$ is $\mathcal{A}=\left\{\mathbf{v}(-\frac{6}{N_{R}}),\mathbf{v}(-\frac{2}{N_{R}}),\mathbf{v}(0),\mathbf{v}(\frac{2}{N_{R}}),\mathbf{v}(\frac{6}{N_{R}})\right\}$ .

Step 2. Fine phase shift (FPS)
For $l$ from $1$ to $L$ , do

\displaystyle\bm{\theta}^{(t+1)}_{temp}\leftarrow\bm{\theta}^{(t+1)}_{temp}\odot\Delta\bm{\theta}_{f}

(44)

where $\Delta\bm{\theta}_{f}=\Delta\bm{\theta}\;({\rm or}\;\Delta\bm{\theta}_{f}=\Delta\bm{\theta}^{*})$ is the ascent direction, and $\Delta\bm{\theta}$ is the random dither signal.

Remark 4.

For example, when the quantization level $r=8$ , and $N_{R}=32$ , the step of coarse phase shift is $\frac{2\pi}{32}$ , and the step of fine phase shift is $\frac{2\pi}{256}$ with the range $[-\frac{\pi}{32},\;\frac{\pi}{32}]$ .

To be compatible with the enhanced action, the frame structure needs to be updated as in Figure 7. In the first $K$ time slots, UEs transmit pilots and BS performs channel estimation with the reflection pattern in (43), and in the subsequent $(L-1)K$ time slots, UEs repeatedly transmit pilots and BS performs channel estimation, while the reflection pattern updates as in (44). It is noteworthy that, as the performance feedback is done once per channel block, the performance metric used for the dither-based method is an approximation derived by replacing the authentic channel response $\mathbf{H}$ in (10) with the channel estimate $\hat{\mathbf{H}}$ .

Remark 5.

The parameter $L$ can be set adaptively according to the channel dynamics. For a practical wireless communication system, different values of $L$ correspond to different modes.

V Numerical Results

In this section, we present some numerical results to verify the effectiveness of our proposed model-free control of IRS ³³3The simulation code is available at https://github.com/WeiWang-WYS/IRSconfigurationDRL.

V-A Simulation Parameters

The BS is equipped with a ULA that is placed along the direction $[1,0,0]$ (i.e., x-axis), and IRS is a ULA, which is placed along the direction $[0,1,0]$ (i.e., y-axis), UEs are equipped with a single antenna, the user number is $K=2$ , BS antenna number is $N_{B}=2$ , and IRS reflector number if $N_{R}=32$ . The element antennas/reflectors of BS and IRS are both with half wavelength spacing. The position of BS is $[0,0,10]$ , the position of IRS is $[-2,5,5]$ , and the UEs are uniformly distributed in the area $[0,10)\times[0,10)$ with the height being $1.5$ . The noise variance at BS side is $\sigma_{B}^{2}=0.1$ , the noise variance at UE side is $\sigma^{2}_{k}=0.5,\;\forall k\in\{1,\cdots,K\}$ . Each channel episode consists of $20$ channel blocks. In each channel episode, the LoS component is generated by randomly selecting the user locations within the area $[0,10)\times[0,10)$ , and the LoS component is time-invariant within the $20$ channel blocks of that channel episode.

The LoS channel between BS and IRS is

\displaystyle\mathbf{H}_{BR,LoS}=\mathbf{v}_{R}\mathbf{v}_{B}^{H}

(45)

where the steering vectors are represented as


	$\displaystyle\mathbf{v}_{R}=\mathbf{v}(\Psi_{R},N_{R,y})$	$\displaystyle=\left[1,\;e^{j\pi\Psi_{R}},\;\cdots,\;e^{j(N_{R,y}-1)\pi\Psi_{R}}\right]^{T}$
	$\displaystyle\mathbf{v}_{B}=\mathbf{v}(\Psi_{B},N_{B,x})$	$\displaystyle=\left[1,\;e^{j\pi\Psi_{B}},\;\cdots,\;e^{j(N_{B,x}-1)\pi\Psi_{B}}\right]^{T}$

and, according to [49], the directional cosines $\Psi_{R},\Psi_{B}$ are given by


$\displaystyle\Psi_{R}$	$\displaystyle=[0,1,0]\mathbf{e}_{BR}=\mathbf{e}_{BR}(2)$	(47a)
$\displaystyle\Psi_{B}$	$\displaystyle=[1,0,0]\mathbf{e}_{BR}=\mathbf{e}_{BR}(1)$	(47b)

where the direction vector $\mathbf{e}_{BR}$ is determined by the relative position of BS and UE, i.e.,

\displaystyle\mathbf{e}_{BR}

\displaystyle\triangleq\frac{\mathbf{p}_{B}-\mathbf{p}_{R}}{\|\mathbf{p}_{B}-\mathbf{p}_{R}\|_{2}}

(48)

The NLoS components are Gaussian distributed, i.e., $\mathbf{H}_{BR,NLoS}(\ell,\kappa)\in\mathcal{CN}(0,1)$ . The channel matrix $\mathbf{H}_{BU}$ and $\mathbf{H}_{RU}$ are generated in the same way.

V-B Performance Study of DRL

In Figure 8, we study the performance of the proposed DRL scheme (with the action set $\mathcal{A}_{5}$ ) under different values of Rician factor. The x-axis represents the episode, and the y-axis represents the moving average of $P_{m}$ (window length is $64$ ). As can be seen, the proposed DRL significant outperforms the benchmark schemes, i.e., random reflection and multi-armed bandit (MAB). Different from DRL, the actions of random reflection and MAB that we used are absolute phase shift, and the action set is the DFT vectors. Although all the three schemes are independent of the sub-channel CSI, their utilizations of the other information are different. Random reflection is independent of any information and undoubtedly achieves the worst performance; MAB assumes a fixed distribution of rewards and explores the reward distributions of all arms. However, MAB fails to describe the state of the environment and to build the connection between the action and the environment; DRL defines an appropriate state to represent the agent’s “position” within the environment and learns the quality of a state-action combination using the DQN from the information of rewards and states, which enables the agent to choose the best action to maximize the returns. We can also find that the performance gap of DRL and the benchmarks schemes becomes larger with the increase of Rician factor $K_{Rician}$ , which indicates that the effectiveness of DRL is also dependent on the radio environment.

In Figure 9, we study the impacts of action size to the performance of the proposed DRL scheme when the Rician factor is $K=10$ . In addition to the action set $\mathcal{A}_{5}=\left\{\mathbf{v}(-\frac{6}{N_{R}}),\mathbf{v}(-\frac{2}{N_{R}}),\mathbf{v}(0),\mathbf{v}(\frac{2}{N_{R}}),\mathbf{v}(\frac{6}{N_{R}})\right\}$ defined in Section III, we adopt the action set $\mathcal{A}_{3}=\left\{\mathbf{v}(-\frac{2}{N_{R}}),\mathbf{v}(0),\mathbf{v}(\frac{2}{N_{R}})\right\}$ and action set $\mathcal{A}_{32}=\left\{\mathbf{v}(-1),\mathbf{v}(-1-\frac{2}{N_{R}}),\cdots,\mathbf{v}(1-\frac{2}{N_{R}})\right\}$ (namely, DFT matrix) as the benchmarks. From Figure 9, we can see that $\mathcal{A}_{5}$ is the fastest to converge, while $\mathcal{A}_{32}$ is the slowest. Although a large action size will speed up the response rate of the agent, it will, on the other hand, demand more time for the DQN to converge. In the convergence region, we find that $\mathcal{A}_{5}$ and $\mathcal{A}_{32}$ achieve the similar performance, while $\mathcal{A}_{3}$ ’s performance is inferior. It indicates that a well-design action set with a moderate size might be better than the small action size and the over-large action size. In Figure 9, the average sum rate $\mathbb{E}(P_{m})$ , and the probability $P_{r}(P_{m}>10)$ are presented in the bar chart. For $\mathcal{A}_{5}$ , $\mathbb{E}(P_{m})=9.89$ bps/Hz and $P_{r}(P_{m}>10)=58.05\%$ ; for $\mathcal{A}_{3}$ , $\mathbb{E}(P_{m})=9.34$ bps/Hz and $P_{r}(P_{m}>10)=42.9\%$ ; for $\mathcal{A}_{32}$ , $\mathbb{E}(P_{m})=9.60$ bps/Hz and $P_{r}(P_{m}>10)=49.35\%$ . It further verifies the importance of action set design.

V-C Performance Study of ESC

In Figure 10, we study the performance of the ESC-inspired dither-based iterative method in a specific channel block. Each time interval of $x$ -axis consists of $K$ time slots, which is the pilot length required by the BS to estimate the multi-user channel $\mathbf{H}$ . As can be seen that the performance metric $P_{m}$ for dither-based method is almost monotonically increasing over time. Note that the performance metric used to guide the dither-based iterative method is an approximation, rather than the authentic feedback. Thus, the slight fluctuation of the performance curve is reasonable. For the purpose of comparison, we adopt the model-based method as the benchmark, in which the perfect sub-channel CSI, namely, $\mathbf{H}_{BU}$ , $\mathbf{H}_{BR}$ , and $\mathbf{H}_{RU}$ , is available. The optimal reflection coefficient vector $\bm{\theta}$ is derived through solving the optimization problem (3). Recall that the difference from model-free control is that the exact relationship between the objective function $P_{m}$ and the variable $\bm{\theta}$ is known in model-based methods. Due to the discrete nature of the feasible region $\mathcal{B}$ , the optimization problem is intractable. Hence, we manage to solve it using the alternating optimization technique, which alternatively freezes $N_{R}-1$ reflection coefficients and optimize only $1$ reflection coefficient. According to Figure 10, the model-free dither-based method achieves almost the same performance as the model-based alternating optimization when the time index is greater than $500$ . However, in practice, we have to weigh the cost of time resources against the benefits. As the dither-based method needs to sample $\hat{\mathbf{H}}$ , one iteration means the cost of a unit of time resource in wireless communications. Thus, in order to balance the time allocation between pilot transmission and data transmission, the time resources dedicated to dither-based method (i.e., the time for pilot transmission) should be deliberately selected according to channel dynamics (i.e., the length of a channel block).

V-D Performance Study of the Integrated DRL and ESC

In Figure 11, we study the performance of the integrated DRL and ESC method (with the action set $\mathcal{A}_{5}$ ) in different channel dynamics. The normalized $P_{m}$ is the obtained by multiplying $P_{m}$ by the coefficient $\frac{L_{B}-L}{L_{B}}$ , where $L_{B}$ is the channel block length and $L$ is the training length. Take $L_{B}=100$ and $L=1$ as an example, the first $L=1$ time interval is used for pilot transmission and the rest $L_{B}-L=99$ time intervals are used for data transmission. It is also noteworthy that when $L=1$ the scheme is DRL only, and when $L\geq 2$ the scheme is the integrated DRL and ESC. In addition, we adopt the action set $\mathcal{A}_{5}$ . From the figure, we can see that when the channel block length $L_{B}=100$ , DRL outperforms the integrated DRL and ESC with $L=20,40$ . However, when the channel block length increases, the integrated DRL and ESC gradually becomes superior, which verifies its effectiveness in the slow fading channel. Therefore, the parameter $L$ of the integrated DRL and ESC can be set adaptively to accommodate different channel dynamics.

VI Conclusion

In this paper, we have proposed a model-free control of IRS that is independent of sub-channel CSI. We firstly model the control of IRS as an MDP and apply DRL to perform real-time coarse phase control of IRS. Then, we apply ESC as the fine phase control of IRS. Finally, by updating the frame structure, we integrate DRL and ESC in the model-free control of IRS to improve its adaptivity to different channel dynamics. Numerical results show the superiority of our proposed scheme in model-free IRS control without sub-channel CSI.

References

[1] N. Engheta and R. W. Ziolkowski, Metamaterials: Physics and Engineering Explorations. John Wiley & Sons, 2006.
[2] T. J. Cui, M. Q. Qi, X. Wan, J. Zhao, and Q. Cheng, “Coding metamaterials, digital metamaterials and programmable metamaterials,” Light Sci. Appl., vol. 3, no. 10, pp. e218–e218, Oct. 2014.
[3] D. Schurig, J. J. Mock, B. Justice, S. A. Cummer, J. B. Pendry, A. F. Starr, and D. R. Smith, “Metamaterial electromagnetic cloak at microwave frequencies,” Science, vol. 314, no. 5801, pp. 977–980, Nov. 2006.
[4] L. Liang, M. Qi, J. Yang, X. Shen, J. Zhai, W. Xu, B. Jin, W. Liu, Y. Feng, C. Zhang et al., “Anomalous terahertz reflection and scattering by flexible and conformal coding metamaterials,” Adv. Opt. Mater., vol. 3, no. 10, pp. 1374–1380, Oct. 2015.
[5] N. Yu, P. Genevet, M. A. Kats, F. Aieta, J.-P. Tetienne, F. Capasso, and Z. Gaburro, “Light propagation with phase discontinuities: generalized laws of reflection and refraction,” Science, vol. 334, no. 6054, pp. 333–337, Oct. 2011.
[6] C. Zhang, W. Chen, Q. Chen, and C. He, “Distributed intelligent reflecting surfaces-aided device-to-device communications system,” J. Comm. Inform. Networks, vol. 6, no. 3, pp. 197–207, 2021.
[7] M. Di Renzo et al., “Smart radio environments empowered by reconfigurable AI meta-surfaces: An idea whose time has come,” EURASIP J. Wireless Commun. Netw., vol. 2019, no. 1, pp. 1–20, 2019.
[8] Q. Wu, S. Zhang, B. Zheng, C. You, and R. Zhang, “Intelligent reflecting surface aided wireless communications: A tutorial,” IEEE Trans. Commun., vol. 69, no. 5, pp. 3313–3351, May 2021.
[9] Z. Wang, L. Liu, and S. Cui, “Channel estimation for intelligent reflecting surface assisted multiuser communications,” in 2020 IEEE Wireless Communications and Networking Conference (WCNC), 2020, pp. 1–6.
[10] P. Wang, J. Fang, W. Zhang, and H. Li, “Joint active and passive beam training for IRS-assisted millimeter wave systems,” arXiv preprint arXiv:2103.05812, 2021.
[11] P. Wang, J. Fang, H. Duan, and H. Li, “Compressed channel estimation for intelligent reflecting surface-assisted millimeter wave systems,” IEEE Signal Process. Lett., vol. 27, pp. 905–909, May 2020.
[12] Q. Wu and R. Zhang, “Intelligent reflecting surface enhanced wireless network via joint active and passive beamforming,” IEEE Trans. Wireless Commun., vol. 18, no. 11, pp. 5394–5409, Nov. 2019.
[13] D. Tse and P. Viswanath, Fundamentals of Wireless Communication. Cambridge University Press, 2005.
[14] C. Qi, P. Dong, W. Ma, H. Zhang, Z. Zhang, and G. Y. Li, “Acquisition of channel state information for mmWave massive MIMO: Traditional and machine learning-based approaches,” Sci. China Inf. Sci, vol. 64, no. 8, pp. 1–16, 2021.
[15] F. Liu, J. Pan, X. Zhou, and G. Y. Li, “Atmospheric ducting effect in wireless communications: Challenges and opportunities,” J. Comm. Inform. Networks, vol. 6, no. 2, pp. 101–109, 2021.
[16] W. Wang and W. Zhang, “Joint beam training and positioning for intelligent reflecting surfaces assisted millimeter wave communications,” IEEE Trans. Wireless Commun., vol. 20, no. 10, pp. 6282–6297, Oct. 2021.
[17] H. Yang et al., “A programmable metasurface with dynamic polarization, scattering and focusing control,” Sci. Rep., vol. 6, no. 1, pp. 1–11, Oct. 2016.
[18] W. Tang et al., “Wireless communications with programmable metasurface: Transceiver design and experimental results,” China Commun., vol. 16, no. 5, pp. 46–61, May 2019.
[19] Q. Wu and R. Zhang, “Joint active and passive beamforming optimization for intelligent reflecting surface assisted SWIPT under QoS constraints,” IEEE J. Sel. Areas in Commun., vol. 38, no. 8, pp. 1735–1748, Aug. 2020.
[20] Y. Yang, S. Zhang, and R. Zhang, “IRS-enhanced OFDM: Power allocation and passive array optimization,” in 2019 IEEE Global Communications Conference (GLOBECOM). IEEE, 2019, pp. 1–6.
[21] X. Yu, D. Xu, D. W. K. Ng, and R. Schober, “IRS-assisted green communication systems: Provable convergence and robust optimization,” IEEE Trans. Commun., vol. 69, no. 9, pp. 6313–6329, Sept. 2021.
[22] K. Zhi, C. Pan, H. Ren, and K. Wang, “Ergodic rate analysis of reconfigurable intelligent surface-aided massive MIMO systems with ZF detectors,” IEEE Commun. Lett., vol. 26, no. 2, pp. 264–268, Feb. 2022.
[23] Y. Han, W. Tang, S. Jin, C.-K. Wen, and X. Ma, “Large intelligent surface-assisted wireless communication exploiting statistical CSI,” IEEE Trans. Veh. Technol., vol. 68, no. 8, pp. 8238–8242, Aug. 2019.
[24] K. Zhi, C. Pan, G. Zhou, H. Ren, M. Elkashlan, and R. Schober, “Is RIS-aided massive MIMO promising with ZF detectors and imperfect CSI?” arXiv preprint arXiv:2111.01585, 2021.
[25] K. Zhi, C. Pan, H. Ren, K. Wang, M. Elkashlan, M. Di Renzo, R. Schober, H. V. Poor, J. Wang, and L. Hanzo, “Two-timescale design for reconfigurable intelligent surface-aided massive MIMO systems with imperfect CSI,” arXiv preprint arXiv:2108.07622, 2021.
[26] M.-M. Zhao, Q. Wu, M.-J. Zhao, and R. Zhang, “Intelligent reflecting surface enhanced wireless networks: Two-timescale beamforming optimization,” IEEE Trans. Wireless Commun., vol. 20, no. 1, pp. 2–17, Jan. 2021.
[27] A. Taha, Y. Zhang, F. B. Mismar, and A. Alkhateeb, “Deep reinforcement learning for intelligent reflecting surfaces: Towards standalone operation,” in 2020 IEEE 21st International Workshop on Signal Processing Advances in Wireless Communications (SPAWC). IEEE, 2020, pp. 1–5.
[28] A. Taha, M. Alrabeiah, and A. Alkhateeb, “Enabling large intelligent surfaces with compressive sensing and deep learning,” IEEE Access, vol. 9, pp. 44 304–44 321, 2021.
[29] B. Sheen, J. Yang, X. Feng, and M. M. U. Chowdhury, “A deep learning based modeling of reconfigurable intelligent surface assisted wireless communications for phase shift configuration,” IEEE Open J. Commun. Soc., vol. 2, pp. 262–272, 2021.
[30] Y. Kim, G. Miao, and T. Hwang, “Energy efficient pilot and link adaptation for mobile users in TDD multi-user MIMO systems,” IEEE Trans. Wireless Commun., vol. 13, no. 1, pp. 382–393, Jan. 2013.
[31] W. Zhang, H. Ren, C. Pan, M. Chen, R. C. de Lamare, B. Du, and J. Dai, “Large-scale antenna systems with UL/DL hardware mismatch: Achievable rates analysis and calibration,” IEEE Trans. Commun., vol. 63, no. 4, pp. 1216–1229, Apr. 2015.
[32] A. Goldsmith, Wireless Communications. Cambridge University Press, 2005.
[33] A. K. Samingan, I. Suleiman, A. A. A. Rahman, and Z. M. Yusof, “LTF-based vs. pilot-based MIMO-OFDM channel estimation algorithms: An experimental study in 5.2 GHz wireless channel,” in 2009 IEEE 9th Malaysia International Conference on Communications (MICC), 2009, pp. 794–800.
[34] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. MIT press, 2018.
[35] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: A survey,” J. Mach. Learn. Res., vol. 4, pp. 237–285, May 1996.
[36] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath, “Deep reinforcement learning: A brief survey,” IEEE Signal Process. Mag., vol. 34, no. 6, pp. 26–38, Nov. 2017.
[37] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015.
[38] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
[39] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Q-learning,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 30, no. 1, 2016.
[40] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
[41] P. Ramachandran, B. Zoph, and Q. V. Le, “Searching for activation functions,” arXiv preprint arXiv:1710.05941, 2017.
[42] K. B. Ariyur and M. Krstić, Real-Time Optimization by Extremum-Seeking Control. John Wiley & Sons, 2003.
[43] Nešić et al., “A framework for extremum seeking control of systems with parameter uncertainties,” IEEE Trans. Autom. Control, vol. 58, no. 2, pp. 435–448, Feb. 2013.
[44] Y. Tan, W. Moase, C. Manzie, D. Nešić, and I. Mareels, “Extremum seeking from 1922 to 2010,” in Proceedings of the 29th Chinese Control Conference, 2010, pp. 14–26.
[45] H.-H. Wang, M. Krstić, and G. Bastin, “Optimizing bioreactors by extremum seeking,” Int. J. Adapt. Control Signal Process., vol. 13, no. 8, pp. 651–669, 1999.
[46] D. Nešić, “Extremum seeking control: Convergence analysis,” Eur. J. Control, vol. 15, no. 3-4, pp. 331–347, 2009.
[47] D. Carnevale et al., “Maximizing radiofrequency heating on FTU via extremum seeking: parameter selection and tuning,” in From Physics to Control Through an Emergent View. World Scientific, 2010, pp. 321–326.
[48] K. T. Atta, A. Johansson, and T. Gustafsson, “Extremum seeking control based on phasor estimation,” Syst. Control Lett., vol. 85, pp. 37–45, Nov. 2015.
[49] W. Wang and W. Zhang, “Jittering effects analysis and beam training design for UAV millimeter wave communications,” IEEE Trans. Wireless Commun., pp. 1–1, 2021.