This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

iRDRC: An Intelligent Real-time Dual-functional Radar-Communication System for Automotive Vehicles

Nguyen Quang Hieu, Dinh Thai Hoang, , Nguyen Cong Luong, and Dusit Niyato,
N. Q. Hieu and D. Niyato are with the School of Computer Science and Engineering, Nanyang Technological University, Sinapore (e-mail: {quanghieu.nguyen, dniyato}@ntu.edu.sg.D. T. Hoang is with the School of Electical and Data Engineerng, University of Technology Sydney, Sydney, NSW 2007, Australia (e-mail: [email protected]).N. C. Luong is with the Faculty of Computer Science, PHENIKAA University, Hanoi 12116, Vietnam (e-mail:[email protected]).
Abstract

This letter introduces an intelligent Real-time Dual-functional Radar-Communication (iRDRC) system for autonomous vehicles (AVs). This system enables an AV to perform both radar and data communications functions to maximize bandwidth utilization as well as significantly enhance safety. In particular, the data communications function allows the AV to transmit data, e.g., of current traffic, to edge computing systems and the radar function is used to enhance the reliability and reduce the collision risks of the AV, e.g., under bad weather conditions. The problem of the iRDRC is to decide when to use the communication mode or the radar mode to maximize the data throughput while minimizing the miss detection probability of unexpected events given the uncertainty of surrounding environment. To solve the problem, we develop a deep reinforcement learning algorithm that allows the AV to quickly obtain the optimal policy without requiring any prior information about the environment. Simulation results show that the proposed scheme outperforms baseline schemes in terms of data throughput, miss detection probability, and convergence rate.

Index Terms:
Joint Radar-Communications, Autonomous Vehicle, Deep Reinforcement Learning, MDP.

I Introduction

AUTONOMOUS vehicles (AVs) are required to navigate efficiently and safely in complex and uncontrolled environments [1]. To meet these requirements, Dual-Functional Radar-Communication (DFRC) system design has been recently proposed as a promising technology for AVs. The DFRC allows an AV to jointly implement radar and communication functions. In particular, with the radar function, the AV is able to accurately detect the presence of distant objects or unexpected events even under the bad weather conditions and poor visibility. With the communication function, the AV can use communication channels to communicate with road-side units, base stations, and edge computing systems, e.g., by using vehicle-to-infrastructure (V2I) and vehicle-to-network (V2N), to facilitate intelligent road management, route selection, and data analysis [2, 3].

Since the DFRC system implements both radar and communications using a single hardware device, these functionalities share some system resources such as antennas, spectrum, and power. As a result, one major problem of the AV is how to optimize the resource sharing between the radar function and communication function. In particular, the problem of the AV is how to optimize the selection between the radar mode and communication mode.

Recently, some resource sharing approaches have been proposed to solve the problem. In particular, the authors in [2] proposed to adopt the IEEE 802.11ad standard for the joint radar-communication in an AV system. Accordingly, the AV reserves preamble blocks in the IEEE 802.11ad frame for the radar mode, i.e., to estimate their ranges and velocities, and uses data blocks for the data transmission. Different from [2], the time sharing approach proposed in [4] uses time cycles instead of the standard frames. Then, time portions in the time cycle are allocated to the radar mode and communication mode to maximize the radar estimate rate and communication rate of the radar-communication system. Consider the communication system for the AVs in the V2I scenario, the authors in [3] proposed a method to reduce the beam alignment overhead between the AVs and infrastructures. However, the radar’s performance on object detection is not considered.

In general, the approaches in [2, 4, 3] are fixed schedule schemes that are not appropriate to implement in practice because the surrounding environment of the AV is uncertain and dynamic. To maximize the resource efficiency under uncertain environment, adaptive algorithms for the radar and communication mode selection are required. For example, when the weather is in a bad condition, e.g., heavy rain, the AV can select the radar mode more frequently to improve the radar performance to detect unexpected events on the road. In contrast, when the weather and the communication channel are in good conditions, the AV can select the communication mode more frequently to transmit its data. However, it is challenging for the AV to determine optimal decisions because the environment states, e.g., weather and road states as well as the communication channel state are dynamic and uncertain. In this letter, we thus develop a deep reinforcement learning (DRL) technique that enables the AV to find the optimal selection of the radar mode and communication mode without prior knowledge of the environment. To the best of our knowledge, this is the first approach using DRL to solve the mode selection problem of the DFRC in AV. For this, we first formulate the AV’s problem as a Markov decision process (MDP). Then, we develop the DRL with Deep Q-Network (DQN) [7] algorithm to achieve the optimal policy for the AV. Simulation results show that the proposed DRL outperforms baseline schemes in terms of higher data throughput, miss detection probability, and shorter convergence time.

II System Model

Refer to caption
Figure 1: Autonomous vehicle with a DFRC system.

The system model with an AV as shown in Fig. 1. The AV is equipped with a DFRC equipment that enables the AV to work in two modes, i.e., the radar mode and the communication mode. Typically, the radar and communication modes can be allocated in time cycles, in which each time cycle is separated to radar mode and communication mode [4]. Unlike [4], we consider that each time cycle/step is allocated to either radar mode or communication mode. This enables the AV to effectively change the mode based on the current observation of environment, rather than based on the previous time cycle as in [4].

II-A Dual-functional Radar-Communication Model

In the communication mode, the AV uses the V2I capability to transmit the data, e.g., of current road traffic or live on-board video streaming, to the Base Stations (BSs) distributed along the road. Assume that the AV uses a single channel for the data transmission and has a data queue for storing incoming data packets, e.g., from its sensor devices. Let DD be the capacity of the data queue. In the radar mode, the AV performs an automotive millimeter-wave radar to detect unexpected events. As shown in Fig. 1, the radar mode can be used to detect unexpected events, e.g., a car coming from another road obscured by a truck. In particular, we define an unexpected event as an event that can possibility cause collisions with the AV. We consider that the occurrence of an unexpected event is influenced by four main factors: the road condition, weather condition, speed of the AV, and nearby moving object [5, 6]. Note that the values of these factors can be obtained by the AV’s sensing system, e.g., road friction sensor, weather station instrument, speedometer, and cameras.

Let r{0,1}r\in\{0,1\}, w{0,1}w\in\{0,1\}, v{0,1}v\in\{0,1\}, and m{0,1}m\in\{0,1\} be the road state, weather state, speed state, and moving object state, respectively. In particular, r=1r=1, w=1w=1, v=1v=1, and m=1m=1 represent unfavorable conditions, e.g., slippery road, rainy weather, high speed of the AV, and a moving object nearby, respectively. In contrast, r=0,w=0,v=0r=0,w=0,v=0 and m=0m=0 express favorable conditions, e.g., straight road, good weather, low speed and without a moving object nearby, respectively. Let pjip_{j}^{i} denote the probability to occur an unexpected event at the current condition jj (where j{0,1}j\in\{0,1\} corresponds to favorable or unfavorable conditions, respectively) of factor ii, i{r,w,v,m}i\in\{r,w,v,m\}. For example, p1rp_{1}^{r} expresses the probability of an unexpected event to occur given the slippery road condition, i.e., r=1r=1. Note that the generalization of the states beyond 0 and 1 is straightforward. For example, the speed of the AV can be divided into multiple levels, e.g., low, medium, and high.

II-B Environment Model

To model the dynamic of environment, the probabilities pjvp_{j}^{v}, pjwp_{j}^{w} are taken from the real-world data in [5, 6], and other probabilities are assumed to be pre-defined. Then, we can determine the probability of an unexpected event to occur given factor states (r,w,v,m)(r,w,v,m) using the Bayes’ theorem. For this, let \oplus denote the occurrence of an unexpected event, and \ominus denote that no unexpected event occurs. Let τi\tau_{i} be the probability that factor ii is at state 0, where i{r,w,v,m}i\in\{r,w,v,m\}. Thus, the probability that factor ii at state 11 is 1τi1-\tau_{i}. By using the Bayes’ theorem, the probability of an unexpected event to occur given factor states (r,w,v,m)(r,w,v,m) is determined by:

P()=i{r,w,v,m}(τip0i+(1τi)p1i).P(\oplus)=\sum_{i\in\{r,w,v,m\}}\left(\tau_{i}p_{0}^{i}+(1-\tau_{i})p_{1}^{i}\right). (1)

In general, when the probability of an unexpected event to occur, P()P(\oplus), is high, the environment is more dynamic and uncertain. We introduce a metric, i.e., the miss detection probability, to evaluate the performance of the proposed system. The miss detection probability is defined by the ratio of the number of unexpected events that the AV cannot detect to the total number of unexpected events on the road. A high miss detection probability results in a high risk of accident for the AV. We also introduce the second metric to evaluate the performance of the proposed system that is the data throughput. The data throughput is defined as the average number of packets per time unit that is successfully transmitted from the AV to the BSs. Note that, we assume that the accuracy of the autonomous radar system is perfect, i.e., there is no miss detection or false alarm, when the AV uses the radar mode. However, the system model can be straightforwardly extended by considering the miss detection and false alarm caused by sensing accuracy of the radar. In this case, the proposed DRL scheme still can work well as it can learn these parameters through real-time interactions with the environment.

Intuitively, to minimize the miss detection probability, the AV can use the radar mode more frequently to detect unexpected events, but this reduces the data throughput. Conversely, to increase the throughput, the AV can use the communication mode more frequently, but this may increase the miss detection probability. Consider this tradeoff with the uncertainty of environment, the AV’s decision making problem can be modeled as an MDP. We then develop a DRL algorithm to quickly obtain the optimal policy for the AV without requiring completed information about environment. The details about the DRL scheme that enables the AV to quickly find the optimal policy will be discussed in Section V-B.

III Problem Formulation

To formulate the problem by using the MDP, we define a tuple of <𝒮,𝒜,,𝒫><\mathcal{S},\mathcal{A},\mathcal{R},\mathcal{P}>, where 𝒮\mathcal{S}, 𝒜\mathcal{A}, \mathcal{R}, and 𝒫\mathcal{P} are the state space, action space, reward function, and state transition probability of the AV, respectively. Note that the transition probability 𝒫\mathcal{P} is unknown to the AV in advance.

III-A Action Space and State Space

At each time step, the AV decides to use either the communication mode or the radar mode. Let 𝒜\mathcal{A} denote the action space of the AV, 𝒜={a;a{0,1}}\mathcal{A}=\big{\{}a;a\in\left\{0,1\right\}\big{\}}, where a=0a=0 means that the AV chooses the communication mode, and a=1a=1 means that the AV chooses the radar mode. The state of the AV is the combination of (i) the state of the data queue, (ii) the state of the channel that the AV uses for its data communication, (iii) the state of the road, (iv) the weather state, (v) the speed state of the AV, and (vi) the nearby moving object state. Thus, the state space of the AV can be defined as

𝒮\displaystyle{\mathcal{S}} ={(d,c,r,w,v,m);d{0,1,,D},c{0,1},\displaystyle=\Big{\{}(d,c,r,w,v,m);d\in\{0,1,\ldots,D\},c\in\{0,1\}, (2)
r{0,1},w{0,1},v{0,1},m{0,1}},\displaystyle r\in\{0,1\},w\in\{0,1\},v\in\{0,1\},m\in\{0,1\}\Big{\}},

where dd represents the state of the data queue, i.e., the number of packets in the data queue, cc refers to the state of the communication channel that the AV uses to transmit data to the BSs. c=0c=0 if the channel is good, i.e., low interference, and c=1c=1 if the channel is bad, i.e., high interference. r,w,vr,w,v, and mm are defined in Section II-A. The state of the system at time step tt is defined as st=(d,c,r,w,v,m)𝒮s_{t}=(d,c,r,w,v,m)\in\mathcal{S}.

III-B Reward Function

At each time step tt, the AV chooses an action at𝒜a_{t}\in\mathcal{A} at state st𝒮s_{t}\in\mathcal{S} and receives an immediate reward rtr_{t}. The reward is designed to encourage the AV to increase the data throughput and at the same time decrease its miss detection probability. For this, we define the reward function as follows.

When the AV selects the communication mode and if the channel state is good, the AV successfully transmits ν1\nu_{1} packets and receives a reward r1r_{1}. Otherwise, when the AV selects the communication mode and if the channel is bad, the AV successfully transmits ν2\nu_{2} packets and receives a reward r2r_{2}. Moreover, when the AV selects the communication mode and an unexpected event occurs, the AV receives a penalty of r3-r_{3}. When the AV selects the radar mode and if the AV does not detect any unexpected event, the AV receives no reward. Otherwise, when the AV selects the radar mode and if the AV detects an unexpected event, the AV receives a reward that is proportional to the number of unfavorable conditions in {r,w,m,v}\{r,w,m,v\}, i.e., the number of values 11 in {r,w,m,v}\{r,w,m,v\}. This means that the AV receives a high reward if the probability of an unexpected event to occur is high, e.g., the AV is under very unfavorable conditions, and if the unexpected event is detected. This definition is to encourage the AV to use the radar mode when the environment conditions are unfavorable. In summary, the immediate reward can be defined as follows:

rt={+r1,if at=0,c=0, given ,+r2,if at=0,c=1, given ,r3,if at=0, given ,+r4(b+1),if at=1, given ,0,if at=1, given .r_{t}=\begin{cases}+r_{1},&\text{if }\mbox{}a_{t}=0,c=0,\text{ given }\ominus,\\ +r_{2},&\text{if }\mbox{}a_{t}=0,c=1,\text{ given }\ominus,\\ -r_{3},&\text{if }\mbox{}a_{t}=0,\text{ given }\oplus,\\ +r_{4}(b+1),&\text{if }\mbox{}a_{t}=1,\text{ given }\oplus,\\ 0,&\text{if }\mbox{}a_{t}=1,\text{ given }\ominus.\end{cases} (3)

where bb is the number of values of 11 in the set {r,w,m,v}\{r,w,m,v\}. Note that the probability of an unexpected event to occur given {r,w,m,v}\{r,w,m,v\}, P()P(\oplus), is defined in (1).

In this paper, we aim to find the optimal policy for the AV, denoted by π\pi^{*}, to maximize its long-term discounted cumulative reward, i.e., discounted return, as defined by

maxπG(π)=𝔼{t=0Tγtrt+1(π)},\max_{\pi}\ G(\pi)=\mathbb{E}\{\sum_{t=0}^{T}{\gamma^{t}r_{t+1}(\pi)}\}, (4)

where G(π)G(\pi) is the expected discounted return under the policy π\pi, rt+1(π)r_{t+1}(\pi) is the immediate reward under policy π\pi at time step t+1t+1, TT is the time horizon, and γ(0,1)\gamma\in(0,1), is the discount factor. The optimal policy π\pi^{*} will allow the AV to make optimal decisions at any state sts_{t}, i.e., at=π(st)a^{*}_{t}=\pi^{*}(s_{t}).

To find the optimal policy for the AV, standard Q-learning [8] can be adopted by estimating Q-values of all state-action pairs, i.e., Q(s,a)Q(s,a). The Q-values are iteratively updated in a Q-table, and thus the Q-learning suffers the large state space problem. Therefore, we propose to use the DRL with DQN to quickly find the optimal policy.

IV Deep Reinforcement Learning Algorithm

The DQN algorithm uses a deep neural network, called Q-network, with weights 𝜽\boldsymbol{\theta} to derive an approximate value of Q(s,a)Q^{*}(s,a). The input of the Q-network is one of the states of the AV, and the output includes Q-values Q(s,a,𝜽Q(s,a,\boldsymbol{\theta}) of all possible actions. The approximate Q-values allow the AV to map its state to an optimal action. For this, the Q-network needs to be trained to update the weights 𝜽\mathbf{\boldsymbol{\theta}} as follows.

At the beginning of iteration tt, given state st𝒮s_{t}\in\mathcal{S}, the AV obtains the Q-values Q(s,,𝜽)Q(s,\cdot,\boldsymbol{\theta}) for all possible actions aa. The AV then takes an action ata_{t} according to the ϵ\epsilon-greedy policy [7] and observes the reward rtr_{t} and next state st+1s_{t+1}. The AV stores the transition mt=(st,at,rt,st+1)m_{t}=(s_{t},a_{t},r_{t},s_{t+1}) to a replay memory \mathcal{M}. Then, the AV randomly samples a mini-batch of the transitions from \mathcal{M} to update 𝜽\boldsymbol{\theta} as follows:

𝜽𝒕+𝟏=𝜽𝒕+α[ytQ(st,at;𝜽𝒕)]Q(st,at,𝜽𝒕),\boldsymbol{\theta_{t+1}}=\boldsymbol{\theta_{t}}+\alpha\left[y_{t}-Q(s_{t},a_{t};\boldsymbol{\theta_{t}})\right]\nabla Q(s_{t},a_{t},\boldsymbol{\theta_{t}}), (5)

where α\alpha is the learning rate, Q(st,at,𝜽𝒕)\nabla Q(s_{t},a_{t},\boldsymbol{\theta_{t}}) is the gradient of Q(st,at,𝜽𝒕)Q(s_{t},a_{t},\boldsymbol{\theta_{t}}) with respect to the online network weights 𝜽𝒕\boldsymbol{\theta_{t}}, and yty_{t} is the target value. yty_{t} is defined as yt=rt+γmaxaQ(st+1,a;𝜽t)y_{t}=r_{t}+\gamma\max_{a}Q(s_{t+1},a;\boldsymbol{\theta}^{-}_{t}), where γ\gamma is the discount factor, and 𝜽t\boldsymbol{\theta}^{-}_{t} are the target network weights that are copied periodically from the online network weights. The above steps are repeated in iteration t+1t+1 to update the weights 𝜽\boldsymbol{\theta}. Note that the training process is considered to be an episodic task, and the algorithm converges when the cumulative reward is stable over episodes.

V Performance evaluation

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 2: (a) Total reward vs. episode; (b) average reward, (c) throughput, and (d) miss detection probability vs. p1vp_{1}^{v}.

V-A Experiment Setup

For the comparison purpose, the capacity of the data queue is set to D=10D=10 packets, and the arrival packets follow a Poisson distribution with an average arrival rate of λd=1\lambda_{d}=1 packet/time step. If the channel state is good, i.e., c=0c=0, the AV can transmit ν1=4\nu_{1}=4 packets, if the channel state is bad, i.e., c=1c=1, the AV can transmit ν2=2\nu_{2}=2 packets. We assume that the probability that the channel is at the bad state is pc=0.1p_{c}=0.1, and the probability that the channel is at the good state is 1pc1-p_{c}. For the reward values, to minimize the miss detection probability, the value of r3r_{3} should be much higher than other values, i.e., r1r_{1}, r2r_{2}, and r4r_{4}. In particular, we set the values (r1,r2,r3,r4)(r_{1},r_{2},r_{3},r_{4}) to be (2,1,50,5)(2,1,50,5). The values of p0vp_{0}^{v} and p1vp_{1}^{v} are taken from [5] in which if the AV’s speed exceeds 6060 km/h, the AV’s speed is high and otherwise the AV’s speed is low. Specifically, the values p0vp_{0}^{v} and p1vp_{1}^{v} are set to be 0.0050.005 and 0.10.1, respectively. Rain can be considered to be a common unfavorable weather state, and thus the values of p1wp_{1}^{w} and p0wp_{0}^{w} can be taken from [6] in which p1w=0.046p_{1}^{w}=0.046 and p0w=0.005p_{0}^{w}=0.005. The parameters of the DQN scheme are set as follows. The neural network used in DQN is a Multilayer Perceptron with 11 input layer, 22 hidden layers and 11 output layer. The input layer contains 66 units which correspond to the number of dimensions of the state space. The output layer contains 22 units corresponding to the number of dimensions of the action space of the AV. The DQN and the environment for the AV are implemented by using Keras library and OpenAI Gym environment, respectively. To evaluate the DQN scheme, we introduce the Q-learning [8] and Round-robin scheme, i.e., the AV switches back and forth between the radar mode and the communication mode, as baseline schemes.

V-B Simulation Results

We first compare the total rewards obtained by the schemes. As shown in Fig. 2(a), the total rewards obtained by the DQN and Q-learning are much higher than that of the Round-robin. Furthermore, the DQN and Q-learning converge to the same reward. However, the convergence speed of the DQN is much faster than that of the Q-learning. In particular, the DQN requires 170170 episodes to approach the optimal value, while the Q-learning scheme requires 280280 episodes. The reason is that the DQN updates multiple Q-values in a mini-batch at each training iteration [7], while Q-learning performs only one Q-values update at each training iteration [8]. As a result, the convergence rate of the Q-learning is usually much lower than that of the DQN, especially for the large state/action spaces [7].

Next, we evaluate the DQN scheme by varying the environmental factors. Without loss of generality, we evaluate the proposed scheme when the probability to occur an unexpected event given the high speed of the AV, p1vp_{1}^{v}, varies from 0.10.1 to 11. As shown in Fig. 2(b), as p1vp_{1}^{v} increases, the average reward obtained by the Round-robin scheme decreases, while those obtained by the DQN and Q-learning schemes increase. The reason can be explained as follows. With the Round-robin scheme, the radar mode is chosen according to a fixed policy, meaning that the radar mode may not be frequently used even if the occurrence probability of an unexpected event is high. Thus, the AV may receive high penalties that results in a decrease of the average reward. With the DQN and Q-learning schemes, the AV uses the radar mode more frequently as p1vp_{1}^{v} increases to minimize the penalties. As a result, the DQN and Q-learning schemes can achieve higher average rewards compared with that of the Round-robin scheme.

Following the optimal policy, the DQN and Q-learning can significantly outperform the Round-robin in terms of throughput (see Fig. 2(c)) and miss detection probability (see Fig. 2(d)). As shown in Fig. 2(d), the miss detection probabilities obtained by the DQN and Q-learning decrease as p1vp_{1}^{v} increases. The reason is that the optimal policies obtained by the DQN and Q-learning enable the AV to select the radar mode more frequently as unexpected events are likely to occur. Thus, the AV can detect more unexpected events and reduce the miss detection probability. Note that our simulation results presented in this section are especially useful to design key parameters for real AV systems to ensure the safety for the users. In particular, given the current simulation setting (r1,r2,r3,r4)=(2,1,50,5)(r_{1},r_{2},r_{3},r_{4})=(2,1,50,5), the AV can achieve a miss detection probability ranging from 0.150.15 to 0.30.3. We can further reduce the miss detection probability of the AV to meet its requirement by increasing the reward when the AV selects the radar mode, e.g., increasing r4r_{4} from 55 to 5050 or 100100.

VI Conclusion

In this paper, we have proposed the iRDRC system which enables the AV to optimize the radar mode and communication mode selection automatically in a real-time manner. To deal with the uncertainty of the environment, we have formulated the optimization problem based on the MDP framework and developed the DQN algorithm to obtain the optimal policy. The results show that the proposed system can simultaneously maximize the data throughput and minimize miss detection probability. In our future work, continuous actions and cooperation between the AVs in V2I networks can also be considered.

References

  • [1] D. Ma, et al, “Joint Radar-Communications Strategies for Autonomous Vehicles.” arXiv preprint, arXiv:1909.01729 (2019).
  • [2] P. Kumari, et al, “Investigating the IEEE 802.11ad standard for millimeter wave automotive radar,” IEEE VTC, Sep 2015, pp. 1–5.
  • [3] J. Choi, et al, “Millimeter-wave vehicular communication to support massive automotive sensing.” IEEE Commun. Mag., vol. 54, no. 12, pp. 160-167, Dec 2016.
  • [4] A. R. Chiriyath, et al, “Radar-communications convergence: Coexistence, cooperation, and co-design,” IEEE Trans. Cogn. Commun., vol. 3, no. 1, pp. 1–12, Dec 2017.
  • [5] C. N. Kloeden, et al, “Travelling speed and risk of crash involvement on rural roads”, Australian Transport Safety Bureau, 2001.
  • [6] How Do Weather Events Impact Roads. Accessed: February 2020. [Online]. Available: https://ops.fhwa.dot.gov/weather/q1_roadimpact.htm.
  • [7] V. Mnih et al, “Human-level control through deep reinforcement learning.” Nature vol. 518, no. 7540, pp. 529-533, Feb 2015.
  • [8] C. J. C. H. Watkins, et al. “Q-learning.” Mach. Learn., vol. 8, no. 3-4, pp. 279-292, 1992.