This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Vehicle Tracking in Wireless Sensor Networks via Deep Reinforcement Learning

Jun Li, Zhichao Xing, Weibin Zhang, Yan Lin, and Feng Shu This work was supported in part by the National Natural Science Foundation of China under Grant 61727802 and Grant 61872184, in part by the Fundamental Research Funds for the Central Universities under Grant 30919011227, and in part by the Natural Science Foundation of Jiangsu Province under Grant BK20190454. (Corresponding authors: Yan Lin and Jun Li.)J. Li, Z. Xing, W. Zhang, Y. Lin, and F. Shu are with the School of Electronic and Optical Engineering, Nanjing University of Science and Technology, Nanjing 210094, China. (E-mail: {jun.li, xing.zhichao, weibin.zhang, yanlin, shufeng }@njust.edu.cn).
Abstract

Vehicle tracking has become one of the key applications of wireless sensor networks (WSNs) in the fields of rescue, surveillance, traffic monitoring, etc. However, the increased tracking accuracy requires more energy consumption. In this letter, a decentralized vehicle tracking strategy is conceived for improving both tracking accuracy and energy saving, which is based on adjusting the intersection area between the fixed sensing area and the dynamic activation area. Then, two deep reinforcement learning (DRL) aided solutions are proposed relying on the dynamic selection of the activation area radius. Finally, simulation results show the superiority of our DRL aided design.

Index Terms:
Wireless Sensor Networks, Vehicle Tracking, Deep Reinforcement Learning.

I Introduction

With the advances in the fabrication technologies that integrate the sensing and the wireless communication, a large-scale wireless sensor networks (WSNs) is formed in the desired fields by deploying dense tiny sensor nodes. Vehicle tracking in WSNs has several prominent merits: firstly, the sensing unit is close to the vehicle, thus the sensed data will be of a qualitatively good geometric fidelity via vehicle-to-infrastructure (V2I) technology [1]; secondly, the information about the target is simultaneously generated by multiple sensors and thus contains redundancy [2]. However, there still exist some unsolved problems as far as vehicle tracking in WSNs concerned. The first issue is how to guarantee the vehicle to be tracked by sensor nodes. Second is how to maximize the tracking accuracy with the limited resources of WSNs, such as the energy restriction of each node [3].

There are two strategies based on the data processing mechanism, namely centralized strategy and decentralized strategy. Specifically, in centralized strategy, a sensor is artificially selected as a cluster head, and the tracking estimation is performed at this node with all received data [4].

Refer to caption
Figure 1: An illustration of an vehicle tracking scenario in WSNs.

However, the tasks at the cluster head may be overloaded in this strategy. In each iteration of the decentralized strategy, each cluster uses the data from their neighbours to refine its local estimate [5]. Therefore, although the head is closer to the data source than the fixed head in centralized strategy, it has higher energy efficiency. Meanwhile, the works in [6] point out that deep reinforcement learning (DRL) is a problem-solving tool and suitable for decentralized systems in WSNs.

The existing decentralized tracking strategies in WSNs are based on the prediction position to activate a fixed number of nodes, which will result in unnecessary energy consumption. Thus, we propose a dynamic activation area adjustment scheme based on DRL, to save energy consumption while ensuring tracking accuracy. Based on this, we first formulate the problem as a Markov decision process (MDP). Then, we construct the optimization problem as maximizing average rewards of MDP, where the reward consists of both tracking accuracy and energy consumption. Furthermore, we propose a pair of schemes based on deep Q network (DQN) and on deep determined policy gradient (DDPG), to maximize average rewards of MDP. In simulation, our proposed DQN and DDPG based algorithms outperform the conventional Q-learning based method in terms of tracking accuracy and energy consumption.

II System Model

Fig. 1 illustrates an vehicle tracking scenario in a WSNs, which consists of a single vehicle and multiple sensor nodes. The WSNs starts tracking and broadcasting when the object vehicle appears in its monitoring area. Consider that the time slots are denoted by t=1,2,,Tt=1,2,...,T, with unequally duration τ(t)\tau(t) seconds. The vehicle derives into the monitoring area of the WSNs at time slot t=1t=1, and leaves the WSNs or the WSNs loss vehicle at t=Tt=T. Let the vehicle’s vector denote as x(t)=[xt,vtx,yt,vty]T\emph{{x}}(t)=[x_{t},v_{t}^{x},y_{t},v_{t}^{y}]^{\text{T}}, where l(t)=[xt,yt]T\emph{{l}}(t)=[x_{t},y_{t}]^{\text{T}} is the 2-D position of the vehicle, and [vtx,vty]T[v_{t}^{x},v_{t}^{y}]^{\text{T}} is the velocity of the vehicle at time slot tt on the plane. As shown in Fig. 1, the connection between the sensor nodes and the vehicle can be established by V2I within a certain range, thus the radius rsr_{\text{s}} of sensing area ns(t)n_{\text{s}}(t) is fixed. Moreover, the WSNs activates the sensors in activation area na(t)n_{\text{a}}(t) with radius ra(t)r_{\text{a}}(t) in advance, for tracking with dynamic energy consumption. Additionally, let the sensors in set of intersection area Sint(t)=ns(t)na(t)S_{\text{int}}(t)=n_{\text{s}}(t)\cap n_{\text{a}}(t) denote as i=1,2,,|m(t)|i=1,2,\ldots,|m(t)|, where m(t)m(t) is the collection of sensors in Sint(t)S_{\text{int}}(t), and |m(t)|=card[m(t)]|m(t)|=\text{card}[m(t)] represents the amount of elements in set m(t)m(t).

At time slot t1t-1, WSNs enables an activation area na(t)n_{\text{a}}(t) in advance, which is an area based on the predicted vector x^(t)\hat{\emph{{x}}}^{\prime}(t) and ra(t)r_{\text{a}}(t). When the vehicle moves from t1t-1 to tt, sensors in m(t)m(t) track the vehicle cooperatively, and then only one sensor node is selected for transmitting data to the sensors in na(t+1)n_{\text{a}}(t+1).

II-A Motion Model

In our model, the vehicle’s vector is 4-dimensional. For simplicity, we model the vehicle’s motion vector as a linear moving model [7] as x(t+1)=Ax(t)+Dw(t)\emph{{x}}(t+1)=\emph{{A}}\emph{{x}}(t)+\emph{{D}}\emph{{w}}(t), where A represents the vector transition matrix, D is the transition duration matrix, and w(t)\emph{{w}}(t) denotes the noise adding on the vehicle, which is a zero-mean white Gaussian noise with covariance matrix Q.

II-B Measurement Model

The vehicle’s measurement model can be given by zi(t)=Hx(t)+vi(t){\emph{{z}}_{i}}(t)=\emph{{H}}\emph{{x}}(t)+\emph{{v}}_{i}(t), where zi(t){\emph{{z}}_{i}}(t) represents the measurement of the ii-th sensor at time tt, and the measurement matrix H is assumed to be same for each sensor node. vi(t)\emph{{v}}_{i}(t) is the measurement noise of the ii-th sensor, which is assumed to be a zero-mean white Gaussian with covariance matrix Ri\emph{{R}}_{i}. Additionally, the observation noises that are added on different sensor nodes are independent.

II-C Filtering Model

In this paper, we employ the Kalman filtering approach for WSNs to track the vehicle. To be specific, the iterative process with motion and measurement model can be given by [5] as

x^(t)=Ax^(t1)+Dw(t1),\hat{\emph{{x}}}^{\prime}(t)=\emph{{A}}\hat{\emph{{x}}}(t-1)+\emph{{D}}\emph{{w}}(t-1), (1a)
Pi(t)=APi(t1)AT+Q(t1),{\emph{{P}}^{\prime}_{i}}(t)=\emph{{A}}\emph{{P}}_{i}(t-1)\emph{{A}}^{\text{T}}+\emph{{Q}}(t-1), (1b)

where the priori estimated vector x^(t1)\hat{\emph{{x}}}(t-1) and its covariance Pi(t1)\emph{{P}}_{i}(t-1) is obtained from the data in time slot t1t-1. Then, the priori data is modified by x^(t)\hat{\emph{{x}}}^{\prime}(t) and z(t){\emph{{z}}}(t):

Pi(t)=[Pi(t)1+HT[Ri(t)]1H]1,\emph{{P}}_{i}(t)=\left[\emph{{P}}_{i}^{\prime}(t)^{-1}+{\emph{{H}}}^{\text{T}}[\emph{{R}}_{i}(t)]^{-1}\emph{{H}}\right]^{-1}, (2a)
x^(t)=Po(t)[Po(t)1x^(t)+HT[Ro(t)]1zo(t)].\hat{\emph{{x}}}(t)=\emph{{P}}_{\text{o}}(t)\left[\emph{{P}}_{\text{o}}^{\prime}(t)^{-1}\hat{\emph{{x}}}^{\prime}(t)+\emph{{H}}^{\text{T}}[\emph{{R}}_{\text{o}}(t)]^{-1}\emph{{z}}_{\text{o}}(t)\right]. (2b)

Herein, zo(t)=Hx(t)+minivi(t)\emph{{z}}_{\text{o}}(t)=\emph{{H}}\emph{{x}}(t)+\mathop{\rm{min}}\limits_{{i}}\emph{{v}}_{i}(t) denotes the optimal measurement by broadcasting the data of selected sensor nodes in m(t)m(t), where Ro(t)=𝔼[minivi(t)]\emph{{R}}_{\text{o}}(t)=\mathbb{E}[\mathop{\rm{min}}\limits_{{i}}\emph{{v}}_{i}(t)]. The goal of this selection strategy can help improve tracking accuracy, namely the smaller Pi(t)\emph{{P}}_{i}(t) is, the better estimation quality becomes[8]. Next, we select a sensor with best measuring performance according to Po(t)=minitr(Pi(t))\emph{{P}}_{\text{o}}(t)=\mathop{\rm{min}}\limits_{{i}}tr(\emph{{P}}_{i}(t)), where Po(t)\emph{{P}}_{\text{o}}(t) is the data that the fusion center needs to transmit to na(t+1)n_{\text{a}}(t+1).

II-D Time Model

In this paper, for obtaining the time interval of tracking vehicle in WSNs, we quantify the time based on the formula as [9]:

τ[c(p1,p2)]=NbBlog2(1+Ptc(p1,p2)σ2),\tau[c(\emph{{p}}_{1},\emph{{p}}_{2})]=\frac{N_{b}}{B\text{log}_{2}\left(1+\frac{P_{\text{t}}c(\emph{{p}}_{1},\emph{{p}}_{2})}{\sigma^{2}}\right)}, (3a)
c(p1,p2)=[ρ0p1p22],c(\emph{{p}}_{1},\emph{{p}}_{2})=\left[\frac{\rho_{0}}{||\emph{{p}}_{1}-\emph{{p}}_{2}||_{2}}\right], (3b)

where ρ0\rho_{0} in (3b) denotes the path loss per meter, thus c(p1,p2)c(\emph{{p}}_{1},\emph{{p}}_{2}) is the channel gain between position p1\emph{{p}}_{1} and p2\emph{{p}}_{2}. In (3a)(3\text{a}), NbN_{\text{b}} denotes the bits of each task, BB is the communication bandwidth, PtP_{\text{t}} is transmission power, and σ2\sigma^{2} is power of Gaussian white noise in WSNs. Therefore the data transmission duration is calculated as τ[c(p1,p2)]\tau[c(\emph{{p}}_{1},\emph{{p}}_{2})]. Furthermore, the processing time consists of three parts:

  1. 1.

    Data Gathering Time Model: Based on the estimated vector x^(t1)\hat{\emph{{x}}}(t-1), activation radius ra(t)r_{\text{a}}(t), and the sensing area ns(t)n_{\text{s}}(t), the sensors in m(t)m(t) will detect the vehicle. Let p1=l^(t)\emph{{p}}_{1}=\hat{\emph{{l}}}(t), p2=p(i),i=1,2,,|m(t)|\emph{{p}}_{2}=\emph{{p}}(i),i=1,2,...,|m(t)|, thus τ1i(t)=τ[c(l^(t),p(i))]\tau_{1}^{i}(t)=\tau[c(\hat{\emph{{l}}}(t),\emph{{p}}(i))], where the estimated position is l^(t)=Hx^(t)\hat{\emph{{l}}}(t)=\emph{{H}}\hat{\emph{{x}}}(t). In addition, let τ1(t)=maxiτ[c(l^(t),p(i))]\tau_{1}^{\prime}(t)=\mathop{\rm{max}}\limits_{{i}}\tau[c(\hat{\emph{{l}}}(t),\emph{{p}}(i))] be the the longest duration, which based on the channel gain with the distance between the vehicle and the furthest node ii.

  2. 2.

    Data Fusion Time Model: After data gathering, the senors in m(t)m(t) should establish a communication mechanism to find the optimal tracking data. Thus, the system randomly selects a node JJ as a virtual data fusion center which is responsible for receiving data from other sensor node j=1,2,,|m(t)|j=1,2,...,|m(t)|, jJj\neq J. Let p1=p(J)\emph{{p}}_{1}=\emph{{p}}(J), p2=p(j)\emph{{p}}_{2}=\emph{{p}}(j), thus τ2j(t)=τ[c(p(J),p(j))]\tau_{2}^{j}(t)=\tau[c(\emph{{p}}(J),\emph{{p}}(j))]. Let τ2(t)=maxjτ[c(p(J),p(j))]\tau_{2}^{\prime}(t)=\mathop{\rm{max}}\limits_{{j}}\tau[c(\emph{{p}}(J),\emph{{p}}(j))] represent the duration of total data fusion process, which based on the channel gain with distance between JJ and furthest node jj.

  3. 3.

    Data Broadcast Time Model: After data fusion in m(t)m(t), the center JJ transfers the optimal data to the sensors in na(t+1)n_{\text{a}}(t+1). Let p1=p(J)\emph{{p}}_{1}=\emph{{p}}(J), p2=p(k),k=1,2,,|ma(t+1)|\emph{{p}}_{2}=\emph{{p}}(k),k=1,2,...,|m_{\text{a}}(t+1)|, thus τ3k(t)=τ[c(p(J),p(k))]\tau_{3}^{k}(t)=\tau[c(\emph{{p}}(J),\emph{{p}}(k))]. Similarly, |ma(t+1)||m_{\text{a}}(t+1)| denote the amount of sensors of na(t+1)n_{\text{a}}(t+1), and τ3(t)=maxkτ[c(p(J),p(k))]\tau_{3}^{\prime}(t)=\mathop{\rm{max}}\limits_{{k}}\tau[c(\emph{{p}}(J),\emph{{p}}(k))] denote the total data broadcast duration based on the minimum channel gain with the longest distance between JJ and the sensor kk.

In summary, the time duration in each round includes these three parts τ(t)=τ1(t)+τ2(t)+τ3(t)\tau(t)=\tau_{1}^{\prime}(t)+\tau_{2}^{\prime}(t)+\tau_{3}^{\prime}(t).

III Problem Formulation

This paper solves the problem of improving tracking accuracy and energy saving when tracking vehicle in WSNs. Explicitly, the number of the sensor nodes for tracking contributes to the tracking accuracy [5]. Nevertheless, with the increase of the sensor nodes, the energy consumption for communication and data processing becomes higher. To efficiently track the vehicle, we propose a dynamic activation area adjustment scheme to balance the trade-off between tracking accuracy and energy consumption. In this section, we model this scheme as an MDP based on the nature of the dynamic decision-making problem.

  • State Space: Let 𝒮={s(t)|t=1,2,T}\mathcal{S}=\{s(t)|t=1,2...,T\} denote the state space of the WSNs, where s(t)={x^(t),x^(t),ra(t)}s(t)=\{\hat{\emph{{x}}}^{\prime}(t),\hat{\emph{{x}}}(t),r_{\text{a}}(t)\} includes the estimated vector x^(t)\hat{\emph{{x}}}^{\prime}(t) of the vehicle in time slot tt, the estimated vector of vehicle x^(t)\hat{\emph{{x}}}(t) based on x^(t)\hat{\emph{{x}}}^{\prime}(t) and the measurement zi(t){\emph{{z}}_{i}}(t), and the radius of the activation area ra(t)r_{\text{a}}(t). The elements in 𝒮\mathcal{S} except ra(t)r_{\text{a}}(t) are described in the Kalman filtering model.

  • Action Space: Let 𝒜\mathcal{A} denote the action space of the WSNs, where ra(t)𝒜r_{\text{a}}(t)\in\mathcal{A} is the radius of na(t)n_{\text{a}}(t).

  • Reward: Assuming that na(t)n_{\text{a}}(t) and ns(t)n_{\text{s}}(t) are circles with radius of ra(t)r_{\text{a}}(t) and rsr_{\text{s}}, the prediction error in time slot tt is denoted as L=l^(t)l^(t)2L=||\hat{\emph{{l}}}^{\prime}(t)-\hat{\emph{{l}}}(t)||_{2}, where the predicted position is l^(t)=Hx^(t)\hat{\emph{{l}}}^{\prime}(t)=\emph{{H}}\hat{\emph{{x}}}^{\prime}(t). In the process of the tracking, we consider the reward consisting of both the tracking accuracy and the energy consumption in each time slot:

    1. 1.

      Tracking Accuracy: For quantifying the tracking accuracy to measure the tracking effect of WSNs, classical measurement (such as estimation error covariance) represents the error between l^(t)\hat{\emph{{l}}}^{\prime}(t) and l^(t)\hat{\emph{{l}}}(t), but we would like to measure the error between ns(t)n_{\text{s}}(t) and na(t)n_{\text{a}}(t), which depends on l^(t)\hat{\emph{{l}}}^{\prime}(t) and ra(t)r_{\text{a}}(t). By adopting Sint(t)=ns(t)na(t)S_{\text{int}}(t)=n_{\text{s}}(t)\cap n_{\text{a}}(t) as the metric, we can can measure the effect of ra(t)r_{\text{a}}(t) on energy consumption and track accuracy. As such, l^(t)=l^(t)\hat{\emph{{l}}}^{\prime}(t)=\hat{\emph{{l}}}(t) satisfies when L=0L=0 and max[ra(t)]=rs\text{max}[r_{\text{a}}(t)]=r_{\text{s}}, thus as Sint(t)S_{\text{int}}(t) increases, the predicted position becomes more accurate. We define the Sint(t)=ra2(t)[θ1(t)sinθ1(t)cosθ1(t)]+rs2[θ2(t)sinθ2(t)cosθ2(t)]S_{\text{int}}(t)=r_{\text{a}}^{2}(t)[\theta_{1}(t)-\text{sin}\theta_{1}(t)\text{cos}\theta_{1}(t)]+r_{\text{s}}^{2}[\theta_{2}(t)-\text{sin}\theta_{2}(t)\text{cos}\theta_{2}(t)]. Herein, θ1(t)\theta_{1}(t) and θ2(t)\theta_{2}(t) are the angles of intersection area, which are obtained by θ1(t)=cos1ra2(t)+L2rs22ra(t)L\theta_{1}(t)=\text{cos}^{-1}\frac{r_{\text{a}}^{2}(t)+L^{2}-r_{\text{s}}^{2}}{2r_{\text{a}}(t)L} and θ2(t)=cos1rs2+L2ra2(t)2rsL\theta_{2}(t)=\text{cos}^{-1}\frac{r_{\text{s}}^{2}+L^{2}-r_{\text{a}}^{2}(t)}{2r_{\text{s}}L}, respectively.

    2. 2.

      Energy Consumption: In duration of τ(t)\tau(t), the energy consumption generated by data transmission is

      ecom(t)=Pri=1|m(t)|τ1i(t)+Prτ2(t)+Ptj=1|m(t)|1τ2j(t)+Ptτ3(t)+Prk=1|ma(t+1)|τ3k(t),\begin{split}e_{\text{com}}(t)=&P_{\text{r}}\sum\limits_{i=1}^{|m(t)|}\tau_{1}^{i}(t)+P_{\text{r}}\tau_{2}^{\prime}(t)+P_{\text{t}}\sum\limits_{j=1}^{|m(t)|-1}\tau_{2}^{j}(t)\\ &+P_{\text{t}}\tau_{3}^{\prime}(t)+P_{\text{r}}\sum\limits_{k=1}^{|m_{\text{a}}(t+1)|}\tau_{3}^{k}(t),\end{split} (4)

      where PtP_{\text{t}} and PrP_{\text{r}} denotes the power of nodes for transmitting and receiving data, respectively. In addition, the energy consumption of nodes for waking up in na(t)n_{\text{a}}(t) and working in m(t)m(t) is ew(t)=[Pw|m(t)|+Pidle(|ma(t)||m(t)|)]τ(t)e_{\text{w}}(t)=\left[P_{\text{w}}|m(t)|+P_{\text{idle}}(|m_{\text{a}}(t)|-|m(t)|)\right]\tau(t), where |ma(t)||m_{\text{a}}(t)| denote the amount of sensors of na(t)n_{\text{a}}(t), and the power of the nodes in working mode is denoted as PwP_{\text{w}} and the power of the nodes in idle mode is denoted as PidleP_{\text{idle}}[5]. Thus, the total energy consumption is e(t)=ecom(t)+ew(t)e(t)=e_{\text{com}}(t)+e_{\text{w}}(t).

    Therefore, as the goal of this system is to improve the tracking accuracy and to minimize the energy consumption, we can define reward R(t)R(t) by combining normalized Sint(t)S_{\text{int}}(t) and e(t)e(t) as R(t)=Sint(t)πrs2e(t)emaxR(t)=\frac{S_{\text{int}}(t)}{\pi r^{2}_{\text{s}}}-\frac{e(t)}{e_{\text{max}}}, where emaxe_{\text{max}} denotes the maximum energy consumption in each time duration.

It is noteworthy that, as the number of the training episodes increases, the total rewards in a single episode may exist a certain degree of disturbance, thus the increase of the total rewards in a single episode does not represent an accurate improvement of the system performance. Therefore, our goal is to find an optimal policy π\pi^{*} with the maximum average total reward, given by:

π=argmaxt=1TR(t)T.\begin{split}\pi^{*}=\arg\max\frac{\sum\limits_{t=1}^{T}{R(t)}}{T}.\end{split} (5)

IV Proposed DRL Based Methods

In this section, to solve the problem (5), we adopt the policy iteration based DDPG method, and the value iteration based DQN method. Additionally, in DQN method, we use the action selection strategy of ’softmax’ to compare with the ’greedy’ strategy.

IV-A Proposed DDPG Based Method

DDPG method is an algorithm which combines the Actor-Critic framework and the neural network[10], where the Actor-Critic learns both the policy and value network. In DDPG, when the state and action value are the input of a Q-function, it estimates a Q-value Qπ(s,a)Q^{\pi}(s,a) according to a policy π\pi as Qπ(s,a)𝔼s[r+γmaxaQπ(s,a)|s,a]Q^{\pi^{*}}(s,a)\leftarrow\mathbb{E}_{s^{\prime}}[r+\gamma\max\limits_{a^{\prime}}Q^{\pi^{*}}(s^{\prime},a^{\prime})|s,a], where ss and aa are the current state and action, ss^{\prime} and aa^{\prime} are the next state and action, and γ[0,1]\gamma\in[0,1] is a discount factor to the future reward. The policy network is termed as the actor and the value network is termed as the critic. In DDPG, the critic receives (s,r)(s,r) and produces a temporal-difference (TD) error σ=r+γQ(s,a)Q(s,a)\sigma=r+\gamma Q(s^{\prime},a^{\prime})-Q(s,a). With TD error, the actor network updates with the direction suggested by the critic network. In DDPG based algorithm, for avoiding the influence of data correlation in training, the actor network is divided into a current actor network μ(s|ϑμ)\mu(s|\vartheta^{\mu}) and a target actor network μ(s|ϑμ)\mu^{\prime}(s|\vartheta^{\mu^{\prime}}), while the critic network is divided into the current critic network Q(s,a|ϑQ)Q(s,a|\vartheta^{Q}) and the target critic network Q(s,a|ϑQ)Q^{\prime}(s,a|\vartheta^{Q^{\prime}}). The weights of four NNs will be updated as ϑQιϑQ+(1ι)ϑQ\vartheta^{Q^{\prime}}\leftarrow\ \iota\vartheta^{Q}+(1-\iota^{\prime})\vartheta^{Q^{\prime}} and ϑμιϑμ+(1ι)ϑμ\vartheta^{\mu^{\prime}}\leftarrow\ \iota\vartheta^{\mu}+(1-\iota^{\prime})\vartheta^{\mu^{\prime}}, where ι\iota is the learning decrement.

IV-B Proposed DQN Based Method

Algorithm 1 DQN Based Dynamic Activation Area Adjustment Scheme
1:  Initialize ϑ\vartheta and ϑ¯\overline{\vartheta}; rr and qq; 𝒟\mathcal{D} and NtN_{t}; ι\iota and T=0T=0;
2:  for all episode =1,Nt\ell=1,N_{t}(NtN_{t} is the number of episodes) do
3:     Get the initial state s1s_{1};
4:     while the vehicle under tracking do
5:        Choose action based on ε\varepsilon-greedy policy or softmax policy;
6:        Obtain reward rr and store (s,a,r,s)(s,a,r,s^{\prime}) in 𝒟\mathcal{D};
7:        if the memory 𝒟\mathcal{D} is full then
8:           Sample mini-batch KK randomly from 𝒟\mathcal{D};
9:           ϑminϑl(ϑ)\vartheta\leftarrow\min\limits_{\vartheta}l(\vartheta) and every EE steps reset Q^=Q\widehat{Q}=Q;
10:        end if
11:        Set T=T+τ(t)T=T+\tau(t), return q;
12:     end while
13:  end for

This paper considers two action selection methods:

  • ε\varepsilon-Greedy Policy: In each time slot, action selection according to a=argmaxaQ(s,a;ϑ)a^{\prime}=\arg\max\limits_{a}Q(s,a;\vartheta) when ψt<ε\psi_{t}<\varepsilon, and ψt\psi_{t} is a random variable. Then, select aa randomly in 𝒜\mathcal{A}. At the end of each episode, the parameter ε\varepsilon will be updated by ε=max{εmin,ες}\varepsilon=\text{max}\{\varepsilon_{\text{min}},\varepsilon-\varsigma\}, where εmin\varepsilon_{\text{min}} is the minimum for ε\varepsilon, and ς\varsigma is the decrease speed.

  • Softmax Policy: For getting the selection probability P(a)P(a) of action aa, the softmax based action selection policy is formulated as P(a)=e1t1tr(a)i=1card(𝒜)e1t1tr(ai)P(a)=\frac{\text{e}^{\frac{1}{t}\sum_{1}^{t}r(a)}}{\sum_{i=1}^{\text{card}(\mathcal{A})}{\text{e}^{\frac{1}{t}\sum_{1}^{t}r(a_{i})}}}, where r(a)r(a) is the reward based on action aa, r(ai)r(a_{i}) is the reward based on current action aia_{i}.

Additionally, DQN employs the two networks with the same structure but different parameters ϑ\vartheta and ϑ¯\bar{\vartheta} respectively. For training the model, we search the weights by minimizing the loss function minϑl(ϑ)=t=0T[Q^(s,a;ϑ)ytDQN]2\min\limits_{\vartheta}l(\vartheta)=\sum\limits_{t=0}^{T}[\hat{Q}(s,a;\vartheta)-y_{t}^{\text{DQN}}]^{2}, where ytDQN=rt+γmaxaQ^(s,a;ϑ¯)y_{t}^{\text{DQN}}=r_{t}+\gamma\max\limits_{a^{\prime}}\hat{Q}(s^{\prime},a^{\prime};\bar{\vartheta}), Q^(s,a;ϑ)\hat{Q}(s,a;\vartheta) represents a predicted maximum Q-value with weight ϑ\vartheta, and ytDQNy_{t}^{\text{DQN}} represents the target Q-value with weight ϑ¯\bar{\vartheta}. The details of this process are shown in Algorithm 1.

V Simulations and Results

TABLE I: Simulation Parameters
parameter value parameter value parameter value
B(MHz)B(\text{MHz}) 11 Nb(bits)N_{\text{b}}(\text{bits}) 2×1042\times 10^{4} Pw(w)P_{\text{w}}(\text{w}) 55
Pidle(w)P_{\text{idle}}(\text{w}) 0.050.05 γ\gamma 0.90.9 𝒟\mathcal{D} 20002000
NtN_{t} 18001800 ϑ\vartheta 0.50.5 EE 500500
KK 3030 ε\varepsilon 0.90.9 ι\iota^{\prime} 0.010.01
ϑμ\vartheta^{\mu} 0.50.5 ϑQ\vartheta^{Q} 0.50.5 ϑμ\vartheta^{\mu^{\prime}} 0.50.5
ϑQ\vartheta^{Q^{\prime}} 0.50.5 ϑ¯\bar{\vartheta} 0.50.5 σ2\sigma^{2}(dBm) -110

Based on the previous research in [7], we consider the field of WSNs is 360×600m2360\times 600\text{m}^{2}. Besides, the motion model is

Refer to caption
Figure 2: Energy consumption and tracking accuracy versus rs\text{r}_{\text{s}}.
x(t+1)=[1τ000100001τ0001]x(t)+[τ220τ00τ220τ][wx(t)wy(t)],\emph{{x}}(t+1)=\begin{bmatrix}1&\tau&0&0\\ 0&1&0&0\\ 0&0&1&\tau\\ 0&0&0&1\end{bmatrix}\emph{{x}}(t)+\begin{bmatrix}\frac{\tau^{2}}{2}&0\\ \tau&0\\ 0&\frac{\tau^{2}}{2}\\ 0&\tau\end{bmatrix}\begin{bmatrix}w_{x}(t)\\ w_{y}(t)\end{bmatrix}, (6)

where x(t)\emph{{x}}(t) is the vector of the vehicle, τ=τ(t)\tau=\tau(t) is the length of each time interval, and w(t)w(t) is the vector noise. The initial vector is x(0)=[0 7.84 0 7.84]T\emph{{x}}(0)=[0\ 7.84\ 0\ 7.84]^{\text{T}}, which means the initial speed is 40 kmph. The vector noise accounting for the unpredictable modeling error is characterized by Q=𝔼[w(t)wT(t)]=[0.03000.03]\emph{{Q}}=\mathbb{E}[\emph{{w}}(t)\emph{{w}}^{\text{T}}(t)]=\begin{bmatrix}0.03&0\\ 0&0.03\end{bmatrix}.

In the tracking of a single vehicle, there are many sensors that can be used in the scenario, such as cameras [5] and logical combination of ranging sensors [2]. By adopting the strategy of uniform distribution of sensor nodes in [5], we set 375 nodes in the sensing range with the density of 1 sensor/24m, and the linear measurement model of each sensor is zi(t)=Hx(t)+(vx(t)vy(t))i\emph{{z}}_{i}(t)=\emph{{H}}\emph{{x}}(t)+\begin{pmatrix}v_{x}(t)\\ v_{y}(t)\end{pmatrix}_{i}, where H=[10000010]\emph{{H}}=\begin{bmatrix}1&0&0&0\\ 0&0&1&0\end{bmatrix}. Notably, the motion model can also be a non-linear model, which can be estimated by extended Kalman filter, particle filter or other methods. The parameters of the measurement noise covariance are Ri=𝔼[(vx(t)vy(t))i(vx(t)vy(t))iT]=[40000400]\emph{{R}}_{i}=\mathbb{E}\left[\begin{pmatrix}v_{x}(t)\\ v_{y}(t)\end{pmatrix}_{i}\ \begin{pmatrix}v_{x}(t)\\ v_{y}(t)\end{pmatrix}_{i}^{\text{T}}\right]=\begin{bmatrix}400&0\\ 0&400\end{bmatrix}. We randomly initialize x(0)\emph{{x}}(0), and the parameters used in this model are shown in Table 1.

Fig. 2 plots the trends of the average tracking accuracy 1Tt=1TSint(t)\frac{1}{T}\sum\limits_{t=1}^{T}{S_{\text{int}}(t)} and the average energy consumption 1Tt=1Te(t)\frac{1}{T}\sum\limits_{t=1}^{T}{e(t)} with different rsr_{\text{s}}, where each value of e(t)e(t) and Sint(t)S_{\text{int}}(t) is the final convergence value obtained by training 1800 episodes. Clearly, increasing sensing area ns(t)n_{\text{s}}(t) achieves higher Sint(t)S_{\text{int}}(t) and e(t)e(t).

Fig. 3 shows the average accumulated rewards 1Tt=1TR(t)\frac{1}{T}\sum\limits_{t=1}^{T}{R(t)} of the proposed algorithms. Firstly, we can observe that although e(t)e(t) and Sint(t)S_{\text{int}}(t) increase with rsr_{\text{s}}, their gaps in terms of the average accumulated rewards enlarge gradually. Secondly, as episodes go by, the average accumulated rewards based on DQN is much higher than other methods. The reasons include two aspects: first the QL based method leads to the lowest value because of the Q-table has a limited capacity; second the value iteration of DQN is based on the unbiased estimation of data, while the DDPG is based on the biased estimation.

Refer to caption
Figure 3: Average accumulated rewards versus rs\text{r}_{\text{s}}.

VI Conclusions

This paper focuses on improving the tracking accuracy and energy saving of target vehicle in WSNs. Thus, we propose a decentralized vehicle tracking strategy, which dynamically adjusts the activation area to improve tracking accuracy and energy saving. Then, we define the problem as a MDP and use DRL method to solve it. Finally, the simulation results show that the DQN-based method has a better performance than other DRL-based methods. In future, we can explore multi-agent DRL algorithms in cooperative tracking scenarios.

References

  • [1] O. Popescu, S. Sha-Mohammad, H. Abdel-Wahab, D. C. Popescu, and S. El-Tawab, “Automatic incident detection in intelligent transportation systems using aggregation of traffic parameters collected through v2i communications,” IEEE Intell. Trans. Syst. Mag., vol. 9, pp. 64–75, Summer 2017.
  • [2] B. Wu, Y. Feng, H. Zheng, and X. Chen, “Dynamic cluster members scheduling for target tracking in sensor networks,” IEEE Sensors J., vol. 16, pp. 7242–7249, Oct. 2016.
  • [3] K. Zheng, H. Wang, H. Li, W. Xiang, L. Lei, J. Qiao, and X. S. Shen, “Energy-efficient localization and tracking of mobile devices in wireless sensor networks,” IEEE Trans. Veh. Technol., vol. 66, pp. 2714–2726, Mar. 2017.
  • [4] U. Baroudi, “Robot-assisted maintenance of wireless sensor networks using wireless energy transfer,” IEEE Sensors J., vol. 17, pp. 4661–4671, Jul. 2017.
  • [5] J. Li, Q. Jia, X. Guan, and X. Chen, “Tracking a moving object via a sensor network with a partial information broadcasting scheme,” vol. 181, pp. 4733–4753, 2011.
  • [6] Y. Su, X. Lu, Y. Zhao, L. Huang, and X. Du, “Cooperative communications with relay selection based on deep reinforcement learning in wireless sensor networks,” IEEE Sensors J., vol. 19, pp. 9561–9569, Oct. 2019.
  • [7] Z. Zhao, X. Wang, and T. Wang, “A novel measurement data classification algorithm based on svm for tracking closely spaced targets,” IEEE Trans. Instrum. Meas., vol. 68, pp. 1089–1100, Apr. 2019.
  • [8] W. Xiao, C. K. Tham, and S. K. Das, “Collaborative sensing to improve information quality for target tracking in wireless sensor networks,” in IEEE PERCOM Workshops, pp. 99–104, Mar. 2010.
  • [9] J. Li, S. Chu, F. Shu, J. Wu, and D. N. K. Jayakody, “Contract-based small-cell caching for data disseminations in ultra-dense cellular networks,” IEEE Trans. Mobile Comput., vol. 18, pp. 1042–1053, May 2019.
  • [10] P. Wu, J. Li, L. Shi, M. Ding, K. Cai, and F. Yang, “Dynamic content update for wireless edge caching via deep reinforcement learning,” IEEE Commun. Lett., vol. 23, pp. 1773–1777, Oct. 2019.