This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Aging Bandits: Regret Analysis and Order-Optimal Learning Algorithm for Wireless Networks with Stochastic Arrivals

Eray Unsal Atay, Igor Kadota, and Eytan Modiano
Abstract

We consider a single-hop wireless network with sources transmitting time-sensitive information to the destination over multiple unreliable channels. Packets from each source are generated according to a stochastic process with known statistics and the state of each wireless channel (ON/OFF) varies according to a stochastic process with unknown statistics. The reliability of the wireless channels is to be learned through observation. At every time slot, the learning algorithm selects a single pair (source, channel) and the selected source attempts to transmit its packet via the selected channel. The probability of a successful transmission to the destination depends on the reliability of the selected channel. The goal of the learning algorithm is to minimize the Age-of-Information (AoI) in the network over TT time slots. To analyze the performance of the learning algorithm, we introduce the notion of AoI regret, which is the difference between the expected cumulative AoI of the learning algorithm under consideration and the expected cumulative AoI of a genie algorithm that knows the reliability of the channels a priori. The AoI regret captures the penalty incurred by having to learn the statistics of the channels over the TT time slots. The results are two-fold: first, we consider learning algorithms that employ well-known solutions to the stochastic multi-armed bandit problem (such as ϵ\epsilon-Greedy, Upper Confidence Bound, and Thompson Sampling) and show that their AoI regret scales as Θ(logT)\Theta(\log T); second, we develop a novel learning algorithm and show that it has O(1)O(1) regret. To the best of our knowledge, this is the first learning algorithm with bounded AoI regret.

I Introduction

Age-of-Information (AoI) is a performance metric that captures the freshness of the information from the perspective of the destination. AoI measures the time that elapsed since the generation of the packet that was most recently delivered to the destination. This performance metric has been receiving attention in the literature [1, 2, 3] for its application in communication systems that carry time-sensitive data. In this paper, we consider a network with MM sources transmitting time-sensitive information to the destination over NN unreliable wireless channels, as illustrated in Fig. 1. Packets from each source are generated according to an i.i.d. stochastic process with known statistics and the state of each wireless channel (ON/OFF) varies according to an i.i.d. stochastic process with unknown statistics. At every time slot, the learning algorithm schedules a single pair (source, channel) and the selected source attempts to transmit its packet via the selected wireless channel. When a packet with fresh information is successfully transmitted to the destination, the AoI associated with the selected source is reduced. The goal of the scheduler is to keep the information associated with every source in the network as fresh as possible, i.e. to minimize the AoI in the network. To decide which pair to select in a time slot, the scheduler takes into account: i) the packet generation processes at the MM sources; ii) the current values of AoI at the destination; and iii) the estimated reliability of the NN wireless channels.

In this sequential decision problem, the outcomes of previous transmission attempts are used to estimate the reliability of the wireless channels. This statistical learning problem is closely related to the stochastic multi-armed bandit (MAB) problem in which the wireless channels are the bandits that give i.i.d. rewards and the scheduler is the player that attempts to learn the statistics of the bandits in order to maximize the reward accumulated over time. The main challenge in the stochastic MAB problem is to strike a balance between exploiting the bandit that gave the highest rewards in the past and exploring other bandits that may give high rewards in the future. To evaluate the performance of different learning algorithms, we define regret. Regret is the difference between the expected cumulative reward of a genie algorithm (that knows the statistics of the bandits a priori) and the expected cumulative reward of the learning algorithm under consideration. The regret captures the penalty incurred by having to learn the statistics of the bandits over time. Some well-known order-optimal learning algorithms in terms of regret are: ϵ\epsilon-Greedy, Upper Confidence Bound (UCB), and Thompson Sampling (TS). The regret of these policies was shown to increase no more than logarithmically in time [4, 5, 6], O(logT)O(\log T), and this bound was shown to be tight [7].

We refer to our problem as the Aging Bandit problem. An important distinction between the stochastic MAB problem and the Aging Bandit problem is the reward structure. In the stochastic MAB problem, the player selects a bandit in each time slot and receives a reward that is i.i.d. over time and depends only on the probability distribution associated with the selected bandit. In the Aging Bandit problem, the scheduler selects a pair (source, channel) and the reward is the AoI reduction that results from a packet transmission to the destination. This reward depends on the state of the selected channel (which is i.i.d. over time), since a failed transmission gives zero reward, and it also depends on the history of previous packet deliveries and packet generations. In particular, if the selected source has recently delivered a fresh information update to the destination, then the reduction in AoI may be small. In contrast, if the selected source has not updated the destination for a long period, then the AoI reduction may be large. The reward structure of Aging Bandits is closely related to the AoI evolution (formally defined in Sec. II) which is history-dependent. This intricate reward structure has significant impact on the analysis of regret and on the development of learning algorithms when compared to the analysis of the traditional stochastic MAB.

The literature on MAB problems is vast, dating more than eight decades [8]. For surveys on different types of MAB problems, we refer the readers to [9, 10, 11, 12]. Most relevant to this work are [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]. The authors in [13, 14, 15] considered the problem of minimizing the expected queue-length in a system with a single queue and multiple servers with unknown service rates. In [13], the authors introduced the concept of queue-length regret, developed a learning algorithm inspired by Thompson Sampling, and analyzed its regret. In [14, 15], the authors used information particular to the queue evolution to develop a learning algorithm with O(1)O(1) queue-length regret.

The authors in [16, 17, 18, 19, 20, 21, 22, 23, 24] considered the problem of minimizing the average AoI in a single-hop wireless network with unreliable channels. In [16, 17, 18, 19, 20, 21], the authors posed the AoI minimization problem in a network with multiple sources and known channel statistics as a restless MAB problem, developed the associated Whittle’s Index scheduling policy, and evaluated its performance in terms of the average AoI. In [22], the authors considered the AoI minimization problem in a network with a single source-destination pair and unknown channel statistics, introduced the concept of AoI regret, and showed that the AoI regret of UCB and TS scale as O(logT)O(\log T). In [23], the authors obtained similar results as in [22] for the more challenging case of correlated wireless channels. In [24], the authors considered the AoI minimization problem in a network with multiple sources that generate and transmit fresh packets at every time slot through (possibly) different channels with unknown statistics. The authors in [24] showed that the AoI regret of a UCB-based distributed learning algorithm scales as O(log2T)O(\log^{2}T). An important modelling assumption common to [22, 23, 24] is that sources generate and transmit fresh packets at every time slot. The more realistic assumptions of random packet generation and scheduled transmissions have significant impact on the AoI evolution, on the analysis of AoI regret, and on the development of learning algorithms. For example, in Sec. IV, we leverage the random packet generation to develop a learning algorithm with O(1)O(1) AoI regret.

In this paper, we study learning algorithms that attempt to minimize AoI in a network with multiple sources generating packets according to stochastic processes and transmitting these packets to the destination over wireless channels with initially unknown statistics. At every time slot, the learning algorithm schedules a single pair (source, channel) and the selected source attempts to transmit a packet through the selected channel. Note that the source policy, which selects a source at each time slot, and the channel policy, which selects the channel to be used in each time slot, can be naturally decoupled, as the optimal channel is independent of the source selected. In this paper, we focus on the exploration-exploitation dilemma faced by the channel policy. In particular, we consider learning algorithms employing the optimal source policy and different channel policies. Our main contributions include:

  • we analyze the performance of channel policies based on traditional MAB algorithms including ϵ\epsilon-Greedy, UCB, and TS, and show that their AoI regret scales as Θ(logT)\Theta(\log T). These results generalize the analysis in [22] to networks with multiple sources generating packets randomly. The analysis of the AoI regret is more challenging in this network setting since the AoI evolution depends on both the source policy and the stochastic packet generation process. These challenges are discussed in Sec. III;

  • we develop a novel learning algorithm and establish that it has O(1)O(1) AoI regret. The key insight is that when packets are generated randomly, the learning algorithm can utilize times when the network has no packets to transmit, in order to learn the statistics of the channel. To the best of our knowledge, this is the first learning algorithm with bounded AoI regret.

The remainder of this paper is outlined as follows. In Sec. II, the network model and performance metrics are formally presented. In Sec. III, we analyze the AoI regret of traditional learning algorithms. In Sec. IV, we develop an order-optimal learning algorithm and analyze its AoI regret. In Sec. V, we compare the AoI regret of different learning algorithms using simulations. The paper is concluded in Sec. VI. Some of the technical proofs have been omitted due to the space constraint, and will be made available in a technical report.

II System Model

Consider a single-hop wireless network with MM sources, NN channels and a single destination, as illustrated in Fig. 1. Each source generates packets containing time-sensitive information and these packets are to be transmitted to the destination through one of the wireless channels. Let the time be slotted, with slot index t{1,2,,T}t\in\{1,2,\cdots,T\}, where TT is the time horizon of this discrete-time system. The slot duration allows for a single packet transmission. We normalize the slot duration to unity.

Refer to caption
Figure 1: Illustration of the wireless network with MM sources, NN channels, and a destination.

At the beginning of every slot tt, each source generates a packet with probability λ(0,1)\lambda\in(0,1). Let am(t){0,1}a_{m}(t)\in\{0,1\} be the indicator function that is equal to 11 when source m{1,2,,M}m\in\{1,2,\cdots,M\} generates a packet in slot tt, and am(t)=0a_{m}(t)=0 otherwise. This Bernoulli process with parameter λ\lambda is i.i.d. over time and independent across different sources, with P(am(t)=1)=λ,m,t\mathrm{P}\left(a_{m}(t)=1\right)=\lambda,\forall m,t. A packet that is generated in slot tt can be transmitted during the same slot tt. We denote the vector of packet generations in slot tt by a(t)=[a1(t)aM(t)]𝖳\vec{a}(t)=\left[a_{1}(t)\ \cdots\ a_{M}(t)\right]^{\mathsf{T}}.

Each source has a transmission queue to store its packets. Sources keep only the most recently generated packet, i.e. the freshest packet, in their queue. When source mm generates a new packet at the beginning of slot tt, older packets (if any) are discarded from its queue. Notice that delivering the most recently generated packet provides the freshest information to the destination. This queueing discipline is known to optimize the AoI in a variety of contexts [25, 26, 27]. After a packet delivery from source mm, the queue remains empty until the next packet generation from the same source. However, while the queue is empty, a dummy packet can be transmitted for the purpose of probing the channels.

The networked system is empty during slot tt if there are no data packets available for transmission, i.e. if the MM queues are empty. Let E(t){0,1}E(t)\in\{0,1\} be the indicator function that is equal to 11 if the system is empty during slot tt, and E(t)=0E(t)=0 otherwise. Notice that if there is a packet generation at the beginning of slot tt, then the system is nonempty during slot tt and E(t)=0E(t)=0. Recall that when the system is empty, sources can still transmit dummy packets.

In a slot, the learning algorithm selects a single pair (m,n)(m,n), where m{1,2,,M}m\in\{1,2,\cdots,M\} is the index of the source and n{1,2,,N}n\in\{1,2,\cdots,N\} is the index of the wireless channel. Then, during this slot, source mm transmits a packet to the destination through channel nn. If channel nn is ON, then the packet is successfully transmitted to the destination, and if channel nn is OFF, then the transmission fails. The learning algorithm does not know the channel states while making scheduling decisions, and the outcome of a transmission attempt during slot tt is known at the beginning of slot t+1t+1. Let bn(t){0,1}b_{n}(t)\in\{0,1\} be the indicator function that represents the state of channel nn during slot tt. The channel is ON, bn(t)=1b_{n}(t)=1, with probability μn(0,1]\mu_{n}\in(0,1], and the channel is OFF, bn(t)=0b_{n}(t)=0, with probability 1μn1-\mu_{n}. The channel state process is i.i.d. over time and independent across different channels.

The reliability of channel nn is represented by the probability of this channel being ON, μn\mu_{n}. Let μ=[μ1μN]𝖳\vec{\mu}=\left[\mu_{1}\ \cdots\ \mu_{N}\right]^{\mathsf{T}} be the vector of channel reliabilities. Let μ\mu^{*} be the maximum channel reliability and let nn^{*} be the index of the corresponding channel, i.e.  μ=maxnμn=μn\mu^{*}=\max_{n}\mu_{n}=\mu_{n^{*}}. For simplicity, we assume that the optimal channel nn^{*} is unique. Naturally, if the channel reliabilities were known by the learning algorithm in advance, then the algorithm would select channel nn^{*} in every slot tt. However, since the channel reliabilities μ\vec{\mu} are initially unknown, the learning algorithm has to estimate μn\mu_{n} using observations from previous transmission attempts, while at the same time attempting to minimize the AoI in the network. Next, we formulate the AoI minimization problem.

II-A Age of Information

The AoI captures how old the information is from the perspective of the destination. Let hm(t)h_{m}(t) be a positive integer that represents the AoI associated with source mm at the beginning of slot tt. By definition, we have hm(t):=tτm(t)h_{m}(t):=t-\tau_{m}(t), where τm(t)\tau_{m}(t) is the generation time of the latest packet successfully transmitted from source mm to the destination111We define τm(t)=0\tau_{m}(t)=0 prior to the first packet delivery from source mm.. If the destination does not receive a fresh packet from source mm during slot tt, then in the next slot we have hm(t+1)=hm(t)+1h_{m}(t+1)=h_{m}(t)+1, since the information at the destination is one slot older. In contrast, if the destination receives a fresh packet from source mm during slot tt, then in the next slot the value of τm(t+1)\tau_{m}(t+1) is updated to the generation time of the received packet and the AoI is reduced by τm(t+1)τm(t)\tau_{m}(t+1)-\tau_{m}(t). This difference is the “freshness gain” associated with the received packet. The evolution of hm(t)h_{m}(t) over time is illustrated in Fig. 2. We define the vector of AoI in slot tt as h(t)=[h1(t)hM(t)]𝖳\vec{h}(t)=\left[h_{1}(t)\ \cdots\ h_{M}(t)\right]^{\mathsf{T}}.

Refer to caption
Figure 2: The blue and orange rectangles at the bottom represent packets generated at source mm and successful packet transmissions from source mm, respectively. The orange curve shows the AoI evolution hm(t)h_{m}(t) associated with source mm.

For capturing the information freshness of the entire network, we consider the expected total AoI h¯(T)\bar{h}(T), which is defined as the expected sum of the AoI over all sources and over time, namely

h¯(T)=E[m=1Mt=1Thm(t)],\bar{h}(T)=\mathrm{E}\left[\sum_{m=1}^{M}\sum_{t=1}^{T}h_{m}(t)\right]\;, (1)

where the expectation is with respect to the randomness in the channel states bn(t)b_{n}(t), packet generation process a(t)\vec{a}(t), and scheduling decisions (m,n)(m,n). The learning algorithm schedules pairs (m,n)(m,n) over time so as to minimize the expected total AoI h¯(T)\bar{h}(T). Recall that in this sequential decision problem, the channel reliabilities μn\mu_{n} are initially unknown by the learning algorithm and should be estimated over time. Next, we discuss the class of learning algorithms considered in this paper.

II-B Learning Algorithm

In this section, we present three important concepts associated with the learning algorithm: the channel policy, the source policy, and the AoI regret. Prior to discussing these concepts, we introduce some notation. In each slot tt, the learning algorithm selects a single source and a single channel. Let m(t)m(t) be the index of the source selected during slot tt and let n(t)n(t) be the index of the channel selected during slot tt. Then, the pair selected in each slot can be denoted as (m(t),n(t))(m(t),n(t)). Notice that the learning algorithm can be divided into two components: the source policy, which selects m(t)m(t), and the channel policy, which selects n(t)n(t). Let b(t)=bn(t)(t)b(t)=b_{n(t)}(t) be the state of the channel selected during slot tt, and recall that a(t)\vec{a}(t) is the vector of packet generations and h(t)\vec{h}(t) is the vector of AoI in slot tt. Using this notation, we define the channel policy and the source policy.

The channel policy may (or may not) take into account the status of the transmission queues at the sources (in particular E(t)E(t)) in making scheduling decisions n(t)n(t). Hence, we define two types of channel policies: queue-independent channel policies and queue-dependent channel policies. Let ΠB\Pi_{B} be the class of admissible queue-independent channel policies πb\pi_{b}. In slot tt, an arbitrary policy πbΠB\pi_{b}\in\Pi_{B} selects n(t)n(t) using information about the outcome of previous transmission attempts. In particular, the queue-independent channel history in slot tt is given by HB(t)={n(1),b(1),,n(t1),b(t1)}H_{B}(t)=\{n(1),b(1),\cdots,n(t-1),b(t-1)\}. Let Π¯B\bar{\Pi}_{B} be the class of admissible queue-dependent channel policies π¯b\bar{\pi}_{b}. In slot tt, an arbitrary policy π¯bΠ¯B\bar{\pi}_{b}\in\bar{\Pi}_{B} selects n(t)n(t) using information about the outcome of previous transmission attempts and about the current status of the transmission queues. In particular, the queue-dependent channel history in slot tt is given by H¯B(t)=HB(t){E(t)}\bar{H}_{B}(t)=H_{B}(t)\cup\{E(t)\}. In Sec. IV, we show that this small amount of information, namely E(t)E(t), can have a significant impact on the performance of the channel policy. It is easy to see that both the optimal queue-independent channel policy πb\pi^{*}_{b} and the optimal queue-dependent channel policy π¯b\bar{\pi}^{*}_{b} select the channel with highest reliability μ\mu^{*} at every slot tt. However, since the reliabilities μ\vec{\mu} are not known a priori, the channel policies have to estimate μ\vec{\mu} over time. In Sec. III, we consider queue-independent channel policies and in Sec. IV, we consider queue-dependent channel policies.

The source policies considered in this paper are work-conserving, i.e.  policies that never transmit dummy packets when there are undelivered data packets in the system. Let ΠA\Pi_{A} be the class of admissible work-conserving source policies πa\pi_{a}. In slot tt, an arbitrary source policy πaΠA\pi_{a}\in\Pi_{A} selects m(t)m(t) using information about the current AoI and the generation times of the packets waiting to be transmitted at the sources’ queues. In particular, the source history in slot tt is given by HA(t)={a(1),h(1),,a(t),h(t)}H_{A}(t)=\{\vec{a}(1),\vec{h}(1),\cdots,\vec{a}(t),\vec{h}(t)\}. The optimal source policy πaΠA\pi^{*}_{a}\in\Pi_{A} is the transmission scheduling policy that minimizes the expected total AoI in (1). A few works in the literature [18, 19, 28, 29] have addressed the problem of finding the transmission scheduling policy that minimizes AoI in wireless networks with stochastic packet generation and unreliable channels with known statistics. Despite those efforts, a full characterization of the optimal source policy is still an open problem.

In this paper, we consider learning algorithms π\pi that are a composition of a source policy and a channel policy π=(πa,πb)\pi=(\pi_{a},\pi_{b}). Our goal is to study the exploration-exploitation dilemma faced by the channel policy. To that end, we analyze the AoI regret of learning algorithms employing the optimal source policy and different channel policies. To analyze the AoI regret of learning algorithms without the full characterization of the optimal source policy πa\pi_{a}^{*}, we derive lower and upper bounds on the regret. These bounds are discussed in Proposition 2, Proposition 3, and Theorem 7, where we assumed that the optimal source policy πa\pi_{a}^{*} is the same irrespective of the queue-independent channel policy πb\pi_{b} under consideration, namely

πa=argminπaΠAE[m=1Mt=1Thm(πa,πb)(t)],πbΠB,\pi^{*}_{a}=\operatorname*{arg\,min}_{\pi_{a}\in\Pi_{A}}\mathrm{E}\left[\sum_{m=1}^{M}\sum_{t=1}^{T}h^{(\pi_{a},\pi_{b})}_{m}(t)\right]\;,\;\forall\pi_{b}\in\Pi_{B}\;, (2)

where hm(πa,πb)(t)h^{(\pi_{a},\pi_{b})}_{m}(t) denotes the AoI associated with source mm in slot tt when the learning algorithm π=(πa,πb)\pi=(\pi_{a},\pi_{b}) is employed. An analogous assumption is utilized for the case of queue-dependent channel policies π¯bΠ¯B\bar{\pi}_{b}\in\bar{\Pi}_{B}.

The AoI regret of a learning algorithm π\pi with queue-independent channel policy πb\pi_{b} is defined as the difference between the expected total AoI h¯π(T)\bar{h}^{\pi}(T) when π=(πa,πb)\pi=(\pi_{a},\pi_{b}) is employed and the expected total AoI h¯(T)\bar{h}^{*}(T) when the optimal algorithm π=(πa,πb)\pi^{*}=(\pi_{a}^{*},\pi_{b}^{*}) is employed, namely

Rπ(T)=E[m=1Mt=1Thmπ(t)m=1Mt=1Thm(t)],R^{\pi}(T)=\mathrm{E}\left[\sum_{m=1}^{M}\sum_{t=1}^{T}h^{\pi}_{m}(t)-\sum_{m=1}^{M}\sum_{t=1}^{T}h^{*}_{m}(t)\right]\;, (3)

where the expectation is with respect to the randomness in the channel states b(t)b(t), packet generation process a(t)\vec{a}(t), and scheduling decisions (m(t),n(t))(m(t),n(t)). The definition of AoI regret for a learning algorithm π¯\bar{\pi} with queue-dependent channel policy π¯b\bar{\pi}_{b} is analogous to (3). Next, we analyze the AoI regret of learning algorithms with queue-independent channel policies.

III Regret Analysis

The problem of learning channel reliabilities over time is closely related to the stochastic MAB problem. A natural class of channel policies to consider are traditional MAB algorithms such as ϵ\epsilon-Greedy, UCB, and TS. In this section, we derive bounds on the AoI regret of learning algorithms that employ queue-independent channel policies. Notice that the class of queue-independent channel policies ΠB\Pi_{B} includes traditional MAB algorithms. We describe a learning algorithm employing TS as its channel policy in Algorithm 1.

Initialization: time t=1t=1, estimates μ^n=0\hat{\mu}_{n}=0, counters Tn=0T_{n}=0, parameters αn=βn=1\alpha_{n}=\beta_{n}=1, n{1,,N}\forall n\in\{1,\cdots,N\};
while 1tT1\leq t\leq T do
       Optimal source policy selects m{1,2,,M}m\in\{1,2,\cdots,M\};
       θnBeta(αn,βn)\theta_{n}\sim\text{Beta}(\alpha_{n},\beta_{n});
       n=argmaxn{1,,N}θnn=\operatorname*{arg\,max}_{n^{\prime}\in\{1,\cdots,N\}}\theta_{n^{\prime}};
       Source mm transmits packet through channel nn and observes channel state bb;
       if b=1b=1 then
             αn=αn+1\alpha_{n}=\alpha_{n}+1;
            
      else
             βn=βn+1\beta_{n}=\beta_{n}+1;
            
       end if
      Compute new estimate μ^n=μ^nTn+bTn+1\hat{\mu}_{n}=\dfrac{\hat{\mu}_{n}T_{n}+b}{T_{n}+1};
       Tn=Tn+1T_{n}=T_{n}+1;
       t=t+1t=t+1;
      
end while
Algorithm 1 Learning Algorithm employing TS as its channel policy

Scheduling decisions of a learning algorithm π\pi might differ from those of π\pi^{*} both in the source and in the channel, which makes the analysis of the AoI regret t=1Tm=1ME[hmπ(t)hm(t)]\sum_{t=1}^{T}\sum_{m=1}^{M}\mathrm{E}[h_{m}^{\pi}(t)-h_{m}^{*}(t)] challenging. To alleviate this challenge, we use stochastic coupling to create equivalent coupled channel state processes that are simpler to analyze. Similar coupling arguments were employed in [13, 22].

Remark 1 (Coupled Channel States).

Let {U(t)}t=1T\{U(t)\}_{t=1}^{T} be a sequence of i.i.d. random variables uniformly distributed in the interval [0,1][0,1]. In each slot tt, the channel states bn(t)b_{n}(t) are determined as follows

bn(t)=10U(t)μn,n.b_{n}(t)=1\iff 0\leq U(t)\leq\mu_{n}\;,\;\forall n\;. (4)

By construction, the coupled channel states are no longer independent. In particular, if a channel is ON during slot tt, then all channels with higher reliability μn\mu_{n} are also ON during that slot. Notice that, in each slot tt, each coupled channel nn has the same probability distribution as the associated original channel nn, namely P(bn(t)=1)=μn,n,t\mathrm{P}\left(b_{n}(t)=1\right)=\mu_{n},\forall n,t. Hence, given the scheduling decision (m(t),n(t))(m(t),n(t)) of π\pi during any slot tt, the probability of a successful transmission attempt from source m(t)m(t) through channel n(t)n(t) is the same for both the coupled and original channel states. It follows that the probability distribution of hmπ(t)h_{m}^{\pi}(t) also remains the same for all slots tt and for all sources mm and, thus, the AoI regret Rπ(T)R^{\pi}(T) in (3) also remains the same for both the coupled and original channel state processes. For simplicity of analysis, henceforth in this paper, we assume that the channel state processes are coupled as described in Remark 1.

In Proposition 2, Proposition 3, and Corollary 4, we derive bounds on the AoI regret of a learning algorithm π\pi with respect to its expected number of suboptimal channel choices, namely

E[Kπ(T)]=E[t=1T𝟙{nπ(t)n}],\mathrm{E}[K^{\pi}(T)]=\mathrm{E}\left[\sum_{t=1}^{T}\mathbbm{1}\left\{n^{\pi}(t)\neq n^{*}\right\}\right]\;, (5)

where 𝟙{nπ(t)n}=1\mathbbm{1}\left\{n^{\pi}(t)\neq n^{*}\right\}=1 if nπ(t)nn^{\pi}(t)\neq n^{*}, and 𝟙{nπ(t)n}=0\mathbbm{1}\left\{n^{\pi}(t)\neq n^{*}\right\}=0 otherwise. We consider two classes of admissible learning algorithms

Π={π=(πa,πb):πaΠA,πbΠB};\displaystyle\Pi=\left\{\pi=(\pi_{a},\pi_{b}):\pi_{a}\in\Pi_{A},\pi_{b}\in\Pi_{B}\right\}\;; (6)
Π={π=(πa,πb):πa=πa,πbΠB}.\displaystyle\Pi^{*}=\left\{\pi=(\pi_{a},\pi_{b}):\pi_{a}=\pi^{*}_{a},\pi_{b}\in\Pi_{B}\right\}\;. (7)

Both classes employ queue-independent channel policies. The difference is that Π\Pi employs any admissible source policy πaΠA\pi_{a}\in\Pi_{A}, while Π\Pi^{*} employs the optimal source policy πa\pi_{a}^{*}. Naturally, we have ΠΠ\Pi^{*}\subset\Pi.

Proposition 2 (Lower Bound).

For any given network configuration (λ,μ)(\lambda,\vec{\mu}), the AoI regret of any learning algorithm πΠ\pi\in\Pi scales at least on the order of its expected number of suboptimal channel choices, namely222f(t)=Ω(g(t))C>0t0t>t0:f(t)Cg(n)f(t)=\Omega(g(t))\iff\exists C>0\ \exists t_{0}\ \forall t>t_{0}:f(t)\geq C\cdot g(n)

Rπ(T)=Ω(E[Kπ(T)]).R^{\pi}(T)=\Omega\left(\mathrm{E}\left[K^{\pi}(T)\right]\right)\;. (8)

Proof outline. In addition to the suboptimal channel choices, source choices mπ(t)m^{\pi}(t) of algorithm πΠ\pi\in\Pi can also differ from the source choices m(t)m^{*}(t) of π\pi^{*}. To overcome this challenge, we construct an auxiliary algorithm π^\hat{\pi}^{*} with optimal channel policy and a source policy that selects the same source333Notice that if the selected source mπ(t)m^{\pi}(t) has no packet in its transmission queue, then the auxiliary algorithm attempts to transmit a dummy packet. mπ(t)m^{\pi}(t) as π\pi in every slot tt. Then, we focus on the auxiliary AoI regret t=1Tm=1ME[hmπ(t)hmπ^(t)]\sum_{t=1}^{T}\sum_{m=1}^{M}\mathrm{E}[h_{m}^{\pi}(t)-h_{m}^{\hat{\pi}^{*}}(t)] associated with the auxiliary algorithm π^\hat{\pi}^{*}, which we show to be not greater than the original AoI regret t=1Tm=1ME[hmπ(t)hm(t)]\sum_{t=1}^{T}\sum_{m=1}^{M}\mathrm{E}[h_{m}^{\pi}(t)-h_{m}^{*}(t)]. We then observe that each suboptimal channel choice of π\pi results in a penalty to the auxiliary AoI regret, and we show that this penalty is lower bounded by a constant. Using this constant, we obtain the desired lower bound on the original AoI regret in (8). The details are omitted due to the space constraint.

Proposition 3 (Upper Bound).

For any given network configuration (λ,μ)(\lambda,\vec{\mu}), the AoI regret of any learning algorithm πΠ\pi\in\Pi^{*} scales at most on the order of its expected number of suboptimal channel choices, namely444f(t)=O(g(t))C>0t0t>t0:f(t)Cg(n)f(t)=O(g(t))\iff\exists C>0\ \exists t_{0}\ \forall t>t_{0}:f(t)\leq C\cdot g(n)

Rπ(T)=O(E[Kπ(T)]).R^{\pi}(T)=O\left(\mathrm{E}\left[K^{\pi}(T)\right]\right)\;. (9)

Proof outline. Despite the fact that both learning algorithms πΠ\pi\in\Pi^{*} and π\pi^{*} employ the same optimal source policy πa\pi_{a}^{*}, they might select different sources mπ(t)m(t)m^{\pi}(t)\neq m^{*}(t) over time, due to their different channel policies. To address this challenge, we use an approach similar to the proof of Proposition 2. We construct an auxiliary algorithm π^Π\hat{\pi}\in\Pi^{*} with a source policy that selects the same source m(t)m^{*}(t) as π\pi^{*} in every slot tt, and with a channel policy that selects the same channel nπ(t)n^{\pi}(t) as π\pi in every slot tt. Then, we show that the auxiliary AoI regret t=1Tm=1ME[hmπ^(t)hm(t)]\sum_{t=1}^{T}\sum_{m=1}^{M}\mathrm{E}[h_{m}^{\hat{\pi}}(t)-h_{m}^{*}(t)] associated with the auxiliary algorithm π^\hat{\pi} is not lower than the original AoI regret t=1Tm=1ME[hmπ(t)hm(t)]\sum_{t=1}^{T}\sum_{m=1}^{M}\mathrm{E}[h_{m}^{\pi}(t)-h_{m}^{*}(t)]. To derive an upper bound on the auxiliary AoI regret, we analyze the penalty that results from each suboptimal channel choice of π^\hat{\pi}. During a slot tt where π^\hat{\pi} makes a suboptimal channel choice, if channel nπ^(t)n^{\hat{\pi}}(t) is OFF and channel nn^{*} is ON, then a discrepancy is added to the difference between the AoI of π^\hat{\pi} and the AoI of π\pi^{*}, i.e.  hmπ^(t+1)hm(t+1)>hmπ^(t)hm(t)h_{m}^{\hat{\pi}}(t+1)-h_{m}^{*}(t+1)>h_{m}^{\hat{\pi}}(t)-h_{m}^{*}(t). This discrepancy lasts until the next successful transmission of a packet from source mm by the auxiliary algorithm π^\hat{\pi}, after which the values of hmπ^()h_{m}^{\hat{\pi}}(\cdot) and hm()h_{m}^{*}(\cdot) become equal555Recall from Remark 1 that channel states are coupled. Hence, if channel nπ^(t)n^{\hat{\pi}}(t) is ON, then channel nn^{*} is also ON.. We refer to the duration of the discrepancy as its length. The penalty that results from a suboptimal channel choice is the product of the discrepancy and its length. We characterize the auxiliary AoI regret by expressing it as the sum of the penalties arising from suboptimal channel choices. Then, using discrete phase-type distributions, we upper bound the discrepancies and the lengths by constants (in the expected sense) to obtain the result in (9). The details are omitted due to the space constraint.

Corollary 4.

For any given network configuration (λ,μ)(\lambda,\vec{\mu}), the AoI regret of any learning algorithm πΠ\pi\in\Pi^{*} scales with its expected number of suboptimal channel choices, namely666f(t)=Θ(g(t))f(t)=O(g(t))f(t)=Ω(g(t))C1,C2>0t0t>t0:C1g(n)f(t)C2g(n)f(t)=\Theta(g(t))\iff f(t)=O(g(t))\land f(t)=\Omega(g(t))\iff\exists C_{1},C_{2}>0\ \exists t_{0}\ \forall t>t_{0}:C_{1}\cdot g(n)\leq f(t)\leq C_{2}\cdot g(n)

Rπ(T)=Θ(E[Kπ(T)]).R^{\pi}(T)=\Theta\left(\mathrm{E}\left[K^{\pi}(T)\right]\right)\;. (10)

Corollary 4 follows directly from Propositions 2 and  3. Notice that the bounds in Proposition 3 and Corollary 4 are not valid for the broader class of learning algorithms Π\Pi which includes suboptimal source policies. This is because suboptimal source choices may add to the AoI regret, possibly making it grow faster than E[Kπ(T)]\mathrm{E}\left[K^{\pi}(T)\right].

Prior to analyzing the AoI regret of learning algorithms that employ ϵ\epsilon-Greedy, UCB, and TS as their channel policy, we define α\alpha-consistent learning algorithms [12, 13] and discuss a few of their properties. Let E[Tnπ(T)]\mathrm{E}[T^{\pi}_{n}(T)] be the expected number of times channel nn is selected by πΠ\pi\in\Pi in the first TT slots, namely

E[Tnπ(T)]=E[t=1T𝟙{nπ(t)=n}].\mathrm{E}[T_{n}^{\pi}(T)]=\mathrm{E}\left[\sum_{t=1}^{T}\mathbbm{1}\left\{n^{\pi}(t)=n\right\}\right]\;. (11)
Definition 5 (α\alpha-consistency).

For a given α(0,1)\alpha\in(0,1), a learning algorithm πΠ\pi\in\Pi is classified as α\alpha-consistent if, for any network configuration (λ,μ)(\lambda,\vec{\mu}), we have E[Tnπ(T)]=O(Tα)E\left[T^{\pi}_{n}(T)\right]=O(T^{\alpha}) for all suboptimal channels nnn\neq n^{*}.

Intuitively, a learning algorithm πΠ\pi\in\Pi is α\alpha-consistent if its channel policy has good performance in every network configuration. Consider a learning algorithm with a trivial channel policy that selects n(t)=1n(t)=1 in every slot tt. In network configurations with n=1n^{*}=1, this channel policy never selects suboptimal channels, i.e. E[Tnπ(T)]=O(Tα),nn\mathrm{E}\left[T^{\pi}_{n}(T)\right]=O(T^{\alpha}),\forall n\neq n^{*}. However, in network settings with n1n^{*}\neq 1, this channel policy is such that E[T1π(T)]=T\mathrm{E}\left[T^{\pi}_{1}(T)\right]=T, which violates the definition of α\alpha-consistency. In the remainder of this section, we focus on channel policies that have good performance in every network configuration. In particular, we analyze the AoI regret of α\alpha-consistent learning algorithms with queue-independent channel policies.

Remark 6 (AoI regret of α\alpha-consistent algorithms).

In [13, Corollary 2020], the authors show that any learning algorithm πΠ\pi\in\Pi that is α\alpha-consistent has an expected number of suboptimal channel choices that scales as E[Kπ(T)]=Ω(logT)E\left[K^{\pi}(T)\right]=\Omega(\log T), for any network configuration (λ,μ)(\lambda,\vec{\mu}). Hence, it follows from the lower bound in Proposition 2 that the associated AoI regret scales as

Rπ(T)=Ω(logT),R^{\pi}(T)=\Omega(\log T)\;, (12)

for any network configuration (λ,μ)(\lambda,\vec{\mu}).

Notice that the lower bound in Remark 6 applies to α\alpha-consistent learning algorithms with queue-independent channel policies that do not know the statistics of the channels in advance.

Learning algorithms that employ ϵ\epsilon-Greedy, UCB, and TS as their channel policy are known to have suboptimal channel choices scaling as E[Kπ(T)]=O(logT)\mathrm{E}[K^{\pi}(T)]=O(\log T) for any network configuration (λ,μ)(\lambda,\vec{\mu}) [4, 30], which implies that they are α\alpha-consistent. Hence, it follows from the upper bound in Proposition 3 and from (12) that the AoI regret of these learning algorithms scale as

Rπ(T)=Θ(logT).R^{\pi}(T)=\Theta(\log T)\;. (13)

In [22], the authors derived lower and upper bounds on the AoI regret of learning algorithms employing queue-independent channel policies, including UCB and TS, in networks with a single source generating and transmitting fresh packets in every slot tt. Propositions 2 and 3 generalize the results in [22] to networks with multiple sources generating packets according to stochastic processes. The analysis of the AoI regret is more challenging in this network setting for the following reasons: i) the optimal source policy πa\pi^{*}_{a} is unknown and there is no closed-form expression for the expected total AoI (1) of the optimal algorithm π=(πa,πb)\pi^{*}=(\pi_{a}^{*},\pi_{b}^{*}); and ii) the learning algorithm under consideration π=(πa,πb)\pi=(\pi_{a},\pi_{b}) can make suboptimal choices both in terms of sources m(t)m(t) and channels n(t)n(t), and these two types of suboptimal choices affect the AoI regret Rπ(T)R^{\pi}(T) differently. Next, we develop a learning algorithm that leverages information about the status of the transmission queues in making scheduling decisions n(t)n(t), and show that this new learning algorithm has O(1)O(1) AoI regret.

IV Order-Optimal Learning Algorithm

In this section, we develop a learning algorithm η¯Π¯\bar{\eta}\in\bar{\Pi} with a queue-dependent channel policy that selects n(t)n(t) using information about the outcome of previous transmission attempts, namely HB(t)={n(1),b(1),,n(t1),b(t1)}H_{B}(t)=\{n(1),b(1),\cdots,n(t-1),b(t-1)\}, and about the current status of the transmission queues, E(t)E(t). Then, we derive an upper bound on its AoI regret. In particular, we show that the AoI regret of η¯\bar{\eta} is such that Rη¯(T)=O(1)R^{\bar{\eta}}(T)=O(1). Notice that the only difference between the learning algorithms πΠ\pi\in\Pi in Sec. III and the order-optimal learning algorithm η¯\bar{\eta} is the knowledge of E(t)E(t). This seemingly modest addition led to the reduction of the AoI regret from Rπ(T)=Ω(logT)R^{\pi}(T)=\Omega(\log T) to Rη¯(T)=O(1)R^{\bar{\eta}}(T)=O(1). To the best of our knowledge, this is the first learning algorithm with bounded AoI regret.

The key insight is that when packets are generated randomly, the learning algorithm η¯\bar{\eta} can utilize times when the network has no data packets to transmit, i.e. when E(t)=1E(t)=1, to transmit dummy packets and learn the statistics of the channels without incurring an opportunity cost. The order-optimal learning algorithm η¯=(ηa,η¯b)\bar{\eta}=(\eta_{a},\bar{\eta}_{b}) has optimal source policy ηa=πa\eta_{a}=\pi^{*}_{a} and a channel policy η¯bΠ¯B\bar{\eta}_{b}\in\bar{\Pi}_{B} that operates as follows: when the system is empty, E(t)=1E(t)=1, the policy chooses a channel uniformly at random and uses the outcome of the transmission attempt to update its estimates of the channel reliabilities and, when the system is nonempty, E(t)=0E(t)=0, the policy chooses the channel with the current highest estimated reliability. Notice that the channel policy only updates its estimates of the channel reliabilities when the system is empty. A similar channel policy was used in [14, 15] to develop a learning algorithm with bounded queue-length regret. The order-optimal learning algorithm η¯\bar{\eta} is described in Algorithm 2. The upper bound on the AoI regret is established in the theorem that follows.

Initialization: time t=1t=1, estimates μ^n=0\hat{\mu}_{n}=0, counters Tn=0T_{n}=0, n{1,,N}\forall n\in\{1,\cdots,N\};
while 1tT1\leq t\leq T do
       Optimal source policy selects m{1,2,,M}m\in\{1,2,\cdots,M\};
       if system is empty then
             n=Unif{1,,N}n=\text{Unif}\{1,\cdots,N\};
             Source mm transmits dummy packet through channel nn and observes channel state bb;
             μ^n=μ^nTn+bTn+1\hat{\mu}_{n}=\dfrac{\hat{\mu}_{n}T_{n}+b}{T_{n}+1};
             Tn=Tn+1T_{n}=T_{n}+1;
            
      else
             n=argmaxn{1,,N}μ^nn=\operatorname*{arg\,max}_{n^{\prime}\in\{1,\cdots,N\}}\hat{\mu}_{n^{\prime}};
             Source mm transmits data packet through channel nn and observes channel state bb;
            
       end if
      t=t+1t=t+1;
      
end while
Algorithm 2 Order-Optimal Learning Algorithm
Theorem 7.

For any given network configuration (λ,μ)(\lambda,\vec{\mu}), the AoI regret of the order-optimal learning algorithm η¯\bar{\eta} is bounded, namely

Rη¯(T)=O(1).R^{\bar{\eta}}(T)=O(1)\;. (14)
Proof.

Recall that the two components of the order-optimal learning algorithm are η¯=(ηa,η¯b)\bar{\eta}=(\eta_{a},\bar{\eta}_{b}). Similarly to the proof of Proposition 3, we start by constructing an auxiliary algorithm η^=(η^a,η¯b)\hat{\eta}=(\hat{\eta}_{a},\bar{\eta}_{b}) which has a source policy η^a\hat{\eta}_{a} that selects the same source m(t)m^{*}(t) as π\pi^{*} in every slot tt. Since ηa=πa\eta_{a}=\pi^{*}_{a} is optimal, it follows that η^a\hat{\eta}_{a} is suboptimal, which implies that

m=1Mt=1TE[hmη¯(t)hm(t)]m=1Mt=1TE[hmη^(t)hm(t)].\displaystyle\sum_{m=1}^{M}\sum_{t=1}^{T}\mathrm{E}[h_{m}^{\bar{\eta}}(t)-h_{m}^{*}(t)]\leq\sum_{m=1}^{M}\sum_{t=1}^{T}\mathrm{E}[h_{m}^{\hat{\eta}}(t)-h_{m}^{*}(t)]\;. (15)

We denote the RHS of (15) as the auxiliary AoI regret. Prior to deriving the upper bound on the auxiliary AoI regret, we introduce some definitions that are particular to the channel policy η¯b\bar{\eta}_{b}.

Consider the time slots when the system becomes empty, i.e. time slots tt such that E(t1)=0E(t-1)=0 and E(t)=1E(t)=1. We denote the time interval between two such slots as a period and we divide time t{1,2,,T}t\in\{1,2,\cdots,T\} into successive periods, with period index p{1,2,,P}p\in\{1,2,\cdots,P\}. By definition, the system is empty, E(t)=1E(t)=1, in the beginning of each period pp and it remains empty until the first packet generation. Once the first packet is generated, the system becomes nonempty, E(t)=0E(t)=0, and it remains nonempty until the end of the period. Hence, each period pp has two phases: an empty phase and a nonempty phase, with each phase having at least one slot. Let sps_{p} and fpf_{p} be the first and the last slots of period pp, respectively, with s1=1s_{1}=1 and sp+1=fp+1,ps_{p+1}=f_{p}+1,\forall p. Then, the cumulative AoI of source mm during period pp can be written as

ymη^(p)=t=spfphmη^(t).y^{\hat{\eta}}_{m}(p)=\sum_{t=s_{p}}^{f_{p}}h^{\hat{\eta}}_{m}(t)\;. (16)

Recall from Algorithm 2 that estimates of the channel reliabilities are only updated during empty phases. Within a nonempty phase, the estimates do not change and, thus, the selected channel also does not change. Let n¯(p)\bar{n}(p) be the channel selected by policy η¯b\bar{\eta}_{b} during the entire nonempty phase of period pp. If n¯(p)=n\bar{n}(p)=n^{*}, we refer to period pp as an optimal period. Otherwise, we refer to period pp as a suboptimal period. Next, we derive an upper bound on the auxiliary AoI regret in terms of the expected AoI contributions of the suboptimal periods.

Lemma 8.

The auxiliary AoI regret is upper bounded by

m=1Mt=1TE[hmη^(t)hm(t)]\displaystyle\sum_{m=1}^{M}\sum_{t=1}^{T}\mathrm{E}[h_{m}^{\hat{\eta}}(t)-h_{m}^{*}(t)]
m=1Mp=1TE[ymη^(p)|n¯(p)n]P(n¯(p)n).\displaystyle\quad\leq\sum_{m=1}^{M}\sum_{p=1}^{T}\mathrm{E}\left[y^{\hat{\eta}}_{m}(p)\;\middle|\;\bar{n}(p)\neq n^{*}\right]\mathrm{P}\left(\bar{n}(p)\neq n^{*}\right)\;. (17)

To establish Lemma 8, we first show that if period pp is an optimal period, then hmη^(t)=hm(t),m,t{sp,,fp}h^{\hat{\eta}}_{m}(t)=h^{*}_{m}(t),\forall m,\forall t\in\{s_{p},\cdots,f_{p}\}, which implies that optimal periods do not contribute to the auxiliary AoI regret. Then, we obtain the upper bound in (17) by manipulating the expression of the auxiliary AoI regret. The complete proof of Lemma 8 can be found in Appendix A.

In Lemmas 9 and 10, we derive upper bounds on the first and second terms on the RHS of (17), respectively.

Lemma 9.

There exists a constant CyC_{y} such that

E[ymη^(p)|n¯(p)n]Cy.\mathrm{E}\left[y^{\hat{\eta}}_{m}(p)\;\middle|\;\bar{n}(p)\neq n^{*}\right]\leq C_{y}\;. (18)

To establish Lemma 9, we first show that the cumulative AoI ymη^(p)y^{\hat{\eta}}_{m}(p) of source mm in period pp can be upper bounded by

ymη^(p)\displaystyle y^{\hat{\eta}}_{m}(p) =t=spfphmη^(t)i=0fpsp(hmη^(sp)+i)\displaystyle=\sum_{t=s_{p}}^{f_{p}}h^{\hat{\eta}}_{m}(t)\leq\sum_{i=0}^{f_{p}-s_{p}}\left(h_{m}^{\hat{\eta}}(s_{p})+i\right)
=hmη^(sp)[fpsp+1]+12[(fpsp)2+fpsp].\displaystyle=h_{m}^{\hat{\eta}}(s_{p})[f_{p}-s_{p}+1]+\dfrac{1}{2}[(f_{p}-s_{p})^{2}+f_{p}-s_{p}]\;. (19)

Then, we derive an upper bound on the conditional expectation of (19). In particular, we show that hmη^(sp)h_{m}^{\hat{\eta}}(s_{p}) can be upper bounded by a geometric random variable. Then, we show that the random variable fpspf_{p}-s_{p}, which represents the length of period pp, follows a discrete phase-type distribution. The upper bound on the conditional expectation of (19) follows from the fact that the geometric random variable has finite second moment and the phase-type random variable has finite first and second moments. The details are omitted due to the space constraint.

Lemma 10.

There exists a constant CpC_{p} such that

p=1TP(n¯(p)n)Cp.\sum_{p=1}^{T}\mathrm{P}\left(\bar{n}(p)\neq n^{*}\right)\leq C_{p}\;. (20)

To establish Lemma 10, we use Hoeffding’s inequality to upper bound P(n¯(p)=n)\mathrm{P}\left(\bar{n}(p)=n\right) by an exponential function of p-p, for every suboptimal channel nn. The result in (20) follows directly from this upper bound. The complete proof of Lemma 10 can be found in Appendix B.

From the upper bound on the AoI regret in (15) and the results in Lemmas 8, 9 and 10, we have

Rη¯(T)\displaystyle R^{\bar{\eta}}(T) m=1Mt=1TE[hmη^(t)hm(t)]\displaystyle\leq\sum_{m=1}^{M}\sum_{t=1}^{T}\mathrm{E}[h_{m}^{\hat{\eta}}(t)-h_{m}^{*}(t)]
m=1Mp=1TE[ymη^(p)|n¯(p)n]P(n¯(p)n)\displaystyle\leq\sum_{m=1}^{M}\sum_{p=1}^{T}\mathrm{E}\left[y^{\hat{\eta}}_{m}(p)\;\middle|\;\bar{n}(p)\neq n^{*}\right]\mathrm{P}\left(\bar{n}(p)\neq n^{*}\right)
CyMCp\displaystyle\leq C_{y}MC_{p} (21)

which establishes the bound in (14). ∎

In the particular case of a network with sources generating fresh packets at every slot tt, i.e. λ=1\lambda=1, the algorithm η¯\bar{\eta} cannot utilize slots in which the system is empty to learn the channel reliabilities without incurring a cost in terms of AoI regret, which results in a Rη¯(T)R^{\bar{\eta}}(T) that grows over time. The upper bound in Theorem 7 is only valid for the network models described in Sec. II, in which λ(0,1)\lambda\in(0,1). Next, we evaluate the AoI regret of the different learning algorithms discussed in this paper using MATLAB simulations and we propose a heuristic algorithm that leverages the fast learning rates of TS and the bounded regret of the order-optimal algorithm.

V Simulations

In this section, we evaluate the performance of learning algorithms in terms of the AoI regret in (3). We compare learning algorithms employing the Age-Based Max-Weight source policy [29, Sec. 5] and different channel policies, namely: i) ϵ\epsilon-Greedy; ii) UCB; iii) TS; iv) Optimal; and v) Hybrid. The Age-Based Max-Weight source policy selects, in each slot tt, the source mm associated with the packet that gives the largest AoI reduction, τm(t+1)τm(t)\tau_{m}(t+1)-\tau_{m}(t), if the transmission in slot tt is successful. Intuitively, this policy is selecting the source with highest potential reward in terms of AoI. In [29], the authors evaluate the performance of the Age-Based Max-Weight source policy both analytically and using simulations, and show that it achieves near optimal AoI. The first three channel policies, namely ϵ\epsilon-Greedy, UCB, and TS, were discussed in Sec. III. The Optimal policy is the order-optimal channel policy η¯b\bar{\eta}_{b} developed in Sec. IV. The Hybrid policy employs TS for a fixed period in the beginning of the simulation and then employs the Optimal policy in the remaining slots.

We simulate a network with a time horizon of T=105T=10^{5} slots, M=3M=3 sources, each generating packets according to a Bernoulli process with rate λ\lambda, and N=5N=5 channels with reliabilities μ=[0.4 0.45 0.5 0.55 0.6]𝖳\vec{\mu}=[0.4\ 0.45\ 0.5\ 0.55\ 0.6]^{\mathsf{T}}. Figures 3 and 4 show simulation results of the evolution of the AoI regret over time for λ=0.1\lambda=0.1 and λ=0.75\lambda=0.75, respectively. Figure 5 shows simulation results of the evolution of the reliability estimates associated with the channels with μ4=0.55\mu_{4}=0.55 and μ5=0.6\mu_{5}=0.6 over time for λ=0.75\lambda=0.75. Each data point in Figs. 3, 4, and 5 is an average over the results of 10310^{3} simulations.

Refer to caption
Figure 3: Simulation of a network with λ=0.1\lambda=0.1.
Refer to caption
Figure 4: Simulation of a network with λ=0.75\lambda=0.75.
Refer to caption
Figure 5: Simulation of a network with λ=0.75\lambda=0.75.

The results in Figs. 3 and 4 suggest that, as expected, the AoI regret associated with Optimal and Hybrid is bounded, while the AoI regrets associated with ϵ\epsilon-Greedy, UCB and TS grow over time. By comparing the AoI regret of Optimal and TS in Figs. 3 and 4, it is clear that the AoI regret of the Optimal channel policy varies significantly with λ\lambda. In particular, for T=105T=10^{5}, when λ\lambda increases from 0.10.1 to 0.750.75, the AoI regret of TS increases by a factor of 1.51.5 (from 1,3181,318 to 1,9631,963), while the AoI regret of Optimal increases by a factor of 491.0491.0 (from 1,0681,068 to 481,700481,700). A main reason for this performance degradation is that when λ\lambda increases, empty systems with E(t)=1E(t)=1 occur less often and, as a result, the Optimal channel policy takes longer to learn the reliability of the channels, as can be seen in Fig. 5. To improve the performance of the Optimal policy for networks with large λ\lambda, we propose a heuristic policy called Hybrid channel policy, which employs TS in the first 10410^{4} slots to quickly learn the reliability of the channels, and then shifts to the Optimal policy which has bounded AoI regret in the long term. Figure 5 illustrates the difference in the learning rates between Optimal and Hybrid. Notice in Fig. 5 that there are extended periods of time in which the Optimal channel policy assigns a larger estimated reliability to a suboptimal channel, which leads to the large AoI regret shown in Fig. 4. However, as established in Theorem 7, for a long enough time-horizon TT, the Optimal policy will eventually converge to the true reliabilities, at which point the AoI regret will stop increasing.

VI Conclusion

This paper considers a single-hop wireless network with MM sources transmitting time-sensitive information to the destination over NN unreliable channels. Packets from each source are generated according to a Bernoulli process with known rate λ\lambda and the state of channel nn (ON/OFF) varies according to a Bernoulli process with unknown rate μn\mu_{n}. The reliabilities μ\vec{\mu} of the wireless channels is to be learned through observation. At every slot tt, the learning algorithm selects a single pair (m(t),n(t))(m(t),n(t)) and the selected source m(t)m(t) attempts to transmit its packet via the selected channel n(t)n(t). The goal of the learning algorithm is to minimize the expected total AoI h¯(T)\bar{h}(T). To analyze the performance of the learning algorithm, we derive bounds on the AoI regret Rπ(T)R^{\pi}(T) associated with different learning algorithms. Our main contributions include: i) analyzing the performance of learning algorithms that employ channel policies based on traditional MAB algorithms (ϵ\epsilon-Greedy, UCB, and TS) and showing that their AoI regret scales as Θ(logT)\Theta(\log T); and ii) developing a novel learning algorithm and establishing that it has O(1)O(1) AoI regret. To the best of our knowledge, this is the first learning algorithm with bounded AoI regret. Interesting extensions of this work include consideration of sources with unknown packet generation rates and channels with time-varying statistics.

References

  • [1] A. Kosta, N. Pappas, and V. Angelakis, “Age of information: A new concept, metric, and tool,” Foundations and Trends in Networking, vol. 12, no. 3, pp. 162–259, 2017.
  • [2] Y. Sun, I. Kadota, R. Talak, and E. Modiano, Age of Information: A New Metric for Information Freshness.   Morgan & Claypool, 2019.
  • [3] R. D. Yates, Y. Sun, D. R. B. III, S. K. Kaul, E. Modiano, and S. Ulukus, “Age of information: An introduction and survey,” 2020.
  • [4] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of the multiarmed bandit problem,” Machine Learning, vol. 47, no. 2–3, p. 235–256, May 2002.
  • [5] A. Garivier and O. Cappé, “The KL-UCB algorithm for bounded stochastic bandits and beyond,” in Proceedings of the 24th Annual Conference on Learning Theory, vol. 19, 2011.
  • [6] S. Agrawal and N. Goyal, “Analysis of thompson sampling for the multi-armed bandit problem,” in Proceedings of the 25th Annual Conference on Learning Theory, vol. 23, 2012.
  • [7] T. Lai and H. Robbins, “Asymptotically efficient adaptive allocation rules,” Advances in Applied Mathematics, vol. 6, no. 1, pp. 4–22, 1985.
  • [8] W. R. Thompson, “On the likelihood that one unknown probability exceeds another in view of the evidence of two samples,” Biometrika, vol. 25, pp. 285–294, 1933.
  • [9] A. Slivkins, “Introduction to multi-armed bandits,” Foundations and Trends in Machine Learning, vol. 12, pp. 1–286, 2019.
  • [10] J. Gittins, K. Glazebrook, and R. Weber, Multi-armed Bandit Allocation Indices, 2nd ed.   Wiley, Mar. 2011.
  • [11] N. Cesa-Bianchi and G. Lugosi, Prediction, Learning, and Games.   USA: Cambridge University Press, 2006.
  • [12] S. Bubeck and N. Cesa-Bianchi, “Regret analysis of stochastic and nonstochastic multi-armed bandit problems,” Foundations and Trends in Machine Learning, vol. 5, pp. 1–122, 2012.
  • [13] S. Krishnasamy, R. Sen, R. Johari, and S. Shakkottai, “Learning unknown service rates in queues: A multiarmed bandit approach,” Operations Research, 2020.
  • [14] T. Stahlbuhk, B. Shrader, and E. Modiano, “Learning algorithms for minimizing queue length regret,” in Proceedings of IEEE ISIT, 2018, pp. 1001–1005.
  • [15] T. B. Stahlbuhk, “Control of wireless networks under uncertain state information,” Ph.D. dissertation, Massachusetts Institute of Technology, 2018.
  • [16] I. Kadota, A. Sinha, E. Uysal-Biyikoglu, R. Singh, and E. Modiano, “Scheduling policies for minimizing age of information in broadcast wireless networks,” IEEE/ACM Transactions on Networking, 2018.
  • [17] I. Kadota, A. Sinha, and E. Modiano, “Optimizing age of information in wireless networks with throughput constraints,” in Proceedings of IEEE INFOCOM, 2018.
  • [18] Y.-P. Hsu, “Age of information: Whittle index for scheduling stochastic arrivals,” in Proceedings of IEEE ISIT, 2018.
  • [19] Y.-P. Hsu, E. Modiano, and L. Duan, “Scheduling algorithms for minimizing age of information in wireless broadcast networks with random arrivals,” IEEE Transactions on Mobile Computing, vol. 19, no. 12, pp. 2903–2915, 2020.
  • [20] J. Sun, Z. Jiang, B. Krishnamachari, S. Zhou, and Z. Niu, “Closed-form whittle’s index-enabled random access for timely status update,” IEEE Transactions on Communications, vol. 68, no. 3, pp. 1538–1551, 2020.
  • [21] V. Tripathi and E. Modiano, “A whittle index approach to minimizing functions of age of information,” in Proceedings of IEEE Allerton, 2019, p. 1160–1167.
  • [22] S. Fatale, K. Bhandari, U. Narula, S. Moharir, and M. K. Hanawal, “Regret of age-of-information bandits,” 2020.
  • [23] I. Juneja, S. Fatale, and S. Moharir, “Correlated age-of-information bandits,” 2020.
  • [24] A. Prasad, V. Jain, and S. Moharir, “Decentralized age-of-information bandits,” 2020.
  • [25] M. Costa, M. Codreanu, and A. Ephremides, “On the age of information in status update systems with packet management,” IEEE Transactions on Information Theory, vol. 62, no. 4, pp. 1897–1910, 2016.
  • [26] S. Kaul, R. D. Yates, and M. Gruteser, “Status updates through queues,” in Proceedings of IEEE CISS, 2012.
  • [27] A. M. Bedewy, Y. Sun, and N. B. Shroff, “The age of information in multihop networks,” IEEE/ACM Transactions on Networking, vol. 27, no. 3, pp. 1248–1257, 2019.
  • [28] I. Kadota and E. Modiano, “Minimizing the age of information in wireless networks with stochastic arrivals,” in Proceedings of ACM MobiHoc, 2019.
  • [29] ——, “Minimizing the age of information in wireless networks with stochastic arrivals,” IEEE Transactions on Mobile Computing, 2019.
  • [30] S. Agrawal and N. Goyal, “Further optimal regret bounds for thompson sampling,” in Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, vol. 31, 2013, pp. 99–107.

Appendix A Proof of Lemma 8

To establish Lemma 8, we first show that optimal periods do not contribute to the auxiliary AoI regret defined in (15). Then, we obtain an upper bound on the contribution of suboptimal periods by manipulating the expression of the auxiliary AoI regret.

Lemma 11.

If period pp is an optimal period, then for any slot tt within period pp and for any source mm, we have hmη^(t)=hm(t)h^{\hat{\eta}}_{m}(t)=h^{*}_{m}(t).

Proof.

Consider the time slots preceding period pp in a network employing the auxiliary learning algorithm η^\hat{\eta}. In slot sp1s_{p}-1, the algorithm η^\hat{\eta} delivers the last packet in the system and in slot sps_{p} the system becomes empty, with E(sp)=1E(s_{p})=1. Let tmt^{\prime}_{m} be the slot in which the latest packet generated from source mm was delivered by η^\hat{\eta}. It follows that tmsp1t^{\prime}_{m}\leq s_{p}-1 and source mm did not generate new packets during the time interval [tm+1,sp][t^{\prime}_{m}+1,s_{p}].

Recall that (by construction) algorithms η^\hat{\eta} and π\pi^{*} select the same source m(t)m^{*}(t) at every slot tt and (due to the coupling argument in Remark 1) when a transmission by η^\hat{\eta} is successful, the transmission by π\pi^{*} is also successful. Hence, it follows that algorithm π\pi^{*} also delivered the latest packet generated from source mm during (or before) slot tmt^{\prime}_{m}, which implies that τmη^(sp)=τmπ(sp)\tau_{m}^{\hat{\eta}}(s_{p})=\tau_{m}^{\pi^{*}}(s_{p}) and, as a result, we have hmη^(sp)=spτmη^(sp)=spτmπ(sp)=hm(sp)h^{\hat{\eta}}_{m}(s_{p})=s_{p}-\tau_{m}^{\hat{\eta}}(s_{p})=s_{p}-\tau_{m}^{\pi^{*}}(s_{p})=h^{*}_{m}(s_{p}).

Since hmη^(sp)=hm(sp)h^{\hat{\eta}}_{m}(s_{p})=h^{*}_{m}(s_{p}) for every source mm and for the first slot of every period pp, it follows that if period pp is an optimal period (in which η^\hat{\eta} and π\pi^{*} select the same source and the same channel in every slot tt) then hmη^(t)=hm(t)h^{\hat{\eta}}_{m}(t)=h^{*}_{m}(t) in every slot tt within period pp and for every source mm. ∎

From Lemma 11, we have that if period pp is an optimal period, then

m=1Mt=spfphmη^(t)=m=1Mt=spfphm(t).\sum_{m=1}^{M}\sum_{t=s_{p}}^{f_{p}}h^{\hat{\eta}}_{m}(t)=\sum_{m=1}^{M}\sum_{t=s_{p}}^{f_{p}}h^{*}_{m}(t)\;. (22)

Using this result in the expression of the auxiliary AoI regret, we obtain

E\displaystyle\mathrm{E} [m=1Mt=1T[hmη^(t)hm(t)]]=\displaystyle\left[\sum_{m=1}^{M}\sum_{t=1}^{T}\left[h^{\hat{\eta}}_{m}(t)-h^{*}_{m}(t)\right]\right]=
=E[m=1Mt=1T[hmη^(t)hm(t)]hmη^(t)𝟙{n(t)nE(t)=0}+\displaystyle=\mathrm{E}\left[\sum_{m=1}^{M}\sum_{t=1}^{T}\underbrace{\left[h^{\hat{\eta}}_{m}(t)-h^{*}_{m}(t)\right]}_{\leq h^{\hat{\eta}}_{m}(t)}\mathbbm{1}\left\{n(t)\neq n^{*}\cap E(t)=0\right\}\right.+
+[hmη^(t)hm(t)]𝟙{n(t)=nE(t)=1}=0]\displaystyle\qquad+\left.\underbrace{\left[h^{\hat{\eta}}_{m}(t)-h^{*}_{m}(t)\right]\mathbbm{1}\left\{n(t)=n^{*}\cup E(t)=1\right\}}_{=0}\right]
(a)E[m=1Mp=1Tymη^(p) 1{n¯(p)n}]\displaystyle\overset{(a)}{\leq}\mathrm{E}\left[\sum_{m=1}^{M}\sum_{p=1}^{T}y^{\hat{\eta}}_{m}(p)\ \mathbbm{1}\left\{\bar{n}(p)\neq n^{*}\right\}\right]
=m=1Mp=1TE[ymη^(p) 1{n¯(p)n}]\displaystyle=\sum_{m=1}^{M}\sum_{p=1}^{T}\mathrm{E}\left[y^{\hat{\eta}}_{m}(p)\ \mathbbm{1}\left\{\bar{n}(p)\neq n^{*}\right\}\right]
=(b)m=1Mp=1TE[ymη^(p)|n¯(p)n]P(n¯(p)n)\displaystyle\overset{(b)}{=}\sum_{m=1}^{M}\sum_{p=1}^{T}\mathrm{E}\left[y^{\hat{\eta}}_{m}(p)\;\middle|\;\bar{n}(p)\neq n^{*}\right]\mathrm{P}\left(\bar{n}(p)\neq n^{*}\right) (23)

where (a) follows from the fact that each period pp has duration of at least 11 time slot, and (b) follows from the law of total expectation.

Appendix B Proof of Lemma 10

To establish Lemma 10, we first use Hoeffding’s inequality to upper bound P(n¯(p)=n)\mathrm{P}\left(\bar{n}(p)=n\right) by an exponential function of p-p, for every suboptimal channel nn. The result then follows directly from this upper bound.

Lemma 12.

For every suboptimal channel index nnn\neq n^{*}, the probability of channel policy η¯b\bar{\eta}_{b} selecting channel nn in the nonempty phase of period pp is bounded by

P(n¯(p)=n)2exp(12N2p)+2exp(Δn24Np),\mathrm{P}\left(\bar{n}(p)=n\right)\leq 2\exp\left(-\dfrac{1}{2N^{2}}p\right)+2\exp\left(-\dfrac{\Delta_{n}^{2}}{4N}p\right)\;, (24)

where Δn=μμn\Delta_{n}=\mu^{*}-\mu_{n}.

Proof.

The channel policy η¯b\bar{\eta}_{b} uses the empty phases of each period pp to explore the channels and update its estimate of the channel reliabilities. We know that at the beginning of each period pp there is an empty phase with at least one exploration slot. Let Nn(p)N_{n}(p) be the number of exploration slots in which channel nn is selected within the first pp periods and let pp^{\prime} be the total number of exploration slots within the first pp periods. It follows that

n=1NNn(p)=pp and E[Nn(p)]=1Np.\sum_{n=1}^{N}N_{n}(p)=p^{\prime}\geq p\;\mbox{ and }\;\mathrm{E}\left[N_{n}(p)\right]=\dfrac{1}{N}p^{\prime}\;. (25)

Let μ^n(p)\hat{\mu}_{n}(p^{\prime}) be the estimate of the reliability of channel nn after a total of pp^{\prime} exploration slots. Then, for any suboptimal channel nnn\neq n^{*}, we have

P(n¯(p)=n)P(μ^n(p)μ^(p))\displaystyle\mathrm{P}\left(\bar{n}(p)=n\right)\leq\mathrm{P}\left(\hat{\mu}_{n}(p^{\prime})\geq\hat{\mu}^{*}(p^{\prime})\right)
(a)P(μ^n(p)μn+μ2)+P(μn+μ2μ^(p))\displaystyle\quad\overset{(a)}{\leq}\mathrm{P}\left(\hat{\mu}_{n}(p^{\prime})\geq\dfrac{\mu_{n}+\mu^{*}}{2}\right)+\mathrm{P}\left(\dfrac{\mu_{n}+\mu^{*}}{2}\geq\hat{\mu}^{*}(p^{\prime})\right)
=P(μ^n(p)μnΔn2)+P(μ^(p)μΔn2).\displaystyle\quad=\mathrm{P}\left(\hat{\mu}_{n}(p^{\prime})-\mu_{n}\geq\dfrac{\Delta_{n}}{2}\right)+\mathrm{P}\left(\hat{\mu}^{*}(p^{\prime})-\mu^{*}\leq-\dfrac{\Delta_{n}}{2}\right)\;. (26)

where (a) follows from the union bound. Denote μ^n(p)μn=In\hat{\mu}_{n}(p^{\prime})-\mu_{n}=I_{n}. Then,

P(μ^n(p)μnΔn2)\displaystyle\mathrm{P}\left(\hat{\mu}_{n}(p^{\prime})-\mu_{n}\geq\dfrac{\Delta_{n}}{2}\right)
=P(InΔn2,Nn(p)p2N)+P(In>Δn2,Nn(p)>p2N)\displaystyle=\mathrm{P}\left(I_{n}\geq\dfrac{\Delta_{n}}{2},\,N_{n}(p)\leq\dfrac{p^{\prime}}{2N}\right)+\mathrm{P}\left(I_{n}>\dfrac{\Delta_{n}}{2},\,N_{n}(p)>\dfrac{p^{\prime}}{2N}\right)
P(Nn(p)p2N)+P(In>Δn2|Nn(p)>p2N).\displaystyle\leq\mathrm{P}\left(N_{n}(p)\leq\dfrac{p^{\prime}}{2N}\right)+\mathrm{P}\left(I_{n}>\dfrac{\Delta_{n}}{2}\;\middle|\;N_{n}(p)>\dfrac{p^{\prime}}{2N}\right)\;. (27)

Next, we upper bound each of the last two terms by Hoeffding’s inequality.

Notice that:

  • Nn(p)N_{n}(p) is the sum of pp^{\prime} i.i.d. Bernoulli random variables with mean 1N\dfrac{1}{N}; and

  • μ^n(p)\hat{\mu}_{n}(p^{\prime}) is the average of Nn(p)N_{n}(p) i.i.d. Bernoulli random variables with mean μn\mu_{n}.

Thus, by Hoeffding’s inequality

P(Nn(p)12Np)\displaystyle\mathrm{P}\left(N_{n}(p)\leq\dfrac{1}{2N}p^{\prime}\right) =P(Nn(p)p1N12N)\displaystyle=\mathrm{P}\left(\dfrac{N_{n}(p)}{p^{\prime}}-\dfrac{1}{N}\leq-\dfrac{1}{2N}\right)
exp(12N2p)\displaystyle\leq\exp\left(-\dfrac{1}{2N^{2}}p^{\prime}\right) (28)

and

P(In>Δn2|Nn(p)>12Np)\displaystyle\mathrm{P}\left(I_{n}>\dfrac{\Delta_{n}}{2}\;\middle|\;N_{n}(p)>\dfrac{1}{2N}p^{\prime}\right) exp(Δn24Np).\displaystyle\leq\exp\left(-\dfrac{\Delta_{n}^{2}}{4N}p^{\prime}\right)\;. (29)

Inequalities (27), (28) and (29) imply

P(μ^n(p)μnΔn2)exp(12N2p)+exp(Δn24Np).\displaystyle\mathrm{P}\left(\hat{\mu}_{n}(p^{\prime})-\mu_{n}\geq\dfrac{\Delta_{n}}{2}\right)\leq\exp\left(-\dfrac{1}{2N^{2}}p^{\prime}\right)+\exp\left(-\dfrac{\Delta_{n}^{2}}{4N}p^{\prime}\right). (30)

Analogously, we have

P(μ^(p)μΔn2)exp(12N2p)+exp(Δn24Np).\displaystyle\mathrm{P}\left(\hat{\mu}^{*}(p^{\prime})-\mu^{*}\leq\dfrac{\Delta_{n}}{2}\right)\leq\exp\left(-\dfrac{1}{2N^{2}}p^{\prime}\right)+\exp\left(-\dfrac{\Delta_{n}^{2}}{4N}p^{\prime}\right)\;. (31)

Now, inequalities (26), (30) and (31) imply that

P(n¯(p)=n)\displaystyle\mathrm{P}\left(\bar{n}(p^{\prime})=n\right) 2exp(12N2p)+2exp(Δn24Np)\displaystyle\leq 2\exp\left(-\dfrac{1}{2N^{2}}p^{\prime}\right)+2\exp\left(-\dfrac{\Delta_{n}^{2}}{4N}p^{\prime}\right)
2exp(12N2p)+2exp(Δn24Np)\displaystyle\leq 2\exp\left(-\dfrac{1}{2N^{2}}p\right)+2\exp\left(-\dfrac{\Delta_{n}^{2}}{4N}p\right) (32)

Using Lemma 12, we obtain

p=1TP(n¯(p)n)=nnp=1TP(n¯(p)=n)\displaystyle\sum_{p=1}^{T}\mathrm{P}\left(\bar{n}(p)\neq n^{*}\right)=\sum_{n\neq n^{*}}\sum_{p=1}^{T}\mathrm{P}\left(\bar{n}(p)=n\right)
nnp=1T[2exp(12N2p)+2exp(Δn24Np)]\displaystyle\quad\leq\sum_{n\neq n^{*}}\sum_{p=1}^{T}\left[2\exp\left(-\dfrac{1}{2N^{2}}p\right)+2\exp\left(-\dfrac{\Delta_{n}^{2}}{4N}p\right)\right]
nnp=1[2exp(12N2p)+2exp(Δn24Np)]\displaystyle\quad\leq\sum_{n\neq n^{*}}\sum_{p=1}^{\infty}\left[2\exp\left(-\dfrac{1}{2N^{2}}p\right)+2\exp\left(-\dfrac{\Delta_{n}^{2}}{4N}p\right)\right]
=nn[2exp(12N2)1+2exp(Δn24N)1]\displaystyle\quad=\sum_{n\neq n^{*}}\left[\dfrac{2}{\exp\left(\dfrac{1}{2N^{2}}\right)-1}+\dfrac{2}{\exp\left(\dfrac{\Delta_{n}^{2}}{4N}\right)-1}\right]
2(N1)exp(12N2)1+2(N1)exp(Δmin24N)1\displaystyle\quad\leq\dfrac{2(N-1)}{\exp\left(\dfrac{1}{2N^{2}}\right)-1}+\dfrac{2(N-1)}{\exp\left(\dfrac{\Delta_{min}^{2}}{4N}\right)-1} (33)

which is a constant, where Δmin=μmaxnnμn\Delta_{min}=\mu^{*}-\max\limits_{n\neq n^{*}}\mu_{n}.