This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Indexability of Finite State Restless Multi-Armed Bandit and Rollout Policy

Vishesh Mittal, Rahul Meshram, Deepak Dev and Surya Prakash
IIIT Allahabad
India
Abstract

We consider finite state restless multi-armed bandit problem. The decision maker can act on MM bandits out of NN bandits in each time step. The play of arm (active arm) yields state dependent rewards based on action and when the arm is not played, it also provides rewards based on the state and action. The objective of the decision maker is to maximize the infinite horizon discounted reward. The classical approach to restless bandits is Whittle index policy. In such policy, the MM arms with highest indices are played at each time step. Here, one decouples the restless bandits problem by analyzing relaxed constrained restless bandits problem. Then by Lagrangian relaxation problem, one decouples restless bandits problem into NN single-armed restless bandit problems. We analyze the single-armed restless bandit. In order to study the Whittle index policy, we show structural results on the single armed bandit model. We define indexability and show indexability in special cases. We propose an alternative approach to verify the indexable criteria for a single armed bandit model using value iteration algorithm. We demonstrate the performance of our algorithm with different examples. We provide insight on condition of indexability of restless bandits using different structural assumptions on transition probability and reward matrices.

We also study online rollout policy and discuss the computation complexity of algorithm and compare that with complexity of index computation. Numerical examples illustrate that index policy and rollout policy performs better than myopic policy.

I Introduction

Restless multi-armed bandit (RMAB) is a class of sequential decision problem. RMABs are extensively studied for resource allocation problems like—machine maintenance [1, 2], congestion control [3], healthcare [4], opportunistic channel scheduling[5], recommendation systems[6, 7], queueing systems, [8, 9] etc. RMAB problem is described as follows. There is a decision maker with NN independent arms where each arm can be in one of many finite states and the state evolves according to Markovian law. The playing of arm yields a state dependent reward and we assume that even not playing of arm also yields state dependent reward. The transition dynamics of state evolution is assumed to be known at the decision maker, The system is time slotted. The decision maker plays MM arms out of NN arms in each slot. The goal is to determine which sequence arms to be played such that it maximizes long-term discounted cumulative reward.

RMAB is the generalization of classical multi-armed bandit (Markov bandit). It is shown to be PSPACE hard [10]. RMAB is first introduced in [11] and author proposed Whittle index policy approach. This is a heuristic policy. The popularity of Whittle index policy is due to optimal performance under some conditions [12]. In such policy, the arms with highest Whittle indices are played at each time-step. The index is derived for each arm by analyzing a single-armed restless bandit, where subsidy is introduced for not playing of arm and this is Lagrangian relaxation of restless bandit problem. To use the Whittle index policy, one requires to show indexability condition. Sufficient condition for indexability is provided under modeling assumption on restless bandits, particularly transition probability and reward matrices. In [13, 14], authors introduced linear programming (LP) approach to restless bandits and proposed primal-dual heuristic algorithm. In [15], author studied the marginal productivity index (MPI) for RMAB which is generalization of Whittle index policy and futher developed adaptive greedy policy to compute MPI. Their model assumes to satisfy partial conservation laws (PCL). In RMAB with PCL, one can apply adaptive greedy algorithm for Whittle index computation. In general, it is difficult to show the indexability and then compute the index formula. It is possible to compute the index approximately and this is done in [16]. The index computation algorithm using two timescale stochastic approximation approach and value iteration algorithm is studied in [3, 17, 18]. All this work requires to prove indexability and then compute the indices. Results on indexability make structural assumptions on transition probability and reward matrices.

A generalization of RMAB is referred to as weakly coupled Markov decision processes (MDPs), where more complex constraints are introduced [19], and approximate dynamic program is studied in [20]. Fluid appproximation for restless bandits is studied in [21]. Simulation-based approach for restless bandits is considered in [22]. Linear programming approach with asymptotic optimality using fluid model is studied in [23]. Other paper on fluid model for resource allocation is [24]. Another work on LP policy for weakly coupled MDPs is studied in [25]. In [26], index computation algorithm is discussed. Recently, study of restless bandits is considered for partially observable states and hidden Markov bandit model in [18, 5, 27].

I-A Our contributions

In this paper, we study finite state restless multi-armed bandit problem. We analyze the single-armed restless bandit problem. Our goal here is to show that the restless bandit is indexable. First, we provide structural results by making modeling assumption on transition probability and reward matrices and prove the indexability analytically.

Second contribution in this paper is to study single-armed restless bandit using the value-iteration algorithm with fixed subsidy (Lagrangian parameter) and obtain the optimal decision rule for each state. In this, the optimal decision can be either passive action or active action. Next, we vary subsidy over a fixed-size grid of subsidies and we compute the optimal decision rule for each state and for each value of subsidy from the fixed-size grid and it is referred to as the policy matrix. By analyzing this matrix, we provide an alternative approach to verify the indexability condition. It also provides an index value for each state. The advantages of our approach is that is easy to verify the indexability condition using value iteration scheme. We present the analysis of indexability using policy matrix with few examples. Best of our knowledge, this viewpoint is not taken into study in earlier works.

Third, we discuss different numerical examples from [28, 29, 16, 15] and present examples of non-indexable bandits. We give insight on conditions for the non-indexability of restless bandits using numerical examples. We observe that when the reward is monotone (non-decreasing/non-increasing) for both actions in same direction, then bandit is indexable. When the reward for passive action is increasing in state, the reward for active action is decreasing in state and there is sufficient difference between the rewards of passive and active action and some structure on transition probability matrices, then the bandit is shown to be non-indexable. This example is motivated from [15]. However, for most of applications in communication systems opportunistic scheduling problem [17] and recommendation systems [6, 7], the rewards are monotone for both actions in the same direction, thus the bandit is indexable.

Fourth, we compare the performance of Whittle index policy, myopic policy and online rollout policy. Numerical examples illustrate that the discounted cumulative reward under index policy and rollout policy is higher than that of myopic policy for non-identical restless bandits.

The rest of the paper is organized as follows. RMAB problem is formulated in Section II. Indexability and our approach of indexability for a single armed restless bandit is discussed in Section III. Different algorithms for RMAB are presented in Section IV. Numerical examples for indexability of single-armed restless bandit are given in Section V. Numerical examples for RMAB are illustrated in Section VI. We make concluding remarks in Section VII.

II Model of Restless Bandits

Consider NN independent KK state Markov decision processes (MDPs). Each MDP has state space 𝒮={1,2,,K}\mathcal{S}=\{1,2,\cdots,K\} and action space 𝒜={0,1}.\mathcal{A}=\{0,1\}. Time progresses in discrete time-steps and it is denoted by t=0,1,2,3,.t=0,1,2,3,\dots. Let XtnX_{t}^{n} denotes the state of Markov chain nn at instant t0t\geq 0 for 1nN,1\leq n\leq N, and Xtn𝒮.X_{t}^{n}\in\mathcal{S}. Let pi,ja,np_{i,j}^{a,n} be the transition probability from state ii to jj under action aa for nnth Markov chain. Thus the transition probability matrix of nnth MDP is Pa,n=[[pi,ja,n]]P^{a,n}=[[p_{i,j}^{a,n}]] for a𝒜.a\in\mathcal{A}. Rewards from state ii under action aa for nnth MDP is rn(i,a).r^{n}(i,a). Let AtnA_{t}^{n} be the action correspond to nnth MDP at time t,t, and Atn{0,1}.A_{t}^{n}\in\{0,1\}. Let πt:Ht\pi_{t}:H_{t}\rightarrow\mathcal{M} is the strategy that maps history to subset 𝒩,\mathcal{M}\subset\mathcal{N}, where 𝒩={1,2,,N}\mathcal{N}=\{1,2,\dots,N\} and ||=M.|\mathcal{M}|=M. Under policy π,\pi, the action at time tt is denoted by

Atn={1n,0n.A_{t}^{n}=\begin{cases}1&\mbox{$n\in\mathcal{M},$}\\ 0&\mbox{$n\notin\mathcal{M}.$}\end{cases}

The infinite horizon discounted cumulative reward starting from initial state X0=sX_{0}=s with fixed policy π\pi is given as follows.

Vπ(s)=Eπ[t=0βt(n=1Nrn(Xtn,Atn))].V^{\pi}(s)=\mathrm{E}^{\pi}\left[\sum_{t=0}^{\infty}\beta^{t}\left(\sum_{n=1}^{N}r^{n}(X_{t}^{n},A_{t}^{n})\right)\right].

Here, β\beta is the discount parameter, 0<β<1.0<\beta<1. The agent can play at most MM arms at time t.t. Thus,

n=1NAtn=M,t.\sum_{n=1}^{N}A_{t}^{n}=M,\ \ \forall\ \ t.

Our objective is

maxπΠVπ(s)\displaystyle\max_{\pi\in\Pi}V^{\pi}(s)
 subject to n=1NAtn=M,t.\displaystyle\ \ \ \ \text{ subject to }\sum_{n=1}^{N}A_{t}^{n}=M,\forall t.

It is the discounted (reward-based) restless bandit problem. Here, Π\Pi is a set of policies. The relaxed discounted restless bandit problem is as follows.

maxπΠVπ(s)\displaystyle\max_{\pi\in\Pi}V^{\pi}(s)
 subject to t=0βt(n=1NAtn)=M1β.\displaystyle\ \ \ \ \text{ subject to }\sum_{t=0}^{\infty}\beta^{t}\left(\sum_{n=1}^{N}A_{t}^{n}\right)=\frac{M}{1-\beta}. (1)

Then the Lagrangian relaxation of problem (1) is given by

maxπΠEπ[t=0βt(n=1Nrn(Xtn,Atn)+λ(1Atn))].\displaystyle\max_{\pi\in\Pi}\mathrm{E}^{\pi}\left[\sum_{t=0}^{\infty}\beta^{t}\left(\sum_{n=1}^{N}r^{n}(X_{t}^{n},A_{t}^{n})+\lambda(1-A_{t}^{n})\right)\right].

Here, λ\lambda is Lagrangian variable and λ.\lambda\in\mathbb{R}. This is equivalent to solving NN single-armed restless bandits, i.e., a MDP with finite number of states and two actions model and policy π:𝒮{0,1}.\pi:\mathcal{S}\rightarrow\{0,1\}. Thus, a single-armed restless bandit problem is given as follows.

maxπΠEπ[t=0βt(rn(Xtn,Atn)+λ(1Atn))]\displaystyle\max_{\pi\in\Pi}\mathrm{E}^{\pi}\left[\sum_{t=0}^{\infty}\beta^{t}\left(r^{n}(X_{t}^{n},A_{t}^{n})+\lambda(1-A_{t}^{n})\right)\right] (2)

II-A Single-armed restless bandit

For notation simplicity, we omit superscript nn and introduce explicit dependence of value function on λ.\lambda. Then, the optimization problem from starting state ss under policy π\pi is as follows.

V(s,λ)=maxπΠEπ[t=0βt(r(Xt,At)+λ(1At))]\displaystyle V(s,\lambda)=\max_{\pi\in\Pi}\mathrm{E}^{\pi}\left[\sum_{t=0}^{\infty}\beta^{t}\left(r(X_{t},A_{t})+\lambda(1-A_{t})\right)\right] (3)

The optimal dynamic program is given by

V(s,λ)=maxa{0,1}{r(s,a)+(1a)λ+βs𝒮ps,saV(s,λ)}\displaystyle V(s,\lambda)=\max_{a\in\{0,1\}}\left\{r(s,a)+(1-a)\lambda+\beta\sum_{s^{\prime}\in\mathcal{S}}p_{s,s^{\prime}}^{a}V(s^{\prime},\lambda)\right\}

for all s𝒮.s\in\mathcal{S}. We use these ideas to define the indexability of single-armed restless bandit in next section. The analysis is done in the next section.

III Indexability and Whittle index

The Lagrangian variable λ\lambda is referred to as subsidy. Then, we can rewrite dynamic program as follows.

Q(s,a=0,λ)\displaystyle Q(s,a=0,\lambda) =\displaystyle= r(s,a=0)+λ+βsSps,s0V(s,λ)\displaystyle r(s,a=0)+\lambda+\beta\sum_{s^{\prime}\in S}p_{s,s^{\prime}}^{0}V(s^{\prime},\lambda)
Q(s,a=1,λ)\displaystyle Q(s,a=1,\lambda) =\displaystyle= r(s,a=1)+βsSps,s1V(s,λ)\displaystyle r(s,a=1)+\beta\sum_{s^{\prime}\in S}p_{s,s^{\prime}}^{1}V(s^{\prime},\lambda)
V(s,λ)\displaystyle V(s,\lambda) =\displaystyle= max{Q(s,a=0,λ),Q(s,a=1,λ)}\displaystyle\max\left\{Q(s,a=0,\lambda),Q(s,a=1,\lambda)\right\}

for all s𝒮.s\in\mathcal{S}. We define the passive set B(λ)B(\lambda) as follows.

B(λ)={s𝒮:Q(s,a=0,λ)Q(s,a=1,λ)}\displaystyle B(\lambda)=\bigg{\{}s\in\mathcal{S}:Q(s,a=0,\lambda)\geq Q(s,a=1,\lambda)\bigg{\}}

Let πλ:𝒮𝒜\pi_{\lambda}:\mathcal{S}\rightarrow\mathcal{A} be the policy that maps state ss to action aa for given subsidy λ.\lambda. We can also rewrite B(λ)B(\lambda) as

B(λ)={s𝒮:πλ(s)=0}.\displaystyle B(\lambda)=\bigg{\{}s\in\mathcal{S}:\pi_{\lambda}(s)=0\bigg{\}}.
Definition 1 (Indexable)

A single armed restless bandit (𝒮,𝒜={0,1},P0,P1,R,β)(\mathcal{S},\mathcal{A}=\{0,1\},P^{0},P^{1},R,\beta) is indexable if B(λ)B(\lambda) is non-decreasing in λ\lambda, i.e.

λ2λ1B(λ1)B(λ2).\displaystyle\lambda_{2}\geq\lambda_{1}\implies B(\lambda_{1})\subseteq B(\lambda_{2}).

A restless multi-armed bandit problem is indexable if each bandit is indexable. A restless bandit is indexable if as the passive subsidy increases, then the set of states for which the passive action is optimal increases.

Definition 2 (Whittle index)

If a restless bandit (𝒮,𝒜={0,1},P0,P1,R,β)(\mathcal{S},\mathcal{A}=\{0,1\},P^{0},P^{1},R,\beta) is indexable then its Whittle index, λ:𝒮R,\lambda:\mathcal{S}\rightarrow R, is given by λ(s)=inf{λ:sB(λ)},\lambda(s)=\inf\{\lambda:s\in B(\lambda)\}, for s𝒮.s\in\mathcal{S}.

We assume that the rewards are bounded which implies that the indices are also bounded.

Remark 1
  1. 1.

    In general, it is very difficult to verify that a bandit is indexable and even more difficult to compute the Whittle indices. Note that as λ\lambda increases, the set B(λ)B(\lambda) has to increase for indexability.

  2. 2.

    If a bandit is indexable, then it implies that there are λ1,λ2,\lambda_{1},\lambda_{2}\in\mathbb{R}, λ2>λ1\lambda_{2}>\lambda_{1} such that πλ1(s)=1\pi_{\lambda_{1}}(s)=1 and πλ2(s)=0\pi_{\lambda_{2}}(s)=0 for s𝒮s\in\mathcal{S} because B(λ1)B(λ2).B(\lambda_{1})\subseteq B(\lambda_{2}). This further suggests that the policy πλ(s)\pi_{\lambda}(s) has a single threshold as a function of λ\lambda for fixed s𝒮.s\in\mathcal{S}. That is, there exist λ\lambda^{*} such that

    πλ(s)={1 λ<λ, 0 λλ,\displaystyle\pi_{\lambda}(s)=\begin{cases}1&\mbox{ $\lambda<\lambda^{*},$ }\\ 0&\mbox{ $\lambda\geq\lambda^{*},$}\end{cases}

    for each s𝒮.s\in\mathcal{S}.

  3. 3.

    If a bandit is non-indexable, then it implies that there are λ1,λ2,\lambda_{1},\lambda_{2}\in\mathbb{R}, λ2>λ1\lambda_{2}>\lambda_{1} and B(λ1)B(\lambda_{1}) is not a subset of B(λ2)B(\lambda_{2}) for some λ1,λ2.\lambda_{1},\lambda_{2}\in\mathbb{R}. In other words, there exists some s^𝒮\hat{s}\in\mathcal{S} such that for λ3>λ2>λ1,\lambda_{3}>\lambda_{2}>\lambda_{1}, πλ1(s^)=1,\pi_{\lambda_{1}}(\hat{s})=1, πλ2(s^)=0,\pi_{\lambda_{2}}(\hat{s})=0, and πλ3(s^)=1.\pi_{\lambda_{3}}(\hat{s})=1.

III-A Single threshold policy and indexability

We now describe a single threshold type policy for MDP having action space 𝒜={0,1}\mathcal{A}=\{0,1\} in ss assuming a fixed subsidy λ\lambda as follows. We consider the deterministic Markov policy. The policy is called a single threshold type when the decision rule πλ(s)\pi_{\lambda}(s) for s𝒮s\in\mathcal{S} is of the following form.

πλ(s)={0 ss,1 s>s,\displaystyle\pi_{\lambda}(s)=\begin{cases}0&\mbox{ $s\leq s^{*},$}\\ 1&\mbox{ $s>s^{*},$}\end{cases}

and ss^{*} is a threshold state. This is also referred to as control limit policy in [30]. The conditions required on transition probability and reward matrices for a single threshold policy are discussed below.

Lemma 1 (Convexity of value functions)
  1. 1.

    For fixed λ\lambda and β,\beta, V(s,λ),V(s,\lambda), Q(s,a=0,λ),Q(s,a=0,\lambda), and Q(s,a=1,λ)Q(s,a=1,\lambda) are non-decreasing piecewise convex in s.s.

  2. 2.

    For fixed ss and β,\beta, V(s,λ),V(s,\lambda), Q(s,a=0,λ),Q(s,a=0,\lambda), and Q(s,a=1,λ)Q(s,a=1,\lambda) are non-decreasing piecewise convex in λ.\lambda.

Proof of lemma follows from induction method and it is given in Appendix -A.

Definition 3

The Q(s,a,λ)Q(s,a,\lambda) is superadditive for a fixed λ,\lambda, if s>s;s^{\prime}>s; s,s𝒮s^{\prime},s\in\mathcal{S}

Q(s,1,λ)+Q(s,0,λ)Q(s,1,λ)+Q(s,0,λ).Q(s^{\prime},1,\lambda)+Q(s,0,\lambda)\geq Q(s,1,\lambda)+Q(s^{\prime},0,\lambda). (4)

This implies that Q(s,1,λ)Q(s,0,λ)Q(s,1,\lambda)-Q(s,0,\lambda) is non-decreasing in s.s. This is also referred to as Monotone non-decreasing difference property in s.s. This implies that the decision rule πλ(s)\pi_{\lambda}(s) is non-decreasing in s.s. We now state the following result from [30].

Theorem 1

Suppose that

  1. 1.

    r(s,a)r(s,a) is non-decreasing in ss for all a{0,1},a\in\{0,1\},

  2. 2.

    q(k|s,a)=j=kKps,jaq(k|s,a)=\sum_{j=k}^{K}p_{s,j}^{a} is non-decreasing in ss for all k𝒮k\in\mathcal{S} and a{0,1},a\in\{0,1\},

  3. 3.

    r(s,a)r(s,a) is a superadditive (subadditive) function on 𝒮×{0,1},\mathcal{S}\times\{0,1\},

  4. 4.

    q(k|s,a)q(k|s,a) is a superadditive (subadditive) function on 𝒮×{0,1}\mathcal{S}\times\{0,1\} for all k𝒮.k\in\mathcal{S}.

Then, there exists optimal policy πλ(s)\pi_{\lambda}(s) which is non-decreasing (non-increasing) in s.s.

These assumptions imply that one requires stronger conditions on transition probability matrices P0,P^{0}, P1P^{1} and reward matrix R.R.

Remark 2
  1. 1.

    If r(s,a)r(s,a) is a superadditive function on 𝒮×{0,1},\mathcal{S}\times\{0,1\}, then

    r(s,1)r(s,0)r(s,1)r(s,0)\displaystyle r(s^{\prime},1)-r(s^{\prime},0)\geq r(s,1)-r(s,0)

    for s>ss^{\prime}>s and s,s𝒮.s^{\prime},s\in\mathcal{S}.

  2. 2.

    If q(k|s,a)q(k|s,a) is a superadditive function on 𝒮×{0,1},\mathcal{S}\times\{0,1\}, then

    q(k|s,1)q(k|s,0)q(k|s,1)q(k|s,0)\displaystyle q(k|s^{\prime},1)-q(k|s^{\prime},0)\geq q(k|s,1)-q(k|s,0)

    for all k𝒮.k\in\mathcal{S}. This implies

    j=kK[ps,j1ps,j0]j=kK[ps,j1ps,j0].\displaystyle\sum_{j=k}^{K}\left[p_{s^{\prime},j}^{1}-p_{s^{\prime},j}^{0}\right]\geq\sum_{j=k}^{K}\left[p_{s,j}^{1}-p_{s,j}^{0}\right].

From Theorem 1, it is clear that optimal policy is of a single threshold type in ss for a fixed λ.\lambda.

We next prove that Q(s,a,λ)Q(s,a,\lambda) is subadditive on {0,1}×\{0,1\}\times\mathbb{R} for all values of s𝒮.s\in\mathcal{S}. That is, Q(s,a=1,λ)Q(s,a=0,λ)Q(s,a=1,\lambda)-Q(s,a=0,\lambda) is non-increasing in λ\lambda for each s𝒮.s\in\mathcal{S}. Observe that Q(s,a=0,λ)Q(s,a=0,\lambda) is strictly increasing in λ\lambda and Q(s,a=1,λ)Q(s,a=1,\lambda) is non-decreasing in λ\lambda for fixed s.s. For assumptions in Theorem 1, Q(s,a=1,λ)Q(s,a=0,λ)Q(s,a=1,\lambda)-Q(s,a=0,\lambda) is non-increasing in λ.\lambda. Hence the policy πλ(s)\pi_{\lambda}(s) is non-increasing in λ\lambda for each s𝒮.s\in\mathcal{S}.

Remark 3
  1. 1.

    It implies that for each s𝒮s\in\mathcal{S} there is λ(s)\lambda^{*}(s) such that

    πλ(s)={1λ<λ(s),0λλ(s).\displaystyle\pi_{\lambda}(s)=\begin{cases}1&\mbox{$\lambda<\lambda^{*}(s),$}\\ 0&\mbox{$\lambda\geq\lambda^{*}(s).$}\end{cases}

    Further, this suggests that as λ\lambda increases, B(λ)B(\lambda) increases. Thus, a restless bandit is indexable.

  2. 2.

    It is possible that a restless bandit can be indexable with the following structure on reward matrix. r(s,1)r(s,1) and r(s,0)r(s,0) are non-decreasing in ss and there is no structure on transition probability matrices. Such a model is not considered in previous discussion.

  3. 3.

    It is further possible that a bandit is indexable even though there is no structure on transition probability and reward matrices. This is also not analyzed using previous Theorem. This is due to limitation of analysis. With this motivation, we study an alternative approach to indexability using value-iteration based methods in the next section.

III-B Alternative approach for indexability

In this section, we study the value iteration algorithm and compute the optimal policy πλ(s)\pi_{\lambda}(s) for each state s𝒮s\in\mathcal{S} for fixed value of subsidy λ.\lambda. We define the grid of subsidy Λ\Lambda which is of finite size, say J.J. We perform the value iteration for each λΛ\lambda\in\Lambda and compute the policy πλ(s)\pi_{\lambda}(s) for s𝒮.s\in\mathcal{S}. Size of policy matrix Φ=[[πλ(s)]]\Phi=[[\pi_{\lambda}(s)]] is K×J.K\times J. We next verify the indexability criteria in Remark 1 using policy matrix Φ=[[πλ(s)]],\Phi=[[\pi_{\lambda}(s)]], that is, there exist λ\lambda^{*} such that

πλ(s)={1 λ<λ, 0 λλ,\displaystyle\pi_{\lambda}(s)=\begin{cases}1&\mbox{ $\lambda<\lambda^{*},$ }\\ 0&\mbox{ $\lambda\geq\lambda^{*},$}\end{cases}

for each s𝒮.s\in\mathcal{S}. If this condition does not meet, we verify the non-indexable bandit condition in Remark 1. The value iteration algorithm for computation of policy matrix Φ\Phi is described in Algorithm 1.

Input: Reward matrix RR, TP matrix P,P, β,\beta, tolerance Δ,\Delta, Grid over subsidy Λ([1,1]),\Lambda([-1,1]), Tmax.T_{\max}.
Output: Policy matrix Φ=[[πλ(s)]]K×J\Phi=[[\pi_{\lambda}(s)]]_{K\times J}
for  λΛ([1,1])\lambda\in\Lambda([-1,1]) do
       Initialization: Q0(s,a,λ)=0Q_{0}(s,a,\lambda)=0 for all s𝒮,s\in\mathcal{S}, a{0,1}.a\in\{0,1\}.
       V0(s,λ)=0V_{0}(s,\lambda)=0 for all s𝒮.s\in\mathcal{S}.
      
      for s𝒮s\in\mathcal{S} do
             for t=0,1,2,,Tmax1t=0,1,2,\cdots,T_{\max}-1  do
                  
Qt+1(s,a=0,λ)=r(s,a=0)+λ+\displaystyle Q_{t+1}(s,a=0,\lambda)=r(s,a=0)+\lambda+
βsSps,s0Vt(s,λ)\displaystyle\beta\sum_{s^{\prime}\in S}p_{s,s^{\prime}}^{0}V_{t}(s^{\prime},\lambda)
Qt+1(s,a=1,λ)=r(s,a=1)+\displaystyle Q_{t+1}(s,a=1,\lambda)=r(s,a=1)+
βsSps,s1Vt(s,λ)\displaystyle\beta\sum_{s^{\prime}\in S}p_{s,s^{\prime}}^{1}V_{t}(s^{\prime},\lambda)
Vt+1(s,λ)=max{Qt+1(s,a=0,λ),\displaystyle V_{t+1}(s,\lambda)=\max\left\{Q_{t+1}(s,a=0,\lambda),\right.
Qt+1(s,a=1,λ)}\displaystyle\left.Q_{t+1}(s,a=1,\lambda)\right\}
             end for
            
            
Q(s,a=0,λ)\displaystyle Q(s,a=0,\lambda) =\displaystyle= QTmax(s,a=0,λ)\displaystyle Q_{T_{\max}}(s,a=0,\lambda)
Q(s,a=1,λ)\displaystyle Q(s,a=1,\lambda) =\displaystyle= QTmax(s,a=1,λ)\displaystyle Q_{T_{\max}}(s,a=1,\lambda)
V(s,λ)=max{Q(s,a=0,λ),Q(s,a=1,λ)}\displaystyle V(s,\lambda)=\max\{Q(s,a=0,\lambda),Q(s,a=1,\lambda)\}
            If |Q(s,a=0,λ)Q(s,a=1,λ)|Δ|Q(s,a=0,\lambda)-Q(s,a=1,\lambda)|\geq\Delta then
            
Φ(s,λ)=argmaxa{0,1}Q(s,a,λ)\displaystyle\Phi(s,\lambda)=\arg\max_{a\in\{0,1\}}Q(s,a,\lambda)
            If |Q(s,a=0,λ)Q(s,a=1,λ)|<Δ|Q(s,a=0,\lambda)-Q(s,a=1,\lambda)|<\Delta then
            
Φ(s,λ)=0\displaystyle\Phi(s,\lambda)=0
       end for
      
end for
return Matrix Φ\Phi ;
Algorithm 1 Value iteration algorithm for policy matrix Φ\Phi

We now discuss the structure of matrix Φ\Phi with examples and this can depend on transition probability and reward matrices.

In the following examples we consider state space 𝒮={1,2,3,4}\mathcal{S}=\{1,2,3,4\} and 88 values of λΛ.\lambda\in\Lambda. Here Λ={λ1,λ2,,λ8}\Lambda=\{\lambda_{1},\lambda_{2},\cdots,\lambda_{8}\} and λi<λj\lambda_{i}<\lambda_{j} for 1i<j8.1\leq i<j\leq 8.

III-B1 Example 1

Suppose the policy matrix is as given in Eqn. (5).

Φ={blockarray}cccccccccλ1&λ2λ3λ4λ5λ6λ7λ8{block}(cccccccc)c11000000s=111100000s=211110000s=311111100s=4\Phi=\blockarray{ccccccccc}\lambda_{1}&\lambda_{2}\lambda_{3}\lambda_{4}\lambda_{5}\lambda_{6}\lambda_{7}\lambda_{8}\\ \block{(cccccccc)c}11000000s=1\\ 11100000s=2\\ 11110000s=3\\ 11111100s=4\\ (5)

Observe that for fixed λ,\lambda, πλ=Φ(:,λ)\pi_{\lambda}=\Phi(:,\lambda) is the policy for state vector. The policy πλ(s)\pi_{\lambda}(s) is non-decreasing in s.s. This structure implies a single threshold policy in ss for fixed subsidy λ.\lambda. Moreover for fixed s,s, as λ\lambda increases then πλ(s)\pi_{\lambda}(s) is non-increasing in λ.\lambda. This structure implies that there is a single threshold policy in λ\lambda for fixed values of state s.s. If a single threshold policy in λ\lambda exists for each s𝒮,s\in\mathcal{S}, then we can say that bandit is indexable, and this follows from the definition of indexability because B(λ)B(\lambda) is a non-decreasing set as we increase λ.\lambda. In this example, B(λ1)=B(\lambda_{1})=\emptyset, B(λ2)=B(\lambda_{2})=\emptyset, B(λ3)={1}B(\lambda_{3})=\{1\}, B(λ4)={1,2}B(\lambda_{4})=\{1,2\}, B(λ5)={1,2,3}B(\lambda_{5})=\{1,2,3\}, B(λ6)={1,2,3}B(\lambda_{6})=\{1,2,3\}, B(λ7)={1,2,3,4}B(\lambda_{7})=\{1,2,3,4\}, and B(λ8)={1,2,3,4}B(\lambda_{8})=\{1,2,3,4\}.

III-B2 Example 2

Suppose the policy matrix is as given in Eqn. (6).

Φ={blockarray}cccccccccλ1&λ2λ3λ4λ5λ6λ7λ8{block}(cccccccc)c11000000s=111111000s=211100000s=311111100s=4\Phi=\blockarray{ccccccccc}\lambda_{1}&\lambda_{2}\lambda_{3}\lambda_{4}\lambda_{5}\lambda_{6}\lambda_{7}\lambda_{8}\\ \block{(cccccccc)c}11000000s=1\\ 11111000s=2\\ 11100000s=3\\ 11111100s=4\\ (6)

Notice that for fixed λ,\lambda, πλ=Φ(:,λ)\pi_{\lambda}=\Phi(:,\lambda) and the policy πλ(s)\pi_{\lambda}(s) is not monotone in ss for all λΛ.\lambda\in\Lambda. We can observe from (6) that for λ4,\lambda_{4}, and λ5\lambda_{5} the optimal policy π4=π5=[0,1,0,1]T\pi_{4}=\pi_{5}=[0,1,0,1]^{T} and there is no single threshold policy in s.s.

But for fixed s,s, as λ\lambda increases then πλ(s)\pi_{\lambda}(s) is non-increasing in λ.\lambda. This structure implies that there is a single threshold policy in λ\lambda for fixed values of state s𝒮.s\in\mathcal{S}. B(λ)B(\lambda) is a non-decreasing set as we increase λ.\lambda. In this example, B(λ1)=B(\lambda_{1})=\emptyset, B(λ2)=B(\lambda_{2})=\emptyset, B(λ3)={1}B(\lambda_{3})=\{1\}, B(λ4)={1,3}B(\lambda_{4})=\{1,3\}, B(λ5)={1,3}B(\lambda_{5})=\{1,3\}, B(λ6)={1,2,3}B(\lambda_{6})=\{1,2,3\}, B(λ7)={1,2,3,4}B(\lambda_{7})=\{1,2,3,4\}, and B(λ8)={1,2,3,4}B(\lambda_{8})=\{1,2,3,4\}. From the definition of indexability, we can say that the bandit is indexable.

III-B3 Example 3

Suppose the policy matrix is as given in Eqn. (7).

Φ={blockarray}cccccccccλ1&λ2λ3λ4λ5λ6λ7λ8{block}(cccccccc)c11000000s=111111000s=211100110s=311111100s=4\Phi=\blockarray{ccccccccc}\lambda_{1}&\lambda_{2}\lambda_{3}\lambda_{4}\lambda_{5}\lambda_{6}\lambda_{7}\lambda_{8}\\ \block{(cccccccc)c}11000000s=1\\ 11111000s=2\\ 11100110s=3\\ 11111100s=4\\ (7)

In this example, B(λ1)=B(\lambda_{1})=\emptyset, B(λ2)=B(\lambda_{2})=\emptyset, B(λ3)={1}B(\lambda_{3})=\{1\}, B(λ4)={1,3}B(\lambda_{4})=\{1,3\}, B(λ5)={1,3}B(\lambda_{5})=\{1,3\}, B(λ6)={1,2}B(\lambda_{6})=\{1,2\}, B(λ7)={1,2,4}B(\lambda_{7})=\{1,2,4\}, and B(λ8)={1,2,3,4}B(\lambda_{8})=\{1,2,3,4\}. Observe that B(λ5)B(λ6)B(\lambda_{5})\not\subset B(\lambda_{6}) for λ5<λ6.\lambda_{5}<\lambda_{6}. Hence, the bandit is not indexable.

Notice that for fixed λ,\lambda, πλ=Φ(:,λ)\pi_{\lambda}=\Phi(:,\lambda) and the policy πλ(s)\pi_{\lambda}(s) is not monotone in ss for all λΛ.\lambda\in\Lambda. There is no single threshold policy in ss for λ4,λ5,\lambda_{4},\lambda_{5}, and λ7.\lambda_{7}.

For fixed s=3,s=3, as λ\lambda increases, then πλ(s)\pi_{\lambda}(s) is not monotone in λ,\lambda, because policy is {1,1,1,0,0,1,1,0}.\{1,1,1,0,0,1,1,0\}. This structure implies that there is no single threshold policy in λ\lambda for s=3.s=3. Because of this, it does not satisfy condition of indexability.

III-C Computational complexity of Verification of Indexability and Index Computation

RMABs are known to be a PSPACE hard problem,[10, Theorem 44]. However, the Lagrangian relaxed RMAB problem is solved by the index policy, where we analyze a single-armed restless bandit. Each bandit is a MDP with finite number of states and two actions. From [31], MDP with finite state-space and finite action space can be solved in polynomial in time (P-complete). Thus, the computational complexity of indexability and index computation is polynomial in time. Note that we have used the value iteration algorithm for computation of policy matrix with fixed grid size of subsidy, |Λ|=J.|\Lambda|=J. From [32], the sample complexity of value iteration algorithm is O(K2log(1/(1β)ϵ)1β).O\left(\frac{K^{2}\log(1/(1-\beta)\epsilon)}{1-\beta}\right). The value iteration algorithm is performed for each λΛ.\lambda\in\Lambda. Thus, the sample complexity of computing policy matrix Φ\Phi is O(K2Jlog(1/(1β)ϵ)1β).O\left(\frac{K^{2}J\log(1/(1-\beta)\epsilon)}{1-\beta}\right). The verification of indexability from matrix Φ\Phi requires additional time KJ.KJ. This is due to the fact that one has to verify for each s,s, Φ(s,λ)\Phi(s,\lambda) is non-increasing in λ\lambda for all λΛ.\lambda\in\Lambda. Hence, indexability verification criteria and the index computation require sample complexity of O(KJ+K2Jlog(1/(1β)ϵ)1β).O\left(KJ+\frac{K^{2}J\log(1/(1-\beta)\epsilon)}{1-\beta}\right). There are many variants of value iteration algorithm which have improved the sample complexity in terms of its dependence on number of states K,K, see [33]. Therefore, one can improve the complexities of indexability verification and index computation.

IV Algorithms for restless multi-armed bandits

We now discuss algorithms for restless multi-armed bandits—myopic policy, Whittle index policy and online rollout policy. In the myopic policy, the arms with the highest immediate rewards are selected in each time-step. That is, (Xt1,Xt2,,XtN)𝒮N,(X_{t}^{1},X_{t}^{2},\cdots,X_{t}^{N})\in\mathcal{S}^{N}, then the immediate reward for arm nn is rn(Xtn,Atn).r^{n}(X_{t}^{n},A_{t}^{n}). Thus, MM arms are selected using immediate reward at each time-step t.t. In the Whittle index policy, for given time-step tt with state (Xt1,Xt2,,XtN)𝒮N,(X_{t}^{1},X_{t}^{2},\cdots,X_{t}^{N})\in\mathcal{S}^{N}, one computes the index for each bandit, i.e., the index for nnth bandit is W(Xtn).W(X_{t}^{n}). Then, we select the MM arms with the highest Whittle indices. We compute the Whittle indices for all states in the offline mode. In the following we discuss online rollout policy.

IV-A Online Rollout Policy

We now discuss a simulation-based approach referred to as online rollout policy. Here, many trajectories are ‘rolled out’ using a simulator and the value of each action is estimated based on the cumulative reward along these trajectories. Trajectories of length HH are generated using a fixed base policy, say, ϕ,\phi, which might choose arms according to a deterministic rule (for instance, myopic decision) at each time-step. The information obtained from a trajectory is

{Xh,ln,bh,ln,ϕ,rh,ln,ϕ}n=1,h=1N,H\displaystyle\{X^{n}_{h,l},b^{n,\phi}_{h,l},r^{n,\phi}_{h,l}\}_{n=1,h=1}^{N,H} (8)

under policy ϕ.\phi. Here, ll denotes a trajectory, hh denotes time-step and Xh,lnX^{n}_{h,l} denotes the state about arm n.n. The action of playing or not playing arm nn at time-step hh in trajectory ll is denoted by bh,ln,ϕ{0,1}.b^{n,\phi}_{h,l}\in\{0,1\}. Reward obtained from arm nn is rh,ln,ϕ.r^{n,\phi}_{h,l}.

We now describe the rollout policy for M=1M=1. We compute the value estimate for trajectory ll with starting belief X=(X1,,XN),X=(X^{1},\cdots,X^{N}), and initial action ξ{1,2,,N}\xi\in\{1,2,\cdots,N\}. Here, ξh,l=n\xi_{h,l}=n means arm nn is played at time-step hh in trajectory ll, so bh,ln,ϕ=1b^{n,\phi}_{h,l}=1, and bh,li,ϕ=0b^{i,\phi}_{h,l}=0 for in\forall i\neq n. The value estimate for initial action ξ\xi along trajectory ll is given by

Qlϕ(X,ξ)\displaystyle Q_{l}^{\phi}(X,\xi) =\displaystyle= h=1Hβh1n=1Nrh,ln,ϕ.\displaystyle\sum_{h=1}^{H}\beta^{h-1}\sum_{n=1}^{N}r^{n,\phi}_{h,l}.

Then, averaging over LL trajectories, the value estimate for action ξ\xi in state XX under policy ϕ\phi is

Q~H,Lϕ(X,ξ)=1Ll=1LQlϕ(X,ξ),\displaystyle\widetilde{Q}_{H,L}^{\phi}({X,\xi})=\frac{1}{L}\sum_{l=1}^{L}Q_{l}^{\phi}(X,\xi),

where, the base policy ϕ\phi is myopic (greedy), i.e., it chooses the arm with the highest immediate reward, along each trajectory. Now, we perform one-step policy improvement, and the optimal action is selected as,

j(X)=argmax1jN[r~(X,ξ=j)+βQ~H,Lϕ(X,ξ=j)],\displaystyle j^{*}(X)=\arg\max_{1\leq j\leq N}\left[\widetilde{r}(X,\xi=j)+\beta\widetilde{Q}_{H,L}^{\phi}(X,\xi=j)\right], (9)

and r~(X,ξ)=n=1Nrn(Xn,bn).\widetilde{r}(X,\xi)=\sum_{n=1}^{N}r^{n}(X^{n},b^{n}). In each time-step tt with Xt={Xt1,,Xtn},X_{t}=\{X^{1}_{t},\cdots,X^{n}_{t}\}, online rollout policy plays the arm j(Xt)j^{*}(X_{t}) obtained according to Eqn. (9). This can be extended to multiple play of arms.

We now discuss the computational complexity of online rollout policy. As rollout policy is a heuristic (look-ahead) policy, it does not require convergence analysis. In the Whittle index policy, index computation is done offline, where we compute and store the index values for each element on the grid G,G, for all arms (NN). During online implementation, when state X={X1,,XN}X=\{X^{1},...,X^{N}\} is observed, the corresponding index values are drawn from the stored data. On the other hand, rollout policy is implemented online and its computational complexity is stated in the following Theorem.

Theorem 2

The online rollout policy has a worst case complexity of O(N(HL+2)T)O(N(HL+2)T) for number of iterations T,T, when the base policy is myopic.

The proof is given in Appendix -C. Here TT is dependent on β\beta and T=1/(1β).T=1/(1-\beta).

IV-B Comparison of complexity results

For large state space, K=1000,K=1000, one requires grid of subsidy Λ\Lambda to be large, that is, J=O(K).J=O(K). Thus the sample complexity of index computation is O(K2+K3log(1/(1β)ϵ)1β).O\left(K^{2}+\frac{K^{3}\log(1/(1-\beta)\epsilon)}{1-\beta}\right). The computational complexity of online rollout policy is dependent on horizon length HH and number of trajectories LL and it is O(N(HL+2)/(1β)).O(N(HL+2)/(1-\beta)). For large state space model, rollout policy can have lower computational complexity compared to index policy.

V Numerical Examples of single-armed restless bandit

In this section, we provide numerical examples to illustrate the indexability of the bandit using our approach. In all following examples, we use discount parameter β=0.9.\beta=0.9. Similar results are also true for β=0.99.\beta=0.99.

V-A Example with circular dynamics

In the following example, we illustrate that the optimal policy is not a single threshold type but it is indexable. This example is taken from [28], [29], where indexability of the bandit is assumed, and author studied learning algorithm. In this example, we claim indexability using our approach.

The example has four states and the dynamics are circulant: when an arm is passive (a=0a=0), resp. active (a=1a=1), the state evolves according to the transition probability matrices.

P0=[1/2001/21/21/20001/21/20001/21/2],\displaystyle P^{0}=\begin{bmatrix}1/2&0&0&1/2\\ 1/2&1/2&0&0\\ 0&1/2&1/2&0\\ 0&0&1/2&1/2\end{bmatrix},
P1=P0T.\displaystyle P^{1}={P^{0}}^{T}.

The reward matrix is given as follows.

R=[11000011].\displaystyle R=\begin{bmatrix}-1&-1\\ 0&0\\ 0&0\\ 1&1\end{bmatrix}.

We consider different values of subsidy λ[1,1]\lambda\in[-1,1] and compute the optimal policy for s𝒮s\in\mathcal{S} using dynamic programming algorithm and obtain the policy matrix Φ,\Phi, which is given in Fig. 1. We observe that the optimal policy has more than two threshold for some values of subsidy λ.\lambda. This is clear for subsidy λ=0.4,\lambda=-0.4, the optimal policy πλ=0.4={0,1,1,0}\pi_{\lambda=-0.4}=\{0,1,1,0\} and there are two switches in decision rule. But it is indexable because as λ\lambda increases from 1-1 to +1,+1, then the set B(λ)B(\lambda) is non-decreasing, where B(λ)={s𝒮:πλ(s)=0}.B(\lambda)=\{s\in\mathcal{S}:\pi_{\lambda}(s)=0\}. Note that B(0.9)=,B(-0.9)=\emptyset, B(0.8)={4},B(-0.8)=\{4\}, B(0.4)={1,4},B(-0.4)=\{1,4\}, B(0.5)={1,2,4},B(0.5)=\{1,2,4\}, B(0.9)={1,2,3,4}.B(0.9)=\{1,2,3,4\}.

Refer to caption
Figure 1: Example of circular dynamics: policy matrix Φ\Phi for different states as function of subsidy λ.\lambda.

V-B Examples with restart

We illustrate the structure of policy matrix Φ\Phi for restart model, where action 11 is taken, then ps,11=1p^{1}_{s,1}=1 for all s𝒮.s\in\mathcal{S}. We consider K=5K=5 states.

Refer to caption
Figure 2: Example of restart model: policy matrix Φ\Phi for different states as function of subsidy λ.\lambda. States K=5.K=5.

This example is taken from [28], where authors assumed indexability. The transition probability matrices are as follows.

P0\displaystyle P_{0} =\displaystyle= [1/109/100001/1009/10001/10009/1001/100009/101/100009/10],\displaystyle\begin{bmatrix}1/10&9/10&0&0&0\\ 1/10&0&9/10&0&0\\ 1/10&0&0&9/10&0\\ 1/10&0&0&0&9/10\\ 1/10&0&0&0&9/10\end{bmatrix},
P1\displaystyle P_{1} =\displaystyle= [1000010000100001000010000].\displaystyle\begin{bmatrix}1&0&0&0&0\\ 1&0&0&0&0\\ 1&0&0&0&0\\ 1&0&0&0&0\\ 1&0&0&0&0\end{bmatrix}.

The rewards for passive action (a=0)(a=0) with state k,k, r(k,0)=0.9kr(k,0)=0.9^{k} and the rewards for active action (a=1)(a=1), r(k,1)=0r(k,1)=0 for all k𝒮.k\in\mathcal{S}. We observe from Fig. 2 that the optimal policy is of a single threshold type for fixed subsidy λ.\lambda. Moreover, as subsidy increases from 1-1 to +1,+1, an optimal policy changes monotonically from active action to passive action for all states. Thus, the bandit is indexable.

V-C Non-Indexable model with 55 states

We consider following parameters.

P0\displaystyle P_{0} =\displaystyle= [0.15020.04000.41560.03000.36420.40000.35000.08000.12000.05000.52760.04000.39910.02000.01330.05000.10000.15000.20000.50000.01910.01000.08970.03000.8512],\displaystyle\begin{bmatrix}0.1502&0.0400&0.4156&0.0300&0.3642\\ 0.4000&0.3500&0.0800&0.1200&0.0500\\ 0.5276&0.0400&0.3991&0.0200&0.0133\\ 0.0500&0.1000&0.1500&0.2000&0.5000\\ 0.0191&0.0100&0.0897&0.0300&0.8512\end{bmatrix},
P1\displaystyle P_{1} =\displaystyle= [0.71960.05000.09030.01000.13010.55000.20000.05000.08000.12000.19030.01000.16630.01000.62340.20000.05000.15000.10000.50000.25010.01000.39010.03000.3198],\displaystyle\begin{bmatrix}0.7196&0.0500&0.0903&0.0100&0.1301\\ 0.5500&0.2000&0.0500&0.0800&0.1200\\ 0.1903&0.0100&0.1663&0.0100&0.6234\\ 0.2000&0.0500&0.1500&0.1000&0.5000\\ 0.2501&0.0100&0.3901&0.0300&0.3198\end{bmatrix},
R\displaystyle R =\displaystyle= [0.45800.96310.51000.81000.53080.79630.67100.10610.68730.1057].\displaystyle\begin{bmatrix}0.4580&0.9631\\ 0.5100&0.8100\\ 0.5308&0.7963\\ 0.6710&0.1061\\ 0.6873&0.1057\end{bmatrix}.

We make no structural assumption on transition probability matrices. But the reward is decreasing in ss for a=1a=1 and increasing in ss for a=0.a=0. From Fig. 3, we notice that πλ(s)\pi_{\lambda}(s) is not monotone in λ\lambda for s=3.s=3. Thus, the bandit is non-indexable.

Refer to caption
Figure 3: Example of a model with no structure on prob. matrices : policy matrix Φ\Phi for different states as function of subsidy λ.\lambda. States K=5.K=5. Non-indexable example.

More numerical examples on indexability are provided in Appendix.

VI Numerical results on RMAB

We compare the performance of rollout policy, index policy and myopic policy. We consider number of arms N=3,5,10,N=3,5,10, and discount parameter β=0.99.\beta=0.99. In the rollout policy, we used number of trajectories L=30,L=30, and horizon length H=4.H=4. In the following, we consider non-identical restless bandits. All bandits are indexable, and index is monotone in state. In Fig. 4,  5 and  6, we compare performance of myopic, rollout and index policy for N=3,5,10,N=3,5,10, respectively. We observe that the rollout policy and the index policy performs better than myopic policy.

Refer to caption
Figure 4: 33 armed restless bandits: Non-identical, monotone and indexable
Refer to caption
Figure 5: 55 armed restless bandits: Non-identical, monotone and indexable
Refer to caption
Figure 6: 1010 armed restless bandits: Non-identical, monotone and indexable

Additional numerical examples of RMAB are given in Appendix.

VII Discussion and concluding remarks

In this paper, we studied finite state restless bandits problem. We analyzed the Whittle index policy and proposed a value-iteration based algorithm, which computes the policy matrix, verifies indexability condition and obtains the indices. Further, the sample complexity analysis is discussed for index computation. We also study online rollout policy and its computational complexity. Numerical examples are provided to illustrate the indexability condition and performance of algorithms—myopic policy, index policy and rollout policy.

From numerical examples, we note that the indexability condition does not require any structural assumption on transition probabilities. It is even possible that the arms having rewards with no structural assumption with respect to actions can be indexable. Hence, our approach can be widely applicable without any structural model assumptions to verify the indexability condition and index computation. In future work, we plan to extend our approach to countable state-space model.

Acknowledgment

R. Meshram and S. Prakash are with ECE Dept. IIIT Allahabad and Center for Intelligent Robotics (CIR) Lab, IIIT Allahabad, India. V. Mittal and D. Dev are with ECE Dept. IIIT Allahabad, India. The work of Rahul Meshram and Deepak Dev is supported by grants from SERB.

References

  • [1] K. D. Glazebrook, H. M. Mitchell, and P. S. Ansell, “Index policies for the maintenance of a collection of machines by a set of repairmen,” European Jorunal of Operation Research, vol. 165, pp. 267–284, Aug. 2005.
  • [2] D. Ruiz-Hernandez K. D. Glazebrook and C. Kirkbride, “Some indexable families of restless bandit problems,” Advances in Applied Probability, vol. 38, no. 3, pp. 643–672, Sept. 2006.
  • [3] K. Avrachenkov, U. Ayesta, J. Doncel, and P. Jacko, “Congestion control of TCP flows in Internet routers by means of index policy,” Computer Networks, vol. 57, no. 17, pp. 3463–3478, 2013.
  • [4] A. Mate, A. Perrault, and M. Tambe, “Risk-aware interventions in public health: Planning with restless multi-armed bandits.,” in Proceedings of AAMAS, 2021, pp. 880–888.
  • [5] R. Meshram, D. Manjunath, and A. Gopalan, “On the Whittle index for restless multi-armed hidden Markov bandits,” IEEE Transactions on Automatic Control, vol. 63, no. 9, pp. 3046–3053, 2018.
  • [6] R. Meshram, D. Manjunath, and A. Gopalan, “A restless bandit with no observable states for recommendation systems and communication link scheduling,” in Proceedings of CDC, Dec 2015, pp. 7820–7825.
  • [7] R. Meshram, A. Gopalan, and D. Manjunath, “A hidden Markov restless multi-armed bandit model for playout recommendation systems,” COMSNETS, Lecture Notes in Computer Science, Springer, pp. 335–362, 2017.
  • [8] P. S. Ansell, K. D. Glazebrook, J. Niño-Mora, and M. O’Keeffe, “Whittle’s index policy for a multi-class queueing system with convex holding costs,” Math. Methods Oper. Res., vol. 57, no. 1, pp. 21–39, 2003.
  • [9] J. Gittins, K. Glazebrook, and R. Weber, Multi-armed bandit allocation indices, Wiley, 2011.
  • [10] C. H. Papadimitriou and J. N. Tsitsiklis, “The complexity of optimal queuing network control,” Mathematics of Operation Research, vol. 24, no. 2, pp. 293–305, May 1999.
  • [11] P. Whittle, “Restless bandits: Activity allocation in a changing world,” Journal of Applied Probability, vol. 25, no. A, pp. 287–298, 1988.
  • [12] R. R. Weber and G. Weiss, “On an index policy for restless bandits,” Journal of Applied Probability, vol. 27, no. 3, pp. 637–648, Sept. 1990.
  • [13] D. Bertsimas and J. Niño-Mora, “Conservation laws and extended polymatroids and multi-armed bandit problems: A polyhedral approach to indexable systems,” Mathematics of Operations Research, vol. 21, no. 2, pp. 257–306, 1996.
  • [14] D. Bertsimas and J. Niño-Mora, “Restless bandits, linear programming relaxation and a primal-dual index heuristic,” Operations Research, vol. 48, no. 1, pp. 80–90, 2000.
  • [15] J. Niño-Mora, “Dynamic priority allocation via restless bandit marginal productivity indices,” TOP, vol. 15, no. 2, pp. 161–198, 2007.
  • [16] N. Akbarzadeh and A. Mahajan, “Conditions for indexability of restless bandits and o(k3)o(k^{3}) algorithm to compute Whittle index,” Advances in Applied Probability, vol. 54, pp. 1164–1192, 2022.
  • [17] V. S. Borkar, G. S. Kasbekar, S. Pattathil, and P. Y. Shetty, “Opportunistic scheduling as restless bandits,” ArXiv:1706.09778, 2017.
  • [18] K. Kaza, R. Meshram, V. Mehta, and S. N. Merchant, “Sequential decision making with limited observation capability: Application to wireless networks,” IEEE Transactions on Cognitive Communications and Networking, vol. 5, no. 2, pp. 237–251, June 2019.
  • [19] J. T. Hawkins, A Lagrangian decomposition approach to weakly coupled dynamic optimization problems and its applications, Ph.D. thesis, Massachusetts Institute of Technology, 2003.
  • [20] D. Adelman and A. J. Mersereau, “Relaxations of weakly coupled stochastic dynamic programs,” Operations Research, vol. 56, no. 3, pp. 712–727, 2008.
  • [21] D. Bertsimas and V. V. Mišić, “Decomposable Markov decision processes: A fluid optimization approach,” Operations Research, vol. 64, no. 6, pp. 1537–1555, 2016.
  • [22] R. Meshram and K. Kaza, “Simulation based algorithms for Markov decision processes and multi-action restless bandits,” Arxiv, 2020.
  • [23] X. Zhang and P. I. Frazier, “Near optimality for infinite horizon restless bandits with many arms,” Arxiv, pp. 1–28, 2022.
  • [24] D. B. Brown and J. Zhang, “Fluid policies, reoptimization, and performance guarantees in dynamic resource allocation,” SRRN, pp. 1–46, 2022.
  • [25] N. Gast, B. Gaujal, and C. Yan, “The LP-update policy for weakly coupled markov decision processes,” Arxiv, pp. 1–29, 2022.
  • [26] J. Nino Mora, “A fast-pivoting algorithm for Whittle’s restless bandit index,” Mathematics, vol. 8, pp. 1–21, 2020.
  • [27] R. Meshram, K. Kaza, V. Mehta, and S. N. Merchant, “Restless bandits for sensor scheduling in energy constrained networks,” in Proceedings of IEEE Indian Control Conference, Dec. 2022, pp. 1–6.
  • [28] K. E. Avrachenkov and V. S. Borkar, “Whittle index based Q-learning for restless bandits with average reward,” Automatica, vol. 139, pp. 1–10, May 2022.
  • [29] J. Fu, Y. Nazarthy, S. Moka, and P. G. Taylor, “Towards Q-learning the Whittle index for restless bandits,” in Proceedings of Australian and New Zealand Control Conference (ANZCC), Nov. 2019, pp. 249–254.
  • [30] M. L. Puterman, Markov decision processes: Discrete stochastic dynamic programming, John Wiley & Sons, 2005.
  • [31] C. H. Papadimitriou and J. N. Tsitsiklis, “The complexity of Markov decision processes,” Mathematics of Operation Research, vol. 12, no. 3, pp. 441–450, Aug. 1987.
  • [32] M. L. Littman, T. L. Dean, and L. P. Kaelbling, “On the complexity of solving Markov decision problems,” in Proceedings of the Eleventh Conference on Uncertainity in Artificial Intelligence (UAI’95), Aug. 1995, pp. 394–402.
  • [33] M. Wang, “Randomized linear programming solves the Markov decision problem in nearly linear (sometimes sublinear) time,” Mathematics of Operation Research, vol. 45, no. 2, pp. 1–30, May 2020.

-A Proof of Lemma 1

Proof is by induction method. We provide the proof for piecewise convex.

Let

Q0(s,a=1,λ)\displaystyle Q_{0}(s,a=1,\lambda) =\displaystyle= r(s,1)\displaystyle r(s,1)
Q0(s,a=0,λ)\displaystyle Q_{0}(s,a=0,\lambda) =\displaystyle= r(s,0)+λ\displaystyle r(s,0)+\lambda
V1(s,λ)\displaystyle V_{1}(s,\lambda) =\displaystyle= max{r(s,1),r(s,0)+λ}\displaystyle\max\{r(s,1),r(s,0)+\lambda\}

Note that rewards r(s,1)r(s,1) and r(s,0)r(s,0) are linear functions. The max of linear functions is piecewise convex function.

We next assume that Vn(s,λ)V_{n}(s,\lambda) is piecewise convex in ss for fixed λ,β.\lambda,\beta.

From dynamic programming equations, we have

Qn(s,a,λ)=r(s,a)+λ(1a)+βs𝒮ps,saVn(s,λ).\displaystyle Q_{n}(s,a,\lambda)=r(s,a)+\lambda(1-a)+\beta\sum_{s^{\prime}\in\mathcal{S}}p_{s,s^{\prime}}^{a}V_{n}(s^{\prime},\lambda).

We want to show that Qn(s,a,λ)Q_{n}(s,a,\lambda) is piecewise convex in s.s. Now observe that s𝒮ps,s0Vn(s,λ)\sum_{s^{\prime}\in\mathcal{S}}p_{s,s^{\prime}}^{0}V_{n}(s^{\prime},\lambda) is piecewise convex in ss because it is a convex combination of piecewise convex functions. Hence Qn(s,a,λ)Q_{n}(s,a,\lambda) is piecewise convex function in s.s. Then

Vn+1(s,λ)\displaystyle V_{n+1}(s,\lambda) =\displaystyle= max{Qn(s,0,λ),Qn(s,1,λ)}\displaystyle\max\{Q_{n}(s,0,\lambda),Q_{n}(s,1,\lambda)\}

Vn+1(s)V_{n+1}(s) is piecewise convex in s.s. Hence Vn(s)V_{n}(s) is piecewise convex in ss for all n.n.

From Banach fixed point theorem, [30, Theorem 6.2.36.2.3], the dynamic program equation converges and there exists unique V.V. Thus as n,n\rightarrow\infty, Vn(s,λ)V(s,λ)V_{n}(s,\lambda)\rightarrow V(s,\lambda) and hence V(s,λ)V(s,\lambda) is piecewise convex in s.s. The result follows.

Analogously, we can prove that V(s,λ)V(s,\lambda) piecewise convex in λ\lambda for fixed s,β.s,\beta.

-B Proof of Theorem 1

We make use of following Lemma from [30, Lemma 4.7.24.7.2].

Lemma 2

Let {yj}\{y_{j}\} and {yj}\{y_{j}^{\prime}\} be real-valued non-negative sequences satisfying

j=kyjj=kyj\displaystyle\sum_{j=k}y_{j}\geq\sum_{j=k}y_{j}^{\prime}

for all values of k.k. Equality holds for k=0.k=0. Suppose that vj+1vjv_{j+1}\geq v_{j} for j=0,1,2,,j=0,1,2,\cdots, then

j=kvjyjj=kvjyj.\displaystyle\sum_{j=k}v_{j}y_{j}\geq\sum_{j=k}v_{j}y_{j}^{\prime}. (10)

Using lemma we have following results. Let ZZ and ZZ^{\prime} be two random variables with probabilities 𝖯𝗋(Z=j)=yj\mathsf{Pr}\left(Z=j\right)=y_{j} and 𝖯𝗋(Z=j)=yj.\mathsf{Pr}\left(Z^{\prime}=j\right)=y_{j}^{\prime}. If j=kyjj=kyj,\sum_{j=k}y_{j}\geq\sum_{j=k}y_{j}^{\prime}, then ZZ is stochastically greater than Z.Z^{\prime}. From preceding lemma for any non-decreasing function f(j)f(j) we have 𝖤(f(Z))𝖤(f(Z)).\mathsf{E}\left({f(Z)}\right)\geq\mathsf{E}\left({f(Z^{\prime})}\right). We first prove the following lemma.

Lemma 3

Suppose

  1. 1.

    r(s,a)r(s,a) is non-decreasing in ss for all a{0,1},a\in\{0,1\},

  2. 2.

    q(k|s,a)=j=kKps,jaq(k|s,a)=\sum_{j=k}^{K}p_{s,j}^{a} is non-decreasing in s.s.

Then V(s,λ)V(s,\lambda) is non-decreasing in ss for fixed λ.\lambda.

Proof:

Proof of this lemma is by induction method. Let

Q0(s,a=1,λ)\displaystyle Q_{0}(s,a=1,\lambda) =\displaystyle= r(s,1)\displaystyle r(s,1)
Q0(s,a=0,λ)\displaystyle Q_{0}(s,a=0,\lambda) =\displaystyle= r(s,0)+λ\displaystyle r(s,0)+\lambda
V1(s,λ)\displaystyle V_{1}(s,\lambda) =\displaystyle= max{r(s,1),r(s,0)+λ}\displaystyle\max\{r(s,1),r(s,0)+\lambda\}

Notice that Q0(s,a=1,λ),Q_{0}(s,a=1,\lambda), Q0(s,a=0,λ)Q_{0}(s,a=0,\lambda) are non-decreasing in s,s, and hence V1(s,λ)V_{1}(s,\lambda) is non-decreasing in s.s.

Assume that Vn(s,λ)V_{n}(s,\lambda) is non-decreasing in ss for fixed λ.\lambda. Then,

Qn(s,a,λ)=r(s,a)+λ(1a)+βs𝒮ps,saVn(s,λ).\displaystyle Q_{n}(s,a,\lambda)=r(s,a)+\lambda(1-a)+\beta\sum_{s^{\prime}\in\mathcal{S}}p_{s,s^{\prime}}^{a}V_{n}(s^{\prime},\lambda).

From earlier lemma and discussion, s𝒮ps,s0Vn(s,λ)\sum_{s^{\prime}\in\mathcal{S}}p_{s,s^{\prime}}^{0}V_{n}(s^{\prime},\lambda) is non-decreasing in s.s. whenever Vn(s,λ)V_{n}(s,\lambda) is non-decreasing in s.s. Hence Qn(s,0,λ)Q_{n}(s,0,\lambda) is non-decreasing in s.s. Similarly, Qn(s,1,λ)Q_{n}(s,1,\lambda) is non-decreasing in s.s. Then

Vn+1(s,λ)\displaystyle V_{n+1}(s,\lambda) =\displaystyle= max{Qn(s,0,λ),Qn(s,1,λ)}\displaystyle\max\{Q_{n}(s,0,\lambda),Q_{n}(s,1,\lambda)\}

and Vn+1(s)V_{n+1}(s) is non-decreasing in s.s. Hence Vn(s)V_{n}(s) is non-decreasing in ss for all n.n.

From Banach fixed point theorem, [30, Theorem 6.2.36.2.3], the dynamic program equation converges and there exists unique V.V. Thus as n,n\rightarrow\infty, Vn(s,λ)V(s,λ)V_{n}(s,\lambda)\rightarrow V(s,\lambda) and hence V(s,λ)V(s,\lambda) is non-decreasing in s.s. The result follows. ∎

Lemma 4

V(s,λ)V(s,\lambda) is non-decreasing in λ\lambda for fixed s.s.

The proof of this is again by induction method, proof steps are analogous to previous lemma and hence we omit the details of the proof.

We are now ready to prove the main theorem.

  1. 1.

    If r(s,a)r(s,a) is a superadditive function on 𝒮×{0,1},\mathcal{S}\times\{0,1\}, then

    r(s,1)r(s,0)r(s,1)r(s,0)\displaystyle r(s^{\prime},1)-r(s^{\prime},0)\geq r(s,1)-r(s,0)

    for s>ss^{\prime}>s and s,s𝒮.s^{\prime},s\in\mathcal{S}.

  2. 2.

    If q(k|s,a)q(k|s,a) is a superadditive function on 𝒮×{0,1},\mathcal{S}\times\{0,1\}, then

    q(k|s,1)q(k|s,0)q(k|s,1)q(k|s,0)\displaystyle q(k|s^{\prime},1)-q(k|s^{\prime},0)\geq q(k|s,1)-q(k|s,0)

    for all k𝒮.k\in\mathcal{S}. This implies

    j=kK[ps,j1ps,j0]j=kK[ps,j1ps,j0].\displaystyle\sum_{j=k}^{K}\left[p_{s^{\prime},j}^{1}-p_{s^{\prime},j}^{0}\right]\geq\sum_{j=k}^{K}\left[p_{s,j}^{1}-p_{s,j}^{0}\right].

We assumed that λ\lambda is fixed. Proof is by induction method. From assumption on rewards, Q0(s,a,λ)Q_{0}(s,a,\lambda) is superadditive on 𝒮×{0,1}.\mathcal{S}\times\{0,1\}. The optimal policy is

πλ,1(s)=argmaxa{0,1}{Q0(s,a,λ),},\displaystyle\pi_{\lambda,1}(s)=\arg\max_{a\{0,1\}}\{Q_{0}(s,a,\lambda),\}, (11)

Further, note that

Q0(s,1,λ)Q0(s,0,λ)=r(s,1)r(s,0)λ\displaystyle Q_{0}(s,1,\lambda)-Q_{0}(s,0,\lambda)=r(s,1)-r(s,0)-\lambda

and Q0(s,1,λ)Q0(s,0,λ)Q_{0}(s,1,\lambda)-Q_{0}(s,0,\lambda) is non-decreasing in ss for fixed λ.\lambda. Hence, the optimal policy πλ,1(s)\pi_{\lambda,1}(s) is non-decreasing in s.s.

We next want to show that the optimal policy πλ,n+1\pi_{\lambda,n+1} at time step n+1n+1 is non-decreasing in s.s. We need to show that Qn(s,1,λ)Qn(s,0,λ)Q_{n}(s,1,\lambda)-Q_{n}(s,0,\lambda) is non-decreasing in s.s.

Consider

Qn(s,1,λ)Qn(s,0,λ)=r(s,1)r(s,0)λ+\displaystyle Q_{n}(s,1,\lambda)-Q_{n}(s,0,\lambda)=r(s,1)-r(s,0)-\lambda+
β[j=kK(ps,j1ps,j0)Vn(j,λ)].\displaystyle\beta\left[\sum_{j=k}^{K}(p_{s,j}^{1}-p_{s,j}^{0})V_{n}(j,\lambda)\right].

Next we require the following inequality for sss^{\prime}\geq s to be true.

j=kK[ps,j1ps,j0]Vn(j,λ)j=kK[ps,j1ps,j0]Vn(j,λ).\displaystyle\sum_{j=k}^{K}\left[p_{s^{\prime},j}^{1}-p_{s^{\prime},j}^{0}\right]V_{n}(j,\lambda)\geq\sum_{j=k}^{K}\left[p_{s,j}^{1}-p_{s,j}^{0}\right]V_{n}(j,\lambda).

From earlier lemma, we know that Vn(j)V_{n}(j) is non-decreasing in ss and q(k|s,a)q(k|s,a) is a superadditive function on 𝒮×{0,1}.\mathcal{S}\times\{0,1\}. Thus j=0Kps,jaVn(j,λ)\sum_{j=0}^{K}p_{s,j}^{a}V_{n}(j,\lambda) is superadditive. Reward r(s,a)r(s,a) is a superadditive function on 𝒮×{0,1},\mathcal{S}\times\{0,1\}, Hence Qn(s,a,λ)Q_{n}(s,a,\lambda) is a superadditive function on 𝒮×{0,1}\mathcal{S}\times\{0,1\} for all n.n.

This implies that Qn(s,1,λ)Qn(s,0,λ)Q_{n}(s,1,\lambda)-Q_{n}(s,0,\lambda) is non-decreasing in ss and there exists the optimal policy πλ,n+1(s)=argmaxaQn(s,a,λ)\pi_{\lambda,n+1}(s)=\arg\max_{a}Q_{n}(s,a,\lambda) which is non-decreasing in ss for fixed λ.\lambda. The policy πλ,n(s)\pi_{\lambda,n}(s) is non-decreasing in ss for all values of n.n.

From Banach fixed point theorem, [30, Theorem 6.2.36.2.3], the dynamic program equation converges and there exists unique V.V. Thus as n,n\rightarrow\infty, Vn(s,λ)V(s,λ)V_{n}(s,\lambda)\rightarrow V(s,\lambda) and V(s,λ)V(s,\lambda) is non-decreasing in s,s, πλ,n(s)πλ(s).\pi_{\lambda,n}(s)\rightarrow\pi_{\lambda}(s). Moreover,

Q(s,a,λ)=r(s,a)+λ(1a)+βj=1Kps,jaV(j,λ).\displaystyle Q(s,a,\lambda)=r(s,a)+\lambda(1-a)+\beta\sum_{j=1}^{K}p_{s,j}^{a}V(j,\lambda).

Then Q(s,1,λ)Q(s,0,λ)Q(s,1,\lambda)-Q(s,0,\lambda) is non-decreasing in s.s. Hence πλ(s)\pi_{\lambda}(s) is non-decreasing in ss for fixed λ.\lambda. This completes the proof.

-C Proof of Theorem 2

In [27], complexity analysis is studied for online rollout policy with partially observable restless bandits. Here, we sketch proof detail with finite state observable restless bandit model.

We consider case M=1,M=1, i.e., a single-arm is played.

Recall from Section IV-A that we simulate the multiple trajectories for fixed horizon length. We compute the value estimate for each trajectory, say ll starting from initial state X,X, and initial action ξ,\xi, i.e., Qlϕ(X,ξ).Q_{l}^{\phi}(X,\xi). This require computation of O(H),O(H), since each trajectory is run for HH horizon length. There are LL trajectories and it requires O(HL)O(HL) computation.

The empirical value estimates Q~H,Lϕ(X,ξ),\widetilde{Q}_{H,L}^{\phi}(X,\xi), ξ{1,2,,N}\xi\in\{1,2,\cdots,N\} for NN possible initial actions (arms). This takes O(NHL)O(NHL) computations as there are LL trajectories of horizon (look-ahead) length HH for each of the NN initial actions.

Rollout policy also involves the policy improvement step and it takes another O(2N)O(2N) computations. Thus total computation complexity in each iteration is O(2N+NHL)=O(N(HL+2)).O(2N+NHL)=O(N(HL+2)). For TT time steps, the computational complexity is O(N(HL+2)T).O(N(HL+2)T).

-D Online rollout policy when multiple arms are played

We now discuss rollout policy for M>1,M>1, more than one arm is played at each time-step.

When a decision maker plays more than one arm in each time-step in rollout policy, employing a base policy with future look-ahead is non-trivial.

This is due to the large number of possible combinations of MM out of NN available arms, i.e., (NM)\binom{N}{M}. Since the rollout policy depends on future look-ahead actions, it can be computationally expensive to implement as each time-step because we need to choose from (NM).\binom{N}{M}. We reduce these computations for base policy ϕ\phi by employing a myopic rule in look-ahead approach, where we select MM arms with highest immediate rewards while computing value estimates of trajectories.

In this case, n=1Nbh,ln,ϕ=M,\sum_{n=1}^{N}b_{h,l}^{n,\phi}=M, M>1.M>1. The set of arms played at step hh in trajectory ll is ξh,l𝒩={1,2,,N},\xi_{h,l}\subset\mathcal{N}=\{1,2,\cdots,N\}, with |ξh,l|=M.|\xi_{h,l}|=M. Here, bh,ln,ϕ=1b_{h,l}^{n,\phi}=1 if nξh,l.n\in\xi_{h,l}. The base policy ϕ\phi uses myopic decision rule and the one-step policy improvement is given by

j(X)=argmaxξ𝒩[r~(X,ξ)+βQ~H,Lϕ(X,ξ)],\displaystyle{j}^{*}(X)=\arg\max_{\xi\subset\mathcal{N}}\left[\widetilde{r}(X,{\xi})+\beta\widetilde{Q}_{H,L}^{\phi}(X,\xi)\right], (12)

and r~(X,ξ)=n=1Nrn(Xn,bn).\widetilde{r}(X,\xi)=\sum_{n=1}^{N}r^{n}(X^{n},b^{n}). The computation of Q~H,Lϕ(X,ξ)\widetilde{Q}_{H,L}^{\phi}(X,\xi) is similar to the preceding discussion. At time tt with state X={X1,X2,,XN},X=\{X^{1},X^{2},\cdots,X^{N}\}, rollout policy plays the subset of arms jj^{*} according to Eqn. (12).

-D1 Computation complexity result for M>1M>1 (multiple arms are played)

Along each trajectory, MM arms are played out of N.N. The number of possible actions at each time-step in a trajectory is (NM)\binom{N}{M} and (NM)O(NM),\binom{N}{M}\approx O(N^{M}), which is a polynomial in NN for fixed M.M. Thus, the complexity would be very high if all the possible actions are considered.

The value estimates are computed only for some |A||A| initial actions, N|A|<(NM).N\leq|A|<\binom{N}{M}. The computations required for value estimate Q~H,Lϕ(X,ξ)\widetilde{Q}_{H,L}^{\phi}(X,\xi) per iteration is O(AHL).O(AHL). As there are AA number of subsets considered for each step in action selection, the computations needed for policy improvement steps are at most 2|A|.2|A|. So, the per-iteration computation complexity is |A|HL+2|A|.|A|HL+2|A|. Hence, for TT time-steps the computational complexity would be O(|A|(HL+2)T).O(|A|(HL+2)T).

-E Additional Numerical Examples: Single-Armed Restless Bandit

Here we provide more numerical examples and verify the indexability using our Algorithm 1.

-E1 Examples with restart

We illustrate the structure of policy matrix Φ\Phi for restart model, where when action 11 is taken, then ps,11=1p^{1}_{s,1}=1 for all s𝒮.s\in\mathcal{S}. Reward is decreasing in ss for passive action (a=0)(a=0) and reward for active action (a=1)(a=1) is 0 for all states. We provide examples for K=10K=10 and 100100 states. The transition probabilities are as follows. p0(s,min{s+1,K})=0.9p^{0}(s,\min\{s+1,K\})=0.9, p0(s,1)=0.1,p^{0}(s,1)=0.1, p1(s,1)=1p^{1}(s,1)=1 for all s𝒮.s\in\mathcal{S}.

Refer to caption
Figure 7: Example of restart model: policy matrix Φ\Phi for different states as function of subsidy λ.\lambda. States K=10.K=10.

For K=10,K=10, the rewards for action a=0a=0 (passive action) with state k,k, r(k,0)=0.95kr(k,0)=0.95^{k} and the rewards for action a=1a=1 (active action), r(k,1)=0r(k,1)=0 for all k𝒮.k\in\mathcal{S}. The policy matrix Φ\Phi is illustrated in Fig. 7. We observe that the optimal policy is of a single threshold type and it is indexable.

In Fig. 8, we use K=100,K=100, and rewards for passive action as r(k,0)=0.99kr(k,0)=0.99^{k} and the rewards for active action as r(k,1)=0r(k,1)=0 for all k𝒮.k\in\mathcal{S}. It is difficult to draw policy matrix Φ\Phi for large states. Hence, we illustrate threshold state s^λ\hat{s}_{\lambda} as function of λ\lambda. Here, threshold state s^λ:=inf{s:πλ(s)=0}.\hat{s}_{\lambda}:=\inf\{s:\pi_{\lambda}(s)=0\}. Notice that s^λ\hat{s}_{\lambda} is increasing in λ.\lambda. This monotone behavior indicates the bandit is indexable.

Refer to caption
Figure 8: Example of restart model: Threshold state s^λ\hat{s}_{\lambda} as function of subsidy λ.\lambda. States K=100.K=100.

-E2 Example of one-step random walk

We consider K=5K=5 states. It is one-step random walk. The probability matrix is same for both the actions a=1a=1 and a=0a=0. Rewards for passive action r(k,0)=0r(k,0)=0 and active action r(k,1)=0.9kr(k,1)=0.9^{k} for all k𝒮.k\in\mathcal{S}. This is also applicable in wireless communication systems.

Refer to caption
Figure 9: Example of one step random walk model: policy matrix Φ\Phi for different states as function of subsidy λ.\lambda. States K=5.K=5.
P0\displaystyle P_{0} =\displaystyle= [3/107/100001/102/107/100001/102/107/100001/102/107/100003/107/10],\displaystyle\begin{bmatrix}3/10&7/10&0&0&0\\ 1/10&2/10&7/10&0&0\\ 0&1/10&2/10&7/10&0\\ 0&0&1/10&2/10&7/10\\ 0&0&0&3/10&7/10\end{bmatrix},
P1\displaystyle P_{1} =\displaystyle= P0.\displaystyle P_{0}.

In Fig. 9, we illustrate the policy matrix Φ\Phi and we observe that the policy πs(λ)\pi_{s}(\lambda) is non-increasing in ss for fixed λ.\lambda. This is true for all λ\lambda. Thus, the policy has a single-threshold in ss. For fixed state ss, πλ(s)\pi_{\lambda}(s) is non-increasing in λ\lambda and it is true for all s𝒮s\in\mathcal{S}. Hence, it has a single threshold policy in λ\lambda also.

From Fig. 9, it is clear that B(λ=0.5)=,B(\lambda=0.5)=\emptyset, B(λ=0.6)={5},B(\lambda=0.6)=\{5\}, B(λ=0.7)={4,5},B(\lambda=0.7)=\{4,5\}, B(λ=0.8)={3,4,5},B(\lambda=0.8)=\{3,4,5\}, B(λ=0.9)={1,2,3,4,5}B(\lambda=0.9)=\{1,2,3,4,5\} and B(λ1)B(λ2)B(\lambda_{1})\subseteq B(\lambda_{2}) for λ2>λ1.\lambda_{2}>\lambda_{1}. Thus, it is indexable.

-E3 Example from [16] (Akbarzadeh 20222022)

Refer to caption
Figure 10: Example of a model with no structure: policy matrix Φ\Phi for different states as function of subsidy λ.\lambda. States K=3.K=3.

In this example, there is no structural assumption on transition probability and reward matrices.

P0\displaystyle P_{0} =\displaystyle= [0.36290.50260.13430.08230.75340.16430.24600.02940.7246],\displaystyle\begin{bmatrix}0.3629&0.5026&0.1343\\ 0.0823&0.7534&0.1643\\ 0.2460&0.0294&0.7246\end{bmatrix},
P1\displaystyle P_{1} =\displaystyle= [0.17190.17490.65320.05470.93170.01360.15470.62710.2182].\displaystyle\begin{bmatrix}0.1719&0.1749&0.6532\\ 0.0547&0.9317&0.0136\\ 0.1547&0.6271&0.2182\end{bmatrix}.

Rewards in passive action r(k,0)=0r(k,0)=0 for all k𝒮k\in\mathcal{S} and in active action, r(1,1)=0.44138r(1,1)=0.44138, r(2,1)=0.8033r(2,1)=0.8033, and r(3,1)=0.14257r(3,1)=0.14257.

From policy matrix in Fig. 10, we can observe that the bandit is indexable, because B(λ=0.1)=B(\lambda=0.1)=\emptyset, B(λ=0.2)={1}B(\lambda=0.2)=\{1\}, B(λ=0.6)={1,3},B(\lambda=0.6)=\{1,3\}, and B(λ=0.9)={1,2,3}.B(\lambda=0.9)=\{1,2,3\}. B(λ)B(\lambda) is non-decreasing in λ.\lambda.

-E4 Indexable model from [15] (Nino-Mora 20072007)

There is no structural assumption on transition probability matrices.

P0\displaystyle P_{0} =\displaystyle= [0.18100.48010.33890.26760.26460.46780.53040.28430.1853],\displaystyle\begin{bmatrix}0.1810&0.4801&0.3389\\ 0.2676&0.2646&0.4678\\ 0.5304&0.2843&0.1853\end{bmatrix},
P1\displaystyle P_{1} =\displaystyle= [0.28410.48270.23320.51310.02120.46570.46120.00810.5307],\displaystyle\begin{bmatrix}0.2841&0.4827&0.2332\\ 0.5131&0.0212&0.4657\\ 0.4612&0.0081&0.5307\end{bmatrix},
R\displaystyle R =\displaystyle= [00.901600.1094900.01055].\displaystyle\begin{bmatrix}0&0.9016\\ 0&0.10949\\ 0&0.01055\end{bmatrix}.

We use the discount parameter as β=0.9.\beta=0.9. We observe from Fig. 11, that the optimal policy πλ(s)\pi_{\lambda}(s) is non-increasing in ss for fixed λ,\lambda, even though there is no structural assumption on transition probability and reward matrices. Further, πλ(s)\pi_{\lambda}(s) is non-increasing in λ\lambda for fixed s,s, This is true for all s𝒮.s\in\mathcal{S}. Notice that B(λ=0.1)=,B(\lambda=-0.1)=\emptyset, B(λ=0)={3},B(\lambda=0)=\{3\}, B(λ=0.3)={2,3},B(\lambda=0.3)=\{2,3\}, and B(λ=1)={1,2,3}.B(\lambda=1)=\{1,2,3\}. Thus, the bandit is indexable.

Refer to caption
Figure 11: Example of a model with no structure on reward and prob matrix from [15]: policy matrix Φ\Phi for different states as function of subsidy λ.\lambda. States K=3.K=3.

-E5 Non-indexable model from [15] (Nino-Mora 20072007)

In this, we select an example of a bandit that is non-indexable. We use the following parameters. β=0.9,\beta=0.9,

P0\displaystyle P_{0} =\displaystyle= [0.19020.41560.39420.56760.41910.01330.01910.10970.8712],\displaystyle\begin{bmatrix}0.1902&0.4156&0.3942\\ 0.5676&0.4191&0.0133\\ 0.0191&0.1097&0.8712\end{bmatrix},
P1\displaystyle P_{1} =\displaystyle= [0.77960.09030.13010.19030.18630.62340.29010.39010.3198],\displaystyle\begin{bmatrix}0.7796&0.0903&0.1301\\ 0.1903&0.1863&0.6234\\ 0.2901&0.3901&0.3198\end{bmatrix},
R\displaystyle R =\displaystyle= [0.4580.96310.53080.79630.68730.1057].\displaystyle\begin{bmatrix}0.458&0.9631\\ 0.5308&0.7963\\ 0.6873&0.1057\end{bmatrix}.

We do not make any structural assumption on transition probability matrices but there is one on reward matrix. Reward is decreasing in ss for active action (a=1)(a=1) and reward is increasing in ss for passive action (a=0)(a=0). In Fig. 12, notice that policy πλ(s)\pi_{\lambda}(s) is not monotone in ss for all values of λΛ.\lambda\in\Lambda. This implies that there exists λ\lambda such that there are multiple thresholds, This can be observed for λ=0.2\lambda=-0.2, πλ(s)=[1,0,1]T\pi_{\lambda}(s)=[1,0,1]^{T} and for λ=0.5\lambda=0.5, πλ(s)=[0,1,0]T.\pi_{\lambda}(s)=[0,1,0]^{T}. Also, πλ(s)\pi_{\lambda}(s) is not monotone in λ\lambda for all values of s𝒮.s\in\mathcal{S}. For instance, there is s=2,s=2, for which πλ(s=2)=[1,0,0,0,0,0,1,1,1,0]\pi_{\lambda}(s=2)=[1,0,0,0,0,0,1,1,1,0] for λ=0.3\lambda=-0.3 to λ=0.6\lambda=0.6. B(λ=0.3)=B(\lambda=-0.3)=\emptyset, B(λ=0.2)={2}B(\lambda=-0.2)=\{2\}, B(λ=0.1)={2,3}B(\lambda=-0.1)=\{2,3\}, B(λ=0.2)={2,3}B(\lambda=0.2)=\{2,3\}, B(λ=0.3)={3}B(\lambda=0.3)=\{3\}, B(λ=0.4)={3}B(\lambda=0.4)=\{3\}, B(λ=0.5)={1,3}B(\lambda=0.5)=\{1,3\}, B(λ=0.6)={1,2,3}B(\lambda=0.6)=\{1,2,3\}. Clearly, B(λ)B(\lambda) is not monotone in λ,\lambda, hence, the bandit is non-indexable.

Refer to caption
Figure 12: Example of a model with no structure on reward and prob matrix from [15]: policy matrix Φ\Phi for different states as function of subsidy λ.\lambda. States K=3.K=3. Non-indexable example.

-F Non-indexable to indexable bandit by modification of reward matrix

In the following example, we use states K=5K=5 and use the same transition probability matrices as in the non-indexable bandit. We modify here, the reward matrix and illustrate that the bandit becomes indexable.

-F1 Indexable model with 55 states and β=0.9\beta=0.9

In this example, we slightly modify the reward matrix while keeping the same transition probability matrices as in the previous example, K=5K=5. Then, we observe that such a bandit becomes indexable. This may be due to the fact that the difference in reward from passive and action actions is decreased here as compared to the previous example, but the structure of reward matrix remains the same in both examples, i.e., reward is decreasing in ss for active action and reward is increasing in ss for passive action.

P0\displaystyle P_{0} =\displaystyle= [0.15020.04000.41560.03000.36420.40000.35000.08000.12000.05000.52760.04000.39910.02000.01330.05000.10000.15000.20000.50000.01910.01000.08970.03000.8512],\displaystyle\begin{bmatrix}0.1502&0.0400&0.4156&0.0300&0.3642\\ 0.4000&0.3500&0.0800&0.1200&0.0500\\ 0.5276&0.0400&0.3991&0.0200&0.0133\\ 0.0500&0.1000&0.1500&0.2000&0.5000\\ 0.0191&0.0100&0.0897&0.0300&0.8512\end{bmatrix},
P1\displaystyle P_{1} =\displaystyle= [0.71960.05000.09030.01000.13010.55000.20000.05000.08000.12000.19030.01000.16630.01000.62340.20000.05000.15000.10000.50000.25010.01000.39010.03000.3198],\displaystyle\begin{bmatrix}0.7196&0.0500&0.0903&0.0100&0.1301\\ 0.5500&0.2000&0.0500&0.0800&0.1200\\ 0.1903&0.0100&0.1663&0.0100&0.6234\\ 0.2000&0.0500&0.1500&0.1000&0.5000\\ 0.2501&0.0100&0.3901&0.0300&0.3198\end{bmatrix},
R\displaystyle R =\displaystyle= [0.45800.96310.51000.81000.65080.79630.67100.60610.68730.5057].\displaystyle\begin{bmatrix}0.4580&0.9631\\ 0.5100&0.8100\\ 0.6508&0.7963\\ 0.6710&0.6061\\ 0.6873&0.5057\end{bmatrix}.

β=0.9\beta=0.9 In Fig. 13, we notice that πλ(s)\pi_{\lambda}(s) is non-increasing (monotone) in λ\lambda for each s𝒮s\in\mathcal{S} but πλ(s)\pi_{\lambda}(s) is not monotone in ss for λ=0.1,0.05,0.0,0.05\lambda=-0.1,-0.05,0.0,0.05. Further, we observe that B(λ=0.15)=,B(\lambda=-0.15)=\emptyset, B(λ=0.1)={3},B(\lambda=-0.1)=\{3\}, B(λ=0.05)={3,4},B(\lambda=0.05)=\{3,4\}, B(λ=0.1)={3,4,5},B(\lambda=0.1)=\{3,4,5\}, B(λ=0.3)={3,4,5},B(\lambda=0.3)=\{3,4,5\}, B(λ=0.35)={2,3,4,5},B(\lambda=0.35)=\{2,3,4,5\}, B(λ=0.4)={1,2,3,4,5}.B(\lambda=0.4)=\{1,2,3,4,5\}. Hence, B(λ)B(\lambda) is non-decreasing in λ\lambda, and hence, it is indexable.

Refer to caption
Figure 13: Example of a model with no structure on reward and prob matrix : policy matrix Φ\Phi for different states as function of subsidy λ.\lambda. States K=5.K=5. Bandit is indexable.

-F2 Indexable model with 55 states, monotone reward and β=0.9\beta=0.9

We consider following parameters.

P0\displaystyle P_{0} =\displaystyle= [0.15020.04000.41560.03000.36420.40000.35000.08000.12000.05000.52760.04000.39910.02000.01330.05000.10000.15000.20000.50000.01910.01000.08970.03000.8512],\displaystyle\begin{bmatrix}0.1502&0.0400&0.4156&0.0300&0.3642\\ 0.4000&0.3500&0.0800&0.1200&0.0500\\ 0.5276&0.0400&0.3991&0.0200&0.0133\\ 0.0500&0.1000&0.1500&0.2000&0.5000\\ 0.0191&0.0100&0.0897&0.0300&0.8512\end{bmatrix},
P1\displaystyle P_{1} =\displaystyle= [0.71960.05000.09030.01000.13010.55000.20000.05000.08000.12000.19030.01000.16630.01000.62340.20000.05000.15000.10000.50000.25010.01000.39010.03000.3198],\displaystyle\begin{bmatrix}0.7196&0.0500&0.0903&0.0100&0.1301\\ 0.5500&0.2000&0.0500&0.0800&0.1200\\ 0.1903&0.0100&0.1663&0.0100&0.6234\\ 0.2000&0.0500&0.1500&0.1000&0.5000\\ 0.2501&0.0100&0.3901&0.0300&0.3198\end{bmatrix},
R\displaystyle R =\displaystyle= [0.45800.50570.51000.60610.65080.79630.67100.81000.68730.9631].\displaystyle\begin{bmatrix}0.4580&0.5057\\ 0.5100&0.6061\\ 0.6508&0.7963\\ 0.6710&0.8100\\ 0.6873&0.9631\end{bmatrix}.

β=0.9.\beta=0.9. Reward is increasing in ss for both actions (passive a=0a=0 and active a=1a=1). But there is no structural assumption on transition probability matrices. Using this example, we illustrate that the bandit is indexable. From Fig, 14, we can observe that πλ(s)\pi_{\lambda}(s) is non-increasing in λ\lambda for every s𝒮s\in\mathcal{S} but πλ(s)\pi_{\lambda}(s) is not monotone in SS for λ=0.15,0.2,0.25,0.3\lambda=0.15,0.2,0.25,0.3. There is no single threshold policy in ss for these values of λ\lambda, but there is a single threshold policy in λ\lambda for each s𝒮s\in\mathcal{S}. We have B(λ=0.35)=B(\lambda=-0.35)=\emptyset, B(λ=0.3)={1}B(\lambda=-0.3)=\{1\}, B(λ=0.15)={1,2,4}B(\lambda=0.15)=\{1,2,4\}, B(λ=0.2)={1,2,4,5}B(\lambda=0.2)=\{1,2,4,5\} and B(λ=0.35)={1,2,3,4,5}B(\lambda=0.35)=\{1,2,3,4,5\}. Set B(λ)B(\lambda) is non-decreasing in λ\lambda and hence, the bandit is indexable.

Refer to caption
Figure 14: Example of a model with no structure on reward and prob matrix : policy matrix Φ\Phi for different states as function of subsidy λ.\lambda. States K=5.K=5. Bandit is indexable.

-F3 Indexable model with 55 states and β=0.99\beta=0.99

Consider

P0\displaystyle P_{0} =\displaystyle= [0.15020.04000.41560.03000.36420.40000.35000.08000.12000.05000.52760.04000.39910.02000.01330.05000.10000.15000.20000.50000.01910.01000.08970.03000.8512],\displaystyle\begin{bmatrix}0.1502&0.0400&0.4156&0.0300&0.3642\\ 0.4000&0.3500&0.0800&0.1200&0.0500\\ 0.5276&0.0400&0.3991&0.0200&0.0133\\ 0.0500&0.1000&0.1500&0.2000&0.5000\\ 0.0191&0.0100&0.0897&0.0300&0.8512\end{bmatrix},
P1\displaystyle P_{1} =\displaystyle= [0.71960.05000.09030.01000.13010.55000.20000.05000.08000.12000.19030.01000.16630.01000.62340.20000.05000.15000.10000.50000.25010.01000.39010.03000.3198],\displaystyle\begin{bmatrix}0.7196&0.0500&0.0903&0.0100&0.1301\\ 0.5500&0.2000&0.0500&0.0800&0.1200\\ 0.1903&0.0100&0.1663&0.0100&0.6234\\ 0.2000&0.0500&0.1500&0.1000&0.5000\\ 0.2501&0.0100&0.3901&0.0300&0.3198\end{bmatrix},
R\displaystyle R =\displaystyle= [0.45800.96310.51000.81000.65080.79630.67100.60610.68730.5057].\displaystyle\begin{bmatrix}0.4580&0.9631\\ 0.5100&0.8100\\ 0.6508&0.7963\\ 0.6710&0.6061\\ 0.6873&0.5057\end{bmatrix}.

β=0.99.\beta=0.99. We now observe that the reward is monotone in ss for both actions. From Fig, 15, we have B(λ=0.2)=,B(\lambda=-0.2)=\emptyset, B(λ=0.15)={3},B(\lambda=-0.15)=\{3\}, B(λ=0.05)={3,4},B(\lambda=0.05)=\{3,4\}, B(λ=0.1)={3,4,5},B(\lambda=0.1)=\{3,4,5\}, B(λ=0.35)={2,3,4,5},B(\lambda=0.35)=\{2,3,4,5\}, and B(λ=0.4)={1,2,3,4,5},B(\lambda=0.4)=\{1,2,3,4,5\}, B(λ)B(\lambda) is non-decreasing in λ.\lambda. Hence, the bandit is indexable.

Refer to caption
Figure 15: Example of a model with no structure on reward and prob matrix : policy matrix Φ\Phi for different states as function of subsidy λ.\lambda. States K=5K=5 and β=0.99,\beta=0.99, Bandit is indexable.

-G Discussion on non-indexability and indexability of SAB

From the preceding examples (non-indexable and indexable) in Section V, and Appendix  -F, we infer that in order for a bandit to be non-indexable, there is necessity of reward for both actions to be in the reverse order along the actions and there should be sufficient difference in rewards for each state. When reward is monotone (non-decreasing) in state for both actions, then the bandit is indexable, and this is due to, as we increase subsidy λ\lambda there is no possibility of more than a single threshold in λ\lambda for each state s𝒮s\in\mathcal{S}. In order to have more than one threshold in λ\lambda, first we observe that λ\lambda is fixed reward obtained for passive action and independent of state; second as λ\lambda increases, passive action becomes optimal as not playing is optimal choice; third as λ\lambda increases more then active action to be optimal again, it requires cumulative reward from active action to be higher than that of passive action.

It is possible due to the fact that for passive action, reward is increasing and for active action, it is decreasing. The difference r(s,1)r(s,0)r(s,1)-r(s,0) is going from positive to negative value and increasing λ\lambda provides balancing for some states by dynamic program equation. Fourth, we see that as λ\lambda increases even further, the optimal action has to be passive and remains passive for remaining values of λ\lambda.

-G1 Comments on Indexability

From numerical examples of restless single-armed bandits, we observed that non-indexability occurs under very restrictive setting on transition probability and reward matrices. Most applications in wireless communication, machine maintenance, recommendation systems models, there is assumed to be some structure on reward and transition probability matrices. Hence, many applications of finite state observable restless bandit models are indexable.

-H Additional Numerical Examples: RMAB with Performance of different policies

Here, we provide more numerical examples, comparing the performance of different policies on restless multi-armed bandits. In the following, we consider both, identical as well as non-identical restless bandits. Also, we use varying scenarios such as; monotone and indexable bandits; non-monotone and indexable bandits; and non-indexable bandits. In the scenario of non-indexable bandits, we compare the performance of rollout and myopic policies. We illustrate simulations for number of arms N=3,5,10N=3,5,10, and discount parameter β=0.99\beta=0.99. In the rollout policy, we used number of trajectories L=30L=30, and horizon length H=3,4,5H=3,4,5.

-H1 Example of identical, indexable, and monotone bandits

Refer to caption
Figure 16: 55 armed restless bandits: Identical, indexable and monotone.
Refer to caption
Figure 17: 1010 armed restless bandits: Identical, indexable and monotone.

In Fig. 16 and Fig. 17, we compare performance of myopic, rollout and index policy for N=5,10N=5,10 respectively. All the bandits are identical, indexable, and index is monotone in state. We use number of trajectories L=30L=30, and horizon length H=4H=4. Notice that all policies have identical performance and it is due to identical and monotone reward structure of bandits and index is monotone in state for the bandit.

-H2 Example of identical, indexable, and non-monotone bandits

Refer to caption
Figure 18: 55 armed restless bandits: Identical, indexable and non monotone
Refer to caption
Figure 19: 1010 armed restless bandits: Identical, indexable and non monotone

In Fig. 18 and Fig. 19, we compare performance of myopic, rollout and index policy for N=5,10N=5,10 respectively. All the bandits are identical, indexable, and non-monotone, i.e., index is not monotone in state. We use number of trajectories L=30L=30, and horizon length H=4H=4. We observe that the rollout policy and the index policy performs better than myopic policy.

Examples of such bandit are given in Appendix -F. Even though B(λ)B(\lambda) is non-decreasing in λ,\lambda, the policy πλ(s)\pi_{\lambda}(s) is not monotone in ss for all values of λ.\lambda. The examples from SAB suggests that there may not be any structure on transition probability or reward matrices. In such examples, myopic policy does not perform as good as rollout policy and index policy. Here, rollout policy uses simulation based look-ahead approach and index policy uses dynamic program for index computation, these policies take into account future possible state informations.

Refer to caption
Figure 20: 55 armed restless bandits: Identical and non-indexable
Refer to caption
Figure 21: 1010 armed restless bandits: Identical and non-indexable.
Refer to caption
Figure 22: 33 armed restless bandits: Non-Identical and indexable, H=3.H=3.
Refer to caption
Figure 23: 33 armed restless bandits: Non-identical and indexable, H=5.H=5.

-H3 Example of identical and non-indexable bandits

In Fig. 20 and Fig. 21, we compare performance of myopic and rollout policy for N=5,10N=5,10 respectively, where we assume that all bandits are non-indexable and all bandits are identical. In the rollout policy, the number of trajectories are L=30L=30, and horizon length is H=4H=4. We observe that both policies have identical performance.

-H4 Effect of horizon length HH in rollout policy

In Fig. 22 and Fig. 23, we compare performance of myopic, rollout and index policy for N=3N=3. All the bandits are non-identical and indexable, some are monotone, and some are non-monotone. We use number of trajectories L=30L=30, and horizon length H=3,5H=3,5 respectively. We observe that the rollout policy and the index policy performs better than myopic policy. It is noticed that as HH increases from 33 to 5,5, the difference in rollout policy and index policy decreases.