Indexability of Finite State Restless Multi-Armed Bandit and Rollout Policy
Abstract
We consider finite state restless multi-armed bandit problem. The decision maker can act on bandits out of bandits in each time step. The play of arm (active arm) yields state dependent rewards based on action and when the arm is not played, it also provides rewards based on the state and action. The objective of the decision maker is to maximize the infinite horizon discounted reward. The classical approach to restless bandits is Whittle index policy. In such policy, the arms with highest indices are played at each time step. Here, one decouples the restless bandits problem by analyzing relaxed constrained restless bandits problem. Then by Lagrangian relaxation problem, one decouples restless bandits problem into single-armed restless bandit problems. We analyze the single-armed restless bandit. In order to study the Whittle index policy, we show structural results on the single armed bandit model. We define indexability and show indexability in special cases. We propose an alternative approach to verify the indexable criteria for a single armed bandit model using value iteration algorithm. We demonstrate the performance of our algorithm with different examples. We provide insight on condition of indexability of restless bandits using different structural assumptions on transition probability and reward matrices.
We also study online rollout policy and discuss the computation complexity of algorithm and compare that with complexity of index computation. Numerical examples illustrate that index policy and rollout policy performs better than myopic policy.
I Introduction
Restless multi-armed bandit (RMAB) is a class of sequential decision problem. RMABs are extensively studied for resource allocation problems like—machine maintenance [1, 2], congestion control [3], healthcare [4], opportunistic channel scheduling[5], recommendation systems[6, 7], queueing systems, [8, 9] etc. RMAB problem is described as follows. There is a decision maker with independent arms where each arm can be in one of many finite states and the state evolves according to Markovian law. The playing of arm yields a state dependent reward and we assume that even not playing of arm also yields state dependent reward. The transition dynamics of state evolution is assumed to be known at the decision maker, The system is time slotted. The decision maker plays arms out of arms in each slot. The goal is to determine which sequence arms to be played such that it maximizes long-term discounted cumulative reward.
RMAB is the generalization of classical multi-armed bandit (Markov bandit). It is shown to be PSPACE hard [10]. RMAB is first introduced in [11] and author proposed Whittle index policy approach. This is a heuristic policy. The popularity of Whittle index policy is due to optimal performance under some conditions [12]. In such policy, the arms with highest Whittle indices are played at each time-step. The index is derived for each arm by analyzing a single-armed restless bandit, where subsidy is introduced for not playing of arm and this is Lagrangian relaxation of restless bandit problem. To use the Whittle index policy, one requires to show indexability condition. Sufficient condition for indexability is provided under modeling assumption on restless bandits, particularly transition probability and reward matrices. In [13, 14], authors introduced linear programming (LP) approach to restless bandits and proposed primal-dual heuristic algorithm. In [15], author studied the marginal productivity index (MPI) for RMAB which is generalization of Whittle index policy and futher developed adaptive greedy policy to compute MPI. Their model assumes to satisfy partial conservation laws (PCL). In RMAB with PCL, one can apply adaptive greedy algorithm for Whittle index computation. In general, it is difficult to show the indexability and then compute the index formula. It is possible to compute the index approximately and this is done in [16]. The index computation algorithm using two timescale stochastic approximation approach and value iteration algorithm is studied in [3, 17, 18]. All this work requires to prove indexability and then compute the indices. Results on indexability make structural assumptions on transition probability and reward matrices.
A generalization of RMAB is referred to as weakly coupled Markov decision processes (MDPs), where more complex constraints are introduced [19], and approximate dynamic program is studied in [20]. Fluid appproximation for restless bandits is studied in [21]. Simulation-based approach for restless bandits is considered in [22]. Linear programming approach with asymptotic optimality using fluid model is studied in [23]. Other paper on fluid model for resource allocation is [24]. Another work on LP policy for weakly coupled MDPs is studied in [25]. In [26], index computation algorithm is discussed. Recently, study of restless bandits is considered for partially observable states and hidden Markov bandit model in [18, 5, 27].
I-A Our contributions
In this paper, we study finite state restless multi-armed bandit problem. We analyze the single-armed restless bandit problem. Our goal here is to show that the restless bandit is indexable. First, we provide structural results by making modeling assumption on transition probability and reward matrices and prove the indexability analytically.
Second contribution in this paper is to study single-armed restless bandit using the value-iteration algorithm with fixed subsidy (Lagrangian parameter) and obtain the optimal decision rule for each state. In this, the optimal decision can be either passive action or active action. Next, we vary subsidy over a fixed-size grid of subsidies and we compute the optimal decision rule for each state and for each value of subsidy from the fixed-size grid and it is referred to as the policy matrix. By analyzing this matrix, we provide an alternative approach to verify the indexability condition. It also provides an index value for each state. The advantages of our approach is that is easy to verify the indexability condition using value iteration scheme. We present the analysis of indexability using policy matrix with few examples. Best of our knowledge, this viewpoint is not taken into study in earlier works.
Third, we discuss different numerical examples from [28, 29, 16, 15] and present examples of non-indexable bandits. We give insight on conditions for the non-indexability of restless bandits using numerical examples. We observe that when the reward is monotone (non-decreasing/non-increasing) for both actions in same direction, then bandit is indexable. When the reward for passive action is increasing in state, the reward for active action is decreasing in state and there is sufficient difference between the rewards of passive and active action and some structure on transition probability matrices, then the bandit is shown to be non-indexable. This example is motivated from [15]. However, for most of applications in communication systems opportunistic scheduling problem [17] and recommendation systems [6, 7], the rewards are monotone for both actions in the same direction, thus the bandit is indexable.
Fourth, we compare the performance of Whittle index policy, myopic policy and online rollout policy. Numerical examples illustrate that the discounted cumulative reward under index policy and rollout policy is higher than that of myopic policy for non-identical restless bandits.
The rest of the paper is organized as follows. RMAB problem is formulated in Section II. Indexability and our approach of indexability for a single armed restless bandit is discussed in Section III. Different algorithms for RMAB are presented in Section IV. Numerical examples for indexability of single-armed restless bandit are given in Section V. Numerical examples for RMAB are illustrated in Section VI. We make concluding remarks in Section VII.
II Model of Restless Bandits
Consider independent state Markov decision processes (MDPs). Each MDP has state space and action space Time progresses in discrete time-steps and it is denoted by Let denotes the state of Markov chain at instant for and Let be the transition probability from state to under action for th Markov chain. Thus the transition probability matrix of th MDP is for Rewards from state under action for th MDP is Let be the action correspond to th MDP at time and Let is the strategy that maps history to subset where and Under policy the action at time is denoted by
The infinite horizon discounted cumulative reward starting from initial state with fixed policy is given as follows.
Here, is the discount parameter, The agent can play at most arms at time Thus,
Our objective is
It is the discounted (reward-based) restless bandit problem. Here, is a set of policies. The relaxed discounted restless bandit problem is as follows.
(1) |
Then the Lagrangian relaxation of problem (1) is given by
Here, is Lagrangian variable and This is equivalent to solving single-armed restless bandits, i.e., a MDP with finite number of states and two actions model and policy Thus, a single-armed restless bandit problem is given as follows.
(2) |
II-A Single-armed restless bandit
For notation simplicity, we omit superscript and introduce explicit dependence of value function on Then, the optimization problem from starting state under policy is as follows.
(3) |
The optimal dynamic program is given by
for all We use these ideas to define the indexability of single-armed restless bandit in next section. The analysis is done in the next section.
III Indexability and Whittle index
The Lagrangian variable is referred to as subsidy. Then, we can rewrite dynamic program as follows.
for all We define the passive set as follows.
Let be the policy that maps state to action for given subsidy We can also rewrite as
Definition 1 (Indexable)
A single armed restless bandit is indexable if is non-decreasing in , i.e.
A restless multi-armed bandit problem is indexable if each bandit is indexable. A restless bandit is indexable if as the passive subsidy increases, then the set of states for which the passive action is optimal increases.
Definition 2 (Whittle index)
If a restless bandit is indexable then its Whittle index, is given by for
We assume that the rewards are bounded which implies that the indices are also bounded.
Remark 1
-
1.
In general, it is very difficult to verify that a bandit is indexable and even more difficult to compute the Whittle indices. Note that as increases, the set has to increase for indexability.
-
2.
If a bandit is indexable, then it implies that there are such that and for because This further suggests that the policy has a single threshold as a function of for fixed That is, there exist such that
for each
-
3.
If a bandit is non-indexable, then it implies that there are and is not a subset of for some In other words, there exists some such that for and
III-A Single threshold policy and indexability
We now describe a single threshold type policy for MDP having action space in assuming a fixed subsidy as follows. We consider the deterministic Markov policy. The policy is called a single threshold type when the decision rule for is of the following form.
and is a threshold state. This is also referred to as control limit policy in [30]. The conditions required on transition probability and reward matrices for a single threshold policy are discussed below.
Lemma 1 (Convexity of value functions)
-
1.
For fixed and and are non-decreasing piecewise convex in
-
2.
For fixed and and are non-decreasing piecewise convex in
Proof of lemma follows from induction method and it is given in Appendix -A.
Definition 3
The is superadditive for a fixed if
(4) |
This implies that is non-decreasing in This is also referred to as Monotone non-decreasing difference property in This implies that the decision rule is non-decreasing in We now state the following result from [30].
Theorem 1
Suppose that
-
1.
is non-decreasing in for all
-
2.
is non-decreasing in for all and
-
3.
is a superadditive (subadditive) function on
-
4.
is a superadditive (subadditive) function on for all
Then, there exists optimal policy which is non-decreasing (non-increasing) in
These assumptions imply that one requires stronger conditions on transition probability matrices and reward matrix
Remark 2
-
1.
If is a superadditive function on then
for and
-
2.
If is a superadditive function on then
for all This implies
From Theorem 1, it is clear that optimal policy is of a single threshold type in for a fixed
We next prove that is subadditive on for all values of That is, is non-increasing in for each Observe that is strictly increasing in and is non-decreasing in for fixed For assumptions in Theorem 1, is non-increasing in Hence the policy is non-increasing in for each
Remark 3
-
1.
It implies that for each there is such that
Further, this suggests that as increases, increases. Thus, a restless bandit is indexable.
-
2.
It is possible that a restless bandit can be indexable with the following structure on reward matrix. and are non-decreasing in and there is no structure on transition probability matrices. Such a model is not considered in previous discussion.
-
3.
It is further possible that a bandit is indexable even though there is no structure on transition probability and reward matrices. This is also not analyzed using previous Theorem. This is due to limitation of analysis. With this motivation, we study an alternative approach to indexability using value-iteration based methods in the next section.
III-B Alternative approach for indexability
In this section, we study the value iteration algorithm and compute the optimal policy for each state for fixed value of subsidy We define the grid of subsidy which is of finite size, say We perform the value iteration for each and compute the policy for Size of policy matrix is We next verify the indexability criteria in Remark 1 using policy matrix that is, there exist such that
for each If this condition does not meet, we verify the non-indexable bandit condition in Remark 1. The value iteration algorithm for computation of policy matrix is described in Algorithm 1.
We now discuss the structure of matrix with examples and this can depend on transition probability and reward matrices.
In the following examples we consider state space and values of Here and for
III-B1 Example 1
Suppose the policy matrix is as given in Eqn. (5).
(5) |
Observe that for fixed is the policy for state vector. The policy is non-decreasing in This structure implies a single threshold policy in for fixed subsidy Moreover for fixed as increases then is non-increasing in This structure implies that there is a single threshold policy in for fixed values of state If a single threshold policy in exists for each then we can say that bandit is indexable, and this follows from the definition of indexability because is a non-decreasing set as we increase In this example, , , , , , , , and .
III-B2 Example 2
Suppose the policy matrix is as given in Eqn. (6).
(6) |
Notice that for fixed and the policy is not monotone in for all We can observe from (6) that for and the optimal policy and there is no single threshold policy in
But for fixed as increases then is non-increasing in This structure implies that there is a single threshold policy in for fixed values of state is a non-decreasing set as we increase In this example, , , , , , , , and . From the definition of indexability, we can say that the bandit is indexable.
III-B3 Example 3
Suppose the policy matrix is as given in Eqn. (7).
(7) |
In this example, , , , , , , , and . Observe that for Hence, the bandit is not indexable.
Notice that for fixed and the policy is not monotone in for all There is no single threshold policy in for and
For fixed as increases, then is not monotone in because policy is This structure implies that there is no single threshold policy in for Because of this, it does not satisfy condition of indexability.
III-C Computational complexity of Verification of Indexability and Index Computation
RMABs are known to be a PSPACE hard problem,[10, Theorem ]. However, the Lagrangian relaxed RMAB problem is solved by the index policy, where we analyze a single-armed restless bandit. Each bandit is a MDP with finite number of states and two actions. From [31], MDP with finite state-space and finite action space can be solved in polynomial in time (P-complete). Thus, the computational complexity of indexability and index computation is polynomial in time. Note that we have used the value iteration algorithm for computation of policy matrix with fixed grid size of subsidy, From [32], the sample complexity of value iteration algorithm is The value iteration algorithm is performed for each Thus, the sample complexity of computing policy matrix is The verification of indexability from matrix requires additional time This is due to the fact that one has to verify for each is non-increasing in for all Hence, indexability verification criteria and the index computation require sample complexity of There are many variants of value iteration algorithm which have improved the sample complexity in terms of its dependence on number of states see [33]. Therefore, one can improve the complexities of indexability verification and index computation.
IV Algorithms for restless multi-armed bandits
We now discuss algorithms for restless multi-armed bandits—myopic policy, Whittle index policy and online rollout policy. In the myopic policy, the arms with the highest immediate rewards are selected in each time-step. That is, then the immediate reward for arm is Thus, arms are selected using immediate reward at each time-step In the Whittle index policy, for given time-step with state one computes the index for each bandit, i.e., the index for th bandit is Then, we select the arms with the highest Whittle indices. We compute the Whittle indices for all states in the offline mode. In the following we discuss online rollout policy.
IV-A Online Rollout Policy
We now discuss a simulation-based approach referred to as online rollout policy. Here, many trajectories are ‘rolled out’ using a simulator and the value of each action is estimated based on the cumulative reward along these trajectories. Trajectories of length are generated using a fixed base policy, say, which might choose arms according to a deterministic rule (for instance, myopic decision) at each time-step. The information obtained from a trajectory is
(8) |
under policy Here, denotes a trajectory, denotes time-step and denotes the state about arm The action of playing or not playing arm at time-step in trajectory is denoted by Reward obtained from arm is
We now describe the rollout policy for . We compute the value estimate for trajectory with starting belief and initial action . Here, means arm is played at time-step in trajectory , so , and for . The value estimate for initial action along trajectory is given by
Then, averaging over trajectories, the value estimate for action in state under policy is
where, the base policy is myopic (greedy), i.e., it chooses the arm with the highest immediate reward, along each trajectory. Now, we perform one-step policy improvement, and the optimal action is selected as,
(9) |
and In each time-step with online rollout policy plays the arm obtained according to Eqn. (9). This can be extended to multiple play of arms.
We now discuss the computational complexity of online rollout policy. As rollout policy is a heuristic (look-ahead) policy, it does not require convergence analysis. In the Whittle index policy, index computation is done offline, where we compute and store the index values for each element on the grid for all arms (). During online implementation, when state is observed, the corresponding index values are drawn from the stored data. On the other hand, rollout policy is implemented online and its computational complexity is stated in the following Theorem.
Theorem 2
The online rollout policy has a worst case complexity of for number of iterations when the base policy is myopic.
The proof is given in Appendix -C. Here is dependent on and
IV-B Comparison of complexity results
For large state space, one requires grid of subsidy to be large, that is, Thus the sample complexity of index computation is The computational complexity of online rollout policy is dependent on horizon length and number of trajectories and it is For large state space model, rollout policy can have lower computational complexity compared to index policy.
V Numerical Examples of single-armed restless bandit
In this section, we provide numerical examples to illustrate the indexability of the bandit using our approach. In all following examples, we use discount parameter Similar results are also true for
V-A Example with circular dynamics
In the following example, we illustrate that the optimal policy is not a single threshold type but it is indexable. This example is taken from [28], [29], where indexability of the bandit is assumed, and author studied learning algorithm. In this example, we claim indexability using our approach.
The example has four states and the dynamics are circulant: when an arm is passive (), resp. active (), the state evolves according to the transition probability matrices.
The reward matrix is given as follows.
We consider different values of subsidy and compute the optimal policy for using dynamic programming algorithm and obtain the policy matrix which is given in Fig. 1. We observe that the optimal policy has more than two threshold for some values of subsidy This is clear for subsidy the optimal policy and there are two switches in decision rule. But it is indexable because as increases from to then the set is non-decreasing, where Note that

V-B Examples with restart
We illustrate the structure of policy matrix for restart model, where action is taken, then for all We consider states.

This example is taken from [28], where authors assumed indexability. The transition probability matrices are as follows.
The rewards for passive action with state and the rewards for active action , for all We observe from Fig. 2 that the optimal policy is of a single threshold type for fixed subsidy Moreover, as subsidy increases from to an optimal policy changes monotonically from active action to passive action for all states. Thus, the bandit is indexable.
V-C Non-Indexable model with states
We consider following parameters.
We make no structural assumption on transition probability matrices. But the reward is decreasing in for and increasing in for From Fig. 3, we notice that is not monotone in for Thus, the bandit is non-indexable.

More numerical examples on indexability are provided in Appendix.
VI Numerical results on RMAB
We compare the performance of rollout policy, index policy and myopic policy. We consider number of arms and discount parameter In the rollout policy, we used number of trajectories and horizon length In the following, we consider non-identical restless bandits. All bandits are indexable, and index is monotone in state. In Fig. 4, 5 and 6, we compare performance of myopic, rollout and index policy for respectively. We observe that the rollout policy and the index policy performs better than myopic policy.



Additional numerical examples of RMAB are given in Appendix.
VII Discussion and concluding remarks
In this paper, we studied finite state restless bandits problem. We analyzed the Whittle index policy and proposed a value-iteration based algorithm, which computes the policy matrix, verifies indexability condition and obtains the indices. Further, the sample complexity analysis is discussed for index computation. We also study online rollout policy and its computational complexity. Numerical examples are provided to illustrate the indexability condition and performance of algorithms—myopic policy, index policy and rollout policy.
From numerical examples, we note that the indexability condition does not require any structural assumption on transition probabilities. It is even possible that the arms having rewards with no structural assumption with respect to actions can be indexable. Hence, our approach can be widely applicable without any structural model assumptions to verify the indexability condition and index computation. In future work, we plan to extend our approach to countable state-space model.
Acknowledgment
R. Meshram and S. Prakash are with ECE Dept. IIIT Allahabad and Center for Intelligent Robotics (CIR) Lab, IIIT Allahabad, India. V. Mittal and D. Dev are with ECE Dept. IIIT Allahabad, India. The work of Rahul Meshram and Deepak Dev is supported by grants from SERB.
References
- [1] K. D. Glazebrook, H. M. Mitchell, and P. S. Ansell, “Index policies for the maintenance of a collection of machines by a set of repairmen,” European Jorunal of Operation Research, vol. 165, pp. 267–284, Aug. 2005.
- [2] D. Ruiz-Hernandez K. D. Glazebrook and C. Kirkbride, “Some indexable families of restless bandit problems,” Advances in Applied Probability, vol. 38, no. 3, pp. 643–672, Sept. 2006.
- [3] K. Avrachenkov, U. Ayesta, J. Doncel, and P. Jacko, “Congestion control of TCP flows in Internet routers by means of index policy,” Computer Networks, vol. 57, no. 17, pp. 3463–3478, 2013.
- [4] A. Mate, A. Perrault, and M. Tambe, “Risk-aware interventions in public health: Planning with restless multi-armed bandits.,” in Proceedings of AAMAS, 2021, pp. 880–888.
- [5] R. Meshram, D. Manjunath, and A. Gopalan, “On the Whittle index for restless multi-armed hidden Markov bandits,” IEEE Transactions on Automatic Control, vol. 63, no. 9, pp. 3046–3053, 2018.
- [6] R. Meshram, D. Manjunath, and A. Gopalan, “A restless bandit with no observable states for recommendation systems and communication link scheduling,” in Proceedings of CDC, Dec 2015, pp. 7820–7825.
- [7] R. Meshram, A. Gopalan, and D. Manjunath, “A hidden Markov restless multi-armed bandit model for playout recommendation systems,” COMSNETS, Lecture Notes in Computer Science, Springer, pp. 335–362, 2017.
- [8] P. S. Ansell, K. D. Glazebrook, J. Niño-Mora, and M. O’Keeffe, “Whittle’s index policy for a multi-class queueing system with convex holding costs,” Math. Methods Oper. Res., vol. 57, no. 1, pp. 21–39, 2003.
- [9] J. Gittins, K. Glazebrook, and R. Weber, Multi-armed bandit allocation indices, Wiley, 2011.
- [10] C. H. Papadimitriou and J. N. Tsitsiklis, “The complexity of optimal queuing network control,” Mathematics of Operation Research, vol. 24, no. 2, pp. 293–305, May 1999.
- [11] P. Whittle, “Restless bandits: Activity allocation in a changing world,” Journal of Applied Probability, vol. 25, no. A, pp. 287–298, 1988.
- [12] R. R. Weber and G. Weiss, “On an index policy for restless bandits,” Journal of Applied Probability, vol. 27, no. 3, pp. 637–648, Sept. 1990.
- [13] D. Bertsimas and J. Niño-Mora, “Conservation laws and extended polymatroids and multi-armed bandit problems: A polyhedral approach to indexable systems,” Mathematics of Operations Research, vol. 21, no. 2, pp. 257–306, 1996.
- [14] D. Bertsimas and J. Niño-Mora, “Restless bandits, linear programming relaxation and a primal-dual index heuristic,” Operations Research, vol. 48, no. 1, pp. 80–90, 2000.
- [15] J. Niño-Mora, “Dynamic priority allocation via restless bandit marginal productivity indices,” TOP, vol. 15, no. 2, pp. 161–198, 2007.
- [16] N. Akbarzadeh and A. Mahajan, “Conditions for indexability of restless bandits and algorithm to compute Whittle index,” Advances in Applied Probability, vol. 54, pp. 1164–1192, 2022.
- [17] V. S. Borkar, G. S. Kasbekar, S. Pattathil, and P. Y. Shetty, “Opportunistic scheduling as restless bandits,” ArXiv:1706.09778, 2017.
- [18] K. Kaza, R. Meshram, V. Mehta, and S. N. Merchant, “Sequential decision making with limited observation capability: Application to wireless networks,” IEEE Transactions on Cognitive Communications and Networking, vol. 5, no. 2, pp. 237–251, June 2019.
- [19] J. T. Hawkins, A Lagrangian decomposition approach to weakly coupled dynamic optimization problems and its applications, Ph.D. thesis, Massachusetts Institute of Technology, 2003.
- [20] D. Adelman and A. J. Mersereau, “Relaxations of weakly coupled stochastic dynamic programs,” Operations Research, vol. 56, no. 3, pp. 712–727, 2008.
- [21] D. Bertsimas and V. V. Mišić, “Decomposable Markov decision processes: A fluid optimization approach,” Operations Research, vol. 64, no. 6, pp. 1537–1555, 2016.
- [22] R. Meshram and K. Kaza, “Simulation based algorithms for Markov decision processes and multi-action restless bandits,” Arxiv, 2020.
- [23] X. Zhang and P. I. Frazier, “Near optimality for infinite horizon restless bandits with many arms,” Arxiv, pp. 1–28, 2022.
- [24] D. B. Brown and J. Zhang, “Fluid policies, reoptimization, and performance guarantees in dynamic resource allocation,” SRRN, pp. 1–46, 2022.
- [25] N. Gast, B. Gaujal, and C. Yan, “The LP-update policy for weakly coupled markov decision processes,” Arxiv, pp. 1–29, 2022.
- [26] J. Nino Mora, “A fast-pivoting algorithm for Whittle’s restless bandit index,” Mathematics, vol. 8, pp. 1–21, 2020.
- [27] R. Meshram, K. Kaza, V. Mehta, and S. N. Merchant, “Restless bandits for sensor scheduling in energy constrained networks,” in Proceedings of IEEE Indian Control Conference, Dec. 2022, pp. 1–6.
- [28] K. E. Avrachenkov and V. S. Borkar, “Whittle index based Q-learning for restless bandits with average reward,” Automatica, vol. 139, pp. 1–10, May 2022.
- [29] J. Fu, Y. Nazarthy, S. Moka, and P. G. Taylor, “Towards Q-learning the Whittle index for restless bandits,” in Proceedings of Australian and New Zealand Control Conference (ANZCC), Nov. 2019, pp. 249–254.
- [30] M. L. Puterman, Markov decision processes: Discrete stochastic dynamic programming, John Wiley & Sons, 2005.
- [31] C. H. Papadimitriou and J. N. Tsitsiklis, “The complexity of Markov decision processes,” Mathematics of Operation Research, vol. 12, no. 3, pp. 441–450, Aug. 1987.
- [32] M. L. Littman, T. L. Dean, and L. P. Kaelbling, “On the complexity of solving Markov decision problems,” in Proceedings of the Eleventh Conference on Uncertainity in Artificial Intelligence (UAI’95), Aug. 1995, pp. 394–402.
- [33] M. Wang, “Randomized linear programming solves the Markov decision problem in nearly linear (sometimes sublinear) time,” Mathematics of Operation Research, vol. 45, no. 2, pp. 1–30, May 2020.
-A Proof of Lemma 1
Proof is by induction method. We provide the proof for piecewise convex.
Let
Note that rewards and are linear functions. The max of linear functions is piecewise convex function.
We next assume that is piecewise convex in for fixed
From dynamic programming equations, we have
We want to show that is piecewise convex in Now observe that is piecewise convex in because it is a convex combination of piecewise convex functions. Hence is piecewise convex function in Then
is piecewise convex in Hence is piecewise convex in for all
From Banach fixed point theorem, [30, Theorem ], the dynamic program equation converges and there exists unique Thus as and hence is piecewise convex in The result follows.
Analogously, we can prove that piecewise convex in for fixed
-B Proof of Theorem 1
We make use of following Lemma from [30, Lemma ].
Lemma 2
Let and be real-valued non-negative sequences satisfying
for all values of Equality holds for Suppose that for then
(10) |
Using lemma we have following results. Let and be two random variables with probabilities and If then is stochastically greater than From preceding lemma for any non-decreasing function we have We first prove the following lemma.
Lemma 3
Suppose
-
1.
is non-decreasing in for all
-
2.
is non-decreasing in
Then is non-decreasing in for fixed
Proof:
Proof of this lemma is by induction method. Let
Notice that are non-decreasing in and hence is non-decreasing in
Assume that is non-decreasing in for fixed Then,
From earlier lemma and discussion, is non-decreasing in whenever is non-decreasing in Hence is non-decreasing in Similarly, is non-decreasing in Then
and is non-decreasing in Hence is non-decreasing in for all
From Banach fixed point theorem, [30, Theorem ], the dynamic program equation converges and there exists unique Thus as and hence is non-decreasing in The result follows. ∎
Lemma 4
is non-decreasing in for fixed
The proof of this is again by induction method, proof steps are analogous to previous lemma and hence we omit the details of the proof.
We are now ready to prove the main theorem.
-
1.
If is a superadditive function on then
for and
-
2.
If is a superadditive function on then
for all This implies
We assumed that is fixed. Proof is by induction method. From assumption on rewards, is superadditive on The optimal policy is
(11) |
Further, note that
and is non-decreasing in for fixed Hence, the optimal policy is non-decreasing in
We next want to show that the optimal policy at time step is non-decreasing in We need to show that is non-decreasing in
Consider
Next we require the following inequality for to be true.
From earlier lemma, we know that is non-decreasing in and is a superadditive function on Thus is superadditive. Reward is a superadditive function on Hence is a superadditive function on for all
This implies that is non-decreasing in and there exists the optimal policy which is non-decreasing in for fixed The policy is non-decreasing in for all values of
From Banach fixed point theorem, [30, Theorem ], the dynamic program equation converges and there exists unique Thus as and is non-decreasing in Moreover,
Then is non-decreasing in Hence is non-decreasing in for fixed This completes the proof.
-C Proof of Theorem 2
In [27], complexity analysis is studied for online rollout policy with partially observable restless bandits. Here, we sketch proof detail with finite state observable restless bandit model.
We consider case i.e., a single-arm is played.
Recall from Section IV-A that we simulate the multiple trajectories for fixed horizon length. We compute the value estimate for each trajectory, say starting from initial state and initial action i.e., This require computation of since each trajectory is run for horizon length. There are trajectories and it requires computation.
The empirical value estimates for possible initial actions (arms). This takes computations as there are trajectories of horizon (look-ahead) length for each of the initial actions.
Rollout policy also involves the policy improvement step and it takes another computations. Thus total computation complexity in each iteration is For time steps, the computational complexity is
-D Online rollout policy when multiple arms are played
We now discuss rollout policy for more than one arm is played at each time-step.
When a decision maker plays more than one arm in each time-step in rollout policy, employing a base policy with future look-ahead is non-trivial.
This is due to the large number of possible combinations of out of available arms, i.e., . Since the rollout policy depends on future look-ahead actions, it can be computationally expensive to implement as each time-step because we need to choose from We reduce these computations for base policy by employing a myopic rule in look-ahead approach, where we select arms with highest immediate rewards while computing value estimates of trajectories.
In this case, The set of arms played at step in trajectory is with Here, if The base policy uses myopic decision rule and the one-step policy improvement is given by
(12) |
and The computation of is similar to the preceding discussion. At time with state rollout policy plays the subset of arms according to Eqn. (12).
-D1 Computation complexity result for (multiple arms are played)
Along each trajectory, arms are played out of The number of possible actions at each time-step in a trajectory is and which is a polynomial in for fixed Thus, the complexity would be very high if all the possible actions are considered.
The value estimates are computed only for some initial actions, The computations required for value estimate per iteration is As there are number of subsets considered for each step in action selection, the computations needed for policy improvement steps are at most So, the per-iteration computation complexity is Hence, for time-steps the computational complexity would be
-E Additional Numerical Examples: Single-Armed Restless Bandit
Here we provide more numerical examples and verify the indexability using our Algorithm 1.
-E1 Examples with restart
We illustrate the structure of policy matrix for restart model, where when action is taken, then for all Reward is decreasing in for passive action and reward for active action is for all states. We provide examples for and states. The transition probabilities are as follows. , for all

For the rewards for action (passive action) with state and the rewards for action (active action), for all The policy matrix is illustrated in Fig. 7. We observe that the optimal policy is of a single threshold type and it is indexable.
In Fig. 8, we use and rewards for passive action as and the rewards for active action as for all It is difficult to draw policy matrix for large states. Hence, we illustrate threshold state as function of . Here, threshold state Notice that is increasing in This monotone behavior indicates the bandit is indexable.

-E2 Example of one-step random walk
We consider states. It is one-step random walk. The probability matrix is same for both the actions and . Rewards for passive action and active action for all This is also applicable in wireless communication systems.

In Fig. 9, we illustrate the policy matrix and we observe that the policy is non-increasing in for fixed This is true for all . Thus, the policy has a single-threshold in . For fixed state , is non-increasing in and it is true for all . Hence, it has a single threshold policy in also.
From Fig. 9, it is clear that and for Thus, it is indexable.
-E3 Example from [16] (Akbarzadeh )

In this example, there is no structural assumption on transition probability and reward matrices.
Rewards in passive action for all and in active action, , , and .
From policy matrix in Fig. 10, we can observe that the bandit is indexable, because , , and is non-decreasing in
-E4 Indexable model from [15] (Nino-Mora )
There is no structural assumption on transition probability matrices.
We use the discount parameter as We observe from Fig. 11, that the optimal policy is non-increasing in for fixed even though there is no structural assumption on transition probability and reward matrices. Further, is non-increasing in for fixed This is true for all Notice that and Thus, the bandit is indexable.

-E5 Non-indexable model from [15] (Nino-Mora )
In this, we select an example of a bandit that is non-indexable. We use the following parameters.
We do not make any structural assumption on transition probability matrices but there is one on reward matrix. Reward is decreasing in for active action and reward is increasing in for passive action . In Fig. 12, notice that policy is not monotone in for all values of This implies that there exists such that there are multiple thresholds, This can be observed for , and for , Also, is not monotone in for all values of For instance, there is for which for to . , , , , , , , . Clearly, is not monotone in hence, the bandit is non-indexable.

-F Non-indexable to indexable bandit by modification of reward matrix
In the following example, we use states and use the same transition probability matrices as in the non-indexable bandit. We modify here, the reward matrix and illustrate that the bandit becomes indexable.
-F1 Indexable model with states and
In this example, we slightly modify the reward matrix while keeping the same transition probability matrices as in the previous example, . Then, we observe that such a bandit becomes indexable. This may be due to the fact that the difference in reward from passive and action actions is decreased here as compared to the previous example, but the structure of reward matrix remains the same in both examples, i.e., reward is decreasing in for active action and reward is increasing in for passive action.
In Fig. 13, we notice that is non-increasing (monotone) in for each but is not monotone in for . Further, we observe that Hence, is non-decreasing in , and hence, it is indexable.

-F2 Indexable model with states, monotone reward and
We consider following parameters.
Reward is increasing in for both actions (passive and active ). But there is no structural assumption on transition probability matrices. Using this example, we illustrate that the bandit is indexable. From Fig, 14, we can observe that is non-increasing in for every but is not monotone in for . There is no single threshold policy in for these values of , but there is a single threshold policy in for each . We have , , , and . Set is non-decreasing in and hence, the bandit is indexable.

-F3 Indexable model with states and
Consider
We now observe that the reward is monotone in for both actions. From Fig, 15, we have and is non-decreasing in Hence, the bandit is indexable.

-G Discussion on non-indexability and indexability of SAB
From the preceding examples (non-indexable and indexable) in Section V, and Appendix -F, we infer that in order for a bandit to be non-indexable, there is necessity of reward for both actions to be in the reverse order along the actions and there should be sufficient difference in rewards for each state. When reward is monotone (non-decreasing) in state for both actions, then the bandit is indexable, and this is due to, as we increase subsidy there is no possibility of more than a single threshold in for each state . In order to have more than one threshold in , first we observe that is fixed reward obtained for passive action and independent of state; second as increases, passive action becomes optimal as not playing is optimal choice; third as increases more then active action to be optimal again, it requires cumulative reward from active action to be higher than that of passive action.
It is possible due to the fact that for passive action, reward is increasing and for active action, it is decreasing. The difference is going from positive to negative value and increasing provides balancing for some states by dynamic program equation. Fourth, we see that as increases even further, the optimal action has to be passive and remains passive for remaining values of .
-G1 Comments on Indexability
From numerical examples of restless single-armed bandits, we observed that non-indexability occurs under very restrictive setting on transition probability and reward matrices. Most applications in wireless communication, machine maintenance, recommendation systems models, there is assumed to be some structure on reward and transition probability matrices. Hence, many applications of finite state observable restless bandit models are indexable.
-H Additional Numerical Examples: RMAB with Performance of different policies
Here, we provide more numerical examples, comparing the performance of different policies on restless multi-armed bandits. In the following, we consider both, identical as well as non-identical restless bandits. Also, we use varying scenarios such as; monotone and indexable bandits; non-monotone and indexable bandits; and non-indexable bandits. In the scenario of non-indexable bandits, we compare the performance of rollout and myopic policies. We illustrate simulations for number of arms , and discount parameter . In the rollout policy, we used number of trajectories , and horizon length .
-H1 Example of identical, indexable, and monotone bandits


In Fig. 16 and Fig. 17, we compare performance of myopic, rollout and index policy for respectively. All the bandits are identical, indexable, and index is monotone in state. We use number of trajectories , and horizon length . Notice that all policies have identical performance and it is due to identical and monotone reward structure of bandits and index is monotone in state for the bandit.
-H2 Example of identical, indexable, and non-monotone bandits


In Fig. 18 and Fig. 19, we compare performance of myopic, rollout and index policy for respectively. All the bandits are identical, indexable, and non-monotone, i.e., index is not monotone in state. We use number of trajectories , and horizon length . We observe that the rollout policy and the index policy performs better than myopic policy.
Examples of such bandit are given in Appendix -F. Even though is non-decreasing in the policy is not monotone in for all values of The examples from SAB suggests that there may not be any structure on transition probability or reward matrices. In such examples, myopic policy does not perform as good as rollout policy and index policy. Here, rollout policy uses simulation based look-ahead approach and index policy uses dynamic program for index computation, these policies take into account future possible state informations.




-H3 Example of identical and non-indexable bandits
In Fig. 20 and Fig. 21, we compare performance of myopic and rollout policy for respectively, where we assume that all bandits are non-indexable and all bandits are identical. In the rollout policy, the number of trajectories are , and horizon length is . We observe that both policies have identical performance.
-H4 Effect of horizon length in rollout policy
In Fig. 22 and Fig. 23, we compare performance of myopic, rollout and index policy for . All the bandits are non-identical and indexable, some are monotone, and some are non-monotone. We use number of trajectories , and horizon length respectively. We observe that the rollout policy and the index policy performs better than myopic policy. It is noticed that as increases from to the difference in rollout policy and index policy decreases.