[1,2]\fnmKeqin \surLiu
[1]\orgdivDepartment of Mathematics, \orgnameNanjing University, \orgaddress\cityNanjing, \postcode210093, \countryChina
2]\orgnameNational Center for Applied Mathematics, \orgaddress\cityNanjing, \postcode210093, \countryChina
3]\orgdivDepartment of Mathematics, \orgnameUniversity of Cambridge, \orgaddress\cityCambridge, \postcodeCB3 0WB, \countryUK
Low-Complexity Algorithm for Restless Bandits with Imperfect Observations
Abstract
We consider a class of restless bandit problems that finds a broad application area in reinforcement learning and stochastic optimization. We consider independent discrete-time Markov processes, each of which had two possible states: 1 and 0 (‘good’ and ‘bad’). Only if a process is both in state 1 and observed to be so does reward accrue. The aim is to maximize the expected discounted sum of returns over the infinite horizon subject to a constraint that only processes may be observed at each step. Observation is error-prone: there are known probabilities that state 1 (0) will be observed as 0 (1). From this one knows, at any time , a probability that process is in state 1. The resulting system may be modeled as a restless multi-armed bandit problem with an information state space of uncountable cardinality. Restless bandit problems with even finite state spaces are PSPACE-HARD in general. We propose a novel approach for simplifying the dynamic programming equations of this class of restless bandits and develop a low-complexity algorithm that achieves a strong performance and is readily extensible to the general restless bandit model with observation errors. Under certain conditions, we establish the existence (indexability) of Whittle index and its equivalence to our algorithm. When those conditions do not hold, we show by numerical experiments the near-optimal performance of our algorithm in the general parametric space. Furthermore, we theoretically prove the optimality of our algorithm for homogeneous systems.
keywords:
restless bandits, continuous state space, observation errors, value functions, index policy1 Introduction
The exploration-exploitation (EE) dilemma is well posed in optimization-over-time problems and mathematically modeled in various forms for reinforcement learning, to which a major category, multi-armed bandits (MAB) belongs. In the classical MAB model, a player chooses one out of statistically independent arms to pull and possibly accrues reward determined by the state of the chosen arm which transits to a new state according to a known Markovian rule (Gittins et al., 2011). The states of other arms remain frozen. The objective is to maximize the expected total discounted reward summed over times to an infinite time horizon with discount factor ,
(1) |
The expectation is taken under some policy , chosen from the set of all feasible policies ; is the joint state of all arms at time , is the arm pulled at and is the reward thus obtained. It follows from standard theory of Markov decision processes that there must exist an optimal stationary policy , independent of time . If each arm’s state space has cardinality , then the joint state space has size . This means that a dynamic programming solution to the problem will have running time that grows geometrically as the number of arms increases. Gittins (1979) solved the problem by showing the optimality of an index policy, i.e., for each state of each arm there exists an index depending solely on the parameters of that arm; it is then optimal at each time to choose the arm whose state has highest index. The running time of the Gittins index policy grows only linearly with the number of arms as they are decoupled when computing the index function(of the states of each arm). Whittle (1988) generalized Gittins index to the restless MAB model in which those arms that are not chosen may also change states and produce reward. Whittle’s generalization has been shown to perform very well in theoretical and numerical studies (see, e.g., Weber and Weiss (1990, 1991), Liu and Zhao (2010), Hu and Frazier (2017), Zayas-Cabán et al. (2019), Brown and Smith (2020), Chen et al. (2021), Gast et al. (2021)). In general, however, it is difficult to theoretically establish the condition that is necessary for the Whittle index to exist (so called indexability) and to solve for the closed-form Whittle index when it does exists due to the curse of dimensionality (see Sec. 2.1). For finite-state models, Gast et al. (2023) proposed an efficient algorithm to numerically test indexability and compute Whittle index. A natural question to ask if one can transform a bandit with a continuous-state space into a finite-state one by discretization. Unfortunately, we found that how fine the discretization needs to be given the system paramters is itself a difficult problem. Furthermore, a finer discretization inevitably leads to a larger state transition matrix with increased algorithmic complexity and unpredictability of the steady-state system performance. This motivates us to consider alternative approaches to deal with bandits with infinite state spaces as addressed in this paper. In terms of searching for the general optimal policy, Papadimitriou and Tsitsiklis (1999) have shown that the restless MAB with a finite state space is PSPACE-HARD in general. With an infinite state space, a restless bandit problem yields practical difficulties in implementing purely numerical methods as discussed above. In this paper, we show that for our particular Markovian model with an infinite state space, Whittle index policy can be efficiently implemented with satisfactory accuracy after theoretical analysis on the rich mathematical structure of the problem.
In this paper, we extend the work in Liu and Zhao (2010) (for a perfect observation model) and Liu et al. (2010) (for the myopic policy on stochastically identical arms) to build a near-optimal algorithm with low complexity for a class of restless bandits with an infinite state space and an imperfect observation model. Our model also belongs to the general framework of partially observable Markov decision processes (POMDP) (Sondik, 1978). Consider processes each of which evolves on a -state Markov chain whose state is observed if and only if the process is chosen. Furthermore, the observation is error-prone: state may be observed as and vice versa. Each process is referred to as an arm. At time , the player obtains reward of amount if and only if arm is currently chosen and accurately observed in state . Under resource constraints, the player’s goal is to select arms at each time and maximize the long-term reward. By formulating the belief vector as the system state for decision making, we show that the indexability is satisfied under certain conditions. Furthermore, we propose an efficient algorithm to compute an approximate Whittle index that achieves a near-optimal performance in general, even if the conditions for indexability do not hold.
Rahul et al. (2018), Varun et al. (2018) and Kesav et al. (2019) considered similar models (except for some nuances) and established indexability under much stricter conditions on the system parameters. For example, all the three papers require that the Markov transition probabilities of each arm have differences bounded by while we do not need any restriction on the transition probabilities. Furthermore, the three papers require that the discount factor is less than while our condition on is a more relaxed closed-form expression of the system parameters. In terms of the computation of the Whittle index, their algorithm is a direct application of general reinforcement learning while our algorithmic framework is based on the detailed analysis of the value functions with a quick convergence to the exact Whittle index function. In this paper, we also plot the performance of the optimal policy to demonstrate the near-optimality of our algorithm in addition to the comparison with the myopic policy. For homogeneous systems (stochastically identical arms), we show that our algorithm is equivalent to the myopic policy and theoretically prove its optimality under certain conditions. Wang et al. (2018) also considered our model and assumed the optimality of the threshold policy for a single arm while using a very coarse linear approximation to compute the Whittle index function (the key step (a) for the second equality in the proof of Lemma 6 in Wang et al. (2018) is incorrect). In this paper, we rigorously prove the optimality of the threshold policy for a single arm and establish the indexability under certain conditions and subsequently construct an efficient algorithm for computing the Whittle index function with arbitrary precision with its optimality numerically verified in general and formally proved for a class of homogeneous systems.
The rest of this paper is organized as follows: Sec. 2 presents our problem formulation and main results on the optimal threshold policy and indexability. Sec. 3 presents our algorithm with design details. Sec. 4 presents a theoretic proof of the optimality for homogeneous arms. Sec. 5 concludes this paper and provides some future research directions in this field.
2 Main Results
Consider a restless MAB having internal -state Markov chains (arms) of potentially different transition probabilities. At each time , the player chooses to observe the states of arms. Let denote the current state of an arm and let denote its observation outcome (detection outcome). The error probabilities are and , i.e., the probabilities of miss detection and false alarm, respectively, in the observation model. In Levy (2008), it was shown that the error probabilities and follow the curve of receiver operating characteristics (ROC) under the optimal detector that makes a concave increasing function from to over . This matches the intuition that making a detector more sensitive will reduce but increase . Since the optimal detector design is a solved problem and not the focus of this paper, we simply assume that and are given. If arm in state is observed in state (i.e., ), then the player accrues units of reward from this arm. One of many application examples of this observation model is to cognitive radios, where a secondary user aims to utilize a frequency band (channel/arm) currently unused by the primary users. Due to energy and policy constraints on the sensor of the secondary user, only a subset of channels can be sensed at each time and if any of them is sensed idle (), the user can send certain packets over it to its receiver and obtain an (acknowledgement) in the end of the time slot if the channel is indeed idle (); otherwise no from this channel would be received. Then the reward is just the bandwidth of channel . Clearly, the hard constraint here should be on the miss detection probability to guarantee the satisfaction of the primary users, i.e., the disturbance (when a secondary user senses a busy channel as idle and subsequently sends data over it) to the primary users should be capped.
2.1 System Model and Belief Vector
At each discrete time , the internal state () of an arm cannot be observed before deciding whether or not to observe the arm. Therefore, we cannot use the states of the Markov chains as the system state for decision making. Applying the general POMDP theory to our model the belief state vector consisting of probabilities that arms are in state given all past observations is a sufficient statistics for making future decisions (Sondik, 1978):
(2) | ||||
(3) |
where is the belief state of arm at time and its internal state. According to the Bayes’ rule, the belief state (of any arm) itself evolves as a Markov chain with an infinite state space:
(4) |
(5) |
where is the set of arms chosen at time with , and are respectively the state and observation from arm at time if , the acknowledgement of successful utilization of arm for slot , the one-step belief update operator without observation, and the transition matrix of the internal Markov chain of arm . Note that when , all three expressions of the next belief state in (4) become linear functions and the problem is reduced to that in Liu and Zhao (2010) for perfect observations where the value functions for dynamic programming can be directly solved in closed-form. Later, we will see that when , the value functions are very difficult to analyze except for a few general properties. Without a fine-grain analysis on the value functions, the indexability and Whittle index cannot be established for our model. Our approach is to start with a finite time horizon and then utilize the backward induction (on time horizon) methodology until we take limits of corresponding functions as the time horizon goes to infinity. The key idea is to obtain analytic bounds on the value functions and their derivatives as functions of the system parameters instead of just numerical computations of these functions. We will show that these bounds, together with some detailed properties of the value functions, are sufficient for our purpose. From (5), the -step belief update of an unobserved arm for consecutive slots starting from any belief state is
(6) |
For simplicity of notations, we denote by .
At time , the initial belief state of arm can
be set as the stationary distribution of the internal
Markov chain***Here we assume the internal
Markov chain with transition matrix is
irreducible and aperiodic.:
(7) |
where is the unique solution to and an arbitrary probability. Given the initial belief vector , we arrive at the following constrained optimization problem:
(8) | |||
(9) |
Now the decision problem has a countable state space as modelled by the belief vector for a fixed initial and an uncountable state space for an arbitrarily chosen . It is clear that fixing , the action-dependent belief vector takes possible values growing geometrically with time , leading to a high-complexity in solving the problem; this is the so-called curse of dimensionality. In the following, we adopt Whittle’s original idea of Lagrangian relaxation to decouple arms for an index policy and show some crucial properties of the value functions of a single arm.
2.2 Arm Decoupling by Lagrangian Relaxation
(10) | ||||
subject to | (11) |
Clearly constraint (11) is a relaxation on the player’s action from (9). Applying the Lagrangian multiplier to constraint (11), we arrive at the following unconstrained optimization problem:
(12) |
Fixing , the above optimization is equivalent to independent unconstraint optimization problem as shown below: for each ,
(13) |
Here is a single-arm policy that maps the belief state of the arm to the binary action (chosen/activated) or (unchosen/made passive). It is thus sufficient to consider a single arm for solving problem (12). For simplicity, we will drop the subscript in consideration of a single-armed bandit problem without loss of generality. Let denote the value of (13) with and , it is straightforward to write out the dynamic equation of the single-armed bandit problem as follows:
(14) |
where and denote, respectively, the maximum expected total discounted reward that can be obtained if the arm is activated or made passive at the current belief state , followed by an optimal policy in subsequent slots. Since we consider the infinite-horizon problem, a stationary optimal policy can be chosen and the time index is not needed in (14). Define the nonlinear operator as
It is easy to see that is Lipschitz continuous on :
(15) | ||||
(16) |
We assume that (otherwise the problem is reduced to that considered in Liu and Zhao, 2010) and (otherwise the belief update is independent of observations or actions and the problem becomes trivial). Without loss of generality, set . We have
(20) |
Define passive set as the set of all belief states in which taking the passive action is optimal:
(21) |
Notice that the immediate reward under the active action cannot exceed so it is optimal to always make the arm passive if the immediate reward under the passive action exceeds (). This is because we would obtain the maximum immediate reward equal to at each time step regardless of the state transitions in this case. On the other hand, if then the optimal action must be to activate the arm. To see this, note that the total discounted reward by any policy consists of two parts: the reward under the active actions and the reward under the passive actions. If the optimal policy is to make the arm passive now when , then the reward obtained under the future active actions by this policy must be greater than otherwise the total discounted reward would be negative, contradicting the optimality of the policy since any policy that always activates the arm achieves a nonnegative total discounted reward so should the optimal policy. This is again a contradiction since the total discounted reward under the active actions is upper bounded by . Consequently, the passive set changes from the empty set to the closed interval as increases from to . However, such change may not be monotonic as increases. But if does increase monotonically with , then for each value of the belief state, one can define the unique that makes it join and stay in the set forever. Intuitively, such measures in a well-ordered manner the attractiveness of activating the arm in belief state compared to other belief states: the larger is the that is required for it to be passive, the more is the incentive to activate the arm in belief state , even in the problem without . This Lagrangian multiplier is thus called ‘subsidy for passivity’ by Whittle who formalized the following definition of indexability and Whittle index (Whittle, 1988).
Definition 1.
A restless multi-armed bandit is indexable if for each single-armed bandit in a problem with subsidy for passivity, the set of arm states in which passivity is optimal increases monotonically from the empty set to the whole state space as increases from to . Under indexability, the Whittle index of an arm state is defined as the infimum subsidy such that the state remains in the passive set.
For our model in which the arm state is given by the belief vector, the indexability is equivalent to the following:
(22) |
Under indexability, the Whittle index of arm state is defined as
(23) |
In the following we derive useful properties of the value functions , and . Our strategy is to first establish those properties for finite horizons and then extend them to the infinite horizon by the uniform convergence of the value functions of the former to the latter. Define the -horizon value function as the maximum expected total discounted reward achievable over the next time slots starting from the initial belief state . Then
(24) |
where and denote, respectively, the maximum expected total discounted reward achievable given the initial active and passive actions over the next time slots starting from the initial belief state :
(25) | ||||
(26) | ||||
(27) |
From the above recursive equations, we can analyze by backward induction on . It is easy to see that for any ,
(28) |
Therefore is the maximum of two linear equations and thus piecewise linear and convex for (in both and ). Assume that is piecewise linear and convex. The Bayes’ rule shows that the following term
(29) |
is piecewise linear and convex since the leading coefficient also appears as the denominator of the argument of the linear operator in assumed to be piecewise linear and convex by the induction hypothesis. Henceforth, the recursive equation set (25) and (26) shows that is the maximum of two convex and piecewise linear functions and thus piecewise linear and convex for any (in both and ). Motivated by the Lipschitz continuity of , we show in Lemma 2 that is also Lipschitz continuous under certain conditions. In the following, we first establish a monotonic property of in the case of (positively correlated Markov chain).
Lemma 1.
If , then is monotonically increasing with for any .
Proof.
Since is piecewise linear, it is differentiable almost everywhere except on a null set (under the Lebesgue measure on ) consisting of finite points among which both the left and right derivatives at any point exist but not equal. To prove that the continuous function is monotonically increasing with , we only need to show
(30) |
where denotes the right derivative of as a function of the belief state with fixed. From (28), the value function is monotonically increasing with nonnegative right derivative or . Assume (30) is true for , then for we have with
(34) |
From the above, we have
(38) |
where and denote the right derivatives of the corresponding functions. We have used the fact that is monotonically increasing and when , is also monotonically increasing and that
(39) |
By the induction hypothesis and (38), if then is monotonically increasing (since ) and
(40) |
where both the first and second inequalities are due to the monotonically increasing property of under the assumption that by our induction hypothesis and
(41) |
This proves the monotonically increasing property of . Thus is also monotonically increasing and the proof by induction is finished. ∎
Now we show that under a constraint on the discount factor , the value function is a Lipschitz function:
Lemma 2.
Suppose the discount factor satisfies
(42) |
Then and ,
(43) | |||
(44) |
Proof.
We prove this by induction. Without loss of generality, assume . For the case of ,
Thus , where the second inequality is due to (42).
Assume that for holds, i.e., neither the left nor the right derivative of can have an absolute value exceeding . We have the following inequalities:
(45) |
To prove , recall the definitions of and in (34) and their right derivatives and in (38). We have
(46) |
where the inequality in the first (or second) line above is due to the first (or second) line in (45). Thus we have the following lower and upper bounds on and :
(47) |
From (47) and that , we have
where we used the fact that in the above inequality. Since is absolutely continuous, the above implies that
The proof is thus finished by the induction process. ∎
Last, we give a lemma establishing the order of
and
under certain conditions which
further leads to a threshold structure of the optimal single-arm
policy as detailed in Section 2.3.
Lemma 3.
Suppose that and , we have
(48) |
where denotes the right derivative of at for . The above inequality is also true if and .
Proof.
Again, we prove by induction on the time horizon . When , it is clear that and :
(49) |
Assume that for . From (47), we have, in case of and , the inequality which shows that . When , is increasing with therefore has nonnegative right derivatives by Lemma 1. We can thus obtain tighter bounds on and by (38):
If , we have , which shows that . The proof is thus complete. ∎
2.3 Threshold Policy and Indexability
In this section, we show that the optimal single-arm policy is a threshold policy under the constraints on the discount factor specified in Section 2.2 and analyze the conditions for indexability. First, for a finite-horizon single-armed bandit, a threshold policy is defined by a time-dependent real number such that
(50) |
In the above is the action taken under at the current state with slots remaining. Intuitively, the larger is, the larger expected immediate reward to accrue and thus more attractive to activate the arm. We formalize this intuition under certain conditions in the following theorem.
Theorem 1.
Suppose that and . For any , the optimal single-arm policy is a threshold policy, i.e., there exists such that under , the optimal action is
Furthermore, at the threshold ,
(51) |
The conclusion is also true for the case of and .
Proof.
At , . Thus we can choose as follows:
where are arbitrary constants.
For , when the condition on is satisfied, Lemma 3 shows that
(52) |
(53) |
This shows that is monotonically increasing and either has no zeros in the interval or intersects with it over a closed interval (which can be a single point) only. Specially,
Consider the following three regions of .
-
(i)
. In this case, and . Therefore intersects over (at least) one point in . This point can thus be chosen as .
-
(ii)
. In this case, and . So is strictly positive over and we can choose with any .
-
(iii)
. In this case, always choosing the passive action is clearly optimal as the expected immediate reward is uniformly upper-bounded by over the whole belief state space. We can thus choose with any .
In conclusion, when the conditions in the theorem are satisfied, the optimal finite-horizon single-arm policy is a threshold policy for any horizon length . ∎
In the next theorem, we show that the optimal single-arm policy over the infinite horizon is also a threshold policy under the same conditions.
Theorem 2.
Proof.
The uniform convergence is obvious since and the rest can be easily proved by contradiction following the uniform convergence. ∎
Thus far we have established the threshold structure of the optimal single-arm policy with subsidy based on the analysis of as a function of the belief state with fixed. To study the indexability condition, we now analyze the properties of as a function of the subsidy with the starting belief fixed. From Definition 1 and the threshold structure of the optimal policy, the indexability of our model is reduced to requiring that the threshold is monotonically increasing with (if the threshold is a closed interval then the right end is selected). Note that for the infinite-horizon problem, the threshold is independent of time. Furthermore, is also convex in as for any and the optimal policy achieving applied respectively on the problem with subsides and cannot outperform those achieving and . Specifically, let be the expected total discounted reward from the active action and that from the passive action under applied to the problem with subsidy , then
Since is convex in , its left and right derivatives with exist at every point . Furthermore, consider two policies and achieving and for any , respectively. With a similar interchange argument of and as above, we have and is Lipschitz continuous in . By the Rademacher theorem (see Heinonen, 2005), is differentiable almost everywhere in . For a small increase of , the rate at which increases is at least the expected total discounted passive time under any optimal policy for the problem with subsidy starting from the belief state . In the following theorem, we formalize this relation between the value function and the passive time as well as a sufficient condition for the indexability of our model.
Theorem 3.
Let denote the set of all optimal single-arm policies achieving with initial belief state . Define the passive time
(54) |
The right derivative of the value function with , denoted by , exists at every value of and
(55) |
Furthermore, the single-armed bandit is indexable if at least one of the following condition is satisfied:
-
i.
for any the optimal policy is a threshold policy with threshold (if the threshold is a closed interval then the right end is selected) and
(56) -
ii.
for any and , we have
(57)
Proof.
The proof of (55) follows directly from the argument in Theorem 1 in Liu (2021) and is omitted here. To prove the sufficiency of (56), we note that if it is true then there exists a such that ,
Since , we have which implies that the threshold remains in the passive set as continuously increases so is monotonically increasing with . This conclusion is clearly true for the trivial case of or . The sufficiency of (57) is obvious because then it is impossible for any to escape from as increases due to the nondecreasing property of enforced by (57). ∎
Theorem 3 essentially provides a way for checking the indexability condition in terms of the passive times. For example, equation (56) is equivalent to for any ,
(58) |
The above strict inequality clearly holds if since for any . When , we prove by contradiction that the strict inequality (58) must hold under the threshold structure of the optimal policy. If then (58) is clearly true. Assume that the left and right sides of (58) are equal and . In this case, we have
(59) | |||
(60) |
Equation (60) implies that starting from , always activating the arm is strictly optimal. This means that the threshold is strictly below and we have a contradiction to (59). Another easier way to see that the bandit is indexable if is that (57) would be satisfied where no strict inequality is required. However, condition (58) provides a convenient way for approximately computing the passive times as well as the value functions which leads to an efficient algorithm for evaluating the indexability and solving for the Whittle index function for any , as detailed in the next section.
Corollary 1.
The restless bandit is indexable if .
3 The Whittle Index Policy
In this section, we design an efficient algorithm by approximating the Whittle index.
3.1 The Approximated Whittle Index
The threshold structure of the optimal single-arm policy under certain conditions yields the following iterative nature of the dynamic equations for both and . Define the first crossing time
(61) |
In the above and we set if for all . Clearly is the minimum time slots required for a belief state to stay in the passive set before the arm is activated given a threshold . Consider the nontrivial case where and such that the Markov chain of the internal arm states is aperiodic and irreducible and that the belief update is action-dependent. From (6), if then
(62) |
or if then
(63) |
To illustrate (62) and (63), note that a belief state will converge to the stationary belief value given by (7) monotonically for or in an oscillating manner for under passive actions (see Fig. 3 and Fig. 4 in Liu and Zhao (2010)). Suppose that the following conditions are satisfied such that the optimal single-arm policy is a threshold policy and the indexability holds:
(64) |
To solve for the Whittle index function , given the current arm state , we aim to find out the minimum subsidy that makes it as a threshold:
(65) | |||
(66) | |||
(67) |
Given a threshold and any , the value function can be expanded by the first crossing time as
(68) |
There is no doubt that the last item of the above equation has caused us trouble in solving for . However, if we let
(69) |
and construct iteratively the sequence as with . We then get the following sequence of equations:
For sufficiently large , we can get an estimation of with an arbitrarily small error by setting whose error is discounted by in computing thus causing a geometrically decreasing error propagation in the backward computation process for . Note that we first compute in the same way by setting in the above equation set. Therefore we can have an estimation of with arbitrarily high precision for any . Interestingly, extensive numerical results found that quickly converges to a limit belief state (independent of ) (see Fig. 2 for an example). Specifically, after iterations, the difference becomes too small to affect the performance of our algorithm as discussed in Sec. 5.1. So we can set and efficiently solve the finite linear equation set (up to ). In general, the -iteration Whittle index is based on the solution of the following equations:
After the value functions are (approximately) solved, we can plot the active value function and the passive one to see that they intersect at one single point (the threshold) verifying the optimality of the threshold policy proven by Theorem 2 (See Fig. 2 for an example). According to Theorem 3, the passive time can also be approximately solved based on the following equations:
Substituting for in the above linear equations with unknowns (first solving for ), we can obtain and for any according to the linear equation sets. The indexability condition (56) in Theorem 3 can be checked online: for the original multi-armed bandit problem and for each arm at state at time , we compute its approximated Whittle index by solving a set of linear equations, which has a polynomial complexity of the iteration number , independent of the decision time . At time , for each arm, if is found to be nondecreasing with the arm states appeared so far starting from the initial belief vector defined in (2), then the indexability has not been violated. Interestingly, extensive numerical studies have shown that the indexability is always satisfied as illustrated in Figs. 4 and 4 in Sec. 5.1.


For large where the threshold structure of the optimal policy or the indexability may not hold (i.e., condition (64) is not satisfied), we can still use the above process to solve for the subsidy that makes (65) true if it exists. Note that after computing the value functions appeared in (65) in terms of , both and are linear (affine) in and their equality gives a unique solution of if their linear coefficients are not equal. This , if exists, can thus be used as the approximated Whittle index without requiring indexability or threshold-based optimal policy. If it does not exist, we can simply set . The existence of such an is defined as the relaxed indexability in Liu (2021). Note that extensive numerical studies have shown that the relaxed indexability of our model with imperfect state observations is always satisfied as well. Before summarizing our general algorithm for all in Section 3.2, we solve for the approximated Whittle index function in closed-form for the simplest case of -iteration, which is referred to as the imperfect Whittle index. Note that if then . Thus when is sufficiently small, we can approximate by . Under this approximation, we have, for any ,
By using the above three equations, we can directly solve for and in closed-form.
When ,
The approximate Whittle index is given by
where
Similarly, when , we have
The approximate Whittle index is given by
where
3.2 Algorithm
Our analysis leads to the algorithm for the RMAB model with imperfect observations in Algorithm 1 for all .
Input:
, ,, , iteration number
Input:
initial belief state , , ,
4 Optimality for Homogeneous Systems
A space-wise homogeneous system for a restless bandit is defined as the system with stochastically identical arms, i.e., the parameters and do not depend on . In this case, our algorithm is equivalent to the myopic policy that chooses the arms with the largest belief values and is optimal.
Theorem 4.
Consider a space-wise homogeneous model with positively correlated arms () and satisfying
(70) |
the myopic policy is optimal over both finite and infinite horizons.
Proof.
We adopt notations similar to that in Liu et al. (2011) for the case of perfect observation () but need several non-trivial differences due to the additional complexity introduced by observation errors. Consider arms in total and we choose arms to active at each step. Let denote the expected total discounted reward over steps when all arms are ordered so the probabilities that the underlying random processes are in state 1 are . In Liu et al. (2010), it has been proved that the myopic policy has a dynamic queuing structure if the error probability satisfies (70). Then we have
where and the expectation is taken over possible outcomes that can occur when the arms that are observed are those at the left end ( having probabilities that the underlying random processes are in state 1). We will describe it more specifically. For a belief state sequence , we call and a partition of if they satisfy:
-
(i)
is a rearrangement of ;
-
(ii)
and .
Let be the collection of all partitions of , the expectation above can be written as follows:
We can see that when , is the value function for the myopic policy.
To prove the theorem, we first prove that , is linear in . We will prove it by induction. It is obvious that are linear in . Assume it is true for , , where and are constants independent of . Now consider . When
In this case, probability terms in expectation are only related to , is linear in and is linear in , thus is also linear in .
When , for any partition
and
of
, if
, the
corresponding term
in the expectation is linear in . If , by inductive hypothesis there exists ,
The equation above shows that is linear in , thus the proposition is proved.
From above, we can assume
,
where are constants. If we swap the positions of and
and make differences between the two, we have
Next we will prove two important properties of . We let
denote any sequence of s, possibly
empty.
We still adopt induction to prove next two properties:
(A)
(B)
These are clearly true for . We will begin by proving an
induction step for (B). As above, the expression in (B) is equal to
Suppose the position exchange occur in the th and th place,
. If , for some
(which are
stochastically determined by the observations from the top arms in
the queue), by inductive hypothesis,
Similarly if ,
The interesting case is . In this case,
where the first inequality follows from
the inductive hypothesis for (B) and the second follows from (A).
Next we will prove (A) by induction. Suppose that occurs within
the two expressions in the th and th place, . If , similarly for some
(they depend
on the observations),
If , we have
The interesting case is . Let , where and represent states in total. Then
The above expression is a function of and , of the
form . To prove the expression above is nonnegative for
all , we just need to check out that it is true for
.
If , then
If , then
If , then
If , then
Thus (A) is true. In fact, (B) shows that the myopic policy is optimal over finite horizons. By contradiction, it is easy to show that the myopic policy also maximizes the expected total discounted reward and the expected average reward over the infinite horizon. Furthermore, our proof does not depend on the time-homogeneousness of the system so the optimality result holds even if the system parameters are time-varying as long as and satisfies (70). ∎
5 Numerical Analysis and Conclusion
In this section, we illustrate the near-optimality and efficiency of our approximated Whittle index policy for non-homogeneous arms through simulation examples. After the discussions and illustrations on these numerical results, we conclude this paper and propose several future research directions on relevant problems.
5.1 Numerical Examples
We will show that the -iteration approximation algorithm is sufficient to yield the same performance as the exact Whittle index policy and use it to plot the approximated Whittle index function for the following parameters: in Fig. 4 ; in Fig. 4 . Note that the monotonic increasing property of implies the indexability numerically while the nonlinearity of illustrates its difference to the myopic policy (with index as a linear function in ).
We now compare the performance of Whittle index policy with the optimal policy which is computed by dynamic programming over a finite horizon of length . In other words, we recursively call the following equation with terminating state :
(71) |
where and are respectively given by (25) and (26) in terms of . Clearly, the number of observed belief states grows exponentially with both the number of arms (as arms are not decoupled by any relaxation) and the time horizon due to the tree-expansion type of the belief update given in (4). In contrast, our approximate Whittle index has a linear complexity in both and the number of arms. In Fig. 6 and Fig. 6, we compare the real running times between the optimal policy and our algorithm to illustrate the efficiency of the latter. Note that the optimal policy for any finite time horizon provides an upper bound on the total discounted reward over achieved by the infinite-horizon optimal policy. This is because one can definitely apply the infinite-horizon optimal policy to the finite-horizon problem up to time . Henceforth, the near-optimality of our algorithm is well demonstrated by comparing to the finite-horizon optimal policy over the first steps as shown in Figures 8-22 (see Table 1 and Table 2 for system parameters). Furthermore, all numerical experiments with randomly generated system parameters showed that setting the iteration number is sufficient for the Whittle indices of all arms to converge such that their rank remains the same as increases at each time step . In other words, setting makes the approximated Whittle index have the same action path as the exact Whittle index, leading to the same performance. When becomes smaller, the approximation error will be larger and cause more performance loss as shown in Fig. 23.
We also illustrate the performance of the myopic policy that chooses the arms with the largest for comparison. From Figures 8-22, we observe that Algorithm 1 outperforms the myopic policy. Interestingly, the Whittle index policy may have some performance loss in the middle but eventually catches up with the optimal policy as time goes. This is consistent with the conjecture that Whittle index policy is asymptotically optimal as time goes to infinity as the Lagrangian relaxation should not fundamentally alter the state and action paths of the optimal policy for the original problem from the perspective of large deviation theory (Weber and Weiss, 1990). On the contrary, the myopic policy is unable to follow the optimal action path and never catches up! Since the myopic policy only cares about maximizing the immediate reward, its performance for is optimal (thus better than any other policy) because the state transitions do not matter in this case. Definitely, the myopic policy has the lowest complexity but this advantage is negligible given its significant performance loss and the efficiency of our algorithm compared to the optimal policy as shown in Fig. 6 and Fig. 6.
5.2 Conclusion and Future Work
In this paper, we proposed a low-complexity algorithm based on the idea of value function approximations arisen in solving for the Whittle index policy of a class of RMAB with an infinite state space and an imperfect observation model. By exploring and exploiting the rich mathematical structure of this RMAB model, our algorithm was designed to be implemented online to control the approximation error such that it becomes equivalent to the original Whittle index policy. Extensive numerical examples showed that our algorithm achieves a near-optimal performance with a complexity linear in the key system parameters such as the time horizon and the number of arms. From Figures 8-22, we observe that in some instances the four-iteration policy is closer to optimality than in other instances. Unfortunately, it is still unknown how to theoretically quantify the performance gap of Whittle index policy to optimality in finite regime: only some asymptotic results were obtained for restless bandits with finite states as both the number of arms and the time horizon go to infinity under the time-average reward criterion and some conditions only verified for bandits with or states (Weber and Weiss, 1990, 1991).
Future work includes the theoretic study of the performance loss of Whittle index policy by more in-depth analysis on the value functions to further improve our algorithm. Another research direction is to consider more complex system models such as non-Markovian state models (see, e.g., Liu et al., 2011) or high-dimensional state models (see, e.g., Liu, 2021) and study the convergence patterns of the state path to approximate the value functions. Future work also includes the generalization of constructing a finite set of linear equations to solve for dynamic programming problems, e.g., the simplification of non-linearity by threshold-based first crossing time, the error-control process by backward induction, and further complexity reduction by minimizing the number of linear equations required.





















System-1 | ||
\cline2-3 | ||
\cline2-3 | ||
System-2 | ||
\cline2-3 | ||
\cline2-3 | ||
System-3 | ||
\cline2-3 | ||
\cline2-3 | ||
System-4 | ||
\cline2-3 | ||
\cline2-3 |
System | Example | Meet threshold | Meet indexability | ||
conditions? | conditions? | ||||
System-1 | 1 | 0.3 | 0.999 | yes | no |
\cline2-6 | 2 | 0.1 | 0.999 | yes | no |
System-2 | 3 | 0.3 | 0.29 | yes | yes |
\cline2-6 | 4 | 0.1 | 0.29 | yes | yes |
\cline2-6 | 5 | 0.3 | 0.48 | no | yes |
\cline2-6 | 6 | 0.1 | 0.48 | no | yes |
System-3 | 7 | 0.3 | 0.69 | yes | no |
\cline2-6 | 8 | 0.1 | 0.69 | yes | no |
\cline2-6 | 9 | 0.3 | 0.48 | yes | yes |
\cline2-6 | 10 | 0.1 | 0.48 | yes | yes |
System-4 | 11 | 0.3 | 0.29 | yes | yes |
\cline2-6 | 12 | 0.1 | 0.29 | yes | yes |
\cline2-6 | 13 | 0.3 | 0.48 | no | yes |
\cline2-6 | 14 | 0.1 | 0.48 | no | yes |
\cline2-6 | 15 | 0.3 | 0.999 | no | no |
\cline2-6 | 16 | 0.1 | 0.999 | no | no |
Declarations
-
•
Funding Not applicable
-
•
Conflict of interest/Competing interests Not applicable
-
•
Ethics approval Not applicable
-
•
Consent to participate Not applicable
-
•
Consent for publication Yes
-
•
Availability of data and materials Available upon request
-
•
Code availability Available upon request
-
•
Authors’ contributions Keqin Liu constructed the proof sketch for each theorem and the main algorithm and contributed to the writing of the paper. Richard Weber outlined the proof strategy for the optimality of the myopic policy in homogeneous systems and contributed to the verification and writing of the paper. Chengzhong Zhang filled out the details of the proofs and conducted the numerical simulations.
References
- Brown and Smith (2020) Brown DB, Simth JE (2020) Index policies and performance bounds for dynamic selection problems. Manage Sci 66(7):3029–3050.
- Chen et al. (2021) Chen M, Wu K, Song L (2021) A Whittle index approach to minimizing age of multi-packet information in IoT network. IEEE Access 9: 31467 - 31480.
- Gast et al. (2023) Gast N, Gaujal B, Khun K (2023) Testing indexability and computing Whittle and Gittins index in subcubic time. Math Meth Oper Res 97:391–436.
- Gast et al. (2021) Gast N, Gaujal B, Yan C (2021, working paper) (Close to) Optimal policies for finite horizon restless bandits. https://hal.inria.fr/hal-03262307/file/LP_paper.pdf.
- Gittins (1979) Gittins JC (1979) Bandit processes and dynamic allocation indices. J R Stat Soc 41(2):148–177.
- Gittins et al. (2011) Gittins JC, Glazebrook KD, Weber RR (2011) Multi-Armed Bandit Allocation Indices. Wiley, Chichester.
- Hu and Frazier (2017) Hu W, Frazier PI (2017, working paper) An asymptotically optimal index policy for finite-horizon restless bandits. https://arxiv.org/abs/1707.00205.
- Heinonen (2005) Heinonen J (2005) Lectures on Lipschitz analysis. www.math.jyu.fi/research/reports/rep100.pdf.
- Kesav et al. (2019) Kesav K, Rahul M, Varun M, Shabbir NM (2019). Sequential decision making with limited observation capability: Application to wireless networks. IEEE Trans Cognit Commun Networking 5(2):237–251.
- Levy (2008) Levy BC (2008) Principles of Signal Detection and Parameter Estimation. Springer, Verlag.
- Liu (2021) Liu K (2021) Index policy for a class of partially observable Markov decision processes. https://arxiv.org/abs/2107.11939.
- Liu et al. (2011) Liu K, Weber RR, Zhao Q (2011) Indexability and Whittle index for restless bandit problems involving reset processes. Proc. of the 50th IEEE Conference on Decision and Control 7690–7696.
- Liu and Zhao (2010) Liu K, Zhao Q (2010) Indexability of restless bandit problems and optimality of Whittle index for dynamic multichannel access. IEEE Trans Inf Theory 56(11):5547–5567.
- Liu et al. (2010) Liu K, Zhao Q, Krishnamachari B (2010) Dynamic multichannel access with imperfect channel state detection. IEEE Trans Signal Process 2795–2808.
- Papadimitriou and Tsitsiklis (1999) Papadimitriou CH, Tsitsiklis JN (1999) The complexity of optimal queueing network control. Math Oper Res 24(2):293–305.
- Rahul et al. (2018) Rahul, M., D. M., & Aditya, G. (2018). On the whittle index for restless multiarmed hidden markov bandits. IEEE Trans Autom Control 63(9):3046–3053.
- Sondik (1978) Sondik EJ (1978) The optimal control of partially observable Markov processes over the infinite horizon: discounted costs. Oper Res 26(2):282–304.
- Varun et al. (2018) Varun, M., Rahul, M., Kesav, K., Shabbir, N.M. & Uday, B.D. (2018). Rested and restless bandits with constrained arms and hidden states: Applications in social networks and 5g networks. IEEE Access 6:56782–56799.
- Wang et al. (2018) Wang K, Chen L, Yu J, Win M (2018) Opportunistic Multichannel Access with Imperfect Observation: A Fixed Point Analysis on Indexability and Index-based Policy. Proceedingds of IEEE INFOCOM.
- Weber and Weiss (1990) Weber RR, Weiss G (1990) On an index policy for restless bandits. J Appl Probab 27:637–648.
- Weber and Weiss (1991) Weber RR, Weiss G (1991) Addendum to ‘On an index policy for restless bandits’. Adv Appl Prob 23:429–430.
- Whittle (1988) Whittle P (1988) Restless bandits: Activity allocation in a changing world. J Appl Probab 25:287–298.
- Zayas-Cabán et al. (2019) Zayas-Cabán G, Jasin S, Wang G (2019) An asymptotically optimal heuristic for general nonstationary finite-horizon restless multi-armed, multi-action bandits. Adv Appl Probab 51:745–772.