Two families of indexable partially observable restless bandits and Whittle index computation111This research was funded in part by Fonds de Recherche du Quebec-Nature et technologies (FRQNT).
Abstract
We consider the restless bandits with general finite state space under partial observability with two observational models: first, the state of each bandit is not observable at all, and second, the state of each bandit is observable when it is selected. Under the assumption that the models satisfy a restart property, we prove that both models are indexable. For the first model, we derive a closed-form expression for the Whittle index. For the second model, we propose an efficient algorithm to compute the Whittle index by exploiting the qualitative properties of the optimal policy. We present detailed numerical experiments for multiple instances of machine maintenance problem. The result indicates that the Whittle index policy outperforms myopic policy and can be close to optimal in different setups.
keywords:
Multi-armed bandits; Restless bandits; Whittle index; indexability; partially observable; scheduling; resource allocation.1 Introduction
Resource allocation and scheduling problems arise in various applications including telecommunication networks, sensor management, patient prioritization, and machine maintenance. Restless bandits is a widely-used solution framework for such models [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15].
Identifying the optimal policy in restless bandits suffers from the curse of dimensionality because the state space is exponential in the number of alternatives [16]. To circumvent the curse of dimensionality, Whittle proposed an index heuristic which has a linear complexity in the number of alternatives [17]. The resulting policy, called the Whittle index policy, operates as follows: assign an index (called the Whittle index) to each state of each arm (or alternative) and then, at each time, play the arms in states with the highest indices.
The Whittle index policy is attractive for two reasons. First, it is a scalable heuristic because its complexity is linear in the number of arms. Second, although it is a heuristic, there are certain settings where it is optimal [18, 19, 20, 21] and, in general, it performs close to optimal in many instances [22, 23, 10, 24, 25, 26].
Nonetheless, there are two challenges in using the Whittle index policy. First, the Whittle index heuristic is applicable only when a technical condition known as indexability is satisfied. There is no general test for indexability, and the existing sufficient conditions are only applicable for specific models [23, 10, 24, 27, 28, 25, 29, 30]. Second, while there are closed-form expressions to compute the Whittle index in some instances [10, 24, 29, 5, 6, 4, 3, 26, 31], in general, the Whittle index policy has to be computed numerically. For a subclass of restless bandits which satisfy an additional technical condition known as PCL (partial conservation law), the Whittle index can be computed using an algorithm called the adaptive greedy algorithm [32, 22]. Recently, [31, 33] presented generalizations of adaptive greedy algorithm which are applicable to all indexable restless bandits.
We are interested in resource allocation and scheduling problems where the state of each arm is not fully-observed. Such partially observable restless bandit models are conceptually and computationally more challenging. The sufficient conditions for indexability that are derived for fully-observed bandits [17, 19, 10, 24, 29, 31, 34] are not directly applicable to the partially observable setting. The existing literature on partially observable restless bandits often restricts attention to models where each arm has two states [2, 3, 4, 11, 9, 13, 1, 5, 35]. In some cases, it is also assumed that the two states are positively correlated [4, 3, 5]; in others, it is assumed that the state dynamics are independent of the chosen action [6, 36, 37]. There are very few results for general finite state space models under partial observability [12, 7, 6, 36, 37], and, for such models, indexability is often verified numerically. In addition, there are very few algorithms to compute the Whittle index for such models.
Recently, alternative index-based heuristics for partially observable restless bandits [38] has been proposed, but we restrict to Whittle index policy in this paper.
The main contributions of our paper are as follows:
-
1.
We investigate partially observable restless bandits with general finite state spaces and consider two observation models, which we call model A and model B. We show that both models are indexable.
-
2.
For model A, we provide a closed-form expression to compute the Whittle index. For model B, we provide a refinement of the adaptive greedy algorithm of [31] to efficiently compute the Whittle index.
-
3.
We present a detailed numerical study which illustrates that the Whittle index policy performs close to optimal for small scale systems and outperforms a commonly used heuristic (the myopic policy) for large-scale systems.
The organization of the paper is as follows. In Section 2, we formulate the restless bandit problem under partial observations for two different models. Then, we define a belief state by which the partially-observable problem can be converted into a fully-observable one. In Section 3, we present a short overview of restless bandits. In Section 4, we show the restless bandit problem is indexable for both models and present a general formula to compute the index. In Section 5, we present a countable state representation of the belief state and use it to develop methods to compute the Whittle index. In Section 6, we present the proofs of the results. In Section 7, we present a detailed numerical study which compares the performance of Whittle index policy with two baseline policies. Finally, we conclude in Section 8.
1.1 Notations and Definitions
We use uppercase letters to denote random variables; the corresponding lowercase letter to denote their realization and corresponding calligraphic letters to denote the set of realizations. Superscripts index arms and subscripts index time, e.g., denotes the state of arm at time . The subscript denote the history of the variable from time to , e.g., denotes . Bold letters denote the vector of variable for all arms, e.g., denotes . Given a collection of real numbers , we use to denote their product . Given a collection of sets , we use to denote their Cartesian product . For a finite set , denote the set of probability mass functions (PMFs) on .
We use as the indicator function, as the expectation operator, as the probability function, as the set of real numbers, as the set of integers and as the set of integers that are not lower than .
Given ordered sets and , a function is called submodular if for any and such that and , we have . Furthermore, the transition probability matrix is stochastic monotone if for any such that , we have for any . For any function , denotes the span semi-norm of , i.e.,
2 Model and Problem Formulation
2.1 Restless Bandit Process with restart
A discrete-time restless bandit process (or arm) is a controlled Markov process where denotes the finite set of states; denotes the action space where the action is called the passive action and the action is the active action; , , denotes the transition matrix when action is chosen; denotes the cost function; denotes the initial state distribution; denotes the finite set of observations.
Assumption 1 (Restart property)
All rows of the transition matrix are identical.
For models which satisfy Assumption , we denote by and denote each (identical) row of by . The term restart property is used following the terminology of [26], where was a PMF on the state space (i.e., on taking active action, the state resets according to PMF ). Note [26] considered fully observed models, while we are considering partially observable setups.
An operator has to select arms at each time but does not observe the state of the arms. We consider two observation models.
-
1.
Model A: In model A, the operator does not observe anything. We denote this by , where denotes a blank symbol.
-
2.
Model B: In model B, the operator observes the state of the arm after it has been reset, i.e.,
(1)
For model A, and for model B, .
2.2 Partially-observable Restless Multi-armed Bandit Problem
A partially-observable restless multi-armed bandit (PO-RMAB) problem is a collection of independent restless bandits , .
Let , , and denote the combined state, action, and observation spaces, respectively. Also, let , , and denote the combined states, actions taken, and observations made by the operator at time . Due to the independent evolution of each arm, for each realization of and of , we have
Let denote the initial state distribution of all arms.
When the system is in state and action is taken, the system incurs a cost The action at time is chosen according to
(2) |
where is a history-dependent policy at time . Let denote the policy for infinite time horizon and let denote the family of all such policies. Then, the performance of policy is given by
(3) |
where denotes the discount factor.
Formally, the optimization problem of interest is as follows:
Problem 1
Given a discount factor , the total number of arms, the number to be selected, the system model of each arm, and the observation model at the operator, choose a history-dependent policy that minimizes given by (3).
Some remarks
-
1.
Problem 1 is a POMDP and the standard methodology to solve POMDPs is to convert them to a fully observable Markov decision process (MDP) by viewing the “belief state” as the information state of the system [39]. We present such a belief state representation in Sec. 2.4 and point out its limitations in the context of restless bandits.
-
2.
In Problem 1, the objective depends on the initial state distribution . This can give the impression that the optimal policy may depend on the initial distribution. It is well known in the MDP literature that there exist policies that are optimal for all initial distributions [40]. However, our results rely on translating the belief state representation of the POMDP into a countable state MDP formulation and such a transformation is valid only when the initial state distribution is of a specific form. See Sec. 5 for details. Our results do not depend on the specific choice of the initial distribution, as long as it satisfies Assumption 2 specified in Sec. 5.
-
3.
In Problem 1, we consider the normalized expected discounted cost as an objective, where the discounted cost is multiplied by . In the MDP literature, one typically considers unnormalized objectives. However, normalized objective is typically used constrained MDPs [41] and has also been used in some of the previous literature on restless bandits [31]. Note that multiplying the objective by a constant does not change the optimal policy. The reason that we use a normalized expected discounted cost is that it simplifies the description of the adaptive greedy algorithms to compute the Whittle index presented in Sec. 5.
2.3 Some Examples
In this section, we present some examples corresponding to the model presented above.
Example 1
Consider a sensor network where ther are sensors, each observing an independent Markov processes. We assume that the state of each Markov process is integer valued and evolves in an auto-regressive manner: , where are i.i.d. processes which are also independent across the sensors. An estimator can observe only , where , sensors at each time. If a sensor is observed, the state of the Markov process at that sensor is revealed to the estimator. If a sensor is not observed, the estimator gets no new information about its state and has to estimate the state based on previous observations. The objective is to determine a sensor scheduling policy to decide which sensors to observe at each time.
In this case, it can be shown that when the noise processes have symmetric and unimodal distributions, the optimal estimation strategy is a Kalman-filter like strategy, i.e., the optimal estimate is when the Markov process is transmitted, and is equal to the previous estimate when the Markov process is not transmitted [42]. Thus, the error process has a restart property [43]. An instance of such a sensor network problem was considered in [44].
Example 2
Consider a maintenance company monitoring machines which are deteriorating independently over time. Each machine has multiple deterioration states sorted from pristine to ruined levels. However, the state of the machine is not observed. There is a cost associated with running the machine and the cost is non-decreasing function of the state. If a machine is left un-monitored, then the state of the machine deteriorates and after a while, it is ruined. However, the state of the machine is not observed.
Furthermore, it is assumed the company cannot observe the state of the machines unless it sends a service-person to visit the machine. Replacing the machine is relatively inexpensive, and when service-persons visit a machine, they simply replace it with a new one. Due to manufacturing mistakes, all the machines may not be in pristine state when installed. If the service-person can observe the state of the machine when installing a new one, the observation model is same as model B. Otherwise, it is model A. There are , where , service-persons. The objective is to determine a scheduling policy to decide which machines should be serviced at each time. An instance of such machine maintenance problem in the context of maintaining demand response devices was considered in [9].
Example 3
Consider the problem of resource constrained health intervention delivery, where a community health center is monitoring patients to check if they are adhering to the prescribed medication [35]. Each patient has a binary state of “Adhering” or “Not Adhering”, which is hidden. There are , where , health workers, and if an health worker visits a patient, the state of the patient is observed. Moreover, it is assumed that after the visit by a health worker, the patient goes into the “Adhering” state. The objective is to determine a policy to schedule the health workers to maximize the number of patients in the “Adhering” state.
2.4 Belief State
Using standard results from Markov decision theory, the partially observable restless bandit problem can be converted to a fully observable restless bandit problem with belief (or posterior distribution) as states. We present the details in this section. Let define the operator’s belief on the state of arm at time as follows: for any, , let Note that is a distribution-valued random variable. Also, define .
Then, for arm , the evolution of the belief state is as follows: for model A, the belief update rule is
(4) |
and for model B, the belief update rule is
(5) |
where is the Dirac delta distribution over the discrete state space with the value of one only for state . The per-step cost function of the belief state when action is taken is
Define the combined belief state of the system as follows: for any ,
Note that is a random variable that takes values in . Using standard results in POMDPs [39], we have the following.
Proposition 1
In Problem 1, is a sufficient statistic for . Therefore, there is no loss of optimality in restricting attention to decision policies of the form . Furthermore, an optimal policy with this structure can be identified by solving an appropriate dynamic program.
Next, we present our first simplification for the structure of optimal decision policy as follows.
Proposition 2
For any , we have
(6) |
Therefore, there is no loss of optimality in restricting attention to decision policies of the form . Furthermore, an optimal policy with this structure can be identified by solving an appropriate dynamic program.
Proof 1
3 Whittle index policy solution concept
For the ease of notation, we will drop the superscript from all relative variables for the rest of this and the next sections.
Consider an arm with a modified per-step cost function
(7) |
The modified cost function implies that there is a penalty of for taking the active action. Given any time-homogeneous policy , the modified performance of the policy is
(8) |
Subsequently, consider the following optimization problem.
Problem 2
Given an arm , the discount factor and the penalty , choose a Markov policy to minimize given by (8).
Problem 2 is a Markov decision process where one may use dynamic programming to obtain the optimal solution as follows.
Proposition 3
Proof 2
The result follows immediately from Markov decision theory [40].
Finally, we present the following definitions.
Definition 1 (Passive Set)
Given penalty , define the passive set as the set of states where passive action is optimal for the modified arm, i.e.,
Definition 2 (Indexability)
An arm is indexable if is non-decreasing in , i.e., for any ,
A restless multi-armed bandit problem is indexable if all arms are indexable.
Definition 3 (Whittle index)
The Whittle index of the state of an arm is the smallest value of for which state is part of the passive set , i.e.,
Equivalently, the Whittle index is the smallest value of for which the optimal policy is indifferent between the active action and passive action when the belief state of the arm is .
The Whittle index policy is as follows: At each time step, select arms which are in states with the highest indices. The Whittle index policy is easy to implement and efficient to compute but it may not be optimal. As mentioned earlier, Whittle index is optimal in certain cases [18, 19, 20, 21] and performs close to optimal for many other cases [22, 23, 10, 24, 25, 26].
4 Indexability and the corresponding Whittle index for models A and B
Given an arm, let denote the family of all stopping times , with respect to the natural filtration associated with . For any stopping time and an initial belief state , define
Theorem 1
The PO-RMAB for model A and B is indexable. In particular, each arm is indexable and the Whittle index is given by
where
(10) | ||||
(11) |
and for model A and for model B.
Proof 3
Recall that we assert that and are non-decreasing in for any . Hence, for any policy
where is non-decreasing in for any and . From Markov decision theory we know that . Since the infimum of non-decreasing functions is non-decreasing, is non-decreasing in for any . Consequently, is non-decreasing which implies is non-decreasing in .
Given any stopping time , let denote a policy that takes the passive action up to and including time , takes the active action at time , and follows the optimal policy from time onwards. The performance of policy is given by
(12) |
We use to denote a policy that takes active action at time and follows the optimal policy from time onwards. The performance of policy is given by
(13) |
The next result generalizes [26, Lemma 2].
Lemma 1
The following characterizations of the passive sets are equivalent to Def. 1.
-
1.
.
-
2.
.
-
3.
.
Proof 4
Note that does not depend on while we showed that is non-decreasing in . Hence, is non-decreasing in by the lemma. Thus, arm is indexable. The expression for the Whittle index in the theorem follows from the definitions.
5 Whittle index computation
Computing the Whittle index using the belief state representation is intractable in general. Inspired by the approach taken in [45], we introduce a new information state which is equivalent to the belief state.
5.1 Countable Information state
For models A and B, define
Assumption 2
For model A, and for model B, .
For model A, define a process as follows. The initial state is such that and for , is given by
(14) |
Similarly, for model B, define a process as follows. The initial state is such that and for , evolves according to (14) and evolves according to
(15) |
Note that once the first observation has been taken in both models, denotes the time elapsed since the last observation of arm and, in addition in model B, denotes the last observed states of arm . Let and . The relation between the belief state and variables and is characterized in the following lemma.
Lemma 2
The following statements hold under Assumption 2:
-
1.
For model A, for any , . In particular, .
-
2.
For model B, for any , . In particular, .
For model A, the expected per-step cost at time may be written as
(16) |
Similarly, for model B, the expected per-step cost at time may be written as
(17) |
Proposition 4
In Problem 1, there is no loss of optimality in restricting attention to decision policies of the form for model A and of the form for model B.
5.2 Threshold policies
We assume that the model satisfies the following condition.
Assumption 3
Let where and are non-decreasing functions in and is submodular in .
Under Assumption 3, we derive structural properties of the optimal policies for models A and B. Then, we show how can be decomposed and computed.
In the following theorem, we show that the optimal policy for model A has a threshold structure and the optimal policy for model B has a threshold structure with respect to the second dimension of the information state.
Theorem 2
Under Assumptions 2 and 3, the following statements hold:
-
1.
In model A, for any , the optimal policy is a threshold policy, i.e., there exists a threshold such that
Moreover, the threshold is non-decreasing in .
-
2.
In model B, for any , the optimal policy is a threshold policy with respect to for every , i.e., there exists a threshold for each such that
Moreover, for every , the threshold is non-decreasing in .
See Section 6 for the proof.
We use to denote the vector .
5.3 Performance of threshold based policies
We simplify the notation and denote the policy corresponding to thresholds and by simply and instead of and .
5.3.1 Model A
Let be the total discounted cost incurred under policy with penalty when the initial state is , i.e.,
(18) |
where
represents the expected total discounted cost while represents the expected number of times active action is selected under policy starting from the initial information state .
We will show (see Theorem 7) that the Whittle index for model A can be computed as a function of and . First, we present a method to compute these two variables. Let
where and , respectively, denote the expected discounted cost and time starting from information state until reaching information state for the first time.
Theorem 3
For any , we have
See Section 6 for the proof.
5.3.2 Model B
Let be the total discounted cost incurred under policy with penalty when the initial information state is , i.e.,
(19) |
where
and have the same interpretations as the ones for model A. We will show (see Theorem 8) that Whittle index for model B can be computed as a function of and . But first let’s define vector , , and vectors and in a similar manner. Then, from (19), . Let’s also define
Let and .
Theorem 4
For any , we have
Let be a matrix where , for any . Then,
See Section 6 for the proof.
5.4 Finite state approximation
For computing Whittle index, we provide a finite state approximation of Proposition 3 for models A and B. Essentially, we truncate the countable set of possible information state to a finite set and provide the approximation bound on the optimal value function for each of the models.
Theorem 5 (Model A)
Given , let and be the unique fixed point of equation
where
We set if . Then, we have the following:
(i) For any , we have
(ii) For all , . Moreover, let be any limit point of . Then, the policy is optimal for Problem 2.
See Section 6 for the proof.
Theorem 6 (Model B)
Given , let and be the unique fixed point of equation
where
We set if . Then, we have the following:
(i) For any ,
(ii) For all , . Let be any limit point of . Then, the policy is optimal for Problem 2.
See Section 6 for the proof.
5.5 Computation of Whittle index
Next, we derive a closed form expression to compute the Whittle index for model A and provide an efficient algorithm to compute the Whittle index for model B.
5.5.1 Whittle index formula for model A.
For model A, we obtain the Whittle index formula based on the two variables and as follows.
Theorem 7
Let . Then, under Assumption 3, , and the Whittle index of model A at information state is
(20) |
Proof 7
Since model A is a restart model, the result follows from [31, Lemma 4].
Theorem 7 gives us a closed-form expression to approximately compute the Whittle index for model .
5.5.2 Modified adaptive greedy algorithm for model B.
Let and denote the number of distinct Whittle indices. Let where denote the sorted distinct Whittle indices with . Let . For any subset , define the policy as
Given , define and . Additionally, for any , and all states , define , and . Then, for all , define
(21) |
Lemma 3
For , we have the following:
-
1.
For all , we have .
-
2.
For all and , we have for all with equality if and only if and .
Proof 8
Theorem 8
The following properties hold:
-
1.
For any , the set is non-empty.
-
2.
For any , with equality if and only if .
6 Proof of Main Results
6.1 Proof of Theorem 2
Let and be two probability mass functions on totally ordered set . Then we say stochastically dominates if for all , . Given two transition matrices and , we say stochastically dominates if each row of stochastically dominates the corresponding . A basic property of stochastic dominance is the following.
Lemma 4
If stochastically dominates and is an non-decreasing function defined on , then for all , .
Proof 10
This follows from [40, Lemma 4.7.2].
Consider a fully-observable restless bandit process (note that is removed due to the observability assumption). According to [31], we say a fully-observable restless bandit process is stochastic monotone if it satisfies the following conditions.
-
(D1)
and are stochastic monotone transition matrices.
-
(D2)
For any , is non-decreasing in .
-
(D3)
For any , is non-decreasing in .
-
(D4)
is submodular in .
The following is established in [31, Lemma 5].
Proposition 5
The optimal policy of a stochastic monotone fully-observable restless bandit process is a threshold policy denoted by , which is a policy which takes passive action for states below a threshold denoted by and active action for the rest of the states, i.e.,
6.1.1 Proof of Theorem 2, Part 1
We show that each machine in model A is a stochastic monotone fully-observable restless bandit process. Each condition of stochastic monotone fully-observable restless bandit process is presented and proven for model A below.
-
(D1’)
The transition probability matrix under passive action for model A based on the information states is and the transition probability matrix under active action for model A is . Thus, and are stochastic monotone matrices.
-
(D2’)
Since is a stochastic monotone matrix and has constant rows, is non-decreasing in for any integer .
- (D3’)
-
(D4’)
As is submodular in and as shown in (D3’), stochastically dominates for any . Therefore, by Lemma 4, is non-decreasing in .
Therefore, according to Proposition 5, the optimal policy of a fully-observable restless bandit process under model A is a threshold based policy.
Finally, since the optimal policy is threshold based, the passive set is given by . As shown in Theorem 1, model A is indexable. Therefore, the passive set must be non-decreasing in , which implies that the threshold is non-decreasing in .
6.1.2 Proof of Theorem 2, Part 2
We first characterize the behavior of value function and state-action value function for Model B.
Lemma 5
We have
-
a.
is non-decreasing in for any and .
-
b.
Given a fixed , is non-decreasing in for any .
-
c.
is submodular in , for any .
-
d.
is submodular in , for any .
Proof 11
The proof of each part is as follows.
-
a.
By definition, we have
Similar to the proof of (D3’) in Proposition 5, for a given and , is non-decreasing in and and as is non-decreasing in , is non-decreasing in .
-
b.
Let
where for all .
Claim: is non-decreasing in for any and .
We prove the claim by induction. By construction, is non-decreasing in for any . This forms the basis of induction. Now assume that is non-decreasing in for any and some . Consider . Then, by induction hypothesis we haveTherefore,
Thus, is non-decreasing in for any . This completes the induction step. and monotonicity is preserved under limits, the induction proof is complete.
-
c.
is submodular in . Also, note that is the row of . Thus, stochastically dominates and by Lemma 4 we have
Therefore,
Consequently,
Hence,
-
d.
As for any , is non-decreasing in , and is submodular in , for any and , we have
Lemma 6
Suppose is a submodular function and for each , exists. Then, is monotone non-decreasing in .
Proof 12
This result follows from [40, Lemma 4.7.1].
Now, we conclude that as is submodular in for any , then, based on Lemma 6 and as only two actions is available, the optimal policy is a threshold policy specified in the theorem statement.
Finally, since the optimal policy is threshold based, the passive set is given by . As shown in Theorem 1, model B is indexable. Therefore, the passive set must be non-decreasing in , which implies that, for every , the threshold is non-decreasing in .
6.2 Proof of Theorem 3
By the strong Markov property, we have
If we set in the above,
6.3 Proof of Theorem 4
By the strong Markov property, we have
If we set in the above,
which results in
and hence, the statement is obtained by reformation of the terms inside the equations.
6.4 Proof of Theorem 5
(i): Starting from information state , the cost incurred by is the same as for information states . The per-step cost incurred by differs from for information states by at most .
(ii): The sequence of finite-state models described above is an augmentation type approximation sequence. As a result, a limit point of exists and the final result holds by [47, Proposition B.5, Theorem 4.6.3].
6.5 Proof of Theorem 6
(i): Starting from information state , given any and , the cost incurred by is the same as for information states . The per-step cost incurred by differs from for later realized information states by at most . Thus, the bound holds.
(ii): The sequence of finite-state models described above is an augmentation type approximation sequence. As a result, a limit point of exists and the final result holds [47, Proposition B.5, Theorem 4.6.3].
7 Numerical Analysis
In this section, we consider Example 2 and compare the performance of the following policies:
-
opt:
the optimal policy obtained using dynamic programming. As discussed earlier, the dynamic programming computation to obtain the optimal policy suffers from the curse of dimensionality. Therefore, the optimal policy can be computed only for small-scale models.
-
myp:
myopic policy, which is a heuristic which sequentially selects machines as follows. Suppose machines have been selected. Then select machine to be the machine which provides the smallest increase in the total per-step cost. The detailed description for model B is shown in Alg. 2.
-
wip:
whittle index heuristic, as described in this paper.
7.1 Experiments and Results
We conduct numerical experiments for both models A and B, and vary the number of machines, the number of service-persons and the parameters associated with each machine. There are three parameters associated with each machine: the deterioration probability matrix , the reset pmf and the per-step cost . We assume the matrix is chosen from a family of four types of structured transition matrices , where is a parameter of the model. The details of all these models are presented in A. We assume each element of is sampled from Exp(), i.e., exponential distribution with the rate parameter of , and then normalized such that the sum of all elements becomes . Finally, we assume that the per-step cost is given by and .
In all experiments, the discount factor is . The performance of every policy is evaluated using Monte-Carlo simulation of length averaged over sample paths.
In Experiment 1, we consider a small scale problem where we can compute opt and we compare the performance of wip with it. However, in Experiment 2, we consider a large scale problem where we compare the performance of wip with myp as computing the optimal policy is highly time-consuming.
The code for both experiments is available at [48].
Experiment 1) Comparison of Whittle index with the optimal policy.
In this experiment, we compare the performance of wip with opt. We assume , and , for both models A and B. In order to model heterogeneous machines, we consider the following. Let denote equispaced points in the interval . Then we choose as the transition matrix of machine . We denote the accumulated discounted cost of wip and opt by and , respectively. In order to have a better perspective of the performances, we compute the relative performance of wip with respect to opt by computing
(22) |
The closer is to , the closer wip is to opt. The results of for all different combinations of parameter were which means the Whittle policy is as good as the optimal policy.
Experiment 2) Comparison of Whittle index with the myopic policy for structured models.
In this experiment, we increase the state space size to and we set , we select from the set and from the set . We denote the accumulated discounted cost of myp by . In order to have a better perspective of the performances, we compute the relative improvement of wip with respect to myp by computing
(23) |
Note that means that wip performs better than myp. We generate structured transition matrices, similar to Experiment 1, and apply the same procedure to build heterogeneous machines. The results of for different choice of the parameters for models A and B are shown in Tables 1 and 2, respectively.
1 | 2 | 3 | 4 | ||
---|---|---|---|---|---|
1.99 | 2.54 | 2.24 | 7.44 | ||
3.41 | 6.90 | 4.71 | 8.14 | ||
2.97 | 6.19 | 2.80 | 6.70 |
1 | 2 | 3 | 4 | ||
---|---|---|---|---|---|
0.21 | 0.26 | 0.19 | 0.97 | ||
0.68 | 1.73 | 1.28 | 4.54 | ||
1.36 | 2.35 | 2.32 | 6.41 |
1 | 2 | 3 | 4 | ||
---|---|---|---|---|---|
7.67 | 11.17 | 12.12 | 9.39 | ||
14.96 | 13.85 | 14.55 | 9.17 | ||
15.02 | 12.12 | 13.39 | 6.63 |
1 | 2 | 3 | 4 | ||
---|---|---|---|---|---|
0.63 | 1.62 | 1.01 | 2.92 | ||
2.92 | 3.14 | 3.21 | 6.57 | ||
4.86 | 7.22 | 6.99 | 9.96 |
7.2 Discussion
In Experiment 1 where wip is compared with opt, we observe is very close to for almost all experiments, implying that wip performs as well as opt for these experiments. in model B is less than model A as model B is more complex than model A for a given set of parameters and hence, the difference between the performance of the two polices is more than model A.
In Experiment 2 where wip is compared with myp, we observe ranges from to . In a similar interpretation as Experiment 1, as model B is more complex than model A, for model B is higher than the ones model A given the same set of parameters.
Furthermore, we observe that as increases, also increases overall. Also, as increases, decreases in general. This suggests that as increases, there is an overlap between the set of machines chosen according to wip and myp, and hence, the performance of wip and myp become close to each other.
8 Conclusion
We investigated partially observable restless bandits. Unlike most of the existing literature which restricts attention to models with binary state space, we consider general state space models. We presented two observation models, which we call model A and model B, and showed that the partially observable restless bandits are indexable for both models.
To compute the Whittle index, we work with a countable space representation rather than the belief state representation. We established certain qualitative properties of the auxiliary problem to compute the Whittle index. In particular, for both models we showed that the optimal policies of the auxiliary problem satisfy threshold properties. For model A, we used the threshold property to obtain a closed form expression to compute the Whittle index. For model B, we used the threshold policy to present a refinement of the adaptive greedy algorithm of [31] to compute the Whittle index.
Finally, we presented a detailed numerical study of a machine maintenance model. We observed that for small-scale models, the Whittle index policy is close-to-optimal and for large-scale models, the Whittle index policy outperforms the myopic policy baseline.
Appendix A Structured Markov chains
Consider a Markov chain with states. Then a family of structured stochastic monotone matrices which dominates the identity matrix is illustrated below.
-
1.
Matrix : Let and . Then,
-
2.
Matrix : Similar to with and .
-
3.
Matrix : Similar to with and .
-
4.
Matrix : Let . Then,
References
- [1] R. Meshram, D. Manjunath, A. Gopalan, On the Whittle index for restless multiarmed hidden Markov bandits 63 (9) (2018) 3046–3053.
- [2] S. Guha, K. Munagala, P. Shi, Approximation algorithms for restless bandit problems, Journal of the ACM (JACM) 58 (1) (2010) 3.
- [3] K. Kaza, R. Meshram, V. Mehta, S. N. Merchant, Sequential decision making with limited observation capability: Application to wireless networks, IEEE Transactions on Cognitive Communications and Networking 5 (2) (2019) 237–251.
- [4] K. Kaza, V. Mehta, R. Meshram, S. Merchant, Restless bandits with cumulative feedback: Applications in wireless networks, in: Wireless Communications and Networking Conference, IEEE, 2018, pp. 1–6.
- [5] S. Aalto, P. Lassila, P. Osti, Whittle index approach to size-aware scheduling for time-varying channels with multiple states, Queueing Systems 83 (3-4) (2016) 195–225.
- [6] M. Larrañaga, M. Assaad, A. Destounis, G. S. Paschos, Dynamic pilot allocation over Markovian fading channels: A restless bandit approach, in: Information Theory Workshop, IEEE, 2016, pp. 290–294.
- [7] N. Akbarzadeh, A. Mahajan, Dynamic spectrum access under partial observations: A restless bandit approach, in: Canadian Workshop on Information Theory, IEEE, 2019, pp. 1–6.
- [8] S. S. Villar, J. Bowden, J. Wason, Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges, Statistical science: a review journal of the Institute of Mathematical Statistics 30 (2) (2015) 199.
- [9] C. Abad, G. Iyengar, A near-optimal maintenance policy for automated DR devices, IEEE Transactions on Smart Grid 7 (3) (2016) 1411–1419.
- [10] K. D. Glazebrook, H. M. Mitchell, P. S. Ansell, Index policies for the maintenance of a collection of machines by a set of repairmen, European Journal of Operational Research 165 (1) (2005) 267–284.
- [11] S. S. Villar, Indexability and optimal index policies for a class of reinitialising restless bandits, Probability in the engineering and informational sciences 30 (1) (2016) 1–23.
- [12] Y. Qian, C. Zhang, B. Krishnamachari, M. Tambe, Restless poachers: Handling exploration-exploitation tradeoffs in security domains, in: Int. Conf. on Autonomous Agents & Multiagent Systems, 2016, pp. 123–131.
- [13] V. S. Borkar, Whittle index for partially observed binary markov decision processes, IEEE Transactions on Automatic Control 62 (12) (2017) 6614–6618.
- [14] V. S. Borkar, G. S. Kasbekar, S. Pattathil, P. Shetty, Opportunistic scheduling as restless bandits, IEEE Transactions on Control of Network Systems (2017).
- [15] K. E. Avrachenkov, V. S. Borkar, Whittle index policy for crawling ephemeral content, IEEE Transactions on Control of Network Systems 5 (1) (2018) 446–455. doi:10.1109/TCNS.2016.2619066.
- [16] C. H. Papadimitriou, J. N. Tsitsiklis, The complexity of optimal queuing network control, Mathematics of Operations Research 24 (2) (1999) 293–305.
- [17] P. Whittle, Restless bandits: Activity allocation in a changing world, Journal of Applied Probability 25 (A) (1988) 287–298.
- [18] J. C. Gittins, Bandit processes and dynamic allocation indices, Journal of the Royal Statistical Society. Series B (Methodological) (1979) 148–177.
- [19] R. R. Weber, G. Weiss, On an index policy for restless bandits, Journal of Applied Probability 27 (3) (1990) 637–648.
- [20] C. Lott, D. Teneketzis, On the optimality of an index rule in multichannel allocation for single-hop mobile networks with multiple service classes, Probability in the Engineering and Informational Sciences 14 (3) (2000) 259–297.
- [21] K. Liu, Q. Zhao, Indexability of restless bandit problems and optimality of whittle index for dynamic multichannel access 56 (11) (2010) 5547–5567.
- [22] J. Niño-Mora, Dynamic priority allocation via restless bandit marginal productivity indices, TOP 15 (2) (2007) 161–198.
- [23] P. S. Ansell, K. D. Glazebrook, J. Niño-Mora, M. O’Keeffe, Whittle’s index policy for a multi-class queueing system with convex holding costs, Mathematical Methods of Operations Research 57 (1) (2003) 21–39.
- [24] K. D. Glazebrook, D. Ruiz-Hernandez, C. Kirkbride, Some indexable families of restless bandit problems, Advances in Applied Probability 38 (3) (2006) 643–672.
- [25] U. Ayesta, M. Erausquin, P. Jacko, A modeling framework for optimizing the flow-level scheduling with time-varying channels, Performance Evaluation 67 (11) (2010) 1014–1029.
- [26] N. Akbarzadeh, A. Mahajan, Restless bandits with controlled restarts: Indexability and computation of Whittle index, in: Conference on Decision and Control, 2019, pp. 7294–7300.
- [27] T. W. Archibald, D. P. Black, K. D. Glazebrook, Indexability and index heuristics for a simple class of inventory routing problems, Operations research 57 (2) (2009) 314–326.
- [28] K. Avrachenkov, U. Ayesta, J. Doncel, P. Jacko, Congestion control of TCP flows in internet routers by means of index policy, Computer Networks 57 (17) (2013) 3463–3478.
- [29] K. Glazebrook, D. Hodge, C. Kirkbride, Monotone policies and indexability for bidirectional restless bandits, Advances in Applied Probability 45 (1) (2013) 51–85.
- [30] Z. Yu, Y. Xu, L. Tong, Deadline scheduling as restless bandits 63 (8) (2018) 2343–2358.
- [31] N. Akbarzadeh, A. Mahajan, Conditions for indexability of restless bandits and an algorithm to compute Whittle index, Adv. in Appl. Probab. 54 (4) (2022) 1164–1192.
- [32] J. Niño-Mora, Restless bandits, partial conservation laws and indexability, Advances in Applied Probability 33 (1) (2001) 76–98.
- [33] N. Gast, B. Gaujal, K. Khun, Computing Whittle (and Gittins) index in subcubic time, arXiv preprint arXiv:2203.05207 (2022).
- [34] D. Ruiz-Hernández, J. M. Pinar-Pérez, D. Delgado-Gómez, Multi-machine preventive maintenance scheduling with imperfect interventions: A restless bandit approach, Computers & Operations Research 119 (2020) 104927.
- [35] A. Mate, J. Killian, H. Xu, A. Perrault, M. Tambe, Collapsing bandits and their application to public health intervention, Advances in Neural Information Processing Systems 33 (2020) 15639–15650.
- [36] C. R. Dance, T. Silander, Optimal policies for observing time series and related restless bandit problems., J. Mach. Learn. Res. 20 (2019) 35–1.
- [37] K. Liu, Index policy for a class of partially observable Markov decision processes, arXiv preprint arXiv:2107.11939 (2021).
- [38] D. B. Brown, J. E. Smith, Index policies and performance bounds for dynamic selection problems, Management Science 66 (7) (2020) 3029–3050.
- [39] K. J. Astrom, Optimal control of Markov processes with incomplete state information, Journal of mathematical analysis and applications 10 (1) (1965) 174–205.
- [40] M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming, John Wiley & Sons, 2014.
- [41] E. Altman, Constrained Markov decision processes, Vol. 7, CRC press, 1999.
- [42] G. M. Lipsa, N. C. Martins, Remote state estimation with communication costs for first-order LTI systems, IEEE Transactions on Automatic Control 56 (9) (2011) 2013–2025.
- [43] J. Chakravorty, A. Mahajan, Fundamental limits of remote estimation of autoregressive Markov processes under communication constraints., IEEE Trans. Automat. Contr. 62 (3) (2017) 1109–1124.
- [44] A. Mahajan, Remote estimation over control area networks, in: 2017 IEEE 86th Vehicular Technology Conference (VTC-Fall), 2017, pp. 1–5.
- [45] D. I. Shuman, A. Nayyar, A. Mahajan, Y. Goykhman, K. Li, M. Liu, D. Teneketzis, M. Moghaddam, D. Entekhabi, Measurement scheduling for soil moisture sensing: From physical models to optimal control, Proceedings of the IEEE 98 (11) (2010) 1918–1933.
- [46] J. Keilson, A. Kester, Monotone matrices and monotone Markov processes, Stochastic Processes and their Applications 5 (3) (1977) 231–241.
- [47] L. I. Sennott, Stochastic dynamic programming and the control of queueing systems, Vol. 504, John Wiley & Sons, 2009.
- [48] N. Akbarzadeh, A. Mahajan, Partially observable restless bandits with restarts, https://codeocean.com/capsule/6654724/tree/v1.