This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Two families of indexable partially observable restless bandits and Whittle index computation111This research was funded in part by Fonds de Recherche du Quebec-Nature et technologies (FRQNT).

Nima Akbarzadeh [email protected] Aditya Mahajan [email protected] Department of Electrical and Computer Engineering, McGill University, 3480 Rue University, Montréal, QC H3A 0E9. Department of Electrical and Computer Engineering, McGill University, 3480 Rue University, Montréal, QC H3A 0E9.
Abstract

We consider the restless bandits with general finite state space under partial observability with two observational models: first, the state of each bandit is not observable at all, and second, the state of each bandit is observable when it is selected. Under the assumption that the models satisfy a restart property, we prove that both models are indexable. For the first model, we derive a closed-form expression for the Whittle index. For the second model, we propose an efficient algorithm to compute the Whittle index by exploiting the qualitative properties of the optimal policy. We present detailed numerical experiments for multiple instances of machine maintenance problem. The result indicates that the Whittle index policy outperforms myopic policy and can be close to optimal in different setups.

keywords:
Multi-armed bandits; Restless bandits; Whittle index; indexability; partially observable; scheduling; resource allocation.

1 Introduction

Resource allocation and scheduling problems arise in various applications including telecommunication networks, sensor management, patient prioritization, and machine maintenance. Restless bandits is a widely-used solution framework for such models [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15].

Identifying the optimal policy in restless bandits suffers from the curse of dimensionality because the state space is exponential in the number of alternatives [16]. To circumvent the curse of dimensionality, Whittle proposed an index heuristic which has a linear complexity in the number of alternatives [17]. The resulting policy, called the Whittle index policy, operates as follows: assign an index (called the Whittle index) to each state of each arm (or alternative) and then, at each time, play the arms in states with the highest indices.

The Whittle index policy is attractive for two reasons. First, it is a scalable heuristic because its complexity is linear in the number of arms. Second, although it is a heuristic, there are certain settings where it is optimal [18, 19, 20, 21] and, in general, it performs close to optimal in many instances [22, 23, 10, 24, 25, 26].

Nonetheless, there are two challenges in using the Whittle index policy. First, the Whittle index heuristic is applicable only when a technical condition known as indexability is satisfied. There is no general test for indexability, and the existing sufficient conditions are only applicable for specific models [23, 10, 24, 27, 28, 25, 29, 30]. Second, while there are closed-form expressions to compute the Whittle index in some instances [10, 24, 29, 5, 6, 4, 3, 26, 31], in general, the Whittle index policy has to be computed numerically. For a subclass of restless bandits which satisfy an additional technical condition known as PCL (partial conservation law), the Whittle index can be computed using an algorithm called the adaptive greedy algorithm [32, 22]. Recently, [31, 33] presented generalizations of adaptive greedy algorithm which are applicable to all indexable restless bandits.

We are interested in resource allocation and scheduling problems where the state of each arm is not fully-observed. Such partially observable restless bandit models are conceptually and computationally more challenging. The sufficient conditions for indexability that are derived for fully-observed bandits [17, 19, 10, 24, 29, 31, 34] are not directly applicable to the partially observable setting. The existing literature on partially observable restless bandits often restricts attention to models where each arm has two states [2, 3, 4, 11, 9, 13, 1, 5, 35]. In some cases, it is also assumed that the two states are positively correlated [4, 3, 5]; in others, it is assumed that the state dynamics are independent of the chosen action [6, 36, 37]. There are very few results for general finite state space models under partial observability [12, 7, 6, 36, 37], and, for such models, indexability is often verified numerically. In addition, there are very few algorithms to compute the Whittle index for such models.

Recently, alternative index-based heuristics for partially observable restless bandits [38] has been proposed, but we restrict to Whittle index policy in this paper.

The main contributions of our paper are as follows:

  • 1.

    We investigate partially observable restless bandits with general finite state spaces and consider two observation models, which we call model A and model B. We show that both models are indexable.

  • 2.

    For model A, we provide a closed-form expression to compute the Whittle index. For model B, we provide a refinement of the adaptive greedy algorithm of [31] to efficiently compute the Whittle index.

  • 3.

    We present a detailed numerical study which illustrates that the Whittle index policy performs close to optimal for small scale systems and outperforms a commonly used heuristic (the myopic policy) for large-scale systems.

The organization of the paper is as follows. In Section 2, we formulate the restless bandit problem under partial observations for two different models. Then, we define a belief state by which the partially-observable problem can be converted into a fully-observable one. In Section 3, we present a short overview of restless bandits. In Section 4, we show the restless bandit problem is indexable for both models and present a general formula to compute the index. In Section 5, we present a countable state representation of the belief state and use it to develop methods to compute the Whittle index. In Section 6, we present the proofs of the results. In Section 7, we present a detailed numerical study which compares the performance of Whittle index policy with two baseline policies. Finally, we conclude in Section 8.

1.1 Notations and Definitions

We use uppercase letters to denote random variables; the corresponding lowercase letter to denote their realization and corresponding calligraphic letters to denote the set of realizations. Superscripts index arms and subscripts index time, e.g., XtiX^{i}_{t} denotes the state of arm ii at time tt. The subscript 0:t0{:}t denote the history of the variable from time 0 to tt, e.g., X0:tiX^{i}_{0:t} denotes (X0,,Xt)(X_{0},\ldots,X_{t}). Bold letters denote the vector of variable for all arms, e.g., 𝑿t{\boldsymbol{X}}_{t} denotes (Xt1,,Xtn)(X^{1}_{t},\ldots,X^{n}_{t}). Given a collection of real numbers p1,,pnp^{1},\ldots,p^{n}, we use i=1npi\prod_{i=1}^{n}p^{i} to denote their product p1p2pnp^{1}\cdot p^{2}\cdot\dots\cdot p^{n}. Given a collection of sets 𝒳1,,𝒳n{\cal X}^{1},\dots,{\cal X}^{n}, we use i=1n𝒳i\prod_{i=1}^{n}{\cal X}^{i} to denote their Cartesian product 𝒳1×𝒳2××𝒳n{\cal X}^{1}\times{\cal X}^{2}\times\dots\times{\cal X}^{n}. For a finite set 𝒳{\cal X}, 𝒫(𝒳){\cal P}({\cal X}) denote the set of probability mass functions (PMFs) on 𝒳{\cal X}.

We use 𝕀\mathds{I} as the indicator function, 𝔼\mathds{E} as the expectation operator, \mathds{P} as the probability function, \mathds{R} as the set of real numbers, \mathds{Z} as the set of integers and m\mathds{Z}_{\geq m} as the set of integers that are not lower than mm.

Given ordered sets 𝒳\mathcal{X} and 𝒴\mathcal{Y}, a function f:𝒳×𝒴f:\mathcal{X}\times\mathcal{Y}\to\mathds{R} is called submodular if for any x1,x2𝒳x_{1},x_{2}\in\mathcal{X} and y1,y2𝒴y_{1},y_{2}\in\mathcal{Y} such that x2x1x_{2}\geq x_{1} and y2y1y_{2}\geq y_{1}, we have f(x1,y2)f(x1,y1)f(x2,y2)f(x2,y1)f(x_{1},y_{2})-f(x_{1},y_{1})\geq f(x_{2},y_{2})-f(x_{2},y_{1}). Furthermore, the transition probability matrix PP is stochastic monotone if for any x,y𝒳x,y\in\mathcal{X} such that x<yx<y, we have w𝒳zPxww𝒳zPyw\sum_{w\in\mathcal{X}_{\geq z}}P_{xw}\leq\sum_{w\in\mathcal{X}_{\geq z}}P_{yw} for any z𝒳z\in\mathcal{X}. For any function f:𝒵f\colon\mathcal{Z}\to\mathbb{R}, span(f)\operatorname{span}(f) denotes the span semi-norm of ff, i.e., span(f)=supz𝒵f(z)infz𝒵f(z).\operatorname{span}(f)=\sup_{z\in\mathcal{Z}}f(z)-\inf_{z\in\mathcal{Z}}f(z).

2 Model and Problem Formulation

2.1 Restless Bandit Process with restart

A discrete-time restless bandit process (or arm) is a controlled Markov process (𝒳,{0,1},{P¯(a)}a{0,1},c,π0,𝒴)(\mathcal{X},\{0,1\},\allowbreak\{\bar{P}(a)\}_{a\in\{0,1\}},c,\pi_{0},\mathcal{Y}) where 𝒳\mathcal{X} denotes the finite set of states; {0,1}\{0,1\} denotes the action space where the action 0 is called the passive action and the action 11 is the active action; P¯(a)\bar{P}(a), a{0,1}a\in\{0,1\}, denotes the transition matrix when action aa is chosen; c:𝒳×{0,1}0c:\mathcal{X}\times\{0,1\}\to\mathbb{R}_{\geq 0} denotes the cost function; π0\pi_{0} denotes the initial state distribution; 𝒴\mathcal{Y} denotes the finite set of observations.

Assumption 1 (Restart property)

All rows of the transition matrix P¯(1)\bar{P}(1) are identical.

For models which satisfy Assumption 11, we denote P¯(0)\bar{P}(0) by PP and denote each (identical) row of P¯(1)\bar{P}(1) by QQ. The term restart property is used following the terminology of [26], where QQ was a PMF on the state space (i.e., on taking active action, the state resets according to PMF QQ). Note [26] considered fully observed models, while we are considering partially observable setups.

An operator has to select m<nm<n arms at each time but does not observe the state of the arms. We consider two observation models.

  • 1.

    Model A: In model A, the operator does not observe anything. We denote this by Yt=𝔈Y_{t}=\mathfrak{E}, where 𝔈\mathfrak{E} denotes a blank symbol.

  • 2.

    Model B: In model B, the operator observes the state of the arm after it has been reset, i.e.,

    Yt+1={𝔈 if At=0Xt+1 if At=1,Y_{t+1}=\begin{cases}\mathfrak{E}\leavevmode\nobreak\ &\text{ if }\leavevmode\nobreak\ A_{t}=0\\ X_{t+1}\leavevmode\nobreak\ &\text{ if }\leavevmode\nobreak\ A_{t}=1\end{cases}, (1)

For model A, 𝒴={𝔈}\mathcal{Y}=\{\mathfrak{E}\} and for model B, 𝒴=𝒳{𝔈}\mathcal{Y}={\cal X}\cup\{\mathfrak{E}\}.

2.2 Partially-observable Restless Multi-armed Bandit Problem

A partially-observable restless multi-armed bandit (PO-RMAB) problem is a collection of nn independent restless bandits (𝒳i,{0,1},{Pi(a)}a{0,1},ci,π0i,𝒴i)(\mathcal{X}^{i},\{0,1\},\allowbreak\{P^{i}(a)\}_{a\in\{0,1\}},c^{i},\pi^{i}_{0},{\mathcal{Y}}^{i}), i𝒩{1,,n}i\in{\cal N}\coloneqq\{1,\ldots,n\}.

Let 𝓧i𝒩𝒳i\boldsymbol{\mathcal{X}}\coloneqq\prod_{i\in{\cal N}}{\mathcal{X}}^{i}, 𝒜(m){(a1,,an){0,1}n:i𝒩ai=m}{\mathbfcal A}(m)\coloneqq\biggl{\{}(a^{1},\ldots,a^{n})\in\{0,1\}^{n}:\sum_{i\in{\cal N}}a^{i}=m\biggr{\}}, and 𝓨i𝒩𝒴i\boldsymbol{\mathcal{Y}}\coloneqq\prod_{i\in{\cal N}}{\mathcal{Y}}^{i} denote the combined state, action, and observation spaces, respectively. Also, let 𝑿t=(Xt1,Xtn)𝓧\boldsymbol{X}_{t}=(X^{1}_{t},\dots X^{n}_{t})\in\boldsymbol{\mathcal{X}}, 𝑨t=(At1,,Atn)𝒜(m)\boldsymbol{A}_{t}=(A^{1}_{t},\dots,A^{n}_{t})\in{\mathbfcal A}(m), and 𝒀t=(Yt1,Ytn)𝓨\boldsymbol{Y}_{t}=(Y^{1}_{t},\dots Y^{n}_{t})\in\boldsymbol{\mathcal{Y}} denote the combined states, actions taken, and observations made by the operator at time t0t\geq 0. Due to the independent evolution of each arm, for each realization 𝒙0:t{\boldsymbol{x}}_{0:t} of 𝑿0:t{\boldsymbol{X}}_{0:t} and 𝒂0:t{\boldsymbol{a}}_{0:t} of 𝑨0:t{\boldsymbol{A}}_{0:t}, we have

(𝑿t+1=𝒙t+1|𝑿0:t=𝒙0:t,𝑨0:t=𝒂0:t)\displaystyle\mathds{P}(\boldsymbol{X}_{t+1}=\boldsymbol{x}_{t+1}|\boldsymbol{X}_{0:t}=\boldsymbol{x}_{0:t},\boldsymbol{A}_{0:t}=\boldsymbol{a}_{0:t}) =i𝒩(Xt+1i=xt+1i|Xti=xti,Ati=ati)\displaystyle=\prod_{i\in{\cal N}}\mathds{P}(X^{i}_{t+1}=x^{i}_{t+1}|X^{i}_{t}=x^{i}_{t},A^{i}_{t}=a^{i}_{t})
=i𝒩Pxti,xt+1ii(ati).\displaystyle=\prod_{i\in{\cal N}}P^{i}_{x^{i}_{t},x^{i}_{t+1}}(a^{i}_{t}).

Let 𝝅0=i𝒩π0i{\boldsymbol{\pi}}_{0}=\prod_{i\in{\cal N}}\pi^{i}_{0} denote the initial state distribution of all arms.

When the system is in state 𝒙t{\boldsymbol{x}}_{t} and action 𝒂t{\boldsymbol{a}}_{t} is taken, the system incurs a cost 𝒄(𝒙t,𝒂t)i𝒩ci(xti,ati).{\boldsymbol{c}}({\boldsymbol{x}}_{t},{\boldsymbol{a}}_{t})\coloneqq\sum_{i\in{\cal N}}c^{i}(x^{i}_{t},a^{i}_{t}). The action at time tt is chosen according to

𝑨t=𝒈t(𝒀0:t1,𝑨0:t1),\displaystyle{\boldsymbol{A}}_{t}={\boldsymbol{g}}_{t}({\boldsymbol{Y}}_{0:t-1},{\boldsymbol{A}}_{0:t-1}), (2)

where 𝒈t{\boldsymbol{g}}_{t} is a history-dependent policy at time tt. Let 𝒈=(g1,g2,){\boldsymbol{g}}=(g_{1},g_{2},\ldots) denote the policy for infinite time horizon and let 𝒢{\mathbfcal G} denote the family of all such policies. Then, the performance of policy 𝒈{\boldsymbol{g}} is given by

J(𝒈)(1β)𝔼[t=0βti𝒩ci(Xti,Ati)|X0iπ0i],J^{({\boldsymbol{g}})}\coloneqq(1-\beta)\mathds{E}\biggl{[}\sum_{t=0}^{\infty}\beta^{t}\sum_{i\in{\cal N}}c^{i}(X^{i}_{t},A^{i}_{t})\big{|}X^{i}_{0}\sim\pi^{i}_{0}\biggr{]}, (3)

where β(0,1)\beta\in(0,1) denotes the discount factor.

Formally, the optimization problem of interest is as follows:

Problem 1

Given a discount factor β(0,1)\beta\in(0,1), the total number nn of arms, the number mm to be selected, the system model {(𝒳i,{0,1},Pi(a),ci,𝒴i)}i𝒩\{(\mathcal{X}^{i},\{0,1\},P^{i}(a),c^{i},\mathcal{Y}^{i})\}_{i\in{\cal N}} of each arm, and the observation model at the operator, choose a history-dependent policy 𝐠𝒢{\boldsymbol{g}}\in{\mathbfcal G} that minimizes J(𝐠)J^{({\boldsymbol{g}})} given by (3).

Some remarks

  1. 1.

    Problem 1 is a POMDP and the standard methodology to solve POMDPs is to convert them to a fully observable Markov decision process (MDP) by viewing the “belief state” as the information state of the system [39]. We present such a belief state representation in Sec. 2.4 and point out its limitations in the context of restless bandits.

  2. 2.

    In Problem 1, the objective J(g)J^{(g)} depends on the initial state distribution (π0i)i𝒩(\pi^{i}_{0})_{i\in\mathcal{N}}. This can give the impression that the optimal policy may depend on the initial distribution. It is well known in the MDP literature that there exist policies that are optimal for all initial distributions [40]. However, our results rely on translating the belief state representation of the POMDP into a countable state MDP formulation and such a transformation is valid only when the initial state distribution is of a specific form. See Sec. 5 for details. Our results do not depend on the specific choice of the initial distribution, as long as it satisfies Assumption 2 specified in Sec. 5.

  3. 3.

    In Problem 1, we consider the normalized expected discounted cost as an objective, where the discounted cost is multiplied by (1β)(1-\beta). In the MDP literature, one typically considers unnormalized objectives. However, normalized objective is typically used constrained MDPs [41] and has also been used in some of the previous literature on restless bandits [31]. Note that multiplying the objective by a constant does not change the optimal policy. The reason that we use a normalized expected discounted cost is that it simplifies the description of the adaptive greedy algorithms to compute the Whittle index presented in Sec. 5.

2.3 Some Examples

In this section, we present some examples corresponding to the model presented above.

Example 1

Consider a sensor network where ther are nn sensors, each observing an independent Markov processes. We assume that the state {Sti}t1\{S^{i}_{t}\}_{t\geq 1} of each Markov process is integer valued and evolves in an auto-regressive manner: St+1i=Sti+WtiS^{i}_{t+1}=S^{i}_{t}+W^{i}_{t}, where {Wti}t1\{W^{i}_{t}\}_{t\geq 1} are i.i.d. processes which are also independent across the sensors. An estimator can observe only mm, where m<nm<n, sensors at each time. If a sensor is observed, the state of the Markov process at that sensor is revealed to the estimator. If a sensor is not observed, the estimator gets no new information about its state and has to estimate the state based on previous observations. The objective is to determine a sensor scheduling policy to decide which sensors to observe at each time.

In this case, it can be shown that when the noise processes {Wti}t1\{W^{i}_{t}\}_{t\geq 1} have symmetric and unimodal distributions, the optimal estimation strategy is a Kalman-filter like strategy, i.e., the optimal estimate S^ti\hat{S}^{i}_{t} is StiS^{i}_{t} when the Markov process ii is transmitted, and is equal to the previous estimate S^t1i\hat{S}^{i}_{t-1} when the Markov process ii is not transmitted [42]. Thus, the error process EtiStiS^tiE^{i}_{t}\coloneqq S^{i}_{t}-\hat{S}^{i}_{t} has a restart property [43]. An instance of such a sensor network problem was considered in [44].

Example 2

Consider a maintenance company monitoring nn machines which are deteriorating independently over time. Each machine has multiple deterioration states sorted from pristine to ruined levels. However, the state of the machine is not observed. There is a cost associated with running the machine and the cost is non-decreasing function of the state. If a machine is left un-monitored, then the state of the machine deteriorates and after a while, it is ruined. However, the state of the machine is not observed.

Furthermore, it is assumed the company cannot observe the state of the machines unless it sends a service-person to visit the machine. Replacing the machine is relatively inexpensive, and when service-persons visit a machine, they simply replace it with a new one. Due to manufacturing mistakes, all the machines may not be in pristine state when installed. If the service-person can observe the state of the machine when installing a new one, the observation model is same as model B. Otherwise, it is model A. There are mm, where m<nm<n, service-persons. The objective is to determine a scheduling policy to decide which machines should be serviced at each time. An instance of such machine maintenance problem in the context of maintaining demand response devices was considered in [9].

Example 3

Consider the problem of resource constrained health intervention delivery, where a community health center is monitoring nn patients to check if they are adhering to the prescribed medication [35]. Each patient has a binary state of “Adhering” or “Not Adhering”, which is hidden. There are mm, where m<nm<n, health workers, and if an health worker visits a patient, the state of the patient is observed. Moreover, it is assumed that after the visit by a health worker, the patient goes into the “Adhering” state. The objective is to determine a policy to schedule the health workers to maximize the number of patients in the “Adhering” state.

2.4 Belief State

Using standard results from Markov decision theory, the partially observable restless bandit problem can be converted to a fully observable restless bandit problem with belief (or posterior distribution) as states. We present the details in this section. Let define the operator’s belief Πti𝒫(𝒳i)\Pi^{i}_{t}\in\mathcal{P}({\cal X}^{i}) on the state of arm ii at time tt as follows: for any, xti𝒳ix^{i}_{t}\in{\cal X}^{i}, let Πti(xti)(Xti=xti|Y0:t1i,A0:t1i).\Pi^{i}_{t}(x^{i}_{t})\coloneqq\mathds{P}(X^{i}_{t}=x^{i}_{t}\,|\,Y^{i}_{0:t-1},A^{i}_{0:t-1}). Note that Πti\Pi^{i}_{t} is a distribution-valued random variable. Also, define 𝚷t(Πt1,,Πtn)\boldsymbol{\Pi}_{t}\coloneqq(\Pi^{1}_{t},\dots,\Pi^{n}_{t}).

Then, for arm ii, the evolution of the belief state is as follows: for model A, the belief update rule is

Πt+1i={ΠtiP,if Ati=0,Q,if Ati=1,\Pi^{i}_{t+1}=\begin{cases}\Pi^{i}_{t}P,&\hbox{if }A^{i}_{t}=0,\\ Q,&\hbox{if }A^{i}_{t}=1,\end{cases} (4)

and for model B, the belief update rule is

Πt+1i={ΠtiP,if Ati=0,δXt+1ii where Xt+1iQ,if Ati=1\Pi^{i}_{t+1}=\begin{cases}\Pi^{i}_{t}P,&\hbox{if }A^{i}_{t}=0,\\ \delta^{i}_{X^{i}_{t+1}}\text{ where }X^{i}_{t+1}\sim Q,&\hbox{if }A^{i}_{t}=1\end{cases} (5)

where δx\delta_{x} is the Dirac delta distribution over the discrete state space 𝒳{\cal X} with the value of one only for state xx. The per-step cost function of the belief state Πti\Pi^{i}_{t} when action AtiA^{i}_{t} is taken is

c¯(Πti,Ati)=𝔼[cti(Xti,Ati)|Y0:t1i,A0:t1i]=x𝒳iΠti(x)ci(x,Ati).\bar{c}(\Pi^{i}_{t},A^{i}_{t})=\mathds{E}[c^{i}_{t}(X^{i}_{t},A^{i}_{t})|Y^{i}_{0:t-1},A^{i}_{0:t-1}]=\sum_{x\in{\cal X}^{i}}\Pi^{i}_{t}(x)c^{i}(x,A^{i}_{t}).

Define the combined belief state Θt𝒫(𝓧)\Theta_{t}\in\mathcal{P}(\boldsymbol{\cal X}) of the system as follows: for any 𝒙𝓧\boldsymbol{x}\in\boldsymbol{\cal X},

Θt(𝒙)=(𝑿t=𝒙|𝒀0:t1,𝑨0:t1).\Theta_{t}(\boldsymbol{x})=\mathds{P}(\boldsymbol{X}_{t}=\boldsymbol{x}\,|\,\boldsymbol{Y}_{0:t-1},\boldsymbol{A}_{0:t-1}).

Note that Θt\Theta_{t} is a random variable that takes values in 𝒫(𝓧)\mathcal{P}(\boldsymbol{\cal X}). Using standard results in POMDPs [39], we have the following.

Proposition 1

In Problem 1, Θt\Theta_{t} is a sufficient statistic for (𝐘0:t1,𝐀0:t1)(\boldsymbol{Y}_{0:t-1},\boldsymbol{A}_{0:t-1}). Therefore, there is no loss of optimality in restricting attention to decision policies of the form 𝐀t=gtbelief(Θt)\boldsymbol{A}_{t}=g^{\text{belief}}_{t}(\Theta_{t}). Furthermore, an optimal policy with this structure can be identified by solving an appropriate dynamic program.

Next, we present our first simplification for the structure of optimal decision policy as follows.

Proposition 2

For any 𝐱𝓧\boldsymbol{x}\in\boldsymbol{\cal X}, we have

Θt(𝒙)=i𝒩Πti(xi),a.s..\Theta_{t}(\boldsymbol{x})=\prod_{i\in{\cal N}}\Pi^{i}_{t}(x^{i}),\quad\text{a.s.}. (6)

Therefore, there is no loss of optimality in restricting attention to decision policies of the form 𝐀t=gtsimple(𝚷t)\boldsymbol{A}_{t}=g^{\text{simple}}_{t}(\boldsymbol{\Pi}_{t}). Furthermore, an optimal policy with this structure can be identified by solving an appropriate dynamic program.

Proof 1

Eq. (6) follows from the conditional independence of the arms, and the nature of the observation function. The structure of the optimal policies then follow immediately from Proposition 1.

In Propositions 1 and 2, we do not present the dynamic programs because they suffer from the curse of dimensionality. In particular, obtaining the optimal policy for PO-RMAB is PSPACE-hard [16]. So, we focus on the Whittle index heuristics to solve the problem.

3 Whittle index policy solution concept

For the ease of notation, we will drop the superscript ii from all relative variables for the rest of this and the next sections.

Consider an arm (𝒳,{0,1},{P¯(a)}a{0,1},c,π0,𝒴)(\mathcal{X},\{0,1\},\allowbreak\{\bar{P}(a)\}_{a\in\{0,1\}},c,\pi_{0},\mathcal{Y}) with a modified per-step cost function

c¯λ(π,a)c¯(π,a)+λa,π𝒫(𝒳),a{0,1},λ.{\bar{c}}_{\lambda}(\pi,a)\coloneqq{\bar{c}}(\pi,a)+\lambda a,\quad\forall\pi\in\mathcal{P}({\mathcal{X}}),\forall a\in\{0,1\},\lambda\in\mathds{R}. (7)

The modified cost function implies that there is a penalty of λ\lambda for taking the active action. Given any time-homogeneous policy g:𝒫(𝒳){0,1}g:\mathcal{P}({\mathcal{X}})\to\{0,1\}, the modified performance of the policy is

Jλ(g):=(1β)𝔼[t=0βtc¯λ(Πt,g(Πt))|Π0].J^{(g)}_{\lambda}:=(1-\beta)\mathds{E}\biggl{[}\sum_{t=0}^{\infty}\beta^{t}\bar{c}_{\lambda}(\Pi_{t},g(\Pi_{t}))\big{|}\Pi_{0}\biggr{]}. (8)

Subsequently, consider the following optimization problem.

Problem 2

Given an arm (𝒳,𝒴,{0,1},{P¯(a)}a{0,1},c)({\cal X},{\cal Y},\{0,1\},\{\bar{P}(a)\}_{a\in\{0,1\}},c), the discount factor β(0,1)\beta\in(0,1) and the penalty λ\lambda\in\mathds{R}, choose a Markov policy g:𝒫(𝒳){0,1}g:\mathcal{P}({\mathcal{X}})\to\{0,1\} to minimize Jλ(g)J^{(g)}_{\lambda} given by (8).

Problem 2 is a Markov decision process where one may use dynamic programming to obtain the optimal solution as follows.

Proposition 3

Consider the fixed point equation

Vλ(π)=mina{0,1}Hλ(π,a)V_{\lambda}(\pi)=\min_{a\in\{0,1\}}H_{\lambda}(\pi,a) (9)

where for Model A we have

Hλ(π,0)=(1β)c¯(π,0)+βVλ(πP),Hλ(π,1)=(1β)c¯(π,1)+(1β)λ+βVλ(Q)\displaystyle H_{\lambda}(\pi,0)=(1-\beta)\bar{c}(\pi,0)+\beta V_{\lambda}(\pi P),H_{\lambda}(\pi,1)=(1-\beta)\bar{c}(\pi,1)+(1-\beta)\lambda+\beta V_{\lambda}(Q)

and for Model B, we have

Hλ(π,0)=(1β)c¯(π,0)+βVλ(πP),Hλ(π,1)=(1β)c¯(π,1)+βx𝒳QxVλ(δx).\displaystyle H_{\lambda}(\pi,0)=(1-\beta)\bar{c}(\pi,0)+\beta V_{\lambda}(\pi P),H_{\lambda}(\pi,1)=(1-\beta)\bar{c}(\pi,1)+\beta\sum_{x\in{\cal X}}Q_{x}V_{\lambda}(\delta_{x}).

Then (9) has a unique fixed point VλV^{*}_{\lambda}, and the policy

gλ(π)={0, if Hλ(π,0)<Hλ(π,1)1, otherwise\displaystyle g_{\lambda}(\pi)=\begin{cases}0,&\text{ if }H_{\lambda}(\pi,0)<H_{\lambda}(\pi,1)\\ 1,&\text{ otherwise }\end{cases}

is optimal for Problem 2.

Proof 2

The result follows immediately from Markov decision theory [40].

Finally, we present the following definitions.

Definition 1 (Passive Set)

Given penalty λ\lambda, define the passive set 𝒲λ{\cal W}_{\lambda} as the set of states where passive action is optimal for the modified arm, i.e.,

𝒲λ:={πΠ:gλ(π)=0}.{\cal W}_{\lambda}:=\left\{\pi\in\Pi:g_{\lambda}(\pi)=0\right\}.
Definition 2 (Indexability)

An arm is indexable if 𝒲λ{\cal W}_{\lambda} is non-decreasing in λ\lambda, i.e., for any λ1,λ2\lambda_{1},\lambda_{2}\in\mathds{R},

λ1<λ2𝒲λ1𝒲λ2.\lambda_{1}<\lambda_{2}\implies{\cal W}_{\lambda_{1}}\subseteq{\cal W}_{\lambda_{2}}.

A restless multi-armed bandit problem is indexable if all nn arms are indexable.

Definition 3 (Whittle index)

The Whittle index of the state π\pi of an arm is the smallest value of λ\lambda for which state π\pi is part of the passive set 𝒲λ{\cal W}_{\lambda}, i.e.,

w(π)=inf{λ:π𝒲λ}.w(\pi)=\inf\left\{\lambda\in\mathds{R}:\pi\in{\cal W}_{\lambda}\right\}.

Equivalently, the Whittle index w(π)w(\pi) is the smallest value of λ\lambda for which the optimal policy is indifferent between the active action and passive action when the belief state of the arm is π\pi.

The Whittle index policy is as follows: At each time step, select mm arms which are in states with the highest indices. The Whittle index policy is easy to implement and efficient to compute but it may not be optimal. As mentioned earlier, Whittle index is optimal in certain cases [18, 19, 20, 21] and performs close to optimal for many other cases [22, 23, 10, 24, 25, 26].

4 Indexability and the corresponding Whittle index for models A and B

Given an arm, let Σ\Sigma denote the family of all stopping times τ1\tau\geq 1, with respect to the natural filtration associated with {Πt}t0\{\Pi_{t}\}_{t\geq 0}. For any stopping time τΣ\tau\in\Sigma and an initial belief state πΠ\pi\in\Pi, define

L(π,τ)\displaystyle L(\pi,\tau) 𝔼[t=0τ1βtc¯(Πt,0)+βτc¯(Πτ,1)|Π0=π],\displaystyle\coloneqq\mathds{E}\bigg{[}\sum_{t=0}^{\tau-1}\beta^{t}{\bar{c}}(\Pi_{t},0)+\beta^{\tau}{\bar{c}}(\Pi_{\tau},1)\Bigm{|}\Pi_{0}=\pi\bigg{]},
B(π,τ)\displaystyle B(\pi,\tau) 𝔼[βτ|Π0=π].\displaystyle\coloneqq\mathds{E}[\beta^{\tau}|\Pi_{0}=\pi].
Theorem 1

The PO-RMAB for model A and B is indexable. In particular, each arm is indexable and the Whittle index is given by

w(π)=inf{λ:G(π)<Ωλ},w(\pi)=\inf\left\{\lambda\in\mathbb{R}:G(\pi)<\Omega_{\lambda}\right\},

where

G(π)\displaystyle G(\pi) (1β)infτΣL(π,τ)c¯(π,1)1B(π,τ),\displaystyle\coloneqq(1-\beta)\inf_{\tau\in\Sigma}\frac{L(\pi,\tau)-{\bar{c}}(\pi,1)}{1-B(\pi,\tau)}, (10)
Ωλ\displaystyle\Omega_{\lambda} λ+βV1next,\displaystyle\coloneqq\lambda+\beta V^{\textsc{next}}_{1}, (11)

and V1next=Vλ(Q)V^{\textsc{next}}_{1}=V_{\lambda}(Q) for model A and V1next=x𝒳QxVλ(δx)V^{\textsc{next}}_{1}=\sum_{x\in{\cal X}}Q_{x}V_{\lambda}(\delta_{x}) for model B.

Proof 3

Recall that we assert that Vλ(π)V_{\lambda}(\pi) and Ωλ\Omega_{\lambda} are non-decreasing in λ\lambda for any πΠ\pi\in\Pi. Hence, for any policy g:𝒫(𝒳){0,1}g:{\cal P}({\cal X})\to\{0,1\}

Vλ(g)(π)=(1β)𝔼[t=0βtc¯λ(Πt,At)|Πt=π]V^{(g)}_{\lambda}(\pi)=(1-\beta)\mathbb{E}\bigg{[}\sum_{t=0}^{\infty}\beta^{t}\bar{c}_{\lambda}(\Pi_{t},A_{t})\bigg{|}\Pi_{t}=\pi\bigg{]}

where c¯λ(π,a)=cλ(π,a)+λa\bar{c}_{\lambda}(\pi,a)=c_{\lambda}(\pi,a)+\lambda a is non-decreasing in λ\lambda for any π𝒫(𝒳)\pi\in{\cal P}({\cal X}) and a{0,1}a\in\{0,1\}. From Markov decision theory we know that Vλ(π)=infg:𝒫(𝒳){0,1}Vλ(g)(π)V_{\lambda}(\pi)=\inf_{g:{\cal P}({\cal X})\to\{0,1\}}V^{(g)}_{\lambda}(\pi). Since the infimum of non-decreasing functions is non-decreasing, Vλ(π)V_{\lambda}(\pi) is non-decreasing in λ\lambda for any π𝒫(𝒳)\pi\in{\cal P}({\cal X}). Consequently, V1nextV^{\textsc{next}}_{1} is non-decreasing which implies Ωλ\Omega_{\lambda} is non-decreasing in λ\lambda.

Given any stopping time τΣ\tau\in\Sigma, let hτh_{\tau} denote a policy that takes the passive action up to and including time τ1\tau-1, takes the active action at time τ\tau, and follows the optimal policy from time τ+1\tau+1 onwards. The performance Cλ(π,τ)C_{\lambda}(\pi,\tau) of policy hτh_{\tau} is given by

Cλ(π,τ)\displaystyle C_{\lambda}(\pi,\tau) =(1β)𝔼hτ[t=0βtcλ(πt,At)|Π0=π]=(1β)L(π,τ)+𝔼[βτΩλ|Π0=π]\displaystyle=(1-\beta)\mathds{E}^{h_{\tau}}\bigg{[}\sum_{t=0}^{\infty}\beta^{t}c_{\lambda}(\pi_{t},A_{t})\Bigm{|}\Pi_{0}=\pi\bigg{]}=(1-\beta)L(\pi,\tau)+\mathds{E}[\beta^{\tau}\Omega_{\lambda}|\Pi_{0}=\pi]
=(1β)L(π,τ)+B(π,τ)Ωλ.\displaystyle=(1-\beta)L(\pi,\tau)+B(\pi,\tau)\Omega_{\lambda}. (12)

We use h0h_{0} to denote a policy that takes active action at time 0 and follows the optimal policy from time 11 onwards. The performance Cλ(π,0)C_{\lambda}(\pi,0) of policy h0h_{0} is given by

Cλ(π,0)=(1β)c(π,1)+Ωλ.C_{\lambda}(\pi,0)=(1-\beta)c(\pi,1)+\Omega_{\lambda}. (13)

The next result generalizes [26, Lemma 2].

Lemma 1

The following characterizations of the passive sets are equivalent to Def. 1.

  1. 1.

    𝒲λ={πΠ:Hλ(π,0)<Hλ(π,1)}{\cal W}_{\lambda}=\left\{\pi\in\Pi:H_{\lambda}(\pi,0)<H_{\lambda}(\pi,1)\right\}.

  2. 2.

    𝒲λ={πΠ:σΣ such that Cλ(π,σ)<Cλ(π,0)}{\cal W}_{\lambda}=\left\{\pi\in\Pi:\exists\sigma\in\Sigma\text{ such that }C_{\lambda}(\pi,\sigma)<C_{\lambda}(\pi,0)\right\}.

  3. 3.

    𝒲λ={πΠ:G(π)<Ωλ}{\cal W}_{\lambda}=\left\{\pi\in\Pi:G(\pi)<\Omega_{\lambda}\right\}.

Proof 4

Characterization 1) follows from the dynamic program given in Proposition 3. Characterization 2) follows from the fact that Cλ(π,0)=Hλ(π,1)C_{\lambda}(\pi,0)=H_{\lambda}(\pi,1) and for π𝒲λ\pi\in{\cal W}_{\lambda}, Cλ(π,σ)=Hλ(π,0)C_{\lambda}(\pi,\sigma)=H_{\lambda}(\pi,0), where σ\sigma is the hitting time of the set 𝒫(𝒳)𝒲λ\mathcal{P}({\mathcal{X}})\setminus{\cal W}_{\lambda}. Characterization 3) follows from characterization 2) and rearranging the terms using (12) and (13).

Note that G(π)G(\pi) does not depend on λ\lambda while we showed that Ωλ\Omega_{\lambda} is non-decreasing in λ\lambda. Hence, 𝒲λ={πΠ:G(π)<Ωλ}{\cal W}_{\lambda}=\left\{\pi\in\Pi:G(\pi)<\Omega_{\lambda}\right\} is non-decreasing in λ\lambda by the lemma. Thus, arm ii is indexable. The expression for the Whittle index in the theorem follows from the definitions.

5 Whittle index computation

Computing the Whittle index using the belief state representation is intractable in general. Inspired by the approach taken in [45], we introduce a new information state which is equivalent to the belief state.

5.1 Countable Information state

For models A and B, define A={QPk:k0},B={δsPk:s𝒳,k0}.{\cal R}_{A}=\bigl{\{}QP^{k}:k\in\mathds{Z}_{\geq 0}\bigr{\}},\leavevmode\nobreak\ {\cal R}_{B}=\bigl{\{}\delta_{s}P^{k}:s\in{\cal X},k\in\mathds{Z}_{\geq 0}\bigr{\}}.

Assumption 2

For model A, π0A\pi_{0}\in{\cal R}_{A} and for model B, π0B\pi_{0}\in{\cal R}_{B}.

For model A, define a process {Kt}t0\{K_{t}\}_{t\geq 0} as follows. The initial state k0k_{0} is such that π0=QPk0\pi_{0}=QP^{k_{0}} and for t>0t>0, KtK_{t} is given by

Kt={0, if At1=1Kt1+1, if At1=0.K_{t}=\begin{cases}0,&\text{ if }A_{t-1}=1\\ K_{t-1}+1,&\text{ if }A_{t-1}=0.\end{cases} (14)

Similarly, for model B, define a process {St,Kt}t0\{S_{t},K_{t}\}_{t\geq 0} as follows. The initial state (s0,k0)(s_{0},k_{0}) is such that π0=δs0Pk0\pi_{0}=\delta_{s_{0}}P^{k_{0}} and for t>0t>0, KtK_{t} evolves according to (14) and StS_{t} evolves according to

St={Xt1 where Xt1Q, if At1=1St1, if At1=0.S_{t}=\begin{cases}X_{t-1}\text{ where }X_{t-1}\sim Q,&\text{ if }A_{t-1}=1\\ S_{t-1},&\text{ if }A_{t-1}=0.\end{cases} (15)

Note that once the first observation has been taken in both models, KtK_{t} denotes the time elapsed since the last observation of arm ii and, in addition in model B, StS_{t} denotes the last observed states of arm ii. Let 𝑺t(St1,Stn)\boldsymbol{S}_{t}\coloneqq(S^{1}_{t},\dots S^{n}_{t}) and 𝑲t(Kt1,Ktn)\boldsymbol{K}_{t}\coloneqq(K^{1}_{t},\dots K^{n}_{t}). The relation between the belief state Πt\Pi_{t} and variables StS_{t} and KtK_{t} is characterized in the following lemma.

(1, 0, 0)(0, 1, 0)(0, 0, 1)
Figure 1: Belief state dynamics for a 3-state arm ii in the simplex 𝒫({1,2,3})\mathcal{P}(\{1,2,3\}). Dashed arrows show a sample realizations of the belief state evolution under At=0A_{t}=0 for three time steps and the solid arrow shows a sample realization of the belief state evolution under At=1A_{t}=1.
Lemma 2

The following statements hold under Assumption 2:

  • 1.

    For model A, for any tt, ΠtA\Pi_{t}\in{\cal R}_{A}. In particular, Πt=QPKt\Pi_{t}=QP^{K_{t}}.

  • 2.

    For model B, for any tt, ΠtB\Pi_{t}\in{\cal R}_{B}. In particular, Πt=δStPKt\Pi_{t}=\delta_{S_{t}}P^{K_{t}}.

Proof 5

The results immediately follow from (4)-(5) and (14)-(15).

For model A, the expected per-step cost at time tt may be written as

c¯(Kt,At)c¯(QPKt,At)=x𝒳[QPKt]xc(x,At).\displaystyle\bar{c}(K_{t},A_{t})\coloneqq\bar{c}(QP^{K_{t}},A_{t})=\sum_{x\in{\cal X}}[QP^{K_{t}}]_{x}c(x,A_{t}). (16)

Similarly, for model B, the expected per-step cost at time tt may be written as

c¯(St,Kt,At)c¯(δStPKt,At)=x𝒳[δSt,PKt]xc(x,At).\displaystyle\bar{c}(S_{t},K_{t},A_{t})\coloneqq\bar{c}(\delta_{S_{t}}P^{K_{t}},A_{t})=\sum_{x\in{\cal X}}[\delta_{S_{t}},P^{K_{t}}]_{x}c(x,A_{t}). (17)
Proposition 4

In Problem 1, there is no loss of optimality in restricting attention to decision policies of the form 𝐀t=gtinfo(𝐊t)\boldsymbol{A}_{t}=g^{\text{info}}_{t}(\boldsymbol{K}_{t}) for model A and of the form 𝐀t=gtinfo(𝐒t,𝐊t)\boldsymbol{A}_{t}=g^{\text{info}}_{t}(\boldsymbol{S}_{t},\boldsymbol{K}_{t}) for model B.

Proof 6

This result immediately follows from Lemma 2, (16) and (17).

5.2 Threshold policies

We assume that the model satisfies the following condition.

Assumption 3

Let c(x,a)=(1a)ϕ(x)+aρ(x)c(x,a)=(1-a)\phi(x)+a\rho(x) where ϕ:𝒳[0,ϕmax)\phi:{\cal X}\to[0,\phi_{\max}) and ρ:𝒳[0,ρmax)\rho:{\cal X}\to[0,\rho_{\max}) are non-decreasing functions in 𝒳{\cal X} and c(x,a)c(x,a) is submodular in (x,a)(x,a).

Under Assumption 3, we derive structural properties of the optimal policies for models A and B. Then, we show how Jλ(g)J^{(g)}_{\lambda} can be decomposed and computed.

In the following theorem, we show that the optimal policy for model A has a threshold structure and the optimal policy for model B has a threshold structure with respect to the second dimension of the information state.

Theorem 2

Under Assumptions 2 and 3, the following statements hold:

  1. 1.

    In model A, for any λ\lambda\in\mathds{R}, the optimal policy gλA(k)g^{A}_{\lambda}(k) is a threshold policy, i.e., there exists a threshold θλA1\theta^{A}_{\lambda}\in\mathds{Z}_{\geq-1} such that

    gλA(k)={0,k<θλA1,otherwise.\displaystyle g^{A}_{\lambda}(k)=\begin{cases}0,\leavevmode\nobreak\ k<\theta^{A}_{\lambda}\\ 1,\leavevmode\nobreak\ \text{otherwise}.\end{cases}

    Moreover, the threshold θλA\theta^{A}_{\lambda} is non-decreasing in λ\lambda.

  2. 2.

    In model B, for any λ\lambda\in\mathds{R}, the optimal policy gλB(s,k)g^{B}_{\lambda}(s,k) is a threshold policy with respect to kk for every s𝒳s\in{\cal X}, i.e., there exists a threshold θs,λB1\theta^{B}_{s,\lambda}\in\mathds{Z}_{\geq-1} for each s𝒳s\in{\cal X} such that

    gλB(s,k)={0,k<θs,λB1,otherwise.\displaystyle g^{B}_{\lambda}(s,k)=\begin{cases}0,\leavevmode\nobreak\ k<\theta^{B}_{s,\lambda}\\ 1,\leavevmode\nobreak\ \text{otherwise}.\end{cases}

    Moreover, for every s𝒳s\in{\cal X}, the threshold θs,λB\theta^{B}_{s,\lambda} is non-decreasing in λ\lambda.

See Section 6 for the proof.

We use 𝜽B{\boldsymbol{\theta}}^{B} to denote the vector (θsB)s𝒳(\theta^{B}_{s})_{s\in{\cal X}}.

5.3 Performance of threshold based policies

We simplify the notation and denote the policy corresponding to thresholds θA\theta^{A} and 𝜽B{\boldsymbol{\theta}}^{B} by simply θA\theta^{A} and 𝜽B\boldsymbol{\theta}^{B} instead of g(θA)g^{(\theta^{A})} and g(𝜽B)g^{(\boldsymbol{\theta}^{B})}.

5.3.1 Model A

Let Jλ(θA)(k)J^{(\theta^{A})}_{\lambda}(k) be the total discounted cost incurred under policy g(θA)g^{(\theta^{A})} with penalty λ\lambda when the initial state is kk, i.e.,

Jλ(θA)(k)(1β)𝔼[t=0βtc¯λ(Kt,g(θA)(Kt))|K0=k]D(θA)(k)+λN(θA)(k),\displaystyle J^{(\theta^{A})}_{\lambda}(k)\coloneqq(1-\beta)\mathds{E}\biggl{[}\sum_{t=0}^{\infty}\beta^{t}\bar{c}_{\lambda}(K_{t},g^{(\theta^{A})}(K_{t}))\Bigm{|}K_{0}=k\biggr{]}\eqqcolon D^{(\theta^{A})}(k)+\lambda N^{(\theta^{A})}(k), (18)

where

D(θA)(k)\displaystyle D^{(\theta^{A})}(k) (1β)𝔼[t=0βtc(Kt,g(θA)(Kt))|K0=k],\displaystyle\coloneqq(1-\beta)\mathds{E}\biggl{[}\sum_{t=0}^{\infty}\beta^{t}c(K_{t},g^{(\theta^{A})}(K_{t}))\Bigm{|}K_{0}=k\biggr{]},
N(θA)(k)\displaystyle N^{(\theta^{A})}(k) (1β)𝔼[t=0βtg(θA)(Kt)|K0=k].\displaystyle\coloneqq(1-\beta)\mathds{E}\biggl{[}\sum_{t=0}^{\infty}\beta^{t}g^{(\theta^{A})}(K_{t})\Bigm{|}K_{0}=k\biggr{]}.

D(θA)(k)D^{({\theta^{A}})}(k) represents the expected total discounted cost while N(θA)(k)N^{({\theta^{A}})}(k) represents the expected number of times active action is selected under policy g(θA)g^{(\theta^{A})} starting from the initial information state kk.

We will show (see Theorem 7) that the Whittle index for model A can be computed as a function of D(θA)(k)D^{({\theta^{A}})}(k) and N(θA)(k)N^{({\theta^{A}})}(k). First, we present a method to compute these two variables. Let

L(θA)(k)\displaystyle L^{(\theta^{A})}(k) (1β)t=kθA1βtkc¯(t,0)+(1β)βθAkc¯(θA,1)\displaystyle\coloneqq(1-\beta)\sum_{t=k}^{\theta^{A}-1}\beta^{t-k}\bar{c}(t,0)+(1-\beta)\beta^{\theta^{A}-k}\bar{c}(\theta^{A},1)
M(θA)(k)\displaystyle M^{(\theta^{A})}(k) (1β)βθAk\displaystyle\coloneqq(1-\beta)\beta^{\theta^{A}-k}

where L(θA)(k)L^{(\theta^{A})}(k) and M(θA)(k)M^{(\theta^{A})}(k), respectively, denote the expected discounted cost and time starting from information state kk until reaching information state θA\theta^{A} for the first time.

Theorem 3

For any k0k\in\mathds{Z}_{\geq 0}, we have

D(θA)(k)=L(θA)(k)+βθAk+1L(θA)(0)1βθA+1,\displaystyle D^{(\theta^{A})}(k)=L^{(\theta^{A})}(k)+\beta^{\theta^{A}-k+1}\dfrac{L^{(\theta^{A})}(0)}{1-\beta^{\theta^{A}+1}},
N(θA)(k)=M(θA)(k)+βθAk+1M(θA)(0)1βθA+1.\displaystyle N^{(\theta^{A})}(k)=M^{(\theta^{A})}(k)+\beta^{\theta^{A}-k+1}\dfrac{M^{(\theta^{A})}(0)}{1-\beta^{\theta^{A}+1}}.

See Section 6 for the proof.

5.3.2 Model B

Let Jλ(𝜽B)(s,k)J^{({\boldsymbol{\theta}}^{B})}_{\lambda}(s,k) be the total discounted cost incurred under policy g(𝜽B)g^{({\boldsymbol{\theta}}^{B})} with penalty λ\lambda when the initial information state is (s,k)(s,k), i.e.,

Jλ(𝜽B)(s,k)\displaystyle J^{({\boldsymbol{\theta}}^{B})}_{\lambda}(s,k) =(1β)𝔼[t=0βtc¯λ(St,Kt,g(𝜽B)(St,Kt))|(S0,K0)=(s,k)]\displaystyle=(1-\beta)\mathds{E}\biggl{[}\sum_{t=0}^{\infty}\beta^{t}\bar{c}_{\lambda}(S_{t},K_{t},g^{({\boldsymbol{\theta}}^{B})}(S_{t},K_{t}))\Bigm{|}(S_{0},K_{0})=(s,k)\biggr{]}
D(𝜽B)(s,k)+λN(𝜽B)(s,k),\displaystyle\eqqcolon D^{({\boldsymbol{\theta}}^{B})}(s,k)+\lambda N^{({\boldsymbol{\theta}}^{B})}(s,k), (19)

where

D(𝜽B)(s,k)\displaystyle D^{({\boldsymbol{\theta}}^{B})}(s,k) (1β)𝔼[t=0βtc¯(St,Kt,g(𝜽B)(St,Kt))|(S0,K0)=(s,k)],\displaystyle\coloneqq(1-\beta)\mathds{E}\biggl{[}\sum_{t=0}^{\infty}\beta^{t}\bar{c}(S_{t},K_{t},g^{({\boldsymbol{\theta}}^{B})}(S_{t},K_{t}))\Bigm{|}(S_{0},K_{0})=(s,k)\biggr{]},
N(𝜽B)(s,k)\displaystyle N^{({\boldsymbol{\theta}}^{B})}(s,k) (1β)𝔼[t=0βtg(𝜽B)(St,Kt)|(S0,K0)=(s,k)].\displaystyle\coloneqq(1-\beta)\mathds{E}\biggl{[}\sum_{t=0}^{\infty}\beta^{t}g^{({\boldsymbol{\theta}}^{B})}(S_{t},K_{t})\Bigm{|}(S_{0},K_{0})=(s,k)\biggr{]}.

D(𝜽B)(s,k)D^{({\boldsymbol{\theta}}^{B})}(s,k) and N(𝜽B)(s,k)N^{({\boldsymbol{\theta}}^{B})}(s,k) have the same interpretations as the ones for model A. We will show (see Theorem 8) that Whittle index for model B can be computed as a function of D(𝜽B)(s,k)D^{({\boldsymbol{\theta}^{B}})}(s,k) and N(𝜽B)(s,k)N^{({\boldsymbol{\theta}^{B}})}(s,k). But first let’s define vector 𝑱λ(𝜽B)(0)=(Jλ(𝜽B)(1,0)\boldsymbol{J}^{({\boldsymbol{\theta}}^{B})}_{\lambda}(0)=(J^{({\boldsymbol{\theta}}^{B})}_{\lambda}(1,0), \ldots, Jλ(𝜽B)(|𝒳|,0))J^{({\boldsymbol{\theta}}^{B})}_{\lambda}(|{\cal X}|,0)) and vectors 𝑫(𝜽B)(0)\boldsymbol{D}^{({\boldsymbol{\theta}}^{B})}(0) and 𝑵(𝜽B)(0)\boldsymbol{N}^{({\boldsymbol{\theta}}^{B})}(0) in a similar manner. Then, from (19), 𝑱λ(𝜽B)(0)=𝑫(𝜽B)(0)+λ𝑵(𝜽B)(0)\boldsymbol{J}^{({\boldsymbol{\theta}}^{B})}_{\lambda}(0)=\boldsymbol{D}^{({\boldsymbol{\theta}}^{B})}(0)+\lambda\boldsymbol{N}^{({\boldsymbol{\theta}}^{B})}(0). Let’s also define

L(𝜽B)(s,k)\displaystyle L^{({\boldsymbol{\theta}}^{B})}(s,k) (1β)t=kθsB1βtkc¯(s,t,0)+(1β)βθsBkc¯(s,θsB,1),\displaystyle\coloneqq(1-\beta)\sum_{t=k}^{\theta^{B}_{s}-1}\beta^{t-k}\bar{c}(s,t,0)+(1-\beta)\beta^{\theta^{B}_{s}-k}\bar{c}(s,\theta^{B}_{s},1),
M(𝜽B)(s,k)\displaystyle M^{({\boldsymbol{\theta}}^{B})}(s,k) (1β)βθsBk.\displaystyle\coloneqq(1-\beta)\beta^{\theta^{B}_{s}-k}.

Let 𝑳(𝜽B)(0)=(L(𝜽B)(1,0),,L(𝜽B)(|𝒳|,0))\boldsymbol{L}^{({\boldsymbol{\theta}}^{B})}(0)=(L^{({\boldsymbol{\theta}}^{B})}(1,0),\ldots,L^{({\boldsymbol{\theta}}^{B})}(|{\cal X}|,0)) and 𝑴(𝜽B)(0)=(M(𝜽B)(1,0),,M(𝜽B)(|𝒳|,0))\boldsymbol{M}^{({\boldsymbol{\theta}}^{B})}(0)=(M^{({\boldsymbol{\theta}}^{B})}(1,0),\ldots,M^{({\boldsymbol{\theta}}^{B})}(|{\cal X}|,0)).

Theorem 4

For any (s,k)𝒳×0(s,k)\in{\cal X}\times\mathds{Z}_{\geq 0}, we have

D(𝜽B)(s,k)=L(𝜽B)(s,k)+βθsBk+1r𝒳QrD(𝜽B)(r,0),\displaystyle D^{({\boldsymbol{\theta}}^{B})}(s,k)=L^{({\boldsymbol{\theta}}^{B})}(s,k)+\beta^{\theta^{B}_{s}-k+1}\sum_{r\in{\cal X}}Q_{r}D^{({\boldsymbol{\theta}}^{B})}(r,0),
N(𝜽B)(s,k)=M(𝜽B)(s,k)+βθsBk+1r𝒳QrN(𝜽B)(r,0).\displaystyle N^{({\boldsymbol{\theta}}^{B})}(s,k)=M^{({\boldsymbol{\theta}}^{B})}(s,k)+\beta^{\theta^{B}_{s}-k+1}\sum_{r\in{\cal X}}Q_{r}N^{({\boldsymbol{\theta}}^{B})}(r,0).

Let Z(𝛉B)Z^{({\boldsymbol{\theta}}^{B})} be a |𝒳|×|𝒳||{\cal X}|\times|{\cal X}| matrix where Zsr(𝛉B)=βθsB+1QrZ^{({\boldsymbol{\theta}}^{B})}_{sr}=\beta^{\theta^{B}_{s}+1}Q_{r}, for any s,r𝒳s,r\in{\cal X}. Then,

𝑫(𝜽B)(0)=(IZ(𝜽B))1𝑳(𝜽B)(0),\displaystyle\boldsymbol{D}^{({\boldsymbol{\theta}}^{B})}(0)=(I-Z^{({\boldsymbol{\theta}}^{B})})^{-1}\boldsymbol{L}^{({\boldsymbol{\theta}}^{B})}(0),
𝑵(𝜽B)(0)=(IZ(𝜽B))1𝑴(𝜽B)(0).\displaystyle\boldsymbol{N}^{({\boldsymbol{\theta}}^{B})}(0)=(I-Z^{({\boldsymbol{\theta}}^{B})})^{-1}\boldsymbol{M}^{({\boldsymbol{\theta}}^{B})}(0).

See Section 6 for the proof.

5.4 Finite state approximation

For computing Whittle index, we provide a finite state approximation of Proposition 3 for models A and B. Essentially, we truncate the countable set of possible information state KtK_{t} to a finite set and provide the approximation bound on the optimal value function for each of the models.

Theorem 5 (Model A)

Given \ell\in\mathds{N}, let :={0,,}\mathds{N}_{\ell}:=\{0,\ldots,\ell\} and V,λ:V_{\ell,\lambda}:\mathds{N}_{\ell}\to\mathds{R} be the unique fixed point of equation

V,λ(k)=mina{0,1}H,λ(k,a),g^,λ(k)=argmina{0,1}H,λ(k,a)V_{\ell,\lambda}(k)=\min_{a\in\{0,1\}}H_{\ell,\lambda}(k,a),\leavevmode\nobreak\ \hat{g}_{\ell,\lambda}(k)=\operatorname*{arg\,min}_{a\in\{0,1\}}H_{\ell,\lambda}(k,a)

where

H,λ(k,0)\displaystyle H_{\ell,\lambda}(k,0) =(1β)c¯(k,0)+βV,λ(min{k+1,}),\displaystyle=(1-\beta)\bar{c}(k,0)+\beta V_{\ell,\lambda}(\min\{k+1,\ell\}),
H,λ(k,1)\displaystyle H_{\ell,\lambda}(k,1) =(1β)c¯(k,1)+(1β)λ+βV,λ(0).\displaystyle=(1-\beta)\bar{c}(k,1)+(1-\beta)\lambda+\beta V_{\ell,\lambda}(0).

We set g^,λ(k)=1\hat{g}_{\ell,\lambda}(k)=1 if H,λ(k,0)=H,λ(k,1)H_{\ell,\lambda}(k,0)=H_{\ell,\lambda}(k,1). Then, we have the following:
(i) For any 0k0\leq k\leq\ell, we have

|Vλ(k)V,λ(k)|βk+1span(cλ)1β.|V_{\lambda}(k)-V_{\ell,\lambda}(k)|\leq\dfrac{\beta^{\ell-k+1}\operatorname{span}(c_{\lambda})}{1-\beta}.

(ii) For all k0k\in\mathds{Z}_{\geq 0}, limV,λ(k)=Vλ(k)\lim_{\ell\to\infty}V_{\ell,\lambda}(k)=V_{\lambda}(k). Moreover, let g^λ()\hat{g}^{*}_{\lambda}(\cdot) be any limit point of {g^,λ()}1\{\hat{g}_{\ell,\lambda}(\cdot)\}_{\ell\geq 1}. Then, the policy g^λ()\hat{g}^{*}_{\lambda}(\cdot) is optimal for Problem 2.

See Section 6 for the proof.

Theorem 6 (Model B)

Given \ell\in\mathds{N}, let :={0,,}\mathds{N}_{\ell}:=\{0,\ldots,\ell\} and V,λ:𝒳×V_{\ell,\lambda}:{\cal X}\times\mathds{N}_{\ell}\to\mathds{R} be the unique fixed point of equation

V,λ(s,k)\displaystyle V_{\ell,\lambda}(s,k) =mina{0,1}H,λ(s,k,a),g^,λ(s,k)=argmina{0,1}H,λ(s,k,a)\displaystyle=\min_{a\in\{0,1\}}H_{\ell,\lambda}(s,k,a),\leavevmode\nobreak\ \hat{g}_{\ell,\lambda}(s,k)=\operatorname*{arg\,min}_{a\in\{0,1\}}H_{\ell,\lambda}(s,k,a)

where

H,λ(s,k,0)\displaystyle H_{\ell,\lambda}(s,k,0) =(1β)c¯(s,k,0)+βV,λ(s,min{k+1,}),\displaystyle=(1-\beta)\bar{c}(s,k,0)+\beta V_{\ell,\lambda}(s,\min\{k+1,\ell\}),
H,λ(s,k,1)\displaystyle H_{\ell,\lambda}(s,k,1) =(1β)c¯(s,k,1)+(1β)λ+βx𝒳~QxV,λ(x,0).\displaystyle=(1-\beta)\bar{c}(s,k,1)+(1-\beta)\lambda+\beta\sum_{x^{\prime}\in\tilde{\cal X}}Q_{x^{\prime}}V_{\ell,\lambda}(x^{\prime},0).

We set g^,λ(s,k)=1\hat{g}_{\ell,\lambda}(s,k)=1 if H,λ(s,k,0)=H,λ(s,k,1)H_{\ell,\lambda}(s,k,0)=H_{\ell,\lambda}(s,k,1). Then, we have the following:
(i) For any 0k0\leq k\leq\ell,

|Vλ(s,k)V,λ(s,k)|βk+1span(cλ)1β,s𝒳.|V_{\lambda}(s,k)-V_{\ell,\lambda}(s,k)|\leq\dfrac{\beta^{\ell-k+1}\operatorname{span}(c_{\lambda})}{1-\beta},\forall s\in{\cal X}.

(ii) For all (s,k)𝒳×0(s,k)\in{\cal X}\times\mathds{Z}_{\geq 0}, limV,λ(s,k)=Vλ(s,k)\lim_{\ell\to\infty}V_{\ell,\lambda}(s,k)=V_{\lambda}(s,k). Let g^λ(,)\hat{g}^{*}_{\lambda}(\cdot,\cdot) be any limit point of {g^,λ(,)}1\{\hat{g}_{\ell,\lambda}(\cdot,\cdot)\}_{\ell\geq 1}. Then, the policy g^λ(,)\hat{g}^{*}_{\lambda}(\cdot,\cdot) is optimal for Problem 2.

See Section 6 for the proof.

Due to Theorems 5 and 6, we can restrict the countable part of the information state to a finite set, \mathds{N}_{\ell}.

5.5 Computation of Whittle index

Next, we derive a closed form expression to compute the Whittle index for model A and provide an efficient algorithm to compute the Whittle index for model B.

5.5.1 Whittle index formula for model A.

For model A, we obtain the Whittle index formula based on the two variables D(θA)()D^{(\theta^{A})}(\cdot) and N(θA)()N^{(\theta^{A})}(\cdot) as follows.

Theorem 7

Let ΛkA={k0{0,1,,(+1)1}:N(k)(k0)N(k+1)(k0)}\Lambda^{A}_{k}=\{k_{0}\in\{0,1,\ldots,(\ell+1)-1\}:N^{(k)}(k_{0})\neq N^{(k+1)}(k_{0})\}. Then, under Assumption 3, ΛkA\Lambda^{A}_{k}\neq\emptyset, and the Whittle index of model A at information state kk\in\mathds{N}_{\ell} is

wA(k)=mink0ΛkAD(k+1)(k0)D(k)(k0)N(k)(k0)N(k+1)(k0).w^{A}(k)=\min_{k_{0}\in\Lambda^{A}_{k}}\dfrac{D^{(k+1)}(k_{0})-D^{(k)}(k_{0})}{N^{(k)}(k_{0})-N^{(k+1)}(k_{0})}. (20)
Proof 7

Since model A is a restart model, the result follows from [31, Lemma 4].

Theorem 7 gives us a closed-form expression to approximately compute the Whittle index for model AA.

5.5.2 Modified adaptive greedy algorithm for model B.

Let B=|𝒳|(+1)B=|{\cal X}|(\ell+1) and BD(B)B_{D}(\leq B) denote the number of distinct Whittle indices. Let Λ={λ0,λ1,,λBD}\Lambda^{*}=\{\lambda_{0},\lambda_{1},\ldots,\lambda_{B_{D}}\} where λ1<λ2<<λBD\lambda_{1}<\lambda_{2}<\ldots<\lambda_{B_{D}} denote the sorted distinct Whittle indices with λ0=\lambda_{0}=-\infty. Let 𝒲b:={(s,k)𝒳×:w(s,k)λb}{\cal W}_{b}:=\{(s,k)\in{\cal X}\times\mathds{N}_{\ell}:w(s,k)\leq\lambda_{b}\}. For any subset 𝒮𝒳×{\cal S}\subseteq{\cal X}\times\mathds{N}_{\ell}, define the policy g¯(𝒮):𝒳×{0,1}\bar{g}^{({\cal S})}:{\cal X}\times\mathds{N}_{\ell}\to\{0,1\} as

g¯(𝒮)(s,k)={0, if (s,k)𝒮1, if (s,k)(𝒳×)\𝒮.\bar{g}^{(\cal S)}(s,k)=\begin{cases}0,\leavevmode\nobreak\ \text{ if }(s,k)\in{\cal S}\\ 1,\leavevmode\nobreak\ \text{ if }(s,k)\in({\cal X}\times\mathds{N}_{\ell})\backslash{\cal S}.\end{cases}

Given 𝒲b{\cal W}_{b}, define Φb={(s,k)(𝒳×)𝒲b:(s,max{0,k1})𝒲b}\Phi_{b}=\{(s,k)\in({\cal X}\times\mathds{N}_{\ell})\setminus{\cal W}_{b}:(s,\max\{0,k-1\})\in{\cal W}_{b}\} and Γb+1=𝒲b+1\𝒲b\Gamma_{b+1}={\cal W}_{b+1}\backslash{\cal W}_{b}. Additionally, for any b{0,,BD1}b\in\{0,\ldots,B_{D}-1\}, and all states yΦby\in\Phi_{b}, define hb=g¯(𝒲b)h_{b}=\bar{g}^{({\cal W}_{b})}, hb,y=g¯(𝒲b{y})h_{b,y}=\bar{g}^{({\cal W}_{b}\cup\{y\})} and Λb,y={(x,k)(𝒳×):N(hb)(x,k)N(hb,y)(x,k)}\Lambda_{b,y}=\{(x,k)\in({\cal X}\times\mathds{N}_{\ell}):N^{(h_{b})}(x,k)\neq N^{({h}_{b,y})}(x,k)\}. Then, for all (x,k)Λb,y(x,k)\in\Lambda_{b,y}, define

μb,y(x,k)=D(hb,y)(x,k)D(hb)(x,k)N(hb)(x,k)N(hb,y)(x,k).\displaystyle\mu_{b,y}(x,k)=\dfrac{D^{({h}_{b,y})}(x,k)-D^{(h_{b})}(x,k)}{N^{(h_{b})}(x,k)-N^{({h}_{b,y})}(x,k)}. (21)
Lemma 3

For d{0,,BD1}d\in\{0,\ldots,B_{D}-1\}, we have the following:

  1. 1.

    For all yΓb+1y\in\Gamma_{b+1}, we have w(y)=λb+1w(y)=\lambda_{b+1}.

  2. 2.

    For all yΦby\in\Phi_{b} and λ(λb,λb+1]\lambda\in(\lambda_{b},\lambda_{b+1}], we have Jλ(hb,y)(x)Jλ(hb)(x)J^{(h_{b,y})}_{\lambda}(x)\geq J^{(h_{b})}_{\lambda}(x) for all x𝒳x\in{\cal X} with equality if and only if y𝒲b+1\𝒲by\in{\cal W}_{b+1}\backslash{\cal W}_{b} and λ=λb+1\lambda=\lambda_{b+1}.

Proof 8

The result follows from [31, Lemma 3]. The only difference is that since we know from Theorem 2 that the optimal policy is a threshold policy with respect to the second dimension, we restrict to yΦby\in\Phi_{b}.

Theorem 8

The following properties hold:

  1. 1.

    For any yΓb+1y\in\Gamma_{b+1}, the set Λb,y\Lambda_{b,y} is non-empty.

  2. 2.

    For any xΛb,yx\in\Lambda_{b,y}, μb,y(x)λb+1\mu_{b,y}(x)\geq\lambda_{b+1} with equality if and only if yΓb+1y\in\Gamma_{b+1}.

Proof 9

The result follows from [31, Theorem 2]. Similar to Lemma 3, we consider yΦby\in\Phi_{b}.

input : RB (𝒳,{0,1},P,Q,c,ρ)(\mathcal{X},\{0,1\},P,Q,c,\rho), discount factor β\beta.
Initialize b=0b=0, 𝒲b={\cal W}_{b}=\emptyset.
while 𝒲b𝒳×{\cal W}_{b}\neq{\cal X}\times\mathds{N}_{\ell} do
       Compute Λb,y\Lambda_{b,y} and μb,y(x)\mu_{b,y}(x) using (21), yΦb\forall y\in\Phi_{b}.
       Compute μb,y=minxΛb,yμb,y(x)\mu^{*}_{b,y}=\min_{x\in\Lambda_{b,y}}\mu_{b,y}(x), yΦb\forall y\in\Phi_{b}.
       Compute λb+1=minyΦbμb,y\lambda_{b+1}=\min_{y\in\Phi_{b}}\mu^{*}_{b,y}.
       Compute Γb+1=argminyΦbμb,y\Gamma_{b+1}=\arg\min_{y\in\Phi_{b}}\mu^{*}_{b,y}.
       Set w(z)=λb+1w(z)=\lambda_{b+1}, zΓb+1\forall z\in\Gamma_{b+1}.
       Set 𝒲b+1=𝒲bΓb+1{\cal W}_{b+1}={\cal W}_{b}\cup\Gamma_{b+1}.
       Set b=b+1b=b+1.
      
Algorithm 1 Computing Whittle index of all information states of model B

By Theorem 8, we can find the Whittle indices iteratively. This approach is summarized in Algorithm 1. For a computationally-efficient implementation using the Sherman-Morrison formula, see [31, Algorithm 2].

6 Proof of Main Results

6.1 Proof of Theorem 2

Let μ1\mu^{1} and μ2\mu^{2} be two probability mass functions on totally ordered set 𝒳~\tilde{\cal X}. Then we say μ1\mu^{1} stochastically dominates μ2\mu^{2} if for all x𝒳~x\in\tilde{\cal X}, z𝒳~xμz1z𝒳~xμz2\sum_{z\in\tilde{\mathcal{X}}_{\geq x}}\mu^{1}_{z}\geq\sum_{z\in\tilde{\mathcal{X}}_{\geq x}}\mu^{2}_{z}. Given two |𝒳~|×|𝒳~||\tilde{\cal X}|\times|\tilde{\cal X}| transition matrices MM and NN, we say MM stochastically dominates NN if each row of MM stochastically dominates the corresponding NN. A basic property of stochastic dominance is the following.

Lemma 4

If M1M^{1} stochastically dominates M2M^{2} and cc is an non-decreasing function defined on 𝒳~\tilde{\cal X}, then for all x𝒳~x\in\tilde{\cal X}, y𝒳~Mxy1c(y)y𝒳~Mxy2c(y)\sum_{y\in\tilde{\cal X}}M^{1}_{xy}c(y)\geq\sum_{y\in\tilde{\cal X}}M^{2}_{xy}c(y).

Proof 10

This follows from [40, Lemma 4.7.2].

Consider a fully-observable restless bandit process {(𝒳~,{0,1},{P~,Q~},c~,π~0)}\{(\mathcal{\tilde{X}},\{0,1\},\allowbreak\{\tilde{P},\tilde{Q}\},\tilde{c},\tilde{\pi}_{0})\} (note that 𝒴{\cal Y} is removed due to the observability assumption). According to [31], we say a fully-observable restless bandit process is stochastic monotone if it satisfies the following conditions.

  1. (D1)

    P~\tilde{P} and Q~\tilde{Q} are stochastic monotone transition matrices.

  2. (D2)

    For any z𝒳~z\in\tilde{\mathcal{X}}, w𝒳~z[P~Q~]xw\sum_{w\in\tilde{\mathcal{X}}_{\geq z}}[\tilde{P}-\tilde{Q}]_{xw} is non-decreasing in x𝒳~x\in\tilde{\mathcal{X}}.

  3. (D3)

    For any a{0,1}a\in\{0,1\}, c~(x,a)\tilde{c}(x,a) is non-decreasing in xx.

  4. (D4)

    c~(x,a)\tilde{c}(x,a) is submodular in (x,a)(x,a).

The following is established in [31, Lemma 5].

Proposition 5

The optimal policy of a stochastic monotone fully-observable restless bandit process is a threshold policy denoted by g~\tilde{g}, which is a policy which takes passive action for states below a threshold denoted by θ~\tilde{\theta} and active action for the rest of the states, i.e.,

g~={0,x<θ~1,otherwise.\displaystyle\tilde{g}=\begin{cases}0,\leavevmode\nobreak\ x<\tilde{\theta}\\ 1,\leavevmode\nobreak\ \text{otherwise}\end{cases}.

6.1.1 Proof of Theorem 2, Part 1

We show that each machine in model A is a stochastic monotone fully-observable restless bandit process. Each condition of stochastic monotone fully-observable restless bandit process is presented and proven for model A below.

  1. (D1’)

    The transition probability matrix under passive action for model A based on the information states is PxyA=𝕀{y=x+1}P^{A}_{xy}=\mathds{I}_{\{y=x+1\}} and the transition probability matrix under active action for model A is QxyA=𝕀{y=0}Q^{A}_{xy}=\mathds{I}_{\{y=0\}}. Thus, PAP^{A} and QAQ^{A} are stochastic monotone matrices.

  2. (D2’)

    Since PAP^{A} is a stochastic monotone matrix and QAQ^{A} has constant rows, rz[PAQA]sr\sum_{r\geq z}[P^{A}-Q^{A}]_{sr} is non-decreasing in ss for any integer z0z\geq 0.

  3. (D3’)

    As PP stochastically dominates the identity matrix, we infer from [46, Theorem 1.1-b and Theorem 1.2-c], that QPQP^{\ell} stochastically dominates QPkQP^{k} for any >k0\ell>k\geq 0. Additionally, cλ(x,a)c_{\lambda}(x,a) is non-decreasing in xx for any a{0,1}a\in\{0,1\}. By (16) we have c¯λ(k,a)=x𝒳[(QP)k]xcλ(x,a)\bar{c}_{\lambda}(k,a)=\sum_{x\in{\cal X}}[(QP)^{k}]_{x}c_{\lambda}(x,a). Therefore, by Lemma 4, c¯λ(k,a)\bar{c}_{\lambda}(k,a) is non-decreasing in kk.

  4. (D4’)

    As c(x,a)c(x,a) is submodular in (x,a)(x,a) and as shown in (D3’), QPQP^{\ell} stochastically dominates QPkQP^{k} for any >k0\ell>k\geq 0. Therefore, by Lemma 4, c¯λ(k,0)c¯λ(k,1)=x𝒳[(QP)k]x(cλ(x,0)cλ(x,1))\bar{c}_{\lambda}(k,0)-\bar{c}_{\lambda}(k,1)=\sum_{x\in{\cal X}}[(QP)^{k}]_{x}(c_{\lambda}(x,0)-c_{\lambda}(x,1)) is non-decreasing in (k,a)(k,a).

Therefore, according to Proposition 5, the optimal policy of a fully-observable restless bandit process under model A is a threshold based policy.

Finally, since the optimal policy is threshold based, the passive set 𝒲λ{\cal W}_{\lambda} is given by {k1:k<θλA}\{k\in\mathbb{Z}_{\geq-1}:k<\theta^{A}_{\lambda}\}. As shown in Theorem 1, model A is indexable. Therefore, the passive set must be non-decreasing in λ\lambda, which implies that the threshold θλA\theta^{A}_{\lambda} is non-decreasing in λ\lambda.

6.1.2 Proof of Theorem 2, Part 2

We first characterize the behavior of value function and state-action value function for Model B.

Lemma 5

We have

  1. a.

    c¯λ(s,k,a)\bar{c}_{\lambda}(s,k,a) is non-decreasing in kk for any s𝒳s\in{\cal X} and a{0,1}a\in\{0,1\}.

  2. b.

    Given a fixed λ\lambda, Vλ(s,k)V_{\lambda}(s,k) is non-decreasing in kk for any s𝒳s\in\mathcal{X}.

  3. c.

    c¯λ(s,k,a)\bar{c}_{\lambda}(s,k,a) is submodular in (k,a)(k,a), for any s𝒳s\in{\cal X}.

  4. d.

    Hλ(s,k,a)H_{\lambda}(s,k,a) is submodular in (k,a)(k,a), for any s𝒳s\in{\cal X}.

Proof 11

The proof of each part is as follows.

  1. a.

    By definition, we have

    c¯λ(s,k,a)=x𝒳[δsPk](x)c(x,a)+λa.\bar{c}_{\lambda}(s,k,a)=\sum_{x\in{\cal X}}[\delta_{s}P^{k}](x)c(x,a)+\lambda a.

    Similar to the proof of (D3’) in Proposition 5, for a given s𝒳s\in{\cal X} and a{0,1}a\in\{0,1\}, [δsPk](x)[\delta_{s}P^{k}](x) is non-decreasing in kk and xx and as c(x,a)c(x,a) is non-decreasing in xx, c¯λ(s,k,a)\bar{c}_{\lambda}(s,k,a) is non-decreasing in kk.

  2. b.

    Let

    Hλj(s,k,0)\displaystyle H^{j}_{\lambda}(s,k,0) :=(1β)c¯(s,k,0)+βVλj(s,k+1),\displaystyle:=(1-\beta)\bar{c}(s,k,0)+\beta V^{j}_{\lambda}(s,k+1),
    Hλj(s,k,1)\displaystyle H^{j}_{\lambda}(s,k,1) :=(1β)c¯(s,k,1)+(1β)λ+βrQrVλj(r,0),\displaystyle:=(1-\beta)\bar{c}(s,k,1)+(1-\beta)\lambda+\beta\sum_{r}Q_{r}V^{j}_{\lambda}(r,0),
    Vλj+1(s,k)\displaystyle V^{j+1}_{\lambda}(s,k) :=mina{0,1}{Hλj(s,k,a)},\displaystyle:=\min_{a\in\{0,1\}}\{H^{j}_{\lambda}(s,k,a)\},

    where Vλ0(,)=0V^{0}_{\lambda}(\cdot,\cdot)=0 for all (s,k)𝒳×0(s,k)\in{\cal X}\times\mathds{Z}_{\geq 0}.
    Claim: Vλj(s,k)V^{j}_{\lambda}(s,k) is non-decreasing in kk for any s𝒳s\in{\cal X} and j0j\geq 0.
    We prove the claim by induction. By construction, Vλ0(s,k)V^{0}_{\lambda}(s,k) is non-decreasing in kk for any s𝒳s\in{\cal X}. This forms the basis of induction. Now assume that Vλj(s,k)V^{j}_{\lambda}(s,k) is non-decreasing in kk for any s𝒳s\in{\cal X} and some j0j\geq 0. Consider >k0\ell>k\geq 0. Then, by induction hypothesis we have

    Hλj(s,,0)\displaystyle H^{j}_{\lambda}(s,\ell,0) =(1β)c¯(s,,0)+βVλj(s,+1)\displaystyle=(1-\beta)\bar{c}(s,\ell,0)+\beta V^{j}_{\lambda}(s,\ell+1)
    (1β)c¯(s,k,0)+βVλj(s,k+1)=Hλj(s,k,0),\displaystyle\geq(1-\beta)\bar{c}(s,k,0)+\beta V^{j}_{\lambda}(s,k+1)=H^{j}_{\lambda}(s,k,0),
    Hλj(s,,1)\displaystyle H^{j}_{\lambda}(s,\ell,1) =(1β)c¯(s,,1)+(1β)λ+βrQrVλj(r,0)\displaystyle=(1-\beta)\bar{c}(s,\ell,1)+(1-\beta)\lambda+\beta\sum_{r}Q_{r}V^{j}_{\lambda}(r,0)
    (1β)c¯(s,k,1)+(1β)λ+βrQrVλj(r,0)=Hλj(s,k,1).\displaystyle\geq(1-\beta)\bar{c}(s,k,1)+(1-\beta)\lambda+\beta\sum_{r}Q_{r}V^{j}_{\lambda}(r,0)=H^{j}_{\lambda}(s,k,1).

    Therefore,

    Vλj+1(s,)=mina{Hλj(s,,a)}mina{Hλj(s,k,a)}=Vλj+1(s,k).\displaystyle V^{j+1}_{\lambda}(s,\ell)=\min_{a}\{H^{j}_{\lambda}(s,\ell,a)\}\geq\min_{a}\{H^{j}_{\lambda}(s,k,a)\}=V^{j+1}_{\lambda}(s,k).

    Thus, Vλj+1(s,k)V^{j+1}_{\lambda}(s,k) is non-decreasing in kk for any s𝒳s\in{\cal X}. This completes the induction step. Vλ(s,k)=limjVλj(s,k)V_{\lambda}(s,k)=\lim_{j\to\infty}V^{j}_{\lambda}(s,k) and monotonicity is preserved under limits, the induction proof is complete.

  3. c.

    c(x,a)c(x,a) is submodular in (x,a)(x,a). Also, note that δsPk\delta_{s}P^{k} is the sths^{th} row of PkP^{k}. Thus, δsPk+1\delta_{s}P^{k+1} stochastically dominates δsPk\delta_{s}P^{k} and by Lemma 4 we have

    x𝒳[δs(Pk+1Pk)]x(c(x,0)c(x,1))0.\sum_{x\in{\cal X}}[\delta_{s}(P^{k+1}-P^{k})]_{x}(c(x,0)-c(x,1))\geq 0.

    Therefore,

    x𝒳[δs(PkPk+1)]xc(x,1)x𝒳[δs(PkPk+1)]xc(x,0).\displaystyle\sum_{x\in{\cal X}}[\delta_{s}(P^{k}-P^{k+1})]_{x}c(x,1)\geq\sum_{x\in{\cal X}}[\delta_{s}(P^{k}-P^{k+1})]_{x}c(x,0).

    Consequently,

    x𝒳[δsPk]xc(x,1)x𝒳[δsPk]xc(x,0)x𝒳[δsPk+1]xc(x,1)x𝒳[δsPk+1]xc(x,0).\displaystyle\sum_{x\in{\cal X}}[\delta_{s}P^{k}]_{x}c(x,1)-\sum_{x\in{\cal X}}[\delta_{s}P^{k}]_{x}c(x,0)\geq\sum_{x\in{\cal X}}[\delta_{s}P^{k+1}]_{x}c(x,1)-\sum_{x\in{\cal X}}[\delta_{s}P^{k+1}]_{x}c(x,0).

    Hence,

    c¯(s,k,1)c¯(s,k,0)c¯(s,k+1,1)c¯(s,k+1,0).\displaystyle\bar{c}(s,k,1)-\bar{c}(s,k,0)\geq\bar{c}(s,k+1,1)-\bar{c}(s,k+1,0).
  4. d.

    As for any s𝒳s\in{\cal X}, Vλ(s,k)V_{\lambda}(s,k) is non-decreasing in kk, and c¯λ(s,k,a)\bar{c}_{\lambda}(s,k,a) is submodular in (k,a)(k,a), for any kk\in\mathds{N}_{\ell} and a{0,1}a\in\{0,1\}, we have

    Hλ(s,k,1)Hλ(s,k,0)\displaystyle H_{\lambda}(s,k,1)-H_{\lambda}(s,k,0) =(1β)c¯(s,k,1)+(1β)λ+βrQrVλ(r,0)\displaystyle=(1-\beta)\bar{c}(s,k,1)+(1-\beta)\lambda+\beta\sum_{r}Q_{r}V_{\lambda}(r,0)
    (1β)c¯(s,k,0)βVλ(s,k+1)\displaystyle-(1-\beta)\bar{c}(s,k,0)-\beta V_{\lambda}(s,k+1)
    (1β)c¯(s,k+1,1)+(1β)λ+βrQrVλ(r,0)\displaystyle\geq(1-\beta)\bar{c}(s,k+1,1)+(1-\beta)\lambda+\beta\sum_{r}Q_{r}V_{\lambda}(r,0)
    (1β)c¯(s,k+1,0)βVλ(s,k+2)\displaystyle-(1-\beta)\bar{c}(s,k+1,0)-\beta V_{\lambda}(s,k+2)
    =Hλ(s,k+1,1)Hλ(s,k+1,0).\displaystyle=H_{\lambda}(s,k+1,1)-H_{\lambda}(s,k+1,0).
Lemma 6

Suppose f:𝒳×𝒴f:\mathcal{X}\times\mathcal{Y}\to\mathds{R} is a submodular function and for each x𝒳x\in\mathcal{X}, miny𝒴f(x,y)min_{y\in\mathcal{Y}}f(x,y) exists. Then, max{argminy𝒴f(x,y)}\max\{\operatorname*{arg\,min}_{y\in\mathcal{Y}}f(x,y)\} is monotone non-decreasing in xx.

Proof 12

This result follows from [40, Lemma 4.7.1].

Now, we conclude that as Hλ(s,k,a)H_{\lambda}(s,k,a) is submodular in (k,a)(k,a) for any s𝒳s\in{\cal X}, then, based on Lemma 6 and as only two actions is available, the optimal policy is a threshold policy specified in the theorem statement.

Finally, since the optimal policy is threshold based, the passive set 𝒲λ{\cal W}_{\lambda} is given by {(s,k)𝒳×1:k<θs,λB}\{(s,k)\in{\cal X}\times\mathbb{Z}_{\geq-1}:k<\theta^{B}_{s,\lambda}\}. As shown in Theorem 1, model B is indexable. Therefore, the passive set must be non-decreasing in λ\lambda, which implies that, for every s𝒳s\in{\cal X}, the threshold θs,λB\theta^{B}_{s,\lambda} is non-decreasing in λ\lambda.

6.2 Proof of Theorem 3

By the strong Markov property, we have

D(θA)(k)\displaystyle D^{(\theta^{A})}(k) =(1β)j=kθAβtc¯(t,g(t))+βθAk+1D(θA)(0)=L(θA)(k)+βθAk+1D(θA)(0),\displaystyle=(1-\beta)\sum_{j=k}^{\theta^{A}}\beta^{t}\bar{c}(t,g(t))+\beta^{\theta^{A}-k+1}D^{(\theta^{A})}(0)=L^{(\theta^{A})}(k)+\beta^{\theta^{A}-k+1}D^{(\theta^{A})}(0),
N(θA)(k)\displaystyle N^{(\theta^{A})}(k) =(1β)βθAk+βθAk+1N(θA)(0)=M(θA)(k)+βθAk+1N(θA)(0).\displaystyle=(1-\beta)\beta^{\theta^{A}-k}+\beta^{\theta^{A}-k+1}N^{(\theta^{A})}(0)=M^{(\theta^{A})}(k)+\beta^{\theta^{A}-k+1}N^{(\theta^{A})}(0).

If we set k=0k=0 in the above,

D(θA)(0)\displaystyle D^{(\theta^{A})}(0) =L(θA)(0)1βθA+1andN(θA)(0)=M(θA)(0)1βθA+1.\displaystyle=\dfrac{L^{(\theta^{A})}(0)}{1-\beta^{\theta^{A}+1}}\leavevmode\nobreak\ \text{and}\leavevmode\nobreak\ N^{(\theta^{A})}(0)=\dfrac{M^{(\theta^{A})}(0)}{1-\beta^{\theta^{A}+1}}.

6.3 Proof of Theorem 4

By the strong Markov property, we have

D(𝜽B)(s,k)\displaystyle D^{({\boldsymbol{\theta}}^{B})}(s,k) =(1β)j=kθsBβtc¯(s,t,g(s,t))+βθsBk+1r𝒳QrD(𝜽B)(r,0)\displaystyle=(1-\beta)\sum_{j=k}^{\theta^{B}_{s}}\beta^{t}\bar{c}(s,t,g(s,t))+\beta^{\theta^{B}_{s}-k+1}\sum_{r\in{\cal X}}Q_{r}D^{({\boldsymbol{\theta}}^{B})}(r,0)
=L(𝜽B)(s,k)+βθsBk+1r𝒳QrD(𝜽B)(r,0),\displaystyle=L^{({\boldsymbol{\theta}}^{B})}(s,k)+\beta^{\theta^{B}_{s}-k+1}\sum_{r\in{\cal X}}Q_{r}D^{({\boldsymbol{\theta}}^{B})}(r,0),
N(𝜽B)(s,0)\displaystyle N^{({\boldsymbol{\theta}}^{B})}(s,0) =(1β)βθsBk+βθsBk+1r𝒳QrN(𝜽B)(r,0)\displaystyle=(1-\beta)\beta^{\theta^{B}_{s}-k}+\beta^{\theta^{B}_{s}-k+1}\sum_{r\in{\cal X}}Q_{r}N^{({\boldsymbol{\theta}}^{B})}(r,0)
=M(𝜽B)(s,k)+βθsBk+1r𝒳QrN(𝜽B)(r,0).\displaystyle=M^{({\boldsymbol{\theta}}^{B})}(s,k)+\beta^{\theta^{B}_{s}-k+1}\sum_{r\in{\cal X}}Q_{r}N^{({\boldsymbol{\theta}}^{B})}(r,0).

If we set k=0k=0 in the above,

D(𝜽B)(s,0)\displaystyle D^{({\boldsymbol{\theta}}^{B})}(s,0) =L(𝜽B)(s,0)+βθsB+1r𝒳QrD(𝜽B)(r,0),\displaystyle=L^{({\boldsymbol{\theta}}^{B})}(s,0)+\beta^{\theta^{B}_{s}+1}\sum_{r\in{\cal X}}Q_{r}D^{({\boldsymbol{\theta}}^{B})}(r,0),
N(𝜽B)(s,0)\displaystyle N^{({\boldsymbol{\theta}}^{B})}(s,0) =M(𝜽B)(s,0)+βθsB+1r𝒳QrN(𝜽B)(r,0).\displaystyle=M^{({\boldsymbol{\theta}}^{B})}(s,0)+\beta^{\theta^{B}_{s}+1}\sum_{r\in{\cal X}}Q_{r}N^{({\boldsymbol{\theta}}^{B})}(r,0).

which results in

𝑫(𝜽B)(0)\displaystyle\boldsymbol{D}^{({\boldsymbol{\theta}}^{B})}(0) =𝑳(𝜽B)(0)+Z(𝜽B)𝑫(𝜽B)(0),\displaystyle=\boldsymbol{L}^{({\boldsymbol{\theta}}^{B})}(0)+Z^{({\boldsymbol{\theta}}^{B})}\boldsymbol{D}^{({\boldsymbol{\theta}}^{B})}(0),
𝑵(𝜽B)(0)\displaystyle\boldsymbol{N}^{({\boldsymbol{\theta}}^{B})}(0) =𝑴(𝜽B)(0)+Z(𝜽B)𝑵(𝜽B)(0)\displaystyle=\boldsymbol{M}^{({\boldsymbol{\theta}}^{B})}(0)+Z^{({\boldsymbol{\theta}}^{B})}\boldsymbol{N}^{({\boldsymbol{\theta}}^{B})}(0)

and hence, the statement is obtained by reformation of the terms inside the equations.

6.4 Proof of Theorem 5

(i): Starting from information state kk\in\mathds{N}_{\ell}, the cost incurred by g^,λ()\hat{g}_{\ell,\lambda}(\cdot) is the same as gλA()g^{A}_{\lambda}(\cdot) for information states {k,,}\{k,\ldots,\ell\}. The per-step cost incurred by g^,λ()\hat{g}_{\ell,\lambda}(\cdot) differs from gλA()g^{A}_{\lambda}(\cdot) for information states {+1,}\{\ell+1,\ldots\} by at most span(cλ)\operatorname{span}(c_{\lambda}).
(ii): The sequence of finite-state models described above is an augmentation type approximation sequence. As a result, a limit point of g^λ\hat{g}^{*}_{\lambda} exists and the final result holds by [47, Proposition B.5, Theorem 4.6.3].

6.5 Proof of Theorem 6

(i): Starting from information state (s,k)(s,k), given any s𝒳s\in{\cal X} and kk\in\mathds{N}_{\ell}, the cost incurred by g^,λ(,)\hat{g}_{\ell,\lambda}(\cdot,\cdot) is the same as gλB(,)g^{B}_{\lambda}(\cdot,\cdot) for information states {(s,l)}l=k\{(s,l)\}_{l=k}^{\ell}. The per-step cost incurred by g^,λ(,)\hat{g}_{\ell,\lambda}(\cdot,\cdot) differs from gλB(,)g^{B}_{\lambda}(\cdot,\cdot) for later realized information states by at most Δcλ\Delta c_{\lambda}. Thus, the bound holds.
(ii): The sequence of finite-state models described above is an augmentation type approximation sequence. As a result, a limit point of g^λ\hat{g}^{*}_{\lambda} exists and the final result holds [47, Proposition B.5, Theorem 4.6.3].

7 Numerical Analysis

In this section, we consider Example 2 and compare the performance of the following policies:

  • opt:

    the optimal policy obtained using dynamic programming. As discussed earlier, the dynamic programming computation to obtain the optimal policy suffers from the curse of dimensionality. Therefore, the optimal policy can be computed only for small-scale models.

  • myp:

    myopic policy, which is a heuristic which sequentially selects mm machines as follows. Suppose ζ<m\zeta<m machines have been selected. Then select machine ζ+1\zeta+1 to be the machine which provides the smallest increase in the total per-step cost. The detailed description for model B is shown in Alg. 2.

  • wip:

    whittle index heuristic, as described in this paper.

input : RB (𝒳,{0,1},P,Q,c,ρ)(\mathcal{X},\{0,1\},P,Q,c,\rho), discount factor β\beta, mm.
Initialize t=0t=0.
while t0t\geq 0 do
       Set ζ=0\zeta=0.
       while ζm\zeta\leq m do
             Compute iζargmini𝒵j𝒵{i}c¯j(Stj,Ktj,0)+c¯i(Sti,Kti,1)i^{*}_{\zeta}\in\arg\min_{i\in\mathcal{Z}}\sum_{j\in\mathcal{Z}\setminus\{i\}}\bar{c}^{j}(S^{j}_{t},K^{j}_{t},0)+\bar{c}^{i}(S^{i}_{t},K^{i}_{t},1).
             Let ={iζ}\mathcal{M}=\mathcal{M}\cup\{i^{*}_{\zeta}\}, 𝒵=𝒵{iζ}\mathcal{Z}=\mathcal{Z}\setminus\{i^{*}_{\zeta}\}.
             Set ζ=ζ+1\zeta=\zeta+1.
            
      Service the machines with indices collected in \mathcal{M}.
       Update KtiK^{i}_{t} according to (14) and StiS^{i}_{t} according to (15) for all i𝒩i\in{\cal N}.
       Set t=t+1t=t+1.
      
Algorithm 2 Myopic Heuristic (Model B)

7.1 Experiments and Results

We conduct numerical experiments for both models A and B, and vary the number nn of machines, the number mm of service-persons and the parameters associated with each machine. There are three parameters associated with each machine: the deterioration probability matrix PiP^{i}, the reset pmf QiQ^{i} and the per-step cost ci(x,a)c^{i}(x,a). We assume the matrix PiP^{i} is chosen from a family of four types of structured transition matrices 𝒫γ(p){\cal P}_{\gamma}(p), γ{1,2,3,4}\gamma\in\{1,2,3,4\} where pp is a parameter of the model. The details of all these models are presented in A. We assume each element of QiQ^{i} is sampled from Exp(11), i.e., exponential distribution with the rate parameter of 11, and then normalized such that the sum of all elements becomes 11. Finally, we assume that the per-step cost is given by ci(x,0)=(x1)2c^{i}(x,0)=(x-1)^{2} and ci(x,1)=0.5|𝒳i|2c^{i}(x,1)=0.5|\mathcal{X}^{i}|^{2}.

In all experiments, the discount factor is β=0.99\beta=0.99. The performance of every policy is evaluated using Monte-Carlo simulation of length 10001000 averaged over 50005000 sample paths.

In Experiment 1, we consider a small scale problem where we can compute opt and we compare the performance of wip with it. However, in Experiment 2, we consider a large scale problem where we compare the performance of wip with myp as computing the optimal policy is highly time-consuming.

The code for both experiments is available at [48].

Experiment 1) Comparison of Whittle index with the optimal policy.

In this experiment, we compare the performance of wip with opt. We assume |𝒳|=4{|\cal X|}=4, (+1)=6(\ell+1)=6 and n=3n=3, m=1m=1 for both models A and B. In order to model heterogeneous machines, we consider the following. Let (p1,,pn)(p_{1},\ldots,p_{n}) denote nn equispaced points in the interval [0.05,0.95][0.05,0.95]. Then we choose 𝒫γ(pi){\cal P}_{\gamma}(p_{i}) as the transition matrix of machine ii. We denote the accumulated discounted cost of wip and opt by J(wip)J(\text{{wip}}) and J(opt)J(\text{{opt}}), respectively. In order to have a better perspective of the performances, we compute the relative performance of wip with respect to opt by computing

αopt=100×J(opt)J(wip).\alpha_{\textsc{opt}}=100\times\dfrac{J(\textsc{opt})}{J(\text{{wip}})}. (22)

The closer α\alpha is to 100100, the closer wip is to opt. The results of αopt\alpha_{\textsc{opt}} for all different combinations of parameter were 100100 which means the Whittle policy is as good as the optimal policy.

Experiment 2) Comparison of Whittle index with the myopic policy for structured models.

In this experiment, we increase the state space size to |𝒳|=20|{\cal X}|=20 and we set (+1)=40(\ell+1)=40, we select nn from the set {20,40,60}\{20,40,60\} and mm from the set {1,5}\{1,5\}. We denote the accumulated discounted cost of myp by J(myp)J(\text{{myp}}). In order to have a better perspective of the performances, we compute the relative improvement of wip with respect to myp by computing

εmyp=100×J(myp)J(wip)J(myp).\varepsilon_{\textsc{myp}}=100\times\dfrac{J(\text{{myp}})-J(\text{{wip}})}{J(\text{{myp}})}. (23)

Note that εmyp>0\varepsilon_{\textsc{myp}}>0 means that wip performs better than myp. We generate structured transition matrices, similar to Experiment 1, and apply the same procedure to build heterogeneous machines. The results of εmyp\varepsilon_{\textsc{myp}} for different choice of the parameters for models A and B are shown in Tables 1 and 2, respectively.

Table 1: εmyp\varepsilon_{\textsc{myp}} for different choice of parameters of Model A in Experiment 22.
(a) Model A, m=1m=1
εmyp\varepsilon_{\textsc{myp}} γ\gamma
1 2 3 4
nn 2020 1.99 2.54 2.24 7.44
4040 3.41 6.90 4.71 8.14
6060 2.97 6.19 2.80 6.70
(b) Model A, m=5m=5
εmyp\varepsilon_{\textsc{myp}} γ\gamma
1 2 3 4
nn 2020 0.21 0.26 0.19 0.97
4040 0.68 1.73 1.28 4.54
6060 1.36 2.35 2.32 6.41
Table 2: εmyp\varepsilon_{\textsc{myp}} for different choice of parameters of Model B in Experiment 22.
(a) Model B, m=1m=1
εmyp\varepsilon_{\textsc{myp}} γ\gamma
1 2 3 4
nn 2020 7.67 11.17 12.12 9.39
4040 14.96 13.85 14.55 9.17
6060 15.02 12.12 13.39 6.63
(b) Model B, m=5m=5
εmyp\varepsilon_{\textsc{myp}} γ\gamma
1 2 3 4
nn 2020 0.63 1.62 1.01 2.92
4040 2.92 3.14 3.21 6.57
6060 4.86 7.22 6.99 9.96

7.2 Discussion

In Experiment 1 where wip is compared with opt, we observe αopt\alpha_{\textsc{opt}} is very close to 100100 for almost all experiments, implying that wip performs as well as opt for these experiments. αopt\alpha_{\textsc{opt}} in model B is less than model A as model B is more complex than model A for a given set of parameters and hence, the difference between the performance of the two polices is more than model A.

In Experiment 2 where wip is compared with myp, we observe εmyp\varepsilon_{\textsc{myp}} ranges from 0.2%0.2\% to 15%15\%. In a similar interpretation as Experiment 1, as model B is more complex than model A, εmyp\varepsilon_{\textsc{myp}} for model B is higher than the ones model A given the same set of parameters.

Furthermore, we observe that as nn increases, εmyp\varepsilon_{\textsc{myp}} also increases overall. Also, as mm increases, εmyp\varepsilon_{\textsc{myp}} decreases in general. This suggests that as mm increases, there is an overlap between the set of machines chosen according to wip and myp, and hence, the performance of wip and myp become close to each other.

8 Conclusion

We investigated partially observable restless bandits. Unlike most of the existing literature which restricts attention to models with binary state space, we consider general state space models. We presented two observation models, which we call model A and model B, and showed that the partially observable restless bandits are indexable for both models.

To compute the Whittle index, we work with a countable space representation rather than the belief state representation. We established certain qualitative properties of the auxiliary problem to compute the Whittle index. In particular, for both models we showed that the optimal policies of the auxiliary problem satisfy threshold properties. For model A, we used the threshold property to obtain a closed form expression to compute the Whittle index. For model B, we used the threshold policy to present a refinement of the adaptive greedy algorithm of [31] to compute the Whittle index.

Finally, we presented a detailed numerical study of a machine maintenance model. We observed that for small-scale models, the Whittle index policy is close-to-optimal and for large-scale models, the Whittle index policy outperforms the myopic policy baseline.

Appendix A Structured Markov chains

Consider a Markov chain with |𝒳||\cal X| states. Then a family of structured stochastic monotone matrices which dominates the identity matrix is illustrated below.

  1. 1.

    Matrix 𝒫1(p){\cal P}_{1}(p): Let q1=1pq_{1}=1-p and q2=0q_{2}=0. Then,

    𝒫1(p)=[pq1q2000000pq1q2000000pq1q2000000pq1q200000000pq1q20000000pq1+q200000001].{\cal P}_{1}(p)=\begin{bmatrix}p&q_{1}&q_{2}&0&0&0&0&\dots&0\\ 0&p&q_{1}&q_{2}&0&0&0&\dots&0\\ 0&0&p&q_{1}&q_{2}&0&0&\dots&0\\ 0&0&0&p&q_{1}&q_{2}&0&\dots&0\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\ 0&0&0&0&0&0&p&q_{1}&q_{2}\\ 0&0&0&0&0&0&0&p&q_{1}+q_{2}\\ 0&0&0&0&0&0&0&\dots&1\end{bmatrix}.
  2. 2.

    Matrix 𝒫2(p){\cal P}_{2}(p): Similar to 𝒫1(p){\cal P}_{1}(p) with q1=(1p)/2q_{1}=(1-p)/2 and q2=(1p)/2q_{2}=(1-p)/2.

  3. 3.

    Matrix 𝒫3(p){\cal P}_{3}(p): Similar to 𝒫1(p){\cal P}_{1}(p) with q1=2(1p)/3q_{1}=2(1-p)/3 and q2=(1p)/3q_{2}=(1-p)/3.

  4. 4.

    Matrix 𝒫4(p){\cal P}_{4}(p): Let qi=(1p)/(𝒳i)q_{i}=(1-p)/({\cal X}-i). Then,

    𝒫4(p)=[pq1q1q1q10pq2q2q2000pqn100001].\displaystyle{\cal P}_{4}(p)=\begin{bmatrix}p&q_{1}&q_{1}&\dots&q_{1}&q_{1}\\ 0&p&q_{2}&\dots&q_{2}&q_{2}\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\ 0&0&0&\dots&p&q_{n-1}\\ 0&0&0&\dots&0&1\end{bmatrix}.

References

  • [1] R. Meshram, D. Manjunath, A. Gopalan, On the Whittle index for restless multiarmed hidden Markov bandits 63 (9) (2018) 3046–3053.
  • [2] S. Guha, K. Munagala, P. Shi, Approximation algorithms for restless bandit problems, Journal of the ACM (JACM) 58 (1) (2010) 3.
  • [3] K. Kaza, R. Meshram, V. Mehta, S. N. Merchant, Sequential decision making with limited observation capability: Application to wireless networks, IEEE Transactions on Cognitive Communications and Networking 5 (2) (2019) 237–251.
  • [4] K. Kaza, V. Mehta, R. Meshram, S. Merchant, Restless bandits with cumulative feedback: Applications in wireless networks, in: Wireless Communications and Networking Conference, IEEE, 2018, pp. 1–6.
  • [5] S. Aalto, P. Lassila, P. Osti, Whittle index approach to size-aware scheduling for time-varying channels with multiple states, Queueing Systems 83 (3-4) (2016) 195–225.
  • [6] M. Larrañaga, M. Assaad, A. Destounis, G. S. Paschos, Dynamic pilot allocation over Markovian fading channels: A restless bandit approach, in: Information Theory Workshop, IEEE, 2016, pp. 290–294.
  • [7] N. Akbarzadeh, A. Mahajan, Dynamic spectrum access under partial observations: A restless bandit approach, in: Canadian Workshop on Information Theory, IEEE, 2019, pp. 1–6.
  • [8] S. S. Villar, J. Bowden, J. Wason, Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges, Statistical science: a review journal of the Institute of Mathematical Statistics 30 (2) (2015) 199.
  • [9] C. Abad, G. Iyengar, A near-optimal maintenance policy for automated DR devices, IEEE Transactions on Smart Grid 7 (3) (2016) 1411–1419.
  • [10] K. D. Glazebrook, H. M. Mitchell, P. S. Ansell, Index policies for the maintenance of a collection of machines by a set of repairmen, European Journal of Operational Research 165 (1) (2005) 267–284.
  • [11] S. S. Villar, Indexability and optimal index policies for a class of reinitialising restless bandits, Probability in the engineering and informational sciences 30 (1) (2016) 1–23.
  • [12] Y. Qian, C. Zhang, B. Krishnamachari, M. Tambe, Restless poachers: Handling exploration-exploitation tradeoffs in security domains, in: Int. Conf. on Autonomous Agents & Multiagent Systems, 2016, pp. 123–131.
  • [13] V. S. Borkar, Whittle index for partially observed binary markov decision processes, IEEE Transactions on Automatic Control 62 (12) (2017) 6614–6618.
  • [14] V. S. Borkar, G. S. Kasbekar, S. Pattathil, P. Shetty, Opportunistic scheduling as restless bandits, IEEE Transactions on Control of Network Systems (2017).
  • [15] K. E. Avrachenkov, V. S. Borkar, Whittle index policy for crawling ephemeral content, IEEE Transactions on Control of Network Systems 5 (1) (2018) 446–455. doi:10.1109/TCNS.2016.2619066.
  • [16] C. H. Papadimitriou, J. N. Tsitsiklis, The complexity of optimal queuing network control, Mathematics of Operations Research 24 (2) (1999) 293–305.
  • [17] P. Whittle, Restless bandits: Activity allocation in a changing world, Journal of Applied Probability 25 (A) (1988) 287–298.
  • [18] J. C. Gittins, Bandit processes and dynamic allocation indices, Journal of the Royal Statistical Society. Series B (Methodological) (1979) 148–177.
  • [19] R. R. Weber, G. Weiss, On an index policy for restless bandits, Journal of Applied Probability 27 (3) (1990) 637–648.
  • [20] C. Lott, D. Teneketzis, On the optimality of an index rule in multichannel allocation for single-hop mobile networks with multiple service classes, Probability in the Engineering and Informational Sciences 14 (3) (2000) 259–297.
  • [21] K. Liu, Q. Zhao, Indexability of restless bandit problems and optimality of whittle index for dynamic multichannel access 56 (11) (2010) 5547–5567.
  • [22] J. Niño-Mora, Dynamic priority allocation via restless bandit marginal productivity indices, TOP 15 (2) (2007) 161–198.
  • [23] P. S. Ansell, K. D. Glazebrook, J. Niño-Mora, M. O’Keeffe, Whittle’s index policy for a multi-class queueing system with convex holding costs, Mathematical Methods of Operations Research 57 (1) (2003) 21–39.
  • [24] K. D. Glazebrook, D. Ruiz-Hernandez, C. Kirkbride, Some indexable families of restless bandit problems, Advances in Applied Probability 38 (3) (2006) 643–672.
  • [25] U. Ayesta, M. Erausquin, P. Jacko, A modeling framework for optimizing the flow-level scheduling with time-varying channels, Performance Evaluation 67 (11) (2010) 1014–1029.
  • [26] N. Akbarzadeh, A. Mahajan, Restless bandits with controlled restarts: Indexability and computation of Whittle index, in: Conference on Decision and Control, 2019, pp. 7294–7300.
  • [27] T. W. Archibald, D. P. Black, K. D. Glazebrook, Indexability and index heuristics for a simple class of inventory routing problems, Operations research 57 (2) (2009) 314–326.
  • [28] K. Avrachenkov, U. Ayesta, J. Doncel, P. Jacko, Congestion control of TCP flows in internet routers by means of index policy, Computer Networks 57 (17) (2013) 3463–3478.
  • [29] K. Glazebrook, D. Hodge, C. Kirkbride, Monotone policies and indexability for bidirectional restless bandits, Advances in Applied Probability 45 (1) (2013) 51–85.
  • [30] Z. Yu, Y. Xu, L. Tong, Deadline scheduling as restless bandits 63 (8) (2018) 2343–2358.
  • [31] N. Akbarzadeh, A. Mahajan, Conditions for indexability of restless bandits and an 𝒪(k3)\mathcal{O}\!\left(k^{3}\right) algorithm to compute Whittle index, Adv. in Appl. Probab. 54 (4) (2022) 1164–1192.
  • [32] J. Niño-Mora, Restless bandits, partial conservation laws and indexability, Advances in Applied Probability 33 (1) (2001) 76–98.
  • [33] N. Gast, B. Gaujal, K. Khun, Computing Whittle (and Gittins) index in subcubic time, arXiv preprint arXiv:2203.05207 (2022).
  • [34] D. Ruiz-Hernández, J. M. Pinar-Pérez, D. Delgado-Gómez, Multi-machine preventive maintenance scheduling with imperfect interventions: A restless bandit approach, Computers & Operations Research 119 (2020) 104927.
  • [35] A. Mate, J. Killian, H. Xu, A. Perrault, M. Tambe, Collapsing bandits and their application to public health intervention, Advances in Neural Information Processing Systems 33 (2020) 15639–15650.
  • [36] C. R. Dance, T. Silander, Optimal policies for observing time series and related restless bandit problems., J. Mach. Learn. Res. 20 (2019) 35–1.
  • [37] K. Liu, Index policy for a class of partially observable Markov decision processes, arXiv preprint arXiv:2107.11939 (2021).
  • [38] D. B. Brown, J. E. Smith, Index policies and performance bounds for dynamic selection problems, Management Science 66 (7) (2020) 3029–3050.
  • [39] K. J. Astrom, Optimal control of Markov processes with incomplete state information, Journal of mathematical analysis and applications 10 (1) (1965) 174–205.
  • [40] M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming, John Wiley & Sons, 2014.
  • [41] E. Altman, Constrained Markov decision processes, Vol. 7, CRC press, 1999.
  • [42] G. M. Lipsa, N. C. Martins, Remote state estimation with communication costs for first-order LTI systems, IEEE Transactions on Automatic Control 56 (9) (2011) 2013–2025.
  • [43] J. Chakravorty, A. Mahajan, Fundamental limits of remote estimation of autoregressive Markov processes under communication constraints., IEEE Trans. Automat. Contr. 62 (3) (2017) 1109–1124.
  • [44] A. Mahajan, Remote estimation over control area networks, in: 2017 IEEE 86th Vehicular Technology Conference (VTC-Fall), 2017, pp. 1–5.
  • [45] D. I. Shuman, A. Nayyar, A. Mahajan, Y. Goykhman, K. Li, M. Liu, D. Teneketzis, M. Moghaddam, D. Entekhabi, Measurement scheduling for soil moisture sensing: From physical models to optimal control, Proceedings of the IEEE 98 (11) (2010) 1918–1933.
  • [46] J. Keilson, A. Kester, Monotone matrices and monotone Markov processes, Stochastic Processes and their Applications 5 (3) (1977) 231–241.
  • [47] L. I. Sennott, Stochastic dynamic programming and the control of queueing systems, Vol. 504, John Wiley & Sons, 2009.
  • [48] N. Akbarzadeh, A. Mahajan, Partially observable restless bandits with restarts, https://codeocean.com/capsule/6654724/tree/v1.