Two families of indexable partially observable restless bandits and Whittle index computation¹¹1This research was funded in part by Fonds de Recherche du Quebec-Nature et technologies (FRQNT).

Nima Akbarzadeh [email protected] Aditya Mahajan [email protected] Department of Electrical and Computer Engineering, McGill University, 3480 Rue University, Montréal, QC H3A 0E9. Department of Electrical and Computer Engineering, McGill University, 3480 Rue University, Montréal, QC H3A 0E9.

Abstract

We consider the restless bandits with general finite state space under partial observability with two observational models: first, the state of each bandit is not observable at all, and second, the state of each bandit is observable when it is selected. Under the assumption that the models satisfy a restart property, we prove that both models are indexable. For the first model, we derive a closed-form expression for the Whittle index. For the second model, we propose an efficient algorithm to compute the Whittle index by exploiting the qualitative properties of the optimal policy. We present detailed numerical experiments for multiple instances of machine maintenance problem. The result indicates that the Whittle index policy outperforms myopic policy and can be close to optimal in different setups.

keywords:

Multi-armed bandits; Restless bandits; Whittle index; indexability; partially observable; scheduling; resource allocation.

1 Introduction

Resource allocation and scheduling problems arise in various applications including telecommunication networks, sensor management, patient prioritization, and machine maintenance. Restless bandits is a widely-used solution framework for such models [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15].

Identifying the optimal policy in restless bandits suffers from the curse of dimensionality because the state space is exponential in the number of alternatives [16]. To circumvent the curse of dimensionality, Whittle proposed an index heuristic which has a linear complexity in the number of alternatives [17]. The resulting policy, called the Whittle index policy, operates as follows: assign an index (called the Whittle index) to each state of each arm (or alternative) and then, at each time, play the arms in states with the highest indices.

The Whittle index policy is attractive for two reasons. First, it is a scalable heuristic because its complexity is linear in the number of arms. Second, although it is a heuristic, there are certain settings where it is optimal [18, 19, 20, 21] and, in general, it performs close to optimal in many instances [22, 23, 10, 24, 25, 26].

Nonetheless, there are two challenges in using the Whittle index policy. First, the Whittle index heuristic is applicable only when a technical condition known as indexability is satisfied. There is no general test for indexability, and the existing sufficient conditions are only applicable for specific models [23, 10, 24, 27, 28, 25, 29, 30]. Second, while there are closed-form expressions to compute the Whittle index in some instances [10, 24, 29, 5, 6, 4, 3, 26, 31], in general, the Whittle index policy has to be computed numerically. For a subclass of restless bandits which satisfy an additional technical condition known as PCL (partial conservation law), the Whittle index can be computed using an algorithm called the adaptive greedy algorithm [32, 22]. Recently, [31, 33] presented generalizations of adaptive greedy algorithm which are applicable to all indexable restless bandits.

We are interested in resource allocation and scheduling problems where the state of each arm is not fully-observed. Such partially observable restless bandit models are conceptually and computationally more challenging. The sufficient conditions for indexability that are derived for fully-observed bandits [17, 19, 10, 24, 29, 31, 34] are not directly applicable to the partially observable setting. The existing literature on partially observable restless bandits often restricts attention to models where each arm has two states [2, 3, 4, 11, 9, 13, 1, 5, 35]. In some cases, it is also assumed that the two states are positively correlated [4, 3, 5]; in others, it is assumed that the state dynamics are independent of the chosen action [6, 36, 37]. There are very few results for general finite state space models under partial observability [12, 7, 6, 36, 37], and, for such models, indexability is often verified numerically. In addition, there are very few algorithms to compute the Whittle index for such models.

Recently, alternative index-based heuristics for partially observable restless bandits [38] has been proposed, but we restrict to Whittle index policy in this paper.

The main contributions of our paper are as follows:

1.

We investigate partially observable restless bandits with general finite state spaces and consider two observation models, which we call model A and model B. We show that both models are indexable.
2.

For model A, we provide a closed-form expression to compute the Whittle index. For model B, we provide a refinement of the adaptive greedy algorithm of [31] to efficiently compute the Whittle index.
3.

We present a detailed numerical study which illustrates that the Whittle index policy performs close to optimal for small scale systems and outperforms a commonly used heuristic (the myopic policy) for large-scale systems.

The organization of the paper is as follows. In Section 2, we formulate the restless bandit problem under partial observations for two different models. Then, we define a belief state by which the partially-observable problem can be converted into a fully-observable one. In Section 3, we present a short overview of restless bandits. In Section 4, we show the restless bandit problem is indexable for both models and present a general formula to compute the index. In Section 5, we present a countable state representation of the belief state and use it to develop methods to compute the Whittle index. In Section 6, we present the proofs of the results. In Section 7, we present a detailed numerical study which compares the performance of Whittle index policy with two baseline policies. Finally, we conclude in Section 8.

1.1 Notations and Definitions

We use uppercase letters to denote random variables; the corresponding lowercase letter to denote their realization and corresponding calligraphic letters to denote the set of realizations. Superscripts index arms and subscripts index time, e.g., $X^{i}_{t}$ denotes the state of arm $i$ at time $t$ . The subscript $0{:}t$ denote the history of the variable from time $0$ to $t$ , e.g., $X^{i}_{0:t}$ denotes $(X_{0},\ldots,X_{t})$ . Bold letters denote the vector of variable for all arms, e.g., ${\boldsymbol{X}}_{t}$ denotes $(X^{1}_{t},\ldots,X^{n}_{t})$ . Given a collection of real numbers $p^{1},\ldots,p^{n}$ , we use $\prod_{i=1}^{n}p^{i}$ to denote their product $p^{1}\cdot p^{2}\cdot\dots\cdot p^{n}$ . Given a collection of sets ${\cal X}^{1},\dots,{\cal X}^{n}$ , we use $\prod_{i=1}^{n}{\cal X}^{i}$ to denote their Cartesian product ${\cal X}^{1}\times{\cal X}^{2}\times\dots\times{\cal X}^{n}$ . For a finite set ${\cal X}$ , ${\cal P}({\cal X})$ denote the set of probability mass functions (PMFs) on ${\cal X}$ .

We use $\mathds{I}$ as the indicator function, $\mathds{E}$ as the expectation operator, $\mathds{P}$ as the probability function, $\mathds{R}$ as the set of real numbers, $\mathds{Z}$ as the set of integers and $\mathds{Z}_{\geq m}$ as the set of integers that are not lower than $m$ .

Given ordered sets $\mathcal{X}$ and $\mathcal{Y}$ , a function $f:\mathcal{X}\times\mathcal{Y}\to\mathds{R}$ is called submodular if for any $x_{1},x_{2}\in\mathcal{X}$ and $y_{1},y_{2}\in\mathcal{Y}$ such that $x_{2}\geq x_{1}$ and $y_{2}\geq y_{1}$ , we have $f(x_{1},y_{2})-f(x_{1},y_{1})\geq f(x_{2},y_{2})-f(x_{2},y_{1})$ . Furthermore, the transition probability matrix $P$ is stochastic monotone if for any $x,y\in\mathcal{X}$ such that $x<y$ , we have $\sum_{w\in\mathcal{X}_{\geq z}}P_{xw}\leq\sum_{w\in\mathcal{X}_{\geq z}}P_{yw}$ for any $z\in\mathcal{X}$ . For any function $f\colon\mathcal{Z}\to\mathbb{R}$ , $\operatorname{span}(f)$ denotes the span semi-norm of $f$ , i.e., $\operatorname{span}(f)=\sup_{z\in\mathcal{Z}}f(z)-\inf_{z\in\mathcal{Z}}f(z).$

2 Model and Problem Formulation

2.1 Restless Bandit Process with restart

A discrete-time restless bandit process (or arm) is a controlled Markov process $(\mathcal{X},\{0,1\},\allowbreak\{\bar{P}(a)\}_{a\in\{0,1\}},c,\pi_{0},\mathcal{Y})$ where $\mathcal{X}$ denotes the finite set of states; $\{0,1\}$ denotes the action space where the action $0$ is called the passive action and the action $1$ is the active action; $\bar{P}(a)$ , $a\in\{0,1\}$ , denotes the transition matrix when action $a$ is chosen; $c:\mathcal{X}\times\{0,1\}\to\mathbb{R}_{\geq 0}$ denotes the cost function; $\pi_{0}$ denotes the initial state distribution; $\mathcal{Y}$ denotes the finite set of observations.

Assumption 1 (Restart property)

All rows of the transition matrix $\bar{P}(1)$ are identical.

For models which satisfy Assumption $1$ , we denote $\bar{P}(0)$ by $P$ and denote each (identical) row of $\bar{P}(1)$ by $Q$ . The term restart property is used following the terminology of [26], where $Q$ was a PMF on the state space (i.e., on taking active action, the state resets according to PMF $Q$ ). Note [26] considered fully observed models, while we are considering partially observable setups.

An operator has to select $m<n$ arms at each time but does not observe the state of the arms. We consider two observation models.

1.

Model A: In model A, the operator does not observe anything. We denote this by $Y_{t}=\mathfrak{E}$ , where $\mathfrak{E}$ denotes a blank symbol.

Model B: In model B, the operator observes the state of the arm after it has been reset, i.e.,

Y_{t+1}=\begin{cases}\mathfrak{E}\leavevmode\nobreak\ &\text{ if }\leavevmode\nobreak\ A_{t}=0\\ X_{t+1}\leavevmode\nobreak\ &\text{ if }\leavevmode\nobreak\ A_{t}=1\end{cases},

(1)

For model A, $\mathcal{Y}=\{\mathfrak{E}\}$ and for model B, $\mathcal{Y}={\cal X}\cup\{\mathfrak{E}\}$ .

2.2 Partially-observable Restless Multi-armed Bandit Problem

A partially-observable restless multi-armed bandit (PO-RMAB) problem is a collection of $n$ independent restless bandits $(\mathcal{X}^{i},\{0,1\},\allowbreak\{P^{i}(a)\}_{a\in\{0,1\}},c^{i},\pi^{i}_{0},{\mathcal{Y}}^{i})$ , $i\in{\cal N}\coloneqq\{1,\ldots,n\}$ .

Let $\boldsymbol{\mathcal{X}}\coloneqq\prod_{i\in{\cal N}}{\mathcal{X}}^{i}$ , ${\mathbfcal A}(m)\coloneqq\biggl{\{}(a^{1},\ldots,a^{n})\in\{0,1\}^{n}:\sum_{i\in{\cal N}}a^{i}=m\biggr{\}}$ , and $\boldsymbol{\mathcal{Y}}\coloneqq\prod_{i\in{\cal N}}{\mathcal{Y}}^{i}$ denote the combined state, action, and observation spaces, respectively. Also, let $\boldsymbol{X}_{t}=(X^{1}_{t},\dots X^{n}_{t})\in\boldsymbol{\mathcal{X}}$ , $\boldsymbol{A}_{t}=(A^{1}_{t},\dots,A^{n}_{t})\in{\mathbfcal A}(m)$ , and $\boldsymbol{Y}_{t}=(Y^{1}_{t},\dots Y^{n}_{t})\in\boldsymbol{\mathcal{Y}}$ denote the combined states, actions taken, and observations made by the operator at time $t\geq 0$ . Due to the independent evolution of each arm, for each realization ${\boldsymbol{x}}_{0:t}$ of ${\boldsymbol{X}}_{0:t}$ and ${\boldsymbol{a}}_{0:t}$ of ${\boldsymbol{A}}_{0:t}$ , we have

	$\displaystyle\mathds{P}(\boldsymbol{X}_{t+1}=\boldsymbol{x}_{t+1}\|\boldsymbol{X}_{0:t}=\boldsymbol{x}_{0:t},\boldsymbol{A}_{0:t}=\boldsymbol{a}_{0:t})$	$\displaystyle=\prod_{i\in{\cal N}}\mathds{P}(X^{i}_{t+1}=x^{i}_{t+1}\|X^{i}_{t}=x^{i}_{t},A^{i}_{t}=a^{i}_{t})$
		$\displaystyle=\prod_{i\in{\cal N}}P^{i}_{x^{i}_{t},x^{i}_{t+1}}(a^{i}_{t}).$

Let ${\boldsymbol{\pi}}_{0}=\prod_{i\in{\cal N}}\pi^{i}_{0}$ denote the initial state distribution of all arms.

When the system is in state ${\boldsymbol{x}}_{t}$ and action ${\boldsymbol{a}}_{t}$ is taken, the system incurs a cost ${\boldsymbol{c}}({\boldsymbol{x}}_{t},{\boldsymbol{a}}_{t})\coloneqq\sum_{i\in{\cal N}}c^{i}(x^{i}_{t},a^{i}_{t}).$ The action at time $t$ is chosen according to

\displaystyle{\boldsymbol{A}}_{t}={\boldsymbol{g}}_{t}({\boldsymbol{Y}}_{0:t-1},{\boldsymbol{A}}_{0:t-1}),

(2)

where ${\boldsymbol{g}}_{t}$ is a history-dependent policy at time $t$ . Let ${\boldsymbol{g}}=(g_{1},g_{2},\ldots)$ denote the policy for infinite time horizon and let ${\mathbfcal G}$ denote the family of all such policies. Then, the performance of policy ${\boldsymbol{g}}$ is given by

J^{({\boldsymbol{g}})}\coloneqq(1-\beta)\mathds{E}\biggl{[}\sum_{t=0}^{\infty}\beta^{t}\sum_{i\in{\cal N}}c^{i}(X^{i}_{t},A^{i}_{t})\big{|}X^{i}_{0}\sim\pi^{i}_{0}\biggr{]},

(3)

where $\beta\in(0,1)$ denotes the discount factor.

Formally, the optimization problem of interest is as follows:

Problem 1

Given a discount factor $\beta\in(0,1)$ , the total number $n$ of arms, the number $m$ to be selected, the system model $\{(\mathcal{X}^{i},\{0,1\},P^{i}(a),c^{i},\mathcal{Y}^{i})\}_{i\in{\cal N}}$ of each arm, and the observation model at the operator, choose a history-dependent policy ${\boldsymbol{g}}\in{\mathbfcal G}$ that minimizes $J^{({\boldsymbol{g}})}$ given by (3).

Some remarks

1.

Problem 1 is a POMDP and the standard methodology to solve POMDPs is to convert them to a fully observable Markov decision process (MDP) by viewing the “belief state” as the information state of the system [39]. We present such a belief state representation in Sec. 2.4 and point out its limitations in the context of restless bandits.
2.

In Problem 1, the objective $J^{(g)}$ depends on the initial state distribution $(\pi^{i}_{0})_{i\in\mathcal{N}}$ . This can give the impression that the optimal policy may depend on the initial distribution. It is well known in the MDP literature that there exist policies that are optimal for all initial distributions [40]. However, our results rely on translating the belief state representation of the POMDP into a countable state MDP formulation and such a transformation is valid only when the initial state distribution is of a specific form. See Sec. 5 for details. Our results do not depend on the specific choice of the initial distribution, as long as it satisfies Assumption 2 specified in Sec. 5.
3.

In Problem 1, we consider the normalized expected discounted cost as an objective, where the discounted cost is multiplied by $(1-\beta)$ . In the MDP literature, one typically considers unnormalized objectives. However, normalized objective is typically used constrained MDPs [41] and has also been used in some of the previous literature on restless bandits [31]. Note that multiplying the objective by a constant does not change the optimal policy. The reason that we use a normalized expected discounted cost is that it simplifies the description of the adaptive greedy algorithms to compute the Whittle index presented in Sec. 5.

2.3 Some Examples

In this section, we present some examples corresponding to the model presented above.

Example 1

Consider a sensor network where ther are $n$ sensors, each observing an independent Markov processes. We assume that the state $\{S^{i}_{t}\}_{t\geq 1}$ of each Markov process is integer valued and evolves in an auto-regressive manner: $S^{i}_{t+1}=S^{i}_{t}+W^{i}_{t}$ , where $\{W^{i}_{t}\}_{t\geq 1}$ are i.i.d. processes which are also independent across the sensors. An estimator can observe only $m$ , where $m<n$ , sensors at each time. If a sensor is observed, the state of the Markov process at that sensor is revealed to the estimator. If a sensor is not observed, the estimator gets no new information about its state and has to estimate the state based on previous observations. The objective is to determine a sensor scheduling policy to decide which sensors to observe at each time.

In this case, it can be shown that when the noise processes $\{W^{i}_{t}\}_{t\geq 1}$ have symmetric and unimodal distributions, the optimal estimation strategy is a Kalman-filter like strategy, i.e., the optimal estimate $\hat{S}^{i}_{t}$ is $S^{i}_{t}$ when the Markov process $i$ is transmitted, and is equal to the previous estimate $\hat{S}^{i}_{t-1}$ when the Markov process $i$ is not transmitted [42]. Thus, the error process $E^{i}_{t}\coloneqq S^{i}_{t}-\hat{S}^{i}_{t}$ has a restart property [43]. An instance of such a sensor network problem was considered in [44].

Example 2

Consider a maintenance company monitoring $n$ machines which are deteriorating independently over time. Each machine has multiple deterioration states sorted from pristine to ruined levels. However, the state of the machine is not observed. There is a cost associated with running the machine and the cost is non-decreasing function of the state. If a machine is left un-monitored, then the state of the machine deteriorates and after a while, it is ruined. However, the state of the machine is not observed.

Furthermore, it is assumed the company cannot observe the state of the machines unless it sends a service-person to visit the machine. Replacing the machine is relatively inexpensive, and when service-persons visit a machine, they simply replace it with a new one. Due to manufacturing mistakes, all the machines may not be in pristine state when installed. If the service-person can observe the state of the machine when installing a new one, the observation model is same as model B. Otherwise, it is model A. There are $m$ , where $m<n$ , service-persons. The objective is to determine a scheduling policy to decide which machines should be serviced at each time. An instance of such machine maintenance problem in the context of maintaining demand response devices was considered in [9].

Example 3

Consider the problem of resource constrained health intervention delivery, where a community health center is monitoring $n$ patients to check if they are adhering to the prescribed medication [35]. Each patient has a binary state of “Adhering” or “Not Adhering”, which is hidden. There are $m$ , where $m<n$ , health workers, and if an health worker visits a patient, the state of the patient is observed. Moreover, it is assumed that after the visit by a health worker, the patient goes into the “Adhering” state. The objective is to determine a policy to schedule the health workers to maximize the number of patients in the “Adhering” state.

2.4 Belief State

Using standard results from Markov decision theory, the partially observable restless bandit problem can be converted to a fully observable restless bandit problem with belief (or posterior distribution) as states. We present the details in this section. Let define the operator’s belief $\Pi^{i}_{t}\in\mathcal{P}({\cal X}^{i})$ on the state of arm $i$ at time $t$ as follows: for any, $x^{i}_{t}\in{\cal X}^{i}$ , let $\Pi^{i}_{t}(x^{i}_{t})\coloneqq\mathds{P}(X^{i}_{t}=x^{i}_{t}\,|\,Y^{i}_{0:t-1},A^{i}_{0:t-1}).$ Note that $\Pi^{i}_{t}$ is a distribution-valued random variable. Also, define $\boldsymbol{\Pi}_{t}\coloneqq(\Pi^{1}_{t},\dots,\Pi^{n}_{t})$ .

Then, for arm $i$ , the evolution of the belief state is as follows: for model A, the belief update rule is

\Pi^{i}_{t+1}=\begin{cases}\Pi^{i}_{t}P,&\hbox{if }A^{i}_{t}=0,\\ Q,&\hbox{if }A^{i}_{t}=1,\end{cases}

(4)

and for model B, the belief update rule is

\Pi^{i}_{t+1}=\begin{cases}\Pi^{i}_{t}P,&\hbox{if }A^{i}_{t}=0,\\ \delta^{i}_{X^{i}_{t+1}}\text{ where }X^{i}_{t+1}\sim Q,&\hbox{if }A^{i}_{t}=1\end{cases}

(5)

where $\delta_{x}$ is the Dirac delta distribution over the discrete state space ${\cal X}$ with the value of one only for state $x$ . The per-step cost function of the belief state $\Pi^{i}_{t}$ when action $A^{i}_{t}$ is taken is

\bar{c}(\Pi^{i}_{t},A^{i}_{t})=\mathds{E}[c^{i}_{t}(X^{i}_{t},A^{i}_{t})|Y^{i}_{0:t-1},A^{i}_{0:t-1}]=\sum_{x\in{\cal X}^{i}}\Pi^{i}_{t}(x)c^{i}(x,A^{i}_{t}).

Define the combined belief state $\Theta_{t}\in\mathcal{P}(\boldsymbol{\cal X})$ of the system as follows: for any $\boldsymbol{x}\in\boldsymbol{\cal X}$ ,

\Theta_{t}(\boldsymbol{x})=\mathds{P}(\boldsymbol{X}_{t}=\boldsymbol{x}\,|\,\boldsymbol{Y}_{0:t-1},\boldsymbol{A}_{0:t-1}).

Note that $\Theta_{t}$ is a random variable that takes values in $\mathcal{P}(\boldsymbol{\cal X})$ . Using standard results in POMDPs [39], we have the following.

Proposition 1

In Problem 1, $\Theta_{t}$ is a sufficient statistic for $(\boldsymbol{Y}_{0:t-1},\boldsymbol{A}_{0:t-1})$ . Therefore, there is no loss of optimality in restricting attention to decision policies of the form $\boldsymbol{A}_{t}=g^{\text{belief}}_{t}(\Theta_{t})$ . Furthermore, an optimal policy with this structure can be identified by solving an appropriate dynamic program.

Next, we present our first simplification for the structure of optimal decision policy as follows.

Proposition 2

For any $\boldsymbol{x}\in\boldsymbol{\cal X}$ , we have

\Theta_{t}(\boldsymbol{x})=\prod_{i\in{\cal N}}\Pi^{i}_{t}(x^{i}),\quad\text{a.s.}.

(6)

Therefore, there is no loss of optimality in restricting attention to decision policies of the form $\boldsymbol{A}_{t}=g^{\text{simple}}_{t}(\boldsymbol{\Pi}_{t})$ . Furthermore, an optimal policy with this structure can be identified by solving an appropriate dynamic program.

Proof 1

Eq. (6) follows from the conditional independence of the arms, and the nature of the observation function. The structure of the optimal policies then follow immediately from Proposition 1.

In Propositions 1 and 2, we do not present the dynamic programs because they suffer from the curse of dimensionality. In particular, obtaining the optimal policy for PO-RMAB is PSPACE-hard [16]. So, we focus on the Whittle index heuristics to solve the problem.

3 Whittle index policy solution concept

For the ease of notation, we will drop the superscript $i$ from all relative variables for the rest of this and the next sections.

Consider an arm $(\mathcal{X},\{0,1\},\allowbreak\{\bar{P}(a)\}_{a\in\{0,1\}},c,\pi_{0},\mathcal{Y})$ with a modified per-step cost function

{\bar{c}}_{\lambda}(\pi,a)\coloneqq{\bar{c}}(\pi,a)+\lambda a,\quad\forall\pi\in\mathcal{P}({\mathcal{X}}),\forall a\in\{0,1\},\lambda\in\mathds{R}.

(7)

The modified cost function implies that there is a penalty of $\lambda$ for taking the active action. Given any time-homogeneous policy $g:\mathcal{P}({\mathcal{X}})\to\{0,1\}$ , the modified performance of the policy is

J^{(g)}_{\lambda}:=(1-\beta)\mathds{E}\biggl{[}\sum_{t=0}^{\infty}\beta^{t}\bar{c}_{\lambda}(\Pi_{t},g(\Pi_{t}))\big{|}\Pi_{0}\biggr{]}.

(8)

Subsequently, consider the following optimization problem.

Problem 2

Given an arm $({\cal X},{\cal Y},\{0,1\},\{\bar{P}(a)\}_{a\in\{0,1\}},c)$ , the discount factor $\beta\in(0,1)$ and the penalty $\lambda\in\mathds{R}$ , choose a Markov policy $g:\mathcal{P}({\mathcal{X}})\to\{0,1\}$ to minimize $J^{(g)}_{\lambda}$ given by (8).

Problem 2 is a Markov decision process where one may use dynamic programming to obtain the optimal solution as follows.

Proposition 3

Consider the fixed point equation

V_{\lambda}(\pi)=\min_{a\in\{0,1\}}H_{\lambda}(\pi,a)

(9)

where for Model A we have

\displaystyle H_{\lambda}(\pi,0)=(1-\beta)\bar{c}(\pi,0)+\beta V_{\lambda}(\pi P),H_{\lambda}(\pi,1)=(1-\beta)\bar{c}(\pi,1)+(1-\beta)\lambda+\beta V_{\lambda}(Q)

and for Model B, we have

\displaystyle H_{\lambda}(\pi,0)=(1-\beta)\bar{c}(\pi,0)+\beta V_{\lambda}(\pi P),H_{\lambda}(\pi,1)=(1-\beta)\bar{c}(\pi,1)+\beta\sum_{x\in{\cal X}}Q_{x}V_{\lambda}(\delta_{x}).

Then (9) has a unique fixed point $V^{*}_{\lambda}$ , and the policy

\displaystyle g_{\lambda}(\pi)=\begin{cases}0,&\text{ if }H_{\lambda}(\pi,0)<H_{\lambda}(\pi,1)\\ 1,&\text{ otherwise }\end{cases}

is optimal for Problem 2.

Proof 2

The result follows immediately from Markov decision theory [40].

Finally, we present the following definitions.

Definition 1 (Passive Set)

Given penalty $\lambda$ , define the passive set ${\cal W}_{\lambda}$ as the set of states where passive action is optimal for the modified arm, i.e.,

{\cal W}_{\lambda}:=\left\{\pi\in\Pi:g_{\lambda}(\pi)=0\right\}.

Definition 2 (Indexability)

An arm is indexable if ${\cal W}_{\lambda}$ is non-decreasing in $\lambda$ , i.e., for any $\lambda_{1},\lambda_{2}\in\mathds{R}$ ,

\lambda_{1}<\lambda_{2}\implies{\cal W}_{\lambda_{1}}\subseteq{\cal W}_{\lambda_{2}}.

A restless multi-armed bandit problem is indexable if all $n$ arms are indexable.

Definition 3 (Whittle index)

The Whittle index of the state $\pi$ of an arm is the smallest value of $\lambda$ for which state $\pi$ is part of the passive set ${\cal W}_{\lambda}$ , i.e.,

w(\pi)=\inf\left\{\lambda\in\mathds{R}:\pi\in{\cal W}_{\lambda}\right\}.

Equivalently, the Whittle index $w(\pi)$ is the smallest value of $\lambda$ for which the optimal policy is indifferent between the active action and passive action when the belief state of the arm is $\pi$ .

The Whittle index policy is as follows: At each time step, select $m$ arms which are in states with the highest indices. The Whittle index policy is easy to implement and efficient to compute but it may not be optimal. As mentioned earlier, Whittle index is optimal in certain cases [18, 19, 20, 21] and performs close to optimal for many other cases [22, 23, 10, 24, 25, 26].

4 Indexability and the corresponding Whittle index for models A and B

Given an arm, let $\Sigma$ denote the family of all stopping times $\tau\geq 1$ , with respect to the natural filtration associated with $\{\Pi_{t}\}_{t\geq 0}$ . For any stopping time $\tau\in\Sigma$ and an initial belief state $\pi\in\Pi$ , define

	$\displaystyle L(\pi,\tau)$	$\displaystyle\coloneqq\mathds{E}\bigg{[}\sum_{t=0}^{\tau-1}\beta^{t}{\bar{c}}(\Pi_{t},0)+\beta^{\tau}{\bar{c}}(\Pi_{\tau},1)\Bigm{\|}\Pi_{0}=\pi\bigg{]},$
	$\displaystyle B(\pi,\tau)$	$\displaystyle\coloneqq\mathds{E}[\beta^{\tau}\|\Pi_{0}=\pi].$

Theorem 1

The PO-RMAB for model A and B is indexable. In particular, each arm is indexable and the Whittle index is given by

w(\pi)=\inf\left\{\lambda\in\mathbb{R}:G(\pi)<\Omega_{\lambda}\right\},

where

	$\displaystyle G(\pi)$	$\displaystyle\coloneqq(1-\beta)\inf_{\tau\in\Sigma}\frac{L(\pi,\tau)-{\bar{c}}(\pi,1)}{1-B(\pi,\tau)},$		(10)
	$\displaystyle\Omega_{\lambda}$	$\displaystyle\coloneqq\lambda+\beta V^{\textsc{next}}_{1},$		(11)

and $V^{\textsc{next}}_{1}=V_{\lambda}(Q)$ for model A and $V^{\textsc{next}}_{1}=\sum_{x\in{\cal X}}Q_{x}V_{\lambda}(\delta_{x})$ for model B.

Proof 3

Recall that we assert that $V_{\lambda}(\pi)$ and $\Omega_{\lambda}$ are non-decreasing in $\lambda$ for any $\pi\in\Pi$ . Hence, for any policy $g:{\cal P}({\cal X})\to\{0,1\}$

V^{(g)}_{\lambda}(\pi)=(1-\beta)\mathbb{E}\bigg{[}\sum_{t=0}^{\infty}\beta^{t}\bar{c}_{\lambda}(\Pi_{t},A_{t})\bigg{|}\Pi_{t}=\pi\bigg{]}

where $\bar{c}_{\lambda}(\pi,a)=c_{\lambda}(\pi,a)+\lambda a$ is non-decreasing in $\lambda$ for any $\pi\in{\cal P}({\cal X})$ and $a\in\{0,1\}$ . From Markov decision theory we know that $V_{\lambda}(\pi)=\inf_{g:{\cal P}({\cal X})\to\{0,1\}}V^{(g)}_{\lambda}(\pi)$ . Since the infimum of non-decreasing functions is non-decreasing, $V_{\lambda}(\pi)$ is non-decreasing in $\lambda$ for any $\pi\in{\cal P}({\cal X})$ . Consequently, $V^{\textsc{next}}_{1}$ is non-decreasing which implies $\Omega_{\lambda}$ is non-decreasing in $\lambda$ .

Given any stopping time $\tau\in\Sigma$ , let $h_{\tau}$ denote a policy that takes the passive action up to and including time $\tau-1$ , takes the active action at time $\tau$ , and follows the optimal policy from time $\tau+1$ onwards. The performance $C_{\lambda}(\pi,\tau)$ of policy $h_{\tau}$ is given by

	$\displaystyle C_{\lambda}(\pi,\tau)$	$\displaystyle=(1-\beta)\mathds{E}^{h_{\tau}}\bigg{[}\sum_{t=0}^{\infty}\beta^{t}c_{\lambda}(\pi_{t},A_{t})\Bigm{\|}\Pi_{0}=\pi\bigg{]}=(1-\beta)L(\pi,\tau)+\mathds{E}[\beta^{\tau}\Omega_{\lambda}\|\Pi_{0}=\pi]$
		$\displaystyle=(1-\beta)L(\pi,\tau)+B(\pi,\tau)\Omega_{\lambda}.$		(12)

We use $h_{0}$ to denote a policy that takes active action at time $0$ and follows the optimal policy from time $1$ onwards. The performance $C_{\lambda}(\pi,0)$ of policy $h_{0}$ is given by

C_{\lambda}(\pi,0)=(1-\beta)c(\pi,1)+\Omega_{\lambda}.

(13)

The next result generalizes [26, Lemma 2].

Lemma 1

The following characterizations of the passive sets are equivalent to Def. 1.

1.

${\cal W}_{\lambda}=\left\{\pi\in\Pi:H_{\lambda}(\pi,0)<H_{\lambda}(\pi,1)\right\}$ .
2.

${\cal W}_{\lambda}=\left\{\pi\in\Pi:\exists\sigma\in\Sigma\text{ such that }C_{\lambda}(\pi,\sigma)<C_{\lambda}(\pi,0)\right\}$ .
3.

${\cal W}_{\lambda}=\left\{\pi\in\Pi:G(\pi)<\Omega_{\lambda}\right\}$ .

Proof 4

Characterization 1) follows from the dynamic program given in Proposition 3. Characterization 2) follows from the fact that $C_{\lambda}(\pi,0)=H_{\lambda}(\pi,1)$ and for $\pi\in{\cal W}_{\lambda}$ , $C_{\lambda}(\pi,\sigma)=H_{\lambda}(\pi,0)$ , where $\sigma$ is the hitting time of the set $\mathcal{P}({\mathcal{X}})\setminus{\cal W}_{\lambda}$ . Characterization 3) follows from characterization 2) and rearranging the terms using (12) and (13).

Note that $G(\pi)$ does not depend on $\lambda$ while we showed that $\Omega_{\lambda}$ is non-decreasing in $\lambda$ . Hence, ${\cal W}_{\lambda}=\left\{\pi\in\Pi:G(\pi)<\Omega_{\lambda}\right\}$ is non-decreasing in $\lambda$ by the lemma. Thus, arm $i$ is indexable. The expression for the Whittle index in the theorem follows from the definitions.

5 Whittle index computation

Computing the Whittle index using the belief state representation is intractable in general. Inspired by the approach taken in [45], we introduce a new information state which is equivalent to the belief state.

5.1 Countable Information state

For models A and B, define ${\cal R}_{A}=\bigl{\{}QP^{k}:k\in\mathds{Z}_{\geq 0}\bigr{\}},\leavevmode\nobreak\ {\cal R}_{B}=\bigl{\{}\delta_{s}P^{k}:s\in{\cal X},k\in\mathds{Z}_{\geq 0}\bigr{\}}.$

Assumption 2

For model A, $\pi_{0}\in{\cal R}_{A}$ and for model B, $\pi_{0}\in{\cal R}_{B}$ .

For model A, define a process $\{K_{t}\}_{t\geq 0}$ as follows. The initial state $k_{0}$ is such that $\pi_{0}=QP^{k_{0}}$ and for $t>0$ , $K_{t}$ is given by

K_{t}=\begin{cases}0,&\text{ if }A_{t-1}=1\\ K_{t-1}+1,&\text{ if }A_{t-1}=0.\end{cases}

(14)

Similarly, for model B, define a process $\{S_{t},K_{t}\}_{t\geq 0}$ as follows. The initial state $(s_{0},k_{0})$ is such that $\pi_{0}=\delta_{s_{0}}P^{k_{0}}$ and for $t>0$ , $K_{t}$ evolves according to (14) and $S_{t}$ evolves according to

S_{t}=\begin{cases}X_{t-1}\text{ where }X_{t-1}\sim Q,&\text{ if }A_{t-1}=1\\ S_{t-1},&\text{ if }A_{t-1}=0.\end{cases}

(15)

Note that once the first observation has been taken in both models, $K_{t}$ denotes the time elapsed since the last observation of arm $i$ and, in addition in model B, $S_{t}$ denotes the last observed states of arm $i$ . Let $\boldsymbol{S}_{t}\coloneqq(S^{1}_{t},\dots S^{n}_{t})$ and $\boldsymbol{K}_{t}\coloneqq(K^{1}_{t},\dots K^{n}_{t})$ . The relation between the belief state $\Pi_{t}$ and variables $S_{t}$ and $K_{t}$ is characterized in the following lemma.

Figure 1: Belief state dynamics for a 3-state arm

i

in the simplex

\mathcal{P}(\{1,2,3\})

. Dashed arrows show a sample realizations of the belief state evolution under

A_{t}=0

for three time steps and the solid arrow shows a sample realization of the belief state evolution under

A_{t}=1

Lemma 2

The following statements hold under Assumption 2:

1.

For model A, for any $t$ , $\Pi_{t}\in{\cal R}_{A}$ . In particular, $\Pi_{t}=QP^{K_{t}}$ .
2.

For model B, for any $t$ , $\Pi_{t}\in{\cal R}_{B}$ . In particular, $\Pi_{t}=\delta_{S_{t}}P^{K_{t}}$ .

Proof 5

The results immediately follow from (4)-(5) and (14)-(15).

For model A, the expected per-step cost at time $t$ may be written as

\displaystyle\bar{c}(K_{t},A_{t})\coloneqq\bar{c}(QP^{K_{t}},A_{t})=\sum_{x\in{\cal X}}[QP^{K_{t}}]_{x}c(x,A_{t}).

(16)

Similarly, for model B, the expected per-step cost at time $t$ may be written as

\displaystyle\bar{c}(S_{t},K_{t},A_{t})\coloneqq\bar{c}(\delta_{S_{t}}P^{K_{t}},A_{t})=\sum_{x\in{\cal X}}[\delta_{S_{t}},P^{K_{t}}]_{x}c(x,A_{t}).

(17)

Proposition 4

In Problem 1, there is no loss of optimality in restricting attention to decision policies of the form $\boldsymbol{A}_{t}=g^{\text{info}}_{t}(\boldsymbol{K}_{t})$ for model A and of the form $\boldsymbol{A}_{t}=g^{\text{info}}_{t}(\boldsymbol{S}_{t},\boldsymbol{K}_{t})$ for model B.

Proof 6

This result immediately follows from Lemma 2, (16) and (17).

5.2 Threshold policies

We assume that the model satisfies the following condition.

Assumption 3

Let $c(x,a)=(1-a)\phi(x)+a\rho(x)$ where $\phi:{\cal X}\to[0,\phi_{\max})$ and $\rho:{\cal X}\to[0,\rho_{\max})$ are non-decreasing functions in ${\cal X}$ and $c(x,a)$ is submodular in $(x,a)$ .

Under Assumption 3, we derive structural properties of the optimal policies for models A and B. Then, we show how $J^{(g)}_{\lambda}$ can be decomposed and computed.

In the following theorem, we show that the optimal policy for model A has a threshold structure and the optimal policy for model B has a threshold structure with respect to the second dimension of the information state.

Theorem 2

Under Assumptions 2 and 3, the following statements hold:

1.

In model A, for any $\lambda\in\mathds{R}$ , the optimal policy $g^{A}_{\lambda}(k)$ is a threshold policy, i.e., there exists a threshold $\theta^{A}_{\lambda}\in\mathds{Z}_{\geq-1}$ such that

$\displaystyle g^{A}_{\lambda}(k)=\begin{cases}0,\leavevmode\nobreak\ k<\theta^{A}_{\lambda}\\ 1,\leavevmode\nobreak\ \text{otherwise}.\end{cases}$

Moreover, the threshold $\theta^{A}_{\lambda}$ is non-decreasing in $\lambda$ .
2.

In model B, for any $\lambda\in\mathds{R}$ , the optimal policy $g^{B}_{\lambda}(s,k)$ is a threshold policy with respect to $k$ for every $s\in{\cal X}$ , i.e., there exists a threshold $\theta^{B}_{s,\lambda}\in\mathds{Z}_{\geq-1}$ for each $s\in{\cal X}$ such that

$\displaystyle g^{B}_{\lambda}(s,k)=\begin{cases}0,\leavevmode\nobreak\ k<\theta^{B}_{s,\lambda}\\ 1,\leavevmode\nobreak\ \text{otherwise}.\end{cases}$

Moreover, for every $s\in{\cal X}$ , the threshold $\theta^{B}_{s,\lambda}$ is non-decreasing in $\lambda$ .

See Section 6 for the proof.

We use ${\boldsymbol{\theta}}^{B}$ to denote the vector $(\theta^{B}_{s})_{s\in{\cal X}}$ .

5.3 Performance of threshold based policies

We simplify the notation and denote the policy corresponding to thresholds $\theta^{A}$ and ${\boldsymbol{\theta}}^{B}$ by simply $\theta^{A}$ and $\boldsymbol{\theta}^{B}$ instead of $g^{(\theta^{A})}$ and $g^{(\boldsymbol{\theta}^{B})}$ .

5.3.1 Model A

Let $J^{(\theta^{A})}_{\lambda}(k)$ be the total discounted cost incurred under policy $g^{(\theta^{A})}$ with penalty $\lambda$ when the initial state is $k$ , i.e.,

\displaystyle J^{(\theta^{A})}_{\lambda}(k)\coloneqq(1-\beta)\mathds{E}\biggl{[}\sum_{t=0}^{\infty}\beta^{t}\bar{c}_{\lambda}(K_{t},g^{(\theta^{A})}(K_{t}))\Bigm{|}K_{0}=k\biggr{]}\eqqcolon D^{(\theta^{A})}(k)+\lambda N^{(\theta^{A})}(k),

(18)

where

	$\displaystyle D^{(\theta^{A})}(k)$	$\displaystyle\coloneqq(1-\beta)\mathds{E}\biggl{[}\sum_{t=0}^{\infty}\beta^{t}c(K_{t},g^{(\theta^{A})}(K_{t}))\Bigm{\|}K_{0}=k\biggr{]},$
	$\displaystyle N^{(\theta^{A})}(k)$	$\displaystyle\coloneqq(1-\beta)\mathds{E}\biggl{[}\sum_{t=0}^{\infty}\beta^{t}g^{(\theta^{A})}(K_{t})\Bigm{\|}K_{0}=k\biggr{]}.$

$D^{({\theta^{A}})}(k)$ represents the expected total discounted cost while $N^{({\theta^{A}})}(k)$ represents the expected number of times active action is selected under policy $g^{(\theta^{A})}$ starting from the initial information state $k$ .

We will show (see Theorem 7) that the Whittle index for model A can be computed as a function of $D^{({\theta^{A}})}(k)$ and $N^{({\theta^{A}})}(k)$ . First, we present a method to compute these two variables. Let

	$\displaystyle L^{(\theta^{A})}(k)$	$\displaystyle\coloneqq(1-\beta)\sum_{t=k}^{\theta^{A}-1}\beta^{t-k}\bar{c}(t,0)+(1-\beta)\beta^{\theta^{A}-k}\bar{c}(\theta^{A},1)$
	$\displaystyle M^{(\theta^{A})}(k)$	$\displaystyle\coloneqq(1-\beta)\beta^{\theta^{A}-k}$

where $L^{(\theta^{A})}(k)$ and $M^{(\theta^{A})}(k)$ , respectively, denote the expected discounted cost and time starting from information state $k$ until reaching information state $\theta^{A}$ for the first time.

Theorem 3

For any $k\in\mathds{Z}_{\geq 0}$ , we have

	$\displaystyle D^{(\theta^{A})}(k)=L^{(\theta^{A})}(k)+\beta^{\theta^{A}-k+1}\dfrac{L^{(\theta^{A})}(0)}{1-\beta^{\theta^{A}+1}},$
	$\displaystyle N^{(\theta^{A})}(k)=M^{(\theta^{A})}(k)+\beta^{\theta^{A}-k+1}\dfrac{M^{(\theta^{A})}(0)}{1-\beta^{\theta^{A}+1}}.$

See Section 6 for the proof.

5.3.2 Model B

Let $J^{({\boldsymbol{\theta}}^{B})}_{\lambda}(s,k)$ be the total discounted cost incurred under policy $g^{({\boldsymbol{\theta}}^{B})}$ with penalty $\lambda$ when the initial information state is $(s,k)$ , i.e.,

	$\displaystyle J^{({\boldsymbol{\theta}}^{B})}_{\lambda}(s,k)$	$\displaystyle=(1-\beta)\mathds{E}\biggl{[}\sum_{t=0}^{\infty}\beta^{t}\bar{c}_{\lambda}(S_{t},K_{t},g^{({\boldsymbol{\theta}}^{B})}(S_{t},K_{t}))\Bigm{\|}(S_{0},K_{0})=(s,k)\biggr{]}$
		$\displaystyle\eqqcolon D^{({\boldsymbol{\theta}}^{B})}(s,k)+\lambda N^{({\boldsymbol{\theta}}^{B})}(s,k),$		(19)

where

	$\displaystyle D^{({\boldsymbol{\theta}}^{B})}(s,k)$	$\displaystyle\coloneqq(1-\beta)\mathds{E}\biggl{[}\sum_{t=0}^{\infty}\beta^{t}\bar{c}(S_{t},K_{t},g^{({\boldsymbol{\theta}}^{B})}(S_{t},K_{t}))\Bigm{\|}(S_{0},K_{0})=(s,k)\biggr{]},$
	$\displaystyle N^{({\boldsymbol{\theta}}^{B})}(s,k)$	$\displaystyle\coloneqq(1-\beta)\mathds{E}\biggl{[}\sum_{t=0}^{\infty}\beta^{t}g^{({\boldsymbol{\theta}}^{B})}(S_{t},K_{t})\Bigm{\|}(S_{0},K_{0})=(s,k)\biggr{]}.$

$D^{({\boldsymbol{\theta}}^{B})}(s,k)$ and $N^{({\boldsymbol{\theta}}^{B})}(s,k)$ have the same interpretations as the ones for model A. We will show (see Theorem 8) that Whittle index for model B can be computed as a function of $D^{({\boldsymbol{\theta}^{B}})}(s,k)$ and $N^{({\boldsymbol{\theta}^{B}})}(s,k)$ . But first let’s define vector $\boldsymbol{J}^{({\boldsymbol{\theta}}^{B})}_{\lambda}(0)=(J^{({\boldsymbol{\theta}}^{B})}_{\lambda}(1,0)$ , $\ldots$ , $J^{({\boldsymbol{\theta}}^{B})}_{\lambda}(|{\cal X}|,0))$ and vectors $\boldsymbol{D}^{({\boldsymbol{\theta}}^{B})}(0)$ and $\boldsymbol{N}^{({\boldsymbol{\theta}}^{B})}(0)$ in a similar manner. Then, from (19), $\boldsymbol{J}^{({\boldsymbol{\theta}}^{B})}_{\lambda}(0)=\boldsymbol{D}^{({\boldsymbol{\theta}}^{B})}(0)+\lambda\boldsymbol{N}^{({\boldsymbol{\theta}}^{B})}(0)$ . Let’s also define

	$\displaystyle L^{({\boldsymbol{\theta}}^{B})}(s,k)$	$\displaystyle\coloneqq(1-\beta)\sum_{t=k}^{\theta^{B}_{s}-1}\beta^{t-k}\bar{c}(s,t,0)+(1-\beta)\beta^{\theta^{B}_{s}-k}\bar{c}(s,\theta^{B}_{s},1),$
	$\displaystyle M^{({\boldsymbol{\theta}}^{B})}(s,k)$	$\displaystyle\coloneqq(1-\beta)\beta^{\theta^{B}_{s}-k}.$

Let $\boldsymbol{L}^{({\boldsymbol{\theta}}^{B})}(0)=(L^{({\boldsymbol{\theta}}^{B})}(1,0),\ldots,L^{({\boldsymbol{\theta}}^{B})}(|{\cal X}|,0))$ and $\boldsymbol{M}^{({\boldsymbol{\theta}}^{B})}(0)=(M^{({\boldsymbol{\theta}}^{B})}(1,0),\ldots,M^{({\boldsymbol{\theta}}^{B})}(|{\cal X}|,0))$ .

Theorem 4

For any $(s,k)\in{\cal X}\times\mathds{Z}_{\geq 0}$ , we have

	$\displaystyle D^{({\boldsymbol{\theta}}^{B})}(s,k)=L^{({\boldsymbol{\theta}}^{B})}(s,k)+\beta^{\theta^{B}_{s}-k+1}\sum_{r\in{\cal X}}Q_{r}D^{({\boldsymbol{\theta}}^{B})}(r,0),$
	$\displaystyle N^{({\boldsymbol{\theta}}^{B})}(s,k)=M^{({\boldsymbol{\theta}}^{B})}(s,k)+\beta^{\theta^{B}_{s}-k+1}\sum_{r\in{\cal X}}Q_{r}N^{({\boldsymbol{\theta}}^{B})}(r,0).$

Let $Z^{({\boldsymbol{\theta}}^{B})}$ be a $|{\cal X}|\times|{\cal X}|$ matrix where $Z^{({\boldsymbol{\theta}}^{B})}_{sr}=\beta^{\theta^{B}_{s}+1}Q_{r}$ , for any $s,r\in{\cal X}$ . Then,

	$\displaystyle\boldsymbol{D}^{({\boldsymbol{\theta}}^{B})}(0)=(I-Z^{({\boldsymbol{\theta}}^{B})})^{-1}\boldsymbol{L}^{({\boldsymbol{\theta}}^{B})}(0),$
	$\displaystyle\boldsymbol{N}^{({\boldsymbol{\theta}}^{B})}(0)=(I-Z^{({\boldsymbol{\theta}}^{B})})^{-1}\boldsymbol{M}^{({\boldsymbol{\theta}}^{B})}(0).$

See Section 6 for the proof.

5.4 Finite state approximation

For computing Whittle index, we provide a finite state approximation of Proposition 3 for models A and B. Essentially, we truncate the countable set of possible information state $K_{t}$ to a finite set and provide the approximation bound on the optimal value function for each of the models.

Theorem 5 (Model A)

Given $\ell\in\mathds{N}$ , let $\mathds{N}_{\ell}:=\{0,\ldots,\ell\}$ and $V_{\ell,\lambda}:\mathds{N}_{\ell}\to\mathds{R}$ be the unique fixed point of equation

V_{\ell,\lambda}(k)=\min_{a\in\{0,1\}}H_{\ell,\lambda}(k,a),\leavevmode\nobreak\ \hat{g}_{\ell,\lambda}(k)=\operatorname*{arg\,min}_{a\in\{0,1\}}H_{\ell,\lambda}(k,a)

where

	$\displaystyle H_{\ell,\lambda}(k,0)$	$\displaystyle=(1-\beta)\bar{c}(k,0)+\beta V_{\ell,\lambda}(\min\{k+1,\ell\}),$
	$\displaystyle H_{\ell,\lambda}(k,1)$	$\displaystyle=(1-\beta)\bar{c}(k,1)+(1-\beta)\lambda+\beta V_{\ell,\lambda}(0).$

We set $\hat{g}_{\ell,\lambda}(k)=1$ if $H_{\ell,\lambda}(k,0)=H_{\ell,\lambda}(k,1)$ . Then, we have the following:
(i) For any $0\leq k\leq\ell$ , we have

|V_{\lambda}(k)-V_{\ell,\lambda}(k)|\leq\dfrac{\beta^{\ell-k+1}\operatorname{span}(c_{\lambda})}{1-\beta}.

(ii) For all $k\in\mathds{Z}_{\geq 0}$ , $\lim_{\ell\to\infty}V_{\ell,\lambda}(k)=V_{\lambda}(k)$ . Moreover, let $\hat{g}^{*}_{\lambda}(\cdot)$ be any limit point of $\{\hat{g}_{\ell,\lambda}(\cdot)\}_{\ell\geq 1}$ . Then, the policy $\hat{g}^{*}_{\lambda}(\cdot)$ is optimal for Problem 2.

See Section 6 for the proof.

Theorem 6 (Model B)

Given $\ell\in\mathds{N}$ , let $\mathds{N}_{\ell}:=\{0,\ldots,\ell\}$ and $V_{\ell,\lambda}:{\cal X}\times\mathds{N}_{\ell}\to\mathds{R}$ be the unique fixed point of equation

\displaystyle V_{\ell,\lambda}(s,k)

\displaystyle=\min_{a\in\{0,1\}}H_{\ell,\lambda}(s,k,a),\leavevmode\nobreak\ \hat{g}_{\ell,\lambda}(s,k)=\operatorname*{arg\,min}_{a\in\{0,1\}}H_{\ell,\lambda}(s,k,a)

where

	$\displaystyle H_{\ell,\lambda}(s,k,0)$	$\displaystyle=(1-\beta)\bar{c}(s,k,0)+\beta V_{\ell,\lambda}(s,\min\{k+1,\ell\}),$
	$\displaystyle H_{\ell,\lambda}(s,k,1)$	$\displaystyle=(1-\beta)\bar{c}(s,k,1)+(1-\beta)\lambda+\beta\sum_{x^{\prime}\in\tilde{\cal X}}Q_{x^{\prime}}V_{\ell,\lambda}(x^{\prime},0).$

We set $\hat{g}_{\ell,\lambda}(s,k)=1$ if $H_{\ell,\lambda}(s,k,0)=H_{\ell,\lambda}(s,k,1)$ . Then, we have the following:
(i) For any $0\leq k\leq\ell$ ,

|V_{\lambda}(s,k)-V_{\ell,\lambda}(s,k)|\leq\dfrac{\beta^{\ell-k+1}\operatorname{span}(c_{\lambda})}{1-\beta},\forall s\in{\cal X}.

(ii) For all $(s,k)\in{\cal X}\times\mathds{Z}_{\geq 0}$ , $\lim_{\ell\to\infty}V_{\ell,\lambda}(s,k)=V_{\lambda}(s,k)$ . Let $\hat{g}^{*}_{\lambda}(\cdot,\cdot)$ be any limit point of $\{\hat{g}_{\ell,\lambda}(\cdot,\cdot)\}_{\ell\geq 1}$ . Then, the policy $\hat{g}^{*}_{\lambda}(\cdot,\cdot)$ is optimal for Problem 2.

See Section 6 for the proof.

Due to Theorems 5 and 6, we can restrict the countable part of the information state to a finite set, $\mathds{N}_{\ell}$ .

5.5 Computation of Whittle index

Next, we derive a closed form expression to compute the Whittle index for model A and provide an efficient algorithm to compute the Whittle index for model B.

5.5.1 Whittle index formula for model A.

For model A, we obtain the Whittle index formula based on the two variables $D^{(\theta^{A})}(\cdot)$ and $N^{(\theta^{A})}(\cdot)$ as follows.

Theorem 7

Let $\Lambda^{A}_{k}=\{k_{0}\in\{0,1,\ldots,(\ell+1)-1\}:N^{(k)}(k_{0})\neq N^{(k+1)}(k_{0})\}$ . Then, under Assumption 3, $\Lambda^{A}_{k}\neq\emptyset$ , and the Whittle index of model A at information state $k\in\mathds{N}_{\ell}$ is

w^{A}(k)=\min_{k_{0}\in\Lambda^{A}_{k}}\dfrac{D^{(k+1)}(k_{0})-D^{(k)}(k_{0})}{N^{(k)}(k_{0})-N^{(k+1)}(k_{0})}.

(20)

Proof 7

Since model A is a restart model, the result follows from [31, Lemma 4].

Theorem 7 gives us a closed-form expression to approximately compute the Whittle index for model $A$ .

5.5.2 Modified adaptive greedy algorithm for model B.

Let $B=|{\cal X}|(\ell+1)$ and $B_{D}(\leq B)$ denote the number of distinct Whittle indices. Let $\Lambda^{*}=\{\lambda_{0},\lambda_{1},\ldots,\lambda_{B_{D}}\}$ where $\lambda_{1}<\lambda_{2}<\ldots<\lambda_{B_{D}}$ denote the sorted distinct Whittle indices with $\lambda_{0}=-\infty$ . Let ${\cal W}_{b}:=\{(s,k)\in{\cal X}\times\mathds{N}_{\ell}:w(s,k)\leq\lambda_{b}\}$ . For any subset ${\cal S}\subseteq{\cal X}\times\mathds{N}_{\ell}$ , define the policy $\bar{g}^{({\cal S})}:{\cal X}\times\mathds{N}_{\ell}\to\{0,1\}$ as

\bar{g}^{(\cal S)}(s,k)=\begin{cases}0,\leavevmode\nobreak\ \text{ if }(s,k)\in{\cal S}\\ 1,\leavevmode\nobreak\ \text{ if }(s,k)\in({\cal X}\times\mathds{N}_{\ell})\backslash{\cal S}.\end{cases}

Given ${\cal W}_{b}$ , define $\Phi_{b}=\{(s,k)\in({\cal X}\times\mathds{N}_{\ell})\setminus{\cal W}_{b}:(s,\max\{0,k-1\})\in{\cal W}_{b}\}$ and $\Gamma_{b+1}={\cal W}_{b+1}\backslash{\cal W}_{b}$ . Additionally, for any $b\in\{0,\ldots,B_{D}-1\}$ , and all states $y\in\Phi_{b}$ , define $h_{b}=\bar{g}^{({\cal W}_{b})}$ , $h_{b,y}=\bar{g}^{({\cal W}_{b}\cup\{y\})}$ and $\Lambda_{b,y}=\{(x,k)\in({\cal X}\times\mathds{N}_{\ell}):N^{(h_{b})}(x,k)\neq N^{({h}_{b,y})}(x,k)\}$ . Then, for all $(x,k)\in\Lambda_{b,y}$ , define

\displaystyle\mu_{b,y}(x,k)=\dfrac{D^{({h}_{b,y})}(x,k)-D^{(h_{b})}(x,k)}{N^{(h_{b})}(x,k)-N^{({h}_{b,y})}(x,k)}.

(21)

Lemma 3

For $d\in\{0,\ldots,B_{D}-1\}$ , we have the following:

1.

For all $y\in\Gamma_{b+1}$ , we have $w(y)=\lambda_{b+1}$ .
2.

For all $y\in\Phi_{b}$ and $\lambda\in(\lambda_{b},\lambda_{b+1}]$ , we have $J^{(h_{b,y})}_{\lambda}(x)\geq J^{(h_{b})}_{\lambda}(x)$ for all $x\in{\cal X}$ with equality if and only if $y\in{\cal W}_{b+1}\backslash{\cal W}_{b}$ and $\lambda=\lambda_{b+1}$ .

Proof 8

The result follows from [31, Lemma 3]. The only difference is that since we know from Theorem 2 that the optimal policy is a threshold policy with respect to the second dimension, we restrict to $y\in\Phi_{b}$ .

Theorem 8

The following properties hold:

1.

For any $y\in\Gamma_{b+1}$ , the set $\Lambda_{b,y}$ is non-empty.
2.

For any $x\in\Lambda_{b,y}$ , $\mu_{b,y}(x)\geq\lambda_{b+1}$ with equality if and only if $y\in\Gamma_{b+1}$ .

Proof 9

The result follows from [31, Theorem 2]. Similar to Lemma 3, we consider $y\in\Phi_{b}$ .

input : RB

(\mathcal{X},\{0,1\},P,Q,c,\rho)

, discount factor

\beta

Initialize

b=0

{\cal W}_{b}=\emptyset

while ${\cal W}_{b}\neq{\cal X}\times\mathds{N}_{\ell}$ do

Compute

\Lambda_{b,y}

and

\mu_{b,y}(x)

using (21),

\forall y\in\Phi_{b}

Compute

\mu^{*}_{b,y}=\min_{x\in\Lambda_{b,y}}\mu_{b,y}(x)

\forall y\in\Phi_{b}

Compute

\lambda_{b+1}=\min_{y\in\Phi_{b}}\mu^{*}_{b,y}

Compute

\Gamma_{b+1}=\arg\min_{y\in\Phi_{b}}\mu^{*}_{b,y}

Set

w(z)=\lambda_{b+1}

\forall z\in\Gamma_{b+1}

Set

{\cal W}_{b+1}={\cal W}_{b}\cup\Gamma_{b+1}

Set

b=b+1

Algorithm 1 Computing Whittle index of all information states of model B

By Theorem 8, we can find the Whittle indices iteratively. This approach is summarized in Algorithm 1. For a computationally-efficient implementation using the Sherman-Morrison formula, see [31, Algorithm 2].

6 Proof of Main Results

6.1 Proof of Theorem 2

Let $\mu^{1}$ and $\mu^{2}$ be two probability mass functions on totally ordered set $\tilde{\cal X}$ . Then we say $\mu^{1}$ stochastically dominates $\mu^{2}$ if for all $x\in\tilde{\cal X}$ , $\sum_{z\in\tilde{\mathcal{X}}_{\geq x}}\mu^{1}_{z}\geq\sum_{z\in\tilde{\mathcal{X}}_{\geq x}}\mu^{2}_{z}$ . Given two $|\tilde{\cal X}|\times|\tilde{\cal X}|$ transition matrices $M$ and $N$ , we say $M$ stochastically dominates $N$ if each row of $M$ stochastically dominates the corresponding $N$ . A basic property of stochastic dominance is the following.

Lemma 4

If $M^{1}$ stochastically dominates $M^{2}$ and $c$ is an non-decreasing function defined on $\tilde{\cal X}$ , then for all $x\in\tilde{\cal X}$ , $\sum_{y\in\tilde{\cal X}}M^{1}_{xy}c(y)\geq\sum_{y\in\tilde{\cal X}}M^{2}_{xy}c(y)$ .

Proof 10

This follows from [40, Lemma 4.7.2].

Consider a fully-observable restless bandit process $\{(\mathcal{\tilde{X}},\{0,1\},\allowbreak\{\tilde{P},\tilde{Q}\},\tilde{c},\tilde{\pi}_{0})\}$ (note that ${\cal Y}$ is removed due to the observability assumption). According to [31], we say a fully-observable restless bandit process is stochastic monotone if it satisfies the following conditions.

(D1)

$\tilde{P}$ and $\tilde{Q}$ are stochastic monotone transition matrices.
(D2)

For any $z\in\tilde{\mathcal{X}}$ , $\sum_{w\in\tilde{\mathcal{X}}_{\geq z}}[\tilde{P}-\tilde{Q}]_{xw}$ is non-decreasing in $x\in\tilde{\mathcal{X}}$ .
(D3)

For any $a\in\{0,1\}$ , $\tilde{c}(x,a)$ is non-decreasing in $x$ .
(D4)

$\tilde{c}(x,a)$ is submodular in $(x,a)$ .

The following is established in [31, Lemma 5].

Proposition 5

The optimal policy of a stochastic monotone fully-observable restless bandit process is a threshold policy denoted by $\tilde{g}$ , which is a policy which takes passive action for states below a threshold denoted by $\tilde{\theta}$ and active action for the rest of the states, i.e.,

\displaystyle\tilde{g}=\begin{cases}0,\leavevmode\nobreak\ x<\tilde{\theta}\\ 1,\leavevmode\nobreak\ \text{otherwise}\end{cases}.

6.1.1 Proof of Theorem 2, Part 1

We show that each machine in model A is a stochastic monotone fully-observable restless bandit process. Each condition of stochastic monotone fully-observable restless bandit process is presented and proven for model A below.

(D1’)

The transition probability matrix under passive action for model A based on the information states is $P^{A}_{xy}=\mathds{I}_{\{y=x+1\}}$ and the transition probability matrix under active action for model A is $Q^{A}_{xy}=\mathds{I}_{\{y=0\}}$ . Thus, $P^{A}$ and $Q^{A}$ are stochastic monotone matrices.
(D2’)

Since $P^{A}$ is a stochastic monotone matrix and $Q^{A}$ has constant rows, $\sum_{r\geq z}[P^{A}-Q^{A}]_{sr}$ is non-decreasing in $s$ for any integer $z\geq 0$ .
(D3’)

As $P$ stochastically dominates the identity matrix, we infer from [46, Theorem 1.1-b and Theorem 1.2-c], that $QP^{\ell}$ stochastically dominates $QP^{k}$ for any $\ell>k\geq 0$ . Additionally, $c_{\lambda}(x,a)$ is non-decreasing in $x$ for any $a\in\{0,1\}$ . By (16) we have $\bar{c}_{\lambda}(k,a)=\sum_{x\in{\cal X}}[(QP)^{k}]_{x}c_{\lambda}(x,a)$ . Therefore, by Lemma 4, $\bar{c}_{\lambda}(k,a)$ is non-decreasing in $k$ .
(D4’)

As $c(x,a)$ is submodular in $(x,a)$ and as shown in (D3’), $QP^{\ell}$ stochastically dominates $QP^{k}$ for any $\ell>k\geq 0$ . Therefore, by Lemma 4, $\bar{c}_{\lambda}(k,0)-\bar{c}_{\lambda}(k,1)=\sum_{x\in{\cal X}}[(QP)^{k}]_{x}(c_{\lambda}(x,0)-c_{\lambda}(x,1))$ is non-decreasing in $(k,a)$ .

Therefore, according to Proposition 5, the optimal policy of a fully-observable restless bandit process under model A is a threshold based policy.

Finally, since the optimal policy is threshold based, the passive set ${\cal W}_{\lambda}$ is given by $\{k\in\mathbb{Z}_{\geq-1}:k<\theta^{A}_{\lambda}\}$ . As shown in Theorem 1, model A is indexable. Therefore, the passive set must be non-decreasing in $\lambda$ , which implies that the threshold $\theta^{A}_{\lambda}$ is non-decreasing in $\lambda$ .

6.1.2 Proof of Theorem 2, Part 2

We first characterize the behavior of value function and state-action value function for Model B.

Lemma 5

We have

a.

$\bar{c}_{\lambda}(s,k,a)$ is non-decreasing in $k$ for any $s\in{\cal X}$ and $a\in\{0,1\}$ .
b.

Given a fixed $\lambda$ , $V_{\lambda}(s,k)$ is non-decreasing in $k$ for any $s\in\mathcal{X}$ .
c.

$\bar{c}_{\lambda}(s,k,a)$ is submodular in $(k,a)$ , for any $s\in{\cal X}$ .
d.

$H_{\lambda}(s,k,a)$ is submodular in $(k,a)$ , for any $s\in{\cal X}$ .

Proof 11

The proof of each part is as follows.

a.

By definition, we have

$\bar{c}_{\lambda}(s,k,a)=\sum_{x\in{\cal X}}[\delta_{s}P^{k}](x)c(x,a)+\lambda a.$

Similar to the proof of (D3’) in Proposition 5, for a given $s\in{\cal X}$ and $a\in\{0,1\}$ , $[\delta_{s}P^{k}](x)$ is non-decreasing in $k$ and $x$ and as $c(x,a)$ is non-decreasing in $x$ , $\bar{c}_{\lambda}(s,k,a)$ is non-decreasing in $k$ .

Let

	$\displaystyle H^{j}_{\lambda}(s,k,0)$	$\displaystyle:=(1-\beta)\bar{c}(s,k,0)+\beta V^{j}_{\lambda}(s,k+1),$
	$\displaystyle H^{j}_{\lambda}(s,k,1)$	$\displaystyle:=(1-\beta)\bar{c}(s,k,1)+(1-\beta)\lambda+\beta\sum_{r}Q_{r}V^{j}_{\lambda}(r,0),$
	$\displaystyle V^{j+1}_{\lambda}(s,k)$	$\displaystyle:=\min_{a\in\{0,1\}}\{H^{j}_{\lambda}(s,k,a)\},$

where $V^{0}_{\lambda}(\cdot,\cdot)=0$ for all $(s,k)\in{\cal X}\times\mathds{Z}_{\geq 0}$ .
Claim: $V^{j}_{\lambda}(s,k)$ is non-decreasing in $k$ for any $s\in{\cal X}$ and $j\geq 0$ .
We prove the claim by induction. By construction, $V^{0}_{\lambda}(s,k)$ is non-decreasing in $k$ for any $s\in{\cal X}$ . This forms the basis of induction. Now assume that $V^{j}_{\lambda}(s,k)$ is non-decreasing in $k$ for any $s\in{\cal X}$ and some $j\geq 0$ . Consider $\ell>k\geq 0$ . Then, by induction hypothesis we have

	$\displaystyle H^{j}_{\lambda}(s,\ell,0)$	$\displaystyle=(1-\beta)\bar{c}(s,\ell,0)+\beta V^{j}_{\lambda}(s,\ell+1)$
		$\displaystyle\geq(1-\beta)\bar{c}(s,k,0)+\beta V^{j}_{\lambda}(s,k+1)=H^{j}_{\lambda}(s,k,0),$
	$\displaystyle H^{j}_{\lambda}(s,\ell,1)$	$\displaystyle=(1-\beta)\bar{c}(s,\ell,1)+(1-\beta)\lambda+\beta\sum_{r}Q_{r}V^{j}_{\lambda}(r,0)$
		$\displaystyle\geq(1-\beta)\bar{c}(s,k,1)+(1-\beta)\lambda+\beta\sum_{r}Q_{r}V^{j}_{\lambda}(r,0)=H^{j}_{\lambda}(s,k,1).$

Therefore,

\displaystyle V^{j+1}_{\lambda}(s,\ell)=\min_{a}\{H^{j}_{\lambda}(s,\ell,a)\}\geq\min_{a}\{H^{j}_{\lambda}(s,k,a)\}=V^{j+1}_{\lambda}(s,k).

Thus, $V^{j+1}_{\lambda}(s,k)$ is non-decreasing in $k$ for any $s\in{\cal X}$ . This completes the induction step. $V_{\lambda}(s,k)=\lim_{j\to\infty}V^{j}_{\lambda}(s,k)$ and monotonicity is preserved under limits, the induction proof is complete.

$c(x,a)$ is submodular in $(x,a)$ . Also, note that $\delta_{s}P^{k}$ is the $s^{th}$ row of $P^{k}$ . Thus, $\delta_{s}P^{k+1}$ stochastically dominates $\delta_{s}P^{k}$ and by Lemma 4 we have

\sum_{x\in{\cal X}}[\delta_{s}(P^{k+1}-P^{k})]_{x}(c(x,0)-c(x,1))\geq 0.

Therefore,

\displaystyle\sum_{x\in{\cal X}}[\delta_{s}(P^{k}-P^{k+1})]_{x}c(x,1)\geq\sum_{x\in{\cal X}}[\delta_{s}(P^{k}-P^{k+1})]_{x}c(x,0).

Consequently,

\displaystyle\sum_{x\in{\cal X}}[\delta_{s}P^{k}]_{x}c(x,1)-\sum_{x\in{\cal X}}[\delta_{s}P^{k}]_{x}c(x,0)\geq\sum_{x\in{\cal X}}[\delta_{s}P^{k+1}]_{x}c(x,1)-\sum_{x\in{\cal X}}[\delta_{s}P^{k+1}]_{x}c(x,0).

Hence,

\displaystyle\bar{c}(s,k,1)-\bar{c}(s,k,0)\geq\bar{c}(s,k+1,1)-\bar{c}(s,k+1,0).

As for any $s\in{\cal X}$ , $V_{\lambda}(s,k)$ is non-decreasing in $k$ , and $\bar{c}_{\lambda}(s,k,a)$ is submodular in $(k,a)$ , for any $k\in\mathds{N}_{\ell}$ and $a\in\{0,1\}$ , we have

	$\displaystyle H_{\lambda}(s,k,1)-H_{\lambda}(s,k,0)$	$\displaystyle=(1-\beta)\bar{c}(s,k,1)+(1-\beta)\lambda+\beta\sum_{r}Q_{r}V_{\lambda}(r,0)$
		$\displaystyle-(1-\beta)\bar{c}(s,k,0)-\beta V_{\lambda}(s,k+1)$
		$\displaystyle\geq(1-\beta)\bar{c}(s,k+1,1)+(1-\beta)\lambda+\beta\sum_{r}Q_{r}V_{\lambda}(r,0)$
		$\displaystyle-(1-\beta)\bar{c}(s,k+1,0)-\beta V_{\lambda}(s,k+2)$
		$\displaystyle=H_{\lambda}(s,k+1,1)-H_{\lambda}(s,k+1,0).$

Lemma 6

Suppose $f:\mathcal{X}\times\mathcal{Y}\to\mathds{R}$ is a submodular function and for each $x\in\mathcal{X}$ , $min_{y\in\mathcal{Y}}f(x,y)$ exists. Then, $\max\{\operatorname*{arg\,min}_{y\in\mathcal{Y}}f(x,y)\}$ is monotone non-decreasing in $x$ .

Proof 12

This result follows from [40, Lemma 4.7.1].

Now, we conclude that as $H_{\lambda}(s,k,a)$ is submodular in $(k,a)$ for any $s\in{\cal X}$ , then, based on Lemma 6 and as only two actions is available, the optimal policy is a threshold policy specified in the theorem statement.

Finally, since the optimal policy is threshold based, the passive set ${\cal W}_{\lambda}$ is given by $\{(s,k)\in{\cal X}\times\mathbb{Z}_{\geq-1}:k<\theta^{B}_{s,\lambda}\}$ . As shown in Theorem 1, model B is indexable. Therefore, the passive set must be non-decreasing in $\lambda$ , which implies that, for every $s\in{\cal X}$ , the threshold $\theta^{B}_{s,\lambda}$ is non-decreasing in $\lambda$ .

6.2 Proof of Theorem 3

By the strong Markov property, we have

	$\displaystyle D^{(\theta^{A})}(k)$	$\displaystyle=(1-\beta)\sum_{j=k}^{\theta^{A}}\beta^{t}\bar{c}(t,g(t))+\beta^{\theta^{A}-k+1}D^{(\theta^{A})}(0)=L^{(\theta^{A})}(k)+\beta^{\theta^{A}-k+1}D^{(\theta^{A})}(0),$
	$\displaystyle N^{(\theta^{A})}(k)$	$\displaystyle=(1-\beta)\beta^{\theta^{A}-k}+\beta^{\theta^{A}-k+1}N^{(\theta^{A})}(0)=M^{(\theta^{A})}(k)+\beta^{\theta^{A}-k+1}N^{(\theta^{A})}(0).$

If we set $k=0$ in the above,

\displaystyle D^{(\theta^{A})}(0)

\displaystyle=\dfrac{L^{(\theta^{A})}(0)}{1-\beta^{\theta^{A}+1}}\leavevmode\nobreak\ \text{and}\leavevmode\nobreak\ N^{(\theta^{A})}(0)=\dfrac{M^{(\theta^{A})}(0)}{1-\beta^{\theta^{A}+1}}.

6.3 Proof of Theorem 4

By the strong Markov property, we have

	$\displaystyle D^{({\boldsymbol{\theta}}^{B})}(s,k)$	$\displaystyle=(1-\beta)\sum_{j=k}^{\theta^{B}_{s}}\beta^{t}\bar{c}(s,t,g(s,t))+\beta^{\theta^{B}_{s}-k+1}\sum_{r\in{\cal X}}Q_{r}D^{({\boldsymbol{\theta}}^{B})}(r,0)$
		$\displaystyle=L^{({\boldsymbol{\theta}}^{B})}(s,k)+\beta^{\theta^{B}_{s}-k+1}\sum_{r\in{\cal X}}Q_{r}D^{({\boldsymbol{\theta}}^{B})}(r,0),$
	$\displaystyle N^{({\boldsymbol{\theta}}^{B})}(s,0)$	$\displaystyle=(1-\beta)\beta^{\theta^{B}_{s}-k}+\beta^{\theta^{B}_{s}-k+1}\sum_{r\in{\cal X}}Q_{r}N^{({\boldsymbol{\theta}}^{B})}(r,0)$
		$\displaystyle=M^{({\boldsymbol{\theta}}^{B})}(s,k)+\beta^{\theta^{B}_{s}-k+1}\sum_{r\in{\cal X}}Q_{r}N^{({\boldsymbol{\theta}}^{B})}(r,0).$

If we set $k=0$ in the above,

	$\displaystyle D^{({\boldsymbol{\theta}}^{B})}(s,0)$	$\displaystyle=L^{({\boldsymbol{\theta}}^{B})}(s,0)+\beta^{\theta^{B}_{s}+1}\sum_{r\in{\cal X}}Q_{r}D^{({\boldsymbol{\theta}}^{B})}(r,0),$
	$\displaystyle N^{({\boldsymbol{\theta}}^{B})}(s,0)$	$\displaystyle=M^{({\boldsymbol{\theta}}^{B})}(s,0)+\beta^{\theta^{B}_{s}+1}\sum_{r\in{\cal X}}Q_{r}N^{({\boldsymbol{\theta}}^{B})}(r,0).$

which results in

	$\displaystyle\boldsymbol{D}^{({\boldsymbol{\theta}}^{B})}(0)$	$\displaystyle=\boldsymbol{L}^{({\boldsymbol{\theta}}^{B})}(0)+Z^{({\boldsymbol{\theta}}^{B})}\boldsymbol{D}^{({\boldsymbol{\theta}}^{B})}(0),$
	$\displaystyle\boldsymbol{N}^{({\boldsymbol{\theta}}^{B})}(0)$	$\displaystyle=\boldsymbol{M}^{({\boldsymbol{\theta}}^{B})}(0)+Z^{({\boldsymbol{\theta}}^{B})}\boldsymbol{N}^{({\boldsymbol{\theta}}^{B})}(0)$

and hence, the statement is obtained by reformation of the terms inside the equations.

6.4 Proof of Theorem 5

(i): Starting from information state $k\in\mathds{N}_{\ell}$ , the cost incurred by $\hat{g}_{\ell,\lambda}(\cdot)$ is the same as $g^{A}_{\lambda}(\cdot)$ for information states $\{k,\ldots,\ell\}$ . The per-step cost incurred by $\hat{g}_{\ell,\lambda}(\cdot)$ differs from $g^{A}_{\lambda}(\cdot)$ for information states $\{\ell+1,\ldots\}$ by at most $\operatorname{span}(c_{\lambda})$ .
(ii): The sequence of finite-state models described above is an augmentation type approximation sequence. As a result, a limit point of $\hat{g}^{*}_{\lambda}$ exists and the final result holds by [47, Proposition B.5, Theorem 4.6.3].

6.5 Proof of Theorem 6

(i): Starting from information state $(s,k)$ , given any $s\in{\cal X}$ and $k\in\mathds{N}_{\ell}$ , the cost incurred by $\hat{g}_{\ell,\lambda}(\cdot,\cdot)$ is the same as $g^{B}_{\lambda}(\cdot,\cdot)$ for information states $\{(s,l)\}_{l=k}^{\ell}$ . The per-step cost incurred by $\hat{g}_{\ell,\lambda}(\cdot,\cdot)$ differs from $g^{B}_{\lambda}(\cdot,\cdot)$ for later realized information states by at most $\Delta c_{\lambda}$ . Thus, the bound holds.
(ii): The sequence of finite-state models described above is an augmentation type approximation sequence. As a result, a limit point of $\hat{g}^{*}_{\lambda}$ exists and the final result holds [47, Proposition B.5, Theorem 4.6.3].

7 Numerical Analysis

In this section, we consider Example 2 and compare the performance of the following policies:

opt:

the optimal policy obtained using dynamic programming. As discussed earlier, the dynamic programming computation to obtain the optimal policy suffers from the curse of dimensionality. Therefore, the optimal policy can be computed only for small-scale models.
myp:

myopic policy, which is a heuristic which sequentially selects $m$ machines as follows. Suppose $\zeta<m$ machines have been selected. Then select machine $\zeta+1$ to be the machine which provides the smallest increase in the total per-step cost. The detailed description for model B is shown in Alg. 2.
wip:

whittle index heuristic, as described in this paper.

input : RB

(\mathcal{X},\{0,1\},P,Q,c,\rho)

, discount factor

\beta

m

Initialize

t=0

while $t\geq 0$ do

Set

\zeta=0

while $\zeta\leq m$ do

Compute

i^{*}_{\zeta}\in\arg\min_{i\in\mathcal{Z}}\sum_{j\in\mathcal{Z}\setminus\{i\}}\bar{c}^{j}(S^{j}_{t},K^{j}_{t},0)+\bar{c}^{i}(S^{i}_{t},K^{i}_{t},1)

Let

\mathcal{M}=\mathcal{M}\cup\{i^{*}_{\zeta}\}

\mathcal{Z}=\mathcal{Z}\setminus\{i^{*}_{\zeta}\}

Set

\zeta=\zeta+1

Service the machines with indices collected in

\mathcal{M}

Update

K^{i}_{t}

according to (14) and

S^{i}_{t}

according to (15) for all

i\in{\cal N}

Set

t=t+1

Algorithm 2 Myopic Heuristic (Model B)

7.1 Experiments and Results

We conduct numerical experiments for both models A and B, and vary the number $n$ of machines, the number $m$ of service-persons and the parameters associated with each machine. There are three parameters associated with each machine: the deterioration probability matrix $P^{i}$ , the reset pmf $Q^{i}$ and the per-step cost $c^{i}(x,a)$ . We assume the matrix $P^{i}$ is chosen from a family of four types of structured transition matrices ${\cal P}_{\gamma}(p)$ , $\gamma\in\{1,2,3,4\}$ where $p$ is a parameter of the model. The details of all these models are presented in A. We assume each element of $Q^{i}$ is sampled from Exp( $1$ ), i.e., exponential distribution with the rate parameter of $1$ , and then normalized such that the sum of all elements becomes $1$ . Finally, we assume that the per-step cost is given by $c^{i}(x,0)=(x-1)^{2}$ and $c^{i}(x,1)=0.5|\mathcal{X}^{i}|^{2}$ .

In all experiments, the discount factor is $\beta=0.99$ . The performance of every policy is evaluated using Monte-Carlo simulation of length $1000$ averaged over $5000$ sample paths.

In Experiment 1, we consider a small scale problem where we can compute opt and we compare the performance of wip with it. However, in Experiment 2, we consider a large scale problem where we compare the performance of wip with myp as computing the optimal policy is highly time-consuming.

The code for both experiments is available at [48].

Experiment 1) Comparison of Whittle index with the optimal policy.

In this experiment, we compare the performance of wip with opt. We assume ${|\cal X|}=4$ , $(\ell+1)=6$ and $n=3$ , $m=1$ for both models A and B. In order to model heterogeneous machines, we consider the following. Let $(p_{1},\ldots,p_{n})$ denote $n$ equispaced points in the interval $[0.05,0.95]$ . Then we choose ${\cal P}_{\gamma}(p_{i})$ as the transition matrix of machine $i$ . We denote the accumulated discounted cost of wip and opt by $J(\text{{wip}})$ and $J(\text{{opt}})$ , respectively. In order to have a better perspective of the performances, we compute the relative performance of wip with respect to opt by computing

\alpha_{\textsc{opt}}=100\times\dfrac{J(\textsc{opt})}{J(\text{{wip}})}.

(22)

The closer $\alpha$ is to $100$ , the closer wip is to opt. The results of $\alpha_{\textsc{opt}}$ for all different combinations of parameter were $100$ which means the Whittle policy is as good as the optimal policy.

Experiment 2) Comparison of Whittle index with the myopic policy for structured models.

In this experiment, we increase the state space size to $|{\cal X}|=20$ and we set $(\ell+1)=40$ , we select $n$ from the set $\{20,40,60\}$ and $m$ from the set $\{1,5\}$ . We denote the accumulated discounted cost of myp by $J(\text{{myp}})$ . In order to have a better perspective of the performances, we compute the relative improvement of wip with respect to myp by computing

\varepsilon_{\textsc{myp}}=100\times\dfrac{J(\text{{myp}})-J(\text{{wip}})}{J(\text{{myp}})}.

(23)

Note that $\varepsilon_{\textsc{myp}}>0$ means that wip performs better than myp. We generate structured transition matrices, similar to Experiment 1, and apply the same procedure to build heterogeneous machines. The results of $\varepsilon_{\textsc{myp}}$ for different choice of the parameters for models A and B are shown in Tables 1 and 2, respectively.

Table 1:

\varepsilon_{\textsc{myp}}

for different choice of parameters of Model A in Experiment

2

(a) Model A,

m=1

$\varepsilon_{\textsc{myp}}$		$\gamma$
$\varepsilon_{\textsc{myp}}$		1	2	3	4
$n$	$20$	1.99	2.54	2.24	7.44
	$40$	3.41	6.90	4.71	8.14
	$60$	2.97	6.19	2.80	6.70

(b) Model A,

m=5

$\varepsilon_{\textsc{myp}}$		$\gamma$
$\varepsilon_{\textsc{myp}}$		1	2	3	4
$n$	$20$	0.21	0.26	0.19	0.97
	$40$	0.68	1.73	1.28	4.54
	$60$	1.36	2.35	2.32	6.41

Table 2:

\varepsilon_{\textsc{myp}}

for different choice of parameters of Model B in Experiment

2

(a) Model B,

m=1

$\varepsilon_{\textsc{myp}}$		$\gamma$
$\varepsilon_{\textsc{myp}}$		1	2	3	4
$n$	$20$	7.67	11.17	12.12	9.39
	$40$	14.96	13.85	14.55	9.17
	$60$	15.02	12.12	13.39	6.63

(b) Model B,

m=5

$\varepsilon_{\textsc{myp}}$		$\gamma$
$\varepsilon_{\textsc{myp}}$		1	2	3	4
$n$	$20$	0.63	1.62	1.01	2.92
	$40$	2.92	3.14	3.21	6.57
	$60$	4.86	7.22	6.99	9.96

7.2 Discussion

In Experiment 1 where wip is compared with opt, we observe $\alpha_{\textsc{opt}}$ is very close to $100$ for almost all experiments, implying that wip performs as well as opt for these experiments. $\alpha_{\textsc{opt}}$ in model B is less than model A as model B is more complex than model A for a given set of parameters and hence, the difference between the performance of the two polices is more than model A.

In Experiment 2 where wip is compared with myp, we observe $\varepsilon_{\textsc{myp}}$ ranges from $0.2\%$ to $15\%$ . In a similar interpretation as Experiment 1, as model B is more complex than model A, $\varepsilon_{\textsc{myp}}$ for model B is higher than the ones model A given the same set of parameters.

Furthermore, we observe that as $n$ increases, $\varepsilon_{\textsc{myp}}$ also increases overall. Also, as $m$ increases, $\varepsilon_{\textsc{myp}}$ decreases in general. This suggests that as $m$ increases, there is an overlap between the set of machines chosen according to wip and myp, and hence, the performance of wip and myp become close to each other.

8 Conclusion

We investigated partially observable restless bandits. Unlike most of the existing literature which restricts attention to models with binary state space, we consider general state space models. We presented two observation models, which we call model A and model B, and showed that the partially observable restless bandits are indexable for both models.

To compute the Whittle index, we work with a countable space representation rather than the belief state representation. We established certain qualitative properties of the auxiliary problem to compute the Whittle index. In particular, for both models we showed that the optimal policies of the auxiliary problem satisfy threshold properties. For model A, we used the threshold property to obtain a closed form expression to compute the Whittle index. For model B, we used the threshold policy to present a refinement of the adaptive greedy algorithm of [31] to compute the Whittle index.

Finally, we presented a detailed numerical study of a machine maintenance model. We observed that for small-scale models, the Whittle index policy is close-to-optimal and for large-scale models, the Whittle index policy outperforms the myopic policy baseline.

Appendix A Structured Markov chains

Consider a Markov chain with $|\cal X|$ states. Then a family of structured stochastic monotone matrices which dominates the identity matrix is illustrated below.

Matrix ${\cal P}_{1}(p)$ : Let $q_{1}=1-p$ and $q_{2}=0$ . Then,

{\cal P}_{1}(p)=\begin{bmatrix}p&q_{1}&q_{2}&0&0&0&0&\dots&0\\ 0&p&q_{1}&q_{2}&0&0&0&\dots&0\\ 0&0&p&q_{1}&q_{2}&0&0&\dots&0\\ 0&0&0&p&q_{1}&q_{2}&0&\dots&0\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\ 0&0&0&0&0&0&p&q_{1}&q_{2}\\ 0&0&0&0&0&0&0&p&q_{1}+q_{2}\\ 0&0&0&0&0&0&0&\dots&1\end{bmatrix}.

2.

Matrix ${\cal P}_{2}(p)$ : Similar to ${\cal P}_{1}(p)$ with $q_{1}=(1-p)/2$ and $q_{2}=(1-p)/2$ .
3.

Matrix ${\cal P}_{3}(p)$ : Similar to ${\cal P}_{1}(p)$ with $q_{1}=2(1-p)/3$ and $q_{2}=(1-p)/3$ .

Matrix ${\cal P}_{4}(p)$ : Let $q_{i}=(1-p)/({\cal X}-i)$ . Then,

\displaystyle{\cal P}_{4}(p)=\begin{bmatrix}p&q_{1}&q_{1}&\dots&q_{1}&q_{1}\\ 0&p&q_{2}&\dots&q_{2}&q_{2}\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\ 0&0&0&\dots&p&q_{n-1}\\ 0&0&0&\dots&0&1\end{bmatrix}.

References

[1] R. Meshram, D. Manjunath, A. Gopalan, On the Whittle index for restless multiarmed hidden Markov bandits 63 (9) (2018) 3046–3053.
[2] S. Guha, K. Munagala, P. Shi, Approximation algorithms for restless bandit problems, Journal of the ACM (JACM) 58 (1) (2010) 3.
[3] K. Kaza, R. Meshram, V. Mehta, S. N. Merchant, Sequential decision making with limited observation capability: Application to wireless networks, IEEE Transactions on Cognitive Communications and Networking 5 (2) (2019) 237–251.
[4] K. Kaza, V. Mehta, R. Meshram, S. Merchant, Restless bandits with cumulative feedback: Applications in wireless networks, in: Wireless Communications and Networking Conference, IEEE, 2018, pp. 1–6.
[5] S. Aalto, P. Lassila, P. Osti, Whittle index approach to size-aware scheduling for time-varying channels with multiple states, Queueing Systems 83 (3-4) (2016) 195–225.
[6] M. Larrañaga, M. Assaad, A. Destounis, G. S. Paschos, Dynamic pilot allocation over Markovian fading channels: A restless bandit approach, in: Information Theory Workshop, IEEE, 2016, pp. 290–294.
[7] N. Akbarzadeh, A. Mahajan, Dynamic spectrum access under partial observations: A restless bandit approach, in: Canadian Workshop on Information Theory, IEEE, 2019, pp. 1–6.
[8] S. S. Villar, J. Bowden, J. Wason, Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges, Statistical science: a review journal of the Institute of Mathematical Statistics 30 (2) (2015) 199.
[9] C. Abad, G. Iyengar, A near-optimal maintenance policy for automated DR devices, IEEE Transactions on Smart Grid 7 (3) (2016) 1411–1419.
[10] K. D. Glazebrook, H. M. Mitchell, P. S. Ansell, Index policies for the maintenance of a collection of machines by a set of repairmen, European Journal of Operational Research 165 (1) (2005) 267–284.
[11] S. S. Villar, Indexability and optimal index policies for a class of reinitialising restless bandits, Probability in the engineering and informational sciences 30 (1) (2016) 1–23.
[12] Y. Qian, C. Zhang, B. Krishnamachari, M. Tambe, Restless poachers: Handling exploration-exploitation tradeoffs in security domains, in: Int. Conf. on Autonomous Agents & Multiagent Systems, 2016, pp. 123–131.
[13] V. S. Borkar, Whittle index for partially observed binary markov decision processes, IEEE Transactions on Automatic Control 62 (12) (2017) 6614–6618.
[14] V. S. Borkar, G. S. Kasbekar, S. Pattathil, P. Shetty, Opportunistic scheduling as restless bandits, IEEE Transactions on Control of Network Systems (2017).
[15] K. E. Avrachenkov, V. S. Borkar, Whittle index policy for crawling ephemeral content, IEEE Transactions on Control of Network Systems 5 (1) (2018) 446–455. doi:10.1109/TCNS.2016.2619066.
[16] C. H. Papadimitriou, J. N. Tsitsiklis, The complexity of optimal queuing network control, Mathematics of Operations Research 24 (2) (1999) 293–305.
[17] P. Whittle, Restless bandits: Activity allocation in a changing world, Journal of Applied Probability 25 (A) (1988) 287–298.
[18] J. C. Gittins, Bandit processes and dynamic allocation indices, Journal of the Royal Statistical Society. Series B (Methodological) (1979) 148–177.
[19] R. R. Weber, G. Weiss, On an index policy for restless bandits, Journal of Applied Probability 27 (3) (1990) 637–648.
[20] C. Lott, D. Teneketzis, On the optimality of an index rule in multichannel allocation for single-hop mobile networks with multiple service classes, Probability in the Engineering and Informational Sciences 14 (3) (2000) 259–297.
[21] K. Liu, Q. Zhao, Indexability of restless bandit problems and optimality of whittle index for dynamic multichannel access 56 (11) (2010) 5547–5567.
[22] J. Niño-Mora, Dynamic priority allocation via restless bandit marginal productivity indices, TOP 15 (2) (2007) 161–198.
[23] P. S. Ansell, K. D. Glazebrook, J. Niño-Mora, M. O’Keeffe, Whittle’s index policy for a multi-class queueing system with convex holding costs, Mathematical Methods of Operations Research 57 (1) (2003) 21–39.
[24] K. D. Glazebrook, D. Ruiz-Hernandez, C. Kirkbride, Some indexable families of restless bandit problems, Advances in Applied Probability 38 (3) (2006) 643–672.
[25] U. Ayesta, M. Erausquin, P. Jacko, A modeling framework for optimizing the flow-level scheduling with time-varying channels, Performance Evaluation 67 (11) (2010) 1014–1029.
[26] N. Akbarzadeh, A. Mahajan, Restless bandits with controlled restarts: Indexability and computation of Whittle index, in: Conference on Decision and Control, 2019, pp. 7294–7300.
[27] T. W. Archibald, D. P. Black, K. D. Glazebrook, Indexability and index heuristics for a simple class of inventory routing problems, Operations research 57 (2) (2009) 314–326.
[28] K. Avrachenkov, U. Ayesta, J. Doncel, P. Jacko, Congestion control of TCP flows in internet routers by means of index policy, Computer Networks 57 (17) (2013) 3463–3478.
[29] K. Glazebrook, D. Hodge, C. Kirkbride, Monotone policies and indexability for bidirectional restless bandits, Advances in Applied Probability 45 (1) (2013) 51–85.
[30] Z. Yu, Y. Xu, L. Tong, Deadline scheduling as restless bandits 63 (8) (2018) 2343–2358.
[31] N. Akbarzadeh, A. Mahajan, Conditions for indexability of restless bandits and an $\mathcal{O}\!\left(k^{3}\right)$ algorithm to compute Whittle index, Adv. in Appl. Probab. 54 (4) (2022) 1164–1192.
[32] J. Niño-Mora, Restless bandits, partial conservation laws and indexability, Advances in Applied Probability 33 (1) (2001) 76–98.
[33] N. Gast, B. Gaujal, K. Khun, Computing Whittle (and Gittins) index in subcubic time, arXiv preprint arXiv:2203.05207 (2022).
[34] D. Ruiz-Hernández, J. M. Pinar-Pérez, D. Delgado-Gómez, Multi-machine preventive maintenance scheduling with imperfect interventions: A restless bandit approach, Computers & Operations Research 119 (2020) 104927.
[35] A. Mate, J. Killian, H. Xu, A. Perrault, M. Tambe, Collapsing bandits and their application to public health intervention, Advances in Neural Information Processing Systems 33 (2020) 15639–15650.
[36] C. R. Dance, T. Silander, Optimal policies for observing time series and related restless bandit problems., J. Mach. Learn. Res. 20 (2019) 35–1.
[37] K. Liu, Index policy for a class of partially observable Markov decision processes, arXiv preprint arXiv:2107.11939 (2021).
[38] D. B. Brown, J. E. Smith, Index policies and performance bounds for dynamic selection problems, Management Science 66 (7) (2020) 3029–3050.
[39] K. J. Astrom, Optimal control of Markov processes with incomplete state information, Journal of mathematical analysis and applications 10 (1) (1965) 174–205.
[40] M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming, John Wiley & Sons, 2014.
[41] E. Altman, Constrained Markov decision processes, Vol. 7, CRC press, 1999.
[42] G. M. Lipsa, N. C. Martins, Remote state estimation with communication costs for first-order LTI systems, IEEE Transactions on Automatic Control 56 (9) (2011) 2013–2025.
[43] J. Chakravorty, A. Mahajan, Fundamental limits of remote estimation of autoregressive Markov processes under communication constraints., IEEE Trans. Automat. Contr. 62 (3) (2017) 1109–1124.
[44] A. Mahajan, Remote estimation over control area networks, in: 2017 IEEE 86th Vehicular Technology Conference (VTC-Fall), 2017, pp. 1–5.
[45] D. I. Shuman, A. Nayyar, A. Mahajan, Y. Goykhman, K. Li, M. Liu, D. Teneketzis, M. Moghaddam, D. Entekhabi, Measurement scheduling for soil moisture sensing: From physical models to optimal control, Proceedings of the IEEE 98 (11) (2010) 1918–1933.
[46] J. Keilson, A. Kester, Monotone matrices and monotone Markov processes, Stochastic Processes and their Applications 5 (3) (1977) 231–241.
[47] L. I. Sennott, Stochastic dynamic programming and the control of queueing systems, Vol. 504, John Wiley & Sons, 2009.
[48] N. Akbarzadeh, A. Mahajan, Partially observable restless bandits with restarts, https://codeocean.com/capsule/6654724/tree/v1.

Two families of indexable partially observable restless bandits and Whittle index computation111This research was funded in part by Fonds de Recherche du Quebec-Nature et technologies (FRQNT).

Abstract

keywords:

1 Introduction

1.1 Notations and Definitions

2 Model and Problem Formulation

2.1 Restless Bandit Process with restart

Assumption 1 (Restart property)

2.2 Partially-observable Restless Multi-armed Bandit Problem

Problem 1

Some remarks

2.3 Some Examples

Example 1

Example 2

Example 3

2.4 Belief State

Proposition 1

Proposition 2

Proof 1

3 Whittle index policy solution concept

Problem 2

Proposition 3

Proof 2

Definition 1 (Passive Set)

Definition 2 (Indexability)

Definition 3 (Whittle index)

4 Indexability and the corresponding Whittle index for models A and B

Theorem 1

Proof 3

Lemma 1

Proof 4

5 Whittle index computation

5.1 Countable Information state

Assumption 2

Lemma 2

Proof 5

Proposition 4

Proof 6

5.2 Threshold policies

Assumption 3

Theorem 2

5.3 Performance of threshold based policies

5.3.1 Model A

Theorem 3

5.3.2 Model B

Theorem 4

5.4 Finite state approximation

Theorem 5 (Model A)

Theorem 6 (Model B)

5.5 Computation of Whittle index

5.5.1 Whittle index formula for model A.

Theorem 7

Proof 7

5.5.2 Modified adaptive greedy algorithm for model B.

Lemma 3

Proof 8

Theorem 8

Proof 9

6 Proof of Main Results

6.1 Proof of Theorem 2

Lemma 4

Proof 10

Proposition 5

6.1.1 Proof of Theorem 2, Part 1

6.1.2 Proof of Theorem 2, Part 2

Lemma 5

Proof 11

Lemma 6

Proof 12

6.2 Proof of Theorem 3

6.3 Proof of Theorem 4

6.4 Proof of Theorem 5

6.5 Proof of Theorem 6

7 Numerical Analysis

7.1 Experiments and Results

Experiment 1) Comparison of Whittle index with the optimal policy.

Experiment 2) Comparison of Whittle index with the myopic policy for structured models.

7.2 Discussion

8 Conclusion

Appendix A Structured Markov chains

Two families of indexable partially observable restless bandits and Whittle index computation¹¹1This research was funded in part by Fonds de Recherche du Quebec-Nature et technologies (FRQNT).