This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

[1,2]\fnmKeqin \surLiu

[1]\orgdivDepartment of Mathematics, \orgnameNanjing University, \orgaddress\cityNanjing, \postcode210093, \countryChina

2]\orgnameNational Center for Applied Mathematics, \orgaddress\cityNanjing, \postcode210093, \countryChina

3]\orgdivDepartment of Mathematics, \orgnameUniversity of Cambridge, \orgaddress\cityCambridge, \postcodeCB3 0WB, \countryUK

Low-Complexity Algorithm for Restless Bandits with Imperfect Observations

[email protected]    \fnmRichard \surWeber [email protected]    \fnmChengzhong \surZhang [email protected] * [ [
Abstract

We consider a class of restless bandit problems that finds a broad application area in reinforcement learning and stochastic optimization. We consider NN independent discrete-time Markov processes, each of which had two possible states: 1 and 0 (‘good’ and ‘bad’). Only if a process is both in state 1 and observed to be so does reward accrue. The aim is to maximize the expected discounted sum of returns over the infinite horizon subject to a constraint that only MM (<N)(<N) processes may be observed at each step. Observation is error-prone: there are known probabilities that state 1 (0) will be observed as 0 (1). From this one knows, at any time tt, a probability that process ii is in state 1. The resulting system may be modeled as a restless multi-armed bandit problem with an information state space of uncountable cardinality. Restless bandit problems with even finite state spaces are PSPACE-HARD in general. We propose a novel approach for simplifying the dynamic programming equations of this class of restless bandits and develop a low-complexity algorithm that achieves a strong performance and is readily extensible to the general restless bandit model with observation errors. Under certain conditions, we establish the existence (indexability) of Whittle index and its equivalence to our algorithm. When those conditions do not hold, we show by numerical experiments the near-optimal performance of our algorithm in the general parametric space. Furthermore, we theoretically prove the optimality of our algorithm for homogeneous systems.

keywords:
restless bandits, continuous state space, observation errors, value functions, index policy

1 Introduction

The exploration-exploitation (EE) dilemma is well posed in optimization-over-time problems and mathematically modeled in various forms for reinforcement learning, to which a major category, multi-armed bandits (MAB) belongs. In the classical MAB model, a player chooses one out of NN statistically independent arms to pull and possibly accrues reward determined by the state of the chosen arm which transits to a new state according to a known Markovian rule (Gittins et al., 2011). The states of other arms remain frozen. The objective is to maximize the expected total discounted reward summed over times t=1,2,t=1,2,\ldots\ to an infinite time horizon with discount factor β(0,1)\beta\in(0,1),

maxπΠ𝔼π[t=1βt1R(t)|S(t),A(t)].\displaystyle\max_{\pi\in\Pi}\mathbbm{E}_{\pi}\left[\sum_{t=1}^{\infty}\beta^{t-1}R(t)\Big{|}S(t),A(t)\right]. (1)

The expectation is taken under some policy π\pi, chosen from the set of all feasible policies Π\Pi; S(t)S(t) is the joint state of all arms at time tt, A(t){1,2,,N}A(t)\in\{1,2,\ldots,N\} is the arm pulled at tt and R(t)R(t) is the reward thus obtained. It follows from standard theory of Markov decision processes that there must exist an optimal stationary policy π\pi^{*}, independent of time tt. If each arm’s state space has cardinality KK, then the joint state space has size KNK^{N}. This means that a dynamic programming solution to the problem will have running time that grows geometrically as the number of arms increases. Gittins (1979) solved the problem by showing the optimality of an index policy, i.e., for each state of each arm there exists an index depending solely on the parameters of that arm; it is then optimal at each time to choose the arm whose state has highest index. The running time of the Gittins index policy grows only linearly with the number of arms as they are decoupled when computing the index function(of the states of each arm). Whittle (1988) generalized Gittins index to the restless MAB model in which those arms that are not chosen may also change states and produce reward. Whittle’s generalization has been shown to perform very well in theoretical and numerical studies (see, e.g., Weber and Weiss (1990, 1991), Liu and Zhao (2010), Hu and Frazier (2017), Zayas-Cabán et al. (2019), Brown and Smith (2020), Chen et al. (2021), Gast et al. (2021)). In general, however, it is difficult to theoretically establish the condition that is necessary for the Whittle index to exist (so called indexability) and to solve for the closed-form Whittle index when it does exists due to the curse of dimensionality (see Sec. 2.1). For finite-state models, Gast et al. (2023) proposed an efficient algorithm to numerically test indexability and compute Whittle index. A natural question to ask if one can transform a bandit with a continuous-state space into a finite-state one by discretization. Unfortunately, we found that how fine the discretization needs to be given the system paramters is itself a difficult problem. Furthermore, a finer discretization inevitably leads to a larger state transition matrix with increased algorithmic complexity and unpredictability of the steady-state system performance. This motivates us to consider alternative approaches to deal with bandits with infinite state spaces as addressed in this paper. In terms of searching for the general optimal policy, Papadimitriou and Tsitsiklis (1999) have shown that the restless MAB with a finite state space is PSPACE-HARD in general. With an infinite state space, a restless bandit problem yields practical difficulties in implementing purely numerical methods as discussed above. In this paper, we show that for our particular Markovian model with an infinite state space, Whittle index policy can be efficiently implemented with satisfactory accuracy after theoretical analysis on the rich mathematical structure of the problem.

In this paper, we extend the work in Liu and Zhao (2010) (for a perfect observation model) and Liu et al. (2010) (for the myopic policy on stochastically identical arms) to build a near-optimal algorithm with low complexity for a class of restless bandits with an infinite state space and an imperfect observation model. Our model also belongs to the general framework of partially observable Markov decision processes (POMDP) (Sondik, 1978). Consider NN processes each of which evolves on a 22-state Markov chain whose state is observed if and only if the process is chosen. Furthermore, the observation is error-prone: state 11 may be observed as 0 and vice versa. Each process is referred to as an arm. At time tt, the player obtains reward of amount BnB_{n} if and only if arm nn is currently chosen and accurately observed in state 11. Under resource constraints, the player’s goal is to select M(M<N)M~{}(M<N) arms at each time and maximize the long-term reward. By formulating the belief vector as the system state for decision making, we show that the indexability is satisfied under certain conditions. Furthermore, we propose an efficient algorithm to compute an approximate Whittle index that achieves a near-optimal performance in general, even if the conditions for indexability do not hold.

Rahul et al. (2018), Varun et al. (2018) and Kesav et al. (2019) considered similar models (except for some nuances) and established indexability under much stricter conditions on the system parameters. For example, all the three papers require that the Markov transition probabilities of each arm have differences bounded by 1/31/3 while we do not need any restriction on the transition probabilities. Furthermore, the three papers require that the discount factor β\beta is less than 1/31/3 while our condition on β\beta is a more relaxed closed-form expression of the system parameters. In terms of the computation of the Whittle index, their algorithm is a direct application of general reinforcement learning while our algorithmic framework is based on the detailed analysis of the value functions with a quick convergence to the exact Whittle index function. In this paper, we also plot the performance of the optimal policy to demonstrate the near-optimality of our algorithm in addition to the comparison with the myopic policy. For homogeneous systems (stochastically identical arms), we show that our algorithm is equivalent to the myopic policy and theoretically prove its optimality under certain conditions. Wang et al. (2018) also considered our model and assumed the optimality of the threshold policy for a single arm while using a very coarse linear approximation to compute the Whittle index function (the key step (a) for the second equality in the proof of Lemma 6 in Wang et al. (2018) is incorrect). In this paper, we rigorously prove the optimality of the threshold policy for a single arm and establish the indexability under certain conditions and subsequently construct an efficient algorithm for computing the Whittle index function with arbitrary precision with its optimality numerically verified in general and formally proved for a class of homogeneous systems.

The rest of this paper is organized as follows: Sec. 2 presents our problem formulation and main results on the optimal threshold policy and indexability. Sec. 3 presents our algorithm with design details. Sec. 4 presents a theoretic proof of the optimality for homogeneous arms. Sec. 5 concludes this paper and provides some future research directions in this field.

2 Main Results

Consider a restless MAB having NN internal 22-state Markov chains (arms) of potentially different transition probabilities. At each time tt, the player chooses to observe the states of MM (<N)(<N) arms. Let S{0(bad),1(good)}S\in\{0~{}\text{(bad)},1~{}\text{(good)}\} denote the current state of an arm and let OO denote its observation outcome (detection outcome). The error probabilities are  δ=Pr(O=1S=0)\delta=\Pr(O=1\mid S=0) and ϵ=Pr(O=0S=1)\epsilon=\Pr(O=0\mid S=1), i.e., the probabilities of miss detection and false alarm, respectively, in the observation model. In Levy (2008), it was shown that the error probabilities δ\delta and ϵ\epsilon follow the curve of receiver operating characteristics (ROC) under the optimal detector that makes 1δ1-\delta a concave increasing function from 0 to 11 over ϵ\epsilon. This matches the intuition that making a detector more sensitive will reduce δ\delta but increase ϵ\epsilon. Since the optimal detector design is a solved problem and not the focus of this paper, we simply assume that δ\delta and ϵ\epsilon are given. If arm nn in state S=1S=1 is observed in state 11 (i.e., S=O=1S=O=1), then the player accrues BnB_{n} units of reward from this arm. One of many application examples of this observation model is to cognitive radios, where a secondary user aims to utilize a frequency band (channel/arm) currently unused by the primary users. Due to energy and policy constraints on the sensor of the secondary user, only a subset of channels can be sensed at each time and if any of them is sensed idle (O=1O=1), the user can send certain packets over it to its receiver and obtain an ACK\mathop{\text{ACK}} (acknowledgement) in the end of the time slot if the channel is indeed idle (S=1S=1); otherwise no ACK\mathop{\text{ACK}} from this channel would be received. Then the reward BnB_{n} is just the bandwidth of channel nn. Clearly, the hard constraint here should be on the miss detection probability δ\delta to guarantee the satisfaction of the primary users, i.e., the disturbance (when a secondary user senses a busy channel as idle and subsequently sends data over it) to the primary users should be capped.

2.1 System Model and Belief Vector

At each discrete time tt, the internal state (0/10/1) of an arm cannot be observed before deciding whether or not to observe the arm. Therefore, we cannot use the states of the Markov chains as the system state for decision making. Applying the general POMDP theory to our model the belief state vector consisting of probabilities that arms are in state 11 given all past observations is a sufficient statistics for making future decisions (Sondik, 1978):

𝝎(t)\displaystyle\boldsymbol{\omega}(t) =(ω1(t),ω2(t),,ωN(t)),\displaystyle=(\omega_{1}(t),\omega_{2}(t),\cdots,\omega_{N}(t)), (2)
ωn(t)\displaystyle\omega_{n}(t) =Pr(Sn(t)=1past observations on armn),n{1,,N},\displaystyle=\Pr(S_{n}(t)=1\mid\text{past observations on arm}~{}n),~{}~{}\forall\ n\in\{1,\cdots,N\}, (3)

where ωn(t)\omega_{n}(t) is the belief state of arm nn at time tt and Sn(t)S_{n}(t) its internal state. According to the Bayes’ rule, the belief state (of any arm) itself evolves as a Markov chain with an infinite state space:

ωn(t+1)\displaystyle\omega_{n}(t+1) ={p11(n),nA(t),ACKn(t)(Sn(t)=1,On(t)=1)𝒯n(ϵωn(t)ϵωn(t)+1ωn(t)),nA(t),noACKn(t)𝒯n(ωn(t)),nA(t),\displaystyle=\begin{cases}p_{11}^{(n)},&n\in A(t),\mathop{\text{ACK}}_{n}(t)~{}(S_{n}(t)=1,O_{n}(t)=1)\\[1.0pt] \mathcal{T}_{n}\left(\frac{\epsilon\omega_{n}(t)}{\epsilon\omega_{n}(t)+1-\omega_{n}(t)}\right),&n\in A(t),\text{no}\mathop{\text{ACK}}_{n}(t)\\[1.0pt] \mathcal{T}_{n}(\omega_{n}(t)),&n\notin A(t)\end{cases}, (4)
𝒯n(ωn(t))\displaystyle\mathcal{T}_{n}(\omega_{n}(t)) =ωn(t)p11(n)+(1ωn(t))p01(n),\displaystyle=\omega_{n}(t)p^{(n)}_{11}+(1-\omega_{n}(t))p^{(n)}_{01},~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{} (5)

where A(t){1,2,,N}A(t)\subset\{1,2,\ldots,N\} is the set of arms chosen at time tt with |A(t)|=M|A(t)|=M, Sn(t)S_{n}(t) and On(t)O_{n}(t) are respectively the state and observation from arm nn at time tt if nA(t)n\in A(t), ACKn(t)\mathop{\text{ACK}}_{n}(t) the acknowledgement of successful utilization of arm nn for slot tt, 𝒯n()\mathcal{T}_{n}(\cdot) the one-step belief update operator without observation, and P(n)={pij(n),i,j{0,1}}\textbf{P}^{(n)}=\{p^{(n)}_{ij},{i,j\in\{0,1\}}\} the transition matrix of the internal Markov chain of arm nn. Note that when ϵ=0\epsilon=0, all three expressions of the next belief state in (4) become linear functions and the problem is reduced to that in Liu and Zhao (2010) for perfect observations where the value functions for dynamic programming can be directly solved in closed-form. Later, we will see that when ϵ0\epsilon\neq 0, the value functions are very difficult to analyze except for a few general properties. Without a fine-grain analysis on the value functions, the indexability and Whittle index cannot be established for our model. Our approach is to start with a finite time horizon and then utilize the backward induction (on time horizon) methodology until we take limits of corresponding functions as the time horizon goes to infinity. The key idea is to obtain analytic bounds on the value functions and their derivatives as functions of the system parameters instead of just numerical computations of these functions. We will show that these bounds, together with some detailed properties of the value functions, are sufficient for our purpose. From (5), the kk-step belief update of an unobserved arm for kk consecutive slots starting from any belief state ω\omega is

𝒯nk(ω)=p01(n)(p11(n)p01(n))k(p01(n)(1+p01(n)p11(n))ω)1+p01(n)p11(n).\displaystyle\mathcal{T}_{n}^{k}(\omega)=\frac{p^{(n)}_{01}-(p^{(n)}_{11}-p^{(n)}_{01})^{k}(p^{(n)}_{01}-(1+p^{(n)}_{01}-p^{(n)}_{11})\omega)}{1+p^{(n)}_{01}-p^{(n)}_{11}}. (6)

For simplicity of notations, we denote 𝒯n1()\mathcal{T}_{n}^{1}(\cdot) by 𝒯n()\mathcal{T}_{n}(\cdot).
       At time t=1t=1, the initial belief state ωn(1)\omega_{n}(1) of arm nn can be set as the stationary distribution ωn,o\omega_{n,o} of the internal Markov chain***Here we assume the internal Markov chain with transition matrix P(n)\textbf{P}^{(n)} is irreducible and aperiodic.:

ωn(1)=ωn,o=limk𝒯nk(ω)=p01(n)p01(n)+p10(n),\displaystyle\omega_{n}(1)=\omega_{n,o}=\lim_{k\rightarrow\infty}\mathcal{T}_{n}^{k}(\omega^{\prime})=\frac{p^{(n)}_{01}}{p^{(n)}_{01}+p^{(n)}_{10}}, (7)

where ωn,o\omega_{n,o} is the unique solution to 𝒯n(ω)=ω\mathcal{T}_{n}(\omega)=\omega and ω[0,1]\omega^{\prime}\in[0,1] an arbitrary probability. Given the initial belief vector 𝝎(1)=(ω1(1),ω2(1),,ωN(1))\boldsymbol{\omega}(1)=(\omega_{1}(1),\omega_{2}(1),\ldots,\omega_{N}(1)), we arrive at the following constrained optimization problem:

maxπ:𝝎(t)A(t)𝔼π[t=1βt1R(t)|𝝎(1)],\displaystyle\max\limits_{\pi:\boldsymbol{\omega}(t)\rightarrow A(t)}\mathbbm{E}_{\pi}\left[\sum\limits_{t=1}^{\infty}\beta^{t-1}R(t)\Big{|}\boldsymbol{\omega}(1)\right], (8)
subject to|A(t)|=M,t1.\displaystyle\quad\mbox{subject to}\quad|A(t)|=M,\quad t\geq 1. (9)

Now the decision problem has a countable state space as modelled by the belief vector for a fixed initial 𝝎(1)\boldsymbol{\omega}(1) and an uncountable state space for an arbitrarily chosen 𝝎(1)\boldsymbol{\omega}(1). It is clear that fixing 𝝎(1)\boldsymbol{\omega}(1), the action-dependent belief vector 𝝎(t)\boldsymbol{\omega}(t) takes possible values growing geometrically with time tt, leading to a high-complexity in solving the problem; this is the so-called curse of dimensionality. In the following, we adopt Whittle’s original idea of Lagrangian relaxation to decouple arms for an index policy and show some crucial properties of the value functions of a single arm.

2.2 Arm Decoupling by Lagrangian Relaxation

maxπ:𝝎(t)A(t)\displaystyle\max\limits_{\pi:\boldsymbol{\omega}(t)\rightarrow A(t)} 𝔼π[t=1βt1n=1N𝟙(nA(t))Sn(t)On(t)Bn|𝝎(1)]\displaystyle\mathbbm{E}_{\pi}\left[\sum_{t=1}^{\infty}\beta^{t-1}\sum_{n=1}^{N}\mathbbm{1}_{(n\in A(t))}\cdot S_{n}(t)\cdot O_{n}(t)\cdot B_{n}\Big{|}\boldsymbol{\omega}(1)\right] (10)
subject to 𝔼π[t=1βt1n=1N𝟙(nA(t))|𝝎(1)]=NM1β.\displaystyle\mathbbm{E}_{\pi}\left[\sum_{t=1}^{\infty}\beta^{t-1}\sum_{n=1}^{N}\mathbbm{1}_{(n\notin A(t))}\Big{|}\boldsymbol{\omega}(1)\right]=\frac{N-M}{1-\beta}. (11)

Clearly constraint (11) is a relaxation on the player’s action A(t)A(t) from (9). Applying the Lagrangian multiplier μ\mu to constraint (11), we arrive at the following unconstrained optimization problem:

maxπ:𝝎(t)A(t)𝔼π[t=1βt1n=1N[𝟙(nA(t))Sn(t)On(t)Bn+μ𝟙(nA(t))]|𝝎(1)].\displaystyle\max\limits_{\pi:\boldsymbol{\omega}(t)\rightarrow A(t)}\mathbbm{E}_{\pi}\left[\sum_{t=1}^{\infty}\beta^{t-1}\sum_{n=1}^{N}\left[\mathbbm{1}_{(n\in A(t))}S_{n}(t)O_{n}(t)B_{n}+\mu\cdot\mathbbm{1}_{(n\notin A(t))}\right]\Big{|}\boldsymbol{\omega}(1)\right]. (12)

Fixing μ\mu, the above optimization is equivalent to NN independent unconstraint optimization problem as shown below: for each n{1,2,,N}n\in\{1,2,\ldots,N\},

maxπ:ωn(t){0,1}𝔼π[t=1βt1[𝟙(nA(t))Sn(t)On(t)Bn+μ𝟙(nA(t))]|ωn(1)].\displaystyle\max\limits_{\pi:\omega_{n}(t)\rightarrow\{0,1\}}\mathbbm{E}_{\pi}\left[\sum_{t=1}^{\infty}\beta^{t-1}\left[\mathbbm{1}_{(n\in A(t))}S_{n}(t)O_{n}(t)B_{n}+\mu\cdot\mathbbm{1}_{(n\notin A(t))}\right]\Big{|}\omega_{n}(1)\right]. (13)

Here π\pi is a single-arm policy that maps the belief state of the arm to the binary action u=1u=1 (chosen/activated) or u=0u=0 (unchosen/made passive). It is thus sufficient to consider a single arm for solving problem (12). For simplicity, we will drop the subscript nn in consideration of a single-armed bandit problem without loss of generality. Let Vβ,m(ω)V_{\beta,m}(\omega) denote the value of (13) with μ=m\mu=m and ωn(1)=ω\omega_{n}(1)=\omega, it is straightforward to write out the dynamic equation of the single-armed bandit problem as follows:

Vβ,m(ω)=max{Vβ,m(ω;u=1);Vβ,m(ω;u=0)},V_{\beta,m}(\omega)=\max\{V_{\beta,m}(\omega;u=1);V_{\beta,m}(\omega;u=0)\}, (14)

where Vβ,m(ω;u=1)V_{\beta,m}(\omega;u=1) and Vβ,m(ω;u=0)V_{\beta,m}(\omega;u=0) denote, respectively, the maximum expected total discounted reward that can be obtained if the arm is activated or made passive at the current belief state ω\omega, followed by an optimal policy in subsequent slots. Since we consider the infinite-horizon problem, a stationary optimal policy can be chosen and the time index tt is not needed in (14). Define the nonlinear operator ϕ()\phi(\cdot) as

ϕ(ω)=ϵωϵω+1ω.\phi(\omega)=\frac{\epsilon\omega}{\epsilon\omega+1-\omega}.

It is easy to see that 𝒯ϕ()\mathcal{T}\circ\phi(\cdot) is Lipschitz continuous on [0,1][0,1]:

|𝒯(ϵωϵω+1ω)𝒯(ϵωϵω+1ω)|\displaystyle\Big{|}\mathcal{T}\left(\tfrac{\epsilon\omega}{\epsilon\omega+1-\omega}\right)-\mathcal{T}\left(\tfrac{\epsilon\omega^{\prime}}{\epsilon\omega^{\prime}+1-\omega^{\prime}}\right)\Big{|} =|p11ϵω+(1ω)p01ϵω+1ωp11ϵω+(1ω)p01ϵω+1ω|\displaystyle=\Big{|}\tfrac{p_{11}\epsilon\omega+(1-\omega)p_{01}}{\epsilon\omega+1-\omega}-\tfrac{p_{11}\epsilon\omega^{\prime}+(1-\omega^{\prime})p_{01}}{\epsilon\omega^{\prime}+1-\omega^{\prime}}\Big{|} (15)
=|ϵ(p11p01)(ωω)(1(1ϵ)ω)(1(1ϵ)ω)||p11p01|ϵ|ωω|.\displaystyle=\Big{|}\tfrac{\epsilon(p_{11}-p_{01})(\omega-\omega^{\prime})}{(1-(1-\epsilon)\omega)(1-(1-\epsilon)\omega^{\prime})}\Big{|}\leq\frac{|p_{11}-p_{01}|}{\epsilon}|\omega-\omega^{\prime}|. (16)

We assume that ϵ0\epsilon\neq 0 (otherwise the problem is reduced to that considered in Liu and Zhao, 2010) and p11p01p_{11}\neq p_{01} (otherwise the belief update is independent of observations or actions and the problem becomes trivial). Without loss of generality, set B=1B=1. We have

{Vβ,m(ω;u=1)=(1ϵ)ω+β[(1ϵ)ωVβ,m(p11)+(1(1ϵ)ω)Vβ,m(𝒯(ϕ(ω)))],Vβ,m(ω;u=0)=m+βVβ,m(𝒯(ω)).\displaystyle\left\{\begin{array}[]{ll}V_{\beta,m}(\omega;u=1)=(1-\epsilon)\omega+\beta\big{[}(1-\epsilon)\omega V_{\beta,m}(p_{11})\\[5.0pt] \quad\quad\quad\quad\quad\quad\quad\quad+(1-(1-\epsilon)\omega)V_{\beta,m}(\mathcal{T}(\phi(\omega)))\big{]},\\[5.0pt] V_{\beta,m}(\omega;u=0)=m+\beta V_{\beta,m}(\mathcal{T}(\omega)).\end{array}\right. (20)

Define passive set P(m)P(m) as the set of all belief states in which taking the passive action u=0u=0 is optimal:

P(m)=Δ{ω:Vβ,m(ω;u=1)Vβ,m(ω;u=0)}.\displaystyle P(m){\,\stackrel{{\scriptstyle\Delta}}{{=}}}\,\{\omega:~{}V_{\beta,m}(\omega;u=1)\leq V_{\beta,m}(\omega;u=0)\}. (21)

Notice that the immediate reward under the active action cannot exceed 11 so it is optimal to always make the arm passive if the immediate reward under the passive action exceeds 11 (m1m\geq 1). This is because we would obtain the maximum immediate reward equal to mm at each time step regardless of the state transitions in this case. On the other hand, if m<1/(1β)m<-1/{(1-\beta)} then the optimal action must be to activate the arm. To see this, note that the total discounted reward by any policy consists of two parts: the reward under the active actions and the reward under the passive actions. If the optimal policy is to make the arm passive now when m<1/(1β)m<-1/{(1-\beta)}, then the reward obtained under the future active actions by this policy must be greater than 1/(1β)1/{(1-\beta)} otherwise the total discounted reward would be negative, contradicting the optimality of the policy since any policy that always activates the arm achieves a nonnegative total discounted reward so should the optimal policy. This is again a contradiction since the total discounted reward under the active actions is upper bounded by 1/(1β)1/{(1-\beta)}. Consequently, the passive set P(m)P(m) changes from the empty set to the closed interval [0,1][0,1] as mm increases from -\infty to \infty. However, such change may not be monotonic as mm increases. But if P(m)P(m) does increase monotonically with mm, then for each value ω\omega of the belief state, one can define the unique mm that makes it join P(m)P(m) and stay in the set forever. Intuitively, such mm measures in a well-ordered manner the attractiveness of activating the arm in belief state ω\omega compared to other belief states: the larger is the mm that is required for it to be passive, the more is the incentive to activate the arm in belief state ω\omega, even in the problem without mm. This Lagrangian multiplier mm is thus called ‘subsidy for passivity’ by Whittle who formalized the following definition of indexability and Whittle index (Whittle, 1988).

Definition 1.

A restless multi-armed bandit is indexable if for each single-armed bandit in a problem with subsidy mm for passivity, the set of arm states P(m)P(m) in which passivity is optimal increases monotonically from the empty set to the whole state space as mm increases from -\infty to ++\infty. Under indexability, the Whittle index of an arm state is defined as the infimum subsidy mm such that the state remains in the passive set.

For our model in which the arm state is given by the belief vector, the indexability is equivalent to the following:

IfVβ,m(ω;u=1)Vβ,m(ω;u=0),then\displaystyle\mbox{If}~{}~{}~{}V_{\beta,m}(\omega;u=1)\leq V_{\beta,m}(\omega;u=0),~{}\mbox{then}~{}
m>m,Vβ,m(ω;u=1)Vβ,m(ω;u=0).\displaystyle\forall~{}m^{\prime}>m,~{}V_{\beta,m^{\prime}}(\omega;u=1)\leq V_{\beta,m^{\prime}}(\omega;u=0). (22)

Under indexability, the Whittle index W(ω)W(\omega) of arm state ω\omega is defined as

W(ω)=Δinf{m:Vβ,m(ω;u=1)Vβ,m(ω;u=0)}.\displaystyle W(\omega){\,\stackrel{{\scriptstyle\Delta}}{{=}}}\,\inf\{m:~{}V_{\beta,m}(\omega;u=1)\leq V_{\beta,m}(\omega;u=0)\}. (23)

In the following we derive useful properties of the value functions Vβ,m(ω;u=1)V_{\beta,m}(\omega;u=1), Vβ,m(ω;u=0)\ V_{\beta,m}(\omega;u=0) and Vβ,m(ω)V_{\beta,m}(\omega). Our strategy is to first establish those properties for finite horizons and then extend them to the infinite horizon by the uniform convergence of the value functions of the former to the latter. Define the TT-horizon value function V1,T,β,m(ω)V_{1,T,\beta,m}(\omega) as the maximum expected total discounted reward achievable over the next TT time slots starting from the initial belief state ω\omega. Then

V1,T,β,m(ω)=max{V1,T,β,m(ω;u=1);V1,T,β,m(ω;u=0)},\displaystyle V_{1,T,\beta,m}(\omega)=\max\{V_{1,T,\beta,m}(\omega;u=1);V_{1,T,\beta,m}(\omega;u=0)\}, (24)

where V1,T,β,m(ω;u=1)V_{1,T,\beta,m}(\omega;u=1) and V1,T,β,m(ω;u=0)V_{1,T,\beta,m}(\omega;u=0) denote, respectively, the maximum expected total discounted reward achievable given the initial active and passive actions over the next TT time slots starting from the initial belief state ω\omega:

V1,T,β,m(ω;u=1)\displaystyle V_{1,T,\beta,m}(\omega;u=1) =(1ϵ)ω+(1ϵ)ωβV1,T1,β,m(p11)\displaystyle=(1-\epsilon)\omega+(1-\epsilon)\omega\beta V_{1,T-1,\beta,m}(p_{11})
+(1(1ϵ)ω)βV1,T1,β,m(𝒯(ϵω1(1ϵ)ω)),\displaystyle~{}~{}~{}~{}+(1-(1-\epsilon)\omega)\beta V_{1,T-1,\beta,m}\left(\mathcal{T}(\tfrac{\epsilon\omega}{1-(1-\epsilon)\omega})\right), (25)
V1,T,β,m(ω;u=0)\displaystyle V_{1,T,\beta,m}(\omega;u=0) =m+βV1,T1,β,m(𝒯(ω)),\displaystyle=m+\beta V_{1,T-1,\beta,m}(\mathcal{T}(\omega)),~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{} (26)
V1,0,β,m()\displaystyle V_{1,0,\beta,m}(\cdot) 0.\displaystyle\equiv 0.~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{} (27)

From the above recursive equations, we can analyze V1,T,β,m(ω)V_{1,T,\beta,m}(\omega) by backward induction on TT. It is easy to see that for any ω\omega,

V1,1,β,m(ω;u=1)=(1ϵ)ω,V1,1,β,m(ω;u=0)=m.\displaystyle V_{1,1,\beta,m}(\omega;u=1)=(1-\epsilon)\omega,\quad V_{1,1,\beta,m}(\omega;u=0)=m. (28)

Therefore V1,T,β,m(ω)V_{1,T,\beta,m}(\omega) is the maximum of two linear equations and thus piecewise linear and convex for T=1T=1 (in both ω\omega and mm). Assume that V1,T1,β,m(ω)V_{1,T-1,\beta,m}(\omega) is piecewise linear and convex. The Bayes’ rule shows that the following term

(1(1ϵ)ω)βV1,T1,β,m(𝒯(ϵω1(1ϵ)ω))\displaystyle(1-(1-\epsilon)\omega)\beta V_{1,T-1,\beta,m}\left(\mathcal{T}(\tfrac{\epsilon\omega}{1-(1-\epsilon)\omega})\right) (29)

is piecewise linear and convex since the leading coefficient (1(1ϵ)ω)(1-(1-\epsilon)\omega) also appears as the denominator of the argument of the linear operator 𝒯\mathcal{T} in V1,T1,β,m()V_{1,T-1,\beta,m}(\cdot) assumed to be piecewise linear and convex by the induction hypothesis. Henceforth, the recursive equation set (25) and (26) shows that V1,T,β,m(ω)V_{1,T,\beta,m}(\omega) is the maximum of two convex and piecewise linear functions and thus piecewise linear and convex for any T>1T>1 (in both ω\omega and mm). Motivated by the Lipschitz continuity of 𝒯ϕ\mathcal{T}\circ\phi, we show in Lemma 2 that V1,T,β,m(ω)V_{1,T,\beta,m}(\omega) is also Lipschitz continuous under certain conditions. In the following, we first establish a monotonic property of V1,T,β,m(ω)V_{1,T,\beta,m}(\omega) in the case of p11>p01p_{11}>p_{01} (positively correlated Markov chain).

Lemma 1.

If p11>p01p_{11}>p_{01}, then V1,T,β,m(ω)V_{1,T,\beta,m}(\omega) is monotonically increasing with ω[0,1]\omega\in[0,1] for any T1T\geq 1.

Proof.

Since V1,T,β,m(ω)V_{1,T,\beta,m}(\omega) is piecewise linear, it is differentiable almost everywhere except on a null set (under the Lebesgue measure on \mathbbm{R}) consisting of finite points among which both the left and right derivatives at any point exist but not equal. To prove that the continuous function V1,T,β,m(ω)V_{1,T,\beta,m}(\omega) is monotonically increasing with ω\omega, we only need to show

V1,T,β,m(ω)0,ω(0,1),\displaystyle V^{\prime}_{1,T,\beta,m}(\omega)\geq 0,\quad\forall\omega\in(0,1), (30)

where V1,T,β,m(ω)V^{\prime}_{1,T,\beta,m}(\omega) denotes the right derivative of V1,T,β,m()V_{1,T,\beta,m}(\cdot) as a function of the belief state with mm fixed. From (28), the value function V1,1,β,m(ω)=max{(1ϵ)ω,m}V_{1,1,\beta,m}(\omega)=\max\{(1-\epsilon)\omega,m\} is monotonically increasing with nonnegative right derivative 1ϵ1-\epsilon or 0. Assume (30) is true for T1T\geq 1, then for T+1T+1 we have V1,T+1,β,m(ω)=max{fT(ω),gT(ω)}V_{1,T+1,\beta,m}(\omega)=\max\{f_{T}(\omega),g_{T}(\omega)\} with

fT(ω)=Δ(1ϵ)ω+(1ϵ)ωβV1,T,β,m(p11)+(1(1ϵ)ω)βV1,T,β,m(𝒯ϕ(ω)),gT(ω)=Δm+βV1,T,β,m(𝒯(ω)).\displaystyle\begin{array}[]{ll}f_{T}(\omega){\,\stackrel{{\scriptstyle\Delta}}{{=}}}\,(1-\epsilon)\omega+(1-\epsilon)\omega\beta V_{1,T,\beta,m}(p_{11})\\[4.0pt] \quad\quad\quad\quad+(1-(1-\epsilon)\omega)\beta V_{1,T,\beta,m}(\mathcal{T}\circ\phi(\omega)),\\[4.0pt] g_{T}(\omega){\,\stackrel{{\scriptstyle\Delta}}{{=}}}\,m+\beta V_{1,T,\beta,m}(\mathcal{T}(\omega)).\end{array} (34)

From the above, we have

fT(ω)=(1ϵ)+(1ϵ)βV1,T,β,m(p11)(1ϵ)βV1,T,β,m(𝒯ϕ(ω))+V1,T,β,m(𝒯ϕ(ω))ϵβ(p11p01)1(1ϵ)ω,gT(ω)=β(p11p01)V1,T,β,m(𝒯(ω)),\displaystyle\begin{array}[]{ll}f_{T}^{\prime}(\omega)=(1-\epsilon)+(1-\epsilon)\beta V_{1,T,\beta,m}(p_{11})-(1-\epsilon)\beta V_{1,T,\beta,m}(\mathcal{T}\circ\phi(\omega))\\[4.0pt] ~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+V_{1,T,\beta,m}^{\prime}(\mathcal{T}\circ\phi(\omega))\frac{\epsilon\beta(p_{11}-p_{01})}{1-(1-\epsilon)\omega},\\[4.0pt] g_{T}^{\prime}(\omega)=\beta(p_{11}-p_{01})V_{1,T,\beta,m}^{\prime}(\mathcal{T}(\omega)),\end{array} (38)

where fT(),gT()f_{T}^{\prime}(\cdot),\ g_{T}^{\prime}(\cdot) and V1,T,β,m()V_{1,T,\beta,m}^{\prime}(\cdot) denote the right derivatives of the corresponding functions. We have used the fact that ϕ()\phi(\cdot) is monotonically increasing and when p11>p01p_{11}>p_{01}, 𝒯()\mathcal{T}(\cdot) is also monotonically increasing and that

ϕ(ω)=ϵ(1(1ϵ)ω)2.\displaystyle\phi^{\prime}(\omega)=\frac{\epsilon}{(1-(1-\epsilon)\omega)^{2}}. (39)

By the induction hypothesis and (38), if p11>p01p_{11}>p_{01} then gT(ω)g_{T}(\omega) is monotonically increasing (since gT(ω)0g^{\prime}_{T}(\omega)\geq 0) and

fT(ω)=(1ϵ)+(1ϵ)βV1,T,β,m(p11)(1ϵ)βV1,T,β,m(𝒯(ϵωϵω+1ω))+(1(1ϵ)ω)βV1,T,β,m(𝒯(ϵωϵω+1ω))(𝒯ϕ)(ω)(1ϵ)+ϵβ(p11p01)1(1ϵ)ωV1,T,β,m(𝒯(ϵωϵω+1ω))>0,\displaystyle\begin{aligned} f_{T}^{\prime}(\omega)&=(1-\epsilon)+(1-\epsilon)\beta V_{1,T,\beta,m}(p_{11})-(1-\epsilon)\beta V_{1,T,\beta,m}(\mathcal{T}(\tfrac{\epsilon\omega}{\epsilon\omega+1-\omega}))\\ &~{}~{}~{}+(1-(1-\epsilon)\omega)\beta V_{1,T,\beta,m}^{\prime}(\mathcal{T}(\frac{\epsilon\omega}{\epsilon\omega+1-\omega}))(\mathcal{T}\circ\phi)^{\prime}(\omega)\\ &\geq(1-\epsilon)+\tfrac{\epsilon\beta(p_{11}-p_{01})}{1-(1-\epsilon)\omega}V_{1,T,\beta,m}^{\prime}(\mathcal{T}(\tfrac{\epsilon\omega}{\epsilon\omega+1-\omega}))>0,\end{aligned} (40)

where both the first and second inequalities are due to the monotonically increasing property of V1,T,β,m()V_{1,T,\beta,m}(\cdot) under the assumption that p11>p01p_{11}>p_{01} by our induction hypothesis and

p01𝒯(ω)p11, 0ϕ(ω)1,ω[0,1].\displaystyle p_{01}\leq\mathcal{T}(\omega)\leq p_{11},\ 0\leq\phi(\omega)\leq 1,\quad\forall~{}\omega\in[0,1]. (41)

This proves the monotonically increasing property of fT(ω)f_{T}(\omega). Thus V1,T+1,β,m(ω)=max{fT(ω),gT(ω)}V_{1,T+1,\beta,m}(\omega)=\max\{f_{T}(\omega),g_{T}(\omega)\} is also monotonically increasing and the proof by induction is finished. ∎

Now we show that under a constraint on the discount factor β(0,1)\beta\in(0,1), the value function V1,T,β,m(ω)V_{1,T,\beta,m}(\omega) is a Lipschitz function:

Lemma 2.

Suppose the discount factor β(0,1)\beta\in(0,1) satisfies

β<1(2ϵ)|p11p01|.\displaystyle\beta<\tfrac{1}{(2-\epsilon)|p_{11}-p_{01}|}. (42)

Then T1\forall\ T\geq 1 and ω,ω[0,1]\forall\ \omega,\omega^{\prime}\in[0,1],

|V1,T,β,m(ω)V1,T,β,m(ω)|C|ωω|,where\displaystyle|V_{1,T,\beta,m}(\omega)-V_{1,T,\beta,m}(\omega^{\prime})|\leq C|\omega-\omega^{\prime}|,\ \text{where}~{} (43)
C=1ϵ1(2ϵ)β|p11p01|.\displaystyle C=\frac{1-\epsilon}{1-(2-\epsilon)\beta|p_{11}-p_{01}|}. (44)
Proof.

We prove this by induction. Without loss of generality, assume ω<ω\omega<\omega^{\prime}. For the case of T=1T=1,

|V1,1,β,m(ω)V1,1,β,m(ω)|={0,m(1ϵ)ω(1ϵ)ωm,if (1ϵ)ωm<(1ϵ)ω(1ϵ)|ωω|,if m<(1ϵ)ω.|V_{1,1,\beta,m}(\omega)-V_{1,1,\beta,m}(\omega^{\prime})|=\begin{cases}0,&m\geq(1-\epsilon)\omega^{\prime}\\ (1-\epsilon)\omega^{\prime}-m,&\text{if }(1-\epsilon)\omega\leq m<(1-\epsilon)\omega^{\prime}\\ (1-\epsilon)|\omega-\omega^{\prime}|,&\text{if }m<(1-\epsilon)\omega\end{cases}.

Thus |V1,1,β,m(ω)V1,1,β,m(ω)|(1ϵ)|ωω|C|ωω||V_{1,1,\beta,m}(\omega)-V_{1,1,\beta,m}(\omega^{\prime})|\leq(1-\epsilon)|\omega-\omega^{\prime}|\leq C|\omega-\omega^{\prime}|, where the second inequality is due to (42).

Assume that for T1,|V1,T,β,m(ω)V1,T,β,m(ω)|C|ωω|T\geq 1,~{}|V_{1,T,\beta,m}(\omega)-V_{1,T,\beta,m}(\omega^{\prime})|\leq C|\omega-\omega^{\prime}| holds, i.e., neither the left nor the right derivative of V1,T,β,m()V_{1,T,\beta,m}(\cdot) can have an absolute value exceeding CC. We have the following inequalities:

|V1,T,β,m(p11)V1,T,β,m(𝒯(ϵωϵω+1ω))|C|p11𝒯(ϵωϵω+1ω)|C|p11p01|,ϵ1(1ϵ)ω|V1,T,β,m(𝒯(ϵωϵω+1ω))|ϵ1(1ϵ)1C=C.\displaystyle\begin{aligned} |V_{1,T,\beta,m}(p_{11})-V_{1,T,\beta,m}(\mathcal{T}(\tfrac{\epsilon\omega}{\epsilon\omega+1-\omega}))|&\leq C|p_{11}-\mathcal{T}(\tfrac{\epsilon\omega}{\epsilon\omega+1-\omega})|\\ &\leq C|p_{11}-p_{01}|,\\[4.0pt] \frac{\epsilon}{1-(1-\epsilon)\omega}|V_{1,T,\beta,m}^{\prime}(\mathcal{T}(\tfrac{\epsilon\omega}{\epsilon\omega+1-\omega}))|&\leq\frac{\epsilon}{1-(1-\epsilon)\cdot 1}C=C.\end{aligned} (45)

To prove |V1,T+1,β,m(ω)V1,T+1,β,m(ω)|C|ωω||V_{1,T+1,\beta,m}(\omega)-V_{1,T+1,\beta,m}(\omega^{\prime})|\leq C|\omega-\omega^{\prime}|, recall the definitions of fT(ω)f_{T}(\omega) and gT(ω)g_{T}(\omega) in (34) and their right derivatives fT(ω)f_{T}^{\prime}(\omega) and gT(ω)g_{T}^{\prime}(\omega) in (38). We have

|fT(ω)(1ϵ)|(1ϵ)βC|p11p01|+Cβ|p11p01|=(2ϵ)βC|p11p01|,|gT(ω)|Cβ|p11p01|,\displaystyle\begin{aligned} |f_{T}^{\prime}(\omega)-(1-\epsilon)|&\leq(1-\epsilon)\beta C|p_{11}-p_{01}|+C\beta|p_{11}-p_{01}|=(2-\epsilon)\beta C|p_{11}-p_{01}|,\\ |g_{T}^{\prime}(\omega)|&\leq C\beta|p_{11}-p_{01}|,\end{aligned} (46)

where the inequality in the first (or second) line above is due to the first (or second) line in (45). Thus we have the following lower and upper bounds on fT(ω)f^{\prime}_{T}(\omega) and gT(ω)g^{\prime}_{T}(\omega):

(1ϵ)(2ϵ)βC|p11p01|fT(ω)(1ϵ)+(2ϵ)βC|p11p01|,Cβ|p11p01|gT(ω)Cβ|p11p01|.\displaystyle\begin{aligned} (1-\epsilon)-(2-\epsilon)\beta C|p_{11}-p_{01}|\leq f_{T}^{\prime}(\omega)&\leq(1-\epsilon)+(2-\epsilon)\beta C|p_{11}-p_{01}|,\\ -C\beta|p_{11}-p_{01}|\leq g_{T}^{\prime}(\omega)&\leq C\beta|p_{11}-p_{01}|.\end{aligned} (47)

From (47) and that V1,T+1,β,m(ω)=max{fT(ω),gT(ω)}V_{1,T+1,\beta,m}(\omega)=\max\{f_{T}(\omega),g_{T}(\omega)\}, we have

|V1,T+1,β,m(ω)|\displaystyle|V_{1,T+1,\beta,m}^{\prime}(\omega)| (1ϵ)+(2ϵ)βC|p11p01|\displaystyle\leq(1-\epsilon)+(2-\epsilon)\beta C|p_{11}-p_{01}|
=(1ϵ)+(2ϵ)β|p11p01|1ϵ1(2ϵ)β|p11p01|\displaystyle=(1-\epsilon)+(2-\epsilon)\beta|p_{11}-p_{01}|\tfrac{1-\epsilon}{1-(2-\epsilon)\beta|p_{11}-p_{01}|}
=1ϵ1(2ϵ)β|p11p01|=C,\displaystyle=\tfrac{1-\epsilon}{1-(2-\epsilon)\beta|p_{11}-p_{01}|}=C,

where we used the fact that 2ϵ12-\epsilon\geq 1 in the above inequality. Since V1,T+1,β,m(ω)V_{1,T+1,\beta,m}(\omega) is absolutely continuous, the above implies that

|V1,T+1,β,m(ω)V1,T+1,β,m(ω)|C|ωω|.|V_{1,T+1,\beta,m}(\omega)-V_{1,T+1,\beta,m}(\omega^{\prime})|\leq C|\omega-\omega^{\prime}|.

The proof is thus finished by the induction process. ∎

Last, we give a lemma establishing the order of V1,T,β,m(;u=1)V^{\prime}_{1,T,\beta,m}(\,\cdot\,;u=1) and
V1,T,β,m(;u=0)V^{\prime}_{1,T,\beta,m}(\,\cdot\,;u=0) under certain conditions which further leads to a threshold structure of the optimal single-arm policy as detailed in Section 2.3.

Lemma 3.

Suppose that p11>p01p_{11}>p_{01} and β1/[(3ϵ)(p11p01)]\beta\leq 1/[{(3-\epsilon)(p_{11}-p_{01})}], we have

V1,T,β,m(ω;u=1)V1,T,β,m(ω;u=0),\displaystyle V_{1,T,\beta,m}^{\prime}(\omega;u=1)\geq V_{1,T,\beta,m}^{\prime}(\omega;u=0), (48)

where V1,T,β,m(ω;u=i)V_{1,T,\beta,m}^{\prime}(\omega;u=i) denotes the right derivative of V1,T,β,m(;u=i)V_{1,T,\beta,m}(\,\cdot\,;u=i) at ω\omega for i{0,1}i\in\{0,1\}. The above inequality is also true if p01>p11p_{01}>p_{11} and β1/[(52ϵ)(p01p11)]\beta\leq 1/[{(5-2\epsilon)(p_{01}-p_{11})}].

Proof.

Again, we prove by induction on the time horizon TT. When T=1T=1, it is clear that V1,1,β,m(ω;u=1)=(1ϵ)ωV_{1,1,\beta,m}(\omega;u=1)=(1-\epsilon)\omega and V1,1,β,m(ω;u=0)=mV_{1,1,\beta,m}(\omega;u=0)=m:

V1,1,β,m(ω;u=1)=1ϵ>V1,1,β,m(ω;u=0)=0.\displaystyle V_{1,1,\beta,m}^{\prime}(\omega;u=1)=1-\epsilon>V_{1,1,\beta,m}^{\prime}(\omega;u=0)=0. (49)

Assume that V1,T,β,m(ω;u=1)V1,T,β,m(ω;u=0)V_{1,T,\beta,m}^{\prime}(\omega;u=1)\geq V_{1,T,\beta,m}^{\prime}(\omega;u=0) for T1T\geq 1. From (47), we have, in case of p01>p11p_{01}>p_{11} and β1(52ϵ)(p01p11)\beta\leq\frac{1}{(5-2\epsilon)(p_{01}-p_{11})}, the inequality Cβ(p01p11)(1ϵ)(2ϵ)βC(p01p11),C\beta(p_{01}-p_{11})\leq(1-\epsilon)-(2-\epsilon)\beta C(p_{01}-p_{11}), which shows that fT(ω)gT(ω)f_{T}^{\prime}(\omega)\geq g_{T}^{\prime}(\omega). When p11>p01p_{11}>p_{01}, V1,T,β,m(ω)V_{1,T,\beta,m}(\omega) is increasing with ω\omega therefore has nonnegative right derivatives by Lemma 1. We can thus obtain tighter bounds on fT(ω)f_{T}^{\prime}(\omega) and gT(ω)g_{T}^{\prime}(\omega) by (38):

(1ϵ)fT(ω)\displaystyle(1-\epsilon)\leq f_{T}^{\prime}(\omega) (1ϵ)+(2ϵ)βC(p11p01),\displaystyle\leq(1-\epsilon)+(2-\epsilon)\beta C(p_{11}-p_{01}),
0gT(ω)\displaystyle 0\leq g_{T}^{\prime}(\omega) Cβ(p11p01).\displaystyle\leq C\beta(p_{11}-p_{01}).

If β1(3ϵ)(p11p01)\beta\leq\frac{1}{(3-\epsilon)(p_{11}-p_{01})}, we have Cβ(p11p01)(1ϵ)C\beta(p_{11}-p_{01})\leq(1-\epsilon), which shows that fT(ω)gT(ω)f_{T}^{\prime}(\omega)\geq g_{T}^{\prime}(\omega). The proof is thus complete. ∎

2.3 Threshold Policy and Indexability

In this section, we show that the optimal single-arm policy is a threshold policy under the constraints on the discount factor β\beta specified in Section 2.2 and analyze the conditions for indexability. First, for a finite-horizon single-armed bandit, a threshold policy π\pi is defined by a time-dependent real number ωT,β(m)\omega_{T,\beta}(m) such that

uT,m(ω)={1,if ω>ωT,β(m);0,if ωωT,β(m).u_{T,m}(\omega)=\begin{cases}1,&\text{if }\omega>\omega_{T,\beta}(m);\\ 0,&\text{if }\omega\leq\omega_{T,\beta}(m).\end{cases} (50)

In the above uT,m(ω){0,1}u_{T,m}(\omega)\in\{0,1\} is the action taken under π\pi at the current state ω\omega with TT slots remaining. Intuitively, the larger ω\omega is, the larger expected immediate reward to accrue and thus more attractive to activate the arm. We formalize this intuition under certain conditions in the following theorem.

Theorem 1.

Suppose that p11>p01p_{11}>p_{01} and β1(3ϵ)(p11p01)\beta\leq\frac{1}{(3-\epsilon)(p_{11}-p_{01})}. For any T1T\geq 1, the optimal single-arm policy π\pi^{*} is a threshold policy, i.e., there exists ωT,β(m)\omega_{T,\beta}^{*}(m)\in\mathbbm{R} such that under π\pi^{*}, the optimal action is

uT,m(ω)={1,if ω>ωT,β(m);0,if ωωT,β(m).u_{T,m}^{*}(\omega)=\begin{cases}1,&\text{if }\omega>\omega_{T,\beta}^{*}(m);\\ 0,&\text{if }\omega\leq\omega_{T,\beta}^{*}(m).\end{cases}

Furthermore, at the threshold ωT,β(m)\omega_{T,\beta}^{*}(m),

V1,T,β,m(ωT,β(m);u=0)=V1,T,β,m(ωT,β(m);u=1).\displaystyle V_{1,T,\beta,m}(\omega_{T,\beta}^{*}(m);u=0)=V_{1,T,\beta,m}(\omega_{T,\beta}^{*}(m);u=1). (51)

The conclusion is also true for the case of p01>p11p_{01}>p_{11} and β1(52ϵ)(p01p11)\beta\leq\frac{1}{(5-2\epsilon)(p_{01}-p_{11})}.

Proof.

At T=1T=1, V1,1,β,m(ω;u=1)=(1ϵ)ω,V1,1,β,m(ω;u=0)=mV_{1,1,\beta,m}(\omega;u=1)=(1-\epsilon)\omega,V_{1,1,\beta,m}(\omega;u=0)=m. Thus we can choose ω1,β(m)\omega_{1,\beta}^{*}(m) as follows:

ω1,β(m)={c,if m1ϵ;m1ϵ,if 0m<1ϵ;b,if m<0,\omega_{1,\beta}^{*}(m)=\begin{cases}c,&\text{if }m\geq 1-\epsilon;\\ \frac{m}{1-\epsilon},&\text{if }0\leq m<1-\epsilon;\\ b,&\text{if }m<0,\end{cases}

where b<0,c>1b<0,c>1 are arbitrary constants.

For T1T\geq 1, when the condition on β\beta is satisfied, Lemma 3 shows that

hT(ω)\displaystyle h_{T}(\omega) =ΔV1,T,β,m(ω;u=1)V1,T,β,m(ω;u=0),\displaystyle{\,\stackrel{{\scriptstyle\Delta}}{{=}}}\,V_{1,T,\beta,m}(\omega;u=1)-V_{1,T,\beta,m}(\omega;u=0), (52)
hT(ω)\displaystyle h^{\prime}_{T}(\omega) 0,ω(0,1).\displaystyle\geq 0,\quad\forall~{}\omega\in(0,1). (53)

This shows that hT()h_{T}(\cdot) is monotonically increasing and either has no zeros in the interval [0,1][0,1] or intersects with it over a closed interval (which can be a single point) only. Specially,

V1,T,β,m(0;u=1)\displaystyle V_{1,T,\beta,m}(0;u=1) =βV1,T1,β,m(p01),\displaystyle=\beta V_{1,T-1,\beta,m}(p_{01}),
V1,T,β,m(0;u=0)\displaystyle V_{1,T,\beta,m}(0;u=0) =m+βV1,T1,β,m(p01),\displaystyle=m+\beta V_{1,T-1,\beta,m}(p_{01}),
V1,T,β,m(1;u=1)\displaystyle V_{1,T,\beta,m}(1;u=1) =(1ϵ)+βV1,T1,β,m(p11),\displaystyle=(1-\epsilon)+\beta V_{1,T-1,\beta,m}(p_{11}),
V1,T,β,m(1;u=0)\displaystyle V_{1,T,\beta,m}(1;u=0) =m+βV1,T1,β,m(p11).\displaystyle=m+\beta V_{1,T-1,\beta,m}(p_{11}).

Consider the following three regions of mm.

  • (i)

    0m<1ϵ0\leq m<1-\epsilon. In this case, V1,T,β,m(0;u=1)V1,T,β,m(0;u=0)V_{1,T,\beta,m}(0;u=1)\leq V_{1,T,\beta,m}(0;u=0) and V1,T,β,m(1;u=1)>V1,T,β,m(1;u=0)V_{1,T,\beta,m}(1;u=1)>V_{1,T,\beta,m}(1;u=0). Therefore hT()h_{T}(\cdot) intersects over (at least) one point in [0,1][0,1]. This point can thus be chosen as ωT,β(m)\omega_{T,\beta}^{*}(m).

  • (ii)

    m<0m<0. In this case, V1,T,β,m(0;u=1)>V1,T,β,m(0;u=0)V_{1,T,\beta,m}(0;u=1)>V_{1,T,\beta,m}(0;u=0) and V1,T,β,m(1;u=1)>V1,T,β,m(1;u=0)V_{1,T,\beta,m}(1;u=1)>V_{1,T,\beta,m}(1;u=0). So hT()h_{T}(\cdot) is strictly positive over [0,1][0,1] and we can choose ωT,β(m)=b\omega_{T,\beta}^{*}(m)=b with any b<0b<0.

  • (iii)

    m(1ϵ)m\geq(1-\epsilon). In this case, always choosing the passive action is clearly optimal as the expected immediate reward is uniformly upper-bounded by mm over the whole belief state space. We can thus choose ωT,β(m)=c\omega_{T,\beta}^{*}(m)=c with any c>1c>1.

In conclusion, when the conditions in the theorem are satisfied, the optimal finite-horizon single-arm policy is a threshold policy for any horizon length T1T\geq 1. ∎

In the next theorem, we show that the optimal single-arm policy over the infinite horizon is also a threshold policy under the same conditions.

Theorem 2.

Fix the subsidy mm. The finite-horizon value functions V1,T,β,m()V_{1,T,\beta,m}(\cdot), V1,T,β,m(;u=1) and V1,T,β,m(;u=0)V_{1,T,\beta,m}(\,\cdot\,;u=1)\text{~{}and~{}}V_{1,T,\beta,m}(\,\cdot\,;u=0) uniformly converge to the infinite-horizon value functions Vβ,m()V_{\beta,m}(\cdot), Vβ,m(;u=1) and Vβ,m(;u=0)V_{\beta,m}(\,\cdot\,;u=1)\text{~{}and~{}}V_{\beta,m}(\,\cdot\,;u=0) which are consequently obedient to the same properties established in Lemmas 1 and 2 and Theorem 1.

Proof.

The uniform convergence is obvious since β<1\beta<1 and the rest can be easily proved by contradiction following the uniform convergence. ∎

Thus far we have established the threshold structure of the optimal single-arm policy with subsidy based on the analysis of Vβ,m(ω)V_{\beta,m}(\omega) as a function of the belief state ω\omega with mm fixed. To study the indexability condition, we now analyze the properties of Vβ,m(ω)V_{\beta,m}(\omega) as a function of the subsidy mm with the starting belief ω\omega fixed. From Definition 1 and the threshold structure of the optimal policy, the indexability of our model is reduced to requiring that the threshold ωβ(m)\omega^{*}_{\beta}(m) is monotonically increasing with mm (if the threshold is a closed interval then the right end is selected). Note that for the infinite-horizon problem, the threshold ωβ(m)\omega^{*}_{\beta}(m) is independent of time. Furthermore, Vβ,m(ω)V_{\beta,m}(\omega) is also convex in mm as for any m1,m2m_{1},m_{2}\in\mathbbm{R} and θ(0,1)\theta\in(0,1) the optimal policy πβ(θm1+(1θ)m2)\pi_{\beta}^{*}(\theta m_{1}+(1-\theta)m_{2}) achieving Vβ,θm1+(1θ)m2(ω)V_{\beta,\theta m_{1}+(1-\theta)m_{2}}(\omega) applied respectively on the problem with subsides m1m_{1} and m2m_{2} cannot outperform those achieving Vβ,m1(ω)V_{\beta,m_{1}}(\omega) and Vβ,m2(ω)V_{\beta,m_{2}}(\omega). Specifically, let rar_{a} be the expected total discounted reward from the active action and rp(m)r_{p}(m) that from the passive action under πβ(θm1+(1θ)m2)\pi_{\beta}^{*}(\theta m_{1}+(1-\theta)m_{2}) applied to the problem with subsidy mm, then

θVβ,m1(ω)+(1θ)Vβ,m2(ω)\displaystyle{}\theta V_{\beta,m_{1}}(\omega)+(1-\theta)V_{\beta,m_{2}}(\omega) \displaystyle\geq ra+θrp(m1)+(1θ)rp(m2)\displaystyle r_{a}+\theta r_{p}(m_{1})+(1-\theta)r_{p}(m_{2})
=\displaystyle= ra+rp(θm1+(1θ)m2)\displaystyle r_{a}+r_{p}(\theta m_{1}+(1-\theta)m_{2})
=\displaystyle= Vβ,θm1+(1θ)m2(ω).\displaystyle V_{\beta,\theta m_{1}+(1-\theta)m_{2}}(\omega).

Since Vβ,m(ω)V_{\beta,m}(\omega) is convex in mm, its left and right derivatives with mm exist at every point m0m_{0}\in\mathbbm{R}. Furthermore, consider two policies πβ(m1)\pi_{\beta}^{*}(m_{1}) and πβ(m2)\pi_{\beta}^{*}(m_{2}) achieving Vβ,m1(ω)V_{\beta,m_{1}}(\omega) and Vβ,m2(ω)V_{\beta,m_{2}}(\omega) for any m1,m2m_{1},m_{2}\in\mathbbm{R}, respectively. With a similar interchange argument of πβ(m1)\pi_{\beta}^{*}(m_{1}) and πβ(m2)\pi_{\beta}^{*}(m_{2}) as above, we have |Vβ,m1(ω)Vβ,m2(ω)|11β|m1m2||V_{\beta,m_{1}}(\omega)-V_{\beta,m_{2}}(\omega)|\leq\frac{1}{1-\beta}|m_{1}-m_{2}| and Vβ,m(ω)V_{\beta,m}(\omega) is Lipschitz continuous in mm. By the Rademacher theorem (see Heinonen, 2005), Vβ,m(ω)V_{\beta,m}(\omega) is differentiable almost everywhere in mm. For a small increase of mm, the rate at which Vβ,m(ω)V_{\beta,m}(\omega) increases is at least the expected total discounted passive time under any optimal policy for the problem with subsidy mm starting from the belief state ω\omega. In the following theorem, we formalize this relation between the value function and the passive time as well as a sufficient condition for the indexability of our model.

Theorem 3.

Let Πβ(m)\Pi^{*}_{\beta}(m) denote the set of all optimal single-arm policies achieving Vβ,m(ω)V_{\beta,m}(\omega) with initial belief state ω\omega. Define the passive time

Dβ,m(ω)=Δmaxπβ(m)Πβ(m)𝔼πβ(m)[t=1βt1𝟙(u(t)=0)|ω(1)=ω].\displaystyle D_{\beta,m}(\omega){\,\stackrel{{\scriptstyle\Delta}}{{=}}}\,\max_{\pi^{*}_{\beta}(m)\in\Pi^{*}_{\beta}(m)}\mathbbm{E}_{\pi^{*}_{\beta}(m)}\left[\sum_{t=1}^{\infty}\beta^{t-1}\mathbbm{1}_{(u(t)=0)}\Big{|}\omega(1)=\omega\right]. (54)

The right derivative of the value function Vβ,m(ω)V_{\beta,m}(\omega) with mm, denoted by dVβ,m(ω)(dm)+\frac{dV_{\beta,m}(\omega)}{(dm)^{+}}, exists at every value of mm and

dVβ,m(ω)(dm)+|m=m0=Dβ,m0(ω).\displaystyle\left.\frac{dV_{\beta,m}(\omega)}{(dm)^{+}}\right|_{m=m_{0}}=D_{\beta,m_{0}}(\omega). (55)

Furthermore, the single-armed bandit is indexable if at least one of the following condition is satisfied:

  • i.

    for any m0[0,1ϵ)m_{0}\in[0,1-\epsilon) the optimal policy is a threshold policy with threshold ωβ(m0)[0,1)\omega^{*}_{\beta}(m_{0})\in[0,1) (if the threshold is a closed interval then the right end is selected) and

    dVβ,m(ωβ(m0);u=0)(dm)+|m=m0>dVβ,m(ωβ(m0);u=1)(dm)+|m=m0.\displaystyle\left.\frac{dV_{\beta,m}(\omega^{*}_{\beta}(m_{0});u=0)}{(dm)^{+}}\right|_{m=m_{0}}>\left.\frac{dV_{\beta,m}(\omega^{*}_{\beta}(m_{0});u=1)}{(dm)^{+}}\right|_{m=m_{0}}. (56)
  • ii.

    for any m0m_{0}\in\mathbbm{R} and ωP(m0)\omega\in P(m_{0}), we have

    dVβ,m(ω;u=0)(dm)+|m=m0dVβ,m(ω;u=1)(dm)+|m=m0.\displaystyle\left.\frac{dV_{\beta,m}(\omega;u=0)}{(dm)^{+}}\right|_{m=m_{0}}\geq\left.\frac{dV_{\beta,m}(\omega;u=1)}{(dm)^{+}}\right|_{m=m_{0}}. (57)
Proof.

The proof of (55) follows directly from the argument in Theorem 1 in Liu (2021) and is omitted here. To prove the sufficiency of (56), we note that if it is true then there exists a Δm>0\Delta m>0 such that m(m0,m0+Δm)\forall~{}m\in(m_{0},m_{0}+\Delta m),

Vβ,m(ωβ(m0);u=0)Vβ,m0(ωβ(m0);u=0)\displaystyle~{}~{}~{}V_{\beta,m}(\omega^{*}_{\beta}(m_{0});u=0)-V_{\beta,m_{0}}(\omega^{*}_{\beta}(m_{0});u=0)
>Vβ,m(ωβ(m0);u=1)Vβ,m0(ωβ(m0);u=1).\displaystyle>V_{\beta,m}(\omega^{*}_{\beta}(m_{0});u=1)-V_{\beta,m_{0}}(\omega^{*}_{\beta}(m_{0});u=1).

Since Vβ,m0(ωβ(m0);u=0)=Vβ,m0(ωβ(m0);u=1)V_{\beta,m_{0}}(\omega^{*}_{\beta}(m_{0});u=0)=V_{\beta,m_{0}}(\omega^{*}_{\beta}(m_{0});u=1), we have Vβ,m(ωβ(m0);u=0)>Vβ,m(ωβ(m0);u=1)V_{\beta,m}(\omega^{*}_{\beta}(m_{0});u=0)>V_{\beta,m}(\omega^{*}_{\beta}(m_{0});u=1) which implies that the threshold ωβ(m0)\omega^{*}_{\beta}(m_{0}) remains in the passive set as mm continuously increases so P(m)P(m) is monotonically increasing with mm. This conclusion is clearly true for the trivial case of m<0m<0 or m1ϵm\geq 1-\epsilon. The sufficiency of (57) is obvious because then it is impossible for any ωP(m)\omega\in P(m) to escape from P(m)P(m) as mm increases due to the nondecreasing property of Vβ,m(ω;u=0)Vβ,m(ω;u=1)V_{\beta,m}(\omega;u=0)-V_{\beta,m}(\omega;u=1) enforced by (57). ∎

Theorem 3 essentially provides a way for checking the indexability condition in terms of the passive times. For example, equation (56) is equivalent to for any m[0,1ϵ)m\in[0,1-\epsilon),

β[(1ϵ)ωβ(m)Dβ,m(p11)+(1(1ϵ)ωβ(m))Dβ,m(𝒯(ϵωβ(m)ϵωβ(m)+1ωβ(m)))]\displaystyle\beta\Bigl{[}(1-\epsilon)\omega_{\beta}^{*}(m)D_{\beta,m}(p_{11})+(1-(1-\epsilon)\omega_{\beta}^{*}(m))D_{\beta,m}\Bigl{(}\mathcal{T}\bigl{(}\tfrac{\epsilon\omega_{\beta}^{*}(m)}{\epsilon\omega_{\beta}^{*}(m)+1-\omega_{\beta}^{*}(m)}\bigr{)}\Bigr{)}\Bigr{]}{}
<1+βDβ,m(𝒯(ωβ(m))).\displaystyle<1+\beta D_{\beta,m}(\mathcal{T}(\omega_{\beta}^{*}(m))). (58)

The above strict inequality clearly holds if β<0.5\beta<0.5 since Dβ,m()[0,11β]D_{\beta,m}(\cdot)\in[0,\frac{1}{1-\beta}] for any mm\in\mathbbm{R}. When β=0.5\beta=0.5, we prove by contradiction that the strict inequality (58) must hold under the threshold structure of the optimal policy. If ωβ(m)=0\omega_{\beta}^{*}(m)=0 then (58) is clearly true. Assume that the left and right sides of (58) are equal and ωβ(m)0\omega_{\beta}^{*}(m)\neq 0. In this case, we have

Dβ,m(p11)=11β,\displaystyle D_{\beta,m}(p_{11})=\tfrac{1}{1-\beta}, (59)
Dβ,m(𝒯(ωβ(m)))=0.\displaystyle D_{\beta,m}(\mathcal{T}(\omega_{\beta}^{*}(m)))=0. (60)

Equation (60) implies that starting from 𝒯(ωβ(m))\mathcal{T}(\omega_{\beta}^{*}(m)), always activating the arm is strictly optimal. This means that the threshold ωβ(m)\omega_{\beta}^{*}(m) is strictly below p11p_{11} and we have a contradiction to (59). Another easier way to see that the bandit is indexable if β0.5\beta\leq 0.5 is that (57) would be satisfied where no strict inequality is required. However, condition (58) provides a convenient way for approximately computing the passive times as well as the value functions which leads to an efficient algorithm for evaluating the indexability and solving for the Whittle index function for any β(0,1)\beta\in(0,1), as detailed in the next section.

Corollary 1.

The restless bandit is indexable if β0.5\beta\leq 0.5.

3 The Whittle Index Policy

In this section, we design an efficient algorithm by approximating the Whittle index.

3.1 The Approximated Whittle Index

The threshold structure of the optimal single-arm policy under certain conditions yields the following iterative nature of the dynamic equations for both Dβ,m(ω)D_{\beta,m}(\omega) and Vβ,m(ω)V_{\beta,m}(\omega). Define the first crossing time

L(ω,ω)=min0k<{k:𝒯k(ω)>ω}.L(\omega,\omega^{\prime})=\min_{0\leq k<\infty}\{k:\mathcal{T}^{k}(\omega)>\omega^{\prime}\}. (61)

In the above 𝒯0(ω)=Δω\mathcal{T}^{0}(\omega){\,\stackrel{{\scriptstyle\Delta}}{{=}}}\,\omega and we set L(ω,ω)=+L(\omega,\omega^{\prime})=+\infty if 𝒯k(ω)ω\mathcal{T}^{k}(\omega)\leq\omega for all k0k\geq 0. Clearly L(ω,ω)L(\omega,\omega^{\prime}) is the minimum time slots required for a belief state ω\omega to stay in the passive set P(m)P(m) before the arm is activated given a threshold ω[0,1)\omega^{\prime}\in[0,1). Consider the nontrivial case where p01,p11(0,1)p_{01},p_{11}\in(0,1) and p01p11p_{01}\neq p_{11} such that the Markov chain of the internal arm states is aperiodic and irreducible and that the belief update is action-dependent. From (6), if p11>p01p_{11}>p_{01} then

L(ω,ω)={0,ω>ωlogp11p01p01ω(1p11+p01)p01ω(1p11+p01)+1,ωω<ωo,ωω,ωωo;L(\omega,\omega^{\prime})=\begin{cases}0,&\omega>\omega^{\prime}\\ \bigg{\lfloor}\log_{p_{11}-p_{01}}^{\frac{p_{01}-\omega^{\prime}(1-p_{11}+p_{01})}{p_{01}-\omega(1-p_{11}+p_{01})}}\bigg{\rfloor}+1,&\omega\leq\omega^{\prime}<\omega_{o}\\ \infty,&\omega\leq\omega^{\prime},\omega^{\prime}\geq\omega_{o}\end{cases}; (62)

or if p11<p01p_{11}<p_{01} then

L(ω,ω)={0,ω>ω1,ωω,𝒯(ω)>ω,ωω,𝒯(ω)ω.L(\omega,\omega^{\prime})=\begin{cases}0,&\omega>\omega^{\prime}\\ 1,&\omega\leq\omega^{\prime},\mathcal{T}(\omega)>\omega^{\prime}\\ \infty,&\omega\leq\omega^{\prime},\mathcal{T}(\omega)\leq\omega^{\prime}\end{cases}. (63)

To illustrate (62) and (63), note that a belief state will converge to the stationary belief value given by (7) monotonically for p11>p01p_{11}>p_{01} or in an oscillating manner for p11<p01p_{11}<p_{01} under passive actions (see Fig. 3 and Fig. 4 in Liu and Zhao (2010)). Suppose that the following conditions are satisfied such that the optimal single-arm policy is a threshold policy and the indexability holds:

β{min{1(3ϵ)(p11p01),0.5},ifp11>p01min{1(52ϵ)(p01p11),0.5},ifp11<p01.\beta\leq\begin{cases}\min\left\{\frac{1}{(3-\epsilon)(p_{11}-p_{01})},0.5\right\},&\text{if}~{}p_{11}>p_{01}\\[4.0pt] \min\left\{\frac{1}{(5-2\epsilon)(p_{01}-p_{11})},0.5\right\},&\text{if}~{}p_{11}<p_{01}\end{cases}. (64)

To solve for the Whittle index function W(ω)W(\omega), given the current arm state ω\omega, we aim to find out the minimum subsidy mm that makes it as a threshold:

Vβ,m(ω)=Vβ,m(ω;u=1)=Vβ,m(ω;u=0),\displaystyle V_{\beta,m}(\omega)=V_{\beta,m}(\omega;u=1)=V_{\beta,m}(\omega;u=0),~{}~{}~{}~{}~{} (65)
Vβ,m(ω;u=1)=(1ϵ)ω+β[(1ϵ)ωVβ,m(p11)\displaystyle V_{\beta,m}(\omega;u=1)=(1-\epsilon)\omega+\beta[(1-\epsilon)\omega V_{\beta,m}(p_{11})
+(1(1ϵ)ω)Vβ,m(𝒯ϕ(ω))],\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+(1-(1-\epsilon)\omega)V_{\beta,m}(\mathcal{T}\circ\phi(\omega))], (66)
Vβ,m(ω;u=0)=m+βVβ,m(𝒯(ω)).\displaystyle V_{\beta,m}(\omega;u=0)=m+\beta V_{\beta,m}(\mathcal{T}(\omega)).~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{} (67)

Given a threshold ωβ(m)[0,1)\omega^{*}_{\beta}(m)\in[0,1) and any ω[0,1]\omega\in[0,1], the value function Vβ,m(ω)V_{\beta,m}(\omega) can be expanded by the first crossing time as

Vβ,m(ω)=1βL(ω,ωβ(m))1βm+βL(ω,ωβ(m))Vβ,m(𝒯L(ω,ωβ(m))(ω);u=1)=1βL(ω,ωβ(m))1βm+βL(ω,ωβ(m)){(1ϵ)𝒯L(ω,ωβ(m))(ω)+β[(1ϵ)𝒯L(ω,ωβ(m))(ω)Vβ,m(p11)+(1(1ϵ)𝒯L(ω,ωβ(m))(ω))×Vβ,m(𝒯(ϵ𝒯L(ω,ωβ(m))(ω)ϵ𝒯L(ω,ωβ(m))(ω)+1𝒯L(ω,ωβ(m))(ω)))]}.\displaystyle\begin{aligned} V_{\beta,m}(\omega)&=\frac{1-\beta^{L(\omega,\omega^{*}_{\beta}(m))}}{1-\beta}m+\beta^{L(\omega,\omega^{*}_{\beta}(m))}V_{\beta,m}(\mathcal{T}^{L(\omega,\omega^{*}_{\beta}(m))}(\omega);u=1)\\ &=\frac{1-\beta^{L(\omega,\omega^{*}_{\beta}(m))}}{1-\beta}m+\beta^{L(\omega,\omega^{*}_{\beta}(m))}\bigg{\{}(1-\epsilon)\mathcal{T}^{L(\omega,\omega^{*}_{\beta}(m))}(\omega)\\ &\quad+\beta\bigg{[}(1-\epsilon)\mathcal{T}^{L(\omega,\omega^{*}_{\beta}(m))}(\omega)V_{\beta,m}(p_{11})+(1-(1-\epsilon)\mathcal{T}^{L(\omega,\omega^{*}_{\beta}(m))}(\omega))\\ &\quad\quad\quad\quad\times V_{\beta,m}\bigg{(}\mathcal{T}\bigg{(}\tfrac{\epsilon\mathcal{T}^{L(\omega,\omega^{*}_{\beta}(m))}(\omega)}{\epsilon\mathcal{T}^{L(\omega,\omega^{*}_{\beta}(m))}(\omega)+1-\mathcal{T}^{L(\omega,\omega^{*}_{\beta}(m))}(\omega)}\bigg{)}\bigg{)}\bigg{]}\bigg{\}}.\end{aligned} (68)

There is no doubt that the last item of the above equation has caused us trouble in solving for Vβ,m(ω)V_{\beta,m}(\omega). However, if we let

f(ω,ωβ(m))=𝒯(ϵ𝒯L(ω,ωβ(m))(ω)ϵ𝒯L(ω,ωβ(m))(ω)+1𝒯L(ω,ωβ(m))(ω))=p11ϵ𝒯L(ω,ωβ(m))(ω)+p01(1𝒯L(ω,ωβ(m))(ω))ϵ𝒯L(ω,ωβ(m))(ω)+1𝒯L(ω,ωβ(m))(ω)\displaystyle\begin{aligned} f(\omega,\omega_{\beta}^{*}(m))&=\mathcal{T}\bigg{(}\tfrac{\epsilon\mathcal{T}^{L(\omega,\omega_{\beta}^{*}(m))}(\omega)}{\epsilon\mathcal{T}^{L(\omega,\omega_{\beta}^{*}(m))}(\omega)+1-\mathcal{T}^{L(\omega,\omega_{\beta}^{*}(m))}(\omega)}\bigg{)}\\ &=\frac{p_{11}\epsilon\mathcal{T}^{L(\omega,\omega_{\beta}^{*}(m))}(\omega)+p_{01}(1-\mathcal{T}^{L(\omega,\omega_{\beta}^{*}(m))}(\omega))}{\epsilon\mathcal{T}^{L(\omega,\omega_{\beta}^{*}(m))}(\omega)+1-\mathcal{T}^{L(\omega,\omega_{\beta}^{*}(m))}(\omega)}\end{aligned} (69)

and construct iteratively the sequence {kn}\{k_{n}\} as kn+1=f(kn,ωβ(m))k_{n+1}=f(k_{n},\omega_{\beta}^{*}(m)) with k0=ωk_{0}=\omega. We then get the following sequence of equations:

Vβ,m(k0)=1βL(k0,ωβ(m))1βm+βL(k0,ωβ(m)){(1ϵ)𝒯L(k0,ωβ(m))(k0)+β[(1ϵ)×𝒯L(k0,ωβ(m))(k0)Vβ,m(p11)+(1(1ϵ)𝒯L(k0,ωβ(m))(k0))Vβ,m(k1)]}Vβ,m(k1)=1βL(k1,ωβ(m))1βm+βL(k1,ωβ(m)){(1ϵ)𝒯L(k1,ωβ(m))(k1)+β[(1ϵ)×𝒯L(k1,ωβ(m))(k1)Vβ,m(p11)+(1(1ϵ)𝒯L(k1,ωβ(m))(k1))Vβ,m(k2)]}Vβ,m(kn)=1βL(kn,ωβ(m))1βm+βL(kn,ωβ(m)){(1ϵ)𝒯L(kn,ωβ(m))(kn)+β[(1ϵ)𝒯L(kn,ωβ(m))(kn)Vβ,m(p11)+(1(1ϵ)×𝒯L(kn,ωβ(m))(kn))Vβ,m(kn+1)]}.{\begin{aligned} V_{\beta,m}(k_{0})&=\frac{1-\beta^{L(k_{0},\omega_{\beta}^{*}(m))}}{1-\beta}m+\beta^{L(k_{0},\omega_{\beta}^{*}(m))}\big{\{}(1-\epsilon)\mathcal{T}^{L(k_{0},\omega_{\beta}^{*}(m))}(k_{0})+\beta\big{[}(1-\epsilon)\\ &\quad\times\mathcal{T}^{L(k_{0},\omega_{\beta}^{*}(m))}(k_{0})V_{\beta,m}(p_{11})+(1-(1-\epsilon)\mathcal{T}^{L(k_{0},\omega_{\beta}^{*}(m))}(k_{0}))V_{\beta,m}(k_{1})\big{]}\big{\}}\\ V_{\beta,m}(k_{1})&=\frac{1-\beta^{L(k_{1},\omega_{\beta}^{*}(m))}}{1-\beta}m+\beta^{L(k_{1},\omega_{\beta}^{*}(m))}\big{\{}(1-\epsilon)\mathcal{T}^{L(k_{1},\omega_{\beta}^{*}(m))}(k_{1})+\beta\big{[}(1-\epsilon)\\ &\quad\times\mathcal{T}^{L(k_{1},\omega_{\beta}^{*}(m))}(k_{1})V_{\beta,m}(p_{11})+(1-(1-\epsilon)\mathcal{T}^{L(k_{1},\omega_{\beta}^{*}(m))}(k_{1}))V_{\beta,m}(k_{2})\big{]}\big{\}}\\ \cdots\\ V_{\beta,m}(k_{n})&=\frac{1-\beta^{L(k_{n},\omega_{\beta}^{*}(m))}}{1-\beta}m+\beta^{L(k_{n},\omega_{\beta}^{*}(m))}\big{\{}(1-\epsilon)\mathcal{T}^{L(k_{n},\omega_{\beta}^{*}(m))}(k_{n})\\ &\quad+\beta\big{[}(1-\epsilon)\cdot\mathcal{T}^{L(k_{n},\omega_{\beta}^{*}(m))}(k_{n})V_{\beta,m}(p_{11})+(1-(1-\epsilon)\\ &\quad\quad\quad\times\mathcal{T}^{L(k_{n},\omega_{\beta}^{*}(m))}(k_{n}))V_{\beta,m}(k_{n+1})\big{]}\big{\}}\\ \cdots\end{aligned}}.

For sufficiently large nn, we can get an estimation of Vβ,m(ω)=Vβ,m(k0)V_{\beta,m}(\omega)=V_{\beta,m}(k_{0}) with an arbitrarily small error by setting Vβ,m(kn+1)=0V_{\beta,m}(k_{n+1})=0 whose error is discounted by β\beta in computing Vβ,m(kn)V_{\beta,m}(k_{n}) thus causing a geometrically decreasing error propagation in the backward computation process for Vβ,m(k0)V_{\beta,m}(k_{0}). Note that we first compute Vβ,m(p11)V_{\beta,m}(p_{11}) in the same way by setting k0=p11k_{0}=p_{11} in the above equation set. Therefore we can have an estimation of Vβ,m(ω)V_{\beta,m}(\omega) with arbitrarily high precision for any ω[0,1]\omega\in[0,1]. Interestingly, extensive numerical results found that {kn}\{k_{n}\} quickly converges to a limit belief state kk^{*} (independent of k0k_{0}) (see Fig. 2 for an example). Specifically, after 44 iterations, the difference |k4k||k_{4}-k| becomes too small to affect the performance of our algorithm as discussed in Sec. 5.1. So we can set Vβ,m(k5)=Vβ,m(k4)V_{\beta,m}(k_{5})=V_{\beta,m}(k_{4}) and efficiently solve the finite linear equation set (up to Vβ,m(k4)V_{\beta,m}(k_{4})). In general, the nn-iteration Whittle index is based on the solution of the following equations:

Vβ,m(p11)\displaystyle{}V_{\beta,m}(p_{11}) =1βL(p11,ωβ(m))1βm+βL(p11,ωβ(m)){(1ϵ)𝒯L(p11,ωβ(m))(p11)\displaystyle=\frac{1-\beta^{L(p_{11},\omega_{\beta}^{*}(m))}}{1-\beta}m+\beta^{L(p_{11},\omega_{\beta}^{*}(m))}\big{\{}(1-\epsilon)\mathcal{T}^{L(p_{11},\omega_{\beta}^{*}(m))}(p_{11})
+β[(1ϵ)𝒯L(p11,ωβ(m))(p11)Vβ,m(p11)+(1(1ϵ)\displaystyle\quad+\beta\big{[}(1-\epsilon)\mathcal{T}^{L(p_{11},\omega_{\beta}^{*}(m))}(p_{11})V_{\beta,m}(p_{11})+(1-(1-\epsilon)
×𝒯L(p11,ωβ(m))(p11))Vβ,m(k1)]}\displaystyle\quad\quad\quad\times\mathcal{T}^{L(p_{11},\omega_{\beta}^{*}(m))}(p_{11}))V_{\beta,m}(k_{1})\big{]}\big{\}}
Vβ,m(k1)\displaystyle{}V_{\beta,m}(k_{1}) =1βL(k1,ωβ(m))1βm+βL(k1,ωβ(m)){(1ϵ)𝒯L(k1,ωβ(m))(k1)+β[(1ϵ)\displaystyle=\frac{1-\beta^{L(k_{1},\omega_{\beta}^{*}(m))}}{1-\beta}m+\beta^{L(k_{1},\omega_{\beta}^{*}(m))}\big{\{}(1-\epsilon)\mathcal{T}^{L(k_{1},\omega_{\beta}^{*}(m))}(k_{1})+\beta\big{[}(1-\epsilon)
×𝒯L(k1,ωβ(m))(k1)Vβ,m(p11)+(1(1ϵ)𝒯L(k1,ωβ(m))(k1))Vβ,m(k2)]}\displaystyle\quad\times\mathcal{T}^{L(k_{1},\omega_{\beta}^{*}(m))}(k_{1})V_{\beta,m}(p_{11})+(1-(1-\epsilon)\mathcal{T}^{L(k_{1},\omega_{\beta}^{*}(m))}(k_{1}))V_{\beta,m}(k_{2})\big{]}\big{\}}
\displaystyle\vdots
Vβ,m(kn)\displaystyle{}V_{\beta,m}(k_{n}) =1βL(kn,ωβ(m))1βm+βL(kn,ωβ(m)){(1ϵ)𝒯L(kn,ωβ(m))(kn)\displaystyle=\frac{1-\beta^{L(k_{n},\omega_{\beta}^{*}(m))}}{1-\beta}m+\beta^{L(k_{n},\omega_{\beta}^{*}(m))}\big{\{}(1-\epsilon)\mathcal{T}^{L(k_{n},\omega_{\beta}^{*}(m))}(k_{n})
+β[(1ϵ)𝒯L(kn,ωβ(m))(kn)Vβ,m(p11)+(1(1ϵ)\displaystyle\quad+\beta\big{[}(1-\epsilon)\mathcal{T}^{L(k_{n},\omega_{\beta}^{*}(m))}(k_{n})V_{\beta,m}(p_{11})+(1-(1-\epsilon)
×𝒯L(kn,ωβ(m))(kn))Vβ,m(kn)]}.\displaystyle\quad\quad\quad\times\mathcal{T}^{L(k_{n},\omega_{\beta}^{*}(m))}(k_{n}))V_{\beta,m}(k_{n})\big{]}\big{\}}.

After the value functions are (approximately) solved, we can plot the active value function Vβ,m(ω;u=1)V_{\beta,m}(\omega;u=1) and the passive one Vβ,m(ω;u=0)V_{\beta,m}(\omega;u=0) to see that they intersect at one single point (the threshold) verifying the optimality of the threshold policy proven by Theorem 2 (See Fig. 2 for an example). According to Theorem 3, the passive time Dβ,m(ω)D_{\beta,m}(\omega) can also be approximately solved based on the following equations:

Dβ,m(p11)=1βL(p11,ωβ(m))1β+βL(p11,ωβ(m))+1{(1ϵ)𝒯L(p11,ωβ(m))(p11)×Dβ,m(p11)+(1(1ϵ)𝒯L(p11,ωβ(m))(p11))Dβ,m(k1)}Dβ,m(k1)=1βL(k1,ωβ(m))1β+βL(k1,ωβ(m))+1{(1ϵ)𝒯L(k1,ωβ(m))(k1)×Dβ,m(p11)+(1(1ϵ)𝒯L(k1,ωβ(m))(k1))Dβ,m(k2)}Dβ,m(kn)=1βL(kn,ωβ(m))1β+βL(kn,ωβ(m))+1{(1ϵ)𝒯L(kn,ωβ(m))(kn)×Dβ,m(p11)+(1(1ϵ)𝒯L(kn,ωβ(m))(kn))Dβ,m(kn)}.\begin{aligned} D_{\beta,m}(p_{11})&=\frac{1-\beta^{L(p_{11},\omega_{\beta}^{*}(m))}}{1-\beta}+\beta^{L(p_{11},\omega_{\beta}^{*}(m))+1}\big{\{}(1-\epsilon)\mathcal{T}^{L(p_{11},\omega_{\beta}^{*}(m))}(p_{11})\\ &\quad\times D_{\beta,m}(p_{11})+(1-(1-\epsilon)\mathcal{T}^{L(p_{11},\omega_{\beta}^{*}(m))}(p_{11}))D_{\beta,m}(k_{1})\big{\}}\\ D_{\beta,m}(k_{1})&=\frac{1-\beta^{L(k_{1},\omega_{\beta}^{*}(m))}}{1-\beta}+\beta^{L(k_{1},\omega_{\beta}^{*}(m))+1}\big{\{}(1-\epsilon)\mathcal{T}^{L(k_{1},\omega_{\beta}^{*}(m))}(k_{1})\\ &\quad\times D_{\beta,m}(p_{11})+(1-(1-\epsilon)\mathcal{T}^{L(k_{1},\omega_{\beta}^{*}(m))}(k_{1}))D_{\beta,m}(k_{2})\big{\}}\\ \vdots\\ D_{\beta,m}(k_{n})&=\frac{1-\beta^{L(k_{n},\omega_{\beta}^{*}(m))}}{1-\beta}+\beta^{L(k_{n},\omega_{\beta}^{*}(m))+1}\big{\{}(1-\epsilon)\mathcal{T}^{L(k_{n},\omega_{\beta}^{*}(m))}(k_{n})\\ &\quad\times D_{\beta,m}(p_{11})+(1-(1-\epsilon)\mathcal{T}^{L(k_{n},\omega_{\beta}^{*}(m))}(k_{n}))D_{\beta,m}(k_{n})\big{\}}\end{aligned}.

Substituting ω\omega for ωβ(m)\omega^{*}_{\beta}(m) in the above n+1n+1 linear equations with n+1n+1 unknowns (first solving for ω=p11\omega=p_{11}), we can obtain Vβ,m(ω)V_{\beta,m}(\omega^{\prime}) and Dβ,m(ω)D_{\beta,m}(\omega^{\prime}) for any ω[0,1]\omega^{\prime}\in[0,1] according to the linear equation sets. The indexability condition (56) in Theorem 3 can be checked online: for the original multi-armed bandit problem and for each arm at state ω(t)\omega(t) at time tt, we compute its approximated Whittle index W(ω(t))W(\omega(t)) by solving a set of linear equations, which has a polynomial complexity of the iteration number nn, independent of the decision time tt. At time tt, for each arm, if W()W(\cdot) is found to be nondecreasing with the arm states (ω(1),ω(2),,ω(t))(\omega(1),\omega(2),\ldots,\omega(t)) appeared so far starting from the initial belief vector 𝝎(1)\boldsymbol{\omega}(1) defined in (2), then the indexability has not been violated. Interestingly, extensive numerical studies have shown that the indexability is always satisfied as illustrated in Figs. 4 and 4 in Sec. 5.1.

Refer to caption
Figure 1: The Convergence of knk_{n}
Refer to caption
Figure 2: The Optimality of Threshold Policy

For large β(0,1)\beta\in(0,1) where the threshold structure of the optimal policy or the indexability may not hold (i.e., condition (64) is not satisfied), we can still use the above process to solve for the subsidy mm that makes (65) true if it exists. Note that after computing the value functions appeared in (65) in terms of mm, both Vβ,m(ω;u=1)V_{\beta,m}(\omega;u=1) and Vβ,m(ω;u=0)V_{\beta,m}(\omega;u=0) are linear (affine) in mm and their equality gives a unique solution of mm if their linear coefficients are not equal. This mm, if exists, can thus be used as the approximated Whittle index W(ω)W(\omega) without requiring indexability or threshold-based optimal policy. If it does not exist, we can simply set W(ω)=ωBW(\omega)=\omega B. The existence of such an mm is defined as the relaxed indexability in Liu (2021). Note that extensive numerical studies have shown that the relaxed indexability of our model with imperfect state observations is always satisfied as well. Before summarizing our general algorithm for all β(0,1)\beta\in(0,1) in Section 3.2, we solve for the approximated Whittle index function in closed-form for the simplest case of 0-iteration, which is referred to as the imperfect Whittle index. Note that if ϵ0\epsilon\rightarrow 0 then 𝒯(ϵωϵω+1ω)p01\mathcal{T}(\frac{\epsilon\omega}{\epsilon\omega+1-\omega})\rightarrow p_{01}. Thus when ϵ\epsilon is sufficiently small, we can approximate Vβ,m(𝒯(ϵωϵω+1ω))V_{\beta,m}(\mathcal{T}(\frac{\epsilon\omega}{\epsilon\omega+1-\omega})) by Vβ,m(p01)V_{\beta,m}(p_{01}). Under this approximation, we have, for any ω[0,1]\omega\in[0,1],

Vβ,m(ω;u=1)=(1ϵ)ω+β[(1ϵ)ωVβ,m(p11)+(1(1ϵ)ω)Vβ,m(p01)]Vβ,m(ω;u=0)=m+βVβ,m(𝒯(ω))Vβ,m(ω)=1βL(ω,ωβ(m))1βm+βL(ω,ωβ(m)){(1ϵ)𝒯L(ω,ωβ(m))(ω)+β[(1ϵ)𝒯L(ω,ωβ(m))(ω)Vβ,m(p11)+(1(1ϵ)𝒯L(ω,ωβ(m))(ω))Vβ,m(p01)]}.\begin{aligned} &V_{\beta,m}(\omega;u=1)=(1-\epsilon)\omega+\beta\big{[}(1-\epsilon)\omega V_{\beta,m}(p_{11})+(1-(1-\epsilon)\omega)V_{\beta,m}(p_{01})\big{]}\\ &V_{\beta,m}(\omega;u=0)=m+\beta V_{\beta,m}(\mathcal{T}(\omega))\\ &V_{\beta,m}(\omega)=\frac{1-\beta^{L(\omega,\omega_{\beta}^{*}(m))}}{1-\beta}m+\beta^{L(\omega,\omega_{\beta}^{*}(m))}\big{\{}(1-\epsilon)\mathcal{T}^{L(\omega,\omega_{\beta}^{*}(m))}(\omega)\\ &\quad\quad\quad\quad~{}~{}+\beta\big{[}(1-\epsilon)\mathcal{T}^{L(\omega,\omega_{\beta}^{*}(m))}(\omega)V_{\beta,m}(p_{11})\\ &\quad\quad\quad\quad~{}~{}+(1-(1-\epsilon)\mathcal{T}^{L(\omega,\omega_{\beta}^{*}(m))}(\omega))V_{\beta,m}(p_{01})\big{]}\big{\}}\\ \end{aligned}.

By using the above three equations, we can directly solve for Vβ,m(p01)V_{\beta,m}(p_{01}) and Vβ,m(p11)V_{\beta,m}(p_{11}) in closed-form.
When p11>p01p_{11}>p_{01}, Vβ,m(p01)=V_{\beta,m}(p_{01})=

{(1ϵ)p01(1β)(1β(1ϵ)p11+β(1ϵ)p01),if ωβ(m)<p01(1β(1ϵ)p11)(1βL(p01,ωβ(m)))m+(1ϵ)(1β)βL(p01,ωβ(m))𝒯L(p01,ωβ(m))(p01)(1β(1ϵ)p11)(1β)(1βL(p01,ωβ(m))+1)+(1ϵ)(1β)2βL(p01,ωβ(m))+1𝒯L(p01,ωβ(m))(p01),if p01ωβ(m)<ω0m1β,if ωβ(m)ω0,\begin{cases}\frac{(1-\epsilon)p_{01}}{(1-\beta)(1-\beta(1-\epsilon)p_{11}+\beta(1-\epsilon)p_{01})},\text{if }\omega_{\beta}^{*}(m)<p_{01}\\[10.0pt] \frac{(1-\beta(1-\epsilon)p_{11})(1-\beta^{L(p_{01},\omega_{\beta}^{*}(m))})m+(1-\epsilon)(1-\beta)\beta^{L(p_{01},\omega_{\beta}^{*}(m))}\mathcal{T}^{L(p_{01},\omega_{\beta}^{*}(m))}(p_{01})}{(1-\beta(1-\epsilon)p_{11})(1-\beta)(1-\beta^{L(p_{01},\omega_{\beta}^{*}(m))+1})+(1-\epsilon)(1-\beta)^{2}\beta^{L(p_{01},\omega_{\beta}^{*}(m))+1}\mathcal{T}^{L(p_{01},\omega_{\beta}^{*}(m))}(p_{01})},\\ ~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\text{if }p_{01}\leq\omega_{\beta}^{*}(m)<\omega_{0}\\ \frac{m}{1-\beta},\text{if }\omega_{\beta}^{*}(m)\geq\omega_{0}\end{cases},
Vβ,m(p11)={(1ϵ)p11+β(1(1ϵ)p11)Vβ,m(p01)1β(1ϵ)p11,if ωβ(m)<p11m1β,if ωβ(m)p11.V_{\beta,m}(p_{11})=\begin{cases}\frac{(1-\epsilon)p_{11}+\beta(1-(1-\epsilon)p_{11})V_{\beta,m}(p_{01})}{1-\beta(1-\epsilon)p_{11}},&\text{if }\omega_{\beta}^{*}(m)<p_{11}\\ \frac{m}{1-\beta},&\text{if }\omega_{\beta}^{*}(m)\geq p_{11}\end{cases}.

The approximate Whittle index is given by

W(ω)={ω(1ϵ)(1βp11+βp01)1β(1ϵ)p11+β(1ϵ)p01,if ωp01(1ϵ)(ωβ𝒯(ω))+C2(1β)β[(1β(1ϵ)p11)(1ϵ)(ωβ𝒯(ω))]1β(1ϵ)p11C1β[(1β(1ϵ)p11)(1ϵ)(ωβ𝒯(ω))],if p01<ωω0(1ϵ)ω1β(1ϵ)p11+β(1ϵ)ω,if ω0<ωp11(1ϵ)ω,if ω>p11,W(\omega)=\begin{cases}\frac{\omega(1-\epsilon)(1-\beta p_{11}+\beta p_{01})}{1-\beta(1-\epsilon)p_{11}+\beta(1-\epsilon)p_{01}},&\text{if }\omega\leq p_{01}\\ \frac{(1-\epsilon)(\omega-\beta\mathcal{T}(\omega))+C_{2}(1-\beta)\beta[(1-\beta(1-\epsilon)p_{11})-(1-\epsilon)(\omega-\beta\mathcal{T}(\omega))]}{1-\beta(1-\epsilon)p_{11}-C_{1}\beta[(1-\beta(1-\epsilon)p_{11})-(1-\epsilon)(\omega-\beta\mathcal{T}(\omega))]},&\text{if }p_{01}<\omega\leq\omega_{0}\\ \frac{(1-\epsilon)\omega}{1-\beta(1-\epsilon)p_{11}+\beta(1-\epsilon)\omega},&\text{if }\omega_{0}<\omega\leq p_{11}\\ (1-\epsilon)\omega,&\text{if }\omega>p_{11}\end{cases},

where

C1=(1β(1ϵ)p11)(1βL(p01,ω))(1β(1ϵ)p11)(1βL(p01,ω)+1)+(1ϵ)(1β)βL(p01,ω)+1𝒯L(p01,ω)(p01),\displaystyle C_{1}=\frac{(1-\beta(1-\epsilon)p_{11})(1-\beta^{L(p_{01},\omega)})}{(1-\beta(1-\epsilon)p_{11})(1-\beta^{L(p_{01},\omega)+1})+(1-\epsilon)(1-\beta)\beta^{L(p_{01},\omega)+1}\mathcal{T}^{L(p_{01},\omega)}(p_{01})},
C2=(1ϵ)βL(p01,ω)𝒯L(p01,ω)(p01)(1β(1ϵ)p11)(1βL(p01,ω)+1)+(1ϵ)(1β)βL(p01,ω)+1𝒯L(p01,ω)(p01).\displaystyle C_{2}=\frac{(1-\epsilon)\beta^{L(p_{01},\omega)}\mathcal{T}^{L(p_{01},\omega)}(p_{01})}{(1-\beta(1-\epsilon)p_{11})(1-\beta^{L(p_{01},\omega)+1})+(1-\epsilon)(1-\beta)\beta^{L(p_{01},\omega)+1}\mathcal{T}^{L(p_{01},\omega)}(p_{01})}.

Similarly, when p01>p11p_{01}>p_{11}, we have

Vβ,m(p11)={(1ϵ)(p11(1β)+βp01)(1β)(1β(1ϵ)p11+β(1ϵ)p01),if ωβ(m)<p11(1β(1(1ϵ)p01))m+β(1ϵ)𝒯(p11)(1β)+β2(1ϵ)p01(1β)(1+β(1+β)(1ϵ)p01β2(1ϵ)𝒯(p11)),if p11ωβ(m)<𝒯(p11)m1β,if ωβ(m)𝒯(p11),V_{\beta,m}(p_{11})=\begin{cases}\frac{(1-\epsilon)(p_{11}(1-\beta)+\beta p_{01})}{(1-\beta)(1-\beta(1-\epsilon)p_{11}+\beta(1-\epsilon)p_{01})},\ \text{if }\omega_{\beta}^{*}(m)<p_{11}\\[8.0pt] \frac{(1-\beta(1-(1-\epsilon)p_{01}))m+\beta(1-\epsilon)\mathcal{T}(p_{11})(1-\beta)+\beta^{2}(1-\epsilon)p_{01}}{(1-\beta)(1+\beta(1+\beta)(1-\epsilon)p_{01}-\beta^{2}(1-\epsilon)\mathcal{T}(p_{11}))},\ \\ ~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\text{if }p_{11}\leq\omega_{\beta}^{*}(m)<\mathcal{T}(p_{11})\\[6.0pt] \frac{m}{1-\beta},\ \text{if }\omega_{\beta}^{*}(m)\geq\mathcal{T}(p_{11})\end{cases},
Vβ,m(p01)={(1ϵ)p01+β(1ϵ)p01Vβ,m(p11)1β(1(1ϵ)p01),if ωβ(m)<p01m1β,if ωβ(m)p01.V_{\beta,m}(p_{01})=\begin{cases}\frac{(1-\epsilon)p_{01}+\beta(1-\epsilon)p_{01}V_{\beta,m}(p_{11})}{1-\beta(1-(1-\epsilon)p_{01})},&\text{if }\omega_{\beta}^{*}(m)<p_{01}\\ \frac{m}{1-\beta},&\text{if }\omega_{\beta}^{*}(m)\geq p_{01}\end{cases}.~{}~{}~{}~{}~{}~{}~{}~{}

The approximate Whittle index is given by

W(ω)={ω(1ϵ)(1βp11+βp01)1β(1ϵ)p11+β(1ϵ)p01,if ωp11(1ϵ)(1β+C4β)(βp01+ωβ𝒯(ω))1β(1(1ϵ)p01)+(1ϵ)C3β(β𝒯(ω)βp01ω),if p11<ω<ω0(1ϵ)(1β+βC4)(βp01+ω(1β))1β(1(1ϵ)p01)(1ϵ)βC3(βp01+ωβω),if ω0ω<𝒯(p11)(1ϵ)(βp01+(1β)ω)1+(1ϵ)β(p01ω),if 𝒯(p11)ω<p01(1ϵ)ω,if ωp01,W(\omega)=\begin{cases}\frac{\omega(1-\epsilon)(1-\beta p_{11}+\beta p_{01})}{1-\beta(1-\epsilon)p_{11}+\beta(1-\epsilon)p_{01}},&\text{if }\omega\leq p_{11}\\[6.0pt] \frac{(1-\epsilon)(1-\beta+C_{4}\beta)(\beta p_{01}+\omega-\beta\mathcal{T}(\omega))}{1-\beta(1-(1-\epsilon)p_{01})+(1-\epsilon)C_{3}\beta(\beta\mathcal{T}(\omega)-\beta p_{01}-\omega)},&\text{if }p_{11}<\omega<\omega_{0}\\[6.0pt] \frac{(1-\epsilon)(1-\beta+\beta C_{4})(\beta p_{01}+\omega(1-\beta))}{1-\beta(1-(1-\epsilon)p_{01})-(1-\epsilon)\beta C_{3}(\beta p_{01}+\omega-\beta\omega)},&\text{if }\omega_{0}\leq\omega<\mathcal{T}(p_{11})\\[6.0pt] \frac{(1-\epsilon)(\beta p_{01}+(1-\beta)\omega)}{1+(1-\epsilon)\beta(p_{01}-\omega)},&\text{if }\mathcal{T}(p_{11})\leq\omega<p_{01}\\[6.0pt] (1-\epsilon)\omega,&\text{if }\omega\geq p_{01}\end{cases},

where

C3=1β(1(1ϵ)p01)1+β(1+β)(1ϵ)p01β2(1ϵ)𝒯(p11),\displaystyle C_{3}=\tfrac{1-\beta(1-(1-\epsilon)p_{01})}{1+\beta(1+\beta)(1-\epsilon)p_{01}-\beta^{2}(1-\epsilon)\mathcal{T}(p_{11})},
C4=β(1ϵ)𝒯(p11)(1β)+β2(1ϵ)p011+β(1+β)(1ϵ)p01β2(1ϵ)𝒯(p11).\displaystyle C_{4}=\tfrac{\beta(1-\epsilon)\mathcal{T}(p_{11})(1-\beta)+\beta^{2}(1-\epsilon)p_{01}}{1+\beta(1+\beta)(1-\epsilon)p_{01}-\beta^{2}(1-\epsilon)\mathcal{T}(p_{11})}.

3.2 Algorithm

Our analysis leads to the algorithm for the RMAB model with imperfect observations in Algorithm 1 for all β(0,1)\beta\in(0,1).

Algorithm 1 Whittle Index Policy

Input: β(0,1)\beta\in(0,1), T1T\geq 1,N2N\geq 2, 1M<N1\leq M<N, iteration number kk
Input: initial belief state ωn(1)\omega_{n}(1), P(n)\textbf{P}^{(n)}, BnB_{n}, n=1,,Nn=1,...,N

1:for t=1,2,,Tt=1,2,...,T do
2:     for n=1,,Nn=1,...,N do
3:         Set the threshold ωβ(m)=ωn(t)\omega^{*}_{\beta}(m)=\omega_{n}(t) in (68)
4:         Compute L(p11(n),ωn(t))L\left(p_{11}^{(n)},\omega_{n}(t)\right) and set ω=p11(n)\omega=p_{11}^{(n)} in (68)
5:         Expand (68) to the kkth step and solve for Vβ,m(n)(p11(n))V^{(n)}_{\beta,m}(p_{11}^{(n)})
6:         Compute L(𝒯ϕ(ω),ωn(t))L\left(\mathcal{T}\circ\phi(\omega),\omega_{n}(t)\right) and set ω=𝒯ϕ(ω)\omega=\mathcal{T}\circ\phi(\omega) in (68)
7:         Expand (68) to the kkth step and solve for Vβ,m(n)(𝒯ϕ(ω))V^{(n)}_{\beta,m}(\mathcal{T}\circ\phi(\omega)) from Vβ,m(n)(p11(n))V^{(n)}_{\beta,m}(p_{11}^{(n)})
8:         Compute L(𝒯(ω),ωn(t))L\left(\mathcal{T}(\omega),\omega_{n}(t)\right) and set ω=𝒯(ω)\omega=\mathcal{T}(\omega) in (68)
9:         Expand (68) to the kkth step and solve for Vβ,m(n)(𝒯(ω))V^{(n)}_{\beta,m}(\mathcal{T}(\omega)) from Vβ,m(n)(p11(n))V^{(n)}_{\beta,m}(p_{11}^{(n)})
10:         Solve for Vβ,m(n)(ω;u=1)V^{(n)}_{\beta,m}(\omega;u=1) by Vβ,m(n)(p11(n))V^{(n)}_{\beta,m}(p_{11}^{(n)}) and Vβ,m(n)(𝒯ϕ(ω))V^{(n)}_{\beta,m}(\mathcal{T}\circ\phi(\omega)) as in (66)
11:         Solve for Vβ,m(n)(ω;u=0)V^{(n)}_{\beta,m}(\omega;u=0) by Vβ,m(n)(𝒯(ω))V^{(n)}_{\beta,m}(\mathcal{T}(\omega)) as in (67)
12:         Evaluate the solvability of the linear equation of mm: Vβ,m(n)(ω;u=1)V^{(n)}_{\beta,m}(\omega;u=1)
13:    =Vβ,m(n)(ω;u=0)=V^{(n)}_{\beta,m}(\omega;u=0)
14:         Set W(ωn(t))=ωn(t)BnW(\omega_{n}(t))=\omega_{n}(t)B_{n} and skip Step 14 if the above is unsolvable
15:         Compute W(ωn(t))W(\omega_{n}(t)) as the solution to Vβ,m(n)(ω;u=1)=Vβ,m(n)(ω;u=0)V^{(n)}_{\beta,m}(\omega;u=1)=V^{(n)}_{\beta,m}(\omega;u=0)
16:     end for
17:     Choose the top MM arms with the largest Whittle Indices W(ωn(t))W(\omega_{n}(t))
18:     Observe the selected MM arms and accrue reward On(t)Sn(t)BnO_{n}(t)S_{n}(t)B_{n} from each
19:     observed arm
20:     for n=1,,Nn=1,...,N do
21:         Update the belief state ωn(t)\omega_{n}(t) according to (4)
22:     end for
23:end for

4 Optimality for Homogeneous Systems

A space-wise homogeneous system for a restless bandit is defined as the system with NN stochastically identical arms, i.e., the parameters P(n)\textbf{P}^{(n)} and BnB_{n} do not depend on nn. In this case, our algorithm is equivalent to the myopic policy that chooses the arms with the largest belief values and is optimal.

Theorem 4.

Consider a space-wise homogeneous model with positively correlated arms (p11p01p_{11}\geq p_{01}) and ϵ\epsilon satisfying

ϵp01(1p11)p11(1p01)=p01p10p11p00,\displaystyle\epsilon\leq\frac{p_{01}(1-p_{11})}{p_{11}(1-p_{01})}=\frac{p_{01}p_{10}}{p_{11}p_{00}}, (70)

the myopic policy is optimal over both finite and infinite horizons.

Proof.

We adopt notations similar to that in Liu et al. (2011) for the case of perfect observation (ϵ=δ=0\epsilon=\delta=0) but need several non-trivial differences due to the additional complexity introduced by observation errors. Consider NN arms in total and we choose KK arms to active at each step. Let Ws(ω1,,ωN)W_{s}(\omega_{1},\ldots,\omega_{N}) denote the expected total discounted reward over ss steps when all arms are ordered so the probabilities that the underlying random processes are in state 1 are ω1ωN\omega_{1}\geq\cdots\geq\omega_{N}. In Liu et al. (2010), it has been proved that the myopic policy has a dynamic queuing structure if the error probability ϵ\epsilon satisfies (70). Then we have

Ws+1(ω1,,ωN)=(1ϵ)i=1Kωi\displaystyle W_{s+1}(\omega_{1},\ldots,\omega_{N})=(1-\epsilon)\sum_{i=1}^{K}\omega_{i}
+β𝔼[Ws(p11,,p11,τ(ωK+1),,τ(ωN),σ(),,σ())],\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\beta\mathbbm{E}[W_{s}(p_{11},\ldots,p_{11},\tau(\omega_{K+1}),\ldots,\tau(\omega_{N}),\sigma(\cdot),\cdots,\sigma(\cdot))],

where W0()=0,τ(ω)=p11ω+p01(1ω),σ(ω)=τ(ϵωϵω+1ω),W_{0}(\cdot)=0,\tau(\omega)=p_{11}\omega+p_{01}(1-\omega),\sigma(\omega)=\tau(\frac{\epsilon\omega}{\epsilon\omega+1-\omega}), and the expectation is taken over possible outcomes that can occur when the KK arms that are observed are those at the left end (i.e.,i.e., having probabilities ω1,,ωK\omega_{1},\ldots,\omega_{K} that the underlying random processes are in state 1). We will describe it more specifically. For a belief state sequence (ω1,,ωK)(\omega_{1},\ldots,\omega_{K}), we call (ωi1,,ωis)(\omega_{i_{1}},\ldots,\omega_{i_{s}}) and (ωj1,,ωjt)(\omega_{j_{1}},\ldots,\omega_{j_{t}}) a partition of (ω1,,ωK)(\omega_{1},\ldots,\omega_{K}) if they satisfy:

  • (i)

    (ωi1,,ωis,ωj1,,ωjt)(\omega_{i_{1}},\ldots,\omega_{i_{s}},\omega_{j_{1}},\ldots,\omega_{j_{t}}) is a rearrangement of (ω1,,ωK)(\omega_{1},\ldots,\omega_{K});

  • (ii)

    i1<<isi_{1}<\cdots<i_{s} and j1<<jtj_{1}<\cdots<j_{t}.

Let 𝒫\mathcal{P} be the collection of all partitions of (ω1,,ωK)(\omega_{1},\ldots,\omega_{K}), the expectation above can be written as follows:

𝔼[Ws(p11,,p11,τ(ωK+1),,τ(ωN),σ(),,σ())]\displaystyle~{}\mathbbm{E}[W_{s}(p_{11},\ldots,p_{11},\tau(\omega_{K+1}),\ldots,\tau(\omega_{N}),\sigma(\cdot),\ldots,\sigma(\cdot))]
=\displaystyle= 𝒫(m=1s(1ϵ)ωim)(n=1t(1(1ϵ)ωjn))\displaystyle\sum_{\mathcal{P}}(\prod_{m=1}^{s}(1-\epsilon)\omega_{i_{m}})(\prod_{n=1}^{t}(1-(1-\epsilon)\omega_{j_{n}}))
×Ws(p11,,p11,τ(ωK+1),,τ(ωN),σ(ωj1),,σ(ωjt)).\displaystyle\times W_{s}(p_{11},\ldots,p_{11},\tau(\omega_{K+1}),\ldots,\tau(\omega_{N}),\sigma(\omega_{j_{1}}),\ldots,\sigma(\omega_{j_{t}})).

We can see that when ω1ω2ωN\omega_{1}\geq\omega_{2}\geq\cdots\geq\omega_{N}, Ws(ω1,,ωN)W_{s}(\omega_{1},\ldots,\omega_{N}) is the value function for the myopic policy.

To prove the theorem, we first prove that s\forall s, Ws(ω1,,ωN)W_{s}(\omega_{1},\ldots,\omega_{N}) is linear in ωi(1iN)\omega_{i}(1\leq i\leq N). We will prove it by induction. It is obvious that W0=0,W1(ω1,,ωN)=(1ϵ)i=1KωiW_{0}=0,W_{1}(\omega_{1},\ldots,\omega_{N})=(1-\epsilon)\sum_{i=1}^{K}\omega_{i} are linear in ω1,,ωN\omega_{1},\ldots,\omega_{N}. Assume it is true for ss, i.e.,Ws(ωl)=Ws(ω1,,ωl,,ωN)=alωl+bli.e.,W_{s}(\omega_{l})=W_{s}(\omega_{1},\ldots,\omega_{l},\ldots,\omega_{N})=a_{l}\omega_{l}+b_{l} (1lN)(\forall 1\leq l\leq N), where ala_{l} and blb_{l} are constants independent of ωl\omega_{l}. Now consider Ws+1(ωl)=Ws+1(ω1,,ωl,,ωN)W_{s+1}(\omega_{l})=W_{s+1}(\omega_{1},\ldots,\omega_{l},\ldots,\omega_{N}). When l>K,l>K,

Ws+1(ω1,,\displaystyle W_{s+1}(\omega_{1},\ldots, ωl,,ωN)=(1ϵ)i=1Kωi\displaystyle\omega_{l},\cdots,\omega_{N})=(1-\epsilon)\sum_{i=1}^{K}\omega_{i}
+β𝔼[Ws(p11,,p11,τ(ωK+1),,τ(ωl),,τ(ωN),σ(),,σ())]\displaystyle+\beta\mathbbm{E}[W_{s}(p_{11},\ldots,p_{11},\tau(\omega_{K+1}),\ldots,\tau(\omega_{l}),\ldots,\tau(\omega_{N}),\sigma(\cdot),\ldots,\sigma(\cdot))]~{}~{}~{}~{}~{}~{}~{}~{}

In this case, probability terms in expectation are only related to ω1,,ωK\omega_{1},\ldots,\omega_{K}, τ(ωl)\tau(\omega_{l}) is linear in ωl\omega_{l} and WsW_{s} is linear in τ(ωl)\tau(\omega_{l}), thus Ws+1W_{s+1} is also linear in ωl\omega_{l}.
When lkl\leq k, for any partition (ωi1,,ωis)(\omega_{i_{1}},\ldots,\omega_{i_{s}}) and (ωj1,,ωjt)(\omega_{j_{1}},\ldots,\omega_{j_{t}}) of (ω1,,ωK)(\omega_{1},\ldots,\omega_{K}), if ωl{ωi1,,ωis}\omega_{l}\in\{\omega_{i_{1}},\ldots,\omega_{i_{s}}\}, the corresponding term

(m=1s(1ϵ)\displaystyle(\prod_{m=1}^{s}(1-\epsilon) ωim)(n=1t(1(1ϵ)ωjn))\displaystyle\omega_{i_{m}})\cdot(\prod_{n=1}^{t}(1-(1-\epsilon)\omega_{j_{n}}))
×Ws(p11,,p11,τ(ωK+1),,τ(ωN),σ(ωj1),,σ(ωjt))\displaystyle~{}~{}~{}~{}\times W_{s}(p_{11},\ldots,p_{11},\tau(\omega_{K+1}),\ldots,\tau(\omega_{N}),\sigma(\omega_{j_{1}}),\ldots,\sigma(\omega_{j_{t}}))~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}

in the expectation is linear in ωl\omega_{l}. If ωl{ωj1,,ωjt}\omega_{l}\in\{\omega_{j_{1}},\ldots,\omega_{j_{t}}\}, by inductive hypothesis there exists a~,b~\tilde{a},~{}\tilde{b},

(1(1ϵ)ωl)Ws(p11,,p11,τ(ωK+1),,τ(ωN),σ(ωj1),,σ(ωl),,σ(ωjt))\displaystyle(1-(1-\epsilon)\omega_{l})W_{s}(p_{11},\ldots,p_{11},\tau(\omega_{K+1}),\ldots,\tau(\omega_{N}),\sigma(\omega_{j_{1}}),\ldots,\sigma(\omega_{l}),\ldots,\sigma(\omega_{j_{t}}))
=(1(1ϵ)ωl)(a~σ(ωl)+b~)=a~(ϵp11ωl+p01(1ωl))+b~(1(1ϵ)ωl).\displaystyle=(1-(1-\epsilon)\omega_{l})(\tilde{a}\sigma(\omega_{l})+\tilde{b})=\tilde{a}(\epsilon p_{11}\omega_{l}+p_{01}(1-\omega_{l}))+\tilde{b}(1-(1-\epsilon)\omega_{l}).

The equation above shows that Ws+1W_{s+1} is linear in ωl\omega_{l}, thus the proposition is proved.
       From above, we can assume Ws(ω1,,x,,y,,ωN)=ax+by+cxy+dW_{s}(\omega_{1},\ldots,x,\ldots,y,\ldots,\omega_{N})=ax+by+cxy+d, where a,b,c,da,b,c,d are constants. If we swap the positions of xx and yy and make differences between the two, we have

Ws(ω1,,x,,y,,ωN)Ws(ω1,,y,,x,,ωN)\displaystyle W_{s}(\omega_{1},\ldots,x,\ldots,y,\ldots,\omega_{N})-W_{s}(\omega_{1},\ldots,y,\ldots,x,\ldots,\omega_{N})
=\displaystyle= (xy)[Ws(ω1,,1,,0,,ωN)Ws(ω1,,0,,1,,ωN)].\displaystyle(x-y)[W_{s}(\omega_{1},\ldots,1,\ldots,0,\ldots,\omega_{N})-W_{s}(\omega_{1},\ldots,0,\ldots,1,\ldots,\omega_{N})].

Next we will prove two important properties of WsW_{s}. We let ωi¯\bar{\omega_{i}} denote any sequence of ωi\omega_{i}s, possibly empty. We still adopt induction to prove next two properties:
(A) 1ϵ+Ws(ω1¯,y,ω2¯,ω3¯)Ws(ω1¯,ω2¯,y,ω3¯)0.1-\epsilon+W_{s}(\bar{\omega_{1}},y,\bar{\omega_{2}},\bar{\omega_{3}})-W_{s}(\bar{\omega_{1}},\bar{\omega_{2}},y,\bar{\omega_{3}})\geq 0.
(B) y>x,Ws(ω1¯,y,ω2¯,x,ω3¯)Ws(ω1¯,x,ω2¯,y,ω3¯)0.\forall y>x,W_{s}(\bar{\omega_{1}},y,\bar{\omega_{2}},x,\bar{\omega_{3}})-W_{s}(\bar{\omega_{1}},x,\bar{\omega_{2}},y,\bar{\omega_{3}})\geq 0.
These are clearly true for s=1s=1. We will begin by proving an induction step for (B). As above, the expression in (B) is equal to (yx)[Ws(ω1¯,1,ω2¯,0,ω3¯)Ws(ω1¯,0,ω2¯,1,ω3¯)].(y-x)[W_{s}(\bar{\omega_{1}},1,\bar{\omega_{2}},0,\bar{\omega_{3}})-W_{s}(\bar{\omega_{1}},0,\bar{\omega_{2}},1,\bar{\omega_{3}})].
Suppose the position exchange occur in the iith and jjth place, i<ji<j. If i,jKi,j\leq K, for some ω1¯,ω2¯,ω3¯\bar{\omega_{1}}^{\prime},\bar{\omega_{2}}^{\prime},\bar{\omega_{3}}^{\prime}(which are stochastically determined by the observations from the top KK arms in the queue), by inductive hypothesis,

Ws(ω1¯,1,ω2¯,0,ω3¯)Ws(ω1¯,0,ω2¯,1,ω3¯)\displaystyle~{}~{}~{}~{}W_{s}(\bar{\omega_{1}},1,\bar{\omega_{2}},0,\bar{\omega_{3}})-W_{s}(\bar{\omega_{1}},0,\bar{\omega_{2}},1,\bar{\omega_{3}})
=β𝔼[Ws1(ω1¯,ω2¯,p01,ω3¯)Ws1(ω1¯,p01,ω2¯,ω3¯)]0.\displaystyle={}\beta\mathbbm{E}[W_{s-1}(\bar{\omega_{1}}^{\prime},\bar{\omega_{2}}^{\prime},p_{01},\bar{\omega_{3}}^{\prime})-W_{s-1}(\bar{\omega_{1}}^{\prime},p_{01},\bar{\omega_{2}}^{\prime},\bar{\omega_{3}}^{\prime})]\geq{}0.

Similarly if i,j>Ki,j>K,

Ws(ω1¯,1,ω2¯,0,ω3¯)Ws(ω1¯,0,ω2¯,1,ω3¯)\displaystyle~{}~{}~{}~{}W_{s}(\bar{\omega_{1}},1,\bar{\omega_{2}},0,\bar{\omega_{3}})-W_{s}(\bar{\omega_{1}},0,\bar{\omega_{2}},1,\bar{\omega_{3}})
=β𝔼[Ws1(ω1¯,p11,ω2¯,p01,ω3¯)Ws1(ω1¯,p01,ω2¯,p11,ω3¯)]0.\displaystyle=\beta\mathbbm{E}[W_{s-1}(\bar{\omega_{1}}^{\prime},p_{11},\bar{\omega_{2}}^{\prime},p_{01},\bar{\omega_{3}}^{\prime})-W_{s-1}(\bar{\omega_{1}}^{\prime},p_{01},\bar{\omega_{2}}^{\prime},p_{11},\bar{\omega_{3}}^{\prime})]\geq 0.

The interesting case is iK<ji\leq K<j. In this case,

Ws(ω1¯,1,ω2¯,0,ω3¯)Ws(ω1¯,0,ω2¯,1,ω3¯)\displaystyle W_{s}(\bar{\omega_{1}},1,\bar{\omega_{2}},0,\bar{\omega_{3}})-W_{s}(\bar{\omega_{1}},0,\bar{\omega_{2}},1,\bar{\omega_{3}})
=\displaystyle={} 1ϵ+β𝔼[Ws1(ω1¯,p11,ω2¯,p01,ω3¯,ω4¯)Ws1(ω1¯,ω2¯,p11,ω3¯,p01,ω4¯)]\displaystyle 1-\epsilon+\beta\mathbbm{E}[W_{s-1}(\bar{\omega_{1}}^{\prime},p_{11},\bar{\omega_{2}}^{\prime},p_{01},\bar{\omega_{3}}^{\prime},\bar{\omega_{4}}^{\prime})-W_{s-1}(\bar{\omega_{1}}^{\prime},\bar{\omega_{2}}^{\prime},p_{11},\bar{\omega_{3}}^{\prime},p_{01},\bar{\omega_{4}}^{\prime})]
\displaystyle\geq{} 1ϵ+β𝔼[Ws1(ω1¯,ω2¯,p11,p01,ω3¯,ω4¯)Ws1(ω1¯,ω2¯,p11,ω3¯,p01,ω4¯)]\displaystyle 1-\epsilon+\beta\mathbbm{E}[W_{s-1}(\bar{\omega_{1}}^{\prime},\bar{\omega_{2}}^{\prime},p_{11},p_{01},\bar{\omega_{3}}^{\prime},\bar{\omega_{4}}^{\prime})-W_{s-1}(\bar{\omega_{1}}^{\prime},\bar{\omega_{2}}^{\prime},p_{11},\bar{\omega_{3}}^{\prime},p_{01},\bar{\omega_{4}}^{\prime})]
=\displaystyle={} (1ϵ)(1β)+β𝔼[(1ϵ)\displaystyle(1-\epsilon)(1-\beta)+\beta\mathbbm{E}[(1-\epsilon)
+Ws1(ω1¯,ω2¯,p11,p01,ω3¯,ω4¯)Ws1(ω1¯,ω2¯,p11,ω3¯,p01,ω4¯)]\displaystyle+W_{s-1}(\bar{\omega_{1}}^{\prime},\bar{\omega_{2}}^{\prime},p_{11},p_{01},\bar{\omega_{3}}^{\prime},\bar{\omega_{4}}^{\prime})-W_{s-1}(\bar{\omega_{1}}^{\prime},\bar{\omega_{2}}^{\prime},p_{11},\bar{\omega_{3}}^{\prime},p_{01},\bar{\omega_{4}}^{\prime})]
\displaystyle\geq{} 0,\displaystyle 0,

where the first inequality follows from the inductive hypothesis for (B) and the second follows from (A).
Next we will prove (A) by induction. Suppose that yy occurs within the two expressions in the iith and jjth place, i<ji<j. If i,jKi,j\leq K, similarly for some ω1¯,ω2¯,ω3¯\bar{\omega_{1}}^{\prime},\bar{\omega_{2}}^{\prime},\bar{\omega_{3}}^{\prime} (they depend on the observations),

1ϵ+Ws(ω1¯,y,ω2¯,ω3¯)Ws(ω1¯,ω2¯,y,ω3¯)\displaystyle 1-\epsilon+W_{s}(\bar{\omega_{1}},y,\bar{\omega_{2}},\bar{\omega_{3}})-W_{s}(\bar{\omega_{1}},\bar{\omega_{2}},y,\bar{\omega_{3}})
=\displaystyle={} 1ϵ+β𝔼[Ws1(ω1¯,σ(y),ω2¯,ω3¯)Ws1(ω1¯,ω2¯,σ(y),ω3¯)]\displaystyle 1-\epsilon+\beta\mathbbm{E}[W_{s-1}(\bar{\omega_{1}}^{\prime},\sigma(y),\bar{\omega_{2}}^{\prime},\bar{\omega_{3}}^{\prime})-W_{s-1}(\bar{\omega_{1}}^{\prime},\bar{\omega_{2}}^{\prime},\sigma(y),\bar{\omega_{3}}^{\prime})]
=\displaystyle={} (1ϵ)(1β)+β𝔼[(1ϵ)+Ws1(ω1¯,σ(y),ω2¯,ω3¯)Ws1(ω1¯,ω2¯,σ(y),ω3¯)]\displaystyle(1-\epsilon)(1-\beta)+\beta\mathbbm{E}[(1-\epsilon)+W_{s-1}(\bar{\omega_{1}}^{\prime},\sigma(y),\bar{\omega_{2}}^{\prime},\bar{\omega_{3}}^{\prime})-W_{s-1}(\bar{\omega_{1}}^{\prime},\bar{\omega_{2}}^{\prime},\sigma(y),\bar{\omega_{3}}^{\prime})]
\displaystyle\geq{} 0.\displaystyle 0.

If i,j>Ki,j>K, we have

1ϵ+Ws(ω1¯,y,ω2¯,ω3¯)Ws(ω1¯,ω2¯,y,ω3¯)\displaystyle 1-\epsilon+W_{s}(\bar{\omega_{1}},y,\bar{\omega_{2}},\bar{\omega_{3}})-W_{s}(\bar{\omega_{1}},\bar{\omega_{2}},y,\bar{\omega_{3}})
=\displaystyle={} (1ϵ)(1β)+β𝔼[(1ϵ)+Ws1(ω1¯,τ(y),ω2¯,ω3¯)Ws1(ω1¯,ω2¯,τ(y),ω3¯)]\displaystyle(1-\epsilon)(1-\beta)+\beta\mathbbm{E}[(1-\epsilon)+W_{s-1}(\bar{\omega_{1}}^{\prime},\tau(y),\bar{\omega_{2}}^{\prime},\bar{\omega_{3}}^{\prime})-W_{s-1}(\bar{\omega_{1}}^{\prime},\bar{\omega_{2}}^{\prime},\tau(y),\bar{\omega_{3}}^{\prime})]
\displaystyle\geq{} 0.\displaystyle 0.

The interesting case is iK<ji\leq K<j. Let ω2¯=(ω21¯,x,ω22¯)\bar{\omega_{2}}=(\bar{\omega_{21}},x,\bar{\omega_{22}}), where ω1¯\bar{\omega_{1}} and ω21¯\bar{\omega_{21}} represent K1K-1 states in total. Then

1ϵ+Ws(ω1¯,y,ω2¯,ω3¯)Ws(ω1¯,ω2¯,y,ω3¯)\displaystyle~{}~{}~{}1-\epsilon+W_{s}(\bar{\omega_{1}},y,\bar{\omega_{2}},\bar{\omega_{3}})-W_{s}(\bar{\omega_{1}},\bar{\omega_{2}},y,\bar{\omega_{3}})
=1ϵ+Ws(ω1¯,y,ω21¯,x,ω22¯,ω3¯)Ws(ω1¯,ω21¯,x,ω22¯,y,ω3¯).\displaystyle={}1-\epsilon+W_{s}(\bar{\omega_{1}},y,\bar{\omega_{21}},x,\bar{\omega_{22}},\bar{\omega_{3}})-W_{s}(\bar{\omega_{1}},\bar{\omega_{21}},x,\bar{\omega_{22}},y,\bar{\omega_{3}}).

The above expression is a function of xx and yy, of the form ax+by+cxy+dax+by+cxy+d. To prove the expression above is nonnegative for all x,y[0,1]x,y\in[0,1], we just need to check out that it is true for (x,y){(0,0),(0,1),(1,0),(1,1)}(x,y)\in\{(0,0),(0,1),(1,0),(1,1)\}.
If x=y=0x=y=0, then

1ϵ+Ws(ω1¯,0,ω21¯,0,ω22¯,ω3¯)Ws(ω1¯,ω21¯,0,ω22¯,0,ω3¯)\displaystyle~{}~{}~{}1-\epsilon+W_{s}(\bar{\omega_{1}},0,\bar{\omega_{21}},0,\bar{\omega_{22}},\bar{\omega_{3}})-W_{s}(\bar{\omega_{1}},\bar{\omega_{21}},0,\bar{\omega_{22}},0,\bar{\omega_{3}})
=1ϵ+β𝔼[Ws1(ω1¯,p01,τ(ω22¯),ω3¯,p01,ω4¯)\displaystyle={}1-\epsilon+\beta\mathbbm{E}[W_{s-1}(\bar{\omega_{1}}^{\prime},p_{01},\tau(\bar{\omega_{22}}),\bar{\omega_{3}}^{\prime},p_{01},\bar{\omega_{4}}^{\prime})
Ws1(ω1¯,τ(ω22¯),p01,ω3¯,ω4¯,p01)]\displaystyle\quad-W_{s-1}(\bar{\omega_{1}}^{\prime},\tau(\bar{\omega_{22}}),p_{01},\bar{\omega_{3}}^{\prime},\bar{\omega_{4}}^{\prime},p_{01})]
(1ϵ)(1β)+β𝔼[(1ϵ)\displaystyle\geq{}(1-\epsilon)(1-\beta)+\beta\mathbbm{E}[(1-\epsilon)
+Ws1(ω1¯,p01,τ(ω22¯),ω3¯,p01,ω4¯)Ws1(ω1¯,τ(ω22¯),ω3¯,p01,ω4¯,p01)]\displaystyle\quad+W_{s-1}(\bar{\omega_{1}}^{\prime},p_{01},\tau(\bar{\omega_{22}}),\bar{\omega_{3}}^{\prime},p_{01},\bar{\omega_{4}}^{\prime})-W_{s-1}(\bar{\omega_{1}}^{\prime},\tau(\bar{\omega_{22}}),\bar{\omega_{3}}^{\prime},p_{01},\bar{\omega_{4}}^{\prime},p_{01})]
0.\displaystyle\geq{}0.

If x=y=1x=y=1, then

1ϵ+Ws(ω1¯,1,ω21¯,1,ω22¯,ω3¯)Ws(ω1¯,ω21¯,1,ω22¯,1,ω3¯)\displaystyle~{}~{}~{}1-\epsilon+W_{s}(\bar{\omega_{1}},1,\bar{\omega_{21}},1,\bar{\omega_{22}},\bar{\omega_{3}})-W_{s}(\bar{\omega_{1}},\bar{\omega_{21}},1,\bar{\omega_{22}},1,\bar{\omega_{3}})
=1ϵ+β𝔼[Ws1(ω1¯,p11,ω2¯,p11,τ(ω22¯),ω3¯)\displaystyle={}1-\epsilon+\beta\mathbbm{E}[W_{s-1}(\bar{\omega_{1}}^{\prime},p_{11},\bar{\omega_{2}}^{\prime},p_{11},\tau(\bar{\omega_{22}}),\bar{\omega_{3}}^{\prime})
Ws1(ω1¯,ω2¯,p11,τ(ω22¯),p11,ω3¯)]\displaystyle\quad-W_{s-1}(\bar{\omega_{1}}^{\prime},\bar{\omega_{2}}^{\prime},p_{11},\tau(\bar{\omega_{22}}),p_{11},\bar{\omega_{3}}^{\prime})]
(1ϵ)(1β)+β𝔼[(1ϵ)\displaystyle\geq{}(1-\epsilon)(1-\beta)+\beta\mathbbm{E}[(1-\epsilon)
+Ws1(ω1¯,ω2¯,p11,p11,τ(ω22¯),ω3¯)Ws1(ω1¯,ω2¯,p11,τ(ω22¯),p11,ω3¯)]\displaystyle\quad+W_{s-1}(\bar{\omega_{1}}^{\prime},\bar{\omega_{2}}^{\prime},p_{11},p_{11},\tau(\bar{\omega_{22}}),\bar{\omega_{3}}^{\prime})-W_{s-1}(\bar{\omega_{1}}^{\prime},\bar{\omega_{2}}^{\prime},p_{11},\tau(\bar{\omega_{22}}),p_{11},\bar{\omega_{3}}^{\prime})]
0.\displaystyle\geq{}0.

If x=0,y=1x=0,y=1, then

1ϵ+Ws(ω1¯,1,ω21¯,0,ω22¯,ω3¯)Ws(ω1¯,ω21¯,0,ω22¯,1,ω3¯)\displaystyle~{}~{}~{}1-\epsilon+W_{s}(\bar{\omega_{1}},1,\bar{\omega_{21}},0,\bar{\omega_{22}},\bar{\omega_{3}})-W_{s}(\bar{\omega_{1}},\bar{\omega_{21}},0,\bar{\omega_{22}},1,\bar{\omega_{3}})
=2(1ϵ)+β𝔼[Ws1(ω1¯,p11,ω3¯,p01,τ(ω22¯),ω4¯)\displaystyle=2(1-\epsilon)+\beta\mathbbm{E}[W_{s-1}(\bar{\omega_{1}}^{\prime},p_{11},\bar{\omega_{3}}^{\prime},p_{01},\tau(\bar{\omega_{22}}),\bar{\omega_{4}}^{\prime})
Ws1(ω1¯,ω3¯,τ(ω22¯),p11,ω4¯,p01)]\displaystyle\quad-W_{s-1}(\bar{\omega_{1}}^{\prime},\bar{\omega_{3}}^{\prime},\tau(\bar{\omega_{22}}),p_{11},\bar{\omega_{4}}^{\prime},p_{01})]
2(1ϵ)(1β)+β𝔼[22ϵ\displaystyle\geq 2(1-\epsilon)(1-\beta)+\beta\mathbbm{E}[2-2\epsilon
+Ws1(ω1¯,ω3¯,p01,τ(ω22¯),p11,ω4¯)Ws1(ω1¯,ω3¯,τ(ω22¯),p11,ω4¯,p01)]\displaystyle\quad+W_{s-1}(\bar{\omega_{1}}^{\prime},\bar{\omega_{3}}^{\prime},p_{01},\tau(\bar{\omega_{22}}),p_{11},\bar{\omega_{4}}^{\prime})-W_{s-1}(\bar{\omega_{1}}^{\prime},\bar{\omega_{3}}^{\prime},\tau(\bar{\omega_{22}}),p_{11},\bar{\omega_{4}}^{\prime},p_{01})]
>0.\displaystyle>0.

If x=1,y=0x=1,y=0, then

1ϵ+Ws(ω1¯,0,ω21¯,1,ω22¯,ω3¯)Ws(ω1¯,ω21¯,1,ω22¯,0,ω3¯)\displaystyle~{}~{}~{}1-\epsilon+W_{s}(\bar{\omega_{1}},0,\bar{\omega_{21}},1,\bar{\omega_{22}},\bar{\omega_{3}})-W_{s}(\bar{\omega_{1}},\bar{\omega_{21}},1,\bar{\omega_{22}},0,\bar{\omega_{3}})
=β𝔼[Ws1(ω1¯,p11,τ(ω22¯),ω3¯,p01,ω4¯)Ws1(ω1¯,p11,τ(ω22¯),p01,ω3¯,ω4¯)]\displaystyle=\beta\mathbbm{E}[W_{s-1}(\bar{\omega_{1}}^{\prime},p_{11},\tau(\bar{\omega_{22}}),\bar{\omega_{3}}^{\prime},p_{01},\bar{\omega_{4}}^{\prime})-W_{s-1}(\bar{\omega_{1}}^{\prime},p_{11},\tau(\bar{\omega_{22}}),p_{01},\bar{\omega_{3}}^{\prime},\bar{\omega_{4}}^{\prime})]
0.\displaystyle\geq 0.

Thus (A) is true. In fact, (B) shows that the myopic policy is optimal over finite horizons. By contradiction, it is easy to show that the myopic policy also maximizes the expected total discounted reward and the expected average reward over the infinite horizon. Furthermore, our proof does not depend on the time-homogeneousness of the system so the optimality result holds even if the system parameters are time-varying as long as p11(t)p01(t)p_{11}(t)\geq p_{01}(t) and ϵ\epsilon satisfies (70). ∎

5 Numerical Analysis and Conclusion

In this section, we illustrate the near-optimality and efficiency of our approximated Whittle index policy for non-homogeneous arms through simulation examples. After the discussions and illustrations on these numerical results, we conclude this paper and propose several future research directions on relevant problems.

5.1 Numerical Examples

We will show that the 44-iteration approximation algorithm is sufficient to yield the same performance as the exact Whittle index policy and use it to plot the approximated Whittle index function W(ω)W(\omega) for the following parameters: in Fig. 4 p11=0.2,p01=0.9,β=0.9,ϵ=0.1,B=1p_{11}=0.2,p_{01}=0.9,\beta=0.9,\epsilon=0.1,B=1; in Fig. 4 p11=0.6,p01=0.3,β=0.9,ϵ=0.1,B=1p_{11}=0.6,p_{01}=0.3,\beta=0.9,\epsilon=0.1,B=1. Note that the monotonic increasing property of W(ω)W(\omega) implies the indexability numerically while the nonlinearity of W(ω)W(\omega) illustrates its difference to the myopic policy (with index ωB\omega B as a linear function in ω\omega).

We now compare the performance of Whittle index policy with the optimal policy which is computed by dynamic programming over a finite horizon of length TT. In other words, we recursively call the following equation with terminating state V1,0,β,m()=0V_{1,0,\beta,m}(\cdot)=0:

V1,T,β,m(ω)=max{V1,T,β,m(ω;u=1);V1,T,β,m(ω;u=0)},\displaystyle V_{1,T,\beta,m}(\omega)=\max\{V_{1,T,\beta,m}(\omega;u=1);V_{1,T,\beta,m}(\omega;u=0)\}, (71)

where V1,T,β,m(ω;u=1)V_{1,T,\beta,m}(\omega;u=1) and V1,T,β,m(ω;u=0)V_{1,T,\beta,m}(\omega;u=0) are respectively given by (25) and (26) in terms of V1,T1,β,m()V_{1,T-1,\beta,m}(\cdot). Clearly, the number of observed belief states grows exponentially with both the number of arms (as arms are not decoupled by any relaxation) and the time horizon TT due to the tree-expansion type of the belief update given in (4). In contrast, our approximate Whittle index has a linear complexity in both TT and the number of arms. In Fig. 6 and Fig. 6, we compare the real running times between the optimal policy and our algorithm to illustrate the efficiency of the latter. Note that the optimal policy for any finite time horizon TT provides an upper bound on the total discounted reward over TT achieved by the infinite-horizon optimal policy. This is because one can definitely apply the infinite-horizon optimal policy to the finite-horizon problem up to time TT. Henceforth, the near-optimality of our algorithm is well demonstrated by comparing to the finite-horizon optimal policy over the first TT steps as shown in Figures 8-22 (see Table 1 and Table 2 for system parameters). Furthermore, all numerical experiments with randomly generated system parameters showed that setting the iteration number k=4k=4 is sufficient for the Whittle indices of all arms to converge such that their rank remains the same as kk increases at each time step tt. In other words, setting k=4k=4 makes the approximated Whittle index have the same action path as the exact Whittle index, leading to the same performance. When kk becomes smaller, the approximation error will be larger and cause more performance loss as shown in Fig. 23.

We also illustrate the performance of the myopic policy that chooses the MM arms with the largest ωnBn\omega_{n}B_{n} for comparison. From Figures 8-22, we observe that Algorithm 1 outperforms the myopic policy. Interestingly, the Whittle index policy may have some performance loss in the middle but eventually catches up with the optimal policy as time goes. This is consistent with the conjecture that Whittle index policy is asymptotically optimal as time goes to infinity as the Lagrangian relaxation should not fundamentally alter the state and action paths of the optimal policy for the original problem from the perspective of large deviation theory (Weber and Weiss, 1990). On the contrary, the myopic policy is unable to follow the optimal action path and never catches up! Since the myopic policy only cares about maximizing the immediate reward, its performance for T=1T=1 is optimal (thus better than any other policy) because the state transitions do not matter in this case. Definitely, the myopic policy has the lowest complexity but this advantage is negligible given its significant performance loss and the efficiency of our algorithm compared to the optimal policy as shown in Fig. 6 and Fig. 6.

5.2 Conclusion and Future Work

In this paper, we proposed a low-complexity algorithm based on the idea of value function approximations arisen in solving for the Whittle index policy of a class of RMAB with an infinite state space and an imperfect observation model. By exploring and exploiting the rich mathematical structure of this RMAB model, our algorithm was designed to be implemented online to control the approximation error such that it becomes equivalent to the original Whittle index policy. Extensive numerical examples showed that our algorithm achieves a near-optimal performance with a complexity linear in the key system parameters such as the time horizon TT and the number of arms. From Figures 8-22, we observe that in some instances the four-iteration policy is closer to optimality than in other instances. Unfortunately, it is still unknown how to theoretically quantify the performance gap of Whittle index policy to optimality in finite regime: only some asymptotic results were obtained for restless bandits with finite states as both the number of arms and the time horizon go to infinity under the time-average reward criterion and some conditions only verified for bandits with 22 or 33 states (Weber and Weiss, 1990, 1991).

Future work includes the theoretic study of the performance loss of Whittle index policy by more in-depth analysis on the value functions to further improve our algorithm. Another research direction is to consider more complex system models such as non-Markovian state models (see, e.g., Liu et al., 2011) or high-dimensional state models (see, e.g., Liu, 2021) and study the convergence patterns of the state path to approximate the value functions. Future work also includes the generalization of constructing a finite set of linear equations to solve for dynamic programming problems, e.g., the simplification of non-linearity by threshold-based first crossing time, the error-control process by backward induction, and further complexity reduction by minimizing the number of linear equations required.

Refer to caption
Figure 3: Approximated W(ω)W(\omega) (p11<p01p_{11}<p_{01})
Refer to caption
Figure 4: Approximated W(ω)W(\omega) (p11>p01p_{11}>p_{01})
Refer to caption
Figure 5: The Complexity with TT
Refer to caption
Figure 6: The Complexity with NN
Refer to caption
Figure 7: Example-1
Refer to caption
Figure 8: Example-2
Refer to caption
Figure 9: Example-3
Refer to caption
Figure 10: Example-4
Refer to caption
Figure 11: Example-5
Refer to caption
Figure 12: Example-6
Refer to caption
Figure 13: Example-7
Refer to caption
Figure 14: Example-8
Refer to caption
Figure 15: Example-9
Refer to caption
Figure 16: Example-10
Refer to caption
Figure 17: Example-11
Refer to caption
Figure 18: Example-12
Refer to caption
Figure 19: Example-13
Refer to caption
Figure 20: Example-14
Refer to caption
Figure 21: Example-15
Refer to caption
Figure 22: Example-16
Refer to caption
Figure 23: The Performance Comparison for kk
Table 1: Experiment Setting
System-1 {p11(i)}i=17\{p_{11}^{(i)}\}_{i=1}^{7} {0.3,0.6,0.4,0.7,0.2,0.6,0.8}\{0.3,0.6,0.4,0.7,0.2,0.6,0.8\}
\cline2-3 {p01(i)}i=17\{p_{01}^{(i)}\}_{i=1}^{7} {0.1,0.4,0.3,0.4,0.1,0.3,0.5}\{0.1,0.4,0.3,0.4,0.1,0.3,0.5\}
\cline2-3 {Bi}i=17\{B_{i}\}_{i=1}^{7} {0.8800,0.2200,0.3300,0.1930,1.0000,0.2558,0.1549}\{0.8800,0.2200,0.3300,0.1930,1.0000,0.2558,0.1549\}
System-2 {p11(i)}i=17\{p_{11}^{(i)}\}_{i=1}^{7} {0.6,0.4,0.2,0.2,0.4,0.1,0.3}\{0.6,0.4,0.2,0.2,0.4,0.1,0.3\}
\cline2-3 {p01(i)}i=17\{p_{01}^{(i)}\}_{i=1}^{7} {0.8,0.6,0.4,0.9,0.8,0.6,0.7}\{0.8,0.6,0.4,0.9,0.8,0.6,0.7\}
\cline2-3 {Bi}i=17\{B_{i}\}_{i=1}^{7} {0.5150,0.6666,1.0000,0.6296,0.5833,0.8100,0.6700}\{0.5150,0.6666,1.0000,0.6296,0.5833,0.8100,0.6700\}
System-3 {p11(i)}i=17\{p_{11}^{(i)}\}_{i=1}^{7} {0.1,0.4,0.3,0.4,0.1,0.3,0.5}\{0.1,0.4,0.3,0.4,0.1,0.3,0.5\}
\cline2-3 {p01(i)}i=17\{p_{01}^{(i)}\}_{i=1}^{7} {0.3,0.6,0.4,0.7,0.2,0.6,0.8}\{0.3,0.6,0.4,0.7,0.2,0.6,0.8\}
\cline2-3 {Bi}i=17\{B_{i}\}_{i=1}^{7} {0.7273,0.3636,0.5000,0.3377,1.0000,0.3939,0.2955}\{0.7273,0.3636,0.5000,0.3377,1.0000,0.3939,0.2955\}
System-4 {p11(i)}i=17\{p_{11}^{(i)}\}_{i=1}^{7} {0.6,0.7,0.2,0.6,0.4,0.5,0.3}\{0.6,0.7,0.2,0.6,0.4,0.5,0.3\}
\cline2-3 {p01(i)}i=17\{p_{01}^{(i)}\}_{i=1}^{7} {0.8,0.4,0.9,0.5,0.7,0.2,0.6}\{0.8,0.4,0.9,0.5,0.7,0.2,0.6\}
\cline2-3 {Bi}i=17\{B_{i}\}_{i=1}^{7} {0.4286,0.5000,0.5397,0.5143,0.5306,1.0000,0.6190}\{0.4286,0.5000,0.5397,0.5143,0.5306,1.0000,0.6190\}
Table 2: Experiment Setting (continued)
System Example ϵ\epsilon β\beta Meet threshold Meet indexability
conditions? conditions?
System-1 1 0.3 0.999 yes no
\cline2-6 2 0.1 0.999 yes no
System-2 3 0.3 0.29 yes yes
\cline2-6 4 0.1 0.29 yes yes
\cline2-6 5 0.3 0.48 no yes
\cline2-6 6 0.1 0.48 no yes
System-3 7 0.3 0.69 yes no
\cline2-6 8 0.1 0.69 yes no
\cline2-6 9 0.3 0.48 yes yes
\cline2-6 10 0.1 0.48 yes yes
System-4 11 0.3 0.29 yes yes
\cline2-6 12 0.1 0.29 yes yes
\cline2-6 13 0.3 0.48 no yes
\cline2-6 14 0.1 0.48 no yes
\cline2-6 15 0.3 0.999 no no
\cline2-6 16 0.1 0.999 no no

Declarations

  • Funding Not applicable

  • Conflict of interest/Competing interests Not applicable

  • Ethics approval Not applicable

  • Consent to participate Not applicable

  • Consent for publication Yes

  • Availability of data and materials Available upon request

  • Code availability Available upon request

  • Authors’ contributions Keqin Liu constructed the proof sketch for each theorem and the main algorithm and contributed to the writing of the paper. Richard Weber outlined the proof strategy for the optimality of the myopic policy in homogeneous systems and contributed to the verification and writing of the paper. Chengzhong Zhang filled out the details of the proofs and conducted the numerical simulations.

References

  • Brown and Smith (2020) Brown DB, Simth JE (2020) Index policies and performance bounds for dynamic selection problems. Manage Sci 66(7):3029–3050.
  • Chen et al. (2021) Chen M, Wu K, Song L (2021) A Whittle index approach to minimizing age of multi-packet information in IoT network. IEEE Access 9: 31467 - 31480.
  • Gast et al. (2023) Gast N, Gaujal B, Khun K (2023) Testing indexability and computing Whittle and Gittins index in subcubic time. Math Meth Oper Res 97:391–436.
  • Gast et al. (2021) Gast N, Gaujal B, Yan C (2021, working paper) (Close to) Optimal policies for finite horizon restless bandits. https://hal.inria.fr/hal-03262307/file/LP_paper.pdf.
  • Gittins (1979) Gittins JC (1979) Bandit processes and dynamic allocation indices. J R Stat Soc 41(2):148–177.
  • Gittins et al. (2011) Gittins JC, Glazebrook KD, Weber RR (2011) Multi-Armed Bandit Allocation Indices. Wiley, Chichester.
  • Hu and Frazier (2017) Hu W, Frazier PI (2017, working paper) An asymptotically optimal index policy for finite-horizon restless bandits. https://arxiv.org/abs/1707.00205.
  • Heinonen (2005) Heinonen J (2005) Lectures on Lipschitz analysis. www.math.jyu.fi/research/reports/rep100.pdf.
  • Kesav et al. (2019) Kesav K, Rahul M, Varun M, Shabbir NM (2019). Sequential decision making with limited observation capability: Application to wireless networks. IEEE Trans Cognit Commun Networking 5(2):237–251.
  • Levy (2008) Levy BC (2008) Principles of Signal Detection and Parameter Estimation. Springer, Verlag.
  • Liu (2021) Liu K (2021) Index policy for a class of partially observable Markov decision processes. https://arxiv.org/abs/2107.11939.
  • Liu et al. (2011) Liu K, Weber RR, Zhao Q (2011) Indexability and Whittle index for restless bandit problems involving reset processes. Proc. of the 50th IEEE Conference on Decision and Control 7690–7696.
  • Liu and Zhao (2010) Liu K, Zhao Q (2010) Indexability of restless bandit problems and optimality of Whittle index for dynamic multichannel access. IEEE Trans Inf Theory 56(11):5547–5567.
  • Liu et al. (2010) Liu K, Zhao Q, Krishnamachari B (2010) Dynamic multichannel access with imperfect channel state detection. IEEE Trans Signal Process 2795–2808.
  • Papadimitriou and Tsitsiklis (1999) Papadimitriou CH, Tsitsiklis JN (1999) The complexity of optimal queueing network control. Math Oper Res 24(2):293–305.
  • Rahul et al. (2018) Rahul, M., D. M., & Aditya, G. (2018). On the whittle index for restless multiarmed hidden markov bandits. IEEE Trans Autom Control 63(9):3046–3053.
  • Sondik (1978) Sondik EJ (1978) The optimal control of partially observable Markov processes over the infinite horizon: discounted costs. Oper Res 26(2):282–304.
  • Varun et al. (2018) Varun, M., Rahul, M., Kesav, K., Shabbir, N.M. & Uday, B.D. (2018). Rested and restless bandits with constrained arms and hidden states: Applications in social networks and 5g networks. IEEE Access 6:56782–56799.
  • Wang et al. (2018) Wang K, Chen L, Yu J, Win M (2018) Opportunistic Multichannel Access with Imperfect Observation: A Fixed Point Analysis on Indexability and Index-based Policy. Proceedingds of IEEE INFOCOM.
  • Weber and Weiss (1990) Weber RR, Weiss G (1990) On an index policy for restless bandits. J Appl Probab 27:637–648.
  • Weber and Weiss (1991) Weber RR, Weiss G (1991) Addendum to ‘On an index policy for restless bandits’. Adv Appl Prob 23:429–430.
  • Whittle (1988) Whittle P (1988) Restless bandits: Activity allocation in a changing world. J Appl Probab 25:287–298.
  • Zayas-Cabán et al. (2019) Zayas-Cabán G, Jasin S, Wang G (2019) An asymptotically optimal heuristic for general nonstationary finite-horizon restless multi-armed, multi-action bandits. Adv Appl Probab 51:745–772.