This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Thompson Sampling for Parameterized Markov Decision Processes with Uninformative Actions

Michael Gimelfarb and Michael Jong Kim Michael Gimelfarb is with the Department of Mechanical and Industrial Engineering, University of Toronto, Toronto, Ontario M5S 3G8, Canada. (e-mail: [email protected]). Michael Jong Kim was with the Department of Mechanical and Industrial Engineering, University of Toronto, Toronto, Ontario M5S 3G8, Canada. He is now with the Sauder School of Business, University of British Columbia, Vancouver, British Columbia V6T 1Z2, Canada. (e-mail: [email protected]).
Abstract

We study parameterized MDPs (PMDPs) in which the key parameters of interest are unknown and must be learned using Bayesian inference. One key defining feature of such models is the presence of “uninformative” actions that not provide no information about the unknown parameters. We contribute a set of assumptions for PMDPs under which Thompson sampling guarantees an asymptotically optimal expected regret bound of O(T1)O(T^{-1}), which are easily verified for many classes of problems such as queuing, inventory control, and dynamic pricing.

Index Terms:
Bayesian Inference, Exploration–Exploitation, Markov Decision Process, Parameter Uncertainty, Regret Bound, Thompson Sampling

I Introduction

Parameterized MDPs (PMDPs) are dynamic control problems described by parameters of interest whose values are typically unknown, arising in many applications such as queuing and supply chain control, maintenance, and network design. Such problems can be naturally formulated as POMDPs, in which a learnt posterior distribution over the model parameters is incorporated directly into the system state [1, 3, 5]. However, the corresponding dynamic programming equations are typically computationally or analytically intractable due to the so-called “curse of dimensionality”. To remedy this, a variety of approaches have been proposed to do approximately optimal control including myopic and lookahead policies [16], least-squares methods [7, 19], non-parametric approaches [10, 8] and reinforcement learning [12]

One particular approach, which we investigate in this work, is Thompson sampling (TS) [27, 26], which samples a parameter according to the current posterior distribution at each time step, and then solves the underlying MDP assuming that the sampled parameter is correct. The decoupling of the posterior update and sample phase from the optimization phase makes Thompson sampling a computationally and analytically tractable alternative for solving general PMDPs, and it has been applied in a variety of problems settings, notably multi-armed bandit problems [2, 25]. However, these models can be classified as purely Bayesian optimization problems since, when stripped of all parameter uncertainty, they are fundamentally single-stage optimization problems. On the other hand, much less is known about the theoretical performance of Thompson sampling for stochastic control problems with parameter uncertainty.

Our work builds upon the stream of literature on Thompson sampling for PMDPs [20, 6, 24, 15, 13]. Specifically, the closest work to ours is [20], which showed that Thompson sampling is asymptotically optimal assuming each action taken reveals some additional information about the unknown parameter(s). On the other hand, many important PMDPs contain “uninformative” actions that do not reveal information about the parameters, e.g. setting lower inventory levels in an inventory control problem may lead to stock-outs [18], preventative maintenance provides less statistical information about failures [21], and so-called “uninformative prices” in dynamic pricing problems [1]. To this end, Gopalan and Mannor [15] showed – under rather general assumptions – that the number of instants where sub-optimal actions are chosen scales logarithmically with time, with high probability. On the other hand, this paper provides a different set of assumptions and analysis, and contributes more precise regret and learning rate bounds for an important class of PMDPs with uninformative actions that hold in expectation. Our work contributes more general assumptions and regret analysis than found in Kim [20], extending the previous optimal regret guarantees for a broader classes of problems with uninformative actions that are of practical interest.

The remainder of the paper is structured as follows. Section II defines the Thompson sampling algorithm in the context of PMDPs. Section III provides a formal statement of the notion of an “uninformative” action, a set of general assumptions on the problem structure, and corresponding optimal regret and learning rate bounds for Thompson sampling. These results are proved in Section IV, and empirically validated on three important classes of problems, namely admission control, inventory control and dynamic pricing.

II Preliminaries

II-A Parameterized Markov Decision Process

Decision-making in this paper be summarized as a parameterized Markov decision process (PMDP), which consists of a finite set of possible states 𝒮\mathcal{S}, a finite set of possible actions or controls 𝒜\mathcal{A}, and a finite set of parameters Θ\Theta (e.g. hypotheses) that specify both the reward and the state distribution. Specifically, given a control at𝒜a_{t}\in\mathcal{A} applied at time tt in state st𝒮s_{t}\in\mathcal{S}, and a parameter θΘ\theta\in\Theta, the reward and the next state are sampled according to

St+1pθ(|st,at),Rtfθ(|st,at).S_{t+1}\sim p_{\theta}(\cdot\,|\,s_{t},a_{t}),\quad R_{t}\sim f_{\theta}(\cdot\,|\,s_{t},a_{t}).

We also define the reward function as rθ(s,a)𝔼θ[R|s,a]=rfθ(r|s,a)drr_{\theta}\!\left(s,a\right)\coloneqq\operatorname{\mathbb{E}}_{\theta}\!\left[R|s,a\right]=\int rf_{\theta}(r|s,a)\,\mathrm{d}r, which is assumed to be uniformly bounded on 𝒮×𝒜\mathcal{S}\times\mathcal{A}.

An admissible policy μ\mu is defined as a sequence of mappings {μt:t0}\left\{\,\mu_{t}\,:\,t\geq 0\,\right\}, where μt(t)𝒫(𝒜)\mu_{t}(\mathscr{H}_{t})\mapsto\mathscr{P}(\mathcal{A}) maps the history of information observed up to time tt, t={S0,A0,R0,St1,At1,Rt1,St}\mathscr{H}_{t}=\left\{S_{0},A_{0},R_{0},\dots S_{t-1},A_{t-1},\dots R_{t-1},S_{t}\right\}, to a probabilitity distribution over the action space, and Π\Pi be the set of all admissible policies. On the other hand, a Markov policy is a sequence of mappings μt:𝒮𝒜\mu_{t}:\mathcal{S}\to\mathcal{A}; it is further a stationary policy if μ0=μ1=\mu_{0}=\mu_{1}=\dots. Let ΠS\Pi_{S} denote the set of all stationary policies. Given a fixed parameter θΘ\theta\in\Theta, the goal is to maximize over all μΠ\mu\in\Pi, the long-run expected average reward starting from an initial state S0=s0S_{0}=s_{0}, given as

Jθμ(s0)lim supT1T𝔼θ[t=0T1rθ(St,At)],J_{\theta}^{\mu}(s_{0})\coloneqq\limsup_{T\to\infty}\frac{1}{T}\operatorname{\mathbb{E}}_{\theta}\!\left[\sum_{t=0}^{T-1}r_{\theta}\!\left(S_{t},A_{t}\right)\right], (1)

in which the expectation is computed with respect to pθp_{\theta} and AtA_{t} are sampled according to μ\mu. We assume that an optimal policy μθ\mu_{\theta}^{*} exists, which must necessarily be a stationary Markov policy. Furthermore, the state process 𝒮θ={St:t0}\mathscr{S}_{\theta}=\left\{\,S_{t}\,:\,t\geq 0\,\right\} induced by μθ\mu_{\theta}^{*} constitutes a time-homogeneous Markov chain, and we assume that

Assumption 1.

The state process 𝒮θ={St:t0}\mathscr{S}_{\theta}=\left\{\,S_{t}\,:\,t\geq 0\,\right\} induced by μθ\mu_{\theta}^{*} is ergodic unichain.

Please note that our assumption is weaker than the ergodicity required in [20], although the main theoretical results therein also follow through under the weaker assumption above [17].

II-B Thompson Sampling

The decision maker accounts for uncertainty in θΘ\theta\in\Theta by modeling it as a random variable Θt\Theta_{t} in each decision epoch. Starting with prior distribution π0(θ)(Θ0=θ|0)\pi_{0}(\theta)\coloneqq\operatorname{\mathbb{P}}\!\left(\Theta_{0}=\theta\,|\,\mathscr{H}_{0}\right), and given history t\mathscr{H}_{t}, the decision maker updates his belief about θ\theta by computing the posterior distribution using Bayes’ theorem:

πt(θ)(Θt=θ|t)=θ(t)π0(θ)ϑΘϑ(t)π0(ϑ),\pi_{t}(\theta)\coloneqq\operatorname{\mathbb{P}}\!\left(\Theta_{t}=\theta\,|\,\mathscr{H}_{t}\right)=\frac{\mathcal{L}_{\theta}(\mathscr{H}_{t})\pi_{0}(\theta)}{\sum_{\vartheta\in\Theta}{\mathcal{L}_{\vartheta}(\mathscr{H}_{t})\pi_{0}(\vartheta)}}, (2)

where

θ(t)i=1tfθ(Ri1|Si1,Ai1)pθ(Si|Si1,Ai1)\mathcal{L}_{\theta}(\mathscr{H}_{t})\coloneqq\prod_{i=1}^{t}f_{\theta}(R_{i-1}\,|\,S_{i-1},A_{i-1})\,p_{\theta}(S_{i}\,|\,S_{i-1},A_{i-1})

is the likelihood. Note that πt(θ)\pi_{t}(\theta) is a function of t\mathscr{H}_{t} and is therefore a random variable.

The agent selects controls that are consistent with the learned belief πt\pi_{t} by following the Thompson sampling algorithm. Formally, this corresponds to an admissible (e.g. history-dependent) policy τ\tau defined in Algorithm 1.

Algorithm 1 Thompson Sampling for PMDPs
  Initialize S0s0S_{0}\leftarrow s_{0} and π0\pi_{0}
  for t=0t=0 to TT do
     Sample Θtπt()\Theta_{t}\sim\pi_{t}(\cdot)
     Solve MDP (1) with parameter Θt\Theta_{t} to obtain μΘt\mu_{\Theta_{t}}^{*}
     Apply control AtμΘt(St)A_{t}\leftarrow\mu_{\Theta_{t}}^{*}(S_{t})
     Observe St+1S_{t+1} and Rt+1R_{t+1}
     for θΘ\theta\in\Theta do
        πt+1(θ)fθ(Rt+1|St,At)pθ(St+1|St,At)πt(θ)ϑΘfϑ(Rt+1|St,At)pϑ(St+1|St,At)πt(ϑ)\pi_{t+1}(\theta)\leftarrow\frac{f_{\theta}(R_{t+1}|S_{t},A_{t})p_{\theta}(S_{t+1}|S_{t},A_{t})\pi_{t}(\theta)}{\sum_{\vartheta\in\Theta}f_{\vartheta}(R_{t+1}|S_{t},A_{t})p_{\vartheta}(S_{t+1}|S_{t},A_{t})\pi_{t}(\vartheta)}
     end for
  end forend

Intuitively, the posterior update (2) should only improve upon the decision maker’s belief about θ\theta when fθ(|s,a)pθ(|s,a)f_{\theta}(\cdot\,|\,s,a)\,p_{\theta}(\cdot\,|\,s,a) is distinguishable for different values of θ\theta. To see this, we can restate (2) as

πt(θ)=[1+ϑθπ0(ϑ)π0(θ)(ϑ(t)θ(t))]1,\pi_{t}(\theta)=\left[1+\displaystyle\sum_{\vartheta\not=\theta}{\frac{\pi_{0}(\vartheta)}{\pi_{0}(\theta)}\left(\frac{\mathcal{L}_{\vartheta}(\mathscr{H}_{t})}{\mathcal{L}_{\theta}(\mathscr{H}_{t})}\right)}\right]^{-1},

from which we can see that πt+1(θ)=πt(θ)\pi_{t+1}(\theta)=\pi_{t}(\theta) precisely when the ratio of the two likelihoods does not change. Conceptually, the existence of such uninformative state-action pairs (s,a)(s,a) can significantly “stall” the posterior updates and prevent the decision maker from learning the correct PMDP parameters. On the other hand, recent analysis of posterior convergence [20] in general PMDPs precludes such scenarios, by imposing rather strong assumptions on the problem structure that do not always hold in practice. In the following section, we will formalize the notion of an uninformative action and provide an alternative and complementary set of assumptions that still provide an asymptotically optimal regret bound.

III Thompson Sampling for PMDPs with Uninformative Actions

III-A Uninformative Actions

In order to assess the convergence speed of πt\pi_{t} more precisely, it is necessary to quantify the magnitude of the posterior update. In particular, we define νθ(s,a)\nu_{\theta}(s,a) as the joint probability distribution specified by fθ(|s,a)pθ(|s,a)f_{\theta}(\cdot\,|\,s,a)\,p_{\theta}(\cdot\,|\,s,a) and the KL-divergence

𝒟(νθ(s,a)νϑ(s,a))𝔼θ[log(fθ(|s,a)pθ(|s,a)fϑ(|s,a)pϑ(|s,a))].\mathcal{D}\!\left(\nu_{\theta}(s,a)\,\|\,\nu_{\vartheta}(s,a)\right)\coloneqq\operatorname{\mathbb{E}}_{\theta}\!\left[\log\!\left(\frac{f_{\theta}(\cdot\,|\,s,a)\,p_{\theta}(\cdot\,|\,s,a)}{f_{\vartheta}(\cdot\,|\,s,a)\,p_{\vartheta}(\cdot\,|\,s,a)}\right)\right].

We require that νϑ(s,a)\nu_{\vartheta}(s,a) is absolutely continuous with respect to νθ(s,a)\nu_{\theta}(s,a), so that the above quantity is finite.

Formalizing the above intuition, an action a𝒜a\in\mathcal{A} is called informative in state s𝒮s\in\mathcal{S} if, for any two distinct θ,ϑΘ\theta,\vartheta\in\Theta,

𝒟(νθ(s,a)νϑ(s,a))>0.\mathcal{D}\!\left(\nu_{\theta}(s,a)\,\|\,\nu_{\vartheta}(s,a)\right)>0.
Assumption 2.

There exists s𝒮s^{*}\in\mathcal{S} such that μθ(s)\mu_{\theta}^{*}\left(s^{*}\right) is an informative action in state ss^{*} for all θΘ\theta\in\Theta.

Note that Assumption 2 is easy to check in practice since, in the context of Algorithm 1, the policies μθ\mu_{\theta}^{*} are typically computed in advance and cached. Furthermore, it naturally partitions the state-action space into two classes. The first class Σ𝒮×𝒜\Sigma\subset\mathcal{S}\times\mathcal{A} consists of state-action pairs (s,a)(s,a) for which aa is informative in ss, and is non-empty since it contains (s,μθ(s))(s^{*},\mu_{\theta}^{*}\left(s^{*}\right)) for every θΘ\theta\in\Theta. The second class is comprised of those state-action pairs for which aa are not informative with respect to ss. Thus, Assumption 2 is much weaker than the one required in [20], where it is simply assumed that Σ𝖢=\Sigma^{\mathsf{C}}=\emptyset.

Finally, to ensure that the posterior belief converges at the optimal asymptotic rate, it is necessary that the Markov chain induced by Thompson sampling eventually visits an informative state a positive fraction of the time. One way to guarantee this is to ensure that the induced Markov chain under Thompson sampling mixes with respect to the special state ss^{*}, which is now known to be informative. More formally, define (s)𝒜\mathcal{I}(s)\subseteq\mathcal{A} as the set of controls that are informative in state ss, and the following random variables for every μΠ\mu\in\Pi and s𝒮s\in\mathcal{S}:

Iμ(t)i=0t1𝟙{Ai(Si)},Iμ(t;s)i=0t1𝟙{Si=s}.I^{\mu}(t)\coloneqq\sum_{i=0}^{t-1}\mathbbm{1}\!\left\{A_{i}\in\mathcal{I}(S_{i})\right\},\quad I^{\mu}(t;s)\coloneqq\sum_{i=0}^{t-1}\mathbbm{1}\!\left\{S_{i}=s\right\}.

Finally, for each s𝒮s\in\mathcal{S}, we define 𝒜(s){a𝒜:a=μθ(s) for some θΘ}\mathcal{A}^{*}(s)\coloneqq\left\{\,a\in\mathcal{A}\,:\,a=\mu_{\theta}^{*}(s)\mbox{ for some }\theta\in\Theta\,\right\}. We then define ΠCΠ\Pi_{C}\subset\Pi to be the set of policies μ\mu for which μ(s)𝒜(s)\mu(s)\in\mathcal{A}^{*}(s) holds for all s𝒮s\in\mathcal{S}, e.g. the set of policies whose controls are consistent with respect to the policy set {μθ:θΘ}\left\{\,\mu_{\theta}^{*}\,:\,\theta\in\Theta\,\right\}.

Assumption 3.

There exists a policy μ¯ΠS\bar{\mu}\in\Pi_{S}, such that the Markov chain induced by μ¯\bar{\mu} is ergodic, and Iμ(t;s)Iμ¯(t;s)I^{\mu}(t;s^{*})\geq I^{\bar{\mu}}(t;s^{*}) holds (with probability 1) for all tt and μΠC\mu\in\Pi_{C}.

Refer to caption
Figure 1: Conceptual illustration of Assumptions 2 and 3.

One way to interpret Assumptions 2 and 3 is to define another policy class ΠΠ\Pi_{*}\subseteq\Pi, consisting of all policies μΠ\mu\in\Pi for which (μ(s)(s))=1\operatorname{\mathbb{P}}\!\left(\mu(s^{*})\in\mathcal{I}(s^{*})\right)=1. Clearly, ΠCΠ\Pi_{C}\subset\Pi_{*} according to Assumption 2, and μθΠC\mu_{\theta}^{*}\in\Pi_{C} by construction. Furthermore, by the design of Thompson sampling (Algorithm 1), we can also conclude that τΠC\tau\in\Pi_{C} (although τΠS\tau\not\in\Pi_{S}). Finally, Assumption 3 requires that Iμ(t;s)Iμ¯(t;s)I^{\mu}(t;s^{*})\geq I^{\bar{\mu}}(t;s^{*}) for all μΠC{\mu}\in\Pi_{C}, including τ\tau. These facts are summarized conceptually in Fig. 1. Note that μ¯\bar{\mu} can be any policy in the shaded region for which the induced Markov chain is ergodic. As illustrated later in our examples, this policy also often lies in Π\Pi_{*}, but this is not required under our assumptions in general.

In the next section, we will prove that Assumptions 1-3 are sufficient to achieve asymptotically optimal regret bounds for Thompson sampling with uninformative actions. However, these assumptions are not in general necessary. To see this, suppose Assumption 3 is relaxed by defining a secondary state s𝒮s^{\prime}\in\mathcal{S} instead of ss^{*}, that need not satisfy Assumption 2. Now, if we further assume that there is a non-zero probability of visiting ss^{*} between successive visits to ss^{\prime} under any policy in ΠC\Pi_{C}, then states ss^{\prime} and ss^{*} belong in the same communicating class, and the same convergence guarantees also extend to this setting. This generalization would allow us to establish convergence guarantees for learning the service rate parameters in the examples of the following paragraph.

We now provide several important classes of stochastic control problems that satisfy the required assumptions in this paper. Please note, however, that these examples do not satisfy the stronger assumption on the KL-divergence in [20], and thus the analysis in that paper cannot be applied to them.

Example 1 (Admission Control).

At the beginning of each decision epoch, a decision maker decides whether to open or close a server with a fixed capacity n¯>0\bar{n}>0 (no backlogging of orders is allowed). A single customer arrives in each period with unknown probability θ(0,1)\theta\in(0,1). If the server is open, the customer pays a toll R0R\geq 0 and joins the end of the queue, while if the server is closed, the customer arrival is unobserved. For each customer waiting in the server in each epoch, the decision maker incurs a penalty of h0h\geq 0. Meanwhile at the end of each period, the first customer in queue completes service and leaves with probability β(0,1)\beta\in(0,1). Opening (closing) the server corresponds to an informative (uninformative) action. It is also easy to check that, if βRh\beta R\geq h (see, e.g., [23]), the optimal policy satisfies μθ(0)=``open"\mu_{\theta}^{*}(0)=``\mathrm{open}" for every θ(0,1)\theta\in(0,1), e.g. an empty server should be open, and we set s=0s^{*}=0. Finally, let μ¯\bar{\mu} be the policy that always admits a customer unless the server is full. The Markov chain induced by μ¯\bar{\mu} is ergodic, and Assumption 3 holds for any admissible policy μ\mu that admits when the server is empty. This example can also be generalized to multiple servers, multiple customer types, and general demand or service distributions. \Box

Example 2 (Inventory Management).

A store sells mm different types of goods. At the beginning of each decision epoch tt, the store manager observes the amount of each type of good in stock, st,is_{t,i}, and decides whether or not to fully restock the inventories for each type of good ii up to a level n¯i>0\bar{n}_{i}>0. The delivery time is negligible compared to the length of each decision epoch, and so goods are delivered instantaneously. The wholesale price of a good of type ii is ci>0c_{i}>0 and it is sold for pi>cip_{i}>c_{i}. Meanwhile, the cost of holding a good of type ii is hi>0h_{i}>0 per item per period. Furthermore, the demand for each good ii is modeled as a Poisson random variable with mean θi>0\theta_{i}>0, and stock-outs (e.g. demand exceeding the inventory) are unobserved. Since the manager profits from selling every type of good, it is always optimal to reorder inventory when a particular item is out of stock, and so μθ(0,0,0)\mu_{\theta}^{*}(0,0,\dots 0) requires that all types of goods are restocked. In this case, the inventory level is guaranteed to be positive at the beginning of each epoch and demand for each type of good is observed. Finally, let μ¯\bar{\mu} be the policy that chooses to restock every type of good in every decision epoch regardless of the inventory level. Since demand is unbounded, the induced Markov chain is clearly ergodic, and Assumption 3 holds for all admissible policies μ\mu that reorder in state s=(0,0,0)s^{*}=(0,0,\dots 0). \Box

Example 3 (Dynamic Pricing).

Customer demand in each period follows a Poisson distribution with parameter θ>0\theta>0. When customers arrive, they observe the firm’s posted price p>0p>0 and the number of customers already in queue n0n\geq 0, and decide whether to join the queue or leave. Given that the value of the service to the customer is V>0V>0, the waiting cost per epoch is c>0c>0, and the service time per customer is geometric with parameter β(0,1)\beta\in(0,1), the customer will only join the queue if Vcβ(n+1)pV-\frac{c}{\beta}(n+1)\geq p (see, e.g., [9, 11]). At the beginning of each decision epoch, the firm fixes a price from the set {p1,p2,pm}\left\{p_{1},p_{2},\dots p_{m}\right\}, where without loss of generality we may assume 0<p1<p2<<pm<0<p_{1}<p_{2}<\dots<p_{m}<\infty. We also assume that pm>Vcβp_{m}>V-\frac{c}{\beta} so the firm can choose to reject customers, and also assume that p1Vcβp_{1}\leq V-\frac{c}{\beta} so the problem is non-trivial. Clearly, the queue has an effective capacity, given by n¯=min{n:n0,Vcβ(n+1)<p1}\bar{n}=\min\left\{\,n\,:\,n\geq 0,\,V-\frac{c}{\beta}(n+1)<p_{1}\,\right\}. Similar to the admission control problem, the cost incurred by the firm is h>0h>0 per customer in queue per period. Finally, let s=0s^{*}=0 and consider the policy μ¯\bar{\mu} that always sets the lowest price p1p_{1}. The induced Markov chain is ergodic, and Assumption 3 holds for all admissible policies μ\mu that fix an attractive price when the queue is empty. \Box

III-B Asymptotically Optimal Convergence Rates

The main theoretical result of this paper establishes that Thompson sampling with uninformative actions achieves asymptotically optimal learning rates (as measured by expected posterior probability of sampling the incorrect parameter) and regret. Here, we sketch out the main reasoning necessary to establish both results, and provide formal proofs in the following sections.

First, it is necessary to understand how fast the time-homogeneous Markov chain induced by the policy μ¯\bar{\mu} mixes. In the following section, it will be shown that the Markov chain induced by the policy μ¯\bar{\mu} mixes at an exponential (e.g. geometric) rate, and so Assumption 3 ensures that the special state ss^{*} will be visited at a linear rate asymptotically.

Proposition 1.

For η>0\eta>0 sufficiently small, there exists δθ,λθ>0\delta_{\theta},\lambda_{\theta}>0 and NθN_{\theta}\in\mathbb{N} such that

θ(Iτ(t)ηt)1δθexp(λθt),t>Nθ.\operatorname{\mathbb{P}}_{\theta}\!\left(I^{\tau}(t)\geq\eta t\right)\geq 1-\delta_{\theta}\exp\!\left(-\lambda_{\theta}t\right),\quad\forall t>N_{\theta}.

By making use of Proposition 1, we then prove that the posterior error converges to zero exponentially fast under our more general assumptions.

Theorem 2.

Under Assumptions 1-3, and assuming π0(θ)>0\pi_{0}(\theta)>0 for all θ\theta, there exist constants aθa_{\theta} and bθ>0b_{\theta}>0 such that under Thompson sampling,

𝔼θ[1πt(θ)]aθexp(bθt),t0.\operatorname{\mathbb{E}}_{\theta}\!\left[1-\pi_{t}(\theta)\right]\leq a_{\theta}\exp\!\left(-b_{\theta}t\right),\quad\forall t\geq 0.

The following regret result follows immediately from Theorem 2 proved above, Assumption 1, and Theorem 5 in [20].

Theorem 3 ([20]).

Under Assumptions 1-3, Thompson sampling achieves a worst-case average regret of O(T1)O(T^{-1}).

IV Proofs of Theoretical Results

IV-A Proof of Proposition 1

We first cite the following result.

Proposition 4 ([14]).

Let 𝒳={Xt:t0}\mathscr{X}=\left\{\,X_{t}\,:\,t\geq 0\,\right\} be a Markov chain taking values in 𝒳\mathcal{X}. Suppose there exists a probability measure φ\varphi on 𝒳\mathcal{X}, λ>0\lambda>0, and integer m1m\geq 1 such that

(Xm|X0=x)λφ(),x𝒳.\operatorname{\mathbb{P}}\!\left(X_{m}\in\cdot\,|\,X_{0}=x\right)\geq\lambda\varphi(\cdot),\quad\forall x\in\mathcal{X}. (3)

Let f:𝒳f:\mathcal{X}\to\mathbb{R}, and define Yif(Xi)Y_{i}\coloneqq f(X_{i}) and Cti=0t1YiC_{t}\coloneqq\sum_{i=0}^{t-1}{Y_{i}}, and suppose that f=sup{|f(x)|:x𝒳}<\|f\|=\sup\left\{\,\left\lvert f(x)\right\rvert\,:\,x\in\mathcal{X}\,\right\}<\infty. Then,

(Ct𝔼[Ct]tϵ)exp(λ2(tϵ2fm/λ)22tf2m2)\operatorname{\mathbb{P}}\!\left(C_{t}-\operatorname{\mathbb{E}}\!\left[C_{t}\right]\geq t\epsilon\right)\leq\exp\!\left(-\frac{\lambda^{2}(t\epsilon-2\|f\|m/\lambda)^{2}}{2t\|f\|^{2}m^{2}}\right)

for t>2fm/(λϵ)t>2\|f\|m/(\lambda\epsilon).

The condition (3) is introduced earlier in [4], in which the authors provide sufficient conditions under which it holds for general Markov chains. In particular, if 𝒳\mathcal{X} is discrete, then (3) holds if the mm-step transition matrix of 𝒳\mathscr{X} has a column whose elements are uniformly bounded away from zero. Furthermore, if 𝒳\mathcal{X} is finite, then (3) holds if 𝒳\mathscr{X} is aperiodic and irreducible.

Let 𝒮μ¯={Stμ¯:t0}\mathscr{S}^{\bar{\mu}}=\left\{\,S_{t}^{\bar{\mu}}\,:\,t\geq 0\,\right\} denote the ergodic Markov chain induced by policy μ¯\bar{\mu} specified in Assumption 3. First, Assumption 2 states that μθ(s)\mu_{\theta}^{*}(s^{*}) is an informative action for every θΘ\theta\in\Theta. Since Thompson sampling (randomly) selects one of the optimal policies μΘt\mu_{\Theta_{t}}^{*} in each decision epoch, actions selected by Thompson sampling upon each visit to state ss^{*} must also be informative and thus Iτ(t)Iτ(t;s)I^{\tau}(t)\geq I^{\tau}(t;s^{*}). By replacing the abstract policy μ\mu with Thompson sampling in Assumption 3, we have Iτ(t;s)Iμ¯(t;s)I^{\tau}(t;s^{*})\geq I^{\bar{\mu}}(t;s^{*}) for all t0t\geq 0, or in other words:

θ(Iτ(t)t<η)\displaystyle\operatorname{\mathbb{P}}_{\theta}\!\left(\frac{I^{\tau}(t)}{t}<\eta\right) θ(Iτ(t;s)t<η)\displaystyle\leq\operatorname{\mathbb{P}}_{\theta}\!\left(\frac{I^{\tau}(t;s^{*})}{t}<\eta\right)
θ(Iμ¯(t;s)t<η),\displaystyle\leq\operatorname{\mathbb{P}}_{\theta}\!\left(\frac{I^{\bar{\mu}}(t;s^{*})}{t}<\eta\right),

which hold for all η>0\eta>0. Therefore, it remains to bound the rightmost quantity.

To this end, let R(t)=Iμ¯(t;s)/tR(t)=I^{\bar{\mu}}(t;s^{*})/t. By Assumption 3, 𝒮μ¯\mathscr{S}^{\bar{\mu}} is time-homogeneous and ergodic, and hence a limiting distribution exists such that 𝔼θ[R(t)]r\operatorname{\mathbb{E}}_{\theta}\!\left[R(t)\right]\rightarrow r as tt\to\infty for some r(0,1)r\in(0,1). More precisely, for each ν>0\nu>0 there exists NN\in\mathbb{N} (dependent on ν\nu) such that |𝔼θ[R(t)]r|<ν\left\lvert\operatorname{\mathbb{E}}_{\theta}\!\left[R(t)\right]-r\right\rvert<\nu for tNt\geq N, in which case:

θ(|R(t)r|>2ν)\displaystyle\operatorname{\mathbb{P}}_{\theta}\!\left(\left\lvert R(t)-r\right\rvert>2\nu\right)
=θ(|R(t)𝔼θ[R(t)]+𝔼θ[R(t)]r|>2ν)\displaystyle=\operatorname{\mathbb{P}}_{\theta}\!\left(\left\lvert R(t)-\operatorname{\mathbb{E}}_{\theta}\!\left[R(t)\right]+\operatorname{\mathbb{E}}_{\theta}\!\left[R(t)\right]-r\right\rvert>2\nu\right)
θ(|R(t)𝔼θ[R(t)]|+|𝔼θ[R(t)]r|>2ν)\displaystyle\leq\operatorname{\mathbb{P}}_{\theta}\!\left(\left\lvert R(t)-\operatorname{\mathbb{E}}_{\theta}\!\left[R(t)\right]\right\rvert+\left\lvert\operatorname{\mathbb{E}}_{\theta}\!\left[R(t)\right]-r\right\rvert>2\nu\right)
θ(|R(t)𝔼θ[R(t)]|+ν>2ν)\displaystyle\leq\operatorname{\mathbb{P}}_{\theta}\!\left(\left\lvert R(t)-\operatorname{\mathbb{E}}_{\theta}\!\left[R(t)\right]\right\rvert+\nu>2\nu\right)
=θ(|R(t)𝔼θ[R(t)]|>ν).\displaystyle=\operatorname{\mathbb{P}}_{\theta}\!\left(\left\lvert R(t)-\operatorname{\mathbb{E}}_{\theta}\!\left[R(t)\right]\right\rvert>\nu\right).

Now, choosing any η(0,r)\eta\in(0,r) and substituting ν=rη2\nu=\frac{r-\eta}{2}, we obtain:

θ(R(t)<η)\displaystyle\operatorname{\mathbb{P}}_{\theta}\!\left(R(t)<\eta\right) =θ(R(t)r<(rη))\displaystyle=\operatorname{\mathbb{P}}_{\theta}\!\left(R(t)-r<-(r-\eta)\right)
θ(|R(t)r|>rη)\displaystyle\leq\operatorname{\mathbb{P}}_{\theta}\!\left(\left\lvert R(t)-r\right\rvert>r-\eta\right)
θ(|R(t)𝔼θ[R(t)]|>rη2),\displaystyle\leq\operatorname{\mathbb{P}}_{\theta}\!\left(\left\lvert R(t)-\operatorname{\mathbb{E}}_{\theta}\!\left[R(t)\right]\right\rvert>\frac{r-\eta}{2}\right),

which is valid for all tN1t\geq N_{1} where N1N_{1}\in\mathbb{N}. Finally, we apply Proposition 4 by setting Yi=𝟙{Siμ¯=s}Y_{i}=\mathbbm{1}\!\left\{S_{i}^{\bar{\mu}}=s^{*}\right\}, Ct=Iμ¯(t;s)C_{t}=I^{\bar{\mu}}(t;s^{*}), and f=1\|f\|=1, so that there exists N2N_{2}\in\mathbb{N} and δ,λ>0\delta,\lambda>0 dependent on η\eta and θ\theta such that

θ(|R(t)𝔼θ[R(t)]|>rη2)δexp(λt),tN2.\operatorname{\mathbb{P}}_{\theta}\!\left(\left\lvert R(t)-\operatorname{\mathbb{E}}_{\theta}\!\left[R(t)\right]\right\rvert>\frac{r-\eta}{2}\right)\leq\delta\exp\!\left(-\lambda t\right),\quad\forall t\geq N_{2}.

Therefore, by setting Nθ=max{N1,N2}N_{\theta}=\max\left\{N_{1},N_{2}\right\}, we have found δθ,λθ>0\delta_{\theta},\lambda_{\theta}>0 and ηθ(0,1)\eta_{\theta}\in(0,1) such that

θ(Iτ(t;s)ηθt)1δθexp(λθt),t>Nθ.\operatorname{\mathbb{P}}_{\theta}\!\left(I^{\tau}(t;s^{*})\geq\eta_{\theta}t\right)\geq 1-\delta_{\theta}\exp\!\left(-\lambda_{\theta}t\right),\quad\forall t>N_{\theta}.

This completes the proof. \blacksquare

IV-B Proof of Theorem 2

The proof begins in an identical manner to [20] but diverges after the second paragraph. We begin by writing πt(θ)\pi_{t}(\theta) as:

πt(θ)\displaystyle\pi_{t}(\theta) =θ(t)π0(θ)ϑΘϑ(t)π0(ϑ)=11+ϑθcϑ(ϑ(t)θ(t))\displaystyle=\frac{\mathcal{L}_{\theta}(\mathscr{H}_{t})\pi_{0}(\theta)}{\sum_{\vartheta\in\Theta}{\mathcal{L}_{\vartheta}(\mathscr{H}_{t})\pi_{0}(\vartheta)}}=\frac{1}{1+\displaystyle\sum_{\vartheta\not=\theta}{c_{\vartheta}\left(\frac{\mathcal{L}_{\vartheta}(\mathscr{H}_{t})}{\mathcal{L}_{\theta}(\mathscr{H}_{t})}\right)}}
=11+ϑθcϑexp(log(θ(t)ϑ(t)))\displaystyle=\frac{1}{1+\displaystyle\sum_{\vartheta\not=\theta}{c_{\vartheta}\exp\!\left(-\log{\left(\frac{\mathcal{L}_{\theta}(\mathscr{H}_{t})}{\mathcal{L}_{\vartheta}(\mathscr{H}_{t})}\right)}\right)}}
=11+θϑcϑexp(s=0tlogΛsϑ),\displaystyle=\frac{1}{1+\displaystyle\sum_{\theta\not=\vartheta}{c_{\vartheta}\exp\!\left(-\sum_{s=0}^{t}\log{\Lambda_{s}^{\vartheta}}\right)}}, (4)

where cϑπ0(ϑ)/π0(θ)c_{\vartheta}\coloneqq\pi_{0}(\vartheta)/\pi_{0}(\theta), and where:

Λ0ϑ\displaystyle\Lambda_{0}^{\vartheta} 1,\displaystyle\coloneqq 1,
Λiϑ\displaystyle\Lambda_{i}^{\vartheta} fθ(Ri|Si1,Ai1)pθ(Si|Si1,Ai1)fϑ(Ri|Si1,Ai1)pϑ(Si|Si1,Ai1),0<it,\displaystyle\coloneqq\frac{f_{\theta}(R_{i}\,|\,S_{i-1},A_{i-1})\,p_{\theta}(S_{i}\,|\,S_{i-1},A_{i-1})}{f_{\vartheta}(R_{i}\,|\,S_{i-1},A_{i-1})\,p_{\vartheta}(S_{i}\,|\,S_{i-1},A_{i-1})},\quad 0<i\leq t,

and where the processes {St:t0}\left\{\,S_{t}\,:\,t\geq 0\,\right\} and {At:t0}\left\{\,A_{t}\,:\,t\geq 0\,\right\} are understood to evolve under Thompson sampling.

We define the processes 𝒵ϑ={Ztϑ:t0}\mathscr{Z}^{\vartheta}=\left\{\,Z_{t}^{\vartheta}\,:\,t\geq 0\,\right\} as

Ztϑi=0tlogΛiϑ,ϑθΘ,Z_{t}^{\vartheta}\coloneqq\sum_{i=0}^{t}\log{\Lambda_{i}^{\vartheta}},\quad\vartheta\not=\theta\in\Theta,

and for convenience, the filtration ={t:t0}\mathscr{F}=\left\{\,\mathscr{F}_{t}\,:\,t\geq 0\,\right\} as the sequence of sigma algebras tσ(t,At)\mathscr{F}_{t}\coloneqq\sigma(\mathscr{H}_{t},A_{t})., e.g, by concatenating AtA_{t} to t\mathscr{H}_{t}. We can now separate 𝒵ϑ\mathscr{Z}^{\vartheta} into a martingale and a predictable process using the Doob Decomposition theorem [22]:

Ztϑ\displaystyle Z_{t}^{\vartheta} =Mtϑ+Ptϑ\displaystyle=M_{t}^{\vartheta}+P_{t}^{\vartheta}
=i=0t(logΛiϑ𝔼θ[logΛiϑ|i1])+i=0t𝔼θ[logΛiϑ|i1],\displaystyle=\sum_{i=0}^{t}\left(\log{\Lambda_{i}^{\vartheta}}-\operatorname{\mathbb{E}}_{\theta}\!\left[\log{\Lambda_{i}^{\vartheta}}|\mathscr{F}_{i-1}\right]\right)+\sum_{i=0}^{t}\operatorname{\mathbb{E}}_{\theta}\!\left[\log{\Lambda_{i}^{\vartheta}}|\mathscr{F}_{i-1}\right],

where ϑ={Mtϑ:t0}\mathscr{M}^{\vartheta}=\left\{\,M_{t}^{\vartheta}\,:\,t\geq 0\,\right\} is an t\mathscr{F}_{t}-martingale under θ\operatorname{\mathbb{P}}_{\theta} by construction, with M0ϑ=0M_{0}^{\vartheta}=0 since Λ0ϑ=1\Lambda_{0}^{\vartheta}=1.

Since AiA_{i} is i\mathscr{F}_{i}-measurable, for every i{1,2t}i\in\left\{1,2\dots t\right\}, ϑΘ\vartheta\in\Theta with ϑθ\vartheta\not=\theta, we have:

𝔼θ[logΛiϑ|i1]\displaystyle\operatorname{\mathbb{E}}_{\theta}\!\left[\log{\Lambda_{i}^{\vartheta}}\,|\,\mathscr{F}_{i-1}\right]
=𝔼θ[𝒟(νθ(Si1,Ai1)νϑ(Si1,Ai1))|i1]\displaystyle=\operatorname{\mathbb{E}}_{\theta}\!\left[\mathcal{D}\!\left(\nu_{\theta}(S_{i-1},A_{i-1})\,\|\,\nu_{\vartheta}(S_{i-1},A_{i-1})\right)\,|\,\mathscr{F}_{i-1}\right]
𝔼θ[minϑθΘ𝒟(νθ(Si1,Ai1)νϑ(Si1,Ai1))|i1]\displaystyle\geq\operatorname{\mathbb{E}}_{\theta}\!\left[\min_{\vartheta\not=\theta\in\Theta}\mathcal{D}\!\left(\nu_{\theta}(S_{i-1},A_{i-1})\,\|\,\nu_{\vartheta}(S_{i-1},A_{i-1})\right)\,\Big{|}\,\mathscr{F}_{i-1}\right]
𝔼θ[minϑθε(θ,ϑ)𝟙{Ai1(Si1)}|i1]\displaystyle\geq\operatorname{\mathbb{E}}_{\theta}\!\left[\min_{\vartheta\not=\theta}\varepsilon(\theta,\vartheta)\mathbbm{1}\!\left\{A_{i-1}\in\mathcal{I}(S_{i-1})\right\}\,\Big{|}\,\mathscr{F}_{i-1}\right]
ε𝟙{Ai1(Si1)},\displaystyle\geq\varepsilon\mathbbm{1}\!\left\{A_{i-1}\in\mathcal{I}(S_{i-1})\right\},

where the constants ε(θ,ϑ)>0\varepsilon(\theta,\vartheta)>0 follow from the definition of an informative action. As a result, {Ptϑ:t0}\left\{\,P_{t}^{\vartheta}\,:\,t\geq 0\,\right\} is a non-decreasing process with P0ϑ=0P_{0}^{\vartheta}=0 and:

Ptϑ\displaystyle P_{t}^{\vartheta} =i=0t𝔼θ[logΛiϑ|i1]i=1tε𝟙{Ai1(Si1)}\displaystyle=\sum_{i=0}^{t}\operatorname{\mathbb{E}}_{\theta}\!\left[\log{\Lambda_{i}^{\vartheta}}|\mathscr{F}_{i-1}\right]\geq\sum_{i=1}^{t}\varepsilon\mathbbm{1}\!\left\{A_{i-1}\in\mathcal{I}(S_{i-1})\right\}
=εI(t),\displaystyle=\varepsilon I(t),

where I(t)Iτ(t)I(t)\coloneqq I^{\tau}(t).

We now take expectation of the posterior in (IV-B):

𝔼θ[πt(θ)]\displaystyle\operatorname{\mathbb{E}}_{\theta}\!\left[\pi_{t}(\theta)\right] =𝔼θ[11+ϑθcϑexp(Ztϑ)]\displaystyle=\operatorname{\mathbb{E}}_{\theta}\!\left[\frac{1}{1+\sum_{\vartheta\not=\theta}{c_{\vartheta}\exp\!\left(-Z_{t}^{\vartheta}\right)}}\right]
=𝔼θ[11+ϑθcϑexp(MtϑPtϑ)]\displaystyle=\operatorname{\mathbb{E}}_{\theta}\!\left[\frac{1}{1+\sum_{\vartheta\not=\theta}{c_{\vartheta}\exp\!\left(-M_{t}^{\vartheta}-P_{t}^{\vartheta}\right)}}\right]
𝔼θ[11+ϑθcϑexp(MtϑεI(t))].\displaystyle\geq\operatorname{\mathbb{E}}_{\theta}\!\left[\frac{1}{1+\sum_{\vartheta\not=\theta}{c_{\vartheta}\exp\!\left(-M_{t}^{\vartheta}-\varepsilon I(t)\right)}}\right].

For η(0,1)\eta\in(0,1), we define the event 𝒢tη={I(t)ηt}\mathcal{G}_{t}^{\eta}=\{I(t)\geq\eta t\} and condition as follows:

𝔼θ[πt(θ)]\displaystyle\operatorname{\mathbb{E}}_{\theta}\!\left[\pi_{t}(\theta)\right]
𝔼θ[11+ϑθcϑexp(MtϑεI(t))|𝒢tη]θ(𝒢tη)\displaystyle\geq\operatorname{\mathbb{E}}_{\theta}\!\left[\frac{1}{1+\sum_{\vartheta\not=\theta}{c_{\vartheta}\exp\!\left(-M_{t}^{\vartheta}-\varepsilon I(t)\right)}}\,\Big{|}\,\mathcal{G}_{t}^{\eta}\right]\operatorname{\mathbb{P}}_{\theta}\!\left(\mathcal{G}_{t}^{\eta}\right)
+𝔼θ[11+ϑθcϑexp(MtϑεI(t))|(𝒢tη)𝖢]θ((𝒢tη)𝖢)\displaystyle\indent+\operatorname{\mathbb{E}}_{\theta}\!\left[\frac{1}{1+\sum_{\vartheta\not=\theta}{c_{\vartheta}\exp\!\left(-M_{t}^{\vartheta}-\varepsilon I(t)\right)}}\,\Big{|}\,(\mathcal{G}_{t}^{\eta})^{\mathsf{C}}\right]\operatorname{\mathbb{P}}_{\theta}\!\left((\mathcal{G}_{t}^{\eta})^{\mathsf{C}}\right)
𝔼θ[11+ϑθcϑexp(MtϑεI(t))|𝒢tη]θ(𝒢tη)\displaystyle\geq\operatorname{\mathbb{E}}_{\theta}\!\left[\frac{1}{1+\sum_{\vartheta\not=\theta}{c_{\vartheta}\exp\!\left(-M_{t}^{\vartheta}-\varepsilon I(t)\right)}}\,\Big{|}\,\mathcal{G}_{t}^{\eta}\right]\operatorname{\mathbb{P}}_{\theta}\!\left(\mathcal{G}_{t}^{\eta}\right)
𝔼θ[11+ϑθcϑexp(Mtϑεηt)|𝒢tη]θ(𝒢tη).\displaystyle\geq\operatorname{\mathbb{E}}_{\theta}\!\left[\frac{1}{1+\sum_{\vartheta\not=\theta}{c_{\vartheta}\exp\!\left(-M_{t}^{\vartheta}-\varepsilon\eta t\right)}}\,\Big{|}\,\mathcal{G}_{t}^{\eta}\right]\operatorname{\mathbb{P}}_{\theta}\!\left(\mathcal{G}_{t}^{\eta}\right).

For δ(0,εη)\delta\in(0,\varepsilon\eta), we define the event t(δ)ϑθ{|Mtϑ|δt}\mathcal{B}_{t}(\delta)\coloneqq\bigcap_{\vartheta\not=\theta}\left\{\left\lvert M_{t}^{\vartheta}\right\rvert\leq\delta t\right\} and the scalar dθ1π0(θ)π0(θ)d_{\theta}\coloneqq\frac{1-\pi_{0}(\theta)}{\pi_{0}(\theta)}. Then:

𝔼θ,p[11+ϑθcϑexp(Mtϑεηt)|𝒢tη]\displaystyle\operatorname{\mathbb{E}}_{\theta,p}\!\left[\frac{1}{1+\sum_{\vartheta\not=\theta}{c_{\vartheta}\exp\!\left(-M_{t}^{\vartheta}-\varepsilon\eta t\right)}}\,\Big{|}\,\mathcal{G}_{t}^{\eta}\right]
=𝔼θ[11+cϑexp(Mtϑεηt)𝟙{t(δ)}|𝒢tη]\displaystyle=\operatorname{\mathbb{E}}_{\theta}\!\left[\frac{1}{1+\sum{c_{\vartheta}\exp\!\left(-M_{t}^{\vartheta}-\varepsilon\eta t\right)}}\mathbbm{1}\!\left\{\mathcal{B}_{t}(\delta)\right\}\,\Big{|}\,\mathcal{G}_{t}^{\eta}\right]
+𝔼θ[11+cθexp(Mtϑεηt)𝟙{t(δ)𝖢}|𝒢tη]\displaystyle\indent+\operatorname{\mathbb{E}}_{\theta}\!\left[\frac{1}{1+\sum{c_{\theta}\exp\!\left(-M_{t}^{\vartheta}-\varepsilon\eta t\right)}}\mathbbm{1}\!\left\{\mathcal{B}_{t}(\delta)^{\mathsf{C}}\right\}\,\Big{|}\,\mathcal{G}_{t}^{\eta}\right]
θ(t(δ)|𝒢tη)1+dθexp((ϵηδ)t),\displaystyle\geq\frac{\operatorname{\mathbb{P}}_{\theta}\!\left(\mathcal{B}_{t}(\delta)\,\big{|}\,\mathcal{G}_{t}^{\eta}\right)}{1+d_{\theta}\exp\!\left(-(\epsilon\eta-\delta)t\right)},

which yields the lower bound

𝔼θ[πt(θ)]θ(t(δ)|𝒢tη)1+dθexp((ϵηδ)t)θ(𝒢tη).\displaystyle\operatorname{\mathbb{E}}_{\theta}\!\left[\pi_{t}(\theta)\right]\geq\frac{\operatorname{\mathbb{P}}_{\theta}\!\left(\mathcal{B}_{t}(\delta)\,\big{|}\,\mathcal{G}_{t}^{\eta}\right)}{1+d_{\theta}\exp\!\left(-(\epsilon\eta-\delta)t\right)}\operatorname{\mathbb{P}}_{\theta}\!\left(\mathcal{G}_{t}^{\eta}\right). (5)

Next, we apply DeMorgan’s law and the union bound as follows:

θ(t(δ)|𝒢tη)\displaystyle\operatorname{\mathbb{P}}_{\theta}\!\left(\mathcal{B}_{t}(\delta)\,\big{|}\,\mathcal{G}_{t}^{\eta}\right) =1θ(t(δ)𝖢|𝒢tη)\displaystyle=1-\operatorname{\mathbb{P}}_{\theta}\!\left(\mathcal{B}_{t}(\delta)^{\mathsf{C}}\,\big{|}\,\mathcal{G}_{t}^{\eta}\right)
=1θ(ϑθ{|Mtϑ|δt}𝖢|𝒢tη)\displaystyle=1-\operatorname{\mathbb{P}}_{\theta}\!\left(\bigcup_{\vartheta\not=\theta}\left\{\left\lvert M_{t}^{\vartheta}\right\rvert\leq\delta t\right\}^{\mathsf{C}}\,\Big{|}\,\mathcal{G}_{t}^{\eta}\right)
1ϑθθ(|Mtϑ|δt|𝒢tη).\displaystyle\geq 1-\sum_{\vartheta\not=\theta}\operatorname{\mathbb{P}}_{\theta}\!\left(\left\lvert M_{t}^{\vartheta}\right\rvert\geq\delta t\,\big{|}\,\mathcal{G}_{t}^{\eta}\right). (6)

Using conditional probability, the summand can be further simplified to

θ(|Mtϑ|>δt|𝒢tη)θ(|Mtθ|>δt)θ(𝒢tη).\operatorname{\mathbb{P}}_{\theta}\!\left(\left\lvert M_{t}^{\vartheta}\right\rvert>\delta t\,\big{|}\,\mathcal{G}_{t}^{\eta}\right)\leq\frac{\operatorname{\mathbb{P}}_{\theta}\!\left(\left\lvert M_{t}^{\theta}\right\rvert>\delta t\right)}{\operatorname{\mathbb{P}}_{\theta}\!\left(\mathcal{G}_{t}^{\eta}\right)}.

Next, we apply Proposition 1, so there exists NN (dependent on η\eta and θ\theta) such that for t>Nt>N, θ(𝒢tη)12\operatorname{\mathbb{P}}_{\theta}\!\left(\mathcal{G}_{t}^{\eta}\right)\geq\frac{1}{2} and thus

θ(|Mtϑ|>δt|𝒢tη)2θ(|Mtϑ|>δt).\operatorname{\mathbb{P}}_{\theta}\!\left(\left\lvert M_{t}^{\vartheta}\right\rvert>\delta t\,\big{|}\,\mathcal{G}_{t}^{\eta}\right)\leq 2\operatorname{\mathbb{P}}_{\theta}\!\left(\left\lvert M_{t}^{\vartheta}\right\rvert>\delta t\right).

Since 𝒟(νθ(Si1,Ai1)νϑ(Si1,Ai1))<\mathcal{D}\!\left(\nu_{\theta}(S_{i-1},A_{i-1})\,\|\,\nu_{\vartheta}(S_{i-1},A_{i-1})\right)<\infty for all ϑθ\vartheta\not=\theta, θ\mathscr{M}^{\theta} is an t\mathscr{F}_{t}-martingale with bounded increments, and applying Azuma’s inequality [22] obtains:

θ(|Mtϑ|>δt|𝒢tη)2θ(|Mtϑ|>δt)4exp(δ22c2t),\operatorname{\mathbb{P}}_{\theta}\!\left(\left\lvert M_{t}^{\vartheta}\right\rvert>\delta t\,\big{|}\,\mathcal{G}_{t}^{\eta}\right)\leq 2\operatorname{\mathbb{P}}_{\theta}\!\left(\left\lvert M_{t}^{\vartheta}\right\rvert>\delta t\right)\leq 4\exp\!\left(-\frac{\delta^{2}}{2c^{2}}t\right),

which holds for tt large, η\eta small and δ<εη\delta<\varepsilon\eta. Plugging this into (IV-B):

θ(t(δ)|𝒢tη)\displaystyle\operatorname{\mathbb{P}}_{\theta}\!\left(\mathcal{B}_{t}(\delta)\,\Big{|}\,\mathcal{G}_{t}^{\eta}\right) 1ϑθθ(|Mtϑ|>δt|𝒢tη)\displaystyle\geq 1-\sum_{\vartheta\not=\theta}\operatorname{\mathbb{P}}_{\theta}\!\left(\left\lvert M_{t}^{\vartheta}\right\rvert>\delta t\,\big{|}\,\mathcal{G}_{t}^{\eta}\right)
14(|Θ|1)exp(δ22c2t),\displaystyle\geq 1-4\left(|\Theta|-1\right)\exp\!\left(-\frac{\delta^{2}}{2c^{2}}t\right),

and therefore (5) leads to:

𝔼θ[πt(θ)]\displaystyle\operatorname{\mathbb{E}}_{\theta}\!\left[\pi_{t}(\theta)\right] θ(t(δ)|𝒢tη)1+dθexp((ϵηδ)t)θ(𝒢tη)\displaystyle\geq\frac{\operatorname{\mathbb{P}}_{\theta}\!\left(\mathcal{B}_{t}(\delta)\,\big{|}\,\mathcal{G}_{t}^{\eta}\right)}{1+d_{\theta}\exp\!\left(-(\epsilon\eta-\delta)t\right)}\operatorname{\mathbb{P}}_{\theta}\!\left(\mathcal{G}_{t}^{\eta}\right)
14(|Θ|1)exp(δ22c2t)1+dθexp((ϵηδ)t)(1δexp(λt))\displaystyle\geq\frac{1-4\left(|\Theta|-1\right)\exp\!\left(-\frac{\delta^{2}}{2c^{2}}t\right)}{1+d_{\theta}\exp\!\left(-(\epsilon\eta-\delta)t\right)}\left(1-\delta\exp\!\left(\lambda t\right)\right)
1δ1exp(λ1t)1+δ2exp(λ2t)\displaystyle\geq\frac{1-\delta_{1}\exp\!\left(-\lambda_{1}t\right)}{1+\delta_{2}\exp\!\left(-\lambda_{2}t\right)}

for tt sufficiently large and some δ1,δ2,λ1,λ2>0\delta_{1},\delta_{2},\lambda_{1},\lambda_{2}>0 dependent on θ\theta and pp. Finally:

𝔼θ[1πt(θ)]\displaystyle\operatorname{\mathbb{E}}_{\theta}\!\left[1-\pi_{t}(\theta)\right] 11δ1exp(λ1t)1+δ2exp(λ2t)\displaystyle\leq 1-\frac{1-\delta_{1}\exp\!\left(-\lambda_{1}t\right)}{1+\delta_{2}\exp\!\left(-\lambda_{2}t\right)}
=δ2exp(λ2t)+δ1exp(λ1t)1+δ2exp(λ2t)\displaystyle=\frac{\delta_{2}\exp\!\left(-\lambda_{2}t\right)+\delta_{1}\exp\!\left(-\lambda_{1}t\right)}{1+\delta_{2}\exp\!\left(-\lambda_{2}t\right)}
δ2exp(λ2t)+δ1exp(λ1t)\displaystyle\leq\delta_{2}\exp\!\left(-\lambda_{2}t\right)+\delta_{1}\exp\!\left(-\lambda_{1}t\right)
δ3exp(λ3t)\displaystyle\leq\delta_{3}\exp\!\left(-\lambda_{3}t\right)

for tt sufficiently large and some δ3,λ3>0\delta_{3},\lambda_{3}>0. It is easy to find δ4,λ4>0\delta_{4},\lambda_{4}>0 so that 𝔼θ[1πt(θ)]δ4exp(λ4t)\operatorname{\mathbb{E}}_{\theta}\!\left[1-\pi_{t}(\theta)\right]\leq\delta_{4}\exp\!\left(-\lambda_{4}t\right) holds for all t0t\geq 0. This completes the proof. \blacksquare

V Numerical Examples

Refer to caption
Refer to caption
(a) Admission control.
Refer to caption
Refer to caption
(b) Inventory management.
Refer to caption
Refer to caption
(c) Dynamic pricing.
Refer to caption
Refer to caption
Refer to caption
(d) Inverse of the regret.
Figure 2: Caption

(a)-(c) empirically estimated average regret (top) and posterior error rate for admission control, inventory management and dynamic pricing problems. The dark curve in the topmost figures corresponds to regret of the true unknown parameter value, while the gray curves corresponds to regret of the other parameters. (d) inverse of the empirically estimated regret bounds, showing approximately linear growth for large TT.

In this section, we apply Thompson sampling (Algorithm 1) to three instantiations of PMDPs with arrival rate uncertainty and uninformative actions given in Section III-A. The goal of these experiments is to verify the asymptotically optimal regret bounds hold when Thompson sampling is applied in the online setting with uninformative actions.

The admission control problem assumes n¯=40,R=10,h=0.15,β=0.3\bar{n}=40,\,R=10,\,h=0.15,\,\beta=0.3 and Θ={0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9}\Theta=\left\{0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9\right\}. The inventory management problem assumes a single type of good and n¯=30,c=2,p=2.8,h=0.01\bar{n}=30,\,c=2,\,p=2.8,\,h=0.01 and Θ={1,2,3,4,5,6,7,8,9,10}\Theta=\left\{1,2,3,4,5,6,7,8,9,10\right\}. Finally, the dynamic pricing problem assumes V=4,c=0.05,β=0.5,h=0.01V=4,\,c=0.05,\,\beta=0.5,\,h=0.01, prices pi=ip_{i}=i for i=0,15i=0,1\dots 5, and Θ={1,2,3,4,5}\Theta=\left\{1,2,3,4,5\right\}. The prior π0(θ)=1/|Θ|\pi_{0}(\theta)=1/|\Theta| is assigned to a uniform distribution for all problems, and relative value iteration is used to calculate μθ\mu_{\theta}^{*} for each θ\theta. Finally, to estimate the learning rate and regret for each problem, Thompson sampling is run for 5,000,0005,\!000,\!000 sample paths, and the results across paths are averaged for each decision epoch tt.

The results are reported in Figure 2(d). As illustrated in plots (a)-(c), the empirical expected regret for the true parameter value (shown in black) tends to zero over decision epochs for all problems. To validate that the rate is indeed O(T1)O(T^{-1}), (d) illustrates the corresponding inverse of the empirical regret values, which becomes linear for large TT and confirms Theorem 3. The analysis was conducted on an Intel quad core processor at 2.5 GHz with 8 GB ram, with an average running time of around 10310^{-3} seconds per sample path and parameter pair. Due to its low time complexity, Thompson sampling can be easily implemented for larger parameter spaces and longer planning horizons, which converging at the asymptotically optimal rate.

VI Conclusion

We studied parameterized MDPs described by a set of unknown parameters learned using Bayesian inference. A crucial feature of such models was the presence of “uninformative” actions, which do not provide any information about the unknown parameters and slow down the rate of learning. We contribute a set of assumptions for PMDPs under which Thompson sampling guarantees an asymptotically optimal expected regret bound of O(T1)O(T^{-1}), which are easily verified for many classes of problems such as queuing, inventory control, and dynamic pricing. Numerical experiments validated the theory and showed that, when our assumptions can be verified, provides a computationally efficient algorithm for solving parameterized MDPs.

References

  • Afèche and Ata [2013] Philipp Afèche and Barış Ata. Bayesian dynamic pricing in queueing systems with unknown delay cost characteristics. Manufacturing & Service Operations Management, 15(2):292–304, 2013.
  • Agrawal and Goyal [2012] Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory, pages 39–1. JMLR Workshop and Conference Proceedings, 2012.
  • Araman and Caldentey [2009] Victor F Araman and René Caldentey. Dynamic pricing for nonperishable products with demand learning. Operations research, 57(5):1169–1188, 2009.
  • Asmussen et al. [1992] Søren Asmussen, Peter W Glynn, and Hermann Thorisson. Stationarity detection in the initial transient problem. ACM Transactions on Modeling and Computer Simulation (TOMACS), 2(2):130–157, 1992.
  • Aviv and Pazgal [2005] Yossi Aviv and Amit Pazgal. A partially observed markov decision process for dynamic pricing. Management science, 51(9):1400–1416, 2005.
  • Banjević and Kim [2019] Dragan Banjević and Michael Jong Kim. Thompson sampling for stochastic control: The continuous parameter case. IEEE Transactions on Automatic Control, 64(10):4137–4152, 2019. doi: 10.1109/TAC.2019.2895253.
  • Bertsimas and Perakis [2001] Dimitris J Bertsimas and Georgia Perakis. Dynamic pricing: A learning approach. Massachusetts Institute of Technology, Operations Research Center, 2001.
  • Besbes and Zeevi [2012] Omar Besbes and Assaf Zeevi. Blind network revenue management. Operations research, 60(6):1537–1550, 2012.
  • Borgs et al. [2014] Christian Borgs, Jennifer T Chayes, Sherwin Doroudi, Mor Harchol-Balter, and Kuang Xu. The optimal admission threshold in observable queues with state dependent pricing. Probability in the Engineering and Informational Sciences, 28(1):101, 2014.
  • Burnetas and Smith [2000] Apostolos N Burnetas and Craig E Smith. Adaptive ordering and pricing for perishable products. Operations Research, 48(3):436–443, 2000.
  • Chen and Frank [2001] Hong Chen and Murray Z Frank. State dependent pricing with a queue. IIE Transactions, 33(10):847–860, 2001.
  • Chowdhury et al. [2021] Sayak Ray Chowdhury, Aditya Gopalan, and Odalric-Ambrym Maillard. Reinforcement learning in parametric mdps with exponential families. In Arindam Banerjee and Kenji Fukumizu, editors, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 1855–1863. PMLR, 13–15 Apr 2021.
  • Ferreira et al. [2018] Kris Johnson Ferreira, David Simchi-Levi, and He Wang. Online network revenue management using thompson sampling. Operations research, 66(6):1586–1602, 2018.
  • Glynn and Ormoneit [2002] Peter W Glynn and Dirk Ormoneit. Hoeffding’s inequality for uniformly ergodic markov chains. Statistics & probability letters, 56(2):143–146, 2002.
  • Gopalan and Mannor [2015] Aditya Gopalan and Shie Mannor. Thompson sampling for learning parameterized markov decision processes. In Conference on Learning Theory, pages 861–898, 2015.
  • Harrison et al. [2012] J Michael Harrison, N Bora Keskin, and Assaf Zeevi. Bayesian dynamic pricing policies: Learning and earning under a binary prior distribution. Management Science, 58(3):570–586, 2012.
  • Huang et al. [1976] Cheng-Chi Huang, Dean Isaacson, and B Vinograde. The rate of convergence of certain nonhomogeneous markov chains. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 35(2):141–146, 1976.
  • Jain et al. [2015] Aditya Jain, Nils Rudi, and Tong Wang. Demand estimation and ordering under censoring: Stock-out timing is (almost) all you need. Operations Research, 63(1):134–150, 2015.
  • Keskin and Zeevi [2014] N Bora Keskin and Assaf Zeevi. Dynamic pricing with an unknown demand model: Asymptotically optimal semi-myopic policies. Operations Research, 62(5):1142–1167, 2014.
  • Kim [2017] Michael Jong Kim. Thompson sampling for stochastic control: The finite parameter case. IEEE Transactions on Automatic Control, 62(12):6415–6422, 2017.
  • Kim and Makis [2012] Michael Jong Kim and Viliam Makis. Optimal control of a partially observable failing system with costly multivariate observations. Stochastic Models, 28(4):584–608, 2012.
  • Klenke [2013] A. Klenke. Probability Theory: A Comprehensive Course. Universitext. Springer London, 2013. ISBN 9781447153610.
  • Naor [1969] Pinhas Naor. The regulation of queue size by levying tolls. Econometrica: journal of the Econometric Society, pages 15–24, 1969.
  • Osband et al. [2013] Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pages 3003–3011, 2013.
  • Russo and Van Roy [2014] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243, 2014.
  • Russo et al. [2018] Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen, et al. A tutorial on thompson sampling. Foundations and Trends® in Machine Learning, 11(1):1–96, 2018.
  • Thompson [1933] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.