Thompson Sampling for Parameterized Markov Decision Processes with Uninformative Actions

Michael Gimelfarb and Michael Jong Kim Michael Gimelfarb is with the Department of Mechanical and Industrial Engineering, University of Toronto, Toronto, Ontario M5S 3G8, Canada. (e-mail: [email protected]). Michael Jong Kim was with the Department of Mechanical and Industrial Engineering, University of Toronto, Toronto, Ontario M5S 3G8, Canada. He is now with the Sauder School of Business, University of British Columbia, Vancouver, British Columbia V6T 1Z2, Canada. (e-mail: [email protected]).

Abstract

We study parameterized MDPs (PMDPs) in which the key parameters of interest are unknown and must be learned using Bayesian inference. One key defining feature of such models is the presence of “uninformative” actions that not provide no information about the unknown parameters. We contribute a set of assumptions for PMDPs under which Thompson sampling guarantees an asymptotically optimal expected regret bound of $O(T^{-1})$ , which are easily verified for many classes of problems such as queuing, inventory control, and dynamic pricing.

Index Terms:

Bayesian Inference, Exploration–Exploitation, Markov Decision Process, Parameter Uncertainty, Regret Bound, Thompson Sampling

I Introduction

Parameterized MDPs (PMDPs) are dynamic control problems described by parameters of interest whose values are typically unknown, arising in many applications such as queuing and supply chain control, maintenance, and network design. Such problems can be naturally formulated as POMDPs, in which a learnt posterior distribution over the model parameters is incorporated directly into the system state [1, 3, 5]. However, the corresponding dynamic programming equations are typically computationally or analytically intractable due to the so-called “curse of dimensionality”. To remedy this, a variety of approaches have been proposed to do approximately optimal control including myopic and lookahead policies [16], least-squares methods [7, 19], non-parametric approaches [10, 8] and reinforcement learning [12]

One particular approach, which we investigate in this work, is Thompson sampling (TS) [27, 26], which samples a parameter according to the current posterior distribution at each time step, and then solves the underlying MDP assuming that the sampled parameter is correct. The decoupling of the posterior update and sample phase from the optimization phase makes Thompson sampling a computationally and analytically tractable alternative for solving general PMDPs, and it has been applied in a variety of problems settings, notably multi-armed bandit problems [2, 25]. However, these models can be classified as purely Bayesian optimization problems since, when stripped of all parameter uncertainty, they are fundamentally single-stage optimization problems. On the other hand, much less is known about the theoretical performance of Thompson sampling for stochastic control problems with parameter uncertainty.

Our work builds upon the stream of literature on Thompson sampling for PMDPs [20, 6, 24, 15, 13]. Specifically, the closest work to ours is [20], which showed that Thompson sampling is asymptotically optimal assuming each action taken reveals some additional information about the unknown parameter(s). On the other hand, many important PMDPs contain “uninformative” actions that do not reveal information about the parameters, e.g. setting lower inventory levels in an inventory control problem may lead to stock-outs [18], preventative maintenance provides less statistical information about failures [21], and so-called “uninformative prices” in dynamic pricing problems [1]. To this end, Gopalan and Mannor [15] showed – under rather general assumptions – that the number of instants where sub-optimal actions are chosen scales logarithmically with time, with high probability. On the other hand, this paper provides a different set of assumptions and analysis, and contributes more precise regret and learning rate bounds for an important class of PMDPs with uninformative actions that hold in expectation. Our work contributes more general assumptions and regret analysis than found in Kim [20], extending the previous optimal regret guarantees for a broader classes of problems with uninformative actions that are of practical interest.

The remainder of the paper is structured as follows. Section II defines the Thompson sampling algorithm in the context of PMDPs. Section III provides a formal statement of the notion of an “uninformative” action, a set of general assumptions on the problem structure, and corresponding optimal regret and learning rate bounds for Thompson sampling. These results are proved in Section IV, and empirically validated on three important classes of problems, namely admission control, inventory control and dynamic pricing.

II Preliminaries

II-A Parameterized Markov Decision Process

Decision-making in this paper be summarized as a parameterized Markov decision process (PMDP), which consists of a finite set of possible states $\mathcal{S}$ , a finite set of possible actions or controls $\mathcal{A}$ , and a finite set of parameters $\Theta$ (e.g. hypotheses) that specify both the reward and the state distribution. Specifically, given a control $a_{t}\in\mathcal{A}$ applied at time $t$ in state $s_{t}\in\mathcal{S}$ , and a parameter $\theta\in\Theta$ , the reward and the next state are sampled according to

S_{t+1}\sim p_{\theta}(\cdot\,|\,s_{t},a_{t}),\quad R_{t}\sim f_{\theta}(\cdot\,|\,s_{t},a_{t}).

We also define the reward function as $r_{\theta}\!\left(s,a\right)\coloneqq\operatorname{\mathbb{E}}_{\theta}\!\left[R|s,a\right]=\int rf_{\theta}(r|s,a)\,\mathrm{d}r$ , which is assumed to be uniformly bounded on $\mathcal{S}\times\mathcal{A}$ .

An admissible policy $\mu$ is defined as a sequence of mappings $\left\{\,\mu_{t}\,:\,t\geq 0\,\right\}$ , where $\mu_{t}(\mathscr{H}_{t})\mapsto\mathscr{P}(\mathcal{A})$ maps the history of information observed up to time $t$ , $\mathscr{H}_{t}=\left\{S_{0},A_{0},R_{0},\dots S_{t-1},A_{t-1},\dots R_{t-1},S_{t}\right\}$ , to a probabilitity distribution over the action space, and $\Pi$ be the set of all admissible policies. On the other hand, a Markov policy is a sequence of mappings $\mu_{t}:\mathcal{S}\to\mathcal{A}$ ; it is further a stationary policy if $\mu_{0}=\mu_{1}=\dots$ . Let $\Pi_{S}$ denote the set of all stationary policies. Given a fixed parameter $\theta\in\Theta$ , the goal is to maximize over all $\mu\in\Pi$ , the long-run expected average reward starting from an initial state $S_{0}=s_{0}$ , given as

J_{\theta}^{\mu}(s_{0})\coloneqq\limsup_{T\to\infty}\frac{1}{T}\operatorname{\mathbb{E}}_{\theta}\!\left[\sum_{t=0}^{T-1}r_{\theta}\!\left(S_{t},A_{t}\right)\right],

(1)

in which the expectation is computed with respect to $p_{\theta}$ and $A_{t}$ are sampled according to $\mu$ . We assume that an optimal policy $\mu_{\theta}^{*}$ exists, which must necessarily be a stationary Markov policy. Furthermore, the state process $\mathscr{S}_{\theta}=\left\{\,S_{t}\,:\,t\geq 0\,\right\}$ induced by $\mu_{\theta}^{*}$ constitutes a time-homogeneous Markov chain, and we assume that

Assumption 1.

The state process $\mathscr{S}_{\theta}=\left\{\,S_{t}\,:\,t\geq 0\,\right\}$ induced by $\mu_{\theta}^{*}$ is ergodic unichain.

Please note that our assumption is weaker than the ergodicity required in [20], although the main theoretical results therein also follow through under the weaker assumption above [17].

II-B Thompson Sampling

The decision maker accounts for uncertainty in $\theta\in\Theta$ by modeling it as a random variable $\Theta_{t}$ in each decision epoch. Starting with prior distribution $\pi_{0}(\theta)\coloneqq\operatorname{\mathbb{P}}\!\left(\Theta_{0}=\theta\,|\,\mathscr{H}_{0}\right)$ , and given history $\mathscr{H}_{t}$ , the decision maker updates his belief about $\theta$ by computing the posterior distribution using Bayes’ theorem:

\pi_{t}(\theta)\coloneqq\operatorname{\mathbb{P}}\!\left(\Theta_{t}=\theta\,|\,\mathscr{H}_{t}\right)=\frac{\mathcal{L}_{\theta}(\mathscr{H}_{t})\pi_{0}(\theta)}{\sum_{\vartheta\in\Theta}{\mathcal{L}_{\vartheta}(\mathscr{H}_{t})\pi_{0}(\vartheta)}},

(2)

where

\mathcal{L}_{\theta}(\mathscr{H}_{t})\coloneqq\prod_{i=1}^{t}f_{\theta}(R_{i-1}\,|\,S_{i-1},A_{i-1})\,p_{\theta}(S_{i}\,|\,S_{i-1},A_{i-1})

is the likelihood. Note that $\pi_{t}(\theta)$ is a function of $\mathscr{H}_{t}$ and is therefore a random variable.

The agent selects controls that are consistent with the learned belief $\pi_{t}$ by following the Thompson sampling algorithm. Formally, this corresponds to an admissible (e.g. history-dependent) policy $\tau$ defined in Algorithm 1.

Algorithm 1 Thompson Sampling for PMDPs

Initialize

S_{0}\leftarrow s_{0}

and

\pi_{0}

for

t=0

T

Sample

\Theta_{t}\sim\pi_{t}(\cdot)

Solve MDP (1) with parameter

\Theta_{t}

to obtain

\mu_{\Theta_{t}}^{*}

Apply control

A_{t}\leftarrow\mu_{\Theta_{t}}^{*}(S_{t})

Observe

S_{t+1}

and

R_{t+1}

for

\theta\in\Theta

\pi_{t+1}(\theta)\leftarrow\frac{f_{\theta}(R_{t+1}|S_{t},A_{t})p_{\theta}(S_{t+1}|S_{t},A_{t})\pi_{t}(\theta)}{\sum_{\vartheta\in\Theta}f_{\vartheta}(R_{t+1}|S_{t},A_{t})p_{\vartheta}(S_{t+1}|S_{t},A_{t})\pi_{t}(\vartheta)}

end for

end forend

Intuitively, the posterior update (2) should only improve upon the decision maker’s belief about $\theta$ when $f_{\theta}(\cdot\,|\,s,a)\,p_{\theta}(\cdot\,|\,s,a)$ is distinguishable for different values of $\theta$ . To see this, we can restate (2) as

\pi_{t}(\theta)=\left[1+\displaystyle\sum_{\vartheta\not=\theta}{\frac{\pi_{0}(\vartheta)}{\pi_{0}(\theta)}\left(\frac{\mathcal{L}_{\vartheta}(\mathscr{H}_{t})}{\mathcal{L}_{\theta}(\mathscr{H}_{t})}\right)}\right]^{-1},

from which we can see that $\pi_{t+1}(\theta)=\pi_{t}(\theta)$ precisely when the ratio of the two likelihoods does not change. Conceptually, the existence of such uninformative state-action pairs $(s,a)$ can significantly “stall” the posterior updates and prevent the decision maker from learning the correct PMDP parameters. On the other hand, recent analysis of posterior convergence [20] in general PMDPs precludes such scenarios, by imposing rather strong assumptions on the problem structure that do not always hold in practice. In the following section, we will formalize the notion of an uninformative action and provide an alternative and complementary set of assumptions that still provide an asymptotically optimal regret bound.

III Thompson Sampling for PMDPs with Uninformative Actions

III-A Uninformative Actions

In order to assess the convergence speed of $\pi_{t}$ more precisely, it is necessary to quantify the magnitude of the posterior update. In particular, we define $\nu_{\theta}(s,a)$ as the joint probability distribution specified by $f_{\theta}(\cdot\,|\,s,a)\,p_{\theta}(\cdot\,|\,s,a)$ and the KL-divergence

\mathcal{D}\!\left(\nu_{\theta}(s,a)\,\|\,\nu_{\vartheta}(s,a)\right)\coloneqq\operatorname{\mathbb{E}}_{\theta}\!\left[\log\!\left(\frac{f_{\theta}(\cdot\,|\,s,a)\,p_{\theta}(\cdot\,|\,s,a)}{f_{\vartheta}(\cdot\,|\,s,a)\,p_{\vartheta}(\cdot\,|\,s,a)}\right)\right].

We require that $\nu_{\vartheta}(s,a)$ is absolutely continuous with respect to $\nu_{\theta}(s,a)$ , so that the above quantity is finite.

Formalizing the above intuition, an action $a\in\mathcal{A}$ is called informative in state $s\in\mathcal{S}$ if, for any two distinct $\theta,\vartheta\in\Theta$ ,

\mathcal{D}\!\left(\nu_{\theta}(s,a)\,\|\,\nu_{\vartheta}(s,a)\right)>0.

Assumption 2.

There exists $s^{*}\in\mathcal{S}$ such that $\mu_{\theta}^{*}\left(s^{*}\right)$ is an informative action in state $s^{*}$ for all $\theta\in\Theta$ .

Note that Assumption 2 is easy to check in practice since, in the context of Algorithm 1, the policies $\mu_{\theta}^{*}$ are typically computed in advance and cached. Furthermore, it naturally partitions the state-action space into two classes. The first class $\Sigma\subset\mathcal{S}\times\mathcal{A}$ consists of state-action pairs $(s,a)$ for which $a$ is informative in $s$ , and is non-empty since it contains $(s^{*},\mu_{\theta}^{*}\left(s^{*}\right))$ for every $\theta\in\Theta$ . The second class is comprised of those state-action pairs for which $a$ are not informative with respect to $s$ . Thus, Assumption 2 is much weaker than the one required in [20], where it is simply assumed that $\Sigma^{\mathsf{C}}=\emptyset$ .

Finally, to ensure that the posterior belief converges at the optimal asymptotic rate, it is necessary that the Markov chain induced by Thompson sampling eventually visits an informative state a positive fraction of the time. One way to guarantee this is to ensure that the induced Markov chain under Thompson sampling mixes with respect to the special state $s^{*}$ , which is now known to be informative. More formally, define $\mathcal{I}(s)\subseteq\mathcal{A}$ as the set of controls that are informative in state $s$ , and the following random variables for every $\mu\in\Pi$ and $s\in\mathcal{S}$ :

I^{\mu}(t)\coloneqq\sum_{i=0}^{t-1}\mathbbm{1}\!\left\{A_{i}\in\mathcal{I}(S_{i})\right\},\quad I^{\mu}(t;s)\coloneqq\sum_{i=0}^{t-1}\mathbbm{1}\!\left\{S_{i}=s\right\}.

Finally, for each $s\in\mathcal{S}$ , we define $\mathcal{A}^{*}(s)\coloneqq\left\{\,a\in\mathcal{A}\,:\,a=\mu_{\theta}^{*}(s)\mbox{ for some }\theta\in\Theta\,\right\}$ . We then define $\Pi_{C}\subset\Pi$ to be the set of policies $\mu$ for which $\mu(s)\in\mathcal{A}^{*}(s)$ holds for all $s\in\mathcal{S}$ , e.g. the set of policies whose controls are consistent with respect to the policy set $\left\{\,\mu_{\theta}^{*}\,:\,\theta\in\Theta\,\right\}$ .

Assumption 3.

There exists a policy $\bar{\mu}\in\Pi_{S}$ , such that the Markov chain induced by $\bar{\mu}$ is ergodic, and $I^{\mu}(t;s^{*})\geq I^{\bar{\mu}}(t;s^{*})$ holds (with probability 1) for all $t$ and $\mu\in\Pi_{C}$ .

Refer to caption — Figure 1: Conceptual illustration of Assumptions 2 and 3.

One way to interpret Assumptions 2 and 3 is to define another policy class $\Pi_{*}\subseteq\Pi$ , consisting of all policies $\mu\in\Pi$ for which $\operatorname{\mathbb{P}}\!\left(\mu(s^{*})\in\mathcal{I}(s^{*})\right)=1$ . Clearly, $\Pi_{C}\subset\Pi_{*}$ according to Assumption 2, and $\mu_{\theta}^{*}\in\Pi_{C}$ by construction. Furthermore, by the design of Thompson sampling (Algorithm 1), we can also conclude that $\tau\in\Pi_{C}$ (although $\tau\not\in\Pi_{S}$ ). Finally, Assumption 3 requires that $I^{\mu}(t;s^{*})\geq I^{\bar{\mu}}(t;s^{*})$ for all ${\mu}\in\Pi_{C}$ , including $\tau$ . These facts are summarized conceptually in Fig. 1. Note that $\bar{\mu}$ can be any policy in the shaded region for which the induced Markov chain is ergodic. As illustrated later in our examples, this policy also often lies in $\Pi_{*}$ , but this is not required under our assumptions in general.

In the next section, we will prove that Assumptions 1-3 are sufficient to achieve asymptotically optimal regret bounds for Thompson sampling with uninformative actions. However, these assumptions are not in general necessary. To see this, suppose Assumption 3 is relaxed by defining a secondary state $s^{\prime}\in\mathcal{S}$ instead of $s^{*}$ , that need not satisfy Assumption 2. Now, if we further assume that there is a non-zero probability of visiting $s^{*}$ between successive visits to $s^{\prime}$ under any policy in $\Pi_{C}$ , then states $s^{\prime}$ and $s^{*}$ belong in the same communicating class, and the same convergence guarantees also extend to this setting. This generalization would allow us to establish convergence guarantees for learning the service rate parameters in the examples of the following paragraph.

We now provide several important classes of stochastic control problems that satisfy the required assumptions in this paper. Please note, however, that these examples do not satisfy the stronger assumption on the KL-divergence in [20], and thus the analysis in that paper cannot be applied to them.

Example 1 (Admission Control).

At the beginning of each decision epoch, a decision maker decides whether to open or close a server with a fixed capacity $\bar{n}>0$ (no backlogging of orders is allowed). A single customer arrives in each period with unknown probability $\theta\in(0,1)$ . If the server is open, the customer pays a toll $R\geq 0$ and joins the end of the queue, while if the server is closed, the customer arrival is unobserved. For each customer waiting in the server in each epoch, the decision maker incurs a penalty of $h\geq 0$ . Meanwhile at the end of each period, the first customer in queue completes service and leaves with probability $\beta\in(0,1)$ . Opening (closing) the server corresponds to an informative (uninformative) action. It is also easy to check that, if $\beta R\geq h$ (see, e.g., [23]), the optimal policy satisfies $\mu_{\theta}^{*}(0)=``\mathrm{open}"$ for every $\theta\in(0,1)$ , e.g. an empty server should be open, and we set $s^{*}=0$ . Finally, let $\bar{\mu}$ be the policy that always admits a customer unless the server is full. The Markov chain induced by $\bar{\mu}$ is ergodic, and Assumption 3 holds for any admissible policy $\mu$ that admits when the server is empty. This example can also be generalized to multiple servers, multiple customer types, and general demand or service distributions. $\Box$

Example 2 (Inventory Management).

A store sells $m$ different types of goods. At the beginning of each decision epoch $t$ , the store manager observes the amount of each type of good in stock, $s_{t,i}$ , and decides whether or not to fully restock the inventories for each type of good $i$ up to a level $\bar{n}_{i}>0$ . The delivery time is negligible compared to the length of each decision epoch, and so goods are delivered instantaneously. The wholesale price of a good of type $i$ is $c_{i}>0$ and it is sold for $p_{i}>c_{i}$ . Meanwhile, the cost of holding a good of type $i$ is $h_{i}>0$ per item per period. Furthermore, the demand for each good $i$ is modeled as a Poisson random variable with mean $\theta_{i}>0$ , and stock-outs (e.g. demand exceeding the inventory) are unobserved. Since the manager profits from selling every type of good, it is always optimal to reorder inventory when a particular item is out of stock, and so $\mu_{\theta}^{*}(0,0,\dots 0)$ requires that all types of goods are restocked. In this case, the inventory level is guaranteed to be positive at the beginning of each epoch and demand for each type of good is observed. Finally, let $\bar{\mu}$ be the policy that chooses to restock every type of good in every decision epoch regardless of the inventory level. Since demand is unbounded, the induced Markov chain is clearly ergodic, and Assumption 3 holds for all admissible policies $\mu$ that reorder in state $s^{*}=(0,0,\dots 0)$ . $\Box$

Example 3 (Dynamic Pricing).

Customer demand in each period follows a Poisson distribution with parameter $\theta>0$ . When customers arrive, they observe the firm’s posted price $p>0$ and the number of customers already in queue $n\geq 0$ , and decide whether to join the queue or leave. Given that the value of the service to the customer is $V>0$ , the waiting cost per epoch is $c>0$ , and the service time per customer is geometric with parameter $\beta\in(0,1)$ , the customer will only join the queue if $V-\frac{c}{\beta}(n+1)\geq p$ (see, e.g., [9, 11]). At the beginning of each decision epoch, the firm fixes a price from the set $\left\{p_{1},p_{2},\dots p_{m}\right\}$ , where without loss of generality we may assume $0<p_{1}<p_{2}<\dots<p_{m}<\infty$ . We also assume that $p_{m}>V-\frac{c}{\beta}$ so the firm can choose to reject customers, and also assume that $p_{1}\leq V-\frac{c}{\beta}$ so the problem is non-trivial. Clearly, the queue has an effective capacity, given by $\bar{n}=\min\left\{\,n\,:\,n\geq 0,\,V-\frac{c}{\beta}(n+1)<p_{1}\,\right\}$ . Similar to the admission control problem, the cost incurred by the firm is $h>0$ per customer in queue per period. Finally, let $s^{*}=0$ and consider the policy $\bar{\mu}$ that always sets the lowest price $p_{1}$ . The induced Markov chain is ergodic, and Assumption 3 holds for all admissible policies $\mu$ that fix an attractive price when the queue is empty. $\Box$

III-B Asymptotically Optimal Convergence Rates

The main theoretical result of this paper establishes that Thompson sampling with uninformative actions achieves asymptotically optimal learning rates (as measured by expected posterior probability of sampling the incorrect parameter) and regret. Here, we sketch out the main reasoning necessary to establish both results, and provide formal proofs in the following sections.

First, it is necessary to understand how fast the time-homogeneous Markov chain induced by the policy $\bar{\mu}$ mixes. In the following section, it will be shown that the Markov chain induced by the policy $\bar{\mu}$ mixes at an exponential (e.g. geometric) rate, and so Assumption 3 ensures that the special state $s^{*}$ will be visited at a linear rate asymptotically.

Proposition 1.

For $\eta>0$ sufficiently small, there exists $\delta_{\theta},\lambda_{\theta}>0$ and $N_{\theta}\in\mathbb{N}$ such that

\operatorname{\mathbb{P}}_{\theta}\!\left(I^{\tau}(t)\geq\eta t\right)\geq 1-\delta_{\theta}\exp\!\left(-\lambda_{\theta}t\right),\quad\forall t>N_{\theta}.

By making use of Proposition 1, we then prove that the posterior error converges to zero exponentially fast under our more general assumptions.

Theorem 2.

Under Assumptions 1-3, and assuming $\pi_{0}(\theta)>0$ for all $\theta$ , there exist constants $a_{\theta}$ and $b_{\theta}>0$ such that under Thompson sampling,

\operatorname{\mathbb{E}}_{\theta}\!\left[1-\pi_{t}(\theta)\right]\leq a_{\theta}\exp\!\left(-b_{\theta}t\right),\quad\forall t\geq 0.

The following regret result follows immediately from Theorem 2 proved above, Assumption 1, and Theorem 5 in [20].

Theorem 3 ([20]).

Under Assumptions 1-3, Thompson sampling achieves a worst-case average regret of $O(T^{-1})$ .

IV Proofs of Theoretical Results

IV-A Proof of Proposition 1

We first cite the following result.

Proposition 4 ([14]).

Let $\mathscr{X}=\left\{\,X_{t}\,:\,t\geq 0\,\right\}$ be a Markov chain taking values in $\mathcal{X}$ . Suppose there exists a probability measure $\varphi$ on $\mathcal{X}$ , $\lambda>0$ , and integer $m\geq 1$ such that

\operatorname{\mathbb{P}}\!\left(X_{m}\in\cdot\,|\,X_{0}=x\right)\geq\lambda\varphi(\cdot),\quad\forall x\in\mathcal{X}.

(3)

Let $f:\mathcal{X}\to\mathbb{R}$ , and define $Y_{i}\coloneqq f(X_{i})$ and $C_{t}\coloneqq\sum_{i=0}^{t-1}{Y_{i}}$ , and suppose that $\|f\|=\sup\left\{\,\left\lvert f(x)\right\rvert\,:\,x\in\mathcal{X}\,\right\}<\infty$ . Then,

\operatorname{\mathbb{P}}\!\left(C_{t}-\operatorname{\mathbb{E}}\!\left[C_{t}\right]\geq t\epsilon\right)\leq\exp\!\left(-\frac{\lambda^{2}(t\epsilon-2\|f\|m/\lambda)^{2}}{2t\|f\|^{2}m^{2}}\right)

for $t>2\|f\|m/(\lambda\epsilon)$ .

The condition (3) is introduced earlier in [4], in which the authors provide sufficient conditions under which it holds for general Markov chains. In particular, if $\mathcal{X}$ is discrete, then (3) holds if the $m$ -step transition matrix of $\mathscr{X}$ has a column whose elements are uniformly bounded away from zero. Furthermore, if $\mathcal{X}$ is finite, then (3) holds if $\mathscr{X}$ is aperiodic and irreducible.

Let $\mathscr{S}^{\bar{\mu}}=\left\{\,S_{t}^{\bar{\mu}}\,:\,t\geq 0\,\right\}$ denote the ergodic Markov chain induced by policy $\bar{\mu}$ specified in Assumption 3. First, Assumption 2 states that $\mu_{\theta}^{*}(s^{*})$ is an informative action for every $\theta\in\Theta$ . Since Thompson sampling (randomly) selects one of the optimal policies $\mu_{\Theta_{t}}^{*}$ in each decision epoch, actions selected by Thompson sampling upon each visit to state $s^{*}$ must also be informative and thus $I^{\tau}(t)\geq I^{\tau}(t;s^{*})$ . By replacing the abstract policy $\mu$ with Thompson sampling in Assumption 3, we have $I^{\tau}(t;s^{*})\geq I^{\bar{\mu}}(t;s^{*})$ for all $t\geq 0$ , or in other words:

	$\displaystyle\operatorname{\mathbb{P}}_{\theta}\!\left(\frac{I^{\tau}(t)}{t}<\eta\right)$	$\displaystyle\leq\operatorname{\mathbb{P}}_{\theta}\!\left(\frac{I^{\tau}(t;s^{*})}{t}<\eta\right)$
		$\displaystyle\leq\operatorname{\mathbb{P}}_{\theta}\!\left(\frac{I^{\bar{\mu}}(t;s^{*})}{t}<\eta\right),$

which hold for all $\eta>0$ . Therefore, it remains to bound the rightmost quantity.

To this end, let $R(t)=I^{\bar{\mu}}(t;s^{*})/t$ . By Assumption 3, $\mathscr{S}^{\bar{\mu}}$ is time-homogeneous and ergodic, and hence a limiting distribution exists such that $\operatorname{\mathbb{E}}_{\theta}\!\left[R(t)\right]\rightarrow r$ as $t\to\infty$ for some $r\in(0,1)$ . More precisely, for each $\nu>0$ there exists $N\in\mathbb{N}$ (dependent on $\nu$ ) such that $\left\lvert\operatorname{\mathbb{E}}_{\theta}\!\left[R(t)\right]-r\right\rvert<\nu$ for $t\geq N$ , in which case:

	$\displaystyle\operatorname{\mathbb{P}}_{\theta}\!\left(\left\lvert R(t)-r\right\rvert>2\nu\right)$
	$\displaystyle=\operatorname{\mathbb{P}}_{\theta}\!\left(\left\lvert R(t)-\operatorname{\mathbb{E}}_{\theta}\!\left[R(t)\right]+\operatorname{\mathbb{E}}_{\theta}\!\left[R(t)\right]-r\right\rvert>2\nu\right)$
	$\displaystyle\leq\operatorname{\mathbb{P}}_{\theta}\!\left(\left\lvert R(t)-\operatorname{\mathbb{E}}_{\theta}\!\left[R(t)\right]\right\rvert+\left\lvert\operatorname{\mathbb{E}}_{\theta}\!\left[R(t)\right]-r\right\rvert>2\nu\right)$
	$\displaystyle\leq\operatorname{\mathbb{P}}_{\theta}\!\left(\left\lvert R(t)-\operatorname{\mathbb{E}}_{\theta}\!\left[R(t)\right]\right\rvert+\nu>2\nu\right)$
	$\displaystyle=\operatorname{\mathbb{P}}_{\theta}\!\left(\left\lvert R(t)-\operatorname{\mathbb{E}}_{\theta}\!\left[R(t)\right]\right\rvert>\nu\right).$

Now, choosing any $\eta\in(0,r)$ and substituting $\nu=\frac{r-\eta}{2}$ , we obtain:

	$\displaystyle\operatorname{\mathbb{P}}_{\theta}\!\left(R(t)<\eta\right)$	$\displaystyle=\operatorname{\mathbb{P}}_{\theta}\!\left(R(t)-r<-(r-\eta)\right)$
		$\displaystyle\leq\operatorname{\mathbb{P}}_{\theta}\!\left(\left\lvert R(t)-r\right\rvert>r-\eta\right)$
		$\displaystyle\leq\operatorname{\mathbb{P}}_{\theta}\!\left(\left\lvert R(t)-\operatorname{\mathbb{E}}_{\theta}\!\left[R(t)\right]\right\rvert>\frac{r-\eta}{2}\right),$

which is valid for all $t\geq N_{1}$ where $N_{1}\in\mathbb{N}$ . Finally, we apply Proposition 4 by setting $Y_{i}=\mathbbm{1}\!\left\{S_{i}^{\bar{\mu}}=s^{*}\right\}$ , $C_{t}=I^{\bar{\mu}}(t;s^{*})$ , and $\|f\|=1$ , so that there exists $N_{2}\in\mathbb{N}$ and $\delta,\lambda>0$ dependent on $\eta$ and $\theta$ such that

\operatorname{\mathbb{P}}_{\theta}\!\left(\left\lvert R(t)-\operatorname{\mathbb{E}}_{\theta}\!\left[R(t)\right]\right\rvert>\frac{r-\eta}{2}\right)\leq\delta\exp\!\left(-\lambda t\right),\quad\forall t\geq N_{2}.

Therefore, by setting $N_{\theta}=\max\left\{N_{1},N_{2}\right\}$ , we have found $\delta_{\theta},\lambda_{\theta}>0$ and $\eta_{\theta}\in(0,1)$ such that

\operatorname{\mathbb{P}}_{\theta}\!\left(I^{\tau}(t;s^{*})\geq\eta_{\theta}t\right)\geq 1-\delta_{\theta}\exp\!\left(-\lambda_{\theta}t\right),\quad\forall t>N_{\theta}.

This completes the proof. $\blacksquare$

IV-B Proof of Theorem 2

The proof begins in an identical manner to [20] but diverges after the second paragraph. We begin by writing $\pi_{t}(\theta)$ as:

$\displaystyle\pi_{t}(\theta)$	$\displaystyle=\frac{\mathcal{L}_{\theta}(\mathscr{H}_{t})\pi_{0}(\theta)}{\sum_{\vartheta\in\Theta}{\mathcal{L}_{\vartheta}(\mathscr{H}_{t})\pi_{0}(\vartheta)}}=\frac{1}{1+\displaystyle\sum_{\vartheta\not=\theta}{c_{\vartheta}\left(\frac{\mathcal{L}_{\vartheta}(\mathscr{H}_{t})}{\mathcal{L}_{\theta}(\mathscr{H}_{t})}\right)}}$
	$\displaystyle=\frac{1}{1+\displaystyle\sum_{\vartheta\not=\theta}{c_{\vartheta}\exp\!\left(-\log{\left(\frac{\mathcal{L}_{\theta}(\mathscr{H}_{t})}{\mathcal{L}_{\vartheta}(\mathscr{H}_{t})}\right)}\right)}}$
	$\displaystyle=\frac{1}{1+\displaystyle\sum_{\theta\not=\vartheta}{c_{\vartheta}\exp\!\left(-\sum_{s=0}^{t}\log{\Lambda_{s}^{\vartheta}}\right)}},$	(4)

where $c_{\vartheta}\coloneqq\pi_{0}(\vartheta)/\pi_{0}(\theta)$ , and where:

	$\displaystyle\Lambda_{0}^{\vartheta}$	$\displaystyle\coloneqq 1,$
	$\displaystyle\Lambda_{i}^{\vartheta}$	$\displaystyle\coloneqq\frac{f_{\theta}(R_{i}\,\|\,S_{i-1},A_{i-1})\,p_{\theta}(S_{i}\,\|\,S_{i-1},A_{i-1})}{f_{\vartheta}(R_{i}\,\|\,S_{i-1},A_{i-1})\,p_{\vartheta}(S_{i}\,\|\,S_{i-1},A_{i-1})},\quad 0<i\leq t,$

and where the processes $\left\{\,S_{t}\,:\,t\geq 0\,\right\}$ and $\left\{\,A_{t}\,:\,t\geq 0\,\right\}$ are understood to evolve under Thompson sampling.

We define the processes $\mathscr{Z}^{\vartheta}=\left\{\,Z_{t}^{\vartheta}\,:\,t\geq 0\,\right\}$ as

Z_{t}^{\vartheta}\coloneqq\sum_{i=0}^{t}\log{\Lambda_{i}^{\vartheta}},\quad\vartheta\not=\theta\in\Theta,

and for convenience, the filtration $\mathscr{F}=\left\{\,\mathscr{F}_{t}\,:\,t\geq 0\,\right\}$ as the sequence of sigma algebras $\mathscr{F}_{t}\coloneqq\sigma(\mathscr{H}_{t},A_{t})$ ., e.g, by concatenating $A_{t}$ to $\mathscr{H}_{t}$ . We can now separate $\mathscr{Z}^{\vartheta}$ into a martingale and a predictable process using the Doob Decomposition theorem [22]:

	$\displaystyle Z_{t}^{\vartheta}$	$\displaystyle=M_{t}^{\vartheta}+P_{t}^{\vartheta}$
		$\displaystyle=\sum_{i=0}^{t}\left(\log{\Lambda_{i}^{\vartheta}}-\operatorname{\mathbb{E}}_{\theta}\!\left[\log{\Lambda_{i}^{\vartheta}}\|\mathscr{F}_{i-1}\right]\right)+\sum_{i=0}^{t}\operatorname{\mathbb{E}}_{\theta}\!\left[\log{\Lambda_{i}^{\vartheta}}\|\mathscr{F}_{i-1}\right],$

where $\mathscr{M}^{\vartheta}=\left\{\,M_{t}^{\vartheta}\,:\,t\geq 0\,\right\}$ is an $\mathscr{F}_{t}$ -martingale under $\operatorname{\mathbb{P}}_{\theta}$ by construction, with $M_{0}^{\vartheta}=0$ since $\Lambda_{0}^{\vartheta}=1$ .

Since $A_{i}$ is $\mathscr{F}_{i}$ -measurable, for every $i\in\left\{1,2\dots t\right\}$ , $\vartheta\in\Theta$ with $\vartheta\not=\theta$ , we have:

	$\displaystyle\operatorname{\mathbb{E}}_{\theta}\!\left[\log{\Lambda_{i}^{\vartheta}}\,\|\,\mathscr{F}_{i-1}\right]$
	$\displaystyle=\operatorname{\mathbb{E}}_{\theta}\!\left[\mathcal{D}\!\left(\nu_{\theta}(S_{i-1},A_{i-1})\,\\|\,\nu_{\vartheta}(S_{i-1},A_{i-1})\right)\,\|\,\mathscr{F}_{i-1}\right]$
	$\displaystyle\geq\operatorname{\mathbb{E}}_{\theta}\!\left[\min_{\vartheta\not=\theta\in\Theta}\mathcal{D}\!\left(\nu_{\theta}(S_{i-1},A_{i-1})\,\\|\,\nu_{\vartheta}(S_{i-1},A_{i-1})\right)\,\Big{\|}\,\mathscr{F}_{i-1}\right]$
	$\displaystyle\geq\operatorname{\mathbb{E}}_{\theta}\!\left[\min_{\vartheta\not=\theta}\varepsilon(\theta,\vartheta)\mathbbm{1}\!\left\{A_{i-1}\in\mathcal{I}(S_{i-1})\right\}\,\Big{\|}\,\mathscr{F}_{i-1}\right]$
	$\displaystyle\geq\varepsilon\mathbbm{1}\!\left\{A_{i-1}\in\mathcal{I}(S_{i-1})\right\},$

where the constants $\varepsilon(\theta,\vartheta)>0$ follow from the definition of an informative action. As a result, $\left\{\,P_{t}^{\vartheta}\,:\,t\geq 0\,\right\}$ is a non-decreasing process with $P_{0}^{\vartheta}=0$ and:

	$\displaystyle P_{t}^{\vartheta}$	$\displaystyle=\sum_{i=0}^{t}\operatorname{\mathbb{E}}_{\theta}\!\left[\log{\Lambda_{i}^{\vartheta}}\|\mathscr{F}_{i-1}\right]\geq\sum_{i=1}^{t}\varepsilon\mathbbm{1}\!\left\{A_{i-1}\in\mathcal{I}(S_{i-1})\right\}$
		$\displaystyle=\varepsilon I(t),$

where $I(t)\coloneqq I^{\tau}(t)$ .

We now take expectation of the posterior in (IV-B):

	$\displaystyle\operatorname{\mathbb{E}}_{\theta}\!\left[\pi_{t}(\theta)\right]$	$\displaystyle=\operatorname{\mathbb{E}}_{\theta}\!\left[\frac{1}{1+\sum_{\vartheta\not=\theta}{c_{\vartheta}\exp\!\left(-Z_{t}^{\vartheta}\right)}}\right]$
		$\displaystyle=\operatorname{\mathbb{E}}_{\theta}\!\left[\frac{1}{1+\sum_{\vartheta\not=\theta}{c_{\vartheta}\exp\!\left(-M_{t}^{\vartheta}-P_{t}^{\vartheta}\right)}}\right]$
		$\displaystyle\geq\operatorname{\mathbb{E}}_{\theta}\!\left[\frac{1}{1+\sum_{\vartheta\not=\theta}{c_{\vartheta}\exp\!\left(-M_{t}^{\vartheta}-\varepsilon I(t)\right)}}\right].$

For $\eta\in(0,1)$ , we define the event $\mathcal{G}_{t}^{\eta}=\{I(t)\geq\eta t\}$ and condition as follows:

	$\displaystyle\operatorname{\mathbb{E}}_{\theta}\!\left[\pi_{t}(\theta)\right]$
	$\displaystyle\geq\operatorname{\mathbb{E}}_{\theta}\!\left[\frac{1}{1+\sum_{\vartheta\not=\theta}{c_{\vartheta}\exp\!\left(-M_{t}^{\vartheta}-\varepsilon I(t)\right)}}\,\Big{\|}\,\mathcal{G}_{t}^{\eta}\right]\operatorname{\mathbb{P}}_{\theta}\!\left(\mathcal{G}_{t}^{\eta}\right)$
	$\displaystyle\indent+\operatorname{\mathbb{E}}_{\theta}\!\left[\frac{1}{1+\sum_{\vartheta\not=\theta}{c_{\vartheta}\exp\!\left(-M_{t}^{\vartheta}-\varepsilon I(t)\right)}}\,\Big{\|}\,(\mathcal{G}_{t}^{\eta})^{\mathsf{C}}\right]\operatorname{\mathbb{P}}_{\theta}\!\left((\mathcal{G}_{t}^{\eta})^{\mathsf{C}}\right)$
	$\displaystyle\geq\operatorname{\mathbb{E}}_{\theta}\!\left[\frac{1}{1+\sum_{\vartheta\not=\theta}{c_{\vartheta}\exp\!\left(-M_{t}^{\vartheta}-\varepsilon I(t)\right)}}\,\Big{\|}\,\mathcal{G}_{t}^{\eta}\right]\operatorname{\mathbb{P}}_{\theta}\!\left(\mathcal{G}_{t}^{\eta}\right)$
	$\displaystyle\geq\operatorname{\mathbb{E}}_{\theta}\!\left[\frac{1}{1+\sum_{\vartheta\not=\theta}{c_{\vartheta}\exp\!\left(-M_{t}^{\vartheta}-\varepsilon\eta t\right)}}\,\Big{\|}\,\mathcal{G}_{t}^{\eta}\right]\operatorname{\mathbb{P}}_{\theta}\!\left(\mathcal{G}_{t}^{\eta}\right).$

For $\delta\in(0,\varepsilon\eta)$ , we define the event $\mathcal{B}_{t}(\delta)\coloneqq\bigcap_{\vartheta\not=\theta}\left\{\left\lvert M_{t}^{\vartheta}\right\rvert\leq\delta t\right\}$ and the scalar $d_{\theta}\coloneqq\frac{1-\pi_{0}(\theta)}{\pi_{0}(\theta)}$ . Then:

	$\displaystyle\operatorname{\mathbb{E}}_{\theta,p}\!\left[\frac{1}{1+\sum_{\vartheta\not=\theta}{c_{\vartheta}\exp\!\left(-M_{t}^{\vartheta}-\varepsilon\eta t\right)}}\,\Big{\|}\,\mathcal{G}_{t}^{\eta}\right]$
	$\displaystyle=\operatorname{\mathbb{E}}_{\theta}\!\left[\frac{1}{1+\sum{c_{\vartheta}\exp\!\left(-M_{t}^{\vartheta}-\varepsilon\eta t\right)}}\mathbbm{1}\!\left\{\mathcal{B}_{t}(\delta)\right\}\,\Big{\|}\,\mathcal{G}_{t}^{\eta}\right]$
	$\displaystyle\indent+\operatorname{\mathbb{E}}_{\theta}\!\left[\frac{1}{1+\sum{c_{\theta}\exp\!\left(-M_{t}^{\vartheta}-\varepsilon\eta t\right)}}\mathbbm{1}\!\left\{\mathcal{B}_{t}(\delta)^{\mathsf{C}}\right\}\,\Big{\|}\,\mathcal{G}_{t}^{\eta}\right]$
	$\displaystyle\geq\frac{\operatorname{\mathbb{P}}_{\theta}\!\left(\mathcal{B}_{t}(\delta)\,\big{\|}\,\mathcal{G}_{t}^{\eta}\right)}{1+d_{\theta}\exp\!\left(-(\epsilon\eta-\delta)t\right)},$

which yields the lower bound

\displaystyle\operatorname{\mathbb{E}}_{\theta}\!\left[\pi_{t}(\theta)\right]\geq\frac{\operatorname{\mathbb{P}}_{\theta}\!\left(\mathcal{B}_{t}(\delta)\,\big{|}\,\mathcal{G}_{t}^{\eta}\right)}{1+d_{\theta}\exp\!\left(-(\epsilon\eta-\delta)t\right)}\operatorname{\mathbb{P}}_{\theta}\!\left(\mathcal{G}_{t}^{\eta}\right).

(5)

Next, we apply DeMorgan’s law and the union bound as follows:

$\displaystyle\operatorname{\mathbb{P}}_{\theta}\!\left(\mathcal{B}_{t}(\delta)\,\big{\|}\,\mathcal{G}_{t}^{\eta}\right)$	$\displaystyle=1-\operatorname{\mathbb{P}}_{\theta}\!\left(\mathcal{B}_{t}(\delta)^{\mathsf{C}}\,\big{\|}\,\mathcal{G}_{t}^{\eta}\right)$
	$\displaystyle=1-\operatorname{\mathbb{P}}_{\theta}\!\left(\bigcup_{\vartheta\not=\theta}\left\{\left\lvert M_{t}^{\vartheta}\right\rvert\leq\delta t\right\}^{\mathsf{C}}\,\Big{\|}\,\mathcal{G}_{t}^{\eta}\right)$
	$\displaystyle\geq 1-\sum_{\vartheta\not=\theta}\operatorname{\mathbb{P}}_{\theta}\!\left(\left\lvert M_{t}^{\vartheta}\right\rvert\geq\delta t\,\big{\|}\,\mathcal{G}_{t}^{\eta}\right).$	(6)

Using conditional probability, the summand can be further simplified to

\operatorname{\mathbb{P}}_{\theta}\!\left(\left\lvert M_{t}^{\vartheta}\right\rvert>\delta t\,\big{|}\,\mathcal{G}_{t}^{\eta}\right)\leq\frac{\operatorname{\mathbb{P}}_{\theta}\!\left(\left\lvert M_{t}^{\theta}\right\rvert>\delta t\right)}{\operatorname{\mathbb{P}}_{\theta}\!\left(\mathcal{G}_{t}^{\eta}\right)}.

Next, we apply Proposition 1, so there exists $N$ (dependent on $\eta$ and $\theta$ ) such that for $t>N$ , $\operatorname{\mathbb{P}}_{\theta}\!\left(\mathcal{G}_{t}^{\eta}\right)\geq\frac{1}{2}$ and thus

\operatorname{\mathbb{P}}_{\theta}\!\left(\left\lvert M_{t}^{\vartheta}\right\rvert>\delta t\,\big{|}\,\mathcal{G}_{t}^{\eta}\right)\leq 2\operatorname{\mathbb{P}}_{\theta}\!\left(\left\lvert M_{t}^{\vartheta}\right\rvert>\delta t\right).

Since $\mathcal{D}\!\left(\nu_{\theta}(S_{i-1},A_{i-1})\,\|\,\nu_{\vartheta}(S_{i-1},A_{i-1})\right)<\infty$ for all $\vartheta\not=\theta$ , $\mathscr{M}^{\theta}$ is an $\mathscr{F}_{t}$ -martingale with bounded increments, and applying Azuma’s inequality [22] obtains:

\operatorname{\mathbb{P}}_{\theta}\!\left(\left\lvert M_{t}^{\vartheta}\right\rvert>\delta t\,\big{|}\,\mathcal{G}_{t}^{\eta}\right)\leq 2\operatorname{\mathbb{P}}_{\theta}\!\left(\left\lvert M_{t}^{\vartheta}\right\rvert>\delta t\right)\leq 4\exp\!\left(-\frac{\delta^{2}}{2c^{2}}t\right),

which holds for $t$ large, $\eta$ small and $\delta<\varepsilon\eta$ . Plugging this into (IV-B):

	$\displaystyle\operatorname{\mathbb{P}}_{\theta}\!\left(\mathcal{B}_{t}(\delta)\,\Big{\|}\,\mathcal{G}_{t}^{\eta}\right)$	$\displaystyle\geq 1-\sum_{\vartheta\not=\theta}\operatorname{\mathbb{P}}_{\theta}\!\left(\left\lvert M_{t}^{\vartheta}\right\rvert>\delta t\,\big{\|}\,\mathcal{G}_{t}^{\eta}\right)$
		$\displaystyle\geq 1-4\left(\|\Theta\|-1\right)\exp\!\left(-\frac{\delta^{2}}{2c^{2}}t\right),$

and therefore (5) leads to:

	$\displaystyle\operatorname{\mathbb{E}}_{\theta}\!\left[\pi_{t}(\theta)\right]$	$\displaystyle\geq\frac{\operatorname{\mathbb{P}}_{\theta}\!\left(\mathcal{B}_{t}(\delta)\,\big{\|}\,\mathcal{G}_{t}^{\eta}\right)}{1+d_{\theta}\exp\!\left(-(\epsilon\eta-\delta)t\right)}\operatorname{\mathbb{P}}_{\theta}\!\left(\mathcal{G}_{t}^{\eta}\right)$
		$\displaystyle\geq\frac{1-4\left(\|\Theta\|-1\right)\exp\!\left(-\frac{\delta^{2}}{2c^{2}}t\right)}{1+d_{\theta}\exp\!\left(-(\epsilon\eta-\delta)t\right)}\left(1-\delta\exp\!\left(\lambda t\right)\right)$
		$\displaystyle\geq\frac{1-\delta_{1}\exp\!\left(-\lambda_{1}t\right)}{1+\delta_{2}\exp\!\left(-\lambda_{2}t\right)}$

for $t$ sufficiently large and some $\delta_{1},\delta_{2},\lambda_{1},\lambda_{2}>0$ dependent on $\theta$ and $p$ . Finally:

	$\displaystyle\operatorname{\mathbb{E}}_{\theta}\!\left[1-\pi_{t}(\theta)\right]$	$\displaystyle\leq 1-\frac{1-\delta_{1}\exp\!\left(-\lambda_{1}t\right)}{1+\delta_{2}\exp\!\left(-\lambda_{2}t\right)}$
		$\displaystyle=\frac{\delta_{2}\exp\!\left(-\lambda_{2}t\right)+\delta_{1}\exp\!\left(-\lambda_{1}t\right)}{1+\delta_{2}\exp\!\left(-\lambda_{2}t\right)}$
		$\displaystyle\leq\delta_{2}\exp\!\left(-\lambda_{2}t\right)+\delta_{1}\exp\!\left(-\lambda_{1}t\right)$
		$\displaystyle\leq\delta_{3}\exp\!\left(-\lambda_{3}t\right)$

for $t$ sufficiently large and some $\delta_{3},\lambda_{3}>0$ . It is easy to find $\delta_{4},\lambda_{4}>0$ so that $\operatorname{\mathbb{E}}_{\theta}\!\left[1-\pi_{t}(\theta)\right]\leq\delta_{4}\exp\!\left(-\lambda_{4}t\right)$ holds for all $t\geq 0$ . This completes the proof. $\blacksquare$

V Numerical Examples

In this section, we apply Thompson sampling (Algorithm 1) to three instantiations of PMDPs with arrival rate uncertainty and uninformative actions given in Section III-A. The goal of these experiments is to verify the asymptotically optimal regret bounds hold when Thompson sampling is applied in the online setting with uninformative actions.

The admission control problem assumes $\bar{n}=40,\,R=10,\,h=0.15,\,\beta=0.3$ and $\Theta=\left\{0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9\right\}$ . The inventory management problem assumes a single type of good and $\bar{n}=30,\,c=2,\,p=2.8,\,h=0.01$ and $\Theta=\left\{1,2,3,4,5,6,7,8,9,10\right\}$ . Finally, the dynamic pricing problem assumes $V=4,\,c=0.05,\,\beta=0.5,\,h=0.01$ , prices $p_{i}=i$ for $i=0,1\dots 5$ , and $\Theta=\left\{1,2,3,4,5\right\}$ . The prior $\pi_{0}(\theta)=1/|\Theta|$ is assigned to a uniform distribution for all problems, and relative value iteration is used to calculate $\mu_{\theta}^{*}$ for each $\theta$ . Finally, to estimate the learning rate and regret for each problem, Thompson sampling is run for $5,\!000,\!000$ sample paths, and the results across paths are averaged for each decision epoch $t$ .

The results are reported in Figure 2(d). As illustrated in plots (a)-(c), the empirical expected regret for the true parameter value (shown in black) tends to zero over decision epochs for all problems. To validate that the rate is indeed $O(T^{-1})$ , (d) illustrates the corresponding inverse of the empirical regret values, which becomes linear for large $T$ and confirms Theorem 3. The analysis was conducted on an Intel quad core processor at 2.5 GHz with 8 GB ram, with an average running time of around $10^{-3}$ seconds per sample path and parameter pair. Due to its low time complexity, Thompson sampling can be easily implemented for larger parameter spaces and longer planning horizons, which converging at the asymptotically optimal rate.

VI Conclusion

We studied parameterized MDPs described by a set of unknown parameters learned using Bayesian inference. A crucial feature of such models was the presence of “uninformative” actions, which do not provide any information about the unknown parameters and slow down the rate of learning. We contribute a set of assumptions for PMDPs under which Thompson sampling guarantees an asymptotically optimal expected regret bound of $O(T^{-1})$ , which are easily verified for many classes of problems such as queuing, inventory control, and dynamic pricing. Numerical experiments validated the theory and showed that, when our assumptions can be verified, provides a computationally efficient algorithm for solving parameterized MDPs.

References

Afèche and Ata [2013] Philipp Afèche and Barış Ata. Bayesian dynamic pricing in queueing systems with unknown delay cost characteristics. Manufacturing & Service Operations Management, 15(2):292–304, 2013.
Agrawal and Goyal [2012] Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory, pages 39–1. JMLR Workshop and Conference Proceedings, 2012.
Araman and Caldentey [2009] Victor F Araman and René Caldentey. Dynamic pricing for nonperishable products with demand learning. Operations research, 57(5):1169–1188, 2009.
Asmussen et al. [1992] Søren Asmussen, Peter W Glynn, and Hermann Thorisson. Stationarity detection in the initial transient problem. ACM Transactions on Modeling and Computer Simulation (TOMACS), 2(2):130–157, 1992.
Aviv and Pazgal [2005] Yossi Aviv and Amit Pazgal. A partially observed markov decision process for dynamic pricing. Management science, 51(9):1400–1416, 2005.
Banjević and Kim [2019] Dragan Banjević and Michael Jong Kim. Thompson sampling for stochastic control: The continuous parameter case. IEEE Transactions on Automatic Control, 64(10):4137–4152, 2019. doi: 10.1109/TAC.2019.2895253.
Bertsimas and Perakis [2001] Dimitris J Bertsimas and Georgia Perakis. Dynamic pricing: A learning approach. Massachusetts Institute of Technology, Operations Research Center, 2001.
Besbes and Zeevi [2012] Omar Besbes and Assaf Zeevi. Blind network revenue management. Operations research, 60(6):1537–1550, 2012.
Borgs et al. [2014] Christian Borgs, Jennifer T Chayes, Sherwin Doroudi, Mor Harchol-Balter, and Kuang Xu. The optimal admission threshold in observable queues with state dependent pricing. Probability in the Engineering and Informational Sciences, 28(1):101, 2014.
Burnetas and Smith [2000] Apostolos N Burnetas and Craig E Smith. Adaptive ordering and pricing for perishable products. Operations Research, 48(3):436–443, 2000.
Chen and Frank [2001] Hong Chen and Murray Z Frank. State dependent pricing with a queue. IIE Transactions, 33(10):847–860, 2001.
Chowdhury et al. [2021] Sayak Ray Chowdhury, Aditya Gopalan, and Odalric-Ambrym Maillard. Reinforcement learning in parametric mdps with exponential families. In Arindam Banerjee and Kenji Fukumizu, editors, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 1855–1863. PMLR, 13–15 Apr 2021.
Ferreira et al. [2018] Kris Johnson Ferreira, David Simchi-Levi, and He Wang. Online network revenue management using thompson sampling. Operations research, 66(6):1586–1602, 2018.
Glynn and Ormoneit [2002] Peter W Glynn and Dirk Ormoneit. Hoeffding’s inequality for uniformly ergodic markov chains. Statistics & probability letters, 56(2):143–146, 2002.
Gopalan and Mannor [2015] Aditya Gopalan and Shie Mannor. Thompson sampling for learning parameterized markov decision processes. In Conference on Learning Theory, pages 861–898, 2015.
Harrison et al. [2012] J Michael Harrison, N Bora Keskin, and Assaf Zeevi. Bayesian dynamic pricing policies: Learning and earning under a binary prior distribution. Management Science, 58(3):570–586, 2012.
Huang et al. [1976] Cheng-Chi Huang, Dean Isaacson, and B Vinograde. The rate of convergence of certain nonhomogeneous markov chains. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 35(2):141–146, 1976.
Jain et al. [2015] Aditya Jain, Nils Rudi, and Tong Wang. Demand estimation and ordering under censoring: Stock-out timing is (almost) all you need. Operations Research, 63(1):134–150, 2015.
Keskin and Zeevi [2014] N Bora Keskin and Assaf Zeevi. Dynamic pricing with an unknown demand model: Asymptotically optimal semi-myopic policies. Operations Research, 62(5):1142–1167, 2014.
Kim [2017] Michael Jong Kim. Thompson sampling for stochastic control: The finite parameter case. IEEE Transactions on Automatic Control, 62(12):6415–6422, 2017.
Kim and Makis [2012] Michael Jong Kim and Viliam Makis. Optimal control of a partially observable failing system with costly multivariate observations. Stochastic Models, 28(4):584–608, 2012.
Klenke [2013] A. Klenke. Probability Theory: A Comprehensive Course. Universitext. Springer London, 2013. ISBN 9781447153610.
Naor [1969] Pinhas Naor. The regulation of queue size by levying tolls. Econometrica: journal of the Econometric Society, pages 15–24, 1969.
Osband et al. [2013] Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pages 3003–3011, 2013.
Russo and Van Roy [2014] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243, 2014.
Russo et al. [2018] Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen, et al. A tutorial on thompson sampling. Foundations and Trends® in Machine Learning, 11(1):1–96, 2018.
Thompson [1933] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.

	$\displaystyle\operatorname{\mathbb{E}}_{\theta}\!\left[\log{\Lambda_{i}^{\vartheta}}\,\|\,\mathscr{F}_{i-1}\right]$
	$\displaystyle=\operatorname{\mathbb{E}}_{\theta}\!\left[\mathcal{D}\!\left(\nu_{\theta}(S_{i-1},A_{i-1})\,\\|\,\nu_{\vartheta}(S_{i-1},A_{i-1})\right)\,\|\,\mathscr{F}_{i-1}\right]$
	$\displaystyle\geq\operatorname{\mathbb{E}}_{\theta}\!\left[\min_{\vartheta\not=\theta\in\Theta}\mathcal{D}\!\left(\nu_{\theta}(S_{i-1},A_{i-1})\,\\|\,\nu_{\vartheta}(S_{i-1},A_{i-1})\right)\,\Big{\|}\,\mathscr{F}_{i-1}\right]$
	$\displaystyle\geq\operatorname{\mathbb{E}}_{\theta}\!\left[\min_{\vartheta\not=\theta}\varepsilon(\theta,\vartheta)\mathbbm{1}\!\left\{A_{i-1}\in\mathcal{I}(S_{i-1})\right\}\,\Big{\|}\,\mathscr{F}_{i-1}\right]$
	$\displaystyle\geq\varepsilon\mathbbm{1}\!\left\{A_{i-1}\in\mathcal{I}(S_{i-1})\right\},$

	$\displaystyle\operatorname{\mathbb{E}}_{\theta}\!\left[\pi_{t}(\theta)\right]$
	$\displaystyle\geq\operatorname{\mathbb{E}}_{\theta}\!\left[\frac{1}{1+\sum_{\vartheta\not=\theta}{c_{\vartheta}\exp\!\left(-M_{t}^{\vartheta}-\varepsilon I(t)\right)}}\,\Big{\|}\,\mathcal{G}_{t}^{\eta}\right]\operatorname{\mathbb{P}}_{\theta}\!\left(\mathcal{G}_{t}^{\eta}\right)$
	$\displaystyle\indent+\operatorname{\mathbb{E}}_{\theta}\!\left[\frac{1}{1+\sum_{\vartheta\not=\theta}{c_{\vartheta}\exp\!\left(-M_{t}^{\vartheta}-\varepsilon I(t)\right)}}\,\Big{\|}\,(\mathcal{G}_{t}^{\eta})^{\mathsf{C}}\right]\operatorname{\mathbb{P}}_{\theta}\!\left((\mathcal{G}_{t}^{\eta})^{\mathsf{C}}\right)$
	$\displaystyle\geq\operatorname{\mathbb{E}}_{\theta}\!\left[\frac{1}{1+\sum_{\vartheta\not=\theta}{c_{\vartheta}\exp\!\left(-M_{t}^{\vartheta}-\varepsilon I(t)\right)}}\,\Big{\|}\,\mathcal{G}_{t}^{\eta}\right]\operatorname{\mathbb{P}}_{\theta}\!\left(\mathcal{G}_{t}^{\eta}\right)$
	$\displaystyle\geq\operatorname{\mathbb{E}}_{\theta}\!\left[\frac{1}{1+\sum_{\vartheta\not=\theta}{c_{\vartheta}\exp\!\left(-M_{t}^{\vartheta}-\varepsilon\eta t\right)}}\,\Big{\|}\,\mathcal{G}_{t}^{\eta}\right]\operatorname{\mathbb{P}}_{\theta}\!\left(\mathcal{G}_{t}^{\eta}\right).$

	$\displaystyle\operatorname{\mathbb{E}}_{\theta,p}\!\left[\frac{1}{1+\sum_{\vartheta\not=\theta}{c_{\vartheta}\exp\!\left(-M_{t}^{\vartheta}-\varepsilon\eta t\right)}}\,\Big{\|}\,\mathcal{G}_{t}^{\eta}\right]$
	$\displaystyle=\operatorname{\mathbb{E}}_{\theta}\!\left[\frac{1}{1+\sum{c_{\vartheta}\exp\!\left(-M_{t}^{\vartheta}-\varepsilon\eta t\right)}}\mathbbm{1}\!\left\{\mathcal{B}_{t}(\delta)\right\}\,\Big{\|}\,\mathcal{G}_{t}^{\eta}\right]$
	$\displaystyle\indent+\operatorname{\mathbb{E}}_{\theta}\!\left[\frac{1}{1+\sum{c_{\theta}\exp\!\left(-M_{t}^{\vartheta}-\varepsilon\eta t\right)}}\mathbbm{1}\!\left\{\mathcal{B}_{t}(\delta)^{\mathsf{C}}\right\}\,\Big{\|}\,\mathcal{G}_{t}^{\eta}\right]$
	$\displaystyle\geq\frac{\operatorname{\mathbb{P}}_{\theta}\!\left(\mathcal{B}_{t}(\delta)\,\big{\|}\,\mathcal{G}_{t}^{\eta}\right)}{1+d_{\theta}\exp\!\left(-(\epsilon\eta-\delta)t\right)},$

$\displaystyle\operatorname{\mathbb{P}}_{\theta}\!\left(\mathcal{B}_{t}(\delta)\,\big{\|}\,\mathcal{G}_{t}^{\eta}\right)$	$\displaystyle=1-\operatorname{\mathbb{P}}_{\theta}\!\left(\mathcal{B}_{t}(\delta)^{\mathsf{C}}\,\big{\|}\,\mathcal{G}_{t}^{\eta}\right)$
	$\displaystyle=1-\operatorname{\mathbb{P}}_{\theta}\!\left(\bigcup_{\vartheta\not=\theta}\left\{\left\lvert M_{t}^{\vartheta}\right\rvert\leq\delta t\right\}^{\mathsf{C}}\,\Big{\|}\,\mathcal{G}_{t}^{\eta}\right)$
	$\displaystyle\geq 1-\sum_{\vartheta\not=\theta}\operatorname{\mathbb{P}}_{\theta}\!\left(\left\lvert M_{t}^{\vartheta}\right\rvert\geq\delta t\,\big{\|}\,\mathcal{G}_{t}^{\eta}\right).$	(6)