This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Langevin Thompson Sampling with Logarithmic Communication: Bandits and Reinforcement Learning

Amin Karbasi Yale University [email protected] Nikki Lijing Kuang11footnotemark: 1 University of California, San Diego [email protected] Yi-An Ma University of California, San Diego [email protected] Siddharth Mitra11footnotemark: 1 Yale University [email protected] Major contribution credited equally to Kuang and Mitra.
Abstract

Thompson sampling (TS) is widely used in sequential decision making due to its ease of use and appealing empirical performance. However, many existing analytical and empirical results for TS rely on restrictive assumptions on reward distributions, such as belonging to conjugate families, which limits their applicability in realistic scenarios. Moreover, sequential decision making problems are often carried out in a batched manner, either due to the inherent nature of the problem or to serve the purpose of reducing communication and computation costs. In this work, we jointly study these problems in two popular settings, namely, stochastic multi-armed bandits (MABs) and infinite-horizon reinforcement learning (RL), where TS is used to learn the unknown reward distributions and transition dynamics, respectively. We propose batched Langevin Thompson Sampling algorithms that leverage MCMC methods to sample from approximate posteriors with only logarithmic communication costs in terms of batches. Our algorithms are computationally efficient and maintain the same order-optimal regret guarantees of ๐’ชโ€‹(logโกT)\mathcal{O}(\log T) for stochastic MABs, and ๐’ชโ€‹(T)\mathcal{O}(\sqrt{T}) for RL. We complement our theoretical findings with experimental results.

1 Introduction

TS-based Algorithms Batching Scheme MCMC Method Regret # of Batches
Stochastic MAB Karbasi etย al., (2021) Dynamic - Oโ€‹(logโกT)O(\log T) Oโ€‹(logโกT)O(\log T)
Mazumdar etย al., (2020) - SGLD, ULA Oโ€‹(logโกT)O(\log T) Oโ€‹(T){O}(T)
This paper (Algorithm 2) Dynamic SGLD Oโ€‹(logโกT){O}(\log T) Oโ€‹(logโกT){O}(\log T)
TS for RL (PSRL) Osband etย al., (2013) - - Oโ€‹(T){O}(\sqrt{T}) Oโ€‹(T/H){O}(T/H)
Ouyang etย al., (2017) Dynamic - Oโ€‹(T){O}(\sqrt{T}) Oโ€‹(T){O}(\sqrt{T})
Theocharous etย al., 2017b Static - Oโ€‹(T){O}(\sqrt{T}) Oโ€‹(logโกT){O}(\log T)
This paper (Algorithm 3) Static SGLD, MLD Oโ€‹(T){O}(\sqrt{T}) Oโ€‹(logโกT){O}(\log T)
Table 1: We compare our methods with existing TS-based algorithms in terms of batching schemes and MCMC methods adopted for approximation. Performance is measured by regret, while communication cost is quantified with the number of batches. Here, TT is the time horizon, HH is the fixed episode length in episodic MDP settings. Our methods achieve optimal performance, while reducing computation and communication costs due to batching, and are applicable in broader regimes.

Modern machine learning often needs to balance computation and communication budgets with statistical guarantees. Existing analyses of sequential decision making have been primarily focused on the statistical aspects of the problems (Tian etย al.,, 2020), less is known about their computation and communication aspects. In particular, regret minimization in multi-armed bandits (MABs) and reinforcement learning (RL) (Jaksch etย al.,, 2010; Wu etย al.,, 2022; Jung etย al.,, 2019) is often studied under the common assumptions that computation can be performed perfectly in time, and that communication always happens in real-time (Li etย al.,, 2022; Jin etย al.,, 2018; Haarnoja etย al.,, 2018).

A question of particular importance is whether optimal decisions can still be made under reasonable computation and communication budgets. In this work, we study the exploration-exploitation problem with low computation and communication costs using Thompson Sampling (TS) (Thompson,, 1933) (a.k.a. posterior sampling). To allow sampling from distributions that deviate from the standard restrictive assumptions, and to enable its deployment in settings where computing the exact posterior is challenging, we employ Markov Chain Monte Carlo (MCMC) methods.

TS operates by maintaining a posterior distribution over the unknown and is widely used owing to its strong empirical performance Chapelle and Li, (2011). However, the theoretical understanding of TS in bandits typically relied on the restrictive conjugacy assumptions between priors and reward distributions. Recently, approximate TS methods for general posterior distributions start to be studied Mazumdar etย al., (2020); Xu etย al., (2022), where MCMC algorithms are used in conjunction with TS to expand its applicability in fully-sequential settings. On the other hand, to the best of our knowledge, how to provably incorporate MCMC with posterior sampling in RL domains remains to be untackled.

Moreover, previous analyses of TS have been restricted to the fully-sequential settings, where feedback or reward is assumed to be immediately observable upon taking actions (i.e. before making the next decision). Nevertheless, it fails to account for the practical settings when delays take place in communication, or when feedback is only available in an aggregate or batched manner. Examples include clinical trials where feedback about the efficacy of medication is only available after a nontrivial amount of time, recommender systems where feedback from multiple users comes all at once and marketing campaigns Schwartz etย al., (2017). This issue has been studied in literature by considering static or dynamic batching schemes and by designing algorithms that acquire feedback in batches, where the learner typically receives reward information only at the end of the batch (Karbasi etย al.,, 2021; Kalkanli and Ozgur,, 2021; Vernade etย al.,, 2020; Zhang etย al.,, 2020). Nonetheless, the analysis of approximate TS in batched settings is unavailable for both bandits and RL.

In this paper, we tackle these challenges by incorporating TS with Langevin Monte Carlo (LMC) and batching schemes in stochastic MABs and infinite-horizon RL. Our algorithms are applicable to a broad class of distributions with only logarithmic rounds of communication between the learner and the environment, thus being robust to constraints on communication. We compare our results with other works in Table 1, and summarize our main contributions as follows:

  • โ€ข

    For stochastic MABs with time horizon TT, we present Langevin Thompson Sampling (BLTS, Algorithm 2) along with Theorem 5.2, which achieves the optimal Oโ€‹(logโกT)O(\log T) regret with Oโ€‹(logโกT)O(\log T) batches111A TT round game can be thought of as TT many batches each of size 1.. The main technical contribution here is to show that when feedback is obtained in a batched manner where the posterior concentration is weaker (Theorem 1), the convergence guarantee of SGLD continues to hold.

  • โ€ข

    For large-scale infinite-horizon MDPs, we present Langevin Posterior Sampling for RL (LPSRL, Algorithm 3) along with Theorem 6.2 to show that SGLD with a static policy-switching222In MDP settings, the notion of a batch is more appropriately thought of as a policy-switch. scheme achieves the optimal Oโ€‹(T)O(\sqrt{T}) Bayesian regret with Oโ€‹(logโกT)O(\log T) policy switches. For tabular MDPs, we show that LPSRL incorporated with the Mirrored Langevin Dynamics (MLD) achieves the optimal Oโ€‹(T)O(\sqrt{T}) Bayesian regret with Oโ€‹(logโกT)O(\log T) policy switches. The use of approximate sampling leads to an additive error where the true model and the sampled model are no longer identically distributed. This error can be properly handled with the convergence guarantees of LMC methods.

  • โ€ข

    Experiments are performed to demonstrate the effectiveness of our algorithms, which maintain the order-optimal regret with significantly lower communication costs compared to existing exact TS methods.

2 Problem Setting

In this section, we introduce the problem setting with relevant background information.

2.1 Stochastic Multi-armed Bandits

We consider the NN-armed stochastic multi-armed bandit problem, where the set of arms is denoted by ๐’œ=[N]={1,2,โ€ฆ,N}\mathcal{A}=[N]=\{1,2,\dots,N\}. Let TT be the time horizon of the game. At t=1,2,โ€ฆ,Tt=1,2,\dots,T, the learner chooses an arm atโˆˆ๐’œa_{t}\in\mathcal{A} and receives a real-valued reward ratr_{a_{t}} drawn from a fixed, unknown, parametric distribution corresponding to arm ata_{t}. In the standard fully-sequential setup, the learner observes rewards immediately. Here, we consider the more general batched setting, where the learner observes the rewards for all timesteps within a batch at the end of it. We use BkB_{k} to denote the starting time of the kk-th batch, Bโ€‹(t)B(t) to represent the starting time of the batch that contains time tt, and KK as the total number of batches. The learner observes the set of rewards {rat}t=BkBk+1โˆ’1\{r_{a_{t}}\}_{t=B_{k}}^{B_{k+1}-1} at the end of the kk-th batch. Note that the batched setting reduces to the fully-sequential setting when the number of batches is TT, each with size 11.

Suppose for each arm aa, there exists a parametric reward distribution parameterized by ฮธaโˆˆโ„d\theta_{a}\in\mathbb{R}^{d} such that the true reward distribution is given by paโ€‹(r)=paโ€‹(r|ฮธaโˆ—)p_{a}(r)=p_{a}(r|\theta^{*}_{a}), where ฮธaโˆ—\theta^{*}_{a} is an unknown parameter333Our results hold for the more general case of ฮธaโˆˆโ„da\theta_{a}\in\mathbb{R}^{d_{a}}, but for simplicity of exposition, we consider the ambient dimension for the parameters of each arm to be the same.. To ensure meaningful results, we impose the following assumptions on the reward distributions for all aโˆˆ๐’œa\in\mathcal{A}:

  • โ€ข

    Assumption 1: logโกpaโ€‹(r|ฮธa)\log p_{a}(r|\theta_{a}) is LL-smooth and mm-strongly concave in ฮธa\theta_{a}.

  • โ€ข

    Assumption 2: paโ€‹(r|ฮธaโˆ—)p_{a}(r|\theta_{a}^{*}) is ฮฝ\nu strongly log-concave in rr and โˆ‡ฮธlogโกpaโ€‹(r|ฮธaโˆ—)\nabla_{\theta}\log p_{a}(r|\theta^{*}_{a}) is LL-Lipschitz in rr.

  • โ€ข

    Assumption 3: The prior ฮปaโ€‹(ฮธa)\lambda_{a}(\theta_{a}) is concave with LL-Lipschitz gradients for all ฮธa\theta_{a}.

  • โ€ข

    Assumption 4: Joint Lipschitz smoothness of (the bivariate) logโกpaโ€‹(r|ฮธa)\log p_{a}(r|\theta_{a}) in rr and ฮธa\theta_{a}.

These properties include log-concavity and Lipschitz smoothness of the parametric families and prior distributions, which are standard assumptions in existing literature Mazumdar etย al., (2020) and are satisfied by models like Gaussian bandits (Honda and Takemura,, 2013). For the sake of brevity, We provide the mathematical statements of these assumptions in Appendix B.

Let ฮผa\mu_{a} denote the expected value of the true reward distribution for arm aa. The goal of the learner is to minimize the expected regret, which is defined as follows:

Rโ€‹(T):=๐”ผโ€‹[โˆ‘t=1Tฮผโˆ—โˆ’ฮผat]=โˆ‘aโˆˆ๐’œฮ”aโ€‹๐”ผโ€‹[kaโ€‹(T)],R(T):=\mathbb{E}\left[\sum_{t=1}^{T}\mu^{*}-\mu_{a_{t}}\right]=\sum_{a\in\mathcal{A}}\Delta_{a}\mathbb{E}\left[k_{a}(T)\right], (1)

where ฮผโˆ—=maxaโˆˆ๐’œโกฮผa\mu^{*}=\max_{a\in\mathcal{A}}\mu_{a}, ฮ”a=ฮผโˆ—โˆ’ฮผa\Delta_{a}=\mu^{*}-\mu_{a}, and kaโ€‹(t)k_{a}(t) represents the number of times arm aa has been played up to time tt. Without loss of generality, we will assume that arm 11 is the best arm. We discuss the MAB setting in Section 5.

2.2 Infinite-horizon Markov Decision Processes

We focus on average-reward MDPs with infinite horizon (Jaksch etย al.,, 2010; Wei etย al.,, 2021), which is underexplored compared to the episodic setting. It is a more realistic model for real-world tasks, such as robotics and financial-market decision making, where state reset is not possible. Specifically, we consider an undiscounted weakly communicating MDP (๐’ฎ,๐’œ,p,โ„›)(\mathcal{S},\mathcal{A},p,\mathcal{R}) with infinite time horizon444It is known that weakly communicating MDPs satisfy the Bellman Optimality. (Ouyang etย al.,, 2017; Theocharous etย al., 2017b, ), where ๐’ฎ\mathcal{S} is the state space, ๐’œ\mathcal{A} is the action space, pp represents the parameterized transition dynamics, and โ„›:๐’ฎร—๐’œ\mathcal{R}:\mathcal{S}\times\mathcal{A} โ†’โ„\to\mathbb{R} is the reward function. We assume that ฮธโˆˆโ„d\theta\in\mathbb{R}^{d} parameterizes the transition dynamics and there exists a true unknown ฮธโˆ—\theta^{*} governing the next state of the learner. At each time tt, the learner is in state sts_{t}, takes action ata_{t}, and transits into the next state st+1s_{t+1}, which is drawn from p(โ‹…|st,at,ฮธโˆ—)p(\cdot|s_{t},a_{t},\theta^{*}).

We consider two sub-settings based on the parameterization of the transition dynamics: the General Parameterization and the Simplex Parameterization. These sub-settings require different assumptions and setups, which we elaborate on in their respective sections (Section 6.2 and Section 6.3). In the MDP context, the notion of batch is more appropriately thought of as a policy switch. Therefore, BkB_{k} now represents the starting time of the kk-th policy switch, and we additionally define TkT_{k} as the number of time steps between policy switch kk and k+1k+1. We consider stationary and deterministic policies, which are mappings from ๐’ฎโ†’๐’œ\mathcal{S}\to\mathcal{A}. Let ฯ€k\pi_{k} be the policy followed by the learner after the kk-th policy switch. When the decision to update and obtain the kk-th policy is made, the learner uses the observed data {st,at,โ„›โ€‹(st,at),st+1}t=BkBk+1โˆ’1\{s_{t},a_{t},\mathcal{R}(s_{t},a_{t}),s_{t+1}\}_{t=B_{k}}^{B_{k+1}-1} collected after the (kโˆ’1)(k-1)-th policy switch to sample from the updated posterior and compute ฯ€k\pi_{k}. The goal of the learner is to maximize the long-term average reward:

Jฯ€โ€‹(ฮธ)=๐”ผโ€‹[lim supTโ†’โˆž1Tโ€‹โˆ‘t=1Tโ„›โ€‹(st,at)].J^{\pi}(\theta)=\mathbb{E}\left[\limsup_{T\rightarrow\infty}\frac{1}{T}\sum_{t=1}^{T}\mathcal{R}(s_{t},a_{t})\right].

Similar to other works Ouyang etย al., (2017); Osband etย al., (2013); Theocharous etย al., 2017b , we measure the performance using Bayesian regret555In Bayesian regret, expectation is taken w.r.tw.r.t the prior distribution of the true parameter ฮธโˆ—\theta^{*}, the randomness of algorithm and transition dynamics. defined by:

RBโ€‹(T):=๐”ผโ€‹[โˆ‘t=1T(Jฯ€โˆ—โ€‹(ฮธโˆ—)โˆ’โ„›โ€‹(st,at))],R_{B}(T):=\mathbb{E}\left[\sum_{t=1}^{T}(J^{\pi^{*}}(\theta^{*})-\mathcal{R}(s_{t},a_{t}))\right], (2)

where Jฯ€โˆ—โ€‹(ฮธโˆ—)J^{\pi^{*}}(\theta^{*}) denotes the average long-term reward after running the optimal policy under the true model.

It is known that weakly communicating MDPs satisfy the following Bellman optimality Bertsekas, (2012); Ouyang etย al., (2017); Wei etย al., (2021) in infinite-horizon setting, and there exists some positive number HH such that the span (Definitionย 2) satisfies sโ€‹pโ€‹(hโ€‹(ฮธ))โ‰คHsp(h(\theta))\leq H for all ฮธโˆˆโ„d\theta\in\mathbb{R}^{d}.

Lemma 1 (Bellman Optimality).

There exist optimal average reward Jโˆˆโ„J\in\mathbb{R} and a bounded measurable function h:๐’ฎโ†’โ„h:\mathcal{S}\rightarrow\mathbb{R}, such that for any s,โˆˆ๐’ฎ,ฮธโˆˆโ„ds,\in\mathcal{S},\theta\in\mathbb{R}^{d}, the Bellman optimality equation holds:

Jโ€‹(ฮธ)+hโ€‹(s,ฮธ)=maxaโˆˆ๐’œโก{โ„›โ€‹(s,a)+๐”ผsโ€ฒโˆผp(โ‹…|s,a;ฮธ)โ€‹[hโ€‹(sโ€ฒ,ฮธ)]}.\displaystyle J(\theta)+h(s,\theta)=\max\limits_{a\in\mathcal{A}}\Big{\{}\mathcal{R}(s,a)+\mathbb{E}_{s^{\prime}\sim p(\cdot|s,a;\theta)}[h(s^{\prime},\theta)]\Big{\}}. (3)

Here Jโ€‹(ฮธ)=mโ€‹aโ€‹xฯ€โ€‹Jฯ€โ€‹(ฮธ)J(\theta)=max_{\pi}J^{\pi}(\theta) under ฮธ\theta and is independent of initial state. Function hฯ€(s,ฮธ)=limTโ†’โˆž๐”ผ[โˆ‘t=1Th^{\pi}(s,\theta)=\lim_{T\rightarrow\infty}\mathbb{E}[\sum_{t=1}^{T} (โ„›(st,ฯ€(st))โˆ’Jฯ€(st))|s1=s](\mathcal{R}(s_{t},\pi(s_{t}))-J^{\pi}(s_{t}))|s_{1}=s] quantifies the bias of policy ฯ€\pi w.r.tw.r.t the average-reward under ฮธ\theta, and hโ€‹(s,ฮธ)=hฯ€โˆ—โ€‹(s,ฮธ)h(s,\theta)=h^{\pi^{*}}(s,\theta), where ฯ€โˆ—=argmaxฯ€Jฯ€โ€‹(ฮธ){\pi^{*}}=\operatorname*{argmax}_{\pi}J^{\pi}(\theta).

Definition 2.

For any ฮธโˆˆโ„d\theta\in\mathbb{R}^{d}, span of an MDP is defined as sโ€‹pโ€‹(hโ€‹(ฮธ)):=sโ€‹uโ€‹ps,sโ€ฒโˆˆ๐’ฎโ€‹|hโ€‹(s,ฮธ)โˆ’hโ€‹(sโ€ฒ,ฮธ)|=mโ€‹aโ€‹xsโˆˆ๐’ฎโ€‹hโ€‹(s,ฮธ)โˆ’mโ€‹iโ€‹nsโˆˆ๐’ฎโ€‹hโ€‹(s,ฮธ)sp(h(\theta)):=sup_{s,s^{\prime}\in\mathcal{S}}|h(s,\theta)-h(s^{\prime},\theta)|=max_{s\in\mathcal{S}}h(s,\theta)-min_{s\in\mathcal{S}}h(s,\theta).

3 Related Work

Under the conjugacy assumptions on rewards, asymptotic convergence of TS was studied in stochastic MABs by Granmo, (2010) and May etย al., (2012). Later, finite-time analyses with Oโ€‹(logโกT)O(\log T) problem-dependent regret bound were provided Agrawal and Goyal, (2012); Kaufmann etย al., (2012); Agrawal and Goyal, (2013). However, in practice, exact posteriors are intractable for all but the simplest models (Riquelme etย al.,, 2018), necessitating the use of approximate sampling methods with TS in complex problem domains. Recent progress has been made in understanding approximate TS in fully-sequential MABs Lu and Vanย Roy, (2017); Mazumdar etย al., (2020); Zhang, (2022); Xu etย al., (2022). On the other hand, the question of learning with TS in the presence of batched data has evolved along a separate trajectory of works (Karbasi etย al.,, 2021; Kalkanli and Ozgur,, 2021; Vernade etย al.,, 2020; Zhang etย al.,, 2020). However, provably performing Langevin TS in batched settings remains unexplored, and in this paper, we aim at bridging these lines.

Moving to the more complex decision-making frameworks based on MDPs, TS is employed in model-based methods to learn transition models, which is known as Posterior Sampling for Reinforcement Learning (PSRL) (Strens,, 2000). When exact posteriors are intractable, MCMC methods have been empirically studied for performing Bayesian inference in policy and reward spaces in RL (Brown etย al.,, 2020; Imani etย al.,, 2018; Bojun,, 2020; Guez etย al.,, 2014). MCMC is a family of approximate posterior inference methods that enables sampling without exact knowledge of posteriors Ma etย al., (2015); Welling and Teh, (2011). However, it is unclear how to provably incorporate MCMC methods in learning transition models for RL.

Furthermore, the analysis of undiscounted infinite-horizon MDPs (Abbasi-Yadkori and Szepesvรกri,, 2015; Osband and Vanย Roy,, 2016; Ouyang etย al.,, 2017; Wei etย al.,, 2020, 2021) poses greater challenges compared to the well-studied episodic MDPs with finite horizon and fixed episode length (Osband etย al.,, 2013). Previous works on infinite-horizon settings include model-based methods that estimate environment dynamics and switch policies when the number of visits to state-action pairs doubles (Jaksch etย al.,, 2010; Tossou etย al.,, 2019; Agrawal and Jia,, 2017; Bartlett and Tewari,, 2012). Nevertheless, under such dynamic schemes, the number of policy switches can be as large as Oโ€‹(T)O(\sqrt{T}), making it computationally heavy and infeasible for continuous states and actions. To enable TS with logarithmic policy switches while maintaining optimal regret, we build upon an algorithmically-independent static scheme as in Theocharous etย al., 2017b , and incorporate Langevin Monte Carlo (LMC) methods to sample from inexact posteriors.

4 SGLD for Langevin Thompson sampling

In the MAB and MDP settings, ฮธ\theta parameterizes the unknown reward or transition distributions respectively. TS maintains a distribution over the parameters and updates the distribution to (the new) posterior upon receiving new data. Given pโ€‹(X|ฮธ)p(X|\theta), prior ฮปโ€‹(ฮธ)\lambda(\theta), and nn data samples {Xi}i=1n\{X_{i}\}_{i=1}^{n}666Here data {Xi}i=1n\{X_{i}\}_{i=1}^{n} can be rewards for some arms or actual transitions of state-action pairs depending on the setting., let ฯn\rho_{n} be the posterior distribution after receiving nn data samples which satisfies: ฯโ€‹(ฮธ|{Xi}i=1n)โˆexpโก(โˆ‘i=1nlogโกpโ€‹(Xi|ฮธ)+logโกฮปโ€‹(ฮธ))\rho(\theta|\{X_{i}\}_{i=1}^{n})\propto\exp(\sum_{i=1}^{n}\log p(X_{i}|\theta)+\log\lambda(\theta)). In addition, consider the scaled posterior ฯnโ€‹[ฮณ]\rho_{n}[\gamma] for some scaling parameter ฮณ\gamma, which represents the density proportional to expโก(ฮณโ€‹(โˆ‘i=1nlogโกpโ€‹(Xi|ฮธ)+logโกฮปโ€‹(ฮธ)))\exp(\gamma(\sum_{i=1}^{n}\log p(X_{i}|\theta)+\log\lambda(\theta))).

The introduction of MCMC methods arises from the need for sampling from intractable posteriors in the absence of conjugacy assumptions. We resort to a gradient-based MCMC method that performs noisy updates based on Langevin dynamics: Stochastic Gradient Langevin Dynamics (SGLD). Algorithm 1 presents SGLD with bached data to generate samples from an approximation of the true posterior. For a detailed exposition, please refer to Welling and Teh, (2011); Ma etย al., (2015) and Appendix A. Algorithm 1 takes all available data {Xs}s=1n\{X_{s}\}_{s=1}^{n} at the start of a batch bb as input, subsamples data, performs gradient updates by computing โˆ‡U^โ€‹(ฮธ)=โˆ’n|D|โ€‹โˆ‘XsโˆˆDโˆ‡logโกpโ€‹(Xs|ฮธ)โˆ’โˆ‡logโกฮปโ€‹(ฮธ)\nabla\widehat{U}(\theta)=-\frac{n}{|D|}\sum_{X_{s}\in D}\nabla\log p(X_{s}|\theta)-\nabla\log\lambda(\theta), and outputs the posterior for batch bb.

Input: prior ฮปโ€‹(ฮธ)\lambda(\theta), data {Xs}s=1n\{X_{s}\}_{s=1}^{n}, sample from last batch ฮธbโˆ’1\theta^{b-1}, total iterations NN, learning rate ฮท\eta, parameters LL, scaling parameter ฮณ\gamma.
Initialization: ฮธ0โ†ฮธbโˆ’1\theta_{0}\leftarrow\theta^{b-1}
forย i=1,โ€ฆ,Ni=1,\dots,Nย do
ย ย ย ย ย ย  Subsample DโІ{Xs}s=1nD\subseteq\{X_{s}\}_{s=1}^{n}
ย ย ย ย ย ย  Compute โˆ‡U^โ€‹(ฮธiโ€‹ฮท)\nabla\widehat{U}(\theta_{i\eta}) over DD
ย ย ย ย ย ย  Sample ฮธ(i+1)โ€‹ฮทโˆผ๐’ฉโ€‹(ฮธiโ€‹ฮทโˆ’ฮทโ€‹โˆ‡U^โ€‹(ฮธiโ€‹ฮท), 2โ€‹ฮทโ€‹I)\theta_{(i+1)\eta}\sim\mathcal{N}(\theta_{i\eta}-\eta\nabla\widehat{U}(\theta_{i\eta}),\ 2\eta I)
Output: ฮธbโˆผ๐’ฉโ€‹(ฮธNโ€‹ฮท,1nโ€‹Lโ€‹ฮณโ€‹I)\theta^{b}\sim\mathcal{N}\big{(}\theta_{N\eta},\ \frac{1}{nL\gamma}I\big{)}
Algorithmย 1 SGLD with Batched Data

In the batched setting, new data is received at the end of a batch, or when making the decision to perform a new policy switch. Due to the way that the learner receives new data and the fact that the batch data size may increase exponentially777Data received in batch kk can be doubled compared to the previous batch., the posterior concentrates slower. This differs from the fully-sequential problem where the distribution shift of successive true posteriors is small owing to data being received in an iterative manner. We show that in batched settings, with only constant computational complexity in terms of iterations, SGLD is able to provide strong convergence guarantee as in the fully-sequential setting Mazumdar etย al., (2020). Theorem 1 shows the convergence of SGLD in the Wasserstein-pp distance can be achieved with a constant number of iterations and data.

Theorem 1 (SGLD convergence).

Suppose that the parametric reward/transition families, priors, and true reward/transition distributions satisfy Assumptions B-B. Let ฮบ:=maxโก{L/m,L/ฮฝ}\kappa:=\max\{L/m,L/\nu\}, |D|=Oโ€‹(ฮบ2)|D|=O(\kappa^{2}), ฮท=Oโ€‹(1/nโ€‹ฮบโ€‹L)\eta=O(1/n\kappa L), and N=Oโ€‹(ฮบ2)N=O(\kappa^{2}), then for any ฮดโˆˆ(0,1)\delta\in(0,1), the following holds with probability โ‰ฅ1โˆ’ฮด\geq 1-\delta:

Wpโ€‹(ฯ~n,ฯn)โ‰ค12nโ€‹mโ€‹(d+logโกQ+(32+8โ€‹dโ€‹ฮบ2)โ€‹p)1/2W_{p}\left(\tilde{\rho}_{n},\rho_{n}\right)\leq\sqrt{\frac{12}{nm}}(d+\log Q+(32+8d\kappa^{2})p)^{1/2}

for all pโ‰ฅ2p\geq 2, and where Q:=maxฮธโกฮปโ€‹(ฮธ)ฮปโ€‹(ฮธโˆ—)Q:=\max_{\theta}\frac{\lambda(\theta)}{\lambda(\theta^{*})} measures the quality of prior distribution.

ฯn\rho_{n} denotes the true posterior corresponding to nn data samples and ฯ~n\tilde{\rho}_{n} is the approximate posterior outputted by Algorithm 1. We also note that similar concentration bounds can be achieved by using the Unadjusted Langevin Algorithm (ULA) for batched data, which adopts full-batch gradient evaluations and therefore leads to a growing iteration complexity. The proofs of Theoremย 1 are adapted to the batched setting, which differs from Mazumdar etย al., (2020).

5 Batched Langevin Thompson Sampling for Bandits

In this section, we introduce Langevin Thompson Sampling for batched stochastic MAB setting in Algorithm 2, namely, BLTS. It leverages SGLD and batching schemes to learn a wide class of unknown reward distributions while reducing communication and computation costs. We have previously discussed the results of SGLD in Section 4 for both MABs and MDPs. Here, we focus on the batching strategy in Algorithm 2 for bandits, and discuss the resulting regret guarantee.

5.1 Dynamic Doubling Batching Scheme

BLTS keeps track of the number of times each arm aa has been played until time tt with kaโ€‹(t)k_{a}(t). Initially, all {ka}aโˆˆ๐’œ\{k_{a}\}_{a\in\mathcal{A}} are set to 0. The size of each batch is determined by {ka}aโˆˆ๐’œ\{k_{a}\}_{a\in\mathcal{A}} and the corresponding integers {la}aโˆˆ๐’œ\{l_{a}\}_{a\in\mathcal{A}}. Once kak_{a} reaches 2la2^{l_{a}}for some arm aa, BLTS makes the decision to terminate the current batch, collects all rewards from the batch in a single request, and increases lal_{a} by 11. BLTS thus starts a new batch whenever an arm is played twice as many times as in the previous batch, which results in growing batch sizes. As the decision to move onto the next batch depends on the sequence of arms that is played, it is considered as โ€œdynamicโ€. This batching scheme is similar to the one used in Karbasi etย al., (2021). The total number of batches that BLTS carries out satisfies the following theorem, and its proof can be found in Appendix D.

{restatable}

theorembanditBatches BLTS ensures that the total number of batches is at most Oโ€‹(Nโ€‹logโกT)O(N\log T) where N=|๐’œ|N=|\mathcal{A}|.

Gao etย al., (2019) showed that ฮฉโ€‹(logโกT/logโกlogโกT)\Omega(\log T/\log\log T) batches are required to achieve the optimal logarithmic dependence in time horizon TT for a batched MAB problem. This shows that the dependence on TT in the number of batches BLTS requires is at most a factor of logโกlogโกT\log\log T off the optimal. We now state and discuss the BLTS algorithm.

5.2 Regret of BLTS Algorithm

In Algorithm 2, denote by ฮธak\theta_{a}^{k} the output of Algorithm 1 for arm aa at batch kk. At the end of each batch, new data is acquired all at once and the posterior is being updated. It is important to note that upon receiving new data when we run Algorithm 1 for each arm, only that armโ€™s data is fed into Algorithm 1. For each aโˆˆ๐’œa\in\mathcal{A}, assume the existence of linear map ฮฑa\alpha_{a} such that ๐”ผXโˆผpaโ€‹(X|ฮธa)โ€‹[X]=ฮฑaโŠบโ€‹ฮธaโ€‹โˆ€ฮธaโˆˆโ„d\mathbb{E}_{X\sim p_{a}(X|\theta_{a})}[X]=\alpha_{a}^{\intercal}~{}\theta_{a}~{}\forall\theta_{a}\in\mathbb{R}^{d}, where โ€–ฮฑaโ€–\left\|\alpha_{a}\right\| is bounded. Theorem 5.2 shows the regret guarantee of BLTS.

{restatable}

theorembanditRegret

Assume that the parametric reward families, priors, and true reward distributions satisfy Assumptions 1 through 4 for each arm aโˆˆ๐’œa\in\mathcal{A}. Then with the SGLD parameters specified as per Algorithm 1 and with ฮณ=Oโ€‹(1/dโ€‹ฮบ3)\gamma=O(1/d\kappa^{3}) (for ฮบ:=maxโก{L/m,L/ฮฝ}\kappa:=\max\{L/m,L/\nu\}), BLTS satisfies:

Rโ€‹(T)\displaystyle R(T) โ‰คโˆ‘a>1Cโ€‹Q1mโ€‹ฮ”aโ€‹(d+logโกQ1+dโ€‹ฮบ2โ€‹logโกT+d2โ€‹ฮบ2)+Cmโ€‹ฮ”aโ€‹(d+logโกQa+d2โ€‹ฮบ2โ€‹logโกT)+4โ€‹ฮ”a,\displaystyle\leq\sum_{a>1}\frac{C\sqrt{Q_{1}}}{m\Delta_{a}}\left(d+\log Q_{1}+d\kappa^{2}\log T+d^{2}\kappa^{2}\right)+\frac{C}{m\Delta_{a}}\left(d+\log Q_{a}+d^{2}\kappa^{2}\log T\right)+4\Delta_{a},

where CC is a constant and Qa:=maxฮธโกฮปaโ€‹(ฮธ)ฮปaโ€‹(ฮธโˆ—)Q_{a}:=\max_{\theta}\frac{\lambda_{a}(\theta)}{\lambda_{a}(\theta^{*})}. The total number of SGLD iterations used by BLTS is Oโ€‹(ฮบ2โ€‹Nโ€‹lโ€‹oโ€‹gโ€‹T)O(\kappa^{2}NlogT).

Discussion

We show that BLTS achieves the optimal Oโ€‹(logโกTโ–ณ)O\Big{(}\frac{\log T}{\triangle}\Big{)} regret bound with exponentially fewer rounds of communication between the learner and the environment. Result of Theoremย 5.2 relies on both the statistical guarantee provided by SGLD and the design of our batching scheme. In bached setting, one must carefully consider the trade-off between batch size and the number of batches. While it is desirable to reuse the existing posterior for sampling within a batch, the batching scheme must also ensure new data is collected in time to avoid significant distribution shifts. In addition, the use of SGLD allows BLTS to be applicable in a wide range of general settings with a low computation cost of Oโ€‹(ฮบ2โ€‹Nโ€‹logโกT)O(\kappa^{2}N\log T).

In the regret bound of Theoremย 5.2, QaQ_{a} measures the quality of prior for arm aa. Specifically, if the prior is properly centered such that its mode is at ฮธaโˆ—\theta_{a}^{*}, or if the prior is uninformative or flat everywhere, then logโกQa=0\log Q_{a}=0. In Sectionย 7, we show that using either favorable priors or uninformative priors provides similar empirical performance as existing methods.

Input: priors ฮปaโ€‹(ฮธ)โ€‹โˆ€aโˆˆ๐’œ\lambda_{a}(\theta)~{}\forall a\in\mathcal{A}, scaling parameter ฮณ\gamma, inputs for SGLD subroutine N,ฮท,LN,\eta,L.
Initialization: kaโ†0,laโ†0,naโ†0,ฯ~a,k=ฯ~a,0=ฮปaโ€‹โˆ€aโˆˆ๐’œk_{a}\leftarrow 0,l_{a}\leftarrow 0,n_{a}\leftarrow 0,\tilde{\rho}_{a,k}=\tilde{\rho}_{a,0}=\lambda_{a}~{}\forall a\in\mathcal{A}, batch index kโ†0k\leftarrow 0.
forย t=1,โ€ฆ,Tt=1,\dots,Tย do
ย ย ย ย ย ย  ย ย Sample ฮธa,tโˆผ๐’ฉโ€‹(ฮธak,1naโ€‹Lโ€‹ฮณโ€‹I)โ€‹โˆ€aโˆˆ๐’œ\theta_{a,t}\sim\mathcal{N}\big{(}\theta_{a}^{k},\ \frac{1}{n_{a}L\gamma}I\big{)}~{}\forall a\in\mathcal{A}
ย ย ย ย ย ย  Choose action at=argmaxaโˆˆ๐’œฮฑaโŠบโ€‹ฮธa,ta_{t}=\operatorname*{argmax}_{a\in\mathcal{A}}\alpha_{a}^{\intercal}\theta_{a,t}
ย ย ย ย ย ย  Update kaโ€‹(t)โ†kaโ€‹(t)+1k_{a(t)}\leftarrow k_{a(t)}+1
ย ย ย ย ย ย  ifย kat=2laโ€‹(t)k_{a_{t}}=2^{l_{a(t)}}ย then
ย ย ย ย ย ย ย ย ย ย ย ย  laโ€‹(t)โ†laโ€‹(t)+1~{}~{}l_{a(t)}\leftarrow l_{a(t)}+1
ย ย ย ย ย ย ย ย ย ย ย ย  Terminate batch kk and observe rewards {rai}i=Bkt\{r_{a_{i}}\}_{i=B_{k}}^{t}
ย ย ย ย ย ย ย ย ย ย ย ย  forย aโˆˆ๐’œa\in\mathcal{A}ย do
ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย  ย ย Update nan_{a} with the number of new samples
ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย  Run Algorithm 1 to obtain ฯ~a,k+1\tilde{\rho}_{a,k+1} and ฮธak+1\theta_{a}^{k+1}
ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย  Update batch index kโ†k+1k\leftarrow k+1
Algorithmย 2 Batched Langevin Thompson Sampling (BLTS)

6 Batched Langevin Posterior Sampling For RL

In RL frameworks, posterior sampling is commonly used in model-based methods to learn unknown transition dynamics and is known as PSRL888We also depart from using TS for the RL setting and stick to the more popular posterior sampling terminology for RL.. In infinite-horizon settings, PSRL operates by sampling a model and solving for an optimal policy based on the sampled MDP at the beginning of each policy switch. The learner then follows the same policy until the next policy switch. In this context, the concept of a batch corresponds to a policy switch.

Previous analyses of PSRL have primarily focused on transition distributions that conform to well-behaved conjugate families. Handling transitions that deviate from these families and computing the corresponding posteriors has been heuristically left to MCMC methods. Here, we provably extend PSRL with LMC and introduce Langevin Posterior Sampling for RL (LPSRL, Algorithm 3) using a static doubling policy-switch scheme. Analyses of PSRL have crucially relied on the true transition dynamics ฮธโˆ—\theta^{*} and the sampled MDPs being identically distributed (Osband etย al.,, 2013; Osband and Vanย Roy,, 2016; Russo and Vanย Roy,, 2014; Ouyang etย al.,, 2017; Theocharous etย al., 2017b, ). However, when the dynamics are sampled from an approximation of the true posterior, this fails to hold. To address the issue, we introduce the Langevin posterior sampling lemma (Lemma 6), which shows approximate sampling yields an additive error in the Wasserstein-11 distance.

{restatable}

lemmaLangevinPS (Langevin Posterior Sampling). Let tkt_{k} be the beginning time of policy-switch kk, โ„‹tk:={sฯ„,aฯ„}ฯ„=1tk\mathcal{H}_{t_{k}}:=\{s_{\tau},a_{\tau}\}_{\tau=1}^{t_{k}} be the history of observed states and actions till time tkt_{k}, and ฮธkโˆผฯ~tk\theta^{k}\sim\tilde{\rho}_{t_{k}} be the sampled model from the approximate posterior ฯ~tk\tilde{\rho}_{t_{k}} at time tkt_{k}. Then, for any ฯƒโ€‹(โ„‹tk)\sigma(\mathcal{H}_{t_{k}})-measurable function ff that is 11-Lipschitz, it holds that:

|๐”ผ[f(ฮธโˆ—)|โ„‹tk]โˆ’๐”ผ[f(ฮธk)|โ„‹tk]|โ‰คW1(ฯ~tk,ฯtk).\Big{|}\mathbb{E}[f(\theta^{*})|\mathcal{H}_{t_{k}}]-\mathbb{E}[f(\theta^{k})|\mathcal{H}_{t_{k}}]\Big{|}\leq W_{1}(\tilde{\rho}_{t_{k}},\rho_{t_{k}}). (4)

By the tower rule, |๐”ผโ€‹[fโ€‹(ฮธโˆ—)]โˆ’๐”ผโ€‹[fโ€‹(ฮธk)]|โ‰คW1โ€‹(ฯ~tk,ฯtk)\Big{|}\mathbb{E}[f(\theta^{*})]-\mathbb{E}[f(\theta^{k})]\Big{|}\leq W_{1}(\tilde{\rho}_{t_{k}},\rho_{t_{k}}).

As demonstrated later, this error term can be effectively controlled and does not impact the overall regret (Theorems 6.2 and 6.3). It only requires the average reward function Jฯ€โ€‹(ฮธ)J^{\pi}(\theta) to be 1-Lipschitz, as specified in Assumption B999Mathematical statement is in Appendix B.. Let us consider the parameterization of the transition dynamics pp with ฮธโˆˆโ„d\theta\in\mathbb{R}^{d}, where ฮธโˆ—โˆˆโ„d\theta^{*}\in\mathbb{R}^{d} denotes the true (unknown) parameter governing the dynamics. We explore two distinct settings based on these parameterizations:

  • โ€ข

    General Parameterization (Section 6.2): In this setting, we consider modeling the full transition dynamics using ฮธโˆ—โˆˆโ„d\theta^{*}\in\mathbb{R}^{d}, where dโ‰ช|๐’ฎ|โ€‹|๐’œ|d\ll|\mathcal{S}||\mathcal{A}|. This parameterization can be particularly useful for tackling large-scale MDPs with large (or even continuous) state and action spaces. Towards this end, we consider ๐’ฎโ‰…โ„\mathcal{S}\cong\mathbb{R}. Examples of General Parameterization include linear MDPs with feature mappings Jin etย al., (2020), RL with general function approximation Yang etย al., (2020), and the low-dimensional structures that govern the transition (Gopalan and Mannor,, 2015; Yang and Wang,, 2020). We provide a real-world example that adopts such parameterization in Appendix E.3.

    Despite Theocharous etย al., 2017b studies a similar setting, their work confines the parameter space to โ„\mathbb{R}. To accommodate a broader class of MDPs, we generalize the parameter space to โ„d\mathbb{R}^{d}. As suggested by Theorem 6.2, our algorithm retains the optimal Oโ€‹(T)O(\sqrt{T}) regret with Oโ€‹(logโกT)O(\log T) policy switches, making it applicable to a wide range of general transition dynamics.

  • โ€ข

    Simplex Parameterization (Section 6.3): Here, we consider the classical tabular MDPs with finite states and actions. For each state-action pair, there exists a probability simplex ฮ”|๐’ฎ|\Delta^{|\mathcal{S}|} that encodes the likelihood of transitioning into each state. Hence, in this case, ฮธโˆ—โˆˆโ„d\theta^{*}\in\mathbb{R}^{d} with d=|๐’ฎ|2โ€‹|๐’œ|d=|\mathcal{S}|^{2}|\mathcal{A}|. This structure necessitates sampling transition dynamics from constrained distributions, which naturally leads us to instantiate LPSRL with the Mirrored Langevin Dynamics (Hsieh etย al.,, 2018) (See Appendix ย A for more discussions). As proven in Theorem 6.3, LPSRL with MLD achieves the optimal Oโ€‹(T)O(\sqrt{T}) regret with Oโ€‹(logโกT)O(\log T) policy switches for general transition dynamics subject to the probability simplex constraints.

6.1 The LPSRL Algorithm

LPSRL (Algorithm 3) use SamplingAlg as a subroutine, where SGLD and MLD are invoked respectively depending on the parameterization. Unlike the BLTS algorithm in bandit settings, LPSRL adopts a static doubling batching scheme, in which the decision to move onto the next batch is independent of the dynamic statistics of the algorithm, and thus is algorithmically independent.

Let tkt_{k} be the starting time of policy-switch kk and let Tk:=2kโˆ’1T_{k}:=2^{k-1} represent the total number of time steps between policy-switch kk and k+1k+1. At the beginning of each policy-switch kk, we utilize SamplingAlg to obtain an approximate posterior distribution ฯ~tk\tilde{\rho}_{t_{k}} and sample dynamics ฮธk\theta^{k} from ฯ~tk\tilde{\rho}_{t_{k}}. A policy ฯ€k\pi_{k} is then computed for ฮธk\theta^{k} with any planning algorithm101010We assume the optimality of policies and focus on learning the transitions. When only suboptimal policies are available in our setting, it can be shown that small approximation errors in policies only contribute additive non-leading terms to regret. See details in Ouyang etย al., (2017).. The learner follows ฯ€k\pi_{k} to select actions and transit into new states during the remaining time steps before the next policy switch. New Data is collected all at once at the end of kk. Once the total number of time steps is being doubled, i.e., tt reaches tk+Tkโˆ’1t_{k}+T_{k}-1, the posterior is updated using the latest data DD, and the above process is repeated.

Input: MCMC scheme SamplingAlg initiated with prior ฮปโ€‹(ฮธ)\lambda(\theta).
Initialization: time step tโ†1t\leftarrow 1, Dโ†โˆ…D\leftarrow\emptyset
forย batch k=1,โ€ฆ,KTk=1,\dots,K_{T}ย do
ย ย ย ย ย ย  Tkโ†2kโˆ’1~{}~{}T_{k}\leftarrow 2^{k-1}
ย ย ย ย ย ย  tkโ†2kโˆ’1t_{k}\leftarrow 2^{k-1}
ย ย ย ย ย ย  Run SamplingAlg and sample ฮธk\theta^{k} from posterior: ฮธkโˆผฯ~tkโ€‹(ฮธ|D)\theta^{k}\sim\tilde{\rho}_{t_{k}}(\theta|D)
ย ย ย ย ย ย  Compute optimal policy ฯ€k\pi_{k} based on ฮธk\theta^{k}
ย ย ย ย ย ย  forย t=tk,tk+1,โ‹ฏ,tk+Tkโˆ’1t=t_{k},t_{k}+1,\cdots,t_{k}+T_{k}-1ย do
ย ย ย ย ย ย ย ย ย ย ย ย  ย ย Choose action atโˆผฯ€ka_{t}\sim\pi_{k}
ย ย ย ย ย ย ย ย ย ย ย ย  Generate immediate reward โ„›โ€‹(st,at)\mathcal{R}(s_{t},a_{t}), transit into new state st+1s_{t+1}
ย ย ย ย ย ย Dโ†Dโˆช{st,at,โ„›โ€‹(st,at),st+1}t=tktk+Tkโˆ’1D\leftarrow D\cup\{s_{t},a_{t},\mathcal{R}(s_{t},a_{t}),s_{t+1}\}_{t=t_{k}}^{t_{k}+T_{k}-1}
Algorithmย 3 Langevin PSRL (LPSRL)

6.2 General Parametrization

In RL context, to study the performance of LPSRL instantiated with SGLD as SamplingAlg, Assumptions B-B are required to hold on the (unknown) transition dynamics, rather than the (unknown) rewards as in the bandit setting. Additionally, similar to Theocharous etย al., 2017b , the General Parameterization requires p(โ‹…|ฮธ)p(\cdot|\theta) to be Lipschitz in ฮธ\theta (Assumption B). Mathematical statements of all assumptions are in Appendix B. We now state the main theorem for LPSRL under the General Parameterization.

{restatable}

theoremMDPRegretSGLD

Under Assumptions 1โˆ’61-6, by instantiating SamplingAlg with SGLD and setting the hyperparameters as per Theorem 1, with p=2p=2, the regret of LPSRL (Algorithm 3) satisfies:

RBโ€‹(T)โ‰คCโ€‹Hโ€‹logโกTโ€‹Tmโ€‹(d+logโกQ+(32+8โ€‹dโ€‹ฮบ2)โ€‹p)1/2,\displaystyle R_{B}(T)\leq CH\log T\sqrt{\frac{T}{m}}(d+\log Q+(32+8d\kappa^{2})p)^{1/2},

where CC is some positive constant, HH is the upper bound of the MDP span, and QQ denotes the quality of the prior. The total number of iterations required for SGLD is Oโ€‹(ฮบ2โ€‹logโกT)O(\kappa^{2}\log T).

Discussion.

LPSRL with SGLD maintains the same order-optimal regret as exact PSRL in Theocharous etย al., 2017b . Similar to Theorem 5.2, the regret bound has explicit dependence on the quality of prior imposed to transitions, where logโกQ=0\log Q=0 when prior is properly centered with its mode at ฮธโˆ—\theta^{*}, or when it is uninformative or flat. Let ฮธk,โˆ—\theta^{k,*} be the true posterior in policy-switch kk. Our result relies on ฮธโˆ—\theta^{*} and ฮธk,โˆ—\theta^{k,*} being identically distributed, and the convergence of SGLD in Oโ€‹(logโกT)O(\log T) iterations to control the additive cumulative error in โˆ‘k=1KTTkโ€‹W1โ€‹(ฯ~tk,ฯtk)\sum_{k=1}^{K_{T}}T_{k}W_{1}(\tilde{\rho}_{t_{k}},\rho_{t_{k}}) arising from approximate sampling.

6.3 Simplex Parametrization

We now consider the tabular setting where ฮธโˆ—\theta^{*} specifically models a collection of |๐’œ||\mathcal{A}| transition matrices in [0,1]|๐’ฎ|ร—|๐’ฎ|[0,1]^{|\mathcal{S}|\times|\mathcal{S}|}. Each row of the transition matrices lies in a probability simplex ฮ”|๐’ฎ|\Delta^{|\mathcal{S}|}, specifying the transition probabilities for each corresponding state-action pair. In particular, if the learner is in state sโˆˆ๐’ฎs\in\mathcal{S} and takes action aโˆˆ๐’œa\in\mathcal{A}, then it lands in state sโ€ฒs^{\prime} with pโ€‹(sโ€ฒ)=pโ€‹(sโ€ฒ|s,a,ฮธโˆ—)p(s^{\prime})=p(s^{\prime}|s,a,\theta^{*}). In order to run LPSRL on constrained space, we need to sample from probability simplexes and therefore appeal to the Mirrored Langevin Dynamics (MLD) (Hsieh etย al.,, 2018) by using the entropic mirror map, which satisfies the requirements set forth by Theorem 2 in Hsieh etย al., (2018). Under Assumptions B and B, we have the following convergence guarantee for MLD and regret bound for LPSRL under the Simplex Parameterization.

{restatable}

theoremMLDConvergence At the beginning of each policy-switch kk, for each state-action pair (s,a)โˆˆ๐’ฎร—๐’œ(s,a)\in\mathcal{S}\times\mathcal{A}, sample transition probabilities over ฮ”|๐’ฎ|\Delta^{|\mathcal{S}|} using MLD with the entropic mirror map. Let ntkn_{t_{k}} be the number of data samples for any (s,a)(s,a) at time tk{t_{k}}, then with step size chosen per Cheng and Bartlett, (2018), running MLD with Oโ€‹(ntk)O(n_{t_{k}}) iterations guarantees that W2โ€‹(ฯ~tk,ฯtk)=O~โ€‹(|๐’ฎ|/ntk).W_{2}(\tilde{\rho}_{t_{k}},\rho_{t_{k}})=\tilde{O}\left(\sqrt{|\mathcal{S}|/n_{t_{k}}}\right)~{}.

{restatable}

theoremMDPRegretMLD Suppose Assumptions 5 and 6 are satisfied, then by instantiating SamplingAlg with MLD (Algorithm 4), there exists some positive constant CC such that the regret of LPSRL (Algorithm 3) in the Simplex Parameterization is bounded by

RBโ€‹(T)โ‰คCโ€‹Hโ€‹|๐’ฎ|โ€‹|๐’œ|โ€‹Tโ€‹logโก(|๐’ฎ|โ€‹|๐’œ|โ€‹T),R_{B}(T)\leq CH|\mathcal{S}|\sqrt{|\mathcal{A}|T\log(|\mathcal{S}||\mathcal{A}|T)},

where CC is some positive constant, HH is the upper bound of the MDP span. The total number of iterations required for MLD is Oโ€‹(|๐’ฎ|2โ€‹|๐’œ|2โ€‹T)O(|\mathcal{S}|^{2}|\mathcal{A}|^{2}T).

Discussion.

In simplex parameterization, instantiating LPSRL with MLD achieves the same order-optimal regret, but the computational complexity in terms of iterations for MLD is linear in TT as opposed to logโกT\log T for SGLD in the General Parameterization. Nevertheless, given that the simplex parameterization implies simpler structures, we naturally have fewer assumptions for the theory to hold.

7 Experiments

In this section, we perform empirical studies in simulated environments for bandit and RL to corroborate our theoretical findings. By comparing the actual regret (average rewards) and the number of batches for interaction (maximum policy switches), we show Langevin TS algorithms empowered by LMC methods achieve appealing statistical accuracy with low communication cost. For additional experimental details, please refer to Appendix F.

7.1 Langevin TS in Bandits

We first study how Langevin TS behaves in learning the true reward distributions of log-concave bandits with different priors and batching schemes. Specifically, we construct two bandit environments111111Our theories apply to bandits with a more general family of reward distributions. with Gaussian and Laplace reward distributions, respectively. While both environments are instances of log-concave families, Laplace bandits do not belong to conjugate families.

7.1.1 Gaussian Bandits

We simulate a Gaussian bandit environment with N=15N=15 arms. The existence of closed-form posteriors in Gaussian bandits allows us to benchmark against existing exact TS algorithms. More specifically, we instantiate Langevin TS with SGLD (SGLD-TS), and perform the following tasks:

  • โ€ข

    Compare SGLD-TS against both frequentist and Bayesian methods, including UCB1, Bayes-UCB, decaying ฯต\epsilon-greedy, and exact TS.

  • โ€ข

    Apply informative priors and uninformative priors for Bayesian methods based on the availability of prior knowledge in reward distributions.

  • โ€ข

    Examine all methods under three batching schemes: fully-sequential mode, dynamic batch, static batch.

Results and Discussion. Figureย 1(a) illustrates the cumulative regret for SGLD-TS and Exact-TS with favorable priors. Tableย 2 reports the regret upon convergence along with the total number of batches in interaction. Note that SGLD-TS equipped with dynamic batching scheme implements Algorithmย 2 (BLTS). Empirical results demonstrate that SGLD-TS is comparable to Exact-TS under all batching schemes, and is empirically more appealing compared to UCB1 as well as Bayes-UCB. While static batch incurs slightly lower communication costs compared to dynamic batch, results show that all methods under dynamic batch scheme are more robust with smaller standard deviation. Our BLTS algorithm thus well balances the trade-off between statistical performance, communication, and computational efficiency by achieving the order-optimal regret with a small number of batches.

Refer to caption
Refer to caption
Refer to caption
Figure 1: (a)ย Regret in Gaussian Bandits (N=15N=15): expected regret is reported over 10 experiments with informative priors. Results show SGLD-TS under dynamic batching scheme achieves optimal performance as in the sequential case without using approximate sampling. Results with uninformative priors yield the same conclusions (See Appendix F). (b)ย Regret in Laplace Bandits (N=10N=10): regret is reported over 10 experiments with informative priors. As in Gaussian Bandits, SGLD-TS with dynamic batching scheme achieves optimal regret and outperforms UCB1. (c)ย Average reward in RiverSwim: expected average reward is reported over 10 experiments. MLD-PSRL achieves optimal average reward upon convergence with a small number of policy switches.
SGLD-TS Exact-TS UCB1 Bayes-UCB Batches
Fully sequential 99.66ยฑ13.0999.66\pm 13.09 99.07ยฑ12.2399.07\pm 12.23 154.13+โˆ’4.10154.13+-4.10 160.55ยฑ25.75160.55\pm 25.75 650.0ยฑ0.0650.0\pm 0.0
Static batch 148.52ยฑ39.28148.52\pm 39.28 145.94ยฑ31.46145.94\pm 31.46 155.17ยฑ5.06155.17\pm 5.06 231.80ยฑ52.11231.80\pm 52.11 9.0ยฑ0.09.0\pm 0.0
Dynamic batch 99.80ยฑ15.6299.80\pm 15.62 98.71ยฑ12.1098.71\pm 12.10 153.31ยฑ3.83153.31\pm 3.83 214.43ยฑ0.5214.43\pm 0.5 22.93ยฑ1.5022.93\pm 1.50
Table 2: Average regret with the standard deviation under different batching schemes. The last column quantifies communication cost w.r.tw.r.t the total number of batches for interaction. BLTS (SGLD-TS under dynamic batching scheme) achieves order-optimal regret with low communication cost.

7.1.2 Laplace Bandits

To demonstrate the applicability of Langevin TS in scenarios where posteriors are intractable, we construct a Laplace bandit environment with N=10N=10 arms. It is important to note that Laplace reward distributions do not have conjugate priors, rendering exact TS inapplicable in this setting. Therefore, we compare the performance of SGLD-TS with favorable priors against UCB1. Results presented in Figure 1(b) reveal that, similar to the Gaussian bandits, SGLD-TS with dynamic batching scheme achieves comparable performance as in the fully-sequential setting and significantly outperforms UCB1, highlighting its capability to handle diverse environments. In addition, the static batching scheme exhibits larger deviations compared to the dynamic batching, which aligns with results in Table 2.

7.2 Langevin PSRL in Average-reward MDPs

In MDP setting, we consider a variant of RiverSwim environment (Strehl and Littman,, 2008), which is a common testbed for provable RL methods. Specifically, it models an agent swimming in the river with five states, two actions (|๐’ฎ|=5,|๐’œ|=2|\mathcal{S}|=5,|\mathcal{A}|=2). In this tabular case, LPSRL (Algorithm 3) employs MLD (Algorithm 4 in Appendix A) as SamplingAlg, namely, MLD-PSRL. We benchmark the performance of MLD-PSRL against other mainstream model-based RL methods, including TSDE (Ouyang etย al.,, 2017), DS-PSRL (Theocharous etย al., 2017b, ) and DB-PSRL (exact-PSRLStrens, (2000) with dynamic batch). Note that MLD-PSRL and DS-PSRL adopt the static doubling policy switch scheme discussed in section 6. Dynamic doubling policy switch scheme adopted by both DB-PSRL and TSDE is akin to the one we use in bandit setting, but based on the visiting counts of state-action pairs. We simulate 1010 different runs of experiment, and report the average rewards obtained by each method in Figure 1(c). Mechanisms used by each method are summarized in Tableย 3, along with the average rewards achieved and maximum number of policy switches incurred.

MLD-PSRL DS-PSRL DB-PSRL TSDE Optimal policy
Static ps โˆš\surd โˆš\surd
Dynamic ps โˆš\surd โˆš\surd
Linear growth โˆš\surd
Avg. reward 4.01ยฑ0.114.01\pm 0.11 4.02ยฑ0.084.02\pm 0.08 2.41ยฑ0.912.41\pm 0.91 4.01ยฑ0.174.01\pm 0.17 4.15ยฑ0.044.15\pm 0.04
Max. switches 12.0ยฑ0.012.0\pm 0.0 12.0ยฑ0.012.0\pm 0.0 15.33ยฑ1.7015.33\pm 1.70 94.0ยฑ3.5694.0\pm 3.56 -
Table 3: We report the average reward and the maximum number of policy switches all methods over 10 different runs. MLD-PSRL instantiates Alogrithmย 3 in Sectionย 6, which achieves order-optimal performance with small number of policy switches.

Results and Discussion. We demonstrate that MLD-PSRL achieves comparable performance compared to existing PSRL methods while significantly reducing communication costs through the use of static policy switches. In contrast, as illustrated in Figure 4 (Appendixย F) and Tableย 3, TSDE achieves near-optimal performance but requires high communication costs. Additionally, our empirical results reveal that the static policy switch in the MDP setting outperforms the dynamic policy switch alone. This observation aligns with existing findings that frequent policy switches in MDPs can harm performance. Moreover, compared to DS-PSRL, MLD-PSRL is applicable to more general frameworks when closed-form posterior distributions are not available121212Different mirror maps can be applied to MLD method depending on parameterization..

8 Conclusion

In this paper, we jointly address two challenges in the design and analysis of Thompson sampling (TS) methods. Firstly, when dealing with posteriors that do not belong to conjugate families, it is necessary to generate approximate samples within a reasonable computational budget. Secondly, when interacting with the environment in a batched manner, it is important to limit the amount of communication required. These challenges are critical in real-world deployments of TS, as closed-form posteriors and fully-sequential interactions are rare. In stochastic MABs, approximate TS and batched interactions are studied independently. We bridge the two lines of work by providing a Langevin TS algorithm that works for a wide class of reward distributions with only logarithmic communication. In the case of undiscounted infinite-horizon MDP settings, to the best of our knowledge, we are the first to provably incorporate approximate sampling with the TS paradigm. This enhances the applicability of TS for RL problems with low communication costs. Finally, we conclude with experiments to demonstrate the appealing empirical performance of the Langevin TS algorithms.

Acknowledgements

This work is supported in part by the National Science Foundation Grants NSF-SCALE MoDL(2134209) and NSF-CCF-2112665 (TILOS), the U.S. Department Of Energy, Office of Science, and the Facebook Research award. Amin Karbasi acknowledges funding in direct support of this work from NSF (IIS-1845032), ONR (N00014- 19-1-2406), and the AI Institute for Learning-Enabled Optimization at Scale (TILOS).

References

  • Abbasi-Yadkori and Szepesvรกri, (2015) Abbasi-Yadkori, Y. and Szepesvรกri, C. (2015). Bayesian optimal control of smoothly parameterized systems. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence. AUAI Press.
  • Agrawal and Goyal, (2012) Agrawal, S. and Goyal, N. (2012). Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory, pages 39โ€“1. JMLR Workshop and Conference Proceedings.
  • Agrawal and Goyal, (2013) Agrawal, S. and Goyal, N. (2013). Further optimal regret bounds for thompson sampling. In Artificial intelligence and statistics, pages 99โ€“107. PMLR.
  • Agrawal and Jia, (2017) Agrawal, S. and Jia, R. (2017). Optimistic posterior sampling for reinforcement learning: worst-case regret bounds. Advances in Neural Information Processing Systems, 30.
  • Bartlett and Tewari, (2012) Bartlett, P.ย L. and Tewari, A. (2012). Regal: A regularization based algorithm for reinforcement learning in weakly communicating mdps. arXiv preprint arXiv:1205.2661.
  • Bertsekas, (2012) Bertsekas, D. (2012). Dynamic programming and optimal control: Volume I, volumeย 1. Athena scientific.
  • Bojun, (2020) Bojun, H. (2020). Steady state analysis of episodic reinforcement learning. Advances in Neural Information Processing Systems, 33:9335โ€“9345.
  • Brown etย al., (2020) Brown, D., Coleman, R., Srinivasan, R., and Niekum, S. (2020). Safe imitation learning via fast bayesian reward inference from preferences. In International Conference on Machine Learning, pages 1165โ€“1177. PMLR.
  • Chapelle and Li, (2011) Chapelle, O. and Li, L. (2011). An empirical evaluation of thompson sampling. Advances in neural information processing systems, 24:2249โ€“2257.
  • Cheng and Bartlett, (2018) Cheng, X. and Bartlett, P. (2018). Convergence of langevin mcmc in kl-divergence. In Algorithmic Learning Theory, pages 186โ€“211. PMLR.
  • Gao etย al., (2019) Gao, Z., Han, Y., Ren, Z., and Zhou, Z. (2019). Batched multi-armed bandits problem.
  • Gopalan and Mannor, (2015) Gopalan, A. and Mannor, S. (2015). Thompson Sampling for Learning Parameterized Markov Decision Processes. In Proceedings of The 28th Conference on Learning Theory. PMLR.
  • Granmo, (2010) Granmo, O.-C. (2010). Solving two-armed bernoulli bandit problems using a bayesian learning automaton. International Journal of Intelligent Computing and Cybernetics.
  • Guez etย al., (2014) Guez, A., Silver, D., and Dayan, P. (2014). Better optimism by bayes: Adaptive planning with rich models. arXiv preprint arXiv:1402.1958.
  • Haarnoja etย al., (2018) Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., etย al. (2018). Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905.
  • Honda and Takemura, (2013) Honda, J. and Takemura, A. (2013). Optimality of thompson sampling for gaussian bandits depends on priors.
  • Hsieh etย al., (2018) Hsieh, Y.-P., Kavis, A., Rolland, P., and Cevher, V. (2018). Mirrored langevin dynamics. Advances in Neural Information Processing Systems, 31.
  • Imani etย al., (2018) Imani, M., Ghoreishi, S.ย F., and Braga-Neto, U.ย M. (2018). Bayesian control of large mdps with unknown dynamics in data-poor environments. Advances in neural information processing systems, 31.
  • Jaksch etย al., (2010) Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11:1563โ€“1600.
  • Jin etย al., (2018) Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M.ย I. (2018). Is q-learning provably efficient? Advances in neural information processing systems, 31.
  • Jin etย al., (2020) Jin, C., Yang, Z., Wang, Z., and Jordan, M.ย I. (2020). Provablylearning to optimize via posterior sampling efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137โ€“2143. PMLR.
  • Jung etย al., (2019) Jung, Y.ย H., Abeille, M., and Tewari, A. (2019). Thompson sampling in non-episodic restless bandits. arXiv preprint arXiv:1910.05654.
  • Kalkanli and Ozgur, (2021) Kalkanli, C. and Ozgur, A. (2021). Batched thompson sampling. Advances in Neural Information Processing Systems, 34:29984โ€“29994.
  • Karbasi etย al., (2021) Karbasi, A., Mirrokni, V., and Shadravan, M. (2021). Parallelizing thompson sampling. Advances in Neural Information Processing Systems, 34:10535โ€“10548.
  • Kaufmann etย al., (2012) Kaufmann, E., Korda, N., and Munos, R. (2012). Thompson sampling: An asymptotically optimal finite-time analysis. In International conference on algorithmic learning theory, pages 199โ€“213. Springer.
  • Li etย al., (2022) Li, T., Wu, F., and Lan, G. (2022). Stochastic first-order methods for average-reward markov decision processes. arXiv preprint arXiv:2205.05800.
  • Lu and Vanย Roy, (2017) Lu, X. and Vanย Roy, B. (2017). Ensemble sampling. arXiv preprint arXiv:1705.07347.
  • Ma etย al., (2015) Ma, Y.-A., Chen, T., and Fox, E. (2015). A complete recipe for stochastic gradient mcmc. Advances in neural information processing systems, 28.
  • May etย al., (2012) May, B.ย C., Korda, N., Lee, A., and Leslie, D.ย S. (2012). Optimistic bayesian sampling in contextual-bandit problems. Journal of Machine Learning Research, 13:2069โ€“2106.
  • Mazumdar etย al., (2020) Mazumdar, E., Pacchiano, A., Ma, Y.-a., Bartlett, P.ย L., and Jordan, M.ย I. (2020). On approximate Thompson sampling with Langevin algorithms. In ICML, volume 119, pages 6797โ€“6807.
  • Osband etย al., (2013) Osband, I., Russo, D., and Vanย Roy, B. (2013). (more) efficient reinforcement learning via posterior sampling. Advances in Neural Information Processing Systems, 26.
  • Osband and Vanย Roy, (2014) Osband, I. and Vanย Roy, B. (2014). Model-based reinforcement learning and the eluder dimension. Advances in Neural Information Processing Systems, 27.
  • Osband and Vanย Roy, (2016) Osband, I. and Vanย Roy, B. (2016). Posterior sampling for reinforcement learning without episodes. arXiv preprint arXiv:1608.02731.
  • Ouyang etย al., (2017) Ouyang, Y., Gagrani, M., Nayyar, A., and Jain, R. (2017). Learning unknown markov decision processes: A thompson sampling approach. Advances in neural information processing systems, 30.
  • Riquelme etย al., (2018) Riquelme, C., Tucker, G., and Snoek, J. (2018). Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling. arXiv preprint arXiv:1802.09127.
  • Russo and Vanย Roy, (2014) Russo, D. and Vanย Roy, B. (2014). Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221โ€“1243.
  • Schwartz etย al., (2017) Schwartz, E.ย M., Bradlow, E.ย T., and Fader, P.ย S. (2017). Customer acquisition via display advertising using multi-armed bandit experiments. Marketing Science, 36(4):500โ€“522.
  • Strehl and Littman, (2008) Strehl, A.ย L. and Littman, M.ย L. (2008). An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309โ€“1331.
  • Strens, (2000) Strens, M. (2000). A bayesian framework for reinforcement learning. In ICML, volume 2000, pages 943โ€“950.
  • (40) Theocharous, G., Vlassis, N., and Wen, Z. (2017a). An interactive points of interest guidance system. In Proceedings of the 22nd International Conference on Intelligent User Interfaces Companion, IUI โ€™17 Companion.
  • (41) Theocharous, G., Wen, Z., Abbasi-Yadkori, Y., and Vlassis, N. (2017b). Posterior sampling for large scale reinforcement learning. arXiv preprint arXiv:1711.07979.
  • Thompson, (1933) Thompson, W.ย R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285โ€“294.
  • Tian etย al., (2020) Tian, Y., Qian, J., and Sra, S. (2020). Towards minimax optimal reinforcement learning in factored markov decision processes. Advances in Neural Information Processing Systems, 33:19896โ€“19907.
  • Tossou etย al., (2019) Tossou, A., Basu, D., and Dimitrakakis, C. (2019). Near-optimal optimistic reinforcement learning using empirical bernstein inequalities. arXiv preprint arXiv:1905.12425.
  • Vernade etย al., (2020) Vernade, C., Carpentier, A., Lattimore, T., Zappella, G., Ermis, B., and Brueckner, M. (2020). Linear bandits with stochastic delayed feedback. In International Conference on Machine Learning, pages 9712โ€“9721. PMLR.
  • Wei etย al., (2021) Wei, C.-Y., Jahromi, M.ย J., Luo, H., and Jain, R. (2021). Learning infinite-horizon average-reward mdps with linear function approximation. In International Conference on Artificial Intelligence and Statistics, pages 3007โ€“3015. PMLR.
  • Wei etย al., (2020) Wei, C.-Y., Jahromi, M.ย J., Luo, H., Sharma, H., and Jain, R. (2020). Model-free reinforcement learning in infinite-horizon average-reward markov decision processes. In International conference on machine learning, pages 10170โ€“10180. PMLR.
  • Welling and Teh, (2011) Welling, M. and Teh, Y.ย W. (2011). Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681โ€“688.
  • Wu etย al., (2022) Wu, Y., Zhou, D., and Gu, Q. (2022). Nearly minimax optimal regret for learning infinite-horizon average-reward mdps with linear function approximation. In International Conference on Artificial Intelligence and Statistics, pages 3883โ€“3913. PMLR.
  • Xu etย al., (2022) Xu, P., Zheng, H., Mazumdar, E.ย V., Azizzadenesheli, K., and Anandkumar, A. (2022). Langevin monte carlo for contextual bandits. In International Conference on Machine Learning, pages 24830โ€“24850. PMLR.
  • Yang and Wang, (2020) Yang, L. and Wang, M. (2020). Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. In International Conference on Machine Learning, pages 10746โ€“10756. PMLR.
  • Yang etย al., (2020) Yang, Z., Jin, C., Wang, Z., Wang, M., and Jordan, M.ย I. (2020). On function approximation in reinforcement learning: Optimism in the face of large state spaces. arXiv preprint arXiv:2011.04622.
  • Zhang, (2022) Zhang, T. (2022). Feel-good thompson sampling for contextual bandits and reinforcement learning. SIAM Journal on Mathematics of Data Science, 4(2):834โ€“857.
  • Zhang etย al., (2020) Zhang, W., Zhou, D., Li, L., and Gu, Q. (2020). Neural thompson sampling.

Appendices

Appendix A MCMC Methods

A.1 Unconstrained Approximate Sampling

Suppose a target distribution ฯ\rho is parameterized by ฮธโˆˆโ„d\theta\in\mathbb{R}^{d}, and observed data {Xi}i=1n\{X_{i}\}_{i=1}^{n} are independently identically distributed. A posterior distribution defined up to a normalization factor can be expressed via the Gibbs distribution form:

ฯโ€‹(ฮธ|X1,โ€ฆ,Xn)โˆฮปโ€‹(ฮธ)โ€‹โˆi=1npโ€‹(Xi;ฮธ)=expโก(โˆ’Uโ€‹(ฮธ)),\rho(\theta|X_{1},\dots,X_{n})\propto\lambda(\theta)\prod_{i=1}^{n}p(X_{i};\theta)=\exp\left(-U(\theta)\right),

where ฮปโ€‹(ฮธ)\lambda(\theta) is the prior distribution of ฮธ\theta, pโ€‹(Xi;ฮธ)p(X_{i};\theta) is the likelihood function, and Uโ€‹(ฮธ):=โˆ’logโก(ฮปโ€‹(ฮธ))โˆ’โˆ‘i=1nlogโก(pโ€‹(Xi;ฮธ))U(\theta):=-\log\left(\lambda(\theta)\right)-\sum_{i=1}^{n}\log\left(p(X_{i};\theta)\right) is the energy function.

Typical MCMC methods require computations over the whole dataset, which is inefficient in large-scale online learning. To overcome this issue, we adopt SGLD Welling and Teh, (2011) as one of the approximate sampling methods, which is developed upon stochastic optimization over mini-batch data DโІ{Xi}i=1nD\subseteq\{X_{i}\}_{i=1}^{n}. The update rule is based on the Euler-Murayama discretization of the Langevin stochastic differential equation (SDE):

dโ€‹ฮธt=12โ€‹(โˆ‡logโก(ฮปโ€‹(ฮธ0))+n|D|โ€‹โˆ‘iโˆˆDโˆ‡logโก(pโ€‹(xi;ฮธt)))โ€‹dโ€‹t+2โ€‹dโ€‹Bt,\displaystyle d\theta_{t}=\frac{1}{2}\left(\nabla\log\left(\lambda(\theta_{0})\right)+\frac{n}{|D|}\sum_{i\in D}\nabla\log\left(p(x_{i};\theta_{t})\right)\right)dt+\sqrt{2}dB_{t},

where BtB_{t} is a Brownian motion. To further improve computation, we reuse samples from previous batches to warm start the Markov chains (Algorithm 1). The resulting dependent structure in samples will complicate our analysis.

A.2 Constrained Approximate Sampling

While the convergence of SGLD methods is well-studied, it is only applicable to unconstrained settings. To enable sampling from constrained non log-concave distributions, such as probability simplex in transition dynamics of MDPs, reparameterization can be used in conjunction with SGLD. Alternatively, one can adopt MLD Hsieh etย al., (2018) which utilizes mirror maps for sampling from a dual unconstrained space (Algorithm 4). Let the probability measure of ฮธ\theta be dโ€‹ฯ=eโˆ’Uโ€‹(ฮธ)โ€‹dโ€‹ฮธd\rho=e^{-U(\theta)}\text{d}\theta, where dom(UU) is constrained. Suppose there exist a mirror map hh that maps ฯ\rho to some unconstrained distribution dโ€‹ฮฝ=eโˆ’Wโ€‹(ฯ‰)โ€‹dโ€‹ฯ‰d\nu=e^{-W(\omega)}\text{d}\omega, denoted by โˆ‡hโ€‹#โ€‹ฯ=ฮฝ\nabla h\#\rho=\nu. Then MLD has the following SDE:

{dโ€‹ฯ‰t=โˆ’(โˆ‡Wโˆ˜โˆ‡h)โ€‹(ฮธt)โ€‹dโ€‹t+2โ€‹dโ€‹Btฮธt=โˆ‡hโˆ—โ€‹(ฯ‰t),\begin{cases}\text{d}\omega_{t}=-(\nabla W\circ\nabla h)(\theta_{t})\text{d}t+\sqrt{2}\text{d}B_{t}\\ \theta_{t}=\nabla h^{*}(\omega_{t})\end{cases}, (5)

where hโˆ—h^{*} is the dual of hh, and (โˆ‡h)โˆ’1=โˆ‡hโˆ—(\nabla h)^{-1}=\nabla h^{*}.

Input: ๐’ฎ,๐’œ\mathcal{S},\mathcal{A}, mirror map hh, observed transitions {Xs}s=1n\{X_{s}\}_{s=1}^{n}, total iterations NN
forย i=1,โ€ฆ,Ni=1,\dots,Nย do
ย ย ย ย ย ย  Subsample DโІ{Xs}s=1nD\subseteq\{X_{s}\}_{s=1}^{n}
ย ย ย ย ย ย  Sample ฯ‰i+1โˆผโˆ‡hโ€‹#โ€‹eโˆ’U\omega_{i+1}\sim\nabla h\#e^{-U} from the unconstrained dual space
ย ย ย ย ย ย  Compute constrained sample ฮธi+1=โˆ‡hโˆ—โ€‹(ฯ‰i+1)\theta_{i+1}=\nabla h^{*}(\omega_{i+1})
Output: ฮธN\theta_{N}
Algorithmย 4 Mirrored Langevin Dynamics (MLD)

In tabular settings of MDP, MLD needs to be run against each row of the |๐’œ|ร—|๐’ฎ||\mathcal{A}|\times|\mathcal{S}| matrices to generate a sampled transition from simplex ฮ”|๐’ฎ|\Delta_{|\mathcal{S}|} for each state-action pair. In this case, entropic mirror map will be adopted as hh, which is given by

hโ€‹(ฮธ)=โˆ‘i=1|๐’ฎ|ฮธiโ€‹logโกฮธi+(1โˆ’โˆ‘i=1|๐’ฎ|ฮธi)โ€‹logโก(1โˆ’โˆ‘i=1|๐’ฎ|ฮธi),whereโ€‹0โ€‹logโก0:=0.h(\theta)=\sum_{i=1}^{|\mathcal{S}|}\theta_{i}\log\theta_{i}+(1-\sum_{i=1}^{|\mathcal{S}|}\theta_{i})\log(1-\sum_{i=1}^{|\mathcal{S}|}\theta_{i}),~{}~{}~{}~{}\text{where}~{}~{}0\log 0:=0. (6)

Appendix B Assumptions

Here we explicitly mention all of the assumptions required in the paper. Assumptions 1-4 are required for SGLD to converge (Algorithm 1 and Theorem 1), Assumptions B and B are required in Section 6.

  • Assumptionย 1ย (Assumption on the family pโ€‹(S|ฮธ)p(S|\theta) for approximate sampling). Assume that logโกpโ€‹(s|ฮธ)\log p(s|\theta) is LL-smooth and mm-strongly concave over ฮธ\theta:

    โˆ’logโกpโ€‹(s|ฮธโ€ฒ)โˆ’โˆ‡ฮธlogโกpโ€‹(s|ฮธโ€ฒ)โŠคโ€‹(ฮธโˆ’ฮธโ€ฒ)+m2โ€‹โ€–ฮธโˆ’ฮธโ€ฒโ€–2โ‰คโˆ’logโกpโ€‹(s|ฮธ)\displaystyle-\log p(s|\theta^{\prime})-\nabla_{\theta}\log p(s|\theta^{\prime})^{\top}(\theta-\theta^{\prime})+\frac{m}{2}\left\|\theta-\theta^{\prime}\right\|^{2}\leq-\log p(s|\theta)
    โ‰คโˆ’logโกpโ€‹(s|ฮธโ€ฒ)โˆ’โˆ‡ฮธlogโกpโ€‹(s|ฮธโ€ฒ)โŠคโ€‹(ฮธโˆ’ฮธโ€ฒ)+L2โ€‹โ€–ฮธโˆ’ฮธโ€ฒโ€–2โ€‹โˆ€ฮธ,ฮธโ€ฒโˆˆโ„d,sโˆˆ๐’ฎ\displaystyle\leq-\log p(s|\theta^{\prime})-\nabla_{\theta}\log p(s|\theta^{\prime})^{\top}(\theta-\theta^{\prime})+\frac{L}{2}\left\|\theta-\theta^{\prime}\right\|^{2}~{}~{}\forall\theta,\theta^{\prime}\in\mathbb{R}^{d},s\in\mathcal{S}
  • Assumptionย 2ย (Assumption on true reward/transition distribution pโ€‹(S|ฮธโˆ—)p(S|\theta^{*})). Assume that pโ€‹(S;ฮธโˆ—)p(S;\theta^{*}) is strongly log-concave in SS with some parameter ฮฝ\nu, and that โˆ‡ฮธlogโกpโ€‹(s|ฮธโˆ—)\nabla_{\theta}\log p(s|\theta^{*}) is LL-Lipschitz in SS:

    โˆ’(โˆ‡slogโกpโ€‹(s|ฮธโˆ—)โˆ’โˆ‡slogโกpโ€‹(sโ€ฒ|ฮธโˆ—))โŠคโ€‹(sโˆ’sโ€ฒ)โ‰ฅฮฝโ€‹โ€–sโˆ’sโ€ฒโ€–2,โˆ€s,sโ€ฒโˆˆโ„-(\nabla_{s}\log p(s|\theta^{*})-\nabla_{s}\log p(s^{\prime}|\theta^{*}))^{\top}(s-s^{\prime})\geq\nu\left\|s-s^{\prime}\right\|^{2},~{}~{}\forall s,s^{\prime}\in\mathbb{R}
    โˆฅโˆ‡ฮธlogp(s|ฮธโˆ—)โˆ’โˆ‡ฮธlogp(sโ€ฒ|ฮธโˆ—)โˆฅโ‰คLโˆฅsโˆ’sโ€ฒโˆฅ,โˆ€s,sโ€ฒโˆˆโ„\left\|\nabla_{\theta}\log p(s|\theta^{*})-\nabla_{\theta}\log p(s^{\prime}|\theta^{*})\right\|\leq L\left\|s-s^{\prime}\right\|,~{}~{}\forall s,s^{\prime}\in\mathbb{R}
  • Assumptionย 3ย (Assumption on the prior distribution). Assume that logโกฮปโ€‹(ฮธ)\log\lambda(\theta) is concave with LL-Lipschitz gradients for all ฮธโˆˆโ„d\theta\in\mathbb{R}^{d}:

    โ€–โˆ‡ฮธฮปโ€‹(ฮธ)โˆ’โˆ‡ฮธฮปโ€‹(ฮธโ€ฒ)โ€–โ‰คLโ€‹โ€–ฮธโˆ’ฮธโ€ฒโ€–,โˆ€ฮธ,ฮธโ€ฒโˆˆโ„d\left\|\nabla_{\theta}\lambda(\theta)-\nabla_{\theta}\lambda(\theta^{\prime})\right\|\leq L\left\|\theta-\theta^{\prime}\right\|,~{}~{}\forall\theta,\theta^{\prime}\in\mathbb{R}^{d}
  • Assumptionย 4ย (Joint Lipschitz smoothness of logโกpโ€‹(S|ฮธ)\log p(S|\theta)).

    โˆฅโˆ‡ฮธlogp(s|ฮธ)โˆ’โˆ‡ฮธlogp(sโ€ฒ|ฮธ)โˆฅโ‰คLโˆฅฮธโˆ’ฮธโ€ฒโˆฅ+Lโˆฅsโˆ’sโ€ฒโˆฅ,โˆ€ฮธ,ฮธโ€ฒโˆˆโ„d,s,sโ€ฒโˆˆโ„\left\|\nabla_{\theta}\log p(s|\theta)-\nabla_{\theta}\log p(s^{\prime}|\theta)\right\|\leq L\left\|\theta-\theta^{\prime}\right\|+L\left\|s-s^{\prime}\right\|,~{}~{}\forall\theta,\theta^{\prime}\in\mathbb{R}^{d},s,s^{\prime}\in\mathbb{R}
  • Assumptionย 5ย (1- Lipschitzness of Jโ€‹(ฮธ)J(\theta) in ฮธ\theta). The optimal average-reward function JJ satisfies

    โ€–Jโ€‹(ฮธ)โˆ’Jโ€‹(ฮธโ€ฒ)โ€–โ‰คโ€–ฮธโˆ’ฮธโ€ฒโ€–,โˆ€ฮธ,ฮธโ€ฒโˆˆโ„d\left\|J(\theta)-J(\theta^{\prime})\right\|\leq\|\theta-\theta^{\prime}\|,~{}~{}\forall\theta,\theta^{\prime}\in\mathbb{R}^{d}

    where Jโ€‹(ฮธ)=maxฯ€โกJฯ€โ€‹(ฮธ)J(\theta)=\max_{\pi}J^{\pi}(\theta).

  • Assumptionย 6ย (Lipschitzness of transition in ฮธ\theta for RL). There exists a constant LpL_{p} such that the transition for each state-action pair is LpL_{p}-Lipschtiz in parameter space:

    โˆฅp(โ‹…|s,a,ฮธ)โˆ’p(โ‹…|s,a,ฮธโ€ฒ)โˆฅโ‰คLpโˆฅฮธโˆ’ฮธโ€ฒโˆฅ,โˆ€ฮธ,ฮธโ€ฒโˆˆโ„d,s,aโˆˆ๐’ฎร—๐’œ\left\|p(\cdot|s,a,\theta)-p(\cdot|s,a,\theta^{\prime})\right\|\leq L_{p}\left\|\theta-\theta^{\prime}\right\|,~{}~{}\forall\theta,\theta^{\prime}\in\mathbb{R}^{d},~{}~{}s,a\in\mathcal{S}\times\mathcal{A}

Appendix C Convergence of SGLD with Batched Data

In this section, we prove the convergence of SGLD in sequential decision making frameworks under the batch scheme, which is stated with precise hyperparameters as Theorem 2. We first state the supporting lemmas, followed by the proof of the convergence theorem.

Lemma C.3 (Lemma 5 in Mazumdar etย al., (2020)).

Denote U^\widehat{U} as the stochastic estimator of UU. Then for stochastic gradient estimate with kk data points, we have,

๐”ผโ€‹[โ€–โˆ‡U^โ€‹(ฮธ)โˆ’โˆ‡Uโ€‹(ฮธ)โ€–p|ฮธ]โ‰ค2โ€‹np/2kp/2โ€‹(dโ€‹pโ€‹Laโˆ—ฮฝa)p.\displaystyle\mathbb{E}\left[\left\|\nabla\widehat{U}(\theta)-\nabla U(\theta)\right\|^{p}\big{|}\theta\right]\leq 2\frac{n^{p/2}}{k^{p/2}}\left(\frac{\sqrt{dp}L_{a}^{*}}{\sqrt{\nu_{a}}}\right)^{p}.
Lemma C.4 (Lemma 6 from Mazumdar etย al., (2020)).

For a fixed arm aa with nn samples, suppose we run Algorithm 1 with step size ฮทโ‰คm^32โ€‹L^2\eta\leq\frac{\hat{m}}{32\hat{L}^{2}} for N iterations to generate samples from posterior ฯnโˆ—โˆexpโก(โˆ’U)\rho_{n}^{*}\propto\exp(-U), in which UU is m^โˆ’\hat{m}-strongly convex and L^โˆ’\hat{L}-Lipschitz smooth. If at each step iโˆˆ[N]i\in[N], the pp-th moment between the true gradient and the stochastic gradient satisfies ๐”ผโ€‹[โ€–โˆ‡Uโ€‹(ฮธiโ€‹ฮท)โˆ’โˆ‡U^โ€‹(ฮธiโ€‹ฮท)โ€–p|ฮธiโ€‹ฮท]โ‰คฮ”p,\mathbb{E}\left[\|\nabla U(\theta_{i\eta})-\nabla\hat{U}(\theta_{i\eta})\|^{p}~{}|~{}\theta_{i\eta}\right]\leq\Delta_{p}, then:

Wppโ€‹(ฯ~iโ€‹ฮท,n,ฯnโˆ—)โ‰ค(1โˆ’m^8โ€‹ฮท)pโ€‹iโ€‹Wppโ€‹(ฯ0,ฯnโˆ—)+25โ€‹pโ€‹L^pm^pโ€‹(dโ€‹p)p/2โ€‹(ฮท)p/2+22โ€‹p+3โ€‹ฮ”pm^pW_{p}^{p}(\tilde{\rho}_{i\eta,n},\rho_{n}^{*})\leq\left(1-\frac{\hat{m}}{8}\eta\right)^{pi}W_{p}^{p}(\rho_{0},\rho_{n}^{*})+2^{5p}\frac{\hat{L}^{p}}{\hat{m}^{p}}(dp)^{p/2}(\eta)^{p/2}+2^{2p+3}\frac{\Delta_{p}}{\hat{m}^{p}}

where ฯ0=ฯ~0โ€‹ฮท,n\rho_{0}=\tilde{\rho}_{0\eta,n}.

Theorem 2 (SGLD convergence).

Fix an arm aโˆˆ๐’œa\in\mathcal{A} and suppose that Assumptions 1-4 are met for it. Let ฮบ:=maxโก{L/m,L/ฮฝ}\kappa:=\max\{L/m,L/\nu\}, nkn_{k} be the number of available rewards for arm aa when running SGLD for the kk-th time, ฯa,nk\rho_{a,n_{k}} be the exact posterior of arm aa after observing nkn_{k} samples, and ฯ~a,nk\tilde{\rho}_{a,n_{k}} be the corresponding approximate posterior obtained by SGLD. If ๐”ผฮธโˆผฯa,nkโ€‹[โ€–ฮธโˆ’ฮธโˆ—โ€–p]1/pโ‰คD~nk\mathbb{E}_{\theta\sim\rho_{a,n_{k}}}[\|\theta-\theta^{*}\|^{p}]^{1/p}\leq\frac{\tilde{D}}{\sqrt{n_{k}}} is satisfied by the posterior, then with mini-batch size s=32โ€‹L2mโ€‹ฮฝ=๐’ชโ€‹(ฮบ2)s=\frac{32L^{2}}{m\nu}=\mathcal{O}(\kappa^{2}), step size ฮท=mโ€‹nk32โ€‹L2โ€‹(nk+1)2=๐’ชโ€‹(1Lโ€‹ฮบโ€‹nk)\eta=\frac{mn_{k}}{32L^{2}(n_{k}+1)^{2}}=\mathcal{O}(\frac{1}{L\kappa n_{k}}), and the number of steps N=1280โ€‹L2โ€‹(nk+1)2m2โ€‹nk2=๐’ชโ€‹(ฮบ2)N=\frac{1280L^{2}(n_{k}+1)^{2}}{m^{2}n_{k}^{2}}=\mathcal{O}(\kappa^{2}), SGLD in Algorithm 1 converges in Wasserstein-pp distance:

Wpโ€‹(ฯ~a,nk,ฯa,nk)โ‰ค2โ€‹D~nk,โˆ€D~โ‰ฅ32โ€‹dโ€‹pm,pโ‰ฅ2.W_{p}\left(\tilde{\rho}_{a,n_{k}},\rho_{a,n_{k}}\right)\leq\frac{2\tilde{D}}{\sqrt{n_{k}}},~{}~{}~{}~{}\forall\tilde{D}\geq\sqrt{\frac{32dp}{m}},~{}p\geq 2.
  • Proof of Theorem ย 2

    The proof follows similarly to that of Theorem 6 in Mazumdar etย al., (2020). Compared to the analysis in Mazumdar etย al., (2020), our proof is based on induction on the batches, as opposed to induction on the number of samples, as for us, SGLD is only executed at the end of the batch. Let BkB_{k} be the kk-th batch. Now for the base case, i.e. when k=1k=1, we have that nk=1n_{k}=1. And therefore the claim follows by the initialization of the algorithm (this is similar to the fully sequential case in Mazumdar etย al., (2020)).

    Now, suppose that the claim holds for batch kโˆ’1k-1. That is, suppose that all the necessary conditions are met and that Wpโ€‹(ฯ~a,nkโˆ’1,ฯa,nkโˆ’1)โ‰ค2โ€‹D~nkโˆ’1W_{p}\left(\tilde{\rho}_{a,n_{k-1}},\rho_{a,n_{k-1}}\right)\leq\frac{2\tilde{D}}{\sqrt{n_{k-1}}}.

    Taking the initial condition ฯ0=ฯ~a,nkโˆ’1\rho_{0}=\tilde{\rho}_{a,n_{k-1}} in Lemma C.4, we get that:

    Wppโ€‹(ฯ~iโ€‹ฮท,nk,ฯnkโˆ—)โ‰ค(1โˆ’m^8โ€‹ฮท)pโ€‹iโ€‹Wppโ€‹(ฯ~a,nkโˆ’1,ฯnkโˆ—)+25โ€‹pโ€‹L^pm^pโ€‹(dโ€‹p)p/2โ€‹(ฮท)p/2+22โ€‹p+3โ€‹ฮ”pm^p.W_{p}^{p}(\tilde{\rho}_{i\eta,n_{k}},\rho_{n_{k}}^{*})\leq\left(1-\frac{\hat{m}}{8}\eta\right)^{pi}W_{p}^{p}(\tilde{\rho}_{a,n_{k-1}},\rho_{n_{k}}^{*})+2^{5p}\frac{\hat{L}^{p}}{\hat{m}^{p}}(dp)^{p/2}(\eta)^{p/2}+2^{2p+3}\frac{\Delta_{p}}{\hat{m}^{p}}.

Now we know that:

Wpโ€‹(ฯnkโˆ—,ฯ~a,nkโˆ’1)\displaystyle W_{p}(\rho_{n_{k}}^{*},\tilde{\rho}_{a,n_{k-1}}) โ‰คWpโ€‹(ฯnkโˆ—,ฯnkโˆ’1โˆ—)+Wpโ€‹(ฯnkโˆ’1โˆ—,ฯ~a,nkโˆ’1)\displaystyle\leq W_{p}(\rho_{n_{k}}^{*},\rho_{n_{k-1}}^{*})+W_{p}(\rho_{n_{k-1}}^{*},\tilde{\rho}_{a,n_{k-1}})
โ‰คD~nk+D~nkโˆ’1+2โ€‹D~nkโˆ’1\displaystyle\leq\frac{\tilde{D}}{\sqrt{n_{k}}}+\frac{\tilde{D}}{\sqrt{n_{k-1}}}+\frac{2\tilde{D}}{\sqrt{n_{k-1}}}
โ‰ค8โ€‹D~nk\displaystyle\leq\frac{8\tilde{D}}{\sqrt{n_{k}}}

where the first inequality follows from triangle inequality, the second one follows from the assumption on the posterior and the induction hypothesis, and the last one just upper bounds the expression while also using the fact that nkโ‰ค2โ€‹nkโˆ’1n_{k}\leq 2n_{k-1}. This shows us that we can get the same upper bound as is seen in the fully sequential proof. The main point to note is that the proof has enough looseness in it, so that despite collecting at most double the data, the same bounds hold. With the choice of hyperparameters, taking i=Ni=N and using Lemma C.3 leads us to the conclusion that Wpโ€‹(ฯ~a,nk,ฯa,nk)โ‰ค2โ€‹D~nkW_{p}\left(\tilde{\rho}_{a,n_{k}},\rho_{a,n_{k}}\right)\leq\frac{2\tilde{D}}{\sqrt{n_{k}}}ย .

โ– \blacksquare

We now state the concentration results provided by SGLD in Lemmaย C.5, which shows the probability that the sampled parameters (from the approximate posterior) are far away from the true dynamics is small. Lemmaย C.5 extends Lemma 11 in Mazumdar etย al., (2020) to the batched settings.

Lemma C.5 (Concentration of SGLD in bandits).

For a fixed arm aโˆˆ๐’œa\in\mathcal{A}, say that it is pulled nkโˆ’1n_{k-1} times till batch kโˆ’1k-1 and nkn_{k} times till batch kk (where nkโ‰ค2โ€‹nkโˆ’1n_{k}\leq 2n_{k-1}). Suppose that Assumptions B-B are satisfied, then for ฮดโˆˆ(0,1)\delta\in(0,1), with parameters as specified in Theoremย 2, the sampled parameter ฮธak\theta_{a}^{k} generated in the kk-th batch satisfies,

โ„™ฮธakโˆผฯ~a,nkโ€‹[ฮณ]โ€‹(โ€–ฮธakโˆ’ฮธaโˆ—โ€–2>36โ€‹enkโ€‹mโ€‹(d+logโกQa+2โ€‹ฯƒโ€‹logโก1/ฮด+2โ€‹(ฯƒ+mโ€‹d18โ€‹Lโ€‹ฮณ)โ€‹logโก1/ฮด)|Zkโˆ’1)<ฮด,\mathbb{P}_{\theta_{a}^{k}\sim\tilde{\rho}_{a,n_{k}}[\gamma]}\left(\|\theta_{a}^{k}-\theta_{a}^{*}\|_{2}>\sqrt{\frac{36e}{n_{k}m}\left(d+\log Q_{a}+2\sigma\log 1/\delta+2(\sigma+\frac{md}{18L\gamma})\log 1/\delta\right)}~{}~{}\Bigg{|}~{}Z_{k-1}\right)<\delta,

where Zkโˆ’1={โˆฅฮธakโˆ’1โˆ’ฮธaโˆ—โˆฅ2โ‰คC(nk)Z_{k-1}=\{\left\|\theta_{a}^{k-1}-\theta_{a}^{*}\right\|_{2}\leq C(n_{k}) }, Cโ€‹(nk)=18โ€‹enkโ€‹mโ€‹(d+logโกQa+2โ€‹ฯƒโ€‹logโก1/ฮด)0.5C(n_{k})=\sqrt{\frac{18e}{n_{k}m}}(d+\log Q_{a}+2\sigma\log 1/\delta)^{0.5}, ฯƒ=16+4โ€‹dโ€‹L2ฮฝโ€‹m\sigma=16+\frac{4dL^{2}}{\nu m}.

  • Proof of Lemma C.5

The proof follows exactly as Lemma 11 from Mazumdar etย al., (2020) by replacing the notations in fully-sequential settings by those in batched settings, i.e., ฮธa,t\theta_{a,t} by ฮธak\theta_{a}^{k}, ฮธa,tโˆ’1\theta_{a,t-1} by ฮธakโˆ’1\theta_{a}^{k-1}.

โ– \blacksquare

Appendix D Proofs of Langevin Thompson Sampling in Multi-armed Bandits

In this section, we provide the regret proofs of BLTS algorithm in the stochastic multi-armed bandit (MAB) setting, which are discussed in Sectionย  5. In particular, we discuss the information exchange guarantees under dynamic batching scheme and its communication cost. We then utilize the convergence of SGLD in Appendix C and the above results to prove the problem-dependent regret bound in MAB setting.

D.1 Notations

We first introduce the notation being used in this section, which is summarized in Table 4.

Symbol Meaning
๐’œ\mathcal{A} set of arms in bandit environment
NN number of arms in bandit environment, i.e., |๐’œ||\mathcal{A}|
TT time horizon
KK total number of batches
Bโ€‹(t)B(t) starting time of the batch containing timestep tt
BkB_{k} starting time of the kk-th batch
lal_{a} trigger of dynamic batches (a batch is formed when kaโ€‹(t)=2lak_{a}(t)=2^{l_{a}}), a monotonically-increasing integer for arm aa
kaโ€‹(t)k_{a}(t) the number of times that arm aa has been pulled up to time tt
paโ€‹(r|ฮธa)p_{a}(r|\theta_{a}) reward distribution of arm aa parameterized by ฮธaโˆˆโ„d\theta_{a}\in\mathbb{R}^{d}
ฮธa\theta_{a} parameter of reward distribution for arm aโˆˆ๐’œa\in\mathcal{A}
ฮผa\mu_{a} expected reward of arm aa, ฮผa:=๐”ผโ€‹[ra|ฮธaโˆ—]\mu_{a}:=\mathbb{E}[r_{a}|\theta_{a}^{*}]
ฮผ^a\hat{\mu}_{a} estimated expected reward of arm aa, ฮผ^a:=๐”ผโ€‹[โ€–ฮธaโ€–]\hat{\mu}_{a}:=\mathbb{E}[\left\|\theta_{a}\right\|]
QaQ_{a} quality of prior for arm aa, Qa:=maxฮธโกpaโ€‹(ฮธ)paโ€‹(ฮธaโˆ—)Q_{a}:=\max_{\theta}\frac{p_{a}(\theta)}{p_{a}(\theta_{a}^{*})}
ฮบ\kappa condition number of parameterized reward distribution, ฮบ:=maxโก{L/m,L/ฮฝ}\kappa:=\max\{L/m,L/\nu\}
ฮปaโ€‹(ฮธa)\lambda_{a}(\theta_{a}) prior distribution over ฮธaโˆˆโ„d\theta_{a}\in\mathbb{R}^{d}
UU energy function of posterior distribution ฯ\rho : ฯโˆeโˆ’U\rho\propto e^{-U}
LL Lipschitz constant of the true reward distribution and likelihood families paโ€‹(r|ฮธโˆ—)p_{a}(r|\theta^{*}) in rr
mm strong log-concavity parameter of paโ€‹(r;ฮธ)p_{a}(r;\theta) in ฮธ\theta for all rr
ฮฝ\nu strong log-concavity parameter of paโ€‹(r;ฮธ)p_{a}(r;\theta) in rr
Table 4: Notations in multi-armed bandit setting.

D.2 Communication cost of Dynamic Doubling Batching Scheme

In batched setting, striking a balance between batch size and the number of batches is critical to achieving optimal performance. More specifically, it is crucial to balance the number of actions taken within each batch, with the frequency of starting new batches to collect new data and update the posteriors. According to Lemmaย D.2, dynamic doubling batching scheme guarantees an arm that has been pulled kk times has at least k/2k/2 observed rewards, indicating that communication between the learner and the environment is sufficient under this batching scheme.

{restatable}

lemmabanditPull Let tt be the current time step, Bโ€‹(t)B(t) be the starting time of the current batch, kaโ€‹(t)k_{a}(t) be the number of times that arm aa has been pulled up to time tt. For all aโˆˆ๐’œa\in\mathcal{A}, the dynamic batch scheme ensures:

12โ€‹kaโ€‹(t)โ‰คkaโ€‹(Bโ€‹(t))โ‰คkaโ€‹(t).\frac{1}{2}k_{a}(t)\leq k_{a}(B(t))\leq k_{a}(t).
  • Proof of Lemmaย D.2

    By the mechanism of our batch scheme, a new batch will begin when the number of times of any arm aโˆˆ๐’œa\in\mathcal{A} being pulled is doubled. It implies that the number of times that an arm is pulled within a batch is less than the number of times that it has been pulled at the beginning of this batch. At any time step tโ‰คTt\leq T:

    kaโ€‹(t)โˆ’kaโ€‹(Bโ€‹(t))โ‰คkaโ€‹(Bโ€‹(t)),k_{a}(t)-k_{a}(B(t))\leq k_{a}(B(t)),

    which gives 12โ€‹kaโ€‹(t)โ‰คkaโ€‹(Bโ€‹(t))\frac{1}{2}k_{a}(t)\leq k_{a}(B(t)). On the other hand, kaโ€‹(Bโ€‹(t))โ‰คkaโ€‹(t)k_{a}(B(t))\leq k_{a}(t) holds due to the fact that Bโ€‹(t)โ‰คtB(t)\leq t.

    โ– \blacksquare

Next, we show that by employing the dynamic doubling batching scheme, BATS algorithm achieves optimal performance using only logarithmic rounds of communication (measured in terms of batches).

\banditBatches

*

  • Proof of Theoremย 5.1

    Denote by BkB_{k} the starting time of the kk-th batch, and let laโ€‹(Bk)l_{a}(B_{k}) be the trigger integer for arm aa at time BkB_{k}, KK be the total number of rounds to interact with environment, namely, batches. Then for each arm aโˆˆ๐’œa\in\mathcal{A}, kaโ€‹(T)โ‰คTk_{a}(T)\leq T, and

    kaโ€‹(T)=โˆ‘k=1Kโˆ’1kaโ€‹(Bk+1)โˆ’kaโ€‹(Bk)โ‰คโˆ‘k=1Kโˆ’1kaโ€‹(Bk)=โˆ‘k=1Kโˆ’12laโ€‹(Bk)โˆ’1โ‰คโˆ‘l=0Kโˆ’12l,\displaystyle k_{a}(T)=\sum_{k=1}^{K-1}k_{a}(B_{k+1})-k_{a}(B_{k})\leq\sum_{k=1}^{K-1}k_{a}(B_{k})=\sum_{k=1}^{K-1}2^{l_{a}(B_{k})-1}\leq\sum_{l=0}^{K-1}2^{l},

    where the second and third step result from the dynamic batching scheme. Thus for each arm aa, we have

    Kโ‰คlogโก(T+1).K\leq\log(T+1).

    The proof is then completed by multiplying the above result by NN arms. โ– \blacksquare

D.3 Regret Proofs in Multi-armed Bandit

With the convergence properties shown in Appendixย C, we proceed to prove the regret guarantee of Langevin TS with SGLD. The general idea of our regret proof is to upper bound the total number of times that the sub-optimal arms are pulled over time horizon TT. We remark that the dependence of approximate samples across batches complicates our analysis of TS compared to the existing analyses in bandit literature.

We first decompose the expected regret according to the events of concentration in approximate samples ฮธa,t\theta_{a,t} and the events of estimation accuracy in expected rewards of sub-optimal arms.

For approximate samples ฮธ\theta, define event Eฮธ,aโ€‹(Bk)={โ€–ฮธa,kโˆ’ฮธaโˆ—โ€–<Cโ€‹(nk)},E_{\theta,a}(B_{k})=\left\{\left\|\theta_{a,k}-\theta_{a}^{*}\right\|<C(n_{k})\right\}, which is guaranteed to happen with probability at least (1โˆ’ฮด2)(1-\delta_{2}) by Lemmaย C.5 for some ฮด2โˆˆ[0,1]\delta_{2}\in[0,1]. Let Eฮธ,aโ€‹(T)=โ‹‚t=1TEฮธ,aโ€‹(t)E_{\theta,a}(T)=\bigcap_{t=1}^{T}E_{\theta,a}(t), Eฮธ,aโ€‹(K)=โ‹‚k=1KEฮธ,aโ€‹(Bk)E_{\theta,a}(K)=\bigcap_{k=1}^{K}E_{\theta,a}(B_{k}), where KK is the total number of batches. Without loss of generality, we take โ€–ฮฑaโ€–=1\left\|\alpha_{a}\right\|=1 for all arms in ๐”ผXโˆผpaโ€‹(X|ฮธa)โ€‹[X]=ฮฑaโŠบโ€‹ฮธaโ‰คโ€–ฮธaโ€–\mathbb{E}_{X\sim p_{a}(X|\theta_{a})}[X]=\alpha_{a}^{\intercal}~{}\theta_{a}\leq\left\|\theta_{a}\right\| in the subsequent proofs

Let ฮผ^aโ€‹(t)\hat{\mu}_{a}(t) be the estimate of the expected reward for arm aa at time step tt, and denote the filtration up to time Bโ€‹(t)B(t) as โ„ฑBโ€‹(t):={aโ€‹(ฯ„),raโ€‹(ฯ„),kaโ€‹(Bโ€‹(ฯ„))|ฯ„โ‰คBโ€‹(t)}\mathcal{F}_{B(t)}:=\{a(\tau),r_{a(\tau),k_{a}(B(\tau))}\ |\ \tau\leq B(t)\}. For any sub-optimal arm aโ‰ 1a\neq 1, define event Eฮผ,aโ€‹(t)={ฮผ^aโ€‹(t)โ‰ฅฮผ1โˆ’ฯต}E_{\mu,a}(t)=\{\hat{\mu}_{a}(t)\geq\mu_{1}-\epsilon\} with probability pa,kaโ€‹(Bโ€‹(t))โ€‹(t):=โ„™โ€‹(ฮผ^aโ€‹(t)โ‰ฅฮผ1โˆ’ฯต|โ„ฑBโ€‹(t))p_{a,k_{a}(B(t))}(t):={\mathbb{P}}(\hat{\mu}_{a}(t)\geq\mu_{1}-\epsilon|\mathcal{F}_{B(t)}) for some ฯตโˆˆ(0,1)\epsilon\in(0,1), which signifies the estimation of arm aa is close to the true optimal expected reward.

Lemma D.1 (Regret Decomposition).

Let ฮผa\mu_{a} be the true expected reward of arm aa, ฮผโˆ—=maxaโˆˆ๐’œโกฮผa\mu^{*}=\max_{a\in\mathcal{A}}\mu_{a}, ฮ”a:=ฮผโˆ—โˆ’ฮผa\Delta_{a}:=\mu^{*}-\mu_{a}. The expected regret of Langevin TS with SGLD satisfies:

RTโ‰คโˆ‘aโˆˆ๐’œ(R1+R2+2)โ€‹ฮ”a,\displaystyle R_{T}\leq\sum_{a\in\mathcal{A}}\Big{(}R_{1}+R_{2}+2\Big{)}\Delta_{a},

where R1:=๐”ผโ€‹[โˆ‘t=1T๐•€โ€‹(aโ€‹(t)=a,Eฮผ,acโ€‹(t))|Eฮธ,aโ€‹(K)โˆฉEฮธ,1โ€‹(K)],R2:=๐”ผโ€‹[โˆ‘t=1T๐•€โ€‹(aโ€‹(t)=a,Eฮผ,aโ€‹(t))|Eฮธ,aโ€‹(K)โˆฉEฮธ,1โ€‹(K)]R_{1}:=\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{I}(a(t)=a,E_{\mu,a}^{c}(t))\ |\ E_{\theta,a}(K)\cap E_{\theta,1}(K)\right],R_{2}:=\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{I}(a(t)=a,E_{\mu,a}(t))\ |\ E_{\theta,a}(K)\cap E_{\theta,1}(K)\right] .

  • Proof of Lemma D.1.

    Recall that

    RT=โˆ‘aโˆˆ๐’œฮ”aโ‹…๐”ผโ€‹[kaโ€‹(T)],ฮ”a=ฮผโˆ—โˆ’ฮผa.\displaystyle R_{T}=\sum_{a\in\mathcal{A}}\Delta_{a}\cdot\mathbb{E}\left[k_{a}(T)\right],\quad\Delta_{a}=\mu^{*}-\mu_{a}.

    For any sub-optimal arm aโ‰ 1a\neq 1, consider the event space โ„ฑฮธ={{Eฮธ,aโ€‹(T)โˆฉEฮธ,1โ€‹(T)},{Eฮธ,aโ€‹(T)โˆฉEฮธ,1โ€‹(T)}C}\mathcal{F}_{\theta}=\{\{E_{\theta,a}(T)\cap E_{\theta,1}(T)\},\{E_{\theta,a}(T)\cap E_{\theta,1}(T)\}^{C}\}, in which Eฮธ,aโ€‹(T)โˆฉEฮธ,1โ€‹(T)E_{\theta,a}(T)\cap E_{\theta,1}(T) denotes the event that all approximate samples of arm aa and optimal arm 11 are concentrated.

    To bound the regret, we bound the largest number of times that each sub-optimal arm will be played:

    ๐”ผโ€‹[kaโ€‹(T)]\displaystyle\mathbb{E}\left[k_{a}(T)\right] =๐”ผโ€‹[โˆ‘t=1T๐•€โ€‹(aโ€‹(t)=a,Eฮผ,aโ€‹(t)โˆชEฮผ,acโ€‹(t),Eฮธ,aโ€‹(T)โˆฉEฮธ,1โ€‹(T))]+๐”ผโ€‹[โˆ‘t=1T๐•€โ€‹(aโ€‹(t)=a,(Eฮธ,aโ€‹(T)โˆฉEฮธ,1โ€‹(T))c)]\displaystyle=\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{I}(a(t)=a,E_{\mu,a}(t)\cup E_{\mu,a}^{c}(t),E_{\theta,a}(T)\cap E_{\theta,1}(T))\right]+\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{I}(a(t)=a,\left(E_{\theta,a}(T)\cap E_{\theta,1}(T)\right)^{c})\right]
    โ‰ค๐”ผโ€‹[โˆ‘t=1T๐•€โ€‹(aโ€‹(t)=a,Eฮผ,aโ€‹(t)โˆชEฮผ,acโ€‹(t))|Eฮธ,aโ€‹(T)โˆฉEฮธ,1โ€‹(T)]+๐”ผโ€‹[โˆ‘t=1T๐•€โ€‹(aโ€‹(t)=a,(Eฮธ,aโ€‹(T)โˆฉEฮธ,1โ€‹(T))c)].\displaystyle\leq\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{I}(a(t)=a,E_{\mu,a}(t)\cup E_{\mu,a}^{c}(t))\Big{|}E_{\theta,a}(T)\cap E_{\theta,1}(T)\right]+\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{I}(a(t)=a,\left(E_{\theta,a}(T)\cap E_{\theta,1}(T)\right)^{c})\right].

    where the second inequality results from Pโ€‹(Eฮธ,aโ€‹(T)โˆฉEฮธ,1โ€‹(T))โ‰ค1P(E_{\theta,a}(T)\cap E_{\theta,1}(T))\leq 1.

    For any arm aโˆˆ๐’œa\in\mathcal{A} in each batch, approximate samples are independently generated from the identical approximate distribution ฯ~aโ€‹(ฮธa|Ra)\tilde{\rho}_{a}(\theta_{a}|R_{a}). Thus, approximate samples for arm aa are independent within the same batch, while being dependent across different batches, implying

    {โ„™โ€‹(Eฮธ,aโ€‹(T))=โˆt=1Tโ„™โ€‹(Eฮธ,aโ€‹(t)|Eฮธ,aโ€‹(1),โ€ฆ,Eฮธ,aโ€‹(tโˆ’1))=โˆk=1Kโˆ’1โ„™โ€‹(Eฮธ,aโ€‹(Bk+1)|Eฮธ,aโ€‹(Bk))Tk+1โ„™โ€‹(Eฮธ,acโ€‹(T))=โ„™โ€‹(โ‹ƒt=1TEฮธ,acโ€‹(t))=โˆ‘t=1Tโ„™โ€‹(Eฮธ,acโ€‹(t))=โˆ‘k=1KTkโ€‹โ„™โ€‹(Eฮธ,acโ€‹(Bk)),\displaystyle\begin{cases}{\mathbb{P}}(E_{\theta,a}(T))&=\prod_{t=1}^{T}{\mathbb{P}}(E_{\theta,a}(t)|E_{\theta,a}(1),\dots,E_{\theta,a}(t-1))=\prod_{k=1}^{K-1}{\mathbb{P}}(E_{\theta,a}(B_{k+1})|E_{\theta,a}(B_{k}))^{T_{k+1}}\\ {\mathbb{P}}(E^{c}_{\theta,a}(T))&={\mathbb{P}}(\bigcup_{t=1}^{T}E^{c}_{\theta,a}(t))=\sum_{t=1}^{T}{\mathbb{P}}(E^{c}_{\theta,a}(t))=\sum_{k=1}^{K}T_{k}{\mathbb{P}}(E^{c}_{\theta,a}(B_{k}))\end{cases},

    where Tk:=Bk+1โˆ’BkT_{k}:=B_{k+1}-B_{k} is the number of time steps in the kk-th batch, namely, the length of the batch. By Lemmaย C.5, for each arm aa in batch BkB_{k}, โ„™โ€‹(Eฮธ,acโ€‹(Bk))โ‰คฮด2,(1โˆ’ฮด2)โ‰คโ„™โ€‹(Eฮธ,aโ€‹(Bk))โ‰ค1{\mathbb{P}}(E^{c}_{\theta,a}(B_{k}))\leq\delta_{2},(1-\delta_{2})\leq{\mathbb{P}}(E_{\theta,a}(B_{k}))\leq 1, which gives:

    {โ„™โ€‹[Eฮธ,aโ€‹(T)โˆฉEฮธ,1โ€‹(T)]โ‰คโ„™โ€‹[Eฮธ,aโ€‹(K)โˆฉEฮธ,1โ€‹(K)]โ„™โ€‹[Eฮธ,acโ€‹(T)โˆชEฮธ,1cโ€‹(T)]โ‰คโ„™โ€‹[Eฮธ,acโ€‹(T)]+โ„™โ€‹[Eฮธ,1cโ€‹(T)]โ‰ค2โ€‹ฮด2โ€‹โˆ‘k=1TTk=2โ€‹ฮด2โ€‹T.\displaystyle\begin{cases}{\mathbb{P}}[E_{\theta,a}(T)\cap E_{\theta,1}(T)]\leq{\mathbb{P}}[E_{\theta,a}(K)\cap E_{\theta,1}(K)]\\ {\mathbb{P}}[E_{\theta,a}^{c}(T)\cup E_{\theta,1}^{c}(T)]\leq{\mathbb{P}}[E_{\theta,a}^{c}(T)]+{\mathbb{P}}[E_{\theta,1}^{c}(T)]\leq 2\delta_{2}\sum_{k=1}^{T}T_{k}=2\delta_{2}T\qquad\qquad\qquad\qquad\qquad\end{cases}.

    Setting ฮด2=1/T2\delta_{2}=1/T^{2} gives,

    ๐”ผโ€‹[โˆ‘t=1T๐•€โ€‹(aโ€‹(t)=a,(Eฮธ,aโ€‹(T)โˆฉEฮธ,1โ€‹(T))c)]\displaystyle\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{I}(a(t)=a,\left(E_{\theta,a}(T)\cap E_{\theta,1}(T)\right)^{c})\right] =๐”ผโ€‹[โˆ‘t=1T๐•€โ€‹(aโ€‹(t)=a)|Eฮธ,aโ€‹(T)cโˆชEฮธ,1cโ€‹(T)]โ€‹โ„™โ€‹[Eฮธ,acโ€‹(T)โˆชEฮธ,1cโ€‹(T)]\displaystyle=\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{I}(a(t)=a)\ |\ E_{\theta,a}(T)^{c}\cup E_{\theta,1}^{c}(T)\right]{\mathbb{P}}\bigg{[}{E_{\theta,a}^{c}(T)\cup E_{\theta,1}^{c}(T)}\bigg{]}
    โ‰ค2โ€‹ฮด2โ€‹Tโ€‹๐”ผโ€‹[kaโ€‹(T)|Eฮธ,aโ€‹(T)cโˆชEฮธ,1โ€‹(T)c]โ‰ค2โ€‹ฮด2โ€‹T2โ‰ค2.\displaystyle\leq 2\delta_{2}T\mathbb{E}\bigg{[}k_{a}(T)\ |\ E_{\theta,a}(T)^{c}\cup E_{\theta,1}(T)^{c}\bigg{]}\leq 2\delta_{2}T^{2}\leq 2.

    Plugging in the results to the definition of regret yields,

    Rโ€‹(T)\displaystyle R(T) โ‰คโˆ‘aโˆˆ๐’œฮ”aโ‹…(๐”ผโ€‹[โˆ‘t=1T๐•€โ€‹(aโ€‹(t)=a,Eฮผ,aโ€‹(t)โˆชEฮผ,acโ€‹(t))|Eฮธ,aโ€‹(T)โˆฉEฮธ,1โ€‹(T)]+2)\displaystyle\leq\sum_{a\in\mathcal{A}}\Delta_{a}\cdot\Bigg{(}\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{I}(a(t)=a,E_{\mu,a}(t)\cup E_{\mu,a}^{c}(t))\ \Big{|}\ E_{\theta,a}(T)\cap E_{\theta,1}(T)\right]+2\Bigg{)}
    โ‰คโˆ‘aโˆˆ๐’œฮ”aโ‹…(๐”ผโ€‹[โˆ‘t=1T๐•€โ€‹(aโ€‹(t)=a,Eฮผ,aโ€‹(t)โˆชEฮผ,acโ€‹(t))|Eฮธ,aโ€‹(K)โˆฉEฮธ,1โ€‹(K)]+2)\displaystyle\leq\sum_{a\in\mathcal{A}}\Delta_{a}\cdot\Bigg{(}\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{I}(a(t)=a,E_{\mu,a}(t)\cup E_{\mu,a}^{c}(t))\ \Big{|}\ E_{\theta,a}(K)\cap E_{\theta,1}(K)\right]+2\Bigg{)}

    โ– \blacksquare

We then proceed to bound R1R_{1} and R2R_{2} respectively in Lemmaย D.2 and Lemmaย D.3, the key to maintaining optimal regret is to maximize the probability of pulling the optimal arm by ensuring the event Eฮผ,aโ€‹(t)E_{\mu,a}(t) takes place with low probability for all sub-optimal arms.

Lemma D.2 (Bound term R1R_{1}).

It can be shown that,

R1โ‰ค๐”ผโ€‹[โˆ‘t=1T(1p1,k1โ€‹(Bโ€‹(t))โ€‹(t)โˆ’1)|Eฮธ,1โ€‹(K)].R_{1}\leq\mathbb{E}\left[\sum_{t=1}^{T}\bigg{(}\frac{1}{p_{1,k_{1}(B(t))}(t)}-1\bigg{)}\ \bigg{|}\ E_{\theta,1}(K)\right].
  • Proof of lemma D.2.

    Note that arm aa is played at time tt if and only if ฮผ^aโ€ฒโ‰คฮผ^a,โˆ€aโ€ฒโˆˆ๐’œ\hat{\mu}_{a^{\prime}}\leq\hat{\mu}_{a},\ \forall{a^{\prime}\in\mathcal{A}}. Thus for a sub-optimal arm aa, the following event relationship holds: {aโ€‹(t)=a,Eฮผ,acโ€‹(t)}={aโ€‹(t)=a,Eฮผ,acโ€‹(t),โˆฉaโ€ฒโ‰ aEฮผ,aโ€ฒcโ€‹(t)}โІ{โˆฉaโ€ฒโˆˆ๐’œEฮผ,aโ€ฒcโ€‹(t)}\{a(t)=a,E_{\mu,a}^{c}(t)\}=\{a(t)=a,E_{\mu,a}^{c}(t),\cap_{a^{\prime}\neq a}E_{\mu,a^{\prime}}^{c}(t)\}\subseteq\{\cap_{a^{\prime}\in\mathcal{A}}E_{\mu,a^{\prime}}^{c}(t)\}, and {Eฮผ,1โ€‹(t),โˆฉaโ€ฒโ‰ 1Eฮผ,aโ€ฒcโ€‹(t)}={aโ€‹(t)=1,Eฮผ,1โ€‹(t),โˆฉaโ€ฒโ‰ 1Eฮผ,aโ€ฒcโ€‹(t)}โІ{aโ€‹(t)=1}\{E_{\mu,1}(t),\cap_{a^{\prime}\neq 1}E_{\mu,a^{\prime}}^{c}(t)\}=\{a(t)=1,E_{\mu,1}(t),\cap_{a^{\prime}\neq 1}E_{\mu,a^{\prime}}^{c}(t)\}\subseteq\{a(t)=1\}. We then have,

    {โ„™โ€‹[aโ€‹(t)=a,Eฮผ,acโ€‹(t)|โ„ฑBโ€‹(t)]โ‰คโ„™โ€‹[โ‹‚aโ€ฒโˆˆ๐’œEฮผ,aโ€ฒcโ€‹(t)|โ„ฑBโ€‹(t)]=โ„™โ€‹[โ‹‚aโ€ฒโ‰ 1Eฮผ,aโ€ฒcโ€‹(t)|โ„ฑBโ€‹(t)]โ€‹(1โˆ’โ„™โ€‹[Eฮผ,1โ€‹(t)|โ„ฑBโ€‹(t)])โ„™โ€‹[aโ€‹(t)=1|โ„ฑBโ€‹(t)]โ‰ฅโ„™โ€‹[Eฮผ,1โ€‹(t)|โ„ฑBโ€‹(t)]โ€‹โ„™โ€‹[โ‹‚aโ€ฒโ‰ 1Eฮผ,aโ€ฒcโ€‹(t)|โ„ฑBโ€‹(t)]\displaystyle\begin{cases}&{\mathbb{P}}\left[a(t)=a,E_{\mu,a}^{c}(t)\ |\ \mathcal{F}_{B(t)}\right]\leq{\mathbb{P}}\big{[}\bigcap_{a^{\prime}\in\mathcal{A}}E_{\mu,a^{\prime}}^{c}(t)\ |\ \mathcal{F}_{B(t)}\big{]}={\mathbb{P}}\big{[}\bigcap_{a^{\prime}\neq 1}E_{\mu,a^{\prime}}^{c}(t)\ |\ \mathcal{F}_{B(t)}\big{]}\big{(}1-{\mathbb{P}}\big{[}E_{\mu,1}(t)\ |\ \mathcal{F}_{B(t)}\big{]}\big{)}\\ &{\mathbb{P}}\left[a(t)=1\ |\ \mathcal{F}_{B(t)}\right]\geq{\mathbb{P}}\big{[}E_{\mu,1}(t)\ |\ \mathcal{F}_{B(t)}\big{]}{\mathbb{P}}\big{[}\bigcap_{a^{\prime}\neq 1}E_{\mu,a^{\prime}}^{c}(t)\ |\ \mathcal{F}_{B(t)}\big{]}\end{cases}

    Recall that p1,k1โ€‹(Bโ€‹(t))โ€‹(t):=โ„™โ€‹[Eฮผ,1โ€‹(t)|โ„ฑBโ€‹(t)]p_{1,k_{1}(B(t))}(t):={\mathbb{P}}[E_{\mu,1}(t)\ |\ \mathcal{F}_{B(t)}]. Combining the above two equations shows that the probability of pulling a sub-optimal arm aa is bounded by the probability of pulling the optimal arm with an exponentially decaying coefficient:

    โ„™โ€‹[aโ€‹(t)=a,Eฮผ,acโ€‹(t)|โ„ฑBโ€‹(t)]โ‰ค(1p1,k1โ€‹(Bโ€‹(t))โ€‹(t)โˆ’1)โ€‹โ„™โ€‹[aโ€‹(t)=1|โ„ฑBโ€‹(t)].{\mathbb{P}}\left[a(t)=a,E_{\mu,a}^{c}(t)\ |\ \mathcal{F}_{B(t)}\right]\leq\bigg{(}\frac{1}{p_{1,k_{1}(B(t))}(t)}-1\bigg{)}{\mathbb{P}}\left[a(t)=1\ |\ \mathcal{F}_{B(t)}\right]. (7)

    Therefore, R1R_{1} is upper bounded accordingly:

    R1\displaystyle R_{1} =๐”ผ[โˆ‘t=1T๐”ผ[๐•€(a(t)=a,Eฮผ,ac(t)|โ„ฑBโ€‹(t)]|Eฮธ,a(K)โ‹‚Eฮธ,1(K)]\displaystyle=\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{E}\left[\mathbb{I}(a(t)=a,E_{\mu,a}^{c}(t)\ |\ \mathcal{F}_{B(t)}\right]\ \bigg{|}\ E_{\theta,a}(K)\bigcap E_{\theta,1}(K)\right]
    =๐”ผโ€‹[โˆ‘t=1Tโ„™โ€‹[aโ€‹(t)=a,Eฮผ,acโ€‹(t)|โ„ฑBโ€‹(t)]|Eฮธ,aโ€‹(K)โ€‹โ‹‚Eฮธ,1โ€‹(K)]\displaystyle=\mathbb{E}\left[\sum_{t=1}^{T}{\mathbb{P}}\left[a(t)=a,E_{\mu,a}^{c}(t)\ |\ \mathcal{F}_{B(t)}\right]\ \bigg{|}\ E_{\theta,a}(K)\bigcap E_{\theta,1}(K)\right]
    โ‰ค๐”ผโ€‹[โˆ‘t=1T(1p1,k1โ€‹(Bโ€‹(t))โ€‹(t)โˆ’1)โ€‹โ„™โ€‹[aโ€‹(t)=1|โ„ฑBโ€‹(t)]|Eฮธ,aโ€‹(K)โ€‹โ‹‚Eฮธ,1โ€‹(K)]\displaystyle\leq\mathbb{E}\left[\sum_{t=1}^{T}\bigg{(}\frac{1}{p_{1,k_{1}(B(t))}(t)}-1\bigg{)}{\mathbb{P}}\left[a(t)=1\ |\ \mathcal{F}_{B(t)}\right]\ \bigg{|}\ E_{\theta,a}(K)\bigcap E_{\theta,1}(K)\right]
    =๐”ผโ€‹[โˆ‘t=1T(1p1,k1โ€‹(Bโ€‹(t))โ€‹(t)โˆ’1)โ€‹๐•€โ€‹[aโ€‹(t)=1]|Eฮธ,1โ€‹(K)]\displaystyle=\mathbb{E}\left[\sum_{t=1}^{T}\bigg{(}\frac{1}{p_{1,k_{1}(B(t))}(t)}-1\bigg{)}\mathbb{I}\big{[}a(t)=1\big{]}\ \bigg{|}E_{\theta,1}(K)\right]
    โ‰ค๐”ผโ€‹[โˆ‘t=1T(1p1,k1โ€‹(Bโ€‹(t))โ€‹(t)โˆ’1)|Eฮธ,1โ€‹(K)].\displaystyle\leq\mathbb{E}\left[\sum_{t=1}^{T}\bigg{(}\frac{1}{p_{1,k_{1}(B(t))}(t)}-1\bigg{)}\ \bigg{|}\ E_{\theta,1}(K)\right].

    โ– \blacksquare

Lemma D.3 (Bound term R2R_{2}).

It can be shown that,

R2โ‰ค1+๐”ผโ€‹[โˆ‘t=1T๐•€โ€‹(pa,kaโ€‹(Bโ€‹(t))โ€‹(t)>1T)|Eฮธ,aโ€‹(K)].R_{2}\leq 1+\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{I}\left(p_{a,k_{a}(B(t))}(t)>\frac{1}{T}\right)\ \bigg{|}\ E_{\theta,a}(K)\right].
  • Proof of Lemma D.3.

    The proof closely follows Agrawal and Goyal, (2012). Let ๐’ฏ:={t|pa,kaโ€‹(Bโ€‹(t))โ€‹(t)>1T}\mathcal{T}:=\{t\ |\ p_{a,k_{a}(B(t))}(t)>\frac{1}{T}\}. R2R_{2} term can be rewritten as:

    R2\displaystyle R_{2} =๐”ผโ€‹[โˆ‘tโˆˆ๐’ฏ๐•€โ€‹(aโ€‹(t)=a,Eฮผ,aโ€‹(t))|Eฮธ,aโ€‹(K)โ€‹โ‹‚Eฮธ,1โ€‹(K)]+๐”ผโ€‹[โˆ‘tโˆ‰๐’ฏ๐•€โ€‹(aโ€‹(t)=a,Eฮผ,aโ€‹(t))|Eฮธ,aโ€‹(K)โ€‹โ‹‚Eฮธ,1โ€‹(K)]\displaystyle=\mathbb{E}\left[\sum_{t\in\mathcal{T}}\mathbb{I}(a(t)=a,E_{\mu,a}(t))\ \bigg{|}\ E_{\theta,a}(K)\bigcap E_{\theta,1}(K)\right]+\mathbb{E}\left[\sum_{t\notin\mathcal{T}}\mathbb{I}(a(t)=a,E_{\mu,a}(t))\ \bigg{|}\ E_{\theta,a}(K)\bigcap E_{\theta,1}(K)\right]
    โ‰ค๐”ผโ€‹[โˆ‘tโˆˆ๐’ฏ๐•€โ€‹(aโ€‹(t)=a)|Eฮธ,aโ€‹(K)โ€‹โ‹‚Eฮธ,1โ€‹(K)]โŸI+๐”ผโ€‹[โˆ‘tโˆ‰๐’ฏ๐•€โ€‹(Eฮผ,aโ€‹(t))|Eฮธ,aโ€‹(K)โ€‹โ‹‚Eฮธ,1โ€‹(K)]โŸIโ€‹I.\displaystyle\leq\underbrace{\mathbb{E}\left[\sum_{t\in\mathcal{T}}\mathbb{I}(a(t)=a)\ \bigg{|}\ E_{\theta,a}(K)\bigcap E_{\theta,1}(K)\right]}_{I}+\underbrace{\mathbb{E}\left[\sum_{t\notin\mathcal{T}}\mathbb{I}(E_{\mu,a}(t))\ \bigg{|}\ E_{\theta,a}(K)\bigcap E_{\theta,1}(K)\right]}_{II}.

    It follows that the first term satisfies,

    I=๐”ผโ€‹[โˆ‘t=1T๐•€โ€‹(aโ€‹(t)=a,pa,kaโ€‹(Bโ€‹(t))โ€‹(t)>1T)|Eฮธ,aโ€‹(K)]โ‰ค๐”ผโ€‹[โˆ‘t=1T๐•€โ€‹(pa,kaโ€‹(Bโ€‹(t))โ€‹(t)>1T)|Eฮธ,aโ€‹(K)],\displaystyle I=\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{I}(a(t)=a,p_{a,k_{a}(B(t))}(t)>\frac{1}{T})\ \bigg{|}\ E_{\theta,a}(K)\right]\leq\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{I}(p_{a,k_{a}(B(t))}(t)>\frac{1}{T})\ \bigg{|}\ E_{\theta,a}(K)\right],

    and the second term satisfies,

    Iโ€‹I=๐”ผโ€‹[โˆ‘tโˆ‰๐’ฏ๐”ผโ€‹[๐•€โ€‹(Eฮผ,aโ€‹(t))|โ„ฑBโ€‹(t)]|Eฮธ,aโ€‹(K)โ€‹โ‹‚Eฮธ,1โ€‹(K)]=๐”ผโ€‹[โˆ‘tโˆ‰๐’ฏpa,kaโ€‹(Bโ€‹(t))โ€‹(t)|Eฮธ,aโ€‹(T)โ€‹โ‹‚Eฮธ,1โ€‹(T)]โ‰ค1,\displaystyle II=\mathbb{E}\left[\sum_{t\notin\mathcal{T}}\mathbb{E}\Big{[}\mathbb{I}(E_{\mu,a}(t))\ \big{|}\ \mathcal{F}_{B(t)}\Big{]}\ \bigg{|}\ E_{\theta,a}(K)\bigcap E_{\theta,1}(K)\right]=\mathbb{E}\left[\sum_{t\notin\mathcal{T}}p_{a,k_{a}(B(t))}(t)\ \bigg{|}\ E_{\theta,a}(T)\bigcap E_{\theta,1}(T)\right]\leq 1,

    where the last inequality holds as pa,kaโ€‹(Bโ€‹(t))โ€‹(t)โ‰ค1/Tp_{a,k_{a}(B(t))}(t)\leq 1/T for tโˆ‰๐’ฏt\notin\mathcal{T}. โ– \blacksquare

Lemma D.4.

Assume that the prior and reward distributions satisfy Assumptions B-B. Then at each time step tโ‰คTt\leq T, if there are k1โ€‹(Bโ€‹(t))k_{1}(B(t)) observed rewards for arm 11, then Algorithm 1 ensures:

๐”ผโ€‹[1p1,k1โ€‹(Bโ€‹(t))โ€‹(t)]โ‰ค36โ€‹Q1,\mathbb{E}\left[\frac{1}{p_{1,k_{1}(B(t))}(t)}\right]\leq 36\sqrt{Q_{1}},

where Q1=maxฮธโˆˆโ„dโกp1โ€‹(ฮธ)p1โ€‹(ฮธ1โˆ—)Q_{1}=\max_{\theta\in\mathbb{R}^{d}}\frac{p_{1}(\theta)}{p_{1}(\theta_{1}^{*})} measures the quality of the prior distribution, Q1โ‰ฅ1Q_{1}\geq 1.

  • Proof of Lemma D.4

    For completeness, we provide the proof of this lemma, which closely follows the proof of Lemma 18 in Mazumdar etย al., (2020).

    For each arm aa, upon running SGLD with batched data in batch kk, by Cauchy-Schwartz inequality, we have,

    โ„™โ€‹(ฮฑaTโ€‹(ฮธakโˆ’ฮธa,Nโ€‹ฮท)โ‰ฅฮฑ1Tโ€‹(ฮธaโˆ—โˆ’ฮธa,Nโ€‹ฮท)โˆ’ฯต)โ‰ฅโ„™โ€‹(Zโ‰ฅโ€–ฮธaโˆ—โˆ’ฮธa,Nโ€‹ฮทโ€–),\mathbb{P}\left(\alpha_{a}^{\mathrm{T}}(\theta_{a}^{k}-\theta_{a,N\eta})\geq\alpha_{1}^{\mathrm{T}}(\theta_{a}^{*}-\theta_{a,N\eta})-\epsilon\right)\geq\mathbb{P}\left(Z\geq\left\|\theta_{a}^{*}-\theta_{a,N\eta}\right\|\right),

    where Zโˆผ๐’ฉโ€‹(0,1nโ€‹Lโ€‹ฮณโ€‹I)Z\sim\mathcal{N}(0,\frac{1}{nL\gamma}I). Let ฯƒ2=1nโ€‹Lโ€‹ฮณโ€‹I\sigma^{2}=\frac{1}{nL\gamma}I, by anti-concentration of Gaussian random variables, for the optimal arm 11,

    p1,k1โ€‹(Bโ€‹(t))โ€‹(t)โ‰ฅ12โ€‹ฯ€โ€‹{ฯƒโ€‹tt2+ฯƒ2โ€‹eโˆ’t22โ€‹ฯƒ2,t>ฯƒ;0.34,otherwise.\displaystyle p_{1,k_{1}(B(t))}(t)\geq\sqrt{\frac{1}{2\pi}}\begin{cases}\frac{\sigma t}{t^{2}+\sigma^{2}}e^{-\frac{t^{2}}{2\sigma^{2}}},&~{}~{}t>\sigma;\\ 0.34,&~{}~{}\text{otherwise}.\end{cases}

    Taking expectations of both sides and by Cauchy-Schwartz inequality,

    ๐”ผโ€‹[1p1,k1โ€‹(Bโ€‹(t))โ€‹(t)]\displaystyle\mathbb{E}\left[\frac{1}{p_{1,k_{1}(B(t))}(t)}\right] โ‰ค3โ€‹2โ€‹ฯ€+2โ€‹ฯ€โ€‹nโ€‹Lโ€‹ฮณโ€‹๐”ผโ€‹[โ€–ฮธ1โˆ—โˆ’ฮธ1,Nโ€‹hโ€–2]โ€‹๐”ผโ€‹[enโ€‹Lโ€‹ฮณโ€‹โ€–ฮธ1โˆ—โˆ’ฮธ1,Nโ€‹hโ€–2]+2โ€‹ฯ€โ€‹๐”ผโ€‹[enโ€‹Lโ€‹ฮณ2โ€‹โ€–ฮธ1โˆ—โˆ’ฮธ1,Nโ€‹hโ€–2].\displaystyle\leq 3\sqrt{2\pi}+\sqrt{2\pi nL\gamma}\sqrt{\mathbb{E}\left[\|\theta_{1}^{*}-\theta_{1,Nh}\|^{2}\right]}\sqrt{\mathbb{E}\left[e^{nL\gamma\|\theta_{1}^{*}-\theta_{1,Nh}\|^{2}}\right]}+\sqrt{2\pi}\mathbb{E}\left[e^{\frac{nL\gamma}{2}\|\theta_{1}^{*}-\theta_{1,Nh}\|^{2}}\right].

    By the convergence guarantee of SGLD in Theoremย 2,

    ๐”ผโ€‹[โ€–ฮธ1โˆ—โˆ’ฮธ1,Nโ€‹hโ€–2]โ‰ค18mโ€‹nโ€‹(d+logโกQ+32+8โ€‹dโ€‹L2ฮฝโ€‹m).\mathbb{E}\left[\|\theta_{1}^{*}-\theta_{1,Nh}\|^{2}\right]\leq\frac{18}{mn}\left(d+\log Q+32+\frac{8dL^{2}}{\nu m}\right).

    Note that โ€–ฮธ1โˆ—โˆ’ฮธ1,Nโ€‹hโ€–2\|\theta_{1}^{*}-\theta_{1,Nh}\|^{2} is a sub-Gaussian random variable, when ฮณโ‰คm32โ€‹Lโ€‹ฯƒ\gamma\leq\frac{m}{32L\sigma},

    ๐”ผโ€‹[enโ€‹Lโ€‹ฮณโ€‹โ€–ฮธ1โˆ—โˆ’ฮธ1,Nโ€‹hโ€–2]โ‰ค3/2โ€‹(e4โ€‹nโ€‹Lโ€‹ฮณโ€‹Dm+2.5).\mathbb{E}[e^{nL\gamma\|\theta_{1}^{*}-\theta_{1,Nh}\|^{2}}]\leq 3/2\left(e^{\frac{4nL\gamma D}{m}}+2.5\right).

    Combining the above results together completes the proof. โ– \blacksquare

With Lemma D.4, we now proceed to prove the terms in R1R_{1} and R2R_{2} that lead us to the final regret bound.

Lemma D.5.

Assume that Assumptions 1-4 are satisfied. Let ฯƒ=16+4โ€‹dโ€‹L2ฮฝโ€‹m\sigma=16+\frac{4dL^{2}}{\nu m}, ฮณ=m32โ€‹Lโ€‹ฯƒ\gamma=\frac{m}{32L\sigma}. running Algorithm 2 with samples generated from approximate posteriors using Algorithm 1, we have,

๐”ผโ€‹[โˆ‘t=1T(1p1,k1โ€‹(Bโ€‹(t))โ€‹(t)โˆ’1)|Eฮธ,1โ€‹(K)]โ‰ค20736โ€‹emโ€‹ฮ”a2โ€‹Q1โ€‹(d+logโกQ1+4โ€‹ฯƒโ€‹logโกT+12โ€‹dโ€‹ฯƒโ€‹logโก2)+1.\displaystyle\mathbb{E}\left[\sum_{t=1}^{T}\bigg{(}\frac{1}{p_{1,k_{1}(B(t))}(t)}-1\bigg{)}\ \bigg{|}\ E_{\theta,1}(K)\right]\leq\frac{20736e}{m\Delta_{a}^{2}}\sqrt{Q_{1}}\big{(}d+\log Q_{1}+4\sigma\log T+12d\sigma\log 2\big{)}+1. (8)
๐”ผโ€‹[โˆ‘t=1T๐•€โ€‹(pa,kaโ€‹(Bโ€‹(t))โ€‹(t)>1T)|Eฮธ,aโ€‹(K)]โ‰ค576โ€‹emโ€‹ฮ”a2โ€‹(d+logโกQa+10โ€‹dโ€‹ฯƒโ€‹logโก(T)).\displaystyle\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{I}\left(p_{a,k_{a}(B(t))}(t)>\frac{1}{T}\right)\ \bigg{|}\ E_{\theta,a}(K)\right]\leq\frac{576e}{m\Delta_{a}^{2}}\big{(}d+\log Q_{a}+10d\sigma\log(T)\big{)}. (9)
  • Proof of lemma D.5.

For ease of notation, let the number of observed rewards in batch kk for arm aโˆˆ๐’œa\in\mathcal{A} be nkn_{k}. By definition,

p1,nk=โ„™โ€‹(ฮผ^1โ€‹(t)โ‰ฅฮผ1โˆ’ฯต|โ„ฑBโ€‹(t))โ‰ฅ1โˆ’โ„™โ€‹(โ€–ฮธ1โˆ’ฮธ1โˆ—โ€–>ฯต|โ„ฑBโ€‹(t))\displaystyle\quad p_{1,n_{k}}=\mathbb{P}\big{(}\hat{\mu}_{1}(t)\geq\mu_{1}-\epsilon\ \big{|}\mathcal{F}_{B(t)}\big{)}\geq 1-\mathbb{P}\big{(}\left\|\theta_{1}-\theta_{1}^{*}\right\|>\epsilon\ \big{|}\mathcal{F}_{B(t)}\big{)}

With concentration property of approximate samples in Lemma ย C.5, it suggests the increasing number of observed rewards for the optimal arm leads to the increasing probability of being optimal. Thus by Lemma ย D.2, at any time step tโ‰คTt\leq T,

p1,k1โ€‹(Bโ€‹(t))โ€‹(t)โ‰ฅp1,k1โ€‹(t)2โ€‹(t).p_{1,k_{1}(B(t))}(t)\geq p_{1,\frac{k_{1}(t)}{2}}(t).

Concentration is achieved only when sufficient number of rewards is observed, we thus require:

โ„™ฮธ1โˆผฯ~1,nkโ€‹[ฮณ]โ€‹(โ€–ฮธ1โˆ’ฮธ1โˆ—โ€–โ‰ฅฯต)โ‰คexpโก(โˆ’16โ€‹dโ€‹ฯƒโ€‹(mโ€‹nkโ€‹ฯต236โ€‹eโˆ’Dยฏ1)),\mathbb{P}_{\theta_{1}\sim\tilde{\rho}_{1,n_{k}}[\gamma]}\left(\left\|\theta_{1}-\theta_{1}^{*}\right\|\geq\epsilon\right)\leq\exp\left(-\frac{1}{6d\sigma}\left(\frac{mn_{k}\epsilon^{2}}{36e}-\bar{D}_{1}\right)\right), (10)

where Dยฏ1=d+logโกQ1+4โ€‹ฯƒโ€‹logโกT,ฯƒ=16+4โ€‹dโ€‹L2ฮฝโ€‹m\bar{D}_{1}=d+\log Q_{1}+4\sigma\log T,\sigma=16+\frac{4dL^{2}}{\nu m}. Choose ฯต=(ฮผ1โˆ’ฮผa)/2=ฮ”a/2\epsilon=(\mu_{1}-\mu_{a})/2=\Delta_{a}/2, and consider the time step tt when arm 11 satisfies:

k1โ€‹(t)=2โŒˆlog2โก2โ€‹lโŒ‰,wโ€‹hโ€‹eโ€‹rโ€‹el=144โ€‹emโ€‹ฮ”a2โ€‹(Dยฏ1+6โ€‹dโ€‹ฯƒโ€‹logโก2).k_{1}(t)=2^{\lceil\log_{2}2l\rceil},\qquad where\quad l=\frac{144e}{m\Delta_{a}^{2}}\big{(}\bar{D}_{1}+6d\sigma\log 2\big{)}.

As k1โ€‹(t)โ‰ฅ2โ€‹lk_{1}(t)\geq 2l, the number of observed rewards is guaranteed to be at least 36โ€‹emโ€‹ฯต2โ€‹Dยฏ\frac{36e}{m\epsilon^{2}}\bar{D}, and โ„™ฮธ1โˆผฯ~1,nkโ€‹[ฮณ](โˆฅฮธ1โˆ’ฮธ1โˆ—โˆฅ\mathbb{P}_{\theta_{1}\sim\tilde{\rho}_{1,n_{k}}[\gamma]}(\left\|\theta_{1}-\theta_{1}^{*}\right\| โ‰ฅฯต)โ‰ค1/2\geq\epsilon)\leq 1/2. Thus, the individual term in R1R_{1} follows:

๐”ผโ€‹[โˆ‘t=1T(1p1,k1โ€‹(Bโ€‹(t))โ€‹(t)โˆ’1)|Eฮธ,1โ€‹(K)]\displaystyle\quad\mathbb{E}\left[\sum_{t=1}^{T}\bigg{(}\frac{1}{p_{1,k_{1}(B(t))}(t)}-1\bigg{)}\ \bigg{|}\ E_{\theta,1}(K)\right]
โ‰ค๐”ผโ€‹[โˆ‘t=1T(1p1,k1โ€‹(t)2โ€‹(t)โˆ’1)|Eฮธ,1โ€‹(K)]\displaystyle\leq\mathbb{E}\left[\sum_{t=1}^{T}\bigg{(}\frac{1}{p_{1,\frac{k_{1}(t)}{2}}(t)}-1\bigg{)}\ \bigg{|}\ E_{\theta,1}(K)\right]
โ‰ค๐”ผโ€‹[โˆ‘k1โ€‹(t)=0Tโˆ’1(1p1,k1โ€‹(t)2โ€‹(t)โˆ’1)|Eฮธ,1โ€‹(K)]\displaystyle\leq\mathbb{E}\left[\sum_{k_{1}(t)=0}^{T-1}\bigg{(}\frac{1}{p_{1,\frac{k_{1}(t)}{2}}(t)}-1\bigg{)}\ \bigg{|}\ E_{\theta,1}(K)\right]
โ‰ค๐”ผโ€‹[โˆ‘k1โ€‹(t)=02โŒˆlog2โก2โ€‹lโŒ‰(1p1,k1โ€‹(t)2โ€‹(t)โˆ’1)|Eฮธ,1โ€‹(K)]+๐”ผโ€‹[โˆ‘k1โ€‹(t)=2โŒˆlog2โก2โ€‹lโŒ‰+1Tโˆ’1(1p1,k1โ€‹(t)2โ€‹(t)โˆ’1)|Eฮธ,1โ€‹(K)].\displaystyle\leq\mathbb{E}\left[\sum_{k_{1}(t)=0}^{2^{\lceil\log_{2}2l\rceil}}\bigg{(}\frac{1}{p_{1,\frac{k_{1}(t)}{2}}(t)}-1\bigg{)}\ \bigg{|}\ E_{\theta,1}(K)\right]+\mathbb{E}\left[\sum_{k_{1}(t)=2^{\lceil\log_{2}2l\rceil}+1}^{T-1}\bigg{(}\frac{1}{p_{1,\frac{k_{1}(t)}{2}}(t)}-1\bigg{)}\ \bigg{|}\ E_{\theta,1}(K)\right]. (11)

In early stage when concentration has not been achieved, using results from Lemmaย D.4,

๐”ผโ€‹[โˆ‘k1โ€‹(t)=02โŒˆlog2โก2โ€‹lโŒ‰(1p1,k1โ€‹(t)2โ€‹(t)โˆ’1)|Eฮธ,1โ€‹(K)]โ‰ค2โŒˆlog2โก2โ€‹lโŒ‰โ€‹36โ€‹B1โ‰ค2โ‹…2โ€‹lโ‹…36โ€‹B1.\mathbb{E}\left[\sum_{k_{1}(t)=0}^{2^{\lceil\log_{2}2l\rceil}}\bigg{(}\frac{1}{p_{1,\frac{k_{1}(t)}{2}}(t)}-1\bigg{)}\ \bigg{|}\ E_{\theta,1}(K)\right]\leq 2^{\lceil\log_{2}2l\rceil}36\sqrt{B_{1}}\leq 2\cdot 2l\cdot 36\sqrt{B_{1}}. (12)

When sufficient rewards for the optimal arm has been accumulated,

๐”ผโ€‹[โˆ‘k1โ€‹(t)=2โŒˆlog2โก2โ€‹lโŒ‰+1Tโˆ’1(1p1,k1โ€‹(t)2โ€‹(t)โˆ’1)|Eฮธ,1โ€‹(K)]\displaystyle\quad\mathbb{E}\left[\sum_{k_{1}(t)=2^{\lceil\log_{2}2l\rceil}+1}^{T-1}\bigg{(}\frac{1}{p_{1,\frac{k_{1}(t)}{2}}(t)}-1\bigg{)}\ \bigg{|}\ E_{\theta,1}(K)\right]
โ‰ค๐”ผโ€‹[โˆ‘k1โ€‹(t)=0Tโˆ’1(1p1,k1โ€‹(t)2โ€‹(t)โˆ’1)|Eฮธ,1โ€‹(K)]\displaystyle\leq\mathbb{E}\left[\sum_{k_{1}(t)=0}^{T-1}\bigg{(}\frac{1}{p_{1,\frac{k_{1}(t)}{2}}(t)}-1\bigg{)}\ \bigg{|}\ E_{\theta,1}(K)\right]
โ‰คโˆ‘k1โ€‹(t)=0Tโˆ’11expโก(โˆ’16โ€‹dโ€‹ฯƒ1โ€‹(m1โ€‹ฯต236โ€‹eโ‹…k1โ€‹(t)2))โˆ’1\displaystyle\leq\sum_{k_{1}(t)=0}^{T-1}\frac{1}{\exp\left(-\frac{1}{6d\sigma_{1}}\left(\frac{m_{1}\epsilon^{2}}{36e}\cdot\frac{k_{1}(t)}{2}\right)\right)}-1
โ‰คโˆซz=0โˆž(1expโก(โˆ’mโ€‹ฯต2432โ€‹eโ€‹dโ€‹ฯƒ1โ€‹z)โˆ’1)โ€‹๐‘‘z\displaystyle\leq\int_{z=0}^{\infty}\left(\frac{1}{\exp\left(-\frac{m\epsilon^{2}}{432ed\sigma_{1}}z\right)}-1\right)dz
โ‰ค2โ‹…144โ€‹emโ€‹ฮ”a2โ‹…6โ€‹dโ€‹ฯƒโ€‹logโก2+1.\displaystyle\leq 2\cdot\frac{144e}{m\Delta_{a}^{2}}\cdot 6d\sigma\log 2+1. (13)

Substituting equation (12) and (13) back to (11) yields,

๐”ผโ€‹[โˆ‘t=1T(1p1,k1โ€‹(Bโ€‹(t))โ€‹(t)โˆ’1)|Eฮธ,1โ€‹(K)]\displaystyle\mathbb{E}\left[\sum_{t=1}^{T}\bigg{(}\frac{1}{p_{1,k_{1}(B(t))}(t)}-1\bigg{)}\ \bigg{|}\ E_{\theta,1}(K)\right] โ‰ค4โ€‹lโ‹…36โ€‹Q1+2โ‹…144โ€‹emโ€‹ฮ”a2โ‹…6โ€‹dโ€‹ฯƒโ€‹logโก2+1\displaystyle\leq 4l\cdot 36\sqrt{Q_{1}}+2\cdot\frac{144e}{m\Delta_{a}^{2}}\cdot 6d\sigma\log 2+1
โ‰ค36โ€‹Q1โ€‹576โ€‹emโ€‹ฮ”a2โ€‹(Dยฏ1+12โ€‹dโ€‹ฯƒโ€‹logโก2)+1.\displaystyle\leq 36\sqrt{Q_{1}}\frac{576e}{m\Delta_{a}^{2}}\big{(}\bar{D}_{1}+12d\sigma\log 2\big{)}+1.

Similarly, for R2R_{2} term with event Eฮผ,aโ€‹(t)={ฮผ^aโ€‹(t)โ‰ฅฮผ1โˆ’ฯต}E_{\mu,a}(t)=\{\hat{\mu}_{a}(t)\geq\mu_{1}-\epsilon\}, let ฯต=(ฮผ1โˆ’ฮผa)/2=ฮ”a/2\epsilon=(\mu_{1}-\mu_{a})/2=\Delta_{a}/2,

pa,kaโ€‹(Bโ€‹(t))โ€‹(t)\displaystyle p_{a,k_{a}(B(t))}(t) =โ„™โ€‹(ฮผ^aโ€‹(t)โˆ’ฮผaโ‰ฅฮผ1โˆ’ฮผaโˆ’ฯต|โ„ฑBโ€‹(t))\displaystyle={\mathbb{P}}(\hat{\mu}_{a}(t)-\mu_{a}\geq\mu_{1}-\mu_{a}-\epsilon|\mathcal{F}_{B(t)})
=โ„™โ€‹(ฮผ^aโ€‹(t)โˆ’ฮผaโ‰ฅฮ”a2|โ„ฑBโ€‹(t))\displaystyle={\mathbb{P}}(\hat{\mu}_{a}(t)-\mu_{a}\geq\frac{\Delta_{a}}{2}|\mathcal{F}_{B(t)})
โ‰คโ„™โ€‹(ฮผ^aโ€‹(t)โˆ’ฮผaโ‰ฅฮ”a2|โ„ฑkaโ€‹(t)2)\displaystyle\leq{\mathbb{P}}(\hat{\mu}_{a}(t)-\mu_{a}\geq\frac{\Delta_{a}}{2}|\mathcal{F}_{\frac{k_{a}(t)}{2}})
=pa,kaโ€‹(t)2โ€‹(t),\displaystyle=p_{a,\frac{k_{a}(t)}{2}}(t),

which gives

๐”ผโ€‹[โˆ‘t=1T๐•€โ€‹(pa,kaโ€‹(Bโ€‹(t))โ€‹(t)>1T)|Eฮธ,aโ€‹(K)]\displaystyle\quad\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{I}\left(p_{a,k_{a}(B(t))}(t)>\frac{1}{T}\right)\ \bigg{|}\ E_{\theta,a}(K)\right]
โ‰ค๐”ผโ€‹[โˆ‘t=1T๐•€โ€‹(โ„™โ€‹(ฮผ^aโ€‹(t)โˆ’ฮผaโ‰ฅฮ”a2|โ„ฑkaโ€‹(t)2)>1T)|Eฮธ,aโ€‹(K)]\displaystyle\leq\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{I}\left({\mathbb{P}}(\hat{\mu}_{a}(t)-\mu_{a}\geq\frac{\Delta_{a}}{2}|\mathcal{F}_{\frac{k_{a}(t)}{2}})>\frac{1}{T}\right)\ \bigg{|}\ E_{\theta,a}(K)\right]
โ‰ค๐”ผโ€‹[โˆ‘t=1T๐•€โ€‹(โ„™โ€‹(|ฮผ^aโ€‹(t)โˆ’ฮผa|โ‰ฅฮ”a2|โ„ฑkaโ€‹(t)2)>1T)|Eฮธ,aโ€‹(K)]\displaystyle\leq\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{I}\left({\mathbb{P}}(|\hat{\mu}_{a}(t)-\mu_{a}|\geq\frac{\Delta_{a}}{2}|\mathcal{F}_{\frac{k_{a}(t)}{2}})>\frac{1}{T}\right)\ \bigg{|}\ E_{\theta,a}(K)\right]
โ‰ค๐”ผโ€‹[โˆ‘t=1T๐•€โ€‹(โ„™ฮธaโˆผฯ~a,kaโ€‹(t)2โ€‹(โ€–ฮธaโˆ’ฮธaโˆ—โ€–โ‰ฅฮ”a2)>1T)|Eฮธ,aโ€‹(K)].\displaystyle\leq\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{I}\left(\mathbb{P}_{\theta_{a}\sim\tilde{\rho}_{a,\frac{k_{a}(t)}{2}}}\big{(}\left\|\theta_{a}-\theta_{a}^{*}\right\|\geq\frac{\Delta_{a}}{2}\big{)}>\frac{1}{T}\right)\ \bigg{|}\ E_{\theta,a}(K)\right].

With the same form of posterior as in equation 10, โ„™ฮธaโˆผฯ~a,kaโ€‹(t)2โ€‹(โ€–ฮธaโˆ’ฮธaโˆ—โ€–โ‰ฅฮ”a2)โ‰ค1T\mathbb{P}_{\theta_{a}\sim\tilde{\rho}_{a,\frac{k_{a}(t)}{2}}}\big{(}\left\|\theta_{a}-\theta_{a}^{*}\right\|\geq\frac{\Delta_{a}}{2}\big{)}\leq\frac{1}{T} for arm aโ‰ 1a\neq 1 holds, when

kaโ€‹(t)>2โ‹…2โ‹…144โ€‹emโ€‹ฮ”a2โ€‹(Dยฏa+6โ€‹dโ€‹ฯƒโ€‹logโก(T)).k_{a}(t)>2\cdot 2\cdot\frac{144e}{m\Delta_{a}^{2}}\big{(}\bar{D}_{a}+6d\sigma\log(T)\big{)}.

Here, the number of observed rewards is guaranteed to be at least 2โŒˆlog2โกlโŒ‰2^{\lceil\log_{2}l\rceil}, where l=144โ€‹emโ€‹ฮ”a2โ€‹(Dยฏa+6โ€‹dโ€‹ฯƒโ€‹logโก(T))l=\frac{144e}{m\Delta_{a}^{2}}\big{(}\bar{D}_{a}+6d\sigma\log(T)\big{)}. Therefore, using the fact that d>1d>1, we have,

๐”ผโ€‹[โˆ‘t=1T๐•€โ€‹(pa,kaโ€‹(Bโ€‹(t))โ€‹(t)>1T)|Eฮธ,aโ€‹(K)]โ‰ค576โ€‹emaโ€‹ฮ”a2โ€‹(Dยฏa+6โ€‹dโ€‹ฯƒaโ€‹logโก(T)).\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{I}\left(p_{a,k_{a}(B(t))}(t)>\frac{1}{T}\right)\ \bigg{|}\ E_{\theta,a}(K)\right]\leq\frac{576e}{m_{a}\Delta_{a}^{2}}\big{(}\bar{D}_{a}+6d\sigma_{a}\log(T)\big{)}.

โ– \blacksquare

We are ready to prove the final regret bound by combining results from the above Lemmas. \banditRegret*

  • Proof of Theorem 5.2.

    The proof is a direct result by combining Lemmaย D.1, D.2, D.3,D.5, which gives,

    RT\displaystyle R_{T} โ‰คโˆ‘aโˆˆ๐’œฮ”aโ‹…(R1+R2+2)\displaystyle\leq\sum_{a\in\mathcal{A}}\Delta_{a}\cdot\Bigg{(}R_{1}+R_{2}+2\Bigg{)}
    โ‰ค(โˆ‘aโˆˆ๐’œ4โ‹…36Q1144โ€‹emโ€‹ฮ”a(d+logQ1+4(16+4โ€‹dโ€‹L2mโ€‹ฮฝ)(logT+3dlog2))+ฮ”a\displaystyle\leq\Bigg{(}\sum_{a\in\mathcal{A}}4\cdot 36\sqrt{Q_{1}}\frac{144e}{m\Delta_{a}}\big{(}d+\log Q_{1}+4\left(16+\frac{4dL^{2}}{m\nu}\right)\left(\log T+3d\log 2\right)\big{)}+\Delta_{a}
    +ฮ”a+4144โ€‹emโ€‹ฮ”a(d+logQa+10d(16+4โ€‹dโ€‹L2mโ€‹ฮฝ)log(T))+2ฮ”a)\displaystyle+\Delta_{a}+4\frac{144e}{m\Delta_{a}}\big{(}d+\log Q_{a}+10d\left(16+\frac{4dL^{2}}{m\nu}\right)\log(T)\big{)}+2\Delta_{a}\Bigg{)}
    โ‰คโˆ‘a>1Cโ€‹Q1mโ€‹ฮ”aโ€‹(d+logโกQ1+dโ€‹ฮบ2โ€‹logโกT+d2โ€‹ฮบ2)+Cmโ€‹ฮ”aโ€‹(d+logโกQa+d2โ€‹ฮบ2โ€‹logโกT)+4โ€‹ฮ”a.\displaystyle\leq\sum_{a>1}\frac{C\sqrt{Q_{1}}}{m\Delta_{a}}\left(d+\log Q_{1}+d\kappa^{2}\log T+d^{2}\kappa^{2}\right)+\frac{C}{m\Delta_{a}}\left(d+\log Q_{a}+d^{2}\kappa^{2}\log T\right)+4\Delta_{a}.

    โ– \blacksquare

Appendix E Proofs of Langevin Posterior Sampling for Reinforcement Learning

In this section, we will present the regret proofs for Langevin Posterior Sampling algorithms in RL frameworks under different types of parameterization, and conclude with a real-world example where the General Parameterization from Section 6.2 is applicable.

E.1 Communication cost of Static Doubling Batching Scheme

We first show that under the static doubling batching scheme in RL setting, LPSRL algorithm achieves optimal performance using only logarithmic rounds of communication (measured in terms of batches, or equivalently policy switches).

Theorem 3.

Let TkT_{k} be the number of time steps between the (kโˆ’1)(k-1)-th policy switch and the kk-th policy switch, and KTK_{T} be the total number of policy switches for time horizon TT. LPSRL ensures that

KTโ‰คlogโกT+1.K_{T}\leq\log T+1.
  • Proof of Theorem 3.

    By design of Algorithmย 3,at the kk-th policy switch, Tk=2kโˆ’1T_{k}=2^{k-1}. Since the total number of time steps is determined by time horizon TT, we can easily obtain KT=โŒˆlogโกTโŒ‰K_{T}=\lceil\log T\rceil. โ– \blacksquare

E.2 Regret Proofs in Average-reward MDPs

In this section, we proceed to prove the theorems in Section 6. To focus on the problem of model estimation, our results are developed under the optimality of policies131313If only suboptimal policies are available in our setting, it can be shown that small approximation errors in policies only contribute additive non-leading terms to regret. See details in Ouyang etย al., (2017)..

While analyses of Bayes regret in existing works of PSRL crucially depend on the true transition dynamics ฮธโˆ—\theta^{*} being identically distributed as those of sampled MDP Russo and Vanย Roy, (2014), we show that in Langevin PSRL, sampling from the approximate posterior instead of the true posterior will introduce a bias that can be upper bounded using Wasserstein-11 distance.

\LangevinPS

*

  • Proof of Lemmaย 6

    Notice that both ฯ~tk\tilde{\rho}_{t_{k}} and ฯtk{\rho}_{t_{k}} are measurable with respect to ฯƒโ€‹(โ„‹tk)\sigma(\mathcal{H}_{t_{k}}). Therefore, condition on history โ„‹tk\mathcal{H}_{t_{k}}, the only randomness under the expectation comes from the sampling procedure for approximate posterior, which gives,

    ๐”ผโ€‹[fโ€‹(ฮธk)|โ„‹tk]\displaystyle\mathbb{E}[f(\theta^{k})|\mathcal{H}_{t_{k}}] =โˆซโ„dfโ€‹(ฮธ)โ€‹ฯ~tkโ€‹(dโ€‹ฮธ)\displaystyle=\int_{\mathbb{R}^{d}}f(\theta)\tilde{\rho}_{t_{k}}(d\theta)
    =โˆซโ„dfโ€‹(ฮธ)โ€‹(ฯ~tkโˆ’ฯtk+ฯtkโˆ’ฮดโ€‹(ฮธโˆ—)+ฮดโ€‹(ฮธโˆ—))โ€‹(dโ€‹ฮธ)\displaystyle=\int_{\mathbb{R}^{d}}f(\theta)(\tilde{\rho}_{t_{k}}-\rho_{t_{k}}+\rho_{t_{k}}-\delta(\theta^{*})+\delta(\theta^{*}))(d\theta)
    โ‰ค๐”ผโ€‹[fโ€‹(ฮธk,โˆ—)|โ„‹tk]โˆ’๐”ผโ€‹[fโ€‹(ฮธโˆ—)|โ„‹tk]+๐”ผโ€‹[fโ€‹(ฮธโˆ—)|โ„‹tk]+W1โ€‹(ฯ~tk,ฯtk)\displaystyle\leq\mathbb{E}[f(\theta^{k,*})|\mathcal{H}_{t_{k}}]-\mathbb{E}[f(\theta^{*})|\mathcal{H}_{t_{k}}]+\mathbb{E}[f(\theta^{*})|\mathcal{H}_{t_{k}}]+W_{1}(\tilde{\rho}_{t_{k}},\rho_{t_{k}})
    =๐”ผโ€‹[fโ€‹(ฮธโˆ—)|โ„‹tk]+W1โ€‹(ฯ~tk,ฯtk).\displaystyle=\mathbb{E}[f(\theta^{*})|\mathcal{H}_{t_{k}}]+W_{1}(\tilde{\rho}_{t_{k}},\rho_{t_{k}}). (14)

    The third inequality follows from the fact that given โ„‹tk\mathcal{H}_{t_{k}}, ฯtk\rho_{t_{k}} is the posterior of ฮธk,โˆ—\theta^{k,*} and the definition of dual representation for W1W_{1} with respect to the 11-Lipschitz function ff. The last equality follows from the standard posterior sampling lemma in the Bayesian setting (Osband etย al.,, 2013; Osband and Vanย Roy,, 2014), which suggests that at time tkt_{k}, given the sigma-algebra ฯƒโ€‹(โ„‹tk)\sigma(\mathcal{H}_{t_{k}}), ฮธk,โˆ—\theta^{k,*} and ฮธโˆ—\theta^{*} are identically distributed:

    ๐”ผโ€‹[fโ€‹(ฮธk,โˆ—)|โ„‹tk]=๐”ผโ€‹[fโ€‹(ฮธโˆ—)|โ„‹tk].\mathbb{E}[f(\theta^{k,*})|\mathcal{H}_{t_{k}}]=\mathbb{E}[f(\theta^{*})|\mathcal{H}_{t_{k}}].

    Following the same argument, condition on โ„‹tk\mathcal{H}_{t_{k}}, we also have,

    ๐”ผโ€‹[fโ€‹(ฮธโˆ—)|โ„‹tk]=๐”ผโ€‹[fโ€‹(ฮธk,โˆ—)|โ„‹tk]=โˆซโ„dfโ€‹(ฮธ)โ€‹(ฯtk+ฯ~tkโˆ’ฯ~tk)โ€‹(dโ€‹ฮธ)โ‰ค๐”ผโ€‹[fโ€‹(ฮธk)|โ„‹tk]+W1โ€‹(ฯ~tk,ฯtk).\mathbb{E}[f(\theta^{*})|\mathcal{H}_{t_{k}}]=\mathbb{E}[f(\theta^{k,*})|\mathcal{H}_{t_{k}}]=\int_{\mathbb{R}^{d}}f(\theta)(\rho_{t_{k}}+\tilde{\rho}_{t_{k}}-\tilde{\rho}_{t_{k}})(d\theta)\leq\mathbb{E}[f(\theta^{k})|\mathcal{H}_{t_{k}}]+W_{1}(\tilde{\rho}_{t_{k}},\rho_{t_{k}}). (15)

    Combining Equation (14) and (15) yields Equation (4). Applying the tower rule concludes the proof. โ– \blacksquare

Corollary E.1 (Tabular Langevin Posterior Sampling).

In tabular settings with finite states and actions, by running an approximate sampling method for each (s,a)โˆˆ๐’ฎร—๐’œ(s,a)\in\mathcal{S}\times\mathcal{A} at time tkt_{k}, it holds that for each policy switch kโˆˆ[KT]k\in[K_{T}],

|๐”ผ[f(ฮธโˆ—)|โ„‹tk]โˆ’๐”ผ[f(ฮธk)|โ„‹tk]|โ‰คโˆ‘(s,a)โˆˆ๐’ฎร—๐’œW1(ฯ~tk(s,a),ฯtk(s,a)),\Big{|}\mathbb{E}[f(\theta^{*})|\mathcal{H}_{t_{k}}]-\mathbb{E}[f(\theta^{k})|\mathcal{H}_{t_{k}}]\Big{|}\leq\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}W_{1}(\tilde{\rho}_{t_{k}}(s,a),\rho_{t_{k}}(s,a)),

where ฯ~tkโ€‹(s,a)\tilde{\rho}_{t_{k}}(s,a) are the corresponding true posterior and approximate posterior for (s,a)(s,a) at time tkt_{k}.

  • Proof of Corollaryย E.1

    Since we run the approximate sampling algorithm for each state-action pair at the beginning of each policy switch kk, the total approximation error is equal to the sum of approximation error for each (s,a)(s,a). โ– \blacksquare

We first provide a general regret decomposition in Lemma E.2, which holds for any undiscounted weakly-communicating MDPs with infinite horizon, where approximate sampling is adopted and the transition is Lipschitiz.

Lemma E.2 (Regret decomposition.).

For a weakly-communicating MDP with infinite time-horizon TT, the Bayesian regret of Algorithm 3 instantiated with any approximate sampling method can be decomposed as follows:

RBโ€‹(T)โ‰ค๐”ผโ€‹[โˆ‘k=1KTTkโ€‹W1โ€‹(ฯ~tk,ฯtk)]+Hโ€‹(logโกT+1)+Hโ€‹Lpโ€‹๐”ผโ€‹[โˆ‘k=1KTโˆ‘t=tktk+1โˆ’1โ€–ฮธโˆ—โˆ’ฮธkโ€–],R_{B}(T)\leq\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}T_{k}W_{1}(\tilde{\rho}_{t_{k}},\rho_{t_{k}})\Big{]}+H(\log T+1)+HL_{p}\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\left\|\theta^{*}-\theta^{k}\right\|\Big{]}, (16)

where LpL_{p} is a Lipschitz constant, and HH is the upper bound of span of MDP.

  • Proof of Lemmaย E.2

    We adopt the greedy policy with respect to the sampled model, which gives at=argmaxaโˆˆAa_{t}=\operatorname*{argmax}_{a\in A} rโ€‹(st,a)r(s_{t},a) at each time step tt. By Bellman Optimality equation in Lemma 1,

    Jฯ€kโ€‹(ฮธk)+hฯ€kโ€‹(s,ฮธk)=โ„›โ€‹(st,at)+โˆซsโ€ฒโˆˆ๐’ฎpโ€‹(sโ€ฒ|st,at;ฮธk)โ€‹hฯ€kโ€‹(sโ€ฒ,ฮธk)โ€‹๐‘‘sโ€ฒ,โˆ€tโˆˆ[tk,tk+1โˆ’1].J^{\pi_{k}}(\theta^{k})+h^{\pi_{k}}(s,\theta^{k})=\mathcal{R}(s_{t},a_{t})+\int_{s^{\prime}\in\mathcal{S}}p(s^{\prime}|s_{t},a_{t};\theta^{k})h^{\pi_{k}}(s^{\prime},\theta^{k})ds^{\prime},~{}~{}~{}~{}\forall t\in[t_{k},t_{k+1}-1]. (17)

    We then follow the standard analyses in RL literature Osband etย al., (2013); Osband and Vanย Roy, (2014) to decompose the regret into sum of Bellman errors. Plug in Equationย (17) into the definition of Bayesian regret, we have,

    RBโ€‹(T)\displaystyle R_{B}(T) =๐”ผโ€‹[โˆ‘t=1TJฯ€โˆ—โ€‹(ฮธโˆ—)โˆ’โ„›โ€‹(st,at)]\displaystyle=\mathbb{E}\left[\sum_{t=1}^{T}J^{\pi^{*}}(\theta^{*})-\mathcal{R}(s_{t},a_{t})\right]
    =๐”ผโ€‹[โˆ‘k=1KTโˆ‘t=tktk+1โˆ’1Jฯ€โˆ—โ€‹(ฮธโˆ—)โˆ’โ„›โ€‹(st,ฯ€kโ€‹(st))]\displaystyle=\mathbb{E}\left[\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}J^{\pi^{*}}(\theta^{*})-\mathcal{R}(s_{t},\pi_{k}(s_{t}))\right]
    =๐”ผโ€‹[โˆ‘k=1KTโˆ‘t=tktk+1โˆ’1Jโˆ—โ€‹(ฮธโˆ—)โˆ’Jฯ€kโ€‹(ฮธk)]โŸ(i)+๐”ผโ€‹[โˆ‘k=1KTโˆ‘t=tktk+1โˆ’1(โˆซsโ€ฒโˆˆ๐’ฎpโ€‹(sโ€ฒ|st,ฯ€kโ€‹(st);ฮธk)โ€‹hฯ€kโ€‹(sโ€ฒ,ฮธk)โ€‹๐‘‘sโ€ฒโˆ’hฯ€kโ€‹(st,ฮธk))]โŸ(ii)\displaystyle=\leavevmode\resizebox{389.89749pt}{}{$\underbrace{\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}J^{*}(\theta^{*})-J^{\pi_{k}}(\theta^{k})\Big{]}}_{\text{(i)}}+\underbrace{\mathbb{E}\left[\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\left(\int_{s^{\prime}\in\mathcal{S}}p(s^{\prime}|s_{t},\pi_{k}(s_{t});\theta^{k})h^{\pi_{k}}(s^{\prime},\theta^{k})ds^{\prime}-h^{\pi_{k}}(s_{t},\theta^{k})\right)\right]}_{\text{(ii)}}$ } (18)

Term (i). By the property of approximate posterior sampling in Lemma 6 and the non-negativity of Wasserstein distance,

(i)โ‰ค|(i)|\displaystyle\mathrm{(i)}\leq|\mathrm{(i)}| โ‰ค๐”ผโ€‹[โˆ‘k=1KTโˆ‘t=tktk+1โˆ’1|Jโˆ—โ€‹(ฮธโˆ—)โˆ’Jฯ€kโ€‹(ฮธk)|]โ‰ค๐”ผโ€‹[โˆ‘k=1KTโˆ‘t=tktk+1โˆ’1W1โ€‹(ฯtk,ฯ~tk)]=๐”ผโ€‹[โˆ‘k=1KTTkโ€‹W1โ€‹(ฯ~tk,ฯtk)].\displaystyle\leq\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\Big{|}J^{*}(\theta^{*})-J^{\pi_{k}}(\theta^{k})\Big{|}\Big{]}\leq\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}W_{1}(\rho_{t_{k}},\tilde{\rho}_{t_{k}})\Big{]}=\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}T_{k}W_{1}(\tilde{\rho}_{t_{k}},\rho_{t_{k}})\Big{]}. (19)

We remark that this term differs from the exact PSRL where no approximate sampling method is used. To ensure the final regret is properly bounded, approximate sampling method being used must provide sufficient statistical guarantee of ฯ~tk\tilde{\rho}_{t_{k}} and ฯtk\rho_{t_{k}} in terms of Wasserstein-11 distance.

Term (ii). We further decompose term (ii) into the model estimation errors.

(ii)\displaystyle\mathrm{(ii)} =๐”ผโ€‹[โˆ‘k=1KTโˆ‘t=tktk+1โˆ’1(โˆซsโ€ฒโˆˆ๐’ฎpโ€‹(sโ€ฒ|st,ฯ€kโ€‹(st);ฮธk)โ€‹hฯ€kโ€‹(sโ€ฒ,ฮธk)โ€‹๐‘‘sโ€ฒโˆ’hฯ€kโ€‹(st,ฮธk)+hฯ€kโ€‹(st+1,ฮธk)โˆ’hฯ€kโ€‹(st+1,ฮธk))]\displaystyle=\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\Big{(}\int_{s^{\prime}\in\mathcal{S}}p(s^{\prime}|s_{t},\pi_{k}(s_{t});\theta^{k})h^{\pi_{k}}(s^{\prime},\theta^{k})ds^{\prime}-h^{\pi_{k}}(s_{t},\theta^{k})+h^{\pi_{k}}(s_{t+1},\theta^{k})-h^{\pi_{k}}(s_{t+1},\theta^{k})\Big{)}\Big{]}
=๐”ผโ€‹[โˆ‘k=1KTโˆ‘t=tktk+1โˆ’1(hฯ€kโ€‹(st+1,ฮธk)โˆ’hฯ€kโ€‹(st,ฮธk))]โŸฮ”h\displaystyle=\underbrace{\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\Big{(}h^{\pi_{k}}(s_{t+1},\theta^{k})-h^{\pi_{k}}(s_{t},\theta^{k})\Big{)}\Big{]}}_{\Delta_{h}}
+๐”ผโ€‹[โˆ‘k=1KTโˆ‘t=tktk+1โˆ’1โˆซsโ€ฒโˆˆ๐’ฎ(pโ€‹(sโ€ฒ|st,ฯ€kโ€‹(st),ฮธk)โˆ’pโ€‹(sโ€ฒ|st,ฯ€kโ€‹(st),ฮธโˆ—))โ€‹hฯ€kโ€‹(sโ€ฒ,ฮธk)โ€‹๐‘‘sโ€ฒ]โŸฮ”eโ€‹rโ€‹r.\displaystyle+\underbrace{\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\int_{s^{\prime}\in\mathcal{S}}\Big{(}p(s^{\prime}|s_{t},\pi_{k}(s_{t}),\theta^{k})-p(s^{\prime}|s_{t},\pi_{k}(s_{t}),\theta^{*})\Big{)}h^{\pi_{k}}(s^{\prime},\theta^{k})ds^{\prime}\Big{]}}_{\Delta_{err}}.

To bound ฮ”h\Delta_{h}, note that for each kโˆˆ[1,KT]k\in[1,K_{T}], sโ€‹pโ€‹(hโ€‹(ฮธk))โ‰คHsp(h(\theta^{k}))\leq H, and by Theorem 3,

ฮ”h\displaystyle\Delta_{h} =๐”ผโ€‹[โˆ‘k=1KTโˆ‘t=tktk+1โˆ’1(hฯ€kโ€‹(st+1,ฮธk)โˆ’hฯ€kโ€‹(st,ฮธk))]\displaystyle=\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\Big{(}h^{\pi_{k}}(s_{t+1},\theta^{k})-h^{\pi_{k}}(s_{t},\theta^{k})\Big{)}\Big{]}
=๐”ผโ€‹[โˆ‘k=1KT(hฯ€kโ€‹(stk+1,ฮธk)โˆ’hฯ€kโ€‹(stk,ฮธk))]\displaystyle=\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}\Big{(}h^{\pi_{k}}(s_{t_{k+1}},\theta_{k})-h^{\pi_{k}}(s_{t_{k}},\theta_{k})\Big{)}\Big{]}
โ‰ค๐”ผโ€‹[sโ€‹pโ€‹(hโ€‹(ฮธk))โ€‹KT]\displaystyle\leq\mathbb{E}\left[sp(h(\theta^{k}))K_{T}\right]
โ‰คHโ€‹(logโกT+1).\displaystyle\leq H(\log T+1). (20)

Thus, combining Equation (18),(19), (20), and by Lemmaย  E.3, we conclude the proof.

โ– \blacksquare

Lemma E.3 (Bound estimation error).

Let ฮ”eโ€‹rโ€‹r=๐”ผ[โˆ‘k=1KTโˆ‘t=tktk+1โˆ’1โˆซsโ€ฒโˆˆ๐’ฎ(p(sโ€ฒ|st,ฯ€k(st),ฮธk)โˆ’p(sโ€ฒ|st,ฯ€k(st),\Delta_{err}=\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\int_{s^{\prime}\in\mathcal{S}}\Big{(}p(s^{\prime}|s_{t},\pi_{k}(s_{t}),\theta^{k})-p(s^{\prime}|s_{t},\pi_{k}(s_{t}), ฮธโˆ—))hฯ€k(sโ€ฒ,ฮธk)dsโ€ฒ]\theta^{*})\Big{)}h^{\pi_{k}}(s^{\prime},\theta^{k})ds^{\prime}\Big{]}. Suppose Assumptionย B holds, then

ฮ”eโ€‹rโ€‹rโ‰คHโ€‹Lpโ€‹๐”ผโ€‹[โˆ‘k=1KTโˆ‘t=tktk+1โˆ’1โ€–ฮธโˆ—โˆ’ฮธkโ€–].\Delta_{err}\leq HL_{p}\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\left\|\theta^{*}-\theta^{k}\right\|\Big{]}. (21)
  • Proof of Lemma E.3

    Recall that ฯ€k\pi_{k} is the optimal policy under ฮธk\theta^{k}, thus hโ€‹(โ‹…,ฮธk)=hฯ€kโ€‹(โ‹…,ฮธk)h(\cdot,\theta^{k})=h^{\pi_{k}}(\cdot,\theta^{k}), and span is properly bounded in weakly-communicating MDPs: sโ€‹pโ€‹(hโ€‹(ฮธ))โ‰คHsp(h(\theta))\leq H for any ฮธโˆˆโ„d\theta\in\mathbb{R}^{d}. Then by Assumption B and Cauchy-Schwartz inequality,

    โˆซsโ€ฒโˆˆ๐’ฎ(pโ€‹(sโ€ฒ|st,ฯ€kโ€‹(st),ฮธk)โˆ’pโ€‹(sโ€ฒ|st,ฯ€kโ€‹(st),ฮธโˆ—))โ€‹hฯ€kโ€‹(sโ€ฒ,ฮธk)โ€‹๐‘‘sโ€ฒ\displaystyle\qquad\int_{s^{\prime}\in\mathcal{S}}\Big{(}p(s^{\prime}|s_{t},\pi_{k}(s_{t}),\theta^{k})-p(s^{\prime}|s_{t},\pi_{k}(s_{t}),\theta^{*})\Big{)}h^{\pi_{k}}(s^{\prime},\theta^{k})ds^{\prime}
    โ‰คโˆฅp(โ‹…|st,ฯ€k(st),ฮธk)โˆ’p(โ‹…|st,ฯ€k(st),ฮธโˆ—)โˆฅโˆฅh(โ‹…,ฮธk)โˆฅโˆž\displaystyle\leq\left\|p(\cdot|s_{t},\pi_{k}(s_{t}),\theta^{k})-p(\cdot|s_{t},\pi_{k}(s_{t}),\theta^{*})\right\|\left\|h(\cdot,\theta^{k})\right\|_{\infty}
    =Hโ€‹Lpโ€‹โ€–ฮธโˆ—โˆ’ฮธkโ€–.\displaystyle=HL_{p}\left\|\theta^{*}-\theta^{k}\right\|.

    Plugging the result into the definition of ฮ”eโ€‹rโ€‹r\Delta_{err} concludes the proof. โ– \blacksquare

The above regret decomposition in Lemma E.2 holds regardless of the approximate sampling methods being employed. To derive the final regret bounds, we discuss in the context of General Parmeteration and Simple parameterization respectively.

E.2.1 General Parametrization

The first term in Equation (16) corresponds to the accumulating approximation error over the time horizon TT due to the use of approximate sampling method. Upper bounding this term relies on the statistical guarantee provided by the adopted approximate sampling method, which is the main novelty of LPSRL. In this section, we focus on the regret guarantee under the general parameterization.

To maintain the sub-linear regret guarantee, the convergence guarantee provided by SGLD is required to effectively upper bound the approximation error in the first term of Lemma E.2.

Lemma E.4.

Suppose Assumptions B-B are satisfied. Under the general parameterization of MDP, by instantiating LPSRL with SGLD, it holds that for any pโ‰ฅ2p\geq 2,

โˆ‘k=1KTTkโ€‹W1โ€‹(ฯ~tk,ฯtk)โ‰ค24โ€‹Tโ€‹(logโกT+1)mโ€‹(d+logโกQ+(32+8โ€‹dโ€‹ฮบ2)โ€‹p)1/2.\sum_{k=1}^{K_{T}}T_{k}W_{1}(\tilde{\rho}_{t_{k}},\rho_{t_{k}})\leq\sqrt{\frac{24T(\log T+1)}{m}}(d+\log Q+(32+8d\kappa^{2})p)^{1/2}~{}. (22)
  • Proof of Lemma E.4

    By design of Algorithmย 3 and the convergence guarantee of SGLD in Theorem 1, we have,

    โˆ‘k=1KTTkโ€‹W1โ€‹(ฯ~tk,ฯtk)\displaystyle\sum_{k=1}^{K_{T}}T_{k}W_{1}(\tilde{\rho}_{t_{k}},\rho_{t_{k}}) โ‰คโˆ‘k=1logโกT+12kโˆ’1โ€‹Wpโ€‹(ฯ~tk,ฯtk)\displaystyle\leq\sum_{k=1}^{\log T+1}2^{k-1}W_{p}(\tilde{\rho}_{t_{k}},\rho_{t_{k}})
    โ‰คโˆ‘k=1logโกT+12kโˆ’1โ€‹122kโˆ’1โ€‹mโ€‹(d+logโกQ+(32+8โ€‹dโ€‹ฮบ2)โ€‹p)1/2\displaystyle\leq\sum_{k=1}^{\log T+1}2^{k-1}\sqrt{\frac{12}{2^{k-1}m}}(d+\log Q+(32+8d\kappa^{2})p)^{1/2}
    =12mโ€‹(d+logโกQ+(32+8โ€‹dโ€‹ฮบ2)โ€‹p)1/2โ€‹โˆ‘k=1logโกT+12kโˆ’1\displaystyle=\sqrt{\frac{12}{m}}(d+\log Q+(32+8d\kappa^{2})p)^{1/2}\sum_{k=1}^{\log T+1}\sqrt{2^{k-1}}
    โ‰ค24โ€‹Tโ€‹(logโกT+1)mโ€‹(d+logโกQ+(32+8โ€‹dโ€‹ฮบ2)โ€‹p)1/2.\displaystyle\leq\sqrt{\frac{24T(\log T+1)}{m}}(d+\log Q+(32+8d\kappa^{2})p)^{1/2}.

    Here, the first inequality follows from the fact that Wpโ‰ฅWqW_{p}\geq W_{q} for any pโ‰ฅqp\geq q. The second equality directly follows from Theorem 1, and the last inequality follows from the Cauchy-Schwartz inequality.

    โ– \blacksquare

To further upper bound ฮ”eโ€‹rโ€‹r\Delta_{err} in Lemma E.3 under the General Parameterization, we establish the following concentration guarantee provided by SGLD under the static doubling batching scheme adopted by LPSRL.

Lemma E.5 (Concentration of SGLD).

For any policy-switch kโˆˆ[KT]k\in[K_{T}], instantiating LPSRL with SGLD guarantees that

๐”ผโ€‹[Tkโ€‹โ€–ฮธโˆ—โˆ’ฮธkโ€–2]โ‰ค960โ€‹dmโ€‹logโกT.\mathbb{E}\Big{[}T_{k}\|\theta^{*}-\theta^{k}\|^{2}\Big{]}\leq\frac{960d}{m}\log T.
  • Proof of Lemma E.5

At time tkt_{k}, denote ฮธk,โˆ—\theta^{k,*} the parameter sampling from the true posterior ฯtk\rho_{t_{k}}, and ntkn_{t_{k}} the total number of available observations. By the triangle inequality,

โ€–ฮธโˆ—โˆ’ฮธkโ€–2โ‰ค3โ€‹(โ€–ฮธโˆ—โˆ’ฮธk,โˆ—โ€–2+โ€–ฮธk,โˆ—โˆ’ฮธkโ€–2).\left\|\theta^{*}-\theta^{k}\right\|^{2}\leq 3(\left\|\theta^{*}-\theta^{k,*}\right\|^{2}+\left\|\theta^{k,*}-\theta^{k}\right\|^{2}).

Taking expectation and multiplying both sides by 2kโˆ’12^{k-1} yields,

๐”ผโ€‹[Tkโ€‹โ€–ฮธโˆ—โˆ’ฮธkโ€–2]โ‰ค3โ€‹๐”ผโ€‹[Tkโ€‹โ€–ฮธโˆ—โˆ’ฮธk,โˆ—โ€–2]+3โ€‹๐”ผโ€‹[Tkโ€‹โ€–ฮธk,โˆ—โˆ’ฮธkโ€–2].\mathbb{E}[T_{k}\|\theta^{*}-\theta^{k}\|^{2}]\leq 3\mathbb{E}[T_{k}\|\theta^{*}-\theta^{k,*}\|^{2}]+3\mathbb{E}[T_{k}\|\theta^{k,*}-\theta^{k}\|^{2}]. (23)

Under the General parameterization, we follow Assumption A2 in Theocharous etย al., 2017b to focus on the MDPs that have proper concentration, which suggests the true parameter ฮธโˆ—\theta^{*} and the mode of the posterior ฮธk,โˆ—\theta^{k,*} satisfies

๐”ผโ€‹[โ€–ฮธโˆ—โˆ’ฮธk,โˆ—โ€–2]โ‰ค32โ€‹dmโ€‹ntkโ€‹logโกT.\mathbb{E}\left[\|\theta^{*}-\theta^{k,*}\|^{2}\right]\leq\frac{32d}{mn_{t_{k}}}\log T.

We provide an example in Appendix E.3 to show this assumption can be easily satisfied in practice. Let D~2:=32โ€‹dmโ€‹logโกT\tilde{D}^{2}:=\frac{32d}{m}\log T, then by Theorem 2 adapted to the MDP setting (i.e. with a change in notation), we have,

W22โ€‹(ฯ~tk,ฯtk)โ‰ค4โ€‹D~2ntk.W_{2}^{2}(\widetilde{\rho}_{t_{k}},\rho_{t_{k}})\leq\frac{4\tilde{D}^{2}}{n_{t_{k}}}.

Note that ntk=โˆ‘kโ€ฒ=1kโˆ’1Tkโ€ฒn_{t_{k}}=\sum_{k^{\prime}=1}^{k-1}T_{k^{\prime}} and by design of Algorithmย 3, ntkโ‰คTkโ‰ค2โ€‹ntkn_{t_{k}}\leq T_{k}\leq 2n_{t_{k}}. Combining the above results and Equation (23), we have

๐”ผโ€‹[Tkโ€‹โ€–ฮธโˆ—โˆ’ฮธkโ€–2]โ‰ค30โ€‹D~2โ‰ค960โ€‹dmโ€‹logโกT,โˆ€D~2โ‰ฅ32โ€‹dmโ€‹logโกT.\displaystyle\mathbb{E}\left[T_{k}\|\theta^{*}-\theta^{k}\|^{2}\right]\leq 30\tilde{D}^{2}\leq\frac{960d}{m}\log T,~{}~{}~{}~{}~{}\forall\tilde{D}^{2}\geq\frac{32d}{m}\log T.

โ– \blacksquare

With all the above results, we are now ready to prove the main theorem for LPSRL with SGLD.

\MDPRegretSGLD

*

  • Proof of Theorem 6.2

First we further upper bound ฮ”eโ€‹rโ€‹r\Delta_{err} using the concentration guarantee provided by SGLD. We first note that by Cauchyโ€“Schwarz inequality,

โˆ‘k=1KTโˆ‘t=tktk+1โˆ’1โ€–ฮธโˆ—โˆ’ฮธkโ€–=โˆ‘t=1Tโ€–ฮธโˆ—โˆ’ฮธkโ€–โ‰คTโ€‹โˆ‘t=1Tโ€–ฮธโˆ—โˆ’ฮธkโ€–2=Tโ€‹โˆ‘k=1KTTkโ€‹โ€–ฮธโˆ—โˆ’ฮธkโ€–2.\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\left\|\theta^{*}-\theta^{k}\right\|=\sum_{t=1}^{T}\left\|\theta^{*}-\theta^{k}\right\|\leq\sqrt{T\sum_{t=1}^{T}\left\|\theta^{*}-\theta^{k}\right\|^{2}}=\sqrt{T\sum_{k=1}^{K_{T}}T_{k}\left\|\theta^{*}-\theta^{k}\right\|^{2}}. (24)

Combining Equation (21) in Lemma E.3 and (24), by Theoremย 3, Lemma E.3 and E.5, we have,

ฮ”eโ€‹rโ€‹r\displaystyle\Delta_{err} โ‰คHโ€‹Lpโ€‹Tโ€‹๐”ผโ€‹[โˆ‘k=1KTTkโ€‹โ€–ฮธโˆ—โˆ’ฮธkโ€–2]\displaystyle\leq HL_{p}\sqrt{T\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}T_{k}\left\|\theta^{*}-\theta^{k}\right\|^{2}\Big{]}}
โ‰คHโ€‹Lpโ€‹Tโ€‹KTโ€‹maxkโก๐”ผโ€‹[Tkโ€‹โ€–ฮธโˆ—โˆ’ฮธkโ€–2]\displaystyle\leq HL_{p}\sqrt{TK_{T}\max_{k}\mathbb{E}\Big{[}T_{k}\left\|\theta^{*}-\theta^{k}\right\|^{2}\Big{]}}
โ‰คHโ€‹Lpโ€‹960โ€‹dmโ€‹Tโ€‹lโ€‹oโ€‹gโ€‹Tโ€‹(logโกT+1)\displaystyle\leq HL_{p}\sqrt{\frac{960d}{m}TlogT(\log T+1)}
โ‰คHโ€‹(logโกT+1)โ€‹960โ€‹dmโ€‹T.\displaystyle\leq H(\log T+1)\sqrt{\frac{960d}{m}T}. (25)

Then combining Lemma E.2, E.4 and Equation 25, we have,

RBโ€‹(T)\displaystyle R_{B}(T) โ‰คHโ€‹(logโกT+1)+Hโ€‹(logโกT+1)โ€‹960โ€‹dmโ€‹T+24โ€‹Tโ€‹(logโกT+1)mโ€‹(d+logโกQ+(32+8โ€‹dโ€‹ฮบ2)โ€‹p)1/2\displaystyle\leq H(\log T+1)+H(\log T+1)\sqrt{\frac{960d}{m}T}+\sqrt{\frac{24T(\log T+1)}{m}}(d+\log Q+(32+8d\kappa^{2})p)^{1/2}
โ‰ค(1+960+24)โ€‹Hโ€‹(logโกT+1)โ€‹Tmโ€‹(d+logโกQ+(32+8โ€‹dโ€‹ฮบ2)โ€‹p)1/2\displaystyle\leq(1+\sqrt{960}+\sqrt{24})H(\log T+1)\sqrt{\frac{T}{m}}(d+\log Q+(32+8d\kappa^{2})p)^{1/2}
โ‰ค38โ€‹Hโ€‹(logโกT+1)โ€‹Tmโ€‹(d+logโกQ+(32+8โ€‹dโ€‹ฮบ2)โ€‹p)1/2.\displaystyle\leq 38H(\log T+1)\sqrt{\frac{T}{m}}(d+\log Q+(32+8d\kappa^{2})p)^{1/2}.

โ– \blacksquare

E.2.2 Simplex Parametrization

We now discuss the performance of LPSRL under simplex parametrization. Similar to the General Parameterization, the regret guarantee of LPSRL relies on the convergence guarantee of MLD, which is presented in the following theorem.

\MLDConvergence

*

  • Proof of Theorem 6.3

    Theorem 6.3 follows from Theorems 2 and 3 from Hsieh etย al., (2018) with step sizes given as per Theorem 3 from Cheng and Bartlett, (2018). โ– \blacksquare

Instantiating Algorithmย 3 with MLD provides the following statistical guarantee to control the approximation error in terms of the Wasserstein-11 distance.

Lemma E.6.

Under the simplex parameterization of MDPs, we run MLD for each state-action pair (s,a)โˆˆ๐’ฎร—๐’œ(s,a)\in\mathcal{S}\times\mathcal{A} at the beginning of each policy-switch kโˆˆ[KT]k\in[K_{T}] for O~โ€‹(|๐’ฎ|โ€‹|๐’œ|โ€‹ntk)\tilde{O}(|\mathcal{S}||\mathcal{A}|n_{t_{k}}) iterations. Suppose Assumption B and B are satisfied, then by instantiating LPSRL (Algorithm 3) with MLD (Algorithm 4) as SamplingAlg, we have,

โˆ‘k=1KTTkโ€‹W1โ€‹(ฯ~tk,ฯtk)โ‰ค|๐’ฎ|โ€‹8โ€‹|๐’œ|โ€‹Tโ€‹logโกT.\sum_{k=1}^{K_{T}}T_{k}W_{1}(\tilde{\rho}_{t_{k}},\rho_{t_{k}})\leq|\mathcal{S}|\sqrt{8|\mathcal{A}|T\log T}. (26)
  • Proof of Lemmaย E.6

    By Corollaryย E.1, in tabular settings, the error term in the Wasserstein-11 distance can be further decomposed in terms of state-action pairs, suggesting

    W1โ€‹(ฯ~tk,ฯtk)=โˆ‘(s,a)โˆˆ๐’ฎร—๐’œW1โ€‹(ฯ~tkโ€‹(s,a),ฯtkโ€‹(s,a)).W_{1}(\tilde{\rho}_{t_{k}},\rho_{t_{k}})=\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}W_{1}(\tilde{\rho}_{t_{k}}(s,a),\rho_{t_{k}}(s,a)).

    Then by design of Algorithmย 3, we have,

    โˆ‘k=1KTTkโ€‹W1โ€‹(ฯ~tk,ฯtk)\displaystyle\sum_{k=1}^{K_{T}}T_{k}W_{1}(\tilde{\rho}_{t_{k}},\rho_{t_{k}}) =โˆ‘k=1KTTkโ€‹โˆ‘s,aW1โ€‹(ฯ~tkโ€‹(s,a),ฯtkโ€‹(s,a))\displaystyle=\sum_{k=1}^{K_{T}}T_{k}\sum_{s,a}W_{1}(\tilde{\rho}_{t_{k}}(s,a),\rho_{t_{k}}(s,a))
    โ‰คโˆ‘k=1logโกT+1Tkโ€‹โˆ‘s,aW2โ€‹(ฯ~tkโ€‹(s,a),ฯtkโ€‹(s,a))\displaystyle\leq\sum_{k=1}^{\log T+1}T_{k}\sum_{s,a}W_{2}(\tilde{\rho}_{t_{k}}(s,a),\rho_{t_{k}}(s,a))
    โ‰คโˆ‘k=1logโกT+1Tkโ€‹|๐’ฎ|โ€‹|๐’œ|โ€‹maxs,aโกW2โ€‹(ฯ~tkโ€‹(s,a),ฯtkโ€‹(s,a)),\displaystyle\leq\sum_{k=1}^{\log T+1}T_{k}|\mathcal{S}||\mathcal{A}|\max_{s,a}W_{2}(\tilde{\rho}_{t_{k}}(s,a),\rho_{t_{k}}(s,a)), (27)

    where the first inequality follows from the fact that Wpโ‰ฅWqW_{p}\geq W_{q} for any pโ‰ฅqp\geq q.

    The convergence guarantee provided by Theorem 6.3 for MLD suggests, for each state-action pair (s,a)โˆˆ๐’ฎร—๐’œ(s,a)\in\mathcal{S}\times\mathcal{A}, upon running MLD for O~โ€‹(|๐’ฎ|โ€‹|๐’œ|โ€‹ntk)\tilde{O}(|\mathcal{S}||\mathcal{A}|n_{t_{k}}) iterations, where ntkn_{t_{k}} is the number of data available for (s,a)(s,a) at time tkt_{k}, we have

    W2โ€‹(ฯ~tkโ€‹(s,a),ฯtkโ€‹(s,a))โ‰ค1|๐’œ|โ€‹ntk.W_{2}(\tilde{\rho}_{t_{k}}(s,a),\rho_{t_{k}}(s,a))\leq\sqrt{\frac{1}{|\mathcal{A}|n_{t_{k}}}}. (28)

    At time tkt_{k}, let ntkn_{t_{k}} be the total number of available observations, which gives ntk=โˆ‘kโ€ฒ=1kโˆ’1Tkโ€ฒn_{t_{k}}=\sum_{k^{\prime}=1}^{k-1}T_{k^{\prime}} and tk=ntk+1t_{k}=n_{t_{k}}+1. By design of Algorithmย 3, ntkโ‰คTkโ‰ค2โ€‹ntkn_{t_{k}}\leq T_{k}\leq 2n_{t_{k}}. Then combining Equation (27) and (28) gives,

    โˆ‘k=1KTTkโ€‹โˆ‘s,aW1โ€‹(ฯ~tkโ€‹(s,a),ฯtkโ€‹(s,a))\displaystyle\sum_{k=1}^{K_{T}}T_{k}\sum_{s,a}W_{1}(\tilde{\rho}_{t_{k}}(s,a),\rho_{t_{k}}(s,a)) โ‰คโˆ‘k=1logโกT+1|๐’ฎ|โ€‹2โ€‹|๐’œ|โ€‹Tk\displaystyle\leq\sum_{k=1}^{\log T+1}|\mathcal{S}|\sqrt{2|\mathcal{A}|T_{k}}
    โ‰คโˆ‘k=1logโกT+1|๐’ฎ|โ€‹|๐’œ|โ€‹2k\displaystyle\leq\sum_{k=1}^{\log T+1}|\mathcal{S}|\sqrt{|\mathcal{A}|2^{k}}
    โ‰ค|๐’ฎ|โ€‹|๐’œ|โ€‹(logโกT+1)โ€‹โˆ‘k=1logโกT+12k\displaystyle\leq|\mathcal{S}|\sqrt{|\mathcal{A}|(\log T+1)\sum_{k=1}^{\log T+1}2^{k}}
    โ‰ค|๐’ฎ|โ€‹8โ€‹|๐’œ|โ€‹Tโ€‹logโกT,\displaystyle\leq|\mathcal{S}|\sqrt{8|\mathcal{A}|T\log T},

    where the third inequality follows from the Cauchy-Schwarz inequality.

    โ– \blacksquare

Lemmaย E.6 suggests the approximation error in the first term of Lemmaย E.2 can be effectively bounded when instantiating SamplingAlg with MLD.

Lemma E.7 (Concentration of MLD).

For any policy-switch kโˆˆ[KT]k\in[K_{T}], we run MLD (Algorithm 4) for each state-action pair (s,a)โˆˆ๐’ฎร—๐’œ(s,a)\in\mathcal{S}\times\mathcal{A} at time tkt_{k} for O~โ€‹(|๐’ฎ|โ€‹|๐’œ|โ€‹ntk)\tilde{O}(|\mathcal{S}||\mathcal{A}|n_{t_{k}}) iterations. Then instantiating LPSRL with MLD guarantees that

๐”ผโ€‹[Tkโ€‹โ€–ฮธk,โˆ—โˆ’ฮธkโ€–2]โ‰ค2โ€‹|๐’ฎ|2โ€‹|๐’œ|.\mathbb{E}[T_{k}\|\theta^{k,*}-\theta^{k}\|^{2}]\leq 2|\mathcal{S}|^{2}|\mathcal{A}|.
  • Proof of Lemma E.7

    By towerโ€™s rule and the triangle inequality, we have

    ๐”ผโ€‹[Tkโ€‹โ€–ฮธk,โˆ—โˆ’ฮธkโ€–2]\displaystyle\mathbb{E}\left[T_{k}\|\theta^{k,*}-\theta^{k}\|^{2}\right] =๐”ผโ€‹[๐”ผโ€‹[Tkโ€‹โ€–ฮธk,โˆ—โˆ’ฮธkโ€–2]|โ„‹tk]\displaystyle=\mathbb{E}\left[\mathbb{E}\left[T_{k}\|\theta^{k,*}-\theta^{k}\|^{2}\right]\Big{|}\mathcal{H}_{t_{k}}\right]
    โ‰ค๐”ผโ€‹[Tkโ€‹W22โ€‹(ฯ~tk,ฯtk)|โ„‹tk]\displaystyle\leq\mathbb{E}\left[T_{k}W_{2}^{2}(\tilde{\rho}_{t_{k}},\rho_{t_{k}})\Big{|}\mathcal{H}_{t_{k}}\right]
    โ‰ค๐”ผโ€‹[Tkโ€‹(|๐’ฎ|โ€‹|๐’œ|)2โ€‹maxs,aโกW22โ€‹(ฯ~tkโ€‹(s,a),ฯtkโ€‹(s,a))|โ„‹tk].\displaystyle\leq\mathbb{E}\left[T_{k}(|\mathcal{S}||\mathcal{A}|)^{2}\max_{s,a}W_{2}^{2}(\tilde{\rho}_{t_{k}}(s,a),\rho_{t_{k}}(s,a))\Big{|}\mathcal{H}_{t_{k}}\right]. (29)

    where the last inequality follows from the fact that in tabular setting, W22โ€‹(ฯ~tk,ฯtk)=(โˆ‘s,aW2โ€‹(ฯ~tkโ€‹(s,a),ฯtkโ€‹(s,a)))2W_{2}^{2}(\tilde{\rho}_{t_{k}},\rho_{t_{k}})=\left(\sum_{s,a}W_{2}(\tilde{\rho}_{t_{k}}(s,a),\rho_{t_{k}}(s,a))\right)^{2} .

    By the convergence guarantee of MLD in Theorem 6.3, for each state-action pair (s,a)โˆˆ๐’ฎร—๐’œ(s,a)\in\mathcal{S}\times\mathcal{A}, upon running MLD for O~โ€‹(|๐’ฎ|โ€‹|๐’œ|โ€‹ntk)\tilde{O}(|\mathcal{S}||\mathcal{A}|n_{t_{k}}) iterations, we have

    W2โ€‹(ฯ~tkโ€‹(s,a),ฯtkโ€‹(s,a))โ‰ค1|๐’œ|โ€‹ntk.W_{2}(\tilde{\rho}_{t_{k}}(s,a),\rho_{t_{k}}(s,a))\leq\sqrt{\frac{1}{|\mathcal{A}|n_{t_{k}}}}. (30)

    Combining Equation (29) and (30) and the fact that ntkโ‰คTkโ‰ค2โ€‹ntkn_{t_{k}}\leq T_{k}\leq 2n_{t_{k}} concludes the proof.

โ– \blacksquare

With the concentration guarantee between sample ฮธk\theta^{k} and ฮธk,โˆ—\theta^{k,*}, as well as the concentration guarantee between ฮธk,โˆ—\theta^{k,*} and ฮธโˆ—\theta^{*} in exact PSRL, we are able to effectively upper bound the model estimation error ฮ”eโ€‹rโ€‹r\Delta_{err} in tabular settings.

Lemma E.8 (Bound ฮ”eโ€‹rโ€‹r\Delta_{err} in tabular settings).

With the definition of model estimation error ฮ”eโ€‹rโ€‹r\Delta_{err} in Lemma E.3, in tabular setting, the following upper bound holds for ฮ”eโ€‹rโ€‹r\Delta_{err},

ฮ”eโ€‹rโ€‹rโ‰ค66โ€‹Hโ€‹|๐’ฎ|โ€‹|๐’œ|โ€‹Tโ€‹logโก(2โ€‹|๐’ฎ|โ€‹|๐’œ|โ€‹T),\Delta_{err}\leq 66H|\mathcal{S}|\sqrt{|\mathcal{A}|T\log(2|\mathcal{S}||\mathcal{A}|T)}, (31)

where LpL_{p} is the Lipschitz constant for transition dynamics.

  • Proof of Lemma E.8

    By the triangle inequality and Lemma E.3,

    ฮ”eโ€‹rโ€‹rโ‰คHโ€‹Lpโ€‹๐”ผโ€‹[โˆ‘k=1KTโˆ‘t=tktk+1โˆ’1โ€–ฮธโˆ—โˆ’ฮธkโ€–]โ‰คHโ€‹Lpโ€‹(๐”ผโ€‹[โˆ‘k=1KTโˆ‘t=tktk+1โˆ’1โ€–ฮธโˆ—โˆ’ฮธk,โˆ—โ€–]+๐”ผโ€‹[โˆ‘k=1KTโˆ‘t=tktk+1โˆ’1โ€–ฮธk,โˆ—โˆ’ฮธkโ€–]).\Delta_{err}\leq HL_{p}\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\left\|\theta^{*}-\theta^{k}\right\|\Big{]}\leq HL_{p}\left(\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\left\|\theta^{*}-\theta^{k,*}\right\|\Big{]}+\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\left\|\theta^{k,*}-\theta^{k}\right\|\Big{]}\right). (32)

    Bound The first term. The first term can be upper bounded using standard concentration results of exact PSRL algorithms in Bayesian settings. Define the event

    Eฮธ={ฮธ:โˆ€(s,a)โˆˆ๐’ฎร—๐’œ,โˆฅฮธ(โ‹…|s,a)โˆ’ฮธ^k(โ‹…|s,a)โˆฅ1โ‰คฮฒk(s,a)},E_{\theta}=\left\{\theta:\forall(s,a)\in\mathcal{S}\times\mathcal{A},~{}~{}~{}~{}\left\|\theta(\cdot|s,a)-\widehat{\theta}^{k}(\cdot|s,a)\right\|_{1}\leq\beta_{k}(s,a)\right\}, (33)

    where ฮธ^k\widehat{\theta}^{k} is the empirical distribution at the beginning of policy switch kk, ฮฒkโ€‹(s,a):=14โ€‹|๐’ฎ|โ€‹logโก(2โ€‹|๐’ฎ|โ€‹|๐’œ|โ€‹tkโ€‹T)mโ€‹aโ€‹xโ€‹(1,ntkโ€‹(s,a))\beta_{k}(s,a):=\sqrt{\frac{14|\mathcal{S}|\log(2|\mathcal{S}||\mathcal{A}|t_{k}T)}{max(1,n_{t_{k}}(s,a))}} following Jaksch etย al., (2010); Ouyang etย al., (2017); Osband etย al., (2013) by setting ฮด=1/T\delta=1/T. Then event EฮธE_{\theta} happens with probability at least 1โˆ’ฮด1-\delta. Note that for any vector x, โ€–xโ€–2โ‰คโ€–xโ€–1\left\|x\right\|_{2}\leq\left\|x\right\|_{1}, and by the triangle inequality, we have

    โˆฅฮธโˆ—โˆ’ฮธk,โˆ—โˆฅโ‰คโˆ‘sโ€ฒโˆˆ๐’ฎ|ฮธโˆ—(โ‹…|s,a)โˆ’ฮธk,โˆ—(โ‹…|s,a)|โ‰ค2(ฮฒk(st,at)+๐Ÿ{ฮธโˆ—โˆ‰Eฮธ}).\left\|\theta^{*}-\theta^{k,*}\right\|\leq\sum_{s^{\prime}\in\mathcal{S}}\Big{|}\theta^{*}(\cdot|s,a)-\theta^{k,*}(\cdot|s,a)\Big{|}\leq 2(\beta_{k}(s_{t},a_{t})+\mathbf{1}_{\{\theta^{*}\notin E_{\theta}\}}).

    At any time tโˆˆ[tk,tk+Tkโˆ’1]t\in[t_{k},t_{k}+T_{k}-1], ntโ‰ค2โ€‹ntkn_{t}\leq 2n_{t_{k}} for any state-action pair (st,at)(s_{t},a_{t}), and by the fact that tkโ‰คTt_{k}\leq T, we have

    ๐”ผโ€‹[โˆ‘k=1KTโˆ‘t=tktk+1โˆ’1ฮฒkโ€‹(st,at)]โ‰คโˆ‘k=1KTโˆ‘t=tktk+1โˆ’128โ€‹|๐’ฎ|โ€‹logโก(2โ€‹|๐’ฎ|โ€‹|๐’œ|โ€‹tkโ€‹T)mโ€‹aโ€‹xโ€‹(1,ntโ€‹(st,at))โ‰คโˆ‘t=1T56โ€‹|๐’ฎ|โ€‹logโก(2โ€‹|๐’ฎ|โ€‹|๐’œ|โ€‹T)mโ€‹aโ€‹xโ€‹(1,ntโ€‹(st,at)).\displaystyle\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\beta_{k}(s_{t},a_{t})\Big{]}\leq\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\sqrt{\frac{28|\mathcal{S}|\log(2|\mathcal{S}||\mathcal{A}|t_{k}T)}{max(1,n_{t}(s_{t},a_{t}))}}\leq\sum_{t=1}^{T}\sqrt{\frac{56|\mathcal{S}|\log(2|\mathcal{S}||\mathcal{A}|T)}{max(1,n_{t}(s_{t},a_{t}))}}. (34)

    It then suffices to bound โˆ‘t=1T1/mโ€‹aโ€‹xโ€‹(1,ntโ€‹st,at)\sum_{t=1}^{T}1/\sqrt{max(1,n_{t}s_{t},a_{t})}. Note that

    โˆ‘t=1T1mโ€‹aโ€‹xโ€‹(1,ntโ€‹(st,at))\displaystyle\sum_{t=1}^{T}\frac{1}{\sqrt{max(1,n_{t}(s_{t},a_{t}))}} =โˆ‘(s,a)โˆ‘t=1T๐Ÿ(st,at)=(s,a)mโ€‹aโ€‹xโ€‹(1,ntโ€‹(s,a))\displaystyle=\sum_{(s,a)}\sum_{t=1}^{T}\frac{\mathbf{1}_{(s_{t},a_{t})=(s,a)}}{\sqrt{max(1,n_{t}(s,a))}}
    โ‰ค4โ€‹โˆ‘(s,a)โˆซz=0nT+1โ€‹(s,a)zโˆ’1/2โ€‹๐‘‘z\displaystyle\leq 4\sum_{(s,a)}\int_{z=0}^{n_{T+1}(s,a)}z^{-1/2}dz
    โ‰ค4โ€‹|๐’ฎ|โ€‹|๐’œ|โ€‹โˆ‘(s,a)nT+1โ€‹(s,a)\displaystyle\leq 4\sqrt{|\mathcal{S}||\mathcal{A}|\sum_{(s,a)}n_{T+1}(s,a)}
    โ‰ค4โ€‹|๐’ฎ|โ€‹|๐’œ|โ€‹T.\displaystyle\leq 4\sqrt{|\mathcal{S}||\mathcal{A}|T}. (35)

    On the other hand, by definition of ฮฒkโ€‹(s,a)\beta_{k}(s,a), โ„™(ฮธโˆ—โˆ‰Eฮธ})โ‰ค1/(Ttk6)\mathbb{P}(\theta^{*}\notin E_{\theta}\})\leq 1/(Tt_{k}^{6}), which yields

    ๐”ผ[โˆ‘k=1KTโˆ‘t=tktk+1โˆ’1๐Ÿ{ฮธโˆ—โˆ‰Eฮธ}]โ‰ค๐”ผ[โˆ‘k=1KTTkโ„™(ฮธโˆ—โˆ‰Eฮธ})]โ‰คโˆ‘k=1โˆžkโˆ’6โ‰คโˆ‘k=1โˆžkโˆ’2โ‰ค2.\displaystyle\mathbb{E}\left[\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{1}_{\{\theta^{*}\notin E_{\theta}\}}\right]\leq\mathbb{E}\left[\sum_{k=1}^{K_{T}}T_{k}\mathbb{P}(\theta^{*}\notin E_{\theta}\})\right]\leq\sum_{k=1}^{\infty}k^{-6}\leq\sum_{k=1}^{\infty}k^{-2}\leq 2. (36)

    Combining Equation (34), (35) and (36), we have,

    Hโ€‹Lpโ€‹๐”ผโ€‹[โˆ‘k=1KTโˆ‘t=tktk+1โˆ’1โ€–ฮธโˆ—โˆ’ฮธk,โˆ—โ€–]โ‰ค64โ€‹Hโ€‹Lpโ€‹|๐’ฎ|โ€‹|๐’œ|โ€‹Tโ€‹logโก(2โ€‹|๐’ฎ|โ€‹|๐’œ|โ€‹T)HL_{p}\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\left\|\theta^{*}-\theta^{k,*}\right\|\Big{]}\leq 64HL_{p}|\mathcal{S}|\sqrt{|\mathcal{A}|T\log(2|\mathcal{S}||\mathcal{A}|T)} (37)

Bound the second term. The second term arises from the use of approximate sampling. Note that by Cauchyโ€“Schwarz inequality, this term in Equationย (32) satisfies,

โˆ‘k=1KTโˆ‘t=tktk+1โˆ’1โ€–ฮธk,โˆ—โˆ’ฮธkโ€–=โˆ‘t=1Tโ€–ฮธk,โˆ—โˆ’ฮธkโ€–โ‰คTโ€‹โˆ‘t=1Tโ€–ฮธk,โˆ—โˆ’ฮธkโ€–2=Tโ€‹โˆ‘k=1KTTkโ€‹โ€–ฮธk,โˆ—โˆ’ฮธkโ€–2.\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\left\|\theta^{k,*}-\theta^{k}\right\|=\sum_{t=1}^{T}\left\|\theta^{k,*}-\theta^{k}\right\|\leq\sqrt{T\sum_{t=1}^{T}\left\|\theta^{k,*}-\theta^{k}\right\|^{2}}=\sqrt{T\sum_{k=1}^{K_{T}}T_{k}\left\|\theta^{k,*}-\theta^{k}\right\|^{2}}. (38)

It then relies on the concentration guarantee provided by MLD for LPSRL under the static policy switch scheme. By Lemma E.7, we have,

Hโ€‹Lpโ€‹Tโ€‹๐”ผโ€‹[โˆ‘k=1KTTkโ€‹โ€–ฮธโˆ—โˆ’ฮธkโ€–2]โ‰คHโ€‹Lpโ€‹Tโ€‹KTโ€‹maxkโก๐”ผโ€‹[Tkโ€‹โ€–ฮธโˆ—โˆ’ฮธkโ€–2]โ‰คHโ€‹Lpโ€‹|๐’ฎ|โ€‹4โ€‹|๐’œ|โ€‹Tโ€‹logโกT.\displaystyle HL_{p}\sqrt{T\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}T_{k}\left\|\theta^{*}-\theta^{k}\right\|^{2}\Big{]}}\leq HL_{p}\sqrt{TK_{T}\max_{k}\mathbb{E}\Big{[}T_{k}\left\|\theta^{*}-\theta^{k}\right\|^{2}\Big{]}}\leq HL_{p}|\mathcal{S}|\sqrt{4|\mathcal{A}|T\log T}. (39)

Combining Equation (37) and (39) concludes the proof.

โ– \blacksquare

With all the above results, we now proceed to prove the regret bound for LPSRL with MLD.

\MDPRegretMLD

*

  • Proof of Theorem 6.3

By Lemmaย E.2, E.6 and E.8, we have

RBโ€‹(T)\displaystyle R_{B}(T) โ‰คHโ€‹(logโกT+1)+66โ€‹Hโ€‹|๐’ฎ|โ€‹|๐’œ|โ€‹Tโ€‹logโก(2โ€‹|๐’ฎ|โ€‹|๐’œ|โ€‹T)+|๐’ฎ|โ€‹8โ€‹|๐’œ|โ€‹Tโ€‹logโกT\displaystyle\leq H(\log T+1)+66H|\mathcal{S}|\sqrt{|\mathcal{A}|T\log(2|\mathcal{S}||\mathcal{A}|T)}+|\mathcal{S}|\sqrt{8|\mathcal{A}|T\log T}
โ‰ค2โ€‹Hโ€‹logโกT+66โ€‹Hโ€‹|๐’ฎ|โ€‹|๐’œ|โ€‹Tโ€‹logโก(2โ€‹|๐’ฎ|โ€‹|๐’œ|โ€‹T)+4โ€‹|๐’ฎ|โ€‹|๐’œ|โ€‹Tโ€‹logโกT\displaystyle\leq 2H\log T+66H|\mathcal{S}|\sqrt{|\mathcal{A}|T\log(2|\mathcal{S}||\mathcal{A}|T)}+4|\mathcal{S}|\sqrt{|\mathcal{A}|T\log T}
โ‰ค72โ€‹Hโ€‹|๐’ฎ|โ€‹|๐’œ|โ€‹Tโ€‹logโก(2โ€‹|๐’ฎ|โ€‹|๐’œ|โ€‹T).\displaystyle\leq 72H|\mathcal{S}|\sqrt{|\mathcal{A}|T\log(2|\mathcal{S}||\mathcal{A}|T)}.

By Lemma E.6, for each state-action pair (s,a)โˆˆ๐’ฎร—๐’œ(s,a)\in\mathcal{S}\times\mathcal{A} and policy-switch kโˆˆ[KT]k\in[K_{T}], the number of iterations required for MLD is Oโ€‹(|๐’ฎ|โ€‹|๐’œ|โ€‹2kโˆ’1)O(|\mathcal{S}||\mathcal{A}|2^{k-1}). This suggests that for each state-action pair, the total number of iterations required for MLD is Oโ€‹(|๐’ฎ|โ€‹|๐’œ|โ€‹T)O(|\mathcal{S}||\mathcal{A}|T) along the time horizon TT. Summing over all possible state-action pairs, the computational cost of running MLD in terms of the total number of iterations is Oโ€‹(|๐’ฎ|2โ€‹|๐’œ|2โ€‹T)O(|\mathcal{S}|^{2}|\mathcal{A}|^{2}T). โ– \blacksquare

E.3 General Parameterization Example

Following Theocharous etย al., 2017a ; Theocharous etย al., 2017b we consider a points of interest (POI) recommender system where the system recommends a sequence of points that could be of interest to a particular tourist or individual. We will let the points of interest be denoted by points on โ„\mathbb{R}. Following the perturbation model in Theocharous etย al., 2017a ; Theocharous etย al., 2017b , the transition probabilities are pโ€‹(s|ฮธ)=pโ€‹(s)1/ฮธp(s|\theta)=p(s)^{1/\theta} if the chosen action is ss and it is pโ€‹(s)/zโ€‹(ฮธ)p(s)/z(\theta) otherwise. Here ss is a state or a POI and zโ€‹(ฮธ)=โˆ‘xโ‰ spโ€‹(x)1โˆ’pโ€‹(s)1/ฮธz(\theta)=\frac{\sum_{x\neq s}p(x)}{1-p(s)^{1/\theta}}. Furthermore, to fully specify pp we consider pโ€‹(s|ฮธ)=12โ€‹ฯ€โ€‹eโˆ’s2/2โ€‹ฮธp(s|\theta)=\frac{1}{\sqrt{2\pi}}e^{-s^{2}/2\theta}. One can see that Assumptions 11-44 are satisfied due to the Gaussian-like nature of the transition dynamics and the satisfiability of Assumption 55 follows from Lemma 5 in Theocharous etย al., 2017b .

Appendix F Experimental Details

F.1 Additional Discussions of Langevin TS in Gaussian Bandits

Refer to caption
Refer to caption
Figure 2: Left:(a) Expected regret for informative priors. Right:(b) Expected regret for uninformative priors. Results are reported over 10 experiments. In both scenarios, SGLD-TS under dynamic scheme achieves optimal performance as in sequential case without using approximate sampling.
Refer to caption
Refer to caption
Figure 3: Left:(a)ย expected communication cost of three batching schemes: fully-sequential mode, dynamic batch, and static batch. Right:(b)ย expected communication cost under dynamic batching scheme for Gaussian bandits.

In this section, we present additional empirical results for the Gaussian bandit experiments. In particular, we examine both informative priors and uninformative priors for Gaussian bandits with N=15N=15 arms, where each arm is associated with distinct expected rewards. We set the true expected rewards of all arms to be evenly spaced in the interval [1,20][1,20], and the ordering of values is shuffled before assigning to arms. All arms share the same standard deviation of 0.50.5. We investigate the performance of SGLD-TS against UCB1, Bayes-UCB, and exact-TS under different interaction schemes: fully-sequential mode, dynamic batch scheme, and static batch scheme.

In the first setting, we assume prior knowledge of the ordering of expected rewards and apply informative priors to facilitate the learning process. Gaussian priors are adopted with means evenly spaced in [14,20][14,20], and inverted variance (i.e., precision) set to 0.3750.375. The priors are assigned according to the ordering of the true reward distributions. Note that the exact knowledge of the true expected values is not required. In TS algorithms, the selection of arms at each time step is based on sampled values, therefore efficient learning is essential even with the knowledge of the correct ordering. The expected regret of all methods is reported over 10 experiments and results are illustrated in Figureย 2(a). Results of both Figureย 1(a) and Figureย 2(a) demonstrate that SGLD-TS achieves optimal performance similar to exact-TS with conjugate families. Its appealing empirical performance in comparison to other popular methods (e.g., UCB1 and Bayes-UCB), along with its ability to handle complex posteriors using MCMC algorithms, make it a promising solution for challenging problem domains. Additionally, the introduction of the dynamic batch scheme ensures the computational efficiency of SGLD-TS. As depicted in Figureย 3(a)(b) and Table 2 (column labeled โ€batchesโ€), communication cost is significantly reduced from linear to logarithmic dependence on the time horizon, as suggested by Theorem 5.1. Furthermore, in bandit environments, our dynamic batch scheme exhibits greater robustness compared to the static batch scheme for both frequentist and Bayesian methods.

Furthermore, we explore the setting where prior information is absent, and uninformative priors are employed. In this case, we adopt the same Gaussian priors as ๐’ฉโ€‹(14.0,8.0)\mathcal{N}(14.0,8.0) for all arms. Similar to the first setting, the same conclusion can be drawn for SGLD-TS from Figureย 2(b).

F.2 Experimental Setup for Langevin TS in Laplace Bandits

In order to demonstrate the performance of Langevin TS in a broader class of general bandit environments where closed-form posteriors are not available and exact TS is not applicable, we construct a Laplace bandit environment consisting of N=10N=10 arms. Specifically, we set the expected rewards to be evenly spaced in the interval [1,10][1,10], and shuffle the ordering before assigning each arm a value. The reward distribution of each arm shares the same standard deviation of 0.80.8. We adopt favorable priors to incorporate the knowledge of the true ordering in Laplace bandits. It is important to note that our objective is to learn the expected rewards, and arm selection at each time step is based on the sampled values rather than the ordering. In particular, we adopt Gaussian priors with means evenly spaced in [4,10][4,10] (ordered according to prior knowledge). The inverted variance (i.e., precision) for all Gaussian priors is set to 0.8750.875. We conduct the experiments 10 times and report the cumulative regrets in Figure 1(b).

By employing Langevin TS in the Laplace bandit environment, we aim to showcase the algorithmโ€™s effectiveness and versatility in scenarios where posteriors are intractable and exact TS cannot be directly applied.

Refer to caption
Figure 4: Number of policy switches in RiverSwim over 10 experiments. Static policy switch scheme requires the least number of communication.

F.3 Experimental Setup for Langevin PSRL

In MDP setting, we consider a variant of RiverSwim environment being frequently used empirically (Strehl and Littman,, 2008), in which the agent swimming in the river is modeled with five states, and two available actions: left and right. If the agent swims rightwards along the river current, the attempt to transit to the right is going to succeed with a large probability of p=0.8p=0.8. If the agent swims leftwards against the current, the transition probability to the left is small with p=0.2p=0.2. Rewards are zero unless the agent is in the leftmost state (r=2.0r=2.0) or the rightmost state (r=10.0r=10.0). The agent is assumed to start from the leftmost state. We implement MLD-PSRL and exact-PSRL under two policy switch schemes, one is the static doubling scheme discussed in section 6, and the other is the dynamic doubling scheme based on the visiting counts of state-action pairs. To ensure the performance of TSDE, we adopt its original policy switch criteria based on the linear growth restriction on episode length and dynamic doubling scheme. We run experiments 1010 times, and report the average rewards of each method in Figure 1(c). The number of policy switches under different schemes is depicted in Figure 4.