This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

An Information-Theoretic Analysis of Nonstationary Bandit Learning


Seungki Min
Industrial and Systems Engineering
KAIST
[email protected]
  
Daniel J. Russo
Graduate School of Business
Columbia University
[email protected]
( Initial Version: February 2023
Current Revision: December 2023)
Abstract

In nonstationary bandit learning problems, the decision-maker must continually gather information and adapt their action selection as the latent state of the environment evolves. In each time period, some latent optimal action maximizes expected reward under the environment state. We view the optimal action sequence as a stochastic process, and take an information-theoretic approach to analyze attainable performance. We bound per-period regret in terms of the entropy rate of the optimal action process. The bound applies to a wide array of problems studied in the literature and reflects the problem’s information structure through its information-ratio.

1 Introduction

We study the problem of learning in interactive decision-making. Across a sequence of time periods, a decision-maker selects actions, observes outcomes, and associates these with rewards. They hope to earn high rewards, but this may require investing in gathering information.

Most of the literature studies stationary environments — where the likelihood of outcomes under an action is fixed across time.111An alternative style of result lets the environment change, but tries only to compete with the best fixed action in hindsight. Efficient algorithms limit costs required to converge on optimal behavior. We study the design and analysis of algorithms in nonstationary environments, where converging on optimal behavior is impossible.

In our model, the latent state of the environment in each time period is encoded in a parameter vector. These parameters are unobservable, but evolve according to a known stochastic process. The decision-maker hopes to earn high rewards by adapting their action selection as the environment evolves. This requires continual learning from interaction and striking a judicious balance between exploration and exploitation. Uncertainty about the environment’s state cannot be fully resolved before the state changes and this necessarily manifests in suboptimal decisions. Strong performance is impossible under adversarial forms of nonstationarity but is possible in more benign environments. Why are A/B testing, or recommender systems, widespread and effective even though nonstationarity is a ubiquitous concern? Quantifying the impact different forms of nonstationarity have on decision-quality is, unfortunately, quite subtle.

Our contributions.

We provide a novel information-theoretic analysis that bounds the inherent degradation of decision-quality in dynamic environments. The work can be seen as a very broad generalization of Russo and Van Roy [2016, 2022] from stationary to dynamic environments. To understand our results, it is important to first recognize that the latent state evolution within these environments gives rise to a latent optimal action process. This process identifies an optimal action at each time step, defined as the one that maximizes the expected reward based on the current environment state. Theorem 1 bounds the per-period regret in terms of the entropy rate of this optimal action process. This rate is indicative of the extent to which changes in the environment lead to unpredictable and erratic shifts in the optimal action process. Complementing Theorem 1, Section 5.3 lower bounds regret by the entropy rate of the optimal action process in a family of nonstationarity bandit problems.

Through examples, we illustrate that the entropy rate of the action process may be much lower than the entropy rate of the latent environment states. Subsection 1.1 provides an illustrative example of nonstationarity, drawing inspiration from A/B testing scenarios. This distinction is further explored in the context of news recommendation problems, as detailed in Example 5 and Example 6.

We provide several generalizations of our main result. Section 7 considers problems where the environment evolves very rapidly, so aiming to track the optimal action sequence is futile. The assurances of Theorem 1 become vacuous in this case. Regardless of how erratic the environment is, Theorem 2 shows that it is possible to earn rewards competitive with any weaker benchmark — something we call a satisficing action sequence — whose entropy rate is low.

Section 8 interprets and generalizes the entropy rate through the lens of a communication problem. In this problem, an observer of the latent state aims to transmit useful information to a decision-maker. The decision-maker can implement optimal actions if and only if the observer (optimally) transmits information with bit-rate exceeding the entropy rate of the optimal action process. Theorem 3 replaces the entropy rate in Theorem 1 with a rate-distortion function [Cover and Thomas, 2006], which characterizes the bit-rate required to enable action selection with a given per-period regret.

This abstract theory provides an intriguing link between interactive learning problems and the limits of lossy compression. We do not engage in a full investigation, but provide a few initial connections to existing measures of the severity of nonstationarity. Proposition 2 bounds the entropy rate (and hence the rate-distortion function) in terms the number of switches in the optimal action process, similar to Auer [2002] or Suk and Kpotufe [2022]. Proposition 4 upper bounds the rate distortion in terms of the total-variation in arms’ mean-rewards, following the influential work of Besbes et al. [2015]. Section 9 reviews a duality between stochastic and adversarial models of nonstationarity which shows these information-theoretic techniques can be used to study adversarial models.

Our general bounds also depend on the algorithm’s information ratio. While the problem’s entropy rate (or rate-distortion function) bounds the rate at which the decision-maker must acquire information, the information-ratio bounds the per-period price of acquiring each bit of information. Since its introduction in the study of stationary learning problems in Russo and Van Roy [2016], the information-ratio has been shown to properly capture the complexity of learning in a range of widely studied problems. Recent works link it to generic limits on when efficient learning is possible Lattimore [2022], Foster et al. [2022]. Section 6 details bounds on the information ratio that apply to many of the most important sequential learning problems — ranging from contextual bandits to online matching. Through our analysis, these imply regret bounds in nonstationary variants of these problems.

This work emphasizes understanding of the limits of attainable performance. Thankfully, most results apply to variants of Thompson sampling (TS) [Thompson, 1933], one of the most widely used learning algorithms in practice. In some problems, TS is far from optimal, and better bounds are attained with Information-Directed Sampling [Russo and Van Roy, 2018].

A short conference version of this paper appeared in [Min and Russo, 2023]. In addition to providing a more thorough exposition, this full-length paper provides substantive extensions to that initial work, including the treatment of satisficing in Section 7, the connections with rate-distortion theory in Section 8, and the connections with adversarial analysis in Section 9.

1.1 An illustrative Bayesian model of nonstationarity

Consider a multi-armed bandit environment where two types of nonstationarity coexist – a common variation that affects the performance of all arms, and idiosyncratic variations that affect the performance of individual arms separately. More explicitly, let us assume that the mean reward of arm aa at time tt is given by

μt,a=θtcm+θt,aid,\mu_{t,a}=\theta_{t}^{\rm cm}+\theta_{t,a}^{\rm id},

where (θtcm)t(\theta_{t}^{\rm cm})_{t\in\mathbb{N}} and (θt,aid)t(\theta_{t,a}^{\rm id})_{t\in\mathbb{N}}’s are latent stochastic processes describing common and idiosyncratic disturbances. While deferring the detailed description to Appendix C, we introduce two hyperparameters τcm\tau^{\rm cm} and τid\tau^{\rm id} in our generative model to control the time scale of these two types of variations.222 We assume that (θtcm)t(\theta_{t}^{\rm cm})_{t\in\mathbb{N}} is a zero-mean Gaussian process satisfying Cov(θscm,θtcm)=exp(12(tsτcm)2)\text{Cov}(\theta_{s}^{\rm cm},\theta_{t}^{\rm cm})=\exp\left(-\frac{1}{2}\left(\frac{t-s}{\tau^{\rm cm}}\right)^{2}\right) so that τcm\tau^{\rm cm} determines the volatility of the process. Similarly, the volatility of (θt,aid)t(\theta_{t,a}^{\rm id})_{t\in\mathbb{N}} is determined by τid\tau^{\rm id}.

Refer to caption

Figure 1: A two-arm bandit environment with two types of nonstationarity – a common variation (θtcm)t(\theta_{t}^{\rm cm})_{t\in\mathbb{N}} generated with a time-scaling factor τcm=10\tau^{\rm cm}=10, and idiosyncratic variations (θt,aid)t,a𝒜(\theta_{t,a}^{\rm id})_{t\in\mathbb{N},a\in\mathcal{A}} generated with a time-scaling factor τid=50\tau^{\rm id}=50. While absolute performance of two arms are extremely volatile (left), their idiosyncratic performances are relatively stable (right).

Inspired by real-world A/B tests [Wu et al., 2022], we imagine a two-armed bandit instance involving a common variation that is much more erratic than idiosyncratic variations. Common variations reflect exogenous shocks to user behavior which impacts the reward under all treatment arms. Figure 1 visualizes such an example, a sample path generated with the choice of τcm=10\tau^{\rm cm}=10 and τid=50\tau^{\rm id}=50. Observe that the optimal action AtA_{t}^{*} has changed only five times throughout 1,000 periods. Although that the environment itself is highly nonstationary and unpredictable due to the common variation term, the optimal action sequence (At)t(A_{t}^{*})_{t\in\mathbb{N}} is relatively stable and predictable since it depends only on the idiosyncratic variations.

Now we ask — How difficult is this learning task? Which type of nonstationarity determines the difficulty? A quick numerical investigation shows that the problem’s difficulty is mainly determined by the frequency of optimal action switches, rather than volatility of common variation.

Refer to caption

Figure 2: Performance of algorithms in two-armed bandit environments, with difference choices of time-scaling factors τcm\tau^{\rm cm} (common variation) and τid\tau^{\rm id} (idiosyncratic variations). Each data point reports per-period regret averaged over 1,000 time periods and 1,000 runs of simulation.

See Figure 2, where we report the effect of τcm\tau^{\rm cm} and τid\tau^{\rm id} on the performance of several bandit algorithms (namely, Thompson sampling with exact posterior sampling,333 In order to perform exact posterior sampling, it exploits the specified nonstationary structure as well as the values of τcm\tau^{\rm cm} and τid\tau^{\rm id}. and Sliding-Window TS that only uses recent L{10,50,100}L\in\{10,50,100\} observations; see Appendix C for the details). Remarkably, their performances appear to be sensitive only to τid\tau^{\rm id} but not to τcm\tau^{\rm cm}, highlighting that nonstationarity driven by common variation is benign to the learner.

We remark that our information-theoretic analyses predict this result. Theorem 1 shows that the complexity of a nonstationary environment can be sufficiently characterized by the entropy rate of the optimal action sequence, which depends only on τid\tau^{\rm id} but not on τcm\tau^{\rm cm} in this example. Proposition 1 further expresses the entropy rate in terms of effective horizon, which corresponds to τid\tau^{\rm id} in this example. (See Example 7, where we revisit this two-armed bandit problem.)

1.2 Comments on the use of prior knowledge

A substantive discussion of Bayesian, frequentist, and adversarial perspectives on decision-making uncertainty is beyond the scope of this paper. We make two quick observations. First, where does a prior like the one in Figure 1 come from? One answer is that company may run many thousands of A/B tests, and an informed prior may let them transfer experience across tests [Azevedo et al., 2019]. In particular, experience with past tests may let them calibrate τid\tau^{\rm id}, or form hierarchical prior where τid\tau^{\rm id} is also random. Example 5 in Section 2.2.2 illustrates the possibility of estimating prior hyperparameters in a news article recommendation problem. Second, Thompson sampling with a stationary prior is perhaps the most widely used bandit algorithm. One might view the model in Section 1.1 as a more conservative way of applying TS that guards against a certain magnitude of nonstationarity.

1.3 Literature review

We adopt Bayesian viewpoints to describe nonstationary environments: changes in the underlying reward distributions (more generally, changes in outcome distributions) are driven by a stochastic process. Such a viewpoint dates back to the earliest work of Whittle [1988] which introduces the term ‘restless bandits’ and has motivated subsequent work Slivkins and Upfal [2008], Chakrabarti et al. [2008], Jung and Tewari [2019]. On the other hand, since Thompson sampling (TS) has gained its popularity as a Bayesian bandit algorithm, its variants have been proposed for nonstationary settings accordingly: e.g., Dynamic TS [Gupta et al., 2011], Discounted TS [Raj and Kalyani, 2017], Sliding-Window TS [Trovo et al., 2020], TS with Bayesian changepoint detection [Mellor and Shapiro, 2013, Ghatak, 2020]. Although the Bayesian framework can flexibly model various types of nonstationarity, this literature rarely presents performance guarantees that apply to a broad class of models.

To provide broad performance guarantees, our analysis adopts an information-theoretic approach introduced in Russo and Van Roy [2016] and extended in Russo and Van Roy [2022]. While past work applies this framework to analyze learning in stationary environments, we show that it can be gracefully extended to dynamic environments. In this extension, bounds that depend on the entropy of the static optimal action [Russo and Van Roy, 2016] are replaced with the entropy rate of the optimal action process; rate-distortion functions are similarly extended. A strength of this approach is enables one to leverage a wealth of existing bounds on the information-ratio, including for kk-armed bandits/linear bandits/combinatorial bandits [Russo and Van Roy, 2016], contextual bandits [Neu et al., 2022], logistic bandits [Dong et al., 2019], bandits with graph-based feedback [Liu et al., 2018], convex optimization [Bubeck and Eldan, 2016, Lattimore and Szepesvári, 2020], etc.

Liu et al. [2023] recently proposed an algorithm called Predictive Sampling for nonstationarity bandit learning problems. At a conceptual level, their algorithm is most closely related to our Section 7, in which Thompson sampling style algorithms are modified to explore less aggressively in the face of rapid dynamics. The authors introduce a modified information-ratio and use it to successfully display the benefits of predictive sampling over vanilla Thompson sampling. Our framework does not allow us to study predictive sampling, but enables us to establish a wide array of theoretical guarantees for other algorithms that are not provided in Liu et al. [2023]. We leave a detailed comparison in Appendix B.

Recent work of Chen et al. [2023] develops an algorithm regret analysis for kk-armed bandit problems when arm means (i.e. the latent states) evolve according to an auto-regressive process. This is a special case of our formulation and a detailed application application of our Theorem 1 or Theorem 3 to such settings would be an interesting avenue for future research. One promising route is to modify the argument in Example 7.

Most existing theoretical studies on nonstationary bandit experiments adopt adversarial viewpoints in the modeling of nonstationarity. Implications of our information-theoretic regret bounds in adversarial environments are discussed in Section 9. These works typically fall into two categories – ‘‘switching environments’’ and ‘‘drifting environments’’.

Switching environments consider a situation where underlying reward distributions change at unknown times (often referred to as changepoints or breakpoints). Denoting the total number of changes over TT periods by NN, it was shown that the cumulative regret O~(NT)\tilde{O}(\sqrt{NT}) is achievable: e.g., Exp3.S [Auer et al., 2002, Auer, 2002], Discounted-UCB [Kocsis and Szepesvári, 2006], Sliding-Window UCB [Garivier and Moulines, 2011], and more complicated algorithms that actively detect the changepoints [Auer et al., 2019, Chen et al., 2019]. While those results depend on the number of switches in the underlying environment, the work of Auer [2002] already showed that it is possible to achieve a bound of O~(ST)\tilde{O}(\sqrt{ST}) where SS only counts the number of switches in identity of the best-arm. More recent studies design algorithms that adapt to an unknown number of switches [Abbasi-Yadkori et al., 2022, Suk and Kpotufe, 2022]. Our most comparable result, which follows from Theorem 1 with Proposition 2, is a bound on the expected cumulative regret of Thompson sampling is O~(ST)\tilde{O}(\sqrt{ST}), in a wide range of problems beyond kk-armed bandits (see Section 6.1).

Another stream of work considers drifting environments. Denoting the total variation in the underlying reward distribution by VV (often referred to as the variation budget), it was shown that the cumulative regret O~(V1/3T2/3)\tilde{O}(V^{1/3}T^{2/3}) is achievable [Besbes et al., 2014, 2015, Cheung et al., 2019]. Our most comparable result is given in Section 8.3, which recovers the Bayesian analogue of this result by bounding the rate distortion function. In fact, we derive a stronger regret bound that ignores variation that is common across all arms. A recent preprint by Jia et al. [2023] reveals that stronger bounds are possible under the additional smoothness assumptions on environment variations. It is an interesting open question to study whether one can derive similar bounds by bounding the rate-distortion function.

2 Problem Setup

A decision-maker interacts with a changing environment across rounds t:={1,2,3,}t\in\mathbb{N}:=\{1,2,3,\ldots\}. In period tt, the decision-maker selects some action AtA_{t} from a finite set 𝒜\mathcal{A}, observes an outcome OtO_{t}, and associates this with reward Rt=R(Ot,At)R_{t}=R(O_{t},A_{t}) that depends on the outcome and action through a known utility function R()R(\cdot).

There is a function gg, an i.i.d sequence of disturbances W=(Wt)tW=(W_{t})_{t\in\mathbb{N}}, and a sequence of latent environment states θ=(θt)t\theta=(\theta_{t})_{t\in\mathbb{N}} taking values in Θ\Theta, such that outcomes are determined as

Ot=g(At,θt,Wt).O_{t}=g(A_{t},\theta_{t},W_{t}). (1)

Write potential outcomes as Ot,a=g(a,θt,Wt)O_{t,a}=g(a,\theta_{t},W_{t}) and potential rewards as Rt,a=R(Ot,a,a)R_{t,a}=R(O_{t,a},a). Equation (1) is equivalent to specifying a known probability distribution over outcomes for each choice of action and environment state.

The decision-maker wants to earn high rewards even as the environment evolves, but cannot directly observe the environment state or influence its evolution. Specifically, the decision-maker’s actions are determined by some choice of policy π=(π1,π2,)\pi=(\pi_{1},\pi_{2},\ldots). At time tt, an action At=πt(t1,W~t)A_{t}=\pi_{t}(\mathcal{F}_{t-1},\tilde{W}_{t}) is a function of the observation history t1=(A1,O1,,At1,Ot1)\mathcal{F}_{t-1}=(A_{1},O_{1},\ldots,A_{t-1},O_{t-1}) and an internal random seed W~t\tilde{W}_{t} that allows for randomness in action selection. Reflecting that the seed is exogenously determined, assume W~=(W~t)t\tilde{W}=(\tilde{W}_{t})_{t\in\mathbb{N}} is jointly independent of the outcome disturbance process WW and state process θ=(θt)t\theta=(\theta_{t})_{t\in\mathbb{N}}. That actions do not influence the environment’s evolution can be written formally through the conditional independence relation (θs)st+1t(θ)t(\theta_{s})_{s\geq t+1}\perp\mathcal{F}_{t}\mid(\theta_{\ell})_{\ell\leq t}.

The decision-maker wants to select a policy π\pi that accumulates high rewards as this interaction continues. They know all probability distributions and functions listed above, but are uncertain about how environment states will evolve across time. To perform ‘well’, they need to continually gather information about the latent environment states and carefully balance exploration and exploitation.

Rather than measure the reward a policy generates, it is helpful to measure its regret. We define the TT-period per-period regret of a policy π\pi to be

Δ¯T(π):=𝔼π[t=1T(Rt,AtRt,At)]T,\bar{\Delta}_{T}(\pi):=\frac{\mathbb{E}_{\pi}\left[\sum_{t=1}^{T}\big{(}R_{t,A_{t}^{*}}-R_{t,A_{t}}\big{)}\right]}{T},

where the latent optimal action AtA_{t}^{*} is a function of the latent state θt\theta_{t} satisfying Atargmaxa𝒜𝔼[Rt,aθt]A_{t}^{*}\in\operatorname*{\mathrm{argmax}}_{a\in\mathcal{A}}\mathbb{E}[R_{t,a}\mid\theta_{t}]. We further define the regret rate of policy π\pi as its limit value,

Δ¯(π):=lim supTΔ¯T(π).\bar{\Delta}_{\infty}(\pi):=\limsup_{T\to\infty}\,\bar{\Delta}_{T}(\pi).

It measures the (long-run) per-period degradation in performance due to uncertainty about the environment state.

Remark 1.

The use of a limit supremum and Cesàro averages is likely unnecessary under some technical restrictions. For instance, under Thompson sampling applied to Examples 14, if the latent state process (θt)t(\theta_{t})_{t\in\mathbb{N}} is ergodic, we conjecture that Δ¯(π)=limt𝔼π[Rt,AtRt,At]\bar{\Delta}_{\infty}(\pi)=\lim_{t\to\infty}\mathbb{E}_{\pi}\left[R_{t,A_{t}^{*}}-R_{t,A_{t}}\right].

Our analysis proceeds under the following assumption, which is standard in the literature.

Assumption 1.

There exists σ\sigma such that, conditioned on t1\mathcal{F}_{t-1}, Rt,aR_{t,a} is sub-Gaussian with variance proxy σ2\sigma^{2}.

2.1 ‘Stationary processes’ in ‘nonstationary bandits’

The way the term ‘nonstationarity’ is used in the bandit learning literature could cause confusion as it conflicts with the meaning of ‘stationarity’ in the theory of stochastic process, which we use elsewhere in this paper.

Definition 1.

A stochastic process X=(Xt)tX=(X_{t})_{t\in\mathbb{N}} is (strictly) stationary if for each integer tt, the random vector (X1+m,,Xt+m)(X_{1+m},\ldots,X_{t+m}) has the same distribution for each choice of mm.

‘Nonstationarity’, as used in the bandit learning literature, means that realizations of the latent state θt\theta_{t} may differ at different time steps. The decision-maker can gather information about the current state of the environment, but it may later change. Nonstationarity of the stochastic process (θt)t(\theta_{t})_{t\in\mathbb{N}}, in the language of probability theory, arises when apriori there are predictable differences between environment states at different timesteps – e.g., if time period tt is nighttime then rewards tend to be lower than daytime. It is often clearer to model predictable differences like that through contexts, as in Example 4.

2.2 Examples

2.2.1 Modeling a range of interactive decision-making problems

Many interactive decision-making problems can be naturally written as special cases of our general protocol, where actions generate outcomes that are associated with rewards.

Our first example describes a bandit problem with independent arms, where outcomes generate information only about the selected action.

Example 1 (kk-armed bandit).

Consider a website who can display one among kk ads at a time and gains one dollar per click. For each ad a[k]:={1,,k}a\in[k]:=\{1,\ldots,k\}, the potential outcome/reward Ot,a=Rt,aBernoulli(θt,a)O_{t,a}=R_{t,a}\sim\text{Bernoulli}(\theta_{t,a}) is a random variable representing whether the ad aa is clicked by the ttht^{\text{th}} visitor if displayed, where θt,a[0,1]\theta_{t,a}\in[0,1] represents its click-through-rate. The platform only observes the reward of the displayed ad, so Ot=Rt,AtO_{t}=R_{t,A_{t}}.

Full information online optimization problems fall at the other extreme of kk-armed bandit problems. There the potential observation Ot,aO_{t,a} does not depend on the chosen action aa, so purposeful information gathering is unnecessary. The next example was introduced by Cover [1991] and motivates such scenarios.

Example 2 (Log-optimal online portfolios).

Consider a small trader who has no market impact. In period tt they have wealth WtW_{t} which they divide among kk possible investments. The action AtA_{t} is chosen from a feasible set of probability vectors, with At,iA_{t,i} denoting the proportion of wealth invested in stock ii. The observation is Ot+kO_{t}\in\mathbb{R}^{k}_{+} where Ot,iO_{t,i} is the end-of-day value of $1\$1 invested in stock ii at the start of the day and the distribution of OtO_{t} is parameterized by θt\theta_{t}. Because the observation consists of publicly available data, and the trader has no market impact, OtO_{t} does not depend on the investor’s decision. Define the reward function Rt=log(OtAt)R_{t}=\log\left(O_{t}^{\top}A_{t}\right). Since wealth evolves according to the equation Wt+1=(OtAt)WtW_{t+1}=\left(O_{t}^{\top}A_{t}\right)W_{t},

t=1T1Rt=log(WT/W1).\sum_{t=1}^{T-1}R_{t}=\log(W_{T}/W_{1}).

Many problems lie in between these extremes. We give two examples. The first is a matching problem. Many pairs of individuals are matched together and, in addition to the cumulative reward, the decision-maker observes feedback on the quality of outcome from each individual match. This kind of observation structure is sometimes called ‘‘semi-bandit’’ feedback Audibert et al. [2014].

Example 3 (Matching).

Consider an online dating platform with two disjoint sets of individuals \mathcal{M} and 𝒲\mathcal{W}. On each day tt, the platform suggests a matching of size kk, At{(m,w):m,w𝒲}A_{t}\subset\{(m,w):m\in\mathcal{M},w\in\mathcal{W}\} with |At|k|A_{t}|\leq k. For each pair (m,w)(m,w), their match quality is given by Qt,(m,w)Q_{t,(m,w)}, where the distribution of Qt=(Qt,(m,w):m,w𝒲)Q_{t}=(Q_{t,(m,w)}:m\in\mathcal{M},w\in\mathcal{W}) is parameterized by θt\theta_{t}. The platform observes the quality of individual matches, Ot=(Qt,(m,w):(m,w)At)O_{t}=\big{(}Q_{t,(m,w)}:(m,w)\in A_{t}\big{)}, and earns their average, Rt=1k(m,w)AtQt,(m,w)R_{t}=\frac{1}{k}\sum_{(m,w)\in A_{t}}Q_{t,(m,w)}.

Our final example is a contextual bandit problem. Here an action is itself more like a policy — it is a rule for assigning treatments on the basis of an observed context. Observations are richer than in the kk-armed bandit. The decision-maker sees not only the reward a policy generated but also the context in which it was applied.

Example 4 (Contextual bandit).

Suppose that the website described in Example 1 can now access additional information about each visitor, denoted by Xt𝒳X_{t}\in\mathcal{X}. The website observes the contextual information XtX_{t}, chooses an ad to display, and then observes whether the user clicks. To represent this task using our general protocol, we let the decision space 𝒜\mathcal{A} be the set of mappings from the context space 𝒳\mathcal{X} to the set of ads {1,,k}\{1,\ldots,k\}, the decision At𝒜A_{t}\in\mathcal{A} be a personalized advertising rule, and the observation Ot=(Xt,Rt)O_{t}=(X_{t},R_{t}) contains the observed visitor information and the reward from applying the ad At(Xt)A_{t}(X_{t}). Rewards are drawn according to RtXt,At,θtBernoulli(ϕθt(Xt,At(Xt)))R_{t}\mid X_{t},A_{t},\theta_{t}\sim\text{Bernoulli}(\phi_{\theta_{t}}(X_{t},A_{t}(X_{t}))), where ϕθ:𝒳×[k][0,1]\phi_{\theta}:\mathcal{X}\times[k]\to[0,1] is a parametric click-through-rate model. Assume Xt+1(At,θ)Xt,t1X_{t+1}\perp(A_{t},\theta)\mid X_{t},\mathcal{F}_{t-1}. This assumption means that advertising decisions cannot influence the future contexts and that parameters of the click-through rate model θ=(θt)t\theta=(\theta_{t})_{t\in\mathbb{N}} cannot be inferred passively by observing contexts.

2.2.2 Unconventional models of nonstationarity

Section 1.1 already presented one illustration of a kk-armed bandit problem with in which arms’ mean-rewards change across time. This section illustrates a very different kind of dynamics that can also be cast as a special case of our framework .

The next example describes a variant of the kk-armed bandit problem motivated by a news article recommendation problem treated in the seminal work of Chapelle and Li [2011]. We isolate one source of nonstationarity: the periodic replacement of news articles with fresh ones. Each new article is viewed as a draw from a population distribution; experience recommending other articles gives an informed prior on an article’s click-through-rate, but resolving residual uncertainty requires exploration. Because articles are continually refreshed, the system must perpetually balance exploration and exploitation.

Our treatment here is purposefully stylized. A realistic treatment might include features of the released articles which signal its click-through-rate, features of arriving users, time-of-day effects, and more general article lifetime distributions.

Example 5 (News article recommendation).

At each user visit, a news website recommends one out of a small pool of hand-picked news articles. The pool is dynamic – old articles are removed and new articles are added to the pool in an endless manner. An empirical investigation on a dataset from Yahoo news (Yahoo! Front Page User Click Log dataset) shows that, as visualized in Figure 3, the pool size varies between 19 and 25 while one article stays in the candidate pool 17.9 hours in average.

Refer to caption

Figure 3: Addition/removal times of 271 articles recorded in ‘‘Yahoo! Front Page User Click Log’’ dataset. Each horizontal line segment represents the time period on which an article was included in the candidate pool, and these line segments are packed into 25 slots. Colors are randomly assigned.

We can formulate this situation with k=25k=25 article slots (arms), each of which contains one article at a time. Playing an arm i[k]i\in[k] corresponds to recommending the article in slot ii. The article in a slot gets replaced with a new article at random times; the decision-maker observes that it was replaced but is uncertain about the click-through rate of the new article.

Let θt[0,1]k\theta_{t}\in[0,1]^{k} denote the vector of click-through rates across kk article slots, where Rt,aθt,aBernoulli(θt,a)R_{t,a}\mid\theta_{t,a}\sim\text{Bernoulli}(\theta_{t,a}). The observation at time tt when playing arm aa is Ot,a=(Rt,a,χt)O_{t,a}=(R_{t,a},\chi_{t}), where χt{0,1}k\chi_{t}\in\{0,1\}^{k} indicates which article slots were updated. If χt,a=0\chi_{t,a}=0, then θt+1,a=θt,a\theta_{t+1,a}=\theta_{t,a} almost surely. If χt,a=1\chi_{t,a}=1, then θt+1,a\theta_{t+1,a} is drawn independently from the past as

θt+1,a(χt,a=1)Q.\theta_{t+1,a}\mid(\chi_{t,a}=1)\sim Q.

As this process repeats across many articles, the population click-through-rate distribution QQ can be learned from historical data. For purposes of analysis, we simply assume QQ is known.

Assume that χt,aBernoulli(1/τ)\chi_{t,a}\sim{\rm Bernoulli}(1/\tau) is independent across times tt and arms aa; this means that each article’s lifetime follows a Geometric distribution with mean lifetime τ\tau. (Assume also that (χt)t(\chi_{t})_{t\in\mathbb{N}} is independent of the disturbance process WW and random seeds W~\tilde{W}).

3 Thompson sampling

While this paper is primarily focused on the limits of attainable performance, many of our regret bounds apply to Thompson sampling [Thompson, 1933], one of the most popular bandit algorithms. This section explains how to generalize its definition to dynamic environments.

We denote the Thompson sampling policy by πTS\pi^{\rm TS} and define it by the probability matching property:

(At=at1)=(At=at1),\mathbb{P}(A_{t}=a\mid\mathcal{F}_{t-1})=\mathbb{P}(A_{t}^{*}=a\mid\mathcal{F}_{t-1}), (2)

which holds for all tt\in\mathbb{N}, a𝒜a\in\mathcal{A}. Actions are chosen by sampling from the posterior distribution of the optimal action at the current time period. This aligns with the familiar definition of Thompson sampling in static environments where A1==AT.A_{1}^{*}=\cdots=A_{T}^{*}.

Practical implements of TS sampling from the posterior distribution of the optimal action by sampling plausible latent state θ^t(θt=t1)\hat{\theta}_{t}\sim\mathbb{P}(\theta_{t}=\cdot\mid\mathcal{F}_{t-1}) and picking the action Atargmaxa𝒜𝔼[Rt,aθt=θ^t]A_{t}\in\operatorname*{\mathrm{argmax}}_{a\in\mathcal{A}}\mathbb{E}[R_{t,a}\mid\theta_{t}=\hat{\theta}_{t}].

How the posterior gradually forgets stale data.

The literature on nonstationary bandit problems often constructs mean-reward estimators that explicitly throw away or discount old data [Kocsis and Szepesvári, 2006, Garivier and Moulines, 2011]. Under Bayesian algorithms like TS, forgetting of stale information happens implicitly through the formation of posterior distributions. To provide intuition, we explain that proper posterior updating resembles an exponential moving average estimator under a simple model of nonstationarity.

Consider kk-armed bandit with Gaussian noises where each arm’s mean-reward process follows an auto-regressive process with order 11. That is,

θt,a=αθt1,a+ξt,a,Rt,a=θt,a+Wt,a,\theta_{t,a}=\alpha\cdot\theta_{t-1,a}+\xi_{t,a},\quad R_{t,a}=\theta_{t,a}+W_{t,a},

where ξt,a𝒩(0,σξ2)\xi_{t,a}\sim\mathcal{N}(0,\sigma_{\xi}^{2}) are i.i.d. innovations in the mean-reward process, Wt,a𝒩(0,σW2)W_{t,a}\sim\mathcal{N}(0,\sigma_{W}^{2}) are i.i.d. noises in the observations, and α(0,1)\alpha\in(0,1) is the autoregression coefficient. Assume θ1,a𝒩(0,σξ21α2)\theta_{1,a}\sim\mathcal{N}(0,\frac{\sigma_{\xi}^{2}}{1-\alpha^{2}}) so that the mean-reward process is stationary.

Algorithm 1 implements Thompson sampling in this environment. It uses a (simple) Kalman filter to obtain the posterior distribution. Past data is ‘‘forgotten’’ at a geometric rate over time, and the algorithm shrinks posterior mean estimates toward 0 (i.e., the prior mean) if few recent observations are available.

Input: AR(1) process parameters (α,σξ2)(\alpha,\sigma_{\xi}^{2}), noise variance σW2\sigma_{W}^{2}.
Initialize μ^0,a0,ν^0,aσξ21α2\hat{\mu}_{0,a}\leftarrow 0,\hat{\nu}_{0,a}\leftarrow\frac{\sigma_{\xi}^{2}}{1-\alpha^{2}} for each a𝒜a\in\mathcal{A};
for t=1,2,t=1,2,\ldots do
       Diffuse posterior as it proceeds one time step: for each a𝒜a\in\mathcal{A},
μ^t,aαμ^t1,a,ν^t,aα2ν^t1,a+σξ2\hat{\mu}_{t,a}\leftarrow\alpha\hat{\mu}_{t-1,a},\quad\hat{\nu}_{t,a}\leftarrow\alpha^{2}\hat{\nu}_{t-1,a}+\sigma_{\xi}^{2} (3)
      Sample θ~a𝒩(μ^t,a,ν^t,a)\tilde{\theta}_{a}\sim\mathcal{N}(\hat{\mu}_{t,a},\hat{\nu}_{t,a}) for each a𝒜a\in\mathcal{A};
      
      Play At=argmaxaθ~aA_{t}=\operatorname*{\mathrm{argmax}}_{a}\tilde{\theta}_{a}, and observe RtR_{t};
      
      Additionally update posterior for the new observation: for a=Ata=A_{t} only,
μ^t,aν^t,a1×μ^t,a+σW2×Rtν^t,a1+σW2,ν^t,a1ν^t,a1+σW2\hat{\mu}_{t,a}\leftarrow\frac{\hat{\nu}_{t,a}^{-1}\times\hat{\mu}_{t,a}+\sigma_{W}^{-2}\times R_{t}}{\hat{\nu}_{t,a}^{-1}+\sigma_{W}^{-2}},\quad\hat{\nu}_{t,a}\leftarrow\frac{1}{\hat{\nu}_{t,a}^{-1}+\sigma_{W}^{-2}} (4)
;
      
end for
Algorithm 1 Thompson sampling for AR(1)-bandit

4 Information Theoretic Preliminaries

The entropy of a discrete random variable XX, defined by H(X):=x(X=x)log((X=x))H(X):=-\sum_{x}\mathbb{P}(X=x)\log(\mathbb{P}(X=x)), measures the uncertainty in its realization. The entropy rate of a stochastic process (X1,X2,)(X_{1},X_{2},\ldots) is the rate at which entropy of the partial realization (X1,,Xt)(X_{1},\ldots,X_{t}) accumulates as tt grows.

Definition 2.

The TT-period entropy rate of a stochastic process X=(Xt)tX=(X_{t})_{t\in\mathbb{N}}, taking values in a discrete set 𝒳\mathcal{X}, is

H¯T(X):=H([X1,,XT])T.\displaystyle\bar{H}_{T}(X):=\frac{H\left(~{}[X_{1},\ldots,X_{T}]~{}\right)}{T}.

The entropy rate is defined as its limit value:

H¯(X):=lim supTH¯T(X).\bar{H}_{\infty}(X):=\limsup_{T\to\infty}\,\bar{H}_{T}(X).

We provide some useful facts about the entropy rate:

Fact 1.

0H¯T(X)log(|𝒳|)0\leq\bar{H}_{T}(X)\leq\log(|\mathcal{X}|).

Fact 2.

By chain rule,

H¯T(X)=1Tt=1TH(Xt|Xt1,,X1).\bar{H}_{T}(X)=\frac{1}{T}\sum_{t=1}^{T}H(X_{t}|X_{t-1},\ldots,X_{1}).

If XX is a stationary stochastic process, then

H¯(X)=limtH(Xt|Xt1,,X1).\bar{H}_{\infty}(X)=\lim_{t\to\infty}H(X_{t}|X_{t-1},\ldots,X_{1}). (5)

The form (5) is especially elegant. The entropy rate of a stationary stochastic process is the residual uncertainty in the draw of XtX_{t} which cannot be removed by knowing the draw of Xt1,,X1X_{t-1},\ldots,X_{1}. Processes that evolve quickly and erratically have high entropy rate. Those that tend to change infrequently (i.e., Xt=Xt1X_{t}=X_{t-1} for most tt) or change predictably will have low entropy rate.

The next lemma sharpens this understanding by establishing an upper bound on the entropy rate in terms of the switching frequency.

Lemma 1.

Consider a stochastic process X=(Xt)tX=(X_{t})_{t\in\mathbb{N}} taking values in a finite set 𝒳\mathcal{X}. Let 𝒮T(X)\mathcal{S}_{T}(X) be the number of switches in XX occurring up to time TT,

𝒮T(X):=1+t=2T𝕀{XtXt1}.\mathcal{S}_{T}(X):=1+\sum_{t=2}^{T}\mathbb{I}\{X_{t}\neq X_{t-1}\}. (6)

If 𝒮T(X)ST\mathcal{S}_{T}(X)\leq S_{T} almost surely for some ST{1,,T}S_{T}\in\{1,\ldots,T\}, the TT-period entropy rate of XX is upper bounded as

H¯T(X)STT(1+log(1+TST)+log(|𝒳|)).\bar{H}_{T}(X)~{}\leq~{}\frac{S_{T}}{T}\cdot\left(1+\log\left(1+\frac{T}{S_{T}}\right)+\log(|\mathcal{X}|)\right).

The proof is provided in Appendix A, which simply counts the number of possible realizations (X1,,XT)𝒳T(X_{1},\ldots,X_{T})\in\mathcal{X}^{T} with a restricted number of switches.

The conditional entropy rate and mutual information rate of two stochastic processes are defined similarly. For brevity, we write X1:TX_{1:T} to denote the random vector (X1,,XT)(X_{1},\ldots,X_{T}).

Definition 3.

Consider a stochastic process (X,Z)=(Xt,Zt)t(X,Z)=(X_{t},Z_{t})_{t\in\mathbb{N}}, where XtX_{t} takes values in a discrete set. The TT-period conditional entropy rate of XX given ZZ is defined as

H¯T(X|Z):=H(X1:TZ1:T)T,\bar{H}_{T}(X|Z):=\frac{H(~{}X_{1:T}~{}\mid~{}Z_{1:T}~{})}{T},

and the TT-period mutual information rate between XX and ZZ is defined as

I¯T(X;Z):=I(X1:T;Z1:T)T.\bar{I}_{T}(X;Z):=\frac{I(~{}X_{1:T}~{};~{}Z_{1:T}~{})}{T}.

The conditional entropy rate and mutual information rate are defined as their limit values: i.e., H¯(X|Z):=lim supTH¯T(X|Z)\bar{H}_{\infty}(X|Z):=\limsup_{T\to\infty}\bar{H}_{T}(X|Z), and I¯(X;Z):=lim supTI¯T(X;Z)\bar{I}_{\infty}(X;Z):=\limsup_{T\to\infty}\bar{I}_{T}(X;Z).

The mutual information rate I¯T(X;Z)\bar{I}_{T}(X;Z) represents the average amount of communication happening between two processes, XX and ZZ. This quantity can also be understood as the decrease in the entropy rate of XX resulting from observing ZZ, and trivially it cannot exceed the entropy rate of XX:

Fact 3.

0I¯T(X;Z)=H¯T(X)H¯T(X|Z)H¯T(X)0\leq\bar{I}_{T}(X;Z)=\bar{H}_{T}(X)-\bar{H}_{T}(X|Z)\leq\bar{H}_{T}(X).

The next fact shows that the mutual information rate can be expressed as the average amount of information that ZZ accumulates about XX in each period, following from the chain rule.

Fact 4.

I¯T(X;Z)=1Tt=1TI(X1:T;Zt|Zt1,,Z1)\bar{I}_{T}(X;Z)=\frac{1}{T}\sum_{t=1}^{T}I(~{}X_{1:T}~{};~{}Z_{t}~{}|~{}Z_{t-1},\ldots,Z_{1}).

5 Information-Theoretic Analysis of Dynamic Regret

We apply the information theoretic analysis of Russo and Van Roy [2016] and establish upper bounds on the per-period regret, expressed in terms of (1) the algorithm’s information ratio, and (2) the entropy rate of the optimal action process.

5.1 Preview: special cases of our result

We begin by giving a special case of our result. It bounds the regret of Thompson sampling in terms of the reward variance proxy σ2\sigma^{2}, the number of actions |𝒜||\mathcal{A}|, and the entropy rate of the optimal action process H¯T(A)\bar{H}_{T}(A^{*}).

Corollary 1.

Under any problem in the scope of our problem formulation,

Δ¯T(πTS)σ2|𝒜|H¯T(A),\bar{\Delta}_{T}(\pi^{\rm TS})\leq\sigma\sqrt{2\cdot|\mathcal{A}|\cdot\bar{H}_{T}(A^{*})},
andΔ¯(πTS)σ2|𝒜|H¯(A).\text{and}\quad\bar{\Delta}_{\infty}(\pi^{\rm TS})\leq\sigma\sqrt{2\cdot|\mathcal{A}|\cdot\bar{H}_{\infty}(A^{*})}.

This result naturally covers a wide range of bandit learning tasks while highlighting that the entropy rate of optimal action process captures the level of degradation due to nonstationarity. Note that it includes as a special case the well-known regret upper bound established for a stationary kk-armed bandit: when A1==ATA_{1}^{*}=\ldots=A_{T}^{*}, we have H([A1,,AT])=H(A1)logkH([A_{1}^{*},\ldots,A_{T}^{*}])=H(A_{1}^{*})\leq\log k, and thus Δ¯T(πTS)O~(σk/T)\bar{\Delta}_{T}(\pi^{\rm TS})\leq\tilde{O}(\sigma\sqrt{k/T}).

According to (5), the entropy rate is small when the conditional entropy H(AtA1,,At1)H(A_{t}^{*}\mid A^{*}_{1},\ldots,A^{*}_{t-1}) is small. That is, the entropy rate is small if most uncertainty in the optimal action AtA_{t}^{*} is removed through knowledge of the past optimal actions. Of course, Thompson sampling does not observe the environment states or the corresponding optimal actions, so its dependence on this quantity is somewhat remarkable.

The dependence of regret on the number of actions, |𝒜||\mathcal{A}|, is unavoidable in a problem like the kk-armed bandit of Example 1. But in other cases, it is undesirable. Our general results depend on the problem’s information structure in a more refined manner. To preview this, we give another corollary of our main result, which holds for problems with full-information feedback (see Example 2 for motivation). In this case, the dependence on the number of actions completely disappears and the bound depends on the variance proxy and the entropy rate. The bound applies to TS and the policy πGreedy\pi^{\rm Greedy}, which chooses Atargmaxa𝒜𝔼[Rt,at1]A_{t}\in\operatorname*{\mathrm{argmax}}_{a\in\mathcal{A}}\mathbb{E}[R_{t,a}\mid\mathcal{F}_{t-1}] in each period tt.

Corollary 2.

For full information problems, where Ot,a=Ot,aO_{t,a}=O_{t,a^{\prime}} for each a,a𝒜a,a^{\prime}\in\mathcal{A}, we have

Δ¯T(πGreedy)Δ¯T(πTS)σ2H¯T(A),\bar{\Delta}_{T}(\pi^{\rm Greedy})\leq\bar{\Delta}_{T}(\pi^{\rm TS})\leq\sigma\sqrt{2\cdot\bar{H}_{T}(A^{*})},
andΔ¯(πGreedy)Δ¯(πTS)σ2H¯(A).\text{and}\quad\bar{\Delta}_{\infty}(\pi^{\rm Greedy})\leq\bar{\Delta}_{\infty}(\pi^{\rm TS})\leq\sigma\sqrt{2\cdot\bar{H}_{\infty}(A^{*})}.

5.2 Bounds on the entropy rate

Our results highlight that the difficulty arising due to the nonstationarity of the environment is sufficiently characterized by the entropy rate of the optimal action process, denoted by H¯T(A)\bar{H}_{T}(A^{*}) or H¯(A)\bar{H}_{\infty}(A^{*}). We provide some stylized upper bounds on these quantities to aid in their interpretation and to characterize the resulting regret bounds in a comparison with the existing results in the literature.

Bound with the effective time horizon.

In settings where optimal action process (At)t(A_{t}^{*})_{t\in\mathbb{N}} is stationary (Definition 1), define τeff:=1/(AtAt1)\tau_{\textup{eff}}:=1/\mathbb{P}(A_{t}^{*}\neq A_{t-1}^{*}). We interpret τeff\tau_{\textup{eff}} as the problem’s ‘‘effective time horizon’’, as it captures the average length of time before the identity of the optimal action changes. The effective time horizon τeff\tau_{\textup{eff}} is long when the optimal action changes infrequently, so that, intuitively, a decision-maker could continue to exploit the optimal action for a long time if it were identified, achieving a low regret. The next proposition shows that the entropy rate H¯T(A)\bar{H}_{T}(A^{*}) is bounded by the inverse of τeff\tau_{\textup{eff}}, regardless of TT:

Proposition 1.

When the process (At)t(A_{t}^{*})_{t\in\mathbb{N}} is stationary,

H¯T(A)1+log(τeff)+H(At|AtAt1)τeff+H(A1)T,\bar{H}_{T}(A^{*})\leq\frac{1+\log(\tau_{\textup{eff}})+H(A_{t}^{*}|A_{t}^{*}\neq A_{t-1}^{*})}{\tau_{\textup{eff}}}+\frac{H(A^{*}_{1})}{T}, (7)

for every TT\in\mathbb{N}, where

τeff:=1(AtAt1).\tau_{\textup{eff}}:=\frac{1}{\mathbb{P}(A_{t}^{*}\neq A_{t-1}^{*})}. (8)

Combining this result with Corollary 1, and the fact that H(A1)T0\frac{H(A^{*}_{1})}{T}\to 0 as TT\to\infty, we obtain

Δ¯(πTS)O~(σ|𝒜|τeff),\bar{\Delta}_{\infty}(\pi^{\rm TS})\leq\tilde{O}\left(\sigma\sqrt{\frac{|\mathcal{A}|}{\tau_{\rm eff}}}\right),

which closely mirrors familiar O(k/T)O(\sqrt{k/T}) regret bounds on the average per-period regret in bandit problems with kk arms, TT periods, and i.i.d rewards Bubeck and Cesa-Bianchi [2012], except that the effective time horizon replaces the problem’s raw time horizon.

Example 6 (Revisiting new article recommendation).

In Example 5, the stochastic process (θt)t(\theta_{t})_{t\in\mathbb{N}} is stationary as long as θ1,ai.i.dQ\theta_{1,a}\overset{i.i.d}{\sim}Q for each article slot a[k]a\in[k]. Recall that each article slot is refreshed with probability 1/τ1/\tau, independently. As a result,

τeff=1(AtAt1)k+12(k1)τ,\tau_{\rm eff}=\frac{1}{\mathbb{P}(A_{t}^{*}\neq A_{t-1}^{*})}\approx\frac{k+1}{2(k-1)}\cdot\tau,

and Δ¯(πTS)O~(σkτ)\bar{\Delta}_{\infty}(\pi^{\rm TS})\leq\tilde{O}\left(\sigma\sqrt{\frac{k}{\tau}}\right). Per-period regret scales with number of article slots that need to be explored and inversely with the lifetime of an article; the latter determines how long information about an article’s click-rate can be exploited before it is no longer relevant.

To attain this result, it is critical that our theory depends on the entropy rate or effective horizon of the optimal action process, rather than the parameter process. The vector of click-through rates θt\theta_{t} changes every τ/k\tau/k periods, since it changes whenever an article is updated. Nevertheless, the optimal action process changes roughy once every τ\tau periods, on average, since a new article is only optimal 1/kth1/k^{\rm th} of the time.

We now revisit the two-armed bandit example from Section 1.1.

Example 7 (Revisiting example in Section 1.1).

Consider the two-armed bandit example discussed in Section 1.1. The idiosyncratic mean reward process of each arm is a stationary zero-mean Gaussian process such that, for each a{1,2}a\in\{1,2\}, Cov(θs,aid,θt,aid)=exp(12(tsτid)2)\textup{Cov}\left(\theta_{s,a}^{\rm id},\theta_{t,a}^{\rm id}\right)=\exp\left(-\frac{1}{2}\left(\frac{t-s}{\tau^{\rm id}}\right)^{2}\right) for any s,ts,t\in\mathbb{N}. By Rice’s formula (Rice, 1944; Barnett and Kedem, 1991, eq. (2)) (also known as the zero-crossing rate formula), we have

τeff=πcos1(exp(12τid2))πτid.\tau_{\rm eff}=\frac{\pi}{\cos^{-1}\left(\exp\left(-\frac{1}{2{\tau^{\rm id}}^{2}}\right)\right)}\approx\pi\cdot\tau^{\rm id}.

Consequently, if τidπ1\tau^{\rm id}\geq\pi^{-1},

H¯(A)1+log(πτid)πτid.\bar{H}_{\infty}(A^{*})\leq\frac{1+\log(\pi\cdot\tau^{\rm id})}{\pi\cdot\tau^{\rm id}}.

Below we give an example where the upper bound in Proposition 1 is nearly exact.

Example 8 (Piecewise stationary environment).

Suppose (At)t(A_{t}^{*})_{t\in\mathbb{N}} follows a switching process. With probability 1δ1-\delta there is no change in the optimal action, whereas with probability δ\delta there is a change-event and AtA_{t} is drawn uniformly from among the other k1|𝒜|1k-1\equiv|\mathcal{A}|-1 arms. Precisely, (At)t(A_{t}^{*})_{t\in\mathbb{N}} follows a Markov process with transition dynamics:

(At+1=aAt=a)={1δif a=aδ/(k1)if aa\mathbb{P}(A^{*}_{t+1}=a\mid A^{*}_{t}=a^{\prime})=\begin{cases}1-\delta&\text{if }a=a^{\prime}\\ \delta/(k-1)&\text{if }a\neq a^{\prime}\end{cases}

for a,a𝒜a,a^{\prime}\in\mathcal{A}. Then

H¯(A)\displaystyle\bar{H}_{\infty}(A^{*}) =(1δ)log(11δ)+δlog(k1δ)\displaystyle=(1-\delta)\log\left(\frac{1}{1-\delta}\right)+\delta\log\left(\frac{k-1}{\delta}\right)
=(1δ)log(1+δ1δ)+δlog(k1δ)\displaystyle=(1-\delta)\log\left(1+\frac{\delta}{1-\delta}\right)+\delta\log\left(\frac{k-1}{\delta}\right)
δ+δlog((k1)/δ),\displaystyle\approx\delta+\delta\log((k-1)/\delta),

where we used the approximation log(1+x)x\log(1+x)\approx x. Plugging in τeff=1/δ\tau_{\rm eff}=1/\delta and H(AtAtAt1)=log(k1)H(A_{t}^{*}\mid A_{t}^{*}\neq A^{*}_{t-1})=\log(k-1) yields,

H¯(A)1+log(τeff)+H(AtAtAt1)τeff,\bar{H}_{\infty}(A^{*})\approx\frac{1+\log(\tau_{\rm eff})+H(A_{t}^{*}\mid A_{t}^{*}\neq A^{*}_{t-1})}{\tau_{\rm eff}},

which matches the upper bound (7).

Bound with the switching rate.

Proposition 1 requires that the optimal action process is stationary, in the sense of Definition 1. We now provide a similar result that removes this restriction. It is stated in terms of the switching rate, which in stationary settings is similar to the inverse of the effective horizon. The added genenerality comes at the expense of replacing the conditional entropy H(AtAtAt1)H(A_{t}^{*}\mid A_{t}^{*}\neq A_{t-1}^{*}) in Proposition 1 with a the crude upper bound log(|𝒜|)\log(|\mathcal{A}|).

Proposition 2.

Let S¯T\bar{S}_{T} be the expected switching rate of the optimal action sequence, defined as

S¯T:=𝔼[1+t=2T𝕀{AtAt1}]T.\bar{S}_{T}:=\frac{\mathbb{E}\left[1+\sum_{t=2}^{T}\mathbb{I}\{A_{t}^{*}\neq A_{t-1}^{*}\}\right]}{T}.

Then,

H¯T(A)S¯T(1+log(1+1/S¯T)+log(|𝒜|))+logTT.\bar{H}_{T}(A^{*})\leq\bar{S}_{T}\cdot\left(1+\log\left(1+1/\bar{S}_{T}\right)+\log(|\mathcal{A}|)\right)+\frac{\log T}{T}.

This results follows from Lemma 1 while slightly generalizing it by bounding the entropy rate with the expected number of switches instead of an almost sure upper bound. Combining this result with Corollary 1 gives a bound

Δ¯T(πTS)O~(σ|𝒜|S¯T),\bar{\Delta}_{T}(\pi^{\rm TS})\leq\tilde{O}\left(\sigma\sqrt{|\mathcal{A}|\cdot\bar{S}_{T}}\right),

which precisely recovers the well-known results established for switching bandits [Auer et al., 2002, Suk and Kpotufe, 2022, Abbasi-Yadkori et al., 2022]. Although other features of the environment may change erratically, a low regret is achievable if the optimal action switches infrequently.

Bound with the entropy rate of latent state process.

Although it can be illuminating to consider the number of switches or the effective horizon, the entropy rate is a deeper quantity that better captures a problem’s intrinsic difficulty. A simple but useful fact is that the entropy rate of the optimal action process cannot exceed that of the environment’s state process:

Remark 2.

Since the optimal action AtA_{t}^{*} is completely determined by the latent state θt\theta_{t}, by data processing inequality,

H¯T(A)HT(θ),H¯(A)H¯(θ).\bar{H}_{T}(A^{*})\leq H_{T}(\theta),\quad\bar{H}_{\infty}(A^{*})\leq\bar{H}_{\infty}(\theta).

These bounds can be useful when the environment’s nonstationarity has predictable dynamics. The next example illustrates such a situation, in which the entropy rate of the latent state process can be directly quantified.

Example 9 (System with seasonality).

Consider a system that exhibits a strong intraday seasonality. Specifically, suppose that the system’s hourly state (e.g., arrival rate) at time tt can be modeled as

θt=ξday(t)μtime-of-the-day(t)+ϵt,\theta_{t}=\xi_{\text{day}(t)}\cdot\mu_{\text{time-of-the-day}(t)}+\epsilon_{t},

where (ξd)d(\xi_{d})_{d\in\mathbb{N}} is a sequence of i.i.d random variables describing the daily random fluctuation, (μh)h{0,,23}(\mu_{h})_{h\in\{0,\ldots,23\}} is a known deterministic sequence describing the intraday pattern, and (ϵt)t(\epsilon_{t})_{t\in\mathbb{N}} is a sequence of i.i.d random variables describing the hourly random fluctuation. Then we have

H¯(A)H¯(θ)=124H(ξ)+H(ϵ),\bar{H}_{\infty}(A^{*})\leq\bar{H}_{\infty}(\theta)=\frac{1}{24}H(\xi)+H(\epsilon),

regardless of the state-action relationship. Imagine that the variation within the intraday pattern μ\mu is large so that the optimal action changes almost every hour (i.e., STTS_{T}\approx T and τeff1\tau_{\textup{eff}}\approx 1). In this case, the bound like above can be easier to compute and more meaningful than the bounds in Propositions 2 and 1.

5.3 Lower bound

We provide an impossibility result through the next theorem, showing that no algorithm can perform significantly better than the upper bounds provided in Corollary 1 and implied by Proposition 1. Our proof is built by modifying well known lower bound examples for stationary bandits.

Proposition 3.

Let k>1k>1 and τk\tau\geq k. There exists a nonstationary bandit problem instance with |𝒜|=k|\mathcal{A}|=k and τeff=τ\tau_{\rm eff}=\tau, such that

infπΔ¯(π)Cσ|𝒜|τeff,\inf_{\pi}\bar{\Delta}_{\infty}(\pi)\geq C\cdot\sigma\sqrt{\frac{|\mathcal{A}|}{\tau_{\rm eff}}},

where CC is a universal constant.

Remark 3.

For the problem instance constructed in the proof, the entropy rate of optimal action process is H¯(A)log(|𝒜|)/τeff\bar{H}(A^{*})\approx\log(|\mathcal{A}|)/\tau_{\rm eff}. This implies that the upper bound established in Corollary 1 is tight up to logarithmic factors, and so is the one established in Theorem 1.

5.4 Main result

The corollaries presented earlier are special cases of a general result that we present now. Define the (maximal) information ratio of an algorithm π\pi by,

Γ(π):=supt(𝔼π[Rt,AtRt,At])2I(At;(At,Ot,At)t1)=:Γt(π),\Gamma(\pi):=\sup_{t\in\mathbb{N}}\underbrace{\frac{\left(\mathbb{E}_{\pi}\left[R_{t,A_{t}^{*}}-R_{t,A_{t}}\right]\right)^{2}}{I\left(A_{t}^{*};(A_{t},O_{t,A_{t}})\mid\mathcal{F}_{t-1}\right)}}_{=:\Gamma_{t}(\pi)},

The per-period information ratio Γt(π)\Gamma_{t}(\pi) was defined by Russo and Van Roy [2016] and presented in this form by Russo and Van Roy [2022]. It is the ratio between the square of expected regret and the conditional mutual information between the optimal action and the algorithm’s observation. It measures the cost, in terms of the square of expected regret, that the algorithm pays to acquire each bit of information about the optimum.

The next theorem shows that any algorithm’s per-period regret is bounded by the square root of the product of its information ratio and the entropy rate of the optimal action sequence. The result has profound consequences, but follows easily by applying elegant properties of information measures.

Theorem 1.

Under any algorithm π\pi,

Δ¯T(π)Γ(π)H¯T(A),andΔ¯(π)Γ(π)H¯(A).\bar{\Delta}_{T}(\pi)\leq\sqrt{\Gamma(\pi)\cdot\bar{H}_{T}(A^{*})},\quad\text{and}\quad\bar{\Delta}_{\infty}(\pi)\leq\sqrt{\Gamma(\pi)\cdot\bar{H}_{\infty}(A^{*})}.
Proof.

Use the shorthand notation Δt:=𝔼[Rt,AtRt,At]\Delta_{t}:=\mathbb{E}[R_{t,A_{t}^{*}}-R_{t,A_{t}}] for regret, Gt:=I(At;(At,Ot,At)|t1)G_{t}:=I(A_{t}^{*};(A_{t},O_{t,A_{t}})|\mathcal{F}_{t-1}) for information gain, and Γt=Δt2/Gt\Gamma_{t}=\Delta_{t}^{2}/G_{t} for the information ratio at period tt. Then,

t=1TΔt=t=1TΓtGtt=1TΓtt=1TGt,\sum_{t=1}^{T}\Delta_{t}=\sum_{t=1}^{T}\sqrt{\Gamma_{t}}\sqrt{G_{t}}\leq\sqrt{\sum_{t=1}^{T}\Gamma_{t}}\sqrt{\sum_{t=1}^{T}G_{t}},

by Cauchy-Schwarz, and

t=1TΓtt=1TGtΓ(π)Tt=1TGt.\sqrt{\sum_{t=1}^{T}\Gamma_{t}}\sqrt{\sum_{t=1}^{T}G_{t}}\leq\sqrt{\Gamma(\pi)\cdot T\cdot\sum_{t=1}^{T}G_{t}}.

by definition of Γ(π)\Gamma(\pi). We can further bound the information gain. This uses the chain rule, the data processing inequality, and the fact that entropy bounds mutual information:

t=1TGt=t=1TI(At;(At,Ot,At)|t1)t=1TI([A1,,AT];(At,Ot,At)|t1)\displaystyle\sum_{t=1}^{T}G_{t}=\sum_{t=1}^{T}I(A_{t}^{*};(A_{t},O_{t,A_{t}})|\mathcal{F}_{t-1})\leq\sum_{t=1}^{T}I([A_{1}^{*},\ldots,A_{T}^{*}];(A_{t},O_{t,A_{t}})|\mathcal{F}_{t-1}) =I([A1,,AT];T)\displaystyle=I([A_{1}^{*},\ldots,A_{T}^{*}];\mathcal{F}_{T})
H([A1,,AT]).\displaystyle\leq H([A_{1}^{*},\ldots,A_{T}^{*}]).

Combining these results, we obtain

Δ¯T(π)Γ(π)TH([A1,,AT])T=Γ(π)H¯T(A).\displaystyle\bar{\Delta}_{T}(\pi)\leq\frac{\sqrt{\Gamma(\pi)\cdot T\cdot H([A_{1}^{*},\ldots,A_{T}^{*}])}}{T}=\sqrt{\Gamma(\pi)\cdot\bar{H}_{T}(A^{*})}.

Taking limit yields the bound on regret rate Δ¯(π)\bar{\Delta}_{\infty}(\pi). ∎

Remark 4.

A careful reading of the proof reveals that it is possible to replace the entropy rate H¯T(A)\bar{H}_{T}(A^{*}) with the mutual information rate I¯T(A;)\bar{I}_{T}(A^{*};\mathcal{F}).

Remark 5.

Following Lattimore and Gyorgy [2021], one can generalize the definition of information ratio,

Γλ(π):=supt(𝔼π[Rt,AtRt,At])λI(At;(At,Ot,At)t1),\Gamma_{\lambda}(\pi):=\sup_{t\in\mathbb{N}}\frac{\left(\mathbb{E}_{\pi}\left[R_{t,A_{t}^{*}}-R_{t,A_{t}}\right]\right)^{\lambda}}{I\left(A_{t}^{*};(A_{t},O_{t,A_{t}})\mid\mathcal{F}_{t-1}\right)},

which immediately yields an inequality, Δ¯T(π)(Γλ(π)H¯T(A))1/λ\bar{\Delta}_{T}(\pi)\leq\left(\Gamma_{\lambda}(\pi)\bar{H}_{T}(A^{*})\right)^{1/\lambda} for any λ1\lambda\geq 1.

6 Bounds on the Information Ratio

6.1 Some known bounds on the information ratio

We list some known results about the information ratio. These were originally established for stationary bandit problems but immediately extend to nonstationary settings considered in this paper. Most results apply to Thompson sampling, and essentially all bounds apply to Information-directed sampling, which is designed to minimize the information ratio [Russo and Van Roy, 2018]. The first four results were shown by Russo and Van Roy [2016] under 1.

Classical bandits.

Γ(πTS)2σ2|𝒜|\Gamma(\pi^{\rm TS})\leq 2\sigma^{2}|\mathcal{A}|, for bandit tasks with a finite action set (e.g., Example 1).

Full information.

Γ(πTS)2σ2\Gamma(\pi^{\rm TS})\leq 2\sigma^{2}, for problems with full-information feedback (e.g., Example 2).

Linear bandits.

Γ(πTS)2σ2d\Gamma(\pi^{\rm TS})\leq 2\sigma^{2}d, for linear bandits of dimension dd (i.e., 𝒜d\mathcal{A}\subseteq\mathbb{R}^{d}, Θd\Theta\subseteq\mathbb{R}^{d}, and [Rt,a|θt]=aθt\mathbb{R}[R_{t,a}|\theta_{t}]=a^{\top}\theta_{t}).

Combinatorial bandits.

Γ(πTS)2σ2dk2\Gamma(\pi^{\rm TS})\leq 2\sigma^{2}\frac{d}{k^{2}}, for combinatorial optimization tasks of selecting kk items out of dd items with semi-bandit feedback (e.g., Example 3).

Contextual bandits.

See the below for a new result.

Logistic bandits.

Dong et al. [2019] consider problems where mean-rewards follow a generalized linear model with logistic link function, and bound the information ratio by the dimension of the parameter vector and a new notion they call the ‘fragility dimension.’

Graph based feedback.

With graph based feedback, the decision-maker observes not only the reward of selected arm but also the reward of its neighbors in feedback graph. One can bound the information ratio by the feedback graph’s clique cover number Liu et al. [2018] or its independence number Hao et al. [2022].

Sparse linear models.

Hao et al. [2021] consider sparse linear bandits and show conditions under which the information ratio of Information-Directed Sampling in Remark 5 is bounded by the number of nonzero elements in the parameter vector.

Convex cost functions.

Bubeck and Eldan [2016] and Lattimore [2020] study bandit learning problems where the reward function is known to be concave and bound the information ratio by a polynomial function of the dimension of the action space.

6.2 A new bound on the information ratio of contextual bandits

Contextual bandit problems are a special case of our formulation that satisfy the following abstract assumption. Re-read Example 4 to get intuition.

Assumption 2.

There is a set 𝒳\mathcal{X} and an integer kk such that 𝒜\mathcal{A} is the set of functions mapping 𝒳\mathcal{X} to [k][k]. The observation at time tt is the tuple Ot=(Xt,Rt)𝒳×O_{t}=(X_{t},R_{t})\in\mathcal{X}\times\mathbb{R}. Define it:=At(Xt)[k]i_{t}:=A_{t}(X_{t})\in[k]. Assume that for each tt, Xt+1(At,Rt)(Xt,t1)X_{t+1}\perp(A_{t},R_{t})\mid(X_{t},\mathcal{F}_{t-1}), and RtAt(Xt,it,θt).R_{t}\perp A_{t}\mid(X_{t},i_{t},\theta_{t}).

Under this assumption, we provide an information ratio bound that depends on the number of arms kk. It is a massive improvement over Corollary 1, which depends on the number of decision-rules.

Lemma 2.

Under Assumption 2, Γ(πTS)2σ2k\Gamma(\pi^{\rm TS})\leq 2\cdot\sigma^{2}\cdot k.

Theorem 1 therefore bounds regret in terms of the entropy rate of the optimal decision rule process (At)t(A^{*}_{t})_{t\in\mathbb{N}}, the number of arms kk, and the reward variance proxy σ2\sigma^{2}.

Neu et al. [2022] recently highlighted that information-ratio analysis seems not to deal adequately with context, and proposed a substantial modification which considers information gain about model parameters rather than optimal decision-rules. Lemma 2 appears to resolve this open question without changing the information ratio itself. Our bounds scale with the entropy of the optimal decision-rule, instead of the entropy of the true model parameter, as in Neu et al. [2022]. By the data processing inequality, the former is always smaller. Our proof bounds the per-period information ratio, so it can be used to provide finite time regret bounds for stationary contextual bandit problems. Hao et al. [2022] provide an interesting study of variants of Information-directed sampling in contextual bandits with complex information structure. It is not immediately clear how that work relates to Lemma 2 and the information ratio of Thompson sampling.

The next corollary combines the information ratio bound above with the earlier bound of Proposition 1. The bound depends on the number of arms, the dimension of the parameter space, and the effective time horizon. No further structural assumptions (e.g., linearity) are needed. An unfortunate feature of the result is that it applies only to parameter vectors that are quantized at scale ϵ\epsilon. The logarithmic dependence on ϵ\epsilon is omitted in the O~()\tilde{O}(\cdot) notation, but displayed in the proof. When outcome distributions are smooth in θt\theta_{t}, we believe this could be removed with careful analysis.

Corollary 3.

Under 2, if θt{1,1+ϵ,,1ϵ,1}p\theta_{t}\in\{-1,-1+\epsilon,\ldots,1-\epsilon,1\}^{p} is a discretized pp-dimensional vector, and the optimal policy process (At)t(A^{*}_{t})_{t\in\mathbb{N}} is stationary, then

Δ¯(πTS)O~(σpkτeff).\bar{\Delta}_{\infty}(\pi^{\rm TS})\leq\tilde{O}\left(\sigma\sqrt{\frac{p\cdot k}{\tau_{\textup{eff}}}}\right).

7 Extension 1: Satisficing in the Face of Rapidly Evolving Latent States

In nonstationary learning problems with short effective horizon, the decision-maker is continually uncertain about the latent states and latent optimal action. Algorithms like Thompson sampling explore aggressively in an attempt to resolve this uncertainty. But this effort may be futile, and whatever information they do acquire may have low value since it cannot be exploited for long.

Faced with rapidly evolving environments, smart algorithms should satisfice. As noted by Herbert Simon (Simon [1979], p. 498) when receiving his Nobel Prize, ‘‘decision makers can satisfice either by finding optimum solutions for a simplified world, or by finding satisfactory solutions for a more realistic world.’’ We focus on the latter case, and decision-makers that realistically model a rapidly evolving environment but seek a more stable decision-rule with satisfactory performance.

We introduce a broad generalization of TS that performs probability matching with respect to a latent satisficing action sequence instead of the latent optimal action sequence. Our theory gracefully generalizes to this case, and confirms that the decision-maker can attain rewards competitive with any satisficing action sequence whose entropy rate is low. Whereas earlier results were meaningful only if the optimal action sequence had low entropy rate — effectively an assumption on the environment — this result allows the decision-maker to compete with any low entropy action sequence regardless of the extent of nonstationarity in the true environment. We provide some implementable examples of satisficing action sequences in Section 7.2.

One can view this section as a broad generalization of the ideas in Russo and Van Roy [2022], who explore the connection between satisficing, Thompson sampling, and information theory in stationary environments. It also bears a close conceptual relationship to the work of Liu et al. [2023], which is discussed in detail in Appendix B. In the adversarial bandit literature, the most similar work is that of Auer et al. [2002]. They show a properly tuned exponential weighting algorithm attains low regret relative to any sequence of actions which switches infrequently.

7.1 Satisficing regret: competing with a different benchmark

In our formulation, the latent optimal action process A=(At)tA^{*}=(A^{*}_{t})_{t\in\mathbb{N}} serves as both a benchmark and a learning target. When calculating regret, the rewards a decision maker accrues are compared against the rewards AA^{*} would have accrued, using AA^{*} as a benchmark. When defining TS, we implicitly direct it to resolve uncertainty about the latent optimal action AA^{*} through probability matching, using AA^{*} as a learning target. When the environment evolves rapidly and unpredictably, it is impossible to learn about changes in the latent state quickly enough to exploit them, as AA^{*} does in such cases, the latent optimal action is not a reasonable benchmark or learning target.

We introduce a satisficing action sequence A=(At)tA^{\dagger}=\big{(}A_{t}^{\dagger}\big{)}_{t\in\mathbb{N}} that serves as an alternative benchmark and learning target. We want the optimal action sequence A=(At)tA^{*}=(A^{*}_{t})_{t\in\mathbb{N}} to be a feasible choice of satisficing action sequence, so we must allow AA^{\dagger} to be a function of θ\theta. The next definition also allows for randomization in the choice. For now, this definition is very abstract, but we give specific examples in Section 7.2.

Definition 4.

A collection of random variables A=(At)tA^{\dagger}=\big{(}A_{t}^{\dagger}\big{)}_{t\in\mathbb{N}} taking values in 𝒜\mathcal{A}^{\infty} is a satisficing action sequence if there exists a function f()f(\cdot) and an exogenous random variable ξ\xi (independent of θ\theta, WW and W~\tilde{W}) such that A=f(θ,ξ)A^{\dagger}=f(\theta,\xi).

Replacing AA^{*} with AA^{\dagger} as a learning target, we modify Thompson sampling so its exploration aims to resolve uncertainty about AA^{\dagger}, which could be much simpler. Satisficing Thompson sampling (STS) [Russo and Van Roy, 2022] with respect to this satisficing action sequence is denoted by π\pi^{\dagger}. It instantiates the probability matching with respect to AA^{\dagger} so that, under π\pi^{\dagger},

(At=at1)=(At=at1),\mathbb{P}(A_{t}=a\mid\mathcal{F}_{t-1})=\mathbb{P}(A_{t}^{\dagger}=a\mid\mathcal{F}_{t-1}),

for all tt\in\mathbb{N} and a𝒜a\in\mathcal{A}.

Replacing AA^{*} with AA^{\dagger} as a benchmark, we define the regret rate of an action sequence A=(At)tA=(A_{t})_{t\in\mathbb{N}} (i.e., some sequence of non-anticipating random variables) with respect to a satisficing action sequence AA^{\dagger} to be

Δ¯T(A;A):=1T𝔼[t=1TRt,AtRt,At],Δ¯(A;A):=lim supTΔ¯T(A;A).\bar{\Delta}_{T}(A;A^{\dagger}):=\frac{1}{T}\mathbb{E}\left[\sum_{t=1}^{T}R_{t,A_{t}^{\dagger}}-R_{t,A_{t}}\right],\qquad\bar{\Delta}_{\infty}(A;A^{\dagger}):=\limsup_{T\rightarrow\infty}\bar{\Delta}_{T}(A;A^{\dagger}).

Each policy π\pi induces an action sequence AA, so we can overload notation to write Δ¯T(π;A)\bar{\Delta}_{T}(\pi;A^{\dagger}).

Theorem 1 is easily generalized to bound satisficing regret. Define the information ratio with respect to AA^{\dagger}:

Γ(π;A):=supt(𝔼π[Rt,AtRt,At])2I(At;(At,Ot,At)t1).\Gamma(\pi;A^{\dagger}):=\sup_{t\in\mathbb{N}}\frac{\left(\mathbb{E}_{\pi}\left[R_{t,A_{t}^{\dagger}}-R_{t,A_{t}}\right]\right)^{2}}{I\left(A_{t}^{\dagger};(A_{t},O_{t,A_{t}})\mid\mathcal{F}_{t-1}\right)}. (9)

It represents the (squared) cost that the algorithm π\pi needs to pay in order to acquire one unit of information about the satisficing action sequence AA^{\dagger}.

Replicating the proof of Theorem 1 with respect to AA^{\dagger} yields a bound on regret relative to this alternative benchmark in terms of the information ratio with respect to this alternative benchmark. Intuitively, less information is needed to identify simpler satisficing action sequences, reducing the I¯(A;θ)\bar{I}_{\infty}(A^{\dagger};\theta) term in this bound.

Theorem 2.

For any algorithm π\pi and satisficing action sequence AA^{\dagger},

Δ¯(π;A)Γ(π;A)×I¯(A;θ).\bar{\Delta}_{\infty}(\pi;A^{\dagger})\leq\sqrt{\Gamma(\pi;A^{\dagger})\times\bar{I}_{\infty}(A^{\dagger};\theta)}.

If the satisficing action sequence satisfies A=f(θ1:T,ξ)A^{\dagger}=f(\theta_{1:T},\xi) for some function f()f(\cdot), then

Δ¯T(π;A)Γ(π;A)×I¯T(A1:T;θ1:T).\bar{\Delta}_{T}(\pi;A^{\dagger})\leq\sqrt{\Gamma(\pi;A^{\dagger})\times\bar{I}_{T}(A^{\dagger}_{1:T};\theta_{1:T})}.

Interpret the mutual information I¯T(A;θ)\bar{I}_{T}(A^{\dagger};\theta) as the rate at which bits about θ=(θt)t\theta=\big{(}\theta_{t}\big{)}_{t\in\mathbb{N}} must be communicated in order to implement the changing decision rule AA^{\dagger}. Interpret Γ(π;A)\Gamma(\pi;A^{\dagger}) as the price (in terms of squared regret) that policy π\pi pays per bit of information acquired about A.A^{\dagger}. As made clear in the next remark, satisficing TS controls the price per bit of information Γ(π;A)\Gamma(\pi;A^{\dagger}) regardless of the choice of satisficing action sequence AA^{\dagger}. Theorem 2 therefore offers much stronger guarantees than Theorem 1 as whenever I(A;θ)H(A)I(A^{\dagger};\theta)\lll H(A^{*}), i.e. whenever the the satisficing action sequence can be implemented while acquiring much less information about the latent states.

Remark 6.

All the upper bounds on Γ(πTS;A)\Gamma(\pi^{\text{TS}};A^{*}) listed in Section 6.1 also apply for Γ(π;A)\Gamma(\pi^{\dagger};A^{\dagger}). Namely, for any satisficing action sequence AA^{\dagger}, the corresponding STS π\pi^{\dagger} satisfies (a) Γ(π;A)2σ2|𝒜|\Gamma(\pi^{\dagger};A^{\dagger})\leq 2\sigma^{2}|\mathcal{A}| for bandit tasks with finite action set, (b) Γ(π;A)2σ2\Gamma(\pi^{\dagger};A^{\dagger})\leq 2\sigma^{2} for problems with full-information feedback, (c) Γ(π;A)2σ2d\Gamma(\pi^{\dagger};A^{\dagger})\leq 2\sigma^{2}d for linear bandits, (d) Γ(π;A)2σ2d/k2\Gamma(\pi^{\dagger};A^{\dagger})\leq 2\sigma^{2}d/k^{2} for combinatorial bandits, and (e) Γ(π;A)2σ2k\Gamma(\pi^{\dagger};A^{\dagger})\leq 2\sigma^{2}k for contextual bandits.

Therefore it satisfies similar bounds on satisficing regret. For linear bandits,

Δ¯(π;A)σ2d×I¯(A;θ).\bar{\Delta}_{\infty}(\pi^{\dagger};A^{\dagger})\leq\sigma\sqrt{2d\times\bar{I}_{\infty}(A^{\dagger};\theta)}.

7.2 Examples of satisficing action sequences

The concept introduced above is quite abstract. Here we introduce three examples. Each aims to construct a satisficing action sequence that is near optimal (i.e., low Δ¯(A;A))\bar{\Delta}_{\infty}(A^{\dagger};A^{*})) while requiring limited information about θ\theta (i.e., low I¯(A;θ)\bar{I}_{\infty}(A^{\dagger};\theta)). As a warmup, we visualize in Figure 4 these examples of satisficing actions; they switch identity infrequently, ensuring they have low entropy, but are nevertheless very nearly optimal.

Refer to caption
Figure 4: Three examples of satisficing actions in a two-armed bandit environment described below Example 12. The top figure shows the performance of two arms, as visualized in Figure 1, where shaded vs. unshaded regions indicate which action is optimal at each time. For each choice of alternative benchmarks (suggested in Examples 10, 11 and 12), we show two figures stacked vertically. The above ones plot the performance of alternative action sequences that are more stable, where shaded vs unshaded regions indicate which action the benchmark chooses. The below ones plot the suboptimality of these alternative benchmarks.

Our first example is the optimal action sequence among those that switch identity no more than mm times within TT periods.

Example 10 (Restricting number of switches).

Define the number of switches in a sequence a1:T𝒜Ta_{1:T}\in\mathcal{A}^{T} by 𝒮T(a1:T):=1+t=2T𝕀{atat1}\mathcal{S}_{T}(a_{1:T}):=1+\sum_{t=2}^{T}\mathbb{I}\{a_{t}\neq a_{t-1}\}, a measure used in (6). Let

(A1,,AT)argmaxa1:T𝒜T:𝒮T(a1:T)m1Tt=1T𝔼[Rt,atθt].(A^{\dagger}_{1},\ldots,A^{\dagger}_{T})\in\operatorname*{\mathrm{argmax}}_{a_{1:T}\in\mathcal{A}^{T}:\mathcal{S}_{T}(a_{1:T})\leq m}\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\left[R_{t,a_{t}}\mid\theta_{t}\right].

be the best action sequence with fewer than mm switches. Satisficing regret Δ¯T(π;A)\bar{\Delta}_{T}(\pi^{\dagger};A^{\dagger}) measures the gap between the rewards its actions generate and the true best action sequence with few switches. By Lemma 1, one can bound the accumulated bits of information about this satisficing action sequence as:

I¯T(A1:T;θ)H¯T(A1:T)mT(1+log(1+Tm)+log(|𝒜|)).\bar{I}_{T}(A^{\dagger}_{1:T};\theta)\leq\bar{H}_{T}(A^{\dagger}_{1:T})\leq\frac{m}{T}\cdot\left(1+\log\left(1+\frac{T}{m}\right)+\log(|\mathcal{A}|)\right).

Satisficing Thompson sampling can be implemented by at time tt drawing a sequence

A~1:Targmaxa1:T𝒜T:𝒮T(a1:T)m1T=1T𝔼[R,atθ=θ~]where(θ~1,,θ~T)(θ1:Tt1).\tilde{A}_{1:T}\in\operatorname*{\mathrm{argmax}}_{a_{1:T}\in\mathcal{A}^{T}:\mathcal{S}_{T}(a_{1:T})\leq m}\frac{1}{T}\sum_{\ell=1}^{T}\mathbb{E}\left[R_{\ell,a_{t}}\mid\theta_{\ell}=\tilde{\theta}_{\ell}\right]\quad\text{where}\quad(\tilde{\theta}_{1},\ldots,\tilde{\theta}_{T})\sim\mathbb{P}(\theta_{1:T}\in\cdot\mid\mathcal{F}_{t-1}).

and then picking At=A~tA_{t}=\tilde{A}_{t}. This samples an arm according to the posterior probability it is the arm chosen in the true best action sequence with mm switches. A simple dynamic programming algorithm can be used to solve the optimization problem defining A~1:T\tilde{A}_{1:T}.

The next example also tries to create a more reasonable benchmark that switches less frequently. Here though, we define a satisficing action sequence that switches to a new arm only once an arm’s subotimality exceeds some threshold. This leads to a version of satisficing Thompson sampling that has a simpler implementation.

Example 11 (Ignoring small suboptimality).

Define a sequence of actions as follows:

At={At1 if 𝔼[Rt,At1θt]𝔼[maxa𝒜Rt,aθt]D,argmaxa𝒜𝔼[Rt,aθt]otherwise.\displaystyle A^{\dagger}_{t}=\begin{cases}A^{\dagger}_{t-1}&\text{ if }\mathbb{E}[R_{t,A_{t-1}^{\dagger}}\mid\theta_{t}]\geq\mathbb{E}[\max_{a\in\mathcal{A}}R_{t,a}\mid\theta_{t}]-D,\\ \operatorname*{\mathrm{argmax}}_{a\in\mathcal{A}}\mathbb{E}[R_{t,a}\mid\theta_{t}]&\text{otherwise}.\end{cases}

This sequence of arms switches only when an arm’s suboptimality exceeds a distortion level D>0D>0. The entropy of AA^{\dagger} will depend on the problem, but we will later bound it in problems with ‘‘slow variation’’ as in Besbes et al. [2014]. Algorithm 2 provides an implementation of satisficing Thompson sampling with respect to AA^{\dagger}.

Input: Distortion level D>0D>0.
Define μθt(a)=𝔼[Rt,aθt]\mu_{\theta_{t}}(a)=\mathbb{E}[R_{t,a}\mid\theta_{t}];
for t=1,2,t=1,2,\ldots do
       Sample (θ~1,,θ~t)((θ1,,θt)t1)(\tilde{\theta}_{1},\ldots,\tilde{\theta}_{t})\sim\mathbb{P}\left((\theta_{1},\ldots,\theta_{t})\in\cdot\mid\mathcal{F}_{t-1}\right);
      
      A1=argmaxμθ~1(a)A^{\dagger}_{1}=\operatorname*{\mathrm{argmax}}\mu_{\tilde{\theta}_{1}}(a);
      
      for s=2,,t1s=2,\ldots,t-1 do
             if μθ~s(As1)maxa𝒜μθ~s(a)D\mu_{\tilde{\theta}_{s}}(A^{\dagger}_{s-1})\geq\max_{a\in\mathcal{A}}\mu_{\tilde{\theta}_{s}}(a)-D then
                   AsAs1A_{s}^{\dagger}\leftarrow A^{\dagger}_{s-1};
                  
            else
                   Asargmaxa𝒜μθ~s(a)A^{\dagger}_{s}\leftarrow\operatorname*{\mathrm{argmax}}_{a\in\mathcal{A}}\mu_{\tilde{\theta}_{s}}(a);
                  
             end if
            
       end for
      Play At=AtA_{t}=A_{t}^{\dagger}, observe OtO_{t} and update history;
      
end for
Algorithm 2 Satisficing Thompson sampling in Example 11

The next example induces satisficing in a very different way. Instead of altering how often the benchmark optimal action can change, it restricts the granularity of information about θ\theta on which it can depend.

Example 12 (Optimizing with respect to obscured information).

Let

At=argmaxa𝒜𝔼[Rt,aY]whereY=h(θ,ξ),A^{\dagger}_{t}=\operatorname*{\mathrm{argmax}}_{a\in\mathcal{A}}\mathbb{E}[R_{t,a}\mid Y]\quad\text{where}\quad Y=h(\theta,\xi),

for some function h()h(\cdot). Since AYθA^{\dagger}-Y-\theta forms a Markov chain, choosing a YY that reveals limited information about θ\theta restricts the magnitude of the mutual information I¯(A;θ)\bar{I}_{\infty}(A^{\dagger};\theta).

To implement satisficing TS, one needs to draw from the posterior distribution of AtA_{t}^{\dagger}. Since Atargmaxa𝒜μt,a(Y)A_{t}^{\dagger}\in\operatorname*{\mathrm{argmax}}_{a\in\mathcal{A}}\mu_{t,a}(Y) where μt(Y):=𝔼[Rt,aY]\mu_{t}(Y):=\mathbb{E}[R_{t,a}\mid Y], we can sample from the posterior just as we do in TS: one needs to draw a sample μ~t(μt(Y)t1)\tilde{\mu}_{t}\sim\mathbb{P}(\mu_{t}(Y)\in\cdot\mid\mathcal{F}_{t-1}) from the posterior distribution of μt(Y)\mu_{t}(Y) an then pick Atargmaxa𝒜μ~t,aA_{t}\in\operatorname*{\mathrm{argmax}}_{a\in\mathcal{A}}\tilde{\mu}_{t,a}.

There are two natural strategies for providing obscured information YY about the latent states θ\theta. First, one can provide noise corrupted view, like Y=θ+ξY=\theta+\xi where ξ\xi is mean-zero noise. For stationary problems, this is explored in Russo and Van Roy [2022]. For the kk-armed nonstationary bandits in Example 1, providing fictitious reward observations Y=(θt,a+ξt,a)t,a𝒜Y=(\theta_{t,a}+\xi_{t,a})_{t\in\mathbb{N},a\in\mathcal{A}}, where ξt,aN(0,σ2)\xi_{t,a}\sim N(0,\sigma^{2}), is similar to the proposed algorithm in Liu et al. [2023]. The second approach one can take is to reveal some simple summary statistics of θ\theta.

We describe a simple illustration of the potential benefits of the second approach. Consider a problem with bandit feedback (i.e. Ot,a=Rt,aO_{t,a}=R_{t,a}) and Gaussian reward observations; that is,

Rt,a=θt,a+Wt,a,θt,a=μafixed+μt,aGP,R_{t,a}=\theta_{t,a}+W_{t,a},\quad\theta_{t,a}=\mu_{a}^{\text{fixed}}+\mu_{t,a}^{\text{GP}},

where Wt,a𝒩(0,σ2)W_{t,a}\sim\mathcal{N}(0,\sigma^{2}) is i.i.d Gaussian noise, μafixed𝒩(0,v2)\mu_{a}^{\text{fixed}}\sim\mathcal{N}(0,v^{2}), and μaGP=(μt,aGP)t\mu_{a}^{\text{GP}}=(\mu_{t,a}^{\text{GP}})_{t\in\mathbb{N}} follows a Gaussian process such that Cov(μt,aGP,μs,aGP)=exp(12(tsτ)2)\textup{Cov}(\mu_{t,a}^{\text{GP}},\mu_{s,a}^{\text{GP}})=\exp\left(-\frac{1}{2}\left(\frac{t-s}{\tau}\right)^{2}\right).

We construct a variant of satisficing TS that should vastly outperform regular TS when τ0\tau\approx 0. Choose YY as

Y=(μafixed)a𝒜.Y=\big{(}\mu_{a}^{\text{fixed}}\big{)}_{a\in\mathcal{A}}.

Since 𝔼[Rt,a|Y]=μafixed\mathbb{E}[R_{t,a}|Y]=\mu_{a}^{\text{fixed}}, we have A1=A2==argmaxa𝒜μafixedA_{1}^{\dagger}=A_{2}^{\dagger}=\ldots=\operatorname*{\mathrm{argmax}}_{a\in\mathcal{A}}\mu_{a}^{\text{fixed}}, and satisficing TS works as follows.

Satisficing TS:

Sample μ~t,a𝒩(𝔼[μafixedt1],Var(μafixedt1))\tilde{\mu}_{t,a}\sim\mathcal{N}\left(\mathbb{E}[\mu_{a}^{\text{fixed}}\mid\mathcal{F}_{t-1}]\,,\,{\rm Var}\left(\mu_{a}^{\text{fixed}}\mid\mathcal{F}_{t-1}\right)\right) and pick Atargmaxa𝒜μ~t,aA_{t}\in\operatorname*{\mathrm{argmax}}_{a\in\mathcal{A}}\tilde{\mu}_{t,a}.

Regular TS:

Sample μ~t,a𝒩(𝔼[μafixed+μt,aGPt1],Var(μafixed+μt,aGPt1))\tilde{\mu}_{t,a}\sim\mathcal{N}\left(\mathbb{E}[\mu_{a}^{\text{fixed}}+\mu_{t,a}^{\text{GP}}\mid\mathcal{F}_{t-1}]\,,\,{\rm Var}\left(\mu_{a}^{\text{fixed}}+\mu_{t,a}^{\text{GP}}\mid\mathcal{F}_{t-1}\right)\right) and pick Atargmaxa𝒜μ~t,aA_{t}\in\operatorname*{\mathrm{argmax}}_{a\in\mathcal{A}}\tilde{\mu}_{t,a}.

To understand the difference between these algorithms, consider the limiting regimes in which τ\tau\to\infty and τ0\tau\to 0. In the first case, environment dynamics are so slow that the problems looks like a statationry kk-armed bandit.

When τ0\tau\to 0, the latent states evolve so rapidly and erratically that the problem looks is again equivalent to a stationary problem. Precisely, as τ0\tau\to 0, Cov[μt,aGP,μs,aGP]𝕀{t=s}\textup{Cov}[\mu_{t,a}^{\text{GP}},\mu_{s,a}^{\text{GP}}]\to\mathbb{I}\{t=s\} for any t,st,s\in\mathbb{N}; the process (μt,aGP)t(\mu_{t,a}^{\text{GP}})_{t\in\mathbb{N}} becomes a white noise process, i.e., behaves almost like a sequence of i.i.d. random variables distributed with 𝒩(0,12)\mathcal{N}(0,1^{2}). The overall problem looks like a stationary kk-armed Gaussian bandit with prior 𝒩(0,v2)\mathcal{N}(0,v^{2}) and i.i.d. noise with law 𝒩(0,σ2+12)\mathcal{N}(0,\sigma^{2}+1^{2}).

Satisficing TS is similar in the τ0\tau\to 0 and τ\tau\to\infty limits; in ether case it attempts to identify the arm with best performance throughout the horizon. But regular TS behaves very differently in these two limits – making its behavior seemingly incoherent. As τ\tau\to\infty, regular TS coincides with satisficing TS. But as τ0\tau\to 0, the samples maximized by regular TS always have high variance, causing the algorithm to explore in a futile attempt to resolve uncertainty about the white noise process (μt,aGP)t(\mu_{t,a}^{\text{GP}})_{t\in\mathbb{N}}.

8 Extension 2: Generalizing the Entropy Rate with Rate Distortion Theory

Our main result in Theorem 1 bounds regret in a large class of sequential learning problems in terms of the entropy rate of the optimal action process — linking the theory of exploration with the theory of optimal lossless compression. In this section, we replace the dependence of our bounds on the entropy rate with dependence on a rate distortion function, yielding our most general and strongest bounds. The section result is quite abstract, but reveals an intriguing connection between interactive decision-making and compression.

As an application, we bound the rate distortion function for environments in which the reward process has low total variation rate. This recovers regret bounds that are similar to those in the influential work of Besbes et al. [2015], and greatly strengthens what Theorem 1 guarantees under this assumption.

While this section is abstract, the proofs follow easily from attempting to instantiate Theorem 2 with the best possible choice of satisficing action sequence. Like Theorem 2, the results generalize Russo and Van Roy [2022] to nonstationary environments.

8.1 Rate distortion and lossy compression

To derive a rate distortion function, we momentarily ignore the sequential learning problem which is our focus. Consider a simpler communication problem. Imagine that some observer knows the latent states of the environment; their challenge is to encode this knowledge as succinctly as possible and transmit the information to a downstream decision-maker.

As a warmup, Figure 5(a) depicts a more typical problem arising in the theory of optimal communication, in which the goal is to succinctly encode latent states themselves. Assuming the law of the latent states (θ1,θ2,)(\theta_{1},\theta_{2},\ldots) is known, one wants to design a procedure that efficiently communicates information about the realization across a noisy channel. Having observed θ1:T=(θ1,θ2,,θT)\theta_{1:T}=(\theta_{1},\theta_{2},\ldots,\theta_{T}), the observer transmits some signal fT(θ1:T)f_{T}(\theta_{1:T}) to a decoder, which recovers an estimate θ^1:T\hat{\theta}_{1:T}. The encoder is said to communicate at bit-rate nn if fT:ΘT{0,1}nTf_{T}:\Theta^{T}\to\{0,1\}^{n\cdot T}. While θ1:T\theta_{1:T} could be very complex, fT(θ1:T)f_{T}(\theta_{1:T}) can be encoded in a binary string of length nTn\cdot T. Shannon proved that lossless communication is possible as TT\to\infty while transmitting at any bit-rate exceeding the entropy rate444We use the natural logarithm when defining entropy in this paper. The equivalence between minimal the bit-rate of a binary encoder and the entropy rate requires that log base 2 is used instead. of θ\theta. If one is willing to tolerate imperfect reconstruction of the latent states, then the bit-rate can be further reduced. Rate distortion theory quantifies the necessary bit-rate to achieve a given recovery loss, e.g. 𝔼[1Tt=1Tθtθ^t2]D\mathbb{E}\left[\frac{1}{T}\sum_{t=1}^{T}\|\theta_{t}-\hat{\theta}_{t}\|^{2}\right]\leq D in Figure 5(a).

Figure 5(b) applies similar reasoning to a decision-making problem. Having observed θ1:T=(θ1,θ2,,θT)\theta_{1:T}=(\theta_{1},\theta_{2},\ldots,\theta_{T}), the observer transmits some signal fT(θ1:T)f_{T}(\theta_{1:T}) to a decision-maker, who processes that information and implements the decision A1:TA^{\dagger}_{1:T}. It is possible for the decision-maker to losslessly recover the optimal action process if and only if the encoder communicates at bit-rate exceeding the entropy rate of the optimal action process H¯(A)\bar{H}_{\infty}(A^{*}). Rate distortion theory tells us that the decision-maker can achieve regret-rate lower than some distortion level DD when the bit-rate exceeds ¯(D)\bar{\mathcal{R}}_{\infty}(D) defined as

¯(D):=infAI¯(A;θ)subject toΔ¯(A;A)D.\bar{\mathcal{R}}_{\infty}(D):=\inf_{A^{\dagger}}\bar{I}_{\infty}(A^{\dagger};\theta)\quad\text{subject to}\quad\bar{\Delta}_{\infty}(A^{\dagger};A^{*})\leq D. (10)

The function ¯:++\bar{\mathcal{R}}_{\infty}:\mathbb{R}_{+}\to\mathbb{R}_{+} is called the rate-distortion function. It minimizes the information about the latent states required to implement a satisficing action sequence AA^{\dagger} (see Definition 4), over all choices with regret rate less than DD. For any fixed horizon TT, one can analogously define

¯T(D):=infA1:TI¯T(A1:T;θ)subject toΔ¯T(A1:T;A1:T)D.\bar{\mathcal{R}}_{T}(D):=\inf_{A^{\dagger}_{1:T}}\bar{I}_{T}(A^{\dagger}_{1:T};\theta)\quad\text{subject to}\quad\bar{\Delta}_{T}(A^{\dagger}_{1:T};A^{*}_{1:T})\leq D. (11)
Encoderθ1:T\theta_{1:T}Decoder Loss: 1Tt=1Tθtθ^t2\frac{1}{T}\sum_{t=1}^{T}\|\theta_{t}-\hat{\theta}_{t}\|^{2} fT(θ1:T)f_{T}(\theta_{1:T})θ^1:T\hat{\theta}_{1:T}
(a) Optimal lossy compression for state estimation
Encoderθ1:T\theta_{1:T}Decision-rule Regret: Δ¯T(A1:T;A1:T)\bar{\Delta}_{T}(A^{\dagger}_{1:T};A^{*}_{1:T}) fT(θ1:T)f_{T}(\theta_{1:T})A1:TA^{\dagger}_{1:T}
(b) Optimal lossy compression for decision-making
Figure 5: Lossy compression of the environment states

8.2 General regret bounds in terms of the rate distortion function

As discussed in Remark 6, many of the most widely studied online learning problems have the feature that there is a uniform upper bound on the information ratio. While there is a price (in terms of squared regret suffered) for acquiring information, the price-per-bit does not explode regardless of the information one seeks.

For such problems, we can bound the attainable regret-rate of a DM who learns through interaction by solving the kind of lossy compression problem described above. Namely, we bound regret in terms the information ratio and the problem’s rate distortion function. At this point, the proof is a trivial consequence of Theorem 2. But apriori, the possibility of such a result is quite unclear.

Theorem 3.

Suppose that ΓU:=supAinfπΓ(π;A)<\Gamma_{U}:=\sup_{A^{\dagger}}\inf_{\pi}\Gamma(\pi;A^{\dagger})<\infty. Then

infπΔ¯(π)infD0{D+ΓU¯(D)}.\inf_{\pi}\bar{\Delta}_{\infty}(\pi)\leq\inf_{D\geq 0}\left\{D+\sqrt{\Gamma_{U}\cdot\bar{\mathcal{R}}_{\infty}(D)}\right\}.

Similarly, for any finite T<T<\infty,

infπΔ¯T(π)infD0{D+ΓU¯T(D)}.\inf_{\pi}\bar{\Delta}_{T}(\pi)\leq\inf_{D\geq 0}\left\{D+\sqrt{\Gamma_{U}\cdot\bar{\mathcal{R}}_{T}(D)}\right\}.
Proof.

Theorem 2 shows that for any satisficing action sequence and policy π\pi,

Δ¯(π)Δ¯(A;A)+Γ(π,A)×I¯(A;θ).\bar{\Delta}_{\infty}(\pi)\leq\bar{\Delta}_{\infty}(A^{\dagger};A^{*})+\sqrt{\Gamma(\pi,A^{\dagger})\times\bar{I}_{\infty}(A^{\dagger};\theta)}.

Taking the infimum over π\pi yields

infπΔ¯(π)Δ¯(A;A)+ΓU×I¯(A;θ).\inf_{\pi}\bar{\Delta}_{\infty}(\pi)\leq\bar{\Delta}_{\infty}(A^{\dagger};A^{*})+\sqrt{\Gamma_{U}\times\bar{I}_{\infty}(A^{\dagger};\theta)}. (12)

Taking the infimum over of the right-hand side over satisficing action sequences with Δ¯(A;A)D\bar{\Delta}_{\infty}(A^{\dagger};A^{*})\leq D, and then minimizing over DD yields the result. ∎

As in Section 5, we provide some more easily interpreted special cases of our result. The next result is an analogue of Corollary 1 which upper bounds the attainable regret-rate in terms of the number of actions and the rate-distortion function ¯(D)\bar{\cal R}(D). For brevity, we state the result only for the infinite horizon regret rate.

Corollary 4.

Under any problem in the scope of our formulation,

infπΔ¯(π)infD0{D+σ2|𝒜|R¯(D)}.\inf_{\pi}\bar{\Delta}_{\infty}(\pi)\leq\inf_{D\geq 0}\left\{D+\sigma\sqrt{2\cdot|\mathcal{A}|\cdot\bar{R}_{\infty}(D)}\right\}.

A dependence on the number of actions can be avoided when it is possible to acquire information about some actions while exploring other actions. Full information problems are an extreme case where information can be acquired without any active exploration. In that case the optimal policy is the rule πGreedy\pi^{\rm Greedy} that chooses Atargmaxa𝒜𝔼[Rt,at1]A_{t}\in\operatorname*{\mathrm{argmax}}_{a\in\mathcal{A}}\mathbb{E}[R_{t,a}\mid\mathcal{F}_{t-1}] in each period tt. It satisfies the following bound.

Corollary 5.

In full information problems, where Ot,a=Ot,aO_{t,a}=O_{t,a^{\prime}} for each a,a𝒜a,a^{\prime}\in\mathcal{A}, we have

Δ¯(πGreedy)infD0{D+σ2R¯(D)}.\bar{\Delta}_{\infty}(\pi^{\rm Greedy})\leq\inf_{D\geq 0}\left\{D+\sigma\sqrt{2\cdot\bar{R}_{\infty}(D)}\right\}.

8.3 Application to environments with low total variation

One of the leading ways of analyzing nonstationary online optimization problems assumes little about the environment except for a bound on the total variation across time [Besbes et al., 2014, 2015, Cheung et al., 2019]. In this section, we upper bound the rate distortion function in terms of the normalized expected total variation in the suboptimality of an arm:

V¯T:=𝔼[1Tt=2Tmaxa𝒜|Δt(a)Δt1(a)|]whereΔt(a):=𝔼[Rt,AtRt,aθt].\bar{V}_{T}:=\mathbb{E}\left[\frac{1}{T}\sum_{t=2}^{T}\max_{a\in\mathcal{A}}\left|\Delta_{t}(a)-\Delta_{t-1}(a)\right|\right]\quad\text{where}\quad\Delta_{t}(a):=\mathbb{E}[R_{t,A_{t}^{*}}-R_{t,a}\mid\theta_{t}]. (13)

It is also possible to study the total variation in mean-rewards 𝔼[Rt,aθt]\mathbb{E}[R_{t,a}\mid\theta_{t}], as in Besbes et al. [2015]. Figure 1 displays an environment in which total variation of the optimality gap Δt(a)\Delta_{t}(a) is much smaller, since it ignores variation that is common across all arms.

The next proposition shows it is possible to attain a low regret rate whenever the the total variation of the environment is low. We establish this by bounding the rate distortion function, i.e., the number of bits about the latent environment states that must be communicated in order to implement a near-optimal action sequence. To ease the presentation, we use O~()\tilde{O}(\cdot) notation that hides logarithmic factors, but all constants can be found in Section A.9.

Proposition 4.

The rate distortion function is bounded as

¯T(D)O~(1+V¯TD).\bar{\mathcal{R}}_{T}(D)\leq\tilde{O}\left(1+\frac{\bar{V}_{T}}{D}\right).

If there is a uniform bound on the information ratio ΓU:=supAinfπΓ(π;A)<\Gamma_{U}:=\sup_{A^{\dagger}}\inf_{\pi}\Gamma(\pi;A^{\dagger})<\infty, then

infπΔ¯T(π)O~((ΓUV¯T)1/3+ΓUT).\inf_{\pi}\bar{\Delta}_{T}\left(\pi\right)\leq\tilde{O}\left(\left(\Gamma_{U}\bar{V}_{T}\right)^{1/3}+\sqrt{\frac{\Gamma_{U}}{T}}\right).

From Corollary 4, one can derived bounds for kk-armed bandits similar to those in Besbes et al. [2015]. One strength of the general result is it can be specialized to a much broader class of important online decision-making problems with bounded information ratio, ranging from combinatorial bandits to contextual bandits.

Proof sketch.

Even if the total variation V¯T\bar{V}_{T} is small, it is technically possible for the entropy rate of the optimal action process to be very large. This occurs if the optimal action switches identity frequently and unpredictably, but the gap between actions’ performance vanishes.

Thankfully, the satisificing action sequence in Example 11 is near optimal and has low entropy. Recall the definition

At={At1 if 𝔼[Rt,At1θt]𝔼[maxa𝒜Rt,aθt]Dargmaxa𝒜𝔼[Rt,aθt]otherwise.\displaystyle A^{\dagger}_{t}=\begin{cases}A^{\dagger}_{t-1}&\text{ if }\mathbb{E}[R_{t,A_{t-1}^{\dagger}}\mid\theta_{t}]\geq\mathbb{E}[\max_{a\in\mathcal{A}}R_{t,a}\mid\theta_{t}]-D\\ \operatorname*{\mathrm{argmax}}_{a\in\mathcal{A}}\mathbb{E}[R_{t,a}\mid\theta_{t}]&\text{otherwise}.\end{cases}

This sequence of arms switches only when an arm’s suboptimality exceeds a distortion level D>0D>0. It is immediate that Δ¯T(A;A)D\bar{\Delta}_{T}(A^{*};A^{\dagger})\leq D and our proof reveals

H¯T(A1:T)2V¯TD(1+log(1+T(1D2V¯T))+log(|𝒜|))+3log(|𝒜|T)T.\bar{H}_{T}(A^{\dagger}_{1:T})\leq\frac{2\bar{V}_{T}}{D}\cdot\left(1+\log\left(1+T\wedge\left(1\vee\frac{D}{2\bar{V}_{T}}\right)\right)+\log(|\mathcal{A}|)\right)+\frac{3\log(|\mathcal{A}|T)}{T}.

Together, these give a constructive bound on the rate distortion function. The remaining claim is Theorem 3. ∎

While we state this result in terms of fundamental limits of performance, it is worth noting that the proof itself implies Algorithm 2 attains regret bounds in terms of the variation budget in the problem types mentioned in Remark 6.

It is also worth mentioning that the entropy rate of the optimal action process could, in general, be large even in problems with low total-variation. This occurs when the optimal action keeps switching identity, but the arms it switches among tend to still be very nearly optimal. The generalization of Theorem 1 to Theorem 3 is essential to this result.

9 Extension 3: Bounds Under Adversarial Nonstationarity

The theoretical literature on nonstationary bandit learning mostly focuses on the limits of attainable performance when faced the true environment, or latent states θ\theta, can be chosen adversarially. At first glance, our results seem to be very different. We use the language of stochastic processes, rather than game theory, to model the realization of the latent states. Thankfully, these two approaches are, in a sense, dual to one another.

In statistical decision theory, is common to minimax risk by studying Bayesian risk under a least favorable prior distribution. Using this insight, it is possible to deduce bounds on attainable regret in adversarial environments from our bounds on stochastic environments. We illustrate this first when the adversary is constrained to select latent states under which the optimal action switches infrequently [see e.g., Suk and Kpotufe, 2022]. Then, we study when the adversary must pick a sequence under which arm-means have low total-variation, as in Besbes et al. [2015]. We bound the rate-distortion function in terms of the total-variation, and recover adversarial regret bounds as a result.

Our approach builds on Bubeck and Eldan [2016], who leveraged ‘Bayesian’ information ratio analysis to study the fundamental limits of attainable regret of adversarial convex bandits; see also Lattimore and Szepesvári [2019], Lattimore [2020], Lattimore and Gyorgy [2021], Lattimore [2022]. These works study regret relative to the best fixed action in hindsight, wheres we study regret relative to the best changing action sequence — often called ‘‘dynamic regret.’’ In either case, a weakness of this analysis approach is that it is non-constructive. It reveals that certain levels of performance are possible even in adversarial environments, but does not yield concrete algorithms that attain this performance. It would be exciting to synthesize our analysis with recent breakthroughs of Xu and Zeevi [2023], who showed how to derive algorithms for adversarial environments out of information-ratio style analysis.

9.1 A minimax theorem

Define the expected regret rate incurred by a policy π\pi under parameter realization θ\theta^{\prime} by

Δ¯T(π;θ):=𝔼[1Tt=1T(Rt,AtRt,Atθt=θt)].\bar{\Delta}_{T}(\pi;\theta^{\prime}):=\mathbb{E}\left[\frac{1}{T}\sum_{t=1}^{T}(R_{t,A^{*}_{t}}-R_{t,A_{t}}\mid\theta_{t}=\theta_{t}^{\prime})\right].

The Bayesian, or average, regret

Δ¯T(π;q):=𝔼θq[Δ¯T(π;θ)],\bar{\Delta}_{T}(\pi;q):=\mathbb{E}_{\theta\sim q}\left[\bar{\Delta}_{T}(\pi;\theta)\right],

produces a scalar performance measure by averaging. Many papers instead study the worst-case regret supθΞΔ¯T(π;θ)\sup_{\theta\in\Xi}\bar{\Delta}_{T}(\pi;\theta) across a constrained family ΞΘT\Xi\subseteq\Theta^{T} of possible parameter realizations. The next proposition states that the optimal worst-case regret is the same as the optimal Bayesian regret under a worst-case prior. We state stringent regularity conditions to apply the most basic version of the minimax theorem. Generalizations are possible, but require greater mathematical sophistication.

Proposition 5.

Let Ξ\Xi be a finite set and take 𝒟(Ξ)\mathcal{D}(\Xi) to be the set of probability distributions over Ξ\Xi. Suppose the range of the outcome function g()g(\cdot) in (1) is a finite set. Then,

minπΠmaxq𝒟(Ξ)Δ¯T(π;q)minimax regret=maxq𝒟(Ξ)minπΠΔ¯T(π;q).Bayes regret under least favorable prior\underbrace{\min_{\pi\in\Pi}\max_{q\in\mathcal{D}(\Xi)}\bar{\Delta}_{T}(\pi;q)}_{\text{minimax regret}}=\underbrace{\max_{q\in\mathcal{D}(\Xi)}\min_{\pi\in\Pi}\bar{\Delta}_{T}(\pi;q).}_{\text{Bayes regret under least favorable prior}}
proof sketch.

We apply Von-Neumann’s minimax theorem to study a two player game between an experimenter, who chooses π\pi and nature, who chooses θ\theta. A deterministic strategy for nature is a choice of θΞ\theta\in\Xi. Take n=|Ξ|n=|\Xi| and label the possible choice of nature by θ(1),,θ(n)\theta^{(1)},\ldots,\theta^{(n)}. A randomized strategy for nature is a probability mass function q=(q1,,qn)[0,1]nq=(q_{1},\ldots,q_{n})\in[0,1]^{n}.

For the experimenter, we need to make a notational distinction between a possibly randomized strategy π\pi and a deterministic one, which we denote by ψ\psi. A deterministic strategy ψ\psi for the experimenter is a mapping from a history of past actions and observations to an action. Let 𝒪\mathcal{O} denote the range of gg, i.e., the set of possible observations. There are at most m=|𝒜||𝒪|Tm=|\mathcal{A}|^{|\mathcal{O}|^{T}} deterministic policies. Label these ψ(1),,ψ(m)\psi^{(1)},\ldots,\psi^{(m)}. A randomized strategy for the experimenter is a probability mass function π=(π1,,πm)[0,1]m\pi=(\pi_{1},\ldots,\pi_{m})\in[0,1]^{m}.

Define Δm×n\Delta\in\mathbb{R}^{m\times n} by Δ(i,j)=Δ¯T(ψ(i);θ(j))\Delta(i,j)=\bar{\Delta}_{T}(\psi^{(i)};\theta^{(j)}). By Von-Neuman’s minimax theorem

minπmaxqπΔq=maxqminππΔq.\min_{\pi}\max_{q}\,\pi^{\top}\Delta q=\max_{q}\min_{\pi}\,\pi^{\top}\Delta q.

9.2 Deducing bounds on regret in adversarial environments: environments with few switches

As a corollary of this result, we bound minimax regret when the adversary is constrained to choose a sequence of latent states under which the optimal action changes infrequently. For concreteness, we state the result for linear bandit problems, though similar statements hold for any of the problems with bounded information ratio discusses in Section 6.1.

Corollary 6 (Corollary of Theorem 1 and Proposition 2).

Suppose 𝒜,Θd\mathcal{A},\Theta\subset\mathbb{R}^{d}, rewards follow the linear model 𝔼[Rt,aθt]=θta\mathbb{E}\left[R_{t,a}\mid\theta_{t}\right]=\theta_{t}^{\top}a, and the range of the reward function R()R(\cdot) is contained in [1,1][-1,1]. Under the conditions in Proposition 5, for any TT\in\mathbb{N},

infπΠmaxθ:𝒮¯T(A;θ)S¯Δ¯T(π;θ)d2S¯(1+log(1+1/S¯)+log(|𝒜|)).\inf_{\pi\in\Pi}\max_{\theta:\bar{\mathcal{S}}_{T}(A^{*};\theta)\leq\bar{S}}\bar{\Delta}_{T}(\pi;\theta)\leq\sqrt{\frac{d}{2}\cdot\bar{S}\cdot\left(1+\log\left(1+1/\bar{S}\right)+\log(|\mathcal{A}|)\right)}.

where 𝒮¯T(A;θ):=1T(1+t=2T𝕀{AtAt1})\bar{\mathcal{S}}_{T}(A^{*};\theta):=\frac{1}{T}(1+\sum_{t=2}^{T}\mathbb{I}\{A^{*}_{t}\neq A^{*}_{t-1}\}) is the switching rate defined in (6).

Proof.

Fix a prior distribution qq supported on Ξ:={θΘT:𝒮¯T(θ)S¯}\Xi:=\{\theta\in\Theta^{T}:\bar{\mathcal{S}}_{T}(\theta)\leq\bar{S}\}. Theorem 1 yields the bound Δ¯T(π)Γ(π)H¯T(A)\bar{\Delta}_{T}(\pi)\leq\sqrt{\Gamma(\pi)\bar{H}_{T}(A^{*})}, where implicitly the information ratio Γ(π)\Gamma(\pi) and the entropy rate H¯T(A)\bar{H}_{T}(A^{*}) depend on qq. We give uniform upper bounds on these quantities.

First, since 𝒮¯T(A;θ)S¯\bar{\mathcal{S}}_{T}(A^{*};\theta)\leq\bar{S} for any θΞ\theta\in\Xi, Lemma 1 implies

H¯T(A)S¯(1+log(1+1/S¯)+log|𝒜|).\bar{H}_{T}(A^{*})\leq\bar{S}\cdot\left(1+\log\left(1+1/\bar{S}\right)+\log|\mathcal{A}|\right).

Bounded random variables are sub-Gaussian. In particular, since the reward function R()R(\cdot) is contained in [1,1][-1,1], we know the variance proxy is bounded as σ21\sigma^{2}\leq 1. From Section 6.1, we know, Γ(πTS)2σ2d2d\Gamma(\pi^{\rm TS})\leq 2\cdot\sigma^{2}\cdot d\leq 2\cdot d. Combining these results gives

infπΔ¯T(π;q)infπΓ(π)H¯T(A)2dS¯(1+log(1+1/S¯)+log|𝒜|)\inf_{\pi}\bar{\Delta}_{T}(\pi;q)\leq\sqrt{\inf_{\pi}\Gamma(\pi)\bar{H}_{T}(A^{*})}\leq\sqrt{2\cdot d\cdot\bar{S}\cdot\left(1+\log\left(1+1/\bar{S}\right)+\log|\mathcal{A}|\right)}

The claim then follows from Proposition 5. ∎

9.3 Deducing bounds on regret in adversarial environments: environments with low total variation

From the regret bound for stochastic environments in Proposition 4, we can deduce a bound on minimax regret in linear bandits when the adversary is constrained in the rate of total variation in mean rewards [Besbes et al., 2015]. The proof is omitted for brevity.

Corollary 7 (Corollary of Proposition 4).

Suppose 𝒜,Θd\mathcal{A},\Theta\subset\mathbb{R}^{d}, rewards follow the linear model 𝔼[Rt,aθt]=θta\mathbb{E}\left[R_{t,a}\mid\theta_{t}\right]=\theta_{t}^{\top}a, and rewards are bounded as |Rt,a|1|R_{t,a}|\leq 1. Under the conditions in Proposition 5, for any TT\in\mathbb{N},

infπΠmaxθ:𝒱¯T(θ)V¯Δ¯T(π;θ)O~((dV¯)1/3+dT).\inf_{\pi\in\Pi}\max_{\theta:\bar{\mathcal{V}}_{T}(\theta)\leq\bar{V}}\bar{\Delta}_{T}(\pi;\theta)\leq\tilde{O}\left(\left(d\bar{V}\right)^{1/3}+\sqrt{\frac{d}{T}}\right).

where 𝒱¯T(θ):=1Tt=2Tmaxa𝒜|Δt(a)Δt1(a)|\bar{\mathcal{V}}_{T}(\theta):=\frac{1}{T}\sum_{t=2}^{T}\max_{a\in\mathcal{A}}\left|\Delta_{t}(a)-\Delta_{t-1}(a)\right| is the total variation measure used in (13)

10 Conclusion and Open Questions

We have provided a unifying framework to analyze interactive learning in changing environments. The results offer an intriguing measure of the difficulty of learning: the entropy rate of the optimal action process. A strength of the approach is that it applies to nonstationary variants of many of the most important learning problems. Instead of designing algorithms to make the proofs work, most results apply to Thompson sampling (TS), one of the most widely used bandit algorithms, and successfully recover the existing results that are proven individually for each setting with different proof techniques and algorithms.

While our analyses offer new theoretical insights, practical implementation of algorithms in real-world problem involves numerous considerations and challenges that we do not address. Section 3 described a simple setting in which Thompson sampling with proper posterior updating resembles simple exponential moving average estimators. The A/B testing scenario of Section 1.1 and the news recommendation problem in Example 5 reflect there is often natural problem structure that is different from what agnostic exponential moving average estimators capture. To leverage this, it is crucial to properly model the dynamics of the environment and leverage auxiliary data to construct an appropriate prior. Implementing the posterior sampling procedure adds another layer of complexity, given that the exact posterior distribution is often not available in a closed form in for models with complicated dynamics. Modern generative AI techniques (e.g. diffusion models) provide a promising path to enhance both model flexibility and sampling efficiency.

Lastly, we call for a deeper exploration of the connection between learning in dynamic environments and the theory of optimal compression. Section 8 provides intriguing connections to rate-distortion theory, but many questions remain open. One open direction is around information-theoretic lower bounds. For that, we conjecture one needs to construct problem classes in which the uniform upper bound ΓU\Gamma_{U} on the information ratio is close to a lower bound. Another open direction is to try to characterize or bound the rate-distortion function in other scenarios. In the information-theory literature, numerous studies have provided theoretical characterizations of rate distortion functions [Cover and Thomas, 2006, Gray, 1971, Blahut, 1972, Derpich and Ostergaard, 2012, Stavrou et al., 2018, 2020]. It is worth investigating whether a synthesis of these existing rate distortion results with our framework can produce meaningful regret bounds, particularly for the nonstationary bandit environments driven by Markov processes such as Brownian motion [Slivkins and Upfal, 2008] or autoregressive processes [Chen et al., 2023]. Furthermore, computational methods for obtaining the rate distortion function and the optimal lossy compression scheme [Blahut, 1972, Jalali and Weissman, 2008, Theis et al., 2017] can be implemented to construct the rate-optimal satisficing action sequence.

References

  • Abbasi-Yadkori et al. [2022] Yasin Abbasi-Yadkori, Andras Gyorgy, and Nevena Lazic. A new look at dynamic regret for non-stationary stochastic bandits. arXiv preprint arXiv:2201.06532, 2022.
  • Audibert et al. [2014] Jean-Yves Audibert, Sébastien Bubeck, and Gábor Lugosi. Regret in online combinatorial optimization. Mathematics of Operations Research, 39(1):31–45, 2014.
  • Auer [2002] Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3:397–422, 2002.
  • Auer et al. [2002] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
  • Auer et al. [2019] Peter Auer, Pratik Gajane, and Ronald Ortner. Adaptively tracking the best bandit arm with an unknown number of distribution changes. In Conference on Learning Theory, pages 138–158. PMLR, 2019.
  • Azevedo et al. [2019] Eduardo M Azevedo, Alex Deng, José L Montiel Olea, and E Glen Weyl. Empirical Bayes estimation of treatment effects with many A/B tests: An overview. In AEA Papers and Proceedings, volume 109, pages 43–47, 2019.
  • Barnett and Kedem [1991] John T Barnett and Benjamin Kedem. Zero-crossing rates of functions of gaussian processes. IEEE Transactions on Information Theory, 37(4):1188–1194, 1991.
  • Besbes et al. [2014] Omar Besbes, Yanatan Gur, and Assaf Zeevi. Stochastic multi-armed-bandit problem with non-stationary rewards. Advances in neural information processing systems 27, 2014.
  • Besbes et al. [2015] Omar Besbes, Yonatan Gur, and Assaf Zeevi. Non-stationary stochastic optimization. Operations Research, 63(5):1227–1244, 2015.
  • Blahut [1972] Richard E. Blahut. Computation of channel capacity and rate-distortion functions. IEEE Transactions on Information Theory, 18(4):460–473, 1972.
  • Bubeck and Cesa-Bianchi [2012] Sebastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1–122, 2012.
  • Bubeck and Eldan [2016] Sébastien Bubeck and Ronen Eldan. Multi-scale exploration of convex functions and bandit convex optimization. In Conference on Learning Theory, pages 583–589. PMLR, 2016.
  • Chakrabarti et al. [2008] Deepayan Chakrabarti, Ravi Kumar, Filip Radlinski, and Eli Upfal. Mortal multi-armed bandits. In Advances in neural information processing systems 21, 2008.
  • Chapelle and Li [2011] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances in neural information processing systems 24, 2011.
  • Chen et al. [2023] Qinyi Chen, Negin Golrezaei, and Djallel Bouneffouf. Non-stationary bandits with auto-regressive temporal dependency. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • Chen et al. [2019] Yifang Chen, Chung-Wei Lee, Haipeng Luo, and Chen-Yu Wei. A new algorithm for non-stationary contextual bandits: Efficient, optimal, and parameter-free. In Conference on Learning Theory, pages 696–726. PMLR, 2019.
  • Cheung et al. [2019] Wang Chi Cheung, David Simchi-Levi, and Ruihao Zhu. Learning to optimize under non-stationarity. In International Conference on Artificial Intelligence and Statistics, pages 1079–1087. PMLR, 2019.
  • Cover [1991] Thomas Cover. Universal portfolios. Mathematical Finance, 1(1):1–29, 1991.
  • Cover and Thomas [2006] Thomas M. Cover and Joy A. Thomas. Elements of information theory. John Wiley & Sons, 2006.
  • Derpich and Ostergaard [2012] Milan S. Derpich and Jan Ostergaard. Improved upper bounds to the causal quadratic rate-distortion function for gaussian stationary sources. IEEE Transactions on Information Theory, 58(5):3131–3152, 2012.
  • Dong et al. [2019] Shi Dong, Tengyu Ma, and Benjamin Van Roy. On the performance of Thompson sampling on logistic bandits. In Conference on Learning Theory, pages 1158–1160, 2019.
  • Foster et al. [2022] Dylan J Foster, Alexander Rakhlin, Ayush Sekhari, and Karthik Sridharan. On the complexity of adversarial decision making. In Advances in Neural Information Processing Systems, 2022.
  • Garivier and Moulines [2011] Aurélien Garivier and Eric Moulines. On upper-confidence bound policies for switching bandit problems. In International Conference on Algorithmic Learning Theory, pages 174–188. Springer, 2011.
  • Ghatak [2020] Gourab Ghatak. A change-detection-based Thompson sampling framework for non-stationary bandits. IEEE Transactions on Computers, 70(10):1670–1676, 2020.
  • Gray [1971] Robert M. Gray. Rate distortion functions for finite-state finite-alphabet markov sources. IEEE Transactions on Information Theory, 17(2):127–134, 1971.
  • Gupta et al. [2011] Neha Gupta, Ole-Christoffer Granmo, and Ashok Agrawala. Thompson sampling for dynamic multi-armed bandits. In 10th International Conference on Machine Learning and Applications, volume 1, pages 484–489. IEEE, 2011.
  • Hao et al. [2021] Botao Hao, Tor Lattimore, and Wei Deng. Information directed sampling for sparse linear bandits. In Advances in Neural Information Processing Systems 34, pages 16738–16750, 2021.
  • Hao et al. [2022] Botao Hao, Tor Lattimore, and Chao Qin. Contextual information-directed sampling. In International Conference on Machine Learning, pages 8446–8464. PMLR, 2022.
  • Jalali and Weissman [2008] Shirin Jalali and Tsachy Weissman. Rate-distortion via markov chain monte carlo. In 2008 IEEE International Symposium on Information Theory, 2008.
  • Jia et al. [2023] Su Jia, Qian Xie, Nathan Kallus, and Peter I Frazier. Smooth non-stationary bandits. arXiv preprint arXiv:2301.12366, 2023.
  • Jung and Tewari [2019] Young Hun Jung and Ambuj Tewari. Regret bounds for Thompson sampling in episodic restless bandit problems. In Advances in Neural Information Processing Systems 32, 2019.
  • Kocsis and Szepesvári [2006] Levente Kocsis and Csaba Szepesvári. Discounted UCB. In 2nd PASCAL Challenges Workshop, 2006.
  • Lattimore [2020] Tor Lattimore. Improved regret for zeroth-order adversarial bandit convex optimisation. Mathematical Statistics and Learning, 2(3):311–334, 2020.
  • Lattimore [2022] Tor Lattimore. Minimax regret for partial monitoring: Infinite outcomes and rustichini’s regret. In Conference on Learning Theory, pages 1547–1575. PMLR, 2022.
  • Lattimore and Gyorgy [2021] Tor Lattimore and Andras Gyorgy. Mirror descent and the information ratio. In Conference on Learning Theory, pages 2965–2992. PMLR, 2021.
  • Lattimore and Szepesvári [2019] Tor Lattimore and Csaba Szepesvári. An information-theoretic approach to minimax regret in partial monitoring. In Conference on Learning Theory, pages 2111–2139. PMLR, 2019.
  • Lattimore and Szepesvári [2020] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
  • Liu et al. [2018] Fang Liu, Swapna Buccapatnam, and Ness Shroff. Information directed sampling for stochastic bandits with graph feedback. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  • Liu et al. [2023] Yueyang Liu, Benjamin Van Roy, and Kuang Xu. Nonstationary bandit learning via predictive sampling. arXiv preprint arXiv:2205.01970v5, 2023.
  • Mellor and Shapiro [2013] Joseph Mellor and Jonathan Shapiro. Thompson sampling in switching environments with Bayesian online change point detection. In Artificial Intelligence and Statistics, pages 442–450. PMLR, 2013.
  • Min and Russo [2023] Seungki Min and Daniel Russo. An information-theoretic analysis of nonstationary bandit learning. In International Conference on Machine Learning, pages 24831–24849, 2023.
  • Min et al. [2019] Seungki Min, Costis Maglaras, and Ciamac C. Moallemi. Thompson sampling with information relaxation penalties. In Advances in Neural Information Processing Systems 32 (NeurIPS 2019), 2019.
  • Neu et al. [2022] Gergely Neu, Julia Olkhovskaya, Matteo Papini, and Ludovic Schwartz. Lifting the information ratio: An information-theoretic analysis of Thompson sampling for contextual bandits. arXiv preprint arXiv:2205.13924, 2022.
  • Raj and Kalyani [2017] Vishnu Raj and Sheetal Kalyani. Taming non-stationary bandits: A Bayesian approach. arXiv preprint arXiv:1707.09727, 2017.
  • Rice [1944] Stephen O Rice. Mathematical analysis of random noise. The Bell System Technical Journal, 23(3):282–332, 1944.
  • Russo and Van Roy [2016] Daniel Russo and Benjamin Van Roy. An information-theoretic analysis of Thompson sampling. Journal of Machine Learning Research, 17(1):1221–1243, 2016.
  • Russo and Van Roy [2018] Daniel Russo and Benjamin Van Roy. Learning to optimize via information-directed sampling. Operations Research, 66(1):230–252, 2018.
  • Russo and Van Roy [2022] Daniel Russo and Benjamin Van Roy. Satisficing in time-sensitive bandit learning. Mathematics of Operations Research, 47(4):2815–2839, 2022.
  • Simon [1979] Herbert A Simon. Rational decision making in business organizations. The American economic review, 69(4):493–513, 1979.
  • Slivkins and Upfal [2008] Aleksandrs Slivkins and Eli Upfal. Adapting to a changing environment: the brownian restless bandits. In COLT, pages 343–354, 2008.
  • Stavrou et al. [2018] Photios A. Stavrou, Jan Østergaard, and Charalambos D. Charalambous. Zero-delay rate distortion via filtering for vector-valued gaussian sources. IEEE Journal of Selected Topics in Signal Processing, 12(5):841–856, 2018.
  • Stavrou et al. [2020] Photios A. Stavrou, Takashi Tanaka, and Sekhar Tatikonda. The time-invariant multidimensional gaussian sequential rate-distortion problem revisited. IEEE Transactions on Automatic Control, 65(5), 2020.
  • Suk and Kpotufe [2022] Joe Suk and Samory Kpotufe. Tracking most significant arm switches in bandits. In Conference on Learning Theory, pages 2160–2182. PMLR, 2022.
  • Theis et al. [2017] Lucas Theis, Wenzhe Shi, Andrew Cunningham, and Ferenc Huszar. Lossy image compression with compressive autoencoders. In International Conference on Learning Representations (ICLR), 2017.
  • Thompson [1933] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285–294, 1933.
  • Trovo et al. [2020] Francesco Trovo, Stefano Paladino, Marcello Restelli, and Nicola Gatti. Sliding-window Thompson sampling for non-stationary settings. Journal of Artificial Intelligence Research, 68:311–364, 2020.
  • Whittle [1988] Peter Whittle. Restless bandits: Activity allocation in a changing world. Journal of applied probability, 25:287–298, 1988.
  • Wu et al. [2022] Yuhang Wu, Zeyu Zheng, Guangyu Zhang, Zuohua Zhang, and Chu Wang. Non-stationary a/b tests. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 2079–2089, 2022.
  • Xu and Zeevi [2023] Yunbei Xu and Assaf Zeevi. Bayesian design principles for frequentist sequential learning. In International Conference on Machine Learning, pages 38768–38800. PMLR, 2023.

Appendix A Proofs

A.1 Proof of Lemma 1

We first bound the number of possible realizations that involve at most STS_{T} switches:555One can imagine a two-dimensional grid of size T×STT\times S_{T}, represented with coordinates ((t,s):t[T],s[ST])\left((t,s):t\in[T],s\in[S_{T}]\right). A feasible switching time configuration corresponds to a path from the lower left corner (1,1)(1,1) to the upper right corner (T,ST)(T,S_{T}) that consists of T1T-1 rightward moves and ST1S_{T}-1 upward moves ; whenever a path makes an upward move, from (t,s)(t,s) to (t,s+1)(t,s+1), we can mark that a switch occurs at time tt (if the path makes two or more upward moves in a row, the actual number of switches can be smaller than STS_{T}). The number of such paths is given by ((T1)+(ST1)ST1)\binom{(T-1)+(S_{T}-1)}{S_{T}-1}.

|{(x1,,xT)𝒳T|1+t=2T𝕀{xtxt1}ST}|((T1)+(ST1)ST1)×|𝒳|ST(T+ST1ST)×|𝒳|ST.\left|\left\{(x_{1},\ldots,x_{T})\in\mathcal{X}^{T}\left|1+\sum_{t=2}^{T}\mathbb{I}\{x_{t}\neq x_{t-1}\}\leq S_{T}\right.\right\}\right|\leq\binom{(T-1)+(S_{T}-1)}{S_{T}-1}\times|\mathcal{X}|^{S_{T}}\leq\binom{T+S_{T}-1}{S_{T}}\times|\mathcal{X}|^{S_{T}}.

Note that for any knk\leq n\in\mathbb{N},

(nk)=n×(n1)××(nk+1)k!nkk!nk2πk(k/e)knk(k/e)k=(enk)k,\binom{n}{k}=\frac{n\times(n-1)\times\ldots\times(n-k+1)}{k!}\leq\frac{n^{k}}{k!}\leq\frac{n^{k}}{\sqrt{2\pi k}(k/e)^{k}}\leq\frac{n^{k}}{(k/e)^{k}}=\left(\frac{en}{k}\right)^{k},

where the second inequality uses Stirling. Therefore,

log(|𝒳|ST×(T+ST1ST))log(|𝒳|ST×(e(T+ST1)ST)ST)=ST×(log(|𝒳|)+1+log(1+T1ST)).\log\left(|\mathcal{X}|^{S_{T}}\times\binom{T+S_{T}-1}{S_{T}}\right)\leq\log\left(|\mathcal{X}|^{S_{T}}\times\left(\frac{e(T+S_{T}-1)}{S_{T}}\right)^{S_{T}}\right)=S_{T}\times\left(\log(|\mathcal{X}|)+1+\log\left(1+\frac{T-1}{S_{T}}\right)\right).

By observing log(1+T1ST)log(1+TST)\log\left(1+\frac{T-1}{S_{T}}\right)\leq\log\left(1+\frac{T}{S_{T}}\right), we obtain the desired result.

A.2 Proof of Proposition 1

Let Zt:=𝕀{AtAt1}Z_{t}:=\mathbb{I}\{A_{t}^{*}\neq A_{t-1}^{*}\}, an indicator of a ‘‘switch’’. Then, τeff1=(Zt=1)\tau_{\textup{eff}}^{-1}=\mathbb{P}(Z_{t}=1) and for any t2t\geq 2,

H(At|A1:t1)\displaystyle H(A_{t}^{*}|A_{1:t-1}^{*}) =H(At|A1:t1)+H(Zt|A1:t1,At)=0\displaystyle=H(A_{t}^{*}|A_{1:t-1}^{*})+\underbrace{H(Z_{t}|A_{1:t-1}^{*},A_{t}^{*})}_{=0}
=H((Zt,At)|A1:t1)\displaystyle=H((Z_{t},A_{t}^{*})|A_{1:t-1}^{*})
=H(Zt|A1:t1)+H(At|Zt,A1:t1)\displaystyle=H(Z_{t}|A_{1:t-1}^{*})+H(A_{t}^{*}|Z_{t},A_{1:t-1}^{*})
H(Zt)+H(At|Zt,A1:t1)\displaystyle\leq H(Z_{t})+H(A_{t}^{*}|Z_{t},A_{1:t-1}^{*})
=H(Zt)+(Zt=1)H(At|Zt=1,A1:t1)\displaystyle=H(Z_{t})+\mathbb{P}(Z_{t}=1)H(A_{t}^{*}|Z_{t}=1,A_{1:t-1}^{*})
+(Zt=0)H(At|Zt=0,A1:t1)=0\displaystyle\qquad+\underbrace{\mathbb{P}(Z_{t}=0)H(A_{t}^{*}|Z_{t}=0,A_{1:t-1}^{*})}_{=0}
H(Zt)+(Zt=1)H(At|Zt=1).\displaystyle\leq H(Z_{t})+\mathbb{P}(Z_{t}=1)H(A_{t}^{*}|Z_{t}=1).

With δ:=τeff1\delta:=\tau_{\textup{eff}}^{-1},

H(Zt)+(Zt=1)H(At|Zt=1)\displaystyle H(Z_{t})+\mathbb{P}(Z_{t}=1)H(A_{t}^{*}|Z_{t}=1)
=δlog(1/δ)+(1δ)log(1/(1δ))+δH(At|Zt=1)\displaystyle=\delta\log(1/\delta)+(1-\delta)\log(1/(1-\delta))+\delta H(A_{t}^{*}|Z_{t}=1)
=δlog(1/δ)+(1δ)log(1+δ/(1δ))+δH(At|Zt=1)\displaystyle=\delta\log(1/\delta)+(1-\delta)\log(1+\delta/(1-\delta))+\delta H(A_{t}^{*}|Z_{t}=1)
δlog(1/δ)+δ+δH(At|Zt=1)\displaystyle\leq\delta\log(1/\delta)+\delta+\delta H(A_{t}^{*}|Z_{t}=1)
=1τeff[log(τeff)+1+H(At|Zt=1)].\displaystyle=\frac{1}{\tau_{\textup{eff}}}\left[\log(\tau_{\textup{eff}})+1+H(A_{t}^{*}|Z_{t}=1)\right].

Therefore, we deduce that

H¯T(A)=1Tt=1TH(At|A1:t1)1+log(τeff)+H(At|Zt=1)τeff+H(A1)T.\bar{H}_{T}(A^{*})=\frac{1}{T}\sum_{t=1}^{T}H(A_{t}^{*}|A_{1:t-1}^{*})\leq\frac{1+\log(\tau_{\textup{eff}})+H(A_{t}^{*}|Z_{t}=1)}{\tau_{\textup{eff}}}+\frac{H(A_{1}^{*})}{T}.

A.3 Proof of Proposition 2

We use A1:TA_{1:T}^{*} to denote (A1,,AT)(A_{1}^{*},\ldots,A_{T}^{*}) for shorthand. Let 𝒮T:=1+t=2T𝕀{AtAt1}\mathcal{S}_{T}:=1+\sum_{t=2}^{T}\mathbb{I}\{A_{t}^{*}\neq A_{t-1}^{*}\}. By Lemma 1,

H(A1:T|𝒮T=n)n×(1+log(1+Tn)+log|𝒜|),H(A_{1:T}^{*}|\mathcal{S}_{T}=n)\leq n\times\left(1+\log\left(1+\frac{T}{n}\right)+\log|\mathcal{A}|\right),

for any n{1,,T}n\in\{1,\ldots,T\}. Since the right hand side is a concave function of nn, by Jensen’s inequality

1TH(A1:T|𝒮T)\displaystyle\frac{1}{T}H(A_{1:T}^{*}|\mathcal{S}_{T}) 𝔼[𝒮TT×(1+log(1+T𝒮T)+log|𝒜|)],\displaystyle\leq\mathbb{E}\left[\frac{\mathcal{S}_{T}}{T}\times\left(1+\log\left(1+\frac{T}{\mathcal{S}_{T}}\right)+\log|\mathcal{A}|\right)\right],
𝔼[𝒮T]T×(1+log(1+T𝔼[𝒮T])+log|𝒜|)\displaystyle\leq\frac{\mathbb{E}[\mathcal{S}_{T}]}{T}\times\left(1+\log\left(1+\frac{T}{\mathbb{E}[\mathcal{S}_{T}]}\right)+\log|\mathcal{A}|\right)
=S¯T×(1+log(1+1/S¯T)+log|𝒜|).\displaystyle=\bar{S}_{T}\times\left(1+\log\left(1+1/\bar{S}_{T}\right)+\log|\mathcal{A}|\right).

On the other hand, we have H(𝒮T)logTH(\mathcal{S}_{T})\leq\log T since 𝒮T{1,2,,T}\mathcal{S}_{T}\in\{1,2,\ldots,T\}. Therefore,

H¯T(A)=H(A1:T|𝒮T)T+H(𝒮T)TS¯T×(1+log(1+1/S¯T)+log|𝒜|)+logTT.\bar{H}_{T}(A^{*})=\frac{H(A_{1:T}^{*}|\mathcal{S}_{T})}{T}+\frac{H(\mathcal{S}_{T})}{T}\leq\bar{S}_{T}\times\left(1+\log\left(1+1/\bar{S}_{T}\right)+\log|\mathcal{A}|\right)+\frac{\log T}{T}.

A.4 Proof of Proposition 3

We start with a proof sketch. Our proof is built upon a well-known result established for stationary bandits: there exists a stationary (Bayesian) bandit instance such that any algorithm’s (Bayesian) cumulative regret is lower bounded by Ω(nk)\Omega(\sqrt{nk}) where nn is the length of time horizon.

More specifically, we set n=Θ(τeff)n=\Theta(\tau_{\rm eff})\in\mathbb{N} and construct a nonstationary environment by concatenating independent nn-period stationary Gaussian bandit instances, i.e., the mean rewards changes periodically every nn time steps. In each block (of length nn), the best arm has mean reward ϵ>0\epsilon>0 and the other arms has zero mean reward, where the best arm is drawn from kk arms uniformly and independently per block. When ϵ=Θ(k/n)\epsilon=\Theta(\sqrt{k/n}), no algorithm can identify this best arm within nn samples, and hence the cumulative regret should increase by Ω(nϵ)\Omega(n\epsilon) per block. Consequently, the per-period regret Δ¯(π)\bar{\Delta}_{\infty}(\pi) should be Ω(ϵ)=Ω(k/n)=Ω(k/τeff)\Omega(\epsilon)=\Omega(\sqrt{k/n})=\Omega(\sqrt{k/\tau_{\rm eff}}). In our detailed proof, we additionally employ some randomization trick in determination of changepoints in order to ensure that the optimal action sequence (At)t(A_{t}^{*})_{t\in\mathbb{N}} is stationary and (AtAt1)=τeff1\mathbb{P}(A_{t}^{*}\neq A_{t-1}^{*})=\tau_{\rm eff}^{-1} exactly. Now, we give the formal proof.

Proof.

We will consider Gaussian bandit instances throughout the proof. Without loss of generality, we assume σ=1\sigma=1 and the noise variances are always one.

We begin by stating a well-known result for the stationary bandits, adopted from Lattimore and Szepesvári [2020, Exercise 15.2]: With ϵ=(11/k)k/n\epsilon=(1-1/k)\sqrt{k/n}, for each i{1,,k}i\in\{1,\ldots,k\}, let mean reward vector μ(i)k\mu^{(i)}\in\mathbb{R}^{k} satisfy μa(i)=ϵ𝕀{i=a}\mu_{a}^{(i)}=\epsilon\mathbb{I}\{i=a\}. It is shown that, when k>1k>1 and nkn\geq k, under any algorithm π\pi

1ki=1k𝔼μ(i)π[t=1n(Rt,iRt,At)]18nk,\frac{1}{k}\sum_{i=1}^{k}\mathbb{E}_{\mu^{(i)}}^{\pi}\left[\sum_{t=1}^{n}(R_{t,i}-R_{t,A_{t}})\right]\geq\frac{1}{8}\sqrt{nk}, (14)

where 𝔼μ(i)π[t=1n(Rt,iRt,At)]\mathbb{E}_{\mu^{(i)}}^{\pi}\left[\sum_{t=1}^{n}(R_{t,i}-R_{t,A_{t}})\right] is the (frequentist’s) cumulative regret of π\pi in a kk-armed Gaussian bandit instance specified by the time horizon length nn and mean reward vector μ(i)\mu^{(i)} (i.e., the reward distribution of arm aa is 𝒩(μa(i),12)\mathcal{N}(\mu_{a}^{(i)},1^{2})). Considering a uniform distribution over {μ(1),,μ(k)}\{\mu^{(1)},\cdots,\mu^{(k)}\} as a prior, we can construct a Bayesian KK-armed bandit instance of length nn such that 𝔼[t=1n(Rt,ARt,At)]nk/8\mathbb{E}[\sum_{t=1}^{n}(R_{t,A^{*}}-R_{t,A_{t}})]\geq\sqrt{nk}/8 under any algorithm.

Given τeff2\tau_{\rm eff}\geq 2, set τ~=k1kτeff\tilde{\tau}=\frac{k-1}{k}\tau_{\rm eff}, n=τ~n=\lfloor\tilde{\tau}\rfloor, and p=τ~τ~p=\tilde{\tau}-\lfloor\tilde{\tau}\rfloor. Let NN be the random variable such that equals nn with probability pp and equals n+1n+1 with probability 1p1-p, so that 𝔼[N]=τ~\mathbb{E}[N]=\tilde{\tau}. We construct a stationary renewal process (T1,T2,)(T_{1},T_{2},\ldots) whose inter-renewal time distribution is given by the distribution of NN. That is, Tj+1Tj=dNT_{j+1}-T_{j}\stackrel{{\scriptstyle\text{d}}}{{=}}N for all jj\in\mathbb{N}, and T1T_{1} is drawn from the equilibrium distribution of its excess life time, i.e.,

(T1=x)={1/τ~if xn,p/τ~if x=n+1,0if x>n+1,x.\mathbb{P}(T_{1}=x)=\left\{\begin{array}[]{ll}1/\tilde{\tau}&\text{if }x\leq n,\\ p/\tilde{\tau}&\text{if }x=n+1,\\ 0&\text{if }x>n+1,\end{array}\right.\quad\forall x\in\mathbb{N}.

Since the process (T1,T2,)(T_{1},T_{2},\ldots) is a stationary renewal process,

(renewal occurs at t)=(j,Tj=t)=1𝔼[N]=1τ~,t.\mathbb{P}\left(\text{renewal occurs at $t$}\right)=\mathbb{P}\left(\exists j,T_{j}=t\right)=\frac{1}{\mathbb{E}[N]}=\frac{1}{\tilde{\tau}},\quad\forall t\in\mathbb{N}.

We now consider a nonstationary Gaussian bandit instance where the mean reward vector is (re-)drawn from {μ(1),,μ(k)}\{\mu^{(1)},\cdots,\mu^{(k)}\} uniformly and independently at times T1,T2,T_{1},T_{2},\ldots. As desired, the effective horizon of this bandit instance matches the target τeff\tau_{\rm eff}:

(AtAt1)=(j,Tj=t)×(AtAt1|j,Tj=t)=1τ~×(11k)=1τeff.\mathbb{P}\left(A_{t}^{*}\neq A_{t-1}^{*}\right)=\mathbb{P}\left(\exists j,T_{j}=t\right)\times\mathbb{P}\left(A_{t}^{*}\neq A_{t-1}^{*}|\exists j,T_{j}=t\right)=\frac{1}{\tilde{\tau}}\times\left(1-\frac{1}{k}\right)=\frac{1}{\tau_{\rm eff}}.

Since Tj+1TjnT_{j+1}-T_{j}\geq n,

𝔼[t=TjTj+11(Rt,AtRt,At)]𝔼[t=TjTj+n1(Rt,AtRt,At)]=1ki=1k𝔼[t=TjTj+n1(Rt,AtRt,At)|ATj=i]18nk,\mathbb{E}\left[\sum_{t=T_{j}}^{T_{j+1}-1}(R_{t,A_{t}^{*}}-R_{t,A_{t}})\right]\geq\mathbb{E}\left[\sum_{t=T_{j}}^{T_{j}+n-1}(R_{t,A_{t}^{*}}-R_{t,A_{t}})\right]=\frac{1}{k}\sum_{i=1}^{k}\mathbb{E}\left[\left.\sum_{t=T_{j}}^{T_{j}+n-1}(R_{t,A_{t}^{*}}-R_{t,A_{t}})\right|A_{T_{j}}^{*}=i\right]\geq\frac{1}{8}\sqrt{nk},

where the last inequality follows from Equation 14. Since there are at least T/(n+1)\lfloor T/(n+1)\rfloor renewals until time TT, 𝔼[t=1T(Rt,AtRt,At)]T/(n+1)nk/8\mathbb{E}\left[\sum_{t=1}^{T}(R_{t,A_{t}^{*}}-R_{t,A_{t}})\right]\geq\lfloor T/(n+1)\rfloor\sqrt{nk}/8, and therefore,

Δ¯(π)=lim supT𝔼[1Tt=1T(Rt,AtRt,At)]nk8(n+1).\bar{\Delta}_{\infty}(\pi)=\limsup_{T\rightarrow\infty}\mathbb{E}\left[\frac{1}{T}\sum_{t=1}^{T}(R_{t,A_{t}^{*}}-R_{t,A_{t}})\right]\geq\frac{\sqrt{nk}}{8(n+1)}.

Since n+12nn+1\leq 2n and n=k1kτeffτeffn=\lfloor\frac{k-1}{k}\tau_{\rm eff}\rfloor\leq\tau_{\rm eff}, we have Δ¯(π)116kτeff\bar{\Delta}_{\infty}(\pi)\geq\frac{1}{16}\sqrt{\frac{k}{\tau_{\rm eff}}}. ∎

A.5 Proof of Remark 5

Let Δt:=𝔼[Rt,AtRt,At]\Delta_{t}:=\mathbb{E}[R_{t,A_{t}^{*}}-R_{t,A_{t}}], Gt:=I(At;(At,Ot,At)|t1)G_{t}:=I(A_{t}^{*};(A_{t},O_{t,A_{t}})|\mathcal{F}_{t-1}), and Γλ,t:=Δtλ/Gt\Gamma_{\lambda,t}:=\Delta_{t}^{\lambda}/G_{t}. Then, Γλ(π)=suptΓλ,t\Gamma_{\lambda}(\pi)=\sup_{t\in\mathbb{N}}\Gamma_{\lambda,t}, and we have

Δ¯T(π)\displaystyle\bar{\Delta}_{T}(\pi) =T1t=1TΔt\displaystyle=T^{-1}\sum_{t=1}^{T}\Delta_{t}
=T1t=1TΓλ,t1/λGt1/λ\displaystyle=T^{-1}\sum_{t=1}^{T}\Gamma_{\lambda,t}^{1/\lambda}G_{t}^{1/\lambda}
Γλ(π)1/λ(T1t=1TGt1/λ)\displaystyle\,\leq\Gamma_{\lambda}(\pi)^{1/\lambda}\cdot\left(T^{-1}\sum_{t=1}^{T}G_{t}^{1/\lambda}\right)
(a)Γλ(π)1/λ[T1(t=1TGt)1/λ(t=1T1)11/λ]\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\Gamma_{\lambda}(\pi)^{1/\lambda}\cdot\left[T^{-1}\left(\sum_{t=1}^{T}G_{t}\right)^{1/\lambda}\cdot\left(\sum_{t=1}^{T}1\right)^{1-1/\lambda}\right]
=Γλ(π)1/λ(T1t=1TGt)1/λ\displaystyle=\Gamma_{\lambda}(\pi)^{1/\lambda}\cdot\left(T^{-1}\sum_{t=1}^{T}G_{t}\right)^{1/\lambda}
(b)Γλ(π)1/λH¯T(A)1/λ,\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}\Gamma_{\lambda}(\pi)^{1/\lambda}\cdot\bar{H}_{T}(A^{*})^{1/\lambda},

where step (a) uses Hölder’s inequality, and step (b) uses T1t=1TGtH¯T(A)T^{-1}\sum_{t=1}^{T}G_{t}\leq\bar{H}_{T}(A^{*}).

A.6 Proof of Lemma 2

Recall the definition, Γ(π):=suptΓt(π)\Gamma(\pi):=\sup_{t\in\mathbb{N}}\Gamma_{t}(\pi) where

Γt(π)=(𝔼[Rt,AtRt,At])2I(At;(At,Ot,At)t1).\Gamma_{t}(\pi)=\frac{\left(\mathbb{E}\left[R_{t,A^{*}_{t}}-R_{t,A_{t}}\right]\right)^{2}}{I\left(A_{t}^{*};(A_{t},O_{t,A_{t}})\mid\mathcal{F}_{t-1}\right)}.

Our goal is to bound the numerator of Γt(πTS)\Gamma_{t}(\pi^{\rm TS}) in terms of the denominator.

Let 𝔼t[]:=𝔼[Xt,t1]\mathbb{E}_{t}\left[\cdot\right]:=\mathbb{E}\left[~{}\cdot\mid X_{t},\mathcal{F}_{t-1}\right] denote the conditional expectation operator which conditions on observations prior to time tt AND the context at time tt. Similarly, define the probability operation t():=(Xt,t1)\mathbb{P}_{t}\left(\cdot\right):=\mathbb{P}\left(\cdot\mid X_{t},\mathcal{F}_{t-1}\right) accordingly. Define It(;)I_{t}(\cdot;\cdot) to be the function that evaluates mutual information when the base measure is t\mathbb{P}_{t}.

The law of iterated expectations states that for any real valued random variable ZZ, 𝔼[𝔼t[Z]]=𝔼[Z]\mathbb{E}[\mathbb{E}_{t}[Z]]=\mathbb{E}[Z]. The definition of conditional mutual information states that for any random variables Z1Z_{1}, Z2Z_{2},

𝔼[It(Z1;Z2)]=I(Z1;Z2Xt,t1).\mathbb{E}\left[I_{t}(Z_{1};Z_{2})\right]=I(Z_{1};Z_{2}\mid X_{t},\mathcal{F}_{t-1}). (15)

Under 2, there exists a function μ:Θ×𝒳×[k]\mu:\Theta\times\mathcal{X}\times[k]\rightarrow\mathbb{R} such that

μ(θ,x,i)=𝔼[Rt,Atθt=θ,it=i,At].\mu(\theta^{\prime},x,i)=\mathbb{E}\left[R_{t,A_{t}}\mid\theta_{t}=\theta^{\prime},i_{t}=i,A_{t}\right]. (16)

This specifies expected rewards as a function of the parameter and chosen arm, regardless of the specific decision-rule used.

The definition of Thompson sampling is the probability matching property on decision-rules, (At=at1)=(At=at1)\mathbb{P}\left(A^{*}_{t}=a\mid\mathcal{F}_{t-1}\right)=\mathbb{P}\left(A^{*}_{t}=a\mid\mathcal{F}_{t-1}\right), for each a𝒜a\in\mathcal{A}. It implies the following probability matching property on arms: with it:=At(Xt)[k]i^{*}_{t}:=A_{t}^{*}(X_{t})\in[k].

t(it=i)=(At(Xt)=it1,Xt)=(At(Xt)=it1,Xt)=t(it=i),\mathbb{P}_{t}\left(i^{*}_{t}=i\right)=\mathbb{P}\left(A^{*}_{t}(X_{t})=i\mid\mathcal{F}_{t-1},X_{t}\right)=\mathbb{P}\left(A_{t}(X_{t})=i\mid\mathcal{F}_{t-1},X_{t}\right)=\mathbb{P}_{t}\left(i_{t}=i\right),

which holds for each i[k]i\in[k].

With this setup, repeating the analysis in Proposition 3, or Corollary 1, of Russo and Van Roy [2016] implies, immediately, that

(𝔼t[μ(θt,Xt,it)μ(θt,Xt,it)])22σ2kIt(it;(it,Rt)).\left(\mathbb{E}_{t}\left[\mu(\theta_{t},X_{t},i_{t}^{*})-\mu(\theta_{t},X_{t},i_{t})\right]\right)^{2}\leq 2\cdot\sigma^{2}\cdot k\cdot I_{t}\left(i^{*}_{t};(i_{t},R_{t})\right). (17)

(Conditioned on context, one can repeat the same proof to relate regret to information gain about the optimal arm.) Now, we complete the proof:

(𝔼[Rt,AtRt,At])2\displaystyle\left(\mathbb{E}\left[R_{t,A*_{t}}-R_{t,A_{t}}\right]\right)^{2} =(a)(𝔼[μ(θt,Xt,it)μ(θt,Xt,it)])2\displaystyle\overset{(a)}{=}\left(\mathbb{E}\left[\mu(\theta_{t},X_{t},i_{t}^{*})-\mu(\theta_{t},X_{t},i_{t})\right]\right)^{2}
(b)𝔼[(𝔼t[μ(θt,Xt,it)μ(θt,Xt,it)])2]\displaystyle\overset{(b)}{\leq}\mathbb{E}\left[\left(\mathbb{E}_{t}\left[\mu(\theta_{t},X_{t},i_{t}^{*})-\mu(\theta_{t},X_{t},i_{t})\right]\right)^{2}\right]
(c)2σ2k𝔼[It(it;(it,Rt))]\displaystyle\overset{(c)}{\leq}2\cdot\sigma^{2}\cdot k\cdot\mathbb{E}\left[I_{t}\left(i^{*}_{t};(i_{t},R_{t})\right)\right]
=(d)2σ2kI(it;(it,Rt)Xt,t1)\displaystyle\overset{(d)}{=}2\cdot\sigma^{2}\cdot k\cdot I\left(i^{*}_{t};(i_{t},R_{t})\mid X_{t},\mathcal{F}_{t-1}\right)
(e)2σ2kI(At;(it,Rt)Xt,t1)\displaystyle\overset{(e)}{\leq}2\cdot\sigma^{2}\cdot k\cdot I\left(A^{*}_{t};(i_{t},R_{t})\mid X_{t},\mathcal{F}_{t-1}\right)
(f)2σ2kI(At;(At,Rt)Xt,t1)\displaystyle\overset{(f)}{\leq}2\cdot\sigma^{2}\cdot k\cdot I\left(A^{*}_{t};(A_{t},R_{t})\mid X_{t},\mathcal{F}_{t-1}\right)
=(g)2σ2k[I(At;(At,Xt,Rt)t1)I(At;Xtt1)]\displaystyle\overset{(g)}{=}2\cdot\sigma^{2}\cdot k\cdot\left[I\left(A^{*}_{t};(A_{t},X_{t},R_{t})\mid\mathcal{F}_{t-1}\right)-I\left(A^{*}_{t};X_{t}\mid\mathcal{F}_{t-1}\right)\right]
(h)2σ2kI(At;(At,Xt,Rt)t1)\displaystyle\overset{(h)}{\leq}2\cdot\sigma^{2}\cdot k\cdot I\left(A^{*}_{t};(A_{t},X_{t},R_{t})\mid\mathcal{F}_{t-1}\right)
=(i)2σ2kI(At;(At,Ot)t1),\displaystyle\overset{(i)}{=}2\cdot\sigma^{2}\cdot k\cdot I\left(A^{*}_{t};(A_{t},O_{t})\mid\mathcal{F}_{t-1}\right),

where step (a) uses (16), step (b) is Jensen’s inequality, step (c) applies (17), step (d) is (15), steps (e) and (f) apply the data processing inequality, step (g) uses the chain-rule of mutual information, step (h) uses that mutual information is non-negative, and step (i) simply recalls that Ot=(Xt,Rt)O_{t}=(X_{t},R_{t}).

A.7 Proof of Corollary 3

If θt{1,1+ϵ,,1ϵ,1}p\theta_{t}\in\{-1,-1+\epsilon,\ldots,1-\epsilon,1\}^{p} is a discretized pp dimensional vector, and the optimal policy process (At)t(A^{*}_{t})_{t\in\mathbb{N}} is stationary, then, by Proposition 1,

H¯(A)\displaystyle\bar{H}(A^{*}) 1+log(τeff)+H(At|AtAt1)τeff\displaystyle\leq\frac{1+\log(\tau_{\textup{eff}})+H(A_{t}^{*}|A_{t}^{*}\neq A_{t-1}^{*})}{\tau_{\textup{eff}}}
1+log(τeff)+H(θt)τeff\displaystyle\leq\frac{1+\log(\tau_{\textup{eff}})+H(\theta_{t})}{\tau_{\textup{eff}}}
1+log(τeff)+pln(2/ϵ)τeff.\displaystyle\leq\frac{1+\log(\tau_{\textup{eff}})+p\ln(2/\epsilon)}{\tau_{\textup{eff}}}.

Combining this with Theorem 1 and the information ratio bound in Lemma 2 gives

Δ¯(πTS)2σ2k×1+log(τeff)+pln(2/ϵ)τeff=O~(σpkτeff).\bar{\Delta}_{\infty}(\pi^{\rm TS})\leq\sqrt{2\cdot\sigma^{2}\cdot k\times\frac{1+\log(\tau_{\textup{eff}})+p\ln(2/\epsilon)}{\tau_{\textup{eff}}}}=\tilde{O}\left(\sigma\sqrt{\frac{p\cdot k}{\tau_{\textup{eff}}}}\right).

A.8 Proof of Theorem 2

The proof is almost identical to that of Theorem 1. Let Δt:=𝔼[Rt,AtRt,At]\Delta_{t}^{\dagger}:=\mathbb{E}[R_{t,A_{t}^{\dagger}}-R_{t,A_{t}}], Gt:=I(At;(At,Ot,At)|t1)G_{t}^{\dagger}:=I(A_{t}^{\dagger};(A_{t},O_{t,A_{t}})|\mathcal{F}_{t-1}), and Γt=Δt2/Gt\Gamma_{t}^{\dagger}={\Delta_{t}^{\dagger}}^{2}/G_{t}^{\dagger}. Then,

t=1TΔt=t=1TΓtGtt=1TΓtt=1TGtΓ(π;A)Tt=1TGt,\sum_{t=1}^{T}\Delta_{t}^{\dagger}=\sum_{t=1}^{T}\sqrt{\Gamma_{t}^{\dagger}}\sqrt{G_{t}^{\dagger}}\leq\sqrt{\sum_{t=1}^{T}\Gamma_{t}^{\dagger}}\sqrt{\sum_{t=1}^{T}G_{t}^{\dagger}}\leq\sqrt{\Gamma(\pi;A^{\dagger})\cdot T\cdot\sum_{t=1}^{T}G_{t}^{\dagger}},

and

t=1TGt\displaystyle\sum_{t=1}^{T}G_{t}^{\dagger} =t=1TI(At;(At,Ot,At)|t1)\displaystyle=\sum_{t=1}^{T}I(A_{t}^{\dagger};(A_{t},O_{t,A_{t}})|\mathcal{F}_{t-1})
I(A1:T;T)\displaystyle\leq I(A_{1:T}^{\dagger};\mathcal{F}_{T})
I(A1:T;(θ1:T,T))\displaystyle\leq I(A_{1:T}^{\dagger};(\theta_{1:T},\mathcal{F}_{T}))
=I(A1:T;θ1:T)+I(A1:T;t|θ1:T)\displaystyle=I(A_{1:T}^{\dagger};\theta_{1:T})+I(A_{1:T}^{\dagger};\mathcal{F}_{t}|\theta_{1:T})
=I(A1:T;θ1:T),\displaystyle=I(A_{1:T}^{\dagger};\theta_{1:T}),

where the last step uses the fact that I(A1:T;t|θ1:T)=0I(A_{1:T}^{\dagger};\mathcal{F}_{t}|\theta_{1:T})=0 since A1:Tt|θ1:TA_{1:T}^{\dagger}\perp\mathcal{F}_{t}|\theta_{1:T} according to Definition 4. Combining these results,

Δ¯T(π;A)=t=1TΔtTΓ(π;A)t=1TGtTΓ(π;A)I¯T(A1:T;θ1:T).\bar{\Delta}_{T}(\pi;A^{\dagger})=\frac{\sum_{t=1}^{T}\Delta_{t}^{\dagger}}{T}\leq\sqrt{\frac{\Gamma(\pi;A^{\dagger})\cdot\sum_{t=1}^{T}G_{t}^{\dagger}}{T}}\leq\sqrt{\Gamma(\pi;A^{\dagger})\cdot\bar{I}_{T}(A_{1:T}^{\dagger};\theta_{1:T})}.

The bound on Δ¯(π;A)\bar{\Delta}_{\infty}(\pi;A^{\dagger}) simply follows by taking limit on both sides.

A.9 Proof of Proposition 4

We state and prove a concrete version of Proposition 4.

Proposition 6 (Concrete version of Proposition 4).

If |𝒜|2|\mathcal{A}|\geq 2 and T2T\geq 2, then

¯T(D)2V¯TD(1+log(1+min{T,D2V¯T})+log|𝒜|)+3log(|𝒜|T)T,\bar{\mathcal{R}}_{T}(D)\leq\frac{2\bar{V}_{T}}{D}\cdot\left(1+\log\left(1+\min\left\{T,\frac{D}{2\bar{V}_{T}}\right\}\right)+\log|\mathcal{A}|\right)+\frac{3\log(|\mathcal{A}|T)}{T},

and

infπΔ¯T(π)\displaystyle\inf_{\pi}\bar{\Delta}_{T}(\pi) 5(ΓUlog|𝒜|V¯T)1/3log(1+min{T,(ΓUlog|𝒜|)1/3V¯T2/3})+3ΓUlog(|𝒜|T)T\displaystyle\leq 5\cdot\left(\Gamma_{U}\cdot\log|\mathcal{A}|\cdot\bar{V}_{T}\right)^{1/3}\cdot\sqrt{\log\left(1+\min\left\{T,\frac{\big{(}\Gamma_{U}\cdot\log|\mathcal{A}|\big{)}^{1/3}}{\bar{V}_{T}^{2/3}}\right\}\right)}+\sqrt{\frac{3\Gamma_{U}\log(|\mathcal{A}|T)}{T}}
10(ΓUlog|𝒜|V¯T)1/3log(T)+3ΓUlog(|𝒜|T)T.\displaystyle\leq 10\cdot\left(\Gamma_{U}\cdot\log|\mathcal{A}|\cdot\bar{V}_{T}\right)^{1/3}\cdot\log(T)+\sqrt{\frac{3\Gamma_{U}\log(|\mathcal{A}|T)}{T}}.
Proof.

For brevity we define μt(a):=𝔼[Rt,a|θt]\mu_{t}(a):=\mathbb{E}[R_{t,a}|\theta_{t}] so that Δt(a)=μt(At)μt(a)\Delta_{t}(a)=\mu_{t}(A_{t}^{*})-\mu_{t}(a).

Fix DD and consider a satisficing action sequence AA^{\dagger} that gets synchronized to AA^{*} whenever DD-gap occurs:

At={Atif t=1 or μt(At)μt(At1)+D,At1otherwise.A_{t}^{\dagger}=\left\{\begin{array}[]{ll}A_{t}^{*}&\text{if }t=1\text{ or }\mu_{t}(A_{t}^{*})\geq\mu_{t}(A_{t-1}^{\dagger})+D,\\ A_{t-1}^{\dagger}&\text{otherwise}.\end{array}\right.

It is obvious that Δ¯T(A;A)D\bar{\Delta}_{T}(A^{\dagger};A^{*})\leq D since μt(At)μt(At)D\mu_{t}(A_{t}^{*})-\mu_{t}(A_{t}^{\dagger})\leq D.

Step 1.

Let VT:=V¯T×TV_{T}:=\bar{V}_{T}\times T, and ST:=𝔼[1+t=2T𝕀{AtAt1}]S_{T}^{\dagger}:=\mathbb{E}\left[1+\sum_{t=2}^{T}\mathbb{I}\{A_{t}^{\dagger}\neq A_{t-1}^{\dagger}\}\right]. We first show that ST2VTD+1S_{T}^{\dagger}\leq\frac{2V_{T}}{D}+1 by arguing that the total variation must increase at least by D/2D/2 whenever the satisficing action switches. Observe that, in every sample path,

2×t=2Tsupa𝒜|Δt(a)Δt1(a)|\displaystyle 2\times\sum_{t=2}^{T}\sup_{a\in\mathcal{A}}\left|\Delta_{t}(a)-\Delta_{t-1}(a)\right| t=2T|Δt(At)Δt1(At)|+|Δt(At1)Δt1(At1)|\displaystyle\geq\sum_{t=2}^{T}\left|\Delta_{t}(A_{t}^{*})-\Delta_{t-1}(A_{t}^{*})\right|+\left|\Delta_{t}(A_{t-1}^{\dagger})-\Delta_{t-1}(A_{t-1}^{\dagger})\right|
=t=2T|Δt(At)Δt1(At)|+|Δt1(At1)Δt(At1)|\displaystyle=\sum_{t=2}^{T}\left|\Delta_{t}(A_{t}^{*})-\Delta_{t-1}(A_{t}^{*})\right|+\left|\Delta_{t-1}(A_{t-1}^{\dagger})-\Delta_{t}(A_{t-1}^{\dagger})\right|
t=2T(Δt(At)Δt1(At))+(Δt1(At1)Δt(At1))\displaystyle\geq\sum_{t=2}^{T}\left(\Delta_{t}(A_{t}^{*})-\Delta_{t-1}(A_{t}^{*})\right)+\left(\Delta_{t-1}(A_{t-1}^{\dagger})-\Delta_{t}(A_{t-1}^{\dagger})\right)
=t=2T(Δt(At)Δt(At1))(Δt1(At)Δt1(At1))\displaystyle=\sum_{t=2}^{T}\left(\Delta_{t}(A_{t}^{*})-\Delta_{t}(A_{t-1}^{\dagger})\right)-\left(\Delta_{t-1}(A_{t}^{*})-\Delta_{t-1}(A_{t-1}^{\dagger})\right)
=(a)t=2T(μt(At)μt(At1))(μt1(At)μt1(At1))\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\sum_{t=2}^{T}\left(\mu_{t}(A_{t}^{*})-\mu_{t}(A_{t-1}^{\dagger})\right)-\left(\mu_{t-1}(A_{t}^{*})-\mu_{t-1}(A_{t-1}^{\dagger})\right)
(b)t=2T(μt(At)μt(At1))(μt1(At1)μt1(At1))\displaystyle\stackrel{{\scriptstyle(b)}}{{\geq}}\sum_{t=2}^{T}\left(\mu_{t}(A_{t}^{*})-\mu_{t}(A_{t-1}^{\dagger})\right)-\left(\mu_{t-1}(A_{t-1}^{*})-\mu_{t-1}(A_{t-1}^{\dagger})\right)
=t=2T(μt(At)μt(At))(μt1(At1)μt1(At1))+(μt(At)μt(At1))\displaystyle=\sum_{t=2}^{T}\left(\mu_{t}(A_{t}^{*})-\mu_{t}(A_{t}^{\dagger})\right)-\left(\mu_{t-1}(A_{t-1}^{*})-\mu_{t-1}(A_{t-1}^{\dagger})\right)+\left(\mu_{t}(A_{t}^{\dagger})-\mu_{t}(A_{t-1}^{\dagger})\right)
=(μT(AT)μT(AT))(μ1(A1)μ1(A1))+t=2T(μt(At)μt(At1))\displaystyle=\left(\mu_{T}(A_{T}^{*})-\mu_{T}(A_{T}^{\dagger})\right)-\left(\mu_{1}(A_{1}^{*})-\mu_{1}(A_{1}^{\dagger})\right)+\sum_{t=2}^{T}\left(\mu_{t}(A_{t}^{\dagger})-\mu_{t}(A_{t-1}^{\dagger})\right)
=(μT(AT)μT(AT))0+t=2T(μt(At)μt(At1))𝕀{AtAt1}\displaystyle=\left(\mu_{T}(A_{T}^{*})-\mu_{T}(A_{T}^{\dagger})\right)-0+\sum_{t=2}^{T}\left(\mu_{t}(A_{t}^{\dagger})-\mu_{t}(A_{t-1}^{\dagger})\right)\mathbb{I}\{A_{t}^{\dagger}\neq A_{t-1}^{\dagger}\}
(c)0+t=2TD𝕀{AtAt1},\displaystyle\stackrel{{\scriptstyle(c)}}{{\geq}}0+\sum_{t=2}^{T}D\cdot\mathbb{I}\{A_{t}^{\dagger}\neq A_{t-1}^{\dagger}\},

where step (a) uses the definition of Δt(a)\Delta_{t}(a), step (b) uses the fact that μt1(At)μt1(At1)\mu_{t-1}(A_{t}^{*})\leq\mu_{t-1}(A_{t-1}^{*}), and step (c) holds since μT(AT)μT(AT)0\mu_{T}(A_{T}^{*})-\mu_{T}(A_{T}^{\dagger})\geq 0 and if AtAt1A_{t}^{\dagger}\neq A_{t-1}^{\dagger} then At=AtA_{t}^{*}=A_{t}^{\dagger} and μt(At)μt(At1)D\mu_{t}(A_{t}^{*})-\mu_{t}(A_{t-1}^{\dagger})\geq D. By taking expectation on both sides, we obtain 2×VTD×𝔼[t=2T𝕀{AtAt1}]=D×(ST1)2\times V_{T}\geq D\times\mathbb{E}[\sum_{t=2}^{T}\mathbb{I}\{A_{t}^{\dagger}\neq A_{t-1}^{\dagger}\}]=D\times(S_{T}^{\dagger}-1).

Step 2.

We further bound ¯T(D)\bar{\mathcal{R}}_{T}(D) by utilizing Proposition 2:

¯T(D)I¯T(A;θ)H¯T(A)STT(1+log(1+TST)+log|𝒜|)+logTT.\bar{\mathcal{R}}_{T}(D)\leq\bar{I}_{T}(A^{\dagger};\theta)\leq\bar{H}_{T}(A^{\dagger})\leq\frac{S_{T}^{\dagger}}{T}\cdot\left(1+\log\left(1+\frac{T}{S_{T}^{\dagger}}\right)+\log|\mathcal{A}|\right)+\frac{\log T}{T}.

Let xy=min{x,y}x\wedge y=\min\{x,y\} and xy=max{x,y}x\vee y=\max\{x,y\}. Since 1ST(2VTD+1)T1\leq S_{T}^{\dagger}\leq\left(\frac{2V_{T}}{D}+1\right)\wedge T,

¯T(D)\displaystyle\bar{\mathcal{R}}_{T}(D) STT(1+log(1+TST)+log|𝒜|)+logTT\displaystyle\leq\frac{S_{T}^{\dagger}}{T}\cdot\left(1+\log\left(1+\frac{T}{S_{T}^{\dagger}}\right)+\log|\mathcal{A}|\right)+\frac{\log T}{T}
2VT/D+1T(1+log(1+TST)+log|𝒜|)+logTT\displaystyle\leq\frac{2V_{T}/D+1}{T}\cdot\left(1+\log\left(1+\frac{T}{S_{T}^{\dagger}}\right)+\log|\mathcal{A}|\right)+\frac{\log T}{T}
2VTDT(1+log(1+TST)+log|𝒜|)+1+log(1+T)+log|𝒜|T+logTT\displaystyle\leq\frac{2V_{T}}{DT}\cdot\left(1+\log\left(1+\frac{T}{S_{T}^{\dagger}}\right)+\log|\mathcal{A}|\right)+\frac{1+\log(1+T)+\log|\mathcal{A}|}{T}+\frac{\log T}{T}
2VTDT(1+log(1+T(1DT2VT))+log|𝒜|)+2logT+3log|𝒜|T+logTT\displaystyle\leq\frac{2V_{T}}{DT}\cdot\left(1+\log\left(1+T\wedge\left(1\vee\frac{DT}{2V_{T}}\right)\right)+\log|\mathcal{A}|\right)+\frac{2\log T+3\log|\mathcal{A}|}{T}+\frac{\log T}{T}
=2VTDT(1+log(1+T(1DT2VT))+log|𝒜|)+3log(|𝒜|T)T,\displaystyle=\frac{2V_{T}}{DT}\cdot\left(1+\log\left(1+T\wedge\left(1\vee\frac{DT}{2V_{T}}\right)\right)+\log|\mathcal{A}|\right)+\frac{3\log(|\mathcal{A}|T)}{T},

where the last inequality uses 12log|𝒜|1\leq 2\log|\mathcal{A}| for |𝒜|2|\mathcal{A}|\geq 2 and log(1+T)2logT\log(1+T)\leq 2\log T for T2T\geq 2.

Step 3.

To simplify notation, let k:=|𝒜|k:=|\mathcal{A}|, and κ:=1+logk\kappa:=1+\log k. We consider a satisficing action sequence AA^{\dagger} constructed with D=D:=(2ΓUκV¯T)1/3D=D^{*}:=\left(2\cdot\Gamma_{U}\cdot\kappa\cdot\bar{V}_{T}\right)^{1/3}. By Theorem 3, with τ:=T(1D2V¯T)\tau^{*}:=T\wedge\left(1\vee\frac{D^{*}}{2\bar{V}_{T}}\right),

infπΔ¯T(π)\displaystyle\inf_{\pi}\bar{\Delta}_{T}(\pi) D+ΓUׯT(D)\displaystyle\leq D^{*}+\sqrt{\Gamma_{U}\times\bar{\mathcal{R}}_{T}(D^{*})}
D+ΓU×[2V¯TD(κ+log(1+τ))+3log(kT)T]\displaystyle\leq D^{*}+\sqrt{\Gamma_{U}\times\left[\frac{2\bar{V}_{T}}{D^{*}}\cdot\left(\kappa+\log\left(1+\tau^{*}\right)\right)+\frac{3\log(kT)}{T}\right]}
(a)D+ΓU×[2V¯TD(κ+log(1+τ))]+3ΓUlog(kT)T\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}D^{*}+\sqrt{\Gamma_{U}\times\left[\frac{2\bar{V}_{T}}{D^{*}}\cdot\left(\kappa+\log\left(1+\tau^{*}\right)\right)\right]}+\sqrt{\frac{3\Gamma_{U}\log(kT)}{T}}
=D+2ΓUκV¯TD×(1+log(1+τ)κ)+3ΓUlog(kT)T\displaystyle=D^{*}+\sqrt{\frac{2\Gamma_{U}\kappa\bar{V}_{T}}{D^{*}}\times\left(1+\frac{\log\left(1+\tau^{*}\right)}{\kappa}\right)}+\sqrt{\frac{3\Gamma_{U}\log(kT)}{T}}
=D+D×1+log(1+τ)κ+3ΓUlog(kT)T\displaystyle=D^{*}+D^{*}\times\sqrt{1+\frac{\log\left(1+\tau^{*}\right)}{\kappa}}+\sqrt{\frac{3\Gamma_{U}\log(kT)}{T}}
(b)2.6×D×log(1+τ)+3ΓUlog(kT)T,\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}2.6\times D^{*}\times\sqrt{\log\left(1+\tau^{*}\right)}+\sqrt{\frac{3\Gamma_{U}\log(kT)}{T}},

where step (a) uses x+yx+y\sqrt{x+y}\leq\sqrt{x}+\sqrt{y} for any x,y0x,y\geq 0, and step (b) uses κ1\kappa\geq 1 and log(1+τ)log2\log\left(1+\tau^{*}\right)\geq\log 2. Since κ3logk\kappa\leq 3\log k, we have

D=(2ΓUκV¯T)1/3(6ΓUlogkV¯T)1/31.9(ΓUlogkV¯T)1/3.D^{*}=(2\cdot\Gamma_{U}\cdot\kappa\cdot\bar{V}_{T})^{1/3}\leq(6\cdot\Gamma_{U}\cdot\log k\cdot\bar{V}_{T})^{1/3}\leq 1.9\cdot(\Gamma_{U}\cdot\log k\cdot\bar{V}_{T})^{1/3}.

Therefore,

infπΔ¯T(π)5(ΓUlogkV¯T)1/3×log(1+min{T,(ΓUlogkV¯T)1/3V¯T})+3ΓUlog(kT)T.\inf_{\pi}\bar{\Delta}_{T}(\pi)\leq 5\cdot(\Gamma_{U}\cdot\log k\cdot\bar{V}_{T})^{1/3}\times\sqrt{\log\left(1+\min\left\{T,\frac{(\Gamma_{U}\cdot\log k\cdot\bar{V}_{T})^{1/3}}{\bar{V}_{T}}\right\}\right)}+\sqrt{\frac{3\Gamma_{U}\log(kT)}{T}}.

The final claim is obtained by observing that log(1+T)2logT\sqrt{\log(1+T)}\leq 2\log T for T2T\geq 2. ∎

Appendix B Comparison with Liu et al. [2023]

In this section, we provide a detailed comparison between our work and Liu et al. [2023].

Both papers adopt a Bayesian perspective that models the nonstationarity through a stochastic latent parameter process whose law is known to the decision maker. Both papers consider Thompson sampling (TS), which is the rule that employs probability matching with respect to the optimal action sequence AA^{*}. The primary concern in Liu et al. [2023] revolves around the potential shortcomings of TS in rapidly changing environments where learning AA^{*} is hopeless. Our paper considers this shortcoming of TS in Section 7. To tackle this challenge, both studies propose some variants of TS designed to ‘satisfice’, i.e., aim to learn and play a different target other than AA^{*}.

More specifically, Liu et al. [2023] proposes a brilliant algorithm called predictive sampling (PS) that improves over TS by ‘‘deprioritizing acquiring information that quickly loses usefulness’’. Using our notation, PS’s learning target is At=argmaxa𝒜𝔼[Rt,a|t1,(Rt,a)a𝒜,tt+1]A_{t}^{\dagger}=\operatorname*{\mathrm{argmax}}_{a\in\mathcal{A}}\mathbb{E}[R_{t,a}|\mathcal{F}_{t-1},(R_{t^{\prime},a^{\prime}})_{a^{\prime}\in\mathcal{A},t^{\prime}\geq t+1}], the arm that would have been played by a Bayesian decision maker informed of all future rewards but not the latent parameters.666 In its implementation, it generates a fictitious future scenario, and makes an inference on the best arm after simulating the posterior updating procedure with respect to this future scenario. This makes the algorithm to take into account the total amount of information about the current state that the decision maker can potentially accumulate in the future, and hence prevents it from being overly optimistic about the learnability of the target. A similar idea can be found in Min et al. [2019] for the stationary bandit problems with a finite-time horizon. By ignoring the unresolvable uncertainty in the latent states — i.e. that which remains even conditioned on all future reward observations — PS avoids wasteful exploration. We remark that one of our suggested satisificing action sequence designs, Example 12, was motivated from Liu et al. [2023]. But our satisficing TS does not cannot fully generalize PS since we have restricted the learning target to be determined independently from the algorithm’s decisions (see Definition 4), while AtA_{t}^{\dagger} above depends on the history of observations under the algorithm.

Liu et al. [2023] uses an information-ratio analysis to bound an algorithm’s ‘foresight regret’ in terms of the total possible ‘predictive information.’ Their custom modifications to the information-ratio analysis in Russo and Van Roy [2016] enable an analysis of predictive sampling which seems not to be possible otherwise. By contrast, our analysis does not apply to predictive sampling, but produces a wealth of results that do not follow from Liu et al. [2023]. We elaborate on three main differences in the information-theoretic analysis.

First, Liu et al. [2023] bound a custom suboptimality measure which they call foresight regret, whereas we bound conventional (‘dynamic’) regret. Compared to the conventional regret which we denote by Δ¯T(π;A)\bar{\Delta}_{T}(\pi;A^{*}), foresight regret can serve as a tighter benchmark that better quantifies the range of achievable performance, since it is always non-negative while not exceeding the conventional regret. For the same reason, however, their results do not provide upper bounds on the conventional regret , a metric treated in a huge preceding literature. Since our work bounds the conventional regret metric, it enables us to recover results that are comparable to the existing literature; see for instance Section 9.

Second, whereas Liu et al. [2023] introduce a new information ratio, partly specialized to predictive sampling, we use the same information ratio as in Russo and Van Roy [2016] and many subsequent papers. Using the conventional definition of the information ratio enables us to systematically inherit all the upper bounds established for stationary settings (See Sec. 6.1). Liu et al. [2023] provide an upper bound on their modified information ratio for a particular nonstationary environment they termed ‘modulated kk-armed Bernoulli bandit,’ (this is identical to our Example 5) but more research is required to bound it in other settings.

Finally, due to using a different definitions of the information ratio, our bounds depend on different notions of the maximial cumulative information gain. The corresponding term in our Theorem 1 is simply the entropy rate of the optimal action process (and elsewhere, a mutual information rate). As one of the most fundamental metrics in information theory, numerous techniques can be employed to bound this entropy rate as in Section 5.2. The regret bound in Liu et al. [2023] involves a term which they call cumulative predictive information. They bound this quantity in the special case of modulated kk-armed bandits (equivalent to Example 5). In that example, their bound on cumulative predictive information is roughly kk times larger than our bound on the entropy rate of the optimal action process. As a result, Liu et al. [2023] obtain a final regret bound that scales with kk (see theorem 5 of Liu et al. [2023]) whereas our bound scales with klog(k)\sqrt{k\log(k)} (see Example 6).

Appendix C Numerical Experiment in Detail

We here illustrate the detailed procedure of the numerical experiment conducted in Section 1.1.

Generative model.

We say a stochastic process (Xt)t𝒢𝒫(σX2,τX)(X_{t})_{t\in\mathbb{N}}\sim\mathcal{GP}(\sigma_{X}^{2},\tau_{X}) if (X1,,Xt)(X_{1},\ldots,X_{t}) follows a multivariate normal distribution satisfying

𝔼[Xi]=0,Cov(Xi,Xj)=σX2exp(12(ijτX)2),\mathbb{E}[X_{i}]=0,\quad\text{Cov}(X_{i},X_{j})=\sigma_{X}^{2}\exp\left(-\frac{1}{2}\left(\frac{i-j}{\tau_{X}}\right)^{2}\right),

for any i,j[t]i,j\in[t] and any tt\in\mathbb{N}. Note that this process is stationary, and given horizon TT a sample path (X1,,XT)(X_{1},\ldots,X_{T}) can be generated by randomly drawing a multivariate normal variable from the distribution specified by σX2\sigma_{X}^{2} and τX\tau_{X}.

As described in Section 1.1, we consider a nonstionary two-arm Gaussian bandit with unit noise variance:

Rt,a=θtcm+θt,aid:=μt,a+ϵt,a,a{1,2},t,R_{t,a}=\underbrace{\theta_{t}^{\rm cm}+\theta_{t,a}^{\rm id}}_{:=\mu_{t,a}}+\epsilon_{t,a},\quad\forall a\in\{1,2\},t\in\mathbb{N},

where ϵt,a\epsilon_{t,a}’s are i.i.d. noises 𝒩(0,12)\sim\mathcal{N}(0,1^{2}), (θtcm)t\big{(}\theta_{t}^{\rm cm}\big{)}_{t\in\mathbb{N}} is the common variation process 𝒢𝒫(12,τcm)\sim\mathcal{GP}(1^{2},\tau^{\rm cm}), and (θt,aid)t\big{(}\theta_{t,a}^{\rm id}\big{)}_{t\in\mathbb{N}} is arm aa’s idiosyncratic variation process 𝒢𝒫(12,τid)\sim\mathcal{GP}(1^{2},\tau^{\rm id}).

Note that the optimal action process is completely determined by θid\theta^{\rm id}:

At={1if θt,1idθt,2id2if θt,1id<θt,2id,A_{t}^{*}=\left\{\begin{array}[]{cc}1&\text{if }\theta_{t,1}^{\rm id}\geq\theta_{t,2}^{\rm id}\\ 2&\text{if }\theta_{t,1}^{\rm id}<\theta_{t,2}^{\rm id}\end{array}\right.,

and the optimal action switches more frequently when τid\tau^{\rm id} is small (compare Figure 6 with Figure 1).

Refer to caption

Figure 6: A sample path generated with τcm=τid=10\tau^{\rm cm}=\tau^{\rm id}=10 (cf., Figure 1 was generated with τcm=10\tau^{\rm cm}=10 and τid=50\tau^{\rm id}=50).

Consequently, the problem’s effective horizon τeff:=1/(AtAt1)\tau_{\rm eff}:=1/\mathbb{P}(A_{t}^{*}\neq A_{t-1}^{*}), defined in (8), depends only on τid\tau^{\rm id}. To visualize this relationship, we estimate τeff\tau_{\rm eff} using the sample average of the number of switches occurred over T=1000T=1000 periods (averaged across 100 sample paths), while varying τid\tau^{\rm id} from 11 to 100100. See Figure 7. As expected, τeff\tau_{\rm eff} is linear in τid\tau^{\rm id} (more specifically, τeffπ×τid\tau_{\rm eff}\approx\pi\times\tau^{\rm id} by Rice formula).

Refer to caption

Figure 7: The effective time horizon τeff\tau_{\rm eff} as a function of τid{1,5,10,,100}\tau^{\rm id}\in\{1,5,10,\ldots,100\}, estimated from 100 sample paths randomly generated.
Tested bandit algorithms.

Given the generative model described above, we evaluate four algorithms – Thompson sampling (TS), Sliding-Window TS (SW-TS; Trovo et al. [2020]), Sliding-Window Upper-Confidence-Bound (SW-UCB; Garivier and Moulines [2011]), and Uniform.

Thompson sampling (TS) is assumed to know the dynamics of latent state processes as well as the exact values of τcm\tau^{\rm cm} and τid\tau^{\rm id} (i.e., no model/prior misspecification). More specifically, in each period tt, πTS\pi^{\rm TS} draws a random sample (θ~t,1id,θ~t,2id)(\tilde{\theta}_{t,1}^{\rm id},\tilde{\theta}_{t,2}^{\rm id}) of the latent state (θt,1id,θt,2id)(\theta_{t,1}^{\rm id},\theta_{t,2}^{\rm id}) from its posterior distribution, and then selects the arm Atargmaxa{1,2}θ~t,aidA_{t}\leftarrow\operatorname*{\mathrm{argmax}}_{a\in\{1,2\}}\tilde{\theta}_{t,a}^{\rm id}. Here, the posterior distribution of (θt,1id,θt,2id)(\theta_{t,1}^{\rm id},\theta_{t,2}^{\rm id}) given the history t1=(A1,R1,,At1,Rt1)\mathcal{F}_{t-1}=(A_{1},R_{1},\ldots,A_{t-1},R_{t-1}) is a multivariate normal distribution that can be computed as follows. Given the past action sequence (A1,,At1)𝒜t1(A_{1},\ldots,A_{t-1})\in\mathcal{A}^{t-1}, the (conditional) distribution of (R1,,Rt1,θt,1id,θt,2id)(R_{1},\ldots,R_{t-1},\theta_{t,1}^{\rm id},\theta_{t,2}^{\rm id}) is given by

[R1Rt1θt,1idθt,2id]|(A1,,At1)𝒩(0(t1)+2,[Σt,RR(t1)×(t1)Σt,θR(t1)×2Σt,θR2×(t1)Σt,θθ2×2]),\left.\left[\begin{array}[]{c}R_{1}\\ \vdots\\ R_{t-1}\\ \theta_{t,1}^{\rm id}\\ \theta_{t,2}^{\rm id}\end{array}\right]\right|(A_{1},\ldots,A_{t-1})\sim\mathcal{N}\left(0\in\mathbb{R}^{(t-1)+2},\left[\begin{array}[]{cc}\Sigma_{t,RR}\in\mathbb{R}^{(t-1)\times(t-1)}&\Sigma_{t,\theta R}^{\top}\in\mathbb{R}^{(t-1)\times 2}\\ \Sigma_{t,\theta R}\in\mathbb{R}^{2\times(t-1)}&\Sigma_{t,\theta\theta}\in\mathbb{R}^{2\times 2}\end{array}\right]\right),

where the pairwise covariances are given by (Σt,RR)ij:=Cov(Ri,Rj|Ai,Aj)=𝕀{i=j}Var(ϵi,a)+Cov(θicm,θjcm)+𝕀{Ai=Aj}Cov(θi,aid,θj,aid)(\Sigma_{t,RR})_{ij}:=\text{Cov}(R_{i},R_{j}|A_{i},A_{j})=\mathbb{I}\{i=j\}\cdot\text{Var}(\epsilon_{i,a})+\text{Cov}(\theta_{i}^{\rm cm},\theta_{j}^{\rm cm})+\mathbb{I}\{A_{i}=A_{j}\}\cdot\text{Cov}(\theta_{i,a}^{\rm id},\theta_{j,a}^{\rm id}) for i,j[t1]i,j\in[t-1], (Σt,θR)ai:=Cov(Ri,θt,aid|Ai)=𝕀{Ai=a}Cov(θi,aid,θt,aid)(\Sigma_{t,\theta R})_{ai}:=\text{Cov}(R_{i},\theta_{t,a}^{\rm id}|A_{i})=\mathbb{I}\{A_{i}=a\}\cdot\text{Cov}(\theta_{i,a}^{\rm id},\theta_{t,a}^{\rm id}) for a[2]a\in[2] and i[t1]i\in[t-1], and Σt,θθ\Sigma_{t,\theta\theta} is the identity matrix. Additionally given the reward realizations,

[θt,1idθt,2id]|t1𝒩(μ^t2,Σ^t2×2),whereμ^t:=Σt,θRΣt,RR1[R1Rt1],Σ^t:=Σt,θθΣt,θRΣt,RR1Σt,θR.\left.\left[\begin{array}[]{c}\theta_{t,1}^{\rm id}\\ \theta_{t,2}^{\rm id}\end{array}\right]\right|\mathcal{F}_{t-1}\sim\mathcal{N}(\hat{\mu}_{t}\in\mathbb{R}^{2},\hat{\Sigma}_{t}\in\mathbb{R}^{2\times 2}),\quad\text{where}\quad\hat{\mu}_{t}:=\Sigma_{t,\theta R}\Sigma_{t,RR}^{-1}\left[\begin{array}[]{c}R_{1}\\ \vdots\\ R_{t-1}\end{array}\right],\quad\hat{\Sigma}_{t}:=\Sigma_{t,\theta\theta}-\Sigma_{t,\theta R}\Sigma_{t,RR}^{-1}\Sigma_{t,\theta R}^{\top}.

Sliding-Window TS is a simple modification of stationary Thompson sampling such that behaves as if the environment is stationary but discards all observations revealed before LL periods ago. More specifically, in each period tt, πSWTS\pi^{\rm SW-TS} with window length LL draws a random sample of mean rewards μ~t,a\tilde{\mu}_{t,a} from 𝒩(μ^t,a,σ^t,a2)\mathcal{N}(\hat{\mu}_{t,a},\hat{\sigma}_{t,a}^{2}) where

μ^t,a:=s=max{1,tL}t1Rs𝕀{As=a}1+Nt,a,σ^t,a2:=11+Nt,a,Nt,a:=s=max{1,tL}t1𝕀{As=a},\hat{\mu}_{t,a}:=\frac{\sum_{s=\max\{1,t-L\}}^{t-1}R_{s}\mathbb{I}\{A_{s}=a\}}{1+N_{t,a}},\quad\hat{\sigma}_{t,a}^{2}:=\frac{1}{1+N_{t,a}},\quad N_{t,a}:=\sum_{s=\max\{1,t-L\}}^{t-1}\mathbb{I}\{A_{s}=a\},

and then selects the arm Atargmaxa𝒜μ~t,aA_{t}\leftarrow\operatorname*{\mathrm{argmax}}_{a\in\mathcal{A}}\tilde{\mu}_{t,a}. Here, LL is a control parameter determining the degree of adaptivity.

Similarly, Sliding-Window UCB implements a simple modification of UCB such that computes UCB indices defined as

Ut,a:=μ^t,a+β1Nt,a,μ^t,a:=s=max{1,tL}t1Rs𝕀{As=a}Nt,a,Nt,a:=s=max{1,tL}t1𝕀{As=a},U_{t,a}:=\hat{\mu}_{t,a}+\beta\frac{1}{\sqrt{N_{t,a}}},\quad\hat{\mu}_{t,a}:=\frac{\sum_{s=\max\{1,t-L\}}^{t-1}R_{s}\mathbb{I}\{A_{s}=a\}}{N_{t,a}},\quad N_{t,a}:=\sum_{s=\max\{1,t-L\}}^{t-1}\mathbb{I}\{A_{s}=a\},

and then selects the arm Atargmaxa𝒜Ut,aA_{t}\leftarrow\operatorname*{\mathrm{argmax}}_{a\in\mathcal{A}}U_{t,a}. Here, LL is a control parameter determining the degree of adaptivity, and β\beta is a control parameter determining the degree of exploration.

Uniform is a naïve benchmark policy that always selects one of two arms uniformly at random. One can easily show that Δ¯T(πUniform)0.57\bar{\Delta}_{T}(\pi^{\rm Uniform})\approx 0.57 in our setup, regardless of the choice of τcm\tau^{\rm cm} and τid\tau^{\rm id}.

Simulation results.

Given a sample path specified by θ\theta, we measure the (pathwise) Cesàro average regret of an action sequence AA as

Δ¯T(A;θ):=1Tt=1T(μtμt,At),whereμt,a:=θtcm+θt,aid,μt:=maxa𝒜μt,a.\bar{\Delta}_{T}(A;\theta):=\frac{1}{T}\sum_{t=1}^{T}(\mu_{t}^{*}-\mu_{t,A_{t}}),\quad\text{where}\quad\mu_{t,a}:=\theta_{t}^{\rm cm}+\theta_{t,a}^{\rm id},\quad\mu_{t}^{*}:=\max_{a\in\mathcal{A}}\mu_{t,a}.

Given an environment specified by (τcm,τid)(\tau^{\rm cm},\tau^{\rm id}), we estimate the per-period regret of an algorithm π\pi using SS sample paths:

Δ¯^T(π;τcm,τid):=1Ss=1SΔ¯T(Aπ,(s);θ(s)),\hat{\bar{\Delta}}_{T}(\pi;\tau^{\rm cm},\tau^{\rm id}):=\frac{1}{S}\sum_{s=1}^{S}\bar{\Delta}_{T}(A^{\pi,(s)};\theta^{(s)}),

where θ(s)\theta^{(s)} is the sths^{\text{th}} sample path (that is shared by all algorithms) and Aπ,(s)A^{\pi,(s)} is the action sequence taken by π\pi along this sample path. In all experiments, we use T=1000T=1000 and S=1000S=1000.

We first report convergence of instantaneous regret in Figure 8: we observe that 𝔼[μtμt,At]\mathbb{E}[\mu_{t}^{*}-\mu_{t,A_{t}}] quickly converges to a constant after some initial transient periods, numerically verifying the conjecture made in Remark 1. While not reported here, we also observe that the Cesàro average Δ¯T(A;θ)\bar{\Delta}_{T}(A;\theta) converges to the same limit value as TT\rightarrow\infty in every sample path, suggesting the ergodicity of the entire system.

Refer to caption

Figure 8: Convergence of instantaneous regret 𝔼[μtμt,At]\mathbb{E}[\mu_{t}^{*}-\mu_{t,A_{t}}] in the case of τcm=τid=50\tau^{\rm cm}=\tau^{\rm id}=50. The solid lines report the instantaneous regret of the algorithms, averaged across S=1000S=1000 sample paths, and the dashed horizontal lines represent the estimated per-period regret.

We next examine the effect of τid\tau^{\rm id} and τcm\tau^{\rm cm} on the performance of algorithms, and provide the detailed simulation results that complement Figure 2 of Section 1.1. While varying τid\tau^{\rm id} and τcm\tau^{\rm cm}, we measure the per-period regret Δ¯^T(π;τcm,τid)\hat{\bar{\Delta}}_{T}(\pi;\tau^{\rm cm},\tau^{\rm id}) of algorithms according to the procedure described above. We observe from Figure 9 that for every algorithm its performance is mainly determined by τid\tau^{\rm id}, independent of τcm\tau^{\rm cm}, numerically confirming our main claim – the difficulty of problem can be sufficiently characterized by the entropy rate of optimal action sequence, H¯(A)\bar{H}(A^{*}), which depends only on τid\tau^{\rm id}. We additionally visualize the upper bound on TS’s regret that our analysis predicts (Example 7). We also observe that TS performs best across all settings, perhaps because TS exploits the prior knowledge about nonstationarity of the environment, whereas SW-TS or SW-UCB performs well when the window length roughly matches τid\tau^{\rm id}.

Refer to caption
Refer to caption
Figure 9: The effect of τid\tau^{\rm id} (left) and τcm\tau^{\rm cm} (right) on the performance of algorithms. The dashed line in the left plot represents the upper bound on Δ¯(πTS)\bar{\Delta}_{\infty}(\pi^{\rm TS}), implied by Corollary 1 and Example 7.