This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Leveraging (Biased) Information: Multi-armed Bandits with Offline Data

Wang Chi Cheung    Lixing Lyu
Abstract

We leverage offline data to facilitate online learning in stochastic multi-armed bandits. The probability distributions that govern the offline data and the online rewards can be different. Without any non-trivial upper bound on their difference, we show that no non-anticipatory policy can out-perform the UCB policy by (Auer et al. 2002), even in the presence of offline data. In complement, we propose an online policy MIN-UCB, which outperforms UCB when a non-trivial upper bound is given. MIN-UCB adaptively chooses to utilize the offline data when they are deemed informative, and to ignore them otherwise. MIN-UCB is shown to be tight in terms of both instance independent and dependent regret bounds. Finally, we corroborate the theoretical results with numerical experiments.

Machine Learning, ICML

1 Introduction

Multi-armed bandit problem (MAB) is a classical model in sequential decision-making. In traditional models, online learning is initialized with no historical dataset. However, in many real world scenarios, historical datasets related to the underlying model exist before online learning begins. Those datasets could be beneficial to online learning. For instance, when a company launch a new product in Singapore, past sales data in the US could be of relevance. This prompts an intriguing question: Can we develop an approach that effectively utilizes offline data, and subsequently learns under bandit feedback? Such approaches carry significant allure, as they potentially reduce the amount of exploration in the learning process. Nevertheless, it is too strong and impractical to assume that the historical dataset and the online model follow the same probability distribution, given the ubiquity of distributional shifts. Therefore, it is desirable to design an adaptive policy that reaps the benefit of offline data if they closely match the online model, while judiciously ignores the offline data and learns from scratch otherwise.

Motivated by the above discussions, we consider a stochastic multi-armed bandit model with possibly biased offline dataset. The entire event horizon consists of a warm-start phase, followed by an online phase whereby decision making occurs. During the warm-start phase, the DM receives an offline dataset, consisting of samples governed by a latent probabiltiy distribution P(off)P^{\text{(off)}}. The subsequent online phase is the same as standard multi-armed bandit model, except that the offline dataset can be incorporated in decision making. The random online rewards are governed by another latent probabiltiy distribution P(on)P^{\text{(on)}}. The DM aims to maximize the total cumulative reward earned, or equivalently minimize the regret, during the online phase.

Intuitively, when P(off)P^{\text{(off)}}, P(on)P^{\text{(on)}} are “far apart”, the DM should conduct online learning from scratch and ignore the offline dataset. For example, the vanilla UCB policy by (Auer et al., 2002) incurs expected regret at most

O(a:Δ(a)>0logTΔ(a)),O\left(\sum_{a:\Delta(a)>0}\frac{\log T}{\Delta(a)}\right), (1)

where TT is the length of online phase, and Δ(a)0\Delta(a)\geq 0 is the difference between the expected reward of an optimal arm and arm aa (also see Section 2). The bound (1) holds for all P(off)P^{\text{(off)}}, P(on)P^{\text{(on)}}. In contrary, when P(off)P^{\text{(off)}}, P(on)P^{\text{(on)}} are “sufficiently close”, the DM should incorporate the offline data into online learning and avoid unnecessary exploration. For example, when P(off)=P(on)P^{\text{(off)}}=P^{\text{(on)}}, HUCB1 (Shivaswamy & Joachims, 2012) and MonUCB (Banerjee et al., 2022) incur expected regret at most

O(a:Δ(a)>0max{logTΔ(a)TS(a)Δ(a),0}),O\left(\sum_{a:\Delta(a)>0}\max\left\{\frac{\log T}{\Delta(a)}-T_{\text{S}}(a)\Delta(a),0\right\}\right), (2)

where TS(a)T_{\text{S}}(a) is the number of offline samples on arm aa. (2) is a better regret bound than (1). However, (2) only holds when P(off)=P(on)P^{\text{(off)}}=P^{\text{(on)}}. The above inspires the following question:

(Q) Can the DM outperform the vanilla UCB, i.e. achieves a better regret bound than (1) when P(off)=P(on)P^{\text{(off)}}=P^{\text{(on)}}, while retains the regret bound (1) for general P(off),P(on)P^{\text{(off)}},P^{\text{(on)}}?

The answer to (Q) turns out to be somewhat mixed. Our novel contributions shed light on (Q) as follows:

Impossibility Result. We show in Section 3 that no non-anticipatory policy can achieve the aim in (Q). Even with an offline dataset, no non-anticipatory policy can outperform the vanilla UCB without any additional knowledge or restriction on the difference between P(off),P(on)P^{\text{(off)}},P^{\text{(on)}}.

Algorithm design and analysis. To bypass the impossibility result, we endow the DM with an auxiliary information dubbed valid bias bound VV, which is an information additional to the offline dataset. The bound VV serves as an upper bound on the difference between P(off)P^{\text{(off)}} and P(on)P^{\text{(on)}}. We propose the MIN-UCB policy, which achieves a strictly better regret bound than (1) when P(off),P(on)P^{\text{(off)}},P^{\text{(on)}} are “sufficiently close”. In particular, our regret bound reduces to (2) if the DM knows P(off)=P(on)P^{\text{(off)}}=P^{\text{(on)}}. We provide both instance dependent and independent regret upper bounds on MIN-UCB, and we show their tightness by providing the corresponding regret lower bounds. Our analysis also pins down the precise meaning of “far apart” and “sufficientlly close” in the above discussion. The design and analysis of MIN-UCB are provided in Section 4.

In particular, when P(off)=P(on)P^{\text{(off)}}=P^{\text{(on)}}, our analysis provide two novel contributions. Firstly, we establish the tightness of the instance dependent bound (2). Secondly, we establish a pair of instance independent regret upper and lower bounds, which match each other up to a logarithmic factor. MIN-UCB is proved to achieve the upper bound when P(off)=P(on)P^{\text{(off)}}=P^{\text{(on)}}. Both the upper and lower bounds are novel in the literature, and we show that the optimal regret bound involves the optimum of a novel linear program.

1.1 Related Works

Multi-armed bandits with offline data has been studied in multiple works. (Shivaswamy & Joachims, 2012; Banerjee et al., 2022) are the most relevant works, focusing on the special case of P(on)=P(off)P^{\text{(on)}}=P^{\text{(off)}}. They only provide instance dependent regret upper bound, while we both instance dependent and independent bounds. Online learning with offline data under P(on)=P(off)P^{\text{(on)}}=P^{\text{(off)}} is also studied in dynamic pricing settings (Bu et al., 2020) and reinforcement learning settings (Hao et al., 2023; Wagenmaker & Pacchiano, 2023), as well as when historical data are sequentially provided (Gur & Momeni, 2022), or provided as clustered data (Bouneffouf et al., 2019; Ye et al., 2020; Tennenholtz et al., 2021). (Zhang et al., 2019) is another closely related work on contextual bandits that allows P(on)P(off)P^{\text{(on)}}\neq P^{\text{(off)}}. We provide more discussions on it in Appendix A.1.

Bayesian policies, such as the Thompson sampling policy (TS), are popular for incorporating offline data into online learning. When the online and offline reward distributions are identical, offline data can be used to construct well-specified priors, which enables better regret bounds than (Russo & Van Roy, 2014; Russo & Roy, 2016; Liu & Li, 2016) when the offline data size increases. By contrast, (Liu & Li, 2016; Simchowitz et al., 2021) show that the TS initialized with mis-specified priors generally incurs worse regret bounds than state-of-the-art online policies without any offline data. Our work focuses on an orthogonal issue of deciding whether to incorporate the possibly biased offline data in online learning.

Our problem model closely relates to machine learning under domain shift. (Si et al., 2023) investigates a offline policy learning problem using historical data on contextual bandits under changing environment from a distributionally robust optimization perspective. (Chen et al., 2022) studies a variant in RL context, and their policy is also endowed with an upper bound on the distributional drift between offline and online models. We provide more discussion in Appendix A.3. Numerous research works consider the issue of distributional drift in supervised learning (Crammer et al., 2008; Mansour et al., 2009; Ben-David et al., 2010) and stochastic optimization (Besbes et al., 2022).

Our work is related to research on online learning with advice, which focuses on improving performance guarantees with hints or predictions. Indeed, our offline dataset could serve as a hint to an online learner. A variety of results are derived in multi-armed bandits (Wei & Luo, 2018), linear bandits (Cutkosky et al., 2022), contextual bandits (Wei et al., 2020b), and full feedback settings (Steinhardt & Liang, 2014; Rakhlin & Sridharan, 2013a, b). These works do not apply in our setting, since they do not consider refining instance dependent bounds. We provide more discussions in Appendix A.2. Our theme of whether to utilize the potentially biased offline data is related to online model selection, studied in (Agarwal et al., 2017; Pacchiano et al., 2020; Lee et al., 2021; Cutkosky et al., 2021; Pacchiano et al., 2023). While these works take a black-box approach, we tailor-make our decision making on whether to utilize offline data with a novel UCB algorithm and the auxilliary input VV, and the latter is not studied in the abovemoentioned works.

Notation. We denote 𝒩(μ,1){\cal N}(\mu,1) as the Gaussian disribution with mean μ\mu and variance 1. We abbreviate identically and independently distributed as “i.i.d.”. The relationship RPR\sim P means that the random variable RR follows the probability distribution PP.

2 Model

We consider a stochastic KK-armed bandit model with possibly biased offline data. Denote 𝒜={1,,K}{\cal A}=\{1,\ldots,K\} as the set of KK arms. The event horizon consists of a warm-start phase, then an online phase. During the warm-start phase, the DM receive a collection of TS(a)0T_{\text{S}}(a)\in\mathbb{Z}_{\geq 0} i.i.d. samples, denoted S(a)={Xs(a)}s=1TS(a)S(a)=\{X_{s}(a)\}^{T_{\text{S}}(a)}_{s=1}, for each a𝒜a\in{\cal A}. We postulate that X1(a),,XTS(a)(a)Pa(off)X_{1}(a),\ldots,X_{T_{\text{S}}(a)}(a)\sim P^{\text{(off)}}_{a}, the offline reward distribution on arm aa.

The subsequent online phase consists of TT time steps. For time steps t=1,,Tt=1,\ldots,T, the DM chooses an arm At𝒜A_{t}\in{\cal A} to pull, which gives the DM a random reward RtR_{t}. Pulling arm aa in the online phase generates a random reward R(a)Pa(on)R(a)\sim P^{\text{(on)}}_{a}, dubbed the online reward distribution on arm aa. We emphasize that P(on)={Pa(on)}a𝒜P^{\text{(on)}}=\{P^{\text{(on)}}_{a}\}_{a\in{\cal A}} needs not be the same as P(off)={Pa(off)}a𝒜P^{\text{(off)}}=\{P^{\text{(off)}}_{a}\}_{a\in{\cal A}}. Finally, the DM proceeds to time step t+1t+1, or terminates when t=Tt=T.

The DM chooses arms A1,,ATA_{1},\ldots,A_{T} with a non-anticipatory policy π={πt}t=1\pi=\{\pi_{t}\}^{\infty}_{t=1}. The function πt\pi_{t} maps the observations Ht1=(S={S(a)}a𝒜,{(As.Rs)}s=1t1)H_{t-1}=(S=\{S(a)\}_{a\in{\cal A}},\{(A_{s}.R_{s})\}^{t-1}_{s=1}) collected at the end of t1t-1, to an element in {x0K:a𝒜xa=1}\{x\in\mathbb{R}^{K}_{\geq 0}:\sum_{a\in{\cal A}}x_{a}=1\}. The quantity πt(a|Ht1)[0,1]\pi_{t}(a~{}|~{}H_{t-1})\in[0,1] is the probability of At=aA_{t}=a, conditioned on Ht1H_{t-1} under policy π\pi. While our proposed algorithms are deterministic, i.e. πt(a|Ht1){0,1}\pi_{t}(a~{}|~{}H_{t-1})\in\{0,1\} always, we allow randomized policies in our regret lower bound results. The policy π\pi could utilize SS, which provides a potentially biased estimate to the online model since we allow P(on)P(off)P^{\text{(on)}}\neq P^{\text{(off)}}.

Altogether, the underlying instance II is specified by the tuple (𝒜,{TS(a)}a𝒜,P,T)({\cal A},\{T_{\text{S}}(a)\}_{a\in{\cal A}},P,T), where P=(P(off),P(on))P=(P^{\text{(off)}},P^{\text{(on)}}). The DM only knows 𝒜{\cal A} and SS, but not PP and TT, at the start of the online phase. For every a𝒜a\in{\cal A}, we assume that both Pa(off),Pa(on)P^{\text{(off)}}_{a},P^{\text{(on)}}_{a} are 1-subGaussian, which is also known to the DM. For a𝒜a\in{\cal A}, denote μ(on)(a)=𝔼R(a)Pa(on)[R(a)]\mu^{\text{(on)}}(a)=\mathbb{E}_{R(a)\sim P^{\text{(on)}}_{a}}[R(a)], and μ(off)(a)=𝔼R(a)Pa(off)[R(a)]\mu^{\text{(off)}}(a)=\mathbb{E}_{R(a)\sim P^{\text{(off)}}_{a}}[R(a)]. We do not have any boundedness assumptions on μ(on)(a),μ(off)(a)\mu^{\text{(on)}}(a),\mu^{\text{(off)}}(a).

The DM aims to maximize the total reward earned in the online phase. We quantify the performance guarantee of the DM’s non-anticipatory policy by its regret. Define μ(on)=maxa𝒜μ(on)(a)\mu_{*}^{\text{(on)}}=\max_{a\in{\cal A}}\mu^{\text{(on)}}(a), and denote Δ(a)=μ(on)μ(on)(a)\Delta(a)=\mu^{\text{(on)}}_{*}-\mu^{\text{(on)}}(a) for each a𝒜a\in{\cal A}. The DM aims to design a non-anticipatory policy π\pi minimize the regret

RegT(π,P)=Tμ(on)t=1Tμ(on)(At)=t=1TΔ(At)\text{Reg}_{T}(\pi,P)=T\mu_{*}^{\text{(on)}}-\sum_{t=1}^{T}\mu^{\text{(on)}}(A_{t})=\sum^{T}_{t=1}\Delta(A_{t}) (3)

despite the uncertainty on PP. While (3) involves only μ(on)\mu^{\text{(on)}} but not μ(off)\mu^{\text{(off)}}, the choice of AtA_{t} is influenced by both the offline data SS and the online data {As,Rs}s=1t1\{A_{s},R_{s}\}^{t-1}_{s=1}.

As mentioned in the introduction and detailed in the forthcoming Section 3, the DM cannot outperform the vanilla UCB policy with a non-anticipatory policy. Rather, the DM requires information additional to the offline dataset and a carefully designed policy. We consider the following auxiliary input in our algorithm design.

Auxiliary input: valid bias bound. We say that V={V(a)}aK({})|𝒜|V=\{V(a)\}_{a\in{K}}\in(\mathbb{R}\cup\{\infty\})^{|\mathcal{A}|} is a valid bias bound on an instance II if

V(a)|μ(off)(a)μ(on)(a)|for each a𝒜.V(a)\geq|\mu^{\text{(off)}}(a)-\mu^{\text{(on)}}(a)|\quad\text{for each $a\in{\cal A}$.} (4)

In our forthcoming policy MIN-UCB, the DM is endowed the auxiliary input VV in addition to the offline dataset SS before the online phase begins. The quantity V(a)V(a) serves as an upper bound on the amount of distributional shift from Pa(off)P^{\text{(off)}}_{a} to Pa(on)P^{\text{(on)}}_{a}. The condition (4) always holds in the trivial case of V(a)=V(a)=\infty, which is when the DM has no additional knowledge. By contrast, when V(a)V(a)\neq\infty, the DM has a non-trivial knowledge on the difference |μ(off)(a)μ(on)(a)||\mu^{\text{(off)}}(a)-\mu^{\text{(on)}}(a)|.

The knowledge on an upper bound such as VV is in line with the model assumptions in research works on learning under distributional shift. Similar upper bounds are assumed in the contexts of supervised learning (Crammer et al., 2008), stochastic optimization (Besbes et al., 2022), offline policy learning on contextual bandits (Si et al., 2023), multi-task bandit learning (Wang et al., 2021), for example. Methodologies for constructing such upper bounds are provided in some literature. For instance, (Blanchet et al., 2019) designs an approach based on several machine learning estimators, such as LASSO. (Si et al., 2023) provides some managerial insights on how to construct them empirically. (Chen et al., 2022) constructs them through cross-validation, and implement it in a case study.

3 An Impossibility Result

We illustrate the impossibility result with two instances IP,IQI_{P},I_{Q}. The instances IP,IQI_{P},I_{Q} share the same 𝒜={1,2},{TS(a)}a𝒜{\cal A}=\{1,2\},\{T_{\text{S}}(a)\}_{a\in{\cal A}} and TT. However, IP,IQI_{P},I_{Q} differ in their respective reward distributions PP, QQ, and IP,IQI_{P},I_{Q} have different optimal arms. Consider a fixed but arbitrary β(0,1/2)\beta\in(0,1/2). Instance IPI_{P} has well-aligned reward distributions P(off)=P(on)P^{\text{(off)}}=P^{\text{(on)}}, where

P1(on)=𝒩(0,1),P2(on)=𝒩(Tβ,1).P^{\text{(on)}}_{1}={\cal N}(0,1),\quad P^{\text{(on)}}_{2}={\cal N}(-T^{-\beta},1).

Note that Δ(2)=Tβ\Delta(2)=T^{-\beta}. In IPI_{P}, an offline dataset provides useful hints on arm 1 being optimal. For example, given TS(1)=TS(2)128(T2βT2βϵ)logTT_{\text{S}}(1)=T_{\text{S}}(2)\geq 128(T^{2\beta}-T^{2\beta-\epsilon})\log T where ϵ(0,β)\epsilon\in(0,\beta), the existing policies H-UCB and MonUCB achieve an expected regret (see Equation (2)) at most

O(logTΔ(2)TS(2)Δ(2))\displaystyle O\left(\frac{\log T}{\Delta(2)}-T_{\text{S}}(2)\Delta(2)\right) =O(logT(TβTβ+Tβϵ))\displaystyle=O(\log T(T^{\beta}-T^{\beta}+T^{\beta-\epsilon}))
=O(TβϵlogT),\displaystyle=O(T^{\beta-\epsilon}\log T),

which strictly outperforms the vanilla UCB policy that incurs expected regret O(logT/Δ(2))=O(TβlogT)O(\log T/\Delta(2))=O(T^{\beta}\log T). Despite the apparent improvement, the following claim shows that any non-anticipatory policy (not just H-UCB or MonUCB) that outperforms the vanilla UCB on IPI_{P} would incur a larger regret than the vanilla UCB on a suitably chosen IQI_{Q}.

Proposition 3.1.

Let TS(1),TS(2)T_{\text{S}}(1),T_{\text{S}}(2) be arbitrary. Consider an arbitrary non-anticipatory policy π\pi (which only possesses the offline dataset SS but not the auxiliary input VV) satisfies 𝔼[RegT(π,P)]CTβϵlogT\mathbb{E}[\text{Reg}_{T}(\pi,P)]\leq CT^{\beta-\epsilon}\log T on instance IPI_{P}, where ϵ(0,β)\epsilon\in(0,\beta), C>0C>0 is an absolute constant, and the horizon TT is so large that C<Tϵ/(4logT)C<T^{\epsilon}/(4\log T). Set Q=(Q(off),Q(on))Q=(Q^{\text{(off)}},Q^{\text{(on)}}) in the instance IQI_{Q} as

Q(off)\displaystyle Q^{\text{(off)}} =P(off),Q1(on)=P1(on), but\displaystyle=P^{\text{(off)}},\quad Q^{\text{(on)}}_{1}=P^{\text{(on)}}_{1}\text{, but}
Q2(on)\displaystyle Q^{\text{(on)}}_{2} =𝒩(1ClogTTβ(ϵ/2)1Tβ,1).\displaystyle={\cal N}\left(\frac{1}{\sqrt{C\log T}T^{\beta-(\epsilon/2)}}-\frac{1}{T^{\beta}},1\right).

The following inequality holds:

𝔼[Reg(π,Q)]\displaystyle\mathbb{E}[\text{Reg}(\pi,Q)] T1β4e2CTβϵlogT\displaystyle\geq\frac{T^{1-\beta}}{4}\cdot e^{-2}-CT^{\beta-\epsilon}\log T (5)
=Ω(T1β)>Ω(T).\displaystyle=\Omega(T^{1-\beta})>\Omega(\sqrt{T}).

Proposition 3.1 is proved in Appendix section C.2. We first remark on IQI_{Q}. Different from IPI_{P}, arm 2 is the optimal arm in IQI_{Q}. Thus, the offline data from Q(off)Q^{\text{(off)}} provides the misleading information that arm 1 has a higher expected reward. In addition, note that Q(off)=P(off)Q^{\text{(off)}}=P^{\text{(off)}}, so the offline datasets in instances IP,IQI_{P},I_{Q} are identically distributed. The proposition suggests that no non-anticipatory policy can simultaneously (i) establish that SS is useful for online learning in IPI_{P} to improve upon the UCB policy, (ii) establish that SS is to be disregarded for online learning in IQI_{Q} to match the UCB policy, negating the short-lived conjecture in (Q).

Lastly, we allow TS(1),TS(2)T_{\text{S}}(1),T_{\text{S}}(2) to be arbitrary. In the most informative case of TS(1)=TS(2)=T_{\text{S}}(1)=T_{\text{S}}(2)=\infty, meaning that the DM knows P(off),Q(off)P^{\text{(off)}},Q^{\text{(off)}} in IP,IQI_{P},I_{Q} respectively, the DM still cannot achieve (i, ii) simultaneously. Indeed, the assumed improvement 𝔼[RegT(π,P)]CTβϵlogT\mathbb{E}[\text{Reg}_{T}(\pi,P)]\leq CT^{\beta-\epsilon}\log T limits the number of pulls on arm 2 in IPI_{P} during the online phase. We harness this property to construct an upper bound on the KL divergence between on online dynamics on IPI_{P}, IQI_{Q}, which in turns implies the DM cannot differentiate between IP,IQI_{P},I_{Q} under policy π\pi. The KL divergence argument utilizes a chain rule (see Appendix C.1) on KL divergence adapted to our setting.

4 Design and Analysis of MIN-UCB

After the impossibility result, we present our MIN-UCB policy in Algorithm 1. MIN-UCB follows the optimism-in-face-of-uncertainty principle. At time t>Kt>K, MIN-UCB sets min{UCBt(a),UCBtS(a)}\min\{\text{UCB}_{t}(a),\text{UCB}_{t}^{\text{S}}(a)\} as the UCB on μ(on)(a)\mu^{\text{(on)}}(a). While UCBt(a)\text{UCB}_{t}(a) follows (Auer et al., 2002), the construction of UCBtS(a)\text{UCB}_{t}^{\text{S}}(a) in (6) adheres to the following ideas that incorporates a valid bias bound VV, the auxiliary input.

Algorithm 1 Policy MIN-UCB
1:  Input: Valid bias bound VV on the instance, confidence parameter {δt}t=1\{\delta_{t}\}^{\infty}_{t=1}, offline samples SS.
2:  For each a𝒜a\in{\cal A}, compute X^(a)=s=1TS(a)Xs(a)TS(a)\hat{X}(a)=\frac{\sum^{T_{\text{S}}(a)}_{s=1}X_{s}(a)}{T_{\text{S}}(a)}, and initialize R^1(a)=0\hat{R}_{1}(a)=0.
3:  At t=1,,Kt=1,\ldots,K, pull each arm once, then set NK+1(a)=1N_{K+1}(a)=1 for all aa.
4:  for t=K+1,,Tt=K+1,\ldots,T do
5:     For all a𝒜a\in\mathcal{A}, compute the vanilla UCB
UCBt(a)=R^t(a)+radt(a),\text{UCB}_{t}(a)=\hat{R}_{t}(a)+\text{rad}_{t}(a),
where radt(a)=2log(2t/δt)Nt(a)\text{rad}_{t}(a)=\sqrt{\frac{2\log(2t/\delta_{t})}{N_{t}(a)}}.
6:     For all a𝒜a\in\mathcal{A}, compute the warm-start UCBtS(a)=\text{UCB}^{\text{S}}_{t}(a)=
Nt(a)R^t(a)+TS(a)X^(a)Nt(a)+TS(a)+radtS(a),\frac{N_{t}(a)\cdot\hat{R}_{t}(a)+T_{\text{S}}(a)\cdot\hat{X}(a)}{N_{t}(a)+T_{\text{S}}(a)}+\text{rad}^{\text{S}}_{t}(a), (6)
where radtS(a)=\text{rad}^{\text{S}}_{t}(a)=
2log(2t/δt)Nt(a)+TS(a)+TS(a)Nt(a)+TS(a)V(a).\sqrt{\frac{2\log(2t/\delta_{t})}{N_{t}(a)+T_{\text{S}}(a)}}+\frac{T_{\text{S}}(a)}{N_{t}(a)+T_{\text{S}}(a)}\cdot V(a). (7)
7:     Select
Atargmaxa𝒜{min{UCBt(a),UCBtS(a)}}.A_{t}\in\text{argmax}_{a\in{\cal A}}\left\{\min\left\{\text{UCB}_{t}(a),\text{UCB}^{\text{S}}_{t}(a)\right\}\right\}.
8:     Observe RtPAt(on)R_{t}\sim P^{\text{(on)}}_{A_{t}}. For each a𝒜a\in{\cal A}, update Nt+1(a)=Nt(a)+𝟏(At=a)N_{t+1}(a)=N_{t}(a)+\mathbf{1}(A_{t}=a),
R^t+1(a)={Nt(a)R^t(a)Nt(a)+1+RtNt(a)+1a=AtR^t(a)aAt.\hat{R}_{t+1}(a)=\begin{cases}\frac{N_{t}(a)\cdot\hat{R}_{t}(a)}{N_{t}(a)+1}+\frac{R_{t}}{N_{t}(a)+1}&a=A_{t}\\ \hat{R}_{t}(a)&a\neq A_{t}\end{cases}.
9:  end for

In (6), the sample mean Nt(a)R^t(a)+TS(a)X^(a)Nt(a)+TS(a)\frac{N_{t}(a)\cdot\hat{R}_{t}(a)+T_{\text{S}}(a)\cdot\hat{X}(a)}{N_{t}(a)+T_{\text{S}}(a)} incorporates both online and offline data. Then, we inject optimism by adding radS(a)\text{rad}^{\text{S}}(a), set in (7). The square-root term in (7) reflects the optimism in view of the randomness of the sample mean. The term TS(a)Nt(a)+TS(a)V(a)\frac{T_{\text{S}}(a)}{N_{t}(a)+T_{\text{S}}(a)}\cdot V(a) in (7) injects optimism to correct the potential bias from the offline dataset. The correction requires the auxiliary input VV.

Finally, we adopt UCBtS(a)\text{UCB}_{t}^{\text{S}}(a) as the UCB when the bias bound V(a)V(a) is so small and the sample size TS(a)T_{\text{S}}(a) is so large that UCBt(a)>UCBtS(a)\text{UCB}_{t}(a)>\text{UCB}_{t}^{\text{S}}(a). Otherwise, P(off)P^{\text{(off)}} is deemed “far away” P(on)P^{\text{(on)}}, and we follow the vanilla UCB. Our adaptive decisions on whether to incorporate offline data at each time step is different from the vanilla UCB policy that always ignores offline data, or from HUCB and MonUCB that always incorporate all offline data.

Lastly, we remark on VV. Suppose that the DM possesses no knowledge on P(off),P(on)P^{\text{(off)}},P^{\text{(on)}} additional to the offline dataset SS at the start of the online phase. Then, it is advisable to set V(a)=V(a)=\infty for all aa, which specializes MIN-UCB to the vanilla UCB policy. Indeed, the theoretical impossibility result in Section 3 dashes out any hope for adaptively tuning VV in MIN-UCB with online data, in order to outperform the vanilla UCB. By contrast, when the DM has a non-trivial upper bound V(a)V(a) on |μ(off)(a)μ(on)(a)|\left|\mu^{\text{(off)}}(a)-\mu^{\text{(on)}}(a)\right|, we show that can uniformly out-perform the vanilla UCB that direclty ignores the offline data.

We next embark on the analysis on MIN-UCB, by establishing instance dependent regret bounds in Section 4.1 and instance independent regret bounds in Section 4.3. Both sections relies on considering the following events of accurate estimations by UCBt(a),UCBtS(a)\text{UCB}_{t}(a),\text{UCB}^{\text{S}}_{t}(a). For each tt, define t=a𝒜(t(a)tS(a)){\cal E}_{t}=\cap_{a\in{\cal A}}({\cal E}_{t}(a)\cap{\cal E}^{\text{S}}_{t}(a)), where

t(a)={μ(on)(a)UCBt(a)μ(on)(a)+2radt(a)},\displaystyle{\cal E}_{t}(a)=\left\{\mu^{\text{(on)}}(a)\leq\text{UCB}_{t}(a)\leq\mu^{\text{(on)}}(a)+2\text{rad}_{t}(a)\right\},
tS(a)={μ(on)(a)UCBtS(a)μ(on)(a)+radtS(a)\displaystyle{\cal E}^{\text{S}}_{t}(a)=\left\{\mu^{\text{(on)}}(a)\leq\text{UCB}^{\text{S}}_{t}(a)\leq\mu^{(\text{on})}(a)+\text{rad}^{\text{S}}_{t}(a)\right.
+[2log(2t/δt)Nt(a)+TS(a)+TS(a)(μ(off)(a)μ(on)(a))Nt(a)+TS(a)]}.\displaystyle\left.+\left[\sqrt{\frac{2\log(2t/\delta_{t})}{N_{t}(a)+T_{\text{S}}(a)}}+\frac{T_{\text{S}}(a)\cdot(\mu^{\text{(off)}}(a)-\mu^{\text{(on)}}(a))}{N_{t}(a)+T_{\text{S}}(a)}\right]\right\}.

While t(a){\cal E}_{t}(a) incorporates estimation error due to stochastic variations, tS(a){\cal E}^{\text{S}}_{t}(a) additionally involves the potential bias of the offline data.

Lemma 4.1.

Pr(t)12Kδt.\Pr({\cal E}_{t})\geq 1-2K\delta_{t}.

The proof of Lemma 4.1 is provided in Appendix B.1.

4.1 Instance Dependent Regret Bounds

Both our instance dependent regret upper and lower bounds depend on the following discrepancy measure

ω(a)=V(a)+(μ(off)(a)μ(on)(a))\omega(a)=V(a)+(\mu^{\text{(off)}}(a)-\mu^{\text{(on)}}(a)) (8)

on an arm aa. Note that ω(a)\omega(a) depends on both the valid bias bound V(a)V(a) and the model parameters μ(off)(a),μ(on)(a)).\mu^{\text{(off)}}(a),\mu^{\text{(on)}}(a)). By the validity of VV, we know that ω(a)[0,2V(a)]\omega(a)\in[0,2V(a)].

Theorem 4.2.

Consider policy MIN-UCB, which inputs a valid bias bound VV on the underlying instance II and δt=12Kt2\delta_{t}=\frac{1}{2Kt^{2}} for t=1,2,t=1,2,\ldots. We have 𝔼[RegT(MIN-UCB,P)]\mathbb{E}[\text{Reg}_{T}(\text{MIN-UCB},P)]\leq

O(a:Δ(a)>0max{log(T)Δ(a)Sav0(a),Δ(a)}),O\left(\sum_{a:\Delta(a)>0}\max\left\{\frac{\log(T)}{\Delta(a)}-\text{Sav}_{0}(a),\Delta(a)\right\}\right), (9)

where a:Δ(a)>0\sum_{a:\Delta(a)>0} is over {a𝒜:Δ(a)>0}\{a\in{\cal A}:\Delta(a)>0\}, and

Sav0(a)=TS(a)Δ(a)max{1ω(a)Δ(a),0}2.\text{Sav}_{0}(a)=T_{\text{S}}(a)\cdot\Delta(a)\cdot\max\left\{1-\frac{\omega(a)}{\Delta(a)},0\right\}^{2}. (10)

Theorem 4.2 is proved in the Appendix B.2. We provide a sketch proof in Section 4.2. A full explicit bound is in (28) in the Appendix. Comparing with the regret bound (1) by the vanilla UCB, MIN-UCB provides an improvement on the regret order bound by the saving term Sav0(a)\text{Sav}_{0}(a).

First, note that Sav0(a)0\text{Sav}_{0}(a)\geq 0 always. The case of ω(a)<Δ(a)\omega(a)<\Delta(a) means that the reward distributions Pa(off),Pa(on)P^{\text{(off)}}_{a},P^{\text{(on)}}_{a} are “sufficiently close”. The offline data on arm aa are incorporated by MIN-UCB to improve upon the vanila UCB. By contrast, the case of ω(a)Δ(a)\omega(a)\geq\Delta(a) means that Pa(off),Pa(on)P^{\text{(off)}}_{a},P^{\text{(on)}}_{a} are “far apart”, so that the offline data on arm aa are ignored in the learning process, while MIN-UCB still matches the performance guarantee of the vanilla UCB. We emphasize that ω(a)\omega(a) and Δ(a)\Delta(a) are both latent, so MIN-UCB conducts the above adaptive choice under uncertainty.

Next, in the case of ω(a)<Δ(a)\omega(a)<\Delta(a) where the offline data on aa are beneficial, observe that Sav0(a)\text{Sav}_{0}(a) is increasing in TS(a)T_{\text{S}}(a). This monotonicity adheres to the intuition that more offline data leads to a higher reward when Pa(off),Pa(on)P^{\text{(off)}}_{a},P^{\text{(on)}}_{a} are “sufficiently close”. The term Sav0(a)\text{Sav}_{0}(a) dcreases as ω(a)\omega(a) increses. Indeed, a smaller V(a)V(a) means a tighter upper bound on |μ(off)(a)μ(on)(a)||\mu^{\text{(off)}}(a)-\mu^{\text{(on)}}(a)| is known to the DM, which leads to a better peformance.

Interestingly, Sav0(a)\text{Sav}_{0}(a) increases when μ(off)(a)μ(on)(a)\mu^{\text{(off)}}(a)-\mu^{\text{(on)}}(a) decreases. Paradoxically, Sav0(a)\text{Sav}_{0}(a) increases with |μ(off)(a)μ(on)(a)||\mu^{\text{(off)}}(a)-\mu^{\text{(on)}}(a)|, under the case of μ(off)(a)<μ(on)(a)\mu^{\text{(off)}}(a)<\mu^{\text{(on)}}(a). Indeed, the saving term Sav0(a)\text{Sav}_{0}(a) depends not only on the magnitude of the distributional shift, but also the direction. To gain intuition, consider the case of μ(off)(a)<μ(on)(a)\mu^{\text{(off)}}(a)<\mu^{\text{(on)}}(a). On μ(on)(a)\mu^{\text{(on)}}(a), the offline dataset on aa provides a pessimistic estimate, which encourages the DM to eliminate arm aa (which is sub-optimal), hence improves the DM’s performance.

Lastly, we compare to existing baselines. By setting V(a)=V(a)=\infty for all aa, the regret bound (9) reduces to (1) of the vanilla UCB. In the case when P(on)=P(off)P^{\text{(on)}}=P^{(\text{off})} and it is known to the DM, setting V(a)=0V(a)=0 for all aa reduces the regret bound (9) to (2). In both cases we achieve the state-of-the-art regret bounds, while additionally we quantify the impact of distributional drift to the regret bound. Finally, the ARROW-CB by (Zhang et al., 2019) can be applied to our model. While ARROW-CB does not require any auxiliary input, their regret bound is at least Ω(K1/3T2/3)\Omega(K^{1/3}T^{2/3}). ARROW-CB provides improvment to purely online policies in the contextual bandit setting, as discussed in Appednix A.1.

We next provides a regret lower bound that nearly matches (9). To proceed, we consider a subset of instances with bounded distributional drift, as defined below.

Definition 4.3.

Let V=(V(a))a𝒜0KV=(V(a))_{a\in{\cal A}}\in\mathbb{R}_{\geq 0}^{K}. We define V={I:|μ(on)(a)μ(off)(a)|V(a) for all a𝒜}{\cal I}_{V}=\{I:|\mu^{\text{(on)}}(a)-\mu^{\text{(off)}}(a)|\leq V(a)\text{ for all $a\in{\cal A}$}\}.

Knowing a valid bias bound VV is equivalent to knowing that the underlying instance II lies in the subset V{\cal I}_{V}, leading to the following which is equivalent to Theorem 4.2:

Corollary 4.4.

For any V0KV\in\mathbb{R}_{\geq 0}^{K}, the MIN-UCB policy with input VV and δt=1/(2Kt2)\delta_{t}=1/(2Kt^{2}) satisfies 𝔼[RegT(MIN-UCB,P)](9)\mathbb{E}[\text{Reg}_{T}(\text{MIN-UCB},P)]\leq(\ref{eq:reg_dep_upper}), for all IVI\in{\cal I}_{V}.

Our regret lower bounds involve instances with Gaussian rewards.

Definition 4.5.

An instance II is a Gaussian instance if Pa(on),Pa(off)P^{\text{(on)}}_{a},P^{\text{(off)}}_{a} are Gaussian distributions with variance one, for every a𝒜.a\in{\cal A}.

Similar to existing works, we focus on consistent policies.

Definition 4.6.

For C>0,p(0,1)C>0,p\in(0,1) and a collection {\cal I} of instances, a non-anticipatory policy π\pi is said to be (C,p)(C,p)-consistent on {\cal I}, if for all II\in{\cal I}, it holds that 𝔼[RegT(P,π)]CTp\mathbb{E}[\text{Reg}_{T}(P,\pi)]\leq CT^{p}.

For example, the vanilla UCB is (C,1/2)(C,1/2) consistent for some absolute constant C>0C>0.

Theorem 4.7.

Let V0KV\in\mathbb{R}_{\geq 0}^{K} be arbitrary, and consider a fixed but arbitrary Gaussian instance IVI\in{\cal I}_{V}. For any (C,p)(C,p)-consistent policy π\pi on V{\cal I}_{V}, we have the following regret lower bound of π\pi on II: 𝔼[RegT(π,P)]\mathbb{E}[\text{Reg}_{T}(\pi,P)]\geq

a:Δ(a)>0{2(1p)(1+ϵ)2logTΔ(a)Savϵ(a)+κϵ(a)}\displaystyle\sum_{a:\Delta(a)>0}\left\{\frac{2(1-p)}{(1+\epsilon)^{2}}\cdot\frac{\log T}{\Delta(a)}-\text{Sav}_{\epsilon}(a)+\kappa_{\epsilon}(a)\right\} (11)

holds for any ϵ(0,1]\epsilon\in(0,1], where

Savϵ(a)=TS(a)Δ(a)max{1ω(a)(1+ϵ)Δ(a),0}2,\text{Sav}_{\epsilon}(a)=T_{\text{S}}(a)\cdot\Delta(a)\cdot\max\left\{1-\frac{\omega(a)}{(1+\epsilon)\Delta(a)},0\right\}^{2}, (12)

and κϵ(a)=12(1+ϵ)2Δ(a)logϵΔ(a)8C\kappa_{\epsilon}(a)=\frac{1}{2(1+\epsilon)^{2}\Delta(a)}\log\frac{\epsilon\Delta(a)}{8C} is independent of TT or TS(a)T_{\text{S}}(a).

Theorem 4.7 is proved in Appendix C.3. Modulo the additive term κϵ(a)\kappa_{\epsilon}(a), the regret upper bound (9) and lower bound (11) are nearly matching. Both bounds feature the Θ((logT)/Δ(a))\Theta((\log T)/\Delta(a)) regret term that is classical in the literature (Lattimore & Szepesvári, 2020). Importantly, the lower bound (11) features the regret saving term Savϵ(a)\text{Sav}_{\epsilon}(a), which closely matches the term Sav0(a)\text{Sav}_{0}(a) in the upper bound. Indeed, ϵ(0,1]\epsilon\in(0,1] is arbitrary, Savϵ(a)\text{Sav}_{\epsilon}(a) tends to the term Sav0(a)\text{Sav}_{0}(a) in the regret upper bound (9) as we tend ϵ\epsilon to 0. The close match highlights the fundamental nature of the discrepancy measure ω(a)\omega(a). In addition, the lower bound (11) carries the insight that the offline data on arm aa facilitates improvement over the vanilla UCB if and only if ω(a)<(1+ϵ)Δ(a)\omega(a)<(1+\epsilon)\Delta(a), which is consistent with our insights from the regret upper bound (9) by tending ϵ0\epsilon\rightarrow 0.

Lastly, in the case P(off)=P(on)P^{\text{(off)}}=P^{\text{(on)}}, the lower bound (11) reduces to Ω(a:Δ(a)>0[logTΔ(a)TS(a)Δ(a)+κϵ(a)])\Omega(\sum_{a:\Delta(a)>0}[\frac{\log T}{\Delta(a)}-T_{\text{S}}(a)\Delta(a)+\kappa_{\epsilon}(a)]), which ascertains the near tightness of (2) by HUCB1 and MonUCB. A nearly tight regret lower bound is novel in the literature.

4.2 Sketch Proof of Theorem 4.2

Theorem 4.2 is a direct conclusion of the following lemma:

Lemma 4.8.

Let aa be a sub-optimal arm, that is Δ(a)>0\Delta(a)>0. Conditioned on the event t{\cal E}_{t}, if it holds that

Nt(a)>32log(4Kt4)Δ(a)2TS(a)max{1ω(a)Δ(a),0}2,N_{t}(a)>32\cdot\frac{\log(4Kt^{4})}{\Delta(a)^{2}}-T_{\text{S}}(a)\cdot\max\left\{1-\frac{\omega(a)}{\Delta(a)},0\right\}^{2},

then AtaA_{t}\neq a with certainty.

Note Sav0(a)Δ(a)=TS(a)max{1ω(a)Δ(a),0}2\frac{\text{Sav}_{0}(a)}{\Delta(a)}=T_{\text{S}}(a)\cdot\max\{1-\frac{\omega(a)}{\Delta(a)},0\}^{2}. The Lemma is fully proved in Appendix B.2, and a sketch proof is below.

Sketch Proof of Lemma 4.8.

Consider

Case 1a: ω(a)<Δ(a)\omega(a)<\Delta(a), and Sav0(a)Δ(a)16log(4Kt4)Δ(a)2\frac{\text{Sav}_{0}(a)}{\Delta(a)}\geq\frac{16\log(4Kt^{4})}{\Delta(a)^{2}},

Case 1b: ω(a)<Δ(a)\omega(a)<\Delta(a), and Sav0(a)Δ(a)<16log(4Kt4)Δ(a)2\frac{\text{Sav}_{0}(a)}{\Delta(a)}<\frac{16\log(4Kt^{4})}{\Delta(a)^{2}},

Case 2: ω(a)Δ(a)\omega(a)\geq\Delta(a).

Both Cases 1b, 2 imply Nt(a)>16log(4Kt4)Δ(a)2N_{t}(a)>16\cdot\frac{\log(4Kt^{4})}{\Delta(a)^{2}}. Following (Auer et al., 2002), we can then deduce

UCBt(a)\displaystyle\text{UCB}_{t}(a) μ(on)(a)+Δ(a)2\displaystyle\leq\mu^{\text{(on)}}(a)+\frac{\Delta(a)}{2}
<μ(on)min{UCBt(a),UCBtS(a)},\displaystyle<\mu^{(\text{on})}_{*}\leq\min\left\{\text{UCB}_{t}(a_{*}),\text{UCB}^{\text{S}}_{t}(a_{*})\right\},

thus AtaA_{t}\neq a with certainty. In what follows, we focus on Case 1a, the main case when MIN-UCB incorporates the offline dataset in learning. Case 1a and V(a)V(a) being a valid bias bound imply

ω(a)Δ(a)[0,1).\frac{\omega(a)}{\Delta(a)}\in[0,1). (13)

Then

2log(2t/δt)Nt(a)+TS(a)2log(4Kt4)TS(a)\displaystyle\sqrt{\frac{2\log(2t/\delta_{t})}{N_{t}(a)+T_{\text{S}}(a)}}\leq\sqrt{\frac{2\log(4Kt^{4})}{T_{\text{S}}(a)}}
()\displaystyle\overset{(\dagger)}{\leq} (1ω(a)Δ(a))2log(4Kt4)16log(4Kt4)Δ(a)2=(1ω(a)Δ(a))Δ(a)8.\displaystyle\left(1-\frac{\omega(a)}{\Delta(a)}\right)\sqrt{\frac{2\log(4Kt^{4})}{\frac{16\log(4Kt^{4})}{\Delta(a)^{2}}}}=\left(1-\frac{\omega(a)}{\Delta(a)}\right)\cdot\frac{\Delta(a)}{\sqrt{8}}.

()(\dagger) is by the second condition in Case 1a and (13). Also,

TS(a)ω(a)Nt(a)+TS(a)ω(a)Δ(a)Δ(a).\frac{T_{\text{S}}(a)\cdot\omega(a)}{N_{t}(a)+T_{\text{S}}(a)}\leq\frac{\omega(a)}{\Delta(a)}\cdot\Delta(a).

Consequently, conditioned on event t{\cal E}_{t},

UCBtS(a)\displaystyle\text{UCB}^{\text{S}}_{t}(a)
\displaystyle\leq μ(on)(a)+22log(2t/δt)Nt(a)+TS(a)+TS(a)ω(a)TS(a)+Nt(a)\displaystyle\mu^{\text{(on)}}(a)+2\sqrt{\frac{2\log(2t/\delta_{t})}{N_{t}(a)+T_{\text{S}}(a)}}+\frac{T_{\text{S}}(a)\cdot\omega(a)}{T_{\text{S}}(a)+N_{t}(a)} (14)
\displaystyle\leq μ(on)(a)+(1ω(a)Δ(a))Δ(a)2+ω(a)Δ(a)Δ(a)\displaystyle\mu^{\text{(on)}}(a)+\left(1-\frac{\omega(a)}{\Delta(a)}\right)\cdot\frac{\Delta(a)}{\sqrt{2}}+\frac{\omega(a)}{\Delta(a)}\cdot\Delta(a)
<()\displaystyle\overset{(\ddagger)}{<} μ(on)()min{UCBt(a),UCBtS(a)},\displaystyle\mu^{(\text{on})}_{*}\overset{(\P)}{\leq}\min\left\{\text{UCB}_{t}(a_{*}),\text{UCB}^{\text{S}}_{t}(a_{*})\right\},

()(\ddagger) is because 01ω(a)Δ(a)<10\leq 1-\frac{\omega(a)}{\Delta(a)}<1 by (13). (\P, 14) are by the event t{\cal E}_{t}. Altogether, AtaA_{t}\neq a with certainty. ∎

4.3 Instance Indepedent Regret Bounds

Now we analyze the instance independent bound. We set δt=δ/(2Kt2)\delta_{t}=\delta/(2Kt^{2}), where δ(0,1)\delta\in(0,1) is a prescribed confidence paramater.

Theorem 4.9.

Assume 2KT2\leq K\leq T. Consider policy MIN-UCB, which inputs a valid bias bound VV on the underlying instance and δt=δ/(2Kt2)\delta_{t}=\delta/(2Kt^{2}) for t=1,t=1,\ldots, where δ(0,1)\delta\in(0,1). With probability at least 1δ1-\delta,  RegT(MIN-UCB,P)=\text{Reg}_{T}(\text{MIN-UCB},P)=

O(min{KTlogTδ,(logTδτ+Vmax)T}),O\left(\min\left\{\sqrt{KT\log\frac{T}{\delta}},\left(\sqrt{\frac{\log\frac{T}{\delta}}{\tau_{*}}}+V_{\text{max}}\right)\cdot T\right\}\right), (15)

where Vmax=maxa𝒜V(a)V_{\text{max}}=\max_{a\in{\cal A}}V(a), and (τ,n)(\tau_{*},n_{*}) is an optimum solution of the following linear program:

(LP): maxτ,n\displaystyle\text{(LP): }\max_{\tau,n} τ\displaystyle\tau (16)
s.t. τTS(a)+n(a)a𝒜,\displaystyle\tau\leq T_{\text{S}}(a)+n(a)\quad\forall a\in\mathcal{A},
a𝒜n(a)=T,\displaystyle\sum_{a\in{\cal A}}n(a)=T,
τ0,n(a)0a𝒜.\displaystyle\tau\geq 0,n(a)\geq 0\quad\;\;\forall a\in\mathcal{A}.

The proof of Theorem 4.9 is in Appendix B.3. In the minimum in (15), the first term corresponds to the regret due to {radt(At)}t=K+1T\{\text{rad}_{t}(A_{t})\}^{T}_{t=K+1}, in the same vein as the vanilla UCB. The second term corresponds to the regret due to {radtS(At)}t=K+1T\{\text{rad}^{\text{S}}_{t}(A_{t})\}^{T}_{t=K+1} caused by the warm-start UCBS\text{UCB}^{\text{S}}. The minimum corresponds to the fact that MIN-UCB adapts to the better of the two UCB’s.

We make two observations on τ\tau_{*}. Firstly, τT/K\tau_{*}\geq T/K for any {TS(a)}a𝒜\{T_{\text{S}}(a)\}_{a\in{\cal A}}. Indeed, the solution (τ¯,n¯)(\bar{\tau},\bar{n}) defined as τ¯=n¯(a)=T/K\bar{\tau}=\bar{n}(a)=T/K for all a𝒜a\in{\cal A} is feasible to (LP). Crucially, we have KTT/τ\sqrt{KT}\geq T/\sqrt{\tau_{*}}, which implies RegT(MIN-UCB,P)=o(KTlog(T/δ))\text{Reg}_{T}(\text{MIN-UCB},P)=o(\sqrt{KT\log(T/\delta)}) when Vmax=o(K/T)V_{\text{max}}=o(\sqrt{K/T}), and RegT(MIN-UCB,P)=O(KTlog(T/δ))\text{Reg}_{T}(\text{MIN-UCB},P)=O(\sqrt{KT\log(T/\delta)}) otherwise. Secondly, in the case when TS(a)=mT_{\text{S}}(a)=m for all a𝒜a\in{\cal A}, it can be verified that τ=(T/K)+m\tau_{*}=(T/K)+m is the optimum to (LP), with n(a)=T/Kn_{*}(a)=T/K for all aa, meaning that T/τ=T/(T/K)+mKTT/\sqrt{\tau_{*}}=T/\sqrt{(T/K)+m}\leq\sqrt{KT} for any m0m\in\mathbb{Z}_{\geq 0}.

The second term in (15) requires a new analysis, as we need to determine the worst-case regret under the heterogeneity in the sample sizes {TS(a)}a𝒜\{T_{\text{S}}(a)\}_{a\in{\cal A}}. We overcome the heterogeneity by a novel linear program (LP) in (16). The constraints in (LP) are set such that {n(a)}a𝒜\{n_{*}(a)\}_{a\in{\cal A}} represents a worst-case {𝔼[NT(a)]}a𝒜\{\mathbb{E}[N_{T}(a)]\}_{a\in{\cal A}} that maximizes the instance-independent regret in the case of P(on)=P(off)P^{\text{(on)}}=P^{\text{(off)}}. We incorporate the auxiliary decision variable τ\tau in (LP), where a non-anticipatory incurs at most O(1/τ)O(1/\sqrt{\tau_{*}}) regret per time round under the worst case number of arm pulls.

Despite the non-closed-form nature of τ\tau_{*}, we demonstrate that τ\tau_{*} is fundamental to a tight regret bound by the following regret lower bound, which matches (15) up to a logarithmic factor.

Theorem 4.10.

Let Vmax0V_{\text{max}}\in\mathbb{R}_{\geq 0} and {TS(a)}a𝒜𝒜\{T_{\text{S}}(a)\}_{a\in{\cal A}}\in\mathbb{N}^{\cal A} be fixed but arbitrary, and let there be K2K\geq 2 arms. Set V(a)=VmaxV(a)=V_{\text{max}} for all a𝒦a\in{\cal K}. For any non-anticipatory policy π\pi, there exists a Gaussian instance IVI\in{\cal I}_{V} with offline sample size {TS(a)}a𝒜\{T_{\text{S}}(a)\}_{a\in{\cal A}}, such that 𝔼[RegT(π,P)]=\mathbb{E}[\text{Reg}_{T}(\pi,P)]=

Ω(min{KT,(1τ+Vmax)T}),\Omega\left(\min\left\{\sqrt{KT},\left(\frac{1}{\sqrt{\tau_{*}}}+V_{\text{max}}\right)\cdot T\right\}\right),

where τ\tau_{*} is the optimum of (16).

Theorem 4.10 is proved in Appendix C.4. We conclude the Section by highlighting that, in the case of P(off)=P(on)P^{\text{(off)}}=P^{\text{(on)}} no existing work establishes a problem independent regret bound of o(KTlog(T/δ))o(\sqrt{KT\log(T/\delta)}). To this end, by setting V(a)=0V(a)=0 for all aa, our analysis implies that MIN-UCB achieves a regret of O(Tlog(T/δ)/τ)O(T\sqrt{\log(T/\delta)/\tau_{*}}) and there is regret lower bound of Ω(T/τ)\Omega(T/\sqrt{\tau_{*}}), showing the near-optimality of MIN-UCB in the well-aligned setting.

Refer to caption
(a) Different VV
Refer to caption
(b) V=0.2V=0.2
Refer to caption
(c) V=0.6V=0.6
Figure 1: Regret on different VV
Refer to caption
(a) V=0.1V=0.1
Refer to caption
(b) V=0.3V=0.3
Figure 2: Regret on different TST_{\text{S}}

5 Numerical Experiments

We perform numerical experiments to evaluate MIN-UCB. We compare MIN-UCB with three related benchmarks. The first policy is vanilla UCB (Auer et al., 2002) that ignores the offline data, which we call “PURE-UCB”. The second policy chooses an arm with the largest UCBtS(a)\text{UCB}^{\text{S}}_{t}(a) defined in (6), which we call “UCBS”. The third policy is HUCB1 (Shivaswamy & Joachims, 2012), which we call ”HUCB”.

In all experiments, we simulate our algorithm MIN-UCB and benchmarks on several instances, with fixed K=4K=4. Each arms’s offline and online reward follows standard Gaussian distribution with mean μoff(a)\mu^{\text{off}}(a), μon(a)\mu^{\text{on}}(a), respectively. In each specific instance, we run each algorithms 100 times and report their average and standard error. For simplicity, we assume V(a)=VV(a)=V for any sub-optimal arm aa and TS(a)=TST_{\text{S}}(a)=T_{\text{S}} for any arm aa. The detailed setup is provided in Appendix D.

Figure 1(a) plots the regret on single instance under different valid bias bounds VV, with T=TS=10000T=T_{\text{S}}=10000 in each trial. Specifically, for V=0.2,0.6V=0.2,0.6, we illustrate the regret across different TT in Figures 1(b), 1(c). These results demonstrate the benefit of our algorithm. When VV is small, indicating that DM has a precise estimation on |μ(off)(a)μ(on)(a)||\mu^{\text{(off)}}(a)-\mu^{\text{(on)}}(a)|, our MIN-UCB and UCBS that effectively leverage the information from VV and offline dataset perform well. In contrary, when VV is large, implying a loose estimation of |μ(off)(a)μ(on)(a)||\mu^{\text{(off)}}(a)-\mu^{\text{(on)}}(a)|, the offline dataset becomes challenging to utilize effectively. In such cases, MIN-UCB’s performance aligns with PURE-UCB, and remains unaffected by looseness of VV that misleads the DM to believe |μ(off)(a)μ(on)(a)||\mu^{\text{(off)}}(a)-\mu^{\text{(on)}}(a)| to be much larger than the actual gap. The stable performance of MIN-UCB is in contrast with the performance of UCBS and HUCB. Thus, the robustness of our algorithm is validated through these numerical results.

Figures 2(a) and 2(b) show the regret on different TST_{\text{S}}, where in the first group V=0.1V=0.1 (small VV) and the second group V=0.3V=0.3 (large VV). Both figures demonstrate our MIN-UCB consistently outperforms alternative methods, irrespective of TST_{\text{S}} and VV. A noteworthy observation is that is that when the offline data is easy to leverage (small VV), a larger offline dataset (increased TST_{\text{S}}) results in larger improvement of our algorithm over vanilla UCB. This observation aligns with intuition, as a larger TST_{\text{S}} coupled with a smaller VV indicates more informative insights into the online distribution. Conversely, when VV is large and offline data is hard to utilize, increasing the size of the offline dataset (TST_{\text{S}}) does not significantly impact the performance of MIN-UCB. This is because our algorithm can adapt to this environment and consistently align with vanilla UCB, thereby mitigating the influence of the offline data on performance.

6 Conclusion

We explore a multi-armed bandit problem with additional offline data. Our approach on how to adapt to the possibly biased offline data is novel in the literature, leading to tight and optimal regret bounds. There are some interesting future directions. For example, we can further investigate how to apply it in other online learning models, such as linear bandits and online Markov decision processes. It is also intriguing to study how other forms of metrics that can measure the distribution drift can affect and facilitate the algorithm design.

Impact Statements

This paper presents work whose goal is to advance the field of Online Learning. We primarily focus on theoretical research. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

References

  • Agarwal et al. (2017) Agarwal, A., Luo, H., Neyshabur, B., and Schapire, R. E. Corralling a band of bandit algorithms. In Kale, S. and Shamir, O. (eds.), Proceedings of the 30th Conference on Learning Theory, COLT 2017, Amsterdam, The Netherlands, 7-10 July 2017, volume 65 of Proceedings of Machine Learning Research, pp.  12–38. PMLR, 2017.
  • Auer et al. (2002) Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Mach. Learn., 47(2–3):235–256, may 2002. ISSN 0885-6125. URL https://doi.org/10.1023/A:1013689704352.
  • Banerjee et al. (2022) Banerjee, S., Sinclair, S. R., Tambe, M., Xu, L., and Yu, C. L. Artificial replay: a meta-algorithm for harnessing historical data in bandits. arXiv preprint arXiv:2210.00025, 2022.
  • Ben-David et al. (2010) Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. W. A theory of learning from different domains. Mach. Learn., 79(1–2):151–175, may 2010. ISSN 0885-6125.
  • Besbes et al. (2022) Besbes, O., Ma, W., and Mouchtaki, O. Beyond iid: data-driven decision-making in heterogeneous environments. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  23979–23991. Curran Associates, Inc., 2022.
  • Blanchet et al. (2019) Blanchet, J., Kang, Y., and Murthy, K. Robust wasserstein profile inference and applications to machine learning. Journal of Applied Probability, 56(3):830–857, 2019.
  • Bouneffouf et al. (2019) Bouneffouf, D., Parthasarathy, S., Samulowitz, H., and Wistuba, M. Optimal exploitation of clustering and history information in multi-armed bandit. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp.  2016–2022, 2019.
  • Bu et al. (2020) Bu, J., Simchi-Levi, D., and Xu, Y. Online pricing with offline data: Phase transition and inverse square law. In International Conference on Machine Learning, pp.  1202–1210. PMLR, 2020.
  • Chen et al. (2022) Chen, X., Shi, P., and Pu, S. Data-pooling reinforcement learning for personalized healthcare intervention. arXiv preprint arXiv:2211.08998, 2022.
  • Crammer et al. (2008) Crammer, K., Kearns, M., and Wortman, J. Learning from multiple sources. Journal of Machine Learning Research, 9(57):1757–1774, 2008. URL http://jmlr.org/papers/v9/crammer08a.html.
  • Cutkosky et al. (2021) Cutkosky, A., Dann, C., Das, A., Gentile, C., Pacchiano, A., and Purohit, M. Dynamic balancing for model selection in bandits and rl. In International Conference on Machine Learning, pp.  2276–2285. PMLR, 2021.
  • Cutkosky et al. (2022) Cutkosky, A., Dann, C., Das, A., and Zhang, Q. Leveraging initial hints for free in stochastic linear bandits. In Dasgupta, S. and Haghtalab, N. (eds.), Proceedings of The 33rd International Conference on Algorithmic Learning Theory, volume 167 of Proceedings of Machine Learning Research, pp.  282–318. PMLR, 29 Mar–01 Apr 2022.
  • Gur & Momeni (2022) Gur, Y. and Momeni, A. Adaptive sequential experiments with unknown information arrival processes. Manufacturing & Service Operations Management, 24(5):2666–2684, 2022.
  • Hao et al. (2023) Hao, B., Jain, R., Lattimore, T., Van Roy, B., and Wen, Z. Leveraging demonstrations to improve online learning: Quality matters. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
  • Lattimore & Szepesvári (2020) Lattimore, T. and Szepesvári, C. Bandit algorithms. Cambridge University Press, 2020.
  • Lee et al. (2021) Lee, J., Pacchiano, A., Muthukumar, V., Kong, W., and Brunskill, E. Online model selection for reinforcement learning with function approximation. In International Conference on Artificial Intelligence and Statistics, pp.  3340–3348. PMLR, 2021.
  • Liu & Li (2016) Liu, C.-Y. and Li, L. On the prior sensitivity of thompson sampling. In Ortner, R., Simon, H. U., and Zilles, S. (eds.), Algorithmic Learning Theory, pp.  321–336, Cham, 2016. Springer International Publishing.
  • Mansour et al. (2009) Mansour, Y., Mohri, M., and Rostamizadeh, A. Domain adaptation: Learning bounds and algorithms. arXiv preprint arXiv:0902.3430, 2009.
  • Pacchiano et al. (2020) Pacchiano, A., Phan, M., Abbasi Yadkori, Y., Rao, A., Zimmert, J., Lattimore, T., and Szepesvari, C. Model selection in contextual stochastic bandit problems. Advances in Neural Information Processing Systems, 33:10328–10337, 2020.
  • Pacchiano et al. (2023) Pacchiano, A., Dann, C., and Gentile, C. Data-driven regret balancing for online model selection in bandits. CoRR, abs/2306.02869, 2023. URL https://doi.org/10.48550/arXiv.2306.02869.
  • Rakhlin & Sridharan (2013a) Rakhlin, A. and Sridharan, K. Online learning with predictable sequences. In Conference on Learning Theory, pp.  993–1019. PMLR, 2013a.
  • Rakhlin & Sridharan (2013b) Rakhlin, S. and Sridharan, K. Optimization, learning, and games with predictable sequences. Advances in Neural Information Processing Systems, 26, 2013b.
  • Russo & Roy (2016) Russo, D. and Roy, B. V. An information-theoretic analysis of thompson sampling. Journal of Machine Learning Research, 17(68):1–30, 2016. URL http://jmlr.org/papers/v17/14-087.html.
  • Russo & Van Roy (2014) Russo, D. and Van Roy, B. Learning to optimize via information-directed sampling. Advances in Neural Information Processing Systems, 27, 2014.
  • Shivaswamy & Joachims (2012) Shivaswamy, P. and Joachims, T. Multi-armed bandit problems with history. In Artificial Intelligence and Statistics, pp.  1046–1054. PMLR, 2012.
  • Si et al. (2023) Si, N., Zhang, F., Zhou, Z., and Blanchet, J. Distributionally robust batch contextual bandits. Management Science, 69(10):5772–5793, 2023.
  • Simchowitz et al. (2021) Simchowitz, M., Tosh, C., Krishnamurthy, A., Hsu, D. J., Lykouris, T., Dudik, M., and Schapire, R. E. Bayesian decision-making under misspecified priors with applications to meta-learning. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp.  26382–26394. Curran Associates, Inc., 2021.
  • Steinhardt & Liang (2014) Steinhardt, J. and Liang, P. Adaptivity and optimism: An improved exponentiated gradient algorithm. In International conference on machine learning, pp.  1593–1601. PMLR, 2014.
  • Tennenholtz et al. (2021) Tennenholtz, G., Shalit, U., Mannor, S., and Efroni, Y. Bandits with partially observable confounded data. In Uncertainty in Artificial Intelligence, pp.  430–439. PMLR, 2021.
  • Wagenmaker & Pacchiano (2023) Wagenmaker, A. and Pacchiano, A. Leveraging offline data in online reinforcement learning. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
  • Wang et al. (2021) Wang, Z., Zhang, C., Singh, M. K., Riek, L. D., and Chaudhuri, K. Multitask bandit learning through heterogeneous feedback aggregation. In Banerjee, A. and Fukumizu, K. (eds.), The 24th International Conference on Artificial Intelligence and Statistics, AISTATS 2021, April 13-15, 2021, Virtual Event, volume 130 of Proceedings of Machine Learning Research, pp.  1531–1539. PMLR, 2021.
  • Wei & Luo (2018) Wei, C. and Luo, H. More adaptive algorithms for adversarial bandits. In Bubeck, S., Perchet, V., and Rigollet, P. (eds.), Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018, volume 75 of Proceedings of Machine Learning Research, pp.  1263–1291. PMLR, 2018.
  • Wei et al. (2020a) Wei, C.-Y., Luo, H., and Agarwal, A. Taking a hint: How to leverage loss predictors in contextual bandits? In Abernethy, J. and Agarwal, S. (eds.), Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pp.  3583–3634. PMLR, 09–12 Jul 2020a.
  • Wei et al. (2020b) Wei, C.-Y., Luo, H., and Agarwal, A. Taking a hint: How to leverage loss predictors in contextual bandits? In Conference on Learning Theory, pp.  3583–3634. PMLR, 2020b.
  • Ye et al. (2020) Ye, L., Lin, Y., Xie, H., and Lui, J. Combining offline causal inference and online bandit learning for data driven decision. arXiv preprint arXiv:2001.05699, 2020.
  • Zhang et al. (2019) Zhang, C., Agarwal, A., Iii, H. D., Langford, J., and Negahban, S. Warm-starting contextual bandits: Robustly combining supervised and bandit feedback. In International Conference on Machine Learning, pp.  7335–7344. PMLR, 2019.

Appendix A More Detailed Comparisons with Existing Works

A.1 Comparing with (Zhang et al., 2019)

(Zhang et al., 2019) consider a stochastic contextual KK-armed bandit model with possibly biased offline data, which generalizes our setting of stochastic KK-armed bandits. On one hand, their proposed algorithm ARROW-CB does not require any knowledge on the discrepancy between the offline and online data, while our proposed algorithm requires knowing an upper bound V(a)V(a) to |μ(off)(a)μ(on)(a)||\mu^{\text{(off)}}(a)-\mu^{\text{(on)}}(a)| for each aa. On the other hand, in the case of stochastic KK-armed bandits, the regret bound of ARROW-CB (see Theorem 1 in (Zhang et al., 2019)) is at least the regret bound of the explore-then-commit algorithm (see Chapter 6 in (Lattimore & Szepesvári, 2020)) that ignores all offline data, which in particular does not offer improved regret bound compared to existing baselines that do not utilize offline data.

In more details, the ARROW-CB algorithm inputs an exploration probability parameter ϵ(0,1)\epsilon\in(0,1), and a set Λ[0,1]\Lambda\subset[0,1] of weighted combination parameters that hedges between ignoring all offline data and incorporating all offline data. In their contextual setting where they compare ARROW-CB with a benchmark policy class Π\Pi, they establish the following regret upper bound on ARROW-CB on an instance II:

ϵT+3Tlog(T|Π|)+32ϵKTlog(8T|Λ|)+minλΛ{disc(λ,I)}.\epsilon T+3\sqrt{T\log(T|\Pi|)}+\frac{32}{\sqrt{\epsilon}}\cdot\sqrt{KT\log(8T|\Lambda|)}+\min_{\lambda\in\Lambda}\{\text{disc}(\lambda,I)\}. (17)

The function disc(λ,I)\text{disc}(\lambda,I) (see Equation (4) in (Zhang et al., 2019)) is a discrepancy measure on the distance between the offline and online models, with the property that disc(λ,I)0\text{disc}(\lambda,I)\geq 0 for all λ,I\lambda,I.

On one hand, (17) provides improvements to the conventional regret bound O(KTlog(|Π|T))O(\sqrt{KT\log(|\Pi|T)}) in terms of the dependency on |Π||\Pi| when disc(λ,I)\text{disc}(\lambda,I) is sufficiently small. On the other hand, the (17) does not provide any improvement to the conventional regret bounds in the stochastic KK-armed bandits setting. Indeed, in KK-armed setting we have |Π|=K|\Pi|=K, where Π\Pi consists of the policies of always pulling arm aa, for a𝒜a\in{\cal A}. Ignoring the non-negative term minλΛ{disc(λ,I)}\min_{\lambda\in\Lambda}\{\text{disc}(\lambda,I)\} and optimizing ϵ\epsilon in the remaining terms show that the remaining terms sum to at least

K1/3T2/3+3Tlog(TK)+32K1/3T2/3logT,K^{1/3}T^{2/3}+3\sqrt{T\log(TK)}+32K^{1/3}T^{2/3}\sqrt{\log T},

which is no better than the explore-then-commit policy’s regret of O(K1/3T2/3logT)O(K^{1/3}T^{2/3}\sqrt{\log T}).

Finally, one additional distinction is that (Zhang et al., 2019) requires TS(1)==TS(K)T_{\text{S}}(1)=\ldots=T_{\text{S}}(K), while we allow the offline sample sizes TS(1),,TS(K)T_{\text{S}}(1),\ldots,T_{\text{S}}(K) to be arbitrary non-negative integers.

A.2 Comparing wtih Online Learning with Advice

The frameworks of (Rakhlin & Sridharan, 2013a; Steinhardt & Liang, 2014; Wei & Luo, 2018; Wei et al., 2020a) utilize a commonly defined quantity {\cal E} to quantify the error in the predictions, which are provided in an online manner. We focus our discussion on comparison with (Wei & Luo, 2018). The framework of (Wei & Luo, 2018) provides the following results for adversarial multi-armed bandits with side information. Consider a KK-armed bandit model, where an arm a𝒜a\in{\cal A} is associated with loss t(a)[1,1]\ell_{t}(a)\in[-1,1] at time tt for t{1,,T}t\in\{1,\ldots,T\}, and the loss vectors {1,,T}\{\ell_{1},\ldots,\ell_{T}\} over the horizon of TT time steps are generated by an oblivious adversary. Before choosing an arm at𝒜a_{t}\in{\cal A} at time tt, the agent is endowed with a prediction mt[1,1]Km_{t}\in[-1,1]^{K}, where mt(a)m_{t}(a) is a prediction on t(a)\ell_{t}(a). They design a version of optimistic online mirror design algorithm that incoporates the predictions m1,,mTm_{1},\ldots,m_{T}, and achieves an expected regret bound of

𝔼[t=1Tt(a)minaKt=1Tt(a)]=O(Klog(T)),\mathbb{E}\left[\sum^{T}_{t=1}\ell_{t}(a)-\min_{a\in K}\sum^{T}_{t=1}\ell_{t}(a)\right]=O(\sqrt{K{\cal E}\log(T)}), (18)

where the expectation is taken over the internalized randomization of the algorithm, where an arm is randomly chosen each round. The quantity =t=1Ttmt2=O(T){\cal E}=\sum^{T}_{t=1}\|\ell_{t}-m_{t}\|^{2}_{\infty}=O(T) represents the prediction errors.

We argue that their framework can only guarantees a O(KTlog(T))O(\sqrt{KT\log(T)}) in terms of expected regret bound in our setting, since the prediction error term {\cal E} depends on the realized losses rather than their means. Indeed, In our stochastic setting (translated to a loss minimization instead of reward maximization model), the loss vectors are 1,,T\ell_{1},\ldots,\ell_{T} are independent and identically distributed according to a common (but latent) distribution 𝒟{\cal D}. Consider the case when the loss of arm aa is distributed according to the Bernoulli distribution with mean μ(a)[1/4,3/4]\mu(a)\in[1/4,3/4] for each a𝒜a\in{\cal A}. Even in the case when we have the prediction mt(a)=μ(a)m_{t}(a)=\mu(a) for each tt, namely the predictions reveal the true mean, we still have (t(a)μ(a))21/16(\ell_{t}(a)-\mu(a))^{2}\geq 1/16 for each arm with certainty, leading to T/16{\cal E}\geq T/16.

The framework of (Wei & Luo, 2018) can only gauranttee a regret bound of o(KT)o(\sqrt{KT}) when =o(T){\cal E}=o(T). In our setting, it requires that tmt=o(1)\|\ell_{t}-m_{t}\|_{\infty}=o(1), meaning that the prediction mtm_{t} has to be correlated with, thus contains information of, the realized loss t\ell_{t}, for Ω(T)\Omega(T) many time rounds tt. Different from (Wei & Luo, 2018), we achieve a regret bound of o(KT)o(\sqrt{KT}) in the presense of accurate offline data, instead of receiving hints about the realized online rewards. A similar conclusion (with a worse regret bound than (18)) holds when we compare our results with (Wei et al., 2020a), which presents regret bounds on adversarial contextual bandits with predictions, since their regret bounds are also defined in terms of {\cal E} mentioned above.

A.3 Comparing with (Chen et al., 2022)

(Chen et al., 2022) explores a Finite-Horizon Reinforcement Learning Model incorporating possibly biased historical data, another generalization of our stochastic multi-armed bandit model. Similar to our work, their approach ”Data-pooling Perturbed LSVI” (Algorithm 2) requires an upper bound on the discrepancy between offline and online data, defined as Δ\Delta in their Assumption 1. They propose an cross-validation method and illustrate through a case study that a substantial range of Δ\Delta values produce similar performance for their algorithm. However, their approach cannot facilitate the design of an algorithm that achieve our improved regret bounds compared to existing benchmarks that neglects the offline data.

In details, their “Data-pooling Perturbed LSVI” algorithm is based on classical Value Iteration (VI). They adapt the VI on online learning setting by randomly perturbing the estimated value-go-functions, and applying least squares fitting to estimate the Q-function. This approach is fundemantally different from the OFU principle we apply in the multi-armed bandit context. Besides, they follow a different way to combine offline and online data. Specifically, they compute a weighted parameter λt\lambda_{t} that balances between online and offline data, via tt, Δ\Delta, and TSmin=minh,s,aTS(h,s,a)T_{\text{S}}^{\min}=\min_{h,s,a}T_{\text{S}}(h,s,a), where TS(h,s,a)T_{\text{S}}(h,s,a) refers to the number of offline samples for pair (h,s,a)(h,s,a), refering to epoch hh, state ss, and action aa, respectively. This λt\lambda_{t} is shown in their equation (10). They derive the following regret upper bound:

(1+2p0)1HSKt=1T/(SK)(εRDP(t,TSmin)+H(W¯+1)εPDP(t,TSmin)+εVDP(t,TSmin))+4HTδ\left(1+2p_{0}\right)^{-1}HSK\sum_{t=1}^{T/(SK)}\left(\varepsilon_{R}^{\text{DP}}(t,T_{\text{S}}^{\min})+H(\bar{W}+1)\varepsilon_{P}^{\text{DP}}(t,T_{\text{S}}^{\min})+\varepsilon_{V}^{\text{DP}}(t,T_{\text{S}}^{\min})\right)+4HT\delta (19)

where p0p_{0}, W¯\bar{W}, δ\delta are their input parameter, HH is the length of a horizon, and SS is the number of state. εRDP(,)\varepsilon_{R}^{\text{DP}}(\cdot,\cdot), εPDP(,)\varepsilon_{P}^{\text{DP}}(\cdot,\cdot), εVDP(,)\varepsilon_{V}^{\text{DP}}(\cdot,\cdot) are confidence radius defined in their Appendix EC.4.2. They state that when TSmin1T_{\text{S}}^{\min}\geq 1 and Δ\Delta is small (See their Theorem EC.1), this regret bound is strictly smaller than the case that without combining historical data.

However, (19) does not offer an explicit closed-form bound, and they do not provide any regret lower bound.More importantly, their implicit bound solely depends on TSmin=minh,s,aTS(h,s,a)T_{\text{S}}^{\min}=\min_{h,s,a}T_{\text{S}}(h,s,a), which is equivalent to minaTS(a)\min_{a}T_{\text{S}}(a) in our model. In contrast, both our instance dependent bound (Theorem 4.2) and instance independent bound (Theorem 4.9) demonstrate how the difference in TS(a)T_{S}(a) among different arms affects the regret. Thus, their approach cannot deliver the tight regret bounds on and insights into our model.

Appendix B Proofs for Regret Upper Bounds

B.1 Proof for Lemma 4.1

The proof uses the Chernoff inequality:

Proposition B.1 (Chernoff Inequality).

Let G1,,GmG_{1},\ldots,G_{m} be independent (though not necessarily identically distributed) 1-subGaussian random variables. For any δ(0,1)\delta\in(0,1), it holds that

Pr[1m𝔼[i=1mGi]1mi=1mGi+2log(2/δ)m]1δ/2,\displaystyle\Pr\left[\frac{1}{m}\mathbb{E}\left[\sum^{m}_{i=1}G_{i}\right]\leq\frac{1}{m}\sum^{m}_{i=1}G_{i}+\sqrt{\frac{2\log(2/\delta)}{m}}\right]\geq 1-\delta/2,
Pr[1mi=1mGi1m𝔼[i=1mGi]+2log(2/δ)m]1δ/2.\displaystyle\Pr\left[\frac{1}{m}\sum^{m}_{i=1}G_{i}\leq\frac{1}{m}\mathbb{E}\left[\sum^{m}_{i=1}G_{i}\right]+\sqrt{\frac{2\log(2/\delta)}{m}}\right]\geq 1-\delta/2.
Proof of Lemma 4.1.

To prove the Lemma, it suffices to prove that Pr[tS(a)]1δt\Pr[{\cal E}^{\text{S}}_{t}(a)]\geq 1-\delta_{t}. Indeed, the inequality implies Pr(t(a))1δt\Pr({\cal E}_{t}(a))\geq 1-\delta_{t} by setting TS(a)=0T_{\text{S}}(a)=0, and then a union bound on the failure probabilities of tS(a),t(a){\cal E}^{\text{S}}_{t}(a),{\cal E}_{t}(a) for a𝒜a\in{\cal A} establishes the Lemma. Now, we have

Pr[μ(on)(a)UCBtS(a)]\displaystyle\Pr[\mu^{\text{(on)}}(a)\leq\text{UCB}^{\text{S}}_{t}(a)]
=\displaystyle= Pr[μ(on)(a)Nt(a)R^t(a)+TS(a)X^(a)Nt(a)+TS(a)+2log(2t/δt)Nt+(a)+TS(a)+TS(a)Nt+(a)+TS(a)V(a).]\displaystyle\Pr\left[\mu^{\text{(on)}}(a)\leq\frac{N_{t}(a)\cdot\hat{R}_{t}(a)+T_{\text{S}}(a)\cdot\hat{X}(a)}{N_{t}(a)+T_{\text{S}}(a)}+\sqrt{\frac{2\log(2t/\delta_{t})}{N^{+}_{t}(a)+T_{\text{S}}(a)}}+\frac{T_{\text{S}}(a)}{N^{+}_{t}(a)+T_{\text{S}}(a)}\cdot V(a).\right]
=\displaystyle= Pr[Nt(a)μ(on)(a)+TS(a)μ(off)(a)Nt(a)+TS(a)+TS(a)(μ(on)(a)μ(off)(a))Nt(a)+TS(a)\displaystyle\Pr\left[\frac{N_{t}(a)\mu^{\text{(on)}}(a)+T_{\text{S}}(a)\mu^{\text{(off)}}(a)}{N_{t}(a)+T_{\text{S}}(a)}+\frac{T_{\text{S}}(a)(\mu^{\text{(on)}}(a)-\mu^{\text{(off)}}(a))}{N_{t}(a)+T_{\text{S}}(a)}\leq\right.
Nt(a)R^t(a)+TS(a)X^(a)Nt(a)+TS(a)+2log(2t/δt)Nt(a)+TS(a)+TS(a)Nt(a)+TS(a)V(a).]\displaystyle\left.\frac{N_{t}(a)\cdot\hat{R}_{t}(a)+T_{\text{S}}(a)\cdot\hat{X}(a)}{N_{t}(a)+T_{\text{S}}(a)}+\sqrt{\frac{2\log(2t/\delta_{t})}{N_{t}(a)+T_{\text{S}}(a)}}+\frac{T_{\text{S}}(a)}{N_{t}(a)+T_{\text{S}}(a)}\cdot V(a).\right]
\displaystyle\geq Pr[Nt(a)μ(on)(a)+TS(a)μ(off)(a)Nt(a)+TS(a)Nt(a)R^t(a)+TS(a)X^(a)Nt(a)+TS(a)+2log(2t/δt)Nt(a)+TS(a)]\displaystyle\Pr\left[\frac{N_{t}(a)\mu^{\text{(on)}}(a)+T_{\text{S}}(a)\mu^{\text{(off)}}(a)}{N_{t}(a)+T_{\text{S}}(a)}\leq\frac{N_{t}(a)\cdot\hat{R}_{t}(a)+T_{\text{S}}(a)\cdot\hat{X}(a)}{N_{t}(a)+T_{\text{S}}(a)}+\sqrt{\frac{2\log(2t/\delta_{t})}{N_{t}(a)+T_{\text{S}}(a)}}\right] (20)
\displaystyle\geq PrYiP(on)(a)[nμ(on)(a)+TS(a)μ(off)(a)n+TS(a)i=1nYi+TS(a)X^(a)n+TS(a)+2log(2t/δt)n+TS(a) for n=1,2,,t]\displaystyle\Pr_{Y_{i}\sim P^{\text{(on)}}(a)}\left[\frac{n\mu^{\text{(on)}}(a)+T_{\text{S}}(a)\mu^{\text{(off)}}(a)}{n+T_{\text{S}}(a)}\leq\frac{\sum_{i=1}^{n}Y_{i}+T_{\text{S}}(a)\cdot\hat{X}(a)}{n+T_{\text{S}}(a)}+\sqrt{\frac{2\log(2t/\delta_{t})}{n+T_{\text{S}}(a)}}\text{\ for $n=1,2,\ldots,t$}\right] (21)
\displaystyle\geq 1δt/2.\displaystyle 1-\delta_{t}/2. (22)

Step (20) is by the input assumption that |μ(on)(a)μ(off)(a)|V(a)|\mu^{\text{(on)}}(a)-\mu^{\text{(off)}}(a)|\leq V(a). Step (21) is by a union bound over all possible values of Nt(a)N_{t}(a). Step (22) is by the Chernoff inequality.

Next,

Pr(UCBtS(a)μ(on)(a)+radtS(a)+2log(2Nt+(a)/δt)Nt+(a)+TS(a)+TS(a)(μ(off)(a)μ(on)(a))Nt+(a)+TS(a))\displaystyle\Pr\left(\text{UCB}^{\text{S}}_{t}(a)\leq\mu^{(\text{on})}(a)+\text{rad}^{\text{S}}_{t}(a)+\sqrt{\frac{2\log(2N^{+}_{t}(a)/\delta_{t})}{N^{+}_{t}(a)+T_{\text{S}}(a)}}+\frac{T_{\text{S}}(a)\cdot(\mu^{\text{(off)}}(a)-\mu^{\text{(on)}}(a))}{N^{+}_{t}(a)+T_{\text{S}}(a)}\right)
=\displaystyle= Pr(Nt(a)R^t(a)+TS(a)X^(a)Nt(a)+TS(a)μ(on)(a)+2log(2Nt+(a)/δt)Nt+(a)+TS(a)+TS(a)(μ(off)(a)μ(on)(a))Nt+(a)+TS(a))\displaystyle\Pr\left(\frac{N_{t}(a)\cdot\hat{R}_{t}(a)+T_{\text{S}}(a)\cdot\hat{X}(a)}{N_{t}(a)+T_{\text{S}}(a)}\leq\mu^{(\text{on})}(a)+\sqrt{\frac{2\log(2N^{+}_{t}(a)/\delta_{t})}{N^{+}_{t}(a)+T_{\text{S}}(a)}}+\frac{T_{\text{S}}(a)\cdot(\mu^{\text{(off)}}(a)-\mu^{\text{(on)}}(a))}{N^{+}_{t}(a)+T_{\text{S}}(a)}\right)
=\displaystyle= Pr(Nt(a)R^t(a)+TS(a)X^(a)Nt(a)+TS(a)Nt(a)μ(on)(a)+TS(a)μ(off)(a)Nt(a)+TS(a)+2log(2Nt+(a)/δt)Nt+(a)+TS(a))\displaystyle\Pr\left(\frac{N_{t}(a)\cdot\hat{R}_{t}(a)+T_{\text{S}}(a)\cdot\hat{X}(a)}{N_{t}(a)+T_{\text{S}}(a)}\leq\frac{N_{t}(a)\mu^{\text{(on)}}(a)+T_{\text{S}}(a)\mu^{\text{(off)}}(a)}{N_{t}(a)+T_{\text{S}}(a)}+\sqrt{\frac{2\log(2N^{+}_{t}(a)/\delta_{t})}{N^{+}_{t}(a)+T_{\text{S}}(a)}}\right)
\displaystyle\geq PrYiP(on)(a)[i=1nYi+TS(a)X^(a)n+TS(a)nμ(on)(a)+TS(a)μ(off)(a)n+TS(a)+2log(2t/δt)n+TS(a) for n=1,2,,t]\displaystyle\Pr_{Y_{i}\sim P^{\text{(on)}}(a)}\left[\frac{\sum_{i=1}^{n}Y_{i}+T_{\text{S}}(a)\cdot\hat{X}(a)}{n+T_{\text{S}}(a)}\leq\frac{n\mu^{\text{(on)}}(a)+T_{\text{S}}(a)\mu^{\text{(off)}}(a)}{n+T_{\text{S}}(a)}+\sqrt{\frac{2\log(2t/\delta_{t})}{n+T_{\text{S}}(a)}}\text{\ for $n=1,2,\ldots,t$}\right] (23)
\displaystyle\geq 1δt/2.\displaystyle 1-\delta_{t}/2. (24)

The justifications of (23, 24) are the same as (21, 22) respectively. Altogether, the Lemma is proved.

B.2 Proof of instance dependent regret upper bound

B.2.1 Proof of Lemma 4.8

We consider two main cases depending on ω(a)\omega(a) and Δ(a)\Delta(a):

Case 1: ω(a)<Δ(a)\omega(a)<\Delta(a). The case condition implies that

max{1ω(a)Δ(a),0}2=(1ω(a)Δ(a))2>0,ω(a)Δ(a)[0,1).\max\left\{1-\frac{\omega(a)}{\Delta(a)},0\right\}^{2}=\left(1-\frac{\omega(a)}{\Delta(a)}\right)^{2}>0,\quad\frac{\omega(a)}{\Delta(a)}\in[0,1).

To proceed. we consider two sub-cases:

Sub-case 1a: TS(a)(1ω(a)Δ(a))216log(4Kt4)Δ(a)2T_{\text{S}}(a)\cdot(1-\frac{\omega(a)}{\Delta(a)})^{2}\geq\frac{16\log(4Kt^{4})}{\Delta(a)^{2}}. Then

2log(2t/δt)Nt(a)+TS(a)2log(4Kt4)TS(a)(1ω(a)Δ(a))2log(4Kt4)16log(4Kt4)Δ(a)2=(1ω(a)Δ(a))Δ(a)8.\displaystyle\sqrt{\frac{2\log(2t/\delta_{t})}{N_{t}(a)+T_{\text{S}}(a)}}\leq\sqrt{\frac{2\log(4Kt^{4})}{T_{\text{S}}(a)}}\leq\left(1-\frac{\omega(a)}{\Delta(a)}\right)\sqrt{\frac{2\log(4Kt^{4})}{\frac{16\log(4Kt^{4})}{\Delta(a)^{2}}}}=\left(1-\frac{\omega(a)}{\Delta(a)}\right)\cdot\frac{\Delta(a)}{\sqrt{8}}. (25)

The second inequality is by the lower bound TS(a)T_{\text{S}}(a) in the case condition in Sub-case 1a. In addition,

TS(a)Nt(a)+TS(a)ω(a)ω(a)Δ(a)Δ(a).\frac{T_{\text{S}}(a)}{N_{t}(a)+T_{\text{S}}(a)}\cdot\omega(a)\leq\frac{\omega(a)}{\Delta(a)}\cdot\Delta(a).

Consequently, conditioned on event t{\cal E}_{t},

UCBtS(a)\displaystyle\text{UCB}^{\text{S}}_{t}(a) μ(on)(a)+22log(2t/δt)Nt(a)+TS(a)+TS(a)TS(a)+Nt(a)ω(a)\displaystyle\leq\mu^{\text{(on)}}(a)+2\cdot\sqrt{\frac{2\log(2t/\delta_{t})}{N_{t}(a)+T_{\text{S}}(a)}}+\frac{T_{\text{S}}(a)}{T_{\text{S}}(a)+N_{t}(a)}\cdot\omega(a) (26)
μ(on)(a)+(1ω(a)Δ(a))Δ(a)2+ω(a)Δ(a)Δ(a)\displaystyle\leq\mu^{\text{(on)}}(a)+\left(1-\frac{\omega(a)}{\Delta(a)}\right)\cdot\frac{\Delta(a)}{\sqrt{2}}+\frac{\omega(a)}{\Delta(a)}\cdot\Delta(a)
<μ(on)min{UCBt(a),UCBtS(a)},\displaystyle<\mu^{(\text{on})}_{*}\leq\min\left\{\text{UCB}_{t}(a_{*}),\text{UCB}^{\text{S}}_{t}(a_{*})\right\}, (27)

The strict inequality in step (27) is by the sub-case condition that 1ω(a)Δ(a)>01-\frac{\omega(a)}{\Delta(a)}>0. The less-than-equal steps in Lines (26, 27) is by conditioning on the event t{\cal E}_{t}. Altogether, AtaA_{t}\neq a with certainty.

Sub-case 1b: TS(a)(1ω(a)Δ(a))2<16log(4Kt4)Δ(a)2T_{\text{S}}(a)\cdot(1-\frac{\omega(a)}{\Delta(a)})^{2}<\frac{16\log(4Kt^{4})}{\Delta(a)^{2}}. In this case, we clearly have

Nt(a)>32log(4Kt4)Δ(a)2TS(a)max{1ω(a)Δ(a),0}216log(4Kt4)Δ(a)2.N_{t}(a)>32\cdot\frac{\log(4Kt^{4})}{\Delta(a)^{2}}-T_{\text{S}}(a)\cdot\max\left\{1-\frac{\omega(a)}{\Delta(a)},0\right\}^{2}\geq\frac{16\log(4Kt^{4})}{\Delta(a)^{2}}.

Consequently, we have

radt(a)=2log(2t/δt)Nt(a)<2log(4Kt4)16log(4Kt4)Δ(a)2Δ(a)8,\text{rad}_{t}(a)=\sqrt{\frac{2\log(2t/\delta_{t})}{N_{t}(a)}}<\sqrt{\frac{2\log(4Kt^{4})}{16\cdot\frac{\log(4Kt^{4})}{\Delta(a)^{2}}}}\leq\frac{\Delta(a)}{\sqrt{8}},

leading to

UCBt(a)μ(on)(a)+2Δ(a)8<μ(on)min{UCBt(a),UCBtS(a)},\text{UCB}_{t}(a)\leq\mu^{\text{(on)}}(a)+2\cdot\frac{\Delta(a)}{\sqrt{8}}<\mu^{(\text{on})}_{*}\leq\min\left\{\text{UCB}_{t}(a_{*}),\text{UCB}^{\text{S}}_{t}(a_{*})\right\},

thus AtaA_{t}\neq a with certainty.

Case 2: ω(a)Δ(a)\omega(a)\geq\Delta(a). Then max{1ω(a)Δ(a),0}2=0\max\{1-\frac{\omega(a)}{\Delta(a)},0\}^{2}=0, and Nt(a)>32log(4Kt4)Δ(a)2N_{t}(a)>32\cdot\frac{\log(4Kt^{4})}{\Delta(a)^{2}}. Similar to Case 1b, we arrive at

radt(a)=2log(2t/δt)Nt(a)<2log(4Kt4)32log(4Kt4)Δ(a)2Δ(a)4.\text{rad}_{t}(a)=\sqrt{\frac{2\log(2t/\delta_{t})}{N_{t}(a)}}<\sqrt{\frac{2\log(4Kt^{4})}{32\cdot\frac{\log(4Kt^{4})}{\Delta(a)^{2}}}}\leq\frac{\Delta(a)}{4}.

Consequently,

UCBt(a)μ(on)(a)+2Δ(a)4<μ(on)min{UCBt(a),UCBtS(a)},\text{UCB}_{t}(a)\leq\mu^{\text{(on)}}(a)+2\cdot\frac{\Delta(a)}{4}<\mu^{(\text{on})}_{*}\leq\min\left\{\text{UCB}_{t}(a_{*}),\text{UCB}^{\text{S}}_{t}(a_{*})\right\},

where the first and the last inequalities are by the event t{\cal E}_{t}. Altogether, AtaA_{t}\neq a with certainty.

Altogether, all cases are covered and the Lemma is proved.

B.2.2 Proof of Theorem 4.2

Applying Lemma 4.8, we get

𝔼[RegT(π,T)]\displaystyle\mathbb{E}[\text{Reg}_{T}(\pi,T)] =t=1T𝔼[(μ(on)μ(on)(At))[𝟏(t)+𝟏(tc)]]\displaystyle=\sum^{T}_{t=1}\mathbb{E}\left[(\mu_{*}^{(\text{on})}-\mu^{(\text{on})}(A_{t}))\cdot\left[\mathbf{1}({\cal E}_{t})+\mathbf{1}({\cal E}^{c}_{t})\right]\right]
t=1T𝔼[(μ(on)μ(on)(At))𝟏(t)]+Δmaxt=1T𝔼[𝟏(tc)],\displaystyle\leq\sum^{T}_{t=1}\mathbb{E}\left[(\mu_{*}^{(\text{on})}-\mu^{(\text{on})}(A_{t}))\cdot\mathbf{1}({\cal E}_{t})\right]+\Delta_{\text{max}}\sum^{T}_{t=1}\mathbb{E}\left[\mathbf{1}({\cal E}^{c}_{t})\right],

and

t=1T𝔼[(μ(on)μ(on)(At))𝟏(t)]\displaystyle\sum^{T}_{t=1}\mathbb{E}\left[(\mu_{*}^{(\text{on})}-\mu^{(\text{on})}(A_{t}))\cdot\mathbf{1}({\cal E}_{t})\right] =a𝒜Δ(a)𝔼[t=1T𝟏(At=a)𝟏(t)]\displaystyle=\sum_{a\in{\cal A}}\Delta(a)\mathbb{E}\left[\sum^{T}_{t=1}\mathbf{1}(A_{t}=a)\mathbf{1}({\cal E}_{t})\right]
a𝒜max{32log(4KT4)Δ(a)TS(a)Δ(a)max{1ω(a)Δ(a),0}2,Δ(a)},\displaystyle\leq\sum_{a\in{\cal A}}\max\left\{32\cdot\frac{\log(4KT^{4})}{\Delta(a)}-T_{\text{S}}(a)\cdot\Delta(a)\cdot\max\left\{1-\frac{\omega(a)}{\Delta(a)},0\right\}^{2},\Delta(a)\right\},

and t=1T𝔼[𝟏(tc)]π2/6\sum^{T}_{t=1}\mathbb{E}\left[\mathbf{1}({\cal E}^{c}_{t})\right]\leq\pi^{2}/6. Altogether, we arrive at

π26Δmax+a𝒜:Δ(a)>0max{32log(4KT4)Δ(a)TS(a)Δ(a)max{1ω(a)Δ(a),0}2,Δ(a)},\frac{\pi^{2}}{6}\Delta_{\text{max}}+\sum_{a\in{\cal A}:\Delta(a)>0}\max\left\{32\cdot\frac{\log(4KT^{4})}{\Delta(a)}-T_{\text{S}}(a)\cdot\Delta(a)\cdot\max\left\{1-\frac{\omega(a)}{\Delta(a)},0\right\}^{2},\Delta(a)\right\}, (28)

where Δmax=maxaΔ(a)\Delta_{\text{max}}=\max_{a}\Delta(a), which proves the Theorem.

B.3 Proof for instance independent regret upper bounds

B.3.1 Proof of Theorem 4.9

Conditioned on t=1Tt\cap^{T}_{t=1}{\cal E}_{t}, we have

Reg =t=1T[μ(on)μ(on)(At)]\displaystyle=\sum^{T}_{t=1}\left[\mu^{(\text{on})}_{*}-\mu^{(\text{on})}(A_{t})\right]
KΔmax+t=K+1T[min{UCBt(a),UCBtS(a)}μ(on)(At)]\displaystyle\leq K\Delta_{\text{max}}+\sum^{T}_{t=K+1}\left[\min\{\text{UCB}_{t}(a_{*}),\text{UCB}^{\text{S}}_{t}(a_{*})\}-\mu^{(\text{on})}(A_{t})\right]
KΔmax+t=K+1T[min{UCBt(At),UCBtS(At)}μ(on)(At)]\displaystyle\leq K\Delta_{\text{max}}+\sum^{T}_{t=K+1}\left[\min\{\text{UCB}_{t}(A_{t}),\text{UCB}^{\text{S}}_{t}(A_{t})\}-\mu^{(\text{on})}(A_{t})\right]
KΔmax+t=K+1T[min{μ(on)(At)+2radt(At),μ(on)(At)+2radtS(A(t))}μ(on)(At)]\displaystyle\leq K\Delta_{\text{max}}+\sum^{T}_{t=K+1}\left[\min\{\mu^{(\text{on})}(A_{t})+2\text{rad}_{t}(A_{t}),\mu^{(\text{on})}(A_{t})+2\text{rad}^{\text{S}}_{t}(A(t))\}-\mu^{(\text{on})}(A_{t})\right] (29)
KΔmax+2min{t=K+1Tradt(At),t=K+1TradtS(At)}.\displaystyle\leq K\Delta_{\text{max}}+2\min\left\{\sum^{T}_{t=K+1}\text{rad}_{t}(A_{t}),\sum^{T}_{t=K+1}\text{rad}^{\text{S}}_{t}(A_{t})\right\}. (30)

Step (29) is by applying Lemma 4.1, and noting that μ(off)(a)μ(on)(a)V(a)\mu^{(\text{off})}(a)-\mu^{\text{(on)}}(a)\leq V(a) in the upper bound of UCBtS(a)\text{UCB}^{\text{S}}_{t}(a) in the Lemma.

The conventional analysis shows that t=K+1Tradt(At)=O(KTlog(T/δ))\sum^{T}_{t=K+1}\text{rad}_{t}(A_{t})=O(\sqrt{KT\log(T/\delta)}). We next focus on upper bounding

t=K+1TradtS(At)\displaystyle\sum^{T}_{t=K+1}\text{rad}^{\text{S}}_{t}(A_{t}) =t=K+1T[2log(2t/δt)Nt(a)+TS(a)+TS(a)Nt(a)+TS(a)V(a)]\displaystyle=\sum^{T}_{t=K+1}\left[\sqrt{\frac{2\log(2t/\delta_{t})}{N_{t}(a)+T_{\text{S}}(a)}}+\frac{T_{\text{S}}(a)}{N_{t}(a)+T_{\text{S}}(a)}\cdot V(a)\right]
maxa𝒜V(a)T+a𝒜n=1NT(a)8log(T/δ)n+TS(a).\displaystyle\leq\max_{a\in{\cal A}}V(a)\cdot T+\sum_{a\in{\cal A}}\sum^{N_{T}(a)}_{n=1}\sqrt{\frac{8\log(T/\delta)}{n+T_{\text{S}}(a)}}.

We claim that

a𝒜n(a)=1NT(a)1n(a)+TS(a)a𝒜n(a)=1max{τTS(a),0}1n(a)+TS(a).\sum_{a\in{\cal A}}\sum^{N_{T}(a)}_{n(a)=1}\sqrt{\frac{1}{n(a)+T_{\text{S}}(a)}}\leq\sum_{a\in{\cal A}}\sum^{\max\{\lceil\tau_{*}\rceil-T_{\text{S}}(a),0\}}_{n(a)=1}\sqrt{\frac{1}{n(a)+T_{\text{S}}(a)}}. (31)

We show (31) by establishing two major claims:

Claim 1: Suppose that (NT(a))a𝒜0𝒜(N^{*}_{T}(a))_{a\in{\cal A}}\in\mathbb{N}_{\geq 0}^{\cal A} is an optimal solution to the following maximization problem

(IP): max(NT(a))a𝒜\displaystyle\text{(IP): }~{}\underset{(N_{T}(a))_{a\in{\cal A}}}{\text{max}}~{}~{} a𝒜n(a)=1NT(a)1n(a)+TS(a)\displaystyle\sum_{a\in{\cal A}}\sum^{N_{T}(a)}_{n(a)=1}\sqrt{\frac{1}{n(a)+T_{\text{S}}(a)}}
s.t. a𝒜NT(a)=T\displaystyle\sum_{a\in{\cal A}}N_{T}(a)=T ,
NT(a)0\displaystyle N_{T}(a)\in\mathbb{N}_{\geq 0} for all a𝒜.\displaystyle\text{ for all }a\in{\cal A}.

Then it must be the case that NT(a)max{τTS(a),0}N^{*}_{T}(a)\leq\max\{\lceil\tau_{*}\rceil-T_{\text{S}}(a),0\} for all a𝒜a\in{\cal A}, where we recall that τ\tau_{*} is the optimum of (LP).

Claim 2: . For an optimal solution (τ,{n(a)}a𝒜)(\tau_{*},\{n_{*}(a)\}_{a\in{\cal A}}) to (LP), it holds that

n(a)=max{τTS(a),0} for every arm a.n_{*}(a)=\max\{\tau_{*}-T_{\text{S}}(a),0\}\text{ for every arm $a$.} (32)

We see that Claim 1 is immediately useful for proving (31), since then

a𝒜n(a)=1NT(a)1n(a)+TS(a)a𝒜n(a)=1NT(a)1n(a)+TS(a)a𝒜n(a)=1max{τTS(a),0}1n(a)+TS(a),\sum_{a\in{\cal A}}\sum^{N_{T}(a)}_{n(a)=1}\sqrt{\frac{1}{n(a)+T_{\text{S}}(a)}}\leq\sum_{a\in{\cal A}}\sum^{N^{*}_{T}(a)}_{n(a)=1}\sqrt{\frac{1}{n(a)+T_{\text{S}}(a)}}\leq\sum_{a\in{\cal A}}\sum^{\max\{\lceil\tau_{*}\rceil-T_{\text{S}}(a),0\}}_{n(a)=1}\sqrt{\frac{1}{n(a)+T_{\text{S}}(a)}},

where the first inequality is by the optimality of (NT(a))a𝒜(N^{*}_{T}(a))_{a\in{\cal A}} to (IP), and the fact that any realized (NT(a))a𝒜(N_{T}(a))_{a\in{\cal A}} must be feasible to (IP). The second inequality is a direct consequence of Claim 1. Claim 2 is an axuiliary claim for proving Claim 1. We postpone the proofs of these two claims to the end of the proof of the Theorem.

Given (31), for each a𝒜a\in{\cal A} we then have

n=1max{τTS(a),0}1n+TS(a)max{τTS(a),0}τt=1τ1tmax{τTS(a),0}4τ.\sum^{\max\{\lceil\tau_{*}\rceil-T_{\text{S}}(a),0\}}_{n=1}\sqrt{\frac{1}{n+T_{\text{S}}(a)}}\leq\frac{\max\{\lceil\tau_{*}\rceil-T_{\text{S}}(a),0\}}{\lceil\tau_{*}\rceil}\sum^{\lceil\tau_{*}\rceil}_{t=1}\sqrt{\frac{1}{t}}\leq\max\{\lceil\tau_{*}\rceil-T_{\text{S}}(a),0\}\cdot\frac{4}{\sqrt{\tau_{*}}}. (33)

Applying the feasibility of (τ,n)(\tau_{*},n_{*}) to (LP), we have

max{τTS(a),0}max{τTS(a),0}+1n(a)+1.\max\{\lceil\tau_{*}\rceil-T_{\text{S}}(a),0\}\leq\max\{\tau_{*}-T_{\text{S}}(a),0\}+1\leq n_{*}(a)+1.

Combining the above with (33) gives

a𝒜n=1max{τTS(a),0}1n+TS(a)a𝒜(n(a)+1)4τT8τ,\sum_{a\in{\cal A}}\sum^{\max\{\lceil\tau_{*}\rceil-T_{\text{S}}(a),0\}}_{n=1}\sqrt{\frac{1}{n+T_{\text{S}}(a)}}\leq\sum_{a\in{\cal A}}(n_{*}(a)+1)\cdot\frac{4}{\sqrt{\tau_{*}}}\leq T\cdot\frac{8}{\sqrt{\tau_{*}}},

hence a𝒜n=1NT(a)8log(T/δ)n+TS(a)=O(Tlog(T/δ)/τ)\sum_{a\in{\cal A}}\sum^{N_{T}(a)}_{n=1}\sqrt{\frac{8\log(T/\delta)}{n+T_{\text{S}}(a)}}=O(T\sqrt{\log(T/\delta)/\tau_{*}}). Altogether, the Theorem is proved after we provide the proofs to Claims 1, 2:

Proving Claim 1 We establish the claim by a contradiction argument. Suppose there exists an arm aa such that NT(a)max{τTS(a),0}+1N^{*}_{T}(a)\geq\max\{\lceil\tau_{*}\rceil-T_{\text{S}}(a),0\}+1. Firstly, we assert that there must exist another arm a𝒜{a}a^{\prime}\in{\cal A}\setminus\{a\} such that NT(a)max{τTS(a),0}1N^{*}_{T}(a^{\prime})\leq\max\{\lceil\tau_{*}\rceil-T_{\text{S}}(a^{\prime}),0\}-1. If not, we then have NT(a)max{τTS(a),0}+1N^{*}_{T}(a)\geq\max\{\lceil\tau_{*}\rceil-T_{\text{S}}(a),0\}+1 and NT(a′′)max{τTS(a′′),0}N^{*}_{T}(a^{\prime\prime})\geq\max\{\lceil\tau_{*}\rceil-T_{\text{S}}(a^{\prime\prime}),0\} for all a′′𝒜a^{\prime\prime}\in{\cal A}, but then

k𝒜NT(k)>k𝒜max{τTS(k),0}k𝒜max{τTS(k),0}=()k𝒜n(k)=T,\displaystyle\sum_{k\in{\cal A}}N^{*}_{T}(k)>\sum_{k\in{\cal A}}\max\{\lceil\tau_{*}\rceil-T_{\text{S}}(k),0\}\geq\sum_{k\in{\cal A}}\max\{\tau_{*}-T_{\text{S}}(k),0\}\overset{(\dagger)}{=}\sum_{k\in{\cal A}}n_{*}(k)=T,

violating the constraint k𝒜NT(k)=T\sum_{k\in{\cal A}}N^{*}_{T}(k)=T. Note that the equality ()(\dagger) is by Claim 2. Thus, the claimed arm aa^{\prime} exists. To this end, the condition NT(a)max{τTS(a),0}1N^{*}_{T}(a^{\prime})\leq\max\{\lceil\tau_{*}\rceil-T_{\text{S}}(a^{\prime}),0\}-1 implies that τ>TS(a)\lceil\tau_{*}\rceil>T_{\text{S}}(a^{\prime}), since otherwise NT(a)1N^{*}_{T}(a^{\prime})\leq-1, which violates the non-negativity constraint on NT(a)N^{*}_{T}(a). Altogether we have established the existence of two distinct arms a,a𝒜a,a^{\prime}\in{\cal A} such that

NT(a)+TS(a)\displaystyle N^{*}_{T}(a)+T_{\text{S}}(a) max{τ,TS(a)}+1, in particular NT(a)1,\displaystyle\geq\max\{\lceil\tau_{*}\rceil,T_{\text{S}}(a)\}+1\text{, in particular }N^{*}_{T}(a)\geq 1,
NT(a)+TS(a)\displaystyle N^{*}_{T}(a^{\prime})+T_{\text{S}}(a^{\prime}) max{τ,TS(a)}1=τ1.\displaystyle\leq\max\{\lceil\tau_{*}\rceil,T_{\text{S}}(a^{\prime})\}-1=\lceil\tau_{*}\rceil-1.

To establish the contradiction argument, consider another solution (N~T(k))k𝒜(\tilde{N}_{T}(k))_{k\in{\cal A}} to the displayed optimization problem in the claim, where

N~T(k)={NT(k)1if k=aNT(k)+1if k=aNT(k)if k𝒜{a,a}.\tilde{N}_{T}(k)=\begin{cases}N^{*}_{T}(k)-1&\quad\text{if }k=a\\ N^{*}_{T}(k)+1&\quad\text{if }k=a^{\prime}\\ N^{*}_{T}(k)&\quad\text{if }k\in{\cal A}\setminus\{a,a^{\prime}\}.\end{cases}

By the property that NT(a)1N^{*}_{T}(a)\geq 1, N~T(a)0\tilde{N}_{T}(a)\geq 0, and (N~T(k))k𝒜(\tilde{N}_{T}(k))_{k\in{\cal A}} is a feasible solution. But then we have

a𝒜n(a)=1N~T(a)1n(a)+TS(a)a𝒜n(a)=1NT(a)1n(a)+TS(a)\displaystyle\sum_{a\in{\cal A}}\sum^{\tilde{N}_{T}(a)}_{n(a)=1}\sqrt{\frac{1}{n(a)+T_{\text{S}}(a)}}-\sum_{a\in{\cal A}}\sum^{N^{*}_{T}(a)}_{n(a)=1}\sqrt{\frac{1}{n(a)+T_{\text{S}}(a)}}
=\displaystyle= 1NT(a)+TS(a)+11NT(a)+TS(a)\displaystyle\sqrt{\frac{1}{N^{*}_{T}(a^{\prime})+T_{\text{S}}(a^{\prime})+1}}-\sqrt{\frac{1}{N^{*}_{T}(a)+T_{\text{S}}(a)}}
\displaystyle\geq 1τ1max{τ,TS(a)}+1>0,\displaystyle\sqrt{\frac{1}{\lceil\tau_{*}\rceil}}-\sqrt{\frac{1}{\max\{\lceil\tau_{*}\rceil,T_{\text{S}}(a)\}+1}}>0, (34)

which contradicts the assumed optimality of (NT(a))a𝒜(N^{*}_{T}(a))_{a\in{\cal A}}. Thus Claim 1 is shown.

Proving Claim 2 . Now, since (τ,{n(a)}a𝒜)(\tau_{*},\{n_{*}(a)\}_{a\in{\cal A}}) is feasible to the (LP), the constraints in the (LP) give us that n(a)max{τTS(a),0}n_{*}(a)\geq\max\{\tau_{*}-T_{\text{S}}(a),0\}. We establish our asserted equality by a contradition argument. Suppose there is an arm aa^{\prime} such that n(a)max{τTS(a),0}=ϵ>0n_{*}(a^{\prime})-\max\{\tau_{*}-T_{\text{S}}(a^{\prime}),0\}=\epsilon>0. Then it can be verified that the solution (τ,{n(a)}a𝒜)(\tau^{\prime},\{n^{\prime}(a)\}_{a\in{\cal A}}) defined as τ=τ+ϵ/K\tau^{\prime}=\tau_{*}+\epsilon/K, n(a)=n(a)+ϵ/Kn^{\prime}(a)=n_{*}(a)+\epsilon/K for all a𝒜{a}a\in{\cal A}\setminus\{a^{\prime}\} and n(a)=n(a)K1Kϵn^{\prime}(a^{\prime})=n_{*}(a)-\frac{K-1}{K}\epsilon is also a feasible solution to (LP). But then τ>τ\tau^{\prime}>\tau_{*}, which contradicts with the optimality of (τ,{n(a)}a𝒜)(\tau_{*},\{n_{*}(a)\}_{a\in{\cal A}}). Thus (32) is shown.

Appendix C Proofs for Regret Lower Bounds

C.1 Notational Set Up and Auxiliary Results

Notational Set Up. Consider a non-anticipatory policy π\pi and an instance II with reward distribution PP. We denote ρP,π\rho_{P,\pi} as the joint probability distribution function on S,A1,R1,,AT,RTS,A_{1},R_{1},\ldots,A_{T},R_{T}, the concatenation of the offline dataset SS and the online trajectory under policy π\pi on instance II. For a σ(S,A1,R1,,AT,RT)\sigma(S,A_{1},R_{1},\ldots,A_{T},R_{T})-measurable event EE, we denote PrP,π(E)\Pr_{P,\pi}(E) as the probability of EE holds under ρP,π\rho_{P,\pi}. For a σ(S,A1,R1,,AT,RT)\sigma(S,A_{1},R_{1},\ldots,A_{T},R_{T})-measurable random variable YY, we denote 𝔼P,π[Y]\mathbb{E}_{P,\pi}[Y] as the expectation of YY under the joint probability distribution function ρP,π\rho_{P,\pi}. In our analysis, we make use of YY being RegT(π,P)\text{Reg}_{T}(\pi,P) or NT(a)N_{T}(a) for an arm aa. To lighten our notational burden, we abbreviate 𝔼P,π[RegT(π,P)]\mathbb{E}_{P,\pi}[\text{Reg}_{T}(\pi,P)] as 𝔼[RegT(π,P)]\mathbb{E}[\text{Reg}_{T}(\pi,P)].

Auxiliary Results To establish the regret lower bounds and the impossibility result, we need three auxiliary results, namely Claim 1 and Theorems C.1, C.2. Claim 1 is on the KL-divergence between Gaussian random variables:

Claim 1.

For Pi=𝒩(μi,σ2)P_{i}={\cal N}(\mu_{i},\sigma^{2}), where i{1,2}i\in\{1,2\}, we have KL(P1,P2)=(μ1μ2)22σ2\text{KL}(P_{1},P_{2})=\frac{(\mu_{1}-\mu_{2})^{2}}{2\sigma^{2}}.

The following, dubbed Bretagnolle–Huber inequality, is extracted from Theorem 14.2 from (Lattimore & Szepesvári, 2020):

Theorem C.1.

Let ,\mathbb{P},\mathbb{Q} be probability disributions on (Ω,)(\Omega,{\cal F}). For an event Eσ()E\in\sigma({\cal F}), it holds that

Pr(E)+Pr(Ec)12exp(KL(,)).\Pr_{\mathbb{P}}(E)+\Pr_{\mathbb{Q}}(E^{c})\geq\frac{1}{2}\exp\left(-\text{KL}(\mathbb{P},\mathbb{Q})\right).

Lastly, the derivation of the chain rule in Theorem C.2 largely follows from the derivation of Lemma 15.1 in (Lattimore & Szepesvári, 2020):

Theorem C.2.

Consider two instances IP,IQI_{P},I_{Q} that share the same arm set 𝒜{\cal A}, online phase horizon TT, offline sample size {TS(a)}a𝒜\{T_{\text{S}}(a)\}_{a\in{\cal A}}, but have two different reward distributions P=(P(on),P(off))P=(P^{(\text{on})},P^{(\text{off})}), Q=(Q(on),Q(off))Q=(Q^{(\text{on})},Q^{(\text{off})}). For any non-anticipatory policy π\pi, it holds that

KL(ρP,π,ρQ,π)=a𝒜𝔼P,π[NT(a)]KL(Pa(on),Qa(on))+a𝒜TS(a)KL(Pa(off),Qa(off)).\text{KL}(\rho_{P,\pi},\rho_{Q,\pi})=\sum_{a\in{\cal A}}\mathbb{E}_{P,\pi}[N_{T}(a)]\cdot\text{KL}(P^{(\text{on})}_{a},Q^{(\text{on})}_{a})+\sum_{a\in{\cal A}}T_{S}(a)\cdot\text{KL}(P^{(\text{off})}_{a},Q^{(\text{off})}_{a}).

We provide a proof to theorem C.2 for completeness sake:

Proof of Theorem C.2.

The proof largely follow the well-known chain rule in the multi-armed bandit literature, for example see Lemma 15.1 in (Lattimore & Szepesvári, 2020). We start by explicitly expressing the joint probability function ρP,π\rho_{P,\pi} on

S={{Xs(a)}s=1TS(a)}a𝒜,A1,R1,,AT,RT,S=\{\{X_{s}(a)\}_{s=1}^{T_{\text{S}}(a)}\}_{a\in{\cal A}},A_{1},R_{1},\ldots,A_{T},R_{T},

under reward distribution PP and non-anticipatory policy π\pi. In coherence with our focus on Gaussian instances, we only consider the case when all the random rewards (offline or online) are continuous random variables with support on \mathbb{R}. Generalizing the argument to general reward distributions only require notational changes. By an abuse in notation, we denote Pa((off)(xs(a))P^{(\text{(off)}}_{a}(x_{s}(a)) as the probability density function (with variable xs(a)x_{s}(a)) of the offline reward distribution with arm aa, and likewise for the online reward distribution.

To ease the notation, we denote x={xs(a)}s=1TS(a)x=\{x_{s}(a)\}_{s=1}^{T_{\text{S}}(a)}. Then ρP,π\rho_{P,\pi} is expressed as

ρP,π(x,a1,r1,,aT,rT)\displaystyle\rho_{P,\pi}(x,a_{1},r_{1},\ldots,a_{T},r_{T})
=\displaystyle= [a𝒜s=1TS(a)Pa(off)(xs(a))]on offline rewardst=1T[πt(at|x,a1,r1,,at1,rt1)Pat(on)(rt)]on online arms and rewards.\displaystyle\underbrace{\left[\prod_{a\in{\cal A}}\prod^{T_{\text{S}}(a)}_{s=1}P^{\text{(off)}}_{a}(x_{s}(a))\right]}_{\text{on offline rewards}}\cdot\underbrace{\prod^{T}_{t=1}\left[\pi_{t}(a_{t}|x,a_{1},r_{1},\ldots,a_{t-1},r_{t-1})\cdot P^{\text{(on)}}_{a_{t}}(r_{t})\right]}_{\text{on online arms and rewards}}. (35)

Likewise, by replacing reward distribution PP with QQ but keeping the fixed policy π\pi unchanged, we have

ρQ,π(x,a1,r1,,aT,rT)\displaystyle\rho_{Q,\pi}(x,a_{1},r_{1},\ldots,a_{T},r_{T})
=\displaystyle= [a𝒜s=1TS(a)Qa(off)(xs(a))]t=1T[πt(at|x,a1,r1,,at1,rt1)Qat(on)(rt)].\displaystyle\left[\prod_{a\in{\cal A}}\prod^{T_{\text{S}}(a)}_{s=1}Q^{\text{(off)}}_{a}(x_{s}(a))\right]\cdot\prod^{T}_{t=1}\left[\pi_{t}(a_{t}|x,a_{1},r_{1},\ldots,a_{t-1},r_{t-1})\cdot Q^{\text{(on)}}_{a_{t}}(r_{t})\right].

The KL divergence between ρP,π\rho_{P,\pi} and ρQ,π\rho_{Q,\pi} is

KL(ρP,π,ρQ,π)\displaystyle\text{KL}(\rho_{P,\pi},\rho_{Q,\pi})
=\displaystyle= x,r1,,rTa1,,aT𝒜ρP,π(x,a1,r1,,aT,rT)log[ρP,π(x,a1,r1,,aT,rT)ρQ,π(x,a1,r1,,aT,rT)] dx dr1 drT,\displaystyle\int_{x,r_{1},\ldots,r_{T}}\sum_{a_{1},\ldots,a_{T}\in{\cal A}}\rho_{P,\pi}(x,a_{1},r_{1},\ldots,a_{T},r_{T})\log\left[\frac{\rho_{P,\pi}(x,a_{1},r_{1},\ldots,a_{T},r_{T})}{\rho_{Q,\pi}(x,a_{1},r_{1},\ldots,a_{T},r_{T})}\right]\text{ d}x\text{ d}r_{1}\ldots\text{ d}r_{T},

where x={xs(a)}s=1TS(a)a𝒜TS(a)\int_{x}=\int_{\{x_{s}(a)\}_{s=1}^{T_{\text{S}}(a)}\in\mathbb{R}^{\sum_{a\in{\cal A}}T_{\text{S}}(a)}}, and dx=a𝒜s=1TS(a)dxs(a)\text{d}x=\prod_{a\in{\cal A}}\prod^{T_{\text{S}}(a)}_{s=1}\text{d}x_{s}(a).

We use the explicit expressions of ρP,π,ρQ,π\rho_{P,\pi},\rho_{Q,\pi} to decompose the log term:

log[ρP,π(x,a1,r1,,aT,rT)ρQ,π(x,a1,r1,,aT,rT)]\displaystyle\log\left[\frac{\rho_{P,\pi}(x,a_{1},r_{1},\ldots,a_{T},r_{T})}{\rho_{Q,\pi}(x,a_{1},r_{1},\ldots,a_{T},r_{T})}\right]
=\displaystyle= a𝒜s=1TS(a)log(Pa(off)(xs(a))Qa(off)(xs(a)))\displaystyle\sum_{a\in{\cal A}}\sum^{T_{\text{S}}(a)}_{s=1}\log\left(\frac{P^{\text{(off)}}_{a}(x_{s}(a))}{Q^{\text{(off)}}_{a}(x_{s}(a))}\right)
+t=1Tπt(at|x,a1,r1,,at1,rt1)log(πt(at|x,a1,r1,,at1,rt1)πt(at|x,a1,r1,,at1,rt1))=0+t=1Tlog(Pat(on)(rt)Qat(on)(rt))\displaystyle+\sum^{T}_{t=1}\underbrace{\pi_{t}(a_{t}|x,a_{1},r_{1},\ldots,a_{t-1},r_{t-1})\log\left(\frac{\pi_{t}(a_{t}|x,a_{1},r_{1},\ldots,a_{t-1},r_{t-1})}{\pi_{t}(a_{t}|x,a_{1},r_{1},\ldots,a_{t-1},r_{t-1})}\right)}_{=0}+\sum^{T}_{t=1}\log\left(\frac{P^{\text{(on)}}_{a_{t}}(r_{t})}{Q^{\text{(on)}}_{a_{t}}(r_{t})}\right)
=\displaystyle= a𝒜s=1TS(a)log(Pa(off)(xs(a))Qa(off)(xs(a)))+t=1Tlog(Pat(on)(rt)Qat(on)(rt)).\displaystyle\sum_{a\in{\cal A}}\sum^{T_{\text{S}}(a)}_{s=1}\log\left(\frac{P^{\text{(off)}}_{a}(x_{s}(a))}{Q^{\text{(off)}}_{a}(x_{s}(a))}\right)+\sum^{T}_{t=1}\log\left(\frac{P^{\text{(on)}}_{a_{t}}(r_{t})}{Q^{\text{(on)}}_{a_{t}}(r_{t})}\right).

In the second line, the middle sum is equal to zero, which can be interpreted as the fact that we are evaluating the same policy π\pi on reward distributions P,QP,Q. Finally, marginalizing and making use of (35) gives, for each a𝒜a\in{\cal A},

x,r1,,rTa1,,aT𝒜ρP,π(x,a1,r1,,aT,rT)log(Pa(off)(xs(a))Qa(off)(xs(a))) dx dr1 drT\displaystyle\int_{x,r_{1},\ldots,r_{T}}\sum_{a_{1},\ldots,a_{T}\in{\cal A}}\rho_{P,\pi}(x,a_{1},r_{1},\ldots,a_{T},r_{T})\log\left(\frac{P^{\text{(off)}}_{a}(x_{s}(a))}{Q^{\text{(off)}}_{a}(x_{s}(a))}\right)\text{ d}x\text{ d}r_{1}\ldots\text{ d}r_{T}
=\displaystyle= xs(a)Pa(off)(xs(a))log(Pa(off)(xs(a))Qa(off)(xs(a))) dxx(a)=KL(Pa(off),Qa(off)),\displaystyle\int_{x_{s}(a)}P^{\text{(off)}}_{a}(x_{s}(a))\log\left(\frac{P^{\text{(off)}}_{a}(x_{s}(a))}{Q^{\text{(off)}}_{a}(x_{s}(a))}\right)\text{ d}x_{x}(a)=\text{KL}(P^{\text{(off)}}_{a},Q^{\text{(off)}}_{a}), (36)

where the first equality in (36) follows by integrating with respect to rtr_{t} over \mathbb{R} and then summing over at𝒜a_{t}\in{\cal A} in the order of t=T,,1t=T,\ldots,1, and then integrating over all variables in xx except xs(a)x_{s}(a). For each t{1,,T}t\in\{1,\ldots,T\} during the online phase we have

x,r1,,rTa1,,aT𝒜ρP,π(x,a1,r1,,aT,rT)log(Pat(on)(rt)Qat(on)(rt)) dx dr1 drT\displaystyle\int_{x,r_{1},\ldots,r_{T}}\sum_{a_{1},\ldots,a_{T}\in{\cal A}}\rho_{P,\pi}(x,a_{1},r_{1},\ldots,a_{T},r_{T})\log\left(\frac{P^{\text{(on)}}_{a_{t}}(r_{t})}{Q^{\text{(on)}}_{a_{t}}(r_{t})}\right)\text{ d}x\text{ d}r_{1}\ldots\text{ d}r_{T}
=\displaystyle= x,r1,,rta1,,at𝒜[a𝒜s=1TS(a)Pa(off)(xs(a))]τ=1t[πτ(aτ|x,a1,r1,,aτ1,rτ1)Paτ(on)(rτ)]\displaystyle\int_{x,r_{1},\ldots,r_{t}}\sum_{a_{1},\ldots,a_{t}\in{\cal A}}\left[\prod_{a\in{\cal A}}\prod^{T_{\text{S}}(a)}_{s=1}P^{\text{(off)}}_{a}(x_{s}(a))\right]\cdot\prod^{t}_{\tau=1}\left[\pi_{\tau}(a_{\tau}|x,a_{1},r_{1},\ldots,a_{\tau-1},r_{\tau-1})\cdot P^{\text{(on)}}_{a_{\tau}}(r_{\tau})\right]
log(Pat(on)(rt)Qat(on)(rt)) dx dr1 drt\displaystyle\quad\log\left(\frac{P^{\text{(on)}}_{a_{t}}(r_{t})}{Q^{\text{(on)}}_{a_{t}}(r_{t})}\right)\text{ d}x\text{ d}r_{1}\ldots\text{ d}r_{t}
=\displaystyle= at𝒜{x,r1,,rt1a1,,at1𝒜[a𝒜s=1TS(a)Pa(off)(xs(a))]τ=1t1[πτ(aτ|x,a1,r1,,aτ1,rτ1)Paτ(on)(rτ)]\displaystyle\sum_{a_{t}\in{\cal A}}\left\{\int_{x,r_{1},\ldots,r_{t-1}}\sum_{a_{1},\ldots,a_{t-1}\in{\cal A}}\left[\prod_{a\in{\cal A}}\prod^{T_{\text{S}}(a)}_{s=1}P^{\text{(off)}}_{a}(x_{s}(a))\right]\cdot\prod^{t-1}_{\tau=1}\left[\pi_{\tau}(a_{\tau}|x,a_{1},r_{1},\ldots,a_{\tau-1},r_{\tau-1})\cdot P^{\text{(on)}}_{a_{\tau}}(r_{\tau})\right]\right.
×πt(at|x,a1,r1,,at1,rt1) dx dr1 drt1}rtP(on)at(rt)log(Pat(on)(rt)Qat(on)(rt)) drt\displaystyle\quad\times\pi_{t}(a_{t}|x,a_{1},r_{1},\ldots,a_{t-1},r_{t-1})\text{ d}x\text{ d}r_{1}\ldots\text{ d}r_{t-1}\Big{\}}\cdot\int_{r_{t}}P^{\text{(on)}}_{a_{t}}(r_{t})\log\left(\frac{P^{\text{(on)}}_{a_{t}}(r_{t})}{Q^{\text{(on)}}_{a_{t}}(r_{t})}\right)\text{ d}r_{t}
=\displaystyle= at𝒜PrP,π(At=at)rtPat(on)(rt)log(Pat(on)(rt)Qat(on)(rt)) drt\displaystyle\sum_{a_{t}\in{\cal A}}\Pr_{P,\pi}(A_{t}=a_{t})\int_{r_{t}}P^{\text{(on)}}_{a_{t}}(r_{t})\log\left(\frac{P^{\text{(on)}}_{a_{t}}(r_{t})}{Q^{\text{(on)}}_{a_{t}}(r_{t})}\right)\text{ d}r_{t}
=\displaystyle= a𝒜𝔼P,π[𝟏(At=a)]KL(Pa(on),Qa(on)).\displaystyle\sum_{a\in{\cal A}}\mathbb{E}_{P,\pi}[\mathbf{1}(A_{t}=a)]\cdot\text{KL}(P^{\text{(on)}}_{a},Q^{\text{(on)}}_{a}). (37)

Summing (36) over s{1,,TS(a)}s\in\{1,\ldots,T_{\text{S}}(a)\} and a𝒜a\in{\cal A}, as well as summing (37) over t{1,,T}t\in\{1,\ldots,T\} establish the Theorem. ∎

C.2 Proof for Proposition 3.1, the Impossibility Result

By the condition C<Tϵ/(4logT)C<T^{\epsilon}/(4\log T), we know that 1ClogTTβ(ϵ/2)1Tβ>1Tβ>0\frac{1}{\sqrt{C\log T}T^{\beta-(\epsilon/2)}}-\frac{1}{T^{\beta}}>\frac{1}{T^{\beta}}>0, so in the instance with QQ, arm 2 is the unqiue optimal arm. Consider the event E={NT(2)>T/2}E=\{N_{T}(2)>T/2\}. Now,

𝔼[Reg(π,P)]+𝔼[Reg(π,Q)]\displaystyle\mathbb{E}[\text{Reg}(\pi,P)]+\mathbb{E}[\text{Reg}(\pi,Q)]
\displaystyle\geq 1TβT2PrP,π(E)+[1ClogTTβ(ϵ/2)1Tβ]T2PrQ,π(EC)\displaystyle\frac{1}{T^{\beta}}\cdot\frac{T}{2}\cdot\Pr_{P,\pi}(E)+\left[\frac{1}{\sqrt{C\log T}T^{\beta-(\epsilon/2)}}-\frac{1}{T^{\beta}}\right]\cdot\frac{T}{2}\Pr_{Q,\pi}(E^{C})
\displaystyle\geq T1β2[PrP,π(E)+PrQ,π(EC)]\displaystyle\frac{T^{1-\beta}}{2}\cdot\left[\Pr_{P,\pi}(E)+\Pr_{Q,\pi}(E^{C})\right]
\displaystyle\geq T1β4exp[𝔼π,P[NT(2)]KL(P2(on),Q2(on))TS(2)KL(P2(off),Q2(off))]\displaystyle\frac{T^{1-\beta}}{4}\cdot\exp\left[-\mathbb{E}_{\pi,P}[N_{T}(2)]\cdot\text{KL}(P^{(\text{on})}_{2},Q^{(\text{on})}_{2})-T_{S}(2)\cdot\text{KL}(P^{(\text{off})}_{2},Q^{(\text{off})}_{2})\right] (38)
=\displaystyle= T1β4exp[𝔼π,P[NT(2)]KL(P2(on),Q2(on))].\displaystyle\frac{T^{1-\beta}}{4}\cdot\exp\left[-\mathbb{E}_{\pi,P}[N_{T}(2)]\cdot\text{KL}(P^{(\text{on})}_{2},Q^{(\text{on})}_{2})\right]. (39)

Step (38) is again by the chain rule in Theorem C.2. Step (39) is because P(off)=Q(off)P^{\text{(off)}}=Q^{\text{(off)}}. It is evident that

KL(P2(on),Q2(on))=12CT2βϵlogT,\text{KL}(P^{(\text{on})}_{2},Q^{(\text{on})}_{2})=\frac{1}{2CT^{2\beta-\epsilon}\log T},

and by the Claim assumption that Reg(π,P)=(1/Tβ)𝔼π,P[NT(2)]CTβϵlogT\text{Reg}(\pi,P)=(1/T^{\beta})\mathbb{E}_{\pi,P}[N_{T}(2)]\leq CT^{\beta-\epsilon}\log T, we have

𝔼π,P[NT(2)]CT2βϵlogT,\mathbb{E}_{\pi,P}[N_{T}(2)]\leq CT^{2\beta-\epsilon}\log T,

which leads to

𝔼[Reg(π,P)]+𝔼[Reg(π,Q)]T1β4e2.\mathbb{E}[\text{Reg}(\pi,P)]+\mathbb{E}[\text{Reg}(\pi,Q)]\geq\frac{T^{1-\beta}}{4}\cdot e^{-2}.

Again by the Proposition assumption 𝔼[Reg(π,P)]CTβϵlogT\mathbb{E}[\text{Reg}(\pi,P)]\leq CT^{\beta-\epsilon}\log T, we arrive at the claimed inequality, and the Proposition is proved.

C.3 Proof for instance dependent regret lower bound

C.3.1 Proof for Theorem 4.7

We denote μ(off)(a),μ(on)(a)\mu^{\text{(off)}}(a),\mu^{\text{(on)}}(a) as the means of the Gaussian distributions Pa(off),Pa(on)P^{\text{(off)}}_{a},P^{\text{(on)}}_{a} respestively, for each arm a𝒜a\in{\cal A}. In addition, recall the notation that μ(on)=maxa𝒜μ(on)(a)\mu_{*}^{\text{(on)}}=\max_{a\in{\cal A}}\mu^{\text{(on)}}(a), Δ(a)=μ(on)μ(on)(a)\Delta(a)=\mu_{*}^{\text{(on)}}-\mu^{\text{(on)}}(a) and ω(a)=V(a)+(μ(off)(a)μ(on)(a))\omega(a)=V(a)+(\mu^{\text{(off)}}(a)-\mu^{\text{(on)}}(a)).

Consider an arbitrary but fixed arm aa with Δ(a)>0\Delta(a)>0. We claim that

𝔼P,π[NT(a)]\displaystyle\mathbb{E}_{P,\pi}[N_{T}(a)] 2(1p)(1+ϵ)2logTΔ(a)2+2(1+ϵ)2Δ(a)2log(ϵΔ(a)8C)\displaystyle\geq\frac{2(1-p)}{(1+\epsilon)^{2}}\cdot\frac{\log T}{\Delta(a)^{2}}+\frac{2}{(1+\epsilon)^{2}\Delta(a)^{2}}\cdot\log\left(\frac{\epsilon\Delta(a)}{8C}\right)
TS(a)max{(1ω(a)(1+ϵ)Δ(a)),0}2\displaystyle-T_{\text{S}}(a)\cdot\max\left\{\left(1-\frac{\omega(a)}{(1+\epsilon)\Delta(a)}\right),0\right\}^{2} (40)

The Lemma then follows by applying (40) on each sub-optimal arm, thus it remains to prove (40). To proceed, we consider another Gaussain instance Q=(Q(off),Q(on))Q=(Q^{\text{(off)}},Q^{\text{(on)}}), where

Qk(off)=Pk(off) for all k𝒜{a},Q^{\text{(off)}}_{k}=P^{\text{(off)}}_{k}\text{ for all $k\in{\cal A}\setminus\{a\}$},

but

Qa(on)=𝒩(μ(on)(a)+(1+ϵ)Δ(a),1),Q^{(\text{on})}_{a}={\cal N}(\mu^{\text{(on)}}(a)+(1+\epsilon)\Delta(a),1),
Qa(off)={𝒩(μ(off)(a),1)if μ(off)(a)μ(on)(a)+(1+ϵ)Δ(a)V(a)𝒩(μ(on)(a)+(1+ϵ)Δ(a)V(a),1)if μ(off)(a)<μ(on)(a)+(1+ϵ)Δ(a)V(a).Q^{(\text{off})}_{a}=\begin{cases}{\cal N}(\mu^{\text{(off)}}(a),1)&\text{if }\mu^{\text{(off)}}(a)\geq\mu^{\text{(on)}}(a)+(1+\epsilon)\Delta(a)-V(a)\\ {\cal N}(\mu^{\text{(on)}}(a)+(1+\epsilon)\Delta(a)-V(a),1)&\text{if }\mu^{\text{(off)}}(a)<\mu^{\text{(on)}}(a)+(1+\epsilon)\Delta(a)-V(a)\end{cases}. (41)

It is clear that PVP\in{\cal I}_{V} implies QVQ\in{\cal I}_{V}. Indeed, if the second case in (41) holds, then the mean of Qa(on)Q^{(\text{on})}_{a}- the mean of Qa(off)Q^{(\text{off})}_{a} is equal to V(a)V(a), while if the first case hold, then we evidently have

μ(on)(a)+(1+ϵ)Δ(a)mean of Qa(on)+V(a)μ(off)(a)μ(on)(a)+(1+ϵ)Δ(a)mean of Qa(on)V(a),\underbrace{\mu^{\text{(on)}}(a)+(1+\epsilon)\Delta(a)}_{\text{mean of $Q^{(\text{on})}_{a}$}}+V(a)\geq\mu^{\text{(off)}}(a)\geq\underbrace{\mu^{\text{(on)}}(a)+(1+\epsilon)\Delta(a)}_{\text{mean of $Q^{(\text{on})}_{a}$}}-V(a),

thus QVQ\in{\cal I}_{V}. Consider the event E={NT(a)T/2}E=\{N_{T}(a)\geq T/2\}. By the assumed consistency of π\pi, we have

2CTp\displaystyle 2CT^{p} 𝔼[RegT(π,P)]+𝔼[RegT(π,Q)]\displaystyle\geq\mathbb{E}[\text{Reg}_{T}(\pi,P)]+\mathbb{E}[\text{Reg}_{T}(\pi,Q)] (42)
\displaystyle\geq T2ϵΔ(a)[PrP,π(E)+PrQ,π(Ec)]\displaystyle\frac{T}{2}\cdot\epsilon\Delta(a)\cdot\left[\Pr_{P,\pi}(E)+\Pr_{Q,\pi}(E^{c})\right] (43)
\displaystyle\geq T4ϵΔ(a)exp[𝔼π,P[NT(a)]KL(Pa(on),Qa(on))TS(a)KL(Pa(off),Qa(off))].\displaystyle\frac{T}{4}\cdot\epsilon\Delta(a)\cdot\exp\left[-\mathbb{E}_{\pi,P}[N_{T}(a)]\cdot\text{KL}(P^{(\text{on})}_{a},Q^{(\text{on})}_{a})-T_{S}(a)\text{KL}(P^{(\text{off})}_{a},Q^{(\text{off})}_{a})\right]. (44)

Step (42) is by the definition of consistency. Step (43) is implied by our construction that in instance PP, arm aa is sub-optimal with optiamlity gap Δ(a)ϵΔ(a)\Delta(a)\geq\epsilon\Delta(a), and in instance QQ, arm aa is the unique optimal arm and other arms have optimality gap at least ϵΔ(a)\epsilon\Delta(a). Step (44) is by the Chain rule Theorem C.2, as well as the BH inequality. By our set up, we have

KL(Pa(on),Qa(on))\displaystyle\text{KL}(P^{(\text{on})}_{a},Q^{(\text{on})}_{a}) =(1+ϵ)2Δ(a)22,\displaystyle=\frac{(1+\epsilon)^{2}\Delta(a)^{2}}{2},
KL(Pa(off),Qa(off))\displaystyle\text{KL}(P^{(\text{off})}_{a},Q^{(\text{off})}_{a}) =max{(1+ϵ)Δ(a)[V(a)+(μ(off)(a)μ(on)(a))],0}22.\displaystyle=\frac{\max\{(1+\epsilon)\Delta(a)-[V(a)+(\mu^{\text{(off)}}(a)-\mu^{\text{(on)}}(a))],0\}^{2}}{2}.

Plugging in we get

2CTp\displaystyle 2CT^{p} T4ϵΔ(a)exp[𝔼π,P[NT(a)](1+ϵ)2Δ(a)22\displaystyle\geq\frac{T}{4}\cdot\epsilon\Delta(a)\cdot\exp\left[-\mathbb{E}_{\pi,P}[N_{T}(a)]\cdot\frac{(1+\epsilon)^{2}\Delta(a)^{2}}{2}\right.
TS(a)max{(1+ϵ)Δ(a)[V(a)+(μ(off)(a)μ(on)(a))],0}22],\displaystyle\left.-T_{S}(a)\frac{\max\{(1+\epsilon)\Delta(a)-[V(a)+(\mu^{\text{(off)}}(a)-\mu^{\text{(on)}}(a))],0\}^{2}}{2}\right], (45)

which is equivalent to

𝔼π,P[NT(a)](1+ϵ)2Δ(a)22+TS(a)max{(1+ϵ)Δ(a)[V(a)+(μ(off)(a)μ(on)(a))],0}22\displaystyle\mathbb{E}_{\pi,P}[N_{T}(a)]\cdot\frac{(1+\epsilon)^{2}\Delta(a)^{2}}{2}+T_{S}(a)\frac{\max\{(1+\epsilon)\Delta(a)-[V(a)+(\mu^{\text{(off)}}(a)-\mu^{\text{(on)}}(a))],0\}^{2}}{2}
\displaystyle\geq (1p)logT+log(ϵΔ(a)8C).\displaystyle(1-p)\log T+\log\left(\frac{\epsilon\Delta(a)}{8C}\right).

Rearranging leads to the claimed inequality (40), and the Lemma is proved.

C.4 Proof for instance independent regret lower bound

C.4.1 Proof for Theorem 4.10

We proceed by a case analysis:

Case 1a: 2KT>T(Vmax+1/τ)2\sqrt{KT}>T\cdot(V_{\text{max}}+1/\sqrt{\tau_{*}}), and Vmax1/τV_{\text{max}}\leq 1/\sqrt{\tau_{*}}. We derive a regret lower bound of

Ω(min{KT,T(1τ+Vmax)})=Ω(Tτ).\Omega\left(\min\left\{\sqrt{KT},T\cdot\left(\frac{1}{\sqrt{\tau_{*}}}+V_{\text{max}}\right)\right\}\right)=\Omega\left(\frac{T}{\sqrt{\tau_{*}}}\right).

Case 1b: 2KT>T(Vmax+1/τ)2\sqrt{KT}>T\cdot(V_{\text{max}}+1/\sqrt{\tau_{*}}), and Vmax>1/τV_{\text{max}}>1/\sqrt{\tau_{*}}. We derive a regret lower bound of

Ω(min{KT,T(1τ+Vmax)})=Ω(TVmax).\Omega\left(\min\left\{\sqrt{KT},T\cdot\left(\frac{1}{\sqrt{\tau_{*}}}+V_{\text{max}}\right)\right\}\right)=\Omega(T\cdot V_{\text{max}}).

Case 2: 2KT(Vmax+1/τ)T2\sqrt{KT}\leq(V_{\text{max}}+1/\sqrt{\tau_{*}})\cdot T. We derive a regret lower bound of

Ω(min{KT,T(1τ+Vmax)})=Ω(KT).\Omega\left(\min\left\{\sqrt{KT},T\cdot\left(\frac{1}{\sqrt{\tau_{*}}}+V_{\text{max}}\right)\right\}\right)=\Omega(\sqrt{KT}).

We establish the three cases in what follows.

Case 1a: 2KT>T(Vmax+1/τ)2\sqrt{KT}>T\cdot(V_{\text{max}}+1/\sqrt{\tau_{*}}), and Vmax1/τV_{\text{max}}\leq 1/\sqrt{\tau_{*}}. We derive a regret lower bound of

Ω(min{KT,T(1τ+Vmax)})=Ω(Tτ).\Omega\left(\min\left\{\sqrt{KT},T\cdot\left(\frac{1}{\sqrt{\tau_{*}}}+V_{\text{max}}\right)\right\}\right)=\Omega\left(\frac{T}{\sqrt{\tau_{*}}}\right).

Now, without loss of generality, we assume that 1argmaxa𝒜TS(a)1\in\text{argmax}_{a\in{\cal A}}T_{\text{S}}(a). Consider the following Gaussian reward distributions P=(P(on),P(off))P=(P^{\text{(on)}},P^{\text{(off)}}) with Pa(on)=Pa(off)P^{\text{(on)}}_{a}=P^{\text{(off)}}_{a} (thus PVP\in{\cal I}_{V} for any V0KV\in\mathbb{R}^{K}_{\geq 0}), defined as

Pa(on)={𝒩(Δ,1)if a=1𝒩(0,1)if a𝒜{1}, where Δ=1τ.P^{\text{(on)}}_{a}=\begin{cases}{\cal N}(\Delta,1)&\quad\text{if }a=1\\ {\cal N}(0,1)&\quad\text{if }a\in{\cal A}\setminus\{1\}\end{cases}\text{, where }\Delta=\frac{1}{\sqrt{\tau_{*}}}.

Consider the values (τ~,n~)(\tilde{\tau},\tilde{n}) defined as n~(a)=𝔼P,π[NT(a)]\tilde{n}(a)=\mathbb{E}_{P,\pi}[N_{T}(a)] for each a𝒜a\in{\cal A}, and τ~=mina𝒜{TS(a)+n~(a)}\tilde{\tau}=\min_{a\in{\cal A}}\{T_{\text{S}}(a)+\tilde{n}(a)\}. Evidently, (τ~,n~)(\tilde{\tau},\tilde{n}) is feasible to (LP), and thus we have τ~τ\tilde{\tau}\leq\tau_{*}, the optimum of (LP). In particular, there exists an arm a𝒜a\in{\cal A} such that TS(a)+𝔼P,π[NT(a)]τT_{\text{S}}(a)+\mathbb{E}_{P,\pi}[N_{T}(a)]\leq\tau_{*}. To this end, let’s consider two situations:

Situation (i): TS(a)+𝔼P,π[NT(a)]>τT_{\text{S}}(a)+\mathbb{E}_{P,\pi}[N_{T}(a)]>\tau_{*} for all a𝒜{1}a\in{\cal A}\setminus\{1\}. The condition immeidately implies TS(1)+𝔼P,π[NT(1)]τT_{\text{S}}(1)+\mathbb{E}_{P,\pi}[N_{T}(1)]\leq\tau_{*}. Then we can deduce that TS(1)+𝔼P,π[NT(1)]<TS(a)+𝔼P,π[NT(a)]T_{\text{S}}(1)+\mathbb{E}_{P,\pi}[N_{T}(1)]<T_{\text{S}}(a)+\mathbb{E}_{P,\pi}[N_{T}(a)] for all a𝒜{1}a\in{\cal A}\setminus\{1\}, which further implies that 𝔼P,π[NT(1)]<𝔼P,π[NT(a)]\mathbb{E}_{P,\pi}[N_{T}(1)]<\mathbb{E}_{P,\pi}[N_{T}(a)] for all a𝒜{1}a\in{\cal A}\setminus\{1\} since TS(1)=maxa𝒜TS(a)T_{\text{S}}(1)=\max_{a\in{\cal A}}T_{\text{S}}(a). The above implies that 𝔼P,π[NT(1)]<T/K\mathbb{E}_{P,\pi}[N_{T}(1)]<T/K, which implies that

𝔼[RegT(π,P)]>(K1)TK.\mathbb{E}[\text{Reg}_{T}(\pi,P)]>\frac{(K-1)T}{K}.

Situation (ii): TS(k)+𝔼P,π[NT(k)]τT_{\text{S}}(k)+\mathbb{E}_{P,\pi}[N_{T}(k)]\leq\tau_{*} for a k𝒜{1}k\in{\cal A}\setminus\{1\}. In this case, consider the following Gaussian reward distribution Q=(Q(on),Q(off))Q=(Q^{\text{(on)}},Q^{\text{(off)}}) with Q(on)=Q(off)Q^{\text{(on)}}=Q^{\text{(off)}}, defined as

Qa(on)={𝒩(Δ,1)if a=1𝒩(2Δ,1)if a=k𝒩(0,1)if a𝒜{1,k}, where Δ=1τ.Q^{\text{(on)}}_{a}=\begin{cases}{\cal N}(\Delta,1)&\quad\text{if }a=1\\ {\cal N}(2\Delta,1)&\quad\text{if }a=k\\ {\cal N}(0,1)&\quad\text{if }a\in{\cal A}\setminus\{1,k\}\end{cases}\text{, where }\Delta=\frac{1}{\sqrt{\tau_{*}}}.

To this end, note that Pa(off)=Qa(off)=Pa(on)=Qa(on)P^{\text{(off)}}_{a}=Q^{\text{(off)}}_{a}=P^{\text{(on)}}_{a}=Q^{\text{(on)}}_{a} for all a𝒜{k}a\in{\cal A}\setminus\{k\}, but Pk(on)=Pk(off)Qk(on)=Qk(off)P^{\text{(on)}}_{k}=P^{\text{(off)}}_{k}\neq Q^{\text{(on)}}_{k}=Q^{\text{(off)}}_{k}. In addition, both P,QP,Q belong to V{\cal I}_{V}. Consider the event E={NT(1)<T/2}E=\{N_{T}(1)<T/2\}. We have

𝔼[RegT(π,P)]+𝔼[RegT(π,Q)]\displaystyle\mathbb{E}[\text{Reg}_{T}(\pi,P)]+\mathbb{E}[\text{Reg}_{T}(\pi,Q)] T2Δ[PrP,π(E)+PrQ,π(Ec)]\displaystyle\geq\frac{T}{2}\cdot\Delta\cdot\left[\Pr_{P,\pi}(E)+\Pr_{Q,\pi}(E^{c})\right] (46)
T4Δexp[𝔼P,π[NT(k)]KL(Pk(on),Qk(on))TS(k)KL(Pk(off),Qk(off))]\displaystyle\geq\frac{T}{4}\cdot\Delta\cdot\exp\left[-\mathbb{E}_{P,\pi}[N_{T}(k)]\cdot\text{KL}(P^{(\text{on})}_{k},Q^{(\text{on})}_{k})-T_{\text{S}}(k)\text{KL}(P^{(\text{off})}_{k},Q^{(\text{off})}_{k})\right] (47)
=T4Δexp[(TS(k)+𝔼P,π[NT(k)])KL(Pk(on),Qk(on))]\displaystyle=\frac{T}{4}\cdot\Delta\cdot\exp\left[-\left(T_{\text{S}}(k)+\mathbb{E}_{P,\pi}[N_{T}(k)]\right)\cdot\text{KL}(P^{(\text{on})}_{k},Q^{(\text{on})}_{k})\right]
=T4τexp[(TS(k)+𝔼P,π[NT(k)])12τ]\displaystyle=\frac{T}{4\sqrt{\tau_{*}}}\cdot\exp\left[-\left(T_{\text{S}}(k)+\mathbb{E}_{P,\pi}[N_{T}(k)]\right)\cdot\frac{1}{2\tau_{*}}\right]
T4τexp[τ12τ]=14eTτ.\displaystyle\geq\frac{T}{4\sqrt{\tau_{*}}}\cdot\exp\left[-\tau_{*}\cdot\frac{1}{2\tau_{*}}\right]=\frac{1}{4\sqrt{e}}\cdot\frac{T}{\sqrt{\tau_{*}}}. (48)

Altogether, in Case 1a, we either have

𝔼[RegT(π,P)]>(K1)TK=Ω(Tτ),\mathbb{E}[\text{Reg}_{T}(\pi,P)]>\frac{(K-1)T}{K}=\Omega\left(\frac{T}{\sqrt{\tau_{*}}}\right),

or

max{𝔼[RegT(π,P)],𝔼[RegT(π,Q)]}18eTτ.\max\{\mathbb{E}[\text{Reg}_{T}(\pi,P)],\mathbb{E}[\text{Reg}_{T}(\pi,Q)]\}\geq\frac{1}{8\sqrt{e}}\cdot\frac{T}{\sqrt{\tau_{*}}}.

Case 1b: 2KT>T(Vmax+1/τ)2\sqrt{KT}>T\cdot(V_{\text{max}}+1/\sqrt{\tau_{*}}), and Vmax>1/τV_{\text{max}}>1/\sqrt{\tau_{*}}. We derive a regret lower bound of

Ω(min{KT,T(1τ+Vmax)})=Ω(TVmax).\Omega\left(\min\left\{\sqrt{KT},T\cdot\left(\frac{1}{\sqrt{\tau_{*}}}+V_{\text{max}}\right)\right\}\right)=\Omega(T\cdot V_{\text{max}}).

Now, set Δ=Vmax/2\Delta=V_{\text{max}}/2, and consider the Guassian instance with reward disribution PP:

Pa(off)=N(0,1) for all a𝒜, and Pa(on)={𝒩(Δ,1)if a=1,𝒩(0,1)if a𝒜{1}.P^{\text{(off)}}_{a}=N(0,1)\text{ for all $a\in{\cal A}$, and }P^{\text{(on)}}_{a}=\begin{cases}{\cal N}(\Delta,1)&\quad\text{if }a=1,\\ {\cal N}(0,1)&\quad\text{if }a\in{\cal A}\setminus\{1\}.\end{cases}

Now, there exists an arm a𝒜{1}a^{\prime}\in{\cal A}\setminus\{1\} such that 𝔼P,π[NT(a)]T/(K1)\mathbb{E}_{P,\pi}[N_{T}(a^{\prime})]\leq T/(K-1). Consider the Gaussian instance with

Qa(off)=𝒩(0,1) for all a𝒜, and Qa(on)={𝒩(Δ,1)if a=1,𝒩(2Δ,1)if a=a,𝒩(0,1)if a𝒜{1,a}.Q^{\text{(off)}}_{a}={\cal N}(0,1)\text{ for all $a\in{\cal A}$, and }Q^{\text{(on)}}_{a}=\begin{cases}{\cal N}(\Delta,1)&\quad\text{if }a=1,\\ {\cal N}(2\Delta,1)&\quad\text{if }a=a^{\prime},\\ {\cal N}(0,1)&\quad\text{if }a\in{\cal A}\setminus\{1,a^{\prime}\}.\end{cases}

To this end, note that Pa(off)=Qa(off)P^{\text{(off)}}_{a}=Q^{\text{(off)}}_{a} for all a𝒜a\in{\cal A}, and Pa(on)=Qa(on)P^{\text{(on)}}_{a}=Q^{\text{(on)}}_{a} for all a𝒜{a}a\in{\cal A}\setminus\{a^{\prime}\}, but Pa(on)Qa(on)P^{\text{(on)}}_{a^{\prime}}\neq Q^{\text{(on)}}_{a^{\prime}}. In addition, both P,QP,Q belong to V{\cal I}_{V}. Consider the event E={NT(1)<T/2}E=\{N_{T}(1)<T/2\}. We have

𝔼[RegT(π,P)]+𝔼[RegT(π,Q)]\displaystyle\mathbb{E}[\text{Reg}_{T}(\pi,P)]+\mathbb{E}[\text{Reg}_{T}(\pi,Q)]\geq T2Δ[PrP,π(E)+PrQ,π(Ec)]\displaystyle\frac{T}{2}\cdot\Delta\cdot\left[\Pr_{P,\pi}(E)+\Pr_{Q,\pi}(E^{c})\right] (49)
\displaystyle\geq T4Δexp[𝔼P,π[NT(a)]KL(Pa(on),Qa(on))]\displaystyle\frac{T}{4}\cdot\Delta\cdot\exp\left[-\mathbb{E}_{P,\pi}[N_{T}(a^{\prime})]\cdot\text{KL}(P^{(\text{on})}_{a^{\prime}},Q^{(\text{on})}_{a^{\prime}})\right] (50)
\displaystyle\geq T4Δexp[TK12Δ2]\displaystyle\frac{T}{4}\cdot\Delta\cdot\exp\left[-\frac{T}{K-1}\cdot 2\cdot\Delta^{2}\right]
=\displaystyle= T8Vmaxexp[TK1Vmax22]\displaystyle\frac{T}{8}\cdot V_{\text{max}}\cdot\exp\left[-\frac{T}{K-1}\cdot\frac{V_{\text{max}}^{2}}{2}\right]
>\displaystyle> T8Vmaxexp[KK1]18e2TVmax.\displaystyle\frac{T}{8}\cdot V_{\text{max}}\cdot\exp\left[-\frac{K}{K-1}\right]\geq\frac{1}{8e^{2}}\cdot T\cdot V_{\text{max}}. (51)

Step (49) is again by Theorem C.1. Step (50) is by the Chain rule (Theorem C.2), and our observations on the constructed P,QP,Q. Step (54) is by the choice of arm aa^{\prime} and the KL divergence between Pa(on),Qa(on)P^{(\text{on})}_{a^{\prime}},Q^{(\text{on})}_{a^{\prime}}. The strict inequality in (51) is by the case assumption that 2KT>T(Vmax+1/τ)>TVmax2\sqrt{KT}>T\cdot(V_{\text{max}}+1/\tau_{*})>T\cdot V_{\text{max}}, which implies TK1Vmax22<KK1\frac{T}{K-1}\cdot\frac{V_{\text{max}}^{2}}{2}<\frac{K}{K-1}. The larger than equal in (51) is by the assumption that K2K\geq 2. Altogether, the desire regret lower bound of

max{𝔼[RegT(π,P)],𝔼[RegT(π,Q)]}116e2TVmax\max\{\mathbb{E}[\text{Reg}_{T}(\pi,P)],\mathbb{E}[\text{Reg}_{T}(\pi,Q)]\}\geq\frac{1}{16e^{2}}\cdot T\cdot V_{\text{max}}

is achieved.

Case 2: 2KT(Vmax+1/τ)T2\sqrt{KT}\leq(V_{\text{max}}+1/\sqrt{\tau_{*}})\cdot T. We derive a regret lower bound of

Ω(min{KT,T(1τ+Vmax)})=Ω(KT),\Omega\left(\min\left\{\sqrt{KT},T\cdot\left(\frac{1}{\sqrt{\tau_{*}}}+V_{\text{max}}\right)\right\}\right)=\Omega(\sqrt{KT}),

largely by following Case 1b with a different Δ\Delta, as well as the proof of Lemma 15.2 in (Lattimore & Szepesvári, 2020). Recall KTT/τ\sqrt{KT}\geq T/\sqrt{\tau_{*}} since τT/K\tau^{*}\geq T/K, so the case condition implies KTTVmax\sqrt{KT}\leq TV_{\text{max}}, meaning K/TVmax\sqrt{K/T}\leq V_{\text{max}}. Now, set Δ=(1/2)(K1)/T\Delta=(1/2)\sqrt{(K-1)/T}, and consider the Guassian instance with reward disribution PP:

Pa(off)=𝒩(0,1) for all a𝒜, and Pa(on)={𝒩(Δ,1)if a=1,𝒩(0,1)if a𝒜{1}.P^{\text{(off)}}_{a}={\cal N}(0,1)\text{ for all $a\in{\cal A}$, and }P^{\text{(on)}}_{a}=\begin{cases}{\cal N}(\Delta,1)&\quad\text{if }a=1,\\ {\cal N}(0,1)&\quad\text{if }a\in{\cal A}\setminus\{1\}.\end{cases}

Now, there exists an arm a𝒜{1}a^{\prime}\in{\cal A}\setminus\{1\} such that 𝔼P,π[NT(a)]T/(K1)\mathbb{E}_{P,\pi}[N_{T}(a^{\prime})]\leq T/(K-1). Consider the Gaussian instance with

Qa(off)=𝒩(0,1) for all a𝒜, and Qa(on)={𝒩(Δ,1)if a=1,𝒩(2Δ,1)if a=a,𝒩(0,1)if a𝒜{1,a}.Q^{\text{(off)}}_{a}={\cal N}(0,1)\text{ for all $a\in{\cal A}$, and }Q^{\text{(on)}}_{a}=\begin{cases}{\cal N}(\Delta,1)&\quad\text{if }a=1,\\ {\cal N}(2\Delta,1)&\quad\text{if }a=a^{\prime},\\ {\cal N}(0,1)&\quad\text{if }a\in{\cal A}\setminus\{1,a^{\prime}\}.\end{cases}

To this end, note that Pa(off)=Qa(off)P^{\text{(off)}}_{a}=Q^{\text{(off)}}_{a} for all a𝒜a\in{\cal A}, and Pa(on)=Qa(on)P^{\text{(on)}}_{a}=Q^{\text{(on)}}_{a} for all a𝒜{a}a\in{\cal A}\setminus\{a^{\prime}\}, but Pa(on)Qa(on)P^{\text{(on)}}_{a^{\prime}}\neq Q^{\text{(on)}}_{a^{\prime}}. In addition, both P,QP,Q belong to V{\cal I}_{V}. Consider the event E={NT(1)<T/2}E=\{N_{T}(1)<T/2\}. We have

𝔼[RegT(π,P)]+𝔼[RegT(π,Q)]\displaystyle\mathbb{E}[\text{Reg}_{T}(\pi,P)]+\mathbb{E}[\text{Reg}_{T}(\pi,Q)]\geq T2Δ[PrP,π(E)+PrQ,π(Ec)]\displaystyle\frac{T}{2}\cdot\Delta\cdot\left[\Pr_{P,\pi}(E)+\Pr_{Q,\pi}(E^{c})\right] (52)
\displaystyle\geq T4Δexp[𝔼P,π[NT(a)]KL(Pa(on),Qa(on))]\displaystyle\frac{T}{4}\cdot\Delta\cdot\exp\left[-\mathbb{E}_{P,\pi}[N_{T}(a^{\prime})]\cdot\text{KL}(P^{(\text{on})}_{a^{\prime}},Q^{(\text{on})}_{a^{\prime}})\right] (53)
\displaystyle\geq T4Δexp[TK12Δ2].\displaystyle\frac{T}{4}\cdot\Delta\cdot\exp\left[-\frac{T}{K-1}\cdot 2\cdot\Delta^{2}\right]. (54)

Step (52) is again by Theorem C.1. Step (53) is by the Chain rule (Theorem C.2), and our observations on the constructed P,QP,Q. Step (54) is by the choice of arm aa^{\prime} and the KL divergence between Pa(on),Qa(on)P^{(\text{on})}_{a^{\prime}},Q^{(\text{on})}_{a^{\prime}}. Finally, putting in Δ=(1/2)(K1)/T\Delta=(1/2)\sqrt{(K-1)/T} gives us

max{RegT(π,P),RegT(π,Q)}18e(K1)T.\max\{\text{Reg}_{T}(\pi,P),\text{Reg}_{T}(\pi,Q)\}\geq\frac{1}{8\sqrt{e}}\cdot\sqrt{(K-1)T}.

Altogether, the three cases cover all possibilites and the Theorem is proved.

Appendix D More Details on Numerical Experiments

In all experiments, we set 𝝁(on)=(0.8,0.5,0.5,0.5)\boldsymbol{\mu}^{\text{(on)}}=(0.8,0.5,0.5,0.5) (hence a=1a_{*}=1). In Figure 1, we set 𝝁(off)=(0.4,0.6,0.6,0.6)\boldsymbol{\mu}^{\text{(off)}}=(0.4,0.6,0.6,0.6), V(1)=0.4V(1)=0.4, and vary VV from 0.1 to 0.6. In Figure 2, we set 𝝁(off)=(0.5,0.6,0.6,0.6)\boldsymbol{\mu}^{\text{(off)}}=(0.5,0.6,0.6,0.6), V(1)=0.3V(1)=0.3, and vary TST_{S} from 1000 to 3000, for V=0.1,0.3V=0.1,0.3, respectively.