This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Doubly Robust Interval Estimation for Optimal Policy Evaluation in Online Learning

Ye Shen Equal contribution. Department of Statistics, North Carolina State University Hengrui Cai Department of Statistics, University of California Irvine Rui Song Department of Statistics, North Carolina State University
Abstract

Evaluating the performance of an ongoing policy plays a vital role in many areas such as medicine and economics, to provide crucial instructions on the early-stop of the online experiment and timely feedback from the environment. Policy evaluation in online learning thus attracts increasing attention by inferring the mean outcome of the optimal policy (i.e., the value) in real-time. Yet, such a problem is particularly challenging due to the dependent data generated in the online environment, the unknown optimal policy, and the complex exploration and exploitation trade-off in the adaptive experiment. In this paper, we aim to overcome these difficulties in policy evaluation for online learning. We explicitly derive the probability of exploration that quantifies the probability of exploring non-optimal actions under commonly used bandit algorithms. We use this probability to conduct valid inference on the online conditional mean estimator under each action and develop the doubly robust interval estimation (DREAM) method to infer the value under the estimated optimal policy in online learning. The proposed value estimator provides double protection for consistency and is asymptotically normal with a Wald-type confidence interval provided. Extensive simulation studies and real data applications are conducted to demonstrate the empirical validity of the proposed DREAM method.

Keywords: Asymptotic Normality; Bandit Algorithms; Double Protection; Online Estimation; Probability of Exploration

1 Introduction

Sequential decision-making is one of the essential components of modern artificial intelligence that considers the dynamics of the real world. By maintaining the trade-off between exploration and exploitation based on historical information, bandit algorithms aim to maximize the cumulative outcome of interest and are thus popular in dynamic decision optimization with a wide variety of applications, such as precision medicine (Lu et al., 2021) and dynamic pricing (Turvey, 2017). There has been a vast literature on bandit optimization established over recent decades (see e.g., Sutton and Barto, 2018; Lattimore and Szepesvári, 2020, and the references therein). Most of these theoretical works focus on the regret analysis of bandit algorithms. When properly designed and implemented to address the exploration-and-exploitation trade-off, a powerful bandit policy could achieve a sub-linear regret, and thus eventually approximate the underlying optimal policy that maximizes the expected outcome. However, such a regret analysis only shows the convergence rate of the averaged cumulative regret (difference between the outcome under the optimal policy and that under the bandit policy) but provides limited information on the expected outcome under this bandit policy (referred to as the value in Dudík et al. (2011)).

The evaluation of the performance of bandit policies plays a vital role in many areas, including medicine and economics (see e.g., Chakraborty and Moodie, 2013; Athey, 2019). By evaluation, we aim to unbiasedly estimate the value of the optimal policy that the bandit policy is approaching and infer the corresponding estimate. Although there is an increasing trend in policy evaluation (see e.g., Li et al., 2011; Dudík et al., 2011; Swaminathan et al., 2017; Wang et al., 2017; Kallus and Zhou, 2018; Su et al., 2019), we note that all of these works focus on learning the value of a target policy offline using historical log data. See the architecture of offline policy evaluation illustrated in the left panel of Figure 1. Instead of a post-experiment investigation, it has attracted more attention recently to evaluate the ongoing policy in real-time. In precision medicine, the physician aims to make the best treatment decision for each patient sequentially according to their baseline covariates. Estimating the mean outcome of the current treatment decision rule is crucial to answering several fundamental questions in health care, such as whether the current strategy significantly optimizes patient outcomes over some baseline strategies. When the value under the ongoing rule is much lower than the desired average curative effect, the online trial must be terminated until more effective treatment options are available for the next round. Thus, policy evaluation in online learning is a new idea to provide the early stop of the online experiment and timely feedback from the environment, as demonstrated in the right panel of Figure 1.

Refer to caption
Refer to caption
Figure 1: Left panel: the architecture of offline policy evaluation, with offline context-action-outcome triples {(𝒙t,at,rt)}\{({\bm{x}_{t}},a_{t},r_{t})\} stored in the buffer to learn the value under a target policy π\pi^{*} with data generated by a behavior policy πb\pi_{b}. Right panel: the architecture of doubly robust interval estimation (DREAM) method for policy evaluation in online learning, where the context-action-outcome triple at time tt, (𝒙t,at,rt)({\bm{x}_{t}},a_{t},r_{t}), is stored in the buffer to update the bandit policy πt\pi_{t} and in the meantime to evaluate its performance.

1.1 Related Works and Challenges

Despite the importance of policy evaluation in online learning, the current bandit literature suffers from three main challenges. First, the data, such as the actions and rewards sequentially collected from the online environment, are not independent and identically distributed (i.i.d.) since they depend on the previous history and the running policy (see the right panel of Figure 1). In contrast, the existing methods for the offline policy evaluation (see e.g., Li et al., 2011; Dudík et al., 2011) primarily assumed that the data are generated by the same behavior policy and i.i.d. across different individuals and time points. Such assumptions allow them to evaluate a new policy using offline data by modeling the behavior policy or the conditional mean outcome. In addition, we note that the target policy to be evaluated in offline policy evaluation is fixed and generally known, whereas for policy evaluation in online learning, the optimal policy of interest needs to be estimated and updated in real time.

The second challenge lies in estimating the mean outcome under the optimal policy online. Although numerous methods have recently been proposed to evaluate the online sample mean for a fixed action (see e.g., Nie et al., 2018; Neel and Roth, 2018; Deshpande et al., 2018; Shin et al., 2019a, b; Waisman et al., 2019; Hadad et al., 2019; Zhang et al., 2020), none of these methods is directly applicable to our problem, as the sample mean only provides the impact of one particular arm, not the value of the optimal policy in bandits that considers the dynamics of the online environment. For instance, in the contextual bandits, we aim to select an action for each subject based on its context/feature to optimize the overall outcome of interest. However, there may not exist a unified best action for all subjects due to heterogeneity, and thus evaluating the value of one single optimal action cannot fully address the policy evaluation in such a setting. However, although commonly used in the regret analysis, the average of collected outcomes is not a good estimator of the value under the optimal policy in the sense that it does not possess statistical efficiency (see details in Section 3).

Third, given data generated by a bandit algorithm that maintains the exploration-and-exploitation trade-off sequentially, inferring the value of the optimal policy online should consider such a trade-off and quantify the probability of exploration and exploitation. The probability of exploring non-optimal actions is essential in two ways. First, it determines the convergence rate of the online conditional mean estimator under each action. Second, it indicates the data points used to match the value under the optimal policy. To our knowledge, the regret analysis in the current bandit literature is based on the binding of variance information (see e.g., Auer, 2002; Srinivas et al., 2009; Chu et al., 2011; Abbasi-Yadkori et al., 2011; Bubeck and Cesa-Bianchi, 2012; Zhou, 2015), yet little effort has been made in formally quantifying the probability of exploration over time.

There are very few studies directly related to our topic. Chambaz et al. (2017) established the asymptotic normality for the conditional mean outcome under an optimal policy for sequential decision making. Later, Chen et al. (2020) proposed an inverse probability weighted value estimator to infer the value of optimal policy using the ϵ\epsilon-Greedy (EG) method. These two works did not discuss how to account for the exploration-and-exploitation trade-off under commonly used bandit algorithms, such as Upper Confidence Bound (UCB) and Thompson Sampling (TS), as considered in this paper. Recently, to evaluate the value of a known policy based on the adaptive data, Bibaut et al. (2021) and Zhan et al. (2021) proposed to utilize the stabilized doubly robust estimator and the adaptive weighting doubly robust estimator, respectively. However, both methods focused on obtaining a valid inference of the value estimator under a fixed policy by conveniently assuming a desired exploration rate to ensure sufficient sampling of different arms. Such an assumption can be violated in many commonly used bandits (see details shown in Theorem 4.1). Although there are other works that focus on statistical inference for adaptively collected data (Dimakopoulou et al., 2021; Zhang et al., 2021; Khamaru et al., 2021; Ramprasad et al., 2022) in bandit or reinforcement learning setting, our work handles policy evaluation from a completely unique angle to infer the value of optimal policy by investigating the exploration rate in online learning.

1.2 Our Contributions

In this paper, we aim to overcome the aforementioned difficulties of policy evaluation in online decision-making. Our contributions are expressed in the following folds.

The first contribution of this work is to explicitly characterize the trade-off between exploration and exploitation in the online policy optimization, we derive the probability of exploration in bandit algorithms. Such a probability is new to the literature by quantifying the chance of taking the nongreedy policy (i.e., a nonoptimal action) given the current information over time, in contrast to the probability of exploitation for taking greedy actions. Specifically, we consider three commonly used bandit algorithms for exposition, including the UCB, TS, and EG methods. We note that the probability of exploration is prespecified by users in EG while remaining implicit in UCB and TS. We use this probability to conduct valid inferences on the online conditional mean estimator under each action. The second contribution of this work is to propose the doubly robust interval estimation (DREAM) to infer the mean outcome of the optimal online policy. The DREAM provides double protection on the consistency of the proposed value estimator to the true value, given the product of the nuisance error rates of the probability of exploitation and the conditional mean outcome as op(T1/2)o_{p}(T^{-1/2}) for TT as the termination time. Under standard assumptions for inferring the online sample mean, we show that the value estimator under DREAM is asymptotically normal with a Wald-type confidence interval provided. To the best of our knowledge, this is the first work to establish the inference for the value under the optimal policy by taking the exploration-and-exploitation trade-off into thorough account and thus fills a crucial gap in the policy evaluation of online learning.

The remainder of this paper is organized as follows. We introduce notation and formulate our problem, followed by preliminaries of standard contextual bandit algorithms and the formal definition of probability of exploration. In Section 3, we introduce the DREAM method and its implementation details. Then in Section 4, we derive theoretical results of DREAM under contextual bandits by establishing the convergence rate of the probability of exploration and deriving the asymptotic normality of the online mean estimator. Extensive simulation studies are conducted to demonstrate the empirical performance of the proposed method in Section 5, followed by a real application using OpenML datasets in Section 6. We conclude our paper in Section 7 by discussing the performance of the proposed DREAM in terms of the regret bound and a direct extension of our method for policy evaluation of any known policy in online learning. All the additional results and technical proofs are given in the appendix.

2 Problem Formulation

In this section, we formulate the problem of policy evaluation in online learning. We first build the framework based on the contextual bandits in Section 2.1. We then introduce three commonly used bandit algorithms, including UCB, TS, and EG, to generate data online, in Section 2.2. Lastly, we define the probability of exploration in Section 2.3. In this paper, we use bold symbols for vectors and matrices.

2.1 Framework

In contextual bandits, at each time step t𝒯{1,2,3,}t\in\mathcal{T}\equiv\{1,2,3,\cdots\}, we observe a dd-dimensional context 𝒙t\bm{x}_{t} drawn from a distribution P𝒳P_{\mathcal{X}} which includes 1 for the intercept, choose an action at𝒜a_{t}\in\mathcal{A}, and then observe a reward rtr_{t}\in\mathcal{R}. Denote the history observations prior to time step tt as t1={𝒙i,ai,ri}1it1\mathcal{H}_{t-1}=\{\bm{x}_{i},a_{i},r_{i}\}_{1\leq i\leq t-1}. Suppose that the reward given 𝒙\bm{x} and aa follows rμ(𝒙,a)+er\equiv\mu(\bm{x},a)+e, where μ(𝒙,a)𝔼(r|𝒙,a)\mu(\bm{x},a)\equiv\mathbb{E}(r|\bm{x},a) is the conditional mean outcome function (also known as the Q-function in the literature (Murphy, 2003; Sutton and Barto, 2018)), and is bounded by |μ(𝒙,a)|U|\mu(\bm{x},a)|\leq U. The noise term ee is independent σ\sigma-subgaussian at the time step tt independently of t1\mathcal{H}_{t-1} given ata_{t} for t𝒯t\in\mathcal{T}. Let the conditional variance be 𝔼(e2|a)=σa2{\mathbb{E}}(e^{2}|a)=\sigma^{2}_{a}. The value (Dudík et al., 2011) of a given policy π()\pi(\cdot) is defined as

V(π)𝔼𝒙P𝒳[𝔼{r|𝒙,a=π(𝒙)}]=𝔼𝒙P𝒳[μ{𝒙,π(𝒙)}].V(\pi)\equiv\mathbb{E}_{\bm{x}\sim P_{\mathcal{X}}}\left[\mathbb{E}\{r|\bm{x},a=\pi(\bm{x})\}\right]=\mathbb{E}_{\bm{x}\sim P_{\mathcal{X}}}\left[\mu\{\bm{x},\pi(\bm{x})\}\right].

We define the optimal policy as π(𝒙)argmaxa𝒜μ(𝒙,a),𝒙𝒳\pi^{*}(\bm{x})\equiv\operatorname*{arg\,max}_{a\in\mathcal{A}}\mu(\bm{x},a),\forall\bm{x}\in\mathcal{X}, which finds the optimal action based on the conditional mean outcome function given a context 𝒙\bm{x}. Thus, the optimal value can be defined as VV(π)=𝔼𝒙P𝒳[μ{𝒙,π(𝒙)}]V^{*}\equiv V(\pi^{*})=\mathbb{E}_{\bm{x}\sim P_{\mathcal{X}}}\left[\mu\{\bm{x},\pi^{*}(\bm{x})\}\right]. In the rest of this paper, to simplify the exposition, we focus on two actions, that is, 𝒜={0,1}\mathcal{A}=\{0,1\}. Then the optimal policy is given by

π(𝒙)argmaxa𝒜μ(𝒙,a)=𝕀{μ(𝒙,1)>μ(𝒙,0)},𝒙𝒳.\pi^{*}(\bm{x})\equiv\operatorname*{arg\,max}_{a\in\mathcal{A}}\mu(\bm{x},a)=\mathbb{I}\{\mu(\bm{x},1)>\mu(\bm{x},0)\},\quad\forall\bm{x}\in\mathcal{X}.

Our goal is to infer the value under the optimal policy π\pi^{*} using the online data sequentially generated by a bandit algorithm. Since the optimal policy is unknown, we estimate the optimal policy from the online data as π^t\widehat{\pi}_{t}. As commonly assumed in the current online inference literature (see e.g., Deshpande et al., 2018; Zhang et al., 2020; Chen et al., 2020) and the bandit literature (see e.g., Chu et al., 2011; Abbasi-Yadkori et al., 2011; Bubeck and Cesa-Bianchi, 2012; Zhou, 2015), we consider the conditional mean outcome function taking a linear form, i.e., μ(𝒙,a)=𝒙𝜷(a),\mu(\bm{x},a)=\bm{x}^{\top}\bm{\beta}(a), where 𝜷()\bm{\beta}(\cdot) is a smooth function and can be estimated via a ridge regression based on t1\mathcal{H}_{t-1} as

𝜷^t1(a)={𝑫t1(a)𝑫t1(a)+ω𝑰d}1𝑫t1(a)𝑹t1(a),\widehat{\bm{\beta}}_{t-1}(a)=\{\bm{D}_{t-1}(a)^{\top}\bm{D}_{t-1}(a)+\omega\bm{I}_{d}\}^{-1}\bm{D}_{t-1}(a)^{\top}\bm{R}_{t-1}(a), (1)

where 𝑰d\bm{I}_{d} is a d×dd\times d identity matrix, 𝑫t1(a)\bm{D}_{t-1}(a) is a 𝑵t1(a)×d\bm{N}_{t-1}(a)\times d design matrix at time t1t-1 with 𝑵t1(a)\bm{N}_{t-1}(a) as the number of pulls for action aa, 𝑹t1(a)\bm{R}_{t-1}(a) is the 𝑵t1(a)×1\bm{N}_{t-1}(a)\times 1 vector of the outcomes received under action aa at time t1t-1, and ω\omega is a positive and bounded constant as the regularization term. There are two main reasons to choose the ridge estimator instead of the ordinary least squares estimator that is considered in Deshpande et al. (2018); Zhang et al. (2020); Chen et al. (2020). First, the ridge estimator is well defined when 𝑫t1(a)𝑫t1(a)\bm{D}_{t-1}(a)^{\top}\bm{D}_{t-1}(a) is singular and its bias is negligible when the time step is large. Second, the parameter estimations in the ridge method are in accordance with the linear UCB (Li et al., 2010) and the linear TS (Agrawal and Goyal, 2013) methods (detailed in the next section) with ω=1\omega=1. Based on the ridge estimator in (1), the online conditional mean estimator for μ\mu is defined as μ^t1(𝒙,a)=𝒙𝜷^t1(a)\widehat{\mu}_{t-1}(\bm{x},a)=\bm{x}^{\top}\widehat{\bm{\beta}}_{t-1}(a). With two actions, the estimated optimal policy at time step tt is defined by

π^t(𝒙)=𝕀{μ^t1(𝒙,1)>μ^t1(𝒙,0)}=𝕀{𝒙𝜷^t1(1)>𝒙𝜷^t1(0)},𝒙𝒳.\widehat{\pi}_{t}(\bm{x})=\mathbb{I}\left\{\widehat{\mu}_{t-1}(\bm{x},1)>\widehat{\mu}_{t-1}(\bm{x},0)\right\}=\mathbb{I}\{\bm{x}^{\top}\widehat{\bm{\beta}}_{t-1}(1)>\bm{x}^{\top}\widehat{\bm{\beta}}_{t-1}(0)\},\quad\forall\bm{x}\in\mathcal{X}. (2)

We note that the linear form of μ(𝒙,a)\mu(\bm{x},a) can be relaxed to the non-linear case as μ(𝒙,a)=f(𝒙)𝜷(a)\mu(\bm{x},a)=f(\bm{x})^{\top}\bm{\beta}(a), where f()f(\cdot) is a continuous function (see examples in our simulation studies in Section 5). Then the corresponding online conditional mean estimator for μ\mu is defined as μ^t1(𝒙,a)=f(𝒙)𝜷^t1(a)\widehat{\mu}_{t-1}(\bm{x},a)=f(\bm{x})^{\top}\widehat{\bm{\beta}}_{t-1}(a) based on t1\mathcal{H}_{t-1}.

2.2 Bandit Algorithms

We briefly introduce three commonly used bandit algorithms in the framework of contextual bandits, to generate the online data sequentially.
Upper Confidence Bound (UCB) (Li et al., 2010): Let the estimated standard deviation based on t1\mathcal{H}_{t-1} be σ^t1(𝒙,a)=𝒙{𝑫t1(a)𝑫t1(a)+ω𝑰d}1𝒙.\widehat{\sigma}_{t-1}(\bm{x},a)=\sqrt{\bm{x}^{\top}\{\bm{D}_{t-1}(a)^{\top}\bm{D}_{t-1}(a)+\omega\bm{I}_{d}\}^{-1}\bm{x}}. The action at time tt is selected by

at=argmaxa𝒜μ^t1(𝒙t,a)+ctσ^t1(𝒙t,a),a_{t}=\operatorname*{arg\,max}_{a\in\mathcal{A}}\widehat{\mu}_{t-1}(\bm{x}_{t},a)+c_{t}\widehat{\sigma}_{t-1}(\bm{x}_{t},a),

where ctc_{t} is a non-increasing positive parameter that controls the level of exploration. With two actions, we have the action at time step tt as

at=𝕀{μ^t1(𝒙t,1)+ctσ^t1(𝒙t,1)>μ^t1(𝒙t,0)+ctσ^t1(𝒙t,0)}.a_{t}=\mathbb{I}\left\{\widehat{\mu}_{t-1}(\bm{x}_{t},1)+c_{t}\widehat{\sigma}_{t-1}(\bm{x}_{t},1)>\widehat{\mu}_{t-1}(\bm{x}_{t},0)+c_{t}\widehat{\sigma}_{t-1}(\bm{x}_{t},0)\right\}.

Thompson Sampling (TS) (Agrawal and Goyal, 2013): Suppose a normal likelihood function for the reward given 𝒙\bm{x} and aa such that R𝒩{𝒙𝜷(a),ρ2}R\sim\mathcal{N}\{\bm{x}^{\top}\bm{\beta}(a),\rho^{2}\} with a known parameter ρ2\rho^{2}. If the prior for 𝜷(a)\bm{\beta}(a) at time tt is

𝜷(a)𝒩d[𝜷^t1(a),ρ2{𝑫t1(a)𝑫t1(a)+ω𝑰d}1],\bm{\beta}(a)\sim\mathcal{N}_{d}[\widehat{\bm{\beta}}_{t-1}(a),\rho^{2}\{\bm{D}_{t-1}(a)^{\top}\bm{D}_{t-1}(a)+\omega\bm{I}_{d}\}^{-1}],

where 𝒩d\mathcal{N}_{d} is the dd-dimensional multivariate normal distribution, we have the posterior distribution of 𝜷(a)\bm{\beta}(a) as

𝜷(a)|t1𝒩d[𝜷^t(a),ρ2{𝑫t(a)𝑫t(a)+ω𝑰d}1],\bm{\beta}(a)|\mathcal{H}_{t-1}\sim\mathcal{N}_{d}[\widehat{\bm{\beta}}_{t}(a),\rho^{2}\{\bm{D}_{t}(a)^{\top}\bm{D}_{t}(a)+\omega\bm{I}_{d}\}^{-1}],

for a{0,1}a\in\{0,1\}. At each time step tt, we draw a sample from the posterior distribution as 𝜷t(a)\bm{\beta}_{t}(a) for a{0,1}a\in\{0,1\}, and select the next action within two arms by at=𝕀{𝒙t𝜷t(1)>𝒙t𝜷t(0)}.a_{t}=\mathbb{I}\left\{\bm{x}_{t}^{\top}\bm{\beta}_{t}(1)>\bm{x}_{t}^{\top}\bm{\beta}_{t}(0)\right\}.
ϵ\epsilon-Greedy (EG) (Sutton and Barto, 2018): Recall the estimated conditional mean outcome for action aa as μ^t(𝒙,a)\widehat{\mu}_{t}(\bm{x},a) given a context 𝒙\bm{x}. Under EG method, the action at time tt is selected by

at=δtargmaxa𝒜μ^t1(𝒙t,a)+(1δt)Bernoulli(0.5),a_{t}=\delta_{t}\operatorname*{arg\,max}_{a\in\mathcal{A}}\widehat{\mu}_{t-1}(\bm{x}_{t},a)+(1-\delta_{t})\text{Bernoulli}(0.5),

where δtBernoulli(1ϵt)\delta_{t}\sim\text{Bernoulli}(1-\epsilon_{t}) and the parameter ϵt\epsilon_{t} controls the level of exploration as pre-specified by users.

2.3 Probability of Exploration

We next quantify the probability of exploring non-optimal actions at each time step. To be specific, define the status of exploration as 𝕀{atπ^t(𝒙t)}\mathbb{I}\{a_{t}\not=\widehat{\pi}_{t}(\bm{x}_{t})\}, indicating whether the action taken by the bandit algorithm is different from the estimated optimal action that exploits the historical information, given the context information. Here, π^t\widehat{\pi}_{t} can be viewed as the greedy policy at time step tt. Thus the probability of exploration is defined by

κt(𝒙t)Pr{atπ^t(𝒙t)}=𝔼[𝕀{atπ^t(𝒙t)}],\kappa_{t}(\bm{x}_{t})\equiv{\mbox{Pr}}\{a_{t}\not=\widehat{\pi}_{t}(\bm{x}_{t})\}={\mathbb{E}}[\mathbb{I}\{a_{t}\not=\widehat{\pi}_{t}(\bm{x}_{t})\}], (3)

where the expectation in the last term is taken respect to at𝒜a_{t}\in\mathcal{A} and history t1\mathcal{H}_{t-1}. According to (3), κt\kappa_{t} is determined by the given context information and the current time point. We clarify the connections and distinctions of the defined probability of exploration concerning the bandit literature here. First, the exploration rate used in the current bandit works mainly refers to the probability of exploring non-optimal actions given an optimal policy and contextual information 𝒙t\bm{x}_{t} at time step tt, i.e., Pr{atπ(𝒙t)}{\mbox{Pr}}\{a_{t}\not=\pi^{*}(\bm{x}_{t})\}. This is different from our proposed probability of exploration κt(𝒙t)Pr{atπ^t(𝒙t)}\kappa_{t}(\bm{x}_{t})\equiv{\mbox{Pr}}\{a_{t}\not=\widehat{\pi}_{t}(\bm{x}_{t})\}. The main difference lies in the estimated optimal policy from the collected data at the current time step tt, i.e., π^t\widehat{\pi}_{t}, in κt(𝒙t)\kappa_{t}(\bm{x}_{t}). If properly designed and implemented, the estimated optimal policy under a bandit algorithm π^t\widehat{\pi}_{t} will eventually converge to the true optimal policy π\pi^{*} when tt is large, which yields κt(𝒙)limtPr{atπ(𝒙)}\kappa_{t}(\bm{x})\to\lim_{t\rightarrow\infty}{\mbox{Pr}}\{a_{t}\not=\pi^{*}(\bm{x})\}. In practice, we can estimate κt(𝒙t)\kappa_{t}(\bm{x}_{t}) sequentially by using {i,𝒙i}1it\{i,\bm{x}_{i}\}_{1\leq i\leq t} as inputs and {𝕀{ai=π^i(𝒙i)}}1it\{\mathbb{I}\{a_{i}=\widehat{\pi}_{i}(\bm{x}_{i})\}\}_{1\leq i\leq t} as outputs via parametric or non-parametric tools. We denote the corresponding estimator as κ^t(𝒙t)\widehat{\kappa}_{t}(\bm{x}_{t}) for time step tt. Such implementation conditions are explicitly described in Section 3.

3 Doubly Robust Interval Estimation

We present the proposed DREAM method in this section. We first detail why the average of outcomes received in bandits fails to process statistical efficiency. In view of the results established in regret analysis for the contextual bandits (see e.g., Abbasi-Yadkori et al., 2011; Chu et al., 2011; Zhou, 2015), we have the cumulative regret as |t=1T(rtV)|=𝒪~(dT)|\sum_{t=1}^{T}(r_{t}-V^{*})|=\tilde{\mathcal{O}}(\sqrt{dT}), where 𝒪~\tilde{\mathcal{O}} is the asymptotic order up to some logarithm factor. Therefore, it is immediate that the average of rewards follows T(T1t=1TrtV)=𝒪~(1){\sqrt{T}}(T^{-1}\sum_{t=1}^{T}r_{t}-V^{*})=\tilde{\mathcal{O}}(1), since the dimension dd is finite. These results indicate that a simple average of the total outcome under the bandit algorithm is not a good estimator for the optimal value, since it does not own the asymptotic normality for a valid confidence interval construction. Instead of the simple aggregation, intuitively, we should select the outcome when the action taken under a bandit policy accords with the optimal policy, i.e., at=π^t(𝒙t)a_{t}=\widehat{\pi}_{t}(\bm{x}_{t}), defined as the status of exploitation. In contrast to the probability of exploration, we define the probability of exploitation as

Pr{at=π^t(𝒙t)}=1κt(𝒙t)=𝔼[𝕀{at=π^t(𝒙t)}].{\mbox{Pr}}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}=1-{\kappa}_{t}(\bm{x}_{t})={\mathbb{E}}[\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}].

Following the doubly robust value estimator in Dudík et al. (2011), we propose the doubly robust mean outcome estimator as the value estimator under the optimal policy as

V^T=1Tt=1T𝕀{at=π^t(𝒙t)}1κ^t(𝒙t)[rtμ^t1{𝒙t,π^t(𝒙t)}]+μ^t1{𝒙t,π^t(𝒙t)},\widehat{V}_{T}=\frac{1}{T}\sum_{t=1}^{T}\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}}{1-\widehat{\kappa}_{t}(\bm{x}_{t})}\Big{[}r_{t}-\widehat{\mu}_{t-1}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{]}+\widehat{\mu}_{t-1}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}, (4)

where TT is the current/termination time, and 1κ^t(𝒙t)1-\widehat{\kappa}_{t}(\bm{x}_{t}) is the estimated matching probability between the chosen action ata_{t} and estimated optimal action given 𝒙t\bm{x}_{t}, which captures the probability of exploitation. Our value estimator provides double protection on the consistency to the true value, given the product of the nuisance error rates of the probability of exploitation and the conditional mean outcome as 𝒪(T1/2)\mathcal{O}(T^{-1/2}), as discussed in Theorem 3. We further propose a variance estimator for V^T\widehat{V}_{T} as

σ^T2=1Tt=1T(π^t(𝒙t)σ^1,t12+{1π^t(𝒙t)}σ^0,t121κ^t(𝒙t)+[μ^T{𝒙t,π^T(𝒙t)}1Tt=1Tμ^T{𝒙t,π^T(𝒙t)}]2),\begin{aligned} \widehat{\sigma}_{T}^{2}&=\frac{1}{T}\sum_{t=1}^{T}\left(\frac{\widehat{\pi}_{t}(\bm{x}_{t})\widehat{\sigma}_{1,t-1}^{2}+\{1-\widehat{\pi}_{t}(\bm{x}_{t})\}\widehat{\sigma}_{0,t-1}^{2}}{1-\widehat{\kappa}_{t}(\bm{x}_{t})}+\left[\widehat{\mu}_{T}\{\bm{x}_{t},\widehat{\pi}_{T}(\bm{x}_{t})\}-\frac{1}{T}\sum_{t=1}^{T}\widehat{\mu}_{T}\{\bm{x}_{t},\widehat{\pi}_{T}(\bm{x}_{t})\}\right]^{2}\right),\end{aligned}

(5)

where σ^a,t2={i=1t𝕀(ai=a)d}1ai=a1it[μ^t{𝒙i,ai}ri]2\widehat{\sigma}_{a,t}^{2}=\{\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)-d\}^{-1}\sum_{a_{i}=a}^{1\leq i\leq t}[\widehat{\mu}_{t}\{\bm{x}_{i},a_{i}\}-r_{i}]^{2} is an estimator for σa2\sigma^{2}_{a}, for a=0,1a=0,1. The proposed variance estimator is consistent to the true variance of the value shown in Theorem 3. We officially name our method as doubly robust interval estimation (DREAM), with the detailed pseudocode provided in Algorithm 1. To ensure sufficient exploration and a valid inference, we force to pull non-optimal actions given greedy samples with a pre-specified clipping rate ptp_{t} in step (4) of Algorithm 1 of an order larger than 𝒪(t1/2)\mathcal{O}(t^{-1/2}) as required by Theorem 3. In step (4), if the unchosen action 1at1-a_{t} does not satisfy the clipping condition detailed in Assumption 4.2, then we will force to pull this non-greedy action for additional exploration, as required for a valid online inference. The estimated probability of exploration is used in step (6) in Algorithm 1 to learn the value and variance. A potential limitation of DREAM rises when the agent resists taking extra non-greedy actions in online learning. Yet, we discuss in Section 7.1 that the regret from such an extra exploration is negligible compared to the regret from exploitation, thus we can still maintain the sub-linear regret under DREAM. The theoretical validity of the proposed DREAM is detailed in Section 4, with its empirical out-performance over baseline methods demonstrated in Section 5. Finally, we remark that our method is not overly sensitive to the choice of ptp_{t} with additional sensitivity analyses provided in Appendix A.

Algorithm 1 DREAM under Contextual Bandits with Clipping
Input: termination time TT, and the clipping rate pt>𝒪(t1/2)p_{t}>\mathcal{O}(t^{-1/2});
for Time t=1,2,Tt=1,2,\cdots T do
     (1) Sample dd-dimensional context 𝒙t𝒳\bm{x}_{t}\in\mathcal{X};
     (2) Update 𝜷^t1()\widehat{\bm{\beta}}_{t-1}(\cdot) using Equation (1) and μ^t1(,)\widehat{\mu}_{t-1}(\cdot,\cdot);
     (3) Update π^t()\widehat{\pi}_{t}(\cdot) and ata_{t} using Equation (2) and the contextual bandit algorithms in Section 2.2;
     if λmin(t1i=1t𝕀(ai=1at)𝒙i𝒙i)\lambda_{\min}({t}^{-1}\sum_{i=1}^{t}\mathbb{I}(a_{i}=1-a_{t})\bm{x}_{i}\bm{x}_{i}^{\top}) <ptλmin(t1i=1t𝒙i𝒙i)<p_{t}\lambda_{\min}({t}^{-1}\sum_{i=1}^{t}\bm{x}_{i}\bm{x}_{i}^{\top}) then
         (4) Choose action 1at1-a_{t};
     end if
          (5) Use the history {𝕀{π^i(𝒙i)=ai},𝒙i}1it\{\mathbb{I}\{\widehat{\pi}_{i}(\bm{x}_{i})=a_{i}\},\bm{x}_{i}\}_{1\leq i\leq t} to estimate κ^t()\widehat{\kappa}_{t}(\cdot);
          (6) Get the value and its variance under the optimal policy by Equations (4) and (5).
          (7) A two-sided 1α1-\alpha CI for VV^{*} under the online optimization is given by
          [V^Tzα/2σ^T/T,V^T+zα/2σ^T/T]\Big{[}\widehat{V}_{T}-z_{\alpha/2}\widehat{\sigma}_{T}/\sqrt{T},\quad\widehat{V}_{T}+z_{\alpha/2}\widehat{\sigma}_{T}/\sqrt{T}\Big{]}.
     end for

4 Theoretical Results

We formally present our theoretical results. In Section 4.1, we first derive the bound of the probability of exploration under the three commonly used bandit algorithms introduced in Section 2.2. This allows us to further establish the asymptotic normality of the online conditional mean estimator under a specific action in Section 4.2. Next, we establish the theoretical properties of DREAM with a Wald-type confidence interval given in Section 4.3. All the proofs are provided in Appendix B. The following assumptions are required to establish our theories.

Assumption 4.1

(Boundness) There exists a positive constant L𝐱L_{\bm{x}} such that 𝐱L𝐱\|\bm{x}\|_{\infty}\leq L_{\bm{x}} for all 𝐱𝒳\bm{x}\in\mathcal{X}, and 𝚺=𝔼(𝐱𝐱)\bm{\Sigma}=\mathbb{E}\left(\bm{x}\bm{x}^{\top}\right) has minimum eigenvalue λmin(𝚺)>λ\lambda_{\min}(\bm{\Sigma})>\lambda for some λ>0\lambda>0.

Assumption 4.2

(Clipping) For any action a𝒜a\in\mathcal{A} and time step t1t\geq 1, there exists an sequence positive and non-increasing {pi}i=1t\{p_{i}\}_{i=1}^{t} , such that λmin{t1i=1t𝕀(ai=a)𝐱i𝐱i}>ptλmin(𝚺).\lambda_{\min}\{{t}^{-1}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}\}>p_{t}\lambda_{\min}(\bm{\Sigma}).

Assumption 4.3

(Margin Condition) Assume there exist some constants γ\gamma and δ\delta such that Pr{0|μ(𝐱,1)μ(𝐱,0)|M}=𝒪(Mγ){\mbox{Pr}}\{0\leq|\mu(\bm{x},1)-\mu(\bm{x},0)|\leq M\}=\mathcal{O}(M^{\gamma}), 𝐱𝒳,\forall\bm{x}\in\mathcal{X}, where the big-O term is uniform in 0<Mδ0<M\leq\delta.

Assumption 4.1 is a technical condition on bounded contexts such that the mean of the martingale differences will converge to zero (see e.g., Zhang et al., 2020; Chen et al., 2020). Assumption 4.2 is a technical requirement for the uniqueness and convergence of the least squares estimators, which requires the bandit algorithm to explore all actions sufficiently such that the asymptotic properties for the online conditional mean estimator under different actions hold (see e.g., Deshpande et al., 2018; Hadad et al., 2019; Zhang et al., 2020). The parameter {pi}i=1t\{p_{i}\}_{i=1}^{t} defined in Assumption 4.2 characterizes the boundary of the probability of taking one action and we name it as the clipping rate. We ensure Assumption 4.2 is satisfied using step (4) in Algorithm 1. We establish the relationship between ptp_{t} and κt\kappa_{t} and discuss the requirement for ptp_{t} to consistently estimate 𝜷^t(a)\widehat{\bm{\beta}}_{t}(a) in this section. Assumption 4.3 is well known as the margin condition, which is commonly assumed in the literature to derive a sharp convergence rate for the value under the estimated optimal policy (see e.g., Luedtke and Van Der Laan, 2016; Chambaz et al., 2017; Chen et al., 2020).

4.1 Bounding the Probability of Exploration

The probability of exploration not only shows the signal of success in finding the global optimization in online learning, but also connects to optimal policy evaluation by quantifying the rate of executing the greedy actions. Instead of directly specifying this probability (see e.g., Zhang et al., 2020; Chen et al., 2020; Bibaut et al., 2021; Zhan et al., 2021), we explicitly bound this rate based on the updating parameters in bandit algorithms, by which we conduct a valid follow-up inference. Since the probability of exploration involves the estimation of the mean outcome function, we need to first derive a tail bound for the online ridge estimator and the estimated difference between the mean outcomes.

Lemma 4.1

(Tail bound for the online ridge estimator). In the online contextual bandit under UCB, TS, or EG, with Assumptions 4.1 and 4.2 hold, we have that for any h>0h>0, the probability of the online ridge estimator bounded within its true as

Pr(𝜷^t(a)𝜷(a)1>h)2dexp{tpt2λ2(hd𝜷(a)2)2/(8d2σ2L𝒙2)}.{{\mbox{Pr}}\left(\left\|\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)\right\|_{1}>h\right)\leq 2d\exp\left\{-tp_{t}^{2}\lambda^{2}\left(h-\sqrt{d}\left\|\bm{\beta}(a)\right\|_{2}\right)^{2}/\left(8d^{2}\sigma^{2}L_{\bm{x}}^{2}\right)\right\}.}

The results in Lemma 4.1 establish the tail bound of the online ridge estimator 𝜷^t(a)\widehat{\bm{\beta}}_{t}(a), which can be simplified as C1exp(C2tpt2)C_{1}\exp(-C_{2}tp_{t}^{2}), for some constants {Cj}1j2\{C_{j}\}_{1\leq j\leq 2}, by noticing the constant hh and λ\lambda, the dimension dd, the subgaussian parameter σ\sigma, and the bound L𝒙L_{\bm{x}} are positive and bounded, under bounded true coefficients 𝜷(a)\bm{\beta}(a). Recall the clipping constraint that 0<pt<10<p_{t}<1 is non-increasing sequence. This tail bound is asymptotically equivalent to exp(tpt2)\exp(-tp_{t}^{2}). The established results in Lemma 4.1 work for general bandit algorithms including UCB, TS, and EG. These tail bounds are aligned with the bound exp(tϵt2)\exp(-t\epsilon_{t}^{2}) derived in Chen et al. (2020) for the EG method only. By Lemma 4.1, it is immediate to obtain the consistency of the online ridge estimator 𝜷^t(a)\widehat{\bm{\beta}}_{t}(a), if tpt2tp_{t}^{2}\rightarrow\infty as tt\rightarrow\infty. We can further obtain the tail bound for the estimated difference between the conditional mean outcomes under two actions by Lemma 4.1 as detailed in the following corollary.

Corollary 1

(Tail bound for the online mean estimator) Suppose conditions in Lemma 4.1 hold. Denote Δ𝐱tμ(𝐱t,1)μ(𝐱t,0)\Delta_{\bm{x}_{t}}\equiv\mu(\bm{x}_{t},1)-\mu(\bm{x}_{t},0), then for any ξ>0\xi>0, we have the probability of the online conditional mean estimator bounded within its true as

Pr{|μ^t(𝒙t,1)μ^t(𝒙t,0)Δ𝒙t|>ξ}4dexp{tpt2cξ},{\mbox{Pr}}\left\{\left|\widehat{\mu}_{t}(\bm{x}_{t},1)-\widehat{\mu}_{t}(\bm{x}_{t},0)-\Delta_{\bm{x}_{t}}\right|>\xi\right\}\leq 4d\exp\left\{-tp_{t}^{2}c_{\xi}\right\},

with

cξ=λ2[min{(ξ/2dL𝒙𝜷(1)2)2,(ξ/2dL𝒙𝜷(0)2)2}]/(8d2σ2L𝒙4)c_{\xi}=\lambda^{2}\left[\min\left\{\left(\xi/2-\sqrt{d}L_{\bm{x}}\left\|\bm{\beta}(1)\right\|_{2}\right)^{2},\left(\xi/2-\sqrt{d}L_{\bm{x}}\left\|\bm{\beta}(0)\right\|_{2}\right)^{2}\right\}\right]/\left(8d^{2}\sigma^{2}L_{\bm{x}}^{4}\right)

being a constant of time tt.

The above corollary quantifies the uncertainty of the online estimation of the conditional mean outcomes, and thus provides a crucial middle result to further access the probability of exploration by noting π^t(𝒙)=𝕀{μ^t1(𝒙,1)>μ^t1(𝒙,0)}\widehat{\pi}_{t}(\bm{x})=\mathbb{I}\left\{\widehat{\mu}_{t-1}(\bm{x},1)>\widehat{\mu}_{t-1}(\bm{x},0)\right\} in (2). More specifically, we derive the probability of exploration at each time step under the three discussed bandit algorithms for exposition in the following theorem.

Theorem 1

(Probability of exploration) In the online contextual bandit algorithms using UCB, TS, or EG, given a context 𝐱t\bm{x}_{t} at time step tt, assuming Assumptions 4.1, 4.2, and 4.3 hold with tpttp_{t}\rightarrow\infty, then for any 0<ξ<|Δ𝐱t|/20<\xi<\left|\Delta_{\bm{x}_{t}}\right|/2 with cξc_{\xi} specified in Corollary 1,
(i) under UCB, there exists some constant C>0C>0 such that

κt(𝒙t)C(2ctL𝒙(t1)pt1λ+ξ)γ+4dexp{(t1)pt12cξ};\displaystyle\kappa_{t}(\bm{x}_{t})\leq{C\left(\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}+\xi\right)^{\gamma}+4d\exp\left\{-(t-1)p_{t-1}^{2}c_{\xi}\right\}};

(ii) under TS, we have

κt(𝒙t)exp((|Δ𝒙t|ξ)2(t1)pt1λ4ρ2L𝒙2)+4dexp{(t1)pt12cξ};\displaystyle\kappa_{t}(\bm{x}_{t})\leq{\exp\left(-\frac{\left(\left|\Delta_{\bm{x}_{t}}\right|-\xi\right)^{2}{{(t-1)p_{t-1}\lambda}}}{4\rho^{2}L_{\bm{x}}^{2}}\right)+4d\exp\left\{-(t-1)p_{t-1}^{2}c_{\xi}\right\}};

(iii) under EG, we have κt(𝐱t)=ϵt/2\kappa_{t}(\bm{x}_{t})=\epsilon_{t}/2.

The theoretical order of κt()\kappa_{t}(\cdot) is new to the literature, which quantifies the probability of exploring the non-optimal actions under a bandit policy over time. We note that the probability of exploration under EG is pre-specified by users, which directly implies its exploring rate as ϵt/2\epsilon_{t}/2 (for the two-arm setting) as shown in results (iii). The results (i) and (ii) in Theorem 1 show that the probabilities of exploration under UCB and TS are non-increasing under certain conditions. For instance, if (t1)pt12(t-1)p_{t-1}^{2}\rightarrow\infty for TS, as tt\rightarrow\infty, the upper bound for κt(𝒙t)\kappa_{t}(\bm{x}_{t}) decays to zero with an asymptotically equivalent convergence rate as 𝒪{exp((t1)pt12)}\mathcal{O}\{\exp(-(t-1)p_{t-1}^{2})\} up to some constant. Similarly, for UCB, with an arbitrarily small ξ\xi at an order of 𝒪(1/tpt)\mathcal{O}(1/\sqrt{tp_{t}}), as tt\rightarrow\infty, the upper bound for κt(𝒙t)\kappa_{t}(\bm{x}_{t}) decays to zero as long as (t1)pt12(t-1)p_{t-1}^{2}\rightarrow\infty. Theorem 1 also indicates that without the clipping required in Assumption 4.2 and implemented in step (4) of Algorithm 1, the exploration for non-optimal arms might be insufficient, leading to a possibly invalid and biased inference for the online conditional mean.

4.2 Asymptotic Normality of Online Ridge Estimator

We use the established bounds for the probability of exploration in Theorem 1 to further obtain the asymptotic normality for the online conditional mean estimator under each action. Specifically, denote κ(𝒙)limtκt(𝒙)=limtPr{atπ(𝒙)}\kappa_{\infty}(\bm{x})\equiv\lim_{t\to\infty}\kappa_{t}(\bm{x})=\lim_{t\rightarrow\infty}{\mbox{Pr}}\{a_{t}\not=\pi^{*}(\bm{x})\} given a context 𝒙\bm{x}. Assume the conditions in Theorem 1 hold and tpt2tp_{t}^{2}\rightarrow\infty as tt\rightarrow\infty, we have the upper bound of the probability of exploration decays to zero in Theorem 1 for UCB and TS. Since κt()\kappa_{t}(\cdot) is nonnegative by its definition, it follows from Sandwich Theorem immediately that κ()\kappa_{\infty}(\cdot) exist for UCB and TS, and κ()=limtϵt/2\kappa_{\infty}(\cdot)=\lim_{t\to\infty}\epsilon_{t}/2 for EG. By using this limit, we can further characterize the following theorem for asymptotics and inference.

Theorem 2

(Asymptotics and inference) Supposing the conditions in Theorem 1 hold with tpt2tp_{t}^{2}\rightarrow\infty as tt\rightarrow\infty, we have t{𝛃^t(a)𝛃(a)}D𝒩d{𝟎d,σ𝛃(a)2}\sqrt{t}\{\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)\}\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathcal{N}_{d}\{\bm{0}_{d},\sigma_{\bm{\beta}(a)}^{2}\} and t{μ^t(𝐱,a)μ(𝐱,a)}D𝒩{0,𝐱σ𝛃(a)2𝐱}\sqrt{t}\{\widehat{\mu}_{t}(\bm{x},a)-\mu(\bm{x},a)\}\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathcal{N}\{0,\bm{x}^{\top}\sigma_{\bm{\beta}(a)}^{2}\bm{x}\} for 𝐱𝒳\forall\bm{x}\in\mathcal{X} and a𝒜\forall a\in\mathcal{A}, with the variance given by

σ𝜷(a)2=σa2\displaystyle\sigma_{\bm{\beta}(a)}^{2}=\sigma_{a}^{2} [κ(𝒙)𝕀{𝒙𝜷(a)<𝒙𝜷(1a)}𝒙𝒙dP𝒳\displaystyle\left[\int\kappa_{\infty}(\bm{x})\mathbb{I}\{\bm{x}^{\top}\bm{\beta}(a)<\bm{x}^{\top}\bm{\beta}(1-a)\}\bm{x}\bm{x}^{\top}dP_{\mathcal{X}}\right.
+{1κ(𝒙)}𝕀{𝒙𝜷(a)𝒙𝜷(1a)}𝒙𝒙dP𝒳]1.\displaystyle\left.+\int\{1-\kappa_{\infty}(\bm{x})\}\mathbb{I}\left\{\bm{x}^{\top}\bm{\beta}(a)\geq\bm{x}^{\top}\bm{\beta}(1-a)\right\}\bm{x}\bm{x}^{\top}dP_{\mathcal{X}}\right]^{-1}.

The results in Theorem 2 establish the asymptotic normality of the online ridge estimator and the online conditional mean estimator over time, with an explicit asymptotic variance derived. Here, κ(𝒙)\kappa_{\infty}(\bm{x}) may differ under different bandit algorithms while the asymptotic normality holds as long as the adopted bandit algorithm explores the non-optimal actions sufficiently with a non-increasing rate, following the conclusion in Theorem 1, i.e., a clipping rate ptp_{t} of an order larger than 𝒪(t1/2)\mathcal{O}(t^{-1/2}) as described in Algorithm 1. The proof of Theorem 2 generalizes Theorem 3.1 in Chen et al. (2020) by additionally considering the online ridge estimator and general bandit algorithms.

4.3 Asymptotic Normality and Robustness for DREAM

In this section, we further derive the asymptotic normality of T(V^TV)\sqrt{T}(\widehat{V}_{T}-V^{*}). The following additional assumption is required to establish the double robustness of DREAM.

Assumption 4.4

(Rate Double Robustness) Define zt2,T=1Tt=1Tzt2||z_{t}||_{2,T}=\sqrt{{1\over T}\sum_{t=1}^{T}z_{t}^{2}} as the L2L_{2} norm. Let μ(𝐱,a)μ^t(𝐱,a)2,T=𝒪p(cμ,T) for a𝒜||\mu(\bm{x},a)-\widehat{\mu}_{t}(\bm{x},a)||_{2,T}=\mathcal{O}_{p}(c_{\mu,T})\text{ for }a\in\mathcal{A}, and κt(𝐱)κ^t(𝐱)2,T=𝒪p(cκ,T)||\kappa_{t}(\bm{x})-\widehat{\kappa}_{t}(\bm{x})||_{2,T}=\mathcal{O}_{p}(c_{\kappa,T}). Assume the product of two rates satisfies cμ,Tcκ,T=o(T1/2)c_{\mu,T}c_{\kappa,T}=o(T^{-1/2}).

Assumption 4.4 requires the estimated conditional mean function and the estimated probability of exploration to converge at certain rates in online learning. This assumption is frequently studied in the causal inference literature (see e.g., Farrell, 2015; Luedtke and Van Der Laan, 2016; Smucler et al., 2019; Hou et al., 2021; Kennedy, 2022) to derive the asymptotic distribution of the estimated average treatment effect with either parametric estimators or non-parametric estimators (see e.g., Wager and Athey, 2018; Farrell et al., 2021). We remark that under the conditions required in Theorem 2, we have μ(𝒙,a)μ^t(𝒙,a)2,T=𝒪p(T1/2)||\mu(\bm{x},a)-\widehat{\mu}_{t}(\bm{x},a)||_{2,T}=\mathcal{O}_{p}(T^{-1/2}) for all a𝒜a\in\mathcal{A}, thus Assumption 4.4 holds as along as κt(𝒙)κ^t(𝒙)2,T=op(1)||\kappa_{t}(\bm{x})-\widehat{\kappa}_{t}(\bm{x})||_{2,T}=o_{p}(1). We then have the asymptotic normality of T(V^TV)\sqrt{T}(\widehat{V}_{T}-V^{*}) as stated in the following theorem.

Theorem 3

(Asymptotic normality for DREAM) Suppose conditions in Theorem 2 hold. Assuming Assumption 4.4, with κ^t()κt=op(1)\left\|\widehat{\kappa}_{t}(\cdot)-\kappa_{t}\right\|_{\infty}=o_{p}(1) , we have T(V^TV)D𝒩(0,σDR2),\sqrt{T}(\widehat{V}_{T}-V^{*})\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathcal{N}\left(0,\sigma_{DR}^{2}\right), with

σ^T2𝑝σDR2=𝒙π(𝒙)σ12+{1π(𝒙)}σ021κ(𝒙)𝑑P𝒳+Var[μ{𝒙,π(𝒙)}]<.\widehat{\sigma}_{T}^{2}\overset{p}{\longrightarrow}\sigma_{DR}^{2}=\int_{\bm{x}}\frac{\pi^{*}(\bm{x})\sigma_{1}^{2}+\{1-\pi^{*}(\bm{x})\}\sigma_{0}^{2}}{1-\kappa_{\infty}(\bm{x})}d{P_{\mathcal{X}}}+{\mbox{Var}}\left[{\mu}\{\bm{x},\pi^{*}(\bm{x})\}\right]<\infty.

The above theorem shows that the value estimator under DREAM is doubly robust to the true value, given the product of the nuisance error rates of the probability of exploitation and the conditional mean outcome as op(T1/2)o_{p}(T^{-1/2}). We use the Martingale Central Limit Theorem to overcome the non-i.i.d. sample problem for asymptotic normality. The asymptotic variance of the proposed estimator is caused by two sources. One is the variance due to the context information, and the other is a weighted average of the variance under the optimal arm and the variance under the non-optimal arms. The weight is determined by the probability of exploration when tt\rightarrow\infty under the adopted bandit algorithm. To the best of our knowledge, this is the first work that studies the asymptotic distribution of the mean outcome of the estimated optimal policy under general online bandit algorithms, which takes the exploration-and-exploitation trade-off into a thorough account. Our method thus fills a crucial gap in policy evaluation of online learning. By Theorem 3, a two-sided 1α1-\alpha confidence interval (CI) for VV^{*} is [V^Tzα/2σ^T/T,V^T+zα/2σ^T/T][\widehat{V}_{T}-z_{\alpha/2}\widehat{\sigma}_{T}/\sqrt{T},\quad\widehat{V}_{T}+z_{\alpha/2}\widehat{\sigma}_{T}/\sqrt{T}], where zα/2z_{\alpha/2} denotes the upper α/2\alpha/2-th quantile of a standard normal distribution.

5 Simulation Studies

We investigate the finite sample performance and demonstrate the coverage probabilities of the policy value of DREAM in this section. The computing infrastructure used is a virtual machine in the AWS Platform with 72 processor cores and 144GB memory.

Consider the 2-dimensional context 𝒙=[x1,x2]\bm{x}=[x_{1},x_{2}]^{\top}, with x1,x2i.i.d.Uniform(0,2π)x_{1},x_{2}{\sim}_{i.i.d.}\text{Uniform}(0,2\pi). Suppose the outcome of interest given 𝒙\bm{x} and aa is generated from 𝒩{μ(𝒙,a),σa2}\mathcal{N}\{\mu(\bm{x},a),\sigma_{a}^{2}\}, where the conditional mean function takes a non-linear form as μ(𝒙,a)=2a+(5a1)cos(x1)+(1.53a)cos(x2)\mu(\bm{x},a)=2-a+(5a-1)\cos(x_{1})+(1.5-3a)\cos(x_{2}), with equal variances as σ1=σ0=0.1\sigma_{1}=\sigma_{0}=0.1 and 𝜷(a)=[2a,5a1,1.53a]\bm{\beta}(a)=[2-a,5a-1,1.5-3a]^{\top}. Then the optimal policy is given by 𝕀{1+5cos(x1)3cos(x2)>0}\mathbb{I}\{-1+5\cos(x_{1})-3\cos(x_{2})>0\}, and the optimal value is 3.27 calculated by integration.

We employ the three bandit algorithms described in Section 2.2 to generate the online data with ω=1\omega=1 (Li et al., 2010), where ct=1c_{t}=1 for UCB, 𝜷(a)𝒩d(𝟎d,𝑰d)\bm{\beta}(a)\sim\mathcal{N}_{d}(\bm{0}_{d},\bm{I}_{d}) with ρ=2\rho=2 for TS, and ϵt=0.1t2/5\epsilon_{t}=0.1t^{-2/5} for EG. Set the total decision time as T=2000T=2000 with a burning period T0=50T_{0}=50.

Refer to caption
Figure 2: Results by DREAM with UCB under different model specifications in comparison to the averaged reward. Left panel: the coverage probabilities of the 95% two-sided Wald-type confidence interval, with the red line representing the nominal level at 95%. Middle panel: the bias between the estimated value and the true value. Right panel: the ratio between the standard error and the Monte Carlo standard deviation, with the red line representing the nominal level at 1.

We evaluate the doubly robustness property of the proposed value estimator under DREAM in comparison to the simple average of the total reward for contextual bandits. To be specific, we consider the following four methods: 1. the conditional mean function μ\mu is correctly specified, and the probability of exploration κt\kappa_{t} is estimated by a nonparametric regression in DREAM; 2. the probability of exploration κt\kappa_{t} is estimated by a nonparametric regression while the model of μ\mu is misspecified with linear regression; 3. the conditional mean function μ\mu is correctly specified while the model of κt\kappa_{t} is misspecified by a constant 0.5; 4. using the averaged reward as the value estimator. The clipping rate of DREAM is set to be 0.01, and our method is not overly sensitive to the choice of ptp_{t} as shown in additional sensitivity analyses provided in Appendix A. The above four value estimators are evaluated by the coverage probabilities of the 95% two-sided Wald-type CI on covering the optimal value, the bias, and the ratio between the standard error and the Monte Carlo standard deviation, as shown in Figure 2 for UCB, Figure 3 for TS, and Figure 4 for EG, aggregated over 1000 runs.

Refer to caption
Figure 3: Results by DREAM with TS under different model specifications in comparison to the averaged reward. Left panel: the coverage probabilities of the 95% two-sided Wald-type confidence interval, with the red line representing the nominal level at 95%. Middle panel: the bias between the estimated value and the true value. Right panel: the ratio between the standard error and the Monte Carlo standard deviation, with the red line representing the nominal level at 1.

Based on Figures 2, 3, and 4, the performance of the proposed DREAM method is reasonably much better than the simple average estimator of the total outcome. Specifically, under different bandit algorithms, when the time tt increases, the coverage probabilities of the proposed DREAM estimator are close to the nominal level of 95%, with the biases approaching 0 and the ratios between the standard error and the Monte Carlo standard deviation approaching 1. In addition, our DREAM method achieves reasonably good performance when either regression model for the conditional mean function or the probability of exploration is misspecified. These findings not only validate the theoretical results in Theorem 3 but also demonstrate the doubly robustness of DREAM in handling policy evaluation in online learning. In contrast, the simple average of reward can hardly maintain coverage probabilities over 80% with much larger biases in all cases.

Refer to caption
Figure 4: Results by DREAM with EG under different model specifications in comparison to the averaged reward. Left panel: the coverage probabilities of the 95% two-sided Wald-type confidence interval, with the red line representing the nominal level at 95%. Middle panel: the bias between the estimated value and the true value. Right panel: the ratio between the standard error and the Monte Carlo standard deviation, with the red line representing the nominal level at 1.

6 Real Data Application

In this section, we evaluate the performance of the proposed DREAM method in real datasets from the OpenML database, which is a curated, comprehensive benchmark suite for machine-learning tasks. Following the contextual bandit setting considered in this paper, we select two datasets in the public OpenML Curated Classification benchmarking suite 2018 (OpenML-CC18; BSD 3-Clause license) (Bischl et al., 2017), i.e., the SEA50 and SEA50000, to formulate the real application. Each dataset is a collection of pairs of 3-dimensional features 𝒙\bm{x} and their corresponding labels Y{0,1}Y\in\{0,1\}, with a total number of observations as n=n=1,000,000. To simulate an online environment for data generation, we turn two-class classification tasks into two-armed contextual bandit problems (see e.g., Dudík et al., 2011; Wang et al., 2017; Su et al., 2019), such that we can reproduce the online data to evaluate the performance of the proposed method. Specifically, at each time step tt, we draw the pair {𝒙t,Yt}\{\bm{x}_{t},Y_{t}\} uniformly at random without replacement from the dataset with n=n=1,000,000. Given the revealed context 𝒙t\bm{x}_{t}, the bandit algorithm selects an action at{0,1}a_{t}\in\{0,1\}. The reward is generated by a normal distribution 𝒩{𝕀(at=Yt),0.52}\mathcal{N}\{\mathbb{I}(a_{t}=Y_{t}),0.5^{2}\}. Here, the mean of reward is 1 if the selected action matches the underlying true label, and 0 otherwise. Therefore, the optimal value is 1 while the optimal policy is unknown due to the complex relationship between the collected features and the label. Our goal is to infer the value under the optimal policy in the online settings produced by datasets SEA50 and SEA50000.

Refer to caption
Figure 5: The coverage probabilities of the 95% two-sided Wald-type confidence interval by the proposed value estimator under DREAM in comparison to the averaged reward, under different contextual bandit algorithms. Left panel: the results for the SEA50 dataset. Right panel: the results for the SEA50000 dataset. The red line represents the nominal level at 95%.

Using the similar procedure as described in Section 5, we apply DREAM in comparison to the simple average estimator, by employing three bandit algorithms with the following specifications: (i) for UCB, let ct=2c_{t}=2; (ii) for TS, let priors 𝜷(a)𝒩d(𝟎d,𝑰d)\bm{\beta}(a)\sim\mathcal{N}_{d}(\bm{0}_{d},\bm{I}_{d}) and parameter ρ=0.5\rho=0.5; and (iii) for EG, let ϵtt1/3\epsilon_{t}\equiv t^{-1/3}. Set the total time for the online learning as T=200T=200 with a burning period as T0=20T_{0}=20. The results are evaluated by the coverage probabilities of the 95% two-sided Wald-type CI in Figure 5 for two real datasets, respectively, averaged over 500 replications. It can be observed from Figure 5 that our proposed DREAM method performs much better than the simple average estimator, in all cases. To be specific, the coverage probabilities of the value estimator under DREAM are close to the nominal level of 95%, while the CI constructed by the averaged reward can hardly cover the true value with its coverage probabilities decaying to 0, under different bandit algorithms and two simulated online environments. These findings are consistent with what we have observed in simulations and consolidate the practical usefulness of the proposed DREAM.

7 Discussion

In this paper, we propose doubly robust interval estimation (DREAM) to infer the mean outcome of the optimal policy using the online data generated from a bandit algorithm. We explicitly characterize the probability of exploring the non-optimal actions under different bandit algorithms and show the consistency and asymptotic normality of the proposed value estimator. In this section, we discuss the performance of DREAM in terms of the regret bound and extend it to the evaluation of a known policy in online learning.

7.1 Regret Bound under DREAM

In this section, we discuss the regret bound of the proposed DREAM method. Specifically, we study the regret defined as the difference between the expected cumulative rewards under the oracle optimal policy and the bandit policy, which is

RT=t=1T𝔼{μ(𝒙t,π(𝒙t))μ(𝒙t,at)}.R_{T}=\sum_{t=1}^{T}{\mathbb{E}}\left\{\mu(\bm{x}_{t},\pi^{*}(\bm{x}_{t}))-\mu(\bm{x}_{t},a_{t})\right\}. (6)

By noticing that μ(𝒙t,π(𝒙t))μ(𝒙t,at)=Δ𝒙t\mu(\bm{x}_{t},\pi^{*}(\bm{x}_{t}))-\mu(\bm{x}_{t},a_{t})=\Delta_{\bm{x}_{t}} if atπ(𝒙t)a_{t}\neq\pi^{*}(\bm{x}_{t}) and 0 otherwise, we have

RT=t=1T𝔼[Δ𝒙t𝕀{atπ(𝒙t)}].R_{T}=\sum_{t=1}^{T}{\mathbb{E}}\left[\Delta_{\bm{x}_{t}}\mathbb{I}\left\{a_{t}\neq\pi^{*}(\bm{x}_{t})\right\}\right].

We note that the indicator function is equivalent to |atπ(𝒙t)||a_{t}-\pi^{*}(\bm{x}_{t})| and is bounded by |atπ^(𝒙t)|+|π^(𝒙t)π(𝒙t)||a_{t}-\widehat{\pi}(\bm{x}_{t})|+|\widehat{\pi}(\bm{x}_{t})-\pi^{*}(\bm{x}_{t})|. Thus, we can divide the regret defined in Equation (6) into two parts as RTRT(1)+RT(2)R_{T}\leq R_{T}^{(1)}+R_{T}^{(2)} where

RT(1)=t=1T𝔼[Δ𝒙t|atπ^(𝒙t)|],R_{T}^{(1)}=\sum_{t=1}^{T}{\mathbb{E}}\left[\Delta_{\bm{x}_{t}}|a_{t}-\widehat{\pi}(\bm{x}_{t})|\right],

is the regret from the exploration and RT(2)=t=1T𝔼[Δ𝒙t|π^(𝒙t)π(𝒙t)|]R_{T}^{(2)}=\sum_{t=1}^{T}{\mathbb{E}}\left[\Delta_{\bm{x}_{t}}|\widehat{\pi}(\bm{x}_{t})-\pi^{*}(\bm{x}_{t})|\right] is the regret from the exploitation. It is well known that the regret from the exploitation RT(2)R_{T}^{(2)} is sublinear (Chu et al., 2011; Agrawal and Goyal, 2013) and the regret for EG has been well studied by Chen et al. (2020). Therefore, we focus on the analysis of the regret RT(1)R_{T}^{(1)} from the exploration for UCB and TS here.

Since Δ𝒙t\Delta_{\bm{x}_{t}} is bounded and the upper bound of 𝔼|atπ^(𝒙t)|=𝔼[𝕀{atπ^t(𝒙t)}]=κt(𝒙t){\mathbb{E}}|a_{t}-\widehat{\pi}(\bm{x}_{t})|={\mathbb{E}}[\mathbb{I}\{a_{t}\not=\widehat{\pi}_{t}(\bm{x}_{t})\}]=\kappa_{t}(\bm{x}_{t}) has an asymptotically equivalent convergence rate as 𝒪{exp(tpt2)}\mathcal{O}\{\exp(-tp_{t}^{2})\} up to some constant by Theorem 1, there exists some constant CC such that the regret from the exploration is bounded by

RT(1)Ct=1T𝒪{exp(tpt2)}.R_{T}^{(1)}\leq C\sum_{t=1}^{T}\mathcal{O}\{\exp(-tp_{t}^{2})\}.

If we choose pt=αlogt/tp_{t}=\sqrt{{\alpha\log t}/{t}} for some α(0,1)\alpha\in(0,1), the regret RT(1)R_{T}^{(1)} is bounded by t=1T𝒪{tα}=𝒪{T1α}\sum_{t=1}^{T}\mathcal{O}\{t^{-\alpha}\}=\mathcal{O}\{T^{1-\alpha}\}, where the equation is calculated using Lemma 6 in Luedtke and Van Der Laan (2016). Thus, we still have sublinear regret under DREAM.

7.2 Evaluation of Known Policies in Online Learning

We could extend our method to evaluate a new known policy πE\pi^{E} that is different from the bandit policy, in the online environment. Typically, we focus on the statistical inference of policy evaluation in online learning under the contextual bandit framework. Recall the setting and notations in Section 2, given a target policy πE\pi^{E}, we propose its doubly robust value estimator as

V^T(πE)=1Tt=1T𝕀{at=πE(𝒙t)}p^t1{at|𝒙t}[rtμ^t1{𝒙t,πE(𝒙t)}]+μ^t1{𝒙t,πE(𝒙t)},\widehat{V}_{T}({\pi}^{E})=\frac{1}{T}\sum_{t=1}^{T}\frac{\mathbb{I}\{a_{t}={\pi}^{E}(\bm{x}_{t})\}}{\widehat{p}_{t-1}\{a_{t}|\bm{x}_{t}\}}\Big{[}r_{t}-\widehat{\mu}_{t-1}\{\bm{x}_{t},{\pi}^{E}(\bm{x}_{t})\}\Big{]}+\widehat{\mu}_{t-1}\{\bm{x}_{t},{\pi}^{E}(\bm{x}_{t})\},

where TT is the current time step or the termination time and p^t1(at|𝒙t)\widehat{p}_{t-1}(a_{t}|\bm{x}_{t}) is the estimator for the propensity score of the chosen action ata_{t} denoted as pt1(at|𝒙t)p_{t-1}(a_{t}|\bm{x}_{t}). The following theorem summarizes the asymptotic properties of V^T(πE)\widehat{V}_{T}({\pi}^{E}), built on Theorem 2.

Corollary 2

(Asymptotic normality for evaluating a known policy) Suppose the conditions in Theorem 2 hold. Furthermore, assuming the rate doubly robustness that ||μ(𝐱,a)μ^t(𝐱,a)||2,T||pt(a|𝐱)p^t(a|𝐱)||2,T=op(T1/2) for a𝒜,||\mu(\bm{x},a)-\widehat{\mu}_{t}(\bm{x},a)||_{2,T}||p_{t}(a|\bm{x})-\widehat{p}_{t}(a|\bm{x})||_{2,T}=o_{p}(T^{-1/2})\text{ for }a\in\mathcal{A}, we have T{V^T(πE)V(πE)}D𝒩{0,σ(πE)2}\sqrt{T}\{\widehat{V}_{T}(\pi^{E})-V(\pi^{E})\}\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathcal{N}\{0,\sigma(\pi^{E})^{2}\}, with

σ(πE)2=𝒙πE(𝒙)σ12+{1πE(𝒙)}σ02Pr{πE(𝒙)|𝒙}𝑑P𝒳+Var[μ{𝒙,πE(𝒙)}]<,\sigma(\pi^{E})^{2}=\int_{\bm{x}}\frac{\pi^{E}(\bm{x})\sigma_{1}^{2}+\{1-\pi^{E}(\bm{x})\}\sigma_{0}^{2}}{{\mbox{Pr}}\left\{\pi^{E}(\bm{x})|\bm{x}\right\}}d{P_{\mathcal{X}}}+{\mbox{Var}}\left[{\mu}\{\bm{x},\pi^{E}(\bm{x})\}\right]<\infty,

where Pr{πE(𝐱)|𝐱}=limtpt1(πE(𝐱t)|𝐱t){\mbox{Pr}}\left\{\pi^{E}(\bm{x})|\bm{x}\right\}=\lim_{t\rightarrow\infty}p_{t-1}(\pi^{E}(\bm{x}_{t})|\bm{x}_{t}).

Here, we impose the same conditions on the bandit algorithms as in Theorem 3 to guarantee sufficient exploration on different arms, and thus evaluating an arbitrary policy is valid. The usage of the new rate doubly robustness assumption and the margin condition (Assumption 4.3) follows a similar logic as in Theorem 3. The estimator of σ(πE)2\sigma(\pi^{E})^{2} denoted as σ^(πE)2\widehat{\sigma}(\pi^{E})^{2} can be obtained similarly to (5). Thus, a two-sided 1α1-\alpha CI for V(πE)V(\pi^{E}) under online optimization is [V^T(πE)zα/2σ^(πE)/T,V^T(πE)+zα/2σ^(πE)/T][\widehat{V}_{T}(\pi^{E})-z_{\alpha/2}\widehat{\sigma}(\pi^{E})/\sqrt{T},\quad\widehat{V}_{T}(\pi^{E})+z_{\alpha/2}\widehat{\sigma}(\pi^{E})/\sqrt{T}].


There are some other extensions that we may also consider in future work. First, in this paper, we focus on settings with binary actions. Thus, a more general method with multiple actions or even a continuous action space is desirable. Second, we consider the contextual bandits in this paper and all the theoretical results are applicable to the multi-armed bandits. It would be practically interesting to extend our proposal to reinforcement learning problems. Third, instead of using the rate double robustness assumption in the current paper, it is of theoretical interest to impose the model double robustness version of DREAM in future research.

References

  • (1)
  • Abbasi-Yadkori et al. (2011) Abbasi-Yadkori, Y., Pál, D. and Szepesvári, C. (2011), Improved algorithms for linear stochastic bandits, in ‘Advances in Neural Information Processing Systems’, pp. 2312–2320.
  • Agrawal and Goyal (2013) Agrawal, S. and Goyal, N. (2013), Thompson sampling for contextual bandits with linear payoffs, in ‘International Conference on Machine Learning’, PMLR, pp. 127–135.
  • Athey (2019) Athey, S. (2019), 21. the impact of machine learning on economics, in ‘The economics of artificial intelligence’, University of Chicago Press, pp. 507–552.
  • Auer (2002) Auer, P. (2002), ‘Using confidence bounds for exploitation-exploration trade-offs’, Journal of Machine Learning Research 3(Nov), 397–422.
  • Bibaut et al. (2021) Bibaut, A., Dimakopoulou, M., Kallus, N., Chambaz, A. and van der Laan, M. (2021), ‘Post-contextual-bandit inference’, Advances in Neural Information Processing Systems 34.
  • Bischl et al. (2017) Bischl, B., Casalicchio, G., Feurer, M., Hutter, F., Lang, M., Mantovani, R. G., van Rijn, J. N. and Vanschoren, J. (2017), ‘Openml benchmarking suites’, arXiv preprint arXiv:1708.03731 .
  • Bubeck and Cesa-Bianchi (2012) Bubeck, S. and Cesa-Bianchi, N. (2012), ‘Regret analysis of stochastic and nonstochastic multi-armed bandit problems’, arXiv preprint arXiv:1204.5721 .
  • Chakraborty and Moodie (2013) Chakraborty, B. and Moodie, E. (2013), Statistical methods for dynamic treatment regimes, Springer.
  • Chambaz et al. (2017) Chambaz, A., Zheng, W. and van der Laan, M. J. (2017), ‘Targeted sequential design for targeted learning inference of the optimal treatment rule and its mean reward’, Annals of statistics 45(6), 2537.
  • Chen et al. (2020) Chen, H., Lu, W. and Song, R. (2020), ‘Statistical inference for online decision making: In a contextual bandit setting’, Journal of the American Statistical Association pp. 1–16.
  • Chu et al. (2011) Chu, W., Li, L., Reyzin, L. and Schapire, R. (2011), Contextual bandits with linear payoff functions, in ‘Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics’, JMLR Workshop and Conference Proceedings, pp. 208–214.
  • Dedecker and Louhichi (2002) Dedecker, J. and Louhichi, S. (2002), Maximal inequalities and empirical central limit theorems, in ‘Empirical process techniques for dependent data’, Springer, pp. 137–159.
  • Deshpande et al. (2018) Deshpande, Y., Mackey, L., Syrgkanis, V. and Taddy, M. (2018), Accurate inference for adaptive linear models, in ‘International Conference on Machine Learning’, PMLR, pp. 1194–1203.
  • Dimakopoulou et al. (2021) Dimakopoulou, M., Ren, Z. and Zhou, Z. (2021), ‘Online multi-armed bandits with adaptive inference’, Advances in Neural Information Processing Systems 34, 1939–1951.
  • Dudík et al. (2011) Dudík, M., Langford, J. and Li, L. (2011), ‘Doubly robust policy evaluation and learning’, arXiv preprint arXiv:1103.4601 .
  • Farrell (2015) Farrell, M. H. (2015), ‘Robust inference on average treatment effects with possibly more covariates than observations’, Journal of Econometrics 189(1), 1–23.
  • Farrell et al. (2021) Farrell, M. H., Liang, T. and Misra, S. (2021), ‘Deep neural networks for estimation and inference’, Econometrica 89(1), 181–213.
  • Feller (2008) Feller, W. (2008), An introduction to probability theory and its applications, vol 2, John Wiley & Sons.
  • Hadad et al. (2019) Hadad, V., Hirshberg, D. A., Zhan, R., Wager, S. and Athey, S. (2019), ‘Confidence intervals for policy evaluation in adaptive experiments’, arXiv preprint arXiv:1911.02768 .
  • Hall and Heyde (2014) Hall, P. and Heyde, C. C. (2014), Martingale limit theory and its application, Academic press.
  • Hou et al. (2021) Hou, J., Bradic, J. and Xu, R. (2021), ‘Treatment effect estimation under additive hazards models with high-dimensional confounding’, Journal of the American Statistical Association pp. 1–16.
  • Kallus and Zhou (2018) Kallus, N. and Zhou, A. (2018), ‘Policy evaluation and optimization with continuous treatments’, arXiv preprint arXiv:1802.06037 .
  • Kennedy (2022) Kennedy, E. H. (2022), ‘Semiparametric doubly robust targeted double machine learning: a review’, arXiv preprint arXiv:2203.06469 .
  • Khamaru et al. (2021) Khamaru, K., Deshpande, Y., Mackey, L. and Wainwright, M. J. (2021), ‘Near-optimal inference in adaptive linear regression’, arXiv preprint arXiv:2107.02266 .
  • Lattimore and Szepesvári (2020) Lattimore, T. and Szepesvári, C. (2020), Bandit algorithms, Cambridge University Press.
  • Li et al. (2010) Li, L., Chu, W., Langford, J. and Schapire, R. E. (2010), A contextual-bandit approach to personalized news article recommendation, in ‘Proceedings of the 19th international conference on World wide web’, pp. 661–670.
  • Li et al. (2011) Li, L., Chu, W., Langford, J. and Wang, X. (2011), Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms, in ‘Proceedings of the fourth ACM international conference on Web search and data mining’, pp. 297–306.
  • Lu et al. (2021) Lu, Y., Xu, Z. and Tewari, A. (2021), ‘Bandit algorithms for precision medicine’, arXiv preprint arXiv:2108.04782 .
  • Luedtke and Van Der Laan (2016) Luedtke, A. R. and Van Der Laan, M. J. (2016), ‘Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy’, Annals of statistics 44(2), 713.
  • Murphy (2003) Murphy, S. A. (2003), ‘Optimal dynamic treatment regimes’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65(2), 331–355.
  • Neel and Roth (2018) Neel, S. and Roth, A. (2018), Mitigating bias in adaptive data gathering via differential privacy, in ‘International Conference on Machine Learning’, PMLR, pp. 3720–3729.
  • Nie et al. (2018) Nie, X., Tian, X., Taylor, J. and Zou, J. (2018), Why adaptively collected data have negative bias and how to correct for it, in ‘International Conference on Artificial Intelligence and Statistics’, PMLR, pp. 1261–1269.
  • Ramprasad et al. (2022) Ramprasad, P., Li, Y., Yang, Z., Wang, Z., Sun, W. W. and Cheng, G. (2022), ‘Online bootstrap inference for policy evaluation in reinforcement learning’, Journal of the American Statistical Association (just-accepted), 1–31.
  • Shin et al. (2019a) Shin, J., Ramdas, A. and Rinaldo, A. (2019a), ‘Are sample means in multi-armed bandits positively or negatively biased?’, arXiv preprint arXiv:1905.11397 .
  • Shin et al. (2019b) Shin, J., Ramdas, A. and Rinaldo, A. (2019b), ‘On the bias, risk and consistency of sample means in multi-armed bandits’, arXiv preprint arXiv:1902.00746 .
  • Smucler et al. (2019) Smucler, E., Rotnitzky, A. and Robins, J. M. (2019), ‘A unifying approach for doubly-robust l1l_{1} regularized estimation of causal contrasts’, arXiv preprint arXiv:1904.03737 .
  • Srinivas et al. (2009) Srinivas, N., Krause, A., Kakade, S. M. and Seeger, M. (2009), ‘Gaussian process optimization in the bandit setting: No regret and experimental design’, arXiv preprint arXiv:0912.3995 .
  • Su et al. (2019) Su, Y., Dimakopoulou, M., Krishnamurthy, A. and Dudík, M. (2019), ‘Doubly robust off-policy evaluation with shrinkage’, arXiv preprint arXiv:1907.09623 .
  • Sutton and Barto (2018) Sutton, R. S. and Barto, A. G. (2018), Reinforcement learning: An introduction, MIT press.
  • Swaminathan et al. (2017) Swaminathan, A., Krishnamurthy, A., Agarwal, A., Dudik, M., Langford, J., Jose, D. and Zitouni, I. (2017), Off-policy evaluation for slate recommendation, in ‘Advances in Neural Information Processing Systems’, pp. 3632–3642.
  • Turvey (2017) Turvey, R. (2017), Optimal Pricing and Investment in Electricity Supply: An Esay in Applied Welfare Economics, Routledge.
  • Wager and Athey (2018) Wager, S. and Athey, S. (2018), ‘Estimation and inference of heterogeneous treatment effects using random forests’, Journal of the American Statistical Association 113(523), 1228–1242.
  • Waisman et al. (2019) Waisman, C., Nair, H. S., Carrion, C. and Xu, N. (2019), ‘Online inference for advertising auctions’, arXiv preprint arXiv:1908.08600 .
  • Wang et al. (2017) Wang, Y.-X., Agarwal, A. and Dudık, M. (2017), Optimal and adaptive off-policy evaluation in contextual bandits, in ‘International Conference on Machine Learning’, PMLR, pp. 3589–3597.
  • Zhan et al. (2021) Zhan, R., Hadad, V., Hirshberg, D. A. and Athey, S. (2021), Off-policy evaluation via adaptive weighting with data from contextual bandits, in ‘Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining’, pp. 2125–2135.
  • Zhang et al. (2021) Zhang, K., Janson, L. and Murphy, S. (2021), ‘Statistical inference with m-estimators on adaptively collected data’, Advances in Neural Information Processing Systems 34, 7460–7471.
  • Zhang et al. (2020) Zhang, K. W., Janson, L. and Murphy, S. A. (2020), ‘Inference for batched bandits’, arXiv preprint arXiv:2002.03217 .
  • Zhou (2015) Zhou, L. (2015), ‘A survey on contextual multi-armed bandits’, arXiv preprint arXiv:1508.03326 .

Supplementary to ‘Doubly Robust Interval Estimation for Optimal Policy Evaluation in Online Learning’

This supplementary article provides sensitivity analyses and all the technical proofs for the established theorems for policy evaluation in online learning under the contextual bandits. Note that the theoretical results in Section 7 can be proven in a similar manner by arguments in Section B. Thus we omit the details here.

Appendix A Sensitivity Test for the Choice of ptp_{t}

We conduct a sensitivity test for the choice of ptp_{t} in this section. We run all the simulations with pt=0.01,0.05,0.1p_{t}=0.01,0.05,0.1 and find that Algorithm 1 is not sensitive to the choice of ptp_{t}.

Refer to caption
(a) pt=0.01p_{t}=0.01
Figure A.1: Results by DREAM under UCB with different model specifications in comparison to the averaged reward. Left panel: the coverage probabilities of the 95% two-sided Wald-type CI, with the red line representing the nominal level at 95%. Middle panel: the bias between the estimated value and the true value. Right panel: the ratio between the standard error and the Monte Carlo standard deviation, with the red line representing the nominal level at 1.
Refer to caption
(a) pt=0.05p_{t}=0.05
Refer to caption
(b) pt=0.1p_{t}=0.1
Figure A.2: Results by DREAM under UCB with different model specifications in comparison to the averaged reward. Left panel: the coverage probabilities of the 95% two-sided Wald-type CI, with the red line representing the nominal level at 95%. Middle panel: the bias between the estimated value and the true value. Right panel: the ratio between the standard error and the Monte Carlo standard deviation, with the red line representing the nominal level at 1.
Refer to caption
(a) pt=0.01p_{t}=0.01
Refer to caption
(b) pt=0.05p_{t}=0.05
Refer to caption
(c) pt=0.1p_{t}=0.1
Figure A.3: Results by DREAM under TS with different model specifications in comparison to the averaged reward. Left panel: the coverage probabilities of the 95% two-sided Wald-type CI, with the red line representing the nominal level at 95%. Middle panel: the bias between the estimated value and the true value. Right panel: the ratio between the standard error and the Monte Carlo standard deviation, with the red line representing the nominal level at 1.
Refer to caption
(a) pt=0.01p_{t}=0.01
Refer to caption
(b) pt=0.05p_{t}=0.05
Refer to caption
(c) pt=0.1p_{t}=0.1
Figure A.4: Results by DREAM under EG with different model specifications in comparison to the averaged reward. Left panel: the coverage probabilities of the 95% two-sided Wald-type CI, with the red line representing the nominal level at 95%. Middle panel: the bias between the estimated value and the true value. Right panel: the ratio between the standard error and the Monte Carlo standard deviation, with the red line representing the nominal level at 1.

Appendix B Technical Proofs for Main Results

This section provides all the technical proofs for the established theorems for policy evaluation in online learning under the contextual bandits.

B.1 Proof of Lemma 4.1

The proof of Lemma 4.1 consists of three main steps. To be specific, we first reconstruct the target difference 𝜷^t(a)𝜷(a)\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a) and decompose it into two parts. Then, we establish the bound for each part and derive its lower bound Pr(𝜷^t(a)𝜷(a)1h){\mbox{Pr}}(\|\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)\|_{1}\leq h).

Step 1: Recall Equation (1) in the main paper with 𝑫t1(a)\bm{D}_{t-1}(a) being a 𝑵t1(a)×d\bm{N}_{t-1}(a)\times d design matrix at time t1t-1 with 𝑵t1(a)\bm{N}_{t-1}(a) as the number of pulls for action aa, we have

𝜷^t(a)={1ti=1t𝕀(ai=a)𝒙i𝒙i+1tω𝑰d}1{1ti=1t𝕀(ai=a)𝒙iri}.\displaystyle\widehat{\bm{\beta}}_{t}(a)=\left\{\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+\frac{1}{t}\omega\bm{I}_{d}\right\}^{-1}\left\{\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}r_{i}\right\}.

We are interested in the quantity

𝜷^t(a)𝜷(a)={1ti=1t𝕀(ai=a)𝒙i𝒙i+1tω𝑰d}1{1ti=1t𝕀(ai=a)𝒙iri}𝜷(a).\displaystyle\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)=\left\{\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+\frac{1}{t}\omega\bm{I}_{d}\right\}^{-1}\left\{\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}r_{i}\right\}-\bm{\beta}(a). (B.1)

Note that 𝜷(a)\bm{\beta}(a) can be written as

{1ti=1t𝕀(ai=a)𝒙i𝒙i+1tω𝑰d}1{1ti=1t𝕀(ai=a)𝒙i𝒙i+1tω𝑰d}𝜷(a),\left\{\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+\frac{1}{t}\omega\bm{I}_{d}\right\}^{-1}\left\{\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+\frac{1}{t}\omega\bm{I}_{d}\right\}\bm{\beta}(a),

and since ri=𝒙i𝜷(a)+eir_{i}=\bm{x}_{i}^{\top}\bm{\beta}(a)+e_{i}, we can write (B.1) as

𝜷^t(a)𝜷(a)=\displaystyle\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)= {1ti=1t𝕀(ai=a)𝒙i𝒙i+1tω𝑰d}1{1ti=1t𝕀(ai=a)𝒙iei}η3\displaystyle\underbrace{\left\{\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+\frac{1}{t}\omega\bm{I}_{d}\right\}^{-1}\left\{\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}e_{i}\right\}}_{\eta_{3}}
{1ti=1t𝕀(ai=a)𝒙i𝒙i+1tω𝑰d}1ωt𝜷(a)η4.\displaystyle-\underbrace{\left\{\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+\frac{1}{t}\omega\bm{I}_{d}\right\}^{-1}\frac{\omega}{t}\bm{\beta}(a)}_{\eta_{4}}.

Our goal is to find a lower bound of Pr(𝜷^t(a)𝜷(a)1h){\mbox{Pr}}(\|\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)\|_{1}\leq h) for any h>0h>0. Notice that by the triangle inequality we have 𝜷^t(a)𝜷(a)1η31+η41\|\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)\|_{1}\leq\left\|\eta_{3}\right\|_{1}+\left\|\eta_{4}\right\|_{1}, thus we can find the lower bound using the inequality as

Pr(𝜷^t(a)𝜷(a)1h)Pr(η31+η41h).\displaystyle{\mbox{Pr}}\left(\left\|\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)\right\|_{1}\leq h\right)\geq{\mbox{Pr}}\left(\left\|\eta_{3}\right\|_{1}+\left\|\eta_{4}\right\|_{1}\leq h\right). (B.3)

Step 2: We focus on bounding η4\eta_{4} first. By the relationship between eigenvalues and the L2L_{2} norm of symmetric matrix, we have 𝑴12=λmax(𝑴1)={λmin(𝑴)}1\|\bm{M}^{-1}\|_{2}=\lambda_{\max}(\bm{M}^{-1})=\{\lambda_{\min}(\bm{M})\}^{-1} for any invertible matrix 𝑴\bm{M}. Thus we can obtain that

{1ti=1t𝕀(ai=a)𝒙i𝒙i+1tω𝑰d}12\displaystyle\left\|\left\{\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+\frac{1}{t}\omega\bm{I}_{d}\right\}^{-1}\right\|_{2} ={λmin(1ti=1t𝕀(ai=a)𝒙i𝒙i+1tω𝑰d)}1\displaystyle=\left\{\lambda_{\min}\left(\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+\frac{1}{t}\omega\bm{I}_{d}\right)\right\}^{-1}
=1λmin(1ti=1t𝕀(ai=a)𝒙i𝒙i)+1tω\displaystyle=\frac{1}{\lambda_{\min}\left(\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}\right)+\frac{1}{t}\omega}
1ptλmin(𝚺)+1tω1ptλ+1tω,\displaystyle\leq\frac{1}{p_{t}\lambda_{\min}\left(\bm{\Sigma}\right)+\frac{1}{t}\omega}\leq\frac{1}{p_{t}\lambda+\frac{1}{t}\omega},

which leads to

η42{1ti=1t𝕀(ai=a)𝒙i𝒙i+1tω𝑰d}12ωt𝜷(a)2ωtptλ+ω𝜷(a)2.\|\eta_{4}\|_{2}\leq\left\|\left\{\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+\frac{1}{t}\omega\bm{I}_{d}\right\}^{-1}\right\|_{2}\left\|\frac{\omega}{t}\bm{\beta}(a)\right\|_{2}\leq\frac{\omega}{tp_{t}\lambda+\omega}\left\|\bm{\beta}(a)\right\|_{2}. (B.4)

By Cauchy-Schwartz inequality, we further have bound of the L1L_{1} norm of η4\eta_{4} as

η41dη42ωdtptλ+ω𝜷(a)2d1+tptλ/ω𝜷(a)2d𝜷(a)2.\|\eta_{4}\|_{1}\leq\sqrt{d}\|\eta_{4}\|_{2}\leq\frac{\omega\sqrt{d}}{tp_{t}\lambda+\omega}\left\|\bm{\beta}(a)\right\|_{2}\leq\frac{\sqrt{d}}{1+tp_{t}\lambda/\omega}\left\|\bm{\beta}(a)\right\|_{2}\leq\sqrt{d}\left\|\bm{\beta}(a)\right\|_{2}. (B.5)

Step 3: Lastly, using the results in (B.5), we have

Pr(η31+η41h)Pr(η31hd𝜷(a)2).\displaystyle{\mbox{Pr}}\left(\left\|\eta_{3}\right\|_{1}+\left\|\eta_{4}\right\|_{1}\leq h\right)\geq{\mbox{Pr}}\left(\left\|\eta_{3}\right\|_{1}\leq h-\sqrt{d}\left\|\bm{\beta}(a)\right\|_{2}\right). (B.6)

By the definition of η3\eta_{3} and Lemma 2 in Chen et al. (2020), for any constant c>0c>0, we have

Pr(η31>c)2dexp{t(ptλ2)2c22d2σ2L𝒙2}=2dexp{tpt2λ2c28d2σ2L𝒙2}.\displaystyle{\mbox{Pr}}\left(\left\|\eta_{3}\right\|_{1}>c\right)\leq 2d\exp\left\{-\frac{t\left(\frac{p_{t}\lambda}{2}\right)^{2}c^{2}}{2d^{2}\sigma^{2}L_{\bm{x}}^{2}}\right\}=2d\exp\left\{-\frac{tp_{t}^{2}\lambda^{2}c^{2}}{8d^{2}\sigma^{2}L_{\bm{x}}^{2}}\right\}.

Therefore, from (B.3) and (B.6), taking c=hd𝜷(a)2c=h-\sqrt{d}\left\|\bm{\beta}(a)\right\|_{2}, we have that under event EtE_{t},

Pr(𝜷^t(a)𝜷(a)1>h)2dexp{tpt2λ2(hd𝜷(a)2)28d2σ2L𝒙2}.\displaystyle\begin{aligned} {\mbox{Pr}}\left(\left\|\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)\right\|_{1}>h\right)&\leq 2d\exp\left\{-\frac{tp_{t}^{2}\lambda^{2}\left(h-\sqrt{d}\left\|\bm{\beta}(a)\right\|_{2}\right)^{2}}{8d^{2}\sigma^{2}L_{\bm{x}}^{2}}\right\}.\\ \end{aligned} (B.7)

Based on the above results, it is immediate that the online ridge estimator 𝜷^t(a)\widehat{\bm{\beta}}_{t}(a) is consistent to 𝜷(a)\bm{\beta}(a) if tpt2tp_{t}^{2}\rightarrow\infty as tt\rightarrow\infty. The proof is hence completed.

B.2 Proof of Corollary 1

Since μ^t(𝒙t,a)μ(𝒙t,a)=𝒙t(𝜷^t(a)𝜷(a))\widehat{\mu}_{t}(\bm{x}_{t},a)-\mu(\bm{x}_{t},a)=\bm{x}_{t}^{\top}\left(\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)\right), by Holder’s inequality, we have

|μ^t(𝒙t,a)μ(𝒙t,a)|𝒙t𝜷^t(a)𝜷(a)1L𝒙𝜷^t(a)𝜷(a)1,\left|\widehat{\mu}_{t}(\bm{x}_{t},a)-\mu(\bm{x}_{t},a)\right|\leq\left\|\bm{x}_{t}\right\|_{\infty}\left\|\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)\right\|_{1}\leq L_{\bm{x}}\left\|\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)\right\|_{1},

which follows

Pr{|μ^t(𝒙t,a)μ(𝒙t,a)|>ξ}Pr{L𝒙𝜷^t(a)𝜷(a)1>ξ}=Pr{𝜷^t(a)𝜷(a)1>ξ/L𝒙}.{\mbox{Pr}}\left\{\left|\widehat{\mu}_{t}(\bm{x}_{t},a)-\mu(\bm{x}_{t},a)\right|>\xi\right\}\leq{\mbox{Pr}}\left\{L_{\bm{x}}\left\|\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)\right\|_{1}>\xi\right\}={\mbox{Pr}}\left\{\left\|\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)\right\|_{1}>\xi/L_{\bm{x}}\right\}.

By Lemma 4.1, we further have

Pr{|μ^t(𝒙t,a)μ(𝒙t,a)|>ξ}\displaystyle{\mbox{Pr}}\left\{\left|\widehat{\mu}_{t}(\bm{x}_{t},a)-\mu(\bm{x}_{t},a)\right|>\xi\right\} Pr{𝜷^t(a)𝜷(a)1>ξL𝒙}\displaystyle\leq{\mbox{Pr}}\left\{\left\|\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)\right\|_{1}>\frac{\xi}{L_{\bm{x}}}\right\}
2dexp{tpt2λ2(ξL𝒙d𝜷(a)2)28d2σ2L𝒙2}\displaystyle\leq 2d\exp\left\{-\frac{tp_{t}^{2}\lambda^{2}\left(\frac{\xi}{L_{\bm{x}}}-\sqrt{d}\left\|\bm{\beta}(a)\right\|_{2}\right)^{2}}{8d^{2}\sigma^{2}L_{\bm{x}}^{2}}\right\}
=2dexp{tpt2λ2(ξdL𝒙𝜷(a)2)28d2σ2L𝒙4}.\displaystyle=2d\exp\left\{-\frac{tp_{t}^{2}\lambda^{2}\left(\xi-\sqrt{d}L_{\bm{x}}\left\|\bm{\beta}(a)\right\|_{2}\right)^{2}}{8d^{2}\sigma^{2}L_{\bm{x}}^{4}}\right\}.

Note that by the Triangle Inequality,

|μ^t(𝒙t,1)μ^t(𝒙t,0)Δ𝒙t|\displaystyle\left|\widehat{\mu}_{t}(\bm{x}_{t},1)-\widehat{\mu}_{t}(\bm{x}_{t},0)-\Delta_{\bm{x}_{t}}\right| =|{μ^t(𝒙t,1)μ(𝒙t,1)}{μ^t(𝒙t,0)μ(𝒙t,0)}|\displaystyle=\left|\left\{\widehat{\mu}_{t}(\bm{x}_{t},1)-\mu(\bm{x}_{t},1)\right\}-\left\{\widehat{\mu}_{t}(\bm{x}_{t},0)-\mu(\bm{x}_{t},0)\right\}\right|
|μ^t(𝒙t,1)μ(𝒙t,1)|+|μ^t(𝒙t,0)μ(𝒙t,0)|,\displaystyle\leq\left|\widehat{\mu}_{t}(\bm{x}_{t},1)-\mu(\bm{x}_{t},1)\right|+\left|\widehat{\mu}_{t}(\bm{x}_{t},0)-\mu(\bm{x}_{t},0)\right|,

thus for |μ^t(𝒙t,1)μ^t(𝒙t,0)Δ𝒙t|\left|\widehat{\mu}_{t}(\bm{x}_{t},1)-\widehat{\mu}_{t}(\bm{x}_{t},0)-\Delta_{\bm{x}_{t}}\right|, we have

Pr{|μ^t(𝒙t,1)μ^t(𝒙t,0)Δ𝒙t|>ξ}\displaystyle{\mbox{Pr}}\left\{\left|\widehat{\mu}_{t}(\bm{x}_{t},1)-\widehat{\mu}_{t}(\bm{x}_{t},0)-\Delta_{\bm{x}_{t}}\right|>\xi\right\}
Pr{|μ^t(𝒙t,1)μ(𝒙t,1)|+|μ^t(𝒙t,0)μ(𝒙t,0)|>ξ}\displaystyle\leq{\mbox{Pr}}\left\{\left|\widehat{\mu}_{t}(\bm{x}_{t},1)-\mu(\bm{x}_{t},1)\right|+\left|\widehat{\mu}_{t}(\bm{x}_{t},0)-\mu(\bm{x}_{t},0)\right|>\xi\right\}
Pr{|μ^t(𝒙t,1)μ(𝒙t,1)|>ξ/2}+Pr{|μ^t(𝒙t,0)μ(𝒙t,0)|>ξ/2}\displaystyle\leq{\mbox{Pr}}\left\{\left|\widehat{\mu}_{t}(\bm{x}_{t},1)-\mu(\bm{x}_{t},1)\right|>\xi/2\right\}+{\mbox{Pr}}\left\{\left|\widehat{\mu}_{t}(\bm{x}_{t},0)-\mu(\bm{x}_{t},0)\right|>\xi/2\right\}
2dexp{tpt2λ2(ξ/2dL𝒙𝜷(1)2)28d2σ2L𝒙4}+2dexp{tpt2λ2(ξ/2dL𝒙𝜷(0)2)28d2σ2L𝒙4}\displaystyle\leq 2d\exp\left\{-\frac{tp_{t}^{2}\lambda^{2}\left(\xi/2-\sqrt{d}L_{\bm{x}}\left\|\bm{\beta}(1)\right\|_{2}\right)^{2}}{8d^{2}\sigma^{2}L_{\bm{x}}^{4}}\right\}+2d\exp\left\{-\frac{tp_{t}^{2}\lambda^{2}\left(\xi/2-\sqrt{d}L_{\bm{x}}\left\|\bm{\beta}(0)\right\|_{2}\right)^{2}}{8d^{2}\sigma^{2}L_{\bm{x}}^{4}}\right\}
4dexp{(t1)pt12cξ},\displaystyle\leq 4d\exp\left\{-(t-1)p_{t-1}^{2}c_{\xi}\right\},

with

cξ=λ2[min{(ξ/2dL𝒙𝜷(1)2)2,(ξ/2dL𝒙𝜷(0)2)2}]8d2σ2L𝒙4,c_{\xi}=\frac{\lambda^{2}\left[\min\left\{\left(\xi/2-\sqrt{d}L_{\bm{x}}\left\|\bm{\beta}(1)\right\|_{2}\right)^{2},\left(\xi/2-\sqrt{d}L_{\bm{x}}\left\|\bm{\beta}(0)\right\|_{2}\right)^{2}\right\}\right]}{8d^{2}\sigma^{2}L_{\bm{x}}^{4}},

consistent with time tt.

B.3 Proof of Theorem 1

The proof of Theorem 1 consists of two main parts to show the probability of exploration under UCB and TS, respectively, by noting the probability of exploration under EG is given by its definition.

B.3.1 Proof for UCB

We first show the probability of exploration under UCB. This proof consists of three main steps stated as following:

  1. 1.

    We first rewrite the target probability by its definition and express it as

    Pr{|μ^t1(𝒙t,1)μ^t1(𝒙t,0)|<ct|σ^t1(𝒙t,0)σ^t1(𝒙t,1)|}.{\mbox{Pr}}\left\{\left|\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)\right|<c_{t}\left|\widehat{\sigma}_{t-1}(\bm{x}_{t},0)-\widehat{\sigma}_{t-1}(\bm{x}_{t},1)\right|\right\}.
  2. 2.

    Then, we establish the bound for the variance estimation such that

    ct|σ^t1(𝒙t,0)σ^t1(𝒙t,1)|2ctL𝒙(t1)pt1λ.c_{t}\left|\widehat{\sigma}_{t-1}(\bm{x}_{t},0)-\widehat{\sigma}_{t-1}(\bm{x}_{t},1)\right|\leq\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}.
  3. 3.

    Lastly, we bound Pr{|μ^t1(𝒙t,1)μ^t1(𝒙t,0)|<2ctL𝒙(t1)pt1λ}{\mbox{Pr}}\left\{\left|\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)\right|<\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}\right\} using the result in Corollary 1.

Step 1: We rewrite the target probability by definition and decompose it into two parts.

Let Δ𝒙tμ(𝒙t,1)μ(𝒙t,0)\Delta_{\bm{x}_{t}}\equiv\mu(\bm{x}_{t},1)-\mu(\bm{x}_{t},0). Based on the definition of the probability of exploration and the form of the estimated optimal policy π^t(𝒙t)\widehat{\pi}_{t}(\bm{x}_{t}), we have

κt(𝒙t)=Pr{atπ^t(𝒙t)}=𝔼[𝕀{atπ^t(𝒙t)}]\displaystyle\kappa_{t}(\bm{x}_{t})={\mbox{Pr}}\{a_{t}\not=\widehat{\pi}_{t}(\bm{x}_{t})\}={\mathbb{E}}[\mathbb{I}\{a_{t}\not=\widehat{\pi}_{t}(\bm{x}_{t})\}] (B.8)
=\displaystyle= 𝔼[𝕀(at=0)|π^t(𝒙t)=1]Pr{π^t(𝒙t)=1}η0+𝔼[𝕀(at=1)|π^t(𝒙t)=0]Pr{π^t(𝒙t)=0}η1,\displaystyle\underbrace{{\mathbb{E}}[\mathbb{I}(a_{t}=0)|\widehat{\pi}_{t}(\bm{x}_{t})=1]{\mbox{Pr}}\{\widehat{\pi}_{t}(\bm{x}_{t})=1\}}_{\eta_{0}}+\underbrace{{\mathbb{E}}[\mathbb{I}(a_{t}=1)|\widehat{\pi}_{t}(\bm{x}_{t})=0]{\mbox{Pr}}\{\widehat{\pi}_{t}(\bm{x}_{t})=0\}}_{\eta_{1}},

where the expectation is taken with respect to the history t1\mathcal{H}_{t-1} before time point tt.

Next, we rewrite η0\eta_{0} and η1\eta_{1} using the estimated mean and variance components μ^t1(𝒙t,a)\widehat{\mu}_{t-1}(\bm{x}_{t},a) and σ^t1(𝒙t,a)\widehat{\sigma}_{t-1}(\bm{x}_{t},a), where a=0,1a=0,1. We focus on η0\eta_{0} first.

Given π^t(𝒙t)=1\widehat{\pi}_{t}(\bm{x}_{t})=1, i.e., μ^t1(𝒙t,1)μ^t1(𝒙t,0)>0\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)>0, based on the definition of the taken action in Lin-UCB that

at=𝕀{μ^t1(𝒙t,1)+ctσ^t1(𝒙t,1)>μ^t1(𝒙t,0)+ctσ^t1(𝒙t,0)},\displaystyle a_{t}=\mathbb{I}\left\{\widehat{\mu}_{t-1}(\bm{x}_{t},1)+c_{t}\widehat{\sigma}_{t-1}(\bm{x}_{t},1)>\widehat{\mu}_{t-1}(\bm{x}_{t},0)+c_{t}\widehat{\sigma}_{t-1}(\bm{x}_{t},0)\right\},

the probability of choosing action 0 rather than action 1 is

𝔼[𝕀(at=0)|π^t(𝒙t)=1]\displaystyle{\mathbb{E}}[\mathbb{I}(a_{t}=0)|\widehat{\pi}_{t}(\bm{x}_{t})=1]
=\displaystyle= Pr{μ^t1(𝒙t,1)+ctσ^t1(𝒙t,1)<μ^t1(𝒙t,0)+ctσ^t1(𝒙t,0)|π^t(𝒙t)=1}\displaystyle{\mbox{Pr}}\left\{\widehat{\mu}_{t-1}(\bm{x}_{t},1)+c_{t}\widehat{\sigma}_{t-1}(\bm{x}_{t},1)<\widehat{\mu}_{t-1}(\bm{x}_{t},0)+c_{t}\widehat{\sigma}_{t-1}(\bm{x}_{t},0)|\widehat{\pi}_{t}(\bm{x}_{t})=1\right\}
=\displaystyle= Pr{μ^t1(𝒙t,1)μ^t1(𝒙t,0)<ctσ^t1(𝒙t,0)ctσ^t1(𝒙t,1)|μ^t1(𝒙t,1)μ^t1(𝒙t,0)>0}\displaystyle{\mbox{Pr}}\left\{\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)<c_{t}\widehat{\sigma}_{t-1}(\bm{x}_{t},0)-c_{t}\widehat{\sigma}_{t-1}(\bm{x}_{t},1)|\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)>0\right\}
=\displaystyle= Pr[0<μ^t1(𝒙t,1)μ^t1(𝒙t,0)<ct{σ^t1(𝒙t,0)σ^t1(𝒙t,1)}]/Pr{π^t(𝒙t)=1},\displaystyle{\mbox{Pr}}\left[0<\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)<c_{t}\left\{\widehat{\sigma}_{t-1}(\bm{x}_{t},0)-\widehat{\sigma}_{t-1}(\bm{x}_{t},1)\right\}\right]/{\mbox{Pr}}\{\widehat{\pi}_{t}(\bm{x}_{t})=1\},

where the second equality is to rearrange the estimated mean and variance components, and the last equality comes from the definition of the conditional probability. Combining this with (B.8), we have

η0=Pr[0<μ^t1(𝒙t,1)μ^t1(𝒙t,0)<ct{σ^t1(𝒙t,0)σ^t1(𝒙t,1)}].\displaystyle\eta_{0}={\mbox{Pr}}\left[0<\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)<c_{t}\left\{\widehat{\sigma}_{t-1}(\bm{x}_{t},0)-\widehat{\sigma}_{t-1}(\bm{x}_{t},1)\right\}\right].

Similarly, we have

η1=Pr[ct{σ^t1(𝒙t,0)σ^t1(𝒙t,1)}<μ^t1(𝒙t,1)μ^t1(𝒙t,0)<0].\displaystyle\eta_{1}={\mbox{Pr}}\left[c_{t}\left\{\widehat{\sigma}_{t-1}(\bm{x}_{t},0)-\widehat{\sigma}_{t-1}(\bm{x}_{t},1)\right\}<\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)<0\right].

Thus combined with Equation (B.8), we have

κt(𝒙t)=η0+η1=\displaystyle\kappa_{t}(\bm{x}_{t})=\eta_{0}+\eta_{1}= Pr[0<μ^t1(𝒙t,1)μ^t1(𝒙t,0)<ct{σ^t1(𝒙t,0)σ^t1(𝒙t,1)}]\displaystyle{\mbox{Pr}}\left[0<\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)<c_{t}\left\{\widehat{\sigma}_{t-1}(\bm{x}_{t},0)-\widehat{\sigma}_{t-1}(\bm{x}_{t},1)\right\}\right] (B.9)
+Pr[ct{σ^t1(𝒙t,0)σ^t1(𝒙t,1)}<μ^t1(𝒙t,1)μ^t1(𝒙t,0)<0]\displaystyle+{\mbox{Pr}}\left[c_{t}\left\{\widehat{\sigma}_{t-1}(\bm{x}_{t},0)-\widehat{\sigma}_{t-1}(\bm{x}_{t},1)\right\}<\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)<0\right]
=Pr{|μ^t1(𝒙t,1)μ^t1(𝒙t,0)|<ct|σ^t1(𝒙t,0)σ^t1(𝒙t,1)|}.\displaystyle={\mbox{Pr}}\left\{\left|\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)\right|<c_{t}\left|\widehat{\sigma}_{t-1}(\bm{x}_{t},0)-\widehat{\sigma}_{t-1}(\bm{x}_{t},1)\right|\right\}.

The rest of the proof is aims to bound the probability

Pr{|μ^t1(𝒙t,1)μ^t1(𝒙t,0)|<ct|σ^t1(𝒙t,0)σ^t1(𝒙t,1)|}.{\mbox{Pr}}\left\{\left|\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)\right|<c_{t}\left|\widehat{\sigma}_{t-1}(\bm{x}_{t},0)-\widehat{\sigma}_{t-1}(\bm{x}_{t},1)\right|\right\}.

Step 2: Secondly, we bound the variance σ^t1(𝒙t,0)\widehat{\sigma}_{t-1}(\bm{x}_{t},0) and σ^t1(𝒙t,1)\widehat{\sigma}_{t-1}(\bm{x}_{t},1) .

We consider the quantity σ^t1(𝒙t,0)={𝒙t{𝑫t1(0)𝑫t1(0)+ω𝑰d}1𝒙t}\widehat{\sigma}_{t-1}(\bm{x}_{t},0)=\sqrt{\{\bm{x}_{t}^{\top}\{\bm{D}_{t-1}(0)^{\top}\bm{D}_{t-1}(0)+\omega\bm{I}_{d}\}^{-1}\bm{x}_{t}\}} first. Let 𝐯\mathbf{v} be any d×1d\times 1 vector, then the sample variance under action 0 is given by

𝒙t{𝑫t1(0)𝑫t1(0)+ω𝑰d}1𝒙t\displaystyle\bm{x}_{t}^{\top}\{\bm{D}_{t-1}(0)^{\top}\bm{D}_{t-1}(0)+\omega\bm{I}_{d}\}^{-1}\bm{x}_{t} =𝒙t22(𝒙t𝒙t2){𝑫t1(0)𝑫t1(0)+ω𝑰d}1(𝒙t𝒙t2)\displaystyle=\|\bm{x}_{t}\|_{2}^{2}(\frac{\bm{x}_{t}}{\|\bm{x}_{t}\|_{2}})^{\top}\{\bm{D}_{t-1}(0)^{\top}\bm{D}_{t-1}(0)+\omega\bm{I}_{d}\}^{-1}(\frac{\bm{x}_{t}}{\|\bm{x}_{t}\|_{2}}) (B.10)
𝒙t22max𝐯2=1𝐯{𝑫t1(0)𝑫t1(0)+ω𝑰d}1𝐯\displaystyle\leq\|\bm{x}_{t}\|_{2}^{2}\max\limits_{\|\mathbf{v}\|_{2}=1}\mathbf{v}^{\top}\{\bm{D}_{t-1}(0)^{\top}\bm{D}_{t-1}(0)+\omega\bm{I}_{d}\}^{-1}\mathbf{v}
𝒙t22λmax{(𝑫t1(0)𝑫t1(0)+ω𝑰d)1},\displaystyle\leq\|\bm{x}_{t}\|_{2}^{2}\lambda_{\max}\{(\bm{D}_{t-1}(0)^{\top}\bm{D}_{t-1}(0)+\omega\bm{I}_{d})^{-1}\},

where the first inequality is to replace (𝒙t/𝒙t2)({\bm{x}_{t}}/{\|\bm{x}_{t}\|_{2}})^{\top} with any normalized vector, and the second inequality is due to the definition. According to (B.10), combined with Assumption 4.1, we can further bound σ^t1(𝒙t,0)\widehat{\sigma}_{t-1}(\bm{x}_{t},0) by

𝒙t2λmax{(𝑫t1(0)𝑫t1(0)+ω𝑰d)1}L𝒙λmin{𝑫t1(0)𝑫t1(0)+ω𝑰d}.\|\bm{x}_{t}\|_{2}\sqrt{\lambda_{\max}\{(\bm{D}_{t-1}(0)^{\top}\bm{D}_{t-1}(0)+\omega\bm{I}_{d})^{-1}\}}\leq\frac{L_{\bm{x}}}{\sqrt{\lambda_{\min}\{\bm{D}_{t-1}(0)^{\top}\bm{D}_{t-1}(0)+\omega\bm{I}_{d}\}}}. (B.11)

It is immediate from (B.10) and (B.11) that

0<σ^t1(𝒙t,0)L𝒙λmin{𝑫t1(0)𝑫t1(0)+ω𝑰d}.\displaystyle 0<\widehat{\sigma}_{t-1}(\bm{x}_{t},0)\leq\frac{L_{\bm{x}}}{\sqrt{\lambda_{\min}\{\bm{D}_{t-1}(0)^{\top}\bm{D}_{t-1}(0)+\omega\bm{I}_{d}\}}}. (B.12)

Note that

λmin{𝑫t1(0)𝑫t1(0)+ω𝑰d}=λmin{𝑫t1(0)𝑫t1(0)}+ω,\lambda_{\min}\{\bm{D}_{t-1}(0)^{\top}\bm{D}_{t-1}(0)+\omega\bm{I}_{d}\}=\lambda_{\min}\{\bm{D}_{t-1}(0)^{\top}\bm{D}_{t-1}(0)\}+\omega,

combined with the fact that 𝑫t1(0)𝑫t1(0)=i=1t1(1ai)𝒙i𝒙i\bm{D}_{t-1}(0)^{\top}\bm{D}_{t-1}(0)=\sum_{i=1}^{t-1}(1-a_{i})\bm{x}_{i}\bm{x}_{i}^{\top}, then λmin{𝑫t1(0)𝑫t1(0)+ω𝑰d}\lambda_{\min}\{\bm{D}_{t-1}(0)^{\top}\bm{D}_{t-1}(0)+\omega\bm{I}_{d}\} can be further expressed as

λmin{𝑫t1(0)𝑫t1(0)+ω𝑰d}\displaystyle\lambda_{\min}\{\bm{D}_{t-1}(0)^{\top}\bm{D}_{t-1}(0)+\omega\bm{I}_{d}\} =(t1)λmin{1t1i=1t1(1ai)𝒙i𝒙i}+ω\displaystyle=(t-1)\lambda_{\min}\left\{\frac{1}{t-1}\sum_{i=1}^{t-1}(1-a_{i})\bm{x}_{i}\bm{x}_{i}^{\top}\right\}+\omega
>(t1)pt1λmin(𝚺)+ω>(t1)pt1λ+ω,\displaystyle>(t-1)p_{t-1}\lambda_{\min}\left(\bm{\Sigma}\right)+\omega>(t-1)p_{t-1}\lambda+\omega,

where the first inequality is owing to Assumption 4.2 , and the second inequality is owing to Assumption 4.1. This together with (B.12) gives the lower and upper bounds of σ^t1(𝒙t,0)\widehat{\sigma}_{t-1}(\bm{x}_{t},0) as

0<σ^t1(𝒙t,0)L𝒙(t1)pt1λ+ω<L𝒙(t1)pt1λ.0<\widehat{\sigma}_{t-1}(\bm{x}_{t},0)\leq\frac{L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda+\omega}}<\frac{L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}. (B.13)

Similarly we have

0<σ^t1(𝒙t,1)L𝒙(t1)pt1λ,0<\widehat{\sigma}_{t-1}(\bm{x}_{t},1)\leq\frac{L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}, (B.14)

which follows that

ct|σ^t1(𝒙t,0)σ^t1(𝒙t,1)|ct(|σ^t1(𝒙t,0)|+|σ^t1(𝒙t,1)|)2ctL𝒙(t1)pt1λ.c_{t}\left|\widehat{\sigma}_{t-1}(\bm{x}_{t},0)-\widehat{\sigma}_{t-1}(\bm{x}_{t},1)\right|\leq c_{t}\left(\left|\widehat{\sigma}_{t-1}(\bm{x}_{t},0)\right|+\left|\widehat{\sigma}_{t-1}(\bm{x}_{t},1)\right|\right)\leq\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}.

Combining (B.9) and the above equation, we get the conclusion that

κt(𝒙t)\displaystyle\kappa_{t}(\bm{x}_{t}) Pr{|μ^t1(𝒙t,1)μ^t1(𝒙t,0)|<ct|σ^t1(𝒙t,0)σ^t1(𝒙t,1)|}\displaystyle\leq{\mbox{Pr}}\left\{\left|\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)\right|<c_{t}\left|\widehat{\sigma}_{t-1}(\bm{x}_{t},0)-\widehat{\sigma}_{t-1}(\bm{x}_{t},1)\right|\right\} (B.15)
Pr{|μ^t1(𝒙t,1)μ^t1(𝒙t,0)|<2ctL𝒙(t1)pt1λ}.\displaystyle\leq{\mbox{Pr}}\left\{\left|\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)\right|<\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}\right\}.

Step 3: Lastly, we aim to bound Pr{|μ^t1(𝒙t,1)μ^t1(𝒙t,0)|<2ctL𝒙(t1)pt1λ}{\mbox{Pr}}\left\{\left|\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)\right|<\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}\right\} using the result in Corollary 1.

For any ξ>0\xi>0, define E:={|μ^t(𝒙t,1)μ^t(𝒙t,0)Δ𝒙t|ξ}E:=\{\left|\widehat{\mu}_{t}(\bm{x}_{t},1)-\widehat{\mu}_{t}(\bm{x}_{t},0)-\Delta_{\bm{x}_{t}}\right|\leq\xi\}, which satisfies Pr{E}14dexp{tpt2cξ}{\mbox{Pr}}\left\{E\right\}\geq 1-4d\exp\left\{-tp_{t}^{2}c_{\xi}\right\} by Corollary 1. Then on the Event EE, we have

|μ^t1(𝒙t,1)μ^t1(𝒙t,0)|\displaystyle\left|\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)\right| =|Δ𝒙t+{μ^t1(𝒙t,1)μ^t1(𝒙t,0)Δ𝒙t}|\displaystyle=\left|\Delta_{\bm{x}_{t}}+\left\{\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)-\Delta_{\bm{x}_{t}}\right\}\right|
|Δ𝒙t||μ^t(𝒙t,1)μ^t(𝒙t,0)Δ𝒙t||Δ𝒙t|ξ.\displaystyle\geq\left|\Delta_{\bm{x}_{t}}\right|-\left|\widehat{\mu}_{t}(\bm{x}_{t},1)-\widehat{\mu}_{t}(\bm{x}_{t},0)-\Delta_{\bm{x}_{t}}\right|\geq\left|\Delta_{\bm{x}_{t}}\right|-\xi.

Thus for the probability Pr{|μ^t1(𝒙t,1)μ^t1(𝒙t,0)|<2ctL𝒙(t1)pt1λ}{\mbox{Pr}}\left\{\left|\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)\right|<\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}\right\}, we have

κt(𝒙t)\displaystyle\kappa_{t}(\bm{x}_{t}) Pr{|μ^t1(𝒙t,1)μ^t1(𝒙t,0)|<2ctL𝒙(t1)pt1λ}\displaystyle\leq{\mbox{Pr}}\left\{\left|\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)\right|<\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}\right\} (B.16)
Pr{|μ^t1(𝒙t,1)μ^t1(𝒙t,0)|<2ctL𝒙(t1)pt1λ|E}+Pr{Ec}\displaystyle\leq{\mbox{Pr}}\left\{\left|\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)\right|<\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}\bigm{|}E\right\}+{\mbox{Pr}}\left\{E^{c}\right\}
Pr{|Δ𝒙t|ξ<2ctL𝒙(t1)pt1λ}+4dexp{(t1)pt12cξ}\displaystyle\leq{\mbox{Pr}}\left\{\left|\Delta_{\bm{x}_{t}}\right|-\xi<\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}\right\}+4d\exp\left\{-(t-1)p_{t-1}^{2}c_{\xi}\right\}
=Pr{|Δ𝒙t|<2ctL𝒙(t1)pt1λ+ξ}+4dexp{(t1)pt12cξ}.\displaystyle={\mbox{Pr}}\left\{\left|\Delta_{\bm{x}_{t}}\right|<\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}+\xi\right\}+4d\exp\left\{-(t-1)p_{t-1}^{2}c_{\xi}\right\}.

Sine tpttp_{t}\rightarrow\infty as tt\rightarrow\infty, for any constant δ>ξ\delta>\xi, there exist large enough tt satisfying 2ctL𝒙(t1)pt1λδξ\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}\leq\delta-\xi. Then by Assumption 4.3, there exists some constant γ\gamma such that

Pr{|Δ𝒙t|<2ctL𝒙(t1)pt1λ+ξ}=𝒪{(2ctL𝒙(t1)pt1λ+ξ)γ},{\mbox{Pr}}\left\{\left|\Delta_{\bm{x}_{t}}\right|<\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}+\xi\right\}=\mathcal{O}\left\{\left(\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}+\xi\right)^{\gamma}\right\},

i.e., there exists some constant CC such that

Pr{|Δ𝒙t|<2ctL𝒙(t1)pt1λ+ξ}=C(2ctL𝒙(t1)pt1λ+ξ)γ.{\mbox{Pr}}\left\{\left|\Delta_{\bm{x}_{t}}\right|<\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}+\xi\right\}=C\left(\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}+\xi\right)^{\gamma}.

Therefore, combined with the Equation (B.16), we have

κt(𝒙t)\displaystyle\kappa_{t}(\bm{x}_{t}) C(2ctL𝒙(t1)pt1λ+ξ)γ+4dexp{(t1)pt12cξ}.\displaystyle\leq C\left(\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}+\xi\right)^{\gamma}+4d\exp\left\{-(t-1)p_{t-1}^{2}c_{\xi}\right\}.

The proof is hence completed.

B.3.2 Proof for TS

We next show the probability of exploration under TS consisting of three main steps:

  1. 1.

    We firstly define an event E:={|μ^t(𝒙t,1)μ^t(𝒙t,0)Δ𝒙t|ξ}E:=\{\left|\widehat{\mu}_{t}(\bm{x}_{t},1)-\widehat{\mu}_{t}(\bm{x}_{t},0)-\Delta_{\bm{x}_{t}}\right|\leq\xi\} for any 0<ξ<|Δ𝒙t|/20<\xi<\left|\Delta_{\bm{x}_{t}}\right|/2, where the estimated difference between mean functions is close to the true difference. And we have Pr{E}14dexp{tpt2cξ}{\mbox{Pr}}\left\{E\right\}\geq 1-4d\exp\left\{-tp_{t}^{2}c_{\xi}\right\} by Corollary 1.

  2. 2.

    Next, we bound the probability of exploration on the event EE.

  3. 3.

    Lastly, we combine the results in the previous two steps to get the unconditioned probability of exploration .

Step 1: For any 0<ξ<|Δ𝒙t|/20<\xi<\left|\Delta_{\bm{x}_{t}}\right|/2, define E:={|μ^t(𝒙t,1)μ^t(𝒙t,0)Δ𝒙t|ξ}E:=\{\left|\widehat{\mu}_{t}(\bm{x}_{t},1)-\widehat{\mu}_{t}(\bm{x}_{t},0)-\Delta_{\bm{x}_{t}}\right|\leq\xi\}, which satisfies Pr{E}14dexp{(t1)pt12cξ}{\mbox{Pr}}\left\{E\right\}\geq 1-4d\exp\left\{-(t-1)p_{t-1}^{2}c_{\xi}\right\} by Corollary 1. Then for the probability Pr{atπ^t(𝒙t)}{\mbox{Pr}}\{a_{t}\not=\widehat{\pi}_{t}(\bm{x}_{t})\}, we have

Pr{atπ^t(𝒙t)}Pr{atπ^t(𝒙t)|E}+Pr{Ec}Pr{atπ^t(𝒙t)|E}+4dexp{(t1)pt12cξ}.{\mbox{Pr}}\{a_{t}\not=\widehat{\pi}_{t}(\bm{x}_{t})\}\leq{\mbox{Pr}}\{a_{t}\not=\widehat{\pi}_{t}(\bm{x}_{t})|E\}+{\mbox{Pr}}\{E^{c}\}\leq{\mbox{Pr}}\{a_{t}\not=\widehat{\pi}_{t}(\bm{x}_{t})|E\}+4d\exp\left\{-(t-1)p_{t-1}^{2}c_{\xi}\right\}. (B.17)

Without loss of generality, we assume Δ𝒙t>0\Delta_{\bm{x}_{t}}>0, then E:={0<Δ𝒙tξμ^t(𝒙t,1)μ^t(𝒙t,0)Δ𝒙t+ξ}E:=\{0<\Delta_{\bm{x}_{t}}-\xi\leq\widehat{\mu}_{t}(\bm{x}_{t},1)-\widehat{\mu}_{t}(\bm{x}_{t},0)\leq\Delta_{\bm{x}_{t}}+\xi\}, which implies π^t(𝒙t)=1\widehat{\pi}_{t}(\bm{x}_{t})=1.

Using the law of iterated expectations, based on the definition of the probability of exploration and the form of the estimated optimal policy π^t(𝒙t)\widehat{\pi}_{t}(\bm{x}_{t}), on the event EE, we have

Pr{at1|E}=𝔼[𝕀{at1}|E]=𝔼(𝔼[𝕀{at=0}|μ^t1(𝒙t,1),μ^t1(𝒙t,0)]|E).\displaystyle{\mbox{Pr}}\{a_{t}\not=1|E\}={\mathbb{E}}[\mathbb{I}\{a_{t}\not=1\}|E]={\mathbb{E}}\left({\mathbb{E}}[\mathbb{I}\{a_{t}=0\}|\widehat{\mu}_{t-1}(\bm{x}_{t},1),\widehat{\mu}_{t-1}(\bm{x}_{t},0)]|E\right). (B.18)

Step 2: Next, we focus on deriving the bound of 𝔼[𝕀{at=0}|μ^t1(𝒙t,1),μ^t1(𝒙t,0)]{\mathbb{E}}[\mathbb{I}\{a_{t}=0\}|\widehat{\mu}_{t-1}(\bm{x}_{t},1),\widehat{\mu}_{t-1}(\bm{x}_{t},0)] on the Event EE.

Recalling the bandit mechanism of TS, we have at=𝕀{𝒙t𝜷t(1)>𝒙t𝜷t(0)}a_{t}=\mathbb{I}\left\{\bm{x}_{t}^{\top}\bm{\beta}_{t}(1)>\bm{x}_{t}^{\top}\bm{\beta}_{t}(0)\right\}, where 𝜷t(a)\bm{\beta}_{t}(a) is drawn from the posterior distribution of 𝜷(a)\bm{\beta}(a) given by

𝒩d[𝜷^t1(a),ρ2{𝑫t1(a)𝑫t1(a)+ω𝑰d}1].\mathcal{N}_{d}[\widehat{\bm{\beta}}_{t-1}(a),\rho^{2}\{\bm{D}_{t-1}(a)^{\top}\bm{D}_{t-1}(a)+\omega\bm{I}_{d}\}^{-1}].

From the posterior distributions and the definitions of μ^t1(𝒙t,a)\widehat{\mu}_{t-1}(\bm{x}_{t},a) and σ^t1(𝒙t,a)\widehat{\sigma}_{t-1}(\bm{x}_{t},a), we have

𝒙t𝜷t(a)𝒩[𝒙t𝜷^t1(a),ρ2𝒙t{𝑫t1(a)𝑫t1(a)+ω𝑰d}1𝒙t],\bm{x}_{t}^{\top}\bm{\beta}_{t}(a)\sim\mathcal{N}[\bm{x}_{t}^{\top}\widehat{\bm{\beta}}_{t-1}(a),\rho^{2}\bm{x}_{t}^{\top}\{\bm{D}_{t-1}(a)^{\top}\bm{D}_{t-1}(a)+\omega\bm{I}_{d}\}^{-1}\bm{x}_{t}],

that is,

𝒙t𝜷t(a)𝒩[μ^t1(𝒙t,a),ρ2σ^t1(𝒙t,a)2].\bm{x}_{t}^{\top}\bm{\beta}_{t}(a)\sim\mathcal{N}[\widehat{\mu}_{t-1}(\bm{x}_{t},a),\rho^{2}\widehat{\sigma}_{t-1}(\bm{x}_{t},a)^{2}].

Notice that 𝒙t𝜷t(1)\bm{x}_{t}^{\top}\bm{\beta}_{t}(1) and 𝒙t𝜷t(0)\bm{x}_{t}^{\top}\bm{\beta}_{t}(0) are drawn independently, thus,

𝒙t𝜷t(1)𝒙t𝜷t(0)𝒩[μ^t1(𝒙t,1)μ^t1(𝒙t,0),ρ2{σ^t1(𝒙t,1)2+σ^t1(𝒙t,0)2}].\bm{x}_{t}^{\top}\bm{\beta}_{t}(1)-\bm{x}_{t}^{\top}\bm{\beta}_{t}(0)\sim\mathcal{N}[\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0),\rho^{2}\{\widehat{\sigma}_{t-1}(\bm{x}_{t},1)^{2}+\widehat{\sigma}_{t-1}(\bm{x}_{t},0)^{2}\}]. (B.19)

Recall at=𝕀{𝒙t𝜷t(1)>𝒙t𝜷t(0)}a_{t}=\mathbb{I}\left\{\bm{x}_{t}^{\top}\bm{\beta}_{t}(1)>\bm{x}_{t}^{\top}\bm{\beta}_{t}(0)\right\} in TS, based on the posterior distribution in (B.19). Therefore, on the Event EE we have

𝔼[𝕀{at=0}|μ^t1(𝒙t,1),μ^t1(𝒙t,0)]\displaystyle{\mathbb{E}}[\mathbb{I}\{a_{t}=0\}|\widehat{\mu}_{t-1}(\bm{x}_{t},1),\widehat{\mu}_{t-1}(\bm{x}_{t},0)] (B.20)
=Pr{𝒙t𝜷t(1)𝒙t𝜷t(0)<0|μ^t1(𝒙t,1),μ^t1(𝒙t,0)}\displaystyle={\mbox{Pr}}\left\{\bm{x}_{t}^{\top}\bm{\beta}_{t}(1)-\bm{x}_{t}^{\top}\bm{\beta}_{t}(0)<0|\widehat{\mu}_{t-1}(\bm{x}_{t},1),\widehat{\mu}_{t-1}(\bm{x}_{t},0)\right\}
=\displaystyle= Φ[{μ^t1(𝒙t,1)μ^t1(𝒙t,0)}/ρ2{σ^t1(𝒙t,1)2+σ^t1(𝒙t,0)2}]\displaystyle\Phi[-\{\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)\}/\sqrt{\rho^{2}\{\widehat{\sigma}_{t-1}(\bm{x}_{t},1)^{2}+\widehat{\sigma}_{t-1}(\bm{x}_{t},0)^{2}\}}]
=\displaystyle= 1Φ[{μ^t1(𝒙t,1)μ^t1(𝒙t,0)}/ρ2{σ^t1(𝒙t,1)2+σ^t1(𝒙t,0)2}],\displaystyle 1-\Phi[\{\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)\}/\sqrt{\rho^{2}\{\widehat{\sigma}_{t-1}(\bm{x}_{t},1)^{2}+\widehat{\sigma}_{t-1}(\bm{x}_{t},0)^{2}\}}],

where Φ()\Phi(\cdot) is the cumulative distribution function of the standard normal distribution. Denote z^t{μ^t1(𝒙t,1)μ^t1(𝒙t,0)}/ρ2{σ^t1(𝒙t,1)2+σ^t1(𝒙t,0)2}>0\widehat{z}_{t}\equiv\{\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)\}/\sqrt{\rho^{2}\{\widehat{\sigma}_{t-1}(\bm{x}_{t},1)^{2}+\widehat{\sigma}_{t-1}(\bm{x}_{t},0)^{2}\}}>0, since π^t(𝒙t)=𝕀{μ^t1(𝒙,1)>μ^t1(𝒙,0)}=1\widehat{\pi}_{t}(\bm{x}_{t})=\mathbb{I}\left\{\widehat{\mu}_{t-1}(\bm{x},1)>\widehat{\mu}_{t-1}(\bm{x},0)\right\}=1, i.e., μ^t1(𝒙t,1)μ^t1(𝒙t,0)>0\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)>0. By applying the tail bound established for the normal distribution in Section 7.1 of Feller (2008), we have (B.20) can be bounded as

𝔼[𝕀{at=0}|μ^t1(𝒙t,1),μ^t1(𝒙t,0)]exp(z^t2/2).\displaystyle{\mathbb{E}}[\mathbb{I}\{a_{t}=0\}|\widehat{\mu}_{t-1}(\bm{x}_{t},1),\widehat{\mu}_{t-1}(\bm{x}_{t},0)]\leq\exp(-\widehat{z}_{t}^{2}/2).

This yields that on the Event EE,

𝔼[𝕀{at=0}|μ^t1(𝒙t,1),μ^t1(𝒙t,0)]\displaystyle{\mathbb{E}}[\mathbb{I}\{a_{t}=0\}|\widehat{\mu}_{t-1}(\bm{x}_{t},1),\widehat{\mu}_{t-1}(\bm{x}_{t},0)]
𝔼{exp(z^t2/2)}=𝔼{exp({μ^t1(𝒙t,1)μ^t1(𝒙t,0)}22ρ2{σ^t1(𝒙t,1)2+σ^t1(𝒙t,0)2})}.\displaystyle\leq{\mathbb{E}}\left\{\exp(-\widehat{z}_{t}^{2}/2)\right\}={\mathbb{E}}\left\{\exp\left(-\frac{\{\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)\}^{2}}{2\rho^{2}\{\widehat{\sigma}_{t-1}(\bm{x}_{t},1)^{2}+\widehat{\sigma}_{t-1}(\bm{x}_{t},0)^{2}\}}\right)\right\}.

Using similar arguments in proving (B.13)that σ^t1(𝒙t,0)L𝒙(t1)pt1λ\widehat{\sigma}_{t-1}(\bm{x}_{t},0)\leq\frac{L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}, we have

σ^t1(𝒙t,1)2+σ^t1(𝒙t,0)22L𝒙2(t1)pt1λ.\widehat{\sigma}_{t-1}(\bm{x}_{t},1)^{2}+\widehat{\sigma}_{t-1}(\bm{x}_{t},0)^{2}\leq\frac{2L_{\bm{x}}^{2}}{{{(t-1)p_{t-1}\lambda}}}.

Therefore, combining the above two equations leads to

𝔼[𝕀{at=0}|μ^t1(𝒙t,1),μ^t1(𝒙t,0)]𝔼{exp({μ^t1(𝒙t,1)μ^t1(𝒙t,0)}2(t1)pt1λ4ρ2L𝒙2)},{\mathbb{E}}[\mathbb{I}\{a_{t}=0\}|\widehat{\mu}_{t-1}(\bm{x}_{t},1),\widehat{\mu}_{t-1}(\bm{x}_{t},0)]\leq{\mathbb{E}}\left\{\exp\left(-\frac{\{\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)\}^{2}{{(t-1)p_{t-1}\lambda}}}{4\rho^{2}L_{\bm{x}}^{2}}\right)\right\},

where the expectation is taken with respect to history t1\mathcal{H}_{t-1}.

Note that on the Event EE, we have

{μ^t1(𝒙t,1)μ^t1(𝒙t,0)}2(|Δ𝒙t|ξ)2,\{\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)\}^{2}\geq\left(\left|\Delta_{\bm{x}_{t}}\right|-\xi\right)^{2},

which follows that on the Event EE,

𝔼[𝕀{at=0}|μ^t1(𝒙t,1),μ^t1(𝒙t,0)]\displaystyle{\mathbb{E}}[\mathbb{I}\{a_{t}=0\}|\widehat{\mu}_{t-1}(\bm{x}_{t},1),\widehat{\mu}_{t-1}(\bm{x}_{t},0)] 𝔼{exp((|Δ𝒙t|ξ)2(t1)pt1λ4ρ2L𝒙2)}\displaystyle\leq{\mathbb{E}}\left\{\exp\left(-\frac{\left(\left|\Delta_{\bm{x}_{t}}\right|-\xi\right)^{2}{{(t-1)p_{t-1}\lambda}}}{4\rho^{2}L_{\bm{x}}^{2}}\right)\right\} (B.21)
exp((|Δ𝒙t|ξ)2(t1)pt1λ4ρ2L𝒙2).\displaystyle\leq\exp\left(-\frac{\left(\left|\Delta_{\bm{x}_{t}}\right|-\xi\right)^{2}{{(t-1)p_{t-1}\lambda}}}{4\rho^{2}L_{\bm{x}}^{2}}\right).

Step 3: Combined with Equation (B.17), Equation (B.18) and Equation (B.21), we have

Pr{atπ^t(𝒙t)}\displaystyle{\mbox{Pr}}\{a_{t}\not=\widehat{\pi}_{t}(\bm{x}_{t})\} (B.17)Pr{atπ^t(𝒙t)|E}+4dexp{(t1)pt12cξ}\displaystyle\overset{\eqref{thm1_pf_ts2}}{\leq}{\mbox{Pr}}\{a_{t}\not=\widehat{\pi}_{t}(\bm{x}_{t})|E\}+4d\exp\left\{-(t-1)p_{t-1}^{2}c_{\xi}\right\}
(B.18)𝔼(𝔼[𝕀{at=0}|μ^t1(𝒙t,1),μ^t1(𝒙t,0)]|E)+4dexp{(t1)pt12cξ}\displaystyle\overset{\eqref{eq:proofPETS}}{\leq}{\mathbb{E}}\left({\mathbb{E}}[\mathbb{I}\{a_{t}=0\}|\widehat{\mu}_{t-1}(\bm{x}_{t},1),\widehat{\mu}_{t-1}(\bm{x}_{t},0)]|E\right)+4d\exp\left\{-(t-1)p_{t-1}^{2}c_{\xi}\right\}
(B.21)exp((|Δ𝒙t|ξ)2(t1)pt1λ4ρ2L𝒙2)+4dexp{(t1)pt12cξ}.\displaystyle\overset{\eqref{thm1_pf_ts4}}{\leq}\exp\left(-\frac{\left(\left|\Delta_{\bm{x}_{t}}\right|-\xi\right)^{2}{{(t-1)p_{t-1}\lambda}}}{4\rho^{2}L_{\bm{x}}^{2}}\right)+4d\exp\left\{-(t-1)p_{t-1}^{2}c_{\xi}\right\}.

The proof is hence completed.

B.4 Proof of Theorem 2

We detail the proof of Theorem 2 in this section. Using the similar arguments in (B.1) in the proof of Lemma 4.1, we can rewrite t{𝜷^t(a)𝜷(a)}\sqrt{t}\{\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)\} as

t{𝜷^t(a)𝜷(a)}=\displaystyle\sqrt{t}\{\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)\}= {1ti=1t𝕀(ai=a)𝒙i𝒙i+1tω𝑰d}1𝝃{1ti=1t𝕀(ai=a)𝒙iei}𝜼1\displaystyle\underbrace{\left\{\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+\frac{1}{t}\omega\bm{I}_{d}\right\}^{-1}}_{\bm{\xi}}\underbrace{\left\{\frac{1}{\sqrt{t}}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}e_{i}\right\}}_{\bm{\eta}_{1}}
{1ti=1t𝕀(ai=a)𝒙i𝒙i+1tω𝑰d}1ωt𝜷(a)𝜼2.\displaystyle-\underbrace{\left\{\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+\frac{1}{t}\omega\bm{I}_{d}\right\}^{-1}\frac{\omega}{\sqrt{t}}\bm{\beta}(a)}_{\bm{\eta}_{2}}.

Our goal is to prove that t{𝜷^t(a)𝜷(a)}\sqrt{t}\{\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)\} is asymptotically normal. The proof is to generalize Theorem 3.1 in Chen et al. (2020) by considering commonly used bandit algorithms, including UCB, TS, and EG here. We complete the proof in the following four steps:

  • Step 1: Prove that 𝜼1=(1/t)i=1t𝕀(ai=a)𝒙ieiD𝒩d(𝟎d,Ga)\bm{\eta}_{1}=({1}/{\sqrt{t}})\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}e_{i}\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathcal{N}_{d}\left(\bm{0}_{d},G_{a}\right), where GaG_{a} is the variance matrix to be spesified shortly.

  • Step 2: Prove that 𝝃={(1/t)i=1t𝕀(ai=a)𝒙i𝒙i+(ω/t)𝑰d}1pσa2Ga1\bm{\xi}=\left\{(1/t)\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+(\omega/t)\bm{I}_{d}\right\}^{-1}\stackrel{{\scriptstyle p}}{{\longrightarrow}}\sigma_{a}^{2}G_{a}^{-1}, where σa2=𝔼(et2|at=a)\sigma^{2}_{a}={\mathbb{E}}(e_{t}^{2}|a_{t}=a) for a=0,1a=0,1.

  • Step 3: Prove that 𝜼2={(1/t)i=1t𝕀(ai=a)𝒙i𝒙i+(ω/t)𝑰d}1(ω/t)𝜷(a)𝑝𝟎d\bm{\eta}_{2}=\left\{(1/t)\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+(\omega/t)\bm{I}_{d}\right\}^{-1}({\omega}/{\sqrt{t}})\bm{\beta}(a)\overset{p}{\longrightarrow}\bm{0}_{d}.

  • Step 4: Combine above results in steps 1-3 using Slutsky’s theorem.

Step 1: We first focus on proving that 𝜼1=(1/t)i=1t𝕀(ai=a)𝒙ieiD𝒩d(𝟎d,Ga)\bm{\eta}_{1}=({1}/{\sqrt{t}})\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}e_{i}\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathcal{N}_{d}\left(\bm{0}_{d},G_{a}\right). Using Cramer-Wold device, it suffices to show that for any 𝒗d\bm{v}\in\mathbb{R}^{d},

𝜼1(𝒗)1ti=1t𝕀(ai=a)𝒗𝒙ieiD𝒩d(0,𝒗Ga𝒗).\displaystyle\bm{\eta}_{1}(\bm{v})\equiv\frac{1}{\sqrt{t}}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{v}^{\top}\bm{x}_{i}e_{i}\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathcal{N}_{d}\left(0,\bm{v}^{\top}G_{a}\bm{v}\right).

Note that 𝔼(𝕀(ai=a)𝒗𝒙ieii1)=𝔼(𝕀(ai=a)𝒗𝒙ii1)𝔼(eii1,ai=a)=0{\mathbb{E}}\left(\mathbb{I}(a_{i}=a)\bm{v}^{\top}\bm{x}_{i}e_{i}\mid\mathcal{H}_{i-1}\right)={\mathbb{E}}\left(\mathbb{I}(a_{i}=a)\bm{v}^{\top}\bm{x}_{i}\mid\mathcal{H}_{i-1}\right){\mathbb{E}}\left(e_{i}\mid\mathcal{H}_{i-1},a_{i}=a\right)=0, we have that 𝕀(ai=a)𝒗𝒙iei\mathbb{I}(a_{i}=a)\bm{v}^{\top}\bm{x}_{i}e_{i} is a Martingale difference sequence. We next show the asymptotic normality of 𝜼1(𝒗)\bm{\eta}_{1}(\bm{v}) using Martingale central limit theorem, by the following two parts: i) check the conditional Lindeberg condition; ii) derivative the limit of the conditional variance.

Firstly, we check the conditional Lindeberg condition. For any δ>0\delta>0, denote

ψ=i=1t𝔼[1t𝕀(ai=a)(𝒗𝒙i)2ei2𝕀{|1t𝕀(ai=a)𝒗𝒙iei|>δ}i1].\displaystyle\psi=\sum_{i=1}^{t}\mathbb{E}\left[\frac{1}{t}\mathbb{I}(a_{i}=a)\left(\bm{v}^{\top}\bm{x}_{i}\right)^{2}e_{i}^{2}\mathbb{I}\left\{\left|\frac{1}{\sqrt{t}}\mathbb{I}(a_{i}=a)\bm{v}^{\top}\bm{x}_{i}e_{i}\right|>\delta\right\}\mid\mathcal{H}_{i-1}\right]. (B.22)

Notice that (𝒗𝒙i)2𝒗22L𝒙2d\left(\bm{v}^{\top}\bm{x}_{i}\right)^{2}\leq\|\bm{v}\|_{2}^{2}L_{\bm{x}}^{2}d, we have

𝕀{|1t𝕀(ai=a)𝒗𝒙iei|>δ}𝕀{𝕀(ai=a)ei2𝒗22L𝒙2d>tδ2}=𝕀{𝕀(ai=a)ei2>tδ2𝒗22L𝒙2d}.\mathbb{I}\left\{\left|\frac{1}{\sqrt{t}}\mathbb{I}(a_{i}=a)\bm{v}^{\top}\bm{x}_{i}e_{i}\right|>\delta\right\}\leq\mathbb{I}\left\{\mathbb{I}(a_{i}=a)e_{i}^{2}\|\bm{v}\|_{2}^{2}L_{\bm{x}}^{2}d>t\delta^{2}\right\}=\mathbb{I}\left\{\mathbb{I}(a_{i}=a)e_{i}^{2}>\frac{t\delta^{2}}{\|\bm{v}\|_{2}^{2}L_{\bm{x}}^{2}d}\right\}.

Combining this with (B.22), we obtain that

ψ𝒗22L𝒙2dti=1t𝔼(𝕀(ai=a)ei2𝕀{𝕀(ai=a)ei2>tδ2𝒗22L𝒙2d}i1),\displaystyle\psi\leq\frac{\|\bm{v}\|_{2}^{2}L_{\bm{x}}^{2}d}{t}\sum_{i=1}^{t}\mathbb{E}\left(\mathbb{I}(a_{i}=a)e_{i}^{2}\mathbb{I}\left\{\mathbb{I}(a_{i}=a)e_{i}^{2}>\frac{t\delta^{2}}{\|\bm{v}\|_{2}^{2}L_{\bm{x}}^{2}d}\right\}\mid\mathcal{H}_{i-1}\right), (B.23)

where the right hand side equals

𝒗22L𝒙2dti=1t𝔼(𝕀(ai=a)i1)𝔼(ei2𝕀{𝕀(ai=a)ei2>δ2t𝒗22L𝒙2d}i1).\displaystyle\frac{\|\bm{v}\|_{2}^{2}L_{\bm{x}}^{2}d}{t}\sum_{i=1}^{t}\mathbb{E}\left(\mathbb{I}(a_{i}=a)\mid\mathcal{H}_{i-1}\right)\mathbb{E}\left(e_{i}^{2}\mathbb{I}\left\{\mathbb{I}(a_{i}=a)e_{i}^{2}>\frac{\delta^{2}t}{\|\bm{v}\|_{2}^{2}L_{\bm{x}}^{2}d}\right\}\mid\mathcal{H}_{i-1}\right).

Then, we can further write (B.23) as

ψ𝒗22L𝒙2dti=1t𝔼(ei(a)2𝕀{ei(a)2>δ2t𝒗22L𝒙2d}i1).\displaystyle\psi\leq\frac{\|\bm{v}\|_{2}^{2}L_{\bm{x}}^{2}d}{t}\sum_{i=1}^{t}\mathbb{E}\left(e_{i(a)}^{2}\mathbb{I}\left\{e_{i(a)}^{2}>\frac{\delta^{2}t}{\|\bm{v}\|_{2}^{2}L_{\bm{x}}^{2}d}\right\}\mid\mathcal{H}_{i-1}\right). (B.24)

where ei(a)=eie_{i(a)}=e_{i} when ai=aa_{i}=a and 0 otherwise. Since eie_{i} conditioned on aia_{i} are i.i.d.i.i.d., i\forall i, we have the right hand side of the above inequality equals

𝒗22L𝒙2d𝔼(e2𝕀{e2>tδ2𝒗22L𝒙2d}),\displaystyle\|\bm{v}\|_{2}^{2}L_{\bm{x}}^{2}d\mathbb{E}\left(e^{2}\mathbb{I}\left\{e^{2}>\frac{t\delta^{2}}{\|\bm{v}\|_{2}^{2}L_{\bm{x}}^{2}d}\right\}\right),

where ee is the random variable given by ei|i1e_{i}|\mathcal{H}_{i-1}. Note that e2𝕀{e2>tδ2/(𝒗22L𝒙2d)}e^{2}\mathbb{I}\left\{e^{2}>{t\delta^{2}}/({\|\bm{v}\|_{2}^{2}L_{\bm{x}}^{2}d})\right\} is dominated by e2e^{2} with 𝔼e2<{\mathbb{E}}e^{2}<\infty and converges to 0, as tt\rightarrow\infty. Then, by Dominated Convergence Theorem, the results in (B.24) can be further bounded by

ψ𝒗22L𝒙2dti=1t𝔼(e2𝕀{e2>tδ2𝒗22L𝒙2d})0, as t.\psi\leq\frac{\|\bm{v}\|_{2}^{2}L_{\bm{x}}^{2}d}{t}\sum_{i=1}^{t}\mathbb{E}\left(e^{2}\mathbb{I}\left\{e^{2}>\frac{t\delta^{2}}{\|\bm{v}\|_{2}^{2}L_{\bm{x}}^{2}d}\right\}\right)\rightarrow 0,\text{ as }t\rightarrow\infty.

Therefore, conditional Lindeberg condition holds.

Secondly, we derive the limit of the conditional variance. Notice that

1ti=1t𝔼[𝕀(ai=a)(𝒗𝒙i)2ei2i1]\displaystyle\frac{1}{t}\sum_{i=1}^{t}\mathbb{E}\left[\mathbb{I}(a_{i}=a)\left(\bm{v}^{\top}\bm{x}_{i}\right)^{2}e_{i}^{2}\mid\mathcal{H}_{i-1}\right] =1ti=1t𝔼{𝔼[𝕀(ai=a)(𝒗𝒙i)2ei2ai,𝒙i]i1}\displaystyle=\frac{1}{t}\sum_{i=1}^{t}\mathbb{E}\left\{\mathbb{E}\left[\mathbb{I}(a_{i}=a)\left(\bm{v}^{\top}\bm{x}_{i}\right)^{2}e_{i}^{2}\mid a_{i},\bm{x}_{i}\right]\mid\mathcal{H}_{i-1}\right\}
=1ti=1t𝔼{𝕀(ai=a)(𝒗𝒙i)2𝔼[ei2ai=a,𝒙i]i1}.\displaystyle=\frac{1}{t}\sum_{i=1}^{t}\mathbb{E}\left\{\mathbb{I}(a_{i}=a)\left(\bm{v}^{\top}\bm{x}_{i}\right)^{2}\mathbb{E}\left[e_{i}^{2}\mid a_{i}=a,\bm{x}_{i}\right]\mid\mathcal{H}_{i-1}\right\}.

Since ete_{t} is independent of i1\mathcal{H}_{i-1} and 𝒙i\bm{x}_{i} given ata_{t}, and 𝔼[ei2ai=a]=σa2\mathbb{E}\left[e_{i}^{2}\mid a_{i}=a\right]=\sigma_{a}^{2}, we have

1ti=1t𝔼[𝕀(ai=a)(𝒗𝒙i)2ei2i1]\displaystyle\frac{1}{t}\sum_{i=1}^{t}\mathbb{E}\left[\mathbb{I}(a_{i}=a)\left(\bm{v}^{\top}\bm{x}_{i}\right)^{2}e_{i}^{2}\mid\mathcal{H}_{i-1}\right] =1ti=1t𝔼[𝕀(ai=a)(𝒗𝒙i)2𝔼[ei2ai=a]i1]\displaystyle=\frac{1}{t}\sum_{i=1}^{t}\mathbb{E}\left[\mathbb{I}(a_{i}=a)\left(\bm{v}^{\top}\bm{x}_{i}\right)^{2}\mathbb{E}\left[e_{i}^{2}\mid a_{i}=a\right]\mid\mathcal{H}_{i-1}\right]
=1ti=1t𝔼[𝕀(ai=a)𝒗𝒙i𝒙i𝒗σa2i1]\displaystyle=\frac{1}{t}\sum_{i=1}^{t}\mathbb{E}\left[\mathbb{I}(a_{i}=a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v}\sigma_{a}^{2}\mid\mathcal{H}_{i-1}\right]
=1ti=1tσa2𝔼[𝕀(ai=a)𝒗𝒙i𝒙i𝒗i1],\displaystyle=\frac{1}{t}\sum_{i=1}^{t}\sigma_{a}^{2}\mathbb{E}\left[\mathbb{I}(a_{i}=a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v}\mid\mathcal{H}_{i-1}\right],

where

𝔼[𝕀(ai=a)𝒗𝒙i𝒙i𝒗i1]=\displaystyle\mathbb{E}\left[\mathbb{I}(a_{i}=a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v}\mid\mathcal{H}_{i-1}\right]= 𝔼{𝔼[𝕀(aiπ(𝒙i))𝕀(π(𝒙i)a)𝒗𝒙i𝒙i𝒗𝒙i]i1}\displaystyle\mathbb{E}\{\mathbb{E}\left[\mathbb{I}(a_{i}\neq\pi^{*}(\bm{x}_{i}))\mathbb{I}(\pi^{*}(\bm{x}_{i})\neq a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v}\mid\bm{x}_{i}\right]\mid\mathcal{H}_{i-1}\}
+\displaystyle+ 𝔼{𝔼[𝕀(ai=π(𝒙i))𝕀(π(𝒙i)=a)𝒗𝒙i𝒙i𝒗𝒙i]i1}\displaystyle\mathbb{E}\{\mathbb{E}\left[\mathbb{I}(a_{i}=\pi^{*}(\bm{x}_{i}))\mathbb{I}(\pi^{*}(\bm{x}_{i})=a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v}\mid\bm{x}_{i}\right]\mid\mathcal{H}_{i-1}\}
=\displaystyle= 𝔼{𝔼[𝕀(aiπ(𝒙i))𝒙i,i1]𝕀(π(𝒙i)a)𝒗𝒙i𝒙i𝒗}\displaystyle\mathbb{E}\{\mathbb{E}\left[\mathbb{I}(a_{i}\neq\pi^{*}(\bm{x}_{i}))\mid\bm{x}_{i},\mathcal{H}_{i-1}\right]\mathbb{I}(\pi^{*}(\bm{x}_{i})\neq a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v}\}
+\displaystyle+ 𝔼{𝔼[𝕀(ai=π(𝒙i))𝒙i,i1]𝕀(π(𝒙i)=a)𝒗𝒙i𝒙i𝒗}.\displaystyle\mathbb{E}\{\mathbb{E}\left[\mathbb{I}(a_{i}=\pi^{*}(\bm{x}_{i}))\mid\bm{x}_{i},\mathcal{H}_{i-1}\right]\mathbb{I}(\pi^{*}(\bm{x}_{i})=a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v}\}.

Here, the first equation comes from iteration expectation over 𝒙i\bm{x}_{i} and the fact that 𝕀(π(𝒙i)a)𝒗𝒙i𝒙i𝒗\mathbb{I}(\pi^{*}(\bm{x}_{i})\neq a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v} is a constant given 𝒙i\bm{x}_{i} and i1\mathcal{H}_{i-1}.

𝕀(ai=a)=𝕀(ai=aπ(𝒙i))+𝕀(ai=a=π(𝒙i))=𝕀(aiπ(𝒙i))𝕀(π(𝒙i)a)+𝕀(ai=π(𝒙i))𝕀(π(𝒙i)=a),\displaystyle\begin{aligned} \mathbb{I}(a_{i}=a)&=\mathbb{I}(a_{i}=a\neq\pi^{*}(\bm{x}_{i}))+\mathbb{I}(a_{i}=a=\pi^{*}(\bm{x}_{i}))\\ &=\mathbb{I}(a_{i}\neq\pi^{*}(\bm{x}_{i}))\mathbb{I}(\pi^{*}(\bm{x}_{i})\neq a)+\mathbb{I}(a_{i}=\pi^{*}(\bm{x}_{i}))\mathbb{I}(\pi^{*}(\bm{x}_{i})=a),\end{aligned}

and the second equation is owing to the fact that 𝕀(π(𝒙i)=a)𝒗𝒙i𝒙i𝒗\mathbb{I}(\pi^{*}(\bm{x}_{i})=a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v} is a constant given 𝒙i\bm{x}_{i} and independent of i1\mathcal{H}_{i-1}. Define

νi(𝒙i,i1)Pr{aiπ(𝒙i)|𝒙i,i1}=𝔼[𝕀{aiπ(𝒙i)}|𝒙i,i1],\nu_{i}\left(\bm{x}_{i},\mathcal{H}_{i-1}\right)\equiv\operatorname{Pr}\left\{a_{i}\neq\pi^{*}(\bm{x}_{i})\right|\bm{x}_{i},\mathcal{H}_{i-1}\}=\mathbb{E}\left[\mathbb{I}\left\{a_{i}\neq\pi^{*}(\bm{x}_{i})\right\}|\bm{x}_{i},\mathcal{H}_{i-1}\right], (B.25)

then the conditional variance can be expressed as

1ti=1t𝔼[𝕀(ai=a)𝒗𝒙i𝒙i𝒗i1]=1ti=1t𝔼{νi(𝒙i,i1)𝕀(π(𝒙i)a)𝒗𝒙i𝒙i𝒗}+1ti=1t𝔼[{1νi(𝒙i,i1)}𝕀(π(𝒙i)=a)𝒗𝒙i𝒙i𝒗],\displaystyle\begin{aligned} \frac{1}{t}\sum_{i=1}^{t}\mathbb{E}\left[\mathbb{I}(a_{i}=a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v}\mid\mathcal{H}_{i-1}\right]&=\frac{1}{t}\sum_{i=1}^{t}\mathbb{E}\{\nu_{i}\left(\bm{x}_{i},\mathcal{H}_{i-1}\right)\mathbb{I}(\pi^{*}(\bm{x}_{i})\neq a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v}\}\\ &+\frac{1}{t}\sum_{i=1}^{t}\mathbb{E}\left[\{1-\nu_{i}\left(\bm{x}_{i},\mathcal{H}_{i-1}\right)\}\mathbb{I}(\pi^{*}(\bm{x}_{i})=a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v}\right],\end{aligned}

which can be expressed as

1ti=1tνi(𝒙,i1)𝕀{𝒙𝜷(a)<𝒙𝜷(1a)}𝒗𝒙𝒙𝒗𝑑P𝒳+1ti=1t{1νi(𝒙,i1)}𝕀{𝒙𝜷(a)𝒙𝜷(1a)}𝒗𝒙𝒙𝒗𝑑P𝒳=1ti=1tνi(𝒙,i1)𝕀{𝒙𝜷(a)<𝒙𝜷(1a)}𝒗𝒙𝒙𝒗dP𝒳+{11ti=1tνi(𝒙,i1)}𝕀{𝒙𝜷(a)𝒙𝜷(1a)}𝒗𝒙𝒙𝒗𝑑P𝒳.\displaystyle\begin{aligned} &\frac{1}{t}\sum_{i=1}^{t}\int\nu_{i}\left(\bm{x},\mathcal{H}_{i-1}\right)\mathbb{I}\{\bm{x}^{\top}\bm{\beta}(a)<\bm{x}^{\top}\bm{\beta}(1-a)\}\bm{v}^{\top}\bm{x}\bm{x}^{\top}\bm{v}dP_{\mathcal{X}}\\ &+\frac{1}{t}\sum_{i=1}^{t}\int\{1-\nu_{i}\left(\bm{x},\mathcal{H}_{i-1}\right)\}\mathbb{I}\left\{\bm{x}^{\top}\bm{\beta}(a)\geq\bm{x}^{\top}\bm{\beta}(1-a)\right\}\bm{v}^{\top}\bm{x}\bm{x}^{\top}\bm{v}dP_{\mathcal{X}}\\ &=\int\frac{1}{t}\sum_{i=1}^{t}\nu_{i}\left(\bm{x},\mathcal{H}_{i-1}\right)\mathbb{I}\{\bm{x}^{\top}\bm{\beta}(a)<\bm{x}^{\top}\bm{\beta}(1-a)\}\bm{v}^{\top}\bm{x}\bm{x}^{\top}\bm{v}dP_{\mathcal{X}}\\ &+\int\{1-\frac{1}{t}\sum_{i=1}^{t}\nu_{i}\left(\bm{x},\mathcal{H}_{i-1}\right)\}\mathbb{I}\left\{\bm{x}^{\top}\bm{\beta}(a)\geq\bm{x}^{\top}\bm{\beta}(1-a)\right\}\bm{v}^{\top}\bm{x}\bm{x}^{\top}\bm{v}dP_{\mathcal{X}}.\end{aligned}

Since limiPr{aiπ(𝒙)}=κ(𝒙)\lim_{i\rightarrow\infty}{\mbox{Pr}}\{a_{i}\neq\pi^{*}(\bm{x})\}=\kappa_{\infty}(\bm{x}), we have for any ϵ>0\epsilon>0, there exist a constant t0>0t_{0}>0 such that |Pr{aiπ(𝒙)}κ(𝒙)|<ϵ\left|{\mbox{Pr}}\{a_{i}\neq\pi^{*}(\bm{x})\}-\kappa_{\infty}(\bm{x})\right|<\epsilon for all it0i\geq t_{0}. Therefore, for the expectation of 1ti=1tνi(𝒙,i1)\frac{1}{t}\sum_{i=1}^{t}\nu_{i}\left(\bm{x},\mathcal{H}_{i-1}\right) over the history, we have

𝔼[1ti=1tνi(𝒙,i1)]=1ti=1t𝔼[νi(𝒙,i1)]=1ti=1tPr{aiπ(𝒙)}.{\mathbb{E}}\left[\frac{1}{t}\sum_{i=1}^{t}\nu_{i}\left(\bm{x},\mathcal{H}_{i-1}\right)\right]=\frac{1}{t}\sum_{i=1}^{t}{\mathbb{E}}\left[\nu_{i}\left(\bm{x},\mathcal{H}_{i-1}\right)\right]=\frac{1}{t}\sum_{i=1}^{t}{\mbox{Pr}}\{a_{i}\neq\pi^{*}(\bm{x})\}.

It follows immediately that

𝔼[1ti=1tνi(𝒙,i1)]κ(𝒙)\displaystyle{\mathbb{E}}\left[\frac{1}{t}\sum_{i=1}^{t}\nu_{i}\left(\bm{x},\mathcal{H}_{i-1}\right)\right]-\kappa_{\infty}(\bm{x})
=1ti=1t0[Pr{aiπ(𝒙)}κ(𝒙)]+1ti=t0t[Pr{aiπ(𝒙)}κ(𝒙)].\displaystyle=\frac{1}{t}\sum_{i=1}^{t_{0}}\left[{\mbox{Pr}}\{a_{i}\neq\pi^{*}(\bm{x})\}-\kappa_{\infty}(\bm{x})\right]+\frac{1}{t}\sum_{i=t_{0}}^{t}\left[{\mbox{Pr}}\{a_{i}\neq\pi^{*}(\bm{x})\}-\kappa_{\infty}(\bm{x})\right].

Therefore, by the triangle inequality, we have

|𝔼[1ti=1tνi(𝒙,i1)]κ(𝒙)|\displaystyle\left|{\mathbb{E}}\left[\frac{1}{t}\sum_{i=1}^{t}\nu_{i}\left(\bm{x},\mathcal{H}_{i-1}\right)\right]-\kappa_{\infty}(\bm{x})\right|
=1ti=1t0|Pr{aiπ(𝒙)}κ(𝒙)|+1ti=t0t|Pr{aiπ(𝒙)}κ(𝒙)|\displaystyle=\frac{1}{t}\sum_{i=1}^{t_{0}}\left|{\mbox{Pr}}\{a_{i}\neq\pi^{*}(\bm{x})\}-\kappa_{\infty}(\bm{x})\right|+\frac{1}{t}\sum_{i=t_{0}}^{t}\left|{\mbox{Pr}}\{a_{i}\neq\pi^{*}(\bm{x})\}-\kappa_{\infty}(\bm{x})\right|
<1ti=1t0[|Pr{aiπ(𝒙)}|+|κ(𝒙)|]+1ti=t0tϵ\displaystyle<\frac{1}{t}\sum_{i=1}^{t_{0}}\left[\left|{\mbox{Pr}}\{a_{i}\neq\pi^{*}(\bm{x})\}\right|+\left|\kappa_{\infty}(\bm{x})\right|\right]+\frac{1}{t}\sum_{i=t_{0}}^{t}\epsilon
1ti=1t02+1ti=t0tϵ=2t0t+tt0tϵ.\displaystyle\leq\frac{1}{t}\sum_{i=1}^{t_{0}}2+\frac{1}{t}\sum_{i=t_{0}}^{t}\epsilon=\frac{2t_{0}}{t}+\frac{t-t_{0}}{t}\epsilon.

Since the above equation holds for any ϵ>0\epsilon>0 and 0<tt0t<10<\frac{t-t_{0}}{t}<1, we have

|𝔼[1ti=1tνi(𝒙,i1)]κ(𝒙)|2t0t,\left|{\mathbb{E}}\left[\frac{1}{t}\sum_{i=1}^{t}\nu_{i}\left(\bm{x},\mathcal{H}_{i-1}\right)\right]-\kappa_{\infty}(\bm{x})\right|\leq\frac{2t_{0}}{t},

which goes to zero as tt\rightarrow\infty. Thus,

𝔼[1ti=1tνi(𝒙,i1)]=κ(𝒙)+op(1).{\mathbb{E}}\left[\frac{1}{t}\sum_{i=1}^{t}\nu_{i}\left(\bm{x},\mathcal{H}_{i-1}\right)\right]=\kappa_{\infty}(\bm{x})+o_{p}(1). (B.26)

Next, we consider the variance of 1ti=1tνi(𝒙,i1)\frac{1}{t}\sum_{i=1}^{t}\nu_{i}\left(\bm{x},\mathcal{H}_{i-1}\right). Denote 𝔼[1ti=1tνi(𝒙,i1)]=μν(𝒙){\mathbb{E}}\left[\frac{1}{t}\sum_{i=1}^{t}\nu_{i}\left(\bm{x},\mathcal{H}_{i-1}\right)\right]=\mu_{\nu}(\bm{x}), we have μν(𝒙)=κ(𝒙)+op(1)\mu_{\nu}(\bm{x})=\kappa_{\infty}(\bm{x})+o_{p}(1). Notice that νi(𝒙i,i1)Pr{aiπ(𝒙i)|𝒙i,i1}[0,1]\nu_{i}\left(\bm{x}_{i},\mathcal{H}_{i-1}\right)\equiv\operatorname{Pr}\left\{a_{i}\neq\pi^{*}(\bm{x}_{i})\right|\bm{x}_{i},\mathcal{H}_{i-1}\}\in[0,1], by Lemma B.1, we have

Var[1ti=1tνi(𝒙,i1)]μν(𝒙)μν(𝒙)2=κ(𝒙){1κ(𝒙)}+op(1),\displaystyle\operatorname{Var}\left[\frac{1}{t}\sum_{i=1}^{t}\nu_{i}\left(\bm{x},\mathcal{H}_{i-1}\right)\right]\leq\mu_{\nu}(\bm{x})-\mu_{\nu}(\bm{x})^{2}=\kappa_{\infty}(\bm{x})\left\{1-\kappa_{\infty}(\bm{x})\right\}+o_{p}(1),

which goes to zero as tt\rightarrow\infty. Combined with Equation (B.26), it follows immediately that as tt goes to \infty, we have

1ti=1tνi(𝒙,i1)κ(𝒙).\frac{1}{t}\sum_{i=1}^{t}\nu_{i}\left(\bm{x},\mathcal{H}_{i-1}\right)\rightarrow\kappa_{\infty}(\bm{x}).

Therefore, as tt goes to \infty, we have 1ti=1t𝔼[𝕀(ai=a)𝒗𝒙i𝒙i𝒗i1]\frac{1}{t}\sum_{i=1}^{t}\mathbb{E}\left[\mathbb{I}(a_{i}=a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v}\mid\mathcal{H}_{i-1}\right] converges to

κ(𝒙)𝕀{𝒙𝜷(a)<𝒙𝜷(1a)}𝒗𝒙𝒙𝒗𝑑P𝒳\displaystyle\int\kappa_{\infty}(\bm{x})\mathbb{I}\{\bm{x}^{\top}\bm{\beta}(a)<\bm{x}^{\top}\bm{\beta}(1-a)\}\bm{v}^{\top}\bm{x}\bm{x}^{\top}\bm{v}dP_{\mathcal{X}} (B.27)
+{1κ(𝒙)}𝕀{𝒙𝜷(a)𝒙𝜷(1a)}𝒗𝒙𝒙𝒗𝑑P𝒳.\displaystyle+\int\{1-\kappa_{\infty}(\bm{x})\}\mathbb{I}\left\{\bm{x}^{\top}\bm{\beta}(a)\geq\bm{x}^{\top}\bm{\beta}(1-a)\right\}\bm{v}^{\top}\bm{x}\bm{x}^{\top}\bm{v}dP_{\mathcal{X}}.

Thus, following the similar arguments in S1.2 in Chen et al. (2020), we have

𝜼1(𝒗)=1ti=1t𝕀(ai=a)𝒗𝒙ieiD𝒩d(0,𝒗Ga𝒗),\bm{\eta}_{1}(\bm{v})=\frac{1}{\sqrt{t}}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{v}^{\top}\bm{x}_{i}e_{i}\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathcal{N}_{d}\left(0,\bm{v}^{\top}G_{a}\bm{v}\right),

where

𝒗Ga𝒗=σa2\displaystyle\bm{v}^{\top}G_{a}\bm{v}=\sigma_{a}^{2} {κ(𝒙)𝕀{𝒙𝜷(a)<𝒙𝜷(1a)}𝒗𝒙𝒙𝒗dP𝒳\displaystyle\left\{\int\kappa_{\infty}(\bm{x})\mathbb{I}\{\bm{x}^{\top}\bm{\beta}(a)<\bm{x}^{\top}\bm{\beta}(1-a)\}\bm{v}^{\top}\bm{x}\bm{x}^{\top}\bm{v}dP_{\mathcal{X}}\right.
+{1κ(𝒙)}𝕀{𝒙𝜷(a)𝒙𝜷(1a)}𝒗𝒙𝒙𝒗dP𝒳}.\displaystyle\left.+\int\{1-\kappa_{\infty}(\bm{x})\}\mathbb{I}\left\{\bm{x}^{\top}\bm{\beta}(a)\geq\bm{x}^{\top}\bm{\beta}(1-a)\right\}\bm{v}^{\top}\bm{x}\bm{x}^{\top}\bm{v}dP_{\mathcal{X}}\right\}.

Finally, by Martingale Central Limit Theorem, we have

𝜼1=1ti=1t𝕀(ai=a)𝒙ieiD𝒩d(0,Ga),\bm{\eta}_{1}=\frac{1}{\sqrt{t}}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}e_{i}\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathcal{N}_{d}\left(0,G_{a}\right),

where

Ga=σa2\displaystyle G_{a}=\sigma_{a}^{2} {κ(𝒙)𝕀{𝒙𝜷(a)<𝒙𝜷(1a)}𝒙𝒙dP𝒳\displaystyle\left\{\int\kappa_{\infty}(\bm{x})\mathbb{I}\{\bm{x}^{\top}\bm{\beta}(a)<\bm{x}^{\top}\bm{\beta}(1-a)\}\bm{x}\bm{x}^{\top}dP_{\mathcal{X}}\right. (B.28)
+{1κ(𝒙)}𝕀{𝒙𝜷(a)𝒙𝜷(1a)}𝒙𝒙dP𝒳}.\displaystyle\left.+\int\{1-\kappa_{\infty}(\bm{x})\}\mathbb{I}\left\{\bm{x}^{\top}\bm{\beta}(a)\geq\bm{x}^{\top}\bm{\beta}(1-a)\right\}\bm{x}\bm{x}^{\top}dP_{\mathcal{X}}\right\}.

The first part is thus completed.

Step 2: We next show that 𝝃={(1/t)i=1t𝕀(ai=a)𝒙i𝒙i+1tω𝑰d}1pσa2Ga1\bm{\xi}=\left\{(1/t)\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+\frac{1}{t}\omega\bm{I}_{d}\right\}^{-1}\stackrel{{\scriptstyle p}}{{\longrightarrow}}\sigma_{a}^{2}G_{a}^{-1}, which is sufficient to find the limit of {(1/t)i=1t𝕀(ai=a)𝒙i𝒙i+1tω𝑰d}\left\{(1/t)\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+\frac{1}{t}\omega\bm{I}_{d}\right\}. By Lemma 6 in Chen et al. (2020), it suffices to show the limit of 1ti=1t𝕀(ai=a)𝒗𝒙i𝒙i𝒗\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v} for any 𝒗d\bm{v}\in\mathbb{R}^{d}.

Since Pr(|𝕀(ai=a)𝒗𝒙i𝒙i𝒗|>h)Pr(|𝒗𝒙𝒙T𝒗|>h){\mbox{Pr}}(|\mathbb{I}(a_{i}=a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v}|>h)\leq{\mbox{Pr}}(|\bm{v}^{\top}\bm{x}\bm{x}^{T}\bm{v}|>h) for each h>0h>0 and i1i\leq 1, by Theorem 2.19 in Hall and Heyde (2014), we have

1ti=1t[𝕀(ai=a)𝒗𝒙i𝒙i𝒗𝔼{𝕀(ai=a)𝒗𝒙i𝒙i𝒗i1}]𝑝0, as t.\frac{1}{t}\sum_{i=1}^{t}\left[\mathbb{I}(a_{i}=a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v}-{\mathbb{E}}\left\{\mathbb{I}(a_{i}=a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v}\mid\mathcal{H}_{i-1}\right\}\right]\overset{p}{\longrightarrow}0,\text{ as }t\rightarrow\infty. (B.29)

Recall the results in (B.4) and (B.28), we have 𝔼{𝕀(ai=a)𝒗𝒙i𝒙i𝒗i1}=Ga/σa2{\mathbb{E}}\left\{\mathbb{I}(a_{i}=a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v}\mid\mathcal{H}_{i-1}\right\}={G_{a}}/{\sigma_{a}^{2}}. Combining this with (B.29), we have

1ti=1t𝕀(ai=a)𝒗𝒙i𝒙i𝒗𝑝Gaσa2, as t.\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v}\overset{p}{\longrightarrow}\frac{G_{a}}{\sigma_{a}^{2}},\text{ as }t\rightarrow\infty.

By Lemma 6 in Chen et al. (2020) and Continuous Mapping Theorem, we further have

𝝃={1ti=1t𝕀(ai=a)𝒙i𝒙i+1tω𝑰d}1pσa2Ga1.\bm{\xi}=\left\{\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+\frac{1}{t}\omega\bm{I}_{d}\right\}^{-1}\stackrel{{\scriptstyle p}}{{\longrightarrow}}\sigma_{a}^{2}G_{a}^{-1}.

Step 3: We focus on proving 𝜼2={(1/t)i=1t𝕀(ai=a)𝒙i𝒙i+(ω/t)𝑰d}1(ω/t)𝜷(a)𝑝𝟎d\bm{\eta}_{2}=\left\{(1/t)\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+(\omega/t)\bm{I}_{d}\right\}^{-1}({\omega}/{\sqrt{t}})\bm{\beta}(a)\overset{p}{\longrightarrow}\bm{0}_{d} next. This suffices to show that 𝒃i𝜼2𝑝0\bm{b}_{i}^{\top}\bm{\eta}_{2}\overset{p}{\longrightarrow}0 holds for any standard basis 𝒃id\bm{b}_{i}\in\mathbb{R}^{d}. Since

𝒃i𝜼2\displaystyle\bm{b}_{i}^{\top}\bm{\eta}_{2} =𝒃i{1ti=1t𝕀(ai=a)𝒙i𝒙i+1tω𝑰d}1ωt𝜷(a)2\displaystyle=\left\|\bm{b}_{i}^{\top}\left\{\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+\frac{1}{t}\omega\bm{I}_{d}\right\}^{-1}\frac{\omega}{\sqrt{t}}\bm{\beta}(a)\right\|_{2}
ωt𝒃i2{1ti=1t𝕀(ai=a)𝒙i𝒙i+1tω𝑰d}12𝜷(a)2,\displaystyle\leq\frac{\omega}{\sqrt{t}}\left\|\bm{b}_{i}^{\top}\right\|_{2}\left\|\left\{\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+\frac{1}{t}\omega\bm{I}_{d}\right\}^{-1}\right\|_{2}\left\|\bm{\beta}(a)\right\|_{2},

and by (B.4), we have

𝒃i𝜼2ω𝜷(a)2tpt2λ+1tω.\bm{b}_{i}^{\top}\bm{\eta}_{2}\leq\frac{\omega\left\|\bm{\beta}(a)\right\|_{2}}{\sqrt{tp_{t}^{2}}\lambda+\frac{1}{\sqrt{t}}\omega}.

Thus, we have 𝒃i𝜼2𝑝0\bm{b}_{i}^{\top}\bm{\eta}_{2}\overset{p}{\longrightarrow}0, as tpt2tp_{t}^{2}\rightarrow\infty.

Step 4: Finally, we combine the above results using Slutsky’s theorem, and conclude that

t{𝜷^t(a)𝜷(a)}=𝝃𝜼1+𝜼2D𝒩d(𝟎d,σa4Ga1),\sqrt{t}\{\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)\}=\bm{\xi}\bm{\eta}_{1}+\bm{\eta}_{2}\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathcal{N}_{d}\left(\bm{0}_{d},\sigma_{a}^{4}G_{a}^{-1}\right),

where GaG_{a} is defined in (B.28). Denote the variance term as

σ𝜷(a)2=σa2\displaystyle\sigma_{\bm{\beta}(a)}^{2}=\sigma_{a}^{2} [κ(𝒙)𝕀{𝒙𝜷(a)<𝒙𝜷(1a)}𝒙𝒙dP𝒳\displaystyle\left[\int\kappa_{\infty}(\bm{x})\mathbb{I}\{\bm{x}^{\top}\bm{\beta}(a)<\bm{x}^{\top}\bm{\beta}(1-a)\}\bm{x}\bm{x}^{\top}dP_{\mathcal{X}}\right.
+{1κ(𝒙)}𝕀{𝒙𝜷(a)𝒙𝜷(1a)}𝒙𝒙dP𝒳]1,\displaystyle\left.+\int\{1-\kappa_{\infty}(\bm{x})\}\mathbb{I}\left\{\bm{x}^{\top}\bm{\beta}(a)\geq\bm{x}^{\top}\bm{\beta}(1-a)\right\}\bm{x}\bm{x}^{\top}dP_{\mathcal{X}}\right]^{-1},

with σa2=𝔼(et2|at=a)\sigma^{2}_{a}={\mathbb{E}}(e_{t}^{2}|a_{t}=a) denoting the conditional variance of ete_{t} given at=aa_{t}=a, for a=0,1a=0,1, we have

t{𝜷^t(a)𝜷(a)}D𝒩d{𝟎d,σ𝜷(a)2}.\sqrt{t}\left\{\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)\right\}\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathcal{N}_{d}\{\bm{0}_{d},\sigma_{\bm{\beta}(a)}^{2}\}.

The proof is hence completed.

B.5 Proof of Theorem 3

Finally, we prove the asymptotic normality of the proposed value estimator under DREAM in Theorem 3 in this section. The proof consists of four steps. In step 1, we aim to show

V^T=V~T+op(T1/2),\widehat{V}_{T}=\widetilde{V}_{T}+o_{p}(T^{-1/2}),

where

V~T=1Tt=1T𝕀{at=π^t(𝒙t)}1κt(𝒙t)[rtμ{𝒙t,π^t(𝒙t)}]+μ{𝒙t,π^t(𝒙t)}.\widetilde{V}_{T}=\frac{1}{T}\sum_{t=1}^{T}\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}}{1-\kappa_{t}(\bm{x}_{t})}\Big{[}r_{t}-{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{]}+{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}.

Next, in Step 2, we establish

V~T=V¯T+op(T1/2),\widetilde{V}_{T}=\overline{V}_{T}+o_{p}(T^{-1/2}),

where

V¯T=1Tt=1T𝕀{at=π(𝒙t)}Pr{at=π(𝒙t)}[rtμ{𝒙t,π(𝒙t)}]+μ{𝒙t,π(𝒙t)}.\overline{V}_{T}=\frac{1}{T}\sum_{t=1}^{T}\frac{\mathbb{I}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}\Big{[}r_{t}-{\mu}\{\bm{x}_{t},{\pi}^{*}(\bm{x}_{t})\}\Big{]}+{\mu}\{\bm{x}_{t},{\pi}^{*}(\bm{x}_{t})\}.

The above two steps yields that

V^T=V¯T+op(T1/2).\widehat{V}_{T}=\overline{V}_{T}+o_{p}(T^{-1/2}). (B.30)

Then, in Step 3, based on (B.30) and Martingale Central Limit Theorem, we show

T(V^TV)D𝒩(0,σDR2),\sqrt{T}(\widehat{V}_{T}-V^{*})\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathcal{N}\left(0,\sigma_{DR}^{2}\right),

with

σDR2=𝒙π(𝒙)σ12+{1π(𝒙)}σ021κ(𝒙)𝑑P𝒳+Var[μ{𝒙,π(𝒙)}],\sigma_{DR}^{2}=\int_{\bm{x}}\frac{\pi^{*}(\bm{x})\sigma_{1}^{2}+\{1-\pi^{*}(\bm{x})\}\sigma_{0}^{2}}{1-\kappa_{\infty}(\bm{x})}d{P_{\mathcal{X}}}+{\mbox{Var}}\left[{\mu}\{\bm{x},\pi^{*}(\bm{x})\}\right],

where σa2=𝔼(et2|at=a)\sigma^{2}_{a}={\mathbb{E}}(e_{t}^{2}|a_{t}=a) for a=0,1a=0,1, and κtκ\kappa_{t}\rightarrow\kappa_{\infty} as tt\rightarrow\infty.

Lastly, in Step 4, we show the variance estimator in Equation (5) in the main paper is a consistent estimator of σDR2\sigma_{DR}^{2}. The proof for Theorem 3 is thus completed.

Step 1: We first show V^T=V~T+op(T1/2)\widehat{V}_{T}=\widetilde{V}_{T}+o_{p}(T^{-1/2}). To this end, define a middle term as

ϕ~T=1Tt=1T𝕀{at=π^t(𝒙t)}1κt(𝒙t)[rtμ^t1{𝒙t,π^t(𝒙t)}]+μ^t1{𝒙t,π^t(𝒙t)}.\widetilde{\phi}_{T}=\frac{1}{T}\sum_{t=1}^{T}\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}}{1-\kappa_{t}(\bm{x}_{t})}\Big{[}r_{t}-\widehat{\mu}_{t-1}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{]}+\widehat{\mu}_{t-1}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}.

Thus, it suffices to show V^T=ϕ~T+op(T1/2)\widehat{V}_{T}=\widetilde{\phi}_{T}+o_{p}(T^{-1/2}) and ϕ~T=V~T+op(T1/2)\widetilde{\phi}_{T}=\widetilde{V}_{T}+o_{p}(T^{-1/2}).

Firstly, we have

V^Tϕ~T\displaystyle\widehat{V}_{T}-\widetilde{\phi}_{T} =1Tt=1T[𝕀{at=π^t(𝒙t)}1κ^t(𝒙t)𝕀{at=π^t(𝒙t)}1κt(𝒙t)][rtμ^t1{𝒙t,π^t(𝒙t)}]\displaystyle=\frac{1}{T}\sum_{t=1}^{T}\left[\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}}{1-\widehat{\kappa}_{t}(\bm{x}_{t})}-\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}}{1-\kappa_{t}(\bm{x}_{t})}\right]\Big{[}r_{t}-\widehat{\mu}_{t-1}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{]}
=1Tt=1T{κ^t(𝒙t)κt(𝒙t)}[𝕀{at=π^t(𝒙t)}[rtμ^t1{𝒙t,π^t(𝒙t)}]{1κ^t(𝒙t)}{1κt(𝒙t)}].\displaystyle=\frac{1}{T}\sum_{t=1}^{T}\left\{\widehat{\kappa}_{t}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right\}\left[\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{[}r_{t}-\widehat{\mu}_{t-1}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{]}}{\{1-\widehat{\kappa}_{t}(\bm{x}_{t})\}\{1-\kappa_{t}(\bm{x}_{t})\}}\right].

We can further decompose (B.5) by

V^Tϕ~T=1Tt=1T\displaystyle\widehat{V}_{T}-\widetilde{\phi}_{T}=\frac{1}{T}\sum_{t=1}^{T} {κ^t(𝒙t)κt(𝒙t)}\displaystyle\left\{\widehat{\kappa}_{t}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right\}
[𝕀{at=π^t(𝒙t)}[rtμ{𝒙t,π^t(𝒙t)}+μ{𝒙t,π^t(𝒙t)}μ^t1{𝒙t,π^t(𝒙t)}]{1κ^t(𝒙t)}{1κt(𝒙t)}]\displaystyle\left[\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{[}r_{t}-{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}+{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}-\widehat{\mu}_{t-1}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{]}}{\{1-\widehat{\kappa}_{t}(\bm{x}_{t})\}\{1-\kappa_{t}(\bm{x}_{t})\}}\right]
=1Tt=1T\displaystyle=\frac{1}{T}\sum_{t=1}^{T} {κ^t(𝒙t)κt(𝒙t)}[𝕀{at=π^t(𝒙t)}[rtμ{𝒙t,π^t(𝒙t)}]{1κ^t(𝒙t)}{1κt(𝒙t)}]\displaystyle\left\{\widehat{\kappa}_{t}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right\}\left[\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{[}r_{t}-{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{]}}{\{1-\widehat{\kappa}_{t}(\bm{x}_{t})\}\{1-\kappa_{t}(\bm{x}_{t})\}}\right] (B.32)
+1Tt=1T\displaystyle+\frac{1}{T}\sum_{t=1}^{T} {κ^t(𝒙t)κt(𝒙t)}[𝕀{at=π^t(𝒙t)}[μ{𝒙t,π^t(𝒙t)}μ^t1{𝒙t,π^t(𝒙t)}]{1κ^t(𝒙t)}{1κt(𝒙t)}].\displaystyle\left\{\widehat{\kappa}_{t}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right\}\left[\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{[}{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}-\widehat{\mu}_{t-1}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{]}}{\{1-\widehat{\kappa}_{t}(\bm{x}_{t})\}\{1-\kappa_{t}(\bm{x}_{t})\}}\right].

We first show that the first term in (B.32) is op(T1/2)o_{p}(T^{-1/2}). Define a class of function

κ(𝒙,a,r)={{κ^(𝒙)κ(𝒙)}[𝕀{a=π(𝒙)}[rμ{𝒙,π(𝒙)}]{1κ^(𝒙)}{1κ(𝒙)}]:κ^(),κ()Λ,π()Π},\displaystyle\mathcal{F}_{\kappa}(\bm{x},a,r)=\Bigg{\{}\left\{\widehat{\kappa}(\bm{x})-\kappa(\bm{x})\right\}\left[\frac{\mathbb{I}\{a=\pi(\bm{x})\}\Big{[}r-{\mu}\{\bm{x},\pi(\bm{x})\}\Big{]}}{\{1-\widehat{\kappa}(\bm{x})\}\{1-\kappa(\bm{x})\}}\right]:\widehat{\kappa}(\cdot),\kappa(\cdot)\in\Lambda,\pi(\cdot)\in\Pi\Bigg{\}},

where Π\Pi and Λ\Lambda are two classes of functions that maps context 𝒙𝒳\bm{x}\in\mathcal{X} to a probability.
Define the supremum of the empirical process indexed by κ\mathcal{F}_{\kappa} as

𝔾n\displaystyle||\mathbb{G}_{n}||_{\mathcal{F}}\equiv supπΠ1Tt=1T[κ(𝒙t,at,rt)𝔼{κ(𝒙t,at,rt)|t1}].\displaystyle\sup_{\pi\in\Pi}\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\left[\mathcal{F}_{\kappa}(\bm{x}_{t},a_{t},r_{t})-\mathbb{E}\{\mathcal{F}_{\kappa}(\bm{x}_{t},a_{t},r_{t})|\mathcal{H}_{t-1}\}\right]. (B.33)

Notice that

𝔼{κ(𝒙t,at,rt)|t1}\displaystyle\mathbb{E}\{\mathcal{F}_{\kappa}(\bm{x}_{t},a_{t},r_{t})|\mathcal{H}_{t-1}\}
=\displaystyle= 𝔼({κ^(𝒙t)κt(𝒙t)}[𝕀{at=π^t(𝒙t)}[rtμ{𝒙t,π^t(𝒙t)}]{1κ^(𝒙t)}{1κt(𝒙t)}]t1)\displaystyle\mathbb{E}\left(\left\{\widehat{\kappa}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right\}\left[\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{[}r_{t}-{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{]}}{\{1-\widehat{\kappa}(\bm{x}_{t})\}\{1-\kappa_{t}(\bm{x}_{t})\}}\right]\mid\mathcal{H}_{t-1}\right)
=\displaystyle= 𝔼({κ^(𝒙t)κt(𝒙t)}[𝕀{at=π^t(𝒙t)}et{1κ^(𝒙t)}{1κt(𝒙t)}]t1),\displaystyle\mathbb{E}\left(\left\{\widehat{\kappa}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right\}\left[\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}e_{t}}{\{1-\widehat{\kappa}(\bm{x}_{t})\}\{1-\kappa_{t}(\bm{x}_{t})\}}\right]\mid\mathcal{H}_{t-1}\right),

by the definitions and thus, using the iteration of expectation, we have

𝔼{κ(𝒙t,at,rt)|t1}\displaystyle\mathbb{E}\{\mathcal{F}_{\kappa}(\bm{x}_{t},a_{t},r_{t})|\mathcal{H}_{t-1}\}
=\displaystyle= 𝔼{𝔼({κ^(𝒙t)κt(𝒙t)}[𝕀{at=π^t(𝒙t)}et{1κ^(𝒙t)}{1κt(𝒙t)}]at,𝒙t)t1}\displaystyle\mathbb{E}\left\{\mathbb{E}\left(\left\{\widehat{\kappa}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right\}\left[\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}e_{t}}{\{1-\widehat{\kappa}(\bm{x}_{t})\}\{1-\kappa_{t}(\bm{x}_{t})\}}\right]\mid a_{t},\bm{x}_{t}\right)\mid\mathcal{H}_{t-1}\right\}
=\displaystyle= 𝔼{{κ^(𝒙t)κt(𝒙t)}[𝔼{𝕀{at=π^t(𝒙t)}𝒙t}{1κ^(𝒙t)}{1κt(𝒙t)}]𝔼(etat,𝒙t)t1}=0,\displaystyle\mathbb{E}\left\{\left\{\widehat{\kappa}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right\}\left[\frac{\mathbb{E}\{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}\mid\bm{x}_{t}\}}{\{1-\widehat{\kappa}(\bm{x}_{t})\}\{1-\kappa_{t}(\bm{x}_{t})\}}\right]\mathbb{E}\left(e_{t}\mid a_{t},\bm{x}_{t}\right)\mid\mathcal{H}_{t-1}\right\}=0,

where the last equation is due to the definition of the noise ete_{t}. Therefore, Equation (B.33) can be further written as

𝔾nsupπΠ1Tt=1Tκ(𝒙t,at,rt).||\mathbb{G}_{n}||_{\mathcal{F}}\equiv\sup_{\pi\in\Pi}\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\mathcal{F}_{\kappa}(\bm{x}_{t},a_{t},r_{t}).

Next, we show the second moment is bounded by

𝔼{({κ^t(𝒙t)κt(𝒙t)}[𝕀{at=π^t(𝒙t)}[rtμ{𝒙t,π^t(𝒙t)}]{1κ^t(𝒙t)}{1κt(𝒙t)}])2t1}=𝔼{[{κ^t(𝒙t)κt(𝒙t)}{1κ^t(𝒙t)}{1κt(𝒙t)}]2𝕀{at=π^t(𝒙t)}[rtμ{𝒙t,π^t(𝒙t)}]2t1}=𝔼{[{κ^t(𝒙t)κt(𝒙t)}{1κ^t(𝒙t)}{1κt(𝒙t)}]2𝕀{at=π^t(𝒙t)}et2t1},\begin{aligned} &\mathbb{E}\left\{\left(\left\{\widehat{\kappa}_{t}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right\}\left[\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{[}r_{t}-{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{]}}{\{1-\widehat{\kappa}_{t}(\bm{x}_{t})\}\{1-\kappa_{t}(\bm{x}_{t})\}}\right]\right)^{2}\mid\mathcal{H}_{t-1}\right\}\\ =&\mathbb{E}\left\{\left[\frac{\left\{\widehat{\kappa}_{t}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right\}}{\{1-\widehat{\kappa}_{t}(\bm{x}_{t})\}\{1-\kappa_{t}(\bm{x}_{t})\}}\right]^{2}\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{[}r_{t}-{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{]}^{2}\mid\mathcal{H}_{t-1}\right\}\\ =&\mathbb{E}\left\{\left[\frac{\left\{\widehat{\kappa}_{t}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right\}}{\{1-\widehat{\kappa}_{t}(\bm{x}_{t})\}\{1-\kappa_{t}(\bm{x}_{t})\}}\right]^{2}\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}e_{t}^{2}\mid\mathcal{H}_{t-1}\right\}\end{aligned},

by the definitions and thus, using the iteration of expectation, we have

𝔼{({κ^t(𝒙t)κt(𝒙t)}[𝕀{at=π^t(𝒙t)}[rtμ{𝒙t,π^t(𝒙t)}]{1κ^t(𝒙t)}{1κt(𝒙t)}])2t1}\displaystyle\mathbb{E}\left\{\left(\left\{\widehat{\kappa}_{t}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right\}\left[\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{[}r_{t}-{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{]}}{\{1-\widehat{\kappa}_{t}(\bm{x}_{t})\}\{1-\kappa_{t}(\bm{x}_{t})\}}\right]\right)^{2}\mid\mathcal{H}_{t-1}\right\}
=\displaystyle= 𝔼(𝔼{[{κ^t(𝒙t)κt(𝒙t)}{1κ^t(𝒙t)}{1κt(𝒙t)}]2𝕀{at=π^t(𝒙t)}et2at,𝒙t}t1)\displaystyle\mathbb{E}\left(\mathbb{E}\left\{\left[\frac{\left\{\widehat{\kappa}_{t}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right\}}{\{1-\widehat{\kappa}_{t}(\bm{x}_{t})\}\{1-\kappa_{t}(\bm{x}_{t})\}}\right]^{2}\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}e_{t}^{2}\mid a_{t},\bm{x}_{t}\right\}\mid\mathcal{H}_{t-1}\right)
=\displaystyle= 𝔼([{κ^t(𝒙t)κt(𝒙t)}{1κ^t(𝒙t)}{1κt(𝒙t)}]2𝕀{at=π^t(𝒙t)}𝔼{et2at,𝒙t}t1)\displaystyle\mathbb{E}\left(\left[\frac{\left\{\widehat{\kappa}_{t}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right\}}{\{1-\widehat{\kappa}_{t}(\bm{x}_{t})\}\{1-\kappa_{t}(\bm{x}_{t})\}}\right]^{2}\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}\mathbb{E}\left\{e_{t}^{2}\mid a_{t},\bm{x}_{t}\right\}\mid\mathcal{H}_{t-1}\right)
=\displaystyle= 𝔼([11κ^t(𝒙t)11κt(𝒙t)]2𝕀{at=π^t(𝒙t)}σat2t1).\displaystyle\mathbb{E}\left(\left[\frac{1}{1-\widehat{\kappa}_{t}(\bm{x}_{t})}-\frac{1}{1-\kappa_{t}(\bm{x}_{t})}\right]^{2}\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}\sigma_{a_{t}}^{2}\mid\mathcal{H}_{t-1}\right).

Notice that κ^t(𝒙t)C1<1\widehat{\kappa}_{t}(\bm{x}_{t})\leq C_{1}<1 and κt(𝒙t)C1<1\kappa_{t}(\bm{x}_{t})\leq C_{1}<1 for sure for some constant C1<1C_{1}<1 (by definition of a valid bandit algorithm and results of Theorem 1), we have

[11κ^t(𝒙t)11κt(𝒙t)]2{|11κ^t(𝒙t)|+|11κt(𝒙t)|}2(21C1)2C2,\left[\frac{1}{1-\widehat{\kappa}_{t}(\bm{x}_{t})}-\frac{1}{1-\kappa_{t}(\bm{x}_{t})}\right]^{2}\leq\left\{\left|\frac{1}{1-\widehat{\kappa}_{t}(\bm{x}_{t})}\right|+\left|\frac{1}{1-\kappa_{t}(\bm{x}_{t})}\right|\right\}^{2}\leq\left(\frac{2}{1-C_{1}}\right)^{2}\equiv C_{2},

where C2C_{2} is a bounded constant. Thus we have

𝔼([11κ^t(𝒙t)11κt(𝒙t)]2𝕀{at=π^t(𝒙t)}σat2t1)\displaystyle\mathbb{E}\left(\left[\frac{1}{1-\widehat{\kappa}_{t}(\bm{x}_{t})}-\frac{1}{1-\kappa_{t}(\bm{x}_{t})}\right]^{2}\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}\sigma_{a_{t}}^{2}\mid\mathcal{H}_{t-1}\right)
𝔼(C2𝕀{at=π^t(𝒙t)}σat2t1)\displaystyle\leq\mathbb{E}\left(C_{2}\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}\sigma_{a_{t}}^{2}\mid\mathcal{H}_{t-1}\right)
C2max{σ02,σ12}.\displaystyle\leq C_{2}\max\{\sigma_{0}^{2},\sigma_{1}^{2}\}.

Therefore, for the second moment of the inner term of the first term, we have

t=1T𝔼{(1T{κ^t(𝒙t)κt(𝒙t)}[𝕀{at=π^t(𝒙t)}[rtμ{𝒙t,π^t(𝒙t)}]{1κ^t(𝒙t)}{1κt(𝒙t)}])2t1}\displaystyle\sum_{t=1}^{T}\mathbb{E}\left\{\left(\frac{1}{\sqrt{T}}\left\{\widehat{\kappa}_{t}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right\}\left[\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{[}r_{t}-{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{]}}{\{1-\widehat{\kappa}_{t}(\bm{x}_{t})\}\{1-\kappa_{t}(\bm{x}_{t})\}}\right]\right)^{2}\mid\mathcal{H}_{t-1}\right\}
t=1TC2max{σ02,σ12}T=C2max{σ02,σ12}<.\displaystyle\leq\sum_{t=1}^{T}\frac{C_{2}\max\{\sigma_{0}^{2},\sigma_{1}^{2}\}}{T}=C_{2}\max\{\sigma_{0}^{2},\sigma_{1}^{2}\}<\infty.

Therefore, we have

d1(f)𝔼(κ(𝒙1,a1,r1)0)<,d_{1}(f)\equiv\left\|\mathbb{E}\left(\mid\mathcal{F}_{\kappa}(\bm{x}_{1},a_{1},r_{1})\|\mathcal{H}_{0}\right)\right\|_{\infty}<\infty,

and

d2(f)𝔼((κ(𝒙1,a1,r1))20)1/2<.d_{2}(f)\equiv\left\|\mathbb{E}\left(\left(\mathcal{F}_{\kappa}(\bm{x}_{1},a_{1},r_{1})\right)^{2}\mid\mathcal{H}_{0}\right)\right\|_{\infty}^{1/2}<\infty.

It follows from the maximal inequality developed in Section 4.2 of Dedecker and Louhichi (2002) that there exist some constant K1K\geq 1 such that

𝔼[||𝔾n||]K(pd2(f)+1Tmax1tT|κ(𝒙t,at,rt)𝔼(κ(𝒙t,at,rt)t1)|).\displaystyle{\mathbb{E}}\Big{[}||\mathbb{G}_{n}||_{\mathcal{F}}\Big{]}\lesssim K\left(\sqrt{p}d_{2}(f)+\frac{1}{\sqrt{T}}\left\|\max_{1\leq t\leq T}\left|\mathcal{F}_{\kappa}(\bm{x}_{t},a_{t},r_{t})-\mathbb{E}\left(\mathcal{F}_{\kappa}(\bm{x}_{t},a_{t},r_{t})\mid\mathcal{H}_{t-1}\right)\right|\right\|\right).

The above right-hand-side is upper bounded by

O(1)T1/2,\displaystyle O(1)\sqrt{T^{-1/2}},

where O(1)O(1) denotes some universal constant. Hence, we have

𝔼[𝔾n]=𝒪p(T1/2).\displaystyle{\mathbb{E}}\Big{[}||\mathbb{G}_{n}||_{\mathcal{F}}\Big{]}=\mathcal{O}_{p}(T^{-1/2}). (B.34)

Combined with Equation (B.33), we have

1Tt=1Tκ(𝒙t,at,rt)=𝒪p(T1/2).\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\mathcal{F}_{\kappa}(\bm{x}_{t},a_{t},r_{t})=\mathcal{O}_{p}(T^{-1/2}).

Therefore, for the first term in (B.32), we have

1Tt=1T{κ^t(𝒙t)κt(𝒙t)}[𝕀{at=π^t(𝒙t)}[rtμ{𝒙t,π^t(𝒙t)}]{1κ^t(𝒙t)}{1κt(𝒙t)}]=𝒪p(T1)=op(T1/2).\frac{1}{T}\sum_{t=1}^{T}\left\{\widehat{\kappa}_{t}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right\}\left[\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{[}r_{t}-{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{]}}{\{1-\widehat{\kappa}_{t}(\bm{x}_{t})\}\{1-\kappa_{t}(\bm{x}_{t})\}}\right]=\mathcal{O}_{p}(T^{-1})=o_{p}(T^{-1/2}).

Then we consider the second term in (B.32), where

1Tt=1T{κ^t(𝒙t)κt(𝒙t)}[𝕀{at=π^t(𝒙t)}[μ{𝒙t,π^t(𝒙t)}μ^t1{𝒙t,π^t(𝒙t)}]{1κ^t(𝒙t)}{1κt(𝒙t)}]\displaystyle\frac{1}{T}\sum_{t=1}^{T}\left\{\widehat{\kappa}_{t}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right\}\left[\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{[}{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}-\widehat{\mu}_{t-1}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{]}}{\{1-\widehat{\kappa}_{t}(\bm{x}_{t})\}\{1-\kappa_{t}(\bm{x}_{t})\}}\right]
\displaystyle\leq 1Tt=1TBκ|κ^t(𝒙t)κt(𝒙t)||μ{𝒙t,π^t(𝒙t)}μ^t1{𝒙t,π^t(𝒙t)}|,\displaystyle\frac{1}{T}\sum_{t=1}^{T}B_{\kappa}\left|\widehat{\kappa}_{t}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right|\left|{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}-\widehat{\mu}_{t-1}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\right|,

for some BκB_{\kappa} as the bound of 𝕀{at=π^t(𝒙t)}/[{1κ^t(𝒙t)}{1κt(𝒙t)}]\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}/[{\{1-\widehat{\kappa}_{t}(\bm{x}_{t})\}\{1-\kappa_{t}(\bm{x}_{t})\}}]. By Cauchy-Schwartz inequality, we have the above term further bounded by

1Tt=1TBκ|κ^t(𝒙t)κt(𝒙t)||μ{𝒙t,π^t(𝒙t)}μ^t1{𝒙t,π^t(𝒙t)}|\displaystyle\frac{1}{T}\sum_{t=1}^{T}B_{\kappa}\left|\widehat{\kappa}_{t}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right|\left|{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}-\widehat{\mu}_{t-1}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\right|
\displaystyle\leq Bκ1Tt=1T|κ^t(𝒙t)κt(𝒙t)|21Tt=1T|μ{𝒙t,π^t(𝒙t)}μ^t1{𝒙t,π^t(𝒙t)}|2.\displaystyle B_{\kappa}\sqrt{\frac{1}{T}\sum_{t=1}^{T}\left|\widehat{\kappa}_{t}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right|^{2}\frac{1}{T}\sum_{t=1}^{T}\left|{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}-\widehat{\mu}_{t-1}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\right|^{2}}.

Given Assumption 4.4, we have the above bounded by op(T1/2)o_{p}(T^{-1/2}), and thus the second term in (B.32) is op(T1/2)o_{p}(T^{-1/2}).

Therefore, we have V^T=ϕ~T+op(T1/2)\widehat{V}_{T}=\widetilde{\phi}_{T}+o_{p}(T^{-1/2}) hold.

Then, we focus on proving

ϕ~TV~T=1Tt=1T[𝕀{at=π^t(𝒙t)}1κt(𝒙t)1][μ{𝒙t,π^t(𝒙t)}μ^t1{𝒙t,π^t(𝒙t)}],\displaystyle\widetilde{\phi}_{T}-\widetilde{V}_{T}=\frac{1}{T}\sum_{t=1}^{T}\left[\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}}{1-\kappa_{t}(\bm{x}_{t})}-1\right]\Big{[}{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}-\widehat{\mu}_{t-1}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{]}, (B.35)

is op(T1/2)o_{p}(T^{-1/2}). Specifically, since

μ{𝒙t,π^t(𝒙t)}μ^t1{𝒙t,π^t(𝒙t)}=𝒪p(t1/2),{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}-\widehat{\mu}_{t-1}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}=\mathcal{O}_{p}(t^{-1/2}),

similar as the proof of first term in (B.32), we could prove that ϕ~TV~T=op(T1/2)\widetilde{\phi}_{T}-\widetilde{V}_{T}=o_{p}(T^{-1/2}). Thus V^T=V~T+op(T1/2)\widehat{V}_{T}=\widetilde{V}_{T}+o_{p}(T^{-1/2}) holds. The first step is thus completed.

Step 2: We next focus on proving V~T=V¯T+op(T1/2)\widetilde{V}_{T}=\overline{V}_{T}+o_{p}(T^{-1/2}). By definition of V~T\widetilde{V}_{T} and V¯T\overline{V}_{T}, we have

T(V~TV¯T)=\displaystyle\sqrt{T}(\widetilde{V}_{T}-\overline{V}_{T})= 1Tt=1T[𝕀{at=π^t(𝒙t)}1κt(𝒙t)1][μ{𝒙t,π(𝒙t)}μ{𝒙t,π^t(𝒙t)}]η5\displaystyle\underbrace{\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\left[\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}}{1-\kappa_{t}(\bm{x}_{t})}-1\right]\Big{[}{\mu}\{\bm{x}_{t},{\pi}^{*}(\bm{x}_{t})\}-{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{]}}_{\eta_{5}}
+1Tt=1T[𝕀{at=π^t(𝒙t)}1κt(𝒙t)𝕀{at=π(𝒙t)}Pr{at=π(𝒙t)}][rtμ{𝒙t,πt(𝒙t)}]η6.\displaystyle+\underbrace{\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\left[\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}}{1-\kappa_{t}(\bm{x}_{t})}-\frac{\mathbb{I}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}\right]\Big{[}r_{t}-{\mu}\{\bm{x}_{t},\pi^{*}_{t}(\bm{x}_{t})\}\Big{]}}_{\eta_{6}}.

We first show η5=op(1)\eta_{5}=o_{p}(1). Since κt(𝒙t)C2<1\kappa_{t}(\bm{x}_{t})\leq C_{2}<1 for sure for some constant 0<C2<10<C_{2}<1 (by definition of a valid bandit algorithm as results for Theorem 1), it suffices to show

ψ5=T1/2t=1T|μ{𝒙t,π^t(𝒙t)}μ{𝒙t,π}|=op(1),\psi_{5}={{T}^{-1/2}}\sum_{t=1}^{T}\left|\mu\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}-\mu\{\bm{x}_{t},\pi^{*}\}\right|=o_{p}(1),

which is the direct conclusion of Lemma B.3.

Next, we show η6=op(1)\eta_{6}=o_{p}(1). Firstly we can express η6\eta_{6} as

η6=\displaystyle\eta_{6}= 1Tt=1T[𝕀{at=π^t(𝒙t)}1κt(𝒙t)𝕀{at=π(𝒙t)}Pr{at=π(𝒙t)}][rtμ{𝒙t,πt(𝒙t)}]\displaystyle\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\left[\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}}{1-\kappa_{t}(\bm{x}_{t})}-\frac{\mathbb{I}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}\right]\Big{[}r_{t}-{\mu}\{\bm{x}_{t},\pi^{*}_{t}(\bm{x}_{t})\}\Big{]}
=\displaystyle= 1Tt=1T[𝕀{at=π^t(𝒙t)}1κt(𝒙t)𝕀{at=π(𝒙t)}Pr{at=π(𝒙t)}]et\displaystyle\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\left[\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}}{1-\kappa_{t}(\bm{x}_{t})}-\frac{\mathbb{I}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}\right]e_{t}
=\displaystyle= 1Tt=1T[11κt(𝒙t)𝕀{at=π(𝒙t)}Pr{at=π(𝒙t)}]et𝕀{at=π^t(𝒙t)}\displaystyle\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\left[\frac{1}{1-\kappa_{t}(\bm{x}_{t})}-\frac{\mathbb{I}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}\right]e_{t}\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}
1Tt=1T𝕀{at=π(𝒙t)}Pr{at=π(𝒙t)}et𝕀{atπ^t(𝒙t)}\displaystyle-\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\frac{\mathbb{I}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}e_{t}\mathbb{I}\{a_{t}\neq\widehat{\pi}_{t}(\bm{x}_{t})\}
=\displaystyle= 1Tt=1T[11κt(𝒙t)𝕀{at=π(𝒙t)}Pr{at=π(𝒙t)}]et𝕀{at=π^t(𝒙t)}ψ6\displaystyle\underbrace{\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\left[\frac{1}{1-\kappa_{t}(\bm{x}_{t})}-\frac{\mathbb{I}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}\right]e_{t}\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}}_{\psi_{6}}
1Tt=1T1Pr{at=π(𝒙t)}et𝕀{atπ^t(𝒙t)}𝕀{at=π(𝒙t)}ψ7.\displaystyle-\underbrace{\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\frac{1}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}e_{t}\mathbb{I}\{a_{t}\neq\widehat{\pi}_{t}(\bm{x}_{t})\}\mathbb{I}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}_{\psi_{7}}.

Note that

Pr{at=π(𝒙t)}\displaystyle{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\} Pr{at=π^t(𝒙t),π^t(𝒙t)=π(𝒙t)}\displaystyle\geq{\mbox{Pr}}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t}),\widehat{\pi}_{t}(\bm{x}_{t})={\pi}^{*}(\bm{x}_{t})\}
Pr{at=π^t(𝒙t)}+Pr{π^t(𝒙t)=π(𝒙t)}1\displaystyle\geq{\mbox{Pr}}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}+{\mbox{Pr}}\{\widehat{\pi}_{t}(\bm{x}_{t})={\pi}^{*}(\bm{x}_{t})\}-1
=Pr{π^t(𝒙t)=π(𝒙t)}κt(𝒙t),\displaystyle={\mbox{Pr}}\{\widehat{\pi}_{t}(\bm{x}_{t})={\pi}^{*}(\bm{x}_{t})\}-\kappa_{t}(\bm{x}_{t}),

where κt(𝒙t)=op(1)\kappa_{t}(\bm{x}_{t})=o_{p}(1) by Theorem 1 and Pr{π^t(𝒙t)=π(𝒙t)}1ctαγ{\mbox{Pr}}\{\widehat{\pi}_{t}(\bm{x}_{t})={\pi}^{*}(\bm{x}_{t})\}\geq 1-ct^{-\alpha\gamma} by Lemma B.2 as tt\rightarrow\infty. Thus we have for large enough t,

Pr{at=π(𝒙t)}>C1,{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}>C_{1}, (B.37)

for some constant C1>0C_{1}>0.

Then we focus on proving ψ6=op(1)\psi_{6}=o_{p}(1) here. Define a class of function

κ(𝒙,a,r)={[11κ(𝒙)𝕀{a=π(𝒙)}Pr{a=π(𝒙)}]et𝕀{a=π(𝒙)}:κtΛ,π()Π},\displaystyle\mathcal{F}_{\kappa}(\bm{x},a,r)=\Bigg{\{}\left[\frac{1}{1-\kappa(\bm{x})}-\frac{\mathbb{I}\{a={\pi}^{*}(\bm{x})\}}{{\mbox{Pr}}\{a={\pi}^{*}(\bm{x})\}}\right]e_{t}\mathbb{I}\{a=\pi(\bm{x})\}:\kappa_{t}\in\Lambda,\pi(\cdot)\in\Pi\Bigg{\}},

where Π\Pi and Λ\Lambda are two classes of functions that maps context 𝒙𝒳\bm{x}\in\mathcal{X} to a probability.
Define the supremum of the empirical process indexed by κ\mathcal{F}_{\kappa} as

𝔾n\displaystyle||\mathbb{G}_{n}||_{\mathcal{F}}\equiv supπΠ1Tt=1T[κ(𝒙t,at,rt)𝔼{κ(𝒙t,at,rt)|t1}].\displaystyle\sup_{\pi\in\Pi}\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\left[\mathcal{F}_{\kappa}(\bm{x}_{t},a_{t},r_{t})-\mathbb{E}\{\mathcal{F}_{\kappa}(\bm{x}_{t},a_{t},r_{t})|\mathcal{H}_{t-1}\}\right]. (B.38)

Firstly we notice that

𝔼{κ(𝒙t,at,rt)|t1}=𝔼([11κt(𝒙t)𝕀{at=π(𝒙t)}Pr{at=π(𝒙t)}]et𝕀{a=πt(𝒙)}t1)=0,\mathbb{E}\{\mathcal{F}_{\kappa}(\bm{x}_{t},a_{t},r_{t})|\mathcal{H}_{t-1}\}={\mathbb{E}}\left(\left[\frac{1}{1-\kappa_{t}(\bm{x}_{t})}-\frac{\mathbb{I}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}\right]e_{t}\mathbb{I}\{a=\pi_{t}(\bm{x})\}\mid\mathcal{H}_{t-1}\right)=0,

since 𝔼(et|{at,t1)=0{\mathbb{E}}\left(e_{t}|\{a_{t},\mathcal{H}_{t-1}\right)=0.
Secondly since κt(𝒙t)C2<1\kappa_{t}(\bm{x}_{t})\leq C_{2}<1 for sure for some constant 0<C2<10<C_{2}<1 (by definition of a valid bandit algorithm an results for Theorem 1), by the Triangle inequality, we have

|11κt(𝒙t)𝕀{at=π(𝒙t)}Pr{at=π(𝒙t)}||11κt(𝒙t)|+|𝕀{at=π(𝒙t)}Pr{at=π(𝒙t)}|11C2+1C1C.\left|\frac{1}{1-\kappa_{t}(\bm{x}_{t})}-\frac{\mathbb{I}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}\right|\leq\left|\frac{1}{1-\kappa_{t}(\bm{x}_{t})}-\right|+\left|\frac{\mathbb{I}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}\right|\leq\frac{1}{1-C_{2}}+\frac{1}{C_{1}}\triangleq C.

Therefore, we have

t=1T𝔼{(1T[11κt(𝒙t)𝕀{at=π(𝒙t)}Pr{at=π(𝒙t)}]et𝕀{at=π^t(𝒙t)})2t1}\displaystyle\sum_{t=1}^{T}{\mathbb{E}}\left\{\left(\frac{1}{\sqrt{T}}\left[\frac{1}{1-\kappa_{t}(\bm{x}_{t})}-\frac{\mathbb{I}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}\right]e_{t}\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}\right)^{2}\mid\mathcal{H}_{t-1}\right\}
t=1T𝔼{(1T[11κt(𝒙t)𝕀{at=π(𝒙t)}Pr{at=π(𝒙t)}]et𝕀{at=π^t(𝒙t)})2t1}\displaystyle\leq\sum_{t=1}^{T}{\mathbb{E}}\left\{\left(\frac{1}{\sqrt{T}}\left[\frac{1}{1-\kappa_{t}(\bm{x}_{t})}-\frac{\mathbb{I}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}\right]e_{t}\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}\right)^{2}\mid\mathcal{H}_{t-1}\right\}
t=1TC2T𝔼{et2t1}t=1TC2Tmax{σ02,σ12}=C2max{σ02,σ12}<.\displaystyle\leq\sum_{t=1}^{T}\frac{C^{2}}{T}{\mathbb{E}}\left\{e_{t}^{2}\mid\mathcal{H}_{t-1}\right\}\leq\sum_{t=1}^{T}\frac{C^{2}}{T}\max\{\sigma_{0}^{2},\sigma_{1}^{2}\}=C^{2}\max\{\sigma_{0}^{2},\sigma_{1}^{2}\}<\infty.

Therefore, we have

d1(f)𝔼(κ(𝒙1,a1,r1)0)<,d_{1}(f)\equiv\left\|\mathbb{E}\left(\mid\mathcal{F}_{\kappa}(\bm{x}_{1},a_{1},r_{1})\|\mathcal{H}_{0}\right)\right\|_{\infty}<\infty,

and

d2(f)𝔼((κ(𝒙1,a1,r1))20)1/2<.d_{2}(f)\equiv\left\|\mathbb{E}\left(\left(\mathcal{F}_{\kappa}(\bm{x}_{1},a_{1},r_{1})\right)^{2}\mid\mathcal{H}_{0}\right)\right\|_{\infty}^{1/2}<\infty.

It follows from the maximal inequality developed in Section 4.2 of Dedecker and Louhichi (2002) that there exists some constant K1K\geq 1 such that

𝔼[||𝔾n||]K(d2(f)+1Tmax1tT|κ(𝒙t,at,rt)𝔼(κ(𝒙t,at,rt)t1)|)\displaystyle{\mathbb{E}}\Big{[}||\mathbb{G}_{n}||_{\mathcal{F}}\Big{]}\lesssim K\left(d_{2}(f)+\frac{1}{\sqrt{T}}\left\|\max_{1\leq t\leq T}\left|\mathcal{F}_{\kappa}(\bm{x}_{t},a_{t},r_{t})-\mathbb{E}\left(\mathcal{F}_{\kappa}(\bm{x}_{t},a_{t},r_{t})\mid\mathcal{H}_{t-1}\right)\right|\right\|\right)

The above right-hand-side is upper bounded by

𝒪(1)T1/2,\displaystyle\mathcal{O}(1)\sqrt{T^{-1/2}},

where 𝒪(1)\mathcal{O}(1) denotes some universal constant. Hence, we have

𝔼[𝔾n]=𝒪p(T1/2).\displaystyle{\mathbb{E}}\Big{[}||\mathbb{G}_{n}||_{\mathcal{F}}\Big{]}=\mathcal{O}_{p}(T^{-1/2}). (B.39)

Combined with Equation (B.38), we have

1Tt=1Tκ(𝒙t,at,rt)=𝒪p(T1/2).\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\mathcal{F}_{\kappa}(\bm{x}_{t},a_{t},r_{t})=\mathcal{O}_{p}(T^{-1/2}).

and ψ6=op(1)\psi_{6}=o_{p}(1).

Next, for ψ7\psi_{7}, by triangle inequality, we have

|ψ7|\displaystyle|\psi_{7}| =|1Tt=1T1Pr{at=π(𝒙t)}et𝕀{atπ^t(𝒙t)}𝕀{at=π(𝒙t)}|\displaystyle=\left|\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\frac{1}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}e_{t}\mathbb{I}\{a_{t}\neq\widehat{\pi}_{t}(\bm{x}_{t})\}\mathbb{I}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}\right|
1Tt=1T|1Pr{at=π(𝒙t)}et𝕀{atπ^t(𝒙t)}𝕀{at=π(𝒙t)}|\displaystyle\leq\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\left|\frac{1}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}e_{t}\mathbb{I}\{a_{t}\neq\widehat{\pi}_{t}(\bm{x}_{t})\}\mathbb{I}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}\right|
1Tt=1T|1Pr{at=π(𝒙t)}|et|𝕀{atπ^t(𝒙t)}𝕀{at=π(𝒙t)}|\displaystyle\leq\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\left|\frac{1}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}\right|e_{t}\left|\mathbb{I}\{a_{t}\neq\widehat{\pi}_{t}(\bm{x}_{t})\}\mathbb{I}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}\right|
1Tt=1T1C1et.\displaystyle\leq\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\frac{1}{C_{1}}e_{t}.

Notice that 1Tt=1Tet=op(1)\frac{1}{\sqrt{T}}\sum_{t=1}^{T}e_{t}=o_{p}(1), we have

|ψ7|=1C11Tt=1Tet=op(1).|\psi_{7}|\leq=\frac{1}{C_{1}}\frac{1}{\sqrt{T}}\sum_{t=1}^{T}e_{t}=o_{p}(1).

Therefore, η6=ψ6+ψ7=op(1)\eta_{6}=\psi_{6}+\psi_{7}=o_{p}(1). Hence, V~T=V¯T+op(T1/2)\widetilde{V}_{T}=\overline{V}_{T}+o_{p}(T^{-1/2}) hold.

Step 3: Then, to show the asymptotic normality of the proposed value estimator under DREAM, based on the above two steps, it is sufficient to show

T(V^TV)=T(V¯TV)+op(1)D𝒩(0,σDR2),\sqrt{T}(\widehat{V}_{T}-V^{*})=\sqrt{T}(\overline{V}_{T}-V^{*})+o_{p}(1)\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathcal{N}\left(0,\sigma_{DR}^{2}\right), (B.40)

as TT\rightarrow\infty, using Martingale Central Limit Theorem. By rt=μ{𝒙t,at)+etr_{t}={\mu}\{\bm{x}_{t},a_{t})+e_{t}, we define

ξt\displaystyle\xi_{t} =𝕀{at=π(𝒙t)}Pr{at=π(𝒙t)}[rtμ{𝒙t,π(𝒙t)}]+μ{𝒙t,π(𝒙t)}\displaystyle=\frac{\mathbb{I}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}\Big{[}r_{t}-{\mu}\{\bm{x}_{t},{\pi}^{*}(\bm{x}_{t})\}\Big{]}+{\mu}\{\bm{x}_{t},{\pi}^{*}(\bm{x}_{t})\}
=𝕀{at=π(𝒙t)}Pr{at=π(𝒙t)}etZt+μ{𝒙t,π(𝒙t)}VWt.\displaystyle=\underbrace{\frac{\mathbb{I}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}e_{t}}_{Z_{t}}+\underbrace{{\mu}\{\bm{x}_{t},{\pi}^{*}(\bm{x}_{t})\}-V^{*}}_{W_{t}}.

By (B.5), we have

𝔼{Ztt1}\displaystyle{\mathbb{E}}\{Z_{t}\mid\mathcal{H}_{t-1}\} =𝔼{𝔼[Ztat]t1}=𝔼{𝔼[𝕀{at=π(𝒙t)}Pr{at=π(𝒙t)}etat,𝒙t]t1}\displaystyle={\mathbb{E}}\{{\mathbb{E}}\left[Z_{t}\mid a_{t}\right]\mid\mathcal{H}_{t-1}\}={\mathbb{E}}\left\{{\mathbb{E}}\left[\frac{\mathbb{I}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}e_{t}\mid a_{t},\bm{x}_{t}\right]\mid\mathcal{H}_{t-1}\right\}
=𝔼{𝕀{at=π(𝒙t)}Pr{at=π(𝒙t)}𝔼[etat,𝒙t]t1}.\displaystyle={\mathbb{E}}\left\{\frac{\mathbb{I}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}{\mathbb{E}}\left[e_{t}\mid a_{t},\bm{x}_{t}\right]\mid\mathcal{H}_{t-1}\right\}.

Since ete_{t} is independent of i1\mathcal{H}_{i-1} and 𝒙i\bm{x}_{i} given ata_{t}, and 𝔼{etat}=0{\mathbb{E}}\left\{e_{t}\mid a_{t}\right\}=0, we have

𝔼{Ztt1}=𝔼{𝕀{at=π(𝒙t)}Pr{at=π(𝒙t)}𝔼{etat}t1}=0.{\mathbb{E}}\{Z_{t}\mid\mathcal{H}_{t-1}\}={\mathbb{E}}\left\{\frac{\mathbb{I}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}{\mathbb{E}}\left\{e_{t}\mid a_{t}\right\}\mid\mathcal{H}_{t-1}\right\}=0.

Notice that 𝔼[μ{𝒙t,π(𝒙t)}]=V{\mathbb{E}}[{\mu}\{\bm{x}_{t},{\pi}^{*}(\bm{x}_{t})\}]=V^{*} by the definition, we have

𝔼{Wtt1}=𝔼{μ{𝒙t,π(𝒙t)}V}=0.{\mathbb{E}}\{W_{t}\mid\mathcal{H}_{t-1}\}={\mathbb{E}}\left\{{\mu}\{\bm{x}_{t},{\pi}^{*}(\bm{x}_{t})\}-V^{*}\right\}=0.

Thus, we have 𝔼{ξtt1}=0{\mathbb{E}}\{\xi_{t}\mid\mathcal{H}_{t-1}\}=0, which implies that {Zt}t=1T\{Z_{t}\}_{t=1}^{T}, {Wt}t=1T\{W_{t}\}_{t=1}^{T} and {ξt}t=1T\{\xi_{t}\}_{t=1}^{T} are Martingale difference sequences. To prove Equation (B.40), it suffices to prove that (1/T)t=1Tξt𝐷𝒩(0,σDR2)(1/\sqrt{T})\sum_{t=1}^{T}\xi_{t}\overset{D}{\longrightarrow}\mathcal{N}(0,\sigma_{DR}^{2}), as TT\rightarrow\infty, using Martingale Central Limit Theorem.

Firstly we calculate the conditional variance of ξt\xi_{t} given t1\mathcal{H}_{t-1}. Note that

𝔼(Zt2t1)\displaystyle{\mathbb{E}}\left(Z_{t}^{2}\mid\mathcal{H}_{t-1}\right) =𝔼{(𝕀{at=π(𝒙t)}Pr{at=π(𝒙t)}et)2t1}=𝔼(𝕀{at=π(𝒙t)}[Pr{at=π(𝒙t)}]2et2t1),\displaystyle={\mathbb{E}}\left\{\left(\frac{\mathbb{I}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}e_{t}\right)^{2}\mid\mathcal{H}_{t-1}\right\}={\mathbb{E}}\left(\frac{\mathbb{I}\{a_{t}=\pi^{*}(\bm{x}_{t})\}}{[{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}]^{2}}e_{t}^{2}\mid\mathcal{H}_{t-1}\right),

and

𝔼(Zt2t1)\displaystyle{\mathbb{E}}\left(Z_{t}^{2}\mid\mathcal{H}_{t-1}\right) =𝔼[𝔼(𝕀{at=π(𝒙t)}[Pr{at=π(𝒙t)}]2et2at,𝒙t)t1]\displaystyle={\mathbb{E}}\left[{\mathbb{E}}\left(\frac{\mathbb{I}\{a_{t}=\pi^{*}(\bm{x}_{t})\}}{[{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}]^{2}}e_{t}^{2}\ \mid a_{t},\bm{x}_{t}\right)\mid\mathcal{H}_{t-1}\right]
=𝔼[𝕀{at=π(𝒙t)}[Pr{at=π(𝒙t)}]2𝔼(et2at,𝒙t)t1].\displaystyle={\mathbb{E}}\left[\frac{\mathbb{I}\{a_{t}=\pi^{*}(\bm{x}_{t})\}}{[{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}]^{2}}{\mathbb{E}}\left(e_{t}^{2}\ \mid a_{t},\bm{x}_{t}\right)\mid\mathcal{H}_{t-1}\right].

Since ete_{t} is independent of i1\mathcal{H}_{i-1} and 𝒙i\bm{x}_{i} given ata_{t}, we have

1Tt=1T𝔼(Zt2t1)\displaystyle\frac{1}{T}\sum_{t=1}^{T}{\mathbb{E}}\left(Z_{t}^{2}\mid\mathcal{H}_{t-1}\right) =1Tt=1T𝔼[𝕀{at=π(𝒙t)}[Pr{at=π(𝒙t)}]2𝔼(et2at)t1]\displaystyle=\frac{1}{T}\sum_{t=1}^{T}{\mathbb{E}}\left[\frac{\mathbb{I}\{a_{t}=\pi^{*}(\bm{x}_{t})\}}{[{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}]^{2}}{\mathbb{E}}\left(e_{t}^{2}\ \mid a_{t}\right)\mid\mathcal{H}_{t-1}\right]
=1Tt=1T𝔼(𝕀{at=π(𝒙t)}[Pr{at=π(𝒙t)}]2σat2t1).\displaystyle=\frac{1}{T}\sum_{t=1}^{T}{\mathbb{E}}\left(\frac{\mathbb{I}\{a_{t}=\pi^{*}(\bm{x}_{t})\}}{[{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}]^{2}}\sigma_{a_{t}}^{2}\mid\mathcal{H}_{t-1}\right).
=1Tt=1T𝔼[1[Pr{at=π(𝒙t)}]2σπ(𝒙t)2𝔼(𝕀{at=π(𝒙t)}𝒙t,t1)].\displaystyle=\frac{1}{T}\sum_{t=1}^{T}{\mathbb{E}}\left[\frac{1}{[{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}]^{2}}\sigma_{\pi^{*}(\bm{x}_{t})}^{2}{\mathbb{E}}\left(\mathbb{I}\{a_{t}=\pi^{*}(\bm{x}_{t})\}\mid\bm{x}_{t},\mathcal{H}_{t-1}\right)\right].

By the definition of Equation (B.25),

νi(𝒙i,i1)Pr{aiπ(𝒙i)|𝒙i,i1}=𝔼[𝕀{aiπ(𝒙i)}|𝒙i,i1],\nu_{i}\left(\bm{x}_{i},\mathcal{H}_{i-1}\right)\equiv\operatorname{Pr}\left\{a_{i}\neq\pi^{*}(\bm{x}_{i})\right|\bm{x}_{i},\mathcal{H}_{i-1}\}=\mathbb{E}\left[\mathbb{I}\left\{a_{i}\neq\pi^{*}(\bm{x}_{i})\right\}|\bm{x}_{i},\mathcal{H}_{i-1}\right],

we have

1Tt=1T𝔼(Zt2t1)\displaystyle\frac{1}{T}\sum_{t=1}^{T}{\mathbb{E}}\left(Z_{t}^{2}\mid\mathcal{H}_{t-1}\right) =1Tt=1T𝔼[1[Pr{at=π(𝒙t)}]2σπ(𝒙t)2{1νt(𝒙,t1)}]\displaystyle=\frac{1}{T}\sum_{t=1}^{T}{\mathbb{E}}\left[\frac{1}{[{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}]^{2}}\sigma_{\pi^{*}(\bm{x}_{t})}^{2}\left\{1-\nu_{t}\left(\bm{x},\mathcal{H}_{t-1}\right)\right\}\right]
=1Tt=1T{1νt(𝒙,t1)}[Pr{at=π(𝒙)}]2σπ(𝒙)2𝑑P𝒳\displaystyle=\frac{1}{T}\sum_{t=1}^{T}\int\frac{\left\{1-\nu_{t}\left(\bm{x},\mathcal{H}_{t-1}\right)\right\}}{[{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}]^{2}}\sigma_{\pi^{*}(\bm{x})}^{2}dP_{\mathcal{X}}
=[1Tt=1T{1νt(𝒙,t1)}[Pr{at=π(𝒙)}]2]σπ(𝒙)2𝑑P𝒳.\displaystyle=\int\left[\frac{1}{T}\sum_{t=1}^{T}\frac{\left\{1-\nu_{t}\left(\bm{x},\mathcal{H}_{t-1}\right)\right\}}{[{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}]^{2}}\right]\sigma_{\pi^{*}(\bm{x})}^{2}dP_{\mathcal{X}}.

Similar as before, since limiPr{aiπ(𝒙)}=κ(𝒙)\lim_{i\rightarrow\infty}{\mbox{Pr}}\{a_{i}\neq\pi^{*}(\bm{x})\}=\kappa_{\infty}(\bm{x}), we have for any ϵ>0\epsilon>0, there exist a constant T0>0T_{0}>0 such that |Pr{aiπ(𝒙)}κ(𝒙)|<ϵ\left|{\mbox{Pr}}\{a_{i}\neq\pi^{*}(\bm{x})\}-\kappa_{\infty}(\bm{x})\right|<\epsilon for all iT0i\geq T_{0}.
We firstly consider the expectation of (1/T)t=1T{1νt(𝒙,t1)}/[Pr{at=π(𝒙)}]2(1/T)\sum_{t=1}^{T}{\left\{1-\nu_{t}\left(\bm{x},\mathcal{H}_{t-1}\right)\right\}/[{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}]^{2}}. Note that Pr{at=π(𝒙)}{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\} is not conditional on t1\mathcal{H}_{t-1}, thus we have

𝔼[1Tt=1T{1νt(𝒙,t1)}[Pr{at=π(𝒙)}]2]\displaystyle{\mathbb{E}}\left[\frac{1}{T}\sum_{t=1}^{T}\frac{\left\{1-\nu_{t}\left(\bm{x},\mathcal{H}_{t-1}\right)\right\}}{[{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}]^{2}}\right] =1Tt=1T𝔼{1νt(𝒙,t1)}[Pr{at=π(𝒙)}]2=1Tt=1TPr{at=π(𝒙)}[Pr{at=π(𝒙)}]2\displaystyle=\frac{1}{T}\sum_{t=1}^{T}\frac{{\mathbb{E}}\left\{1-\nu_{t}\left(\bm{x},\mathcal{H}_{t-1}\right)\right\}}{[{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}]^{2}}=\frac{1}{T}\sum_{t=1}^{T}\frac{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}}{[{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}]^{2}}
=1Tt=1T1Pr{at=π(𝒙)}.\displaystyle=\frac{1}{T}\sum_{t=1}^{T}\frac{1}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}}.

Therefore by the triangle inequality we have

|𝔼[1Tt=1T{1νt(𝒙,t1)}[Pr{at=π(𝒙)}]2]11κ(𝒙)|=|1Tt=1T1Pr{at=π(𝒙)}11κ(𝒙)|\displaystyle~{}~{}~{}\left|{\mathbb{E}}\left[\frac{1}{T}\sum_{t=1}^{T}\frac{\left\{1-\nu_{t}\left(\bm{x},\mathcal{H}_{t-1}\right)\right\}}{[{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}]^{2}}\right]-\frac{1}{1-\kappa_{\infty}(\bm{x})}\right|=\left|\frac{1}{T}\sum_{t=1}^{T}\frac{1}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}}-\frac{1}{1-\kappa_{\infty}(\bm{x})}\right|
=|1Tt=1T[1Pr{at=π(𝒙)}11κ(𝒙)]|=|1Tt=1T1κ(𝒙)Pr{at=π(𝒙)}Pr{at=π(𝒙)}{1κ(𝒙)}|\displaystyle=\left|\frac{1}{T}\sum_{t=1}^{T}\left[\frac{1}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}}-\frac{1}{1-\kappa_{\infty}(\bm{x})}\right]\right|=\left|\frac{1}{T}\sum_{t=1}^{T}\frac{1-\kappa_{\infty}(\bm{x})-{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}\left\{1-\kappa_{\infty}(\bm{x})\right\}}\right|
=|1Tt=1TPr{atπ(𝒙)κ(𝒙)}Pr{at=π(𝒙)}{1κ(𝒙)}|1Tt=1T|Pr{atπ(𝒙)κ(𝒙)}|Pr{at=π(𝒙)}{1κ(𝒙)}\displaystyle=\left|\frac{1}{T}\sum_{t=1}^{T}\frac{{\mbox{Pr}}\{a_{t}\neq{\pi}^{*}(\bm{x})-\kappa_{\infty}(\bm{x})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}\left\{1-\kappa_{\infty}(\bm{x})\right\}}\right|\leq\frac{1}{T}\sum_{t=1}^{T}\frac{\left|{\mbox{Pr}}\{a_{t}\neq{\pi}^{*}(\bm{x})-\kappa_{\infty}(\bm{x})\}\right|}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}\left\{1-\kappa_{\infty}(\bm{x})\right\}}
=1Tt=1T0|Pr{atπ(𝒙)κ(𝒙)}|Pr{at=π(𝒙)}{1κ(𝒙)}+1Tt=T0T|Pr{atπ(𝒙)κ(𝒙)}|Pr{at=π(𝒙)}{1κ(𝒙)}\displaystyle=\frac{1}{T}\sum_{t=1}^{T_{0}}\frac{\left|{\mbox{Pr}}\{a_{t}\neq{\pi}^{*}(\bm{x})-\kappa_{\infty}(\bm{x})\}\right|}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}\left\{1-\kappa_{\infty}(\bm{x})\right\}}+\frac{1}{T}\sum_{t=T_{0}}^{T}\frac{\left|{\mbox{Pr}}\{a_{t}\neq{\pi}^{*}(\bm{x})-\kappa_{\infty}(\bm{x})\}\right|}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}\left\{1-\kappa_{\infty}(\bm{x})\right\}}
<1Tt=1T0|Pr{atπ(𝒙)}|+|κ(𝒙)|Pr{at=π(𝒙)}{1κ(𝒙)}+1Tt=T0TϵPr{at=π(𝒙)}{1κ(𝒙)}.\displaystyle<\frac{1}{T}\sum_{t=1}^{T_{0}}\frac{\left|{\mbox{Pr}}\{a_{t}\neq{\pi}^{*}(\bm{x})\}\right|+\left|\kappa_{\infty}(\bm{x})\right|}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}\left\{1-\kappa_{\infty}(\bm{x})\right\}}+\frac{1}{T}\sum_{t=T_{0}}^{T}\frac{\epsilon}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}\left\{1-\kappa_{\infty}(\bm{x})\right\}}.

Since the above equation holds for any ϵ>0\epsilon>0, we have

|𝔼[1Tt=1T1νt(𝒙,t1)[Pr{at=π(𝒙)}]2]11κ(𝒙)|\displaystyle\left|{\mathbb{E}}\left[\frac{1}{T}\sum_{t=1}^{T}\frac{1-\nu_{t}\left(\bm{x},\mathcal{H}_{t-1}\right)}{[{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}]^{2}}\right]-\frac{1}{1-\kappa_{\infty}(\bm{x})}\right| 1Tt=1T0|Pr{at=π(𝒙)}|+|κ(𝒙)|Pr{at=π(𝒙)}{1κ(𝒙)}\displaystyle\leq\frac{1}{T}\sum_{t=1}^{T_{0}}\frac{\left|{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}\right|+\left|\kappa_{\infty}(\bm{x})\right|}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}\left\{1-\kappa_{\infty}(\bm{x})\right\}}
1Tt=1T02Pr{at=π(𝒙)}{1κ(𝒙)}.\displaystyle\leq\frac{1}{T}\sum_{t=1}^{T_{0}}\frac{2}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}\left\{1-\kappa_{\infty}(\bm{x})\right\}}.

Since κ(𝒙)C2<1\kappa_{\infty}(\bm{x})\leq C_{2}<1 for sure for some constant 0<C2<10<C_{2}<1 (by definition of a valid bandit algorithm an results for Theorem 1), and by Equation (B.37), we have Pr{at=π(𝒙)}>C1{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}>C_{1} for some constant C1>0C_{1}>0, therefore we have

|𝔼[1Tt=1T{1νt(𝒙,t1)}[Pr{at=π(𝒙)}]2]11κ(𝒙)|1Tt=1T02C1(1C2)=2T0C1(1C2)T0,\displaystyle\left|{\mathbb{E}}\left[\frac{1}{T}\sum_{t=1}^{T}\frac{\left\{1-\nu_{t}\left(\bm{x},\mathcal{H}_{t-1}\right)\right\}}{[{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}]^{2}}\right]-\frac{1}{1-\kappa_{\infty}(\bm{x})}\right|\leq\frac{1}{T}\sum_{t=1}^{T_{0}}\frac{2}{C_{1}\left(1-C_{2}\right)}=\frac{2T_{0}}{C_{1}\left(1-C_{2}\right)T}\rightarrow 0,

as TT\rightarrow\infty.
Then we consider the variance of (1/T)t=1T{1νt(𝒙,t1)}/[Pr{at=π(𝒙)}]2(1/T)\sum_{t=1}^{T}{\left\{1-\nu_{t}\left(\bm{x},\mathcal{H}_{t-1}\right)\right\}/[{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}]^{2}} over different histories. By Lemma B.1, we have

Var[νt(𝒙,t1)]Pr{at=π(𝒙)}[1Pr{at=π(𝒙)}].\displaystyle\operatorname{Var}\left[\nu_{t}\left(\bm{x},\mathcal{H}_{t-1}\right)\right]\leq{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}\left[1-{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}\right].

Therefore, we have

Var[1Tt=1T{1νt(𝒙,t1)}[Pr{at=π(𝒙)}]2]=1T2Var[t=1T{1νt(𝒙,t1)}[Pr{at=π(𝒙)}]2]\displaystyle{\mbox{Var}}\left[\frac{1}{T}\sum_{t=1}^{T}\frac{\left\{1-\nu_{t}\left(\bm{x},\mathcal{H}_{t-1}\right)\right\}}{[{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}]^{2}}\right]=\frac{1}{T^{2}}{\mbox{Var}}\left[\sum_{t=1}^{T}\frac{\left\{1-\nu_{t}\left(\bm{x},\mathcal{H}_{t-1}\right)\right\}}{[{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}]^{2}}\right]
\displaystyle\leq 1Tt=1TVar[{1νt(𝒙,t1)}[Pr{at=π(𝒙)}]2]=1Tt=1TVar[{1νt(𝒙,t1)}][Pr{at=π(𝒙)}]4=1Tt=1TVar[{νt(𝒙,t1)}][Pr{at=π(𝒙)}]4\displaystyle\frac{1}{T}\sum_{t=1}^{T}{\mbox{Var}}\left[\frac{\left\{1-\nu_{t}\left(\bm{x},\mathcal{H}_{t-1}\right)\right\}}{[{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}]^{2}}\right]=\frac{1}{T}\sum_{t=1}^{T}\frac{{\mbox{Var}}\left[\left\{1-\nu_{t}\left(\bm{x},\mathcal{H}_{t-1}\right)\right\}\right]}{[{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}]^{4}}=\frac{1}{T}\sum_{t=1}^{T}\frac{{\mbox{Var}}\left[\left\{\nu_{t}\left(\bm{x},\mathcal{H}_{t-1}\right)\right\}\right]}{[{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}]^{4}}
\displaystyle\leq 1Tt=1TPr{at=π(𝒙)}[1Pr{at=π(𝒙)}][Pr{at=π(𝒙)}]4=1Tt=1T1Pr{at=π(𝒙)}[Pr{at=π(𝒙)}]3\displaystyle\frac{1}{T}\sum_{t=1}^{T}\frac{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}\left[1-{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}\right]}{[{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}]^{4}}=\frac{1}{T}\sum_{t=1}^{T}\frac{1-{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}}{[{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}]^{3}}
\displaystyle\leq 1Tt=1T1Pr{at=π(𝒙)}[1C2]3=1[1C2]31Tt=1TPr{atπ(𝒙)}.\displaystyle\frac{1}{T}\sum_{t=1}^{T}\frac{1-{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}}{[1-C_{2}]^{3}}=\frac{1}{[1-C_{2}]^{3}}\frac{1}{T}\sum_{t=1}^{T}{\mbox{Pr}}\{a_{t}\neq{\pi}^{*}(\bm{x})\}.

Similarly as before, we could proof

1Tt=1TPr{atπ(𝒙)}κ(𝒙),\displaystyle\frac{1}{T}\sum_{t=1}^{T}{\mbox{Pr}}\{a_{t}\neq{\pi}^{*}(\bm{x})\}\rightarrow\kappa_{\infty}(\bm{x}),

which follows

Var[1Tt=1T{1νt(𝒙,t1)}[Pr{at=π(𝒙)}]2]1[1C2]3κ(𝒙).{\mbox{Var}}\left[\frac{1}{T}\sum_{t=1}^{T}\frac{\left\{1-\nu_{t}\left(\bm{x},\mathcal{H}_{t-1}\right)\right\}}{[{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}]^{2}}\right]\rightarrow\frac{1}{[1-C_{2}]^{3}}\kappa_{\infty}(\bm{x}).

Therefore, as TT goes to infinity, we have

1Tt=1T{1νt(𝒙,t1)}[Pr{at=π(𝒙)}]211κ(𝒙),\frac{1}{T}\sum_{t=1}^{T}\frac{\left\{1-\nu_{t}\left(\bm{x},\mathcal{H}_{t-1}\right)\right\}}{[{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}]^{2}}\to\frac{1}{1-\kappa_{\infty}(\bm{x})},

and

1Tt=1T𝔼(Zt2t1)11κ(𝒙)σπ(𝒙)2𝑑P𝒳.\displaystyle\frac{1}{T}\sum_{t=1}^{T}{\mathbb{E}}\left(Z_{t}^{2}\mid\mathcal{H}_{t-1}\right)\rightarrow\int\frac{1}{1-\kappa_{\infty}(\bm{x})}\sigma_{\pi^{*}(\bm{x})}^{2}dP_{\mathcal{X}}.

Using the same technique of conditioning on ata_{t} and 𝒙t\bm{x}_{t}, we have

𝔼(WtZtt1)\displaystyle{\mathbb{E}}\left(W_{t}Z_{t}\mid\mathcal{H}_{t-1}\right) =𝔼{(μ{𝒙t,π(𝒙t)}V)𝕀{at=π(𝒙t)}Pr{at=π(𝒙t)}ett1}\displaystyle={\mathbb{E}}\left\{\left({\mu}\{\bm{x}_{t},{\pi}^{*}(\bm{x}_{t})\}-V^{*}\right)\frac{\mathbb{I}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}e_{t}\mid\mathcal{H}_{t-1}\right\}
=𝔼{𝔼[(μ{𝒙t,π(𝒙t)}V)𝕀{at=π(𝒙t)}Pr{at=π(𝒙t)}etat,𝒙t]t1}\displaystyle={\mathbb{E}}\left\{{\mathbb{E}}\left[\left({\mu}\{\bm{x}_{t},{\pi}^{*}(\bm{x}_{t})\}-V^{*}\right)\frac{\mathbb{I}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}e_{t}\mid a_{t},\bm{x}_{t}\right]\mid\mathcal{H}_{t-1}\right\}
=𝔼{(μ{𝒙t,π(𝒙t)}V)𝕀{at=π(𝒙t)}Pr{at=π(𝒙t)}𝔼(etat)t1}=0.\displaystyle={\mathbb{E}}\left\{\left({\mu}\{\bm{x}_{t},{\pi}^{*}(\bm{x}_{t})\}-V^{*}\right)\frac{\mathbb{I}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}{\mathbb{E}}\left(e_{t}\mid a_{t}\right)\mid\mathcal{H}_{t-1}\right\}=0.

Thus, we further have

𝔼(ξt2t1)=𝔼{(Zt+Wt)2t1}=𝔼(Zt2t1)+𝔼(Wt2t1),{\mathbb{E}}\left(\xi_{t}^{2}\mid\mathcal{H}_{t-1}\right)={\mathbb{E}}\left\{\left(Z_{t}+W_{t}\right)^{2}\mid\mathcal{H}_{t-1}\right\}={\mathbb{E}}\left(Z_{t}^{2}\mid\mathcal{H}_{t-1}\right)+{\mathbb{E}}\left(W_{t}^{2}\mid\mathcal{H}_{t-1}\right),

and

1Tt=1T𝔼(ξt2t1)=11κ(𝒙)σπ(𝒙)2𝑑P𝒳+1Tt=1TVar[μ{𝒙t,π(𝒙t)}t1].\frac{1}{T}\sum_{t=1}^{T}{\mathbb{E}}\left(\xi_{t}^{2}\mid\mathcal{H}_{t-1}\right)=\int\frac{1}{1-\kappa_{\infty}(\bm{x})}\sigma_{\pi^{*}(\bm{x})}^{2}dP_{\mathcal{X}}+\frac{1}{T}\sum_{t=1}^{T}{\mbox{Var}}\left[{\mu}\{\bm{x}_{t},{\pi}^{*}(\bm{x}_{t})\}\mid\mathcal{H}_{t-1}\right].

Therefore as TT goes to infinity, we have

t=1T𝔼{(1Tξt)2t1}σDR2,\sum_{t=1}^{T}{\mathbb{E}}\left\{\left(\frac{1}{\sqrt{T}}\xi_{t}\right)^{2}\mid\mathcal{H}_{t-1}\right\}\longrightarrow\sigma_{DR}^{2},

where

σDR2=𝒙σ12𝕀{μ(𝒙,1)>μ(𝒙,0)}+σ02𝕀{μ(𝒙,1)<μ(𝒙,0)}1κ(𝒙)𝑑P𝒳+Var[μ{𝒙,π(𝒙)}].\sigma_{DR}^{2}=\int_{\bm{x}}\frac{\sigma_{1}^{2}\mathbb{I}\{\mu(\bm{x},1)>\mu(\bm{x},0)\}+\sigma_{0}^{2}\mathbb{I}\{\mu(\bm{x},1)<\mu(\bm{x},0)\}}{1-\kappa_{\infty}(\bm{x})}dP_{\mathcal{X}}+{\mbox{Var}}\left[{\mu}\{\bm{x},{\pi}^{*}(\bm{x})\}\right]. (B.42)

Then we check the conditional Lindeberg condition. For any h>0h>0, we have

t=1T𝔼{(1Tξt)2𝕀{|1Tξt|>h}t1}=1Tt=1T𝔼{ξt2𝕀{ξt2>Th2}t1}.\displaystyle\sum_{t=1}^{T}{\mathbb{E}}\left\{\left(\frac{1}{\sqrt{T}}\xi_{t}\right)^{2}\mathbb{I}\left\{\left|\frac{1}{\sqrt{T}}\xi_{t}\right|>h\right\}\mid\mathcal{H}_{t-1}\right\}=\frac{1}{T}\sum_{t=1}^{T}{\mathbb{E}}\left\{\xi_{t}^{2}\mathbb{I}\left\{\xi_{t}^{2}>Th^{2}\right\}\mid\mathcal{H}_{t-1}\right\}.

Since ξt2𝕀{ξt2>Th2}\xi_{t}^{2}\mathbb{I}\left\{\xi_{t}^{2}>Th^{2}\right\} converges to zero as TT goes to infinity and is dominated by ξt2\xi_{t}^{2} given t1\mathcal{H}_{t-1}. Therefore, by Dominated Convergence Theorem, we conclude that

t=1T𝔼{(1Tξt)2𝕀{|1Tξt|>h}t1}0, as t.\sum_{t=1}^{T}{\mathbb{E}}\left\{\left(\frac{1}{\sqrt{T}}\xi_{t}\right)^{2}\mathbb{I}\left\{\left|\frac{1}{\sqrt{T}}\xi_{t}\right|>h\right\}\mid\mathcal{H}_{t-1}\right\}\rightarrow 0,\text{ as }t\rightarrow\infty.

Thus the conditional Lindeberg condition is checked.

Next, recall the derived conditional variance in (B.42). By Martingale Central Limit Theorem, we have

(1/T)t=1Tξt𝐷𝒩(0,σDR2),(1/\sqrt{T})\sum_{t=1}^{T}\xi_{t}\overset{D}{\longrightarrow}\mathcal{N}(0,\sigma_{DR}^{2}),

as TT\rightarrow\infty. Hence, we complete the proof of Equation (B.40).

Step 4: Finally, to show the variance estimator in Equation (5) in the main paper is a consistent estimator of σDR2\sigma_{DR}^{2}. Recall that the variance estimator is

σ^T2\displaystyle\widehat{\sigma}_{T}^{2} =1Tt=1Tσ^1,t12(𝒙t,1)𝕀{μ^t1(𝒙t,1)>μ^t1(𝒙t,0)}+σ^0,t12𝕀{μ^t1(𝒙t,1)<μ^t1(𝒙t,0)}1κ^t(𝒙t)σ^T,12\displaystyle=\underbrace{\frac{1}{T}\sum_{t=1}^{T}\frac{\widehat{\sigma}_{1,t-1}^{2}(\bm{x}_{t},1)\mathbb{I}\{\widehat{\mu}_{t-1}(\bm{x}_{t},1)>\widehat{\mu}_{t-1}(\bm{x}_{t},0)\}+\widehat{\sigma}_{0,t-1}^{2}\mathbb{I}\{\widehat{\mu}_{t-1}(\bm{x}_{t},1)<\widehat{\mu}_{t-1}(\bm{x}_{t},0)\}}{1-\widehat{\kappa}_{t}(\bm{x}_{t})}}_{\widehat{\sigma}_{T,1}^{2}} (B.43)
+1Tt=1T[μ^T{𝒙t,π^T(𝒙t)}1Tt=1Tμ^T{𝒙t,π^T(𝒙t)}]2σ^T,22.\displaystyle+\underbrace{\frac{1}{T}\sum_{t=1}^{T}\left[\widehat{\mu}_{T}\{\bm{x}_{t},\widehat{\pi}_{T}(\bm{x}_{t})\}-\frac{1}{T}\sum_{t=1}^{T}\widehat{\mu}_{T}\{\bm{x}_{t},\widehat{\pi}_{T}(\bm{x}_{t})\}\right]^{2}}_{\widehat{\sigma}_{T,2}^{2}}.

Firstly we proof the first line of the above Equation (B.43) is a consistent estimator for

𝒙σ12𝕀{μ(𝒙,1)>μ(𝒙,0)}+σ02𝕀{μ(𝒙,1)<μ(𝒙,0)}1κ(𝒙)𝑑P𝒳.\int_{\bm{x}}\frac{\sigma_{1}^{2}\mathbb{I}\{\mu(\bm{x},1)>\mu(\bm{x},0)\}+\sigma_{0}^{2}\mathbb{I}\{\mu(\bm{x},1)<\mu(\bm{x},0)\}}{1-\kappa_{\infty}(\bm{x})}dP_{\mathcal{X}}.

Recall that we denote Δ^𝒙t=μ^t1(𝒙t,1)μ^t1(𝒙t,0)\widehat{\Delta}_{\bm{x}_{t}}=\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0), thus we can rewrite σ^T2\widehat{\sigma}_{T}^{2} as

σ^T,12\displaystyle\widehat{\sigma}_{T,1}^{2} =1Tt=1Tσ^1,t12𝕀{μ^t1(𝒙t,1)>μ^t1(𝒙t,0)}+σ^0,t12(𝒙t,0)𝕀{μ^t1(𝒙t,1)<μ^t1(𝒙t,0)}1κ^t(𝒙t)\displaystyle=\frac{1}{T}\sum_{t=1}^{T}\frac{\widehat{\sigma}_{1,t-1}^{2}\mathbb{I}\{\widehat{\mu}_{t-1}(\bm{x}_{t},1)>\widehat{\mu}_{t-1}(\bm{x}_{t},0)\}+\widehat{\sigma}_{0,t-1}^{2}(\bm{x}_{t},0)\mathbb{I}\{\widehat{\mu}_{t-1}(\bm{x}_{t},1)<\widehat{\mu}_{t-1}(\bm{x}_{t},0)\}}{1-\widehat{\kappa}_{t}(\bm{x}_{t})}
=1Tt=1Tσ^1,t12𝕀{Δ^𝒙t>0}+σ^0,t12𝕀{Δ^𝒙t<0}1κ^t(𝒙t).\displaystyle=\frac{1}{T}\sum_{t=1}^{T}\frac{\widehat{\sigma}_{1,t-1}^{2}\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}>0\}+\widehat{\sigma}_{0,t-1}^{2}\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}<0\}}{1-\widehat{\kappa}_{t}(\bm{x}_{t})}.

We decompose the proposed variance estimator by

σ^T,12\displaystyle\widehat{\sigma}_{T,1}^{2} =1Tt=1T{σ^1,t12σ12}𝕀{Δ^𝒙t>0}+{σ^0,t12σ02}𝕀{Δ^𝒙t<0}1κ^t(𝒙t)\displaystyle=\frac{1}{T}\sum_{t=1}^{T}\frac{\left\{\widehat{\sigma}_{1,t-1}^{2}-\sigma_{1}^{2}\right\}\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}>0\}+\left\{\widehat{\sigma}_{0,t-1}^{2}-\sigma_{0}^{2}\right\}\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}<0\}}{1-\widehat{\kappa}_{t}(\bm{x}_{t})}
+1Tt=1Tσ12(𝕀{Δ^𝒙t>0}𝕀{Δ𝒙t>0})+σ02(𝕀{Δ^𝒙t<0}𝕀{Δ𝒙t<0})1κ^t(𝒙t)\displaystyle+\frac{1}{T}\sum_{t=1}^{T}\frac{\sigma_{1}^{2}\left(\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}>0\}-\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}\right)+\sigma_{0}^{2}\left(\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}<0\}-\mathbb{I}\{\Delta_{\bm{x}_{t}}<0\}\right)}{1-\widehat{\kappa}_{t}(\bm{x}_{t})}
+1Tt=1T(σ12𝕀{Δ𝒙t>0}+σ02𝕀{Δ𝒙t<0}){11κ^t(𝒙t)11κt(𝒙t)}\displaystyle+\frac{1}{T}\sum_{t=1}^{T}\left(\sigma_{1}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}+\sigma_{0}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}<0\}\right)\left\{\frac{1}{1-\widehat{\kappa}_{t}(\bm{x}_{t})}-\frac{1}{1-\kappa_{t}(\bm{x}_{t})}\right\}
+1Tt=1Tσ12𝕀{Δ𝒙t>0}+σ02𝕀{Δ𝒙t<0}1κt(𝒙t).\displaystyle+\frac{1}{T}\sum_{t=1}^{T}\frac{\sigma_{1}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}+\sigma_{0}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}<0\}}{1-\kappa_{t}(\bm{x}_{t})}.

Our goal is to prove that the first three lines are all op(1)o_{p}(1).

Firstly, recall that

σ^a,t2\displaystyle\widehat{\sigma}_{a,t}^{2} ={i=1t𝕀(ai=a)d}1ai=a1it[μ^i{𝒙i,ai}ri]2\displaystyle=\{\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)-d\}^{-1}\sum_{a_{i}=a}^{1\leq i\leq t}[\widehat{\mu}_{i}\{\bm{x}_{i},a_{i}\}-r_{i}]^{2}
={i=1t𝕀(ai=a)d}1ai=a1it[𝒙i{𝜷^i1(a)𝜷i1(a)}ei]2.\displaystyle=\{\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)-d\}^{-1}\sum_{a_{i}=a}^{1\leq i\leq t}[\bm{x}_{i}^{\top}\left\{\widehat{\bm{\beta}}_{i-1}(a)-\bm{\beta}_{i-1}(a)\right\}-e_{i}]^{2}.

By Lemma 4.1, we have 𝜷^i1(a)𝜷i1(a)1=op(1)\|\widehat{\bm{\beta}}_{i-1}(a)-\bm{\beta}_{i-1}(a)\|_{1}=o_{p}(1). Under Assumption 4.1, we have 𝒙i{𝜷^i1(a)𝜷i1(a)}=op(1)\bm{x}_{i}^{\top}\left\{\widehat{\bm{\beta}}_{i-1}(a)-\bm{\beta}_{i-1}(a)\right\}=o_{p}(1). Thus by Lemma 6 in Luedtke and Van Der Laan (2016), we have

σ^a,t2\displaystyle\widehat{\sigma}_{a,t}^{2} ={i=1t𝕀(ai=a)d}1ai=a1itei2+op(1).\displaystyle=\{\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)-d\}^{-1}\sum_{a_{i}=a}^{1\leq i\leq t}e_{i}^{2}+o_{p}(1).

Since eie_{i} is i.i.d conditional on aia_{i}, and 𝔼(ei2|ai=a)=σa2{\mathbb{E}}(e_{i}^{2}|a_{i}=a)=\sigma_{a}^{2}, noting that

limti=1t𝕀(ai=a)i=1t𝕀(ai=a)d=1,\lim_{t\rightarrow\infty}\frac{\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)}{\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)-d}=1,

by Law of Large Numbers we have

σ^a,t2=σa2+op(1).\widehat{\sigma}_{a,t}^{2}=\sigma_{a}^{2}+o_{p}(1).

Therefore, the first line is op(1)o_{p}(1).

Secondly, denote the second line as

ψ8\displaystyle\psi_{8} =1Tt=1Tσ12(𝕀{Δ^𝒙t>0}𝕀{Δ𝒙t>0})+σ02(𝕀{Δ^𝒙t<0}𝕀{Δ𝒙t<0})1κ^t(𝒙t)\displaystyle=\frac{1}{T}\sum_{t=1}^{T}\frac{\sigma_{1}^{2}\left(\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}>0\}-\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}\right)+\sigma_{0}^{2}\left(\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}<0\}-\mathbb{I}\{\Delta_{\bm{x}_{t}}<0\}\right)}{1-\widehat{\kappa}_{t}(\bm{x}_{t})}
=1Tt=1Tσ12(𝕀{Δ^𝒙t>0}𝕀{Δ𝒙t>0})+σ02(𝕀{Δ𝒙t>0}𝕀{Δ^𝒙t>0})1κ^t(𝒙t)\displaystyle=\frac{1}{T}\sum_{t=1}^{T}\frac{\sigma_{1}^{2}\left(\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}>0\}-\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}\right)+\sigma_{0}^{2}\left(\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}-\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}>0\}\right)}{1-\widehat{\kappa}_{t}(\bm{x}_{t})}
=σ12σ02Tt=1T𝕀{Δ^𝒙t>0}𝕀{Δ𝒙t>0}1κ^t(𝒙t).\displaystyle=\frac{\sigma_{1}^{2}-\sigma_{0}^{2}}{T}\sum_{t=1}^{T}\frac{\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}>0\}-\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}}{1-\widehat{\kappa}_{t}(\bm{x}_{t})}.

Since κt(𝒙t)C2<1\kappa_{t}(\bm{x}_{t})\leq C_{2}<1 for sure for some constant 0<C2<10<C_{2}<1 (by definition of a valid bandit algorithm an results for Theorem 1), by the triangle inequality, we have

|ψ8|\displaystyle|\psi_{8}| σ12σ02Tt=1T|𝕀{Δ^𝒙t>0}𝕀{Δ𝒙t>0}||1κ^t(𝒙t)|σ12σ02Tt=1T|𝕀{Δ^𝒙t>0}𝕀{Δ𝒙t>0}|1C2\displaystyle\leq\frac{\sigma_{1}^{2}-\sigma_{0}^{2}}{T}\sum_{t=1}^{T}\frac{\left|\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}>0\}-\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}\right|}{\left|1-\widehat{\kappa}_{t}(\bm{x}_{t})\right|}\leq\frac{\sigma_{1}^{2}-\sigma_{0}^{2}}{T}\sum_{t=1}^{T}\frac{\left|\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}>0\}-\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}\right|}{1-C_{2}}
σ12σ021C21Tt=1T|𝕀{Δ^𝒙t>0}𝕀{Δ𝒙t>0}|.\displaystyle\leq\frac{\sigma_{1}^{2}-\sigma_{0}^{2}}{1-C_{2}}\frac{1}{T}\sum_{t=1}^{T}\left|\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}>0\}-\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}\right|.

Since Pr(𝕀{Δ^𝒙t>0}𝕀{Δ𝒙t>0}=0)=1ctαγ{\mbox{Pr}}\left(\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}>0\}-\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}=0\right)=1-ct^{-\alpha\gamma} by Lemma B.2, there exists some constant cc such that

Pr(1Tt=1T|𝕀{Δ^𝒙t>0}𝕀{Δ𝒙t>0}|0)t=1TPr(|𝕀{Δ^𝒙t>0}𝕀{Δ𝒙t>0}|0)t=1Tctαγ.{\mbox{Pr}}\left(\frac{1}{T}\sum_{t=1}^{T}\left|\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}>0\}-\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}\right|\neq 0\right)\leq\sum_{t=1}^{T}{\mbox{Pr}}\left(\left|\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}>0\}-\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}\right|\neq 0\right)\leq\sum_{t=1}^{T}ct^{-\alpha\gamma}.

By Lemma 6 in Luedtke and Van Der Laan (2016), we have t=1Tctαγ=cTαγ\sum_{t=1}^{T}ct^{-\alpha\gamma}=cT^{-\alpha\gamma}, thus

Pr(|ψ8|0)=Pr(1Tt=1T|𝕀{Δ^𝒙t>0}𝕀{Δ𝒙t>0}|0)cTαγ,{\mbox{Pr}}\left(|\psi_{8}|\neq 0\right)={\mbox{Pr}}\left(\frac{1}{T}\sum_{t=1}^{T}\left|\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}>0\}-\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}\right|\neq 0\right)\leq cT^{-\alpha\gamma},

which follows Pr(|ψ8|0)=op(1){\mbox{Pr}}\left(|\psi_{8}|\neq 0\right)=o_{p}(1).

Lastly, under the assumption that κ^t(𝒙t)\widehat{\kappa}_{t}(\bm{x}_{t}) is a consistent estimator for κt(𝒙t)\kappa_{t}(\bm{x}_{t}), we have the third line is op(1)o_{p}(1) by the continuous mapping theorem.

Given the above results, we have

σ^T,12=1Tt=1Tσ12𝕀{Δ𝒙t>0}+σ02𝕀{Δ𝒙t<0}1κt(𝒙t)+op(1)=1Tt=1Tσ12𝕀{Δ𝒙t>0}+σ02𝕀{Δ𝒙t<0}1Pr{atπ^t(xt)}+op(1).\widehat{\sigma}_{T,1}^{2}=\frac{1}{T}\sum_{t=1}^{T}\frac{\sigma_{1}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}+\sigma_{0}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}<0\}}{1-\kappa_{t}(\bm{x}_{t})}+o_{p}(1)=\frac{1}{T}\sum_{t=1}^{T}\frac{\sigma_{1}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}+\sigma_{0}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}<0\}}{1-\operatorname{Pr}\left\{a_{t}\neq\widehat{\pi}_{t}\left(x_{t}\right)\right\}}+o_{p}(1).

Thus, we can further express σ^T,12\widehat{\sigma}_{T,1}^{2} as

σ^T,12\displaystyle\widehat{\sigma}_{T,1}^{2} =1Tt=1T(σ12𝕀{Δ𝒙t>0}+σ02𝕀{Δ𝒙t<0})(11Pr{atπ^t(x)}11Pr{atπ(𝒙)})\displaystyle=\frac{1}{T}\sum_{t=1}^{T}\left(\sigma_{1}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}+\sigma_{0}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}<0\}\right)\left(\frac{1}{1-\operatorname{Pr}\left\{a_{t}\neq\widehat{\pi}_{t}\left(x\right)\right\}}-\frac{1}{1-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}\right)\right\}}\right) (B.44)
+1Tt=1Tσ12𝕀{Δ𝒙t>0}+σ02𝕀{Δ𝒙t<0}1Pr{atπ(𝒙t)}+op(1).\displaystyle+\frac{1}{T}\sum_{t=1}^{T}\frac{\sigma_{1}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}+\sigma_{0}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}<0\}}{1-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\right\}}+o_{p}(1).

Notice that

(11Pr{atπ^t(x)}11Pr{atπ(𝒙)})=Pr{atπ^t(x)}Pr{atπ(𝒙t)}(1Pr{atπ^t(x)})(1Pr{atπ(𝒙)}),\left(\frac{1}{1-\operatorname{Pr}\left\{a_{t}\neq\widehat{\pi}_{t}\left(x\right)\right\}}-\frac{1}{1-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}\right)\right\}}\right)=\frac{\operatorname{Pr}\left\{a_{t}\neq\widehat{\pi}_{t}\left(x\right)\right\}-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\right\}}{\left(1-\operatorname{Pr}\left\{a_{t}\neq\widehat{\pi}_{t}\left(x\right)\right\}\right)\left(1-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}\right)\right\}\right)},

where

Pr{atπ^t(x)}Pr{atπ(𝒙t)}=𝔼(𝕀{atπ^t(x)}𝕀{atπ(𝒙t)})=𝔼(𝕀{π^t(x)π(𝒙t)}).\operatorname{Pr}\left\{a_{t}\neq\widehat{\pi}_{t}\left(x\right)\right\}-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\right\}={\mathbb{E}}\left(\mathbb{I}\{a_{t}\neq\widehat{\pi}_{t}\left(x\right)\}-\mathbb{I}\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\}\right)={\mathbb{E}}\left(\mathbb{I}\{\widehat{\pi}_{t}\left(x\right)\neq\pi^{*}\left(\bm{x}_{t}\right)\}\right).

We also note that by Lemma B.2, there exists some constant cc and 0<α<120<\alpha<\frac{1}{2} such that αγ<12\alpha\gamma<\frac{1}{2} and

𝔼(𝕀{π^t(x)=π(𝒙t)})=Pr(π^t(x)=π(𝒙t)})ctαγ,{\mathbb{E}}\left(\mathbb{I}\{\widehat{\pi}_{t}\left(x\right)=\pi^{*}\left(\bm{x}_{t}\right)\}\right)={\mbox{Pr}}\left(\widehat{\pi}_{t}\left(x\right)=\pi^{*}\left(\bm{x}_{t}\right)\}\right)\leq ct^{-\alpha\gamma},

therefore we have the result that

Pr{atπ^t(x)}Pr{atπ(𝒙t)}=𝔼(𝕀{π^t(x)π(𝒙t)})=op(1).\operatorname{Pr}\left\{a_{t}\neq\widehat{\pi}_{t}\left(x\right)\right\}-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\right\}={\mathbb{E}}\left(\mathbb{I}\{\widehat{\pi}_{t}\left(x\right)\neq\pi^{*}\left(\bm{x}_{t}\right)\}\right)=o_{p}(1).

Thus, Equation (B.44) can be expressed as

σ^T,12\displaystyle\widehat{\sigma}_{T,1}^{2} =1Tt=1T(σ12𝕀{Δ𝒙t>0}+σ02𝕀{Δ𝒙t<0})op(1)+1Tt=1Tσ12𝕀{Δ𝒙t>0}+σ02𝕀{Δ𝒙t<0}1Pr{atπ(𝒙t)}+op(1)\displaystyle=\frac{1}{T}\sum_{t=1}^{T}\left(\sigma_{1}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}+\sigma_{0}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}<0\}\right)o_{p}(1)+\frac{1}{T}\sum_{t=1}^{T}\frac{\sigma_{1}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}+\sigma_{0}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}<0\}}{1-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\right\}}+o_{p}(1)
=1Tt=1Tσ12𝕀{Δ𝒙t>0}+σ02𝕀{Δ𝒙t<0}1Pr{atπ(𝒙t)}+op(1)\displaystyle=\frac{1}{T}\sum_{t=1}^{T}\frac{\sigma_{1}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}+\sigma_{0}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}<0\}}{1-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\right\}}+o_{p}(1)
=1Tt=1T{σ12𝕀{Δ𝒙t>0}+σ02𝕀{Δ𝒙t<0}}[11Pr{atπ(𝒙t)}11κ(𝒙t)]ψ9\displaystyle=\underbrace{\frac{1}{T}\sum_{t=1}^{T}\left\{\sigma_{1}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}+\sigma_{0}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}<0\}\right\}\left[\frac{1}{1-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\right\}}-\frac{1}{1-\kappa_{\infty}(\bm{x}_{t})}\right]}_{\psi_{9}}
+1Tt=1Tσ12𝕀{Δ𝒙t>0}+σ02𝕀{Δ𝒙t<0}1κ(𝒙t)+op(1).\displaystyle+\frac{1}{T}\sum_{t=1}^{T}\frac{\sigma_{1}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}+\sigma_{0}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}<0\}}{1-\kappa_{\infty}(\bm{x}_{t})}+o_{p}(1).

Note that

ψ9\displaystyle\psi_{9} =1Tt=1T{σ12𝕀{Δ𝒙t>0}+σ02𝕀{Δ𝒙t<0}}[11Pr{atπ(𝒙t)}11κ(𝒙t)]\displaystyle=\frac{1}{T}\sum_{t=1}^{T}\left\{\sigma_{1}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}+\sigma_{0}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}<0\}\right\}\left[\frac{1}{1-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\right\}}-\frac{1}{1-\kappa_{\infty}(\bm{x}_{t})}\right]
=1Tt=1T{σ12𝕀{Δ𝒙t>0}+σ02𝕀{Δ𝒙t<0}}κ(𝒙t)Pr{atπ(𝒙t)}[1Pr{atπ(𝒙t)}][1κ(𝒙t)],\displaystyle=\frac{1}{T}\sum_{t=1}^{T}\left\{\sigma_{1}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}+\sigma_{0}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}<0\}\right\}\frac{\kappa_{\infty}(\bm{x}_{t})-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\right\}}{\left[1-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\right\}\right]\left[1-\kappa_{\infty}(\bm{x}_{t})\right]},

which follows

|ψ9|\displaystyle|\psi_{9}| 1Tt=1T{σ12𝕀{Δ𝒙t>0}+σ02𝕀{Δ𝒙t<0}}|κ(𝒙t)Pr{atπ(𝒙t)}|[1Pr{atπ(𝒙t)}][1κ(𝒙t)].\displaystyle\leq\frac{1}{T}\sum_{t=1}^{T}\left\{\sigma_{1}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}+\sigma_{0}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}<0\}\right\}\frac{\left|\kappa_{\infty}(\bm{x}_{t})-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\right\}\right|}{\left[1-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\right\}\right]\left[1-\kappa_{\infty}(\bm{x}_{t})\right]}.

By Equation (B.37), for large enough tt, there exists some constant C1>0C_{1}>0 such that

Pr{atπ(𝒙t)}<C1.{\mbox{Pr}}\{a_{t}\neq{\pi}^{*}(\bm{x}_{t})\}<C_{1}.

Since κt(𝒙t)C2<1\kappa_{t}(\bm{x}_{t})\leq C_{2}<1 for sure for some constant 0<C2<10<C_{2}<1 (by the definition of a valid bandit algorithm as results shown in Theorem 1), we also have

κ(𝒙t)<C2.\kappa_{\infty}(\bm{x}_{t})<C_{2}.

Therefore,

σ12𝕀{Δ𝒙t>0}+σ02𝕀{Δ𝒙t<0}[1Pr{atπ(𝒙t)}][1κ(𝒙t)]<max{σ02,σ12}(1C1)(1C2)C,\frac{\sigma_{1}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}+\sigma_{0}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}<0\}}{\left[1-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\right\}\right]\left[1-\kappa_{\infty}(\bm{x}_{t})\right]}<\frac{\max\{\sigma_{0}^{2},\sigma_{1}^{2}\}}{(1-C_{1})(1-C_{2})}\triangleq C,

which follows immediately that

|ψ9|\displaystyle|\psi_{9}| 1Tt=1TC|κ(𝒙t)Pr{atπ(𝒙t)}|.\displaystyle\leq\frac{1}{T}\sum_{t=1}^{T}C\left|\kappa_{\infty}(\bm{x}_{t})-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\right\}\right|.

Since limtPr{atπ(𝒙)}=κ(𝒙)\lim_{t\rightarrow\infty}\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}\right)\right\}=\kappa_{\infty}(\bm{x}) for any 𝒙\bm{x}, therefore for any ϵ>0\epsilon>0, there exists some constant T0T_{0}, such that |Pr{atπ(𝒙)}κt(𝒙)|<ϵ\left|\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}\right)\right\}-\kappa_{t}(\bm{x})\right|<\epsilon for any 𝒙\bm{x} with tT0t\geq T_{0}, thus we have

|ψ9|\displaystyle|\psi_{9}| 1Tt=1TC|κ(𝒙t)Pr{atπ(𝒙t)}|\displaystyle\leq\frac{1}{T}\sum_{t=1}^{T}C\left|\kappa_{\infty}(\bm{x}_{t})-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\right\}\right|
=1Tt=1T0C|κ(𝒙t)Pr{atπ(𝒙t)}|+1Tt=T0TC|κ(𝒙t)Pr{atπ(𝒙t)}|\displaystyle=\frac{1}{T}\sum_{t=1}^{T_{0}}C\left|\kappa_{\infty}(\bm{x}_{t})-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\right\}\right|+\frac{1}{T}\sum_{t=T_{0}}^{T}C\left|\kappa_{\infty}(\bm{x}_{t})-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\right\}\right|
<1Tt=1T0C|κ(𝒙t)Pr{atπ(𝒙t)}|+1Tt=T0TCϵ.\displaystyle<\frac{1}{T}\sum_{t=1}^{T_{0}}C\left|\kappa_{\infty}(\bm{x}_{t})-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\right\}\right|+\frac{1}{T}\sum_{t=T_{0}}^{T}C\epsilon.

Note that by the triangle inequality,

|κ(𝒙t)Pr{atπ(𝒙t)}||κ(𝒙t)|+|Pr{atπ(𝒙t)}|<C1+C2,\left|\kappa_{\infty}(\bm{x}_{t})-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\right\}\right|\leq\left|\kappa_{\infty}(\bm{x}_{t})\right|+\left|\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\right\}\right|<C_{1}+C_{2},

thus we have

|ψ9|<1Tt=1T0C(C1+C2)+1Tt=T0TCϵ=T0C(C1+C2)T+TT0TCϵ.|\psi_{9}|<\frac{1}{T}\sum_{t=1}^{T_{0}}C\left(C_{1}+C_{2}\right)+\frac{1}{T}\sum_{t=T_{0}}^{T}C\epsilon=\frac{T_{0}C\left(C_{1}+C_{2}\right)}{T}+\frac{T-T0}{T}C\epsilon.

Since the above equation holds for any ϵ>0\epsilon>0, we have

|ψ9|1Tt=1T0C(C1+C2)+1Tt=T0TCϵ=T0C(C1+C2)T=𝒪(1T),|\psi_{9}|\leq\frac{1}{T}\sum_{t=1}^{T_{0}}C\left(C_{1}+C_{2}\right)+\frac{1}{T}\sum_{t=T_{0}}^{T}C\epsilon=\frac{T_{0}C\left(C_{1}+C_{2}\right)}{T}=\mathcal{O}(\frac{1}{T}),

which follows ψ9=0p(1)\psi_{9}=0_{p}(1). Therefore, we have

σ^T,12=1Tt=1Tσ12𝕀{Δ𝒙t>0}+σ02𝕀{Δ𝒙t<0}1κ(𝒙t)+op(1).\widehat{\sigma}_{T,1}^{2}=\frac{1}{T}\sum_{t=1}^{T}\frac{\sigma_{1}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}+\sigma_{0}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}<0\}}{1-\kappa_{\infty}(\bm{x}_{t})}+o_{p}(1).

By Law of Large Numbers, we further have

σ^T,12=𝒙σ12𝕀{μ(𝒙,1)>μ(𝒙,0)}+σ02𝕀{μ(𝒙,1)<μ(𝒙,0)}1κ(𝒙)𝑑P𝒳+op(1).\widehat{\sigma}_{T,1}^{2}=\int_{\bm{x}}\frac{\sigma_{1}^{2}\mathbb{I}\{\mu(\bm{x},1)>\mu(\bm{x},0)\}+\sigma_{0}^{2}\mathbb{I}\{\mu(\bm{x},1)<\mu(\bm{x},0)\}}{1-\kappa_{\infty}(\bm{x})}dP_{\mathcal{X}}+o_{p}(1).

Next, we proof the second line of Equation (B.43) is a consistent estimator for Var[μ{𝒙,π(𝒙)}]{\mbox{Var}}\left[{\mu}\{\bm{x},{\pi}^{*}(\bm{x})\}\right]. By Central Limit Theorem and Continuous Mapping Theorem, we have

σ^T,22=1Tt=1T[μ^T{𝒙t,π^T(𝒙t)}1Tt=1Tμ^T{𝒙t,π^T(𝒙t)}]2.\widehat{\sigma}_{T,2}^{2}=\frac{1}{T}\sum_{t=1}^{T}\left[\widehat{\mu}_{T}\{\bm{x}_{t},\widehat{\pi}_{T}(\bm{x}_{t})\}-\frac{1}{T}\sum_{t=1}^{T}\widehat{\mu}_{T}\{\bm{x}_{t},\widehat{\pi}_{T}(\bm{x}_{t})\}\right]^{2}.

Since 𝒙t\bm{x}_{t} are i.i.d, μ^T{𝒙t,π^T(𝒙t)}\widehat{\mu}_{T}\{\bm{x}_{t},\widehat{\pi}_{T}(\bm{x}_{t})\} are i.i.d as well. Thus by the Law of Large Numbers, we have

1Tt=1Tμ^T{𝒙t,π^T(𝒙t)}=𝔼[μ^T{𝒙,π^T(𝒙)}]+op(1),\frac{1}{T}\sum_{t=1}^{T}\widehat{\mu}_{T}\{\bm{x}_{t},\widehat{\pi}_{T}(\bm{x}_{t})\}={\mathbb{E}}\left[\widehat{\mu}_{T}\{\bm{x},\widehat{\pi}_{T}(\bm{x})\}\right]+o_{p}(1),

and

σ^T,22=Var[μ^T{𝒙,π^T(𝒙)}]+op(1).\widehat{\sigma}_{T,2}^{2}={\mbox{Var}}\left[\widehat{\mu}_{T}\{\bm{x},\widehat{\pi}_{T}(\bm{x})\}\right]+o_{p}(1).

Note that

μ^T{𝒙,π^T(𝒙)}=𝕀{Δ^𝒙>0}μ^T{𝒙,1}+[1𝕀{Δ^𝒙>0}]μ^T{𝒙,0},\widehat{\mu}_{T}\{\bm{x},\widehat{\pi}_{T}(\bm{x})\}=\mathbb{I}\{\widehat{\Delta}_{\bm{x}}>0\}\widehat{\mu}_{T}\{\bm{x},1\}+\left[1-\mathbb{I}\{\widehat{\Delta}_{\bm{x}}>0\}\right]\widehat{\mu}_{T}\{\bm{x},0\},

since 𝕀{Δ^𝒙>0}𝕀{Δ𝒙>0}=op(1)\mathbb{I}\{\widehat{\Delta}_{\bm{x}}>0\}-\mathbb{I}\{\Delta_{\bm{x}}>0\}=o_{p}(1) and the fact that μ^T{𝒙,0}\widehat{\mu}_{T}\{\bm{x},0\} and μ^T{𝒙,1}\widehat{\mu}_{T}\{\bm{x},1\} are consistent, we have

μ^T{𝒙,π^T(𝒙)}=𝕀{Δ𝒙>0}μ{𝒙,1}+[1𝕀{Δ𝒙>0}]μ{𝒙,0}=μ{𝒙,π(𝒙)}+op(1).\widehat{\mu}_{T}\{\bm{x},\widehat{\pi}_{T}(\bm{x})\}=\mathbb{I}\{\Delta_{\bm{x}}>0\}\mu\{\bm{x},1\}+\left[1-\mathbb{I}\{\Delta_{\bm{x}}>0\}\right]\mu\{\bm{x},0\}=\mu\{\bm{x},\pi(\bm{x})\}+o_{p}(1).

Therefore

σ^T,22=Var[μ{𝒙,π(𝒙)}+op(1)]=Var[μ{𝒙,π(𝒙)}]+op(1).\widehat{\sigma}_{T,2}^{2}={\mbox{Var}}\left[\mu\{\bm{x},\pi(\bm{x})\}+o_{p}(1)\right]={\mbox{Var}}\left[\mu\{\bm{x},\pi(\bm{x})\}\right]+o_{p}(1).

by Continuous Mapping Theorem. The proof for Theorem 3 is thus completed.

B.6 Results and Proof for Auxiliary Lemmas

Lemma B.1

Suppose a random variable XX is restricted to [a,b][a,b] and μ=E[X]\mu=E[X], then the variance of XX is bounded by (bμ)(μa)(b-\mu)(\mu-a).

Proof: Firstly consider the case that a=0,b=1a=0,b=1. Notice that we have E[X2]E[X]E\left[X^{2}\right]\leq E[X] since for all x[0,1],x2xx\in[0,1],x^{2}\leq x. Therefore,

Var[X]=E[X2](E[X]2)=E[X2]μ2μμ2=μ(1μ).\operatorname{Var}[X]=E\left[X^{2}\right]-\left(E[X]^{2}\right)=E\left[X^{2}\right]-\mu^{2}\leq\mu-\mu^{2}=\mu(1-\mu).

Then we consider general interval [a,b][a,b]. Define Y=XabaY=\frac{X-a}{b-a}, which is restricted in [0,1][0,1]. Equivalently, X=(ba)Y+aX=(b-a)Y+a, which follows immediate that

Var[X]=(ba)2Var[Y](ba)2μX(1μY),\operatorname{Var}[X]=(b-a)^{2}\operatorname{Var}[Y]\leq(b-a)^{2}\mu_{X}\left(1-\mu_{Y}\right),

where the inequality is based on the first result. Now, by substituting μY=μXaba\mu_{Y}=\frac{\mu_{X}-a}{b-a}, the bound equals

(ba)2μXaba(1μXaba)=(ba)2μXababμXba=(μXa)(bμX),(b-a)^{2}\frac{\mu_{X}-a}{b-a}\left(1-\frac{\mu_{X}-a}{b-a}\right)=(b-a)^{2}\frac{\mu_{X}-a}{b-a}\frac{b-\mu_{X}}{b-a}=\left(\mu_{X}-a\right)\left(b-\mu_{X}\right),

which is the desired result.

Lemma B.2

Suppose the conditions in Theorem 2 hold with Assumption 4.3, then there exists some constant cc and 0<α<120<\alpha<\frac{1}{2} such that αγ<12\alpha\gamma<\frac{1}{2} and Pr(π^t(x)π{𝐱t)}1ctαγ{\mbox{Pr}}\left(\widehat{\pi}_{t}\left(x\right)\neq\pi^{*}\left\{\bm{x}_{t}\right)\right\}\geq 1-ct^{-\alpha\gamma}.

Proof: By Theorem 2, we have Δ^𝒙tΔ𝒙t=𝒪p(t12)\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}}=\mathcal{O}_{p}(t^{-\frac{1}{2}}), thus 𝕀{Δ^𝒙t>0}=𝕀{Δ𝒙t+𝒪p(t12)>0}\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}>0\}=\mathbb{I}\{\Delta_{\bm{x}_{t}}+\mathcal{O}_{p}(t^{-\frac{1}{2}})>0\}.
By Assumption 4.3, there exists some constant cc and 0<α<120<\alpha<\frac{1}{2} such that αγ<12\alpha\gamma<\frac{1}{2} and

Pr{0<|Δ𝒙t|<tα}ctαγ.{\mbox{Pr}}\{0<|\Delta_{\bm{x}_{t}}|<t^{-\alpha}\}\leq ct^{-\alpha\gamma}.

Thus, with probability greater than 1ctαγ1-ct^{-\alpha\gamma}, we have |Δ𝒙t|>tα|\Delta_{\bm{x}_{t}}|>t^{-\alpha}, which further implies 𝕀{Δ𝒙t+𝒪p(t12)>0}=𝕀{Δ𝒙t>0}\mathbb{I}\{\Delta_{\bm{x}_{t}}+\mathcal{O}_{p}(t^{-\frac{1}{2}})>0\}=\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}. In other words, 𝕀{Δ^𝒙t>0}𝕀{Δ𝒙t>0}=0\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}>0\}-\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}=0 with probability greater than 1ctαγ1-ct^{-\alpha\gamma}, which convergences to 1 as tt\rightarrow\infty. Therefore, as tt\rightarrow\infty, we have

Pr(𝕀{Δ^𝒙t>0}𝕀{Δ𝒙t>0}=0)1ctαγ,{\mbox{Pr}}\left(\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}>0\}-\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}=0\right)\geq 1-ct^{-\alpha\gamma},

i.e.,

Pr(π^t(x)=π{𝒙t)}1ctαγ.{\mbox{Pr}}\left(\widehat{\pi}_{t}\left(x\right)=\pi^{*}\left\{\bm{x}_{t}\right)\right\}\geq 1-ct^{-\alpha\gamma}.
Lemma B.3

Suppose conditions in Lemma 4.1 hold. Assuming Assumptions 4.3 with tpt2tp_{t}^{2}\rightarrow\infty as tt\rightarrow\infty, we have

T1/2t=1T|μ{𝒙t,π^t(𝒙t)}μ{𝒙t,π}|=op(1).{{T}^{-1/2}}\sum_{t=1}^{T}\left|\mu\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}-\mu\{\bm{x}_{t},\pi^{*}\}\right|=o_{p}(1).

Proof: Without loss of generality, suppose π^t(𝒙t)=𝕀{𝒙t𝜷(1)>𝒙t𝜷(0)}\widehat{\pi}_{t}(\bm{x}_{t})=\mathbb{I}\{\bm{x}_{t}^{\top}\bm{\beta}(1)>\bm{x}_{t}^{\top}\bm{\beta}(0)\}. Since μ(𝒙t,a)=𝒙t𝜷(a)\mu(\bm{x}_{t},a)=\bm{x}_{t}^{\top}\bm{\beta}(a), we have

μ(𝒙t,π^t(𝒙t)\displaystyle\mu(\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t}) =𝕀{𝒙t𝜷^t(1)>𝒙t𝜷^t(0)}𝒙t𝜷(1)+𝕀{𝒙t𝜷^t(1)𝒙t𝜷^t(0)}𝒙t𝜷(0)\displaystyle=\mathbb{I}\{\bm{x}_{t}^{\top}\widehat{\bm{\beta}}_{t}(1)>\bm{x}_{t}^{\top}\widehat{\bm{\beta}}_{t}(0)\}\bm{x}_{t}^{\top}\bm{\beta}(1)+\mathbb{I}\{\bm{x}_{t}^{\top}\widehat{\bm{\beta}}_{t}(1)\leq\bm{x}_{t}^{\top}\widehat{\bm{\beta}}_{t}(0)\}\bm{x}_{t}^{\top}\bm{\beta}(0) (B.45)
=𝕀{𝒙t𝜷^t(1)>𝒙t𝜷^t(0)}𝒙t{𝜷(1)𝜷(0)}+𝒙t𝜷(0).\displaystyle=\mathbb{I}\{\bm{x}_{t}^{\top}\widehat{\bm{\beta}}_{t}(1)>\bm{x}_{t}^{\top}\widehat{\bm{\beta}}_{t}(0)\}\bm{x}_{t}^{\top}\left\{\bm{\beta}(1)-\bm{\beta}(0)\right\}+\bm{x}_{t}^{\top}\bm{\beta}(0).

Similarly to (B.45), we have

μ(𝒙t,π)=𝕀{𝒙t𝜷(1)>𝒙t𝜷(0)}𝒙t{𝜷(1)𝜷(0)}+𝒙t𝜷(0).\displaystyle\mu(\bm{x}_{t},\pi^{*})=\mathbb{I}\{\bm{x}_{t}^{\top}\bm{\beta}(1)>\bm{x}_{t}^{\top}\bm{\beta}(0)\}\bm{x}_{t}^{\top}\left\{\bm{\beta}(1)-\bm{\beta}(0)\right\}+\bm{x}_{t}^{\top}\bm{\beta}(0). (B.46)

Combining (B.45) and (B.46), we have

μ(𝒙t,π^t(𝒙t)μ(𝒙t,π)=[𝕀{𝒙t𝜷^t(1)>𝒙t𝜷^t(0)}𝕀{𝒙t𝜷(1)>𝒙t𝜷(0)}]𝒙t{𝜷(1)𝜷(0)}.\mu(\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})-\mu(\bm{x}_{t},\pi^{*})=\left[\mathbb{I}\{\bm{x}_{t}^{\top}\widehat{\bm{\beta}}_{t}(1)>\bm{x}_{t}^{\top}\widehat{\bm{\beta}}_{t}(0)\}-\mathbb{I}\{\bm{x}_{t}^{\top}\bm{\beta}(1)>\bm{x}_{t}^{\top}\bm{\beta}(0)\}\right]\bm{x}_{t}^{\top}\left\{\bm{\beta}(1)-\bm{\beta}(0)\right\}. (B.47)

Since 𝕀{𝒙t𝜷(1)>𝒙t𝜷(0)}=1\mathbb{I}\{\bm{x}_{t}^{\top}\bm{\beta}(1)>\bm{x}_{t}^{\top}\bm{\beta}(0)\}=1 by assumption, (B.47) can be simplified as

μ(𝒙t,π^t(𝒙t)μ(𝒙t,π)=𝕀{𝒙t𝜷^t(1)𝒙t𝜷^t(0)0}𝒙t{𝜷(1)𝜷(0)}=𝕀{Δ^𝒙t0}Δ𝒙t0,\mu(\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})-\mu(\bm{x}_{t},\pi^{*})=-\mathbb{I}\{\bm{x}_{t}^{\top}\widehat{\bm{\beta}}_{t}(1)-\bm{x}_{t}^{\top}\widehat{\bm{\beta}}_{t}(0)\leq 0\}\bm{x}_{t}^{\top}\left\{\bm{\beta}(1)-\bm{\beta}(0)\right\}=-\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}\leq 0\}\Delta_{\bm{x}_{t}}\leq 0,

where Δ𝒙t=𝒙t{𝜷(1)𝜷(0)}\Delta_{\bm{x}_{t}}=\bm{x}_{t}^{\top}\{\bm{\beta}(1)-\bm{\beta}(0)\} and Δ^𝒙t=𝒙t{𝜷^(1)𝜷^(0)}\widehat{\Delta}_{\bm{x}_{t}}=\bm{x}_{t}^{\top}\{\widehat{\bm{\beta}}(1)-\widehat{\bm{\beta}}(0)\}. Thus, we have

T1/2t=1T|μ{𝒙t,π^t(𝒙t)}μ{𝒙t,π}|=1Tt=1T𝕀{Δ^𝒙t0}Δ𝒙t.{{T}^{-1/2}}\sum_{t=1}^{T}\left|\mu\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}-\mu\{\bm{x}_{t},\pi^{*}\}\right|=\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}\leq 0\}\Delta_{\bm{x}_{t}}.

To show T1/2t=1T|μ{𝒙t,π^t(𝒙t)}μ{𝒙t,π}|=op(1){{T}^{-1/2}}\sum_{t=1}^{T}\left|\mu\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}-\mu\{\bm{x}_{t},\pi^{*}\}\right|=o_{p}(1), it suffices to show that (1/T)t=1T𝕀{Δ^𝒙t0}Δ𝒙t(1/\sqrt{T})\sum_{t=1}^{T}\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}\leq 0\}\Delta_{\bm{x}_{t}} has an upper bound. Since 𝕀{Δ^𝒙t0}Δ^𝒙t0\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}\leq 0\}\widehat{\Delta}_{\bm{x}_{t}}\leq 0, it suffices to show

ζ=1Tt=1T𝕀{Δ^𝒙t0}(Δ^𝒙tΔ𝒙t)\zeta=\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}\leq 0\}(\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}})

has a lower bound. We further notice that for any α>0\alpha>0,

ζ\displaystyle\zeta =Pr{0<Δ𝒙t<Tα}1Tt=1T𝕀{0<Δ𝒙t<Tα}𝕀{Δ^𝒙t0}(Δ^𝒙tΔ𝒙t)ζ1\displaystyle=\underbrace{{\mbox{Pr}}\{0<\Delta_{\bm{x}_{t}}<T^{-\alpha}\}\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\mathbb{I}\{0<\Delta_{\bm{x}_{t}}<T^{-\alpha}\}\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}\leq 0\}(\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}})}_{\zeta_{1}} (B.48)
+Pr{Δ𝒙t>Tα}1Tt=1T𝕀{Δ𝒙t>Tα}𝕀{Δ^𝒙t0}(Δ^𝒙tΔ𝒙t)ζ2.\displaystyle+\underbrace{{\mbox{Pr}}\{\Delta_{\bm{x}_{t}}>T^{-\alpha}\}\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\mathbb{I}\{\Delta_{\bm{x}_{t}}>T^{-\alpha}\}\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}\leq 0\}(\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}})}_{\zeta_{2}}.

To show ζ\zeta has a lower bound, it is sufficient to show ζ1=op(1)\zeta_{1}=o_{p}(1) and ζ2=op(1)\zeta_{2}=o_{p}(1) correspondingly.

Firstly, we are going to show ζ1=op(1)\zeta_{1}=o_{p}(1). By Theorem 2, Δ^𝒙tΔ𝒙t=𝒪p(t12)\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}}=\mathcal{O}_{p}(t^{-\frac{1}{2}}), which implies

Δ^𝒙tΔ𝒙t=op{t(12αγ)}.\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}}=o_{p}\{t^{-(\frac{1}{2}-\alpha\gamma)}\}.

Thus we have

|1Tt=1T𝕀{0<Δ𝒙t<Tα}𝕀{Δ^𝒙t0}(Δ^𝒙tΔ𝒙t)|\displaystyle|\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\mathbb{I}\{0<\Delta_{\bm{x}_{t}}<T^{-\alpha}\}\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}\leq 0\}(\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}})|
1Tt=1T|𝕀{0<Δ𝒙t<Tα}||𝕀{Δ^𝒙t0}||(Δ^𝒙tΔ𝒙t)|\displaystyle\leq\frac{1}{\sqrt{T}}\sum_{t=1}^{T}|\mathbb{I}\{0<\Delta_{\bm{x}_{t}}<T^{-\alpha}\}||\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}\leq 0\}||(\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}})|
1Tt=1T|(Δ^𝒙tΔ𝒙t)|TTt=1Top(t(12αγ))=Top(T(12αγ))=op(Tαγ),\displaystyle\leq\frac{1}{\sqrt{T}}\sum_{t=1}^{T}|(\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}})|\leq\frac{\sqrt{T}}{T}\sum_{t=1}^{T}o_{p}(t^{-(\frac{1}{2}-\alpha\gamma)})\overset{*}{=}\sqrt{T}o_{p}(T^{-(\frac{1}{2}-\alpha\gamma)})=o_{p}(T^{\alpha\gamma}),

where the equation (*) is derived by Lemma 6 in Luedtke and Van Der Laan (2016). By Assumption 4.3, there exists some constant cc and 0<α<120<\alpha<\frac{1}{2} such that αγ<12\alpha\gamma<\frac{1}{2} and

Pr{0<Δ𝒙t<Tα}cTαγ.{\mbox{Pr}}\{0<\Delta_{\bm{x}_{t}}<T^{-\alpha}\}\leq cT^{-\alpha\gamma}.

Therefore we have

|ζ1|Pr{0<Δ𝒙t<Tα}op(T(12αγ))=cTαγop(Tαγ)=op(1).|\zeta_{1}|\leq{\mbox{Pr}}\{0<\Delta_{\bm{x}_{t}}<T^{-\alpha}\}o_{p}(T^{-(\frac{1}{2}-\alpha\gamma)})=cT^{-\alpha\gamma}o_{p}(T^{\alpha\gamma})=o_{p}(1). (B.49)

Notice that

𝕀{0<Δ𝒙t<Tα}𝕀{Δ^𝒙t0}(Δ^𝒙tΔ𝒙t)𝕀{Δ^𝒙t0}Δ^𝒙t0,\mathbb{I}\{0<\Delta_{\bm{x}_{t}}<T^{-\alpha}\}\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}\leq 0\}(\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}})\leq\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}\leq 0\}\widehat{\Delta}_{\bm{x}_{t}}\leq 0, (B.50)

where the first inequality holds since Δ𝒙t0\Delta_{\bm{x}_{t}}\geq 0. Combining (B.48), (B.49), and (B.50), we have

0ζ1=op(1).0\geq\zeta_{1}=o_{p}(1). (B.51)

Next, we consider the second part ζ2\zeta_{2}. Note that

𝕀{Δ^𝒙t0}=𝕀{Δ^𝒙tΔ𝒙tΔ𝒙t}=𝕀{|Δ^𝒙tΔ𝒙t|>Δ𝒙t},\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}\leq 0\}=\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}}\leq-\Delta_{\bm{x}_{t}}\}=\mathbb{I}\{|\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}}|>\Delta_{\bm{x}_{t}}\},

we have

|𝕀{Δ^𝒙t0}(Δ^𝒙tΔ𝒙t)|=𝕀{|Δ^𝒙tΔ𝒙t|>Δ𝒙t}|Δ^𝒙tΔ𝒙t||Δ^𝒙tΔ𝒙t|2Δ𝒙t,\left|\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}\leq 0\}(\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}})\right|=\mathbb{I}\{|\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}}|>\Delta_{\bm{x}_{t}}\}|\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}}|\leq\frac{|\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}}|^{2}}{\Delta_{\bm{x}_{t}}}, (B.52)

where the inequality holds since

𝕀{|Δ^𝒙tΔ𝒙t|>Δ𝒙t}|Δ^𝒙tΔ𝒙t|Δ𝒙t.\mathbb{I}\{|\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}}|>\Delta_{\bm{x}_{t}}\}\leq\frac{|\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}}|}{\Delta_{\bm{x}_{t}}}.

Thus, by (B.52), we further have

𝕀{Δ^𝒙t0}(Δ^𝒙tΔ𝒙t)(Δ^𝒙tΔ𝒙t)2Δ𝒙t.\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}\leq 0\}(\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}})\geq-\frac{(\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}})^{2}}{\Delta_{\bm{x}_{t}}}. (B.53)

Since Δ^𝒙tΔ𝒙t<0\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}}<0, based on (B.53), we have

0ζ21Tt=1T𝕀{Δ𝒙t>Tα}𝕀{Δ^𝒙t0}(Δ^𝒙tΔ𝒙t)1Tt=1T𝕀{Δ𝒙t>Tα}(Δ^𝒙tΔ𝒙t)2Δ𝒙t.0\geq\zeta_{2}\geq\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\mathbb{I}\{\Delta_{\bm{x}_{t}}>T^{-\alpha}\}\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}\leq 0\}(\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}})\geq-\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\mathbb{I}\{\Delta_{\bm{x}_{t}}>T^{-\alpha}\}\frac{(\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}})^{2}}{\Delta_{\bm{x}_{t}}}. (B.54)

Notice that 𝕀{Δ𝒙t>Tα}Δ𝒙tTα\mathbb{I}\{\Delta_{\bm{x}_{t}}>T^{-\alpha}\}\leq{\Delta_{\bm{x}_{t}}}{T^{\alpha}}, combining with (B.54), we further have

0ζ21Δ𝒙t1Tt=1TΔ𝒙tTα(Δ^𝒙tΔ𝒙t)2=T12+αt=1T(Δ^𝒙tΔ𝒙t)2=T12+α{1Tt=1T(Δ^𝒙tΔ𝒙t)2}.0\geq\zeta_{2}\geq-\frac{1}{\Delta_{\bm{x}_{t}}}\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\frac{\Delta_{\bm{x}_{t}}}{T^{-\alpha}}(\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}})^{2}=-T^{-\frac{1}{2}+\alpha}\sum_{t=1}^{T}(\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}})^{2}=-T^{\frac{1}{2}+\alpha}\left\{\frac{1}{T}\sum_{t=1}^{T}(\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}})^{2}\right\}.

By Theorem 2, Δ^𝒙tΔ𝒙t=𝒪p(t12)\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}}=\mathcal{O}_{p}(t^{-\frac{1}{2}}), which implies (Δ^𝒙tΔ𝒙t)2=op(T(12+α))(\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}})^{2}=o_{p}(T^{-(\frac{1}{2}+\alpha)}).And by Lemma 6 in Luedtke and Van Der Laan (2016), T1t=1Top{t(12+α)}=op{T(12+α)}T^{-1}\sum_{t=1}^{T}o_{p}\{t^{-(\frac{1}{2}+\alpha)}\}=o_{p}\{T^{-(\frac{1}{2}+\alpha)}\}, we have

0ζ2T12+αop(T(12+α))=op(1).0\geq\zeta_{2}\geq-T^{\frac{1}{2}+\alpha}o_{p}(T^{-(\frac{1}{2}+\alpha)})=o_{p}(1). (B.55)

Therefore, combining Equation (B.51) and Equation (B.55), we have

0ζ=ζ1+ζ2=op(1).0\geq\zeta=\zeta_{1}+\zeta_{2}=o_{p}(1).

Thus, we have

T1/2t=1T|μ{𝒙t,π^t(𝒙t)}μ{𝒙t,π}|=op(1).{{T}^{-1/2}}\sum_{t=1}^{T}\left|\mu\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}-\mu\{\bm{x}_{t},\pi^{*}\}\right|=o_{p}(1).