Doubly Robust Interval Estimation for Optimal Policy Evaluation in Online Learning

Ye Shen Equal contribution. Department of Statistics, North Carolina State University Hengrui Cai^∗ Department of Statistics, University of California Irvine Rui Song Department of Statistics, North Carolina State University

Abstract

Evaluating the performance of an ongoing policy plays a vital role in many areas such as medicine and economics, to provide crucial instructions on the early-stop of the online experiment and timely feedback from the environment. Policy evaluation in online learning thus attracts increasing attention by inferring the mean outcome of the optimal policy (i.e., the value) in real-time. Yet, such a problem is particularly challenging due to the dependent data generated in the online environment, the unknown optimal policy, and the complex exploration and exploitation trade-off in the adaptive experiment. In this paper, we aim to overcome these difficulties in policy evaluation for online learning. We explicitly derive the probability of exploration that quantifies the probability of exploring non-optimal actions under commonly used bandit algorithms. We use this probability to conduct valid inference on the online conditional mean estimator under each action and develop the doubly robust interval estimation (DREAM) method to infer the value under the estimated optimal policy in online learning. The proposed value estimator provides double protection for consistency and is asymptotically normal with a Wald-type confidence interval provided. Extensive simulation studies and real data applications are conducted to demonstrate the empirical validity of the proposed DREAM method.

Keywords: Asymptotic Normality; Bandit Algorithms; Double Protection; Online Estimation; Probability of Exploration

1 Introduction

Sequential decision-making is one of the essential components of modern artificial intelligence that considers the dynamics of the real world. By maintaining the trade-off between exploration and exploitation based on historical information, bandit algorithms aim to maximize the cumulative outcome of interest and are thus popular in dynamic decision optimization with a wide variety of applications, such as precision medicine (Lu et al., 2021) and dynamic pricing (Turvey, 2017). There has been a vast literature on bandit optimization established over recent decades (see e.g., Sutton and Barto, 2018; Lattimore and Szepesvári, 2020, and the references therein). Most of these theoretical works focus on the regret analysis of bandit algorithms. When properly designed and implemented to address the exploration-and-exploitation trade-off, a powerful bandit policy could achieve a sub-linear regret, and thus eventually approximate the underlying optimal policy that maximizes the expected outcome. However, such a regret analysis only shows the convergence rate of the averaged cumulative regret (difference between the outcome under the optimal policy and that under the bandit policy) but provides limited information on the expected outcome under this bandit policy (referred to as the value in Dudík et al. (2011)).

The evaluation of the performance of bandit policies plays a vital role in many areas, including medicine and economics (see e.g., Chakraborty and Moodie, 2013; Athey, 2019). By evaluation, we aim to unbiasedly estimate the value of the optimal policy that the bandit policy is approaching and infer the corresponding estimate. Although there is an increasing trend in policy evaluation (see e.g., Li et al., 2011; Dudík et al., 2011; Swaminathan et al., 2017; Wang et al., 2017; Kallus and Zhou, 2018; Su et al., 2019), we note that all of these works focus on learning the value of a target policy offline using historical log data. See the architecture of offline policy evaluation illustrated in the left panel of Figure 1. Instead of a post-experiment investigation, it has attracted more attention recently to evaluate the ongoing policy in real-time. In precision medicine, the physician aims to make the best treatment decision for each patient sequentially according to their baseline covariates. Estimating the mean outcome of the current treatment decision rule is crucial to answering several fundamental questions in health care, such as whether the current strategy significantly optimizes patient outcomes over some baseline strategies. When the value under the ongoing rule is much lower than the desired average curative effect, the online trial must be terminated until more effective treatment options are available for the next round. Thus, policy evaluation in online learning is a new idea to provide the early stop of the online experiment and timely feedback from the environment, as demonstrated in the right panel of Figure 1.

Refer to caption — Figure 1: Left panel: the architecture of offline policy evaluation, with offline context-action-outcome triples $\{({\bm{x}_{t}},a_{t},r_{t})\}$ stored in the buffer to learn the value under a target policy $\pi^{*}$ with data generated by a behavior policy $\pi_{b}$ . Right panel: the architecture of doubly robust interval estimation (DREAM) method for policy evaluation in online learning, where the context-action-outcome triple at time $t$ , $({\bm{x}_{t}},a_{t},r_{t})$ , is stored in the buffer to update the bandit policy $\pi_{t}$ and in the meantime to evaluate its performance.

1.1 Related Works and Challenges

Despite the importance of policy evaluation in online learning, the current bandit literature suffers from three main challenges. First, the data, such as the actions and rewards sequentially collected from the online environment, are not independent and identically distributed (i.i.d.) since they depend on the previous history and the running policy (see the right panel of Figure 1). In contrast, the existing methods for the offline policy evaluation (see e.g., Li et al., 2011; Dudík et al., 2011) primarily assumed that the data are generated by the same behavior policy and i.i.d. across different individuals and time points. Such assumptions allow them to evaluate a new policy using offline data by modeling the behavior policy or the conditional mean outcome. In addition, we note that the target policy to be evaluated in offline policy evaluation is fixed and generally known, whereas for policy evaluation in online learning, the optimal policy of interest needs to be estimated and updated in real time.

The second challenge lies in estimating the mean outcome under the optimal policy online. Although numerous methods have recently been proposed to evaluate the online sample mean for a fixed action (see e.g., Nie et al., 2018; Neel and Roth, 2018; Deshpande et al., 2018; Shin et al., 2019a, b; Waisman et al., 2019; Hadad et al., 2019; Zhang et al., 2020), none of these methods is directly applicable to our problem, as the sample mean only provides the impact of one particular arm, not the value of the optimal policy in bandits that considers the dynamics of the online environment. For instance, in the contextual bandits, we aim to select an action for each subject based on its context/feature to optimize the overall outcome of interest. However, there may not exist a unified best action for all subjects due to heterogeneity, and thus evaluating the value of one single optimal action cannot fully address the policy evaluation in such a setting. However, although commonly used in the regret analysis, the average of collected outcomes is not a good estimator of the value under the optimal policy in the sense that it does not possess statistical efficiency (see details in Section 3).

Third, given data generated by a bandit algorithm that maintains the exploration-and-exploitation trade-off sequentially, inferring the value of the optimal policy online should consider such a trade-off and quantify the probability of exploration and exploitation. The probability of exploring non-optimal actions is essential in two ways. First, it determines the convergence rate of the online conditional mean estimator under each action. Second, it indicates the data points used to match the value under the optimal policy. To our knowledge, the regret analysis in the current bandit literature is based on the binding of variance information (see e.g., Auer, 2002; Srinivas et al., 2009; Chu et al., 2011; Abbasi-Yadkori et al., 2011; Bubeck and Cesa-Bianchi, 2012; Zhou, 2015), yet little effort has been made in formally quantifying the probability of exploration over time.

There are very few studies directly related to our topic. Chambaz et al. (2017) established the asymptotic normality for the conditional mean outcome under an optimal policy for sequential decision making. Later, Chen et al. (2020) proposed an inverse probability weighted value estimator to infer the value of optimal policy using the $\epsilon$ -Greedy (EG) method. These two works did not discuss how to account for the exploration-and-exploitation trade-off under commonly used bandit algorithms, such as Upper Confidence Bound (UCB) and Thompson Sampling (TS), as considered in this paper. Recently, to evaluate the value of a known policy based on the adaptive data, Bibaut et al. (2021) and Zhan et al. (2021) proposed to utilize the stabilized doubly robust estimator and the adaptive weighting doubly robust estimator, respectively. However, both methods focused on obtaining a valid inference of the value estimator under a fixed policy by conveniently assuming a desired exploration rate to ensure sufficient sampling of different arms. Such an assumption can be violated in many commonly used bandits (see details shown in Theorem 4.1). Although there are other works that focus on statistical inference for adaptively collected data (Dimakopoulou et al., 2021; Zhang et al., 2021; Khamaru et al., 2021; Ramprasad et al., 2022) in bandit or reinforcement learning setting, our work handles policy evaluation from a completely unique angle to infer the value of optimal policy by investigating the exploration rate in online learning.

1.2 Our Contributions

In this paper, we aim to overcome the aforementioned difficulties of policy evaluation in online decision-making. Our contributions are expressed in the following folds.

The first contribution of this work is to explicitly characterize the trade-off between exploration and exploitation in the online policy optimization, we derive the probability of exploration in bandit algorithms. Such a probability is new to the literature by quantifying the chance of taking the nongreedy policy (i.e., a nonoptimal action) given the current information over time, in contrast to the probability of exploitation for taking greedy actions. Specifically, we consider three commonly used bandit algorithms for exposition, including the UCB, TS, and EG methods. We note that the probability of exploration is prespecified by users in EG while remaining implicit in UCB and TS. We use this probability to conduct valid inferences on the online conditional mean estimator under each action. The second contribution of this work is to propose the doubly robust interval estimation (DREAM) to infer the mean outcome of the optimal online policy. The DREAM provides double protection on the consistency of the proposed value estimator to the true value, given the product of the nuisance error rates of the probability of exploitation and the conditional mean outcome as $o_{p}(T^{-1/2})$ for $T$ as the termination time. Under standard assumptions for inferring the online sample mean, we show that the value estimator under DREAM is asymptotically normal with a Wald-type confidence interval provided. To the best of our knowledge, this is the first work to establish the inference for the value under the optimal policy by taking the exploration-and-exploitation trade-off into thorough account and thus fills a crucial gap in the policy evaluation of online learning.

The remainder of this paper is organized as follows. We introduce notation and formulate our problem, followed by preliminaries of standard contextual bandit algorithms and the formal definition of probability of exploration. In Section 3, we introduce the DREAM method and its implementation details. Then in Section 4, we derive theoretical results of DREAM under contextual bandits by establishing the convergence rate of the probability of exploration and deriving the asymptotic normality of the online mean estimator. Extensive simulation studies are conducted to demonstrate the empirical performance of the proposed method in Section 5, followed by a real application using OpenML datasets in Section 6. We conclude our paper in Section 7 by discussing the performance of the proposed DREAM in terms of the regret bound and a direct extension of our method for policy evaluation of any known policy in online learning. All the additional results and technical proofs are given in the appendix.

2 Problem Formulation

In this section, we formulate the problem of policy evaluation in online learning. We first build the framework based on the contextual bandits in Section 2.1. We then introduce three commonly used bandit algorithms, including UCB, TS, and EG, to generate data online, in Section 2.2. Lastly, we define the probability of exploration in Section 2.3. In this paper, we use bold symbols for vectors and matrices.

2.1 Framework

In contextual bandits, at each time step $t\in\mathcal{T}\equiv\{1,2,3,\cdots\}$ , we observe a $d$ -dimensional context $\bm{x}_{t}$ drawn from a distribution $P_{\mathcal{X}}$ which includes 1 for the intercept, choose an action $a_{t}\in\mathcal{A}$ , and then observe a reward $r_{t}\in\mathcal{R}$ . Denote the history observations prior to time step $t$ as $\mathcal{H}_{t-1}=\{\bm{x}_{i},a_{i},r_{i}\}_{1\leq i\leq t-1}$ . Suppose that the reward given $\bm{x}$ and $a$ follows $r\equiv\mu(\bm{x},a)+e$ , where $\mu(\bm{x},a)\equiv\mathbb{E}(r|\bm{x},a)$ is the conditional mean outcome function (also known as the Q-function in the literature (Murphy, 2003; Sutton and Barto, 2018)), and is bounded by $|\mu(\bm{x},a)|\leq U$ . The noise term $e$ is independent $\sigma$ -subgaussian at the time step $t$ independently of $\mathcal{H}_{t-1}$ given $a_{t}$ for $t\in\mathcal{T}$ . Let the conditional variance be ${\mathbb{E}}(e^{2}|a)=\sigma^{2}_{a}$ . The value (Dudík et al., 2011) of a given policy $\pi(\cdot)$ is defined as

V(\pi)\equiv\mathbb{E}_{\bm{x}\sim P_{\mathcal{X}}}\left[\mathbb{E}\{r|\bm{x},a=\pi(\bm{x})\}\right]=\mathbb{E}_{\bm{x}\sim P_{\mathcal{X}}}\left[\mu\{\bm{x},\pi(\bm{x})\}\right].

We define the optimal policy as $\pi^{*}(\bm{x})\equiv\operatorname*{arg\,max}_{a\in\mathcal{A}}\mu(\bm{x},a),\forall\bm{x}\in\mathcal{X}$ , which finds the optimal action based on the conditional mean outcome function given a context $\bm{x}$ . Thus, the optimal value can be defined as $V^{*}\equiv V(\pi^{*})=\mathbb{E}_{\bm{x}\sim P_{\mathcal{X}}}\left[\mu\{\bm{x},\pi^{*}(\bm{x})\}\right]$ . In the rest of this paper, to simplify the exposition, we focus on two actions, that is, $\mathcal{A}=\{0,1\}$ . Then the optimal policy is given by

\pi^{*}(\bm{x})\equiv\operatorname*{arg\,max}_{a\in\mathcal{A}}\mu(\bm{x},a)=\mathbb{I}\{\mu(\bm{x},1)>\mu(\bm{x},0)\},\quad\forall\bm{x}\in\mathcal{X}.

Our goal is to infer the value under the optimal policy $\pi^{*}$ using the online data sequentially generated by a bandit algorithm. Since the optimal policy is unknown, we estimate the optimal policy from the online data as $\widehat{\pi}_{t}$ . As commonly assumed in the current online inference literature (see e.g., Deshpande et al., 2018; Zhang et al., 2020; Chen et al., 2020) and the bandit literature (see e.g., Chu et al., 2011; Abbasi-Yadkori et al., 2011; Bubeck and Cesa-Bianchi, 2012; Zhou, 2015), we consider the conditional mean outcome function taking a linear form, i.e., $\mu(\bm{x},a)=\bm{x}^{\top}\bm{\beta}(a),$ where $\bm{\beta}(\cdot)$ is a smooth function and can be estimated via a ridge regression based on $\mathcal{H}_{t-1}$ as

\widehat{\bm{\beta}}_{t-1}(a)=\{\bm{D}_{t-1}(a)^{\top}\bm{D}_{t-1}(a)+\omega\bm{I}_{d}\}^{-1}\bm{D}_{t-1}(a)^{\top}\bm{R}_{t-1}(a),

(1)

where $\bm{I}_{d}$ is a $d\times d$ identity matrix, $\bm{D}_{t-1}(a)$ is a $\bm{N}_{t-1}(a)\times d$ design matrix at time $t-1$ with $\bm{N}_{t-1}(a)$ as the number of pulls for action $a$ , $\bm{R}_{t-1}(a)$ is the $\bm{N}_{t-1}(a)\times 1$ vector of the outcomes received under action $a$ at time $t-1$ , and $\omega$ is a positive and bounded constant as the regularization term. There are two main reasons to choose the ridge estimator instead of the ordinary least squares estimator that is considered in Deshpande et al. (2018); Zhang et al. (2020); Chen et al. (2020). First, the ridge estimator is well defined when $\bm{D}_{t-1}(a)^{\top}\bm{D}_{t-1}(a)$ is singular and its bias is negligible when the time step is large. Second, the parameter estimations in the ridge method are in accordance with the linear UCB (Li et al., 2010) and the linear TS (Agrawal and Goyal, 2013) methods (detailed in the next section) with $\omega=1$ . Based on the ridge estimator in (1), the online conditional mean estimator for $\mu$ is defined as $\widehat{\mu}_{t-1}(\bm{x},a)=\bm{x}^{\top}\widehat{\bm{\beta}}_{t-1}(a)$ . With two actions, the estimated optimal policy at time step $t$ is defined by

\widehat{\pi}_{t}(\bm{x})=\mathbb{I}\left\{\widehat{\mu}_{t-1}(\bm{x},1)>\widehat{\mu}_{t-1}(\bm{x},0)\right\}=\mathbb{I}\{\bm{x}^{\top}\widehat{\bm{\beta}}_{t-1}(1)>\bm{x}^{\top}\widehat{\bm{\beta}}_{t-1}(0)\},\quad\forall\bm{x}\in\mathcal{X}.

(2)

We note that the linear form of $\mu(\bm{x},a)$ can be relaxed to the non-linear case as $\mu(\bm{x},a)=f(\bm{x})^{\top}\bm{\beta}(a)$ , where $f(\cdot)$ is a continuous function (see examples in our simulation studies in Section 5). Then the corresponding online conditional mean estimator for $\mu$ is defined as $\widehat{\mu}_{t-1}(\bm{x},a)=f(\bm{x})^{\top}\widehat{\bm{\beta}}_{t-1}(a)$ based on $\mathcal{H}_{t-1}$ .

2.2 Bandit Algorithms

We briefly introduce three commonly used bandit algorithms in the framework of contextual bandits, to generate the online data sequentially.
Upper Confidence Bound (UCB) (Li et al., 2010): Let the estimated standard deviation based on $\mathcal{H}_{t-1}$ be $\widehat{\sigma}_{t-1}(\bm{x},a)=\sqrt{\bm{x}^{\top}\{\bm{D}_{t-1}(a)^{\top}\bm{D}_{t-1}(a)+\omega\bm{I}_{d}\}^{-1}\bm{x}}.$ The action at time $t$ is selected by

a_{t}=\operatorname*{arg\,max}_{a\in\mathcal{A}}\widehat{\mu}_{t-1}(\bm{x}_{t},a)+c_{t}\widehat{\sigma}_{t-1}(\bm{x}_{t},a),

where $c_{t}$ is a non-increasing positive parameter that controls the level of exploration. With two actions, we have the action at time step $t$ as

a_{t}=\mathbb{I}\left\{\widehat{\mu}_{t-1}(\bm{x}_{t},1)+c_{t}\widehat{\sigma}_{t-1}(\bm{x}_{t},1)>\widehat{\mu}_{t-1}(\bm{x}_{t},0)+c_{t}\widehat{\sigma}_{t-1}(\bm{x}_{t},0)\right\}.

Thompson Sampling (TS) (Agrawal and Goyal, 2013): Suppose a normal likelihood function for the reward given $\bm{x}$ and $a$ such that $R\sim\mathcal{N}\{\bm{x}^{\top}\bm{\beta}(a),\rho^{2}\}$ with a known parameter $\rho^{2}$ . If the prior for $\bm{\beta}(a)$ at time $t$ is

\bm{\beta}(a)\sim\mathcal{N}_{d}[\widehat{\bm{\beta}}_{t-1}(a),\rho^{2}\{\bm{D}_{t-1}(a)^{\top}\bm{D}_{t-1}(a)+\omega\bm{I}_{d}\}^{-1}],

where $\mathcal{N}_{d}$ is the $d$ -dimensional multivariate normal distribution, we have the posterior distribution of $\bm{\beta}(a)$ as

\bm{\beta}(a)|\mathcal{H}_{t-1}\sim\mathcal{N}_{d}[\widehat{\bm{\beta}}_{t}(a),\rho^{2}\{\bm{D}_{t}(a)^{\top}\bm{D}_{t}(a)+\omega\bm{I}_{d}\}^{-1}],

for $a\in\{0,1\}$ . At each time step $t$ , we draw a sample from the posterior distribution as $\bm{\beta}_{t}(a)$ for $a\in\{0,1\}$ , and select the next action within two arms by $a_{t}=\mathbb{I}\left\{\bm{x}_{t}^{\top}\bm{\beta}_{t}(1)>\bm{x}_{t}^{\top}\bm{\beta}_{t}(0)\right\}.$
$\epsilon$ -Greedy (EG) (Sutton and Barto, 2018): Recall the estimated conditional mean outcome for action $a$ as $\widehat{\mu}_{t}(\bm{x},a)$ given a context $\bm{x}$ . Under EG method, the action at time $t$ is selected by

a_{t}=\delta_{t}\operatorname*{arg\,max}_{a\in\mathcal{A}}\widehat{\mu}_{t-1}(\bm{x}_{t},a)+(1-\delta_{t})\text{Bernoulli}(0.5),

where $\delta_{t}\sim\text{Bernoulli}(1-\epsilon_{t})$ and the parameter $\epsilon_{t}$ controls the level of exploration as pre-specified by users.

2.3 Probability of Exploration

We next quantify the probability of exploring non-optimal actions at each time step. To be specific, define the status of exploration as $\mathbb{I}\{a_{t}\not=\widehat{\pi}_{t}(\bm{x}_{t})\}$ , indicating whether the action taken by the bandit algorithm is different from the estimated optimal action that exploits the historical information, given the context information. Here, $\widehat{\pi}_{t}$ can be viewed as the greedy policy at time step $t$ . Thus the probability of exploration is defined by

\kappa_{t}(\bm{x}_{t})\equiv{\mbox{Pr}}\{a_{t}\not=\widehat{\pi}_{t}(\bm{x}_{t})\}={\mathbb{E}}[\mathbb{I}\{a_{t}\not=\widehat{\pi}_{t}(\bm{x}_{t})\}],

(3)

where the expectation in the last term is taken respect to $a_{t}\in\mathcal{A}$ and history $\mathcal{H}_{t-1}$ . According to (3), $\kappa_{t}$ is determined by the given context information and the current time point. We clarify the connections and distinctions of the defined probability of exploration concerning the bandit literature here. First, the exploration rate used in the current bandit works mainly refers to the probability of exploring non-optimal actions given an optimal policy and contextual information $\bm{x}_{t}$ at time step $t$ , i.e., ${\mbox{Pr}}\{a_{t}\not=\pi^{*}(\bm{x}_{t})\}$ . This is different from our proposed probability of exploration $\kappa_{t}(\bm{x}_{t})\equiv{\mbox{Pr}}\{a_{t}\not=\widehat{\pi}_{t}(\bm{x}_{t})\}$ . The main difference lies in the estimated optimal policy from the collected data at the current time step $t$ , i.e., $\widehat{\pi}_{t}$ , in $\kappa_{t}(\bm{x}_{t})$ . If properly designed and implemented, the estimated optimal policy under a bandit algorithm $\widehat{\pi}_{t}$ will eventually converge to the true optimal policy $\pi^{*}$ when $t$ is large, which yields $\kappa_{t}(\bm{x})\to\lim_{t\rightarrow\infty}{\mbox{Pr}}\{a_{t}\not=\pi^{*}(\bm{x})\}$ . In practice, we can estimate $\kappa_{t}(\bm{x}_{t})$ sequentially by using $\{i,\bm{x}_{i}\}_{1\leq i\leq t}$ as inputs and $\{\mathbb{I}\{a_{i}=\widehat{\pi}_{i}(\bm{x}_{i})\}\}_{1\leq i\leq t}$ as outputs via parametric or non-parametric tools. We denote the corresponding estimator as $\widehat{\kappa}_{t}(\bm{x}_{t})$ for time step $t$ . Such implementation conditions are explicitly described in Section 3.

3 Doubly Robust Interval Estimation

We present the proposed DREAM method in this section. We first detail why the average of outcomes received in bandits fails to process statistical efficiency. In view of the results established in regret analysis for the contextual bandits (see e.g., Abbasi-Yadkori et al., 2011; Chu et al., 2011; Zhou, 2015), we have the cumulative regret as $|\sum_{t=1}^{T}(r_{t}-V^{*})|=\tilde{\mathcal{O}}(\sqrt{dT})$ , where $\tilde{\mathcal{O}}$ is the asymptotic order up to some logarithm factor. Therefore, it is immediate that the average of rewards follows ${\sqrt{T}}(T^{-1}\sum_{t=1}^{T}r_{t}-V^{*})=\tilde{\mathcal{O}}(1)$ , since the dimension $d$ is finite. These results indicate that a simple average of the total outcome under the bandit algorithm is not a good estimator for the optimal value, since it does not own the asymptotic normality for a valid confidence interval construction. Instead of the simple aggregation, intuitively, we should select the outcome when the action taken under a bandit policy accords with the optimal policy, i.e., $a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})$ , defined as the status of exploitation. In contrast to the probability of exploration, we define the probability of exploitation as

{\mbox{Pr}}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}=1-{\kappa}_{t}(\bm{x}_{t})={\mathbb{E}}[\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}].

Following the doubly robust value estimator in Dudík et al. (2011), we propose the doubly robust mean outcome estimator as the value estimator under the optimal policy as

\widehat{V}_{T}=\frac{1}{T}\sum_{t=1}^{T}\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}}{1-\widehat{\kappa}_{t}(\bm{x}_{t})}\Big{[}r_{t}-\widehat{\mu}_{t-1}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{]}+\widehat{\mu}_{t-1}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\},

(4)

where $T$ is the current/termination time, and $1-\widehat{\kappa}_{t}(\bm{x}_{t})$ is the estimated matching probability between the chosen action $a_{t}$ and estimated optimal action given $\bm{x}_{t}$ , which captures the probability of exploitation. Our value estimator provides double protection on the consistency to the true value, given the product of the nuisance error rates of the probability of exploitation and the conditional mean outcome as $\mathcal{O}(T^{-1/2})$ , as discussed in Theorem 3. We further propose a variance estimator for $\widehat{V}_{T}$ as

$\begin{aligned} \widehat{\sigma}_{T}^{2}&=\frac{1}{T}\sum_{t=1}^{T}\left(\frac{\widehat{\pi}_{t}(\bm{x}_{t})\widehat{\sigma}_{1,t-1}^{2}+\{1-\widehat{\pi}_{t}(\bm{x}_{t})\}\widehat{\sigma}_{0,t-1}^{2}}{1-\widehat{\kappa}_{t}(\bm{x}_{t})}+\left[\widehat{\mu}_{T}\{\bm{x}_{t},\widehat{\pi}_{T}(\bm{x}_{t})\}-\frac{1}{T}\sum_{t=1}^{T}\widehat{\mu}_{T}\{\bm{x}_{t},\widehat{\pi}_{T}(\bm{x}_{t})\}\right]^{2}\right),\end{aligned}$

(5)

where $\widehat{\sigma}_{a,t}^{2}=\{\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)-d\}^{-1}\sum_{a_{i}=a}^{1\leq i\leq t}[\widehat{\mu}_{t}\{\bm{x}_{i},a_{i}\}-r_{i}]^{2}$ is an estimator for $\sigma^{2}_{a}$ , for $a=0,1$ . The proposed variance estimator is consistent to the true variance of the value shown in Theorem 3. We officially name our method as doubly robust interval estimation (DREAM), with the detailed pseudocode provided in Algorithm 1. To ensure sufficient exploration and a valid inference, we force to pull non-optimal actions given greedy samples with a pre-specified clipping rate $p_{t}$ in step (4) of Algorithm 1 of an order larger than $\mathcal{O}(t^{-1/2})$ as required by Theorem 3. In step (4), if the unchosen action $1-a_{t}$ does not satisfy the clipping condition detailed in Assumption 4.2, then we will force to pull this non-greedy action for additional exploration, as required for a valid online inference. The estimated probability of exploration is used in step (6) in Algorithm 1 to learn the value and variance. A potential limitation of DREAM rises when the agent resists taking extra non-greedy actions in online learning. Yet, we discuss in Section 7.1 that the regret from such an extra exploration is negligible compared to the regret from exploitation, thus we can still maintain the sub-linear regret under DREAM. The theoretical validity of the proposed DREAM is detailed in Section 4, with its empirical out-performance over baseline methods demonstrated in Section 5. Finally, we remark that our method is not overly sensitive to the choice of $p_{t}$ with additional sensitivity analyses provided in Appendix A.

Algorithm 1 DREAM under Contextual Bandits with Clipping

Input: termination time

T

, and the clipping rate

p_{t}>\mathcal{O}(t^{-1/2})

;

for Time

t=1,2,\cdots T

(1) Sample

d

-dimensional context

\bm{x}_{t}\in\mathcal{X}

;

(2) Update

\widehat{\bm{\beta}}_{t-1}(\cdot)

using Equation (1) and

\widehat{\mu}_{t-1}(\cdot,\cdot)

;

(3) Update

\widehat{\pi}_{t}(\cdot)

and

a_{t}

using Equation (2) and the contextual bandit algorithms in Section 2.2;

\lambda_{\min}({t}^{-1}\sum_{i=1}^{t}\mathbb{I}(a_{i}=1-a_{t})\bm{x}_{i}\bm{x}_{i}^{\top})

<p_{t}\lambda_{\min}({t}^{-1}\sum_{i=1}^{t}\bm{x}_{i}\bm{x}_{i}^{\top})

then

(4) Choose action

1-a_{t}

;

end if

(5) Use the history

\{\mathbb{I}\{\widehat{\pi}_{i}(\bm{x}_{i})=a_{i}\},\bm{x}_{i}\}_{1\leq i\leq t}

to estimate

\widehat{\kappa}_{t}(\cdot)

;

(6) Get the value and its variance under the optimal policy by Equations (4) and (5).

(7) A two-sided

1-\alpha

CI for

V^{*}

under the online optimization is given by

\Big{[}\widehat{V}_{T}-z_{\alpha/2}\widehat{\sigma}_{T}/\sqrt{T},\quad\widehat{V}_{T}+z_{\alpha/2}\widehat{\sigma}_{T}/\sqrt{T}\Big{]}

end for

4 Theoretical Results

We formally present our theoretical results. In Section 4.1, we first derive the bound of the probability of exploration under the three commonly used bandit algorithms introduced in Section 2.2. This allows us to further establish the asymptotic normality of the online conditional mean estimator under a specific action in Section 4.2. Next, we establish the theoretical properties of DREAM with a Wald-type confidence interval given in Section 4.3. All the proofs are provided in Appendix B. The following assumptions are required to establish our theories.

Assumption 4.1

(Boundness) There exists a positive constant $L_{\bm{x}}$ such that $\|\bm{x}\|_{\infty}\leq L_{\bm{x}}$ for all $\bm{x}\in\mathcal{X}$ , and $\bm{\Sigma}=\mathbb{E}\left(\bm{x}\bm{x}^{\top}\right)$ has minimum eigenvalue $\lambda_{\min}(\bm{\Sigma})>\lambda$ for some $\lambda>0$ .

Assumption 4.2

(Clipping) For any action $a\in\mathcal{A}$ and time step $t\geq 1$ , there exists an sequence positive and non-increasing $\{p_{i}\}_{i=1}^{t}$ , such that $\lambda_{\min}\{{t}^{-1}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}\}>p_{t}\lambda_{\min}(\bm{\Sigma}).$

Assumption 4.3

(Margin Condition) Assume there exist some constants $\gamma$ and $\delta$ such that ${\mbox{Pr}}\{0\leq|\mu(\bm{x},1)-\mu(\bm{x},0)|\leq M\}=\mathcal{O}(M^{\gamma})$ , $\forall\bm{x}\in\mathcal{X},$ where the big-O term is uniform in $0<M\leq\delta$ .

Assumption 4.1 is a technical condition on bounded contexts such that the mean of the martingale differences will converge to zero (see e.g., Zhang et al., 2020; Chen et al., 2020). Assumption 4.2 is a technical requirement for the uniqueness and convergence of the least squares estimators, which requires the bandit algorithm to explore all actions sufficiently such that the asymptotic properties for the online conditional mean estimator under different actions hold (see e.g., Deshpande et al., 2018; Hadad et al., 2019; Zhang et al., 2020). The parameter $\{p_{i}\}_{i=1}^{t}$ defined in Assumption 4.2 characterizes the boundary of the probability of taking one action and we name it as the clipping rate. We ensure Assumption 4.2 is satisfied using step (4) in Algorithm 1. We establish the relationship between $p_{t}$ and $\kappa_{t}$ and discuss the requirement for $p_{t}$ to consistently estimate $\widehat{\bm{\beta}}_{t}(a)$ in this section. Assumption 4.3 is well known as the margin condition, which is commonly assumed in the literature to derive a sharp convergence rate for the value under the estimated optimal policy (see e.g., Luedtke and Van Der Laan, 2016; Chambaz et al., 2017; Chen et al., 2020).

4.1 Bounding the Probability of Exploration

The probability of exploration not only shows the signal of success in finding the global optimization in online learning, but also connects to optimal policy evaluation by quantifying the rate of executing the greedy actions. Instead of directly specifying this probability (see e.g., Zhang et al., 2020; Chen et al., 2020; Bibaut et al., 2021; Zhan et al., 2021), we explicitly bound this rate based on the updating parameters in bandit algorithms, by which we conduct a valid follow-up inference. Since the probability of exploration involves the estimation of the mean outcome function, we need to first derive a tail bound for the online ridge estimator and the estimated difference between the mean outcomes.

Lemma 4.1

(Tail bound for the online ridge estimator). In the online contextual bandit under UCB, TS, or EG, with Assumptions 4.1 and 4.2 hold, we have that for any $h>0$ , the probability of the online ridge estimator bounded within its true as

{{\mbox{Pr}}\left(\left\|\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)\right\|_{1}>h\right)\leq 2d\exp\left\{-tp_{t}^{2}\lambda^{2}\left(h-\sqrt{d}\left\|\bm{\beta}(a)\right\|_{2}\right)^{2}/\left(8d^{2}\sigma^{2}L_{\bm{x}}^{2}\right)\right\}.}

The results in Lemma 4.1 establish the tail bound of the online ridge estimator $\widehat{\bm{\beta}}_{t}(a)$ , which can be simplified as $C_{1}\exp(-C_{2}tp_{t}^{2})$ , for some constants $\{C_{j}\}_{1\leq j\leq 2}$ , by noticing the constant $h$ and $\lambda$ , the dimension $d$ , the subgaussian parameter $\sigma$ , and the bound $L_{\bm{x}}$ are positive and bounded, under bounded true coefficients $\bm{\beta}(a)$ . Recall the clipping constraint that $0<p_{t}<1$ is non-increasing sequence. This tail bound is asymptotically equivalent to $\exp(-tp_{t}^{2})$ . The established results in Lemma 4.1 work for general bandit algorithms including UCB, TS, and EG. These tail bounds are aligned with the bound $\exp(-t\epsilon_{t}^{2})$ derived in Chen et al. (2020) for the EG method only. By Lemma 4.1, it is immediate to obtain the consistency of the online ridge estimator $\widehat{\bm{\beta}}_{t}(a)$ , if $tp_{t}^{2}\rightarrow\infty$ as $t\rightarrow\infty$ . We can further obtain the tail bound for the estimated difference between the conditional mean outcomes under two actions by Lemma 4.1 as detailed in the following corollary.

Corollary 1

(Tail bound for the online mean estimator) Suppose conditions in Lemma 4.1 hold. Denote $\Delta_{\bm{x}_{t}}\equiv\mu(\bm{x}_{t},1)-\mu(\bm{x}_{t},0)$ , then for any $\xi>0$ , we have the probability of the online conditional mean estimator bounded within its true as

{\mbox{Pr}}\left\{\left|\widehat{\mu}_{t}(\bm{x}_{t},1)-\widehat{\mu}_{t}(\bm{x}_{t},0)-\Delta_{\bm{x}_{t}}\right|>\xi\right\}\leq 4d\exp\left\{-tp_{t}^{2}c_{\xi}\right\},

with

c_{\xi}=\lambda^{2}\left[\min\left\{\left(\xi/2-\sqrt{d}L_{\bm{x}}\left\|\bm{\beta}(1)\right\|_{2}\right)^{2},\left(\xi/2-\sqrt{d}L_{\bm{x}}\left\|\bm{\beta}(0)\right\|_{2}\right)^{2}\right\}\right]/\left(8d^{2}\sigma^{2}L_{\bm{x}}^{4}\right)

being a constant of time $t$ .

The above corollary quantifies the uncertainty of the online estimation of the conditional mean outcomes, and thus provides a crucial middle result to further access the probability of exploration by noting $\widehat{\pi}_{t}(\bm{x})=\mathbb{I}\left\{\widehat{\mu}_{t-1}(\bm{x},1)>\widehat{\mu}_{t-1}(\bm{x},0)\right\}$ in (2). More specifically, we derive the probability of exploration at each time step under the three discussed bandit algorithms for exposition in the following theorem.

Theorem 1

(Probability of exploration) In the online contextual bandit algorithms using UCB, TS, or EG, given a context $\bm{x}_{t}$ at time step $t$ , assuming Assumptions 4.1, 4.2, and 4.3 hold with $tp_{t}\rightarrow\infty$ , then for any $0<\xi<\left|\Delta_{\bm{x}_{t}}\right|/2$ with $c_{\xi}$ specified in Corollary 1,
(i) under UCB, there exists some constant $C>0$ such that

\displaystyle\kappa_{t}(\bm{x}_{t})\leq{C\left(\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}+\xi\right)^{\gamma}+4d\exp\left\{-(t-1)p_{t-1}^{2}c_{\xi}\right\}};

(ii) under TS, we have

\displaystyle\kappa_{t}(\bm{x}_{t})\leq{\exp\left(-\frac{\left(\left|\Delta_{\bm{x}_{t}}\right|-\xi\right)^{2}{{(t-1)p_{t-1}\lambda}}}{4\rho^{2}L_{\bm{x}}^{2}}\right)+4d\exp\left\{-(t-1)p_{t-1}^{2}c_{\xi}\right\}};

(iii) under EG, we have $\kappa_{t}(\bm{x}_{t})=\epsilon_{t}/2$ .

The theoretical order of $\kappa_{t}(\cdot)$ is new to the literature, which quantifies the probability of exploring the non-optimal actions under a bandit policy over time. We note that the probability of exploration under EG is pre-specified by users, which directly implies its exploring rate as $\epsilon_{t}/2$ (for the two-arm setting) as shown in results (iii). The results (i) and (ii) in Theorem 1 show that the probabilities of exploration under UCB and TS are non-increasing under certain conditions. For instance, if $(t-1)p_{t-1}^{2}\rightarrow\infty$ for TS, as $t\rightarrow\infty$ , the upper bound for $\kappa_{t}(\bm{x}_{t})$ decays to zero with an asymptotically equivalent convergence rate as $\mathcal{O}\{\exp(-(t-1)p_{t-1}^{2})\}$ up to some constant. Similarly, for UCB, with an arbitrarily small $\xi$ at an order of $\mathcal{O}(1/\sqrt{tp_{t}})$ , as $t\rightarrow\infty$ , the upper bound for $\kappa_{t}(\bm{x}_{t})$ decays to zero as long as $(t-1)p_{t-1}^{2}\rightarrow\infty$ . Theorem 1 also indicates that without the clipping required in Assumption 4.2 and implemented in step (4) of Algorithm 1, the exploration for non-optimal arms might be insufficient, leading to a possibly invalid and biased inference for the online conditional mean.

4.2 Asymptotic Normality of Online Ridge Estimator

We use the established bounds for the probability of exploration in Theorem 1 to further obtain the asymptotic normality for the online conditional mean estimator under each action. Specifically, denote $\kappa_{\infty}(\bm{x})\equiv\lim_{t\to\infty}\kappa_{t}(\bm{x})=\lim_{t\rightarrow\infty}{\mbox{Pr}}\{a_{t}\not=\pi^{*}(\bm{x})\}$ given a context $\bm{x}$ . Assume the conditions in Theorem 1 hold and $tp_{t}^{2}\rightarrow\infty$ as $t\rightarrow\infty$ , we have the upper bound of the probability of exploration decays to zero in Theorem 1 for UCB and TS. Since $\kappa_{t}(\cdot)$ is nonnegative by its definition, it follows from Sandwich Theorem immediately that $\kappa_{\infty}(\cdot)$ exist for UCB and TS, and $\kappa_{\infty}(\cdot)=\lim_{t\to\infty}\epsilon_{t}/2$ for EG. By using this limit, we can further characterize the following theorem for asymptotics and inference.

Theorem 2

(Asymptotics and inference) Supposing the conditions in Theorem 1 hold with $tp_{t}^{2}\rightarrow\infty$ as $t\rightarrow\infty$ , we have $\sqrt{t}\{\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)\}\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathcal{N}_{d}\{\bm{0}_{d},\sigma_{\bm{\beta}(a)}^{2}\}$ and $\sqrt{t}\{\widehat{\mu}_{t}(\bm{x},a)-\mu(\bm{x},a)\}\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathcal{N}\{0,\bm{x}^{\top}\sigma_{\bm{\beta}(a)}^{2}\bm{x}\}$ for $\forall\bm{x}\in\mathcal{X}$ and $\forall a\in\mathcal{A}$ , with the variance given by

	$\displaystyle\sigma_{\bm{\beta}(a)}^{2}=\sigma_{a}^{2}$	$\displaystyle\left[\int\kappa_{\infty}(\bm{x})\mathbb{I}\{\bm{x}^{\top}\bm{\beta}(a)<\bm{x}^{\top}\bm{\beta}(1-a)\}\bm{x}\bm{x}^{\top}dP_{\mathcal{X}}\right.$
		$\displaystyle\left.+\int\{1-\kappa_{\infty}(\bm{x})\}\mathbb{I}\left\{\bm{x}^{\top}\bm{\beta}(a)\geq\bm{x}^{\top}\bm{\beta}(1-a)\right\}\bm{x}\bm{x}^{\top}dP_{\mathcal{X}}\right]^{-1}.$

The results in Theorem 2 establish the asymptotic normality of the online ridge estimator and the online conditional mean estimator over time, with an explicit asymptotic variance derived. Here, $\kappa_{\infty}(\bm{x})$ may differ under different bandit algorithms while the asymptotic normality holds as long as the adopted bandit algorithm explores the non-optimal actions sufficiently with a non-increasing rate, following the conclusion in Theorem 1, i.e., a clipping rate $p_{t}$ of an order larger than $\mathcal{O}(t^{-1/2})$ as described in Algorithm 1. The proof of Theorem 2 generalizes Theorem 3.1 in Chen et al. (2020) by additionally considering the online ridge estimator and general bandit algorithms.

4.3 Asymptotic Normality and Robustness for DREAM

In this section, we further derive the asymptotic normality of $\sqrt{T}(\widehat{V}_{T}-V^{*})$ . The following additional assumption is required to establish the double robustness of DREAM.

Assumption 4.4

(Rate Double Robustness) Define $||z_{t}||_{2,T}=\sqrt{{1\over T}\sum_{t=1}^{T}z_{t}^{2}}$ as the $L_{2}$ norm. Let $||\mu(\bm{x},a)-\widehat{\mu}_{t}(\bm{x},a)||_{2,T}=\mathcal{O}_{p}(c_{\mu,T})\text{ for }a\in\mathcal{A}$ , and $||\kappa_{t}(\bm{x})-\widehat{\kappa}_{t}(\bm{x})||_{2,T}=\mathcal{O}_{p}(c_{\kappa,T})$ . Assume the product of two rates satisfies $c_{\mu,T}c_{\kappa,T}=o(T^{-1/2})$ .

Assumption 4.4 requires the estimated conditional mean function and the estimated probability of exploration to converge at certain rates in online learning. This assumption is frequently studied in the causal inference literature (see e.g., Farrell, 2015; Luedtke and Van Der Laan, 2016; Smucler et al., 2019; Hou et al., 2021; Kennedy, 2022) to derive the asymptotic distribution of the estimated average treatment effect with either parametric estimators or non-parametric estimators (see e.g., Wager and Athey, 2018; Farrell et al., 2021). We remark that under the conditions required in Theorem 2, we have $||\mu(\bm{x},a)-\widehat{\mu}_{t}(\bm{x},a)||_{2,T}=\mathcal{O}_{p}(T^{-1/2})$ for all $a\in\mathcal{A}$ , thus Assumption 4.4 holds as along as $||\kappa_{t}(\bm{x})-\widehat{\kappa}_{t}(\bm{x})||_{2,T}=o_{p}(1)$ . We then have the asymptotic normality of $\sqrt{T}(\widehat{V}_{T}-V^{*})$ as stated in the following theorem.

Theorem 3

(Asymptotic normality for DREAM) Suppose conditions in Theorem 2 hold. Assuming Assumption 4.4, with $\left\|\widehat{\kappa}_{t}(\cdot)-\kappa_{t}\right\|_{\infty}=o_{p}(1)$ , we have $\sqrt{T}(\widehat{V}_{T}-V^{*})\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathcal{N}\left(0,\sigma_{DR}^{2}\right),$ with

\widehat{\sigma}_{T}^{2}\overset{p}{\longrightarrow}\sigma_{DR}^{2}=\int_{\bm{x}}\frac{\pi^{*}(\bm{x})\sigma_{1}^{2}+\{1-\pi^{*}(\bm{x})\}\sigma_{0}^{2}}{1-\kappa_{\infty}(\bm{x})}d{P_{\mathcal{X}}}+{\mbox{Var}}\left[{\mu}\{\bm{x},\pi^{*}(\bm{x})\}\right]<\infty.

The above theorem shows that the value estimator under DREAM is doubly robust to the true value, given the product of the nuisance error rates of the probability of exploitation and the conditional mean outcome as $o_{p}(T^{-1/2})$ . We use the Martingale Central Limit Theorem to overcome the non-i.i.d. sample problem for asymptotic normality. The asymptotic variance of the proposed estimator is caused by two sources. One is the variance due to the context information, and the other is a weighted average of the variance under the optimal arm and the variance under the non-optimal arms. The weight is determined by the probability of exploration when $t\rightarrow\infty$ under the adopted bandit algorithm. To the best of our knowledge, this is the first work that studies the asymptotic distribution of the mean outcome of the estimated optimal policy under general online bandit algorithms, which takes the exploration-and-exploitation trade-off into a thorough account. Our method thus fills a crucial gap in policy evaluation of online learning. By Theorem 3, a two-sided $1-\alpha$ confidence interval (CI) for $V^{*}$ is $[\widehat{V}_{T}-z_{\alpha/2}\widehat{\sigma}_{T}/\sqrt{T},\quad\widehat{V}_{T}+z_{\alpha/2}\widehat{\sigma}_{T}/\sqrt{T}]$ , where $z_{\alpha/2}$ denotes the upper $\alpha/2-$ th quantile of a standard normal distribution.

5 Simulation Studies

We investigate the finite sample performance and demonstrate the coverage probabilities of the policy value of DREAM in this section. The computing infrastructure used is a virtual machine in the AWS Platform with 72 processor cores and 144GB memory.

Consider the 2-dimensional context $\bm{x}=[x_{1},x_{2}]^{\top}$ , with $x_{1},x_{2}{\sim}_{i.i.d.}\text{Uniform}(0,2\pi)$ . Suppose the outcome of interest given $\bm{x}$ and $a$ is generated from $\mathcal{N}\{\mu(\bm{x},a),\sigma_{a}^{2}\}$ , where the conditional mean function takes a non-linear form as $\mu(\bm{x},a)=2-a+(5a-1)\cos(x_{1})+(1.5-3a)\cos(x_{2})$ , with equal variances as $\sigma_{1}=\sigma_{0}=0.1$ and $\bm{\beta}(a)=[2-a,5a-1,1.5-3a]^{\top}$ . Then the optimal policy is given by $\mathbb{I}\{-1+5\cos(x_{1})-3\cos(x_{2})>0\}$ , and the optimal value is 3.27 calculated by integration.

We employ the three bandit algorithms described in Section 2.2 to generate the online data with $\omega=1$ (Li et al., 2010), where $c_{t}=1$ for UCB, $\bm{\beta}(a)\sim\mathcal{N}_{d}(\bm{0}_{d},\bm{I}_{d})$ with $\rho=2$ for TS, and $\epsilon_{t}=0.1t^{-2/5}$ for EG. Set the total decision time as $T=2000$ with a burning period $T_{0}=50$ .

We evaluate the doubly robustness property of the proposed value estimator under DREAM in comparison to the simple average of the total reward for contextual bandits. To be specific, we consider the following four methods: 1. the conditional mean function $\mu$ is correctly specified, and the probability of exploration $\kappa_{t}$ is estimated by a nonparametric regression in DREAM; 2. the probability of exploration $\kappa_{t}$ is estimated by a nonparametric regression while the model of $\mu$ is misspecified with linear regression; 3. the conditional mean function $\mu$ is correctly specified while the model of $\kappa_{t}$ is misspecified by a constant 0.5; 4. using the averaged reward as the value estimator. The clipping rate of DREAM is set to be 0.01, and our method is not overly sensitive to the choice of $p_{t}$ as shown in additional sensitivity analyses provided in Appendix A. The above four value estimators are evaluated by the coverage probabilities of the 95% two-sided Wald-type CI on covering the optimal value, the bias, and the ratio between the standard error and the Monte Carlo standard deviation, as shown in Figure 2 for UCB, Figure 3 for TS, and Figure 4 for EG, aggregated over 1000 runs.

Based on Figures 2, 3, and 4, the performance of the proposed DREAM method is reasonably much better than the simple average estimator of the total outcome. Specifically, under different bandit algorithms, when the time $t$ increases, the coverage probabilities of the proposed DREAM estimator are close to the nominal level of 95%, with the biases approaching 0 and the ratios between the standard error and the Monte Carlo standard deviation approaching 1. In addition, our DREAM method achieves reasonably good performance when either regression model for the conditional mean function or the probability of exploration is misspecified. These findings not only validate the theoretical results in Theorem 3 but also demonstrate the doubly robustness of DREAM in handling policy evaluation in online learning. In contrast, the simple average of reward can hardly maintain coverage probabilities over 80% with much larger biases in all cases.

6 Real Data Application

In this section, we evaluate the performance of the proposed DREAM method in real datasets from the OpenML database, which is a curated, comprehensive benchmark suite for machine-learning tasks. Following the contextual bandit setting considered in this paper, we select two datasets in the public OpenML Curated Classification benchmarking suite 2018 (OpenML-CC18; BSD 3-Clause license) (Bischl et al., 2017), i.e., the SEA50 and SEA50000, to formulate the real application. Each dataset is a collection of pairs of 3-dimensional features $\bm{x}$ and their corresponding labels $Y\in\{0,1\}$ , with a total number of observations as $n=$ 1,000,000. To simulate an online environment for data generation, we turn two-class classification tasks into two-armed contextual bandit problems (see e.g., Dudík et al., 2011; Wang et al., 2017; Su et al., 2019), such that we can reproduce the online data to evaluate the performance of the proposed method. Specifically, at each time step $t$ , we draw the pair $\{\bm{x}_{t},Y_{t}\}$ uniformly at random without replacement from the dataset with $n=$ 1,000,000. Given the revealed context $\bm{x}_{t}$ , the bandit algorithm selects an action $a_{t}\in\{0,1\}$ . The reward is generated by a normal distribution $\mathcal{N}\{\mathbb{I}(a_{t}=Y_{t}),0.5^{2}\}$ . Here, the mean of reward is 1 if the selected action matches the underlying true label, and 0 otherwise. Therefore, the optimal value is 1 while the optimal policy is unknown due to the complex relationship between the collected features and the label. Our goal is to infer the value under the optimal policy in the online settings produced by datasets SEA50 and SEA50000.

Using the similar procedure as described in Section 5, we apply DREAM in comparison to the simple average estimator, by employing three bandit algorithms with the following specifications: (i) for UCB, let $c_{t}=2$ ; (ii) for TS, let priors $\bm{\beta}(a)\sim\mathcal{N}_{d}(\bm{0}_{d},\bm{I}_{d})$ and parameter $\rho=0.5$ ; and (iii) for EG, let $\epsilon_{t}\equiv t^{-1/3}$ . Set the total time for the online learning as $T=200$ with a burning period as $T_{0}=20$ . The results are evaluated by the coverage probabilities of the 95% two-sided Wald-type CI in Figure 5 for two real datasets, respectively, averaged over 500 replications. It can be observed from Figure 5 that our proposed DREAM method performs much better than the simple average estimator, in all cases. To be specific, the coverage probabilities of the value estimator under DREAM are close to the nominal level of 95%, while the CI constructed by the averaged reward can hardly cover the true value with its coverage probabilities decaying to 0, under different bandit algorithms and two simulated online environments. These findings are consistent with what we have observed in simulations and consolidate the practical usefulness of the proposed DREAM.

7 Discussion

In this paper, we propose doubly robust interval estimation (DREAM) to infer the mean outcome of the optimal policy using the online data generated from a bandit algorithm. We explicitly characterize the probability of exploring the non-optimal actions under different bandit algorithms and show the consistency and asymptotic normality of the proposed value estimator. In this section, we discuss the performance of DREAM in terms of the regret bound and extend it to the evaluation of a known policy in online learning.

7.1 Regret Bound under DREAM

In this section, we discuss the regret bound of the proposed DREAM method. Specifically, we study the regret defined as the difference between the expected cumulative rewards under the oracle optimal policy and the bandit policy, which is

R_{T}=\sum_{t=1}^{T}{\mathbb{E}}\left\{\mu(\bm{x}_{t},\pi^{*}(\bm{x}_{t}))-\mu(\bm{x}_{t},a_{t})\right\}.

(6)

By noticing that $\mu(\bm{x}_{t},\pi^{*}(\bm{x}_{t}))-\mu(\bm{x}_{t},a_{t})=\Delta_{\bm{x}_{t}}$ if $a_{t}\neq\pi^{*}(\bm{x}_{t})$ and 0 otherwise, we have

R_{T}=\sum_{t=1}^{T}{\mathbb{E}}\left[\Delta_{\bm{x}_{t}}\mathbb{I}\left\{a_{t}\neq\pi^{*}(\bm{x}_{t})\right\}\right].

We note that the indicator function is equivalent to $|a_{t}-\pi^{*}(\bm{x}_{t})|$ and is bounded by $|a_{t}-\widehat{\pi}(\bm{x}_{t})|+|\widehat{\pi}(\bm{x}_{t})-\pi^{*}(\bm{x}_{t})|$ . Thus, we can divide the regret defined in Equation (6) into two parts as $R_{T}\leq R_{T}^{(1)}+R_{T}^{(2)}$ where

R_{T}^{(1)}=\sum_{t=1}^{T}{\mathbb{E}}\left[\Delta_{\bm{x}_{t}}|a_{t}-\widehat{\pi}(\bm{x}_{t})|\right],

is the regret from the exploration and $R_{T}^{(2)}=\sum_{t=1}^{T}{\mathbb{E}}\left[\Delta_{\bm{x}_{t}}|\widehat{\pi}(\bm{x}_{t})-\pi^{*}(\bm{x}_{t})|\right]$ is the regret from the exploitation. It is well known that the regret from the exploitation $R_{T}^{(2)}$ is sublinear (Chu et al., 2011; Agrawal and Goyal, 2013) and the regret for EG has been well studied by Chen et al. (2020). Therefore, we focus on the analysis of the regret $R_{T}^{(1)}$ from the exploration for UCB and TS here.

Since $\Delta_{\bm{x}_{t}}$ is bounded and the upper bound of ${\mathbb{E}}|a_{t}-\widehat{\pi}(\bm{x}_{t})|={\mathbb{E}}[\mathbb{I}\{a_{t}\not=\widehat{\pi}_{t}(\bm{x}_{t})\}]=\kappa_{t}(\bm{x}_{t})$ has an asymptotically equivalent convergence rate as $\mathcal{O}\{\exp(-tp_{t}^{2})\}$ up to some constant by Theorem 1, there exists some constant $C$ such that the regret from the exploration is bounded by

R_{T}^{(1)}\leq C\sum_{t=1}^{T}\mathcal{O}\{\exp(-tp_{t}^{2})\}.

If we choose $p_{t}=\sqrt{{\alpha\log t}/{t}}$ for some $\alpha\in(0,1)$ , the regret $R_{T}^{(1)}$ is bounded by $\sum_{t=1}^{T}\mathcal{O}\{t^{-\alpha}\}=\mathcal{O}\{T^{1-\alpha}\}$ , where the equation is calculated using Lemma 6 in Luedtke and Van Der Laan (2016). Thus, we still have sublinear regret under DREAM.

7.2 Evaluation of Known Policies in Online Learning

We could extend our method to evaluate a new known policy $\pi^{E}$ that is different from the bandit policy, in the online environment. Typically, we focus on the statistical inference of policy evaluation in online learning under the contextual bandit framework. Recall the setting and notations in Section 2, given a target policy $\pi^{E}$ , we propose its doubly robust value estimator as

\widehat{V}_{T}({\pi}^{E})=\frac{1}{T}\sum_{t=1}^{T}\frac{\mathbb{I}\{a_{t}={\pi}^{E}(\bm{x}_{t})\}}{\widehat{p}_{t-1}\{a_{t}|\bm{x}_{t}\}}\Big{[}r_{t}-\widehat{\mu}_{t-1}\{\bm{x}_{t},{\pi}^{E}(\bm{x}_{t})\}\Big{]}+\widehat{\mu}_{t-1}\{\bm{x}_{t},{\pi}^{E}(\bm{x}_{t})\},

where $T$ is the current time step or the termination time and $\widehat{p}_{t-1}(a_{t}|\bm{x}_{t})$ is the estimator for the propensity score of the chosen action $a_{t}$ denoted as $p_{t-1}(a_{t}|\bm{x}_{t})$ . The following theorem summarizes the asymptotic properties of $\widehat{V}_{T}({\pi}^{E})$ , built on Theorem 2.

Corollary 2

(Asymptotic normality for evaluating a known policy) Suppose the conditions in Theorem 2 hold. Furthermore, assuming the rate doubly robustness that $||\mu(\bm{x},a)-\widehat{\mu}_{t}(\bm{x},a)||_{2,T}||p_{t}(a|\bm{x})-\widehat{p}_{t}(a|\bm{x})||_{2,T}=o_{p}(T^{-1/2})\text{ for }a\in\mathcal{A},$ we have $\sqrt{T}\{\widehat{V}_{T}(\pi^{E})-V(\pi^{E})\}\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathcal{N}\{0,\sigma(\pi^{E})^{2}\}$ , with

\sigma(\pi^{E})^{2}=\int_{\bm{x}}\frac{\pi^{E}(\bm{x})\sigma_{1}^{2}+\{1-\pi^{E}(\bm{x})\}\sigma_{0}^{2}}{{\mbox{Pr}}\left\{\pi^{E}(\bm{x})|\bm{x}\right\}}d{P_{\mathcal{X}}}+{\mbox{Var}}\left[{\mu}\{\bm{x},\pi^{E}(\bm{x})\}\right]<\infty,

where ${\mbox{Pr}}\left\{\pi^{E}(\bm{x})|\bm{x}\right\}=\lim_{t\rightarrow\infty}p_{t-1}(\pi^{E}(\bm{x}_{t})|\bm{x}_{t})$ .

Here, we impose the same conditions on the bandit algorithms as in Theorem 3 to guarantee sufficient exploration on different arms, and thus evaluating an arbitrary policy is valid. The usage of the new rate doubly robustness assumption and the margin condition (Assumption 4.3) follows a similar logic as in Theorem 3. The estimator of $\sigma(\pi^{E})^{2}$ denoted as $\widehat{\sigma}(\pi^{E})^{2}$ can be obtained similarly to (5). Thus, a two-sided $1-\alpha$ CI for $V(\pi^{E})$ under online optimization is $[\widehat{V}_{T}(\pi^{E})-z_{\alpha/2}\widehat{\sigma}(\pi^{E})/\sqrt{T},\quad\widehat{V}_{T}(\pi^{E})+z_{\alpha/2}\widehat{\sigma}(\pi^{E})/\sqrt{T}]$ .

There are some other extensions that we may also consider in future work. First, in this paper, we focus on settings with binary actions. Thus, a more general method with multiple actions or even a continuous action space is desirable. Second, we consider the contextual bandits in this paper and all the theoretical results are applicable to the multi-armed bandits. It would be practically interesting to extend our proposal to reinforcement learning problems. Third, instead of using the rate double robustness assumption in the current paper, it is of theoretical interest to impose the model double robustness version of DREAM in future research.

References

(1)
Abbasi-Yadkori et al. (2011) Abbasi-Yadkori, Y., Pál, D. and Szepesvári, C. (2011), Improved algorithms for linear stochastic bandits, in ‘Advances in Neural Information Processing Systems’, pp. 2312–2320.
Agrawal and Goyal (2013) Agrawal, S. and Goyal, N. (2013), Thompson sampling for contextual bandits with linear payoffs, in ‘International Conference on Machine Learning’, PMLR, pp. 127–135.
Athey (2019) Athey, S. (2019), 21. the impact of machine learning on economics, in ‘The economics of artificial intelligence’, University of Chicago Press, pp. 507–552.
Auer (2002) Auer, P. (2002), ‘Using confidence bounds for exploitation-exploration trade-offs’, Journal of Machine Learning Research 3(Nov), 397–422.
Bibaut et al. (2021) Bibaut, A., Dimakopoulou, M., Kallus, N., Chambaz, A. and van der Laan, M. (2021), ‘Post-contextual-bandit inference’, Advances in Neural Information Processing Systems 34.
Bischl et al. (2017) Bischl, B., Casalicchio, G., Feurer, M., Hutter, F., Lang, M., Mantovani, R. G., van Rijn, J. N. and Vanschoren, J. (2017), ‘Openml benchmarking suites’, arXiv preprint arXiv:1708.03731 .
Bubeck and Cesa-Bianchi (2012) Bubeck, S. and Cesa-Bianchi, N. (2012), ‘Regret analysis of stochastic and nonstochastic multi-armed bandit problems’, arXiv preprint arXiv:1204.5721 .
Chakraborty and Moodie (2013) Chakraborty, B. and Moodie, E. (2013), Statistical methods for dynamic treatment regimes, Springer.
Chambaz et al. (2017) Chambaz, A., Zheng, W. and van der Laan, M. J. (2017), ‘Targeted sequential design for targeted learning inference of the optimal treatment rule and its mean reward’, Annals of statistics 45(6), 2537.
Chen et al. (2020) Chen, H., Lu, W. and Song, R. (2020), ‘Statistical inference for online decision making: In a contextual bandit setting’, Journal of the American Statistical Association pp. 1–16.
Chu et al. (2011) Chu, W., Li, L., Reyzin, L. and Schapire, R. (2011), Contextual bandits with linear payoff functions, in ‘Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics’, JMLR Workshop and Conference Proceedings, pp. 208–214.
Dedecker and Louhichi (2002) Dedecker, J. and Louhichi, S. (2002), Maximal inequalities and empirical central limit theorems, in ‘Empirical process techniques for dependent data’, Springer, pp. 137–159.
Deshpande et al. (2018) Deshpande, Y., Mackey, L., Syrgkanis, V. and Taddy, M. (2018), Accurate inference for adaptive linear models, in ‘International Conference on Machine Learning’, PMLR, pp. 1194–1203.
Dimakopoulou et al. (2021) Dimakopoulou, M., Ren, Z. and Zhou, Z. (2021), ‘Online multi-armed bandits with adaptive inference’, Advances in Neural Information Processing Systems 34, 1939–1951.
Dudík et al. (2011) Dudík, M., Langford, J. and Li, L. (2011), ‘Doubly robust policy evaluation and learning’, arXiv preprint arXiv:1103.4601 .
Farrell (2015) Farrell, M. H. (2015), ‘Robust inference on average treatment effects with possibly more covariates than observations’, Journal of Econometrics 189(1), 1–23.
Farrell et al. (2021) Farrell, M. H., Liang, T. and Misra, S. (2021), ‘Deep neural networks for estimation and inference’, Econometrica 89(1), 181–213.
Feller (2008) Feller, W. (2008), An introduction to probability theory and its applications, vol 2, John Wiley & Sons.
Hadad et al. (2019) Hadad, V., Hirshberg, D. A., Zhan, R., Wager, S. and Athey, S. (2019), ‘Confidence intervals for policy evaluation in adaptive experiments’, arXiv preprint arXiv:1911.02768 .
Hall and Heyde (2014) Hall, P. and Heyde, C. C. (2014), Martingale limit theory and its application, Academic press.
Hou et al. (2021) Hou, J., Bradic, J. and Xu, R. (2021), ‘Treatment effect estimation under additive hazards models with high-dimensional confounding’, Journal of the American Statistical Association pp. 1–16.
Kallus and Zhou (2018) Kallus, N. and Zhou, A. (2018), ‘Policy evaluation and optimization with continuous treatments’, arXiv preprint arXiv:1802.06037 .
Kennedy (2022) Kennedy, E. H. (2022), ‘Semiparametric doubly robust targeted double machine learning: a review’, arXiv preprint arXiv:2203.06469 .
Khamaru et al. (2021) Khamaru, K., Deshpande, Y., Mackey, L. and Wainwright, M. J. (2021), ‘Near-optimal inference in adaptive linear regression’, arXiv preprint arXiv:2107.02266 .
Lattimore and Szepesvári (2020) Lattimore, T. and Szepesvári, C. (2020), Bandit algorithms, Cambridge University Press.
Li et al. (2010) Li, L., Chu, W., Langford, J. and Schapire, R. E. (2010), A contextual-bandit approach to personalized news article recommendation, in ‘Proceedings of the 19th international conference on World wide web’, pp. 661–670.
Li et al. (2011) Li, L., Chu, W., Langford, J. and Wang, X. (2011), Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms, in ‘Proceedings of the fourth ACM international conference on Web search and data mining’, pp. 297–306.
Lu et al. (2021) Lu, Y., Xu, Z. and Tewari, A. (2021), ‘Bandit algorithms for precision medicine’, arXiv preprint arXiv:2108.04782 .
Luedtke and Van Der Laan (2016) Luedtke, A. R. and Van Der Laan, M. J. (2016), ‘Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy’, Annals of statistics 44(2), 713.
Murphy (2003) Murphy, S. A. (2003), ‘Optimal dynamic treatment regimes’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65(2), 331–355.
Neel and Roth (2018) Neel, S. and Roth, A. (2018), Mitigating bias in adaptive data gathering via differential privacy, in ‘International Conference on Machine Learning’, PMLR, pp. 3720–3729.
Nie et al. (2018) Nie, X., Tian, X., Taylor, J. and Zou, J. (2018), Why adaptively collected data have negative bias and how to correct for it, in ‘International Conference on Artificial Intelligence and Statistics’, PMLR, pp. 1261–1269.
Ramprasad et al. (2022) Ramprasad, P., Li, Y., Yang, Z., Wang, Z., Sun, W. W. and Cheng, G. (2022), ‘Online bootstrap inference for policy evaluation in reinforcement learning’, Journal of the American Statistical Association (just-accepted), 1–31.
Shin et al. (2019a) Shin, J., Ramdas, A. and Rinaldo, A. (2019a), ‘Are sample means in multi-armed bandits positively or negatively biased?’, arXiv preprint arXiv:1905.11397 .
Shin et al. (2019b) Shin, J., Ramdas, A. and Rinaldo, A. (2019b), ‘On the bias, risk and consistency of sample means in multi-armed bandits’, arXiv preprint arXiv:1902.00746 .
Smucler et al. (2019) Smucler, E., Rotnitzky, A. and Robins, J. M. (2019), ‘A unifying approach for doubly-robust $l_{1}$ regularized estimation of causal contrasts’, arXiv preprint arXiv:1904.03737 .
Srinivas et al. (2009) Srinivas, N., Krause, A., Kakade, S. M. and Seeger, M. (2009), ‘Gaussian process optimization in the bandit setting: No regret and experimental design’, arXiv preprint arXiv:0912.3995 .
Su et al. (2019) Su, Y., Dimakopoulou, M., Krishnamurthy, A. and Dudík, M. (2019), ‘Doubly robust off-policy evaluation with shrinkage’, arXiv preprint arXiv:1907.09623 .
Sutton and Barto (2018) Sutton, R. S. and Barto, A. G. (2018), Reinforcement learning: An introduction, MIT press.
Swaminathan et al. (2017) Swaminathan, A., Krishnamurthy, A., Agarwal, A., Dudik, M., Langford, J., Jose, D. and Zitouni, I. (2017), Off-policy evaluation for slate recommendation, in ‘Advances in Neural Information Processing Systems’, pp. 3632–3642.
Turvey (2017) Turvey, R. (2017), Optimal Pricing and Investment in Electricity Supply: An Esay in Applied Welfare Economics, Routledge.
Wager and Athey (2018) Wager, S. and Athey, S. (2018), ‘Estimation and inference of heterogeneous treatment effects using random forests’, Journal of the American Statistical Association 113(523), 1228–1242.
Waisman et al. (2019) Waisman, C., Nair, H. S., Carrion, C. and Xu, N. (2019), ‘Online inference for advertising auctions’, arXiv preprint arXiv:1908.08600 .
Wang et al. (2017) Wang, Y.-X., Agarwal, A. and Dudık, M. (2017), Optimal and adaptive off-policy evaluation in contextual bandits, in ‘International Conference on Machine Learning’, PMLR, pp. 3589–3597.
Zhan et al. (2021) Zhan, R., Hadad, V., Hirshberg, D. A. and Athey, S. (2021), Off-policy evaluation via adaptive weighting with data from contextual bandits, in ‘Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining’, pp. 2125–2135.
Zhang et al. (2021) Zhang, K., Janson, L. and Murphy, S. (2021), ‘Statistical inference with m-estimators on adaptively collected data’, Advances in Neural Information Processing Systems 34, 7460–7471.
Zhang et al. (2020) Zhang, K. W., Janson, L. and Murphy, S. A. (2020), ‘Inference for batched bandits’, arXiv preprint arXiv:2002.03217 .
Zhou (2015) Zhou, L. (2015), ‘A survey on contextual multi-armed bandits’, arXiv preprint arXiv:1508.03326 .

Supplementary to ‘Doubly Robust Interval Estimation for Optimal Policy Evaluation in Online Learning’

This supplementary article provides sensitivity analyses and all the technical proofs for the established theorems for policy evaluation in online learning under the contextual bandits. Note that the theoretical results in Section 7 can be proven in a similar manner by arguments in Section B. Thus we omit the details here.

Appendix A Sensitivity Test for the Choice of $p_{t}$

We conduct a sensitivity test for the choice of $p_{t}$ in this section. We run all the simulations with $p_{t}=0.01,0.05,0.1$ and find that Algorithm 1 is not sensitive to the choice of $p_{t}$ .

Appendix B Technical Proofs for Main Results

This section provides all the technical proofs for the established theorems for policy evaluation in online learning under the contextual bandits.

B.1 Proof of Lemma 4.1

The proof of Lemma 4.1 consists of three main steps. To be specific, we first reconstruct the target difference $\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)$ and decompose it into two parts. Then, we establish the bound for each part and derive its lower bound ${\mbox{Pr}}(\|\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)\|_{1}\leq h)$ .

Step 1: Recall Equation (1) in the main paper with $\bm{D}_{t-1}(a)$ being a $\bm{N}_{t-1}(a)\times d$ design matrix at time $t-1$ with $\bm{N}_{t-1}(a)$ as the number of pulls for action $a$ , we have

\displaystyle\widehat{\bm{\beta}}_{t}(a)=\left\{\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+\frac{1}{t}\omega\bm{I}_{d}\right\}^{-1}\left\{\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}r_{i}\right\}.

We are interested in the quantity

\displaystyle\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)=\left\{\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+\frac{1}{t}\omega\bm{I}_{d}\right\}^{-1}\left\{\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}r_{i}\right\}-\bm{\beta}(a).

(B.1)

Note that $\bm{\beta}(a)$ can be written as

\left\{\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+\frac{1}{t}\omega\bm{I}_{d}\right\}^{-1}\left\{\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+\frac{1}{t}\omega\bm{I}_{d}\right\}\bm{\beta}(a),

and since $r_{i}=\bm{x}_{i}^{\top}\bm{\beta}(a)+e_{i}$ , we can write (B.1) as

	$\displaystyle\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)=$		$\displaystyle\underbrace{\left\{\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+\frac{1}{t}\omega\bm{I}_{d}\right\}^{-1}\left\{\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}e_{i}\right\}}_{\eta_{3}}$
			$\displaystyle-\underbrace{\left\{\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+\frac{1}{t}\omega\bm{I}_{d}\right\}^{-1}\frac{\omega}{t}\bm{\beta}(a)}_{\eta_{4}}.$

Our goal is to find a lower bound of ${\mbox{Pr}}(\|\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)\|_{1}\leq h)$ for any $h>0$ . Notice that by the triangle inequality we have $\|\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)\|_{1}\leq\left\|\eta_{3}\right\|_{1}+\left\|\eta_{4}\right\|_{1}$ , thus we can find the lower bound using the inequality as

\displaystyle{\mbox{Pr}}\left(\left\|\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)\right\|_{1}\leq h\right)\geq{\mbox{Pr}}\left(\left\|\eta_{3}\right\|_{1}+\left\|\eta_{4}\right\|_{1}\leq h\right).

(B.3)

Step 2: We focus on bounding $\eta_{4}$ first. By the relationship between eigenvalues and the $L_{2}$ norm of symmetric matrix, we have $\|\bm{M}^{-1}\|_{2}=\lambda_{\max}(\bm{M}^{-1})=\{\lambda_{\min}(\bm{M})\}^{-1}$ for any invertible matrix $\bm{M}$ . Thus we can obtain that

	$\displaystyle\left\\|\left\{\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+\frac{1}{t}\omega\bm{I}_{d}\right\}^{-1}\right\\|_{2}$	$\displaystyle=\left\{\lambda_{\min}\left(\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+\frac{1}{t}\omega\bm{I}_{d}\right)\right\}^{-1}$
		$\displaystyle=\frac{1}{\lambda_{\min}\left(\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}\right)+\frac{1}{t}\omega}$
		$\displaystyle\leq\frac{1}{p_{t}\lambda_{\min}\left(\bm{\Sigma}\right)+\frac{1}{t}\omega}\leq\frac{1}{p_{t}\lambda+\frac{1}{t}\omega},$

which leads to

\|\eta_{4}\|_{2}\leq\left\|\left\{\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+\frac{1}{t}\omega\bm{I}_{d}\right\}^{-1}\right\|_{2}\left\|\frac{\omega}{t}\bm{\beta}(a)\right\|_{2}\leq\frac{\omega}{tp_{t}\lambda+\omega}\left\|\bm{\beta}(a)\right\|_{2}.

(B.4)

By Cauchy-Schwartz inequality, we further have bound of the $L_{1}$ norm of $\eta_{4}$ as

\|\eta_{4}\|_{1}\leq\sqrt{d}\|\eta_{4}\|_{2}\leq\frac{\omega\sqrt{d}}{tp_{t}\lambda+\omega}\left\|\bm{\beta}(a)\right\|_{2}\leq\frac{\sqrt{d}}{1+tp_{t}\lambda/\omega}\left\|\bm{\beta}(a)\right\|_{2}\leq\sqrt{d}\left\|\bm{\beta}(a)\right\|_{2}.

(B.5)

Step 3: Lastly, using the results in (B.5), we have

\displaystyle{\mbox{Pr}}\left(\left\|\eta_{3}\right\|_{1}+\left\|\eta_{4}\right\|_{1}\leq h\right)\geq{\mbox{Pr}}\left(\left\|\eta_{3}\right\|_{1}\leq h-\sqrt{d}\left\|\bm{\beta}(a)\right\|_{2}\right).

(B.6)

By the definition of $\eta_{3}$ and Lemma 2 in Chen et al. (2020), for any constant $c>0$ , we have

\displaystyle{\mbox{Pr}}\left(\left\|\eta_{3}\right\|_{1}>c\right)\leq 2d\exp\left\{-\frac{t\left(\frac{p_{t}\lambda}{2}\right)^{2}c^{2}}{2d^{2}\sigma^{2}L_{\bm{x}}^{2}}\right\}=2d\exp\left\{-\frac{tp_{t}^{2}\lambda^{2}c^{2}}{8d^{2}\sigma^{2}L_{\bm{x}}^{2}}\right\}.

Therefore, from (B.3) and (B.6), taking $c=h-\sqrt{d}\left\|\bm{\beta}(a)\right\|_{2}$ , we have that under event $E_{t}$ ,

\displaystyle\begin{aligned} {\mbox{Pr}}\left(\left\|\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)\right\|_{1}>h\right)&\leq 2d\exp\left\{-\frac{tp_{t}^{2}\lambda^{2}\left(h-\sqrt{d}\left\|\bm{\beta}(a)\right\|_{2}\right)^{2}}{8d^{2}\sigma^{2}L_{\bm{x}}^{2}}\right\}.\\ \end{aligned}

(B.7)

Based on the above results, it is immediate that the online ridge estimator $\widehat{\bm{\beta}}_{t}(a)$ is consistent to $\bm{\beta}(a)$ if $tp_{t}^{2}\rightarrow\infty$ as $t\rightarrow\infty$ . The proof is hence completed.

B.2 Proof of Corollary 1

Since $\widehat{\mu}_{t}(\bm{x}_{t},a)-\mu(\bm{x}_{t},a)=\bm{x}_{t}^{\top}\left(\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)\right)$ , by Holder’s inequality, we have

\left|\widehat{\mu}_{t}(\bm{x}_{t},a)-\mu(\bm{x}_{t},a)\right|\leq\left\|\bm{x}_{t}\right\|_{\infty}\left\|\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)\right\|_{1}\leq L_{\bm{x}}\left\|\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)\right\|_{1},

which follows

{\mbox{Pr}}\left\{\left|\widehat{\mu}_{t}(\bm{x}_{t},a)-\mu(\bm{x}_{t},a)\right|>\xi\right\}\leq{\mbox{Pr}}\left\{L_{\bm{x}}\left\|\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)\right\|_{1}>\xi\right\}={\mbox{Pr}}\left\{\left\|\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)\right\|_{1}>\xi/L_{\bm{x}}\right\}.

By Lemma 4.1, we further have

	$\displaystyle{\mbox{Pr}}\left\{\left\|\widehat{\mu}_{t}(\bm{x}_{t},a)-\mu(\bm{x}_{t},a)\right\|>\xi\right\}$	$\displaystyle\leq{\mbox{Pr}}\left\{\left\\|\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)\right\\|_{1}>\frac{\xi}{L_{\bm{x}}}\right\}$
		$\displaystyle\leq 2d\exp\left\{-\frac{tp_{t}^{2}\lambda^{2}\left(\frac{\xi}{L_{\bm{x}}}-\sqrt{d}\left\\|\bm{\beta}(a)\right\\|_{2}\right)^{2}}{8d^{2}\sigma^{2}L_{\bm{x}}^{2}}\right\}$
		$\displaystyle=2d\exp\left\{-\frac{tp_{t}^{2}\lambda^{2}\left(\xi-\sqrt{d}L_{\bm{x}}\left\\|\bm{\beta}(a)\right\\|_{2}\right)^{2}}{8d^{2}\sigma^{2}L_{\bm{x}}^{4}}\right\}.$

Note that by the Triangle Inequality,

	$\displaystyle\left\|\widehat{\mu}_{t}(\bm{x}_{t},1)-\widehat{\mu}_{t}(\bm{x}_{t},0)-\Delta_{\bm{x}_{t}}\right\|$	$\displaystyle=\left\|\left\{\widehat{\mu}_{t}(\bm{x}_{t},1)-\mu(\bm{x}_{t},1)\right\}-\left\{\widehat{\mu}_{t}(\bm{x}_{t},0)-\mu(\bm{x}_{t},0)\right\}\right\|$
		$\displaystyle\leq\left\|\widehat{\mu}_{t}(\bm{x}_{t},1)-\mu(\bm{x}_{t},1)\right\|+\left\|\widehat{\mu}_{t}(\bm{x}_{t},0)-\mu(\bm{x}_{t},0)\right\|,$

thus for $\left|\widehat{\mu}_{t}(\bm{x}_{t},1)-\widehat{\mu}_{t}(\bm{x}_{t},0)-\Delta_{\bm{x}_{t}}\right|$ , we have

		$\displaystyle{\mbox{Pr}}\left\{\left\|\widehat{\mu}_{t}(\bm{x}_{t},1)-\widehat{\mu}_{t}(\bm{x}_{t},0)-\Delta_{\bm{x}_{t}}\right\|>\xi\right\}$
		$\displaystyle\leq{\mbox{Pr}}\left\{\left\|\widehat{\mu}_{t}(\bm{x}_{t},1)-\mu(\bm{x}_{t},1)\right\|+\left\|\widehat{\mu}_{t}(\bm{x}_{t},0)-\mu(\bm{x}_{t},0)\right\|>\xi\right\}$
		$\displaystyle\leq{\mbox{Pr}}\left\{\left\|\widehat{\mu}_{t}(\bm{x}_{t},1)-\mu(\bm{x}_{t},1)\right\|>\xi/2\right\}+{\mbox{Pr}}\left\{\left\|\widehat{\mu}_{t}(\bm{x}_{t},0)-\mu(\bm{x}_{t},0)\right\|>\xi/2\right\}$
		$\displaystyle\leq 2d\exp\left\{-\frac{tp_{t}^{2}\lambda^{2}\left(\xi/2-\sqrt{d}L_{\bm{x}}\left\\|\bm{\beta}(1)\right\\|_{2}\right)^{2}}{8d^{2}\sigma^{2}L_{\bm{x}}^{4}}\right\}+2d\exp\left\{-\frac{tp_{t}^{2}\lambda^{2}\left(\xi/2-\sqrt{d}L_{\bm{x}}\left\\|\bm{\beta}(0)\right\\|_{2}\right)^{2}}{8d^{2}\sigma^{2}L_{\bm{x}}^{4}}\right\}$
		$\displaystyle\leq 4d\exp\left\{-(t-1)p_{t-1}^{2}c_{\xi}\right\},$

with

c_{\xi}=\frac{\lambda^{2}\left[\min\left\{\left(\xi/2-\sqrt{d}L_{\bm{x}}\left\|\bm{\beta}(1)\right\|_{2}\right)^{2},\left(\xi/2-\sqrt{d}L_{\bm{x}}\left\|\bm{\beta}(0)\right\|_{2}\right)^{2}\right\}\right]}{8d^{2}\sigma^{2}L_{\bm{x}}^{4}},

consistent with time $t$ .

B.3 Proof of Theorem 1

The proof of Theorem 1 consists of two main parts to show the probability of exploration under UCB and TS, respectively, by noting the probability of exploration under EG is given by its definition.

B.3.1 Proof for UCB

We first show the probability of exploration under UCB. This proof consists of three main steps stated as following:

We first rewrite the target probability by its definition and express it as

{\mbox{Pr}}\left\{\left|\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)\right|<c_{t}\left|\widehat{\sigma}_{t-1}(\bm{x}_{t},0)-\widehat{\sigma}_{t-1}(\bm{x}_{t},1)\right|\right\}.

Then, we establish the bound for the variance estimation such that

c_{t}\left|\widehat{\sigma}_{t-1}(\bm{x}_{t},0)-\widehat{\sigma}_{t-1}(\bm{x}_{t},1)\right|\leq\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}.

3.

Lastly, we bound ${\mbox{Pr}}\left\{\left|\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)\right|<\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}\right\}$ using the result in Corollary 1.

Step 1: We rewrite the target probability by definition and decompose it into two parts.

Let $\Delta_{\bm{x}_{t}}\equiv\mu(\bm{x}_{t},1)-\mu(\bm{x}_{t},0)$ . Based on the definition of the probability of exploration and the form of the estimated optimal policy $\widehat{\pi}_{t}(\bm{x}_{t})$ , we have

			$\displaystyle\kappa_{t}(\bm{x}_{t})={\mbox{Pr}}\{a_{t}\not=\widehat{\pi}_{t}(\bm{x}_{t})\}={\mathbb{E}}[\mathbb{I}\{a_{t}\not=\widehat{\pi}_{t}(\bm{x}_{t})\}]$		(B.8)
	$\displaystyle=$		$\displaystyle\underbrace{{\mathbb{E}}[\mathbb{I}(a_{t}=0)\|\widehat{\pi}_{t}(\bm{x}_{t})=1]{\mbox{Pr}}\{\widehat{\pi}_{t}(\bm{x}_{t})=1\}}_{\eta_{0}}+\underbrace{{\mathbb{E}}[\mathbb{I}(a_{t}=1)\|\widehat{\pi}_{t}(\bm{x}_{t})=0]{\mbox{Pr}}\{\widehat{\pi}_{t}(\bm{x}_{t})=0\}}_{\eta_{1}},$

where the expectation is taken with respect to the history $\mathcal{H}_{t-1}$ before time point $t$ .

Next, we rewrite $\eta_{0}$ and $\eta_{1}$ using the estimated mean and variance components $\widehat{\mu}_{t-1}(\bm{x}_{t},a)$ and $\widehat{\sigma}_{t-1}(\bm{x}_{t},a)$ , where $a=0,1$ . We focus on $\eta_{0}$ first.

Given $\widehat{\pi}_{t}(\bm{x}_{t})=1$ , i.e., $\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)>0$ , based on the definition of the taken action in Lin-UCB that

\displaystyle a_{t}=\mathbb{I}\left\{\widehat{\mu}_{t-1}(\bm{x}_{t},1)+c_{t}\widehat{\sigma}_{t-1}(\bm{x}_{t},1)>\widehat{\mu}_{t-1}(\bm{x}_{t},0)+c_{t}\widehat{\sigma}_{t-1}(\bm{x}_{t},0)\right\},

the probability of choosing action 0 rather than action 1 is

		$\displaystyle{\mathbb{E}}[\mathbb{I}(a_{t}=0)\|\widehat{\pi}_{t}(\bm{x}_{t})=1]$
	$\displaystyle=$	$\displaystyle{\mbox{Pr}}\left\{\widehat{\mu}_{t-1}(\bm{x}_{t},1)+c_{t}\widehat{\sigma}_{t-1}(\bm{x}_{t},1)<\widehat{\mu}_{t-1}(\bm{x}_{t},0)+c_{t}\widehat{\sigma}_{t-1}(\bm{x}_{t},0)\|\widehat{\pi}_{t}(\bm{x}_{t})=1\right\}$
	$\displaystyle=$	$\displaystyle{\mbox{Pr}}\left\{\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)<c_{t}\widehat{\sigma}_{t-1}(\bm{x}_{t},0)-c_{t}\widehat{\sigma}_{t-1}(\bm{x}_{t},1)\|\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)>0\right\}$
	$\displaystyle=$	$\displaystyle{\mbox{Pr}}\left[0<\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)<c_{t}\left\{\widehat{\sigma}_{t-1}(\bm{x}_{t},0)-\widehat{\sigma}_{t-1}(\bm{x}_{t},1)\right\}\right]/{\mbox{Pr}}\{\widehat{\pi}_{t}(\bm{x}_{t})=1\},$

where the second equality is to rearrange the estimated mean and variance components, and the last equality comes from the definition of the conditional probability. Combining this with (B.8), we have

\displaystyle\eta_{0}={\mbox{Pr}}\left[0<\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)<c_{t}\left\{\widehat{\sigma}_{t-1}(\bm{x}_{t},0)-\widehat{\sigma}_{t-1}(\bm{x}_{t},1)\right\}\right].

Similarly, we have

\displaystyle\eta_{1}={\mbox{Pr}}\left[c_{t}\left\{\widehat{\sigma}_{t-1}(\bm{x}_{t},0)-\widehat{\sigma}_{t-1}(\bm{x}_{t},1)\right\}<\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)<0\right].

Thus combined with Equation (B.8), we have

$\displaystyle\kappa_{t}(\bm{x}_{t})=\eta_{0}+\eta_{1}=$	$\displaystyle{\mbox{Pr}}\left[0<\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)<c_{t}\left\{\widehat{\sigma}_{t-1}(\bm{x}_{t},0)-\widehat{\sigma}_{t-1}(\bm{x}_{t},1)\right\}\right]$	(B.9)
	$\displaystyle+{\mbox{Pr}}\left[c_{t}\left\{\widehat{\sigma}_{t-1}(\bm{x}_{t},0)-\widehat{\sigma}_{t-1}(\bm{x}_{t},1)\right\}<\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)<0\right]$
	$\displaystyle={\mbox{Pr}}\left\{\left\|\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)\right\|<c_{t}\left\|\widehat{\sigma}_{t-1}(\bm{x}_{t},0)-\widehat{\sigma}_{t-1}(\bm{x}_{t},1)\right\|\right\}.$

The rest of the proof is aims to bound the probability

{\mbox{Pr}}\left\{\left|\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)\right|<c_{t}\left|\widehat{\sigma}_{t-1}(\bm{x}_{t},0)-\widehat{\sigma}_{t-1}(\bm{x}_{t},1)\right|\right\}.

Step 2: Secondly, we bound the variance $\widehat{\sigma}_{t-1}(\bm{x}_{t},0)$ and $\widehat{\sigma}_{t-1}(\bm{x}_{t},1)$ .

We consider the quantity $\widehat{\sigma}_{t-1}(\bm{x}_{t},0)=\sqrt{\{\bm{x}_{t}^{\top}\{\bm{D}_{t-1}(0)^{\top}\bm{D}_{t-1}(0)+\omega\bm{I}_{d}\}^{-1}\bm{x}_{t}\}}$ first. Let $\mathbf{v}$ be any $d\times 1$ vector, then the sample variance under action 0 is given by

$\displaystyle\bm{x}_{t}^{\top}\{\bm{D}_{t-1}(0)^{\top}\bm{D}_{t-1}(0)+\omega\bm{I}_{d}\}^{-1}\bm{x}_{t}$	$\displaystyle=\\|\bm{x}_{t}\\|_{2}^{2}(\frac{\bm{x}_{t}}{\\|\bm{x}_{t}\\|_{2}})^{\top}\{\bm{D}_{t-1}(0)^{\top}\bm{D}_{t-1}(0)+\omega\bm{I}_{d}\}^{-1}(\frac{\bm{x}_{t}}{\\|\bm{x}_{t}\\|_{2}})$	(B.10)
	$\displaystyle\leq\\|\bm{x}_{t}\\|_{2}^{2}\max\limits_{\\|\mathbf{v}\\|_{2}=1}\mathbf{v}^{\top}\{\bm{D}_{t-1}(0)^{\top}\bm{D}_{t-1}(0)+\omega\bm{I}_{d}\}^{-1}\mathbf{v}$
	$\displaystyle\leq\\|\bm{x}_{t}\\|_{2}^{2}\lambda_{\max}\{(\bm{D}_{t-1}(0)^{\top}\bm{D}_{t-1}(0)+\omega\bm{I}_{d})^{-1}\},$

where the first inequality is to replace $({\bm{x}_{t}}/{\|\bm{x}_{t}\|_{2}})^{\top}$ with any normalized vector, and the second inequality is due to the definition. According to (B.10), combined with Assumption 4.1, we can further bound $\widehat{\sigma}_{t-1}(\bm{x}_{t},0)$ by

\|\bm{x}_{t}\|_{2}\sqrt{\lambda_{\max}\{(\bm{D}_{t-1}(0)^{\top}\bm{D}_{t-1}(0)+\omega\bm{I}_{d})^{-1}\}}\leq\frac{L_{\bm{x}}}{\sqrt{\lambda_{\min}\{\bm{D}_{t-1}(0)^{\top}\bm{D}_{t-1}(0)+\omega\bm{I}_{d}\}}}.

(B.11)

It is immediate from (B.10) and (B.11) that

\displaystyle 0<\widehat{\sigma}_{t-1}(\bm{x}_{t},0)\leq\frac{L_{\bm{x}}}{\sqrt{\lambda_{\min}\{\bm{D}_{t-1}(0)^{\top}\bm{D}_{t-1}(0)+\omega\bm{I}_{d}\}}}.

(B.12)

Note that

\lambda_{\min}\{\bm{D}_{t-1}(0)^{\top}\bm{D}_{t-1}(0)+\omega\bm{I}_{d}\}=\lambda_{\min}\{\bm{D}_{t-1}(0)^{\top}\bm{D}_{t-1}(0)\}+\omega,

combined with the fact that $\bm{D}_{t-1}(0)^{\top}\bm{D}_{t-1}(0)=\sum_{i=1}^{t-1}(1-a_{i})\bm{x}_{i}\bm{x}_{i}^{\top}$ , then $\lambda_{\min}\{\bm{D}_{t-1}(0)^{\top}\bm{D}_{t-1}(0)+\omega\bm{I}_{d}\}$ can be further expressed as

	$\displaystyle\lambda_{\min}\{\bm{D}_{t-1}(0)^{\top}\bm{D}_{t-1}(0)+\omega\bm{I}_{d}\}$	$\displaystyle=(t-1)\lambda_{\min}\left\{\frac{1}{t-1}\sum_{i=1}^{t-1}(1-a_{i})\bm{x}_{i}\bm{x}_{i}^{\top}\right\}+\omega$
		$\displaystyle>(t-1)p_{t-1}\lambda_{\min}\left(\bm{\Sigma}\right)+\omega>(t-1)p_{t-1}\lambda+\omega,$

where the first inequality is owing to Assumption 4.2 , and the second inequality is owing to Assumption 4.1. This together with (B.12) gives the lower and upper bounds of $\widehat{\sigma}_{t-1}(\bm{x}_{t},0)$ as

0<\widehat{\sigma}_{t-1}(\bm{x}_{t},0)\leq\frac{L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda+\omega}}<\frac{L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}.

(B.13)

Similarly we have

0<\widehat{\sigma}_{t-1}(\bm{x}_{t},1)\leq\frac{L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}},

(B.14)

which follows that

c_{t}\left|\widehat{\sigma}_{t-1}(\bm{x}_{t},0)-\widehat{\sigma}_{t-1}(\bm{x}_{t},1)\right|\leq c_{t}\left(\left|\widehat{\sigma}_{t-1}(\bm{x}_{t},0)\right|+\left|\widehat{\sigma}_{t-1}(\bm{x}_{t},1)\right|\right)\leq\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}.

Combining (B.9) and the above equation, we get the conclusion that

	$\displaystyle\kappa_{t}(\bm{x}_{t})$	$\displaystyle\leq{\mbox{Pr}}\left\{\left\|\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)\right\|<c_{t}\left\|\widehat{\sigma}_{t-1}(\bm{x}_{t},0)-\widehat{\sigma}_{t-1}(\bm{x}_{t},1)\right\|\right\}$		(B.15)
		$\displaystyle\leq{\mbox{Pr}}\left\{\left\|\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)\right\|<\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}\right\}.$		(B.15)

Step 3: Lastly, we aim to bound ${\mbox{Pr}}\left\{\left|\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)\right|<\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}\right\}$ using the result in Corollary 1.

For any $\xi>0$ , define $E:=\{\left|\widehat{\mu}_{t}(\bm{x}_{t},1)-\widehat{\mu}_{t}(\bm{x}_{t},0)-\Delta_{\bm{x}_{t}}\right|\leq\xi\}$ , which satisfies ${\mbox{Pr}}\left\{E\right\}\geq 1-4d\exp\left\{-tp_{t}^{2}c_{\xi}\right\}$ by Corollary 1. Then on the Event $E$ , we have

	$\displaystyle\left\|\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)\right\|$	$\displaystyle=\left\|\Delta_{\bm{x}_{t}}+\left\{\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)-\Delta_{\bm{x}_{t}}\right\}\right\|$
		$\displaystyle\geq\left\|\Delta_{\bm{x}_{t}}\right\|-\left\|\widehat{\mu}_{t}(\bm{x}_{t},1)-\widehat{\mu}_{t}(\bm{x}_{t},0)-\Delta_{\bm{x}_{t}}\right\|\geq\left\|\Delta_{\bm{x}_{t}}\right\|-\xi.$

Thus for the probability ${\mbox{Pr}}\left\{\left|\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)\right|<\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}\right\}$ , we have

$\displaystyle\kappa_{t}(\bm{x}_{t})$	$\displaystyle\leq{\mbox{Pr}}\left\{\left\|\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)\right\|<\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}\right\}$	(B.16)
	$\displaystyle\leq{\mbox{Pr}}\left\{\left\|\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)\right\|<\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}\bigm{\|}E\right\}+{\mbox{Pr}}\left\{E^{c}\right\}$
	$\displaystyle\leq{\mbox{Pr}}\left\{\left\|\Delta_{\bm{x}_{t}}\right\|-\xi<\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}\right\}+4d\exp\left\{-(t-1)p_{t-1}^{2}c_{\xi}\right\}$
	$\displaystyle={\mbox{Pr}}\left\{\left\|\Delta_{\bm{x}_{t}}\right\|<\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}+\xi\right\}+4d\exp\left\{-(t-1)p_{t-1}^{2}c_{\xi}\right\}.$

Sine $tp_{t}\rightarrow\infty$ as $t\rightarrow\infty$ , for any constant $\delta>\xi$ , there exist large enough $t$ satisfying $\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}\leq\delta-\xi$ . Then by Assumption 4.3, there exists some constant $\gamma$ such that

{\mbox{Pr}}\left\{\left|\Delta_{\bm{x}_{t}}\right|<\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}+\xi\right\}=\mathcal{O}\left\{\left(\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}+\xi\right)^{\gamma}\right\},

i.e., there exists some constant $C$ such that

{\mbox{Pr}}\left\{\left|\Delta_{\bm{x}_{t}}\right|<\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}+\xi\right\}=C\left(\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}+\xi\right)^{\gamma}.

Therefore, combined with the Equation (B.16), we have

\displaystyle\kappa_{t}(\bm{x}_{t})

\displaystyle\leq C\left(\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}+\xi\right)^{\gamma}+4d\exp\left\{-(t-1)p_{t-1}^{2}c_{\xi}\right\}.

The proof is hence completed.

B.3.2 Proof for TS

We next show the probability of exploration under TS consisting of three main steps:

1.

We firstly define an event $E:=\{\left|\widehat{\mu}_{t}(\bm{x}_{t},1)-\widehat{\mu}_{t}(\bm{x}_{t},0)-\Delta_{\bm{x}_{t}}\right|\leq\xi\}$ for any $0<\xi<\left|\Delta_{\bm{x}_{t}}\right|/2$ , where the estimated difference between mean functions is close to the true difference. And we have ${\mbox{Pr}}\left\{E\right\}\geq 1-4d\exp\left\{-tp_{t}^{2}c_{\xi}\right\}$ by Corollary 1.
2.

Next, we bound the probability of exploration on the event $E$ .
3.

Lastly, we combine the results in the previous two steps to get the unconditioned probability of exploration .

Step 1: For any $0<\xi<\left|\Delta_{\bm{x}_{t}}\right|/2$ , define $E:=\{\left|\widehat{\mu}_{t}(\bm{x}_{t},1)-\widehat{\mu}_{t}(\bm{x}_{t},0)-\Delta_{\bm{x}_{t}}\right|\leq\xi\}$ , which satisfies ${\mbox{Pr}}\left\{E\right\}\geq 1-4d\exp\left\{-(t-1)p_{t-1}^{2}c_{\xi}\right\}$ by Corollary 1. Then for the probability ${\mbox{Pr}}\{a_{t}\not=\widehat{\pi}_{t}(\bm{x}_{t})\}$ , we have

{\mbox{Pr}}\{a_{t}\not=\widehat{\pi}_{t}(\bm{x}_{t})\}\leq{\mbox{Pr}}\{a_{t}\not=\widehat{\pi}_{t}(\bm{x}_{t})|E\}+{\mbox{Pr}}\{E^{c}\}\leq{\mbox{Pr}}\{a_{t}\not=\widehat{\pi}_{t}(\bm{x}_{t})|E\}+4d\exp\left\{-(t-1)p_{t-1}^{2}c_{\xi}\right\}.

(B.17)

Without loss of generality, we assume $\Delta_{\bm{x}_{t}}>0$ , then $E:=\{0<\Delta_{\bm{x}_{t}}-\xi\leq\widehat{\mu}_{t}(\bm{x}_{t},1)-\widehat{\mu}_{t}(\bm{x}_{t},0)\leq\Delta_{\bm{x}_{t}}+\xi\}$ , which implies $\widehat{\pi}_{t}(\bm{x}_{t})=1$ .

Using the law of iterated expectations, based on the definition of the probability of exploration and the form of the estimated optimal policy $\widehat{\pi}_{t}(\bm{x}_{t})$ , on the event $E$ , we have

\displaystyle{\mbox{Pr}}\{a_{t}\not=1|E\}={\mathbb{E}}[\mathbb{I}\{a_{t}\not=1\}|E]={\mathbb{E}}\left({\mathbb{E}}[\mathbb{I}\{a_{t}=0\}|\widehat{\mu}_{t-1}(\bm{x}_{t},1),\widehat{\mu}_{t-1}(\bm{x}_{t},0)]|E\right).

(B.18)

Step 2: Next, we focus on deriving the bound of ${\mathbb{E}}[\mathbb{I}\{a_{t}=0\}|\widehat{\mu}_{t-1}(\bm{x}_{t},1),\widehat{\mu}_{t-1}(\bm{x}_{t},0)]$ on the Event $E$ .

Recalling the bandit mechanism of TS, we have $a_{t}=\mathbb{I}\left\{\bm{x}_{t}^{\top}\bm{\beta}_{t}(1)>\bm{x}_{t}^{\top}\bm{\beta}_{t}(0)\right\}$ , where $\bm{\beta}_{t}(a)$ is drawn from the posterior distribution of $\bm{\beta}(a)$ given by

\mathcal{N}_{d}[\widehat{\bm{\beta}}_{t-1}(a),\rho^{2}\{\bm{D}_{t-1}(a)^{\top}\bm{D}_{t-1}(a)+\omega\bm{I}_{d}\}^{-1}].

From the posterior distributions and the definitions of $\widehat{\mu}_{t-1}(\bm{x}_{t},a)$ and $\widehat{\sigma}_{t-1}(\bm{x}_{t},a)$ , we have

\bm{x}_{t}^{\top}\bm{\beta}_{t}(a)\sim\mathcal{N}[\bm{x}_{t}^{\top}\widehat{\bm{\beta}}_{t-1}(a),\rho^{2}\bm{x}_{t}^{\top}\{\bm{D}_{t-1}(a)^{\top}\bm{D}_{t-1}(a)+\omega\bm{I}_{d}\}^{-1}\bm{x}_{t}],

that is,

\bm{x}_{t}^{\top}\bm{\beta}_{t}(a)\sim\mathcal{N}[\widehat{\mu}_{t-1}(\bm{x}_{t},a),\rho^{2}\widehat{\sigma}_{t-1}(\bm{x}_{t},a)^{2}].

Notice that $\bm{x}_{t}^{\top}\bm{\beta}_{t}(1)$ and $\bm{x}_{t}^{\top}\bm{\beta}_{t}(0)$ are drawn independently, thus,

\bm{x}_{t}^{\top}\bm{\beta}_{t}(1)-\bm{x}_{t}^{\top}\bm{\beta}_{t}(0)\sim\mathcal{N}[\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0),\rho^{2}\{\widehat{\sigma}_{t-1}(\bm{x}_{t},1)^{2}+\widehat{\sigma}_{t-1}(\bm{x}_{t},0)^{2}\}].

(B.19)

Recall $a_{t}=\mathbb{I}\left\{\bm{x}_{t}^{\top}\bm{\beta}_{t}(1)>\bm{x}_{t}^{\top}\bm{\beta}_{t}(0)\right\}$ in TS, based on the posterior distribution in (B.19). Therefore, on the Event $E$ we have

	$\displaystyle{\mathbb{E}}[\mathbb{I}\{a_{t}=0\}\|\widehat{\mu}_{t-1}(\bm{x}_{t},1),\widehat{\mu}_{t-1}(\bm{x}_{t},0)]$	(B.20)
	$\displaystyle={\mbox{Pr}}\left\{\bm{x}_{t}^{\top}\bm{\beta}_{t}(1)-\bm{x}_{t}^{\top}\bm{\beta}_{t}(0)<0\|\widehat{\mu}_{t-1}(\bm{x}_{t},1),\widehat{\mu}_{t-1}(\bm{x}_{t},0)\right\}$
$\displaystyle=$	$\displaystyle\Phi[-\{\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)\}/\sqrt{\rho^{2}\{\widehat{\sigma}_{t-1}(\bm{x}_{t},1)^{2}+\widehat{\sigma}_{t-1}(\bm{x}_{t},0)^{2}\}}]$
$\displaystyle=$	$\displaystyle 1-\Phi[\{\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)\}/\sqrt{\rho^{2}\{\widehat{\sigma}_{t-1}(\bm{x}_{t},1)^{2}+\widehat{\sigma}_{t-1}(\bm{x}_{t},0)^{2}\}}],$

where $\Phi(\cdot)$ is the cumulative distribution function of the standard normal distribution. Denote $\widehat{z}_{t}\equiv\{\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)\}/\sqrt{\rho^{2}\{\widehat{\sigma}_{t-1}(\bm{x}_{t},1)^{2}+\widehat{\sigma}_{t-1}(\bm{x}_{t},0)^{2}\}}>0$ , since $\widehat{\pi}_{t}(\bm{x}_{t})=\mathbb{I}\left\{\widehat{\mu}_{t-1}(\bm{x},1)>\widehat{\mu}_{t-1}(\bm{x},0)\right\}=1$ , i.e., $\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)>0$ . By applying the tail bound established for the normal distribution in Section 7.1 of Feller (2008), we have (B.20) can be bounded as

\displaystyle{\mathbb{E}}[\mathbb{I}\{a_{t}=0\}|\widehat{\mu}_{t-1}(\bm{x}_{t},1),\widehat{\mu}_{t-1}(\bm{x}_{t},0)]\leq\exp(-\widehat{z}_{t}^{2}/2).

This yields that on the Event $E$ ,

		$\displaystyle{\mathbb{E}}[\mathbb{I}\{a_{t}=0\}\|\widehat{\mu}_{t-1}(\bm{x}_{t},1),\widehat{\mu}_{t-1}(\bm{x}_{t},0)]$
		$\displaystyle\leq{\mathbb{E}}\left\{\exp(-\widehat{z}_{t}^{2}/2)\right\}={\mathbb{E}}\left\{\exp\left(-\frac{\{\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)\}^{2}}{2\rho^{2}\{\widehat{\sigma}_{t-1}(\bm{x}_{t},1)^{2}+\widehat{\sigma}_{t-1}(\bm{x}_{t},0)^{2}\}}\right)\right\}.$

Using similar arguments in proving (B.13)that $\widehat{\sigma}_{t-1}(\bm{x}_{t},0)\leq\frac{L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}$ , we have

\widehat{\sigma}_{t-1}(\bm{x}_{t},1)^{2}+\widehat{\sigma}_{t-1}(\bm{x}_{t},0)^{2}\leq\frac{2L_{\bm{x}}^{2}}{{{(t-1)p_{t-1}\lambda}}}.

Therefore, combining the above two equations leads to

{\mathbb{E}}[\mathbb{I}\{a_{t}=0\}|\widehat{\mu}_{t-1}(\bm{x}_{t},1),\widehat{\mu}_{t-1}(\bm{x}_{t},0)]\leq{\mathbb{E}}\left\{\exp\left(-\frac{\{\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)\}^{2}{{(t-1)p_{t-1}\lambda}}}{4\rho^{2}L_{\bm{x}}^{2}}\right)\right\},

where the expectation is taken with respect to history $\mathcal{H}_{t-1}$ .

Note that on the Event $E$ , we have

\{\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)\}^{2}\geq\left(\left|\Delta_{\bm{x}_{t}}\right|-\xi\right)^{2},

which follows that on the Event $E$ ,

	$\displaystyle{\mathbb{E}}[\mathbb{I}\{a_{t}=0\}\|\widehat{\mu}_{t-1}(\bm{x}_{t},1),\widehat{\mu}_{t-1}(\bm{x}_{t},0)]$	$\displaystyle\leq{\mathbb{E}}\left\{\exp\left(-\frac{\left(\left\|\Delta_{\bm{x}_{t}}\right\|-\xi\right)^{2}{{(t-1)p_{t-1}\lambda}}}{4\rho^{2}L_{\bm{x}}^{2}}\right)\right\}$		(B.21)
		$\displaystyle\leq\exp\left(-\frac{\left(\left\|\Delta_{\bm{x}_{t}}\right\|-\xi\right)^{2}{{(t-1)p_{t-1}\lambda}}}{4\rho^{2}L_{\bm{x}}^{2}}\right).$		(B.21)

Step 3: Combined with Equation (B.17), Equation (B.18) and Equation (B.21), we have

	$\displaystyle{\mbox{Pr}}\{a_{t}\not=\widehat{\pi}_{t}(\bm{x}_{t})\}$	$\displaystyle\overset{\eqref{thm1_pf_ts2}}{\leq}{\mbox{Pr}}\{a_{t}\not=\widehat{\pi}_{t}(\bm{x}_{t})\|E\}+4d\exp\left\{-(t-1)p_{t-1}^{2}c_{\xi}\right\}$
		$\displaystyle\overset{\eqref{eq:proofPETS}}{\leq}{\mathbb{E}}\left({\mathbb{E}}[\mathbb{I}\{a_{t}=0\}\|\widehat{\mu}_{t-1}(\bm{x}_{t},1),\widehat{\mu}_{t-1}(\bm{x}_{t},0)]\|E\right)+4d\exp\left\{-(t-1)p_{t-1}^{2}c_{\xi}\right\}$
		$\displaystyle\overset{\eqref{thm1_pf_ts4}}{\leq}\exp\left(-\frac{\left(\left\|\Delta_{\bm{x}_{t}}\right\|-\xi\right)^{2}{{(t-1)p_{t-1}\lambda}}}{4\rho^{2}L_{\bm{x}}^{2}}\right)+4d\exp\left\{-(t-1)p_{t-1}^{2}c_{\xi}\right\}.$

The proof is hence completed.

B.4 Proof of Theorem 2

We detail the proof of Theorem 2 in this section. Using the similar arguments in (B.1) in the proof of Lemma 4.1, we can rewrite $\sqrt{t}\{\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)\}$ as

	$\displaystyle\sqrt{t}\{\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)\}=$		$\displaystyle\underbrace{\left\{\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+\frac{1}{t}\omega\bm{I}_{d}\right\}^{-1}}_{\bm{\xi}}\underbrace{\left\{\frac{1}{\sqrt{t}}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}e_{i}\right\}}_{\bm{\eta}_{1}}$
			$\displaystyle-\underbrace{\left\{\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+\frac{1}{t}\omega\bm{I}_{d}\right\}^{-1}\frac{\omega}{\sqrt{t}}\bm{\beta}(a)}_{\bm{\eta}_{2}}.$

Our goal is to prove that $\sqrt{t}\{\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)\}$ is asymptotically normal. The proof is to generalize Theorem 3.1 in Chen et al. (2020) by considering commonly used bandit algorithms, including UCB, TS, and EG here. We complete the proof in the following four steps:

•

Step 1: Prove that $\bm{\eta}_{1}=({1}/{\sqrt{t}})\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}e_{i}\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathcal{N}_{d}\left(\bm{0}_{d},G_{a}\right)$ , where $G_{a}$ is the variance matrix to be spesified shortly.
•

Step 2: Prove that $\bm{\xi}=\left\{(1/t)\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+(\omega/t)\bm{I}_{d}\right\}^{-1}\stackrel{{\scriptstyle p}}{{\longrightarrow}}\sigma_{a}^{2}G_{a}^{-1}$ , where $\sigma^{2}_{a}={\mathbb{E}}(e_{t}^{2}|a_{t}=a)$ for $a=0,1$ .
•

Step 3: Prove that $\bm{\eta}_{2}=\left\{(1/t)\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+(\omega/t)\bm{I}_{d}\right\}^{-1}({\omega}/{\sqrt{t}})\bm{\beta}(a)\overset{p}{\longrightarrow}\bm{0}_{d}$ .
•

Step 4: Combine above results in steps 1-3 using Slutsky’s theorem.

Step 1: We first focus on proving that $\bm{\eta}_{1}=({1}/{\sqrt{t}})\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}e_{i}\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathcal{N}_{d}\left(\bm{0}_{d},G_{a}\right)$ . Using Cramer-Wold device, it suffices to show that for any $\bm{v}\in\mathbb{R}^{d}$ ,

\displaystyle\bm{\eta}_{1}(\bm{v})\equiv\frac{1}{\sqrt{t}}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{v}^{\top}\bm{x}_{i}e_{i}\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathcal{N}_{d}\left(0,\bm{v}^{\top}G_{a}\bm{v}\right).

Note that ${\mathbb{E}}\left(\mathbb{I}(a_{i}=a)\bm{v}^{\top}\bm{x}_{i}e_{i}\mid\mathcal{H}_{i-1}\right)={\mathbb{E}}\left(\mathbb{I}(a_{i}=a)\bm{v}^{\top}\bm{x}_{i}\mid\mathcal{H}_{i-1}\right){\mathbb{E}}\left(e_{i}\mid\mathcal{H}_{i-1},a_{i}=a\right)=0$ , we have that $\mathbb{I}(a_{i}=a)\bm{v}^{\top}\bm{x}_{i}e_{i}$ is a Martingale difference sequence. We next show the asymptotic normality of $\bm{\eta}_{1}(\bm{v})$ using Martingale central limit theorem, by the following two parts: i) check the conditional Lindeberg condition; ii) derivative the limit of the conditional variance.

Firstly, we check the conditional Lindeberg condition. For any $\delta>0$ , denote

\displaystyle\psi=\sum_{i=1}^{t}\mathbb{E}\left[\frac{1}{t}\mathbb{I}(a_{i}=a)\left(\bm{v}^{\top}\bm{x}_{i}\right)^{2}e_{i}^{2}\mathbb{I}\left\{\left|\frac{1}{\sqrt{t}}\mathbb{I}(a_{i}=a)\bm{v}^{\top}\bm{x}_{i}e_{i}\right|>\delta\right\}\mid\mathcal{H}_{i-1}\right].

(B.22)

Notice that $\left(\bm{v}^{\top}\bm{x}_{i}\right)^{2}\leq\|\bm{v}\|_{2}^{2}L_{\bm{x}}^{2}d$ , we have

\mathbb{I}\left\{\left|\frac{1}{\sqrt{t}}\mathbb{I}(a_{i}=a)\bm{v}^{\top}\bm{x}_{i}e_{i}\right|>\delta\right\}\leq\mathbb{I}\left\{\mathbb{I}(a_{i}=a)e_{i}^{2}\|\bm{v}\|_{2}^{2}L_{\bm{x}}^{2}d>t\delta^{2}\right\}=\mathbb{I}\left\{\mathbb{I}(a_{i}=a)e_{i}^{2}>\frac{t\delta^{2}}{\|\bm{v}\|_{2}^{2}L_{\bm{x}}^{2}d}\right\}.

Combining this with (B.22), we obtain that

\displaystyle\psi\leq\frac{\|\bm{v}\|_{2}^{2}L_{\bm{x}}^{2}d}{t}\sum_{i=1}^{t}\mathbb{E}\left(\mathbb{I}(a_{i}=a)e_{i}^{2}\mathbb{I}\left\{\mathbb{I}(a_{i}=a)e_{i}^{2}>\frac{t\delta^{2}}{\|\bm{v}\|_{2}^{2}L_{\bm{x}}^{2}d}\right\}\mid\mathcal{H}_{i-1}\right),

(B.23)

where the right hand side equals

\displaystyle\frac{\|\bm{v}\|_{2}^{2}L_{\bm{x}}^{2}d}{t}\sum_{i=1}^{t}\mathbb{E}\left(\mathbb{I}(a_{i}=a)\mid\mathcal{H}_{i-1}\right)\mathbb{E}\left(e_{i}^{2}\mathbb{I}\left\{\mathbb{I}(a_{i}=a)e_{i}^{2}>\frac{\delta^{2}t}{\|\bm{v}\|_{2}^{2}L_{\bm{x}}^{2}d}\right\}\mid\mathcal{H}_{i-1}\right).

Then, we can further write (B.23) as

\displaystyle\psi\leq\frac{\|\bm{v}\|_{2}^{2}L_{\bm{x}}^{2}d}{t}\sum_{i=1}^{t}\mathbb{E}\left(e_{i(a)}^{2}\mathbb{I}\left\{e_{i(a)}^{2}>\frac{\delta^{2}t}{\|\bm{v}\|_{2}^{2}L_{\bm{x}}^{2}d}\right\}\mid\mathcal{H}_{i-1}\right).

(B.24)

where $e_{i(a)}=e_{i}$ when $a_{i}=a$ and 0 otherwise. Since $e_{i}$ conditioned on $a_{i}$ are $i.i.d.$ , $\forall i$ , we have the right hand side of the above inequality equals

\displaystyle\|\bm{v}\|_{2}^{2}L_{\bm{x}}^{2}d\mathbb{E}\left(e^{2}\mathbb{I}\left\{e^{2}>\frac{t\delta^{2}}{\|\bm{v}\|_{2}^{2}L_{\bm{x}}^{2}d}\right\}\right),

where $e$ is the random variable given by $e_{i}|\mathcal{H}_{i-1}$ . Note that $e^{2}\mathbb{I}\left\{e^{2}>{t\delta^{2}}/({\|\bm{v}\|_{2}^{2}L_{\bm{x}}^{2}d})\right\}$ is dominated by $e^{2}$ with ${\mathbb{E}}e^{2}<\infty$ and converges to 0, as $t\rightarrow\infty$ . Then, by Dominated Convergence Theorem, the results in (B.24) can be further bounded by

\psi\leq\frac{\|\bm{v}\|_{2}^{2}L_{\bm{x}}^{2}d}{t}\sum_{i=1}^{t}\mathbb{E}\left(e^{2}\mathbb{I}\left\{e^{2}>\frac{t\delta^{2}}{\|\bm{v}\|_{2}^{2}L_{\bm{x}}^{2}d}\right\}\right)\rightarrow 0,\text{ as }t\rightarrow\infty.

Therefore, conditional Lindeberg condition holds.

Secondly, we derive the limit of the conditional variance. Notice that

	$\displaystyle\frac{1}{t}\sum_{i=1}^{t}\mathbb{E}\left[\mathbb{I}(a_{i}=a)\left(\bm{v}^{\top}\bm{x}_{i}\right)^{2}e_{i}^{2}\mid\mathcal{H}_{i-1}\right]$		$\displaystyle=\frac{1}{t}\sum_{i=1}^{t}\mathbb{E}\left\{\mathbb{E}\left[\mathbb{I}(a_{i}=a)\left(\bm{v}^{\top}\bm{x}_{i}\right)^{2}e_{i}^{2}\mid a_{i},\bm{x}_{i}\right]\mid\mathcal{H}_{i-1}\right\}$
			$\displaystyle=\frac{1}{t}\sum_{i=1}^{t}\mathbb{E}\left\{\mathbb{I}(a_{i}=a)\left(\bm{v}^{\top}\bm{x}_{i}\right)^{2}\mathbb{E}\left[e_{i}^{2}\mid a_{i}=a,\bm{x}_{i}\right]\mid\mathcal{H}_{i-1}\right\}.$

Since $e_{t}$ is independent of $\mathcal{H}_{i-1}$ and $\bm{x}_{i}$ given $a_{t}$ , and $\mathbb{E}\left[e_{i}^{2}\mid a_{i}=a\right]=\sigma_{a}^{2}$ , we have

	$\displaystyle\frac{1}{t}\sum_{i=1}^{t}\mathbb{E}\left[\mathbb{I}(a_{i}=a)\left(\bm{v}^{\top}\bm{x}_{i}\right)^{2}e_{i}^{2}\mid\mathcal{H}_{i-1}\right]$		$\displaystyle=\frac{1}{t}\sum_{i=1}^{t}\mathbb{E}\left[\mathbb{I}(a_{i}=a)\left(\bm{v}^{\top}\bm{x}_{i}\right)^{2}\mathbb{E}\left[e_{i}^{2}\mid a_{i}=a\right]\mid\mathcal{H}_{i-1}\right]$
			$\displaystyle=\frac{1}{t}\sum_{i=1}^{t}\mathbb{E}\left[\mathbb{I}(a_{i}=a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v}\sigma_{a}^{2}\mid\mathcal{H}_{i-1}\right]$
			$\displaystyle=\frac{1}{t}\sum_{i=1}^{t}\sigma_{a}^{2}\mathbb{E}\left[\mathbb{I}(a_{i}=a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v}\mid\mathcal{H}_{i-1}\right],$

where

	$\displaystyle\mathbb{E}\left[\mathbb{I}(a_{i}=a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v}\mid\mathcal{H}_{i-1}\right]=$		$\displaystyle\mathbb{E}\{\mathbb{E}\left[\mathbb{I}(a_{i}\neq\pi^{}(\bm{x}_{i}))\mathbb{I}(\pi^{}(\bm{x}_{i})\neq a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v}\mid\bm{x}_{i}\right]\mid\mathcal{H}_{i-1}\}$
	$\displaystyle+$		$\displaystyle\mathbb{E}\{\mathbb{E}\left[\mathbb{I}(a_{i}=\pi^{}(\bm{x}_{i}))\mathbb{I}(\pi^{}(\bm{x}_{i})=a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v}\mid\bm{x}_{i}\right]\mid\mathcal{H}_{i-1}\}$
	$\displaystyle=$		$\displaystyle\mathbb{E}\{\mathbb{E}\left[\mathbb{I}(a_{i}\neq\pi^{}(\bm{x}_{i}))\mid\bm{x}_{i},\mathcal{H}_{i-1}\right]\mathbb{I}(\pi^{}(\bm{x}_{i})\neq a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v}\}$
	$\displaystyle+$		$\displaystyle\mathbb{E}\{\mathbb{E}\left[\mathbb{I}(a_{i}=\pi^{}(\bm{x}_{i}))\mid\bm{x}_{i},\mathcal{H}_{i-1}\right]\mathbb{I}(\pi^{}(\bm{x}_{i})=a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v}\}.$

Here, the first equation comes from iteration expectation over $\bm{x}_{i}$ and the fact that $\mathbb{I}(\pi^{*}(\bm{x}_{i})\neq a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v}$ is a constant given $\bm{x}_{i}$ and $\mathcal{H}_{i-1}$ .

\displaystyle\begin{aligned} \mathbb{I}(a_{i}=a)&=\mathbb{I}(a_{i}=a\neq\pi^{*}(\bm{x}_{i}))+\mathbb{I}(a_{i}=a=\pi^{*}(\bm{x}_{i}))\\ &=\mathbb{I}(a_{i}\neq\pi^{*}(\bm{x}_{i}))\mathbb{I}(\pi^{*}(\bm{x}_{i})\neq a)+\mathbb{I}(a_{i}=\pi^{*}(\bm{x}_{i}))\mathbb{I}(\pi^{*}(\bm{x}_{i})=a),\end{aligned}

and the second equation is owing to the fact that $\mathbb{I}(\pi^{*}(\bm{x}_{i})=a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v}$ is a constant given $\bm{x}_{i}$ and independent of $\mathcal{H}_{i-1}$ . Define

\nu_{i}\left(\bm{x}_{i},\mathcal{H}_{i-1}\right)\equiv\operatorname{Pr}\left\{a_{i}\neq\pi^{*}(\bm{x}_{i})\right|\bm{x}_{i},\mathcal{H}_{i-1}\}=\mathbb{E}\left[\mathbb{I}\left\{a_{i}\neq\pi^{*}(\bm{x}_{i})\right\}|\bm{x}_{i},\mathcal{H}_{i-1}\right],

(B.25)

then the conditional variance can be expressed as

\displaystyle\begin{aligned} \frac{1}{t}\sum_{i=1}^{t}\mathbb{E}\left[\mathbb{I}(a_{i}=a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v}\mid\mathcal{H}_{i-1}\right]&=\frac{1}{t}\sum_{i=1}^{t}\mathbb{E}\{\nu_{i}\left(\bm{x}_{i},\mathcal{H}_{i-1}\right)\mathbb{I}(\pi^{*}(\bm{x}_{i})\neq a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v}\}\\ &+\frac{1}{t}\sum_{i=1}^{t}\mathbb{E}\left[\{1-\nu_{i}\left(\bm{x}_{i},\mathcal{H}_{i-1}\right)\}\mathbb{I}(\pi^{*}(\bm{x}_{i})=a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v}\right],\end{aligned}

which can be expressed as

\displaystyle\begin{aligned} &\frac{1}{t}\sum_{i=1}^{t}\int\nu_{i}\left(\bm{x},\mathcal{H}_{i-1}\right)\mathbb{I}\{\bm{x}^{\top}\bm{\beta}(a)<\bm{x}^{\top}\bm{\beta}(1-a)\}\bm{v}^{\top}\bm{x}\bm{x}^{\top}\bm{v}dP_{\mathcal{X}}\\ &+\frac{1}{t}\sum_{i=1}^{t}\int\{1-\nu_{i}\left(\bm{x},\mathcal{H}_{i-1}\right)\}\mathbb{I}\left\{\bm{x}^{\top}\bm{\beta}(a)\geq\bm{x}^{\top}\bm{\beta}(1-a)\right\}\bm{v}^{\top}\bm{x}\bm{x}^{\top}\bm{v}dP_{\mathcal{X}}\\ &=\int\frac{1}{t}\sum_{i=1}^{t}\nu_{i}\left(\bm{x},\mathcal{H}_{i-1}\right)\mathbb{I}\{\bm{x}^{\top}\bm{\beta}(a)<\bm{x}^{\top}\bm{\beta}(1-a)\}\bm{v}^{\top}\bm{x}\bm{x}^{\top}\bm{v}dP_{\mathcal{X}}\\ &+\int\{1-\frac{1}{t}\sum_{i=1}^{t}\nu_{i}\left(\bm{x},\mathcal{H}_{i-1}\right)\}\mathbb{I}\left\{\bm{x}^{\top}\bm{\beta}(a)\geq\bm{x}^{\top}\bm{\beta}(1-a)\right\}\bm{v}^{\top}\bm{x}\bm{x}^{\top}\bm{v}dP_{\mathcal{X}}.\end{aligned}

Since $\lim_{i\rightarrow\infty}{\mbox{Pr}}\{a_{i}\neq\pi^{*}(\bm{x})\}=\kappa_{\infty}(\bm{x})$ , we have for any $\epsilon>0$ , there exist a constant $t_{0}>0$ such that $\left|{\mbox{Pr}}\{a_{i}\neq\pi^{*}(\bm{x})\}-\kappa_{\infty}(\bm{x})\right|<\epsilon$ for all $i\geq t_{0}$ . Therefore, for the expectation of $\frac{1}{t}\sum_{i=1}^{t}\nu_{i}\left(\bm{x},\mathcal{H}_{i-1}\right)$ over the history, we have

{\mathbb{E}}\left[\frac{1}{t}\sum_{i=1}^{t}\nu_{i}\left(\bm{x},\mathcal{H}_{i-1}\right)\right]=\frac{1}{t}\sum_{i=1}^{t}{\mathbb{E}}\left[\nu_{i}\left(\bm{x},\mathcal{H}_{i-1}\right)\right]=\frac{1}{t}\sum_{i=1}^{t}{\mbox{Pr}}\{a_{i}\neq\pi^{*}(\bm{x})\}.

It follows immediately that

		$\displaystyle{\mathbb{E}}\left[\frac{1}{t}\sum_{i=1}^{t}\nu_{i}\left(\bm{x},\mathcal{H}_{i-1}\right)\right]-\kappa_{\infty}(\bm{x})$
		$\displaystyle=\frac{1}{t}\sum_{i=1}^{t_{0}}\left[{\mbox{Pr}}\{a_{i}\neq\pi^{}(\bm{x})\}-\kappa_{\infty}(\bm{x})\right]+\frac{1}{t}\sum_{i=t_{0}}^{t}\left[{\mbox{Pr}}\{a_{i}\neq\pi^{}(\bm{x})\}-\kappa_{\infty}(\bm{x})\right].$

Therefore, by the triangle inequality, we have

		$\displaystyle\left\|{\mathbb{E}}\left[\frac{1}{t}\sum_{i=1}^{t}\nu_{i}\left(\bm{x},\mathcal{H}_{i-1}\right)\right]-\kappa_{\infty}(\bm{x})\right\|$
		$\displaystyle=\frac{1}{t}\sum_{i=1}^{t_{0}}\left\|{\mbox{Pr}}\{a_{i}\neq\pi^{}(\bm{x})\}-\kappa_{\infty}(\bm{x})\right\|+\frac{1}{t}\sum_{i=t_{0}}^{t}\left\|{\mbox{Pr}}\{a_{i}\neq\pi^{}(\bm{x})\}-\kappa_{\infty}(\bm{x})\right\|$
		$\displaystyle<\frac{1}{t}\sum_{i=1}^{t_{0}}\left[\left\|{\mbox{Pr}}\{a_{i}\neq\pi^{*}(\bm{x})\}\right\|+\left\|\kappa_{\infty}(\bm{x})\right\|\right]+\frac{1}{t}\sum_{i=t_{0}}^{t}\epsilon$
		$\displaystyle\leq\frac{1}{t}\sum_{i=1}^{t_{0}}2+\frac{1}{t}\sum_{i=t_{0}}^{t}\epsilon=\frac{2t_{0}}{t}+\frac{t-t_{0}}{t}\epsilon.$

Since the above equation holds for any $\epsilon>0$ and $0<\frac{t-t_{0}}{t}<1$ , we have

\left|{\mathbb{E}}\left[\frac{1}{t}\sum_{i=1}^{t}\nu_{i}\left(\bm{x},\mathcal{H}_{i-1}\right)\right]-\kappa_{\infty}(\bm{x})\right|\leq\frac{2t_{0}}{t},

which goes to zero as $t\rightarrow\infty$ . Thus,

{\mathbb{E}}\left[\frac{1}{t}\sum_{i=1}^{t}\nu_{i}\left(\bm{x},\mathcal{H}_{i-1}\right)\right]=\kappa_{\infty}(\bm{x})+o_{p}(1).

(B.26)

Next, we consider the variance of $\frac{1}{t}\sum_{i=1}^{t}\nu_{i}\left(\bm{x},\mathcal{H}_{i-1}\right)$ . Denote ${\mathbb{E}}\left[\frac{1}{t}\sum_{i=1}^{t}\nu_{i}\left(\bm{x},\mathcal{H}_{i-1}\right)\right]=\mu_{\nu}(\bm{x})$ , we have $\mu_{\nu}(\bm{x})=\kappa_{\infty}(\bm{x})+o_{p}(1)$ . Notice that $\nu_{i}\left(\bm{x}_{i},\mathcal{H}_{i-1}\right)\equiv\operatorname{Pr}\left\{a_{i}\neq\pi^{*}(\bm{x}_{i})\right|\bm{x}_{i},\mathcal{H}_{i-1}\}\in[0,1]$ , by Lemma B.1, we have

\displaystyle\operatorname{Var}\left[\frac{1}{t}\sum_{i=1}^{t}\nu_{i}\left(\bm{x},\mathcal{H}_{i-1}\right)\right]\leq\mu_{\nu}(\bm{x})-\mu_{\nu}(\bm{x})^{2}=\kappa_{\infty}(\bm{x})\left\{1-\kappa_{\infty}(\bm{x})\right\}+o_{p}(1),

which goes to zero as $t\rightarrow\infty$ . Combined with Equation (B.26), it follows immediately that as $t$ goes to $\infty$ , we have

\frac{1}{t}\sum_{i=1}^{t}\nu_{i}\left(\bm{x},\mathcal{H}_{i-1}\right)\rightarrow\kappa_{\infty}(\bm{x}).

Therefore, as $t$ goes to $\infty$ , we have $\frac{1}{t}\sum_{i=1}^{t}\mathbb{E}\left[\mathbb{I}(a_{i}=a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v}\mid\mathcal{H}_{i-1}\right]$ converges to

	$\displaystyle\int\kappa_{\infty}(\bm{x})\mathbb{I}\{\bm{x}^{\top}\bm{\beta}(a)<\bm{x}^{\top}\bm{\beta}(1-a)\}\bm{v}^{\top}\bm{x}\bm{x}^{\top}\bm{v}dP_{\mathcal{X}}$		(B.27)
	$\displaystyle+\int\{1-\kappa_{\infty}(\bm{x})\}\mathbb{I}\left\{\bm{x}^{\top}\bm{\beta}(a)\geq\bm{x}^{\top}\bm{\beta}(1-a)\right\}\bm{v}^{\top}\bm{x}\bm{x}^{\top}\bm{v}dP_{\mathcal{X}}.$

Thus, following the similar arguments in S1.2 in Chen et al. (2020), we have

\bm{\eta}_{1}(\bm{v})=\frac{1}{\sqrt{t}}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{v}^{\top}\bm{x}_{i}e_{i}\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathcal{N}_{d}\left(0,\bm{v}^{\top}G_{a}\bm{v}\right),

where

	$\displaystyle\bm{v}^{\top}G_{a}\bm{v}=\sigma_{a}^{2}$	$\displaystyle\left\{\int\kappa_{\infty}(\bm{x})\mathbb{I}\{\bm{x}^{\top}\bm{\beta}(a)<\bm{x}^{\top}\bm{\beta}(1-a)\}\bm{v}^{\top}\bm{x}\bm{x}^{\top}\bm{v}dP_{\mathcal{X}}\right.$
		$\displaystyle\left.+\int\{1-\kappa_{\infty}(\bm{x})\}\mathbb{I}\left\{\bm{x}^{\top}\bm{\beta}(a)\geq\bm{x}^{\top}\bm{\beta}(1-a)\right\}\bm{v}^{\top}\bm{x}\bm{x}^{\top}\bm{v}dP_{\mathcal{X}}\right\}.$

Finally, by Martingale Central Limit Theorem, we have

\bm{\eta}_{1}=\frac{1}{\sqrt{t}}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}e_{i}\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathcal{N}_{d}\left(0,G_{a}\right),

where

	$\displaystyle G_{a}=\sigma_{a}^{2}$	$\displaystyle\left\{\int\kappa_{\infty}(\bm{x})\mathbb{I}\{\bm{x}^{\top}\bm{\beta}(a)<\bm{x}^{\top}\bm{\beta}(1-a)\}\bm{x}\bm{x}^{\top}dP_{\mathcal{X}}\right.$		(B.28)
		$\displaystyle\left.+\int\{1-\kappa_{\infty}(\bm{x})\}\mathbb{I}\left\{\bm{x}^{\top}\bm{\beta}(a)\geq\bm{x}^{\top}\bm{\beta}(1-a)\right\}\bm{x}\bm{x}^{\top}dP_{\mathcal{X}}\right\}.$		(B.28)

The first part is thus completed.

Step 2: We next show that $\bm{\xi}=\left\{(1/t)\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+\frac{1}{t}\omega\bm{I}_{d}\right\}^{-1}\stackrel{{\scriptstyle p}}{{\longrightarrow}}\sigma_{a}^{2}G_{a}^{-1}$ , which is sufficient to find the limit of $\left\{(1/t)\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+\frac{1}{t}\omega\bm{I}_{d}\right\}$ . By Lemma 6 in Chen et al. (2020), it suffices to show the limit of $\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v}$ for any $\bm{v}\in\mathbb{R}^{d}$ .

Since ${\mbox{Pr}}(|\mathbb{I}(a_{i}=a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v}|>h)\leq{\mbox{Pr}}(|\bm{v}^{\top}\bm{x}\bm{x}^{T}\bm{v}|>h)$ for each $h>0$ and $i\leq 1$ , by Theorem 2.19 in Hall and Heyde (2014), we have

\frac{1}{t}\sum_{i=1}^{t}\left[\mathbb{I}(a_{i}=a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v}-{\mathbb{E}}\left\{\mathbb{I}(a_{i}=a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v}\mid\mathcal{H}_{i-1}\right\}\right]\overset{p}{\longrightarrow}0,\text{ as }t\rightarrow\infty.

(B.29)

Recall the results in (B.4) and (B.28), we have ${\mathbb{E}}\left\{\mathbb{I}(a_{i}=a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v}\mid\mathcal{H}_{i-1}\right\}={G_{a}}/{\sigma_{a}^{2}}$ . Combining this with (B.29), we have

\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{v}^{\top}\bm{x}_{i}\bm{x}_{i}^{\top}\bm{v}\overset{p}{\longrightarrow}\frac{G_{a}}{\sigma_{a}^{2}},\text{ as }t\rightarrow\infty.

By Lemma 6 in Chen et al. (2020) and Continuous Mapping Theorem, we further have

\bm{\xi}=\left\{\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+\frac{1}{t}\omega\bm{I}_{d}\right\}^{-1}\stackrel{{\scriptstyle p}}{{\longrightarrow}}\sigma_{a}^{2}G_{a}^{-1}.

Step 3: We focus on proving $\bm{\eta}_{2}=\left\{(1/t)\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+(\omega/t)\bm{I}_{d}\right\}^{-1}({\omega}/{\sqrt{t}})\bm{\beta}(a)\overset{p}{\longrightarrow}\bm{0}_{d}$ next. This suffices to show that $\bm{b}_{i}^{\top}\bm{\eta}_{2}\overset{p}{\longrightarrow}0$ holds for any standard basis $\bm{b}_{i}\in\mathbb{R}^{d}$ . Since

	$\displaystyle\bm{b}_{i}^{\top}\bm{\eta}_{2}$	$\displaystyle=\left\\|\bm{b}_{i}^{\top}\left\{\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+\frac{1}{t}\omega\bm{I}_{d}\right\}^{-1}\frac{\omega}{\sqrt{t}}\bm{\beta}(a)\right\\|_{2}$
		$\displaystyle\leq\frac{\omega}{\sqrt{t}}\left\\|\bm{b}_{i}^{\top}\right\\|_{2}\left\\|\left\{\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)\bm{x}_{i}\bm{x}_{i}^{\top}+\frac{1}{t}\omega\bm{I}_{d}\right\}^{-1}\right\\|_{2}\left\\|\bm{\beta}(a)\right\\|_{2},$

and by (B.4), we have

\bm{b}_{i}^{\top}\bm{\eta}_{2}\leq\frac{\omega\left\|\bm{\beta}(a)\right\|_{2}}{\sqrt{tp_{t}^{2}}\lambda+\frac{1}{\sqrt{t}}\omega}.

Thus, we have $\bm{b}_{i}^{\top}\bm{\eta}_{2}\overset{p}{\longrightarrow}0$ , as $tp_{t}^{2}\rightarrow\infty$ .

Step 4: Finally, we combine the above results using Slutsky’s theorem, and conclude that

\sqrt{t}\{\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)\}=\bm{\xi}\bm{\eta}_{1}+\bm{\eta}_{2}\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathcal{N}_{d}\left(\bm{0}_{d},\sigma_{a}^{4}G_{a}^{-1}\right),

where $G_{a}$ is defined in (B.28). Denote the variance term as

	$\displaystyle\sigma_{\bm{\beta}(a)}^{2}=\sigma_{a}^{2}$	$\displaystyle\left[\int\kappa_{\infty}(\bm{x})\mathbb{I}\{\bm{x}^{\top}\bm{\beta}(a)<\bm{x}^{\top}\bm{\beta}(1-a)\}\bm{x}\bm{x}^{\top}dP_{\mathcal{X}}\right.$
		$\displaystyle\left.+\int\{1-\kappa_{\infty}(\bm{x})\}\mathbb{I}\left\{\bm{x}^{\top}\bm{\beta}(a)\geq\bm{x}^{\top}\bm{\beta}(1-a)\right\}\bm{x}\bm{x}^{\top}dP_{\mathcal{X}}\right]^{-1},$

with $\sigma^{2}_{a}={\mathbb{E}}(e_{t}^{2}|a_{t}=a)$ denoting the conditional variance of $e_{t}$ given $a_{t}=a$ , for $a=0,1$ , we have

\sqrt{t}\left\{\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)\right\}\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathcal{N}_{d}\{\bm{0}_{d},\sigma_{\bm{\beta}(a)}^{2}\}.

The proof is hence completed.

B.5 Proof of Theorem 3

Finally, we prove the asymptotic normality of the proposed value estimator under DREAM in Theorem 3 in this section. The proof consists of four steps. In step 1, we aim to show

\widehat{V}_{T}=\widetilde{V}_{T}+o_{p}(T^{-1/2}),

where

\widetilde{V}_{T}=\frac{1}{T}\sum_{t=1}^{T}\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}}{1-\kappa_{t}(\bm{x}_{t})}\Big{[}r_{t}-{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{]}+{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}.

Next, in Step 2, we establish

\widetilde{V}_{T}=\overline{V}_{T}+o_{p}(T^{-1/2}),

where

\overline{V}_{T}=\frac{1}{T}\sum_{t=1}^{T}\frac{\mathbb{I}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}\Big{[}r_{t}-{\mu}\{\bm{x}_{t},{\pi}^{*}(\bm{x}_{t})\}\Big{]}+{\mu}\{\bm{x}_{t},{\pi}^{*}(\bm{x}_{t})\}.

The above two steps yields that

\widehat{V}_{T}=\overline{V}_{T}+o_{p}(T^{-1/2}).

(B.30)

Then, in Step 3, based on (B.30) and Martingale Central Limit Theorem, we show

\sqrt{T}(\widehat{V}_{T}-V^{*})\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathcal{N}\left(0,\sigma_{DR}^{2}\right),

with

\sigma_{DR}^{2}=\int_{\bm{x}}\frac{\pi^{*}(\bm{x})\sigma_{1}^{2}+\{1-\pi^{*}(\bm{x})\}\sigma_{0}^{2}}{1-\kappa_{\infty}(\bm{x})}d{P_{\mathcal{X}}}+{\mbox{Var}}\left[{\mu}\{\bm{x},\pi^{*}(\bm{x})\}\right],

where $\sigma^{2}_{a}={\mathbb{E}}(e_{t}^{2}|a_{t}=a)$ for $a=0,1$ , and $\kappa_{t}\rightarrow\kappa_{\infty}$ as $t\rightarrow\infty$ .

Lastly, in Step 4, we show the variance estimator in Equation (5) in the main paper is a consistent estimator of $\sigma_{DR}^{2}$ . The proof for Theorem 3 is thus completed.

Step 1: We first show $\widehat{V}_{T}=\widetilde{V}_{T}+o_{p}(T^{-1/2})$ . To this end, define a middle term as

\widetilde{\phi}_{T}=\frac{1}{T}\sum_{t=1}^{T}\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}}{1-\kappa_{t}(\bm{x}_{t})}\Big{[}r_{t}-\widehat{\mu}_{t-1}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{]}+\widehat{\mu}_{t-1}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}.

Thus, it suffices to show $\widehat{V}_{T}=\widetilde{\phi}_{T}+o_{p}(T^{-1/2})$ and $\widetilde{\phi}_{T}=\widetilde{V}_{T}+o_{p}(T^{-1/2})$ .

Firstly, we have

	$\displaystyle\widehat{V}_{T}-\widetilde{\phi}_{T}$		$\displaystyle=\frac{1}{T}\sum_{t=1}^{T}\left[\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}}{1-\widehat{\kappa}_{t}(\bm{x}_{t})}-\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}}{1-\kappa_{t}(\bm{x}_{t})}\right]\Big{[}r_{t}-\widehat{\mu}_{t-1}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{]}$
			$\displaystyle=\frac{1}{T}\sum_{t=1}^{T}\left\{\widehat{\kappa}_{t}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right\}\left[\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{[}r_{t}-\widehat{\mu}_{t-1}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{]}}{\{1-\widehat{\kappa}_{t}(\bm{x}_{t})\}\{1-\kappa_{t}(\bm{x}_{t})\}}\right].$

We can further decompose (B.5) by

$\displaystyle\widehat{V}_{T}-\widetilde{\phi}_{T}=\frac{1}{T}\sum_{t=1}^{T}$	$\displaystyle\left\{\widehat{\kappa}_{t}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right\}$
	$\displaystyle\left[\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{[}r_{t}-{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}+{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}-\widehat{\mu}_{t-1}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{]}}{\{1-\widehat{\kappa}_{t}(\bm{x}_{t})\}\{1-\kappa_{t}(\bm{x}_{t})\}}\right]$
$\displaystyle=\frac{1}{T}\sum_{t=1}^{T}$	$\displaystyle\left\{\widehat{\kappa}_{t}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right\}\left[\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{[}r_{t}-{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{]}}{\{1-\widehat{\kappa}_{t}(\bm{x}_{t})\}\{1-\kappa_{t}(\bm{x}_{t})\}}\right]$	(B.32)
$\displaystyle+\frac{1}{T}\sum_{t=1}^{T}$	$\displaystyle\left\{\widehat{\kappa}_{t}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right\}\left[\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{[}{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}-\widehat{\mu}_{t-1}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{]}}{\{1-\widehat{\kappa}_{t}(\bm{x}_{t})\}\{1-\kappa_{t}(\bm{x}_{t})\}}\right].$

We first show that the first term in (B.32) is $o_{p}(T^{-1/2})$ . Define a class of function

\displaystyle\mathcal{F}_{\kappa}(\bm{x},a,r)=\Bigg{\{}\left\{\widehat{\kappa}(\bm{x})-\kappa(\bm{x})\right\}\left[\frac{\mathbb{I}\{a=\pi(\bm{x})\}\Big{[}r-{\mu}\{\bm{x},\pi(\bm{x})\}\Big{]}}{\{1-\widehat{\kappa}(\bm{x})\}\{1-\kappa(\bm{x})\}}\right]:\widehat{\kappa}(\cdot),\kappa(\cdot)\in\Lambda,\pi(\cdot)\in\Pi\Bigg{\}},

where $\Pi$ and $\Lambda$ are two classes of functions that maps context $\bm{x}\in\mathcal{X}$ to a probability.
Define the supremum of the empirical process indexed by $\mathcal{F}_{\kappa}$ as

\displaystyle||\mathbb{G}_{n}||_{\mathcal{F}}\equiv

\displaystyle\sup_{\pi\in\Pi}\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\left[\mathcal{F}_{\kappa}(\bm{x}_{t},a_{t},r_{t})-\mathbb{E}\{\mathcal{F}_{\kappa}(\bm{x}_{t},a_{t},r_{t})|\mathcal{H}_{t-1}\}\right].

(B.33)

Notice that

		$\displaystyle\mathbb{E}\{\mathcal{F}_{\kappa}(\bm{x}_{t},a_{t},r_{t})\|\mathcal{H}_{t-1}\}$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left(\left\{\widehat{\kappa}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right\}\left[\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{[}r_{t}-{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{]}}{\{1-\widehat{\kappa}(\bm{x}_{t})\}\{1-\kappa_{t}(\bm{x}_{t})\}}\right]\mid\mathcal{H}_{t-1}\right)$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left(\left\{\widehat{\kappa}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right\}\left[\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}e_{t}}{\{1-\widehat{\kappa}(\bm{x}_{t})\}\{1-\kappa_{t}(\bm{x}_{t})\}}\right]\mid\mathcal{H}_{t-1}\right),$

by the definitions and thus, using the iteration of expectation, we have

		$\displaystyle\mathbb{E}\{\mathcal{F}_{\kappa}(\bm{x}_{t},a_{t},r_{t})\|\mathcal{H}_{t-1}\}$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left\{\mathbb{E}\left(\left\{\widehat{\kappa}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right\}\left[\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}e_{t}}{\{1-\widehat{\kappa}(\bm{x}_{t})\}\{1-\kappa_{t}(\bm{x}_{t})\}}\right]\mid a_{t},\bm{x}_{t}\right)\mid\mathcal{H}_{t-1}\right\}$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left\{\left\{\widehat{\kappa}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right\}\left[\frac{\mathbb{E}\{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}\mid\bm{x}_{t}\}}{\{1-\widehat{\kappa}(\bm{x}_{t})\}\{1-\kappa_{t}(\bm{x}_{t})\}}\right]\mathbb{E}\left(e_{t}\mid a_{t},\bm{x}_{t}\right)\mid\mathcal{H}_{t-1}\right\}=0,$

where the last equation is due to the definition of the noise $e_{t}$ . Therefore, Equation (B.33) can be further written as

||\mathbb{G}_{n}||_{\mathcal{F}}\equiv\sup_{\pi\in\Pi}\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\mathcal{F}_{\kappa}(\bm{x}_{t},a_{t},r_{t}).

Next, we show the second moment is bounded by

\begin{aligned} &\mathbb{E}\left\{\left(\left\{\widehat{\kappa}_{t}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right\}\left[\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{[}r_{t}-{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{]}}{\{1-\widehat{\kappa}_{t}(\bm{x}_{t})\}\{1-\kappa_{t}(\bm{x}_{t})\}}\right]\right)^{2}\mid\mathcal{H}_{t-1}\right\}\\ =&\mathbb{E}\left\{\left[\frac{\left\{\widehat{\kappa}_{t}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right\}}{\{1-\widehat{\kappa}_{t}(\bm{x}_{t})\}\{1-\kappa_{t}(\bm{x}_{t})\}}\right]^{2}\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{[}r_{t}-{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{]}^{2}\mid\mathcal{H}_{t-1}\right\}\\ =&\mathbb{E}\left\{\left[\frac{\left\{\widehat{\kappa}_{t}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right\}}{\{1-\widehat{\kappa}_{t}(\bm{x}_{t})\}\{1-\kappa_{t}(\bm{x}_{t})\}}\right]^{2}\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}e_{t}^{2}\mid\mathcal{H}_{t-1}\right\}\end{aligned},

by the definitions and thus, using the iteration of expectation, we have

		$\displaystyle\mathbb{E}\left\{\left(\left\{\widehat{\kappa}_{t}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right\}\left[\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{[}r_{t}-{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{]}}{\{1-\widehat{\kappa}_{t}(\bm{x}_{t})\}\{1-\kappa_{t}(\bm{x}_{t})\}}\right]\right)^{2}\mid\mathcal{H}_{t-1}\right\}$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left(\mathbb{E}\left\{\left[\frac{\left\{\widehat{\kappa}_{t}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right\}}{\{1-\widehat{\kappa}_{t}(\bm{x}_{t})\}\{1-\kappa_{t}(\bm{x}_{t})\}}\right]^{2}\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}e_{t}^{2}\mid a_{t},\bm{x}_{t}\right\}\mid\mathcal{H}_{t-1}\right)$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left(\left[\frac{\left\{\widehat{\kappa}_{t}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right\}}{\{1-\widehat{\kappa}_{t}(\bm{x}_{t})\}\{1-\kappa_{t}(\bm{x}_{t})\}}\right]^{2}\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}\mathbb{E}\left\{e_{t}^{2}\mid a_{t},\bm{x}_{t}\right\}\mid\mathcal{H}_{t-1}\right)$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left(\left[\frac{1}{1-\widehat{\kappa}_{t}(\bm{x}_{t})}-\frac{1}{1-\kappa_{t}(\bm{x}_{t})}\right]^{2}\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}\sigma_{a_{t}}^{2}\mid\mathcal{H}_{t-1}\right).$

Notice that $\widehat{\kappa}_{t}(\bm{x}_{t})\leq C_{1}<1$ and $\kappa_{t}(\bm{x}_{t})\leq C_{1}<1$ for sure for some constant $C_{1}<1$ (by definition of a valid bandit algorithm and results of Theorem 1), we have

\left[\frac{1}{1-\widehat{\kappa}_{t}(\bm{x}_{t})}-\frac{1}{1-\kappa_{t}(\bm{x}_{t})}\right]^{2}\leq\left\{\left|\frac{1}{1-\widehat{\kappa}_{t}(\bm{x}_{t})}\right|+\left|\frac{1}{1-\kappa_{t}(\bm{x}_{t})}\right|\right\}^{2}\leq\left(\frac{2}{1-C_{1}}\right)^{2}\equiv C_{2},

where $C_{2}$ is a bounded constant. Thus we have

		$\displaystyle\mathbb{E}\left(\left[\frac{1}{1-\widehat{\kappa}_{t}(\bm{x}_{t})}-\frac{1}{1-\kappa_{t}(\bm{x}_{t})}\right]^{2}\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}\sigma_{a_{t}}^{2}\mid\mathcal{H}_{t-1}\right)$
		$\displaystyle\leq\mathbb{E}\left(C_{2}\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}\sigma_{a_{t}}^{2}\mid\mathcal{H}_{t-1}\right)$
		$\displaystyle\leq C_{2}\max\{\sigma_{0}^{2},\sigma_{1}^{2}\}.$

Therefore, for the second moment of the inner term of the first term, we have

		$\displaystyle\sum_{t=1}^{T}\mathbb{E}\left\{\left(\frac{1}{\sqrt{T}}\left\{\widehat{\kappa}_{t}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right\}\left[\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{[}r_{t}-{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{]}}{\{1-\widehat{\kappa}_{t}(\bm{x}_{t})\}\{1-\kappa_{t}(\bm{x}_{t})\}}\right]\right)^{2}\mid\mathcal{H}_{t-1}\right\}$
		$\displaystyle\leq\sum_{t=1}^{T}\frac{C_{2}\max\{\sigma_{0}^{2},\sigma_{1}^{2}\}}{T}=C_{2}\max\{\sigma_{0}^{2},\sigma_{1}^{2}\}<\infty.$

Therefore, we have

d_{1}(f)\equiv\left\|\mathbb{E}\left(\mid\mathcal{F}_{\kappa}(\bm{x}_{1},a_{1},r_{1})\|\mathcal{H}_{0}\right)\right\|_{\infty}<\infty,

and

d_{2}(f)\equiv\left\|\mathbb{E}\left(\left(\mathcal{F}_{\kappa}(\bm{x}_{1},a_{1},r_{1})\right)^{2}\mid\mathcal{H}_{0}\right)\right\|_{\infty}^{1/2}<\infty.

It follows from the maximal inequality developed in Section 4.2 of Dedecker and Louhichi (2002) that there exist some constant $K\geq 1$ such that

\displaystyle{\mathbb{E}}\Big{[}||\mathbb{G}_{n}||_{\mathcal{F}}\Big{]}\lesssim K\left(\sqrt{p}d_{2}(f)+\frac{1}{\sqrt{T}}\left\|\max_{1\leq t\leq T}\left|\mathcal{F}_{\kappa}(\bm{x}_{t},a_{t},r_{t})-\mathbb{E}\left(\mathcal{F}_{\kappa}(\bm{x}_{t},a_{t},r_{t})\mid\mathcal{H}_{t-1}\right)\right|\right\|\right).

The above right-hand-side is upper bounded by

\displaystyle O(1)\sqrt{T^{-1/2}},

where $O(1)$ denotes some universal constant. Hence, we have

\displaystyle{\mathbb{E}}\Big{[}||\mathbb{G}_{n}||_{\mathcal{F}}\Big{]}=\mathcal{O}_{p}(T^{-1/2}).

(B.34)

Combined with Equation (B.33), we have

\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\mathcal{F}_{\kappa}(\bm{x}_{t},a_{t},r_{t})=\mathcal{O}_{p}(T^{-1/2}).

Therefore, for the first term in (B.32), we have

\frac{1}{T}\sum_{t=1}^{T}\left\{\widehat{\kappa}_{t}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right\}\left[\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{[}r_{t}-{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{]}}{\{1-\widehat{\kappa}_{t}(\bm{x}_{t})\}\{1-\kappa_{t}(\bm{x}_{t})\}}\right]=\mathcal{O}_{p}(T^{-1})=o_{p}(T^{-1/2}).

Then we consider the second term in (B.32), where

			$\displaystyle\frac{1}{T}\sum_{t=1}^{T}\left\{\widehat{\kappa}_{t}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right\}\left[\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{[}{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}-\widehat{\mu}_{t-1}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{]}}{\{1-\widehat{\kappa}_{t}(\bm{x}_{t})\}\{1-\kappa_{t}(\bm{x}_{t})\}}\right]$
	$\displaystyle\leq$		$\displaystyle\frac{1}{T}\sum_{t=1}^{T}B_{\kappa}\left\|\widehat{\kappa}_{t}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right\|\left\|{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}-\widehat{\mu}_{t-1}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\right\|,$

for some $B_{\kappa}$ as the bound of $\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}/[{\{1-\widehat{\kappa}_{t}(\bm{x}_{t})\}\{1-\kappa_{t}(\bm{x}_{t})\}}]$ . By Cauchy-Schwartz inequality, we have the above term further bounded by

			$\displaystyle\frac{1}{T}\sum_{t=1}^{T}B_{\kappa}\left\|\widehat{\kappa}_{t}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right\|\left\|{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}-\widehat{\mu}_{t-1}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\right\|$
	$\displaystyle\leq$		$\displaystyle B_{\kappa}\sqrt{\frac{1}{T}\sum_{t=1}^{T}\left\|\widehat{\kappa}_{t}(\bm{x}_{t})-\kappa_{t}(\bm{x}_{t})\right\|^{2}\frac{1}{T}\sum_{t=1}^{T}\left\|{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}-\widehat{\mu}_{t-1}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\right\|^{2}}.$

Given Assumption 4.4, we have the above bounded by $o_{p}(T^{-1/2})$ , and thus the second term in (B.32) is $o_{p}(T^{-1/2})$ .

Therefore, we have $\widehat{V}_{T}=\widetilde{\phi}_{T}+o_{p}(T^{-1/2})$ hold.

Then, we focus on proving

\displaystyle\widetilde{\phi}_{T}-\widetilde{V}_{T}=\frac{1}{T}\sum_{t=1}^{T}\left[\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}}{1-\kappa_{t}(\bm{x}_{t})}-1\right]\Big{[}{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}-\widehat{\mu}_{t-1}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{]},

(B.35)

is $o_{p}(T^{-1/2})$ . Specifically, since

{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}-\widehat{\mu}_{t-1}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}=\mathcal{O}_{p}(t^{-1/2}),

similar as the proof of first term in (B.32), we could prove that $\widetilde{\phi}_{T}-\widetilde{V}_{T}=o_{p}(T^{-1/2})$ . Thus $\widehat{V}_{T}=\widetilde{V}_{T}+o_{p}(T^{-1/2})$ holds. The first step is thus completed.

Step 2: We next focus on proving $\widetilde{V}_{T}=\overline{V}_{T}+o_{p}(T^{-1/2})$ . By definition of $\widetilde{V}_{T}$ and $\overline{V}_{T}$ , we have

	$\displaystyle\sqrt{T}(\widetilde{V}_{T}-\overline{V}_{T})=$		$\displaystyle\underbrace{\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\left[\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}}{1-\kappa_{t}(\bm{x}_{t})}-1\right]\Big{[}{\mu}\{\bm{x}_{t},{\pi}^{*}(\bm{x}_{t})\}-{\mu}\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}\Big{]}}_{\eta_{5}}$
			$\displaystyle+\underbrace{\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\left[\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}}{1-\kappa_{t}(\bm{x}_{t})}-\frac{\mathbb{I}\{a_{t}={\pi}^{}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x}_{t})\}}\right]\Big{[}r_{t}-{\mu}\{\bm{x}_{t},\pi^{*}_{t}(\bm{x}_{t})\}\Big{]}}_{\eta_{6}}.$

We first show $\eta_{5}=o_{p}(1)$ . Since $\kappa_{t}(\bm{x}_{t})\leq C_{2}<1$ for sure for some constant $0<C_{2}<1$ (by definition of a valid bandit algorithm as results for Theorem 1), it suffices to show

\psi_{5}={{T}^{-1/2}}\sum_{t=1}^{T}\left|\mu\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}-\mu\{\bm{x}_{t},\pi^{*}\}\right|=o_{p}(1),

which is the direct conclusion of Lemma B.3.

Next, we show $\eta_{6}=o_{p}(1)$ . Firstly we can express $\eta_{6}$ as

	$\displaystyle\eta_{6}=$	$\displaystyle\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\left[\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}}{1-\kappa_{t}(\bm{x}_{t})}-\frac{\mathbb{I}\{a_{t}={\pi}^{}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x}_{t})\}}\right]\Big{[}r_{t}-{\mu}\{\bm{x}_{t},\pi^{*}_{t}(\bm{x}_{t})\}\Big{]}$
	$\displaystyle=$	$\displaystyle\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\left[\frac{\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}}{1-\kappa_{t}(\bm{x}_{t})}-\frac{\mathbb{I}\{a_{t}={\pi}^{}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x}_{t})\}}\right]e_{t}$
	$\displaystyle=$	$\displaystyle\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\left[\frac{1}{1-\kappa_{t}(\bm{x}_{t})}-\frac{\mathbb{I}\{a_{t}={\pi}^{}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x}_{t})\}}\right]e_{t}\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}$
		$\displaystyle-\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\frac{\mathbb{I}\{a_{t}={\pi}^{}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x}_{t})\}}e_{t}\mathbb{I}\{a_{t}\neq\widehat{\pi}_{t}(\bm{x}_{t})\}$
	$\displaystyle=$	$\displaystyle\underbrace{\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\left[\frac{1}{1-\kappa_{t}(\bm{x}_{t})}-\frac{\mathbb{I}\{a_{t}={\pi}^{}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x}_{t})\}}\right]e_{t}\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}}_{\psi_{6}}$
		$\displaystyle-\underbrace{\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\frac{1}{{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x}_{t})\}}e_{t}\mathbb{I}\{a_{t}\neq\widehat{\pi}_{t}(\bm{x}_{t})\}\mathbb{I}\{a_{t}={\pi}^{}(\bm{x}_{t})\}}_{\psi_{7}}.$

Note that

	$\displaystyle{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}$	$\displaystyle\geq{\mbox{Pr}}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t}),\widehat{\pi}_{t}(\bm{x}_{t})={\pi}^{*}(\bm{x}_{t})\}$
		$\displaystyle\geq{\mbox{Pr}}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}+{\mbox{Pr}}\{\widehat{\pi}_{t}(\bm{x}_{t})={\pi}^{*}(\bm{x}_{t})\}-1$
		$\displaystyle={\mbox{Pr}}\{\widehat{\pi}_{t}(\bm{x}_{t})={\pi}^{*}(\bm{x}_{t})\}-\kappa_{t}(\bm{x}_{t}),$

where $\kappa_{t}(\bm{x}_{t})=o_{p}(1)$ by Theorem 1 and ${\mbox{Pr}}\{\widehat{\pi}_{t}(\bm{x}_{t})={\pi}^{*}(\bm{x}_{t})\}\geq 1-ct^{-\alpha\gamma}$ by Lemma B.2 as $t\rightarrow\infty$ . Thus we have for large enough t,

{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}>C_{1},

(B.37)

for some constant $C_{1}>0$ .

Then we focus on proving $\psi_{6}=o_{p}(1)$ here. Define a class of function

\displaystyle\mathcal{F}_{\kappa}(\bm{x},a,r)=\Bigg{\{}\left[\frac{1}{1-\kappa(\bm{x})}-\frac{\mathbb{I}\{a={\pi}^{*}(\bm{x})\}}{{\mbox{Pr}}\{a={\pi}^{*}(\bm{x})\}}\right]e_{t}\mathbb{I}\{a=\pi(\bm{x})\}:\kappa_{t}\in\Lambda,\pi(\cdot)\in\Pi\Bigg{\}},

where $\Pi$ and $\Lambda$ are two classes of functions that maps context $\bm{x}\in\mathcal{X}$ to a probability.
Define the supremum of the empirical process indexed by $\mathcal{F}_{\kappa}$ as

\displaystyle||\mathbb{G}_{n}||_{\mathcal{F}}\equiv

\displaystyle\sup_{\pi\in\Pi}\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\left[\mathcal{F}_{\kappa}(\bm{x}_{t},a_{t},r_{t})-\mathbb{E}\{\mathcal{F}_{\kappa}(\bm{x}_{t},a_{t},r_{t})|\mathcal{H}_{t-1}\}\right].

(B.38)

Firstly we notice that

\mathbb{E}\{\mathcal{F}_{\kappa}(\bm{x}_{t},a_{t},r_{t})|\mathcal{H}_{t-1}\}={\mathbb{E}}\left(\left[\frac{1}{1-\kappa_{t}(\bm{x}_{t})}-\frac{\mathbb{I}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}\right]e_{t}\mathbb{I}\{a=\pi_{t}(\bm{x})\}\mid\mathcal{H}_{t-1}\right)=0,

since ${\mathbb{E}}\left(e_{t}|\{a_{t},\mathcal{H}_{t-1}\right)=0$ .
Secondly since $\kappa_{t}(\bm{x}_{t})\leq C_{2}<1$ for sure for some constant $0<C_{2}<1$ (by definition of a valid bandit algorithm an results for Theorem 1), by the Triangle inequality, we have

\left|\frac{1}{1-\kappa_{t}(\bm{x}_{t})}-\frac{\mathbb{I}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}\right|\leq\left|\frac{1}{1-\kappa_{t}(\bm{x}_{t})}-\right|+\left|\frac{\mathbb{I}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}\right|\leq\frac{1}{1-C_{2}}+\frac{1}{C_{1}}\triangleq C.

Therefore, we have

		$\displaystyle\sum_{t=1}^{T}{\mathbb{E}}\left\{\left(\frac{1}{\sqrt{T}}\left[\frac{1}{1-\kappa_{t}(\bm{x}_{t})}-\frac{\mathbb{I}\{a_{t}={\pi}^{}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x}_{t})\}}\right]e_{t}\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}\right)^{2}\mid\mathcal{H}_{t-1}\right\}$
		$\displaystyle\leq\sum_{t=1}^{T}{\mathbb{E}}\left\{\left(\frac{1}{\sqrt{T}}\left[\frac{1}{1-\kappa_{t}(\bm{x}_{t})}-\frac{\mathbb{I}\{a_{t}={\pi}^{}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x}_{t})\}}\right]e_{t}\mathbb{I}\{a_{t}=\widehat{\pi}_{t}(\bm{x}_{t})\}\right)^{2}\mid\mathcal{H}_{t-1}\right\}$
		$\displaystyle\leq\sum_{t=1}^{T}\frac{C^{2}}{T}{\mathbb{E}}\left\{e_{t}^{2}\mid\mathcal{H}_{t-1}\right\}\leq\sum_{t=1}^{T}\frac{C^{2}}{T}\max\{\sigma_{0}^{2},\sigma_{1}^{2}\}=C^{2}\max\{\sigma_{0}^{2},\sigma_{1}^{2}\}<\infty.$

Therefore, we have

d_{1}(f)\equiv\left\|\mathbb{E}\left(\mid\mathcal{F}_{\kappa}(\bm{x}_{1},a_{1},r_{1})\|\mathcal{H}_{0}\right)\right\|_{\infty}<\infty,

and

d_{2}(f)\equiv\left\|\mathbb{E}\left(\left(\mathcal{F}_{\kappa}(\bm{x}_{1},a_{1},r_{1})\right)^{2}\mid\mathcal{H}_{0}\right)\right\|_{\infty}^{1/2}<\infty.

It follows from the maximal inequality developed in Section 4.2 of Dedecker and Louhichi (2002) that there exists some constant $K\geq 1$ such that

\displaystyle{\mathbb{E}}\Big{[}||\mathbb{G}_{n}||_{\mathcal{F}}\Big{]}\lesssim K\left(d_{2}(f)+\frac{1}{\sqrt{T}}\left\|\max_{1\leq t\leq T}\left|\mathcal{F}_{\kappa}(\bm{x}_{t},a_{t},r_{t})-\mathbb{E}\left(\mathcal{F}_{\kappa}(\bm{x}_{t},a_{t},r_{t})\mid\mathcal{H}_{t-1}\right)\right|\right\|\right)

The above right-hand-side is upper bounded by

\displaystyle\mathcal{O}(1)\sqrt{T^{-1/2}},

where $\mathcal{O}(1)$ denotes some universal constant. Hence, we have

\displaystyle{\mathbb{E}}\Big{[}||\mathbb{G}_{n}||_{\mathcal{F}}\Big{]}=\mathcal{O}_{p}(T^{-1/2}).

(B.39)

Combined with Equation (B.38), we have

\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\mathcal{F}_{\kappa}(\bm{x}_{t},a_{t},r_{t})=\mathcal{O}_{p}(T^{-1/2}).

and $\psi_{6}=o_{p}(1)$ .

Next, for $\psi_{7}$ , by triangle inequality, we have

	$\displaystyle\|\psi_{7}\|$	$\displaystyle=\left\|\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\frac{1}{{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x}_{t})\}}e_{t}\mathbb{I}\{a_{t}\neq\widehat{\pi}_{t}(\bm{x}_{t})\}\mathbb{I}\{a_{t}={\pi}^{}(\bm{x}_{t})\}\right\|$
		$\displaystyle\leq\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\left\|\frac{1}{{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x}_{t})\}}e_{t}\mathbb{I}\{a_{t}\neq\widehat{\pi}_{t}(\bm{x}_{t})\}\mathbb{I}\{a_{t}={\pi}^{}(\bm{x}_{t})\}\right\|$
		$\displaystyle\leq\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\left\|\frac{1}{{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x}_{t})\}}\right\|e_{t}\left\|\mathbb{I}\{a_{t}\neq\widehat{\pi}_{t}(\bm{x}_{t})\}\mathbb{I}\{a_{t}={\pi}^{}(\bm{x}_{t})\}\right\|$
		$\displaystyle\leq\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\frac{1}{C_{1}}e_{t}.$

Notice that $\frac{1}{\sqrt{T}}\sum_{t=1}^{T}e_{t}=o_{p}(1)$ , we have

|\psi_{7}|\leq=\frac{1}{C_{1}}\frac{1}{\sqrt{T}}\sum_{t=1}^{T}e_{t}=o_{p}(1).

Therefore, $\eta_{6}=\psi_{6}+\psi_{7}=o_{p}(1)$ . Hence, $\widetilde{V}_{T}=\overline{V}_{T}+o_{p}(T^{-1/2})$ hold.

Step 3: Then, to show the asymptotic normality of the proposed value estimator under DREAM, based on the above two steps, it is sufficient to show

\sqrt{T}(\widehat{V}_{T}-V^{*})=\sqrt{T}(\overline{V}_{T}-V^{*})+o_{p}(1)\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathcal{N}\left(0,\sigma_{DR}^{2}\right),

(B.40)

as $T\rightarrow\infty$ , using Martingale Central Limit Theorem. By $r_{t}={\mu}\{\bm{x}_{t},a_{t})+e_{t}$ , we define

	$\displaystyle\xi_{t}$		$\displaystyle=\frac{\mathbb{I}\{a_{t}={\pi}^{}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x}_{t})\}}\Big{[}r_{t}-{\mu}\{\bm{x}_{t},{\pi}^{}(\bm{x}_{t})\}\Big{]}+{\mu}\{\bm{x}_{t},{\pi}^{}(\bm{x}_{t})\}$
			$\displaystyle=\underbrace{\frac{\mathbb{I}\{a_{t}={\pi}^{}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x}_{t})\}}e_{t}}_{Z_{t}}+\underbrace{{\mu}\{\bm{x}_{t},{\pi}^{}(\bm{x}_{t})\}-V^{}}_{W_{t}}.$

By (B.5), we have

	$\displaystyle{\mathbb{E}}\{Z_{t}\mid\mathcal{H}_{t-1}\}$	$\displaystyle={\mathbb{E}}\{{\mathbb{E}}\left[Z_{t}\mid a_{t}\right]\mid\mathcal{H}_{t-1}\}={\mathbb{E}}\left\{{\mathbb{E}}\left[\frac{\mathbb{I}\{a_{t}={\pi}^{}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x}_{t})\}}e_{t}\mid a_{t},\bm{x}_{t}\right]\mid\mathcal{H}_{t-1}\right\}$
		$\displaystyle={\mathbb{E}}\left\{\frac{\mathbb{I}\{a_{t}={\pi}^{}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x}_{t})\}}{\mathbb{E}}\left[e_{t}\mid a_{t},\bm{x}_{t}\right]\mid\mathcal{H}_{t-1}\right\}.$

Since $e_{t}$ is independent of $\mathcal{H}_{i-1}$ and $\bm{x}_{i}$ given $a_{t}$ , and ${\mathbb{E}}\left\{e_{t}\mid a_{t}\right\}=0$ , we have

{\mathbb{E}}\{Z_{t}\mid\mathcal{H}_{t-1}\}={\mathbb{E}}\left\{\frac{\mathbb{I}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}{\mathbb{E}}\left\{e_{t}\mid a_{t}\right\}\mid\mathcal{H}_{t-1}\right\}=0.

Notice that ${\mathbb{E}}[{\mu}\{\bm{x}_{t},{\pi}^{*}(\bm{x}_{t})\}]=V^{*}$ by the definition, we have

{\mathbb{E}}\{W_{t}\mid\mathcal{H}_{t-1}\}={\mathbb{E}}\left\{{\mu}\{\bm{x}_{t},{\pi}^{*}(\bm{x}_{t})\}-V^{*}\right\}=0.

Thus, we have ${\mathbb{E}}\{\xi_{t}\mid\mathcal{H}_{t-1}\}=0$ , which implies that $\{Z_{t}\}_{t=1}^{T}$ , $\{W_{t}\}_{t=1}^{T}$ and $\{\xi_{t}\}_{t=1}^{T}$ are Martingale difference sequences. To prove Equation (B.40), it suffices to prove that $(1/\sqrt{T})\sum_{t=1}^{T}\xi_{t}\overset{D}{\longrightarrow}\mathcal{N}(0,\sigma_{DR}^{2})$ , as $T\rightarrow\infty$ , using Martingale Central Limit Theorem.

Firstly we calculate the conditional variance of $\xi_{t}$ given $\mathcal{H}_{t-1}$ . Note that

\displaystyle{\mathbb{E}}\left(Z_{t}^{2}\mid\mathcal{H}_{t-1}\right)

\displaystyle={\mathbb{E}}\left\{\left(\frac{\mathbb{I}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}}e_{t}\right)^{2}\mid\mathcal{H}_{t-1}\right\}={\mathbb{E}}\left(\frac{\mathbb{I}\{a_{t}=\pi^{*}(\bm{x}_{t})\}}{[{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x}_{t})\}]^{2}}e_{t}^{2}\mid\mathcal{H}_{t-1}\right),

and

	$\displaystyle{\mathbb{E}}\left(Z_{t}^{2}\mid\mathcal{H}_{t-1}\right)$	$\displaystyle={\mathbb{E}}\left[{\mathbb{E}}\left(\frac{\mathbb{I}\{a_{t}=\pi^{}(\bm{x}_{t})\}}{[{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x}_{t})\}]^{2}}e_{t}^{2}\ \mid a_{t},\bm{x}_{t}\right)\mid\mathcal{H}_{t-1}\right]$
		$\displaystyle={\mathbb{E}}\left[\frac{\mathbb{I}\{a_{t}=\pi^{}(\bm{x}_{t})\}}{[{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x}_{t})\}]^{2}}{\mathbb{E}}\left(e_{t}^{2}\ \mid a_{t},\bm{x}_{t}\right)\mid\mathcal{H}_{t-1}\right].$

Since $e_{t}$ is independent of $\mathcal{H}_{i-1}$ and $\bm{x}_{i}$ given $a_{t}$ , we have

	$\displaystyle\frac{1}{T}\sum_{t=1}^{T}{\mathbb{E}}\left(Z_{t}^{2}\mid\mathcal{H}_{t-1}\right)$	$\displaystyle=\frac{1}{T}\sum_{t=1}^{T}{\mathbb{E}}\left[\frac{\mathbb{I}\{a_{t}=\pi^{}(\bm{x}_{t})\}}{[{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x}_{t})\}]^{2}}{\mathbb{E}}\left(e_{t}^{2}\ \mid a_{t}\right)\mid\mathcal{H}_{t-1}\right]$
		$\displaystyle=\frac{1}{T}\sum_{t=1}^{T}{\mathbb{E}}\left(\frac{\mathbb{I}\{a_{t}=\pi^{}(\bm{x}_{t})\}}{[{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x}_{t})\}]^{2}}\sigma_{a_{t}}^{2}\mid\mathcal{H}_{t-1}\right).$
		$\displaystyle=\frac{1}{T}\sum_{t=1}^{T}{\mathbb{E}}\left[\frac{1}{[{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x}_{t})\}]^{2}}\sigma_{\pi^{}(\bm{x}_{t})}^{2}{\mathbb{E}}\left(\mathbb{I}\{a_{t}=\pi^{*}(\bm{x}_{t})\}\mid\bm{x}_{t},\mathcal{H}_{t-1}\right)\right].$

By the definition of Equation (B.25),

\nu_{i}\left(\bm{x}_{i},\mathcal{H}_{i-1}\right)\equiv\operatorname{Pr}\left\{a_{i}\neq\pi^{*}(\bm{x}_{i})\right|\bm{x}_{i},\mathcal{H}_{i-1}\}=\mathbb{E}\left[\mathbb{I}\left\{a_{i}\neq\pi^{*}(\bm{x}_{i})\right\}|\bm{x}_{i},\mathcal{H}_{i-1}\right],

we have

	$\displaystyle\frac{1}{T}\sum_{t=1}^{T}{\mathbb{E}}\left(Z_{t}^{2}\mid\mathcal{H}_{t-1}\right)$	$\displaystyle=\frac{1}{T}\sum_{t=1}^{T}{\mathbb{E}}\left[\frac{1}{[{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x}_{t})\}]^{2}}\sigma_{\pi^{}(\bm{x}_{t})}^{2}\left\{1-\nu_{t}\left(\bm{x},\mathcal{H}_{t-1}\right)\right\}\right]$
		$\displaystyle=\frac{1}{T}\sum_{t=1}^{T}\int\frac{\left\{1-\nu_{t}\left(\bm{x},\mathcal{H}_{t-1}\right)\right\}}{[{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x})\}]^{2}}\sigma_{\pi^{}(\bm{x})}^{2}dP_{\mathcal{X}}$
		$\displaystyle=\int\left[\frac{1}{T}\sum_{t=1}^{T}\frac{\left\{1-\nu_{t}\left(\bm{x},\mathcal{H}_{t-1}\right)\right\}}{[{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x})\}]^{2}}\right]\sigma_{\pi^{}(\bm{x})}^{2}dP_{\mathcal{X}}.$

Similar as before, since $\lim_{i\rightarrow\infty}{\mbox{Pr}}\{a_{i}\neq\pi^{*}(\bm{x})\}=\kappa_{\infty}(\bm{x})$ , we have for any $\epsilon>0$ , there exist a constant $T_{0}>0$ such that $\left|{\mbox{Pr}}\{a_{i}\neq\pi^{*}(\bm{x})\}-\kappa_{\infty}(\bm{x})\right|<\epsilon$ for all $i\geq T_{0}$ .
We firstly consider the expectation of $(1/T)\sum_{t=1}^{T}{\left\{1-\nu_{t}\left(\bm{x},\mathcal{H}_{t-1}\right)\right\}/[{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}]^{2}}$ . Note that ${\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}$ is not conditional on $\mathcal{H}_{t-1}$ , thus we have

	$\displaystyle{\mathbb{E}}\left[\frac{1}{T}\sum_{t=1}^{T}\frac{\left\{1-\nu_{t}\left(\bm{x},\mathcal{H}_{t-1}\right)\right\}}{[{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}]^{2}}\right]$	$\displaystyle=\frac{1}{T}\sum_{t=1}^{T}\frac{{\mathbb{E}}\left\{1-\nu_{t}\left(\bm{x},\mathcal{H}_{t-1}\right)\right\}}{[{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x})\}]^{2}}=\frac{1}{T}\sum_{t=1}^{T}\frac{{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x})\}}{[{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}]^{2}}$
		$\displaystyle=\frac{1}{T}\sum_{t=1}^{T}\frac{1}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}}.$

Therefore by the triangle inequality we have

		$\displaystyle~{}~{}~{}\left\|{\mathbb{E}}\left[\frac{1}{T}\sum_{t=1}^{T}\frac{\left\{1-\nu_{t}\left(\bm{x},\mathcal{H}_{t-1}\right)\right\}}{[{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x})\}]^{2}}\right]-\frac{1}{1-\kappa_{\infty}(\bm{x})}\right\|=\left\|\frac{1}{T}\sum_{t=1}^{T}\frac{1}{{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x})\}}-\frac{1}{1-\kappa_{\infty}(\bm{x})}\right\|$
		$\displaystyle=\left\|\frac{1}{T}\sum_{t=1}^{T}\left[\frac{1}{{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x})\}}-\frac{1}{1-\kappa_{\infty}(\bm{x})}\right]\right\|=\left\|\frac{1}{T}\sum_{t=1}^{T}\frac{1-\kappa_{\infty}(\bm{x})-{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}\left\{1-\kappa_{\infty}(\bm{x})\right\}}\right\|$
		$\displaystyle=\left\|\frac{1}{T}\sum_{t=1}^{T}\frac{{\mbox{Pr}}\{a_{t}\neq{\pi}^{}(\bm{x})-\kappa_{\infty}(\bm{x})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x})\}\left\{1-\kappa_{\infty}(\bm{x})\right\}}\right\|\leq\frac{1}{T}\sum_{t=1}^{T}\frac{\left\|{\mbox{Pr}}\{a_{t}\neq{\pi}^{}(\bm{x})-\kappa_{\infty}(\bm{x})\}\right\|}{{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x})\}\left\{1-\kappa_{\infty}(\bm{x})\right\}}$
		$\displaystyle=\frac{1}{T}\sum_{t=1}^{T_{0}}\frac{\left\|{\mbox{Pr}}\{a_{t}\neq{\pi}^{}(\bm{x})-\kappa_{\infty}(\bm{x})\}\right\|}{{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x})\}\left\{1-\kappa_{\infty}(\bm{x})\right\}}+\frac{1}{T}\sum_{t=T_{0}}^{T}\frac{\left\|{\mbox{Pr}}\{a_{t}\neq{\pi}^{}(\bm{x})-\kappa_{\infty}(\bm{x})\}\right\|}{{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x})\}\left\{1-\kappa_{\infty}(\bm{x})\right\}}$
		$\displaystyle<\frac{1}{T}\sum_{t=1}^{T_{0}}\frac{\left\|{\mbox{Pr}}\{a_{t}\neq{\pi}^{}(\bm{x})\}\right\|+\left\|\kappa_{\infty}(\bm{x})\right\|}{{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x})\}\left\{1-\kappa_{\infty}(\bm{x})\right\}}+\frac{1}{T}\sum_{t=T_{0}}^{T}\frac{\epsilon}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}\left\{1-\kappa_{\infty}(\bm{x})\right\}}.$

Since the above equation holds for any $\epsilon>0$ , we have

	$\displaystyle\left\|{\mathbb{E}}\left[\frac{1}{T}\sum_{t=1}^{T}\frac{1-\nu_{t}\left(\bm{x},\mathcal{H}_{t-1}\right)}{[{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}]^{2}}\right]-\frac{1}{1-\kappa_{\infty}(\bm{x})}\right\|$	$\displaystyle\leq\frac{1}{T}\sum_{t=1}^{T_{0}}\frac{\left\|{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x})\}\right\|+\left\|\kappa_{\infty}(\bm{x})\right\|}{{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x})\}\left\{1-\kappa_{\infty}(\bm{x})\right\}}$
		$\displaystyle\leq\frac{1}{T}\sum_{t=1}^{T_{0}}\frac{2}{{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}\left\{1-\kappa_{\infty}(\bm{x})\right\}}.$

Since $\kappa_{\infty}(\bm{x})\leq C_{2}<1$ for sure for some constant $0<C_{2}<1$ (by definition of a valid bandit algorithm an results for Theorem 1), and by Equation (B.37), we have ${\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}>C_{1}$ for some constant $C_{1}>0$ , therefore we have

\displaystyle\left|{\mathbb{E}}\left[\frac{1}{T}\sum_{t=1}^{T}\frac{\left\{1-\nu_{t}\left(\bm{x},\mathcal{H}_{t-1}\right)\right\}}{[{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}]^{2}}\right]-\frac{1}{1-\kappa_{\infty}(\bm{x})}\right|\leq\frac{1}{T}\sum_{t=1}^{T_{0}}\frac{2}{C_{1}\left(1-C_{2}\right)}=\frac{2T_{0}}{C_{1}\left(1-C_{2}\right)T}\rightarrow 0,

as $T\rightarrow\infty$ .
Then we consider the variance of $(1/T)\sum_{t=1}^{T}{\left\{1-\nu_{t}\left(\bm{x},\mathcal{H}_{t-1}\right)\right\}/[{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}]^{2}}$ over different histories. By Lemma B.1, we have

\displaystyle\operatorname{Var}\left[\nu_{t}\left(\bm{x},\mathcal{H}_{t-1}\right)\right]\leq{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}\left[1-{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}\right].

Therefore, we have

		$\displaystyle{\mbox{Var}}\left[\frac{1}{T}\sum_{t=1}^{T}\frac{\left\{1-\nu_{t}\left(\bm{x},\mathcal{H}_{t-1}\right)\right\}}{[{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x})\}]^{2}}\right]=\frac{1}{T^{2}}{\mbox{Var}}\left[\sum_{t=1}^{T}\frac{\left\{1-\nu_{t}\left(\bm{x},\mathcal{H}_{t-1}\right)\right\}}{[{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x})\}]^{2}}\right]$
	$\displaystyle\leq$	$\displaystyle\frac{1}{T}\sum_{t=1}^{T}{\mbox{Var}}\left[\frac{\left\{1-\nu_{t}\left(\bm{x},\mathcal{H}_{t-1}\right)\right\}}{[{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x})\}]^{2}}\right]=\frac{1}{T}\sum_{t=1}^{T}\frac{{\mbox{Var}}\left[\left\{1-\nu_{t}\left(\bm{x},\mathcal{H}_{t-1}\right)\right\}\right]}{[{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x})\}]^{4}}=\frac{1}{T}\sum_{t=1}^{T}\frac{{\mbox{Var}}\left[\left\{\nu_{t}\left(\bm{x},\mathcal{H}_{t-1}\right)\right\}\right]}{[{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}]^{4}}$
	$\displaystyle\leq$	$\displaystyle\frac{1}{T}\sum_{t=1}^{T}\frac{{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x})\}\left[1-{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x})\}\right]}{[{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x})\}]^{4}}=\frac{1}{T}\sum_{t=1}^{T}\frac{1-{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x})\}}{[{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}]^{3}}$
	$\displaystyle\leq$	$\displaystyle\frac{1}{T}\sum_{t=1}^{T}\frac{1-{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x})\}}{[1-C_{2}]^{3}}=\frac{1}{[1-C_{2}]^{3}}\frac{1}{T}\sum_{t=1}^{T}{\mbox{Pr}}\{a_{t}\neq{\pi}^{}(\bm{x})\}.$

Similarly as before, we could proof

\displaystyle\frac{1}{T}\sum_{t=1}^{T}{\mbox{Pr}}\{a_{t}\neq{\pi}^{*}(\bm{x})\}\rightarrow\kappa_{\infty}(\bm{x}),

which follows

{\mbox{Var}}\left[\frac{1}{T}\sum_{t=1}^{T}\frac{\left\{1-\nu_{t}\left(\bm{x},\mathcal{H}_{t-1}\right)\right\}}{[{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}]^{2}}\right]\rightarrow\frac{1}{[1-C_{2}]^{3}}\kappa_{\infty}(\bm{x}).

Therefore, as $T$ goes to infinity, we have

\frac{1}{T}\sum_{t=1}^{T}\frac{\left\{1-\nu_{t}\left(\bm{x},\mathcal{H}_{t-1}\right)\right\}}{[{\mbox{Pr}}\{a_{t}={\pi}^{*}(\bm{x})\}]^{2}}\to\frac{1}{1-\kappa_{\infty}(\bm{x})},

and

\displaystyle\frac{1}{T}\sum_{t=1}^{T}{\mathbb{E}}\left(Z_{t}^{2}\mid\mathcal{H}_{t-1}\right)\rightarrow\int\frac{1}{1-\kappa_{\infty}(\bm{x})}\sigma_{\pi^{*}(\bm{x})}^{2}dP_{\mathcal{X}}.

Using the same technique of conditioning on $a_{t}$ and $\bm{x}_{t}$ , we have

	$\displaystyle{\mathbb{E}}\left(W_{t}Z_{t}\mid\mathcal{H}_{t-1}\right)$	$\displaystyle={\mathbb{E}}\left\{\left({\mu}\{\bm{x}_{t},{\pi}^{}(\bm{x}_{t})\}-V^{}\right)\frac{\mathbb{I}\{a_{t}={\pi}^{}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x}_{t})\}}e_{t}\mid\mathcal{H}_{t-1}\right\}$
		$\displaystyle={\mathbb{E}}\left\{{\mathbb{E}}\left[\left({\mu}\{\bm{x}_{t},{\pi}^{}(\bm{x}_{t})\}-V^{}\right)\frac{\mathbb{I}\{a_{t}={\pi}^{}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x}_{t})\}}e_{t}\mid a_{t},\bm{x}_{t}\right]\mid\mathcal{H}_{t-1}\right\}$
		$\displaystyle={\mathbb{E}}\left\{\left({\mu}\{\bm{x}_{t},{\pi}^{}(\bm{x}_{t})\}-V^{}\right)\frac{\mathbb{I}\{a_{t}={\pi}^{}(\bm{x}_{t})\}}{{\mbox{Pr}}\{a_{t}={\pi}^{}(\bm{x}_{t})\}}{\mathbb{E}}\left(e_{t}\mid a_{t}\right)\mid\mathcal{H}_{t-1}\right\}=0.$

Thus, we further have

{\mathbb{E}}\left(\xi_{t}^{2}\mid\mathcal{H}_{t-1}\right)={\mathbb{E}}\left\{\left(Z_{t}+W_{t}\right)^{2}\mid\mathcal{H}_{t-1}\right\}={\mathbb{E}}\left(Z_{t}^{2}\mid\mathcal{H}_{t-1}\right)+{\mathbb{E}}\left(W_{t}^{2}\mid\mathcal{H}_{t-1}\right),

and

\frac{1}{T}\sum_{t=1}^{T}{\mathbb{E}}\left(\xi_{t}^{2}\mid\mathcal{H}_{t-1}\right)=\int\frac{1}{1-\kappa_{\infty}(\bm{x})}\sigma_{\pi^{*}(\bm{x})}^{2}dP_{\mathcal{X}}+\frac{1}{T}\sum_{t=1}^{T}{\mbox{Var}}\left[{\mu}\{\bm{x}_{t},{\pi}^{*}(\bm{x}_{t})\}\mid\mathcal{H}_{t-1}\right].

Therefore as $T$ goes to infinity, we have

\sum_{t=1}^{T}{\mathbb{E}}\left\{\left(\frac{1}{\sqrt{T}}\xi_{t}\right)^{2}\mid\mathcal{H}_{t-1}\right\}\longrightarrow\sigma_{DR}^{2},

where

\sigma_{DR}^{2}=\int_{\bm{x}}\frac{\sigma_{1}^{2}\mathbb{I}\{\mu(\bm{x},1)>\mu(\bm{x},0)\}+\sigma_{0}^{2}\mathbb{I}\{\mu(\bm{x},1)<\mu(\bm{x},0)\}}{1-\kappa_{\infty}(\bm{x})}dP_{\mathcal{X}}+{\mbox{Var}}\left[{\mu}\{\bm{x},{\pi}^{*}(\bm{x})\}\right].

(B.42)

Then we check the conditional Lindeberg condition. For any $h>0$ , we have

\displaystyle\sum_{t=1}^{T}{\mathbb{E}}\left\{\left(\frac{1}{\sqrt{T}}\xi_{t}\right)^{2}\mathbb{I}\left\{\left|\frac{1}{\sqrt{T}}\xi_{t}\right|>h\right\}\mid\mathcal{H}_{t-1}\right\}=\frac{1}{T}\sum_{t=1}^{T}{\mathbb{E}}\left\{\xi_{t}^{2}\mathbb{I}\left\{\xi_{t}^{2}>Th^{2}\right\}\mid\mathcal{H}_{t-1}\right\}.

Since $\xi_{t}^{2}\mathbb{I}\left\{\xi_{t}^{2}>Th^{2}\right\}$ converges to zero as $T$ goes to infinity and is dominated by $\xi_{t}^{2}$ given $\mathcal{H}_{t-1}$ . Therefore, by Dominated Convergence Theorem, we conclude that

\sum_{t=1}^{T}{\mathbb{E}}\left\{\left(\frac{1}{\sqrt{T}}\xi_{t}\right)^{2}\mathbb{I}\left\{\left|\frac{1}{\sqrt{T}}\xi_{t}\right|>h\right\}\mid\mathcal{H}_{t-1}\right\}\rightarrow 0,\text{ as }t\rightarrow\infty.

Thus the conditional Lindeberg condition is checked.

Next, recall the derived conditional variance in (B.42). By Martingale Central Limit Theorem, we have

(1/\sqrt{T})\sum_{t=1}^{T}\xi_{t}\overset{D}{\longrightarrow}\mathcal{N}(0,\sigma_{DR}^{2}),

as $T\rightarrow\infty$ . Hence, we complete the proof of Equation (B.40).

Step 4: Finally, to show the variance estimator in Equation (5) in the main paper is a consistent estimator of $\sigma_{DR}^{2}$ . Recall that the variance estimator is

	$\displaystyle\widehat{\sigma}_{T}^{2}$	$\displaystyle=\underbrace{\frac{1}{T}\sum_{t=1}^{T}\frac{\widehat{\sigma}_{1,t-1}^{2}(\bm{x}_{t},1)\mathbb{I}\{\widehat{\mu}_{t-1}(\bm{x}_{t},1)>\widehat{\mu}_{t-1}(\bm{x}_{t},0)\}+\widehat{\sigma}_{0,t-1}^{2}\mathbb{I}\{\widehat{\mu}_{t-1}(\bm{x}_{t},1)<\widehat{\mu}_{t-1}(\bm{x}_{t},0)\}}{1-\widehat{\kappa}_{t}(\bm{x}_{t})}}_{\widehat{\sigma}_{T,1}^{2}}$		(B.43)
		$\displaystyle+\underbrace{\frac{1}{T}\sum_{t=1}^{T}\left[\widehat{\mu}_{T}\{\bm{x}_{t},\widehat{\pi}_{T}(\bm{x}_{t})\}-\frac{1}{T}\sum_{t=1}^{T}\widehat{\mu}_{T}\{\bm{x}_{t},\widehat{\pi}_{T}(\bm{x}_{t})\}\right]^{2}}_{\widehat{\sigma}_{T,2}^{2}}.$		(B.43)

Firstly we proof the first line of the above Equation (B.43) is a consistent estimator for

\int_{\bm{x}}\frac{\sigma_{1}^{2}\mathbb{I}\{\mu(\bm{x},1)>\mu(\bm{x},0)\}+\sigma_{0}^{2}\mathbb{I}\{\mu(\bm{x},1)<\mu(\bm{x},0)\}}{1-\kappa_{\infty}(\bm{x})}dP_{\mathcal{X}}.

Recall that we denote $\widehat{\Delta}_{\bm{x}_{t}}=\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)$ , thus we can rewrite $\widehat{\sigma}_{T}^{2}$ as

	$\displaystyle\widehat{\sigma}_{T,1}^{2}$	$\displaystyle=\frac{1}{T}\sum_{t=1}^{T}\frac{\widehat{\sigma}_{1,t-1}^{2}\mathbb{I}\{\widehat{\mu}_{t-1}(\bm{x}_{t},1)>\widehat{\mu}_{t-1}(\bm{x}_{t},0)\}+\widehat{\sigma}_{0,t-1}^{2}(\bm{x}_{t},0)\mathbb{I}\{\widehat{\mu}_{t-1}(\bm{x}_{t},1)<\widehat{\mu}_{t-1}(\bm{x}_{t},0)\}}{1-\widehat{\kappa}_{t}(\bm{x}_{t})}$
		$\displaystyle=\frac{1}{T}\sum_{t=1}^{T}\frac{\widehat{\sigma}_{1,t-1}^{2}\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}>0\}+\widehat{\sigma}_{0,t-1}^{2}\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}<0\}}{1-\widehat{\kappa}_{t}(\bm{x}_{t})}.$

We decompose the proposed variance estimator by

	$\displaystyle\widehat{\sigma}_{T,1}^{2}$	$\displaystyle=\frac{1}{T}\sum_{t=1}^{T}\frac{\left\{\widehat{\sigma}_{1,t-1}^{2}-\sigma_{1}^{2}\right\}\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}>0\}+\left\{\widehat{\sigma}_{0,t-1}^{2}-\sigma_{0}^{2}\right\}\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}<0\}}{1-\widehat{\kappa}_{t}(\bm{x}_{t})}$
		$\displaystyle+\frac{1}{T}\sum_{t=1}^{T}\frac{\sigma_{1}^{2}\left(\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}>0\}-\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}\right)+\sigma_{0}^{2}\left(\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}<0\}-\mathbb{I}\{\Delta_{\bm{x}_{t}}<0\}\right)}{1-\widehat{\kappa}_{t}(\bm{x}_{t})}$
		$\displaystyle+\frac{1}{T}\sum_{t=1}^{T}\left(\sigma_{1}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}+\sigma_{0}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}<0\}\right)\left\{\frac{1}{1-\widehat{\kappa}_{t}(\bm{x}_{t})}-\frac{1}{1-\kappa_{t}(\bm{x}_{t})}\right\}$
		$\displaystyle+\frac{1}{T}\sum_{t=1}^{T}\frac{\sigma_{1}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}+\sigma_{0}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}<0\}}{1-\kappa_{t}(\bm{x}_{t})}.$

Our goal is to prove that the first three lines are all $o_{p}(1)$ .

Firstly, recall that

	$\displaystyle\widehat{\sigma}_{a,t}^{2}$	$\displaystyle=\{\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)-d\}^{-1}\sum_{a_{i}=a}^{1\leq i\leq t}[\widehat{\mu}_{i}\{\bm{x}_{i},a_{i}\}-r_{i}]^{2}$
		$\displaystyle=\{\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)-d\}^{-1}\sum_{a_{i}=a}^{1\leq i\leq t}[\bm{x}_{i}^{\top}\left\{\widehat{\bm{\beta}}_{i-1}(a)-\bm{\beta}_{i-1}(a)\right\}-e_{i}]^{2}.$

By Lemma 4.1, we have $\|\widehat{\bm{\beta}}_{i-1}(a)-\bm{\beta}_{i-1}(a)\|_{1}=o_{p}(1)$ . Under Assumption 4.1, we have $\bm{x}_{i}^{\top}\left\{\widehat{\bm{\beta}}_{i-1}(a)-\bm{\beta}_{i-1}(a)\right\}=o_{p}(1)$ . Thus by Lemma 6 in Luedtke and Van Der Laan (2016), we have

\displaystyle\widehat{\sigma}_{a,t}^{2}

\displaystyle=\{\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)-d\}^{-1}\sum_{a_{i}=a}^{1\leq i\leq t}e_{i}^{2}+o_{p}(1).

Since $e_{i}$ is i.i.d conditional on $a_{i}$ , and ${\mathbb{E}}(e_{i}^{2}|a_{i}=a)=\sigma_{a}^{2}$ , noting that

\lim_{t\rightarrow\infty}\frac{\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)}{\sum_{i=1}^{t}\mathbb{I}(a_{i}=a)-d}=1,

by Law of Large Numbers we have

\widehat{\sigma}_{a,t}^{2}=\sigma_{a}^{2}+o_{p}(1).

Therefore, the first line is $o_{p}(1)$ .

Secondly, denote the second line as

	$\displaystyle\psi_{8}$	$\displaystyle=\frac{1}{T}\sum_{t=1}^{T}\frac{\sigma_{1}^{2}\left(\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}>0\}-\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}\right)+\sigma_{0}^{2}\left(\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}<0\}-\mathbb{I}\{\Delta_{\bm{x}_{t}}<0\}\right)}{1-\widehat{\kappa}_{t}(\bm{x}_{t})}$
		$\displaystyle=\frac{1}{T}\sum_{t=1}^{T}\frac{\sigma_{1}^{2}\left(\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}>0\}-\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}\right)+\sigma_{0}^{2}\left(\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}-\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}>0\}\right)}{1-\widehat{\kappa}_{t}(\bm{x}_{t})}$
		$\displaystyle=\frac{\sigma_{1}^{2}-\sigma_{0}^{2}}{T}\sum_{t=1}^{T}\frac{\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}>0\}-\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}}{1-\widehat{\kappa}_{t}(\bm{x}_{t})}.$

Since $\kappa_{t}(\bm{x}_{t})\leq C_{2}<1$ for sure for some constant $0<C_{2}<1$ (by definition of a valid bandit algorithm an results for Theorem 1), by the triangle inequality, we have

	$\displaystyle\|\psi_{8}\|$	$\displaystyle\leq\frac{\sigma_{1}^{2}-\sigma_{0}^{2}}{T}\sum_{t=1}^{T}\frac{\left\|\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}>0\}-\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}\right\|}{\left\|1-\widehat{\kappa}_{t}(\bm{x}_{t})\right\|}\leq\frac{\sigma_{1}^{2}-\sigma_{0}^{2}}{T}\sum_{t=1}^{T}\frac{\left\|\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}>0\}-\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}\right\|}{1-C_{2}}$
		$\displaystyle\leq\frac{\sigma_{1}^{2}-\sigma_{0}^{2}}{1-C_{2}}\frac{1}{T}\sum_{t=1}^{T}\left\|\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}>0\}-\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}\right\|.$

Since ${\mbox{Pr}}\left(\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}>0\}-\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}=0\right)=1-ct^{-\alpha\gamma}$ by Lemma B.2, there exists some constant $c$ such that

{\mbox{Pr}}\left(\frac{1}{T}\sum_{t=1}^{T}\left|\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}>0\}-\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}\right|\neq 0\right)\leq\sum_{t=1}^{T}{\mbox{Pr}}\left(\left|\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}>0\}-\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}\right|\neq 0\right)\leq\sum_{t=1}^{T}ct^{-\alpha\gamma}.

By Lemma 6 in Luedtke and Van Der Laan (2016), we have $\sum_{t=1}^{T}ct^{-\alpha\gamma}=cT^{-\alpha\gamma}$ , thus

{\mbox{Pr}}\left(|\psi_{8}|\neq 0\right)={\mbox{Pr}}\left(\frac{1}{T}\sum_{t=1}^{T}\left|\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}>0\}-\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}\right|\neq 0\right)\leq cT^{-\alpha\gamma},

which follows ${\mbox{Pr}}\left(|\psi_{8}|\neq 0\right)=o_{p}(1)$ .

Lastly, under the assumption that $\widehat{\kappa}_{t}(\bm{x}_{t})$ is a consistent estimator for $\kappa_{t}(\bm{x}_{t})$ , we have the third line is $o_{p}(1)$ by the continuous mapping theorem.

Given the above results, we have

\widehat{\sigma}_{T,1}^{2}=\frac{1}{T}\sum_{t=1}^{T}\frac{\sigma_{1}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}+\sigma_{0}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}<0\}}{1-\kappa_{t}(\bm{x}_{t})}+o_{p}(1)=\frac{1}{T}\sum_{t=1}^{T}\frac{\sigma_{1}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}+\sigma_{0}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}<0\}}{1-\operatorname{Pr}\left\{a_{t}\neq\widehat{\pi}_{t}\left(x_{t}\right)\right\}}+o_{p}(1).

Thus, we can further express $\widehat{\sigma}_{T,1}^{2}$ as

	$\displaystyle\widehat{\sigma}_{T,1}^{2}$	$\displaystyle=\frac{1}{T}\sum_{t=1}^{T}\left(\sigma_{1}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}+\sigma_{0}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}<0\}\right)\left(\frac{1}{1-\operatorname{Pr}\left\{a_{t}\neq\widehat{\pi}_{t}\left(x\right)\right\}}-\frac{1}{1-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}\right)\right\}}\right)$		(B.44)
		$\displaystyle+\frac{1}{T}\sum_{t=1}^{T}\frac{\sigma_{1}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}+\sigma_{0}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}<0\}}{1-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\right\}}+o_{p}(1).$		(B.44)

Notice that

\left(\frac{1}{1-\operatorname{Pr}\left\{a_{t}\neq\widehat{\pi}_{t}\left(x\right)\right\}}-\frac{1}{1-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}\right)\right\}}\right)=\frac{\operatorname{Pr}\left\{a_{t}\neq\widehat{\pi}_{t}\left(x\right)\right\}-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\right\}}{\left(1-\operatorname{Pr}\left\{a_{t}\neq\widehat{\pi}_{t}\left(x\right)\right\}\right)\left(1-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}\right)\right\}\right)},

where

\operatorname{Pr}\left\{a_{t}\neq\widehat{\pi}_{t}\left(x\right)\right\}-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\right\}={\mathbb{E}}\left(\mathbb{I}\{a_{t}\neq\widehat{\pi}_{t}\left(x\right)\}-\mathbb{I}\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\}\right)={\mathbb{E}}\left(\mathbb{I}\{\widehat{\pi}_{t}\left(x\right)\neq\pi^{*}\left(\bm{x}_{t}\right)\}\right).

We also note that by Lemma B.2, there exists some constant $c$ and $0<\alpha<\frac{1}{2}$ such that $\alpha\gamma<\frac{1}{2}$ and

{\mathbb{E}}\left(\mathbb{I}\{\widehat{\pi}_{t}\left(x\right)=\pi^{*}\left(\bm{x}_{t}\right)\}\right)={\mbox{Pr}}\left(\widehat{\pi}_{t}\left(x\right)=\pi^{*}\left(\bm{x}_{t}\right)\}\right)\leq ct^{-\alpha\gamma},

therefore we have the result that

\operatorname{Pr}\left\{a_{t}\neq\widehat{\pi}_{t}\left(x\right)\right\}-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\right\}={\mathbb{E}}\left(\mathbb{I}\{\widehat{\pi}_{t}\left(x\right)\neq\pi^{*}\left(\bm{x}_{t}\right)\}\right)=o_{p}(1).

Thus, Equation (B.44) can be expressed as

	$\displaystyle\widehat{\sigma}_{T,1}^{2}$	$\displaystyle=\frac{1}{T}\sum_{t=1}^{T}\left(\sigma_{1}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}+\sigma_{0}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}<0\}\right)o_{p}(1)+\frac{1}{T}\sum_{t=1}^{T}\frac{\sigma_{1}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}+\sigma_{0}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}<0\}}{1-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\right\}}+o_{p}(1)$
		$\displaystyle=\frac{1}{T}\sum_{t=1}^{T}\frac{\sigma_{1}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}+\sigma_{0}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}<0\}}{1-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\right\}}+o_{p}(1)$
		$\displaystyle=\underbrace{\frac{1}{T}\sum_{t=1}^{T}\left\{\sigma_{1}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}+\sigma_{0}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}<0\}\right\}\left[\frac{1}{1-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\right\}}-\frac{1}{1-\kappa_{\infty}(\bm{x}_{t})}\right]}_{\psi_{9}}$
		$\displaystyle+\frac{1}{T}\sum_{t=1}^{T}\frac{\sigma_{1}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}+\sigma_{0}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}<0\}}{1-\kappa_{\infty}(\bm{x}_{t})}+o_{p}(1).$

Note that

	$\displaystyle\psi_{9}$	$\displaystyle=\frac{1}{T}\sum_{t=1}^{T}\left\{\sigma_{1}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}+\sigma_{0}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}<0\}\right\}\left[\frac{1}{1-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\right\}}-\frac{1}{1-\kappa_{\infty}(\bm{x}_{t})}\right]$
		$\displaystyle=\frac{1}{T}\sum_{t=1}^{T}\left\{\sigma_{1}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}+\sigma_{0}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}<0\}\right\}\frac{\kappa_{\infty}(\bm{x}_{t})-\operatorname{Pr}\left\{a_{t}\neq\pi^{}\left(\bm{x}_{t}\right)\right\}}{\left[1-\operatorname{Pr}\left\{a_{t}\neq\pi^{}\left(\bm{x}_{t}\right)\right\}\right]\left[1-\kappa_{\infty}(\bm{x}_{t})\right]},$

which follows

\displaystyle|\psi_{9}|

\displaystyle\leq\frac{1}{T}\sum_{t=1}^{T}\left\{\sigma_{1}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}+\sigma_{0}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}<0\}\right\}\frac{\left|\kappa_{\infty}(\bm{x}_{t})-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\right\}\right|}{\left[1-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\right\}\right]\left[1-\kappa_{\infty}(\bm{x}_{t})\right]}.

By Equation (B.37), for large enough $t$ , there exists some constant $C_{1}>0$ such that

{\mbox{Pr}}\{a_{t}\neq{\pi}^{*}(\bm{x}_{t})\}<C_{1}.

Since $\kappa_{t}(\bm{x}_{t})\leq C_{2}<1$ for sure for some constant $0<C_{2}<1$ (by the definition of a valid bandit algorithm as results shown in Theorem 1), we also have

\kappa_{\infty}(\bm{x}_{t})<C_{2}.

Therefore,

\frac{\sigma_{1}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}+\sigma_{0}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}<0\}}{\left[1-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\right\}\right]\left[1-\kappa_{\infty}(\bm{x}_{t})\right]}<\frac{\max\{\sigma_{0}^{2},\sigma_{1}^{2}\}}{(1-C_{1})(1-C_{2})}\triangleq C,

which follows immediately that

\displaystyle|\psi_{9}|

\displaystyle\leq\frac{1}{T}\sum_{t=1}^{T}C\left|\kappa_{\infty}(\bm{x}_{t})-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\right\}\right|.

Since $\lim_{t\rightarrow\infty}\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}\right)\right\}=\kappa_{\infty}(\bm{x})$ for any $\bm{x}$ , therefore for any $\epsilon>0$ , there exists some constant $T_{0}$ , such that $\left|\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}\right)\right\}-\kappa_{t}(\bm{x})\right|<\epsilon$ for any $\bm{x}$ with $t\geq T_{0}$ , thus we have

	$\displaystyle\|\psi_{9}\|$	$\displaystyle\leq\frac{1}{T}\sum_{t=1}^{T}C\left\|\kappa_{\infty}(\bm{x}_{t})-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\right\}\right\|$
		$\displaystyle=\frac{1}{T}\sum_{t=1}^{T_{0}}C\left\|\kappa_{\infty}(\bm{x}_{t})-\operatorname{Pr}\left\{a_{t}\neq\pi^{}\left(\bm{x}_{t}\right)\right\}\right\|+\frac{1}{T}\sum_{t=T_{0}}^{T}C\left\|\kappa_{\infty}(\bm{x}_{t})-\operatorname{Pr}\left\{a_{t}\neq\pi^{}\left(\bm{x}_{t}\right)\right\}\right\|$
		$\displaystyle<\frac{1}{T}\sum_{t=1}^{T_{0}}C\left\|\kappa_{\infty}(\bm{x}_{t})-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\right\}\right\|+\frac{1}{T}\sum_{t=T_{0}}^{T}C\epsilon.$

Note that by the triangle inequality,

\left|\kappa_{\infty}(\bm{x}_{t})-\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\right\}\right|\leq\left|\kappa_{\infty}(\bm{x}_{t})\right|+\left|\operatorname{Pr}\left\{a_{t}\neq\pi^{*}\left(\bm{x}_{t}\right)\right\}\right|<C_{1}+C_{2},

thus we have

|\psi_{9}|<\frac{1}{T}\sum_{t=1}^{T_{0}}C\left(C_{1}+C_{2}\right)+\frac{1}{T}\sum_{t=T_{0}}^{T}C\epsilon=\frac{T_{0}C\left(C_{1}+C_{2}\right)}{T}+\frac{T-T0}{T}C\epsilon.

Since the above equation holds for any $\epsilon>0$ , we have

|\psi_{9}|\leq\frac{1}{T}\sum_{t=1}^{T_{0}}C\left(C_{1}+C_{2}\right)+\frac{1}{T}\sum_{t=T_{0}}^{T}C\epsilon=\frac{T_{0}C\left(C_{1}+C_{2}\right)}{T}=\mathcal{O}(\frac{1}{T}),

which follows $\psi_{9}=0_{p}(1)$ . Therefore, we have

\widehat{\sigma}_{T,1}^{2}=\frac{1}{T}\sum_{t=1}^{T}\frac{\sigma_{1}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}+\sigma_{0}^{2}\mathbb{I}\{\Delta_{\bm{x}_{t}}<0\}}{1-\kappa_{\infty}(\bm{x}_{t})}+o_{p}(1).

By Law of Large Numbers, we further have

\widehat{\sigma}_{T,1}^{2}=\int_{\bm{x}}\frac{\sigma_{1}^{2}\mathbb{I}\{\mu(\bm{x},1)>\mu(\bm{x},0)\}+\sigma_{0}^{2}\mathbb{I}\{\mu(\bm{x},1)<\mu(\bm{x},0)\}}{1-\kappa_{\infty}(\bm{x})}dP_{\mathcal{X}}+o_{p}(1).

Next, we proof the second line of Equation (B.43) is a consistent estimator for ${\mbox{Var}}\left[{\mu}\{\bm{x},{\pi}^{*}(\bm{x})\}\right]$ . By Central Limit Theorem and Continuous Mapping Theorem, we have

\widehat{\sigma}_{T,2}^{2}=\frac{1}{T}\sum_{t=1}^{T}\left[\widehat{\mu}_{T}\{\bm{x}_{t},\widehat{\pi}_{T}(\bm{x}_{t})\}-\frac{1}{T}\sum_{t=1}^{T}\widehat{\mu}_{T}\{\bm{x}_{t},\widehat{\pi}_{T}(\bm{x}_{t})\}\right]^{2}.

Since $\bm{x}_{t}$ are i.i.d, $\widehat{\mu}_{T}\{\bm{x}_{t},\widehat{\pi}_{T}(\bm{x}_{t})\}$ are i.i.d as well. Thus by the Law of Large Numbers, we have

\frac{1}{T}\sum_{t=1}^{T}\widehat{\mu}_{T}\{\bm{x}_{t},\widehat{\pi}_{T}(\bm{x}_{t})\}={\mathbb{E}}\left[\widehat{\mu}_{T}\{\bm{x},\widehat{\pi}_{T}(\bm{x})\}\right]+o_{p}(1),

and

\widehat{\sigma}_{T,2}^{2}={\mbox{Var}}\left[\widehat{\mu}_{T}\{\bm{x},\widehat{\pi}_{T}(\bm{x})\}\right]+o_{p}(1).

Note that

\widehat{\mu}_{T}\{\bm{x},\widehat{\pi}_{T}(\bm{x})\}=\mathbb{I}\{\widehat{\Delta}_{\bm{x}}>0\}\widehat{\mu}_{T}\{\bm{x},1\}+\left[1-\mathbb{I}\{\widehat{\Delta}_{\bm{x}}>0\}\right]\widehat{\mu}_{T}\{\bm{x},0\},

since $\mathbb{I}\{\widehat{\Delta}_{\bm{x}}>0\}-\mathbb{I}\{\Delta_{\bm{x}}>0\}=o_{p}(1)$ and the fact that $\widehat{\mu}_{T}\{\bm{x},0\}$ and $\widehat{\mu}_{T}\{\bm{x},1\}$ are consistent, we have

\widehat{\mu}_{T}\{\bm{x},\widehat{\pi}_{T}(\bm{x})\}=\mathbb{I}\{\Delta_{\bm{x}}>0\}\mu\{\bm{x},1\}+\left[1-\mathbb{I}\{\Delta_{\bm{x}}>0\}\right]\mu\{\bm{x},0\}=\mu\{\bm{x},\pi(\bm{x})\}+o_{p}(1).

Therefore

\widehat{\sigma}_{T,2}^{2}={\mbox{Var}}\left[\mu\{\bm{x},\pi(\bm{x})\}+o_{p}(1)\right]={\mbox{Var}}\left[\mu\{\bm{x},\pi(\bm{x})\}\right]+o_{p}(1).

by Continuous Mapping Theorem. The proof for Theorem 3 is thus completed.

B.6 Results and Proof for Auxiliary Lemmas

Lemma B.1

Suppose a random variable $X$ is restricted to $[a,b]$ and $\mu=E[X]$ , then the variance of $X$ is bounded by $(b-\mu)(\mu-a)$ .

Proof: Firstly consider the case that $a=0,b=1$ . Notice that we have $E\left[X^{2}\right]\leq E[X]$ since for all $x\in[0,1],x^{2}\leq x$ . Therefore,

\operatorname{Var}[X]=E\left[X^{2}\right]-\left(E[X]^{2}\right)=E\left[X^{2}\right]-\mu^{2}\leq\mu-\mu^{2}=\mu(1-\mu).

Then we consider general interval $[a,b]$ . Define $Y=\frac{X-a}{b-a}$ , which is restricted in $[0,1]$ . Equivalently, $X=(b-a)Y+a$ , which follows immediate that

\operatorname{Var}[X]=(b-a)^{2}\operatorname{Var}[Y]\leq(b-a)^{2}\mu_{X}\left(1-\mu_{Y}\right),

where the inequality is based on the first result. Now, by substituting $\mu_{Y}=\frac{\mu_{X}-a}{b-a}$ , the bound equals

(b-a)^{2}\frac{\mu_{X}-a}{b-a}\left(1-\frac{\mu_{X}-a}{b-a}\right)=(b-a)^{2}\frac{\mu_{X}-a}{b-a}\frac{b-\mu_{X}}{b-a}=\left(\mu_{X}-a\right)\left(b-\mu_{X}\right),

which is the desired result.

Lemma B.2

Suppose the conditions in Theorem 2 hold with Assumption 4.3, then there exists some constant $c$ and $0<\alpha<\frac{1}{2}$ such that $\alpha\gamma<\frac{1}{2}$ and ${\mbox{Pr}}\left(\widehat{\pi}_{t}\left(x\right)\neq\pi^{*}\left\{\bm{x}_{t}\right)\right\}\geq 1-ct^{-\alpha\gamma}$ .

Proof: By Theorem 2, we have $\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}}=\mathcal{O}_{p}(t^{-\frac{1}{2}})$ , thus $\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}>0\}=\mathbb{I}\{\Delta_{\bm{x}_{t}}+\mathcal{O}_{p}(t^{-\frac{1}{2}})>0\}$ .
By Assumption 4.3, there exists some constant $c$ and $0<\alpha<\frac{1}{2}$ such that $\alpha\gamma<\frac{1}{2}$ and

{\mbox{Pr}}\{0<|\Delta_{\bm{x}_{t}}|<t^{-\alpha}\}\leq ct^{-\alpha\gamma}.

Thus, with probability greater than $1-ct^{-\alpha\gamma}$ , we have $|\Delta_{\bm{x}_{t}}|>t^{-\alpha}$ , which further implies $\mathbb{I}\{\Delta_{\bm{x}_{t}}+\mathcal{O}_{p}(t^{-\frac{1}{2}})>0\}=\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}$ . In other words, $\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}>0\}-\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}=0$ with probability greater than $1-ct^{-\alpha\gamma}$ , which convergences to 1 as $t\rightarrow\infty$ . Therefore, as $t\rightarrow\infty$ , we have

{\mbox{Pr}}\left(\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}>0\}-\mathbb{I}\{\Delta_{\bm{x}_{t}}>0\}=0\right)\geq 1-ct^{-\alpha\gamma},

i.e.,

{\mbox{Pr}}\left(\widehat{\pi}_{t}\left(x\right)=\pi^{*}\left\{\bm{x}_{t}\right)\right\}\geq 1-ct^{-\alpha\gamma}.

Lemma B.3

Suppose conditions in Lemma 4.1 hold. Assuming Assumptions 4.3 with $tp_{t}^{2}\rightarrow\infty$ as $t\rightarrow\infty$ , we have

{{T}^{-1/2}}\sum_{t=1}^{T}\left|\mu\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}-\mu\{\bm{x}_{t},\pi^{*}\}\right|=o_{p}(1).

Proof: Without loss of generality, suppose $\widehat{\pi}_{t}(\bm{x}_{t})=\mathbb{I}\{\bm{x}_{t}^{\top}\bm{\beta}(1)>\bm{x}_{t}^{\top}\bm{\beta}(0)\}$ . Since $\mu(\bm{x}_{t},a)=\bm{x}_{t}^{\top}\bm{\beta}(a)$ , we have

	$\displaystyle\mu(\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})$	$\displaystyle=\mathbb{I}\{\bm{x}_{t}^{\top}\widehat{\bm{\beta}}_{t}(1)>\bm{x}_{t}^{\top}\widehat{\bm{\beta}}_{t}(0)\}\bm{x}_{t}^{\top}\bm{\beta}(1)+\mathbb{I}\{\bm{x}_{t}^{\top}\widehat{\bm{\beta}}_{t}(1)\leq\bm{x}_{t}^{\top}\widehat{\bm{\beta}}_{t}(0)\}\bm{x}_{t}^{\top}\bm{\beta}(0)$		(B.45)
		$\displaystyle=\mathbb{I}\{\bm{x}_{t}^{\top}\widehat{\bm{\beta}}_{t}(1)>\bm{x}_{t}^{\top}\widehat{\bm{\beta}}_{t}(0)\}\bm{x}_{t}^{\top}\left\{\bm{\beta}(1)-\bm{\beta}(0)\right\}+\bm{x}_{t}^{\top}\bm{\beta}(0).$		(B.45)

Similarly to (B.45), we have

\displaystyle\mu(\bm{x}_{t},\pi^{*})=\mathbb{I}\{\bm{x}_{t}^{\top}\bm{\beta}(1)>\bm{x}_{t}^{\top}\bm{\beta}(0)\}\bm{x}_{t}^{\top}\left\{\bm{\beta}(1)-\bm{\beta}(0)\right\}+\bm{x}_{t}^{\top}\bm{\beta}(0).

(B.46)

Combining (B.45) and (B.46), we have

\mu(\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})-\mu(\bm{x}_{t},\pi^{*})=\left[\mathbb{I}\{\bm{x}_{t}^{\top}\widehat{\bm{\beta}}_{t}(1)>\bm{x}_{t}^{\top}\widehat{\bm{\beta}}_{t}(0)\}-\mathbb{I}\{\bm{x}_{t}^{\top}\bm{\beta}(1)>\bm{x}_{t}^{\top}\bm{\beta}(0)\}\right]\bm{x}_{t}^{\top}\left\{\bm{\beta}(1)-\bm{\beta}(0)\right\}.

(B.47)

Since $\mathbb{I}\{\bm{x}_{t}^{\top}\bm{\beta}(1)>\bm{x}_{t}^{\top}\bm{\beta}(0)\}=1$ by assumption, (B.47) can be simplified as

\mu(\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})-\mu(\bm{x}_{t},\pi^{*})=-\mathbb{I}\{\bm{x}_{t}^{\top}\widehat{\bm{\beta}}_{t}(1)-\bm{x}_{t}^{\top}\widehat{\bm{\beta}}_{t}(0)\leq 0\}\bm{x}_{t}^{\top}\left\{\bm{\beta}(1)-\bm{\beta}(0)\right\}=-\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}\leq 0\}\Delta_{\bm{x}_{t}}\leq 0,

where $\Delta_{\bm{x}_{t}}=\bm{x}_{t}^{\top}\{\bm{\beta}(1)-\bm{\beta}(0)\}$ and $\widehat{\Delta}_{\bm{x}_{t}}=\bm{x}_{t}^{\top}\{\widehat{\bm{\beta}}(1)-\widehat{\bm{\beta}}(0)\}$ . Thus, we have

{{T}^{-1/2}}\sum_{t=1}^{T}\left|\mu\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}-\mu\{\bm{x}_{t},\pi^{*}\}\right|=\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}\leq 0\}\Delta_{\bm{x}_{t}}.

To show ${{T}^{-1/2}}\sum_{t=1}^{T}\left|\mu\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}-\mu\{\bm{x}_{t},\pi^{*}\}\right|=o_{p}(1)$ , it suffices to show that $(1/\sqrt{T})\sum_{t=1}^{T}\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}\leq 0\}\Delta_{\bm{x}_{t}}$ has an upper bound. Since $\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}\leq 0\}\widehat{\Delta}_{\bm{x}_{t}}\leq 0$ , it suffices to show

\zeta=\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}\leq 0\}(\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}})

has a lower bound. We further notice that for any $\alpha>0$ ,

	$\displaystyle\zeta$	$\displaystyle=\underbrace{{\mbox{Pr}}\{0<\Delta_{\bm{x}_{t}}<T^{-\alpha}\}\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\mathbb{I}\{0<\Delta_{\bm{x}_{t}}<T^{-\alpha}\}\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}\leq 0\}(\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}})}_{\zeta_{1}}$		(B.48)
		$\displaystyle+\underbrace{{\mbox{Pr}}\{\Delta_{\bm{x}_{t}}>T^{-\alpha}\}\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\mathbb{I}\{\Delta_{\bm{x}_{t}}>T^{-\alpha}\}\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}\leq 0\}(\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}})}_{\zeta_{2}}.$		(B.48)

To show $\zeta$ has a lower bound, it is sufficient to show $\zeta_{1}=o_{p}(1)$ and $\zeta_{2}=o_{p}(1)$ correspondingly.

Firstly, we are going to show $\zeta_{1}=o_{p}(1)$ . By Theorem 2, $\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}}=\mathcal{O}_{p}(t^{-\frac{1}{2}})$ , which implies

\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}}=o_{p}\{t^{-(\frac{1}{2}-\alpha\gamma)}\}.

Thus we have

		$\displaystyle\|\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\mathbb{I}\{0<\Delta_{\bm{x}_{t}}<T^{-\alpha}\}\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}\leq 0\}(\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}})\|$
		$\displaystyle\leq\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\|\mathbb{I}\{0<\Delta_{\bm{x}_{t}}<T^{-\alpha}\}\|\|\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}\leq 0\}\|\|(\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}})\|$
		$\displaystyle\leq\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\|(\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}})\|\leq\frac{\sqrt{T}}{T}\sum_{t=1}^{T}o_{p}(t^{-(\frac{1}{2}-\alpha\gamma)})\overset{*}{=}\sqrt{T}o_{p}(T^{-(\frac{1}{2}-\alpha\gamma)})=o_{p}(T^{\alpha\gamma}),$

where the equation (*) is derived by Lemma 6 in Luedtke and Van Der Laan (2016). By Assumption 4.3, there exists some constant $c$ and $0<\alpha<\frac{1}{2}$ such that $\alpha\gamma<\frac{1}{2}$ and

{\mbox{Pr}}\{0<\Delta_{\bm{x}_{t}}<T^{-\alpha}\}\leq cT^{-\alpha\gamma}.

Therefore we have

|\zeta_{1}|\leq{\mbox{Pr}}\{0<\Delta_{\bm{x}_{t}}<T^{-\alpha}\}o_{p}(T^{-(\frac{1}{2}-\alpha\gamma)})=cT^{-\alpha\gamma}o_{p}(T^{\alpha\gamma})=o_{p}(1).

(B.49)

Notice that

\mathbb{I}\{0<\Delta_{\bm{x}_{t}}<T^{-\alpha}\}\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}\leq 0\}(\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}})\leq\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}\leq 0\}\widehat{\Delta}_{\bm{x}_{t}}\leq 0,

(B.50)

where the first inequality holds since $\Delta_{\bm{x}_{t}}\geq 0$ . Combining (B.48), (B.49), and (B.50), we have

0\geq\zeta_{1}=o_{p}(1).

(B.51)

Next, we consider the second part $\zeta_{2}$ . Note that

\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}\leq 0\}=\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}}\leq-\Delta_{\bm{x}_{t}}\}=\mathbb{I}\{|\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}}|>\Delta_{\bm{x}_{t}}\},

we have

\left|\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}\leq 0\}(\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}})\right|=\mathbb{I}\{|\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}}|>\Delta_{\bm{x}_{t}}\}|\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}}|\leq\frac{|\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}}|^{2}}{\Delta_{\bm{x}_{t}}},

(B.52)

where the inequality holds since

\mathbb{I}\{|\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}}|>\Delta_{\bm{x}_{t}}\}\leq\frac{|\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}}|}{\Delta_{\bm{x}_{t}}}.

Thus, by (B.52), we further have

\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}\leq 0\}(\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}})\geq-\frac{(\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}})^{2}}{\Delta_{\bm{x}_{t}}}.

(B.53)

Since $\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}}<0$ , based on (B.53), we have

0\geq\zeta_{2}\geq\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\mathbb{I}\{\Delta_{\bm{x}_{t}}>T^{-\alpha}\}\mathbb{I}\{\widehat{\Delta}_{\bm{x}_{t}}\leq 0\}(\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}})\geq-\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\mathbb{I}\{\Delta_{\bm{x}_{t}}>T^{-\alpha}\}\frac{(\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}})^{2}}{\Delta_{\bm{x}_{t}}}.

(B.54)

Notice that $\mathbb{I}\{\Delta_{\bm{x}_{t}}>T^{-\alpha}\}\leq{\Delta_{\bm{x}_{t}}}{T^{\alpha}}$ , combining with (B.54), we further have

0\geq\zeta_{2}\geq-\frac{1}{\Delta_{\bm{x}_{t}}}\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\frac{\Delta_{\bm{x}_{t}}}{T^{-\alpha}}(\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}})^{2}=-T^{-\frac{1}{2}+\alpha}\sum_{t=1}^{T}(\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}})^{2}=-T^{\frac{1}{2}+\alpha}\left\{\frac{1}{T}\sum_{t=1}^{T}(\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}})^{2}\right\}.

By Theorem 2, $\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}}=\mathcal{O}_{p}(t^{-\frac{1}{2}})$ , which implies $(\widehat{\Delta}_{\bm{x}_{t}}-\Delta_{\bm{x}_{t}})^{2}=o_{p}(T^{-(\frac{1}{2}+\alpha)})$ .And by Lemma 6 in Luedtke and Van Der Laan (2016), $T^{-1}\sum_{t=1}^{T}o_{p}\{t^{-(\frac{1}{2}+\alpha)}\}=o_{p}\{T^{-(\frac{1}{2}+\alpha)}\}$ , we have

0\geq\zeta_{2}\geq-T^{\frac{1}{2}+\alpha}o_{p}(T^{-(\frac{1}{2}+\alpha)})=o_{p}(1).

(B.55)

Therefore, combining Equation (B.51) and Equation (B.55), we have

0\geq\zeta=\zeta_{1}+\zeta_{2}=o_{p}(1).

Thus, we have

{{T}^{-1/2}}\sum_{t=1}^{T}\left|\mu\{\bm{x}_{t},\widehat{\pi}_{t}(\bm{x}_{t})\}-\mu\{\bm{x}_{t},\pi^{*}\}\right|=o_{p}(1).

	$\displaystyle{\mbox{Pr}}\left\{\left\|\widehat{\mu}_{t}(\bm{x}_{t},a)-\mu(\bm{x}_{t},a)\right\|>\xi\right\}$	$\displaystyle\leq{\mbox{Pr}}\left\{\left\\|\widehat{\bm{\beta}}_{t}(a)-\bm{\beta}(a)\right\\|_{1}>\frac{\xi}{L_{\bm{x}}}\right\}$
		$\displaystyle\leq 2d\exp\left\{-\frac{tp_{t}^{2}\lambda^{2}\left(\frac{\xi}{L_{\bm{x}}}-\sqrt{d}\left\\|\bm{\beta}(a)\right\\|_{2}\right)^{2}}{8d^{2}\sigma^{2}L_{\bm{x}}^{2}}\right\}$
		$\displaystyle=2d\exp\left\{-\frac{tp_{t}^{2}\lambda^{2}\left(\xi-\sqrt{d}L_{\bm{x}}\left\\|\bm{\beta}(a)\right\\|_{2}\right)^{2}}{8d^{2}\sigma^{2}L_{\bm{x}}^{4}}\right\}.$

		$\displaystyle{\mbox{Pr}}\left\{\left\|\widehat{\mu}_{t}(\bm{x}_{t},1)-\widehat{\mu}_{t}(\bm{x}_{t},0)-\Delta_{\bm{x}_{t}}\right\|>\xi\right\}$
		$\displaystyle\leq{\mbox{Pr}}\left\{\left\|\widehat{\mu}_{t}(\bm{x}_{t},1)-\mu(\bm{x}_{t},1)\right\|+\left\|\widehat{\mu}_{t}(\bm{x}_{t},0)-\mu(\bm{x}_{t},0)\right\|>\xi\right\}$
		$\displaystyle\leq{\mbox{Pr}}\left\{\left\|\widehat{\mu}_{t}(\bm{x}_{t},1)-\mu(\bm{x}_{t},1)\right\|>\xi/2\right\}+{\mbox{Pr}}\left\{\left\|\widehat{\mu}_{t}(\bm{x}_{t},0)-\mu(\bm{x}_{t},0)\right\|>\xi/2\right\}$
		$\displaystyle\leq 2d\exp\left\{-\frac{tp_{t}^{2}\lambda^{2}\left(\xi/2-\sqrt{d}L_{\bm{x}}\left\\|\bm{\beta}(1)\right\\|_{2}\right)^{2}}{8d^{2}\sigma^{2}L_{\bm{x}}^{4}}\right\}+2d\exp\left\{-\frac{tp_{t}^{2}\lambda^{2}\left(\xi/2-\sqrt{d}L_{\bm{x}}\left\\|\bm{\beta}(0)\right\\|_{2}\right)^{2}}{8d^{2}\sigma^{2}L_{\bm{x}}^{4}}\right\}$
		$\displaystyle\leq 4d\exp\left\{-(t-1)p_{t-1}^{2}c_{\xi}\right\},$

$\displaystyle\bm{x}_{t}^{\top}\{\bm{D}_{t-1}(0)^{\top}\bm{D}_{t-1}(0)+\omega\bm{I}_{d}\}^{-1}\bm{x}_{t}$	$\displaystyle=\\|\bm{x}_{t}\\|_{2}^{2}(\frac{\bm{x}_{t}}{\\|\bm{x}_{t}\\|_{2}})^{\top}\{\bm{D}_{t-1}(0)^{\top}\bm{D}_{t-1}(0)+\omega\bm{I}_{d}\}^{-1}(\frac{\bm{x}_{t}}{\\|\bm{x}_{t}\\|_{2}})$	(B.10)
	$\displaystyle\leq\\|\bm{x}_{t}\\|_{2}^{2}\max\limits_{\\|\mathbf{v}\\|_{2}=1}\mathbf{v}^{\top}\{\bm{D}_{t-1}(0)^{\top}\bm{D}_{t-1}(0)+\omega\bm{I}_{d}\}^{-1}\mathbf{v}$
	$\displaystyle\leq\\|\bm{x}_{t}\\|_{2}^{2}\lambda_{\max}\{(\bm{D}_{t-1}(0)^{\top}\bm{D}_{t-1}(0)+\omega\bm{I}_{d})^{-1}\},$

$\displaystyle\kappa_{t}(\bm{x}_{t})$	$\displaystyle\leq{\mbox{Pr}}\left\{\left\|\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)\right\|<\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}\right\}$	(B.16)
	$\displaystyle\leq{\mbox{Pr}}\left\{\left\|\widehat{\mu}_{t-1}(\bm{x}_{t},1)-\widehat{\mu}_{t-1}(\bm{x}_{t},0)\right\|<\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}\bigm{\|}E\right\}+{\mbox{Pr}}\left\{E^{c}\right\}$
	$\displaystyle\leq{\mbox{Pr}}\left\{\left\|\Delta_{\bm{x}_{t}}\right\|-\xi<\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}\right\}+4d\exp\left\{-(t-1)p_{t-1}^{2}c_{\xi}\right\}$
	$\displaystyle={\mbox{Pr}}\left\{\left\|\Delta_{\bm{x}_{t}}\right\|<\frac{2c_{t}L_{\bm{x}}}{\sqrt{(t-1)p_{t-1}\lambda}}+\xi\right\}+4d\exp\left\{-(t-1)p_{t-1}^{2}c_{\xi}\right\}.$

	$\displaystyle{\mathbb{E}}[\mathbb{I}\{a_{t}=0\}\|\widehat{\mu}_{t-1}(\bm{x}_{t},1),\widehat{\mu}_{t-1}(\bm{x}_{t},0)]$	$\displaystyle\leq{\mathbb{E}}\left\{\exp\left(-\frac{\left(\left\|\Delta_{\bm{x}_{t}}\right\|-\xi\right)^{2}{{(t-1)p_{t-1}\lambda}}}{4\rho^{2}L_{\bm{x}}^{2}}\right)\right\}$		(B.21)
		$\displaystyle\leq\exp\left(-\frac{\left(\left\|\Delta_{\bm{x}_{t}}\right\|-\xi\right)^{2}{{(t-1)p_{t-1}\lambda}}}{4\rho^{2}L_{\bm{x}}^{2}}\right).$		(B.21)

Doubly Robust Interval Estimation for Optimal Policy Evaluation in Online Learning

Abstract

1 Introduction

1.1 Related Works and Challenges

1.2 Our Contributions

2 Problem Formulation

2.1 Framework

2.2 Bandit Algorithms

2.3 Probability of Exploration

3 Doubly Robust Interval Estimation

4 Theoretical Results

Assumption 4.1

Assumption 4.2

Assumption 4.3

4.1 Bounding the Probability of Exploration

Lemma 4.1

Corollary 1

Theorem 1

4.2 Asymptotic Normality of Online Ridge Estimator

Theorem 2

4.3 Asymptotic Normality and Robustness for DREAM

Assumption 4.4

Theorem 3

5 Simulation Studies

6 Real Data Application

7 Discussion

7.1 Regret Bound under DREAM

7.2 Evaluation of Known Policies in Online Learning

Corollary 2

References

Appendix A Sensitivity Test for the Choice of ptp_{t}

Appendix B Technical Proofs for Main Results

B.1 Proof of Lemma 4.1

B.2 Proof of Corollary 1

B.3 Proof of Theorem 1

B.3.1 Proof for UCB

B.3.2 Proof for TS

B.4 Proof of Theorem 2

B.5 Proof of Theorem 3

B.6 Results and Proof for Auxiliary Lemmas

Lemma B.1

Lemma B.2

Lemma B.3

Appendix A Sensitivity Test for the Choice of $p_{t}$