Nearly Dimension-Independent Sparse Linear Bandit over Small Action Spaces via Best Subset Selection

Yining Wang Yi Chen Ethan X. Fang Zhaoran Wang and Runze Li

Abstract

We consider the stochastic contextual bandit problem under the high dimensional linear model. We focus on the case where the action space is finite and random, with each action associated with a randomly generated contextual covariate. This setting finds essential applications such as personalized recommendation, online advertisement, and personalized medicine. However, it is very challenging as we need to balance exploration and exploitation. We propose doubly growing epochs and estimating the parameter using the best subset selection method, which is easy to implement in practice. This approach achieves $\widetilde{\mathcal{O}}(s\sqrt{T})$ regret with high probability, which is nearly independent in the “ambient” regression model dimension $d$ . We further attain a sharper $\widetilde{\mathcal{O}}(\sqrt{sT})$ regret by using the SupLinUCB framework and match the minimax lower bound of low-dimensional linear stochastic bandit problems. Finally, we conduct extensive numerical experiments to demonstrate the applicability and robustness of our algorithms empirically.

\@normalsize¹¹footnotetext: Yining Wang is an assistant professor at Information Systems and Operations Management at the Warrington College of Business of University of Florida, Gainesville, FL 32611, USA. Yi Chen is a Ph.D. student at Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, IL 60208, USA. Ethan X. Fang is an assistant professor at Department of Statistics, Pennsylvania State University, University Park, PA 16802-2111, USA. Zhaoran Wang is an assistant professor at Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, IL 60208, USA. Runze Li is the Eberly Chair Professor at Department of Statistics, Pennsylvania State University, University Park, PA 16802-2111, USA. Fang and Li’s research was supported by NSF DMS 1820702, DMS 1953196, and DMS 2015539. ²²footnotetext: Wang and Chen contribute equally to this manuscript.

Keyword: Best subset selection, high-dimensional models, regret analysis, stochastic bandit.

1 Introduction

Contextual bandit problems receive significant attention over the past years in different communities, such as statistics, operations research, and computer science. This class of problems studies how to make optimal sequential decisions with new information in different settings, where we aim to maximize our accumulative reward, and we iteratively improve our policy given newly observed results. It finds many important modern applications such as personalized recommendation (Li et al., 2010, 2011), online advertising (Krause and Ong, 2011), cost-sensitive classification (Agarwal et al., 2014), and personalized medicine (Goldenshluger and Zeevi, 2013; Bastani and Bayati, 2015; Keyvanshokooh et al., 2019). In most contextual bandit problems, at each iteration, we first obtain some new information. Then, we take action based on a certain policy and observe a new reward. Our goal is to maximize the total reward, where we iteratively improve our policy. This paper is concerned with the linear stochastic bandit model, one of the most fundamental models in contextual bandit problems. We model the expected reward at each time period as a linear function of some random information depending on our action. This model receives considerable attention (Auer, 2002; Abe et al., 2003; Dani et al., 2008; Rusmevichientong and Tsitsiklis, 2010; Chu et al., 2011), and as we will discuss in more detail, it naturally finds applications in optimal sequential treatment regimes.

In a linear stochastic bandit model, at each time period $t\in\{1,2,\cdots,T\}$ , we are given some action space $\mathcal{A}_{t}$ . Here each feasible action $i\in\mathcal{A}_{t}$ is associated with a $d$ -dimensional context covariate $X_{t,i}\in\mathbb{\mathbb{R}}^{d}$ that is known before any action is taken. Next, we take an action $i_{t}\in\mathcal{A}_{t}$ and then observe a reward $Y_{t}\in\mathbb{R}$ . Assume that the reward follows a linear model

Y_{t}=\langle X_{t,i_{t}},\theta^{*}\rangle+\varepsilon_{t},

(1)

where $\theta^{*}\in\mathbb{R}^{d}$ is an unknown $d$ -dimensional parameter, and $\{\varepsilon_{t}\}_{t=1}^{T}$ are noises. Without loss of generality, we assume that we aim to maximize the total reward. We measure the performance of the selected sequence of actions $\{i_{t}\}_{t=1}^{T}$ by the accumulated regret

R_{T}(\{i_{t}\};\theta^{*}):=\sum_{t=1}^{T}\max_{i\in\mathcal{A}_{t}}\langle X_{t,i},\theta^{*}\rangle-\langle X_{t,i_{t}},\theta^{*}\rangle.

(2)

Essentially, the regret measures the difference between the “best” we can achieve if we know the true parameter $\theta^{*}$ and the “noiseless” reward we get. It is clear that regret is always nonnegative, and our goal is to minimize the regret.

Meanwhile, due to the advance of technology, there are many modern applications where we encounter the high-dimensionality issue, i.e., the dimension $d$ of the covariate is large. Note that here the total number of iterations $T$ is somewhat equivalent to the sample size, which is the number of pieces of information we are able to “learn” the true parameter $\theta^{*}$ . Such a high-dimensionality issue presents in practice. For example, given the genomic information of some patients, we aim to find a policy that assigns each patient to the best treatment for him/her. It is usually the case that the number of patients is much smaller than the dimension of the covariate. In this paper, we consider the linear stochastic contextual bandit problem under such a high-dimensional setting, where the parameter $\theta^{*}$ is of high dimension that $d\gg T$ . We assume that $\theta^{*}$ is sparse that only at most $s\ll d$ components of $\theta^{*}$ are non-zero. This assumption is commonly imposed in high-dimensional statistics and signal processing literature (Donoho, 2006; Bühlmann and Van De Geer, 2011).

In addition, for the action spaces $\{\mathcal{A}_{t}\}_{t=1}^{T}$ , it is known in literature (Dani et al., 2008; Shamir, 2015; Szepesvari, 2016) that if there are infinitely many feasible actions at each iteration, the minimax lower bound is of order $\widetilde{\mathcal{O}}(\sqrt{sdT})$ , which does not solve the curse of dimensionality. We assume that the action spaces $\{\mathcal{A}_{t}\}_{t=1}^{T}$ are finite, small, and random. In particular, we assume that for all $t$ , $|\mathcal{A}_{t}|=k\ll d$ , and each action in $\mathcal{A}_{t}$ is associated with independently randomly generated contextual covariates. In most practical applications, this finite action space setting is naturally satisfied. For example, in the treatment example, there are usually only a small number of feasible treatments available. We refer the readers to Section 3.1 for a complete description and discussion of the assumptions.

1.1 Literature Review

We briefly review existing works on linear stochastic bandit problems under both low-dimensional and high-dimensional settings. Under the classical low-dimensional setting, Auer (2002) pioneers the use of upper-confidence-bound (UCB) type algorithms for the linear stochastic bandit, which is one of the most powerful and fundamental algorithms for this class of problems, and is also considered in Chu et al. (2011). Dani et al. (2008); Rusmevichientong and Tsitsiklis (2010) study linear stochastic bandit problems with large or infinite action spaces, and derive corresponding lower bounds. Under the high-dimensional setting, where we assume that $\theta^{*}$ is sparse, when the action spaces $\mathcal{A}_{t}$ ’s are hyper-cube spaces $[-1,1]^{d}$ , Lattimore et al. (2015) develop the SETC algorithm that attains nearly dimension-independent regret bounds. We point out that this algorithm exploits the unique structure of hyper-cubes and is unlikely to be applicable for general action spaces including the ones of our interest where the action spaces are finite. Abbasi-Yadkori et al. (2012) consider a UCB-type algorithm for general action sets and obtain a regret upper bound of $\widetilde{\mathcal{O}}(\sqrt{sdT})$ , which depends polynomially on the ambient dimension $d$ . Note that here the $\widetilde{\mathcal{O}}(\cdot)$ notation is the big-O notation that ignores all logarithmic factors. Carpentier and Munos (2012) consider a different reward model and obtain an $\widetilde{\mathcal{O}}(\|\theta^{*}\|_{2}^{2}s\sqrt{T})$ regret upper bound. Goldenshluger and Zeevi (2013); Bastani and Bayati (2015) study a variant of the linear stochastic bandit problem in which only one contextual covariate $X_{t}$ is observed at each time period $t$ , while each action $i_{t}$ corresponds to a different unknown model $\theta_{i_{t}}^{*}$ . We point out that this model is a special case of our model (1), as discussed in Foster et al. (2018).

Another closely related problem is the online sparse prediction problem (Gerchinovitz, 2013; Foster et al., 2016), in which sequential predictions $\widehat{Y}_{t}$ ’s of $Y_{t}=\langle X_{t},\theta^{*}\rangle+\varepsilon_{t}$ are of the interest, and the regret is measured in mean-square error $\sum_{t}|\widehat{Y}_{t}-Y_{t}|^{2}$ . It can be further generalized to online empirical-risk minimization (Langford et al., 2009) or even the more general derivative-free/bandit convex optimization (Nemirovsky and Yudin, 1983; Flaxman et al., 2005; Agarwal et al., 2010; Shamir, 2013; Besbes et al., 2015; Bubeck et al., 2017; Wang et al., 2017). Most existing works along this direction have continuous (infinite) action spaces $\{\mathcal{A}_{t}\}$ . They allow small-perturbation type methods like estimating gradient descent. We summarize the existing theoretical results on stochastic linear bandit problems in Table 1.

Table 1: Summary of existing results on low- and high-dimensional linear stochastic bandit. Contextual covariates

X

regression model

\theta^{*}

are in

\mathbb{R}^{d}

. Logarithmic terms are dropped.

Example paper	Model	Action spaces	Model constrains	Regret
Auer (2002)	$Y=\langle X,\theta^{*}\rangle+\varepsilon$	$\|\mathcal{A}_{t}\|\leq N$	$\|\langle X,\theta^{*}\rangle\|\leq 1$	$\widetilde{\mathcal{O}}(\sqrt{dT})$
Dani et al. (2008)	$Y=\langle X,\theta^{*}\rangle+\varepsilon$	$\mathbb{R}^{d}$	$\|\langle X,\theta^{*}\rangle\|\leq 1$	$\widetilde{\mathcal{O}}(d\sqrt{T})$
Lattimore et al. (2015)	$Y=\langle X,\theta^{*}\rangle+\varepsilon$	$[-1,1]^{d}$	$\\|\theta^{}\\|_{0}\leq s,\\|\theta^{}\\|_{1}\leq 1$	$\widetilde{\mathcal{O}}(s\sqrt{T})$
Abbasi-Yadkori et al. (2012)	$Y=\langle X,\theta^{*}\rangle+\varepsilon$	$\mathbb{R}^{d}$	$\\|\theta^{}\\|_{0}\leq s,\|\langle X,\theta^{}\rangle\|\leq 1$	$\widetilde{\mathcal{O}}(\sqrt{sdT})$
Carpentier and Munos (2012)	$Y=\langle X,\theta^{*}\rangle+\varepsilon$	$\{x:\\|x\\|_{2}\leq 1\}$	$\\|\theta^{*}\\|_{0}\leq s$	$\widetilde{\mathcal{O}}(\\|\theta^{*}\\|_{2}^{2}s\sqrt{T})$
This paper	$Y=\langle X,\theta^{*}\rangle+\varepsilon$	$\|\mathcal{A}_{t}\|\leq k$ ^$\dagger$	$\\|\theta^{}\\|_{0}\leq s,\\|\theta^{}\\|_{2}\leq r$	$\widetilde{\mathcal{O}}(\sqrt{sT})$

^$\dagger$Contextual covariates $X_{t,i}$ associated with each $i\in\mathcal{A}_{t}$ are randomly generated.

From the application perspective of finding the optimal treatment regime, existing literatures focus on achieving the optimality through batch settings. General approaches include model-based methods such as Q-learning (Watkins and Dayan, 1992; Murphy, 2003; Moodie et al., 2007; Chakraborty et al., 2010; Goldberg and Kosorok, 2012; Song et al., 2015) and A-learning (Robins et al., 2000; Murphy, 2005) and model-free policy search methods (Robins et al., 2008; Orellana et al., 2010a, b; Zhang et al., 2012b; Zhao et al., 2012, 2014). These methods are all developed based on batch settings where we use the whole dataset to estimate the optimal treatment regime. This approach is applicable after the clinical trial is completed, or when the observational dataset is fully available. However, the batch setting approach is not applicable when it is emerging to identify the optimal treatment regime. For a contemporary example, during the recent outbreak of coronavirus disease (COVID-19), it is extremely important to quickly identify the optimal or nearly optimal treatment regime to assign each patient to the best treatment among a few choices. However, since the disease is novel, there is no or very little historical data. Thus, the batch setting approaches mentioned above are not applicable. On the other hand, our model naturally provides a “learn while optimizing” alternative approach to sequentially improve the policy/treatment regime. This provides an important motivating application of the proposed method.

1.2 Major Contributions

In this paper, we propose new algorithms, which iteratively learn the parameter $\theta^{*}$ while optimizing the regret. Our algorithms use the “doubling trick” and modern optimization techniques, which carefully balance the randomization for exploration to fully learn the parameter and maximizing the reward to achieve the near-optimal regret. In particular, our algorithms fall under the general UCB-type algorithms (Lai and Robbins, 1985; Auer et al., 2002). Briefly speaking, we start with some pure exploration periods that we take random actions uniformly, and we estimate the parameter. Then, for the next certain periods, which we call an epoch, we take the action at each time period by optimizing some upper confidence bands using the previous estimator. At the end of each epoch, we renew the estimator using new information. We then enter the next epoch using the new estimator and renew the estimator at the end of the next epoch. We repeat this until the $T$ -th time period.

The high-dimensional regime (i.e., $d\gg T$ ) poses significant challenges in our setting, which cannot be solved by exiting works. First, unlike in the low-dimensional regime where ordinary least squares (OLS) always admits closed-form solutions and error bounds, in the high-dimensional regime, most existing methods like the Lasso (Tibshirani, 1996) or the Dantzig selector (Candes and Tao, 2007) require the sample covariance matrix to satisfy certain restricted eigenvalue”conditions (Bickel et al., 2009), which do not hold under our setting for sequentially selected covariates. Additionally, our action spaces $\{\mathcal{A}_{t}\}$ are finite. This rules out several existing algorithms, including the SETC method (Lattimore et al., 2015) that exploits the specific structure of hyper-cube actions sets and finite-difference type algorithms in stochastic sparse convex optimization (Wang et al., 2017; Balasubramanian and Ghadimi, 2018). We adopt the best subset selector estimator (Miller, 2002) to derive valid confidence bands only using ill-conditioned sample covariance matrices. Note that while the optimization for best subset selection is NP-hard in theory (Natarajan, 1995), by the tremendous progress of modern optimization, solving such problems is practically efficient as discussed in Pilanci et al. (2015); Bertsimas et al. (2016). In addition, the renewed estimator may correlate with the previous one. This decreases the efficiency. We let the epoch sizes grow exponentially, which is known as the “doubling trick” (Auer et al., 1995). This “removes” the correlation between recovered support sets by best subset regression. Our theoretical analysis is also motivated from some known analytical frameworks such as the elliptical potential lemma (Abbasi-Yadkori et al., 2011) and the SupLinUCB framework (Auer, 2002) in order to obtain sharp regret bounds.

We summarize our main theoretical contribution in the following corollary, which is essentially a simplified version of Theorem 2 in Section 3.3.

Corollary 1.

We assume that $|\mathcal{A}_{t}|=k$ , $X_{t,i}\overset{i.i.d.}{\sim}\mathcal{N}(0,I_{d})$ , and $\varepsilon_{t}\overset{i.i.d.}{\sim}\mathcal{N}(0,1)$ , for each $1\leq t\leq T$ . Then there exists a policy $\pi$ such that with probability at least $1-\delta$ , for all $\|\theta^{*}\|_{0}\leq s$ and $\|\theta^{*}\|\leq r$ ,

R_{T}(\{i_{t}\};\theta^{*})\lesssim\sqrt{sT}\cdot\mathrm{polylog}(k,d,T,r),

where $\mathrm{polylog}$ denotes a polynomial of logarithmic factors.

Note that this corollary holds even if $T\ll d$ . Theorem 2 provides more general results than this corollary, and can be applied for a broader family of distributions for contextual covariates $\{X_{t,i}\}$ . A simpler and more implementable algorithm, with a weaker regret guarantee of $\widetilde{\mathcal{O}}(s\sqrt{T})$ , is given in Algorithm 1 and Theorem 1, respectively. The form of regret guarantee in this Corollary 1 looks particularly similar to its low-dimensional counterparts in Table 1, with the ambient dimension $d$ replaced by the intrinsic dimension $s$ .

Notations. Throughout this paper, for an integer $n$ , we use $[n]$ to denote the set $\{1,2,\ldots,n\}$ . We use $\|\cdot\|_{1},\|\cdot\|,\|\cdot\|_{\infty}$ to denote the $\ell_{1},\ell_{2}$ and $\ell_{\infty}$ norms of vector, respectively. Given a matrix $\mathbf{A}$ , we use $\|\cdot\|_{\mathbf{A}}$ to denote the $\ell_{2}$ norm weighted by $\mathbf{A}$ . Specifically, we have $\|X\|_{\mathbf{A}}:=\sqrt{X^{\top}\mathbf{A}X}$ . We also use $\langle\cdot,\cdot\rangle$ to denote the inner product of two vectors. Given a set $S\subseteq[d]$ , $S^{c}$ is it complement and $|S|$ denotes the cardinality. Given a $d$ -dimensional vector $X$ , we use $[X]_{i}$ to denote its $i$ -th coordinate. We also use $\text{supp}(X)$ to represent the support of $X$ , which is the collection of indices corresponding to nonzero coordinates. Furthermore, we use $[X]_{S}=([X]_{i})_{i\in S}$ to denote the restriction of $X$ on $S$ , which is a $|S|$ -dimensional vector. Similarly, for a $d\times d$ matrix $\mathbf{A}=([\mathbf{A}]_{ij})_{i,j\in[d]}\in\mathbb{R}^{d\times d}$ , we denote by $[\mathbf{A}]_{SS^{\prime}}=([\mathbf{A}]_{jk})_{j\in S,k\in S^{\prime}}$ the restriction of $A$ on $S\times S^{\prime}$ , which is a $|S|\times|S^{\prime}|$ matrix. When $S=S^{\prime}$ , we further abbreviate $[\mathbf{A}]_{S}=[\mathbf{A}]_{SS}$ . In addition, given two sequences of nonnegative real numbers $\{a_{n}\}_{n\geq 1}$ and $\{b_{n}\}_{n\geq 1}$ , $a_{n}\lesssim b_{n}$ and $a_{n}\gtrsim b_{n}$ mean that there exists an absolute constant $0<C<\infty$ such that $a_{n}\leq Cb_{n}$ and $a_{n}\geq Cb_{n}$ for all $n$ , respectively. We also abbreviate $a_{n}\asymp b_{n}$ , if $a_{n}\lesssim b_{n}$ and $a_{n}\gtrsim b_{n}$ hold simultaneously. We say that a random event $\mathcal{E}$ holds with probability at least $1-\delta$ , if there exists some absolute constant $C$ such that the probability of $\mathcal{E}$ is larger than $1-C\delta$ . Finally, we remark that arm, action, and treatment all refer to actions in different applications. We also denote by $i_{t}$ the action taken in period $t$ and $X_{t}=X_{t,i_{t}}$ the associated covariate.

Paper Organization. The rest of this paper is organized as follows. We present our algorithms in Section 2. Then we establish our theoretical results in Section 3, under certain technical conditions. We conduct extensive numerical experiments using both synthetic and real datasets to investigate the performance of our methods in Section 4. We conclude the paper in Section 5.

2 Methodologies

In this section, we present the proposed methods to solve the linear stochastic bandit problem where we aim to minimize the regret defined in (2). In Section 2.1, we first introduce an algorithm called “Sparse-LinUCB” (SLUCB) as summarized in Alg. 1, which can be efficiently implemented, and demonstrate the core idea of our algorithmic design. The SLUCB algorithm is a variant of the celebrated LinUCB algorithm (Chu et al., 2011) for classical linear contextual bandit problems. The SLUCB algorithm is intuitive and easy to implement. However, we cannot derive the optimal upper bound for the regret. To close this gap, we further propose a more sophisticated algorithm called “Sparse-SupLinUCB” (SSUCB) (Alg. 2) in Section 3.2. In comparison with the SLUCB algorithm, the SSUCB algorithm constructs the upper confidence bands through sequentially selecting historical data and achieves the optimal rate (up to logarithmic factors).

2.1 Sparse-LinUCB Algorithm

As we mentioned above, our algorithm is inspired by the standard LinUCB algorithm, which balances the tradeoff between exploration and exploitation following the principle of “optimism in the face of uncertainty”. In particular, the LinUCB algorithm repeatedly constructs upper confidence bands for the potential rewards of the actions. The upper confidence bands are optimistic estimators. We then pick the action associated with the largest upper confidence band. This leads to the optimal regret under the low-dimensional setting. However, under the high-dimensional setting, directly applying the LinUCB algorithm incurs some suboptimal regret since we only get lose confidence bands under the high-dimensional regime. Thus, it is desirable to construct tight confidence bands under the high-dimensional and sparse setting to achieve the optimal regret.

Inspired by the remarkable success of the best subset selection (BSS) in high-dimensional regression problems, we propose to incorporate this powerful tool into the LinUCB algorithm. Meanwhile, since the BSS procedure is computationally expensive, it is impractical to execute the BSS method during every time period. In addition, as we will discuss later, it is not necessary.

Before we present our algorithms, we briefly discuss the support of parameters. Given a $d$ -dimensional vector $\theta$ , we denote by $\mathrm{supp}(\theta)$ the support set of $\theta$ , which is the collection of dimensions of $\theta$ with nonzero coordinates that

\mathrm{supp}(\theta)=\big{\{}j\in[d]:[\theta]_{j}\neq 0\big{\}}.

This definition agrees with that of most literatures. However, for the BSS procedure, it is desirable to generalize this definition. We propose the concept of “generalized support” as follows.

Definition 1 (Generalized Support).

Given a $d$ -dimensional vector $\theta$ , we call a subset $S\subseteq[d]$ the generalized support of $\theta$ and denote it by $\mathrm{supp}^{+}(\theta)$ , if

[\theta]_{j}=0,\ \forall j\not\in S.

The generalized support $\mathrm{supp}^{+}(\theta)$ is a relaxation of the normal support, since any support is a generalized support (but not vice versa). Moreover, the generalized support is not unique. Any subset including the support is a valid generalized support.

We distinguish the difference between support and generalized support in order to define the best subset selection without causing confusion. For example, we consider a linear model $\theta^{*}\in\mathbb{R}^{d},X_{t}\in\mathbb{R}^{d}$ , and $Y_{t}=\langle X_{t},\theta^{*}\rangle+\varepsilon_{t}$ . Calculating the ordinary least square estimator restricted on the generalized support $S\in[d]$ , which is denoted by

\displaystyle\widehat{\theta}=\operatorname*{argmin}_{\mathrm{supp}^{+}(\theta)=S}\Big{\{}\sum_{t}\big{|}Y_{t}-\langle X_{t},\theta\rangle\big{|}^{2}\Big{\}},

means that we consider a low-dimensional model only using the information in $S$ and set the coordinates of estimator except in $S$ as zeros. Formally, let ${[\widehat{\theta}]}_{S}\in\mathbb{R}^{|S|}$ be

{[\widehat{\theta}]}_{S}=\operatorname*{argmin}_{\phi\in\mathbb{R}^{|S|}}\Big{\{}\sum_{t}\big{|}Y_{t}-\langle[X_{t}]_{S},\phi\rangle\big{|}^{2}\Big{\}}.

Then we have

\big{[}\widehat{\theta}\big{]}_{j}=\big{[}\widehat{[\theta]}_{S}\big{]}_{j},\ j\in S;\ \big{[}\widehat{\theta}\big{]}_{j}=0,\ j\not\in S.

Since we do not guarantee $[\widehat{\theta}\big{]}_{j}\neq 0,\ \forall j\in S$ , we call $S$ the generalized support instead of support.

We are ready to present the details of SLUCB algorithm now. Our algorithm works as follows. We first apply the “doubling trick”, which partitions the whole $T$ decision periods into several consecutive epochs such that the lengths of the epochs increase doubly. We only implement the BSS procedure at the end of each epoch to recover the support of the parameter $\theta^{*}$ . Within each epoch, we start with a minimal epoch length of $n_{0}$ for pure exploration, which is called the pure-exploration-stage. At each time period $t$ of this stage, we pick an action within $\mathcal{A}_{t}$ uniformly at random. Intuitively speaking, in pure-exploration-stage, we explore the unknown parameters at all possible directions through uniform randomization. From a technical perspective, the pure-exploration-stage is desirable since it maintains the smallest sparse eigenvalue of the empirical design matrix lower bounded away from zero and hence, help us obtain an efficient estimator for $\theta^{*}$ . After the pure-exploration-stage, we keep the estimated support, say of $s$ nonzero components, from the previous epoch unchanged, and treat the problem as an $s$ -dimensional regression. Specifically, at each time period, we use the ridge estimator with penalty weight $\lambda$ to estimate $\theta^{*}$ and construct corresponding confidence bands to help us make decisions in the next time period.

In summary, we partition the time horizon $[T]$ into consecutive epochs $\{E_{\tau}\}_{\tau=1}^{\widetilde{\tau}}$ such that

[T]=\bigcup_{\tau=1}^{\widetilde{\tau}}E_{\tau},\ |E_{\tau}|=\min\{2^{\tau},n_{0}\}.

The length of the last epoch $E_{\widetilde{\tau}}$ may be less than $2^{\widetilde{\tau}}$ or $n_{0}$ . By definition, the number of epochs $\widetilde{\tau}\asymp\log(T)$ . Hence, in the SLUCB algorithm, we run the BSS procedure $\widetilde{\mathcal{O}}(\log(T))$ times. In our later simulations studies, we find that this is practical for moderately large dimensions.

Next, we introduce the details of constructing upper confidence bands in the SLUCB algorithm. We assume that at period $t\in E_{\tau}$ , we pick action $i_{t}\in\mathcal{A}_{t}$ and observe the associated covariate $X_{t,i_{t}}$ and reward $Y_{t}$ . We also abbreviate $X_{t,i_{t}}$ as $X_{t}$ , if there is no confusion. We denote by $\widehat{\theta}_{\tau-1,\lambda}$ the BSS estimator for the true parameter $\theta^{*}$ at the end of previous epoch $E_{\tau-1}$ , and let

S_{\tau-1}=\mathrm{supp}^{+}{(\widehat{\theta}_{\tau-1,\lambda})}

be its generalized support, i.e., the generalized support recovered by epoch $E_{\tau-1}$ . For periods in the pure-exploration-stage of $E_{\tau}$ , we pick arms uniformly at random and do not update the estimator of $\theta^{*}$ . Beyond the pure-exploration-stage, we estimate $\theta^{*}$ by a ridge estimator. Let $\widehat{\theta}^{t-1}_{\tau,\lambda}$ be the most updated ridge estimator of $\theta^{*}$ by $t\in E_{\tau}$ , which is estimated by restricting its generalized support on $S_{\tau-1}$ using data $\{X_{t^{\prime}},Y_{t^{\prime}}\}_{t^{\prime}\in E^{t-1}_{\tau}}$ . In particular, all components of $\widehat{\theta}^{t-1}_{\tau,\lambda}$ outside $S_{\tau-1}$ are set as zeros and

\displaystyle\widehat{\theta}^{t-1}_{\tau,\lambda}

\displaystyle=\operatorname*{argmin}_{\mathrm{supp}^{+}(\theta)=S_{\tau-1}}\Big{\{}\sum_{t^{\prime}\in E^{t-1}_{\tau}}\big{|}Y_{t^{\prime}}-\langle X_{t^{\prime}},\theta\rangle\big{|}^{2}+\lambda\|\theta\|^{2}\Big{\}}.

Given $\widehat{\theta}^{t-1}_{\tau,\lambda}$ , we calculate the upper confidence band of potential reward $\langle X_{t,i},\theta^{*}\rangle$ for each possible action $i\in\mathcal{A}_{t}$ . In particular, let the tuning parameters $\alpha$ and $\beta$ be

\displaystyle\alpha=\sigma\nu\sqrt{s\widetilde{\tau}}\cdot\log(\sigma\rho^{-1}kTd/\delta),\ \beta=r\sigma\cdot\sqrt{s\log(kTd/\delta)},

which correspond to the confidence level and upper estimate of potential reward, respectively. Then we calculate the upper confidence band associated with action $i$ as

\min\Big{\{}\beta,\big{\langle}X_{t,i},\widehat{\theta}_{\tau,\lambda}^{t-1}\big{\rangle}+\alpha\cdot\Big{(}\sigma\sqrt{{\log(kTd/\delta)}\big{/}{|E_{\tau-1}|}}+\sqrt{[X_{t,i}]_{S_{\tau-1}}^{\top}\big{[}\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}\big{]}^{-1}[X_{t,i}]_{S_{\tau-1}}}\Big{)}\Big{\}},

where

\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}=\lambda\textbf{I}_{|S_{\tau-1}|}+\sum_{t^{\prime}\in E_{\tau}^{t-1}}[X_{t^{\prime}}]_{S_{\tau-1}}[X_{t^{\prime}}]^{\top}_{S_{\tau-1}}.

After that, we pick the arm $i_{t}$ corresponding to the largest upper confidence band to play and observe the corresponding reward $Y_{t}=\langle X_{t,i_{t}},\theta^{*}\rangle+\varepsilon_{t}$ . We repeat this process until the end of epoch $E_{\tau}$ .

Then we run the BSS procedure using all data collected in $E_{\tau}$ to recover the support of $\theta^{*}$ . We also enlarge the size of generalized support by $s$ . To be specific, let $S_{\tau}$ be the generalized support recovered in this step. We require that $S_{\tau}$ satisfies constraints

S_{\tau}\supseteq S_{\tau-1},\ |S_{\tau}|\leq\tau s.

and obtain the BSS estimator $\widehat{\theta}_{\tau,\lambda}$ as

\displaystyle\widehat{\theta}_{\tau,\lambda}=\text{Proj}_{r}\Big{\{}\operatorname*{argmin}_{S_{\tau-1}\subseteq\mathrm{supp}^{+}(\theta),|\mathrm{supp}^{+}(\theta)|\leq\tau s}\Big{\{}\sum_{t^{\prime}\in E_{\tau}}\big{|}Y_{t^{\prime}}-\langle X_{t^{\prime}},\theta\rangle\big{|}^{2}+\lambda\|\theta\|^{2}\Big{\}}\Big{\}},

(3)

where $\text{Proj}_{r}\{\cdot\}$ denotes the projection on a centered $\ell_{2}$ -ball with radius $r$ , the $\ell_{2}$ -norm of $\theta^{*}$ . Note that in comparison with the standard BSS estimator, we project the minimizer for regularization. The boundedness also simplifies our later theoretical analysis. We also add the inclusion restriction $S_{\tau}\supseteq S_{\tau-1}$ for some technical reasons, which does not lead to any fundamental difference. As a result, we need to consider the sparsity $\tau s$ instead of $s$ . It boosts the probability of recovering the true support.

A pseudo-code description of the SLUCB algorithm is presented in Algorithm 1. We remark that although computing the BSS estimator is computationally expensive, many methods exist that can solve this problem very efficiently in practice. See, for example, (Pilanci et al., 2015; Bertsimas et al., 2016) for more details. We also conduct extensive numerical experiments to demonstrate that our proposed algorithm is practical in Section 4.

\@normalsize

Input: sequentially arriving covariates

\{X_{t,i}\}_{t\in[T],i\in\mathcal{A}_{t}}

, length of pure-exploration-stage

n_{0}

, confidence level

\alpha

, estimated upper bound of reward

\beta

, sparsity level

s

, ridge regression penalty

\lambda

Output: action sequence

\{i_{t}\}_{t\in[T]}

2partition

[T]

into consecutive epochs

E_{1},E_{2},\cdots,E_{\widetilde{\tau}}

such that

|E_{\tau}|=\max\{2^{\tau},n_{0}\}

;

3 initialization:

\widehat{\theta}_{0,\lambda}=0

S=\emptyset

;

4 for $\tau=1,2,\cdots,\widetilde{\tau}$ do

5 for the first $n_{0}$ time periods $t\in E_{\tau}$ do

6 select

i_{t}\in\mathcal{A}_{t}

uniformly at random;

7 set

\widehat{\theta}^{t}_{\tau,\lambda}=\widehat{\theta}_{\tau-1,\lambda}

;

9 end for

10 for the remaining time periods $t\in E_{\tau}$ do

11 calculate matrix

\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}=\lambda\textbf{I}_{|S|}+\sum_{t^{\prime}\in E_{\tau}^{t-1}}[X_{t^{\prime}}]_{S}[X_{t^{\prime}}]^{\top}_{S}\text{;}

12 calculate the upper confidence band of reward for each arm

\overline{r}(X_{t,i})=\min\Big{\{}\beta,\big{\langle}X_{t,i},\widehat{\theta}^{t-1}_{\tau,\lambda}\big{\rangle}+\alpha\cdot\Big{(}\sigma\sqrt{{\log(kTd/\delta)}\big{/}{|E_{\tau-1}|}}+\big{\|}[X_{t,i}]_{S}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}]^{-1}}\Big{)}\Big{\}}\text{;}

13 select arm with the largest upper confidence band

i_{t}=\operatorname*{argmin}_{i\in\mathcal{A}_{t}}\big{\{}\overline{r}(X_{t,i})\big{\}}\text{;}

14 observe reward

Y_{t}=\langle X_{t,i_{t}},\theta^{*}\rangle+\varepsilon_{t}\text{;}

15 update the ridge estimator:

\widehat{\theta}^{t}_{\tau,\lambda}=\operatorname*{argmin}_{\mathrm{supp}^{+}(\theta)=S}\Big{\{}\sum_{t^{\prime}\in E_{\tau}^{t}}\big{|}Y_{t^{\prime}}-\langle X_{t^{\prime}},\theta\rangle\big{|}^{2}+\lambda\|\theta\|^{2}\Big{\}}\text{;}

16 end for

17 update the best subset selection estimator

\widehat{\theta}_{\tau,\lambda}=\text{Proj}_{r}\Big{\{}\operatorname*{argmin}_{S\subseteq\mathrm{supp}^{+}(\theta),|\mathrm{supp}^{+}(\theta)|\leq\tau s}\Big{\{}\sum_{t^{\prime}\in E_{\tau}}\big{|}Y_{t^{\prime}}-\langle X_{t^{\prime}},\theta\rangle\big{|}^{2}+\lambda\|\theta\|^{2}\Big{\}}\Big{\}};

18 update

S=\mathrm{supp}^{+}(\widehat{\theta}_{\tau,\lambda})\text{;}

19 end for

Algorithm 1 Sparse-LinUCB Algorithm

2.2 Sparse-SupLinUCB Algorithm

\@normalsize

Input: epoch index

\tau

, sequential arriving covariates

\{X_{t,i}\}_{t\in E_{\tau},i\in\mathcal{A}_{t}}

, length of pure-exploration-stage

n_{0}

, confidence level

\gamma

, estimated upper bound of reward

\beta

, support recovered in previous epoch

S_{\tau-1}

, sparsity level

s

, ridge regression penalty

\lambda

Output: action sequence

\{i_{t}\}_{t\in E_{\tau}}

1 for the first $n_{0}$ time periods $t\in E_{\tau}$ do

2 select

i_{t}\in\mathcal{A}_{t}

uniformly at random;

4 end for

5set

\widetilde{\zeta}=\lceil\log(\beta T)\rceil

S=S_{\tau-1}

, and initialize sets

\{\Psi_{\tau}^{t,1},\cdots,\Psi_{\tau}^{t,\widetilde{\zeta}}\}

by evenly partitioning the

n_{0}

time periods of pure-exploration-stage;

6 for the remaining time periods $t$ in $E_{\tau}$ do

7 initialize

\zeta=1,{\mathcal{N}}_{\tau}^{t-1,\zeta}=\mathcal{A}_{t}

;

8 repeat

9 compute restricted ridge estimator

\widehat{\theta}_{\tau,\lambda}^{t-1,\zeta}=\operatorname*{argmin}_{\mathrm{supp}^{+}(\theta)=S}\Big{\{}\sum_{t^{\prime}\in\Psi_{\tau}^{t-1,\zeta}}\big{|}Y_{t^{\prime}}-\langle X_{t^{\prime}},\theta\rangle\big{|}^{2}+\lambda\|\theta\|^{2}\Big{\}}\text{;}

10 compute matrix

\widehat{\mathbf{\Gamma}}^{t-1,\zeta}_{\tau-1,\lambda}=\lambda\textbf{I}_{|S|}+\sum_{t^{\prime}\in\Psi^{t-1,\zeta}_{\tau}}[X_{t^{\prime}}]_{S}[X_{t^{\prime}}]_{S}^{\top}\text{;}

11 compute confidence band for each

i\in{\mathcal{N}}_{\tau}^{t-1,\zeta}

\omega_{\tau,\lambda}^{t-1,\zeta}(i)=\gamma\cdot\Big{(}\sqrt{s/|E_{\tau-1}|}+\big{\|}[X_{t,i}]_{S}\big{\|}_{[\widehat{\mathbf{\Gamma}}^{t-1,\zeta}_{\tau-1,\lambda}]^{-1}}\Big{)}\text{;}

12 if $\omega_{\tau,\lambda}^{t-1,\zeta}(i)\leq 1/\sqrt{T},\ \forall i\in{\mathcal{N}}_{\tau}^{t-1,\zeta}$ then

13 select

i_{t}=\operatorname*{argmin}_{i\in{\mathcal{N}}_{\tau}^{t-1,\zeta}}\Big{\{}\beta,\big{\langle}X_{t,i},\widehat{\theta}_{\tau}^{t-1,\zeta}\big{\rangle}+\omega_{\tau,\lambda}^{t-1,\zeta}(i)\Big{\}}\text{;}

as the arm to play and update

\Psi_{\tau}^{t,\zeta}\leftarrow\Psi_{\tau}^{t-1,\zeta}

for all

\zeta\in[\widetilde{\zeta}]

;

15 end if

16 else if $\omega_{\tau,\lambda}^{t-1,\zeta}(i)\leq 2^{-\zeta}\beta,\forall i\in{\mathcal{N}}_{\tau}^{t-1,\zeta}$ then

17 eliminate suboptimal arms as

{\mathcal{N}}_{\tau}^{t-1,\zeta+1}=\Big{\{}i\in{\mathcal{N}}_{\tau}^{t-1,\zeta}:\big{\langle}X_{t,i},\widehat{\theta}_{\tau,\lambda}^{t-1,\zeta}\big{\rangle}\geq\max_{j\in{\mathcal{N}}_{\tau}^{t-1,\zeta}}\big{\langle}X_{t,j},\widehat{\theta}_{\tau,\lambda}^{t-1,\zeta}\big{\rangle}-2^{1-\zeta}\beta\Big{\}}\text{;}

move to the next group and update

\zeta\leftarrow\zeta+1\text{;}

18 end if

19 else

20 select

i_{t}\in{\mathcal{N}}_{\tau}^{t-1,\zeta}

such that

\omega_{\tau,\lambda}^{t-1,\zeta}(i)>2^{-\zeta}\beta

as the arm to play;

21 update

\Psi_{\tau}^{t,\zeta}\leftarrow\Psi_{\tau}^{t-1,\zeta}\cup\{t\}

and

\Psi_{\tau}^{t,\zeta^{\prime}}\leftarrow\Psi_{\tau}^{t-1,\zeta^{\prime}}

for all

\zeta^{\prime}\neq\zeta

;

23 end if

25 until an arm $i_{t}\in\mathcal{A}_{t}$ is selected;

26 end for

Algorithm 2 Sparse-SupLinUCB Subroutine

Although the SLUCB algorithm is intuitive and easy to implement, we are unable to prove the optimal upper bound for its regret due to some technical reasons. Specifically, as discussed in the next section, we can only establish an $\widetilde{\mathcal{O}}(s\sqrt{T})$ upper bound for the regret of the SLUCB algorithm, while the optimal regret should be $\widetilde{\mathcal{O}}(\sqrt{sT})$ . Here we omit all the constants and logarithmic factors and only consider the dependency on horizon length and dimension parameters. The obstacle leading to sub-optimality is the dependency of covariates on random noises. Recall that at each period $t\in E_{\tau}$ , the SLUCB algorithm constructs the ridge estimator $\widehat{\theta}^{t-1}_{\tau,\lambda}$ using all historical data, where designs $\{X_{t^{\prime}}\}_{t^{\prime}\in E^{t-1}_{\tau}}$ are correlated with noises $\{\varepsilon_{t^{\prime}}\}_{t^{\prime}\in E^{t-1}_{\tau}}$ due to the UCB-type policy. Such a complicated correlation impedes us from establishing tight confidence bands for predicted rewards, which results in suboptimal regret.

To close the aforementioned gap and achieve the optimality, we modify the seminal SupLinUCB algorithm (Auer, 2002; Chu et al., 2011), which is originally proposed to attain the optimal regret for classic stochastic linear bandit problem, as a subroutine in our framework. Then we propose the Sparse-SupLinUCB (SSUCB) algorithm. Specifically, we replace the ridge estimator and UCB-type policy with a modified SupLinUCB algorithm. The basic idea of the SupLinUCB algorithm is to separate the dependent designs into several groups such that within each group, the designs and noises are independent of each other. Then the ridge estimators of the true parameters are calculated based on group individually. Thanks to the desired independency, now we can derive tighter confidence bands by applying sharper concentration inequality, which gives rise to the optimal regret in the final.

In the next, we present the details of SupLinUCB algorithm and show how to embed it in our framework. For each period $t\in E_{\tau}$ that belongs to the UCB-stage, the SupLinUCB algorithm partitions the historical periods $E^{t-1}_{\tau}$ into $\widetilde{\zeta}$ disjoint groups

E^{t-1}_{\tau}=\{\Psi^{t-1,1}_{\tau},\cdots,\Psi^{t-1,\widetilde{\zeta}}_{\tau}\},

where $\widetilde{\zeta}=\lceil\log(\beta T)\rceil$ . These groups are initialized by evenly partitioning periods from the pure-exploration-stage and are updated sequentially following the rule introduced as follows. For each period $t$ , we screen the groups $\{\Psi^{t-1,\zeta}_{\tau}\}$ one by one (in an ascending order of index $\zeta$ ) to determine the action to take or eliminate some obvious sub-optimal actions.

Suppose that we are at the $\zeta$ -th group now. Let ${\mathcal{N}}^{t-1,\zeta}_{\tau}$ be the set of candidate actions that are still kept by the $\zeta$ -th step, which is initialized as the whole action space $\mathcal{A}_{t}$ when $\zeta=1$ . We first calculate the ridge estimator $\widehat{\theta}^{t-1,\zeta}_{\tau,\lambda}$ restricted on the generalized support $S_{\tau-1}$ , using data from group $\Psi^{t-1,\zeta}_{\tau}$ . Then for each action $i\in{\mathcal{N}}^{t-1,\zeta}_{\tau}$ , we calculate $\omega_{\tau,\lambda}^{t-1,\zeta}(i)$ , the width of confidence band of the potential reward. Specifically, we have

	$\displaystyle\widehat{\theta}^{t-1,\zeta}_{\tau,\lambda}$	$\displaystyle=\operatorname*{argmin}_{\mathrm{supp}^{+}(\theta)=S_{\tau-1}}\Big{\{}\sum_{t^{\prime}\in\Psi^{t-1,\zeta}_{\tau}}\big{\|}Y_{t^{\prime}}-\langle X_{t^{\prime},i_{t^{\prime}}},\theta\rangle\big{\|}^{2}+\lambda\\|\theta\\|^{2}\Big{\}},$
	$\displaystyle\omega^{t-1,\zeta}_{\tau,\lambda}(i)$	$\displaystyle=\gamma\cdot\Big{(}\sqrt{s/\|E_{\tau-1}\|}+\sqrt{[X_{t,i}]^{\top}_{S_{\tau-1}}[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}]^{-1}[X_{t,i}]_{S_{\tau-1}}}\Big{)},$

where the tuning parameter

\displaystyle\gamma=\sigma(\lambda^{1/2}+\nu+\sigma)\log^{3/2}(kTd/\delta),

corresponds to the confidence level. Our next step depends on the values of $\omega_{\tau,\lambda}^{t-1,\zeta}(i)$ . If

\omega_{\tau,\lambda}^{t-1,\zeta}(i)\leq 1/\sqrt{T},\ \forall i\in{\mathcal{N}}^{t-1,\zeta}_{\tau},

which means that the widths of confidence bands are uniformly small, we pick the action associated with the largest upper confidence band

\min\Big{\{}\beta,\langle X_{t,i},\widehat{\theta}_{\tau,\lambda}^{t-1,\zeta}\rangle+\omega_{\tau,\lambda}^{t-1,\zeta}(i)\Big{\}}.

where $\beta=r\sigma\cdot\sqrt{s\log(kTd/\delta)}$ is an upper estimate of potential rewards. In this case, we discard the data point $(X_{t},Y_{t})$ and do not update any group, i.e., setting $\Psi^{t,\zeta}_{\tau}=\Psi^{t-1,\zeta}_{\tau}$ , for all $\zeta\in[\widetilde{\zeta}].$

Otherwise, if there exists some $i\in{\mathcal{N}}^{t-1,\zeta}_{\tau}$ such that $\omega_{\tau,\lambda}^{t-1,\zeta}(i)\geq 2^{-\zeta}\beta,$ which means that the width of confidence band is not sufficiently small, then we pick such an action $i$ to play for exploration. In this case, we add the period $t$ into the $\zeta$ -th group while keeping all other groups unchanged, i.e.,

\displaystyle\Psi^{t,\zeta}_{\tau}=\Psi^{t-1,\zeta}_{\tau}\cup\{t\},\ \Psi^{t,\eta}_{\tau}=\Psi^{t-1,\eta}_{\tau},\ \text{if}\ \eta\neq\zeta.

Finally, if neither one of the above scenarios happens, which implies that for all $i\in{\mathcal{N}}^{t-1,\zeta}_{\tau}$ $\omega_{\tau,\lambda}^{t-1,\zeta}(i)\leq 2^{-\zeta}\beta$ , then we do not take any action for now. Instead, we eliminate some obvious sub-optimal actions and move to the next group $\Psi^{t-1,\zeta+1}_{\tau}$ . Particularly, we update the set of candidate arms as

{\mathcal{N}}_{\tau}^{t-1,\zeta+1}=\Big{\{}i\in{\mathcal{N}}_{\tau}^{t-1,\zeta}:\big{\langle}X_{t,i},\widehat{\theta}_{\tau,\lambda}^{t-1,\zeta}\big{\rangle}\geq\max_{j\in{\mathcal{N}}_{\tau}^{t-1,\zeta}}\big{\langle}X_{t,j},\widehat{\theta}_{\tau,\lambda}^{t-1,\zeta}\big{\rangle}-2^{1-\zeta}\beta\Big{\}}.

We repeat the above procedure until an arm is selected. Since the number of groups is $\widetilde{\zeta}=\lceil\log(\beta T)\rceil$ and $2^{-\widetilde{\zeta}}\beta=1/T\leq 1/\sqrt{T}$ , the SupLinUCB algorithm stops eventually. Finally, by replacing the direct ridge regression and UCB-type policy with the SupLinUCB algorithm above, we obtain the SSUCB algorithm. The pseudo-code is presented in Algorithm 2.

3 Theoretical Results

In this section, we present the theoretical results of the SLUCB and SSUCB algorithms. We use the regret to evaluate the performance of our algorithms, which is a standard performance measure in literature. We denote by $\{i_{t}\}_{t\in[T]}$ the actions sequence generated by an algorithm. Then given the true parameter $\theta^{*}$ and covariates $\{X_{t,i}\}_{t\in[T],i\in\mathcal{A}_{t}}$ , the regret of the sequence $\{i_{t}\}_{t\in[T]}$ is defined as

\displaystyle R_{T}\big{(}\{i_{t}\},\theta^{*}\big{)}=\sum_{t=1}^{T}\big{\langle}X_{t,i_{t}^{*}},\theta^{*}\big{\rangle}-\big{\langle}X_{i,i_{t}},\theta^{*}\big{\rangle},

(4)

where $i_{t}^{*}=\mathrm{argmax}_{i\in\mathcal{A}_{t}}\langle X_{t,i},\theta^{*}\rangle$ denotes the optimal action under the true parameter. The regret measures the discrepancy in accumulated reward between real actions and oracles where the true parameter is known to a decision-maker. In what follows, in Section 3.1, we first introduce some technical assumptions to facilitate our discussions. Then we study the regrets of the SLUCB and SSUCB algorithms in Section 3.2 and Section 3.3, respectively.

3.1 Assumptions

We present the assumptions in our theoretical analysis and discuss their relevance and implications. We consider finite action spaces $\{\mathcal{A}_{t}\}_{t=1}^{T}$ and assume that there exists some constant $k$ such that

|\mathcal{A}_{t}|=k,\ \forall t\in[T].

We also assume that for each period $t$ , the covariates $\{X_{t,i}\}_{i\in\mathcal{A}_{t}}$ are sampled independently from an unknown distribution $P_{0}$ . We further impose the following assumptions on distribution $P_{0}$ .

Assumption 1.

Let random vector $X\in\mathbb{R}^{d}$ follow the distribution $P_{0}$ . Then $X$ satisfies:

(A1)

(Sub-Gaussianity): Random vector $X\in\mathbb{R}^{d}$ is centered and sub-Gaussian with parameter $\sigma^{2}\geq 1$ , which means that $\mathbb{E}[X]=0$ and

$\mathbb{E}\big{[}\exp\{\sigma a^{\top}X\}\big{]}\leq\exp\big{\{}\sigma^{2}\|a\|^{2}/2\big{\}},\ \forall a\in\mathbb{R}^{d};$
(A2)

(Non-degeneracy): There exists a constant $\rho\in(0,\sigma]$ such that

$\mathbb{E}\big{[}[X]_{j}^{2}\big{]}\geq\rho,\ \forall j\in[d];$
(A3)

(Independent coordinates): The $d$ coordinates of $X$ are independent of each other.

We briefly discuss Assumption 1. First of all, (A1) is a standard assumption in literature, with sub-Gaussianity covering a broad family of distributions like Gaussian, Rademacher, and bounded distributions. Assumption (A2) is a non-degeneracy assumption which, together with (A3), implies that the smallest eigenvalue of the population covariance matrix $\mathbb{E}[XX^{\top}]$ is lower bounded by some constant $\rho>0$ . Similar assumptions are also adopted in high-dimensional statistics literature in order to prove the “restricted eigenvalue” conditions of sample covariance matrices (Raskutti et al., 2010), which are essential in the analysis of penalized least square methods (Wainwright, 2009; Bickel et al., 2009). However, we emphasize that in this paper the covariates indexed by the selected actions $\{i_{t}\}$ do not satisfy the restricted eigenvalue condition in general, and therefore, we need novel and non-standard analyses of high-dimensional M-estimators. For Assumption (A3), at a higher level, independence among coordinates enables relatively independent explorations in different dimensions, which is similar to the key idea of the SETC method (Lattimore et al., 2015). Technically, (A3) is used to establish the key independence of sample covariance matrices restricted within and outside the recovered support. Due to such an independence, the rewards in the unexplored directions at each period are independent as well, which can be estimated efficiently.

We also impose the following assumptions on the unknown $d$ -dimensional true parameter $\theta^{*}$ .

Assumption 2.

(B1)

(Sparsity): The true parameter $\theta^{*}$ is sparse. In other words, there exists an $s\ll d$ such that $|\mathrm{supp}(\theta^{*})|=s$ .
(B2)

(Boundedness): There exists a constant $r\geq 1$ such that $\|\theta^{*}\|\leq r$ .

Note that in Assumption 2, (B1) is the key sparsity assumption, which assumes that only $s\ll d$ components of the true parameter $\theta^{*}$ are non-zero. Assumption (B2) is a boundedness condition on the $\ell_{2}$ -norm of $\theta^{*}$ . This assumption is often imposed, either explicitly or implicitly, in contextual bandit problems for deriving an upper bound for rewards (Dani et al., 2008; Chu et al., 2011).

Finally, we impose the sub-Gaussian assumption on noises sequence $\{\varepsilon_{t}\}_{t=1}^{T}$ , which is a standard assumption adopted in most statistics and bandit literatures.

Assumption 3.

(C1)

(Sub-gaussian noise): The random noises $\{\varepsilon_{t}\}_{t=1}^{T}$ are independent, centered, and sub-Gaussian with parameter $\nu^{2}\geq 1$ .

3.2 Regret Analysis of Sparse-LinUCB

In this section, we analyze the performance of the SLUCB algorithm. As discussed earlier, we measure the performance via the regret defined in (4). We show that with a tailored choice of pure-exploration-stage length $n_{0}$ , tuning parameters $\alpha$ , and $\beta$ , the accumulated regret of the SLUCB algorithm is upper bounded by $\widetilde{\mathcal{O}}(s\sqrt{T})$ (up to logarithmic factors) with high probability. Formally, we have the following theorem.

Theorem 1.

For any $\delta\in(0,1)$ , let

	$\displaystyle n_{0}$	$\displaystyle\asymp\rho^{-1}\cdot(\nu\sigma\widetilde{\tau})^{2}\cdot s^{3}\cdot\log^{4}\big{(}kTd/(\delta\lambda)\big{)},$
	$\displaystyle\alpha$	$\displaystyle=\sigma\nu\sqrt{s\widetilde{\tau}}\cdot\log(\sigma\rho^{-1}dkT/\delta),\ \beta=\sigma r\sqrt{s\log(kTd/\delta)}.$

Under Assumptions 1-3, the regret of the actions sequence $\{i_{t}\}^{T}_{t=1}$ generated by the Sparse-LinUCB algorithm is upper bounded by

\displaystyle R_{T}\big{(}\{i_{t}\},\theta^{*}\big{)}\lesssim\Big{(}n_{0}\beta+\alpha\sqrt{sT\log(T)\cdot\big{(}\lambda+\log(\sigma dTk/\delta)\big{)}}+\alpha\sigma\sqrt{T\log(kTd/\delta)}\Big{)}\cdot\log(T),

with probability at least $1-\delta$ .

Note that in Theorem 1, if we omit all the constants and logarithmic factors, the dominating part in the accumulated regret is

\widetilde{\mathcal{O}}(\alpha\cdot\sqrt{sT})=\widetilde{\mathcal{O}}(s\sqrt{T}).

Theorem 1 builds on a nontrivial combination of the UCB-type algorithm and the best subset selection method. To save space, we put the complete proof of Theorem 1 in Appendix A.

3.3 Regret Analysis of Sparse-SupLinUCB

In comparison with the SLUCB algorithm, the SSUCB algorithm splits the historical data into several groups dynamically. In each period, we sequentially update the ridge estimator and corresponding confidence bands using data from a single group instead of the whole data. The motivation of only using a single group of data is to achieve the independence between the design matrix and random noises within each group, which leads to tighter confidence bands by applying a sharper concentration inequality. The tighter upper bounds of the predicted rewards lead to an improved regret. In particular, we have the following theorem.

Theorem 2.

For any $\delta\in(0,1)$ , let

	$\displaystyle n_{0}$	$\displaystyle\asymp\rho^{-1}\cdot(\nu\sigma\widetilde{\tau})^{2}\cdot s^{3}\cdot\log^{4}\big{(}kTd/(\delta\lambda)\big{)},$
	$\displaystyle\beta$	$\displaystyle=\sigma r\sqrt{s\log(kTd/\delta)},\ \gamma=\sigma(\lambda^{1/2}+\nu+\sigma)\log^{3/2}(kTd/\delta).$

Then under Assumptions 1-3, the regret of actions sequence $\{i_{t}\}^{T}_{t=1}$ generated by the Sparse-SupLinUCB algorithm is upper bounded by

\displaystyle R_{T}\big{(}\{i_{t}\},\theta^{*}\big{)}

\displaystyle\lesssim\sqrt{T}+\Big{(}n_{0}\beta+\gamma\sqrt{sT\cdot\log(T)\big{(}\lambda+\log(\sigma dTk/\delta)\big{)}}\Big{)}\cdot\log(T)\cdot\log(\beta T),

with probability at least $1-\delta$ .

Note that in Theorem 2, if we omit all constants and logarithmic factors, the dominating part in the regret upper bound is of order $\widetilde{\mathcal{O}}(\sqrt{sT})$ . This improves the rate in Theorem 1 by an order of $\widetilde{\mathcal{O}}(\sqrt{s})$ and achieves the optimal rate (up to logarithmic factors). Theorem 2 builds on a tailored analysis of the SupLinUCB algorithm. To save space, we also put the complete proof of Theorem 2 in Appendix B.

4 Numerical Experiments

In this section, we use extensive numerical experiments to demonstrate our proposed algorithm’s efficiency and validate our theoretical results. To simplify implementation, we only test the SLUCB algorithm in our experiments, which can be implemented efficiently. Theoretically, the SLUCB algorithm only achieves a sub-optimal regret guarantee due to technical reasons. However, extensive numerical experiments imply that it attains the near-optimality in practice. Specifically, we first empirically verify the regret guarantee established in Section 3 via simulation under various settings. Then we apply the proposed method to a sequential treatment assignment problem using a real medical dataset. In simulation studies, we sample the contextual covariates from $d$ -dimensional standard Gaussian distribution. For the real dataset test, we sample the covariates from the empirical distribution, which may not be Gaussian. Some assumptions in Section 3.1 are also violated. However, in all scenarios, our algorithm performs well and achieves $\widetilde{\mathcal{O}}(\sqrt{T})$ regrets, which demonstrates the robustness and practicability.

4.1 Simulation Studies

In this section, we present the details of our simulation studies. We first show the $\widetilde{\mathcal{O}}(\sqrt{T})$ growth rate of regret empirically. Then we fix time horizon length $T$ and dimension $d$ and study the dependency of accumulated regret on sparsity $s$ . To demonstrate the power of best subset selection, we also compare our algorithm’s performance with the oracle, where the decision-maker knows the true support of underlying parameters. Since the bottleneck of computing time in our algorithm is the best subset selection, which requires solving a mixed-integer programming problem, it is appealing to replace this step with other variable selection methods, such as Lasso and iterative hard thresholding (IHT) (Blumensath and Davies, 2009). So we test the performance of those variants. We fix the number of actions $k=60$ in all settings.

4.1.1 Experiment 1: Growth of regret

In this experiment, we study the growth rate of regret. We run two sets of experiments, where in the first case $d=100,T=1300,s=5,10,15,$ and in the second case, $d=300,T=1970,s=15,20,25$ . For each setup, we replicate $20$ times and then calculate corresponding mean and $90\%$ -confidence interval. We present the results in Figure 1. As we see, for each fixed $d$ and $s$ , the growth rate of regret is about $\widetilde{\mathcal{O}}(\sqrt{T})$ , which validates our theory.

Refer to caption — Figure 1: Plot of regret v.s. time periods. In (a), we set the dimension $d=100$ , the horizon length $T=1300$ , and the sparsity $s=5,10,15$ . In (b), we set the dimension $d=300$ , the horizon length $T=1970$ , and the sparsity $s=5,15,25$ . For each setting, we replicate $20$ times. Solid lines are the means of regret. Shadow areas denote corresponding empirical confidence intervals.

4.1.2 Experiment 2: Dependency on sparsity

In this experiment, we fix the dimension $d$ and horizon length $T$ and let sparsity $s$ varies. We calculate the accumulated regret at the end of horizon. We also run two sets of experiments, where in the first case $d=100,T=1300,s=5,8,11,\cdots,23$ and in the second case $d=300,T=1970,s=5,8,11,\cdots,23$ . We present the results are presented in Figure 2. Although Theorem 1 only provides an $\widetilde{\mathcal{O}}(s\sqrt{T})$ regret guarantee for the SLUCB algorithm. The linear dependency of accumulated regret on $\sqrt{s}$ suggests that it actually attains the optimal $\widetilde{\mathcal{O}}(\sqrt{sT})$ rate in practice.

4.1.3 Experiment 3: Comparison with variants of main algorithm and oracle

In this experiment, we compare the performance and the computing time of our algorithm with several variants that substitute the best subset selection subroutine with Lasso and IHT. We also compare with the oracle regret where the decision-maker knows the true support of parameters. In more detail, for the first variant, we use Lasso to recover the support of the true parameter at the end of each epoch. We tune the $\ell_{1}$ -penalty $\lambda$ such that the size of the support of the estimator is approximately equal to $s$ , and then use it in the next epoch. For the second variant, we apply IHT to estimate the parameter and set the hard threshold to be $s$ .

We run two settings of experiments, corresponding to $d=100,s=15,T=1300$ , and $d=300,s=15,T=1300$ . We also replicate $20$ times in each setting. For the first case, the averaged computing times are $4.58$ seconds for Lasso, $34.62$ seconds for IHT, and $310.61$ seconds for best subset selection. For the second case, the average computing times are $9.10$ seconds for Lasso, $39.47$ seconds for IHT, and $516.92$ seconds for best subset selection. We display the associated regret curves in Figure 3. As we see, although the computing time of Lasso is shorter than the other two methods, the performance of Lasso is significantly weaker. The computing time of IHT is slightly longer than Lasso, but it achieves the similar performance as best subset selection, which suggests that IHT might be a good alternative in practice when the computing resource is limited. Finally, although the computing time of the best subset selection is the longest, its performance is the best. The regret almost achieves the same performance as the oracle. Such a result demonstrates the power of best subsection selection in high-dimensional stochastic bandit problems.

4.2 Real Dataset Test

In this section, we apply our methods in a sequential treatment assignment problem based on the ACTG175 dataset from the R package “speff2trial”. The scenario is that patients arrive at a clinic sequentially. Several candidate treatments are available, and different treatments may have different treatment effects. Hence, for each patient, the doctor observes his/her health information and then chooses a treatment for the patient, to maximize the total treatment effect.

We first briefly introduce the ACTG175 dataset (Hammer et al., 1996). This dataset records the results of a randomized experiment on 2139 HIV-infected patients, and the background information of those patients. The patients are randomly assigned one of the four treatments: zidovudine (AZT) monotherapy, AZT+didanosine (ddI), AZT+zalcitabine (ddC), and ddI monotherapy. The treatment effect is measured via the number of CD8 T-cell after 20 weeks of therapy. Besides, for each patient, an $18$ -dimensional feature vector is provided as the background information, which includes age, weight, gender, drug history, etc.

Similar to the simulation studies, we validate our algorithms by regret. However, in practice, the potential outcomes of unassigned treatments on each patient are unavailable, which prevents us from calculating the regret. To overcome this difficulty, we assume a linear model for the treatment effect of each therapy. Particularly, we assume that the effect of treatment $i$ (number of CD8 T cell) is $\langle X,\theta^{*}_{i}\rangle+\varepsilon$ , where $X$ denotes the background information of each patient, $\varepsilon$ is a standard Gaussian noise, and $\theta_{i}^{*}$ is estimated via linear regression. Similar assumptions are commonly adopted in statistics literatures to calculate the counterfactual, when real experiments are unavailable (Zhang et al., 2012a). Above sequential treatment assignment problem can be easily transformed to our formulation by considering the joint covariate and space. Given a patient with background $X$ , let $X_{1}=(X,0,0,0),X_{2}=(0,X,0,0),X_{3}=(0,0,X,0),X_{4}=(0,0,0,X)$ , and $\theta^{*}=(\theta^{*}_{1},\theta^{*}_{2},\theta^{*}_{3},\theta^{*}_{4})$ . Since $\langle X_{i},\theta^{*}\rangle=\langle X,\theta^{*}_{i}\rangle$ , assigning treatment to a patient is equivalent to pick an action from space $\mathcal{A}=\{X_{1},X_{2},X_{3},X_{4}\}$ . Note that with this formulation, Assumption 1 is violated. However, the numerical results introduced below still demonstrate the good performance of our algorithm.

In this experiment, for simplicity, we pick $9$ coordinates (age, weight, drugs in history, Karnofsky score, number of days of previously received antiretroviral therapy, antiretroviral history stratification, gender, CD4 T cell number at baseline, and CD8 T cell number at baseline), from the whole dataset as the covariate. We denote by it $X$ . Then we use the real dataset to estimate the true parameters $\theta^{*}_{i},i=1,2,3,4$ , i.e., the treatment effect of each therapy. To test our algorithm in the high dimensional setting, we further concatenate $X$ with a $41$ -dimensional random noise drawn from standard Gaussian distribution. The parameters corresponding to those injected noises are zeros. For each patient, we use the estimated parameters to calculate the potential treatment effect and regret. In addition, we choose the total number of patients $T=2600$ . Like the simulation studies, we compare our algorithm’s performance with other variants (Lasso,IHT, and oracle). For each algorithm, we repeat $20$ times and calculate the mean regret curve. We present the result in Figure 4. Compared with the simulation studies, the performance of Lasso is still the worst. However, the discrepancy is much smaller. Although there is almost no difference in IHT and best subset selection, in real data tests, we may observe some instances where the IHT algorithm fails to converge, which means a lack of robustness. Again, for our proposed algorithm, we observe an $\widetilde{\mathcal{O}}(\sqrt{T})$ growth rate of regret. It is pretty close to the oracle as well. So The BSS algorithm also achieves the optimality. Since in this experiment, technical assumptions in Section 3.1 are violated, the numerical results demonstrate the robustness of our method.

5 Conclusion

In this paper, we first propose a method for the high-dimensional stochastic linear bandit problem by combining the best subset selection method and the LinUCB algorithm. It achieves the $\widetilde{\mathcal{O}}(s\sqrt{T})$ regret upper bound and is nearly independent with the ambient dimension $d$ (up to logarithmic factors). In order to attain the optimal regret $\widetilde{\mathcal{O}}(\sqrt{sT})$ , we further improve our method by modifying the SupLinUCB algorithm. Extensive numerical experiments validate the performance and robustness of our algorithms. Moreover, although we cannot prove the $\widetilde{\mathcal{O}}(s\sqrt{T})$ regret upper bound for the SLUCB algorithm due to some technical reasons, simulation studies show that the regret of SLUCB algorithm is actually $\widetilde{\mathcal{O}}(\sqrt{sT})$ rather than our provable upper bound $\widetilde{\mathcal{O}}(s\sqrt{T})$ . A similar phenomenon is also observed in the seminal works (Auer, 2002; Chu et al., 2011), where low-dimensional stochastic linear bandit problems are investigated. It remains an open problem to prove the $\widetilde{\mathcal{O}}(\sqrt{sT})$ upper bound for the LinUCB-type algorithms. In addition, it is also interesting to study the high-dimensional stochastic linear bandit problem under weaker assumptions, especially when the independent coordinates assumption does not hold. We view all of those problems as potential future research directions.

\@normalsize

References

Abbasi-Yadkori et al. [2011] Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
Abbasi-Yadkori et al. [2012] Y. Abbasi-Yadkori, D. Pal, and C. Szepesvari. Online-to-confidence-set conversions and application to sparse stochastic bandits. In Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS), 2012.
Abe et al. [2003] N. Abe, A. W. Biermann, and P. M. Long. Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica, 37(4):263–293, 2003.
Agarwal et al. [2010] A. Agarwal, O. Dekel, and L. Xiao. Optimal algorithms for online convex optimization with multi-point bandit feedback. In Proceedings of annual Conference on Learning Theory (COLT). Citeseer, 2010.
Agarwal et al. [2014] A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, and R. Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In Proceedings of International Conference on Machine Learning (ICML), 2014.
Auer [2002] P. Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
Auer et al. [1995] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS), 1995.
Auer et al. [2002] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235–256, 2002.
Balasubramanian and Ghadimi [2018] K. Balasubramanian and S. Ghadimi. Zeroth-order (non)-convex stochastic optimization via conditional gradient and gradient updates. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2018.
Bastani and Bayati [2015] H. Bastani and M. Bayati. Online decision-making with high-dimensional covariates. 2015. Available at SSRN: https://ssrn.com/abstract=2661896.
Bertsimas et al. [2016] D. Bertsimas, A. King, and R. Mazumder. Best subset selection via a modern optimization lens. The Annals of Statistics, 44(2):813–852, 2016.
Besbes et al. [2015] O. Besbes, Y. Gur, and A. Zeevi. Non-stationary stochastic optimization. Operations Research, 63(5):1227–1244, 2015.
Bickel et al. [2009] P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 37(4):1705–1732, 2009.
Blumensath and Davies [2009] T. Blumensath and M. E. Davies. Iterative hard thresholding for compressed sensing. Applied and computational harmonic analysis, 27(3):265–274, 2009.
Bubeck et al. [2017] S. Bubeck, Y. T. Lee, and R. Eldan. Kernel-based methods for bandit convex optimization. In Proceedings of the Annual ACM SIGACT Symposium on Theory of Computing (STOC), 2017.
Bühlmann and Van De Geer [2011] P. Bühlmann and S. Van De Geer. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media, 2011.
Candes and Tao [2007] E. Candes and T. Tao. The dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics, 35(6):2313–2351, 2007.
Carpentier and Munos [2012] A. Carpentier and R. Munos. Bandit theory meets compressed sensing for high dimensional stochastic linear bandit. In Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS), 2012.
Chakraborty et al. [2010] B. Chakraborty, S. Murphy, and V. Strecher. Inference for non-regular parameters in optimal dynamic treatment regimes. Statistical Methods in Medical Research, 19(3):317–343, 2010.
Chu et al. [2011] W. Chu, L. Li, L. Reyzin, and R. Schapire. Contextual bandits with linear payoff functions. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2011.
Dani et al. [2008] V. Dani, T. P. Hayes, and S. M. Kakade. Stochastic linear optimization under bandit feedback. In Proceedings of annual Conference on Learning Theory (COLT), 2008.
Donoho [2006] D. L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, 2006.
Filippi et al. [2010] S. Filippi, O. Cappe, A. Garivier, and C. Szepesvári. Parametric bandits: The generalized linear case. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2010.
Flaxman et al. [2005] A. D. Flaxman, A. T. Kalai, and H. B. McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the annual ACM-SIAM symposium on Discrete algorithms (SODA), 2005.
Foster et al. [2016] D. Foster, S. Kale, and H. Karloff. Online sparse linear regression. In Proceedings of annual Conference on Learning Theory (COLT), 2016.
Foster et al. [2018] D. J. Foster, A. Agarwal, M. Dudík, H. Luo, and R. E. Schapire. Practical contextual bandits with regression oracles. In Proceedings of the International Conference on Machine Learning (ICML), 2018.
Gerchinovitz [2013] S. Gerchinovitz. Sparsity regret bounds for individual sequences in online linear regression. Journal of Machine Learning Research, 14(Mar):729–769, 2013.
Goldberg and Kosorok [2012] Y. Goldberg and M. R. Kosorok. Q-learning with censored data. Annals of Statistics, 40(1):529, 2012.
Goldenshluger and Zeevi [2013] A. Goldenshluger and A. Zeevi. A linear response bandit problem. Stochastic Systems, 3(1):230–261, 2013.
Hammer et al. [1996] S. M. Hammer, D. A. Katzenstein, M. D. Hughes, H. Gundacker, R. T. Schooley, R. H. Haubrich, W. K. Henry, M. M. Lederman, J. P. Phair, M. Niu, et al. A trial comparing nucleoside monotherapy with combination therapy in hiv-infected adults with cd4 cell counts from 200 to 500 per cubic millimeter. New England Journal of Medicine, 335(15):1081–1090, 1996.
Keyvanshokooh et al. [2019] E. Keyvanshokooh, M. Zhalechian, C. Shi, M. P. Van Oyen, and P. Kazemian. Contextual learning with online convex optimization: Theory and application to chronic diseases. Available at SSRN, 2019.
Krause and Ong [2011] A. Krause and C. S. Ong. Contextual gaussian process bandit optimization. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2011.
Lai and Robbins [1985] T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
Langford et al. [2009] J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. Journal of Machine Learning Research, 10(Mar):777–801, 2009.
Lattimore et al. [2015] T. Lattimore, K. Crammer, and C. Szepesvári. Linear multi-resource allocation with semi-bandit feedback. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2015.
Li et al. [2010] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the International Conference on World Wide Web (WWW). ACM, 2010.
Li et al. [2011] L. Li, W. Chu, J. Langford, and X. Wang. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM). ACM, 2011.
Li et al. [2017] L. Li, Y. Lu, and D. Zhou. Provably optimal algorithms for generalized linear contextual bandits. In Proceedings of International Conference on Machine Learning (ICML), 2017.
Miller [2002] A. Miller. Subset selection in regression. Chapman and Hall/CRC, 2002.
Moodie et al. [2007] E. E. Moodie, T. S. Richardson, and D. A. Stephens. Demystifying optimal dynamic treatment regimes. Biometrics, 63(2):447–455, 2007.
Murphy [2003] S. A. Murphy. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(2):331–355, 2003.
Murphy [2005] S. A. Murphy. An experimental design for the development of adaptive treatment strategies. Statistics in medicine, 24(10):1455–1481, 2005.
Natarajan [1995] B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on Computing, 24(2):227–234, 1995.
Nemirovsky and Yudin [1983] A. S. Nemirovsky and D. B. Yudin. Problem complexity and method efficiency in optimization. SIAM, 1983.
Orellana et al. [2010a] L. Orellana, A. Rotnitzky, and J. M. Robins. Dynamic regime marginal structural mean models for estimation of optimal dynamic treatment regimes, part I: main content. The International Journal of Biostatistics, 6(2), 2010a.
Orellana et al. [2010b] L. Orellana, A. Rotnitzky, and J. M. Robins. Dynamic regime marginal structural mean models for estimation of optimal dynamic treatment regimes, part II: proofs of results. The International Journal of Biostatistics, 6(2), 2010b.
Pilanci et al. [2015] M. Pilanci, M. J. Wainwright, and L. El Ghaoui. Sparse learning via boolean relaxations. Mathematical Programming (Series B), 151(1):63–87, 2015.
Raskutti et al. [2010] G. Raskutti, M. J. Wainwright, and B. Yu. Restricted eigenvalue properties for correlated gaussian designs. Journal of Machine Learning Research, 11(Aug):2241–2259, 2010.
Robins et al. [2008] J. Robins, L. Orellana, and A. Rotnitzky. Estimation and extrapolation of optimal treatment and testing strategies. Statistics in Medicine, 27(23):4678–4721, 2008.
Robins et al. [2000] J. M. Robins, M. A. Hernan, and B. Brumback. Marginal structural models and causal inference in epidemiology, 2000.
Rusmevichientong and Tsitsiklis [2010] P. Rusmevichientong and J. N. Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411, 2010.
Shamir [2013] O. Shamir. On the complexity of bandit and derivative-free stochastic convex optimization. In Proceedings of annual Conference on Learning Theory (COLT), 2013.
Shamir [2015] O. Shamir. On the complexity of bandit linear optimization. In Proceedings of annual Conference on Learning Theory (COLT), 2015.
Song et al. [2015] R. Song, W. Wang, D. Zeng, and M. R. Kosorok. Penalized Q-learning for dynamic treatment regimens. Statistica Sinica, 25(3):901, 2015.
Szepesvari [2016] C. Szepesvari. Lower bounds for stochastic linear bandits. http://banditalgs.com/2016/10/20/lower-bounds-for-stochastic-linear-bandits/, 2016. Accessed: Oct 15, 2018.
Tibshirani [1996] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
Tropp [2012] J. A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12(4):389–434, 2012.
Wainwright [2009] M. J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using $\ell_{1}$ -constrained quadratic programming (Lasso). IEEE Transactions on Information Theory, 55(5):2183–2202, 2009.
Wang et al. [2017] Y. Wang, S. Du, S. Balakrishnan, and A. Singh. Stochastic zeroth-order optimization in high dimensions. In Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.
Watkins and Dayan [1992] C. J. Watkins and P. Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.
Zhang et al. [2012a] B. Zhang, A. A. Tsiatis, M. Davidian, M. Zhang, and E. Laber. Estimating optimal treatment regimes from a classification perspective. Stat, 1(1):103–114, 2012a.
Zhang et al. [2012b] B. Zhang, A. A. Tsiatis, E. B. Laber, and M. Davidian. A robust method for estimating optimal treatment regimes. Biometrics, 68(4):1010–1018, 2012b.
Zhao et al. [2012] Y. Zhao, D. Zeng, A. J. Rush, and M. R. Kosorok. Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association, 107(499):1106–1118, 2012.
Zhao et al. [2014] Y.-Q. Zhao, D. Zeng, E. B. Laber, R. Song, M. Yuan, and M. R. Kosorok. Doubly robust learning for estimating individualized treatment with censored data. Biometrika, 102(1):151–168, 2014.

Appendix

Appendix A Proof of Theorem 1

Before going to the proof, we present a proposition, which upper bounds the scale of covariate $\{X_{t,i}\}$ uniformly with high probability.

Proposition 1.

Consider all the sub-Gaussian covariates in the problem $\{X_{t,i}\}_{t\in[T],i\in\mathcal{A}_{t}}$ . Then with probability at least $1-\delta$ ,

\displaystyle\max_{t\in[T],i\in\mathcal{A}_{t},j\in[d]}\big{|}[X_{t,i}]_{j}\big{|}\lesssim\sigma\log(kTd/\delta).

Proof.

By Assumption 1, for all $t\in[T]$ , $i\in\mathcal{A}_{t}$ , and $j\in[d]$ , $\{[X_{t,i}]_{j}\}$ are independent and identically distributed centered sub-Gaussian random variables with parameter $\sigma^{2}$ . Then the result follows from the standard sub-Gaussian concentration inequality and union bound argument. ∎

Next, we present the proof of Theorem 1. For ease of presentations, we defer the detailed proofs of the lemmas to the Supplementary Materials. Our proof takes the benefits of results of the celebrated work on linear contextual bandit (Abbasi-Yadkori et al., 2011), in which the authors establish a self-normalized martingale concentration inequality and the confidence region for the ridge estimator under dependent designs. For the self completeness, we present these results in the Supplementary Materials Section 6.

Proof of Theorem 1.

The proof includes four major steps. First, we quantify the performance of the BSS estimator obtained at the end of each epoch. In particular, we estimate the $\ell_{2}$ -norm of the true parameter $\theta^{*}$ outside $S_{\tau}$ , the generalized support recovered by epoch $E_{\tau}$ . It measures the discrepancy between $S_{\tau}$ and the true support of $\theta^{*}$ and hence, the quality of BSS estimator. Then, within epoch $E_{\tau}$ , for each period $t$ , we analyze the error of the restricted ridge estimator $\widehat{\theta}^{t-1}_{\tau,\lambda}$ . Based on that, we upper bound the predictive errors of potential rewards, and build up corresponding upper confidence bounds. Next, similar to the classic analysis of UCB-type algorithm of linear bandit problem, we establish an elliptical potential lemma to upper bound the sum of confidence bands. Finally, we combine the results to show that the regret of the SLUCB algorithm is of order $\widetilde{\mathcal{O}}(s\sqrt{T})$ . In what follows, we present the details of each step.

Step 1. Upper bound the error of the BSS estimator: In this step, we analyze the error of the BSS estimator obtained at the end of each epoch. Recall that $S_{\tau}=\mathrm{supp}^{+}(\widehat{\theta}_{\tau,\lambda})$ , which is the generalized support of the BSS estimator $\widehat{\theta}_{\tau,\lambda}$ . We have the following lemma.

Lemma 1.

n_{0}\gtrsim\rho^{-1}\cdot(\nu\sigma\widetilde{\tau})^{2}\cdot s^{3}\cdot\log^{4}\big{(}kTd/(\delta\lambda)\big{)},

we have that, with probability at least $1-\delta$ , for each epoch $\tau\in[\widetilde{\tau}]$ ,

\big{\|}[\theta^{*}]_{S_{\tau}^{c}}\big{\|}\lesssim\sqrt{\frac{\nu^{2}s\log^{2}\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda r^{2}}{|E_{\tau}|}}.

(5)

Note that when $\tau$ is large, $|E_{\tau}|\asymp T$ . Hence, Lemma 1 states that the $\ell_{2}$ -norm of the unselected part of true parameter $\theta^{*}$ decays at a rate of $\widetilde{\mathcal{O}}(\sqrt{s/T})$ . Recall that in the next epoch $\tau+1$ , we restrict the estimation only on $S_{\tau}$ . Failing to identify the exact support of $\theta^{*}$ incurs a regret that is roughly

\big{\|}[\theta^{*}]_{S_{\tau}^{c}}\big{\|}_{2}\cdot T=\widetilde{\mathcal{O}}(\sqrt{s/T})\cdot T=\widetilde{\mathcal{O}}(\sqrt{sT}).

We interpret this part as “bias” in the total regret. Formally, by Lemma 5 and Proposition 1, for each covariate $X_{t,i}$ , we upper bound the inner product of $[\theta^{*}]_{S_{\tau}^{c}}$ and $[X_{t,i}]_{S_{\tau}^{c}}$ via the following corollary.

Corollary 2.

Under the condition of Lemma 5, with probability at least $1-\delta$ , for all $\tau\in[\widetilde{\tau}]$ , $t\in E_{\tau}$ , and $i\in\mathcal{A}_{t}$ , we have

\Big{|}\big{\langle}[X_{t,i}]_{S_{\tau}^{c}},[\theta^{*}]_{S_{\tau}^{c}}\big{\rangle}\Big{|}\lesssim\sigma\sqrt{\log(kTd/\delta)}\cdot\sqrt{\frac{\nu^{2}s\log^{2}\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda r^{2}}{|E_{\tau}|}}.

(6)

Essentially, Corollary 2 implies that the regret incurred by inexact recovery of the support of true parameter at a single period is of order $\widetilde{\mathcal{O}}(\sqrt{s/T})$ . This result is critical for our regret analysis later.

Step 2. Upper bound the prediction error: In this step, for each time period $t\in E_{\tau}$ and arm $i\in\mathcal{A}_{t}$ , we establish an upper bound for the prediction error $\langle X_{t,i},\widehat{\theta}_{\tau,\lambda}^{t-1}-\theta^{*}\rangle$ . In particular, we have the following lemma.

Lemma 2.

n_{0}\gtrsim\rho^{-1}\cdot(\nu\sigma\widetilde{\tau})^{2}\cdot s^{3}\cdot\log^{4}\big{(}kTd/(\delta\lambda)\big{)},

with probability at least $1-\delta$ , for all $\tau\in[\widetilde{\tau}]$ , $t\in E_{\tau}$ , and $i\in\mathcal{A}_{t}$ , we have

\Big{|}\big{\langle}X_{t,i},\widehat{\theta}_{\tau,\lambda}^{t-1}-\theta^{*}\big{\rangle}\Big{|}\lesssim\nu\Big{(}\sqrt{s}\log\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda^{1/2}r\Big{)}\Big{(}\sigma\sqrt{\frac{\log(kTd/\delta)}{|E_{\tau-1}|}}+\big{\|}[X_{t,i}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}]^{-1}}\Big{)},

where

\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}=\lambda\textbf{I}_{|S_{\tau-1}|}+\sum_{t^{\prime}\in E_{\tau}^{t-1}}[X_{t^{\prime}}]_{S_{\tau-1}}[X_{t^{\prime}}]_{S_{\tau-1}}^{\top},

and $E_{\tau}^{t-1}=\{t^{\prime}:t^{\prime}\leq t-1,t^{\prime}\in E_{\tau}\}$ .

The upper bound in Lemma 2 includes two parts. The first term, which is of order $\widetilde{\mathcal{O}}(\sqrt{s/T})$ , corresponds to the bias of the estimator $\widehat{\theta}^{t-1}_{\tau,\lambda}$ . Particularly, it is caused by the inexact recovery of $\mathrm{supp}(\theta^{*})$ . In contrast, the second term, which is determined by the variability of $[X_{t,i}]_{S_{\tau-1}}$ and sample covariance matrix $\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}$ , corresponds to the variance.

Step 3. Elliptical potential lemma: In this this step, we establish the elliptical potential lemma, which a powerful tool in bandit problems (Rusmevichientong and Tsitsiklis, 2010; Auer, 2002; Filippi et al., 2010; Li et al., 2017), that upper bounds the sum of prediction errors in each period. Specifically, we upper bound the sum the right-hand side of the inequality in Lemma 2 for all periods. Recall that in each epoch, the arms are picked uniformly at random in the first $n_{0}$ periods.

Lemma 3.

Let two constants $p,q$ satisfy $1\leq p\leq q$ . Then with probability at least $1-\delta$ , we have

\sum_{t\in E_{\tau}}\min\Big{\{}p,q\cdot\big{\|}[X_{t}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}]^{-1}}\Big{\}}\lesssim n_{0}p+q\cdot\sqrt{|E_{\tau}|\cdot|S_{\tau-1}|\cdot\big{(}\lambda+\log(\sigma dTk/\delta)\big{)}}.

The upper bound in Lemma 3 consists of two terms. The first one is linear in $n_{0}$ , the length of the pure-exploration-stage. The second part, which is of order $\widetilde{\mathcal{O}}(\sqrt{sT})$ , shows the trade-off of between exploration and exploitation. In later proof, we show that this part leads to the order $\widetilde{\mathcal{O}}(s\sqrt{T})$ regret under a tailored choice of $p,q$ .

Step 4. Putting all together: Based on Lemmas 1, 2, and 3, we are ready to prove Theorem 1. For each $t\in[T]$ and $i\in\mathcal{A}_{t}$ , let

r(X_{t,i})=\langle X_{t,i},\theta^{*}\rangle

be the true expected reward of arm $i$ at period $t$ . By Assumption 3 and Proposition 1, the inequality

r(X_{t,i})\leq r\sigma\sqrt{\log(kTd/\delta)}

holds for all $t$ and $i$ with probability at least $1-\delta$ . Then under the the following choice of $n_{0}$ ,

\displaystyle n_{0}

\displaystyle\asymp\rho^{-1}\cdot(\nu\sigma\widetilde{\tau})^{2}\cdot s^{3}\cdot\log^{4}\big{(}kTd/(\delta\lambda)\big{)},

Lemma 2 guarantees that

\displaystyle\overline{r}(X_{t,i})

\displaystyle:=\min\Big{\{}\beta,\big{\langle}X_{t,i},\widehat{\theta}_{\tau,\lambda}^{t-1}\big{\rangle}+\alpha\cdot\Big{(}\sigma\sqrt{{\log(kTd/\delta)}\big{/}{|E_{\tau-1}|}}+\big{\|}[X_{t,i}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}]^{-1}}\Big{)}\Big{\}}

(7)

is an upper bound for $r(X_{t,i})$ for each $t\in[T]$ and $i\in\mathcal{A}_{t}$ (up to an absolute constant), where

	$\displaystyle\beta$	$\displaystyle=r\sigma\sqrt{s\log(kTd/\delta)},$
	$\displaystyle\alpha$	$\displaystyle=\nu\cdot\big{(}\sqrt{s}\log\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda^{1/2}r\big{)}.$

Next, recall that at period $t$ , our algorithm picks arm $i_{t}$ . The optimal action is

i_{t}^{*}=\arg\max_{i\in\mathcal{A}_{t}}\langle X_{t,i},\theta^{*}\rangle.

Let $\mathcal{T}_{e}$ be the periods of pure-exploration-stage, in which the arms are picked uniformly at random. Correspondingly, we use $\mathcal{T}_{g}$ to denote the periods of UCB-stage, in which the UCB-type policy is employed. Then we have

|\mathcal{T}_{e}|\lesssim n_{0}\log(T).

Since with probability at least $1-\delta$ ,

r(X_{t,i})\leq\max\big{\{}\overline{r}(X_{t,i}),r\sigma\sqrt{\log(kTd/\delta)}\big{\}}

holds for all $t$ and $i$ , by the definition of regret, we have

	$\displaystyle R_{T}\big{(}\{i_{t}\},\theta^{*}\big{)}$	$\displaystyle=\sum_{t\in\mathcal{T}_{e}}r(X_{t,i_{t}^{}})-r(X_{t,i_{t}})+\sum_{t\in\mathcal{T}_{g}}r(X_{t,i_{t}^{}})-r(X_{t,i_{t}})$
		$\displaystyle\lesssim n_{0}\log(T)\cdot 2r\sigma\sqrt{\log(kTd/\delta)}+\sum_{t\in\mathcal{T}_{g}}\overline{r}(X_{t,i_{t}^{*}})-r(X_{t,i_{t}}).$		(8)

For the second part in the right-hand side of inequality (A), we have

	$\displaystyle\sum_{t\in\mathcal{T}_{g}}\overline{r}(X_{t,i_{t}^{*}})-r(X_{t,i_{t}})$	$\displaystyle=\sum_{t\in\mathcal{T}_{g}}\big{(}\overline{r}(X_{t,i_{t}^{*}})-\overline{r}(X_{t,i_{t}})\big{)}+\big{(}\overline{r}(X_{t,i_{t}})-r(X_{t,i_{t}})\big{)}$
		$\displaystyle\leq\sum_{t\in\mathcal{T}_{g}}\overline{r}(X_{t,i_{t}})-r(X_{t,i_{t}})$
		$\displaystyle\lesssim\sum_{\tau=1}^{\widetilde{\tau}}\sum_{t\in E_{\tau}}\min\Big{\{}\beta,\alpha\cdot\Big{(}\sigma\sqrt{{\log(kTd/\delta)}\big{/}{\|E_{\tau-1}\|}}+\big{\\|}[X_{t}]_{S_{\tau-1}}\big{\\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}]^{-1}}\Big{)}\Big{\}}.$

Here the first inequality holds because in period $t\in\mathcal{T}_{g}$ , the picked arm $i_{t}$ has the largest upper confidence band. The second inequality follows by the definition of $\overline{r}(X_{t,i})$ in (7). By Lemma 3, we have

		$\displaystyle\sum_{\tau=1}^{\widetilde{\tau}}\sum_{t\in E_{\tau}}\min\Big{\{}\beta,\alpha\cdot\Big{(}\sigma\sqrt{{\log(kTd/\delta)}\big{/}{\|E_{\tau-1}\|}}+\big{\\|}[X_{t}]_{S_{\tau-1}}\big{\\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}]^{-1}}\Big{)}\Big{\}}$
		$\displaystyle\lesssim\sum_{\tau=1}^{\widetilde{\tau}}\Big{(}n_{0}\beta+\alpha\sqrt{\|E_{\tau}\|\cdot\|S_{\tau-1}\|\cdot\big{(}\lambda+\log(\sigma dTk/\delta)\big{)}}\Big{)}+\sum_{\tau=1}^{\widetilde{\tau}}\alpha\sigma\sqrt{{\log(kTd/\delta)}\cdot{\|E_{\tau-1}\|}}$
		$\displaystyle\lesssim\Big{(}n_{0}\beta+\alpha\sqrt{sT\log(T)\cdot\big{(}\lambda+\log(\sigma dTk/\delta)\big{)}}+\alpha\sigma\sqrt{T\log(kTd/\delta)}\Big{)}\cdot\log(T).$		(9)

Finally, by combining inequalities (A)-(A), we obtain that with probability at least $1-\delta$ ,

\displaystyle R_{T}\big{(}\{i_{t}\},\theta^{*}\big{)}\lesssim\Big{(}n_{0}\beta+\alpha\sqrt{sT\log(T)\cdot\big{(}\lambda+\log(\sigma dTk/\delta)\big{)}}+\alpha\sigma\sqrt{T\log(kTd/\delta)}\Big{)}\cdot\log(T),

which concludes the proof of Theorem 1. ∎

Appendix B Proof of Theorem 2

Proof of Theorem 2.

The proof follows a similar routine as the proof of Theorem 1. The main difference lies in the upper bound of prediction error (Step 2 in the proof of Theorem 1). Recall that in the SLUCB algorithm, we construct the ridge estimator using all historical data. The complicated correlation between design matrices and random noises jeopardizes us applying the standard concentration inequality for independent random variables to obtain an order $\widetilde{\mathcal{O}}(\sqrt{s})$ upper bound for the prediction error. Instead, we use the self-normalized martingale concentration inequality, which only gives an order $\widetilde{\mathcal{O}}(s)$ upper bound and hence, a suboptimal regret. In comparison, in the SSUCB algorithm, the Sparse-SupLinUCB subroutine overcomes the aforementioned obstacle by splitting the historical data into disjoint groups such that within each group, the designs and random noises are independent. Due to such independences, we establish the desired order $\widetilde{\mathcal{O}}(\sqrt{s})$ upper bound for prediction errors. In particular, we have the following key lemma.

Lemma 4.

When

n_{0}\gtrsim\rho^{-1}\cdot(\nu\sigma\widetilde{\tau})^{2}\cdot s^{3}\cdot\log^{4}\big{(}kTd/(\delta\lambda)\big{)},

with probability at least $1-\delta$ , we have that for every tuple $(\tau,t,i,\zeta)$ , it holds

\displaystyle\Big{|}\big{\langle}X_{t,i},\widehat{\theta}_{\tau,\lambda}^{t-1,\zeta}-\theta^{*}\big{\rangle}\Big{|}

\displaystyle\lesssim\sigma(\lambda^{1/2}+\nu+\sigma)\log^{3/2}(kTd/\delta)\cdot\Big{(}\sqrt{s/|E_{\tau-1}|}+\big{\|}[X_{t,i}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}]^{-1}}\Big{)},

where

\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}=\lambda\textbf{I}_{|S_{\tau-1}|}+\sum_{t^{\prime}\in\Psi_{\tau}^{t-1,\zeta}}[X_{t^{\prime}}]_{S_{\tau-1}}[X_{t^{\prime}}]_{S_{\tau-1}}^{\top}.

The upper bound in Lemma 4 includes two parts, corresponding to the bias and variance, respectively. The bias part is still of order $\widetilde{\mathcal{O}}(\sqrt{s})$ . In comparison with Lemma 2, this lemma decreases an $\widetilde{\mathcal{O}}(\sqrt{s})$ factor in the variance part. Note that by Lemma 3, we have

\big{\|}[X_{t,i}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}]^{-1}}=\widetilde{\mathcal{O}}(\sqrt{s}).

Such an improvement leads to an order $\widetilde{\mathcal{O}}(\sqrt{s})$ improvement in the accumulated regret, which achieves the optimal rate. The detailed proof of Lemma 4 is in the supplementary document.

Same as the last step in the proof of Theorem 1, by Lemma 4, we obtain that, with probability at least $1-\delta$ , for all tuples $(\tau,t,i,\zeta)$ , it holds

\displaystyle{r}(X_{t,i})

\displaystyle\lesssim\min\Big{\{}\beta,\big{\langle}X_{t,i},\widehat{\theta}_{\tau,\lambda}^{t-1,\zeta}\big{\rangle}+\omega^{t-1,\zeta}_{\tau,\lambda}(i)\Big{\}},

(10)

where ${r}(X_{t,i})=\langle X_{t,i},\theta^{*}\rangle$ and

	$\displaystyle\omega^{t-1,\zeta}_{\tau,\lambda}(i)$	$\displaystyle=\gamma\cdot\Big{(}\sqrt{s/\|E_{\tau-1}\|}+\big{\\|}[X_{t,i}]_{S_{\tau-1}}\big{\\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}]^{-1}}\Big{)},$		(11)
	$\displaystyle\gamma$	$\displaystyle=\sigma(\lambda^{1/2}+\nu+\sigma)\log^{3/2}(kTd/\delta).$		(12)

Without loss of generality, we assume that inequality (10) holds for all tuples $(\tau,t,i,\zeta)$ , which is guaranteed with probability at least $1-\delta$ . We also assume that in period $t\in E_{\tau}$ , the SupLinUCB algorithm terminates at the $\zeta_{t}$ -th group with selected an arm $i_{t}$ . Let $i^{*}_{t}=\mathrm{argmax}_{i\in\mathcal{A}_{t}}\langle X_{t,i},\theta^{*}\rangle$ be the optimal arm for period $t$ . If the first scenario in the SupLinUCB algorithm happens, i.e.,

\omega_{\tau,\lambda}^{t-1,\zeta_{t}}(i)\leq 1/\sqrt{T},\ \forall i\in{\mathcal{N}}^{t-1,\zeta_{t}}_{\tau},

we have

\big{\langle}X_{t,i^{*}_{t}}-X_{t,i_{t}},\theta^{*}\big{\rangle}\lesssim\big{\langle}X_{t,i^{*}_{t}}-X_{t,i_{t}},\widehat{\theta}_{\tau,\lambda}^{t-1,\zeta_{t}}\big{\rangle}+\omega^{t-1,\zeta_{t}}_{\tau,\lambda}(i^{*}_{t})+\omega^{t-1,\zeta_{t}}_{\tau,\lambda}(i_{t})\lesssim 2/\sqrt{T}.

(13)

Otherwise, we have

\displaystyle\omega^{t-1,\zeta_{t}}_{\tau,\lambda}(i_{t})\geq 2^{-\zeta_{t}}\beta.

(14)

In this case, since the SupLinUCB algorithm does not stop before $\zeta_{t}$ , then for all $\zeta<\zeta_{t}$ , we have

\omega_{\tau,\lambda}^{t-1,\zeta}(i)\leq 2^{-\zeta}\beta,\ \forall i\in\mathcal{N}^{t-1,\zeta}_{\tau}.

Consequently, for all $i\in\mathcal{N}^{t-1,\zeta}_{\tau}$ , we have

\big{\langle}X_{t,i_{t}^{*}},\widehat{\theta}^{t-1,\zeta}_{\tau,\lambda}\big{\rangle}+\omega^{t-1,\zeta}_{\tau,\lambda}(i_{t}^{*})\gtrsim\langle X_{t,i_{t}^{*}},\theta^{*}\rangle\geq\langle X_{t,i},\theta^{*}\rangle\gtrsim\big{\langle}X_{t,i},\widehat{\theta}^{t-1,\zeta}_{\tau,\lambda}\big{\rangle}-\omega^{t-1,\zeta}_{\tau,\lambda}(i).

Moreover, by setting $\zeta=\zeta_{t}-1$ , we have

\displaystyle\big{\langle}X_{t,i^{*}_{t}}-X_{t,i_{t}},\theta^{*}\big{\rangle}\lesssim\big{\langle}X_{t,i^{*}_{t}}-X_{t,i_{t}},\widehat{\theta}^{t-1,\zeta_{t}-1}_{\tau,\lambda}\big{\rangle}+\omega^{t-1,\zeta_{t}-1}_{\tau,\lambda}(i^{*}_{t})+\omega^{t-1,\zeta_{t}-1}_{\tau,\lambda}(i_{t})\lesssim 2^{-\zeta_{t}}\beta.

(15)

By combining inequalities (13)-(15), we obtain

	$\displaystyle r(X_{t,i_{t}^{*}})-r(X_{t,i_{t}})$	$\displaystyle=\big{\langle}X_{t,i^{}_{t}}-X_{t,i_{t}},\theta^{}\big{\rangle}$
		$\displaystyle\lesssim\max\big{\{}1/\sqrt{T},2^{-\zeta_{t}}\beta\big{\}}\leq\max\big{\{}1/\sqrt{T},\omega^{t-1,\zeta_{t}}_{\tau,\lambda}(i_{t})\big{\}}.$		(16)

For the accumulated regret $R_{T}(\{i_{t}\},\theta^{*})$ , similar to inequality (A) in the proof of Theorem 1, we have that, with probability at least $1-\delta$ ,

\displaystyle R_{T}\big{(}\{i_{t}\},\theta^{*}\big{)}\lesssim n_{0}\log(T)\cdot 2r\sigma\sqrt{\log(kTd/\delta)}+\sum_{t\in\mathcal{T}_{g}}r(X_{t,i_{t}^{*}})-r(X_{t,i_{t}}),

where $\mathcal{T}_{g}$ denotes the periods of UCB-stage. Then by inequality (B) and Lemma 4, with probability at least $1-\delta$ , we have

$\displaystyle R_{T}\big{(}\{i_{t}\},\theta^{*}\big{)}$	$\displaystyle\lesssim n_{0}\log(T)\cdot 2r\sigma\sqrt{\log(kTd/\delta)}+\sum_{t\in\mathcal{T}_{g}}\max\big{\{}1/\sqrt{T},\omega^{t-1,\zeta_{t}}_{\tau,\lambda}(i_{t})\big{\}}$
	$\displaystyle\lesssim n_{0}\log(T)\cdot 2r\sigma\sqrt{\log(kTd/\delta)}+\sqrt{T}+\sum_{t\in\mathcal{T}_{g}}\omega_{\tau,\lambda}^{t-1,\zeta_{t}}(i_{t})$
	$\displaystyle\lesssim\sqrt{T}+\sum_{\tau=1}^{\widetilde{\tau}}\sum_{t\in E_{\tau}}\min\Big{\{}\beta,\gamma\cdot\Big{(}\sqrt{s/\|E_{\tau-1}\|}+\big{\\|}[X_{t,i}]_{S_{\tau-1}}\big{\\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta_{t}}]^{-1}}\Big{)}\Big{\}}$
	$\displaystyle=\sqrt{T}+\sum_{\zeta=1}^{\widetilde{\zeta}}\sum_{\tau=1}^{\widetilde{\tau}}\sum_{t\in\Psi^{\zeta}_{\tau}}\min\Big{\{}\beta,\gamma\cdot\Big{(}\sqrt{s/\|E_{\tau-1}\|}+\big{\\|}[X_{t,i}]_{S_{\tau-1}}\big{\\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}]^{-1}}\Big{)}\Big{\}},$	(17)

where $\omega_{\tau,\lambda}^{t,\zeta}(i)$ is defined in equation (11), and $\Psi^{\zeta}_{\tau}$ denotes the $\zeta$ -th group of periods at the end of epoch $E_{\tau}$ . For each fixed $\zeta$ , we have that, with probability at least $1-\delta$ ,

		$\displaystyle\sum_{\tau=1}^{\widetilde{\tau}}\sum_{t\in\Psi^{\zeta}_{\tau}}\min\Big{\{}\beta,\gamma\cdot\Big{(}\sqrt{s/\|E_{\tau-1}\|}+\big{\\|}[X_{t,i}]_{S_{\tau-1}}\big{\\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}]^{-1}}\Big{)}\Big{\}}$
		$\displaystyle\lesssim\sum_{\tau=1}^{\widetilde{\tau}}\Big{(}n_{0}\beta+\gamma\sqrt{\|E_{\tau}\|\cdot\|S_{\tau-1}\|\cdot\big{(}\lambda+\log(\sigma dTk/\delta)\big{)}}\Big{)}+\sum_{\tau=1}^{\widetilde{\tau}}\gamma\sqrt{s\cdot{\|E_{\tau-1}\|}}$
		$\displaystyle\lesssim\Big{(}n_{0}\beta+\gamma\sqrt{sT\cdot\log(T)\big{(}\lambda+\log(\sigma dTk/\delta)\big{)}}\Big{)}\cdot\log(T).$		(18)

Finally, combining inequalities (17) and (B), we have that, with probability at least $1-\delta$ ,

\displaystyle R_{T}\big{(}\{i_{t}\},\theta^{*}\big{)}

\displaystyle\lesssim\sqrt{T}+\Big{(}n_{0}\beta+\gamma\sqrt{sT\cdot\log(T)\big{(}\lambda+\log(\sigma dTk/\delta)\big{)}}\Big{)}\cdot\log(T)\cdot\log(\beta T),

which concludes the proof. ∎

Supplementary Materials to
“Nearly Dimension-Independent Sparse Linear Bandit over Small Action Spaces via Best Subset Selection”

3 Proof of Auxiliary Lemmas in Theorem 1

3.1 Proof of Lemma 1

Lemma (Restatement of Lemma 1).

n_{0}\gtrsim\rho^{-1}\cdot(\nu\sigma\widetilde{\tau})^{2}\cdot s^{3}\cdot\log^{4}\big{(}kTd/(\delta\lambda)\big{)},

we have that, with probability at least $1-\delta$ , for each epoch $\tau\in[\widetilde{\tau}]$ ,

\big{\|}[\theta^{*}]_{S_{\tau}^{c}}\big{\|}\lesssim\sqrt{\frac{\nu^{2}s\log^{2}\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda r^{2}}{|E_{\tau}|}}.

Proof.

The proof of Lemma 1 depends on the following lemma, which establishes an upper bound for the estimation error $\widehat{\theta}_{\tau,\lambda}-\theta^{*}$ in an $\ell_{2}$ -norm weighted by sample covariance matrix

\widehat{\mathbf{\Gamma}}_{\tau,\lambda}=\lambda\textbf{I}_{d}+\sum_{t\in E_{\tau}}X_{t}X_{t}^{\top}.

Lemma 5.

With probability at least $1-\delta$ , we have

\big{\|}\widehat{\theta}_{\tau,\lambda}-\theta^{*}\big{\|}^{2}_{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}\lesssim\nu^{2}s\log^{2}\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda r^{2}.

The proof of Lemma 5 is provided in Section 5.1. Now we continue the proof of Lemma 1. To simplify notation, from now on, we fix the epoch $\tau$ and let

S=S_{\tau}=\mathrm{supp}^{+}(\widehat{\theta}_{\tau,\lambda}).

Then by definition, $[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}=-[\theta^{*}]_{S^{c}}$ . We can decompose the error $\|\widehat{\theta}_{\tau,\lambda}-\theta^{*}\|^{2}_{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}$ as

	$\displaystyle\big{\\|}\widehat{\theta}_{\tau,\lambda}-\theta^{*}\big{\\|}^{2}_{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}$	$\displaystyle=[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S^{c}}^{\top}\big{[}{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}\big{]}_{S^{c}}[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S^{c}}+[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S}^{\top}\big{[}{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}\big{]}_{S}[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S}$
		$\displaystyle\qquad+[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S}^{\top}\big{[}{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}\big{]}_{SS^{c}}[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S^{c}}+[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S^{c}}^{\top}\big{[}{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}\big{]}_{S^{c}S}[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S}$
		$\displaystyle\geq[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S^{c}}^{\top}\big{[}{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}\big{]}_{S^{c}}[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S^{c}}+[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S}^{\top}\big{[}{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}\big{]}_{SS^{c}}[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S^{c}}$
		$\displaystyle\qquad+[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S^{c}}^{\top}\big{[}{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}\big{]}_{S^{c}S}[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S},$

where the inequality follows from the fact that the matrix $[{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}]_{S}$ is positive definite. By rearranging above inequality, we obtain

	$\displaystyle[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S^{c}}^{\top}\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{S^{c}}[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S^{c}}$	$\displaystyle\leq-[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S}^{\top}\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{SS^{c}}[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S^{c}}$
		$\displaystyle-[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S^{c}}^{\top}\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{S^{c}S}[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S}+\big{\\|}\widehat{\theta}_{\tau,\lambda}-\theta^{*}\big{\\|}^{2}_{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}.$		(19)

In the next, we upper bound the right-hand side of inequality (3.1) and then lower bound the left hand side, which leads to a quadratic inequality with respect to the estimation error $\|[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S_{\tau}^{c}}\|$ . By solving this quadratic inequality, we finish the proof of Lemma 1.

We establish the upper bound first. According to Hölder’s inequality, we have

	$\displaystyle\Big{\|}[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S}^{\top}\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{SS^{c}}[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S^{c}}\Big{\|}$	$\displaystyle\leq\big{\\|}[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S}\big{\\|}_{1}\cdot\big{\\|}\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{SS^{c}}\big{\\|}_{\infty}\cdot\big{\\|}[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S^{c}}\big{\\|}_{1}$
		$\displaystyle\leq\\|\widehat{\theta}_{\tau,\lambda}-\theta^{}\\|_{1}\cdot\big{\\|}\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{SS^{c}}\big{\\|}_{\infty}\cdot\sqrt{s(\tau+1)}\cdot\big{\\|}[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S^{c}}\big{\\|},$

where the last inequality follows from the fact that $[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}$ has at most $(\tau+1)s$ non-zero coordinates. To continue the proof, we also need the following two lemmas, which upper bound $\|[\widehat{\mathbf{\Gamma}}_{\tau,\lambda}]_{SS^{c}}\|_{\infty}$ and $\|\widehat{\theta}_{\tau,\lambda}-\theta^{*}\|_{1}$ individually.

Lemma 6.

With probability at least $1-\delta$ , we have

\big{\|}\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{SS^{c}}\big{\|}_{\infty}\leq 8\sigma{\log(d/\delta)}\sqrt{|E_{\tau}|}.

Lemma 7.

For arbitrary constant $\varsigma>0$ , when

n_{0}\gtrsim\max\{\varsigma^{2}\rho^{-1},\sigma^{2}\rho^{-2}{\log(d\widetilde{\tau}/\delta)}\},

with probability at least $1-\delta$ , we have

\displaystyle\|\widehat{\theta}_{\tau,\lambda}-\theta^{*}\|_{1}\lesssim\nu s/\varsigma\cdot\sqrt{\widetilde{\tau}}\cdot\log\big{(}kTd/(\delta\lambda)\big{)}.

The proofs of Lemma 6-7 are deferred to Section 5.2-5.3. We continue the proof of Lemma 1 now. By applying Lemma 6-7 and setting

\displaystyle\varsigma=\nu\sigma\widetilde{\tau}\cdot s^{3/2}\cdot\log^{2}\big{(}kTd/(\delta\lambda)\big{)},

we obtain the following inequality with probability at least $1-\delta$ ,

\displaystyle\Big{|}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S}^{\top}\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{SS^{c}}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}\Big{|}\lesssim\sqrt{|E_{\tau}|}\cdot\big{\|}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}\big{\|}.

(20)

Note that with such a choice of $\varsigma$ , we require

n_{0}\gtrsim\varsigma^{2}\rho^{-1}=\rho^{-1}\cdot(\nu\sigma\widetilde{\tau})^{2}\cdot s^{3}\cdot\log^{4}\big{(}kTd/(\delta\lambda)\big{)}.

Similarly, we can establish the same upper bound for the second part in the right-hand side of inequality (3.1). Then by combing inequalities (3.1) and (20), we obtain

\displaystyle\Big{|}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}^{\top}\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{S^{c}}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}\Big{|}\lesssim\big{\|}\widehat{\theta}_{\tau,\lambda}-\theta^{*}\big{\|}_{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}^{2}+\sqrt{|E_{\tau}|}\cdot\big{\|}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}\big{\|}.

(21)

Then we turn to the lower bound for the left-hand side of inequality (21). Recall that in the SLUCB algorithm, the selection of actions in epoch $E_{\tau}$ does not rely on $[X_{t}]_{S^{c}}$ . Furthermore, by Assumption 1 (A2, A3), for each $t$ and $j\in S^{c}$ , $[X_{t}]_{j}$ is sampled from distribution $P_{0}$ independently and satisfies

\mathbb{E}\big{[}[X_{t}]_{j}^{2}\big{]}\geq\rho.

Subsequently, by matrix Bernstein inequality, when

|E_{\tau}|\geq n_{0}\gtrsim\sigma^{2}\rho^{-2}\log(d\widetilde{\tau}/\delta),

with probability at least $1-\delta$ , we have

[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}^{\top}\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{S^{c}}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}\geq\rho/2\cdot|E_{\tau}|\cdot\big{\|}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}\big{\|}^{2}.

(22)

Then by combining inequalities (20)-(21) and Lemma 5, we obtain with probability at least $1-\delta$ that,

\rho/2\cdot|E_{\tau}|\cdot\big{\|}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}\big{\|}^{2}\lesssim\sqrt{|E_{\tau}|}\cdot\big{\|}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}\big{\|}+\nu^{2}s\log^{2}\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda r^{2}

Finally, we solve above inequality and obtain

\big{\|}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}\big{\|}_{2}\lesssim\sqrt{\frac{\nu^{2}s\log^{2}\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda r^{2}}{|E_{\tau}|}}.

Then Lemma 1 follows from $[\widehat{\theta}_{\tau,\lambda}]_{S^{c}}=0$ by the definition $S$ . ∎

3.2 Proof of Lemma 2

Lemma (Resatement of Lemma 2).

n_{0}\gtrsim\rho^{-1}\cdot(\nu\sigma\widetilde{\tau})^{2}\cdot s^{3}\cdot\log^{4}\big{(}kTd/(\delta\lambda)\big{)},

with probability at least $1-\delta$ , for all $\tau\in[\widetilde{\tau}]$ , $t\in E_{\tau}$ , and $i\in\mathcal{A}_{t}$ , we have

\Big{|}\big{\langle}X_{t,i},\widehat{\theta}_{\tau,\lambda}^{t-1}-\theta^{*}\big{\rangle}\Big{|}\lesssim\nu\Big{(}\sqrt{s}\log\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda^{1/2}r\Big{)}\Big{(}\sigma\sqrt{\frac{\log(kTd/\delta)}{|E_{\tau-1}|}}+\big{\|}[X_{t,i}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}]^{-1}}\Big{)},

where

\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}=\lambda\textbf{I}_{|S_{\tau-1}|}+\sum_{t^{\prime}\in E_{\tau}^{t-1}}[X_{t^{\prime}}]_{S_{\tau-1}}[X_{t^{\prime}}]_{S_{\tau-1}}^{\top},

and $E_{\tau}^{t-1}=\{t^{\prime}:t^{\prime}\leq t-1,t^{\prime}\in E_{\tau}\}$ .

Proof.

Recall that $\widehat{\theta}_{\tau,\lambda}^{t-1}$ is the ridge estimator of $\theta^{*}$ using data $\{(X_{t^{\prime}},Y_{t^{\prime}})\}_{t^{\prime}\in E_{\tau}^{t-1}}$ restricted on the generalized support $S_{\tau-1}$ . Then for each $i\in\mathcal{A}_{t}$ , we have decomposition

\displaystyle\big{\langle}X_{t,i},\widehat{\theta}_{\tau,\lambda}^{t-1}-\theta^{*}\big{\rangle}=-\big{\langle}[X_{t,i}]_{S_{\tau-1}^{c}},[\theta^{*}]_{S_{\tau-1}^{c}}\big{\rangle}+\big{\langle}[X_{t,i}]_{S_{\tau-1}},[\widehat{\theta}_{\tau,\lambda}^{t-1}-\theta^{*}]_{S_{\tau-1}}\big{\rangle}.

(23)

We can upper bound the first term in the right-hand side of equation (23) by Corollary 2. Formally, with probability at least $1-\delta$ , we have

\displaystyle\Big{|}\big{\langle}[X_{t,i}]_{S_{\tau-1}^{c}},[\theta^{*}]_{S_{\tau-1}^{c}}\big{\rangle}\Big{|}\lesssim\sigma\sqrt{\log(kTd/\delta)}\cdot\sqrt{\frac{\nu^{2}s\log^{2}\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda r^{2}}{|E_{\tau-1}|}}.

(24)

It remains to upper bound the second part. By Cauchy-Schwartz inequality, we have

\displaystyle\Big{|}\big{\langle}[X_{t,i}]_{S_{\tau-1}^{c}},[\widehat{\theta}_{\tau,\lambda}^{t-1}-\theta^{*}]_{S_{\tau-1}^{c}}\big{\rangle}\Big{|}\leq\big{\|}[X_{t,i}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}]^{-1}}\cdot\big{\|}[\widehat{\theta}_{\tau,\lambda}^{t-1}-\theta^{*}]_{S_{\tau-1}^{c}}\big{\|}_{\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}}.

(25)

The following lemma establishes an upper bound for the second term in the right-hand side of inequality (25). The proof relies the self-normalized martingale concentration inequality (i.e., Lemma 9 in Section 6), which is similar to the proof of Lemma 5.

Lemma 8.

For each $\tau\in[\widetilde{\tau}]$ and $t\in E_{\tau}$ , we have

\big{\|}[\widehat{\theta}_{\tau,\lambda}^{t-1}-\theta^{*}]_{S_{\tau-1}}\big{\|}^{2}_{\widehat{\mathbf{\Gamma}}^{t-1}_{\tau-1,\lambda}}\lesssim\big{(}\nu^{2}+\sigma^{2}\cdot\big{\|}[\theta^{*}]_{S^{c}_{\tau-1}}\big{\|}^{2}\big{)}\cdot\big{(}s\log^{2}\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda r^{2}\big{)},

with probability at least $1-\delta$ .

The proof of Lemma 8 is deferred to Section 5.4. We continue the proof of Lemma 2 now. By combing inequalities (23)-(25) and Lemma 8, we obtain with probability at least $1-\delta$ that

	$\displaystyle\Big{\|}\big{\langle}X_{t,i},\widehat{\theta}_{\tau,\lambda}^{t-1}-\theta^{*}\big{\rangle}\Big{\|}$	$\displaystyle\lesssim\sigma\sqrt{\log(kTd/\delta)}\cdot\sqrt{\frac{\nu^{2}s\log^{2}\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda r^{2}}{\|E_{\tau-1}\|}}$
		$\displaystyle\ +\Big{(}\nu+\sigma\cdot\big{\\|}[\theta^{*}]_{S^{c}_{\tau-1}}\big{\\|}\Big{)}\Big{(}\sqrt{s}\log\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda^{1/2}r\Big{)}\cdot\big{\\|}[X_{t,i}]_{S_{\tau-1}}\big{\\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}]^{-1}}.$

Recall that in Lemma 1, we show that

\big{\|}[\theta^{*}]_{S^{c}_{\tau-1}}\big{\|}=\widetilde{\mathcal{O}}(\sqrt{s/T}).

Hence, when $s\ll T$ , we have

\displaystyle\Big{|}\big{\langle}X_{t,i},\widehat{\theta}_{\tau,\lambda}^{t-1}-\theta^{*}\big{\rangle}\Big{|}\lesssim\nu\Big{(}\sqrt{s}\log\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda^{1/2}r\Big{)}\Big{(}\sigma\sqrt{\frac{\log(kTd/\delta)}{|E_{\tau-1}|}}+\big{\|}[X_{t,i}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}]^{-1}}\Big{)},

with probability at least $1-\delta$ . Finally, by applying the union bound argument for all $t\in E_{\tau}$ , we finish the proof of Lemma 2. ∎

3.3 Proof of Lemma 3

Lemma (Restatement of Lemma 3).

Let two constants $p,q$ satisfy $1\leq p\leq q$ . Then with probability at least $1-\delta$ , we have

\sum_{t\in E_{\tau}}\min\Big{\{}p,q\cdot\big{\|}[X_{t}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}]^{-1}}\Big{\}}\lesssim n_{0}p+q\cdot\sqrt{|E_{\tau}|\cdot|S_{\tau-1}|\cdot\big{(}\lambda+\log(\sigma dTk/\delta)\big{)}}.

Proof.

For a fixed epoch $E_{\tau}$ , let $t_{1}$ be the end of pure-exploration-stage and $t_{2}$ be the end of $E_{\tau}$ . Then we have $t_{1}\leq n_{0}\leq t_{2}$ . We also have

	$\displaystyle\sum_{t\in E_{\tau}}\min\Big{\{}p,q\cdot\big{\\|}[X_{t}]_{S_{\tau-1}}\big{\\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}]^{-1}}\Big{\}}$
	$\displaystyle\quad\leq n_{0}p+q\cdot\sum_{t=t_{1}+1}^{t_{2}}\min\Big{\{}1,\big{\\|}[X_{t}]_{S_{\tau-1}}\big{\\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}]^{-1}}\Big{\}}$
	$\displaystyle\quad\leq n_{0}p+q\sqrt{\|E_{\tau}\|}\cdot\sqrt{\sum_{t=t_{1}+1}^{t_{2}}\min\Big{\{}1,\big{\\|}[X_{t}]_{S_{\tau-1}}\big{\\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}]^{-1}}\Big{\}}}$
	$\displaystyle\quad\leq n_{0}p+q\sqrt{\|E_{\tau}\|}\cdot\sqrt{2\sum_{t=t_{1}+1}^{t_{2}}\log\Big{(}1+\big{\\|}[X_{t}]_{S_{\tau-1}}\big{\\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}]^{-1}}\Big{)}}.$		(26)

Here the first inequality holds because $q\geq p$ and $n_{0}\leq t_{1}$ . The second inequality follows from Cauchy-Schwartz inequality. The last inequality is due to the basic inequality

\min\{1,u\}\leq 2\log(1+u),\ \forall u>0.

On the other hand, we have the recursion

	$\displaystyle\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t}$	$\displaystyle=\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}+[X_{t}]_{S_{\tau-1}}[X_{t}]_{S_{\tau-1}}^{\top}$
		$\displaystyle=\big{[}\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}\big{]}^{1/2}\cdot\Big{(}I_{\|S_{\tau-1}\|}+\big{[}\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}\big{]}^{-1/2}[X_{t}]_{S_{\tau-1}}[X_{t}]_{S_{\tau-1}}^{\top}\big{[}\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}\big{]}^{-1/2}\Big{)}\cdot\big{[}\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}\big{]}^{1/2}.$

Subsequently, we have

\displaystyle\log\big{(}\det(\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t})\big{)}=\log\big{(}\det(\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1})\big{)}+\log\big{(}(1+[X_{t}]_{S_{\tau-1}}[X_{t}]_{S_{\tau-1}}^{\top})\big{)}.

(27)

By combining equations (26)-(27) and taking telescoping sum, we obtain

\sum_{t\in E_{\tau}}\min\Big{\{}p,q\cdot\big{\|}[X_{t}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}]^{-1}}\Big{\}}\leq n_{0}p+q\cdot\sqrt{2|E_{\tau}|\cdot\log\big{(}\det(\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t_{2}})\big{/}{\det(\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t_{1}})}\big{)}}.

(28)

By Proposition 1, we have with probability at least $1-\delta$ that, for all $t\in[T]$ and $i\in\mathcal{A}_{t}$ ,

\big{\|}X_{t,i}\big{\|}\lesssim\sigma\sqrt{d\log(kT/\delta)}.

Then by applying the determinant-trace inequality, we have

\displaystyle\det(\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t_{2}})\leq\Big{(}\mathrm{tr}(\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t_{2}})\big{/}|S_{\tau-1}|\Big{)}^{|S_{\tau-1}|}\leq\Big{(}\lambda+\max_{t\in[T],i\in\mathcal{A}_{t}}\|X_{t,i}\|\Big{)}^{|S_{\tau-1}|}.

Hence, we have

\log\big{(}\det(\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t_{2}})\big{)}\lesssim|S_{\tau-1}|\cdot\big{(}\lambda+\log(\sigma dTk/\delta)\big{)}

(29)

with probability at least $1-\delta$ . On the other hand, by definition, we have $\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t_{1}}\gtrsim\lambda\textbf{I}_{|S_{\tau-1}|}$ , which implies

\log\big{(}\det(\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t_{1}})\big{)}\gtrsim|S_{\tau-1}|\cdot\log(\lambda).

(30)

Finally, by combining inequalities (28)-(30), we complete the proof of Lemma 3. ∎

4 Proof of Auxiliary Lemmas in Theorem 2

4.1 Proof of Lemma 4

Lemma (Restatement of Lemma 4).

When

n_{0}\gtrsim\rho^{-1}\cdot(\nu\sigma\widetilde{\tau})^{2}\cdot s^{3}\cdot\log^{4}\big{(}kTd/(\delta\lambda)\big{)},

with probability at least $1-\delta$ , we have that for every tuple $(\tau,t,i,\zeta)$ , it holds

\displaystyle\Big{|}\big{\langle}X_{t,i},\widehat{\theta}_{\tau,\lambda}^{t-1,\zeta}-\theta^{*}\big{\rangle}\Big{|}

\displaystyle\lesssim\sigma(\lambda^{1/2}+\nu+\sigma)\log^{3/2}(kTd/\delta)\cdot\Big{(}\sqrt{s/|E_{\tau-1}|}+\big{\|}[X_{t,i}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}]^{-1}}\Big{)},

where

\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}=\lambda\textbf{I}_{|S_{\tau-1}|}+\sum_{t^{\prime}\in\Psi_{\tau}^{t-1,\zeta}}[X_{t^{\prime}}]_{S_{\tau-1}}[X_{t^{\prime}}]_{S_{\tau-1}}^{\top}.

Proof.

To simplify notation, for each fixed tuple $(\tau,t,\zeta)$ , we denote by

	$\displaystyle\boldsymbol{\varepsilon}^{t-1,\zeta}_{\tau}$	$\displaystyle=\{\varepsilon_{t^{\prime}}\}_{t^{\prime}\in\Psi_{\tau}^{t-1,\zeta}}\in\mathbb{R}^{m},$
	$\displaystyle\mathbf{Y}^{t-1,\zeta}_{\tau}$	$\displaystyle=\{Y_{t^{\prime}}\}_{t^{\prime}\in\Psi_{\tau}^{t-1,\zeta}}\in\mathbb{R}^{m},\ \mathbf{X}^{t-1,\zeta}_{\tau}=\{X_{t^{\prime}}\}_{t^{\prime}\in\Psi_{\tau}^{t-1,\zeta}}\in\mathbb{R}^{d\times m},$

where $m$ is the number of elements in set $\Psi_{\tau}^{t-1,\zeta}$ . Recall that

\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}=\lambda\textbf{I}_{|S_{\tau-1}|}+\sum_{t^{\prime}\in\Psi_{\tau}^{t-1,\zeta}}[X_{t^{\prime}}]_{S_{\tau-1}}[X_{t^{\prime}}]_{S_{\tau-1}}^{\top},

which is the sample covariance matrix corresponding to $\mathbf{X}^{t-1,\zeta}_{\tau}$ restricted on $S_{\tau-1}$ . Recall that $\widehat{\theta}_{\tau,\lambda}^{t-1,\zeta}$ is the ridge estimator with generalized support $S_{\tau-1}$ using data $(\mathbf{X}^{t-1,\zeta}_{\tau},\mathbf{Y}^{t-1,\zeta}_{\tau})$ . Particularly, we have

\displaystyle\big{[}\widehat{\theta}_{\tau,\lambda}^{t-1,\zeta}\big{]}_{S_{\tau-1}}=\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}\cdot\big{[}\mathbf{X}^{t-1,\zeta}_{\tau}\big{]}_{S_{\tau-1}}^{\top}\mathbf{Y}^{t-1,\zeta}_{\tau}.

Note that

\mathbf{Y}^{t-1,\zeta}_{\tau}=\big{[}\mathbf{X}^{t-1,\zeta}_{\tau}\big{]}_{S_{\tau-1}}[\theta^{*}]_{S_{\tau-1}}+\big{[}\mathbf{X}^{t-1,\zeta}_{\tau}\big{]}_{S^{c}_{\tau-1}}[\theta^{*}]_{S^{c}_{\tau-1}}+\boldsymbol{\varepsilon}^{t-1,\zeta}_{\tau}.

Then we have

\displaystyle\big{[}\widehat{\theta}_{\tau,\lambda}^{t-1,\zeta}-\theta^{*}\big{]}_{S_{\tau-1}}=\big{[}\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}\big{]}^{-1}\Big{(}-\lambda[\theta^{*}]_{S_{\tau-1}}+[\mathbf{X}^{t-1,\zeta}_{\tau}]_{S_{\tau-1}}^{\top}\big{(}\boldsymbol{\varepsilon}^{t-1,\zeta}_{\tau}+[\mathbf{X}^{t-1,\zeta}_{\tau}]_{S^{c}_{\tau-1}}[\theta^{*}]_{S^{c}_{\tau-1}}\big{)}\Big{)}.

Hence, for each $X_{t,i}$ , we can upper bound the prediction error $\langle X_{t,i},\widehat{\theta}_{\tau,\lambda}^{t-1,\zeta}-\theta^{*}\rangle$ as

\displaystyle\big{|}\langle X_{t,i},\widehat{\theta}_{\tau,\lambda}^{t-1,\zeta}-\theta^{*}\rangle\big{|}\leq\Big{|}\big{\langle}[X_{t,i}]_{S_{\tau-1}^{c}},[\theta^{*}]_{S_{\tau-1}^{c}}\big{\rangle}\Big{|}+\Big{|}\big{\langle}[X_{t,i}]_{S_{\tau-1}},[\widehat{\theta}_{\tau,\lambda}^{t-1,\zeta}-\theta^{*}]_{S_{\tau-1}}\big{\rangle}\Big{|}.

(31)

We can use Corollary 6 to upper bound the first term in the right-hand side of inequality (31) and obtain

\Big{|}\big{\langle}[X_{t,i}]_{S^{c}_{\tau-1}},[\theta^{*}]_{S^{c}_{\tau-1}}\big{\rangle}\Big{|}\lesssim\sigma\sqrt{\log(kTd/\delta)}\cdot\sqrt{\frac{\nu^{2}s\log^{2}\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda r^{2}}{|E_{\tau}|}},

(32)

with probability at least $1-\delta$ . It remains to upper bound the second term.

We first note that

$\displaystyle\Big{\langle}[X_{t,i}]_{S_{\tau-1}},\big{[}\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}\big{]}^{-1}[\theta^{*}]_{S_{\tau-1}}\Big{\rangle}$	$\displaystyle\leq\big{\\|}[\theta^{*}]_{S_{\tau-1}}\big{\\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}]^{-1}}\cdot\big{\\|}[X_{t,i}]_{S_{\tau-1}}\big{\\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}]^{-1}}$
	$\displaystyle\leq\lambda^{-1}\big{\\|}[\theta^{*}]_{S_{\tau-1}}\big{\\|}\cdot\big{\\|}[X_{t,i}]_{S_{\tau-1}}\big{\\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}]^{-1}}$
	$\displaystyle\leq\lambda^{-1}r\cdot\big{\\|}[X_{t,i}]_{S_{\tau-1}}\big{\\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}]^{-1}},$	(33)

where the first inequality is Cauchy-Schwartz inequality and the second one follows from the fact

\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}\succeq\lambda\textbf{I}_{|S_{\tau-1}|}.

In the next, we claim that for each fixed tuple $(\tau,t,\zeta)$ , random vectors $\boldsymbol{\varepsilon}^{t-1,\zeta}_{\tau}$ and $\mathbf{X}^{t-1,\zeta}_{\tau}$ are independent of each other. Recall the policies of picking arms and updating $\Psi_{\tau}^{t-1,\zeta}$ in SupLinUCB algorithm, which are specified in details in Section 2.2. We consider a fixed period $t^{\prime}\leq t-1$ and the $\zeta$ -th group. Note that the random event that covariate $X_{t^{\prime}}$ enters $\mathbf{X}^{t-1,\zeta}_{\tau}$ only depends on the widths of confidence bands $\{\omega_{\tau,\lambda}^{t^{\prime}-1,\zeta^{\prime}}\}_{\zeta^{\prime}\leq\zeta}$ . Moreover, since $\{\omega_{\tau,\lambda}^{t^{\prime}-1,\zeta^{\prime}}\}_{\zeta^{\prime}\leq\zeta}$ only relies on the values of design matrices $\{\mathbf{X}^{t^{\prime}-1,\zeta^{\prime}}_{\tau}\}_{\zeta^{\prime}\leq\zeta}$ but not noises sequence $\{\varepsilon_{t^{\prime\prime}}\}_{t^{\prime\prime}\leq t^{\prime}}$ , whether $X_{t^{\prime}}$ enters $\mathbf{X}^{t-1,\zeta}_{\tau}$ is also independent of $\{\varepsilon_{t^{\prime}}\}_{t^{\prime}\in\Psi_{\tau}^{t,\zeta}}$ . Then we obtain the result.

The independency between $\boldsymbol{\varepsilon}^{t-1,\zeta}_{\tau}$ and $\mathbf{X}^{t-1,\zeta}_{\tau}$ enables us to establish a tighter upper bound for the second term in the right-hand side of inequality (31) than Lemma 2. Recall that by Assumption 1, given the information by epoch $\tau-1$ , random variables $[\mathbf{X}^{t-1,\zeta}]_{S_{\tau-1}}$ and $[\mathbf{X}^{t-1,\zeta}]_{S^{c}_{\tau-1}}$ are independent. Moreover, $[\mathbf{X}^{t-1,\zeta}]_{S^{c}_{\tau-1}}$ is also independent of $\boldsymbol{\varepsilon}^{t-1,\zeta}_{\tau}$ . Hence, each element of

\boldsymbol{\varepsilon}^{t-1,\zeta}_{\tau}+[\mathbf{X}^{t-1,\zeta}_{\tau}]_{S^{c}_{\tau-1}}[\theta^{*}]_{S^{c}_{\tau-1}}

is a centered sub-Gaussian random variable with parameter $\nu^{2}+\sigma^{2}\|[\theta^{*}]_{S_{\tau-1}^{c}}\|_{2}^{2}$ , which is also independent of covariates $[\mathbf{X}^{t-1,\zeta}_{\tau}]_{S_{\tau-1}}$ . By applying the sub-Gaussian concentration inequality and the union bound argument, we obtain with probability at least $1-\delta$ that

\displaystyle\big{\|}\boldsymbol{\varepsilon}^{t-1,\zeta}_{\tau}+[\mathbf{X}^{t-1,\zeta}_{\tau}]_{S^{c}_{\tau-1}}^{\top}[\theta^{*}]_{S^{c}_{\tau-1}}\big{\|}\lesssim\sqrt{\big{(}\nu^{2}+\sigma^{2}\big{\|}[\theta^{*}]_{S^{c}_{\tau-1}}\big{\|}^{2}\big{)}}\cdot\log(kTd/\delta).

(34)

Here we emphasize that inequality (34) is interpreted as a random event given the realization of $[\mathbf{X}^{t-1,\zeta}]_{S_{\tau-1}}$ for later proof. It also holds regardless the values of $[\mathbf{X}^{t-1,\zeta}]_{S_{\tau-1}}$ . That is where the independency of $\boldsymbol{\varepsilon}^{t-1,\zeta}_{\tau}$ and $[\mathbf{X}^{t-1,\zeta}]_{S_{\tau-1}}$ is indispensable: otherwise, given $[\mathbf{X}^{t-1,\zeta}]_{S^{c}_{\tau-1}}$ , elements of $\boldsymbol{\varepsilon}^{t-1,\zeta}_{\tau}$ are not i.i.d., which prevents us establishing concentration result similar to (34). Based on inequality (34) and Cauchy-Schwartz inequality, with probability at least $1-\delta$ , we have

		$\displaystyle\Big{\langle}[X_{t,i}]_{S_{\tau-1}},\big{[}\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}\big{]}^{-1}[\mathbf{X}^{t-1,\zeta}_{\tau}]_{S_{\tau-1}}^{\top}\big{(}\boldsymbol{\varepsilon}^{t-1,\zeta}_{\tau}+[\mathbf{X}^{t-1,\zeta}_{\tau}]_{S^{c}_{\tau-1}}[\theta^{*}]_{S^{c}_{\tau-1}}\big{)}\Big{\rangle}$
		$\displaystyle\qquad\qquad\lesssim\big{\\|}[X_{t,i}]_{S_{\tau-1}}\big{\\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}]^{-1}}\cdot\sqrt{\big{(}\nu^{2}+\sigma^{2}\big{\\|}[\theta^{*}]_{S^{c}_{\tau-1}}\big{\\|}^{2}\big{)}}\cdot\log(kTd/\delta).$		(35)

By combing inequalities (4.1)-(4.1), we have

		$\displaystyle\Big{\|}\big{\langle}[X_{t,i}]_{S_{\tau-1}},[\widehat{\theta}_{\tau,\lambda}^{t-1,\zeta}-\theta^{*}]_{S_{\tau-1}}\big{\rangle}\Big{\|}$
		$\displaystyle\qquad\qquad\lesssim\Big{(}r+\sqrt{\big{(}\nu^{2}+\sigma^{2}\big{\\|}[\theta^{*}]_{S^{c}_{\tau-1}}\big{\\|}^{2}\big{)}}\cdot\log(kTd/\delta)\Big{)}\cdot\big{\\|}[X_{t,i}]_{S_{\tau-1}}\big{\\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}]^{-1}},$		(36)

with probability at least $1-\delta$ . Recall that Lemma 1 shows that

\big{\|}[\theta^{*}]_{S^{c}_{\tau-1}}\big{\|}=\widetilde{\mathcal{O}}(\sqrt{s/T}).

Hence, when $s\ll T$ , $\|[\theta^{*}]_{S^{c}_{\tau-1}}\|\lesssim 1$ . Subsequently, by combining inequalities (32) and (4.1), we obtain with probability at least $1-\delta$ that

\displaystyle\Big{|}\big{\langle}X_{t,i},\widehat{\theta}_{\tau,\lambda}^{t-1,\zeta}-\theta^{*}\big{\rangle}\Big{|}

\displaystyle\lesssim\sigma(\lambda^{1/2}+\nu+\sigma)\log^{3/2}(kTd/\delta)\cdot\Big{(}\sqrt{s/|E_{\tau}|}+\big{\|}[X_{t,i}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}]^{-1}}\Big{)},

which concludes the proof of Lemma 4. ∎

5 Proof of Auxiliary Lemmas in Supplementary Document

5.1 Proof of Lemma 5

Lemma.

(Restatement of Lemma 5) With probability at least $1-\delta$ , we have

\big{\|}\widehat{\theta}_{\tau,\lambda}-\theta^{*}\big{\|}^{2}_{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}\lesssim\nu^{2}s\log^{2}\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda r^{2}.

Proof.

We use the optimality of $\widehat{\theta}_{\tau,\lambda}$ to derive an upper bound of the quadratic form of $\widehat{\theta}_{\tau,\lambda}-\theta^{*}.$ In the following, we use $S^{*}$ to denote the support of true parameter $\theta^{*}$ . Let $\overline{\theta}_{\tau,\lambda}$ be the projected ridge estimator with support $S^{*}$ using data $\{X_{t},Y_{t}\}_{t\in E_{\tau}}$ , i.e.,

\displaystyle\overline{\theta}_{\tau,\lambda}=\text{Proj}_{r}\Big{\{}\operatorname*{argmin}_{\mathrm{supp}^{+}{(\theta)}=S^{*}}\Big{\{}\sum_{t\in E_{\tau}}(Y_{t}-X_{t}^{\top}\theta)^{2}+\lambda\|\theta\|^{2}\Big{\}}\Big{\}},

where $\text{Proj}_{r}\{\cdot\}$ denotes the projection operator on a centered $\ell_{2}$ -ball with radius $r$ . We remark that $\overline{\theta}_{\tau,\lambda}$ is unavailable in practice since $S^{*}$ is unknown. However, it is not necessary in the SLUCB algorithm and only serves as an auxiliary estimator in theoretical analysis. Recall the definition of BSS estimator $\widehat{\theta}_{\tau,\lambda}$ , which is given by

	$\displaystyle\widehat{\theta}_{\tau,\lambda}$	$\displaystyle=\operatorname*{argmin}_{\theta}\Big{\{}\sum_{t\in E_{\tau}}(Y_{t}-X_{t}^{\top}\theta)^{2}+\lambda\\|\theta\\|^{2}\Big{\}},$
		$\displaystyle\text{s.t.}\ \mathrm{supp}^{+}{(\theta)}\supseteq S_{\tau-1},\|\mathrm{supp}^{+}{(\theta)}\|\leq\tau s,\\|\theta\\|\leq r,$		(37)

where $S_{\tau-1}$ is the generalized support recovered by the $(\tau-1)$ -th epoch. Since $\mathrm{supp}^{+}(\overline{\theta}_{\tau,\lambda})=S*$ , we have $\mathrm{supp}^{+}(\overline{\theta}_{\tau,\lambda})=S^{*}\cup S_{\tau-1}\supseteq S_{\tau-1}$ . We also have $|S^{*}\cup S_{\tau-1}|\leq\tau s$ . Hence, $\overline{\theta}_{\tau,\lambda}$ satisfies the constraints in equation (5.1). Then by the optimality of $\widehat{\theta}_{\tau,\lambda}$ , we have

\sum_{t\in E_{\tau}}\big{(}Y_{t}-X_{t}^{\top}\widehat{\theta}_{\tau,\lambda}\big{)}^{2}+\lambda\|\widehat{\theta}_{\tau,\lambda}\|^{2}\leq\sum_{t\in E_{\tau}}\big{(}Y_{t}-X_{t}^{\top}\overline{\theta}_{\tau,\lambda}\big{)}^{2}+\lambda\|\overline{\theta}_{\tau,\lambda}\|^{2},

and furthermore,

\widehat{\theta}_{\tau,\lambda}^{\top}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\widehat{\theta}_{\tau,\lambda}\leq\overline{\theta}_{\tau,\lambda}^{\top}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\overline{\theta}_{\tau,\lambda}+2\sum_{t\in E_{\tau}}Y_{t}X_{t}^{\top}(\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda}),

(38)

where $\widehat{\mathbf{\Gamma}}_{\tau,\lambda}=\sum_{t\in E_{\tau}}X_{t}X_{t}^{\top}+\lambda\textbf{I}_{d}$ . Note that we have

Y_{t}=X_{t}^{\top}\theta^{*}+\varepsilon_{t}=X_{t}^{\top}\overline{\theta}_{\tau,\lambda}+\varepsilon_{t}+X_{t}^{\top}(\theta^{*}-\overline{\theta}_{\tau,\lambda}).

After some algebra, we obtain

	$\displaystyle\sum_{t\in E_{\tau}}Y_{t}X_{t}^{\top}(\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda})$	$\displaystyle=\overline{\theta}_{\tau,\lambda}^{\top}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}(\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda})+(\theta^{*}-\overline{\theta}_{\tau,\lambda})^{\top}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}(\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda})$
		$\displaystyle+\Big{(}\sum_{t\in E_{\tau}}\varepsilon_{t}X_{t}^{\top}\Big{)}(\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda})-\lambda(\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda})^{\top}\theta^{*}.$

Then inequality (38) becomes

		$\displaystyle(\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda})^{\top}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}(\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda})$
		$\displaystyle\quad\leq 2\Big{(}\sum_{t\in E_{\tau}}\varepsilon_{t}X_{t}^{\top}\Big{)}(\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda})+2(\theta^{}-\overline{\theta}_{\tau,\lambda})^{\top}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}(\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda})-2\lambda(\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda})^{\top}\theta^{}.$		(39)

In the next, we establish upper bounds for the three parts of the right-hand side of inequality (5.1) individually. We first introduce some notations. By the definition of $\widehat{\theta}_{\tau,\lambda}$ , we have

	$\displaystyle\widehat{\theta}_{\tau,\lambda}$	$\displaystyle=\operatorname*{argmin}_{\theta}\Big{\{}\sum_{t\in E_{\tau}}(Y_{t}-X_{t}^{\top}\theta)^{2}+\lambda\\|\theta\\|^{2}\Big{\}}$
		$\displaystyle\text{s.t. these exists some subset}\ S\in[d]:\ \mathrm{supp}^{+}{(\theta)}=S,S\supseteq S_{\tau-1},\|S\|\leq\tau s,\\|\theta\\|\leq r.$

where the optimization is taken over all feasible set $S$ . From now on, for convenience, given a fixed subset $S\subseteq[d]$ with $|S|\leq\tau s$ , we denote by

\widetilde{S}=S\cup\mathrm{supp}(\theta^{*}).

Particularly, let

\widetilde{S}_{\tau}=\mathrm{supp}^{+}(\widehat{\theta}_{\tau,\lambda})\cup\mathrm{supp}(\theta^{*}).

Then $|\widetilde{S}|\leq(\tau+1)s$ by definition. We are ready to upper bound the right-hand side of inequality (5.1).

For the first part, note that $\mathrm{supp}^{+}(\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda})=\widetilde{S}_{\tau}$ . Then by Cauchy-Schwartz inequality, we have

$\displaystyle\Big{(}\sum_{t\in E_{\tau}}\varepsilon_{t}X_{t}^{\top}\Big{)}(\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda})$	$\displaystyle\leq\Big{\\|}\sum_{t\in E_{\tau}}\varepsilon_{t}\cdot[X_{t}]_{\widetilde{S}_{\tau}}\Big{\\|}_{[\widehat{\mathbf{\Gamma}}_{\tau,\lambda}]_{\widetilde{S}_{\tau}}^{-1}}\cdot\Big{\\|}[\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda}]_{\widetilde{S}_{\tau}}\Big{\\|}_{[\widehat{\mathbf{\Gamma}}_{\tau,\lambda}]_{\widetilde{S}_{\tau}}}$
	$\displaystyle=\Big{\\|}\sum_{t\in E_{\tau}}\varepsilon_{t}\cdot[X_{t}]_{\widetilde{S}_{\tau}}\Big{\\|}_{[\widehat{\mathbf{\Gamma}}_{\tau,\lambda}]_{\widetilde{S}_{\tau}}^{-1}}\cdot\big{\\|}\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda}\big{\\|}_{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}$
	$\displaystyle\leq\max_{\widetilde{S}}\Big{\{}\Big{\\|}\sum_{t\in E_{\tau}}\varepsilon_{t}\cdot[X_{t}]_{\widetilde{S}}\Big{\\|}_{[\widehat{\mathbf{\Gamma}}_{\tau,\lambda}]^{-1}_{\widetilde{S}}}\Big{\}}\cdot\big{\\|}\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda}\big{\\|}_{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}},$	(40)

where the maximization in the last step is taken over all possible $\widetilde{S}$ , whose number is at most $d^{(\tau+1)s}$ . For each fixed $\widetilde{S}$ , by applying the self-normalized martingale concentration inequality (i.e., Lemma 9 provided in Section 6), we obtain with probability at least $1-\delta$ that

\displaystyle\Big{\|}\sum_{t\in E_{\tau}}\varepsilon_{t}\cdot[X_{t}]_{\widetilde{S}}\Big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau,\lambda}]^{-1}_{\widetilde{S}}}\leq\nu\sqrt{2\log\Big{(}\frac{\det\big{(}[\widehat{\mathbf{\Gamma}}_{\tau,\lambda}]_{\widetilde{S}}\big{)}^{1/2}\det\big{(}\lambda[\textbf{I}_{d}]_{\widetilde{S}}\big{)}^{-1/2}}{\delta}\Big{)}}.

Since the number of possible $\widetilde{S}$ is upper bounded by $d^{(\tau+1)s}$ , by applying the union bound argument, we obtain with probability at least $1-\delta$ that,

	$\displaystyle\Big{\\|}\sum_{t\in E_{\tau}}\varepsilon_{t}\cdot[X_{t}]_{\widetilde{S}}\Big{\\|}_{[\widehat{\mathbf{\Gamma}}_{\tau,\lambda}]^{-1}_{\widetilde{S}}}$	$\displaystyle\leq\nu\sqrt{2\log\Big{(}\frac{\det\big{(}[\widehat{\mathbf{\Gamma}}_{\tau,\lambda}]_{\widetilde{S}}\big{)}^{1/2}\det\big{(}\lambda[\textbf{I}_{d}]_{\widetilde{S}}\big{)}^{-1/2}}{\delta}\Big{)}+\tau s\log(d)}$
		$\displaystyle:=\Lambda_{\widetilde{S}},$		(41)

hold for all possible $\widetilde{S}$ , which further implies that

\displaystyle\max_{\widetilde{S}}\Big{\|}\sum_{t\in E_{\tau}}\varepsilon_{t}\cdot[X_{t}]_{\widetilde{S}}\Big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau,\lambda}]^{-1}_{\widetilde{S}}}\leq\max_{\widetilde{S}}\big{\{}\Lambda_{\widetilde{S}}\big{\}},

(42)

holds with probability at least $1-\delta$ .

For the second part of the right-hand side of inequality (5.1), by Cauchy-Schwartz inequality and the confidence region of ridge estimator (i.e., Lemma 10 provided in Section 6), we obtain with probability at least $1-\delta$ that

	$\displaystyle 2(\theta^{*}-\overline{\theta}_{\tau,\lambda})^{\top}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}(\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda})$	$\displaystyle\leq\big{\\|}\theta^{*}-\overline{\theta}_{\tau,\lambda}\big{\\|}_{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}\cdot\big{\\|}\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda}\big{\\|}_{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}$
		$\displaystyle\lesssim(\Lambda_{S^{*}}+\lambda^{1/2}r)\cdot\big{\\|}\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda}\big{\\|}_{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}},$		(43)

where $S^{*}=\mathrm{supp}({\overline{\theta}_{\tau}})\cup\mathrm{supp}(\theta^{*})$ .

For the last part of the right-hand side of inequality (5.1), since $\theta^{*},\widehat{\theta}_{\tau},$ and $\overline{\theta}_{\tau}$ all are in the centered $\ell_{2}$ -ball with radius $r$ , we have

\displaystyle\big{|}2\lambda(\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda})^{\top}\theta^{*}\big{|}\leq 4\lambda r^{2}.

(44)

Finally, by combing inequalities (5.1)-(44) and Lemma 10 (provided in Section 6), we obtain with probability at least $1-\delta$ that

	$\displaystyle\big{\\|}\widehat{\theta}_{\tau,\lambda}-\theta^{*}\big{\\|}_{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}$	$\displaystyle\leq\big{\\|}\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda}\big{\\|}_{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}+\big{\\|}\overline{\theta}_{\tau,\lambda}-\theta^{*}\big{\\|}_{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}$
		$\displaystyle\lesssim\max_{\widetilde{S}}\big{\{}\Lambda_{\widetilde{S}}\big{\}}+\Lambda_{S^{*}}+\lambda^{1/2}r,$		(45)

where the maximization is taken over all subsets $\widetilde{S}\in[d]$ with capacity at most $(\tau+1)s$ .

In order to conclude the proof of Lemma 5, it remains to establish an upper bound for $\Lambda_{\widetilde{S}}$ . By Proposition 1, with probability at least $1-\delta$ , for all $t\in E_{\tau}$ , $i\in\mathcal{A}_{t}$ , and $j\in[d],$

\big{|}[X_{t,i}]_{j}\big{|}\lesssim\sigma\log(kTd/\delta).

Then by determinant-trace inequality, we have

\displaystyle\det\big{(}[\widehat{\mathbf{\Gamma}}_{\tau,\lambda}]_{\widetilde{S}}\big{)}\lesssim\Big{(}\lambda+\frac{T\sigma^{2}\log^{2}(kTd/\delta)}{(\tau+1)s}\Big{)}^{(\tau+1)s}.

Recall the definition of $\Lambda_{\widetilde{S}}$ in equation (5.1). Then with probability at least $1-\delta$ , we have

	$\displaystyle\Lambda_{\widetilde{S}}$	$\displaystyle\leq\nu\sqrt{(\tau+1)s\log\Big{(}\delta^{-1}+\frac{T\sigma^{2}\log^{2}(kTd/\delta)}{\delta\lambda(\tau+1)s}\Big{)}+\tau s\log(d)}$
		$\displaystyle\lesssim\nu\sqrt{s}\cdot\log\big{(}{kTd}/{(\delta\lambda)}\big{)},$		(46)

which further implies that

\max_{\widetilde{S}}\big{\{}\Lambda_{\widetilde{S}}\big{\}},\Lambda_{S^{*}}\lesssim\nu\sqrt{s}\cdot\log\big{(}{kTd}/{(\delta\lambda)}\big{)}.

As a result, by inequality (5.1), we finally obtain with probability at least $1-\delta$ that

\displaystyle\big{\|}\widehat{\theta}_{\tau,\lambda}-\theta^{*}\big{\|}^{2}_{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}

\displaystyle\lesssim\Big{(}\nu\sqrt{s}\cdot\log\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda^{1/2}r\Big{)}^{2}\lesssim\nu^{2}s\log^{2}\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda r^{2},

which concludes the proof of Lemma 5.

∎

5.2 Proof of Lemma 6

Lemma.

(Restatement of Lemma 6) With probability at least $1-\delta$ , we have

\big{\|}\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{SS^{c}}\big{\|}_{\infty}\leq 8\sigma{\log(d/\delta)}\sqrt{|E_{\tau}|}.

Proof.

By the definition of matrix $[\widehat{\mathbf{\Gamma}}_{\tau,\lambda}]_{SS^{c}}$ , for every $i\in S$ and $j\in S^{c}$ , we have

\displaystyle\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{ij}=\sum_{t\in E_{\tau}}[X_{t}]_{i}[X_{t}]_{j}^{\top}.

Recall that in epoch $E_{\tau}$ , the SLUCB algorithm picks arm only based on the information in dimension $\mathrm{supp}^{+}(\widehat{\theta}_{\tau-1,\lambda})$ . Also recall that SLUCB algorithm requires

S=\mathrm{supp}^{+}(\widehat{\theta}_{\tau,\lambda})\supseteq\mathrm{supp}^{+}(\widehat{\theta}_{\tau-1,\lambda}),

which implies that

S^{c}\subseteq\big{[}\mathrm{supp}^{+}(\widehat{\theta}_{\tau-1,\lambda})\big{]}^{c}.

Since all covariates $X_{t,i}$ have independent coordinates, for each $j\in S^{c}$ , $\{[X_{t}]_{j}\}_{t\in E_{\tau}}$ are independent centered sub-Gaussian random variables. As a result, conditional on $[X_{t}]_{i}$ , the random variable $[\widehat{\mathbf{\Gamma}}_{\tau,\lambda}]_{ij}$ is sub-Gaussian with parameter $\sigma^{2}\cdot\sum_{t\in E_{\tau}}[X_{t}]_{i}^{2}$ . Note that

\displaystyle\big{\|}\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{SS^{c}}\big{\|}_{\infty}=\max_{i\in S,j\in S^{c}}\big{|}\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{ij}\big{|},

which is the maximum of $|S|\cdot|S^{c}|$ sub-Gaussian random variables whose parameters are uniformly upper bounded by $\sigma^{2}\cdot\max_{i\in S}\sum_{t\in E_{\tau}}[X_{t}]_{i}^{2}$ . By applying the concentration inequality for the maximal of sub-Gaussian random variables, we obtain with probability at least $1-\delta$ that

\displaystyle\big{\|}\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{SS^{c}}\big{\|}_{\infty}\leq 4\sigma\sqrt{\max_{i\in S}\sum_{t\in E_{\tau}}[X_{t}]_{i}^{2}\cdot\log(d/\delta)}\leq 4\sigma\sqrt{\sum_{t\in E_{\tau}}\big{(}\max_{i\in S}[X_{t}]_{i}\big{)}^{2}\cdot\log(d/\delta)}.

Now we use the concentration inequality for the maximal of sub-Gaussian random variables again. Then with probability at least $1-\delta$ , for each $t$ ,

\displaystyle\max_{i\in S}\big{|}[X_{t}]_{i}\big{|}\leq 2\sigma\sqrt{\log(d/\delta)}.

Hence, we obtain with probability at least $1-\delta$ that

\displaystyle\big{\|}\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{SS^{c}}\big{\|}_{\infty}\leq 8\sigma{\log(d/\delta)}\sqrt{|E_{\tau}|}.

Similarly, we also have

\displaystyle\big{\|}\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{S^{c}S}\big{\|}_{\infty}\leq 8\sigma{\log(d/\delta)}\sqrt{|E_{\tau}|}.

with probability at least $1-\delta$ , which concludes the proof of Lemma 6. ∎

5.3 Proof of Lemma 7

Lemma.

(Restatement of Lemma 7) For arbitrary constant $\varsigma>0$ , when

n_{0}\gtrsim\max\{\varsigma^{2}\rho^{-1},\sigma^{2}\rho^{-2}{\log(d\widetilde{\tau}/\delta)}\},

with probability at least $1-\delta$ , we have

\displaystyle\|\widehat{\theta}_{\tau,\lambda}-\theta^{*}\|_{1}\lesssim\nu s/\varsigma\cdot\sqrt{\widetilde{\tau}}\cdot\log\big{(}kTd/(\delta\lambda)\big{)}.

Proof.

To simplify notation, let

\widetilde{S}_{\tau}=S\cup\mathrm{supp}(\theta^{*})=\mathrm{supp}^{+}(\widehat{\theta}_{\tau,\lambda})\cup\mathrm{supp}(\theta^{*}).

Then we have $|\widetilde{S}_{\tau}|\leq s({\tau}+1)$ . Recall that in SLUCB algorithm, the first $n_{0}$ arms in epoch $E_{\tau}$ are chosen uniformly at random. Then by matrix Bernstein inequality (Tropp, 2012) and Assumption 1 (A2), for arbitrary constant $\varsigma>0$ ,when $n_{0}\gtrsim\max\{\varsigma^{2}\rho^{-1},\sigma^{2}\rho^{-2}{\log(d\widetilde{\tau}/\delta)}\},$ we have

\displaystyle\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{\widetilde{S}_{\tau}}\succeq\varsigma^{2}\cdot\textbf{I}_{|\widetilde{S}_{\tau}|},

(47)

with probability at least $1-\delta$ . Since $\mathrm{supp}(\widehat{\theta}_{\tau,\lambda}-\theta^{*})\subseteq\widetilde{S}_{\tau}$ , when inequality (47) holds, we have

\sqrt{(\widehat{\theta}_{\tau,\lambda}-\theta^{*})^{\top}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}(\widehat{\theta}_{\tau,\lambda}-\theta^{*})}=\sqrt{[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{\widetilde{S}_{\tau}}^{\top}\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{\widetilde{S}_{\tau}}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{\widetilde{S}_{\tau}}}\geq\varsigma\cdot\big{\|}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{\widetilde{S}_{\tau}}\big{\|}.

Hence, we obtain with probability at least $1-\delta$ that

	$\displaystyle\\|\widehat{\theta}_{\tau,\lambda}-\theta^{*}\\|_{1}$	$\displaystyle=\big{\\|}[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{\widetilde{S}_{\tau}}\big{\\|}_{1}\leq\sqrt{s(\widetilde{\tau}+1)}\cdot\big{\\|}[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{\widetilde{S}_{\tau}}\big{\\|}$
		$\displaystyle\leq\big{\\|}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{\widetilde{S}_{\tau}}\big{\\|}_{[\widehat{\mathbf{\Gamma}}_{\tau,\lambda}]_{\widetilde{S}_{\tau}}}\cdot\sqrt{s(\widetilde{\tau}+1)}/\varsigma\lesssim\nu s/\varsigma\cdot\sqrt{\widetilde{\tau}}\cdot\log\big{(}kTd/(\delta\lambda)\big{)},$		(48)

where the last inequality follows from Lemma 5. Now we finish the proof of Lemma 7. ∎

5.4 Proof of Lemma 8

Lemma.

(Restatement of Lemma 8) For each $\tau\in[\widetilde{\tau}]$ and $t\in E_{\tau}$ , we have

\big{\|}[\widehat{\theta}_{\tau,\lambda}^{t-1}-\theta^{*}]_{S_{\tau-1}}\big{\|}^{2}_{\widehat{\mathbf{\Gamma}}^{t-1}_{\tau-1,\lambda}}\lesssim\big{(}\nu^{2}+\sigma^{2}\cdot\big{\|}[\theta^{*}]_{S^{c}_{\tau-1}}\big{\|}^{2}\big{)}\cdot\big{(}s\log^{2}\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda r^{2}\big{)},

with probability at least $1-\delta$ .

Proof.

For each $t^{\prime}\in E_{\tau}$ , we have decomposition

	$\displaystyle Y_{t^{\prime}}=\langle X_{t^{\prime}},\theta^{*}\rangle+\varepsilon_{t^{\prime}}$	$\displaystyle=\big{\langle}[X_{t^{\prime}}]_{S_{\tau-1}},[\theta^{}]_{S_{\tau-1}}\big{\rangle}+\big{\langle}[X_{t^{\prime}}]_{S^{c}_{\tau-1}},[\theta^{}]_{S^{c}_{\tau-1}}\big{\rangle}+\varepsilon_{t^{\prime}}$
		$\displaystyle=\big{\langle}[X_{t^{\prime}}]_{S_{\tau-1}},[\theta^{*}]_{S_{\tau-1}}\big{\rangle}+\widetilde{\varepsilon}_{t^{\prime}}.$

where

\widetilde{\varepsilon}_{t^{\prime}}=\big{\langle}[X_{t^{\prime}}]_{S^{c}_{\tau-1}},[\theta^{*}]_{S^{c}_{\tau-1}}\big{\rangle}+\varepsilon_{t^{\prime}}.

Given the information by epoch $\tau-1$ , for each $t^{\prime}\in E_{\tau}$ and $i\in\mathcal{A}_{t^{\prime}}$ , by Assumption 1 (A3), $[X_{t^{\prime},i}]_{S^{c}_{\tau-1}}$ is independent of $[X_{t^{\prime},i}]_{S_{\tau-1}}$ . Recall that in epoch $\tau$ , all decisions on picking arms only depend on the information of covariates in dimension $S_{\tau-1}$ . Hence, $\widetilde{\varepsilon}_{t^{\prime}}$ is independent of the design $[X_{t^{\prime}}]_{S_{\tau-1}}$ . Moreover, conditioning on the information by epoch $\tau-1$ , $\{\widetilde{\varepsilon}_{t^{\prime}}\}_{t^{\prime}\in E_{\tau}}$ are independent centered sub-Gaussian random variables with parameter

\nu^{2}+\sigma^{2}\cdot\big{\|}[\theta^{*}]_{S^{c}_{\tau-1}}\big{\|}^{2}.

Recall the definition of $\widehat{\theta}_{\tau,\lambda}^{t-1}$ , i.e.,

\widehat{\theta}_{\tau,\lambda}^{t-1}=\operatorname*{argmin}_{\mathrm{supp}^{+}(\theta)=S_{\tau-1}}\sum_{t^{\prime}\in E_{\tau}^{t-1}}\big{|}Y_{t^{\prime}}-\langle X_{t^{\prime}},\theta\rangle\big{|}^{2}+\lambda\|\theta\|^{2},

where $E_{\tau}^{t-1}=\{t^{\prime}:t^{\prime}\leq t-1,t^{\prime}\in E_{\tau}\}$ . Then by using the confidence region of ridge regression estimator (i.e., Lemma 10 provided in Section 6), we have with probability at least $1-\delta$ that

	$\displaystyle[\widehat{\theta}_{\tau,\lambda}^{t-1}-\theta^{}]_{S_{\tau-1}}^{\top}\widehat{\mathbf{\Gamma}}^{t-1}_{\tau-1,\lambda}[\widehat{\theta}_{\tau,\lambda}^{t-1}-\theta^{}]_{S_{\tau-1}}=\big{\\|}[\widehat{\theta}_{\tau,\lambda}^{t-1}-\theta^{*}]_{S_{\tau-1}}\big{\\|}^{2}_{\widehat{\mathbf{\Gamma}}^{t-1}_{\tau-1,\lambda}}$
	$\displaystyle\lesssim\big{(}\nu^{2}+\sigma^{2}\cdot\big{\\|}[\theta^{*}]_{S^{c}_{\tau-1}}\big{\\|}^{2}\big{)}\cdot\log\Big{(}\frac{\det(\widehat{\mathbf{\Gamma}}^{t-1}_{\tau-1,\lambda})^{1/2}\det(\lambda\textbf{I}_{\|S_{\tau-1}\|})^{-1/2}}{\delta}\Big{)}+\lambda r^{2}$

where

\widehat{\mathbf{\Gamma}}^{t-1}_{\tau-1,\lambda}=\lambda\textbf{I}_{|S_{\tau-1}|}+\sum_{t^{\prime}\in E^{t-1}_{\tau}}[X_{t^{\prime}}]_{S_{\tau-1}}[X_{t^{\prime}}]_{S_{\tau-1}}^{\top}.

Similar to the proof of Lemma 5, we have

\displaystyle\det\big{(}\widehat{\mathbf{\Gamma}}^{t-1}_{\tau-1,\lambda}\big{)}\lesssim\Big{(}\lambda+\frac{T\sigma^{2}\log^{2}(kTd/\delta)}{\tau s}\Big{)}^{\tau s},

with probability at least $1-\delta$ . Hence, we obtain with probability at least $1-\delta$ that

	$\displaystyle\big{\\|}[\widehat{\theta}_{\tau,\lambda}^{t-1}-\theta^{*}]_{S_{\tau-1}}\big{\\|}^{2}_{\widehat{\mathbf{\Gamma}}^{t-1}_{\tau-1,\lambda}}$	$\displaystyle\lesssim\big{(}\nu^{2}+\sigma^{2}\cdot\big{\\|}[\theta^{*}]_{S^{c}_{\tau-1}}\big{\\|}^{2}\big{)}\cdot\sqrt{\tau s\cdot\log\Big{(}\delta^{-1}+\frac{T\sigma^{2}\log^{2}(kTd/\delta)}{\delta\lambda\tau s}\Big{)}}$
		$\displaystyle\lesssim\big{(}\nu^{2}+\sigma^{2}\cdot\big{\\|}[\theta^{*}]_{S^{c}_{\tau-1}}\big{\\|}^{2}\big{)}\cdot\sqrt{s}\cdot\log\big{(}{kTd}/{(\delta\lambda)}\big{)},$

which concludes the proof of Lemma 8. ∎

6 Other Auxiliary Lemmas

In this section, we list the results established in (Abbasi-Yadkori et al., 2011), which are used extensively in our proof.

Lemma 9.

(Self-normalized Martingale Concentration inequality) Let $\{X_{t}\}_{t\geq 1}$ be an $\mathbb{R}^{d}$ -valued stochastic process and $\{\eta_{t}\}_{t\geq 1}$ be a real valued stochastic process. Let

\mathcal{F}_{t}=\sigma(X_{1},\ldots,X_{t+1},\eta_{1},\ldots,\eta_{t})

be the natural filtration generated by $\{X_{t}\}_{t\geq 1}$ and $\{\eta_{t}\}_{t\geq 1}$ . Assume that for each $t$ , $\eta_{t}$ is centered and sub-Gaussian with variance proxy $R^{2}$ given $\mathcal{F}_{t-1}$ , i.e.,

	$\displaystyle\mathbb{E}[\eta_{t}\|\mathcal{F}_{t-1}]$	$\displaystyle=0,$
	$\displaystyle\mathbb{E}\big{[}\exp\{\lambda\eta_{t}\}\big{\|}\mathcal{F}_{t-1}\big{]}$	$\displaystyle\leq\exp\{\lambda^{2}R^{2}/2\},\ \forall\lambda\in\mathbb{R}.$

Also assume that $\mathbf{V}$ is a $d\times d$ positive definite matrix. For any $t\geq 0$ . we define

\displaystyle\widehat{\mathbf{\Gamma}}_{t}=\mathbf{V}+\sum_{u=1}^{t}X_{u}X_{u}^{\top},\ \mathbf{S}_{t}=\sum_{u=1}^{t}\eta_{u}X_{u}.

Then for any $\delta>0$ , with probability at least $1-\delta$ , for all $t\geq 0$ ,

\displaystyle\|\mathbf{S}_{t}\|_{\widehat{\mathbf{\Gamma}}_{t}^{-1}}\leq R\sqrt{2\log\Big{(}\frac{\det(\widehat{\mathbf{\Gamma}}_{t})^{1/2}\det(\mathbf{V})^{-1/2}}{\delta}\Big{)}}.

Lemma 10.

(Confidence Region of Ridge Regression Estimator under Dependent Design) Consider the stochastic processes $\{(X_{t},Y_{t},\eta_{t})\}_{t\geq 1}$ . Assume that $\{X_{t}\}_{t\geq 1}$ and $\{\eta_{t}\}_{t\geq 1}$ satisfy the condition in Lemma 9 and $Y_{t}=\langle X_{t},\theta^{*}\rangle+\eta_{t}$ . Also assume that $\|\theta^{*}\|_{2}\leq r$ . For any $T\geq 0$ , let

\overline{\theta}_{t}=\big{(}\mathbf{X}_{1:t}^{\top}\mathbf{X}_{1:t}+\lambda\textbf{I}_{d}\big{)}^{-1}\mathbf{X}_{1:t}^{\top}\mathbf{Y}_{1:t}

be the ridge regression estimator, where $\mathbf{X}_{1:t}=(X_{1},\ldots,X_{t})^{\top}$ , $\mathbf{Y}_{1:t}=(Y_{1},\ldots,Y_{t})^{\top}$ and $I_{d}$ is the $d$ -dimensional identity matrix. Then with probability at least $1-\delta$ , for all $t\geq 0$ ,

\displaystyle\|\overline{\theta}_{t}-\theta^{*}\|_{\widehat{\mathbf{\Gamma}}_{t}}\leq R\sqrt{2\log\Big{(}\frac{\det(\widehat{\mathbf{\Gamma}}_{t})^{1/2}\det(\lambda\textbf{I}_{d})^{-1/2}}{\delta}\Big{)}}+\lambda^{1/2}r.

where $\widehat{\mathbf{\Gamma}}_{t}=\lambda\textbf{I}_{d}+\sum_{u=1}^{t}X_{u}X_{u}^{\top}$ ,

\@normalsize

References

Abbasi-Yadkori et al. [2011] Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
Abbasi-Yadkori et al. [2012] Y. Abbasi-Yadkori, D. Pal, and C. Szepesvari. Online-to-confidence-set conversions and application to sparse stochastic bandits. In Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS), 2012.
Abe et al. [2003] N. Abe, A. W. Biermann, and P. M. Long. Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica, 37(4):263–293, 2003.
Agarwal et al. [2010] A. Agarwal, O. Dekel, and L. Xiao. Optimal algorithms for online convex optimization with multi-point bandit feedback. In Proceedings of annual Conference on Learning Theory (COLT). Citeseer, 2010.
Agarwal et al. [2014] A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, and R. Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In Proceedings of International Conference on Machine Learning (ICML), 2014.
Auer [2002] P. Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
Auer et al. [1995] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS), 1995.
Auer et al. [2002] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235–256, 2002.
Balasubramanian and Ghadimi [2018] K. Balasubramanian and S. Ghadimi. Zeroth-order (non)-convex stochastic optimization via conditional gradient and gradient updates. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2018.
Bastani and Bayati [2015] H. Bastani and M. Bayati. Online decision-making with high-dimensional covariates. 2015. Available at SSRN: https://ssrn.com/abstract=2661896.
Bertsimas et al. [2016] D. Bertsimas, A. King, and R. Mazumder. Best subset selection via a modern optimization lens. The Annals of Statistics, 44(2):813–852, 2016.
Besbes et al. [2015] O. Besbes, Y. Gur, and A. Zeevi. Non-stationary stochastic optimization. Operations Research, 63(5):1227–1244, 2015.
Bickel et al. [2009] P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 37(4):1705–1732, 2009.
Blumensath and Davies [2009] T. Blumensath and M. E. Davies. Iterative hard thresholding for compressed sensing. Applied and computational harmonic analysis, 27(3):265–274, 2009.
Bubeck et al. [2017] S. Bubeck, Y. T. Lee, and R. Eldan. Kernel-based methods for bandit convex optimization. In Proceedings of the Annual ACM SIGACT Symposium on Theory of Computing (STOC), 2017.
Bühlmann and Van De Geer [2011] P. Bühlmann and S. Van De Geer. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media, 2011.
Candes and Tao [2007] E. Candes and T. Tao. The dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics, 35(6):2313–2351, 2007.
Carpentier and Munos [2012] A. Carpentier and R. Munos. Bandit theory meets compressed sensing for high dimensional stochastic linear bandit. In Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS), 2012.
Chakraborty et al. [2010] B. Chakraborty, S. Murphy, and V. Strecher. Inference for non-regular parameters in optimal dynamic treatment regimes. Statistical Methods in Medical Research, 19(3):317–343, 2010.
Chu et al. [2011] W. Chu, L. Li, L. Reyzin, and R. Schapire. Contextual bandits with linear payoff functions. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2011.
Dani et al. [2008] V. Dani, T. P. Hayes, and S. M. Kakade. Stochastic linear optimization under bandit feedback. In Proceedings of annual Conference on Learning Theory (COLT), 2008.
Donoho [2006] D. L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, 2006.
Filippi et al. [2010] S. Filippi, O. Cappe, A. Garivier, and C. Szepesvári. Parametric bandits: The generalized linear case. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2010.
Flaxman et al. [2005] A. D. Flaxman, A. T. Kalai, and H. B. McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the annual ACM-SIAM symposium on Discrete algorithms (SODA), 2005.
Foster et al. [2016] D. Foster, S. Kale, and H. Karloff. Online sparse linear regression. In Proceedings of annual Conference on Learning Theory (COLT), 2016.
Foster et al. [2018] D. J. Foster, A. Agarwal, M. Dudík, H. Luo, and R. E. Schapire. Practical contextual bandits with regression oracles. In Proceedings of the International Conference on Machine Learning (ICML), 2018.
Gerchinovitz [2013] S. Gerchinovitz. Sparsity regret bounds for individual sequences in online linear regression. Journal of Machine Learning Research, 14(Mar):729–769, 2013.
Goldberg and Kosorok [2012] Y. Goldberg and M. R. Kosorok. Q-learning with censored data. Annals of Statistics, 40(1):529, 2012.
Goldenshluger and Zeevi [2013] A. Goldenshluger and A. Zeevi. A linear response bandit problem. Stochastic Systems, 3(1):230–261, 2013.
Hammer et al. [1996] S. M. Hammer, D. A. Katzenstein, M. D. Hughes, H. Gundacker, R. T. Schooley, R. H. Haubrich, W. K. Henry, M. M. Lederman, J. P. Phair, M. Niu, et al. A trial comparing nucleoside monotherapy with combination therapy in hiv-infected adults with cd4 cell counts from 200 to 500 per cubic millimeter. New England Journal of Medicine, 335(15):1081–1090, 1996.
Keyvanshokooh et al. [2019] E. Keyvanshokooh, M. Zhalechian, C. Shi, M. P. Van Oyen, and P. Kazemian. Contextual learning with online convex optimization: Theory and application to chronic diseases. Available at SSRN, 2019.
Krause and Ong [2011] A. Krause and C. S. Ong. Contextual gaussian process bandit optimization. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2011.
Lai and Robbins [1985] T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
Langford et al. [2009] J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. Journal of Machine Learning Research, 10(Mar):777–801, 2009.
Lattimore et al. [2015] T. Lattimore, K. Crammer, and C. Szepesvári. Linear multi-resource allocation with semi-bandit feedback. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2015.
Li et al. [2010] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the International Conference on World Wide Web (WWW). ACM, 2010.
Li et al. [2011] L. Li, W. Chu, J. Langford, and X. Wang. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM). ACM, 2011.
Li et al. [2017] L. Li, Y. Lu, and D. Zhou. Provably optimal algorithms for generalized linear contextual bandits. In Proceedings of International Conference on Machine Learning (ICML), 2017.
Miller [2002] A. Miller. Subset selection in regression. Chapman and Hall/CRC, 2002.
Moodie et al. [2007] E. E. Moodie, T. S. Richardson, and D. A. Stephens. Demystifying optimal dynamic treatment regimes. Biometrics, 63(2):447–455, 2007.
Murphy [2003] S. A. Murphy. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(2):331–355, 2003.
Murphy [2005] S. A. Murphy. An experimental design for the development of adaptive treatment strategies. Statistics in medicine, 24(10):1455–1481, 2005.
Natarajan [1995] B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on Computing, 24(2):227–234, 1995.
Nemirovsky and Yudin [1983] A. S. Nemirovsky and D. B. Yudin. Problem complexity and method efficiency in optimization. SIAM, 1983.
Orellana et al. [2010a] L. Orellana, A. Rotnitzky, and J. M. Robins. Dynamic regime marginal structural mean models for estimation of optimal dynamic treatment regimes, part I: main content. The International Journal of Biostatistics, 6(2), 2010a.
Orellana et al. [2010b] L. Orellana, A. Rotnitzky, and J. M. Robins. Dynamic regime marginal structural mean models for estimation of optimal dynamic treatment regimes, part II: proofs of results. The International Journal of Biostatistics, 6(2), 2010b.
Pilanci et al. [2015] M. Pilanci, M. J. Wainwright, and L. El Ghaoui. Sparse learning via boolean relaxations. Mathematical Programming (Series B), 151(1):63–87, 2015.
Raskutti et al. [2010] G. Raskutti, M. J. Wainwright, and B. Yu. Restricted eigenvalue properties for correlated gaussian designs. Journal of Machine Learning Research, 11(Aug):2241–2259, 2010.
Robins et al. [2008] J. Robins, L. Orellana, and A. Rotnitzky. Estimation and extrapolation of optimal treatment and testing strategies. Statistics in Medicine, 27(23):4678–4721, 2008.
Robins et al. [2000] J. M. Robins, M. A. Hernan, and B. Brumback. Marginal structural models and causal inference in epidemiology, 2000.
Rusmevichientong and Tsitsiklis [2010] P. Rusmevichientong and J. N. Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411, 2010.
Shamir [2013] O. Shamir. On the complexity of bandit and derivative-free stochastic convex optimization. In Proceedings of annual Conference on Learning Theory (COLT), 2013.
Shamir [2015] O. Shamir. On the complexity of bandit linear optimization. In Proceedings of annual Conference on Learning Theory (COLT), 2015.
Song et al. [2015] R. Song, W. Wang, D. Zeng, and M. R. Kosorok. Penalized Q-learning for dynamic treatment regimens. Statistica Sinica, 25(3):901, 2015.
Szepesvari [2016] C. Szepesvari. Lower bounds for stochastic linear bandits. http://banditalgs.com/2016/10/20/lower-bounds-for-stochastic-linear-bandits/, 2016. Accessed: Oct 15, 2018.
Tibshirani [1996] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
Tropp [2012] J. A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12(4):389–434, 2012.
Wainwright [2009] M. J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using $\ell_{1}$ -constrained quadratic programming (Lasso). IEEE Transactions on Information Theory, 55(5):2183–2202, 2009.
Wang et al. [2017] Y. Wang, S. Du, S. Balakrishnan, and A. Singh. Stochastic zeroth-order optimization in high dimensions. In Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.
Watkins and Dayan [1992] C. J. Watkins and P. Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.
Zhang et al. [2012a] B. Zhang, A. A. Tsiatis, M. Davidian, M. Zhang, and E. Laber. Estimating optimal treatment regimes from a classification perspective. Stat, 1(1):103–114, 2012a.
Zhang et al. [2012b] B. Zhang, A. A. Tsiatis, E. B. Laber, and M. Davidian. A robust method for estimating optimal treatment regimes. Biometrics, 68(4):1010–1018, 2012b.
Zhao et al. [2012] Y. Zhao, D. Zeng, A. J. Rush, and M. R. Kosorok. Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association, 107(499):1106–1118, 2012.
Zhao et al. [2014] Y.-Q. Zhao, D. Zeng, E. B. Laber, R. Song, M. Yuan, and M. R. Kosorok. Doubly robust learning for estimating individualized treatment with censored data. Biometrika, 102(1):151–168, 2014.

	$\displaystyle\big{\\|}\widehat{\theta}_{\tau,\lambda}-\theta^{*}\big{\\|}^{2}_{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}$	$\displaystyle=[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S^{c}}^{\top}\big{[}{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}\big{]}_{S^{c}}[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S^{c}}+[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S}^{\top}\big{[}{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}\big{]}_{S}[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S}$
		$\displaystyle\qquad+[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S}^{\top}\big{[}{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}\big{]}_{SS^{c}}[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S^{c}}+[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S^{c}}^{\top}\big{[}{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}\big{]}_{S^{c}S}[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S}$
		$\displaystyle\geq[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S^{c}}^{\top}\big{[}{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}\big{]}_{S^{c}}[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S^{c}}+[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S}^{\top}\big{[}{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}\big{]}_{SS^{c}}[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S^{c}}$
		$\displaystyle\qquad+[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S^{c}}^{\top}\big{[}{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}\big{]}_{S^{c}S}[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S},$

	$\displaystyle[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S^{c}}^{\top}\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{S^{c}}[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S^{c}}$	$\displaystyle\leq-[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S}^{\top}\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{SS^{c}}[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S^{c}}$
		$\displaystyle-[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S^{c}}^{\top}\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{S^{c}S}[\widehat{\theta}_{\tau,\lambda}-\theta^{}]_{S}+\big{\\|}\widehat{\theta}_{\tau,\lambda}-\theta^{*}\big{\\|}^{2}_{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}.$		(19)

	$\displaystyle\sum_{t\in E_{\tau}}\min\Big{\{}p,q\cdot\big{\\|}[X_{t}]_{S_{\tau-1}}\big{\\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}]^{-1}}\Big{\}}$
	$\displaystyle\quad\leq n_{0}p+q\cdot\sum_{t=t_{1}+1}^{t_{2}}\min\Big{\{}1,\big{\\|}[X_{t}]_{S_{\tau-1}}\big{\\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}]^{-1}}\Big{\}}$
	$\displaystyle\quad\leq n_{0}p+q\sqrt{\|E_{\tau}\|}\cdot\sqrt{\sum_{t=t_{1}+1}^{t_{2}}\min\Big{\{}1,\big{\\|}[X_{t}]_{S_{\tau-1}}\big{\\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}]^{-1}}\Big{\}}}$
	$\displaystyle\quad\leq n_{0}p+q\sqrt{\|E_{\tau}\|}\cdot\sqrt{2\sum_{t=t_{1}+1}^{t_{2}}\log\Big{(}1+\big{\\|}[X_{t}]_{S_{\tau-1}}\big{\\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}]^{-1}}\Big{)}}.$		(26)

$\displaystyle\Big{\langle}[X_{t,i}]_{S_{\tau-1}},\big{[}\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}\big{]}^{-1}[\theta^{*}]_{S_{\tau-1}}\Big{\rangle}$	$\displaystyle\leq\big{\\|}[\theta^{*}]_{S_{\tau-1}}\big{\\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}]^{-1}}\cdot\big{\\|}[X_{t,i}]_{S_{\tau-1}}\big{\\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}]^{-1}}$
	$\displaystyle\leq\lambda^{-1}\big{\\|}[\theta^{*}]_{S_{\tau-1}}\big{\\|}\cdot\big{\\|}[X_{t,i}]_{S_{\tau-1}}\big{\\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}]^{-1}}$
	$\displaystyle\leq\lambda^{-1}r\cdot\big{\\|}[X_{t,i}]_{S_{\tau-1}}\big{\\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}]^{-1}},$	(33)

$\displaystyle\Big{(}\sum_{t\in E_{\tau}}\varepsilon_{t}X_{t}^{\top}\Big{)}(\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda})$	$\displaystyle\leq\Big{\\|}\sum_{t\in E_{\tau}}\varepsilon_{t}\cdot[X_{t}]_{\widetilde{S}_{\tau}}\Big{\\|}_{[\widehat{\mathbf{\Gamma}}_{\tau,\lambda}]_{\widetilde{S}_{\tau}}^{-1}}\cdot\Big{\\|}[\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda}]_{\widetilde{S}_{\tau}}\Big{\\|}_{[\widehat{\mathbf{\Gamma}}_{\tau,\lambda}]_{\widetilde{S}_{\tau}}}$
	$\displaystyle=\Big{\\|}\sum_{t\in E_{\tau}}\varepsilon_{t}\cdot[X_{t}]_{\widetilde{S}_{\tau}}\Big{\\|}_{[\widehat{\mathbf{\Gamma}}_{\tau,\lambda}]_{\widetilde{S}_{\tau}}^{-1}}\cdot\big{\\|}\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda}\big{\\|}_{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}$
	$\displaystyle\leq\max_{\widetilde{S}}\Big{\{}\Big{\\|}\sum_{t\in E_{\tau}}\varepsilon_{t}\cdot[X_{t}]_{\widetilde{S}}\Big{\\|}_{[\widehat{\mathbf{\Gamma}}_{\tau,\lambda}]^{-1}_{\widetilde{S}}}\Big{\}}\cdot\big{\\|}\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda}\big{\\|}_{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}},$	(40)