This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Nearly Dimension-Independent Sparse Linear Bandit over Small Action Spaces via Best Subset Selection

Yining Wang Yi Chen Ethan X. Fang Zhaoran Wang and Runze Li
Abstract

We consider the stochastic contextual bandit problem under the high dimensional linear model. We focus on the case where the action space is finite and random, with each action associated with a randomly generated contextual covariate. This setting finds essential applications such as personalized recommendation, online advertisement, and personalized medicine. However, it is very challenging as we need to balance exploration and exploitation. We propose doubly growing epochs and estimating the parameter using the best subset selection method, which is easy to implement in practice. This approach achieves 𝒪~(sT)\widetilde{\mathcal{O}}(s\sqrt{T}) regret with high probability, which is nearly independent in the “ambient” regression model dimension dd. We further attain a sharper 𝒪~(sT)\widetilde{\mathcal{O}}(\sqrt{sT}) regret by using the SupLinUCB framework and match the minimax lower bound of low-dimensional linear stochastic bandit problems. Finally, we conduct extensive numerical experiments to demonstrate the applicability and robustness of our algorithms empirically.

\@normalsize11footnotetext: Yining Wang is an assistant professor at Information Systems and Operations Management at the Warrington College of Business of University of Florida, Gainesville, FL 32611, USA. Yi Chen is a Ph.D. student at Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, IL 60208, USA. Ethan X. Fang is an assistant professor at Department of Statistics, Pennsylvania State University, University Park, PA 16802-2111, USA. Zhaoran Wang is an assistant professor at Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, IL 60208, USA. Runze Li is the Eberly Chair Professor at Department of Statistics, Pennsylvania State University, University Park, PA 16802-2111, USA. Fang and Li’s research was supported by NSF DMS 1820702, DMS 1953196, and DMS 2015539. 22footnotetext: Wang and Chen contribute equally to this manuscript.

Keyword: Best subset selection, high-dimensional models, regret analysis, stochastic bandit.

1 Introduction

Contextual bandit problems receive significant attention over the past years in different communities, such as statistics, operations research, and computer science. This class of problems studies how to make optimal sequential decisions with new information in different settings, where we aim to maximize our accumulative reward, and we iteratively improve our policy given newly observed results. It finds many important modern applications such as personalized recommendation (Li et al., 2010, 2011), online advertising (Krause and Ong, 2011), cost-sensitive classification (Agarwal et al., 2014), and personalized medicine (Goldenshluger and Zeevi, 2013; Bastani and Bayati, 2015; Keyvanshokooh et al., 2019). In most contextual bandit problems, at each iteration, we first obtain some new information. Then, we take action based on a certain policy and observe a new reward. Our goal is to maximize the total reward, where we iteratively improve our policy. This paper is concerned with the linear stochastic bandit model, one of the most fundamental models in contextual bandit problems. We model the expected reward at each time period as a linear function of some random information depending on our action. This model receives considerable attention (Auer, 2002; Abe et al., 2003; Dani et al., 2008; Rusmevichientong and Tsitsiklis, 2010; Chu et al., 2011), and as we will discuss in more detail, it naturally finds applications in optimal sequential treatment regimes.

In a linear stochastic bandit model, at each time period t{1,2,,T}t\in\{1,2,\cdots,T\}, we are given some action space 𝒜t\mathcal{A}_{t}. Here each feasible action i𝒜ti\in\mathcal{A}_{t} is associated with a dd-dimensional context covariate Xt,idX_{t,i}\in\mathbb{\mathbb{R}}^{d} that is known before any action is taken. Next, we take an action it𝒜ti_{t}\in\mathcal{A}_{t} and then observe a reward YtY_{t}\in\mathbb{R}. Assume that the reward follows a linear model

Yt=Xt,it,θ+εt,Y_{t}=\langle X_{t,i_{t}},\theta^{*}\rangle+\varepsilon_{t}, (1)

where θd\theta^{*}\in\mathbb{R}^{d} is an unknown dd-dimensional parameter, and {εt}t=1T\{\varepsilon_{t}\}_{t=1}^{T} are noises. Without loss of generality, we assume that we aim to maximize the total reward. We measure the performance of the selected sequence of actions {it}t=1T\{i_{t}\}_{t=1}^{T} by the accumulated regret

RT({it};θ):=t=1Tmaxi𝒜tXt,i,θXt,it,θ.R_{T}(\{i_{t}\};\theta^{*}):=\sum_{t=1}^{T}\max_{i\in\mathcal{A}_{t}}\langle X_{t,i},\theta^{*}\rangle-\langle X_{t,i_{t}},\theta^{*}\rangle. (2)

Essentially, the regret measures the difference between the “best” we can achieve if we know the true parameter θ\theta^{*} and the “noiseless” reward we get. It is clear that regret is always nonnegative, and our goal is to minimize the regret.

Meanwhile, due to the advance of technology, there are many modern applications where we encounter the high-dimensionality issue, i.e., the dimension dd of the covariate is large. Note that here the total number of iterations TT is somewhat equivalent to the sample size, which is the number of pieces of information we are able to “learn” the true parameter θ\theta^{*}. Such a high-dimensionality issue presents in practice. For example, given the genomic information of some patients, we aim to find a policy that assigns each patient to the best treatment for him/her. It is usually the case that the number of patients is much smaller than the dimension of the covariate. In this paper, we consider the linear stochastic contextual bandit problem under such a high-dimensional setting, where the parameter θ\theta^{*} is of high dimension that dTd\gg T. We assume that θ\theta^{*} is sparse that only at most sds\ll d components of θ\theta^{*} are non-zero. This assumption is commonly imposed in high-dimensional statistics and signal processing literature (Donoho, 2006; Bühlmann and Van De Geer, 2011).

In addition, for the action spaces {𝒜t}t=1T\{\mathcal{A}_{t}\}_{t=1}^{T}, it is known in literature (Dani et al., 2008; Shamir, 2015; Szepesvari, 2016) that if there are infinitely many feasible actions at each iteration, the minimax lower bound is of order 𝒪~(sdT)\widetilde{\mathcal{O}}(\sqrt{sdT}), which does not solve the curse of dimensionality. We assume that the action spaces {𝒜t}t=1T\{\mathcal{A}_{t}\}_{t=1}^{T} are finite, small, and random. In particular, we assume that for all tt, |𝒜t|=kd|\mathcal{A}_{t}|=k\ll d, and each action in 𝒜t\mathcal{A}_{t} is associated with independently randomly generated contextual covariates. In most practical applications, this finite action space setting is naturally satisfied. For example, in the treatment example, there are usually only a small number of feasible treatments available. We refer the readers to Section 3.1 for a complete description and discussion of the assumptions.

1.1 Literature Review

We briefly review existing works on linear stochastic bandit problems under both low-dimensional and high-dimensional settings. Under the classical low-dimensional setting, Auer (2002) pioneers the use of upper-confidence-bound (UCB) type algorithms for the linear stochastic bandit, which is one of the most powerful and fundamental algorithms for this class of problems, and is also considered in Chu et al. (2011). Dani et al. (2008); Rusmevichientong and Tsitsiklis (2010) study linear stochastic bandit problems with large or infinite action spaces, and derive corresponding lower bounds. Under the high-dimensional setting, where we assume that θ\theta^{*} is sparse, when the action spaces 𝒜t\mathcal{A}_{t}’s are hyper-cube spaces [1,1]d[-1,1]^{d}, Lattimore et al. (2015) develop the SETC algorithm that attains nearly dimension-independent regret bounds. We point out that this algorithm exploits the unique structure of hyper-cubes and is unlikely to be applicable for general action spaces including the ones of our interest where the action spaces are finite. Abbasi-Yadkori et al. (2012) consider a UCB-type algorithm for general action sets and obtain a regret upper bound of 𝒪~(sdT)\widetilde{\mathcal{O}}(\sqrt{sdT}), which depends polynomially on the ambient dimension dd. Note that here the 𝒪~()\widetilde{\mathcal{O}}(\cdot) notation is the big-O notation that ignores all logarithmic factors. Carpentier and Munos (2012) consider a different reward model and obtain an 𝒪~(θ22sT)\widetilde{\mathcal{O}}(\|\theta^{*}\|_{2}^{2}s\sqrt{T}) regret upper bound. Goldenshluger and Zeevi (2013); Bastani and Bayati (2015) study a variant of the linear stochastic bandit problem in which only one contextual covariate XtX_{t} is observed at each time period tt, while each action iti_{t} corresponds to a different unknown model θit\theta_{i_{t}}^{*}. We point out that this model is a special case of our model (1), as discussed in Foster et al. (2018).

Another closely related problem is the online sparse prediction problem (Gerchinovitz, 2013; Foster et al., 2016), in which sequential predictions Y^t\widehat{Y}_{t}’s of Yt=Xt,θ+εtY_{t}=\langle X_{t},\theta^{*}\rangle+\varepsilon_{t} are of the interest, and the regret is measured in mean-square error t|Y^tYt|2\sum_{t}|\widehat{Y}_{t}-Y_{t}|^{2}. It can be further generalized to online empirical-risk minimization (Langford et al., 2009) or even the more general derivative-free/bandit convex optimization (Nemirovsky and Yudin, 1983; Flaxman et al., 2005; Agarwal et al., 2010; Shamir, 2013; Besbes et al., 2015; Bubeck et al., 2017; Wang et al., 2017). Most existing works along this direction have continuous (infinite) action spaces {𝒜t}\{\mathcal{A}_{t}\}. They allow small-perturbation type methods like estimating gradient descent. We summarize the existing theoretical results on stochastic linear bandit problems in Table 1.

Table 1: Summary of existing results on low- and high-dimensional linear stochastic bandit. Contextual covariates XX regression model θ\theta^{*} are in d\mathbb{R}^{d}. Logarithmic terms are dropped.
Example paper Model Action spaces Model constrains Regret
Auer (2002) Y=X,θ+εY=\langle X,\theta^{*}\rangle+\varepsilon |𝒜t|N|\mathcal{A}_{t}|\leq N |X,θ|1|\langle X,\theta^{*}\rangle|\leq 1 𝒪~(dT)\widetilde{\mathcal{O}}(\sqrt{dT})
Dani et al. (2008) Y=X,θ+εY=\langle X,\theta^{*}\rangle+\varepsilon d\mathbb{R}^{d} |X,θ|1|\langle X,\theta^{*}\rangle|\leq 1 𝒪~(dT)\widetilde{\mathcal{O}}(d\sqrt{T})
Lattimore et al. (2015) Y=X,θ+εY=\langle X,\theta^{*}\rangle+\varepsilon [1,1]d[-1,1]^{d} θ0s,θ11\|\theta^{*}\|_{0}\leq s,\|\theta^{*}\|_{1}\leq 1 𝒪~(sT)\widetilde{\mathcal{O}}(s\sqrt{T})
Abbasi-Yadkori et al. (2012) Y=X,θ+εY=\langle X,\theta^{*}\rangle+\varepsilon d\mathbb{R}^{d} θ0s,|X,θ|1\|\theta^{*}\|_{0}\leq s,|\langle X,\theta^{*}\rangle|\leq 1 𝒪~(sdT)\widetilde{\mathcal{O}}(\sqrt{sdT})
Carpentier and Munos (2012) Y=X,θ+εY=\langle X,\theta^{*}\rangle+\varepsilon {x:x21}\{x:\|x\|_{2}\leq 1\} θ0s\|\theta^{*}\|_{0}\leq s 𝒪~(θ22sT)\widetilde{\mathcal{O}}(\|\theta^{*}\|_{2}^{2}s\sqrt{T})
This paper Y=X,θ+εY=\langle X,\theta^{*}\rangle+\varepsilon |𝒜t|k|\mathcal{A}_{t}|\leq k\dagger θ0s,θ2r\|\theta^{*}\|_{0}\leq s,\|\theta^{*}\|_{2}\leq r 𝒪~(sT)\widetilde{\mathcal{O}}(\sqrt{sT})

\daggerContextual covariates Xt,iX_{t,i} associated with each i𝒜ti\in\mathcal{A}_{t} are randomly generated.

From the application perspective of finding the optimal treatment regime, existing literatures focus on achieving the optimality through batch settings. General approaches include model-based methods such as Q-learning (Watkins and Dayan, 1992; Murphy, 2003; Moodie et al., 2007; Chakraborty et al., 2010; Goldberg and Kosorok, 2012; Song et al., 2015) and A-learning (Robins et al., 2000; Murphy, 2005) and model-free policy search methods (Robins et al., 2008; Orellana et al., 2010a, b; Zhang et al., 2012b; Zhao et al., 2012, 2014). These methods are all developed based on batch settings where we use the whole dataset to estimate the optimal treatment regime. This approach is applicable after the clinical trial is completed, or when the observational dataset is fully available. However, the batch setting approach is not applicable when it is emerging to identify the optimal treatment regime. For a contemporary example, during the recent outbreak of coronavirus disease (COVID-19), it is extremely important to quickly identify the optimal or nearly optimal treatment regime to assign each patient to the best treatment among a few choices. However, since the disease is novel, there is no or very little historical data. Thus, the batch setting approaches mentioned above are not applicable. On the other hand, our model naturally provides a “learn while optimizing” alternative approach to sequentially improve the policy/treatment regime. This provides an important motivating application of the proposed method.

1.2 Major Contributions

In this paper, we propose new algorithms, which iteratively learn the parameter θ\theta^{*} while optimizing the regret. Our algorithms use the “doubling trick” and modern optimization techniques, which carefully balance the randomization for exploration to fully learn the parameter and maximizing the reward to achieve the near-optimal regret. In particular, our algorithms fall under the general UCB-type algorithms (Lai and Robbins, 1985; Auer et al., 2002). Briefly speaking, we start with some pure exploration periods that we take random actions uniformly, and we estimate the parameter. Then, for the next certain periods, which we call an epoch, we take the action at each time period by optimizing some upper confidence bands using the previous estimator. At the end of each epoch, we renew the estimator using new information. We then enter the next epoch using the new estimator and renew the estimator at the end of the next epoch. We repeat this until the TT-th time period.

The high-dimensional regime (i.e., dTd\gg T) poses significant challenges in our setting, which cannot be solved by exiting works. First, unlike in the low-dimensional regime where ordinary least squares (OLS) always admits closed-form solutions and error bounds, in the high-dimensional regime, most existing methods like the Lasso (Tibshirani, 1996) or the Dantzig selector (Candes and Tao, 2007) require the sample covariance matrix to satisfy certain restricted eigenvalue”conditions (Bickel et al., 2009), which do not hold under our setting for sequentially selected covariates. Additionally, our action spaces {𝒜t}\{\mathcal{A}_{t}\} are finite. This rules out several existing algorithms, including the SETC method (Lattimore et al., 2015) that exploits the specific structure of hyper-cube actions sets and finite-difference type algorithms in stochastic sparse convex optimization (Wang et al., 2017; Balasubramanian and Ghadimi, 2018). We adopt the best subset selector estimator (Miller, 2002) to derive valid confidence bands only using ill-conditioned sample covariance matrices. Note that while the optimization for best subset selection is NP-hard in theory (Natarajan, 1995), by the tremendous progress of modern optimization, solving such problems is practically efficient as discussed in Pilanci et al. (2015); Bertsimas et al. (2016). In addition, the renewed estimator may correlate with the previous one. This decreases the efficiency. We let the epoch sizes grow exponentially, which is known as the “doubling trick” (Auer et al., 1995). This “removes” the correlation between recovered support sets by best subset regression. Our theoretical analysis is also motivated from some known analytical frameworks such as the elliptical potential lemma (Abbasi-Yadkori et al., 2011) and the SupLinUCB framework (Auer, 2002) in order to obtain sharp regret bounds.

We summarize our main theoretical contribution in the following corollary, which is essentially a simplified version of Theorem 2 in Section 3.3.

Corollary 1.

We assume that |𝒜t|=k|\mathcal{A}_{t}|=k, Xt,ii.i.d.𝒩(0,Id)X_{t,i}\overset{i.i.d.}{\sim}\mathcal{N}(0,I_{d}), and εti.i.d.𝒩(0,1)\varepsilon_{t}\overset{i.i.d.}{\sim}\mathcal{N}(0,1), for each 1tT1\leq t\leq T. Then there exists a policy π\pi such that with probability at least 1δ1-\delta, for all θ0s\|\theta^{*}\|_{0}\leq s and θr\|\theta^{*}\|\leq r,

RT({it};θ)sTpolylog(k,d,T,r),R_{T}(\{i_{t}\};\theta^{*})\lesssim\sqrt{sT}\cdot\mathrm{polylog}(k,d,T,r),

where polylog\mathrm{polylog} denotes a polynomial of logarithmic factors.

Note that this corollary holds even if TdT\ll d. Theorem 2 provides more general results than this corollary, and can be applied for a broader family of distributions for contextual covariates {Xt,i}\{X_{t,i}\}. A simpler and more implementable algorithm, with a weaker regret guarantee of 𝒪~(sT)\widetilde{\mathcal{O}}(s\sqrt{T}), is given in Algorithm 1 and Theorem 1, respectively. The form of regret guarantee in this Corollary 1 looks particularly similar to its low-dimensional counterparts in Table 1, with the ambient dimension dd replaced by the intrinsic dimension ss.

Notations. Throughout this paper, for an integer nn, we use [n][n] to denote the set {1,2,,n}\{1,2,\ldots,n\}. We use 1,,\|\cdot\|_{1},\|\cdot\|,\|\cdot\|_{\infty} to denote the 1,2\ell_{1},\ell_{2} and \ell_{\infty} norms of vector, respectively. Given a matrix 𝐀\mathbf{A}, we use 𝐀\|\cdot\|_{\mathbf{A}} to denote the 2\ell_{2} norm weighted by 𝐀\mathbf{A}. Specifically, we have X𝐀:=X𝐀X\|X\|_{\mathbf{A}}:=\sqrt{X^{\top}\mathbf{A}X}. We also use ,\langle\cdot,\cdot\rangle to denote the inner product of two vectors. Given a set S[d]S\subseteq[d], ScS^{c} is it complement and |S||S| denotes the cardinality. Given a dd-dimensional vector XX, we use [X]i[X]_{i} to denote its ii-th coordinate. We also use supp(X)\text{supp}(X) to represent the support of XX, which is the collection of indices corresponding to nonzero coordinates. Furthermore, we use [X]S=([X]i)iS[X]_{S}=([X]_{i})_{i\in S} to denote the restriction of XX on SS, which is a |S||S|-dimensional vector. Similarly, for a d×dd\times d matrix 𝐀=([𝐀]ij)i,j[d]d×d\mathbf{A}=([\mathbf{A}]_{ij})_{i,j\in[d]}\in\mathbb{R}^{d\times d}, we denote by [𝐀]SS=([𝐀]jk)jS,kS[\mathbf{A}]_{SS^{\prime}}=([\mathbf{A}]_{jk})_{j\in S,k\in S^{\prime}} the restriction of AA on S×SS\times S^{\prime}, which is a |S|×|S||S|\times|S^{\prime}| matrix. When S=SS=S^{\prime}, we further abbreviate [𝐀]S=[𝐀]SS[\mathbf{A}]_{S}=[\mathbf{A}]_{SS} . In addition, given two sequences of nonnegative real numbers {an}n1\{a_{n}\}_{n\geq 1} and {bn}n1\{b_{n}\}_{n\geq 1}, anbna_{n}\lesssim b_{n} and anbna_{n}\gtrsim b_{n} mean that there exists an absolute constant 0<C<0<C<\infty such that anCbna_{n}\leq Cb_{n} and anCbna_{n}\geq Cb_{n} for all nn, respectively. We also abbreviate anbna_{n}\asymp b_{n}, if anbna_{n}\lesssim b_{n} and anbna_{n}\gtrsim b_{n} hold simultaneously. We say that a random event \mathcal{E} holds with probability at least 1δ1-\delta, if there exists some absolute constant CC such that the probability of \mathcal{E} is larger than 1Cδ1-C\delta. Finally, we remark that arm, action, and treatment all refer to actions in different applications. We also denote by iti_{t} the action taken in period tt and Xt=Xt,itX_{t}=X_{t,i_{t}} the associated covariate.

Paper Organization. The rest of this paper is organized as follows. We present our algorithms in Section 2. Then we establish our theoretical results in Section 3, under certain technical conditions. We conduct extensive numerical experiments using both synthetic and real datasets to investigate the performance of our methods in Section 4. We conclude the paper in Section 5.

2 Methodologies

In this section, we present the proposed methods to solve the linear stochastic bandit problem where we aim to minimize the regret defined in (2). In Section 2.1, we first introduce an algorithm called “Sparse-LinUCB” (SLUCB) as summarized in Alg. 1, which can be efficiently implemented, and demonstrate the core idea of our algorithmic design. The SLUCB algorithm is a variant of the celebrated LinUCB algorithm (Chu et al., 2011) for classical linear contextual bandit problems. The SLUCB algorithm is intuitive and easy to implement. However, we cannot derive the optimal upper bound for the regret. To close this gap, we further propose a more sophisticated algorithm called “Sparse-SupLinUCB” (SSUCB) (Alg. 2) in Section 3.2. In comparison with the SLUCB algorithm, the SSUCB algorithm constructs the upper confidence bands through sequentially selecting historical data and achieves the optimal rate (up to logarithmic factors).

2.1 Sparse-LinUCB Algorithm

As we mentioned above, our algorithm is inspired by the standard LinUCB algorithm, which balances the tradeoff between exploration and exploitation following the principle of “optimism in the face of uncertainty”. In particular, the LinUCB algorithm repeatedly constructs upper confidence bands for the potential rewards of the actions. The upper confidence bands are optimistic estimators. We then pick the action associated with the largest upper confidence band. This leads to the optimal regret under the low-dimensional setting. However, under the high-dimensional setting, directly applying the LinUCB algorithm incurs some suboptimal regret since we only get lose confidence bands under the high-dimensional regime. Thus, it is desirable to construct tight confidence bands under the high-dimensional and sparse setting to achieve the optimal regret.

Inspired by the remarkable success of the best subset selection (BSS) in high-dimensional regression problems, we propose to incorporate this powerful tool into the LinUCB algorithm. Meanwhile, since the BSS procedure is computationally expensive, it is impractical to execute the BSS method during every time period. In addition, as we will discuss later, it is not necessary.

Before we present our algorithms, we briefly discuss the support of parameters. Given a dd-dimensional vector θ\theta, we denote by supp(θ)\mathrm{supp}(\theta) the support set of θ\theta, which is the collection of dimensions of θ\theta with nonzero coordinates that

supp(θ)={j[d]:[θ]j0}.\mathrm{supp}(\theta)=\big{\{}j\in[d]:[\theta]_{j}\neq 0\big{\}}.

This definition agrees with that of most literatures. However, for the BSS procedure, it is desirable to generalize this definition. We propose the concept of “generalized support” as follows.

Definition 1 (Generalized Support).

Given a dd-dimensional vector θ\theta, we call a subset S[d]S\subseteq[d] the generalized support of θ\theta and denote it by supp+(θ)\mathrm{supp}^{+}(\theta), if

[θ]j=0,jS.[\theta]_{j}=0,\ \forall j\not\in S.

The generalized support supp+(θ)\mathrm{supp}^{+}(\theta) is a relaxation of the normal support, since any support is a generalized support (but not vice versa). Moreover, the generalized support is not unique. Any subset including the support is a valid generalized support.

We distinguish the difference between support and generalized support in order to define the best subset selection without causing confusion. For example, we consider a linear model θd,Xtd\theta^{*}\in\mathbb{R}^{d},X_{t}\in\mathbb{R}^{d}, and Yt=Xt,θ+εtY_{t}=\langle X_{t},\theta^{*}\rangle+\varepsilon_{t}. Calculating the ordinary least square estimator restricted on the generalized support S[d]S\in[d], which is denoted by

θ^=argminsupp+(θ)=S{t|YtXt,θ|2},\displaystyle\widehat{\theta}=\operatorname*{argmin}_{\mathrm{supp}^{+}(\theta)=S}\Big{\{}\sum_{t}\big{|}Y_{t}-\langle X_{t},\theta\rangle\big{|}^{2}\Big{\}},

means that we consider a low-dimensional model only using the information in SS and set the coordinates of estimator except in SS as zeros. Formally, let [θ^]S|S|{[\widehat{\theta}]}_{S}\in\mathbb{R}^{|S|} be

[θ^]S=argminϕ|S|{t|Yt[Xt]S,ϕ|2}.{[\widehat{\theta}]}_{S}=\operatorname*{argmin}_{\phi\in\mathbb{R}^{|S|}}\Big{\{}\sum_{t}\big{|}Y_{t}-\langle[X_{t}]_{S},\phi\rangle\big{|}^{2}\Big{\}}.

Then we have

[θ^]j=[[θ]^S]j,jS;[θ^]j=0,jS.\big{[}\widehat{\theta}\big{]}_{j}=\big{[}\widehat{[\theta]}_{S}\big{]}_{j},\ j\in S;\ \big{[}\widehat{\theta}\big{]}_{j}=0,\ j\not\in S.

Since we do not guarantee [θ^]j0,jS[\widehat{\theta}\big{]}_{j}\neq 0,\ \forall j\in S, we call SS the generalized support instead of support.

We are ready to present the details of SLUCB algorithm now. Our algorithm works as follows. We first apply the “doubling trick”, which partitions the whole TT decision periods into several consecutive epochs such that the lengths of the epochs increase doubly. We only implement the BSS procedure at the end of each epoch to recover the support of the parameter θ\theta^{*}. Within each epoch, we start with a minimal epoch length of n0n_{0} for pure exploration, which is called the pure-exploration-stage. At each time period tt of this stage, we pick an action within 𝒜t\mathcal{A}_{t} uniformly at random. Intuitively speaking, in pure-exploration-stage, we explore the unknown parameters at all possible directions through uniform randomization. From a technical perspective, the pure-exploration-stage is desirable since it maintains the smallest sparse eigenvalue of the empirical design matrix lower bounded away from zero and hence, help us obtain an efficient estimator for θ\theta^{*}. After the pure-exploration-stage, we keep the estimated support, say of ss nonzero components, from the previous epoch unchanged, and treat the problem as an ss-dimensional regression. Specifically, at each time period, we use the ridge estimator with penalty weight λ\lambda to estimate θ\theta^{*} and construct corresponding confidence bands to help us make decisions in the next time period.

In summary, we partition the time horizon [T][T] into consecutive epochs {Eτ}τ=1τ~\{E_{\tau}\}_{\tau=1}^{\widetilde{\tau}} such that

[T]=τ=1τ~Eτ,|Eτ|=min{2τ,n0}.[T]=\bigcup_{\tau=1}^{\widetilde{\tau}}E_{\tau},\ |E_{\tau}|=\min\{2^{\tau},n_{0}\}.

The length of the last epoch Eτ~E_{\widetilde{\tau}} may be less than 2τ~2^{\widetilde{\tau}} or n0n_{0}. By definition, the number of epochs τ~log(T)\widetilde{\tau}\asymp\log(T). Hence, in the SLUCB algorithm, we run the BSS procedure 𝒪~(log(T))\widetilde{\mathcal{O}}(\log(T)) times. In our later simulations studies, we find that this is practical for moderately large dimensions.

Next, we introduce the details of constructing upper confidence bands in the SLUCB algorithm. We assume that at period tEτt\in E_{\tau}, we pick action it𝒜ti_{t}\in\mathcal{A}_{t} and observe the associated covariate Xt,itX_{t,i_{t}} and reward YtY_{t}. We also abbreviate Xt,itX_{t,i_{t}} as XtX_{t}, if there is no confusion. We denote by θ^τ1,λ\widehat{\theta}_{\tau-1,\lambda} the BSS estimator for the true parameter θ\theta^{*} at the end of previous epoch Eτ1E_{\tau-1}, and let

Sτ1=supp+(θ^τ1,λ)S_{\tau-1}=\mathrm{supp}^{+}{(\widehat{\theta}_{\tau-1,\lambda})}

be its generalized support, i.e., the generalized support recovered by epoch Eτ1E_{\tau-1}. For periods in the pure-exploration-stage of EτE_{\tau}, we pick arms uniformly at random and do not update the estimator of θ\theta^{*}. Beyond the pure-exploration-stage, we estimate θ\theta^{*} by a ridge estimator. Let θ^τ,λt1\widehat{\theta}^{t-1}_{\tau,\lambda} be the most updated ridge estimator of θ\theta^{*} by tEτt\in E_{\tau}, which is estimated by restricting its generalized support on Sτ1S_{\tau-1} using data {Xt,Yt}tEτt1\{X_{t^{\prime}},Y_{t^{\prime}}\}_{t^{\prime}\in E^{t-1}_{\tau}}. In particular, all components of θ^τ,λt1\widehat{\theta}^{t-1}_{\tau,\lambda} outside Sτ1S_{\tau-1} are set as zeros and

θ^τ,λt1\displaystyle\widehat{\theta}^{t-1}_{\tau,\lambda} =argminsupp+(θ)=Sτ1{tEτt1|YtXt,θ|2+λθ2}.\displaystyle=\operatorname*{argmin}_{\mathrm{supp}^{+}(\theta)=S_{\tau-1}}\Big{\{}\sum_{t^{\prime}\in E^{t-1}_{\tau}}\big{|}Y_{t^{\prime}}-\langle X_{t^{\prime}},\theta\rangle\big{|}^{2}+\lambda\|\theta\|^{2}\Big{\}}.

Given θ^τ,λt1\widehat{\theta}^{t-1}_{\tau,\lambda}, we calculate the upper confidence band of potential reward Xt,i,θ\langle X_{t,i},\theta^{*}\rangle for each possible action i𝒜ti\in\mathcal{A}_{t}. In particular, let the tuning parameters α\alpha and β\beta be

α=σνsτ~log(σρ1kTd/δ),β=rσslog(kTd/δ),\displaystyle\alpha=\sigma\nu\sqrt{s\widetilde{\tau}}\cdot\log(\sigma\rho^{-1}kTd/\delta),\ \beta=r\sigma\cdot\sqrt{s\log(kTd/\delta)},

which correspond to the confidence level and upper estimate of potential reward, respectively. Then we calculate the upper confidence band associated with action ii as

min{β,Xt,i,θ^τ,λt1+α(σlog(kTd/δ)/|Eτ1|+[Xt,i]Sτ1[𝚪^τ1,λt1]1[Xt,i]Sτ1)},\min\Big{\{}\beta,\big{\langle}X_{t,i},\widehat{\theta}_{\tau,\lambda}^{t-1}\big{\rangle}+\alpha\cdot\Big{(}\sigma\sqrt{{\log(kTd/\delta)}\big{/}{|E_{\tau-1}|}}+\sqrt{[X_{t,i}]_{S_{\tau-1}}^{\top}\big{[}\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}\big{]}^{-1}[X_{t,i}]_{S_{\tau-1}}}\Big{)}\Big{\}},

where

𝚪^τ1,λt1=λI|Sτ1|+tEτt1[Xt]Sτ1[Xt]Sτ1.\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}=\lambda\textbf{I}_{|S_{\tau-1}|}+\sum_{t^{\prime}\in E_{\tau}^{t-1}}[X_{t^{\prime}}]_{S_{\tau-1}}[X_{t^{\prime}}]^{\top}_{S_{\tau-1}}.

After that, we pick the arm iti_{t} corresponding to the largest upper confidence band to play and observe the corresponding reward Yt=Xt,it,θ+εtY_{t}=\langle X_{t,i_{t}},\theta^{*}\rangle+\varepsilon_{t}. We repeat this process until the end of epoch EτE_{\tau}.

Then we run the BSS procedure using all data collected in EτE_{\tau} to recover the support of θ\theta^{*}. We also enlarge the size of generalized support by ss. To be specific, let SτS_{\tau} be the generalized support recovered in this step. We require that SτS_{\tau} satisfies constraints

SτSτ1,|Sτ|τs.S_{\tau}\supseteq S_{\tau-1},\ |S_{\tau}|\leq\tau s.

and obtain the BSS estimator θ^τ,λ\widehat{\theta}_{\tau,\lambda} as

θ^τ,λ=Projr{argminSτ1supp+(θ),|supp+(θ)|τs{tEτ|YtXt,θ|2+λθ2}},\displaystyle\widehat{\theta}_{\tau,\lambda}=\text{Proj}_{r}\Big{\{}\operatorname*{argmin}_{S_{\tau-1}\subseteq\mathrm{supp}^{+}(\theta),|\mathrm{supp}^{+}(\theta)|\leq\tau s}\Big{\{}\sum_{t^{\prime}\in E_{\tau}}\big{|}Y_{t^{\prime}}-\langle X_{t^{\prime}},\theta\rangle\big{|}^{2}+\lambda\|\theta\|^{2}\Big{\}}\Big{\}}, (3)

where Projr{}\text{Proj}_{r}\{\cdot\} denotes the projection on a centered 2\ell_{2}-ball with radius rr, the 2\ell_{2}-norm of θ\theta^{*}. Note that in comparison with the standard BSS estimator, we project the minimizer for regularization. The boundedness also simplifies our later theoretical analysis. We also add the inclusion restriction SτSτ1S_{\tau}\supseteq S_{\tau-1} for some technical reasons, which does not lead to any fundamental difference. As a result, we need to consider the sparsity τs\tau s instead of ss. It boosts the probability of recovering the true support.

A pseudo-code description of the SLUCB algorithm is presented in Algorithm 1. We remark that although computing the BSS estimator is computationally expensive, many methods exist that can solve this problem very efficiently in practice. See, for example, (Pilanci et al., 2015; Bertsimas et al., 2016) for more details. We also conduct extensive numerical experiments to demonstrate that our proposed algorithm is practical in Section 4.

\@normalsize
Input: sequentially arriving covariates {Xt,i}t[T],i𝒜t\{X_{t,i}\}_{t\in[T],i\in\mathcal{A}_{t}}, length of pure-exploration-stage n0n_{0}, confidence level α\alpha, estimated upper bound of reward β\beta, sparsity level ss, ridge regression penalty λ\lambda.
Output: action sequence {it}t[T]\{i_{t}\}_{t\in[T]}.
1
2partition [T][T] into consecutive epochs E1,E2,,Eτ~E_{1},E_{2},\cdots,E_{\widetilde{\tau}} such that |Eτ|=max{2τ,n0}|E_{\tau}|=\max\{2^{\tau},n_{0}\};
3 initialization: θ^0,λ=0\widehat{\theta}_{0,\lambda}=0, S=S=\emptyset;
4 for τ=1,2,,τ~\tau=1,2,\cdots,\widetilde{\tau} do
5       for the first n0n_{0} time periods tEτt\in E_{\tau} do
6             select it𝒜ti_{t}\in\mathcal{A}_{t} uniformly at random;
7             set θ^τ,λt=θ^τ1,λ\widehat{\theta}^{t}_{\tau,\lambda}=\widehat{\theta}_{\tau-1,\lambda};
8            
9       end for
10      for the remaining time periods tEτt\in E_{\tau} do
11             calculate matrix
𝚪^τ1,λt1=λI|S|+tEτt1[Xt]S[Xt]S;\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}=\lambda\textbf{I}_{|S|}+\sum_{t^{\prime}\in E_{\tau}^{t-1}}[X_{t^{\prime}}]_{S}[X_{t^{\prime}}]^{\top}_{S}\text{;}
12             calculate the upper confidence band of reward for each arm
r¯(Xt,i)=min{β,Xt,i,θ^τ,λt1+α(σlog(kTd/δ)/|Eτ1|+[Xt,i]S[𝚪^τ1,λt1]1)};\overline{r}(X_{t,i})=\min\Big{\{}\beta,\big{\langle}X_{t,i},\widehat{\theta}^{t-1}_{\tau,\lambda}\big{\rangle}+\alpha\cdot\Big{(}\sigma\sqrt{{\log(kTd/\delta)}\big{/}{|E_{\tau-1}|}}+\big{\|}[X_{t,i}]_{S}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}]^{-1}}\Big{)}\Big{\}}\text{;}
13             select arm with the largest upper confidence band
it=argmini𝒜t{r¯(Xt,i)};i_{t}=\operatorname*{argmin}_{i\in\mathcal{A}_{t}}\big{\{}\overline{r}(X_{t,i})\big{\}}\text{;}
14             observe reward
Yt=Xt,it,θ+εt;Y_{t}=\langle X_{t,i_{t}},\theta^{*}\rangle+\varepsilon_{t}\text{;}
15             update the ridge estimator:
θ^τ,λt=argminsupp+(θ)=S{tEτt|YtXt,θ|2+λθ2};\widehat{\theta}^{t}_{\tau,\lambda}=\operatorname*{argmin}_{\mathrm{supp}^{+}(\theta)=S}\Big{\{}\sum_{t^{\prime}\in E_{\tau}^{t}}\big{|}Y_{t^{\prime}}-\langle X_{t^{\prime}},\theta\rangle\big{|}^{2}+\lambda\|\theta\|^{2}\Big{\}}\text{;}
 
16       end for
17      update the best subset selection estimator
θ^τ,λ=Projr{argminSsupp+(θ),|supp+(θ)|τs{tEτ|YtXt,θ|2+λθ2}};\widehat{\theta}_{\tau,\lambda}=\text{Proj}_{r}\Big{\{}\operatorname*{argmin}_{S\subseteq\mathrm{supp}^{+}(\theta),|\mathrm{supp}^{+}(\theta)|\leq\tau s}\Big{\{}\sum_{t^{\prime}\in E_{\tau}}\big{|}Y_{t^{\prime}}-\langle X_{t^{\prime}},\theta\rangle\big{|}^{2}+\lambda\|\theta\|^{2}\Big{\}}\Big{\}};
18      update S=supp+(θ^τ,λ);S=\mathrm{supp}^{+}(\widehat{\theta}_{\tau,\lambda})\text{;}
19 end for
20
Algorithm 1 Sparse-LinUCB Algorithm

2.2 Sparse-SupLinUCB Algorithm

\@normalsize
Input: epoch index τ\tau, sequential arriving covariates {Xt,i}tEτ,i𝒜t\{X_{t,i}\}_{t\in E_{\tau},i\in\mathcal{A}_{t}}, length of pure-exploration-stage n0n_{0}, confidence level γ\gamma, estimated upper bound of reward β\beta, support recovered in previous epoch Sτ1S_{\tau-1}, sparsity level ss, ridge regression penalty λ\lambda.
Output: action sequence {it}tEτ\{i_{t}\}_{t\in E_{\tau}}.
1 for the first n0n_{0} time periods tEτt\in E_{\tau} do
2       select it𝒜ti_{t}\in\mathcal{A}_{t} uniformly at random;
3      
4 end for
5set ζ~=log(βT)\widetilde{\zeta}=\lceil\log(\beta T)\rceil, S=Sτ1S=S_{\tau-1}, and initialize sets {Ψτt,1,,Ψτt,ζ~}\{\Psi_{\tau}^{t,1},\cdots,\Psi_{\tau}^{t,\widetilde{\zeta}}\} by evenly partitioning the n0n_{0} time periods of pure-exploration-stage;
6 for the remaining time periods tt in EτE_{\tau} do
7       initialize ζ=1,𝒩τt1,ζ=𝒜t\zeta=1,{\mathcal{N}}_{\tau}^{t-1,\zeta}=\mathcal{A}_{t};
8       repeat
9             compute restricted ridge estimator
θ^τ,λt1,ζ=argminsupp+(θ)=S{tΨτt1,ζ|YtXt,θ|2+λθ2};\widehat{\theta}_{\tau,\lambda}^{t-1,\zeta}=\operatorname*{argmin}_{\mathrm{supp}^{+}(\theta)=S}\Big{\{}\sum_{t^{\prime}\in\Psi_{\tau}^{t-1,\zeta}}\big{|}Y_{t^{\prime}}-\langle X_{t^{\prime}},\theta\rangle\big{|}^{2}+\lambda\|\theta\|^{2}\Big{\}}\text{;}
 
10            compute matrix
𝚪^τ1,λt1,ζ=λI|S|+tΨτt1,ζ[Xt]S[Xt]S;\widehat{\mathbf{\Gamma}}^{t-1,\zeta}_{\tau-1,\lambda}=\lambda\textbf{I}_{|S|}+\sum_{t^{\prime}\in\Psi^{t-1,\zeta}_{\tau}}[X_{t^{\prime}}]_{S}[X_{t^{\prime}}]_{S}^{\top}\text{;}
 
11            compute confidence band for each i𝒩τt1,ζi\in{\mathcal{N}}_{\tau}^{t-1,\zeta},
ωτ,λt1,ζ(i)=γ(s/|Eτ1|+[Xt,i]S[𝚪^τ1,λt1,ζ]1);\omega_{\tau,\lambda}^{t-1,\zeta}(i)=\gamma\cdot\Big{(}\sqrt{s/|E_{\tau-1}|}+\big{\|}[X_{t,i}]_{S}\big{\|}_{[\widehat{\mathbf{\Gamma}}^{t-1,\zeta}_{\tau-1,\lambda}]^{-1}}\Big{)}\text{;}
 
12            if ωτ,λt1,ζ(i)1/T,i𝒩τt1,ζ\omega_{\tau,\lambda}^{t-1,\zeta}(i)\leq 1/\sqrt{T},\ \forall i\in{\mathcal{N}}_{\tau}^{t-1,\zeta} then
13                   select
it=argmini𝒩τt1,ζ{β,Xt,i,θ^τt1,ζ+ωτ,λt1,ζ(i)};i_{t}=\operatorname*{argmin}_{i\in{\mathcal{N}}_{\tau}^{t-1,\zeta}}\Big{\{}\beta,\big{\langle}X_{t,i},\widehat{\theta}_{\tau}^{t-1,\zeta}\big{\rangle}+\omega_{\tau,\lambda}^{t-1,\zeta}(i)\Big{\}}\text{;}
as the arm to play and update Ψτt,ζΨτt1,ζ\Psi_{\tau}^{t,\zeta}\leftarrow\Psi_{\tau}^{t-1,\zeta} for all ζ[ζ~]\zeta\in[\widetilde{\zeta}];
14                  
15             end if
16            else if ωτ,λt1,ζ(i)2ζβ,i𝒩τt1,ζ\omega_{\tau,\lambda}^{t-1,\zeta}(i)\leq 2^{-\zeta}\beta,\forall i\in{\mathcal{N}}_{\tau}^{t-1,\zeta} then
17                   eliminate suboptimal arms as
𝒩τt1,ζ+1={i𝒩τt1,ζ:Xt,i,θ^τ,λt1,ζmaxj𝒩τt1,ζXt,j,θ^τ,λt1,ζ21ζβ};{\mathcal{N}}_{\tau}^{t-1,\zeta+1}=\Big{\{}i\in{\mathcal{N}}_{\tau}^{t-1,\zeta}:\big{\langle}X_{t,i},\widehat{\theta}_{\tau,\lambda}^{t-1,\zeta}\big{\rangle}\geq\max_{j\in{\mathcal{N}}_{\tau}^{t-1,\zeta}}\big{\langle}X_{t,j},\widehat{\theta}_{\tau,\lambda}^{t-1,\zeta}\big{\rangle}-2^{1-\zeta}\beta\Big{\}}\text{;}
move to the next group and update ζζ+1;\zeta\leftarrow\zeta+1\text{;}
18             end if
19            else
20                   select it𝒩τt1,ζi_{t}\in{\mathcal{N}}_{\tau}^{t-1,\zeta} such that ωτ,λt1,ζ(i)>2ζβ\omega_{\tau,\lambda}^{t-1,\zeta}(i)>2^{-\zeta}\beta as the arm to play;
21                   update Ψτt,ζΨτt1,ζ{t}\Psi_{\tau}^{t,\zeta}\leftarrow\Psi_{\tau}^{t-1,\zeta}\cup\{t\} and Ψτt,ζΨτt1,ζ\Psi_{\tau}^{t,\zeta^{\prime}}\leftarrow\Psi_{\tau}^{t-1,\zeta^{\prime}} for all ζζ\zeta^{\prime}\neq\zeta;
22                  
23             end if
24            
25      until an arm it𝒜ti_{t}\in\mathcal{A}_{t} is selected;
26 end for
Algorithm 2 Sparse-SupLinUCB Subroutine

Although the SLUCB algorithm is intuitive and easy to implement, we are unable to prove the optimal upper bound for its regret due to some technical reasons. Specifically, as discussed in the next section, we can only establish an 𝒪~(sT)\widetilde{\mathcal{O}}(s\sqrt{T}) upper bound for the regret of the SLUCB algorithm, while the optimal regret should be 𝒪~(sT)\widetilde{\mathcal{O}}(\sqrt{sT}). Here we omit all the constants and logarithmic factors and only consider the dependency on horizon length and dimension parameters. The obstacle leading to sub-optimality is the dependency of covariates on random noises. Recall that at each period tEτt\in E_{\tau}, the SLUCB algorithm constructs the ridge estimator θ^τ,λt1\widehat{\theta}^{t-1}_{\tau,\lambda} using all historical data, where designs {Xt}tEτt1\{X_{t^{\prime}}\}_{t^{\prime}\in E^{t-1}_{\tau}} are correlated with noises {εt}tEτt1\{\varepsilon_{t^{\prime}}\}_{t^{\prime}\in E^{t-1}_{\tau}} due to the UCB-type policy. Such a complicated correlation impedes us from establishing tight confidence bands for predicted rewards, which results in suboptimal regret.

To close the aforementioned gap and achieve the optimality, we modify the seminal SupLinUCB algorithm (Auer, 2002; Chu et al., 2011), which is originally proposed to attain the optimal regret for classic stochastic linear bandit problem, as a subroutine in our framework. Then we propose the Sparse-SupLinUCB (SSUCB) algorithm. Specifically, we replace the ridge estimator and UCB-type policy with a modified SupLinUCB algorithm. The basic idea of the SupLinUCB algorithm is to separate the dependent designs into several groups such that within each group, the designs and noises are independent of each other. Then the ridge estimators of the true parameters are calculated based on group individually. Thanks to the desired independency, now we can derive tighter confidence bands by applying sharper concentration inequality, which gives rise to the optimal regret in the final.

In the next, we present the details of SupLinUCB algorithm and show how to embed it in our framework. For each period tEτt\in E_{\tau} that belongs to the UCB-stage, the SupLinUCB algorithm partitions the historical periods Eτt1E^{t-1}_{\tau} into ζ~\widetilde{\zeta} disjoint groups

Eτt1={Ψτt1,1,,Ψτt1,ζ~},E^{t-1}_{\tau}=\{\Psi^{t-1,1}_{\tau},\cdots,\Psi^{t-1,\widetilde{\zeta}}_{\tau}\},

where ζ~=log(βT)\widetilde{\zeta}=\lceil\log(\beta T)\rceil. These groups are initialized by evenly partitioning periods from the pure-exploration-stage and are updated sequentially following the rule introduced as follows. For each period tt, we screen the groups {Ψτt1,ζ}\{\Psi^{t-1,\zeta}_{\tau}\} one by one (in an ascending order of index ζ\zeta) to determine the action to take or eliminate some obvious sub-optimal actions.

Suppose that we are at the ζ\zeta-th group now. Let 𝒩τt1,ζ{\mathcal{N}}^{t-1,\zeta}_{\tau} be the set of candidate actions that are still kept by the ζ\zeta-th step, which is initialized as the whole action space 𝒜t\mathcal{A}_{t} when ζ=1\zeta=1. We first calculate the ridge estimator θ^τ,λt1,ζ\widehat{\theta}^{t-1,\zeta}_{\tau,\lambda} restricted on the generalized support Sτ1S_{\tau-1}, using data from group Ψτt1,ζ\Psi^{t-1,\zeta}_{\tau}. Then for each action i𝒩τt1,ζi\in{\mathcal{N}}^{t-1,\zeta}_{\tau}, we calculate ωτ,λt1,ζ(i)\omega_{\tau,\lambda}^{t-1,\zeta}(i), the width of confidence band of the potential reward. Specifically, we have

θ^τ,λt1,ζ\displaystyle\widehat{\theta}^{t-1,\zeta}_{\tau,\lambda} =argminsupp+(θ)=Sτ1{tΨτt1,ζ|YtXt,it,θ|2+λθ2},\displaystyle=\operatorname*{argmin}_{\mathrm{supp}^{+}(\theta)=S_{\tau-1}}\Big{\{}\sum_{t^{\prime}\in\Psi^{t-1,\zeta}_{\tau}}\big{|}Y_{t^{\prime}}-\langle X_{t^{\prime},i_{t^{\prime}}},\theta\rangle\big{|}^{2}+\lambda\|\theta\|^{2}\Big{\}},
ωτ,λt1,ζ(i)\displaystyle\omega^{t-1,\zeta}_{\tau,\lambda}(i) =γ(s/|Eτ1|+[Xt,i]Sτ1[𝚪^τ1,λt1,ζ]1[Xt,i]Sτ1),\displaystyle=\gamma\cdot\Big{(}\sqrt{s/|E_{\tau-1}|}+\sqrt{[X_{t,i}]^{\top}_{S_{\tau-1}}[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}]^{-1}[X_{t,i}]_{S_{\tau-1}}}\Big{)},

where the tuning parameter

γ=σ(λ1/2+ν+σ)log3/2(kTd/δ),\displaystyle\gamma=\sigma(\lambda^{1/2}+\nu+\sigma)\log^{3/2}(kTd/\delta),

corresponds to the confidence level. Our next step depends on the values of ωτ,λt1,ζ(i)\omega_{\tau,\lambda}^{t-1,\zeta}(i). If

ωτ,λt1,ζ(i)1/T,i𝒩τt1,ζ,\omega_{\tau,\lambda}^{t-1,\zeta}(i)\leq 1/\sqrt{T},\ \forall i\in{\mathcal{N}}^{t-1,\zeta}_{\tau},

which means that the widths of confidence bands are uniformly small, we pick the action associated with the largest upper confidence band

min{β,Xt,i,θ^τ,λt1,ζ+ωτ,λt1,ζ(i)}.\min\Big{\{}\beta,\langle X_{t,i},\widehat{\theta}_{\tau,\lambda}^{t-1,\zeta}\rangle+\omega_{\tau,\lambda}^{t-1,\zeta}(i)\Big{\}}.

where β=rσslog(kTd/δ)\beta=r\sigma\cdot\sqrt{s\log(kTd/\delta)} is an upper estimate of potential rewards. In this case, we discard the data point (Xt,Yt)(X_{t},Y_{t}) and do not update any group, i.e., setting Ψτt,ζ=Ψτt1,ζ\Psi^{t,\zeta}_{\tau}=\Psi^{t-1,\zeta}_{\tau}, for all ζ[ζ~].\zeta\in[\widetilde{\zeta}].

Otherwise, if there exists some i𝒩τt1,ζi\in{\mathcal{N}}^{t-1,\zeta}_{\tau} such that ωτ,λt1,ζ(i)2ζβ,\omega_{\tau,\lambda}^{t-1,\zeta}(i)\geq 2^{-\zeta}\beta, which means that the width of confidence band is not sufficiently small, then we pick such an action ii to play for exploration. In this case, we add the period tt into the ζ\zeta-th group while keeping all other groups unchanged, i.e.,

Ψτt,ζ=Ψτt1,ζ{t},Ψτt,η=Ψτt1,η,ifηζ.\displaystyle\Psi^{t,\zeta}_{\tau}=\Psi^{t-1,\zeta}_{\tau}\cup\{t\},\ \Psi^{t,\eta}_{\tau}=\Psi^{t-1,\eta}_{\tau},\ \text{if}\ \eta\neq\zeta.

Finally, if neither one of the above scenarios happens, which implies that for all i𝒩τt1,ζi\in{\mathcal{N}}^{t-1,\zeta}_{\tau} ωτ,λt1,ζ(i)2ζβ\omega_{\tau,\lambda}^{t-1,\zeta}(i)\leq 2^{-\zeta}\beta, then we do not take any action for now. Instead, we eliminate some obvious sub-optimal actions and move to the next group Ψτt1,ζ+1\Psi^{t-1,\zeta+1}_{\tau}. Particularly, we update the set of candidate arms as

𝒩τt1,ζ+1={i𝒩τt1,ζ:Xt,i,θ^τ,λt1,ζmaxj𝒩τt1,ζXt,j,θ^τ,λt1,ζ21ζβ}.{\mathcal{N}}_{\tau}^{t-1,\zeta+1}=\Big{\{}i\in{\mathcal{N}}_{\tau}^{t-1,\zeta}:\big{\langle}X_{t,i},\widehat{\theta}_{\tau,\lambda}^{t-1,\zeta}\big{\rangle}\geq\max_{j\in{\mathcal{N}}_{\tau}^{t-1,\zeta}}\big{\langle}X_{t,j},\widehat{\theta}_{\tau,\lambda}^{t-1,\zeta}\big{\rangle}-2^{1-\zeta}\beta\Big{\}}.

We repeat the above procedure until an arm is selected. Since the number of groups is ζ~=log(βT)\widetilde{\zeta}=\lceil\log(\beta T)\rceil and 2ζ~β=1/T1/T2^{-\widetilde{\zeta}}\beta=1/T\leq 1/\sqrt{T}, the SupLinUCB algorithm stops eventually. Finally, by replacing the direct ridge regression and UCB-type policy with the SupLinUCB algorithm above, we obtain the SSUCB algorithm. The pseudo-code is presented in Algorithm 2.

3 Theoretical Results

In this section, we present the theoretical results of the SLUCB and SSUCB algorithms. We use the regret to evaluate the performance of our algorithms, which is a standard performance measure in literature. We denote by {it}t[T]\{i_{t}\}_{t\in[T]} the actions sequence generated by an algorithm. Then given the true parameter θ\theta^{*} and covariates {Xt,i}t[T],i𝒜t\{X_{t,i}\}_{t\in[T],i\in\mathcal{A}_{t}}, the regret of the sequence {it}t[T]\{i_{t}\}_{t\in[T]} is defined as

RT({it},θ)=t=1TXt,it,θXi,it,θ,\displaystyle R_{T}\big{(}\{i_{t}\},\theta^{*}\big{)}=\sum_{t=1}^{T}\big{\langle}X_{t,i_{t}^{*}},\theta^{*}\big{\rangle}-\big{\langle}X_{i,i_{t}},\theta^{*}\big{\rangle}, (4)

where it=argmaxi𝒜tXt,i,θi_{t}^{*}=\mathrm{argmax}_{i\in\mathcal{A}_{t}}\langle X_{t,i},\theta^{*}\rangle denotes the optimal action under the true parameter. The regret measures the discrepancy in accumulated reward between real actions and oracles where the true parameter is known to a decision-maker. In what follows, in Section 3.1, we first introduce some technical assumptions to facilitate our discussions. Then we study the regrets of the SLUCB and SSUCB algorithms in Section 3.2 and Section 3.3, respectively.

3.1 Assumptions

We present the assumptions in our theoretical analysis and discuss their relevance and implications. We consider finite action spaces {𝒜t}t=1T\{\mathcal{A}_{t}\}_{t=1}^{T} and assume that there exists some constant kk such that

|𝒜t|=k,t[T].|\mathcal{A}_{t}|=k,\ \forall t\in[T].

We also assume that for each period tt, the covariates {Xt,i}i𝒜t\{X_{t,i}\}_{i\in\mathcal{A}_{t}} are sampled independently from an unknown distribution P0P_{0}. We further impose the following assumptions on distribution P0P_{0}.

Assumption 1.

Let random vector XdX\in\mathbb{R}^{d} follow the distribution P0P_{0}. Then XX satisfies:

  1. (A1)

    (Sub-Gaussianity): Random vector XdX\in\mathbb{R}^{d} is centered and sub-Gaussian with parameter σ21\sigma^{2}\geq 1, which means that 𝔼[X]=0\mathbb{E}[X]=0 and

    𝔼[exp{σaX}]exp{σ2a2/2},ad;\mathbb{E}\big{[}\exp\{\sigma a^{\top}X\}\big{]}\leq\exp\big{\{}\sigma^{2}\|a\|^{2}/2\big{\}},\ \forall a\in\mathbb{R}^{d};
  2. (A2)

    (Non-degeneracy): There exists a constant ρ(0,σ]\rho\in(0,\sigma] such that

    𝔼[[X]j2]ρ,j[d];\mathbb{E}\big{[}[X]_{j}^{2}\big{]}\geq\rho,\ \forall j\in[d];
  3. (A3)

    (Independent coordinates): The dd coordinates of XX are independent of each other.

We briefly discuss Assumption 1. First of all, (A1) is a standard assumption in literature, with sub-Gaussianity covering a broad family of distributions like Gaussian, Rademacher, and bounded distributions. Assumption (A2) is a non-degeneracy assumption which, together with (A3), implies that the smallest eigenvalue of the population covariance matrix 𝔼[XX]\mathbb{E}[XX^{\top}] is lower bounded by some constant ρ>0\rho>0. Similar assumptions are also adopted in high-dimensional statistics literature in order to prove the “restricted eigenvalue” conditions of sample covariance matrices (Raskutti et al., 2010), which are essential in the analysis of penalized least square methods (Wainwright, 2009; Bickel et al., 2009). However, we emphasize that in this paper the covariates indexed by the selected actions {it}\{i_{t}\} do not satisfy the restricted eigenvalue condition in general, and therefore, we need novel and non-standard analyses of high-dimensional M-estimators. For Assumption (A3), at a higher level, independence among coordinates enables relatively independent explorations in different dimensions, which is similar to the key idea of the SETC method (Lattimore et al., 2015). Technically, (A3) is used to establish the key independence of sample covariance matrices restricted within and outside the recovered support. Due to such an independence, the rewards in the unexplored directions at each period are independent as well, which can be estimated efficiently.

We also impose the following assumptions on the unknown dd-dimensional true parameter θ\theta^{*}.

Assumption 2.
  1. (B1)

    (Sparsity): The true parameter θ\theta^{*} is sparse. In other words, there exists an sds\ll d such that |supp(θ)|=s|\mathrm{supp}(\theta^{*})|=s.

  2. (B2)

    (Boundedness): There exists a constant r1r\geq 1 such that θr\|\theta^{*}\|\leq r.

Note that in Assumption 2, (B1) is the key sparsity assumption, which assumes that only sds\ll d components of the true parameter θ\theta^{*} are non-zero. Assumption (B2) is a boundedness condition on the 2\ell_{2}-norm of θ\theta^{*}. This assumption is often imposed, either explicitly or implicitly, in contextual bandit problems for deriving an upper bound for rewards (Dani et al., 2008; Chu et al., 2011).

Finally, we impose the sub-Gaussian assumption on noises sequence {εt}t=1T\{\varepsilon_{t}\}_{t=1}^{T}, which is a standard assumption adopted in most statistics and bandit literatures.

Assumption 3.
  1. (C1)

    (Sub-gaussian noise): The random noises {εt}t=1T\{\varepsilon_{t}\}_{t=1}^{T} are independent, centered, and sub-Gaussian with parameter ν21\nu^{2}\geq 1.

3.2 Regret Analysis of Sparse-LinUCB

In this section, we analyze the performance of the SLUCB algorithm. As discussed earlier, we measure the performance via the regret defined in (4). We show that with a tailored choice of pure-exploration-stage length n0n_{0}, tuning parameters α\alpha, and β\beta, the accumulated regret of the SLUCB algorithm is upper bounded by 𝒪~(sT)\widetilde{\mathcal{O}}(s\sqrt{T}) (up to logarithmic factors) with high probability. Formally, we have the following theorem.

Theorem 1.

For any δ(0,1)\delta\in(0,1), let

n0\displaystyle n_{0} ρ1(νστ~)2s3log4(kTd/(δλ)),\displaystyle\asymp\rho^{-1}\cdot(\nu\sigma\widetilde{\tau})^{2}\cdot s^{3}\cdot\log^{4}\big{(}kTd/(\delta\lambda)\big{)},
α\displaystyle\alpha =σνsτ~log(σρ1dkT/δ),β=σrslog(kTd/δ).\displaystyle=\sigma\nu\sqrt{s\widetilde{\tau}}\cdot\log(\sigma\rho^{-1}dkT/\delta),\ \beta=\sigma r\sqrt{s\log(kTd/\delta)}.

Under Assumptions 1-3, the regret of the actions sequence {it}t=1T\{i_{t}\}^{T}_{t=1} generated by the Sparse-LinUCB algorithm is upper bounded by

RT({it},θ)(n0β+αsTlog(T)(λ+log(σdTk/δ))+ασTlog(kTd/δ))log(T),\displaystyle R_{T}\big{(}\{i_{t}\},\theta^{*}\big{)}\lesssim\Big{(}n_{0}\beta+\alpha\sqrt{sT\log(T)\cdot\big{(}\lambda+\log(\sigma dTk/\delta)\big{)}}+\alpha\sigma\sqrt{T\log(kTd/\delta)}\Big{)}\cdot\log(T),

with probability at least 1δ1-\delta.

Note that in Theorem 1, if we omit all the constants and logarithmic factors, the dominating part in the accumulated regret is

𝒪~(αsT)=𝒪~(sT).\widetilde{\mathcal{O}}(\alpha\cdot\sqrt{sT})=\widetilde{\mathcal{O}}(s\sqrt{T}).

Theorem 1 builds on a nontrivial combination of the UCB-type algorithm and the best subset selection method. To save space, we put the complete proof of Theorem 1 in Appendix A.

3.3 Regret Analysis of Sparse-SupLinUCB

In comparison with the SLUCB algorithm, the SSUCB algorithm splits the historical data into several groups dynamically. In each period, we sequentially update the ridge estimator and corresponding confidence bands using data from a single group instead of the whole data. The motivation of only using a single group of data is to achieve the independence between the design matrix and random noises within each group, which leads to tighter confidence bands by applying a sharper concentration inequality. The tighter upper bounds of the predicted rewards lead to an improved regret. In particular, we have the following theorem.

Theorem 2.

For any δ(0,1)\delta\in(0,1), let

n0\displaystyle n_{0} ρ1(νστ~)2s3log4(kTd/(δλ)),\displaystyle\asymp\rho^{-1}\cdot(\nu\sigma\widetilde{\tau})^{2}\cdot s^{3}\cdot\log^{4}\big{(}kTd/(\delta\lambda)\big{)},
β\displaystyle\beta =σrslog(kTd/δ),γ=σ(λ1/2+ν+σ)log3/2(kTd/δ).\displaystyle=\sigma r\sqrt{s\log(kTd/\delta)},\ \gamma=\sigma(\lambda^{1/2}+\nu+\sigma)\log^{3/2}(kTd/\delta).

Then under Assumptions 1-3, the regret of actions sequence {it}t=1T\{i_{t}\}^{T}_{t=1} generated by the Sparse-SupLinUCB algorithm is upper bounded by

RT({it},θ)\displaystyle R_{T}\big{(}\{i_{t}\},\theta^{*}\big{)} T+(n0β+γsTlog(T)(λ+log(σdTk/δ)))log(T)log(βT),\displaystyle\lesssim\sqrt{T}+\Big{(}n_{0}\beta+\gamma\sqrt{sT\cdot\log(T)\big{(}\lambda+\log(\sigma dTk/\delta)\big{)}}\Big{)}\cdot\log(T)\cdot\log(\beta T),

with probability at least 1δ1-\delta.

Note that in Theorem 2, if we omit all constants and logarithmic factors, the dominating part in the regret upper bound is of order 𝒪~(sT)\widetilde{\mathcal{O}}(\sqrt{sT}). This improves the rate in Theorem 1 by an order of 𝒪~(s)\widetilde{\mathcal{O}}(\sqrt{s}) and achieves the optimal rate (up to logarithmic factors). Theorem 2 builds on a tailored analysis of the SupLinUCB algorithm. To save space, we also put the complete proof of Theorem 2 in Appendix B.

4 Numerical Experiments

In this section, we use extensive numerical experiments to demonstrate our proposed algorithm’s efficiency and validate our theoretical results. To simplify implementation, we only test the SLUCB algorithm in our experiments, which can be implemented efficiently. Theoretically, the SLUCB algorithm only achieves a sub-optimal regret guarantee due to technical reasons. However, extensive numerical experiments imply that it attains the near-optimality in practice. Specifically, we first empirically verify the regret guarantee established in Section 3 via simulation under various settings. Then we apply the proposed method to a sequential treatment assignment problem using a real medical dataset. In simulation studies, we sample the contextual covariates from dd-dimensional standard Gaussian distribution. For the real dataset test, we sample the covariates from the empirical distribution, which may not be Gaussian. Some assumptions in Section 3.1 are also violated. However, in all scenarios, our algorithm performs well and achieves 𝒪~(T)\widetilde{\mathcal{O}}(\sqrt{T}) regrets, which demonstrates the robustness and practicability.

4.1 Simulation Studies

In this section, we present the details of our simulation studies. We first show the 𝒪~(T)\widetilde{\mathcal{O}}(\sqrt{T}) growth rate of regret empirically. Then we fix time horizon length TT and dimension dd and study the dependency of accumulated regret on sparsity ss. To demonstrate the power of best subset selection, we also compare our algorithm’s performance with the oracle, where the decision-maker knows the true support of underlying parameters. Since the bottleneck of computing time in our algorithm is the best subset selection, which requires solving a mixed-integer programming problem, it is appealing to replace this step with other variable selection methods, such as Lasso and iterative hard thresholding (IHT) (Blumensath and Davies, 2009). So we test the performance of those variants. We fix the number of actions k=60k=60 in all settings.

4.1.1 Experiment 1: Growth of regret

In this experiment, we study the growth rate of regret. We run two sets of experiments, where in the first case d=100,T=1300,s=5,10,15,d=100,T=1300,s=5,10,15, and in the second case, d=300,T=1970,s=15,20,25d=300,T=1970,s=15,20,25. For each setup, we replicate 2020 times and then calculate corresponding mean and 90%90\%-confidence interval. We present the results in Figure 1. As we see, for each fixed dd and ss, the growth rate of regret is about 𝒪~(T)\widetilde{\mathcal{O}}(\sqrt{T}), which validates our theory.

Refer to caption Refer to caption
Figure 1: Plot of regret v.s. time periods. In (a), we set the dimension d=100d=100, the horizon length T=1300T=1300, and the sparsity s=5,10,15s=5,10,15. In (b), we set the dimension d=300d=300, the horizon length T=1970T=1970, and the sparsity s=5,15,25s=5,15,25. For each setting, we replicate 2020 times. Solid lines are the means of regret. Shadow areas denote corresponding empirical confidence intervals.

4.1.2 Experiment 2: Dependency on sparsity

In this experiment, we fix the dimension dd and horizon length TT and let sparsity ss varies. We calculate the accumulated regret at the end of horizon. We also run two sets of experiments, where in the first case d=100,T=1300,s=5,8,11,,23d=100,T=1300,s=5,8,11,\cdots,23 and in the second case d=300,T=1970,s=5,8,11,,23d=300,T=1970,s=5,8,11,\cdots,23. We present the results are presented in Figure 2. Although Theorem 1 only provides an 𝒪~(sT)\widetilde{\mathcal{O}}(s\sqrt{T}) regret guarantee for the SLUCB algorithm. The linear dependency of accumulated regret on s\sqrt{s} suggests that it actually attains the optimal 𝒪~(sT)\widetilde{\mathcal{O}}(\sqrt{sT}) rate in practice.

Refer to caption Refer to caption
Figure 2: Plot of accumulated regret v.s. s\sqrt{s} . In (a), we set the dimension d=100d=100, the horizon length T=1300T=1300, and the sparsity s=5,8,11,,23s=5,8,11,\cdots,23. In (b), we set the dimension d=300d=300, the horizon length T=1970T=1970, and the sparsity s=5,8,11,,23s=5,8,11,\cdots,23.

4.1.3 Experiment 3: Comparison with variants of main algorithm and oracle

In this experiment, we compare the performance and the computing time of our algorithm with several variants that substitute the best subset selection subroutine with Lasso and IHT. We also compare with the oracle regret where the decision-maker knows the true support of parameters. In more detail, for the first variant, we use Lasso to recover the support of the true parameter at the end of each epoch. We tune the 1\ell_{1}-penalty λ\lambda such that the size of the support of the estimator is approximately equal to ss, and then use it in the next epoch. For the second variant, we apply IHT to estimate the parameter and set the hard threshold to be ss.

We run two settings of experiments, corresponding to d=100,s=15,T=1300d=100,s=15,T=1300, and d=300,s=15,T=1300d=300,s=15,T=1300. We also replicate 2020 times in each setting. For the first case, the averaged computing times are 4.584.58 seconds for Lasso, 34.6234.62 seconds for IHT, and 310.61310.61 seconds for best subset selection. For the second case, the average computing times are 9.109.10 seconds for Lasso, 39.4739.47 seconds for IHT, and 516.92516.92 seconds for best subset selection. We display the associated regret curves in Figure 3. As we see, although the computing time of Lasso is shorter than the other two methods, the performance of Lasso is significantly weaker. The computing time of IHT is slightly longer than Lasso, but it achieves the similar performance as best subset selection, which suggests that IHT might be a good alternative in practice when the computing resource is limited. Finally, although the computing time of the best subset selection is the longest, its performance is the best. The regret almost achieves the same performance as the oracle. Such a result demonstrates the power of best subsection selection in high-dimensional stochastic bandit problems.

Refer to caption Refer to caption
Figure 3: Plot of regret curves of different algorithms. In (a), we set d=100d=100, s=15s=15, and T=1300T=1300. In (b), we set d=300d=300, s=15s=15, and T=1300T=1300. We test four algorithms: Lasso, IHT, BSS, and oracle. We also replicate 2020 times in each setting. Solid lines are means of regret. Shadow areas denote corresponding confidence intervals.

4.2 Real Dataset Test

In this section, we apply our methods in a sequential treatment assignment problem based on the ACTG175 dataset from the R package “speff2trial”. The scenario is that patients arrive at a clinic sequentially. Several candidate treatments are available, and different treatments may have different treatment effects. Hence, for each patient, the doctor observes his/her health information and then chooses a treatment for the patient, to maximize the total treatment effect.

We first briefly introduce the ACTG175 dataset (Hammer et al., 1996). This dataset records the results of a randomized experiment on 2139 HIV-infected patients, and the background information of those patients. The patients are randomly assigned one of the four treatments: zidovudine (AZT) monotherapy, AZT+didanosine (ddI), AZT+zalcitabine (ddC), and ddI monotherapy. The treatment effect is measured via the number of CD8 T-cell after 20 weeks of therapy. Besides, for each patient, an 1818-dimensional feature vector is provided as the background information, which includes age, weight, gender, drug history, etc.

Similar to the simulation studies, we validate our algorithms by regret. However, in practice, the potential outcomes of unassigned treatments on each patient are unavailable, which prevents us from calculating the regret. To overcome this difficulty, we assume a linear model for the treatment effect of each therapy. Particularly, we assume that the effect of treatment ii (number of CD8 T cell) is X,θi+ε\langle X,\theta^{*}_{i}\rangle+\varepsilon, where XX denotes the background information of each patient, ε\varepsilon is a standard Gaussian noise, and θi\theta_{i}^{*} is estimated via linear regression. Similar assumptions are commonly adopted in statistics literatures to calculate the counterfactual, when real experiments are unavailable (Zhang et al., 2012a). Above sequential treatment assignment problem can be easily transformed to our formulation by considering the joint covariate and space. Given a patient with background XX, let X1=(X,0,0,0),X2=(0,X,0,0),X3=(0,0,X,0),X4=(0,0,0,X)X_{1}=(X,0,0,0),X_{2}=(0,X,0,0),X_{3}=(0,0,X,0),X_{4}=(0,0,0,X), and θ=(θ1,θ2,θ3,θ4)\theta^{*}=(\theta^{*}_{1},\theta^{*}_{2},\theta^{*}_{3},\theta^{*}_{4}). Since Xi,θ=X,θi\langle X_{i},\theta^{*}\rangle=\langle X,\theta^{*}_{i}\rangle, assigning treatment to a patient is equivalent to pick an action from space 𝒜={X1,X2,X3,X4}\mathcal{A}=\{X_{1},X_{2},X_{3},X_{4}\}. Note that with this formulation, Assumption 1 is violated. However, the numerical results introduced below still demonstrate the good performance of our algorithm.

Refer to caption
Figure 4: Plot of regret in sequential treatment assignment problem based on ACTG175 dataset

In this experiment, for simplicity, we pick 99 coordinates (age, weight, drugs in history, Karnofsky score, number of days of previously received antiretroviral therapy, antiretroviral history stratification, gender, CD4 T cell number at baseline, and CD8 T cell number at baseline), from the whole dataset as the covariate. We denote by it XX. Then we use the real dataset to estimate the true parameters θi,i=1,2,3,4\theta^{*}_{i},i=1,2,3,4, i.e., the treatment effect of each therapy. To test our algorithm in the high dimensional setting, we further concatenate XX with a 4141-dimensional random noise drawn from standard Gaussian distribution. The parameters corresponding to those injected noises are zeros. For each patient, we use the estimated parameters to calculate the potential treatment effect and regret. In addition, we choose the total number of patients T=2600T=2600. Like the simulation studies, we compare our algorithm’s performance with other variants (Lasso,IHT, and oracle). For each algorithm, we repeat 2020 times and calculate the mean regret curve. We present the result in Figure 4. Compared with the simulation studies, the performance of Lasso is still the worst. However, the discrepancy is much smaller. Although there is almost no difference in IHT and best subset selection, in real data tests, we may observe some instances where the IHT algorithm fails to converge, which means a lack of robustness. Again, for our proposed algorithm, we observe an 𝒪~(T)\widetilde{\mathcal{O}}(\sqrt{T}) growth rate of regret. It is pretty close to the oracle as well. So The BSS algorithm also achieves the optimality. Since in this experiment, technical assumptions in Section 3.1 are violated, the numerical results demonstrate the robustness of our method.

5 Conclusion

In this paper, we first propose a method for the high-dimensional stochastic linear bandit problem by combining the best subset selection method and the LinUCB algorithm. It achieves the 𝒪~(sT)\widetilde{\mathcal{O}}(s\sqrt{T}) regret upper bound and is nearly independent with the ambient dimension dd (up to logarithmic factors). In order to attain the optimal regret 𝒪~(sT)\widetilde{\mathcal{O}}(\sqrt{sT}), we further improve our method by modifying the SupLinUCB algorithm. Extensive numerical experiments validate the performance and robustness of our algorithms. Moreover, although we cannot prove the 𝒪~(sT)\widetilde{\mathcal{O}}(s\sqrt{T}) regret upper bound for the SLUCB algorithm due to some technical reasons, simulation studies show that the regret of SLUCB algorithm is actually 𝒪~(sT)\widetilde{\mathcal{O}}(\sqrt{sT}) rather than our provable upper bound 𝒪~(sT)\widetilde{\mathcal{O}}(s\sqrt{T}). A similar phenomenon is also observed in the seminal works (Auer, 2002; Chu et al., 2011), where low-dimensional stochastic linear bandit problems are investigated. It remains an open problem to prove the 𝒪~(sT)\widetilde{\mathcal{O}}(\sqrt{sT}) upper bound for the LinUCB-type algorithms. In addition, it is also interesting to study the high-dimensional stochastic linear bandit problem under weaker assumptions, especially when the independent coordinates assumption does not hold. We view all of those problems as potential future research directions.

\@normalsize

References

  • Abbasi-Yadkori et al. [2011] Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
  • Abbasi-Yadkori et al. [2012] Y. Abbasi-Yadkori, D. Pal, and C. Szepesvari. Online-to-confidence-set conversions and application to sparse stochastic bandits. In Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS), 2012.
  • Abe et al. [2003] N. Abe, A. W. Biermann, and P. M. Long. Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica, 37(4):263–293, 2003.
  • Agarwal et al. [2010] A. Agarwal, O. Dekel, and L. Xiao. Optimal algorithms for online convex optimization with multi-point bandit feedback. In Proceedings of annual Conference on Learning Theory (COLT). Citeseer, 2010.
  • Agarwal et al. [2014] A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, and R. Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In Proceedings of International Conference on Machine Learning (ICML), 2014.
  • Auer [2002] P. Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
  • Auer et al. [1995] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS), 1995.
  • Auer et al. [2002] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235–256, 2002.
  • Balasubramanian and Ghadimi [2018] K. Balasubramanian and S. Ghadimi. Zeroth-order (non)-convex stochastic optimization via conditional gradient and gradient updates. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2018.
  • Bastani and Bayati [2015] H. Bastani and M. Bayati. Online decision-making with high-dimensional covariates. 2015. Available at SSRN: https://ssrn.com/abstract=2661896.
  • Bertsimas et al. [2016] D. Bertsimas, A. King, and R. Mazumder. Best subset selection via a modern optimization lens. The Annals of Statistics, 44(2):813–852, 2016.
  • Besbes et al. [2015] O. Besbes, Y. Gur, and A. Zeevi. Non-stationary stochastic optimization. Operations Research, 63(5):1227–1244, 2015.
  • Bickel et al. [2009] P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 37(4):1705–1732, 2009.
  • Blumensath and Davies [2009] T. Blumensath and M. E. Davies. Iterative hard thresholding for compressed sensing. Applied and computational harmonic analysis, 27(3):265–274, 2009.
  • Bubeck et al. [2017] S. Bubeck, Y. T. Lee, and R. Eldan. Kernel-based methods for bandit convex optimization. In Proceedings of the Annual ACM SIGACT Symposium on Theory of Computing (STOC), 2017.
  • Bühlmann and Van De Geer [2011] P. Bühlmann and S. Van De Geer. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media, 2011.
  • Candes and Tao [2007] E. Candes and T. Tao. The dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics, 35(6):2313–2351, 2007.
  • Carpentier and Munos [2012] A. Carpentier and R. Munos. Bandit theory meets compressed sensing for high dimensional stochastic linear bandit. In Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS), 2012.
  • Chakraborty et al. [2010] B. Chakraborty, S. Murphy, and V. Strecher. Inference for non-regular parameters in optimal dynamic treatment regimes. Statistical Methods in Medical Research, 19(3):317–343, 2010.
  • Chu et al. [2011] W. Chu, L. Li, L. Reyzin, and R. Schapire. Contextual bandits with linear payoff functions. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2011.
  • Dani et al. [2008] V. Dani, T. P. Hayes, and S. M. Kakade. Stochastic linear optimization under bandit feedback. In Proceedings of annual Conference on Learning Theory (COLT), 2008.
  • Donoho [2006] D. L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, 2006.
  • Filippi et al. [2010] S. Filippi, O. Cappe, A. Garivier, and C. Szepesvári. Parametric bandits: The generalized linear case. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2010.
  • Flaxman et al. [2005] A. D. Flaxman, A. T. Kalai, and H. B. McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the annual ACM-SIAM symposium on Discrete algorithms (SODA), 2005.
  • Foster et al. [2016] D. Foster, S. Kale, and H. Karloff. Online sparse linear regression. In Proceedings of annual Conference on Learning Theory (COLT), 2016.
  • Foster et al. [2018] D. J. Foster, A. Agarwal, M. Dudík, H. Luo, and R. E. Schapire. Practical contextual bandits with regression oracles. In Proceedings of the International Conference on Machine Learning (ICML), 2018.
  • Gerchinovitz [2013] S. Gerchinovitz. Sparsity regret bounds for individual sequences in online linear regression. Journal of Machine Learning Research, 14(Mar):729–769, 2013.
  • Goldberg and Kosorok [2012] Y. Goldberg and M. R. Kosorok. Q-learning with censored data. Annals of Statistics, 40(1):529, 2012.
  • Goldenshluger and Zeevi [2013] A. Goldenshluger and A. Zeevi. A linear response bandit problem. Stochastic Systems, 3(1):230–261, 2013.
  • Hammer et al. [1996] S. M. Hammer, D. A. Katzenstein, M. D. Hughes, H. Gundacker, R. T. Schooley, R. H. Haubrich, W. K. Henry, M. M. Lederman, J. P. Phair, M. Niu, et al. A trial comparing nucleoside monotherapy with combination therapy in hiv-infected adults with cd4 cell counts from 200 to 500 per cubic millimeter. New England Journal of Medicine, 335(15):1081–1090, 1996.
  • Keyvanshokooh et al. [2019] E. Keyvanshokooh, M. Zhalechian, C. Shi, M. P. Van Oyen, and P. Kazemian. Contextual learning with online convex optimization: Theory and application to chronic diseases. Available at SSRN, 2019.
  • Krause and Ong [2011] A. Krause and C. S. Ong. Contextual gaussian process bandit optimization. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2011.
  • Lai and Robbins [1985] T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
  • Langford et al. [2009] J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. Journal of Machine Learning Research, 10(Mar):777–801, 2009.
  • Lattimore et al. [2015] T. Lattimore, K. Crammer, and C. Szepesvári. Linear multi-resource allocation with semi-bandit feedback. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2015.
  • Li et al. [2010] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the International Conference on World Wide Web (WWW). ACM, 2010.
  • Li et al. [2011] L. Li, W. Chu, J. Langford, and X. Wang. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM). ACM, 2011.
  • Li et al. [2017] L. Li, Y. Lu, and D. Zhou. Provably optimal algorithms for generalized linear contextual bandits. In Proceedings of International Conference on Machine Learning (ICML), 2017.
  • Miller [2002] A. Miller. Subset selection in regression. Chapman and Hall/CRC, 2002.
  • Moodie et al. [2007] E. E. Moodie, T. S. Richardson, and D. A. Stephens. Demystifying optimal dynamic treatment regimes. Biometrics, 63(2):447–455, 2007.
  • Murphy [2003] S. A. Murphy. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(2):331–355, 2003.
  • Murphy [2005] S. A. Murphy. An experimental design for the development of adaptive treatment strategies. Statistics in medicine, 24(10):1455–1481, 2005.
  • Natarajan [1995] B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on Computing, 24(2):227–234, 1995.
  • Nemirovsky and Yudin [1983] A. S. Nemirovsky and D. B. Yudin. Problem complexity and method efficiency in optimization. SIAM, 1983.
  • Orellana et al. [2010a] L. Orellana, A. Rotnitzky, and J. M. Robins. Dynamic regime marginal structural mean models for estimation of optimal dynamic treatment regimes, part I: main content. The International Journal of Biostatistics, 6(2), 2010a.
  • Orellana et al. [2010b] L. Orellana, A. Rotnitzky, and J. M. Robins. Dynamic regime marginal structural mean models for estimation of optimal dynamic treatment regimes, part II: proofs of results. The International Journal of Biostatistics, 6(2), 2010b.
  • Pilanci et al. [2015] M. Pilanci, M. J. Wainwright, and L. El Ghaoui. Sparse learning via boolean relaxations. Mathematical Programming (Series B), 151(1):63–87, 2015.
  • Raskutti et al. [2010] G. Raskutti, M. J. Wainwright, and B. Yu. Restricted eigenvalue properties for correlated gaussian designs. Journal of Machine Learning Research, 11(Aug):2241–2259, 2010.
  • Robins et al. [2008] J. Robins, L. Orellana, and A. Rotnitzky. Estimation and extrapolation of optimal treatment and testing strategies. Statistics in Medicine, 27(23):4678–4721, 2008.
  • Robins et al. [2000] J. M. Robins, M. A. Hernan, and B. Brumback. Marginal structural models and causal inference in epidemiology, 2000.
  • Rusmevichientong and Tsitsiklis [2010] P. Rusmevichientong and J. N. Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411, 2010.
  • Shamir [2013] O. Shamir. On the complexity of bandit and derivative-free stochastic convex optimization. In Proceedings of annual Conference on Learning Theory (COLT), 2013.
  • Shamir [2015] O. Shamir. On the complexity of bandit linear optimization. In Proceedings of annual Conference on Learning Theory (COLT), 2015.
  • Song et al. [2015] R. Song, W. Wang, D. Zeng, and M. R. Kosorok. Penalized Q-learning for dynamic treatment regimens. Statistica Sinica, 25(3):901, 2015.
  • Szepesvari [2016] C. Szepesvari. Lower bounds for stochastic linear bandits. http://banditalgs.com/2016/10/20/lower-bounds-for-stochastic-linear-bandits/, 2016. Accessed: Oct 15, 2018.
  • Tibshirani [1996] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
  • Tropp [2012] J. A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12(4):389–434, 2012.
  • Wainwright [2009] M. J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using 1\ell_{1}-constrained quadratic programming (Lasso). IEEE Transactions on Information Theory, 55(5):2183–2202, 2009.
  • Wang et al. [2017] Y. Wang, S. Du, S. Balakrishnan, and A. Singh. Stochastic zeroth-order optimization in high dimensions. In Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.
  • Watkins and Dayan [1992] C. J. Watkins and P. Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.
  • Zhang et al. [2012a] B. Zhang, A. A. Tsiatis, M. Davidian, M. Zhang, and E. Laber. Estimating optimal treatment regimes from a classification perspective. Stat, 1(1):103–114, 2012a.
  • Zhang et al. [2012b] B. Zhang, A. A. Tsiatis, E. B. Laber, and M. Davidian. A robust method for estimating optimal treatment regimes. Biometrics, 68(4):1010–1018, 2012b.
  • Zhao et al. [2012] Y. Zhao, D. Zeng, A. J. Rush, and M. R. Kosorok. Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association, 107(499):1106–1118, 2012.
  • Zhao et al. [2014] Y.-Q. Zhao, D. Zeng, E. B. Laber, R. Song, M. Yuan, and M. R. Kosorok. Doubly robust learning for estimating individualized treatment with censored data. Biometrika, 102(1):151–168, 2014.

Appendix

Appendix A Proof of Theorem 1

Before going to the proof, we present a proposition, which upper bounds the scale of covariate {Xt,i}\{X_{t,i}\} uniformly with high probability.

Proposition 1.

Consider all the sub-Gaussian covariates in the problem {Xt,i}t[T],i𝒜t\{X_{t,i}\}_{t\in[T],i\in\mathcal{A}_{t}}. Then with probability at least 1δ1-\delta,

maxt[T],i𝒜t,j[d]|[Xt,i]j|σlog(kTd/δ).\displaystyle\max_{t\in[T],i\in\mathcal{A}_{t},j\in[d]}\big{|}[X_{t,i}]_{j}\big{|}\lesssim\sigma\log(kTd/\delta).
Proof.

By Assumption 1, for all t[T]t\in[T], i𝒜ti\in\mathcal{A}_{t}, and j[d]j\in[d], {[Xt,i]j}\{[X_{t,i}]_{j}\} are independent and identically distributed centered sub-Gaussian random variables with parameter σ2\sigma^{2}. Then the result follows from the standard sub-Gaussian concentration inequality and union bound argument. ∎

Next, we present the proof of Theorem 1. For ease of presentations, we defer the detailed proofs of the lemmas to the Supplementary Materials. Our proof takes the benefits of results of the celebrated work on linear contextual bandit (Abbasi-Yadkori et al., 2011), in which the authors establish a self-normalized martingale concentration inequality and the confidence region for the ridge estimator under dependent designs. For the self completeness, we present these results in the Supplementary Materials Section 6.

Proof of Theorem 1.

The proof includes four major steps. First, we quantify the performance of the BSS estimator obtained at the end of each epoch. In particular, we estimate the 2\ell_{2}-norm of the true parameter θ\theta^{*} outside SτS_{\tau}, the generalized support recovered by epoch EτE_{\tau}. It measures the discrepancy between SτS_{\tau} and the true support of θ\theta^{*} and hence, the quality of BSS estimator. Then, within epoch EτE_{\tau}, for each period tt, we analyze the error of the restricted ridge estimator θ^τ,λt1\widehat{\theta}^{t-1}_{\tau,\lambda}. Based on that, we upper bound the predictive errors of potential rewards, and build up corresponding upper confidence bounds. Next, similar to the classic analysis of UCB-type algorithm of linear bandit problem, we establish an elliptical potential lemma to upper bound the sum of confidence bands. Finally, we combine the results to show that the regret of the SLUCB algorithm is of order 𝒪~(sT)\widetilde{\mathcal{O}}(s\sqrt{T}). In what follows, we present the details of each step.

Step 1. Upper bound the error of the BSS estimator: In this step, we analyze the error of the BSS estimator obtained at the end of each epoch. Recall that Sτ=supp+(θ^τ,λ)S_{\tau}=\mathrm{supp}^{+}(\widehat{\theta}_{\tau,\lambda}), which is the generalized support of the BSS estimator θ^τ,λ\widehat{\theta}_{\tau,\lambda}. We have the following lemma.

Lemma 1.

If

n0ρ1(νστ~)2s3log4(kTd/(δλ)),n_{0}\gtrsim\rho^{-1}\cdot(\nu\sigma\widetilde{\tau})^{2}\cdot s^{3}\cdot\log^{4}\big{(}kTd/(\delta\lambda)\big{)},

we have that, with probability at least 1δ1-\delta, for each epoch τ[τ~]\tau\in[\widetilde{\tau}],

[θ]Sτcν2slog2(kTd/(δλ))+λr2|Eτ|.\big{\|}[\theta^{*}]_{S_{\tau}^{c}}\big{\|}\lesssim\sqrt{\frac{\nu^{2}s\log^{2}\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda r^{2}}{|E_{\tau}|}}. (5)

Note that when τ\tau is large, |Eτ|T|E_{\tau}|\asymp T. Hence, Lemma 1 states that the 2\ell_{2}-norm of the unselected part of true parameter θ\theta^{*} decays at a rate of 𝒪~(s/T)\widetilde{\mathcal{O}}(\sqrt{s/T}). Recall that in the next epoch τ+1\tau+1, we restrict the estimation only on SτS_{\tau}. Failing to identify the exact support of θ\theta^{*} incurs a regret that is roughly

[θ]Sτc2T=𝒪~(s/T)T=𝒪~(sT).\big{\|}[\theta^{*}]_{S_{\tau}^{c}}\big{\|}_{2}\cdot T=\widetilde{\mathcal{O}}(\sqrt{s/T})\cdot T=\widetilde{\mathcal{O}}(\sqrt{sT}).

We interpret this part as “bias” in the total regret. Formally, by Lemma 5 and Proposition 1, for each covariate Xt,iX_{t,i}, we upper bound the inner product of [θ]Sτc[\theta^{*}]_{S_{\tau}^{c}} and [Xt,i]Sτc[X_{t,i}]_{S_{\tau}^{c}} via the following corollary.

Corollary 2.

Under the condition of Lemma 5, with probability at least 1δ1-\delta, for all τ[τ~]\tau\in[\widetilde{\tau}], tEτt\in E_{\tau}, and i𝒜ti\in\mathcal{A}_{t}, we have

|[Xt,i]Sτc,[θ]Sτc|σlog(kTd/δ)ν2slog2(kTd/(δλ))+λr2|Eτ|.\Big{|}\big{\langle}[X_{t,i}]_{S_{\tau}^{c}},[\theta^{*}]_{S_{\tau}^{c}}\big{\rangle}\Big{|}\lesssim\sigma\sqrt{\log(kTd/\delta)}\cdot\sqrt{\frac{\nu^{2}s\log^{2}\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda r^{2}}{|E_{\tau}|}}. (6)

Essentially, Corollary 2 implies that the regret incurred by inexact recovery of the support of true parameter at a single period is of order 𝒪~(s/T)\widetilde{\mathcal{O}}(\sqrt{s/T}). This result is critical for our regret analysis later.

Step 2. Upper bound the prediction error: In this step, for each time period tEτt\in E_{\tau} and arm i𝒜ti\in\mathcal{A}_{t}, we establish an upper bound for the prediction error Xt,i,θ^τ,λt1θ\langle X_{t,i},\widehat{\theta}_{\tau,\lambda}^{t-1}-\theta^{*}\rangle. In particular, we have the following lemma.

Lemma 2.

If

n0ρ1(νστ~)2s3log4(kTd/(δλ)),n_{0}\gtrsim\rho^{-1}\cdot(\nu\sigma\widetilde{\tau})^{2}\cdot s^{3}\cdot\log^{4}\big{(}kTd/(\delta\lambda)\big{)},

with probability at least 1δ1-\delta, for all τ[τ~]\tau\in[\widetilde{\tau}], tEτt\in E_{\tau}, and i𝒜ti\in\mathcal{A}_{t}, we have

|Xt,i,θ^τ,λt1θ|ν(slog(kTd/(δλ))+λ1/2r)(σlog(kTd/δ)|Eτ1|+[Xt,i]Sτ1[𝚪^τ1,λt1]1),\Big{|}\big{\langle}X_{t,i},\widehat{\theta}_{\tau,\lambda}^{t-1}-\theta^{*}\big{\rangle}\Big{|}\lesssim\nu\Big{(}\sqrt{s}\log\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda^{1/2}r\Big{)}\Big{(}\sigma\sqrt{\frac{\log(kTd/\delta)}{|E_{\tau-1}|}}+\big{\|}[X_{t,i}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}]^{-1}}\Big{)},

where

𝚪^τ1,λt1=λI|Sτ1|+tEτt1[Xt]Sτ1[Xt]Sτ1,\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}=\lambda\textbf{I}_{|S_{\tau-1}|}+\sum_{t^{\prime}\in E_{\tau}^{t-1}}[X_{t^{\prime}}]_{S_{\tau-1}}[X_{t^{\prime}}]_{S_{\tau-1}}^{\top},

and Eτt1={t:tt1,tEτ}E_{\tau}^{t-1}=\{t^{\prime}:t^{\prime}\leq t-1,t^{\prime}\in E_{\tau}\}.

The upper bound in Lemma 2 includes two parts. The first term, which is of order 𝒪~(s/T)\widetilde{\mathcal{O}}(\sqrt{s/T}), corresponds to the bias of the estimator θ^τ,λt1\widehat{\theta}^{t-1}_{\tau,\lambda}. Particularly, it is caused by the inexact recovery of supp(θ)\mathrm{supp}(\theta^{*}). In contrast, the second term, which is determined by the variability of [Xt,i]Sτ1[X_{t,i}]_{S_{\tau-1}} and sample covariance matrix 𝚪^τ1,λt1\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}, corresponds to the variance.

Step 3. Elliptical potential lemma: In this this step, we establish the elliptical potential lemma, which a powerful tool in bandit problems (Rusmevichientong and Tsitsiklis, 2010; Auer, 2002; Filippi et al., 2010; Li et al., 2017), that upper bounds the sum of prediction errors in each period. Specifically, we upper bound the sum the right-hand side of the inequality in Lemma 2 for all periods. Recall that in each epoch, the arms are picked uniformly at random in the first n0n_{0} periods.

Lemma 3.

Let two constants p,qp,q satisfy 1pq1\leq p\leq q. Then with probability at least 1δ1-\delta, we have

tEτmin{p,q[Xt]Sτ1[𝚪^τ1,λt1]1}n0p+q|Eτ||Sτ1|(λ+log(σdTk/δ)).\sum_{t\in E_{\tau}}\min\Big{\{}p,q\cdot\big{\|}[X_{t}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}]^{-1}}\Big{\}}\lesssim n_{0}p+q\cdot\sqrt{|E_{\tau}|\cdot|S_{\tau-1}|\cdot\big{(}\lambda+\log(\sigma dTk/\delta)\big{)}}.

The upper bound in Lemma 3 consists of two terms. The first one is linear in n0n_{0}, the length of the pure-exploration-stage. The second part, which is of order 𝒪~(sT)\widetilde{\mathcal{O}}(\sqrt{sT}), shows the trade-off of between exploration and exploitation. In later proof, we show that this part leads to the order 𝒪~(sT)\widetilde{\mathcal{O}}(s\sqrt{T}) regret under a tailored choice of p,qp,q.

Step 4. Putting all together: Based on Lemmas 1, 2, and 3, we are ready to prove Theorem 1. For each t[T]t\in[T] and i𝒜ti\in\mathcal{A}_{t}, let

r(Xt,i)=Xt,i,θr(X_{t,i})=\langle X_{t,i},\theta^{*}\rangle

be the true expected reward of arm ii at period tt. By Assumption 3 and Proposition 1, the inequality

r(Xt,i)rσlog(kTd/δ)r(X_{t,i})\leq r\sigma\sqrt{\log(kTd/\delta)}

holds for all tt and ii with probability at least 1δ1-\delta. Then under the the following choice of n0n_{0},

n0\displaystyle n_{0} ρ1(νστ~)2s3log4(kTd/(δλ)),\displaystyle\asymp\rho^{-1}\cdot(\nu\sigma\widetilde{\tau})^{2}\cdot s^{3}\cdot\log^{4}\big{(}kTd/(\delta\lambda)\big{)},

Lemma 2 guarantees that

r¯(Xt,i)\displaystyle\overline{r}(X_{t,i}) :=min{β,Xt,i,θ^τ,λt1+α(σlog(kTd/δ)/|Eτ1|+[Xt,i]Sτ1[𝚪^τ1,λt1]1)}\displaystyle:=\min\Big{\{}\beta,\big{\langle}X_{t,i},\widehat{\theta}_{\tau,\lambda}^{t-1}\big{\rangle}+\alpha\cdot\Big{(}\sigma\sqrt{{\log(kTd/\delta)}\big{/}{|E_{\tau-1}|}}+\big{\|}[X_{t,i}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}]^{-1}}\Big{)}\Big{\}} (7)

is an upper bound for r(Xt,i)r(X_{t,i}) for each t[T]t\in[T] and i𝒜ti\in\mathcal{A}_{t} (up to an absolute constant), where

β\displaystyle\beta =rσslog(kTd/δ),\displaystyle=r\sigma\sqrt{s\log(kTd/\delta)},
α\displaystyle\alpha =ν(slog(kTd/(δλ))+λ1/2r).\displaystyle=\nu\cdot\big{(}\sqrt{s}\log\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda^{1/2}r\big{)}.

Next, recall that at period tt, our algorithm picks arm iti_{t}. The optimal action is

it=argmaxi𝒜tXt,i,θ.i_{t}^{*}=\arg\max_{i\in\mathcal{A}_{t}}\langle X_{t,i},\theta^{*}\rangle.

Let 𝒯e\mathcal{T}_{e} be the periods of pure-exploration-stage, in which the arms are picked uniformly at random. Correspondingly, we use 𝒯g\mathcal{T}_{g} to denote the periods of UCB-stage, in which the UCB-type policy is employed. Then we have

|𝒯e|n0log(T).|\mathcal{T}_{e}|\lesssim n_{0}\log(T).

Since with probability at least 1δ1-\delta,

r(Xt,i)max{r¯(Xt,i),rσlog(kTd/δ)}r(X_{t,i})\leq\max\big{\{}\overline{r}(X_{t,i}),r\sigma\sqrt{\log(kTd/\delta)}\big{\}}

holds for all tt and ii, by the definition of regret, we have

RT({it},θ)\displaystyle R_{T}\big{(}\{i_{t}\},\theta^{*}\big{)} =t𝒯er(Xt,it)r(Xt,it)+t𝒯gr(Xt,it)r(Xt,it)\displaystyle=\sum_{t\in\mathcal{T}_{e}}r(X_{t,i_{t}^{*}})-r(X_{t,i_{t}})+\sum_{t\in\mathcal{T}_{g}}r(X_{t,i_{t}^{*}})-r(X_{t,i_{t}})
n0log(T)2rσlog(kTd/δ)+t𝒯gr¯(Xt,it)r(Xt,it).\displaystyle\lesssim n_{0}\log(T)\cdot 2r\sigma\sqrt{\log(kTd/\delta)}+\sum_{t\in\mathcal{T}_{g}}\overline{r}(X_{t,i_{t}^{*}})-r(X_{t,i_{t}}). (8)

For the second part in the right-hand side of inequality (A), we have

t𝒯gr¯(Xt,it)r(Xt,it)\displaystyle\sum_{t\in\mathcal{T}_{g}}\overline{r}(X_{t,i_{t}^{*}})-r(X_{t,i_{t}}) =t𝒯g(r¯(Xt,it)r¯(Xt,it))+(r¯(Xt,it)r(Xt,it))\displaystyle=\sum_{t\in\mathcal{T}_{g}}\big{(}\overline{r}(X_{t,i_{t}^{*}})-\overline{r}(X_{t,i_{t}})\big{)}+\big{(}\overline{r}(X_{t,i_{t}})-r(X_{t,i_{t}})\big{)}
t𝒯gr¯(Xt,it)r(Xt,it)\displaystyle\leq\sum_{t\in\mathcal{T}_{g}}\overline{r}(X_{t,i_{t}})-r(X_{t,i_{t}})
τ=1τ~tEτmin{β,α(σlog(kTd/δ)/|Eτ1|+[Xt]Sτ1[𝚪^τ1,λt1]1)}.\displaystyle\lesssim\sum_{\tau=1}^{\widetilde{\tau}}\sum_{t\in E_{\tau}}\min\Big{\{}\beta,\alpha\cdot\Big{(}\sigma\sqrt{{\log(kTd/\delta)}\big{/}{|E_{\tau-1}|}}+\big{\|}[X_{t}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}]^{-1}}\Big{)}\Big{\}}.

Here the first inequality holds because in period t𝒯gt\in\mathcal{T}_{g}, the picked arm iti_{t} has the largest upper confidence band. The second inequality follows by the definition of r¯(Xt,i)\overline{r}(X_{t,i}) in (7). By Lemma 3, we have

τ=1τ~tEτmin{β,α(σlog(kTd/δ)/|Eτ1|+[Xt]Sτ1[𝚪^τ1,λt1]1)}\displaystyle\sum_{\tau=1}^{\widetilde{\tau}}\sum_{t\in E_{\tau}}\min\Big{\{}\beta,\alpha\cdot\Big{(}\sigma\sqrt{{\log(kTd/\delta)}\big{/}{|E_{\tau-1}|}}+\big{\|}[X_{t}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}]^{-1}}\Big{)}\Big{\}}
τ=1τ~(n0β+α|Eτ||Sτ1|(λ+log(σdTk/δ)))+τ=1τ~ασlog(kTd/δ)|Eτ1|\displaystyle\lesssim\sum_{\tau=1}^{\widetilde{\tau}}\Big{(}n_{0}\beta+\alpha\sqrt{|E_{\tau}|\cdot|S_{\tau-1}|\cdot\big{(}\lambda+\log(\sigma dTk/\delta)\big{)}}\Big{)}+\sum_{\tau=1}^{\widetilde{\tau}}\alpha\sigma\sqrt{{\log(kTd/\delta)}\cdot{|E_{\tau-1}|}}
(n0β+αsTlog(T)(λ+log(σdTk/δ))+ασTlog(kTd/δ))log(T).\displaystyle\lesssim\Big{(}n_{0}\beta+\alpha\sqrt{sT\log(T)\cdot\big{(}\lambda+\log(\sigma dTk/\delta)\big{)}}+\alpha\sigma\sqrt{T\log(kTd/\delta)}\Big{)}\cdot\log(T). (9)

Finally, by combining inequalities (A)-(A), we obtain that with probability at least 1δ1-\delta,

RT({it},θ)(n0β+αsTlog(T)(λ+log(σdTk/δ))+ασTlog(kTd/δ))log(T),\displaystyle R_{T}\big{(}\{i_{t}\},\theta^{*}\big{)}\lesssim\Big{(}n_{0}\beta+\alpha\sqrt{sT\log(T)\cdot\big{(}\lambda+\log(\sigma dTk/\delta)\big{)}}+\alpha\sigma\sqrt{T\log(kTd/\delta)}\Big{)}\cdot\log(T),

which concludes the proof of Theorem 1. ∎

Appendix B Proof of Theorem 2

Proof of Theorem 2.

The proof follows a similar routine as the proof of Theorem 1. The main difference lies in the upper bound of prediction error (Step 2 in the proof of Theorem 1). Recall that in the SLUCB algorithm, we construct the ridge estimator using all historical data. The complicated correlation between design matrices and random noises jeopardizes us applying the standard concentration inequality for independent random variables to obtain an order 𝒪~(s)\widetilde{\mathcal{O}}(\sqrt{s}) upper bound for the prediction error. Instead, we use the self-normalized martingale concentration inequality, which only gives an order 𝒪~(s)\widetilde{\mathcal{O}}(s) upper bound and hence, a suboptimal regret. In comparison, in the SSUCB algorithm, the Sparse-SupLinUCB subroutine overcomes the aforementioned obstacle by splitting the historical data into disjoint groups such that within each group, the designs and random noises are independent. Due to such independences, we establish the desired order 𝒪~(s)\widetilde{\mathcal{O}}(\sqrt{s}) upper bound for prediction errors. In particular, we have the following key lemma.

Lemma 4.

When

n0ρ1(νστ~)2s3log4(kTd/(δλ)),n_{0}\gtrsim\rho^{-1}\cdot(\nu\sigma\widetilde{\tau})^{2}\cdot s^{3}\cdot\log^{4}\big{(}kTd/(\delta\lambda)\big{)},

with probability at least 1δ1-\delta, we have that for every tuple (τ,t,i,ζ)(\tau,t,i,\zeta), it holds

|Xt,i,θ^τ,λt1,ζθ|\displaystyle\Big{|}\big{\langle}X_{t,i},\widehat{\theta}_{\tau,\lambda}^{t-1,\zeta}-\theta^{*}\big{\rangle}\Big{|} σ(λ1/2+ν+σ)log3/2(kTd/δ)(s/|Eτ1|+[Xt,i]Sτ1[𝚪^τ1,λt1,ζ]1),\displaystyle\lesssim\sigma(\lambda^{1/2}+\nu+\sigma)\log^{3/2}(kTd/\delta)\cdot\Big{(}\sqrt{s/|E_{\tau-1}|}+\big{\|}[X_{t,i}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}]^{-1}}\Big{)},

where

𝚪^τ1,λt1,ζ=λI|Sτ1|+tΨτt1,ζ[Xt]Sτ1[Xt]Sτ1.\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}=\lambda\textbf{I}_{|S_{\tau-1}|}+\sum_{t^{\prime}\in\Psi_{\tau}^{t-1,\zeta}}[X_{t^{\prime}}]_{S_{\tau-1}}[X_{t^{\prime}}]_{S_{\tau-1}}^{\top}.

The upper bound in Lemma 4 includes two parts, corresponding to the bias and variance, respectively. The bias part is still of order 𝒪~(s)\widetilde{\mathcal{O}}(\sqrt{s}). In comparison with Lemma 2, this lemma decreases an 𝒪~(s)\widetilde{\mathcal{O}}(\sqrt{s}) factor in the variance part. Note that by Lemma 3, we have

[Xt,i]Sτ1[𝚪^τ1,λt1,ζ]1=𝒪~(s).\big{\|}[X_{t,i}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}]^{-1}}=\widetilde{\mathcal{O}}(\sqrt{s}).

Such an improvement leads to an order 𝒪~(s)\widetilde{\mathcal{O}}(\sqrt{s}) improvement in the accumulated regret, which achieves the optimal rate. The detailed proof of Lemma 4 is in the supplementary document.

Same as the last step in the proof of Theorem 1, by Lemma 4, we obtain that, with probability at least 1δ1-\delta, for all tuples (τ,t,i,ζ)(\tau,t,i,\zeta), it holds

r(Xt,i)\displaystyle{r}(X_{t,i}) min{β,Xt,i,θ^τ,λt1,ζ+ωτ,λt1,ζ(i)},\displaystyle\lesssim\min\Big{\{}\beta,\big{\langle}X_{t,i},\widehat{\theta}_{\tau,\lambda}^{t-1,\zeta}\big{\rangle}+\omega^{t-1,\zeta}_{\tau,\lambda}(i)\Big{\}}, (10)

where r(Xt,i)=Xt,i,θ{r}(X_{t,i})=\langle X_{t,i},\theta^{*}\rangle and

ωτ,λt1,ζ(i)\displaystyle\omega^{t-1,\zeta}_{\tau,\lambda}(i) =γ(s/|Eτ1|+[Xt,i]Sτ1[𝚪^τ1,λt1,ζ]1),\displaystyle=\gamma\cdot\Big{(}\sqrt{s/|E_{\tau-1}|}+\big{\|}[X_{t,i}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}]^{-1}}\Big{)}, (11)
γ\displaystyle\gamma =σ(λ1/2+ν+σ)log3/2(kTd/δ).\displaystyle=\sigma(\lambda^{1/2}+\nu+\sigma)\log^{3/2}(kTd/\delta). (12)

Without loss of generality, we assume that inequality (10) holds for all tuples (τ,t,i,ζ)(\tau,t,i,\zeta), which is guaranteed with probability at least 1δ1-\delta. We also assume that in period tEτt\in E_{\tau}, the SupLinUCB algorithm terminates at the ζt\zeta_{t}-th group with selected an arm iti_{t}. Let it=argmaxi𝒜tXt,i,θi^{*}_{t}=\mathrm{argmax}_{i\in\mathcal{A}_{t}}\langle X_{t,i},\theta^{*}\rangle be the optimal arm for period tt. If the first scenario in the SupLinUCB algorithm happens, i.e.,

ωτ,λt1,ζt(i)1/T,i𝒩τt1,ζt,\omega_{\tau,\lambda}^{t-1,\zeta_{t}}(i)\leq 1/\sqrt{T},\ \forall i\in{\mathcal{N}}^{t-1,\zeta_{t}}_{\tau},

we have

Xt,itXt,it,θXt,itXt,it,θ^τ,λt1,ζt+ωτ,λt1,ζt(it)+ωτ,λt1,ζt(it)2/T.\big{\langle}X_{t,i^{*}_{t}}-X_{t,i_{t}},\theta^{*}\big{\rangle}\lesssim\big{\langle}X_{t,i^{*}_{t}}-X_{t,i_{t}},\widehat{\theta}_{\tau,\lambda}^{t-1,\zeta_{t}}\big{\rangle}+\omega^{t-1,\zeta_{t}}_{\tau,\lambda}(i^{*}_{t})+\omega^{t-1,\zeta_{t}}_{\tau,\lambda}(i_{t})\lesssim 2/\sqrt{T}. (13)

Otherwise, we have

ωτ,λt1,ζt(it)2ζtβ.\displaystyle\omega^{t-1,\zeta_{t}}_{\tau,\lambda}(i_{t})\geq 2^{-\zeta_{t}}\beta. (14)

In this case, since the SupLinUCB algorithm does not stop before ζt\zeta_{t}, then for all ζ<ζt\zeta<\zeta_{t}, we have

ωτ,λt1,ζ(i)2ζβ,i𝒩τt1,ζ.\omega_{\tau,\lambda}^{t-1,\zeta}(i)\leq 2^{-\zeta}\beta,\ \forall i\in\mathcal{N}^{t-1,\zeta}_{\tau}.

Consequently, for all i𝒩τt1,ζi\in\mathcal{N}^{t-1,\zeta}_{\tau}, we have

Xt,it,θ^τ,λt1,ζ+ωτ,λt1,ζ(it)Xt,it,θXt,i,θXt,i,θ^τ,λt1,ζωτ,λt1,ζ(i).\big{\langle}X_{t,i_{t}^{*}},\widehat{\theta}^{t-1,\zeta}_{\tau,\lambda}\big{\rangle}+\omega^{t-1,\zeta}_{\tau,\lambda}(i_{t}^{*})\gtrsim\langle X_{t,i_{t}^{*}},\theta^{*}\rangle\geq\langle X_{t,i},\theta^{*}\rangle\gtrsim\big{\langle}X_{t,i},\widehat{\theta}^{t-1,\zeta}_{\tau,\lambda}\big{\rangle}-\omega^{t-1,\zeta}_{\tau,\lambda}(i).

Moreover, by setting ζ=ζt1\zeta=\zeta_{t}-1, we have

Xt,itXt,it,θXt,itXt,it,θ^τ,λt1,ζt1+ωτ,λt1,ζt1(it)+ωτ,λt1,ζt1(it)2ζtβ.\displaystyle\big{\langle}X_{t,i^{*}_{t}}-X_{t,i_{t}},\theta^{*}\big{\rangle}\lesssim\big{\langle}X_{t,i^{*}_{t}}-X_{t,i_{t}},\widehat{\theta}^{t-1,\zeta_{t}-1}_{\tau,\lambda}\big{\rangle}+\omega^{t-1,\zeta_{t}-1}_{\tau,\lambda}(i^{*}_{t})+\omega^{t-1,\zeta_{t}-1}_{\tau,\lambda}(i_{t})\lesssim 2^{-\zeta_{t}}\beta. (15)

By combining inequalities (13)-(15), we obtain

r(Xt,it)r(Xt,it)\displaystyle r(X_{t,i_{t}^{*}})-r(X_{t,i_{t}}) =Xt,itXt,it,θ\displaystyle=\big{\langle}X_{t,i^{*}_{t}}-X_{t,i_{t}},\theta^{*}\big{\rangle}
max{1/T,2ζtβ}max{1/T,ωτ,λt1,ζt(it)}.\displaystyle\lesssim\max\big{\{}1/\sqrt{T},2^{-\zeta_{t}}\beta\big{\}}\leq\max\big{\{}1/\sqrt{T},\omega^{t-1,\zeta_{t}}_{\tau,\lambda}(i_{t})\big{\}}. (16)

For the accumulated regret RT({it},θ)R_{T}(\{i_{t}\},\theta^{*}), similar to inequality (A) in the proof of Theorem 1, we have that, with probability at least 1δ1-\delta,

RT({it},θ)n0log(T)2rσlog(kTd/δ)+t𝒯gr(Xt,it)r(Xt,it),\displaystyle R_{T}\big{(}\{i_{t}\},\theta^{*}\big{)}\lesssim n_{0}\log(T)\cdot 2r\sigma\sqrt{\log(kTd/\delta)}+\sum_{t\in\mathcal{T}_{g}}r(X_{t,i_{t}^{*}})-r(X_{t,i_{t}}),

where 𝒯g\mathcal{T}_{g} denotes the periods of UCB-stage. Then by inequality (B) and Lemma 4, with probability at least 1δ1-\delta, we have

RT({it},θ)\displaystyle R_{T}\big{(}\{i_{t}\},\theta^{*}\big{)} n0log(T)2rσlog(kTd/δ)+t𝒯gmax{1/T,ωτ,λt1,ζt(it)}\displaystyle\lesssim n_{0}\log(T)\cdot 2r\sigma\sqrt{\log(kTd/\delta)}+\sum_{t\in\mathcal{T}_{g}}\max\big{\{}1/\sqrt{T},\omega^{t-1,\zeta_{t}}_{\tau,\lambda}(i_{t})\big{\}}
n0log(T)2rσlog(kTd/δ)+T+t𝒯gωτ,λt1,ζt(it)\displaystyle\lesssim n_{0}\log(T)\cdot 2r\sigma\sqrt{\log(kTd/\delta)}+\sqrt{T}+\sum_{t\in\mathcal{T}_{g}}\omega_{\tau,\lambda}^{t-1,\zeta_{t}}(i_{t})
T+τ=1τ~tEτmin{β,γ(s/|Eτ1|+[Xt,i]Sτ1[𝚪^τ1,λt1,ζt]1)}\displaystyle\lesssim\sqrt{T}+\sum_{\tau=1}^{\widetilde{\tau}}\sum_{t\in E_{\tau}}\min\Big{\{}\beta,\gamma\cdot\Big{(}\sqrt{s/|E_{\tau-1}|}+\big{\|}[X_{t,i}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta_{t}}]^{-1}}\Big{)}\Big{\}}
=T+ζ=1ζ~τ=1τ~tΨτζmin{β,γ(s/|Eτ1|+[Xt,i]Sτ1[𝚪^τ1,λt1,ζ]1)},\displaystyle=\sqrt{T}+\sum_{\zeta=1}^{\widetilde{\zeta}}\sum_{\tau=1}^{\widetilde{\tau}}\sum_{t\in\Psi^{\zeta}_{\tau}}\min\Big{\{}\beta,\gamma\cdot\Big{(}\sqrt{s/|E_{\tau-1}|}+\big{\|}[X_{t,i}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}]^{-1}}\Big{)}\Big{\}}, (17)

where ωτ,λt,ζ(i)\omega_{\tau,\lambda}^{t,\zeta}(i) is defined in equation (11), and Ψτζ\Psi^{\zeta}_{\tau} denotes the ζ\zeta-th group of periods at the end of epoch EτE_{\tau}. For each fixed ζ\zeta, we have that, with probability at least 1δ1-\delta,

τ=1τ~tΨτζmin{β,γ(s/|Eτ1|+[Xt,i]Sτ1[𝚪^τ1,λt1,ζ]1)}\displaystyle\sum_{\tau=1}^{\widetilde{\tau}}\sum_{t\in\Psi^{\zeta}_{\tau}}\min\Big{\{}\beta,\gamma\cdot\Big{(}\sqrt{s/|E_{\tau-1}|}+\big{\|}[X_{t,i}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}]^{-1}}\Big{)}\Big{\}}
τ=1τ~(n0β+γ|Eτ||Sτ1|(λ+log(σdTk/δ)))+τ=1τ~γs|Eτ1|\displaystyle\lesssim\sum_{\tau=1}^{\widetilde{\tau}}\Big{(}n_{0}\beta+\gamma\sqrt{|E_{\tau}|\cdot|S_{\tau-1}|\cdot\big{(}\lambda+\log(\sigma dTk/\delta)\big{)}}\Big{)}+\sum_{\tau=1}^{\widetilde{\tau}}\gamma\sqrt{s\cdot{|E_{\tau-1}|}}
(n0β+γsTlog(T)(λ+log(σdTk/δ)))log(T).\displaystyle\lesssim\Big{(}n_{0}\beta+\gamma\sqrt{sT\cdot\log(T)\big{(}\lambda+\log(\sigma dTk/\delta)\big{)}}\Big{)}\cdot\log(T). (18)

Finally, combining inequalities (17) and (B), we have that, with probability at least 1δ1-\delta,

RT({it},θ)\displaystyle R_{T}\big{(}\{i_{t}\},\theta^{*}\big{)} T+(n0β+γsTlog(T)(λ+log(σdTk/δ)))log(T)log(βT),\displaystyle\lesssim\sqrt{T}+\Big{(}n_{0}\beta+\gamma\sqrt{sT\cdot\log(T)\big{(}\lambda+\log(\sigma dTk/\delta)\big{)}}\Big{)}\cdot\log(T)\cdot\log(\beta T),

which concludes the proof. ∎

Supplementary Materials to
“Nearly Dimension-Independent Sparse Linear Bandit over Small Action Spaces via Best Subset Selection”

3 Proof of Auxiliary Lemmas in Theorem 1

3.1 Proof of Lemma 1

Lemma (Restatement of Lemma 1).

If

n0ρ1(νστ~)2s3log4(kTd/(δλ)),n_{0}\gtrsim\rho^{-1}\cdot(\nu\sigma\widetilde{\tau})^{2}\cdot s^{3}\cdot\log^{4}\big{(}kTd/(\delta\lambda)\big{)},

we have that, with probability at least 1δ1-\delta, for each epoch τ[τ~]\tau\in[\widetilde{\tau}],

[θ]Sτcν2slog2(kTd/(δλ))+λr2|Eτ|.\big{\|}[\theta^{*}]_{S_{\tau}^{c}}\big{\|}\lesssim\sqrt{\frac{\nu^{2}s\log^{2}\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda r^{2}}{|E_{\tau}|}}.
Proof.

The proof of Lemma 1 depends on the following lemma, which establishes an upper bound for the estimation error θ^τ,λθ\widehat{\theta}_{\tau,\lambda}-\theta^{*} in an 2\ell_{2}-norm weighted by sample covariance matrix

𝚪^τ,λ=λId+tEτXtXt.\widehat{\mathbf{\Gamma}}_{\tau,\lambda}=\lambda\textbf{I}_{d}+\sum_{t\in E_{\tau}}X_{t}X_{t}^{\top}.
Lemma 5.

With probability at least 1δ1-\delta, we have

θ^τ,λθ𝚪^τ,λ2ν2slog2(kTd/(δλ))+λr2.\big{\|}\widehat{\theta}_{\tau,\lambda}-\theta^{*}\big{\|}^{2}_{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}\lesssim\nu^{2}s\log^{2}\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda r^{2}.

The proof of Lemma 5 is provided in Section 5.1. Now we continue the proof of Lemma 1. To simplify notation, from now on, we fix the epoch τ\tau and let

S=Sτ=supp+(θ^τ,λ).S=S_{\tau}=\mathrm{supp}^{+}(\widehat{\theta}_{\tau,\lambda}).

Then by definition, [θ^τ,λθ]Sc=[θ]Sc[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}=-[\theta^{*}]_{S^{c}}. We can decompose the error θ^τ,λθ𝚪^τ,λ2\|\widehat{\theta}_{\tau,\lambda}-\theta^{*}\|^{2}_{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}} as

θ^τ,λθ𝚪^τ,λ2\displaystyle\big{\|}\widehat{\theta}_{\tau,\lambda}-\theta^{*}\big{\|}^{2}_{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}} =[θ^τ,λθ]Sc[𝚪^τ,λ]Sc[θ^τ,λθ]Sc+[θ^τ,λθ]S[𝚪^τ,λ]S[θ^τ,λθ]S\displaystyle=[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}^{\top}\big{[}{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}\big{]}_{S^{c}}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}+[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S}^{\top}\big{[}{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}\big{]}_{S}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S}
+[θ^τ,λθ]S[𝚪^τ,λ]SSc[θ^τ,λθ]Sc+[θ^τ,λθ]Sc[𝚪^τ,λ]ScS[θ^τ,λθ]S\displaystyle\qquad+[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S}^{\top}\big{[}{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}\big{]}_{SS^{c}}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}+[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}^{\top}\big{[}{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}\big{]}_{S^{c}S}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S}
[θ^τ,λθ]Sc[𝚪^τ,λ]Sc[θ^τ,λθ]Sc+[θ^τ,λθ]S[𝚪^τ,λ]SSc[θ^τ,λθ]Sc\displaystyle\geq[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}^{\top}\big{[}{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}\big{]}_{S^{c}}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}+[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S}^{\top}\big{[}{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}\big{]}_{SS^{c}}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}
+[θ^τ,λθ]Sc[𝚪^τ,λ]ScS[θ^τ,λθ]S,\displaystyle\qquad+[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}^{\top}\big{[}{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}\big{]}_{S^{c}S}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S},

where the inequality follows from the fact that the matrix [𝚪^τ,λ]S[{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}]_{S} is positive definite. By rearranging above inequality, we obtain

[θ^τ,λθ]Sc[𝚪^τ,λ]Sc[θ^τ,λθ]Sc\displaystyle[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}^{\top}\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{S^{c}}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}} [θ^τ,λθ]S[𝚪^τ,λ]SSc[θ^τ,λθ]Sc\displaystyle\leq-[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S}^{\top}\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{SS^{c}}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}
[θ^τ,λθ]Sc[𝚪^τ,λ]ScS[θ^τ,λθ]S+θ^τ,λθ𝚪^τ,λ2.\displaystyle-[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}^{\top}\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{S^{c}S}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S}+\big{\|}\widehat{\theta}_{\tau,\lambda}-\theta^{*}\big{\|}^{2}_{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}. (19)

In the next, we upper bound the right-hand side of inequality (3.1) and then lower bound the left hand side, which leads to a quadratic inequality with respect to the estimation error [θ^τ,λθ]Sτc\|[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S_{\tau}^{c}}\|. By solving this quadratic inequality, we finish the proof of Lemma 1.

We establish the upper bound first. According to Hölder’s inequality, we have

|[θ^τ,λθ]S[𝚪^τ,λ]SSc[θ^τ,λθ]Sc|\displaystyle\Big{|}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S}^{\top}\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{SS^{c}}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}\Big{|} [θ^τ,λθ]S1[𝚪^τ,λ]SSc[θ^τ,λθ]Sc1\displaystyle\leq\big{\|}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S}\big{\|}_{1}\cdot\big{\|}\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{SS^{c}}\big{\|}_{\infty}\cdot\big{\|}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}\big{\|}_{1}
θ^τ,λθ1[𝚪^τ,λ]SScs(τ+1)[θ^τ,λθ]Sc,\displaystyle\leq\|\widehat{\theta}_{\tau,\lambda}-\theta^{*}\|_{1}\cdot\big{\|}\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{SS^{c}}\big{\|}_{\infty}\cdot\sqrt{s(\tau+1)}\cdot\big{\|}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}\big{\|},

where the last inequality follows from the fact that [θ^τ,λθ]Sc[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}} has at most (τ+1)s(\tau+1)s non-zero coordinates. To continue the proof, we also need the following two lemmas, which upper bound [𝚪^τ,λ]SSc\|[\widehat{\mathbf{\Gamma}}_{\tau,\lambda}]_{SS^{c}}\|_{\infty} and θ^τ,λθ1\|\widehat{\theta}_{\tau,\lambda}-\theta^{*}\|_{1} individually.

Lemma 6.

With probability at least 1δ1-\delta, we have

[𝚪^τ,λ]SSc8σlog(d/δ)|Eτ|.\big{\|}\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{SS^{c}}\big{\|}_{\infty}\leq 8\sigma{\log(d/\delta)}\sqrt{|E_{\tau}|}.
Lemma 7.

For arbitrary constant ς>0\varsigma>0, when

n0max{ς2ρ1,σ2ρ2log(dτ~/δ)},n_{0}\gtrsim\max\{\varsigma^{2}\rho^{-1},\sigma^{2}\rho^{-2}{\log(d\widetilde{\tau}/\delta)}\},

with probability at least 1δ1-\delta, we have

θ^τ,λθ1νs/ςτ~log(kTd/(δλ)).\displaystyle\|\widehat{\theta}_{\tau,\lambda}-\theta^{*}\|_{1}\lesssim\nu s/\varsigma\cdot\sqrt{\widetilde{\tau}}\cdot\log\big{(}kTd/(\delta\lambda)\big{)}.

The proofs of Lemma 6-7 are deferred to Section 5.2-5.3. We continue the proof of Lemma 1 now. By applying Lemma 6-7 and setting

ς=νστ~s3/2log2(kTd/(δλ)),\displaystyle\varsigma=\nu\sigma\widetilde{\tau}\cdot s^{3/2}\cdot\log^{2}\big{(}kTd/(\delta\lambda)\big{)},

we obtain the following inequality with probability at least 1δ1-\delta,

|[θ^τ,λθ]S[𝚪^τ,λ]SSc[θ^τ,λθ]Sc||Eτ|[θ^τ,λθ]Sc.\displaystyle\Big{|}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S}^{\top}\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{SS^{c}}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}\Big{|}\lesssim\sqrt{|E_{\tau}|}\cdot\big{\|}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}\big{\|}. (20)

Note that with such a choice of ς\varsigma, we require

n0ς2ρ1=ρ1(νστ~)2s3log4(kTd/(δλ)).n_{0}\gtrsim\varsigma^{2}\rho^{-1}=\rho^{-1}\cdot(\nu\sigma\widetilde{\tau})^{2}\cdot s^{3}\cdot\log^{4}\big{(}kTd/(\delta\lambda)\big{)}.

Similarly, we can establish the same upper bound for the second part in the right-hand side of inequality (3.1). Then by combing inequalities (3.1) and (20), we obtain

|[θ^τ,λθ]Sc[𝚪^τ,λ]Sc[θ^τ,λθ]Sc|θ^τ,λθ𝚪^τ,λ2+|Eτ|[θ^τ,λθ]Sc.\displaystyle\Big{|}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}^{\top}\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{S^{c}}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}\Big{|}\lesssim\big{\|}\widehat{\theta}_{\tau,\lambda}-\theta^{*}\big{\|}_{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}^{2}+\sqrt{|E_{\tau}|}\cdot\big{\|}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}\big{\|}. (21)

Then we turn to the lower bound for the left-hand side of inequality (21). Recall that in the SLUCB algorithm, the selection of actions in epoch EτE_{\tau} does not rely on [Xt]Sc[X_{t}]_{S^{c}}. Furthermore, by Assumption 1 (A2, A3), for each tt and jScj\in S^{c}, [Xt]j[X_{t}]_{j} is sampled from distribution P0P_{0} independently and satisfies

𝔼[[Xt]j2]ρ.\mathbb{E}\big{[}[X_{t}]_{j}^{2}\big{]}\geq\rho.

Subsequently, by matrix Bernstein inequality, when

|Eτ|n0σ2ρ2log(dτ~/δ),|E_{\tau}|\geq n_{0}\gtrsim\sigma^{2}\rho^{-2}\log(d\widetilde{\tau}/\delta),

with probability at least 1δ1-\delta, we have

[θ^τ,λθ]Sc[𝚪^τ,λ]Sc[θ^τ,λθ]Scρ/2|Eτ|[θ^τ,λθ]Sc2.[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}^{\top}\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{S^{c}}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}\geq\rho/2\cdot|E_{\tau}|\cdot\big{\|}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}\big{\|}^{2}. (22)

Then by combining inequalities (20)-(21) and Lemma 5, we obtain with probability at least 1δ1-\delta that,

ρ/2|Eτ|[θ^τ,λθ]Sc2|Eτ|[θ^τ,λθ]Sc+ν2slog2(kTd/(δλ))+λr2\rho/2\cdot|E_{\tau}|\cdot\big{\|}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}\big{\|}^{2}\lesssim\sqrt{|E_{\tau}|}\cdot\big{\|}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}\big{\|}+\nu^{2}s\log^{2}\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda r^{2}

Finally, we solve above inequality and obtain

[θ^τ,λθ]Sc2ν2slog2(kTd/(δλ))+λr2|Eτ|.\big{\|}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{S^{c}}\big{\|}_{2}\lesssim\sqrt{\frac{\nu^{2}s\log^{2}\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda r^{2}}{|E_{\tau}|}}.

Then Lemma 1 follows from [θ^τ,λ]Sc=0[\widehat{\theta}_{\tau,\lambda}]_{S^{c}}=0 by the definition SS. ∎

3.2 Proof of Lemma 2

Lemma (Resatement of Lemma 2).

If

n0ρ1(νστ~)2s3log4(kTd/(δλ)),n_{0}\gtrsim\rho^{-1}\cdot(\nu\sigma\widetilde{\tau})^{2}\cdot s^{3}\cdot\log^{4}\big{(}kTd/(\delta\lambda)\big{)},

with probability at least 1δ1-\delta, for all τ[τ~]\tau\in[\widetilde{\tau}], tEτt\in E_{\tau}, and i𝒜ti\in\mathcal{A}_{t}, we have

|Xt,i,θ^τ,λt1θ|ν(slog(kTd/(δλ))+λ1/2r)(σlog(kTd/δ)|Eτ1|+[Xt,i]Sτ1[𝚪^τ1,λt1]1),\Big{|}\big{\langle}X_{t,i},\widehat{\theta}_{\tau,\lambda}^{t-1}-\theta^{*}\big{\rangle}\Big{|}\lesssim\nu\Big{(}\sqrt{s}\log\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda^{1/2}r\Big{)}\Big{(}\sigma\sqrt{\frac{\log(kTd/\delta)}{|E_{\tau-1}|}}+\big{\|}[X_{t,i}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}]^{-1}}\Big{)},

where

𝚪^τ1,λt1=λI|Sτ1|+tEτt1[Xt]Sτ1[Xt]Sτ1,\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}=\lambda\textbf{I}_{|S_{\tau-1}|}+\sum_{t^{\prime}\in E_{\tau}^{t-1}}[X_{t^{\prime}}]_{S_{\tau-1}}[X_{t^{\prime}}]_{S_{\tau-1}}^{\top},

and Eτt1={t:tt1,tEτ}E_{\tau}^{t-1}=\{t^{\prime}:t^{\prime}\leq t-1,t^{\prime}\in E_{\tau}\}.

Proof.

Recall that θ^τ,λt1\widehat{\theta}_{\tau,\lambda}^{t-1} is the ridge estimator of θ\theta^{*} using data {(Xt,Yt)}tEτt1\{(X_{t^{\prime}},Y_{t^{\prime}})\}_{t^{\prime}\in E_{\tau}^{t-1}} restricted on the generalized support Sτ1S_{\tau-1}. Then for each i𝒜ti\in\mathcal{A}_{t}, we have decomposition

Xt,i,θ^τ,λt1θ=[Xt,i]Sτ1c,[θ]Sτ1c+[Xt,i]Sτ1,[θ^τ,λt1θ]Sτ1.\displaystyle\big{\langle}X_{t,i},\widehat{\theta}_{\tau,\lambda}^{t-1}-\theta^{*}\big{\rangle}=-\big{\langle}[X_{t,i}]_{S_{\tau-1}^{c}},[\theta^{*}]_{S_{\tau-1}^{c}}\big{\rangle}+\big{\langle}[X_{t,i}]_{S_{\tau-1}},[\widehat{\theta}_{\tau,\lambda}^{t-1}-\theta^{*}]_{S_{\tau-1}}\big{\rangle}. (23)

We can upper bound the first term in the right-hand side of equation (23) by Corollary 2. Formally, with probability at least 1δ1-\delta, we have

|[Xt,i]Sτ1c,[θ]Sτ1c|σlog(kTd/δ)ν2slog2(kTd/(δλ))+λr2|Eτ1|.\displaystyle\Big{|}\big{\langle}[X_{t,i}]_{S_{\tau-1}^{c}},[\theta^{*}]_{S_{\tau-1}^{c}}\big{\rangle}\Big{|}\lesssim\sigma\sqrt{\log(kTd/\delta)}\cdot\sqrt{\frac{\nu^{2}s\log^{2}\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda r^{2}}{|E_{\tau-1}|}}. (24)

It remains to upper bound the second part. By Cauchy-Schwartz inequality, we have

|[Xt,i]Sτ1c,[θ^τ,λt1θ]Sτ1c|[Xt,i]Sτ1[𝚪^τ1,λt1]1[θ^τ,λt1θ]Sτ1c𝚪^τ1,λt1.\displaystyle\Big{|}\big{\langle}[X_{t,i}]_{S_{\tau-1}^{c}},[\widehat{\theta}_{\tau,\lambda}^{t-1}-\theta^{*}]_{S_{\tau-1}^{c}}\big{\rangle}\Big{|}\leq\big{\|}[X_{t,i}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}]^{-1}}\cdot\big{\|}[\widehat{\theta}_{\tau,\lambda}^{t-1}-\theta^{*}]_{S_{\tau-1}^{c}}\big{\|}_{\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}}. (25)

The following lemma establishes an upper bound for the second term in the right-hand side of inequality (25). The proof relies the self-normalized martingale concentration inequality (i.e., Lemma 9 in Section 6), which is similar to the proof of Lemma 5.

Lemma 8.

For each τ[τ~]\tau\in[\widetilde{\tau}] and tEτt\in E_{\tau}, we have

[θ^τ,λt1θ]Sτ1𝚪^τ1,λt12(ν2+σ2[θ]Sτ1c2)(slog2(kTd/(δλ))+λr2),\big{\|}[\widehat{\theta}_{\tau,\lambda}^{t-1}-\theta^{*}]_{S_{\tau-1}}\big{\|}^{2}_{\widehat{\mathbf{\Gamma}}^{t-1}_{\tau-1,\lambda}}\lesssim\big{(}\nu^{2}+\sigma^{2}\cdot\big{\|}[\theta^{*}]_{S^{c}_{\tau-1}}\big{\|}^{2}\big{)}\cdot\big{(}s\log^{2}\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda r^{2}\big{)},

with probability at least 1δ1-\delta.

The proof of Lemma 8 is deferred to Section 5.4. We continue the proof of Lemma 2 now. By combing inequalities (23)-(25) and Lemma 8, we obtain with probability at least 1δ1-\delta that

|Xt,i,θ^τ,λt1θ|\displaystyle\Big{|}\big{\langle}X_{t,i},\widehat{\theta}_{\tau,\lambda}^{t-1}-\theta^{*}\big{\rangle}\Big{|} σlog(kTd/δ)ν2slog2(kTd/(δλ))+λr2|Eτ1|\displaystyle\lesssim\sigma\sqrt{\log(kTd/\delta)}\cdot\sqrt{\frac{\nu^{2}s\log^{2}\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda r^{2}}{|E_{\tau-1}|}}
+(ν+σ[θ]Sτ1c)(slog(kTd/(δλ))+λ1/2r)[Xt,i]Sτ1[𝚪^τ1,λt1]1.\displaystyle\ +\Big{(}\nu+\sigma\cdot\big{\|}[\theta^{*}]_{S^{c}_{\tau-1}}\big{\|}\Big{)}\Big{(}\sqrt{s}\log\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda^{1/2}r\Big{)}\cdot\big{\|}[X_{t,i}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}]^{-1}}.

Recall that in Lemma 1, we show that

[θ]Sτ1c=𝒪~(s/T).\big{\|}[\theta^{*}]_{S^{c}_{\tau-1}}\big{\|}=\widetilde{\mathcal{O}}(\sqrt{s/T}).

Hence, when sTs\ll T, we have

|Xt,i,θ^τ,λt1θ|ν(slog(kTd/(δλ))+λ1/2r)(σlog(kTd/δ)|Eτ1|+[Xt,i]Sτ1[𝚪^τ1,λt1]1),\displaystyle\Big{|}\big{\langle}X_{t,i},\widehat{\theta}_{\tau,\lambda}^{t-1}-\theta^{*}\big{\rangle}\Big{|}\lesssim\nu\Big{(}\sqrt{s}\log\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda^{1/2}r\Big{)}\Big{(}\sigma\sqrt{\frac{\log(kTd/\delta)}{|E_{\tau-1}|}}+\big{\|}[X_{t,i}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}]^{-1}}\Big{)},

with probability at least 1δ1-\delta. Finally, by applying the union bound argument for all tEτt\in E_{\tau}, we finish the proof of Lemma 2. ∎

3.3 Proof of Lemma 3

Lemma (Restatement of Lemma 3).

Let two constants p,qp,q satisfy 1pq1\leq p\leq q. Then with probability at least 1δ1-\delta, we have

tEτmin{p,q[Xt]Sτ1[𝚪^τ1,λt1]1}n0p+q|Eτ||Sτ1|(λ+log(σdTk/δ)).\sum_{t\in E_{\tau}}\min\Big{\{}p,q\cdot\big{\|}[X_{t}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}]^{-1}}\Big{\}}\lesssim n_{0}p+q\cdot\sqrt{|E_{\tau}|\cdot|S_{\tau-1}|\cdot\big{(}\lambda+\log(\sigma dTk/\delta)\big{)}}.
Proof.

For a fixed epoch EτE_{\tau}, let t1t_{1} be the end of pure-exploration-stage and t2t_{2} be the end of EτE_{\tau}. Then we have t1n0t2t_{1}\leq n_{0}\leq t_{2}. We also have

tEτmin{p,q[Xt]Sτ1[𝚪^τ1,λt1]1}\displaystyle\sum_{t\in E_{\tau}}\min\Big{\{}p,q\cdot\big{\|}[X_{t}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}]^{-1}}\Big{\}}
n0p+qt=t1+1t2min{1,[Xt]Sτ1[𝚪^τ1,λt1]1}\displaystyle\quad\leq n_{0}p+q\cdot\sum_{t=t_{1}+1}^{t_{2}}\min\Big{\{}1,\big{\|}[X_{t}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}]^{-1}}\Big{\}}
n0p+q|Eτ|t=t1+1t2min{1,[Xt]Sτ1[𝚪^τ1,λt1]1}\displaystyle\quad\leq n_{0}p+q\sqrt{|E_{\tau}|}\cdot\sqrt{\sum_{t=t_{1}+1}^{t_{2}}\min\Big{\{}1,\big{\|}[X_{t}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}]^{-1}}\Big{\}}}
n0p+q|Eτ|2t=t1+1t2log(1+[Xt]Sτ1[𝚪^τ1,λt1]1).\displaystyle\quad\leq n_{0}p+q\sqrt{|E_{\tau}|}\cdot\sqrt{2\sum_{t=t_{1}+1}^{t_{2}}\log\Big{(}1+\big{\|}[X_{t}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}]^{-1}}\Big{)}}. (26)

Here the first inequality holds because qpq\geq p and n0t1n_{0}\leq t_{1}. The second inequality follows from Cauchy-Schwartz inequality. The last inequality is due to the basic inequality

min{1,u}2log(1+u),u>0.\min\{1,u\}\leq 2\log(1+u),\ \forall u>0.

On the other hand, we have the recursion

𝚪^τ1,λt\displaystyle\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t} =𝚪^τ1,λt1+[Xt]Sτ1[Xt]Sτ1\displaystyle=\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}+[X_{t}]_{S_{\tau-1}}[X_{t}]_{S_{\tau-1}}^{\top}
=[𝚪^τ1,λt1]1/2(I|Sτ1|+[𝚪^τ1,λt1]1/2[Xt]Sτ1[Xt]Sτ1[𝚪^τ1,λt1]1/2)[𝚪^τ1,λt1]1/2.\displaystyle=\big{[}\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}\big{]}^{1/2}\cdot\Big{(}I_{|S_{\tau-1}|}+\big{[}\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}\big{]}^{-1/2}[X_{t}]_{S_{\tau-1}}[X_{t}]_{S_{\tau-1}}^{\top}\big{[}\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}\big{]}^{-1/2}\Big{)}\cdot\big{[}\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}\big{]}^{1/2}.

Subsequently, we have

log(det(𝚪^τ1,λt))=log(det(𝚪^τ1,λt1))+log((1+[Xt]Sτ1[Xt]Sτ1)).\displaystyle\log\big{(}\det(\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t})\big{)}=\log\big{(}\det(\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1})\big{)}+\log\big{(}(1+[X_{t}]_{S_{\tau-1}}[X_{t}]_{S_{\tau-1}}^{\top})\big{)}. (27)

By combining equations (26)-(27) and taking telescoping sum, we obtain

tEτmin{p,q[Xt]Sτ1[𝚪^τ1,λt1]1}n0p+q2|Eτ|log(det(𝚪^τ1,λt2)/det(𝚪^τ1,λt1)).\sum_{t\in E_{\tau}}\min\Big{\{}p,q\cdot\big{\|}[X_{t}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1}]^{-1}}\Big{\}}\leq n_{0}p+q\cdot\sqrt{2|E_{\tau}|\cdot\log\big{(}\det(\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t_{2}})\big{/}{\det(\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t_{1}})}\big{)}}. (28)

By Proposition 1, we have with probability at least 1δ1-\delta that, for all t[T]t\in[T] and i𝒜ti\in\mathcal{A}_{t},

Xt,iσdlog(kT/δ).\big{\|}X_{t,i}\big{\|}\lesssim\sigma\sqrt{d\log(kT/\delta)}.

Then by applying the determinant-trace inequality, we have

det(𝚪^τ1,λt2)(tr(𝚪^τ1,λt2)/|Sτ1|)|Sτ1|(λ+maxt[T],i𝒜tXt,i)|Sτ1|.\displaystyle\det(\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t_{2}})\leq\Big{(}\mathrm{tr}(\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t_{2}})\big{/}|S_{\tau-1}|\Big{)}^{|S_{\tau-1}|}\leq\Big{(}\lambda+\max_{t\in[T],i\in\mathcal{A}_{t}}\|X_{t,i}\|\Big{)}^{|S_{\tau-1}|}.

Hence, we have

log(det(𝚪^τ1,λt2))|Sτ1|(λ+log(σdTk/δ))\log\big{(}\det(\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t_{2}})\big{)}\lesssim|S_{\tau-1}|\cdot\big{(}\lambda+\log(\sigma dTk/\delta)\big{)} (29)

with probability at least 1δ1-\delta. On the other hand, by definition, we have 𝚪^τ1,λt1λI|Sτ1|\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t_{1}}\gtrsim\lambda\textbf{I}_{|S_{\tau-1}|}, which implies

log(det(𝚪^τ1,λt1))|Sτ1|log(λ).\log\big{(}\det(\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t_{1}})\big{)}\gtrsim|S_{\tau-1}|\cdot\log(\lambda). (30)

Finally, by combining inequalities (28)-(30), we complete the proof of Lemma 3. ∎

4 Proof of Auxiliary Lemmas in Theorem 2

4.1 Proof of Lemma 4

Lemma (Restatement of Lemma 4).

When

n0ρ1(νστ~)2s3log4(kTd/(δλ)),n_{0}\gtrsim\rho^{-1}\cdot(\nu\sigma\widetilde{\tau})^{2}\cdot s^{3}\cdot\log^{4}\big{(}kTd/(\delta\lambda)\big{)},

with probability at least 1δ1-\delta, we have that for every tuple (τ,t,i,ζ)(\tau,t,i,\zeta), it holds

|Xt,i,θ^τ,λt1,ζθ|\displaystyle\Big{|}\big{\langle}X_{t,i},\widehat{\theta}_{\tau,\lambda}^{t-1,\zeta}-\theta^{*}\big{\rangle}\Big{|} σ(λ1/2+ν+σ)log3/2(kTd/δ)(s/|Eτ1|+[Xt,i]Sτ1[𝚪^τ1,λt1,ζ]1),\displaystyle\lesssim\sigma(\lambda^{1/2}+\nu+\sigma)\log^{3/2}(kTd/\delta)\cdot\Big{(}\sqrt{s/|E_{\tau-1}|}+\big{\|}[X_{t,i}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}]^{-1}}\Big{)},

where

𝚪^τ1,λt1,ζ=λI|Sτ1|+tΨτt1,ζ[Xt]Sτ1[Xt]Sτ1.\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}=\lambda\textbf{I}_{|S_{\tau-1}|}+\sum_{t^{\prime}\in\Psi_{\tau}^{t-1,\zeta}}[X_{t^{\prime}}]_{S_{\tau-1}}[X_{t^{\prime}}]_{S_{\tau-1}}^{\top}.
Proof.

To simplify notation, for each fixed tuple (τ,t,ζ)(\tau,t,\zeta), we denote by

𝜺τt1,ζ\displaystyle\boldsymbol{\varepsilon}^{t-1,\zeta}_{\tau} ={εt}tΨτt1,ζm,\displaystyle=\{\varepsilon_{t^{\prime}}\}_{t^{\prime}\in\Psi_{\tau}^{t-1,\zeta}}\in\mathbb{R}^{m},
𝐘τt1,ζ\displaystyle\mathbf{Y}^{t-1,\zeta}_{\tau} ={Yt}tΨτt1,ζm,𝐗τt1,ζ={Xt}tΨτt1,ζd×m,\displaystyle=\{Y_{t^{\prime}}\}_{t^{\prime}\in\Psi_{\tau}^{t-1,\zeta}}\in\mathbb{R}^{m},\ \mathbf{X}^{t-1,\zeta}_{\tau}=\{X_{t^{\prime}}\}_{t^{\prime}\in\Psi_{\tau}^{t-1,\zeta}}\in\mathbb{R}^{d\times m},

where mm is the number of elements in set Ψτt1,ζ\Psi_{\tau}^{t-1,\zeta}. Recall that

𝚪^τ1,λt1,ζ=λI|Sτ1|+tΨτt1,ζ[Xt]Sτ1[Xt]Sτ1,\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}=\lambda\textbf{I}_{|S_{\tau-1}|}+\sum_{t^{\prime}\in\Psi_{\tau}^{t-1,\zeta}}[X_{t^{\prime}}]_{S_{\tau-1}}[X_{t^{\prime}}]_{S_{\tau-1}}^{\top},

which is the sample covariance matrix corresponding to 𝐗τt1,ζ\mathbf{X}^{t-1,\zeta}_{\tau} restricted on Sτ1S_{\tau-1}. Recall that θ^τ,λt1,ζ\widehat{\theta}_{\tau,\lambda}^{t-1,\zeta} is the ridge estimator with generalized support Sτ1S_{\tau-1} using data (𝐗τt1,ζ,𝐘τt1,ζ)(\mathbf{X}^{t-1,\zeta}_{\tau},\mathbf{Y}^{t-1,\zeta}_{\tau}). Particularly, we have

[θ^τ,λt1,ζ]Sτ1=𝚪^τ1,λt1,ζ[𝐗τt1,ζ]Sτ1𝐘τt1,ζ.\displaystyle\big{[}\widehat{\theta}_{\tau,\lambda}^{t-1,\zeta}\big{]}_{S_{\tau-1}}=\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}\cdot\big{[}\mathbf{X}^{t-1,\zeta}_{\tau}\big{]}_{S_{\tau-1}}^{\top}\mathbf{Y}^{t-1,\zeta}_{\tau}.

Note that

𝐘τt1,ζ=[𝐗τt1,ζ]Sτ1[θ]Sτ1+[𝐗τt1,ζ]Sτ1c[θ]Sτ1c+𝜺τt1,ζ.\mathbf{Y}^{t-1,\zeta}_{\tau}=\big{[}\mathbf{X}^{t-1,\zeta}_{\tau}\big{]}_{S_{\tau-1}}[\theta^{*}]_{S_{\tau-1}}+\big{[}\mathbf{X}^{t-1,\zeta}_{\tau}\big{]}_{S^{c}_{\tau-1}}[\theta^{*}]_{S^{c}_{\tau-1}}+\boldsymbol{\varepsilon}^{t-1,\zeta}_{\tau}.

Then we have

[θ^τ,λt1,ζθ]Sτ1=[𝚪^τ1,λt1,ζ]1(λ[θ]Sτ1+[𝐗τt1,ζ]Sτ1(𝜺τt1,ζ+[𝐗τt1,ζ]Sτ1c[θ]Sτ1c)).\displaystyle\big{[}\widehat{\theta}_{\tau,\lambda}^{t-1,\zeta}-\theta^{*}\big{]}_{S_{\tau-1}}=\big{[}\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}\big{]}^{-1}\Big{(}-\lambda[\theta^{*}]_{S_{\tau-1}}+[\mathbf{X}^{t-1,\zeta}_{\tau}]_{S_{\tau-1}}^{\top}\big{(}\boldsymbol{\varepsilon}^{t-1,\zeta}_{\tau}+[\mathbf{X}^{t-1,\zeta}_{\tau}]_{S^{c}_{\tau-1}}[\theta^{*}]_{S^{c}_{\tau-1}}\big{)}\Big{)}.

Hence, for each Xt,iX_{t,i}, we can upper bound the prediction error Xt,i,θ^τ,λt1,ζθ\langle X_{t,i},\widehat{\theta}_{\tau,\lambda}^{t-1,\zeta}-\theta^{*}\rangle as

|Xt,i,θ^τ,λt1,ζθ||[Xt,i]Sτ1c,[θ]Sτ1c|+|[Xt,i]Sτ1,[θ^τ,λt1,ζθ]Sτ1|.\displaystyle\big{|}\langle X_{t,i},\widehat{\theta}_{\tau,\lambda}^{t-1,\zeta}-\theta^{*}\rangle\big{|}\leq\Big{|}\big{\langle}[X_{t,i}]_{S_{\tau-1}^{c}},[\theta^{*}]_{S_{\tau-1}^{c}}\big{\rangle}\Big{|}+\Big{|}\big{\langle}[X_{t,i}]_{S_{\tau-1}},[\widehat{\theta}_{\tau,\lambda}^{t-1,\zeta}-\theta^{*}]_{S_{\tau-1}}\big{\rangle}\Big{|}. (31)

We can use Corollary 6 to upper bound the first term in the right-hand side of inequality (31) and obtain

|[Xt,i]Sτ1c,[θ]Sτ1c|σlog(kTd/δ)ν2slog2(kTd/(δλ))+λr2|Eτ|,\Big{|}\big{\langle}[X_{t,i}]_{S^{c}_{\tau-1}},[\theta^{*}]_{S^{c}_{\tau-1}}\big{\rangle}\Big{|}\lesssim\sigma\sqrt{\log(kTd/\delta)}\cdot\sqrt{\frac{\nu^{2}s\log^{2}\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda r^{2}}{|E_{\tau}|}}, (32)

with probability at least 1δ1-\delta. It remains to upper bound the second term.

We first note that

[Xt,i]Sτ1,[𝚪^τ1,λt1,ζ]1[θ]Sτ1\displaystyle\Big{\langle}[X_{t,i}]_{S_{\tau-1}},\big{[}\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}\big{]}^{-1}[\theta^{*}]_{S_{\tau-1}}\Big{\rangle} [θ]Sτ1[𝚪^τ1,λt1,ζ]1[Xt,i]Sτ1[𝚪^τ1,λt1,ζ]1\displaystyle\leq\big{\|}[\theta^{*}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}]^{-1}}\cdot\big{\|}[X_{t,i}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}]^{-1}}
λ1[θ]Sτ1[Xt,i]Sτ1[𝚪^τ1,λt1,ζ]1\displaystyle\leq\lambda^{-1}\big{\|}[\theta^{*}]_{S_{\tau-1}}\big{\|}\cdot\big{\|}[X_{t,i}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}]^{-1}}
λ1r[Xt,i]Sτ1[𝚪^τ1,λt1,ζ]1,\displaystyle\leq\lambda^{-1}r\cdot\big{\|}[X_{t,i}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}]^{-1}}, (33)

where the first inequality is Cauchy-Schwartz inequality and the second one follows from the fact

𝚪^τ1,λt1,ζλI|Sτ1|.\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}\succeq\lambda\textbf{I}_{|S_{\tau-1}|}.

In the next, we claim that for each fixed tuple (τ,t,ζ)(\tau,t,\zeta), random vectors 𝜺τt1,ζ\boldsymbol{\varepsilon}^{t-1,\zeta}_{\tau} and 𝐗τt1,ζ\mathbf{X}^{t-1,\zeta}_{\tau} are independent of each other. Recall the policies of picking arms and updating Ψτt1,ζ\Psi_{\tau}^{t-1,\zeta} in SupLinUCB algorithm, which are specified in details in Section 2.2. We consider a fixed period tt1t^{\prime}\leq t-1 and the ζ\zeta-th group. Note that the random event that covariate XtX_{t^{\prime}} enters 𝐗τt1,ζ\mathbf{X}^{t-1,\zeta}_{\tau} only depends on the widths of confidence bands {ωτ,λt1,ζ}ζζ\{\omega_{\tau,\lambda}^{t^{\prime}-1,\zeta^{\prime}}\}_{\zeta^{\prime}\leq\zeta}. Moreover, since {ωτ,λt1,ζ}ζζ\{\omega_{\tau,\lambda}^{t^{\prime}-1,\zeta^{\prime}}\}_{\zeta^{\prime}\leq\zeta} only relies on the values of design matrices {𝐗τt1,ζ}ζζ\{\mathbf{X}^{t^{\prime}-1,\zeta^{\prime}}_{\tau}\}_{\zeta^{\prime}\leq\zeta} but not noises sequence {εt′′}t′′t\{\varepsilon_{t^{\prime\prime}}\}_{t^{\prime\prime}\leq t^{\prime}}, whether XtX_{t^{\prime}} enters 𝐗τt1,ζ\mathbf{X}^{t-1,\zeta}_{\tau} is also independent of {εt}tΨτt,ζ\{\varepsilon_{t^{\prime}}\}_{t^{\prime}\in\Psi_{\tau}^{t,\zeta}}. Then we obtain the result.

The independency between 𝜺τt1,ζ\boldsymbol{\varepsilon}^{t-1,\zeta}_{\tau} and 𝐗τt1,ζ\mathbf{X}^{t-1,\zeta}_{\tau} enables us to establish a tighter upper bound for the second term in the right-hand side of inequality (31) than Lemma 2. Recall that by Assumption 1, given the information by epoch τ1\tau-1, random variables [𝐗t1,ζ]Sτ1[\mathbf{X}^{t-1,\zeta}]_{S_{\tau-1}} and [𝐗t1,ζ]Sτ1c[\mathbf{X}^{t-1,\zeta}]_{S^{c}_{\tau-1}} are independent. Moreover, [𝐗t1,ζ]Sτ1c[\mathbf{X}^{t-1,\zeta}]_{S^{c}_{\tau-1}} is also independent of 𝜺τt1,ζ\boldsymbol{\varepsilon}^{t-1,\zeta}_{\tau}. Hence, each element of

𝜺τt1,ζ+[𝐗τt1,ζ]Sτ1c[θ]Sτ1c\boldsymbol{\varepsilon}^{t-1,\zeta}_{\tau}+[\mathbf{X}^{t-1,\zeta}_{\tau}]_{S^{c}_{\tau-1}}[\theta^{*}]_{S^{c}_{\tau-1}}

is a centered sub-Gaussian random variable with parameter ν2+σ2[θ]Sτ1c22\nu^{2}+\sigma^{2}\|[\theta^{*}]_{S_{\tau-1}^{c}}\|_{2}^{2}, which is also independent of covariates [𝐗τt1,ζ]Sτ1[\mathbf{X}^{t-1,\zeta}_{\tau}]_{S_{\tau-1}}. By applying the sub-Gaussian concentration inequality and the union bound argument, we obtain with probability at least 1δ1-\delta that

𝜺τt1,ζ+[𝐗τt1,ζ]Sτ1c[θ]Sτ1c(ν2+σ2[θ]Sτ1c2)log(kTd/δ).\displaystyle\big{\|}\boldsymbol{\varepsilon}^{t-1,\zeta}_{\tau}+[\mathbf{X}^{t-1,\zeta}_{\tau}]_{S^{c}_{\tau-1}}^{\top}[\theta^{*}]_{S^{c}_{\tau-1}}\big{\|}\lesssim\sqrt{\big{(}\nu^{2}+\sigma^{2}\big{\|}[\theta^{*}]_{S^{c}_{\tau-1}}\big{\|}^{2}\big{)}}\cdot\log(kTd/\delta). (34)

Here we emphasize that inequality (34) is interpreted as a random event given the realization of [𝐗t1,ζ]Sτ1[\mathbf{X}^{t-1,\zeta}]_{S_{\tau-1}} for later proof. It also holds regardless the values of [𝐗t1,ζ]Sτ1[\mathbf{X}^{t-1,\zeta}]_{S_{\tau-1}}. That is where the independency of 𝜺τt1,ζ\boldsymbol{\varepsilon}^{t-1,\zeta}_{\tau} and [𝐗t1,ζ]Sτ1[\mathbf{X}^{t-1,\zeta}]_{S_{\tau-1}} is indispensable: otherwise, given [𝐗t1,ζ]Sτ1c[\mathbf{X}^{t-1,\zeta}]_{S^{c}_{\tau-1}}, elements of 𝜺τt1,ζ\boldsymbol{\varepsilon}^{t-1,\zeta}_{\tau} are not i.i.d., which prevents us establishing concentration result similar to (34). Based on inequality (34) and Cauchy-Schwartz inequality, with probability at least 1δ1-\delta, we have

[Xt,i]Sτ1,[𝚪^τ1,λt1,ζ]1[𝐗τt1,ζ]Sτ1(𝜺τt1,ζ+[𝐗τt1,ζ]Sτ1c[θ]Sτ1c)\displaystyle\Big{\langle}[X_{t,i}]_{S_{\tau-1}},\big{[}\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}\big{]}^{-1}[\mathbf{X}^{t-1,\zeta}_{\tau}]_{S_{\tau-1}}^{\top}\big{(}\boldsymbol{\varepsilon}^{t-1,\zeta}_{\tau}+[\mathbf{X}^{t-1,\zeta}_{\tau}]_{S^{c}_{\tau-1}}[\theta^{*}]_{S^{c}_{\tau-1}}\big{)}\Big{\rangle}
[Xt,i]Sτ1[𝚪^τ1,λt1,ζ]1(ν2+σ2[θ]Sτ1c2)log(kTd/δ).\displaystyle\qquad\qquad\lesssim\big{\|}[X_{t,i}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}]^{-1}}\cdot\sqrt{\big{(}\nu^{2}+\sigma^{2}\big{\|}[\theta^{*}]_{S^{c}_{\tau-1}}\big{\|}^{2}\big{)}}\cdot\log(kTd/\delta). (35)

By combing inequalities (4.1)-(4.1), we have

|[Xt,i]Sτ1,[θ^τ,λt1,ζθ]Sτ1|\displaystyle\Big{|}\big{\langle}[X_{t,i}]_{S_{\tau-1}},[\widehat{\theta}_{\tau,\lambda}^{t-1,\zeta}-\theta^{*}]_{S_{\tau-1}}\big{\rangle}\Big{|}
(r+(ν2+σ2[θ]Sτ1c2)log(kTd/δ))[Xt,i]Sτ1[𝚪^τ1,λt1,ζ]1,\displaystyle\qquad\qquad\lesssim\Big{(}r+\sqrt{\big{(}\nu^{2}+\sigma^{2}\big{\|}[\theta^{*}]_{S^{c}_{\tau-1}}\big{\|}^{2}\big{)}}\cdot\log(kTd/\delta)\Big{)}\cdot\big{\|}[X_{t,i}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}]^{-1}}, (36)

with probability at least 1δ1-\delta. Recall that Lemma 1 shows that

[θ]Sτ1c=𝒪~(s/T).\big{\|}[\theta^{*}]_{S^{c}_{\tau-1}}\big{\|}=\widetilde{\mathcal{O}}(\sqrt{s/T}).

Hence, when sTs\ll T, [θ]Sτ1c1\|[\theta^{*}]_{S^{c}_{\tau-1}}\|\lesssim 1. Subsequently, by combining inequalities (32) and (4.1), we obtain with probability at least 1δ1-\delta that

|Xt,i,θ^τ,λt1,ζθ|\displaystyle\Big{|}\big{\langle}X_{t,i},\widehat{\theta}_{\tau,\lambda}^{t-1,\zeta}-\theta^{*}\big{\rangle}\Big{|} σ(λ1/2+ν+σ)log3/2(kTd/δ)(s/|Eτ|+[Xt,i]Sτ1[𝚪^τ1,λt1,ζ]1),\displaystyle\lesssim\sigma(\lambda^{1/2}+\nu+\sigma)\log^{3/2}(kTd/\delta)\cdot\Big{(}\sqrt{s/|E_{\tau}|}+\big{\|}[X_{t,i}]_{S_{\tau-1}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau-1,\lambda}^{t-1,\zeta}]^{-1}}\Big{)},

which concludes the proof of Lemma 4. ∎

5 Proof of Auxiliary Lemmas in Supplementary Document

5.1 Proof of Lemma 5

Lemma.

(Restatement of Lemma 5) With probability at least 1δ1-\delta, we have

θ^τ,λθ𝚪^τ,λ2ν2slog2(kTd/(δλ))+λr2.\big{\|}\widehat{\theta}_{\tau,\lambda}-\theta^{*}\big{\|}^{2}_{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}\lesssim\nu^{2}s\log^{2}\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda r^{2}.
Proof.

We use the optimality of θ^τ,λ\widehat{\theta}_{\tau,\lambda} to derive an upper bound of the quadratic form of θ^τ,λθ.\widehat{\theta}_{\tau,\lambda}-\theta^{*}. In the following, we use SS^{*} to denote the support of true parameter θ\theta^{*}. Let θ¯τ,λ\overline{\theta}_{\tau,\lambda} be the projected ridge estimator with support SS^{*} using data {Xt,Yt}tEτ\{X_{t},Y_{t}\}_{t\in E_{\tau}}, i.e.,

θ¯τ,λ=Projr{argminsupp+(θ)=S{tEτ(YtXtθ)2+λθ2}},\displaystyle\overline{\theta}_{\tau,\lambda}=\text{Proj}_{r}\Big{\{}\operatorname*{argmin}_{\mathrm{supp}^{+}{(\theta)}=S^{*}}\Big{\{}\sum_{t\in E_{\tau}}(Y_{t}-X_{t}^{\top}\theta)^{2}+\lambda\|\theta\|^{2}\Big{\}}\Big{\}},

where Projr{}\text{Proj}_{r}\{\cdot\} denotes the projection operator on a centered 2\ell_{2}-ball with radius rr. We remark that θ¯τ,λ\overline{\theta}_{\tau,\lambda} is unavailable in practice since SS^{*} is unknown. However, it is not necessary in the SLUCB algorithm and only serves as an auxiliary estimator in theoretical analysis. Recall the definition of BSS estimator θ^τ,λ\widehat{\theta}_{\tau,\lambda}, which is given by

θ^τ,λ\displaystyle\widehat{\theta}_{\tau,\lambda} =argminθ{tEτ(YtXtθ)2+λθ2},\displaystyle=\operatorname*{argmin}_{\theta}\Big{\{}\sum_{t\in E_{\tau}}(Y_{t}-X_{t}^{\top}\theta)^{2}+\lambda\|\theta\|^{2}\Big{\}},
s.t.supp+(θ)Sτ1,|supp+(θ)|τs,θr,\displaystyle\text{s.t.}\ \mathrm{supp}^{+}{(\theta)}\supseteq S_{\tau-1},|\mathrm{supp}^{+}{(\theta)}|\leq\tau s,\|\theta\|\leq r, (37)

where Sτ1S_{\tau-1} is the generalized support recovered by the (τ1)(\tau-1)-th epoch. Since supp+(θ¯τ,λ)=S\mathrm{supp}^{+}(\overline{\theta}_{\tau,\lambda})=S*, we have supp+(θ¯τ,λ)=SSτ1Sτ1\mathrm{supp}^{+}(\overline{\theta}_{\tau,\lambda})=S^{*}\cup S_{\tau-1}\supseteq S_{\tau-1}. We also have |SSτ1|τs|S^{*}\cup S_{\tau-1}|\leq\tau s. Hence, θ¯τ,λ\overline{\theta}_{\tau,\lambda} satisfies the constraints in equation (5.1). Then by the optimality of θ^τ,λ\widehat{\theta}_{\tau,\lambda}, we have

tEτ(YtXtθ^τ,λ)2+λθ^τ,λ2tEτ(YtXtθ¯τ,λ)2+λθ¯τ,λ2,\sum_{t\in E_{\tau}}\big{(}Y_{t}-X_{t}^{\top}\widehat{\theta}_{\tau,\lambda}\big{)}^{2}+\lambda\|\widehat{\theta}_{\tau,\lambda}\|^{2}\leq\sum_{t\in E_{\tau}}\big{(}Y_{t}-X_{t}^{\top}\overline{\theta}_{\tau,\lambda}\big{)}^{2}+\lambda\|\overline{\theta}_{\tau,\lambda}\|^{2},

and furthermore,

θ^τ,λ𝚪^τ,λθ^τ,λθ¯τ,λ𝚪^τ,λθ¯τ,λ+2tEτYtXt(θ^τ,λθ¯τ,λ),\widehat{\theta}_{\tau,\lambda}^{\top}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\widehat{\theta}_{\tau,\lambda}\leq\overline{\theta}_{\tau,\lambda}^{\top}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\overline{\theta}_{\tau,\lambda}+2\sum_{t\in E_{\tau}}Y_{t}X_{t}^{\top}(\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda}), (38)

where 𝚪^τ,λ=tEτXtXt+λId\widehat{\mathbf{\Gamma}}_{\tau,\lambda}=\sum_{t\in E_{\tau}}X_{t}X_{t}^{\top}+\lambda\textbf{I}_{d}. Note that we have

Yt=Xtθ+εt=Xtθ¯τ,λ+εt+Xt(θθ¯τ,λ).Y_{t}=X_{t}^{\top}\theta^{*}+\varepsilon_{t}=X_{t}^{\top}\overline{\theta}_{\tau,\lambda}+\varepsilon_{t}+X_{t}^{\top}(\theta^{*}-\overline{\theta}_{\tau,\lambda}).

After some algebra, we obtain

tEτYtXt(θ^τ,λθ¯τ,λ)\displaystyle\sum_{t\in E_{\tau}}Y_{t}X_{t}^{\top}(\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda}) =θ¯τ,λ𝚪^τ,λ(θ^τ,λθ¯τ,λ)+(θθ¯τ,λ)𝚪^τ,λ(θ^τ,λθ¯τ,λ)\displaystyle=\overline{\theta}_{\tau,\lambda}^{\top}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}(\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda})+(\theta^{*}-\overline{\theta}_{\tau,\lambda})^{\top}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}(\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda})
+(tEτεtXt)(θ^τ,λθ¯τ,λ)λ(θ^τ,λθ¯τ,λ)θ.\displaystyle+\Big{(}\sum_{t\in E_{\tau}}\varepsilon_{t}X_{t}^{\top}\Big{)}(\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda})-\lambda(\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda})^{\top}\theta^{*}.

Then inequality (38) becomes

(θ^τ,λθ¯τ,λ)𝚪^τ,λ(θ^τ,λθ¯τ,λ)\displaystyle(\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda})^{\top}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}(\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda})
2(tEτεtXt)(θ^τ,λθ¯τ,λ)+2(θθ¯τ,λ)𝚪^τ,λ(θ^τ,λθ¯τ,λ)2λ(θ^τ,λθ¯τ,λ)θ.\displaystyle\quad\leq 2\Big{(}\sum_{t\in E_{\tau}}\varepsilon_{t}X_{t}^{\top}\Big{)}(\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda})+2(\theta^{*}-\overline{\theta}_{\tau,\lambda})^{\top}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}(\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda})-2\lambda(\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda})^{\top}\theta^{*}. (39)

In the next, we establish upper bounds for the three parts of the right-hand side of inequality (5.1) individually. We first introduce some notations. By the definition of θ^τ,λ\widehat{\theta}_{\tau,\lambda}, we have

θ^τ,λ\displaystyle\widehat{\theta}_{\tau,\lambda} =argminθ{tEτ(YtXtθ)2+λθ2}\displaystyle=\operatorname*{argmin}_{\theta}\Big{\{}\sum_{t\in E_{\tau}}(Y_{t}-X_{t}^{\top}\theta)^{2}+\lambda\|\theta\|^{2}\Big{\}}
s.t. these exists some subsetS[d]:supp+(θ)=S,SSτ1,|S|τs,θr.\displaystyle\text{s.t. these exists some subset}\ S\in[d]:\ \mathrm{supp}^{+}{(\theta)}=S,S\supseteq S_{\tau-1},|S|\leq\tau s,\|\theta\|\leq r.

where the optimization is taken over all feasible set SS. From now on, for convenience, given a fixed subset S[d]S\subseteq[d] with |S|τs|S|\leq\tau s, we denote by

S~=Ssupp(θ).\widetilde{S}=S\cup\mathrm{supp}(\theta^{*}).

Particularly, let

S~τ=supp+(θ^τ,λ)supp(θ).\widetilde{S}_{\tau}=\mathrm{supp}^{+}(\widehat{\theta}_{\tau,\lambda})\cup\mathrm{supp}(\theta^{*}).

Then |S~|(τ+1)s|\widetilde{S}|\leq(\tau+1)s by definition. We are ready to upper bound the right-hand side of inequality (5.1).

For the first part, note that supp+(θ^τ,λθ¯τ,λ)=S~τ\mathrm{supp}^{+}(\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda})=\widetilde{S}_{\tau}. Then by Cauchy-Schwartz inequality, we have

(tEτεtXt)(θ^τ,λθ¯τ,λ)\displaystyle\Big{(}\sum_{t\in E_{\tau}}\varepsilon_{t}X_{t}^{\top}\Big{)}(\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda}) tEτεt[Xt]S~τ[𝚪^τ,λ]S~τ1[θ^τ,λθ¯τ,λ]S~τ[𝚪^τ,λ]S~τ\displaystyle\leq\Big{\|}\sum_{t\in E_{\tau}}\varepsilon_{t}\cdot[X_{t}]_{\widetilde{S}_{\tau}}\Big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau,\lambda}]_{\widetilde{S}_{\tau}}^{-1}}\cdot\Big{\|}[\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda}]_{\widetilde{S}_{\tau}}\Big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau,\lambda}]_{\widetilde{S}_{\tau}}}
=tEτεt[Xt]S~τ[𝚪^τ,λ]S~τ1θ^τ,λθ¯τ,λ𝚪^τ,λ\displaystyle=\Big{\|}\sum_{t\in E_{\tau}}\varepsilon_{t}\cdot[X_{t}]_{\widetilde{S}_{\tau}}\Big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau,\lambda}]_{\widetilde{S}_{\tau}}^{-1}}\cdot\big{\|}\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda}\big{\|}_{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}
maxS~{tEτεt[Xt]S~[𝚪^τ,λ]S~1}θ^τ,λθ¯τ,λ𝚪^τ,λ,\displaystyle\leq\max_{\widetilde{S}}\Big{\{}\Big{\|}\sum_{t\in E_{\tau}}\varepsilon_{t}\cdot[X_{t}]_{\widetilde{S}}\Big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau,\lambda}]^{-1}_{\widetilde{S}}}\Big{\}}\cdot\big{\|}\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda}\big{\|}_{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}, (40)

where the maximization in the last step is taken over all possible S~\widetilde{S}, whose number is at most d(τ+1)sd^{(\tau+1)s}. For each fixed S~\widetilde{S}, by applying the self-normalized martingale concentration inequality (i.e., Lemma 9 provided in Section 6), we obtain with probability at least 1δ1-\delta that

tEτεt[Xt]S~[𝚪^τ,λ]S~1ν2log(det([𝚪^τ,λ]S~)1/2det(λ[Id]S~)1/2δ).\displaystyle\Big{\|}\sum_{t\in E_{\tau}}\varepsilon_{t}\cdot[X_{t}]_{\widetilde{S}}\Big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau,\lambda}]^{-1}_{\widetilde{S}}}\leq\nu\sqrt{2\log\Big{(}\frac{\det\big{(}[\widehat{\mathbf{\Gamma}}_{\tau,\lambda}]_{\widetilde{S}}\big{)}^{1/2}\det\big{(}\lambda[\textbf{I}_{d}]_{\widetilde{S}}\big{)}^{-1/2}}{\delta}\Big{)}}.

Since the number of possible S~\widetilde{S} is upper bounded by d(τ+1)sd^{(\tau+1)s}, by applying the union bound argument, we obtain with probability at least 1δ1-\delta that,

tEτεt[Xt]S~[𝚪^τ,λ]S~1\displaystyle\Big{\|}\sum_{t\in E_{\tau}}\varepsilon_{t}\cdot[X_{t}]_{\widetilde{S}}\Big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau,\lambda}]^{-1}_{\widetilde{S}}} ν2log(det([𝚪^τ,λ]S~)1/2det(λ[Id]S~)1/2δ)+τslog(d)\displaystyle\leq\nu\sqrt{2\log\Big{(}\frac{\det\big{(}[\widehat{\mathbf{\Gamma}}_{\tau,\lambda}]_{\widetilde{S}}\big{)}^{1/2}\det\big{(}\lambda[\textbf{I}_{d}]_{\widetilde{S}}\big{)}^{-1/2}}{\delta}\Big{)}+\tau s\log(d)}
:=ΛS~,\displaystyle:=\Lambda_{\widetilde{S}}, (41)

hold for all possible S~\widetilde{S}, which further implies that

maxS~tEτεt[Xt]S~[𝚪^τ,λ]S~1maxS~{ΛS~},\displaystyle\max_{\widetilde{S}}\Big{\|}\sum_{t\in E_{\tau}}\varepsilon_{t}\cdot[X_{t}]_{\widetilde{S}}\Big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau,\lambda}]^{-1}_{\widetilde{S}}}\leq\max_{\widetilde{S}}\big{\{}\Lambda_{\widetilde{S}}\big{\}}, (42)

holds with probability at least 1δ1-\delta.

For the second part of the right-hand side of inequality (5.1), by Cauchy-Schwartz inequality and the confidence region of ridge estimator (i.e., Lemma 10 provided in Section 6), we obtain with probability at least 1δ1-\delta that

2(θθ¯τ,λ)𝚪^τ,λ(θ^τ,λθ¯τ,λ)\displaystyle 2(\theta^{*}-\overline{\theta}_{\tau,\lambda})^{\top}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}(\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda}) θθ¯τ,λ𝚪^τ,λθ^τ,λθ¯τ,λ𝚪^τ,λ\displaystyle\leq\big{\|}\theta^{*}-\overline{\theta}_{\tau,\lambda}\big{\|}_{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}\cdot\big{\|}\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda}\big{\|}_{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}
(ΛS+λ1/2r)θ^τ,λθ¯τ,λ𝚪^τ,λ,\displaystyle\lesssim(\Lambda_{S^{*}}+\lambda^{1/2}r)\cdot\big{\|}\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda}\big{\|}_{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}, (43)

where S=supp(θ¯τ)supp(θ)S^{*}=\mathrm{supp}({\overline{\theta}_{\tau}})\cup\mathrm{supp}(\theta^{*}).

For the last part of the right-hand side of inequality (5.1), since θ,θ^τ,\theta^{*},\widehat{\theta}_{\tau}, and θ¯τ\overline{\theta}_{\tau} all are in the centered 2\ell_{2}-ball with radius rr, we have

|2λ(θ^τ,λθ¯τ,λ)θ|4λr2.\displaystyle\big{|}2\lambda(\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda})^{\top}\theta^{*}\big{|}\leq 4\lambda r^{2}. (44)

Finally, by combing inequalities (5.1)-(44) and Lemma 10 (provided in Section 6), we obtain with probability at least 1δ1-\delta that

θ^τ,λθ𝚪^τ,λ\displaystyle\big{\|}\widehat{\theta}_{\tau,\lambda}-\theta^{*}\big{\|}_{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}} θ^τ,λθ¯τ,λ𝚪^τ,λ+θ¯τ,λθ𝚪^τ,λ\displaystyle\leq\big{\|}\widehat{\theta}_{\tau,\lambda}-\overline{\theta}_{\tau,\lambda}\big{\|}_{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}+\big{\|}\overline{\theta}_{\tau,\lambda}-\theta^{*}\big{\|}_{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}}
maxS~{ΛS~}+ΛS+λ1/2r,\displaystyle\lesssim\max_{\widetilde{S}}\big{\{}\Lambda_{\widetilde{S}}\big{\}}+\Lambda_{S^{*}}+\lambda^{1/2}r, (45)

where the maximization is taken over all subsets S~[d]\widetilde{S}\in[d] with capacity at most (τ+1)s(\tau+1)s.

In order to conclude the proof of Lemma 5, it remains to establish an upper bound for ΛS~\Lambda_{\widetilde{S}}. By Proposition 1, with probability at least 1δ1-\delta, for all tEτt\in E_{\tau}, i𝒜ti\in\mathcal{A}_{t}, and j[d],j\in[d],

|[Xt,i]j|σlog(kTd/δ).\big{|}[X_{t,i}]_{j}\big{|}\lesssim\sigma\log(kTd/\delta).

Then by determinant-trace inequality, we have

det([𝚪^τ,λ]S~)(λ+Tσ2log2(kTd/δ)(τ+1)s)(τ+1)s.\displaystyle\det\big{(}[\widehat{\mathbf{\Gamma}}_{\tau,\lambda}]_{\widetilde{S}}\big{)}\lesssim\Big{(}\lambda+\frac{T\sigma^{2}\log^{2}(kTd/\delta)}{(\tau+1)s}\Big{)}^{(\tau+1)s}.

Recall the definition of ΛS~\Lambda_{\widetilde{S}} in equation (5.1). Then with probability at least 1δ1-\delta, we have

ΛS~\displaystyle\Lambda_{\widetilde{S}} ν(τ+1)slog(δ1+Tσ2log2(kTd/δ)δλ(τ+1)s)+τslog(d)\displaystyle\leq\nu\sqrt{(\tau+1)s\log\Big{(}\delta^{-1}+\frac{T\sigma^{2}\log^{2}(kTd/\delta)}{\delta\lambda(\tau+1)s}\Big{)}+\tau s\log(d)}
νslog(kTd/(δλ)),\displaystyle\lesssim\nu\sqrt{s}\cdot\log\big{(}{kTd}/{(\delta\lambda)}\big{)}, (46)

which further implies that

maxS~{ΛS~},ΛSνslog(kTd/(δλ)).\max_{\widetilde{S}}\big{\{}\Lambda_{\widetilde{S}}\big{\}},\Lambda_{S^{*}}\lesssim\nu\sqrt{s}\cdot\log\big{(}{kTd}/{(\delta\lambda)}\big{)}.

As a result, by inequality (5.1), we finally obtain with probability at least 1δ1-\delta that

θ^τ,λθ𝚪^τ,λ2\displaystyle\big{\|}\widehat{\theta}_{\tau,\lambda}-\theta^{*}\big{\|}^{2}_{\widehat{\mathbf{\Gamma}}_{\tau,\lambda}} (νslog(kTd/(δλ))+λ1/2r)2ν2slog2(kTd/(δλ))+λr2,\displaystyle\lesssim\Big{(}\nu\sqrt{s}\cdot\log\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda^{1/2}r\Big{)}^{2}\lesssim\nu^{2}s\log^{2}\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda r^{2},

which concludes the proof of Lemma 5.

5.2 Proof of Lemma 6

Lemma.

(Restatement of Lemma 6) With probability at least 1δ1-\delta, we have

[𝚪^τ,λ]SSc8σlog(d/δ)|Eτ|.\big{\|}\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{SS^{c}}\big{\|}_{\infty}\leq 8\sigma{\log(d/\delta)}\sqrt{|E_{\tau}|}.
Proof.

By the definition of matrix [𝚪^τ,λ]SSc[\widehat{\mathbf{\Gamma}}_{\tau,\lambda}]_{SS^{c}}, for every iSi\in S and jScj\in S^{c}, we have

[𝚪^τ,λ]ij=tEτ[Xt]i[Xt]j.\displaystyle\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{ij}=\sum_{t\in E_{\tau}}[X_{t}]_{i}[X_{t}]_{j}^{\top}.

Recall that in epoch EτE_{\tau}, the SLUCB algorithm picks arm only based on the information in dimension supp+(θ^τ1,λ)\mathrm{supp}^{+}(\widehat{\theta}_{\tau-1,\lambda}). Also recall that SLUCB algorithm requires

S=supp+(θ^τ,λ)supp+(θ^τ1,λ),S=\mathrm{supp}^{+}(\widehat{\theta}_{\tau,\lambda})\supseteq\mathrm{supp}^{+}(\widehat{\theta}_{\tau-1,\lambda}),

which implies that

Sc[supp+(θ^τ1,λ)]c.S^{c}\subseteq\big{[}\mathrm{supp}^{+}(\widehat{\theta}_{\tau-1,\lambda})\big{]}^{c}.

Since all covariates Xt,iX_{t,i} have independent coordinates, for each jScj\in S^{c}, {[Xt]j}tEτ\{[X_{t}]_{j}\}_{t\in E_{\tau}} are independent centered sub-Gaussian random variables. As a result, conditional on [Xt]i[X_{t}]_{i}, the random variable [𝚪^τ,λ]ij[\widehat{\mathbf{\Gamma}}_{\tau,\lambda}]_{ij} is sub-Gaussian with parameter σ2tEτ[Xt]i2\sigma^{2}\cdot\sum_{t\in E_{\tau}}[X_{t}]_{i}^{2}. Note that

[𝚪^τ,λ]SSc=maxiS,jSc|[𝚪^τ,λ]ij|,\displaystyle\big{\|}\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{SS^{c}}\big{\|}_{\infty}=\max_{i\in S,j\in S^{c}}\big{|}\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{ij}\big{|},

which is the maximum of |S||Sc||S|\cdot|S^{c}| sub-Gaussian random variables whose parameters are uniformly upper bounded by σ2maxiStEτ[Xt]i2\sigma^{2}\cdot\max_{i\in S}\sum_{t\in E_{\tau}}[X_{t}]_{i}^{2}. By applying the concentration inequality for the maximal of sub-Gaussian random variables, we obtain with probability at least 1δ1-\delta that

[𝚪^τ,λ]SSc4σmaxiStEτ[Xt]i2log(d/δ)4σtEτ(maxiS[Xt]i)2log(d/δ).\displaystyle\big{\|}\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{SS^{c}}\big{\|}_{\infty}\leq 4\sigma\sqrt{\max_{i\in S}\sum_{t\in E_{\tau}}[X_{t}]_{i}^{2}\cdot\log(d/\delta)}\leq 4\sigma\sqrt{\sum_{t\in E_{\tau}}\big{(}\max_{i\in S}[X_{t}]_{i}\big{)}^{2}\cdot\log(d/\delta)}.

Now we use the concentration inequality for the maximal of sub-Gaussian random variables again. Then with probability at least 1δ1-\delta, for each tt,

maxiS|[Xt]i|2σlog(d/δ).\displaystyle\max_{i\in S}\big{|}[X_{t}]_{i}\big{|}\leq 2\sigma\sqrt{\log(d/\delta)}.

Hence, we obtain with probability at least 1δ1-\delta that

[𝚪^τ,λ]SSc8σlog(d/δ)|Eτ|.\displaystyle\big{\|}\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{SS^{c}}\big{\|}_{\infty}\leq 8\sigma{\log(d/\delta)}\sqrt{|E_{\tau}|}.

Similarly, we also have

[𝚪^τ,λ]ScS8σlog(d/δ)|Eτ|.\displaystyle\big{\|}\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{S^{c}S}\big{\|}_{\infty}\leq 8\sigma{\log(d/\delta)}\sqrt{|E_{\tau}|}.

with probability at least 1δ1-\delta, which concludes the proof of Lemma 6. ∎

5.3 Proof of Lemma 7

Lemma.

(Restatement of Lemma 7) For arbitrary constant ς>0\varsigma>0, when

n0max{ς2ρ1,σ2ρ2log(dτ~/δ)},n_{0}\gtrsim\max\{\varsigma^{2}\rho^{-1},\sigma^{2}\rho^{-2}{\log(d\widetilde{\tau}/\delta)}\},

with probability at least 1δ1-\delta, we have

θ^τ,λθ1νs/ςτ~log(kTd/(δλ)).\displaystyle\|\widehat{\theta}_{\tau,\lambda}-\theta^{*}\|_{1}\lesssim\nu s/\varsigma\cdot\sqrt{\widetilde{\tau}}\cdot\log\big{(}kTd/(\delta\lambda)\big{)}.
Proof.

To simplify notation, let

S~τ=Ssupp(θ)=supp+(θ^τ,λ)supp(θ).\widetilde{S}_{\tau}=S\cup\mathrm{supp}(\theta^{*})=\mathrm{supp}^{+}(\widehat{\theta}_{\tau,\lambda})\cup\mathrm{supp}(\theta^{*}).

Then we have |S~τ|s(τ+1)|\widetilde{S}_{\tau}|\leq s({\tau}+1). Recall that in SLUCB algorithm, the first n0n_{0} arms in epoch EτE_{\tau} are chosen uniformly at random. Then by matrix Bernstein inequality (Tropp, 2012) and Assumption 1 (A2), for arbitrary constant ς>0\varsigma>0,when n0max{ς2ρ1,σ2ρ2log(dτ~/δ)},n_{0}\gtrsim\max\{\varsigma^{2}\rho^{-1},\sigma^{2}\rho^{-2}{\log(d\widetilde{\tau}/\delta)}\}, we have

[𝚪^τ,λ]S~τς2I|S~τ|,\displaystyle\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{\widetilde{S}_{\tau}}\succeq\varsigma^{2}\cdot\textbf{I}_{|\widetilde{S}_{\tau}|}, (47)

with probability at least 1δ1-\delta. Since supp(θ^τ,λθ)S~τ\mathrm{supp}(\widehat{\theta}_{\tau,\lambda}-\theta^{*})\subseteq\widetilde{S}_{\tau}, when inequality (47) holds, we have

(θ^τ,λθ)𝚪^τ,λ(θ^τ,λθ)=[θ^τ,λθ]S~τ[𝚪^τ,λ]S~τ[θ^τ,λθ]S~τς[θ^τ,λθ]S~τ.\sqrt{(\widehat{\theta}_{\tau,\lambda}-\theta^{*})^{\top}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}(\widehat{\theta}_{\tau,\lambda}-\theta^{*})}=\sqrt{[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{\widetilde{S}_{\tau}}^{\top}\big{[}\widehat{\mathbf{\Gamma}}_{\tau,\lambda}\big{]}_{\widetilde{S}_{\tau}}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{\widetilde{S}_{\tau}}}\geq\varsigma\cdot\big{\|}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{\widetilde{S}_{\tau}}\big{\|}.

Hence, we obtain with probability at least 1δ1-\delta that

θ^τ,λθ1\displaystyle\|\widehat{\theta}_{\tau,\lambda}-\theta^{*}\|_{1} =[θ^τ,λθ]S~τ1s(τ~+1)[θ^τ,λθ]S~τ\displaystyle=\big{\|}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{\widetilde{S}_{\tau}}\big{\|}_{1}\leq\sqrt{s(\widetilde{\tau}+1)}\cdot\big{\|}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{\widetilde{S}_{\tau}}\big{\|}
[θ^τ,λθ]S~τ[𝚪^τ,λ]S~τs(τ~+1)/ςνs/ςτ~log(kTd/(δλ)),\displaystyle\leq\big{\|}[\widehat{\theta}_{\tau,\lambda}-\theta^{*}]_{\widetilde{S}_{\tau}}\big{\|}_{[\widehat{\mathbf{\Gamma}}_{\tau,\lambda}]_{\widetilde{S}_{\tau}}}\cdot\sqrt{s(\widetilde{\tau}+1)}/\varsigma\lesssim\nu s/\varsigma\cdot\sqrt{\widetilde{\tau}}\cdot\log\big{(}kTd/(\delta\lambda)\big{)}, (48)

where the last inequality follows from Lemma 5. Now we finish the proof of Lemma 7. ∎

5.4 Proof of Lemma 8

Lemma.

(Restatement of Lemma 8) For each τ[τ~]\tau\in[\widetilde{\tau}] and tEτt\in E_{\tau}, we have

[θ^τ,λt1θ]Sτ1𝚪^τ1,λt12(ν2+σ2[θ]Sτ1c2)(slog2(kTd/(δλ))+λr2),\big{\|}[\widehat{\theta}_{\tau,\lambda}^{t-1}-\theta^{*}]_{S_{\tau-1}}\big{\|}^{2}_{\widehat{\mathbf{\Gamma}}^{t-1}_{\tau-1,\lambda}}\lesssim\big{(}\nu^{2}+\sigma^{2}\cdot\big{\|}[\theta^{*}]_{S^{c}_{\tau-1}}\big{\|}^{2}\big{)}\cdot\big{(}s\log^{2}\big{(}{kTd}/{(\delta\lambda)}\big{)}+\lambda r^{2}\big{)},

with probability at least 1δ1-\delta.

Proof.

For each tEτt^{\prime}\in E_{\tau}, we have decomposition

Yt=Xt,θ+εt\displaystyle Y_{t^{\prime}}=\langle X_{t^{\prime}},\theta^{*}\rangle+\varepsilon_{t^{\prime}} =[Xt]Sτ1,[θ]Sτ1+[Xt]Sτ1c,[θ]Sτ1c+εt\displaystyle=\big{\langle}[X_{t^{\prime}}]_{S_{\tau-1}},[\theta^{*}]_{S_{\tau-1}}\big{\rangle}+\big{\langle}[X_{t^{\prime}}]_{S^{c}_{\tau-1}},[\theta^{*}]_{S^{c}_{\tau-1}}\big{\rangle}+\varepsilon_{t^{\prime}}
=[Xt]Sτ1,[θ]Sτ1+ε~t.\displaystyle=\big{\langle}[X_{t^{\prime}}]_{S_{\tau-1}},[\theta^{*}]_{S_{\tau-1}}\big{\rangle}+\widetilde{\varepsilon}_{t^{\prime}}.

where

ε~t=[Xt]Sτ1c,[θ]Sτ1c+εt.\widetilde{\varepsilon}_{t^{\prime}}=\big{\langle}[X_{t^{\prime}}]_{S^{c}_{\tau-1}},[\theta^{*}]_{S^{c}_{\tau-1}}\big{\rangle}+\varepsilon_{t^{\prime}}.

Given the information by epoch τ1\tau-1, for each tEτt^{\prime}\in E_{\tau} and i𝒜ti\in\mathcal{A}_{t^{\prime}}, by Assumption 1 (A3), [Xt,i]Sτ1c[X_{t^{\prime},i}]_{S^{c}_{\tau-1}} is independent of [Xt,i]Sτ1[X_{t^{\prime},i}]_{S_{\tau-1}}. Recall that in epoch τ\tau, all decisions on picking arms only depend on the information of covariates in dimension Sτ1S_{\tau-1}. Hence, ε~t\widetilde{\varepsilon}_{t^{\prime}} is independent of the design [Xt]Sτ1[X_{t^{\prime}}]_{S_{\tau-1}}. Moreover, conditioning on the information by epoch τ1\tau-1, {ε~t}tEτ\{\widetilde{\varepsilon}_{t^{\prime}}\}_{t^{\prime}\in E_{\tau}} are independent centered sub-Gaussian random variables with parameter

ν2+σ2[θ]Sτ1c2.\nu^{2}+\sigma^{2}\cdot\big{\|}[\theta^{*}]_{S^{c}_{\tau-1}}\big{\|}^{2}.

Recall the definition of θ^τ,λt1\widehat{\theta}_{\tau,\lambda}^{t-1}, i.e.,

θ^τ,λt1=argminsupp+(θ)=Sτ1tEτt1|YtXt,θ|2+λθ2,\widehat{\theta}_{\tau,\lambda}^{t-1}=\operatorname*{argmin}_{\mathrm{supp}^{+}(\theta)=S_{\tau-1}}\sum_{t^{\prime}\in E_{\tau}^{t-1}}\big{|}Y_{t^{\prime}}-\langle X_{t^{\prime}},\theta\rangle\big{|}^{2}+\lambda\|\theta\|^{2},

where Eτt1={t:tt1,tEτ}E_{\tau}^{t-1}=\{t^{\prime}:t^{\prime}\leq t-1,t^{\prime}\in E_{\tau}\}. Then by using the confidence region of ridge regression estimator (i.e., Lemma 10 provided in Section 6), we have with probability at least 1δ1-\delta that

[θ^τ,λt1θ]Sτ1𝚪^τ1,λt1[θ^τ,λt1θ]Sτ1=[θ^τ,λt1θ]Sτ1𝚪^τ1,λt12\displaystyle[\widehat{\theta}_{\tau,\lambda}^{t-1}-\theta^{*}]_{S_{\tau-1}}^{\top}\widehat{\mathbf{\Gamma}}^{t-1}_{\tau-1,\lambda}[\widehat{\theta}_{\tau,\lambda}^{t-1}-\theta^{*}]_{S_{\tau-1}}=\big{\|}[\widehat{\theta}_{\tau,\lambda}^{t-1}-\theta^{*}]_{S_{\tau-1}}\big{\|}^{2}_{\widehat{\mathbf{\Gamma}}^{t-1}_{\tau-1,\lambda}}
(ν2+σ2[θ]Sτ1c2)log(det(𝚪^τ1,λt1)1/2det(λI|Sτ1|)1/2δ)+λr2\displaystyle\lesssim\big{(}\nu^{2}+\sigma^{2}\cdot\big{\|}[\theta^{*}]_{S^{c}_{\tau-1}}\big{\|}^{2}\big{)}\cdot\log\Big{(}\frac{\det(\widehat{\mathbf{\Gamma}}^{t-1}_{\tau-1,\lambda})^{1/2}\det(\lambda\textbf{I}_{|S_{\tau-1}|})^{-1/2}}{\delta}\Big{)}+\lambda r^{2}

where

𝚪^τ1,λt1=λI|Sτ1|+tEτt1[Xt]Sτ1[Xt]Sτ1.\widehat{\mathbf{\Gamma}}^{t-1}_{\tau-1,\lambda}=\lambda\textbf{I}_{|S_{\tau-1}|}+\sum_{t^{\prime}\in E^{t-1}_{\tau}}[X_{t^{\prime}}]_{S_{\tau-1}}[X_{t^{\prime}}]_{S_{\tau-1}}^{\top}.

Similar to the proof of Lemma 5, we have

det(𝚪^τ1,λt1)(λ+Tσ2log2(kTd/δ)τs)τs,\displaystyle\det\big{(}\widehat{\mathbf{\Gamma}}^{t-1}_{\tau-1,\lambda}\big{)}\lesssim\Big{(}\lambda+\frac{T\sigma^{2}\log^{2}(kTd/\delta)}{\tau s}\Big{)}^{\tau s},

with probability at least 1δ1-\delta. Hence, we obtain with probability at least 1δ1-\delta that

[θ^τ,λt1θ]Sτ1𝚪^τ1,λt12\displaystyle\big{\|}[\widehat{\theta}_{\tau,\lambda}^{t-1}-\theta^{*}]_{S_{\tau-1}}\big{\|}^{2}_{\widehat{\mathbf{\Gamma}}^{t-1}_{\tau-1,\lambda}} (ν2+σ2[θ]Sτ1c2)τslog(δ1+Tσ2log2(kTd/δ)δλτs)\displaystyle\lesssim\big{(}\nu^{2}+\sigma^{2}\cdot\big{\|}[\theta^{*}]_{S^{c}_{\tau-1}}\big{\|}^{2}\big{)}\cdot\sqrt{\tau s\cdot\log\Big{(}\delta^{-1}+\frac{T\sigma^{2}\log^{2}(kTd/\delta)}{\delta\lambda\tau s}\Big{)}}
(ν2+σ2[θ]Sτ1c2)slog(kTd/(δλ)),\displaystyle\lesssim\big{(}\nu^{2}+\sigma^{2}\cdot\big{\|}[\theta^{*}]_{S^{c}_{\tau-1}}\big{\|}^{2}\big{)}\cdot\sqrt{s}\cdot\log\big{(}{kTd}/{(\delta\lambda)}\big{)},

which concludes the proof of Lemma 8. ∎

6 Other Auxiliary Lemmas

In this section, we list the results established in (Abbasi-Yadkori et al., 2011), which are used extensively in our proof.

Lemma 9.

(Self-normalized Martingale Concentration inequality) Let {Xt}t1\{X_{t}\}_{t\geq 1} be an d\mathbb{R}^{d}-valued stochastic process and {ηt}t1\{\eta_{t}\}_{t\geq 1} be a real valued stochastic process. Let

t=σ(X1,,Xt+1,η1,,ηt)\mathcal{F}_{t}=\sigma(X_{1},\ldots,X_{t+1},\eta_{1},\ldots,\eta_{t})

be the natural filtration generated by {Xt}t1\{X_{t}\}_{t\geq 1} and {ηt}t1\{\eta_{t}\}_{t\geq 1}. Assume that for each tt, ηt\eta_{t} is centered and sub-Gaussian with variance proxy R2R^{2} given t1\mathcal{F}_{t-1}, i.e.,

𝔼[ηt|t1]\displaystyle\mathbb{E}[\eta_{t}|\mathcal{F}_{t-1}] =0,\displaystyle=0,
𝔼[exp{ληt}|t1]\displaystyle\mathbb{E}\big{[}\exp\{\lambda\eta_{t}\}\big{|}\mathcal{F}_{t-1}\big{]} exp{λ2R2/2},λ.\displaystyle\leq\exp\{\lambda^{2}R^{2}/2\},\ \forall\lambda\in\mathbb{R}.

Also assume that 𝐕\mathbf{V} is a d×dd\times d positive definite matrix. For any t0t\geq 0. we define

𝚪^t=𝐕+u=1tXuXu,𝐒t=u=1tηuXu.\displaystyle\widehat{\mathbf{\Gamma}}_{t}=\mathbf{V}+\sum_{u=1}^{t}X_{u}X_{u}^{\top},\ \mathbf{S}_{t}=\sum_{u=1}^{t}\eta_{u}X_{u}.

Then for any δ>0\delta>0, with probability at least 1δ1-\delta, for all t0t\geq 0,

𝐒t𝚪^t1R2log(det(𝚪^t)1/2det(𝐕)1/2δ).\displaystyle\|\mathbf{S}_{t}\|_{\widehat{\mathbf{\Gamma}}_{t}^{-1}}\leq R\sqrt{2\log\Big{(}\frac{\det(\widehat{\mathbf{\Gamma}}_{t})^{1/2}\det(\mathbf{V})^{-1/2}}{\delta}\Big{)}}.
Lemma 10.

(Confidence Region of Ridge Regression Estimator under Dependent Design) Consider the stochastic processes {(Xt,Yt,ηt)}t1\{(X_{t},Y_{t},\eta_{t})\}_{t\geq 1}. Assume that {Xt}t1\{X_{t}\}_{t\geq 1} and {ηt}t1\{\eta_{t}\}_{t\geq 1} satisfy the condition in Lemma 9 and Yt=Xt,θ+ηtY_{t}=\langle X_{t},\theta^{*}\rangle+\eta_{t}. Also assume that θ2r\|\theta^{*}\|_{2}\leq r. For any T0T\geq 0, let

θ¯t=(𝐗1:t𝐗1:t+λId)1𝐗1:t𝐘1:t\overline{\theta}_{t}=\big{(}\mathbf{X}_{1:t}^{\top}\mathbf{X}_{1:t}+\lambda\textbf{I}_{d}\big{)}^{-1}\mathbf{X}_{1:t}^{\top}\mathbf{Y}_{1:t}

be the ridge regression estimator, where 𝐗1:t=(X1,,Xt)\mathbf{X}_{1:t}=(X_{1},\ldots,X_{t})^{\top}, 𝐘1:t=(Y1,,Yt)\mathbf{Y}_{1:t}=(Y_{1},\ldots,Y_{t})^{\top} and IdI_{d} is the dd-dimensional identity matrix. Then with probability at least 1δ1-\delta, for all t0t\geq 0,

θ¯tθ𝚪^tR2log(det(𝚪^t)1/2det(λId)1/2δ)+λ1/2r.\displaystyle\|\overline{\theta}_{t}-\theta^{*}\|_{\widehat{\mathbf{\Gamma}}_{t}}\leq R\sqrt{2\log\Big{(}\frac{\det(\widehat{\mathbf{\Gamma}}_{t})^{1/2}\det(\lambda\textbf{I}_{d})^{-1/2}}{\delta}\Big{)}}+\lambda^{1/2}r.

where 𝚪^t=λId+u=1tXuXu\widehat{\mathbf{\Gamma}}_{t}=\lambda\textbf{I}_{d}+\sum_{u=1}^{t}X_{u}X_{u}^{\top},

\@normalsize

References

  • Abbasi-Yadkori et al. [2011] Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
  • Abbasi-Yadkori et al. [2012] Y. Abbasi-Yadkori, D. Pal, and C. Szepesvari. Online-to-confidence-set conversions and application to sparse stochastic bandits. In Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS), 2012.
  • Abe et al. [2003] N. Abe, A. W. Biermann, and P. M. Long. Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica, 37(4):263–293, 2003.
  • Agarwal et al. [2010] A. Agarwal, O. Dekel, and L. Xiao. Optimal algorithms for online convex optimization with multi-point bandit feedback. In Proceedings of annual Conference on Learning Theory (COLT). Citeseer, 2010.
  • Agarwal et al. [2014] A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, and R. Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In Proceedings of International Conference on Machine Learning (ICML), 2014.
  • Auer [2002] P. Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
  • Auer et al. [1995] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS), 1995.
  • Auer et al. [2002] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235–256, 2002.
  • Balasubramanian and Ghadimi [2018] K. Balasubramanian and S. Ghadimi. Zeroth-order (non)-convex stochastic optimization via conditional gradient and gradient updates. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2018.
  • Bastani and Bayati [2015] H. Bastani and M. Bayati. Online decision-making with high-dimensional covariates. 2015. Available at SSRN: https://ssrn.com/abstract=2661896.
  • Bertsimas et al. [2016] D. Bertsimas, A. King, and R. Mazumder. Best subset selection via a modern optimization lens. The Annals of Statistics, 44(2):813–852, 2016.
  • Besbes et al. [2015] O. Besbes, Y. Gur, and A. Zeevi. Non-stationary stochastic optimization. Operations Research, 63(5):1227–1244, 2015.
  • Bickel et al. [2009] P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 37(4):1705–1732, 2009.
  • Blumensath and Davies [2009] T. Blumensath and M. E. Davies. Iterative hard thresholding for compressed sensing. Applied and computational harmonic analysis, 27(3):265–274, 2009.
  • Bubeck et al. [2017] S. Bubeck, Y. T. Lee, and R. Eldan. Kernel-based methods for bandit convex optimization. In Proceedings of the Annual ACM SIGACT Symposium on Theory of Computing (STOC), 2017.
  • Bühlmann and Van De Geer [2011] P. Bühlmann and S. Van De Geer. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media, 2011.
  • Candes and Tao [2007] E. Candes and T. Tao. The dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics, 35(6):2313–2351, 2007.
  • Carpentier and Munos [2012] A. Carpentier and R. Munos. Bandit theory meets compressed sensing for high dimensional stochastic linear bandit. In Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS), 2012.
  • Chakraborty et al. [2010] B. Chakraborty, S. Murphy, and V. Strecher. Inference for non-regular parameters in optimal dynamic treatment regimes. Statistical Methods in Medical Research, 19(3):317–343, 2010.
  • Chu et al. [2011] W. Chu, L. Li, L. Reyzin, and R. Schapire. Contextual bandits with linear payoff functions. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2011.
  • Dani et al. [2008] V. Dani, T. P. Hayes, and S. M. Kakade. Stochastic linear optimization under bandit feedback. In Proceedings of annual Conference on Learning Theory (COLT), 2008.
  • Donoho [2006] D. L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, 2006.
  • Filippi et al. [2010] S. Filippi, O. Cappe, A. Garivier, and C. Szepesvári. Parametric bandits: The generalized linear case. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2010.
  • Flaxman et al. [2005] A. D. Flaxman, A. T. Kalai, and H. B. McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the annual ACM-SIAM symposium on Discrete algorithms (SODA), 2005.
  • Foster et al. [2016] D. Foster, S. Kale, and H. Karloff. Online sparse linear regression. In Proceedings of annual Conference on Learning Theory (COLT), 2016.
  • Foster et al. [2018] D. J. Foster, A. Agarwal, M. Dudík, H. Luo, and R. E. Schapire. Practical contextual bandits with regression oracles. In Proceedings of the International Conference on Machine Learning (ICML), 2018.
  • Gerchinovitz [2013] S. Gerchinovitz. Sparsity regret bounds for individual sequences in online linear regression. Journal of Machine Learning Research, 14(Mar):729–769, 2013.
  • Goldberg and Kosorok [2012] Y. Goldberg and M. R. Kosorok. Q-learning with censored data. Annals of Statistics, 40(1):529, 2012.
  • Goldenshluger and Zeevi [2013] A. Goldenshluger and A. Zeevi. A linear response bandit problem. Stochastic Systems, 3(1):230–261, 2013.
  • Hammer et al. [1996] S. M. Hammer, D. A. Katzenstein, M. D. Hughes, H. Gundacker, R. T. Schooley, R. H. Haubrich, W. K. Henry, M. M. Lederman, J. P. Phair, M. Niu, et al. A trial comparing nucleoside monotherapy with combination therapy in hiv-infected adults with cd4 cell counts from 200 to 500 per cubic millimeter. New England Journal of Medicine, 335(15):1081–1090, 1996.
  • Keyvanshokooh et al. [2019] E. Keyvanshokooh, M. Zhalechian, C. Shi, M. P. Van Oyen, and P. Kazemian. Contextual learning with online convex optimization: Theory and application to chronic diseases. Available at SSRN, 2019.
  • Krause and Ong [2011] A. Krause and C. S. Ong. Contextual gaussian process bandit optimization. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2011.
  • Lai and Robbins [1985] T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
  • Langford et al. [2009] J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. Journal of Machine Learning Research, 10(Mar):777–801, 2009.
  • Lattimore et al. [2015] T. Lattimore, K. Crammer, and C. Szepesvári. Linear multi-resource allocation with semi-bandit feedback. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2015.
  • Li et al. [2010] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the International Conference on World Wide Web (WWW). ACM, 2010.
  • Li et al. [2011] L. Li, W. Chu, J. Langford, and X. Wang. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM). ACM, 2011.
  • Li et al. [2017] L. Li, Y. Lu, and D. Zhou. Provably optimal algorithms for generalized linear contextual bandits. In Proceedings of International Conference on Machine Learning (ICML), 2017.
  • Miller [2002] A. Miller. Subset selection in regression. Chapman and Hall/CRC, 2002.
  • Moodie et al. [2007] E. E. Moodie, T. S. Richardson, and D. A. Stephens. Demystifying optimal dynamic treatment regimes. Biometrics, 63(2):447–455, 2007.
  • Murphy [2003] S. A. Murphy. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(2):331–355, 2003.
  • Murphy [2005] S. A. Murphy. An experimental design for the development of adaptive treatment strategies. Statistics in medicine, 24(10):1455–1481, 2005.
  • Natarajan [1995] B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on Computing, 24(2):227–234, 1995.
  • Nemirovsky and Yudin [1983] A. S. Nemirovsky and D. B. Yudin. Problem complexity and method efficiency in optimization. SIAM, 1983.
  • Orellana et al. [2010a] L. Orellana, A. Rotnitzky, and J. M. Robins. Dynamic regime marginal structural mean models for estimation of optimal dynamic treatment regimes, part I: main content. The International Journal of Biostatistics, 6(2), 2010a.
  • Orellana et al. [2010b] L. Orellana, A. Rotnitzky, and J. M. Robins. Dynamic regime marginal structural mean models for estimation of optimal dynamic treatment regimes, part II: proofs of results. The International Journal of Biostatistics, 6(2), 2010b.
  • Pilanci et al. [2015] M. Pilanci, M. J. Wainwright, and L. El Ghaoui. Sparse learning via boolean relaxations. Mathematical Programming (Series B), 151(1):63–87, 2015.
  • Raskutti et al. [2010] G. Raskutti, M. J. Wainwright, and B. Yu. Restricted eigenvalue properties for correlated gaussian designs. Journal of Machine Learning Research, 11(Aug):2241–2259, 2010.
  • Robins et al. [2008] J. Robins, L. Orellana, and A. Rotnitzky. Estimation and extrapolation of optimal treatment and testing strategies. Statistics in Medicine, 27(23):4678–4721, 2008.
  • Robins et al. [2000] J. M. Robins, M. A. Hernan, and B. Brumback. Marginal structural models and causal inference in epidemiology, 2000.
  • Rusmevichientong and Tsitsiklis [2010] P. Rusmevichientong and J. N. Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411, 2010.
  • Shamir [2013] O. Shamir. On the complexity of bandit and derivative-free stochastic convex optimization. In Proceedings of annual Conference on Learning Theory (COLT), 2013.
  • Shamir [2015] O. Shamir. On the complexity of bandit linear optimization. In Proceedings of annual Conference on Learning Theory (COLT), 2015.
  • Song et al. [2015] R. Song, W. Wang, D. Zeng, and M. R. Kosorok. Penalized Q-learning for dynamic treatment regimens. Statistica Sinica, 25(3):901, 2015.
  • Szepesvari [2016] C. Szepesvari. Lower bounds for stochastic linear bandits. http://banditalgs.com/2016/10/20/lower-bounds-for-stochastic-linear-bandits/, 2016. Accessed: Oct 15, 2018.
  • Tibshirani [1996] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
  • Tropp [2012] J. A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12(4):389–434, 2012.
  • Wainwright [2009] M. J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using 1\ell_{1}-constrained quadratic programming (Lasso). IEEE Transactions on Information Theory, 55(5):2183–2202, 2009.
  • Wang et al. [2017] Y. Wang, S. Du, S. Balakrishnan, and A. Singh. Stochastic zeroth-order optimization in high dimensions. In Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.
  • Watkins and Dayan [1992] C. J. Watkins and P. Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.
  • Zhang et al. [2012a] B. Zhang, A. A. Tsiatis, M. Davidian, M. Zhang, and E. Laber. Estimating optimal treatment regimes from a classification perspective. Stat, 1(1):103–114, 2012a.
  • Zhang et al. [2012b] B. Zhang, A. A. Tsiatis, E. B. Laber, and M. Davidian. A robust method for estimating optimal treatment regimes. Biometrics, 68(4):1010–1018, 2012b.
  • Zhao et al. [2012] Y. Zhao, D. Zeng, A. J. Rush, and M. R. Kosorok. Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association, 107(499):1106–1118, 2012.
  • Zhao et al. [2014] Y.-Q. Zhao, D. Zeng, E. B. Laber, R. Song, M. Yuan, and M. R. Kosorok. Doubly robust learning for estimating individualized treatment with censored data. Biometrika, 102(1):151–168, 2014.