This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Survey Bandits with Regret Guarantees

Sanath Kumar Krishnamurthy    Susan Athey
Abstract

We consider a variant of the contextual bandit problem. In standard contextual bandits, when a user arrives we get the user’s complete feature vector and then assign a treatment (arm) to that user. In a number of applications (like healthcare), collecting features from users can be costly. To address this issue, we propose algorithms that avoid needless feature collection while maintaining strong regret guarantees.

Machine Learning, Contextual bandits

1 Stanford University
The first author is supported in part by a Dantzig-Lieberman Operations Research Fellowship and a Management Science and Engineering Department Fellowship.
The second author is supported in part by Sloan Foundation, Schmidt Futures, and ONR grant N00014-17-1-2131.

1 Introduction

In a number of applications, like healthcare and mobile health, collecting features from users can be costly. This encourages us to develop variants of contextual bandits that at any time-step select a set of features to be collected from users, and then select an action based on these observed features. We use the term survey bandits to refer to this set-up. Through this formulation, we address the issue of needless feature collection in contextual bandits.

Suppose we are building a system to recommend charities that users can donate to. Since there are many charities we could recommend, it is efficient to make personalized recommendations. This requires us to collect features from users, which can be done by requiring users to fill a survey/questionnaire. The reward at any time-step might be the amount donated. As usual, our goal is to minimize regret. Beyond regret, we would also like to improve user experience by shortening the survey we require users to fill. We try to answer the following question: Can we ensure strong regret guarantees, while being able to shorten our survey over time?

Our contributions: We answer this question in the affirmative. We start by considering zero-shot surveys where the decision maker has to decide the set of features to be queried at every time-step before the user arrives and then make a personalized decision based on the responses. We now state our assumptions. In addition to the standard assumptions made for LinUCB, we introduce 1 (beta-min) which is common in the feature selection literature. We propose algorithms that are natural variants of LinUCB (Li et al., 2010) for the survey bandits framework. Under our assumptions, we prove regret guarantees for these algorithms. At the same time, these algorithms exploit sparsity in arm parameters to reduce the set of features collected from users.

In fact our algorithm, Ridge SurveyUCB has regret guarantees that are tight even for standard contextual bandits O(Tlog(T))O(\sqrt{T}\log(T)). We also provide an algorithm Elastic Net SurveyUCB which is more robust to 1, but with a weaker O((Tlog(T))3/4)O((T\log(T))^{3/4}) regret guarantee. This result requires us to prove a new adaptive tail bounds for the elastic net estimator, which may be of independent interest.

Through simulations, when 1 holds, we demonstrate that both algorithms perform well in terms of regret. Unfortunately, Ridge SurveyUCB can perform poorly on regret when 1 is violated. Fortunately, Elastic Net SurveyUCB performs well even when 1 is violated. It is worth noting that we can still use Ridge SurveyUCB, we just need to use a conservative choices for the beta-min parameter (in 1). Even for conservative choices of the beta-min parameter, we eventually see benefits on survey length.

Through simulations, we also see that both algorithms demonstrate savings in terms of survey length. In fact, in the absence of sub-optimal arms, both algorithms always remove the features that are not relevant for reward prediction in the survey.

We also consider settings with interactive survey’s where at each time step, before making a personalized decision, the decision maker can continually expand the set of queries based on previous responses by the user at that time-step. This allows us to start by querying a smaller set of features and expand our survey for the user only if it is needed to make a better personalized decision. We develop variants of our algorithms that use interactive surveys to ensure lower survey lengths, especially in the presence of sub-optimal arms, while maintaining the same regret performance.

Related work: Prior work (e.g. (Abbasi-Yadkori et al., 2012) and (Bastani & Bayati, 2015)) has also exploited sparsity in arm parameters to provide stronger sparsity dependent regret guarantees. When an upper-bound on the sparsity of arm-parameters is known to the decision-maker, Abbasi-Yadkori et al. (2012) provide algorithms with tight regret guarantees for contextual bandits. Under distributional assumptions on user covariates, Bastani & Bayati (2015) provide stronger regret guarantees, even when no sparsity parameter is provided to the algorithm. While these regret guarantees are stronger than the ones we provide, it is important to note that we do not make any distributional assumption on user covariates and we do not assume that the decision-maker knows an upper-bound to the sparsity of arm-parameters.

Bouneffouf et al. (2017) is the most closely related paper to our work. They develop interesting algorithms for a very similar setup and evaluate them empirically. At every time-step, their algorithms queries a fixed number of features (ss). Their algorithms requires this parameter ss to be an input to the algorithm. For a conservative choice of ss, we would see little benefits to the survey length. Unfortunately, empirically their algorithms could perform much worse than contextual bandits for small choices of ss. We view our work as an alternate approach with guarantees on regret.

2 Preliminaries

2.1 Problem Setting

Let TT denote the number of time-steps. At each time-step, a new user arrives. The decision maker has a survey with dd questions and can ask a user any subset of these questions. At any time-step tt, a user comes with a set of observable covariates XtX_{t}. Here XtX_{t} is a dd-vector, corresponding to the tt-th user’s observable answers to the dd questions on the survey.

The decision maker has access to KK arms (decisions). Each arm yields a noisy, user specific reward. In particular, each arm i[K]i\in[K] has an unknown parameter βid\beta_{i}\in\mathbb{R}^{d}. At time tt, pulling arm ii would yield a reward Yi(t):=Xt𝖳βi+ϵi,tY_{i}(t):=X_{t}^{\mathsf{T}}\beta_{i}+\epsilon_{i,t}. Where ϵi,t\epsilon_{i,t} are independent sequence of σ\sigma-sub-Gaussian random variables. Note that at any time-step, the decision maker can only observe the reward of the arm that was pulled.

Some notation: For any vector zdz\in\mathbb{R}^{d} and any index set I[d]I\subseteq[d], we let zIdz_{I}\in\mathbb{R}^{d} denote the vector obtained by setting coordinates of zz that are not in II to zero. For any matrix Ad×dA\in\mathbb{R}^{d\times d} and any index set I[d]I\subseteq[d], we let AId×dA_{I}\in\mathbb{R}^{d\times d} denote the matrix obtained by setting rows and columns of AA that do not correspond to an index in II to zero. For any zdz\in\mathbb{R}^{d}, we let Supp(z)\text{Supp}(z) to denote the support. And for any set ζd\zeta\subseteq\mathbb{R}^{d}, we let Supp(ζ):=zζSupp(z)\text{Supp}(\zeta):=\cup_{z\in\zeta}\text{Supp}(z).

Goal: The goal is to design a sequential decision making policy π=(πs,πa)\pi=(\pi^{s},\pi^{a}) that maximizes expected cumulative reward, and subject to strong reward guarantees minimizes the total number of questions asked to users. Let πts2[d]\pi^{s}_{t}\in 2^{[d]} denote the subset of survey questions queried by policy π\pi at time t[T]t\in[T]. And, let πta[K]\pi^{a}_{t}\in[K] denote the arm chosen by policy π\pi at time t[T]t\in[T]. Note that we do not observe XtX_{t} and only observe (Xt)πts(X_{t})_{\pi^{s}_{t}}, hence we should be able to choose the arm πta\pi^{a}_{t} using only the observed covariates (Xt)πts(X_{t})_{\pi^{s}_{t}} and data collected from previous time-steps.

Target policy: We now describe a ”sensible” target policy. Consider the target policy π=(πs,πa)\pi^{*}=(\pi^{s*},\pi^{a*}) that already knows arm parameters {βi}{i[K]}\{\beta_{i}\}_{\{i\in[K]\}} but not the noise parameters. We want the target policy to maximize expected cumulative reward, and hence at any time-step tt the policy must pick an arm πtaargmaxjXt𝖳βj\pi^{a*}_{t}\in\arg\max_{j}X_{t}^{\mathsf{T}}\beta_{j}. Therefore, the target policy only needs to query features that influence arm rewards. That is,

πts:=i[K]πi,ts, where πi,ts:=Supp(βi).\pi_{t}^{s*}:=\bigcup_{i\in[K]}\pi^{s*}_{i,t}\text{, where $\pi^{s*}_{i,t}:=\text{Supp}(\beta_{i})$}.

Note that for any vector zdz\in\mathbb{R}^{d} and any arm ii, we have that z𝖳βi=zπts𝖳βiz^{\mathsf{T}}\beta_{i}=z_{\pi^{s*}_{t}}^{\mathsf{T}}\beta_{i}. Therefore at any time-step tt, having observed (Xt)πts(X_{t})_{\pi^{s*}_{t}}, the target policy is able to choose the best arm for the covariate XtX_{t}:

πtaargmaxjXt𝖳βj=argmaxj(Xt)πts𝖳βj.\pi^{a*}_{t}\in\arg\max_{j}X_{t}^{\mathsf{T}}\beta_{j}=\arg\max_{j}(X_{t})_{\pi_{t}^{s*}}^{\mathsf{T}}\beta_{j}.

Regret: If πta=i\pi^{a}_{t}=i, we define expected regret at time tt as the difference between the maximum expected reward and the expected reward of arm ii at time tt, i.e. rt:=maxjXt𝖳βjXt𝖳βir_{t}:=\max_{j}X_{t}^{\mathsf{T}}\beta_{j}-X_{t}^{\mathsf{T}}\beta_{i}. And, we define the cumulative expected regret as RT:=t=1TrtR_{T}:=\sum_{t=1}^{T}r_{t}.

Additional notation: Let 𝐗T×d\mathbf{X}\in\mathbb{R}^{T\times d} denote the design matrix, whose rows are XtX_{t}. Let YiTY_{i}\in\mathbb{R}^{T} denote the vector of observations Xt𝖳βi+ϵi,tX_{t}^{\mathsf{T}}\beta_{i}+\epsilon_{i,t}, where entries of YiY_{i} may be missing 111If arm ii wasn’t played at time tt, then the tt-th coordinate of YiY_{i} would be missing.. For all i[K]i\in[K] and for any n[T]n\in[T], define the sample set Si,n:={t|πta=i,tn}S_{i,n}:=\{t|\pi_{t}^{a}=i,t\leq n\} and let Si:={t|πta=i}[T]S_{i}:=\{t|\pi_{t}^{a}=i\}\subseteq[T]. Let ni,tn_{i,t} denote the number of times arm ii was pulled upto time tt (i.e. ni,t=|Si,t|n_{i,t}=|S_{i,t}|). For any S[T]S^{\prime}\subseteq[T], we let 𝐗(S)\mathbf{X}(S^{\prime}) be the |S|×d|S^{\prime}|\times d submatrix of 𝐗\mathbf{X} whose rows are XtX_{t} for each tSt\in S^{\prime}. Similarly when SSiS^{\prime}\subseteq S_{i}, we let Yi(S)Y_{i}(S^{\prime}) be the |S||S^{\prime}|-vector whose coordinates are Yi(t)Y_{i}(t) for all tSt\in S^{\prime} 222Since when SSiS^{\prime}\subseteq S_{i}, we know that πta=i\pi_{t}^{a}=i for all tSt\in S^{\prime}. Therefore, we have Yi(S)Y_{i}(S^{\prime}) has no missing values.. Also let Id×dI\in\mathbb{R}^{d\times d} denote the identity matrix.

2.2 Assumptions

We make two assumptions. The following assumption is to allow us to ignore features that have small influence on arm rewards, we assume that such features in-fact have no influence on arm rewards.

Assumption 1 (Beta-min).

The decision maker knows a parameter βmin+\beta_{\text{min}}\in\mathbb{R}_{+} such that for all arms i[K]i\in[K] and all q[d]q\in[d], either (βi){q}1=0\|(\beta_{i})_{\{q\}}\|_{1}=0 or (βi){q}1>βmin\|(\beta_{i})_{\{q\}}\|_{1}>\beta_{\text{min}}.

The following is a common assumption made in problems for contextual multi-arm bandits, it is equivalent to assuming that expected rewards are bounded.

Assumption 2 (Bounded rewards).

We make the following assumptions to ensure that expected rewards of any arm for any context is bounded: For all t[T]t\in[T] and i[K]i\in[K], we have Xt1L\|X_{t}\|_{1}\leq L and βi1b\|\beta_{i}\|_{1}\leq b.
For simplicity we further assume for all t[T]t\in[T] and i[K]i\in[K], the potential reward of pulling arm ii lies in [0,1][0,1]. i.e. Yi(t)[0,1]Y_{i}(t)\in[0,1]. 333We can avoid the assumption that Yi(t)[0,1]Y_{i}(t)\in[0,1] for all i[K]i\in[K] and t[T]t\in[T]. We would just need to choose αmax(1,L)\alpha\geq\max(1,L) in definition 2 for the regret guarantee to work out. Where Xt2L\|X_{t}\|_{2}\leq L for all tt.

Since for any vector vdv\in\mathbb{R}^{d}, we know that vpv1\|v\|_{p}\leq\|v\|_{1} for all p1p\geq 1. Therefore for all p1p\geq 1, 2 gives us that the pp-norms of XtX_{t} and βi\beta_{i} are bounded by LL and bb for all time-steps tt and arms i[K]i\in[K].

3 UCB for Survey Bandits

In this section, we describe a natural extension of LinUCB (Li et al., 2010) for survey bandits which involves describing a policy for selecting survey questions and a consistent arm selection policy that can choose an arm given the observed covariates.

Algorithm 1 SurveyUCB
1:  Initialize confidence sets. (See section 4.1)
2:  for t:=1,2,t:=1,2,\dots do
3:     Let πk,tsSupp(Ck,t1)\pi^{s}_{k,t}\leftarrow\text{Supp}(C_{k,t-1}) for all k[K]k\in[K].
4:     Query πts:=k[K]πk,ts\pi^{s}_{t}:=\bigcup_{k\in[K]}\pi^{s}_{k,t}, and observe (Xt)πts(X_{t})_{\pi^{s}_{t}}.
5:     Pick πtaargmaxk[K]maxβCk,t1(Xt)πts𝖳β\pi^{a}_{t}\in\arg\max_{k\in[K]}\max_{\beta\in C_{k,t-1}}(X_{t})_{\pi^{s}_{t}}^{\mathsf{T}}\beta.
6:     Play arm πta\pi^{a}_{t} and observe reward YtY_{t}.
7:     Set Ck,tCk,t1C_{k,t}\leftarrow C_{k,t-1} for all k[K]{πta}k\in[K]\setminus\{\pi^{a}_{t}\}.
8:     Update Cπta,tC_{\pi^{a}_{t},t} (using AlgConfidence).
9:  end for

Upper confidence bound (UCB) algorithms follow the principle of optimism in the face of uncertainty. The essential idea is to construct high-probability confidence sets Ci,t1dC_{i,t-1}\subseteq\mathbb{R}^{d}, for the parameter (βi)(\beta_{i}) of every arm i[K]i\in[K], from observed covariates (X1)π1s,(X2)π2s,,(Xt1)πt1s(X_{1})_{\pi^{s}_{1}},(X_{2})_{\pi^{s}_{2}},\dots,(X_{t-1})_{\pi^{s}_{t-1}} and rewards Y1,Y2,,Yt1Y_{1},Y_{2},\dots,Y_{t-1}. That is, with high probability, βiCi,t1\beta_{i}\in C_{i,t-1} for all time-steps tt and for all arms i[K]i\in[K]. The algorithm then queries the set of features:

πts:=i[K]πi,ts, with πi,ts=Supp(Ci,t1) for all i[K].\pi^{s}_{t}:=\bigcup_{i\in[K]}\pi^{s}_{i,t},\text{ with }\pi^{s}_{i,t}=\text{Supp}(C_{i,t-1})\text{ for all }i\in[K].

Therefore, we have that β=βπts\beta=\beta_{\pi^{s}_{t}} for βi[K]Ci,t1\beta\in\bigcup_{i\in[K]}C_{i,t-1}. Hence, for any zdz\in\mathbb{R}^{d} and βi[K]Ci,t1\beta\in\bigcup_{i\in[K]}C_{i,t-1}, we have that z𝖳β=zπts𝖳βz^{\mathsf{T}}\beta=z_{\pi^{s}_{t}}^{\mathsf{T}}\beta. Therefore, it follows that:

(Xt)πts𝖳β=Xt𝖳β, for all βi[K]Ci,t1.(X_{t})_{\pi^{s}_{t}}^{\mathsf{T}}\beta=X_{t}^{\mathsf{T}}\beta,\text{ for all }\beta\in\bigcup_{i\in[K]}C_{i,t-1}.

The algorithm chooses an optimistic estimate β~i,targmaxβCi,t1Xt𝖳β\tilde{\beta}_{i,t}\in\arg\max_{\beta\in C_{i,t-1}}X_{t}^{\mathsf{T}}\beta for every arm i[K]i\in[K] and then chooses an arm πtaargmaxi[K]Xt𝖳β~i,t\pi_{t}^{a}\in\arg\max_{i\in[K]}X_{t}^{\mathsf{T}}\tilde{\beta}_{i,t} which maximizes reward according to the optimistic estimates. Equivalently, and more compactly, the algorithm chooses the arm:

πta\displaystyle\pi_{t}^{a} argmaxk[K]maxβCk,t1Xt𝖳β\displaystyle\in\arg\max_{k\in[K]}\max_{\beta\in C_{k,t-1}}X_{t}^{\mathsf{T}}\beta
=argmaxk[K]maxβCk,t1(Xt)πts𝖳β.\displaystyle=\arg\max_{k\in[K]}\max_{\beta\in C_{k,t-1}}(X_{t})_{\pi^{s}_{t}}^{\mathsf{T}}\beta.

Note that the arm πts\pi^{s}_{t} is chosen given only the observed covariates (Xt)πts(X_{t})_{\pi^{s}_{t}}. We call the resulting algorithm SurveyUCB.

4 Confidence Sets for SurveyUCB

In this section, we define confidence sets in Standard form and describe the construction of these sets using AlgConfidence. It will turn-out that AlgConfidence constructs confidence sets in Standard form. Throughout this manuscript, we use AlgConfidence to construct confidence sets for SurveyUCB.

4.1 Confidence Sets in Standard Form

We start by defining weighted norms and use that to define confidence sets in Standard form.

Definition 1 (Weighted norm).

For any vector zdz\in\mathbb{R}^{d} and any positive semi-definite matrix AA, we define the weighted norm of zz with respect to AA as follows: zA:=z𝖳Az\|z\|_{A}:=\sqrt{z^{\mathsf{T}}Az}. 444A\|\cdot\|_{A} is a norm when AA is positive semi-definite.

Definition 2 (Standard form).

We say confidence sets {Ci,t1}(i,t)[K]×[T]\{C_{i,t-1}\}_{(i,t)\in[K]\times[T]} are in Standard form if at any time-step tt and for any arm i[K]i\in[K], we have that Ci,t1C_{i,t-1} is a ball centered around our estimate (β^i,t1)(\hat{\beta}_{i,t-1}) of the true arm parameter (βi)(\beta_{i}) under a weighted norm and {Ci,t1}t[T]\{C_{i,t-1}\}_{t\in[T]} have non-increasing supports. More specifically, for all i[K]i\in[K], t[T]t\in[T] and for some α>0\alpha>0, we have:

Supp(β^i,t1)Hi,t1:=Supp(Ci,t1)\displaystyle\text{Supp}(\hat{\beta}_{i,t-1})\subseteq H_{i,t-1}:=\text{Supp}(C_{i,t-1})
Hi,tHi,t1[d], 1θi,t1θi,t\displaystyle H_{i,t}\subseteq H_{i,t-1}\subseteq[d],\text{ }1\leq\theta_{i,t-1}\leq\theta_{i,t}
Dt1i:=(αI)Hi,t1+wSi,t1(Xw)Hi,t1(Xw)Hi,t1𝖳\displaystyle D_{t-1}^{i}:=(\alpha I)_{H_{i,t-1}}+\sum_{w\in S_{i,t-1}}(X_{w})_{H_{i,t-1}}(X_{w})_{H_{i,t-1}}^{\mathsf{T}}
Ci,t1:={β| ββ^i,t1Dt1iθi,t1,\displaystyle C_{i,t-1}:=\{\beta|\text{ }\|\beta-\hat{\beta}_{i,t-1}\|_{D_{t-1}^{i}}\leq\theta_{i,t-1},
Supp(β)Hi,t1}\displaystyle\text{Supp}(\beta)\subseteq H_{i,t-1}\}

In SurveyUCB, note that the confidence sets determine the set of features queried. Also at any time-step tt, confidence sets must be constructed using only the observed covariates and rewards at every time-step upto tt. Lemma 1 gives us a set of observed covariates when SurveyUCB uses confidence sets in Standard form.

Lemma 1.

If SurveyUCB uses confidence sets in Standard form for times t{1,2,,t}t\in\{1,2,\dots,t^{\prime}\}. Then for any arm k[K]k\in[K], we observe {(Xt)Hk,t1}t=1t\{(X_{t})_{H_{k,t^{\prime}-1}}\}_{t=1}^{t^{\prime}}. Where Hk,t1:=Supp(Ck,t1)H_{k,t-1}:=\text{Supp}(C_{k,t-1}) for all arms k[K]k\in[K] and time-steps t[t]t\in[t^{\prime}].

Proof.

All statements in this proof hold for all arms k[K]k\in[K] and time-steps t[T]t\in[T]. SurveyUCB queries the set of features πts=kπk,ts\pi^{s}_{t}=\cup_{k}\pi^{s}_{k,t} at time tt, where πk,ts=Supp(Ck,t1)=Hk,t1\pi^{s}_{k,t}=\text{Supp}(C_{k,t-1})=H_{k,t-1}. From the structure of confidence sets in Standard form, we have that πk,t+1s=Hk,tHk,t1=πk,ts\pi^{s}_{k,t+1}=H_{k,t}\subseteq H_{k,t-1}=\pi^{s}_{k,t}. This implies that:

Hk,t1=πk,tsπtsπt1sπ1s.H_{k,t-1}=\pi^{s}_{k,t}\subseteq\pi^{s}_{t}\subseteq\pi^{s}_{t-1}\subseteq\dots\subseteq\pi^{s}_{1}.

That is, we observe {(Xt)Hk,t1}t=1t\{(X_{t})_{H_{k,t^{\prime}-1}}\}_{t=1}^{t^{\prime}}. ∎

Note that at any time-step tt, SurveyUCB observes (Xt)πts(X_{t})_{\pi^{s}_{t}}. Lemma 1 shows that for any t[t]t\in[t^{\prime}], we also observe (Xt)Hk,t1(X_{t})_{H_{k,t^{\prime}-1}} because the set of features queried by SurveyUCB at tt is a supper set of the support of the confidence set Ck,t1C_{k,t^{\prime}-1}. That is, under the conditions of lemma 1 we have:

Supp(Ck,t1)=Hk,t1πts, for all t[t].\text{Supp}(C_{k,t^{\prime}-1})=H_{k,t^{\prime}-1}\subseteq\pi^{s}_{t},\text{ for all }t\in[t^{\prime}].

Initializing confidence sets: We now define our confidence set for time-step zero based on 2. For any arm k[K]k\in[K] under 2, we have that βkαI=βk(αI)βkαb\|\beta_{k}\|_{\alpha I}=\sqrt{\beta_{k}(\alpha I)\beta_{k}}\leq\sqrt{\alpha}b. That is, we have that βkCk,0\beta_{k}\in C_{k,0} for all arms k[K]k\in[K], where:

Ck,0\displaystyle C_{k,0} :={β| βαIαb}\displaystyle:=\{\beta|\text{ }\|\beta\|_{\alpha I}\leq\sqrt{\alpha}\cdot b\}
={β| βαIαb,Supp(β)[d]}.\displaystyle=\{\beta|\text{ }\|\beta\|_{\alpha I}\leq\sqrt{\alpha}\cdot b,\text{Supp}(\beta)\subseteq[d]\}.

Based on 1, section 4.2 describes the construction of confidence set Ck,tC_{k,t^{\prime}} for all arms k[K]k\in[K] and time-steps t1t^{\prime}\geq 1.

4.2 AlgConfidence

Consider any time-step t1t^{\prime}\geq 1. Suppose that the confidence sets constructed upto time tt^{\prime} are in Standard form, i.e. {Ci,t}(i,t)[K]×[t1]\{C_{i,t}\}_{(i,t)\in[K]\times[t^{\prime}-1]} are in Standard form. Let k[K]k\in[K] be the arm pulled at time tt^{\prime}. We now describe the AlgConfidence update for confidence set Ck,tC_{k,t^{\prime}} and show that the confidence sets {Ci,t}(i,t)[K]×[t]\{C_{i,t}\}_{(i,t)\in[K]\times[t^{\prime}]} are in Standard form. This would inductively imply that AlgConfidence constructs confidence sets in Standard form since the base case trivially holds 555Confidence sets constructed up to and including time-step zero are trivially in Standard form..

From lemma 1 we already know that if the confidence sets constructed upto time tt^{\prime} are in Standard form, then the decision maker using SurveyUCB at least observes (Xt)Hk,t1(X_{t})_{H_{k,t^{\prime}-1}} at every time-step t[t]t\in[t^{\prime}]. Where, Hk,t1=Supp(Ck,t1)H_{k,t^{\prime}-1}=\text{Supp}(C_{k,t^{\prime}-1}). Hence, AlgConfidence can use this to construct the confidence set (Ck,tC_{k,t^{\prime}}) for arm kk at time tt^{\prime}.

Algorithm 2 AlgConfidence (Constructing confidence sets in Standard form)

Input : Given {(Xt)Hk,t1}t=1t\{(X_{t})_{H_{k,t^{\prime}-1}}\}_{t=1}^{t^{\prime}}, YkY_{k}, Sk,tS_{k,t^{\prime}}, and Ck,t1C_{k,t^{\prime}-1}, for some time-step tt^{\prime} and arm kk.
Output : Ck,tC_{k,t^{\prime}}.

1:  Initialize Hk,tHk,t1H_{k,t^{\prime}}\leftarrow H_{k,t^{\prime}-1}.
2:  for q[d]q\in[d] do
3:     if βmin>maxβCk,t1β{q}\beta_{\text{min}}>\max_{\beta\in C_{k,t^{\prime}-1}}\|\beta_{\{q\}}\|  then
4:        Hk,tHk,t{q}H_{k,t^{\prime}}\leftarrow H_{k,t^{\prime}}\setminus\{q\}.
5:     end if
6:  end for
7:  Set (β^k,t)Hk,t0\big{(}\hat{\beta}_{k,t^{\prime}}\big{)}_{H_{k,t^{\prime}}^{\complement}}\leftarrow\vec{0}. {Note that Hk,tHk,t1H_{k,t^{\prime}}\subseteq H_{k,t^{\prime}-1}, hence we have {(Xt)Hk,t}t=1t\{(X_{t})_{H_{k,t^{\prime}}}\}_{t=1}^{t^{\prime}}.}
8:  Estimate (β^k,t)Hk,t(\hat{\beta}_{k,t^{\prime}})_{H_{k,t^{\prime}}} from regression on the data {((Xt)Hk,t,Yk(t))}tSk,t\Big{\{}\Big{(}(X_{t})_{H_{k,t^{\prime}}},Y_{k}(t)\Big{)}\Big{\}}_{t\in S_{k,t^{\prime}}}.
9:  ComputeDtk:=(αI)Hk,t+wSk,t(Xw)Hk,t(Xw)Hk,t𝖳D_{t^{\prime}}^{k}:=(\alpha I)_{H_{k,t^{\prime}}}+\sum_{w\in S_{k,t^{\prime}}}(X_{w})_{H_{k,t^{\prime}}}(X_{w})_{H_{k,t^{\prime}}}^{\mathsf{T}}.
10:  OutputCk,t:={β| ββ^k,tDtkθk,t,Supp(β)Hk,t}C_{k,t^{\prime}}:=\{\beta|\text{ }\|\beta-\hat{\beta}_{k,t^{\prime}}\|_{D_{t^{\prime}}^{k}}\leq\theta_{k,t^{\prime}},\text{Supp}(\beta)\subseteq H_{k,t^{\prime}}\}.

AlgConfidence starts by constructing Hk,tH_{k,t^{\prime}} from Ck,t1C_{k,t^{\prime}-1} by relying on 1. Where Hk,tH_{k,t^{\prime}} with be the support of the confidence set Ck,tC_{k,t^{\prime}}. We would like to construct Hk,tH_{k,t^{\prime}} so that it contains the support of βk\beta_{k}, i.e. Supp(βk)Hk,t\text{Supp}(\beta_{k})\subseteq H_{k,t^{\prime}}. In particular, if the confidence set for arm kk at time t1t^{\prime}-1 holds (i.e. βkCk,t1\beta_{k}\in C_{k,t^{\prime}-1}), then from 1 we have that for all qSupp(βk)q\in\text{Supp}(\beta_{k}):

maxβCk,t1β{q}(βk){q}>βmin.\max_{\beta\in C_{k,t^{\prime}-1}}\|\beta_{\{q\}}\|\geq\|(\beta_{k})_{\{q\}}\|>\beta_{\text{min}}.

Also we get that Supp(βk)\text{Supp}(\beta_{k}) is a subset of the support of the confidence set (Hk,t1H_{k,t^{\prime}-1}) of arm kk at time t1t^{\prime}-1. Hence we have that:

Hk,t:=Hk,t1\{q|maxβCk,t1β{q}βmin}H_{k,t^{\prime}}:=H_{k,t^{\prime}-1}\Big{\backslash}\bigg{\{}q|\max_{\beta\in C_{k,t^{\prime}-1}}\|\beta_{\{q\}}\|\leq\beta_{\text{min}}\bigg{\}}

AlgConfidence now constructs confidence set Ck,tC_{k,t^{\prime}} with support Hk,tH_{k,t^{\prime}}. We then estimate (β^k,t)Hk,t(\hat{\beta}_{k,t^{\prime}})_{H_{k,t^{\prime}}} by regressing over the features in Hk,tH_{k,t^{\prime}}, on the observed data set: {((Xt)Hk,t,Yk(t))}tSk,t\Big{\{}\Big{(}(X_{t})_{H_{k,t^{\prime}}},Y_{k}(t)\Big{)}\Big{\}}_{t\in S_{k,t^{\prime}}}. We then set the components of β^k,t\hat{\beta}_{k,t^{\prime}} not in Hk,tH_{k,t^{\prime}} to zero, i.e. (β^k,t)Hk,t0\big{(}\hat{\beta}_{k,t^{\prime}}\big{)}_{H_{k,t^{\prime}}^{\complement}}\leftarrow\vec{0}. Now with Dtk:=(αI)Hk,t+wSk,t(Xw)Hk,t(Xw)Hk,t𝖳D_{t^{\prime}}^{k}:=(\alpha I)_{H_{k,t^{\prime}}}+\sum_{w\in S_{k,t^{\prime}}}(X_{w})_{H_{k,t^{\prime}}}(X_{w})_{H_{k,t^{\prime}}}^{\mathsf{T}}, we construct the confidence set for arm kk at time tt^{\prime}:

Ck,t:={β| ββ^k,tDtkθk,t,Supp(β)Hk,t}\displaystyle C_{k,t^{\prime}}:=\{\beta|\text{ }\|\beta-\hat{\beta}_{k,t^{\prime}}\|_{D_{t^{\prime}}^{k}}\leq\theta_{k,t^{\prime}},\text{Supp}(\beta)\subseteq H_{k,t^{\prime}}\}

Note that Supp(Ck,t)=Hk,tHk,t1\text{Supp}(C_{k,t^{\prime}})=H_{k,t^{\prime}}\subseteq H_{k,t^{\prime}-1} and Supp(β^k,t)Hk,t\text{Supp}(\hat{\beta}_{k,t^{\prime}})\subseteq H_{k,t^{\prime}}. Hence given the above form of the confidence set, we have that {Ci,t}(i,t)[K]×[t]\{C_{i,t}\}_{(i,t)\in[K]\times[t^{\prime}]} are in Standard form.

4.3 Probability Aggregation

To construct the confidence set Ck,tC_{k,t^{\prime}}, recall that AlgConfidence assumes that βkCk,t1\beta_{k}\in C_{k,t^{\prime}-1}. This is unlike LinUCB (Li et al., 2010) and several other UCB algorithms where confidence sets are constructed from observed data without directly relying on previous confidence sets. Here, we argue that our construction does not lead to any unexpected issues. We now state a helpful lemma and its corollary, and defer proofs to appendix A.

Lemma 2 (Probability aggregation).

Consider a probability space (Ω,,Pr)(\Omega,\mathcal{F},\Pr). Consider any sequence of events {Bi,Πi}i=1\{B_{i},\Pi_{i}\}_{i=1}^{\infty}, such that Bi,ΠiB_{i},\Pi_{i}\in\mathcal{F} and BiΠiB_{i}\subseteq\Pi_{i} for any i𝒩i\in\mathcal{N}. Let Π0:=Ω\Pi_{0}:=\Omega. We then have that:

Pr[i=1Bi]1i=1Pr[Bi|Πi1].\Pr\Bigg{[}\bigcap_{i=1}^{\infty}B_{i}\Bigg{]}\geq 1-\sum_{i=1}^{\infty}\Pr\big{[}B_{i}^{\complement}|\Pi_{i-1}\big{]}.
Corollary 1.

Suppose SurveyUCB constructs the Standard form confidence sets {Ci,t1}(i,t)[K]×[T]\{C_{i,t-1}\}_{(i,t)\in[K]\times[T]} for all arms kk and all time-step’s tt. We then have that:

Pr[k=1Kt=1{βkCk,t1}]\displaystyle\Pr\Bigg{[}\bigcap_{k=1}^{K}\bigcap_{t=1}^{\infty}\{\beta_{k}\in C_{k,t-1}\}\Bigg{]}
1k=1KtSkPr[βkCk,t|Supp(βk)Hk,t].\displaystyle\geq 1-\sum_{k=1}^{K}\sum_{t\in S_{k}}\Pr[\beta_{k}\notin C_{k,t}|\text{Supp}(\beta_{k})\subseteq H_{k,t}].

Where Hk,t1:=Supp(Ck,t1)H_{k,t-1}:=\text{Supp}(C_{k,t-1}) and Sk:={t|πta=k}S_{k}:=\{t|\pi_{t}^{a}=k\} for all arms k[K]k\in[K] and time-steps t[t]t\in[t^{\prime}].

Therefore from corollary 1 for any δ>0\delta>0, we have that Pr[k=1Kt=1{βkCk,t1}]1δ\Pr[\bigcap_{k=1}^{K}\bigcap_{t=1}^{\infty}\{\beta_{k}\in C_{k,t-1}\}]\geq 1-\delta if for all k[K]k\in[K] and t[T]t\in[T] we have: 666Note that j=11(1+j)2<3.15261<1.\sum_{j=1}^{\infty}\frac{1}{(1+j)^{2}}<\frac{3.15^{2}}{6}-1<1.

Pr[βkCk,t|Supp(βk)Hk,t]δK(1+nk,t)2\displaystyle\Pr[\beta_{k}\notin C_{k,t}|\text{Supp}(\beta_{k})\subseteq H_{k,t}]\leq\frac{\delta}{K(1+n_{k,t})^{2}} (1)

5 General Regret Analysis

In section 4.2, we already saw that SurveyUCB constructs confidence sets in Standard form. In lemma 3 we exploit the structure of confidence sets in Standard form to get a general regret bound for SurveyUCB.

Lemma 3 (General regret analysis).

Suppose 2 holds with Xt2L\|X_{t}\|_{2}\leq L for all tt. Suppose the confidence sets {Ci,t1}(i,t)[K]×[T]\{C_{i,t-1}\}_{(i,t)\in[K]\times[T]} used in SurveyUCB are in Standard form with parameter α>0\alpha>0 (in definition 2). And suppose we have that βiCi,t1\beta_{i}\in C_{i,t-1} at any time-step tt and for all arms i[K]i\in[K]. We then have that:

  • Instantaneous regret at time-step tt, rt2θπta,t1(Xt)πts(Vt1i)1r_{t}\leq 2\theta_{\pi^{a}_{t},t-1}\|(X_{t})_{\pi^{s}_{t}}\|_{(V_{t-1}^{i})^{-1}}.

  • And cumulative regret in TT rounds, RTi[K]θi,T18ni,Tdlog(dα+ni,TL2dα).R_{T}\leq\sum_{i\in[K]}\theta_{i,T-1}\sqrt{8n_{i,T}d\log\Big{(}\frac{d\alpha+n_{i,T}L^{2}}{d\alpha}\Big{)}}.

We defer the proof of lemma 3 appendix B. It is worth noting that the proof sheds some light on the need for confidence sets to be in Standard form.

6 Tail Inequalities for Adapted Observations

Consider a linear model Y=𝐗β+ϵY=\mathbf{X}\beta+\epsilon, with design matrix 𝐗n×d\mathbf{X}\in\mathbb{R}^{n\times d}, response vector YnY\in\mathbb{R}^{n}, and noise vector ϵn\epsilon\in\mathbb{R}^{n}. Where ϵt\epsilon_{t} are independent sequence of σ\sigma-sub-Gaussian random variables.

6.1 Ridge Estimator

We now define the Ridge estimator for estimating the parameter β\beta as follows:

Definition 3 (Ridge).

Given regularization parameters α0\alpha\geq 0. The Ridge estimate is given by:

β^𝐗,Yridge(α):=argminβ{Y𝐗β22+αβ22}\hat{\beta}^{\text{ridge}}_{\mathbf{X},Y}(\alpha):=\arg\min_{\beta^{\prime}}\bigg{\{}\|Y-\mathbf{X}\beta^{\prime}\|_{2}^{2}+\alpha\|\beta^{\prime}\|_{2}^{2}\bigg{\}}

From theorem 2 in (Abbasi-Yadkori et al., 2011) we get:

Lemma 4 (Abbasi-Yadkori, et al 2).

Let XtX_{t} denote the tt-th row of 𝐗\mathbf{X}. Let Y(t)Y(t) denote the tt-th entry of YY. The sequence {Xt|t=1,2,,n}\{X_{t}|t=1,2,\dots,n\} form an adapted sequence of observations. That is, XtX_{t} may depend on {Xt,Y(t)}t=1t1\{X_{t^{\prime}},Y(t^{\prime})\}_{t^{\prime}=1}^{t-1}. Assume that β2b\|\beta\|_{2}\leq b. Then for δ>0\delta>0, we have that:

Pr[β^βD2σ2log(det1/2(D)δαd/2)+αb]1δ\Pr\bigg{[}\|\hat{\beta}-\beta\|_{D}^{2}\leq\sigma\sqrt{2\log\bigg{(}\frac{\det^{1/2}(D)}{\delta\alpha^{d/2}}\bigg{)}}+\sqrt{\alpha}b\bigg{]}\geq 1-\delta

Furthermore, if Xt2L\|X_{t}\|_{2}\leq L for all t[n]t\in[n], we have that:

Pr[β^βD2σdlog(1+nL2/αδ)+αb]1δ\Pr\bigg{[}\|\hat{\beta}-\beta\|_{D}^{2}\leq\sigma\sqrt{d\log\bigg{(}\frac{1+nL^{2}/\alpha}{\delta}\bigg{)}}+\sqrt{\alpha}b\bigg{]}\geq 1-\delta

Where β^\hat{\beta} is the Ridge estimate with parameter α>0\alpha>0. And, D=𝐗𝖳𝐗+αID=\mathbf{X}^{\mathsf{T}}\mathbf{X}+\alpha I.

6.2 Elastic Net Estimator

We now define the Elastic net estimator for estimating the parameter β\beta as follows:

Definition 4 (Elastic net).

Given regularization parameters α0\alpha\geq 0, λ0\lambda\geq 0. The Elastic net estimate is given by:

β^𝐗,Yelnet(α,λ)\displaystyle\hat{\beta}^{\text{elnet}}_{\mathbf{X},Y}(\alpha,\lambda)
:=argminβ{1n[Y𝐗β22+2αβ22]+λβ1}\displaystyle:=\arg\min_{\beta^{\prime}}\bigg{\{}\frac{1}{n}\Big{[}\|Y-\mathbf{X}\beta^{\prime}\|_{2}^{2}+2\alpha\|\beta^{\prime}\|_{2}^{2}\Big{]}+\lambda\|\beta^{\prime}\|_{1}\bigg{\}}

We now provide an adaptive tail inequality for the Elastic net estimator that may be of independent interest. We defer the proof to appendix C.

Lemma 5 (Elastic net tail inequality for Adapted observations).

Let XtX_{t} denote the tt-th row of 𝐗\mathbf{X}. Let Y(t)Y(t) denote the tt-th entry of YY. The sequence {Xt|t=1,2,,n}\{X_{t}|t=1,2,\dots,n\} form an adapted sequence of observations. That is, XtX_{t} may depend on {Xt,Y(t)}t=1t1\{X_{t^{\prime}},Y(t^{\prime})\}_{t^{\prime}=1}^{t-1}. Also assume all realizations of XtX_{t} satisfy XtL\|X_{t}\|_{\infty}\leq L and that β1b\|\beta\|_{1}\leq b. Then, if λ:=4σL(γ2+2logd)/n\lambda:=4\sigma L\sqrt{(\gamma^{2}+2\log d)/n}, we have:

Pr[β^βD26σLbn(γ2+2log(d))+4αb2]\displaystyle\Pr\bigg{[}\|\hat{\beta}-\beta\|_{D}^{2}\leq 6\sigma Lb\sqrt{n(\gamma^{2}+2\log(d))}+4\alpha b^{2}\bigg{]}
12exp[γ2/2].\displaystyle\geq 1-2\exp[-\gamma^{2}/2].

Where β^\hat{\beta} is the Elastic net estimate with parameters α,λ0\alpha,\lambda\geq 0. And, D=𝐗𝖳𝐗+αID=\mathbf{X}^{\mathsf{T}}\mathbf{X}+\alpha I.

7 Ridge SurveyUCB

Note that the description and analysis of SurveyUCB gives us a fair amount of flexibility. We are still free to specify the regression method used to estimate (β^i,t)Hi,t(\hat{\beta}_{i,t})_{H_{i,t}} for all i[K]i\in[K] and times t[T]t\in[T]. We are also free to specify our choice of θi,t\theta_{i,t} for all arms i[K]i\in[K] and all times t[T]t\in[T].

Ridge SurveyUCB is a version of the SurveyUCB. In Ridge SurveyUCB, we use Ridge regression with a fixed regularization parameter (α>0\alpha>0), to estimate (β^i,t)Hi,t(\hat{\beta}_{i,t})_{H_{i,t}}. We also choose θi,t\theta_{i,t} for all arms and time-steps based on corollary 1 (eq. 1) and lemma 4 so that the Standard form confidence sets that we construct hold with high probability. Then from lemma 3 we naturally get a high-probability regret bound for Ridge SurveyUCB.

Say we want our bounds to hold with probability 1δ1-\delta. Then from corollary 1, it is enough to show that eq. 1 holds for all arms and times. That is for all arms k[K]k\in[K] and all times tt, we want:

Pr[βkCk,t|Supp(βk)Hk,t]δK(1+nk,t)2\displaystyle\Pr[\beta_{k}\notin C_{k,t}|\text{Supp}(\beta_{k})\subseteq H_{k,t}]\leq\frac{\delta}{K(1+n_{k,t})^{2}} (1)

Given Supp(βk)Hk,t\text{Supp}(\beta_{k})\subseteq H_{k,t}, lemma 4 gives us that eq. 1 holds for Ck,tC_{k,t} with support Hk,tH_{k,t} if:

θk,t=σ|Hk,t|log(1+nk,tL2/αδ/(K(1+nk,t)2))+αb\displaystyle\theta_{k,t}=\sigma\sqrt{|H_{k,t}|\log\bigg{(}\frac{1+n_{k,t}L^{2}/\alpha}{\delta/(K(1+n_{k,t})^{2})}\bigg{)}}+\sqrt{\alpha}b (2)

Hence, from lemma 3 with the above choice of θk,t\theta_{k,t} and the corresponding high-probability guarantee, we get theorem 1. We defer the proof to appendix E.

Theorem 1 (Ridge SurveyUCB regret).

Suppose 1 and 2 hold. Let δ,α>0\delta,\alpha>0 be fixed constants. Let θk,t\theta_{k,t} be chosen as in eq. 2 for all k,tk,t. Then with probability at least 1δ1-\delta, Ridge SurveyUCB has the following regret guarantee:

RTO(dKTlog(K)log(T)).\displaystyle R_{T}\leq O(d\sqrt{KT\log(K)}\log(T)).

8 Elastic Net SurveyUCB

In this section, we want to develop a variant of SurveyUCB that is more robust to the choice of the beta-min parameter in 1. One way to do this is to modify line 3 in AlgConfidence. In particular, suppose k[K]k\in[K] is the arm pulled at time tt^{\prime}. We then construct Hk,tH_{k,t^{\prime}} by removing all features qq from Hk,t1H_{k,t^{\prime}-1} that we estimate to be zero (i.e. (β^k,t1)q=0(\hat{\beta}_{k,t^{\prime}-1})_{q}=0) and that we determine are irrelevant based on 1. That is:

Hk,t:=Hk,t1\{q|\displaystyle H_{k,t^{\prime}}:=H_{k,t^{\prime}-1}\Big{\backslash}\bigg{\{}q\;|\; (β^k,t1)q=0\displaystyle(\hat{\beta}_{k,t^{\prime}-1})_{q}=0
and maxβCk,t1β{q}βmin}\displaystyle\text{ and }\max_{\beta\in C_{k,t^{\prime}-1}}\|\beta_{\{q\}}\|\leq\beta_{\text{min}}\bigg{\}}

It is easy to see that all our arguments for SurveyUCB continue to hold with the above modification. Now note that the above modification makes SurveyUCB more robust to 1, because we additionally need the qqth coordinate of our estimate (β^k,t1\hat{\beta}_{k,t^{\prime}-1}) to be zero before we remove feature qq (at time tt^{\prime}) from the model of arm kk. This modification encourages us to use sparse estimators.

Elastic Net SurveyUCB is a variant of SurveyUCB with the above modification. In Elastic Net SurveyUCB, we use the Elastic net regression to estimate β^i,t\hat{\beta}_{i,t}, with regularization parameters α\alpha and λi,t\lambda_{i,t} for all arms i[K]i\in[K] and time-steps tt. Where:

λi,t:=4σL2ni,tlog(4dKni,t2δ), i[K],t[T].\displaystyle\lambda_{i,t}:=4\sigma L\sqrt{\frac{2}{n_{i,t}}\log\bigg{(}\frac{4dKn_{i,t}^{2}}{\delta}\bigg{)}},\text{ }\forall i\in[K],\forall t\in[T]. (3)

Similar to the arguments in section 7, corollary 1 and lemma 5 give us that our confidence sets hold with probability 1δ1-\delta if for all arms ii and times tt:

θi,t:=6σLb2ni,tlog(4dKni,t2δ)+4αb2.\displaystyle\theta_{i,t}:=\sqrt{6\sigma Lb\sqrt{2n_{i,t}\log\bigg{(}\frac{4dKn_{i,t}^{2}}{\delta}\bigg{)}}+4\alpha b^{2}}. (4)

Again from lemma 3 with the above choice of θk,t\theta_{k,t} and the corresponding high-probability guarantee, we get theorem 2. We defer the proof to appendix F.

Theorem 2 (Elastic Net SurveyUCB regret).

Suppose 1 and 2 hold. Let δ,α>0\delta,\alpha>0 be fixed constants. For all k,tk,t, let λk,t\lambda_{k,t} and θk,t\theta_{k,t} be chosen as in eq. 3 and eq. 2 respectively. Then with probability at least 1δ1-\delta, Elastic Net SurveyUCB has the following regret guarantee:

RTO(K1/4d1/2T3/4log3/4(T)log1/4(dK)).\displaystyle R_{T}\leq O(K^{1/4}d^{1/2}T^{3/4}\log^{3/4}(T)\log^{1/4}(dK)).

9 Interactive Surveys

We start by defining sub-optimal arms, describe an inefficiency in SurveyUCB, and propose a fix using interactive surveys.

Definition 5 (Sub-optimal arms).

An arm is said to be sub-optimal if the target policy would not pick it for any context vector in the context space.

At any time-step tt, the SurveyUCB algorithms query πts:=i[K]πi,ts, with πi,ts=Supp(Ci,t1) for all i[K].\pi^{s}_{t}:=\cup_{i\in[K]}\pi^{s}_{i,t},\text{ with }\pi^{s}_{i,t}=\text{Supp}(C_{i,t-1})\text{ for all }i\in[K]. Now suppose the decision maker plays arm kk at time tt. The decision maker only needs the reward and the features corresponding to πk,ts\pi^{s}_{k,t} to update the model. Recall that the reason the decision maker queries all the features in πts\pi^{s}_{t} is to be able to compute the upper confidence bounds for all arms.

Note that the bandit assignment only depends on the highest upper confidence bound, so it is not necessary to compute all the upper confidence bounds. The algorithm’s inefficiency becomes more evident in the presence of sub-optimal arms because these arms are played less frequently, hence updated less frequently, and increase survey length.

Algorithm 3 Interactive Survey Protocol

Interactive input : User at time tt, that answers queries whenever asked.
Output: Queries to user tt.

1:  Create an ordered list of arms QQ, starting from the most pulled arm to least pulled arm.
2:  Initialize MM\leftarrow-\infty, largest upper confidence bound among queried arms.
3:  Initialize set of queried features UU\leftarrow\emptyset.
4:  while QQ is not empty do
5:     Let iQ[1]i\leftarrow Q[1], be the first arm in QQ.
6:     Query πi,ts\pi^{s}_{i,t} and observe (Xt)πi,ts(X_{t})_{\pi^{s}_{i,t}}.
7:     Update UUπi,tsU\leftarrow U\cup\pi^{s}_{i,t}.
8:     Update Mmax{M,maxβCk,t1(Xt)πts𝖳β}M\leftarrow\max\{M,\max_{\beta\in C_{k,t-1}}(X_{t})_{\pi^{s}_{t}}^{\mathsf{T}}\beta\}
9:     Remove ii from the list QQ.
10:     for ww in QQ do
11:        Set F={x|xU=(Xt)U and x is feasible}F=\{x|x_{U}=(X_{t})_{U}\text{ and $x$ is feasible}\}.
12:        if maxxFmaxβCw,t1x𝖳βM\max_{x\in F}\max_{\beta\in C_{w,t-1}}x^{\mathsf{T}}\beta\leq M then
13:           Remove arm ww from QQ.
14:        end if
15:     end for
16:  end while{Note that we observe enough information to determine which arm has the largest upper confidence and can update its confidence set after observing its reward.}

We attempt to resolve this inefficiency through interactive surveys. Consider the user at time tt, the decision maker needs to query some features from the user before taking an action. We start by creating an ordered list of all arms, starting from the most pulled arms and ending with the least pulled arms. This ordering is a heuristic choice, the idea is to keep sub-optimal arms (which are less frequently pulled) towards the end of the list. We keep removing arms from this list and terminate (take an action) when it is empty.

The main idea is to sequentially query the feature sets πi,ts\pi^{s}_{i,t} for arms ii in the list, simultaneously remove queried arms from the list, and also remove unqueried arms that do not have the largest upper confidence bound. We will now explain how we determine that some unqueried arm ww doesn’t have the largest upper confidence bound. First note that for each queried arm, we can exactly compute its upper confidence bound at time tt. Clearly the following optimization problem gives us an upper bound to the upper confidence bound of arm ww: 777Here (Dt1w)1(D_{t-1}^{w})^{-1} is the pseudo-inverse of the matrix.

maxxF\displaystyle\max_{x\in F} maxβCw,t1x𝖳β\displaystyle\max_{\beta\in C_{w,t-1}}x^{\mathsf{T}}\beta (5)
maxxF\displaystyle\equiv\max_{x\in F} x𝖳β^w,t1+θw,t1(x𝖳(Dt1w)1x)\displaystyle x^{\mathsf{T}}\hat{\beta}_{w,t-1}+\sqrt{\theta_{w,t-1}(x^{\mathsf{T}}(D_{t-1}^{w})^{-1}x)}

Where FF denotes the set of contexts in the context space that are consistent with the queries so far (i.e. feasible xx such that xU=(Xt)Ux_{U}=(X_{t})_{U} where UU is the set of features queried so far for user tt). Hence if this upper bound is less than the upper confidence bound for some queried arm, we remove arm ww from the list. For more details see algorithm 3.

Note that interactive versions of SurveyUCB have the same regret performance as SurveyUCB, because both algorithms work with the same confidence sets and always choose the arm with the largest upper confidence bound. Also note that the optimization problem in eq. 5 is non-convex. In section G.3 we provide a heuristics for this optimization that has been effective in simulations.

10 Simulation

Suppose we have users with fifty features and suppose we have five arms. Where expected reward for context xx is: x1x_{1} for arm 1, x2x_{2} for arm 2, 1x11-x_{1} for arm 3, and zero for both arm 4 and arm 5. Hence, only two of the fifty features are predictive of arm rewards and both arm 4 and arm 5 are sub-optimal. We draw contexts from the uniform distribution U([0,1]d)U([0,1]^{d}) and reward noise (ϵi,t\epsilon_{i,t}) is drawn from U([0,1])U([0,1]). We consider a 100000 step time horizon.

We assume noise is 1-sub-Gaussian, contexts lie in the space [0,1]d[0,1]^{d}, and the 1-norm and 2-norms of the arm parameters is bounded by 5050 and 50\sqrt{50} respectively. We run simulations with (K=5K=5) and without (K=3K=3) sub-optimal arms. That is, in our simulations without sub-optimal arms, we only consider the first three arms. We plot regret and cumulative survey length vs time-steps. The plots here are generated by averaging over five runs.

In simulations, when 1 holds, Ridge SurveyUCB has better regret performance compared to Elastic Net SurveyUCB (for similar beta-min parameters). Plots verify that Elastic Net SurveyUCB is infact reasonably robust to 1 and doesn’t remove predictive features (in simulations) even under violations of 1. We also note that sub-optimal arms hurt the performance of SurveyUCB algorithms. And, interactive surveys help mitigate the negative effects of sub-optimal arms on survey lengths. Also note that performance on regret improves with less conservative choices for the beta-min parameter, this implies that model truncation also helps improve regret performance.

Refer to caption
Figure 1: Regret.
Refer to caption
Figure 2: Survey length.

Ridge SurveyUCB with βmin=0.3,K=3\beta_{\text{min}}=0.3,K=3

Refer to caption
Figure 3: Regret.
Refer to caption
Figure 4: Survey length.

Ridge SurveyUCB with βmin=0.3,K=5\beta_{\text{min}}=0.3,K=5

Refer to caption
Figure 5: Regret.
Refer to caption
Figure 6: Survey length.

Ridge SurveyUCB with βmin=0.5,K=5\beta_{\text{min}}=0.5,K=5

Refer to caption
Figure 7: Regret.
Refer to caption
Figure 8: Survey length.

Elastic net SurveyUCB with βmin=0.7,K=5\beta_{\text{min}}=0.7,K=5.

Refer to caption
Figure 9: Regret.
Refer to caption
Figure 10: Survey length.

Elastic net SurveyUCB with βmin=1.5,K=5\beta_{\text{min}}=1.5,K=5
Note 1 is violated.

References

  • Abbasi-Yadkori et al. (2011) Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312–2320, 2011.
  • Abbasi-Yadkori et al. (2012) Abbasi-Yadkori, Y., Pal, D., and Szepesvari, C. Online-to-confidence-set conversions and application to sparse stochastic bandits. In Artificial Intelligence and Statistics, pp.  1–9, 2012.
  • Bastani & Bayati (2015) Bastani, H. and Bayati, M. Online decision-making with high-dimensional covariates. Available at SSRN 2661896, 2015.
  • Bouneffouf et al. (2017) Bouneffouf, D., Rish, I., Cecchi, G. A., and Féraud, R. Context attentive bandits: Contextual bandit with restricted context. In IJCAI, 2017.
  • Bühlmann & Van De Geer (2011) Bühlmann, P. and Van De Geer, S. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media, 2011.
  • Li et al. (2010) Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp.  661–670. ACM, 2010.
  • Ye (1999) Ye, Y. Approximating global quadratic optimization with convex quadratic constraints. Journal of Global Optimization, 15(1):1–17, 1999.

Appendix A Proofs for probability aggregation

See 2

Proof.

We want to use induction. First note that:

Pr[B1]=1Pr[B1]=1Pr[B1|Ω].\Pr[B_{1}]=1-\Pr[B_{1}^{\complement}]=1-\Pr[B_{1}^{\complement}|\Omega].

Also note that for any t1t\geq 1, we have that:

Pr[i=1t+1Bi]\displaystyle\Pr[\cap_{i=1}^{t+1}B_{i}] =Pr[i=1tBi]Pr[i=1tBiBt+1]\displaystyle=\Pr[\cap_{i=1}^{t}B_{i}]-\Pr[\cap_{i=1}^{t}B_{i}\cap B_{t+1}^{\complement}]
Pr[i=1tBi]Pr[ΠtBt+1]\displaystyle\geq\Pr[\cap_{i=1}^{t}B_{i}]-\Pr[\Pi_{t}\cap B_{t+1}^{\complement}]
=Pr[i=1tBi]Pr[Bt+1|Πt]Pr[Πt]\displaystyle=\Pr[\cap_{i=1}^{t}B_{i}]-\Pr[B_{t+1}^{\complement}|\Pi_{t}]\Pr[\Pi_{t}]
Pr[i=1tBi]Pr[Bt+1|Πt]\displaystyle\geq\Pr[\cap_{i=1}^{t}B_{i}]-\Pr[B_{t+1}^{\complement}|\Pi_{t}]

Where the first inequality follows from the fact that Πt\Pi_{t} is a supper set of i=1tBi\cap_{i=1}^{t}B_{i} [Since, i=1tBiBtΠt\cap_{i=1}^{t}B_{i}\subseteq B_{t}\subseteq\Pi_{t}]. Therefore from induction, for all t1t\geq 1 we get that:

Pr[i=1tBi]1i=1tPr[Bi|Πi1].\Pr\Bigg{[}\bigcap_{i=1}^{t}B_{i}\Bigg{]}\geq 1-\sum_{i=1}^{t}\Pr\big{[}B_{i}^{\complement}|\Pi_{i-1}\big{]}.

We get our required result by taking limit as tt goes to \infty. ∎

See 1

Proof.

Recall that from section 4.2 we know that Supp(βk)Hk,t\text{Supp}(\beta_{k})\subseteq H_{k,t^{\prime}} if βkCk,t1\beta_{k}\in C_{k,t^{\prime}-1}. For any arm kk, using lemma 2 with BtB_{t} being the event that βkCk,t\beta_{k}\in C_{k,t} and Πt\Pi_{t} being the event that Supp(βk)Hk,t\text{Supp}(\beta_{k})\subseteq H_{k,t}, we get that:

Pr[t=1{βkCk,t1}]1t=1Pr[βkCk,t|Supp(βk)Hk,t].\displaystyle\Pr\Bigg{[}\bigcap_{t=1}^{\infty}\{\beta_{k}\in C_{k,t-1}\}\Bigg{]}\geq 1-\sum_{t=1}^{\infty}\Pr[\beta_{k}\notin C_{k,t}|\text{Supp}(\beta_{k})\subseteq H_{k,t}].

Now, from union bound and the above inequality we get that:

Pr[k=1Kt=1{βkCk,t1}]\displaystyle\Pr\Bigg{[}\bigcap_{k=1}^{K}\bigcap_{t=1}^{\infty}\{\beta_{k}\in C_{k,t-1}\}\Bigg{]} =1Pr[k=1K(t=1{βkCk,t1})]\displaystyle=1-\Pr\Bigg{[}\bigcup_{k=1}^{K}\bigg{(}\bigcap_{t=1}^{\infty}\{\beta_{k}\in C_{k,t-1}\}\bigg{)}^{\complement}\Bigg{]}
1k=1KPr[(t=1{βkCk,t1})]\displaystyle\geq 1-\sum_{k=1}^{K}\Pr\Bigg{[}\bigg{(}\bigcap_{t=1}^{\infty}\{\beta_{k}\in C_{k,t-1}\}\bigg{)}^{\complement}\Bigg{]}
1k=1Kt=1Pr[βkCk,t|Supp(βk)Hk,t].\displaystyle\geq 1-\sum_{k=1}^{K}\sum_{t=1}^{\infty}\Pr[\beta_{k}\notin C_{k,t}|\text{Supp}(\beta_{k})\subseteq H_{k,t}].

Appendix B Proofs for General Regret Analysis

We now re-state lemma 3 and provide the proof in following sub-sections. See 3

B.1 Instantaneous Regret Decomposition

Let rtr_{t} denote the instantaneous regret at time-step tt. Suppose SurveyUCB picks arm ii at time-step tt, that is πta=i\pi^{a}_{t}=i. And, suppose arm jj is the optimum arm at time-step tt, that is πta=j\pi^{a*}_{t}=j. Now suppose πts\pi^{s}_{t} denote the set of questions queried by SurveyUCB, then

πts:=k[K]Supp(Ck,t1).\pi^{s}_{t}:=\bigcup_{k\in[K]}\text{Supp}(C_{k,t-1}).

Hence we have that Supp(Ck,t1)πts\text{Supp}(C_{k,t-1})\subseteq\pi^{s}_{t} for any arm k[K]k\in[K] and time-step tt. Therefore for any βCk,t1\beta\in C_{k,t-1} and zdz\in\mathbb{R}^{d}, we have that z𝖳β=(z)πst𝖳βz^{\mathsf{T}}\beta=(z)_{\pi^{t}_{s}}^{\mathsf{T}}\beta. Further from conditions stated in lemma 3, we have that βkCk,t1\beta_{k}\in C_{k,t-1} for all k[K]k\in[K] and t[T]t\in[T]. Therefore, we have:

rt\displaystyle r_{t} =Xt𝖳βjXt𝖳βi\displaystyle=X_{t}^{\mathsf{T}}\beta_{j}-X_{t}^{\mathsf{T}}\beta_{i}
=(Xt)πts𝖳βj(Xt)πts𝖳βi\displaystyle=(X_{t})_{\pi^{s}_{t}}^{\mathsf{T}}\beta_{j}-(X_{t})_{\pi^{s}_{t}}^{\mathsf{T}}\beta_{i}

Since SurveyUCB chooses arm i=πtaargmaxk[K]maxβCk,t1(Xt)πts𝖳βi=\pi^{a}_{t}\in\arg\max_{k\in[K]}\max_{\beta\in C_{k,t-1}}(X_{t})_{\pi^{s}_{t}}^{\mathsf{T}}\beta, and βjCj,t1\beta_{j}\in C_{j,t-1}. We have that (Xt)πts𝖳β~i,t1(Xt)πts𝖳βj(X_{t})_{\pi^{s}_{t}}^{\mathsf{T}}\tilde{\beta}_{i,t-1}\geq(X_{t})_{\pi^{s}_{t}}^{\mathsf{T}}\beta_{j}. Where β~i,t1\tilde{\beta}_{i,t-1} denotes the optimistic estimate of arm ii’s parameter at time-step tt, that is β~i,t1=argmaxβCi,t1(Xt)πts𝖳β\tilde{\beta}_{i,t-1}=\arg\max_{\beta\in C_{i,t-1}}(X_{t})_{\pi^{s}_{t}}^{\mathsf{T}}\beta. Therefore, we get that:

rt\displaystyle r_{t} (Xt)πts𝖳β~i,t1(Xt)πts𝖳βi\displaystyle\leq(X_{t})_{\pi^{s}_{t}}^{\mathsf{T}}\tilde{\beta}_{i,t-1}-(X_{t})_{\pi^{s}_{t}}^{\mathsf{T}}\beta_{i}
=(Xt)πts𝖳(β~i,t1β^i,t1)+(Xt)πts𝖳(β^i,t1βi)\displaystyle=(X_{t})_{\pi^{s}_{t}}^{\mathsf{T}}(\tilde{\beta}_{i,t-1}-\hat{\beta}_{i,t-1})+(X_{t})_{\pi^{s}_{t}}^{\mathsf{T}}(\hat{\beta}_{i,t-1}-\beta_{i})

Where β^i,t1\hat{\beta}_{i,t-1} denotes the estimate of arm ii’s parameter at time-step tt. Now using Caushy-Schwarz inequality for weighted norm with respect to the positive definite matrix Vt1i:=αI+wSi,t1(Xw)πts(Xw)πts𝖳V_{t-1}^{i}:=\alpha I+\sum_{w\in S_{i,t-1}}(X_{w})_{\pi^{s}_{t}}(X_{w})_{\pi^{s}_{t}}^{\mathsf{T}}, we get:

rt\displaystyle r_{t} β~i,t1β^i,t1Vt1i(Xt)πts(Vt1i)1+β^i,t1βiVt1i(Xt)πts(Vt1i)1\displaystyle\leq\|\tilde{\beta}_{i,t-1}-\hat{\beta}_{i,t-1}\|_{V_{t-1}^{i}}\|(X_{t})_{\pi^{s}_{t}}\|_{(V_{t-1}^{i})^{-1}}+\|\hat{\beta}_{i,t-1}-\beta_{i}\|_{V_{t-1}^{i}}\|(X_{t})_{\pi^{s}_{t}}\|_{(V_{t-1}^{i})^{-1}}
=β~i,t1β^i,t1Dt1i(Xt)πts(Vt1i)1+β^i,t1βiDt1i(Xt)πts(Vt1i)1\displaystyle=\|\tilde{\beta}_{i,t-1}-\hat{\beta}_{i,t-1}\|_{D_{t-1}^{i}}\|(X_{t})_{\pi^{s}_{t}}\|_{(V_{t-1}^{i})^{-1}}+\|\hat{\beta}_{i,t-1}-\beta_{i}\|_{D_{t-1}^{i}}\|(X_{t})_{\pi^{s}_{t}}\|_{(V_{t-1}^{i})^{-1}}
2θi,t1(Xt)πts(Vt1i)1\displaystyle\leq 2\theta_{i,t-1}\|(X_{t})_{\pi^{s}_{t}}\|_{(V_{t-1}^{i})^{-1}}

Where the last inequality follows directly from the fact that our confidence sets are in Standard form. The first equality follows from the 1, the fact that βi,β^i,t1,β~i,t1Ci,t1\beta_{i},\hat{\beta}_{i,t-1},\tilde{\beta}_{i,t-1}\in C_{i,t-1}, and the fact that Dt1i=(Vt1i)Hi,t1D_{t-1}^{i}=(V_{t-1}^{i})_{H_{i,t-1}} which follows from the fact that our confidence sets are in Standard form, where Hi,t1:=Supp(Ci,t1)H_{i,t-1}:=\text{Supp}(C_{i,t-1}). We defer the proof of 1 to section B.3.

Observation 1.

Consider any vector vdv\in\mathbb{R}^{d} any positive semi-definite matrix Ad×dA\in\mathbb{R}^{d\times d}. Let SS be such that Supp(v)S\text{Supp}(v)\subseteq S. We have that vA=vB\|v\|_{A}=\|v\|_{B} if B=(A)SB=(A)_{S}.

Now since reward at any time tt lies in the range [0,1][0,1]. Therefore rt[0,2]r_{t}\in[0,2] for all tt. Since θi,t11\theta_{i,t-1}\geq 1, we further have that rt2θi,t1min{1,(Xt)πts(Vt1i)1}r_{t}\leq 2\theta_{i,t-1}\min\big{\{}1,\|(X_{t})_{\pi^{s}_{t}}\|_{(V_{t-1}^{i})^{-1}}\big{\}}.

B.2 Cumulative Regret

From lemma 11 in (Abbasi-Yadkori et al., 2011), we get:

Lemma 6 (Abbasi-Yadkori, et al 1).

Let {Xt}t=1\{X_{t}\}_{t=1}^{\infty} be a sequence in d\mathbb{R}^{d} and Vd×dV\in\mathbb{R}^{d\times d} that is positive definite. Define Vt:=V+s=1tXsXs𝖳V_{t}:=V+\sum_{s=1}^{t}X_{s}X_{s}^{\mathsf{T}}. Further if Xt2L\|X_{t}\|_{2}\leq L for all tt, then:

t=1nmin{1,XtVt112}2(dlog(trace(V)+nL2d)logdet(V)).\displaystyle\sum_{t=1}^{n}\min\big{\{}1,\|X_{t}\|^{2}_{V_{t-1}^{-1}}\big{\}}\leq 2\bigg{(}d\log\bigg{(}\frac{\text{trace}(V)+nL^{2}}{d}\bigg{)}-\log\det(V)\bigg{)}.

Now, consider any arm i[K]i\in[K], let RTiR_{T}^{i} denote the cumulative regret incurred by arm ii. Hence from Caushy-Schwarz inequality and from bound on rtr_{t} in section B.1, we get that:

RTi\displaystyle R_{T}^{i} =tSi,Trtni,TsSi,Trt2\displaystyle=\sum_{t\in S_{i,T}}r_{t}\leq\sqrt{n_{i,T}\sum_{s\in S_{i,T}}r_{t}^{2}}
θi,t14ni,TsSi,Tmin{1,(Xt)πts(Vt1i)12}\displaystyle\leq\theta_{i,t-1}\sqrt{4n_{i,T}\sum_{s\in S_{i,T}}\min\big{\{}1,\|(X_{t})_{\pi^{s}_{t}}\|^{2}_{(V_{t-1}^{i})^{-1}}\big{\}}}

From conditions stated in lemma 3 we have that Xt2L\|X_{t}\|_{2}\leq L for all tt. Hence from lemma 6 we get that:

RTiθi,t18ni,T(dlog(trace(αI)+ni,TL2dα))R_{T}^{i}\leq\theta_{i,t-1}\sqrt{8n_{i,T}\bigg{(}d\log\bigg{(}\frac{\text{trace}(\alpha I)+n_{i,T}L^{2}}{d\alpha}\bigg{)}\bigg{)}}

Therefore, our total regret is of the form:

RT\displaystyle R_{T} =i[K]RTi\displaystyle=\sum_{i\in[K]}R_{T}^{i}
i[K]θi,T18ni,T(dlog(dα+ni,TL2dα)).\displaystyle\leq\sum_{i\in[K]}\theta_{i,T-1}\sqrt{8n_{i,T}\bigg{(}d\log\bigg{(}\frac{d\alpha+n_{i,T}L^{2}}{d\alpha}\bigg{)}\bigg{)}}.

This completes the proof of lemma 3.

B.3 Proof for Observations

See 1

Proof.

Consider any vdv\in\mathbb{R}^{d} any Ad×dA\in\mathbb{R}^{d\times d}. Let SS be such that Supp(v)S[d]\text{Supp}(v)\subseteq S\subseteq[d]. Let B=(A)SB=(A)_{S}, that is we get BB by setting rows and columns of AA not in SS to zero. Consider AA^{\prime}, which we get by setting columns of AA not in SS to zero. Since rows of AA^{\prime} have support in SSupp(v)S\supset\text{Supp}(v) and are the same as the AA’s rows within the support, we have that (AA)v=0(A-A^{\prime})v=0. That is, Av=AvAv=A^{\prime}v. Now note that we can get BB by setting rows of AA^{\prime} not in SS to zero. Similarly, we get v𝖳A=v𝖳Bv^{\mathsf{T}}A^{\prime}=v^{\mathsf{T}}B. Therefore, we have that:

vA2=v𝖳Av=v𝖳Av=v𝖳Bv=vB2.\|v\|_{A}^{2}=v^{\mathsf{T}}Av=v^{\mathsf{T}}A^{\prime}v=v^{\mathsf{T}}Bv=\|v\|_{B}^{2}.

Since AA and hence BB are psd, this also gives us that vA=vB\|v\|_{A}=\|v\|_{B}. ∎

Appendix C Proofs for lemma 5

Consider a linear model Y=𝐗β+ϵY=\mathbf{X}\beta+\epsilon, with design matrix 𝐗n×d\mathbf{X}\in\mathbb{R}^{n\times d}, response vector YnY\in\mathbb{R}^{n}, and noise vector ϵn\epsilon\in\mathbb{R}^{n}. Where ϵt\epsilon_{t} are independent sequence of σ\sigma-sub-Gaussian random variables. Now, from lemma EC2 in (Bastani & Bayati, 2015), we have that:

Lemma 7 (Bastani and Bayati).

Let XtX_{t} denote the tt-th row of 𝐗\mathbf{X}. Let Y(t)Y(t) denote the tt-th entry of YY. The sequence {Xt|t=1,2,,n}\{X_{t}|t=1,2,\dots,n\} form an adapted sequence of observations. That is, XtX_{t} may depend on {Xt,Y(t)}t=1t1\{X_{t^{\prime}},Y(t^{\prime})\}_{t^{\prime}=1}^{t-1}. Also assume all realizations of XtX_{t} satisfy XtL\|X_{t}\|_{\infty}\leq L. Now, define the event:

(λ0(γ)):={maxr[d](2|ϵ𝖳Xr|/n)λ0(γ)}.\mathcal{F}(\lambda_{0}(\gamma)):=\bigg{\{}\max_{r\in[d]}(2|\epsilon^{\mathsf{T}}X^{r}|/n)\leq\lambda_{0}(\gamma)\bigg{\}}.

Where XrX^{r} is the rr-th column of 𝐗\mathbf{X} and λ0(γ):=2σL(γ2+2logd)/n\lambda_{0}(\gamma):=2\sigma L\sqrt{(\gamma^{2}+2\log d)/n}. Then, we have Pr[(λ0(γ))]12exp[γ2/2]\Pr[\mathcal{F}(\lambda_{0}(\gamma))]\geq 1-2\exp[-\gamma^{2}/2].

We now state a useful basic inequality for Elastic net estimators. The proof is similar to basic inequalities proved for Lasso in (Bühlmann & Van De Geer, 2011) and defer the proof to appendix D.

Lemma 8 (Basic inequality for Elastic net).

Consider a linear model Y=𝐗β+ϵY=\mathbf{X}\beta+\epsilon, with design matrix Xn×dX\in\mathbb{R}^{n\times d}, response vector YnY\in\mathbb{R}^{n}, and noise vector ϵn\epsilon\in\mathbb{R}^{n}. We then have that:

1nβ^βD2+λβ^12nϵ𝖳𝐗(β^β)+λβ1+4αnβ22.\frac{1}{n}\|\hat{\beta}-\beta\|_{D}^{2}+\lambda\|\hat{\beta}\|_{1}\leq\frac{2}{n}\epsilon^{\mathsf{T}}\mathbf{X}(\hat{\beta}-\beta)+\lambda\|\beta\|_{1}+\frac{4\alpha}{n}\|\beta\|_{2}^{2}.

Where β^\hat{\beta} is the Elastic net estimate with parameters α,λ0\alpha,\lambda\geq 0. And, D=𝐗𝖳𝐗+αID=\mathbf{X}^{\mathsf{T}}\mathbf{X}+\alpha I.

We define the following lemma is result of simplifying lemma 8 under the high-probability event (λ0)\mathcal{F}(\lambda_{0}) given in lemma 7 when λ2λ0\lambda\geq 2\lambda_{0}.

Lemma 9.

Consider a linear model Y=𝐗β+ϵY=\mathbf{X}\beta+\epsilon, with design matrix Xn×dX\in\mathbb{R}^{n\times d}, response vector YnY\in\mathbb{R}^{n}, and noise vector ϵn\epsilon\in\mathbb{R}^{n}. Let β^\hat{\beta} be the Elastic net estimate with parameters α,λ0\alpha,\lambda\geq 0 and let D=𝐗𝖳𝐗+αID=\mathbf{X}^{\mathsf{T}}\mathbf{X}+\alpha I. When λ2λ0\lambda\geq 2\lambda_{0} and (λ0)\mathcal{F}(\lambda_{0}) holds, we have that:

2nβ^βD23λβ1+8αnβ22.\frac{2}{n}\|\hat{\beta}-\beta\|_{D}^{2}\leq 3\lambda\|\beta\|_{1}+\frac{8\alpha}{n}\|\beta\|_{2}^{2}.
Proof.

Since (λ0)\mathcal{F}(\lambda_{0}) holds and λ2λ0\lambda\geq 2\lambda_{0}, we get:

2nϵ𝖳𝐗(β^β)\displaystyle\frac{2}{n}\epsilon^{\mathsf{T}}\mathbf{X}(\hat{\beta}-\beta) 1n(maxr[d]2|ϵ𝖳Xr|)β^β1\displaystyle\leq\frac{1}{n}\Big{(}\max_{r\in[d]}2|\epsilon^{\mathsf{T}}X^{r}|\Big{)}\|\hat{\beta}-\beta\|_{1}
λ0β^β1\displaystyle\leq\lambda_{0}\|\hat{\beta}-\beta\|_{1}
λ2β^β1\displaystyle\leq\frac{\lambda}{2}\|\hat{\beta}-\beta\|_{1}

Hence from lemma 8 and the above inequality, we have that:

1nβ^βD2+λβ^12nϵ𝖳X(β^β)+λβ1+4αnβ22\displaystyle\frac{1}{n}\|\hat{\beta}-\beta\|_{D}^{2}+\lambda\|\hat{\beta}\|_{1}\leq\frac{2}{n}\epsilon^{\mathsf{T}}X(\hat{\beta}-\beta)+\lambda\|\beta\|_{1}+\frac{4\alpha}{n}\|\beta\|_{2}^{2}
\displaystyle\implies 2nβ^βD2λβ^β1+2λβ1+8αnβ222λβ^1\displaystyle\frac{2}{n}\|\hat{\beta}-\beta\|_{D}^{2}\leq\lambda\|\hat{\beta}-\beta\|_{1}+2\lambda\|\beta\|_{1}+\frac{8\alpha}{n}\|\beta\|_{2}^{2}-2\lambda\|\hat{\beta}\|_{1}

Now since β^10\|\hat{\beta}\|_{1}\geq 0, we get:

\displaystyle\implies 2nβ^βD2λ(β^β1β^1)+2λβ1+8αnβ22\displaystyle\frac{2}{n}\|\hat{\beta}-\beta\|_{D}^{2}\leq\lambda(\|\hat{\beta}-\beta\|_{1}-\|\hat{\beta}\|_{1})+2\lambda\|\beta\|_{1}+\frac{8\alpha}{n}\|\beta\|_{2}^{2}
\displaystyle\implies 2nβ^βD23λβ1+8αnβ22\displaystyle\frac{2}{n}\|\hat{\beta}-\beta\|_{D}^{2}\leq 3\lambda\|\beta\|_{1}+\frac{8\alpha}{n}\|\beta\|_{2}^{2}

Where the last implication follows from triangle inequality. ∎

Combining lemma 7 and lemma 9, we get lemma 5:

See 5

Proof.

Let λ0:=2σL(γ2+2logd)/n\lambda_{0}:=2\sigma L\sqrt{(\gamma^{2}+2\log d)/n}. Now note that λ2λ0\lambda\geq 2\lambda_{0}. Therefore from lemma 7 and lemma 9, we get that with probability at least 12exp[γ2/2]1-2\exp[-\gamma^{2}/2], we have:

2nβ^βD23λβ1+8αnβ22\displaystyle\frac{2}{n}\|\hat{\beta}-\beta\|_{D}^{2}\leq 3\lambda\|\beta\|_{1}+\frac{8\alpha}{n}\|\beta\|_{2}^{2}
\displaystyle\implies β^βD2n23λb+4αb2\displaystyle\|\hat{\beta}-\beta\|_{D}^{2}\leq\frac{n}{2}3\lambda b+4\alpha b^{2}
\displaystyle\implies β^βD26σLbn(γ2+2log(d))+4αb2\displaystyle\|\hat{\beta}-\beta\|_{D}^{2}\leq 6\sigma Lb\sqrt{n(\gamma^{2}+2\log(d))}+4\alpha b^{2}

Where the first implication follows from β2β1b\|\beta\|_{2}\leq\|\beta\|_{1}\leq b. And, the last implication follows from our choise of λ\lambda. ∎

Appendix D Basic Inequality for Elastic Net

See 8

Proof.

Since β^\hat{\beta} is the Elastic net estimate for parameters α,λ>0\alpha,\lambda>0, we have that:

1n[Y𝐗β^22+2αβ^22]+λβ^11n[Y𝐗β22+2αβ22]+λβ1\displaystyle\frac{1}{n}\Big{[}\|Y-\mathbf{X}\hat{\beta}\|_{2}^{2}+2\alpha\|\hat{\beta}\|_{2}^{2}\Big{]}+\lambda\|\hat{\beta}\|_{1}\leq\frac{1}{n}\Big{[}\|Y-\mathbf{X}\beta\|_{2}^{2}+2\alpha\|\beta\|_{2}^{2}\Big{]}+\lambda\|\beta\|_{1}
\displaystyle\implies 1n𝐗β^222nY𝖳𝐗β^+2αnβ^22+λβ^11n𝐗β222nY𝖳𝐗β+2αnβ22+λβ1\displaystyle\frac{1}{n}\|\mathbf{X}\hat{\beta}\|_{2}^{2}-\frac{2}{n}Y^{\mathsf{T}}\mathbf{X}\hat{\beta}+\frac{2\alpha}{n}\|\hat{\beta}\|_{2}^{2}+\lambda\|\hat{\beta}\|_{1}\leq\frac{1}{n}\|\mathbf{X}\beta\|_{2}^{2}-\frac{2}{n}Y^{\mathsf{T}}\mathbf{X}\beta+\frac{2\alpha}{n}\|\beta\|_{2}^{2}+\lambda\|\beta\|_{1}

Adding 1n𝐗β222n(𝐗β)𝖳(𝐗β^)+2αnβ22\frac{1}{n}\|\mathbf{X}\beta\|_{2}^{2}-\frac{2}{n}(\mathbf{X}\beta)^{\mathsf{T}}(\mathbf{X}\hat{\beta})+\frac{2\alpha}{n}\|\beta\|_{2}^{2} to both sides, and re-arranging terms we get:

1n𝐗β^22+1n𝐗β222n(𝐗β)𝖳(𝐗β^)+2αn(β^22+β22)+λβ^1[2nY𝖳(𝐗β^)2n(𝐗β)𝖳(𝐗β^)]+[2n𝐗β222nY𝖳(𝐗β)]+4αnβ22+λβ1.\begin{split}&\frac{1}{n}\|\mathbf{X}\hat{\beta}\|_{2}^{2}+\frac{1}{n}\|\mathbf{X}\beta\|_{2}^{2}-\frac{2}{n}(\mathbf{X}\beta)^{\mathsf{T}}(\mathbf{X}\hat{\beta})+\frac{2\alpha}{n}(\|\hat{\beta}\|_{2}^{2}+\|\beta\|_{2}^{2})+\lambda\|\hat{\beta}\|_{1}\\ \leq&\Big{[}\frac{2}{n}Y^{\mathsf{T}}(\mathbf{X}\hat{\beta})-\frac{2}{n}(\mathbf{X}\beta)^{\mathsf{T}}(\mathbf{X}\hat{\beta})\Big{]}+\Big{[}\frac{2}{n}\|\mathbf{X}\beta\|_{2}^{2}-\frac{2}{n}Y^{\mathsf{T}}(\mathbf{X}\beta)\Big{]}+\frac{4\alpha}{n}\|\beta\|_{2}^{2}+\lambda\|\beta\|_{1}.\end{split} (6)

Note that we can lower bound the LHS of eq. 6:

1n𝐗β^22+1n𝐗β222n(𝐗β)𝖳(𝐗β^)+2αn(β^22+β22)+λβ^1=1n𝐗(β^β)22+αn(β^β22+β^+β22)+λβ^11n𝐗(β^β)22+αnβ^β22+λβ^1=1nβ^βD2+λβ^1\begin{split}&\frac{1}{n}\|\mathbf{X}\hat{\beta}\|_{2}^{2}+\frac{1}{n}\|\mathbf{X}\beta\|_{2}^{2}-\frac{2}{n}(\mathbf{X}\beta)^{\mathsf{T}}(\mathbf{X}\hat{\beta})+\frac{2\alpha}{n}(\|\hat{\beta}\|_{2}^{2}+\|\beta\|_{2}^{2})+\lambda\|\hat{\beta}\|_{1}\\ =&\frac{1}{n}\|\mathbf{X}(\hat{\beta}-\beta)\|_{2}^{2}+\frac{\alpha}{n}(\|\hat{\beta}-\beta\|_{2}^{2}+\|\hat{\beta}+\beta\|_{2}^{2})+\lambda\|\hat{\beta}\|_{1}\\ \geq&\frac{1}{n}\|\mathbf{X}(\hat{\beta}-\beta)\|_{2}^{2}+\frac{\alpha}{n}\|\hat{\beta}-\beta\|_{2}^{2}+\lambda\|\hat{\beta}\|_{1}\\ =&\frac{1}{n}\|\hat{\beta}-\beta\|_{D}^{2}+\lambda\|\hat{\beta}\|_{1}\end{split} (7)

By substituting Y=𝐗β+ϵY=\mathbf{X}\beta+\epsilon, we simplify RHS of eq. 6 and get:

[2nY𝖳(𝐗β^)2n(𝐗β)𝖳(𝐗β^)]+[2n𝐗β222nY𝖳(𝐗β)]+4αnβ22+λβ1=[2nϵ𝖳(𝐗β^)]+[2nϵ𝖳(𝐗β)]+4αnβ22+λβ1=2nϵ𝖳𝐗(β^β)+4αnβ22+λβ1.\begin{split}&\Big{[}\frac{2}{n}Y^{\mathsf{T}}(\mathbf{X}\hat{\beta})-\frac{2}{n}(\mathbf{X}\beta)^{\mathsf{T}}(\mathbf{X}\hat{\beta})\Big{]}+\Big{[}\frac{2}{n}\|\mathbf{X}\beta\|_{2}^{2}-\frac{2}{n}Y^{\mathsf{T}}(\mathbf{X}\beta)\Big{]}+\frac{4\alpha}{n}\|\beta\|_{2}^{2}+\lambda\|\beta\|_{1}\\ =&\Big{[}\frac{2}{n}\epsilon^{\mathsf{T}}(\mathbf{X}\hat{\beta})\Big{]}+\Big{[}-\frac{2}{n}\epsilon^{\mathsf{T}}(\mathbf{X}\beta)\Big{]}+\frac{4\alpha}{n}\|\beta\|_{2}^{2}+\lambda\|\beta\|_{1}\\ =&\frac{2}{n}\epsilon^{\mathsf{T}}\mathbf{X}(\hat{\beta}-\beta)+\frac{4\alpha}{n}\|\beta\|_{2}^{2}+\lambda\|\beta\|_{1}.\end{split} (8)

Putting eq. 6, eq. 7, and eq. 8 together, we get:

1nβ^βD2+λβ^12nϵ𝖳𝐗(β^β)+λβ1+4αnβ22.\frac{1}{n}\|\hat{\beta}-\beta\|_{D}^{2}+\lambda\|\hat{\beta}\|_{1}\leq\frac{2}{n}\epsilon^{\mathsf{T}}\mathbf{X}(\hat{\beta}-\beta)+\lambda\|\beta\|_{1}+\frac{4\alpha}{n}\|\beta\|_{2}^{2}.

Appendix E Proof for theorem 1

See 1

Proof.

Recall that in Ridge SurveyUCB, for the construction of our confidence sets, for all i[K]i\in[K] and t[T]t\in[T], we choose:

θk,t=σ|Hk,t|log(1+nk,tL2/αδ/(K(1+nk,t)2))+αb\displaystyle\theta_{k,t}=\sigma\sqrt{|H_{k,t}|\log\bigg{(}\frac{1+n_{k,t}L^{2}/\alpha}{\delta/(K(1+n_{k,t})^{2})}\bigg{)}}+\sqrt{\alpha}b

Hence from corollary 1 and lemma 5, we have that:

Pr[i[K],t,βiCi,t1]1δ.\Pr\Big{[}\forall i\in[K],\forall t,\beta_{i}\in C_{i,t-1}\Big{]}\geq 1-\delta.

Therefore, with probability at least 1δ1-\delta, we get that from lemma 3 the following regret guarantee holds:

RT=i[K]RTi\displaystyle R_{T}=\sum_{i\in[K]}R_{T}^{i}
i[K]θi,T18dni,Tlog(dα+ni,TL2dα)\displaystyle\leq\sum_{i\in[K]}\theta_{i,T-1}\sqrt{8dn_{i,T}\log\bigg{(}\frac{d\alpha+n_{i,T}L^{2}}{d\alpha}\bigg{)}}
i[K](σdlog(1+TL2/αδ/(K(1+T)2))+αb)8dni,Tlog(dα+TL2dα)\displaystyle\leq\sum_{i\in[K]}\bigg{(}\sigma\sqrt{d\log\bigg{(}\frac{1+TL^{2}/\alpha}{\delta/(K(1+T)^{2})}\bigg{)}}+\sqrt{\alpha}b\bigg{)}\sqrt{8dn_{i,T}\log\bigg{(}\frac{d\alpha+TL^{2}}{d\alpha}\bigg{)}}
(σdlog(1+TL2/αδ/(K(1+T)2))+αb)8dlog(dα+TL2dα)i[K]ni,T\displaystyle\leq\bigg{(}\sigma\sqrt{d\log\bigg{(}\frac{1+TL^{2}/\alpha}{\delta/(K(1+T)^{2})}\bigg{)}}+\sqrt{\alpha}b\bigg{)}\sqrt{8d\log\bigg{(}\frac{d\alpha+TL^{2}}{d\alpha}\bigg{)}}\sum_{i\in[K]}\sqrt{n_{i,T}}
(σdlog(1+TL2/αδ/(K(1+T)2))+αb)8dlog(dα+TL2dα)TK\displaystyle\leq\bigg{(}\sigma\sqrt{d\log\bigg{(}\frac{1+TL^{2}/\alpha}{\delta/(K(1+T)^{2})}\bigg{)}}+\sqrt{\alpha}b\bigg{)}\sqrt{8d\log\bigg{(}\frac{d\alpha+TL^{2}}{d\alpha}\bigg{)}}\sqrt{TK}

Where the last inequality follows from Caushy-Schwarz inequality and the fact that T=i=1Kni,TT=\sum_{i=1}^{K}n_{i,T}. Therefore our high-probability regret bound is O(dKTlog(K)log(T))O(d\sqrt{KT\log(K)}\log(T)). ∎

Appendix F Proof for theorem 2

See 2

Proof.

Recall that in Elastic Net SurveyUCB we use Elastic net regression to estimate β^i,t\hat{\beta}_{i,t}, with regularization parameters α\alpha and λi,t\lambda_{i,t} for all arms i[K]i\in[K] and time-steps tt. Where:

λi,t:=4σL2ni,tlog(4dKni,t2δ), i[K],t[T].\lambda_{i,t}:=4\sigma L\sqrt{\frac{2}{n_{i,t}}\log\bigg{(}\frac{4dKn_{i,t}^{2}}{\delta}\bigg{)}},\text{ }\forall i\in[K],\forall t\in[T].

And for the construction of our confidence sets, for all i[K]i\in[K] and t[T]t\in[T], we choose:

θi,t:=6σLb2ni,tlog(4dKni,t2δ)+4αb2.\theta_{i,t}:=\sqrt{6\sigma Lb\sqrt{2n_{i,t}\log\bigg{(}\frac{4dKn_{i,t}^{2}}{\delta}\bigg{)}}+4\alpha b^{2}}.

Now from corollary 1 and lemma 5, we have that:

Pr[i[K],t,βiCi,t1]1δ.\Pr\Big{[}\forall i\in[K],\forall t,\beta_{i}\in C_{i,t-1}\Big{]}\geq 1-\delta.

Therefore, with probability at least 1δ1-\delta, we get that from lemma 3 the following regret guarantee holds:

RT=i[K]RTi\displaystyle R_{T}=\sum_{i\in[K]}R_{T}^{i}
i[K]θi,T18dni,Tlog(dα+ni,TL2dα)\displaystyle\leq\sum_{i\in[K]}\theta_{i,T-1}\sqrt{8dn_{i,T}\log\bigg{(}\frac{d\alpha+n_{i,T}L^{2}}{d\alpha}\bigg{)}}
=i[K]6σLb2ni,tlog(4dKni,T2δ)+4αb28dni,Tlog(dα+ni,TL2dα)\displaystyle=\sum_{i\in[K]}\sqrt{6\sigma Lb\sqrt{2n_{i,t}\log\bigg{(}\frac{4dKn_{i,T}^{2}}{\delta}\bigg{)}}+4\alpha b^{2}}\sqrt{8dn_{i,T}\log\bigg{(}\frac{d\alpha+n_{i,T}L^{2}}{d\alpha}\bigg{)}}
i[K]4ni,T3/43σLbd(2log(4dKT2δ)+2αb3σL)log(dα+TL2dα)\displaystyle\leq\sum_{i\in[K]}4n_{i,T}^{3/4}\sqrt{3\sigma Lbd\bigg{(}\sqrt{2\log\bigg{(}\frac{4dKT^{2}}{\delta}\bigg{)}}+\frac{2\alpha b}{3\sigma L}\bigg{)}\log\bigg{(}\frac{d\alpha+TL^{2}}{d\alpha}\bigg{)}}
4T3/4K1/4d1/23σLb(2log(4dKT2δ)+2αb3σL)log(dα+TL2dα)\displaystyle\leq 4T^{3/4}K^{1/4}d^{1/2}\sqrt{3\sigma Lb\bigg{(}\sqrt{2\log\bigg{(}\frac{4dKT^{2}}{\delta}\bigg{)}}+\frac{2\alpha b}{3\sigma L}\bigg{)}\log\bigg{(}\frac{d\alpha+TL^{2}}{d\alpha}\bigg{)}}

Where the last inequality follows from Holder’s inequality and the fact that T=i[K]ni,TT=\sum_{i\in[K]}n_{i,T}. Therefore, our high-probability regret bound is O(K1/4d1/2T3/4log3/4(T)log1/4(dK))O(K^{1/4}d^{1/2}T^{3/4}\log^{3/4}(T)\log^{1/4}(dK)). ∎

Appendix G Implementation Issues

Here we describe how one can handle some of the difficulties in implementing our algorithms.

G.1 Ridge SurveyUCB

We need to add a stabilizing numerical approximations that sets numbers between [108,108][-10^{-8},10^{-8}] to zero throughout the algorithm. We do this for all parameters that we store. The reason we do this is because, we often need to take the pseudo-inverse of Dt1iD_{t-1}^{i} for different arms at every time-step. This makes our updates particularly vulnerable to small errors around zero, and leads to unexpected behavior without it.

G.2 Elastic Net SurveyUCB

The python sklearn implementation for Elastic Net regression seems to have some issues with convergence for large L1 regularization. For smaller dimension problems, this can be fixed by setting the tolerance parameter to be very low (which also increases the running time). For larger dimension problems, it seems better to just divide the L1 regularization parameter by the dimension of the problem and appropriately adjust the confidence set bounds. In particular, we choose:

λi,t:=1d4σL2ni,tlog(4dKni,t2δ), i[K],t[T].\displaystyle\lambda_{i,t}:=\frac{1}{d}4\sigma L\sqrt{\frac{2}{n_{i,t}}\log\bigg{(}\frac{4dKn_{i,t}^{2}}{\delta}\bigg{)}},\text{ }\forall i\in[K],\forall t\in[T].

Note that we are just dividing our original L1 regularization parameter by a factor of dd. To ensure that our confidence sets hold with high-probability, we choose for all arms and time-steps: 888Easy to check, simple modification of analysis in lemma 9.

θi,t:=6σL(b+β^i,t)d2ni,tlog(4dKni,t2δ)+4αb2.\displaystyle\theta_{i,t}:=\sqrt{6\sigma L(b+\|\hat{\beta}_{i,t}\|)d\sqrt{2n_{i,t}\log\bigg{(}\frac{4dKn_{i,t}^{2}}{\delta}\bigg{)}}+4\alpha b^{2}}.

G.3 Interactive Surveys

The only issue with implementing algorithm 3 is that the optimization problem in eq. 5 is non-convex. It is worth noting that any upper bound to the optimization problem would maintain the regret performance of interactive SurveyUCB. And better upper-bounds, would be able to demonstrate better savings in terms of survey length. Hence one could use reasonable convex relaxations for this optimization problem in algorithm 3.

In this sub-section, we provide a well performing heuristic as a surrogate to the optimization problem when the context-space is a [0,1]d[0,1]^{d}. Now note that:

maxxF\displaystyle\max_{x\in F}\quad x𝖳β^w,t1+θw,t1(x𝖳(Dt1w)1x)\displaystyle x^{\mathsf{T}}\hat{\beta}_{w,t-1}+\sqrt{\theta_{w,t-1}(x^{\mathsf{T}}(D_{t-1}^{w})^{-1}x)}
maxx,yF\displaystyle\leq\max_{x,y\in F}\quad x𝖳β^w,t1+θw,t1(y𝖳(Dt1w)1y)\displaystyle x^{\mathsf{T}}\hat{\beta}_{w,t-1}+\sqrt{\theta_{w,t-1}(y^{\mathsf{T}}(D_{t-1}^{w})^{-1}y)}

Note that the optimization problem maxxFx𝖳β^w,t1\max_{x\in F}x^{\mathsf{T}}\hat{\beta}_{w,t-1} is convex and infact linear when our context-space can be represented by linear constraints. We now only need to optimize the non-convex problem:

maxyFθw,t1(y𝖳(Dt1w)1y)=maxyFθw,t1(y𝖳(Dt1w)1y)\displaystyle\max_{y\in F}\sqrt{\theta_{w,t-1}(y^{\mathsf{T}}(D_{t-1}^{w})^{-1}y)}=\sqrt{\max_{y\in F}\theta_{w,t-1}(y^{\mathsf{T}}(D_{t-1}^{w})^{-1}y)} (9)

This is a non-convex quadratic optimization problem, where (Dt1w)1(D_{t-1}^{w})^{-1} is symmetric (and positive semi-definite). This problem has been well studied and has good approximations based on SDP relaxations (Ye, 1999).

For our simulations we use the following simple heuristic as a surrogate to this optimization problem (eq. 9) that has worked surprisingly well. In the above problem (eq. 9), we want maximize uncertainty in reward over contexts that are consistent with our queries. Intuitively, uncertainty is maximized when the context yy is as large as possible. When our context space is [0,1]d[0,1]^{d}, we can simply choose yy that is largest co-ordinate wise. That is, yj=1y_{j}=1 if jj is not queried (i.e. jUj\notin U) and yj=(Xt)jy_{j}=(X_{t})_{j} if jj has been queried (i.e. jUj\in U).