Survey Bandits with Regret Guarantees

Sanath Kumar Krishnamurthy Susan Athey

Abstract

We consider a variant of the contextual bandit problem. In standard contextual bandits, when a user arrives we get the user’s complete feature vector and then assign a treatment (arm) to that user. In a number of applications (like healthcare), collecting features from users can be costly. To address this issue, we propose algorithms that avoid needless feature collection while maintaining strong regret guarantees.

Machine Learning, Contextual bandits

^†^†¹ Stanford University
The first author is supported in part by a Dantzig-Lieberman Operations Research Fellowship and a Management Science and Engineering Department Fellowship.
The second author is supported in part by Sloan Foundation, Schmidt Futures, and ONR grant N00014-17-1-2131.

1 Introduction

In a number of applications, like healthcare and mobile health, collecting features from users can be costly. This encourages us to develop variants of contextual bandits that at any time-step select a set of features to be collected from users, and then select an action based on these observed features. We use the term survey bandits to refer to this set-up. Through this formulation, we address the issue of needless feature collection in contextual bandits.

Suppose we are building a system to recommend charities that users can donate to. Since there are many charities we could recommend, it is efficient to make personalized recommendations. This requires us to collect features from users, which can be done by requiring users to fill a survey/questionnaire. The reward at any time-step might be the amount donated. As usual, our goal is to minimize regret. Beyond regret, we would also like to improve user experience by shortening the survey we require users to fill. We try to answer the following question: Can we ensure strong regret guarantees, while being able to shorten our survey over time?

Our contributions: We answer this question in the affirmative. We start by considering zero-shot surveys where the decision maker has to decide the set of features to be queried at every time-step before the user arrives and then make a personalized decision based on the responses. We now state our assumptions. In addition to the standard assumptions made for LinUCB, we introduce 1 (beta-min) which is common in the feature selection literature. We propose algorithms that are natural variants of LinUCB (Li et al., 2010) for the survey bandits framework. Under our assumptions, we prove regret guarantees for these algorithms. At the same time, these algorithms exploit sparsity in arm parameters to reduce the set of features collected from users.

In fact our algorithm, Ridge SurveyUCB has regret guarantees that are tight even for standard contextual bandits $O(\sqrt{T}\log(T))$ . We also provide an algorithm Elastic Net SurveyUCB which is more robust to 1, but with a weaker $O((T\log(T))^{3/4})$ regret guarantee. This result requires us to prove a new adaptive tail bounds for the elastic net estimator, which may be of independent interest.

Through simulations, when 1 holds, we demonstrate that both algorithms perform well in terms of regret. Unfortunately, Ridge SurveyUCB can perform poorly on regret when 1 is violated. Fortunately, Elastic Net SurveyUCB performs well even when 1 is violated. It is worth noting that we can still use Ridge SurveyUCB, we just need to use a conservative choices for the beta-min parameter (in 1). Even for conservative choices of the beta-min parameter, we eventually see benefits on survey length.

Through simulations, we also see that both algorithms demonstrate savings in terms of survey length. In fact, in the absence of sub-optimal arms, both algorithms always remove the features that are not relevant for reward prediction in the survey.

We also consider settings with interactive survey’s where at each time step, before making a personalized decision, the decision maker can continually expand the set of queries based on previous responses by the user at that time-step. This allows us to start by querying a smaller set of features and expand our survey for the user only if it is needed to make a better personalized decision. We develop variants of our algorithms that use interactive surveys to ensure lower survey lengths, especially in the presence of sub-optimal arms, while maintaining the same regret performance.

Related work: Prior work (e.g. (Abbasi-Yadkori et al., 2012) and (Bastani & Bayati, 2015)) has also exploited sparsity in arm parameters to provide stronger sparsity dependent regret guarantees. When an upper-bound on the sparsity of arm-parameters is known to the decision-maker, Abbasi-Yadkori et al. (2012) provide algorithms with tight regret guarantees for contextual bandits. Under distributional assumptions on user covariates, Bastani & Bayati (2015) provide stronger regret guarantees, even when no sparsity parameter is provided to the algorithm. While these regret guarantees are stronger than the ones we provide, it is important to note that we do not make any distributional assumption on user covariates and we do not assume that the decision-maker knows an upper-bound to the sparsity of arm-parameters.

Bouneffouf et al. (2017) is the most closely related paper to our work. They develop interesting algorithms for a very similar setup and evaluate them empirically. At every time-step, their algorithms queries a fixed number of features ( $s$ ). Their algorithms requires this parameter $s$ to be an input to the algorithm. For a conservative choice of $s$ , we would see little benefits to the survey length. Unfortunately, empirically their algorithms could perform much worse than contextual bandits for small choices of $s$ . We view our work as an alternate approach with guarantees on regret.

2 Preliminaries

2.1 Problem Setting

Let $T$ denote the number of time-steps. At each time-step, a new user arrives. The decision maker has a survey with $d$ questions and can ask a user any subset of these questions. At any time-step $t$ , a user comes with a set of observable covariates $X_{t}$ . Here $X_{t}$ is a $d$ -vector, corresponding to the $t$ -th user’s observable answers to the $d$ questions on the survey.

The decision maker has access to $K$ arms (decisions). Each arm yields a noisy, user specific reward. In particular, each arm $i\in[K]$ has an unknown parameter $\beta_{i}\in\mathbb{R}^{d}$ . At time $t$ , pulling arm $i$ would yield a reward $Y_{i}(t):=X_{t}^{\mathsf{T}}\beta_{i}+\epsilon_{i,t}$ . Where $\epsilon_{i,t}$ are independent sequence of $\sigma$ -sub-Gaussian random variables. Note that at any time-step, the decision maker can only observe the reward of the arm that was pulled.

Some notation: For any vector $z\in\mathbb{R}^{d}$ and any index set $I\subseteq[d]$ , we let $z_{I}\in\mathbb{R}^{d}$ denote the vector obtained by setting coordinates of $z$ that are not in $I$ to zero. For any matrix $A\in\mathbb{R}^{d\times d}$ and any index set $I\subseteq[d]$ , we let $A_{I}\in\mathbb{R}^{d\times d}$ denote the matrix obtained by setting rows and columns of $A$ that do not correspond to an index in $I$ to zero. For any $z\in\mathbb{R}^{d}$ , we let $\text{Supp}(z)$ to denote the support. And for any set $\zeta\subseteq\mathbb{R}^{d}$ , we let $\text{Supp}(\zeta):=\cup_{z\in\zeta}\text{Supp}(z)$ .

Goal: The goal is to design a sequential decision making policy $\pi=(\pi^{s},\pi^{a})$ that maximizes expected cumulative reward, and subject to strong reward guarantees minimizes the total number of questions asked to users. Let $\pi^{s}_{t}\in 2^{[d]}$ denote the subset of survey questions queried by policy $\pi$ at time $t\in[T]$ . And, let $\pi^{a}_{t}\in[K]$ denote the arm chosen by policy $\pi$ at time $t\in[T]$ . Note that we do not observe $X_{t}$ and only observe $(X_{t})_{\pi^{s}_{t}}$ , hence we should be able to choose the arm $\pi^{a}_{t}$ using only the observed covariates $(X_{t})_{\pi^{s}_{t}}$ and data collected from previous time-steps.

Target policy: We now describe a ”sensible” target policy. Consider the target policy $\pi^{*}=(\pi^{s*},\pi^{a*})$ that already knows arm parameters $\{\beta_{i}\}_{\{i\in[K]\}}$ but not the noise parameters. We want the target policy to maximize expected cumulative reward, and hence at any time-step $t$ the policy must pick an arm $\pi^{a*}_{t}\in\arg\max_{j}X_{t}^{\mathsf{T}}\beta_{j}$ . Therefore, the target policy only needs to query features that influence arm rewards. That is,

\pi_{t}^{s*}:=\bigcup_{i\in[K]}\pi^{s*}_{i,t}\text{, where $\pi^{s*}_{i,t}:=\text{Supp}(\beta_{i})$}.

Note that for any vector $z\in\mathbb{R}^{d}$ and any arm $i$ , we have that $z^{\mathsf{T}}\beta_{i}=z_{\pi^{s*}_{t}}^{\mathsf{T}}\beta_{i}$ . Therefore at any time-step $t$ , having observed $(X_{t})_{\pi^{s*}_{t}}$ , the target policy is able to choose the best arm for the covariate $X_{t}$ :

\pi^{a*}_{t}\in\arg\max_{j}X_{t}^{\mathsf{T}}\beta_{j}=\arg\max_{j}(X_{t})_{\pi_{t}^{s*}}^{\mathsf{T}}\beta_{j}.

Regret: If $\pi^{a}_{t}=i$ , we define expected regret at time $t$ as the difference between the maximum expected reward and the expected reward of arm $i$ at time $t$ , i.e. $r_{t}:=\max_{j}X_{t}^{\mathsf{T}}\beta_{j}-X_{t}^{\mathsf{T}}\beta_{i}$ . And, we define the cumulative expected regret as $R_{T}:=\sum_{t=1}^{T}r_{t}$ .

Additional notation: Let $\mathbf{X}\in\mathbb{R}^{T\times d}$ denote the design matrix, whose rows are $X_{t}$ . Let $Y_{i}\in\mathbb{R}^{T}$ denote the vector of observations $X_{t}^{\mathsf{T}}\beta_{i}+\epsilon_{i,t}$ , where entries of $Y_{i}$ may be missing ¹¹1If arm $i$ wasn’t played at time $t$ , then the $t$ -th coordinate of $Y_{i}$ would be missing.. For all $i\in[K]$ and for any $n\in[T]$ , define the sample set $S_{i,n}:=\{t|\pi_{t}^{a}=i,t\leq n\}$ and let $S_{i}:=\{t|\pi_{t}^{a}=i\}\subseteq[T]$ . Let $n_{i,t}$ denote the number of times arm $i$ was pulled upto time $t$ (i.e. $n_{i,t}=|S_{i,t}|$ ). For any $S^{\prime}\subseteq[T]$ , we let $\mathbf{X}(S^{\prime})$ be the $|S^{\prime}|\times d$ submatrix of $\mathbf{X}$ whose rows are $X_{t}$ for each $t\in S^{\prime}$ . Similarly when $S^{\prime}\subseteq S_{i}$ , we let $Y_{i}(S^{\prime})$ be the $|S^{\prime}|$ -vector whose coordinates are $Y_{i}(t)$ for all $t\in S^{\prime}$ ²²2Since when $S^{\prime}\subseteq S_{i}$ , we know that $\pi_{t}^{a}=i$ for all $t\in S^{\prime}$ . Therefore, we have $Y_{i}(S^{\prime})$ has no missing values.. Also let $I\in\mathbb{R}^{d\times d}$ denote the identity matrix.

2.2 Assumptions

We make two assumptions. The following assumption is to allow us to ignore features that have small influence on arm rewards, we assume that such features in-fact have no influence on arm rewards.

Assumption 1 (Beta-min).

The decision maker knows a parameter $\beta_{\text{min}}\in\mathbb{R}_{+}$ such that for all arms $i\in[K]$ and all $q\in[d]$ , either $\|(\beta_{i})_{\{q\}}\|_{1}=0$ or $\|(\beta_{i})_{\{q\}}\|_{1}>\beta_{\text{min}}$ .

The following is a common assumption made in problems for contextual multi-arm bandits, it is equivalent to assuming that expected rewards are bounded.

Assumption 2 (Bounded rewards).

We make the following assumptions to ensure that expected rewards of any arm for any context is bounded: For all $t\in[T]$ and $i\in[K]$ , we have $\|X_{t}\|_{1}\leq L$ and $\|\beta_{i}\|_{1}\leq b$ .
For simplicity we further assume for all $t\in[T]$ and $i\in[K]$ , the potential reward of pulling arm $i$ lies in $[0,1]$ . i.e. $Y_{i}(t)\in[0,1]$ . ³³3We can avoid the assumption that $Y_{i}(t)\in[0,1]$ for all $i\in[K]$ and $t\in[T]$ . We would just need to choose $\alpha\geq\max(1,L)$ in definition 2 for the regret guarantee to work out. Where $\|X_{t}\|_{2}\leq L$ for all $t$ .

Since for any vector $v\in\mathbb{R}^{d}$ , we know that $\|v\|_{p}\leq\|v\|_{1}$ for all $p\geq 1$ . Therefore for all $p\geq 1$ , 2 gives us that the $p$ -norms of $X_{t}$ and $\beta_{i}$ are bounded by $L$ and $b$ for all time-steps $t$ and arms $i\in[K]$ .

3 UCB for Survey Bandits

In this section, we describe a natural extension of LinUCB (Li et al., 2010) for survey bandits which involves describing a policy for selecting survey questions and a consistent arm selection policy that can choose an arm given the observed covariates.

Algorithm 1 SurveyUCB

1: Initialize confidence sets. (See section 4.1)

2: for

t:=1,2,\dots

3: Let

\pi^{s}_{k,t}\leftarrow\text{Supp}(C_{k,t-1})

for all

k\in[K]

4: Query

\pi^{s}_{t}:=\bigcup_{k\in[K]}\pi^{s}_{k,t}

, and observe

(X_{t})_{\pi^{s}_{t}}

5: Pick

\pi^{a}_{t}\in\arg\max_{k\in[K]}\max_{\beta\in C_{k,t-1}}(X_{t})_{\pi^{s}_{t}}^{\mathsf{T}}\beta

6: Play arm

\pi^{a}_{t}

and observe reward

Y_{t}

7: Set

C_{k,t}\leftarrow C_{k,t-1}

for all

k\in[K]\setminus\{\pi^{a}_{t}\}

8: Update

C_{\pi^{a}_{t},t}

(using AlgConfidence).

9: end for

Upper confidence bound (UCB) algorithms follow the principle of optimism in the face of uncertainty. The essential idea is to construct high-probability confidence sets $C_{i,t-1}\subseteq\mathbb{R}^{d}$ , for the parameter $(\beta_{i})$ of every arm $i\in[K]$ , from observed covariates $(X_{1})_{\pi^{s}_{1}},(X_{2})_{\pi^{s}_{2}},\dots,(X_{t-1})_{\pi^{s}_{t-1}}$ and rewards $Y_{1},Y_{2},\dots,Y_{t-1}$ . That is, with high probability, $\beta_{i}\in C_{i,t-1}$ for all time-steps $t$ and for all arms $i\in[K]$ . The algorithm then queries the set of features:

\pi^{s}_{t}:=\bigcup_{i\in[K]}\pi^{s}_{i,t},\text{ with }\pi^{s}_{i,t}=\text{Supp}(C_{i,t-1})\text{ for all }i\in[K].

Therefore, we have that $\beta=\beta_{\pi^{s}_{t}}$ for $\beta\in\bigcup_{i\in[K]}C_{i,t-1}$ . Hence, for any $z\in\mathbb{R}^{d}$ and $\beta\in\bigcup_{i\in[K]}C_{i,t-1}$ , we have that $z^{\mathsf{T}}\beta=z_{\pi^{s}_{t}}^{\mathsf{T}}\beta$ . Therefore, it follows that:

(X_{t})_{\pi^{s}_{t}}^{\mathsf{T}}\beta=X_{t}^{\mathsf{T}}\beta,\text{ for all }\beta\in\bigcup_{i\in[K]}C_{i,t-1}.

The algorithm chooses an optimistic estimate $\tilde{\beta}_{i,t}\in\arg\max_{\beta\in C_{i,t-1}}X_{t}^{\mathsf{T}}\beta$ for every arm $i\in[K]$ and then chooses an arm $\pi_{t}^{a}\in\arg\max_{i\in[K]}X_{t}^{\mathsf{T}}\tilde{\beta}_{i,t}$ which maximizes reward according to the optimistic estimates. Equivalently, and more compactly, the algorithm chooses the arm:

	$\displaystyle\pi_{t}^{a}$	$\displaystyle\in\arg\max_{k\in[K]}\max_{\beta\in C_{k,t-1}}X_{t}^{\mathsf{T}}\beta$
		$\displaystyle=\arg\max_{k\in[K]}\max_{\beta\in C_{k,t-1}}(X_{t})_{\pi^{s}_{t}}^{\mathsf{T}}\beta.$

Note that the arm $\pi^{s}_{t}$ is chosen given only the observed covariates $(X_{t})_{\pi^{s}_{t}}$ . We call the resulting algorithm SurveyUCB.

4 Confidence Sets for SurveyUCB

In this section, we define confidence sets in Standard form and describe the construction of these sets using AlgConfidence. It will turn-out that AlgConfidence constructs confidence sets in Standard form. Throughout this manuscript, we use AlgConfidence to construct confidence sets for SurveyUCB.

4.1 Confidence Sets in Standard Form

We start by defining weighted norms and use that to define confidence sets in Standard form.

Definition 1 (Weighted norm).

For any vector $z\in\mathbb{R}^{d}$ and any positive semi-definite matrix $A$ , we define the weighted norm of $z$ with respect to $A$ as follows: $\|z\|_{A}:=\sqrt{z^{\mathsf{T}}Az}$ . ⁴⁴4 $\|\cdot\|_{A}$ is a norm when $A$ is positive semi-definite.

Definition 2 (Standard form).

We say confidence sets $\{C_{i,t-1}\}_{(i,t)\in[K]\times[T]}$ are in Standard form if at any time-step $t$ and for any arm $i\in[K]$ , we have that $C_{i,t-1}$ is a ball centered around our estimate $(\hat{\beta}_{i,t-1})$ of the true arm parameter $(\beta_{i})$ under a weighted norm and $\{C_{i,t-1}\}_{t\in[T]}$ have non-increasing supports. More specifically, for all $i\in[K]$ , $t\in[T]$ and for some $\alpha>0$ , we have:

	$\displaystyle\text{Supp}(\hat{\beta}_{i,t-1})\subseteq H_{i,t-1}:=\text{Supp}(C_{i,t-1})$
	$\displaystyle H_{i,t}\subseteq H_{i,t-1}\subseteq[d],\text{ }1\leq\theta_{i,t-1}\leq\theta_{i,t}$
	$\displaystyle D_{t-1}^{i}:=(\alpha I)_{H_{i,t-1}}+\sum_{w\in S_{i,t-1}}(X_{w})_{H_{i,t-1}}(X_{w})_{H_{i,t-1}}^{\mathsf{T}}$
	$\displaystyle C_{i,t-1}:=\{\beta\|\text{ }\\|\beta-\hat{\beta}_{i,t-1}\\|_{D_{t-1}^{i}}\leq\theta_{i,t-1},$
	$\displaystyle\text{Supp}(\beta)\subseteq H_{i,t-1}\}$

In SurveyUCB, note that the confidence sets determine the set of features queried. Also at any time-step $t$ , confidence sets must be constructed using only the observed covariates and rewards at every time-step upto $t$ . Lemma 1 gives us a set of observed covariates when SurveyUCB uses confidence sets in Standard form.

Lemma 1.

If SurveyUCB uses confidence sets in Standard form for times $t\in\{1,2,\dots,t^{\prime}\}$ . Then for any arm $k\in[K]$ , we observe $\{(X_{t})_{H_{k,t^{\prime}-1}}\}_{t=1}^{t^{\prime}}$ . Where $H_{k,t-1}:=\text{Supp}(C_{k,t-1})$ for all arms $k\in[K]$ and time-steps $t\in[t^{\prime}]$ .

Proof.

All statements in this proof hold for all arms $k\in[K]$ and time-steps $t\in[T]$ . SurveyUCB queries the set of features $\pi^{s}_{t}=\cup_{k}\pi^{s}_{k,t}$ at time $t$ , where $\pi^{s}_{k,t}=\text{Supp}(C_{k,t-1})=H_{k,t-1}$ . From the structure of confidence sets in Standard form, we have that $\pi^{s}_{k,t+1}=H_{k,t}\subseteq H_{k,t-1}=\pi^{s}_{k,t}$ . This implies that:

H_{k,t-1}=\pi^{s}_{k,t}\subseteq\pi^{s}_{t}\subseteq\pi^{s}_{t-1}\subseteq\dots\subseteq\pi^{s}_{1}.

That is, we observe $\{(X_{t})_{H_{k,t^{\prime}-1}}\}_{t=1}^{t^{\prime}}$ . ∎

Note that at any time-step $t$ , SurveyUCB observes $(X_{t})_{\pi^{s}_{t}}$ . Lemma 1 shows that for any $t\in[t^{\prime}]$ , we also observe $(X_{t})_{H_{k,t^{\prime}-1}}$ because the set of features queried by SurveyUCB at $t$ is a supper set of the support of the confidence set $C_{k,t^{\prime}-1}$ . That is, under the conditions of lemma 1 we have:

\text{Supp}(C_{k,t^{\prime}-1})=H_{k,t^{\prime}-1}\subseteq\pi^{s}_{t},\text{ for all }t\in[t^{\prime}].

Initializing confidence sets: We now define our confidence set for time-step zero based on 2. For any arm $k\in[K]$ under 2, we have that $\|\beta_{k}\|_{\alpha I}=\sqrt{\beta_{k}(\alpha I)\beta_{k}}\leq\sqrt{\alpha}b$ . That is, we have that $\beta_{k}\in C_{k,0}$ for all arms $k\in[K]$ , where:

	$\displaystyle C_{k,0}$	$\displaystyle:=\{\beta\|\text{ }\\|\beta\\|_{\alpha I}\leq\sqrt{\alpha}\cdot b\}$
		$\displaystyle=\{\beta\|\text{ }\\|\beta\\|_{\alpha I}\leq\sqrt{\alpha}\cdot b,\text{Supp}(\beta)\subseteq[d]\}.$

Based on 1, section 4.2 describes the construction of confidence set $C_{k,t^{\prime}}$ for all arms $k\in[K]$ and time-steps $t^{\prime}\geq 1$ .

4.2 AlgConfidence

Consider any time-step $t^{\prime}\geq 1$ . Suppose that the confidence sets constructed upto time $t^{\prime}$ are in Standard form, i.e. $\{C_{i,t}\}_{(i,t)\in[K]\times[t^{\prime}-1]}$ are in Standard form. Let $k\in[K]$ be the arm pulled at time $t^{\prime}$ . We now describe the AlgConfidence update for confidence set $C_{k,t^{\prime}}$ and show that the confidence sets $\{C_{i,t}\}_{(i,t)\in[K]\times[t^{\prime}]}$ are in Standard form. This would inductively imply that AlgConfidence constructs confidence sets in Standard form since the base case trivially holds ⁵⁵5Confidence sets constructed up to and including time-step zero are trivially in Standard form..

From lemma 1 we already know that if the confidence sets constructed upto time $t^{\prime}$ are in Standard form, then the decision maker using SurveyUCB at least observes $(X_{t})_{H_{k,t^{\prime}-1}}$ at every time-step $t\in[t^{\prime}]$ . Where, $H_{k,t^{\prime}-1}=\text{Supp}(C_{k,t^{\prime}-1})$ . Hence, AlgConfidence can use this to construct the confidence set ( $C_{k,t^{\prime}}$ ) for arm $k$ at time $t^{\prime}$ .

Algorithm 2 AlgConfidence (Constructing confidence sets in Standard form)

Input : Given $\{(X_{t})_{H_{k,t^{\prime}-1}}\}_{t=1}^{t^{\prime}}$ , $Y_{k}$ , $S_{k,t^{\prime}}$ , and $C_{k,t^{\prime}-1}$ , for some time-step $t^{\prime}$ and arm $k$ .
Output : $C_{k,t^{\prime}}$ .

1: Initialize

H_{k,t^{\prime}}\leftarrow H_{k,t^{\prime}-1}

2: for

q\in[d]

3: if

\beta_{\text{min}}>\max_{\beta\in C_{k,t^{\prime}-1}}\|\beta_{\{q\}}\|

then

H_{k,t^{\prime}}\leftarrow H_{k,t^{\prime}}\setminus\{q\}

5: end if

6: end for

7: Set

\big{(}\hat{\beta}_{k,t^{\prime}}\big{)}_{H_{k,t^{\prime}}^{\complement}}\leftarrow\vec{0}

. {Note that

H_{k,t^{\prime}}\subseteq H_{k,t^{\prime}-1}

, hence we have

\{(X_{t})_{H_{k,t^{\prime}}}\}_{t=1}^{t^{\prime}}

8: Estimate

(\hat{\beta}_{k,t^{\prime}})_{H_{k,t^{\prime}}}

from regression on the data

\Big{\{}\Big{(}(X_{t})_{H_{k,t^{\prime}}},Y_{k}(t)\Big{)}\Big{\}}_{t\in S_{k,t^{\prime}}}

9: Compute

D_{t^{\prime}}^{k}:=(\alpha I)_{H_{k,t^{\prime}}}+\sum_{w\in S_{k,t^{\prime}}}(X_{w})_{H_{k,t^{\prime}}}(X_{w})_{H_{k,t^{\prime}}}^{\mathsf{T}}

10: Output

C_{k,t^{\prime}}:=\{\beta|\text{ }\|\beta-\hat{\beta}_{k,t^{\prime}}\|_{D_{t^{\prime}}^{k}}\leq\theta_{k,t^{\prime}},\text{Supp}(\beta)\subseteq H_{k,t^{\prime}}\}

AlgConfidence starts by constructing $H_{k,t^{\prime}}$ from $C_{k,t^{\prime}-1}$ by relying on 1. Where $H_{k,t^{\prime}}$ with be the support of the confidence set $C_{k,t^{\prime}}$ . We would like to construct $H_{k,t^{\prime}}$ so that it contains the support of $\beta_{k}$ , i.e. $\text{Supp}(\beta_{k})\subseteq H_{k,t^{\prime}}$ . In particular, if the confidence set for arm $k$ at time $t^{\prime}-1$ holds (i.e. $\beta_{k}\in C_{k,t^{\prime}-1}$ ), then from 1 we have that for all $q\in\text{Supp}(\beta_{k})$ :

\max_{\beta\in C_{k,t^{\prime}-1}}\|\beta_{\{q\}}\|\geq\|(\beta_{k})_{\{q\}}\|>\beta_{\text{min}}.

Also we get that $\text{Supp}(\beta_{k})$ is a subset of the support of the confidence set ( $H_{k,t^{\prime}-1}$ ) of arm $k$ at time $t^{\prime}-1$ . Hence we have that:

H_{k,t^{\prime}}:=H_{k,t^{\prime}-1}\Big{\backslash}\bigg{\{}q|\max_{\beta\in C_{k,t^{\prime}-1}}\|\beta_{\{q\}}\|\leq\beta_{\text{min}}\bigg{\}}

AlgConfidence now constructs confidence set $C_{k,t^{\prime}}$ with support $H_{k,t^{\prime}}$ . We then estimate $(\hat{\beta}_{k,t^{\prime}})_{H_{k,t^{\prime}}}$ by regressing over the features in $H_{k,t^{\prime}}$ , on the observed data set: $\Big{\{}\Big{(}(X_{t})_{H_{k,t^{\prime}}},Y_{k}(t)\Big{)}\Big{\}}_{t\in S_{k,t^{\prime}}}$ . We then set the components of $\hat{\beta}_{k,t^{\prime}}$ not in $H_{k,t^{\prime}}$ to zero, i.e. $\big{(}\hat{\beta}_{k,t^{\prime}}\big{)}_{H_{k,t^{\prime}}^{\complement}}\leftarrow\vec{0}$ . Now with $D_{t^{\prime}}^{k}:=(\alpha I)_{H_{k,t^{\prime}}}+\sum_{w\in S_{k,t^{\prime}}}(X_{w})_{H_{k,t^{\prime}}}(X_{w})_{H_{k,t^{\prime}}}^{\mathsf{T}}$ , we construct the confidence set for arm $k$ at time $t^{\prime}$ :

\displaystyle C_{k,t^{\prime}}:=\{\beta|\text{ }\|\beta-\hat{\beta}_{k,t^{\prime}}\|_{D_{t^{\prime}}^{k}}\leq\theta_{k,t^{\prime}},\text{Supp}(\beta)\subseteq H_{k,t^{\prime}}\}

Note that $\text{Supp}(C_{k,t^{\prime}})=H_{k,t^{\prime}}\subseteq H_{k,t^{\prime}-1}$ and $\text{Supp}(\hat{\beta}_{k,t^{\prime}})\subseteq H_{k,t^{\prime}}$ . Hence given the above form of the confidence set, we have that $\{C_{i,t}\}_{(i,t)\in[K]\times[t^{\prime}]}$ are in Standard form.

4.3 Probability Aggregation

To construct the confidence set $C_{k,t^{\prime}}$ , recall that AlgConfidence assumes that $\beta_{k}\in C_{k,t^{\prime}-1}$ . This is unlike LinUCB (Li et al., 2010) and several other UCB algorithms where confidence sets are constructed from observed data without directly relying on previous confidence sets. Here, we argue that our construction does not lead to any unexpected issues. We now state a helpful lemma and its corollary, and defer proofs to appendix A.

Lemma 2 (Probability aggregation).

Consider a probability space $(\Omega,\mathcal{F},\Pr)$ . Consider any sequence of events $\{B_{i},\Pi_{i}\}_{i=1}^{\infty}$ , such that $B_{i},\Pi_{i}\in\mathcal{F}$ and $B_{i}\subseteq\Pi_{i}$ for any $i\in\mathcal{N}$ . Let $\Pi_{0}:=\Omega$ . We then have that:

\Pr\Bigg{[}\bigcap_{i=1}^{\infty}B_{i}\Bigg{]}\geq 1-\sum_{i=1}^{\infty}\Pr\big{[}B_{i}^{\complement}|\Pi_{i-1}\big{]}.

Corollary 1.

Suppose SurveyUCB constructs the Standard form confidence sets $\{C_{i,t-1}\}_{(i,t)\in[K]\times[T]}$ for all arms $k$ and all time-step’s $t$ . We then have that:

	$\displaystyle\Pr\Bigg{[}\bigcap_{k=1}^{K}\bigcap_{t=1}^{\infty}\{\beta_{k}\in C_{k,t-1}\}\Bigg{]}$
	$\displaystyle\geq 1-\sum_{k=1}^{K}\sum_{t\in S_{k}}\Pr[\beta_{k}\notin C_{k,t}\|\text{Supp}(\beta_{k})\subseteq H_{k,t}].$

Where $H_{k,t-1}:=\text{Supp}(C_{k,t-1})$ and $S_{k}:=\{t|\pi_{t}^{a}=k\}$ for all arms $k\in[K]$ and time-steps $t\in[t^{\prime}]$ .

Therefore from corollary 1 for any $\delta>0$ , we have that $\Pr[\bigcap_{k=1}^{K}\bigcap_{t=1}^{\infty}\{\beta_{k}\in C_{k,t-1}\}]\geq 1-\delta$ if for all $k\in[K]$ and $t\in[T]$ we have: ⁶⁶6Note that $\sum_{j=1}^{\infty}\frac{1}{(1+j)^{2}}<\frac{3.15^{2}}{6}-1<1.$

\displaystyle\Pr[\beta_{k}\notin C_{k,t}|\text{Supp}(\beta_{k})\subseteq H_{k,t}]\leq\frac{\delta}{K(1+n_{k,t})^{2}}

(1)

5 General Regret Analysis

In section 4.2, we already saw that SurveyUCB constructs confidence sets in Standard form. In lemma 3 we exploit the structure of confidence sets in Standard form to get a general regret bound for SurveyUCB.

Lemma 3 (General regret analysis).

Suppose 2 holds with $\|X_{t}\|_{2}\leq L$ for all $t$ . Suppose the confidence sets $\{C_{i,t-1}\}_{(i,t)\in[K]\times[T]}$ used in SurveyUCB are in Standard form with parameter $\alpha>0$ (in definition 2). And suppose we have that $\beta_{i}\in C_{i,t-1}$ at any time-step $t$ and for all arms $i\in[K]$ . We then have that:

•

Instantaneous regret at time-step $t$ , $r_{t}\leq 2\theta_{\pi^{a}_{t},t-1}\|(X_{t})_{\pi^{s}_{t}}\|_{(V_{t-1}^{i})^{-1}}$ .
•

And cumulative regret in $T$ rounds, $R_{T}\leq\sum_{i\in[K]}\theta_{i,T-1}\sqrt{8n_{i,T}d\log\Big{(}\frac{d\alpha+n_{i,T}L^{2}}{d\alpha}\Big{)}}.$

We defer the proof of lemma 3 appendix B. It is worth noting that the proof sheds some light on the need for confidence sets to be in Standard form.

6 Tail Inequalities for Adapted Observations

Consider a linear model $Y=\mathbf{X}\beta+\epsilon$ , with design matrix $\mathbf{X}\in\mathbb{R}^{n\times d}$ , response vector $Y\in\mathbb{R}^{n}$ , and noise vector $\epsilon\in\mathbb{R}^{n}$ . Where $\epsilon_{t}$ are independent sequence of $\sigma$ -sub-Gaussian random variables.

6.1 Ridge Estimator

We now define the Ridge estimator for estimating the parameter $\beta$ as follows:

Definition 3 (Ridge).

Given regularization parameters $\alpha\geq 0$ . The Ridge estimate is given by:

\hat{\beta}^{\text{ridge}}_{\mathbf{X},Y}(\alpha):=\arg\min_{\beta^{\prime}}\bigg{\{}\|Y-\mathbf{X}\beta^{\prime}\|_{2}^{2}+\alpha\|\beta^{\prime}\|_{2}^{2}\bigg{\}}

From theorem 2 in (Abbasi-Yadkori et al., 2011) we get:

Lemma 4 (Abbasi-Yadkori, et al 2).

Let $X_{t}$ denote the $t$ -th row of $\mathbf{X}$ . Let $Y(t)$ denote the $t$ -th entry of $Y$ . The sequence $\{X_{t}|t=1,2,\dots,n\}$ form an adapted sequence of observations. That is, $X_{t}$ may depend on $\{X_{t^{\prime}},Y(t^{\prime})\}_{t^{\prime}=1}^{t-1}$ . Assume that $\|\beta\|_{2}\leq b$ . Then for $\delta>0$ , we have that:

\Pr\bigg{[}\|\hat{\beta}-\beta\|_{D}^{2}\leq\sigma\sqrt{2\log\bigg{(}\frac{\det^{1/2}(D)}{\delta\alpha^{d/2}}\bigg{)}}+\sqrt{\alpha}b\bigg{]}\geq 1-\delta

Furthermore, if $\|X_{t}\|_{2}\leq L$ for all $t\in[n]$ , we have that:

\Pr\bigg{[}\|\hat{\beta}-\beta\|_{D}^{2}\leq\sigma\sqrt{d\log\bigg{(}\frac{1+nL^{2}/\alpha}{\delta}\bigg{)}}+\sqrt{\alpha}b\bigg{]}\geq 1-\delta

Where $\hat{\beta}$ is the Ridge estimate with parameter $\alpha>0$ . And, $D=\mathbf{X}^{\mathsf{T}}\mathbf{X}+\alpha I$ .

6.2 Elastic Net Estimator

We now define the Elastic net estimator for estimating the parameter $\beta$ as follows:

Definition 4 (Elastic net).

Given regularization parameters $\alpha\geq 0$ , $\lambda\geq 0$ . The Elastic net estimate is given by:

	$\displaystyle\hat{\beta}^{\text{elnet}}_{\mathbf{X},Y}(\alpha,\lambda)$
	$\displaystyle:=\arg\min_{\beta^{\prime}}\bigg{\{}\frac{1}{n}\Big{[}\\|Y-\mathbf{X}\beta^{\prime}\\|_{2}^{2}+2\alpha\\|\beta^{\prime}\\|_{2}^{2}\Big{]}+\lambda\\|\beta^{\prime}\\|_{1}\bigg{\}}$

We now provide an adaptive tail inequality for the Elastic net estimator that may be of independent interest. We defer the proof to appendix C.

Lemma 5 (Elastic net tail inequality for Adapted observations).

Let $X_{t}$ denote the $t$ -th row of $\mathbf{X}$ . Let $Y(t)$ denote the $t$ -th entry of $Y$ . The sequence $\{X_{t}|t=1,2,\dots,n\}$ form an adapted sequence of observations. That is, $X_{t}$ may depend on $\{X_{t^{\prime}},Y(t^{\prime})\}_{t^{\prime}=1}^{t-1}$ . Also assume all realizations of $X_{t}$ satisfy $\|X_{t}\|_{\infty}\leq L$ and that $\|\beta\|_{1}\leq b$ . Then, if $\lambda:=4\sigma L\sqrt{(\gamma^{2}+2\log d)/n}$ , we have:

	$\displaystyle\Pr\bigg{[}\\|\hat{\beta}-\beta\\|_{D}^{2}\leq 6\sigma Lb\sqrt{n(\gamma^{2}+2\log(d))}+4\alpha b^{2}\bigg{]}$
	$\displaystyle\geq 1-2\exp[-\gamma^{2}/2].$

Where $\hat{\beta}$ is the Elastic net estimate with parameters $\alpha,\lambda\geq 0$ . And, $D=\mathbf{X}^{\mathsf{T}}\mathbf{X}+\alpha I$ .

7 Ridge SurveyUCB

Note that the description and analysis of SurveyUCB gives us a fair amount of flexibility. We are still free to specify the regression method used to estimate $(\hat{\beta}_{i,t})_{H_{i,t}}$ for all $i\in[K]$ and times $t\in[T]$ . We are also free to specify our choice of $\theta_{i,t}$ for all arms $i\in[K]$ and all times $t\in[T]$ .

Ridge SurveyUCB is a version of the SurveyUCB. In Ridge SurveyUCB, we use Ridge regression with a fixed regularization parameter ( $\alpha>0$ ), to estimate $(\hat{\beta}_{i,t})_{H_{i,t}}$ . We also choose $\theta_{i,t}$ for all arms and time-steps based on corollary 1 (eq. 1) and lemma 4 so that the Standard form confidence sets that we construct hold with high probability. Then from lemma 3 we naturally get a high-probability regret bound for Ridge SurveyUCB.

Say we want our bounds to hold with probability $1-\delta$ . Then from corollary 1, it is enough to show that eq. 1 holds for all arms and times. That is for all arms $k\in[K]$ and all times $t$ , we want:

\displaystyle\Pr[\beta_{k}\notin C_{k,t}|\text{Supp}(\beta_{k})\subseteq H_{k,t}]\leq\frac{\delta}{K(1+n_{k,t})^{2}}

(1)

Given $\text{Supp}(\beta_{k})\subseteq H_{k,t}$ , lemma 4 gives us that eq. 1 holds for $C_{k,t}$ with support $H_{k,t}$ if:

\displaystyle\theta_{k,t}=\sigma\sqrt{|H_{k,t}|\log\bigg{(}\frac{1+n_{k,t}L^{2}/\alpha}{\delta/(K(1+n_{k,t})^{2})}\bigg{)}}+\sqrt{\alpha}b

(2)

Hence, from lemma 3 with the above choice of $\theta_{k,t}$ and the corresponding high-probability guarantee, we get theorem 1. We defer the proof to appendix E.

Theorem 1 (Ridge SurveyUCB regret).

Suppose 1 and 2 hold. Let $\delta,\alpha>0$ be fixed constants. Let $\theta_{k,t}$ be chosen as in eq. 2 for all $k,t$ . Then with probability at least $1-\delta$ , Ridge SurveyUCB has the following regret guarantee:

\displaystyle R_{T}\leq O(d\sqrt{KT\log(K)}\log(T)).

8 Elastic Net SurveyUCB

In this section, we want to develop a variant of SurveyUCB that is more robust to the choice of the beta-min parameter in 1. One way to do this is to modify line 3 in AlgConfidence. In particular, suppose $k\in[K]$ is the arm pulled at time $t^{\prime}$ . We then construct $H_{k,t^{\prime}}$ by removing all features $q$ from $H_{k,t^{\prime}-1}$ that we estimate to be zero (i.e. $(\hat{\beta}_{k,t^{\prime}-1})_{q}=0$ ) and that we determine are irrelevant based on 1. That is:

	$\displaystyle H_{k,t^{\prime}}:=H_{k,t^{\prime}-1}\Big{\backslash}\bigg{\{}q\;\|\;$	$\displaystyle(\hat{\beta}_{k,t^{\prime}-1})_{q}=0$
		$\displaystyle\text{ and }\max_{\beta\in C_{k,t^{\prime}-1}}\\|\beta_{\{q\}}\\|\leq\beta_{\text{min}}\bigg{\}}$

It is easy to see that all our arguments for SurveyUCB continue to hold with the above modification. Now note that the above modification makes SurveyUCB more robust to 1, because we additionally need the $q$ th coordinate of our estimate ( $\hat{\beta}_{k,t^{\prime}-1}$ ) to be zero before we remove feature $q$ (at time $t^{\prime}$ ) from the model of arm $k$ . This modification encourages us to use sparse estimators.

Elastic Net SurveyUCB is a variant of SurveyUCB with the above modification. In Elastic Net SurveyUCB, we use the Elastic net regression to estimate $\hat{\beta}_{i,t}$ , with regularization parameters $\alpha$ and $\lambda_{i,t}$ for all arms $i\in[K]$ and time-steps $t$ . Where:

\displaystyle\lambda_{i,t}:=4\sigma L\sqrt{\frac{2}{n_{i,t}}\log\bigg{(}\frac{4dKn_{i,t}^{2}}{\delta}\bigg{)}},\text{ }\forall i\in[K],\forall t\in[T].

(3)

Similar to the arguments in section 7, corollary 1 and lemma 5 give us that our confidence sets hold with probability $1-\delta$ if for all arms $i$ and times $t$ :

\displaystyle\theta_{i,t}:=\sqrt{6\sigma Lb\sqrt{2n_{i,t}\log\bigg{(}\frac{4dKn_{i,t}^{2}}{\delta}\bigg{)}}+4\alpha b^{2}}.

(4)

Again from lemma 3 with the above choice of $\theta_{k,t}$ and the corresponding high-probability guarantee, we get theorem 2. We defer the proof to appendix F.

Theorem 2 (Elastic Net SurveyUCB regret).

Suppose 1 and 2 hold. Let $\delta,\alpha>0$ be fixed constants. For all $k,t$ , let $\lambda_{k,t}$ and $\theta_{k,t}$ be chosen as in eq. 3 and eq. 2 respectively. Then with probability at least $1-\delta$ , Elastic Net SurveyUCB has the following regret guarantee:

\displaystyle R_{T}\leq O(K^{1/4}d^{1/2}T^{3/4}\log^{3/4}(T)\log^{1/4}(dK)).

9 Interactive Surveys

We start by defining sub-optimal arms, describe an inefficiency in SurveyUCB, and propose a fix using interactive surveys.

Definition 5 (Sub-optimal arms).

An arm is said to be sub-optimal if the target policy would not pick it for any context vector in the context space.

At any time-step $t$ , the SurveyUCB algorithms query $\pi^{s}_{t}:=\cup_{i\in[K]}\pi^{s}_{i,t},\text{ with }\pi^{s}_{i,t}=\text{Supp}(C_{i,t-1})\text{ for all }i\in[K].$ Now suppose the decision maker plays arm $k$ at time $t$ . The decision maker only needs the reward and the features corresponding to $\pi^{s}_{k,t}$ to update the model. Recall that the reason the decision maker queries all the features in $\pi^{s}_{t}$ is to be able to compute the upper confidence bounds for all arms.

Note that the bandit assignment only depends on the highest upper confidence bound, so it is not necessary to compute all the upper confidence bounds. The algorithm’s inefficiency becomes more evident in the presence of sub-optimal arms because these arms are played less frequently, hence updated less frequently, and increase survey length.

Algorithm 3 Interactive Survey Protocol

Interactive input : User at time $t$ , that answers queries whenever asked.
Output: Queries to user $t$ .

1: Create an ordered list of arms

Q

, starting from the most pulled arm to least pulled arm.

2: Initialize

M\leftarrow-\infty

, largest upper confidence bound among queried arms.

3: Initialize set of queried features

U\leftarrow\emptyset

4: while

Q

is not empty do

5: Let

i\leftarrow Q[1]

, be the first arm in

Q

6: Query

\pi^{s}_{i,t}

and observe

(X_{t})_{\pi^{s}_{i,t}}

7: Update

U\leftarrow U\cup\pi^{s}_{i,t}

8: Update

M\leftarrow\max\{M,\max_{\beta\in C_{k,t-1}}(X_{t})_{\pi^{s}_{t}}^{\mathsf{T}}\beta\}

9: Remove

i

from the list

Q

10: for

w

Q

11: Set

F=\{x|x_{U}=(X_{t})_{U}\text{ and $x$ is feasible}\}

12: if

\max_{x\in F}\max_{\beta\in C_{w,t-1}}x^{\mathsf{T}}\beta\leq M

then

13: Remove arm

w

from

Q

14: end if

15: end for

16: end while{Note that we observe enough information to determine which arm has the largest upper confidence and can update its confidence set after observing its reward.}

We attempt to resolve this inefficiency through interactive surveys. Consider the user at time $t$ , the decision maker needs to query some features from the user before taking an action. We start by creating an ordered list of all arms, starting from the most pulled arms and ending with the least pulled arms. This ordering is a heuristic choice, the idea is to keep sub-optimal arms (which are less frequently pulled) towards the end of the list. We keep removing arms from this list and terminate (take an action) when it is empty.

The main idea is to sequentially query the feature sets $\pi^{s}_{i,t}$ for arms $i$ in the list, simultaneously remove queried arms from the list, and also remove unqueried arms that do not have the largest upper confidence bound. We will now explain how we determine that some unqueried arm $w$ doesn’t have the largest upper confidence bound. First note that for each queried arm, we can exactly compute its upper confidence bound at time $t$ . Clearly the following optimization problem gives us an upper bound to the upper confidence bound of arm $w$ : ⁷⁷7Here $(D_{t-1}^{w})^{-1}$ is the pseudo-inverse of the matrix.

	$\displaystyle\max_{x\in F}$	$\displaystyle\max_{\beta\in C_{w,t-1}}x^{\mathsf{T}}\beta$		(5)
	$\displaystyle\equiv\max_{x\in F}$	$\displaystyle x^{\mathsf{T}}\hat{\beta}_{w,t-1}+\sqrt{\theta_{w,t-1}(x^{\mathsf{T}}(D_{t-1}^{w})^{-1}x)}$		(5)

Where $F$ denotes the set of contexts in the context space that are consistent with the queries so far (i.e. feasible $x$ such that $x_{U}=(X_{t})_{U}$ where $U$ is the set of features queried so far for user $t$ ). Hence if this upper bound is less than the upper confidence bound for some queried arm, we remove arm $w$ from the list. For more details see algorithm 3.

Note that interactive versions of SurveyUCB have the same regret performance as SurveyUCB, because both algorithms work with the same confidence sets and always choose the arm with the largest upper confidence bound. Also note that the optimization problem in eq. 5 is non-convex. In section G.3 we provide a heuristics for this optimization that has been effective in simulations.

10 Simulation

Suppose we have users with fifty features and suppose we have five arms. Where expected reward for context $x$ is: $x_{1}$ for arm 1, $x_{2}$ for arm 2, $1-x_{1}$ for arm 3, and zero for both arm 4 and arm 5. Hence, only two of the fifty features are predictive of arm rewards and both arm 4 and arm 5 are sub-optimal. We draw contexts from the uniform distribution $U([0,1]^{d})$ and reward noise ( $\epsilon_{i,t}$ ) is drawn from $U([0,1])$ . We consider a 100000 step time horizon.

We assume noise is 1-sub-Gaussian, contexts lie in the space $[0,1]^{d}$ , and the 1-norm and 2-norms of the arm parameters is bounded by $50$ and $\sqrt{50}$ respectively. We run simulations with ( $K=5$ ) and without ( $K=3$ ) sub-optimal arms. That is, in our simulations without sub-optimal arms, we only consider the first three arms. We plot regret and cumulative survey length vs time-steps. The plots here are generated by averaging over five runs.

In simulations, when 1 holds, Ridge SurveyUCB has better regret performance compared to Elastic Net SurveyUCB (for similar beta-min parameters). Plots verify that Elastic Net SurveyUCB is infact reasonably robust to 1 and doesn’t remove predictive features (in simulations) even under violations of 1. We also note that sub-optimal arms hurt the performance of SurveyUCB algorithms. And, interactive surveys help mitigate the negative effects of sub-optimal arms on survey lengths. Also note that performance on regret improves with less conservative choices for the beta-min parameter, this implies that model truncation also helps improve regret performance.

References

Abbasi-Yadkori et al. (2011) Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312–2320, 2011.
Abbasi-Yadkori et al. (2012) Abbasi-Yadkori, Y., Pal, D., and Szepesvari, C. Online-to-confidence-set conversions and application to sparse stochastic bandits. In Artificial Intelligence and Statistics, pp. 1–9, 2012.
Bastani & Bayati (2015) Bastani, H. and Bayati, M. Online decision-making with high-dimensional covariates. Available at SSRN 2661896, 2015.
Bouneffouf et al. (2017) Bouneffouf, D., Rish, I., Cecchi, G. A., and Féraud, R. Context attentive bandits: Contextual bandit with restricted context. In IJCAI, 2017.
Bühlmann & Van De Geer (2011) Bühlmann, P. and Van De Geer, S. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media, 2011.
Li et al. (2010) Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661–670. ACM, 2010.
Ye (1999) Ye, Y. Approximating global quadratic optimization with convex quadratic constraints. Journal of Global Optimization, 15(1):1–17, 1999.

Appendix A Proofs for probability aggregation

See 2

Proof.

We want to use induction. First note that:

\Pr[B_{1}]=1-\Pr[B_{1}^{\complement}]=1-\Pr[B_{1}^{\complement}|\Omega].

Also note that for any $t\geq 1$ , we have that:

	$\displaystyle\Pr[\cap_{i=1}^{t+1}B_{i}]$	$\displaystyle=\Pr[\cap_{i=1}^{t}B_{i}]-\Pr[\cap_{i=1}^{t}B_{i}\cap B_{t+1}^{\complement}]$
		$\displaystyle\geq\Pr[\cap_{i=1}^{t}B_{i}]-\Pr[\Pi_{t}\cap B_{t+1}^{\complement}]$
		$\displaystyle=\Pr[\cap_{i=1}^{t}B_{i}]-\Pr[B_{t+1}^{\complement}\|\Pi_{t}]\Pr[\Pi_{t}]$
		$\displaystyle\geq\Pr[\cap_{i=1}^{t}B_{i}]-\Pr[B_{t+1}^{\complement}\|\Pi_{t}]$

Where the first inequality follows from the fact that $\Pi_{t}$ is a supper set of $\cap_{i=1}^{t}B_{i}$ [Since, $\cap_{i=1}^{t}B_{i}\subseteq B_{t}\subseteq\Pi_{t}$ ]. Therefore from induction, for all $t\geq 1$ we get that:

\Pr\Bigg{[}\bigcap_{i=1}^{t}B_{i}\Bigg{]}\geq 1-\sum_{i=1}^{t}\Pr\big{[}B_{i}^{\complement}|\Pi_{i-1}\big{]}.

We get our required result by taking limit as $t$ goes to $\infty$ . ∎

See 1

Proof.

Recall that from section 4.2 we know that $\text{Supp}(\beta_{k})\subseteq H_{k,t^{\prime}}$ if $\beta_{k}\in C_{k,t^{\prime}-1}$ . For any arm $k$ , using lemma 2 with $B_{t}$ being the event that $\beta_{k}\in C_{k,t}$ and $\Pi_{t}$ being the event that $\text{Supp}(\beta_{k})\subseteq H_{k,t}$ , we get that:

\displaystyle\Pr\Bigg{[}\bigcap_{t=1}^{\infty}\{\beta_{k}\in C_{k,t-1}\}\Bigg{]}\geq 1-\sum_{t=1}^{\infty}\Pr[\beta_{k}\notin C_{k,t}|\text{Supp}(\beta_{k})\subseteq H_{k,t}].

Now, from union bound and the above inequality we get that:

	$\displaystyle\Pr\Bigg{[}\bigcap_{k=1}^{K}\bigcap_{t=1}^{\infty}\{\beta_{k}\in C_{k,t-1}\}\Bigg{]}$	$\displaystyle=1-\Pr\Bigg{[}\bigcup_{k=1}^{K}\bigg{(}\bigcap_{t=1}^{\infty}\{\beta_{k}\in C_{k,t-1}\}\bigg{)}^{\complement}\Bigg{]}$
		$\displaystyle\geq 1-\sum_{k=1}^{K}\Pr\Bigg{[}\bigg{(}\bigcap_{t=1}^{\infty}\{\beta_{k}\in C_{k,t-1}\}\bigg{)}^{\complement}\Bigg{]}$
		$\displaystyle\geq 1-\sum_{k=1}^{K}\sum_{t=1}^{\infty}\Pr[\beta_{k}\notin C_{k,t}\|\text{Supp}(\beta_{k})\subseteq H_{k,t}].$

∎

Appendix B Proofs for General Regret Analysis

We now re-state lemma 3 and provide the proof in following sub-sections. See 3

B.1 Instantaneous Regret Decomposition

Let $r_{t}$ denote the instantaneous regret at time-step $t$ . Suppose SurveyUCB picks arm $i$ at time-step $t$ , that is $\pi^{a}_{t}=i$ . And, suppose arm $j$ is the optimum arm at time-step $t$ , that is $\pi^{a*}_{t}=j$ . Now suppose $\pi^{s}_{t}$ denote the set of questions queried by SurveyUCB, then

\pi^{s}_{t}:=\bigcup_{k\in[K]}\text{Supp}(C_{k,t-1}).

Hence we have that $\text{Supp}(C_{k,t-1})\subseteq\pi^{s}_{t}$ for any arm $k\in[K]$ and time-step $t$ . Therefore for any $\beta\in C_{k,t-1}$ and $z\in\mathbb{R}^{d}$ , we have that $z^{\mathsf{T}}\beta=(z)_{\pi^{t}_{s}}^{\mathsf{T}}\beta$ . Further from conditions stated in lemma 3, we have that $\beta_{k}\in C_{k,t-1}$ for all $k\in[K]$ and $t\in[T]$ . Therefore, we have:

	$\displaystyle r_{t}$	$\displaystyle=X_{t}^{\mathsf{T}}\beta_{j}-X_{t}^{\mathsf{T}}\beta_{i}$
		$\displaystyle=(X_{t})_{\pi^{s}_{t}}^{\mathsf{T}}\beta_{j}-(X_{t})_{\pi^{s}_{t}}^{\mathsf{T}}\beta_{i}$

Since SurveyUCB chooses arm $i=\pi^{a}_{t}\in\arg\max_{k\in[K]}\max_{\beta\in C_{k,t-1}}(X_{t})_{\pi^{s}_{t}}^{\mathsf{T}}\beta$ , and $\beta_{j}\in C_{j,t-1}$ . We have that $(X_{t})_{\pi^{s}_{t}}^{\mathsf{T}}\tilde{\beta}_{i,t-1}\geq(X_{t})_{\pi^{s}_{t}}^{\mathsf{T}}\beta_{j}$ . Where $\tilde{\beta}_{i,t-1}$ denotes the optimistic estimate of arm $i$ ’s parameter at time-step $t$ , that is $\tilde{\beta}_{i,t-1}=\arg\max_{\beta\in C_{i,t-1}}(X_{t})_{\pi^{s}_{t}}^{\mathsf{T}}\beta$ . Therefore, we get that:

	$\displaystyle r_{t}$	$\displaystyle\leq(X_{t})_{\pi^{s}_{t}}^{\mathsf{T}}\tilde{\beta}_{i,t-1}-(X_{t})_{\pi^{s}_{t}}^{\mathsf{T}}\beta_{i}$
		$\displaystyle=(X_{t})_{\pi^{s}_{t}}^{\mathsf{T}}(\tilde{\beta}_{i,t-1}-\hat{\beta}_{i,t-1})+(X_{t})_{\pi^{s}_{t}}^{\mathsf{T}}(\hat{\beta}_{i,t-1}-\beta_{i})$

Where $\hat{\beta}_{i,t-1}$ denotes the estimate of arm $i$ ’s parameter at time-step $t$ . Now using Caushy-Schwarz inequality for weighted norm with respect to the positive definite matrix $V_{t-1}^{i}:=\alpha I+\sum_{w\in S_{i,t-1}}(X_{w})_{\pi^{s}_{t}}(X_{w})_{\pi^{s}_{t}}^{\mathsf{T}}$ , we get:

	$\displaystyle r_{t}$	$\displaystyle\leq\\|\tilde{\beta}_{i,t-1}-\hat{\beta}_{i,t-1}\\|_{V_{t-1}^{i}}\\|(X_{t})_{\pi^{s}_{t}}\\|_{(V_{t-1}^{i})^{-1}}+\\|\hat{\beta}_{i,t-1}-\beta_{i}\\|_{V_{t-1}^{i}}\\|(X_{t})_{\pi^{s}_{t}}\\|_{(V_{t-1}^{i})^{-1}}$
		$\displaystyle=\\|\tilde{\beta}_{i,t-1}-\hat{\beta}_{i,t-1}\\|_{D_{t-1}^{i}}\\|(X_{t})_{\pi^{s}_{t}}\\|_{(V_{t-1}^{i})^{-1}}+\\|\hat{\beta}_{i,t-1}-\beta_{i}\\|_{D_{t-1}^{i}}\\|(X_{t})_{\pi^{s}_{t}}\\|_{(V_{t-1}^{i})^{-1}}$
		$\displaystyle\leq 2\theta_{i,t-1}\\|(X_{t})_{\pi^{s}_{t}}\\|_{(V_{t-1}^{i})^{-1}}$

Where the last inequality follows directly from the fact that our confidence sets are in Standard form. The first equality follows from the 1, the fact that $\beta_{i},\hat{\beta}_{i,t-1},\tilde{\beta}_{i,t-1}\in C_{i,t-1}$ , and the fact that $D_{t-1}^{i}=(V_{t-1}^{i})_{H_{i,t-1}}$ which follows from the fact that our confidence sets are in Standard form, where $H_{i,t-1}:=\text{Supp}(C_{i,t-1})$ . We defer the proof of 1 to section B.3.

Observation 1.

Consider any vector $v\in\mathbb{R}^{d}$ any positive semi-definite matrix $A\in\mathbb{R}^{d\times d}$ . Let $S$ be such that $\text{Supp}(v)\subseteq S$ . We have that $\|v\|_{A}=\|v\|_{B}$ if $B=(A)_{S}$ .

Now since reward at any time $t$ lies in the range $[0,1]$ . Therefore $r_{t}\in[0,2]$ for all $t$ . Since $\theta_{i,t-1}\geq 1$ , we further have that $r_{t}\leq 2\theta_{i,t-1}\min\big{\{}1,\|(X_{t})_{\pi^{s}_{t}}\|_{(V_{t-1}^{i})^{-1}}\big{\}}$ .

B.2 Cumulative Regret

From lemma 11 in (Abbasi-Yadkori et al., 2011), we get:

Lemma 6 (Abbasi-Yadkori, et al 1).

Let $\{X_{t}\}_{t=1}^{\infty}$ be a sequence in $\mathbb{R}^{d}$ and $V\in\mathbb{R}^{d\times d}$ that is positive definite. Define $V_{t}:=V+\sum_{s=1}^{t}X_{s}X_{s}^{\mathsf{T}}$ . Further if $\|X_{t}\|_{2}\leq L$ for all $t$ , then:

\displaystyle\sum_{t=1}^{n}\min\big{\{}1,\|X_{t}\|^{2}_{V_{t-1}^{-1}}\big{\}}\leq 2\bigg{(}d\log\bigg{(}\frac{\text{trace}(V)+nL^{2}}{d}\bigg{)}-\log\det(V)\bigg{)}.

Now, consider any arm $i\in[K]$ , let $R_{T}^{i}$ denote the cumulative regret incurred by arm $i$ . Hence from Caushy-Schwarz inequality and from bound on $r_{t}$ in section B.1, we get that:

	$\displaystyle R_{T}^{i}$	$\displaystyle=\sum_{t\in S_{i,T}}r_{t}\leq\sqrt{n_{i,T}\sum_{s\in S_{i,T}}r_{t}^{2}}$
		$\displaystyle\leq\theta_{i,t-1}\sqrt{4n_{i,T}\sum_{s\in S_{i,T}}\min\big{\{}1,\\|(X_{t})_{\pi^{s}_{t}}\\|^{2}_{(V_{t-1}^{i})^{-1}}\big{\}}}$

From conditions stated in lemma 3 we have that $\|X_{t}\|_{2}\leq L$ for all $t$ . Hence from lemma 6 we get that:

R_{T}^{i}\leq\theta_{i,t-1}\sqrt{8n_{i,T}\bigg{(}d\log\bigg{(}\frac{\text{trace}(\alpha I)+n_{i,T}L^{2}}{d\alpha}\bigg{)}\bigg{)}}

Therefore, our total regret is of the form:

	$\displaystyle R_{T}$	$\displaystyle=\sum_{i\in[K]}R_{T}^{i}$
		$\displaystyle\leq\sum_{i\in[K]}\theta_{i,T-1}\sqrt{8n_{i,T}\bigg{(}d\log\bigg{(}\frac{d\alpha+n_{i,T}L^{2}}{d\alpha}\bigg{)}\bigg{)}}.$

This completes the proof of lemma 3.

B.3 Proof for Observations

See 1

Proof.

Consider any $v\in\mathbb{R}^{d}$ any $A\in\mathbb{R}^{d\times d}$ . Let $S$ be such that $\text{Supp}(v)\subseteq S\subseteq[d]$ . Let $B=(A)_{S}$ , that is we get $B$ by setting rows and columns of $A$ not in $S$ to zero. Consider $A^{\prime}$ , which we get by setting columns of $A$ not in $S$ to zero. Since rows of $A^{\prime}$ have support in $S\supset\text{Supp}(v)$ and are the same as the $A$ ’s rows within the support, we have that $(A-A^{\prime})v=0$ . That is, $Av=A^{\prime}v$ . Now note that we can get $B$ by setting rows of $A^{\prime}$ not in $S$ to zero. Similarly, we get $v^{\mathsf{T}}A^{\prime}=v^{\mathsf{T}}B$ . Therefore, we have that:

\|v\|_{A}^{2}=v^{\mathsf{T}}Av=v^{\mathsf{T}}A^{\prime}v=v^{\mathsf{T}}Bv=\|v\|_{B}^{2}.

Since $A$ and hence $B$ are psd, this also gives us that $\|v\|_{A}=\|v\|_{B}$ . ∎

Appendix C Proofs for lemma 5

Lemma 7 (Bastani and Bayati).

Let $X_{t}$ denote the $t$ -th row of $\mathbf{X}$ . Let $Y(t)$ denote the $t$ -th entry of $Y$ . The sequence $\{X_{t}|t=1,2,\dots,n\}$ form an adapted sequence of observations. That is, $X_{t}$ may depend on $\{X_{t^{\prime}},Y(t^{\prime})\}_{t^{\prime}=1}^{t-1}$ . Also assume all realizations of $X_{t}$ satisfy $\|X_{t}\|_{\infty}\leq L$ . Now, define the event:

\mathcal{F}(\lambda_{0}(\gamma)):=\bigg{\{}\max_{r\in[d]}(2|\epsilon^{\mathsf{T}}X^{r}|/n)\leq\lambda_{0}(\gamma)\bigg{\}}.

Where $X^{r}$ is the $r$ -th column of $\mathbf{X}$ and $\lambda_{0}(\gamma):=2\sigma L\sqrt{(\gamma^{2}+2\log d)/n}$ . Then, we have $\Pr[\mathcal{F}(\lambda_{0}(\gamma))]\geq 1-2\exp[-\gamma^{2}/2]$ .

We now state a useful basic inequality for Elastic net estimators. The proof is similar to basic inequalities proved for Lasso in (Bühlmann & Van De Geer, 2011) and defer the proof to appendix D.

Lemma 8 (Basic inequality for Elastic net).

\frac{1}{n}\|\hat{\beta}-\beta\|_{D}^{2}+\lambda\|\hat{\beta}\|_{1}\leq\frac{2}{n}\epsilon^{\mathsf{T}}\mathbf{X}(\hat{\beta}-\beta)+\lambda\|\beta\|_{1}+\frac{4\alpha}{n}\|\beta\|_{2}^{2}.

Where $\hat{\beta}$ is the Elastic net estimate with parameters $\alpha,\lambda\geq 0$ . And, $D=\mathbf{X}^{\mathsf{T}}\mathbf{X}+\alpha I$ .

We define the following lemma is result of simplifying lemma 8 under the high-probability event $\mathcal{F}(\lambda_{0})$ given in lemma 7 when $\lambda\geq 2\lambda_{0}$ .

Lemma 9.

Consider a linear model $Y=\mathbf{X}\beta+\epsilon$ , with design matrix $X\in\mathbb{R}^{n\times d}$ , response vector $Y\in\mathbb{R}^{n}$ , and noise vector $\epsilon\in\mathbb{R}^{n}$ . Let $\hat{\beta}$ be the Elastic net estimate with parameters $\alpha,\lambda\geq 0$ and let $D=\mathbf{X}^{\mathsf{T}}\mathbf{X}+\alpha I$ . When $\lambda\geq 2\lambda_{0}$ and $\mathcal{F}(\lambda_{0})$ holds, we have that:

\frac{2}{n}\|\hat{\beta}-\beta\|_{D}^{2}\leq 3\lambda\|\beta\|_{1}+\frac{8\alpha}{n}\|\beta\|_{2}^{2}.

Proof.

Since $\mathcal{F}(\lambda_{0})$ holds and $\lambda\geq 2\lambda_{0}$ , we get:

	$\displaystyle\frac{2}{n}\epsilon^{\mathsf{T}}\mathbf{X}(\hat{\beta}-\beta)$	$\displaystyle\leq\frac{1}{n}\Big{(}\max_{r\in[d]}2\|\epsilon^{\mathsf{T}}X^{r}\|\Big{)}\\|\hat{\beta}-\beta\\|_{1}$
		$\displaystyle\leq\lambda_{0}\\|\hat{\beta}-\beta\\|_{1}$
		$\displaystyle\leq\frac{\lambda}{2}\\|\hat{\beta}-\beta\\|_{1}$

Hence from lemma 8 and the above inequality, we have that:

		$\displaystyle\frac{1}{n}\\|\hat{\beta}-\beta\\|_{D}^{2}+\lambda\\|\hat{\beta}\\|_{1}\leq\frac{2}{n}\epsilon^{\mathsf{T}}X(\hat{\beta}-\beta)+\lambda\\|\beta\\|_{1}+\frac{4\alpha}{n}\\|\beta\\|_{2}^{2}$
	$\displaystyle\implies$	$\displaystyle\frac{2}{n}\\|\hat{\beta}-\beta\\|_{D}^{2}\leq\lambda\\|\hat{\beta}-\beta\\|_{1}+2\lambda\\|\beta\\|_{1}+\frac{8\alpha}{n}\\|\beta\\|_{2}^{2}-2\lambda\\|\hat{\beta}\\|_{1}$

Now since $\|\hat{\beta}\|_{1}\geq 0$ , we get:

	$\displaystyle\implies$	$\displaystyle\frac{2}{n}\\|\hat{\beta}-\beta\\|_{D}^{2}\leq\lambda(\\|\hat{\beta}-\beta\\|_{1}-\\|\hat{\beta}\\|_{1})+2\lambda\\|\beta\\|_{1}+\frac{8\alpha}{n}\\|\beta\\|_{2}^{2}$
	$\displaystyle\implies$	$\displaystyle\frac{2}{n}\\|\hat{\beta}-\beta\\|_{D}^{2}\leq 3\lambda\\|\beta\\|_{1}+\frac{8\alpha}{n}\\|\beta\\|_{2}^{2}$

Where the last implication follows from triangle inequality. ∎

Combining lemma 7 and lemma 9, we get lemma 5:

See 5

Proof.

Let $\lambda_{0}:=2\sigma L\sqrt{(\gamma^{2}+2\log d)/n}$ . Now note that $\lambda\geq 2\lambda_{0}$ . Therefore from lemma 7 and lemma 9, we get that with probability at least $1-2\exp[-\gamma^{2}/2]$ , we have:

		$\displaystyle\frac{2}{n}\\|\hat{\beta}-\beta\\|_{D}^{2}\leq 3\lambda\\|\beta\\|_{1}+\frac{8\alpha}{n}\\|\beta\\|_{2}^{2}$
	$\displaystyle\implies$	$\displaystyle\\|\hat{\beta}-\beta\\|_{D}^{2}\leq\frac{n}{2}3\lambda b+4\alpha b^{2}$
	$\displaystyle\implies$	$\displaystyle\\|\hat{\beta}-\beta\\|_{D}^{2}\leq 6\sigma Lb\sqrt{n(\gamma^{2}+2\log(d))}+4\alpha b^{2}$

Where the first implication follows from $\|\beta\|_{2}\leq\|\beta\|_{1}\leq b$ . And, the last implication follows from our choise of $\lambda$ . ∎

Appendix D Basic Inequality for Elastic Net

See 8

Proof.

Since $\hat{\beta}$ is the Elastic net estimate for parameters $\alpha,\lambda>0$ , we have that:

		$\displaystyle\frac{1}{n}\Big{[}\\|Y-\mathbf{X}\hat{\beta}\\|_{2}^{2}+2\alpha\\|\hat{\beta}\\|_{2}^{2}\Big{]}+\lambda\\|\hat{\beta}\\|_{1}\leq\frac{1}{n}\Big{[}\\|Y-\mathbf{X}\beta\\|_{2}^{2}+2\alpha\\|\beta\\|_{2}^{2}\Big{]}+\lambda\\|\beta\\|_{1}$
	$\displaystyle\implies$	$\displaystyle\frac{1}{n}\\|\mathbf{X}\hat{\beta}\\|_{2}^{2}-\frac{2}{n}Y^{\mathsf{T}}\mathbf{X}\hat{\beta}+\frac{2\alpha}{n}\\|\hat{\beta}\\|_{2}^{2}+\lambda\\|\hat{\beta}\\|_{1}\leq\frac{1}{n}\\|\mathbf{X}\beta\\|_{2}^{2}-\frac{2}{n}Y^{\mathsf{T}}\mathbf{X}\beta+\frac{2\alpha}{n}\\|\beta\\|_{2}^{2}+\lambda\\|\beta\\|_{1}$

Adding $\frac{1}{n}\|\mathbf{X}\beta\|_{2}^{2}-\frac{2}{n}(\mathbf{X}\beta)^{\mathsf{T}}(\mathbf{X}\hat{\beta})+\frac{2\alpha}{n}\|\beta\|_{2}^{2}$ to both sides, and re-arranging terms we get:

\begin{split}&\frac{1}{n}\|\mathbf{X}\hat{\beta}\|_{2}^{2}+\frac{1}{n}\|\mathbf{X}\beta\|_{2}^{2}-\frac{2}{n}(\mathbf{X}\beta)^{\mathsf{T}}(\mathbf{X}\hat{\beta})+\frac{2\alpha}{n}(\|\hat{\beta}\|_{2}^{2}+\|\beta\|_{2}^{2})+\lambda\|\hat{\beta}\|_{1}\\ \leq&\Big{[}\frac{2}{n}Y^{\mathsf{T}}(\mathbf{X}\hat{\beta})-\frac{2}{n}(\mathbf{X}\beta)^{\mathsf{T}}(\mathbf{X}\hat{\beta})\Big{]}+\Big{[}\frac{2}{n}\|\mathbf{X}\beta\|_{2}^{2}-\frac{2}{n}Y^{\mathsf{T}}(\mathbf{X}\beta)\Big{]}+\frac{4\alpha}{n}\|\beta\|_{2}^{2}+\lambda\|\beta\|_{1}.\end{split}

(6)

Note that we can lower bound the LHS of eq. 6:

\begin{split}&\frac{1}{n}\|\mathbf{X}\hat{\beta}\|_{2}^{2}+\frac{1}{n}\|\mathbf{X}\beta\|_{2}^{2}-\frac{2}{n}(\mathbf{X}\beta)^{\mathsf{T}}(\mathbf{X}\hat{\beta})+\frac{2\alpha}{n}(\|\hat{\beta}\|_{2}^{2}+\|\beta\|_{2}^{2})+\lambda\|\hat{\beta}\|_{1}\\ =&\frac{1}{n}\|\mathbf{X}(\hat{\beta}-\beta)\|_{2}^{2}+\frac{\alpha}{n}(\|\hat{\beta}-\beta\|_{2}^{2}+\|\hat{\beta}+\beta\|_{2}^{2})+\lambda\|\hat{\beta}\|_{1}\\ \geq&\frac{1}{n}\|\mathbf{X}(\hat{\beta}-\beta)\|_{2}^{2}+\frac{\alpha}{n}\|\hat{\beta}-\beta\|_{2}^{2}+\lambda\|\hat{\beta}\|_{1}\\ =&\frac{1}{n}\|\hat{\beta}-\beta\|_{D}^{2}+\lambda\|\hat{\beta}\|_{1}\end{split}

(7)

By substituting $Y=\mathbf{X}\beta+\epsilon$ , we simplify RHS of eq. 6 and get:

\begin{split}&\Big{[}\frac{2}{n}Y^{\mathsf{T}}(\mathbf{X}\hat{\beta})-\frac{2}{n}(\mathbf{X}\beta)^{\mathsf{T}}(\mathbf{X}\hat{\beta})\Big{]}+\Big{[}\frac{2}{n}\|\mathbf{X}\beta\|_{2}^{2}-\frac{2}{n}Y^{\mathsf{T}}(\mathbf{X}\beta)\Big{]}+\frac{4\alpha}{n}\|\beta\|_{2}^{2}+\lambda\|\beta\|_{1}\\ =&\Big{[}\frac{2}{n}\epsilon^{\mathsf{T}}(\mathbf{X}\hat{\beta})\Big{]}+\Big{[}-\frac{2}{n}\epsilon^{\mathsf{T}}(\mathbf{X}\beta)\Big{]}+\frac{4\alpha}{n}\|\beta\|_{2}^{2}+\lambda\|\beta\|_{1}\\ =&\frac{2}{n}\epsilon^{\mathsf{T}}\mathbf{X}(\hat{\beta}-\beta)+\frac{4\alpha}{n}\|\beta\|_{2}^{2}+\lambda\|\beta\|_{1}.\end{split}

(8)

Putting eq. 6, eq. 7, and eq. 8 together, we get:

\frac{1}{n}\|\hat{\beta}-\beta\|_{D}^{2}+\lambda\|\hat{\beta}\|_{1}\leq\frac{2}{n}\epsilon^{\mathsf{T}}\mathbf{X}(\hat{\beta}-\beta)+\lambda\|\beta\|_{1}+\frac{4\alpha}{n}\|\beta\|_{2}^{2}.

∎

Appendix E Proof for theorem 1

See 1

Proof.

Recall that in Ridge SurveyUCB, for the construction of our confidence sets, for all $i\in[K]$ and $t\in[T]$ , we choose:

\displaystyle\theta_{k,t}=\sigma\sqrt{|H_{k,t}|\log\bigg{(}\frac{1+n_{k,t}L^{2}/\alpha}{\delta/(K(1+n_{k,t})^{2})}\bigg{)}}+\sqrt{\alpha}b

Hence from corollary 1 and lemma 5, we have that:

\Pr\Big{[}\forall i\in[K],\forall t,\beta_{i}\in C_{i,t-1}\Big{]}\geq 1-\delta.

Therefore, with probability at least $1-\delta$ , we get that from lemma 3 the following regret guarantee holds:

	$\displaystyle R_{T}=\sum_{i\in[K]}R_{T}^{i}$
	$\displaystyle\leq\sum_{i\in[K]}\theta_{i,T-1}\sqrt{8dn_{i,T}\log\bigg{(}\frac{d\alpha+n_{i,T}L^{2}}{d\alpha}\bigg{)}}$
	$\displaystyle\leq\sum_{i\in[K]}\bigg{(}\sigma\sqrt{d\log\bigg{(}\frac{1+TL^{2}/\alpha}{\delta/(K(1+T)^{2})}\bigg{)}}+\sqrt{\alpha}b\bigg{)}\sqrt{8dn_{i,T}\log\bigg{(}\frac{d\alpha+TL^{2}}{d\alpha}\bigg{)}}$
	$\displaystyle\leq\bigg{(}\sigma\sqrt{d\log\bigg{(}\frac{1+TL^{2}/\alpha}{\delta/(K(1+T)^{2})}\bigg{)}}+\sqrt{\alpha}b\bigg{)}\sqrt{8d\log\bigg{(}\frac{d\alpha+TL^{2}}{d\alpha}\bigg{)}}\sum_{i\in[K]}\sqrt{n_{i,T}}$
	$\displaystyle\leq\bigg{(}\sigma\sqrt{d\log\bigg{(}\frac{1+TL^{2}/\alpha}{\delta/(K(1+T)^{2})}\bigg{)}}+\sqrt{\alpha}b\bigg{)}\sqrt{8d\log\bigg{(}\frac{d\alpha+TL^{2}}{d\alpha}\bigg{)}}\sqrt{TK}$

Where the last inequality follows from Caushy-Schwarz inequality and the fact that $T=\sum_{i=1}^{K}n_{i,T}$ . Therefore our high-probability regret bound is $O(d\sqrt{KT\log(K)}\log(T))$ . ∎

Appendix F Proof for theorem 2

See 2

Proof.

Recall that in Elastic Net SurveyUCB we use Elastic net regression to estimate $\hat{\beta}_{i,t}$ , with regularization parameters $\alpha$ and $\lambda_{i,t}$ for all arms $i\in[K]$ and time-steps $t$ . Where:

\lambda_{i,t}:=4\sigma L\sqrt{\frac{2}{n_{i,t}}\log\bigg{(}\frac{4dKn_{i,t}^{2}}{\delta}\bigg{)}},\text{ }\forall i\in[K],\forall t\in[T].

And for the construction of our confidence sets, for all $i\in[K]$ and $t\in[T]$ , we choose:

\theta_{i,t}:=\sqrt{6\sigma Lb\sqrt{2n_{i,t}\log\bigg{(}\frac{4dKn_{i,t}^{2}}{\delta}\bigg{)}}+4\alpha b^{2}}.

Now from corollary 1 and lemma 5, we have that:

\Pr\Big{[}\forall i\in[K],\forall t,\beta_{i}\in C_{i,t-1}\Big{]}\geq 1-\delta.

Therefore, with probability at least $1-\delta$ , we get that from lemma 3 the following regret guarantee holds:

	$\displaystyle R_{T}=\sum_{i\in[K]}R_{T}^{i}$
	$\displaystyle\leq\sum_{i\in[K]}\theta_{i,T-1}\sqrt{8dn_{i,T}\log\bigg{(}\frac{d\alpha+n_{i,T}L^{2}}{d\alpha}\bigg{)}}$
	$\displaystyle=\sum_{i\in[K]}\sqrt{6\sigma Lb\sqrt{2n_{i,t}\log\bigg{(}\frac{4dKn_{i,T}^{2}}{\delta}\bigg{)}}+4\alpha b^{2}}\sqrt{8dn_{i,T}\log\bigg{(}\frac{d\alpha+n_{i,T}L^{2}}{d\alpha}\bigg{)}}$
	$\displaystyle\leq\sum_{i\in[K]}4n_{i,T}^{3/4}\sqrt{3\sigma Lbd\bigg{(}\sqrt{2\log\bigg{(}\frac{4dKT^{2}}{\delta}\bigg{)}}+\frac{2\alpha b}{3\sigma L}\bigg{)}\log\bigg{(}\frac{d\alpha+TL^{2}}{d\alpha}\bigg{)}}$
	$\displaystyle\leq 4T^{3/4}K^{1/4}d^{1/2}\sqrt{3\sigma Lb\bigg{(}\sqrt{2\log\bigg{(}\frac{4dKT^{2}}{\delta}\bigg{)}}+\frac{2\alpha b}{3\sigma L}\bigg{)}\log\bigg{(}\frac{d\alpha+TL^{2}}{d\alpha}\bigg{)}}$

Where the last inequality follows from Holder’s inequality and the fact that $T=\sum_{i\in[K]}n_{i,T}$ . Therefore, our high-probability regret bound is $O(K^{1/4}d^{1/2}T^{3/4}\log^{3/4}(T)\log^{1/4}(dK))$ . ∎

Appendix G Implementation Issues

Here we describe how one can handle some of the difficulties in implementing our algorithms.

G.1 Ridge SurveyUCB

We need to add a stabilizing numerical approximations that sets numbers between $[-10^{-8},10^{-8}]$ to zero throughout the algorithm. We do this for all parameters that we store. The reason we do this is because, we often need to take the pseudo-inverse of $D_{t-1}^{i}$ for different arms at every time-step. This makes our updates particularly vulnerable to small errors around zero, and leads to unexpected behavior without it.

G.2 Elastic Net SurveyUCB

The python sklearn implementation for Elastic Net regression seems to have some issues with convergence for large L1 regularization. For smaller dimension problems, this can be fixed by setting the tolerance parameter to be very low (which also increases the running time). For larger dimension problems, it seems better to just divide the L1 regularization parameter by the dimension of the problem and appropriately adjust the confidence set bounds. In particular, we choose:

\displaystyle\lambda_{i,t}:=\frac{1}{d}4\sigma L\sqrt{\frac{2}{n_{i,t}}\log\bigg{(}\frac{4dKn_{i,t}^{2}}{\delta}\bigg{)}},\text{ }\forall i\in[K],\forall t\in[T].

Note that we are just dividing our original L1 regularization parameter by a factor of $d$ . To ensure that our confidence sets hold with high-probability, we choose for all arms and time-steps: ⁸⁸8Easy to check, simple modification of analysis in lemma 9.

\displaystyle\theta_{i,t}:=\sqrt{6\sigma L(b+\|\hat{\beta}_{i,t}\|)d\sqrt{2n_{i,t}\log\bigg{(}\frac{4dKn_{i,t}^{2}}{\delta}\bigg{)}}+4\alpha b^{2}}.

G.3 Interactive Surveys

The only issue with implementing algorithm 3 is that the optimization problem in eq. 5 is non-convex. It is worth noting that any upper bound to the optimization problem would maintain the regret performance of interactive SurveyUCB. And better upper-bounds, would be able to demonstrate better savings in terms of survey length. Hence one could use reasonable convex relaxations for this optimization problem in algorithm 3.

In this sub-section, we provide a well performing heuristic as a surrogate to the optimization problem when the context-space is a $[0,1]^{d}$ . Now note that:

	$\displaystyle\max_{x\in F}\quad$	$\displaystyle x^{\mathsf{T}}\hat{\beta}_{w,t-1}+\sqrt{\theta_{w,t-1}(x^{\mathsf{T}}(D_{t-1}^{w})^{-1}x)}$
	$\displaystyle\leq\max_{x,y\in F}\quad$	$\displaystyle x^{\mathsf{T}}\hat{\beta}_{w,t-1}+\sqrt{\theta_{w,t-1}(y^{\mathsf{T}}(D_{t-1}^{w})^{-1}y)}$

Note that the optimization problem $\max_{x\in F}x^{\mathsf{T}}\hat{\beta}_{w,t-1}$ is convex and infact linear when our context-space can be represented by linear constraints. We now only need to optimize the non-convex problem:

\displaystyle\max_{y\in F}\sqrt{\theta_{w,t-1}(y^{\mathsf{T}}(D_{t-1}^{w})^{-1}y)}=\sqrt{\max_{y\in F}\theta_{w,t-1}(y^{\mathsf{T}}(D_{t-1}^{w})^{-1}y)}

(9)

This is a non-convex quadratic optimization problem, where $(D_{t-1}^{w})^{-1}$ is symmetric (and positive semi-definite). This problem has been well studied and has good approximations based on SDP relaxations (Ye, 1999).

For our simulations we use the following simple heuristic as a surrogate to this optimization problem (eq. 9) that has worked surprisingly well. In the above problem (eq. 9), we want maximize uncertainty in reward over contexts that are consistent with our queries. Intuitively, uncertainty is maximized when the context $y$ is as large as possible. When our context space is $[0,1]^{d}$ , we can simply choose $y$ that is largest co-ordinate wise. That is, $y_{j}=1$ if $j$ is not queried (i.e. $j\notin U$ ) and $y_{j}=(X_{t})_{j}$ if $j$ has been queried (i.e. $j\in U$ ).

	$\displaystyle r_{t}$	$\displaystyle\leq\\|\tilde{\beta}_{i,t-1}-\hat{\beta}_{i,t-1}\\|_{V_{t-1}^{i}}\\|(X_{t})_{\pi^{s}_{t}}\\|_{(V_{t-1}^{i})^{-1}}+\\|\hat{\beta}_{i,t-1}-\beta_{i}\\|_{V_{t-1}^{i}}\\|(X_{t})_{\pi^{s}_{t}}\\|_{(V_{t-1}^{i})^{-1}}$
		$\displaystyle=\\|\tilde{\beta}_{i,t-1}-\hat{\beta}_{i,t-1}\\|_{D_{t-1}^{i}}\\|(X_{t})_{\pi^{s}_{t}}\\|_{(V_{t-1}^{i})^{-1}}+\\|\hat{\beta}_{i,t-1}-\beta_{i}\\|_{D_{t-1}^{i}}\\|(X_{t})_{\pi^{s}_{t}}\\|_{(V_{t-1}^{i})^{-1}}$
		$\displaystyle\leq 2\theta_{i,t-1}\\|(X_{t})_{\pi^{s}_{t}}\\|_{(V_{t-1}^{i})^{-1}}$

	$\displaystyle\frac{2}{n}\epsilon^{\mathsf{T}}\mathbf{X}(\hat{\beta}-\beta)$	$\displaystyle\leq\frac{1}{n}\Big{(}\max_{r\in[d]}2\|\epsilon^{\mathsf{T}}X^{r}\|\Big{)}\\|\hat{\beta}-\beta\\|_{1}$
		$\displaystyle\leq\lambda_{0}\\|\hat{\beta}-\beta\\|_{1}$
		$\displaystyle\leq\frac{\lambda}{2}\\|\hat{\beta}-\beta\\|_{1}$

		$\displaystyle\frac{2}{n}\\|\hat{\beta}-\beta\\|_{D}^{2}\leq 3\lambda\\|\beta\\|_{1}+\frac{8\alpha}{n}\\|\beta\\|_{2}^{2}$
	$\displaystyle\implies$	$\displaystyle\\|\hat{\beta}-\beta\\|_{D}^{2}\leq\frac{n}{2}3\lambda b+4\alpha b^{2}$
	$\displaystyle\implies$	$\displaystyle\\|\hat{\beta}-\beta\\|_{D}^{2}\leq 6\sigma Lb\sqrt{n(\gamma^{2}+2\log(d))}+4\alpha b^{2}$