Survey Bandits with Regret Guarantees
Abstract
We consider a variant of the contextual bandit problem. In standard contextual bandits, when a user arrives we get the user’s complete feature vector and then assign a treatment (arm) to that user. In a number of applications (like healthcare), collecting features from users can be costly. To address this issue, we propose algorithms that avoid needless feature collection while maintaining strong regret guarantees.
The first author is supported in part by a Dantzig-Lieberman Operations Research Fellowship and a Management Science and Engineering Department Fellowship.
The second author is supported in part by Sloan Foundation, Schmidt Futures, and ONR grant N00014-17-1-2131.
1 Introduction
In a number of applications, like healthcare and mobile health, collecting features from users can be costly. This encourages us to develop variants of contextual bandits that at any time-step select a set of features to be collected from users, and then select an action based on these observed features. We use the term survey bandits to refer to this set-up. Through this formulation, we address the issue of needless feature collection in contextual bandits.
Suppose we are building a system to recommend charities that users can donate to. Since there are many charities we could recommend, it is efficient to make personalized recommendations. This requires us to collect features from users, which can be done by requiring users to fill a survey/questionnaire. The reward at any time-step might be the amount donated. As usual, our goal is to minimize regret. Beyond regret, we would also like to improve user experience by shortening the survey we require users to fill. We try to answer the following question: Can we ensure strong regret guarantees, while being able to shorten our survey over time?
Our contributions: We answer this question in the affirmative. We start by considering zero-shot surveys where the decision maker has to decide the set of features to be queried at every time-step before the user arrives and then make a personalized decision based on the responses. We now state our assumptions. In addition to the standard assumptions made for LinUCB, we introduce 1 (beta-min) which is common in the feature selection literature. We propose algorithms that are natural variants of LinUCB (Li et al., 2010) for the survey bandits framework. Under our assumptions, we prove regret guarantees for these algorithms. At the same time, these algorithms exploit sparsity in arm parameters to reduce the set of features collected from users.
In fact our algorithm, Ridge SurveyUCB has regret guarantees that are tight even for standard contextual bandits . We also provide an algorithm Elastic Net SurveyUCB which is more robust to 1, but with a weaker regret guarantee. This result requires us to prove a new adaptive tail bounds for the elastic net estimator, which may be of independent interest.
Through simulations, when 1 holds, we demonstrate that both algorithms perform well in terms of regret. Unfortunately, Ridge SurveyUCB can perform poorly on regret when 1 is violated. Fortunately, Elastic Net SurveyUCB performs well even when 1 is violated. It is worth noting that we can still use Ridge SurveyUCB, we just need to use a conservative choices for the beta-min parameter (in 1). Even for conservative choices of the beta-min parameter, we eventually see benefits on survey length.
Through simulations, we also see that both algorithms demonstrate savings in terms of survey length. In fact, in the absence of sub-optimal arms, both algorithms always remove the features that are not relevant for reward prediction in the survey.
We also consider settings with interactive survey’s where at each time step, before making a personalized decision, the decision maker can continually expand the set of queries based on previous responses by the user at that time-step. This allows us to start by querying a smaller set of features and expand our survey for the user only if it is needed to make a better personalized decision. We develop variants of our algorithms that use interactive surveys to ensure lower survey lengths, especially in the presence of sub-optimal arms, while maintaining the same regret performance.
Related work: Prior work (e.g. (Abbasi-Yadkori et al., 2012) and (Bastani & Bayati, 2015)) has also exploited sparsity in arm parameters to provide stronger sparsity dependent regret guarantees. When an upper-bound on the sparsity of arm-parameters is known to the decision-maker, Abbasi-Yadkori et al. (2012) provide algorithms with tight regret guarantees for contextual bandits. Under distributional assumptions on user covariates, Bastani & Bayati (2015) provide stronger regret guarantees, even when no sparsity parameter is provided to the algorithm. While these regret guarantees are stronger than the ones we provide, it is important to note that we do not make any distributional assumption on user covariates and we do not assume that the decision-maker knows an upper-bound to the sparsity of arm-parameters.
Bouneffouf et al. (2017) is the most closely related paper to our work. They develop interesting algorithms for a very similar setup and evaluate them empirically. At every time-step, their algorithms queries a fixed number of features (). Their algorithms requires this parameter to be an input to the algorithm. For a conservative choice of , we would see little benefits to the survey length. Unfortunately, empirically their algorithms could perform much worse than contextual bandits for small choices of . We view our work as an alternate approach with guarantees on regret.
2 Preliminaries
2.1 Problem Setting
Let denote the number of time-steps. At each time-step, a new user arrives. The decision maker has a survey with questions and can ask a user any subset of these questions. At any time-step , a user comes with a set of observable covariates . Here is a -vector, corresponding to the -th user’s observable answers to the questions on the survey.
The decision maker has access to arms (decisions). Each arm yields a noisy, user specific reward. In particular, each arm has an unknown parameter . At time , pulling arm would yield a reward . Where are independent sequence of -sub-Gaussian random variables. Note that at any time-step, the decision maker can only observe the reward of the arm that was pulled.
Some notation: For any vector and any index set , we let denote the vector obtained by setting coordinates of that are not in to zero. For any matrix and any index set , we let denote the matrix obtained by setting rows and columns of that do not correspond to an index in to zero. For any , we let to denote the support. And for any set , we let .
Goal: The goal is to design a sequential decision making policy that maximizes expected cumulative reward, and subject to strong reward guarantees minimizes the total number of questions asked to users. Let denote the subset of survey questions queried by policy at time . And, let denote the arm chosen by policy at time . Note that we do not observe and only observe , hence we should be able to choose the arm using only the observed covariates and data collected from previous time-steps.
Target policy: We now describe a ”sensible” target policy. Consider the target policy that already knows arm parameters but not the noise parameters. We want the target policy to maximize expected cumulative reward, and hence at any time-step the policy must pick an arm . Therefore, the target policy only needs to query features that influence arm rewards. That is,
Note that for any vector and any arm , we have that . Therefore at any time-step , having observed , the target policy is able to choose the best arm for the covariate :
Regret: If , we define expected regret at time as the difference between the maximum expected reward and the expected reward of arm at time , i.e. . And, we define the cumulative expected regret as .
Additional notation: Let denote the design matrix, whose rows are . Let denote the vector of observations , where entries of may be missing 111If arm wasn’t played at time , then the -th coordinate of would be missing.. For all and for any , define the sample set and let . Let denote the number of times arm was pulled upto time (i.e. ). For any , we let be the submatrix of whose rows are for each . Similarly when , we let be the -vector whose coordinates are for all 222Since when , we know that for all . Therefore, we have has no missing values.. Also let denote the identity matrix.
2.2 Assumptions
We make two assumptions. The following assumption is to allow us to ignore features that have small influence on arm rewards, we assume that such features in-fact have no influence on arm rewards.
Assumption 1 (Beta-min).
The decision maker knows a parameter such that for all arms and all , either or .
The following is a common assumption made in problems for contextual multi-arm bandits, it is equivalent to assuming that expected rewards are bounded.
Assumption 2 (Bounded rewards).
We make the following assumptions to ensure that expected rewards of any arm for any context is bounded: For all and , we have and .
For simplicity we further assume for all and , the potential reward of pulling arm lies in . i.e. . 333We can avoid the assumption that for all and . We would just need to choose in definition 2 for the regret guarantee to work out. Where for all .
Since for any vector , we know that for all . Therefore for all , 2 gives us that the -norms of and are bounded by and for all time-steps and arms .
3 UCB for Survey Bandits
In this section, we describe a natural extension of LinUCB (Li et al., 2010) for survey bandits which involves describing a policy for selecting survey questions and a consistent arm selection policy that can choose an arm given the observed covariates.
Upper confidence bound (UCB) algorithms follow the principle of optimism in the face of uncertainty. The essential idea is to construct high-probability confidence sets , for the parameter of every arm , from observed covariates and rewards . That is, with high probability, for all time-steps and for all arms . The algorithm then queries the set of features:
Therefore, we have that for . Hence, for any and , we have that . Therefore, it follows that:
The algorithm chooses an optimistic estimate for every arm and then chooses an arm which maximizes reward according to the optimistic estimates. Equivalently, and more compactly, the algorithm chooses the arm:
Note that the arm is chosen given only the observed covariates . We call the resulting algorithm SurveyUCB.
4 Confidence Sets for SurveyUCB
In this section, we define confidence sets in Standard form and describe the construction of these sets using AlgConfidence. It will turn-out that AlgConfidence constructs confidence sets in Standard form. Throughout this manuscript, we use AlgConfidence to construct confidence sets for SurveyUCB.
4.1 Confidence Sets in Standard Form
We start by defining weighted norms and use that to define confidence sets in Standard form.
Definition 1 (Weighted norm).
For any vector and any positive semi-definite matrix , we define the weighted norm of with respect to as follows: . 444 is a norm when is positive semi-definite.
Definition 2 (Standard form).
We say confidence sets are in Standard form if at any time-step and for any arm , we have that is a ball centered around our estimate of the true arm parameter under a weighted norm and have non-increasing supports. More specifically, for all , and for some , we have:
In SurveyUCB, note that the confidence sets determine the set of features queried. Also at any time-step , confidence sets must be constructed using only the observed covariates and rewards at every time-step upto . Lemma 1 gives us a set of observed covariates when SurveyUCB uses confidence sets in Standard form.
Lemma 1.
If SurveyUCB uses confidence sets in Standard form for times . Then for any arm , we observe . Where for all arms and time-steps .
Proof.
All statements in this proof hold for all arms and time-steps . SurveyUCB queries the set of features at time , where . From the structure of confidence sets in Standard form, we have that . This implies that:
That is, we observe . ∎
Note that at any time-step , SurveyUCB observes . Lemma 1 shows that for any , we also observe because the set of features queried by SurveyUCB at is a supper set of the support of the confidence set . That is, under the conditions of lemma 1 we have:
Initializing confidence sets: We now define our confidence set for time-step zero based on 2. For any arm under 2, we have that . That is, we have that for all arms , where:
Based on 1, section 4.2 describes the construction of confidence set for all arms and time-steps .
4.2 AlgConfidence
Consider any time-step . Suppose that the confidence sets constructed upto time are in Standard form, i.e. are in Standard form. Let be the arm pulled at time . We now describe the AlgConfidence update for confidence set and show that the confidence sets are in Standard form. This would inductively imply that AlgConfidence constructs confidence sets in Standard form since the base case trivially holds 555Confidence sets constructed up to and including time-step zero are trivially in Standard form..
From lemma 1 we already know that if the confidence sets constructed upto time are in Standard form, then the decision maker using SurveyUCB at least observes at every time-step . Where, . Hence, AlgConfidence can use this to construct the confidence set () for arm at time .
Input : Given , , , and , for some time-step and arm .
Output : .
AlgConfidence starts by constructing from by relying on 1. Where with be the support of the confidence set . We would like to construct so that it contains the support of , i.e. . In particular, if the confidence set for arm at time holds (i.e. ), then from 1 we have that for all :
Also we get that is a subset of the support of the confidence set () of arm at time . Hence we have that:
AlgConfidence now constructs confidence set with support . We then estimate by regressing over the features in , on the observed data set: . We then set the components of not in to zero, i.e. . Now with , we construct the confidence set for arm at time :
Note that and . Hence given the above form of the confidence set, we have that are in Standard form.
4.3 Probability Aggregation
To construct the confidence set , recall that AlgConfidence assumes that . This is unlike LinUCB (Li et al., 2010) and several other UCB algorithms where confidence sets are constructed from observed data without directly relying on previous confidence sets. Here, we argue that our construction does not lead to any unexpected issues. We now state a helpful lemma and its corollary, and defer proofs to appendix A.
Lemma 2 (Probability aggregation).
Consider a probability space . Consider any sequence of events , such that and for any . Let . We then have that:
Corollary 1.
Suppose SurveyUCB constructs the Standard form confidence sets for all arms and all time-step’s . We then have that:
Where and for all arms and time-steps .
Therefore from corollary 1 for any , we have that if for all and we have: 666Note that
(1) |
5 General Regret Analysis
In section 4.2, we already saw that SurveyUCB constructs confidence sets in Standard form. In lemma 3 we exploit the structure of confidence sets in Standard form to get a general regret bound for SurveyUCB.
Lemma 3 (General regret analysis).
Suppose 2 holds with for all . Suppose the confidence sets used in SurveyUCB are in Standard form with parameter (in definition 2). And suppose we have that at any time-step and for all arms . We then have that:
-
•
Instantaneous regret at time-step , .
-
•
And cumulative regret in rounds,
We defer the proof of lemma 3 appendix B. It is worth noting that the proof sheds some light on the need for confidence sets to be in Standard form.
6 Tail Inequalities for Adapted Observations
Consider a linear model , with design matrix , response vector , and noise vector . Where are independent sequence of -sub-Gaussian random variables.
6.1 Ridge Estimator
We now define the Ridge estimator for estimating the parameter as follows:
Definition 3 (Ridge).
Given regularization parameters . The Ridge estimate is given by:
From theorem 2 in (Abbasi-Yadkori et al., 2011) we get:
Lemma 4 (Abbasi-Yadkori, et al 2).
Let denote the -th row of . Let denote the -th entry of . The sequence form an adapted sequence of observations. That is, may depend on . Assume that . Then for , we have that:
Furthermore, if for all , we have that:
Where is the Ridge estimate with parameter . And, .
6.2 Elastic Net Estimator
We now define the Elastic net estimator for estimating the parameter as follows:
Definition 4 (Elastic net).
Given regularization parameters , . The Elastic net estimate is given by:
We now provide an adaptive tail inequality for the Elastic net estimator that may be of independent interest. We defer the proof to appendix C.
Lemma 5 (Elastic net tail inequality for Adapted observations).
Let denote the -th row of . Let denote the -th entry of . The sequence form an adapted sequence of observations. That is, may depend on . Also assume all realizations of satisfy and that . Then, if , we have:
Where is the Elastic net estimate with parameters . And, .
7 Ridge SurveyUCB
Note that the description and analysis of SurveyUCB gives us a fair amount of flexibility. We are still free to specify the regression method used to estimate for all and times . We are also free to specify our choice of for all arms and all times .
Ridge SurveyUCB is a version of the SurveyUCB. In Ridge SurveyUCB, we use Ridge regression with a fixed regularization parameter (), to estimate . We also choose for all arms and time-steps based on corollary 1 (eq. 1) and lemma 4 so that the Standard form confidence sets that we construct hold with high probability. Then from lemma 3 we naturally get a high-probability regret bound for Ridge SurveyUCB.
Say we want our bounds to hold with probability . Then from corollary 1, it is enough to show that eq. 1 holds for all arms and times. That is for all arms and all times , we want:
(1) |
Given , lemma 4 gives us that eq. 1 holds for with support if:
(2) |
Hence, from lemma 3 with the above choice of and the corresponding high-probability guarantee, we get theorem 1. We defer the proof to appendix E.
8 Elastic Net SurveyUCB
In this section, we want to develop a variant of SurveyUCB that is more robust to the choice of the beta-min parameter in 1. One way to do this is to modify line 3 in AlgConfidence. In particular, suppose is the arm pulled at time . We then construct by removing all features from that we estimate to be zero (i.e. ) and that we determine are irrelevant based on 1. That is:
It is easy to see that all our arguments for SurveyUCB continue to hold with the above modification. Now note that the above modification makes SurveyUCB more robust to 1, because we additionally need the th coordinate of our estimate () to be zero before we remove feature (at time ) from the model of arm . This modification encourages us to use sparse estimators.
Elastic Net SurveyUCB is a variant of SurveyUCB with the above modification. In Elastic Net SurveyUCB, we use the Elastic net regression to estimate , with regularization parameters and for all arms and time-steps . Where:
(3) |
Similar to the arguments in section 7, corollary 1 and lemma 5 give us that our confidence sets hold with probability if for all arms and times :
(4) |
Again from lemma 3 with the above choice of and the corresponding high-probability guarantee, we get theorem 2. We defer the proof to appendix F.
9 Interactive Surveys
We start by defining sub-optimal arms, describe an inefficiency in SurveyUCB, and propose a fix using interactive surveys.
Definition 5 (Sub-optimal arms).
An arm is said to be sub-optimal if the target policy would not pick it for any context vector in the context space.
At any time-step , the SurveyUCB algorithms query Now suppose the decision maker plays arm at time . The decision maker only needs the reward and the features corresponding to to update the model. Recall that the reason the decision maker queries all the features in is to be able to compute the upper confidence bounds for all arms.
Note that the bandit assignment only depends on the highest upper confidence bound, so it is not necessary to compute all the upper confidence bounds. The algorithm’s inefficiency becomes more evident in the presence of sub-optimal arms because these arms are played less frequently, hence updated less frequently, and increase survey length.
Interactive input : User at time , that answers queries whenever asked.
Output: Queries to user .
We attempt to resolve this inefficiency through interactive surveys. Consider the user at time , the decision maker needs to query some features from the user before taking an action. We start by creating an ordered list of all arms, starting from the most pulled arms and ending with the least pulled arms. This ordering is a heuristic choice, the idea is to keep sub-optimal arms (which are less frequently pulled) towards the end of the list. We keep removing arms from this list and terminate (take an action) when it is empty.
The main idea is to sequentially query the feature sets for arms in the list, simultaneously remove queried arms from the list, and also remove unqueried arms that do not have the largest upper confidence bound. We will now explain how we determine that some unqueried arm doesn’t have the largest upper confidence bound. First note that for each queried arm, we can exactly compute its upper confidence bound at time . Clearly the following optimization problem gives us an upper bound to the upper confidence bound of arm : 777Here is the pseudo-inverse of the matrix.
(5) | ||||
Where denotes the set of contexts in the context space that are consistent with the queries so far (i.e. feasible such that where is the set of features queried so far for user ). Hence if this upper bound is less than the upper confidence bound for some queried arm, we remove arm from the list. For more details see algorithm 3.
Note that interactive versions of SurveyUCB have the same regret performance as SurveyUCB, because both algorithms work with the same confidence sets and always choose the arm with the largest upper confidence bound. Also note that the optimization problem in eq. 5 is non-convex. In section G.3 we provide a heuristics for this optimization that has been effective in simulations.
10 Simulation
Suppose we have users with fifty features and suppose we have five arms. Where expected reward for context is: for arm 1, for arm 2, for arm 3, and zero for both arm 4 and arm 5. Hence, only two of the fifty features are predictive of arm rewards and both arm 4 and arm 5 are sub-optimal. We draw contexts from the uniform distribution and reward noise () is drawn from . We consider a 100000 step time horizon.
We assume noise is 1-sub-Gaussian, contexts lie in the space , and the 1-norm and 2-norms of the arm parameters is bounded by and respectively. We run simulations with () and without () sub-optimal arms. That is, in our simulations without sub-optimal arms, we only consider the first three arms. We plot regret and cumulative survey length vs time-steps. The plots here are generated by averaging over five runs.
In simulations, when 1 holds, Ridge SurveyUCB has better regret performance compared to Elastic Net SurveyUCB (for similar beta-min parameters). Plots verify that Elastic Net SurveyUCB is infact reasonably robust to 1 and doesn’t remove predictive features (in simulations) even under violations of 1. We also note that sub-optimal arms hurt the performance of SurveyUCB algorithms. And, interactive surveys help mitigate the negative effects of sub-optimal arms on survey lengths. Also note that performance on regret improves with less conservative choices for the beta-min parameter, this implies that model truncation also helps improve regret performance.


Ridge SurveyUCB with


Ridge SurveyUCB with


Ridge SurveyUCB with


Elastic net SurveyUCB with .
References
- Abbasi-Yadkori et al. (2011) Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312–2320, 2011.
- Abbasi-Yadkori et al. (2012) Abbasi-Yadkori, Y., Pal, D., and Szepesvari, C. Online-to-confidence-set conversions and application to sparse stochastic bandits. In Artificial Intelligence and Statistics, pp. 1–9, 2012.
- Bastani & Bayati (2015) Bastani, H. and Bayati, M. Online decision-making with high-dimensional covariates. Available at SSRN 2661896, 2015.
- Bouneffouf et al. (2017) Bouneffouf, D., Rish, I., Cecchi, G. A., and Féraud, R. Context attentive bandits: Contextual bandit with restricted context. In IJCAI, 2017.
- Bühlmann & Van De Geer (2011) Bühlmann, P. and Van De Geer, S. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media, 2011.
- Li et al. (2010) Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661–670. ACM, 2010.
- Ye (1999) Ye, Y. Approximating global quadratic optimization with convex quadratic constraints. Journal of Global Optimization, 15(1):1–17, 1999.
Appendix A Proofs for probability aggregation
See 2
Proof.
We want to use induction. First note that:
Also note that for any , we have that:
Where the first inequality follows from the fact that is a supper set of [Since, ]. Therefore from induction, for all we get that:
We get our required result by taking limit as goes to . ∎
See 1
Proof.
Recall that from section 4.2 we know that if . For any arm , using lemma 2 with being the event that and being the event that , we get that:
Now, from union bound and the above inequality we get that:
∎
Appendix B Proofs for General Regret Analysis
B.1 Instantaneous Regret Decomposition
Let denote the instantaneous regret at time-step . Suppose SurveyUCB picks arm at time-step , that is . And, suppose arm is the optimum arm at time-step , that is . Now suppose denote the set of questions queried by SurveyUCB, then
Hence we have that for any arm and time-step . Therefore for any and , we have that . Further from conditions stated in lemma 3, we have that for all and . Therefore, we have:
Since SurveyUCB chooses arm , and . We have that . Where denotes the optimistic estimate of arm ’s parameter at time-step , that is . Therefore, we get that:
Where denotes the estimate of arm ’s parameter at time-step . Now using Caushy-Schwarz inequality for weighted norm with respect to the positive definite matrix , we get:
Where the last inequality follows directly from the fact that our confidence sets are in Standard form. The first equality follows from the 1, the fact that , and the fact that which follows from the fact that our confidence sets are in Standard form, where . We defer the proof of 1 to section B.3.
Observation 1.
Consider any vector any positive semi-definite matrix . Let be such that . We have that if .
Now since reward at any time lies in the range . Therefore for all . Since , we further have that .
B.2 Cumulative Regret
From lemma 11 in (Abbasi-Yadkori et al., 2011), we get:
Lemma 6 (Abbasi-Yadkori, et al 1).
Let be a sequence in and that is positive definite. Define . Further if for all , then:
Now, consider any arm , let denote the cumulative regret incurred by arm . Hence from Caushy-Schwarz inequality and from bound on in section B.1, we get that:
From conditions stated in lemma 3 we have that for all . Hence from lemma 6 we get that:
Therefore, our total regret is of the form:
This completes the proof of lemma 3.
B.3 Proof for Observations
See 1
Proof.
Consider any any . Let be such that . Let , that is we get by setting rows and columns of not in to zero. Consider , which we get by setting columns of not in to zero. Since rows of have support in and are the same as the ’s rows within the support, we have that . That is, . Now note that we can get by setting rows of not in to zero. Similarly, we get . Therefore, we have that:
Since and hence are psd, this also gives us that . ∎
Appendix C Proofs for lemma 5
Consider a linear model , with design matrix , response vector , and noise vector . Where are independent sequence of -sub-Gaussian random variables. Now, from lemma EC2 in (Bastani & Bayati, 2015), we have that:
Lemma 7 (Bastani and Bayati).
Let denote the -th row of . Let denote the -th entry of . The sequence form an adapted sequence of observations. That is, may depend on . Also assume all realizations of satisfy . Now, define the event:
Where is the -th column of and . Then, we have .
We now state a useful basic inequality for Elastic net estimators. The proof is similar to basic inequalities proved for Lasso in (Bühlmann & Van De Geer, 2011) and defer the proof to appendix D.
Lemma 8 (Basic inequality for Elastic net).
Consider a linear model , with design matrix , response vector , and noise vector . We then have that:
Where is the Elastic net estimate with parameters . And, .
We define the following lemma is result of simplifying lemma 8 under the high-probability event given in lemma 7 when .
Lemma 9.
Consider a linear model , with design matrix , response vector , and noise vector . Let be the Elastic net estimate with parameters and let . When and holds, we have that:
Proof.
Since holds and , we get:
Hence from lemma 8 and the above inequality, we have that:
Now since , we get:
Where the last implication follows from triangle inequality. ∎
See 5
Appendix D Basic Inequality for Elastic Net
See 8
Appendix E Proof for theorem 1
See 1
Proof.
Recall that in Ridge SurveyUCB, for the construction of our confidence sets, for all and , we choose:
Hence from corollary 1 and lemma 5, we have that:
Therefore, with probability at least , we get that from lemma 3 the following regret guarantee holds:
Where the last inequality follows from Caushy-Schwarz inequality and the fact that . Therefore our high-probability regret bound is . ∎
Appendix F Proof for theorem 2
See 2
Proof.
Recall that in Elastic Net SurveyUCB we use Elastic net regression to estimate , with regularization parameters and for all arms and time-steps . Where:
And for the construction of our confidence sets, for all and , we choose:
Now from corollary 1 and lemma 5, we have that:
Therefore, with probability at least , we get that from lemma 3 the following regret guarantee holds:
Where the last inequality follows from Holder’s inequality and the fact that . Therefore, our high-probability regret bound is . ∎
Appendix G Implementation Issues
Here we describe how one can handle some of the difficulties in implementing our algorithms.
G.1 Ridge SurveyUCB
We need to add a stabilizing numerical approximations that sets numbers between to zero throughout the algorithm. We do this for all parameters that we store. The reason we do this is because, we often need to take the pseudo-inverse of for different arms at every time-step. This makes our updates particularly vulnerable to small errors around zero, and leads to unexpected behavior without it.
G.2 Elastic Net SurveyUCB
The python sklearn implementation for Elastic Net regression seems to have some issues with convergence for large L1 regularization. For smaller dimension problems, this can be fixed by setting the tolerance parameter to be very low (which also increases the running time). For larger dimension problems, it seems better to just divide the L1 regularization parameter by the dimension of the problem and appropriately adjust the confidence set bounds. In particular, we choose:
Note that we are just dividing our original L1 regularization parameter by a factor of . To ensure that our confidence sets hold with high-probability, we choose for all arms and time-steps: 888Easy to check, simple modification of analysis in lemma 9.
G.3 Interactive Surveys
The only issue with implementing algorithm 3 is that the optimization problem in eq. 5 is non-convex. It is worth noting that any upper bound to the optimization problem would maintain the regret performance of interactive SurveyUCB. And better upper-bounds, would be able to demonstrate better savings in terms of survey length. Hence one could use reasonable convex relaxations for this optimization problem in algorithm 3.
In this sub-section, we provide a well performing heuristic as a surrogate to the optimization problem when the context-space is a . Now note that:
Note that the optimization problem is convex and infact linear when our context-space can be represented by linear constraints. We now only need to optimize the non-convex problem:
(9) |
This is a non-convex quadratic optimization problem, where is symmetric (and positive semi-definite). This problem has been well studied and has good approximations based on SDP relaxations (Ye, 1999).
For our simulations we use the following simple heuristic as a surrogate to this optimization problem (eq. 9) that has worked surprisingly well. In the above problem (eq. 9), we want maximize uncertainty in reward over contexts that are consistent with our queries. Intuitively, uncertainty is maximized when the context is as large as possible. When our context space is , we can simply choose that is largest co-ordinate wise. That is, if is not queried (i.e. ) and if has been queried (i.e. ).