Annotation Efficiency: Identifying Hard Samples via Blocked Sparse Linear Bandits
Abstract
This paper considers the problem of annotating datapoints using an expert with only a few annotation rounds in a label-scarce setting. We propose soliciting reliable feedback on difficulty in annotating a datapoint from the expert in addition to ground truth label. Existing literature in active learning or coreset selection turns out to be less relevant to our setting since they presume the existence of a reliable trained model, which is absent in the label-scarce regime. However, the literature on coreset selection emphasizes the presence of difficult data points in the training set to perform supervised learning in downstream tasks (Mindermann et al., 2022). Therefore, for a given fixed annotation budget of rounds, we model the sequential decision-making problem of which (difficult) datapoints to choose for annotation in a sparse linear bandits framework with the constraint that no arm can be pulled more than once (blocking constraint). With mild assumptions on the datapoints, our (computationally efficient) Explore-Then-Commit algorithm BSLB achieves a regret guarantee of where the unknown parameter vector has tail magnitude at sparsity level . To this end, we show offline statistical guarantees of Lasso estimator with mild Restricted Eigenvalue (RE) condition that is also robust to sparsity. Finally, we propose a meta-algorithm C-BSLB that does not need knowledge of the optimal sparsity parameters at a no-regret cost. We demonstrate the efficacy of our BSLB algorithm for annotation in the label-scarce setting for an image classification task on the PASCAL-VOC dataset, where we use real-world annotation difficulty scores.
1 Introduction
In niche industrial applications such as Named Entity Recognition (Nguyen et al., 2023) and learning tasks on low-resource languages (Hedderich et al., 2021), obtaining high-quality labels is challenging due to the lack of expert annotators. However, high-quality labels are critical for effective model training (Li et al., 2023), and thus, expert annotators must provide ground truth labels. (Sorscher et al., 2022) demonstrated that selecting high-quality data can reduce the power-law association of test error with dataset size to an exponential law. In annotation-expensive tasks with large volumes of unlabeled data, the challenge is to select a representative subset of datapoints for labeling. In label-scarce tasks, where the number of expert annotators is extremely low, often only one, it is impractical to query the same datapoint multiple times. While crowd-sourcing literature reduces noise by aggregating labels from multiple annotators (Verroios & Garcia-Molina, 2015), the annotation from a single or aggregated expert is considered the final ground truth label in our setting. We term this restriction, where a datapoint cannot be re-queried after annotation, the blocking constraint. Additionally, the annotation budget is typically much smaller than the datapoint embedding dimension. An efficient annotation strategy should be sequential (instead of one-shot) in such cases, as each annotation informs future decisions and helps identify more informative datapoints.
In addition, data-pruning techniques like coreset selection emphasize selecting hard examples (Maharana et al., 2023), while (Sorscher et al., 2022) justifies this for perceptron learning. Curriculum learning (Bengio et al., 2009) also uses increasingly difficult examples, but defining the ‘hardness’ of unlabeled data is ambiguous. Hard examples identified by heuristics are often noisy, mislabeled, or outliers (Mindermann et al., 2022). To address this, we propose soliciting annotation difficulty feedback directly from expert annotators. This adds minimal annotation overhead and helps identify hard examples more reliably than heuristics. Additionally, leveraging expert feedback to learn and predict hardness (via an auxillary model) is simpler than addressing complex downstream task itself.
Thus, the primary problem we aim to solve is: “Given a small annotation budget , how should we sequentially identify unique and difficult unlabeled datapoints for annotation as we receive expert feedback on hardness of annotation?” Theoretically, we model this sequential decision-making process using sparse linear bandits with a blocking constraint. Specifically, suppose we have a large set of unlabeled datapoints, each associated with a -dimensional embedding. With rounds available, at each round , a datapoint is selected for annotation, and the expert provides noisy feedback on the difficulty of labeling that datapoint, along with the ground truth label. To address the limited annotation budget and the relevance of a few important features in high-dimensional spaces (Hao et al., 2020), we model the expected hardness as a sparse linear function of the embedding. We ensure robustness to sparsity - our theoretical guarantees degrade gracefully as the parameter vector becomes less sparse, i.e., as it deviates from sparsity due to an increasing tail.
Although our primary motivation comes from data selection for annotation in a label-scarce regime, our theoretical framework is broadly applicable. For example, resource-constrained edge devices, such as smartphones, consider the problem of recommendation of personalized products. In applications such as movie/book recommendations, typical users will hardly consume an item more than once - so any feedback provided is not subject to change. Furthermore, the user will provide feedback for every item (books/movies) in a large repository of items. Here, our theoretical framework (sparse linear bandits with blocking constraint) is directly applicable as it involves learning unknown user taste from sequential item recommendations based on user feedback. While existing literature addresses this by combining feedback from multiple users (Bresler et al., 2014; Pal et al., 2024), such algorithms compromise user privacy. Our framework will not have this limitation.
Our Contributions and Techniques: The sparse linear bandits framework in the data-sparse regime was first studied by Hao et al. (2020); Jang et al. (2022) - however, these works do not consider either the blocking constraint or robustness to sparsity - both of which pose novel technical challenges. We propose BSLB (Blocked Sparse Linear Bandits), an efficient algorithm in this setting which is primarily an Explore-Then-Commit algorithm. In the exploration period, we carefully choose a subset of arms from which we sample without replacement - the goal is to ensure that the expected covariance matrix has a large minimum eigenvalue. In the exploitation period, we use a lasso estimator to estimate the unknown model parameters while being robust to sparsity. The optimal exploration period depends on the correct sparsity level of the unknown parameter vector, which is difficult to set in practice. Therefore, we also present a meta-bandit algorithm based on corralling a set of base bandits algorithms (Agarwal et al., 2017) - obtaining the same order-wise regret guarantees but without needing the knowledge of hyperparameters. Below, we summarize our main contributions:
-
1.
We theoretically model the problem of identifying hard (informative) unlabeled datapoints by high dimensional sparse linear bandits framework with blocking constraint - the number of rounds or annotation budget is very small, even smaller than the ambient dimension. The hardness is modeled as a sparse linear function of the datapoint embedding - we allow for robustness to sparsity in the parameters where the tail magnitude is non-zero and unknown.
-
2.
We propose an efficient “explore then commit” (ETC) algorithm BSLB for regret minimization in our framework (Theorem 2) that achieves a regret guarantee of for a fixed sparsity and the corresponding tail magnitude - for the special case of hard sparsity (tail magnitude ), BSLB achieves a regret guarantee of (Corollary 2). The run-time of BSLB is polynomial in the number of datapoints and is therefore efficient.
-
3.
A novel ingredient (function GetGoodSubset()) in the proposed algorithm BSLB is to select a good set of representative datapoints from the large set of unlabelled datapoints. This objective entails solving a discrete non-convex optimization problem in Equation 2. We propose a convex relaxation of the objective and obtain a feasible solution via randomized rounding - Theorem 3 states that our computationally efficient algorithm has good approximation guarantees.
-
4.
BSLB requires knowledge of sparsity (which also controls the tail magnitude ) to set exploration period length. Since this is challenging to set in practice, we propose a meta-algorithm C-BSLB that combines base algorithms with different exploration periods. C-BSLB achieves same regret guarantees order-wise (Theorem 4) as BSTB but without knowledge of sparsity.
-
5.
To validate our hypothesis and theory, we present experiments on the public image classification dataset PASCAL VOC 2012. Additional experiments on text classification on SST-2 and personalized recommendation with MovieLens, Jester, and Goodbooks datasets are in Appendix A.2.
Technical Challenges: At a high level, our proof follows a similar structure as in the analysis of Algorithm ESTC proposed in (Hao et al., 2020). First of all, both in (Hao et al., 2020) and our setting, there are no assumptions on the datapoint embeddings - unlike most existing literature on sparsity where the Gram matrix is assumed to satisfy Restricted Isometry Property (RIP) (Boche et al., 2015). A much weaker condition (namely Restricted Eigenvalue (RE)) in characterizing nice statistical guarantees of the Lasso estimator for sparse linear regression was shown in (Bickel et al., 2009; Rudelson & Zhou, 2013) - it was shown that if the covariance matrix of a distribution satisfies RE, then a sufficient number of independent samples will also satisfy RE. However, in our setting, the additional blocking constraint and our desired robustness to sparsity present three significant technical challenges. (A) Sampling: While the above techniques can be used directly in the analysis of Algorithm ESTC, the blocking constraint in our setting entails that sampling of datapoints cannot be independent in the exploration phase. (B) Statistical guarantees: To the best of our knowledge, when the Gram matrix only satisfies RE, the statistical guarantees of the Lasso estimator hold only for hard sparse vectors - for these vectors, all but entries are zero. Existing guarantees for soft sparsity (non-zero tail) hold only for Gram matrices satisfying the stronger RIP condition (Wainwright, 2019). (C) Knowledge of hyper-parameters: Our first proposed algorithm BSLB (similar to ESTC) requires as input a fixed sparsity (corresponding tail magnitude ). If the input sparsity is too low or too high, then the regret guarantee will not be reasonable. However, it is challenging to set the sparsity parameter without having the knowledge of parameter vector itself. We resolve all technical challenges with novel techniques - in particular, the latter two might be of independent interest in other applications.
To resolve (A), we need to optimize for a probability distribution on datapoints (for sampling) with a well-conditioned expected covariance matrix under the constraint that it is uniform on a subset of datapoints and has zero mass on others. The related optimization problem is discrete and non-convex - we describe a convex relaxation of the objective and show a randomized rounding procedure to obtain a feasible solution with good approximation guarantees on the minimum eigenvalue of the expected covariance matrix. Subsequently, we show that subsampling from the recovered distribution in the exploration component ensures that the data covariance matrix satisfies RE with high probability. To resolve (B), we follow two steps 1) We extend guarantees in (Rudelson & Zhou, 2013) (Theorem 23) that demonstrates a variant of restricted isometry - the idea is that the data matrix acts as a near isometry on the image of all sparse vectors under a linear transformation. Such a technique, in turn, allows us to extend RE guarantees for design matrices whose rows are sampled without replacement 2) using the RE condition we derive guarantees for soft sparsity from the results on hard sparse vectors using an iterative approach to ensure robustness ((Boche et al., 2015) Theorem 1.6). For (C), we use a corralling algorithm based on the techniques of (Agarwal et al., 2017) that combines several base algorithms and provides guarantees with respect to the optimal one. However, a naive application leads to an additional corralling cost with a linear dependence on dimension , making the regret vacuous (). We use a covering argument - we only choose a small representative subset (covering set) of base algorithms for input to corralling and show that for all remaining base algorithms, their regret is close enough to a base algorithm in the covering set.
Related Work: Conceptually our work is similar to active learning (Settles, 2009; Lesci & Vlachos, 2024) where unlabeled samples are annotated adaptively, based on the confidence of a trained model (Coleman et al., 2022). Active learning works well with good initialization and informative confidence intervals. However, in our label-scarce setting, active learning is particularly challenging with complex data due to the absence of a reliably trained model in the first place - this is more pronounced for difficult datapoints for which prediction is hard. Active Learning (AL) needs an initial set of high-quality labeled samples to reasonably train a model - also known as the cold-start problem - when labels are scarce, uncertainty based sampling techniques (AL) are unsuitable (Li et al., 2024). Our goal is to identify informative samples with the help of the expert annotator(s), whom the final model aims to emulate. Coreset selection (Guo et al., 2022; Albalak et al., 2024; Sener & Savarese, 2018) aims to select a subset of datapoints for training. However, coreset selection assumes that a large amount of labeled data already exists, and the focus is on reducing computational costs. In contrast, our setting deals with the lack of labeled data, making existing coreset selection approaches, which rely on the entire labeled dataset, inapplicable. Our work also aligns with curriculum learning (Bengio et al., 2009), where a model is trained on samples of increasing difficulty/complexity. Due to the ambiguity in hardness definition, often heuristics are used to infer the difficulty of samples (Soviany et al., 2022) and can turn out unreliable and not generalizable. For niche tasks where an expert annotator is available, the difficulty ratings from the annotator are more informative since the goal is to train a model to mimic the expert. In computer vision, there has been recent work regarding estimating the difficulty of a dataset for the model using implicit difficulty ratings of annotation (Ionescu et al., 2016; Mayo et al., 2023). For NLP tasks, Ethayarajh et al. (2022) constructs information theoretic metrics to estimate the difficulty of data points.
Notation: We denote vectors by bold small letters (say x), scalars by plain letters (say or ), sets by curly capital letters (say ) and matrices by bold capital letters (say ). We use to denote the set , to denote the -norm of vector x. For a set of indices is used to denote the sub-vector of restricted to the indices in . denotes the minimum eigenvalue of the matrix and denotes a diagonal matrix with entries as x. We use and to denote the unit ball and unit sphere in dimensions, respectively. We will write to denote the expectation of a random variable . notation hides logarithmic factors in .
1.1 Problem Formulation and Preliminaries
Consider a dataset with a set of unlabeled data-points , each of which will be referred to as an arm in the bandit setup. Let denote the -dimensional vector embedding associated with the data-point (arm). The arm embedding vectors are contained in a ball of radius that is .
We have an annotation budget of rounds. In our annotation-poor regime, we have ; that is, the annotation budget is much smaller than the ambient dimension, which, in turn, is significantly smaller than the number of arms. At each round , an unlabelled data-point (corresponding to the arm which has not been pulled in the first rounds) is selected by the online algorithm (decision-maker) and sent to the expert annotator for labeling. Note that such a selection mechanism respects the blocking constraint, which stipulates that no unlabeled datapoint will be sent for annotation more than once. The expert 111If there are multiple experts, we can consider the final feedback to be aggregated, which does not change., as feedback, labels the datapoint and provides that corresponds to the difficulty experienced in providing the ground truth label for . We model the expected hardness as a linear function of the arm embedding where is an unknown parameter vector. In particular, the random variable is obtained as where is zero-mean i.i.d noise random variable with bounded variance . More precisely,
-
1.
and , where denotes the filtration till round .
-
2.
For any sparsity level we define the tail of the parameter vector as, where denotes the set of largest coordinates of by absolute value and .
Note that in the special case of for some sparsity level , will be referred to as a hard-sparse vector. In our set-up, we account for soft sparsity when the tail is non-zero and unknown - our statistical guarantees degrade gracefully as the tail magnitude increases. Now, we formally define the objective (commonly known as regret) in our online learning set-up, which also respects the blocking constraint. Our regret definition captures the difference in cumulative expected hardness of datapoints selected by the online algorithm versus the cumulative expected hardness of top unique datapoints. Consider a permutation of arms such that for any , we have . We define the regret for our setting as,
(1) |
We aim to design an algorithm that minimizes expected regret where the expectation is over the randomness in the algorithm.
2 Our Algorithm and Main Results
Our main contribution is to propose an Explore-Then-Commit (ETC) algorithm named BSLB which is summarized in Algorithm 1. BSLB takes as input a set of unlabeled datapoints (arms) , the annotation budget (time horizon) , the exploration budget and Subset selection parameter . Steps 2-8 of BSLB correspond to the exploration component in the algorithm. In Step 2, we first compute a good subset of arms (using function ) which comprises of representative arms that cover the -dimensional space reasonably well. Subsequently, in Steps 4-8, we sample arms without replacement from the set of arms for rounds. The goal in the exploration component is to select a subset of arms such that the image of sparse vectors under the linear transformation by the gram matrix of the selected set has a sufficiently large magnitude (see Definition 1). As we prove, such a result ensures nice statistical guarantees of the parameter (difficulty) estimation with the subset labeled (with difficulty scores) at the end of the exploration component. Since the set of arms, can be arbitrary, note that sampling arms uniformly at random from the entire set might not have good coverage - especially when most arms are concentrated in a lower-dimensional subspace. Therefore, finding a good representative subset of arms leads to the following discrete optimization problem
(2) |
The function approximates the solution to this computationally infeasible discrete optimization. We maximize a relaxed concave program in equation 6 efficiently for a chosen input parameter to obtain a distribution on the set of arms - subsequently, we construct the subset using randomized rounding (Step 19) with to obtain a feasible solution to 2.
The second part of BSLB (Steps 9-14) corresponds to the exploitation component of the algorithm. In Step 9, we use the Lasso estimator to get an estimate of the unknown parameter vector . Note that the number of samples used in obtaining the estimate is much smaller than the ambient dimension . Finally, in Steps 11-14, we choose datapoints that are predicted to be hard according to our recovered estimate and submit them for annotation. At every round in BSLB, no arm is pulled more than once, thus respecting the blocking constraint. It is important to note that BSLB is two-shot. That is, we change our data acquisition strategy only once after the exploration component - thus making our algorithm easy to use in practice. Next, we move on to our main theoretical results.
2.1 Offline Lasso Estimator Guarantees With Soft Sparsity and RE condition
To the best of our knowledge, there do not exist in the literature offline guarantees for sparse linear regression that is (A) robust to sparsity modeling assumption and (B) holds only under the mild RE condition on the Gram matrix. Our first theoretical result fills this gap to a certain extent with an upper bound on error rate. We will start by introducing the definition of Restricted Eigenvalue (RE)
Definition 1.
Restricted Eigenvalue (RE): satisfies Restricted Eigenvalue property , if there exists a constant such that for all and ,
Bickel et al. (2009) showed that RE is among the weakest conditions imposed in the literature on the Gram matrix to ensure nice statistical guarantees on the Lasso estimator for sparse linear regression.
Theorem 1.
Let be the data matrix with samples, dimension . Let be the corresponding observations such that , where is a zero-mean random vector with i.i.d. components having bounded variance . Suppose satisfies restricted eigenvalue property (Def. 1) with with constant . Let have a tail at sparsity level that is, for some satisfying , where is the set of largest coordinates by absolute value. An estimate of recovered using Lasso (Line 9 in BSLB), satisfies following with probability
(3) |
Insight 1.
Note that in Equation 3, for a fixed sparsity , the estimator error guarantee decays with datapoints and RE constant while growing linearly with the tail . Existing error guarantees in literature focus only on hard sparse - the data matrix satisfies with constant and (see Theorem 7.13 Wainwright (2019)). However, with moderately stronger assumption of on the data matrix, guarantees of Theorem 1 hold for all . As stated in Theorem 1, for a larger tail with , needs to satisfy on a larger cone of vectors.
Remark 1.
Note that the statistical guarantee presented in Theorem 1 is an offline error rate that is robust to sparsity modeling assumption - similar to Theorem 7.19 in Wainwright (2019) and Theorem 1.6 in Boche et al. (2015). However, the former holds only for the special case when has i.i.d. Gaussian rows, and the latter requires the stronger RIP condition on the data matrix. Our error guarantee is much more general and holds for deterministic data matrices satisfying RE.
Below we derive a corollary for the case when the rows of the design matrix are sampled without replacement from a set whose empirical covariance matrix has a minimum eigenvalue.
Corollary 1.
Let be the data matrix with samples and dimension , whose rows are sampled uniformly without replacement from a set . Let . Consider the same setup for observations as in Theorem 1. Provided , an estimate of recovered using Lasso (Line 9 in BSLB), will satisfy with probability ,
(4) |
Note in particular that we do not have the () assumption on the parameter vector in Corollary 1. Instead, it is replaced by a lower bound on - datapoints sampled without replacement from the set whose gram matrix has a sufficiently large minimum eigenvalue. This is possible because a lower bound on minimum eigenvalue for a positive semi-definite matrix implies a lower bound on RE with arbitrary parameters - concentration guarantees imply that the RE condition remains satisfied when sufficient (yet smaller than ) number of datapoints are sampled from .
2.2 Online Guarantees - Regret Bound for BSLB
Our next result is the expected regret incurred by BSLB in the online setting. The key ingredient in the regret analysis lies in appropriately accounting for the blocking constraint in the exploitation component of BSLB. Below, we present the result detailing the regret guarantees of BSLB when the exploration period is set optimally using a known sparsity level . The proof is deferred to Appendix A.1.2 We invoke Corollary 1 at the end of the exploration component to obtain error guarantees of Lasso, which in turn bounds the maximum regret incurred in each step of the exploitation component. We optimize the exploration/exploitation trade-off to obtain our stated result.
Theorem 2.
(Regret Analysis of BSLB) Consider the -dimensional sparse linear bandits framework with blocking constraint having a set of arms spanning and rounds (). In each round , we choose arm and observe reward where is unknown and is zero-mean independent noise random variable with variance . Suppose has tail magnitude at sparsity level where is the set of largest coordinates by absolute value. Let for the set be as defined in equation 2 and assume that . In this framework, BSLB with exploration period , achieves a regret guarantee
(5) |
Insight 2.
BSLB enables diversity in selected arms by performing Step 2 in Alg. 1 - this step ensures that of the covariance matrix of the subset used in exploration is approximately optimal. The exploration period in Theorem 2 is optimized to maximize annotation of hard samples. However, in practice, the exploration period of BSLB can be increased further if diversity has more importance.
Insight 3.
The runtime of the GetGoodSubset(, ) and the optimization in Step 9 of BSLB (LASSO) is (). However, if the mild assumption not satisfied, then GetGoodSubset(, ) can be replaced with a (modified) Brute Force Algorithm that runs in time (see Appendix A.1.5 for details), and the theorem statement still holds. Note that the stated runtime in the latter part is still significantly lower than the trivial brute force search for the optimal subset having a runtime of . This is possible because, as a result of the approximation guarantees of Theorem 3, we can restrict the size of the subset while performing a brute-force search.
2.3 Subset Selection for Maximizing the Minimum Eigenvalue
Recall that in Step 5 of BSLB; we sample from a carefully chosen subset of arms that has good coverage - more precisely, our goal is to solve the optimization problem in Eq. 2 to obtain a representative set of arms. Although Hao et al. (2020) had a similar objective, the absence of blocking constraint in their framework implied that they could solve for a probability distribution on the set of arms such that the minimum eigenvalue of the expected covariance matrix is maximized. Since their solution space was the probability simplex, the objective was continuous and concave - implying that a solution can be found efficiently. However, in our setting, due to the blocking constraint, we need to identify a subset of representative arms from which to sample uniformly at random without replacement in the exploration component - this leads to the objective in Eq. 2 being discrete and therefore non-convex. Note that a brute force solution to our objective implies a search over all subsets of and will take time . To design an efficient algorithm for obtaining a good feasible solution to the non-convex objective in 2, our first step is to obtain a convex relaxation as described in Eq. 6 - in particular, instead of optimizing over a subset, we optimize over probability distributions over the set of arms such that the probability mass over any arm is bounded from above.
(6) |
Note that denotes the matrix with all arms and is an additional parameter to the relaxed objective. Since the solution to Eq. 6 might not be a feasible one for Eq. 2, we use a randomized rounding (Step 19 in BSLB) procedure to obtain a feasible solution. In the randomized rounding procedure, each datapoint is sampled into our feasible output set (used in the exploration component) independently with probability .
Let with be the optimal subset for which the RHS in Equation 2 is maximized and let be the corresponding objective value (minimum eigenvalue). We present the following theorem on the approximation guarantees of the solution achieved by our procedure GetGoodSubset of Algorithm 1 - the theorem says that the minimum eigenvalue of the gram matrix associated with datapoints in (obtained post randomized rounding procedure) is close to .
Theorem 3.
Let denote the matrix of all arms. Consider the convex optimization of equation 6 solved at . Let be the output of the randomized rounding procedure (Step 18-20 of Alg. 1) and be the minimum eigenvalue of the corresponding covariance matrix that is, . Then under the assumption , we must have with probability .
If assumption of is not satisfied in its place., then we can implement a brute-force search over all subsets whose size is in the range (for some constant ) to maximize the objective in Eq. 2 - note that the time complexity is still polynomial in the number of arms (refer to Appendix A.1.5 for details) which is significantly improved than the trivial brute force algorithm which has a running time of . Note that the modified brute-force algorithm enjoys stronger approximation guarantees, with and has a running time that is still polynomial in the number of arms but exponential in the dimension .
Remark 2.
We want to highlight that several existing techniques in experimental design deal with maximizing objectives such as minimum eigenvalue; however, existing work assumes submodularity or matroid constraints (Allen-Zhu et al., 2021), which the average minimum eigenvalue (normalized with size of the set) in our setting does not satisfy as discussed in Appendix A.1.6.
Proof Outline: We first show in Lemma 4 (using concentration guarantees) that the following two objective values are close namely (A) value of the maximized concave objective with distribution and parameter in Equation 6 (B) objective value of the set (eq. 2) obtained via randomized rounding procedure from at (line 19 in BSLB). Note that the value of the maximized concave objective in Equation 6 with parameter is larger than the value with parameter provided . Therefore, we show our approximation guarantees with respect to objective in Equation 6 with parameter - which in turn also translates into guarantees for the optimal parameter even though is unknown (since ). Finally, given that the concave objective with parameter in Equation 6 is a relaxation of the discrete objective in Equation 2, the objective value of the former is going to be larger than the objective value of the latter. Combining all these key ingredients, we have proved our theorem statement.
2.4 Corralling when optimal sparsity level is not known
Note that for any unknown parameter vector , we can fix the sparsity level and therefore the corresponding tail magnitude - subsequently, we can obtain the guarantees of Theorem 2 by setting the exploration period optimally for the fixed . However, if is set too low, then will be too high, and therefore, the second term in regret (Equation 5) dominates. On the other hand, if is set too high, then is low but the first term in the regret bound dominates. There is a trade-off, and therefore, there is an optimal choice of sparsity and, tail magnitude . Therefore, we propose a meta-algorithm C-BSLB that exploits coralling (Agarwal et al., 2017) multiple versions of the BSLB algorithm 1 with different values of used to set the exploration period - the meta-algorithm gradually learns to choose the best base algorithm. However, naively applying CORRAL with all distinct base algorithms leads to a linear dependence on dimension in the regret making it vacuous. Therefore we carefully choose base algorithms for search within CORRAL with corresponding sparsity parameters set on exponentially spaced points - such a restriction ensures that the overhead in regret is minimal (logarithmic dependence on dimension ). However, we still prove our regret guarantee with respect to the base algorithm with optimal sparsity - although it is not guaranteed that the optimal base algorithm will be in the set of carefully chosen base algorithms provided as input to the meta-algorithm.
Theorem 4.
Consider the -dimensional sparse linear bandits framework with blocking constraint as described in Theorem 2. Let the C-BSLB algorithm (Algorithm 3 in Appendix A.1.8) run with an appropriate learning rate on multiple versions of BSLB, using distinct sparsity parameter taking values in the set . Let the optimal sparsity parameter in Theorem 2 that achieves minimum regret be , and let be the corresponding regret. Then the meta-algorithm C-BSLB achieves the following regret guarantee,
(7) |
Note that the first term in Equation 7 and the multiplicative factor of corresponds to the additional cost in combining the input base algorithms by Algorithm 3. We stress that the dependence on dimension from the additional cost is only logarithmic.
Proof Outline: We use Theorem 5 of (Agarwal et al., 2017) and our regret bound from Theorem 2 to obtain Theorem 4. The key novelty in our proof is to establish the following - when searching with the small curated set of base algorithms in CORRAL, we do not suffer a significant loss in the regret even if the base algorithm with the optimal sparsity parameter does not lie in the curated set. The crux of our proof lies in a covering argument. By using a recursive telescoping argument, we can bound the regret incurred between any base algorithm not used for the search while corraling versus the base algorithm with the closest sparsity parameter used in the search.
3 Experiments
Below, we demonstrate our methods for annotation in a label-scarce setting for image classification on the PASCAL VOC 2012 dataset. Additional experimental results on SST-2 (text dataset) can be found in Appendix A.2.3. Finally, experiments on the MovieLens, Netflix, and GoodBooks datasets in the context of personalized recommendation with few labeled data using our theoretical framework are in Appendix A.2.1. Finally, we provide detailed simulations in Appendix A.2.4.
We consider the setting where we have a total of unlabelled samples (with ) and only datapoints can be annotated (sequentially). For each unlabeled data point sent for annotation to the expert(s), we receive the ground truth label and the difficulty score corresponding to the difficulty in annotating the data point. We showcase the effectiveness of BSLB (Alg. 1) in our experimental set-up with real-world datasets. Given a model to be trained on a downstream task, to benchmark BSLB, we consider the following set of baselines (to compare against) to choose subset of datapoints for annotation and subsequent training of the aforementioned model :
-
1.
Random: Subset of unlabeled datapoints chosen uniformly at random
-
2.
All: All the samples in training data (except the validation fold)
-
3.
AnchorAL (Lesci & Vlachos, 2024): an anchoring based active learning baseline ( samples).
-
4.
SEALS (Coleman et al., 2022): a KNN based sub-sampling active learning baseline ( samples).
(thresholds on difficulty score to determine easy/hard samples) and (exploration rounds) are relevant hyper-parameters specified for the corresponding experiments 222We consider the AL setup initialized with samples and samples queried in a batch.. We benchmark learning performance on 2 datasets: a) hard samples (samples with difficulty ) (hard-valid) b) easy samples (samples with difficulty ratings ) (easy-valid).
AnchorAL and SEALS are state-of-the-art active learning (AL) algorithms. In general, for a label-scarce complex task, AL might not be immediately applicable (see cold-start problem in Li et al. (2024)) - especially for datapoints close to the decision boundary with noisy/less informative confidence intervals. This is because AL requires a reliably trained model on an even smaller subset of labeled datapoints - however, on datapoints far from the decision boundary (easy datapoints), noisy confidence signals are still useful. As we show in our experiment, this intuition holds, and the AL models, along with the random baseline, perform well on the easy-valid dataset. It is worth noting that complex (hard) datapoints often tend to be the main challenge in industrial applications. This is because it is easy to improve performance on easy data (cheaper to obtain) by simply increasing samples during training, but hard datapoints are difficult to generalize on (Pukowski & Lu, 2024).
Image Classification on PASCAL VOC 2012: Our main result is for the image classification task on the public image dataset, PASCAL VOC 2012 (Everingham et al., 2015). The dataset has unique images and comprises segmentations for objects. In addition to the image dataset, we use difficulty scores of annotations from (Ionescu et al., 2016) - the authors have provided the visual search difficulty by measuring the time taken to annotate in a controlled environment. The annotation task here was to identify if an image contains a particular object, e.g. “Does this image contain a car”. The authors derive a difficulty score between and by normalizing the time to annotate.
In our experiment, the goal is to train a learning model for image classification - is a support vector machine (SVM) head attached to a frozen pre-trained vision transformer (ViT) model pre-trained on ImageNet-21k dataset (Wu et al., 2020). We present results on the classification task - given an input image, predict if the image has an object or not. We consider different objects, namely chair, car, bottle, and (bottle or chair). The last object is an OR conjunction of two labels. We consider the thresholds as and since the distribution of the difficulty scores in the dataset is heavy-tailed as shown in Figure 4(b). The (image, question) tuple with difficulty scores in the range are highly noisy and therefore have been excluded. Table 2 contains the hyperparameters , used and the number of samples in the all dataset for the different object classification tasks, along with the size of the validation datasets hard-valid and easy-valid and aggregaged accuracies. Table 3 contains results on the effect of varying .
|
|
AnchorAL | SEALS | Random | All | Our (BSLB) | ||||
---|---|---|---|---|---|---|---|---|---|---|
easy-valid | chair | 94.0 1.67 | 90.6 1.8 | 96.4 1.0 | 96.0 1.1 | 94.6 1.6 | ||||
car | 94.5 1.6 | 94.7 4.0 | 97.72.1 | 98.7 0.1 | 96.5 1.8 | |||||
bottle | 93.02.5 | 92.82.3 | 96.81.1 | 96.81.1 | 94.82.0 | |||||
bottle or chair | 91.51.1 | 92.31.1 | 94.80.97 | 94.62.1 | 91.72.2 | |||||
hard-valid | chair | 69.3 3.1 | 69.6 6.1 | 66.0 3.8 | 71.3 3.2 | 73.3 3.3 | ||||
car | 70.3 4.0 | 70.0 5.7 | 60.0 5.4 | 65.4 4.0 | 74.0 3.4 | |||||
bottle | 63.12.9 | 63.43.4 | 59.74.4 | 64.81.9 | 66.82.6 | |||||
bottle or chair | 67.13.5 | 66.31.4 | 68.04.0 | 72.32.0 | 73.01.7 |
We present our results in Table 1 averaged over validation folds. For this classification task, our method (BSLB) efficiently selects datapoints (to be annotated) compared to baselines with an equal number of samples. Regarding the quality of the final trained model , the learning performance of BSLB on easy-valid is within of that obtained by the baseline random. However, there is an improvement of on the hard validation data hard-valid. When compared to the active learning baselines (AnchorAL and SEALS), BSLB performs better by on easy-valid and by on hard-valid. Finally, when compared to trained on all datapoints (all baseline), which has to more samples, our method does better ( to ) on the hard-valid and does decently on easy-valid ( difference). These results validate our theory - in particular, we find that performance on easy-valid improves if the model is trained on more samples (randomly chosen to improve coverage). However, improving the performance on hard-valid dataset is the main challenge where our simple approach BSLB with theoretical guarantees does reasonably well.
4 Conclusion and Future Work
This work formulates the problem of adaptive annotation with expert-provided difficulty feedback as regret minimization in blocked sparse linear bandits. The goal is to annotate difficult samples to improve generalization on hard datapoints, assuming difficulty ratings are linked to sample features by an unknown parameter. We consider the practical consideration when each sample can be annotated once and only a few rounds of annotations are available. We propose an explore-then-commit algorithm BSLB, a two-shot algorithm for unlabelled data-point selection (for annotation). We show theoretical results establishing the sub-linear regret of the proposed BSLB algorithm - we also ensure that our guarantees work with minimum assumptions on datapoints and are robust to sparsity. Finally, we present a meta-algorithm C-BSLB that corrals BSLB algorithms with different input parameters - the result is a bandit algorithm that suffers the same regret orders as BSLB without the knowledge of the optimal sparsity hyper-parameter. Numerical studies on real-life image and text datasets show the efficacy of our methods in the label-scarce regime. We demonstrate experiments in the appendix for recommendation and simulation setup using our theoretical framework and BSLB.
Future Work: Theoretically speaking, lower bounds on regret guarantees in the sparse linear bandits framework with blocking constraint is an open problem. Algorithmically, it will also be interesting future work to obtain an algorithm that achieves in our bandit setting. Our experimental results are done on classifying state-of-the-art embeddings; however, more extensive experimentation can be considered in future work, especially for creating datasets used for generative tasks.
References
- Agarwal et al. (2017) Alekh Agarwal, Haipeng Luo, Behnam Neyshabur, and Robert E Schapire. Corralling a band of bandit algorithms. In Conference on Learning Theory, pp. 12–38. PMLR, 2017.
- Albalak et al. (2024) Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, et al. A survey on data selection for language models. arXiv preprint arXiv:2402.16827, 2024.
- Allen-Zhu et al. (2021) Zeyuan Allen-Zhu, Yuanzhi Li, Aarti Singh, and Yining Wang. Near-optimal discrete optimization for experimental design: a regret minimization approach. Math. Program., 186(1-2):439–478, March 2021.
- Bengio et al. (2009) Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41–48, 2009.
- Bickel et al. (2009) Peter J. Bickel, Ya’acov Ritov, and Alexandre B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37(4):1705 – 1732, 2009. doi: 10.1214/08-AOS620. URL https://doi.org/10.1214/08-AOS620.
- Boche et al. (2015) Holger Boche, Robert Calderbank, Gitta Kutyniok, and Jan Vybíral. A survey of compressed sensing. In Compressed Sensing and its Applications: MATHEON Workshop 2013, pp. 1–39. Springer, 2015.
- Bresler et al. (2014) Guy Bresler, George H Chen, and Devavrat Shah. A latent source model for online collaborative filtering. Advances in neural information processing systems, 27, 2014.
- Brown et al. (2024) Adam Brown, Aditi Laddha, and Mohit Singh. Maximizing the minimum eigenvalue in constant dimension, 2024. URL https://arxiv.org/abs/2401.14317.
- Chen et al. (2001) Scott Shaobing Chen, David L Donoho, and Michael A Saunders. Atomic decomposition by basis pursuit. SIAM review, 43(1):129–159, 2001.
- Coleman et al. (2022) Cody Coleman, Edward Chou, Julian Katz-Samuels, Sean Culatana, Peter Bailis, Alexander C. Berg, Robert Nowak, Roshan Sumbaly, Matei Zaharia, and I. Zeki Yalniz. Similarity search for efficient active learning and search of rare concepts. Proceedings of the AAAI Conference on Artificial Intelligence, 36(6):6402–6410, Jun. 2022. doi: 10.1609/aaai.v36i6.20591. URL https://ojs.aaai.org/index.php/AAAI/article/view/20591.
- Ethayarajh et al. (2022) Kawin Ethayarajh, Yejin Choi, and Swabha Swayamdipta. Understanding dataset difficulty with -usable information. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), PMLR, volume 162 of Proceedings of Machine Learning Research, pp. 5988–6008, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/ethayarajh22a.html.
- Everingham et al. (2015) M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, January 2015.
- Goldberg et al. (2001) Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Eigentaste: A constant time collaborative filtering algorithm. Inf. Retr. Boston., 4(2):133–151, 2001.
- Guo et al. (2022) Chengcheng Guo, Bo Zhao, and Yanbing Bai. Deepcore: A comprehensive library for coreset selection in deep learning. In International Conference on Database and Expert Systems Applications, pp. 181–195. Springer, 2022.
- Hao et al. (2020) Botao Hao, Tor Lattimore, and Mengdi Wang. High-dimensional sparse linear bandits. Advances in Neural Information Processing Systems, 33:10753–10763, 2020.
- Harper & Konstan (2015) F. Maxwell Harper and Joseph A. Konstan. The movielens datasets: History and context. ACM Trans. Interact. Intell. Syst., 5(4), dec 2015. ISSN 2160-6455. doi: 10.1145/2827872. URL https://doi.org/10.1145/2827872.
- Hedderich et al. (2021) Michael A. Hedderich, Lukas Lange, Heike Adel, Jannik Strötgen, and Dietrich Klakow. A survey on recent approaches for natural language processing in low-resource scenarios. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2545–2568, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.201. URL https://aclanthology.org/2021.naacl-main.201.
- Ionescu et al. (2016) Radu Tudor Ionescu, Bogdan Alexe, Marius Leordeanu, Marius Popescu, Dim Papadopoulos, and Vittorio Ferrari. How hard can it be? Estimating the difficulty of visual search in an image. In Proceedings of CVPR, pp. 2157–2166, June 2016.
- Jang et al. (2022) Kyoungseok Jang, Chicheng Zhang, and Kwang-Sung Jun. Popart: Efficient sparse regression and experimental design for optimal sparse linear bandits. Advances in Neural Information Processing Systems, 35:2102–2114, 2022.
- Lesci & Vlachos (2024) Pietro Lesci and Andreas Vlachos. AnchorAL: Computationally efficient active learning for large and imbalanced datasets. In Kevin Duh, Helena Gomez, and Steven Bethard (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 8445–8464, Mexico City, Mexico, June 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.467. URL https://aclanthology.org/2024.naacl-long.467.
- Li et al. (2024) Dongyuan Li, Zhen Wang, Yankai Chen, Renhe Jiang, Weiping Ding, and Manabu Okumura. A survey on deep active learning: Recent advances and new frontiers. IEEE Transactions on Neural Networks and Learning Systems, 2024.
- Li et al. (2023) Ming Li, Yong Zhang, Zhitao Li, Jiuhai Chen, Lichang Chen, Ning Cheng, Jianzong Wang, Tianyi Zhou, and Jing Xiao. From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning. arXiv preprint arXiv:2308.12032, 2023.
- Maharana et al. (2023) Adyasha Maharana, Prateek Yadav, and Mohit Bansal. D2 pruning: Message passing for balancing diversity and difficulty in data pruning. arXiv preprint arXiv:2310.07931, 2023.
- Mayo et al. (2023) David Mayo, Jesse Cummings, Xinyu Lin, Dan Gutfreund, Boris Katz, and Andrei Barbu. How hard are computer vision datasets? calibrating dataset difficulty to viewing time. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 11008–11036. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/2432e9646556f3a98dc78c1f4a10481b-Paper-Datasets_and_Benchmarks.pdf.
- Mindermann et al. (2022) Sören Mindermann, Jan M Brauner, Muhammed T Razzak, Mrinank Sharma, Andreas Kirsch, Winnie Xu, Benedikt Höltgen, Aidan N Gomez, Adrien Morisot, Sebastian Farquhar, et al. Prioritized training on points that are learnable, worth learning, and not yet learnt. In International Conference on Machine Learning, pp. 15630–15649. PMLR, 2022.
- Nguyen et al. (2023) Ngoc Dang Nguyen, Wei Tan, Lan Du, Wray Buntine, Richard Beare, and Changyou Chen. Auc maximization for low-resource named entity recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 13389–13399, 2023.
- Pal et al. (2024) Soumyabrata Pal, Arun Suggala, Karthikeyan Shanmugam, and Prateek Jain. Blocked collaborative bandits: online collaborative filtering with per-item budget constraints. Advances in Neural Information Processing Systems, 36, 2024.
- Pukowski & Lu (2024) Pawel Pukowski and Haiping Lu. Investigating the impact of hard samples on accuracy reveals in-class data imbalance. arXiv preprint arXiv:2409.14401, 2024.
- Rudelson & Zhou (2013) Mark Rudelson and Shuheng Zhou. Reconstruction from anisotropic random measurements. IEEE Transactions on Information Theory, 59(6):3434–3447, 2013. doi: 10.1109/TIT.2013.2243201.
- Sener & Savarese (2018) Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=H1aIuk-RW.
- Settles (2009) Burr Settles. Active learning literature survey. Computer Sciences Technical Report 1648, 2009.
- Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/D13-1170.
- Sorscher et al. (2022) Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari Morcos. Beyond neural scaling laws: beating power law scaling via data pruning. Advances in Neural Information Processing Systems, 35:19523–19536, 2022.
- Soviany et al. (2022) Petru Soviany, Radu Tudor Ionescu, Paolo Rota, and Nicu Sebe. Curriculum Learning: A Survey. International Journal of Computer Vision, 130(6):1526–1565, June 2022. ISSN 1573-1405. doi: 10.1007/s11263-022-01611-x. URL https://doi.org/10.1007/s11263-022-01611-x.
- Tolstikhin (2017) I. O. Tolstikhin. Concentration inequalities for samples without replacement. Theory of Probability & Its Applications, 61(3):462–481, 2017. doi: 10.1137/S0040585X97T988277. URL https://doi.org/10.1137/S0040585X97T988277.
- Verroios & Garcia-Molina (2015) Vasilis Verroios and Hector Garcia-Molina. Entity resolution with crowd errors. In 2015 IEEE 31st international conference on data engineering, pp. 219–230. IEEE, 2015.
- Vershynin (2018) Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018.
- Wainwright (2019) Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019.
- Wu et al. (2020) Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Zhicheng Yan, Masayoshi Tomizuka, Joseph Gonzalez, Kurt Keutzer, and Peter Vajda. Visual transformers: Token-based image representation and processing for computer vision, 2020.
- Zajac (2017) Zygmunt Zajac. Goodbooks-10k: a new dataset for book recommendations. FastML, 2017.
- Zhang et al. (2018) Xuan Zhang, Gaurav Kumar, Huda Khayrallah, Kenton Murray, Jeremy Gwinnup, Marianna J. Martindale, Paul McNamee, Kevin Duh, and Marine Carpuat. An empirical exploration of curriculum learning for neural machine translation. CoRR, abs/1811.00739, 2018. URL http://arxiv.org/abs/1811.00739.
Appendix A Appendix
The Appendix comprises two sections: Section A.1 details the detailed proofs of the technical results and Section A.2 presents additional numerical results.
A.1 Brief overview and Technical Proofs
A.1.1 Proof of Theorem 1
We defined the restricted eigenvalue of a matrix as,
Definition 2.
Restricted Eigenvalue: If satisfies Restricted Eigenvalue property , then there exists a constant such that for all and ,
This definition implies,
We prove the theorem by using the Basis Pursuit Program which is one of three equivalent formulation for LASSO Chen et al. (2001); Wainwright (2019),
(8) |
Proof.
For the purpose of the proof denote the true parameter by . The following inequality holds for a general basis pursuit program (w/o any assumptions on sparsity) Wainwright (2019),
(9) |
where , , is the tolerance and is the number of samples . For the proof, the tolerance is equal to the regularization parameter .
Let and , where denotes the top- coordinates, the next top coordinates and so on, for this proof.
First note that,
And from the decomposition available in Theorem 1.6 in Boche et al. (2015) we know that
We also the following inequality,
The last inequality holds since is a solution of equation 8 and otherwise would be the solution instead.
Therefore,
Now by the RE condition,
Putting this into equation 9,
We now bound
And the bound on is,
Combining both bounds we find the bound on ,
where we make the following substitution ()
Let ,
This is true when,
Now applying concentration using the properties of a bounded zero-mean noise, w.h.p. and w.h.p. (Denote the union bound on the probability of both these events by for some constant dependent on and other constants),
with a high probability .
We simplify the above expression in terms of the number of samples , sparsity parameter , the tail parameter and the RE constant.
∎
A.1.2 Proof of Corollary 1
The only additional ingredient needed to prove the corollary is the following concentration inequality, which shows that if the RE of the covariance matrix of the set which is used for sampling is bounded from below, then so is the RE of the sampled covariance matrix with high probability. Since in the assumption of the corollary, the minimum eigenvalue of the covariance matrix of the set used for sampling is , and the RE is bounded from below by the minimum eigenvalue, the corollary follows.
However although the below theorem is adapted directly from Rudelson & Zhou (2013); to prove it we need to make changes accordingly for the sampling without replacement case. This introduces an additional multiplicative term in the exponential in the probability but does not change the order with respect to any other variable. We now
Concentration Inequality for RE condition to hold on sample covariance matrix We use the following Theorem from Rudelson & Zhou (2013).
Theorem 5.
Let and . Let be a random vector such that a.s and denote . Let be an matrix, whose rows are sampled without replacement from a set . Let satisfy the condition as in Definition 1. Set . Assume that and . Suppose the sample size satisfies for some absolute constant
Then, with probability at least , condition holds for matrix with
(The inequality is reverse because our definition of Restricted Eigenvalue has the 1/ compared to the definition of Rudelson & Zhou (2013) )
The proof of Theorem 5 is dependent on Theorem 23, which is reproduced below, and Theorem 10 (Reduction Principle), which is in the paper. Out of these two, only Theorem 23 has elements related to the randomness of the design. In the proof of the above theorem, authors use concentration inequality to extend a RIP-like condition on a general cone (rather than sparse vectors). This concentration inequality results from the following theorem: an augmented version of Theorem 22 of Rudelson & Zhou (2013) to the sampling without replacement case. The original proof of Theorem 22 is extremely involved (and mathematically rich). Reproducing the entire proof would have surmounted to reproducing the entire paper. We only highlight the key difference, it is recommended that the reader goes through the proof beforehand/side-by-side.
Theorem 6.
Set , and . Let be a subset of vectors such that , with . satisfies . Rows of are drawn uniformly without replacement from . Set . Assume and . If for some absolute constant ,
then with probability , for all
Proof.
In the proof of Theorem 22 of Rudelson & Zhou (2013), two arguments require the sampling with replacement (i.i.d. samples), namely symmetrization and Talagrands concentration inequality. We use the sampling without a replacement version of McDiarmid’s concentration inequality to obtain comparable bounds. Therefore, to prove this argument, the following two lemmas.
Lemma 1.
(Symmetrization without Replacement)
where are i.i.d. Rademacher random variables and are random variables sampled uniformly without replacement from some set.
Proof.
Let be the random variables sampled uniformly without replacement from set . Consider be an independent sequence of random variables sampled uniformly without replacement from set . Then and are zero mean random variable. Then,
(Since and are independent)
(Symmetric random variables (Vershynin, 2018) since and have same distribution; followed by Triangular Inequality). ∎
Lemma 2.
(Concentration using McDiarmid’s inequality) If a.s.. And suppose , where are sampled uniformly without replacement from some set. If .
Proof.
We prove the result by using McDiarmid’s inequality. First we bound the quantity,
We use the loose version of McDiarmid’s concentration inequality for random variable without replacement with to obtain the result. The condition that needs to be verified is that is symmetric under permutations of the individual . This is obviously true. We next state McDiarmid’s concentration inequality without replacement from Tolstikhin (2017),
Lemma 3.
Suppose , where are sampled uniformly without replacement from some set. Then,
The probability is , which is same as that in the original theorem except an additional which is reflected in the Theorem statement. ∎
Comment on Dudley’s inequality: Theorem 23 also uses Dudley’s inequality, but there the are treated as deterministic and so the proof goes through in our sampled without replacement case as well. ∎
We can now complete the proof by computing the probability of the following event,
has the probability, which completes the proof.
A.1.3 Proof of Theorem 2
Proof.
The regret bound can be proved in 3 steps. First, we decompose the regret, apply Corollary 1, and then optimize the exploration period.
Step 1. Regret Decomposition: Define the maximum reward as, and as the corresponding arm. As shown in the Appendix, the regret can be decomposed as, this step requires care since the regret is with respect to the top- arms. In the exploitation stage, the arms are selected such that the top arms are played according to and are indexed by the permutation . We next bound the regret for the selected arm.
-
1.
If , then the regret for the selected arm is negative.
-
2.
If not i.e., , then there exists an arm index, in the permutation such that is shifted to the left in . This implies that . We decompose the regret for this case with respect to this index and bound the error:
Here is the error of estimation after exploration. We therefore obtain the following,
Using (the number of exploration rounds is sublinear in ) we obtain,
(10) |
Step 2. Fast Sparse Learning: We use Theorem 1, which is proved in the appendix, to obtain an estimation guarantee in terms of the number of exploration rounds. And we now apply the bound from Corollary 1 and obtain the following (Making an assumption similar to Hao et al. (2020) on the exploration rounds ) 333Another way to analyze this would be to assume which would have similar regret guarantees with a slightly better dependence on ( term becomes ) but the results still holds. .
Step 2. Exploration Period Optimization: ( The probability of error terms () are left out in the expression.) equation 5 can then be bounded as,
Setting we obtain the desired result.
∎
We can also obtain a regret bound for the case of hard sparsity which is of the same order as Hao et al. (2020).
A.1.4 Proof of Theorem 3
For vectors and , define .
Then the minimum eigenvalue for (covariance matrix of) a set of vectors is given by, .
Let randomized rounding be run with , and be the sampled set of arms. Of-course need to be equal to . However, we first assume that the denominator of is equal to . We later show that this assumption worsens the approximation guarantees by with high probability. Under this assumption by construction of the randomized rounding procedure, the expected minimum eigenvalue of the sampled set is equal to the minimum eigenvalue corresponding to the solution of the convex optimization problem, since, .
We therefore prove the following result to bound the approximation error between the obtained from the randomized rounding solution and the optimal solution of the convex relaxation from equation 6.
Lemma 4.
Proof.
Using from the preceding paragraph.
By symmetrization,
where are i.i.d. Rademacher random variables (we overload as the index). Now using Dudley’s integral inequality,
where is the constant which satisfies,
Now w.l.o.g.,
Now from the definition of subgaussian norm,
Now,
Now by Chernoff’s bound,
To find
Upper bounding , we obtain the ,
with , ,
Therefore combining,
Using Markov’s inequality and the above lemma,
If,
is true with probability . Then ,
with probability . Similar to the other direction, we obtain the desired result. ∎
Then, we bound the size of the actual sampled set with respect to , at which the convex relaxation is computed. We derive the following result which gives a probability bound on the number of arms sampled.
Lemma 5.
Let be the subset size that the randomized rounding is run with (Line 20 in Alg. 1) and let be the true number of sampled arms. Then the following probability holds,
(12) |
Proof.
We prove the following two tail bounds and then take the union bound over them both,
First the size of the sampled subset is the sum of independent Bernoulli random variables, where each . Using tail bound from Chernoff bound,
Achieves infinum for ,
Using a similar left tail bound,
Achieves infinum for ,
Now for , , and therefore applying the union bound we obtain the required result.
∎
Therefore the above lemma helps us prove the following statement,
The above two lemmas help us prove the approximation error of the randomized rounding with respect to a fixed parameter . However, the approximation errors need to be with respect to the optimal choice of , , which is the size of the optimal subset from equation 2.
We claim the following about the solution of the convex relaxation at and ,
(13) |
The last inequality follows from the fact that lies is in the feasibility set.
Also since the convex relaxation at is more constrained than the one at ,
We can now combine the two lemmas and the equation above to say that for an error and the following holds with high probability,
where is a constant.
Further bounding the lower bound,
which has a positive determinant and can only satisfied when . Since , therefore for this to be true,
If , then we can assume the stronger condition (note that constant from Lemma 4 is ),
The search bound, therefore becomes, , .
This completes the proof of the theorem.
However if ,
Then
and the search bound becomes,
Therefore for , .
Another way to derive a tighter bound on is,
We consider the stronger condition,
Plugging this into ,
From ,
Under a stronger condition,
A.1.5 Brute Force Algorithm for Searching the Optimal Subset
From the previous previous proof we can set , and then search for subsets in the range to obtain a minimum eigenvalue . We obtain the approximation guarantee, w.h.p., since we are only using the approximation guarantee from Lemma 4 and Lemma 5, and not from equation 13 because we are already searching the space . Since the search space is dependent on , the time complexity of the brute force algorithm, Algorithm 2 follows. This time complexity is substanially smaller than the complexity over the search over all arms which is of the order
What if the maximum minimum eigenvalue is not known ? We can use a lower bound on the . This is easy to obtain: Randomly sample subsets of the arm set and compute the objective value in Equation 2 for each subset - the lower bound can be the maximum objective value across the sampled subsets.
What can the practitioner do to select a good subset size empirically? Additionally, if a practitioner wants to test out a particular choice of , the worst-case error can be empirically calculated (the difference between the convex relaxation at and averaged across multiple randomized rounding runs for different values of ). This is possible because GetGoodSubset() can be run offline.
A.1.6 Why Does the Average Minimum Eigenvalue Constraint Not Satisfy Matroid Constraints
Existing work in experiment design work with objective functions which often satisfy the matroid, submodularity or cardinality constraints to perform experiment design (Allen-Zhu et al., 2021; Brown et al., 2024). However we need to optimize the minimum eigenvalue averaged across the subset (because we want to avoid dependence of in the regret term and also use the RE condition). Our objective clearly does not satisfy the cardinality constraint, and the feasible sets don’t satisfy a matroid constraint (since removing a vector might improve the minimum eigenvalue averaged across the set, so there is no clear partitioning/structure to the feasible set). Finally, we tried to prove submodularity but were unable to do so for our objective function especially because the denominator is dependent on the subset size.
A.1.7 Details on corralling
Brief Overview on Corralling: The algorithm CORRAL (Agarwal et al., 2017) is a meta-bandit algorithm that uses online mirror descent and importance sampling to sample different bandit algorithms that receive the reward. Using the rewards updates the probabilities used for sampling. The main objective is to achieve a regret which is as good as if the best base algorithm was run on its own.
To setup the context, we exactly reproduce the following excerpt, definition and theorems have been taken from Agarwal et al. (2017):
For an environment , we define the environment induced by importance weighting, which is the environment that results when importance weighting is applied to the losses provided by environment . More precisely, is defined as follows. On each round ,
1. picks an arbitrary sampling probability and obtains .
2. reveals to the learner and the learner makes a decision .
3. With probability , define and ; with probability , define and to be arbitrary.
4. reveals the loss to the learner, and passes to .
Definition 3.
(Agarwal et al., 2017) For some and non-decreasing function , an algorithm with decision space is called -stable with respect to an environment if its regret under is , and its regret under any environment induced by importance weighting is
where (with as in the definition of above), and all expectations are taken over the randomness of both and the algorithm.
Similar too most reasonable Base Algorithms it can be seen that the BSLB algorithm satisfies is -stable by rescaling the losses.
We
Theorem 7.
(Theorem 4 in (Agarwal et al., 2017)) For any , if base algorithm (with decision space ) is -stable (recall Defn. 3) with respect to an environment , then under the same environment CORRAL satisfies
where all expectations are taken over the randomness of CORRAL Algorithm, the base algorithms, and the environment.
Theorem 8.
(Theorem 5 in (Agarwal et al., 2017)) Under the conditions of Theorem 7, if , then with CORRAL satisfies: .
A.1.8 Proof of Theorem 4
Before presenting the proof, we clarify what we mean by the exponential scale with an example. For dimension , the exponential scale will be , and we initialize a base algorithm each with the exploration period set according to the .
Remark: Also Step 5 and Step 6 in Algorithm 3 needs explanation. Note that the CORRAL algorithm as a whole has to respect the blocking constraint. Even though CORRAL does not require the bandit algorithms to be run independently, we want to avoid changing CORRAL or the base algorithm. Instead we just change the arms that are available to sample by exploiting the fact that our base algorithm is a two step algorithm and each of the steps can be performed offline. For each base algorithm:
-
1.
For the explore phase of the base algorithm we take arms from the intersection of the subset sampled in step 5 and the subset of arms which have not been sampled.
-
2.
For the exploit phase if the chosen base algorithm provides an arm which has already been sampled by CORRAL. Then we provide the feedback corresponding to that arm. And then we pull an arm from the remaining set of arms without replacement.
Note that with this modification, the exploration phase of each of the algorithm runs as if the algorithm was being run independently. Hence the regret bounds for each individual base algorithm still holds. We can now prove Theorem 8.
We now use Theorem 8 with algorithms. However if we simply apply the theorem we can bound with respect to the sparsity parameter which lies on the grid, ,
But the optimal sparsity parameter may not lie on the grid and we need to bound in terms of . To that end we prove the following lemma,
Lemma 6.
Let be the sparsity parameter at which the regret bound of 5 is minimized. And let be the parameter on grid which is closest to in absolute distance. Then the following holds (where is the probability of exploration round succeeding at sparsity level ),
Proof.
Let the bound on the expected regret for sparsity level be given by .
From the statement of the theorem, we assume that for the optimal sparsity parameter , the nearest parameter (on the exponential scale) is . lies in the interval (otherwise would not be the closest parameter on the exponential scale.). Therefore if we were perform a binary search for , we would need at most queries to search for . Let be the mid-points of these queries, where . Now each of them is such that , where .
First consider the case when , then by substituting , in the regret bound of Theorem 2, the following inequalities can be obtained ,
Now for the case, , we substitute for
The probability of success of each of them is and times the probability of error is still .
Now we can take a cascade of products by decomposing using the above inequality in the direction of . (i.e. we can decompose ,
∎
A.2 Additional Numerical Experiments
A.2.1 Personalized recommendation with Single Rating per Item
Next we demonstrate our bandit algorithm on real-world recommendation datasets. However, we construct the recommendation task such that a) each item receives only a single rating from a user and b) we can only use previous recommendations of a user for recommending content. This is in contrast to the standard collaborative filtering setting where an item can where the ratings of the other users is used to recommend content to you. Our setting makes this possible by exploiting the additional information from the embeddings obtained from a pre-trained network for the text (or image) features of the different items. We argue that our setting is be more relevant in recommendation scenarios where privacy is a concern.
We run the corralling algorithm using copies of Algorithm 1 each with different exploration periods, . Each of the instance, we first give random recommendations by sampling uniformly without replacement from a suitably constructed subset to each user. Given the ratings obtained, we estimate the parameter specific to the user using only their recommendations. For the remaining rounds, we give the top recommendations based on the estimated parameter. . To benchmark we run the algorithms independently and also against a random policy which randomly recommends. We next describe the three tasks that we report our results on,
Goodbooks-10k (Book Reviews): We use the Goodbooks-10k for a personalized book recommendation tasks (Zajac, 2017). For each book we use the title and the author to obtain embeddings using the MiniLM-L6-v2 sentence transformer which we use as the feature vectors for the arms. There are books and we consider users which have more than ratings. The ratings are between to . We consider the exploration periods as .
MovieLens (Movie Ratings): MovieLens 100K dataset has ratings (-star) from users on movies (Harper & Konstan, 2015). We obtain the embeddings of the TMDB descriptions of the movies using the same sentence transformer model. For experimentation, we consider the users which have more than ratings and average the results across the users, there are such users and we run the experiment with different seeds for each user. We consider for the different instances in the range .
Jester (Joke Ratings): We use the Jester joke dataset 1 which has ratings on jokes by users (Goldberg et al., 2001). We obtain embedding for the jokes using the same transformer as above. For experimental purposes we filter out users which do not have ratings on all the jokes and are left with users. We run our algorithm with different random seeds for each of the users and report the results averaged across all users. The joke ratings range from to . For different algorithm instances is taken to be



Results: We summarize the cummulative regret from Eq. 1 of the algorithms in Figure 1. We add the random policy as a reference. We see that for the different dataset our the algorithm achieves a sub-linear regret. The reason the cumulative regret is not monotonic is due to the fact that the regret is with respect to the top- arms. It can be seen that our algorithm with Corral achieves a performance close to the performance of the algorithm with the exploration period, out of the .
A.2.2 Hyperparameters for Experiment of Section 3
The hyperparameters are presented in Table 2. Note that the reason for selecting different across different objects was because the validation datasets had to be big enough (so that the variance of accuracy is informative) and different objects had different total number of samples. The number of exploration rounds are set with respect to so that the approximation error after exploration is small enough. The active learning methods are also run with random initialization of explorations and one round of querying with queries. We present the results of a study where we vary only the for the same value of in Table 3 and observe that with decreasing the estimation of difficulty scores deteriorates, and the performance on hard-valid deteriorates. The performance on easy-valid improves since samples are randomly chosen if the estimation is unsuccessful. Note that our method still performs better than AL baselines and random sampling.
|
(hard) | (easy) | All | BSLB | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Num Samples | Averaged Accuracies | Averaged Accuracies | ||||||||
chair | 60 | 80 | 100 | 80 | 960 (10x) | 83.65 | 83.95 | |||
car | 70 | 100 | 90 | 60 | 1227 (13x) | 82.05 | 85.25 | |||
bottle | 70 | 100 | 120 | 60 | 822 (6x) | 80.8 | 80.8 | |||
bottle or chair | 120 | 120 | 140 | 100 | 1807 (13x) | 83.45 | 82.35 |
|
|
AnchorAL | SEALS | Random | All | Our (BSLB) | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
easy-valid | chair | 94.51.0 | 93.21.5 | 95.71.0 | 96.01.4 | 92.31.4 | |||||
car | 95.82.5 | 95.32.0 | 97.31.2 | 98.21.1 | 95.71.7 | ||||||
bottle | 95.00.9 | 95.01.1 | 96.21.9 | 96.71.8 | 96.81.2 | ||||||
hard-valid | chair | 72.03.0 | 71.01.4 | 67.05.0 | 69.44.1 | 73.42.4 | |||||
car | 66.47.9 | 69.25.6 | 52.68.2 | 58.86.8 | 74.02.7 | ||||||
bottle | 61.21.5 | 61.82.0 | 51.22.9 | 52.63.1 | 63.22.9 | ||||||
easy-valid | chair | 94.82.1 | 93.71.7 | 95.52.0 | 95.51.4 | 94.81.8 | |||||
car | 96.21.9 | 95.51.9 | 97.51.2 | 98.31.1 | 97.02.3 | ||||||
bottle | 95.01.2 | 94.81.0 | 97.21.7 | 97.51.7 | 96.01.9 | ||||||
hard-valid | chair | 70.05.5 | 70.03.8 | 68.04.2 | 69.84.0 | 72.41.7 | |||||
car | 65.88.7 | 70.07.5 | 53.25.4 | 60.68.1 | 72.63.9 | ||||||
bottle | 61.61.0 | 61.62.3 | 53.42.6 | 54.02.6 | 62.62.8 | ||||||
easy-valid | chair | 94.81.5 | 94.51.2 | 96.31.4 | 96.71.9 | 94.53.2 | |||||
car | 95.32.3 | 95.81.9 | 97.31.2 | 98.21.1 | 92.011.6 | ||||||
bottle | 95.30.8 | 95.00.5 | 96.21.9 | 96.71.8 | 96.71.2 | ||||||
hard-valid | chair | 69.42.1 | 71.22.4 | 67.64.9 | 69.84.0 | 70.84.5 | |||||
car | 64.29.5 | 67.67.7 | 52.68.2 | 58.86.8 | 70.86.5 | ||||||
bottle | 60.21.9 | 62.02.1 | 51.22.9 | 52.63.1 | 63.04.9 |
A.2.3 Text Classification on SST-2
Next we present a result on text classification task on SST-2 (Socher et al., 2013). However since there were no human difficulty ratings available for this task, we use rarity of text (Zhang et al., 2018) as a heuristic for the difficulty ratings. The learning model is a SVM which classifies sentence embeddings obtained from the MiniLM-L6-v2 transformer. We consider samples and . The normalized rarity ranges from to and we set and .
We observe a similar trend as the previous task where BSLB method performs better than a random subset by and as good as the random-large subset on the hard-valid dataset. There is no regression on the easy-valid. The results on both the validation sets are comparable with mixed dataset which require all the difficulty ratings (which can be computed for the heuristic but not otherwise). BSLB performs better than both active learning methods on both the validation sets by . However, since this is a standard sentiment analysis task, the embeddings are more informative, thereby improving the baseline performance for a random subset.

A.2.4 Simulation Study

Finally, we also run a simulation study to study the efficacy of our BSLB and C-BSLB algorithm and demonstrate how CORRAL can be used to achieve a sub-linear regret without the knowledge of the optimal parameters. We compute the cumulative regret at time compared to the top arms, and unlike the standard bandit setting, in a blocked setting, the cumulative regret need not be monotonic. To highlight how our method exploits the sparsity of the parameter, we also run CORRAL with multiple versions of our algorithm but with a simple linear regression estimator. We simulate the experimental set-up with the following parameters and . At sparsity level , the tail parameter is . The experiment is repeated with different random parameter initialization. We plot the cumulative regret in Fig. 3, for algorithms run with different exploration period and two versions of the CORRAL algorithm. Our C-BSLB performs better than corralling with regularization, showing that our method exploits the sparsity and does not require true knowledge of the hyperparameters. We also benchmark against a random policy and show that our method performs significantly better showing that the upper bound on regret is not vacuous.
A.2.5 Correlation between model difficulty and human annotation difficulty
In Figure 4(a), we show that for the chair images, if a model (from the numerical study of Section 3) is trained on all images, then the fraction of difficult samples at a certain distance from the classifier boundary goes down as the distance from the classifier body increases; especially after a certain distance from the decision boundary.


A.2.6 How does the tail of the parameter matter?
In Figure 5, we investigate the effect of the tail parameter in the performance of BSLB with a fixed exploration period and different sizes of the tail in the same setup as the simulation study of Appendix A.2.4. We observe that as the tail parameter grows, the regret worsens, however we remark that even for a decent , the performance is reasonable.

A.2.7 Convergence of CORRAL parameters in C-BSLB
We plot the convergence of the CORRAL parameters of C-BSLB in Figure 6 for the simulated experiment of Appendix A.1.7. We observe that the probability of sampling the best algorithm () increases with rounds. Note that since the experiments were run on a limited resource machine, we could only do , and for our setup has to be sufficiently low ( in this case). This is not enough for the CORRAL algorithm to truely exploit the best possible algorithm in C-BSLB but as we see in Figure 3, however it still achieves the a decent performance.


