This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

How Many Validation Labels Do You Need?
Exploring the Design Space of Label-Efficient Model Ranking

Zhengyu Hu1  Jieyu Zhang2  Yue Yu3  Yuchen Zhuang3  Hui Xiong1🖂
1 HKUST (GZ)  2 University of Washington  3 Georgia Institute of Technology
[email protected], [email protected],
{yueyu, yczhuang}@gatech.edu, [email protected]
Abstract

This paper presents LEMR (Label-Efficient Model Ranking) and introduces the MoraBench Benchmark. LEMR is a novel framework that minimizes the need for costly annotations in model selection by strategically annotating instances from an unlabeled validation set. To evaluate LEMR, we leverage the MoraBench Benchmark, a comprehensive collection of model outputs across diverse scenarios. Our extensive evaluation across 23 different NLP tasks in semi-supervised learning, weak supervision, and prompt selection tasks demonstrates LEMR’s effectiveness in significantly reducing labeling costs. Key findings highlight the impact of suitable ensemble methods, uncertainty sampling strategies, and model committee selection in enhancing model ranking accuracy. LEMR, supported by the insights from MoraBench, provides a cost-effective and accurate solution for model selection, especially valuable in resource-constrained environments.

How Many Validation Labels Do You Need?
Exploring the Design Space of Label-Efficient Model Ranking


Zhengyu Hu1  Jieyu Zhang2  Yue Yu3  Yuchen Zhuang3  Hui Xiong1🖂 1 HKUST (GZ)  2 University of Washington  3 Georgia Institute of Technology [email protected], [email protected], {yueyu, yczhuang}@gatech.edu, [email protected]


1 Introduction

Model selection plays a central role in building robust predictive systems for Natural Language Processing (NLP) Awasthy et al. (2020); Lizotte (2021); Zhang et al. (2022b); Han et al. (2023), which underpins numerous application scenarios including feature engineering Severyn and Moschitti (2013), algorithm selection Yang et al. (2023b), and hyperparameter tuning Liu and Wang (2021). Typically, in a standard machine learning pipeline, a held-out validation set is utilized for the model selection purpose, which often contains massive labeled data. Under a more practical low-resource setting, however, creating a large set of validation data is no longer feasible Perez et al. (2021); Bragg et al. (2021) due to the additional annotation cost (Zhang et al., 2023) as well as the reliance on domain expertise (Hu et al., 2023). The resolution of this challenge is vital for the deployment of model selection techniques under real application scenarios.

Facilitating model selection under the true resource-limited scenarios can be challenging. Existing approaches often adopt fixed parameter (Liu et al., 2022), or early stopping (Mahsereci et al., 2017; Choi et al., 2022) for model selection, yet it can suffer from the training instability issue under the low-resource settings and does not reliably choose better-than-average hyperparameters (Blier and Ollivier, 2018; Perez et al., 2021). There are also several works (Zhou et al., 2022; Lu et al., 2022) that focus on unsupervised model selection, which creates pseudo-validation sets for ranking different models. Nevertheless, without labeled data, there often exists a significant disparity between the ranking results produced by these methods and the true model rankings. In summary, model ranking remains challenging and under-explored under low-resource scenarios.

In this work, we propose LEMR (Label-Efficient Model Ranking), a framework that significantly reduces the need for costly annotations. Our framework operates without presuming the availability of ground-truth clean labels. Instead, we aim to strategically annotate instances from an unlabeled validation set for model ranking. The framework can be divided into four steps. First, an ensemble method with a selected model committee generates pseudo-labels for examples from the validation set, reducing the labeling cost (Step-I in Section 4.1). Subsequently, we address the inherent noise in these pseudo-labels through two strategies: We first use uncertainty sampling to acquire ground-truth labels (Step-II in Section 4.2)., and then utilize a Z-score mechanism to dynamically adjust the model committee based on these updated labels, further refining the labeling process (Step-III in Section 4.3). Finally, LEMR ranks all models using the refined pseudo-label and ground-truth label sets (Step-IV in Section 4.4). This framework allows us to create a design space for model ranking, facilitating a systematic exploration of the efficacy across different selection metrics and identifying optimal strategies for each stage.

Specifically, we first organize the intersection for our framework LEMR by proposing an explicit design space centered around disentangling the following key methodological considerations:

  • Pseudo labels generation (Section 4.1): How to generate pseudo-labels? We adopt an ensemble method based on our model committee to obtain the pseudo-labels. Two variants, soft ensemble, and hard ensemble Krogh and Vedelsby (1994); Hansen and Salamon (1990), are considered for this purpose.

  • Label Acquiring (Section 4.2): Which of the pseudo-labels needs to be acquired? Given the presence of noise in pseudo-labels, acquiring ground-truth labels is sometimes necessary. We employ uncertainty sampling strategies to identify which pseudo-labels to replace. Our approach includes uncertainty, classification margin, entropy, and random sampling strategies.

  • Model Committee Selection (Section 4.3): How to select a model committee reasonably? Selecting an appropriate model committee is crucial. We propose two methods: Z-score and All-model. The choice between them depends on balancing the desire for precision (favoring the Z-score method) and the need for diversity and comprehensive coverage (favoring the All-model approach).

With our design space, we can organize different methods and modularly generate a variety of methods. To evaluate these methods and facilitate future research in model ranking, we introduce the MoraBench (Model Ranking Benchmark) in Section 5. It covers diverse scenarios, including semi-supervised learning (Section 6.1), weak supervision (Section 6.2), and prompt selection (Section 6.3) tasks with 23 different tasks. The experiments on MoraBench lead to the following observations:

  • With a suitable combination of methods within the design space, our framework can dramatically reduce the labeling cost for model selection. For instance, in the semi-supervised learning scenario (AGNews task), labeling just 387 samples suffices for model selection, compared to the conventional need for 2000 samples.

  • In Pseudo-label Generation Step (Section 4.1), under a limited budget, we find that soft ensemble yields a higher quality model ranking if the model in the model set performs poorly, otherwise hard ensemble is a better choice.

  • In Active Label Acquisition Step (Section 4.2), our findings underline the superiority of uncertainty sampling over random acquisition in all tasks.

  • In Model Committee Selection Step (Section 4.3), We observe that a high-quality committee crucially influences the quality of model ranking. For this reason, a Z-score-based selection method is designed, which outperforms the All-model strategy on all datasets.

2 Related Work

2.1 Pseudo-labeling

Lately, pseudo-labeling has marked a significant progression in deep learning, utilizing models to predict unlabeled data samples Lee et al. (2013); Chen et al. (2021); Xu et al. (2023); Yang et al. (2023a); Zhang et al. (2022a). Zhu et al. (2023) explore self-adaptive pseudo-label filtering, aiming to refine the selection process for pseudo-labels to boost learning performance. Another popular technique is ensemble distillation Bachman et al. (2014); Hinton et al. (2015); Hu et al. (2023), which means distilling knowledge in an ensemble into a single model.

2.2 Model Selection

Model selection Kohavi (1995); Kayali and Wang (2022); Zhang et al. (2023) refers to determining the best from a set of candidate models based on their performance on a given dataset. In the domain of this area, current research encompasses a variety of innovative methodologies, especially in the field of natural language processing Yang et al. (2023b); Han et al. (2023); Du et al. (2021). Lu et al. (2022) leverage the entropy statistics to select the best prompt orders for in-context learning. Zhou et al. (2022) propose an unsupervised model selection criterion that encourages consistency but simultaneously penalizes collapse.

3 Preliminaries

In this work, we consider a CC-way classification task 𝒯\mathcal{T}. For task 𝒯\mathcal{T}, there exists KK trained models, denoted as ={mk}k[K]\mathcal{M}=\{m_{k}\}_{k\in[K]}. Our objective is to rank these models so that top-ranked models will achieve better performance on 𝒯\mathcal{T}. Importantly, we work under the constraint of having no access to the original training data, instead relying on an unlabeled validation set DV={xi}i[N]D_{V}=\{x_{i}\}_{i\in[N]}, along with a limited annotation budget BB.

Our primary goal is to optimize the annotation process for the validation set in the context of model selection. To this end, we systematically study the effectiveness of our framework across different selection metrics and determine the optimal methods and timing for its utilization.

Refer to caption
Figure 1: The illustration of the overall procedure of LEMR.

4 Methodology

To rank the trained models, we propose a novel framework LEMR, which comprises four primary steps. Step-I (Pseudo-label generation, Section 4.1): Generate pseudo-labels for the unlabeled validation set based on a model committee selected from the model set \mathcal{M}. Step-II (Active label acquisition, Section 4.2): Select samples from the validation set and acquires their ground-truth labels to replace the pseudo-labels. Step-III (Model committee selection, Section 4.3): Select a subset of models based on the updated pseudo-label to form a model committee that would be used to generate pseudo-labels in the next iteration. After TT rounds of iteration for these three steps, we obtain our final pseudo labels, based on which we perform our Step-IV (Model Ranking, Section 4.4). These four steps are detailed in Figure 1 and the pseudocode of LEMR is shown in Appendix A.

4.1 Step-I: Pseudo-label Generation

Our first step is to generate pseudo-labels based on a subset of trained models referred to as the model committee, which will be introduced soon. As the trained models usually have a certain level of capability on the task, it is natural to leverage their ensemble to obtain reasonable pseudo-labels Krogh and Vedelsby (1994); Hansen and Salamon (1990). In particular, we denote Ct\mathcal{M}_{C}^{t} as the model committee at tt-th iteration, and explore two design choices of pseudo-label generation:

  • Hard ensemble: For xiDVx_{i}\in D_{V}, hard ensemble uses the average of the one-hot label prediction vectors generated by all models in Ct\mathcal{M}_{C}^{t} as its pseudo-label distribution y^i(t)\hat{y}^{(t)}_{i}.

  • Soft ensemble: For xiDVx_{i}\in D_{V}, soft ensemble employs the average of the label probability simplex generated by all models in Ct\mathcal{M}_{C}^{t} as its pseudo-label distribution y^i(t)\hat{y}^{(t)}_{i}.

Therefore, at tt-th iteration, we generate the pseudo-label for the ii-th sample via:

y^i(t)g(xi,Ct).\displaystyle\hat{y}^{(t)}_{i}\leftarrow g(x_{i},\mathcal{M}_{C}^{t}). (1)

where the function g()g(\cdot) could be either hard or soft ensemble. These pseudo-labels will be used to select high-quality models to form the model committee.

4.2 Step-II: Active Label Acquisition

In the second step of the LEMR framework, we actively acquire labels for a subset of samples from the pseudo-label set. We explore several existing active sampling strategies in the literature:

  • Random: Although the random sampling is not part of the uncertainty sampling strategies, as a classical acquisition strategy Bergstra and Bengio (2012); Rawat et al. (2021), we also put it into our framework for reference.

  • Uncertainty Culotta and McCallum (2005): We define the value of 1 minus probabilities of the top predicted class as the uncertainty value for a pseudo-label.

  • Margin Schröder et al. (2022): Here, we target pseudo-labels with the smallest margin between the probabilities of the top two predicted classes.

  • Entropy Holub et al. (2008): This strategy calculates the entropy for each pseudo-label. With higher entropy indicating higher information, we prioritize acquiring labels with the highest entropy values.

Utilizing these strategies, we produce a set S(t)S^{(t)} of bb samples at each iteration tt:

S(t)l(Lp,b),S^{(t)}\leftarrow l(L_{p},b), (2)

where l()l(\cdot) represents a certain acquiring strategy (Uncertainty, Margin, Entropy, or Random) and LpL_{p} is the current set of pseudo-labels. We then acquire ground-truth labels for the selected set S(t)S^{(t)}.

We denote the set consisting of all ground-truth labels we have acquired as LgL_{g}. For each sample in S(t)S^{(t)}, we add its ground-truth label to LgL_{g} and remove the corresponding pseudo-label from LpL_{p}. This enhances the reliability of our pseudo-labels and refines subsequent steps, such as model committee selections.

4.3 Step-III: Model Committee Selection

The process of Model Committee Selection in our LEMR framework is a critical step to ensure the appropriate models are chosen to produce pseudo-labels for the next iteration. In our framework, we explore two distinct methods for model committee selection: Z-score and All-model:

  • All-model: All-model approach involves utilizing every model in the existing set \mathcal{M} as part of the model committee. It operates on the principle that the ensemble of diverse models can lead to a more generalized and comprehensive understanding, contributing to the robustness of the pseudo-labels generated.

  • Z-score: The Z-score method assesses a model’s performance relative to the median performance of the entire model set \mathcal{M}, aiding in the identification and filtering of outlier models with extremely low performance. It starts by calculating the accuracy aka_{k} of the kk-th model against the latest pseudo label set LpL_{p} and ground-truth label set LgL_{g}. Then, we calculate the Z-score for each model. Specifically, the Z-score zkz_{k} of the model mkm_{k} is determined as follows:

    zkδ×(akam)Median({|akam|:k[K]}),z_{k}\leftarrow\frac{\delta\times(a_{k}-a_{m})}{\text{Median}(\{|a_{k^{{}^{\prime}}}-a_{m}|:k^{{}^{\prime}}\in[K]\})},

    (3)

    where ama_{m} is the median of the {ak}k[K]\{a_{k}\}_{k\in[K]}. Subsequently, models with Z-score exceeding a certain threshold, τ\tau, are selected for the next iteration’s committee. This ensures that only the most predictive and reliable models contribute to the pseudo-label generation.

Therefore, at the end of tt-th iteration, we select the model committee for the (t+1)(t+1)-th iteration as:

C(t+1)s(Lp,Lg,),\mathcal{M}_{C}^{(t+1)}\leftarrow s(L_{p},L_{g},\mathcal{M}), (4)

where the function s()s(\cdot) could be either Z-score or All-model. Notably, with the updates of LpL_{p} and LgL_{g}, each time we choose the model committee from all models, not from the last model committee. This prevents the early exclusion of potentially valuable models, ensuring a robust and dynamic selection process throughout the iterations.

Method Dataset
Pseudo-label Generation Active Label Acquisition Model Committee Selection IMDB (20) AGNews (40) Amazon Review (250) Yelp Review (250) Yahoo! Answer (500) Avg.
0% 10% 20% 50% 0% 10% 20% 50% 0% 10% 20% 50% 0% 10% 20% 50% 0% 10% 20% 50%
Hard Ensemble Random All-model 0.98 0.97 1.09 0.76 5.38 5.35 5.31 0.69 9.47 9.50 9.48 9.47 14.27 14.27 14.27 0.62 7.11 7.01 6.82 0.93 6.19
Z-score 0.98 0.97 1.09 0.76 5.38 5.35 5.31 0.69 9.47 9.50 9.48 9.47 14.27 14.27 14.27 0.62 7.11 7.01 6.82 0.93 6.19
Uncertainty All-model 0.98 0.84 0.77 0.12 5.38 4.81 0.21 0.01 9.47 9.48 9.54 6.90 14.27 14.27 14.28 0.44 7.11 6.37 1.06 0.02 5.32
Z-score 0.98 0.24 0.00 0.00 5.38 4.41 0.24 0.01 9.47 9.45 9.38 5.66 14.27 14.30 14.28 0.60 7.11 6.36 0.99 0.04 5.16
Margin All-model 0.98 0.84 0.77 0.12 5.38 4.79 0.23 0.01 9.47 9.50 9.56 7.34 14.27 14.27 14.28 0.45 7.11 6.69 1.13 0.03 5.36
Z-score 0.98 0.24 0.00 0.00 5.38 4.57 0.22 0.01 9.47 9.45 9.38 6.52 14.27 14.31 14.26 0.59 7.11 6.56 1.24 0.02 5.23
Entropy All-model 0.98 0.84 0.77 0.12 5.38 4.72 0.20 0.01 9.47 9.50 9.56 4.03 14.27 14.27 14.29 0.45 7.11 6.44 0.87 0.02 5.17
Z-score 0.98 0.24 0.00 0.00 5.38 4.59 0.19 0.01 9.47 9.45 9.43 3.65 14.27 14.31 14.20 0.57 7.11 6.04 0.81 0.02 5.04
Soft Ensemble Random All-model 1.13 1.18 1.03 0.76 5.41 5.35 5.31 0.57 9.45 9.46 9.46 9.46 14.26 14.27 14.27 1.51 7.11 7.01 6.77 0.93 6.24
Z-score 1.13 1.18 1.03 0.76 5.41 5.35 5.31 0.57 9.45 9.46 9.46 9.46 14.26 14.27 14.27 1.51 7.11 7.01 6.77 0.93 6.24
Uncertainty All-model 1.13 0.82 0.63 0.12 5.41 4.70 0.22 0.02 9.45 9.47 9.48 7.78 14.26 14.27 14.28 0.48 7.11 6.42 1.17 0.03 5.36
Z-score 1.13 0.34 0.02 0.00 5.41 4.51 0.23 0.01 9.45 9.45 9.40 7.91 14.26 14.27 14.27 0.59 7.11 6.45 1.24 0.02 5.30
Margin All-model 1.13 0.82 0.63 0.12 5.41 4.82 0.25 0.03 9.45 9.47 9.49 8.09 14.26 14.27 14.28 0.44 7.11 6.50 1.15 0.03 5.39
Z-score 1.13 0.34 0.02 0.00 5.41 4.29 0.21 0.00 9.45 9.45 9.45 8.09 14.26 14.30 14.24 0.64 7.11 6.55 1.12 0.04 5.31
Entropy All-model 1.13 0.82 0.63 0.12 5.41 4.61 0.20 0.03 9.45 9.45 9.49 7.10 14.26 14.27 14.27 0.51 7.11 6.31 0.97 0.00 5.31
Z-score 1.13 0.34 0.02 0.00 5.41 4.59 0.17 0.01 9.45 9.45 9.45 7.15 14.26 14.31 14.28 0.64 7.11 6.30 0.94 0.03 5.25
Table 1: Semi-supervised learning setting: This table illustrates the changes in optimal gap values within our design space. These changes are observed across different budget ratios, specifically at 0%, 10%, 20%, and 50%. The number in brackets after the dataset indicates the number of labels used in model training stage.
Method Dataset
Pseudo-label Generation Active Label Acquisition Model Committee Selection IMDB (100) AGNews (200) Amazon Review (1000) Yelp Review (1000) Yahoo! Answer (2000) Avg.
0% 10% 20% 50% 0% 10% 20% 50% 0% 10% 20% 50% 0% 10% 20% 50% 0% 10% 20% 50%
Hard Ensemble Random All-model 0.96 0.96 0.91 0.86 4.85 4.90 4.85 2.17 7.25 7.24 7.23 7.20 8.64 8.65 8.65 7.33 5.83 5.77 5.71 0.70 5.03
Z-score 0.96 0.96 0.91 0.86 4.85 4.90 4.85 2.17 7.25 7.24 7.23 7.20 8.64 8.65 8.65 7.33 5.83 5.77 5.71 0.70 5.03
Uncertainty All-model 0.96 0.66 0.65 0.06 4.85 4.28 0.05 0.01 7.25 7.17 7.14 2.60 8.64 8.63 8.66 0.36 5.83 5.39 0.70 0.01 3.70
Z-score 0.96 0.67 0.04 0.00 4.85 4.47 0.04 0.00 7.25 7.24 7.03 2.70 8.64 8.71 8.67 0.34 5.83 5.26 0.66 0.01 3.67
Margin All-model 0.96 0.66 0.65 0.06 4.85 4.48 0.07 0.01 7.25 7.19 7.10 2.62 8.64 8.65 8.70 0.39 5.83 5.24 0.87 0.00 3.71
Z-score 0.96 0.67 0.04 0.00 4.85 4.44 0.05 0.00 7.25 7.26 7.17 2.83 8.64 8.69 8.69 0.31 5.83 5.21 0.85 0.00 3.69
Entropy All-model 0.96 0.66 0.65 0.06 4.85 3.92 0.04 0.01 7.25 7.18 7.04 1.86 8.64 8.65 8.67 0.42 5.83 5.40 0.60 0.01 3.63
Z-score 0.96 0.67 0.04 0.00 4.85 3.92 0.04 0.00 7.25 7.26 7.08 2.37 8.64 8.70 8.64 0.34 5.83 5.30 0.60 0.01 3.63
Soft Ensemble Random All-model 0.99 0.94 0.95 0.91 4.88 4.87 4.88 2.29 7.25 7.25 7.25 7.24 8.68 8.69 8.67 8.16 5.83 5.82 5.67 0.66 5.09
Z-score 0.99 0.94 0.95 0.91 4.88 4.87 4.88 2.29 7.25 7.25 7.25 7.24 8.68 8.69 8.67 8.16 5.83 5.82 5.67 0.66 5.09
Uncertainty All-model 0.99 0.68 0.60 0.08 4.88 4.37 0.05 0.01 7.25 7.18 7.12 4.74 8.68 8.66 8.67 0.36 5.83 5.41 0.86 0.01 3.82
Z-score 0.99 0.67 0.16 0.00 4.88 4.39 0.05 0.02 7.25 7.25 7.12 4.72 8.68 8.69 8.70 0.31 5.83 5.42 0.86 0.01 3.80
Margin All-model 0.99 0.68 0.60 0.08 4.88 4.52 0.05 0.00 7.25 7.23 7.16 5.28 8.68 8.67 8.69 0.35 5.83 5.29 0.90 0.00 3.86
Z-score 0.99 0.67 0.16 0.00 4.88 4.47 0.04 0.02 7.25 7.24 7.18 5.11 8.68 8.70 8.73 0.29 5.83 5.29 0.98 0.01 3.83
Entropy All-model 0.99 0.68 0.60 0.08 4.88 4.30 0.05 0.01 7.25 7.18 7.13 4.85 8.68 8.66 8.64 0.41 5.83 5.54 0.61 0.00 3.82
Z-score 0.99 0.67 0.16 0.00 4.88 4.35 0.03 0.02 7.25 7.19 7.15 4.55 8.68 8.72 8.70 0.35 5.83 5.56 0.58 0.00 3.78
Table 2: Semi-supervised learning setting: This table illustrates the changes in optimal gap values within our design space. These changes are observed across different budget ratios, specifically at 0%, 10%, 20%, and 50%. The number in brackets after the dataset indicates the number of labels used in model training stage.
Method Dataset
Pseudo-label Generation Active Label Acquisition Model Committee Selection IMDB AGNews Amazon Review Yelp Review Yahoo! Answer Avg.
20 100 40 200 250 1000 250 1000 500 2000
Hard Ensemble Random All-model 396 399 1442 1321 4230 4511 4363 3740 6865 7806 3507.7
Z-score 396 399 1442 1321 4230 4511 4363 3740 6865 7806 3507.7
Uncertainty All-model 239 277 672 393 3984 3495 3959 3285 3304 3829 2344.1
Z-score 57 97 668 392 3896 3355 3941 3107 3301 3829 2264.7
Margin All-model 239 277 667 396 4057 3385 4137 3349 3336 3819 2366.8
Z-score 57 97 671 391 3914 3369 3954 3124 3326 3879 2278.4
Entropy All-model 239 277 668 387 3969 3586 3902 3382 3194 3813 2342.1
Z-score 57 97 665 393 3881 3318 3919 3202 2959 3906 2240.0
Soft Ensemble Random All-model 396 399 1392 1291 4236 4523 4394 3860 7306 7805 3560.5
Z-score 396 399 1392 1291 4236 4523 4394 3860 7306 7805 3560.5
Uncertainty All-model 219 277 709 395 4078 3546 3964 3369 3342 3950 2385.3
Z-score 89 99 650 395 4026 3486 3961 3247 3320 3970 2324.6
Margin All-model 219 277 692 394 4110 3511 4152 3364 3397 3902 2402.1
Z-score 89 99 669 393 3968 3652 3979 3249 3364 3907 2337.2
Entropy All-model 219 277 683 395 4006 3448 3952 3422 3272 3907 2358.6
Z-score 89 99 673 394 3943 3604 3924 3272 3261 3975 2323.8
Size of Unlabeled Validation Set DVD_{V} 400 400 2000 2000 5000 5000 5000 5000 10000 10000 -
Table 3: Semi-supervised learning setting: This table illustrates the minimum labeling budget necessary to achieve an optimal gap of zero in our framework. The number under the dataset indicates the number of labels used in model training stage.
Refer to caption
Figure 2: Semi-supervised learning setting: This figure illustrates the changes in ranking correction values within our design space. These changes are observed across budget ratios from 0 to 1. The number after the dataset indicates the number of labels under the model training stage.
Method Dataset
Pseudo-label Generation Active Label Acquisition Model Committee Selection Yelp SMS IMDB AGNews Trec Avg.
0% 10% 20% 50% 0% 10% 20% 50% 0% 10% 20% 50% 0% 10% 20% 50% 0% 10% 20% 50%
Hard Ensemble Random All-model 22.27 21.50 20.56 13.50 0.49 0.52 0.39 0.29 14.55 14.12 14.07 11.23 1.76 1.73 1.50 0.22 8.49 8.02 6.91 2.99 8.25
Z-score 22.27 21.50 20.56 13.50 0.49 0.52 0.39 0.29 14.55 14.12 14.07 11.23 1.76 1.73 1.50 0.22 8.49 8.02 6.91 2.99 8.25
Uncertainty All-model 22.27 18.75 14.67 0.04 0.49 0.00 0.00 0.00 14.55 10.74 5.64 1.26 1.76 0.14 0.11 0.00 8.49 5.01 4.18 1.24 5.47
Z-score 22.27 17.64 12.75 0.20 0.49 0.00 0.00 0.00 14.55 11.22 5.43 0.51 1.76 0.13 0.10 0.00 8.49 4.63 2.94 0.52 5.18
Margin All-model 22.27 18.75 14.67 0.04 0.49 0.00 0.00 0.00 14.55 10.74 5.64 1.26 1.76 0.13 0.04 0.00 8.49 5.01 4.15 1.01 5.45
Z-score 22.27 17.64 12.75 0.20 0.49 0.00 0.00 0.00 14.55 11.22 5.43 0.51 1.76 0.13 0.08 0.00 8.49 4.38 2.89 0.32 5.16
Entropy All-model 22.27 18.75 14.67 0.04 0.49 0.00 0.00 0.00 14.55 10.74 5.64 1.26 1.76 0.04 0.12 0.00 8.49 5.62 4.68 1.20 5.52
Z-score 22.27 17.64 12.75 0.20 0.49 0.00 0.00 0.00 14.55 11.22 5.43 0.51 1.76 0.09 0.14 0.00 8.49 4.90 3.17 0.66 5.21
Soft Ensemble Random All-model 22.16 21.17 20.09 13.30 0.49 0.52 0.39 0.29 14.19 13.67 13.54 11.33 1.76 1.74 1.53 0.24 8.86 8.69 7.79 3.61 8.27
Z-score 22.16 21.17 20.09 13.30 0.49 0.52 0.39 0.29 14.19 13.67 13.54 11.33 1.76 1.74 1.53 0.24 8.86 8.69 7.79 3.61 8.27
Uncertainty All-model 22.16 18.24 13.69 0.41 0.49 0.01 0.00 0.00 14.19 10.21 5.53 0.25 1.76 0.14 0.14 0.00 8.86 4.98 4.10 0.87 5.30
Z-score 22.16 16.89 12.96 0.07 0.49 0.01 0.00 0.00 14.19 10.76 6.44 0.63 1.76 0.14 0.14 0.00 8.86 4.77 3.63 0.85 5.24
Margin All-model 22.16 18.24 13.69 0.41 0.49 0.01 0.00 0.00 14.19 10.21 5.53 0.25 1.76 0.05 0.14 0.00 8.86 4.97 3.35 1.18 5.27
Z-score 22.16 16.89 12.96 0.07 0.49 0.01 0.00 0.00 14.19 10.76 6.44 0.63 1.76 0.05 0.14 0.00 8.86 4.77 2.90 0.97 5.20
Entropy All-model 22.16 18.24 13.69 0.41 0.49 0.01 0.00 0.00 14.19 10.21 5.53 0.25 1.76 0.06 0.10 0.00 8.86 6.04 4.50 1.07 5.38
Z-score 22.16 16.89 12.96 0.07 0.49 0.01 0.00 0.00 14.19 10.76 6.44 0.63 1.76 0.10 0.14 0.00 8.86 5.00 3.73 0.83 5.25
Table 4: Weak supervision setting: This table illustrates the changes in optimal gap values within our design space. These changes are observed across different budget ratios, specifically at 0%, 10%, 20%, and 50%.
Refer to caption
Figure 3: Weak supervision setting: This figure illustrates the changes in ranking correction values within our design space. These changes are observed across budget ratios from 0 to 1.
Method Dataset
Pseudo-label Generation Active Label Acquisition Model Committee Selection WSC Story CB RTE WiC ANLI1 ANLI2 ANLI3 Avg.
0% 10% 30% 0% 10% 30% 0% 10% 30% 0% 10% 30% 0% 10% 30% 0% 10% 30% 0% 10% 30% 0% 10% 30%
Hard Ensemble Random All Model 1.16 0.95 1.04 0.03 0.02 0.01 2.84 2.67 1.82 0.40 0.38 0.50 1.14 0.86 0.82 0.05 0.06 0.06 0.46 0.48 0.40 0.81 0.80 0.82 0.77
Z-score 1.16 0.95 1.04 0.03 0.02 0.01 2.84 2.67 1.82 0.40 0.38 0.50 1.14 0.86 0.82 0.05 0.06 0.06 0.46 0.48 0.40 0.81 0.80 0.82 0.77
Uncertainty All Model 1.16 0.64 0.03 0.03 0.03 0.01 2.84 1.60 0.40 0.40 0.07 0.40 1.14 1.05 0.04 0.05 0.07 0.46 0.46 0.33 0.44 0.81 0.86 0.85 0.59
Z-score 1.16 0.00 0.03 0.03 0.00 0.00 2.84 0.00 0.00 0.40 0.00 0.00 1.14 1.19 0.17 0.05 0.13 0.57 0.46 0.26 0.45 0.81 0.81 0.83 0.47
Margin All Model 1.16 0.64 0.03 0.03 0.03 0.01 2.84 1.64 0.27 0.40 0.07 0.40 1.14 1.05 0.04 0.05 0.23 0.55 0.46 0.33 0.49 0.81 0.94 0.85 0.60
Z-score 1.16 0.00 0.03 0.03 0.00 0.00 2.84 0.00 0.00 0.40 0.00 0.00 1.14 1.19 0.17 0.05 0.11 0.51 0.46 0.27 0.42 0.81 0.77 0.75 0.46
Entropy All Model 1.16 0.64 0.03 0.03 0.03 0.01 2.84 1.42 0.31 0.40 0.07 0.40 1.14 1.05 0.04 0.05 0.10 0.51 0.46 0.31 0.45 0.81 0.66 0.85 0.57
Z-score 1.16 0.00 0.03 0.03 0.00 0.00 2.84 0.00 0.00 0.40 0.00 0.00 1.14 1.19 0.17 0.05 0.08 0.56 0.46 0.31 0.43 0.81 0.83 0.76 0.47
Soft Ensemble Random All Model 2.12 2.01 1.33 0.04 0.04 0.02 2.84 2.71 1.91 1.40 1.33 1.02 0.19 0.10 0.16 0.20 0.22 0.08 0.55 0.54 0.57 1.18 1.17 1.13 0.95
Z-score 2.12 2.01 1.33 0.04 0.04 0.02 2.84 2.71 1.91 1.40 1.33 1.02 0.19 0.10 0.16 0.20 0.22 0.08 0.55 0.54 0.57 1.18 1.17 1.13 0.95
Uncertainty All Model 2.12 1.10 0.04 0.04 0.01 0.00 2.84 1.29 0.36 1.40 2.64 2.00 0.19 0.52 0.14 0.20 0.22 0.58 0.55 0.44 0.37 1.18 0.99 0.80 0.83
Z-score 2.12 0.00 0.04 0.04 0.00 0.00 2.84 0.00 0.00 1.40 0.00 0.00 0.19 0.77 0.16 0.20 0.20 0.43 0.55 0.47 0.33 1.18 0.88 0.97 0.53
Margin All Model 2.12 1.10 0.04 0.04 0.01 0.00 2.84 1.20 0.04 1.40 2.64 2.00 0.19 0.52 0.14 0.20 0.36 0.55 0.55 0.50 0.31 1.18 1.07 0.90 0.83
Z-score 2.12 0.00 0.04 0.04 0.00 0.00 2.84 0.00 0.00 1.40 0.00 0.00 0.19 0.77 0.16 0.20 0.32 0.56 0.55 0.48 0.41 1.18 1.04 0.91 0.55
Entropy All Model 2.12 1.10 0.04 0.04 0.01 0.00 2.84 1.42 1.07 1.40 2.64 2.00 0.19 0.52 0.14 0.20 0.03 0.30 0.55 0.44 0.59 1.18 0.94 1.02 0.87
Z-score 2.12 0.00 0.04 0.04 0.00 0.00 2.84 0.00 0.00 1.40 0.00 0.00 0.19 0.77 0.16 0.20 0.02 0.27 0.55 0.44 0.27 1.18 0.81 1.07 0.52
Table 5: Prompt selection setting: This table illustrates the changes in optimal gap values within our design space. These changes are observed across different budget ratios, specifically at 0%, 10%, and 30%.
Refer to caption
Figure 4: Prompt selection setting: This figure illustrates the changes in ranking correction values within our design space. These changes are observed across budget ratios from 0 to 1. The number after the dataset indicates the number of labels under the semi-supervised learning setting.

4.4 Step-IV: Model Ranking

Step-IV in the LEMR framework is dedicated to ranking the models in the set \mathcal{M}. This step utilizes the final pseudo-label set LpL_{p} and the ground-truth label set LgL_{g} to evaluate each model’s accuracy. The rank rpr_{p} is determined as:

rpr(Lp,Lg,),\displaystyle r_{p}\leftarrow r(L_{p},L_{g},\mathcal{M}), (5)

where r()r(\cdot) is the ranking function. It ranks the models in \mathcal{M} according to their accuracy on LpL_{p} and LgL_{g}.

5 The MoraBench Benchmark

To advance research in model ranking and evaluate various design choices in our LEMR framework, we introduce MoraBench (Model Ranking Benchmark). This benchmark comprises a collection of model outputs generated under diverse scenarios. The description of all model sets within MoraBench and its generation configuration are given in Appendix C. We then perform model selection based on these outputs. MoraBench and its associated resources are available at http://github.com/ppsmk388/MoraBench.

5.1 Evaluation Metrics

We define two metrics used to evaluate the effectiveness of model selection results, namely Optimal Gap and Ranking Correction.

Optimal Gap.

The Optimal Gap is defined as the difference in test accuracy between the best model chosen by the fully-labeled validation set, and the best model identified by the methods to be assessed.

Ranking Correction.

Ranking Correction measures the similarity between the model rankings based on the fully-labeled validation set and those obtained by methods to be assessed. This similarity is assessed using the Spearman rank correlation coefficient111http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html , a common non-parametric method evaluating the monotonic relationship between two ranked variables.

6 Experiments

We test our LEMR with MoraBench in detail under three scenarios, i.e., semi-supervised learning (Section 6.1), weak supervision (Section 6.2), and prompt selection (Section 6.3). Corresponding implementation details and design space are described in Appendix B.

6.1 Semi-supervised Learning Setting

Here, we evaluate the LEMR framework under a semi-supervised learning setting. For clarity, we first introduce the concept of ’budget ratio’, defined as the proportion of our budget relative to the size of the complete unlabeled validation set. We examined the performance of LEMR at different budget ratios (0%, 10%, 20% and 50%), and relevant results are detailed in Tables 1 and 2. The impact of varying budget ratios on ranking correction is shown in Figure 2. Additionally, Table 3 highlights the minimum budget needed to achieve an optimal gap of 0. The number in brackets after the dataset indicates the number of labels used in the model training stage. The model set generation setups can be found in Appendix C.1.

From the results, we have the following findings: First, LEMR significantly minimizes labeling costs for model selection. For instance, in setting AGNews (200), we only need to label 387 samples to select the same model as labeling the entire validation set of 2000 samples (see Table 3). Second, our results consistently show the superiority of uncertainty sampling over random sampling. Table 3 illustrates that random sampling typically requires a significantly larger budget compared to uncertainty sampling. This trend is evident in Table 1 and Table 2 as well. Additionally, the curves representing the random strategy in Figure 2 consistently lie below of other uncertainty sampling strategies. Finally, the model committee selected by Z-score is better than All-model under our design space. For example, in Table 1 and 2, the Z-score has a smaller optimal gap than All-model in all cases.

6.2 Weak Supervision Setting

In this section, we employed the WRENCH Zhang et al. (2021b) to evaluate our LEMR framework within a weak supervision setting. we first evaluate LEMR in a low-budget setting. Specifically, we test our framework with budget ratios of 0%, 10%, 20%, and 50%. The corresponding optimal values are displayed in Table 4. Additionally, Figure 2 illustrates the variation in ranking correction as the budget ratio increases from 0 to 1. The model set generation setups can be found in Appendix C.2. Some interesting observations are shown as follows: First, LEMR, when combined with an appropriate selection of methods, significantly lowers the labeling cost for validation sets: As shown in Table 4, only 10% of the labeling cost suffices to select the same model that would be chosen with a fully labeled validation set. Then, compared to random sampling, uncertainty sampling strategies consistently exhibit superior performance. This is evident in Table 4, where the optimal gap for random sampling is highest across all budgets. Moreover, from Figure 2, we notice the random strategy has a curve below all uncertainty sampling strategies, which further supports our conclusion. Finally, adopting the Z-score method generally reduces labeling costs as evidenced by the lower optimal gap values in Table 4. This suggests that removing the model that contains noise helps to reduce the labeling cost.

6.3 Prompt Selection Setting

In this section, we employ the T0 benchmark Sanh et al. (2022) to test LEMR under the prompt selection task. With a large language model, denoted as MM, and a set of prompts {pk}k[K]\{p_{k}\}_{k\in[K]}, we can analogize M(pk)M(p_{k}) to mkm_{k} and refer to Step-I (Section 4.1) to Step-IV (Section 4.4) to get the model rank. The experimental results, including the optimal gap for budget ratios of 0%, 10%, and 30%, are summarized in Table 5. Additionally, Figure 4 visually represents the changes in ranking correction as budget ratios vary from 0 to 1. The setups for model set generation can be found in Appendix C.3.

First, in Figure 4, we find that under a limited budget, soft ensemble yields higher quality model rank if the model in the model set performs poorly, whereas hard ensemble is the superior solution. For example, in the low-budget case, hard ensemble is a better choice in tasks RTE, Story, and WSC, where models generally perform better. While in tasks WIC, ANL1, ANL2, and ANL3, where models perform worse, soft ensemble works better. A similar situation can be found in the Yelp (250), Amazon (100), Amazon (250), Yahoo (500), and Yahoo (2000) datasets in the semi-supervised setting (in Figure 2) as well as in the AGNews dataset and Trec dataset in the weakly supervised setting (in Figure 3). An intuitive explanation is that when the model’s performance in the model set is poor, soft ensemble can utilize all the model’s uncertainty information about the data, while hard ensemble may rely too much on some wrong prediction results, so soft ensemble will be more suitable in this case. When the model’s performance in the model set is relatively high, hard ensemble can filter out the noisy information, which is more conducive to obtaining a high quality rank. When the models in the model set all perform exceptionally well (SMS task of weak supervision setting in Figure 3) or when the model predictions in the model set are relatively consistent (CB tasks of prompt selection), the results of the hard ensemble and the soft ensemble will remain consistent. Moreover, our framework exhibits a substantial reduction in the labeling costs for validation sets. For example, as demonstrated in Table 5, for the SMS task, achieving an optimal gap value of 0 necessitates only 10% budget ratio. Besides, we find that when the sampling strategy is random, the optimal gap of the random strategy is the largest regardless of the budget ratio in Table 5. Lastly, we observe that using Z-score reduces the budget required for all tasks. On average, the Z-score method yields a lower optimal gap value in Table 5. This suggests that a high-quality committee can generate a high-quality model ranking.

7 Conclusion

In this paper, we introduce LEMR, a novel framework that significantly reduces labeling costs for model selection tasks, particularly under resource-limited settings. To evaluate LEMR, we propose the MoraBench Benchmark, a comprehensive collection of model outputs across diverse scenarios. Demonstrated across 23 tasks, including semi-supervised learning, weak supervision, and prompt selection, LEMR significantly reduces validation labeling costs without compromising accuracy. Key results show that, in certain tasks, the required labeling effort is reduced to below 10% compared to a fully labeled dataset. Our findings emphasize the importance of selecting suitable ensemble methods based on model performance, the superiority of uncertainty sampling over random strategies, and the importance of selecting suitable modes to compose the model committee.

References

  • (1) "yelp dataset: http://www.yelp.com/dataset_challenge".
  • Almeida et al. (2011) Tiago A Almeida, José María G Hidalgo, and Akebo Yamakami. 2011. Contributions to the study of sms spam filtering: new collection and results. In DocEng, pages 259–262.
  • Awasthi et al. (2020) Abhijeet Awasthi, Sabyasachi Ghosh, Rasna Goyal, and Sunita Sarawagi. 2020. Learning from rules generalizing labeled exemplars. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  • Awasthy et al. (2020) Parul Awasthy, Bishwaranjan Bhattacharjee, John Kender, and Radu Florian. 2020. Predictive model selection for transfer learning in sequence labeling tasks. In Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pages 113–118, Online. Association for Computational Linguistics.
  • Bach et al. (2017) Stephen H. Bach, Bryan Dawei He, Alexander Ratner, and Christopher Ré. 2017. Learning the structure of generative models without labeled data. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 273–282. PMLR.
  • Bachman et al. (2014) Philip Bachman, Ouais Alsharif, and Doina Precup. 2014. Learning with pseudo-ensembles. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 3365–3373.
  • Bergstra and Bengio (2012) James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. Journal of machine learning research, 13(2).
  • Berthelot et al. (2019a) David Berthelot, Nicholas Carlini, Ekin D Cubuk, Alex Kurakin, Kihyuk Sohn, Han Zhang, and Colin Raffel. 2019a. Remixmatch: Semi-supervised learning with distribution alignment and augmentation anchoring. ArXiv preprint, abs/1911.09785.
  • Berthelot et al. (2019b) David Berthelot, Nicholas Carlini, Ian J. Goodfellow, Nicolas Papernot, Avital Oliver, and Colin Raffel. 2019b. Mixmatch: A holistic approach to semi-supervised learning. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 5050–5060.
  • Berthelot et al. (2022) David Berthelot, Rebecca Roelofs, Kihyuk Sohn, Nicholas Carlini, and Alexey Kurakin. 2022. Adamatch: A unified approach to semi-supervised learning and domain adaptation. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  • Blier and Ollivier (2018) Léonard Blier and Yann Ollivier. 2018. The description length of deep learning models. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 2220–2230.
  • Bragg et al. (2021) Jonathan Bragg, Arman Cohan, Kyle Lo, and Iz Beltagy. 2021. FLEX: unifying evaluation for few-shot NLP. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 15787–15800.
  • Chang et al. (2008) Ming-Wei Chang, Lev-Arie Ratinov, Dan Roth, and Vivek Srikumar. 2008. Importance of semantic representation: Dataless classification. In Aaai, volume 2, pages 830–835.
  • Chen et al. (2021) Yiming Chen, Yan Zhang, Chen Zhang, Grandee Lee, Ran Cheng, and Haizhou Li. 2021. Revisiting self-training for few-shot learning of language model. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9125–9135, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Choi et al. (2022) HongSeok Choi, Dongha Choi, and Hyunju Lee. 2022. Early stopping based on unlabeled samples in text classification. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 708–718, Dublin, Ireland. Association for Computational Linguistics.
  • Culotta and McCallum (2005) Aron Culotta and Andrew McCallum. 2005. Reducing labeling effort for structured prediction tasks. In AAAI, volume 5, pages 746–751.
  • Dagan et al. (2006) Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The pascal recognising textual entailment challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment, pages 177–190, Berlin, Heidelberg. Springer Berlin Heidelberg.
  • De Marneffe et al. (2019) Marie-Catherine De Marneffe, Mandy Simons, and Judith Tonhauser. 2019. The commitmentbank: Investigating projection in naturally occurring discourse. In proceedings of Sinn und Bedeutung, volume 23, pages 107–124.
  • Du et al. (2021) Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2021. All nlp tasks are generation tasks: A general pretraining framework. ArXiv preprint, abs/2103.10360.
  • Fan et al. (2021) Yue Fan, Anna Kukleva, and Bernt Schiele. 2021. Revisiting consistency regularization for semi-supervised learning. In DAGM German Conference on Pattern Recognition, pages 63–78. Springer.
  • Han et al. (2023) Xudong Han, Timothy Baldwin, and Trevor Cohn. 2023. Fair enough: Standardizing evaluation and model selection for fairness research in NLP. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 297–312, Dubrovnik, Croatia. Association for Computational Linguistics.
  • Hansen and Salamon (1990) Lars Kai Hansen and Peter Salamon. 1990. Neural network ensembles. IEEE transactions on pattern analysis and machine intelligence, 12(10):993–1001.
  • Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. ArXiv preprint, abs/1503.02531.
  • Holub et al. (2008) Alex Holub, Pietro Perona, and Michael C. Burl. 2008. Entropy-based active learning for object recognition. In 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pages 1–8.
  • Hu et al. (2023) Zhengyu Hu, Jieyu Zhang, Haonan Wang, Siwei Liu, and Shangsong Liang. 2023. Leveraging relational graph neural network for transductive model ensemble. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 775–787.
  • Kayali and Wang (2022) Moe Kayali and Chi Wang. 2022. Mining robust default configurations for resource-constrained automl. ArXiv, abs/2202.09927.
  • Kohavi (1995) Ron Kohavi. 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection. In International Joint Conference on Artificial Intelligence.
  • Krogh and Vedelsby (1994) Anders Krogh and Jesper Vedelsby. 1994. Neural network ensembles, cross validation, and active learning. Advances in neural information processing systems, 7.
  • Lee et al. (2013) Dong-Hyun Lee et al. 2013. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3, page 896. Atlanta.
  • Levesque et al. (2012) Hector Levesque, Ernest Davis, and Leora Morgenstern. 2012. The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning. Citeseer.
  • Li et al. (2021) Junnan Li, Caiming Xiong, and Steven C. H. Hoi. 2021. Comatch: Semi-supervised learning with contrastive graph regularization. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 9455–9464. IEEE.
  • Li and Roth (2002) Xin Li and Dan Roth. 2002. Learning question classifiers. In International Conference on Computational Linguistics.
  • Liu et al. (2022) Haokun Liu, Derek Tam, Muqeeth Mohammed, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin Raffel. 2022. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. In Advances in Neural Information Processing Systems.
  • Liu and Wang (2021) Xueqing Liu and Chi Wang. 2021. An empirical study on hyperparameter optimization for fine-tuning pre-trained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2286–2300, Online. Association for Computational Linguistics.
  • Lizotte (2021) Dan Lizotte. 2021. Model selection. Machine Learning — A Journey to Deep Learning.
  • Lu et al. (2022) Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. 2022. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8086–8098, Dublin, Ireland. Association for Computational Linguistics.
  • Maas et al. (2011) Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational Linguistics.
  • Mahsereci et al. (2017) Maren Mahsereci, Lukas Balles, Christoph Lassner, and Philipp Hennig. 2017. Early stopping without a validation set. ArXiv preprint, abs/1703.09580.
  • McAuley and Leskovec (2013) Julian J. McAuley and Jure Leskovec. 2013. Hidden factors and hidden topics: understanding rating dimensions with review text. In Seventh ACM Conference on Recommender Systems, RecSys ’13, Hong Kong, China, October 12-16, 2013, pages 165–172. ACM.
  • Miyato et al. (2018) Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. 2018. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993.
  • Mostafazadeh et al. (2017) Nasrin Mostafazadeh, Michael Roth, Annie Louis, Nathanael Chambers, and James Allen. 2017. LSDSem 2017 shared task: The story cloze test. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics, pages 46–51, Valencia, Spain. Association for Computational Linguistics.
  • Nie et al. (2020) Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. 2020. Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885–4901, Online. Association for Computational Linguistics.
  • OpenAI (2023) OpenAI. 2023. Gpt-4 technical report. ArXiv preprint, abs/2303.08774.
  • Penrose (1946) Lionel S Penrose. 1946. The elementary statistics of majority voting. Journal of the Royal Statistical Society, 109(1):53–57.
  • Perez et al. (2021) Ethan Perez, Douwe Kiela, and Kyunghyun Cho. 2021. True few-shot learning with language models. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 11054–11070.
  • Pilehvar and Camacho-Collados (2019) Mohammad Taher Pilehvar and Jose Camacho-Collados. 2019. WiC: the word-in-context dataset for evaluating context-sensitive meaning representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1267–1273, Minneapolis, Minnesota. Association for Computational Linguistics.
  • Rasmus et al. (2015) Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. 2015. Semi-supervised learning with ladder networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 3546–3554.
  • Ratner et al. (2017) Alexander J. Ratner, Stephen H. Bach, Henry R. Ehrenberg, and Christopher Ré. 2017. Snorkel: Fast training set generation for information extraction. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017, pages 1683–1686. ACM.
  • Rawat et al. (2021) Ankit Singh Rawat, Aditya Krishna Menon, Wittawat Jitkrittum, Sadeep Jayasumana, Felix X. Yu, Sashank J. Reddi, and Sanjiv Kumar. 2021. Disentangling sampling and labeling bias for learning in large-output spaces. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8890–8901. PMLR.
  • Ren et al. (2020) Wendi Ren, Yinghao Li, Hanting Su, David Kartchner, Cassie Mitchell, and Chao Zhang. 2020. Denoising multi-source weak supervision for neural text classification. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3739–3754, Online. Association for Computational Linguistics.
  • Sanh et al. (2022) Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, et al. 2022. Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  • Schröder et al. (2022) Christopher Schröder, Andreas Niekler, and Martin Potthast. 2022. Revisiting uncertainty-based query strategies for active learning with transformers. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2194–2203, Dublin, Ireland. Association for Computational Linguistics.
  • Severyn and Moschitti (2013) Aliaksei Severyn and Alessandro Moschitti. 2013. Automatic feature engineering for answer selection and extraction. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 458–467, Seattle, Washington, USA. Association for Computational Linguistics.
  • Sohn et al. (2020) Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. 2020. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  • Tarvainen and Valpola (2017) Antti Tarvainen and Harri Valpola. 2017. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 1195–1204.
  • Wang et al. (2022) Yidong Wang, Hao Chen, Yue Fan, Wang Sun, Ran Tao, Wenxin Hou, Renjie Wang, Linyi Yang, Zhi Zhou, Lan-Zhe Guo, et al. 2022. Usb: A unified semi-supervised learning benchmark for classification. Advances in Neural Information Processing Systems, 35:3938–3961.
  • Xie et al. (2020) Qizhe Xie, Zihang Dai, Eduard H. Hovy, Thang Luong, and Quoc Le. 2020. Unsupervised data augmentation for consistency training. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  • Xu et al. (2023) Ran Xu, Yue Yu, Hejie Cui, Xuan Kan, Yanqiao Zhu, Joyce Ho, Chao Zhang, and Carl Yang. 2023. Neighborhood-regularized self-training for learning with few labels. ArXiv preprint, abs/2301.03726.
  • Xu et al. (2021) Yi Xu, Lei Shang, Jinxing Ye, Qi Qian, Yu-Feng Li, Baigui Sun, Hao Li, and Rong Jin. 2021. Dash: Semi-supervised learning with dynamic thresholding. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 11525–11536. PMLR.
  • Yang et al. (2023a) Weiyi Yang, Richong Zhang, Junfan Chen, Lihong Wang, and Jaein Kim. 2023a. Prototype-guided pseudo labeling for semi-supervised text classification. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16369–16382, Toronto, Canada. Association for Computational Linguistics.
  • Yang et al. (2023b) Yaoqing Yang, Ryan Theisen, Liam Hodgkinson, Joseph E. Gonzalez, Kannan Ramchandran, Charles H. Martin, and Michael W. Mahoney. 2023b. Test accuracy vs. generalization gap: Model selection in nlp without accessing training or testing data. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
  • Zhang et al. (2021a) Bowen Zhang, Yidong Wang, Wenxin Hou, Hao Wu, Jindong Wang, Manabu Okumura, and Takahiro Shinozaki. 2021a. Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 18408–18419.
  • Zhang et al. (2022a) Jieyu Zhang, Bohan Wang, Xiangchen Song, Yujing Wang, Yaming Yang, Jing Bai, and Alexander Ratner. 2022a. Creating training sets via weak indirect supervision. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  • Zhang et al. (2021b) Jieyu Zhang, Yue Yu, Yinghao Li, Yujing Wang, Yaming Yang, Mao Yang, and Alexander Ratner. 2021b. WRENCH: A comprehensive benchmark for weak supervision. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  • Zhang et al. (2022b) Shaokun Zhang, Feiran Jia, Chi Wang, and Qingyun Wu. 2022b. Targeted hyperparameter optimization with lexicographic preferences over multiple objectives. In The Eleventh International Conference on Learning Representations.
  • Zhang et al. (2023) Shaokun Zhang, Yiran Wu, Zhonghua Zheng, Qingyun Wu, and Chi Wang. 2023. Hypertime: Hyperparameter optimization for combating temporal distribution shifts. ArXiv preprint, abs/2305.18421.
  • Zhang et al. (2015) Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 649–657.
  • Zheng et al. (2022) Mingkai Zheng, Shan You, Lang Huang, Fei Wang, Chen Qian, and Chang Xu. 2022. Simmatch: Semi-supervised learning with similarity matching. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 14451–14461. IEEE.
  • Zhou et al. (2022) Chunting Zhou, Junxian He, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. 2022. Prompt consistency for zero-shot task generalization. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2613–2626, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  • Zhu et al. (2023) Lei Zhu, Zhanghan Ke, and Rynson Lau. 2023. Towards self-adaptive pseudo-label filtering for semi-supervised learning. ArXiv preprint, abs/2309.09774.

Appendix A Pseudocode of LEMR

The pseudocode of LEMR is shown:

Input : model set ={mk}k[K]\mathcal{M}=\{m_{k}\}_{k\in[K]}, unlabeled validation set DV={xi}i[N]D_{V}=\{x_{i}\}_{i\in[N]}, budget BB, iteration budget bb.
Output : rank rpr_{p} of \mathcal{M}.
1

// Initialization

2
3

{mk}k[K],C(1),TBb,\mathcal{M}\leftarrow\{m_{k}\}_{k\in[K]},\mathcal{M}_{C}^{(1)}\leftarrow\mathcal{M},T\leftarrow\lfloor{\frac{B}{b}}\rfloor,

Lg,DV{xi}i[N]L_{g}\leftarrow\emptyset,D_{V}\leftarrow\{x_{i}\}_{i\in[N]}

4
5for t=1t=1to TT do
6      
7      

// Initialize the set of pseudo-labels

8      
9      

LpL_{p}\leftarrow\emptyset

// Step I: Pseudo-label decided by model committee

10      
11      for xix_{i} to DVD_{V} do
12            

y^i(t)g(xi,C(t))\hat{y}^{(t)}_{i}\leftarrow g(x_{i},\mathcal{M}_{C}^{(t)})

13            

LpLp+{y^i(t)}L_{p}\leftarrow L_{p}+\{\hat{y}^{(t)}_{i}\}

14      
15      

// Step II: Active label acquisition

16      
17      

S(t)l(Lp,b)S^{(t)}\leftarrow l(L_{p},b)

18      for xjx_{j} to S(t)S^{(t)} do
19            

yj(t)ground-truth labely^{(t)}_{j}\leftarrow\textbf{ground-truth label}

20            

LgLg+{yj(t)}L_{g}\leftarrow L_{g}+\{y^{(t)}_{j}\}

21            

LpLp{y^i(t)}L_{p}\leftarrow L_{p}-\{\hat{y}^{(t)}_{i}\}

22            

DVDV{xj}D_{V}\leftarrow D_{V}-\{x_{j}\}

23      
24      

// Step III: Model committee selection

25      
26      

C(t+1)s(Lp,Lg,)\mathcal{M}_{C}^{(t+1)}\leftarrow s(L_{p},L_{g},\mathcal{M})

27
28

// Step-IV: Model Ranking

29
30

rpr(Lp,Lg,)r_{p}\leftarrow r(L_{p},L_{g},\mathcal{M})

return rpr_{p}
Algorithm 1 LEMR
Training Setting Task Type Dataset Model/Prompt Number # Data
Weak Supervision Sentiment Classification Yelp 480 3800
IMDB 480 2500
Spam Classification SMS 480 500
IMDB 480 2500
Topic Classification AGNews 480 12000
Question Classification Trec 480 500
Semi-supervised Learning Sentiment Classification IMDB 20 400 2000
100 400 2000
Yelp Review 250 400 25000
1000 400 25000
Amazon Review 250 400 25000
1000 400 25000
Topic Classification Yahoo! Answer 500 400 50000
2000 400 50000
AGNews 40 400 10000
200 400 10000
Prompt Selection Coreference Resolution WSC 10 104
Word Sense Disambiguation WiC 10 638
Sentence Completion Story 6 3742
Natural Language Inference CB 15 56
RTE 10 277
ANLI1 15 1000
ANLI2 15 1000
ANLI3 15 1200
Table 6: The initial model set included in MoraBench and the total size of the validation set plus the test set, i.e., # Data. The number after the dataset of Semi-supervised Learning indicates the number of labels used in semi-supervised training stage. The description of datasets and generation configuration for each model set are given in the Appendix E and Appendix C. We plan to add more model set soon.
Method Dataset
Pseudo-label
Generation
Active Label
Acquisition
Model Committee
Selection
Yelp SMS IMDB AGNews Trec Avg.
Hard Ensemble Random All-model 522 88 482 1463 99 530.8
Z-score 522 88 482 1463 99 530.8
Uncertainty All-model 378 7 386 210 59 208.0
Z-score 340 7 295 211 56 181.8
Margin All-model 378 7 386 211 60 208.4
Z-score 340 7 295 211 57 182.0
Entropy All-model 378 7 386 220 59 210.0
Z-score 340 7 295 215 55 182.4
Soft Ensemble Random All-model 529 88 482 1445 99 528.6
Z-score 529 88 482 1445 99 528.6
Uncertainty All-model 415 8 283 210 66 196.4
Z-score 379 8 297 210 66 192.0
Margin All-model 415 8 283 219 66 198.2
Z-score 379 8 297 220 66 194.0
Entropy All-model 415 8 283 219 67 198.4
Z-score 379 8 297 214 66 192.8
Size of Unlabeled Validation Set DVD_{V} 760 100 500 2400 100 -
Table 7: Weak supervision setting: This table illustrates the minimum labeling budget necessary to achieve an optimal gap of zero in our framework.
Method Dataset
Pseudo-label
Generation
Active Label
Acquisition
Model Committee
Selection
WSC Story CB RTE WiC ANLI1 ANLI2 ANLI3 Avg.
Hard Ensemble Random All Model 19 216 10 11 102 44 194 237 105.03
Z-score 19 216 10 11 102 44 194 237 105.03
Uncertainty All Model 3 216 5 4 36 20 166 201 81.79
Z-score 1 44 1 4 43 12 192 232 67.05
Margin All Model 3 216 4 4 36 6 166 201 79.84
Z-score 1 44 1 4 43 18 192 232 67.74
Entropy All Model 3 216 5 4 36 20 168 206 82.60
Z-score 1 44 1 4 43 22 194 228 67.66
Soft Ensemble Random All Model 18 748 10 52 30 57 194 237 168.20
Z-score 18 748 10 52 30 57 194 237 168.20
Uncertainty All Model 5 142 6 32 43 184 184 225 102.54
Z-score 2 59 1 4 50 194 188 225 90.67
Margin All Model 5 142 3 32 43 184 186 228 102.34
Z-score 2 59 1 4 50 194 186 225 90.54
Entropy All Model 5 142 6 32 43 12 184 225 81.06
Z-score 2 59 1 4 50 12 188 225 67.83
Size of Unlabeled Validation Set DVD_{V} 20 748 11 55 127 200 200 240 -
Table 8: Prompt selection setting: This table illustrates the minimum labeling budget necessary to achieve an optimal gap of zero in our framework.

Appendix B Experiments Setup

Here, we show the implementation details and design space of our paper.

B.1 Implementation Details

Our experimental environment is configured on a high-performance computing setup, comprising an Intel (R) Xeon (R) Platinum 8358P CPU clocked at 2.60GHz, backed by a substantial 512GB of memory. The computational muscle is provided by eight NVIDIA A40 GPUs, each with a hefty 48GB of memory. For model set generation (detailed in Appendix C), models are evaluated on validation and test datasets at regular intervals during training, with all outputs saved. These outputs are then divided using a 2:8 ratio to create validation and test sets for model selection. This process is repeated across 50 different splits, and the resulting data is averaged, ensuring a reliable and consistent foundation for our model selection analysis.

B.2 Design Space

Based on Step-I (Section 4.1), Step-II (Section 4.2) and Step-III (Section 4.3), our design space 𝒟\mathcal{D} can be defined as:

𝒟={Hard ensemble, Soft ensemble}×{ Uncertainty, Margin, Entropy, Random}×{ Z-score, All-model}.\begin{aligned} \mathcal{D}&=\textbf{\{Hard ensemble, Soft ensemble\}}\\ &\times\textbf{\{ Uncertainty, Margin, Entropy, Random\}}\\ &\times\textbf{\{ Z-score, All-model\}}.\end{aligned}

(6)

Therefore, there will be a total of 2×4×2=202\times 4\times 2=20 method combinations within our framework.

Appendix C Model Set Generation Setups

The statistics of all model sets within MoraBench are shown in Table 6.

C.1 Generation Setups for Semi-supervised Learning Setting

Leveraging the USB benchmark222http://github.com/microsoft/Semi-supervised-learning Wang et al. (2022), model outputs were obtained from 12 semi-supervised methods across five datasets: IMDB Maas et al. (2011), Amazon Review McAuley and Leskovec (2013), Yelp Review yel , AGNews Zhang et al. (2015) and Yahoo! Answer Chang et al. (2008). More details of these datasets are provided in Appendix E.1.

Specially, we use 14 common semi-supervised methods: Π\Pi model Rasmus et al. (2015), Pseudo Labeling Lee et al. (2013), Mean Teacher Tarvainen and Valpola (2017), VAT Miyato et al. (2018), MixMatch Berthelot et al. (2019b), ReMixMatch Berthelot et al. (2019a), UDA Xie et al. (2020), FixMatch Sohn et al. (2020), Dash Xu et al. (2021), CoMatch Li et al. (2021), CRMatch Fan et al. (2021), FlexMatch Zhang et al. (2021a), AdaMatch Berthelot et al. (2022) and SimMatch Zheng et al. (2022) to generate our model sets in semi-supervised learning setting with dataset we mentioned above. For detailed training configurations, refer to this website333http://github.com/microsoft/Semi-supervised-learning/tree/main/config/usb_nlp. We save the model’s output every 256 steps. Eventually, each method will get 400 outputs. This means that for each dataset we will have 400×14=5600400\times 14=5600 model outputs. In this paper, we randomly selected 10% of the models from each dataset for model selection.

C.2 Generation Setups for Weak Supervision Setting

Utilizing the WRENCH444http://github.com/JieyuZ2/wrench Zhang et al. (2021b) framework, we generated model outputs within a weak supervision setting. We generate model outputs across 48 distinct weak supervision configurations on five datasets: SMS Almeida et al. (2011), AGNews Zhang et al. (2015), Yelp Zhang et al. (2015), IMDB Maas et al. (2011), Trec Li and Roth (2002). Specifics on datasets are in Appendix E.1.

Specifically, we follow the training configuration of WRENCH for model training for the model set, involving an array of label models, label types, model backbones, and varied learning rates.

Label Models: Incorporating Snorkel Ratner et al. (2017), majority voting, weighted majority voting Penrose (1946), and generative model Bach et al. (2017), each offering unique approaches to producing weak labels.

Label Types: Utilization of both soft and hard labels for pseudo-label generation.

Model Backbones: Adoption of bert-base and roberta-base backbones, known for their efficacy in natural language processing.

Learning Rates: Training across three learning rates (10110^{-1}, 10310^{-3}, and 10510^{-5}) to generate model for model set.

For detailed configuration, refer to the WRENCH repository555http://github.com/JieyuZ2/wrench/tree/main. This setup aims to test model selection methods extensively by leveraging a comprehensive and diverse approach to model generation.

C.3 Generation Setups for Prompt Selection Setting

We employed large language models like GPT-4 OpenAI (2023) and various prompts to generate diverse outputs, assessed using the T0 benchmark666http://github.com/bigscience-workshop/T0 Sanh et al. (2022). This process covered eight tasks, with further information in Appendix E.2. In particular, we adopt the T0 benchmark with eight different datasets. The prompts we use for prompt selection all come from the promptsource777http://github.com/bigscience-workshop/promptsource.

Appendix D Optimal Gap with Different Budget Ratio

Our analysis, illustrated in Figures 5, 6, and 7, explores the optimal gap in varying budget ratios, which span from 0 to 1. This investigation across diverse scenarios establishes a key insight: the existing practice of fully labeling the validation set is wasteful, and we do not need to label the entire validation set in the process of model selection. This finding further demonstrates the value of LEMR, highlighting its ability to optimize resource utilization while maintaining high model selection performance.

Appendix E Datasets Details

E.1 Model Selection Datasets

SMS Almeida et al. (2011)

. This dataset contains 4,571 text messages labeled as spam/not-spam, out of which 500 are held out for validation and 2719 for testing. The labeling functions are generated manually by Awasthi et al. (2020), including 16 keyword-based and 57 regular expression-based rules.

AGNews Zhang et al. (2015)

. This dataset is a collection of more than one million news articles. It is constructed by Ren et al. (2020) choosing the 4 largest topic classes from the original corpus. The total number of training samples is 96K and both validation and testing are 12K. The labeling functions are also generated by Ren et al. (2020), including 9 keyword-based rules.

Yelp Zhang et al. (2015)

. This dataset is a subset of Yelp’s businesses, reviews, and user data for binary sentiment classification. It is constructed by Ren et al. (2020), including 30.4K training samples, 3.8K validation samples, and 3.8K testing samples. The labeling functions are also generated by Ren et al. (2020), including 7 heuristic rules on keywords and 1 third-party model on polarity of sentiment.

IMDB Maas et al. (2011)

. This is a dataset for binary sentiment classification containing a set of 20,000 highly polar movie reviews for training, 2,500 for validation and 2,500 for testing. It is constructed by Ren et al. (2020). The labeling functions are also generated by Ren et al. (2020), including 4 heuristic rules on keywords and 1 heuristic rules on expressions.

Amazon Review McAuley and Leskovec (2013).

This dataset is a sentiment classification dataset. There are 5 classes (scores). Each class (score) contains 600,000 training samples and 130,000 test samples. For USB, we draw 50,000 samples and 5,000 samples per class from training samples to form the training dataset and validation dataset respectively. The test dataset is unchanged.

Yelp Review yel

This sentiment classification dataset has 5 classes (scores). Each class (score) contains 130,000 training samples and 10,000 test samples. For USB, we draw 50,000 samples and 5,000 samples per class from training samples to form the training dataset and validation dataset respectively. The test dataset is unchanged.

Trec Li and Roth (2002)

. This dataset contains 4,965 labeled questions in the training set, 500 for the validation set, and another 500 for the testing set. It has 6 classes. The labeling functions are generated by Awasthi et al. (2020), including 68 keyword-based rules.

Yahoo! Answer Chang et al. (2008).

This dataset has 10 categories. Each class contains 140,000 training samples and 6,000 test samples. For USB, we draw 50,000 samples and 5,000 samples per class from training samples to form the training dataset and validation dataset respectively. The test dataset is unchanged.

E.2 Prompt Selection Datasets

We follow the T0 benchmark (Sanh et al., 2022). Specifically, the test tasks include natural language inference (RTE (Dagan et al., 2006), CB (De Marneffe et al., 2019), ANLI/R1-R3 (Nie et al., 2020)), sentence completion (StoryCloze (Mostafazadeh et al., 2017)), word sense disambiguation (WiC (Pilehvar and Camacho-Collados, 2019)), and coreference resolution (WSC (Levesque et al., 2012)).

Appendix F Minimum Budget to achieve an optimal gap of zero in other Cases

We further explored the minimal budget necessary to achieve a zero optimal gap in weak supervision and prompt selection setting, with findings presented in Table 7 and Table 8. We can conclude consistent with the text.

To be specific, Firstly, our framework, combined with an appropriate selection of methods, significantly lowers the labeling cost for validation sets. As seen in 7, for the AGNews task, where only 210 samples need labeling as opposed to labeling 2400 samples of the entire validation set. This efficiency is further evidenced in the Story task, where selecting the optimal model entails labeling a mere 44 samples instead of the full 748, as shown in Table 8.

Then, we can find uncertainty sampling strategy is much better than random strategy. This is evident in Table 7 and Table 8, where uncertainty sampling consistently requires a smaller budget across all tasks.

Finally, adopting the Z-score method generally reduces labeling costs. Table 7 demonstrates that the Z-score method requires a lesser budget to select the equivalent model as the All-model approach. This trend is also evident in Table 8, where the Z-score variant requires less budget to achieve an optimal gap of 0 compared to the All-model scenario.

Appendix G Limitations and Potential Risks

Our evaluations primarily focus on NLP tasks. Although LEMR shows promising results in these areas, its effectiveness and adaptability to other domains, such as computer vision or audio processing, remain to be thoroughly investigated. Different domains may exhibit unique challenges, including higher dimensional data or different notions of uncertainty, which could affect the performance of our proposed methods. Besides, the models selected by frameworks like LEMR are often deployed in applications with wide-reaching societal impacts. From enhancing educational tools and healthcare diagnostics to improving environmental monitoring, the potential for positive societal impact is vast. However, careful consideration of the implications of these applications, including ethical, social, and environmental impacts, is essential to ensure that they contribute positively to society.

Refer to caption
Figure 5: Weak supervision setting: This figure illustrates the changes in optimal gap values within our design space. These changes are observed across budget ratios from 0 to 1.
Refer to caption
Figure 6: Semi-supervised learning setting: This figure illustrates the changes in optimal gap values within our design space, under a semi-supervised learning setting. These changes are observed across budget ratios from 0 to 1.
Refer to caption
Figure 7: Prompt selection setting: This figure illustrates the changes in optimal gap values within our design space. These changes are observed across budget ratios from 0 to 1.