STENCIL: Submodular Mutual Information Based Weak Supervision for Cold-Start Active Learning

Nathan Beck Adithya Iyer Rishabh Iyer

Abstract

As supervised fine-tuning of pre-trained models within NLP applications increases in popularity, larger corpora of annotated data are required, especially with increasing parameter counts in large language models. Active learning, which attempts to mine and annotate unlabeled instances to improve model performance maximally fast, is a common choice for reducing the annotation cost; however, most methods typically ignore class imbalance and either assume access to initial annotated data or require multiple rounds of active learning selection before improving rare classes. We present STENCIL, which utilizes a set of text exemplars and the recently proposed submodular mutual information to select a set of weakly labeled rare-class instances that are then strongly labeled by an annotator. We show that STENCIL improves overall accuracy by $10\%-18\%$ and rare-class F-1 score by $17\%-40\%$ on multiple text classification datasets over common active learning methods within the class-imbalanced cold-start setting.

Active Learning, Cold-Start, NLP, Weak Supervision, Submodular Information Measures, Submodularity, Efficient Deep Learning

1 Introduction

In recent years, many natural language processing use cases have been driven by the advent of widely available pre-trained models and subsequent fine-tuning (Min et al., 2023). Indeed, this paradigm has become commonplace with the prevalence of large language models, seeing heavy use in cases such as instruction fine-tuning (Zhang et al., 2023; Ouyang et al., 2022). The success of this paradigm for downstream tasks, however, relies heavily on annotated data. To scale to today’s pre-eminent models, large corpora of supervised data are needed, resulting in prohibitively large annotation costs. One of the most common choices for reducing annotation cost is through the paradigm of active learning, which aims to select the best instances from unlabeled data for annotation that improves model performance maximally fast. Indeed, active learning has been used extensively both in classical machine learning and recent deep learning approaches (Ren et al., 2021; Settles, 2009; Beck et al., 2021).

Despite the successful use of active learning, complications in the downstream learning environment can prohibit the use of many common methods. One complication is the presence of class imbalance, which tends to reduce the efficacy of active learning methods designed for the general case. Another complication is having to deploy such strategies in cold-start scenarios. Indeed, the majority of published active learning methods assume access to an initial annotated set of data. When both complications are combined, active learning methods for dealing with a class-imbalanced cold-start environment are needed.

While the cold-start scenario has been considered for active learning, such methods tend to operate over multiple rounds of selection and retraining (Barata et al., 2021; Brangbour et al., 2022; Ni et al., 2020; Hacohen et al., 2022; Yuan et al., 2020; Jin et al., 2022; Kothawade et al., 2022a), which addresses class imbalance inefficiently. Due to the nature of cold-start scenarios, such gradual build-up is natural as the active learning strategy learns more about the data landscape before capitalizing on selecting rare-class instances. However, such capitalization can be conducted earlier in cases where prior knowledge can be incorporated, which is often the case in many fine-tuning scenarios and can be conducted via weak supervision (Ratner et al., 2017; Maheshwari et al., 2020, 2022; Rauch et al., 2022). For example, using active learning for spam detection with unlabeled spam-imbalanced data can be accelerated if the sketch of a spam instance is known. Such a sketch is also easier to provide than in other tasks (such as drawing exemplar images for vision tasks). Hence, can class-imbalanced cold-start active learning more quickly improve rare-class performance if prior knowledge of the data environment can be provided?

In this work, we present STENCIL– Submodular mutual informaTion based wEak supervisioN for Cold start actIve Learning – that effectively utilizes prior knowledge of the task via a small exemplar set of rare-class instances. STENCIL uses this set to guide the active learning selection by maximizing the Submodular Mutual Information (see Section 2) (Iyer et al., 2021a, b) between the set of selected instances and the exemplar set. As we show in Section 4, STENCIL provides an immediate improvement of $10\%-18\%$ overall accuracy and $17\%-40\%$ rare-class F-1 score with a single round of selection compared to other methods. STENCIL achieves this performance gain using as few as 15-25 exemplar instances, which are easy to provide in a variety of NLP fine-tuning tasks. Hence, STENCIL offers a sound and complimentary strategy for the cold-start round of class-imbalanced active learning, especially for NLP fine-tuning tasks where STENCIL’s weak supervision can be used.

2 Preliminaries

First, we briefly introduce submodular functions and Submodular Mutual Information (SMI) (Iyer et al., 2021a, b), which is used as STENCIL’s mechanism for guided active learning selection. A set function $F:2^{\mathcal{V}}\rightarrow\mathbb{R}$ over a ground set of instances $\mathcal{V}$ assigns a real-valued score to each possible subset of $\mathcal{V}$ . Unfortunately, finding the maximizing subset of a given size of $F$ is NP-hard. Instead, approximation algorithms for finding this subset are used. Notably, a $(1-\frac{1}{e})$ -approximate greedy algorithm exists for finding the maximizing cardinality-constrained subset if $F$ is monotone submodular (Nemhauser et al., 1978). Specifically, $F$ is submodular if, for any $A\subseteq B\subseteq\mathcal{V}$ and $a\notin B$ , $F(A\cup\{a\})-F(A)\geq F(B\cup\{a\})-F(B)$ ; additionally, $F$ is monotone if $F(A\cup\{a\})-F(A)\geq 0$ for any $A$ and $a\notin A$ . Such a paradigm provides a useful mechanism for selecting sets under an annotation budget within active learning by using desirable submodular functions.

Table 1: Instantiations of SMI functions using different submodular functions (Kothawade et al., 2022c). Each is based on pairwise similarities

s_{ij}\in[0,1]

between an instance

i

and

j

Name	$I_{F}(A;Q)$
FLVMI	$\sum\limits_{i\in\mathcal{U}}\min\left(\max\limits_{j\in A}s_{ij},\max\limits_{j\in Q}s_{ij}\right)$
FLQMI	$\sum\limits_{i\in Q}\max\limits_{j\in A}s_{ij}+\sum\limits_{i\in A}\max\limits_{j\in Q}s_{ij}$
GCMI	$2\lambda\sum\limits_{i\in A}\sum\limits_{j\in Q}s_{ij}$
LOGDETMI	$\begin{aligned} &\log\det(S_{A})\\ -&\log\det(S_{A}-S_{AQ}S_{Q}^{-1}S_{QA})\end{aligned}$

Refer to caption — Figure 1: Architectural process of STENCIL. A downstream task is presented, wherein subsequent data collection transpires. An expert generates an exemplar query set based on prior knowledge of the task. These artifacts are fed to STENCIL’s selection process, and the resulting selected subset is annotated and used for fine-tuning.

SMI is defined between two sets $A$ and $Q$ for a base submodular function $F$ as $I_{F}(A;Q)=F(A)+F(Q)-F(A\cup Q)$ . Indeed, (Iyer et al., 2021a, b) note that SMI serves as a generalization of Shannon-entropic mutual information as one could recover it via SMI’s definition by using Shannon entropy for $F$ . Notably, if $I_{F}$ also satisfies certain conditions, then $I_{F}(A;Q)$ is also monotone submodular for a fixed $Q$ (Iyer et al., 2021a). Accordingly, the same maximization framework discussed earlier extends to SMI. (Kothawade et al., 2022c) extend this framework for SMI by defining versions with restricted submodularity (Table 1); that is, $F$ is submodular for only certain subsets of $\mathcal{V}$ . By defining $\mathcal{V}$ as $\mathcal{U}\cup Q$ for an auxiliary set of query instances $Q$ , one can obtain a set of instances $A\subseteq\mathcal{U}$ with high information overlap with $Q$ by maximizing $I_{F}(A;Q)$ , providing a query-relevant $A$ that also has the salient properties (diversity, representation, etc.) of the scoring afforded by $F$ . This mechanism has been used across numerous targeted active learning applications in other modalities (Kothawade et al., 2021, 2022b; Beck et al., 2024), so its extension to text is an exciting avenue of study, especially for the cold-start scenario.

3 Method

In this section, we present STENCIL, which utilizes the maximization framework mentioned in Section 2 with a query set of exemplar instances of the rare class. We provide an overview in Figure 1 and Algorithm 1. STENCIL starts by instantiating the underlying SMI function as chosen from Table 1 through a modular choice of featurization and similarity measure. Featurization can be conducted by taking GloVe embeddings (Pennington et al., 2014) or any similar choice. Subsequently, the similarity values required by each SMI function in Table 1 are computed using a choice of similarity measure, such as cosine similarity, RBF similarity, and so forth (coalesced as matrices of similarity values, or similarity kernels). Once the similarity kernels are computed, a greedy monotone submodular maximization algorithm (Nemhauser et al., 1978; Minoux, 2005; Mirzasoleiman et al., 2015; Iyer & Bilmes, 2019) selects a set of unlabeled instances $A\subseteq\mathcal{U}$ of size $B$ . The set $A$ is then annotated as a set $\mathcal{L}$ , and a model $M$ is then fine-tuned on $\mathcal{L}$ . In our experiments, we use an average of GloVe (Pennington et al., 2014) embeddings and cosine similarity to instantiate each SMI function.

Algorithm 1 STENCIL

0: Model

M

, Unlabeled set

\mathcal{U}

, Query set

Q

, Active learning budget

B

M

E_{\mathcal{U}},E_{Q}\leftarrow\textsc{Featurize}(\mathcal{U}),\textsc{Featurize}(Q)

\mathcal{K}\leftarrow\textsc{SimilarityKernels}(E_{\mathcal{U}},E_{Q})

I_{F}\leftarrow\textsc{Instantiate}(\mathcal{K})

A\leftarrow\emptyset

5: while

|A|<B

a^{\prime}\leftarrow\mathop{\mathrm{argmax}}\limits_{a\in\mathcal{U}\setminus A}I_{F}(A\cup\{a\};Q)-I_{F}(A;Q)

A\leftarrow A\cup\{a^{\prime}\}

8: end while

\mathcal{L}\leftarrow\textsc{Annotate}(A)

10:

M\leftarrow\textsc{FineTune}(\mathcal{L},M)

To apply STENCIL, an exemplar query set is needed. The query set comes from prior knowledge of the task and contains exemplars of the rare class. While such exemplars are hard to generate for different modalities of data such as images, we posit that it is easier to derive exemplars for reasonable text-based tasks. With spam detection, for example, prior knowledge dictates that messages that urge to open hyperlinks would likely be spam instances; in such cases, one can prepare a few example sentences of this variety to use as the query set $Q$ . With a complimentary choice of featurization, STENCIL can then effectively choose a set of instances that semantically match the exemplars. Conducting selection in this way effectively returns a set of instances that are assumed to be weakly labeled as the rare class, which are then strongly labeled by an annotator.

4 Experiments

Table 2: Mean performance metrics with standard deviations of baselines across 10 trials. The first three rows give test accuracy (with std dev) while the last three rows give rare-class F-1 score (with std dev).

Dataset	FLQMI	FLVMI	LDMI	GCMI	RegEx	Rand	Badge	Ent	L Conf	Marg	KMean
YouTube	70.9 $\pm$ 6.7	67.4 $\pm$ 3.4	77.8 $\pm$ 3.2	58.5 $\pm$ 5.7	66.6 $\pm$ 4.7	51.5 $\pm$ 5.5	53.3 $\pm$ 4.1	52.8 $\pm$ 8	52.1 $\pm$ 4.8	51.2 $\pm$ 3.1	56.3 $\pm$ 2.8
SMS	58.9 $\pm$ 6.9	66.6 $\pm$ 9.4	72.3 $\pm$ 4.9	86.9 $\pm$ 2.3	62.1 $\pm$ 12	55.8 $\pm$ 7.8	60.6 $\pm$ 6.9	68.2 $\pm$ 15	54.6 $\pm$ 9.4	60.2 $\pm$ 15	53.5 $\pm$ 3.6
Tweet	68.4 $\pm$ 3.2	66.8 $\pm$ 3.1	72.1 $\pm$ 1.6	66.2 $\pm$ 3.3	56.2 $\pm$ 4.0	59.1 $\pm$ 3.0	56.1 $\pm$ 4.4	53.2 $\pm$ 4.0	52.4 $\pm$ 5.3	53.4 $\pm$ 4.1	62.4 $\pm$ 2.8
YouTube	73.4 $\pm$ 5.8	55.1 $\pm$ 7.5	77.0 $\pm$ 5.9	61.1 $\pm$ 4.2	59.9 $\pm$ 7.5	10.0 $\pm$ 16	16.6 $\pm$ 13	12.8 $\pm$ 22	12.1 $\pm$ 15	9.7 $\pm$ 10.9	26.8 $\pm$ 7.6
SMS	27.9 $\pm$ 20	46.9 $\pm$ 22	61.1 $\pm$ 10	86.9 $\pm$ 2.2	33.4 $\pm$ 30	17.5 $\pm$ 21	32.8 $\pm$ 19	46.9 $\pm$ 34	12.7 $\pm$ 24	25.3 $\pm$ 36	11.8 $\pm$ 11.4
Tweet	72.4 $\pm$ 1.4	55.0 $\pm$ 7.3	67.2 $\pm$ 3.0	66.7 $\pm$ 4.4	24.3 $\pm$ 13	33.2 $\pm$ 8.9	24.0 $\pm$ 15	14.2 $\pm$ 14	11.2 $\pm$ 18	16.1 $\pm$ 14	43.3 $\pm$ 7.2

Table 3: Effect of query set sizes on mean performance metrics of LOGDETMI across 10 trials. The first three rows give test accuracy (with std dev) while the last three rows give rare-class F-1 scores (with std dev). At

100\%

query set utilization, YouTube has 15 query instances, SMS has 25 query instances, and Tweet has 20 query instances.

Dataset	Sz-20%	Sz-40%	Sz-60%	Sz-80%	Sz-100%
YouTube	68.8 $\pm$ 8.8	70.8 $\pm$ 6.0	74.6 $\pm$ 4.0	74.7 $\pm$ 2.5	77.2 $\pm$ 3.3
SMS	63.9 $\pm$ 7.3	68.8 $\pm$ 4.2	70.9 $\pm$ 8.3	70.3 $\pm$ 8.4	73.2 $\pm$ 4.5
Tweet	69.5 $\pm$ 3.7	70.5 $\pm$ 2.9	71.6 $\pm$ 1.9	71.9 $\pm$ 1.6	72.3 $\pm$ 1.6
YouTube	58.4 $\pm$ 21.4	63.6 $\pm$ 12.6	71.9 $\pm$ 6.2	71.8 $\pm$ 5.9	76.9 $\pm$ 6.5
SMS	41.5 $\pm$ 18.8	54.1 $\pm$ 9.5	57.2 $\pm$ 17.8	55.9 $\pm$ 20.2	63.1 $\pm$ 8.9
Tweet	62.0 $\pm$ 8.5	63.6 $\pm$ 6.0	67.0 $\pm$ 3.7	67.1 $\pm$ 3.7	67.5 $\pm$ 2.7

In this section, we experimentally verify STENCIL’s single-selection ability to handle class-imbalanced cold-start active learning scenarios for text modalities versus common active learning strategies. We additionally conduct ablations across the choice of SMI function (Table 1) and the query set size, analyzing their effect on performance. We conduct experiments over three datasets: YouTube Spam Classification (Alberto et al., 2015), SMS Spam Classification (Almeida et al., 2011), and Twitter Sentiment (Wan & Gao, 2015) (see Appendix B). Such choices are natural since common knowledge of the structure of spam messages and positive sentiments can be incorporated via exemplars (shown in Appendix A). Additionally, both of the latter datasets feature natural class imbalances (SMS: $1:18$ , Twitter: $1:6$ ). We induce an imbalance within the YouTube dataset ( $1:10$ ) to congregate three class-imbalanced settings, and we evaluate on balanced test data to measure both rare-class and overall performance.

For baseline comparison with STENCIL’s SMI variants (Table 1), we evaluate against common active learning strategies. Random chooses $B$ random instances from $\mathcal{U}$ . Entropy, Least Confidence, and Margin select the top $B$ instances from $\mathcal{U}$ with highest Shannon entropy, lowest predicted probability, and lowest classification margin, respectively (Settles, 2009). Badge calculates the loss gradients of each point in $\mathcal{U}$ using pseudo-labeling and performs k-means++ sampling (Arthur et al., 2007) to select $B$ instances from $\mathcal{U}$ , which gives a diverse set of uncertain instances (Ash et al., 2019). RegEx uses regular expression matching between each data point in the exemplar query set and each data point $\mathcal{U}$ . The top $B$ samples with the highest cumulative count of word matches with the patterns or phrases in the query set are selected. Finally, KMean applies $k$ -means clustering with $k=B$ on the space of average GloVe embeddings for each instance and returns the 1-NN to each center, which serves as an adjacent baseline to (Hacohen et al., 2022; Jin et al., 2022). In all cases, $B$ is set to 50 for YouTube, 136 for SMS, and 144 for Twitter (roughly $1.5\%$ to $5\%$ of the full unlabeled set size for each dataset). Subsequently, we utilize an LSTM-based network architecture (Hochreiter & Schmidhuber, 1997) of $1.23M$ parameters to take the sequence as input and make classification decisions. We optimize the network utilizing SGD with learning rate $0.01$ for 25-50 epochs (YouTube: 50, SMS: 30, Tweet: 25).

We present the results of our experiments averaged across 10 trials in Table 2 and Table 3. Summarily, we see that the GCMI variant of STENCIL performs the best on the SMS dataset, the LOGDETMI variant performs the best on the YouTube dataset, and the LOGDETMI variant exhibits the best overall accuracy (on balanced test data) while the FLQMI variant exhibits the best rare-class F-1 score on the Twitter dataset. Hence, STENCIL improves the rare-class performance in the class-imbalanced cold-start setting using only one round of selection without sacrificing overall performance, seeing improvements of $10\%-18\%$ overall accuracy and $17\%-40\%$ rare-class F-1 score. Interestingly, we note that while RegEx uses the exemplar query set, its rule-based operation cannot capture diverse sets of semantically related instances as well as STENCIL. Indeed, LOGDETMI accounts for diversity through its determinants, which is advantageous in cold-start settings where a good coverage of the data is required. GCMI’s effectiveness is particularly notable in situations where rare instances exhibit high similarity, as observed in the SMS dataset. Lastly, we observe from Table 3 that, while more query examples provides better performance, there is a diminishing return on investment in terms of performance gains when increasing the query set size beyond a certain point, which means that STENCIL effectively leverages small query sets and does not specifically require large query sets.

5 Conclusion

In this work, we present STENCIL, which effectively utilizes SMI and prior knowledge of the data environment in the form of a small exemplar set of rare-class instances to manage class-imbalanced cold-start active learning scenarios. We demonstrate that STENCIL is able to make improvements of $10\%-18\%$ overall accuracy and $17\%-40\%$ rare-class F-1 score over common active learning strategies within a single round of selection. As an exciting avenue of future work, the application of STENCIL’s SMI functionality to other text-based tasks beyond classification – and the formation of query exemplars in these tasks – would greatly expand the contributions of this work.

Reproducibility and Licenses

To reproduce our results, we offer six Google Colab notebooks¹¹1https://github.com/nab170130/stencil that can be run repeatedly to amass trial outcomes for each configuration of STENCIL and for each baseline method. Per the setting and model size ( $1.23M$ parameters) described in Section 4, it is sufficient to run these notebooks on the standard CPU configuration. Per dataset, amassing 10 trials requires roughly 2 hours of compute time. Hyper-parameters were chosen based on the code examples provided in DISTIL.

We utilize the following libraries in our experiments and list their licenses as follows:

•

DISTIL (Beck et al., 2021): MIT License
•

Submodlib (Kaushal et al., 2022): MIT License
•

PyTorch (Paszke et al., 2019): Modified BSD License
•

NLTK (Bird et al., 2009): Apache License Version 2.0
•

Transformers (Wolf et al., 2020): Apache License 2.0
•

YouTube Spam (Alberto et al., 2015): CC Attribution 4.0 International
•

SMS Spam (Almeida et al., 2011): CC Attribution 4.0 International
•

Twitter Sentiment (Wan & Gao, 2015): Creative Commons CC-BY-NC-ND

References

Alberto et al. (2015) Alberto, T. C., Lochter, J. V., and Almeida, T. A. Tubespam: Comment spam filtering on youtube. In 2015 IEEE 14th international conference on machine learning and applications (ICMLA), pp. 138–143. IEEE, 2015.
Almeida et al. (2011) Almeida, T. A., Hidalgo, J. M. G., and Yamakami, A. Contributions to the study of sms spam filtering: new collection and results. In Proceedings of the 11th ACM symposium on Document engineering, pp. 259–262, 2011.
Arthur et al. (2007) Arthur, D., Vassilvitskii, S., et al. k-means++: The advantages of careful seeding. In Soda, volume 7, pp. 1027–1035, 2007.
Ash et al. (2019) Ash, J. T., Zhang, C., Krishnamurthy, A., Langford, J., and Agarwal, A. Deep batch active learning by diverse, uncertain gradient lower bounds. arXiv preprint arXiv:1906.03671, 2019.
Barata et al. (2021) Barata, R., Leite, M., Pacheco, R., Sampaio, M. O., Ascensão, J. T., and Bizarro, P. Active learning for imbalanced data under cold start. In Proceedings of the Second ACM International Conference on AI in Finance, pp. 1–9, 2021.
Beck et al. (2021) Beck, N., Sivasubramanian, D., Dani, A., Ramakrishnan, G., and Iyer, R. Effective evaluation of deep active learning on image classification tasks. arXiv preprint arXiv:2106.15324, 2021.
Beck et al. (2024) Beck, N., Killamsetty, K., Kothawade, S., and Iyer, R. Beyond active learning: Leveraging the full potential of human interaction via auto-labeling, human correction, and human verification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2881–2889, 2024.
Bird et al. (2009) Bird, S., Klein, E., and Loper, E. Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”, 2009.
Brangbour et al. (2022) Brangbour, E., Bruneau, P., Tamisier, T., and Marchand-Maillet, S. Cold start active learning strategies in the context of imbalanced classification. arXiv preprint arXiv:2201.10227, 2022.
Hacohen et al. (2022) Hacohen, G., Dekel, A., and Weinshall, D. Active learning on a budget: Opposite strategies suit high and low budgets. arXiv preprint arXiv:2202.02794, 2022.
Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
Iyer & Bilmes (2019) Iyer, R. and Bilmes, J. A memoization framework for scaling submodular optimization to large scale problems. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2340–2349. PMLR, 2019.
Iyer et al. (2021a) Iyer, R., Khargoankar, N., Bilmes, J., and Asanani, H. Submodular combinatorial information measures with applications in machine learning. In Algorithmic Learning Theory, pp. 722–754. PMLR, 2021a.
Iyer et al. (2021b) Iyer, R., Khargonkar, N., Bilmes, J., and Asnani, H. Generalized submodular information measures: Theoretical properties, examples, optimization algorithms, and applications. IEEE Transactions on Information Theory, 68(2):752–781, 2021b.
Jin et al. (2022) Jin, Q., Yuan, M., Li, S., Wang, H., Wang, M., and Song, Z. Cold-start active learning for image classification. Information Sciences, 616:16–36, 2022.
Kaushal et al. (2022) Kaushal, V., Ramakrishnan, G., and Iyer, R. Submodlib: A submodular optimization library. arXiv preprint arXiv:2202.10680, 2022.
Kothawade et al. (2021) Kothawade, S., Beck, N., Killamsetty, K., and Iyer, R. Similar: Submodular information measures based active learning in realistic scenarios. Advances in Neural Information Processing Systems, 34:18685–18697, 2021.
Kothawade et al. (2022a) Kothawade, S., Chopra, S., Ghosh, S., and Iyer, R. Active data discovery: Mining unknown data using submodular information measures. arXiv preprint arXiv:2206.08566, 2022a.
Kothawade et al. (2022b) Kothawade, S., Ghosh, S., Shekhar, S., Xiang, Y., and Iyer, R. Talisman: targeted active learning for object detection with rare classes and slices using submodular mutual information. In European Conference on Computer Vision, pp. 1–16. Springer, 2022b.
Kothawade et al. (2022c) Kothawade, S., Kaushal, V., Ramakrishnan, G., Bilmes, J., and Iyer, R. Prism: A rich class of parameterized submodular information measures for guided data subset selection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 10238–10246, 2022c.
Maheshwari et al. (2020) Maheshwari, A., Chatterjee, O., Killamsetty, K., Ramakrishnan, G., and Iyer, R. Semi-supervised data programming with subset selection. arXiv preprint arXiv:2008.09887, 2020.
Maheshwari et al. (2022) Maheshwari, A., Killamsetty, K., Ramakrishnan, G., Iyer, R., Danilevsky, M., and Popa, L. Learning to robustly aggregate labeling functions for semi-supervised data programming. In Muresan, S., Nakov, P., and Villavicencio, A. (eds.), Findings of the Association for Computational Linguistics: ACL 2022, pp. 1188–1202, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.94. URL https://aclanthology.org/2022.findings-acl.94.
Min et al. (2023) Min, B., Ross, H., Sulem, E., Veyseh, A. P. B., Nguyen, T. H., Sainz, O., Agirre, E., Heintz, I., and Roth, D. Recent advances in natural language processing via large pre-trained language models: A survey. ACM Computing Surveys, 56(2):1–40, 2023.
Minoux (2005) Minoux, M. Accelerated greedy algorithms for maximizing submodular set functions. In Optimization Techniques: Proceedings of the 8th IFIP Conference on Optimization Techniques Würzburg, September 5–9, 1977, pp. 234–243. Springer, 2005.
Mirzasoleiman et al. (2015) Mirzasoleiman, B., Badanidiyuru, A., Karbasi, A., Vondrák, J., and Krause, A. Lazier than lazy greedy. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015.
Nemhauser et al. (1978) Nemhauser, G. L., Wolsey, L. A., and Fisher, M. L. An analysis of approximations for maximizing submodular set functions—i. Mathematical programming, 14:265–294, 1978.
Ni et al. (2020) Ni, A., Yin, P., and Neubig, G. Merging weak and active supervision for semantic parsing. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 8536–8543, 2020.
OpenAI (2023) OpenAI. Openai: Introducing chatgpt, 2023. URL https://openai.com/blog/chatgpt.
Ouyang et al. (2022) Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
Pennington et al. (2014) Pennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543, 2014.
Ratner et al. (2017) Ratner, A., Bach, S. H., Ehrenberg, H., Fries, J., Wu, S., and Ré, C. Snorkel: Rapid training data creation with weak supervision. In Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, volume 11, pp. 269. NIH Public Access, 2017.
Rauch et al. (2022) Rauch, L., Huseljic, D., and Sick, B. Enhancing active learning with weak supervision and transfer learning by leveraging information and knowledge sources. IAL@ PKDD/ECML, 2022.
Ren et al. (2021) Ren, P., Xiao, Y., Chang, X., Huang, P.-Y., Li, Z., Gupta, B. B., Chen, X., and Wang, X. A survey of deep active learning. ACM computing surveys (CSUR), 54(9):1–40, 2021.
Settles (2009) Settles, B. Active learning literature survey. 2009.
Wan & Gao (2015) Wan, Y. and Gao, Q. An ensemble sentiment classification system of twitter data for airline services analysis. In 2015 IEEE international conference on data mining workshop (ICDMW), pp. 1318–1325. IEEE, 2015.
Wolf et al. (2020) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp. 38–45, 2020.
Yuan et al. (2020) Yuan, M., Lin, H.-T., and Boyd-Graber, J. Cold-start active learning through self-supervised language modeling. arXiv preprint arXiv:2010.09535, 2020.
Zhang et al. (2023) Zhang, S., Dong, L., Li, X., Zhang, S., Sun, X., Wang, S., Li, J., Hu, R., Zhang, T., Wu, F., et al. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792, 2023.

Appendix A Chosen Query Exemplars within Experiments

In this section, we provide the query exemplars used in Section 4. The query phrases are created using domain knowledge, examples from prior work (Maheshwari et al., 2022), and prompting ChatGPT (OpenAI, 2023) with the following: ”Give me 10 example phrases, not sentences, found in each of 1. Spam youtube comments 2. Spam SMS messages 3. Tweets with positive sentiment.” Notably, we did not examine the instances of YouTube Spam (Alberto et al., 2015), SMS Spam (Almeida et al., 2011), or Twitter Sentiment (Wan & Gao, 2015) when curating these query sets.

A.1 YouTube Spam

•

‘check out my latest video’
•

‘click the link’
•

‘dont miss out’
•

‘https’
•

‘for more information’
•

‘free gift’
•

‘free giveaway’
•

‘win prizes’
•

‘like and comment’
•

‘limited time offer’
•

‘please help’
•

‘subscribe now’
•

‘subscribe to my channel’
•

‘visit my website’
•

‘watch my video’

A.2 SMS Spam

•

”Click this link to claim your cash prize”
•

”Confirm your account details to continue using our service”
•

”Congratulations! You’ve won a free trial”
•

”Earn money quickly”
•

”Exclusive discount code inside”
•

”Free gift awaiting you”
•

”Get paid to work from home”
•

”Important security notice”
•

”Problem with your payment method”
•

”Special limited-time offer”
•

”Suspicious activity detected”
•

”To stop receiving these messages, click here”
•

”Unclaimed money in your name”
•

”Urgent action required to receive your package”
•

”Verify your email and password immediately”
•

”Warning: Your account will be deactivated”
•

”You are eligible for a refund”
•

”You have an unpaid bill”
•

”You’re our lucky shopper today”
•

”You’ve been selected for an exclusive offer”
•

”Your account has been temporarily locked”
•

”Your order is ready for pickup”
•

”Your subscription is about to expire”
•

”Your trial period is ending”
•

”Your warranty is expired”

A.3 Twitter Sentiment

•

”Absolutely love”
•

”Amazing job”
•

”Can’t wait for”
•

”Extremely happy to”
•

”Feeling blessed”
•

”Feeling inspired by”
•

”Feeling optimistic about”
•

”Feeling very proud of”
•

”Had a great time”
•

”Highly recommend”
•

”Incredible experience”
•

”Overwhelmed with happiness”
•

”Really excited about”
•

”So grateful for”
•

”Such a beautiful”
•

”Thank you so much”
•

”Totally loving the”
•

”Truly amazing”
•

”Very successful”
•

”Wonderful day”

Appendix B Dataset Details

In this section, we provide additional details of each dataset used in Section 4. All datasets contain English text. We also provide class distributions in our train and test splits in Table 4.

Table 4: Class distributions used in the train-test splits of our experiments (Section 4). The train split is made unlabeled before selection occurs.

Dataset	Rare Train	Common Train	Rare Test	Common Test
YouTube	85	808	151	143
SMS	234	4312	480	476
Tweet	1402	8178	936	909

YouTube Spam Classification (Alberto et al., 2015): A collection of real comments on five of the ten most viewed YouTube videos at the time of collection. It consists of 1005 comments marked spam and 951 comments marked ham (non-spam). This dataset is publicly available in the UCI Machine Learning Repository²²2https://archive.ics.uci.edu/dataset/380/youtube+spam+collection. After creating a balanced test set and making the spam class as the rare class in the training data, we induce a class imbalance of approximately $1:10$ .

SMS Spam Classification (Almeida et al., 2011): A dataset of SMS messages curated for binary spam classification task. It combines 425 SMS spam messages from the Grumbletext website, a subset of 3,375 ham messages from the NUS SMS Corpus, 450 ham messages from a PhD Thesis, and 1,002 ham and 322 spam messages from the SMS Spam Corpus v.0.1 Big. This dataset is publicly accessible at the UCI ML Repository³³3https://archive.ics.uci.edu/dataset/228/sms+spam+collection. After creating a balanced test set, the imbalance factor is left as approximately $1:18$ .

Twitter Sentiment (Wan & Gao, 2015): A sentiment classification dataset of comments from Twitter (tweets) about airline services. It includes 12,864 tweets labeled with positive, neutral, or negative sentiments based on analysis using an ensemble of 6 classifiers. For the scope of this study, only entries with positive and negative labels have been retained. The dataset is made available in (Maheshwari et al., 2022). After creating a balanced test set, the imbalance factor is left as approximately $1:6$ (with positive sentiments deemed rare).

Appendix C Additional Baseline Details

In this section, we provide additional detail for each baseline used in our experiments.

C.1 Submodular Mutual Information

As mentioned in Section 2, SMI can be instantiated utilizing various submodular functions, each of which models different properties. Facility Location ( $F(A)=\sum_{i\in\mathcal{V}}\max_{j\in A}s_{ij}$ ) captures representation information between a subset and its ground set and is used to instantiate the FLVMI and FLQMI variants in Table 1. As derived in (Kothawade et al., 2022c), FLQMI differs from FLVMI by modeling only the cross similarities between $A$ and $Q$ . Log Determinant ( $F(A)=\log\det S_{A}$ ) captures diversity information in $A$ , which is reflected by the determinant of $S_{A}$ ; hence, LOGDETMI captures diversity information between $A$ and $Q$ . Graph Cut ( $F(A)=\sum_{i\in\mathcal{V}}\sum_{j\in A}s_{ij}-\lambda\sum_{i,j\in A}s_{ij}$ ) also focuses on representation of $\mathcal{V}$ with $A$ and, when ingested by SMI’s definition, gives GCMI, which focuses entirely on how relevant $A$ is to the queries in $Q$ . Notably, all instantiations utilize similarity values between instances. In our experiments, we represent each instance via the average of the GloVe (Pennington et al., 2014) embeddings for each token in the sequence. Afterwards, we compute cosine similarity to derive each $s_{ij}$ value.

C.2 BADGE Sampling

The BADGE strategy (Ash et al., 2019) is a common choice for batch active learning where diverse sets of uncertain instances are selected, which tend to be informative instances. BADGE achieves this by embedding each unlabeled instance using hypothesized loss gradients (using the most confidently predicted class). By picking a diverse span of points with large magnitude gradients, one obtains a diverse set of instances that are likely to bring large updates to the loss (and are thus uncertain). In general, the following steps are taken:

•

Calculate the pseudo-label for each point in the unlabeled set. The pseudo-label is the class with the highest probability.
•

Compute the cross-entropy loss for each point in the unlabeled set using this pseudo-label.
•

Obtain the resulting loss gradients on the last linear layer of the model for each point (the hypothesized loss gradients).
•

Using these gradients as a form of embedding for each unlabeled point, run k-means++ initialization (Arthur et al., 2007) on this embedding set, retrieving $B$ centers. Each center is a point from the unlabeled set, and $B$ represents the active learning budget.
•

Request labels for the $B$ points whose embeddings were selected.

C.3 Basic Uncertainty Sampling

A common and long-standing choice for active learning selection is that of simple uncertainty sampling measures as discussed in (Settles, 2009). Namely, one can quantify uncertainty utilizing the predicted class probabilities and select the top $k$ most uncertain points. Typically, there are three common choices for quantifying uncertainty: Entropy ( $H(x)=-\sum_{i}p(x)_{i}\log p(x)_{i}$ ), Least Confidence ( $C(x)=\max_{i}p(x)_{i}$ ), and Margin ( $M(x)=p(x)_{\sigma_{1}}-p(x)_{\sigma_{2}}$ , where $\sigma_{1}$ and $\sigma_{2}$ denote the most probable and second-most probable class, respectively). Entropy sampling chooses the top $B$ $H(x)$ values. Least confidence sampling and margin sampling choose the $B$ smallest $C(x)$ and $M(x)$ values, respectively. The uncertain points are then labeled, which tends to provide more crucial information to the downstream model.

C.4 RegEx Sampling

Given the list of query phrases specific to the dataset, the RegEx Selection method implements a regular expression matching between each data point in the query set and each data point in the unlabeled dataset. The samples with the highest cumulative count of matches with the patterns or phrases in the query set make the selected subset of the training dataset. Note that this sampling strategy focuses only on extracting rare class samples and not on obtaining a balanced representation of both class labels in the dataset.

Appendix D Limitations and Societal Impacts

Here, we briefly discuss the limitations of STENCIL. Namely, proper function of STENCIL is contingent upon the correct prior knowledge being injected as the query set. If improper prior knowledge is used, STENCIL may not select the rare-class data optimally. As such, STENCIL’s application is limited to settings where adequate query exemplars can be created a priori, which is not the case for many settings. As a text example, tasks that involve processing SMILES-string representations of chemicals may not be easy to provide exemplars. Further, other modalities may not be as easy to provide exemplars as text modalities, such as video, images, audio, and so forth. Hence, STENCIL is limited to tasks where query exemplars are easily provided. Another limitation of STENCIL is that its core functionality depends on an adequate feature space within the cold-start setting. While this is typically the case with NLP tasks via the advent of pre-trained models, certain tasks may not be compatible with the embedding space for these models. Hence, STENCIL additionally needs access to good featurization of the data within a cold-start setting.

While we do not anticipate this work to have far-reaching societal consequences, we do highlight that this paper provides a query-biased basis for active learning methods within the cold-start setting. Hence, as STENCIL provides a base dataset upon which other active learning methods can build, it is important to mitigate harmful biases that might be present within the query data. Otherwise, these biases may be propagated in downstream learning, which can result in biased models that inadvertently mishandle important domains of the input space. We believe this to be the main potential consequence of this work and urge future applications to take precautions to mitigate this effect.