This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

EXnet: Efficient In-context Learning for Data-less Text classification

Debaditya Shome
KIIT University, India
[email protected]
\AndKuldeep Yadav
SHL, India
Abstract

Large pre-trained language models (PLMs) have made significant progress in encoding world knowledge and spawned a new set of learning paradigms including zero-shot, few-shot, and in-context learning. Many language tasks can be modeled as a set of prompts (for example, is this text about geography?) and language models can provide binary answers, i.e., Yes or No. There is evidence to suggest that the next-word prediction used by many PLMs does not align well with zero-shot paradigms. Therefore, PLMs are fine-tuned as a question-answering system. In-context learning extends zero-shot learning by incorporating prompts and examples, resulting in increased task accuracy. Our paper presents EXnet, a model specifically designed to perform in-context learning without any limitations on the number of examples. We argue that in-context learning is an effective method to increase task accuracy, and providing examples facilitates cross-task generalization, especially when it comes to text classification tasks. With extensive experiments, we show that even our smallest model (15M parameters) generalizes to several unseen classification tasks and domains.

1 Introduction

Large pre-trained language models (LMs) have revolutionized Natural Language Processing (NLP) and demonstrated state-of-the-art results over a varied set of tasks Min et al. (2021). Despite the success, most pre-trained models need to be fine-tuned over large corpus of manually-annotated training data, especially when the origin of the data is from a different domain. Annotation of such large-scale task-specific data for training each downstream model is extremely expensive, and a time-consuming process. To solve this issue, recent research focus on zero-shot learning with PLMs that involves solving downstream tasks without any further gradient-based training (parameter updates) of the models. For instance, GPT-3 (Brown et al., 2020) & other GPT-family of models such as GPT-Neo showed an interesting ability of In-context learning in which the model could learn unseen tasks on-the-fly using a few demonstration examples, without any parameter updates. Zhong et al.Zhong et al. (2021) demonstrated that it is possible to convert text classification tasks to Yes/No and a train a prompt-based model to perform zero-shot learning. Unfortunately, pure zero-shot models provide low task-specific accuracy and are not useful for many downstream applications. In-context learning that takes additional context in terms of both prompt and examples outperforms classic Zero-shot models and likely to work better for many real-world applications. We argue that adding a small number of demonstration examples is a small overhead but can help provide a significant accuracy improvements. Popular PLMs do support In-context learning but have major limitations. Apart from a huge training overhead, they have a significant inference costs and make them inaccessible to a large number of practitioners. Further, they provide poor accuracy on domain-specific tasks, support only limited number of examples, and also, sensitive to the order of examples (Zhao et al., 2021).

In this paper, we propose EXnet, a light-weight model architecture that enables us to utilize In-context support examples for few-shot text classification. Our model doesn’t have a limit on the number of input task examples, and there is no visible impact on the ordering of the examples. Specifically, this paper makes the following contributions:

  • We present EXnet, a simple and scalable model architecture trained specifically to learn classification tasks with few-shot examples, without any parameter updates.

  • We evaluate EXnet on a collection of 9 unseen datasets (during training), which test both the cross-domain and cross-task understanding. Even the smallest variant of EXnet with 15 million parameters and only two in-context examples, shows performance gains of up to 48% compared to GPT-Neo (1.3 billion parameters) which is 86×\times the size of our model.

2 Engines

3 Related Work

This section provides a comprehensive overview of the previous works related to EXnet. The overall landscape of few-shot learning can be grouped into two categories, few-shot learning with or without parameter updates. Generally, few-shot learning with parameter updates is termed as few-shot fine-tuning because it involves utilizing a small set of KK labeled samples to fine-tune a model. Gupta et al.Gupta et al. (2020) explored transfer learning with BERT for few-shot fine-tuning, where they use text-label pairs as input, and predict if the label belongs to the text. Their results show that this simple approach outperforms all previous baselines over the ARSC dataset, suggesting that utilizing both text-label pairs as input has the potential to demonstrate cross-domain understanding. Wei et al.Wei et al. (2021) presented a data augmentation strategy powered by curriculum learning and BERT-based triplet networks for few-shot fine-tuning, where they show a slight 3% improvement in performance over previous approaches. Ohashi et al.Ohashi et al. (2021) presented an approach where they utilize semantic knowledge from labels to learn distinct representations specific to each label, which in turn enables them to get improved model performance in few-shot fine-tuning.

Recently, prompt learning has also been explored for few-shot learning where PLMs are provided a task instruction/prefix which helps in utilizing their pre-trained knowledge to understand the task instruction and thus improving the few-shot performance. Schick et al.Schick and Schütze (2020) presented PET, a prompt-based semi-supervised method that shows strong few-shot performance. PET utilizes cloze-style input patterns similar to Masked-Language modeling (MLM). Utama et al.Utama et al. (2021) demonstrate through their experiments that few-shot fine-tuning performs poorly compared to zero-shot counterpart, which proves that few-shot fine-tuning degrades PLMs performance, particularly in the case of these sentence-pair classification tasks. A possible explanation for this phenomenon is that the PLMs lose their pre-trained knowledge by catastrophic forgetting during the few-shot gradient updates, and learns lexical shortcuts instead of meaningful patterns. An extreme case of few-shot learning without parameter updates is zero-shot learning, where no examples of the task are required. Yin et al.Yin et al. (2019) made the first attempt at unifying datasets and evaluating the zero-shot performance of language models over them which were only fine-tuned on natural language inference (NLI) tasks. However, Ma et al.Ma et al. (2021) demonstrated some major drawbacks in NLI-based zero-shot text classification. They show that there are negligible gains on some tasks with NLI-based fine-tuning compared to standard BERT pre-trained on next-sentence prediction task. Moreover, they show that these NLI-based models only rely on shallow lexical patterns, along with high variance and a lack of stability. Zhong et al.Zhong et al. (2021) explored the idea of utilizing prompts in question-answering format which involve the task of predicting if the label description belongs to the text or not. They collect a large set of 43 datasets comprising 441 labels, which they categorize into groups of similar datasets and only evaluate on those groups that are unseen during training. Their experiments show that larger PLMs like T5-Large show much better performance compared to the smaller ones. In-context learning is the most promising variation of few-shot learning without gradient updates, which involves giving a set of label examples in the input to PLMs, along with a query text that must be predicted based on the understanding of the task from those examples. Brown et al.Brown et al. (2020) presented GPT-3, which was the first PLM to demonstrate this ability of In-context learning, mostly on text generation tasks. One of the major disadvantages of GPT-3 has been it’s huge size (175 billion parameters), and it’s range of biases that have been demonstrated by Zhao et al.Zhao et al. (2021).

None of the prior work explored building a light-weight in-context model that can support a unlimited number of examples.

4 Proposed Model

Refer to caption
Figure 1: EXnet architecture

EXnet utilizes a Support set SS and a Query QQ as input which is a text in a question format with a fixed template TT as T(x,l)T(x,l) = "question: is the text about <ll>? [SEP] text: <xx>". The support set SS consists of KK examples of text which belongs to the class ll mentioned in QQ. Each support example also follows the format of TT. For training, we use the Binary Cross-entropy loss function with the AdamW Loshchilov and Hutter (2018) optimizer having a learning rate of 0.00002. The architecture of EXnet has been schematically shown in Figure 1. The idea of using a single encoder to encode both the query and the support set, and then concatenating their embeddings for few-shot learning, was inspired by metric learning-based few-shot methods in Computer Vision, such as Relation Networks Sung et al. (2018) and Matching Networks Vinyals et al. (2016). EXnet consists of the following key components:

  • Encoder module: We use a single encoder EE to generate embeddings for both the query QQ and support set SS. Thus, EE generates KK embeddings e1eke_{1}...e_{k} for KK examples in SS and one embedding eqe_{q} for QQ. The KK support set embeddings are then concatenated together into a single tensor eSe_{S}. Here, EE can be any encoder-only transformer model such as BERT Devlin et al. (2018). For simplicity, we experiment with three different variants of BERT based on size/number of parameters.

  • Projector blocks: Two projector blocks P1P1 and P2P2 are used to convert eSe_{S} and eQe_{Q} respectively into equal dimension representations. P1P_{1} consists of three GeLU activated feed-forward layers with a 30% dropout in between the second and third layer. P2P_{2} consists of only a single GeLU activated feed-forward layer. We observe that keeping this asymmetry in projections enables the model to show slightly better performance.

  • Cross attention module: The encoder EE only attends either query QQ or the support set SS at a time, without being able to utilize both of them at the same time. Thus, we use this cross-attention module to attend to both QQ and SS at the same time, which would enable EXnet to make use of the knowledge from the in-context examples in an effective manner to predict the answer to the query QQ. The projected vectors are passed into this cross attention module CC which utilizes the first projected vector as Key and Value, and the second projected vector as Query in the Multi-Head Attention mechanism like the original TransformerVaswani et al. (2017).

  • Feed-Forward network: The output representation by the cross attention module is passed on to a Feed-Forward network (FFN) consisting of two GeLU activated Feed-Forward layers, followed by a final Sigmoid activated Feed-Forward layer which predicts the probability of the presence of class ll mentioned in QQ.

5 Experiments

5.1 Training setup

We combined multiple public datasets representing the aspects of Topics, Emotions, Entities, and Questions, which play a key role in providing the essential amount of basic world knowledge to the models. To unify these datasets into the same binary ("yes"/"no") format, we use a simple strategy. For all the multi-class datasets, we keep all the labeled pairs of (text, class) as positives, and randomly sample an equal number of negatives for each text from other classes in the dataset. For every triplet (text, class, yes/no) in this prepared dataset, we sample a list of examples from all the positives. For training, we randomly sample input examples to optimize/calibrate the model to effectively use those examples. In constrast, during testing/evaluation we keep the set of examples fixed, which is further discussed in the next subsection. Below, we describe about all the datasets used in this paper:

  • GoEmotions: It is a human-annotated dataset of 58k Reddit comments that were taken from well-known English-language subreddits and classified with 27 emotionsDemszky et al. (2020). Most emotion classification datasets used in the realm of zero-shot/few-shot learning has been using the standard Ekman 6-way emotion categories. To provide much more fine-grained emotional knowledge of 27 emotions to the model, we chose GoEmotions.

  • News Category 111https://www.kaggle.com/datasets/rmisra/news-category-dataset: This dataset contains around 200k news headlines and their respective categories (42) from the year 2012 to 2018 obtained from the news website called HuffPost.

  • TREC-6 Li and Roth (2002): This question classification dataset contains 5500 labeled questions in training set and another 500 for test set. It contains 6 labels comprising of questions about descriptions, abbreviations, entities, people/individuals, locations, and numerical/quantitative information.

  • OpenEntity Choi et al. (2018): This dataset consists of around 6,000 sentences that have been annotated with precise entity types. The target entity’s appropriate types are described in the entity types, which are free-form noun phrases. Each sentence includes five categories on average.

5.2 Evaluation setup

We collected multiple public datasets on text classification to effectively evaluate our models. Similar to our training setup, we first convert every evaluation dataset into binary ("yes"/"no") format by the same strategy of sampling random negatives. To get examples, we create a set of examples for each class and use them for each and every text in that class. Unlike the training setup, we don’t sample randomly from the entire dataset to make the evaluation robust and simulate a real-world scenario where humans using our model would be able to give only a few examples per label. We categorize these evaluation datasets into two different categories as described further in this section.

5.2.1 Cross-domain evaluation:

The purpose of cross-domain evaluation is to measure the model’s capability to generalize across texts from different domains. For instance, Abstract Classification is similar to News Category as they both entail detecting topics from text but domains are different i.e., Scientific articles vs News. We gather the following datasets for this evaluation setup -

  • Cloth reviews222https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews: This is a binary classification dataset based on customer reviews of clothing items on an e-commerce website. The positive class entails whether the user recommends the item. The training task closest to this is the Emotion classification task i.e GoEmotions because predicting if a user recommends an item is almost the same as predicting a positive sentiment. Hence, this dataset will test if the model trained on Emotions is able to generalize to the domain of user reviews of clothing items, without being explicitly trained on that domain.

  • Abstract classification333https://www.kaggle.com/abisheksudarshan/topic-modeling-for-research-articles?select=Train.csv: This is a multi-class topic classification dataset where each text is an abstract of a scientific article. The training task closest to this is News Category classification. This will test the model’s capability to generalize on topic classification in the unseen domain of scientific text.

  • Water problem444https://www.kaggle.com/vbmokin/nlp-reports-news-classification?select=water_problem_nlp_en_for_Kaggle_100.csv: This is a multi-class topic classification dataset on the task of distinguishing text reports about water problems. It will evaluate the model’s capability to generalize on the domain of water problems by understanding the chemical, biological, and other specialized concepts which were unseen during training.

  • Stock sentiment classification555https://www.kaggle.com/yash612/stockmarket-sentiment-dataset: This is a binary classification dataset for the task of identifying positive or negative trend in stock market by understanding user comments. Evaluating our model over this dataset would show if the model is capable of understanding/generalizing over domain-specific textual data about the stock market.

5.2.2 Cross-task evaluation:

The purpose of cross-task evaluation is to measure the model’s capability to generalize across tasks that are unseen during training. We gather the following datasets for this evaluation setup -

  • ETHICS Hendrycks et al. (2020): This is a multi-class dataset with texts paired with labels of human moral values like justice, well-being, duties, virtues, and commonsense morality.

  • Irony detection: This is a binary classification dataset with pairs of texts and corresponding labels representing if that text is an irony or not.

  • SMS spam detection: This is a binary classification dataset comprising text and label pairs, where the positive class represents whether an SMS text message is spam and the negative class reflects if the SMS text message is not spam.

  • Stance classification: This dataset is for the task of classifying the stance of an user over a particular topic (abortion, atheism, climate change, feminism, hillary) from his/her tweet’s text.

  • SUBJ Pang and Lee (2004): This is a binary classification dataset with texts labeled with either being subjective or objective.

6 Results and Discussion

Table 1: Cross-domain evaluation
Dataset K EXnet-S EXnet-M EXnet-L GPT-Neo Zhong et al.
2 0.88 0.92 0.93 0.59
Cloth 4 0.88 0.94 0.93 0.63 -
8 0.89 0.94 0.95 0.62
2 0.69 0.73 0.81 0.45
Abstract 4 0.70 0.73 0.84 0.48 0.81*
8 0.71 0.75 0.84 0.46
2 0.70 0.72 0.76 0.34
Water 4 0.71 0.72 0.76 0.37 -
8 0.71 0.72 0.76 0.38
2 0.67 0.71 0.72 0.30
Stock 4 0.70 0.77 0.76 0.30 -
8 0.74 0.79 0.80 0.28
Table 2: Cross-task evaluation
Dataset K EXnet-S EXnet-M EXnet-L GPT-Neo Zhong et al.
2 0.48 0.62 0.65 0.43
ETHICS 4 0.66 0.70 0.74 0.44 -
8 0.73 0.78 0.79 0.48
2 0.65 0.68 0.69 0.21
Irony 4 0.66 0.72 0.74 0.21 0.82*
8 0.69 0.72 0.75 0.28
2 0.24 0.32 0.34 0.14
SMS-spam 4 0.26 0.33 0.37 0.18 0.35*
8 0.34 0.39 0.42 0.22
2 0.78 0.81 0.83 0.30
Stance 4 0.80 0.85 0.87 0.36 0.74*
8 0.81 0.84 0.88 0.32
2 0.38 0.43 0.48 0.39
SUBJ 4 0.46 0.49 0.56 0.36 0.59*
8 0.67 0.71 0.74 0.40

We use our evaluation datasets as discussed in Section 5.2, and measure EXnet’s performance over each of them using the most commonly used classification metric, i.e, F1-score. We tend to use F1-score instead of accuracy because some of the evaluation datasets are imbalanced. We compare three sizes of EXnet which we call EXnet-S, EXnet-M, and EXnet-L based on the increasing order of model size. EXnet-S is a model with 15 million parameters which consists of Small BERT Turc et al. (2019) as the Encoder which has 6 layers and an embedding dimension of 256. EXnet-M is a 112 million parameter model which comprises of BERT-base as the encoder which has 12 layers and an embedding dimension of 768. EXnet-L is a 338 million parameter model that consists of BERT-large as the encoder which has 24 layers and an embedding dimension of 1024. As a baseline, we use the 1.3 billion parameter GPT-Neo model in an In-context setting (without fine-tuning) for comparing with EXnet. Unlike EXnet which requires only KK positive examples, GPT-Neo required both KK positives and KK negative examples as showing only positives leads it to predict only "yes" for every query, previous work have shown that this is because GPT-Neo is trained on next word prediction. Furthermore, we also compare EXnet’s performance with the meta-tuned model from Zhong et al.Zhong et al. (2021).*Note: We report results of Zhong et al.from their original meta-tuned zero-shot (K=0) model. The results of our experiments are detailed in Table 1 and Table 2. It can be observed that EXnet is able to use the knowledge from the KK support examples, as increasing KK shows improved performance over most of the datasets. EXnet-M, and EXnet-L outperforms GPT-Neo over all the datasets in both cross-domain and cross-task evaluation by a large margin (up to 48% gains in terms of F1 score). EXnet-S also outperforms GPT-Neo on all the datasets, except for SUBJ where GPT-Neo shows a slightly higher performance by a margin of 0.01 F1 score, when K=2K=2. In comparison to Zhong et al., EXnet with larger KK is able to outperform their Meta-tuned model (770M parameters) over most of the evaluation datasets. Based on our results, we make following observations -

  • Benefits of increasing KK are dependent on task complexity: In our cross-task evaluation, it can be seen from Table 2 that the complex tasks like SUBJ and ETHICS which involve very subjective human concepts, EXnet has a huge improvement when K is increased from 2 to 8 examples. This demonstrates that In-context learning has more utility whenever the concept required to understand the task is complex and subjective.

  • In-context learning is more useful in improving cross-task performance than cross-domain: As seen in 1, all the three variants of EXnet have the ability to understand cross-domain tasks, but increasing KK doesn’t provide reasonable boost in performance. A possible reason for this is the fact that cross-domain evaluation involves tasks which were similar to training tasks, due to which EXnet was able to understand the task well with limited examples (K=2K=2).

7 Conclusion

Zero-shot models do not provide sufficient accuracy for most real-world applications. We argue that In-context learning without gradient updates is better suited for most real-world applications. For example, to detect new set emotions in product reviews, one can take a model trained on similar task on publicly available datasets. Existing in-context learning models are huge and does not provide great task accuracy.

In this paper, we presented EXnet, a lightweight model architecture specifically designed for In-context learning, that doesn’t have limitation over number of examples. We evaluated EXnet using 9 unseen datasets and demonstrated the strong cross-domain and cross-task understanding ability of EXnet, which even outperforms models that are 86×\times it’s size. In the future, an interesting area to explore can be pre-training the encoder in EXnet with different strategies to observe the trends in performance.

References

  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  • Choi et al. (2018) Eunsol Choi, Omer Levy, Yejin Choi, and Luke Zettlemoyer. 2018. Ultra-fine entity typing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 87–96.
  • Demszky et al. (2020) Dorottya Demszky, Dana Movshovitz-Attias, Jeongwoo Ko, Alan Cowen, Gaurav Nemade, and Sujith Ravi. 2020. Goemotions: A dataset of fine-grained emotions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4040–4054.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • Gupta et al. (2020) Aakriti Gupta, Kapil Thadani, and Neil O’Hare. 2020. Effective few-shot classification with transfer learning. In Proceedings of the 28th International Conference on Computational Linguistics, pages 1061–1066.
  • Hendrycks et al. (2020) Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. 2020. Aligning ai with shared human values. In International Conference on Learning Representations.
  • Li and Roth (2002) Xin Li and Dan Roth. 2002. Learning question classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics.
  • Loshchilov and Hutter (2018) Ilya Loshchilov and Frank Hutter. 2018. Decoupled weight decay regularization. In International Conference on Learning Representations.
  • Ma et al. (2021) Tingting Ma, Jin-Ge Yao, Chin-Yew Lin, and Tiejun Zhao. 2021. Issues with entailment-based zero-shot text classification. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 786–796.
  • Min et al. (2021) Bonan Min, Hayley Ross, Elior Sulem, Amir Pouran Ben Veyseh, Thien Huu Nguyen, Oscar Sainz, Eneko Agirre, Ilana Heinz, and Dan Roth. 2021. Recent advances in natural language processing via large pre-trained language models: A survey. arXiv preprint arXiv:2111.01243.
  • Ohashi et al. (2021) Sora Ohashi, Junya Takayama, Tomoyuki Kajiwara, and Yuki Arase. 2021. Distinct label representations for few-shot text classification. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 831–836.
  • Pang and Lee (2004) Bo Pang and Lillian Lee. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pages 271–278.
  • Schick and Schütze (2020) Timo Schick and Hinrich Schütze. 2020. Exploiting cloze questions for few shot text classification and natural language inference. arXiv preprint arXiv:2001.07676.
  • Sung et al. (2018) Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. 2018. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1199–1208.
  • Turc et al. (2019) Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Well-read students learn better: On the importance of pre-training compact models. arXiv preprint arXiv:1908.08962.
  • Utama et al. (2021) Prasetya Utama, Nafise Sadat Moosavi, Victor Sanh, and Iryna Gurevych. 2021. Avoiding inference heuristics in few-shot prompt-based finetuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9063–9074.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30.
  • Vinyals et al. (2016) Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. 2016. Matching networks for one shot learning. Advances in neural information processing systems, 29.
  • Wei et al. (2021) Jason Wei, Chengyu Huang, Soroush Vosoughi, Yu Cheng, and Shiqi Xu. 2021. Few-shot text classification with triplet networks, data augmentation, and curriculum learning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5493–5500.
  • Yin et al. (2019) Wenpeng Yin, Jamaal Hay, and Dan Roth. 2019. Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3914–3923.
  • Zhao et al. (2021) Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning, pages 12697–12706. PMLR.
  • Zhong et al. (2021) Ruiqi Zhong, Kristy Lee, Zheng Zhang, and Dan Klein. 2021. Adapting language models for zero-shot learning by meta-tuning on dataset and prompt collections. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2856–2878.