This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Contrastive Instruction Tuning

Tianyi Lorena Yan, Fei Wang, James Y. Huang, Wenxuan Zhou, Fan Yin
Aram Galstyan, Wenpeng Yin, Muhao Chen
University of Southern California    University of California, Los Angeles
The Pennsylvania State University    University of California, Davis
{tianyiy, fwang598, huangjam, zhouwenx}@usc.edu    [email protected]
[email protected]    [email protected]    [email protected]
Abstract

Instruction tuning has been used as a promising approach to improve the performance of large language models (LLMs) on unseen tasks. However, current LLMs exhibit limited robustness to unseen instructions, generating inconsistent outputs when the same instruction is phrased with slightly varied forms or language styles. This behavior indicates LLMs’ lack of robustness to textual variations and generalizability to unseen instructions, potentially leading to trustworthiness issues. Accordingly, we propose Contrastive Instruction Tuning (Coin), which maximizes the similarity between the hidden representations of semantically equivalent instruction-instance pairs while minimizing the similarity between semantically different ones. To facilitate this approach, we augment the existing FLAN collection by paraphrasing task instructions. Experiments on the PromptBench benchmark show that Coin consistently improves LLMs’ robustness to unseen instructions with variations across character, word, sentence, and semantic levels by an average of +2.5%+2.5\% in accuracy.111Code is available at https://github.com/luka-group/CoIN.

1 Introduction

Instruction tuning has emerged to be an essential training paradigm of large language models (LLMs; Wei et al. 2022; Sanh et al. 2022; Mishra et al. 2022). By training models on various pairs of task instructions and instances, instruction tuning has been widely adopted in LLMs, such as TK-Instruct Wang et al. (2022), InstructGPT Ouyang et al. (2022), FLAN-T5 Wei et al. (2022), and Alpaca Taori et al. (2023), allowing them to follow various human instructions and fulfill user intents (Wang et al., 2022; Zhang et al., 2023).

Refer to caption
Figure 1: An example from CoLA Warstadt et al. (2019) shows that current LLMs like Alpaca may generate entirely different responses when presented with semantically equivalent but textually different instructions.

Despite these advancements, current instruction-tuned LLMs are not robust to instruction variations. Their performance may vary significantly when one re-formulates an instruction with different forms or language styles (Zhu et al., 2023; Liu et al., 2023b). While optimal instructions for specific user intents may exist, there is no guarantee that user-crafted instructions will precisely match them. Indeed, user-crafted instructions often contain variations that can cause drop in LLMs’ performance, such as unintended minor mistakes (e.g., a typo; Wang et al. 2021, 2023a), different expression preferences (e.g., choice of synonyms or paraphrases; Gu et al. 2023; Zhu et al. 2023), inefficient descriptions Sun et al. (2023), and varying formats Liang et al. (2023). As shown in Figure 1, given different instructions of the same intent, an instruction-tuned LLM like Alpaca can generate entirely different responses, some of which can lead to wrong answers. LLMs’ current lack of robustness to instruction variations severely limits their real-world applications. However, prior instruction tuning methods mainly focus on aligning the desired output for a given instruction-input pair and do not explicitly address models’ robustness against variations in instructions Ouyang et al. (2022); Wei et al. (2022); Zhang et al. (2023); Longpre et al. (2023).

In this paper, we propose Contrastive Instruction Tuning (Coin), an instruction tuning method that leverages contrastive learning to align the hidden representations of instruction-instance pairs that are semantically equivalent but textually different and to differentiate those that are semantically distinct. Given the same input and output instance, we pair each instruction with its perturbed versions as positive samples. Observed that the hidden representations of data from different tasks already have low cosine similarity with each other Liu et al. (2023a), we use the same instruction paired with different instance input and output as hard negative samples (refer to Section 3.2 for more details). Intuitively, by recognizing that the same instruction with different formulations can have the same meaning, the model can generate more consistent answers given different instructions of the same intent and become more robust to variations in language expressions. At the same time, negative samples encourage the model to understand that an instruction can lead to different outcomes in different contexts, facilitating the model to distinguish inputs with different user intents.

We assess LLMs’ robustness on the PromptBench benchmark Zhu et al. (2023), which introduces variations to instructions of a diverse set of tasks at character, word, sentence, and semantic levels. Experiments on the benchmark show that Coin significantly improves task performance and reduces response variation of Alpaca on unseen instructions with variations at all four levels, achieving an average accuracy improvement of +2.5% compared with continual instruction tuning on the same dataset.

Our contributions are three-fold. First, we propose a contrastive instruction tuning method, Coin, to enhance LLMs’ robustness to semantic-invariant instruction variations. Second, experiments on PromptBench demonstrate the effectiveness of Coin in handling semantically equivalent instructions that differ at the character, word, sentence, and semantic levels. Third, to facilitate the proposed approach, we augmented the FLAN collection, a widely used instruction tuning dataset, with contrastive instructions. We will release the augmented dataset consisting of 52k entries and 104k instructions to support future work in this direction.

2 Related Work

In this section, we provide a brief summary of three highly related topics.

Instruction Tuning and Generalizability. Instruction tuning has emerged to be one of the pivotal techniques for enhancing the generalizability of LLMs (Sanh et al., 2022; Zhang et al., 2023; Ouyang et al., 2022). This capability is crucial for LLMs, as it determines models’ performance when encountering new data. The efficacy of instruction tuning has become more evident when the number of tasks scales up Xu et al. (2022). Consequently, many recent studies have been focusing on fine-tuning LLMs with a wide range of tasks. For instance, large-scale datasets that encompass numerous NLP tasks and multiple data sources have been curated for effectively enhancing LLMs’ zero-shot generalizability (Sanh et al., 2022; Wei et al., 2022; Niu et al., 2023; Kung et al., 2023; Wang et al., 2023b). Despite performance gained on unseen tasks, LLMs fine-tuned with large-scale instruction datasets remain vulnerable to how the same instruction is expressed differently (Wang et al., 2021; Zhu et al., 2023; Liu et al., 2023b; Sun et al., 2023; Liang et al., 2023). This limitation motivates us to enhance LLMs’ robustness to instruction variations in this work.

Robustness of Instruction-Tuned LLMs. With the increasing reliance on LLMs, recent works have focused on having a comprehensive understanding of the robustness of instruction-tuned language models. Zhu et al. (2023), Gu et al. (2023), and Wang et al. (2023a) add perturbations to instructions across multiple levels (character, word, sentence, etc.) and show that current LLMs are not robust to the introduced perturbations. LLMs’ performance can also be degraded when presented with unobserved, paraphrased versions of task instructions Sun et al. (2023). Furthermore, inconsistency in format and style in instruction expressions, such as placing instructions before, in between, or after the input instances, can decrease models’ performance Liang et al. (2023). While evaluating and analyzing LLMs’ robustness has garnered more attention, enhancing the models’ robustness, particularly against varied instructions of the same task, is an underexplored problem. Our work is dedicated to addressing this gap.

Contrastive Learning. Contrastive learning, a self-supervised technique that involves training a model to contrast between positive and negative pairs of data points, has rapidly evolved and been adapted in NLP tasks, such as sentence embedding Gao et al. (2022), summarization Liu et al. (2022), named entity recognition Layegh et al. (2023), and logical reasoning Bao et al. (2023). Within the context of instruction tuning, contrastive learning has been used with prefix-training to enhance the controllability towards desired attributes of LLMs Qian et al. (2022). However, the focus of the work remains on steering the generated outputs towards an attribute (such as being sentimentally positive) that is assumed to be known but is difficult to be specified given the diversity of tasks that LLMs may handle, and it does not explicitly tackle the challenge of LLMs’ robustness against variations in instruction expressions. Inspired by the observation that contrastive learning is suitable for aligning semantically related sentences Gao et al. (2022), we encourage LLMs to learn the semantic invariance of varied instructions for the same task and aim to address LLMs’ imperfect robustness at all four levels: character, word, sentence, and semantic.

3 Contrastive Instruction Tuning

In this section, we first provide a formal definition of contrastive instruction tuning (Section 3.1). Then, we introduce contrastive sample selection (Section 3.2) and the contrastive loss (Section 3.3) in our method Coin.

3.1 Overview

Refer to caption
Figure 2: Illustration of Coin. A paraphrased instruction is used as the positive sample (green) given the same instance input and output. An instruction paired with different instance input and output is used as the negative sample (red). Cosine similarity between the hidden representations of original and paraphrased instruction-instance pairs is encouraged to be high, and vice versa for the paired negative samples. As we observe that the cosine similarity between the hidden representations of data from different tasks is already low Liu et al. (2023a), we use the same instruction paired with different instance input and output as hard negative samples to provide more informative training signals.

Assume that we have a (autoregressive) language model \mathcal{M} and a dataset 𝒟={(Ii,xi,yi)}i=1N\mathcal{D}=\{(I_{i},x_{i},y_{i})\}_{i=1}^{N}, in which IiI_{i} denotes the task instruction, xix_{i} is the instance input, and yiy_{i} is the desired output. For each original entry, we create a semantically equivalent entry (Ii+,xi+,yi+)(I_{i}^{+},x_{i}^{+},y_{i}^{+}), where xi+=xix_{i}^{+}=x_{i} and yi+=yiy_{i}^{+}=y_{i}. Ii+I_{i}^{+} is constructed by adding textual perturbations to the original instruction while ensuring the underlying semantic meaning remains the same. Our goal is to learn a model \mathcal{M} such that its hidden representations of semantically equivalent instruction-instance pairs, denoted as h(Ii,xi,yi)h_{\mathcal{M}}(I_{i},x_{i},y_{i}) and h(Ii+,xi+,yi+)h_{\mathcal{M}}(I_{i}^{+},x_{i}^{+},y_{i}^{+}), are close to each other in \mathcal{M}’s hidden representation space, thereby enhancing its robustness against instruction variations.

As explored by many previous studies, instruction-tuning with large-scale datasets mainly focuses on aligning the desired output for a given instruction-instance pair from various tasks (Sanh et al., 2022; Longpre et al., 2023; Wei et al., 2022). However, current LLMs exhibit a lack of robustness when facing the same instruction expressed in different forms (Sun et al., 2023; Zhu et al., 2023; Liang et al., 2023), causing LLMs to be unreliable when being deployed in the real world. To mitigate this limitation, our method Coin further leverages contrastive learning to maximize the similarity between hidden representations of semantically equivalent instruction-instance pairs. This approach enhances models’ robustness and consistency to variations in instruction expressions.

3.2 Contrastive Data Selection

Selecting effective positive and negative samples for each instruction is critical to contrastive learning. In Coin, we construct positive samples by varying the phrasing or template structure of original instructions, ensuring that the positive samples still share the same input and output with the original instance. This approach helps the model learn to align semantically similar instructions despite differences in phrasing.

For negative samples, we observe that the contrastive loss converges quickly when using instruction-input pairs of different tasks (i.e., normal negatives), leading to minor improvement in robustness. This observation is consistent with the findings in prior studies (Liu et al., 2023a): LLMs can distinguish between instructions of different tasks such that their hidden representations already have low cosine similarity. To collect hard negatives, we draw inspiration from near-OOD samples, which are data that come from the same task but with different classes Winkens et al. (2020); Fort et al. (2021); Liu et al. (2023a). Prior studies found that it is more difficult for models to detect near-OOD samples than samples from other tasks (far-OOD). This finding indicates that the hidden representations of near-OOD samples may not be distinguishable enough and thus can provide informative supervision signals for contrastive learning. Accordingly, we choose such a sample (Ii,xi,yi)(I_{i}^{-},x_{i}^{-},y_{i}^{-}) that shares the same instruction as the original instance (Ii=IiI_{i}^{-}=I_{i}) but is paired with different input (xixix_{i}^{-}\neq x_{i}) and output (yiyiy_{i}^{-}\neq y_{i}) as a negative sample. For example, if yiy_{i} is "yes", then yiy_{i}^{-} can be "no", ensuring the fundamental intent of the instruction-instance pair is different from the original one. Based on this approach, Coin encourages the model to align semantically equivalent instructions with different phrasings while contrasting inputs with different user intents.

3.3 Learning Objective

Our method Coin is illustrated in Figure 2. We construct the training batch such that each original sample is matched with a perturbed instruction and an identical instance as a positive sample. All other in-batch samples are hard negatives selected according to Section 3.2, i.e. share the same instruction but paired with different instances.

Let hih_{i}, hi+h_{i}^{+}, and hih_{i}^{-} indicate model \mathcal{M}’s hidden representation of the original, positive, and negative instruction-instance pairs, respectively. Since each original pair may have multiple in-batch negatives, here we use hijh_{ij}^{-} to indicate the hidden representation of the negative samples. To align the hidden representation hih_{i} and hi+h_{i}^{+}, we optimize the model \mathcal{M} with the contrastive loss ctri\mathcal{L}_{ctr}^{i}, which is defined as

ctri=logesim(hi,hi+)/τesim(hi,hi+)/τ+jesim(hi,hij)/τ,\mathcal{L}_{ctr}^{i}=-\log\frac{e^{\text{sim}(h_{i},h_{i}^{+})/\tau}}{e^{\text{sim}(h_{i},h_{i}^{+})/\tau}+\sum_{j}e^{\text{sim}(h_{i},h_{ij}^{-})/\tau}},

where sim(h1,h2)\text{sim}(h_{1},h_{2}) is the cosine similarity h1Th2h1h2\frac{h_{1}^{T}h_{2}}{||h_{1}||\cdot||h_{2}||}, and τ\tau is a temperature hyperparameter. In Coin, we obtain the hidden representations by using the hidden state of the last token222We also experimented with other pooling methods such as max and average pooling but found that using the last token’s hidden state yielded better results. from the decoder of the language model.

To preserve the generation ability of the language model, we follow Liu et al. (2022) to include the standard cross entropy loss for each instruction pair, which can be defined as follows:

enti=1lk=1llogp(yk|Ii,xi,y<k)\mathcal{L}^{i}_{ent}=\frac{1}{l}\sum_{k=1}^{l}-\log p(y_{k}|I_{i},x_{i},y_{<k})

where ll is the length of the desired output for instruction-input pair (Ii,xi)(I_{i},x_{i}). This loss is computed for all samples in the batch.

Combining the above two parts, the overall learning objective is

Coini=enti+max(λ,detach(entictri))ctri,\mathcal{L}^{i}_{\mbox{{Coin}}}=\mathcal{L}^{i}_{ent}+\max(\lambda,\text{detach}(\frac{\mathcal{L}^{i}_{ent}}{\mathcal{L}^{i}_{ctr}}))\mathcal{L}^{i}_{ctr},

where detach(\cdot) indicates that the loss value is detached from the computation graph and thus is treated only as a scalar. λ\lambda is the upper bound of the weight that is assigned to the contrastive loss. Based on empirical results, we found that setting λ\lambda too high, thereby significantly increasing the magnitude of the contrastive loss ctr\mathcal{L}_{ctr} relative to the generation loss ent\mathcal{L}_{ent}, adversely affects the models’ generation ability. To mitigate this issue, we scale the contrastive loss to the same magnitude as the generation loss while setting an upper bound to the weighting, ensuring a balanced influence between enhancing robustness and maintaining generative performance. For more details on the weighting choice of the contrastive loss, refer to 5.3.

4 Experiment

In this section, we evaluate Coin’s performance on enhancing LLMs’ robustness to instruction variations on PromptBench, specifically 1010 GLUE datasets with unseen333In this paper, “unseen instructions” refer to those whose textual expressions do not appear in the instruction-tuning dataset. Note that if the model exhibits inadequate robustness when handling unseen instructions for known tasks, its performance is likely to decrease further when confronted with unknown tasks. We consider the former as a rigorous evaluation setting without additional confounding factors. instructions perturbed at different levels. We first provide an overview of the experiment settings (Section 4.1, Section 4.2, and Section 4.3) and then present a detailed analysis of the experiment results Section 4.4.

4.1 Training Datasets

In this work, we conduct experiments on a widely used instruction tuning dataset: the FLAN Collection Wei et al. (2022). FLAN Collection Wei et al. (2022) is a large-scale data collection that encompasses a wide range of tasks, including natural language inference, common sense reasoning, sentiment analysis, paraphrase identification, etc. This data collection is created by transforming 62 publicly available text datasets into instructional formats. 1010 unique instruction templates are manually composed for each dataset. In this work, we choose 25 datasets with deterministic answers from the collection. To ensure each dataset has an equal chance of being sampled into the training set of Coin, we iterate through the training split of each dataset with a round-robin approach. For each entry, we create a positive sample by randomly selecting a predefined instruction template not used by the entry to paraphrase the instruction. Only paraphrasing is used for creating training data while various types of perturbations are included for evaluation (refer to Section 4.3). Avoiding assumptions about specific types of noise in instructions is crucial due to the high uncertainty LLMs face in real-world deployment. Hence, a robustness training method that can generalize to other types of perturbations is more desirable. We then select one entry from the remaining dataset as a negative sample, following the strategy in Section 3.2. Refer to Appendix A for more details of the processed dataset.

Refer to caption
Figure 3: Models’ average accuracy (left) and standard deviation (right) across 1010 GLUE datasets, with each dataset having six unseen instructions with no perturbation (clean) or perturbation added at character, word, sentence, and semantic levels. Coin has consistent improvement in accuracy and decrease in standard deviation across all types of perturbation compared to the base model and continual instruction tuning. Coin obtains significant improvement in robustness against word, character, and sentence level perturbations.

4.2 Implementation Details

We use Alpaca Taori et al. (2023), a model instruction-tuned from the LLaMA model Touvron et al. (2023) on the 52k Self-Instruct dataset, as the base model. When training models on the augmented FLAN collection, we use the same set of hyper-parameters, with the learning rate, batch size, and cut-off length set to 11041*10^{-4}, 64, and 256 respectively. Since we observe that the magnitude of the contrastive loss can be small during the later phase of training and following Gao et al. (2022), we set the temperature τ\tau and λ\lambda to 0.050.05 and 10001000. All experiments are run on 2 NVIDIA RTX A5000 GPUs.

4.3 Evaluation Setting

To evaluate models’ robustness against variations in expression of instructions, we adopt the PromptBench benchmark Zhu et al. (2023). Incorporating a diverse set of tasks, such as sentiment analysis, grammar correctness, duplicate sentence detection, and natural language inference, PromptBench introduces perturbation to task instructions at various levels: character, word, sentence, and semantic. Regarding the data used for evaluation, we sample 300 instruction-instance pairs from each GLUE task wherever the validation set exceeds this size.444Due to the extensive computational requirement of evaluating the models on the entire benchmark, we sample a subset of instructions and data from all possible instructions and datasets. For each dataset, PromptBench predefines 2020 instructions. We ensure that all selected and perturbed instructions for each dataset are not seen during the training time. Given that all instructions are unseen while GLUE tasks are seen during training time, this setting allows a more focused evaluation of LLMs’ robustness against variations in instructions without the confounding factor of task generalization.

Instruction Variations. Regarding instructions, we select six clean instructions predefined for each task. Then, we create perturbed versions of each instruction. Following PromptBench, we use DeepWordBug Gao et al. (2018) to introduce character-level perturbations for certain words, and use TextFooler Jin et al. (2020) to replace words with contextually similar words. At the sentence level, we implement the CheckList Ribeiro et al. (2020) and append randomly generated sequences, which all have a length of 1010 and consist of alphabets and digits, at the end of an instruction to distract LLMs. For the semantic-level perturbation, PromptBench defines 1010 instructions that paraphrase the original instruction for each task while following the linguistic behavior of six languages: Chinese, French, Arabic, Spanish, Japanese, and Korean. To keep the number of instructions the same as other types of perturbation, we randomly select one instruction from each language defined for each task, which are all different from the clean instructions. We also ensure that instructions used for evaluation differ from all instructions in the training dataset and thus are unseen by the model, preventing data contamination.

Metrics. For each type of perturbation, we report average accuracy and standard deviations of six instructions created for each GLUE dataset.

4.4 Results

Refer to caption
Figure 4: UMAP McInnes et al. (2020) visualization of the hidden representations of decoder’s last output token from continually instruction-tuned model (left) and Coin (right). 300 data points are selected from CoLA Warstadt et al. (2019) with no perturbations (clean) or perturbations added at different levels. Coin’s representations of inputs with instruction variations are clustered closer to each other compared to the continually instruction-tuned model, especially inputs with perturbations at word, character, and sentence level.

In Figure 3, we evaluate the base model, continual instruction tuning (i.e., base model fine-tuned on the same data as Coin with cross entropy loss only), and Coin on five groups of instructions across 1010 GLUE datasets. Except for the clean group, which includes the original instructions defined for each dataset, each group contains instructions with the same type of perturbation, including character, word, sentence, and semantic perturbations.

The base model exhibits low accuracy and large performance variance when given instructions with different perturbations or instructions within the same perturbation group. With only around 52%52\% accuracy on the clean instructions, the base model’s performance further decreases when the instructions are perturbed in all character, word, and sentence levels. The largest accuracy gap across different groups is 7.7%7.7\%, observed between the word and the semantic groups. For instructions within the same group, the base model exhibits a variance ranging from 16.9%16.9\% to 19.0%19.0\%. These observations demonstrate that the base model is sensitive to how instructions are formulated for different tasks.

Compared to the base model, the continually instruction-tuned model shows increases in accuracy, which is expected as the model is exposed to more data and trained with more steps. Nevertheless, the performance gap between different groups can still be as large as 6.1%6.1\% observed between the clean group and the group with word-level perturbation. This shows that the continually instruction-tuned model still lacks robustness to unseen instructions with variations across different levels.

Compared to continual instruction tuning, Coin further reduces performance variance and consistently improves accuracy for instructions within and across different groups without introducing any new data and training steps. As it can be seen from Figure 3, Coin achieves improvements in accuracy for all types of perturbation, up to 4.4%4.4\% for word-level perturbations where the continually instruction-tuned model exhibits its largest drop in performance. The largest performance gap is reduced to 3.6%3.6\%. The consistent improvement across all types of perturbations demonstrates the generalizability of Coin at enhancing models’ robustness against variations in instructions at different levels. Coin also decreases the performance variance on instructions from the five groups by 1.6%1.6\%, 1.9%1.9\%, 2.1%2.1\%, 2.5%2.5\%, and 1.2%1.2\%. This also shows that Coin can effectively help the model become less sensitive to specific instructions for each task and more consistent in its performance. For more detailed results, refer to Table 2.

5 Analyses

To provide a more comprehensive view of the impact of Coin on the model’s robustness to instruction variations, we further analyze the results of our method by examining the hidden representations of instruction variants (Section 5.1), task category (Section 5.2), and the weighting choice for the contrastive loss (Section 5.3).

5.1 Closer Representations of Instruction Variants

To understand the impact of Coin on the representations of instructions with variations at different levels, we visualize the hidden states of the last output tokens from the decoder’s last Transformer layer. Specifically, we select 300 data points from CoLA Warstadt et al. (2019), choose one of its instructions, add perturbations at different levels to the instruction, and obtain the hidden states from the model.

As observed in Figure 4, Coin’s hidden representations of inputs with instruction variations at different levels are much closer than those of the continually instruction-tuned model. In the embedding space of the continually instruction-tuned model, the representation of instructions with different perturbations, especially at character and word levels, are clustered almost into distinct groups. This may indicate that the model treats data points with the same instruction varied at different levels differently and thus is more sensitive to how the same instruction is formulated.

In contrast, the representations of data points with character, word, and sentence level variations are less distinguishable in Coin’s embedding space, with representations of instructions varied at the word level (red) having greater overlap with those of the clean group (blue). This observation can be associated with Coin’s varied improvement in performance across different perturbations. As shown in Figure 3, Coin achieves more evident improvement on instructions with word, character, and sentence level perturbations. It can be concluded from the two figures that when Coin effectively aligns the representations of perturbed instructions to those of the clean instructions, the model becomes more capable of capturing the original semantic meaning of the instructions. Thus, it becomes less sensitive to perturbations in instructions.

It can be observed that the representations of instructions with semantic level perturbation are located relatively far away from those of instructions with other types of perturbation. This is expected as paraphrasing introduces new structure and wording to the original instruction, which may lead to varied hidden representations. Nonetheless, Coin stabilizes the representation of the original and paraphrased instructions, demonstrating Coin can effectively align the representation of instruction variants with each other and thus enhance the model’s robustness to instruction variations.

5.2 Impact on Different Tasks

(%) Continual Instruction Tuning Coin \triangle
Task Accuracy Std Accuracy Std Accuracy Std
Sentiment Analysis 89.0 4.1 90.4 3.1 +1.4 -1.1
Natural Language Inference 64.4 3.7 66.1 3.5 +1.7 -0.2
Paraphrase Identification 63.0 11.0 68.5 5.9 +5.4 -5.1
Grammar Correctness 62.0 9.2 68.4 3.9 +6.3 -5.3
Table 1: Models’ average accuracy and standard deviation of each task category. Coin has consistent improvement across all tasks with more evident improvement on duplicate sentence detection and grammar correctness tasks.

We examine Coin’s influence on the model’s performance for different tasks. Based on the task category defined in the PromptBench benchmark, we split the 1010 GLUE datasets into four categories: (1) sentiment analysis, (2) natural language inference (NLI), (3) paraphrase identification, and (4) grammar correctness. Refer to Table 5 for specific datasets classified to each category.

As shown in Table 1, Coin achieves evident improvements in accuracy by +5.4% and +6.3% on paraphrase identification and grammar correctness tasks. Intuitively, these tasks can benefit more directly from Coin that aims to enhance the similarity of representations of semantically equivalent instruction-input pairs. For example, paraphrase identification can directly benefit from the model’s more refined ability to group textual inputs with similar semantic meanings, as Coin pushes representations of inputs with different meanings away from each other. Similarly, grammar correctness can also benefit from the contrastive loss, which may group hidden representations of grammatically correct inputs closer to each other and thus enable the model to become better at detecting inputs with invalid syntactic structures and grammatical rules.

On the other hand, Coin gains modest enhancement in accuracy on sentiment analysis and NLI tasks by +1.4% and +1.7% compared to the continual instruction-tuned model. For the sentiment analysis task, the continually instruction-tuned model has already achieved an accuracy of 89.0%. Obtaining further improvements can be challenging given that the model is already capable at distinguishing between sentences with different sentiments. Regarding NLI, the task requires a comprehensive understanding of the relationship between two sentences, which can depend on the model’s knowledge of various domains or reasoning ability to infer implicit meanings that are not directly stated. The complex relation between two sentences may not be explicitly captured by the hidden representations, meaning that Coin may not explicitly further enhance the model’s reasoning ability. However, Coin still obtains an improvement of +1.4% and +1.7% on the two tasks, demonstrating Coin’s effectiveness at enhancing the model’s ability to discern the nuanced inferential relation that underlies the overall semantic meaning of the instruction-input pairs.

Model Perturbation CoLA MNLI MNLI-m MNLI-mm MRPC QNLI QQP RTE SST2 WNLI Average
Alpaca Baseline Clean 65.1 ±\pm 2.1 51.5 ±\pm 4.3 51.5 ±\pm 4.3 51.3 ±\pm 5.0 28.6 ±\pm 27.5 51.8 ±\pm 1.5 26.6 ±\pm 10.8 62.2 ±\pm 2.4 80.9 ±\pm 5.7 50.5 ±\pm 3.4 52.0 ±\pm 18.2
Character 61.8 ±\pm 4.6 47.2 ±\pm 6.4 47.2 ±\pm 6.4 49.3 ±\pm 4.5 27.4 ±\pm 24.1 42.7 ±\pm 6.9 15.6 ±\pm 10.9 55.5 ±\pm 5.6 66.7 ±\pm 15.6 49.3 ±\pm 3.5 46.3 ±\pm 18.0
Word 61.7 ±\pm 2.0 49.6 ±\pm 3.8 49.6 ±\pm 3.8 49.2 ±\pm 4.7 43.3 ±\pm 21.8 24.8 ±\pm 17.4 14.7 ±\pm 8.1 57.5 ±\pm 4.9 46.4 ±\pm 25.8 53.1 ±\pm 2.7 45.0 ±\pm 18.7
Sentence 64.8 ±\pm 1.8 51.2 ±\pm 3.6 51.2 ±\pm 3.6 52.9 ±\pm 2.2 15.3 ±\pm 10.7 50.2 ±\pm 3.2 22.6 ±\pm 6.8 61.5 ±\pm 3.3 82.3 ±\pm 4.1 52.1 ±\pm 2.0 50.4 ±\pm 19.0
Semantic 65.4 ±\pm 1.9 52.1 ±\pm 1.2 52.1 ±\pm 1.2 51.6 ±\pm 1.8 37.9 ±\pm 25.6 52.1 ±\pm 3.7 25.8 ±\pm 10.0 59.2 ±\pm 4.4 82.1 ±\pm 3.3 48.6 ±\pm 4.4 52.7 ±\pm 16.9
Continual Instruction Tuning Clean 63.5 ±\pm 8.6 68.7 ±\pm 2.4 67.3 ±\pm 2.7 66.3 ±\pm 2.7 62.8 ±\pm 13.0 62.9 ±\pm 4.2 71.2 ±\pm 7.6 82.0 ±\pm 1.9 90.1 ±\pm 2.4 57.5 ±\pm 3.8 69.2 ±\pm 11.1
Character 64.9 ±\pm 3.1 64.9 ±\pm 2.1 64.1 ±\pm 2.3 63.4 ±\pm 1.9 62.1 ±\pm 11.9 54.7 ±\pm 3.6 61.9 ±\pm 11.8 75.7 ±\pm 4.8 90.5 ±\pm 2.0 54.0 ±\pm 5.1 65.6 ±\pm 11.7
Word 58.9 ±\pm 12.6 64.8 ±\pm 4.1 65.4 ±\pm 3.8 64.3 ±\pm 3.5 56.4 ±\pm 10.5 46.8 ±\pm 6.7 62.5 ±\pm 8.2 73.8 ±\pm 3.5 84.2 ±\pm 12.6 54.0 ±\pm 2.1 63.1 ±\pm 12.6
Sentence 58.6 ±\pm 15.2 66.4 ±\pm 1.9 65.3 ±\pm 1.4 65.1 ±\pm 3.7 55.9 ±\pm 16.8 53.2 ±\pm 8.6 66.6 ±\pm 8.1 80.3 ±\pm 3.0 90.4 ±\pm 1.2 55.9 ±\pm 4.3 65.8 ±\pm 13.9
Semantic 64.3 ±\pm 6.6 67.0 ±\pm 2.9 67.1 ±\pm 2.5 66.0 ±\pm 3.1 61.4 ±\pm 14.3 56.4 ±\pm 9.9 69.6 ±\pm 8.1 80.0 ±\pm 4.4 89.6 ±\pm 2.5 58.0 ±\pm 4.6 67.9 ±\pm 11.8
Coin Clean 70.4 ±\pm 3.9 68.8 ±\pm 2.7 68.0 ±\pm 2.2 67.6 ±\pm 3.5 70.6 ±\pm 3.5 61.9 ±\pm 6.0 70.1 ±\pm 6.0 82.3 ±\pm 1.5 91.4 ±\pm 0.7 59.9 ±\pm 2.5 71.1 ±\pm 9.5
Character 66.9 ±\pm 3.0 68.2 ±\pm 2.0 67.5 ±\pm 1.3 66.6 ±\pm 4.0 72.4 ±\pm 2.5 58.7 ±\pm 4.2 64.7 ±\pm 8.0 78.5 ±\pm 3.1 91.1 ±\pm 2.1 58.9 ±\pm 2.6 69.4 ±\pm 9.8
Word 66.5 ±\pm 4.5 67.4 ±\pm 1.7 67.7 ±\pm 3.0 66.1 ±\pm 2.3 71.9 ±\pm 5.4 49.9 ±\pm 7.5 63.9 ±\pm 6.0 75.6 ±\pm 3.5 85.6 ±\pm 11.6 60.1 ±\pm 3.8 67.5 ±\pm 10.5
Sentence 68.4 ±\pm 7.2 67.7 ±\pm 3.5 68.2 ±\pm 2.6 66.3 ±\pm 3.6 63.3 ±\pm 9.6 55.4 ±\pm 9.5 66.8 ±\pm 6.1 79.8 ±\pm 3.5 92.3 ±\pm 0.6 59.6 ±\pm 2.8 68.8 ±\pm 11.4
Semantic 69.7 ±\pm 1.2 66.3 ±\pm 1.8 67.0 ±\pm 0.5 64.3 ±\pm 2.6 72.6 ±\pm 5.8 56.1 ±\pm 10.0 68.5 ±\pm 6.3 78.5 ±\pm 4.5 91.6 ±\pm 0.6 59.2 ±\pm 2.0 69.4 ±\pm 10.6
Table 2: Model’s average accuracy and standard deviation on 10 GLUE datasets, each having six instructions with different types of perturbation. Coin here is trained with λ=1,000\lambda=1,000.

5.3 Weighting of Contrastive Loss

Refer to caption
Figure 5: Coin’s performance by the maximum weight λ\lambda assigned to the contrastive loss. Coin achieves the highest average accuracy at λ=103\lambda=10^{3}.

As the weight of the contrastive loss may affect the extent to which Coin align representations of instruction variants Liu et al. (2022), we examine how different values assigned to λ\lambda can affect Coin’s performance across different perturbation levels.

As shown in Figure 5, Coin achieves its best average performance when λ=1,000\lambda=1,000. When λ\lambda is small, contrastive loss does not have significant impact on the model due to its small magnitude. The resulting model has similar performance and sensitivity to instruction variations as the continual instruction-tuned model. As λ\lambda increases, Coin’s performance increases across different types of perturbations, indicating that the contrastive loss is guiding the model to align representations of instruction variations closer to each other and thus become more robust to the introduced perturbations.

However, when λ\lambda is too large, Coin’s performance decreases significantly, Therefore, based on the empirical results, we choose λ=1,000\lambda=1,000 for higher accuracy and smaller standard deviation. Refer to Table 4 for detailed experiment results of models with different contrastive loss weighting.

6 Conclusion

In this paper, we proposed Coin that aligns hidden representations of semantically equivalent instruction-instance pairs. Evaluation results on PromptBench, with instructions that differ at character, word, sentence, and semantic levels, demonstrate Coin’s effectiveness of enhancing LLMs’ robustness to semantic-invariant instruction variations. Future work can apply contrastive instruction tuning to enhance the robustness of models and tasks in other modalities, and on other prompt components such as few-shot demonstrations and system prompts.

Limitation

We summarize the limitations of this work as follows: First, the current contrastive data selection method only considers paraphrasing for positive instruction augmentation. More semantic-invariant data augmentation methods could be explored. Second, the experiment scale could be enlarged to include more instruction tuning datasets, instruction-tuned models, and downstream tasks. This would provide additional evidence about Coin’s effectiveness. Third, while we use a rigorous evaluation setting to measure model robustness, evaluating the influence of Coin from other perspectives could enhance understanding of contrastive instruction tuning.

Acknowledgement

We appreciate the reviewers for their insightful comments and suggestions. Tianyi Yan was supported by the Center for Undergraduate Research in Viterbi Engineering (CURVE) Fellowship. Fei Wang was supported by the Amazon ML Fellowship. James Y. Huang was supported by a gift fund from the Amazon Center on Secure & Trusted ML. Muhao Chen was supported by the NSF Grant IIS 2105329, the NSF Grant ITE 2333736, the DARPA AIE Grant HR0011-24-9-0370, and an Amazon Research Award.

References

Appendix A Datasets

Task Category Dataset Count
Natural Language Inference(NLI) ANLI(R1) 2664
ANLI(R2) 2670
ANLI(R3) 2658
CB 232
MNLI-Matched 2678
MNLI-Mismatched 2678
QNLI 2682
RTE 2328
SNLI 2682
WNLI 920
Sentiment Analysis IMDB 354
Sent140 2684
SST2 2682
Yelp 834
Paraphrase Identification MRPC 2684
QQP 2684
PAWS Wiki 2684
STS-B 2682
Reading Comprehension BoolQ 1044
MultiRC 30
Coreference Resolution WSC273 720
Summarization AG News 2678
Miscellaneous TREC 2682
CoLA 2684
WIC 2684
Total 52002
Table 3: Number of entries sampled for each dataset from the FLAN collection

For the training dataset sampled from the FLAN collection released under Apache-2.0 license, we select 25 datasets with answer options, which can be classified into 7 categories:

  1. 1.

    Natural Language Inference (NLI): how two sentences are related. The following datasets are used:

    1. (a)

      ANLI Nie et al. (2020)

    2. (b)

      CB Marneffe et al. (2019)

    3. (c)

      MNLI Williams et al. (2018)

    4. (d)

      QNLI Rajpurkar et al. (2018)

    5. (e)

      RTE (Dagan et al., 2006; Bar-Haim et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., 2009)

  2. 2.

    Sentiment Analysis: whether the input text has positive or negative sentiment. The following datasets are used:

    1. (a)

      IMDB Maas et al. (2011)

    2. (b)

      Sent140 Go et al. (2009)

    3. (c)

      SST2 Socher et al. (2013)

    4. (d)

      Yelp Zhang et al. (2015)

  3. 3.

    Paraphrase Detection: whether two sentences are semantically equivalent. The following datasets are used:

    1. (a)

      MRPC Dolan and Brockett (2005)

    2. (b)

      QQP Wang et al. (2018)

    3. (c)

      Paws Wiki Zhang et al. (2019)

    4. (d)

      STS-B Cer et al. (2017)

  4. 4.

    Reading Comprehension: answer questions based on passages that contain the answers. The following datasets are used:

    1. (a)

      BoolQ Clark et al. (2019)

    2. (b)

      MultiRC Khashabi et al. (2018)

  5. 5.

    Coreference: find expressions that refer to the same entity in the input text. WSC273 dataset is used Levesque et al. (2012).

  6. 6.

    Summarization: produce an abbreviated summary of the input text. For input with answer options, the model is asked to, for instance, choose the broader topic or the best summary among all choices provided. AG news dataaset is used Zhang et al. (2015).

  7. 7.

    Miscellaneous:

    1. (a)

      TREC (Li and Roth, 2002; Hovy et al., 2001): Classify questions into specified categories, such as whether the question is related to human, location, abbreviations, etc.

    2. (b)

      CoLA Warstadt et al. (2019): Linguistic acceptability.

    3. (c)

      WIC Pilehvar and Camacho-Collados (2019): Evaluate intended meaning of a word within a context.

Refer to Table 3 for number of entries filtered and selected out from each dataset following the rules described in section 4.1.

Appendix B Detailed Experiment Results

Lambda λ\text{Lambda }\lambda Perturbation CoLA MNLI MNLI-m MNLI-mm MRPC QNLI QQP RTE SST2 WNLI Average
1 Clean 66.4 ±\pm 6.0 67.7 ±\pm 2.6 67.8 ±\pm 2.6 65.8 ±\pm 1.4 63.6 ±\pm 15.2 62.3 ±\pm 5.5 66.4 ±\pm 12.1 81.7 ±\pm 2.9 90.1 ±\pm 1.7 56.6 ±\pm 3.9 68.8 ±\pm 11.6
DeepWordBug 65.0 ±\pm 3.4 65.2 ±\pm 1.7 64.6 ±\pm 1.8 63.3 ±\pm 2.0 63.3 ±\pm 11.3 54.7 ±\pm 3.5 57.6 ±\pm 11.2 75.3 ±\pm 3.6 90.3 ±\pm 1.8 52.8 ±\pm 4.2 65.2 ±\pm 11.8
TextFooler 58.7 ±\pm 11.0 65.4 ±\pm 1.9 66.2 ±\pm 2.8 64.3 ±\pm 3.7 59.9 ±\pm 10.7 46.8 ±\pm 6.0 55.6 ±\pm 12.0 74.1 ±\pm 4.1 85.0 ±\pm 13.4 54.0 ±\pm 3.1 63.0 ±\pm 13.0
CheckList 61.3 ±\pm 13.0 67.7 ±\pm 1.8 66.8 ±\pm 1.9 64.3 ±\pm 3.2 57.8 ±\pm 17.3 51.3 ±\pm 10.0 61.9 ±\pm 12.6 80.5 ±\pm 2.4 91.1 ±\pm 1.6 57.5 ±\pm 2.8 66.0 ±\pm 14.2
Semantic 68.8 ±\pm 3.6 65.1 ±\pm 1.7 65.4 ±\pm 1.6 64.9 ±\pm 3.2 62.6 ±\pm 15.8 56.5 ±\pm 7.8 65.8 ±\pm 10.3 79.6 ±\pm 2.4 89.9 ±\pm 1.9 56.3 ±\pm 5.2 67.5 ±\pm 11.9
10 Clean 69.6 ±\pm 3.2 65.8 ±\pm 2.1 65.4 ±\pm 2.7 64.6 ±\pm 2.2 71.7 ±\pm 8.1 62.5 ±\pm 5.2 68.7 ±\pm 9.5 81.7 ±\pm 2.9 90.0 ±\pm 2.4 56.8 ±\pm 3.0 69.7 ±\pm 10.4
DeepWordBug 66.3 ±\pm 2.5 64.8 ±\pm 1.8 64.9 ±\pm 1.5 61.3 ±\pm 1.6 70.4 ±\pm 6.6 55.4 ±\pm 4.0 57.4 ±\pm 7.2 76.5 ±\pm 3.6 89.4 ±\pm 2.6 56.8 ±\pm 4.7 66.3 ±\pm 10.7
TextFooler 61.2 ±\pm 9.6 63.5 ±\pm 1.8 64.6 ±\pm 1.6 62.8 ±\pm 3.6 70.2 ±\pm 8.2 48.4 ±\pm 5.1 56.0 ±\pm 11.2 74.4 ±\pm 3.9 84.2 ±\pm 12.9 57.3 ±\pm 1.9 64.3 ±\pm 12.0
CheckList 67.6 ±\pm 8.0 66.1 ±\pm 1.7 66.9 ±\pm 2.2 62.6 ±\pm 2.0 64.8 ±\pm 17.0 53.2 ±\pm 10.4 61.4 ±\pm 11.1 80.3 ±\pm 2.6 90.9 ±\pm 2.2 58.2 ±\pm 2.7 67.2 ±\pm 13.0
Semantic 69.4 ±\pm 1.3 63.7 ±\pm 1.5 64.4 ±\pm 1.3 63.1 ±\pm 2.6 69.7 ±\pm 10.3 57.2 ±\pm 7.1 67.4 ±\pm 8.7 79.5 ±\pm 2.7 89.8 ±\pm 2.4 58.5 ±\pm 4.1 68.3 ±\pm 10.7
100 Clean 69.3 ±\pm 3.2 68.9 ±\pm 1.7 69.1 ±\pm 1.9 66.8 ±\pm 3.1 73.6 ±\pm 3.8 62.3 ±\pm 5.9 70.1 ±\pm 7.8 82.4 ±\pm 1.6 90.6 ±\pm 1.1 62.0 ±\pm 3.2 71.5 ±\pm 9.2
DeepWordBug 66.5 ±\pm 3.8 68.4 ±\pm 1.8 68.7 ±\pm 1.6 65.5 ±\pm 2.9 73.5 ±\pm 2.7 55.2 ±\pm 4.3 61.9 ±\pm 8.4 77.3 ±\pm 3.6 91.1 ±\pm 2.1 57.5 ±\pm 2.5 68.6 ±\pm 10.6
TextFooler 62.1 ±\pm 6.6 66.8 ±\pm 2.9 67.5 ±\pm 2.3 66.0 ±\pm 1.5 72.1 ±\pm 4.9 48.5 ±\pm 7.4 60.3 ±\pm 9.6 73.7 ±\pm 4.5 85.8 ±\pm 10.9 56.3 ±\pm 2.8 65.9 ±\pm 11.5
CheckList 68.9 ±\pm 5.4 69.2 ±\pm 3.0 69.4 ±\pm 2.8 66.3 ±\pm 3.7 64.9 ±\pm 12.8 53.8 ±\pm 10.0 66.1 ±\pm 8.8 80.6 ±\pm 3.1 91.6 ±\pm 0.7 57.0 ±\pm 2.4 68.8 ±\pm 12.1
Semantic 68.7 ±\pm 2.1 66.9 ±\pm 1.7 67.0 ±\pm 2.5 64.0 ±\pm 2.4 72.3 ±\pm 6.8 55.0 ±\pm 9.6 70.7 ±\pm 6.7 79.8 ±\pm 3.5 91.1 ±\pm 0.7 59.2 ±\pm 4.7 69.5 ±\pm 10.9
1000 Clean 70.4 ±\pm 3.9 68.8 ±\pm 2.7 68.0 ±\pm 2.2 67.6 ±\pm 3.5 70.6 ±\pm 3.5 61.9 ±\pm 6.0 70.1 ±\pm 6.0 82.3 ±\pm 1.5 91.4 ±\pm 0.7 59.9 ±\pm 2.5 71.1 ±\pm 9.5
DeepWordBug 66.9 ±\pm 3.0 68.2 ±\pm 2.0 67.5 ±\pm 1.3 66.6 ±\pm 4.0 72.4 ±\pm 2.5 58.7 ±\pm 4.2 64.7 ±\pm 8.0 78.5 ±\pm 3.1 91.1 ±\pm 2.1 58.9 ±\pm 2.6 69.4 ±\pm 9.8
TextFooler 66.5 ±\pm 4.5 67.4 ±\pm 1.7 67.7 ±\pm 3.0 66.1 ±\pm 2.3 71.9 ±\pm 5.4 49.9 ±\pm 7.5 63.9 ±\pm 6.0 75.6 ±\pm 3.5 85.6 ±\pm 11.6 60.1 ±\pm 3.8 67.5 ±\pm 10.5
CheckList 68.4 ±\pm 7.2 67.7 ±\pm 3.5 68.2 ±\pm 2.6 66.3 ±\pm 3.6 63.3 ±\pm 9.6 55.4 ±\pm 9.5 66.8 ±\pm 6.1 79.8 ±\pm 3.5 92.3 ±\pm 0.6 59.6 ±\pm 2.8 68.8 ±\pm 11.4
Semantic 69.7 ±\pm 1.2 66.3 ±\pm 1.8 67.0 ±\pm 0.5 64.3 ±\pm 2.6 72.6 ±\pm 5.8 56.1 ±\pm 10.0 68.5 ±\pm 6.3 78.5 ±\pm 4.5 91.6 ±\pm 0.6 59.2 ±\pm 2.0 69.4 ±\pm 10.6
10000 Clean 69.6 ±\pm 5.5 67.9 ±\pm 2.4 68.6 ±\pm 2.1 67.4 ±\pm 1.7 69.0 ±\pm 8.5 63.9 ±\pm 6.0 72.9 ±\pm 5.9 81.1 ±\pm 2.2 91.3 ±\pm 0.9 56.8 ±\pm 4.7 70.8 ±\pm 10.1
DeepWordBug 66.4 ±\pm 3.7 67.2 ±\pm 2.7 67.4 ±\pm 2.0 66.9 ±\pm 3.5 64.3 ±\pm 8.0 59.8 ±\pm 4.4 65.9 ±\pm 9.0 77.2 ±\pm 2.4 90.7 ±\pm 2.7 58.5 ±\pm 2.7 68.5 ±\pm 10.0
TextFooler 62.9 ±\pm 7.9 66.7 ±\pm 2.7 66.5 ±\pm 2.7 65.6 ±\pm 2.7 68.4 ±\pm 9.4 54.8 ±\pm 7.3 66.8 ±\pm 6.3 76.2 ±\pm 3.6 84.8 ±\pm 11.5 61.0 ±\pm 3.5 67.4 ±\pm 10.1
CheckList 68.9 ±\pm 7.9 67.2 ±\pm 2.9 67.4 ±\pm 2.8 65.4 ±\pm 2.4 61.7 ±\pm 17.6 59.2 ±\pm 9.0 70.5 ±\pm 6.6 79.7 ±\pm 3.1 92.2 ±\pm 0.5 58.7 ±\pm 3.8 69.1 ±\pm 12.1
Semantic 69.5 ±\pm 2.8 65.9 ±\pm 2.1 66.1 ±\pm 2.3 65.5 ±\pm 2.2 67.2 ±\pm 13.4 60.1 ±\pm 7.7 70.7 ±\pm 6.6 77.9 ±\pm 4.6 91.4 ±\pm 0.9 58.0 ±\pm 1.5 69.2 ±\pm 10.7
100000000 Clean 70.4 ±\pm 3.0 66.2 ±\pm 2.1 66.1 ±\pm 1.9 65.7 ±\pm 1.7 55.0 ±\pm 10.3 61.2 ±\pm 7.3 70.9 ±\pm 4.9 83.3 ±\pm 1.1 90.6 ±\pm 1.5 56.6 ±\pm 3.7 68.6 ±\pm 11.5
DeepWordBug 64.4 ±\pm 4.5 63.4 ±\pm 3.0 63.2 ±\pm 2.7 64.1 ±\pm 2.0 46.2 ±\pm 4.3 60.3 ±\pm 5.8 64.6 ±\pm 6.0 80.3 ±\pm 2.4 86.7 ±\pm 6.3 56.3 ±\pm 4.0 65.0 ±\pm 11.6
TextFooler 62.9 ±\pm 8.1 64.3 ±\pm 3.7 62.7 ±\pm 3.6 63.9 ±\pm 3.4 49.1 ±\pm 8.2 54.2 ±\pm 7.6 65.4 ±\pm 2.9 78.5 ±\pm 3.2 81.6 ±\pm 12.7 58.5 ±\pm 2.9 64.1 ±\pm 11.4
CheckList 70.5 ±\pm 3.3 67.1 ±\pm 2.0 66.6 ±\pm 2.6 65.8 ±\pm 2.0 50.4 ±\pm 16.3 57.8 ±\pm 9.4 66.3 ±\pm 5.1 81.8 ±\pm 2.3 90.9 ±\pm 1.2 58.0 ±\pm 4.0 67.5 ±\pm 12.9
Semantic 69.2 ±\pm 3.9 64.3 ±\pm 2.6 64.5 ±\pm 2.5 64.1 ±\pm 2.8 56.4 ±\pm 16.4 57.3 ±\pm 8.1 75.0 ±\pm 5.8 78.8 ±\pm 5.6 91.4 ±\pm 1.4 55.9 ±\pm 2.7 67.7 ±\pm 12.6
Table 4: Average accuracy and standard deviation of Coin trained with different contrastive loss weighting.

For the results of models trained with different contrastive loss weighting, refer to Table 4.

Appendix C GLUE Datasets Category

Task Category Datasets
Sentiment Analysis SST-2
Grammar Correctness CoLA
Paraphrase Identification QQP, MRPC
Natural Language Inference MNLI, QNLI, RTE, WNLI
Table 5: Task categories for GLUE datasets following the categories defined in PromptBench benchmark Schulman et al. (2017).

Following the task category defined in PromptBench benchmark, we split the GLUE datasets into four categories as shown in Table 5.