LabelPrompt: Effective Prompt-based Learning for Relation Classification

Wenjie Zhang [email protected] 0000-0002-3307-6189 Jiangnan University1800 Lihu AvenueWuxiJiangsuChina , Xiaoning Song [email protected] Jiangnan University1800 Lihu AvenueWuXiJiangSuChina , Zhenhua Feng [email protected] University of SurreyGuildford GU2 7XHSurreyUnited Kingdom , Tianyang Xu [email protected] Jiangnan University1800 Lihu AvenueWuXiJiangSuChina and Xiaojun Wu wu˙[email protected] Jiangnan University1800 Lihu AvenueWuXiJiangSuChina

Abstract.

Recently, prompt-based learning has gained popularity across many natural language processing (NLP) tasks by reformulating them into a cloze-style format to better align pre-trained language models (PLMs) with downstream tasks. However, applying this approach to relation classification poses unique challenges. Specifically, associating natural language words that fill the masked token with semantic relation labels (e.g. “org:founded_by”) is difficult. To address this challenge, this paper presents a novel prompt-based learning method, namely LabelPrompt, for the relation classification task. Motivated by the intuition to “GIVE MODEL CHOICES!”, we first define additional tokens to represent relation labels, which regard these tokens as the verbaliser with semantic initialisation and explicitly construct them with a prompt template method. Then, to mitigate inconsistency between predicted relations and given entities, we implement an entity-aware module with contrastive learning. Last, we conduct an attention query strategy within the self-attention layer to differentiates prompt tokens and sequence tokens. Together, these strategies enhance the adaptability of prompt-based learning, especially when only small labelled datasets is available. Comprehensive experiments on benchmark datasets demonstrate the superiority of our method, particularly in the few-shot scenario.

relation classification, prompt learning, few-shot

^†^†ccs: Computing methodologies Information extraction

1. Introduction

Pretrained Language Models (PLMs) have achieved great success in many Natural Language Processing (NLP) tasks (Jiang et al., 2020a; Xu et al., 2021; Guo et al., 2022; Zhong and Chen, 2021), such as Relation Classification (RC), Named Entity Recognition (NER), etc. The strong performance of PLMs stems from their ability to learn rich knowledge representations from large unlabelled corpora through self-supervised pretraining objectives such as Masked Language Modelling (MLM). A common paradigm is to then fine-tune the pretrained parameters on downstream task datasets, thereby transferring the embedded knowledge in the PLM to improve performance on the target task (Peters et al., 2018; Lewis et al., 2020). However, a fundamental challenge with this paradigm is the inherent mismatch between the pretraining objective of the PLM and the actual downstream task. This discrepancy in objectives can result in a sub-optimal transfer of knowledge from pretraining to fine-tuning (Liu et al., 2021a).

To bridge the mismatch between pretraining and downstream tasks, prompt-based learning has emerged as an effective approach for utilising PLMs (Lester et al., 2021; Brown et al., 2020; Wallace et al., 2019). The key idea is to reformulate a downstream task into the masked language modelling format used during pretraining. This is achieved through two main components: 1) a prompt template that prepends instructions and context to the input, and 2) an answer mapping method (known as a verbaliser) that converts the model’s output into target labels By transforming tasks into the native pretraining format, prompt-based learning reduces the discrepancy in objectives between pretraining and fine-tuning (Tan et al., 2022; Ma et al., 2022). Despite promising results, manually engineering prompt templates and verbalisers can be expensive and time consuming, limiting wider adoption (Liu et al., 2021a). Recent work has focused on automatically generating prompts (Lester et al., 2021; Liu et al., 2022) or learning continuous prompt representations (Shin et al., 2020a; Li and Liang, 2021) to reduce the cost of prompt engineering. Overall, prompt-based learning offers an appealing paradigm for effectively utilising PLMs, warranting further research to address current limitations.

Meanwhile, inspired by the great success of prompt-based learning, recent studies have explored applications to text classification tasks (Zhang et al., 2022). In this paper, we specifically investigate prompts for relation classification, a prevalent text classification problem. Given a pair of entities mentioned in a text, the goal of relation classification is to predict the semantic relation between them, if any (e.g., “per:employee_of”, “org:founded” and “no_relation”) (Hendrickx et al., 2010). While prompt-based learning provides an appealing approach to use pretrained language models for this task, applying prompts to relation classification also poses unique challenges.

First, relation classification presents a unique challenge compared to many NLP tasks in that the target labels contain rich semantic information, as opposed to being single words representable within the pretrained model’s vocabulary. Consequently, existing answer mapping techniques struggle to effectively project the model’s masked outputs to the complex, multi-word relation labels when reformulating relation classification as a masked language modelling task.

To improve mapping performance, some studies suggest searching external data for label-related words to design a verbaliser (Hu et al., 2022; Gu et al., 2022; Zhu et al., 2022). However, these methods have yet to achieve optimal results. Alternatively, other approaches define additional learnable tokens external to the pretrained vocabulary, and map model outputs to these new tokens for classification (Chen et al., 2022). Although these methods can mitigate token mapping issues, tuning the new tokens requires more labelled training data due to the significant gap from the pretrained vocabulary. So the first challenge in relation extraction is developing effective training methodologies that can enhance model performance under limited data availability.

Second, the masked language modelling (MLM) pre-training objective depends on predictions derived from neighbouring words and global sentence features. However, relation classification requires not only identifying a relation from a sentence, but also determining whether the relation holds between two given entities. Existing approaches predominantly predict relations at the sentence level, without explicitly modelling the correspondence between the entity-relation triplet $(s,r,o)$ .

For instance, consider an abbreviated excerpt from the ReTACRED (Stoica et al., 2021) dataset: “Travis the Chimp, the 14-year-old, 200-pound pet of Sandra Herold, 71.” In this example, “Sandra Herold” is denoted as the subject entity while “14-year-old” is denoted as the object entity. Despite the ground truth relation existing between “Sandra Herold” and “71”, the model incorrectly predicts the relation as “per:age” between “Sandra Herold” and “14-year-old”. Evidently, the model fails to accurately capture the correspondence between the given entities and their relation. Subsequently, an additional challenge arising in relation classification is augmenting the model’s capacity to discern given entities when inferring relations.

To mitigate the above limitations of prompt-tuning for relation classification, we develop a novel LabelPrompt approach with an entity-aware module. LabelPrompt is a simple yet effective method that significantly reduces the gap between pre-training objectives and the relation classification task. Unlike previous methods, LabelPrompt enables the model to capture relation labels more intuitively by augmenting the input sequence.

In terms of the studies in prompt templates, many researchers have focused on constructing a prompt template that meet the requirements of the task formulation. At the same time, some studies have shown that the effectiveness of the prompt template has a significant impact on task performance. Thus, we believe it is essential to provide the pre-trained model with supplementary task-relevant information via the input. Building on this idea, we suggest an intuitive technique for providing the model with selectable options. In particular, we develop a prompt template approach that embeds relation class labels into the input example before the model is ingested.

Specifically, to enhance the ability of the masked language modelling (MLM) for relation label prediction, we propose extending the vocabulary space with custom label tokens that are not present in natural language (e.g., “[C1]”). The embeddings of these tokens are initiated with semantic information derived from relation label texts. However, this introduces a gap between pre-training and our task, as the additional label tokens are also not present in the pre-training vocabulary space. To mitigate this limitation and improve prompt-based method performance in few-shot scenarios, we expand the vocabulary with additional label prompt tokens and use these label tokens to construct prompt templates on the input side. Furthermore, to prevent additional prompt tokens from affecting sentence semantics, we redesign the attention query strategy that use distinct query projections for prompt and sentence token pairs. This controls prompt-sentence interactions, and allows label prompt tokens to provide task-specific information without compromising sentence semantics encoded in the pre-trained language model. This promotes independence between prompts and sentences to improve the efficacy of relation classification.

Second, to enhance entity perception during model training, we implement an entity-aware module to assess the correlation between entities and their relations. As shown in Fig. 1, we adopt a contrastive learning-inspired approach to construct positive and negative samples, where the former consist of given entities and their predicted relation, while the latter contains randomly selected tokens and ground-truth relations within the sentence. This effectively constrains the relation and entities with semantic information.

Finally, extensive evaluations on popular benchmark datasets demonstrate that our proposed approach yields significant performance improvements compared to fine-tuning and other prompt-based methods. We attribute this enhancement to our method’s ability to effectively constrict the model’s search space and guide optimisation towards the correct direction.

To summarise, the main contributions of the proposed LabelPrompt method include:

•

We introduce a prompt-based method that explicitly exploits relation features and significantly improves the performance of prompt-based learning in few-shot scenarios.
•

We implement an entity-aware module that employs contrastive learning to enhance entity awareness during the inference stage.
•

We design an attention query strategy within self-attention layers to differentiate between prompt and sentence tokens and improve the performance of prompt-based learning.

The rest of this paper is organised as follows. We first introduce the related work in Section 2. Then we present the proposed LabelPrompt method in Section 3. The experimental results and analysis of the proposed method are reported in Section 4 and Section 5, respectively. Last, we draw the conclusion of the proposed method in Section 6.

2. Related Work

2.1. Pre-training and Fine-tuning

Pre-trained language models are trained in large-scale unlabelled text data to learn robust general-purpose features, such as lexical, syntactic, semantic, and word representation (Liu et al., 2022). Many big models, such as GPT (Liu et al., 2021b), BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), T5 (Raffel et al., 2020), have been proposed, greatly promoting the development of the research in natural language processing. Most NLP tasks benefit from the knowledge contained in the pre-training language model, and the increase in the size of a pre-training model as well as data facilitates the task. By fine-tuning the parameters of PLMs with additional specific modules and task-specific objective functions, the models can be adapted to most downstream tasks. Generally, the pre-training and fine-tuning paradigm become the foundation for many NLP tasks, such as named entity recognition (Wu et al., 2021), relation classification (Lyu and Chen, 2021), and question answering (Zhong et al., 2022), etc.

2.2. Prompt-based Learning

Although fine-tuning has achieved significant success in many fields, there remains a large gap between the objectives of pre-training and fine-tuning tasks (Liu et al., 2021a). Typically, during pre-training, a PLM is trained using self-supervised corpus texts with Masked Language Modelling (MLM) and Next Sentence Prediction (NSP) tasks (Devlin et al., 2019). This differs from other NLP tasks, such as classification and sequence tagging tasks. As a result, recent studies have sought to explore how to effectively use pre-trained language models and address this gap (Wang et al., 2021; Moiseev et al., 2022).

Prompt-based learning is a different approach that aims to reduce the gap between pre-training and fine-tuning. Rather than fine-tuning, prompt-based learning reformulates a downstream task to the original PLM training task. For example, a prompt template might transfer a machine translation instance as “English: I love this film. French: __”. A large PLM could then fill in the blank with a French translation of the English text. Essentially, prompt-based learning enables PLMs to perform downstream tasks by providing additional information in the form of a prompt (Lester et al., 2021; Gao et al., 2021).

Many studies have shown that prompt templates containing semantics can effectively activate knowledge in the model. Building on this idea, Petroni et al. (2019) considered language models as knowledge bases and proposed the LAMA (LAnguage Model Analysis) probe dataset. The LAMA probe provides manually created completion templates for exploring the knowledge in PLMs. By using hand-crafted query sentences, it extracts the knowledge base from PLMs and performs comparably to using PLMs directly. However, the manual design of templates as discrete prompts is time-consuming, expensive (Shin et al., 2021), and often requires a wealth of expert knowledge.

To extend prompt-based learning to a wider range of tasks and reduce the overhead of prompt text construction, Shin et al. (2020b) automatically generated prompts for different task sets based on a gradient search approach. Some researchers have also added a small set of task-specific prefixes to replace manual templates as continuous prompts, which are learnable by backpropagation (Li and Liang, 2021; Lester et al., 2021). Additionally, Zhong et al. (2021) suggests initialising continuous prompts based on the discrete prompts already explored by prompt search methods.

Another important component of prompt-based learning is the answer-mapping approach. Since the masked output does not always match the label, the aim of answer mapping methods is to map the model output to the answer space when the answers cannot be considered as task outputs. For example, sentiment analysis tasks require mapping the output of the model template “{ good, fantastic, unhappy, average }” to “{ positive, negative, neutral }”. The answer-mapping method is also known as a verbaliser. A verbaliser is a projection method that constructs a project between a natural word space and a task label space (Hu et al., 2022). Just like the prompt template, a verbaliser also has “manual design” (Yin et al., 2019; Cui et al., 2021), “discrete answer search” (Jiang et al., 2020b; Xia et al., 2022) and “continuous answer search” strategies (Hambardzumyan et al., 2021). Equally, the quality of a verbaliser can have an impressive impact on model performance.

In conclusion, prompt-based learning is widely considered as a highly effective method for utilising pre-trained language models, and it is a subject worthy of further study.

Table 1. Two examples of relation classification. Relation classification aims to predict the relation between two entities mentioned in a given sentence.

Sentence	Relation
Mark Fisher [subject] writes for the Dayton Daily News [object].	per:employee_of
He has a sense of humor [subject] about his reaction [object] to that day.	no_relation

2.3. Relation Classification

Relation Classification (RC) is a sub-task of Information Extraction (IE). The goal of RC is to identify the semantic relation between two entities in a given text. As shown in Table 1, given the sentence “Mark Fisher writes for the Dayton Daily News.”, the relation between the subject “Mark Fisher” and the object “Dayton Daily News” is “per:employee_of”. RC is a crucial task for many natural language understanding applications, including question answering, knowledge base construction, and text summarization (Hendrickx et al., 2010).

Early approaches for RC relied heavily on prior knowledge to extract additional features from the text (Califf and Mooney, 1997; Zeng et al., 2015). The emergence of large-scale PLMs, such as BERT (Devlin et al., 2019), has provided universal language representations that are useful for RC tasks (Vaswani et al., 2017; Cohen et al., 2020; Yamada et al., 2020). However, these methods often require significant amounts of labelled data and complex, task-specific neural modules, and tend to underperform in the few-shot scenario.

Recently, there have been studies exploring the use of prompt-based learning to leverage the knowledge contained in PLMs for relation classification. The LAMA dataset is a probe for analysing factual and commonsense knowledge in PLMs, which consists of a set of knowledge sources, each comprising a set of facts. For example, one template could be ”__ is the capital of France.” Hu et al. (2022) proposed a method that uses external knowledge to construct a verbaliser for obtaining more accurate label words for text classification. Han et al. (2021) applied logic rules to construct prompts with several sub-prompts, effectively using prior knowledge from relation classification. Chen et al. (2022) proposed a prompt-based learning approach called KnowPrompt that incorporates abundant semantics and prior knowledge in relation labels into relation classification.

All these methods aimed to effectively use prompt-based learning to narrow the gap between relation classification and pre-training tasks, and demonstrated promising results in both few-shot and full-data scenarios. However, none of these methods addressed the issue of inconsistency between prompt output and relation label texts. Inspired by KnowPrompt, we propose a new prompt-based learning approach, namely LabelPrompt, that explicitly constructs prompt templates using label tokens and adds supervised information corresponding to the labels at the model’s output.

3. The Proposed LabelPrompt Method

Refer to caption — Figure 1. An overview of the proposed method. We input the tokens, including the label tokens, into the RoBERTa model to obtain a global feature representation and improve contextual awareness. The masked token is then used to predict the relation of the sentence. The relation-align module aligns the encoded label prompt tokens with their corresponding classes and computes the label loss to preserve the label prompt information. The Entity-Aware Module constructs two samples based on the triplet information and the sentence tokens in the pure prompt template to constrain the correspondence between the given entities. The attention query strategy in the encoding stage facilitates independent lines of prompt tokens and sentence tokens.

Relation Classification (RC) can be denoted as $\mathcal{T}=\{\mathcal{X},\mathcal{Y}\}$ , where $\mathcal{X}$ is the set of instances and $\mathcal{Y}$ is the set of relation labels. For each instance $\mathbf{x}\in\mathcal{X}$ , it contains a sentence $x=\{x_{1},x_{2},\dots x_{n}\}$ , two entities ( $e_{s},e_{o}$ ) mentioned in $x$ , and its label relation $y_{x}\in\mathcal{Y}$ . The goal of relation classification is to predict the relation $y\in\mathcal{Y}$ between given subject entity $e_{s}$ and object entity $e_{o}$ in sentence $x$ . If there is no relation between the entities, it will be considered as a special type of relation, i.e. “no_relation”, which exists in the set of relations.

We will elaborate on the major components and training schemes of our approach in the following parts. As illustrated in Fig. 1, the input to the model is composed of two primary components, the prompt tokens and the sentence tokens. We can further separate the prompt tokens into label prompt tokens and typical prompt tokens.

3.1. Label Prompt

3.1.1. Label Prompt Tokens

For each relation, $y\in\mathcal{Y}$ , its label text is not an individual word in natural language, but a composite text, e.g. “org:top_members/employees”, “per:country_of_death” and etc. We define a label space by creating $m$ label tokens $\mathcal{C}=\{c_{1},c_{2},\dots,c_{m}\}$ and extending it to the vocabulary of PLM, where $m$ represents the size of the relation set $\mathcal{Y}$ . For better training setup, these label tokens will be initialised by their knowledge-rich semantic label texts. The initialisation process is defined as follows:

(1)

C_{i}=\frac{1}{N_{i}}\sum_{j=1}^{N_{i}}\textrm{Embedding}_{\mathcal{M}}(t_{j}),

where $C_{i}$ represents the word embedding of label $c_{i}$ , $\textrm{Embedding}_{\mathcal{M}}$ stands for the word embedding layer of $\mathcal{M}$ , and $t_{j}$ is denoted as the $j$ -th sub-text of the decomposed relation label text $T_{i}=\{t_{1},t_{2},\dots,t_{N_{i}}\}$ .

3.1.2. Label Prompt Template

To provide context for our work, we first introduce the concepts of prompt templates and the prompt template method. In this paper, a prompt template refers to additional text consisting of prompt tokens, while the template method refers to the contract function used to construct the input instance $\mathbf{x}_{prompt}$ with a specific token [MASK].

The objective of our work is to address the challenge of the prompt-based method to fit correct label tokens in the few-shot scenario where data is limited. To achieve this, we propose a solution that involves converting the relation classification task into a masked language modelling one. However, previous studies have shown that this approach may not be effective due to the large size of the vocabulary and the newly added label tokens that were not present during the model’s pre-training stage.

To overcome this limitation, we introduce an intuitive approach of directly adding label tokens to the input sequences. Our proposed prompt template, namely Label Prompt Template, consists of both label prompt tokens and typical prompt tokens. By incorporating the label tokens directly into the input sequences, we can improve the performance of the prompt-based model in the few-shot scenario. We represent the prompt template as $T(\mathbf{x})$ :

(2)

T(\mathbf{x})=\{c_{1},c_{2},\dots,c_{m},[\textrm{SEP}],e_{s},[\textrm{MASK}],e_{o}\},

where $c_{i}\in\mathcal{C}$ is the label token, $e_{s}$ and $e_{o}$ are the subject and object entities. $[\texttt{CLS}]$ , $[\texttt{SEP}]$ and $[\texttt{MASK}]$ are predefined in $\mathcal{M}$ . The input $x$ will be converted to the input sequence as follows,

(3)

x_{prompt}=\left\{[\textrm{CLS}],T(\mathbf{x}),[\textrm{SEP}],x,[\textrm{SEP}]\right\}.

3.1.3. Label Prompt Learning

In this section, we introduce our key training objective. First, we define $\mathbf{h}$ as the output of the encoder of the model $\mathcal{M}$ . An input sequence $x_{prompt}$ is encoded by the model to obtain $\mathbf{h}\in\mathbb{R}^{\ell\times d}$ ,

(4)

\mathbf{h}=\textrm{Encoder}_{\mathcal{M}}(x_{prompt}),

where $\ell$ represents the length of $x_{prompt}$ , and $d$ is the size of the hidden vectors.

We also define $h_{mask}$ as the hidden vector of the $[\texttt{MASK}]$ , $h_{c_{i}}$ as the hidden vector of label prompt token $c_{i}$ , $h_{sub}$ and $h_{obj}$ represent the hidden vectors of subject and object entities.

Next, we define a verbaliser $V_{\phi}$ as an answer-mapping method that maps a hidden vector $h$ to a relation label,

(5)

V_{\phi}(h)=\mathbf{E}_{label}\cdot(W_{v}\cdot h+b),

where $\mathbf{E}_{label}$ represents the labels’ embedding, $W_{v}$ and $b$ are learnable parameters. We use $C_{i}\in\mathbf{E}_{label}$ to denote the embedding of label token $c_{i}$ . Then we calculate the probability distribution of the mapping result of hidden vector $h$ . The probability that the label of the masked token is $c_{i}$ is calculated using the softmax function,

(6)

p\left(V_{\phi}\left(h\right)=c_{i}|x_{prompt}\right)=\frac{\exp(C_{i}\cdot(W_{v}\cdot h+b))}{\sum_{j=1}^{m}\exp(C_{j}\cdot(W_{v}\cdot h+b))},

where $m$ is the size of the relation set $\mathcal{Y}$ .

Last, our model optimisation aims to assign a suitable relation label to the masked token. We define the loss function,

(7)

\mathcal{L}_{mask}=\frac{1}{m}\sum_{i=1}^{m}y_{x}\log p(y_{x}|\mathbf{x}),

as the cross-entropy loss between the predicted probability $p(y_{x}|\mathbf{x})=p(V_{\phi}(h_{mask})=\mathcal{C}|x_{prompt})$ and the ground truth $y$ . The label token with the highest probability will be selected as the predicted relation class between the entities in instance x.

3.2. Relation-Align Module

The objective of the relation-align module is to align the label tokens with the deep decision-making layer of the model. In the context of relation extraction models, the label token on the input side may lose its inherent features in the subsequent encoder layers, which reduces its effectiveness in aiding the prediction of relations in the deep decision-making layer.

To tackle this problem, we propose the relation-align module, which is integrated at the end of our model. Furthermore, the verbaliser $V_{\phi}$ is incorporated behind the hidden vector $h_{c_{i}}$ of each label token $c_{i}$ in $x_{prompt}$ to reinforce its semantic meaning, just like the mask loss. To accomplish this, the probability score $p(c_{i}|\mathbf{x},c_{i})=p\left(\tilde{c_{i}}=\mathcal{C}|x_{prompt}\right)$ is formulated for each label token, which can be classified into itself by the verbaliser, where $\tilde{c_{i}}$ denotes $V_{\phi}(h_{c_{i}})$ .

The cross-entropy loss for each label token is computed based on the probability distribution, and then the losses are averaged to obtain the label loss $\mathcal{L}_{label}$ :

(8)		$\displaystyle\mathcal{L}_{c_{i}}=\frac{1}{m}\sum_{i=1}^{m}$	$\displaystyle y_{x}\log p(c_{i}\|\mathbf{x},c_{i}),$
(8)		$\displaystyle\mathcal{L}_{label}=$	$\displaystyle\frac{1}{m}\sum_{i=1}^{m}\mathcal{L}_{c_{i}}.$

3.3. Entity-Aware Module

In this section, we introduce an entity-aware module aimed at addressing the issue of a PLM failing to consider given entities when predicting the relation between them.

Through analysis of numerous error cases, we observed that a PLM can detect a relation in a sentence, but it may not exist between the two given entities. In other words, the model does not take the given entities into account when predicting the relationship.

Drawing from TransE (Bordes et al., 2013), which suggests a correlation, $s+r=o$ , between the features of entities ( $s,o$ ) and the relation $r$ , we propose the entity-aware module. The module employs the distance metric $d(s,r,o)=\lVert s+r-o\rVert_{2}$ to measure the distance between entities and relations. Furthermore, the dimensions of the encoder outputs for entities and relations are reduced to eliminate redundant features:

(9)	$\displaystyle s$	$\displaystyle=\phi_{sub}\cdot h_{sub},$
	$\displaystyle o$	$\displaystyle=\phi_{obj}\cdot h_{obj},$
	$\displaystyle r$	$\displaystyle=\phi_{rel}\cdot h_{mask},$

where $\phi_{sub}$ , $\phi_{obj}$ and $\phi_{rel}\in\mathbb{R}^{p\times d}$ are trainable parameters, $p$ is the length of the hidden vector after dimension reduction.

In Equ. (9), the tuple ( $s,r,o$ ) is considered a positive example while negative examples ( $s^{\prime},r,o^{\prime}$ ) are randomly sampled for the entity-aware loss, where $s^{\prime}$ and $o^{\prime}$ denote two arbitrary spans from the sentence. The entity-aware loss, denoted by $\mathcal{L}_{entity}$ , is formulated as follows:

(10)		$\displaystyle\mathcal{L}_{entity}=$	$\displaystyle-\log\sigma\left(\gamma-d\left(s,r,o\right)\right)$
(10)			$\displaystyle-\log\sigma\left(d\left(s^{\prime},r,o^{\prime}\right)-\gamma\right),$

where $\sigma$ is the sigmoid function and $\gamma$ is the margin that is set to 0.3.

Different from previous approaches that use simple negative samples, we use more challenging negative samples and concentrate on the contrastive loss between current sentences during the training phase. Our experiments reveal that other samples in the same batch can hinder the convergence of this portion of the loss, thus we disregard them.

3.4. Attention Query Strategy

As shown in Equ. (3), the input sequence comprises two distinct categories of tokens, namely prompt tokens and sentence tokens. However, the inclusion of additional unknown tokens can significantly impact the semantic meaning of the sentence and subsequently, the efficacy of relation extraction.

To mitigate the adverse impact of prompt tokens on sentence semantics, we employ an attention query strategy during the PLM encoding process. This approach utilises distinct query matrices for different token pairs in the self-attention layer to minimise the influence of prompt tokens on sentence semantics and enhance the quality of relation extraction outcomes. Further details are provided in Fig. 1.

The self-attention layer in the transformer encoder can be formulated as follow:

(11)

\operatorname{SelfAttn}(E)=\operatorname{softmax}(\frac{QE\cdot(KE)^{T}}{\sqrt{d}})\cdot VE,

where $E$ represents the input and $Q\in\mathbb{R}^{d\times d}$ , $K\in\mathbb{R}^{d\times d}$ , $V\in\mathbb{R}^{d\times d}$ denote the mapping matrices. In our method, we select different query mapping matrices depending on which pair of tokens are interacting: $Q_{pp}$ , $Q_{ps}$ , $Q_{sp}$ , and $Q_{ss}$ . The effective initialisation of pre-training parameters enables the method to maintain strong few-shot performance even when additional query matrices is incorporated.

3.5. Objective Function

We propose three loss functions to optimise the model parameters: mask loss for relation prediction, label loss for semantic consistency of label prompt tokens, and entity loss for improving entity awareness.

Our objective is to minimise the final loss $\mathcal{L}$ which is the weighted sum of those losses:

(12)

\mathcal{L}=\mathcal{L}_{pred}+\alpha_{1}\mathcal{L}_{label}+\alpha_{2}\mathcal{L}_{entity},

where $\alpha_{1},\alpha_{2}$ are the weights for the label loss and entity loss, respectively. In this paper, we set them to 1 and 0.04 based on our empirical experimental results.

Table 2. Statistics of different datasets for relation classification.

Dataset	#Training	#Validation	#Test	#Relation
SemEval	6,507	1,493	2,717	19
TACRED	68,124	22,631	15,509	42
TACREV	68,124	22,631	15,509	42
ReTACRED	58,465	19,584	13,418	40

4. Experiment

In this section, we evaluate the proposed method on four benchmark datasets with the widely used micro $F_{1}$ metric. When it comes to multi-class classification tasks, there are two commonly used averaging methods, namely macro and micro scores. The macro $F_{1}$ score calculates the global score by averaging over all the classes, while the micro-averaging method computes the global score by taking into account the True Positives (TP), False Negatives (FN), and False Positives (FP). In this paper, we use the micro $F_{1}$ score as the primary metric to evaluate these methods for all the experiments.

Table 3.

F_{1}

scores (%) on the TACRED, TACREV, ReTACRED and SemEval in the few-shot scenario. The best results are marked bold.

$k$ -shot	Methods	TACRED	TACREV	ReTACRED	SemEval
8-shot	fine-tuning	12.2	13.5	28.5	41.3
	GDPNet (Xue et al., 2021)	11.8	12.3	29.0	42.0
	PTR (Han et al., 2021)	28.1	28.7	51.5	70.5
	KnowPrompt (Chen et al., 2022)	32.0	32.1	55.3	74.3
	LabelPrompt(ours)	33.9	34.8	61.5	77.0
16-shot	fine-tuning	21.5	22.3	49.5	65.2
	GDPNet	22.5	23.8	50.0	67.5
	PTR	30.7	31.4	56.2	81.3
	KnowPrompt	35.4	33.1	63.3	82.9
	LabelPrompt(ours)	34.4	35.4	64.5	81.7
32-shot	fine-tuning	28.0	28.2	56.0	80.1
	GDPNet	28.8	29.1	56.5	81.2
	PTR	32.1	32.4	62.1	84.2
	KnowPrompt	36.5	34.7	65.0	84.8
	LabelPrompt(ours)	35.4	36.8	66.7	84.6

Table 4.

F_{1}

scores (%) on the TACRED, TACREV, ReTACRED and SemEval in the full-data scenario. In the “Extra Data” column, “w/o” means that no additional data is used for pre-training and fine-tuning, yet “w/” means that extra data or knowledge bases are used for data augmentation. The best results are marked bold.

Fine-tuning pre-trained models
Model	Extra Data	TACRED	TACREV	ReTACRED	SemEval
RoBERTa_large (Liu et al., 2019)	w/o	68.7	76.0	84.9	87.6
KnowBERT (Chen et al., 2022)	w/	71.5	79.3	-	89.1
MTB (Baldini Soares et al., 2019)	w/	70.1	-	-	89.5
SpanBERT (Joshi et al., 2020)	w/	70.8	78.0	85.3	-
LUKE (Yamada et al., 2020)	w/	72.7	80.6	90.3	-
Prompt tuning pre-trained models
PTR	w/o	72.4	81.4	90.9	89.9
KnowPrompt	w/o	72.4	82.4	91.3	90.2
LabelPrompt(ours)	w/o	73.1	82.5	91.6	91.3

4.1. Datasets

To better verify the effectiveness of our idea, we use the most popular relation classification datasets, including TACRED (Zhang et al., 2017), TACREV (Alt et al., 2020), ReTACRED (Stoica et al., 2021), and SemEval (Hendrickx et al., 2010).

TACRED is one of the most enormous and widely used crowd-sourced datasets in relation extraction. It is created by combining available human annotations from the TAC KBP challenges and crowd-sourcing, and it contains 42 relation types (including “no_relation”).

TACREV is a modified version of TACRED. It corrects the label errors in the validation and test sets while leaving the training set unchanged.

ReTACRED is a new completely re-annotated version of the TACRED dataset. The dataset fixes many labelling errors in the TACRED dataset and refactors the training, validation, and test sets.

SemEval is a traditional dataset in relation classification. It contains 9 symmetric relations and one special relation “Other”.

Annotation quality of the datasets. In the field of natural language processing, an essential component of dataset preparation is the quality of the annotation. A comparative analysis of the annotation quality of four datasets. The results indicate that the ReTACRED dataset has the highest annotation quality compared to the other three datasets. Although the TACREV and TACRED datasets share the same training data, Alt et al. (2020) improved the quality of annotations in the TACREV test set. The statistical details of each dataset have been tabulated in Table 2.

4.2. Baselines

In this paper, we compare the proposed LabelPrompt method with several representative approaches for relation classification. LabelPrompt is a novel method that leverages label prompt tokens to enhance pre-trained language models for this task. We select baselines that cover different aspects of relation classification, such as fine-tuning, prompt-based learning, and knowledge integration.

For fine-tuning pre-trained models, RoBERTa is a very popular language model that improved BERT’s training strategy to overcome its under-training problem, and it also served as the text encoder of this paper. We also revise some other pre-trained models that incorporated different types of knowledge into their representations. For example, KnowBERT embeds multiple Knowledge Bases (KBs) into PLM to leverage structured, human-curated knowledge. MTB builds task-agnostic relation representations from entity-linked text and fine-tuned them on supervised relation extraction datasets. SpanBERT focuses on improving span representations by defining a masked span prediction task for representing and predicting spans of text with varying lengths. LUKE predicts randomly masked words and entities in a large entity-annotated corpus retrieved from Wikipedia.

For prompt-based learning models, we select two representative works of prompt-based learning models. PTR (Han et al., 2021) uses logic rules to create prompts with sub-prompts that encode the prior knowledge of each class. KnowPrompt (Chen et al., 2022) injects latent knowledge in relation labels into the prompts, thus injecting the knowledge into relation labels.

4.3. Experiment Settings

We use $\textrm{RoBERTa}_{large}$ (Liu et al., 2019) as the pre-trained language model for all our experiments and follow the same setting as the previous methods. We compare the performance of our proposed method with other methods in two scenarios: full-data and few-shot. In the full-data scenario, we use all the training samples in the dataset for network tuning. In the few-shot scenario, we use the $k$ -shot approach, where $k\in[8,16,32]$ , to evaluate our model’s performance. The $k$ -shot approach means that we sample $k$ instances from each class in the training/validating set as the training/validating data while using all the test instances to evaluate the model’s performance.

We conducted our experiments on NVIDIA GeForce RTX 2080 Ti. We set the batch size to 16 and the learning rate to 4e-5 and 4e-6 for the few-shot and full-data scenarios, respectively.

4.4. Results

We first evaluate and analyse the proposed LabelPrompt method by comparing it with both fine-tuning and prompt-based methods. The experimental results are reported in Table 3 and Table 4.

Few-Shot. In Table 3, we compare our approach with the traditional fine-tuning and prompt-based methods in the few-shot scenario. Our approach outperforms the other methods in most cases. Furthermore, on the TACREV and ReTACRED datasets, which have more accurate and consistent annotations, our method outperforms all the other methods in all the few-shot settings.

As compared with the fine-tuning methods, our results show that prompt-based methods perform better in the few-shot scenarios for the NLU tasks with low resources. We attribute this to the fact that prompt-based learning effectively uses the knowledge stored in PLMs to fit the task. In contrast, fine-tuning methods rely on complex decoding modules and training strategies based on PLMs, which require sufficient data.

As compared with the previous best prompt-based methods, LabelPrompt achieves 1.5%-6.2% performance gains on both the ReTACRED and TACREV datasets, which have high labelling quality. It also achieves 1.9%-2.7% improvements on the TACRED and SemEval datasets in the 8-shot setting.

These two aspects demonstrate that the explicit label tokens can aid the pre-trained model in the few-shot scenario. We believe that these additional tokens can reduce the search space for target tokens during the early stage of model training. To further validate our model’s few-shot performance, we design an additional 4-shot setting, as shown in Fig. 2. The experimental results demonstrate that, even with only 4 instances per class, the proposed method achieves the same performance as other methods in the 8-shot setting.

Full-Data. In the full-data scenario, as shown in Table 4, the proposed method also beat all the other fine-tuning and prompt-based approaches.

In comparison with KnowPrompt, LabelPrompt improves the performance by 0.7% in TACRED and by 1.1% in SemEval. This demonstrates that the prompt-learning method can be applied to both few-shot and full-data tasks. In fact, LabelPrompt leverages the knowledge embedded in PLMs to obtain remarkable results in downstream tasks with a simple design.

So far, we’ve discussed the performance of the proposed method, and it shows that LabelPrompt is superior to the other methods in both few-shot and full-data scenarios.

4.5. Ablation Study

In this section, we evaluate the performance of each component of LabelPrompt in both few-shot and full-data settings. The results are presented in Table 5 and Fig. 3.

Table 5. The

F_{1}

scores (%) of ablation experiments in ReTACRED.

Method	8-shot	16-shot	32-shot	full
LabelPrompt	62.1	63.8	66.7	91.6
- w/o label prompt tokens	57.6	61.6	64.7	91.5
- w/o entity-aware module	58.9	61.7	64.7	91.2
- w/o attention query strategy	61.8	62.7	65.4	90.9

4.5.1. Impact of Label Prompt Tokens

The prompt template is one of the most important factors affecting the effectiveness of prompt-based methods. And our prompt template method relies on label prompt tokens, which are very influential. Therefore, we need to demonstrate their impact with more experiments.

First, we ablate the label prompt tokens from the input and remove the label objective module $\mathcal{L}_{label}$ from the overall loss function. During training, by removing the label prompt tokens, the model must search over the full vocabulary space to predict the correct labels without any explicit label prompt. This significantly increases the difficulty of the prediction task, making it challenging for the model to converge efficiently. As shown in Table 5, removing the label prompts leads to a substantial performance degradation, especially under few-shot settings. To verify that the performance gains are due to the semantic guidance from label prompts rather than simply longer prompt lengths, we evaluate two additional ablated configurations: 1) replacing all label tokens in $T(\mathbf{x})$ with $[\texttt{MASK}]$ , and 2) substituting the label tokens with randomly initialised, learnable prompt embeddings.

Fig. 3 shows that both configurations result in a significant performance drop in few-shot settings. This suggests that label tokens are effective due to their high correlation with the labels in the classification task. Moreover, adding more learnable tokens does not help, as the model cannot fit them with limited training data and performs worse in full-data settings.

To visualise the effectiveness of label prompt tokens, we project the hidden vectors of mask token outputs onto a two-dimensional space based on different relations. The ReTACRED test set samples (excluding ”no relation” and ”per:country_of_birth”) are used for visualisation, with a total of 5,648 samples covering 38 different relations.

4.5.2. Impact of The Entity-Aware Module

We propose an Entity-Aware Module to improve the performance of the proposed method for relation classification. This module helps the model to focus on the relation class between the entities in a sentence. As shown in Table 5, removing the Entity-Aware Module decreases the accuracy in all scenarios. The full-data scenario is mostly affected by this removal.

In Table 6, we provide examples of relation extraction by our model with and without the entity-aware module. The entity-aware module improves the accuracy of predicting the relations between entities in a sentence, as shown by the results for No.1254 and No.9035. Without the entity-aware module, the model only considers the relations implied in the sentence, while with the entity-aware module, it takes both entities into account for more accurate relation prediction.

4.5.3. Impact of Attention Query Strategy

Label prompt tokens are not defined during the pre-training phase of PLM. Although these tokens are initialised with the semantics of the label text, they can still affect the semantics of the entire sentence when the model processes the sentence sequence. As presented in Table 5, when the attention query strategy is removed from our model, we observe varying degrees of performance degradation in all scenarios.

Table 6. Case study for the entity-aware module. The underlined text represents the ground-truth relation class.

Sentence	Result
[NO.1254] It has been a little over a year since Travis the Chimp, the 14-year-old [object], 200-pound pet of Sandra Herold [subject], 71, mauled a family friend in Herold’s driveway.	w/o : per:age w/ : no_relation
[NO.9035] Knox [object]’s mother Edda Mellas was shaking with grief as her [subject] stepmother Cassandra Knox moved to comfort her.	w/o : per:children w/ : per:identity

5. Analysis

In this section, we will conduct a detailed analysis of why label prompt tokens are useful for the relation classification task.

Some studies (Geva et al., 2021; Dai et al., 2022) regard the Feed-Forward Network (FFN) within the pre-trained language model as a place to store knowledge. It has been demonstrated that prompt learning is useful for PLMs because prompts can activate specific neurons in the feed-forward network for given inputs. Su et al. (2022) further considered each node in FFN as a neuron and found that the activation layer of the feed-forward network corresponds to a specific behaviour of the model.

As shown in Fig. 5, FFN in a transformer layer can be formulated as:

(13)

\mathrm{FFN(\mathbf{x})}=\mathrm{GELU}(\mathrm{\mathbf{x}}\mathbf{W}_{1}^{\top}+\mathrm{b}_{1})\mathbf{W}_{2}+\mathrm{b}_{2},

where $\mathrm{\mathbf{x}}\in\mathbb{R}^{d}$ denotes the hidden state of the self-attention output, $\mathrm{\mathbf{W}_{1}}\in\mathbb{R}^{d\times 4d}$ and $\mathrm{\mathbf{W}_{2}}\in\mathbb{R}^{4d\times d}$ represent the two dense layers. In this paper, we consider the output of the activation function after the first dense layer in FFN as the activated neurons sequence, denoted as $\mathbf{S}$ .

For each instance, the activated neuron sequence can be obtained by:

(14)

\mathbf{S}=\{s_{c_{1}},s_{c_{2}},\dots,s_{c_{m}},s_{M}\},

where $s_{M}$ is the activated neuron sequence of mask tokens.

We defined a function $\textbf{ON}(s_{a},s_{b})$ to calculate the Overlapping rate of activated Neurons (ON) of two activated neuron sequences $s_{a}$ and $s_{b}$ :

(15)

\mathrm{\mathbf{ON}}(s_{a},s_{b})=\frac{\sum_{k=1}^{4d}\zeta(s_{a,k}>0\wedge s_{b,k}>0)}{\sum_{k=1}^{4d}\zeta(s_{a,k}>0)+\sum_{k=1}^{4d}\zeta(s_{b,k}>0)},

where $\zeta(c)$ stands for a binary function. The result is 1 if the condition $c$ is true and vice versa.

The ON function between different tokens can be used as the associated probability of the different corresponding behaviours of the model. Therefore, we compare ON between label prompt tokens $\mathcal{C}$ and the mask token [MASK] at the last encoder layer. We then take ON as the task-driven capability of label prompt tokens.

The data used for visualisation and analysis contains 5648 samples and 38 relations (excluding “no relation” and “per:country_of_birth”) from ReTACRED. To obtain Fig. 4, we record the activated values in the last encoder layer for all the test instances. We consider the neurons with positive values as activated neurons. To avoid the distribution error caused by sample sampling, we then average the overlapping rate of all samples of relation $i$ as the $\mathrm{\mathbf{ON}}$ .

As shown in Fig. 4 (a), each point $r(i,j)$ is defined as:

(16)

\displaystyle r(i,j)=\frac{1}{|\mathcal{C}_{i}|}\sum_{u=1}^{|\mathcal{C}_{i}|}\mathrm{\mathbf{ON}}(s_{j}^{(u)},s_{M}^{(u)})\hskip 20.00003pt\mathrm{s.t.}\hskip 10.00002ptgt=c_{i},

which denotes the ON result between the mask token and label token $c_{j}$ when the ground-truth relation is $c_{i}$ . Fig. 4 (b) represents $\mathbf{ON}$ of the relations “org:shareholders” and “per:city_of_birth”.

According to Fig. 4 (a), the colour of the diagonal blocks is darker. The block on the diagonal $r(i,i)$ represents the ON results between the mask token sequence $s_{M}$ and the ground-truth label prompt token sequence $s_{c_{i}}$ . i.e. the mask token always has a higher overlapping rate with the label token corresponding to the ground-truth label. This indicates that the label token can lead the mask token to activate the knowledge of the corresponding relation stored in a PLM.

Furthermore, we choose two representative relations to illustrate the results. Fig. 4 (b) shows the histograms with two rows of data from Fig. 4 (a). The histograms clearly show the difference in the overlapping rate between the mask and label tokens. The histogram in orange is the result of the relation “org:shareholders”, compared to other label tokens, the ON result between mask token and label token 17 is at the peak. Meanwhile, the histogram in blue shows that when the relation is “per:city_of_birth”, the overlapping rate is high not only for token 15, but also for tokens 26, 34, 36, and so on. In fact, all of these tokens also have “city” in their labels, which is consistent with our intuition that these relations have a high overlapping ratio of activated knowledge of their relations.

6. Conclusion

In this paper, we proposed LabelPrompt, a strategy to effectively apply label information for the relation classification task based on prompt learning and prompt engineering. By explicitly adding label tokens to sentences before feeding them into the model, the model effectively adapted downstream tasks with our LabelPrompt. The additional entity-aware module also alleviated the problem of relations in sentences that do not correspond to entities.

The experimental results demonstrated that the label prompt tokens are effective in both the few-shot and full-data scenarios. The method focused more on the additional label tokens and correctly predicted the relation class of the entities in a sentence. In future work, we plan to improve our method’s performance on poorly annotated datasets in low-resource scenarios and extend our work to other text classification tasks.

Acknowledgements.

This work was supported in part by Major Project of the National Social Science Foundation of China (No. 21&ZD166), National Natural Science Foundation of China (61876072) and Natural Science Foundation of Jiangsu Province (No. BK20221535).

References

(1)
Alt et al. (2020) Christoph Alt, Aleksandra Gabryszak, and Leonhard Hennig. 2020. TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 1558–1569. https://doi.org/10.18653/v1/2020.acl-main.142
Baldini Soares et al. (2019) Livio Baldini Soares, Nicholas FitzGerald, Jeffrey Ling, and Tom Kwiatkowski. 2019. Matching the Blanks: Distributional Similarity for Relation Learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 2895–2905. https://doi.org/10.18653/v1/P19-1279
Bordes et al. (2013) Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. 2013. Translating Embeddings for Modeling Multi-relational Data. In Advances in Neural Information Processing Systems, C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger (Eds.), Vol. 26. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2013/file/1cecc7a77928ca8133fa24680a88d2f9-Paper.pdf
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
Califf and Mooney (1997) Mary Elaine Califf and Raymond J. Mooney. 1997. Relational Learning of Pattern-Match Rules for Information Extraction. In CoNLL97: Computational Natural Language Learning. https://aclanthology.org/W97-1002
Chen et al. (2022) Xiang Chen, Ningyu Zhang, Xin Xie, Shumin Deng, Yunzhi Yao, Chuanqi Tan, Fei Huang, Luo Si, and Huajun Chen. 2022. Knowprompt: Knowledge-aware prompt-tuning with synergistic optimization for relation extraction. In Proceedings of the ACM Web Conference 2022. 2778–2788.
Cohen et al. (2020) Amir Cohen, Shachar Rosenman, and Yoav Goldberg. 2020. Relation Extraction as Two-way Span-Prediction. ArXiv abs/2010.04829 (2020).
Cui et al. (2021) Leyang Cui, Yu Wu, Jian Liu, Sen Yang, and Yue Zhang. 2021. Template-Based Named Entity Recognition Using BART. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, Online, 1835–1845. https://doi.org/10.18653/v1/2021.findings-acl.161
Dai et al. (2022) Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. 2022. Knowledge Neurons in Pretrained Transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 8493–8502. https://doi.org/10.18653/v1/2022.acl-long.581
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT. 4171–4186. https://www.aclweb.org/anthology/N19-1423.pdf
Gao et al. (2021) Tianyu Gao, Adam Fisch, and Danqi Chen. 2021. Making Pre-trained Language Models Better Few-shot Learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 3816–3830. https://doi.org/10.18653/v1/2021.acl-long.295
Geva et al. (2021) Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer Feed-Forward Layers Are Key-Value Memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 5484–5495. https://doi.org/10.18653/v1/2021.emnlp-main.446
Gu et al. (2022) Yuxian Gu, Xu Han, Zhiyuan Liu, and Minlie Huang. 2022. PPT: Pre-trained Prompt Tuning for Few-shot Learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 8410–8423. https://doi.org/10.18653/v1/2022.acl-long.576
Guo et al. (2022) Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, and Yinfei Yang. 2022. LongT5: Efficient Text-To-Text Transformer for Long Sequences. In Findings of the Association for Computational Linguistics: NAACL 2022. Association for Computational Linguistics, Seattle, United States, 724–736. https://doi.org/10.18653/v1/2022.findings-naacl.55
Hambardzumyan et al. (2021) Karen Hambardzumyan, Hrant Khachatrian, and Jonathan May. 2021. WARP: Word-level Adversarial ReProgramming. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 4921–4933. https://doi.org/10.18653/v1/2021.acl-long.381
Han et al. (2021) Xu Han, Weilin Zhao, Ning Ding, Zhiyuan Liu, and Maosong Sun. 2021. PTR: Prompt Tuning with Rules for Text Classification. arXiv preprint arXiv:2105.11259 (2021).
Hendrickx et al. (2010) Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, Preslav Nakov, Diarmuid Ó Séaghdha, Sebastian Padó, Marco Pennacchiotti, Lorenza Romano, and Stan Szpakowicz. 2010. SemEval-2010 Task 8: Multi-Way Classification of Semantic Relations between Pairs of Nominals. In Proceedings of the 5th International Workshop on Semantic Evaluation. Association for Computational Linguistics, Uppsala, Sweden, 33–38. https://aclanthology.org/S10-1006
Hu et al. (2022) Shengding Hu, Ning Ding, Huadong Wang, Zhiyuan Liu, Jingang Wang, Juanzi Li, Wei Wu, and Maosong Sun. 2022. Knowledgeable Prompt-tuning: Incorporating Knowledge into Prompt Verbalizer for Text Classification. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 2225–2240. https://doi.org/10.18653/v1/2022.acl-long.158
Jiang et al. (2020a) Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Tuo Zhao. 2020a. SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 2177–2190. https://doi.org/10.18653/v1/2020.acl-main.197
Jiang et al. (2020b) Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham Neubig. 2020b. How Can We Know What Language Models Know? Transactions of the Association for Computational Linguistics 8 (2020), 423–438. https://doi.org/10.1162/tacl_a_00324
Joshi et al. (2020) Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. 2020. SpanBERT: Improving Pre-training by Representing and Predicting Spans. Transactions of the Association for Computational Linguistics 8 (2020), 64–77.
Lester et al. (2021) Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The Power of Scale for Parameter-Efficient Prompt Tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 3045–3059. https://doi.org/10.18653/v1/2021.emnlp-main.243
Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In ACL.
Li and Liang (2021) Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 4582–4597. https://doi.org/10.18653/v1/2021.acl-long.353
Liu et al. (2021a) Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2021a. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586 (2021).
Liu et al. (2022) Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang. 2022. P-Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Dublin, Ireland, 61–68. https://doi.org/10.18653/v1/2022.acl-short.8
Liu et al. (2021b) Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. 2021b. GPT Understands, Too. arXiv preprint arXiv:2103.10385 (2021).
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
Lyu and Chen (2021) Shengfei Lyu and Huanhuan Chen. 2021. Relation Classification with Entity Type Restriction. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, Online, 390–395. https://doi.org/10.18653/v1/2021.findings-acl.34
Ma et al. (2022) Ruotian Ma, Xin Zhou, Tao Gui, Yiding Tan, Linyang Li, Qi Zhang, and Xuanjing Huang. 2022. Template-free Prompt Tuning for Few-shot NER. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Seattle, United States, 5721–5732. https://doi.org/10.18653/v1/2022.naacl-main.420
Moiseev et al. (2022) Fedor Moiseev, Zhe Dong, Enrique Alfonseca, and Martin Jaggi. 2022. SKILL: Structured Knowledge Infusion for Large Language Models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Seattle, United States, 1581–1588. https://doi.org/10.18653/v1/2022.naacl-main.113
Peters et al. (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In NAACL.
Petroni et al. (2019) F. Petroni, T. Rocktschel, P. Lewis, A. Bakhtin, Y. Wu, A. H. Miller, and S. Riedel. 2019. Language Models as Knowledge Bases?. In 2019 Conference on Empirical Methods in Natural Language Processing.
Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu, et al. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 140 (2020), 1–67.
Shin et al. (2021) Richard Shin, Christopher Lin, Sam Thomson, Charles Chen, Subhro Roy, Emmanouil Antonios Platanios, Adam Pauls, Dan Klein, Jason Eisner, and Benjamin Van Durme. 2021. Constrained Language Models Yield Few-Shot Semantic Parsers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 7699–7715. https://doi.org/10.18653/v1/2021.emnlp-main.608
Shin et al. (2020b) T. Shin, Y. Razeghi, Irl Logan, E. Wallace, and S. Singh. 2020b. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).
Shin et al. (2020a) Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. 2020a. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 4222–4235. https://doi.org/10.18653/v1/2020.emnlp-main.346
Stoica et al. (2021) George Stoica, Emmanouil Antonios Platanios, and Barnabas Poczos. 2021. Re-TACRED: Addressing Shortcomings of the TACRED Dataset. Proceedings of the AAAI Conference on Artificial Intelligence 35, 15 (May 2021), 13843–13850. https://ojs.aaai.org/index.php/AAAI/article/view/17631
Su et al. (2022) Yusheng Su, Xiaozhi Wang, Yujia Qin, Chi-Min Chan, Yankai Lin, Huadong Wang, Kaiyue Wen, Zhiyuan Liu, Peng Li, Juanzi Li, Lei Hou, Maosong Sun, and Jie Zhou. 2022. On Transferability of Prompt Tuning for Natural Language Processing. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Seattle, United States, 3949–3969. https://doi.org/10.18653/v1/2022.naacl-main.290
Tan et al. (2022) Zhixing Tan, Xiangwen Zhang, Shuo Wang, and Yang Liu. 2022. MSP: Multi-Stage Prompting for Making Pre-trained Language Models Better Translators. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 6131–6142. https://doi.org/10.18653/v1/2022.acl-long.424
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Wallace et al. (2019) Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019. Universal Adversarial Triggers for Attacking and Analyzing NLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 2153–2162. https://doi.org/10.18653/v1/D19-1221
Wang et al. (2021) Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei, Xuanjing Huang, Jianshu Ji, Guihong Cao, Daxin Jiang, and Ming Zhou. 2021. K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, Online, 1405–1418. https://doi.org/10.18653/v1/2021.findings-acl.121
Wu et al. (2021) Shuang Wu, Xiaoning Song, and Zhenhua Feng. 2021. MECT: Multi-Metadata Embedding based Cross-Transformer for Chinese Named Entity Recognition. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 1529–1539. https://doi.org/10.18653/v1/2021.acl-long.121
Xia et al. (2022) Wei Xia, Qianqian Wang, Quanxue Gao, Ming Yang, and Xinbo Gao. 2022. Self-consistent Contrastive Attributed Graph Clustering with Pseudo-label Prompt. IEEE Transactions on Multimedia (2022), 1–13. https://doi.org/10.1109/TMM.2022.3213208
Xu et al. (2021) Haoran Xu, Benjamin Van Durme, and Kenton Murray. 2021. BERT, mBERT, or BiBERT? A Study on Contextualized Embeddings for Neural Machine Translation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 6663–6675. https://doi.org/10.18653/v1/2021.emnlp-main.534
Xue et al. (2021) Fuzhao Xue, Aixin Sun, Hao Zhang, and Eng Siong Chng. 2021. GDPNet: Refining Latent Multi-View Graph for Relation Extraction. In AAAI.
Yamada et al. (2020) Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, and Yuji Matsumoto. 2020. LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 6442–6454. https://doi.org/10.18653/v1/2020.emnlp-main.523
Yin et al. (2019) Wenpeng Yin, Jamaal Hay, and Dan Roth. 2019. Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 3914–3923. https://doi.org/10.18653/v1/D19-1404
Zeng et al. (2015) Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. 2015. Distant Supervision for Relation Extraction via Piecewise Convolutional Neural Networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, 1753–1762. https://doi.org/10.18653/v1/D15-1203
Zhang et al. (2022) Haoxing Zhang, Xiaofeng Zhang, Haibo Huang, and Lei Yu. 2022. Prompt-Based Meta-Learning For Few-shot Text Classification. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 1342–1357. https://aclanthology.org/2022.emnlp-main.87
Zhang et al. (2017) Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor Angeli, and Christopher D. Manning. 2017. Position-aware Attention and Supervised Data Improve Slot Filling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Copenhagen, Denmark, 35–45. https://doi.org/10.18653/v1/D17-1004
Zhong et al. (2022) W. Zhong, Y. Gao, N. Ding, Y. Qin, Z. Liu, M. Zhou, J. Wang, J. Yin, and N. Duan. 2022. ProQA: Structural Prompt-based Pre-training for Unified Question Answering. arXiv e-prints (2022).
Zhong and Chen (2021) Zexuan Zhong and Danqi Chen. 2021. A Frustratingly Easy Approach for Entity and Relation Extraction. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 50–61. https://doi.org/10.18653/v1/2021.naacl-main.5
Zhong et al. (2021) Zexuan Zhong, Dan Friedman, and Danqi Chen. 2021. Factual Probing Is [MASK]: Learning vs. Learning to Recall. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 5017–5033. https://doi.org/10.18653/v1/2021.naacl-main.398
Zhu et al. (2022) Yi Zhu, Xinke Zhou, Jipeng Qiang, Yun Li, Yunhao Yuan, and Xindong Wu. 2022. Prompt-Learning for Short Text Classification. arXiv preprint arXiv:2202.11345 (2022).