This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

LabelPrompt: Effective Prompt-based Learning for Relation Classification

Wenjie Zhang [email protected] 0000-0002-3307-6189 Jiangnan University1800 Lihu AvenueWuxiJiangsuChina Xiaoning Song [email protected] Jiangnan University1800 Lihu AvenueWuXiJiangSuChina Zhenhua Feng [email protected] University of SurreyGuildford GU2 7XHSurreyUnited Kingdom Tianyang Xu [email protected] Jiangnan University1800 Lihu AvenueWuXiJiangSuChina  and  Xiaojun Wu wu˙[email protected] Jiangnan University1800 Lihu AvenueWuXiJiangSuChina
Abstract.

Recently, prompt-based learning has gained popularity across many natural language processing (NLP) tasks by reformulating them into a cloze-style format to better align pre-trained language models (PLMs) with downstream tasks. However, applying this approach to relation classification poses unique challenges. Specifically, associating natural language words that fill the masked token with semantic relation labels (e.g. “org:founded_by”) is difficult. To address this challenge, this paper presents a novel prompt-based learning method, namely LabelPrompt, for the relation classification task. Motivated by the intuition to “GIVE MODEL CHOICES!”, we first define additional tokens to represent relation labels, which regard these tokens as the verbaliser with semantic initialisation and explicitly construct them with a prompt template method. Then, to mitigate inconsistency between predicted relations and given entities, we implement an entity-aware module with contrastive learning. Last, we conduct an attention query strategy within the self-attention layer to differentiates prompt tokens and sequence tokens. Together, these strategies enhance the adaptability of prompt-based learning, especially when only small labelled datasets is available. Comprehensive experiments on benchmark datasets demonstrate the superiority of our method, particularly in the few-shot scenario.

relation classification, prompt learning, few-shot
ccs: Computing methodologies Information extraction

1. Introduction

Pretrained Language Models (PLMs) have achieved great success in many Natural Language Processing (NLP) tasks (Jiang et al., 2020a; Xu et al., 2021; Guo et al., 2022; Zhong and Chen, 2021), such as Relation Classification (RC), Named Entity Recognition (NER), etc. The strong performance of PLMs stems from their ability to learn rich knowledge representations from large unlabelled corpora through self-supervised pretraining objectives such as Masked Language Modelling (MLM). A common paradigm is to then fine-tune the pretrained parameters on downstream task datasets, thereby transferring the embedded knowledge in the PLM to improve performance on the target task (Peters et al., 2018; Lewis et al., 2020). However, a fundamental challenge with this paradigm is the inherent mismatch between the pretraining objective of the PLM and the actual downstream task. This discrepancy in objectives can result in a sub-optimal transfer of knowledge from pretraining to fine-tuning (Liu et al., 2021a).

To bridge the mismatch between pretraining and downstream tasks, prompt-based learning has emerged as an effective approach for utilising PLMs (Lester et al., 2021; Brown et al., 2020; Wallace et al., 2019). The key idea is to reformulate a downstream task into the masked language modelling format used during pretraining. This is achieved through two main components: 1) a prompt template that prepends instructions and context to the input, and 2) an answer mapping method (known as a verbaliser) that converts the model’s output into target labels By transforming tasks into the native pretraining format, prompt-based learning reduces the discrepancy in objectives between pretraining and fine-tuning (Tan et al., 2022; Ma et al., 2022). Despite promising results, manually engineering prompt templates and verbalisers can be expensive and time consuming, limiting wider adoption (Liu et al., 2021a). Recent work has focused on automatically generating prompts (Lester et al., 2021; Liu et al., 2022) or learning continuous prompt representations (Shin et al., 2020a; Li and Liang, 2021) to reduce the cost of prompt engineering. Overall, prompt-based learning offers an appealing paradigm for effectively utilising PLMs, warranting further research to address current limitations.

Meanwhile, inspired by the great success of prompt-based learning, recent studies have explored applications to text classification tasks (Zhang et al., 2022). In this paper, we specifically investigate prompts for relation classification, a prevalent text classification problem. Given a pair of entities mentioned in a text, the goal of relation classification is to predict the semantic relation between them, if any (e.g., “per:employee_of”, “org:founded” and “no_relation”) (Hendrickx et al., 2010). While prompt-based learning provides an appealing approach to use pretrained language models for this task, applying prompts to relation classification also poses unique challenges.

First, relation classification presents a unique challenge compared to many NLP tasks in that the target labels contain rich semantic information, as opposed to being single words representable within the pretrained model’s vocabulary. Consequently, existing answer mapping techniques struggle to effectively project the model’s masked outputs to the complex, multi-word relation labels when reformulating relation classification as a masked language modelling task.

To improve mapping performance, some studies suggest searching external data for label-related words to design a verbaliser (Hu et al., 2022; Gu et al., 2022; Zhu et al., 2022). However, these methods have yet to achieve optimal results. Alternatively, other approaches define additional learnable tokens external to the pretrained vocabulary, and map model outputs to these new tokens for classification (Chen et al., 2022). Although these methods can mitigate token mapping issues, tuning the new tokens requires more labelled training data due to the significant gap from the pretrained vocabulary. So the first challenge in relation extraction is developing effective training methodologies that can enhance model performance under limited data availability.

Second, the masked language modelling (MLM) pre-training objective depends on predictions derived from neighbouring words and global sentence features. However, relation classification requires not only identifying a relation from a sentence, but also determining whether the relation holds between two given entities. Existing approaches predominantly predict relations at the sentence level, without explicitly modelling the correspondence between the entity-relation triplet (s,r,o)(s,r,o).

For instance, consider an abbreviated excerpt from the ReTACRED (Stoica et al., 2021) dataset: “Travis the Chimp, the 14-year-old, 200-pound pet of Sandra Herold, 71.” In this example, “Sandra Herold” is denoted as the subject entity while “14-year-old” is denoted as the object entity. Despite the ground truth relation existing between “Sandra Herold” and “71”, the model incorrectly predicts the relation as “per:age” between “Sandra Herold” and “14-year-old”. Evidently, the model fails to accurately capture the correspondence between the given entities and their relation. Subsequently, an additional challenge arising in relation classification is augmenting the model’s capacity to discern given entities when inferring relations.

To mitigate the above limitations of prompt-tuning for relation classification, we develop a novel LabelPrompt approach with an entity-aware module. LabelPrompt is a simple yet effective method that significantly reduces the gap between pre-training objectives and the relation classification task. Unlike previous methods, LabelPrompt enables the model to capture relation labels more intuitively by augmenting the input sequence.

In terms of the studies in prompt templates, many researchers have focused on constructing a prompt template that meet the requirements of the task formulation. At the same time, some studies have shown that the effectiveness of the prompt template has a significant impact on task performance. Thus, we believe it is essential to provide the pre-trained model with supplementary task-relevant information via the input. Building on this idea, we suggest an intuitive technique for providing the model with selectable options. In particular, we develop a prompt template approach that embeds relation class labels into the input example before the model is ingested.

Specifically, to enhance the ability of the masked language modelling (MLM) for relation label prediction, we propose extending the vocabulary space with custom label tokens that are not present in natural language (e.g., “[C1]”). The embeddings of these tokens are initiated with semantic information derived from relation label texts. However, this introduces a gap between pre-training and our task, as the additional label tokens are also not present in the pre-training vocabulary space. To mitigate this limitation and improve prompt-based method performance in few-shot scenarios, we expand the vocabulary with additional label prompt tokens and use these label tokens to construct prompt templates on the input side. Furthermore, to prevent additional prompt tokens from affecting sentence semantics, we redesign the attention query strategy that use distinct query projections for prompt and sentence token pairs. This controls prompt-sentence interactions, and allows label prompt tokens to provide task-specific information without compromising sentence semantics encoded in the pre-trained language model. This promotes independence between prompts and sentences to improve the efficacy of relation classification.

Second, to enhance entity perception during model training, we implement an entity-aware module to assess the correlation between entities and their relations. As shown in Fig. 1, we adopt a contrastive learning-inspired approach to construct positive and negative samples, where the former consist of given entities and their predicted relation, while the latter contains randomly selected tokens and ground-truth relations within the sentence. This effectively constrains the relation and entities with semantic information.

Finally, extensive evaluations on popular benchmark datasets demonstrate that our proposed approach yields significant performance improvements compared to fine-tuning and other prompt-based methods. We attribute this enhancement to our method’s ability to effectively constrict the model’s search space and guide optimisation towards the correct direction.

To summarise, the main contributions of the proposed LabelPrompt method include:

  • We introduce a prompt-based method that explicitly exploits relation features and significantly improves the performance of prompt-based learning in few-shot scenarios.

  • We implement an entity-aware module that employs contrastive learning to enhance entity awareness during the inference stage.

  • We design an attention query strategy within self-attention layers to differentiate between prompt and sentence tokens and improve the performance of prompt-based learning.

The rest of this paper is organised as follows. We first introduce the related work in Section 2. Then we present the proposed LabelPrompt method in Section 3. The experimental results and analysis of the proposed method are reported in Section 4 and Section 5, respectively. Last, we draw the conclusion of the proposed method in Section 6.

2. Related Work

2.1. Pre-training and Fine-tuning

Pre-trained language models are trained in large-scale unlabelled text data to learn robust general-purpose features, such as lexical, syntactic, semantic, and word representation (Liu et al., 2022). Many big models, such as GPT (Liu et al., 2021b), BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), T5 (Raffel et al., 2020), have been proposed, greatly promoting the development of the research in natural language processing. Most NLP tasks benefit from the knowledge contained in the pre-training language model, and the increase in the size of a pre-training model as well as data facilitates the task. By fine-tuning the parameters of PLMs with additional specific modules and task-specific objective functions, the models can be adapted to most downstream tasks. Generally, the pre-training and fine-tuning paradigm become the foundation for many NLP tasks, such as named entity recognition (Wu et al., 2021), relation classification (Lyu and Chen, 2021), and question answering (Zhong et al., 2022), etc.

2.2. Prompt-based Learning

Although fine-tuning has achieved significant success in many fields, there remains a large gap between the objectives of pre-training and fine-tuning tasks (Liu et al., 2021a). Typically, during pre-training, a PLM is trained using self-supervised corpus texts with Masked Language Modelling (MLM) and Next Sentence Prediction (NSP) tasks (Devlin et al., 2019). This differs from other NLP tasks, such as classification and sequence tagging tasks. As a result, recent studies have sought to explore how to effectively use pre-trained language models and address this gap (Wang et al., 2021; Moiseev et al., 2022).

Prompt-based learning is a different approach that aims to reduce the gap between pre-training and fine-tuning. Rather than fine-tuning, prompt-based learning reformulates a downstream task to the original PLM training task. For example, a prompt template might transfer a machine translation instance as “English: I love this film. French: __”. A large PLM could then fill in the blank with a French translation of the English text. Essentially, prompt-based learning enables PLMs to perform downstream tasks by providing additional information in the form of a prompt (Lester et al., 2021; Gao et al., 2021).

Many studies have shown that prompt templates containing semantics can effectively activate knowledge in the model. Building on this idea, Petroni et al. (2019) considered language models as knowledge bases and proposed the LAMA (LAnguage Model Analysis) probe dataset. The LAMA probe provides manually created completion templates for exploring the knowledge in PLMs. By using hand-crafted query sentences, it extracts the knowledge base from PLMs and performs comparably to using PLMs directly. However, the manual design of templates as discrete prompts is time-consuming, expensive (Shin et al., 2021), and often requires a wealth of expert knowledge.

To extend prompt-based learning to a wider range of tasks and reduce the overhead of prompt text construction, Shin et al. (2020b) automatically generated prompts for different task sets based on a gradient search approach. Some researchers have also added a small set of task-specific prefixes to replace manual templates as continuous prompts, which are learnable by backpropagation (Li and Liang, 2021; Lester et al., 2021). Additionally, Zhong et al. (2021) suggests initialising continuous prompts based on the discrete prompts already explored by prompt search methods.

Another important component of prompt-based learning is the answer-mapping approach. Since the masked output does not always match the label, the aim of answer mapping methods is to map the model output to the answer space when the answers cannot be considered as task outputs. For example, sentiment analysis tasks require mapping the output of the model template “{ good, fantastic, unhappy, average }” to “{ positive, negative, neutral }”. The answer-mapping method is also known as a verbaliser. A verbaliser is a projection method that constructs a project between a natural word space and a task label space (Hu et al., 2022). Just like the prompt template, a verbaliser also has “manual design” (Yin et al., 2019; Cui et al., 2021), “discrete answer search” (Jiang et al., 2020b; Xia et al., 2022) and “continuous answer search” strategies (Hambardzumyan et al., 2021). Equally, the quality of a verbaliser can have an impressive impact on model performance.

In conclusion, prompt-based learning is widely considered as a highly effective method for utilising pre-trained language models, and it is a subject worthy of further study.

Table 1. Two examples of relation classification. Relation classification aims to predict the relation between two entities mentioned in a given sentence.
Sentence Relation
Mark Fisher [subject] writes for the Dayton Daily News [object]. per:employee_of
He has a sense of humor [subject] about his reaction [object] to that day. no_relation

2.3. Relation Classification

Relation Classification (RC) is a sub-task of Information Extraction (IE). The goal of RC is to identify the semantic relation between two entities in a given text. As shown in Table 1, given the sentence “Mark Fisher writes for the Dayton Daily News.”, the relation between the subject “Mark Fisher” and the object “Dayton Daily News” is “per:employee_of”. RC is a crucial task for many natural language understanding applications, including question answering, knowledge base construction, and text summarization (Hendrickx et al., 2010).

Early approaches for RC relied heavily on prior knowledge to extract additional features from the text (Califf and Mooney, 1997; Zeng et al., 2015). The emergence of large-scale PLMs, such as BERT (Devlin et al., 2019), has provided universal language representations that are useful for RC tasks (Vaswani et al., 2017; Cohen et al., 2020; Yamada et al., 2020). However, these methods often require significant amounts of labelled data and complex, task-specific neural modules, and tend to underperform in the few-shot scenario.

Recently, there have been studies exploring the use of prompt-based learning to leverage the knowledge contained in PLMs for relation classification. The LAMA dataset is a probe for analysing factual and commonsense knowledge in PLMs, which consists of a set of knowledge sources, each comprising a set of facts. For example, one template could be ”__ is the capital of France.Hu et al. (2022) proposed a method that uses external knowledge to construct a verbaliser for obtaining more accurate label words for text classification. Han et al. (2021) applied logic rules to construct prompts with several sub-prompts, effectively using prior knowledge from relation classification. Chen et al. (2022) proposed a prompt-based learning approach called KnowPrompt that incorporates abundant semantics and prior knowledge in relation labels into relation classification.

All these methods aimed to effectively use prompt-based learning to narrow the gap between relation classification and pre-training tasks, and demonstrated promising results in both few-shot and full-data scenarios. However, none of these methods addressed the issue of inconsistency between prompt output and relation label texts. Inspired by KnowPrompt, we propose a new prompt-based learning approach, namely LabelPrompt, that explicitly constructs prompt templates using label tokens and adds supervised information corresponding to the labels at the model’s output.

3. The Proposed LabelPrompt Method

Refer to caption
Figure 1. An overview of the proposed method. We input the tokens, including the label tokens, into the RoBERTa model to obtain a global feature representation and improve contextual awareness. The masked token is then used to predict the relation of the sentence. The relation-align module aligns the encoded label prompt tokens with their corresponding classes and computes the label loss to preserve the label prompt information. The Entity-Aware Module constructs two samples based on the triplet information and the sentence tokens in the pure prompt template to constrain the correspondence between the given entities. The attention query strategy in the encoding stage facilitates independent lines of prompt tokens and sentence tokens.

Relation Classification (RC) can be denoted as 𝒯={𝒳,𝒴}\mathcal{T}=\{\mathcal{X},\mathcal{Y}\}, where 𝒳\mathcal{X} is the set of instances and 𝒴\mathcal{Y} is the set of relation labels. For each instance 𝐱𝒳\mathbf{x}\in\mathcal{X}, it contains a sentence x={x1,x2,xn}x=\{x_{1},x_{2},\dots x_{n}\}, two entities (es,eoe_{s},e_{o}) mentioned in xx , and its label relation yx𝒴y_{x}\in\mathcal{Y}. The goal of relation classification is to predict the relation y𝒴y\in\mathcal{Y} between given subject entity ese_{s} and object entity eoe_{o} in sentence xx. If there is no relation between the entities, it will be considered as a special type of relation, i.e.no_relation”, which exists in the set of relations.

We will elaborate on the major components and training schemes of our approach in the following parts. As illustrated in Fig. 1, the input to the model is composed of two primary components, the prompt tokens and the sentence tokens. We can further separate the prompt tokens into label prompt tokens and typical prompt tokens.

3.1. Label Prompt

3.1.1. Label Prompt Tokens

For each relation, y𝒴y\in\mathcal{Y}, its label text is not an individual word in natural language, but a composite text, e.g.org:top_members/employees”, “per:country_of_death” and etc. We define a label space by creating mm label tokens 𝒞={c1,c2,,cm}\mathcal{C}=\{c_{1},c_{2},\dots,c_{m}\} and extending it to the vocabulary of PLM, where mm represents the size of the relation set 𝒴\mathcal{Y}. For better training setup, these label tokens will be initialised by their knowledge-rich semantic label texts. The initialisation process is defined as follows:

(1) Ci=1Nij=1NiEmbedding(tj),C_{i}=\frac{1}{N_{i}}\sum_{j=1}^{N_{i}}\textrm{Embedding}_{\mathcal{M}}(t_{j}),

where CiC_{i} represents the word embedding of label cic_{i}, Embedding\textrm{Embedding}_{\mathcal{M}} stands for the word embedding layer of \mathcal{M}, and tjt_{j} is denoted as the jj-th sub-text of the decomposed relation label text Ti={t1,t2,,tNi}T_{i}=\{t_{1},t_{2},\dots,t_{N_{i}}\}.

3.1.2. Label Prompt Template

To provide context for our work, we first introduce the concepts of prompt templates and the prompt template method. In this paper, a prompt template refers to additional text consisting of prompt tokens, while the template method refers to the contract function used to construct the input instance 𝐱prompt\mathbf{x}_{prompt} with a specific token [MASK].

The objective of our work is to address the challenge of the prompt-based method to fit correct label tokens in the few-shot scenario where data is limited. To achieve this, we propose a solution that involves converting the relation classification task into a masked language modelling one. However, previous studies have shown that this approach may not be effective due to the large size of the vocabulary and the newly added label tokens that were not present during the model’s pre-training stage.

To overcome this limitation, we introduce an intuitive approach of directly adding label tokens to the input sequences. Our proposed prompt template, namely Label Prompt Template, consists of both label prompt tokens and typical prompt tokens. By incorporating the label tokens directly into the input sequences, we can improve the performance of the prompt-based model in the few-shot scenario. We represent the prompt template as T(𝐱)T(\mathbf{x}):

(2) T(𝐱)={c1,c2,,cm,[SEP],es,[MASK],eo},T(\mathbf{x})=\{c_{1},c_{2},\dots,c_{m},[\textrm{SEP}],e_{s},[\textrm{MASK}],e_{o}\},

where ci𝒞c_{i}\in\mathcal{C} is the label token, ese_{s} and eoe_{o} are the subject and object entities. [CLS][\texttt{CLS}], [SEP][\texttt{SEP}] and [MASK][\texttt{MASK}] are predefined in \mathcal{M}. The input xx will be converted to the input sequence as follows,

(3) xprompt={[CLS],T(𝐱),[SEP],x,[SEP]}.x_{prompt}=\left\{[\textrm{CLS}],T(\mathbf{x}),[\textrm{SEP}],x,[\textrm{SEP}]\right\}.

3.1.3. Label Prompt Learning

In this section, we introduce our key training objective. First, we define 𝐡\mathbf{h} as the output of the encoder of the model \mathcal{M}. An input sequence xpromptx_{prompt} is encoded by the model to obtain 𝐡×d\mathbf{h}\in\mathbb{R}^{\ell\times d},

(4) 𝐡=Encoder(xprompt),\mathbf{h}=\textrm{Encoder}_{\mathcal{M}}(x_{prompt}),

where \ell represents the length of xpromptx_{prompt}, and dd is the size of the hidden vectors.

We also define hmaskh_{mask} as the hidden vector of the [MASK][\texttt{MASK}], hcih_{c_{i}} as the hidden vector of label prompt token cic_{i}, hsubh_{sub} and hobjh_{obj} represent the hidden vectors of subject and object entities.

Next, we define a verbaliser VϕV_{\phi} as an answer-mapping method that maps a hidden vector hh to a relation label,

(5) Vϕ(h)=𝐄label(Wvh+b),V_{\phi}(h)=\mathbf{E}_{label}\cdot(W_{v}\cdot h+b),

where 𝐄label\mathbf{E}_{label} represents the labels’ embedding, WvW_{v} and bb are learnable parameters. We use Ci𝐄labelC_{i}\in\mathbf{E}_{label} to denote the embedding of label token cic_{i}. Then we calculate the probability distribution of the mapping result of hidden vector hh. The probability that the label of the masked token is cic_{i} is calculated using the softmax function,

(6) p(Vϕ(h)=ci|xprompt)=exp(Ci(Wvh+b))j=1mexp(Cj(Wvh+b)),p\left(V_{\phi}\left(h\right)=c_{i}|x_{prompt}\right)=\frac{\exp(C_{i}\cdot(W_{v}\cdot h+b))}{\sum_{j=1}^{m}\exp(C_{j}\cdot(W_{v}\cdot h+b))},

where mm is the size of the relation set 𝒴\mathcal{Y}.

Last, our model optimisation aims to assign a suitable relation label to the masked token. We define the loss function,

(7) mask=1mi=1myxlogp(yx|𝐱),\mathcal{L}_{mask}=\frac{1}{m}\sum_{i=1}^{m}y_{x}\log p(y_{x}|\mathbf{x}),

as the cross-entropy loss between the predicted probability p(yx|𝐱)=p(Vϕ(hmask)=𝒞|xprompt)p(y_{x}|\mathbf{x})=p(V_{\phi}(h_{mask})=\mathcal{C}|x_{prompt}) and the ground truth yy. The label token with the highest probability will be selected as the predicted relation class between the entities in instance x.

3.2. Relation-Align Module

The objective of the relation-align module is to align the label tokens with the deep decision-making layer of the model. In the context of relation extraction models, the label token on the input side may lose its inherent features in the subsequent encoder layers, which reduces its effectiveness in aiding the prediction of relations in the deep decision-making layer.

To tackle this problem, we propose the relation-align module, which is integrated at the end of our model. Furthermore, the verbaliser VϕV_{\phi} is incorporated behind the hidden vector hcih_{c_{i}} of each label token cic_{i} in xpromptx_{prompt} to reinforce its semantic meaning, just like the mask loss. To accomplish this, the probability score p(ci|𝐱,ci)=p(ci~=𝒞|xprompt)p(c_{i}|\mathbf{x},c_{i})=p\left(\tilde{c_{i}}=\mathcal{C}|x_{prompt}\right) is formulated for each label token, which can be classified into itself by the verbaliser, where ci~\tilde{c_{i}} denotes Vϕ(hci)V_{\phi}(h_{c_{i}}).

The cross-entropy loss for each label token is computed based on the probability distribution, and then the losses are averaged to obtain the label loss label\mathcal{L}_{label}:

(8) ci=1mi=1m\displaystyle\mathcal{L}_{c_{i}}=\frac{1}{m}\sum_{i=1}^{m} yxlogp(ci|𝐱,ci),\displaystyle y_{x}\log p(c_{i}|\mathbf{x},c_{i}),
label=\displaystyle\mathcal{L}_{label}= 1mi=1mci.\displaystyle\frac{1}{m}\sum_{i=1}^{m}\mathcal{L}_{c_{i}}.

3.3. Entity-Aware Module

In this section, we introduce an entity-aware module aimed at addressing the issue of a PLM failing to consider given entities when predicting the relation between them.

Through analysis of numerous error cases, we observed that a PLM can detect a relation in a sentence, but it may not exist between the two given entities. In other words, the model does not take the given entities into account when predicting the relationship.

Drawing from TransE (Bordes et al., 2013), which suggests a correlation, s+r=os+r=o, between the features of entities (s,os,o) and the relation rr, we propose the entity-aware module. The module employs the distance metric d(s,r,o)=s+ro2d(s,r,o)=\lVert s+r-o\rVert_{2} to measure the distance between entities and relations. Furthermore, the dimensions of the encoder outputs for entities and relations are reduced to eliminate redundant features:

(9) s\displaystyle s =ϕsubhsub,\displaystyle=\phi_{sub}\cdot h_{sub},
o\displaystyle o =ϕobjhobj,\displaystyle=\phi_{obj}\cdot h_{obj},
r\displaystyle r =ϕrelhmask,\displaystyle=\phi_{rel}\cdot h_{mask},

where ϕsub\phi_{sub}, ϕobj\phi_{obj} and ϕrelp×d\phi_{rel}\in\mathbb{R}^{p\times d} are trainable parameters, pp is the length of the hidden vector after dimension reduction.

In Equ. (9), the tuple (s,r,os,r,o) is considered a positive example while negative examples (s,r,os^{\prime},r,o^{\prime}) are randomly sampled for the entity-aware loss, where ss^{\prime} and oo^{\prime} denote two arbitrary spans from the sentence. The entity-aware loss, denoted by entity\mathcal{L}_{entity}, is formulated as follows:

(10) entity=\displaystyle\mathcal{L}_{entity}= logσ(γd(s,r,o))\displaystyle-\log\sigma\left(\gamma-d\left(s,r,o\right)\right)
logσ(d(s,r,o)γ),\displaystyle-\log\sigma\left(d\left(s^{\prime},r,o^{\prime}\right)-\gamma\right),

where σ\sigma is the sigmoid function and γ\gamma is the margin that is set to 0.3.

Different from previous approaches that use simple negative samples, we use more challenging negative samples and concentrate on the contrastive loss between current sentences during the training phase. Our experiments reveal that other samples in the same batch can hinder the convergence of this portion of the loss, thus we disregard them.

3.4. Attention Query Strategy

As shown in Equ. (3), the input sequence comprises two distinct categories of tokens, namely prompt tokens and sentence tokens. However, the inclusion of additional unknown tokens can significantly impact the semantic meaning of the sentence and subsequently, the efficacy of relation extraction.

To mitigate the adverse impact of prompt tokens on sentence semantics, we employ an attention query strategy during the PLM encoding process. This approach utilises distinct query matrices for different token pairs in the self-attention layer to minimise the influence of prompt tokens on sentence semantics and enhance the quality of relation extraction outcomes. Further details are provided in Fig. 1.

The self-attention layer in the transformer encoder can be formulated as follow:

(11) SelfAttn(E)=softmax(QE(KE)Td)VE,\operatorname{SelfAttn}(E)=\operatorname{softmax}(\frac{QE\cdot(KE)^{T}}{\sqrt{d}})\cdot VE,

where EE represents the input and Qd×dQ\in\mathbb{R}^{d\times d}, Kd×dK\in\mathbb{R}^{d\times d}, Vd×dV\in\mathbb{R}^{d\times d} denote the mapping matrices. In our method, we select different query mapping matrices depending on which pair of tokens are interacting: QppQ_{pp}, QpsQ_{ps}, QspQ_{sp}, and QssQ_{ss}. The effective initialisation of pre-training parameters enables the method to maintain strong few-shot performance even when additional query matrices is incorporated.

3.5. Objective Function

We propose three loss functions to optimise the model parameters: mask loss for relation prediction, label loss for semantic consistency of label prompt tokens, and entity loss for improving entity awareness.

Our objective is to minimise the final loss \mathcal{L} which is the weighted sum of those losses:

(12) =pred+α1label+α2entity,\mathcal{L}=\mathcal{L}_{pred}+\alpha_{1}\mathcal{L}_{label}+\alpha_{2}\mathcal{L}_{entity},

where α1,α2\alpha_{1},\alpha_{2} are the weights for the label loss and entity loss, respectively. In this paper, we set them to 1 and 0.04 based on our empirical experimental results.

Table 2. Statistics of different datasets for relation classification.
Dataset #Training #Validation #Test #Relation
SemEval 6,507 1,493 2,717 19
TACRED 68,124 22,631 15,509 42
TACREV 68,124 22,631 15,509 42
ReTACRED 58,465 19,584 13,418 40

4. Experiment

In this section, we evaluate the proposed method on four benchmark datasets with the widely used micro F1F_{1} metric. When it comes to multi-class classification tasks, there are two commonly used averaging methods, namely macro and micro scores. The macro F1F_{1} score calculates the global score by averaging over all the classes, while the micro-averaging method computes the global score by taking into account the True Positives (TP), False Negatives (FN), and False Positives (FP). In this paper, we use the micro F1F_{1} score as the primary metric to evaluate these methods for all the experiments.

Table 3. F1F_{1} scores (%) on the TACRED, TACREV, ReTACRED and SemEval in the few-shot scenario. The best results are marked bold.
kk-shot Methods TACRED TACREV ReTACRED SemEval
8-shot fine-tuning 12.2 13.5 28.5 41.3
GDPNet (Xue et al., 2021) 11.8 12.3 29.0 42.0
PTR (Han et al., 2021) 28.1 28.7 51.5 70.5
KnowPrompt (Chen et al., 2022) 32.0 32.1 55.3 74.3
LabelPrompt(ours) 33.9 34.8 61.5 77.0
16-shot fine-tuning 21.5 22.3 49.5 65.2
GDPNet 22.5 23.8 50.0 67.5
PTR 30.7 31.4 56.2 81.3
KnowPrompt 35.4 33.1 63.3 82.9
LabelPrompt(ours) 34.4 35.4 64.5 81.7
32-shot fine-tuning 28.0 28.2 56.0 80.1
GDPNet 28.8 29.1 56.5 81.2
PTR 32.1 32.4 62.1 84.2
KnowPrompt 36.5 34.7 65.0 84.8
LabelPrompt(ours) 35.4 36.8 66.7 84.6
Table 4. F1F_{1} scores (%) on the TACRED, TACREV, ReTACRED and SemEval in the full-data scenario. In the “Extra Data” column, “w/o” means that no additional data is used for pre-training and fine-tuning, yet “w/” means that extra data or knowledge bases are used for data augmentation. The best results are marked bold.
Model Extra Data TACRED TACREV ReTACRED SemEval
Fine-tuning pre-trained models
RoBERTa_large (Liu et al., 2019) w/o 68.7 76.0 84.9 87.6
KnowBERT (Chen et al., 2022) w/ 71.5 79.3 - 89.1
MTB (Baldini Soares et al., 2019) w/ 70.1 - - 89.5
SpanBERT (Joshi et al., 2020) w/ 70.8 78.0 85.3 -
LUKE (Yamada et al., 2020) w/ 72.7 80.6 90.3 -
Prompt tuning pre-trained models
PTR w/o 72.4 81.4 90.9 89.9
KnowPrompt w/o 72.4 82.4 91.3 90.2
LabelPrompt(ours) w/o 73.1 82.5 91.6 91.3

4.1. Datasets

To better verify the effectiveness of our idea, we use the most popular relation classification datasets, including TACRED (Zhang et al., 2017), TACREV (Alt et al., 2020), ReTACRED (Stoica et al., 2021), and SemEval (Hendrickx et al., 2010).

TACRED is one of the most enormous and widely used crowd-sourced datasets in relation extraction. It is created by combining available human annotations from the TAC KBP challenges and crowd-sourcing, and it contains 42 relation types (including “no_relation”).

TACREV is a modified version of TACRED. It corrects the label errors in the validation and test sets while leaving the training set unchanged.

ReTACRED is a new completely re-annotated version of the TACRED dataset. The dataset fixes many labelling errors in the TACRED dataset and refactors the training, validation, and test sets.

SemEval is a traditional dataset in relation classification. It contains 9 symmetric relations and one special relation “Other”.

Annotation quality of the datasets. In the field of natural language processing, an essential component of dataset preparation is the quality of the annotation. A comparative analysis of the annotation quality of four datasets. The results indicate that the ReTACRED dataset has the highest annotation quality compared to the other three datasets. Although the TACREV and TACRED datasets share the same training data, Alt et al. (2020) improved the quality of annotations in the TACREV test set. The statistical details of each dataset have been tabulated in Table 2.

4.2. Baselines

In this paper, we compare the proposed LabelPrompt method with several representative approaches for relation classification. LabelPrompt is a novel method that leverages label prompt tokens to enhance pre-trained language models for this task. We select baselines that cover different aspects of relation classification, such as fine-tuning, prompt-based learning, and knowledge integration.

For fine-tuning pre-trained models, RoBERTa is a very popular language model that improved BERT’s training strategy to overcome its under-training problem, and it also served as the text encoder of this paper. We also revise some other pre-trained models that incorporated different types of knowledge into their representations. For example, KnowBERT embeds multiple Knowledge Bases (KBs) into PLM to leverage structured, human-curated knowledge. MTB builds task-agnostic relation representations from entity-linked text and fine-tuned them on supervised relation extraction datasets. SpanBERT focuses on improving span representations by defining a masked span prediction task for representing and predicting spans of text with varying lengths. LUKE predicts randomly masked words and entities in a large entity-annotated corpus retrieved from Wikipedia.

For prompt-based learning models, we select two representative works of prompt-based learning models. PTR (Han et al., 2021) uses logic rules to create prompts with sub-prompts that encode the prior knowledge of each class. KnowPrompt (Chen et al., 2022) injects latent knowledge in relation labels into the prompts, thus injecting the knowledge into relation labels.

4.3. Experiment Settings

We use RoBERTalarge\textrm{RoBERTa}_{large} (Liu et al., 2019) as the pre-trained language model for all our experiments and follow the same setting as the previous methods. We compare the performance of our proposed method with other methods in two scenarios: full-data and few-shot. In the full-data scenario, we use all the training samples in the dataset for network tuning. In the few-shot scenario, we use the kk-shot approach, where k[8,16,32]k\in[8,16,32], to evaluate our model’s performance. The kk-shot approach means that we sample kk instances from each class in the training/validating set as the training/validating data while using all the test instances to evaluate the model’s performance.

We conducted our experiments on NVIDIA GeForce RTX 2080 Ti. We set the batch size to 16 and the learning rate to 4e-5 and 4e-6 for the few-shot and full-data scenarios, respectively.

Refer to caption
Figure 2. The F1F_{1} score of different kk-shot compared with KnowPrompt.

4.4. Results

We first evaluate and analyse the proposed LabelPrompt method by comparing it with both fine-tuning and prompt-based methods. The experimental results are reported in Table 3 and Table 4.

Few-Shot. In Table 3, we compare our approach with the traditional fine-tuning and prompt-based methods in the few-shot scenario. Our approach outperforms the other methods in most cases. Furthermore, on the TACREV and ReTACRED datasets, which have more accurate and consistent annotations, our method outperforms all the other methods in all the few-shot settings.

As compared with the fine-tuning methods, our results show that prompt-based methods perform better in the few-shot scenarios for the NLU tasks with low resources. We attribute this to the fact that prompt-based learning effectively uses the knowledge stored in PLMs to fit the task. In contrast, fine-tuning methods rely on complex decoding modules and training strategies based on PLMs, which require sufficient data.

As compared with the previous best prompt-based methods, LabelPrompt achieves 1.5%-6.2% performance gains on both the ReTACRED and TACREV datasets, which have high labelling quality. It also achieves 1.9%-2.7% improvements on the TACRED and SemEval datasets in the 8-shot setting.

These two aspects demonstrate that the explicit label tokens can aid the pre-trained model in the few-shot scenario. We believe that these additional tokens can reduce the search space for target tokens during the early stage of model training. To further validate our model’s few-shot performance, we design an additional 4-shot setting, as shown in Fig. 2. The experimental results demonstrate that, even with only 4 instances per class, the proposed method achieves the same performance as other methods in the 8-shot setting.

Full-Data. In the full-data scenario, as shown in Table 4, the proposed method also beat all the other fine-tuning and prompt-based approaches.

In comparison with KnowPrompt, LabelPrompt improves the performance by 0.7% in TACRED and by 1.1% in SemEval. This demonstrates that the prompt-learning method can be applied to both few-shot and full-data tasks. In fact, LabelPrompt leverages the knowledge embedded in PLMs to obtain remarkable results in downstream tasks with a simple design.

So far, we’ve discussed the performance of the proposed method, and it shows that LabelPrompt is superior to the other methods in both few-shot and full-data scenarios.

4.5. Ablation Study

In this section, we evaluate the performance of each component of LabelPrompt in both few-shot and full-data settings. The results are presented in Table 5 and Fig. 3.

Refer to caption
Figure 3. Ablation experiments on different token strategies. use Mask tokens means we replace all label tokens in the prompt template with a special token [MASK]. use Learnable tokens means we replace the label tokens in the prompt template with mm additional learnable tokens.
Table 5. The F1F_{1} scores (%) of ablation experiments in ReTACRED.
Method 8-shot 16-shot 32-shot full
LabelPrompt 62.1 63.8 66.7 91.6
- w/o label prompt tokens 57.6 61.6 64.7 91.5
- w/o entity-aware module 58.9 61.7 64.7 91.2
- w/o attention query strategy 61.8 62.7 65.4 90.9
Refer to caption
Figure 4. The left subplot shows the ON between the mask token and label tokens for different relations. The right subplot compares the data from rows 15 and 17 of the left subplot in two colours. For example, the blue histogram represents the ON between the mask token and label tokens when the true label is “org:shareholders”.

4.5.1. Impact of Label Prompt Tokens

The prompt template is one of the most important factors affecting the effectiveness of prompt-based methods. And our prompt template method relies on label prompt tokens, which are very influential. Therefore, we need to demonstrate their impact with more experiments.

First, we ablate the label prompt tokens from the input and remove the label objective module label\mathcal{L}_{label} from the overall loss function. During training, by removing the label prompt tokens, the model must search over the full vocabulary space to predict the correct labels without any explicit label prompt. This significantly increases the difficulty of the prediction task, making it challenging for the model to converge efficiently. As shown in Table 5, removing the label prompts leads to a substantial performance degradation, especially under few-shot settings. To verify that the performance gains are due to the semantic guidance from label prompts rather than simply longer prompt lengths, we evaluate two additional ablated configurations: 1) replacing all label tokens in T(𝐱)T(\mathbf{x}) with [MASK][\texttt{MASK}], and 2) substituting the label tokens with randomly initialised, learnable prompt embeddings.

Fig. 3 shows that both configurations result in a significant performance drop in few-shot settings. This suggests that label tokens are effective due to their high correlation with the labels in the classification task. Moreover, adding more learnable tokens does not help, as the model cannot fit them with limited training data and performs worse in full-data settings.

To visualise the effectiveness of label prompt tokens, we project the hidden vectors of mask token outputs onto a two-dimensional space based on different relations. The ReTACRED test set samples (excluding ”no relation” and ”per:country_of_birth”) are used for visualisation, with a total of 5,648 samples covering 38 different relations.

4.5.2. Impact of The Entity-Aware Module

We propose an Entity-Aware Module to improve the performance of the proposed method for relation classification. This module helps the model to focus on the relation class between the entities in a sentence. As shown in Table 5, removing the Entity-Aware Module decreases the accuracy in all scenarios. The full-data scenario is mostly affected by this removal.

In Table 6, we provide examples of relation extraction by our model with and without the entity-aware module. The entity-aware module improves the accuracy of predicting the relations between entities in a sentence, as shown by the results for No.1254 and No.9035. Without the entity-aware module, the model only considers the relations implied in the sentence, while with the entity-aware module, it takes both entities into account for more accurate relation prediction.

4.5.3. Impact of Attention Query Strategy

Label prompt tokens are not defined during the pre-training phase of PLM. Although these tokens are initialised with the semantics of the label text, they can still affect the semantics of the entire sentence when the model processes the sentence sequence. As presented in Table 5, when the attention query strategy is removed from our model, we observe varying degrees of performance degradation in all scenarios.

Table 6. Case study for the entity-aware module. The underlined text represents the ground-truth relation class.
Sentence Result
[NO.1254] It has been a little over a year since Travis the Chimp, the 14-year-old [object], 200-pound pet of Sandra Herold [subject], 71, mauled a family friend in Herold’s driveway. w/o : per:age
w/ : no_relation
[NO.9035] Knox [object]’s mother Edda Mellas was shaking with grief as her [subject] stepmother Cassandra Knox moved to comfort her. w/o : per:children
w/ : per:identity

5. Analysis

Refer to caption
Figure 5. Illustration for activated value sequence ss.

In this section, we will conduct a detailed analysis of why label prompt tokens are useful for the relation classification task.

Some studies (Geva et al., 2021; Dai et al., 2022) regard the Feed-Forward Network (FFN) within the pre-trained language model as a place to store knowledge. It has been demonstrated that prompt learning is useful for PLMs because prompts can activate specific neurons in the feed-forward network for given inputs. Su et al. (2022) further considered each node in FFN as a neuron and found that the activation layer of the feed-forward network corresponds to a specific behaviour of the model.

As shown in Fig. 5, FFN in a transformer layer can be formulated as:

(13) FFN(𝐱)=GELU(𝐱𝐖1+b1)𝐖2+b2,\mathrm{FFN(\mathbf{x})}=\mathrm{GELU}(\mathrm{\mathbf{x}}\mathbf{W}_{1}^{\top}+\mathrm{b}_{1})\mathbf{W}_{2}+\mathrm{b}_{2},

where 𝐱d\mathrm{\mathbf{x}}\in\mathbb{R}^{d} denotes the hidden state of the self-attention output, 𝐖1d×4d\mathrm{\mathbf{W}_{1}}\in\mathbb{R}^{d\times 4d} and 𝐖24d×d\mathrm{\mathbf{W}_{2}}\in\mathbb{R}^{4d\times d} represent the two dense layers. In this paper, we consider the output of the activation function after the first dense layer in FFN as the activated neurons sequence, denoted as 𝐒\mathbf{S}.

For each instance, the activated neuron sequence can be obtained by:

(14) 𝐒={sc1,sc2,,scm,sM},\mathbf{S}=\{s_{c_{1}},s_{c_{2}},\dots,s_{c_{m}},s_{M}\},

where sMs_{M} is the activated neuron sequence of mask tokens.

We defined a function ON(sa,sb)\textbf{ON}(s_{a},s_{b}) to calculate the Overlapping rate of activated Neurons (ON) of two activated neuron sequences sas_{a} and sbs_{b}:

(15) 𝐎𝐍(sa,sb)=k=14dζ(sa,k>0sb,k>0)k=14dζ(sa,k>0)+k=14dζ(sb,k>0),\mathrm{\mathbf{ON}}(s_{a},s_{b})=\frac{\sum_{k=1}^{4d}\zeta(s_{a,k}>0\wedge s_{b,k}>0)}{\sum_{k=1}^{4d}\zeta(s_{a,k}>0)+\sum_{k=1}^{4d}\zeta(s_{b,k}>0)},

where ζ(c)\zeta(c) stands for a binary function. The result is 1 if the condition cc is true and vice versa.

The ON function between different tokens can be used as the associated probability of the different corresponding behaviours of the model. Therefore, we compare ON between label prompt tokens 𝒞\mathcal{C} and the mask token [MASK] at the last encoder layer. We then take ON as the task-driven capability of label prompt tokens.

The data used for visualisation and analysis contains 5648 samples and 38 relations (excluding “no relation” and “per:country_of_birth”) from ReTACRED. To obtain Fig. 4, we record the activated values in the last encoder layer for all the test instances. We consider the neurons with positive values as activated neurons. To avoid the distribution error caused by sample sampling, we then average the overlapping rate of all samples of relation ii as the 𝐎𝐍\mathrm{\mathbf{ON}}.

As shown in Fig. 4 (a), each point r(i,j)r(i,j) is defined as:

(16) r(i,j)=1|𝒞i|u=1|𝒞i|𝐎𝐍(sj(u),sM(u))s.t.gt=ci,\displaystyle r(i,j)=\frac{1}{|\mathcal{C}_{i}|}\sum_{u=1}^{|\mathcal{C}_{i}|}\mathrm{\mathbf{ON}}(s_{j}^{(u)},s_{M}^{(u)})\hskip 20.00003pt\mathrm{s.t.}\hskip 10.00002ptgt=c_{i},

which denotes the ON result between the mask token and label token cjc_{j} when the ground-truth relation is cic_{i}. Fig. 4 (b) represents 𝐎𝐍\mathbf{ON} of the relations “org:shareholders” and “per:city_of_birth”.

According to Fig. 4 (a), the colour of the diagonal blocks is darker. The block on the diagonal r(i,i)r(i,i) represents the ON results between the mask token sequence sMs_{M} and the ground-truth label prompt token sequence scis_{c_{i}}. i.e. the mask token always has a higher overlapping rate with the label token corresponding to the ground-truth label. This indicates that the label token can lead the mask token to activate the knowledge of the corresponding relation stored in a PLM.

Furthermore, we choose two representative relations to illustrate the results. Fig. 4 (b) shows the histograms with two rows of data from Fig. 4 (a). The histograms clearly show the difference in the overlapping rate between the mask and label tokens. The histogram in orange is the result of the relation “org:shareholders”, compared to other label tokens, the ON result between mask token and label token 17 is at the peak. Meanwhile, the histogram in blue shows that when the relation is “per:city_of_birth”, the overlapping rate is high not only for token 15, but also for tokens 26, 34, 36, and so on. In fact, all of these tokens also have “city” in their labels, which is consistent with our intuition that these relations have a high overlapping ratio of activated knowledge of their relations.

6. Conclusion

In this paper, we proposed LabelPrompt, a strategy to effectively apply label information for the relation classification task based on prompt learning and prompt engineering. By explicitly adding label tokens to sentences before feeding them into the model, the model effectively adapted downstream tasks with our LabelPrompt. The additional entity-aware module also alleviated the problem of relations in sentences that do not correspond to entities.

The experimental results demonstrated that the label prompt tokens are effective in both the few-shot and full-data scenarios. The method focused more on the additional label tokens and correctly predicted the relation class of the entities in a sentence. In future work, we plan to improve our method’s performance on poorly annotated datasets in low-resource scenarios and extend our work to other text classification tasks.

Acknowledgements.
This work was supported in part by Major Project of the National Social Science Foundation of China (No. 21&ZD166), National Natural Science Foundation of China (61876072) and Natural Science Foundation of Jiangsu Province (No. BK20221535).

References

  • (1)
  • Alt et al. (2020) Christoph Alt, Aleksandra Gabryszak, and Leonhard Hennig. 2020. TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 1558–1569. https://doi.org/10.18653/v1/2020.acl-main.142
  • Baldini Soares et al. (2019) Livio Baldini Soares, Nicholas FitzGerald, Jeffrey Ling, and Tom Kwiatkowski. 2019. Matching the Blanks: Distributional Similarity for Relation Learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 2895–2905. https://doi.org/10.18653/v1/P19-1279
  • Bordes et al. (2013) Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. 2013. Translating Embeddings for Modeling Multi-relational Data. In Advances in Neural Information Processing Systems, C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger (Eds.), Vol. 26. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2013/file/1cecc7a77928ca8133fa24680a88d2f9-Paper.pdf
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  • Califf and Mooney (1997) Mary Elaine Califf and Raymond J. Mooney. 1997. Relational Learning of Pattern-Match Rules for Information Extraction. In CoNLL97: Computational Natural Language Learning. https://aclanthology.org/W97-1002
  • Chen et al. (2022) Xiang Chen, Ningyu Zhang, Xin Xie, Shumin Deng, Yunzhi Yao, Chuanqi Tan, Fei Huang, Luo Si, and Huajun Chen. 2022. Knowprompt: Knowledge-aware prompt-tuning with synergistic optimization for relation extraction. In Proceedings of the ACM Web Conference 2022. 2778–2788.
  • Cohen et al. (2020) Amir Cohen, Shachar Rosenman, and Yoav Goldberg. 2020. Relation Extraction as Two-way Span-Prediction. ArXiv abs/2010.04829 (2020).
  • Cui et al. (2021) Leyang Cui, Yu Wu, Jian Liu, Sen Yang, and Yue Zhang. 2021. Template-Based Named Entity Recognition Using BART. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, Online, 1835–1845. https://doi.org/10.18653/v1/2021.findings-acl.161
  • Dai et al. (2022) Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. 2022. Knowledge Neurons in Pretrained Transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 8493–8502. https://doi.org/10.18653/v1/2022.acl-long.581
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT. 4171–4186. https://www.aclweb.org/anthology/N19-1423.pdf
  • Gao et al. (2021) Tianyu Gao, Adam Fisch, and Danqi Chen. 2021. Making Pre-trained Language Models Better Few-shot Learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 3816–3830. https://doi.org/10.18653/v1/2021.acl-long.295
  • Geva et al. (2021) Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer Feed-Forward Layers Are Key-Value Memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 5484–5495. https://doi.org/10.18653/v1/2021.emnlp-main.446
  • Gu et al. (2022) Yuxian Gu, Xu Han, Zhiyuan Liu, and Minlie Huang. 2022. PPT: Pre-trained Prompt Tuning for Few-shot Learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 8410–8423. https://doi.org/10.18653/v1/2022.acl-long.576
  • Guo et al. (2022) Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, and Yinfei Yang. 2022. LongT5: Efficient Text-To-Text Transformer for Long Sequences. In Findings of the Association for Computational Linguistics: NAACL 2022. Association for Computational Linguistics, Seattle, United States, 724–736. https://doi.org/10.18653/v1/2022.findings-naacl.55
  • Hambardzumyan et al. (2021) Karen Hambardzumyan, Hrant Khachatrian, and Jonathan May. 2021. WARP: Word-level Adversarial ReProgramming. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 4921–4933. https://doi.org/10.18653/v1/2021.acl-long.381
  • Han et al. (2021) Xu Han, Weilin Zhao, Ning Ding, Zhiyuan Liu, and Maosong Sun. 2021. PTR: Prompt Tuning with Rules for Text Classification. arXiv preprint arXiv:2105.11259 (2021).
  • Hendrickx et al. (2010) Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, Preslav Nakov, Diarmuid Ó Séaghdha, Sebastian Padó, Marco Pennacchiotti, Lorenza Romano, and Stan Szpakowicz. 2010. SemEval-2010 Task 8: Multi-Way Classification of Semantic Relations between Pairs of Nominals. In Proceedings of the 5th International Workshop on Semantic Evaluation. Association for Computational Linguistics, Uppsala, Sweden, 33–38. https://aclanthology.org/S10-1006
  • Hu et al. (2022) Shengding Hu, Ning Ding, Huadong Wang, Zhiyuan Liu, Jingang Wang, Juanzi Li, Wei Wu, and Maosong Sun. 2022. Knowledgeable Prompt-tuning: Incorporating Knowledge into Prompt Verbalizer for Text Classification. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 2225–2240. https://doi.org/10.18653/v1/2022.acl-long.158
  • Jiang et al. (2020a) Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Tuo Zhao. 2020a. SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 2177–2190. https://doi.org/10.18653/v1/2020.acl-main.197
  • Jiang et al. (2020b) Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham Neubig. 2020b. How Can We Know What Language Models Know? Transactions of the Association for Computational Linguistics 8 (2020), 423–438. https://doi.org/10.1162/tacl_a_00324
  • Joshi et al. (2020) Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. 2020. SpanBERT: Improving Pre-training by Representing and Predicting Spans. Transactions of the Association for Computational Linguistics 8 (2020), 64–77.
  • Lester et al. (2021) Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The Power of Scale for Parameter-Efficient Prompt Tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 3045–3059. https://doi.org/10.18653/v1/2021.emnlp-main.243
  • Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In ACL.
  • Li and Liang (2021) Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 4582–4597. https://doi.org/10.18653/v1/2021.acl-long.353
  • Liu et al. (2021a) Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2021a. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586 (2021).
  • Liu et al. (2022) Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang. 2022. P-Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Dublin, Ireland, 61–68. https://doi.org/10.18653/v1/2022.acl-short.8
  • Liu et al. (2021b) Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. 2021b. GPT Understands, Too. arXiv preprint arXiv:2103.10385 (2021).
  • Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
  • Lyu and Chen (2021) Shengfei Lyu and Huanhuan Chen. 2021. Relation Classification with Entity Type Restriction. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, Online, 390–395. https://doi.org/10.18653/v1/2021.findings-acl.34
  • Ma et al. (2022) Ruotian Ma, Xin Zhou, Tao Gui, Yiding Tan, Linyang Li, Qi Zhang, and Xuanjing Huang. 2022. Template-free Prompt Tuning for Few-shot NER. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Seattle, United States, 5721–5732. https://doi.org/10.18653/v1/2022.naacl-main.420
  • Moiseev et al. (2022) Fedor Moiseev, Zhe Dong, Enrique Alfonseca, and Martin Jaggi. 2022. SKILL: Structured Knowledge Infusion for Large Language Models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Seattle, United States, 1581–1588. https://doi.org/10.18653/v1/2022.naacl-main.113
  • Peters et al. (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In NAACL.
  • Petroni et al. (2019) F. Petroni, T. Rocktschel, P. Lewis, A. Bakhtin, Y. Wu, A. H. Miller, and S. Riedel. 2019. Language Models as Knowledge Bases?. In 2019 Conference on Empirical Methods in Natural Language Processing.
  • Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu, et al. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 140 (2020), 1–67.
  • Shin et al. (2021) Richard Shin, Christopher Lin, Sam Thomson, Charles Chen, Subhro Roy, Emmanouil Antonios Platanios, Adam Pauls, Dan Klein, Jason Eisner, and Benjamin Van Durme. 2021. Constrained Language Models Yield Few-Shot Semantic Parsers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 7699–7715. https://doi.org/10.18653/v1/2021.emnlp-main.608
  • Shin et al. (2020b) T. Shin, Y. Razeghi, Irl Logan, E. Wallace, and S. Singh. 2020b. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).
  • Shin et al. (2020a) Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. 2020a. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 4222–4235. https://doi.org/10.18653/v1/2020.emnlp-main.346
  • Stoica et al. (2021) George Stoica, Emmanouil Antonios Platanios, and Barnabas Poczos. 2021. Re-TACRED: Addressing Shortcomings of the TACRED Dataset. Proceedings of the AAAI Conference on Artificial Intelligence 35, 15 (May 2021), 13843–13850. https://ojs.aaai.org/index.php/AAAI/article/view/17631
  • Su et al. (2022) Yusheng Su, Xiaozhi Wang, Yujia Qin, Chi-Min Chan, Yankai Lin, Huadong Wang, Kaiyue Wen, Zhiyuan Liu, Peng Li, Juanzi Li, Lei Hou, Maosong Sun, and Jie Zhou. 2022. On Transferability of Prompt Tuning for Natural Language Processing. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Seattle, United States, 3949–3969. https://doi.org/10.18653/v1/2022.naacl-main.290
  • Tan et al. (2022) Zhixing Tan, Xiangwen Zhang, Shuo Wang, and Yang Liu. 2022. MSP: Multi-Stage Prompting for Making Pre-trained Language Models Better Translators. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 6131–6142. https://doi.org/10.18653/v1/2022.acl-long.424
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
  • Wallace et al. (2019) Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019. Universal Adversarial Triggers for Attacking and Analyzing NLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 2153–2162. https://doi.org/10.18653/v1/D19-1221
  • Wang et al. (2021) Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei, Xuanjing Huang, Jianshu Ji, Guihong Cao, Daxin Jiang, and Ming Zhou. 2021. K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, Online, 1405–1418. https://doi.org/10.18653/v1/2021.findings-acl.121
  • Wu et al. (2021) Shuang Wu, Xiaoning Song, and Zhenhua Feng. 2021. MECT: Multi-Metadata Embedding based Cross-Transformer for Chinese Named Entity Recognition. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 1529–1539. https://doi.org/10.18653/v1/2021.acl-long.121
  • Xia et al. (2022) Wei Xia, Qianqian Wang, Quanxue Gao, Ming Yang, and Xinbo Gao. 2022. Self-consistent Contrastive Attributed Graph Clustering with Pseudo-label Prompt. IEEE Transactions on Multimedia (2022), 1–13. https://doi.org/10.1109/TMM.2022.3213208
  • Xu et al. (2021) Haoran Xu, Benjamin Van Durme, and Kenton Murray. 2021. BERT, mBERT, or BiBERT? A Study on Contextualized Embeddings for Neural Machine Translation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 6663–6675. https://doi.org/10.18653/v1/2021.emnlp-main.534
  • Xue et al. (2021) Fuzhao Xue, Aixin Sun, Hao Zhang, and Eng Siong Chng. 2021. GDPNet: Refining Latent Multi-View Graph for Relation Extraction. In AAAI.
  • Yamada et al. (2020) Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, and Yuji Matsumoto. 2020. LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 6442–6454. https://doi.org/10.18653/v1/2020.emnlp-main.523
  • Yin et al. (2019) Wenpeng Yin, Jamaal Hay, and Dan Roth. 2019. Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 3914–3923. https://doi.org/10.18653/v1/D19-1404
  • Zeng et al. (2015) Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. 2015. Distant Supervision for Relation Extraction via Piecewise Convolutional Neural Networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, 1753–1762. https://doi.org/10.18653/v1/D15-1203
  • Zhang et al. (2022) Haoxing Zhang, Xiaofeng Zhang, Haibo Huang, and Lei Yu. 2022. Prompt-Based Meta-Learning For Few-shot Text Classification. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 1342–1357. https://aclanthology.org/2022.emnlp-main.87
  • Zhang et al. (2017) Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor Angeli, and Christopher D. Manning. 2017. Position-aware Attention and Supervised Data Improve Slot Filling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Copenhagen, Denmark, 35–45. https://doi.org/10.18653/v1/D17-1004
  • Zhong et al. (2022) W. Zhong, Y. Gao, N. Ding, Y. Qin, Z. Liu, M. Zhou, J. Wang, J. Yin, and N. Duan. 2022. ProQA: Structural Prompt-based Pre-training for Unified Question Answering. arXiv e-prints (2022).
  • Zhong and Chen (2021) Zexuan Zhong and Danqi Chen. 2021. A Frustratingly Easy Approach for Entity and Relation Extraction. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 50–61. https://doi.org/10.18653/v1/2021.naacl-main.5
  • Zhong et al. (2021) Zexuan Zhong, Dan Friedman, and Danqi Chen. 2021. Factual Probing Is [MASK]: Learning vs. Learning to Recall. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 5017–5033. https://doi.org/10.18653/v1/2021.naacl-main.398
  • Zhu et al. (2022) Yi Zhu, Xinke Zhou, Jipeng Qiang, Yun Li, Yunhao Yuan, and Xindong Wu. 2022. Prompt-Learning for Short Text Classification. arXiv preprint arXiv:2202.11345 (2022).