Few-shot Open Relation Extraction with Gaussian Prototype and Adaptive Margin

Tianlin Guo¹¹1These authors contributed to the work equllly and should be regarded as co-first authors. Lingling Zhang²²2These authors contributed to the work equllly and should be regarded as co-first authors. Jiaxin Wang Yunkuo Lei Yifei Li Haofen Wang Jun Liu

Abstract

Few-shot relation extraction with none-of-the-above (FsRE with NOTA) aims at predicting labels in few-shot scenarios with unknown classes. FsRE with NOTA is more challenging than the conventional few-shot relation extraction task, since the boundaries of unknown classes are complex and difficult to learn. Meta-learning based methods, especially prototype-based methods, are the mainstream solutions to this task. They obtain the classification boundary by learning the sample distribution of each class. However, their performance is limited because few-shot overfitting and NOTA boundary confusion lead to misclassification between known and unknown classes. To this end, we propose a novel framework based on Gaussian prototype and adaptive margin named GPAM for FsRE with NOTA, which includes three modules, semi-factual representation, GMM-prototype metric learning and decision boundary learning. The first two modules obtain better representations to solve the few-shot problem through debiased information enhancement and Gaussian space distance measurement. The third module learns more accurate classification boundaries and prototypes through adaptive margin and negative sampling. In the training procedure of GPAM, we use contrastive learning loss to comprehensively consider the effects of range and margin on the classification of known and unknown classes to ensure the model’s stability and robustness. Sufficient experiments and ablations on the FewRel dataset show that GPAM surpasses previous prototype methods and achieves state-of-the-art performance.

keywords:

Relation Extraction with None-of-the-Above , Few-shot Learning , Semi-Facutal Representation , Gaussian Distance Metric , Prototype Learning

^†^†journal: Pattern Recognition

\affiliation

[inst1]organization=School of Computer Science and Technology and Ministry of Education Key Laboratory of Intelligent Networks and Network Security, addressline=Xi’an Jiaotong University, city=Xi’an, postcode=710049, country=China \affiliation[inst2]organization=School of Computer Science and Technology and Shaanxi Province Key Laboratory of Big Data Knowledge Engineering, addressline=Xi’an Jiaotong University, city=Xi’an, postcode=710049, country=China \affiliation[inst3]organization=College of Design and Innovation, addressline=Tongji University, city=Shanghai, postcode=200092, country=China

^fn2^fn2footnotetext: Email addresses: [email protected] (Tianlin Guo), [email protected] (Lingling Zhang), [email protected] (Jiaxin Wang), [email protected] (Yunkuo Lei), [email protected] (Yifei Li), [email protected] (Haofen Wang), [email protected] (Jun Liu)

{graphicalabstract} [Uncaptioned image]

{highlights}

Semi-factual representation helps alleviate the problem of information biases.

Gaussian distance metric better captures the distribution of few shots.

Adaptive margin gets more accurate boundary decision and prototype.

Evaluation on the FewRel dataset achieves state-of-the-art performance.

1 Introduction

Relation extraction (RE) is an important task in the field of Natural Language Processing (NLP), which aims to identify and extract semantic relationships between entities from text based on the given list of relations. For the example in Fig.1(a), the model should be trained with a large number of examples from the given categories and extract the relation capital of when given the entity pair (Beijing, China) in the sentence Beijing is the capital of China. RE task has been extensively studied in previous work [1, 2] and existing models can already achieve good classification performance. But for RE task, there is the few-shot issue in the real world, which means that large-scale labeled datasets are difficult to obtain. In addition, sentences that express relations not included in the given set should also be taken into consideration in practical applications, and these unknown classes are called “none of the above” (NOTA) [3] as the example shown in Fig.1(b). One of its characteristics is that the proportion of unknown classes in each task is much larger than that of known classes. For example, for the Few-shot Relation Extraction (FsRE) with NOTA task in Fig.1(b), the number of known classes is 2, namely capital of and member of, and all the remaining classes in the relation set are unknown classes. When given the entity pair (Maya Airport, Brazzaville) in the sentence The airline’s hub is Maya Airport in Brazzaville, the model should determine that the relation is neither capital of nor member of in the relation list, and output the result NOTA. To the best of our knowledge, the performance of existing models in dealing with this problem is limited. We summarize that there are two critical challenges in solving this task:

Refer to caption — Figure 1: Differences bewteen two tasks RE and FsRE with NOTA. (a) There is a large-scale annotated dataset and all classification results in the query set are within the given set of known classes. (b) There is a few-shot dataset and the query set contains some relations that are not among the known classes and should be classified as NOTA.

1) Few-shot Overfitting. Training a classifier for each class can easily lead to overfitting in the case of limited sample supervision. To overcome this issue, many meta-learning approaches, especially prototype-based learning [4] ones, have been proposed, which extract the features of each sample and obtain the prototype anchor of each class by averaging or other techniques, and classify the samples based on their distances to these prototypes. However, the traditional prototype-based method has two major flaws. Firstly, directly and simply extracting sample features is incomplete and biased. Relevant studies [5, 6] have demonstrated that head and tail entity information and context information can cause biases in RE model training, which means that over-reliance on entity or context information can cause the model to obtain non-existent relationships when encountering similar entities or contexts. Taking the entity biases as an example, the two sentences Beijing is the capital of China and Beijing is located in China have the same entity pair (Beijing, China) but express different relations, capital of and located in. Models trained with limited samples are more likely to be confused to obtain wrong relation by these biases. Secondly, the averaging method for prototype computation may fail to accurately represent the distribution of a class. Although Che et al. [7] introduce the task-specific anchor and combine it with the original class-specific anchor to optimize the generated anchor, this method is more affected by the few-shot scenario and it is difficult to obtain the accurate anchor position using only a few positive samples.

2) NOTA Boundary Confusion. There is a tendency for boundary confusion in the FsRE with NOTA scenario, where instances of the NOTA class may be misclassified as one of the known classes. Only a few conventional models specifically address the NOTA problem, which simply treat NOTA as a single class during training. For example, Liu et al. [8] leverage triplet paraphrase to pre-train low-shot relation extraction ability and matches queries and relation labels including NOTA. However, since the semantics of NOTA classes differ in various scenarios, learning their features as a whole fails to capture the correct distribution. This limitation can lead to faults, as the model may incorrectly assign NOTA instances to the closest known class prototypes, rather than recognizing them as belonging to a separate, unknown category. Meanwhile, the few-shot issue will be exacerbated in the presence of the NOTA class, as the model must distinguish between known classes and the unknown class with even fewer data points, increasing the risk of misclassification.

To solve this difficult subject, we propose the framework GPAM, a prototypical learning method using Gaussian Prototype and Adaptive Margin. As shown in Fig.2, our GPAM is mainly composed of three key modules, the semi-facutal representation, the GMM-prototype metric learning and the decision boundary learning module. The first module is to extract better feature representations based on debiased views. The main view and three debiased views are used to deal with the biases caused by the entity and context information shown in Fig.2(a). This can reduce the impact of bias information on model training and obtain more accurate prototype representations through augmentation. The second module is to measure the distance between samples and class anchors more accurately. Gaussian distributions with variable mean and variance are used to characterize the distribution of each view, and mixed weights are used to aggregate the four views according to their respective contributions, as is shown in Fig.2(b). The first and second modules result in a more accurate distribution and fewer few-shot issues, thereby yielding a precise prototype. The third module is designed to distinguish known and unknown classes and obtain more precise prototype boundaries shown in Fig.2(c). To better distinguish known and NOTA classes, the dynamic rather than the fixed margin is introduced and negative sample information is used to optimize the margin range iteratively. For training, contrastive learning strategies are used to make full use of the information of all instances in a query set. This makes the decision boundaries between classes more accurate and reduces NOTA confusion.

Our main contributions can be summarized as follows:

1.

We propose a novel few-shot relation extraction framework GPAM for the challenging NOTA task. This framework alleviates the few-shot issue while ensuring good prototype learning and boundary division effects.
2.

We propose a novel prototype distance measurement strategy with Gaussian distribution and introduce semi-factual representation, which alleviates the few-shot overfitting problem.
3.

We introduce a new adaptive margin and combine range to optimize decision boundary with contrastive learning loss function. This alleviates the NOTA boundary confusion problem.
4.

We conduct experiments on the FewRel dataset to demonstrate the superiority of GPAM. The results show that the model achieves state-of-the-art performance, exhibiting an advantage over previous prototype learning methods.

2 Related Work

In this section, we briefly introduce three categories of research relevant to our work, including relation extraction, few-shot relation extraction, and few-shot relation extraction with NOTA.

2.1 Relation Extraction

Relation extraction (RE) is a fundamental task in natural language processing (NLP) that aims to identify and classify semantic relationships between entities within a text. Existing researches perform supervised deep learning techniques to gain relations based on large-scale labeled data [9, 10, 11], which have shown significant improvements in the accuracy. However, data annotation is time-consuming and labor-intensive. Some researchers [12, 13, 14] use distant supervision methods to align text with external knowledge bases to obtain labels. Liang et al. [14] introduce a constraint graph to model the dependencies between labels and shares information between different relation nodes to alleviate long-tail problem. But automatically generated labels may contain noise and depend on the quality of external knowledge bases, affecting model training. With the development of few-shot learning, few-shot relation extraction has gradually become mainstream.

2.2 Few-shot Relation Extraction

Providing only a small number of data samples, Han et al. [15] are the first to propose the concept of few-shot relation extraction (FsRE) and introduce the relation extraction dataset FewRel. Recent work on few-shot relation extraction focuses on three main directions, including meta-learning methods, especially prototype-based methods [16, 17], parameter optimization learning methods [3, 18, 19, 20] and large language model based methods [21]. Prototype-based methods aim to obtain the distribution of each category named prototype and assign test samples to the closest one. Xiao et al. [17] propose an adaptive hybrid mechanism based on typical prototype to integrate label information into the features of each category support instance. Parameter optimization learning methods use fine-tuning methods to optimize the parameters of the pre-trained language model to adapt to the RE task. Zhang et al. [18] take relationship descriptions as prompt inputs and randomly discards some to simulate the scenario where support set labels are not visible. Zhang et al. [20] utilize prompt-tuning to fine-tune PLM to integrate relationship information and original prototypes. Large language model based methods leverage the vast knowledge and contextual understanding capability to effectively perform tasks in few-shot scenarios. Zhang et al. [21] utilize an instruction alignment method to fine-tune LLMs and aligns the RE task to the QA task to enhance relation extraction performance. Our work focuses on prototype-based methods and hopes to stimulate the prototype’s ability to identify unknown classes, which is called “none of the above” (NOTA).

2.3 Few-shot Relation Extraction with NOTA

When NOTA is introduced into relation extraction task, an additional challenge arises: accurately distinguishing known classes while precisely identifying and separating the unknown classes. On image classification tasks, there are some studies on FsRE with NOTA. Che et al. [22] treat background class as a pseudo label to train the boundary between known classes and unknown classes. But for text classification tasks, there are much less related studies. BERT-PAIR [3] is the first method for this task, which pairs support and query samples for both known and NOTA classes based on a sequence model to calculate the similarity score. Besides, Li et al. [19] obtain pseudo labels for pre-training by extracting paraphrase and passes the universal knowledge to the tiny models.

3 Problem Formalization

Based on the meta-learning framework, the data is divided into training and validation sets with non-overlapping class labels. Then we construct meta-tasks from the training set, and train the model to achieve optimal performance on the FsRE with NOTA task. The mathematical definition of the task is as follows. For an $N$ -way $K$ -shot $Q$ -query meta task from dataset $D$ , we give a formal definition as: $T=\{S,Q^{k},Q^{u}\}$ , where $S=\{(x_{i},r_{i})\}_{i=1}^{NK}$ , $Q^{k}=\{(x_{i},r_{i})\}_{i=1}^{|Q^{k}|}$ and $Q^{u}=\{(x_{i},r_{i})\}_{i=1}^{|Q^{u}|}$ denote the support set, the query set from known relations set $C^{k}$ and the query set from unknown relations set $C^{u}$ , respectively. Models should predict the label $r_{i}$ corresponding to the query instance $x_{i}=(h_{i},r_{i},t_{i},c_{i})$ , and the elements in the quadruple $x_{i}$ represent the head entity, relationship, tail entity and context respectively. Different from traditional FsRE task, the correct relation label is $r\in\{r_{1},r_{2},...,r_{N},\text{NOTA}\}$ instead of $r\in\{r_{1},r_{2},...,r_{N}\}$ .

4 Methodology

The workflow of our model GPAM is shown in Fig.2. Our model consists of three core modules: 1) Semi-Factual Representation, three debiased views are included as semi-factual data derived from the main view to augment the few-shot datasets; 2) GMM-Prototype Metric Learning, the four views’ features are fitted to a Gaussian mixture model and obtaining the distance metric; 3) Decision Boundary Learning, prototype range and adaptive margin are used to accurately distinguish various categories and get the decision boundary. We will further introduce these modules as follows.

4.1 Semi-Factual Representation

In previous studies [4], prototypes were learned solely from original sentence information, with model performance constrained by biases from entities and contexts. We follow the Semi-Factual Representation (SFR) strategy proposed in our previous work [23], which learns representations through multiple debiased views to mitigate this limitation. As shown in Fig.2(a), the details are as follows.

Main View. This view marks the head and tail entity in the raw sentence $\bm{s}=(s_{1},s_{2},...,s_{L})$ with tokens $\langle{h}\rangle$ , $\langle{/h}\rangle$ and $\langle{t}\rangle$ , $\langle{/t}\rangle$ , respectively. We denote the sentence after applying this view as $\bm{s^{m}}=(s_{1}^{m},s_{2}^{m},...,s_{L}^{m})$ .

Head and Tail Debiased Views. These two views replace the head entity $h_{i}$ and the tail entity $t_{i}$ with their attribute features. For example, the head entity “Tennen Mountains” is replaced by its attribute “[Mountain]”. The sentences in head debiased view and tail debiased view are denoted by $\bm{s^{h}}=(s_{1}^{h},s_{2}^{h},...,s_{L}^{h})$ and $\bm{s^{t}}=(s_{1}^{t},s_{2}^{t},...,s_{L}^{t})$ , respectively.

Context Debiased View. This view utilizes WordNet [24] to generate synonyms and randomly replaces 5% of the words in $\bm{s}$ . The resulting sentence is denoted by $\bm{s^{c}}=(s_{1}^{c},s_{2}^{c},...,s_{L}^{c})$ .

We then use BERT [25] to encode these four views $\bm{s^{j}}(j=m,h,t,c)$ , obtaining feature representations $\bm{z^{m}}$ , $\bm{z^{h}}$ , $\bm{z^{t}}$ , and $\bm{z^{c}}$ , respectively.

4.2 GMM-Prototype Metric Learning

The conventional prototype method only calculates the prototype anchor and radius, that is, the prototype is only regarded as a sphere with a certain radius in high-dimensional space [7]. It only considers the mean of samples, but ignores the variance information that accurately reflects the distribution. This is prone to inaccurate description, especially for samples near the decision boundary. To alleviate this problem, we assume that the relation $r$ follows a mixed Gaussian distribution aggregated from the features of four views and propose a Gaussian Mixture Module (GMM)-based strategy detailed as follows.

First, prompt templates are constructed for four views and inserted into the input embedding sequence as prefix tokens to do prompt-tuning for $K$ shots in a meta-task,

input^{j}=[{prompt}^{j}]\,[\bm{z}_{1}^{j}][\bm{z}_{2}^{j}]\ldots[\bm{z}_{K}^{j}]

(1)

where $prompt^{j}(j=m,h,t,c)$ represents the prompt templates corresponding to various views. Different from traditional high-dimensional space distance estimation strategies, Mahalanobis distance is used instead of Eucilidean distance to better adapt the characteristics of Gaussian distribution, utilizing both the mean and variance information. Next, the mean vector $\bm{\mu}$ and diagonal variance matrix $\text{diag}(\bm{\mathit{v}})$ of the Gaussian space are calculated corresponding to the relation type $r$ and the following formulas are used to get the $\bm{\mu}$ , $\bm{v}$ values of the main view and three debiased views in turn:

\bm{\mu^{j}},\bm{v^{j}}=\text{Transformer}([{prompt}^{j}]\,[\bm{z}_{1}^{j}][\bm{z}_{2}^{j}]\ldots[\bm{z}_{K}^{j}];\theta)

(2)

where $\theta$ is a learnable parameter which is the same for all views. The four views reflect the features of relation $r$ from different aspects, therefore, the overall feature of $r$ can be obtained by fusing them according to their respective weights. The following function is utilized to calculate the adaptive mixed Gaussian weights $w\in R^{4}$ of the four views:

w=\text{Softmax\_Linear}\left(\text{SelfAttention}\left([\bm{u^{m}};\bm{v^{m}}];[\bm{u^{h}};\bm{v^{h}}];\right.\right.\\ \left.\left.[\bm{u^{t}};\bm{v^{t}}];[\bm{u^{c}};\bm{v^{c}}];\phi_{1}\right);\phi_{2}\right)

(3)

where $w$ is a four-dimensional vector that reflects the relative weights of the four views which differs for different samples, and $\phi_{1}$ and $\phi_{2}$ are learnable parameters. Moreover, for any instance $x$ in $Q^{k}$ , $Q^{u}$ , the distance between it and the relation $r$ in the mixed Gaussian prototype space can be measured as:

d(x,r)=w^{T}[d^{m}(x,r);d^{h}(x,r);d^{t}(x,r);d^{c}(x,r)]

(4)

where for $j=m,h,t,c$ :

d^{j}(x,r)=(\bm{z^{j}}-\bm{u^{j}})^{T}\text{diag}(\bm{v^{j}})(\bm{z^{j}}-\bm{u^{j}})

(5)

For the support set $S$ , we apply Eq.(4) and take the average of positive instances to obtain the distance between all samples and candidate relations in the mixed Gaussian space to obtain the distribution feature of each relation category $r_{c}$ .

4.3 Decision Boundary Learning

Since in few-shot scenarios, the distribution of positive samples for a certain category has a key influence on the relation, we introduce the prototype range indicator $R_{c}$ to show the range of each category with positive samples. However, using only a single indicator $R_{c}$ to judge the category may cause the problem of sample misjudgment near the decision boundary between known and NOTA classes. In order to alleviate this problem, the adaptive margin of NOTA, $M_{c}$ , is introduced. $M_{c}$ is another indicator that affects the boundary between positive and negative examples, reflecting the tolerance of the learned prototype to negative examples. We utilize the distances from the positive instances to the relation anchor $r_{c}$ to obtain $R_{c}$ and the distances from the negative instances to $r_{c}$ to obtain $M_{c}$ as follows:

R_{c}=h_{\tau_{1}}\left(\left\{d(x^{+}_{i},r_{c})\right\}^{K}_{i=1};x^{+}_{i}\in S\right)

(6)

M_{c}=h_{\tau_{2}}\left(\left\{d(x^{-}_{i},r_{c})-R_{c}\right\}^{(N-1)K}_{i=1};x^{-}_{i}\in S\right)

(7)

where $h(.)$ is the quantile function which follows the principle that most positive instances should be within the prototype range $R_{c}$ and most negative instances should be outside $R_{c}+M_{c}$ , $\tau_{1}$ is a learnable parameter that controls the boundary range, $\tau_{2}$ is a learnable parameter that controls the tolerance for negative instances, and $x^{+}_{i}$ and $x^{-}_{i}$ represent positive and negative instances respectively.

For the query set $Q=Q^{k}\cup Q^{u}$ , we utilize Eq.(4) to do the same thing and obtain the distances between instances and each prototype anchor. To determine whether a instance $x$ belongs to relation $r_{c}$ , the classification rules are as follows:

\displaystyle\left\{\begin{array}[]{ll}\text{when }d(x,r_{c})\leq R_{c},&\text{the label of }x\text{ is }r_{c},\\ \text{when }d(x,r_{c})>R_{c}+M_{c},&\text{the label of }x\text{ is not }r_{c}.\end{array}\right.

Besides, if instance $x$ meets the criteria of multiple classes, the instance with the smallest GMM distance $r_{c}$ will be selected, and if instance $x$ does not belong to any of the relations in the known set, it will belong to the NOTA class. In this way we can accurately predict the labels of the instances in the query set whether they belong to a known class or NOTA class.

In the training procedure, we find that one of the important reasons for the poor performance of prototype learning methods when facing the NOTA problem is that the boundary of NOTA is complex and difficult to accurately describe with a small number of samples. Therefore, we refer to the method of Song [22] to expand some negative instances outside the $R_{c}+M_{c}$ region in the background regions by a certain proportion, and these examples are called pseudo negative samples (PNS), which don’t belong to any of the classes and serve as negative samples of all classes. According to the research of Wang [26] et al., since negative instances close to the boundary will have a greater impact on the classification between known and unknown classes, a higher generation probability is assigned to these negative instances. For each meta-task, several points are sampled outside the range of $R_{c}+M_{c}$ in the feature vector space as pseudo negative example sets. Then, the probability of selecting a sample in the pseudo-negative example set is calculated by the ratio of its distance from the margin boundary to the closed distribution range according to the following formula:

P=\text{Softmax}\left\{\left[\sum_{c=1}^{N}{\frac{\left\|d(p,r_{c})-(R_{c}+M_{c})\right\|}{R_{c}}}\right]^{-1}\right\}

(8)

where $p$ is a pseudo negative sample generated by this strategy, and is added to the support set $S$ of all classes as a negative instance. After adding pseudo negative samples, we update the value of the range $R_{c}$ and margin $M_{c}$ again by Eqs.(6) and (7).

Then, the contrastive learning strategy is used to optimize our model, and GPAM loss can be formulated as:

		$\displaystyle L_{GPAM}=\frac{1}{N}\sum_{c=1}^{N}\Biggl{\{}\lambda R_{c}^{2}$		(9)
		$\displaystyle+\frac{1}{\alpha}\log\left[1+\sum_{x^{+}_{i}\in\mathcal{S}_{c}\cup{Q}^{k}_{c}}e^{\alpha(d(x^{+}_{i},r_{c})-R_{c})}\right]$
		$\displaystyle+\frac{1}{\beta}\log\left[1+\sum_{x^{-}_{i}\in\mathcal{S}_{c}\cup{Q}^{k}_{c}\cup{Q}^{u}_{c}\cup{P}_{c}}e^{-\beta(d(x^{-}_{i},r_{c})-(R_{c}+M_{c}))}\right]\Biggr{\}}$

where $\lambda$ is a hyperparameter that controls the range of the prototype, and $\alpha$ and $\beta$ are adjustable hyperparameters in contrastive learning.

Our loss function can be divided into three components. The initial component focuses on minimizing the range of known classes $R_{c}$ . The middle component ensures that positive examples are positioned as close as possible to the anchors of the known classes. The final component ensures that negative examples are not only distanced from the anchors of the known classes but also placed outside the boundary of $R_{c}+M_{c}$ . Finally, we summarize our training and optimization process of GPAM in Algorithm 1.

Algorithm 1 The Process of GPAM.

1:Input: Support set

S=\left\{x_{i},y_{i}\right\}^{NK}_{i=1}

; known query set

Q^{k}=\left\{x_{i},y_{i}\right\}^{\left|Q^{k}\right|}_{i=1}

; unknown query set

Q^{u}=\left\{x_{i},y_{i}\right\}^{\left|Q^{u}\right|}_{i=1}

; boundary control parameters

\tau_{1}

and

\tau_{2}

; language model

M

;

2:Output: Optimized model parameters of GPAM.

3:Procedure GPAM(Input parameters) // Semi-Factual Representation Module

4:for instance i in

S

Q^{k}

and

Q^{u}

5: Generate semi-factual views for the input instances.

6:end for

7:Initialize variables

8:while GPAM does not converge do

9: For instances in

S

Q^{k}

and

Q^{u}

, learn the feature of four views by Eq.(2); // GMM-Prototype Metric Learning Module

10: Compute the mixed Gaussian weight of four views by Eq.(3);

11: Compute the distance in mixed Gaussian prototype space by Eqs.(4) and (5); // Decision Boundary Learning

12: Compute the range and margin of prototypes by Eqs.(6) and (7);

13: Expand support set

S

with pseudo negative instances by Eq.(8);

14: Update the range and margin of prototypes by Eqs.(6) and (7);

15: Optimize

M

by the loss function by Eq.(9);

16: Update parameters in the next episode.

17:end whilereturn Model parameters of GPAM

18:End Procedure

5 Experiment

In this section, extensive experiments are conducted to compare the proposed GPAM with popular baselines. The detailed framework and learning stage analysis are as follows.

5.1 Settings

Datasets. We perform experiments on the public relation extraction dataset: FewRel [3]. FewRel is a benchmark dataset designed for evaluating models in relation classification tasks. It features 100 diverse relation types with annotated examples sourced from Wikipedia, and contains 700 instances for each relation. We use meta-learning methods to randomly extract samples from the FewRel dataset according to different task settings to form meta-task datasets without redundant instances. Different from conventional few-shot learning task, we add NOTA instances to the meta-task datasets according to a certain NOTA rate.

Comparing Methods. We compare our model GPAM with the following outstanding baselines. These methods can be categorized into three groups. 1) FsRE models with NOTA. Proto-BERT [4]: the original prototype network algorithm; BERT-PAIR [3]: an approach to measure the similarity of sentence pairs; MCMN [8]: an approach using triplet paraphrase and meta-learning paradigm to do low-shot RE. 2) Those without NOTA. Proto-HATT [27]: a hybrid attention-based prototypical network; MLMAN [28]: a multi-level matching and aggregation prototypical network; REGRAB [29]: a Bayesian meta-learning approach to learn the posterior distribution of the prototype and solve the uncertainty of the prototype vector; CTEG [30]: a model to decouple high co-occurrence relations; HCRP [31]: an approach to introduce relation label information and distinguish task difficulty; SimpleFSRE [16]: a direct addition approach that fuses the embedding of relation description to the prototype representation; SaCon [32]: a framework using diverse viewpoints through instance-label pairs to capture intrinsic textual semantics. 3) Large language models. GPT-4o: an outstanding closed-source large language model developed by OpenAI; GLM-4: an open-source large language model developed by the Tsinghua Zhipu AI team.

Training Details. For GMM-prototype metric learning module, the length of prompt template is set to 100; For decision boundary learning module, the initial values of $\lambda$ , $\tau_{1}$ and $\tau_{2}$ are set to 0.001, 0.1 and 0.2 respectively, and the ratio of pseudo negative sampling is set to 0.2; In the loss function, the positive impact parameter $\alpha$ is set to 1 and the negative impact parameter $\beta$ is set to 3 to ensure that negative instances have a greater influence in the contrastive learning process. For the training process, we choose the SGD algorithm with learning rate 0.0002, weight decay 0.0001 as the optimizer.

5.2 Performance Comparison

Results on FsRE with NOTA task. Results on FewRel dataset with NOTA are shown in Table 1 and total, known and NOTA are evaluated individually. The observations are as follows:

1.

Compared with previous methods, our GPAM clearly achieves state-of-the-art performance on all settings. The total accuracy of our GPAM exceeds the previous best conventional model MCMN, improving by 5.11%, 4.15%, 8.19%, and 7.13% on four tasks respectively. Benefit from the three designed modules, GPAM achieved good results on the FsRE with NOTA task.
2.

GPAM’s performance for NOTA class is particularly outstanding under the NOTA rate 0.5 setting. Compared with the previous best performing method MCMN, our GPAM improves the accuracy of NOTA class extraction by 8.85% $\sim$ 10.30% at NOTA rate 0.5. The reason is that GMM-based distance metric and adaptive margin are introduced, GPAM has achieved significant advancements in the classification of both known and unknown classes.
3.

As the number of shots $K$ increases, the performance of GPAM becomes higher and more stable. When the number of shots increases from 1 to 5, the accuracy for the NOTA class increases from 84.25 to 93.25 at NOTA rate 0.15, and from 90.75 to 96.10 at NOTA rate 0.5. We can obtain better prototype of categories and more precise decision boundaries with more instances because the performance improvement is reasonable as the shot increases.

Table 1: Results on FewRel validation dataset with NOTA, known and NOTA are evaluated individually. The optimal values are marked in bold.

Models 5-way-1-shot 0.15 5-way-5-shot 0.15 5-way-1-shot 0.5 5-way-5-shot 0.5 total known NOTA total known NOTA total known NOTA total known NOTA Proto-BERT (NIPS2017) 68.38 —— —— 77.25 —— —— 40.83 —— —— 45.29 —— —— BERT-PAIR (EMNLP2019) 76.42 77.90 59.00 79.62 83.55 60.00 70.63 72.35 68.90 74.42 76.90 71.95 MCMN (ACL2022) 86.47 87.52 81.20 90.72 91.56 79.80 84.29 85.24 81.90 89.70 91.29 85.80 GPT-4o 65.98 62.06 85.57 68.75 66.25 81.25 71.03 56.29 85.77 78.30 71.06 85.53 GLM-4 90.24 90.51 88.89 91.83 93.33 84.31 87.43 91.24 83.62 84.47 94.37 74.56 GPAM (Ours) 91.58 93.05 84.25 94.87 95.20 93.25 92.48 94.21 90.75 95.40 95.75 94.10

Moreover, to compare the performance changes as the NOTA rate improves, we follow the evaluation methods of BERT-PAIR [3], and all models are trained and tested under four different NOTA rates: 0%, 15%, 30%, 50%. The experimental results are shown in Fig.3. GPAM outperforms the compared methods across all NOTA rate settings and maintains better stability. As the NOTA rate increases, the performance of traditional models such as Proto-BERT declines to varying degrees. For the current mainstream large models, GPT-4o’s performance drops dramatically after adding NOTA samples compared to non-NOTA. GLM-4 performs better, but its performance gradually decreases as the NOTA rate increases. It drops by 12.92% compared to non-NOTA when NOTA is 0.5. These compared models exhibit significantly weaker discrimination ability for the NOTA class compared to known classes.

Results on FsRE task. We also test the performance on traditional relation extraction task without NOTA shown in Table 2. It should be noted in advance that since the FewRel dataset test set on Codalab³³3https://codalab.org is no longer open to the public, the validation set is used for all tests. Our GPAM achieves performance slightly inferior to the SOTA model, despite considering the inclusion of unknown class judgments in the zero-NOTA scenario. This shows that our method focuses on tasks including NOTA, but also has acceptable effects on the conventional FsRE task.

Table 2: Results on FewRel validation / test dataset without NOTA. Note that results of the comparing methods are from papers or Codalab. The optimal and suboptimal values are marked in bold and underline respectively.

Models 5-way-1-shot 5-way-5-shot 10-way-1-shot 10-way-5-shot Proto-HATT (AAA2019) 72.65 / 74.52 86.15 / 88.40 60.13 / 62.38 76.20 / 80.45 MLMAN (ACL2019) 75.01 / —— 87.09 / 90.12 62.48 / —— 77.50 / 83.05 Proto-BERT (NIPS2017) 84.77 / 89.33 89.54 / 94.13 76.85 / 83.41 83.42 / 90.25 BERT-PAIR (EMNLP2019) 85.66 / 88.32 89.48 / 93.22 76.84 / 80.63 81.76 / 87.02 REGRAB (ICML2020) 87.95 / 90.30 92.54 / 94.25 80.26 / 84.09 86.72 / 89.93 CTEG (EMNLP2020) 84.72 / 88.11 92.52 / 95.25 76.01 / 81.29 84.89 / 91.33 HCRP (EMNLP2021) 90.90 / 93.76 93.22 / 95.66 84.11 / 89.95 87.79 / 92.10 SimpleFSRE (ACL2022) 91.29 / 94.42 94.05 / 96.37 86.09 / 90.73 89.68 / 93.47 SimpleFSRE+SaCon (AAAI2024) 98.17 / 97.88 97.98 / 98.12 96.21 / 96.65 96.46 / 96.50 GPT-4o 87.29 / —— 91.32 / —— 79.35 / —— 78.05 / —— GLM-4 94.29 / —— 97.39 / —— 97.76 / —— 97.02 / —— GPAM (Ours) 96.71 / —— 95.25 / —— 93.85 / —— 94.75 / ——

5.3 Ablation Study on Semi-Factual Representation

In order to verify the role of each view and study the impact of each debiased view on the main view, we perform ablation studies on three debiased views. Results are shown in Table 3. We can make the observations as follows.

1.

The debiased information of all views has a benefit for all task settings. It can be seen that removing any view will cause performance to degrade to varying degrees.
2.

The semi-factual representation strategy has a greater improvement in performance in more difficult tasks. Comparing only the main view with the original model performance, we can find that when the nota rate is 0.5, the introduction of debiased views significantly improves the performance by 5.09% and 6.30% respectively.
3.

The effect will be better when head and tail views are used together. It can be seen that deleting both the head and tail views at the same time is better than deleting either one. This is because introducing head or tail view alone may lead to incomplete or unbalanced information, and combining the both can get a more reliable representation.

5.4 Ablation Study on GMM-Prototype Metric Learning

To validate the effectiveness of the key strategies in the GMM-prototype metric learning module and to quantify their respective impacts, we construct four variants of GPAM as follows and evaluate their performance on the FewRel dataset shown in Table 3. The analysis of the results is as follows.

Analysis on Guassian Distance: the variant model that uses Euclidean distance instead of Mahalanobis distance as the metric in Eq.(5), that is, treats the prototype as a sphere. Mahalanobis distance shows a more significant improvement in the 5-shot scenario, with increases of 7.70% and 6.38%, respectively. This indicates that Mahalanobis distance based on Gaussian distribution is more advantageous than Euclidean distance for prototype construction in scenarios with multiple samples.

Analysis on Multi-Prompt: the variant model that utilizes the same prompt template instead of multi-prompt. We modify Eq.(2) and set the prompt templates $prompt^{j}$ of the four views to be the same. Multi-prompt strategy has a more significant effect when the NOTA rate is higher, with improvements of 8.80% and 7.17% respectively.

Analysis on Mixed Weights: the variant model that uses the averaging stategy to aggregate four views instead of mixed Gaussian weights. We remove the weight computation formula Eq.(3), set the weights of the four views to be the same, and then use Eq.(4) to obtain the aggregated prototype distance. Mixed weights strategy balances the differences in the importance of different views to certain type of samples, and the effect is more significant in the case of multiple shots.

Analysis on Self-Attention Mechanism: the variant model that removes the self-attention mechanism used in the prototype. We modify Eq.(3) by eliminating the self-attention operation, which results of the prototype relying solely on the raw concatenated features without further refinement. Self-Attention has a slight effect compared to other strategies.

Overall, all four strategies have improved the model to a certain extent, and the effects of Mahalanobis distance and multi-prompt are more significant.

Table 3: Comparision of ablation results for semi-factual representation, GMM-prototype metric learning, and decision boundary learning. “

\downarrow

” represents the change value of accuracy compared with complete GPAM.

Settings 5-way-1-shot 0.15 5-way-5-shot 0.15 5-way-1-shot 0.5 5-way-5-shot 0.5 GPAM 91.58 94.87 92.48 95.40 Semi-Factual Representation - Main Only 89.96 ( $\downarrow$ -1.60) 91.17 ( $\downarrow$ -3.70) 87.39 ( $\downarrow$ -5.09) 89.10 ( $\downarrow$ -6.30) - w/o Head Debiased 90.75 ( $\downarrow$ -0.83) 91.04 ( $\downarrow$ -3.83) 89.40 ( $\downarrow$ -3.08) 91.45 ( $\downarrow$ -3.95) - w/o Tail Debiased 90.21 ( $\downarrow$ -1.37) 90.25 ( $\downarrow$ -4.62) 88.83 ( $\downarrow$ -3.65) 91.15 ( $\downarrow$ -4.25) - w/o Context Debiased 89.99 ( $\downarrow$ -1.59) 92.67 ( $\downarrow$ -2.20) 91.00 ( $\downarrow$ -1.48) 91.08 ( $\downarrow$ -4.32) - w/o Head&Tail Debiased 90.29 ( $\downarrow$ -1.29) 91.87 ( $\downarrow$ -3.00) 90.50 ( $\downarrow$ -1.98) 92.20 ( $\downarrow$ -3.20) GMM-Prototype Metric Learning - w/o Guassian Distance 87.04 ( $\downarrow$ -4.54) 87.17 ( $\downarrow$ -7.70) 88.20 ( $\downarrow$ -4.28) 89.02 ( $\downarrow$ -6.38) - w/o Multi-Prompt 85.87 ( $\downarrow$ -5.71) 90.54 ( $\downarrow$ -4.33) 83.68 ( $\downarrow$ -8.80) 88.23 ( $\downarrow$ -7.17) - w/o Mixed Weights 90.16 ( $\downarrow$ -1.42) 89.42 ( $\downarrow$ -5.45) 86.70 ( $\downarrow$ -5.78) 90.83 ( $\downarrow$ -4.57) - w/o Self-Attention 91.25 ( $\downarrow$ -0.33) 93.83 ( $\downarrow$ -1.04) 90.73 ( $\downarrow$ -1.75) 92.65 ( $\downarrow$ -2.75) Decision Boundary Learning - w/o Margin 89.96 ( $\downarrow$ -1.62) 86.43 ( $\downarrow$ -8.44) 86.23 ( $\downarrow$ -6.25) 87.42 ( $\downarrow$ -7.98) - w/o Adaptive Margin 85.71 ( $\downarrow$ -5.87) 91.50 ( $\downarrow$ -3.37) 89.77 ( $\downarrow$ -2.71) 89.90 ( $\downarrow$ -5.50) - w/o PNS 90.46 ( $\downarrow$ -1.12) 91.58 ( $\downarrow$ -3.29) 88.94 ( $\downarrow$ -3.54) 92.00 ( $\downarrow$ -3.40)

5.5 Ablation Study on Decision Boundary Learning

As shown in Table 3, we also conduct an ablation study on decision boundary learning module, analyzing the impact of the presence or absence of key strategies as follows.

Analysis on Class Margin $M_{c}$ . In order to study the impact of margin presence and dynamics on performance, we construct two variants of GPAM: 1) The variant model that utilizes only range indicator $R_{c}$ without margin $M_{c}$ to classify unknown classes. We achieve this by setting $M_{c}$ to zero and remove this item in the loss function Eq.(9). It can be seen that in the case of a relatively simple task 5-way-1-shot 0.15, the increase of the margin strategy is small, but in the other three complex tasks, the presence or absence of margin has a great impact, with an increase of more than 6%. This indicates that measuring the boundary solely with the prototype range is inaccurate, and the margin is crucial. 2) The variant model that utilizes fixed margin instead of adaptive margin which changes with negative instances of the prototype. We achieve this by changing $M_{c}$ in the margin computation formula Eq.(7) and the loss function Eq.(9) to a fixed value. Taking 5-way-1-shot 0.15 task as an example, using a fixed margin reduces the performance by 5.87%, which is even more significant than simply removing the margin. One possible reason is that when there are fewer shots, using the same $M_{c}$ for different categories will lead to inaccurate boundary ranges.

Analysis on Pseudo Negative Sampling (PNS). We construct a variant model with no pseudo negative samples in the train dataset. It can be seen that the PNS strategy shows a modest improvement of only 1.12% for the 5-way-1-shot task with the NOTA rate of 0.15, while it achieved over 3% improvement for the other three tasks. We analyze the reason for the results and infer that for the 5-way-1-shot 0.15 task, adding pseudo negative examples expands the boundary range of prototype. As a result, more NOTA classes are misclassified as known classes, even though the overall performance is improved. In order to obtain the optimal negative sampling rate, we conduct experiments and plot the performance of different negative sampling rates as shown in Figure 4. The performance is optimal in most cases when the negative sampling rate is 0.2, and the effect drops significantly when the value is 0.4 or higher.

Table 4: Case study on 5-way-1-shot 0.5 meta tasks. The support set is omitted, and only the query set instances and output results are shown. Each row in the table represents the query, ground-truth, and results of different models in a meta-task.

Query Ground-Truth BERT-PAIR GPAM $\text{GPAM}^{\#1}$ $\text{GPAM}^{\#2}$ $\text{GPAM}^{\#3}$ Known (1) As surety for the accord, Lambert pledged to marry Gisela, Berengar’s daughter. P25 mother P25 ✓ P25 ✓ P25 ✓ P25 ✓ P25 ✓ (2) Led Zeppelin, used the mobile studio to record material for the albums ‘Physical Graffiti” and Houses of the Holy”. P155 follows P155 ✓ P155 ✓ P155 ✓ P155 ✓ P155 ✓ (3) Minsk Zoo is located in a southeast part of Minsk near Svislach River. P206 located in body of water P206 ✓ P206 ✓ P206 ✓ P206 ✓ P206 ✓ (4) Julius Peppers held out of team drills, and Chauncey Davis was called to take first team reps at defensive end. P413 position played on team P206 ✗ P413 ✓ P413 ✓ P206 ✗ P413 ✓ (5) Magas was unofficially proclaimed as the godfather of Serbian organized crime at the time. P921 main subject NOTA ✗ P921 ✓ NOTA ✗ P921 ✓ NOTA ✗ Unknown (6) The Mary Hill Bypass, officially known as Highway 7B, runs adjacent to the Fraser River from the Pitt River Bridge on the east to the Port Mann Bridge on the west. NOTA (P177) NOTA ✓ NOTA ✓ P206 ✗ P206 ✗ P25 ✗ (7) He was a member of Adolph Rupp’s ‘Fabulous Five” University of Kentucky basketball team, with Alex Groza, Wallace Jones, Cliff Barker, and Kenny Rollins. NOTA (P641) NOTA ✓ NOTA ✓ P413 ✗ NOTA ✓ NOTA ✓ (8) HD 37124 c is an extrasolar planet approximately 108 light-years away in the constellation of Taurus. NOTA (P59) P206 ✗ NOTA ✓ P206 ✗ P206 ✗ NOTA ✓ (9) Warren Norris (born in St. John’s, Newfoundland) is a Canadian professional ice hockey centre who currently plays for EC KAC in the Austrian Hockey League. NOTA (P641) NOTA ✓ P413 ✗ NOTA ✓ P413 ✗ NOTA ✓ (10) If mobilise, the NGVR would come under the operational command of the 8th Military District which was in the process of being raised under the command of Major General Basil Morris. NOTA (P410) NOTA ✓ NOTA ✓ NOTA ✓ NOTA ✓ NOTA ✓

5.6 Case Study

In order to intuitively demonstrate the effect of GPAM, we conduct a case study and artificially constructs a difficult and confusing 5-way-1-shot 0.5 meta tasks. For test requirements, we choose support and query set instances from five certain known classes, and select NOTA instances for query set from the remaining classes. And we choose BERT-PAIR [3] as a baseline model to compare it with the original GPAM and its three variant models: 1) $\text{GPAM}^{\#1}$ is the variant model that removes the semi-factual representation module; 2) $\text{GPAM}^{\#2}$ is the variant model that uses Euclidean distance instead of Mahalanobis distance as the metric; 3) $\text{GPAM}^{\#3}$ is the variant model that removes the margin strategy and only uses range for boundary division. We show some of the instances and the corresponding extraction results in a batch during the validation process in Tabel 4. Next, we analyze the results of the case study in detail.

Overall, GPAM performs best among the ten cases, correctly judging 90% of the cases, followed by $\text{GPAM}^{\#3}$ . For most known class query instances such as (1)-(3), all five models obtain correct results. For the query instance (4), sentence clearly implies that Chauncey Davis is the defensive end, which is a position played on team. But BERT-PAIR and $\text{GPAM}^{\#2}$ fail to get the correct answer and misjudge it as the relation located in body of water. For the query instance (5), which is considered one of the most challenging instances in this batch, Magas was the godfather of Serbian organized crime, where Magas, as the godfather, was the mastermind of organized crime, so the relation is main subject. Only GPAM and $\text{GPAM}^{\#2}$ are correct, the other three models assume that the relation between entities is not any of the five and get the result NOTA. Instances (4) and (5) show that both the semi-factual representation and Mahalanobis distance metric strategies contribute to solving the few-shot problem of known classes.

For the unknown class query instances in Table 4, BERT-PAIR, GPAM and the variant $\text{GPAM}^{\#3}$ work relatively well in distinguishing NOTA classes, while $\text{GPAM}^{\#1}$ and $\text{GPAM}^{\#2}$ may misclassify a NOTA class as a known class multiple times. This phenomenon occurs because some NOTA class relations have similar semantics to known class relations, such as P641 (sport team) and P431 (position played on team), P59 (constellation position) and P206 (located in body of water). We select the instance (6) where both are wrong for detailed analysis, the sentence describes that the road stretches from the Pitt River Bridge in the east to the Port Mann Bridge in the west and runs right next to the Fraser River. The correct relation NOTA (P177), expresses the relation that the bridge Port Mann Bridge spans the river Fraser River, instead of locating in body of water (P206) or other relationships. This shows that our complete GPAM is more accurate in identifying NOTA classes.

6 Conclusion

In this paper, we propose a model based on Gaussian prototype and adaptive margin named GPAM to solve few-shot relation extraction with NOTA task. The three core modules of GPAM are the semi-factual representation, the GMM-prototype learning and the decision boundary learning module. Besides, in decision boundary learning, we design a pseudo negative sampling strategy for NOTA scenarios to enhance the classification performance of the model. Experimental results on the FewRel dataset demonstrate that the performance of GPAM is better than that of the comparison methods, and ablation experiments convey the effectiveness of the designed modules and optimization strategies. In the future, we hope to study the ability of LLMs on FsRE with NOTA task by combining semi-factual representation and adaptive margin.

Acknowledgment

This work was supported by National Key Research and Development Program of China (2022YFC3303600), National Natural Science Foundation of China (62137002, 62293553, 62293554, 62176207, and 62192781), ”LENOVO-XJTU” Intelligent Industry Joint Laboratory Project, Natural Science Basic Research Program of Shaanxi (2023-JC-YB-593), the Youth Innovation Team of Shaanxi Universities, XJTU Teaching Reform Research Project ”Acquisition Learning Based on Knowledge Forest”.

References

[1] D. Ye, Y. Lin, P. Li, M. Sun, Packed levitated marker for entity and relation extraction, in: ACL, Association for Computational Linguistics, 2022, pp. 4904–4917. doi:10.18653/V1/2022.ACL-LONG.337.
[2] Y. Tian, Y. Song, F. Xia, Improving relation extraction through syntax-induced pre-training with dependency masking, in: ACL, Association for Computational Linguistics, 2022, pp. 1875–1886. doi:10.18653/V1/2022.FINDINGS-ACL.147.
[3] T. Gao, X. Han, H. Zhu, Z. Liu, P. Li, M. Sun, J. Zhou, Fewrel 2.0: Towards more challenging few-shot relation classification, in: EMNLP/IJCNLP, Association for Computational Linguistics, 2019, pp. 6249–6254. doi:10.18653/V1/D19-1649.
[4] J. Snell, K. Swersky, R. S. Zemel, Prototypical networks for few-shot learning, in: NIPS, 2017, pp. 4077–4087.
[5] Y. Wang, M. Chen, W. Zhou, Y. Cai, Y. Liang, D. Liu, B. Yang, J. Liu, B. Hooi, Should we rely on entity mentions for relation extraction? debiasing relation extraction with counterfactual analysis, in: NAACL, Association for Computational Linguistics, 2022, pp. 3071–3081. doi:10.18653/V1/2022.NAACL-MAIN.224.
[6] X. Hu, Z. Hong, C. Zhang, I. King, P. S. Yu, Think rationally about what you see: Continuous rationale extraction for relation extraction, in: SIGIR.
[7] Y. Che, Y. An, H. Xue, Boosting few-shot open-set recognition with multi-relation margin loss, in: IJCAI, ijcai.org, 2023, pp. 3505–3513. doi:10.24963/IJCAI.2023/390.
[8] F. Liu, H. Lin, X. Han, B. Cao, L. Sun, Pre-training to match for unified low-shot relation extraction, in: ACL, Association for Computational Linguistics, 2022, pp. 5785–5795. doi:10.18653/V1/2022.ACL-LONG.397.
[9] P. Zhou, W. Shi, J. Tian, Z. Qi, B. Li, H. Hao, B. Xu, Attention-based bidirectional long short-term memory networks for relation classification, in: ACL, The Association for Computer Linguistics, 2016. doi:10.18653/V1/P16-2034.
[10] H. Ye, W. Chao, Z. Luo, Z. Li, Jointly extracting relations with class ties via effective deep ranking, in: ACL, Association for Computational Linguistics, 2017, pp. 1810–1820. doi:10.18653/V1/P17-1166.
[11] L. B. Soares, N. FitzGerald, J. Ling, T. Kwiatkowski, Matching the blanks: Distributional similarity for relation learning, in: ACL, Association for Computational Linguistics, 2019, pp. 2895–2905. doi:10.18653/V1/P19-1279.
[12] X. Han, P. Yu, Z. Liu, M. Sun, P. Li, Hierarchical relation extraction with coarse-to-fine grained attention, in: EMNLP, Association for Computational Linguistics, 2018, pp. 2236–2245. doi:10.18653/V1/D18-1247.
[13] K. Zhou, Q. Qiao, Y. Li, Q. Li, Improving distantly supervised relation extraction by natural language inference, in: AAAI, AAAI Press, 2023, pp. 14047–14055. doi:10.1609/AAAI.V37I11.26644.
[14] T. Liang, Y. Liu, X. Liu, H. Zhang, G. Sharma, M. Guo, Distantly-supervised long-tailed relation extraction using constraint graphs, IEEE Trans. Knowl. Data Eng. 35 (7) (2023) 6852–6865. doi:10.1109/TKDE.2022.3177226.
[15] X. Han, H. Zhu, P. Yu, Z. Wang, Y. Yao, Z. Liu, M. Sun, Fewrel: A large-scale supervised few-shot relation classification dataset with state-of-the-art evaluation, in: EMNLP, Association for Computational Linguistics, 2018, pp. 4803–4809. doi:10.18653/V1/D18-1514.
[16] Y. Liu, J. Hu, X. Wan, T. Chang, A simple yet effective relation information guided approach for few-shot relation extraction, in: ACL (Findings), Association for Computational Linguistics, 2022, pp. 757–763. doi:10.18653/V1/2022.FINDINGS-ACL.62.
[17] Y. Xiao, Y. Jin, K. Hao, Adaptive prototypical networks with label words and joint representation learning for few-shot relation classification, IEEE Trans. Neural Networks Learn. Syst. 34 (3) (2023) 1406–1417. doi:10.1109/TNNLS.2021.3105377.
[18] P. Zhang, W. Lu, Better few-shot relation extraction with label prompt dropout, in: EMNLP, Association for Computational Linguistics, 2022, pp. 6996–7006. doi:10.18653/V1/2022.EMNLP-MAIN.471.
[19] W. Li, T. Qian, Graph-based model generation for few-shot relation extraction, in: EMNLP, Association for Computational Linguistics, 2022, pp. 62–71. doi:10.18653/V1/2022.EMNLP-MAIN.5.
[20] Y. Zhang, W. Huang, D. Dang, A lightweight approach based on prompt for few-shot relation extraction, Comput. Speech Lang. 84 (2024) 101580. doi:10.1016/J.CSL.2023.101580.
[21] K. Zhang, B. J. Gutierrez, Y. Su, Aligning instruction tasks unlocks large language models as zero-shot relation extractors, in: ACL (Findings), Association for Computational Linguistics, 2023, pp. 794–812. doi:10.18653/V1/2023.FINDINGS-ACL.50.
[22] N. Song, C. Zhang, G. Lin, Few-shot open-set recognition using background as unknowns, in: ACM Multimedia, ACM, 2022, pp. 5970–5979. doi:10.1145/3503161.3547933.
[23] J. Wang, L. Zhang, J. Liu, T. Guo, W. Wu, Learning from semi-factuals: A debiased and semantic-aware framework for generalized relation discovery, arXiv preprint arXiv:2401.06327 (2024).
[24] G. A. Miller, Wordnet: A lexical database for english, Commun. ACM 38 (11) (1995) 39–41. doi:10.1145/219717.219748.
[25] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding, in: NAACL-HLT, Association for Computational Linguistics, 2019, pp. 4171–4186. doi:10.18653/V1/N19-1423.
[26] X. Wang, X. Han, W. Huang, D. Dong, M. R. Scott, Multi-similarity loss with general pair weighting for deep metric learning, in: CVPR, Computer Vision Foundation / IEEE, 2019, pp. 5022–5030. doi:10.1109/CVPR.2019.00516.
[27] T. Gao, X. Han, Z. Liu, M. Sun, Hybrid attention-based prototypical networks for noisy few-shot relation classification, in: AAAI, AAAI Press, 2019, pp. 6407–6414. doi:10.1609/AAAI.V33I01.33016407.
[28] Z. Ye, Z. Ling, Multi-level matching and aggregation network for few-shot relation classification, in: ACL, Association for Computational Linguistics, 2019, pp. 2872–2881. doi:10.18653/V1/P19-1277.
[29] M. Qu, T. Gao, L. A. C. Xhonneux, J. Tang, Few-shot relation extraction via bayesian meta-learning on relation graphs, in: ICML, Vol. 119 of Proceedings of Machine Learning Research, PMLR, 2020, pp. 7867–7876.
[30] Y. Wang, K. Verspoor, T. Baldwin, Learning from unlabelled data for clinical semantic textual similarity, in: ClinicalNLP@EMNLP, Association for Computational Linguistics, 2020, pp. 227–233. doi:10.18653/V1/2020.CLINICALNLP-1.25.
[31] J. Han, B. Cheng, W. Lu, Exploring task difficulty for few-shot relation extraction, in: EMNLP, Association for Computational Linguistics, 2021, pp. 2605–2616. doi:10.18653/V1/2021.EMNLP-MAIN.204.
[32] D. Luo, Y. Gan, R. Hou, R. Lin, Q. Liu, Y. Cai, W. Gao, Synergistic anchored contrastive pre-training for few-shot relation extraction, in: M. J. Wooldridge, J. G. Dy, S. Natarajan (Eds.), AAAI, AAAI Press, 2024, pp. 18742–18750. doi:10.1609/AAAI.V38I17.29838.
[33] E. M. Kenny, M. T. Keane, On generating plausible counterfactual and semi-factual explanations for deep learning, in: AAAI, AAAI Press, 2021, pp. 11575–11585. doi:10.1609/AAAI.V35I13.17377.
[34] B. Liu, H. Kang, H. Li, G. Hua, N. Vasconcelos, Few-shot open-set recognition using meta-learning, in: CVPR, Computer Vision Foundation / IEEE, 2020, pp. 8795–8804. doi:10.1109/CVPR42600.2020.00882.
[35] L. Wu, F. Petroni, M. Josifoski, S. Riedel, L. Zettlemoyer, Scalable zero-shot entity linking with dense entity retrieval, in: EMNLP, Association for Computational Linguistics, 2020, pp. 6397–6407. doi:10.18653/V1/2020.EMNLP-MAIN.519.

Tianlin Guo received the B.S. degree from Xi’an Jiaotong University, Xi’an, Shaanxi, China, in 2023, where he is currently pursuing the M.S. degree with the School of Computer Science and Technology. His research interests include information extraction and few-shot learning.

Lingling Zhang is currently an associate professor in computer science at Xi’an Jiaotong University. She received the PhD degree in Computing Science from Xi’an Jiaotong University in 2020. She was a visiting student with the School of Computer Science, Carnegie Mellon University, working with Prof. A. Hauptmann. Her research interests include cross-media information mining, computer vision, zero-shot learning, and few-shot learning.

Jiaxin Wang received the B.S. and M.S. degrees in communication and information systems from Northwestern Polytechnical University, Shaanxi, China, in 2017 and 2020, respectively. She is currently pursuing the Ph.D. degree with the School of Computer Science and Technology, Xi’an Jiaotong University. Her research interests include natural language processing, large language models, and few-shot learning.

Yunkuo Lei received the B.S. degree from Xi’an Jiaotong University, Xi’an, Shaanxi, China, in 2024, where he is currently pursuing the M.S. degree with the School of Computer Science and Technology. His research interests include information extraction and spiking neural networks.

Yifei Li is currently working toward the Ph.D. degree in computer science at Xi’an Jiaotong University. He obtained his bachelor’s degree from Xi’an Jiaotong University in 2022. His research interest is knowledge graphs and information extraction.

Haofen Wang is a professor with the College of Design and Innovation, Tongji University. He has taken charge of several national AI projects and published more than 100 related papers on top-tier conferences and journals. He has also served as deputy directors or chairs for several NGOs like CCF, CIPS and SCS.

Jun Liu received the B.S., M.S., and Ph.D. degrees in computer science in 1995, 1998, and 2004 from Xi’an Jiaotong University, China. He is currently a professor in the School of Electronic and Information Engineering at Xi’an Jiaotong University. His research interests include text mining, data mining, intelligent network learning environments, and multimedia e-learning.