Causal Interventions-based Few-Shot Named Entity Recognition

Zhen Yang
University of South China
[email protected]
&Yongbin Liu
University of South China
[email protected]
&Chunping Ouyang
University of South China
[email protected]

Abstract

Few-shot named entity recognition (NER) systems aims at recognizing new classes of entities based on a few labeled samples. A significant challenge in the few-shot regime is prone to overfitting than the tasks with abundant samples. The heavy overfitting in few-shot learning is mainly led by spurious correlation caused by the few samples selection bias. To alleviate the problem of the spurious correlation in the few-shot NER, in this paper, we propose a causal intervention-based few-shot NER method. Based on the prototypical network, the method intervenes in the context and prototype via backdoor adjustment during training. In particular, intervening in the context of the one-shot scenario is very difficult, so we intervene in the prototype via incremental learning, which can also avoid catastrophic forgetting. Our experiments on different benchmarks show that our approach achieves new state-of-the-art results (achieving up to 29% absolute improvement and 12% on average for all tasks).

1 Introduction

As a fundamental task in information extraction, named entity recognition (NER) aims at locating and classifying named entities from unstructured text. Many methods Chiu and Nichols (2016); Ma and Hovy (2016); Lample et al. (2016); Peters et al. (2017) have been used to achieve efficient results in named entity recognition.

In practical applications, due to the difficulty of label collection and the high expensiveness of manual labeling, few-shot named entity recognition was proposed and has received wide attention. Many studies on few-shot NER have appeared in recent years. Current approaches are mainly based on metric learning models, in particular, on prototypical networks. The prototypical networks method calculates a prototype representation Snell et al. (2017) for each class and assigns the labels to each query sample according to the distance between the sample and the prototype of each class Fritzler et al. (2019); Yang and Katiyar (2020); Hou et al. (2020). In addition, the span-based methods Wang et al. (2021a); Yu et al. (2021); Ma et al. (2022) also emerge to be able to help few-shot named entity identification to discover the boundaries of entities better.

Refer to caption — Figure 1: (a) An example of spurious correlation. *Pigeons* are easily associated with *square*, but not all animals are associated with *square*. (b) Causal graph of the example. Contexts T ( such as the *square*), entities E (such as *pigeons*), and class label Y (such as the *animal* class type), C is confounding factors brought by samples selection bias in the few-shot task. (c) Causal graph after do-operation

However, these methods ignore the overfitting problem caused by spurious correlation in few-shot tasks. The spurious correlation issue is less severe in tasks with large samples. But due to the few samples selection bias, this issue must be addressed in the few samples tasks. As Figure 1(a) shows, in an example of the spurious correlation, the pigeons belong to the class label animal. In the few-shot NER task, the animal is easily associated with the square. But the square is not positively related to all animal entities. This way is a spurious correlation established by the few samples.

All the problems can be due to the overfitting of the model to the confounding factors in very few samples. Essentially, the few samples selection bias is just a confounder that misleads the few-shot NER model to learn spurious correlations between contexts and labels, e.g., between the context square and the ground-truth label animal in Figure 1(a). More specifically, the confounder is prone to help for a positive association between the contexts and some entity that can be formalized as $P(E\big{|}T)$ , e.g., When the context square is encountered, it increases the likelihood that it is a pigeon. The $P(E\big{|}T)$ based on the confounder misleads to associate the non-causal but positive correlated context T to class label Y, $P(Y\big{|}E,T)$ , e.g., the context square is wrongly regarded as the stable and intrinsic feature of the class label animal, practically it is a non-causal feature in the class animal.

Existing metric-based methods do not eliminate the potential confounding factors in the few samples, especially the few samples selection bias. To deal with the problems mentioned above, we review the few-shot named entity recognition from a causal viewpoint. Figure 1(b) shows the corresponding causal graph. We formulate the causalities among entity E, relevant context T, class label Y, and confounding factor C. The direct link denotes the causalities between the two nodes.

In Figure 1(b), it exists a backdoor path, T $\leftarrow$ C $\rightarrow$ Y. This means that the previous methods may mistakenly learn the spurious correlation brought by the backdoor path T $\leftarrow$ C $\rightarrow$ Y. In this paper, we propose the causal interventions-based model. We used the context-based intervention to block spurious correlation between contexts and class labels. In the causal inference view, as Figure 1(c) shows, we can use $P(E\big{|}do(T))$ instead of $P(E\big{|}T)$ to avoid the effect of the confounding factors. The way $do(T)$ (do-operation) can pursue the causality between the contexts T and the class labels Y without the confounder Pearl (2009). This $P(Y\big{|}E,do(T))$ blocks the backdoor path and eliminates the spurious correlation from the confounding factor. We may intervene T to calculate $P(Y\big{|}E,do(T))$ by randomised controlled trial Chalmers et al. (1981). It needs to be pointed out that it is difficult to obtain the sample-control on the contexts for the one-shot task. Inspired by incremental learning kohn2018incrementa; rebuffi2017icar; Thrun (1995); French (1999); de Masson D’Autume et al. (2019), we use the previous knowledge as the previous prototype to intervene in the previous and current prototypes as the final prototype for the one-shot task. This way also avoids catastrophic forgetting Thrun (1995) of the model. Our contributions can be summarized as follows:

•

We propose the causal model from a causal perspective, intervening on the contexts, $do(T)$ (do-operation) contexts and prototypes. This way helps the model prevent overfitting to the current data. Our method eliminates spurious correlations brought by the few sample selection bias and enhances the model’s generalization.
•

To better distinguish predefined entity classes and other class, we perform preliminary identification of entities through span, which helps the model to identify the boundaries of entities better and helps in the final type identification. Different from only considering the features within classes, our span considers the features among classes.
•

We evaluated on Few_NERD and achieved state-of-the-art results. Comprehensive experiments show that our model has a more robust generalization. Our approach achieves up to 29% absolute improvement and 12% on average for all tasks.

2 Related Work

2.1 Few-Shot Learning and Meta-Learning

Recently, few-shot learning has been widely used in natural language processing Chen et al. (2019); Gao et al. (2020); Brown et al. (2020); Schick and Schütze (2020). The obvious problem of few-shot learning is overfitting, so researchers usually introduce source domain data Han et al. (2018); Geng et al. (2019); Wang et al. (2021b). Besides, few-shot learning usually introduces meta-learning to deal with it. Meta-learning was first widely used in computer vision. With the proposal of prototype networks, metric-based Kulis et al. (2013); Vinyals et al. (2016); Snell et al. (2017) methods were widely adopted. The method first encodes vectors in the support set, after getting the prototype representation of each class based on all vectors of the same class. When predicting query, it calculates the distance between the prototype and query data using different metrics (generally Euclidean distance), and finally classifies the query data according to the nearest prototype. Prototype networks have been well performed in many tasks of natural language processing.

2.2 Few-shot Named Entity Recognition

Few-shot Named Entity Recognition aims to use a few support samples to make the model able to recognize unknown entities in the training phase. Previous work has seen many approaches to token-level Fritzler et al. (2019); Yang and Katiyar (2020); Hou et al. (2020). Snell et al. (2017) used a prototype network for few-shot Named Entity Recognition. Later, inspired by feature extraction and nearest neighbors, Yang and Katiyar (2020) propose NNShot and StructShot. NNShot uses the nearest neighbor to classify entities. Based on this, StructShot added viterbi to improve the few-shot Named Entity Recognition task. Ding et al. (2021) presented Few-NERD, a large-scale human-annotated few-shot NER dataset. And it was evaluated in ProtoBERT, NNShot and StructShot methods. In addition, prompt-based Cui et al. (2021) technologies are also appearing in this area. And Tong et al. (2021) proposes multiple prototypes for reasoning. However, these methods only learn the semantic features and intermediate representations of classes from the source domain. But the generalization to the target domain is very low. So Das et al. (2021) proposes to insert Gaussian embedding and contrast learning to improve the accuracy of few-shot Named Entity Recognition. These works mark and assign labels to entities, but ignore the integrity and boundedness of entities. Therefore, span-level Wang et al. (2021a); Yu et al. (2021); Athiwaratkun et al. (2020); Wang et al. (2021c) approaches has been proposed. ESD Wang et al. (2021a) proposes to use span representation and span matching to enhance the completeness of entity recognition. Based on entity span, Ma et al. (2022) argues that the previous approach contains O-class noise and lacks parameter updates during transfer. Therefore, Ma et al. (2022) proposes to fine-tune the parameters using support instances and locate only entities during span detection.

However, all these methods consider only the current support and query instances, causing the model forgets the previous data. Also, the model ignores the impact of context on entities, which can lead to the overfitting of contexts and entities. And for support and query instances, the model only makes a simple metric calculation, without considering both jointly. We believe that different instances in support contribute differently to query and cannot be calculated with equal weights.

2.3 Causal Inference

The purpose of causal inference Pearl (2009) is to remove the confounders between variables and get causal effects. Based on the causal effects between variables, make accurate predictions of the task. Considering the fitting problem and causal effects of the few samples, we aim to use causal inference to improve the robustness and transferability of the model. Many works have been done to improve the robustness Zeng et al. (2020); Tang et al. (2020b); Wang et al. (2020); Qi et al. (2020) of models using causal theory. Also, in causal effects, many studies use front door adjustment or back door adjustment to remove spurious correlations from confounders Yue et al. (2020); Tang et al. (2020a); Zhang et al. (2020, 2021); Liu et al. (2021) and find the causal effects of variables.

3 Task Formulation

NER is usually formulated as a sequence labeling problem. For a sequence {x₁.. .x_n}, NER aims to assign a label for each x_i . The labels indicate whether x_i belongs to a named entity (such as person, location) or does not belong to any entity (O class).

Few NER uses a few data to identify unknown classes in training by taking N-way K-shot. In which, Few NER sets N-way K-shot by iteratively constructing episodes. At the sampling of training, each episode includes N classes and each class contains K samples, thus constructing the support set, S_train = {(x⁽ⁱ⁾,y⁽ⁱ⁾)} ${}^{N*Q}_{i=1}$ . For the construction of query set, Q samples of N classes are sampled, Q_train = { (x⁽ⁱ⁾,y⁽ⁱ⁾)} ${}^{N*Q}_{i=1}$ . Where S_train $\bigcap$ Q_train = $\varnothing$ .

According to Strain prediction Q_train, which makes the model get trained. For testing, we use a few S_test and make predictions for Q_test. Similarly, S_test $\bigcap$ Q_test= $\varnothing$ . Note that the class of entities in the test set is not present in the training set, that is, Y_train $\bigcap$ Y_test= $\varnothing$ .

4 Methods

In this part, we show our method for the few named entity recognition. The method is composed of four main parts: entity span detection, context-based intervention, prototype-based intervention, and entity reweighting.

4.1 Causal Intervention

In the few-shot named entity recognition task, the serious selection bias of data causes spurious correlation, and then misleads the overfitting of the model. We use the causal interventions-based method to solve the spurious correlation. Figure 2 illustrates the problem in the task. T represents the context, E represents the entity , Y is the predicted label, and C is the confounders, such as the few samples selection bias.

T $\rightarrow$ E $\rightarrow$ Y The entity representation is learned in terms of contexts. Model uses the entity representation to get the predicted label.

T $\leftarrow$ C $\rightarrow$ Y A backdoor path that leads to a spurious correlation in the model. C represents all possible confounders in the task, such as the few samples selection bias. These bias make context pay more attention to the current data and mislead prediction labels. This backdoor path caused the spurious correlation between the context and the predicted class label. The spurious correlation further leads to the overfitting of context and predicted label.

Then, considering that in 1-shot, no additional entities can intervene, we analyze the model. Figure 3 illustrates the problem in the model. In the two figures, T represents the context, E represents the entity , P is the entity prototype, Y is the predicted label, and C is the confounders.

P $\rightarrow$ Y The model calculates the Euclidean distance through the entity prototype, and gets the the predicted label.

P $\leftarrow$ C $\rightarrow$ Y Confounders such as samples selection bias, mislead the calculation of entity prototype, resulting in the spurious correlation between prototype and predicted label. This spurious correlation leads to overfitting.

We expect to block the backdoor path T $\leftarrow$ C $\rightarrow$ Y and P $\leftarrow$ C $\rightarrow$ Y. So, we intervene on T and P. Considering C as the confounders are hard to catch. Therefore, we use the front-door adjustment for the calculation, shown in Eq 1. We get the final equation 2. Specific derivation details are in the Appendix. Here, in $\textstyle\sum_{E}P(E=e|T=t)$ , E is the entity representation. If we consider E as a binary classification, entity and non-entity, then this part denotes the entity detection. In $\sum_{T^{\prime}}P(Y=y|E=e,T=t^{\prime})$ , $T=t^{\prime}$ means we need to iterate over each T to get the predicted label Y in the case of E=e. Therefore, we propose a context-based intervention, make the entity replacement. In 1-shot, no additional entity to intervene, so we intervene on the prototype, consider the previous and current prototypes. Besides, in $P(T=t^{\prime})$ , it denotes the result of selecting the current T. We calculate the weights to get the prototype, considering that different T may be different from the original T distribution. Finally, we divide the model into four parts: entity detection, context-based causal intervention, prototype-based causal intervention and sample reweighting.

	$\displaystyle P(Y=y\|do(T$	$\displaystyle=t))=\sum_{E}P(E=e\|do(T=t))$
		$\displaystyle P(Y=y\|do(T=t),E=e)$		(1)

	$\displaystyle P(Y$	$\displaystyle=y\|do(T=t))=\sum_{E}P(E=e\|T=t)$
		$\displaystyle\sum_{t^{\prime}}P(Y=y\|E=e,T=t^{\prime})P(T=t^{\prime})$		(2)

4.2 Entity Detection

Due to the $\sum_{E}P(E=e|T=t)$ , we propose the entity detection. The traditional prototype network is to average all vectors of the same class entities after encoding, as a prototype representation of each class. The typical prototype network considers only the features common in each class, but not the commonality cross classes. The span detection aims to locate all the named entities in the input sequence but without differentiating specific classes. It considers the possible similarity between entities of all classes. We feed all the entities as a whole into the prototype network, and likewise all the non-entities, to determine all the entities in a sequence. In this way, we expect to discover features common to all entities to help us make a simple initial filtering of the input. An important note is that in the predicted label integration, the dimension matching problem needs to be considered. Therefore, we expand the entity judgment results to the same dimension as the class, to achieve dimension matching.

4.3 Context-based Causal Intervention

The few-shot task has limited samples in each episode, so the model will easily overfit the current samples. Such overfitting is misleading due to spurious correlation caused by samples selection bias. Within the sample, the entities are included in the contexts. When the sample is limited, the representation of entities can easily overfit the current context, making the model difficult to generalize to new domains.

Figure 2 shows our method. In Fig 2(a), we define the few-shot named entity recognition task. In this task, the confounders, such as the few samples selection bias, cause a backdoor path T $\leftarrow$ C $\rightarrow$ Y between context and predicted label. This backdoor path caused the spurious correlation between context and predicted label, misleading context and label overfitting. We want to block this backdoor path. Therefore, we intervene on T to block the backdoor path T $\leftarrow$ C $\rightarrow$ Y. We refer to this method as the context-based causal intervention.

The method is gotten according to Eq 2. In $\sum_{t^{\prime}}P(Y=y|E=e,T=t^{\prime})$ , the $T=t^{\prime}$ means we need to iterate over each T to get the final Y in case E=e. In other words, under each entity, we have to iterate over all T of the same type. Therefore, it is shown in Fig 2(b). In each sentence, the entities in the sentence are replaced with other entities of the same type in turn for the purpose of traversal. This process we use only in training, in testing we use the original data.

4.4 Prototype-based Causal Intervention

When performing 1-shot experiment, context-based causal intervention is not possible since there is no additional suitable entity to intervene. Here, we consider another way, prototype-based intervention. Figure 3 shows the method.

The previous methods average the current episode support as the class prototype representation, and use the distance between query and prototype to recognize the entities. However, as Figure 3(a) shows, there exist confounders, such as the few samples selection bias. It leads to the backdoor path between the prototype and the label, causing the spurious correlation. This spurious correlation results in overfitting in the calculation of entity prototypes. So we refine the model process from a causal perspective. Our method blocks the path P $\leftarrow$ C $\rightarrow$ Y by intervening P, a method called prototype-based causal intervention. Specifically, as shown in Fig 3(b), we save the prior prototype as knowledge. In the current episode, the previous knowledge is combined with the current class representation as the current class representation. Then the same calculation is repeated in the next episode.

Through the prototype-based intervention, we block the spurious correlation between the prototype and the predicted label. This method considers both current and previous data. Also, it prevents the model from overfitting the entity prototype calculation.

4.5 Sample Reweighting

Due to $P(T=t^{\prime})$ , we propose sample reweighting. Typical meta-learning for prototype calculation defaults the weights of all samples to 1. We argue that each sample provides a different contribution to the prototype calculation.

Specifically, for each support sample, we calculate the distance between it and the query. After that, using softmax, we transform a weight for each sample, as shown in Eq 3. Finally, the weights are applied to each sample to calculate the entity prototype. Meanwhile, we also add Maximum mean discrepancy(MMD) to the loss calculation, taking the training data as the source domain and the test support data as the target domain, and calculate the difference between the source and target domains by MMD. By double calculation, we reduce the distribution difference between the source and target domains. Regarding the loss function, the MMD is used as the final feedback loss with the classification loss, as shown in Eq 4.

\displaystyle\alpha_{i}=softmax(h_{\theta}(x_{q})-h_{\theta}(x_{s_{i}}))

(3)

	$\displaystyle L(\theta)=\frac{1}{N}\sum_{i=1}^{N}CrossEntropy(y_{i},h_{\theta}(x_{i}))$
	$\displaystyle+\sup_{\\|f\\|_{H}\leq 1}E_{p}[f(s)]-E_{q}[f(t)]$		(4)

Where h is our network, y is the true label, p denotes the distribution of the source domain s, and q denotes the distribution of the target domain t.

5 Experiments

Few-NERD(INTRA)
Model	5-way 1 $\sim$ 2-shot			5-way 5 $\sim$ 10-shot			10-way 1 $\sim$ 2-shot			10-way 5 $\sim$ 10-shot
Model	P	R	F1	P	R	F1	P	R	F1	P	R	F1
ProtoBERT	16.35 $\pm$ 0.63	28.35 $\pm$ 2.41	20.71 $\pm$ 1.16	31.43 $\pm$ 1.14	45.28 $\pm$ 0.71	37.08 $\pm$ 1.01	12.05 $\pm$ 1.09	21.27 $\pm$ 1.35	15.32 $\pm$ 0.68	23.15 $\pm$ 0.42	35.83 $\pm$ 0.97	28.02 $\pm$ 0.56
NNShot	20.47 $\pm$ 0.40	23.05 $\pm$ 1.12	21.58 $\pm$ 0.70	23.88 $\pm$ 0.79	28.35 $\pm$ 0.88	25.66 $\pm$ 0.78	14.83 $\pm$ 0.56	16.90 $\pm$ 0.68	15.72 $\pm$ 0.53	18.18 $\pm$ 1.20	22.45 $\pm$ 1.03	19.82 $\pm$ 1.11
StructShot	31.40 $\pm$ 1.34	19.63 $\pm$ 2.61	23.95 $\pm$ 2.39	45.20 $\pm$ 1.08	22.80 $\pm$ 0.99	29.68 $\pm$ 1.11	23.15 $\pm$ 0.77	8.61 $\pm$ 0.69	12.31 $\pm$ 0.72	40.40 $\pm$ 2.46	11.35 $\pm$ 1.32	17.10 $\pm$ 1.75
CONTAINER	38.46 $\pm$ 0.55	41.66 $\pm$ 0.23	40.00 $\pm$ 0.71	47.50 $\pm$ 0.61	63.33 $\pm$ 0.24	54.28 $\pm$ 0.64	38.33 $\pm$ 0.45	33.33 $\pm$ 0.27	35.89 $\pm$ 0.16	43.95 $\pm$ 0.22	54.05 $\pm$ 0.73	48.48 $\pm$ 0.69
ESD	42.94 $\pm$ 4.47	32.69 $\pm$ 1.55	37.12 $\pm$ 0.89	59.55 $\pm$ 0.89	43.83 $\pm$ 3.08	50.50 $\pm$ 1.79	37.58 $\pm$ 1.53	26.76 $\pm$ 1.96	31.26 $\pm$ 0.53	34.89 $\pm$ 2.75	32.00 $\pm$ 3.42	33.38 $\pm$ 4.71
DML	47.30 $\pm$ 1.94	46.64 $\pm$ 1.27	46.97 $\pm$ 2.36	57.70 $\pm$ 2.48	61.20 $\pm$ 1.33	59.40 $\pm$ 2.48	40.36 $\pm$ 0.67	42.30 $\pm$ 0.73	41.30 $\pm$ 0.89	51.69 $\pm$ 2.48	49.77 $\pm$ 1.64	50.71 $\pm$ 4.24
Ours	52.44 $\pm$ 1.01	55.67 $\pm$ 1.29	54.00 $\pm$ 1.14	85.06 $\pm$ 1.31	88.92 $\pm$ 1.09	86.94 $\pm$ 1.21	42.23 $\pm$ 0.52	43.81 $\pm$ 0.73	43.58 $\pm$ 1.20	75.92 $\pm$ 1.16	82.34 $\pm$ 0.73	79.00 $\pm$ 0.96

Table 1: Performance of state-of-art models on Few-NERD (INTRA)

Few-NERD(INTER)
Model	5-way 1 $\sim$ 2-shot			5-way 5 $\sim$ 10-shot			10-way 1 $\sim$ 2-shot			10-way 5 $\sim$ 10-shot
Model	P	R	F1	P	R	F1	P	R	F1	P	R	F1
ProtoBERT	31.45 $\pm$ 0.74	46.44 $\pm$ 3.40	37.49 $\pm$ 1.63	46.88 $\pm$ 0.27	59.54 $\pm$ 1.10	52.42 $\pm$ 0.60	22.17 $\pm$ 0.92	34.72 $\pm$ 0.52	26.98 $\pm$ 0.79	50.87 $\pm$ 1.01	63.30 $\pm$ 0.66	56.29 $\pm$ 0.79
NNShot	38.32 $\pm$ 2.24	42.82 $\pm$ 2.34	40.31 $\pm$ 2.30	39.40 $\pm$ 1.42	43.34 $\pm$ 7.32	42.66 $\pm$ 1.07	29.52 $\pm$ 1.15	34.06 $\pm$ 2.27	31.54 $\pm$ 1.63	33.74 $\pm$ 0.44	41.82 $\pm$ 0.52	37.09 $\pm$ 0.13
StructShot	49.45 $\pm$ 0.60	32.44 $\pm$ 7.77	38.78 $\pm$ 5.70	42.62 $\pm$ 6.46	32.47 $\pm$ 5.37	35.95 $\pm$ 1.09	32.54 $\pm$ 1.42	17.54 $\pm$ 0.72	22.61 $\pm$ 0.95	41.82 $\pm$ 0.50	44.52 $\pm$ 0.74	42.75 $\pm$ 0.62
CONTAINER	46.15 $\pm$ 2.47	60.00 $\pm$ 1.98	52.17 $\pm$ 2.74	56.41 $\pm$ 0.87	66.66 $\pm$ 0.53	61.11 $\pm$ 0.76	43.75 $\pm$ 1.16	58.33 $\pm$ 1.27	50.00 $\pm$ 1.46	57.57 $\pm$ 0.46	62.29 $\pm$ 0.75	59.84 $\pm$ 0.36
ESD	61.39 $\pm$ 2.83	54.17 $\pm$ 2.07	57.56 $\pm$ 2.52	76.70 $\pm$ 3.63	64.24 $\pm$ 1.77	69.92 $\pm$ 0.56	58.37 $\pm$ 5.01	48.35 $\pm$ 1.94	52.89 $\pm$ 1.11	68.72 $\pm$ 0.14	65.18 $\pm$ 0.35	66.90 $\pm$ 0.60
DML	62.76 $\pm$ 2.47	62.76 $\pm$ 2.08	62.76 $\pm$ 2.61	69.87 $\pm$ 0.45	71.96 $\pm$ 1.43	70.90 $\pm$ 0.28	57.54 $\pm$ 1.87	61.74 $\pm$ 2.04	59.57 $\pm$ 2.88	63.15 $\pm$ 0.33	70.81 $\pm$ 1.01	66.76 $\pm$ 0.54
Ours	65.96 $\pm$ 0.36	73.23 $\pm$ 2.29	69.41 $\pm$ 1.24	78.74 $\pm$ 0.38	85.29 $\pm$ 0.28	81.89 $\pm$ 0.33	59.26 $\pm$ 0.90	62.14 $\pm$ 0.52	60.67 $\pm$ 0.73	82.13 $\pm$ 0.15	83.58 $\pm$ 0.24	82.13 $\pm$ 0.10

Table 2: Performance of state-of-art models on Few-NERD (INTER)

5.1 Dataset

Few-NERDDing et al. (2021) It includes an annotation structure with 8 coarse-grained entity types and 66 fine-grained entity types. Based on this, two tasks are designed: i) FewNERD-INTRA, where all entities in the training set (source domain), validation set and test set (target domain) belong to different coarse-grained types. ii) FewNERD-INTER, where the training set, validation set and test set can share coarse-grained types, but the fine-grained entity types are disjoint. FewNERD uses N-way K $\sim$ 2K shots. FewNERD-INTRA and FewNERD-INTER both have four settings. 5-way 1 $\sim$ 2-shot, 5-way 5 $\sim$ 10-shot, 10-way 1 $\sim$ 2-shot and 10-way 5 $\sim$ 10-shot.

5.2 Parameter Settings

Following previous methodsDing et al. (2021), we used the Bert-base-uncased model Devlin et al. (2018). We used maximum sequence length of 32, and set the dropout ratio to 0.1. We used AdamW Loshchilov and Hutter (2017). More details of the parameter settings, please refer to the Appendix.

5.3 Evaluation metrics

For the evaluation of Few-NERD, we followed Ding et al. (2021), calculated the pecision(P), recall(R) and micro F1-score(F1).

5.4 Baselines

For systematic comparison, we have chosen a variety of methods, including: ProtoBERT, NNShot, StructShot, CONTAINERDas et al. (2021), ESDWang et al. (2021a) and Decomposed Meta-LearningMa et al. (2022). Please refer to the Appendix for baselines specific details.

5.5 Results and Analysis

The table shows the results of the model in the FewNERD-INTRA and FewNERD-INTER datasets of Few-NERD, respectively. Comprehensive experiments show that our method achieves state-of-the-art. By comparing the results, our method improves by 11-29% on 5-shot, which demonstrates the effectiveness of context-based causal intervention and entity detection. The method also improved by 1-8% on 1-shot, demonstrating the advantages of prototype-based causal intervention and sample reweighting. This shows that our method can block the spurious correlation in the task and prevent overfitting.

5.6 Ablation Study

In order to evaluate the contribution of the different components of the proposed method, we performed the following baseline as an ablation study: for 1 $\sim$ 2-shot, 1) We use the basic prototype network and do the entity detection, prototype-based intervention and sample reweighting. 2) Without prototype-based intervention, we use the basic prototype network and do the entity detection and sample reweighting. 3) Without entity detection, we only use the basic prototype network and do sample reweighting and prototype-based intervention. 4) We only use the basic prototype network and do the prototype-based intervention. 5) We only do the sample reweighting. For 5 $\sim$ 10-shot, 1) Without sample reweighting, we make a context-based intervention and do the entity detection. 2) We only make a context-based intervention for the basic prototype network. 3) We only use the basic prototype network and do the entity detection.

The table illustrates the contribution of each component in our proposed method. The effect decreases when any of the components are removed. Also, by observation, we have the following findings. Our entity detection better corrects and improves entity recognition, making a 6-19% improvement. Meanwhile, context-based causal intervention can help us better resolve spurious correlation, which can improve 20%. For 1-shot, our prototype-based causal intervention can also play a great role, an increase of 6%. In addition, it shows that sample reweighting can help the model to improve by 18%.

Setting	Entity	Context-based	Sample	Prototype-based	F1
Setting	Detection	Intervention	Reweighting	Intervention	score
1 $\sim$ 2-shot	✓	$\times$	✓	✓	69.41
	✓	$\times$	✓	$\times$	63.22
	$\times$	$\times$	✓	✓	60.54
	$\times$	$\times$	$\times$	✓	58.26
	$\times$	$\times$	✓	$\times$	57.64
5 $\sim$ 10-shot	✓	✓	$\times$	$\times$	81.89
	$\times$	✓	$\times$	$\times$	75.40
	✓	$\times$	$\times$	$\times$	61.67

Table 3: Ablation study: F1 scores on Few-NERD

5.7 Experimental Analysis

How does entity detection improve entity recognition To further illustrate the role of entity detection, an example is given. Entity detection performs well on some boundary information classifications, and it helps identify entities and correct misidentifications. For example with the identification of ’the’, entity detection can help to identify ’the’ in ’The Porcellian Club’ and ’The Nation Game’, while ignoring ’the year’ and ’the club’. Similarly, for American, entity detection can help us identify the League after American in ’American League’ while ignoring the airline after American in ’American airline’.

How does context-based causal intervention improve entity recognition To alleviate the overfitting problem of the model, we propose entity replacement. To further demonstrate that context-based causal intervention solves the overfitting problem, as shown in figure 4, we give the t-SNE of the original prototype network and the prototype network with the entity replaced. In the original prototype network, the embedding is very scattered, and the boundary is not obvious, it is easy to confuse. As we do the context-based causal intervention, the embedding becomes more compact and there are clear boundaries. It makes entities of the same type embedded closer together and alienates the distance between different types. The optimization of embedded space enhances the ability of the few-shot named entity recognition.

How does prototype-based causal intervention improve entity recognition For example, in the current query, when we want to judge the class of German in ’The local dialect of the region is East Franconian German, referred to in German as Frankisch’. Since German is O-class (other class) in the current support, the machine will classify German as O-class. This is an overfitting of the current prototype calculation. When we intervene in the prototype, the effect will improve. For example, in the previous support data, there was a case where German was classified as other-lanauage. We transformed it into a previous prototype, and introduced it in the current episode. By prototype-based causal intervention, the machine can correct this error and correctly classify German as other-lanauage in the current query.

How does sample reweighting improve entity recognition As shown in figure 5, we show the data distribution histograms of the source domain and the target domain in the original prototype network and after sample reweighting. Through the comparison, after sample reweighting, the distributions of the source domain and the target domain are basically consistent, and the gap is reduced. This improvement enables the model to transfer from the source domain to the target domain well, and improves the effect of few-shot named entity recognition.

6 Conclusion and Future Work

In this paper, we analyze the few-shot named entity recognition from a causal perspective. There exists the spurious correlation between the context and predicted label, which causes the overfitting between context and predicted label. Meanwhile, there exists a spurious correlation between the prototype and the predicted label, which causes the overfitting between prototype calculation and the predicted label. Therefore, we propose a causal interventions-based few-shot named entity recognition method. Firstly, the boundary of the entity is detected. Then, make a context-based causal intervention, and replaces the entity to add a new context. For 1-shot, the context-based causal intervention cannot be applicable, we further propose the prototype-based causal intervention, considering previous and current prototypes. Finally, we use the support and query to calculate weights to reweight the samples. Comprehensive experiments show that our method achieves significant improvement and helps models avoid overfitting and transfer to new domains better.

In addition, the method still has some weaknesses. The time consumption of the method is small, but the memory consumption is large. In the future, we will continue to improve the method and reduce memory consumption.

References

Athiwaratkun et al. (2020) Ben Athiwaratkun, Cicero Nogueira dos Santos, Jason Krone, and Bing Xiang. 2020. Augmented natural language for generative sequence labeling. arXiv preprint arXiv:2009.13272.
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
Chalmers et al. (1981) Thomas C Chalmers, Harry Smith Jr, Bradley Blackburn, Bernard Silverman, Biruta Schroeder, Dinah Reitman, and Alexander Ambroz. 1981. A method for assessing the quality of a randomized control trial. Controlled clinical trials, 2(1):31–49.
Chen et al. (2019) Mingyang Chen, Wen Zhang, Wei Zhang, Qiang Chen, and Huajun Chen. 2019. Meta relational learning for few-shot link prediction in knowledge graphs. arXiv preprint arXiv:1909.01515.
Chiu and Nichols (2016) Jason PC Chiu and Eric Nichols. 2016. Named entity recognition with bidirectional lstm-cnns. Transactions of the association for computational linguistics, 4:357–370.
Cui et al. (2021) Leyang Cui, Yu Wu, Jian Liu, Sen Yang, and Yue Zhang. 2021. Template-based named entity recognition using bart. arXiv preprint arXiv:2106.01760.
Das et al. (2021) Sarkar Snigdha Sarathi Das, Arzoo Katiyar, Rebecca J Passonneau, and Rui Zhang. 2021. Container: Few-shot named entity recognition via contrastive learning. arXiv preprint arXiv:2109.07589.
de Masson D’Autume et al. (2019) Cyprien de Masson D’Autume, Sebastian Ruder, Lingpeng Kong, and Dani Yogatama. 2019. Episodic memory in lifelong language learning. Advances in Neural Information Processing Systems, 32.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Ding et al. (2021) Ning Ding, Guangwei Xu, Yulin Chen, Xiaobin Wang, Xu Han, Pengjun Xie, Hai-Tao Zheng, and Zhiyuan Liu. 2021. Few-nerd: A few-shot named entity recognition dataset. arXiv preprint arXiv:2105.07464.
French (1999) Robert M French. 1999. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135.
Fritzler et al. (2019) Alexander Fritzler, Varvara Logacheva, and Maksim Kretov. 2019. Few-shot classification in named entity recognition task. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, pages 993–1000.
Gao et al. (2020) Tianyu Gao, Adam Fisch, and Danqi Chen. 2020. Making pre-trained language models better few-shot learners. arXiv preprint arXiv:2012.15723.
Geng et al. (2019) Ruiying Geng, Binhua Li, Yongbin Li, Xiaodan Zhu, Ping Jian, and Jian Sun. 2019. Induction networks for few-shot text classification. arXiv preprint arXiv:1902.10482.
Han et al. (2018) Xu Han, Hao Zhu, Pengfei Yu, Ziyun Wang, Yuan Yao, Zhiyuan Liu, and Maosong Sun. 2018. Fewrel: A large-scale supervised few-shot relation classification dataset with state-of-the-art evaluation. arXiv preprint arXiv:1810.10147.
Hou et al. (2020) Yutai Hou, Wanxiang Che, Yongkui Lai, Zhihan Zhou, Yijia Liu, Han Liu, and Ting Liu. 2020. Few-shot slot tagging with collapsed dependency transfer and label-enhanced task-adaptive projection network. arXiv preprint arXiv:2006.05702.
Kulis et al. (2013) Brian Kulis et al. 2013. Metric learning: A survey. Foundations and Trends® in Machine Learning, 5(4):287–364.
Lample et al. (2016) Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360.
Liu et al. (2021) Fangchao Liu, Lingyong Yan, Hongyu Lin, Xianpei Han, and Le Sun. 2021. Element intervention for open relation extraction. arXiv preprint arXiv:2106.09558.
Loshchilov and Hutter (2017) Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
Ma et al. (2022) Tingting Ma, Huiqiang Jiang, Qianhui Wu, Tiejun Zhao, and Chin-Yew Lin. 2022. Decomposed meta-learning for few-shot named entity recognition. arXiv preprint arXiv:2204.05751.
Ma and Hovy (2016) Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-directional lstm-cnns-crf. arXiv preprint arXiv:1603.01354.
Pearl (2009) Judea Pearl. 2009. Causality. Cambridge university press.
Peters et al. (2017) Matthew E Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. 2017. Semi-supervised sequence tagging with bidirectional language models. arXiv preprint arXiv:1705.00108.
Qi et al. (2020) Jiaxin Qi, Yulei Niu, Jianqiang Huang, and Hanwang Zhang. 2020. Two causal principles for improving visual dialog. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10860–10869.
Schick and Schütze (2020) Timo Schick and Hinrich Schütze. 2020. It’s not just size that matters: Small language models are also few-shot learners. arXiv preprint arXiv:2009.07118.
Snell et al. (2017) Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30.
Tang et al. (2020a) Kaihua Tang, Jianqiang Huang, and Hanwang Zhang. 2020a. Long-tailed classification by keeping the good and removing the bad momentum causal effect. Advances in Neural Information Processing Systems, 33:1513–1524.
Tang et al. (2020b) Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. 2020b. Unbiased scene graph generation from biased training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3716–3725.
Thrun (1995) Sebastian Thrun. 1995. Is learning the n-th thing any easier than learning the first? Advances in neural information processing systems, 8.
Tong et al. (2021) Meihan Tong, Shuai Wang, Bin Xu, Yixin Cao, Minghui Liu, Lei Hou, and Juanzi Li. 2021. Learning from miscellaneous other-class words for few-shot named entity recognition. arXiv preprint arXiv:2106.15167.
Vinyals et al. (2016) Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. 2016. Matching networks for one shot learning. Advances in neural information processing systems, 29.
Wang et al. (2021a) Peiyi Wang, Runxin Xu, Tianyu Liu, Qingyu Zhou, Yunbo Cao, Baobao Chang, and Zhifang Sui. 2021a. An enhanced span-based decomposition method for few-shot sequence labeling. arXiv preprint arXiv:2109.13023.
Wang et al. (2021b) Peiyi Wang, Runxin Xun, Tianyu Liu, Damai Dai, Baobao Chang, and Zhifang Sui. 2021b. Behind the scenes: An exploration of trigger biases problem in few-shot event classification. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 1969–1978.
Wang et al. (2020) Tan Wang, Jianqiang Huang, Hanwang Zhang, and Qianru Sun. 2020. Visual commonsense r-cnn. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10760–10770.
Wang et al. (2021c) Yaqing Wang, Haoda Chu, Chao Zhang, and Jing Gao. 2021c. Learning from language description: Low-shot named entity recognition via decomposed framework. arXiv preprint arXiv:2109.05357.
Yang and Katiyar (2020) Yi Yang and Arzoo Katiyar. 2020. Simple and effective few-shot named entity recognition with structured nearest neighbor learning. arXiv preprint arXiv:2010.02405.
Yu et al. (2021) Dian Yu, Luheng He, Yuan Zhang, Xinya Du, Panupong Pasupat, and Qi Li. 2021. Few-shot intent classification and slot filling with retrieved examples. arXiv preprint arXiv:2104.05763.
Yue et al. (2020) Zhongqi Yue, Hanwang Zhang, Qianru Sun, and Xian-Sheng Hua. 2020. Interventional few-shot learning. Advances in neural information processing systems, 33:2734–2746.
Zeng et al. (2020) Xiangji Zeng, Yunliang Li, Yuchen Zhai, and Yin Zhang. 2020. Counterfactual generator: A weakly-supervised method for named entity recognition. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7270–7280.
Zhang et al. (2020) Dong Zhang, Hanwang Zhang, Jinhui Tang, Xian-Sheng Hua, and Qianru Sun. 2020. Causal intervention for weakly-supervised semantic segmentation. Advances in Neural Information Processing Systems, 33:655–666.
Zhang et al. (2021) Wenkai Zhang, Hongyu Lin, Xianpei Han, and Le Sun. 2021. De-biasing distantly supervised named entity recognition via causal intervention. arXiv preprint arXiv:2106.09233.

Appendix A Formula Appendix

	$\displaystyle P(Y=y\|do(T=t))=\sum_{E}$	$\displaystyle P(Y=y\|do(T=t),E=e)$
		$\displaystyle P(E=e\|do(T=t))$		(5)

	$\displaystyle P(Y=y\|do$	$\displaystyle(T=t))=\sum_{E}P(Y=y\|do(T=t),$
		$\displaystyle do(E=e))P(E=e\|do(T=t))$		(6)

	$\displaystyle P(Y=y\|do$	$\displaystyle(T=t))=\sum_{E}P(Y=y\|do(T=t),$
		$\displaystyle do(E=e))P(E=e\|T=t)$		(7)

	$\displaystyle P(Y=y$	$\displaystyle\|do(T=t))=$
		$\displaystyle\sum_{E}P(Y=y\|do(E=e))P(E=e\|T=t)$		(8)

	$\displaystyle P$	$\displaystyle(Y=y\|do(T=t))=\sum_{t^{\prime}}\sum_{E}P(Y=y\|do(E=e),$
		$\displaystyle T=t^{\prime})P(T=t^{\prime}\|do(E=e))P(E=e\|T=t)$		(9)

	$\displaystyle P(Y=y\|$	$\displaystyle do(T=t))=\sum_{t^{\prime}}\sum_{E}P(Y=y\|E=e,T=t^{\prime})$
		$\displaystyle P(T=t^{\prime}\|do(E=e))P(E=e\|T=t)$		(10)

	$\displaystyle P(Y=y\|do$	$\displaystyle(T=t))=\sum_{t^{\prime}}\sum_{E}P(Y=y\|do(E=e),$
		$\displaystyle T=t^{\prime})P(T=t^{\prime})P(E=e\|T=t)$		(11)

	$\displaystyle P(Y=y$	$\displaystyle\|do(T=t))=\sum_{E}P(E=e\|T=t)$
		$\displaystyle\sum_{t^{\prime}}P(Y=y\|E=e,T=t^{\prime})P(T=t^{\prime})$		(12)

Appendix B Baseline Appendix

ProtoBERT uses BERT to get the vector representation of each token, and then averages all vectors of the same type as the the class representation according to the prototype network. Finally, calculate the distance between each category representation and query, and judge based on the nearest class.

NNShot gets the feature representation for each token and calculates the distance between the query and each representation. Finally, the class is judged based on the nearest distance.

StructShot adds an additional Viterbi decoder to the NNShot.

CONTAINER also uses BERT, and additionally uses contrast learning and Gaussian embedding to get the representation of each token. Then, fine-tuning on the support set and inference using the nearest neighbor method.

ESD uses inter and cross-span attention based on prototypes to get span representations. Also, it constructs multi-prototypes for O label.

Decomposed Meta-Learning considers the few shot as a sequence labeling problem. MAML is used to initialize the model parameters, and meanwhile uses MAML-Protonet to find the optimal embedding space for entity recognition.

Appendix C Experiments Appendix

Parameter Setting We use BERT-base-uncased. Besides, more parameter setting details are shown in the table.

Name	Value
Batch_size	20
Max_length	32
Learning rate	1e-4
Embedding dimension	768
Dropout	0.1

Table 4: The hyperparameters of experiments

	$\displaystyle P(Y=y\|do(T$	$\displaystyle=t))=\sum_{E}P(E=e\|do(T=t))$
		$\displaystyle P(Y=y\|do(T=t),E=e)$		(1)

	$\displaystyle P(Y=y\|do(T=t))=\sum_{E}$	$\displaystyle P(Y=y\|do(T=t),E=e)$
		$\displaystyle P(E=e\|do(T=t))$		(5)

	$\displaystyle P(Y=y\|do$	$\displaystyle(T=t))=\sum_{E}P(Y=y\|do(T=t),$
		$\displaystyle do(E=e))P(E=e\|do(T=t))$		(6)

	$\displaystyle P(Y=y\|do$	$\displaystyle(T=t))=\sum_{E}P(Y=y\|do(T=t),$
		$\displaystyle do(E=e))P(E=e\|T=t)$		(7)

	$\displaystyle P(Y=y\|$	$\displaystyle do(T=t))=\sum_{t^{\prime}}\sum_{E}P(Y=y\|E=e,T=t^{\prime})$
		$\displaystyle P(T=t^{\prime}\|do(E=e))P(E=e\|T=t)$		(10)