Unifying Token and Span Level Supervisions for Few-Shot Sequence Labeling
Abstract.
Few-shot sequence labeling aims to identify novel classes based on only a few labeled samples. Existing methods solve the data scarcity problem mainly by designing token-level or span-level labeling models based on metric learning. However, these methods are only trained at a single granularity (i.e., either token level or span level) and have some weaknesses of the corresponding granularity. In this paper, we first unify token and span level supervisions and propose a Consistent Dual Adaptive Prototypical (CDAP) network for few-shot sequence labeling. CDAP contains the token-level and span-level networks, jointly trained at different granularities. To align the outputs of two networks, we further propose a consistent loss to enable them to learn from each other. During the inference phase, we propose a consistent greedy inference algorithm that first adjusts the predicted probability and then greedily selects non-overlapping spans with maximum probability. Extensive experiments show that our model achieves new state-of-the-art results on three benchmark datasets. All the code and data of this work will be released at https://github.com/zifengcheng/CDAP.
1. Introduction
Sequence labeling tasks such as Named Entity Recognition (NER) and Slot Tagging (ST) are fundamental tasks in information extraction, benefiting query and document understanding in information retrieval (Guo et al., 2022; Agosti et al., 2020; Li et al., 2021), request analysis in task-oriented dialog systems (Vakulenko et al., 2021; Ma et al., 2022c), and so forth. Traditional sequence labeling methods are often supervised methods, which require a large amount of annotated data to train. However, this limits their applications in some real scenarios, where the acquisition of annotated data is laborious and expensive.
In recent years, few-shot sequence labeling has attracted much attention since it aims to extract novel classes based on only a few labeled samples (Fritzler et al., 2019; Hou et al., 2020; Yang and Katiyar, 2020; Tong et al., 2021). Table 1 shows a 2-way 1-shot example, which contains two novel classes (i.e., building-other and building-library) and each class has one sample in the support set. Few-shot sequence labeling aims to learn a model that borrows the prior experience from old domains and adapts to new domains quickly with only very few samples. For example, the model should extract “brooklyn bridge” and “Oxford University Museum of Natural History” as “building-other”, and “public library” and “Radcliffe Science Library” as “building-library” in the query set based on the support set.
Some prevalent methods of few-shot sequence labeling usually focus on token-level metric learning (Fritzler et al., 2019; Hou et al., 2020; Yang and Katiyar, 2020; Oguz and Vu, 2021; Ma et al., 2022a; Das et al., 2022; Li et al., 2022), in which the model assigns a label to each query token based on a learned distance metric. For example, Fritzler et al. (2019) construct prototype for each class to classify query tokens and Das et al. (2022) optimize token-level Gaussian-distributed embeddings to classify query tokens. Recently, some methods (Yu et al., 2021a; Wang et al., 2021; Ma et al., 2022b; Wang et al., 2022) focus on span-level metric learning to measure span-level similarity for classification and achieve state-of-the-art performance. For example, Wang et al. (2021) propose a two-step model, which first detects span and classifies span based on span-level metric.
Support Set | (1) The museum was established as the Municipal Gallery of Hamilton [building-other] in January 1914, and was opened to the public in June 1914, at a Hamilton Public Library [building-library] building on Main Street West. |
Query Set | (1) The Brooklyn Bridge [building-other] ceramic tiles display the bridge’s vertical cables but do not depict its diagonal cables. (2) The reading program had been discontinued, library patronage had dropped, and the denver public library [building-library] was planning to shutter the facility. (3) The current Radcliffe Science Library [building-library] building is located next to the Oxford University Museum of Natural History [building-other] and consists of three parts. |
Despite their success, these token-level and span-level metric learning methods are only trained at a single granularity, so they inevitably have some weaknesses specific to their granularity. For the token-level metric learning methods, their scheme of labeling tokens separately is prone to ignore the integrality of named entities and dialog slots with multiple tokens. Taking the sentence of the support set in Table 1 as an example, to extract the complete entity “Municipal Gallery of Hamilton”, each of its constituent tokens, “Municipal”, “Gallery”, “of”, and “Hamilton”, must be labeled separately, potentially leading to the loss of information about the entity as a whole. As a consequence, it is challenging for token-level representation learning to capture the full semantics of multi-token entities and slots (Wang et al., 2022). For the span-level metric learning methods, since they need to enumerate all possible spans for classification, the few-shot labels of entities and slots are easy to be diluted by a large amount of non-entity and non-slot labels. For example, if we set the maximum span length to 4, the spans containing the word “Municipal” include “Municipal”, “the Municipal”, “Municipal Gallery”, “as the Municipal”, “Municipal Gallery of”, “established as the Municipal”, “Municipal Gallery of Hamilton”, and so forth. Among these spans, only the span “Municipal Gallery of Hamilton” is an entity, while the others are non-entities. According to our statistics of the real dataset FewNERD, the ratio of entity spans to non-entity spans can even exceed 1:100. As a result, when entity spans are rare in the few-shot setting, it becomes challenging for span-level metric learning methods to distinguish them from a large number of similar non-entity spans.
These weaknesses are difficult to deal with when only a single-granularity model is trained, but they are promising to be alleviated when both token-level and span-level labels are utilized in a unified view for model training. On the one hand, span-level labeling can help token-level labeling to maintain entity integrality and capture the full semantics of multi-token entities. On the other hand, token-level labeling can in turn help span-level labeling to distinguish easily confused span.
To this end, we propose a Consistent Dual Adaptive Prototypical (CDAP) network for few-shot sequence labeling. CDAP can jointly train token-level and span-level networks, and make them benefit from each other, to produce better predictions. Specifically, we employ an Adaptive Prototypical Network (APN) to generate adaptive prototypes specific to each query in the few-shot setting and use the APN for both token-level and span-level labeling. Besides, considering that token-level and span-level networks may produce different predictions, we further propose a consistent loss to align their predictions based on bidirectional Kullback-Leibler divergence with temperature. During the inference phase, we propose a consistent greedy inference algorithm to combine the predictions of both networks. The core idea of the algorithm is to adjust the prediction probability of spans based on the predictions of tokens within the span.
The main contributions of this paper can be summarized as follows:
-
•
To the best of our knowledge, we first unify token and span level supervisions to train the few-shot sequence labeling model.
-
•
We propose a consistent dual adaptive prototypical network for few-shot sequence labeling and a consistent greedy inference algorithm to combine the results of two networks.
-
•
Extensive experiments show that our model achieves new state-of-the-art performance on three benchmark datasets.
The rest of this paper is organized as follows. We first review the related work in Section 2. Then, we formulate the task and describe our proposed method in detail in Section 3 and 4 respectively. After that, we conduct experiments and analyze the results in Section 5. Finally, we conclude our work along with some future work in Section 6.
2. Related Work
In this section, we introduce the following three research topics relevant to our work: few-shot learning, sequence labeling, and few-shot sequence labeling.
2.1. Few-Shot Learning
Few-shot learning aims to recognize novel classes with only few samples in each class and is proposed in the computer vision community (Fei-Fei et al., 2006). Motivated by a simple machine learning principle (i.e., train and test must match), matching Networks (Vinyals et al., 2016) first models the few-shot learning as an episode N-way K-shot task, where each episode has N classes and each class has K annotated samples (K is very small, e.g., 5). Afterwards, many methods have been proposed, such as metric learning methods, optimization methods, and generation methods. Metric learning methods aim to learn a suitable embedding space for distance metrics, including cosine similarity (Vinyals et al., 2016), euclidean distance to prototype (i.e., mean class representation) (Snell et al., 2017), learnable relation modules (Sung et al., 2018), nearest neighbor (Yang and Katiyar, 2020), and reachable distance (Zhang et al., 2022a). For example, Snell et al. (2017) first construct prototypes by averaging the representations of each class in the support set and then optimizes the network by using softmax function over Euclidean distances between all prototypes and query representation. Some other methods have focused on improving the transferability of features, including using cross-attention module to exploit the semantic relevance between support and query features (Hou et al., 2019), adaptive margin loss (Li et al., 2020b), and so on. Recently, some methods (Chen et al., 2019; Dhillon et al., 2020; Chen et al., 2021) propose a transfer-learning framework that pre-trains on the training set and fine-tunes on the support set of test episode. Model-agnostic meta-learning (MAML) (Finn et al., 2017) is a typical approach of optimization methods and aims to learn a good model initialization and then quickly adapt on test episode. Finn et al. (2018) further extend MAML, which adapts to new tasks via gradient descent, to incorporate a parameter distribution that is trained via a variational lower bound. Generation methods aim to generate samples (Wang et al., 2018; Zhang et al., 2018; Zhou et al., 2022) or features (Li et al., 2020d) to augment the training set. Specifically, Zhang et al. (2018) propose to augment vanilla few-shot classification models with GAN for better generalization. FlipDA (Zhou et al., 2022) jointly uses a generative model and a classifier to generate label-flipped data, which has a more significant performance improvement compared to label-preserved data. AFHN (Li et al., 2020d) generates diverse and discriminative features conditioned on the few labeled samples based on cWGAN.
2.2. Sequence Labeling
Sequence labeling task is a fundamental task in the field of NLP such as named entity recognition (Lample et al., 2016), part-of-speech tagging (Li et al., 2015), slot tagging (Qin et al., 2019), and so on. Recently, the neural sequence labeling models become the mainstream method for sequence labeling.
Typical neural sequence labeling models (Huang et al., 2015; Chiu and Nichols, 2016; Lample et al., 2016; Ma and Hovy, 2016) generally use CNN or LSTM to extract features and utilize softmax or Conditional Random Field (CRF) for classification. For example, Chiu and Nichols (2016) use CNN to extract character-level features, LSTM to extract word-level features, and softmax to classify words. However, Cui and Zhang (2019) find that BiLSTM-CRF does not always lead to better results compared with BiLSTM-Softmax, and propose a hierarchically-refined label attention network, which explicitly leverages label embeddings and captures potential long-term label dependency by giving each word incrementally refined label distributions with hierarchical attention. Afterwards, some work (Yu et al., 2020; Fu et al., 2021; Ouchi et al., 2020; Jiang et al., 2020; Zhu et al., 2022) adopts span-level prediction which first detects spans by enumeration or boundary identification, and then classifies spans. For example, Yu et al. (2020) use a biaffine classifier to assign scores for all spans in a sentence. Fu et al. (2021) first compare and analyze the advantages and disadvantages of the traditional token-level models and span-level models in detail, and then reveal that span prediction can serve as a system combiner to re-recognize named entities from different systems’ outputs.
Besides, four novel paradigms for sequence labeling have recently been proposed, reformulating it as sequence-to-sequence learning, generation, set prediction, and machine reading comprehension tasks, respectively. First, considering that the sequence-to-sequence learning has a good ability to learn complex global label dependency, some work applies it to solve the sequence labeling tasks such as chunking (Zhai et al., 2017), aspect term extraction (Ma et al., 2019), emotion-cause pair extraction (Cheng et al., 2021), and dialogue act prediction problem (Colombo et al., 2020). For example, in order to make sequence-to-sequence learning suitable for sequence labeling, Ma et al. (2019) design gated unit networks to incorporate corresponding word representation into the decoder and position-aware attention to pay more attention to the adjacent words of a target word. Secondly, some work use generation to solve the sequence labeling. Specifically, Athiwaratkun et al. (2020) propose to generate augmented output that repeats the original input sequence with additional markers that indicate the token-spans and their associated labels. Yan et al. (2021) propose a unified generation framework with the pointer mechanism to tackle flat, nested, and discontinuous NER subtasks. Zhang et al. (2022b) analyze the incorrect biases in the generation process from a causality perspective and design intra- and inter-entity deconfounding data augmentation methods to eliminate confounders. Thirdly, Tan et al. (2021) first propose a sequence-to-set network for NER, which provides a fixed set of entity queries to replace the explicit candidate spans and uses self-attention to capture the dependencies between entities. Finally, Li et al. (2020a) first propose a machine reading comprehension framework to solve the NER task and use type-specific queries using semantic prior information for entity categories. However, type-specific queries can only extract one entity type per inference and rely on external knowledge to construct queries. Shen et al. (2022) further set up global and learnable instance queries to extract entities from a sentence in a parallel manner.
2.3. Few-Shot Sequence Labeling
Few-shot sequence labeling aims to identify novel classes based on only a few labeled samples and has gained widespread attention due to its application value. We roughly divide the existing few-shot sequence labeling methods into three groups.
The first group of methods is token-level metric learning method. Fritzler et al. (2019) first use prototypical networks to solve this task. Afterwards, L-TapNet (Hou et al., 2020) and StructShot (Yang and Katiyar, 2020) both explore CRF to model label dependence in the few-shot setting. L-TapNet (Hou et al., 2020) extends task-adaptive projection networks (TapNet) (Yoon et al., 2019), which improves label representation with label name. StructShot (Yang and Katiyar, 2020) trains model through traditional cross-entropy loss and uses nearest neighbor for classification. Huang et al. (2021) explore noisy supervised pre-training and self-training in the few-shot sequence labeling setting. Motivated by non-entity class O has rich semantics, MUCO (Tong et al., 2021) introduces a binary classifier to determine whether it is an entity or not. Ma et al. (2022a) propose to leverage the semantics of label names and use two separate BERT to encode label and token for token-level similarity metric. Instead of optimizing class-specific attributes, CONTAINER (Das et al., 2022) models Gaussian embedding for each token and optimizes inter token distribution distance to improve generalizability.
The second group of methods is span-level method. SpanNER (Wang et al., 2021) is a two-step method, which first detects the start and end tokens to extract span and then introduces class descriptions to classify spans. Ma et al. (2022b) further propose a model-agnostic meta-learning (MAML) (Finn et al., 2017) enhanced two-step method, which uses BIESO scheme to detect span and prototypical networks to classify span. Retriever (Yu et al., 2021a) proposes a span-level retrieval method that learns similar contextualized representations for spans with the same label via a novel batch-softmax objective. ESD (Wang et al., 2022) proposes inter-span and cross-span attention to enhance the span representations, and uses instance span attention to construct prototype.
The third group of methods mitigates data scarcity with other paradigms, including machine reading comprehension (Ma et al., 2021), generation (Athiwaratkun et al., 2020; Chen et al., 2022), pairwise cloze (Henderson and Vulic, 2021), and prompt tuning (Cui et al., 2021; Lee et al., 2022; Ma et al., 2022d; Hou et al., 2022). For machine reading comprehension method, Ma et al. (2021) enumerate all the slots to extract the answer from the sentence. The MRC-based method can naturally encode the label semantics in the form of questions. For generation methods, Athiwaratkun et al. (2020) propose to generate augmented natural language output, including the original input sequence, delimiter, and label. Self-describing Networks (Chen et al., 2022) is a sequence-to-sequence generation network which can universally describe mentions using concepts, automatically map novel entity types to concepts, and adaptively recognize entities on-demand. For prompt tuning methods, Template (Cui et al., 2021) treats NER as a language model ranking problem in a sequence-to-sequence framework, where the output sequence is filled by templates. Hou et al. (2022) propose an inverse prompt, predicting slot values given slot types, and further introduce the iterative prediction strategy to consider the relations between different slot types. Ma et al. (2022d) reformulate the NER task as an entity-oriented LM task, inducing the LM to predict label words at entity positions during fine-tuning, and propose a template-free prompt tuning method EntLM. Lee et al. (2022) perform a systematic study of entity-oriented prompt (i.e., only entity tokens) and instance-oriented prompt (i.e., retrieved sentences).
3. Task Formulation
We define a sentence as and its label as , where is a span in sentence (e.g., “nimbus” and “nimbus powerplant” in Figure 1), is the corresponding label (e.g., entity class “building-other” and non-entity class O), and is the number of spans in the sentence. Few-shot sequence labeling aims to train a model in the source domain that can accurately extract and classify all novel classes with a few labeled samples in the target domain. Episode training (Vinyals et al., 2016) is a common and effective way to train the model by imitating test situation for few-shot learning. Each training/test episode contains a -way -shot support set and a query set , where the support set includes classes (N-way) and each class has K annotated examples (K-shot). In the training phase, few-shot sequence labeling model is trained by predicting labels of the query set according to a few labeled support set . In the testing phase, the model directly predicts labels of the query set on unseen test dataset according to a few labeled support set . To facilitate the illustration, we list the main notations used throughout this article in Table 2.
Notation | Description |
Token-level representation matrix for a sentence | |
Representation matrix of all tokens of class in the support set | |
Token-level representation for query token | |
Token-level adaptive prototype of class for query token | |
The length of sentence | |
The dimension of feature representation | |
The number of tokens of class in the support set | |
Representation matrix of all spans in the support set | |
Representation matrix of all spans in the query set | |
Representation matrix of all spans after cross-attention module in the support set | |
Representation matrix of all spans after cross-attention module in the query set | |
Span-level adaptive prototype of sub-class for query span | |
Representation matrix of all spans in the support set with sub-class | |
Span-level representation for query span | |
Span-level adaptive prototype of class for query span | |
The number of spans in the support set | |
The number of spans in the query set | |
The number of spans with sub-class in the support set |
4. Model
In this section, we present the details of our proposed model, as shown in Figure 1. Our model utilizes both token-level and span-level labels to mitigate the data scarcity problem in the few-shot setting from a unified perspective. We first explain the encoder module of our model (Section 4.1). Next we present the model architecture of token-level adaptive prototypical network (Section 4.2) and span-level adaptive prototypical network (Section 4.3). We further propose a consistent loss to make the predictions of two networks consistent (Section 4.4). Then we introduce final loss function and training flow of our model (Section 4.5). Finally, we propose a consistent greedy inference algorithm to combine the outputs of two networks during the inference phase (Section 4.6).

4.1. Encoder Module
We utilize BERT (Devlin et al., 2019) to encode sentences in the support and query set. Given a sentence , we insert a [CLS] and [SEP] token for each sentence to get the input of BERT and feed it into BERT to get its contextual representations as follows:
(1) |
We further use a linear layer to obtain the token representation as the input to the token-level network as follows:
(2) |
where and are trainable parameters. Since [CLS] and [SEP] tokens do not need to be classified, we do not feed them into the linear layer.
For the span-level network, we use boundary tokens (i.e., left and right boundary tokens) and a linear layer to obtain the span representation as the input to the span-level network. Specifically, given a span , we represent the span as:
(3) |
where and are trainable parameters, and is the concatenation operation. It is worth noting that we enumerate all spans in the sentence.
4.2. Token-Level Adaptive Prototypical Network
For the token-level network, we adopt an adaptive prototypical network to classify tokens. General prototypical networks (Snell et al., 2017) simply average all the tokens of the same class to get prototype. However, the average operation does not reflect the importance of each token (Gao et al., 2019; Sun et al., 2019). Therefore, we propose to construct adaptive prototypes for each query token by considering the importance of each token. Specifically,
(4) |
(5) |
where is representation of query token , is representation of all tokens of class in the support set, is the number of -class tokens in support set, is attention weight, and is adaptive prototype of class for query token . For clarity, we denote this attention aggregation as , i.e., .
Then we utilize cross-entropy loss to optimize the network and the predicted probability distribution can be produced by a softmax function over Euclidean distances between all prototypes and query token . Specifically,
(6) |
(7) |
where is predicted probability of -class for token , denotes Euclidean distance, and denotes the ground-truth label for token . It is worth noting that we use the IO (Inside and Outside) scheme for token-level network same as Ding et al. (2021), not BIESO (Begin, Inside, End, Singleton, and Outside) or BIO (Begin, Inside, and Outside) scheme. For example, the label space of example in Table 1 is {building-other, building-library, Outside}, not {begin-building-other, inside-building-other, end-building-other, singleton-building-other, begin-building-library, inside-building-library, end-building-library, singleton-building-library, outside} or {begin-building-other, inside-building-other, begin-building-library, inside-building-library, outside}.
4.3. Span-Level Adaptive Prototypical Network
For the span-level network, we classify spans by adopting an adaptive prototypical network similar to the token-level adaptive prototypical network, the difference mainly lies in that we employ an extra cross-attention module and make more fine-grained division for class O.
4.3.1. Cross-Attention Module
To recognize a sample from novel class given a few labeled samples, a natural idea is to locate the most relevant regions in the pair of support and query samples (Hou et al., 2019). Thus, we use a cross-attention to model the interaction between support set and query set to enhance the representation of support set and query set. We denote the span representation in the support and query set as and , where and denote the number of spans in the support and query set. Then we use a cross-attention to model the interaction:
(8) |
(9) |
where and denote the -row of and , denotes the span representation after cross-attention in the support set, and denotes the span representation after cross-attention in the query set. Finally, we use Transformer block (Vaswani et al., 2017) to get the enhanced representation:
(10) |
(11) |
where and denote enhanced representation of support and query set, and FFN denotes a feed forward neural network.
4.3.2. Fine-Grained Division of O
Different from entity classes, non-entity class O has rich semantics and is hard to represent with a single prototype (Tong et al., 2021; Wang et al., 2022). So we further divide class O into three sub-classes. Since we use left and right boundary tokens to represent span, we divide sub-classes based on whether the span has the same left and right boundary as entity spans. Specifically, given a sentence with entity spans , where and are the left and right boundary tokens of the -th span, other non-entity spans can be assigned with a sub-class label as follows,
where denotes the non-entity spans that have the same left boundary token as entity spans, denotes the non-entity spans that have the same right boundary token as entity spans, and denotes the non-entity spans do not have the same left and right boundary tokens as entity spans. It is worth noting that for sentences with two or more entities, there may be some non-entity spans that have the same left boundary with one entity span and same right boundary with another entity span. We treat these non-entity spans as .
4.3.3. Adaptive Prototypical Network
We also construct adaptive prototypes for each query span, similar to token-level adaptive prototypical network. Since we divide class O into three sub-classes, we first adaptively construct prototype of each sub-class and then adaptively aggregate three sub-classes to get the prototype of class O. For entity classes, we can directly get the adaptive prototypes. Specifically,
(12) |
(13) |
(14) |
where denotes the adaptive prototype of sub-class for query span , denotes the representation of all spans in the support set with sub-class , denotes the number of spans with sub-class in the support set, denotes the adaptive prototype of class O for query span , is representation of all spans of class in the support set, denotes the number of -class spans in the support set, and denotes the adaptive prototype of class for query span .
After getting all prototypes , we also use softmax function over Euclidean distances between all prototypes and query span to get the predicted probability distribution, and cross-entropy loss to optimize the network.
(15) |
(16) |
where is predicted probability of -class for span , denotes the ground-truth label of span , and is the number of spans in the sentence.
4.4. Consistent Loss
Intuitively, the token-level network and span-level network should make consistent predictions in a well-trained model. Therefore, we further propose a consistent loss to make the predicted probability distributions of two networks consistent. It contains two steps, where the first step constructs two token-level probability distributions for each token from two networks and the second step measures the difference between two probability distributions.
For each token, we can obtain the token-level probability distribution directly from the token-level network. Then, among all spans containing this token, we select the span with the maximum predicted probability and take its probability distribution as the token-level probability distribution from span-level network. Finally, we use bidirectional Kullback-Leibler divergence with temperature as consistent loss to allow the two networks to learn from each other and to measure the difference between the two distributions. Specifically,
(17) |
where and are the predicted logits of token-level and span-level networks for token , denotes softmax function, and is the temperature to adjust the sharpness of the probability distribution.
4.5. Training
Finally, we combine the above three loss functions to form the final loss function to jointly train our model as follows:
(18) |
where , , are hyper-parameters.
In summary, there are three steps to train our model, i.e., calculate token-level loss, calculate span-level loss, and calculate consistent loss. The whole training flow is illustrated in Figure 1 and Algorithm 1.

4.6. Consistent Greedy Inference Algorithm
Intuitively, we consider extracted entities and slots unreliable if the two networks predict different labels. So we propose a consistent greedy algorithm to combine the outputs of the two networks. As shown in Figure 2 and Algorithm 2, we first adjust the predicted probabilities of spans and then use a greedy algorithm to gradually select non-overlapping spans with the maximum probability based on the adjusted probability.
First, we consider a span to be reliable if all token-level predictions from token-level network in the span agree with the span-level prediction from span-level network. For example, as shown in Figure 2, the prediction of span “powerplant” is consistent, and the prediction of “nimbus powerplant” and “the nimbus powerplant” is inconsistent. We punish the probabilities of inconsistent spans by the following operation:
(19) |
where and is the predicted probability of span after and before consistent adjustment, is a hyper-parameter to control the degree of penalty, and is the number of inconsistent tokens. Then, we use a greedy algorithm to gradually select non-overlapping spans with the maximum probability based on the adjusted probability. Specifically, at each step of the greedy algorithm, we select the span with the maximum probability and discard the overlapping span. For example, we select span “powerplant” and discard “nimbus powerplant” and “the nimbus powerplant”. The whole inference flow is illustrated in Algorithm 2.
5. Experiments
In this section, we first introduce our experimental details, including the datasets, evaluation metrics, implementation details, and baselines. Then, we report the experimental results on three datasets to answer the following research questions:
-
•
RQ1: Whether our proposed model outperforms existing few-shot sequence labeling methods?
-
•
RQ2: How does each of the components of our model contribute to the final performance?
-
•
RQ3: What is effect of different inference algorithms on the performance of our method?
-
•
RQ4: What is effect of the different consistent losses on the performance of our method?
-
•
RQ5: How do the hyper-parameters influence the performance of our method?
Thereafter, we also explore the effectiveness of our O division strategy, evaluate the efficiency of our model, and conduct error analysis and case study to further analyze our model.
Datasets | # Sentences | # Classes | Domain |
FewNERD | 188.2K | 66 | Wikipedia |
SNIPS | 14.5K | 60 | Dialog |
Cross | 189.4K | 43 | News, Wiki, Social, and Mixed |
5.1. Datasets
We conduct experiments on three benchmark datasets: FewNERD (Ding et al., 2021), SNIPS (Coucke et al., 2018), and Cross. FewNERD and Cross are NER datasets, and SNIPS is a slot tagging dataset. The statistics of the three original datasets are shown in Table 3. The detailed description of the three datasets is as follows:
FewNERD dataset with a hierarchy of 8 coarse-grained entity types and 66 fine-grained entity types, and has two few-shot settings: Intra and Inter. The two settings divide the dataset using coarse-grained and fine-grained entity types, respectively. For Intra setting, all entities in the training set, development set, and test set belong to different coarse-grained types. For Inter setting, all entities in the training set, development set, and test set belong to different fine-grained types. It is worth noting that in order to better meet the dense entities, i.e., a sentence may have multiple entity types, Ding et al. (2021) adopt the -way ~ shot sampling method to construct FewNERD dataset (each class in the support set has ~ entities). Both Intra and Inter setting have 4 settings: 5-way 1~2 shot, 5-way 5~10 shot, 10-way 1~2 shot, and 10-way 5~10 shot. We use the public sampled dataset111It is worth noting that we use the latest version of FewNERD dataset from https://github.com/thunlp/Few-NERD, which corresponds to the results reported in https://arxiv.org/pdf/2105.07464v6.pdf. released by Ding et al. (2021) which contains 20,000 episodes for training, 1,000 episodes for validation, and 5,000 episodes for testing.
SNIPS is a slot tagging dataset and contains 7 domains. The domains are Weather (We), Music (Mu), Play List (Pl), Book (Bo), Search Screen (Se), Restaurant (Re), and Creative Work (Cr). Hou et al. (2020) randomly sample a domain for validation, then randomly sample a different domain for testing, and the other domains are used for training. In each episode, all classes have shot samples in the support set and the number of classes () is not fixed. Each domain of SNIPS dataset has two settings: 1-shot and 5-shot. We use the public sampled dataset222https://github.com/AtmaHou/FewShotTagging provided by Hou et al. (2020) for a fair comparison.
Cross is a NER dataset, consisting of four datasets CoNLL-2003 (Sang and Meulder, 2003), GUM (Zeldes, 2017), WNUT-2018 (Derczynski et al., 2017), and Ontonotes (Pradhan et al., 2013). The four datasets are from News, Wiki, Social, and Mixed domains respectively. Same as the SNIPS dataset, Hou et al. (2020) use two datasets for training, one dataset for validation, and one dataset for testing. Each domain of Cross dataset also has two settings: 1-shot and 5-shot and we also use the public sampled dataset333https://github.com/AtmaHou/FewShotTagging provided by Hou et al. (2020) for a fair comparison.
Models | Intra | Inter | ||||||||
1~2 shot | 5~10 shot | Avg. | 1~2 shot | 5~10 shot | Avg. | |||||
5 way | 10 way | 5 way | 10 way | 5 way | 10 way | 5 way | 10 way | |||
Proto | 20.760.84 | 15.040.44 | 42.540.94 | 35.400.13 | 28.440.59 | 38.831.49 | 32.450.79 | 58.790.44 | 52.920.37 | 45.750.77 |
NNShot | 25.780.91 | 18.270.41 | 36.180.79 | 27.380.53 | 26.900.66 | 47.241.00 | 38.870.21 | 55.640.63 | 49.572.73 | 47.831.14 |
StructShot | 30.210.90 | 21.031.13 | 38.001.29 | 26.420.60 | 28.920.98 | 51.880.69 | 43.340.10 | 57.320.63 | 49.573.08 | 50.531.13 |
De-MAML† | 37.492.10 | 27.362.70 | 41.261.34 | 35.315.53 | 35.362.92 | 61.631.11 | 49.245.10 | 54.092.24 | 53.203.44 | 54.542.97 |
ESD | 36.081.60 | 30.000.70 | 52.141.50 | 42.152.60 | 40.091.60 | 59.291.25 | 52.160.79 | 69.060.80 | 64.000.43 | 61.130.82 |
Ours | 41.210.70 | 35.360.91 | 52.670.49 | 45.830.81 | 43.770.73 | 62.790.29 | 56.230.76 | 69.830.40 | 65.990.40 | 63.710.46 |
5.2. Evaluation Metrics
For the FewNERD dataset, we report the micro-F1 over all test episodes following Ding et al. (2021). Specifically,
(20) |
where denotes the precision calculated on all test episodes and denotes the recall calculated on all test episodes. For the SNIPS and Cross dataset, we first calculate micro-F1 for each test episode and then report the average F1 for all test episodes as the final result following Hou et al. (2022). Specifically,
(21) |
where n is the number of episodes and denotes the F1 of i-th episode.
5.3. Implementation Details
We use BERT-base-uncased (Devlin et al., 2019) as our encoder and use AdamW (Loshchilov and Hutter, 2019) optimizer. For the FewNERD dataset, the learning rate is 2e-5 for BERT and 5e-4 for other parameters, and the batch size is set to 2. For the SNIPS dataset, we use grid search to find the best hyper-parameters. The learning rate of BERT is selected from {5e-6, 1e-5, 2e-5, 3e-5, 5e-5} and other parameters are selected from {5e-5, 1e-4, 3e-4, 5e-4}, and the batch size is selected from {2, 4}. We schedule the learning rate that the first 1000 training steps is a linear warmup phrase and then the rest is a linear decay phrase. and are selected from {0.02, 0.05, 0.1, 0.2, 0.3} through grid search, is set to 1, and is selected from {0.01, 0.02, 0.05, 0.1}. The maximum span length is set to 8 for inference.
5.4. Baselines
To evaluate the effectiveness of our method, we compare our method with the following baselines:
- •
- •
- •
-
•
TransferBERT (Hou et al., 2020) is a domain transfer model which pretrains model on source domains and fine-tunes on the target domain.
- •
- •
-
•
Retriever (Yu et al., 2021a) is a span-level retrieval method that learns similar contextualized representations for spans with the same label via a novel batch-softmax objective.
-
•
ConVEx (Henderson and Vulic, 2021) first pre-trains through a novel pairwise cloze task using Reddit data and then directly fine-tunes for slot tagging task.
-
•
MRC (Ma et al., 2021) enumerates all the slots to extract the answer from the sentence in machine reading comprehension framework.
-
•
Inverse (Hou et al., 2022) reversely predicts slot values given slot types prompt and further proposes iterative prediction strategy to consider the relations between different slot types.
-
•
De-MAML (Ma et al., 2022b) proposes a two-step MAML-enhanced model to first detect span and then classify span.
-
•
ESD (Wang et al., 2022) proposes inter-span and cross-span attention to enhance the span representations, and uses instance span attention to construct prototype.
Models | We | Mu | Pl | Bo | Se | Re | Cr | Avg. | |
1-shot | TransferBERT | 55.822.75 | 38.011.74 | 45.652.02 | 31.635.32 | 21.963.98 | 41.793.81 | 38.537.42 | 39.063.86 |
Matching | 21.744.60 | 10.681.07 | 39.711.81 | 58.150.68 | 24.211.20 | 32.880.64 | 69.661.68 | 36.721.67 | |
Proto | 46.721.03 | 40.070.48 | 50.782.09 | 68.731.87 | 60.811.70 | 55.583.56 | 67.671.16 | 55.771.70 | |
MRC | - | - | - | - | - | - | - | 69.3(unk) | |
L-TapNet+CDT | 71.534.04 | 60.560.77 | 66.272.71 | 84.541.08 | 76.271.72 | 70.791.60 | 62.891.88 | 70.411.97 | |
ESD | 78.251.50 | 54.741.02 | 71.151.55 | 71.451.38 | 67.850.75 | 71.520.98 | 78.141.46 | 70.441.23 | |
Ours | 73.561.58 | 58.401.42 | 74.201.24 | 76.071.71 | 70.641.49 | 72.731.08 | 75.321.09 | 71.561.37 | |
5-shot | TransferBERT | 59.410.30 | 42.002.83 | 46.074.32 | 20.743.36 | 28.200.29 | 67.751.28 | 58.613.67 | 46.112.29 |
Matching | 36.673.64 | 33.676.12 | 52.602.84 | 69.092.36 | 38.424.06 | 33.282.99 | 72.101.48 | 47.983.36 | |
Proto | 67.824.11 | 55.992.24 | 46.023.19 | 72.171.75 | 73.591.60 | 60.186.96 | 66.892.88 | 63.243.25 | |
Retriever | 82.95(unk) | 61.74(unk) | 71.75(unk) | 81.65(unk) | 73.10(unk) | 79.54(unk) | 51.35(unk) | 71.72(unk) | |
L-TapNet+CDT | 71.643.62 | 67.162.97 | 75.881.51 | 84.382.81 | 82.582.12 | 70.051.61 | 73.412.61 | 75.012.46 | |
Inverse | 70.63(unk) | 71.97(unk) | 78.73(unk) | 87.34(unk) | 81.95(unk) | 72.07(unk) | 74.44(unk) | 76.73(unk) | |
ConVEx | 71.5(unk) | 77.6(unk) | 79.0(unk) | 84.5(unk) | 84.0(unk) | 73.8(unk) | 67.4(unk) | 76.8(unk) | |
MRC | 89.39(unk) | 75.11(unk) | 77.18(unk) | 84.16(unk) | 73.53(unk) | 82.29(unk) | 72.51(unk) | 79.17(unk) | |
ESD | 84.501.06 | 66.612.00 | 79.691.35 | 82.571.37 | 82.220.81 | 80.440.80 | 81.131.84 | 79.591.32 | |
Ours | 86.241.04 | 67.661.01 | 82.021.05 | 86.470.92 | 82.451.09 | 81.380.55 | 80.551.16 | 80.970.97 |
Models | 1-shot | 5-shot | ||||||||
News | Wiki | Social | Mixed | Avg. | News | Wiki | Social | Mixed | Avg. | |
TransferBERT | 4.751.42 | 0.570.32 | 2.710.72 | 3.460.54 | 2.870.75 | 15.362.81 | 3.620.57 | 11.080.57 | 35.497.60 | 16.392.89 |
Matching | 19.500.35 | 4.730.16 | 17.232.75 | 15.061.61 | 14.131.22 | 19.850.74 | 5.580.23 | 6.611.75 | 8.080.47 | 10.030.80 |
Proto | 32.492.01 | 3.890.24 | 10.681.40 | 6.670.46 | 13.431.03 | 50.061.57 | 9.540.44 | 17.262.65 | 13.591.61 | 22.611.57 |
De-MAML† | 42.400.93 | 4.760.08 | 23.871.49 | 14.191.33 | 21.310.96 | 37.102.65 | 6.210.26 | 18.540.89 | 26.410.48 | 22.071.07 |
L-TapNet+CDT | 44.303.15 | 12.040.65 | 20.801.06 | 15.171.25 | 23.081.53 | 45.352.67 | 11.652.34 | 23.302.80 | 20.952.81 | 25.312.65 |
Ours | 41.210.70 | 9.910.50 | 25.011.32 | 26.182.69 | 25.581.30 | 51.972.93 | 15.651.51 | 29.891.64 | 37.191.35 | 33.681.86 |
5.5. Results (RQ1)
We conduct an empirical study to investigate whether our proposed model can achieve better performance for few-shot sequence labeling to answer RQ1. Table 5 and Table 6 report the performance of our proposed model and baselines on few-shot NER dataset (i.e., FewNERD and Cross datasets) and Table 5 reports the performance on few-shot slot tagging dataset (i.e., SNIPS dataset).
First, we observe the performance on the FewNERD dataset. We can see that our model consistently outperforms all baselines under all settings. Compared with the best baseline ESD, our model achieves 3.68 and 2.58 average F1 improvement under Intra and Inter settings. Compared with the best token-level baseline StructShot, our model achieves 14.85 and 13.18 average F1 improvement under two settings. This shows that combining token-level and span-level can achieve better performance for few-shot sequence labeling. It is worth noting that the improvement under Intra setting is higher than under Inter setting. This suggests that when the domain gap between source domain and target domain is relatively large, combining token and span levels can lead to more significant improvements than the single-granularity baselines.
Secondly, we observe the performance on the SNIPS dataset. Compared with the best baseline ESD, our model achieves 1.12 and 1.38 average F1 improvement under 1-shot and 5-shot settings on the SNIPS dataset. Compared to all baseline methods, our method achieves the best or second best performance in most domains. The results also show the effectiveness of our model. It is worth noting that the standard deviations of our method are the best or second best in most settings. This shows that our model is robust.
Thirdly, we observe the performance on the Cross dataset. Compared with the best baseline L-TapNet+CDT, our model achieves 2.5 and 8.37 average F1 improvement under 1-shot and 5-shot settings. Compared to all baseline methods, our method achieves the best or second best performance in all domains. This shows the effectiveness of our model.
Model Setting | Intra | Inter |
Ours | 36.921.54 | 48.630.51 |
w/o Span-Level Loss | 26.420.33 | 36.251.21 |
w/o Token-Level Loss | 34.530.51 | 47.800.64 |
w/o Consistent Loss | 35.270.77 | 48.100.16 |
w/o Inference Algorithm | 31.131.66 | 45.180.31 |
w/o Division of O | 35.630.94 | 48.490.40 |
w/o Adaptive Prototype | 32.521.31 | 46.271.23 |
w/o Cross-Attention | 32.511.78 | 44.980.31 |
5.6. Ablation Study (RQ2)
To validate the contribution of each component, we report the performance of removing each of them from our model individually in Table 7 to answer RQ2.
First, we can see that all components improve performance. This shows that all components in our model are effective. Secondly, we remove the token-level loss and span-level loss. When we ablate span loss, the network uses the output of the token-level network as the result. We can see that the performance drops to 26.42 and 36.25 under Intra and Inter settings. This may be because token-level metric learning is difficult to extract entities completely and our token-level model is parameter-efficient. When we remove the token-level loss, the performance drops to 34.53 and 47.80 under two settings. This shows that using token-level labels to jointly train the model can effectively improve performance. Thirdly, we remove the consistent loss and the performance drops to 35.27 and 48.10 under two settings. This shows that consistent loss can effectively regularize the model and improve performance. Fourth, we remove the inference algorithm, i.e., all spans that are predicted as entities or slots from span-level network are extracted. The performance drops to 31.13 and 45.18 under two settings. This illustrates that our inference algorithm can effectively combine the results of two networks and avoid extracting overlapping spans. Finally, we ablate all components (i.e., fine-grained division of O, cross-attention module, and adaptive prototype) in the span network. We can see that all components in the span network improve performance. The cross-attention module brings the largest performance improvement. This indicates that the interaction of the support set and query set is the key to performance improvement. We remove the adaptive prototype and the performance drops to 32.52 and 46.27 under two settings. This shows adaptively constructing prototypes for each query span is more effective. We remove the division of class O and the performance drops to 35.63 and 48.49 under two settings. This shows that the fine-grained division of O can effectively construct prototype of class O.
Inference Setting | F1 |
Ours | 48.630.51 |
Token Network | 26.940.96 |
Span Network | 48.050.54 |
Intersection | 29.930.71 |
Union | 41.270.36 |
5.7. Effects of Different Inference Algorithms (RQ3)
We explore the effects of different inference algorithms in Table 8 to answer RQ3, including our proposed consistent greedy inference algorithm, using intersection or union operation to combine two outputs from two networks (i.e., a span and its type are extracted by both networks or by one), and using the outputs of token network or span network (i.e., we also use greedy algorithm to avoid overlapping spans).
First, we can see that our proposed inference algorithm achieves the best performance. This shows that using token-level predictions to adjust the probability of span network is effective. Secondly, we can see that the performance of span network is better than token network. This may be because token-level metric learning is difficult to extract entities completely and our token-level model is parameter-efficient. Thirdly, we report the performance of using intersection or union operation to combine the outputs of two networks. We can see that their performance does not exceed span network. This is because the poor performance of token network and simple combination does not improve performance. Finally, it is worth noting the performance of span network exceeds w/o span-level loss. This shows that using token-level loss to jointly train can improve performance of span-level network.
5.8. Unidirectional vs. Bidirectional Consistent Loss (RQ4)
We explore the performance of using unidirectional Kullback-Leibler divergence with temperature as consistent loss in Table 9 to answer RQ4, i.e., only using Span network to supervise Token network (ST) or using Token network to supervise Span network (TS).
First, we can see that the bidirectional consistent loss performs better than ST and TS. This shows that the bidirectional consistent loss is more effective to align outputs from two networks compared to the unidirectional consistent loss. Secondly, ST and TS both exceed w/o consistent loss. This shows that both unidirectional consistent losses are effective to align outputs and improve performance. Thirdly, the performance of TS exceeds ST. This suggests that it is more effective to use supervision information from token network to train the span network. This may be because it is easier for span network to learn a consistent output.
5.9. Effects of Different Consistent Losses (RQ4)
We explore the performance of using different losses as consistent loss in Table 9 to answer RQ4, i.e., mean squared error (MSE), Kullback-Leibler divergence (KL), negative cosine similarity (COS), and Jensen-Shannon divergence (JS).
First, we can see that the KL loss with temperature performs best and the KL loss performs second best. This shows that using temperature can obtain better supervision information compared to KL loss. It is worth noting that KL divergence and JS divergence exceed w/o consistent loss. This shows that KL divergence and JS divergence can effectively measure the difference between two distributions and improves performance for our model. Secondly, the performance of w/o consistent loss exceeds MSE and COS. This illustrates that MSE and COS play a side effect and are not a good choice for measuring the difference between the two probability distributions for our model.
Consistent Loss | F1 |
Ours | 48.630.51 |
ST | 48.190.49 |
TS | 48.410.44 |
MSE | 48.050.32 |
KL | 48.410.13 |
COS | 47.930.14 |
JS | 48.380.46 |






5.10. Effects of Hyper-Parameters (RQ5)
We explore the performance of different hyper-parameters (i.e., the weight of token-level loss in Eq. (18), the weight of consistent loss in Eq. (18), and the degree of penalty in inference algorithms in Eq. (19)) while fixing the other hyper-parameters in Fig. 3 on FewNERD dataset under 5 way 1~2 shot setting to answer RQ5.
We first observe the effects of which is selected from {0.02, 0.05, 0.1, 0.2, 0.3}. From Figure 3(a) and Figure 3(b), we find that the performance is relatively stable under Inter and Intra settings, i.e., the performance fluctuates by 0.6% and 1.5%, respectively. As changes, the performance first has an upward trend and then has a downward trend under two settings. This shows that setting a suitable can help improve performance. The optimal value of depends on the specific setting. The model achieves the best performance when is 0.1 under Inter setting and is 0.2 under Intra setting.
Then, we observe the effects of and , choosing from {0.02, 0.05, 0.1, 0.2, 0.3} and {0.01, 0.02, 0.05, 0.1}. The overall trend is the same as . We can see that the performance is stable under two settings, i.e., the performance fluctuates by 0.5% and 0.4% with the change of and the performance fluctuates by 0.2% and 0.3% with the change of . This shows that our model is insensitive to and . As or changes, the overall performance also first has an upward trend and then has a downward trend under two settings. The model achieves the best performance when is 0.05 and is 0.02 under two settings.
5.11. Comparison with Other O Divisions
Model Setting | Division Setting | F1 |
CDAP | Ours | 48.630.51 |
Wang et al. (2022) | 48.390.68 | |
ESD | Ours | 46.550.41 |
Wang et al. (2022) | 45.830.56 |
We compare our division with Wang et al. (2022) under two different models (i.e., CDAP and ESD) in Table 10 to show the effectiveness of our division. Wang et al. (2022) also divide class O into three sub-classes: sub-class denotes the span that does not overlap with any entities or slots in the sentence, sub-class denotes the span that is the sub-span of an entity or slot, and denotes the others.
We can see that our O division achieves better performance under two different models. This illustrates the effectiveness of dividing sub-classes according to the left and right boundary tokens. It is worth noting that our division is more effective on the ESD model. This may be because the capacity of distinguishing sub-classes for the ESD model is limited compared to our model, and our division can help ESD model to distinguish sub-classes more effectively.
5.12. Efficiency
Since our model considers both token-level and span-level, we compare our method with token-level methods (i.e., Proto, NNShot, and StructShot) and span-level methods (i.e., ESD and De-MAML) in terms of the number of parameters and inference time in Table 11 to show the efficiency.
First, we observe the number of parameters and we can see that Proto, NNShot, and StructShot have the least number of parameters. This is because these methods only use BERT to encode the tokens. Our model is similar to ESD and significantly smaller than De-MAML. This shows that our method is parameter-efficient and does not introduce too many parameters (i.e., 2M) compared to Proto. In addition, De-MAML uses two BERT to encode two tasks separately resulting in a huge number of parameters.
Secondly, we compare the inference time of our model and baselines with the same hardware and batch size (i.e., 1). We can see that the token-level methods are faster compared to the span-level methods. This is because the span-level methods need to enumerate all spans within a sentence whose length is less than or equal to L. Therefore, the number of spans is approximately L times that of tokens, which may bring extra computation overhead. Proto is faster compared to NNShot, and NNShot is faster compared to StructShot. This is because Proto only needs to measure the distance to the prototype, while NNShot needs to measure the distance to all the tokens in the support set. StructShot further uses a Viterbi decoder to model label dependency based on NNShot, so more inference time is required. Then, we compare our model with span-level methods and we can see that our model is 13.40 ms slower than ESD and 32.49 ms faster than De-MAML under 1-shot setting. Under 5-shot setting, our model is 26.78 ms slower than ESD and 23.44 ms slower than De-MAML. This illustrates that the time cost of our model is acceptable and our model is efficient. In addition, De-MAML needs to forward propagate 2 times through BERT, resulting in a long inference time under 1-shot setting. Our model and ESD both use the cross-attention module to model the interaction of the support set and query set, resulting in a significant increase in inference time under 5-shot setting.
Models | F1 | # Para. | Inference Time | |
1~2 shot | 5~10 shot | |||
Ours | 53.74 | 111M | 79.43 ms | 234.23 ms |
Proto | 37.10 | 109M | 24.98 ms | 29.83 ms |
NNShot | 37.37 | 109M | 25.43 ms | 30.97 ms |
StructShot | 39.73 | 109M | 30.70 ms | 49.62 ms |
ESD | 50.61 | 111M | 66.03 ms | 207.45 ms |
De-MAML† | 44.95 | 218M | 111.92 ms | 210.79 ms |
Setting | F1 | # FP-Span | # FP-Type |
Inter | 62.79 | 7372(74.52%) | 2619(26.48%) |
Intra | 41.21 | 11264(65.46%) | 5944(34.54%) |
5.13. Error Analysis
We further analyze the predicted error of our model in detail and divide the error into 2 categories (i.e., FP-Span and FP-Type) according to whether the span is correct. FP-Span denotes extracted entities with the wrong span boundary and FP-Type denotes extracted entities with the right boundary but the wrong entity type.
First, as shown in Table 12, we can see that FP-Span is main prediction error under two settings. FP-span accounts for 74.52% under Inter setting and 65.46% under Intra setting. This indicates that span detection is more difficult than span classification, which is the main bottleneck at present and needs further attention in the future. Secondly, compared to Inter setting, the number of both errors rises under the more difficult Intra setting. Compared to Inter setting, FP-Span increased by 52.79% and FP-Type increased by 126.96% under Intra setting. This suggests that as domain difference between the source domain and the target domain increases, entity classification is more difficult to transfer than span detection.
5.14. Case Study
Apart from the quantitative analysis, we also conduct case study to intuitively show the effectiveness and analyze our model. We list the predictions of two groups on the test set of our model and three strong baselines (i.e., StructShot, De-MAML, and ESD) on FewNERD under Inter 5-way 1~2 shot setting in Table 13.
First, we observe the predictions of the first group (Query 1, 2, and 3). We can see that the extraction capacity of StructShot is limited. For example, it only correctly extracts entity “B.A.” in Query (1) and also extracts impossible entities (i.e., “the” and “of” in Query (3)). This illustrates that StructShot relies heavily on the labeled samples in the support set. Then, we observe the predictions of span-level methods. For query (1), De-MAML and ESD predict None, and our model extracts “B.A.”. This may be because we use token-level loss to jointly train the model. For query (2), we can see that our model correctly extracts and predicts “Monaco” and “Maxim’s”. De-MAML fails to classify “Maxim’s” and ESD fails to extract “Maxim’s”. This shows that even if a sentence has multiple entities, our model can extract them. For query (3), we can see that our model correctly extracts and predicts “faculty of medicine in Paris”. De-MAML and ESD extract “faculty of medicine”, and ESD predicts wrong type. This shows that even if an entity has multiple tokens, our model can locate and extract it.
Secondly, we display some failure examples in the second group (Query 4, 5, and 6). For query (4), method StructShot extracts “Right” and fails to extract simple entity “Republican”. This also shows that the extraction capacity of StructShot is limited and relies heavily on the labeled samples in the support set. Methods De-MAML, ESD, and our model do not extract entity “neo-Confederate Right” at all. This may be because it is more difficult to extract this entity. For query (5), we can see that StructShot does not extract the entity completely, only “bachelor” is extracted. De-MAML and our model also do not extract the entity completely, only “bachelor of law” is extracted. This indicates that there are challenges in locating the boundaries of entities. ESD does not extract entities, perhaps because it is trained with only span labels and the supervision is sparse. For query (6), we can see that all the methods predict None except De-MAML and De-MAML predicts wrong entity “New American”. We observe the corresponding support set and find that it may be due to the weak representation of the prototype constructed from the two samples (i.e., Bobino and Gaumont-Palace) in the support set of the class.
Query | StructShot | De-MAML | ESD | Ours |
(1) He decided to return to school in 1996, earning his B.A. [other-educational degree]. | earning, B.A. | None | None | B.A. |
(2) The same year, Monaco [person-actor] was named “Maxim’s [art-written art]” number one sexiest cover model of the decade. | None | Monaco, Maxim’s | Monaco | Monaco, Maxim’s |
(3) In 1864 he attained the chair of internal pathology at the faculty of medicine in Paris [building-hospital]. | the, chair, of, of medicine in | faculty of medicine | faculty of medicine | faculty of medicine in Paris |
(4) She writes that “even into the twenty-first century mainstream conservative Republican [organization-political party] politicians continued to associate themselves with issues, symbols, and organizations inspired by the neo-Confederate Right [organization-political party]”. | Right | Republican | Republican | Republican |
(5) He gained a bachelor of law degree [other-educational degree]. | bachelor | bachelor of law | None | bachelor of law |
(6) The script was published in Grove New American Theater [building-theater]. | None | New American | None | None |
6. Conclusion and Future Work
In this paper, we first unify token and span level supervisions and propose a consistent dual adaptive prototypical network for few-shot sequence labeling. Our model contains token-level network and span-level network, jointly trained with token-level and span-level labels. Further, we propose a consistent loss to make the outputs of two networks consistent based on bidirectional Kullback-Leibler divergence with temperature. During the inference phase, we propose a consistent greedy inference algorithm that first adjusts the predicted probability of span network and then greedily selects non-overlapping spans with maximum probability based on adjusted probability. Extensive experiments show that unifying token and span level supervisions is effective and our model achieves new state-of-the-art results on three benchmark datasets.
In the future, we plan to deepen and widen our work from the following aspects: (1) In this work, we first unify token and span level supervisions for few-shot sequence labeling and conduct a preliminary exploration. The proposed method is demonstrated to be simple and effective, but whether other complicated token-level and span-level networks are also effective has not been explored, which is worthy of further exploration. (2) The proposed method has only been verified to be effective on texts in English. The effectiveness on datasets of other languages deserves further verification. (3) The effectiveness of our model on traditional supervised sequence labeling also deserves further exploration.
References
- (1)
- Agosti et al. (2020) Maristella Agosti, Stefano Marchesin, and Gianmaria Silvello. 2020. Learning Unsupervised Knowledge-Enhanced Representations to Reduce the Semantic Gap in Information Retrieval. ACM Trans. Inf. Syst. 38, 4 (2020), 38:1–38:48. https://doi.org/10.1145/3417996
- Athiwaratkun et al. (2020) Ben Athiwaratkun, Cícero Nogueira dos Santos, Jason Krone, and Bing Xiang. 2020. Augmented Natural Language for Generative Sequence Labeling. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020. 375–385. https://doi.org/10.18653/v1/2020.emnlp-main.27
- Chen et al. (2022) Jiawei Chen, Qing Liu, Hongyu Lin, Xianpei Han, and Le Sun. 2022. Few-shot Named Entity Recognition with Self-describing Networks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022. 5711–5722. https://doi.org/10.18653/v1/2022.acl-long.392
- Chen et al. (2019) Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. 2019. A Closer Look at Few-shot Classification. In 7th International Conference on Learning Representations, ICLR 2019. OpenReview.net. https://openreview.net/forum?id=HkxLXnAcFQ
- Chen et al. (2021) Yinbo Chen, Zhuang Liu, Huijuan Xu, Trevor Darrell, and Xiaolong Wang. 2021. Meta-Baseline: Exploring Simple Meta-Learning for Few-Shot Learning. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 9042–9051. https://doi.org/10.1109/ICCV48922.2021.00893
- Cheng et al. (2021) Zifeng Cheng, Zhiwei Jiang, Yafeng Yin, Na Li, and Qing Gu. 2021. A Unified Target-Oriented Sequence-to-Sequence Model for Emotion-Cause Pair Extraction. IEEE ACM Trans. Audio Speech Lang. Process. 29 (2021), 2779–2791. https://doi.org/10.1109/TASLP.2021.3102194
- Chiu and Nichols (2016) Jason P. C. Chiu and Eric Nichols. 2016. Named Entity Recognition with Bidirectional LSTM-CNNs. Trans. Assoc. Comput. Linguistics 4 (2016), 357–370. https://doi.org/10.1162/tacl_a_00104
- Colombo et al. (2020) Pierre Colombo, Emile Chapuis, Matteo Manica, Emmanuel Vignon, Giovanna Varni, and Chloé Clavel. 2020. Guiding Attention in Sequence-to-Sequence Models for Dialogue Act Prediction. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020. 7594–7601. https://ojs.aaai.org/index.php/AAAI/article/view/6259
- Coucke et al. (2018) Alice Coucke, Alaa Saade, Adrien Ball, Théodore Bluche, Alexandre Caulier, David Leroy, Clément Doumouro, Thibault Gisselbrecht, Francesco Caltagirone, Thibaut Lavril, Maël Primet, and Joseph Dureau. 2018. Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces. CoRR abs/1805.10190 (2018). http://arxiv.org/abs/1805.10190
- Cui et al. (2021) Leyang Cui, Yu Wu, Jian Liu, Sen Yang, and Yue Zhang. 2021. Template-Based Named Entity Recognition Using BART. In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021. 1835–1845. https://doi.org/10.18653/v1/2021.findings-acl.161
- Cui and Zhang (2019) Leyang Cui and Yue Zhang. 2019. Hierarchically-Refined Label Attention Network for Sequence Labeling. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019. 4113–4126. https://doi.org/10.18653/v1/D19-1422
- Das et al. (2022) Sarkar Snigdha Sarathi Das, Arzoo Katiyar, Rebecca J. Passonneau, and Rui Zhang. 2022. CONTaiNER: Few-Shot Named Entity Recognition via Contrastive Learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022. 6338–6353. https://doi.org/10.18653/v1/2022.acl-long.439
- Derczynski et al. (2017) Leon Derczynski, Eric Nichols, Marieke van Erp, and Nut Limsopatham. 2017. Results of the WNUT2017 Shared Task on Novel and Emerging Entity Recognition. In Proceedings of the 3rd Workshop on Noisy User-generated Text, NUT@EMNLP 2017. Association for Computational Linguistics, 140–147. https://doi.org/10.18653/v1/w17-4418
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019. 4171–4186. https://doi.org/10.18653/v1/n19-1423
- Dhillon et al. (2020) Guneet Singh Dhillon, Pratik Chaudhari, Avinash Ravichandran, and Stefano Soatto. 2020. A Baseline for Few-Shot Image Classification. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net. https://openreview.net/forum?id=rylXBkrYDS
- Ding et al. (2021) Ning Ding, Guangwei Xu, Yulin Chen, Xiaobin Wang, Xu Han, Pengjun Xie, Haitao Zheng, and Zhiyuan Liu. 2021. Few-NERD: A Few-shot Named Entity Recognition Dataset. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021. 3198–3213. https://doi.org/10.18653/v1/2021.acl-long.248
- Fei-Fei et al. (2006) Li Fei-Fei, Robert Fergus, and Pietro Perona. 2006. One-Shot Learning of Object Categories. IEEE Trans. Pattern Anal. Mach. Intell. 28, 4 (2006), 594–611. https://doi.org/10.1109/TPAMI.2006.79
- Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017. 1126–1135. http://proceedings.mlr.press/v70/finn17a.html
- Finn et al. (2018) Chelsea Finn, Kelvin Xu, and Sergey Levine. 2018. Probabilistic Model-Agnostic Meta-Learning. In Advances in Neural Information Processing Systems, NeurIPS 2018. 9537–9548. https://proceedings.neurips.cc/paper/2018/hash/8e2c381d4dd04f1c55093f22c59c3a08-Abstract.html
- Fritzler et al. (2019) Alexander Fritzler, Varvara Logacheva, and Maksim Kretov. 2019. Few-shot classification in named entity recognition task. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, SAC 2019. 993–1000. https://doi.org/10.1145/3297280.3297378
- Fu et al. (2021) Jinlan Fu, Xuanjing Huang, and Pengfei Liu. 2021. SpanNER: Named Entity Re-/Recognition as Span Prediction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021. 7183–7195. https://doi.org/10.18653/v1/2021.acl-long.558
- Gao et al. (2019) Tianyu Gao, Xu Han, Zhiyuan Liu, and Maosong Sun. 2019. Hybrid Attention-Based Prototypical Networks for Noisy Few-Shot Relation Classification. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019. AAAI Press, 6407–6414. https://doi.org/10.1609/aaai.v33i01.33016407
- Guo et al. (2022) Jiafeng Guo, Yinqiong Cai, Yixing Fan, Fei Sun, Ruqing Zhang, and Xueqi Cheng. 2022. Semantic Models for the First-Stage Retrieval: A Comprehensive Review. ACM Trans. Inf. Syst. 40, 4 (2022), 66:1–66:42. https://doi.org/10.1145/3486250
- Henderson and Vulic (2021) Matthew Henderson and Ivan Vulic. 2021. ConVEx: Data-Efficient and Few-Shot Slot Labeling. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021. 3375–3389. https://doi.org/10.18653/v1/2021.naacl-main.264
- Hou et al. (2019) Ruibing Hou, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. 2019. Cross Attention Network for Few-shot Classification. In Advances in Neural Information Processing Systems, NeurIPS 2019. 4005–4016. https://proceedings.neurips.cc/paper/2019/hash/01894d6f048493d2cacde3c579c315a3-Abstract.html
- Hou et al. (2020) Yutai Hou, Wanxiang Che, Yongkui Lai, Zhihan Zhou, Yijia Liu, Han Liu, and Ting Liu. 2020. Few-shot Slot Tagging with Collapsed Dependency Transfer and Label-enhanced Task-adaptive Projection Network. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020. 1381–1393. https://doi.org/10.18653/v1/2020.acl-main.128
- Hou et al. (2022) Yutai Hou, Cheng Chen, Xianzhen Luo, Bohan Li, and Wanxiang Che. 2022. Inverse is Better! Fast and Accurate Prompt for Few-shot Slot Tagging. In Findings of the Association for Computational Linguistics: ACL 2022. 637–647. https://doi.org/10.18653/v1/2022.findings-acl.53
- Huang et al. (2021) Jiaxin Huang, Chunyuan Li, Krishan Subudhi, Damien Jose, Shobana Balakrishnan, Weizhu Chen, Baolin Peng, Jianfeng Gao, and Jiawei Han. 2021. Few-Shot Named Entity Recognition: An Empirical Baseline Study. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021. 10408–10423. https://doi.org/10.18653/v1/2021.emnlp-main.813
- Huang et al. (2015) Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF Models for Sequence Tagging. CoRR abs/1508.01991 (2015). http://arxiv.org/abs/1508.01991
- Jiang et al. (2020) Zhengbao Jiang, Wei Xu, Jun Araki, and Graham Neubig. 2020. Generalizing Natural Language Analysis through Span-relation Representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online. Association for Computational Linguistics, 2120–2133. https://doi.org/10.18653/v1/2020.acl-main.192
- Lample et al. (2016) Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural Architectures for Named Entity Recognition. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 260–270. https://doi.org/10.18653/v1/n16-1030
- Lee et al. (2022) Dong-Ho Lee, Akshen Kadakia, Kangmin Tan, Mahak Agarwal, Xinyu Feng, Takashi Shibuya, Ryosuke Mitani, Toshiyuki Sekiya, Jay Pujara, and Xiang Ren. 2022. Good Examples Make A Faster Learner: Simple Demonstration-based Learning for Low-resource NER. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022. 2687–2700. https://doi.org/10.18653/v1/2022.acl-long.192
- Li et al. (2020b) Aoxue Li, Weiran Huang, Xu Lan, Jiashi Feng, Zhenguo Li, and Liwei Wang. 2020b. Boosting Few-Shot Learning With Adaptive Margin Loss. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020. Computer Vision Foundation / IEEE, 12573–12581. https://doi.org/10.1109/CVPR42600.2020.01259
- Li et al. (2022) Jing Li, Billy Chiu, Shanshan Feng, and Hao Wang. 2022. Few-Shot Named Entity Recognition via Meta-Learning. IEEE Trans. Knowl. Data Eng. 34, 9 (2022), 4245–4256. https://doi.org/10.1109/TKDE.2020.3038670
- Li et al. (2020d) Kai Li, Yulun Zhang, Kunpeng Li, and Yun Fu. 2020d. Adversarial Feature Hallucination Networks for Few-Shot Learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020. Computer Vision Foundation / IEEE, 13467–13476. https://doi.org/10.1109/CVPR42600.2020.01348
- Li et al. (2020c) Ruirui Li, Xian Wu, Xiusi Chen, and Wei Wang. 2020c. Few-Shot Learning for New User Recommendation in Location-based Social Networks. In WWW ’20: The Web Conference 2020, Taipei, Taiwan. 2472–2478. https://doi.org/10.1145/3366423.3379994
- Li et al. (2020a) Xiaoya Li, Jingrong Feng, Yuxian Meng, Qinghong Han, Fei Wu, and Jiwei Li. 2020a. A Unified MRC Framework for Named Entity Recognition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020. 5849–5859. https://doi.org/10.18653/v1/2020.acl-main.519
- Li et al. (2021) Xiangsheng Li, Jiaxin Mao, Weizhi Ma, Yiqun Liu, Min Zhang, Shaoping Ma, Zhaowei Wang, and Xiuqiang He. 2021. Topic-enhanced knowledge-aware retrieval model for diverse relevance estimation. In WWW ’21: The Web Conference 2021, Virtual Event / Ljubljana, Slovenia, April 19-23, 2021. ACM / IW3C2, 756–767. https://doi.org/10.1145/3442381.3449943
- Li et al. (2015) Zhenghua Li, Jiayuan Chao, Min Zhang, and Wenliang Chen. 2015. Coupled Sequence Labeling on Heterogeneous Annotations: POS Tagging as a Case Study. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015. 1783–1792. https://doi.org/10.3115/v1/p15-1172
- Liu et al. (2020) Bulou Liu, Chenliang Li, Wei Zhou, Feng Ji, Yu Duan, and Haiqing Chen. 2020. An Attention-based Deep Relevance Model for Few-shot Document Filtering. ACM Trans. Inf. Syst. 39, 1 (2020), 6:1–6:35. https://doi.org/10.1145/3419972
- Liu et al. (2022) Yonghao Liu, Mengyu Li, Ximing Li, Fausto Giunchiglia, Xiaoyue Feng, and Renchu Guan. 2022. Few-shot Node Classification on Attributed Networks with Graph Meta-learning. In SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 471–481. https://doi.org/10.1145/3477495.3531978
- Loshchilov and Hutter (2019) Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In 7th International Conference on Learning Representations, ICLR 2019,. https://openreview.net/forum?id=Bkg6RiCqY7
- Ma et al. (2019) Dehong Ma, Sujian Li, Fangzhao Wu, Xing Xie, and Houfeng Wang. 2019. Exploring Sequence-to-Sequence Learning in Aspect Term Extraction. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019. 3538–3547. https://doi.org/10.18653/v1/p19-1344
- Ma et al. (2022a) Jie Ma, Miguel Ballesteros, Srikanth Doss, Rishita Anubhai, Sunil Mallya, Yaser Al-Onaizan, and Dan Roth. 2022a. Label Semantics for Few Shot Named Entity Recognition. In Findings of the Association for Computational Linguistics: ACL 2022. 1956–1971. https://doi.org/10.18653/v1/2022.findings-acl.155
- Ma et al. (2021) Jianqiang Ma, Zeyu Yan, Chang Li, and Yang Zhang. 2021. Frustratingly Simple Few-Shot Slot Tagging. In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021. 1028–1033. https://doi.org/10.18653/v1/2021.findings-acl.88
- Ma et al. (2022c) Longxuan Ma, Mingda Li, Wei-Nan Zhang, Jiapeng Li, and Ting Liu. 2022c. Unstructured Text Enhanced Open-Domain Dialogue System: A Systematic Survey. ACM Trans. Inf. Syst. 40, 1 (2022), 9:1–9:44. https://doi.org/10.1145/3464377
- Ma et al. (2022d) Ruotian Ma, Xin Zhou, Tao Gui, Yiding Tan, Linyang Li, Qi Zhang, and Xuanjing Huang. 2022d. Template-free Prompt Tuning for Few-shot NER. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022. 5721–5732. https://doi.org/10.18653/v1/2022.naacl-main.420
- Ma et al. (2022b) Tingting Ma, Huiqiang Jiang, Qianhui Wu, Tiejun Zhao, and Chin-Yew Lin. 2022b. Decomposed Meta-Learning for Few-Shot Named Entity Recognition. In Findings of the Association for Computational Linguistics: ACL 2022. 1584–1596. https://doi.org/10.18653/v1/2022.findings-acl.124
- Ma and Hovy (2016) Xuezhe Ma and Eduard H. Hovy. 2016. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016. https://doi.org/10.18653/v1/p16-1101
- Oguz and Vu (2021) Cennet Oguz and Ngoc Thang Vu. 2021. Few-shot Learning for Slot Tagging with Attentive Relational Network. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, EACL 2021. 1566–1572. https://doi.org/10.18653/v1/2021.eacl-main.134
- Ouchi et al. (2020) Hiroki Ouchi, Jun Suzuki, Sosuke Kobayashi, Sho Yokoi, Tatsuki Kuribayashi, Ryuto Konno, and Kentaro Inui. 2020. Instance-Based Learning of Span Representations: A Case Study through Named Entity Recognition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020. Association for Computational Linguistics, 6452–6459. https://doi.org/10.18653/v1/2020.acl-main.575
- Pradhan et al. (2013) Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Hwee Tou Ng, Anders Björkelund, Olga Uryupina, Yuchen Zhang, and Zhi Zhong. 2013. Towards Robust Linguistic Analysis using OntoNotes. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, CoNLL 2013. ACL, 143–152. https://aclanthology.org/W13-3516/
- Qin et al. (2019) Libo Qin, Wanxiang Che, Yangming Li, Haoyang Wen, and Ting Liu. 2019. A Stack-Propagation Framework with Token-Level Intent Detection for Spoken Language Understanding. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019. Association for Computational Linguistics, 2078–2087. https://doi.org/10.18653/v1/D19-1214
- Sang and Meulder (2003) Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In Proceedings of the Seventh Conference on Natural Language Learning, CoNLL 2003, Held in cooperation with HLT-NAACL 2003. ACL, 142–147. https://aclanthology.org/W03-0419/
- Shen et al. (2022) Yongliang Shen, Xiaobin Wang, Zeqi Tan, Guangwei Xu, Pengjun Xie, Fei Huang, Weiming Lu, and Yueting Zhuang. 2022. Parallel Instance Query Network for Named Entity Recognition. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022. Association for Computational Linguistics, 947–961. https://doi.org/10.18653/v1/2022.acl-long.67
- Snell et al. (2017) Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical networks for few-shot learning. Advances in neural information processing systems (2017). https://proceedings.neurips.cc/paper/2017/hash/cb8da6767461f2812ae4290eac7cbc42-Abstract.html
- Sun et al. (2019) Shengli Sun, Qingfeng Sun, Kevin Zhou, and Tengchao Lv. 2019. Hierarchical Attention Prototypical Networks for Few-Shot Text Classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019. Association for Computational Linguistics, 476–485. https://doi.org/10.18653/v1/D19-1045
- Sung et al. (2018) Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip H. S. Torr, and Timothy M. Hospedales. 2018. Learning to Compare: Relation Network for Few-Shot Learning. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018. 1199–1208. http://openaccess.thecvf.com/content_cvpr_2018/html/Sung_Learning_to_Compare_CVPR_2018_paper.html
- Tan et al. (2021) Zeqi Tan, Yongliang Shen, Shuai Zhang, Weiming Lu, and Yueting Zhuang. 2021. A Sequence-to-Set Network for Nested Named Entity Recognition. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021. ijcai.org, 3936–3942. https://doi.org/10.24963/ijcai.2021/542
- Tong et al. (2021) Meihan Tong, Shuai Wang, Bin Xu, Yixin Cao, Minghui Liu, Lei Hou, and Juanzi Li. 2021. Learning from Miscellaneous Other-Class Words for Few-shot Named Entity Recognition. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021. 6236–6247. https://doi.org/10.18653/v1/2021.acl-long.487
- Vakulenko et al. (2021) Svitlana Vakulenko, Evangelos Kanoulas, and Maarten de Rijke. 2021. A Large-scale Analysis of Mixed Initiative in Information-Seeking Dialogues for Conversational Search. ACM Trans. Inf. Syst. 39, 4 (2021), 49:1–49:32. https://doi.org/10.1145/3466796
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017. 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
- Vinyals et al. (2016) Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. 2016. Matching networks for one shot learning. Advances in neural information processing systems (2016). https://proceedings.neurips.cc/paper/2016/hash/90e1357833654983612fb05e3ec9148c-Abstract.html
- Wang et al. (2022) Peiyi Wang, Runxin Xu, Tianyu Liu, Qingyu Zhou, Yunbo Cao, Baobao Chang, and Zhifang Sui. 2022. An Enhanced Span-based Decomposition Method for Few-Shot Sequence Labeling. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022. 5012–5024. https://doi.org/10.18653/v1/2022.naacl-main.369
- Wang et al. (2021) Yaqing Wang, Haoda Chu, Chao Zhang, and Jing Gao. 2021. Learning from Language Description: Low-shot Named Entity Recognition via Decomposed Framework. In Findings of the Association for Computational Linguistics: EMNLP 2021. 1618–1630. https://doi.org/10.18653/v1/2021.findings-emnlp.139
- Wang et al. (2018) Yu-Xiong Wang, Ross B. Girshick, Martial Hebert, and Bharath Hariharan. 2018. Low-Shot Learning From Imaginary Data. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. Computer Vision Foundation / IEEE Computer Society, 7278–7286. https://doi.org/10.1109/CVPR.2018.00760
- Yan et al. (2021) Hang Yan, Tao Gui, Junqi Dai, Qipeng Guo, Zheng Zhang, and Xipeng Qiu. 2021. A Unified Generative Framework for Various NER Subtasks. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021. Association for Computational Linguistics, 5808–5822. https://doi.org/10.18653/v1/2021.acl-long.451
- Yang and Katiyar (2020) Yi Yang and Arzoo Katiyar. 2020. Simple and Effective Few-Shot Named Entity Recognition with Structured Nearest Neighbor Learning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020. 6365–6375. https://doi.org/10.18653/v1/2020.emnlp-main.516
- Yoon et al. (2019) Sung Whan Yoon, Jun Seo, and Jaekyun Moon. 2019. TapNet: Neural Network Augmented with Task-Adaptive Projection for Few-Shot Learning. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019. 7115–7123. http://proceedings.mlr.press/v97/yoon19a.html
- Yu et al. (2021a) Dian Yu, Luheng He, Yuan Zhang, Xinya Du, Panupong Pasupat, and Qi Li. 2021a. Few-shot Intent Classification and Slot Filling with Retrieved Examples. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021. 734–749. https://doi.org/10.18653/v1/2021.naacl-main.59
- Yu et al. (2020) Juntao Yu, Bernd Bohnet, and Massimo Poesio. 2020. Named Entity Recognition as Dependency Parsing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020. 6470–6476. https://doi.org/10.18653/v1/2020.acl-main.577
- Yu et al. (2021b) Shi Yu, Zhenghao Liu, Chenyan Xiong, Tao Feng, and Zhiyuan Liu. 2021b. Few-Shot Conversational Dense Retrieval. In SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 829–838. https://doi.org/10.1145/3404835.3462856
- Zeldes (2017) Amir Zeldes. 2017. The GUM corpus: creating multilayer resources in the classroom. Lang. Resour. Evaluation 51, 3 (2017), 581–612. https://doi.org/10.1007/s10579-016-9343-x
- Zhai et al. (2017) Feifei Zhai, Saloni Potdar, Bing Xiang, and Bowen Zhou. 2017. Neural Models for Sequence Chunking. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence. 3365–3371. http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14776
- Zhang et al. (2018) Ruixiang Zhang, Tong Che, Zoubin Ghahramani, Yoshua Bengio, and Yangqiu Song. 2018. MetaGAN: An Adversarial Approach to Few-Shot Learning. In Advances in Neural Information Processing Systems, NeurIPS 2018. 2371–2380. https://proceedings.neurips.cc/paper/2018/hash/4e4e53aa080247bc31d0eb4e7aeb07a0-Abstract.html
- Zhang et al. (2022a) Shichao Zhang, Jiaye Li, and Yangding Li. 2022a. Reachable distance function for KNN classification. IEEE Transactions on Knowledge and Data Engineering (2022).
- Zhang et al. (2022b) Shuai Zhang, Yongliang Shen, Zeqi Tan, Yiquan Wu, and Weiming Lu. 2022b. De-Bias for Generative Extraction in Unified NER Task. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022. Association for Computational Linguistics, 808–818. https://doi.org/10.18653/v1/2022.acl-long.59
- Zhou et al. (2022) Jing Zhou, Yanan Zheng, Jie Tang, Li Jian, and Zhilin Yang. 2022. FlipDA: Effective and Robust Data Augmentation for Few-Shot Learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022. Association for Computational Linguistics, 8646–8665. https://doi.org/10.18653/v1/2022.acl-long.592
- Zhu et al. (2022) Enwei Zhu, Yiyang Liu, and Jinpeng Li. 2022. Deep Span Representations for Named Entity Recognition. CoRR abs/2210.04182 (2022). https://doi.org/10.48550/arXiv.2210.04182 arXiv:2210.04182