This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Unifying Token and Span Level Supervisions for Few-Shot Sequence Labeling

Zifeng Cheng [email protected] State Key Laboratory for Novel Software Technology, Nanjing University163 Xianlin AveNanjingJiangsuChina210023 Qingyu Zhou [email protected] Tencent Cloud XiaoweiBeijingChina100000 Zhiwei Jiang [email protected] State Key Laboratory for Novel Software Technology, Nanjing University163 Xianlin AveNanjingJiangsuChina210023 Xuemin Zhao [email protected] Tencent Cloud XiaoweiChengduSichuanChina610000 Yunbo Cao [email protected] Tencent Cloud XiaoweiBeijingChina100000  and  Qing Gu [email protected] State Key Laboratory for Novel Software Technology, Nanjing University163 Xianlin AveNanjingJiangsuChina210023
(2023)
Abstract.

Few-shot sequence labeling aims to identify novel classes based on only a few labeled samples. Existing methods solve the data scarcity problem mainly by designing token-level or span-level labeling models based on metric learning. However, these methods are only trained at a single granularity (i.e., either token level or span level) and have some weaknesses of the corresponding granularity. In this paper, we first unify token and span level supervisions and propose a Consistent Dual Adaptive Prototypical (CDAP) network for few-shot sequence labeling. CDAP contains the token-level and span-level networks, jointly trained at different granularities. To align the outputs of two networks, we further propose a consistent loss to enable them to learn from each other. During the inference phase, we propose a consistent greedy inference algorithm that first adjusts the predicted probability and then greedily selects non-overlapping spans with maximum probability. Extensive experiments show that our model achieves new state-of-the-art results on three benchmark datasets. All the code and data of this work will be released at https://github.com/zifengcheng/CDAP.

Few-Shot Sequence Labeling, Few-Shot Learning, Sequence Labeling
This work is supported by National Natural Science Foundation of China under Grant Nos. 61972192, 62172208, 61906085, 41972111. This work is partially supported by Collaborative Innovation Center of Novel Software Technology and Industrialization.
copyright: acmcopyrightjournalyear: 2022journal: TOISjournalvolume: 1journalnumber: 1article: 1publicationmonth: 12ccs: Information systems Information extractionccs: Information systems Data miningccs: Computing methodologies Information extraction

1. Introduction

Sequence labeling tasks such as Named Entity Recognition (NER) and Slot Tagging (ST) are fundamental tasks in information extraction, benefiting query and document understanding in information retrieval (Guo et al., 2022; Agosti et al., 2020; Li et al., 2021), request analysis in task-oriented dialog systems (Vakulenko et al., 2021; Ma et al., 2022c), and so forth. Traditional sequence labeling methods are often supervised methods, which require a large amount of annotated data to train. However, this limits their applications in some real scenarios, where the acquisition of annotated data is laborious and expensive.

In recent years, few-shot sequence labeling has attracted much attention since it aims to extract novel classes based on only a few labeled samples (Fritzler et al., 2019; Hou et al., 2020; Yang and Katiyar, 2020; Tong et al., 2021). Table 1 shows a 2-way 1-shot example, which contains two novel classes (i.e., building-other and building-library) and each class has one sample in the support set. Few-shot sequence labeling aims to learn a model that borrows the prior experience from old domains and adapts to new domains quickly with only very few samples. For example, the model should extract “brooklyn bridge” and “Oxford University Museum of Natural History” as “building-other”, and “public library” and “Radcliffe Science Library” as “building-library” in the query set based on the support set.

Some prevalent methods of few-shot sequence labeling usually focus on token-level metric learning (Fritzler et al., 2019; Hou et al., 2020; Yang and Katiyar, 2020; Oguz and Vu, 2021; Ma et al., 2022a; Das et al., 2022; Li et al., 2022), in which the model assigns a label to each query token based on a learned distance metric. For example, Fritzler et al. (2019) construct prototype for each class to classify query tokens and  Das et al. (2022) optimize token-level Gaussian-distributed embeddings to classify query tokens. Recently, some methods (Yu et al., 2021a; Wang et al., 2021; Ma et al., 2022b; Wang et al., 2022) focus on span-level metric learning to measure span-level similarity for classification and achieve state-of-the-art performance. For example, Wang et al. (2021) propose a two-step model, which first detects span and classifies span based on span-level metric.

Table 1. An example of 2-way 1-shot sequence labeling task. Novel classes are building-other and building-library. It is worth noting that in this example two novel classes have the same sentence in the support set.
Support Set 𝒮\mathcal{S} (1) The museum was established as the Municipal Gallery of Hamilton [building-other] in January 1914, and was opened to the public in June 1914, at a Hamilton Public Library [building-library] building on Main Street West.
Query Set 𝒬\mathcal{Q} (1) The Brooklyn Bridge [building-other] ceramic tiles display the bridge’s vertical cables but do not depict its diagonal cables. (2) The reading program had been discontinued, library patronage had dropped, and the denver public library [building-library] was planning to shutter the facility. (3) The current Radcliffe Science Library [building-library] building is located next to the Oxford University Museum of Natural History [building-other] and consists of three parts.

Despite their success, these token-level and span-level metric learning methods are only trained at a single granularity, so they inevitably have some weaknesses specific to their granularity. For the token-level metric learning methods, their scheme of labeling tokens separately is prone to ignore the integrality of named entities and dialog slots with multiple tokens. Taking the sentence of the support set in Table 1 as an example, to extract the complete entity “Municipal Gallery of Hamilton”, each of its constituent tokens, “Municipal”, “Gallery”, “of”, and “Hamilton”, must be labeled separately, potentially leading to the loss of information about the entity as a whole. As a consequence, it is challenging for token-level representation learning to capture the full semantics of multi-token entities and slots (Wang et al., 2022). For the span-level metric learning methods, since they need to enumerate all possible spans for classification, the few-shot labels of entities and slots are easy to be diluted by a large amount of non-entity and non-slot labels. For example, if we set the maximum span length to 4, the spans containing the word “Municipal” include “Municipal”, “the Municipal”, “Municipal Gallery”, “as the Municipal”, “Municipal Gallery of”, “established as the Municipal”, “Municipal Gallery of Hamilton”, and so forth. Among these spans, only the span “Municipal Gallery of Hamilton” is an entity, while the others are non-entities. According to our statistics of the real dataset FewNERD, the ratio of entity spans to non-entity spans can even exceed 1:100. As a result, when entity spans are rare in the few-shot setting, it becomes challenging for span-level metric learning methods to distinguish them from a large number of similar non-entity spans.

These weaknesses are difficult to deal with when only a single-granularity model is trained, but they are promising to be alleviated when both token-level and span-level labels are utilized in a unified view for model training. On the one hand, span-level labeling can help token-level labeling to maintain entity integrality and capture the full semantics of multi-token entities. On the other hand, token-level labeling can in turn help span-level labeling to distinguish easily confused span.

To this end, we propose a Consistent Dual Adaptive Prototypical (CDAP) network for few-shot sequence labeling. CDAP can jointly train token-level and span-level networks, and make them benefit from each other, to produce better predictions. Specifically, we employ an Adaptive Prototypical Network (APN) to generate adaptive prototypes specific to each query in the few-shot setting and use the APN for both token-level and span-level labeling. Besides, considering that token-level and span-level networks may produce different predictions, we further propose a consistent loss to align their predictions based on bidirectional Kullback-Leibler divergence with temperature. During the inference phase, we propose a consistent greedy inference algorithm to combine the predictions of both networks. The core idea of the algorithm is to adjust the prediction probability of spans based on the predictions of tokens within the span.

The main contributions of this paper can be summarized as follows:

  • To the best of our knowledge, we first unify token and span level supervisions to train the few-shot sequence labeling model.

  • We propose a consistent dual adaptive prototypical network for few-shot sequence labeling and a consistent greedy inference algorithm to combine the results of two networks.

  • Extensive experiments show that our model achieves new state-of-the-art performance on three benchmark datasets.

The rest of this paper is organized as follows. We first review the related work in Section 2. Then, we formulate the task and describe our proposed method in detail in Section 3 and  4 respectively. After that, we conduct experiments and analyze the results in Section 5. Finally, we conclude our work along with some future work in Section 6.

2. Related Work

In this section, we introduce the following three research topics relevant to our work: few-shot learning, sequence labeling, and few-shot sequence labeling.

2.1. Few-Shot Learning

Few-shot learning aims to recognize novel classes with only few samples in each class and is proposed in the computer vision community (Fei-Fei et al., 2006). Motivated by a simple machine learning principle (i.e., train and test must match), matching Networks (Vinyals et al., 2016) first models the few-shot learning as an episode N-way K-shot task, where each episode has N classes and each class has K annotated samples (K is very small, e.g., 5). Afterwards, many methods have been proposed, such as metric learning methods, optimization methods, and generation methods. Metric learning methods aim to learn a suitable embedding space for distance metrics, including cosine similarity (Vinyals et al., 2016), euclidean distance to prototype (i.e., mean class representation) (Snell et al., 2017), learnable relation modules (Sung et al., 2018), nearest neighbor (Yang and Katiyar, 2020), and reachable distance (Zhang et al., 2022a). For example, Snell et al. (2017) first construct prototypes by averaging the representations of each class in the support set and then optimizes the network by using softmax function over Euclidean distances between all prototypes and query representation. Some other methods have focused on improving the transferability of features, including using cross-attention module to exploit the semantic relevance between support and query features (Hou et al., 2019), adaptive margin loss (Li et al., 2020b), and so on. Recently, some methods (Chen et al., 2019; Dhillon et al., 2020; Chen et al., 2021) propose a transfer-learning framework that pre-trains on the training set and fine-tunes on the support set of test episode. Model-agnostic meta-learning (MAML) (Finn et al., 2017) is a typical approach of optimization methods and aims to learn a good model initialization and then quickly adapt on test episode. Finn et al. (2018) further extend MAML, which adapts to new tasks via gradient descent, to incorporate a parameter distribution that is trained via a variational lower bound. Generation methods aim to generate samples (Wang et al., 2018; Zhang et al., 2018; Zhou et al., 2022) or features (Li et al., 2020d) to augment the training set. Specifically, Zhang et al. (2018) propose to augment vanilla few-shot classification models with GAN for better generalization. FlipDA (Zhou et al., 2022) jointly uses a generative model and a classifier to generate label-flipped data, which has a more significant performance improvement compared to label-preserved data. AFHN (Li et al., 2020d) generates diverse and discriminative features conditioned on the few labeled samples based on cWGAN.

Recently, few-shot learning has received increasing attention in the IR and NLP communities, including recommendation (Li et al., 2020c), document filtering (Liu et al., 2020), dense retrieval (Yu et al., 2021b), node classification (Liu et al., 2022), and so on.

2.2. Sequence Labeling

Sequence labeling task is a fundamental task in the field of NLP such as named entity recognition (Lample et al., 2016), part-of-speech tagging (Li et al., 2015), slot tagging (Qin et al., 2019), and so on. Recently, the neural sequence labeling models become the mainstream method for sequence labeling.

Typical neural sequence labeling models (Huang et al., 2015; Chiu and Nichols, 2016; Lample et al., 2016; Ma and Hovy, 2016) generally use CNN or LSTM to extract features and utilize softmax or Conditional Random Field (CRF) for classification. For example, Chiu and Nichols (2016) use CNN to extract character-level features, LSTM to extract word-level features, and softmax to classify words. However, Cui and Zhang (2019) find that BiLSTM-CRF does not always lead to better results compared with BiLSTM-Softmax, and propose a hierarchically-refined label attention network, which explicitly leverages label embeddings and captures potential long-term label dependency by giving each word incrementally refined label distributions with hierarchical attention. Afterwards, some work (Yu et al., 2020; Fu et al., 2021; Ouchi et al., 2020; Jiang et al., 2020; Zhu et al., 2022) adopts span-level prediction which first detects spans by enumeration or boundary identification, and then classifies spans. For example, Yu et al. (2020) use a biaffine classifier to assign scores for all spans in a sentence. Fu et al. (2021) first compare and analyze the advantages and disadvantages of the traditional token-level models and span-level models in detail, and then reveal that span prediction can serve as a system combiner to re-recognize named entities from different systems’ outputs.

Besides, four novel paradigms for sequence labeling have recently been proposed, reformulating it as sequence-to-sequence learning, generation, set prediction, and machine reading comprehension tasks, respectively. First, considering that the sequence-to-sequence learning has a good ability to learn complex global label dependency, some work applies it to solve the sequence labeling tasks such as chunking (Zhai et al., 2017), aspect term extraction (Ma et al., 2019), emotion-cause pair extraction (Cheng et al., 2021), and dialogue act prediction problem (Colombo et al., 2020). For example, in order to make sequence-to-sequence learning suitable for sequence labeling, Ma et al. (2019) design gated unit networks to incorporate corresponding word representation into the decoder and position-aware attention to pay more attention to the adjacent words of a target word. Secondly, some work use generation to solve the sequence labeling. Specifically, Athiwaratkun et al. (2020) propose to generate augmented output that repeats the original input sequence with additional markers that indicate the token-spans and their associated labels. Yan et al. (2021) propose a unified generation framework with the pointer mechanism to tackle flat, nested, and discontinuous NER subtasks. Zhang et al. (2022b) analyze the incorrect biases in the generation process from a causality perspective and design intra- and inter-entity deconfounding data augmentation methods to eliminate confounders. Thirdly, Tan et al. (2021) first propose a sequence-to-set network for NER, which provides a fixed set of entity queries to replace the explicit candidate spans and uses self-attention to capture the dependencies between entities. Finally, Li et al. (2020a) first propose a machine reading comprehension framework to solve the NER task and use type-specific queries using semantic prior information for entity categories. However, type-specific queries can only extract one entity type per inference and rely on external knowledge to construct queries. Shen et al. (2022) further set up global and learnable instance queries to extract entities from a sentence in a parallel manner.

2.3. Few-Shot Sequence Labeling

Few-shot sequence labeling aims to identify novel classes based on only a few labeled samples and has gained widespread attention due to its application value. We roughly divide the existing few-shot sequence labeling methods into three groups.

The first group of methods is token-level metric learning method. Fritzler et al. (2019) first use prototypical networks to solve this task. Afterwards, L-TapNet (Hou et al., 2020) and StructShot (Yang and Katiyar, 2020) both explore CRF to model label dependence in the few-shot setting. L-TapNet (Hou et al., 2020) extends task-adaptive projection networks (TapNet) (Yoon et al., 2019), which improves label representation with label name. StructShot (Yang and Katiyar, 2020) trains model through traditional cross-entropy loss and uses nearest neighbor for classification. Huang et al. (2021) explore noisy supervised pre-training and self-training in the few-shot sequence labeling setting. Motivated by non-entity class O has rich semantics, MUCO (Tong et al., 2021) introduces a binary classifier to determine whether it is an entity or not. Ma et al. (2022a) propose to leverage the semantics of label names and use two separate BERT to encode label and token for token-level similarity metric. Instead of optimizing class-specific attributes, CONTAINER (Das et al., 2022) models Gaussian embedding for each token and optimizes inter token distribution distance to improve generalizability.

The second group of methods is span-level method. SpanNER (Wang et al., 2021) is a two-step method, which first detects the start and end tokens to extract span and then introduces class descriptions to classify spans. Ma et al. (2022b) further propose a model-agnostic meta-learning (MAML) (Finn et al., 2017) enhanced two-step method, which uses BIESO scheme to detect span and prototypical networks to classify span. Retriever (Yu et al., 2021a) proposes a span-level retrieval method that learns similar contextualized representations for spans with the same label via a novel batch-softmax objective. ESD (Wang et al., 2022) proposes inter-span and cross-span attention to enhance the span representations, and uses instance span attention to construct prototype.

The third group of methods mitigates data scarcity with other paradigms, including machine reading comprehension (Ma et al., 2021), generation (Athiwaratkun et al., 2020; Chen et al., 2022), pairwise cloze (Henderson and Vulic, 2021), and prompt tuning (Cui et al., 2021; Lee et al., 2022; Ma et al., 2022d; Hou et al., 2022). For machine reading comprehension method, Ma et al. (2021) enumerate all the slots to extract the answer from the sentence. The MRC-based method can naturally encode the label semantics in the form of questions. For generation methods, Athiwaratkun et al. (2020) propose to generate augmented natural language output, including the original input sequence, delimiter, and label. Self-describing Networks (Chen et al., 2022) is a sequence-to-sequence generation network which can universally describe mentions using concepts, automatically map novel entity types to concepts, and adaptively recognize entities on-demand. For prompt tuning methods, Template (Cui et al., 2021) treats NER as a language model ranking problem in a sequence-to-sequence framework, where the output sequence is filled by templates. Hou et al. (2022) propose an inverse prompt, predicting slot values given slot types, and further introduce the iterative prediction strategy to consider the relations between different slot types. Ma et al. (2022d) reformulate the NER task as an entity-oriented LM task, inducing the LM to predict label words at entity positions during fine-tuning, and propose a template-free prompt tuning method EntLM. Lee et al. (2022) perform a systematic study of entity-oriented prompt (i.e., only entity tokens) and instance-oriented prompt (i.e., retrieved sentences).

3. Task Formulation

We define a sentence as 𝒙={x1,x2,,xn}\boldsymbol{x}=\{x_{1},x_{2},...,x_{n}\} and its label as 𝒚={(si,yi)}i=1M\boldsymbol{y}=\{(s_{i},y_{i})\}_{i=1}^{M}, where sis_{i} is a span in sentence 𝒙\boldsymbol{x} (e.g., “nimbus” and “nimbus powerplant” in Figure 1), yiy_{i} is the corresponding label (e.g., entity class “building-other” and non-entity class O), and MM is the number of spans in the sentence. Few-shot sequence labeling aims to train a model in the source domain that can accurately extract and classify all novel classes with a few labeled samples in the target domain. Episode training (Vinyals et al., 2016) is a common and effective way to train the model by imitating test situation for few-shot learning. Each training/test episode contains a NN-way KK-shot support set 𝒮={(𝒙i,𝒚i)}i=1ms\mathcal{S}=\{(\boldsymbol{x}_{i},\boldsymbol{y}_{i})\}_{i=1}^{m_{s}} and a query set 𝒬={(𝒙i,𝒚i)}i=1mq\mathcal{Q}=\{(\boldsymbol{x}_{i},\boldsymbol{y}_{i})\}_{i=1}^{m_{q}}, where the support set 𝒮\mathcal{S} includes NN classes (N-way) and each class has K annotated examples (K-shot). In the training phase, few-shot sequence labeling model is trained by predicting labels of the query set 𝒬train\mathcal{Q}_{train} according to a few labeled support set 𝒮train\mathcal{S}_{train}. In the testing phase, the model directly predicts labels of the query set 𝒬test\mathcal{Q}_{test} on unseen test dataset according to a few labeled support set 𝒮test\mathcal{S}_{test}. To facilitate the illustration, we list the main notations used throughout this article in Table 2.

Table 2. Main Notations Used in this Article.
Notation Description
𝐇n×d\mathbf{H}\in\mathbb{R}^{n\times d} Token-level representation matrix for a sentence 𝒙\boldsymbol{x}
𝐇𝐣𝐬Cj×d\mathbf{H^{s}_{j}}\in\mathbb{R}^{C_{j}\times d} Representation matrix of all tokens of class jj in the support set
𝐡𝐢𝐪d\mathbf{h_{i}^{q}}\in\mathbb{R}^{d} Token-level representation for query token xix_{i}
𝐜𝐢𝐣d\mathbf{c_{i}^{j}}\in\mathbb{R}^{d} Token-level adaptive prototype of class jj for query token xix_{i}
nn The length of sentence 𝒙\boldsymbol{x}
dd The dimension of feature representation
CjC_{j} The number of tokens of class jj in the support set
𝐒Ss×d\mathbf{S}\in\mathbb{R}^{S_{s}\times d} Representation matrix of all spans in the support set
𝐐Sq×d\mathbf{Q}\in\mathbb{R}^{S_{q}\times d} Representation matrix of all spans in the query set
𝐒¯Ss×d\bar{\mathbf{S}}\in\mathbb{R}^{S_{s}\times d} Representation matrix of all spans after cross-attention module in the support set
𝐐¯Sq×d\bar{\mathbf{Q}}\in\mathbb{R}^{S_{q}\times d} Representation matrix of all spans after cross-attention module in the query set
o^kd\hat{\textbf{o}}_{k}\in\mathbb{R}^{d} Span-level adaptive prototype of sub-class OkO_{k} for query span qiq_{i}
Okok×d\textbf{O}_{k}\in\mathbb{R}^{o_{k}\times d} Representation matrix of all spans in the support set with sub-class OkO_{k}
q¯id\bar{\textbf{q}}_{i}\in\mathbb{R}^{d} Span-level representation for query span qiq_{i}
𝐳𝐢𝐣d\mathbf{z_{i}^{j}}\in\mathbb{R}^{d} Span-level adaptive prototype of class jj for query span qiq_{i}
SsS_{s} The number of spans in the support set
SqS_{q} The number of spans in the query set
oko_{k} The number of spans with sub-class OkO_{k} in the support set

4. Model

In this section, we present the details of our proposed model, as shown in Figure 1. Our model utilizes both token-level and span-level labels to mitigate the data scarcity problem in the few-shot setting from a unified perspective. We first explain the encoder module of our model (Section 4.1). Next we present the model architecture of token-level adaptive prototypical network (Section 4.2) and span-level adaptive prototypical network (Section 4.3). We further propose a consistent loss to make the predictions of two networks consistent (Section 4.4). Then we introduce final loss function and training flow of our model (Section 4.5). Finally, we propose a consistent greedy inference algorithm to combine the outputs of two networks during the inference phase (Section 4.6).

Refer to caption
Figure 1. The framework of our proposed model. The upper part corresponds to token-level network and the lower part corresponds to span-level network.

4.1. Encoder Module

We utilize BERT (Devlin et al., 2019) to encode sentences in the support and query set. Given a sentence 𝒙={x1,x2,,xn}\boldsymbol{x}=\{x_{1},x_{2},...,x_{n}\}, we insert a [CLS] and [SEP] token for each sentence to get the input of BERT and feed it into BERT to get its contextual representations 𝐔={𝐮𝟎,𝐮𝟏,𝐮𝟐,,𝐮𝐧,𝐮𝐧+𝟏}(n+2)×d1\mathbf{U}=\{\mathbf{u_{0}},\mathbf{u_{1}},\mathbf{u_{2}},...,\mathbf{u_{n}},\mathbf{u_{n+1}}\}\in\mathbb{R}^{(n+2)\times d_{1}} as follows:

(1) 𝐮𝟎,𝐮𝟏,𝐮𝟐,,𝐮𝐧,𝐮𝐧+𝟏=BERT([CLS],x1,x2,,xn,[SEP])\mathbf{u_{0}},\mathbf{u_{1}},\mathbf{u_{2}},...,\mathbf{u_{n}},\mathbf{u_{n+1}}=\textrm{BERT}(\textrm{[CLS]},x_{1},x_{2},...,x_{n},\textrm{[SEP]})

We further use a linear layer to obtain the token representation 𝐇={𝐡𝟏,𝐡𝟐,,𝐡𝐧}n×d\mathbf{H}=\{\mathbf{h_{1}},\mathbf{h_{2}},...,\mathbf{h_{n}}\}\in\mathbb{R}^{n\times d} as the input to the token-level network as follows:

(2) 𝐡𝐢=𝐖𝐭𝐮𝐢+𝐛𝐭,i=1,2,,n\mathbf{h_{i}}=\mathbf{W_{t}}\mathbf{u_{i}}+\mathbf{b_{t}},i=1,2,...,n

where 𝐖𝐭d×d1\mathbf{W_{t}}\in\mathbb{R}^{d\times d_{1}} and 𝐛𝐭d\mathbf{b_{t}}\in\mathbb{R}^{d} are trainable parameters. Since [CLS] and [SEP] tokens do not need to be classified, we do not feed them into the linear layer.

For the span-level network, we use boundary tokens (i.e., left and right boundary tokens) and a linear layer to obtain the span representation as the input to the span-level network. Specifically, given a span 𝐬k={xi,,xj}\mathbf{s}_{k}=\{x_{i},...,x_{j}\}, we represent the span as:

(3) 𝐬k=𝐖𝐬[𝐮𝐢𝐮𝐣]+𝐛𝐬\mathbf{s}_{k}=\mathbf{W_{s}}[\mathbf{u_{i}}\oplus\mathbf{u_{j}}]+\mathbf{b_{s}}

where 𝐖𝐬d×2d1\mathbf{W_{s}}\in\mathbb{R}^{d\times 2d_{1}} and 𝐛𝐬d\mathbf{b_{s}}\in\mathbb{R}^{d} are trainable parameters, and \oplus is the concatenation operation. It is worth noting that we enumerate all spans in the sentence.

4.2. Token-Level Adaptive Prototypical Network

For the token-level network, we adopt an adaptive prototypical network to classify tokens. General prototypical networks (Snell et al., 2017) simply average all the tokens of the same class to get prototype. However, the average operation does not reflect the importance of each token (Gao et al., 2019; Sun et al., 2019). Therefore, we propose to construct adaptive prototypes for each query token by considering the importance of each token. Specifically,

(4) α𝐢𝐣=softmax(𝐇𝐣𝐬𝐡𝐢𝐪)\mathbf{\alpha_{i}^{j}}=\textrm{softmax}(\mathbf{H^{s}_{j}}\mathbf{h_{i}^{q}})
(5) 𝐜𝐢𝐣=(𝐇𝐣𝐬)Tα𝐢𝐣\mathbf{c_{i}^{j}}=(\mathbf{H^{s}_{j}})^{\mathrm{T}}\mathbf{\alpha_{i}^{j}}

where 𝐡𝐢𝐪d\mathbf{h_{i}^{q}}\in\mathbb{R}^{d} is representation of query token xix_{i}, 𝐇𝐣𝐬Cj×d\mathbf{H^{s}_{j}}\in\mathbb{R}^{C_{j}\times d} is representation of all tokens of class jj in the support set, CjC_{j} is the number of jj-class tokens in support set, α𝐢𝐣Cj\mathbf{\alpha_{i}^{j}}\in\mathbb{R}^{C_{j}} is attention weight, and 𝐜𝐢𝐣d\mathbf{c_{i}^{j}}\in\mathbb{R}^{d} is adaptive prototype of class jj for query token xix_{i}. For clarity, we denote this attention aggregation as ϕ\phi, i.e., 𝐜𝐢𝐣=ϕ(𝐡𝐢𝐪,𝐇𝐣𝐬)\mathbf{c_{i}^{j}}=\phi(\mathbf{h_{i}^{q}},\mathbf{H^{s}_{j}}).

Then we utilize cross-entropy loss to optimize the network and the predicted probability distribution can be produced by a softmax function over Euclidean distances between all prototypes and query token xix_{i}. Specifically,

(6) pit(y^i=j|xi)=exp(d(𝐡𝐢𝐪,𝐜𝐢𝐣))pexp(d(𝐡𝐢𝐪,𝐜𝐢𝐩))p_{i}^{t}(\hat{y}_{i}=j|x_{i})=\frac{\textrm{exp}(-d(\mathbf{h_{i}^{q}},\mathbf{c_{i}^{j}}))}{\sum_{p}\textrm{exp}(-d(\mathbf{h_{i}^{q}},\mathbf{c_{i}^{p}}))}
(7) t=i=1nlogpit(y^i=yi|xi)\mathcal{L}_{t}=-\sum_{i=1}^{n}\textrm{log}p_{i}^{t}(\hat{y}_{i}=y_{i}^{*}|x_{i})

where pit(y^i=j|xi)p_{i}^{t}(\hat{y}_{i}=j|x_{i}) is predicted probability of jj-class for token xix_{i}, dd denotes Euclidean distance, and yiy_{i}^{*} denotes the ground-truth label for token xix_{i}. It is worth noting that we use the IO (Inside and Outside) scheme for token-level network same as Ding et al. (2021), not BIESO (Begin, Inside, End, Singleton, and Outside) or BIO (Begin, Inside, and Outside) scheme. For example, the label space of example in Table 1 is {building-other, building-library, Outside}, not {begin-building-other, inside-building-other, end-building-other, singleton-building-other, begin-building-library, inside-building-library, end-building-library, singleton-building-library, outside} or {begin-building-other, inside-building-other, begin-building-library, inside-building-library, outside}.

4.3. Span-Level Adaptive Prototypical Network

For the span-level network, we classify spans by adopting an adaptive prototypical network similar to the token-level adaptive prototypical network, the difference mainly lies in that we employ an extra cross-attention module and make more fine-grained division for class O.

4.3.1. Cross-Attention Module

To recognize a sample from novel class given a few labeled samples, a natural idea is to locate the most relevant regions in the pair of support and query samples (Hou et al., 2019). Thus, we use a cross-attention to model the interaction between support set and query set to enhance the representation of support set and query set. We denote the span representation in the support and query set as 𝐒Ss×d\mathbf{S}\in\mathbb{R}^{S_{s}\times d} and 𝐐Sq×d\mathbf{Q}\in\mathbb{R}^{S_{q}\times d}, where SsS_{s} and SqS_{q} denote the number of spans in the support and query set. Then we use a cross-attention to model the interaction:

(8) 𝐬^n=ϕ(𝐬𝐧,𝐐)\hat{\mathbf{s}}_{n}=\phi(\mathbf{s_{n}},\mathbf{Q})
(9) 𝐪^n=ϕ(𝐪𝐧,𝐒)\hat{\mathbf{q}}_{n}=\phi(\mathbf{q_{n}},\mathbf{S})

where 𝐬nd\mathbf{s}_{n}\in\mathbb{R}^{d} and 𝐪nd\mathbf{q}_{n}\in\mathbb{R}^{d} denote the nn-row of 𝐒\mathbf{S} and 𝐐\mathbf{Q}, 𝐬^nd\hat{\mathbf{s}}_{n}\in\mathbb{R}^{d} denotes the span representation after cross-attention in the support set, and 𝐪^nd\hat{\mathbf{q}}_{n}\in\mathbb{R}^{d} denotes the span representation after cross-attention in the query set. Finally, we use Transformer block (Vaswani et al., 2017) to get the enhanced representation:

(10) 𝐬¯n=LayerNorm(𝐬n+FFN(𝐬^n))\bar{\mathbf{s}}_{n}=\textrm{LayerNorm}(\mathbf{s}_{n}+\textrm{FFN}(\hat{\mathbf{s}}_{n}))
(11) 𝐪¯n=LayerNorm(𝐪n+FFN(𝐪^n))\bar{\mathbf{q}}_{n}=\textrm{LayerNorm}(\mathbf{q}_{n}+\textrm{FFN}(\hat{\mathbf{q}}_{n}))

where 𝐬¯nd\bar{\mathbf{s}}_{n}\in\mathbb{R}^{d} and 𝐪¯nd\bar{\mathbf{q}}_{n}\in\mathbb{R}^{d} denote enhanced representation of support and query set, and FFN denotes a feed forward neural network.

4.3.2. Fine-Grained Division of O

Different from entity classes, non-entity class O has rich semantics and is hard to represent with a single prototype (Tong et al., 2021; Wang et al., 2022). So we further divide class O into three sub-classes. Since we use left and right boundary tokens to represent span, we divide sub-classes based on whether the span has the same left and right boundary as entity spans. Specifically, given a sentence with M1M_{1} entity spans {(li,ri)}i=1M1\{(l_{i},r_{i})\}_{i=1}^{M_{1}}, where lil_{i} and rir_{i} are the left and right boundary tokens of the ii-th span, other non-entity spans can be assigned with a sub-class OsubO_{sub} label as follows,

Osub={O1,i,lo=liO2,i,ro=riO3,others\displaystyle\begin{split}O_{sub}=\left\{\begin{array}[]{ll}O_{1},&\exists i,l_{o}=l_{i}\\ O_{2},&\exists i,r_{o}=r_{i}\\ O_{3},&\textrm{others}\end{array}\right.\end{split}

where O1O_{1} denotes the non-entity spans that have the same left boundary token as entity spans, O2O_{2} denotes the non-entity spans that have the same right boundary token as entity spans, and O3O_{3} denotes the non-entity spans do not have the same left and right boundary tokens as entity spans. It is worth noting that for sentences with two or more entities, there may be some non-entity spans that have the same left boundary with one entity span and same right boundary with another entity span. We treat these non-entity spans as O1O_{1}.

Algorithm 1 The Training Flow of Consistent Dual Adaptive Prototypical Network
The few-shot sequence labeling training set 𝒟tr\mathcal{D}_{tr}
A well-trained few-shot sequence labeling model
for all iteration = 1, \cdots, MaxIter do
     Randomly sample a mini-batch data (𝒮train\mathcal{S}_{train}, 𝒬train\mathcal{Q}_{train}) from 𝒟tr\mathcal{D}_{tr}
     \rhd Base Encoder
     𝐇𝐬,𝐒\mathbf{H}^{\mathbf{s}},\mathbf{S}\leftarrow BERT(𝒮train\mathcal{S}_{train})
     𝐇𝐪,𝐐\mathbf{H}^{\mathbf{q}},\mathbf{Q}\leftarrow BERT(𝒬train\mathcal{Q}_{train})
     for each xx in 𝒬train\mathcal{Q}_{train} do
         \rhd Token-Level Network
         Generate token-level prototype: 𝐜𝐢𝐣=ϕ(𝐡𝐢𝐪,𝐇𝐣𝐬)\mathbf{c_{i}^{j}}=\phi(\mathbf{h_{i}^{q}},\mathbf{H^{s}_{j}})
         Calculate token-level loss: t=i=1nlogexp(d(𝐡𝐢𝐪,𝐜𝐢))pexp(d(𝐡𝐢𝐪,𝐜𝐢𝐩))\mathcal{L}_{t}=-\sum_{i=1}^{n}\text{log}\frac{\textrm{exp}(-d(\mathbf{h_{i}^{q}},\mathbf{c_{i}^{*}}))}{\sum_{p}\textrm{exp}(-d(\mathbf{h_{i}^{q}},\mathbf{c_{i}^{p}}))}
         \rhd Span-Level Network
         Enhance representation: 𝐒¯,𝐐¯\bar{\mathbf{S}},\bar{\mathbf{Q}}\leftarrow Cross-attention module (𝐒\mathbf{S}, 𝐐\mathbf{Q})
         Generate span-level prototype: 𝐳𝐢𝐣=ϕ(𝐪¯i,𝐒¯j)\mathbf{z_{i}^{j}}=\phi(\bar{\mathbf{q}}_{i},\bar{\mathbf{S}}_{j})
         Calculate span-level loss: s=i=1Sqlogexp(d(q¯i,𝐳𝐢))pexp(d(q¯i,𝐳𝐢𝐩))\mathcal{L}_{s}=-\sum_{i=1}^{S_{q}}\text{log}\frac{\textrm{exp}(-d(\bar{\textbf{q}}_{i},\mathbf{z_{i}^{*}}))}{\sum_{p}\textrm{exp}(-d(\bar{\textbf{q}}_{i},\mathbf{z_{i}^{p}}))}
         \rhd Consistent Loss Calculation
         Generate two token-level distributions lt(xi)l_{t}(x_{i}) and ls(xi)l_{s}(x_{i}) from two networks
         Calculate loss: c\mathcal{L}_{c} = i=1n[DKL(σ(lt(xi)/T)||σ(ls(xi)))\sum_{i=1}^{n}[D_{KL}\left(\sigma(l_{t}(x_{i})/T)||\sigma(l_{s}(x_{i}))\right) + DKL(σ(ls(xi)/T)||σ(lt(xi)))]D_{KL}\left(\sigma(l_{s}(x_{i})/T)||\sigma(l_{t}(x_{i}))\right)]
         \rhd Optimization
         Obtain total loss =λt+βs+γc\mathcal{L}=\lambda\mathcal{L}_{t}+\beta\mathcal{L}_{s}+\gamma\mathcal{L}_{c} and update model
     end for
end for
return The well-trained model

4.3.3. Adaptive Prototypical Network

We also construct adaptive prototypes for each query span, similar to token-level adaptive prototypical network. Since we divide class O into three sub-classes, we first adaptively construct prototype of each sub-class and then adaptively aggregate three sub-classes to get the prototype of class O. For entity classes, we can directly get the adaptive prototypes. Specifically,

(12) o^k=ϕ(q¯i,Ok),k=1,2,3\hat{\textbf{o}}_{k}=\phi(\bar{\textbf{q}}_{i},\textbf{O}_{k}),k=1,2,3
(13) zi0=ϕ(q¯i,[o^1To^2To^3T])\textbf{z}_{i}^{0}=\phi(\bar{\textbf{q}}_{i},\left[\begin{array}[]{ccc}\hat{\textbf{o}}_{1}^{\mathrm{T}}\\ \hat{\textbf{o}}_{2}^{\mathrm{T}}\\ \hat{\textbf{o}}_{3}^{\mathrm{T}}\\ \end{array}\right])
(14) zik=ϕ(q¯i,S¯k),k=1,2,,N\textbf{z}_{i}^{k}=\phi(\bar{\textbf{q}}_{i},\bar{\textbf{S}}_{k}),k=1,2,...,N

where o^kd\hat{\textbf{o}}_{k}\in\mathbb{R}^{d} denotes the adaptive prototype of sub-class OkO_{k} for query span qiq_{i}, Okok×d\textbf{O}_{k}\in\mathbb{R}^{o_{k}\times d} denotes the representation of all spans in the support set with sub-class OkO_{k}, oko_{k} denotes the number of spans with sub-class OkO_{k} in the support set, zi0d\textbf{z}_{i}^{0}\in\mathbb{R}^{d} denotes the adaptive prototype of class O for query span qiq_{i}, S¯kmk×d\bar{\textbf{S}}_{k}\in\mathbb{R}^{m_{k}\times d} is representation of all spans of class kk in the support set, mkm_{k} denotes the number of kk-class spans in the support set, and zikd\textbf{z}_{i}^{k}\in\mathbb{R}^{d} denotes the adaptive prototype of class kk for query span qiq_{i}.

After getting all prototypes Zi=(zi0,zi1,,ziN)\textbf{Z}_{i}=(\textbf{z}_{i}^{0},\textbf{z}_{i}^{1},...,\textbf{z}_{i}^{N}), we also use softmax function over Euclidean distances between all prototypes and query span qiq_{i} to get the predicted probability distribution, and cross-entropy loss to optimize the network.

(15) pis(y^i=j|qi)=exp(d(q¯i,𝐳𝐢𝐣))pexp(d(q¯i,𝐳𝐢𝐩))p_{i}^{s}(\hat{y}_{i}=j|q_{i})=\frac{\textrm{exp}(-d(\bar{\textbf{q}}_{i},\mathbf{z_{i}^{j}}))}{\sum_{p}\textrm{exp}(-d(\bar{\textbf{q}}_{i},\mathbf{z_{i}^{p}}))}
(16) s=i=1Sqlogpis(y^i=y¯i|qi)\mathcal{L}_{s}=-\sum_{i=1}^{S_{q}}\textrm{log}p_{i}^{s}(\hat{y}_{i}=\bar{y}_{i}^{*}|q_{i})

where pis(y^i=j|qi)p_{i}^{s}(\hat{y}_{i}=j|q_{i}) is predicted probability of jj-class for span qiq_{i}, y¯i\bar{y}_{i}^{*} denotes the ground-truth label of span qiq_{i}, and SqS_{q} is the number of spans in the sentence.

4.4. Consistent Loss

Intuitively, the token-level network and span-level network should make consistent predictions in a well-trained model. Therefore, we further propose a consistent loss to make the predicted probability distributions of two networks consistent. It contains two steps, where the first step constructs two token-level probability distributions for each token from two networks and the second step measures the difference between two probability distributions.

For each token, we can obtain the token-level probability distribution directly from the token-level network. Then, among all spans containing this token, we select the span with the maximum predicted probability and take its probability distribution as the token-level probability distribution from span-level network. Finally, we use bidirectional Kullback-Leibler divergence with temperature as consistent loss to allow the two networks to learn from each other and to measure the difference between the two distributions. Specifically,

(17) c=i=1n[DKL(σ(lt(xi)/T)||σ(ls(xi)))+DKL(σ(ls(xi)/T)||σ(lt(xi)))]\displaystyle\mathcal{L}_{c}=\sum_{i=1}^{n}[D_{KL}\left(\sigma(l_{t}(x_{i})/T)||\sigma(l_{s}(x_{i}))\right)+D_{KL}\left(\sigma(l_{s}(x_{i})/T)||\sigma(l_{t}(x_{i}))\right)]

where lt(xi)l_{t}(x_{i}) and ls(xi)l_{s}(x_{i}) are the predicted logits of token-level and span-level networks for token xix_{i}, σ\sigma denotes softmax function, and TT is the temperature to adjust the sharpness of the probability distribution.

4.5. Training

Finally, we combine the above three loss functions to form the final loss function to jointly train our model as follows:

(18) =λt+βs+γc\mathcal{L}=\lambda\mathcal{L}_{t}+\beta\mathcal{L}_{s}+\gamma\mathcal{L}_{c}

where λ\lambda, β\beta, γ\gamma are hyper-parameters.

In summary, there are three steps to train our model, i.e., calculate token-level loss, calculate span-level loss, and calculate consistent loss. The whole training flow is illustrated in Figure 1 and Algorithm 1.

Refer to caption
Figure 2. The process of consistent greedy inference algorithm. The algorithm first punishes the inconsistent prediction and then greedily selects non-overlapping spans with maximum probability based on adjusted probability.

4.6. Consistent Greedy Inference Algorithm

Intuitively, we consider extracted entities and slots unreliable if the two networks predict different labels. So we propose a consistent greedy algorithm to combine the outputs of the two networks. As shown in Figure 2 and Algorithm 2, we first adjust the predicted probabilities of spans and then use a greedy algorithm to gradually select non-overlapping spans with the maximum probability based on the adjusted probability.

First, we consider a span to be reliable if all token-level predictions from token-level network in the span agree with the span-level prediction from span-level network. For example, as shown in Figure 2, the prediction of span “powerplant” is consistent, and the prediction of “nimbus powerplant” and “the nimbus powerplant” is inconsistent. We punish the probabilities of inconsistent spans by the following operation:

(19) y¯i=y^iδcounti\bar{y}_{i}=\hat{y}_{i}-\delta*\textrm{count}_{i}

where y¯i\bar{y}_{i} and y^i\hat{y}_{i} is the predicted probability of span qiq_{i} after and before consistent adjustment, δ\delta is a hyper-parameter to control the degree of penalty, and counti\textrm{count}_{i} is the number of inconsistent tokens. Then, we use a greedy algorithm to gradually select non-overlapping spans with the maximum probability based on the adjusted probability. Specifically, at each step of the greedy algorithm, we select the span with the maximum probability and discard the overlapping span. For example, we select span “powerplant” and discard “nimbus powerplant” and “the nimbus powerplant”. The whole inference flow is illustrated in Algorithm 2.

Algorithm 2 The Process of Consistent Greedy Inference Algorithm
The few-shot sequence labeling test set 𝒟te\mathcal{D}_{te}, a well-trained model, the number of episodes in test set nn
The extracted entities or slots set S
S = \emptyset
for all i = 1, \cdots, nn do
     # Initialization: get data and predictions from two networks:
     B, C = \emptyset, \emptyset
     Get i-th episode data (𝒮test\mathcal{S}_{test}, 𝒬test\mathcal{Q}_{test}) from 𝒟te\mathcal{D}_{te}
     Get token-level prediction {y^1,y^2,,y^n}\{\hat{y}_{1},\hat{y}_{2},...,\hat{y}_{n}\} according to Eq. (6)
     Get span-level prediction A = {(sj,y^j)}j=1M^\{(s_{j},\hat{y}_{j})\}_{j=1}^{\hat{M}} according to Eq. (15)
     # First step: punish the inconsistent prediction:
     for all k = 1, \cdots, M^\hat{M} do
         Get the number of inconsistent tokens countk\textrm{count}_{k} in span sks_{k}
         y¯ky^kδcountk\bar{y}_{k}\leftarrow\hat{y}_{k}-\delta*\textrm{count}_{k}
         A \leftarrow A \setminus {(sk,y^k)}\{(s_{k},\hat{y}_{k})\}
         B \leftarrow B \cup (sks_{k},y¯k\bar{y}_{k})
     end for
     # Second step: greedy algorithm to select spans:
     while B \neq\emptyset do
         Select span sks_{k} with maximum probability in B
         C \leftarrow C \cup sks_{k}
         Get span set D that overlap with span sks_{k} in B
         B \leftarrow B \setminus D
     end while
     S \leftarrow S \cup C
end for
return The extracted entities or slots set S

5. Experiments

In this section, we first introduce our experimental details, including the datasets, evaluation metrics, implementation details, and baselines. Then, we report the experimental results on three datasets to answer the following research questions:

  • RQ1: Whether our proposed model outperforms existing few-shot sequence labeling methods?

  • RQ2: How does each of the components of our model contribute to the final performance?

  • RQ3: What is effect of different inference algorithms on the performance of our method?

  • RQ4: What is effect of the different consistent losses on the performance of our method?

  • RQ5: How do the hyper-parameters influence the performance of our method?

Thereafter, we also explore the effectiveness of our O division strategy, evaluate the efficiency of our model, and conduct error analysis and case study to further analyze our model.

Table 3. Statistics of three original datasets.
Datasets # Sentences # Classes Domain
FewNERD 188.2K 66 Wikipedia
SNIPS 14.5K 60 Dialog
Cross 189.4K 43 News, Wiki, Social, and Mixed

5.1. Datasets

We conduct experiments on three benchmark datasets: FewNERD (Ding et al., 2021), SNIPS (Coucke et al., 2018), and Cross. FewNERD and Cross are NER datasets, and SNIPS is a slot tagging dataset. The statistics of the three original datasets are shown in Table 3. The detailed description of the three datasets is as follows:

FewNERD dataset with a hierarchy of 8 coarse-grained entity types and 66 fine-grained entity types, and has two few-shot settings: Intra and Inter. The two settings divide the dataset using coarse-grained and fine-grained entity types, respectively. For Intra setting, all entities in the training set, development set, and test set belong to different coarse-grained types. For Inter setting, all entities in the training set, development set, and test set belong to different fine-grained types. It is worth noting that in order to better meet the dense entities, i.e., a sentence may have multiple entity types, Ding et al. (2021) adopt the NN-way KK~2K2K shot sampling method to construct FewNERD dataset (each class in the support set has KK~2K2K entities). Both Intra and Inter setting have 4 settings: 5-way 1~2 shot, 5-way 5~10 shot, 10-way 1~2 shot, and 10-way 5~10 shot. We use the public sampled dataset111It is worth noting that we use the latest version of FewNERD dataset from https://github.com/thunlp/Few-NERD, which corresponds to the results reported in https://arxiv.org/pdf/2105.07464v6.pdf. released by Ding et al. (2021) which contains 20,000 episodes for training, 1,000 episodes for validation, and 5,000 episodes for testing.

SNIPS is a slot tagging dataset and contains 7 domains. The domains are Weather (We), Music (Mu), Play List (Pl), Book (Bo), Search Screen (Se), Restaurant (Re), and Creative Work (Cr). Hou et al. (2020) randomly sample a domain for validation, then randomly sample a different domain for testing, and the other domains are used for training. In each episode, all classes have KK shot samples in the support set and the number of classes (NN) is not fixed. Each domain of SNIPS dataset has two settings: 1-shot and 5-shot. We use the public sampled dataset222https://github.com/AtmaHou/FewShotTagging provided by Hou et al. (2020) for a fair comparison.

Cross is a NER dataset, consisting of four datasets CoNLL-2003 (Sang and Meulder, 2003), GUM (Zeldes, 2017), WNUT-2018 (Derczynski et al., 2017), and Ontonotes (Pradhan et al., 2013). The four datasets are from News, Wiki, Social, and Mixed domains respectively. Same as the SNIPS dataset, Hou et al. (2020) use two datasets for training, one dataset for validation, and one dataset for testing. Each domain of Cross dataset also has two settings: 1-shot and 5-shot and we also use the public sampled dataset333https://github.com/AtmaHou/FewShotTagging provided by Hou et al. (2020) for a fair comparison.

Table 4. The performance (F1 score with standard deviations) on FewNERD dataset. {\dagger} denotes that we reproduce the results using the authors’ code555https://github.com/microsoft/vert-papers/tree/master/papers/DecomposedMetaNER. and do not fine-tune model during meta-testing phase for a fair comparison. We report the average result of 5 runs.
Models Intra Inter
1~2 shot 5~10 shot Avg. 1~2 shot 5~10 shot Avg.
5 way 10 way 5 way 10 way 5 way 10 way 5 way 10 way
Proto 20.76±\pm0.84 15.04±\pm0.44 42.54±\pm0.94 35.40±\pm0.13 28.44±\pm0.59 38.83±\pm1.49 32.45±\pm0.79 58.79±\pm0.44 52.92±\pm0.37 45.75±\pm0.77
NNShot 25.78±\pm0.91 18.27±\pm0.41 36.18±\pm0.79 27.38±\pm0.53 26.90±\pm0.66 47.24±\pm1.00 38.87±\pm0.21 55.64±\pm0.63 49.57±\pm2.73 47.83±\pm1.14
StructShot 30.21±\pm0.90 21.03±\pm1.13 38.00±\pm1.29 26.42±\pm0.60 28.92±\pm0.98 51.88±\pm0.69 43.34±\pm0.10 57.32±\pm0.63 49.57±\pm3.08 50.53±\pm1.13
De-MAML 37.49±\pm2.10 27.36±\pm2.70 41.26±\pm1.34 35.31±\pm5.53 35.36±\pm2.92 61.63±\pm1.11 49.24±\pm5.10 54.09±\pm2.24 53.20±\pm3.44 54.54±\pm2.97
ESD 36.08±\pm1.60 30.00±\pm0.70 52.14±\pm1.50 42.15±\pm2.60 40.09±\pm1.60 59.29±\pm1.25 52.16±\pm0.79 69.06±\pm0.80 64.00±\pm0.43 61.13±\pm0.82
Ours 41.21±\pm0.70 35.36±\pm0.91 52.67±\pm0.49 45.83±\pm0.81 43.77±\pm0.73 62.79±\pm0.29 56.23±\pm0.76 69.83±\pm0.40 65.99±\pm0.40 63.71±\pm0.46

5.2. Evaluation Metrics

For the FewNERD dataset, we report the micro-F1 over all test episodes following Ding et al. (2021). Specifically,

(20) F1=2×Pall×RallPall+Rall\textrm{F1}=\frac{2\times\textrm{P}_{all}\times\textrm{R}_{all}}{\textrm{P}_{all}+\textrm{R}_{all}}

where Pall\textrm{P}_{all} denotes the precision calculated on all test episodes and Rall\textrm{R}_{all} denotes the recall calculated on all test episodes. For the SNIPS and Cross dataset, we first calculate micro-F1 for each test episode and then report the average F1 for all test episodes as the final result following Hou et al. (2022). Specifically,

(21) F1=1ni=1nF1i\textrm{F1}=\frac{1}{n}\sum_{i=1}^{n}\textrm{F1}_{i}

where n is the number of episodes and F1i\textrm{F1}_{i} denotes the F1 of i-th episode.

5.3. Implementation Details

We use BERT-base-uncased (Devlin et al., 2019) as our encoder and use AdamW (Loshchilov and Hutter, 2019) optimizer. For the FewNERD dataset, the learning rate is 2e-5 for BERT and 5e-4 for other parameters, and the batch size is set to 2. For the SNIPS dataset, we use grid search to find the best hyper-parameters. The learning rate of BERT is selected from {5e-6, 1e-5, 2e-5, 3e-5, 5e-5} and other parameters are selected from {5e-5, 1e-4, 3e-4, 5e-4}, and the batch size is selected from {2, 4}. We schedule the learning rate that the first 1000 training steps is a linear warmup phrase and then the rest is a linear decay phrase. λ\lambda and γ\gamma are selected from {0.02, 0.05, 0.1, 0.2, 0.3} through grid search, β\beta is set to 1, and δ\delta is selected from {0.01, 0.02, 0.05, 0.1}. The maximum span length is set to 8 for inference.

5.4. Baselines

To evaluate the effectiveness of our method, we compare our method with the following baselines:

  • Proto (Fritzler et al., 2019; Ding et al., 2021) represents each class by the mean of its token representation and uses Euclidean distance to predict query labels.

  • NNShot (Yang and Katiyar, 2020; Ding et al., 2021) is similar to Proto, but uses nearest neighbor classification to predict query labels.

  • StructShot (Yang and Katiyar, 2020; Ding et al., 2021) is similar to NNShot and further uses a Viterbi decoder during the inference phase to model label dependency.

  • TransferBERT (Hou et al., 2020) is a domain transfer model which pretrains model on source domains and fine-tunes on the target domain.

  • Matching (Vinyals et al., 2016; Hou et al., 2020) is similar to Proto and uses the matching networks (Vinyals et al., 2016) to classification instead of prototypical networks.

  • L-TapNet+CDT (Hou et al., 2020) uses label name to extend task-adaptive projection networks (Yoon et al., 2019) and further utilizes collapsed dependency transfer to capture the dependencies between labels.

  • Retriever (Yu et al., 2021a) is a span-level retrieval method that learns similar contextualized representations for spans with the same label via a novel batch-softmax objective.

  • ConVEx (Henderson and Vulic, 2021) first pre-trains through a novel pairwise cloze task using Reddit data and then directly fine-tunes for slot tagging task.

  • MRC (Ma et al., 2021) enumerates all the slots to extract the answer from the sentence in machine reading comprehension framework.

  • Inverse (Hou et al., 2022) reversely predicts slot values given slot types prompt and further proposes iterative prediction strategy to consider the relations between different slot types.

  • De-MAML (Ma et al., 2022b) proposes a two-step MAML-enhanced model to first detect span and then classify span.

  • ESD (Wang et al., 2022) proposes inter-span and cross-span attention to enhance the span representations, and uses instance span attention to construct prototype.

Table 5. The performance (F1 scores with standard deviations) on 77 domains of SNIPS. ‘unk’ denotes methods that do not report standard deviations in their paper. Baselines of 1-shot and 5-shot settings are different since ConVEx, Retriever, and Inverse do not report the 1-shot results in their paper. We report the average result of 10 runs. Best result is bold and second best one is underlined.
Models We Mu Pl Bo Se Re Cr Avg.
1-shot TransferBERT 55.82±\pm2.75 38.01±\pm1.74 45.65±\pm2.02 31.63±\pm5.32 21.96±\pm3.98 41.79±\pm3.81 38.53±\pm7.42 39.06±\pm3.86
Matching 21.74±\pm4.60 10.68±\pm1.07 39.71±\pm1.81 58.15±\pm0.68 24.21±\pm1.20 32.88±\pm0.64 69.66±\pm1.68 36.72±\pm1.67
Proto 46.72±\pm1.03 40.07±\pm0.48 50.78±\pm2.09 68.73±\pm1.87 60.81±\pm1.70 55.58±\pm3.56 67.67±\pm1.16 55.77±\pm1.70
MRC - - - - - - - 69.3(unk)
L-TapNet+CDT 71.53±\pm4.04 60.56±\pm0.77 66.27±\pm2.71 84.54±\pm1.08 76.27±\pm1.72 70.79±\pm1.60 62.89±\pm1.88 70.41±\pm1.97
ESD 78.25±\pm1.50 54.74±\pm1.02 71.15±\pm1.55 71.45±\pm1.38 67.85±\pm0.75 71.52±\pm0.98 78.14±\pm1.46 70.44±\pm1.23
Ours 73.56±\pm1.58 58.40±\pm1.42 74.20±\pm1.24 76.07±\pm1.71 70.64±\pm1.49 72.73±\pm1.08 75.32±\pm1.09 71.56±\pm1.37
5-shot TransferBERT 59.41±\pm0.30 42.00±\pm2.83 46.07±\pm4.32 20.74±\pm3.36 28.20±\pm0.29 67.75±\pm1.28 58.61±\pm3.67 46.11±\pm2.29
Matching 36.67±\pm3.64 33.67±\pm6.12 52.60±\pm2.84 69.09±\pm2.36 38.42±\pm4.06 33.28±\pm2.99 72.10±\pm1.48 47.98±\pm3.36
Proto 67.82±\pm4.11 55.99±\pm2.24 46.02±\pm3.19 72.17±\pm1.75 73.59±\pm1.60 60.18±\pm6.96 66.89±\pm2.88 63.24±\pm3.25
Retriever 82.95(unk) 61.74(unk) 71.75(unk) 81.65(unk) 73.10(unk) 79.54(unk) 51.35(unk) 71.72(unk)
L-TapNet+CDT 71.64±\pm3.62 67.16±\pm2.97 75.88±\pm1.51 84.38±\pm2.81 82.58±\pm2.12 70.05±\pm1.61 73.41±\pm2.61 75.01±\pm2.46
Inverse 70.63(unk) 71.97(unk) 78.73(unk) 87.34(unk) 81.95(unk) 72.07(unk) 74.44(unk) 76.73(unk)
ConVEx 71.5(unk) 77.6(unk) 79.0(unk) 84.5(unk) 84.0(unk) 73.8(unk) 67.4(unk) 76.8(unk)
MRC 89.39(unk) 75.11(unk) 77.18(unk) 84.16(unk) 73.53(unk) 82.29(unk) 72.51(unk) 79.17(unk)
ESD 84.50±\pm1.06 66.61±\pm2.00 79.69±\pm1.35 82.57±\pm1.37 82.22±\pm0.81 80.44±\pm0.80 81.13±\pm1.84 79.59±\pm1.32
Ours 86.24±\pm1.04 67.66±\pm1.01 82.02±\pm1.05 86.47±\pm0.92 82.45±\pm1.09 81.38±\pm0.55 80.55±\pm1.16 80.97±\pm0.97
Table 6. The performance (F1 score with standard deviations) on the Cross Dataset. {\dagger} denotes that we reproduce the results using the authors’ code and do not fine-tune model during meta-testing phase for a fair comparison. We report the average result of 10 runs.
Models 1-shot 5-shot
News Wiki Social Mixed Avg. News Wiki Social Mixed Avg.
TransferBERT 4.75±\pm1.42 0.57±\pm0.32 2.71±\pm0.72 3.46±\pm0.54 2.87±\pm0.75 15.36±\pm2.81 3.62±\pm0.57 11.08±\pm0.57 35.49±\pm7.60 16.39±\pm2.89
Matching 19.50±\pm0.35 4.73±\pm0.16 17.23±\pm2.75 15.06±\pm1.61 14.13±\pm1.22 19.85±\pm0.74 5.58±\pm0.23 6.61±\pm1.75 8.08±\pm0.47 10.03±\pm0.80
Proto 32.49±\pm2.01 3.89±\pm0.24 10.68±\pm1.40 6.67±\pm0.46 13.43±\pm1.03 50.06±\pm1.57 9.54±\pm0.44 17.26±\pm2.65 13.59±\pm1.61 22.61±\pm1.57
De-MAML 42.40±\pm0.93 4.76±\pm0.08 23.87±\pm1.49 14.19±\pm1.33 21.31±\pm0.96 37.10±\pm2.65 6.21±\pm0.26 18.54±\pm0.89 26.41±\pm0.48 22.07±\pm1.07
L-TapNet+CDT 44.30±\pm3.15 12.04±\pm0.65 20.80±\pm1.06 15.17±\pm1.25 23.08±\pm1.53 45.35±\pm2.67 11.65±\pm2.34 23.30±\pm2.80 20.95±\pm2.81 25.31±\pm2.65
Ours 41.21±\pm0.70 9.91±\pm0.50 25.01±\pm1.32 26.18±\pm2.69 25.58±\pm1.30 51.97±\pm2.93 15.65±\pm1.51 29.89±\pm1.64 37.19±\pm1.35 33.68±\pm1.86

5.5. Results (RQ1)

We conduct an empirical study to investigate whether our proposed model can achieve better performance for few-shot sequence labeling to answer RQ1. Table 5 and Table 6 report the performance of our proposed model and baselines on few-shot NER dataset (i.e., FewNERD and Cross datasets) and Table 5 reports the performance on few-shot slot tagging dataset (i.e., SNIPS dataset).

First, we observe the performance on the FewNERD dataset. We can see that our model consistently outperforms all baselines under all settings. Compared with the best baseline ESD, our model achieves 3.68 and 2.58 average F1 improvement under Intra and Inter settings. Compared with the best token-level baseline StructShot, our model achieves 14.85 and 13.18 average F1 improvement under two settings. This shows that combining token-level and span-level can achieve better performance for few-shot sequence labeling. It is worth noting that the improvement under Intra setting is higher than under Inter setting. This suggests that when the domain gap between source domain and target domain is relatively large, combining token and span levels can lead to more significant improvements than the single-granularity baselines.

Secondly, we observe the performance on the SNIPS dataset. Compared with the best baseline ESD, our model achieves 1.12 and 1.38 average F1 improvement under 1-shot and 5-shot settings on the SNIPS dataset. Compared to all baseline methods, our method achieves the best or second best performance in most domains. The results also show the effectiveness of our model. It is worth noting that the standard deviations of our method are the best or second best in most settings. This shows that our model is robust.

Thirdly, we observe the performance on the Cross dataset. Compared with the best baseline L-TapNet+CDT, our model achieves 2.5 and 8.37 average F1 improvement under 1-shot and 5-shot settings. Compared to all baseline methods, our method achieves the best or second best performance in all domains. This shows the effectiveness of our model.

Table 7. The ablation study of our model on the development set on FewNERD dataset under 5-way 1~2 shot setting. The ablation study is repeated five times.
Model Setting Intra Inter
Ours 36.92±\pm1.54 48.63±\pm0.51
w/o Span-Level Loss 26.42±\pm0.33 36.25±\pm1.21
w/o Token-Level Loss 34.53±\pm0.51 47.80±\pm0.64
w/o Consistent Loss 35.27±\pm0.77 48.10±\pm0.16
w/o Inference Algorithm 31.13±\pm1.66 45.18±\pm0.31
w/o Division of O 35.63±\pm0.94 48.49±\pm0.40
w/o Adaptive Prototype 32.52±\pm1.31 46.27±\pm1.23
w/o Cross-Attention 32.51±\pm1.78 44.98±\pm0.31

5.6. Ablation Study (RQ2)

To validate the contribution of each component, we report the performance of removing each of them from our model individually in Table 7 to answer RQ2.

First, we can see that all components improve performance. This shows that all components in our model are effective. Secondly, we remove the token-level loss and span-level loss. When we ablate span loss, the network uses the output of the token-level network as the result. We can see that the performance drops to 26.42 and 36.25 under Intra and Inter settings. This may be because token-level metric learning is difficult to extract entities completely and our token-level model is parameter-efficient. When we remove the token-level loss, the performance drops to 34.53 and 47.80 under two settings. This shows that using token-level labels to jointly train the model can effectively improve performance. Thirdly, we remove the consistent loss and the performance drops to 35.27 and 48.10 under two settings. This shows that consistent loss can effectively regularize the model and improve performance. Fourth, we remove the inference algorithm, i.e., all spans that are predicted as entities or slots from span-level network are extracted. The performance drops to 31.13 and 45.18 under two settings. This illustrates that our inference algorithm can effectively combine the results of two networks and avoid extracting overlapping spans. Finally, we ablate all components (i.e., fine-grained division of O, cross-attention module, and adaptive prototype) in the span network. We can see that all components in the span network improve performance. The cross-attention module brings the largest performance improvement. This indicates that the interaction of the support set and query set is the key to performance improvement. We remove the adaptive prototype and the performance drops to 32.52 and 46.27 under two settings. This shows adaptively constructing prototypes for each query span is more effective. We remove the division of class O and the performance drops to 35.63 and 48.49 under two settings. This shows that the fine-grained division of O can effectively construct prototype of class O.

Table 8. The performance of our model under different inference algorithms on the development set of FewNERD dataset under Inter 5-way 1~2 shot setting. The experiment is repeated five times.
Inference Setting F1
Ours 48.63±\pm0.51
Token Network 26.94±\pm0.96
Span Network 48.05±\pm0.54
Intersection 29.93±\pm0.71
Union 41.27±\pm0.36

5.7. Effects of Different Inference Algorithms (RQ3)

We explore the effects of different inference algorithms in Table 8 to answer RQ3, including our proposed consistent greedy inference algorithm, using intersection or union operation to combine two outputs from two networks (i.e., a span and its type are extracted by both networks or by one), and using the outputs of token network or span network (i.e., we also use greedy algorithm to avoid overlapping spans).

First, we can see that our proposed inference algorithm achieves the best performance. This shows that using token-level predictions to adjust the probability of span network is effective. Secondly, we can see that the performance of span network is better than token network. This may be because token-level metric learning is difficult to extract entities completely and our token-level model is parameter-efficient. Thirdly, we report the performance of using intersection or union operation to combine the outputs of two networks. We can see that their performance does not exceed span network. This is because the poor performance of token network and simple combination does not improve performance. Finally, it is worth noting the performance of span network exceeds w/o span-level loss. This shows that using token-level loss to jointly train can improve performance of span-level network.

5.8. Unidirectional vs. Bidirectional Consistent Loss (RQ4)

We explore the performance of using unidirectional Kullback-Leibler divergence with temperature as consistent loss in Table 9 to answer RQ4, i.e., only using Span network to supervise Token network (ST) or using Token network to supervise Span network (TS).

First, we can see that the bidirectional consistent loss performs better than ST and TS. This shows that the bidirectional consistent loss is more effective to align outputs from two networks compared to the unidirectional consistent loss. Secondly, ST and TS both exceed w/o consistent loss. This shows that both unidirectional consistent losses are effective to align outputs and improve performance. Thirdly, the performance of TS exceeds ST. This suggests that it is more effective to use supervision information from token network to train the span network. This may be because it is easier for span network to learn a consistent output.

5.9. Effects of Different Consistent Losses (RQ4)

We explore the performance of using different losses as consistent loss in Table 9 to answer RQ4, i.e., mean squared error (MSE), Kullback-Leibler divergence (KL), negative cosine similarity (COS), and Jensen-Shannon divergence (JS).

First, we can see that the KL loss with temperature performs best and the KL loss performs second best. This shows that using temperature can obtain better supervision information compared to KL loss. It is worth noting that KL divergence and JS divergence exceed w/o consistent loss. This shows that KL divergence and JS divergence can effectively measure the difference between two distributions and improves performance for our model. Secondly, the performance of w/o consistent loss exceeds MSE and COS. This illustrates that MSE and COS play a side effect and are not a good choice for measuring the difference between the two probability distributions for our model.

Table 9. The performance of our model under different loss settings on the development set of FewNERD dataset under Inter 5-way 1~2 shot setting. ST denotes using span network to supervise token network and TS denotes using token to supervise span. MSE, KL, COS, and JS denote mean squared error, Kullback-Leibler divergence, negative cosine similarity, and Jensen-Shannon divergence. The experiment is repeated five times.
Consistent Loss F1
Ours 48.63±\pm0.51
ST 48.19±\pm0.49
TS 48.41±\pm0.44
MSE 48.05±\pm0.32
KL 48.41±\pm0.13
COS 47.93±\pm0.14
JS 48.38±\pm0.46
Refer to caption
(a) Effects of λ\lambda
Refer to caption
(b) Effects of λ\lambda
Refer to caption
(c) Effects of γ\gamma
Refer to caption
(d) Effects of γ\gamma
Refer to caption
(e) Effects of δ\delta
Refer to caption
(f) Effects of δ\delta
Figure 3. Effects of Hyper-Parameters under 5-way 1~2 shot setting.

5.10. Effects of Hyper-Parameters (RQ5)

We explore the performance of different hyper-parameters (i.e., the weight of token-level loss λ\lambda in Eq. (18), the weight of consistent loss γ\gamma in Eq. (18), and the degree of penalty in inference algorithms δ\delta in Eq. (19)) while fixing the other hyper-parameters in Fig. 3 on FewNERD dataset under 5 way 1~2 shot setting to answer RQ5.

We first observe the effects of λ\lambda which is selected from {0.02, 0.05, 0.1, 0.2, 0.3}. From Figure 3(a) and Figure 3(b), we find that the performance is relatively stable under Inter and Intra settings, i.e., the performance fluctuates by 0.6% and 1.5%, respectively. As λ\lambda changes, the performance first has an upward trend and then has a downward trend under two settings. This shows that setting a suitable λ\lambda can help improve performance. The optimal value of λ\lambda depends on the specific setting. The model achieves the best performance when λ\lambda is 0.1 under Inter setting and λ\lambda is 0.2 under Intra setting.

Then, we observe the effects of γ\gamma and δ\delta, choosing from {0.02, 0.05, 0.1, 0.2, 0.3} and {0.01, 0.02, 0.05, 0.1}. The overall trend is the same as λ\lambda. We can see that the performance is stable under two settings, i.e., the performance fluctuates by 0.5% and 0.4% with the change of γ\gamma and the performance fluctuates by 0.2% and 0.3% with the change of δ\delta. This shows that our model is insensitive to γ\gamma and δ\delta. As γ\gamma or δ\delta changes, the overall performance also first has an upward trend and then has a downward trend under two settings. The model achieves the best performance when γ\gamma is 0.05 and δ\delta is 0.02 under two settings.

5.11. Comparison with Other O Divisions

Table 10. The performance of different O divisions on the development set of FewNERD dataset under Inter 5-way 1~2 shot. The upper part of the table shows the performance of our model under different divisions, and the lower part shows the ESD model under different divisions. All results over five runs are reported.
Model Setting Division Setting F1
CDAP Ours 48.63±\pm0.51
Wang et al. (2022) 48.39±\pm0.68
ESD Ours 46.55±\pm0.41
Wang et al. (2022) 45.83±\pm0.56

We compare our division with Wang et al. (2022) under two different models (i.e., CDAP and ESD) in Table 10 to show the effectiveness of our division. Wang et al. (2022) also divide class O into three sub-classes: sub-class O1O_{1} denotes the span that does not overlap with any entities or slots in the sentence, sub-class O2O_{2} denotes the span that is the sub-span of an entity or slot, and O3O_{3} denotes the others.

We can see that our O division achieves better performance under two different models. This illustrates the effectiveness of dividing sub-classes according to the left and right boundary tokens. It is worth noting that our division is more effective on the ESD model. This may be because the capacity of distinguishing sub-classes for the ESD model is limited compared to our model, and our division can help ESD model to distinguish sub-classes more effectively.

5.12. Efficiency

Since our model considers both token-level and span-level, we compare our method with token-level methods (i.e., Proto, NNShot, and StructShot) and span-level methods (i.e., ESD and De-MAML) in terms of the number of parameters and inference time in Table 11 to show the efficiency.

First, we observe the number of parameters and we can see that Proto, NNShot, and StructShot have the least number of parameters. This is because these methods only use BERT to encode the tokens. Our model is similar to ESD and significantly smaller than De-MAML. This shows that our method is parameter-efficient and does not introduce too many parameters (i.e., 2M) compared to Proto. In addition, De-MAML uses two BERT to encode two tasks separately resulting in a huge number of parameters.

Secondly, we compare the inference time of our model and baselines with the same hardware and batch size (i.e., 1). We can see that the token-level methods are faster compared to the span-level methods. This is because the span-level methods need to enumerate all spans within a sentence whose length is less than or equal to L. Therefore, the number of spans is approximately L times that of tokens, which may bring extra computation overhead. Proto is faster compared to NNShot, and NNShot is faster compared to StructShot. This is because Proto only needs to measure the distance to the prototype, while NNShot needs to measure the distance to all the tokens in the support set. StructShot further uses a Viterbi decoder to model label dependency based on NNShot, so more inference time is required. Then, we compare our model with span-level methods and we can see that our model is 13.40 ms slower than ESD and 32.49 ms faster than De-MAML under 1-shot setting. Under 5-shot setting, our model is 26.78 ms slower than ESD and 23.44 ms slower than De-MAML. This illustrates that the time cost of our model is acceptable and our model is efficient. In addition, De-MAML needs to forward propagate 2 times through BERT, resulting in a long inference time under 1-shot setting. Our model and ESD both use the cross-attention module to model the interaction of the support set and query set, resulting in a significant increase in inference time under 5-shot setting.

Table 11. The number of parameters and inference time per episode task on FewNERD dataset under Inter 5-way setting. F1 notes the micro-F1 over eight settings on the FewNERD dataset. # Para. denotes the number of parameters and {\dagger} denotes that we do not fine-tune model during meta-testing phase for time efficiency.
Models F1 # Para. Inference Time
1~2 shot 5~10 shot
Ours 53.74 111M 79.43 ms 234.23 ms
Proto 37.10 109M 24.98 ms 29.83 ms
NNShot 37.37 109M 25.43 ms 30.97 ms
StructShot 39.73 109M 30.70 ms 49.62 ms
ESD 50.61 111M 66.03 ms 207.45 ms
De-MAML 44.95 218M 111.92 ms 210.79 ms
Table 12. Error analysis on the test set of FewNERD dataset under 5-way 1~2 shot setting. # FP-Span denotes the number and proportion of extracted entities with wrong span boundary. # FP-Type denotes the number and proportion of extracted entities with the right boundary but the wrong entity type.
Setting F1 # FP-Span # FP-Type
Inter 62.79 7372(74.52%) 2619(26.48%)
Intra 41.21 11264(65.46%) 5944(34.54%)

5.13. Error Analysis

We further analyze the predicted error of our model in detail and divide the error into 2 categories (i.e., FP-Span and FP-Type) according to whether the span is correct. FP-Span denotes extracted entities with the wrong span boundary and FP-Type denotes extracted entities with the right boundary but the wrong entity type.

First, as shown in Table 12, we can see that FP-Span is main prediction error under two settings. FP-span accounts for 74.52% under Inter setting and 65.46% under Intra setting. This indicates that span detection is more difficult than span classification, which is the main bottleneck at present and needs further attention in the future. Secondly, compared to Inter setting, the number of both errors rises under the more difficult Intra setting. Compared to Inter setting, FP-Span increased by 52.79% and FP-Type increased by 126.96% under Intra setting. This suggests that as domain difference between the source domain and the target domain increases, entity classification is more difficult to transfer than span detection.

5.14. Case Study

Apart from the quantitative analysis, we also conduct case study to intuitively show the effectiveness and analyze our model. We list the predictions of two groups on the test set of our model and three strong baselines (i.e., StructShot, De-MAML, and ESD) on FewNERD under Inter 5-way 1~2 shot setting in Table 13.

First, we observe the predictions of the first group (Query 1, 2, and 3). We can see that the extraction capacity of StructShot is limited. For example, it only correctly extracts entity “B.A.” in Query (1) and also extracts impossible entities (i.e., “the” and “of” in Query (3)). This illustrates that StructShot relies heavily on the labeled samples in the support set. Then, we observe the predictions of span-level methods. For query (1), De-MAML and ESD predict None, and our model extracts “B.A.”. This may be because we use token-level loss to jointly train the model. For query (2), we can see that our model correctly extracts and predicts “Monaco” and “Maxim’s”. De-MAML fails to classify “Maxim’s” and ESD fails to extract “Maxim’s”. This shows that even if a sentence has multiple entities, our model can extract them. For query (3), we can see that our model correctly extracts and predicts “faculty of medicine in Paris”. De-MAML and ESD extract “faculty of medicine”, and ESD predicts wrong type. This shows that even if an entity has multiple tokens, our model can locate and extract it.

Secondly, we display some failure examples in the second group (Query 4, 5, and 6). For query (4), method StructShot extracts “Right” and fails to extract simple entity “Republican”. This also shows that the extraction capacity of StructShot is limited and relies heavily on the labeled samples in the support set. Methods De-MAML, ESD, and our model do not extract entity “neo-Confederate Right” at all. This may be because it is more difficult to extract this entity. For query (5), we can see that StructShot does not extract the entity completely, only “bachelor” is extracted. De-MAML and our model also do not extract the entity completely, only “bachelor of law” is extracted. This indicates that there are challenges in locating the boundaries of entities. ESD does not extract entities, perhaps because it is trained with only span labels and the supervision is sparse. For query (6), we can see that all the methods predict None except De-MAML and De-MAML predicts wrong entity “New American”. We observe the corresponding support set and find that it may be due to the weak representation of the prototype constructed from the two samples (i.e., Bobino and Gaumont-Palace) in the support set of the class.

Table 13. The predictions on test set of our proposed model and model variant for case study on FewNERD dataset under Inter 5-way 1~2 shot. We use different colors to indicate the different entity types in a sentence. blue indicates that this entity type does not appear in this query sample.
Query StructShot De-MAML ESD Ours
(1) He decided to return to school in 1996, earning his B.A. [other-educational degree]. earning, B.A. None None B.A.
(2) The same year, Monaco [person-actor] was named “Maxim’s [art-written art]” number one sexiest cover model of the decade. None Monaco, Maxim’s Monaco Monaco, Maxim’s
(3) In 1864 he attained the chair of internal pathology at the faculty of medicine in Paris [building-hospital]. the, chair, of, of medicine in faculty of medicine faculty of medicine faculty of medicine in Paris
(4) She writes that “even into the twenty-first century mainstream conservative Republican [organization-political party] politicians continued to associate themselves with issues, symbols, and organizations inspired by the neo-Confederate Right [organization-political party]”. Right Republican Republican Republican
(5) He gained a bachelor of law degree [other-educational degree]. bachelor bachelor of law None bachelor of law
(6) The script was published in Grove New American Theater [building-theater]. None New American None None

6. Conclusion and Future Work

In this paper, we first unify token and span level supervisions and propose a consistent dual adaptive prototypical network for few-shot sequence labeling. Our model contains token-level network and span-level network, jointly trained with token-level and span-level labels. Further, we propose a consistent loss to make the outputs of two networks consistent based on bidirectional Kullback-Leibler divergence with temperature. During the inference phase, we propose a consistent greedy inference algorithm that first adjusts the predicted probability of span network and then greedily selects non-overlapping spans with maximum probability based on adjusted probability. Extensive experiments show that unifying token and span level supervisions is effective and our model achieves new state-of-the-art results on three benchmark datasets.

In the future, we plan to deepen and widen our work from the following aspects: (1) In this work, we first unify token and span level supervisions for few-shot sequence labeling and conduct a preliminary exploration. The proposed method is demonstrated to be simple and effective, but whether other complicated token-level and span-level networks are also effective has not been explored, which is worthy of further exploration. (2) The proposed method has only been verified to be effective on texts in English. The effectiveness on datasets of other languages deserves further verification. (3) The effectiveness of our model on traditional supervised sequence labeling also deserves further exploration.

References

  • (1)
  • Agosti et al. (2020) Maristella Agosti, Stefano Marchesin, and Gianmaria Silvello. 2020. Learning Unsupervised Knowledge-Enhanced Representations to Reduce the Semantic Gap in Information Retrieval. ACM Trans. Inf. Syst. 38, 4 (2020), 38:1–38:48. https://doi.org/10.1145/3417996
  • Athiwaratkun et al. (2020) Ben Athiwaratkun, Cícero Nogueira dos Santos, Jason Krone, and Bing Xiang. 2020. Augmented Natural Language for Generative Sequence Labeling. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020. 375–385. https://doi.org/10.18653/v1/2020.emnlp-main.27
  • Chen et al. (2022) Jiawei Chen, Qing Liu, Hongyu Lin, Xianpei Han, and Le Sun. 2022. Few-shot Named Entity Recognition with Self-describing Networks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022. 5711–5722. https://doi.org/10.18653/v1/2022.acl-long.392
  • Chen et al. (2019) Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. 2019. A Closer Look at Few-shot Classification. In 7th International Conference on Learning Representations, ICLR 2019. OpenReview.net. https://openreview.net/forum?id=HkxLXnAcFQ
  • Chen et al. (2021) Yinbo Chen, Zhuang Liu, Huijuan Xu, Trevor Darrell, and Xiaolong Wang. 2021. Meta-Baseline: Exploring Simple Meta-Learning for Few-Shot Learning. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 9042–9051. https://doi.org/10.1109/ICCV48922.2021.00893
  • Cheng et al. (2021) Zifeng Cheng, Zhiwei Jiang, Yafeng Yin, Na Li, and Qing Gu. 2021. A Unified Target-Oriented Sequence-to-Sequence Model for Emotion-Cause Pair Extraction. IEEE ACM Trans. Audio Speech Lang. Process. 29 (2021), 2779–2791. https://doi.org/10.1109/TASLP.2021.3102194
  • Chiu and Nichols (2016) Jason P. C. Chiu and Eric Nichols. 2016. Named Entity Recognition with Bidirectional LSTM-CNNs. Trans. Assoc. Comput. Linguistics 4 (2016), 357–370. https://doi.org/10.1162/tacl_a_00104
  • Colombo et al. (2020) Pierre Colombo, Emile Chapuis, Matteo Manica, Emmanuel Vignon, Giovanna Varni, and Chloé Clavel. 2020. Guiding Attention in Sequence-to-Sequence Models for Dialogue Act Prediction. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020. 7594–7601. https://ojs.aaai.org/index.php/AAAI/article/view/6259
  • Coucke et al. (2018) Alice Coucke, Alaa Saade, Adrien Ball, Théodore Bluche, Alexandre Caulier, David Leroy, Clément Doumouro, Thibault Gisselbrecht, Francesco Caltagirone, Thibaut Lavril, Maël Primet, and Joseph Dureau. 2018. Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces. CoRR abs/1805.10190 (2018). http://arxiv.org/abs/1805.10190
  • Cui et al. (2021) Leyang Cui, Yu Wu, Jian Liu, Sen Yang, and Yue Zhang. 2021. Template-Based Named Entity Recognition Using BART. In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021. 1835–1845. https://doi.org/10.18653/v1/2021.findings-acl.161
  • Cui and Zhang (2019) Leyang Cui and Yue Zhang. 2019. Hierarchically-Refined Label Attention Network for Sequence Labeling. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019. 4113–4126. https://doi.org/10.18653/v1/D19-1422
  • Das et al. (2022) Sarkar Snigdha Sarathi Das, Arzoo Katiyar, Rebecca J. Passonneau, and Rui Zhang. 2022. CONTaiNER: Few-Shot Named Entity Recognition via Contrastive Learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022. 6338–6353. https://doi.org/10.18653/v1/2022.acl-long.439
  • Derczynski et al. (2017) Leon Derczynski, Eric Nichols, Marieke van Erp, and Nut Limsopatham. 2017. Results of the WNUT2017 Shared Task on Novel and Emerging Entity Recognition. In Proceedings of the 3rd Workshop on Noisy User-generated Text, NUT@EMNLP 2017. Association for Computational Linguistics, 140–147. https://doi.org/10.18653/v1/w17-4418
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019. 4171–4186. https://doi.org/10.18653/v1/n19-1423
  • Dhillon et al. (2020) Guneet Singh Dhillon, Pratik Chaudhari, Avinash Ravichandran, and Stefano Soatto. 2020. A Baseline for Few-Shot Image Classification. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net. https://openreview.net/forum?id=rylXBkrYDS
  • Ding et al. (2021) Ning Ding, Guangwei Xu, Yulin Chen, Xiaobin Wang, Xu Han, Pengjun Xie, Haitao Zheng, and Zhiyuan Liu. 2021. Few-NERD: A Few-shot Named Entity Recognition Dataset. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021. 3198–3213. https://doi.org/10.18653/v1/2021.acl-long.248
  • Fei-Fei et al. (2006) Li Fei-Fei, Robert Fergus, and Pietro Perona. 2006. One-Shot Learning of Object Categories. IEEE Trans. Pattern Anal. Mach. Intell. 28, 4 (2006), 594–611. https://doi.org/10.1109/TPAMI.2006.79
  • Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017. 1126–1135. http://proceedings.mlr.press/v70/finn17a.html
  • Finn et al. (2018) Chelsea Finn, Kelvin Xu, and Sergey Levine. 2018. Probabilistic Model-Agnostic Meta-Learning. In Advances in Neural Information Processing Systems, NeurIPS 2018. 9537–9548. https://proceedings.neurips.cc/paper/2018/hash/8e2c381d4dd04f1c55093f22c59c3a08-Abstract.html
  • Fritzler et al. (2019) Alexander Fritzler, Varvara Logacheva, and Maksim Kretov. 2019. Few-shot classification in named entity recognition task. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, SAC 2019. 993–1000. https://doi.org/10.1145/3297280.3297378
  • Fu et al. (2021) Jinlan Fu, Xuanjing Huang, and Pengfei Liu. 2021. SpanNER: Named Entity Re-/Recognition as Span Prediction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021. 7183–7195. https://doi.org/10.18653/v1/2021.acl-long.558
  • Gao et al. (2019) Tianyu Gao, Xu Han, Zhiyuan Liu, and Maosong Sun. 2019. Hybrid Attention-Based Prototypical Networks for Noisy Few-Shot Relation Classification. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019. AAAI Press, 6407–6414. https://doi.org/10.1609/aaai.v33i01.33016407
  • Guo et al. (2022) Jiafeng Guo, Yinqiong Cai, Yixing Fan, Fei Sun, Ruqing Zhang, and Xueqi Cheng. 2022. Semantic Models for the First-Stage Retrieval: A Comprehensive Review. ACM Trans. Inf. Syst. 40, 4 (2022), 66:1–66:42. https://doi.org/10.1145/3486250
  • Henderson and Vulic (2021) Matthew Henderson and Ivan Vulic. 2021. ConVEx: Data-Efficient and Few-Shot Slot Labeling. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021. 3375–3389. https://doi.org/10.18653/v1/2021.naacl-main.264
  • Hou et al. (2019) Ruibing Hou, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. 2019. Cross Attention Network for Few-shot Classification. In Advances in Neural Information Processing Systems, NeurIPS 2019. 4005–4016. https://proceedings.neurips.cc/paper/2019/hash/01894d6f048493d2cacde3c579c315a3-Abstract.html
  • Hou et al. (2020) Yutai Hou, Wanxiang Che, Yongkui Lai, Zhihan Zhou, Yijia Liu, Han Liu, and Ting Liu. 2020. Few-shot Slot Tagging with Collapsed Dependency Transfer and Label-enhanced Task-adaptive Projection Network. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020. 1381–1393. https://doi.org/10.18653/v1/2020.acl-main.128
  • Hou et al. (2022) Yutai Hou, Cheng Chen, Xianzhen Luo, Bohan Li, and Wanxiang Che. 2022. Inverse is Better! Fast and Accurate Prompt for Few-shot Slot Tagging. In Findings of the Association for Computational Linguistics: ACL 2022. 637–647. https://doi.org/10.18653/v1/2022.findings-acl.53
  • Huang et al. (2021) Jiaxin Huang, Chunyuan Li, Krishan Subudhi, Damien Jose, Shobana Balakrishnan, Weizhu Chen, Baolin Peng, Jianfeng Gao, and Jiawei Han. 2021. Few-Shot Named Entity Recognition: An Empirical Baseline Study. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021. 10408–10423. https://doi.org/10.18653/v1/2021.emnlp-main.813
  • Huang et al. (2015) Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF Models for Sequence Tagging. CoRR abs/1508.01991 (2015). http://arxiv.org/abs/1508.01991
  • Jiang et al. (2020) Zhengbao Jiang, Wei Xu, Jun Araki, and Graham Neubig. 2020. Generalizing Natural Language Analysis through Span-relation Representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online. Association for Computational Linguistics, 2120–2133. https://doi.org/10.18653/v1/2020.acl-main.192
  • Lample et al. (2016) Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural Architectures for Named Entity Recognition. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 260–270. https://doi.org/10.18653/v1/n16-1030
  • Lee et al. (2022) Dong-Ho Lee, Akshen Kadakia, Kangmin Tan, Mahak Agarwal, Xinyu Feng, Takashi Shibuya, Ryosuke Mitani, Toshiyuki Sekiya, Jay Pujara, and Xiang Ren. 2022. Good Examples Make A Faster Learner: Simple Demonstration-based Learning for Low-resource NER. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022. 2687–2700. https://doi.org/10.18653/v1/2022.acl-long.192
  • Li et al. (2020b) Aoxue Li, Weiran Huang, Xu Lan, Jiashi Feng, Zhenguo Li, and Liwei Wang. 2020b. Boosting Few-Shot Learning With Adaptive Margin Loss. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020. Computer Vision Foundation / IEEE, 12573–12581. https://doi.org/10.1109/CVPR42600.2020.01259
  • Li et al. (2022) Jing Li, Billy Chiu, Shanshan Feng, and Hao Wang. 2022. Few-Shot Named Entity Recognition via Meta-Learning. IEEE Trans. Knowl. Data Eng. 34, 9 (2022), 4245–4256. https://doi.org/10.1109/TKDE.2020.3038670
  • Li et al. (2020d) Kai Li, Yulun Zhang, Kunpeng Li, and Yun Fu. 2020d. Adversarial Feature Hallucination Networks for Few-Shot Learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020. Computer Vision Foundation / IEEE, 13467–13476. https://doi.org/10.1109/CVPR42600.2020.01348
  • Li et al. (2020c) Ruirui Li, Xian Wu, Xiusi Chen, and Wei Wang. 2020c. Few-Shot Learning for New User Recommendation in Location-based Social Networks. In WWW ’20: The Web Conference 2020, Taipei, Taiwan. 2472–2478. https://doi.org/10.1145/3366423.3379994
  • Li et al. (2020a) Xiaoya Li, Jingrong Feng, Yuxian Meng, Qinghong Han, Fei Wu, and Jiwei Li. 2020a. A Unified MRC Framework for Named Entity Recognition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020. 5849–5859. https://doi.org/10.18653/v1/2020.acl-main.519
  • Li et al. (2021) Xiangsheng Li, Jiaxin Mao, Weizhi Ma, Yiqun Liu, Min Zhang, Shaoping Ma, Zhaowei Wang, and Xiuqiang He. 2021. Topic-enhanced knowledge-aware retrieval model for diverse relevance estimation. In WWW ’21: The Web Conference 2021, Virtual Event / Ljubljana, Slovenia, April 19-23, 2021. ACM / IW3C2, 756–767. https://doi.org/10.1145/3442381.3449943
  • Li et al. (2015) Zhenghua Li, Jiayuan Chao, Min Zhang, and Wenliang Chen. 2015. Coupled Sequence Labeling on Heterogeneous Annotations: POS Tagging as a Case Study. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015. 1783–1792. https://doi.org/10.3115/v1/p15-1172
  • Liu et al. (2020) Bulou Liu, Chenliang Li, Wei Zhou, Feng Ji, Yu Duan, and Haiqing Chen. 2020. An Attention-based Deep Relevance Model for Few-shot Document Filtering. ACM Trans. Inf. Syst. 39, 1 (2020), 6:1–6:35. https://doi.org/10.1145/3419972
  • Liu et al. (2022) Yonghao Liu, Mengyu Li, Ximing Li, Fausto Giunchiglia, Xiaoyue Feng, and Renchu Guan. 2022. Few-shot Node Classification on Attributed Networks with Graph Meta-learning. In SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 471–481. https://doi.org/10.1145/3477495.3531978
  • Loshchilov and Hutter (2019) Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In 7th International Conference on Learning Representations, ICLR 2019,. https://openreview.net/forum?id=Bkg6RiCqY7
  • Ma et al. (2019) Dehong Ma, Sujian Li, Fangzhao Wu, Xing Xie, and Houfeng Wang. 2019. Exploring Sequence-to-Sequence Learning in Aspect Term Extraction. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019. 3538–3547. https://doi.org/10.18653/v1/p19-1344
  • Ma et al. (2022a) Jie Ma, Miguel Ballesteros, Srikanth Doss, Rishita Anubhai, Sunil Mallya, Yaser Al-Onaizan, and Dan Roth. 2022a. Label Semantics for Few Shot Named Entity Recognition. In Findings of the Association for Computational Linguistics: ACL 2022. 1956–1971. https://doi.org/10.18653/v1/2022.findings-acl.155
  • Ma et al. (2021) Jianqiang Ma, Zeyu Yan, Chang Li, and Yang Zhang. 2021. Frustratingly Simple Few-Shot Slot Tagging. In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021. 1028–1033. https://doi.org/10.18653/v1/2021.findings-acl.88
  • Ma et al. (2022c) Longxuan Ma, Mingda Li, Wei-Nan Zhang, Jiapeng Li, and Ting Liu. 2022c. Unstructured Text Enhanced Open-Domain Dialogue System: A Systematic Survey. ACM Trans. Inf. Syst. 40, 1 (2022), 9:1–9:44. https://doi.org/10.1145/3464377
  • Ma et al. (2022d) Ruotian Ma, Xin Zhou, Tao Gui, Yiding Tan, Linyang Li, Qi Zhang, and Xuanjing Huang. 2022d. Template-free Prompt Tuning for Few-shot NER. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022. 5721–5732. https://doi.org/10.18653/v1/2022.naacl-main.420
  • Ma et al. (2022b) Tingting Ma, Huiqiang Jiang, Qianhui Wu, Tiejun Zhao, and Chin-Yew Lin. 2022b. Decomposed Meta-Learning for Few-Shot Named Entity Recognition. In Findings of the Association for Computational Linguistics: ACL 2022. 1584–1596. https://doi.org/10.18653/v1/2022.findings-acl.124
  • Ma and Hovy (2016) Xuezhe Ma and Eduard H. Hovy. 2016. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016. https://doi.org/10.18653/v1/p16-1101
  • Oguz and Vu (2021) Cennet Oguz and Ngoc Thang Vu. 2021. Few-shot Learning for Slot Tagging with Attentive Relational Network. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, EACL 2021. 1566–1572. https://doi.org/10.18653/v1/2021.eacl-main.134
  • Ouchi et al. (2020) Hiroki Ouchi, Jun Suzuki, Sosuke Kobayashi, Sho Yokoi, Tatsuki Kuribayashi, Ryuto Konno, and Kentaro Inui. 2020. Instance-Based Learning of Span Representations: A Case Study through Named Entity Recognition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020. Association for Computational Linguistics, 6452–6459. https://doi.org/10.18653/v1/2020.acl-main.575
  • Pradhan et al. (2013) Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Hwee Tou Ng, Anders Björkelund, Olga Uryupina, Yuchen Zhang, and Zhi Zhong. 2013. Towards Robust Linguistic Analysis using OntoNotes. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, CoNLL 2013. ACL, 143–152. https://aclanthology.org/W13-3516/
  • Qin et al. (2019) Libo Qin, Wanxiang Che, Yangming Li, Haoyang Wen, and Ting Liu. 2019. A Stack-Propagation Framework with Token-Level Intent Detection for Spoken Language Understanding. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019. Association for Computational Linguistics, 2078–2087. https://doi.org/10.18653/v1/D19-1214
  • Sang and Meulder (2003) Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In Proceedings of the Seventh Conference on Natural Language Learning, CoNLL 2003, Held in cooperation with HLT-NAACL 2003. ACL, 142–147. https://aclanthology.org/W03-0419/
  • Shen et al. (2022) Yongliang Shen, Xiaobin Wang, Zeqi Tan, Guangwei Xu, Pengjun Xie, Fei Huang, Weiming Lu, and Yueting Zhuang. 2022. Parallel Instance Query Network for Named Entity Recognition. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022. Association for Computational Linguistics, 947–961. https://doi.org/10.18653/v1/2022.acl-long.67
  • Snell et al. (2017) Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical networks for few-shot learning. Advances in neural information processing systems (2017). https://proceedings.neurips.cc/paper/2017/hash/cb8da6767461f2812ae4290eac7cbc42-Abstract.html
  • Sun et al. (2019) Shengli Sun, Qingfeng Sun, Kevin Zhou, and Tengchao Lv. 2019. Hierarchical Attention Prototypical Networks for Few-Shot Text Classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019. Association for Computational Linguistics, 476–485. https://doi.org/10.18653/v1/D19-1045
  • Sung et al. (2018) Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip H. S. Torr, and Timothy M. Hospedales. 2018. Learning to Compare: Relation Network for Few-Shot Learning. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018. 1199–1208. http://openaccess.thecvf.com/content_cvpr_2018/html/Sung_Learning_to_Compare_CVPR_2018_paper.html
  • Tan et al. (2021) Zeqi Tan, Yongliang Shen, Shuai Zhang, Weiming Lu, and Yueting Zhuang. 2021. A Sequence-to-Set Network for Nested Named Entity Recognition. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021. ijcai.org, 3936–3942. https://doi.org/10.24963/ijcai.2021/542
  • Tong et al. (2021) Meihan Tong, Shuai Wang, Bin Xu, Yixin Cao, Minghui Liu, Lei Hou, and Juanzi Li. 2021. Learning from Miscellaneous Other-Class Words for Few-shot Named Entity Recognition. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021. 6236–6247. https://doi.org/10.18653/v1/2021.acl-long.487
  • Vakulenko et al. (2021) Svitlana Vakulenko, Evangelos Kanoulas, and Maarten de Rijke. 2021. A Large-scale Analysis of Mixed Initiative in Information-Seeking Dialogues for Conversational Search. ACM Trans. Inf. Syst. 39, 4 (2021), 49:1–49:32. https://doi.org/10.1145/3466796
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017. 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
  • Vinyals et al. (2016) Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. 2016. Matching networks for one shot learning. Advances in neural information processing systems (2016). https://proceedings.neurips.cc/paper/2016/hash/90e1357833654983612fb05e3ec9148c-Abstract.html
  • Wang et al. (2022) Peiyi Wang, Runxin Xu, Tianyu Liu, Qingyu Zhou, Yunbo Cao, Baobao Chang, and Zhifang Sui. 2022. An Enhanced Span-based Decomposition Method for Few-Shot Sequence Labeling. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022. 5012–5024. https://doi.org/10.18653/v1/2022.naacl-main.369
  • Wang et al. (2021) Yaqing Wang, Haoda Chu, Chao Zhang, and Jing Gao. 2021. Learning from Language Description: Low-shot Named Entity Recognition via Decomposed Framework. In Findings of the Association for Computational Linguistics: EMNLP 2021. 1618–1630. https://doi.org/10.18653/v1/2021.findings-emnlp.139
  • Wang et al. (2018) Yu-Xiong Wang, Ross B. Girshick, Martial Hebert, and Bharath Hariharan. 2018. Low-Shot Learning From Imaginary Data. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. Computer Vision Foundation / IEEE Computer Society, 7278–7286. https://doi.org/10.1109/CVPR.2018.00760
  • Yan et al. (2021) Hang Yan, Tao Gui, Junqi Dai, Qipeng Guo, Zheng Zhang, and Xipeng Qiu. 2021. A Unified Generative Framework for Various NER Subtasks. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021. Association for Computational Linguistics, 5808–5822. https://doi.org/10.18653/v1/2021.acl-long.451
  • Yang and Katiyar (2020) Yi Yang and Arzoo Katiyar. 2020. Simple and Effective Few-Shot Named Entity Recognition with Structured Nearest Neighbor Learning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020. 6365–6375. https://doi.org/10.18653/v1/2020.emnlp-main.516
  • Yoon et al. (2019) Sung Whan Yoon, Jun Seo, and Jaekyun Moon. 2019. TapNet: Neural Network Augmented with Task-Adaptive Projection for Few-Shot Learning. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019. 7115–7123. http://proceedings.mlr.press/v97/yoon19a.html
  • Yu et al. (2021a) Dian Yu, Luheng He, Yuan Zhang, Xinya Du, Panupong Pasupat, and Qi Li. 2021a. Few-shot Intent Classification and Slot Filling with Retrieved Examples. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021. 734–749. https://doi.org/10.18653/v1/2021.naacl-main.59
  • Yu et al. (2020) Juntao Yu, Bernd Bohnet, and Massimo Poesio. 2020. Named Entity Recognition as Dependency Parsing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020. 6470–6476. https://doi.org/10.18653/v1/2020.acl-main.577
  • Yu et al. (2021b) Shi Yu, Zhenghao Liu, Chenyan Xiong, Tao Feng, and Zhiyuan Liu. 2021b. Few-Shot Conversational Dense Retrieval. In SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 829–838. https://doi.org/10.1145/3404835.3462856
  • Zeldes (2017) Amir Zeldes. 2017. The GUM corpus: creating multilayer resources in the classroom. Lang. Resour. Evaluation 51, 3 (2017), 581–612. https://doi.org/10.1007/s10579-016-9343-x
  • Zhai et al. (2017) Feifei Zhai, Saloni Potdar, Bing Xiang, and Bowen Zhou. 2017. Neural Models for Sequence Chunking. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence. 3365–3371. http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14776
  • Zhang et al. (2018) Ruixiang Zhang, Tong Che, Zoubin Ghahramani, Yoshua Bengio, and Yangqiu Song. 2018. MetaGAN: An Adversarial Approach to Few-Shot Learning. In Advances in Neural Information Processing Systems, NeurIPS 2018. 2371–2380. https://proceedings.neurips.cc/paper/2018/hash/4e4e53aa080247bc31d0eb4e7aeb07a0-Abstract.html
  • Zhang et al. (2022a) Shichao Zhang, Jiaye Li, and Yangding Li. 2022a. Reachable distance function for KNN classification. IEEE Transactions on Knowledge and Data Engineering (2022).
  • Zhang et al. (2022b) Shuai Zhang, Yongliang Shen, Zeqi Tan, Yiquan Wu, and Weiming Lu. 2022b. De-Bias for Generative Extraction in Unified NER Task. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022. Association for Computational Linguistics, 808–818. https://doi.org/10.18653/v1/2022.acl-long.59
  • Zhou et al. (2022) Jing Zhou, Yanan Zheng, Jie Tang, Li Jian, and Zhilin Yang. 2022. FlipDA: Effective and Robust Data Augmentation for Few-Shot Learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022. Association for Computational Linguistics, 8646–8665. https://doi.org/10.18653/v1/2022.acl-long.592
  • Zhu et al. (2022) Enwei Zhu, Yiyang Liu, and Jinpeng Li. 2022. Deep Span Representations for Named Entity Recognition. CoRR abs/2210.04182 (2022). https://doi.org/10.48550/arXiv.2210.04182 arXiv:2210.04182