Unifying Token and Span Level Supervisions for Few-Shot Sequence Labeling

Zifeng Cheng [email protected] State Key Laboratory for Novel Software Technology, Nanjing University163 Xianlin AveNanjingJiangsuChina210023 , Qingyu Zhou [email protected] Tencent Cloud XiaoweiBeijingChina100000 , Zhiwei Jiang [email protected] State Key Laboratory for Novel Software Technology, Nanjing University163 Xianlin AveNanjingJiangsuChina210023 , Xuemin Zhao [email protected] Tencent Cloud XiaoweiChengduSichuanChina610000 , Yunbo Cao [email protected] Tencent Cloud XiaoweiBeijingChina100000 and Qing Gu [email protected] State Key Laboratory for Novel Software Technology, Nanjing University163 Xianlin AveNanjingJiangsuChina210023

(2023)

Abstract.

Few-shot sequence labeling aims to identify novel classes based on only a few labeled samples. Existing methods solve the data scarcity problem mainly by designing token-level or span-level labeling models based on metric learning. However, these methods are only trained at a single granularity (i.e., either token level or span level) and have some weaknesses of the corresponding granularity. In this paper, we first unify token and span level supervisions and propose a Consistent Dual Adaptive Prototypical (CDAP) network for few-shot sequence labeling. CDAP contains the token-level and span-level networks, jointly trained at different granularities. To align the outputs of two networks, we further propose a consistent loss to enable them to learn from each other. During the inference phase, we propose a consistent greedy inference algorithm that first adjusts the predicted probability and then greedily selects non-overlapping spans with maximum probability. Extensive experiments show that our model achieves new state-of-the-art results on three benchmark datasets. All the code and data of this work will be released at https://github.com/zifengcheng/CDAP.

Few-Shot Sequence Labeling, Few-Shot Learning, Sequence Labeling

This work is supported by National Natural Science Foundation of China under Grant Nos. 61972192, 62172208, 61906085, 41972111. This work is partially supported by Collaborative Innovation Center of Novel Software Technology and Industrialization.

^†^†copyright: acmcopyright^†^†journalyear: 2022^†^†journal: TOIS^†^†journalvolume: 1^†^†journalnumber: 1^†^†article: 1^†^†publicationmonth: 12^†^†ccs: Information systems Information extraction^†^†ccs: Information systems Data mining^†^†ccs: Computing methodologies Information extraction

1. Introduction

Sequence labeling tasks such as Named Entity Recognition (NER) and Slot Tagging (ST) are fundamental tasks in information extraction, benefiting query and document understanding in information retrieval (Guo et al., 2022; Agosti et al., 2020; Li et al., 2021), request analysis in task-oriented dialog systems (Vakulenko et al., 2021; Ma et al., 2022c), and so forth. Traditional sequence labeling methods are often supervised methods, which require a large amount of annotated data to train. However, this limits their applications in some real scenarios, where the acquisition of annotated data is laborious and expensive.

In recent years, few-shot sequence labeling has attracted much attention since it aims to extract novel classes based on only a few labeled samples (Fritzler et al., 2019; Hou et al., 2020; Yang and Katiyar, 2020; Tong et al., 2021). Table 1 shows a 2-way 1-shot example, which contains two novel classes (i.e., building-other and building-library) and each class has one sample in the support set. Few-shot sequence labeling aims to learn a model that borrows the prior experience from old domains and adapts to new domains quickly with only very few samples. For example, the model should extract “brooklyn bridge” and “Oxford University Museum of Natural History” as “building-other”, and “public library” and “Radcliffe Science Library” as “building-library” in the query set based on the support set.

Some prevalent methods of few-shot sequence labeling usually focus on token-level metric learning (Fritzler et al., 2019; Hou et al., 2020; Yang and Katiyar, 2020; Oguz and Vu, 2021; Ma et al., 2022a; Das et al., 2022; Li et al., 2022), in which the model assigns a label to each query token based on a learned distance metric. For example, Fritzler et al. (2019) construct prototype for each class to classify query tokens and Das et al. (2022) optimize token-level Gaussian-distributed embeddings to classify query tokens. Recently, some methods (Yu et al., 2021a; Wang et al., 2021; Ma et al., 2022b; Wang et al., 2022) focus on span-level metric learning to measure span-level similarity for classification and achieve state-of-the-art performance. For example, Wang et al. (2021) propose a two-step model, which first detects span and classifies span based on span-level metric.

Table 1. An example of 2-way 1-shot sequence labeling task. Novel classes are building-other and building-library. It is worth noting that in this example two novel classes have the same sentence in the support set.

Support Set

\mathcal{S}

(1) The museum was established as the Municipal Gallery of Hamilton [building-other] in January 1914, and was opened to the public in June 1914, at a Hamilton Public Library [building-library] building on Main Street West.

Query Set

\mathcal{Q}

(1) The Brooklyn Bridge [building-other] ceramic tiles display the bridge’s vertical cables but do not depict its diagonal cables. (2) The reading program had been discontinued, library patronage had dropped, and the denver public library [building-library] was planning to shutter the facility. (3) The current Radcliffe Science Library [building-library] building is located next to the Oxford University Museum of Natural History [building-other] and consists of three parts.

Despite their success, these token-level and span-level metric learning methods are only trained at a single granularity, so they inevitably have some weaknesses specific to their granularity. For the token-level metric learning methods, their scheme of labeling tokens separately is prone to ignore the integrality of named entities and dialog slots with multiple tokens. Taking the sentence of the support set in Table 1 as an example, to extract the complete entity “Municipal Gallery of Hamilton”, each of its constituent tokens, “Municipal”, “Gallery”, “of”, and “Hamilton”, must be labeled separately, potentially leading to the loss of information about the entity as a whole. As a consequence, it is challenging for token-level representation learning to capture the full semantics of multi-token entities and slots (Wang et al., 2022). For the span-level metric learning methods, since they need to enumerate all possible spans for classification, the few-shot labels of entities and slots are easy to be diluted by a large amount of non-entity and non-slot labels. For example, if we set the maximum span length to 4, the spans containing the word “Municipal” include “Municipal”, “the Municipal”, “Municipal Gallery”, “as the Municipal”, “Municipal Gallery of”, “established as the Municipal”, “Municipal Gallery of Hamilton”, and so forth. Among these spans, only the span “Municipal Gallery of Hamilton” is an entity, while the others are non-entities. According to our statistics of the real dataset FewNERD, the ratio of entity spans to non-entity spans can even exceed 1:100. As a result, when entity spans are rare in the few-shot setting, it becomes challenging for span-level metric learning methods to distinguish them from a large number of similar non-entity spans.

These weaknesses are difficult to deal with when only a single-granularity model is trained, but they are promising to be alleviated when both token-level and span-level labels are utilized in a unified view for model training. On the one hand, span-level labeling can help token-level labeling to maintain entity integrality and capture the full semantics of multi-token entities. On the other hand, token-level labeling can in turn help span-level labeling to distinguish easily confused span.

To this end, we propose a Consistent Dual Adaptive Prototypical (CDAP) network for few-shot sequence labeling. CDAP can jointly train token-level and span-level networks, and make them benefit from each other, to produce better predictions. Specifically, we employ an Adaptive Prototypical Network (APN) to generate adaptive prototypes specific to each query in the few-shot setting and use the APN for both token-level and span-level labeling. Besides, considering that token-level and span-level networks may produce different predictions, we further propose a consistent loss to align their predictions based on bidirectional Kullback-Leibler divergence with temperature. During the inference phase, we propose a consistent greedy inference algorithm to combine the predictions of both networks. The core idea of the algorithm is to adjust the prediction probability of spans based on the predictions of tokens within the span.

The main contributions of this paper can be summarized as follows:

•

To the best of our knowledge, we first unify token and span level supervisions to train the few-shot sequence labeling model.
•

We propose a consistent dual adaptive prototypical network for few-shot sequence labeling and a consistent greedy inference algorithm to combine the results of two networks.
•

Extensive experiments show that our model achieves new state-of-the-art performance on three benchmark datasets.

The rest of this paper is organized as follows. We first review the related work in Section 2. Then, we formulate the task and describe our proposed method in detail in Section 3 and 4 respectively. After that, we conduct experiments and analyze the results in Section 5. Finally, we conclude our work along with some future work in Section 6.

2. Related Work

In this section, we introduce the following three research topics relevant to our work: few-shot learning, sequence labeling, and few-shot sequence labeling.

2.1. Few-Shot Learning

Few-shot learning aims to recognize novel classes with only few samples in each class and is proposed in the computer vision community (Fei-Fei et al., 2006). Motivated by a simple machine learning principle (i.e., train and test must match), matching Networks (Vinyals et al., 2016) first models the few-shot learning as an episode N-way K-shot task, where each episode has N classes and each class has K annotated samples (K is very small, e.g., 5). Afterwards, many methods have been proposed, such as metric learning methods, optimization methods, and generation methods. Metric learning methods aim to learn a suitable embedding space for distance metrics, including cosine similarity (Vinyals et al., 2016), euclidean distance to prototype (i.e., mean class representation) (Snell et al., 2017), learnable relation modules (Sung et al., 2018), nearest neighbor (Yang and Katiyar, 2020), and reachable distance (Zhang et al., 2022a). For example, Snell et al. (2017) first construct prototypes by averaging the representations of each class in the support set and then optimizes the network by using softmax function over Euclidean distances between all prototypes and query representation. Some other methods have focused on improving the transferability of features, including using cross-attention module to exploit the semantic relevance between support and query features (Hou et al., 2019), adaptive margin loss (Li et al., 2020b), and so on. Recently, some methods (Chen et al., 2019; Dhillon et al., 2020; Chen et al., 2021) propose a transfer-learning framework that pre-trains on the training set and fine-tunes on the support set of test episode. Model-agnostic meta-learning (MAML) (Finn et al., 2017) is a typical approach of optimization methods and aims to learn a good model initialization and then quickly adapt on test episode. Finn et al. (2018) further extend MAML, which adapts to new tasks via gradient descent, to incorporate a parameter distribution that is trained via a variational lower bound. Generation methods aim to generate samples (Wang et al., 2018; Zhang et al., 2018; Zhou et al., 2022) or features (Li et al., 2020d) to augment the training set. Specifically, Zhang et al. (2018) propose to augment vanilla few-shot classification models with GAN for better generalization. FlipDA (Zhou et al., 2022) jointly uses a generative model and a classifier to generate label-flipped data, which has a more significant performance improvement compared to label-preserved data. AFHN (Li et al., 2020d) generates diverse and discriminative features conditioned on the few labeled samples based on cWGAN.

Recently, few-shot learning has received increasing attention in the IR and NLP communities, including recommendation (Li et al., 2020c), document filtering (Liu et al., 2020), dense retrieval (Yu et al., 2021b), node classification (Liu et al., 2022), and so on.

2.2. Sequence Labeling

Sequence labeling task is a fundamental task in the field of NLP such as named entity recognition (Lample et al., 2016), part-of-speech tagging (Li et al., 2015), slot tagging (Qin et al., 2019), and so on. Recently, the neural sequence labeling models become the mainstream method for sequence labeling.

Typical neural sequence labeling models (Huang et al., 2015; Chiu and Nichols, 2016; Lample et al., 2016; Ma and Hovy, 2016) generally use CNN or LSTM to extract features and utilize softmax or Conditional Random Field (CRF) for classification. For example, Chiu and Nichols (2016) use CNN to extract character-level features, LSTM to extract word-level features, and softmax to classify words. However, Cui and Zhang (2019) find that BiLSTM-CRF does not always lead to better results compared with BiLSTM-Softmax, and propose a hierarchically-refined label attention network, which explicitly leverages label embeddings and captures potential long-term label dependency by giving each word incrementally refined label distributions with hierarchical attention. Afterwards, some work (Yu et al., 2020; Fu et al., 2021; Ouchi et al., 2020; Jiang et al., 2020; Zhu et al., 2022) adopts span-level prediction which first detects spans by enumeration or boundary identification, and then classifies spans. For example, Yu et al. (2020) use a biaffine classifier to assign scores for all spans in a sentence. Fu et al. (2021) first compare and analyze the advantages and disadvantages of the traditional token-level models and span-level models in detail, and then reveal that span prediction can serve as a system combiner to re-recognize named entities from different systems’ outputs.

Besides, four novel paradigms for sequence labeling have recently been proposed, reformulating it as sequence-to-sequence learning, generation, set prediction, and machine reading comprehension tasks, respectively. First, considering that the sequence-to-sequence learning has a good ability to learn complex global label dependency, some work applies it to solve the sequence labeling tasks such as chunking (Zhai et al., 2017), aspect term extraction (Ma et al., 2019), emotion-cause pair extraction (Cheng et al., 2021), and dialogue act prediction problem (Colombo et al., 2020). For example, in order to make sequence-to-sequence learning suitable for sequence labeling, Ma et al. (2019) design gated unit networks to incorporate corresponding word representation into the decoder and position-aware attention to pay more attention to the adjacent words of a target word. Secondly, some work use generation to solve the sequence labeling. Specifically, Athiwaratkun et al. (2020) propose to generate augmented output that repeats the original input sequence with additional markers that indicate the token-spans and their associated labels. Yan et al. (2021) propose a unified generation framework with the pointer mechanism to tackle flat, nested, and discontinuous NER subtasks. Zhang et al. (2022b) analyze the incorrect biases in the generation process from a causality perspective and design intra- and inter-entity deconfounding data augmentation methods to eliminate confounders. Thirdly, Tan et al. (2021) first propose a sequence-to-set network for NER, which provides a fixed set of entity queries to replace the explicit candidate spans and uses self-attention to capture the dependencies between entities. Finally, Li et al. (2020a) first propose a machine reading comprehension framework to solve the NER task and use type-specific queries using semantic prior information for entity categories. However, type-specific queries can only extract one entity type per inference and rely on external knowledge to construct queries. Shen et al. (2022) further set up global and learnable instance queries to extract entities from a sentence in a parallel manner.

2.3. Few-Shot Sequence Labeling

Few-shot sequence labeling aims to identify novel classes based on only a few labeled samples and has gained widespread attention due to its application value. We roughly divide the existing few-shot sequence labeling methods into three groups.

The first group of methods is token-level metric learning method. Fritzler et al. (2019) first use prototypical networks to solve this task. Afterwards, L-TapNet (Hou et al., 2020) and StructShot (Yang and Katiyar, 2020) both explore CRF to model label dependence in the few-shot setting. L-TapNet (Hou et al., 2020) extends task-adaptive projection networks (TapNet) (Yoon et al., 2019), which improves label representation with label name. StructShot (Yang and Katiyar, 2020) trains model through traditional cross-entropy loss and uses nearest neighbor for classification. Huang et al. (2021) explore noisy supervised pre-training and self-training in the few-shot sequence labeling setting. Motivated by non-entity class O has rich semantics, MUCO (Tong et al., 2021) introduces a binary classifier to determine whether it is an entity or not. Ma et al. (2022a) propose to leverage the semantics of label names and use two separate BERT to encode label and token for token-level similarity metric. Instead of optimizing class-specific attributes, CONTAINER (Das et al., 2022) models Gaussian embedding for each token and optimizes inter token distribution distance to improve generalizability.

The second group of methods is span-level method. SpanNER (Wang et al., 2021) is a two-step method, which first detects the start and end tokens to extract span and then introduces class descriptions to classify spans. Ma et al. (2022b) further propose a model-agnostic meta-learning (MAML) (Finn et al., 2017) enhanced two-step method, which uses BIESO scheme to detect span and prototypical networks to classify span. Retriever (Yu et al., 2021a) proposes a span-level retrieval method that learns similar contextualized representations for spans with the same label via a novel batch-softmax objective. ESD (Wang et al., 2022) proposes inter-span and cross-span attention to enhance the span representations, and uses instance span attention to construct prototype.

The third group of methods mitigates data scarcity with other paradigms, including machine reading comprehension (Ma et al., 2021), generation (Athiwaratkun et al., 2020; Chen et al., 2022), pairwise cloze (Henderson and Vulic, 2021), and prompt tuning (Cui et al., 2021; Lee et al., 2022; Ma et al., 2022d; Hou et al., 2022). For machine reading comprehension method, Ma et al. (2021) enumerate all the slots to extract the answer from the sentence. The MRC-based method can naturally encode the label semantics in the form of questions. For generation methods, Athiwaratkun et al. (2020) propose to generate augmented natural language output, including the original input sequence, delimiter, and label. Self-describing Networks (Chen et al., 2022) is a sequence-to-sequence generation network which can universally describe mentions using concepts, automatically map novel entity types to concepts, and adaptively recognize entities on-demand. For prompt tuning methods, Template (Cui et al., 2021) treats NER as a language model ranking problem in a sequence-to-sequence framework, where the output sequence is filled by templates. Hou et al. (2022) propose an inverse prompt, predicting slot values given slot types, and further introduce the iterative prediction strategy to consider the relations between different slot types. Ma et al. (2022d) reformulate the NER task as an entity-oriented LM task, inducing the LM to predict label words at entity positions during fine-tuning, and propose a template-free prompt tuning method EntLM. Lee et al. (2022) perform a systematic study of entity-oriented prompt (i.e., only entity tokens) and instance-oriented prompt (i.e., retrieved sentences).

3. Task Formulation

We define a sentence as $\boldsymbol{x}=\{x_{1},x_{2},...,x_{n}\}$ and its label as $\boldsymbol{y}=\{(s_{i},y_{i})\}_{i=1}^{M}$ , where $s_{i}$ is a span in sentence $\boldsymbol{x}$ (e.g., “nimbus” and “nimbus powerplant” in Figure 1), $y_{i}$ is the corresponding label (e.g., entity class “building-other” and non-entity class O), and $M$ is the number of spans in the sentence. Few-shot sequence labeling aims to train a model in the source domain that can accurately extract and classify all novel classes with a few labeled samples in the target domain. Episode training (Vinyals et al., 2016) is a common and effective way to train the model by imitating test situation for few-shot learning. Each training/test episode contains a $N$ -way $K$ -shot support set $\mathcal{S}=\{(\boldsymbol{x}_{i},\boldsymbol{y}_{i})\}_{i=1}^{m_{s}}$ and a query set $\mathcal{Q}=\{(\boldsymbol{x}_{i},\boldsymbol{y}_{i})\}_{i=1}^{m_{q}}$ , where the support set $\mathcal{S}$ includes $N$ classes (N-way) and each class has K annotated examples (K-shot). In the training phase, few-shot sequence labeling model is trained by predicting labels of the query set $\mathcal{Q}_{train}$ according to a few labeled support set $\mathcal{S}_{train}$ . In the testing phase, the model directly predicts labels of the query set $\mathcal{Q}_{test}$ on unseen test dataset according to a few labeled support set $\mathcal{S}_{test}$ . To facilitate the illustration, we list the main notations used throughout this article in Table 2.

Table 2. Main Notations Used in this Article.

Notation	Description
$\mathbf{H}\in\mathbb{R}^{n\times d}$	Token-level representation matrix for a sentence $\boldsymbol{x}$
$\mathbf{H^{s}_{j}}\in\mathbb{R}^{C_{j}\times d}$	Representation matrix of all tokens of class $j$ in the support set
$\mathbf{h_{i}^{q}}\in\mathbb{R}^{d}$	Token-level representation for query token $x_{i}$
$\mathbf{c_{i}^{j}}\in\mathbb{R}^{d}$	Token-level adaptive prototype of class $j$ for query token $x_{i}$
$n$	The length of sentence $\boldsymbol{x}$
$d$	The dimension of feature representation
$C_{j}$	The number of tokens of class $j$ in the support set
$\mathbf{S}\in\mathbb{R}^{S_{s}\times d}$	Representation matrix of all spans in the support set
$\mathbf{Q}\in\mathbb{R}^{S_{q}\times d}$	Representation matrix of all spans in the query set
$\bar{\mathbf{S}}\in\mathbb{R}^{S_{s}\times d}$	Representation matrix of all spans after cross-attention module in the support set
$\bar{\mathbf{Q}}\in\mathbb{R}^{S_{q}\times d}$	Representation matrix of all spans after cross-attention module in the query set
$\hat{\textbf{o}}_{k}\in\mathbb{R}^{d}$	Span-level adaptive prototype of sub-class $O_{k}$ for query span $q_{i}$
$\textbf{O}_{k}\in\mathbb{R}^{o_{k}\times d}$	Representation matrix of all spans in the support set with sub-class $O_{k}$
$\bar{\textbf{q}}_{i}\in\mathbb{R}^{d}$	Span-level representation for query span $q_{i}$
$\mathbf{z_{i}^{j}}\in\mathbb{R}^{d}$	Span-level adaptive prototype of class $j$ for query span $q_{i}$
$S_{s}$	The number of spans in the support set
$S_{q}$	The number of spans in the query set
$o_{k}$	The number of spans with sub-class $O_{k}$ in the support set

4. Model

In this section, we present the details of our proposed model, as shown in Figure 1. Our model utilizes both token-level and span-level labels to mitigate the data scarcity problem in the few-shot setting from a unified perspective. We first explain the encoder module of our model (Section 4.1). Next we present the model architecture of token-level adaptive prototypical network (Section 4.2) and span-level adaptive prototypical network (Section 4.3). We further propose a consistent loss to make the predictions of two networks consistent (Section 4.4). Then we introduce final loss function and training flow of our model (Section 4.5). Finally, we propose a consistent greedy inference algorithm to combine the outputs of two networks during the inference phase (Section 4.6).

Refer to caption — Figure 1. The framework of our proposed model. The upper part corresponds to token-level network and the lower part corresponds to span-level network.

4.1. Encoder Module

We utilize BERT (Devlin et al., 2019) to encode sentences in the support and query set. Given a sentence $\boldsymbol{x}=\{x_{1},x_{2},...,x_{n}\}$ , we insert a [CLS] and [SEP] token for each sentence to get the input of BERT and feed it into BERT to get its contextual representations $\mathbf{U}=\{\mathbf{u_{0}},\mathbf{u_{1}},\mathbf{u_{2}},...,\mathbf{u_{n}},\mathbf{u_{n+1}}\}\in\mathbb{R}^{(n+2)\times d_{1}}$ as follows:

(1)

\mathbf{u_{0}},\mathbf{u_{1}},\mathbf{u_{2}},...,\mathbf{u_{n}},\mathbf{u_{n+1}}=\textrm{BERT}(\textrm{[CLS]},x_{1},x_{2},...,x_{n},\textrm{[SEP]})

We further use a linear layer to obtain the token representation $\mathbf{H}=\{\mathbf{h_{1}},\mathbf{h_{2}},...,\mathbf{h_{n}}\}\in\mathbb{R}^{n\times d}$ as the input to the token-level network as follows:

(2)

\mathbf{h_{i}}=\mathbf{W_{t}}\mathbf{u_{i}}+\mathbf{b_{t}},i=1,2,...,n

where $\mathbf{W_{t}}\in\mathbb{R}^{d\times d_{1}}$ and $\mathbf{b_{t}}\in\mathbb{R}^{d}$ are trainable parameters. Since [CLS] and [SEP] tokens do not need to be classified, we do not feed them into the linear layer.

For the span-level network, we use boundary tokens (i.e., left and right boundary tokens) and a linear layer to obtain the span representation as the input to the span-level network. Specifically, given a span $\mathbf{s}_{k}=\{x_{i},...,x_{j}\}$ , we represent the span as:

(3)

\mathbf{s}_{k}=\mathbf{W_{s}}[\mathbf{u_{i}}\oplus\mathbf{u_{j}}]+\mathbf{b_{s}}

where $\mathbf{W_{s}}\in\mathbb{R}^{d\times 2d_{1}}$ and $\mathbf{b_{s}}\in\mathbb{R}^{d}$ are trainable parameters, and $\oplus$ is the concatenation operation. It is worth noting that we enumerate all spans in the sentence.

4.2. Token-Level Adaptive Prototypical Network

For the token-level network, we adopt an adaptive prototypical network to classify tokens. General prototypical networks (Snell et al., 2017) simply average all the tokens of the same class to get prototype. However, the average operation does not reflect the importance of each token (Gao et al., 2019; Sun et al., 2019). Therefore, we propose to construct adaptive prototypes for each query token by considering the importance of each token. Specifically,

(4)

\mathbf{\alpha_{i}^{j}}=\textrm{softmax}(\mathbf{H^{s}_{j}}\mathbf{h_{i}^{q}})

(5)

\mathbf{c_{i}^{j}}=(\mathbf{H^{s}_{j}})^{\mathrm{T}}\mathbf{\alpha_{i}^{j}}

where $\mathbf{h_{i}^{q}}\in\mathbb{R}^{d}$ is representation of query token $x_{i}$ , $\mathbf{H^{s}_{j}}\in\mathbb{R}^{C_{j}\times d}$ is representation of all tokens of class $j$ in the support set, $C_{j}$ is the number of $j$ -class tokens in support set, $\mathbf{\alpha_{i}^{j}}\in\mathbb{R}^{C_{j}}$ is attention weight, and $\mathbf{c_{i}^{j}}\in\mathbb{R}^{d}$ is adaptive prototype of class $j$ for query token $x_{i}$ . For clarity, we denote this attention aggregation as $\phi$ , i.e., $\mathbf{c_{i}^{j}}=\phi(\mathbf{h_{i}^{q}},\mathbf{H^{s}_{j}})$ .

Then we utilize cross-entropy loss to optimize the network and the predicted probability distribution can be produced by a softmax function over Euclidean distances between all prototypes and query token $x_{i}$ . Specifically,

(6)

p_{i}^{t}(\hat{y}_{i}=j|x_{i})=\frac{\textrm{exp}(-d(\mathbf{h_{i}^{q}},\mathbf{c_{i}^{j}}))}{\sum_{p}\textrm{exp}(-d(\mathbf{h_{i}^{q}},\mathbf{c_{i}^{p}}))}

(7)

\mathcal{L}_{t}=-\sum_{i=1}^{n}\textrm{log}p_{i}^{t}(\hat{y}_{i}=y_{i}^{*}|x_{i})

where $p_{i}^{t}(\hat{y}_{i}=j|x_{i})$ is predicted probability of $j$ -class for token $x_{i}$ , $d$ denotes Euclidean distance, and $y_{i}^{*}$ denotes the ground-truth label for token $x_{i}$ . It is worth noting that we use the IO (Inside and Outside) scheme for token-level network same as Ding et al. (2021), not BIESO (Begin, Inside, End, Singleton, and Outside) or BIO (Begin, Inside, and Outside) scheme. For example, the label space of example in Table 1 is {building-other, building-library, Outside}, not {begin-building-other, inside-building-other, end-building-other, singleton-building-other, begin-building-library, inside-building-library, end-building-library, singleton-building-library, outside} or {begin-building-other, inside-building-other, begin-building-library, inside-building-library, outside}.

4.3. Span-Level Adaptive Prototypical Network

For the span-level network, we classify spans by adopting an adaptive prototypical network similar to the token-level adaptive prototypical network, the difference mainly lies in that we employ an extra cross-attention module and make more fine-grained division for class O.

4.3.1. Cross-Attention Module

To recognize a sample from novel class given a few labeled samples, a natural idea is to locate the most relevant regions in the pair of support and query samples (Hou et al., 2019). Thus, we use a cross-attention to model the interaction between support set and query set to enhance the representation of support set and query set. We denote the span representation in the support and query set as $\mathbf{S}\in\mathbb{R}^{S_{s}\times d}$ and $\mathbf{Q}\in\mathbb{R}^{S_{q}\times d}$ , where $S_{s}$ and $S_{q}$ denote the number of spans in the support and query set. Then we use a cross-attention to model the interaction:

(8)

\hat{\mathbf{s}}_{n}=\phi(\mathbf{s_{n}},\mathbf{Q})

(9)

\hat{\mathbf{q}}_{n}=\phi(\mathbf{q_{n}},\mathbf{S})

where $\mathbf{s}_{n}\in\mathbb{R}^{d}$ and $\mathbf{q}_{n}\in\mathbb{R}^{d}$ denote the $n$ -row of $\mathbf{S}$ and $\mathbf{Q}$ , $\hat{\mathbf{s}}_{n}\in\mathbb{R}^{d}$ denotes the span representation after cross-attention in the support set, and $\hat{\mathbf{q}}_{n}\in\mathbb{R}^{d}$ denotes the span representation after cross-attention in the query set. Finally, we use Transformer block (Vaswani et al., 2017) to get the enhanced representation:

(10)

\bar{\mathbf{s}}_{n}=\textrm{LayerNorm}(\mathbf{s}_{n}+\textrm{FFN}(\hat{\mathbf{s}}_{n}))

(11)

\bar{\mathbf{q}}_{n}=\textrm{LayerNorm}(\mathbf{q}_{n}+\textrm{FFN}(\hat{\mathbf{q}}_{n}))

where $\bar{\mathbf{s}}_{n}\in\mathbb{R}^{d}$ and $\bar{\mathbf{q}}_{n}\in\mathbb{R}^{d}$ denote enhanced representation of support and query set, and FFN denotes a feed forward neural network.

4.3.2. Fine-Grained Division of O

Different from entity classes, non-entity class O has rich semantics and is hard to represent with a single prototype (Tong et al., 2021; Wang et al., 2022). So we further divide class O into three sub-classes. Since we use left and right boundary tokens to represent span, we divide sub-classes based on whether the span has the same left and right boundary as entity spans. Specifically, given a sentence with $M_{1}$ entity spans $\{(l_{i},r_{i})\}_{i=1}^{M_{1}}$ , where $l_{i}$ and $r_{i}$ are the left and right boundary tokens of the $i$ -th span, other non-entity spans can be assigned with a sub-class $O_{sub}$ label as follows,

\displaystyle\begin{split}O_{sub}=\left\{\begin{array}[]{ll}O_{1},&\exists i,l_{o}=l_{i}\\ O_{2},&\exists i,r_{o}=r_{i}\\ O_{3},&\textrm{others}\end{array}\right.\end{split}

where $O_{1}$ denotes the non-entity spans that have the same left boundary token as entity spans, $O_{2}$ denotes the non-entity spans that have the same right boundary token as entity spans, and $O_{3}$ denotes the non-entity spans do not have the same left and right boundary tokens as entity spans. It is worth noting that for sentences with two or more entities, there may be some non-entity spans that have the same left boundary with one entity span and same right boundary with another entity span. We treat these non-entity spans as $O_{1}$ .

Algorithm 1 The Training Flow of Consistent Dual Adaptive Prototypical Network

The few-shot sequence labeling training set

\mathcal{D}_{tr}

A well-trained few-shot sequence labeling model

for all iteration = 1,

\cdots

, MaxIter do

Randomly sample a mini-batch data (

\mathcal{S}_{train}

\mathcal{Q}_{train}

) from

\mathcal{D}_{tr}

\rhd

Base Encoder

\mathbf{H}^{\mathbf{s}},\mathbf{S}\leftarrow

BERT(

\mathcal{S}_{train}

)

\mathbf{H}^{\mathbf{q}},\mathbf{Q}\leftarrow

BERT(

\mathcal{Q}_{train}

)

for each

x

\mathcal{Q}_{train}

\rhd

Token-Level Network

Generate token-level prototype:

\mathbf{c_{i}^{j}}=\phi(\mathbf{h_{i}^{q}},\mathbf{H^{s}_{j}})

Calculate token-level loss:

\mathcal{L}_{t}=-\sum_{i=1}^{n}\text{log}\frac{\textrm{exp}(-d(\mathbf{h_{i}^{q}},\mathbf{c_{i}^{*}}))}{\sum_{p}\textrm{exp}(-d(\mathbf{h_{i}^{q}},\mathbf{c_{i}^{p}}))}

\rhd

Span-Level Network

Enhance representation:

\bar{\mathbf{S}},\bar{\mathbf{Q}}\leftarrow

Cross-attention module (

\mathbf{S}

\mathbf{Q}

)

Generate span-level prototype:

\mathbf{z_{i}^{j}}=\phi(\bar{\mathbf{q}}_{i},\bar{\mathbf{S}}_{j})

Calculate span-level loss:

\mathcal{L}_{s}=-\sum_{i=1}^{S_{q}}\text{log}\frac{\textrm{exp}(-d(\bar{\textbf{q}}_{i},\mathbf{z_{i}^{*}}))}{\sum_{p}\textrm{exp}(-d(\bar{\textbf{q}}_{i},\mathbf{z_{i}^{p}}))}

\rhd

Consistent Loss Calculation

Generate two token-level distributions

l_{t}(x_{i})

and

l_{s}(x_{i})

from two networks

Calculate loss:

\mathcal{L}_{c}

\sum_{i=1}^{n}[D_{KL}\left(\sigma(l_{t}(x_{i})/T)||\sigma(l_{s}(x_{i}))\right)

D_{KL}\left(\sigma(l_{s}(x_{i})/T)||\sigma(l_{t}(x_{i}))\right)]

\rhd

Optimization

Obtain total loss

\mathcal{L}=\lambda\mathcal{L}_{t}+\beta\mathcal{L}_{s}+\gamma\mathcal{L}_{c}

and update model

end for

return The well-trained model

4.3.3. Adaptive Prototypical Network

We also construct adaptive prototypes for each query span, similar to token-level adaptive prototypical network. Since we divide class O into three sub-classes, we first adaptively construct prototype of each sub-class and then adaptively aggregate three sub-classes to get the prototype of class O. For entity classes, we can directly get the adaptive prototypes. Specifically,

(12)

\hat{\textbf{o}}_{k}=\phi(\bar{\textbf{q}}_{i},\textbf{O}_{k}),k=1,2,3

(13)

\textbf{z}_{i}^{0}=\phi(\bar{\textbf{q}}_{i},\left[\begin{array}[]{ccc}\hat{\textbf{o}}_{1}^{\mathrm{T}}\\ \hat{\textbf{o}}_{2}^{\mathrm{T}}\\ \hat{\textbf{o}}_{3}^{\mathrm{T}}\\ \end{array}\right])

(14)

\textbf{z}_{i}^{k}=\phi(\bar{\textbf{q}}_{i},\bar{\textbf{S}}_{k}),k=1,2,...,N

where $\hat{\textbf{o}}_{k}\in\mathbb{R}^{d}$ denotes the adaptive prototype of sub-class $O_{k}$ for query span $q_{i}$ , $\textbf{O}_{k}\in\mathbb{R}^{o_{k}\times d}$ denotes the representation of all spans in the support set with sub-class $O_{k}$ , $o_{k}$ denotes the number of spans with sub-class $O_{k}$ in the support set, $\textbf{z}_{i}^{0}\in\mathbb{R}^{d}$ denotes the adaptive prototype of class O for query span $q_{i}$ , $\bar{\textbf{S}}_{k}\in\mathbb{R}^{m_{k}\times d}$ is representation of all spans of class $k$ in the support set, $m_{k}$ denotes the number of $k$ -class spans in the support set, and $\textbf{z}_{i}^{k}\in\mathbb{R}^{d}$ denotes the adaptive prototype of class $k$ for query span $q_{i}$ .

After getting all prototypes $\textbf{Z}_{i}=(\textbf{z}_{i}^{0},\textbf{z}_{i}^{1},...,\textbf{z}_{i}^{N})$ , we also use softmax function over Euclidean distances between all prototypes and query span $q_{i}$ to get the predicted probability distribution, and cross-entropy loss to optimize the network.

(15)

p_{i}^{s}(\hat{y}_{i}=j|q_{i})=\frac{\textrm{exp}(-d(\bar{\textbf{q}}_{i},\mathbf{z_{i}^{j}}))}{\sum_{p}\textrm{exp}(-d(\bar{\textbf{q}}_{i},\mathbf{z_{i}^{p}}))}

(16)

\mathcal{L}_{s}=-\sum_{i=1}^{S_{q}}\textrm{log}p_{i}^{s}(\hat{y}_{i}=\bar{y}_{i}^{*}|q_{i})

where $p_{i}^{s}(\hat{y}_{i}=j|q_{i})$ is predicted probability of $j$ -class for span $q_{i}$ , $\bar{y}_{i}^{*}$ denotes the ground-truth label of span $q_{i}$ , and $S_{q}$ is the number of spans in the sentence.

4.4. Consistent Loss

Intuitively, the token-level network and span-level network should make consistent predictions in a well-trained model. Therefore, we further propose a consistent loss to make the predicted probability distributions of two networks consistent. It contains two steps, where the first step constructs two token-level probability distributions for each token from two networks and the second step measures the difference between two probability distributions.

For each token, we can obtain the token-level probability distribution directly from the token-level network. Then, among all spans containing this token, we select the span with the maximum predicted probability and take its probability distribution as the token-level probability distribution from span-level network. Finally, we use bidirectional Kullback-Leibler divergence with temperature as consistent loss to allow the two networks to learn from each other and to measure the difference between the two distributions. Specifically,

(17)

\displaystyle\mathcal{L}_{c}=\sum_{i=1}^{n}[D_{KL}\left(\sigma(l_{t}(x_{i})/T)||\sigma(l_{s}(x_{i}))\right)+D_{KL}\left(\sigma(l_{s}(x_{i})/T)||\sigma(l_{t}(x_{i}))\right)]

where $l_{t}(x_{i})$ and $l_{s}(x_{i})$ are the predicted logits of token-level and span-level networks for token $x_{i}$ , $\sigma$ denotes softmax function, and $T$ is the temperature to adjust the sharpness of the probability distribution.

4.5. Training

Finally, we combine the above three loss functions to form the final loss function to jointly train our model as follows:

(18)

\mathcal{L}=\lambda\mathcal{L}_{t}+\beta\mathcal{L}_{s}+\gamma\mathcal{L}_{c}

where $\lambda$ , $\beta$ , $\gamma$ are hyper-parameters.

In summary, there are three steps to train our model, i.e., calculate token-level loss, calculate span-level loss, and calculate consistent loss. The whole training flow is illustrated in Figure 1 and Algorithm 1.

4.6. Consistent Greedy Inference Algorithm

Intuitively, we consider extracted entities and slots unreliable if the two networks predict different labels. So we propose a consistent greedy algorithm to combine the outputs of the two networks. As shown in Figure 2 and Algorithm 2, we first adjust the predicted probabilities of spans and then use a greedy algorithm to gradually select non-overlapping spans with the maximum probability based on the adjusted probability.

First, we consider a span to be reliable if all token-level predictions from token-level network in the span agree with the span-level prediction from span-level network. For example, as shown in Figure 2, the prediction of span “powerplant” is consistent, and the prediction of “nimbus powerplant” and “the nimbus powerplant” is inconsistent. We punish the probabilities of inconsistent spans by the following operation:

(19)

\bar{y}_{i}=\hat{y}_{i}-\delta*\textrm{count}_{i}

where $\bar{y}_{i}$ and $\hat{y}_{i}$ is the predicted probability of span $q_{i}$ after and before consistent adjustment, $\delta$ is a hyper-parameter to control the degree of penalty, and $\textrm{count}_{i}$ is the number of inconsistent tokens. Then, we use a greedy algorithm to gradually select non-overlapping spans with the maximum probability based on the adjusted probability. Specifically, at each step of the greedy algorithm, we select the span with the maximum probability and discard the overlapping span. For example, we select span “powerplant” and discard “nimbus powerplant” and “the nimbus powerplant”. The whole inference flow is illustrated in Algorithm 2.

Algorithm 2 The Process of Consistent Greedy Inference Algorithm

The few-shot sequence labeling test set

\mathcal{D}_{te}

, a well-trained model, the number of episodes in test set

n

The extracted entities or slots set S

S =

\emptyset

for all i = 1,

\cdots

n

# Initialization: get data and predictions from two networks:

B, C =

\emptyset

\emptyset

Get i-th episode data (

\mathcal{S}_{test}

\mathcal{Q}_{test}

) from

\mathcal{D}_{te}

Get token-level prediction

\{\hat{y}_{1},\hat{y}_{2},...,\hat{y}_{n}\}

according to Eq. (6)

Get span-level prediction A =

\{(s_{j},\hat{y}_{j})\}_{j=1}^{\hat{M}}

according to Eq. (15)

# First step: punish the inconsistent prediction:

for all k = 1,

\cdots

\hat{M}

Get the number of inconsistent tokens

\textrm{count}_{k}

in span

s_{k}

\bar{y}_{k}\leftarrow\hat{y}_{k}-\delta*\textrm{count}_{k}

\leftarrow

\setminus

\{(s_{k},\hat{y}_{k})\}

\leftarrow

\cup

(

s_{k}

\bar{y}_{k}

)

end for

# Second step: greedy algorithm to select spans:

while B

\neq\emptyset

Select span

s_{k}

with maximum probability in B

\leftarrow

\cup

s_{k}

Get span set D that overlap with span

s_{k}

in B

\leftarrow

\setminus

end while

\leftarrow

\cup

end for

return The extracted entities or slots set S

5. Experiments

In this section, we first introduce our experimental details, including the datasets, evaluation metrics, implementation details, and baselines. Then, we report the experimental results on three datasets to answer the following research questions:

•

RQ1: Whether our proposed model outperforms existing few-shot sequence labeling methods?
•

RQ2: How does each of the components of our model contribute to the final performance?
•

RQ3: What is effect of different inference algorithms on the performance of our method?
•

RQ4: What is effect of the different consistent losses on the performance of our method?
•

RQ5: How do the hyper-parameters influence the performance of our method?

Thereafter, we also explore the effectiveness of our O division strategy, evaluate the efficiency of our model, and conduct error analysis and case study to further analyze our model.

Table 3. Statistics of three original datasets.

Datasets	# Sentences	# Classes	Domain
FewNERD	188.2K	66	Wikipedia
SNIPS	14.5K	60	Dialog
Cross	189.4K	43	News, Wiki, Social, and Mixed

5.1. Datasets

We conduct experiments on three benchmark datasets: FewNERD (Ding et al., 2021), SNIPS (Coucke et al., 2018), and Cross. FewNERD and Cross are NER datasets, and SNIPS is a slot tagging dataset. The statistics of the three original datasets are shown in Table 3. The detailed description of the three datasets is as follows:

FewNERD dataset with a hierarchy of 8 coarse-grained entity types and 66 fine-grained entity types, and has two few-shot settings: Intra and Inter. The two settings divide the dataset using coarse-grained and fine-grained entity types, respectively. For Intra setting, all entities in the training set, development set, and test set belong to different coarse-grained types. For Inter setting, all entities in the training set, development set, and test set belong to different fine-grained types. It is worth noting that in order to better meet the dense entities, i.e., a sentence may have multiple entity types, Ding et al. (2021) adopt the $N$ -way $K$ ~ $2K$ shot sampling method to construct FewNERD dataset (each class in the support set has $K$ ~ $2K$ entities). Both Intra and Inter setting have 4 settings: 5-way 1~2 shot, 5-way 5~10 shot, 10-way 1~2 shot, and 10-way 5~10 shot. We use the public sampled dataset¹¹1It is worth noting that we use the latest version of FewNERD dataset from https://github.com/thunlp/Few-NERD, which corresponds to the results reported in https://arxiv.org/pdf/2105.07464v6.pdf. released by Ding et al. (2021) which contains 20,000 episodes for training, 1,000 episodes for validation, and 5,000 episodes for testing.

SNIPS is a slot tagging dataset and contains 7 domains. The domains are Weather (We), Music (Mu), Play List (Pl), Book (Bo), Search Screen (Se), Restaurant (Re), and Creative Work (Cr). Hou et al. (2020) randomly sample a domain for validation, then randomly sample a different domain for testing, and the other domains are used for training. In each episode, all classes have $K$ shot samples in the support set and the number of classes ( $N$ ) is not fixed. Each domain of SNIPS dataset has two settings: 1-shot and 5-shot. We use the public sampled dataset²²2https://github.com/AtmaHou/FewShotTagging provided by Hou et al. (2020) for a fair comparison.

Cross is a NER dataset, consisting of four datasets CoNLL-2003 (Sang and Meulder, 2003), GUM (Zeldes, 2017), WNUT-2018 (Derczynski et al., 2017), and Ontonotes (Pradhan et al., 2013). The four datasets are from News, Wiki, Social, and Mixed domains respectively. Same as the SNIPS dataset, Hou et al. (2020) use two datasets for training, one dataset for validation, and one dataset for testing. Each domain of Cross dataset also has two settings: 1-shot and 5-shot and we also use the public sampled dataset³³3https://github.com/AtmaHou/FewShotTagging provided by Hou et al. (2020) for a fair comparison.

Table 4. The performance (F1 score with standard deviations) on FewNERD dataset.

{\dagger}

denotes that we reproduce the results using the authors’ code⁵⁵5https://github.com/microsoft/vert-papers/tree/master/papers/DecomposedMetaNER. and do not fine-tune model during meta-testing phase for a fair comparison. We report the average result of 5 runs.

Models	Intra					Inter
	1~2 shot		5~10 shot		Avg.	1~2 shot		5~10 shot		Avg.
	5 way	10 way	5 way	10 way	Avg.	5 way	10 way	5 way	10 way	Avg.
Proto	20.76 $\pm$ 0.84	15.04 $\pm$ 0.44	42.54 $\pm$ 0.94	35.40 $\pm$ 0.13	28.44 $\pm$ 0.59	38.83 $\pm$ 1.49	32.45 $\pm$ 0.79	58.79 $\pm$ 0.44	52.92 $\pm$ 0.37	45.75 $\pm$ 0.77
NNShot	25.78 $\pm$ 0.91	18.27 $\pm$ 0.41	36.18 $\pm$ 0.79	27.38 $\pm$ 0.53	26.90 $\pm$ 0.66	47.24 $\pm$ 1.00	38.87 $\pm$ 0.21	55.64 $\pm$ 0.63	49.57 $\pm$ 2.73	47.83 $\pm$ 1.14
StructShot	30.21 $\pm$ 0.90	21.03 $\pm$ 1.13	38.00 $\pm$ 1.29	26.42 $\pm$ 0.60	28.92 $\pm$ 0.98	51.88 $\pm$ 0.69	43.34 $\pm$ 0.10	57.32 $\pm$ 0.63	49.57 $\pm$ 3.08	50.53 $\pm$ 1.13
De-MAML^†	37.49 $\pm$ 2.10	27.36 $\pm$ 2.70	41.26 $\pm$ 1.34	35.31 $\pm$ 5.53	35.36 $\pm$ 2.92	61.63 $\pm$ 1.11	49.24 $\pm$ 5.10	54.09 $\pm$ 2.24	53.20 $\pm$ 3.44	54.54 $\pm$ 2.97
ESD	36.08 $\pm$ 1.60	30.00 $\pm$ 0.70	52.14 $\pm$ 1.50	42.15 $\pm$ 2.60	40.09 $\pm$ 1.60	59.29 $\pm$ 1.25	52.16 $\pm$ 0.79	69.06 $\pm$ 0.80	64.00 $\pm$ 0.43	61.13 $\pm$ 0.82
Ours	41.21 $\pm$ 0.70	35.36 $\pm$ 0.91	52.67 $\pm$ 0.49	45.83 $\pm$ 0.81	43.77 $\pm$ 0.73	62.79 $\pm$ 0.29	56.23 $\pm$ 0.76	69.83 $\pm$ 0.40	65.99 $\pm$ 0.40	63.71 $\pm$ 0.46

5.2. Evaluation Metrics

For the FewNERD dataset, we report the micro-F1 over all test episodes following Ding et al. (2021). Specifically,

(20)

\textrm{F1}=\frac{2\times\textrm{P}_{all}\times\textrm{R}_{all}}{\textrm{P}_{all}+\textrm{R}_{all}}

where $\textrm{P}_{all}$ denotes the precision calculated on all test episodes and $\textrm{R}_{all}$ denotes the recall calculated on all test episodes. For the SNIPS and Cross dataset, we first calculate micro-F1 for each test episode and then report the average F1 for all test episodes as the final result following Hou et al. (2022). Specifically,

(21)

\textrm{F1}=\frac{1}{n}\sum_{i=1}^{n}\textrm{F1}_{i}

where n is the number of episodes and $\textrm{F1}_{i}$ denotes the F1 of i-th episode.

5.3. Implementation Details

We use BERT-base-uncased (Devlin et al., 2019) as our encoder and use AdamW (Loshchilov and Hutter, 2019) optimizer. For the FewNERD dataset, the learning rate is 2e-5 for BERT and 5e-4 for other parameters, and the batch size is set to 2. For the SNIPS dataset, we use grid search to find the best hyper-parameters. The learning rate of BERT is selected from {5e-6, 1e-5, 2e-5, 3e-5, 5e-5} and other parameters are selected from {5e-5, 1e-4, 3e-4, 5e-4}, and the batch size is selected from {2, 4}. We schedule the learning rate that the first 1000 training steps is a linear warmup phrase and then the rest is a linear decay phrase. $\lambda$ and $\gamma$ are selected from {0.02, 0.05, 0.1, 0.2, 0.3} through grid search, $\beta$ is set to 1, and $\delta$ is selected from {0.01, 0.02, 0.05, 0.1}. The maximum span length is set to 8 for inference.

5.4. Baselines

To evaluate the effectiveness of our method, we compare our method with the following baselines:

•

Proto (Fritzler et al., 2019; Ding et al., 2021) represents each class by the mean of its token representation and uses Euclidean distance to predict query labels.
•

NNShot (Yang and Katiyar, 2020; Ding et al., 2021) is similar to Proto, but uses nearest neighbor classification to predict query labels.
•

StructShot (Yang and Katiyar, 2020; Ding et al., 2021) is similar to NNShot and further uses a Viterbi decoder during the inference phase to model label dependency.
•

TransferBERT (Hou et al., 2020) is a domain transfer model which pretrains model on source domains and fine-tunes on the target domain.
•

Matching (Vinyals et al., 2016; Hou et al., 2020) is similar to Proto and uses the matching networks (Vinyals et al., 2016) to classification instead of prototypical networks.
•

L-TapNet+CDT (Hou et al., 2020) uses label name to extend task-adaptive projection networks (Yoon et al., 2019) and further utilizes collapsed dependency transfer to capture the dependencies between labels.
•

Retriever (Yu et al., 2021a) is a span-level retrieval method that learns similar contextualized representations for spans with the same label via a novel batch-softmax objective.
•

ConVEx (Henderson and Vulic, 2021) first pre-trains through a novel pairwise cloze task using Reddit data and then directly fine-tunes for slot tagging task.
•

MRC (Ma et al., 2021) enumerates all the slots to extract the answer from the sentence in machine reading comprehension framework.
•

Inverse (Hou et al., 2022) reversely predicts slot values given slot types prompt and further proposes iterative prediction strategy to consider the relations between different slot types.
•

De-MAML (Ma et al., 2022b) proposes a two-step MAML-enhanced model to first detect span and then classify span.
•

ESD (Wang et al., 2022) proposes inter-span and cross-span attention to enhance the span representations, and uses instance span attention to construct prototype.

Table 5. The performance (F1 scores with standard deviations) on

7

domains of SNIPS. ‘unk’ denotes methods that do not report standard deviations in their paper. Baselines of 1-shot and 5-shot settings are different since ConVEx, Retriever, and Inverse do not report the 1-shot results in their paper. We report the average result of 10 runs. Best result is bold and second best one is underlined.

	Models	We	Mu	Pl	Bo	Se	Re	Cr	Avg.
1-shot	TransferBERT	55.82 $\pm$ 2.75	38.01 $\pm$ 1.74	45.65 $\pm$ 2.02	31.63 $\pm$ 5.32	21.96 $\pm$ 3.98	41.79 $\pm$ 3.81	38.53 $\pm$ 7.42	39.06 $\pm$ 3.86
	Matching	21.74 $\pm$ 4.60	10.68 $\pm$ 1.07	39.71 $\pm$ 1.81	58.15 $\pm$ 0.68	24.21 $\pm$ 1.20	32.88 $\pm$ 0.64	69.66 $\pm$ 1.68	36.72 $\pm$ 1.67
	Proto	46.72 $\pm$ 1.03	40.07 $\pm$ 0.48	50.78 $\pm$ 2.09	68.73 $\pm$ 1.87	60.81 $\pm$ 1.70	55.58 $\pm$ 3.56	67.67 $\pm$ 1.16	55.77 $\pm$ 1.70
	MRC	-	-	-	-	-	-	-	69.3(unk)
	L-TapNet+CDT	71.53 $\pm$ 4.04	60.56 $\pm$ 0.77	66.27 $\pm$ 2.71	84.54 $\pm$ 1.08	76.27 $\pm$ 1.72	70.79 $\pm$ 1.60	62.89 $\pm$ 1.88	70.41 $\pm$ 1.97
	ESD	78.25 $\pm$ 1.50	54.74 $\pm$ 1.02	71.15 $\pm$ 1.55	71.45 $\pm$ 1.38	67.85 $\pm$ 0.75	71.52 $\pm$ 0.98	78.14 $\pm$ 1.46	70.44 $\pm$ 1.23
	Ours	73.56 $\pm$ 1.58	58.40 $\pm$ 1.42	74.20 $\pm$ 1.24	76.07 $\pm$ 1.71	70.64 $\pm$ 1.49	72.73 $\pm$ 1.08	75.32 $\pm$ 1.09	71.56 $\pm$ 1.37
5-shot	TransferBERT	59.41 $\pm$ 0.30	42.00 $\pm$ 2.83	46.07 $\pm$ 4.32	20.74 $\pm$ 3.36	28.20 $\pm$ 0.29	67.75 $\pm$ 1.28	58.61 $\pm$ 3.67	46.11 $\pm$ 2.29
	Matching	36.67 $\pm$ 3.64	33.67 $\pm$ 6.12	52.60 $\pm$ 2.84	69.09 $\pm$ 2.36	38.42 $\pm$ 4.06	33.28 $\pm$ 2.99	72.10 $\pm$ 1.48	47.98 $\pm$ 3.36
	Proto	67.82 $\pm$ 4.11	55.99 $\pm$ 2.24	46.02 $\pm$ 3.19	72.17 $\pm$ 1.75	73.59 $\pm$ 1.60	60.18 $\pm$ 6.96	66.89 $\pm$ 2.88	63.24 $\pm$ 3.25
	Retriever	82.95(unk)	61.74(unk)	71.75(unk)	81.65(unk)	73.10(unk)	79.54(unk)	51.35(unk)	71.72(unk)
	L-TapNet+CDT	71.64 $\pm$ 3.62	67.16 $\pm$ 2.97	75.88 $\pm$ 1.51	84.38 $\pm$ 2.81	82.58 $\pm$ 2.12	70.05 $\pm$ 1.61	73.41 $\pm$ 2.61	75.01 $\pm$ 2.46
	Inverse	70.63(unk)	71.97(unk)	78.73(unk)	87.34(unk)	81.95(unk)	72.07(unk)	74.44(unk)	76.73(unk)
	ConVEx	71.5(unk)	77.6(unk)	79.0(unk)	84.5(unk)	84.0(unk)	73.8(unk)	67.4(unk)	76.8(unk)
	MRC	89.39(unk)	75.11(unk)	77.18(unk)	84.16(unk)	73.53(unk)	82.29(unk)	72.51(unk)	79.17(unk)
	ESD	84.50 $\pm$ 1.06	66.61 $\pm$ 2.00	79.69 $\pm$ 1.35	82.57 $\pm$ 1.37	82.22 $\pm$ 0.81	80.44 $\pm$ 0.80	81.13 $\pm$ 1.84	79.59 $\pm$ 1.32
	Ours	86.24 $\pm$ 1.04	67.66 $\pm$ 1.01	82.02 $\pm$ 1.05	86.47 $\pm$ 0.92	82.45 $\pm$ 1.09	81.38 $\pm$ 0.55	80.55 $\pm$ 1.16	80.97 $\pm$ 0.97

Table 6. The performance (F1 score with standard deviations) on the Cross Dataset.

{\dagger}

denotes that we reproduce the results using the authors’ code and do not fine-tune model during meta-testing phase for a fair comparison. We report the average result of 10 runs.

Models	1-shot					5-shot
Models	News	Wiki	Social	Mixed	Avg.	News	Wiki	Social	Mixed	Avg.
TransferBERT	4.75 $\pm$ 1.42	0.57 $\pm$ 0.32	2.71 $\pm$ 0.72	3.46 $\pm$ 0.54	2.87 $\pm$ 0.75	15.36 $\pm$ 2.81	3.62 $\pm$ 0.57	11.08 $\pm$ 0.57	35.49 $\pm$ 7.60	16.39 $\pm$ 2.89
Matching	19.50 $\pm$ 0.35	4.73 $\pm$ 0.16	17.23 $\pm$ 2.75	15.06 $\pm$ 1.61	14.13 $\pm$ 1.22	19.85 $\pm$ 0.74	5.58 $\pm$ 0.23	6.61 $\pm$ 1.75	8.08 $\pm$ 0.47	10.03 $\pm$ 0.80
Proto	32.49 $\pm$ 2.01	3.89 $\pm$ 0.24	10.68 $\pm$ 1.40	6.67 $\pm$ 0.46	13.43 $\pm$ 1.03	50.06 $\pm$ 1.57	9.54 $\pm$ 0.44	17.26 $\pm$ 2.65	13.59 $\pm$ 1.61	22.61 $\pm$ 1.57
De-MAML^†	42.40 $\pm$ 0.93	4.76 $\pm$ 0.08	23.87 $\pm$ 1.49	14.19 $\pm$ 1.33	21.31 $\pm$ 0.96	37.10 $\pm$ 2.65	6.21 $\pm$ 0.26	18.54 $\pm$ 0.89	26.41 $\pm$ 0.48	22.07 $\pm$ 1.07
L-TapNet+CDT	44.30 $\pm$ 3.15	12.04 $\pm$ 0.65	20.80 $\pm$ 1.06	15.17 $\pm$ 1.25	23.08 $\pm$ 1.53	45.35 $\pm$ 2.67	11.65 $\pm$ 2.34	23.30 $\pm$ 2.80	20.95 $\pm$ 2.81	25.31 $\pm$ 2.65
Ours	41.21 $\pm$ 0.70	9.91 $\pm$ 0.50	25.01 $\pm$ 1.32	26.18 $\pm$ 2.69	25.58 $\pm$ 1.30	51.97 $\pm$ 2.93	15.65 $\pm$ 1.51	29.89 $\pm$ 1.64	37.19 $\pm$ 1.35	33.68 $\pm$ 1.86

5.5. Results (RQ1)

We conduct an empirical study to investigate whether our proposed model can achieve better performance for few-shot sequence labeling to answer RQ1. Table 5 and Table 6 report the performance of our proposed model and baselines on few-shot NER dataset (i.e., FewNERD and Cross datasets) and Table 5 reports the performance on few-shot slot tagging dataset (i.e., SNIPS dataset).

First, we observe the performance on the FewNERD dataset. We can see that our model consistently outperforms all baselines under all settings. Compared with the best baseline ESD, our model achieves 3.68 and 2.58 average F1 improvement under Intra and Inter settings. Compared with the best token-level baseline StructShot, our model achieves 14.85 and 13.18 average F1 improvement under two settings. This shows that combining token-level and span-level can achieve better performance for few-shot sequence labeling. It is worth noting that the improvement under Intra setting is higher than under Inter setting. This suggests that when the domain gap between source domain and target domain is relatively large, combining token and span levels can lead to more significant improvements than the single-granularity baselines.

Secondly, we observe the performance on the SNIPS dataset. Compared with the best baseline ESD, our model achieves 1.12 and 1.38 average F1 improvement under 1-shot and 5-shot settings on the SNIPS dataset. Compared to all baseline methods, our method achieves the best or second best performance in most domains. The results also show the effectiveness of our model. It is worth noting that the standard deviations of our method are the best or second best in most settings. This shows that our model is robust.

Thirdly, we observe the performance on the Cross dataset. Compared with the best baseline L-TapNet+CDT, our model achieves 2.5 and 8.37 average F1 improvement under 1-shot and 5-shot settings. Compared to all baseline methods, our method achieves the best or second best performance in all domains. This shows the effectiveness of our model.

Table 7. The ablation study of our model on the development set on FewNERD dataset under 5-way 1~2 shot setting. The ablation study is repeated five times.

Model Setting	Intra	Inter
Ours	36.92 $\pm$ 1.54	48.63 $\pm$ 0.51
w/o Span-Level Loss	26.42 $\pm$ 0.33	36.25 $\pm$ 1.21
w/o Token-Level Loss	34.53 $\pm$ 0.51	47.80 $\pm$ 0.64
w/o Consistent Loss	35.27 $\pm$ 0.77	48.10 $\pm$ 0.16
w/o Inference Algorithm	31.13 $\pm$ 1.66	45.18 $\pm$ 0.31
w/o Division of O	35.63 $\pm$ 0.94	48.49 $\pm$ 0.40
w/o Adaptive Prototype	32.52 $\pm$ 1.31	46.27 $\pm$ 1.23
w/o Cross-Attention	32.51 $\pm$ 1.78	44.98 $\pm$ 0.31

5.6. Ablation Study (RQ2)

To validate the contribution of each component, we report the performance of removing each of them from our model individually in Table 7 to answer RQ2.

First, we can see that all components improve performance. This shows that all components in our model are effective. Secondly, we remove the token-level loss and span-level loss. When we ablate span loss, the network uses the output of the token-level network as the result. We can see that the performance drops to 26.42 and 36.25 under Intra and Inter settings. This may be because token-level metric learning is difficult to extract entities completely and our token-level model is parameter-efficient. When we remove the token-level loss, the performance drops to 34.53 and 47.80 under two settings. This shows that using token-level labels to jointly train the model can effectively improve performance. Thirdly, we remove the consistent loss and the performance drops to 35.27 and 48.10 under two settings. This shows that consistent loss can effectively regularize the model and improve performance. Fourth, we remove the inference algorithm, i.e., all spans that are predicted as entities or slots from span-level network are extracted. The performance drops to 31.13 and 45.18 under two settings. This illustrates that our inference algorithm can effectively combine the results of two networks and avoid extracting overlapping spans. Finally, we ablate all components (i.e., fine-grained division of O, cross-attention module, and adaptive prototype) in the span network. We can see that all components in the span network improve performance. The cross-attention module brings the largest performance improvement. This indicates that the interaction of the support set and query set is the key to performance improvement. We remove the adaptive prototype and the performance drops to 32.52 and 46.27 under two settings. This shows adaptively constructing prototypes for each query span is more effective. We remove the division of class O and the performance drops to 35.63 and 48.49 under two settings. This shows that the fine-grained division of O can effectively construct prototype of class O.

Table 8. The performance of our model under different inference algorithms on the development set of FewNERD dataset under Inter 5-way 1~2 shot setting. The experiment is repeated five times.

Inference Setting	F1
Ours	48.63 $\pm$ 0.51
Token Network	26.94 $\pm$ 0.96
Span Network	48.05 $\pm$ 0.54
Intersection	29.93 $\pm$ 0.71
Union	41.27 $\pm$ 0.36

5.7. Effects of Different Inference Algorithms (RQ3)

We explore the effects of different inference algorithms in Table 8 to answer RQ3, including our proposed consistent greedy inference algorithm, using intersection or union operation to combine two outputs from two networks (i.e., a span and its type are extracted by both networks or by one), and using the outputs of token network or span network (i.e., we also use greedy algorithm to avoid overlapping spans).

First, we can see that our proposed inference algorithm achieves the best performance. This shows that using token-level predictions to adjust the probability of span network is effective. Secondly, we can see that the performance of span network is better than token network. This may be because token-level metric learning is difficult to extract entities completely and our token-level model is parameter-efficient. Thirdly, we report the performance of using intersection or union operation to combine the outputs of two networks. We can see that their performance does not exceed span network. This is because the poor performance of token network and simple combination does not improve performance. Finally, it is worth noting the performance of span network exceeds w/o span-level loss. This shows that using token-level loss to jointly train can improve performance of span-level network.

5.8. Unidirectional vs. Bidirectional Consistent Loss (RQ4)

We explore the performance of using unidirectional Kullback-Leibler divergence with temperature as consistent loss in Table 9 to answer RQ4, i.e., only using Span network to supervise Token network (ST) or using Token network to supervise Span network (TS).

First, we can see that the bidirectional consistent loss performs better than ST and TS. This shows that the bidirectional consistent loss is more effective to align outputs from two networks compared to the unidirectional consistent loss. Secondly, ST and TS both exceed w/o consistent loss. This shows that both unidirectional consistent losses are effective to align outputs and improve performance. Thirdly, the performance of TS exceeds ST. This suggests that it is more effective to use supervision information from token network to train the span network. This may be because it is easier for span network to learn a consistent output.

5.9. Effects of Different Consistent Losses (RQ4)

We explore the performance of using different losses as consistent loss in Table 9 to answer RQ4, i.e., mean squared error (MSE), Kullback-Leibler divergence (KL), negative cosine similarity (COS), and Jensen-Shannon divergence (JS).

First, we can see that the KL loss with temperature performs best and the KL loss performs second best. This shows that using temperature can obtain better supervision information compared to KL loss. It is worth noting that KL divergence and JS divergence exceed w/o consistent loss. This shows that KL divergence and JS divergence can effectively measure the difference between two distributions and improves performance for our model. Secondly, the performance of w/o consistent loss exceeds MSE and COS. This illustrates that MSE and COS play a side effect and are not a good choice for measuring the difference between the two probability distributions for our model.

Table 9. The performance of our model under different loss settings on the development set of FewNERD dataset under Inter 5-way 1~2 shot setting. ST denotes using span network to supervise token network and TS denotes using token to supervise span. MSE, KL, COS, and JS denote mean squared error, Kullback-Leibler divergence, negative cosine similarity, and Jensen-Shannon divergence. The experiment is repeated five times.

Consistent Loss	F1
Ours	48.63 $\pm$ 0.51
ST	48.19 $\pm$ 0.49
TS	48.41 $\pm$ 0.44
MSE	48.05 $\pm$ 0.32
KL	48.41 $\pm$ 0.13
COS	47.93 $\pm$ 0.14
JS	48.38 $\pm$ 0.46

5.10. Effects of Hyper-Parameters (RQ5)

We explore the performance of different hyper-parameters (i.e., the weight of token-level loss $\lambda$ in Eq. (18), the weight of consistent loss $\gamma$ in Eq. (18), and the degree of penalty in inference algorithms $\delta$ in Eq. (19)) while fixing the other hyper-parameters in Fig. 3 on FewNERD dataset under 5 way 1~2 shot setting to answer RQ5.

We first observe the effects of $\lambda$ which is selected from {0.02, 0.05, 0.1, 0.2, 0.3}. From Figure 3(a) and Figure 3(b), we find that the performance is relatively stable under Inter and Intra settings, i.e., the performance fluctuates by 0.6% and 1.5%, respectively. As $\lambda$ changes, the performance first has an upward trend and then has a downward trend under two settings. This shows that setting a suitable $\lambda$ can help improve performance. The optimal value of $\lambda$ depends on the specific setting. The model achieves the best performance when $\lambda$ is 0.1 under Inter setting and $\lambda$ is 0.2 under Intra setting.

Then, we observe the effects of $\gamma$ and $\delta$ , choosing from {0.02, 0.05, 0.1, 0.2, 0.3} and {0.01, 0.02, 0.05, 0.1}. The overall trend is the same as $\lambda$ . We can see that the performance is stable under two settings, i.e., the performance fluctuates by 0.5% and 0.4% with the change of $\gamma$ and the performance fluctuates by 0.2% and 0.3% with the change of $\delta$ . This shows that our model is insensitive to $\gamma$ and $\delta$ . As $\gamma$ or $\delta$ changes, the overall performance also first has an upward trend and then has a downward trend under two settings. The model achieves the best performance when $\gamma$ is 0.05 and $\delta$ is 0.02 under two settings.

5.11. Comparison with Other O Divisions

Table 10. The performance of different O divisions on the development set of FewNERD dataset under Inter 5-way 1~2 shot. The upper part of the table shows the performance of our model under different divisions, and the lower part shows the ESD model under different divisions. All results over five runs are reported.

Model Setting	Division Setting	F1
CDAP	Ours	48.63 $\pm$ 0.51
CDAP	Wang et al. (2022)	48.39 $\pm$ 0.68
ESD	Ours	46.55 $\pm$ 0.41
ESD	Wang et al. (2022)	45.83 $\pm$ 0.56

We compare our division with Wang et al. (2022) under two different models (i.e., CDAP and ESD) in Table 10 to show the effectiveness of our division. Wang et al. (2022) also divide class O into three sub-classes: sub-class $O_{1}$ denotes the span that does not overlap with any entities or slots in the sentence, sub-class $O_{2}$ denotes the span that is the sub-span of an entity or slot, and $O_{3}$ denotes the others.

We can see that our O division achieves better performance under two different models. This illustrates the effectiveness of dividing sub-classes according to the left and right boundary tokens. It is worth noting that our division is more effective on the ESD model. This may be because the capacity of distinguishing sub-classes for the ESD model is limited compared to our model, and our division can help ESD model to distinguish sub-classes more effectively.

5.12. Efficiency

Since our model considers both token-level and span-level, we compare our method with token-level methods (i.e., Proto, NNShot, and StructShot) and span-level methods (i.e., ESD and De-MAML) in terms of the number of parameters and inference time in Table 11 to show the efficiency.

First, we observe the number of parameters and we can see that Proto, NNShot, and StructShot have the least number of parameters. This is because these methods only use BERT to encode the tokens. Our model is similar to ESD and significantly smaller than De-MAML. This shows that our method is parameter-efficient and does not introduce too many parameters (i.e., 2M) compared to Proto. In addition, De-MAML uses two BERT to encode two tasks separately resulting in a huge number of parameters.

Secondly, we compare the inference time of our model and baselines with the same hardware and batch size (i.e., 1). We can see that the token-level methods are faster compared to the span-level methods. This is because the span-level methods need to enumerate all spans within a sentence whose length is less than or equal to L. Therefore, the number of spans is approximately L times that of tokens, which may bring extra computation overhead. Proto is faster compared to NNShot, and NNShot is faster compared to StructShot. This is because Proto only needs to measure the distance to the prototype, while NNShot needs to measure the distance to all the tokens in the support set. StructShot further uses a Viterbi decoder to model label dependency based on NNShot, so more inference time is required. Then, we compare our model with span-level methods and we can see that our model is 13.40 ms slower than ESD and 32.49 ms faster than De-MAML under 1-shot setting. Under 5-shot setting, our model is 26.78 ms slower than ESD and 23.44 ms slower than De-MAML. This illustrates that the time cost of our model is acceptable and our model is efficient. In addition, De-MAML needs to forward propagate 2 times through BERT, resulting in a long inference time under 1-shot setting. Our model and ESD both use the cross-attention module to model the interaction of the support set and query set, resulting in a significant increase in inference time under 5-shot setting.

Table 11. The number of parameters and inference time per episode task on FewNERD dataset under Inter 5-way setting. F1 notes the micro-F1 over eight settings on the FewNERD dataset. # Para. denotes the number of parameters and

{\dagger}

denotes that we do not fine-tune model during meta-testing phase for time efficiency.

Models	F1	# Para.	Inference Time
Models	F1	# Para.	1~2 shot	5~10 shot
Ours	53.74	111M	79.43 ms	234.23 ms
Proto	37.10	109M	24.98 ms	29.83 ms
NNShot	37.37	109M	25.43 ms	30.97 ms
StructShot	39.73	109M	30.70 ms	49.62 ms
ESD	50.61	111M	66.03 ms	207.45 ms
De-MAML^†	44.95	218M	111.92 ms	210.79 ms

Table 12. Error analysis on the test set of FewNERD dataset under 5-way 1~2 shot setting. # FP-Span denotes the number and proportion of extracted entities with wrong span boundary. # FP-Type denotes the number and proportion of extracted entities with the right boundary but the wrong entity type.

Setting	F1	# FP-Span	# FP-Type
Inter	62.79	7372(74.52%)	2619(26.48%)
Intra	41.21	11264(65.46%)	5944(34.54%)

5.13. Error Analysis

We further analyze the predicted error of our model in detail and divide the error into 2 categories (i.e., FP-Span and FP-Type) according to whether the span is correct. FP-Span denotes extracted entities with the wrong span boundary and FP-Type denotes extracted entities with the right boundary but the wrong entity type.

First, as shown in Table 12, we can see that FP-Span is main prediction error under two settings. FP-span accounts for 74.52% under Inter setting and 65.46% under Intra setting. This indicates that span detection is more difficult than span classification, which is the main bottleneck at present and needs further attention in the future. Secondly, compared to Inter setting, the number of both errors rises under the more difficult Intra setting. Compared to Inter setting, FP-Span increased by 52.79% and FP-Type increased by 126.96% under Intra setting. This suggests that as domain difference between the source domain and the target domain increases, entity classification is more difficult to transfer than span detection.

5.14. Case Study

Apart from the quantitative analysis, we also conduct case study to intuitively show the effectiveness and analyze our model. We list the predictions of two groups on the test set of our model and three strong baselines (i.e., StructShot, De-MAML, and ESD) on FewNERD under Inter 5-way 1~2 shot setting in Table 13.

First, we observe the predictions of the first group (Query 1, 2, and 3). We can see that the extraction capacity of StructShot is limited. For example, it only correctly extracts entity “B.A.” in Query (1) and also extracts impossible entities (i.e., “the” and “of” in Query (3)). This illustrates that StructShot relies heavily on the labeled samples in the support set. Then, we observe the predictions of span-level methods. For query (1), De-MAML and ESD predict None, and our model extracts “B.A.”. This may be because we use token-level loss to jointly train the model. For query (2), we can see that our model correctly extracts and predicts “Monaco” and “Maxim’s”. De-MAML fails to classify “Maxim’s” and ESD fails to extract “Maxim’s”. This shows that even if a sentence has multiple entities, our model can extract them. For query (3), we can see that our model correctly extracts and predicts “faculty of medicine in Paris”. De-MAML and ESD extract “faculty of medicine”, and ESD predicts wrong type. This shows that even if an entity has multiple tokens, our model can locate and extract it.

Secondly, we display some failure examples in the second group (Query 4, 5, and 6). For query (4), method StructShot extracts “Right” and fails to extract simple entity “Republican”. This also shows that the extraction capacity of StructShot is limited and relies heavily on the labeled samples in the support set. Methods De-MAML, ESD, and our model do not extract entity “neo-Confederate Right” at all. This may be because it is more difficult to extract this entity. For query (5), we can see that StructShot does not extract the entity completely, only “bachelor” is extracted. De-MAML and our model also do not extract the entity completely, only “bachelor of law” is extracted. This indicates that there are challenges in locating the boundaries of entities. ESD does not extract entities, perhaps because it is trained with only span labels and the supervision is sparse. For query (6), we can see that all the methods predict None except De-MAML and De-MAML predicts wrong entity “New American”. We observe the corresponding support set and find that it may be due to the weak representation of the prototype constructed from the two samples (i.e., Bobino and Gaumont-Palace) in the support set of the class.

Table 13. The predictions on test set of our proposed model and model variant for case study on FewNERD dataset under Inter 5-way 1~2 shot. We use different colors to indicate the different entity types in a sentence. blue indicates that this entity type does not appear in this query sample.

Query	StructShot	De-MAML	ESD	Ours
(1) He decided to return to school in 1996, earning his B.A. [other-educational degree].	earning, B.A.	None	None	B.A.
(2) The same year, Monaco [person-actor] was named “Maxim’s [art-written art]” number one sexiest cover model of the decade.	None	Monaco, Maxim’s	Monaco	Monaco, Maxim’s
(3) In 1864 he attained the chair of internal pathology at the faculty of medicine in Paris [building-hospital].	the, chair, of, of medicine in	faculty of medicine	faculty of medicine	faculty of medicine in Paris
(4) She writes that “even into the twenty-first century mainstream conservative Republican [organization-political party] politicians continued to associate themselves with issues, symbols, and organizations inspired by the neo-Confederate Right [organization-political party]”.	Right	Republican	Republican	Republican
(5) He gained a bachelor of law degree [other-educational degree].	bachelor	bachelor of law	None	bachelor of law
(6) The script was published in Grove New American Theater [building-theater].	None	New American	None	None

6. Conclusion and Future Work

In this paper, we first unify token and span level supervisions and propose a consistent dual adaptive prototypical network for few-shot sequence labeling. Our model contains token-level network and span-level network, jointly trained with token-level and span-level labels. Further, we propose a consistent loss to make the outputs of two networks consistent based on bidirectional Kullback-Leibler divergence with temperature. During the inference phase, we propose a consistent greedy inference algorithm that first adjusts the predicted probability of span network and then greedily selects non-overlapping spans with maximum probability based on adjusted probability. Extensive experiments show that unifying token and span level supervisions is effective and our model achieves new state-of-the-art results on three benchmark datasets.

In the future, we plan to deepen and widen our work from the following aspects: (1) In this work, we first unify token and span level supervisions for few-shot sequence labeling and conduct a preliminary exploration. The proposed method is demonstrated to be simple and effective, but whether other complicated token-level and span-level networks are also effective has not been explored, which is worthy of further exploration. (2) The proposed method has only been verified to be effective on texts in English. The effectiveness on datasets of other languages deserves further verification. (3) The effectiveness of our model on traditional supervised sequence labeling also deserves further exploration.

References

(1)
Agosti et al. (2020) Maristella Agosti, Stefano Marchesin, and Gianmaria Silvello. 2020. Learning Unsupervised Knowledge-Enhanced Representations to Reduce the Semantic Gap in Information Retrieval. ACM Trans. Inf. Syst. 38, 4 (2020), 38:1–38:48. https://doi.org/10.1145/3417996
Athiwaratkun et al. (2020) Ben Athiwaratkun, Cícero Nogueira dos Santos, Jason Krone, and Bing Xiang. 2020. Augmented Natural Language for Generative Sequence Labeling. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020. 375–385. https://doi.org/10.18653/v1/2020.emnlp-main.27
Chen et al. (2022) Jiawei Chen, Qing Liu, Hongyu Lin, Xianpei Han, and Le Sun. 2022. Few-shot Named Entity Recognition with Self-describing Networks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022. 5711–5722. https://doi.org/10.18653/v1/2022.acl-long.392
Chen et al. (2019) Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. 2019. A Closer Look at Few-shot Classification. In 7th International Conference on Learning Representations, ICLR 2019. OpenReview.net. https://openreview.net/forum?id=HkxLXnAcFQ
Chen et al. (2021) Yinbo Chen, Zhuang Liu, Huijuan Xu, Trevor Darrell, and Xiaolong Wang. 2021. Meta-Baseline: Exploring Simple Meta-Learning for Few-Shot Learning. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 9042–9051. https://doi.org/10.1109/ICCV48922.2021.00893
Cheng et al. (2021) Zifeng Cheng, Zhiwei Jiang, Yafeng Yin, Na Li, and Qing Gu. 2021. A Unified Target-Oriented Sequence-to-Sequence Model for Emotion-Cause Pair Extraction. IEEE ACM Trans. Audio Speech Lang. Process. 29 (2021), 2779–2791. https://doi.org/10.1109/TASLP.2021.3102194
Chiu and Nichols (2016) Jason P. C. Chiu and Eric Nichols. 2016. Named Entity Recognition with Bidirectional LSTM-CNNs. Trans. Assoc. Comput. Linguistics 4 (2016), 357–370. https://doi.org/10.1162/tacl_a_00104
Colombo et al. (2020) Pierre Colombo, Emile Chapuis, Matteo Manica, Emmanuel Vignon, Giovanna Varni, and Chloé Clavel. 2020. Guiding Attention in Sequence-to-Sequence Models for Dialogue Act Prediction. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020. 7594–7601. https://ojs.aaai.org/index.php/AAAI/article/view/6259
Coucke et al. (2018) Alice Coucke, Alaa Saade, Adrien Ball, Théodore Bluche, Alexandre Caulier, David Leroy, Clément Doumouro, Thibault Gisselbrecht, Francesco Caltagirone, Thibaut Lavril, Maël Primet, and Joseph Dureau. 2018. Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces. CoRR abs/1805.10190 (2018). http://arxiv.org/abs/1805.10190
Cui et al. (2021) Leyang Cui, Yu Wu, Jian Liu, Sen Yang, and Yue Zhang. 2021. Template-Based Named Entity Recognition Using BART. In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021. 1835–1845. https://doi.org/10.18653/v1/2021.findings-acl.161
Cui and Zhang (2019) Leyang Cui and Yue Zhang. 2019. Hierarchically-Refined Label Attention Network for Sequence Labeling. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019. 4113–4126. https://doi.org/10.18653/v1/D19-1422
Das et al. (2022) Sarkar Snigdha Sarathi Das, Arzoo Katiyar, Rebecca J. Passonneau, and Rui Zhang. 2022. CONTaiNER: Few-Shot Named Entity Recognition via Contrastive Learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022. 6338–6353. https://doi.org/10.18653/v1/2022.acl-long.439
Derczynski et al. (2017) Leon Derczynski, Eric Nichols, Marieke van Erp, and Nut Limsopatham. 2017. Results of the WNUT2017 Shared Task on Novel and Emerging Entity Recognition. In Proceedings of the 3rd Workshop on Noisy User-generated Text, NUT@EMNLP 2017. Association for Computational Linguistics, 140–147. https://doi.org/10.18653/v1/w17-4418
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019. 4171–4186. https://doi.org/10.18653/v1/n19-1423
Dhillon et al. (2020) Guneet Singh Dhillon, Pratik Chaudhari, Avinash Ravichandran, and Stefano Soatto. 2020. A Baseline for Few-Shot Image Classification. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net. https://openreview.net/forum?id=rylXBkrYDS
Ding et al. (2021) Ning Ding, Guangwei Xu, Yulin Chen, Xiaobin Wang, Xu Han, Pengjun Xie, Haitao Zheng, and Zhiyuan Liu. 2021. Few-NERD: A Few-shot Named Entity Recognition Dataset. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021. 3198–3213. https://doi.org/10.18653/v1/2021.acl-long.248
Fei-Fei et al. (2006) Li Fei-Fei, Robert Fergus, and Pietro Perona. 2006. One-Shot Learning of Object Categories. IEEE Trans. Pattern Anal. Mach. Intell. 28, 4 (2006), 594–611. https://doi.org/10.1109/TPAMI.2006.79
Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017. 1126–1135. http://proceedings.mlr.press/v70/finn17a.html
Finn et al. (2018) Chelsea Finn, Kelvin Xu, and Sergey Levine. 2018. Probabilistic Model-Agnostic Meta-Learning. In Advances in Neural Information Processing Systems, NeurIPS 2018. 9537–9548. https://proceedings.neurips.cc/paper/2018/hash/8e2c381d4dd04f1c55093f22c59c3a08-Abstract.html
Fritzler et al. (2019) Alexander Fritzler, Varvara Logacheva, and Maksim Kretov. 2019. Few-shot classification in named entity recognition task. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, SAC 2019. 993–1000. https://doi.org/10.1145/3297280.3297378
Fu et al. (2021) Jinlan Fu, Xuanjing Huang, and Pengfei Liu. 2021. SpanNER: Named Entity Re-/Recognition as Span Prediction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021. 7183–7195. https://doi.org/10.18653/v1/2021.acl-long.558
Gao et al. (2019) Tianyu Gao, Xu Han, Zhiyuan Liu, and Maosong Sun. 2019. Hybrid Attention-Based Prototypical Networks for Noisy Few-Shot Relation Classification. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019. AAAI Press, 6407–6414. https://doi.org/10.1609/aaai.v33i01.33016407
Guo et al. (2022) Jiafeng Guo, Yinqiong Cai, Yixing Fan, Fei Sun, Ruqing Zhang, and Xueqi Cheng. 2022. Semantic Models for the First-Stage Retrieval: A Comprehensive Review. ACM Trans. Inf. Syst. 40, 4 (2022), 66:1–66:42. https://doi.org/10.1145/3486250
Henderson and Vulic (2021) Matthew Henderson and Ivan Vulic. 2021. ConVEx: Data-Efficient and Few-Shot Slot Labeling. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021. 3375–3389. https://doi.org/10.18653/v1/2021.naacl-main.264
Hou et al. (2019) Ruibing Hou, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. 2019. Cross Attention Network for Few-shot Classification. In Advances in Neural Information Processing Systems, NeurIPS 2019. 4005–4016. https://proceedings.neurips.cc/paper/2019/hash/01894d6f048493d2cacde3c579c315a3-Abstract.html
Hou et al. (2020) Yutai Hou, Wanxiang Che, Yongkui Lai, Zhihan Zhou, Yijia Liu, Han Liu, and Ting Liu. 2020. Few-shot Slot Tagging with Collapsed Dependency Transfer and Label-enhanced Task-adaptive Projection Network. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020. 1381–1393. https://doi.org/10.18653/v1/2020.acl-main.128
Hou et al. (2022) Yutai Hou, Cheng Chen, Xianzhen Luo, Bohan Li, and Wanxiang Che. 2022. Inverse is Better! Fast and Accurate Prompt for Few-shot Slot Tagging. In Findings of the Association for Computational Linguistics: ACL 2022. 637–647. https://doi.org/10.18653/v1/2022.findings-acl.53
Huang et al. (2021) Jiaxin Huang, Chunyuan Li, Krishan Subudhi, Damien Jose, Shobana Balakrishnan, Weizhu Chen, Baolin Peng, Jianfeng Gao, and Jiawei Han. 2021. Few-Shot Named Entity Recognition: An Empirical Baseline Study. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021. 10408–10423. https://doi.org/10.18653/v1/2021.emnlp-main.813
Huang et al. (2015) Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF Models for Sequence Tagging. CoRR abs/1508.01991 (2015). http://arxiv.org/abs/1508.01991
Jiang et al. (2020) Zhengbao Jiang, Wei Xu, Jun Araki, and Graham Neubig. 2020. Generalizing Natural Language Analysis through Span-relation Representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online. Association for Computational Linguistics, 2120–2133. https://doi.org/10.18653/v1/2020.acl-main.192
Lample et al. (2016) Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural Architectures for Named Entity Recognition. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 260–270. https://doi.org/10.18653/v1/n16-1030
Lee et al. (2022) Dong-Ho Lee, Akshen Kadakia, Kangmin Tan, Mahak Agarwal, Xinyu Feng, Takashi Shibuya, Ryosuke Mitani, Toshiyuki Sekiya, Jay Pujara, and Xiang Ren. 2022. Good Examples Make A Faster Learner: Simple Demonstration-based Learning for Low-resource NER. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022. 2687–2700. https://doi.org/10.18653/v1/2022.acl-long.192
Li et al. (2020b) Aoxue Li, Weiran Huang, Xu Lan, Jiashi Feng, Zhenguo Li, and Liwei Wang. 2020b. Boosting Few-Shot Learning With Adaptive Margin Loss. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020. Computer Vision Foundation / IEEE, 12573–12581. https://doi.org/10.1109/CVPR42600.2020.01259
Li et al. (2022) Jing Li, Billy Chiu, Shanshan Feng, and Hao Wang. 2022. Few-Shot Named Entity Recognition via Meta-Learning. IEEE Trans. Knowl. Data Eng. 34, 9 (2022), 4245–4256. https://doi.org/10.1109/TKDE.2020.3038670
Li et al. (2020d) Kai Li, Yulun Zhang, Kunpeng Li, and Yun Fu. 2020d. Adversarial Feature Hallucination Networks for Few-Shot Learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020. Computer Vision Foundation / IEEE, 13467–13476. https://doi.org/10.1109/CVPR42600.2020.01348
Li et al. (2020c) Ruirui Li, Xian Wu, Xiusi Chen, and Wei Wang. 2020c. Few-Shot Learning for New User Recommendation in Location-based Social Networks. In WWW ’20: The Web Conference 2020, Taipei, Taiwan. 2472–2478. https://doi.org/10.1145/3366423.3379994
Li et al. (2020a) Xiaoya Li, Jingrong Feng, Yuxian Meng, Qinghong Han, Fei Wu, and Jiwei Li. 2020a. A Unified MRC Framework for Named Entity Recognition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020. 5849–5859. https://doi.org/10.18653/v1/2020.acl-main.519
Li et al. (2021) Xiangsheng Li, Jiaxin Mao, Weizhi Ma, Yiqun Liu, Min Zhang, Shaoping Ma, Zhaowei Wang, and Xiuqiang He. 2021. Topic-enhanced knowledge-aware retrieval model for diverse relevance estimation. In WWW ’21: The Web Conference 2021, Virtual Event / Ljubljana, Slovenia, April 19-23, 2021. ACM / IW3C2, 756–767. https://doi.org/10.1145/3442381.3449943
Li et al. (2015) Zhenghua Li, Jiayuan Chao, Min Zhang, and Wenliang Chen. 2015. Coupled Sequence Labeling on Heterogeneous Annotations: POS Tagging as a Case Study. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015. 1783–1792. https://doi.org/10.3115/v1/p15-1172
Liu et al. (2020) Bulou Liu, Chenliang Li, Wei Zhou, Feng Ji, Yu Duan, and Haiqing Chen. 2020. An Attention-based Deep Relevance Model for Few-shot Document Filtering. ACM Trans. Inf. Syst. 39, 1 (2020), 6:1–6:35. https://doi.org/10.1145/3419972
Liu et al. (2022) Yonghao Liu, Mengyu Li, Ximing Li, Fausto Giunchiglia, Xiaoyue Feng, and Renchu Guan. 2022. Few-shot Node Classification on Attributed Networks with Graph Meta-learning. In SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 471–481. https://doi.org/10.1145/3477495.3531978
Loshchilov and Hutter (2019) Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In 7th International Conference on Learning Representations, ICLR 2019,. https://openreview.net/forum?id=Bkg6RiCqY7
Ma et al. (2019) Dehong Ma, Sujian Li, Fangzhao Wu, Xing Xie, and Houfeng Wang. 2019. Exploring Sequence-to-Sequence Learning in Aspect Term Extraction. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019. 3538–3547. https://doi.org/10.18653/v1/p19-1344
Ma et al. (2022a) Jie Ma, Miguel Ballesteros, Srikanth Doss, Rishita Anubhai, Sunil Mallya, Yaser Al-Onaizan, and Dan Roth. 2022a. Label Semantics for Few Shot Named Entity Recognition. In Findings of the Association for Computational Linguistics: ACL 2022. 1956–1971. https://doi.org/10.18653/v1/2022.findings-acl.155
Ma et al. (2021) Jianqiang Ma, Zeyu Yan, Chang Li, and Yang Zhang. 2021. Frustratingly Simple Few-Shot Slot Tagging. In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021. 1028–1033. https://doi.org/10.18653/v1/2021.findings-acl.88
Ma et al. (2022c) Longxuan Ma, Mingda Li, Wei-Nan Zhang, Jiapeng Li, and Ting Liu. 2022c. Unstructured Text Enhanced Open-Domain Dialogue System: A Systematic Survey. ACM Trans. Inf. Syst. 40, 1 (2022), 9:1–9:44. https://doi.org/10.1145/3464377
Ma et al. (2022d) Ruotian Ma, Xin Zhou, Tao Gui, Yiding Tan, Linyang Li, Qi Zhang, and Xuanjing Huang. 2022d. Template-free Prompt Tuning for Few-shot NER. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022. 5721–5732. https://doi.org/10.18653/v1/2022.naacl-main.420
Ma et al. (2022b) Tingting Ma, Huiqiang Jiang, Qianhui Wu, Tiejun Zhao, and Chin-Yew Lin. 2022b. Decomposed Meta-Learning for Few-Shot Named Entity Recognition. In Findings of the Association for Computational Linguistics: ACL 2022. 1584–1596. https://doi.org/10.18653/v1/2022.findings-acl.124
Ma and Hovy (2016) Xuezhe Ma and Eduard H. Hovy. 2016. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016. https://doi.org/10.18653/v1/p16-1101
Oguz and Vu (2021) Cennet Oguz and Ngoc Thang Vu. 2021. Few-shot Learning for Slot Tagging with Attentive Relational Network. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, EACL 2021. 1566–1572. https://doi.org/10.18653/v1/2021.eacl-main.134
Ouchi et al. (2020) Hiroki Ouchi, Jun Suzuki, Sosuke Kobayashi, Sho Yokoi, Tatsuki Kuribayashi, Ryuto Konno, and Kentaro Inui. 2020. Instance-Based Learning of Span Representations: A Case Study through Named Entity Recognition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020. Association for Computational Linguistics, 6452–6459. https://doi.org/10.18653/v1/2020.acl-main.575
Pradhan et al. (2013) Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Hwee Tou Ng, Anders Björkelund, Olga Uryupina, Yuchen Zhang, and Zhi Zhong. 2013. Towards Robust Linguistic Analysis using OntoNotes. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, CoNLL 2013. ACL, 143–152. https://aclanthology.org/W13-3516/
Qin et al. (2019) Libo Qin, Wanxiang Che, Yangming Li, Haoyang Wen, and Ting Liu. 2019. A Stack-Propagation Framework with Token-Level Intent Detection for Spoken Language Understanding. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019. Association for Computational Linguistics, 2078–2087. https://doi.org/10.18653/v1/D19-1214
Sang and Meulder (2003) Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In Proceedings of the Seventh Conference on Natural Language Learning, CoNLL 2003, Held in cooperation with HLT-NAACL 2003. ACL, 142–147. https://aclanthology.org/W03-0419/
Shen et al. (2022) Yongliang Shen, Xiaobin Wang, Zeqi Tan, Guangwei Xu, Pengjun Xie, Fei Huang, Weiming Lu, and Yueting Zhuang. 2022. Parallel Instance Query Network for Named Entity Recognition. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022. Association for Computational Linguistics, 947–961. https://doi.org/10.18653/v1/2022.acl-long.67
Snell et al. (2017) Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical networks for few-shot learning. Advances in neural information processing systems (2017). https://proceedings.neurips.cc/paper/2017/hash/cb8da6767461f2812ae4290eac7cbc42-Abstract.html
Sun et al. (2019) Shengli Sun, Qingfeng Sun, Kevin Zhou, and Tengchao Lv. 2019. Hierarchical Attention Prototypical Networks for Few-Shot Text Classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019. Association for Computational Linguistics, 476–485. https://doi.org/10.18653/v1/D19-1045
Sung et al. (2018) Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip H. S. Torr, and Timothy M. Hospedales. 2018. Learning to Compare: Relation Network for Few-Shot Learning. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018. 1199–1208. http://openaccess.thecvf.com/content_cvpr_2018/html/Sung_Learning_to_Compare_CVPR_2018_paper.html
Tan et al. (2021) Zeqi Tan, Yongliang Shen, Shuai Zhang, Weiming Lu, and Yueting Zhuang. 2021. A Sequence-to-Set Network for Nested Named Entity Recognition. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021. ijcai.org, 3936–3942. https://doi.org/10.24963/ijcai.2021/542
Tong et al. (2021) Meihan Tong, Shuai Wang, Bin Xu, Yixin Cao, Minghui Liu, Lei Hou, and Juanzi Li. 2021. Learning from Miscellaneous Other-Class Words for Few-shot Named Entity Recognition. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021. 6236–6247. https://doi.org/10.18653/v1/2021.acl-long.487
Vakulenko et al. (2021) Svitlana Vakulenko, Evangelos Kanoulas, and Maarten de Rijke. 2021. A Large-scale Analysis of Mixed Initiative in Information-Seeking Dialogues for Conversational Search. ACM Trans. Inf. Syst. 39, 4 (2021), 49:1–49:32. https://doi.org/10.1145/3466796
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017. 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
Vinyals et al. (2016) Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. 2016. Matching networks for one shot learning. Advances in neural information processing systems (2016). https://proceedings.neurips.cc/paper/2016/hash/90e1357833654983612fb05e3ec9148c-Abstract.html
Wang et al. (2022) Peiyi Wang, Runxin Xu, Tianyu Liu, Qingyu Zhou, Yunbo Cao, Baobao Chang, and Zhifang Sui. 2022. An Enhanced Span-based Decomposition Method for Few-Shot Sequence Labeling. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022. 5012–5024. https://doi.org/10.18653/v1/2022.naacl-main.369
Wang et al. (2021) Yaqing Wang, Haoda Chu, Chao Zhang, and Jing Gao. 2021. Learning from Language Description: Low-shot Named Entity Recognition via Decomposed Framework. In Findings of the Association for Computational Linguistics: EMNLP 2021. 1618–1630. https://doi.org/10.18653/v1/2021.findings-emnlp.139
Wang et al. (2018) Yu-Xiong Wang, Ross B. Girshick, Martial Hebert, and Bharath Hariharan. 2018. Low-Shot Learning From Imaginary Data. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. Computer Vision Foundation / IEEE Computer Society, 7278–7286. https://doi.org/10.1109/CVPR.2018.00760
Yan et al. (2021) Hang Yan, Tao Gui, Junqi Dai, Qipeng Guo, Zheng Zhang, and Xipeng Qiu. 2021. A Unified Generative Framework for Various NER Subtasks. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021. Association for Computational Linguistics, 5808–5822. https://doi.org/10.18653/v1/2021.acl-long.451
Yang and Katiyar (2020) Yi Yang and Arzoo Katiyar. 2020. Simple and Effective Few-Shot Named Entity Recognition with Structured Nearest Neighbor Learning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020. 6365–6375. https://doi.org/10.18653/v1/2020.emnlp-main.516
Yoon et al. (2019) Sung Whan Yoon, Jun Seo, and Jaekyun Moon. 2019. TapNet: Neural Network Augmented with Task-Adaptive Projection for Few-Shot Learning. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019. 7115–7123. http://proceedings.mlr.press/v97/yoon19a.html
Yu et al. (2021a) Dian Yu, Luheng He, Yuan Zhang, Xinya Du, Panupong Pasupat, and Qi Li. 2021a. Few-shot Intent Classification and Slot Filling with Retrieved Examples. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021. 734–749. https://doi.org/10.18653/v1/2021.naacl-main.59
Yu et al. (2020) Juntao Yu, Bernd Bohnet, and Massimo Poesio. 2020. Named Entity Recognition as Dependency Parsing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020. 6470–6476. https://doi.org/10.18653/v1/2020.acl-main.577
Yu et al. (2021b) Shi Yu, Zhenghao Liu, Chenyan Xiong, Tao Feng, and Zhiyuan Liu. 2021b. Few-Shot Conversational Dense Retrieval. In SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 829–838. https://doi.org/10.1145/3404835.3462856
Zeldes (2017) Amir Zeldes. 2017. The GUM corpus: creating multilayer resources in the classroom. Lang. Resour. Evaluation 51, 3 (2017), 581–612. https://doi.org/10.1007/s10579-016-9343-x
Zhai et al. (2017) Feifei Zhai, Saloni Potdar, Bing Xiang, and Bowen Zhou. 2017. Neural Models for Sequence Chunking. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence. 3365–3371. http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14776
Zhang et al. (2018) Ruixiang Zhang, Tong Che, Zoubin Ghahramani, Yoshua Bengio, and Yangqiu Song. 2018. MetaGAN: An Adversarial Approach to Few-Shot Learning. In Advances in Neural Information Processing Systems, NeurIPS 2018. 2371–2380. https://proceedings.neurips.cc/paper/2018/hash/4e4e53aa080247bc31d0eb4e7aeb07a0-Abstract.html
Zhang et al. (2022a) Shichao Zhang, Jiaye Li, and Yangding Li. 2022a. Reachable distance function for KNN classification. IEEE Transactions on Knowledge and Data Engineering (2022).
Zhang et al. (2022b) Shuai Zhang, Yongliang Shen, Zeqi Tan, Yiquan Wu, and Weiming Lu. 2022b. De-Bias for Generative Extraction in Unified NER Task. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022. Association for Computational Linguistics, 808–818. https://doi.org/10.18653/v1/2022.acl-long.59
Zhou et al. (2022) Jing Zhou, Yanan Zheng, Jie Tang, Li Jian, and Zhilin Yang. 2022. FlipDA: Effective and Robust Data Augmentation for Few-Shot Learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022. Association for Computational Linguistics, 8646–8665. https://doi.org/10.18653/v1/2022.acl-long.592
Zhu et al. (2022) Enwei Zhu, Yiyang Liu, and Jinpeng Li. 2022. Deep Span Representations for Named Entity Recognition. CoRR abs/2210.04182 (2022). https://doi.org/10.48550/arXiv.2210.04182 arXiv:2210.04182