Simple and Effective Few-Shot Named Entity Recognition
with Structured Nearest Neighbor Learning
Abstract
We present a simple few-shot named entity recognition (NER) system based on nearest neighbor learning and structured inference. Our system uses a supervised NER model trained on the source domain, as a feature extractor. Across several test domains, we show that a nearest neighbor classifier in this feature-space is far more effective than the standard meta-learning approaches. We further propose a cheap but effective method to capture the label dependencies between entity tags without expensive CRF training. We show that our method of combining structured decoding with nearest neighbor learning achieves state-of-the-art performance on standard few-shot NER evaluation tasks, improving F1 scores by to absolute points over prior meta-learning based systems.
1 Introduction
Named entity recognition (NER) aims at identifying and categorizing spans of text into a closed set of classes, such as people, organizations, and locations. As a core language understanding task, NER is widely adopted in several domains, such as news Tjong Kim Sang and De Meulder (2003), medical Stubbs and Uzuner (2015), and social media Derczynski et al. (2017). However, one of the primary challenges in adapting NER to new domains is the mismatch between the different domain-specific entity types. For example, only two out of the twenty-three entity types annotated in the I2B2 2014 Stubbs and Uzuner (2015) data can be found in the OntoNotes 5 Weischedel et al. (2013) annotations. Unfortunately, obtaining NER annotations for a novel domain can be quite expensive, often requiring domain knowledge.
Few-shot classification Vinyals et al. (2016); Bao et al. (2020) models aim at recognizing new classes based on only few labeled examples (support set) from each class. In the context of NER, these few-shot classification methods can enable rapid building of NER systems for a new domain by labeling only a few examples per entity class. Several previous studies (Fritzler et al., 2019; Hou et al., 2020) propose using prototypical networks Snell et al. (2017), a popular few-shot classification algorithm, to address the few-shot NER problem. However, these approaches only achieve F1 scores on average, when transferring knowledge between different NER datasets with one or five shot examples, warranting more effective methods for the problem.

The direct adaption of existing few-shot classification methods to few-shot NER is challenging for two reasons. First, NER is essentially a structured learning problem. It is crucial to model label dependencies as shown in Lample et al. (2016) instead of directly classifying each token independently using the existing few-shot classification approaches. Second, few-shot classification models Snell et al. (2017) typically learn to represent each semantic class by a prototype based on the labeled examples in its support set. However, for NER, unlike the entity classes, the Outside (O) class does not represent any unified semantic meaning. In fact, tokens labeled with O in a dataset actually correspond to different semantic spaces that should be separately represented in a metric-based learning framework. Consider, for example, in Fig. 1, semantic classes such as professions (e.g., ‘minister’) and dates (e.g., ‘today’) may also belong to the O class for some NER datasets. Thus, previous approaches end up learning a noisy prototype for representing O class in this low-resource setting.
In this paper, we propose a simple, yet effective method StructShot for few-shot NER. Instead of learning a prototype for each entity class, we represent each token in the labeled examples of the support set by its contextual representation in the sentence. We learn these contextual representations from training a standard supervised NER model Lample et al. (2016); Devlin et al. (2019), on the source domain. Whereas meta-learning approaches Snell et al. (2017); Vinyals et al. (2016) simulate few-shot evaluation setup during training, our approach does not need to do so. This makes it possible to deploy a unified NER system supporting both classical and emerging types of entities, without the overhead of maintaining a separate few-shot system. During evaluation, StructShot uses a nearest neighbor (NN) classifier and a Viterbi decoder for prediction. As shown in Fig. 1, for each token (“president”) in the target example, the NN classifier finds its nearest token (“minister”) from the support examples, instead of relying on an erroneous class-level (Outside) prototypical representation. We also improve our nearest neighbor predictions by using Viterbi decoder Forney (1973) to capture label dependencies.
We perform extensive in-domain and out-of-domain experiments for this problem. We test our systems on both identifying new types of entities in the source domain as well as identifying new types of entities in various target domains in one-shot and five-shot settings. In addition to the previous evaluation setup followed by Hou et al. (2020), we propose a more standard and reproducible evaluation setup for few-shot NER by using standard test sets and development sets from benchmark datasets of several domains. In particular, we sample support sets from the standard development set and evaluate our models on the standard test set. For all our experiments, we find that our proposed systems outperform previous meta-learning systems by to absolute F1 score.
2 Problem Statement and Setup
In this section, we formalize the task of few-shot NER and propose a standard evaluation setup to facilitate meaningful comparison of results for future research.
2.1 Few-shot NER
NER is a sequence labeling task, where each token in a sentence is either labeled as part of an entity class (e.g., Person, Location, and Organization) or O class if it does not belong to an entity. In practice, tagging schemes such as BIO or IO are adopted to represent if a token is at the beginning (B-X) or inside (I-X) of an entity X. Few-shot NER focuses on a specific NER setting where a system is trained on annotations of one or more source domains and then tested on one or more target domains by only providing a few labeled examples per entity class. It is a challenging problem since the target tag set can be different from any source tag set . To this end, few-shot NER systems need to learn to generalize to unseen entity classes using only a few labeled examples.
Formally, the task of -shot NER is defined as follows: given an input sentence and a -shot support set for the target tag set , find the best tag sequence for . The -shot support set contains entity examples (not tokens) for each entity class given by .
2.2 A standard evaluation setup
Prior work Fritzler et al. (2019); Hou et al. (2020) on few-shot NER followed few-shot classification literature and adopted the episode evaluation methodology. Specifically, a NER system is evaluated with respect to multiple evaluation episodes. An episode includes a sampled -shot support set of labeled examples and a few sampled -shot test sets. In addition to these prior practices, we propose a more realistic evaluation setting by sampling only the support sets and testing the model on the standard test sets from NER benchmarks.
Test set construction
In the episode evaluation setting, test sets are sampled such that the different entity classs are equally distributed. This evaluation setup clearly does not account for the entity distributions in the real data.111In the I2B2 test data, more frequent DATE entity occurs 4,983 times, whereas less frequent EMAIL entity occurs only once. As a result, the reported performance scores do not reflect the effectiveness of these models when adapting to a new domain. We propose to use the original test sets of the standard NER datasets to evaluate the performance of our models. Our evaluation setup does not need to randomly sample test sets, thus, improving its reproducibility for future research.
Support set construction
In order to test our models in the few-shot setting, we sample support sets from the standard development set of the benchmark dataset. We account for the variance of our model performance by sampling multiple support sets and reporting the average performance on the test set for these sampled support sets. We plan to release the different support sets used for evaluation in our experiments for reproducibility.
Unlike classification tasks, a sentence in NER may contain multiple entity classes. Thus, simply sampling sentences for each entity class will result in many more entities of frequent classes than those of less frequent classes, as sampling entities of infrequent classes is more likely to also bring in entities of frequent classes than the other way around. Because of this, we utilize a greedy sampling strategy to build support sets as shown in Alg. 1. In particular, we sample sentences for entity classes in an increasing order with respect to their frequencies.
3 Model
In this section, we present our few-shot NER algorithm based on structured nearest neighbor learning (StructShot). Our method uses a NER model Lample et al. (2016); Devlin et al. (2019) trained on the source domain, as a token embedder to generate contextual representations for all tokens. At inference, these static representations are simply used for nearest neighbor token classification. We also use a Viterbi decoder to capture label dependencies by leveraging tag transitions estimated from the source domain.
3.1 Nearest neighbor classification for few-shot NER
The backbone of StructShot is a simple token-level nearest neighbor classification system (NNShot). At inference, given a test example and a -shot entity support set comprising of sentences, NNShot employs a token embedder to obtain contextual representations for all tokens in their respective sentences. NNShot simply computes a similarity score between a token in the test example and all tokens in the support set. It assigns the token a tag corresponding to the most similar token in the support set:
(1) | ||||
where is the set of support tokens whose tags are . In this work, we use the squared Euclidean distance, for computing similarities between tokens in the nearest neighbor classification. We also perform L2-normalization on the features before computing these distances.
Pre-trained NER models as token embedders
Most meta-learning approaches Snell et al. (2017); Hou et al. (2020) simulate the test time setup during training. Hence, these approaches sample multiple support sets and test sets from the training data and learn representations to minimize their corresponding few-shot loss on the source domain. In this paper, we instead use a NER model trained on the source domain to learn token-level representations that minimizes the supervised cross-entropy loss. Supervised NER models typically consist of a token embedder followed by a linear classifier where is the token embedding size and represents the number of tags.
We consider two popular neural architectures for our supervised NER model: a BiLSTM NER model Lample et al. (2016) and a BERT-based NER model Devlin et al. (2019).222We fine-tune the cased BERT-base model. For training these models on the source domain, we follow the setting from their original papers. These models are trained to minimize the cross-entropy loss on the training data in the source domain. 333If training data from more source domains is available, a similar multitask loss can be adopted. At inference time, NNShot uses the BiLSTM and Transformer encoders just before the final linear classification layers as token embedders.
3.2 Structured nearest neighbor learning
Conditional random field (CRF) Lafferty et al. (2001) is the de facto method to model label dependencies for NER. Lample et al. (2016) use BiLSTM embedder followed by a classification layer to represent token-tag emission scores and learn tag-tag transition scores by joint training a CRF layer. Adopting a similar method is challenging in the context of few-shot learning. The mismatch between the tags in the source domain and the target domain does not allow learning tag-tag transition scores of the target domain by only training on the source domain.

StructShot addresses this challenge by using an abstract tag transition distribution estimated on the source domain data. Additionally, StructShot discards training phase in CRF and only makes use of its Viterbi decoder during inference. In particular, similar to Hou et al. (2020), we utilize a transition matrix that captures transition probabilities between three abstract NER tags: O, I, I-Other444We demonstrate the transitions with the IO tagging scheme and ignore START and END tags for simplicity.. For instance, and correspond to the transition probabilities between an entity tag and O, whereas and correspond to the probabilities of transitioning from an entity tag to itself and to a different entity tag respectively. As depicted in Fig. 2, we can extend these abstract transition probabilities to an arbitrary target domain tag set by evenly distributing the abstract transition probabilities into corresponding target transitions. Our simple extension method guarantees that the resulting target transition probabilities still lead to a valid distribution. Hou et al. (2020) copy these abstract transition scores to multiple specific transitions such that the resulting target transition probabilities no longer correspond to a distribution.
The key idea in StructShot is that it estimates the abstract transition probabilities by counting the number of times a particular transition was observed in the training data. The transition probability from X to Y is
(2) |
where and are the frequencies of the transition from X to Y and the transition from any tag to Y respectively. In practice, these abstract transitions can also be drawn from a prior distribution given domain knowledge.
For Viterbi inference, we obtain the emission probabilities for each token in the test example from NNShot.
(3) |
Given this abstract transition distribution and the emission distribution , we use Viterbi decoder to solve the following the structured inference problem:
(4) |
As the emission and transition probabilities are estimated independently, we introduce a temperature hyper-parameter that re-normalizes the transition probabilities to align the emission and transition scores to a similar scale.
4 Experiments
In this section, we compare StructShot against existing methods on two few-shot NER scenarios: tag set extension and domain transfer. We adopt several benchmark NER corpora in different domains for the few-shot experiments.555When ready, the code will be published at https://github.com/asappresearch/structshot.
4.1 Data
Dataset | Domain | # Class | # Sent | # Entity |
---|---|---|---|---|
OntoNotes | General | 18 | 76,714 | 104,151 |
CoNLL’03 | News | 4 | 20,744 | 35,089 |
I2B2’14 | Medical | 23 | 140,817 | 29,233 |
WNUT’17 | Social | 6 | 5,690 | 3,890 |
We experiment with standard NER datasets in four important domains: OntoNotes 5.0 Weischedel et al. (2013) (General), CoNLL 2003 Tjong Kim Sang and De Meulder (2003) (News), I2B2 2014 Stubbs and Uzuner (2015) (Medical), and WNUT 2017 Derczynski et al. (2017) (Social). To the best of our knowledge, these are the largest annotated NER corpora in their respective domains. These datasets are labeled with diverse and representative named entity types. Table 1 presents detailed statistics of these datasets. We use the OntoNotes train/development/test splits released for the CoNLL 2012 shared task.666Available at: http://conll.cemantix.org/2012/data.html Standard train/development/test splits also come with other dataset distributions.
4.2 Evaluation tasks
We evaluate few-shot NER systems on two real world scenarios. For both scenarios, we experiment with both one-shot and five-shot settings.
Tag set extension
Our first set of experiments are motivated by the fact that new types of entities often emerge in some domains such as medical and social media. Thus, we evaluate the performance of our systems on recognizing new entity types as they emerge in the source domain. We mimic this scenario by splitting the entity classes of a dataset into a source set and a target set. Specifically, we randomly split the eighteen entity classes of the OntoNotes dataset into three target entity class sets:
-
•
Group A: {ORG, NORP, ORDINAL, WORK_OF_ART, QUANTITY, LAW}
-
•
Group B: {GPE, CARDINAL, PERCENT, TIME, EVENT, LANGUAGE}
-
•
Group C: {PERSON, DATE, MONEY, LOC, FAC, PRODUCT}
We evaluate our systems on each target entity set. For each experiment, we modify the training set by replacing all the entity tags corresponding to the target test group with the O tag. Hence, these target tags are no longer observed during training. Similarly, we modify the test set to only include annotations corresponding to the target test group such that we only evaluate our models based on the unseen tags during training. As discussed in § 2, we sample multiple support sets from the development set to simulate the few-shot setting at the test time.
Domain transfer
The second set of experiments address a common scenario of adapting a NER system to a novel domain. For our experiments, we use General (OntoNotes) as the source domain and test our models on News (CoNLL), Medical (I2B2) and Social (WNUT) domains. We train our supervised NER models on the standard OntoNotes training set, whereas we evaluate the few-shot systems on standard test sets of CoNLL, I2B2, and WNUT. The support sets are sampled from the corresponding development sets of the three corpora.
4.3 Experimental settings
We have provided details of our proposed evaluation setup for few-shot NER in § 2. We report the standard evaluation metrics for NER: micro averaged F1 score. For each experiment, we sample five support sets and report the mean and standard deviation of the corresponding F1 scores. In order to establish a comprehensive comparison with prior work, we also report episode evaluation results in the Appendix.
Competitive systems
We consider five competitive approaches in our experiments. We build BERT-based systems for all the methods and BiLSTM-based systems for three of them. Prototypical Network Snell et al. (2017) is a popular few-shot classification algorithm that has been adopted in most state-of-the-art (SOTA) few-shot NER systems Fritzler et al. (2019). PrototypicalNet+P&D Hou et al. (2020) improves upon Prototypical Network by using the pair-wise embedding and dependency transfer mechanism.777Hou et al. (2020) show that Matching Network Vinyals et al. (2016) preforms worse than Prototypical Network on their evaluation for few-shot NER. SimBERT is a nearest neighbor classifier based on the pre-trained BERT encoder without fine-tuning on any NER data. Finally, we include our proposed NNShot and StructShot described in § 3. We use the IO tagging scheme for all of the experiments, as we find that it performs much better than BIO scheme for all the considered methods.
System | Tag Set Extension | Domain Transfer | ||||||
---|---|---|---|---|---|---|---|---|
Group A | Group B | Group C | Ave. | CoNLL | I2B2 | WNUT | Ave. | |
BiLSTM-based systems | ||||||||
Prototypical Network | ||||||||
NNShot (ours) | ||||||||
StructShot (ours) | ||||||||
BERT-based systems | ||||||||
SimBERT | ||||||||
Prototypical Network | ||||||||
PrototypicalNet+P&D | ||||||||
NNShot (ours) | ||||||||
StructShot (ours) | 27.9 | 36.6 |
System | Tag Set Extension | Domain Transfer | ||||||
---|---|---|---|---|---|---|---|---|
Group A | Group B | Group C | Ave. | CoNLL | I2B2 | WNUT | Ave. | |
BiLSTM-based systems | ||||||||
Prototypical Network | ||||||||
NNShot (ours) | ||||||||
StructShot (ours) | ||||||||
BERT-based systems | ||||||||
SimBERT | ||||||||
Prototypical Network | ||||||||
PrototypicalNet+P&D | ||||||||
NNShot (ours) | ||||||||
StructShot (ours) | 52.9 | 44.7 |
Parameter tuning
We adopt the best hyperparameter values reported by Yang et al. (2018) for the BiLSTM-NER models and use the default BERT hyper-parameter values provided by Hugging Face888https://huggingface.co/. Specifically, our BiLSTM-NER models adopt one-layer word-level BiLSTM model and one-layer character-level uni-directional LSTM model. LSTM hidden sizes are and and input embedding sizes are and for the character-level and word-level models respectively. We use the pre-trained -dimensional GloVe vectors Pennington et al. (2014) to initialize the word embeddings for all BiLSTM-NER models. SGD and Adam Kingma and Ba (2014) are utilized to optimize the BiLSTM-based and BERT-based models with learning rates and respectively. We tune other parameters required by different few-shot learning methods on the source domain development sets. The transition re-normalizing temperature is chosen from .
4.4 Results
The results for one-shot NER and five-shot NER are summarized in Table 2 and Table 3 respectively. As shown, our NNShot and StructShot perform significantly better than all previous methods across all evaluation settings. By modeling label dependencies with a simple Viterbi decoder, StructShot boosts the performance of NNShot by and F1 scores on five-shot tag set extension and domain transfer tasks on average respectively. These performance gains are greater than the ones obtained by joint CRF training with the prototypical network (PrototypicalNet+P&D), suggesting that independently modeling transition and emission scores is a cheap but effective way to capture label dependencies. StructShot achieves new SOTA results on the two few-shot NER tasks, outperforming the previous SOTA system (PrototypicalNet+P&D) by to F1 score on one-shot setting and to F1 score on five-shot setting.
BiLSTM vs. BERT as token embedder
The BERT-based systems considerably outperform BiLSTM-based systems on few-shot NER. Language model pre-training is critical for low-resource natural language processing including few-show transfer learning Cherry et al. (2019). However, task-specific knowledge is usually more important than the general information learned via unsupervised training. For example, the top-performing BiLSTM-based systems can beat SimBERT by up to F1 score on some few-shot NER settings. With fine-tuning on the OntoNotes data, NNShot outperforms SimBERT by to F1 scores across different settings, demonstrating the effectiveness of injecting task-specific information into pre-trained language models.
Tag set extension vs. Domain transfer
The one-shot NER systems generally perform better on domain transfer than on tag set extension, while the five-shot systems work better on the tag set extension task. On the domain transfer task, the source entity classes overlap with some entity classes in the target domain, which benefits NER systems built under the extremely low-resource condition. However, in general, domain transfer is more challenging than tag set extension due to language variation across different domains. Not surprisingly, our five-shot NER systems are not only more accurate but also more robust than the one-shot systems. The standard deviations reported with multiple five-shot support sets are much lower than those obtained with one-shot support sets. This indicates that we can build more reliable few-shot NER systems given more few-shot examples in the support sets.
Episode evaluation
Finally, as shown in the Appendix, the results obtained on episode evaluation are generally better than the ones reported with our proposed evaluation setup. However, the performance trend is the same, i.e., StructShot significantly outperforms all competitors. It implies that previous studies Fritzler et al. (2019); Hou et al. (2020) overestimate the performance of their few-shot NER systems.
Few-shot NER in practice
Although the average F1 scores of the few-shot NER systems are relatively low, we believe that few-shot NER systems are still very useful in practice. First, the few-shot NER results are reasonably good if the source and target domains are close to each other. For example, the five-shot NER system trained on the OntoNotes training set can achieve 75% F1 score on the CoNLL test set. Second, given the few-shot NER system, we are able to provide immediate support to emerging entity types without retraining and redeploying the NER model. At the same time, a more accurate NER model can be trained in parallel after collecting sufficient annotations for the new types.
4.5 Analysis
We perform analysis to investigate the impact of various tagging schemes and BERT fine-tuning objectives on few-shot NER.
Tagging scheme
When only a few entities are available in the support sets, the conventional BIO tagging scheme can harm the performance of few-shot NER systems, as it further reduces the number of labeled instances per tag class. We experiment with both BIO and IO tagging schemes for all the few-shot NER models. The systems equipped with IO tagging scheme always outperform those with BIO scheme. In particular, StructShot and NNShot benefit from switching from BIO scheme to IO scheme by an average of and F1 scores on the five-shot tag set extension and domain transfer tasks respectively.
Fine-tuning objective
StructShot exploits the standard cross-entropy loss for NER used in the original paper Devlin et al. (2019) to fine-tune BERT on OntoNotes data. We also experiment with fine-tuning BERT using the prototypical network objective, and then utilize the encoder in StructShot. The results show that BERT fine-tuned with the standard NER loss performs much better than the one fine-tuned with the prototypical network loss by and on the five-shot tag set extension and domain transfer tasks respectively. This suggests that the popular meta-learning methods fall short in capturing effective representations for few-shot NER task.
5 Discussion
In this section, we investigate two questions: 1) why StructShot is so effective? and 2) why few-shot NER is so difficult?

t-SNE visualization
We project token-level representations obtained from the BERT embedders onto a 2-dimensional space using t-SNE Maaten and Hinton (2008). Fig. 3 presents the visualization results on the CoNLL and WNUT test sets (we exclude I2B2 as it includes too many classes for visualization). Fine-tuning BERT on OntoNotes clearly improves the task-awareness with respect to both CoNLL and WNUT datasets, as instances of the same class are much closer compared to those obtained from the non-fine-tuned BERT model. The separation of different entity classes is more evident on CoNLL due to the greater tag set overlap with OntoNotes. Instances labeled with O are spread across the space, regardless of fine-tuning. This explains the effectiveness of StructShot. First, fine-tuning BERT in a conventional NER setting is able to learn a good entity specific metric space. Second, the nearest neighbor classifier that emphasizes more on local distance is more appropriate for assigning O to an instance.
I2B2 | WNUT | ||||
---|---|---|---|---|---|
Class | F1 | # Entity | Class | F1 | # Entity |
DATE | 66.1 | 4,983 | person | 57.8 | 429 |
CITY | 61.1 | 260 | loc. | 49.6 | 150 |
DOCTOR | 51.1 | 1,945 | c-work | 30.1 | 142 |
AGE | 29.3 | 768 | corp. | 16.7 | 66 |
MED-REC | 14.4 | 428 | product | 10.9 | 127 |
PATIENT | 14.0 | 920 | group | 5.0 | 165 |
HOSPITAL | 10.1 | 874 | - | - | - |
PHONE | 7.9 | 224 | - | - | - |
IDNUM | 7.0 | 201 | - | - | - |
Per-class performance analysis
We attempt to shed some light on the second question by analyzing outputs from the best five-shot StructShot systems on the domain transfer task. The per-class F1 scores are shown in Table 4, where we exclude I2B2 classes with less than 200 instances in the test set. StructShot achieves reasonable performance on less ambiguous entity classes such as DATE, CITE, person, and location. However, it struggles to distinguish between highly ambiguous classes. For example, AGE, MEDICAL-RECORD, PHONE, and IDNUM are all numbers. It is still challenging for our system to differentiate different numerical types without any domain specific knowledge. Similarly, StructShot often predicts a PATIENT entity as DOCTOR and it nearly always assigns the corporation label to entities of group. We believe that domain specific cues like ‘Dr.’ and ‘MD.’ can be useful in resolving these ambiguities and enable few-shot NER systems to generalize better.
6 Related Work
Meta learning
Meta learning is widely studied in the computer vision community, as the low-level features in images are transferable across classes that enables learning from only a few examples from the unseen class. The existing approaches Snell et al. (2017); Vinyals et al. (2016) typically focus on metric learning. Snell et al. (2017) learn a prototype representation for each class and classifies test points based on the nearest prototypes. Vinyals et al. (2016) compute support set aware similarities between a test point and the target classes. These methods have been adapted with some success to NLP tasks including text classification (Yu et al., 2018; Geng et al., 2019; Bao et al., 2020), machine translation Gu et al. (2018), and relation classification Han et al. (2018). Recently, Wang et al. (2019) show that simple feature transformations followed by nearest neighbor search can perform competitively with the state-of-the-art meta-learning methods on standard computer vision classification datasets. Inspired by this approach, we evaluate the performance of nearest neighbor based classification against meta-learning methods.
Few-shot NER
A few approaches have been proposed for few-shot NER. Hofer et al. (2018) explore different pre-training and fine-tuning strategies to recognize entities in medical text with a few examples. Fritzler et al. (2019) and Hou et al. (2020) exploit popular few-shot classification methods such as prototypical networks and matching network, where Hou et al. (2020) also jointly learn transition scores that improve performance. These approaches require complex episode training and only achieve unsatisfactory results. StructShot does not require meta-training. With a simple nearest neighbor classifier and a structured decoder, it is much more accurate than other existing meta-learning based systems.
7 Conclusion
We introduce StructShot, a simple few-shot NER system that achieves SOTA performance without any few-shot specific training. We identify two weaknesses of previous systems related to their handling of O class and modeling label dependencies. Our systems overcomes these challenges with nearest neighbor learning and structured decoding. We further propose a standard evaluation setup for few-shot NER and show that StructShot significantly outperforms prior SOTA systems on popular benchmarks across multiple domains. In the future, we want to extend our system to other few-shot sequence tagging problems such as part-of-speech tagging and slot filling.
Acknowledgments
We thank the EMNLP reviewers for their helpful feedback. We also thank the ASAPP NLP team for their support throughout the project.
References
- Bao et al. (2020) Yujia Bao, Menghua Wu, Shiyu Chang, and Regina Barzilay. 2020. Few-shot text classification with distributional signatures. In International Conference on Learning Representations.
- Cherry et al. (2019) Colin Cherry, Greg Durrett, George Foster, Reza Haffari, Shahram Khadivi, Nanyun Peng, Xiang Ren, and Swabha Swayamdipta. 2019. Proceedings of the 2nd workshop on deep learning approaches for low-resource nlp (deeplo 2019). In Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019).
- Derczynski et al. (2017) Leon Derczynski, Eric Nichols, Marieke van Erp, and Nut Limsopatham. 2017. Results of the wnut2017 shared task on novel and emerging entity recognition. In Proceedings of the 3rd Workshop on Noisy User-generated Text.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL).
- Forney (1973) G David Forney. 1973. The viterbi algorithm. Proceedings of the IEEE.
- Fritzler et al. (2019) Alexander Fritzler, Varvara Logacheva, and Maksim Kretov. 2019. Few-shot classification in named entity recognition task. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing.
- Geng et al. (2019) Ruiying Geng, Binhua Li, Yongbin Li, Xiaodan Zhu, Ping Jian, and Jian Sun. 2019. Induction networks for few-shot text classification. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP).
- Gu et al. (2018) Jiatao Gu, Yong Wang, Yun Chen, Victor O. K. Li, and Kyunghyun Cho. 2018. Meta-learning for low-resource neural machine translation. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP).
- Han et al. (2018) Xu Han, Hao Zhu, Pengfei Yu, Ziyun Wang, Yuan Yao, Zhiyuan Liu, and Maosong Sun. 2018. Fewrel: A large-scale supervised few-shot relation classification dataset with state-of-the-art evaluation. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP).
- Hofer et al. (2018) Maximilian Hofer, Andrey Kormilitzin, Paul Goldberg, and Alejo Nevado-Holgado. 2018. Few-shot learning for named entity recognition in medical text. arXiv preprint arXiv:1811.05468.
- Hou et al. (2020) Yutai Hou, Wanxiang Che, Yongkui Lai, Zhihan Zhou, Yijia Liu, Han Liu, and Ting Liu. 2020. Few-shot slot tagging with collapsed dependency transfer and label-enhanced task-adaptive projection network. In Proceedings of the Association for Computational Linguistics (ACL).
- Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Lafferty et al. (2001) John Lafferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the International Conference on Machine Learning (ICML).
- Lample et al. (2016) Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL).
- Maaten and Hinton (2008) Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne. Journal of machine learning research.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP).
- Snell et al. (2017) Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical networks for few-shot learning. In Neural Information Processing Systems (NIPS).
- Stubbs and Uzuner (2015) Amber Stubbs and Özlem Uzuner. 2015. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/uthealth corpus. Journal of biomedical informatics.
- Tjong Kim Sang and De Meulder (2003) Erik F Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: language-independent named entity recognition. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4.
- Vinyals et al. (2016) Oriol Vinyals, Charles Blundell, Timothy Lillicrap, koray kavukcuoglu, and Daan Wierstra. 2016. Matching networks for one shot learning. In Neural Information Processing Systems (NIPS).
- Wang et al. (2019) Yan Wang, Wei-Lun Chao, Kilian Q. Weinberger, and Laurens van der Maaten. 2019. Simpleshot: Revisiting nearest-neighbor classification for few-shot learning. CoRR, abs/1911.04623.
- Weischedel et al. (2013) Ralph Weischedel, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Franchini, et al. 2013. Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA.
- Yang et al. (2018) Jie Yang, Shuailong Liang, and Yue Zhang. 2018. Design challenges and misconceptions in neural sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics (COLING).
- Yu et al. (2018) Mo Yu, Xiaoxiao Guo, Jinfeng Yi, Shiyu Chang, Saloni Potdar, Yu Cheng, Gerald Tesauro, Haoyu Wang, and Bowen Zhou. 2018. Diverse few-shot text classification with multiple metrics. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL).
Appendix A Appendix: Episode Evaluation Results
The main episode evaluation results for one-shot NER and five-shot NER are summarized in Table 5 and Table 6 respectively. We sample evaluation episodes for each experiment. The performance trend is the same as our main results, in which StructShot significantly outperforms all competitors.
System | Tag Set Extension | Domain Transfer | ||||||
---|---|---|---|---|---|---|---|---|
Group A | Group B | Group C | Ave. | CoNLL | I2B2 | WNUT | Ave. | |
BiLSTM-based systems | ||||||||
Prototypical Network | ||||||||
NNShot (ours) | ||||||||
StructShot (ours) | ||||||||
BERT-based systems | ||||||||
SimBERT | ||||||||
Prototypical Network | ||||||||
PrototypicalNet+P&D | ||||||||
NNShot (ours) | ||||||||
StructShot (ours) | 30.8 | 40.4 |
System | Tag Set Extension | Domain Transfer | ||||||
---|---|---|---|---|---|---|---|---|
Group A | Group B | Group C | Ave. | CoNLL | I2B2 | WNUT | Ave. | |
BiLSTM-based systems | ||||||||
Prototypical Network | ||||||||
NNShot (ours) | ||||||||
StructShot (ours) | ||||||||
BERT-based systems | ||||||||
SimBERT | ||||||||
Prototypical Network | ||||||||
PrototypicalNet+P&D | ||||||||
NNShot (ours) | ||||||||
StructShot (ours) | 51.1 | 50.6 |