Simple and Effective Few-Shot Named Entity Recognition
with Structured Nearest Neighbor Learning

Yi Yang
ASAPP Inc.
New York, NY 10007
[email protected]
&Arzoo Katiyar
Pennsylvania State University
University Park, PA 16802
[email protected] Work done at ASAPP Inc.

Abstract

We present a simple few-shot named entity recognition (NER) system based on nearest neighbor learning and structured inference. Our system uses a supervised NER model trained on the source domain, as a feature extractor. Across several test domains, we show that a nearest neighbor classifier in this feature-space is far more effective than the standard meta-learning approaches. We further propose a cheap but effective method to capture the label dependencies between entity tags without expensive CRF training. We show that our method of combining structured decoding with nearest neighbor learning achieves state-of-the-art performance on standard few-shot NER evaluation tasks, improving F1 scores by $6\%$ to $16\%$ absolute points over prior meta-learning based systems.

1 Introduction

Named entity recognition (NER) aims at identifying and categorizing spans of text into a closed set of classes, such as people, organizations, and locations. As a core language understanding task, NER is widely adopted in several domains, such as news Tjong Kim Sang and De Meulder (2003), medical Stubbs and Uzuner (2015), and social media Derczynski et al. (2017). However, one of the primary challenges in adapting NER to new domains is the mismatch between the different domain-specific entity types. For example, only two out of the twenty-three entity types annotated in the I2B2 2014 Stubbs and Uzuner (2015) data can be found in the OntoNotes 5 Weischedel et al. (2013) annotations. Unfortunately, obtaining NER annotations for a novel domain can be quite expensive, often requiring domain knowledge.

Few-shot classification Vinyals et al. (2016); Bao et al. (2020) models aim at recognizing new classes based on only few labeled examples (support set) from each class. In the context of NER, these few-shot classification methods can enable rapid building of NER systems for a new domain by labeling only a few examples per entity class. Several previous studies (Fritzler et al., 2019; Hou et al., 2020) propose using prototypical networks Snell et al. (2017), a popular few-shot classification algorithm, to address the few-shot NER problem. However, these approaches only achieve $10\sim 30\%$ F1 scores on average, when transferring knowledge between different NER datasets with one or five shot examples, warranting more effective methods for the problem.

Refer to caption — Figure 1: An few-shot NER example. Professions (e.g., ‘minister’ and ‘president’) and dates (e.g., ‘today’ and ‘tomorrow’) are part of the O class. Nearest neighbor classifier is better at predicting the O class using an instance-based metric compared to methods using class-based metric.

The direct adaption of existing few-shot classification methods to few-shot NER is challenging for two reasons. First, NER is essentially a structured learning problem. It is crucial to model label dependencies as shown in Lample et al. (2016) instead of directly classifying each token independently using the existing few-shot classification approaches. Second, few-shot classification models Snell et al. (2017) typically learn to represent each semantic class by a prototype based on the labeled examples in its support set. However, for NER, unlike the entity classes, the Outside (O) class does not represent any unified semantic meaning. In fact, tokens labeled with O in a dataset actually correspond to different semantic spaces that should be separately represented in a metric-based learning framework. Consider, for example, in Fig. 1, semantic classes such as professions (e.g., ‘minister’) and dates (e.g., ‘today’) may also belong to the O class for some NER datasets. Thus, previous approaches end up learning a noisy prototype for representing O class in this low-resource setting.

In this paper, we propose a simple, yet effective method StructShot for few-shot NER. Instead of learning a prototype for each entity class, we represent each token in the labeled examples of the support set by its contextual representation in the sentence. We learn these contextual representations from training a standard supervised NER model Lample et al. (2016); Devlin et al. (2019), on the source domain. Whereas meta-learning approaches Snell et al. (2017); Vinyals et al. (2016) simulate few-shot evaluation setup during training, our approach does not need to do so. This makes it possible to deploy a unified NER system supporting both classical and emerging types of entities, without the overhead of maintaining a separate few-shot system. During evaluation, StructShot uses a nearest neighbor (NN) classifier and a Viterbi decoder for prediction. As shown in Fig. 1, for each token (“president”) in the target example, the NN classifier finds its nearest token (“minister”) from the support examples, instead of relying on an erroneous class-level (Outside) prototypical representation. We also improve our nearest neighbor predictions by using Viterbi decoder Forney (1973) to capture label dependencies.

We perform extensive in-domain and out-of-domain experiments for this problem. We test our systems on both identifying new types of entities in the source domain as well as identifying new types of entities in various target domains in one-shot and five-shot settings. In addition to the previous evaluation setup followed by Hou et al. (2020), we propose a more standard and reproducible evaluation setup for few-shot NER by using standard test sets and development sets from benchmark datasets of several domains. In particular, we sample support sets from the standard development set and evaluate our models on the standard test set. For all our experiments, we find that our proposed systems outperform previous meta-learning systems by $6\%$ to $16\%$ absolute F1 score.

2 Problem Statement and Setup

In this section, we formalize the task of few-shot NER and propose a standard evaluation setup to facilitate meaningful comparison of results for future research.

2.1 Few-shot NER

NER is a sequence labeling task, where each token in a sentence is either labeled as part of an entity class (e.g., Person, Location, and Organization) or O class if it does not belong to an entity. In practice, tagging schemes such as BIO or IO are adopted to represent if a token is at the beginning (B-X) or inside (I-X) of an entity X. Few-shot NER focuses on a specific NER setting where a system is trained on annotations of one or more source domains $\{\mathcal{D}_{\mathcal{S}}^{(i)}\}$ and then tested on one or more target domains $\{\mathcal{D}_{\mathcal{T}}^{(i)}\}$ by only providing a few labeled examples per entity class. It is a challenging problem since the target tag set $\mathcal{C}_{\mathcal{T}}^{(i)}$ can be different from any source tag set $\mathcal{C}_{\mathcal{S}}^{(j)}$ . To this end, few-shot NER systems need to learn to generalize to unseen entity classes using only a few labeled examples.

Formally, the task of $K$ -shot NER is defined as follows: given an input sentence $\mathbf{x}=\{x_{t}\}_{t=1}^{T}$ and a $K$ -shot support set for the target tag set $\mathcal{C}_{\mathcal{T}}$ , find the best tag sequence $\mathbf{y}=\{y_{t}\}_{1}^{T}$ for $\mathbf{x}$ . The $K$ -shot support set contains $K$ entity examples (not tokens) for each entity class given by $\mathcal{C}_{\mathcal{T}}$ .

2.2 A standard evaluation setup

Prior work Fritzler et al. (2019); Hou et al. (2020) on few-shot NER followed few-shot classification literature and adopted the episode evaluation methodology. Specifically, a NER system is evaluated with respect to multiple evaluation episodes. An episode includes a sampled $K$ -shot support set of labeled examples and a few sampled $K$ -shot test sets. In addition to these prior practices, we propose a more realistic evaluation setting by sampling only the support sets and testing the model on the standard test sets from NER benchmarks.

Test set construction

In the episode evaluation setting, test sets are sampled such that the different entity classs are equally distributed. This evaluation setup clearly does not account for the entity distributions in the real data.¹¹1In the I2B2 test data, more frequent DATE entity occurs 4,983 times, whereas less frequent EMAIL entity occurs only once. As a result, the reported performance scores do not reflect the effectiveness of these models when adapting to a new domain. We propose to use the original test sets of the standard NER datasets to evaluate the performance of our models. Our evaluation setup does not need to randomly sample test sets, thus, improving its reproducibility for future research.

Support set construction

In order to test our models in the few-shot setting, we sample support sets from the standard development set of the benchmark dataset. We account for the variance of our model performance by sampling multiple support sets and reporting the average performance on the test set for these sampled support sets. We plan to release the different support sets used for evaluation in our experiments for reproducibility.

Unlike classification tasks, a sentence in NER may contain multiple entity classes. Thus, simply sampling $K$ sentences for each entity class will result in many more entities of frequent classes than those of less frequent classes, as sampling entities of infrequent classes is more likely to also bring in entities of frequent classes than the other way around. Because of this, we utilize a greedy sampling strategy to build support sets as shown in Alg. 1. In particular, we sample sentences for entity classes in an increasing order with respect to their frequencies.

0: # of shot

K

, labeled set

\mathbf{X}

with tag set

\mathcal{C}

1: Sort classes in

\mathcal{C}

based on their freq. in

\mathbf{X}

S\leftarrow\phi

//Initialize the support set

\{\text{Count}_{i}\leftarrow 0\}

//Initialize counts of entity classes in

\mathcal{S}

4: while

i<|\mathcal{C}|

5: while

\text{Count}_{i}<K

6: Sample

(\mathbf{x},\mathbf{y})\in\mathbf{X}

s.t.

\mathcal{C}_{i}\in\mathbf{y}

, w/o replacement

S\leftarrow S\cup\{(\mathbf{x},\mathbf{y})\}

8: Update

\{\text{Count}_{j}\}

\forall\mathcal{C}_{j}\in\mathbf{y}

9: end while

10: end while

11: return

\mathcal{S}

Algorithm 1 Greedy sampling

3 Model

In this section, we present our few-shot NER algorithm based on structured nearest neighbor learning (StructShot). Our method uses a NER model Lample et al. (2016); Devlin et al. (2019) trained on the source domain, as a token embedder to generate contextual representations for all tokens. At inference, these static representations are simply used for nearest neighbor token classification. We also use a Viterbi decoder to capture label dependencies by leveraging tag transitions estimated from the source domain.

3.1 Nearest neighbor classification for few-shot NER

The backbone of StructShot is a simple token-level nearest neighbor classification system (NNShot). At inference, given a test example $\mathbf{x}=\{x_{t}\}_{1}^{T}$ and a $K$ -shot entity support set $\mathcal{S}=\{(\mathbf{x}_{n}^{(sup)},\mathbf{y}_{n}^{(sup)}\}_{n=1}^{N}$ comprising of $N$ sentences, NNShot employs a token embedder $f_{\theta}(x)=\hat{x}$ to obtain contextual representations for all tokens in their respective sentences. NNShot simply computes a similarity score between a token $x$ in the test example and all tokens $\{x^{\prime}\}$ in the support set. It assigns the token $x$ a tag $c$ corresponding to the most similar token in the support set:

	$\displaystyle y^{*}$	$\displaystyle=\arg\min_{c\in\{1,\cdots,C\}}d_{c}(\hat{x})$		(1)
	$\displaystyle d_{c}(\hat{x})$	$\displaystyle=\min_{x^{\prime}\in\mathcal{S}_{c}}d(\hat{x},\hat{x^{\prime}}),$

where $\mathcal{S}_{c}$ is the set of support tokens whose tags are $c$ . In this work, we use the squared Euclidean distance, $d(\hat{x},\hat{x^{\prime}})=||\hat{x}-\hat{x^{\prime}}||_{2}^{2}$ for computing similarities between tokens in the nearest neighbor classification. We also perform L2-normalization on the features before computing these distances.

Pre-trained NER models as token embedders

Most meta-learning approaches Snell et al. (2017); Hou et al. (2020) simulate the test time setup during training. Hence, these approaches sample multiple support sets and test sets from the training data and learn representations to minimize their corresponding few-shot loss on the source domain. In this paper, we instead use a NER model trained on the source domain to learn token-level representations that minimizes the supervised cross-entropy loss. Supervised NER models typically consist of a token embedder $f_{\theta}(\cdot)$ followed by a linear classifier $\mathbf{W}\in\mathbb{R}^{D\times L}$ where $D$ is the token embedding size and $L$ represents the number of tags.

We consider two popular neural architectures for our supervised NER model: a BiLSTM NER model Lample et al. (2016) and a BERT-based NER model Devlin et al. (2019).²²2We fine-tune the cased BERT-base model. For training these models on the source domain, we follow the setting from their original papers. These models are trained to minimize the cross-entropy loss $\ell(\mathbf{W}f_{\theta}(x),y)$ on the training data in the source domain. ³³3If training data from more source domains is available, a similar multitask loss can be adopted. At inference time, NNShot uses the BiLSTM and Transformer encoders just before the final linear classification layers as token embedders.

3.2 Structured nearest neighbor learning

Conditional random field (CRF) Lafferty et al. (2001) is the de facto method to model label dependencies for NER. Lample et al. (2016) use BiLSTM embedder followed by a classification layer to represent token-tag emission scores and learn tag-tag transition scores by joint training a CRF layer. Adopting a similar method is challenging in the context of few-shot learning. The mismatch between the tags in the source domain and the target domain does not allow learning tag-tag transition scores of the target domain by only training on the source domain.

StructShot addresses this challenge by using an abstract tag transition distribution estimated on the source domain data. Additionally, StructShot discards training phase in CRF and only makes use of its Viterbi decoder during inference. In particular, similar to Hou et al. (2020), we utilize a transition matrix that captures transition probabilities between three abstract NER tags: O, I, I-Other⁴⁴4We demonstrate the transitions with the IO tagging scheme and ignore START and END tags for simplicity.. For instance, $p(\texttt{O}|\texttt{I})$ and $p(\texttt{I}|\texttt{O})$ correspond to the transition probabilities between an entity tag and O, whereas $p(\texttt{I}|\texttt{I})$ and $p(\texttt{I-Other}|\texttt{I})$ correspond to the probabilities of transitioning from an entity tag to itself and to a different entity tag respectively. As depicted in Fig. 2, we can extend these abstract transition probabilities to an arbitrary target domain tag set by evenly distributing the abstract transition probabilities into corresponding target transitions. Our simple extension method guarantees that the resulting target transition probabilities still lead to a valid distribution. Hou et al. (2020) copy these abstract transition scores to multiple specific transitions such that the resulting target transition probabilities no longer correspond to a distribution.

The key idea in StructShot is that it estimates the abstract transition probabilities by counting the number of times a particular transition was observed in the training data. The transition probability from X to Y is

p(\texttt{Y}|\texttt{X})=\frac{N(\texttt{X}\rightarrow\texttt{Y})}{N(\cdot\rightarrow\texttt{Y})},

(2)

where $N(\texttt{X}\rightarrow\texttt{Y})$ and $N(\cdot\rightarrow\texttt{Y})$ are the frequencies of the transition from X to Y and the transition from any tag to Y respectively. In practice, these abstract transitions can also be drawn from a prior distribution given domain knowledge.

For Viterbi inference, we obtain the emission probabilities $p(y=c|x)$ for each token in the test example from NNShot.

p(y=c|x)=\frac{e^{-d_{c}(\hat{x})}}{\sum_{c^{\prime}}e^{-d_{c^{\prime}}(\hat{x})}}.

(3)

Given this abstract transition distribution $p(y^{\prime}|y)$ and the emission distribution $p(y|x)$ , we use Viterbi decoder to solve the following the structured inference problem:

\mathbf{y}^{*}=\arg\max_{\mathbf{y}}\prod_{t=1}^{T}p(y_{t}|x)\times p(y_{t}|y_{t-1}).

(4)

As the emission and transition probabilities are estimated independently, we introduce a temperature hyper-parameter $\tau$ that re-normalizes the transition probabilities to align the emission and transition scores to a similar scale.

4 Experiments

In this section, we compare StructShot against existing methods on two few-shot NER scenarios: tag set extension and domain transfer. We adopt several benchmark NER corpora in different domains for the few-shot experiments.⁵⁵5When ready, the code will be published at https://github.com/asappresearch/structshot.

4.1 Data

Dataset	Domain	# Class	# Sent	# Entity
OntoNotes	General	18	76,714	104,151
CoNLL’03	News	4	20,744	35,089
I2B2’14	Medical	23	140,817	29,233
WNUT’17	Social	6	5,690	3,890

Table 1: Data statistics. # Class corresponds to the number of entity classes labeled in a dataset.

We experiment with standard NER datasets in four important domains: OntoNotes 5.0 Weischedel et al. (2013) (General), CoNLL 2003 Tjong Kim Sang and De Meulder (2003) (News), I2B2 2014 Stubbs and Uzuner (2015) (Medical), and WNUT 2017 Derczynski et al. (2017) (Social). To the best of our knowledge, these are the largest annotated NER corpora in their respective domains. These datasets are labeled with diverse and representative named entity types. Table 1 presents detailed statistics of these datasets. We use the OntoNotes train/development/test splits released for the CoNLL 2012 shared task.⁶⁶6Available at: http://conll.cemantix.org/2012/data.html Standard train/development/test splits also come with other dataset distributions.

4.2 Evaluation tasks

We evaluate few-shot NER systems on two real world scenarios. For both scenarios, we experiment with both one-shot and five-shot settings.

Tag set extension

Our first set of experiments are motivated by the fact that new types of entities often emerge in some domains such as medical and social media. Thus, we evaluate the performance of our systems on recognizing new entity types as they emerge in the source domain. We mimic this scenario by splitting the entity classes of a dataset into a source set and a target set. Specifically, we randomly split the eighteen entity classes of the OntoNotes dataset into three target entity class sets:

•

Group A: {ORG, NORP, ORDINAL, WORK_OF_ART, QUANTITY, LAW}
•

Group B: {GPE, CARDINAL, PERCENT, TIME, EVENT, LANGUAGE}
•

Group C: {PERSON, DATE, MONEY, LOC, FAC, PRODUCT}

We evaluate our systems on each target entity set. For each experiment, we modify the training set by replacing all the entity tags corresponding to the target test group with the O tag. Hence, these target tags are no longer observed during training. Similarly, we modify the test set to only include annotations corresponding to the target test group such that we only evaluate our models based on the unseen tags during training. As discussed in § 2, we sample multiple support sets from the development set to simulate the few-shot setting at the test time.

Domain transfer

The second set of experiments address a common scenario of adapting a NER system to a novel domain. For our experiments, we use General (OntoNotes) as the source domain and test our models on News (CoNLL), Medical (I2B2) and Social (WNUT) domains. We train our supervised NER models on the standard OntoNotes training set, whereas we evaluate the few-shot systems on standard test sets of CoNLL, I2B2, and WNUT. The support sets are sampled from the corresponding development sets of the three corpora.

4.3 Experimental settings

We have provided details of our proposed evaluation setup for few-shot NER in § 2. We report the standard evaluation metrics for NER: micro averaged F1 score. For each experiment, we sample five support sets and report the mean and standard deviation of the corresponding F1 scores. In order to establish a comprehensive comparison with prior work, we also report episode evaluation results in the Appendix.

Competitive systems

We consider five competitive approaches in our experiments. We build BERT-based systems for all the methods and BiLSTM-based systems for three of them. Prototypical Network Snell et al. (2017) is a popular few-shot classification algorithm that has been adopted in most state-of-the-art (SOTA) few-shot NER systems Fritzler et al. (2019). PrototypicalNet+P&D Hou et al. (2020) improves upon Prototypical Network by using the pair-wise embedding and dependency transfer mechanism.⁷⁷7Hou et al. (2020) show that Matching Network Vinyals et al. (2016) preforms worse than Prototypical Network on their evaluation for few-shot NER. SimBERT is a nearest neighbor classifier based on the pre-trained BERT encoder without fine-tuning on any NER data. Finally, we include our proposed NNShot and StructShot described in § 3. We use the IO tagging scheme for all of the experiments, as we find that it performs much better than BIO scheme for all the considered methods.

BiLSTM-based systems
System	Tag Set Extension				Domain Transfer
System	Group A	Group B	Group C	Ave.	CoNLL	I2B2	WNUT	Ave.
Prototypical Network	$4.0{\pm}1.6$	$5.4{\pm}1.9$	$5.2{\pm}1.5$	$4.9$	$18.7{\pm}9.2$	$2.2{\pm}1.0$	$5.5{\pm}2.7$	$8.8$
NNShot (ours)	$15.7{\pm}7.1$	$25.1{\pm}7.1$	$22.7{\pm}7.1$	$21.2$	$46.4{\pm}11.7$	$7.5{\pm}2.9$	$6.9{\pm}3.2$	$20.3$
StructShot (ours)	$18.9{\pm}9.4$	$31.9{\pm}5.1$	$22.0{\pm}3.4$	$24.3$	$53.1{\pm}9.9$	$10.5{\pm}2.6$	$10.4{\pm}4.4$	$24.7$
BERT-based systems
SimBERT	$8.3{\pm}1.4$	$9.0{\pm}3.8$	$8.4{\pm}1.8$	$8.6$	$15.7{\pm}3.7$	$7.7{\pm}0.8$	$4.9{\pm}1.2$	$9.4$
Prototypical Network	$18.7{\pm}4.7$	$24.4{\pm}8.9$	$18.3{\pm}6.9$	$20.5$	$53.0{\pm}7.2$	$7.6{\pm}3.5$	$14.8{\pm}4.9$	$25.1$
PrototypicalNet+P&D	$18.5{\pm}4.4$	$24.8{\pm}9.3$	$20.7{\pm}8.4$	$21.3$	$56.0{\pm}7.3$	$7.9{\pm}3.2$	$18.8{\pm}5.3$	$27.6$
NNShot (ours)	$27.2{\pm}3.5$	$\textbf{32.5}{\pm}14.4$	$\textbf{23.8}{\pm}10.2$	$27.8$	$61.3{\pm}11.5$	$16.6{\pm}2.1$	$21.7{\pm}6.3$	$33.2$
StructShot (ours)	$\textbf{27.5}{\pm}4.1$	$32.4{\pm}14.7$	$\textbf{23.8}{\pm}10.2$	27.9	$\textbf{62.3}{\pm}11.4$	$\textbf{22.1}{\pm}3.0$	$\textbf{25.3}{\pm}5.3$	36.6

Table 2: F1 score results on one-shot NER for both tag set extension and domain transfer tasks. We report standard deviations from runs with five different support sets sampled from the validation sets. The best results are in bold.

BiLSTM-based systems
System	Tag Set Extension				Domain Transfer
System	Group A	Group B	Group C	Ave.	CoNLL	I2B2	WNUT	Ave.
Prototypical Network	$7.4{\pm}2.7$	$21.8{\pm}7.6$	$18.2{\pm}5.6$	$15.8$	$47.6{\pm}9.0$	$5.9{\pm}1.1$	$8.8{\pm}3.3$	$20.8$
NNShot (ours)	$24.5{\pm}5.4$	$35.2{\pm}7.4$	$33.8{\pm}6.3$	$31.2$	$62.0{\pm}6.1$	$8.4{\pm}2.7$	$12.4{\pm}4.2$	$27.6$
StructShot (ours)	$26.1{\pm}6.0$	$46.1{\pm}6.5$	$38.0{\pm}1.8$	$36.7$	$63.8{\pm}6.9$	$13.7{\pm}0.8$	$15.1{\pm}4.9$	$30.9$
BERT-based systems
SimBERT	$10.1{\pm}0.8$	$23.0{\pm}6.7$	$18.0{\pm}3.5$	$17.0$	$28.6{\pm}2.5$	$9.1{\pm}0.7$	$7.7{\pm}2.2$	$15.1$
Prototypical Network	$27.1{\pm}2.4$	$38.0{\pm}5.9$	$38.4{\pm}3.3$	$34.5$	$65.9{\pm}1.6$	$10.3{\pm}0.4$	$19.8{\pm}5.0$	$32.0$
PrototypicalNet+P&D	$29.8{\pm}2.8$	$41.0{\pm}6.5$	$38.5{\pm}3.3$	$36.4$	$67.1{\pm}1.6$	$10.1{\pm}0.9$	$23.8{\pm}3.9$	$33.6$
NNShot (ours)	$44.7{\pm}2.3$	$53.9{\pm}7.8$	$53.0{\pm}2.3$	$50.5$	$74.3{\pm}2.4$	$23.7{\pm}1.3$	$23.9{\pm}5.0$	$40.7$
StructShot (ours)	$\textbf{47.4}{\pm}3.2$	$\textbf{57.1}{\pm}8.6$	$\textbf{54.2}{\pm}2.5$	52.9	$\textbf{75.2}{\pm}2.3$	$\textbf{31.8}{\pm}1.8$	$\textbf{27.2}{\pm}6.7$	44.7

Table 3: F1 score results on five-shot NER for both tag set extension and domain transfer tasks. We report standard deviations from runs with five different support sets sampled from the validation sets. The best results are in bold.

Parameter tuning

We adopt the best hyperparameter values reported by Yang et al. (2018) for the BiLSTM-NER models and use the default BERT hyper-parameter values provided by Hugging Face⁸⁸8https://huggingface.co/. Specifically, our BiLSTM-NER models adopt one-layer word-level BiLSTM model and one-layer character-level uni-directional LSTM model. LSTM hidden sizes are $50$ and $200$ and input embedding sizes are $30$ and $100$ for the character-level and word-level models respectively. We use the pre-trained $100$ -dimensional GloVe vectors Pennington et al. (2014) to initialize the word embeddings for all BiLSTM-NER models. SGD and Adam Kingma and Ba (2014) are utilized to optimize the BiLSTM-based and BERT-based models with learning rates $0.015$ and $5\times 10^{-5}$ respectively. We tune other parameters required by different few-shot learning methods on the source domain development sets. The transition re-normalizing temperature $\tau$ is chosen from $\{0.01,0.005,0.001\}$ .

4.4 Results

The results for one-shot NER and five-shot NER are summarized in Table 2 and Table 3 respectively. As shown, our NNShot and StructShot perform significantly better than all previous methods across all evaluation settings. By modeling label dependencies with a simple Viterbi decoder, StructShot boosts the performance of NNShot by $2.4\%$ and $4\%$ F1 scores on five-shot tag set extension and domain transfer tasks on average respectively. These performance gains are greater than the ones obtained by joint CRF training with the prototypical network (PrototypicalNet+P&D), suggesting that independently modeling transition and emission scores is a cheap but effective way to capture label dependencies. StructShot achieves new SOTA results on the two few-shot NER tasks, outperforming the previous SOTA system (PrototypicalNet+P&D) by $6\%$ to $9\%$ F1 score on one-shot setting and $11\%$ to $16\%$ F1 score on five-shot setting.

BiLSTM vs. BERT as token embedder

The BERT-based systems considerably outperform BiLSTM-based systems on few-shot NER. Language model pre-training is critical for low-resource natural language processing including few-show transfer learning Cherry et al. (2019). However, task-specific knowledge is usually more important than the general information learned via unsupervised training. For example, the top-performing BiLSTM-based systems can beat SimBERT by up to $15\%$ F1 score on some few-shot NER settings. With fine-tuning on the OntoNotes data, NNShot outperforms SimBERT by $20\%$ to $35\%$ F1 scores across different settings, demonstrating the effectiveness of injecting task-specific information into pre-trained language models.

Tag set extension vs. Domain transfer

The one-shot NER systems generally perform better on domain transfer than on tag set extension, while the five-shot systems work better on the tag set extension task. On the domain transfer task, the source entity classes overlap with some entity classes in the target domain, which benefits NER systems built under the extremely low-resource condition. However, in general, domain transfer is more challenging than tag set extension due to language variation across different domains. Not surprisingly, our five-shot NER systems are not only more accurate but also more robust than the one-shot systems. The standard deviations reported with multiple five-shot support sets are much lower than those obtained with one-shot support sets. This indicates that we can build more reliable few-shot NER systems given more few-shot examples in the support sets.

Episode evaluation

Finally, as shown in the Appendix, the results obtained on episode evaluation are generally better than the ones reported with our proposed evaluation setup. However, the performance trend is the same, i.e., StructShot significantly outperforms all competitors. It implies that previous studies Fritzler et al. (2019); Hou et al. (2020) overestimate the performance of their few-shot NER systems.

Few-shot NER in practice

Although the average F1 scores of the few-shot NER systems are relatively low, we believe that few-shot NER systems are still very useful in practice. First, the few-shot NER results are reasonably good if the source and target domains are close to each other. For example, the five-shot NER system trained on the OntoNotes training set can achieve 75% F1 score on the CoNLL test set. Second, given the few-shot NER system, we are able to provide immediate support to emerging entity types without retraining and redeploying the NER model. At the same time, a more accurate NER model can be trained in parallel after collecting sufficient annotations for the new types.

4.5 Analysis

We perform analysis to investigate the impact of various tagging schemes and BERT fine-tuning objectives on few-shot NER.

Tagging scheme

When only a few entities are available in the support sets, the conventional BIO tagging scheme can harm the performance of few-shot NER systems, as it further reduces the number of labeled instances per tag class. We experiment with both BIO and IO tagging schemes for all the few-shot NER models. The systems equipped with IO tagging scheme always outperform those with BIO scheme. In particular, StructShot and NNShot benefit from switching from BIO scheme to IO scheme by an average of $3.2\%$ and $3.8\%$ F1 scores on the five-shot tag set extension and domain transfer tasks respectively.

Fine-tuning objective

StructShot exploits the standard cross-entropy loss for NER used in the original paper Devlin et al. (2019) to fine-tune BERT on OntoNotes data. We also experiment with fine-tuning BERT using the prototypical network objective, and then utilize the encoder in StructShot. The results show that BERT fine-tuned with the standard NER loss performs much better than the one fine-tuned with the prototypical network loss by $12\%$ and $9\%$ on the five-shot tag set extension and domain transfer tasks respectively. This suggests that the popular meta-learning methods fall short in capturing effective representations for few-shot NER task.

5 Discussion

In this section, we investigate two questions: 1) why StructShot is so effective? and 2) why few-shot NER is so difficult?

t-SNE visualization

We project token-level representations obtained from the BERT embedders onto a 2-dimensional space using t-SNE Maaten and Hinton (2008). Fig. 3 presents the visualization results on the CoNLL and WNUT test sets (we exclude I2B2 as it includes too many classes for visualization). Fine-tuning BERT on OntoNotes clearly improves the task-awareness with respect to both CoNLL and WNUT datasets, as instances of the same class are much closer compared to those obtained from the non-fine-tuned BERT model. The separation of different entity classes is more evident on CoNLL due to the greater tag set overlap with OntoNotes. Instances labeled with O are spread across the space, regardless of fine-tuning. This explains the effectiveness of StructShot. First, fine-tuning BERT in a conventional NER setting is able to learn a good entity specific metric space. Second, the nearest neighbor classifier that emphasizes more on local distance is more appropriate for assigning O to an instance.

I2B2			WNUT
Class	F1	# Entity	Class	F1	# Entity
DATE	66.1	4,983	person	57.8	429
CITY	61.1	260	loc.	49.6	150
DOCTOR	51.1	1,945	c-work	30.1	142
AGE	29.3	768	corp.	16.7	66
MED-REC	14.4	428	product	10.9	127
PATIENT	14.0	920	group	5.0	165
HOSPITAL	10.1	874	-	-	-
PHONE	7.9	224	-	-	-
IDNUM	7.0	201	-	-	-

Table 4: Best per-class five-shot domain transfer results obtained from StructShot on the I2B2 and WNUT test sets, in which MED-REC, loc., c-work, and corp. correspond MEDICAL-RECORD, location, creative-work, and corporation respectively.

Per-class performance analysis

We attempt to shed some light on the second question by analyzing outputs from the best five-shot StructShot systems on the domain transfer task. The per-class F1 scores are shown in Table 4, where we exclude I2B2 classes with less than 200 instances in the test set. StructShot achieves reasonable performance on less ambiguous entity classes such as DATE, CITE, person, and location. However, it struggles to distinguish between highly ambiguous classes. For example, AGE, MEDICAL-RECORD, PHONE, and IDNUM are all numbers. It is still challenging for our system to differentiate different numerical types without any domain specific knowledge. Similarly, StructShot often predicts a PATIENT entity as DOCTOR and it nearly always assigns the corporation label to entities of group. We believe that domain specific cues like ‘Dr.’ and ‘MD.’ can be useful in resolving these ambiguities and enable few-shot NER systems to generalize better.

6 Related Work

Meta learning

Meta learning is widely studied in the computer vision community, as the low-level features in images are transferable across classes that enables learning from only a few examples from the unseen class. The existing approaches Snell et al. (2017); Vinyals et al. (2016) typically focus on metric learning. Snell et al. (2017) learn a prototype representation for each class and classifies test points based on the nearest prototypes. Vinyals et al. (2016) compute support set aware similarities between a test point and the target classes. These methods have been adapted with some success to NLP tasks including text classification (Yu et al., 2018; Geng et al., 2019; Bao et al., 2020), machine translation Gu et al. (2018), and relation classification Han et al. (2018). Recently, Wang et al. (2019) show that simple feature transformations followed by nearest neighbor search can perform competitively with the state-of-the-art meta-learning methods on standard computer vision classification datasets. Inspired by this approach, we evaluate the performance of nearest neighbor based classification against meta-learning methods.

Few-shot NER

A few approaches have been proposed for few-shot NER. Hofer et al. (2018) explore different pre-training and fine-tuning strategies to recognize entities in medical text with a few examples. Fritzler et al. (2019) and Hou et al. (2020) exploit popular few-shot classification methods such as prototypical networks and matching network, where Hou et al. (2020) also jointly learn transition scores that improve performance. These approaches require complex episode training and only achieve unsatisfactory results. StructShot does not require meta-training. With a simple nearest neighbor classifier and a structured decoder, it is much more accurate than other existing meta-learning based systems.

7 Conclusion

We introduce StructShot, a simple few-shot NER system that achieves SOTA performance without any few-shot specific training. We identify two weaknesses of previous systems related to their handling of O class and modeling label dependencies. Our systems overcomes these challenges with nearest neighbor learning and structured decoding. We further propose a standard evaluation setup for few-shot NER and show that StructShot significantly outperforms prior SOTA systems on popular benchmarks across multiple domains. In the future, we want to extend our system to other few-shot sequence tagging problems such as part-of-speech tagging and slot filling.

Acknowledgments

We thank the EMNLP reviewers for their helpful feedback. We also thank the ASAPP NLP team for their support throughout the project.

References

Bao et al. (2020) Yujia Bao, Menghua Wu, Shiyu Chang, and Regina Barzilay. 2020. Few-shot text classification with distributional signatures. In International Conference on Learning Representations.
Cherry et al. (2019) Colin Cherry, Greg Durrett, George Foster, Reza Haffari, Shahram Khadivi, Nanyun Peng, Xiang Ren, and Swabha Swayamdipta. 2019. Proceedings of the 2nd workshop on deep learning approaches for low-resource nlp (deeplo 2019). In Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019).
Derczynski et al. (2017) Leon Derczynski, Eric Nichols, Marieke van Erp, and Nut Limsopatham. 2017. Results of the wnut2017 shared task on novel and emerging entity recognition. In Proceedings of the 3rd Workshop on Noisy User-generated Text.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL).
Forney (1973) G David Forney. 1973. The viterbi algorithm. Proceedings of the IEEE.
Fritzler et al. (2019) Alexander Fritzler, Varvara Logacheva, and Maksim Kretov. 2019. Few-shot classification in named entity recognition task. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing.
Geng et al. (2019) Ruiying Geng, Binhua Li, Yongbin Li, Xiaodan Zhu, Ping Jian, and Jian Sun. 2019. Induction networks for few-shot text classification. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP).
Gu et al. (2018) Jiatao Gu, Yong Wang, Yun Chen, Victor O. K. Li, and Kyunghyun Cho. 2018. Meta-learning for low-resource neural machine translation. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP).
Han et al. (2018) Xu Han, Hao Zhu, Pengfei Yu, Ziyun Wang, Yuan Yao, Zhiyuan Liu, and Maosong Sun. 2018. Fewrel: A large-scale supervised few-shot relation classification dataset with state-of-the-art evaluation. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP).
Hofer et al. (2018) Maximilian Hofer, Andrey Kormilitzin, Paul Goldberg, and Alejo Nevado-Holgado. 2018. Few-shot learning for named entity recognition in medical text. arXiv preprint arXiv:1811.05468.
Hou et al. (2020) Yutai Hou, Wanxiang Che, Yongkui Lai, Zhihan Zhou, Yijia Liu, Han Liu, and Ting Liu. 2020. Few-shot slot tagging with collapsed dependency transfer and label-enhanced task-adaptive projection network. In Proceedings of the Association for Computational Linguistics (ACL).
Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Lafferty et al. (2001) John Lafferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the International Conference on Machine Learning (ICML).
Lample et al. (2016) Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL).
Maaten and Hinton (2008) Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne. Journal of machine learning research.
Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP).
Snell et al. (2017) Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical networks for few-shot learning. In Neural Information Processing Systems (NIPS).
Stubbs and Uzuner (2015) Amber Stubbs and Özlem Uzuner. 2015. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/uthealth corpus. Journal of biomedical informatics.
Tjong Kim Sang and De Meulder (2003) Erik F Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: language-independent named entity recognition. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4.
Vinyals et al. (2016) Oriol Vinyals, Charles Blundell, Timothy Lillicrap, koray kavukcuoglu, and Daan Wierstra. 2016. Matching networks for one shot learning. In Neural Information Processing Systems (NIPS).
Wang et al. (2019) Yan Wang, Wei-Lun Chao, Kilian Q. Weinberger, and Laurens van der Maaten. 2019. Simpleshot: Revisiting nearest-neighbor classification for few-shot learning. CoRR, abs/1911.04623.
Weischedel et al. (2013) Ralph Weischedel, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Franchini, et al. 2013. Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA.
Yang et al. (2018) Jie Yang, Shuailong Liang, and Yue Zhang. 2018. Design challenges and misconceptions in neural sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics (COLING).
Yu et al. (2018) Mo Yu, Xiaoxiao Guo, Jinfeng Yi, Shiyu Chang, Saloni Potdar, Yu Cheng, Gerald Tesauro, Haoyu Wang, and Bowen Zhou. 2018. Diverse few-shot text classification with multiple metrics. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL).

Appendix A Appendix: Episode Evaluation Results

The main episode evaluation results for one-shot NER and five-shot NER are summarized in Table 5 and Table 6 respectively. We sample $100$ evaluation episodes for each experiment. The performance trend is the same as our main results, in which StructShot significantly outperforms all competitors.

BiLSTM-based systems
System	Tag Set Extension				Domain Transfer
System	Group A	Group B	Group C	Ave.	CoNLL	I2B2	WNUT	Ave.
Prototypical Network	$4.0{\pm}1.6$	$5.4{\pm}1.9$	$5.1{\pm}1.5$	$4.8$	$15.3{\pm}7.5$	$3.6{\pm}1.3$	$2.6{\pm}1.4$	$7.2$
NNShot (ours)	$15.7{\pm}7.0$	$27.5{\pm}7.1$	$20.1{\pm}6.0$	$21.1$	$49.6{\pm}12.6$	$9.5{\pm}4.1$	$8.9{\pm}1.6$	$22.7$
StructShot (ours)	$18.9{\pm}9.7$	$32.0{\pm}5.1$	$22.0{\pm}3.3$	$24.3$	$50.0{\pm}9.2$	$11.0{\pm}2.0$	$9.9{\pm}4.2$	$23.6$
BERT-based systems
SimBERT	$13.0{\pm}1.8$	$14.3{\pm}3.9$	$9.5{\pm}1.1$	$12.3$	$19.3{\pm}4.3$	$16.3{\pm}2.1$	$5.3{\pm}0.9$	$13.6$
Prototypical Network	$25.5{\pm}3.7$	$30.5{\pm}6.8$	$21.2{\pm}5.8$	$25.7$	$59.3{\pm}6.3$	$19.9{\pm}2.7$	$15.8{\pm}4.1$	$31.6$
PrototypicalNet+P&D	$27.2{\pm}1.1$	$31.4{\pm}6.9$	$23.0{\pm}5.1$	$27.2$	$61.7{\pm}6.8$	$21.3{\pm}4.8$	$17.5{\pm}2.9$	$33.5$
NNShot (ours)	$\textbf{31.3}{\pm}4.5$	$32.8{\pm}7.4$	$27.3{\pm}7.8$	$30.5$	$67.6{\pm}10.8$	$30.1{\pm}2.5$	$20.2{\pm}6.0$	$39.3$
StructShot (ours)	$30.8{\pm}5.0$	$\textbf{33.5}{\pm}7.7$	$\textbf{28.0}{\pm}7.9$	30.8	$\textbf{68.7}{\pm}10.5$	$\textbf{32.1}{\pm}1.7$	$\textbf{20.5}{\pm}5.2$	40.4

Table 5: F1 score results of episode evaluation on one-shot NER for both tag set extension and domain transfer tasks. We report standard deviations from runs with five different support sets sampled from the validation sets. The best results are in bold.

BiLSTM-based systems
System	Tag Set Extension				Domain Transfer
System	Group A	Group B	Group C	Ave.	CoNLL	I2B2	WNUT	Ave.
Prototypical Network	$7.4{\pm}2.7$	$23.9{\pm}6.2$	$18.2{\pm}5.6$	$16.5$	$49.2{\pm}5.8$	$8.5{\pm}4.6$	$5.2{\pm}1.8$	$21.0$
NNShot (ours)	$24.5{\pm}5.8$	$42.3{\pm}12.9$	$33.8{\pm}6.3$	$33.5$	$62.1{\pm}6.8$	$12.4{\pm}4.2$	$9.0{\pm}2.6$	$27.8$
StructShot (ours)	$26.1{\pm}6.0$	$47.0{\pm}7.7$	$38.0{\pm}1.8$	$37.1$	$63.8{\pm}6.9$	$13.7{\pm}0.8$	$15.1{\pm}4.9$	$30.9$
BERT-based systems
SimBERT	$18.8{\pm}1.4$	$27.0{\pm}2.4$	$21.4{\pm}3.1$	$22.4$	$31.7{\pm}1.3$	$23.6{\pm}1.7$	$9.3{\pm}2.2$	$21.6$
Prototypical Network	$36.4{\pm}0.9$	$46.3{\pm}2.0$	$41.6{\pm}1.4$	$41.4$	$69.2{\pm}2.0$	$27.6{\pm}2.8$	$22.1{\pm}3.1$	$39.6$
PrototypicalNet+P&D	$38.5{\pm}4.1$	$49.5{\pm}2.3$	$44.3{\pm}1.2$	$44.1$	$69.6{\pm}2.3$	$32.2{\pm}2.1$	$26.0{\pm}2.1$	$42.6$
NNShot (ours)	$45.3{\pm}1.5$	$53.4{\pm}2.9$	$49.9{\pm}1.3$	$49.5$	$77.2{\pm}1.8$	$45.4{\pm}2.1$	$26.7{\pm}4.0$	$49.8$
StructShot (ours)	$\textbf{47.2}{\pm}0.9$	$\textbf{54.9}{\pm}2.9$	$\textbf{51.2}{\pm}1.4$	51.1	$\textbf{77.9}{\pm}1.8$	$\textbf{46.1}{\pm}3.2$	$\textbf{27.9}{\pm}3.2$	50.6

Table 6: F1 score results of episode evaluation on five-shot NER for both tag set extension and domain transfer tasks. We report standard deviations from runs with five different support sets sampled from the validation sets. The best results are in bold.

Simple and Effective Few-Shot Named Entity Recognition with Structured Nearest Neighbor Learning