This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

An Embarrassingly Easy but Strong Baseline for Nested Named Entity Recognition

Hang Yan, Yu Sun, Xiaonan Li, Xipeng Qiu
Shanghai Key Laboratory of Intelligent Information Processing, Fudan University
School of Computer Science, Fudan University
{hyan19,lixn20,xpqiu}@fudan.edu.cn
[email protected]
   Corresponding author.
Abstract

Named entity recognition (NER) is the task to detect and classify the entity spans in the text. When entity spans overlap between each other, this problem is named as nested NER. Span-based methods have been widely used to tackle the nested NER. Most of these methods will get a score n×nn\times n matrix, where nn means the length of sentence, and each entry corresponds to a span. However, previous work ignores spatial relations in the score matrix. In this paper, we propose using Convolutional Neural Network (CNN) to model these spatial relations in the score matrix. Despite being simple, experiments in three commonly used nested NER datasets show that our model surpasses several recently proposed methods with the same pre-trained encoders. Further analysis shows that using CNN can help the model find more nested entities. Besides, we found that different papers used different sentence tokenizations for the three nested NER datasets, which will influence the comparison. Thus, we release a pre-processing script to facilitate future comparison111Code is available at https://github.com/yhcc/CNN_Nested_NER..

1 Introduction

Named Entity Recognition (NER) is the task to extract entities from raw text. It has been a fundamental task in the Natural Language Processing (NLP) field. Previously, this task is mainly solved by the sequence labeling paradigm through assigning a label to each token Huang et al. (2015); Ma and Hovy (2016); Yan et al. (2019). However, this method is not directly applicable to the nested NER scenario, since a token may be included in two or more entities. To overcome this issue, the span-based method which assigns labels to each span was introduced Eberts and Ulges (2020); Li et al. (2020); Yu et al. (2020).

Eberts and Ulges (2020) used a pooling method over token representations to get the span representation, and then conducted classification on this span representation. Li et al. (2020) transformed the NER task into a Machine Reading Comprehension form, they used the entity type as the query, and asked the model to select the spans that belong to this entity type. Yu et al. (2020) utilized the Biaffine decoder from dependency parsing Dozat and Manning (2017) to convert the span classification into classifying the start and end token pairs. However, these work did not take advantage of the spatial correlations between adjacent spans.

Refer to caption
Figure 1: All valid spans of a sentence. We use the start and end tokens to pinpoint a span, for instance, “(2-4)” represents “New York University”. Spans in the two orange dotted squares indicates that the center span can have the special relationship (different relations are depicted in different colors) with its surrounding spans. For example, the span “New York” (2-3) is contained by the span “New York University” (2-4). Therefore, the “(2-3)” span is annotated as “d”.

As depicted in Figure 1, the spans surrounding a span have special relationships with the center span. It should be beneficial if we can leverage these spatial correlations. In this paper, we use the Biaffine decoder Dozat and Manning (2017) to get a 3D feature matrix, where each entry represents one span. After that, we view this feature matrix as an image and utilize Convolutional Neural Network (CNN) to model the local interaction between spans.

We compare this simple method with recently proposed methods Wan et al. (2022); Li et al. (2022); Zhu and Li (2022); Yuan et al. (2022). To make sure our method is strictly comparable to theirs, we asked the authors for their version of data. Although all of them used the same datasets, we found that the statistics, such as the number of sentences and entities, were not the same. This was caused by the usage of distinct sentence tokenization methods, which will influence the performance as shown in our experiments. To facilitate future comparison, we release a pre-processing script222https://github.com/yhcc/CNN_Nested_NER/tree/master/preprocess for ACE2004, ACE2005 and Genia datasets.

Our contributions can be summarized as follows.

  • We find that the adjacent spans have special correlations between each other, and we propose using CNN to model the interaction between them. Despite being very simple, it achieves a considerable performance boost in three widely used nested NER datasets.

  • We release a pre-processing script for the three nested NER datasets to facilitate direct and fair comparison.

  • The way we view the span feature matrix as an image shall shed some light on future exploration of span-based methods for nested NER task.

2 Related Work

Previously, four kinds of paradigms have been proposed to solve the nested NER task.

The first one is the sequence labeling framework Straková et al. (2019), since one token can be contained in more than one entities, the Cartesian product of the entity labels are used. However, the Cartesian labels will suffer from the long-tail issue.

The second one is to use the hypergraph to efficiently represent spans Lu and Roth (2015); Muis and Lu (2016); Katiyar and Cardie (2018); Wang and Lu (2018). The shortcoming of this method is the complex decoding.

The third one is the sequence-to-sequence (Seq2Seq) framework Sutskever et al. (2014); Lewis et al. (2020); Raffel et al. (2020) to generate the entity sequence. The entity sequence can be the entity pointer sequence Yan et al. (2021); Fei et al. (2021) or the entity text sequence Lu et al. (2022). Nevertheless, the Seq2Seq method suffers from the time-demanding decoding.

The fourth one is to conduct span classification. Eberts and Ulges (2020) proposed to enumerate all possible spans within a sentence, and use a pooling method to get the span representation. While Yu et al. (2020) proposed to use the start and end tokens of a span to pinpoint the span, and use the Biaffine decoder to get the scores for each span. The span-based methods are friendly to parallelism and the decoding is easy. Therefore, this formulation has been widely adopted Wan et al. (2022); Zhu and Li (2022); Li et al. (2022); Yuan et al. (2022). However, the relation between neighbor spans was ignored in previous work.

Refer to caption
Figure 2: The proposed method in this paper. Use several blocks of CNN to model the spatial correlations between neighbor spans.

3 Proposed Method

In this section, we first introduce the nested NER task, then describe how to get the feature matrix. After that, we present the CNN module to model the spatial correlation on the feature matrix. A general framework of our proposed method can be viewed in Figure 2.

3.1 Nested NER Task

Given an input sentence X=[x1,x2,,xn]X=[x_{1},x_{2},\ldots,x_{n}] with nn tokens, the nested NER task aims to extract all entities in XX. Each entity can be expressed as a tuple (si,ei,ti)(s_{i},e_{i},t_{i}). si,eis_{i},e_{i} are the start, end index of the entity. ti{1,,|T|}t_{i}\in\{1,\ldots,|T|\} is its entity type and |T||T| is the number of entity types. As the task name suggests, the entities may overlap with each other, but different entities are not allowed to have crossing boundaries. For a sentence with nn tokens, there are n(n+1)/2n(n+1)/2 valid spans.

3.2 Span-based Method for Nested NER

We follow Yu et al. (2020) to formulate this task into a span classification task. Namely, for each valid span, the model assigns an entity label to it. The method first uses an encoder to encode the input sentence as follows:

𝐇=Encoder(X),\displaystyle\mathbf{H}=\mathrm{Encoder}(X),

where 𝐇n×d\mathbf{H}\in\mathcal{R}^{n\times d}, and dd is the hidden size. Various pre-trained models, such as BERT Devlin et al. (2019), are usually used as the encoder. For the word tokenized into several pieces, we use max-pooling to aggregate from its pieces’ hidden states.

After getting the contextualized embedding of tokens, previous work usually would concatenate it with the static word embedding and the character embedding, and then send this combined embedding into a BiLSTM layer Yu et al. (2020); Wan et al. (2022); Yuan et al. (2022). To make the model less cluttered, we neither use more embeddings, nor the BiLSTM layer.

Next, we use a multi-head Biaffine decoder Dozat and Manning (2017); Vaswani et al. (2017) to get the score matrix as follows:

𝐇s\displaystyle\mathbf{H}_{s} =LeakyReLU(𝐇Ws),\displaystyle=\mathrm{LeakyReLU}(\mathbf{H}W_{s}),
𝐇e\displaystyle\mathbf{H}_{e} =LeakyReLU(𝐇We),\displaystyle=\mathrm{LeakyReLU}(\mathbf{H}W_{e}),
𝐑\displaystyle\mathbf{R} =MHBiaffine(𝐇s,𝐇e)\displaystyle=\mathrm{MHBiaffine}(\mathbf{H}_{s},\mathbf{H}_{e})

where Ws,Wed×hW_{s},W_{e}\in\mathcal{R}^{d\times h}, hh is the hidden size, MHBiaffine(,)\mathrm{MHBiaffine}(\cdot,\cdot) is the multi-head Biaffine decoder333The detailed description is in the Appendix., and 𝐑n×n×r\mathbf{R}\in\mathcal{R}^{n\times n\times r}, rr is the feature size. Each cell (i,j)(i,j) in the 𝐑\mathbf{R} can be seen as the feature vector 𝐯r\mathbf{v}\in\mathcal{R}^{r} for the span. And for the lower triangle of 𝐑\mathbf{R} (where i>ji>j), the span contains words from the jj-th to the ii-th (Therefore, one span will have two entries if it is off-diagonal).

3.3 CNN on Score Matrix

As shown in Figure 1, the cell has relations with cells around. Therefore, we propose using CNN to model these interactions. We repeat the following CNN block several times in our model:

𝐑\displaystyle\mathbf{R}^{\prime} =Conv2d(𝐑),\displaystyle=\mathrm{Conv2d}(\mathbf{R}),
𝐑′′\displaystyle\mathbf{R}^{\prime\prime} =GeLU(LayerNorm(𝐑+𝐑)),\displaystyle=\mathrm{GeLU}(\mathrm{LayerNorm}(\mathbf{R}^{\prime}+\mathbf{R})),

where Conv2d\mathrm{Conv2d}, LayerNorm\mathrm{LayerNorm} and GeLU\mathrm{GeLU} are the 2D CNN, layer normalization Ba et al. (2016) and GeLU activation function Hendrycks and Gimpel (2016). The layer normalization is conducted in the feature dimension. A noticeable fact here is that since the number of tokens nn in sentences varies, their 𝐑\mathbf{R}s are of different shapes. To make sure results are the same when 𝐑\mathbf{R} is processed in batch, the 2D CNN has no bias term, and all the paddings in 𝐑\mathbf{R} are filled with 0.

After passing through several CNN blocks, the 𝐑′′\mathbf{R}^{\prime\prime} will be further processed by another 2D CNN module.

3.4 The Output

We use a perceptron to get the prediction logits as follows: 444We did not use the Softmax because in the very rare case (such as in the ACE2005 and Genia dataset), one span can have more than one entity tag.

P=Sigmoid(Wo(𝐑+𝐑′′)+b),\displaystyle P=\mathrm{Sigmoid}(W_{o}(\mathbf{R}+\mathbf{R}^{\prime\prime})+b),

where Wo|T|×rW_{o}\in\mathcal{R}^{|T|\times r}, b|T|b\in\mathcal{R}^{|T|}, Pn×n×|T|P\in\mathcal{R}^{n\times n\times|T|}. And then, we use the binary cross entropy to calculate the loss as

BCE=0i,j<nyijlog(Pij),\displaystyle\mathcal{L}_{BCE}=-\sum_{0\leq i,j<n}y_{ij}\mathrm{log}(P_{ij}),

unlike previous works that only use the upper triangle part to get the loss Yu et al. (2020); Zhu and Li (2022), we use both upper and lower triangles to calculate the loss. The reason is that in order to conduct batch computation, we cannot solely compute the upper triangle part. Since the lower triangle part has been computed, we also use them for the output. The tag for the score matrix is symmetric, namely, the tag in the (i,j)(i,j)-th entry is the same as in the (j,i)(j,i)-th.

# Param. (Million) ACE2004 ACE2005
P R F1 P R F1
Data from Li et al. (2022)
W2NER Li et al. (2022)[BERT-large] 355.4 87.33 87.71 87.52 85.03 88.62 86.79
Ours[BERT-large] 345.1 87.8238 87.4020 87.6118 86.3961 87.2434 86.8245
w.o. CNN[BERT-large] 343.6 86.5448 87.0941 86.8121 84.8826 86.9933 85.9227
Data from Wan et al. (2022)
SG Wan et al. (2022)[BERT-base] 112.3 86.70 85.93 86.31 84.37 85.87 85.11
Ours[BERT-base] 110.5 86.8561 86.4536 86.6522 84.9449 85.4027 85.1616
w.o. CNN[BERT-base] 109.1 85.7946 85.7812 85.7822 82.9121 84.8923 83.8916
Data from Zhu and Li (2022)
BS Zhu and Li (2022)[RoBERTa-base] 125.6 88.43 87.53 87.98 86.25 88.07 87.15
Ours[RoBERTa-base] 125.6 87.7727 88.2836 88.0314 86.5878 87.9446 87.2548
w.o. CNN[RoBERTa-base] 125.2 86.7127 87.4042 87.0518 85.4839 87.5459 86.5026
Data from this work
W2NER[BERT-large]\dagger 355.4 87.1711 87.7019 87.4311 85.7830 87.8124 86.7721
Ours[BERT-large] 345.1 87.9830 87.5022 87.7416 86.2665 87.5631 86.9123
w.o. CNN[BERT-large] 343.6 86.6068 86.4836 86.5419 84.9134 87.3926 86.1330
BS[RoBERTa-base]\dagger 125.6 87.3240 86.8416 87.0824 86.5838 87.8459 87.2032
Ours[RoBERTa-base] 125.6 87.3341 87.2925 87.3116 86.7029 88.1654 87.4226
w.o. CNN[RoBERTa-base] 125.2 86.0936 86.8823 86.4817 85.1767 88.035 86.5638
Table 1: Results for the ACE2004 and ACE2005 datasets. Models in the same block use the same data. The subscript means the standard deviation (e.g 87.7318 means 87.73±\pm0.18). \dagger means our reproducation with their publicly available code.

When inference, we calculate scores in the upper triangle part as:

Pij^=(Pij+Pji)/2,\displaystyle\hat{P_{ij}}=(P_{ij}+P_{ji})/2,

where iji\leq j. Then we only use this upper triangle score to get the final prediction. The decoding process generally follows Yu et al. (2020)’s method. We first prune out the non-entity spans (none of its scores is above 0.5), then we sort the remained spans based on their maximum entity score. We pick the spans based on this order, if a span’s boundary clashes with selected spans, it is ignored.

4 Experiment

4.1 Experimental Setup

To verify the effectiveness of our proposed method, we conduct experiments in three widely used nested NER datasets, ACE 2004555https://catalog.ldc.upenn.edu/LDC2005T09 Doddington et al. (2004), ACE 2005666https://catalog.ldc.upenn.edu/LDC2006T06 Walker and Consortium (2005) and Genia Kim et al. (2003).

Besides, we choose recently published papers as our baselines. To make sure our experiments are strictly comparable to theirs, we asked the authors for their version of data. The data statistics for each paper are listed in the Appendix. For ACE2004 and ACE2005, although all of them used the same document split as suggested Lu and Roth (2015), they used different sentence tokenizations, resulting in different numbers of sentences and entities. To facilitate future research on nested NER, we release the pre-processing code and fix some tokenization issues to avoid including unannotated text and dropping entities. While for the Genia data, we fixed some annotation conflicts (the same sentence with different entity annotations). We replicate each experiment five times and report its average performance with standard derivation.

# Param. (Million) Genia
P R F1
Data from Li et al. (2022)
W2NER 113.6 83.10 79.76 81.39
Ours 112.6 83.1824 79.708 81.4011
w.o. CNN 111.1 80.664 79.767 80.215
Data from Wan et al. (2022)
SG 112.7 77.92 80.74 79.30
Ours 112.2 81.0548 77.8765 79.4220
w.o. CNN 111.1 78.6041 78.3552 78.4716
Data from Yuan et al. (2022)
Triaffine 526.5 80.42 82.06 81.23
Ours 128.42 83.379 79.4315 81.358
w.o. CNN 111.1 80.8723 79.4723 80.1616
Data from this work
W2NER\dagger 113.6 81.5861 79.1149 80.3223
Ours 112.6 81.5221 79.1718 80.3313
w.o. CNN 111.1 78.5928 79.8514 79.2212
Table 2: Experiment results for the Genia Dataset. “W2NER”, “SG” and “Triaffine” are from Li et al. (2022)Wan et al. (2022) and Yuan et al. (2022), all models use the BioBERT-baseLee et al. (2020). The subscript means the standard deviation (e.g 81.4011 means 81.40±\pm0.11). \dagger means our reproduction with their publicly available code.
FEP FER NEP NER
ACE2004
Ours 86.90.2 87.30.5 88.40.6 88.80.9
w.o. CNN 86.30.8 86.80.3 89.40.8 86.61.3
ACE2005
Ours 86.20.6 88.30.1 91.40.5 89.00.8
w.o. CNN 85.20.7 87.90.3 91.30.5 86.20.8
Genia
Ours 81.70.2 79.40.2 71.71.6 75.51.3
w.o. CNN 79.00.3 80.00.1 72.71.2 64.81.0
Table 3: The precision and recall for flat and nested entities in the test set of three datasets. FEP, FER, NEP and NER are the flat entity precision, flat entity recall, nested entity precision and nested entity recall, respectively. Compared with models without CNN (“w.o. CNN”), the most improved metric is bold. By using CNN, the recall for nested entities improve significantly. The subscript means the standard deviation (e.g 88.80.9 means 88.8±\pm0.9).

4.2 Main Results

Results for ACE2004 and ACE2005 are listed in Table 1, and for Genia is listed in Table 2. When using the same data from previous work, our simple CNN model surpasses the baselines with less or similar number of parameters, which proves that using CNN to model the interaction between neighbor spans can be beneificial to the nested NER task. Besides, in the bottom block, we reproduced some baselines in our newly processed data to facilitate future comparison. Comparing the last block (processed by us) and the upper blocks (data from previous work), different tokenizations can indeed influence the performance. Therefore, we appeal for the same tokenization for future comparison.

4.3 Why CNN Helps

To study why CNN can boost the performance of the nested NER datasets, we split entities into two kinds. One kind is entities that overlap with other entities, and the other kind is entities that do not. The results of FEP, FER, NEP, and NER777The detailed calculation description of the four metrics locate in the Appendix. are listed in Table 3. Compared with models without CNN, the NEP of models with CNN improved for 2.2, 2.8 and 10.7 for ACE2004, ACE2005 and Genia respectively. Namely, much of the performance improvement can be ascribed to finding more nested entities. This is expected as the CNN can be more effective for exploiting the neighbor entities when they are nested.

5 Conclusion

In this paper, we propose using CNN on the score matrix of span-based NER model. Although this method is very simple, it achieves comparable or better performance than recently proposed methods. Analysis shows exploiting the spatial correlation between neighbor spans through CNN can help model find more nested entities. And experiments show that different tokenizations indeed influence the performance. Therefore, it is necessary to make sure all comparative baselines uses the same tokenization. To facilitate future comparison, we release a new pre-processing script for three nested NER datasets.

References

  • Ba et al. (2016) Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. CoRR, abs/1607.06450.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics.
  • Doddington et al. (2004) George R. Doddington, Alexis Mitchell, Mark A. Przybocki, Lance A. Ramshaw, Stephanie M. Strassel, and Ralph M. Weischedel. 2004. The automatic content extraction (ACE) program - tasks, data, and evaluation. In Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC 2004, May 26-28, 2004, Lisbon, Portugal. European Language Resources Association.
  • Dozat and Manning (2017) Timothy Dozat and Christopher D. Manning. 2017. Deep biaffine attention for neural dependency parsing. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net.
  • Eberts and Ulges (2020) Markus Eberts and Adrian Ulges. 2020. Span-based joint entity and relation extraction with transformer pre-training. In ECAI 2020 - 24th European Conference on Artificial Intelligence, 29 August-8 September 2020, Santiago de Compostela, Spain, August 29 - September 8, 2020 - Including 10th Conference on Prestigious Applications of Artificial Intelligence (PAIS 2020), volume 325 of Frontiers in Artificial Intelligence and Applications, pages 2006–2013. IOS Press.
  • Fei et al. (2021) Hao Fei, Donghong Ji, Bobo Li, Yijiang Liu, Yafeng Ren, and Fei Li. 2021. Rethinking boundaries: End-to-end recognition of discontinuous mentions with pointer networks. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pages 12785–12793. AAAI Press.
  • Hendrycks and Gimpel (2016) Dan Hendrycks and Kevin Gimpel. 2016. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. CoRR, abs/1606.08415.
  • Huang et al. (2015) Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. CoRR, abs/1508.01991.
  • Katiyar and Cardie (2018) Arzoo Katiyar and Claire Cardie. 2018. Nested named entity recognition revisited. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 861–871. Association for Computational Linguistics.
  • Kim et al. (2003) Jin-Dong Kim, Tomoko Ohta, Yuka Tateisi, and Jun’ichi Tsujii. 2003. GENIA corpus - a semantically annotated corpus for bio-textmining. In Proceedings of the Eleventh International Conference on Intelligent Systems for Molecular Biology, June 29 - July 3, 2003, Brisbane, Australia, pages 180–182.
  • Lee et al. (2020) Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinform., 36(4):1234–1240.
  • Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 7871–7880. Association for Computational Linguistics.
  • Li et al. (2022) Jingye Li, Hao Fei, Jiang Liu, Shengqiong Wu, Meishan Zhang, Chong Teng, Donghong Ji, and Fei Li. 2022. Unified named entity recognition as word-word relation classification. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022, pages 10965–10973. AAAI Press.
  • Li et al. (2020) Xiaoya Li, Jingrong Feng, Yuxian Meng, Qinghong Han, Fei Wu, and Jiwei Li. 2020. A unified MRC framework for named entity recognition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 5849–5859. Association for Computational Linguistics.
  • Lu and Roth (2015) Wei Lu and Dan Roth. 2015. Joint mention extraction and classification with mention hypergraphs. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pages 857–867. The Association for Computational Linguistics.
  • Lu et al. (2022) Yaojie Lu, Qing Liu, Dai Dai, Xinyan Xiao, Hongyu Lin, Xianpei Han, Le Sun, and Hua Wu. 2022. Unified structure generation for universal information extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 5755–5772. Association for Computational Linguistics.
  • Ma and Hovy (2016) Xuezhe Ma and Eduard H. Hovy. 2016. End-to-end sequence labeling via bi-directional lstm-cnns-crf. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics.
  • Muis and Lu (2016) Aldrian Obaja Muis and Wei Lu. 2016. Learning to recognize discontiguous entities. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 75–84. The Association for Computational Linguistics.
  • Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
  • Straková et al. (2019) Jana Straková, Milan Straka, and Jan Hajic. 2019. Neural architectures for nested NER through linearization. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 5326–5331. Association for Computational Linguistics.
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 3104–3112.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008.
  • Walker and Consortium (2005) C. Walker and Linguistic Data Consortium. 2005. ACE 2005 Multilingual Training Corpus. LDC corpora. Linguistic Data Consortium.
  • Wan et al. (2022) Juncheng Wan, Dongyu Ru, Weinan Zhang, and Yong Yu. 2022. Nested named entity recognition with span-level graphs. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 892–903. Association for Computational Linguistics.
  • Wang and Lu (2018) Bailin Wang and Wei Lu. 2018. Neural segmental hypergraphs for overlapping mention recognition. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 204–214. Association for Computational Linguistics.
  • Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  • Yan et al. (2019) Hang Yan, Bocao Deng, Xiaonan Li, and Xipeng Qiu. 2019. TENER: adapting transformer encoder for named entity recognition. CoRR, abs/1911.04474.
  • Yan et al. (2021) Hang Yan, Tao Gui, Junqi Dai, Qipeng Guo, Zheng Zhang, and Xipeng Qiu. 2021. A unified generative framework for various NER subtasks. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 5808–5822. Association for Computational Linguistics.
  • Yu et al. (2020) Juntao Yu, Bernd Bohnet, and Massimo Poesio. 2020. Named entity recognition as dependency parsing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 6470–6476. Association for Computational Linguistics.
  • Yuan et al. (2022) Zheng Yuan, Chuanqi Tan, Songfang Huang, and Fei Huang. 2022. Fusing heterogeneous factors with triaffine mechanism for nested named entity recognition. In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 3174–3186. Association for Computational Linguistics.
  • Zhu and Li (2022) Enwei Zhu and Jinpeng Li. 2022. Boundary smoothing for named entity recognition. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 7096–7108. Association for Computational Linguistics.
Sentence Mention
#Train #Dev #Test Avg. Len #Ovlp. #Train #Dev #Test Avg. Len
ACE2004 W2NER 6,802 813 897 20.12 12,571 22,056 2,492 3,020 2.5
SG 6.198 742 809 21.55 12,666 22,195 2,514 3,034 2.51
BS 6,799 829 879 20.43 12,679 22,207 2,511 3,031 2.51
Ours 6,297 742 824 23.52 12,690 22,231 2,514 3,036 2.64
ACE2005 W2NER 7,606 1,002 1,089 17.77 12,179 24,366 3,188 2,989 2.26
SG 7,285 968 1,058 18.60 12,316 24,700 3,218 3,029 2.26
BS 7,336 958 1,047 18.90 12,313 24,687 3,217 3,027 2.26
Ours 7,178 960 1,051 20.59 12,405 25,300 3,321 3,099 2.40
Genia W2NER 15,023 1,669 1,854 25.41 10,263 45,144 5,365 5,506 1.97
SG 15,022 1,669 1,855 26.47 10,412 47,006 4,461 5,596 2.07
Triaffine 16,692 - 1,854 25.41 10,263 50,509 - 5,506 1.97
Ours 15,038 1,765 1,732 26.47 10,315 46,203 4,714 5,119 2.0
Table 4: The statistics used in each paper. “W2NER”999The number of entites is different from that reported in their paper, because we found some duplicated entities in their data., “SG”, “BS” and “Triaffine” are from Li et al. (2022), Wan et al. (2022), Zhu and Li (2022) and Yuan et al. (2022), respectively. Different papers used different sentence tokenization for ACE2004 and ACE2005, resulting in different numbers of sentences in each split. To facilitate future comparison, we open-sourced a pre-processing script to prepare ACE2004 and ACE2005. Previously, some entities will be dropped because of sentence tokenization, we avoid sentence tokenization within an entity and resulting in more entities. And for Genia, different papers used different train/dev/test splits. Besides, the Genia data has conflicting annotations, we remove these sentences. The data annotated with “Ours” is obtained by our pre-processing code.

Appendix A Multi-head Biaffine Decoder

The input of Multi-head Biaffine decoder is two matrix 𝐇s,𝐇en×h\mathbf{H}_{s},\mathbf{H}_{e}\in\mathcal{R}^{n\times h}, and the output is 𝐑n×n×r\mathbf{R}\in\mathcal{R}^{n\times n\times r}. The formulation of Multi-head Biaffine decoder is as follows

𝐒1[i,j]\displaystyle\mathbf{S}_{1}[i,j] =(𝐇s[i]𝐇e[j]𝐰ij)W,\displaystyle=(\mathbf{H}_{s}[i]\oplus\mathbf{H}_{e}[j]\oplus\mathbf{w}_{i-j})W,
{𝐇s(k)},{𝐇e(k)}\displaystyle\{\mathbf{H}_{s}^{(k)}\},\{\mathbf{H}_{e}^{(k)}\} =Split(𝐇s),Split(𝐇e),\displaystyle=\mathrm{Split}(\mathbf{H}_{s}),\mathrm{Split}(\mathbf{H}_{e}),
𝐒2(k)[i,j]\displaystyle\mathbf{S}_{2}^{(k)}[i,j] =𝐇s(k)[i]U𝐇e(k)[j]T,\displaystyle=\mathbf{H}_{s}^{(k)}[i]U\mathbf{H}_{e}^{(k)}[j]^{T},
𝐒2\displaystyle\mathbf{S}_{2} =Concat(𝐒2(1),,𝐒2(K)),\displaystyle=\mathrm{Concat}(\mathbf{S}^{(1)}_{2},...,\mathbf{S}^{(K)}_{2}),
𝐑\displaystyle\mathbf{R} =𝐒1+𝐒2,\displaystyle=\mathbf{S}_{1}+\mathbf{S}_{2},

where 𝐇s,𝐇en×h\mathbf{H}_{s},\mathbf{H}_{e}\in\mathcal{R}^{n\times h}, hh is the hidden size, 𝐰ijc\mathbf{w}_{i-j}\in\mathcal{R}^{c} is the span length embedding for length iji-j, W(2h+c)×rW\in\mathcal{R}^{(2h+c)\times r}, 𝐒1n×n×r\mathbf{S}_{1}\in\mathcal{R}^{n\times n\times r}, rr is the biaffine feature size, Split()\mathrm{Split}(\cdot) equally splits a matrix in the last dimension, thus, 𝐇s(k),𝐇e(k)n×hk\mathbf{H}_{s}^{(k)},\mathbf{H}_{e}^{(k)}\in\mathcal{R}^{n\times h_{k}}; hkh_{k} is the hidden size for each head, and Uhk×rk×hkU\in\mathcal{R}^{h_{k}\times r_{k}\times h_{k}}, 𝐒2n×n×r\mathbf{S}_{2}\in\mathcal{R}^{n\times n\times r}, and 𝐑n×n×r\mathbf{R}\in\mathcal{R}^{n\times n\times r}.

We did not use multi-head for WW, because it does not occupy too much parameters and using multi-head for WW harms the performace slightly.

Appendix B Data

We list the statistics for each datasets in Table 9. As shown in the table, the number of sentences and even the number of entities are different for each paper. Therefore, it is not fair to directly compare results. For the ACE2004 and ACE2005, we release the pre-processing code to get data from the LDC files. We make sure no entities are dropped because of the sentence tokenization. Thus, the pre-processed ACE2004 and ACE2005 data from this work in Table 9 have the most entities. And for Genia, we appeal for the usage of train/dev/test, and we release the data split within the code repo. Moreover, in order to facilitate the document-level NER study, we split the Genia dataset based on documents. Therefore, sentences from train/dev/test splits are from different documents, the document ratio for train/dev/test is 8:1:1. Besides, we found one conflicting document annotation in Genia, we fix this confict. After comparing different versions of Genia, we found the W2NER Li et al. (2022) and Triaffine Yuan et al. (2022) dropped the spans with more than one entity tags (there are 31 such entities). Thus, they have less number of nested entities than us. While SG Wan et al. (2022) includes the discontinuous entities, so they have more number of nested entities than us.

Appendix C Implementation Details

We used the AadmW optimizer to optimize the model and the transformers package for the pre-trained model Wolf et al. (2020). The hyper-parameter range in this paper is listed in Table 5.

ACE2004 ACE2005 Genia
# Epoch 50 50 5
Learning Rate 2e-5 2e-5 7e-6
Batch size 48 48 8
# CNN Blocks [2, 3] [2, 3] 3
CNN kernel size 3 3 3
CNN Channel dim. [120, 200] [120, 200] 200
# Head [1, 5] [1, 5] 4
Hidden size hh 200 200 400
Warmup factor 0.1 0.1 0.1
Table 5: The hyper-parameter in this paper.
# Ent. # Flat Ent. # Nested Ent.
ACE2004 3,036 1,614 1,422
ACE2005 3,099 1,913 1,186
Genia 5,119 3,963 1,156
Table 6: The flat and nested entity statistics in the test set of each dataset.

Appendix D FEP FER NEP NER

We split entities into two kinds based on whether they overlap with other entities, and the statistics for each dataset are listed in Table 6. When calculating the flat entity precision (FEP), we first get all flat entities in the prediction and calculate their ratio in the gold. For the flat entity recall (FER), we get all flat entities in the gold and calculate their ratio in the prediction. And we get the nested entity precision (NEP) and nested entity recall (NER) similarly.