An Embarrassingly Easy but Strong Baseline for Nested Named Entity Recognition
Abstract
Named entity recognition (NER) is the task to detect and classify the entity spans in the text. When entity spans overlap between each other, this problem is named as nested NER. Span-based methods have been widely used to tackle the nested NER. Most of these methods will get a score matrix, where means the length of sentence, and each entry corresponds to a span. However, previous work ignores spatial relations in the score matrix. In this paper, we propose using Convolutional Neural Network (CNN) to model these spatial relations in the score matrix. Despite being simple, experiments in three commonly used nested NER datasets show that our model surpasses several recently proposed methods with the same pre-trained encoders. Further analysis shows that using CNN can help the model find more nested entities. Besides, we found that different papers used different sentence tokenizations for the three nested NER datasets, which will influence the comparison. Thus, we release a pre-processing script to facilitate future comparison111Code is available at https://github.com/yhcc/CNN_Nested_NER..
1 Introduction
Named Entity Recognition (NER) is the task to extract entities from raw text. It has been a fundamental task in the Natural Language Processing (NLP) field. Previously, this task is mainly solved by the sequence labeling paradigm through assigning a label to each token Huang et al. (2015); Ma and Hovy (2016); Yan et al. (2019). However, this method is not directly applicable to the nested NER scenario, since a token may be included in two or more entities. To overcome this issue, the span-based method which assigns labels to each span was introduced Eberts and Ulges (2020); Li et al. (2020); Yu et al. (2020).
Eberts and Ulges (2020) used a pooling method over token representations to get the span representation, and then conducted classification on this span representation. Li et al. (2020) transformed the NER task into a Machine Reading Comprehension form, they used the entity type as the query, and asked the model to select the spans that belong to this entity type. Yu et al. (2020) utilized the Biaffine decoder from dependency parsing Dozat and Manning (2017) to convert the span classification into classifying the start and end token pairs. However, these work did not take advantage of the spatial correlations between adjacent spans.

As depicted in Figure 1, the spans surrounding a span have special relationships with the center span. It should be beneficial if we can leverage these spatial correlations. In this paper, we use the Biaffine decoder Dozat and Manning (2017) to get a 3D feature matrix, where each entry represents one span. After that, we view this feature matrix as an image and utilize Convolutional Neural Network (CNN) to model the local interaction between spans.
We compare this simple method with recently proposed methods Wan et al. (2022); Li et al. (2022); Zhu and Li (2022); Yuan et al. (2022). To make sure our method is strictly comparable to theirs, we asked the authors for their version of data. Although all of them used the same datasets, we found that the statistics, such as the number of sentences and entities, were not the same. This was caused by the usage of distinct sentence tokenization methods, which will influence the performance as shown in our experiments. To facilitate future comparison, we release a pre-processing script222https://github.com/yhcc/CNN_Nested_NER/tree/master/preprocess for ACE2004, ACE2005 and Genia datasets.
Our contributions can be summarized as follows.
-
•
We find that the adjacent spans have special correlations between each other, and we propose using CNN to model the interaction between them. Despite being very simple, it achieves a considerable performance boost in three widely used nested NER datasets.
-
•
We release a pre-processing script for the three nested NER datasets to facilitate direct and fair comparison.
-
•
The way we view the span feature matrix as an image shall shed some light on future exploration of span-based methods for nested NER task.
2 Related Work
Previously, four kinds of paradigms have been proposed to solve the nested NER task.
The first one is the sequence labeling framework Straková et al. (2019), since one token can be contained in more than one entities, the Cartesian product of the entity labels are used. However, the Cartesian labels will suffer from the long-tail issue.
The second one is to use the hypergraph to efficiently represent spans Lu and Roth (2015); Muis and Lu (2016); Katiyar and Cardie (2018); Wang and Lu (2018). The shortcoming of this method is the complex decoding.
The third one is the sequence-to-sequence (Seq2Seq) framework Sutskever et al. (2014); Lewis et al. (2020); Raffel et al. (2020) to generate the entity sequence. The entity sequence can be the entity pointer sequence Yan et al. (2021); Fei et al. (2021) or the entity text sequence Lu et al. (2022). Nevertheless, the Seq2Seq method suffers from the time-demanding decoding.
The fourth one is to conduct span classification. Eberts and Ulges (2020) proposed to enumerate all possible spans within a sentence, and use a pooling method to get the span representation. While Yu et al. (2020) proposed to use the start and end tokens of a span to pinpoint the span, and use the Biaffine decoder to get the scores for each span. The span-based methods are friendly to parallelism and the decoding is easy. Therefore, this formulation has been widely adopted Wan et al. (2022); Zhu and Li (2022); Li et al. (2022); Yuan et al. (2022). However, the relation between neighbor spans was ignored in previous work.

3 Proposed Method
In this section, we first introduce the nested NER task, then describe how to get the feature matrix. After that, we present the CNN module to model the spatial correlation on the feature matrix. A general framework of our proposed method can be viewed in Figure 2.
3.1 Nested NER Task
Given an input sentence with tokens, the nested NER task aims to extract all entities in . Each entity can be expressed as a tuple . are the start, end index of the entity. is its entity type and is the number of entity types. As the task name suggests, the entities may overlap with each other, but different entities are not allowed to have crossing boundaries. For a sentence with tokens, there are valid spans.
3.2 Span-based Method for Nested NER
We follow Yu et al. (2020) to formulate this task into a span classification task. Namely, for each valid span, the model assigns an entity label to it. The method first uses an encoder to encode the input sentence as follows:
where , and is the hidden size. Various pre-trained models, such as BERT Devlin et al. (2019), are usually used as the encoder. For the word tokenized into several pieces, we use max-pooling to aggregate from its pieces’ hidden states.
After getting the contextualized embedding of tokens, previous work usually would concatenate it with the static word embedding and the character embedding, and then send this combined embedding into a BiLSTM layer Yu et al. (2020); Wan et al. (2022); Yuan et al. (2022). To make the model less cluttered, we neither use more embeddings, nor the BiLSTM layer.
Next, we use a multi-head Biaffine decoder Dozat and Manning (2017); Vaswani et al. (2017) to get the score matrix as follows:
where , is the hidden size, is the multi-head Biaffine decoder333The detailed description is in the Appendix., and , is the feature size. Each cell in the can be seen as the feature vector for the span. And for the lower triangle of (where ), the span contains words from the -th to the -th (Therefore, one span will have two entries if it is off-diagonal).
3.3 CNN on Score Matrix
As shown in Figure 1, the cell has relations with cells around. Therefore, we propose using CNN to model these interactions. We repeat the following CNN block several times in our model:
where , and are the 2D CNN, layer normalization Ba et al. (2016) and GeLU activation function Hendrycks and Gimpel (2016). The layer normalization is conducted in the feature dimension. A noticeable fact here is that since the number of tokens in sentences varies, their s are of different shapes. To make sure results are the same when is processed in batch, the 2D CNN has no bias term, and all the paddings in are filled with 0.
After passing through several CNN blocks, the will be further processed by another 2D CNN module.
3.4 The Output
We use a perceptron to get the prediction logits as follows: 444We did not use the Softmax because in the very rare case (such as in the ACE2005 and Genia dataset), one span can have more than one entity tag.
where , , . And then, we use the binary cross entropy to calculate the loss as
unlike previous works that only use the upper triangle part to get the loss Yu et al. (2020); Zhu and Li (2022), we use both upper and lower triangles to calculate the loss. The reason is that in order to conduct batch computation, we cannot solely compute the upper triangle part. Since the lower triangle part has been computed, we also use them for the output. The tag for the score matrix is symmetric, namely, the tag in the -th entry is the same as in the -th.
# Param. (Million) | ACE2004 | ACE2005 | |||||
P | R | F1 | P | R | F1 | ||
Data from Li et al. (2022) | |||||||
W2NER Li et al. (2022)[BERT-large] | 355.4 | 87.33 | 87.71 | 87.52 | 85.03 | 88.62 | 86.79 |
Ours[BERT-large] | 345.1 | 87.8238 | 87.4020 | 87.6118 | 86.3961 | 87.2434 | 86.8245 |
w.o. CNN[BERT-large] | 343.6 | 86.5448 | 87.0941 | 86.8121 | 84.8826 | 86.9933 | 85.9227 |
Data from Wan et al. (2022) | |||||||
SG Wan et al. (2022)[BERT-base] | 112.3 | 86.70 | 85.93 | 86.31 | 84.37 | 85.87 | 85.11 |
Ours[BERT-base] | 110.5 | 86.8561 | 86.4536 | 86.6522 | 84.9449 | 85.4027 | 85.1616 |
w.o. CNN[BERT-base] | 109.1 | 85.7946 | 85.7812 | 85.7822 | 82.9121 | 84.8923 | 83.8916 |
Data from Zhu and Li (2022) | |||||||
BS Zhu and Li (2022)[RoBERTa-base] | 125.6 | 88.43 | 87.53 | 87.98 | 86.25 | 88.07 | 87.15 |
Ours[RoBERTa-base] | 125.6 | 87.7727 | 88.2836 | 88.0314 | 86.5878 | 87.9446 | 87.2548 |
w.o. CNN[RoBERTa-base] | 125.2 | 86.7127 | 87.4042 | 87.0518 | 85.4839 | 87.5459 | 86.5026 |
Data from this work | |||||||
W2NER[BERT-large] | 355.4 | 87.1711 | 87.7019 | 87.4311 | 85.7830 | 87.8124 | 86.7721 |
Ours[BERT-large] | 345.1 | 87.9830 | 87.5022 | 87.7416 | 86.2665 | 87.5631 | 86.9123 |
w.o. CNN[BERT-large] | 343.6 | 86.6068 | 86.4836 | 86.5419 | 84.9134 | 87.3926 | 86.1330 |
BS[RoBERTa-base] | 125.6 | 87.3240 | 86.8416 | 87.0824 | 86.5838 | 87.8459 | 87.2032 |
Ours[RoBERTa-base] | 125.6 | 87.3341 | 87.2925 | 87.3116 | 86.7029 | 88.1654 | 87.4226 |
w.o. CNN[RoBERTa-base] | 125.2 | 86.0936 | 86.8823 | 86.4817 | 85.1767 | 88.035 | 86.5638 |
When inference, we calculate scores in the upper triangle part as:
where . Then we only use this upper triangle score to get the final prediction. The decoding process generally follows Yu et al. (2020)’s method. We first prune out the non-entity spans (none of its scores is above 0.5), then we sort the remained spans based on their maximum entity score. We pick the spans based on this order, if a span’s boundary clashes with selected spans, it is ignored.
4 Experiment
4.1 Experimental Setup
To verify the effectiveness of our proposed method, we conduct experiments in three widely used nested NER datasets, ACE 2004555https://catalog.ldc.upenn.edu/LDC2005T09 Doddington et al. (2004), ACE 2005666https://catalog.ldc.upenn.edu/LDC2006T06 Walker and Consortium (2005) and Genia Kim et al. (2003).
Besides, we choose recently published papers as our baselines. To make sure our experiments are strictly comparable to theirs, we asked the authors for their version of data. The data statistics for each paper are listed in the Appendix. For ACE2004 and ACE2005, although all of them used the same document split as suggested Lu and Roth (2015), they used different sentence tokenizations, resulting in different numbers of sentences and entities. To facilitate future research on nested NER, we release the pre-processing code and fix some tokenization issues to avoid including unannotated text and dropping entities. While for the Genia data, we fixed some annotation conflicts (the same sentence with different entity annotations). We replicate each experiment five times and report its average performance with standard derivation.
# Param. (Million) | Genia | |||
P | R | F1 | ||
Data from Li et al. (2022) | ||||
W2NER | 113.6 | 83.10 | 79.76 | 81.39 |
Ours | 112.6 | 83.1824 | 79.708 | 81.4011 |
w.o. CNN | 111.1 | 80.664 | 79.767 | 80.215 |
Data from Wan et al. (2022) | ||||
SG | 112.7 | 77.92 | 80.74 | 79.30 |
Ours | 112.2 | 81.0548 | 77.8765 | 79.4220 |
w.o. CNN | 111.1 | 78.6041 | 78.3552 | 78.4716 |
Data from Yuan et al. (2022) | ||||
Triaffine | 526.5 | 80.42 | 82.06 | 81.23 |
Ours | 128.42 | 83.379 | 79.4315 | 81.358 |
w.o. CNN | 111.1 | 80.8723 | 79.4723 | 80.1616 |
Data from this work | ||||
W2NER | 113.6 | 81.5861 | 79.1149 | 80.3223 |
Ours | 112.6 | 81.5221 | 79.1718 | 80.3313 |
w.o. CNN | 111.1 | 78.5928 | 79.8514 | 79.2212 |
FEP | FER | NEP | NER | |
ACE2004 | ||||
Ours | 86.90.2 | 87.30.5 | 88.40.6 | 88.80.9 |
w.o. CNN | 86.30.8 | 86.80.3 | 89.40.8 | 86.61.3 |
ACE2005 | ||||
Ours | 86.20.6 | 88.30.1 | 91.40.5 | 89.00.8 |
w.o. CNN | 85.20.7 | 87.90.3 | 91.30.5 | 86.20.8 |
Genia | ||||
Ours | 81.70.2 | 79.40.2 | 71.71.6 | 75.51.3 |
w.o. CNN | 79.00.3 | 80.00.1 | 72.71.2 | 64.81.0 |
4.2 Main Results
Results for ACE2004 and ACE2005 are listed in Table 1, and for Genia is listed in Table 2. When using the same data from previous work, our simple CNN model surpasses the baselines with less or similar number of parameters, which proves that using CNN to model the interaction between neighbor spans can be beneificial to the nested NER task. Besides, in the bottom block, we reproduced some baselines in our newly processed data to facilitate future comparison. Comparing the last block (processed by us) and the upper blocks (data from previous work), different tokenizations can indeed influence the performance. Therefore, we appeal for the same tokenization for future comparison.
4.3 Why CNN Helps
To study why CNN can boost the performance of the nested NER datasets, we split entities into two kinds. One kind is entities that overlap with other entities, and the other kind is entities that do not. The results of FEP, FER, NEP, and NER777The detailed calculation description of the four metrics locate in the Appendix. are listed in Table 3. Compared with models without CNN, the NEP of models with CNN improved for 2.2, 2.8 and 10.7 for ACE2004, ACE2005 and Genia respectively. Namely, much of the performance improvement can be ascribed to finding more nested entities. This is expected as the CNN can be more effective for exploiting the neighbor entities when they are nested.
5 Conclusion
In this paper, we propose using CNN on the score matrix of span-based NER model. Although this method is very simple, it achieves comparable or better performance than recently proposed methods. Analysis shows exploiting the spatial correlation between neighbor spans through CNN can help model find more nested entities. And experiments show that different tokenizations indeed influence the performance. Therefore, it is necessary to make sure all comparative baselines uses the same tokenization. To facilitate future comparison, we release a new pre-processing script for three nested NER datasets.
References
- Ba et al. (2016) Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. CoRR, abs/1607.06450.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics.
- Doddington et al. (2004) George R. Doddington, Alexis Mitchell, Mark A. Przybocki, Lance A. Ramshaw, Stephanie M. Strassel, and Ralph M. Weischedel. 2004. The automatic content extraction (ACE) program - tasks, data, and evaluation. In Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC 2004, May 26-28, 2004, Lisbon, Portugal. European Language Resources Association.
- Dozat and Manning (2017) Timothy Dozat and Christopher D. Manning. 2017. Deep biaffine attention for neural dependency parsing. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net.
- Eberts and Ulges (2020) Markus Eberts and Adrian Ulges. 2020. Span-based joint entity and relation extraction with transformer pre-training. In ECAI 2020 - 24th European Conference on Artificial Intelligence, 29 August-8 September 2020, Santiago de Compostela, Spain, August 29 - September 8, 2020 - Including 10th Conference on Prestigious Applications of Artificial Intelligence (PAIS 2020), volume 325 of Frontiers in Artificial Intelligence and Applications, pages 2006–2013. IOS Press.
- Fei et al. (2021) Hao Fei, Donghong Ji, Bobo Li, Yijiang Liu, Yafeng Ren, and Fei Li. 2021. Rethinking boundaries: End-to-end recognition of discontinuous mentions with pointer networks. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pages 12785–12793. AAAI Press.
- Hendrycks and Gimpel (2016) Dan Hendrycks and Kevin Gimpel. 2016. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. CoRR, abs/1606.08415.
- Huang et al. (2015) Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. CoRR, abs/1508.01991.
- Katiyar and Cardie (2018) Arzoo Katiyar and Claire Cardie. 2018. Nested named entity recognition revisited. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 861–871. Association for Computational Linguistics.
- Kim et al. (2003) Jin-Dong Kim, Tomoko Ohta, Yuka Tateisi, and Jun’ichi Tsujii. 2003. GENIA corpus - a semantically annotated corpus for bio-textmining. In Proceedings of the Eleventh International Conference on Intelligent Systems for Molecular Biology, June 29 - July 3, 2003, Brisbane, Australia, pages 180–182.
- Lee et al. (2020) Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinform., 36(4):1234–1240.
- Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 7871–7880. Association for Computational Linguistics.
- Li et al. (2022) Jingye Li, Hao Fei, Jiang Liu, Shengqiong Wu, Meishan Zhang, Chong Teng, Donghong Ji, and Fei Li. 2022. Unified named entity recognition as word-word relation classification. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022, pages 10965–10973. AAAI Press.
- Li et al. (2020) Xiaoya Li, Jingrong Feng, Yuxian Meng, Qinghong Han, Fei Wu, and Jiwei Li. 2020. A unified MRC framework for named entity recognition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 5849–5859. Association for Computational Linguistics.
- Lu and Roth (2015) Wei Lu and Dan Roth. 2015. Joint mention extraction and classification with mention hypergraphs. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pages 857–867. The Association for Computational Linguistics.
- Lu et al. (2022) Yaojie Lu, Qing Liu, Dai Dai, Xinyan Xiao, Hongyu Lin, Xianpei Han, Le Sun, and Hua Wu. 2022. Unified structure generation for universal information extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 5755–5772. Association for Computational Linguistics.
- Ma and Hovy (2016) Xuezhe Ma and Eduard H. Hovy. 2016. End-to-end sequence labeling via bi-directional lstm-cnns-crf. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics.
- Muis and Lu (2016) Aldrian Obaja Muis and Wei Lu. 2016. Learning to recognize discontiguous entities. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 75–84. The Association for Computational Linguistics.
- Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
- Straková et al. (2019) Jana Straková, Milan Straka, and Jan Hajic. 2019. Neural architectures for nested NER through linearization. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 5326–5331. Association for Computational Linguistics.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 3104–3112.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008.
- Walker and Consortium (2005) C. Walker and Linguistic Data Consortium. 2005. ACE 2005 Multilingual Training Corpus. LDC corpora. Linguistic Data Consortium.
- Wan et al. (2022) Juncheng Wan, Dongyu Ru, Weinan Zhang, and Yong Yu. 2022. Nested named entity recognition with span-level graphs. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 892–903. Association for Computational Linguistics.
- Wang and Lu (2018) Bailin Wang and Wei Lu. 2018. Neural segmental hypergraphs for overlapping mention recognition. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 204–214. Association for Computational Linguistics.
- Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
- Yan et al. (2019) Hang Yan, Bocao Deng, Xiaonan Li, and Xipeng Qiu. 2019. TENER: adapting transformer encoder for named entity recognition. CoRR, abs/1911.04474.
- Yan et al. (2021) Hang Yan, Tao Gui, Junqi Dai, Qipeng Guo, Zheng Zhang, and Xipeng Qiu. 2021. A unified generative framework for various NER subtasks. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 5808–5822. Association for Computational Linguistics.
- Yu et al. (2020) Juntao Yu, Bernd Bohnet, and Massimo Poesio. 2020. Named entity recognition as dependency parsing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 6470–6476. Association for Computational Linguistics.
- Yuan et al. (2022) Zheng Yuan, Chuanqi Tan, Songfang Huang, and Fei Huang. 2022. Fusing heterogeneous factors with triaffine mechanism for nested named entity recognition. In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 3174–3186. Association for Computational Linguistics.
- Zhu and Li (2022) Enwei Zhu and Jinpeng Li. 2022. Boundary smoothing for named entity recognition. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 7096–7108. Association for Computational Linguistics.
Sentence | Mention | |||||||||
#Train | #Dev | #Test | Avg. Len | #Ovlp. | #Train | #Dev | #Test | Avg. Len | ||
ACE2004 | W2NER | 6,802 | 813 | 897 | 20.12 | 12,571 | 22,056 | 2,492 | 3,020 | 2.5 |
SG | 6.198 | 742 | 809 | 21.55 | 12,666 | 22,195 | 2,514 | 3,034 | 2.51 | |
BS | 6,799 | 829 | 879 | 20.43 | 12,679 | 22,207 | 2,511 | 3,031 | 2.51 | |
Ours | 6,297 | 742 | 824 | 23.52 | 12,690 | 22,231 | 2,514 | 3,036 | 2.64 | |
ACE2005 | W2NER | 7,606 | 1,002 | 1,089 | 17.77 | 12,179 | 24,366 | 3,188 | 2,989 | 2.26 |
SG | 7,285 | 968 | 1,058 | 18.60 | 12,316 | 24,700 | 3,218 | 3,029 | 2.26 | |
BS | 7,336 | 958 | 1,047 | 18.90 | 12,313 | 24,687 | 3,217 | 3,027 | 2.26 | |
Ours | 7,178 | 960 | 1,051 | 20.59 | 12,405 | 25,300 | 3,321 | 3,099 | 2.40 | |
Genia | W2NER | 15,023 | 1,669 | 1,854 | 25.41 | 10,263 | 45,144 | 5,365 | 5,506 | 1.97 |
SG | 15,022 | 1,669 | 1,855 | 26.47 | 10,412 | 47,006 | 4,461 | 5,596 | 2.07 | |
Triaffine | 16,692 | - | 1,854 | 25.41 | 10,263 | 50,509 | - | 5,506 | 1.97 | |
Ours | 15,038 | 1,765 | 1,732 | 26.47 | 10,315 | 46,203 | 4,714 | 5,119 | 2.0 |
Appendix A Multi-head Biaffine Decoder
The input of Multi-head Biaffine decoder is two matrix , and the output is . The formulation of Multi-head Biaffine decoder is as follows
where , is the hidden size, is the span length embedding for length , , , is the biaffine feature size, equally splits a matrix in the last dimension, thus, ; is the hidden size for each head, and , , and .
We did not use multi-head for , because it does not occupy too much parameters and using multi-head for harms the performace slightly.
Appendix B Data
We list the statistics for each datasets in Table 9. As shown in the table, the number of sentences and even the number of entities are different for each paper. Therefore, it is not fair to directly compare results. For the ACE2004 and ACE2005, we release the pre-processing code to get data from the LDC files. We make sure no entities are dropped because of the sentence tokenization. Thus, the pre-processed ACE2004 and ACE2005 data from this work in Table 9 have the most entities. And for Genia, we appeal for the usage of train/dev/test, and we release the data split within the code repo. Moreover, in order to facilitate the document-level NER study, we split the Genia dataset based on documents. Therefore, sentences from train/dev/test splits are from different documents, the document ratio for train/dev/test is 8:1:1. Besides, we found one conflicting document annotation in Genia, we fix this confict. After comparing different versions of Genia, we found the W2NER Li et al. (2022) and Triaffine Yuan et al. (2022) dropped the spans with more than one entity tags (there are 31 such entities). Thus, they have less number of nested entities than us. While SG Wan et al. (2022) includes the discontinuous entities, so they have more number of nested entities than us.
Appendix C Implementation Details
We used the AadmW optimizer to optimize the model and the transformers package for the pre-trained model Wolf et al. (2020). The hyper-parameter range in this paper is listed in Table 5.
ACE2004 | ACE2005 | Genia | |
# Epoch | 50 | 50 | 5 |
Learning Rate | 2e-5 | 2e-5 | 7e-6 |
Batch size | 48 | 48 | 8 |
# CNN Blocks | [2, 3] | [2, 3] | 3 |
CNN kernel size | 3 | 3 | 3 |
CNN Channel dim. | [120, 200] | [120, 200] | 200 |
# Head | [1, 5] | [1, 5] | 4 |
Hidden size | 200 | 200 | 400 |
Warmup factor | 0.1 | 0.1 | 0.1 |
# Ent. | # Flat Ent. | # Nested Ent. | |
ACE2004 | 3,036 | 1,614 | 1,422 |
ACE2005 | 3,099 | 1,913 | 1,186 |
Genia | 5,119 | 3,963 | 1,156 |
Appendix D FEP FER NEP NER
We split entities into two kinds based on whether they overlap with other entities, and the statistics for each dataset are listed in Table 6. When calculating the flat entity precision (FEP), we first get all flat entities in the prediction and calculate their ratio in the gold. For the flat entity recall (FER), we get all flat entities in the gold and calculate their ratio in the prediction. And we get the nested entity precision (NEP) and nested entity recall (NER) similarly.