E-NER: Evidential Deep Learning for Trustworthy Named Entity Recognition

Zhen Zhang¹ Mengting Hu¹ Shiwan Zhao Minlie Huang² Haotian Wang¹
Lemao Liu³ Zhirui Zhang³ Zhe Liu⁴ Bingzhe Wu³^*
¹ College of Software, Nankai University, ² The CoAI group, Tsinghua University
³ Tencent AI Lab, ⁴ Zhejiang Lab
[email protected], [email protected] Mengting Hu and Bingzhe Wu are the corresponding authors. Independent researcher.

Abstract

Most named entity recognition (NER) systems focus on improving model performance, ignoring the need to quantify model uncertainty, which is critical to the reliability of NER systems in open environments. Evidential deep learning (EDL) has recently been proposed as a promising solution to explicitly model predictive uncertainty for classification tasks. However, directly applying EDL to NER applications faces two challenges, i.e., the problems of sparse entities and OOV/OOD entities in NER tasks. To address these challenges, we propose a trustworthy NER framework named E-NER ¹¹1https://github.com/Leon-bit-9527/ENER by introducing two uncertainty-guided loss terms to the conventional EDL, along with a series of uncertainty-guided training strategies. Experiments show that E-NER can be applied to multiple NER paradigms to obtain accurate uncertainty estimation. Furthermore, compared to state-of-the-art baselines, the proposed method achieves a better OOV/OOD detection performance and better generalization ability on OOV entities.

1 Introduction

Named entity recognition (NER) aims to locate and classify entities in unstructured text, such as extracting LOCATION information "New York" from the sentence "How far is New York from me". Thanks to the development of deep neural network (DNN), current NER methods have achieved remarkable performance on a wide range of benchmarks Lample et al. (2016); Yamada et al. (2020); Li et al. (2022).

Refer to caption — Figure 1: Visualization of desired uncertainty estimations in the NER application.

Despite this progress, current NER-related research typically focuses on improving the model performance, such as recognition accuracy and F1 scores Yu et al. (2020); Zhu and Li (2022). However, seldom works focus on investigating the model’s reliability. The critical aspect of the model reliability is the uncertainty estimation of the predictive results, which can characterize the probability that the model prediction will be wrong. One natural way to construct the predictive uncertainty is based on the maximum value of the Softmax output Yan et al. (2021); Li et al. (2022); Zhu and Li (2022) (the smaller this value, the larger the uncertainty). However, previous empirical studies show that probabilistic predictions produced by DNN models (e.g., transformer and CNN) are often inaccurate Guo et al. (2017); Lee et al. (2018); Pinto et al. (2022). Therefore, this natural way may over/under-estimate the predictive uncertainty, hindering the model’s reliability.

High-quality uncertainty estimation helps to improve the model’s reliability in an open environment and to find valuable samples to improve training sample efficiency, thus reducing the cost of manual labeling. On the one hand, for the reliability aspect, accurate uncertainty estimation can equip the NER model with the ability to express “I do not know” to both the out-of-domain (OOD) or out-of-vocabulary (OOV) samples Charpentier et al. (2020). A desired uncertainty estimation is conceptually shown in Figure 1, wherein misclassified OOV/OOD entities are assigned with significantly higher uncertainty than the in-domain (ID) entities. Besides, the estimated uncertainty can be further absorbed into the training process to improve the model robustness against OOV/OOD samples. On the other hand, for the sample efficiency aspect, prior work shows that high-quality uncertainty estimation can also be used for selecting more "informative" samples and thus can reduce the number of labeled samples required for training the NER model.

To attain high-quality uncertainty estimation, evidential deep learning (EDL) Sensoy et al. (2018) provides a promising solution. EDL is superior to existing Bayesian learning-based methods Blundell et al. (2015); Kingma et al. (2015); Graves (2011) in that model uncertainty can be efficiently estimated in a single forward pass that avoids inexact posterior approximation Kopetzki et al. (2021) or time/storage-consuming Monte Carlo sampling Gal and Ghahramani (2016). However, directly applying conventional EDL to NER applications still faces two critical challenges: (1) sparse entities: In text corpus, entities only take a minority. For example, only 16.8% of the words in the commonly used CoNLL2003 dataset belong to entities. The remaining non-entity types are labeled into the "others" (O) class. The imbalance between entity and non-entity words can cause over-fitting and poor performance on the entity types. (2) OOV/OOD entity discrimination: In the open environment, NER training/test data typically comes with OOV/OOD entities. However, the optimization objective of current EDL methods lacks explicit modeling of such types of information.

To address these two issues, we present a trustworthy NER framework named E-NER with a series of uncertainty-guided training strategies. For the issue of sparse entities, we propose to use an uncertainty-guided importance weighted (IW) loss, wherein samples with higher predictive uncertainties are assigned larger weights. This loss helps the model training to pay more attention to entities of interest (e.g., person and location). To solve the issue of unknown entities, we present an additional regularization term to penalize the case where labels are more prone to errors by assigning higher uncertainties to corresponding samples. We empirically show these two uncertainty-guided loss terms can improve both the quality of estimated confidence and the robustness against OOV samples.

Our contributions are summarized as follows:

•

To the best of our knowledge, E-NER is the first work to explore how to leverage evidential deep learning to improve the reliability of current NER models. This work has successfully shown the potential of EDL to provide high-quality uncertainty estimation in NER applications. The estimated uncertainty can be further used for detecting OOD/OOV samples in the test phase.
•

For the technique contribution, we propose two uncertainty-guided loss terms to mitigate sparse entities and OOV/OOD entity discrimination issues in the NER task.
•

E-NER is extensively validated in a series of experiments. In contrast to conventional NER methods, the result shows that E-NER comes with the following superiority: (1) more accurate uncertainty estimation. (2) better OOV/OOD detection performance. (3) better generalization ability on OOV entities. (4) better sample efficiency (i.e., fewer samples are required to achieve the same-level performance).

2 Preliminary

This section introduces a commonly-used EDL implementation based on the Dirichlet-based model (DBM) Sensoy et al. (2018). We then describe how the DBM computes the uncertainty in a closed form.

2.1 Dirichlet-based Model

Conventional neural network classifiers typically employ a Softmax layer to provide a point estimation of the categorical distribution. In contrast, Dirichlet-based models (DBM) output the parameters of a Dirichlet distribution and then use it to estimate the categorical distribution. Specifically, for the $i$ -th sample $x^{(i)}$ (e.g., the $i$ -th word in the NER task) in the $C$ -class classification task, the DBM replaces the Softmax of the neural network with an activation function layer (e.g., Softplus) to ensure that the network outputs non-negative values, which are considered as the evidence $\mathbf{e}^{(i)}\in\mathbb{R}^{C}_{+}$ to support the classification. The evidence is then used for constructing a Dirichlet distribution which models the distribution over different classes. To this end, the parameter of a Dirichlet distribution is obtained by: $\boldsymbol{\alpha}^{(i)}=\mathbf{e}^{(i)}+\mathbf{1}$ , where $\mathbf{1}$ represents the vector of $C$ ones. Finally, the density function of Dirichlet distribution is given by:

\displaystyle{\rm{Dir}}(\mathbf{p}^{(i)}|\boldsymbol{\alpha}^{(i)})=\frac{1}{B(\boldsymbol{\alpha}^{(i)})}\prod_{c=1}^{C}p_{c}^{(\alpha^{(i)}_{c}-1)},

(1)

where $B(\boldsymbol{\alpha}^{(i)})$ is the $C$ -dimensional multinomial beta function.

To learn model parameters, given the sample $(x^{(i)},\mathbf{y}^{(i)})$ , where $\mathbf{y}^{(i)}$ is a one-hot $C$ -dimensional label for sample $x^{(i)}$ , previous EDL methods build the optimization objective by combining a cross-entropy classification loss $\mathcal{L}_{CLS}$ and a KL penalty loss $\mathcal{L}_{KL}$ :

$\displaystyle\mathcal{L}^{(i)}_{EDL}\!=$	$\displaystyle\mathcal{L}^{(i)}_{CLS}+\mathcal{L}^{(i)}_{KL}$	(2)
$\displaystyle=$	$\displaystyle\underbrace{\sum_{c=1}^{C}y^{(i)}_{c}\left(\psi(S^{(i)})-\psi(\alpha^{(i)}_{c})\right)}_{\text{(a) classification loss}}$
$\displaystyle+$	$\displaystyle\underbrace{\lambda_{1}KL[{\rm{Dir}}(\mathbf{p}^{(i)}\|{\widetilde{\boldsymbol{\alpha}}^{(i)})}\|\|{\rm{Dir}}(\mathbf{p}^{(i)}\|\mathbf{1})]}_{\text{(b) penalty loss}},$

where $\psi(\cdot)$ is the digamma function, and $S^{(i)}\!=\!\sum^{C}_{c=1}\alpha^{(i)}_{c}$ denotes the Dirichlet strength, $\lambda_{1}$ is the balance factor, ${\rm{Dir}}(\mathbf{p}^{(i)}|\mathbf{1})$ is a special case which is equivalent to the uniform distribution, and $\widetilde{\boldsymbol{\alpha}}^{(i)}=\mathbf{y}^{(i)}+(1-\mathbf{y}^{(i)})\odot\boldsymbol{\alpha}^{(i)}$ denotes the masked parameters while $\odot$ refers to the Hadamard (element-wise) product, which removes the non-misleading evidence from predicted parameters $\boldsymbol{\alpha}^{(i)}$ . Intuitively, the first term in Eq. 2 measures the classification performance while the second term can be seen as a regularization term that penalizes misleading evidences by encouraging the associate distribution to be close to uniform distribution (see more details in Appendix §C.3).

2.2 Uncertainty Estimation of DBM

Once we obtain the Dirichlet distribution for prediction, we can estimate the predictive uncertainty in a closed form. To this end, EDL provides two probabilities: belief mass and uncertainty mass. The belief mass $\mathbf{b}$ represents the probability of evidence assigned to each category and the uncertainty mass $u$ provides uncertainty estimation. Specifically, for the sample $x^{(i)}$ , the belief mass $b^{(i)}_{c}$ and uncertainty $u^{(i)}$ are computed as:

b^{(i)}_{c}=\frac{{e}^{(i)}_{c}}{S^{(i)}}\quad\text{and}\quad u^{(i)}=\frac{C}{S^{(i)}},

(3)

with the restrictions that $u^{(i)}+\sum^{C}_{c=1}b^{(i)}_{c}=1$ . The belief mass $\mathbf{b}$ and the uncertainty mass $u$ will be used to guide the training process in our proposed framework (see Section §3.3).

3 E-NER Architecture

In this section, we describe the three core modules of E-NER and provide an overview of the system architecture in Figure 2. Additionally, we revise the learning strategy of EDL by incorporating importance weights (IW) to address the sparse entities problem and uncertainty mass optimization (UNM) to model the uncertainty of mispredicted entities.

3.1 NER Feature Extraction

Given a word sequence $X=\{x^{(1)},...,x^{(n)}\}$ and a target sequence $Y=\{y^{(1)},...,y^{(n)}\}$ . To obtain the hidden representation $H$ of $X$ , the words in the sentence $X$ are first preprocessed according to the input form required by the corresponding NER method. Then the processed input is fed into an Encoder module (e.g., BERT Devlin et al. (2019)) to compute the hidden representation $H={\rm{Encoder}}(X)$ , where $H\in\mathbb{R}^{n\times d_{h}}$ and $d_{h}$ denotes the dimension of the hidden representation. The input format for NER models can vary depending on the paradigm used. Three NER paradigms were considered for this study: sequence labeling (Figure 2(a)), span-based (Figure 2(b)), and Seq2Seq (Figure 2(c)). The specific formats for these paradigms are provided in the Appendix §A. Note that in the Seq2Seq (sequence-to-sequence) paradigm, we choose a pointer-based model Yan et al. (2021), so that we don’t need to learn on the entire vocabulary.

3.2 Dirichlet-based Prediction Layer

Once we obtain the hidden representation, we introduce a Dirichlet-based layer to produce the final predictive distribution. Precisely, for the $i^{th}$ sample, the hidden representation $h$ is fed to the fully connected layer to output logits, and then we can transform the logits into Dirichlet parameters $\boldsymbol{\alpha}$ as described in Section §2.1. Finally, as shown in Figure 2, only one forward step using Eq. 3 is sufficient to calculate the uncertainty $u^{(i)}$ , while the probability distribution $\mathbf{p}^{(i)}$ and prediction ${y}^{(i)}$ are calculated as follows:

{\mathbf{p}^{(i)}}\!=\!\frac{\boldsymbol{\alpha}^{(i)}}{S^{(i)}},\;\quad y^{(i)}\!=\!\mathop{\mathrm{arg\,max}}\limits_{c\in C}\left[{{p}^{(i)}_{c}}\right].

(4)

3.3 E-NER Model Learning

Overview. The objective function of EDL training is to minimize the sum of losses over all words. Due to the sparse entities and OOV/OOD entities issues, directly applying EDL to NER leads suboptimal uncertainty estimates. We improve conventional EDL methods by incorporating belief mass and uncertainty into the network training process. Specifically, two key modifications are introduced: (1) We compute importance weights for each sample based on the belief mass to reweight the original classification loss in Eq. 2(a). (2) We introduce an additional term to increase the uncertainty of mispredicted instances, which explicitly improves the quality of uncertainty estimation and helps OOD entity detection.

Importance Weight. Due to the inherent imbalance between entities and non-entities in NER datasets, conventional EDL methods tend to overfit non-entities and assign high uncertainty estimates to entities. To make the training focus more on the entities and increase the evidence corresponding to the ground-truth category, we use the belief mass of the ground-truth category to compute the category-level uncertainty for each instance to adjust the loss. Specifically, for the $i^{th}$ sample, we use $(\mathbf{1}-\mathbf{b}^{(i)})$ as the category-level uncertainty which serves as the importance weights of entity categories during training. To this end, we replace the ground truth $\mathbf{y}^{(i)}$ of one-hot representation with an importance weight (IW) $\mathbf{w}^{(i)}=(\mathbf{1}-\mathbf{b}^{(i)})\odot\mathbf{y}^{(i)}$ , and lastly, the Eq. 2(a) is adjusted to:

\displaystyle\mathcal{L}^{(i)}_{IW}=

\displaystyle\sum_{c=1}^{C}{\color[rgb]{0.80,0.10,0.1}w^{(i)}_{c}}\left(\psi(S^{(i)})-\psi(\alpha^{(i)}_{c})\right).

(5)

As illustrated in Figure 3(b), the belief mass of the ground-truth category is high, indicating a high level of certainty in the prediction. In this case, the importance weight (IW) assigned will be small. Conversely, Figure 3(c) presents a small belief mass, indicating an uncertain prediction. IW will be assigned a large value. In this manner, the training process can focus more on sparse but valuable entities.

Uncertainty Mass Optimization. Assigning high uncertainty to OOV/OOD entities (see Figure 3(d) as an example) facilitates OOV/OOD entity detection. However, ground-truth OOV/OOD samples are not available during training. One solution is to synthesize such data on the boundary of the in-domain region via a generative model Lee et al. (2018). In this paper, we propose a more convenient way to treat hard samples as OOV/OOD samples which are often outliers and are mispredicted even after adequate model training. In this way, we enable the model to detect OOV/OOD data. Specifically, uncertainty mass optimization (UNM) assigns higher uncertainty to more error-prone samples for the model to express a lack of evidence, by adding an uncertainty mass penalty term $\mathcal{L}_{UNM}$ to the wrongly predicted samples:

\displaystyle\mathcal{L}_{UNM}\!=\!-\!\lambda_{2}\!\sum_{i\in\{\hat{y}^{(i)}\neq y^{(i)}\}}\!{\rm{log}}(u^{(i)}).

(6)

The coefficient $\lambda_{2}=\lambda_{0}\ \mathrm{exp}\{-(\mathrm{ln}\lambda_{0}/T)t\}$ , where $\lambda_{2}\in[\lambda_{0},1]$ , $\lambda_{0}\ll 1$ is a small positive constant, $t$ is the current training epoch, and $T$ is the total number of training epochs. As the training epoch $t$ increases towards $T$ , the factor $\lambda_{2}$ will increase monotonically from $\lambda_{0}$ to 1.0. This allows the network to initially focus on optimizing classification and gradually shift its emphasis towards optimizing UNM as the training progresses.

Overall Loss. The overall loss function combines three components: the importance weighted classification loss $\mathcal{L}_{IW}$ , the KL divergence penalty loss $\mathcal{L}_{KL}$ , and the uncertainty mass loss $\mathcal{L}_{UNM}$ for mispredicted entities. Each element contributes to the overall loss and is defined as follows:

\mathcal{L}_{overall}=\sum_{i=1}^{N}(\mathcal{L}^{(i)}_{IW}+\mathcal{L}^{(i)}_{KL})+\mathcal{L}_{UNM}.

(7)

4 Experiments

4.1 Research Questions

In this section, we design extensive experiments to validate whether the proposed method obtains high-quality uncertainty estimation. Concretely, the following four research questions will be investigated.

RQ1: Whether E-NER improves the quality of confidence estimation in contrast to prior work?

RQ2: Can uncertainty provided by E-NER achieve better OOV/OOD detection performance?

RQ3: Can E-NER improve the model generalization ability on OOV samples?

RQ4: Can E-NER help to find valuable instances to improve the sample efficiency of NER model training?

Following these four research questions, we provide further discussions on our method including ablation studies and limitations.

Dataset	Sentences	Types	Domain
CoNLL2003	22,137	4	Newswire
OntoNotes 5.0	76,714	18	General
WikiGold	1,696	4	General

Table 1: Statistics of the NER dataset.

Dataset	Sentences	Entities	OOV Rate
TwitterNER	3257	3990	0.62
CoNLL2003-Typos	2676	4130	0.71
CoNLL2003-OOV	3685	5648	0.96

Table 2: Statistics of OOV entities in the test set.

4.2 Datasets and Metrics

Datasets from Different Domains. To answer the above research questions, we choose three widely-used datasets, including CoNLL2003 Tjong Kim Sang and De Meulder (2003), OntoNotes 5.0 Weischedel et al. (2013)²²2https://catalog.ldc.upenn.edu/LDC2013T19 and WikiGold Balasuriya et al. (2009). The statistics are displayed in Table 1.

OOV Datasets. We further choose three public OOV datasets, including TwitterNER Zhang et al. (2018), CoNLL2003-Typos Wang et al. (2021), and CoNLL2003-OOV Wang et al. (2021). The statistics are displayed in Table 2.

Metrics. We evaluate the results using three metrics: F1, Expected Calibration Error (ECE), and Area Under the ROC Curve (AUC). F1 is a commonly used performance indicator in NER. ECE is a metric that measures the confidence calibration of a model, with a low score indicating a well-calibrated model. AUC is a commonly used metric for evaluating the performance of binary classifiers, and we use it to evaluate the OOV/OOD detection performance. Their detailed computations are described in the Appendix §C.2.

Setting	Typos		OOV		OOD
Setting	Con	Unc	Con	Unc	Con	Unc
BERT-Tagger Devlin et al. (2019)	0.812	0.812	0.689	0.751	0.674	0.756
-EDL	0.805	0.808	0.699	0.759	0.693	0.767
-E-NER(ours)	0.820	0.817	0.700	0.760	0.769	0.799
\cdashline1-8 SpanNERFu et al. (2021)	0.717	0.783	0.614	0.773	0.623	0.799
-EDL	0.701	0.759	0.607	0.760	0.620	0.792
-E-NER(ours)	0.741	0.792	0.640	0.796	0.676	0.824
\cdashline1-8 Seq2Seq Yan et al. (2021)	0.825	0.833	0.724	0.794	0.797	0.820
-EDL	0.829	0.830	0.729	0.787	0.793	0.818
-E-NER(ours)	0.824	0.841	0.743	0.803	0.822	0.847

Setting	CoNLL2003		OntoNotes 5.0
Setting	F1( $\uparrow$ )	ECE( $\downarrow$ )	F1( $\uparrow$ )	ECE( $\downarrow$ )
BERT-Tagger	91.32	0.0845	88.20	0.1053
-EDL	91.36	0.0755	88.09	0.0838
-E-NER(ours)	91.55	0.0739	88.74	0.0603
\cdashline1-5 SpanNER	91.94	0.0673	87.82	0.0609
-EDL	91.97	0.0481	87.39	0.0474
-E-NER(ours)	92.06	0.0414	88.44	0.0434
\cdashline1-5 Seq2Seq	93.05	0.0324	89.89	0.0375
-EDL	92.84	0.0322	90.22	0.0329
-E-NER(ours)	93.15	0.0225	90.64	0.0328

Methods	TwitterNER	CoNLL2003
Methods	TwitterNER	Typos	OOV
VaniIB Alemi et al. (2017)	71.19	83.49	70.12
DataAug Dai and Adel (2020)	73.69	81.73	69.60
SpanNER (BERT large)	71.57	81.83	64.43
SpanNER (RoBERTa large)	71.70	82.85	64.70
SpanNER (AlBERT large)	70.33	82.49	64.12
\cdashline1-4 EDL-SpanNER (BERT large)	74.14	82.89	68.40
E-SpanNER (BERT base)	74.94	83.31	67.99
E-SpanNER (BERT large)	75.64	83.64	69.71
\cdashline1-4 $\Delta$ E-NER-NER vs. SpanNER	4.07↑	1.81↑	5.28↑

Setting	CoNLL2003		OntoNotes 5.0
Setting	Ratio	F1( $\uparrow$ )	Ratio	F1( $\uparrow$ )
Random	5.5%	85.39	3.0%	79.47
Entropy	5.5%	88.29	3.0%	84.80
MC dropout	5.5%	88.67	3.0%	86.06
EDL	5.5%	90.51	3.0%	86.25
E-NER	5.5%	90.88	3.0%	86.68

Setting	${\rm{WikiGold}}_{\leftarrow{\rm{CoNLL.}}}$		${\rm{CoNLL2003}}_{\leftarrow{\rm{Onto.}}}$
Setting	Ratio	F1( $\uparrow$ )	Ratio	F1( $\uparrow$ )
Random	4.8%	53.67	4.7%	84.23
Entropy	4.8%	80.63	4.7%	88.81
MC dropout	4.8%	82.87	4.7%	90.32
EDL	4.8%	83.32	4.7%	90.12
E-NER	4.8%	84.08	4.7%	90.52

E-NER: Evidential Deep Learning for Trustworthy Named Entity Recognition

Abstract

1 Introduction

2 Preliminary

2.1 Dirichlet-based Model

2.2 Uncertainty Estimation of DBM

3 E-NER Architecture

3.1 NER Feature Extraction

3.2 Dirichlet-based Prediction Layer

3.3 E-NER Model Learning

4 Experiments

4.1 Research Questions

4.2 Datasets and Metrics

4.3 Experiment Setting

4.4 Research Question Discussions

4.4.1 Confidence Estimation Quality

4.4.2 OOV/OOD Detection

4.4.3 Generalization on OOV Samples

4.4.4 Sample Efficiency

4.5 Further Analysis

5 Related Work

6 Conclusion

Limitations

Acknowledgements

References

Appendix A NER Paradigms

Appendix B Additional Experimental Analysis

B.1 Reliability Diagrams

B.2 Case Study

Appendix C Implementation Details

C.1 Model Parameters

C.2 Evaluation Metrics

C.3 EDL Optimization Function

	BERT-Tagger	SpanNER	Seq2Seq
Input	$X=\{x^{(1)},x^{(2)},...,x^{(n)}\}$	$X=\{x^{(1)},x^{(2)},...,x^{(n)}\}$	$X=\{x^{(1)},x^{(2)},...,x^{(n)}\}$
Processing	-	Enumerate all spans $S=\{s^{(1)},s^{(2)},...,s^{(m)}\}$	Obtain start and end indexes of entities $Y=\{y^{b}_{1},y^{e}_{1},y_{1},...,y^{b}_{k},y^{e}_{k},y_{k}\}$
Hidden state	$h={\rm{Encoder}}(X);$ $h\in\mathbb{R}^{n\times d}$	$h={\rm{Encoder}}(s^{(i)});$ $h\in\mathbb{R}^{d}$	$h_{t}=\mathrm{EncoderDecoder}(X,Y_{<t});$ $h_{t}\in\mathbb{R}^{d}$
Inference	Token-level classification	Span-level classification	Target sequence $Y$ generation

Case	Sentence	Softmax+Entropy	E-NER
*	Mapping: {MIS: miscellaneous; PER: person; ORG: organization; O: non-entity}	Entity: {Predcition; Confidence%; Uncertainty%}
I_ID	A visit to the computer centre offering $Internet^{\rm{E^{1}}}_{\rm{[MIS]}}$ services found a $European^{\rm{E^{2}}}_{\rm{[MIS]}}$ official clicking away on his mouse.	${\color[rgb]{0.80,0.10,0.1}{\rm{E^{1}}}}_{\{\rm{O}\,;\,99.9\,;\,{\color[rgb]{0.80,0.10,0.1}8.0}\}}$ ${{\color[rgb]{0.00,0.00,1.00}{\rm{E^{2}}}}}_{\{\rm{MIS}\,;\,99.9\,;\,{\color[rgb]{0.00,0.00,1.00}3.0}\}}$	${\color[rgb]{0.80,0.10,0.1}{\rm{E^{1}}}}_{\{\rm{O}\,;\,42.0\,;\,{\color[rgb]{0.80,0.10,0.1}70.8}\}}$ ${{\color[rgb]{0.00,0.00,1.00}{\rm{E^{2}}}}}_{\{\rm{MIS}\,;\,92.7\,;\,{\color[rgb]{0.0,0.0,1.0}8.9}\}}$
II_ID	${Lazio^{\rm{E^{1}}}_{\rm{[ORG]}}}$ have injury doubts about striker $Pierluigi$ $Casiragh^{\rm{E^{2}}}_{\rm{[PER]}}.$	${\color[rgb]{0.80,0.10,0.1}{\rm{E^{1}}}}_{\{\rm{O}\,;\,98.8\,;\,{\color[rgb]{0.00,0.00,1.00}7.3}\}}$ ${{\color[rgb]{0.00,0.00,1.00}{\rm{E^{2}}}}}_{\{\rm{PER}\,;\,99.9\,;\,{\color[rgb]{0.00,0.00,1.00}0.4}\}}$	${{\color[rgb]{0.00,0.00,1.00}\rm{E^{1}}}}_{\{\rm{ORG}\,;\,88.9\,;\,{\color[rgb]{0.00,0.00,1.00}12.5}\}}$ ${{\color[rgb]{0.00,0.00,1.00}{\rm{E^{2}}}}}_{\{\rm{PER}\,;\,98.3\,;\,{\color[rgb]{0.00,0.00,1.00}2.3}\}}$
III_OOV	But the ${{Inthrnet}^{\rm{E^{1}}}_{\rm{[MIS]}}}$ , a global computer network.	${\color[rgb]{0.80,0.10,0.1}{\rm{E^{1}}}}_{\{\rm{O}\,;\,90.5\,;\,{\color[rgb]{0.80,0.10,0.1}23.1}\}}$	${\color[rgb]{0.0,0.0,1.0}{\rm{E^{1}}}}_{\{\rm{MIS}\ ;28.1\,;\,{\color[rgb]{0.0,0.0,1.0}70.0}\}}$
IV_OOD	Redesignated 65 ${Fighter\;Wing^{\rm{E^{1}}}_{\rm{[ORG]}}}$ on 24 July 1943.	${\color[rgb]{0.80,0.10,0.1}{\rm{E^{1}}}}_{\{\rm{O}\,;\,99.2\,;\,{\color[rgb]{0.80,0.10,0.1}4.6}\}}$	${\color[rgb]{0.80,0.10,0.1}{\rm{E^{1}}}}_{\{\rm{O}\,;\,51.3\,;\,{\color[rgb]{0.80,0.10,0.1}60.7}\}}$

	$\displaystyle\mathcal{L}^{(i)}_{CLS}$	$\displaystyle\!=\!\frac{\int\left[\sum_{c=1}^{C}\!-\!y_{c}^{(i)}{\rm{log}}(p_{c}^{(i)})\right]}{B(\boldsymbol{\alpha}^{(i)})}\prod_{c=1}^{C}p_{c}^{\alpha_{c}^{(i)}\!-\!1}d\mathbf{p}^{(i)}$		(10)
		$\displaystyle\!=\!\sum_{c=1}^{C}y_{c}^{(i)}\left(\psi(S^{(i)})-\psi(\alpha_{c}^{(i)})\right).$		(10)

$\displaystyle\mathcal{L}^{(i)}_{KL}$	$\displaystyle=KL[{\rm{Dir}}(\mathbf{p}^{(i)}\|{\widetilde{\boldsymbol{\alpha}}^{(i)})}\|\|{\rm{Dir}}(\mathbf{p}^{(i)}\|\mathbf{1})]$	(11)
	$\displaystyle={\rm{log}}\left(\frac{\Gamma(\sum_{c=1}^{C}\widetilde{\alpha}_{c}^{(i)})}{\Gamma(C)\prod_{c=1}^{C}\Gamma(\widetilde{\alpha}_{c}^{(i)})}\right)$
	$\displaystyle+\sum_{c=1}^{C}(\widetilde{\alpha}_{c}^{(i)}-1)\left[\psi(S^{(i)})-\psi(\sum_{j=1}^{C}\widetilde{\alpha}^{(i)}_{j})\right].$

Setting	CoNLL2003		OntoNotes 5.0
Setting	F1	ECE	F1	ECE
E-NER	92.06	0.041	88.44	0.043
-UNM	92.10	0.058	88.21	0.051
-IW	91.95	0.045	87.77	0.042