This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Named Entity Linking with Entity Representation by Multiple Embeddings

Oleg Vasilyev, Alex Dauenhauer, Vedant Dharnidharka, John Bohannon
Primer Technologies Inc.
San Francisco, California
oleg,alex.dauenhauer,[email protected]
Abstract

We propose a simple and practical method for named entity linking (NEL), based on entity representation by multiple embeddings. To explore this method, and to review its dependency on parameters, we measure its performance on Namesakes, a highly challenging dataset of ambiguously named entities. Our observations suggest that the minimal number of mentions required to create a knowledge base (KB) entity is very important for NEL performance. The number of embeddings is less important and can be kept small, within as few as 10 or less. We show that our representations of KB entities can be adjusted using only KB data, and the adjustment can improve NEL performance. We also compare NEL performance of embeddings obtained from tuning language model on diverse news texts as opposed to tuning on more uniform texts from public datasets XSum, CNN / Daily Mail. We found that tuning on diverse news provides better embeddings.

1 Introduction

Named entity linking (NEL) is a task of linking a mention of an entity in a text to the correct reference entity in the knowledge base (KB) Rao et al. (2012); Yang and Chang (2015); Sorokin and Gurevych (2018); Kolitsas et al. (2018); Logeswaran et al. (2019); Wu et al. (2020); Li et al. (2020); Sevgili et al. (2021). Here we consider NEL in a specific setting, with the intention to present our NEL method, and to probe the difficulty of dealing with namesakes:

  1. 1.

    The mention of interest is assumed to be located in the text, i.e. the named entity recognition task is done.

  2. 2.

    Only the local context surrounding the mention of interest is used for the linking, no other mentions in the text are used.

  3. 3.

    KB is fixed and built on reliable data.

  4. 4.

    Both KB and the pool of mentions are mostly composed of namesakes.

  5. 5.

    A mention may have the corresponding entity in KB, or may not. We call the former mention familiar, and the latter stranger.

The point 1 means that the named entity recognition task is assumed to be done, leaving us NEL in a narrow sense Rao et al. (2012); Wu et al. (2020); Logeswaran et al. (2019). The point 2 makes the problem better defined. Using other mentions of the same entity or related entities can help NEL, but our focus is on imitating the more difficult cases of lone mentions (with no related mentions in the vicinity in the text). For using related named entities see for example Zaporojets et al. (2022).

The point 3 leaves out the question of growing or improving KB by the encountered mentions (familiar or stranger). KB built on reliable data allows to isolate the effect of KB pollution by intentionally adding wrong data, as we will do in this paper. The point 4 makes it easier to reveal NEL errors and to observe dependencies of our method on parameters.

Both the points 3 and 4 are satisfied by choosing recent dataset Namesakes Vasilyev et al. (2021b, a), as a dataset with human labeled ambiguously named entities. Another recent dataset - Ambiguous Entity Retrieval (AmbER) Chen et al. (2021) - does include subsets of identically named entities (for the purpose of fact checking, slot filling, and question-answering tasks), but it is automatically generated. Most existing NEL-related datasets do not focus on highly ambiguous names Ratinov et al. (2011); Hoffart et al. (2011); Ferragina and Scaiella (2012); Ji et al. (2017); Guo and Barbosa (2018).

In this paper we focus on presenting our NEL method. We test it on KB entities and mentions taken from Namesakes: the challenging dataset is helpful in revealing the behavior of our NEL method and its dependencies on parameters. Our contribution:

  1. 1.

    We introduce a simple and practical representation of entity in KB, and explore NEL to such representations on example of highly ambiguous mentions from Namesakes.

  2. 2.

    We suggest an adjustment of KB based on its entities, and show that it helps in reducing NEL errors.

In Section 2 we introduce our entity representation and NEL for KB with such representations. In Section 3 we explain how we use Namesakes dataset in our NEL evaluation experiments. In Section 4 we present the experiments and results.

2 Named Entity as a Set of Embeddings

2.1 Knowledge Base Entity

A named entity can be described in very different contexts Ma et al. (2021); FitzGerald et al. (2021). The same person can be a scientist and a dissident, the same location can be described by its nature and by its social events and so on. This is the motivation to represent an entity by multiple embeddings - at least if each embedding is created from a mention in specific context.

Our representation of KB entity EaE_{a} is composed of multiple embeddings, and consists of:

  1. 1.

    Normalized embeddings eie_{i}, their norms |ei||e_{i}| and assigned thresholds tit_{i}, initially set ti=1t_{i}=-1. The number of the embeddings is restricted by agglomerative clustering (Appendix B), i<=NEi<=N_{E}.

  2. 2.

    Entity threshold TT.

  3. 3.

    Entity surface names sks_{k}.

  4. 4.

    References to similar entities EbE_{b}.

In this and the next subsections we explain the details of this representation, and our NEL procedure that uses it.

A fundamental element used in building an entity is an embedding of a mention of this entity in some context. We tuned a pretrained BERT Devlin et al. (2019) language model ’bert-base-uncased’, accessed via the transformers library ((Wolf et al., 2020)), on generic random news, with named entities located in the texts. The only goal of tuning is to enhance LM performance on the mentions of named entities, without changing LM goal or making LM specialized on any particular set of named entities. The tuning and inference are as following:

  1. 1.

    At tuning only the named entities - all the located mentions within the input window in the text - serve as the labels for prediction. In input the mentions are being either left unchanged (probability 0.5) or replaced by another random mention from the same text.

  2. 2.

    At inference the text is kept as it is, and the model is run only once on each input-size chunk of the text. The embeddings are picked up from the first token of the entity surface form - for each named entity that happened to occur in the input-size chunk.

Through all the paper, except Section 4.3, we use the model tuned on random generic news. In Section 4.3 we use the model tuned on texts of more uniform style: the texts from XSum Narayan et al. (2018) and CNN / Daily Mail Hermann et al. (2015); Nallapati et al. (2016) datasets. This allowed us to observe effect of using embeddings from a model exposed to a lesser variety of styles. For more details of the tuning see Appendix A.

There can be many mentions available for building a KB entity, even if only reliable verified mentions are used. If the number of the embeddings obtained from the mentions is higher than NEN_{E}, the embeddings are clustered, and only NEN_{E} ’central’ embeddings (closest to the centers of the clusters) are stored. We use agglomerative clustering (Appendix B). The surface names sks_{k} of all the mentions used for creating an entity are stored in the entity.

2.2 Linking a Mention to KB

In linking a mention to KB we use only the normalized embedding ee of the mention. We define similarity SS of the mention to an entity by the scalar products with its embeddings eie_{i}:

S=maxi[(eei)/max(T,ti)]S=\max_{i}[(e*e_{i})/max(T,t_{i})] (1)

Here all the thresholds tit_{i} are set to 1-1, and are irrelevant unless adjusted as described in Section 2.3; the entity’s threshold TT is defined further below.

The KB entity with the highest similarity SS is the candidate for linking the mention to. We set a linking threshold TLT_{L}: The mention is linked to the candidate-entity only if

S>=TLS>=T_{L} (2)

Otherwise the mention is left unlinked (unassociated with any KB entity). It is natural to assume TL=1T_{L}=1, but our results will show that we had to lower it.

The entity’s threshold TT is defined from the assumption that any embedding of the entity would have to successfully link to the entity (with TL=1T_{L}=1):

T=minj[maxij[(ejei)]]T=\min_{j}[\max_{i\neq j}[(e_{j}*e_{i})]] (3)

This definition makes sense only if there are at least two embeddings in the entity, hence we create a KB entity only if there are at least two mentions available (our observations in Section 4 suggest a more strict requirement).

When linking a mention to KB we find the similarities not to all KB entities, but only to the entities that have a surface name at least somewhat similar to the mention. For this purpose, KB stores a map of all words from all the surface names to the entities that have any such word in their surface names:

map:w{Ea|winskinEa}map:w\rightarrow\{E_{a}|w\;\text{in}\;s_{k}\;\text{in}\;E_{a}\} (4)

When linking a mention to KB, all KB entities that are mapped from at least one word of the mention’s surface name are considered as the candidates for linking. The similarities of the mention’s embedding to these candidates are calculated then by Eq.1; the mention is linked to the candidate with the strongest similarity if exceeds the threshold Eq.2. Generally, KB entities with similar (by some measure) embeddings can be considered in selecting the candidates, but we focus here on the namesakes.

2.3 Knowledge Base Adjustment

Can we improve KB right after it is created, even without using any knowledge about the texts and mentions on which NEL will be used or evaluated? We suggest adjusting the thresholds tit_{i} in each entity EaE_{a} by considering the relation of EaE_{a} with its similar entities EbE_{b}.

Each entity EaE_{a} stores references to its most ’similar’ entities EbE_{b}. For an entity with surface names sks_{k} we select its similar entities using a (non-symmetric) surface-similarity, which we define as

L=maxmkwsksml(w)L=\max_{m}\sum_{k}\sum_{w\in s_{k}\cap s_{m}}l(w) (5)

where sms_{m} are the surface names of another KB entity, w is any word in sks_{k} also existing in sms_{m}, and l(w)l(w) is the number of characters in the word ww. For each KB entity EaE_{a} we find NS=10N_{S}=10 entities EbE_{b} having the highest L>0L>0 (may be less than 10 because of the requirement L>0L>0). References to these ’surface-similar’ entities EbE_{b} are stored in the entity EaE_{a}.

In order to adjust a KB entity EaE_{a}, we impose a requirement on each embedding ee from each similar entity EbE_{b}: the embedding ee should not be able to link to the entity EaE_{a} (because EaE_{a} and EbE_{b} are different entities).

KB is adjusted to satisfy this requirement, by iterating through all KB entities EaE_{a}; for each entity EaE_{a} iterating through its similar entities EbE_{b}; for each pair EaE_{a} and EbE_{b} iterating through the embeddings eie_{i} of EaE_{a} and eje_{j} of EbE_{b}. The threshold tit_{i} for eie_{i} is adjusted as:

tic(ejei)if(ejei)>max(T,ti)t_{i}\rightarrow c(e_{j}*e_{i})\;\;\text{if}\;(e_{j}*e_{i})>\max(T,t_{i}) (6)

Here TT is the threshold of the entity EaE_{a}. We use c=1.01c=1.01, just enough to make the linking impossible. For clarity, the adjustment procedure is also explained by pseudo-code in Figure 1.

Given: Knowledge Base KB={Ea}KB=\{E_{a}\};
Each entity EaE_{a} includes:
       Threshold T = Ea.TE_{a}.T
       Embeddings Ea.embeddings={ei}E_{a}.embeddings=\{e_{i}\}
           Each eie_{i} has its own threshold ti=1t_{i}=-1
       Similar entities Ea.similar={Eb}E_{a}.similar=\{E_{b}\}
Adjustment coefficient c>0c>0 (e.g. c=1.01c=1.01):
Adjustment procedure:
for EaE_{a} in KBKB:
      for EbE_{b} in in Ea.similarE_{a}.similar:
          for eie_{i} in Ea.embeddingsE_{a}.embeddings:
              for eje_{j} in Eb.embeddingsE_{b}.embeddings:
                   if (eiej)>max(Ea.T,ti)(e_{i}*e_{j})>max(E_{a}.T,t_{i}):
                       ti=c(eiej)t_{i}=c*(e_{i}*e_{j})
                   if (eiej)>max(Eb.T,tj)(e_{i}*e_{j})>max(E_{b}.T,t_{j}):
                       tj=c(eiej)t_{j}=c*(e_{i}*e_{j})
Figure 1: Adjustment of Knowledge Base.

For definiteness, we iterate through entities EaE_{a} in KB in order from larger to smaller dissimilarity of the entity; where we defined ’dissimilarity’ of EaE_{a} as the sum

i,k(1(eiek))\sum_{i,k}(1-(e_{i}*e_{k})) (7)

with summation over all the pairs of embeddings from EaE_{a}. The motivation for this order is to start with entities that might be more prone to the conflicts and adjustments in KB. However, in the experiments described in this paper there were no any noticeable difference between this version and versions with somewhat different definitions of ’dissimilarity’, and even with the opposite order of the iteration.

We do not consider here an alternative possibility: instead of increasing an embedding’s threshold, we can rotate the embeddings from each other with the purpose of decreasing their product below the threshold; for more detail see Appendix C.

3 Evaluation on Namesakes

Namesakes dataset consists of three parts Vasilyev et al. (2021b):

  1. 1.

    Entities: human-labeled mentions of named entities from Wikipedia entries.

  2. 2.

    News: human-labeled mentions of named entities from news.

  3. 3.

    Backlinks: mentions of entities linked to the entries used in Entities.

According to Vasilyev et al. (2021a), the mentions in all the parts are selected with the goal of creating high ambiguity of their surface names.

We are creating KB from Entities, and using News and Backlinks as sources of the mentions for evaluating NEL. These evaluation mentions are a mix of familiar and stranger mentions. The stranger mentions appear for two reasons: First, a part of the labeled mentions in News have the same surface names as the labeled mentions in Entities, but represent entities not existing in Entities. Second, the requirement to have a certain minimal number of mentions for creating a KB entity can leave some mentions in both News and Backlinks without their counterpart KB entities.

Performance of NEL evaluation can be represented by three indicators:

  1. 1.

    Fraction FFWF_{FW} of familiar mentions linked to incorrect KB entity.

  2. 2.

    Fraction FFNF_{FN} of familiar mentions not linked to KB.

  3. 3.

    Fraction FSLF_{SL} of stranger mentions linked to KB.

For clarity: if NEL of NFN_{F} familiar mentions resulted in linking NFWN_{FW} mentions to wrong KB entities, and in not linking NFNN_{FN} mentions to KB, then FFW=NFW/NFF_{FW}=N_{FW}/N_{F} and FFN=NFN/NFF_{FN}=N_{FN}/N_{F}. And if NEL of NSN_{S} stranger mentions resulted in NSLN_{SL} linked mentions, then FSL=NSL/NSF_{SL}=N_{SL}/N_{S}.

The lower each of these indicators, the better. The first indicator accounts for the worst kind of error: the mention is familiar, but it is wrongly identified. In a scenario of growing KB this would also lead to degrading KB quality. The second indicator accounts for the most innocent error: the mention is not identified (despite it could be), but at least no wrong identity is given. The third indicator accounts for the errors that are as bad as the first kind, with an arguable excuse that the stranger mentions are more difficult for NEL.

4 Experiments

4.1 Linking to Entities of Namesakes

We present here evaluation results in terms of the three indicators introduced in the previous section. In Figure 2 we show the level of NEL errors for evaluating all the mentions from News.

Refer to caption
Figure 2: Dependency of NEL errors on min number of mentions. KB is made of Namesakes Entities; evaluation mentions are from Namesakes News. Max number of embeddings per entity NE=4N_{E}=4. The errors are shown for threshold values TL=0.825,0.850,0.875T_{L}=0.825,0.850,0.875.

We observe that the minimal number of mentions allowed to create a KB entity plays an important role in reducing the amount of errors, even though the entity mentions were clustered into only 4 embeddings.

The role of the linking threshold TLT_{L}, as expected, is the trade-off between the performance for familiar mentions and for stranger mentions. A higher threshold decreases the fraction of stranger mentions linked to KB, while increasing the fraction of familiar mentions not linked to KB.

In Figure 3 we show again the dependency of NEL errors on the minimal number of mentions per KB entity, but now we evaluate linking of Backlinks mentions to KB (this is the only difference between the settings for the figures 3 and 2).

Refer to caption
Figure 3: Dependency of NEL errors on min number of mentions. KB is made of Namesakes Entities; evaluation mentions are from Namesakes Backlinks. Max number of embeddings per entity NE=4N_{E}=4. The errors are shown for threshold values TL=0.825,0.850,0.875T_{L}=0.825,0.850,0.875.

We observe a comparable level of errors in linking a familiar mention to wrong KB entity and in wrongly linking a stranger mention to KB, but there is a much higher fraction of unlinked familiar mentions. We speculate that the reason is in the less context usually given for a named entity mention in a Wikipedia backlink, as opposed to a mention in the news.

The number of samples participating in the evaluation is presented in left panes of Tables 1 and 2. Relaxed requirement on minimal number of mentions per entity allows for larger KB, and makes more familiar mentions. Backlinks part of Namesakes provides more mentions for evaluation.

Refer to caption
Figure 4: Dependency of NEL errors on max number of embeddings. KB is made of Namesakes Entities; evaluation mentions are from Namesakes News. Min number of mentions per entity: 10. The errors are shown for threshold values TL=0.825,0.850,0.875T_{L}=0.825,0.850,0.875.
Refer to caption
Figure 5: Dependency of NEL errors on max number of embeddings. KB is made of Namesakes Entities; evaluation mentions are from Namesakes Backlinks. Min number of mentions per entity: 10. The errors are shown for threshold values TL=0.825,0.850,0.875T_{L}=0.825,0.850,0.875.

In Figure 4 we show that the limit on the number of stored embeddings can be as low as 4 - at least judging by evaluation on Namesakes News. Evaluation on Namesakes Backlinks - in Figure 5 - show only a very weak dependency on the maximal number of embeddings. We suggest that it may be helpful to store more embeddings, depending on the type and cleanness of the data involved in creating KB. We observe more evidence for this in Section 4.2.

The number of samples involved in the evaluations in Figures 4 and 5 are presented by the last rows in the left panes in Tables 1 and 2 correspondingly.

Refer to caption
Figure 6: Dependency of NEL errors on the linking threshold TLT_{L}. Evaluation mentions are from News. KB entities are created with min number of mentions 10, and with max number of embeddings NE=4N_{E}=4.
Refer to caption
Figure 7: Dependency of NEL errors on the linking threshold TLT_{L}. Evaluation mentions are from Backlinks. KB entities are created with min number of mentions 10, and with max number of embeddings NE=4N_{E}=4.

Increase of the linking threshold TLT_{L}, as expected, suppresses wrong linking, but increases the fraction of unlinked mentions. We show an example of such dependency for wide range of threshold values in Figure 6 for NEL applied to the mentions from News, and in Figure 7 for NEL applied to the mentions from Backlinks. The main difference between the figures is in higher fraction of unlinked familiar mentions from Backlinks; this again suggests that too many mentions in Namesakes Backlinks must have a limited context. The choice of TLT_{L} should be guided by a required trade-off between the errors FSLF_{SL} vs FFWF_{FW} and FFNF_{FN}.

4.2 Linking to Polluted KB

We consider here the effect of lowering KB quality on NEL. When creating KB entities, we add erroneous mentions, imitating the real life situation of not fully reliable sources. Such polluted KB should increase NEL errors. We also expect that KB adjustment described in Section 2.3 can alleviate the effect of the pollution.

We use for pollution the "Other"-tagged mentions from Namesakes Entities Vasilyev et al. (2021a): these mentions have the surface names of the considered entity (Wikipedia entry) but represent some different entity mentioned in the same entry. The pollution can be characterized by the fraction of the polluted KB entities, and by the average fraction of the wrong mentions used in creating a polluted KB entity. We set both these pollution levels to 0.5, meaning that we pollute each second entity, and that we are creating each polluted entity by making it with up to 50% added wrong mentions (subject to availability of ’Other’ mentions in the corresponding document in Namesakes Entities).

In Figure 8 we show how much KB pollution affects NEL, and how much our KB adjustment, described in Section 2.3, can alleviate the effect of pollution.

Refer to caption
Figure 8: NEL errors for linking mentions from News to KB original (solid line), polluted (dotted) and adjusted after pollution (dashed). KB entities are created with max number of embeddings NE=4N_{E}=4. Linking threshold TL=0.85T_{L}=0.85.

The pollution somewhat increases the fraction of familiar mentions linked to wrong KB entities. Pollution increases even more the fraction of stranger mentions wrongly linked to KB. The KB adjustment reduces this effect.

Refer to caption
Figure 9: NEL errors for linking mentions from Backlinks to KB original (solid line), polluted (dotted) and adjusted after pollution (dashed). KB entities are created with max number of embeddings NE=4N_{E}=4. Linking threshold TL=0.85T_{L}=0.85.

Figure 9 shows the evaluation for linking Backlinks mentions to the clean, polluted and polluted-and-adjusted KB. Here we observe that the effect of pollution and adjustment on unlinked familiar mentions FSLF_{SL} is opposite: polluted KB entities encourage linking, wrong or not. Of course the effects on wrongly linked familiar mentions FFWF_{FW} and wrongly linked stranger mentions FSLF_{SL} are more important.

The number of samples participating in the evaluation is presented in Tables 1 and 2.

clean polluted
M KB familiar stranger KB familiar stranger
5 2129 217 58 2440 224 51
6 1560 211 64 2034 212 63
7 1086 194 81 1509 195 80
8 733 191 84 1037 195 80
9 483 187 88 778 189 86
10 318 187 88 547 188 87
Table 1: Size of KB (number of entities) and number of familiar and stranger mentions from Namesakes News participating in NEL evaluation. Here M is the min number of mentions allowed for creating KB entity. The left pane is for clean KB, it corresponds to Figure 2 and to the solid line in Figure 8. The right pane is for polluted (adjusted or not) KB, it corresponds to the dotted and dashed lines in Figure 8. Total number of evaluation mentions (familiar and stranger) in each row and each pane is 275.
clean polluted
M KB familiar stranger KB familiar stranger
5 2129 18203 10409 2440 22473 6139
6 1560 15071 13541 2034 20793 7819
7 1086 10756 17856 1509 15806 12806
8 733 8171 20441 1037 13345 15267
9 483 5528 23084 778 12139 16473
10 318 3986 24626 547 8895 19717
Table 2: Size of KB (number of entities) and number of familiar and stranger mentions from Namesakes Backlinks participating in NEL evaluation. Here M is the min number of mentions allowed for creating KB entity. The left pane is for clean KB, it corresponds to Figure 3 and to the solid line in Figure 9. The right pane is for polluted (adjusted or not) KB, it corresponds to the dotted and dashed lines in Figure 9. Total number of evaluation mentions (familiar and stranger) in each row and each pane is 28612.

Pollution changes dependency of NEL performance on max number of entity embeddings NEN_{E}: in Figure 10 the stranger mentions errors still decrease by NE=7N_{E}=7.

Refer to caption
Figure 10: NEL errors for linking mentions from News to KB original (solid line), polluted (dotted) and adjusted after pollution (dashed). KB entities are created with min number of mentions 10. Linking threshold TL=0.85T_{L}=0.85.
Refer to caption
Figure 11: NEL errors for linking mentions from Backlinks to KB original (solid line), polluted (dotted) and adjusted after pollution (dashed). KB entities are created with min number of mentions 10. Linking threshold TL=0.85T_{L}=0.85.

Figure 11 show even more change for Backlinks, as compared to almost no dependency in Figure 5.

4.3 Embeddings Tuned on More Uniform Texts

In previous subsections we observed NEL where the evaluation mentions are quite different from the mentions used in creating KB entities: KB entities were created from mentions of the entity on its own Wikipedia page; Backlinks mentions come from mentioned entities on Wikipedia backlinks; News mentions come from generic news texts. The difference between KB entities and evaluation mentions added to the difficulty of NEL.

In this subsection we consider one more variation that may cause additional difficulty for NEL: we will use a model tuned not on generic news (see Section 2.1), but on texts of more uniform style - a mix of texts from XSum Narayan et al. (2018) and CNN / Daily Mail Hermann et al. (2015); Nallapati et al. (2016) datasets (we take the texts of the documents, not the summaries.) The details of the data and tuning are in Appendix A.

From Figure 12 we observe that the optimal value of threshold is now different, and that NEL is more sensitive to threshold.

Refer to caption
Figure 12: Dependency of NEL errors on min number of mentions. Embeddings are from model tuned on XSum+CNN+DM. KB is made of Namesakes Entities; evaluation mentions are from Namesakes News. Max number of embeddings per entity NE=4N_{E}=4. The errors are shown for threshold values TL=0.950,0.975,1.000T_{L}=0.950,0.975,1.000.

From Figure 13 we observe that effect of KB adjustment can be stronger. The adjustment, while strongly reducing the most serious errors - FFWF_{FW} and FSLF_{SL} - increases the fraction of unlinked familiar mentions FFNF_{FN}. The latter is understandable as a conseqeunce of increasing individual embedding thresholds in KB entities.

Refer to caption
Figure 13: NEL errors for linking mentions from News to KB original (solid line), polluted (dotted) and adjusted after pollution (dashed). Embeddings are from model tuned on XSum+CNN+DM. KB entities are created with max number of embeddings NE=4N_{E}=4. Linking threshold TL=0.975T_{L}=0.975.

From Figure 14 we observe that choosing the KB entity’s max number of embeddings NEN_{E} may also be affected by the type of model used for embeddings (compare with Figure 4).

Refer to caption
Figure 14: NEL errors for linking mentions from News to KB. Embeddings are from model tuned on XSum+CNN+DM. KB entities are created with min number of mentions 10. The errors are shown for linking threshold values TL=0.950,0.975,1.000T_{L}=0.950,0.975,1.000.

5 Conclusion

We introduced a simple and practical representation of named entities by multiple embeddings. We reviewed NEL performance for linking text mentions to KB of such entities, using Namesakes dataset Vasilyev et al. (2021a) as the source both for the mentions and for building the KB. As a dataset of ambiguous named entities, Namesakes makes NEL difficult and helps to reveal the errors and to observe the behavior and the dependencies of our NEL method.

We observed that a requirement of a minimal number of mentions for creating KB entity is important for NEL performance: a requirement of minimum 10 mentions gives much better results than more relaxed settings (Figures 2, 3). We described a KB adjustment based on only KB data; we have shown that the adjustment helps to reduce NEL errors when KB entities are polluted by admix of wrong mentions (Figures 8, 9, 13).

Through the paper we evaluated our NEL method with static small KB (albeit on intentionally challenging data). More general scenarios can be considered, such as growing and adjusting KB with linked mentions. Even within the limits of static KB, there are interesting issues left out of our consideration here. The KB pollution can be moderated by heuristic algorithms that filter the mentions for KB entities. Also, in realistic ingestion of data to large scale KB (e.g. all Wikipedia named entities), there are much more available mentions for some entities. From our preliminary observations (not included in the paper) we speculate that the optimal number of embeddings can be higher but still within reasonable limits (up to 20), and that the effect of KB adjustment can be stronger.

Limitations

We acknowledge the following limitations of this work:

  1. 1.

    Our observations are limited to the case of fixed KB. We left out the consideration of accepting new linked mentions to KB and of the corresponding changes in KB entity representations, and in NEL quality.

  2. 2.

    We considered linking of ’lone’ mentions, not using advantages of coreference with mentions of the same entity or related entities in the text.

  3. 3.

    We investigated the behavior and dependencies of NEL with our representations on a specific dataset (Namesakes), the dataset is being challenging by having high concentration of namesakes. Evaluation of the method on a large scale KB (for example, KB of all Wikipedia named entities or of entities from some other documents) is left out of this paper. An immediate difference is that such KB is inevitably polluted, albeit not as much as in our considerations in the paper.

  4. 4.

    We explored linking of mentions from two kinds of texts: news and backlinks, the latter is fairly artificial example. Realistically, mentions come from documents of very different formats and styles. While we did imitated here the difference between the mentions on which KB is built (the mentions from their own Wikipedia entries) and the evaluation mentions (the mentions from news or backlinks), this is still just a two examples.

Acknowledgments

We thank Randy Sawaya for review of the paper and valuable feedback.

References

Appendix A Tuning LM on Named Entities

A.1 Tuning

We obtained embeddings by using a pretrained language model (LM) tuned on located mentions of entities (Section 2.1). The texts for the tuning (9831 texts) were taken randomly from generic news, and processed by the named entity recognition (NER) model ’dbmdz/bert-large-cased-finetuned-conll03-english’, accessed via the transformers library (Wolf et al., 2020)). (As explained in Introduction, we are focused on NEL under the assumption that the mentions of named entities are already located in the text.)

The purpose is only in tuning the pretrained model for doing LM task on already located mentions of named entities, without specializing on some specific dataset of named entities. The number of different identified types of the located mentions were comparable: approximately 7.6 locations, 10.4 persons, 11.8 organizations and 6.4 miscellaneous named entities per text, with the average length of text 3300 characters. In this work we do not use the types of locations.

The located mentions in the texts are marked by enveloping them into square brackets, e.g. "… Minority Leader [Kevin McCarthy] and other…" (after making sure in advance that any such brackets in the text are replaced by round brackets). This procedure is done both for tuning LM, and for inference. During the tuning the name in the brackets is sometimes (with probability 0.5) replaced by another random name from the text. At inference there are no replacements (Section A.2).

The labels for tuning are all the tokens within all the square brackets that happen to occur (located by NER model) within the input of LM. We tuned the pretrained ’bert-base-uncased’ LM (Wolf et al., 2020) on texts from random daily news. Each input for tuning is composed of whole sentences - as many sentences as fits into the maximal input size (512 tokens for the LM).

A.2 Inference

At inference the procedure is similar: The text must be processed by NER model; the recognized named entities must be bracketed ’[]’ (but never replaced). Then the text must be processed by chunks, each chunk consisting of whole sentences - as many sentences as fits into the LM maximal input size. Then for each named entity mention (name in the brackets) the embedding is taken for the first token of the mention, from the last hidden layer of LM.

However, for the inference in our experiments here on Namesakes, we are using already recognized and labeled named entities of Namesakes, so the step with running NER is not needed. We have all the mentions already located, both for creating KB entities and for NEL evaluation.

A.3 Tuning on texts from XSum, CNN / Daily Mail

In order to have a model trained on less varied, more uniform style texts, we also tuned the same model bert-base-uncased on a mix of texts from well known datasets XSum Narayan et al. (2018) and CNN / Daily Mail Hermann et al. (2015); Nallapati et al. (2016). (We observed effects of switching to such model in Section 4.3.)

We created a training set of 33,000 texts comprised of 11,000 randomly selected texts from each of the primary sources (XSum, CNN, and Daily Mail) with a validation set of 1,500 texts as well as a test set of 1,500 texts (500 texts randomly selected from each primary source). Entity extraction and model tuning were conducted using the same models and strategy described in A.1. This resulted in a training set of 65,530 individual samples (a single sample being a context window of 512 tokens containing at least one entity) with a validation set of 3,005 samples and a test set of 3,030 samples. Model tuning was conducted for a single epoch on an NVIDIA Tesla T4 GPU, and optimized for the validation set.

Appendix B Clustering Embeddings for KB Entity

As explained in Section 2.1, KB entity stores a limited number NEN_{E} of embeddings. When the number of available reliable mentions exceeds NEN_{E}, the corresponding normalized embeddings are clustered. For this purpose we are using agglomerative clustering with euclidean affinity and with average linkage.

From each cluster we select one representation embedding: the embedding closest (by euclidean distance) to the center of the cluster. The center is defined as the average of all embeddings of the cluster.

Appendix C Adjustment of KB by rotation of embeddings

KB adjustment considered in the paper is defined in Section 2.3. We adjusted individual thresholds for entity representation embeddings, in order to prevent linking of one KB entity to another. Here we point out an alternative possibility to prevent such linking: we can rotate embeddings away from each other.

Suppose that an embedding eje_{j} from an entity EbE_{b} is able to link to the entity EaE_{a}, by being too similar:

(ejei)>T(e_{j}*e_{i})>T (8)

Here eie_{i} is one of embeddings of EaE_{a}, and TT is the threshold of EaE_{a}. We can then replace the embedding eie_{i} by the embedding

ei=cos(α)ei+sin(α)eje_{i}^{*}=cos(\alpha)e_{i}+sin(\alpha)e_{j} (9)

with the goal of having

s(ejei)=T/ct;c>1s\equiv(e_{j}^{*}*e_{i})=T/c\equiv t;\hskip 20.0ptc>1 (10)

where the coefficient cc is slightly higher than 11, for example c=1.01c=1.01. The solution is simple:

cos(α)=ts+1+s2t21+s2cos(\alpha)=\frac{ts+\sqrt{1+s^{2}-t^{2}}}{1+s^{2}} (11)
sin(α)=1cos(α)sin(\alpha)=-\sqrt{1-cos(\alpha)} (12)

There can be also a variation where we change both eje_{j} and eie_{i}. Such alternatives are as simple for processing as the adjustment version that we have considered through the paper. However, the adjustment of thresholds can be undone and redone at any time, while to do the same after rotating embeddings we would have to store the original embeddings. This makes the rotation version more expensive for storage. Performance-wise, for NEL on Namesakes, we have not seen advantages of rotations over threshold adjustments.

Both the thresholds adjustment version and the adjustment by rotations version can be made more ’iterative’: Each individual adjustment (the increase of a threshold or the angle of rotation) can be made smaller, but there would be multiple passes of iterations over all KB entities - possibly even in random order. The passes stop when the number of adjustments made over the pass is zero or below a certain limit. In case of larger data such versions may give better results, but the expensiveness of multiple passes makes this impractical.