Event Linking: Grounding Event Mentions to Wikipedia

Xiaodong Yu¹ Wenpeng Yin² Nitish Gupta³ Dan Roth¹
¹University of Pennsylvania ²Penn State University ³Google Research
{xdyu, danroth}@seas.upenn.edu
[email protected]
[email protected]

Abstract

Comprehending an article requires understanding its constituent events. However, the context where an event is mentioned often lacks the details of this event. A question arises: how can the reader obtain more knowledge about this particular event in addition to what is provided by the local context in the article?

This work defines Event Linking, a new natural language understanding task at the event level. Event linking tries to link an event mention appearing in an article to the most appropriate Wikipedia page. This page is expected to provide rich knowledge about what the event mention refers to. To standardize the research in this new direction, we contribute in four-fold. First, this is the first work in the community that formally defines the Event Linking task. Second, we collect a dataset for this new task. Specifically, we automatically gather the training set from Wikipedia, and then create two evaluation sets: one from the Wikipedia domain, reporting the in-domain performance, and a second from the real-world news domain, to evaluate out-of-domain performance. Third, we retrain and evaluate two state-of-the-art (SOTA) entity linking models, showing the challenges of event linking, and we propose an event-specific linking system, EveLink, to set a competitive result for the new task. Fourth, we conduct a detailed and insightful analysis to help understand the task and the limitations of the current model. Overall, as our analysis shows, Event Linking is a challenging and essential task requiring more effort from the community. ¹¹1Data and code are available here: http://cogcomp.org/page/publication_view/996.

1 Introduction

Refer to caption — Figure 1: Examples of Event linking and Entity linking. The left side is the local context, and the right side contains Wikipedia pages. Entity linking model connects the entity “Boston” to the Wikipedia page “Boston”, while event linking model links the event “detonated” to the Wikipedia page “Boston Marathon Bombing”, which is more relevant to the local context.

Grounding is a process of disambiguation and knowledge acquisition, and is an important task for natural language understanding. Entity linking, grounding entity mentions to a knowledge base (usually Wikipedia) (Bunescu and Pasca, 2006; Mihalcea and Csomai, 2007; Ratinov et al., 2011; Gupta et al., 2017; Wu et al., 2020), has been shown important in natural language understanding tasks, such as question answering, recommendation system, dialogue generation. Despite the significant progress brought by entity linking, we argue that grounding entities may not provide enough background knowledge that is often needed to support text understanding. Consider the example, Figure 1; an entity linking model will link the entity “Boston” to the Wikipedia page “Boston” which introduces the history and culture of the city Boston. The information we can get from the page “Boston” is irrelevant to the local context. To really help understand this sentence, we need to link the event centered by the verb “detonated” to the Wikipedia page “Boston Marathon Bombing”. We call this process that grounds events Event Linking.

In this paper, we formulate this Event Linking task for the first time, analyze the difference and challenges of the new task, and carefully design a benchmark dataset for this task. We automatically collect training data from the hyperlinks in Wikipedia, and create two evaluation sets to evaluate both in-domain and out-of-domain performance. For in-domain evaluation, the test data is also from hyperlinks in Wikipedia. To avoid models from overfitting, the test data is balanced with hard cases and easy cases determined by whether the event is seen in the training and by the similarity between the surface forms of event mentions and Wikipedia titles. For out-of-domain evaluation, we annotate real-world news articles across 20 years collected from New York Times. Considering the sparsity of events existed in Wikipedia, we also add “Nil” annotation to the test data, indicating that those events do not exist in Wikipedia, therefore, the model needs to tag them as “Nil”.

Technically, we come up with an event linking model EveLink that uses the entities in the local context as arguments of the event structure to better present the event mention. EveLink outperforms two SOTA entity linking models BLINK Wu et al. (2020) and GENRE Cao et al. (2021), and achieves strong performance on the event linking test set, especially on seen events and easy cases, and a detailed error analysis shows the difficulties of the new task and the limitation of the current model.

To conclude, our contributions are four-fold: (i) We formulate the task Event Linking. (ii) We collect training data for this task, and design both in-domain and out-of-domain test data, with a balanced ratio of hard cases and easy cases to ensure the dataset quality. (iii) Our proposed approach EveLink shows promising performance in experiments, which sets a competitive result for future works. (iv) Our in-depth analysis provides a better understanding of this new problem, the challenges in different domains, and the new approach.

2 Grounding Events in Wikipedia

Given an article and an event mention $m$ in it, event linking tries to find a title $t$ , from all the English Wikipedia titles (around 5m titles), to provide the best explanation of $m$ . Event mention is defined as verb or nominal that refers to an event. A correct title is defined as follows: as long as a Wikipedia page is about this event, or any subsection of the page introduces this event, we regard its title as the correct one. In this paper, all the models assume gold event mentions are given. For each event mention, a system is expected to label it with the correct Wikipedia page or a “Nil” tag if the event does not exist in Wikipedia. Accuracy is adopted as the official evaluation metric.

Event Linking vs. Entity Linking.

Relatedness: (i) They both link an object (event/entity) from an article to Wikipedia; (ii) Some events, such as “World War II”, are entities; in this case, two tasks are the same. Distinctions: (i) Entities are mostly consecutive text spans. Events, in contrast, are more structured objects, consisting of a trigger and a couple of arguments. An event trigger is mostly a general verb, which may not refer to a specific event by its own without knowing event arguments. More complex structures in events make event linking a more challenging task and require a deeper understanding of the local context; (ii) Unlike entities with a large coverage in Wikipedia, many events do not have a record in Wikipedia. Considering the sparsity, we require models to tag event mentions that do not exist in Wikipedia as “Nil”.

Why Event Linking?

Except for some events that are also entities, generally speaking, events are information units of larger granularity. As shown in Figure 1, a better comprehension of events, such as through linking to Wikipedia, is expected to facilitate the text understanding more.

Challenges specific to Event Linking.

(i) The correct title for some event mentions may not be unique. The same event could be introduced in several pages. For example, “Invasion of Poland” and “Occupation of Poland (1939–1945)” both introduce the event that German Army invaded Poland in 1939. How to decide the ground truth set and how to evaluate in this situation are not trivial.

(ii) Events may only exist in the subsection of the Wikipedia page. Only a limited number of famous events have their own pages, while many other relatively infamous events only exist in the subsection of some pages. Considering the example in Figure 2, the event “rebuilt” of the Manhattan Bridge does not have its own Wikipedia page, but it is mentioned in the subsection “Reconstruction” of the page “Manhattan Bridge”. Linking these events requires a model to understand the whole page instead of just encoding the first paragraph.

(iii) Events have a hierarchical structure. Events at larger granularity consist of many sub-events, and these sub-events may have their own Wikipedia pages, or just be mentioned in the pages of the large events. Ideally, the model should always link the event mention to the most appropriate page. If the sub-event page exists, then link to the sub-event page. Otherwise, link to the page of the large event. However, the term “appropriate” here could be unclear because of the event hierarchy. As Figure 3 shows, the Wikipedia page “Tom Brady” is most specific to the event “drafted”. On the other hand, draft of “Tom Brady” is a sub-event of “2000 NFL Draft”, which is further a sub-event of “National Football League Draft”. Annotators prefer to link this event to “Tom Brady”, while Wikipedia hyperlinks link the event to “National Football League Draft”. The hierarchy of events makes the standard of the correct title inconsistent.

3 Related Work

Entity Linking. As described in the previous section, entity linking has been extensively studied for many years (Bunescu and Pasca, 2006; Mihalcea and Csomai, 2007; Ratinov et al., 2011; Gupta et al., 2017; Wu et al., 2020; Cao et al., 2021). Though both of entity linking and event linking could be regarded as a task linking document contents to a knowledge base, we argue that entity linking is more about linking text span, while event linking is more about linking an event structure, centered by a predicate, which is more challenging because the predicate span is usually a general verb. In the experiment section, we show that just retraining the entity linking model on event linking data without considering the event structure does not perform well. Humeau et al. (2019) and Wu et al. (2020) use a bi-encoder/cross-encoder architecture to train the candidate generation/ranking model respectively for entity linking. Considering the structure of events that entities do not have, EveLink extends their model by adding structure information to the event mention representation.
“Event Linking”. We note that the term “event linking” has been once used in the literature Nothman et al. (2012); Krause et al. (2016). However, these works are essentially performing cross-document event coreference: determine if a given event mention refers to another event mention (in the same or another document). We, on the other hand, link an event mention to a Wikipedia concept with a different purpose: acquiring external knowledge about the event which is often beyond what we can obtain from the local context. Our definition of event linking can not only improve the understanding of the article, but also pave the way for the intensively-studied event coreference and other event relation identification problems.
Data. Eirew et al. (2021) collect training data from Wikipedia hyperlinks for event coreference, while we use similar methods to collect data for event linking. In this work, we use the FIGER type of the title to find event titles, while Eirew et al. (2021) use the Wikipedia infobox. There also exists some other event knowledge bases, such as EventKG Gottschalk and Demidova (2018). Because we use hyperlinks in Wikipedia as training data resource, and we do not limit the candidate space to be event titles only, in this work we only focus on linking event mentions to Wikipedia, and the candidate space is all the Wikipedia titles.

4 Data Construction

We collect training data and in-domain test data from Wikipedia automatically, and manually annotate a test set in the news domain for out-of-domain evaluation purpose. Table 1 lists some data examples, and Table 2 shows detailed statistics.

4.1 Wikipedia

	Event mention in local context	Wikipedia title
Wiki	At the start of the wartime 1940s , he had four releases.	World War II
\cdashline2-3	Henry Louis Gates, a black Harvard University professor who	Henry Louis Gates arrest controversy
	was arrested after police mistakenly thought he was breaking into
	his own home in Cambridge, Massachusetts.
\cdashline2-3	Ibrox hosted four Scotland games in the first phase, starting with a	1994 FIFA World Cup qualification
	1994 World Cup qualifier against Portugal in October 1992.	1994 FIFA World Cup qualification
NYT	The Nets offered Sam Cassell and a first-round draft pick for Marbury.	Sam Cassell
\cdashline2-3	A man who killed his former wife, a bartender and a cook in 1984	Godinez v. Moran
	was executed by injection early today.	Godinez v. Moran
\cdashline2-3	A 45-year-old fashion photographer was shot and killed in his West	Nil
	Village apartment yesterday morning, the police said.	Nil

Table 1: Data examples. The upper part is data collected from Wikipedia hyperlinks. The lower part is annotated New York Times (NYT) paragraphs. Event mentions are highlighted in red.

	Train	Dev	Test
	Wiki	Wiki	Wiki	NYT
Verb	33,213	8,346	9,633	1,319
Seen Event	-	1,814	2,913	0
Unseen Form	-	2,585	3,828	75
Unseen Event	-	3,947	2,892	435
Nil	-	-	-	809
Nominal	33,213	8,346	9,633	443
Hard	-	4173	4817	244
Easy	-	4173	4817	15
Nil	-	-	-	184
Total	66,426	16,692	19,266	1,762

Table 2: Wikipedia and New York Times (NYT) data statistics. NYT is only for testing.

We first collect all hyperlinks $(hypertext,\ title)$ in Wikipedia text, which links a hypertext to a Wikipedia title. Then, we map the FreeBase type of Wikipedia titles to FIGER types (Ling and Weld, 2012), and all titles with a type “Event” are regarded as event titles. All the hypertexts linked to these event titles are regarded as event mentions.

Because same event mentions in one Wikipedia page are hyperlinked only once, and editors tend to hyperlink more nominal mentions than verb mentions, verb mentions are highly limited in Wikipedia. To balance the size of verbs and nominals, we use SpaCy Part-of-Speech model²²2https://spacy.io/usage/linguistic-features##pos-tagging to keep all verb mentions, and sample the same size of nominals. To prevent models from overfitting, we design hard and easy cases for verbs and nominals:

Verbs: We classify each verb mention mainly by whether the surface form (S) of the verb is seen in training data, and whether the gold event title (T) is seen in training data. If both S and T are seen in training data, we call it Seen Event. If T is seen in training data, but S is new, we call it Unseen Form. If T is never seen in training data, we call it Unseen Event. Under this setting, “Seen Event" is regarded as easy cases, and the other two are hard cases. Because of the limited size of verb mentions, all the event titles with fewer than or equal to 5 verb mentions are used as “Unseen Event".

Nominals: We classify each nominal mention mainly by its surface form similarity to the gold title. We calculate the Jaccard similarity between the nominal mention and the gold title by taking 3 grams of the surface form. If the similarity is lower than 0.1, we think it is a hard nominal; otherwise, it is an easy nominal. Then we sample same numbers of hard and easy cases.

4.2 New York Times

We sample 2,500 lead paragraphs from The New York Times Annotated Corpus (Sandhaus, 2008), which contains New York Times articles from 1987 to 2006. We first use an off-the-shelf verb and nominal SRL model³³3https://cogcomp.seas.upenn.edu/page/demo_view/SRLEnglish to extract event mention candidates, and then we use Amazon Mechanical Turk to annotate the corresponding Wikipedia title of the predicted mention candidates. To ensure the quality of the annotation, we design our annotation process in two rounds:
First round. Annotators need to answer whether they think the predicted mention is an event or not. If they think it is an event, then they need to find the corresponding Wikipedia title, otherwise submit “Nil”. Each mention is annotated by three annotators. If all of them submit “Nil”, we include this event mention as a “Nil” example in the final test data. To prevent annotators from simply submitting “Nil”, 10% of the event mentions are the relatively easy cases from the Wiki data and we know their answers. We randomly insert them into the input data for AMturk (i.e., annotators are unaware of that) to evaluate the accuracy of the annotator. Only the annotation from annotators with an accuracy higher than 90% will be accepted.
Second round. This round verifies the annotated results in the first round. Each mention with the annotated title is verified by another three annotators. They need to read the page, and figure out whether it introduces the mention. If the majority vote for “yes”, we include it in the final test data. Because of majority voting, some annotations that not all the annotators agree would be included. The inter-agreement is 63.74 Fleiss’ kappa.

4.3 Domain Analysis

Event linking in the news domain is more challenging than that in the Wikipedia domain because of the following reasons:

(i) News articles describe an event at a different granularity as how Wikipedia does, usually with more details. For example, here is a piece of news about “Iraq_War”: "A contractor working for the American firm Kellogg Brown & Root was wounded in a mortar attack in Baghdad." The event “wounded” here is a very small event in Iraq War, but it is what daily news would report. On the other hand, the event mention that links to “Iraq_War” in Wikipedia domain is: "When touring in Europe, the US went to war in Iraq." The different granularity in representing events makes the task slightly different in two domains. Event linking in Wikipedia domain is more like event coreference, while event linking in news domain is mixed with more sub-event relation extraction.

(ii) As analyzed in Section 2, event linking is challenging because some event mentions may only exist in the subsection of the correct page, and the correct title is not consistent because of the event hierarchy. However, these problems mainly happen in the news domain. First of all, the Wikipedia hyperlinked mentions usually have their own pages instead of just existing in subsections. In news domain, we annotate events that only exist in subsections of a Wikipedia page. Second, in Wikipedia domain, the gold title of same event mentions is usually consistent. For example, all of the event mentions “drafted” of football players link to “National Football League Draft” instead of the page of the specific player. However, the annotation standard of NYT is not always consistent with Wikipedia hyperlinks. For example, annotators would link event mentions about sports player draft to the page of the specific player instead of the general concept page “National Football League Draft”. These problems make data annotation and model evaluation in news domain very challenging.

Because of the reasons claimed above, we think that, for some cases in news domain, the correct answer is multiple titles instead of just one title. Ranking the annotated title to the second place may be because the top one is also correct. To relax the evaluation metric here, for news domain, we also report the number of Accuracy@5, which means that if the annotated title is ranked in the top 5 candidates, we think it is correct.

Models	Verb				Nominal			Verb + Nominal
	Seen	Unseen Form	Unseen	Overall	Hard	Easy	Overall
Glove	23.70	16.57	14.89	18.22	3.08	84.60	43.88	30.98
BM25	32.17	47.41	61.86	47.14	22.54	33.22	27.88	37.51
BLINK-Entity	88.91	69.64	62.31	73.37	67.93	95.20	81.57	77.42
BLINK-Event	80.88	85.84	84.54	83.95	79.39	89.10	84.24	84.10
GENRE-Entity	92.41	73.48	59.13	74.90	79.84	96.53	88.19	81.54
GENRE-Event	98.80	87.30	58.85	82.24	85.34	94.64	89.99	86.12
EveLink	93.99	92.74	93.91	93.47	89.79	95.52	92.65	93.06

Table 3: Recall on Wikipedia Test. “Seen” means both the surface forms of the mention and the gold title are seen in training. “Unseen Form” means the surface form of the mention is new, but the gold title is seen in training. “Unseen” means that the gold title is unseen in training. BLINK-Entity is the original BLINK model trained on entity linking dataset. BLINK-Event is trained on the new event linking dataset. More details in Section 6

Models	Verb				Nominal			Verb + Nominal
	Seen	Unseen Form	Unseen	Overall	Hard	Easy	Overall
Prior	62.21	2.38	1.24	38.81	34.65	85.99	61.65	54.79
BLINK-Entity	64.13	48.56	45.92	52.48	46.79	88.27	67.53	60.00
BLINK-Event	77.72	69.78	62.72	70.06	62.59	82.29	72.44	71.25
GENRE-Entity	75.04	57.00	44.85	58.81	65.29	90.91	78.10	68.45
GENRE-Event	95.50	73.80	45.16	71.76	72.60	88.04	80.32	76.04
EveLink	91.21	80.30	78.08	82.93	75.90	89.70	82.80	82.87

Table 4: Accuracy on Wikipedia Test.

5 Model

In this section, we propose EveLink as the first event linking model. We first introduce the representation of event mentions and event titles in Section 5.1, and then introduce the model architecture in Section 5.2.

5.1 Event Representation

A key difference between entity and event is that the context of an entity is more diverse than the context of an event. For example, when the entity “China” is mentioned in a sentence, it is unclear what entities or what events probably would also be mentioned together. However, if a verb like “invade” is used to represent the event “Battle of France” in a sentence, it is very likely that entities like “Germany”, “Italy” and “France” will also be mentioned. This shows that an event is defined by its arguments, and these arguments, with a large chance, will also be mentioned in the local context because the verb itself cannot refer to any event. Given this observation, we think that the entities in the local context of the event mention should overlap with the entities in the correct Wikipedia page, and these entities can be used to help the model better represent events. To embed these entities information explicitly to the event representation, we use similar method as how Vyas and Ballesteros (2021) embed entity attributes information to the entity representation.

Event mentions: To represent event mentions in local context, we first use an off-the-shelf Named Entity Recognition model ⁴⁴4https://cogcomp.seas.upenn.edu/page/demo_view/NEREnglish trained on 18-type OntoNotes dataset (Weischedel et al., 2013) to extract the entities around the event. We simply define the context window by 500 characters around the event mention. After predicting all the entities $e_{i}$ with their type $t_{i}$ , we represent the event mentions by:

$\displaystyle r_{1}$	$\displaystyle=[\mathrm{CLS}]\ \mathrm{ctxt}_{l}\ [\mathrm{M}_{s}]\ \mathrm{m}\ [\mathrm{M}_{e}]\ \mathrm{ctxt}_{r}$	(1)
$\displaystyle r_{2}$	$\displaystyle=[\mathrm{t}_{1_{s}}]\ \mathrm{e}_{1}\ [\mathrm{t}_{1_{e}}]\ \cdots[\mathrm{t}_{n_{s}}]\ \mathrm{e}_{n}\ [\mathrm{t}_{n_{e}}]$	(2)
$\displaystyle r_{m}$	$\displaystyle=r_{1}\ [\mathrm{SEP}]\ r_{2}\ [\mathrm{SEP}]$	(3)

where $\mathrm{m}$ , $\mathrm{ctxt}_{l}$ , $\mathrm{ctxt}_{r}$ , $\mathrm{e}_{i}$ are tokens of event mention, the context on the left of the mention, the context on the right of the mention and predicted entities. $[\mathrm{M}_{s}]$ and $[\mathrm{M}_{e}]$ are special tokens to tag the start and end of the event mention. $[\mathrm{t}_{i_{s}}]$ and $[\mathrm{t}_{i_{e}}]$ are special tokens to tag the start and end of the entity whose type is $t_{i}$ . $r_{m}$ is the final representation of event mentions.

Title: To represent Wikipedia titles, since important entities are already hyperlinked in the page contents, we take the first ten hyperlinked spans as entities, and represent the title by:

$\displaystyle r_{3}$	$\displaystyle=[\mathrm{CLS}]\ \mathrm{title}\ [\mathrm{TITLE}]\ \mathrm{description}$	(4)
$\displaystyle r_{4}$	$\displaystyle=\mathrm{h}_{1}\ [\mathrm{SEP}]\ \mathrm{h}_{2}\ [\mathrm{SEP}]\ \cdots\ [\mathrm{SEP}]\ \mathrm{h}_{n}$	(5)
$\displaystyle r_{t}$	$\displaystyle=r_{3}\ [\mathrm{SEP}]\ r_{4}\ [\mathrm{SEP}]$	(6)

where $\mathrm{title}$ , $\mathrm{h}_{i}$ and $\mathrm{description}$ are tokens of the title, hyperlinked spans, and the content of the Wikipedia page. We simply take the first $2,000$ characters as the description. $[\mathrm{TITLE}]$ is the special token to separate the title and the description. $r_{t}$ is the final representation of Wikipedia titles.

Models	Verb				Nominal			Verb + Nominal
	Seen	Unseen Form	Unseen	Overall	Hard	Easy	Overall
Glove	-	0.00	0.70	0.60	0.00	33.33	1.66	0.94
BM25	-	28.38	41.74	39.80	45.27	31.25	44.40	41.35
BLINK-Entity	-	4.00	6.67	6.27	7.79	60.00	10.81	7.80
BLINK-Event	-	35.14	37.39	37.06	37.45	75.00	39.77	37.97
GENRE-Entity	-	18.92	9.86	11.18	9.88	62.50	13.13	11.83
GENRE-Event	-	56.77	17.66	23.33	21.81	31.25	22.39	23.02
EveLink	-	52.70	59.40	58.43	51.03	93.75	53.68	56.83

Table 5: Recall on New York Times data. Because “Nil” mentions do not have the Wikipedia title, the Recall is only evaluated on the mentions that exist in Wikipedia.

Models	Verb			Nominal			Verb + Nominal
	Unseen Form	Unseen	Overall	Hard	Easy	Overall	Accu@5	Accu@1
Prior	0.00	0.00	0.00	0.00	6.25	0.39	0.52	0.13
BLINK-Entity	1.33	2.76	2.55	4.92	33.33	6.56	11.44	3.90
BLINK-Event	17.57	5.28	7.06	11.11	37.50	12.74	17.04	8.97
GENRE-Entity	8.11	5.73	6.08	3.29	31.25	5.02	11.83	5.72
GENRE-Event	39.19	8.03	12.55	7.82	31.25	9.27	23.02	11.44
EveLink	28.37	13.07	15.29	14.81	43.75	16.60	29.13	15.73

Table 6: Accuracy on New York Times data without Nil. Only event mentions that exist in Wikipedia are given. Accu@5 means the correct title is ranked top 5. Accu@1 means the correct title is top 1.

Models	Verb				Nominal				Verb+Nominal
	Unseen Form	Unseen	Nil	Overall	Hard	Easy	Nil	Overall	Accu@5	Accu@1
BLINK-Entity	2.7	1.15	79.85	49.51	1.23	25.00	63.04	27.77	57.26	44.04
BLINK-Event	12.16	1.61	90.85	56.94	4.53	37.50	88.59	40.63	58.45	52.84
EveLink	17.57	4.59	93.08	59.59	7.00	43.75	89.67	42.66	59.70	55.33

Table 7: Accuracy on New York Times data with Nil. We simply predict all the mentions with a probability lower than 50 to Nil.

5.2 Model Architecture

Similar to Wu et al. (2020), we first use a bi-encoder architecture to efficiently generate candidates, and use a cross-encoder architecture, which requires more computations, to rank the candidates.

Candidate Generation. We use a bi-encoder architecture to train the candidate generation model. We use two independent BERT transformers (Devlin et al., 2019) to encode the representation of event mentions $r_{m}$ and Wikipedia titles $r_{t}$ , and use the output of the two $\mathrm{[CLS]}$ tokens in $r_{m}$ and $r_{t}$ as the event mention vector $v_{m}$ and the title vector $v_{t}$ . Then, we maximize the dot product between the vectors of event mentions $v_{m}$ and the correct title $v_{t}$ in a batch with randomly selected negatives. At inference time, representations of all the titles are cached, and for each event mention, we calculate the dot products between its representation and the representation of all the titles, and titles with higher scores will become candidates.

Candidate Ranking. For each event mention, we take 30 candidates from the candidate generation model as the training data for the ranking model, and use a cross-encoder architecture to train the candidate ranking model. We concatenate the representation of event mentions $r_{m}$ and titles $r_{t}$ , use one BERT transformer to encode the concatenated representation, and use the output of the $\mathrm{[CLS]}$ token as the final vector $v$ . Then we maximize the dot product between the vector $v$ of the correct title and an additional linear layer $W$ .

6 Experiments

In this section, we evaluate the in-domain performance on Wiki test set and the out-of-domain performance on NYT test set, and conduct an error analysis. Implementation details in Appendix A.

Baselines. Since there is no existing event linking system, we have to compare with previous entity linking systems. In this paper, we mainly compare our system with two SOTA entity linking models BLINK (Wu et al., 2020) and GENRE (Cao et al., 2021). To make a fair comparison, BLINK and GENRE have the following two setups:

BLINK/GENRE-Entity: Since a large portion of event mentions are nominals, which is also a kind of entity, it would be interesting to see how a SOTA entity linking system performs for event linking. Therefore, we test the BLINK/GENRE model pretrained specific to entity linking directly. Please note that the size of entity linking training data is 9 million, which is much larger than the size of event linking training data 66k.

BLINK/GENRE-Event: It adopts the same algorithm with the original BLINK/GENRE system, but is trained on our event linking training set.

For all the experiments, BLINK-Entity retrieves 10 candidates from candidate generation, and both BLINK-Event and EveLink retrieves 100 candidates from candidate generation. These numbers are tuned on dev data. GENRE is a generation model, which does not use the same pipeline of candidate generation and ranking. We follow the original setting to use the beam search with 5 beams.

Besides SOTA entity linking systems, we also evaluate the performance of BM25, Glove vector cosine similarity between event mention and titles Pennington et al. (2014) and prior distribution. Because event mentions are limited in Wikipedia, to fairly estimate the prior distribution of the event titles, we only evaluate event mentions that appear at least 10 times in Wikipedia.

In-domain experiment on Wikipedia.

We evaluate EveLink on the Wikipedia test set as the in-domain performance. We report the recall of candidate generation in Table 3, and the accuracy of candidate ranking in Table 4. As shown in Table 3 and Table 4, EveLink outperforms baseline models by a large margin, $6.94$ points in Recall and $6.38$ points in Accuracy. EveLink also achieves a high performance on seen verbs and easy nominals, around 90 accuracy, but a relatively low performance on other hard cases, which leaves a large space for future works to further improve.

Out-of-domain experiment on News.

We evaluate EveLink on the NYT test set as the out-of-domain performance. In Table 5, we evaluate the recall of candidate generation. Because “Nil” mentions do not have correct titles in Wikipedia, we only evaluate the the recall of event mentions that exist in Wikipedia. Though the recall of EveLink is much higher than the recall of other baseline models (56.83 vs. 37.97), the recall drops significantly compared with the recall on Wikipedia test set (56.83 vs. 93.06). In Table 6, we evaluate the accuracy on the event mentions that exist in Wikipedia, which is the same setting as the experiments in the Wikipedia domain, and again the accuracy drops significantly from $82.87$ to $15.73$ . Even if we accept 5 predictions instead of just one to solve the multiple correct titles problem, the Accuracy@5 is 29.13, which is still low. Detailed error analysis is in Section 6. In Table 7, we evaluate the accuracy of all the event mentions, including Nil. Because we do not have Nil examples in training data and development data, we simply predict all the event mentions with probability lower than 50 to “Nil”, and leave better solutions to future works. GENRE is not tested for Nil mentions because it is unclear how to get its prediction probability.

Analysis.

We wonder following questions:

Models	Wiki Test	NYT (no Nil)	NYT
EveLink	82.87	15.73	55.33
- type	81.39	11.70	55.96
- entities	71.25	8.97	52.84

Table 8: Ablation Study of EveLink

$\mathcal{Q}_{1}$ : Where the gain comes from, compared with the BLINK system?

We do ablation study in Table 8. Explicitly adding entities to the event representation boosts the performance by 10.14 accuracy on Wiki test data and 2.73 accuracy on NYT data. Adding entity types further improves the performance by 1.48 accuracy on Wiki and 4.03 accuracy on NYT data.

$\mathcal{Q}_{2}$ : Error patterns of EveLink

We collect several error patterns that are common in both domains, and patterns that are mostly in news domains. Error patterns of both domains:

(i): Repeating events. In the errors, we find many repeating events, like award ceremonies or sports games, that would happen every several years, and the model usually cannot find the correct year of the event if the year is not explicitly mentioned in the context. For example:

In 1995, his debut season, Biddiscombe made two appearances, $\cdots$ The following year he earned a Rising Star nomination for his performance $\cdots$

In this example, the gold event is “1996 AFL Rising Star”, and the prediction is “1998 AFL Rising Star”, though there is a temporal hint (the following year of 1995 is 1996) to indicate that the correct answer should be the award in 1996. There are many similar errors when linking awards or games, which shows that a deeper temporal understanding is necessary for future works.

(ii): Unrelated context. EveLink replies on the surrounding entities to link the event mentions, however, the context is not always related and surrounding entities cannot help linking. For example:

Returning to his country at the end of the conflict and another begun, Barinaga rejected an offer from Athletic Bilbao, moving to Real Madrid instead.

In this example, the gold event is “World War II”, but the prediction is “1939–40 La Liga”. All the entities, like “Barinaga”, “Athletic Bilbao” and “Real Madrid”, are about football, which is unrelated to the war. To link to the correct page, the model needs to know the second conflict of Barinaga’s country, which indicates that only using the local context maybe not enough.
Error patterns specific to news domain:
(i): Subsection events. Some events do not have their own pages, and are only introduced in the subsections of other pages. For example:

The Philippine government lifted its five - year ban on the return of Imelda Marcos today and said the widow of the late President Ferdinand Marcos was free to come home from exile in the United States.

In this example, the return of Imelda Marcos is introduced in the subsection “Return from exile (1991–present)” of the page “Imelda Marcos”. However, we only use the first 2,000 characters of the page contents to represent the title “Imelda Marcos”, which has no information about the return from exile. A document-level representation may be a potential solution for future works.

(ii): Sub-events. Some events are sub-events of other larger events. For example:

Stepping in at the 11th hour, Hillary Rodham Clinton will campaign in Florida on Saturday for her brother, Hugh Rodham, in his bid for a United States Senate seat.

This event is a sub-event of “1994 United States Senate election in Florida”, which has different event arguments, so the names in the local context do not overlap with the names in the page.

In this work, we discuss many challenges of the task in different domains, but EveLink cannot address all of them. We leave them to future works.

7 Conclusion

In this work, we formulate Event Linking, a challenging but essential task, with a carefully designed Wikipedia dataset and NYT test set, and propose an event linking model EveLink for future works.

8 Limitation

In this section, we discuss limitations of our work.

•

We only focus on event linking to English Wikipedia in this work. We leave multilingual event linking to future works.
•

The performance of EveLink on hard cases is still low, for example events that only exist in the subsection of Wikipedia page.
•

In this work, we simply predict all the mentions with a prediction probability that is lower than 50 to “Nil”. We leave better solutions to future works.

Acknowledgments

This work was supported by Contract FA8750-19-2-1004 with the US Defense Advanced Research Projects Agency (DARPA), the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA Contract No. 2019-19051600006 under the BETTER Program, and a Focused Award from Google. Approved for Public Release, Distribution Unlimited. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, the Department of Defense, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.

References

Bunescu and Pasca (2006) Razvan Bunescu and Marius Pasca. 2006. Using encyclopedic knowledge for named entity disambiguation.
Cao et al. (2021) Nicola De Cao, Gautier Izacard, Sebastian Riedel, and Fabio Petroni. 2021. Autoregressive entity retrieval. ArXiv, abs/2010.00904.
Devlin et al. (2019) J. Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT.
Eirew et al. (2021) Alon Eirew, Arie Cattan, and Ido Dagan. 2021. WEC: Deriving a large-scale cross-document event coreference dataset from Wikipedia. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2498–2510, Online. Association for Computational Linguistics.
Gottschalk and Demidova (2018) Simon Gottschalk and Elena Demidova. 2018. Eventkg: A multilingual event-centric temporal knowledge graph. In Proceedings of the Extended Semantic Web Conference (ESWC 2018). Springer.
Gupta et al. (2017) Nitish Gupta, Sameer Singh, and Dan Roth. 2017. Entity linking via joint encoding of types, descriptions, and context. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2681–2690, Copenhagen, Denmark. Association for Computational Linguistics.
Humeau et al. (2019) Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, and Jason Weston. 2019. Poly-encoders: Transformer architectures and pre-training strategies for fast and accurate multi-sentence scoring. arXiv preprint arXiv:1905.01969.
Krause et al. (2016) Sebastian Krause, Feiyu Xu, Hans Uszkoreit, and Dirk Weissenborn. 2016. Event linking with sentential features from convolutional neural networks. In Proceedings of CoNLL, pages 239–249.
Ling and Weld (2012) Xiao Ling and Daniel S Weld. 2012. Fine-Grained Entity Recognition. In Proceedings of the National Conference on Artificial Intelligence (AAAI).
Mihalcea and Csomai (2007) Rada Mihalcea and Andras Csomai. 2007. Wikify!: Linking documents to encyclopedic knowledge. In Proc. of the ACM Conference on Information and Knowledge Management (CIKM).
Nothman et al. (2012) Joel Nothman, Matthew Honnibal, Ben Hachey, and James R. Curran. 2012. Event linking: Grounding event reference in a news archive. In Proceedings of ACL, pages 228–232.
Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In EMNLP.
Ratinov et al. (2011) Lev Ratinov, Dan Roth, Doug Downey, and Mike Anderson. 2011. Local and global algorithms for disambiguation to Wikipedia. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 1375–1384, Portland, Oregon, USA. Association for Computational Linguistics.
Sandhaus (2008) Evan Sandhaus. 2008. The new york times annotated corpus ldc2008t19. Web Download.
Vyas and Ballesteros (2021) Yogarshi Vyas and Miguel Ballesteros. 2021. Linking entities to unseen knowledge bases with arbitrary schemas. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 834–844, Online. Association for Computational Linguistics.
Weischedel et al. (2013) Ralph Weischedel, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Franchini, et al. 2013. Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA, 23.
Wu et al. (2020) Ledell Wu, Fabio Petroni, Martin Josifoski, Sebastian Riedel, and Luke Zettlemoyer. 2020. Scalable zero-shot entity linking with dense entity retrieval. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6397–6407, Online. Association for Computational Linguistics.

Appendix A Implementations

We use 4 Nvidia RTX A6000 48GB GPUs for model training and evaluation. For both candidate generation model and candidate ranking model, we train 10 epochs with learning rate $1\mathrm{e}^{-5}$ , and use BERT-large-uncased as the pretrained language model (Devlin et al., 2019). The maximum tokens of both event mention representation and Wikipedia title representation are 256. Top-K is chosen from [5, 10, 30, 50, 70, 100], and tuned on development data. The Glove version we use is "glove-wiki-gigaword-100".

The Wikipedia dump version we use is 2020/03/01, which is also released to public with our annotated data.

Appendix B Data Annotation

We require all the annotators from Amazon Mechanical Turk to be English speaker, and with an acceptance rate higher than 95%. All the annotators are English native speakers and are paid more than 10 US dollars per hour.