Verb Knowledge Injection for Multilingual Event Processing

Olga Majewska¹, Ivan Vulić¹, Goran Glavaš², Edoardo M. Ponti^1,3, Anna Korhonen¹
¹Language Technology Lab, TAL, University of Cambridge, UK
² Data and Web Science Group, University of Mannheim, Germany
³ Mila – Quebec AI Institute, Montreal, Canada
¹{om304,iv250,ep490,alk23}@cam.ac.uk
²[email protected]

Abstract

In parallel to their overwhelming success across NLP tasks, language ability of deep Transformer networks, pretrained via language modeling (LM) objectives has undergone extensive scrutiny. While probing revealed that these models encode a range of syntactic and semantic properties of a language, they are still prone to fall back on superficial cues and simple heuristics to solve downstream tasks, rather than leverage deeper linguistic knowledge. In this paper, we target one such area of their deficiency, verbal reasoning. We investigate whether injecting explicit information on verbs’ semantic-syntactic behaviour improves the performance of LM-pretrained Transformers in event extraction tasks – downstream tasks for which accurate verb processing is paramount. Concretely, we impart the verb knowledge from curated lexical resources into dedicated adapter modules (dubbed verb adapters), allowing it to complement, in downstream tasks, the language knowledge obtained during LM-pretraining. We first demonstrate that injecting verb knowledge leads to performance gains in English event extraction. We then explore the utility of verb adapters for event extraction in other languages: we investigate (1) zero-shot language transfer with multilingual Transformers as well as (2) transfer via (noisy automatic) translation of English verb-based lexical constraints. Our results show that the benefits of verb knowledge injection indeed extend to other languages, even when verb adapters are trained on noisily translated constraints.

1 Introduction

Large Transformer-based encoders, pretrained with self-supervised language modeling (LM) objectives, form the backbone of state-of-the-art models for most Natural Language Processing (NLP) tasks Devlin et al. (2019); Yang et al. (2019b); Liu et al. (2019b). Recent probing experiments showed that they implicitly extract a non-negligible amount of linguistic knowledge from text corpora in an unsupervised fashion (Hewitt and Manning, 2019; Vulić et al., 2020; Rogers et al., 2020, inter alia). In downstream tasks, however, they often rely on spurious correlations and superficial cues (Niven and Kao, 2019) rather than a deep understanding of language meaning (Bender and Koller, 2020), which is detrimental to both generalisation and interpretability (McCoy et al., 2019).

In this work, we focus on a specific facet of linguistic knowledge, namely reasoning about events. For instance, in the sentence “Stately, plump Buck Mulligan came from the stairhead, bearing a bowl of lather”, an event of coming occurs in the past, with Buck Mulligan as a participant, simultaneously to an event of bearing with an additional participant, a bowl. Identifying tokens in the text that mention events and classifying the temporal and causal relations among them (Ponti and Korhonen, 2017) is crucial to understand the structure of a story or dialogue (Carlson et al., 2002; Miltsakaki et al., 2004) and to ground a text in real-world facts (Doddington et al., 2004).

Verbs (with their arguments) are prominently used for expressing events (with their participants). Thus, fine-grained knowledge about verbs, such as the syntactic patterns in which they partake and the semantic frames they evoke, may help pretrained encoders to achieve a deeper understanding of text and improve their performance in event-oriented downstream tasks. Fortunately, there already exist expert-curated computational resources that organise verbs into classes based on their syntactic-semantic properties (Jackendoff, 1992; Levin, 1993). In particular, here we consider VerbNet (Schuler, 2005) and FrameNet (Bick, 2011) as rich sources of verb knowledge.

Expanding a line of research on injecting external linguistic knowledge into (LM-)pretrained models Peters et al. (2019); Levine et al. (2020); Lauscher et al. (2020b), we integrate verb knowledge into contextualised representations for the first time. We devise a new method to distill verb knowledge into dedicated adapter modules (Houlsby et al., 2019; Pfeiffer et al., 2020b), which reduce the risk of (catastrophic) forgetting of distributional knowledge and allow for seamless integration with other types of knowledge.

We hypothesise that complementing pretrained encoders through verb knowledge in such modular fashion should benefit model performance in downstream tasks that involve event extraction and processing. We first put this hypothesis to the test in English monolingual event identification and classification tasks from the TempEval (UzZaman et al., 2013) and ACE (Doddington et al., 2004) datasets. Foreshadowing, we report modest but consistent improvements in the former, and significant performance boosts in the latter, thus verifying that verb knowledge is indeed paramount for deeper understanding of events and their structure.

Moreover, we note that expert-curated resources are not available for most of the languages spoken worldwide. Therefore, we also investigate the effectiveness of transferring verb knowledge across languages, and in particular from English to Spanish, Arabic and Mandarin Chinese. Concretely, we compare (1) zero-shot model transfer based on massively multilingual encoders and English constraints with (2) automatic translation of English constraints into the target language. Not only do the results demonstrate that both techniques are successful, but they also shed some light on an important linguistic question: to what extent can verb classes (and predicate–argument structures) be considered cross-lingually universal, rather than varying across languages (Hartmann et al., 2013)?

Overall, our main contributions consist in 1) mitigating the limitations of pretrained encoders regarding event understanding by supplying verb knowledge from external resources; 2) proposing a new method to do so in a modular way through adapter layers; 3) exploring techniques to transfer verb knowledge to resource-poor languages. The gains in performance observed across four diverse languages and several event processing tasks and datasets warrant the conclusion that complementing distributional knowledge with curated verb knowledge is both beneficial and cost-effective.

2 Verb Knowledge for Event Processing

Refer to caption — Figure 1: Framework for injecting verb knowledge into a pretrained Transformer encoder for event processing tasks. 1) Dedicated verb adapter parameters trained to recognise pairs of verbs from the same VerbNet (VN) class or FrameNet (FN) frame; 2) Fine-tuning for an event extraction task (e.g., event trigger identification and classification UzZaman et al. (2013)): a) full fine-tuning – Transformer’s original parameters and verb adapters both fine-tuned for the task; b) task adapter (TA) fine-tuning – additional task adapter is mounted on top of verb adapter and tuned for the task. For simplicity, we show only a single transformer layer; verb- and task-adapters are used in all Transformer layers. Snowflakes denote frozen parameters in the respective training step.

Figure 1 illustrates our framework for injecting verb knowledge from VerbNet or FrameNet and leveraging it in downstream event extraction tasks. First, we inject the external verb knowledge, formulated as the so-called lexical constraints Mrkšić et al. (2017); Ponti et al. (2019) (in our case – verb pairs, see §2.1), into a (small) additional set of adapter parameters (§2.2) Houlsby et al. (2019). In the second step (§2.3), we combine the language knowledge encoded by Transformer’s original parameters and the verb knowledge from verb adapters to solve a particular event extraction task. To this end, we either a) fine-tune both sets of parameters (1. pretrained LM; 2. verb adapters) or b) freeze both sets of parameters and insert an additional set of task-specific adapter parameters. In both cases, the task-specific training is informed both by the general language knowledge captured in the pretrained LM, and the specialised verb knowledge, captured in the verb adapters.

2.1 Sources of Verb Lexical Knowledge

Given the inter-connectedness between verbs’ meaning and syntactic behaviour Levin (1993); Schuler (2005), we assume that refining latent representation spaces with semantic content and predicate-argument properties of verbs would have a positive effect on event extraction tasks that strongly revolve around verbs. Lexical classes, on the other hand, defined in terms of shared semantic-syntactic properties provide a mapping between the verbs’ senses and the morpho-syntactic realisation of their arguments Jackendoff (1992); Levin (1993). The potential of verb classifications lies in their predictive power: for any given verb, a set of rich semantic-syntactic properties can be inferred based on its class membership. In this work, we explicitly harness this rich linguistic knowledge to help LM-pretrained Transformers in capturing regularities in the properties of verbs and their arguments, and consequently improve their ability to reason about events. We select two major English lexical databases – VerbNet Schuler (2005) and FrameNet Baker et al. (1998) – as sources of verb knowledge at the semantic-syntactic interface, each representing a different lexical framework. Despite the different theories underpinning the two resources, their organisational units – verb classes and semantic frames – both capture regularities in verbs’ semantic-syntactic properties.¹¹1Initially we also considered WordNet for creating verb constraints. While it provides records of verbs’ senses and (some) semantic relations between them, WordNet lacks comprehensive information about the (semantic-)syntactic frames in which verbs participate. We thus believe that verb knowledge from WordNet would be less effective in downstream event extraction tasks than that from VerbNet and FrameNet.

VerbNet (VN) Schuler (2005); Kipper et al. (2006) is the largest verb-focused lexicon currently available. It organises verbs into classes based on the overlap in their semantic properties and syntactic behaviour, building on the premise that a verb’s predicate-argument structure informs its meaning Levin (1993). Each entry provides a set of thematic roles and selectional preferences for the verbs’ arguments; it also lists the syntactic contexts characteristic for the class members. The classification is hierarchical, starting from broader classes and spanning several granularity levels where each subclass further refines the semantic-syntactic properties inherited from its parent class.²²2For example, within a top-level class ‘free-80’, which includes verbs like liberate, discharge, and exonerate which participate in a NP V NP PP.theme frame (e.g., It freed him of guilt.), there exists a subset of verbs participating in a syntactic frame NP V NP S_ING (‘free-80-1’), within which there exists an even more constrained subset of verbs appearing with prepositional phrases headed specifically by the preposition ‘from’ (e.g., The scientist purified the water from bacteria.). VerbNet’s reliance on semantic-syntactic coherence as a class membership criterion means that semantically related verbs may end up in different classes because of differences in their combinatorial properties.³³3E.g., verbs split and separate are members of two different classes with identical sets of arguments’ thematic roles, but with discrepancies in their syntactic realisations (e.g., the syntactic frame NP V NP apart is only permissible for the ‘split-23.2’ verbs: The book’s front and back cover split apart, but not *The book’s front and back cover separated apart). Although the sets of syntactic descriptions and corresponding semantic roles defining each VerbNet class are English-specific, the underlying notion of a semantically-syntactically defined verb class is thought to apply cross-lingually Jackendoff (1992); Levin (1993), and its translatability has been demonstrated in previous work Vulić et al. (2017); Majewska et al. (2018). The current version of English VerbNet contains 329 main classes.

FrameNet (FN) Baker et al. (1998), in contrast to the syntactically-driven class divisions in VerbNet, is more semantically-oriented. Grounded in the theory of frame semantics Fillmore (1976, 1977, 1982), it organises concepts according to semantic frames, i.e., schematic representations of situations and events, which they evoke, each characterised by a set of typical roles assumed by its participants. The word senses associated with each frame (FrameNet’s lexical units) are similar in terms of their semantic content, as well as their typical argument structures.⁴⁴4For example, verbs such as beat, hit, smash, and crush evoke the ‘Cause_harm’ frame describing situations in which an Agent or a Cause causes injury to a Victim (or their Body_part), e.g., A falling rock [Cause] CRUSHED the hiker’s ankle [Body_part], or The bodyguard [Agent] was caught BEATING the intruder [Victim]. Note that frame-evoking elements do not need to be verbs; the same frame can also be evoked by nouns, e.g., strike or poisoning. Currently, English FN includes the total of 1,224 frames and its annotations illustrate the typical syntactic realisations of the elements of each frame. Frames themselves are, however, semantically defined: this means that they may be shared even across languages with different syntactic properties (e.g., descriptions of transactions will include the same frame elements Buyer, Seller, Goods, Money in most languages). Indeed, English FN has inspired similar projects in other languages: e.g., Spanish Subirats and Sato (2004), Swedish Heppin and Gronostaj (2012), Japanese Ohara (2012), and Danish Bick (2011).

2.2 Training Verb Adapters

Training Task and Data Generation. In order to encode information about verbs’ membership in VN classes or FN frames into a pretrained Transformer, we devise an intermediary training task in which we train a dedicated VN-/FN-knowledge adapter (hereafter VN-Adapter and FN-Adapter). We frame the task as binary word-pair classification: we predict if two verbs belong to the same VN class or FN frame. We extract training instances from FN and VN independently. This allows for a separate analysis of the impact of verb knowledge from each resource.

We generate positive training instances by extracting all unique verb pairings from the set of members of each main VN class/FN frame (e.g., walk–march), resulting in 181,882 positive instances created from VN and 57,335 from FN. We then generate $k=3$ negative examples for each positive example in a training batch by combining controlled and random sampling. In controlled sampling, we follow prior work on semantic specialisation Wieting et al. (2015); Glavaš and Vulić (2018b); Ponti et al. (2019); Lauscher et al. (2020b). For each positive example $p=(w_{1},w_{2})$ in the training batch $B$ , we create two negatives $\hat{p}_{1}=(\hat{w}_{1},w_{2})$ and $\hat{p}_{2}=(w_{1},\hat{w}_{2})$ ; $\hat{w}_{1}$ is the verb from batch $B$ other than $w_{1}$ that is closest to $w_{2}$ in terms of their cosine similarity in an auxiliary static word embedding space $X_{aux}\in\mathbb{R}^{d}$ ; conversely, $\hat{w}_{2}$ is the verb from $B$ other than $w_{2}$ closest to $w_{1}$ . We additionally create one negative instance $\hat{p}_{3}=(\hat{w}_{1}$ , $\hat{w}_{2}$ ) by randomly sampling $\hat{w}_{1}$ and $\hat{w}_{2}$ from batch $B$ , not considering $w_{1}$ and $w_{2}$ . We make sure that negative examples are not present in the global set of all positive verb pairs from the resource.

Similar to Lauscher et al. (2020b), we tokenise each (positive and negative) training instance into WordPiece tokens, prepended with sequence start token [CLS], and with [SEP] tokens in between the verbs and at the end of the input sequence. We consider the representation of the [CLS] token, $\mathbf{x}_{\mathit{CLS}}\in\mathbb{R}^{h}$ (with $h$ as the hidden state size of the Transformer), output by the last Transformer layer to be the latent representation of the verb pair, and feed it to a simple binary classifier:⁵⁵5We also experimented with sentence-level tasks for injecting verb knowledge, with target verbs presented in sentential contexts drawn from example sentences from VN/FN: we fed (a) pairs of sentences in a binary classification setup (e.g., Jackie leads Rose to the store. – Jackie escorts Rose.); and (b) individual sentences in a multi-class classification setup (predicting the correct VN class/FN frame). Both these variants with sentence-level input, however, led to weaker downstream performance.

\hat{\textbf{y}}=\textrm{softmax}(\textbf{x}_{\textsc{cls}}\textbf{W}_{cl}+\textbf{b}_{cl})

(1)

with $\textbf{W}_{cl}\in\mathbb{R}^{h\times 2}$ and $\textbf{b}_{cl}\in\mathbb{R}^{2}$ as classifier’s trainable parameters. We train by minimising the standard cross-entropy loss ( $\mathcal{L}_{\mathit{VERB}}$ in Figure 1).

Adapter Architecture. Instead of directly fine-tuning all parameters of the pretrained Transformer, we opt for storing verb knowledge in a separate set of adapter parameters, keeping the verb knowledge separate from the general language knowledge acquired in pretraining. This (1) allows downstream training to flexibly combine the two sources of knowledge, and (2) bypasses the issues with catastrophic forgetting and interference Hashimoto et al. (2017); de Masson d’Autume et al. (2019).

We adopt the adapter architecture of Pfeiffer et al. (2020a, c) which exhibits comparable performance to more commonly used Houlsby et al. (2019) architecture, while being computationally more efficient. In each Transformer layer $l$ , we insert a single adapter module ( $Adapter_{l}$ ) after the feed-forward sub-layer. The adapter module itself is a two-layer feed-forward neural network with a residual connection, consisting of a down-projection $\textbf{D}\in\mathbb{R}^{h\times m}$ , a GeLU activation Hendrycks and Gimpel (2016), and an up-projection $\textbf{U}\in\mathbb{R}^{m\times h}$ , where $h$ is the hidden size of the Transformer model and $m$ is the dimension of the adapter:

Adapter_{l}(\textbf{h}_{l},\textbf{r}_{l})=\textbf{U}_{l}(\textrm{GeLU}(\textbf{D}_{l}(\textbf{h}_{l})))+\textbf{r}_{l}

(2)

where $\textbf{r}_{l}$ is the residual connection, output of the Transformer’s feed-forward layer, and $\textbf{h}_{l}$ is the Transformer hidden state, output of the subsequent layer normalisation.

2.3 Downstream Fine-Tuning for Event Tasks

With verb knowledge from VN/FN injected into the parameters of VN-/FN-Adapters, we proceed to the downstream fine-tuning for a concrete event extraction task. Tasks that we experiment with (see §3) are (1) token-level event trigger identification and classification and (2) span extraction for event triggers and arguments (a sequence labeling task). For the former, we mount a classification head – a simple single-layer feed-forward softmax regression classifier – on top of the Transformer augmented with VN-/FN-Adapters. For the latter, we follow the architecture from prior work M’hamdi et al. (2019); Wang et al. (2019) and add a CRF layer Lafferty et al. (2001) on top of the sequence of Transformer’s outputs (for subword tokens), in order to learn inter-dependencies between output tags and determine the optimal tag sequence.

For both types of downstream tasks, we propose and evaluate two different fine-tuning regimes: (1) full downstream fine-tuning, in which we update both the original Transformer’s parameters and VN-/FN-Adapters (see 2a in Figure 1); and (2) task-adapter (TA) fine-tuning, where we keep both Transformer’s original parameters and VN-/FN-Adapters frozen, while stacking a new trainable task adapter on top of the VN-/FN-Adapter in each Transformer layer (see 2b in Figure 1).

2.4 Cross-Lingual Transfer

Creation of curated resources like VN or FN takes years of expert linguistic labour. Consequently, such resources do not exist for a vast majority of languages. Given the inherent cross-lingual nature of verb classes and semantic frames (see §2.1), we investigate the potential for verb knowledge transfer from English to target languages, without any manual target-language adjustments. Massively multilingual Transformers, such as multilingual BERT (mBERT) Devlin et al. (2019) or XLM-R Conneau et al. (2020) have become the de facto standard mechanisms for zero-shot (zs) cross-lingual transfer. We also adopt mBERT in our first language transfer approach: we fine-tune mBERT first on the English verb knowledge and then on English task data and then simply make task predictions for the target language input.

Our second transfer approach, dubbed vtrans, is inspired by the work on cross-lingual transfer of semantic specialisation for static word embedding spaces Glavaš et al. (2019); Ponti et al. (2019); Wang et al. (2020b).

	VerbNet	FrameNet
English (EN)	181,882	57,335
Spanish (ES)	x96,300	36,623
Chinese (ZH)	x60,365	21,815
Arabic (AR)	x70,278	24,551

Table 1: Number of positive training verb pairs in English, and in each target language obtained via the vtrans method (§2.4).

Starting from a set of positive pairs $P$ from English VN/FN, vtrans involves three steps: (1) automatic translation of verbs in each pair into the target language, (2) filtering of the noisy target language pairs by means of a relation prediction model trained on the English examples, and (3) training the verb adapters injected into either mBERT or target language BERT with target language verb pairs. For (1), we translate the verbs by retrieving their nearest neighbour in the target language from the shared cross-lingual embedding space, aligned using the Relaxed Cross-domain Similarity Local Scaling (RCSLS) model of Joulin et al. (2018). Such translation procedure is liable to error due to an imperfect cross-lingual embedding space as well as polysemy and out-of-context word translation. We dwarf these issues in step (2), where we purify the set of noisily translated target language verb pairs by means of a neural lexico-semantic relation prediction model, the Specialization Tensor Model Glavaš and Vulić (2018a), here adjusted for binary classification. We train the STM for the same task as verb adapters during verb knowledge injection (§2.2): to distinguish (positive) verb pairs from the same English VN class/FN frame from those from different VN classes/FN frames. In training, the input to STM are static word embeddings of English verbs taken from a shared cross-lingual word embedding space. We then make predictions in the target language by feeding vectors of target language verbs (from noisily translated verb pairs), taken from the same cross-lingual word embedding space, as input for STM (we provide more details on STM training in Appendix C). Finally, in step (3), we retain only the target language verb pairs identified by STM as positive pairs and perform direct monolingual FN-/VN-Adapter training in the target language, following the same protocol used for English, as described in §2.2.

3 Experimental Setup

Event Processing Tasks and Data. In light of the pivotal role of verbs in encoding the unfolding of actions and occurrences in time, as well as the nature of the relations between their participants, sensitivity to the cues they provide is especially important in event processing tasks. There, systems are tasked with detecting that something happened, identifying what type of occurrence took place, as well as what entities were involved. Verbs typically act as the organisational core of each such event schema,⁶⁶6Event expressions are not, however, restricted to verbs: adjectives, nominalisations or prepositional phrases can also act as event triggers (consider, e.g., Two weeks after the murder took place…, Following the recent acquisition of the company’s assets…). carrying a lot of semantic and structural weight. Therefore, a model’s grasp of verbs’ properties should have a bearing on ultimate task performance. Based on this assumption, we select event extraction and classification as suitable evaluation tasks to profile the methods from §2.

These tasks and the corresponding data are based on the two prominent frameworks for annotating event expressions: TimeML Pustejovsky et al. (2003, 2005) and the Automatic Content Extraction (ACE) Doddington et al. (2004). First, we rely on the TimeML-annotated corpus from TempEval tasks Verhagen et al. (2010); UzZaman et al. (2013), which targets automatic identification of temporal expressions, events, and temporal relations. Second, we use the ACE dataset which provides annotations for entities, the relations between them, and for events in which they participate in newspaper and newswire text. We provide more derails about the frameworks and their corresponding annotation schemes in the Appendix A.

Task 1: Trigger Identification and Classification (TempEval). We frame the first event extraction task as a token-level classification problem, predicting whether a token triggers an event and assigning it to one of the following event types: OCCURRENCE (e.g., died, attacks), STATE (e.g., share, assigned), Reporting (e.g., announced, said), I-ACTION (e.g., agreed, trying), I-STATE (e.g., understands, wants, consider), ASPECTUAL (e.g., ending, began), and PERCEPTION (e.g., watched, spotted).⁷⁷7E.g., in the sentence: “The rules can also affect small businesses, which sometimes pay premiums tied to employees’ health status and claims history.”, affect and pay are event triggers of type STATE and OCCURRENCE, respectively. We use the TempEval-3 data for English and Spanish UzZaman et al. (2013), and the TempEval-2 data for Chinese Verhagen et al. (2010) (see Table 2 for dataset sizes).

		Train	Test
TempEval	English	830,005	7,174
	Spanish	x51,511	5,466
	Chinese	x23,180	5,313
ACE	English	xxx,529	x,x40
	Chinese	xxx,573	x,x43
	Arabic	xxx,356	x,x27

Table 2: Number of tokens (TempEval) and documents (ACE) in the training and test sets.

Task 2: Trigger and Argument Identification and Classification (ACE). In this sequence-labeling task, we detect and label event triggers and their arguments, with four individually scored subtasks: (i) trigger identification, where we identify the key word conveying the nature of the event, and (ii) trigger classification, where we classify the trigger word into one of the predefined categories; (iii) argument identification, where we predict whether an entity mention is an argument of the event identified in (i), and (iv) argument classification, where the correct role needs to be assigned to the identified event arguments. We use the ACE data available for English, Chinese, and Arabic.⁸⁸8The ACE annotations distinguish 34 trigger types (e.g., Business:Merge-Org, Justice:Trial-Hearing, Conflict:Attack) and 35 argument roles. Following previous work Hsi et al. (2016), we conflate eight time-related argument roles - e.g., ‘Time-At-End’, ‘Time-Before’, ‘Time-At-Beginning’ - into a single ‘Time’ role in order to alleviate training data sparsity.

Event extraction as specified in these two frameworks is a challenging, highly context-sensitive problem, where different words (most often verbs) may trigger the same type of event, and conversely, the same word (verb) can evoke different types of event schemata depending on the context. Adopting these tasks as our experimental setup thus tests whether leveraging fine-grained curated knowledge of verbs’ semantic-syntactic behaviour can improve models’ reasoning about event-triggering predicates and their arguments.

Model Configurations. For each task, we compare the performance of the underlying “vanilla” BERT-based model (see §2.3) against its variant with an added VN-Adapter or FN-Adapter⁹⁹9We also experimented with inserting both verb adapters simultaneously; however, this resulted in weaker downstream performance than adding each separately, a likely product of the partly redundant, partly conflicting information encoded in these adapters (see §2.1 for comparison of VN and FN). (see §2.2) in two regimes: (a) full fine-tuning, and (b) task adapter (TA) fine-tuning (see Figure 1 again). To ensure that any performance gains are not merely due to increased parameter capacity offered by the adapter module, we evaluate an additional setup where we replace the knowledge adapter with a randomly initialised adapter module of the same size (+Random). Additionally, we examine the impact of increasing the capacity of the trainable task adapter by replacing it with a ‘Double Task Adapter’ (2TA), i.e., a task adapter with double the number of trainable parameters compared to the base architecture described in §2.2.

Training Details: Verb Adapters. We experimented with $k\in\{2,3,4\}$ negative examples and the following combinations of controlled ( $c$ ) and randomly ( $r$ ) sampled negatives (see §2.2): $k=2$ $[cc]$ , $k=3$ $[ccr]$ , $k=4$ $[ccrr]$ . In our preliminary experiments we found the $k=3$ $[ccr]$ configuration to yield best-performing adapter modules. The downstream evaluation and analysis presented in §4 is therefore based on this setup.

Our VN- and FN-Adapters are injected into the cased variant of the BERT Base model. Following (Pfeiffer et al., 2020a), we train the adapters for 30 epochs using the Adam algorithm Kingma and Ba (2015), a learning rate of $1e-4$ and the adapter reduction factor of 16 Pfeiffer et al. (2020a), i.e., $d=48$ . Our batch size is 64, comprising 16 positive examples and $3\times 16=48$ negative examples (since $k=3$ ). We provide more details on hyperparameter search in Appendix B.

Downstream Task Fine-Tuning. In downstream fine-tuning on Task 1 (TempEval), we train for 10 epochs in batches of size 32, with a learning rate $1e-4$ and maximum input sequence length of $T=128$ WordPiece tokens. In Task 2 (ACE), in light of a greater data sparsity,¹⁰¹⁰10Most event types in ACE ( $\approx 70\%$ ) have fewer than 100 labeled instances, and three have fewer than 10 Liu et al. (2018). we search for an optimal hyperparameter configuration for each language and evaluation setup from the following grid: learning rate $l\in\{1e-5,1e-6\}$ and epochs $n\in\{3,5,10,25,50\}$ (with maximum input sequence length of $T=128$ ).

Transfer Experiments. For zero-shot (zs) transfer experiments, we leverage mBERT, to which we add the VN- or FN-Adapter trained on the English VN/FN data. We train the model on English training data available for each task and evaluate it on the test set in the target language. For the vtrans approach (see §2.4), we use language-specific BERT models readily available for our target languages, and leverage target-language adapters trained on translated and automatically refined verb pairs. The model, with or without the target-language VN-/FN-Adapter, is trained and evaluated on the training and test data available in the language. We carry out the procedure for three target languages (see Table 1). We use the same negative sampling parameter configuration proven strongest in our English experiments ( $k=3$ $[ccr]$ ).

4 Results and Discussion

		FFT	+Random	+FN	+VN	TA	+Random	+FN	+VN
TempEval	T-ident&class	73.6	73.5	73.6	73.6	74.5	74.4	75.0	75.2
ACE	T-ident	69.3	69.6	70.8	70.3	65.1	65.0	65.7	66.4
	T-class	65.3	65.5	66.7	66.2	58.0	58.5	59.5	60.2
	ARG-ident	33.8	33.5	34.2	34.6	x2.1	x1.9	x2.3	x2.5
	ARG-class	31.6	31.6	32.2	32.8	x0.6	x0.6	x0.8	x0.8

Table 3: Results on English TempEval and ACE test sets for full fine-tuning (FFT) setup and the task adapter (TA) setup. Provided are average

F_{1}

scores over 10 runs, with statistically significant (paired t-test;

p<0.05

) improvements over both baselines marked in bold.

		FFT	+Random	+FN	+VN	TA	+Random	+FN	+VN
Spanish	mBERT-zs	37.2	37.2	37.0	36.6	38.0	38.0	38.6	36.5
	ES-BERT	77.7	77.1	77.6	77.4	70.0	70.0	70.7	70.6
	ES-mBERT	73.5	73.6	74.4	74.1	65.3	65.4	65.8	66.2
Chinese	mBERT-zs	49.9	49.9	50.5	47.9	49.2	49.5	50.1	48.2
	ZH-BERT	82.0	81.6	81.8	81.8	76.2	76.3	75.9	76.9
	ZH-mBERT	80.2	80.1	79.9	80.0	71.8	71.8	72.1	71.9

Table 4: Results on Spanish and Chinese TempEval test sets for full fine-tuning (FFT) and the task adapter (TA) set-up, for zero-shot (zs) transfer with mBERT and monolingual target language evaluation with language-specific BERT (ES-BERT / ZH-BERT) or mBERT (ES-mBERT / ZH-mBERT), with FN/VN adapters trained on vtrans-translated verb pairs (see §2.4).

F_{1}

scores are averaged over 10 runs, with statistically significant (paired t-test;

p<0.05

) improvements over both baselines marked in bold.

			FFT	+Random	+FN	+VN	TA	+Random	+FN	+VN
Arabic	mBERT-zs	T-ident	15.8	13.5	17.2	16.3	29.4	30.3	32.9	32.4
		T-class	14.2	12.2	16.1	15.6	25.6	26.3	27.8	28.4
		ARG-ident	x1.2	x0.6	x2.1	x2.7	x2.0	x3.3	x3.3	x3.6
		ARG-class	x0.9	x0.4	x1.5	x1.9	x1.2	x1.6	x1.6	x1.3
	AR-BERT	T-ident	68.8	68.9	70.2	68.6	24.0	21.3	24.6	23.5
		T-class	63.6	62.8	64.4	62.8	22.0	19.5	23.1	22.3
		ARG-ident	31.7	29.3	34.0	33.4	–	–xxx	–	–
		ARG-class	28.4	26.7	30.3	29.7	–	–xxx	–	–
Chinese	mBERT-zs	T-ident	36.9	36.7	42.1	36.8	47.8	49.4	55.0	55.4
		T-class	27.9	25.2	30.9	29.8	38.6	40.1	43.5	44.9
		ARG-ident	x4.3	x3.1	x5.5	x6.1	x5.1	x6.0	x7.6	x8.4
		ARG-class	x3.9	x2.7	x4.9	x5.2	x3.5	x4.7	x5.7	x7.1
	ZH-BERT	T-ident	75.5	74.9	74.5	74.9	69.8	69.3	70.0	70.2
		T-class	67.9	68.2	68.0	68.6	58.4	57.5	59.9	60.0
		ARG-ident	27.3	26.1	29.8	28.8	–	–xxx	–	–
		ARG-class	25.8	25.2	28.2	27.2	–	–xxx	–	–

Table 5: Results on Arabic and Chinese ACE test sets for full fine-tuning (FFT) setup and task adapter (TA) setup, for zero-shot (zs) transfer with mBERT and vtrans transfer approach with language-specific BERT (AR-BERT / ZH-BERT) and FN/VN adapters trained on noisily translated verb pairs (§2.4).

F_{1}

scores averaged over 5 runs; significant improvements (paired t-test;

p<0.05

) over both baselines marked in bold.

4.1 Main Results

English Event Processing. Table 3 shows the performance on English Task 1 (TempEval) and Task 2 (ACE). First, we note that the computationally more efficient setup with a dedicated task adapter (TA) yields higher absolute scores compared to full fine-tuning (FFT) on TempEval. When the underlying BERT is frozen along with the added FN-/VN-Adapter, the TA is enforced to encode additional task-specific knowledge into its parameters, beyond what is provided in the verb adapter, which results in two strongest results overall from the +FN/VN setups. In Task 2, the primacy of TA-based training is overturned in favour of full fine-tuning. Encouragingly, boosts provided by verb adapters are visible regardless of the chosen task fine-tuning regime, that is, regardless of whether the underlying BERT’s parameters remain fixed or not. We notice consistent statistically significant¹¹¹¹11We test significance with the Student’s t-test with a significance value set at $\alpha=0.05$ for sets of model $F_{1}$ scores. improvements in the +VN setup, although the performance of the TA-based setups clearly suffers in argument (arg) tasks due to decreased trainable parameter capacity. Lack of visible improvements from the Random Adapter supports the interpretation that performance gains indeed stem from the added useful ‘non-random’ signal in the verb adapters.

Multilingual Event Processing. Table 4 compares the performance of zero-shot (zs) transfer and monolingual target-language training (via the vtrans approach) on TempEval in Spanish and Chinese. For both we see that the addition of the FN-Adapter in the TA-based setup boosts zero-shot transfer. Benefits of this knowledge injection extend to the full fine-tuning setup in Chinese, achieving the top score overall.

In monolingual evaluation, we observe consistent gains from the added transferred knowledge (i.e., the vtrans approach) in Spanish, while in Chinese performance boosts come from the transferred VerbNet-style class membership information (+VN). These results suggest that even the noisily translated verb pairs carry enough useful signal through to the target language. To tease apart the contribution of the language-specific encoders and transferred verb knowledge to task performance, we carry out an additional monolingual evaluation substituting the monolingual target language BERT with the massively multilingual encoder, trained on (noisy) target language verb signal (ES-mBERT/ZH-mBERT). Notably, although the performance of the massively multilingual model is lower than the language-specific BERTs in absolute terms, the addition of the transferred verb knowledge helps reduce the gap between the two encoders, with tangible gains achieved over the baselines in Spanish (see discussion in §4.2).¹²¹²12Given that analogous patterns were observed in relative scores of mBERT and language-specific BERTs in monolingual evaluation on ACE (Task 2), for brevity we show the vtrans results with mBERT on TempEval only.

In ACE, the top performance scores are achieved in the monolingual full fine-tuning setting; as seen in English, keeping the full capacity of BERT parameters unfrozen noticeably helps performance.¹³¹³13This is especially the case in arg tasks, where the TA-based setup fails to achieve meaningful improvements over zero, even with extended training up to 100 epochs. Due to the computational burden of such long training, the results in this setup are limited to trigger tasks (after 50 epochs). In Arabic, FN knowledge provides performance boosts across the four tasks and with both the zero-shot (zs) and monolingual (vtrans) transfer approaches, whereas the addition of the VN adapter boosts scores in arg tasks. The usefulness of FN knowledge extends to zero-shot transfer in Chinese, and both adapters benefit the arg tasks in the monolingual (vtrans) transfer setup. Notably, in zero-shot transfer, we observe that the highest scores are achieved in the task adapter (TA) fine-tuning, where the inclusion of the knowledge adapters offers additional performance gains. Overall, however, the argument tasks elude the restricted capacity of the TA-based setup, with very low scores across the board.

4.2 Further Discussion

Zero-shot Transfer vs Monolingual Training. The results reveal a considerable gap between the performance of zero-shot transfer and monolingual fine-tuning. The event extraction tasks pose a significant challenge to the zero-shot transfer via mBERT, where downstream event extraction training data is in English; however, mBERT exhibits much more robust performance in the monolingual setup, when presented with training data for event extraction tasks in the target language – here it trails language-specific BERT models by less than 5 points (see Table 4). This is an encouraging result, given that LM-pretrained language-specific Transformers currently exist only for a narrow set of well-resourced languages: for all other languages – should there be language-specific event extraction data – one needs to resort to massively multilingual Transformers. What is more, mBERT’s performance is further improved by the inclusion of transferred verb knowledge (the vtrans approach, see §2.4): in Spanish, where the greater typological vicinity to English (compared to Chinese) renders direct transfer of semantic-syntactic information more viable, the addition of verb adapters (trained on noisy Spanish constraints) yields significant improvements both in the FFT and the TA setup. These results confirm the effectiveness of lexical knowledge transfer (i.e., the vtrans approach) observed in previous work Ponti et al. (2019); Wang et al. (2020b) in the context of semantic specialisation of static word embedding spaces.

Double Task Adapter. The addition of a verb adapter increases the parameter capacity of the underlying pretrained model. To verify whether increasing the number of trainable parameters in TA cancels out the benefits from the frozen verb adapter, we run additional evaluation in the TA-based setup, but with trainable task adapters double the size of the standard TA (2TA). Promisingly, we see in Tables 6 and 7 that the relative performance gains from FN/VN adapters are preserved regardless of the added trainable parameter capacity. As expected, the increased task adapter size helps argument tasks in ACE, where verb adapters produce additional gains. Overall, this suggests that verb adapters indeed encode additional, non-redundant information beyond what is offered by the pretrained model alone, and boost the dedicated task adapter in solving the problem at hand.

		2TA	+FN	+VN
English	EN-BERT	74.5	74.8	74.8
Spanish	mBERT-zs	37.7	38.3	37.1
	ES-BERT	73.1	73.6	73.6
Chinese	mBERT-zs	49.1	50.1	48.8
	ZH-BERT	78.1	78.1	78.6

Table 6: Results on TempEval for the Double Task Adapter-based approaches (2TA). Significant improvements (paired t-test;

p<0.05

) in bold.

			2TA	+FN	+VN
EN	EN-BERT	T-ident	67.5	68.1	68.9
		T-class	61.6	62.6	62.7
		ARG-ident	x6.2	x8.9	x7.1
		ARG-class	x3.9	x6.7	x5.0
AR	mBERT-zs	T-ident	31.2	32.6	31.7
		T-class	26.3	27.1	29.3
		ARG-ident	x5.9	x6.0	x6.9
		ARG-class	x3.9	x4.1	x4.3
	AR-BERT	T-ident	40.6	42.3	43.0
		T-class	36.9	38.1	39.5
		ARG-ident	–	–	–
		ARG-class	–	–	–
ZH	mBERT-zs	T-ident	54.6	56.3	58.1
		T-class	45.6	46.2	46.9
		ARG-ident	x9.2	10.8	11.3
		ARG-class	x8.0	x8.5	x9.9
	ZH-BERT	T-ident	72.3	73.1	72.0
		T-class	59.6	63.0	61.3
		ARG-ident	x2.6	x2.8	x3.3
		ARG-class	x2.3	x2.6	x2.9

Table 7: Results on ACE for the Double Task Adapter-based approaches (2TA). Significant improvements (paired t-test;

p<0.05

) in bold.

	FFT+FN_ES	TA+FN_ES	2TA+FN_ES
ES-BERT	78.0 (+0.4)	70.9 (+0.2)	73.8 (+0.2)

Table 8: Results (

F_{1}

scores) on Spanish TempEval for different configurations of Spanish BERT with added Spanish FN-Adapter (FN_ES), trained on clean Spanish FN constraints. Numbers in brackets indicate relative performance w.r.t. the corresponding setup with FN-Adapter trained on (a larger set of) noisy Spanish constraints obtained through automatic translation of verb pairs from English FN (vtrans approach).

Cleanliness of Verb Knowledge. Gains from verb adapters suggest that there is potential to find supplementary information within structured lexical resources that can support distributional models in tackling tasks where nuanced knowledge of verb behaviour is important. The fact that we obtain best transfer performance through noisy translation of English verb knowledge suggests that these benefits transcend language boundaries.

There are, however, two main limitations to the translation-based (vtrans) approach we used to train our target-language verb adapters (especially in the context of VerbNet constraints): (1) noisy translation based on cross-lingual semantic similarity may already break the VerbNet class membership alignment (i.e., words close in meaning may belong to different VerbNet classes due to differences in syntactic behaviour); and (2) the language-specificity of verb classes due to which they cannot be directly ported to another language without adjustments due to the delicate language-specific interplay of semantic and syntactic information. This is in contrast to the proven cross-lingual portability of synonymy and antonymy relations shown in previous work on semantic specialisation transfer Mrkšić et al. (2017); Ponti et al. (2019), which rely on semantics alone. In case of VerbNet, despite the cross-lingual applicability of a semantically-syntactically defined verb class as a lexical organisational unit, the fine-grained class divisions and exact class membership may be too English-specific to allow direct automatic translation. On the contrary, semantically-driven FrameNet lends itself better to cross-lingual transfer, given that it focuses on function and roles played by event participants, rather than their surface realisations (see §2.1). Indeed, although FN and VN adapters both offer performance gains in our evaluation, the somewhat more consistent improvements from the FN-Adapter may be symptomatic of the resource’s greater cross-lingual portability.

To quickly verify if noisy translation and direct transfer from English curb the usefulness of injected verb knowledge, we additionally evaluate the injection of clean verb knowledge obtained from a small lexical resource available in one of the target languages – Spanish FrameNet Subirats and Sato (2004). Using the procedure described in §2.2, we derive 2,886 positive verb pairs from Spanish FN and train a Spanish FN-Adapter (on top of the Spanish BERT) with this (much smaller but clean) set of Spanish FN constraints. The results in Table 8 show that, despite having 12 times fewer positive examples for training the verb adapter compared to the translation-based approach, the ‘native’ Spanish verb adapter outperforms its vtrans-based counterpart (Table 4), compensating the limited coverage with gold standard accuracy. Nonetheless, the challenge for using native resources in other languages lies in their very limited availability and expensive, time-consuming manual construction process. Our results reaffirm the usefulness of language-specific expert-curated resources and their ability to enrich state-of-the-art NLP models. This, in turn, suggests that work on optimising resource creation methodologies merits future research efforts on a par with modeling work.

5 Related Work

5.1 Event Extraction

The cost and complexity of event annotation requires robust transfer solutions capable of making fine-grained predictions in the face of data scarcity. Traditional event extraction methods relied on hand-crafted, language-specific features Ahn (2006); Gupta and Ji (2009); Llorens et al. (2010); Hong et al. (2011); Li et al. (2013); Glavaš and Šnajder (2015) (e.g., POS tags, entity knowledge, morphological and syntactic information), which limited their generalisation ability and effectively prevented language transfer.

More recent approaches commonly resorted to word embedding input and neural text encoders such as recurrent nets Nguyen et al. (2016); Duan et al. (2017); Sha et al. (2018) and convolutional nets Chen et al. (2015); Nguyen and Grishman (2015), as well as graph neural networks Nguyen and Grishman (2018); Yan et al. (2019) and adversarial networks Hong et al. (2018); Zhang and Ji (2018). Like in most other NLP tasks, most recent empirical advancements in event trigger and argument extraction tasks have been achieved through fine-tuning of LM-pretrained Transformer networks Yang et al. (2019a); Wang et al. (2019); M’hamdi et al. (2019); Wadden et al. (2019); Liu et al. (2020).

Limited training data nonetheless remains an obstacle, especially when facing previously unseen event types. The alleviation of such data scarcity issues has been attempted through data augmentation methods – automatic data annotation Chen et al. (2017); Zheng (2018); Araki and Mitamura (2018) and bootstrapping for training data generation Ferguson et al. (2018); Wang et al. (2019). The recent release of the large English event detection dataset MAVEN Wang et al. (2020c), with annotations of event triggers only, partially remedies for training data scarcity. MAVEN also demonstrates that even the state-of-the-art Transformer-based models fail to yield satisfying event detection performance in the general domain. The fact that it is unlikely to expect datasets of similar size for other event extraction tasks (e.g., event argument extraction) and especially for other languages only emphasises the need for external event-related knowledge and transfer learning approaches, such as the ones introduced in this work.

Beyond event trigger (and argument)-oriented frameworks such as ACE and its light-weight variant ERE Aguilar et al. (2014); Song et al. (2015), several other event-focused datasets exist which frame the problem either as a slot-filling task Grishman and Sundheim (1996) or an open-domain problem consisting in extracting unconstrained event types and schemata from text Allan (2002); Minard et al. (2016); Araki and Mitamura (2018); Liu et al. (2019a). Small domain-specific datasets have also been constructed for event detection in biomedicine Kim et al. (2008); Thompson et al. (2009); Buyko et al. (2010); Nédellec et al. (2013), as well as literary texts Sims et al. (2019) and Twitter Ritter et al. (2012); Guo et al. (2013).

5.2 Semantic Specialisation

Representation spaces induced through self-supervised objectives from large corpora, be it the word embedding spaces Mikolov et al. (2013); Bojanowski et al. (2017) or those spanned by LM-pretrained Transformers Devlin et al. (2019); Liu et al. (2019b), encode only distributional knowledge, i.e., knowledge obtainable from large corpora. A large body of work focused on semantic specialisation (i.e., refinement) of such distributional spaces by means of injecting lexico-semantic knowledge from external resources such as WordNet Fellbaum (1998), BabelNet Navigli and Ponzetto (2010) or ConceptNet Liu and Singh (2004) expressed in the form of lexical constraints (Faruqui et al., 2015; Mrkšić et al., 2017; Glavaš and Vulić, 2018c; Kamath et al., 2019; Lauscher et al., 2020b, inter alia).

Joint specialisation models (Yu and Dredze, 2014; Nguyen et al., 2017; Lauscher et al., 2020b; Levine et al., 2020, inter alia) train the representation space from scratch on the large corpus, but augment the self-supervised training objective with an additional objective based on external lexical constraints. Lauscher et al. (2020b) add to the Masked LM (MLM) and next sentence prediction (NSP) pretraining objectives of BERT Devlin et al. (2019) an objective that predicts pairs of synonyms and first-order hyponymy-hypernymy pairs, aiming to improve word-level semantic similarity in BERT’s representation space. In a similar vein, Levine et al. (2020) add the objective that predicts WordNet supersenses. While joint specialisation models allow the external knowledge to shape the representation space from the very beginning of the distributional training, this also means that any change in lexical constraints implies a new, computationally expensive pretraining from scratch.

Retrofitting and post-specialisation methods (Faruqui et al., 2015; Mrkšić et al., 2017; Vulić et al., 2018; Ponti et al., 2018; Glavaš and Vulić, 2019; Lauscher et al., 2020a; Wang et al., 2020a, inter alia), in contrast, start from a pretrained representation space (word embedding space or a pretrained encoder) and fine-tune it using external lexico-semantic knowledge. Wang et al. (2020a) fine-tune the pre-trained RoBERTa Liu et al. (2019b) with lexical constraints obtained automatically via dependency parsing, whereas Lauscher et al. (2020a) use lexical constraints derived from ConceptNet to inject knowledge into BERT: both adopt adapter-based fine-tuning, storing the external knowledge in a separate set of parameters. In our work, we adopt a similar adapter-based specialisation approach. However, focusing on event-oriented downstream tasks, our lexical constraints reflect verb class memberships and originate from VerbNet and FrameNet.

6 Conclusion

We have investigated the potential of leveraging knowledge about semantic-syntactic behaviour of verbs to improve the capacity of large pretrained models to reason about events in diverse languages. We have proposed an auxiliary pretraining task to inject information about verb class membership and semantic frame-evoking properties into the parameters of dedicated adapter modules, which can be readily employed in other tasks where verb reasoning abilities are key. We demonstrated that state-of-the-art Transformer-based models still benefit from the gold standard linguistic knowledge stored in lexical resources, even those with limited coverage. Crucially, we showed that the benefits of the information available in resource-rich languages can be extended to other, resource-leaner languages through translation-based transfer of verb class/frame membership information.

In future work, we will incorporate our verb knowledge modules into alternative, more sophisticated approaches to cross-lingual transfer to explore the potential for further improvements in low-resource scenarios. Further, we will extend our approach to specialised domains where small-scale but high-quality lexica are available, to support distributional models in dealing with domain-sensitive verb-oriented problems.

Acknowledgments

This work is supported by the ERC Consolidator Grant LEXICAL: Lexical Acquisition Across Languages (no 648909) awarded to Anna Korhonen. The work of Goran Glavaš is supported by the Baden-Württemberg Stiftung (Eliteprogramm, AGREE grant).

References

Aguilar et al. (2014) Jacqueline Aguilar, Charley Beller, Paul McNamee, Benjamin Van Durme, Stephanie Strassel, Zhiyi Song, and Joe Ellis. 2014. A comparison of the events and relations across ACE, ERE, TAC-KBP, and FrameNet annotation standards. In Proceedings of the Second Workshop on EVENTS: Definition, Detection, Coreference, and Representation, pages 45–53, Baltimore, Maryland, USA. Association for Computational Linguistics.
Ahn (2006) David Ahn. 2006. The stages of event extraction. In Proceedings of the Workshop on Annotating and Reasoning about Time and Events, pages 1–8.
Allan (2002) James Allan. 2002. Topic detection and tracking: event-based information organization, volume 12. Springer Science & Business Media, New York.
Araki and Mitamura (2018) Jun Araki and Teruko Mitamura. 2018. Open-domain event detection using distant supervision. In Proceedings of the 27th International Conference on Computational Linguistics, pages 878–891, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
Baker et al. (1998) Collin F. Baker, Charles J. Fillmore, and John B. Lowe. 1998. The Berkeley FrameNet project. In Proceedings of COLING, pages 86–90, Montreal, Quebec, Canada.
Bender and Koller (2020) Emily M. Bender and Alexander Koller. 2020. Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5185–5198.
Bick (2011) Eckhard Bick. 2011. A FrameNet for Danish. In Proceedings of the 18th Nordic Conference of Computational Linguistics, pages 34–41.
Bojanowski et al. (2017) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146.
Buyko et al. (2010) Ekaterina Buyko, Elena Beisswanger, and Udo Hahn. 2010. The GeneReg Corpus for gene expression regulation events – An overview of the corpus and its in-domain and out-of-domain interoperability. In Proceedings of LREC, pages 2662–2666.
Carlson et al. (2002) Lynn Carlson, Daniel Marcu, and Mary Ellen Okurowski. 2002. RST Discourse Treebank LDC2002T07. Technical report, Philadelphia: Linguistic Data Consortium. Web Download.
Chen et al. (2017) Yubo Chen, Shulin Liu, Xiang Zhang, Kang Liu, and Jun Zhao. 2017. Automatically labeled data generation for large scale event extraction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 409–419.
Chen et al. (2015) Yubo Chen, Liheng Xu, Kang Liu, Daojian Zeng, and Jun Zhao. 2015. Event extraction via dynamic multi-pooling convolutional neural networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 167–176, Beijing, China. Association for Computational Linguistics.
Conneau et al. (2020) Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
Derczynski (2017) Leon R.A. Derczynski. 2017. Automatically ordering events and times in text. Springer, Berlin.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT 2019, pages 4171–4186.
Doddington et al. (2004) George R. Doddington, Alexis Mitchell, Mark A. Przybocki, Lance A. Ramshaw, Stephanie M. Strassel, and Ralph M. Weischedel. 2004. The automatic content extraction (ACE) program-tasks, data, and evaluation. In Proceedings of LREC, pages 837–840.
Duan et al. (2017) Shaoyang Duan, Ruifang He, and Wenli Zhao. 2017. Exploiting document level information to improve event detection via recurrent neural networks. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 352–361, Taipei, Taiwan. Asian Federation of Natural Language Processing.
Faruqui et al. (2015) Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith. 2015. Retrofitting word vectors to semantic lexicons. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1606–1615, Denver, Colorado. Association for Computational Linguistics.
Fellbaum (1998) Christiane Fellbaum, editor. 1998. WordNet: An Electronic Lexical Database. MIT Press.
Ferguson et al. (2018) James Ferguson, Colin Lockard, Daniel Weld, and Hannaneh Hajishirzi. 2018. Semi-supervised event extraction with paraphrase clusters. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 359–364, New Orleans, Louisiana. Association for Computational Linguistics.
Fillmore (1976) Charles J. Fillmore. 1976. Frame semantics and the nature of language. In Annals of the New York Academy of Sciences: Conference on the Origin and Development of Language and Speech, volume 280, pages 20–32, New York, New York.
Fillmore (1977) Charles J. Fillmore. 1977. The need for a frame semantics in linguistics. In Statistical Methods in Linguistics, pages 5–29. Ed. Hans Karlgren. Scriptor.
Fillmore (1982) Charles J. Fillmore. 1982. Frame semantics. In Linguistics in the Morning Calm, pages 111–137. Ed. The Linguistic Society of Korea. Hanshin Publishing Co.
Glavaš et al. (2019) Goran Glavaš, Edoardo Maria Ponti, and Ivan Vulić. 2019. Semantic specialization of distributional word vectors. In Proceedings of EMNLP: Tutorial Abstracts.
Glavaš and Šnajder (2015) Goran Glavaš and Jan Šnajder. 2015. Construction and evaluation of event graphs. Natural Language Engineering, 21(4):607–652.
Glavaš and Vulić (2018a) Goran Glavaš and Ivan Vulić. 2018a. Discriminating between lexico-semantic relations with the specialization tensor model. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 181–187, New Orleans, Louisiana. Association for Computational Linguistics.
Glavaš and Vulić (2018b) Goran Glavaš and Ivan Vulić. 2018b. Explicit retrofitting of distributional word vectors. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 34–45, Melbourne, Australia. Association for Computational Linguistics.
Glavaš and Vulić (2018c) Goran Glavaš and Ivan Vulić. 2018c. Explicit retrofitting of distributional word vectors. In Proceedings of ACL, pages 34–45.
Glavaš and Vulić (2019) Goran Glavaš and Ivan Vulić. 2019. Generalized tuning of distributional word vectors for monolingual and cross-lingual lexical entailment. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4824–4830.
Grishman and Sundheim (1996) Ralph Grishman and Beth Sundheim. 1996. Message Understanding Conference- 6: A brief history. In COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics.
Guo et al. (2013) Weiwei Guo, Hao Li, Heng Ji, and Mona Diab. 2013. Linking tweets to news: A framework to enrich short text data in social media. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 239–249, Sofia, Bulgaria. Association for Computational Linguistics.
Gupta and Ji (2009) Prashant Gupta and Heng Ji. 2009. Predicting unknown time arguments based on cross-event propagation. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages 369–372, Suntec, Singapore. Association for Computational Linguistics.
Hartmann et al. (2013) Iren Hartmann, Martin Haspelmath, and Bradley Taylor, editors. 2013. Valency Patterns Leipzig. Max Planck Institute for Evolutionary Anthropology, Leipzig.
Hashimoto et al. (2017) Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsuruoka, and Richard Socher. 2017. A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks. In Proceedings of EMNLP 2017.
Hendrycks and Gimpel (2016) Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (GELUs). CoRR, abs/1606.08415.
Heppin and Gronostaj (2012) Karin Friberg Heppin and Maria Toporowska Gronostaj. 2012. The rocky road towards a Swedish FrameNet - creating SweFN. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 256–261, Istanbul, Turkey. European Language Resources Association (ELRA).
Hewitt and Manning (2019) John Hewitt and Christopher D. Manning. 2019. A structural probe for finding syntax in word representations. In Proceedings of NAACL-HLT 2019, pages 4129–4138.
Hong et al. (2011) Yu Hong, Jianfeng Zhang, Bin Ma, Jianmin Yao, Guodong Zhou, and Qiaoming Zhu. 2011. Using cross-entity inference to improve event extraction. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 1127–1136, Portland, Oregon, USA. Association for Computational Linguistics.
Hong et al. (2018) Yu Hong, Wenxuan Zhou, Jingli Zhang, Guodong Zhou, and Qiaoming Zhu. 2018. Self-regulation: Employing a generative adversarial network to improve event detection. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 515–526, Melbourne, Australia. Association for Computational Linguistics.
Houlsby et al. (2019) Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for NLP. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2790–2799, Long Beach, California, USA. PMLR.
Hsi et al. (2016) Andrew Hsi, Yiming Yang, Jaime Carbonell, and Ruochen Xu. 2016. Leveraging multilingual training for limited resource event extraction. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 1201–1210, Osaka, Japan. The COLING 2016 Organizing Committee.
Jackendoff (1992) Ray Jackendoff. 1992. Semantic structures, volume 18. MIT press.
Joulin et al. (2018) Armand Joulin, Piotr Bojanowski, Tomas Mikolov, Hervé Jégou, and Edouard Grave. 2018. Loss in translation: Learning bilingual word mapping with a retrieval criterion. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2979–2984, Brussels, Belgium. Association for Computational Linguistics.
Kamath et al. (2019) Aishwarya Kamath, Jonas Pfeiffer, Edoardo Maria Ponti, Goran Glavaš, and Ivan Vulić. 2019. Specializing distributional vectors of all words for lexical entailment. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pages 72–83, Florence, Italy. Association for Computational Linguistics.
Kim et al. (2008) Jin-Dong Kim, Tomoko Ohta, Kanae Oda, and Jun’ichi Tsujii. 2008. From text to pathway: Corpus annotation for knowledge acquisition from biomedical literature. In Proceedings of The 6th Asia-Pacific Bioinformatics Conference, pages 165–175. World Scientific.
Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR.
Kipper et al. (2006) Karin Kipper, Anna Korhonen, Neville Ryant, and Martha Palmer. 2006. Extending VerbNet with novel verb classes. In Proceedings of LREC, pages 1027–1032, Genoa, Italy.
Lafferty et al. (2001) John Lafferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of International Conference on Machine Learning (ICML-2001).
Lauscher et al. (2020a) Anne Lauscher, Olga Majewska, Leonardo F. R. Ribeiro, Iryna Gurevych, Nikolai Rozanov, and Goran Glavaš. 2020a. Common sense or world knowledge? Investigating adapter-based knowledge injection into pretrained transformers. In Proceedings of Deep Learning Inside Out (DeeLIO): The First Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 43–49, Online. Association for Computational Linguistics.
Lauscher et al. (2020b) Anne Lauscher, Ivan Vulić, Edoardo Maria Ponti, Anna Korhonen, and Goran Glavaš. 2020b. Specializing unsupervised pretraining models for word-level semantic similarity. In Proceedings of the 28th International Conference on Computational Linguistics, pages 1371–1383, Barcelona, Spain (Online). International Committee on Computational Linguistics.
Levin (1993) Beth Levin. 1993. English verb classes and alternations: A preliminary investigation. University of Chicago press.
Levine et al. (2020) Yoav Levine, Barak Lenz, Or Dagan, Ori Ram, Dan Padnos, Or Sharir, Shai Shalev-Shwartz, Amnon Shashua, and Yoav Shoham. 2020. SenseBERT: Driving some sense into BERT. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4656–4667, Online. Association for Computational Linguistics.
Li et al. (2013) Qi Li, Heng Ji, and Liang Huang. 2013. Joint event extraction via structured prediction with global features. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 73–82, Sofia, Bulgaria. Association for Computational Linguistics.
Liu and Singh (2004) Hugo Liu and Push Singh. 2004. ConceptNet – A practical commonsense reasoning tool-kit. BT technology journal, 22(4):211–226.
Liu et al. (2020) Jian Liu, Yubo Chen, Kang Liu, Wei Bi, and Xiaojiang Liu. 2020. Event extraction as machine reading comprehension. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1641–1651, Online. Association for Computational Linguistics.
Liu et al. (2018) Jian Liu, Yubo Chen, Kang Liu, Jun Zhao, et al. 2018. Event detection via gated multilingual attention mechanism. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, pages 4865–4872.
Liu et al. (2019a) Xiao Liu, Heyan Huang, and Yue Zhang. 2019a. Open domain event extraction using neural latent variable models. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2860–2871, Florence, Italy. Association for Computational Linguistics.
Liu et al. (2019b) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019b. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
Llorens et al. (2010) Hector Llorens, Estela Saquete, and Borja Navarro. 2010. TIPSem (English and Spanish): Evaluating CRFs and semantic roles in TempEval-2. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 284–291, Uppsala, Sweden. Association for Computational Linguistics.
Majewska et al. (2018) Olga Majewska, Ivan Vulić, Diana McCarthy, Yan Huang, Akira Murakami, Veronika Laippala, and Anna Korhonen. 2018. Investigating the cross-lingual translatability of VerbNet-style classification. Language resources and evaluation, 52(3):771–799.
de Masson d’Autume et al. (2019) Cyprien de Masson d’Autume, Sebastian Ruder, Lingpeng Kong, and Dani Yogatama. 2019. Episodic memory in lifelong language learning. In Proceedings of NeurIPS 2019, pages 13122–13131.
McCoy et al. (2019) Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, Florence, Italy. Association for Computational Linguistics.
M’hamdi et al. (2019) Meryem M’hamdi, Marjorie Freedman, and Jonathan May. 2019. Contextualized cross-lingual event trigger extraction with minimal resources. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 656–665, Hong Kong, China. Association for Computational Linguistics.
Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of NeurIPS, pages 3111–3119.
Miltsakaki et al. (2004) Eleni Miltsakaki, Rashmi Prasad, Aravind Joshi, and Bonnie Webber. 2004. The Penn discourse treebank. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Lisbon, Portugal. European Language Resources Association (ELRA).
Minard et al. (2016) Anne-Lyse Minard, Manuela Speranza, Ruben Urizar, Begona Altuna, Marieke Van Erp, Anneleen Schoen, and Chantal Van Son. 2016. MEANTIME, the NewsReader multilingual event and time corpus. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 4417–4422.
Mrkšić et al. (2017) Nikola Mrkšić, Ivan Vulić, Diarmuid Ó Séaghdha, Ira Leviant, Roi Reichart, Milica Gašić, Anna Korhonen, and Steve Young. 2017. Semantic specialization of distributional word vector spaces using monolingual and cross-lingual constraints. Transactions of the Association for Computational Linguistics, 5:309–324.
Navigli and Ponzetto (2010) Roberto Navigli and Simone Paolo Ponzetto. 2010. BabelNet: Building a very large multilingual semantic network. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 216–225, Uppsala, Sweden. Association for Computational Linguistics.
Nédellec et al. (2013) Claire Nédellec, Robert Bossy, Jin-Dong Kim, Jung-jae Kim, Tomoko Ohta, Sampo Pyysalo, and Pierre Zweigenbaum. 2013. Overview of BioNLP shared task 2013. In Proceedings of the BioNLP Shared Task 2013 Workshop, pages 1–7, Sofia, Bulgaria. Association for Computational Linguistics.
Nguyen et al. (2017) Kim Anh Nguyen, Maximilian Köper, Sabine Schulte im Walde, and Ngoc Thang Vu. 2017. Hierarchical embeddings for hypernymy detection and directionality. In Proceedings of EMNLP, pages 233–243.
Nguyen et al. (2016) Thien Huu Nguyen, Kyunghyun Cho, and Ralph Grishman. 2016. Joint event extraction via recurrent neural networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 300–309, San Diego, California. Association for Computational Linguistics.
Nguyen and Grishman (2015) Thien Huu Nguyen and Ralph Grishman. 2015. Event detection and domain adaptation with convolutional neural networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 365–371, Beijing, China. Association for Computational Linguistics.
Nguyen and Grishman (2018) Thien Huu Nguyen and Ralph Grishman. 2018. Graph convolutional networks with argument-aware pooling for event detection. In AAAI, volume 18, pages 5900–5907.
Niven and Kao (2019) Timothy Niven and Hung-Yu Kao. 2019. Probing neural network comprehension of natural language arguments. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4658–4664, Florence, Italy. Association for Computational Linguistics.
Ohara (2012) Kyoko Ohara. 2012. Semantic annotations in Japanese FrameNet: Comparing frames in Japanese and English. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 1559–1562, Istanbul, Turkey. European Language Resources Association (ELRA).
Peters et al. (2019) Matthew E. Peters, Mark Neumann, Robert L. Logan IV, Roy Schwartz, Vidur Joshi, Sameer Singh, and Noah A. Smith. 2019. Knowledge enhanced contextual word representations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 43–54.
Pfeiffer et al. (2020a) Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, and Iryna Gurevych. 2020a. AdapterFusion: Non-destructive task composition for transfer learning. arXiv preprint arXiv:2005.00247.
Pfeiffer et al. (2020b) Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aishwarya Kamath, Ivan Vulić, Sebastian Ruder, Kyunghyun Cho, and Iryna Gurevych. 2020b. AdapterHub: A framework for adapting transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 46–54, Online. Association for Computational Linguistics.
Pfeiffer et al. (2020c) Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Sebastian Ruder. 2020c. MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7654–7673, Online. Association for Computational Linguistics.
Ponti and Korhonen (2017) Edoardo Maria Ponti and Anna Korhonen. 2017. Event-related features in feedforward neural networks contribute to identifying causal relations in discourse. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics, pages 25–30, Valencia, Spain. Association for Computational Linguistics.
Ponti et al. (2018) Edoardo Maria Ponti, Ivan Vulić, Goran Glavaš, Nikola Mrkšić, and Anna Korhonen. 2018. Adversarial propagation and zero-shot cross-lingual transfer of word vector specialization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 282–293.
Ponti et al. (2019) Edoardo Maria Ponti, Ivan Vulić, Goran Glavaš, Roi Reichart, and Anna Korhonen. 2019. Cross-lingual semantic specialization via lexical relation induction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2206–2217, Hong Kong, China. Association for Computational Linguistics.
Pustejovsky et al. (2003) James Pustejovsky, José M. Castano, Robert Ingria, Roser Sauri, Robert J. Gaizauskas, Andrea Setzer, Graham Katz, and Dragomir R. Radev. 2003. TimeML: Robust specification of event and temporal expressions in text. New directions in question answering, 3:28–34.
Pustejovsky et al. (2005) James Pustejovsky, Robert Ingria, Roser Sauri, José M. Castaño, Jessica Littman, Robert J. Gaizauskas, Andrea Setzer, Graham Katz, and Inderjeet Mani. 2005. The specification language TimeML. In The Language of Time: A reader, pages 545–557. Citeseer.
Ritter et al. (2012) Alan Ritter, Oren Etzioni, and Sam Clark. 2012. Open domain event extraction from Twitter. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1104–1112.
Rogers et al. (2020) Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2020. A primer in BERTology: what we know about how BERT works. Transactions of the ACL.
Schuler (2005) Karin Kipper Schuler. 2005. VerbNet: A broad-coverage, comprehensive verb lexicon. Ph.D. thesis, University of Pennsylvania.
Sha et al. (2018) Lei Sha, Feng Qian, Baobao Chang, and Zhifang Sui. 2018. Jointly extracting event triggers and arguments by dependency-bridge RNN and tensor-based argument interaction. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, pages 5916––5923.
Sims et al. (2019) Matthew Sims, Jong Ho Park, and David Bamman. 2019. Literary event detection. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3623–3634, Florence, Italy. Association for Computational Linguistics.
Song et al. (2015) Zhiyi Song, Ann Bies, Stephanie Strassel, Tom Riese, Justin Mott, Joe Ellis, Jonathan Wright, Seth Kulick, Neville Ryant, and Xiaoyi Ma. 2015. From light to rich ERE: Annotation of entities, relations, and events. In Proceedings of the The 3rd Workshop on EVENTS: Definition, Detection, Coreference, and Representation, pages 89–98, Denver, Colorado. Association for Computational Linguistics.
Subirats and Sato (2004) Carlos Subirats and Hiroaki Sato. 2004. Spanish FrameNet and FrameSQL. In 4th International Conference on Language Resources and Evaluation. Workshop on Building Lexical Resources from Semantically Annotated Corpora. Lisbon (Portugal). Citeseer.
Thompson et al. (2009) Paul Thompson, Syed A. Iqbal, John McNaught, and Sophia Ananiadou. 2009. Construction of an annotated corpus to support biomedical information extraction. BMC bioinformatics, 10(1):349.
UzZaman et al. (2013) Naushad UzZaman, Hector Llorens, Leon Derczynski, James Allen, Marc Verhagen, and James Pustejovsky. 2013. SemEval-2013 task 1: TempEval-3: Evaluating time expressions, events, and temporal relations. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 1–9, Atlanta, Georgia, USA. Association for Computational Linguistics.
Verhagen and Pustejovsky (2008) Marc Verhagen and James Pustejovsky. 2008. Temporal processing with the TARSQI toolkit. In Coling 2008: Companion volume: Demonstrations, pages 189–192, Manchester, UK. Coling 2008 Organizing Committee.
Verhagen et al. (2010) Marc Verhagen, Roser Saurí, Tommaso Caselli, and James Pustejovsky. 2010. SemEval-2010 task 13: TempEval-2. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 57–62, Uppsala, Sweden. Association for Computational Linguistics.
Vulić et al. (2018) Ivan Vulić, Goran Glavaš, Nikola Mrkšić, and Anna Korhonen. 2018. Post-specialisation: Retrofitting vectors of words unseen in lexical resources. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 516–527, New Orleans, Louisiana. Association for Computational Linguistics.
Vulić et al. (2017) Ivan Vulić, Nikola Mrkšić, and Anna Korhonen. 2017. Cross-lingual induction and transfer of verb classes based on word vector space specialisation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2546–2558, Copenhagen, Denmark. Association for Computational Linguistics.
Vulić et al. (2020) Ivan Vulić, Edoardo Maria Ponti, Robert Litschko, Goran Glavaš, and Anna Korhonen. 2020. Probing pretrained language models for lexical semantics. In Proceedings of EMNLP 2020, pages 7222–7240.
Wadden et al. (2019) David Wadden, Ulme Wennberg, Yi Luan, and Hannaneh Hajishirzi. 2019. Entity, relation, and event extraction with contextualized span representations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China. Association for Computational Linguistics.
Wang et al. (2020a) Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei, Xuanjing Huang, Cuihong Cao, Daxin Jiang, Ming Zhou, et al. 2020a. K-Adapter: Infusing knowledge into pre-trained models with adapters. arXiv preprint arXiv:2002.01808.
Wang et al. (2020b) Shike Wang, Yuchen Fan, Xiangying Luo, and Dong Yu. 2020b. SHIKEBLCU at SemEval-2020 task 2: An external knowledge-enhanced matrix for multilingual and cross-lingual lexical entailment. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pages 255–262, Barcelona (online). International Committee for Computational Linguistics.
Wang et al. (2019) Xiaozhi Wang, Xu Han, Zhiyuan Liu, Maosong Sun, and Peng Li. 2019. Adversarial training for weakly supervised event detection. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 998–1008, Minneapolis, Minnesota. Association for Computational Linguistics.
Wang et al. (2020c) Xiaozhi Wang, Ziqi Wang, Xu Han, Wangyi Jiang, Rong Han, Zhiyuan Liu, Juanzi Li, Peng Li, Yankai Lin, and Jie Zhou. 2020c. MAVEN: A Massive General Domain Event Detection Dataset. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1652–1671, Online. Association for Computational Linguistics.
Wieting et al. (2015) John Wieting, Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2015. From paraphrase database to compositional paraphrase model and back. Transactions of the Association for Computational Linguistics, 3:345–358.
Yan et al. (2019) Haoran Yan, Xiaolong Jin, Xiangbin Meng, Jiafeng Guo, and Xueqi Cheng. 2019. Event detection with multi-order graph convolution and aggregated attention. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5766–5770, Hong Kong, China. Association for Computational Linguistics.
Yang et al. (2019a) Sen Yang, Dawei Feng, Linbo Qiao, Zhigang Kan, and Dongsheng Li. 2019a. Exploring pre-trained language models for event extraction and generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5284–5294, Florence, Italy. Association for Computational Linguistics.
Yang et al. (2019b) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019b. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pages 5753–5763.
Yu and Dredze (2014) Mo Yu and Mark Dredze. 2014. Improving lexical embeddings with semantic knowledge. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 545–550, Baltimore, Maryland. Association for Computational Linguistics.
Zhang and Ji (2018) Tongtao Zhang and Heng Ji. 2018. Event extraction with generative adversarial imitation learning. arXiv preprint arXiv:1804.07881.
Zheng (2018) Fanghua Zheng. 2018. A corpus-based multidimensional analysis of linguistic features of truth and deception. In Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation, Hong Kong. Association for Computational Linguistics.

Appendix A Frameworks for Annotating Event Expressions

Two prominent frameworks for annotating event expressions are TimeML Pustejovsky et al. (2003, 2005) and the Automatic Content Extraction (ACE) Doddington et al. (2004). TimeML was developed as a rich markup language for annotating event and temporal expressions, addressing the problems of identifying event predicates and anchoring them in time, determining their relative ordering and temporal persistence (i.e., how long the consequences of an event last), as well as tackling contextually underspecified temporal expressions (e.g., last month, two days ago). Currently available English corpora annotated based on the TimeML scheme include the TimeBank corpus Pustejovsky et al. (2003), a human annotated collection of 183 newswire texts (including 7,935 annotated events, comprising both punctual occurrences and states which extend over time) and the AQUAINT corpus, with 80 newswire documents grouped by their covered stories, which allows tracing progress of events through time Derczynski (2017). Both corpora, supplemented with a large, automatically TimeML-annotated training corpus are used in the TempEval-3 task Verhagen and Pustejovsky (2008); UzZaman et al. (2013), which targets automatic identification of temporal expressions, events, and temporal relations.

The ACE dataset provides annotations for entities, the relations between them, and for events in which they participate in newspaper and newswire text. For each event, it identifies its lexical instantiation, i.e., the trigger, and its participants, i.e., the arguments, and the roles they play in the event. For example, an event type “Conflict:Attack” (“It could swell to as much as $500 billion if we go to war in Iraq.”), triggered by the noun ‘war’, involves two arguments, the “Attacker” (“we”) and the “Place" (“Iraq”), each of which is annotated with an entity label (“GPE:Nation”).

Appendix B Adapter Training: Hyperparameter Search

We experimented with $n\in\{10,15,20,30\}$ training epochs, as well as an early stopping approach using validation loss on a small held-out validation set as the stopping criterion, with a patience argument $p\in\{2,5\}$ ; we found the adapters trained for the full 30 epochs to perform most consistently across tasks.

The size of the training batch varies based on the value of $k$ negative examples generated from the starting batch $B$ of positive pairs: e.g., by generating $k=3$ negative examples for each of $8$ positive examples in the starting batch we end up with a training batch of total size $8+3*8=32$ . We experimented with starting batches of size $B\in\{8,16\}$ and found the configuration $k=3$ , $B=16$ to yield the strongest results (reported in this paper).

Appendix C STM Training Details

We train the STM using the sets of English positive examples from each lexical resource (Table 1). Negative examples are generated using controlled sampling (see §2.2), using a $k=2$ $[cc]$ configuration, ensuring that generated negatives do not constitute positive constraints in the global set. We use the pre-trained 300-dimensional static distributional word vectors computed on Wikipedia data using the fastText model Bojanowski et al. (2017), cross-lingually aligned using the RCSLS model of Joulin et al. (2018), to induce the shared cross-lingual embedding space for each source-target language pair. The STM is trained using the Adam optimizer Kingma and Ba (2015), a learning rate $l=1e-4$ , a batch size of 32 (positive and negative) training examples, for a maximum of 10 iterations. We set the values of other training hyperparameters as in Ponti et al. (2019), i.e., the number of specialisation tensor slices $K=5$ and the size of the specialised vectors $h=300$ .