This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Verb Knowledge Injection for Multilingual Event Processing

Olga Majewska1, Ivan Vulić1,  Goran Glavaš2,  Edoardo M. Ponti1,3,  Anna Korhonen1
1Language Technology Lab, TAL, University of Cambridge, UK
2 Data and Web Science Group, University of Mannheim, Germany
3 Mila – Quebec AI Institute, Montreal, Canada
1{om304,iv250,ep490,alk23}@cam.ac.uk
2[email protected]
Abstract

In parallel to their overwhelming success across NLP tasks, language ability of deep Transformer networks, pretrained via language modeling (LM) objectives has undergone extensive scrutiny. While probing revealed that these models encode a range of syntactic and semantic properties of a language, they are still prone to fall back on superficial cues and simple heuristics to solve downstream tasks, rather than leverage deeper linguistic knowledge. In this paper, we target one such area of their deficiency, verbal reasoning. We investigate whether injecting explicit information on verbs’ semantic-syntactic behaviour improves the performance of LM-pretrained Transformers in event extraction tasks – downstream tasks for which accurate verb processing is paramount. Concretely, we impart the verb knowledge from curated lexical resources into dedicated adapter modules (dubbed verb adapters), allowing it to complement, in downstream tasks, the language knowledge obtained during LM-pretraining. We first demonstrate that injecting verb knowledge leads to performance gains in English event extraction. We then explore the utility of verb adapters for event extraction in other languages: we investigate (1) zero-shot language transfer with multilingual Transformers as well as (2) transfer via (noisy automatic) translation of English verb-based lexical constraints. Our results show that the benefits of verb knowledge injection indeed extend to other languages, even when verb adapters are trained on noisily translated constraints.

1 Introduction

Large Transformer-based encoders, pretrained with self-supervised language modeling (LM) objectives, form the backbone of state-of-the-art models for most Natural Language Processing (NLP) tasks Devlin et al. (2019); Yang et al. (2019b); Liu et al. (2019b). Recent probing experiments showed that they implicitly extract a non-negligible amount of linguistic knowledge from text corpora in an unsupervised fashion (Hewitt and Manning, 2019; Vulić et al., 2020; Rogers et al., 2020, inter alia). In downstream tasks, however, they often rely on spurious correlations and superficial cues (Niven and Kao, 2019) rather than a deep understanding of language meaning (Bender and Koller, 2020), which is detrimental to both generalisation and interpretability (McCoy et al., 2019).

In this work, we focus on a specific facet of linguistic knowledge, namely reasoning about events. For instance, in the sentence “Stately, plump Buck Mulligan came from the stairhead, bearing a bowl of lather”, an event of coming occurs in the past, with Buck Mulligan as a participant, simultaneously to an event of bearing with an additional participant, a bowl. Identifying tokens in the text that mention events and classifying the temporal and causal relations among them (Ponti and Korhonen, 2017) is crucial to understand the structure of a story or dialogue (Carlson et al., 2002; Miltsakaki et al., 2004) and to ground a text in real-world facts (Doddington et al., 2004).

Verbs (with their arguments) are prominently used for expressing events (with their participants). Thus, fine-grained knowledge about verbs, such as the syntactic patterns in which they partake and the semantic frames they evoke, may help pretrained encoders to achieve a deeper understanding of text and improve their performance in event-oriented downstream tasks. Fortunately, there already exist expert-curated computational resources that organise verbs into classes based on their syntactic-semantic properties (Jackendoff, 1992; Levin, 1993). In particular, here we consider VerbNet (Schuler, 2005) and FrameNet (Bick, 2011) as rich sources of verb knowledge.

Expanding a line of research on injecting external linguistic knowledge into (LM-)pretrained models Peters et al. (2019); Levine et al. (2020); Lauscher et al. (2020b), we integrate verb knowledge into contextualised representations for the first time. We devise a new method to distill verb knowledge into dedicated adapter modules (Houlsby et al., 2019; Pfeiffer et al., 2020b), which reduce the risk of (catastrophic) forgetting of distributional knowledge and allow for seamless integration with other types of knowledge.

We hypothesise that complementing pretrained encoders through verb knowledge in such modular fashion should benefit model performance in downstream tasks that involve event extraction and processing. We first put this hypothesis to the test in English monolingual event identification and classification tasks from the TempEval (UzZaman et al., 2013) and ACE (Doddington et al., 2004) datasets. Foreshadowing, we report modest but consistent improvements in the former, and significant performance boosts in the latter, thus verifying that verb knowledge is indeed paramount for deeper understanding of events and their structure.

Moreover, we note that expert-curated resources are not available for most of the languages spoken worldwide. Therefore, we also investigate the effectiveness of transferring verb knowledge across languages, and in particular from English to Spanish, Arabic and Mandarin Chinese. Concretely, we compare (1) zero-shot model transfer based on massively multilingual encoders and English constraints with (2) automatic translation of English constraints into the target language. Not only do the results demonstrate that both techniques are successful, but they also shed some light on an important linguistic question: to what extent can verb classes (and predicate–argument structures) be considered cross-lingually universal, rather than varying across languages (Hartmann et al., 2013)?

Overall, our main contributions consist in 1) mitigating the limitations of pretrained encoders regarding event understanding by supplying verb knowledge from external resources; 2) proposing a new method to do so in a modular way through adapter layers; 3) exploring techniques to transfer verb knowledge to resource-poor languages. The gains in performance observed across four diverse languages and several event processing tasks and datasets warrant the conclusion that complementing distributional knowledge with curated verb knowledge is both beneficial and cost-effective.

2 Verb Knowledge for Event Processing

Refer to caption
Figure 1: Framework for injecting verb knowledge into a pretrained Transformer encoder for event processing tasks. 1) Dedicated verb adapter parameters trained to recognise pairs of verbs from the same VerbNet (VN) class or FrameNet (FN) frame; 2) Fine-tuning for an event extraction task (e.g., event trigger identification and classification UzZaman et al. (2013)): a) full fine-tuning – Transformer’s original parameters and verb adapters both fine-tuned for the task; b) task adapter (TA) fine-tuning – additional task adapter is mounted on top of verb adapter and tuned for the task. For simplicity, we show only a single transformer layer; verb- and task-adapters are used in all Transformer layers. Snowflakes denote frozen parameters in the respective training step.

Figure 1 illustrates our framework for injecting verb knowledge from VerbNet or FrameNet and leveraging it in downstream event extraction tasks. First, we inject the external verb knowledge, formulated as the so-called lexical constraints Mrkšić et al. (2017); Ponti et al. (2019) (in our case – verb pairs, see §2.1), into a (small) additional set of adapter parameters2.2) Houlsby et al. (2019). In the second step (§2.3), we combine the language knowledge encoded by Transformer’s original parameters and the verb knowledge from verb adapters to solve a particular event extraction task. To this end, we either a) fine-tune both sets of parameters (1. pretrained LM; 2. verb adapters) or b) freeze both sets of parameters and insert an additional set of task-specific adapter parameters. In both cases, the task-specific training is informed both by the general language knowledge captured in the pretrained LM, and the specialised verb knowledge, captured in the verb adapters.

2.1 Sources of Verb Lexical Knowledge

Given the inter-connectedness between verbs’ meaning and syntactic behaviour Levin (1993); Schuler (2005), we assume that refining latent representation spaces with semantic content and predicate-argument properties of verbs would have a positive effect on event extraction tasks that strongly revolve around verbs. Lexical classes, on the other hand, defined in terms of shared semantic-syntactic properties provide a mapping between the verbs’ senses and the morpho-syntactic realisation of their arguments Jackendoff (1992); Levin (1993). The potential of verb classifications lies in their predictive power: for any given verb, a set of rich semantic-syntactic properties can be inferred based on its class membership. In this work, we explicitly harness this rich linguistic knowledge to help LM-pretrained Transformers in capturing regularities in the properties of verbs and their arguments, and consequently improve their ability to reason about events. We select two major English lexical databases – VerbNet Schuler (2005) and FrameNet Baker et al. (1998) – as sources of verb knowledge at the semantic-syntactic interface, each representing a different lexical framework. Despite the different theories underpinning the two resources, their organisational units – verb classes and semantic frames – both capture regularities in verbs’ semantic-syntactic properties.111Initially we also considered WordNet for creating verb constraints. While it provides records of verbs’ senses and (some) semantic relations between them, WordNet lacks comprehensive information about the (semantic-)syntactic frames in which verbs participate. We thus believe that verb knowledge from WordNet would be less effective in downstream event extraction tasks than that from VerbNet and FrameNet.

VerbNet (VN) Schuler (2005); Kipper et al. (2006) is the largest verb-focused lexicon currently available. It organises verbs into classes based on the overlap in their semantic properties and syntactic behaviour, building on the premise that a verb’s predicate-argument structure informs its meaning Levin (1993). Each entry provides a set of thematic roles and selectional preferences for the verbs’ arguments; it also lists the syntactic contexts characteristic for the class members. The classification is hierarchical, starting from broader classes and spanning several granularity levels where each subclass further refines the semantic-syntactic properties inherited from its parent class.222For example, within a top-level class ‘free-80’, which includes verbs like liberate, discharge, and exonerate which participate in a NP V NP PP.theme frame (e.g., It freed him of guilt.), there exists a subset of verbs participating in a syntactic frame NP V NP S_ING (‘free-80-1’), within which there exists an even more constrained subset of verbs appearing with prepositional phrases headed specifically by the preposition ‘from’ (e.g., The scientist purified the water from bacteria.). VerbNet’s reliance on semantic-syntactic coherence as a class membership criterion means that semantically related verbs may end up in different classes because of differences in their combinatorial properties.333E.g., verbs split and separate are members of two different classes with identical sets of arguments’ thematic roles, but with discrepancies in their syntactic realisations (e.g., the syntactic frame NP V NP apart is only permissible for the ‘split-23.2’ verbs: The book’s front and back cover split apart, but not *The book’s front and back cover separated apart). Although the sets of syntactic descriptions and corresponding semantic roles defining each VerbNet class are English-specific, the underlying notion of a semantically-syntactically defined verb class is thought to apply cross-lingually Jackendoff (1992); Levin (1993), and its translatability has been demonstrated in previous work Vulić et al. (2017); Majewska et al. (2018). The current version of English VerbNet contains 329 main classes.

FrameNet (FN) Baker et al. (1998), in contrast to the syntactically-driven class divisions in VerbNet, is more semantically-oriented. Grounded in the theory of frame semantics Fillmore (1976, 1977, 1982), it organises concepts according to semantic frames, i.e., schematic representations of situations and events, which they evoke, each characterised by a set of typical roles assumed by its participants. The word senses associated with each frame (FrameNet’s lexical units) are similar in terms of their semantic content, as well as their typical argument structures.444For example, verbs such as beat, hit, smash, and crush evoke the ‘Cause_harm’ frame describing situations in which an Agent or a Cause causes injury to a Victim (or their Body_part), e.g., A falling rock [Cause] CRUSHED the hiker’s ankle [Body_part], or The bodyguard [Agent] was caught BEATING the intruder [Victim]. Note that frame-evoking elements do not need to be verbs; the same frame can also be evoked by nouns, e.g., strike or poisoning. Currently, English FN includes the total of 1,224 frames and its annotations illustrate the typical syntactic realisations of the elements of each frame. Frames themselves are, however, semantically defined: this means that they may be shared even across languages with different syntactic properties (e.g., descriptions of transactions will include the same frame elements Buyer, Seller, Goods, Money in most languages). Indeed, English FN has inspired similar projects in other languages: e.g., Spanish Subirats and Sato (2004), Swedish Heppin and Gronostaj (2012), Japanese Ohara (2012), and Danish Bick (2011).

2.2 Training Verb Adapters

Training Task and Data Generation. In order to encode information about verbs’ membership in VN classes or FN frames into a pretrained Transformer, we devise an intermediary training task in which we train a dedicated VN-/FN-knowledge adapter (hereafter VN-Adapter and FN-Adapter). We frame the task as binary word-pair classification: we predict if two verbs belong to the same VN class or FN frame. We extract training instances from FN and VN independently. This allows for a separate analysis of the impact of verb knowledge from each resource.

We generate positive training instances by extracting all unique verb pairings from the set of members of each main VN class/FN frame (e.g., walk–march), resulting in 181,882 positive instances created from VN and 57,335 from FN. We then generate k=3k=3 negative examples for each positive example in a training batch by combining controlled and random sampling. In controlled sampling, we follow prior work on semantic specialisation Wieting et al. (2015); Glavaš and Vulić (2018b); Ponti et al. (2019); Lauscher et al. (2020b). For each positive example p=(w1,w2)p=(w_{1},w_{2}) in the training batch BB, we create two negatives p^1=(w^1,w2)\hat{p}_{1}=(\hat{w}_{1},w_{2}) and p^2=(w1,w^2)\hat{p}_{2}=(w_{1},\hat{w}_{2}); w^1\hat{w}_{1} is the verb from batch BB other than w1w_{1} that is closest to w2w_{2} in terms of their cosine similarity in an auxiliary static word embedding space XauxdX_{aux}\in\mathbb{R}^{d}; conversely, w^2\hat{w}_{2} is the verb from BB other than w2w_{2} closest to w1w_{1}. We additionally create one negative instance p^3=(w^1\hat{p}_{3}=(\hat{w}_{1},w^2\hat{w}_{2}) by randomly sampling w^1\hat{w}_{1} and w^2\hat{w}_{2} from batch BB, not considering w1w_{1} and w2w_{2}. We make sure that negative examples are not present in the global set of all positive verb pairs from the resource.

Similar to Lauscher et al. (2020b), we tokenise each (positive and negative) training instance into WordPiece tokens, prepended with sequence start token [CLS], and with [SEP] tokens in between the verbs and at the end of the input sequence. We consider the representation of the [CLS] token, 𝐱𝐶𝐿𝑆h\mathbf{x}_{\mathit{CLS}}\in\mathbb{R}^{h} (with hh as the hidden state size of the Transformer), output by the last Transformer layer to be the latent representation of the verb pair, and feed it to a simple binary classifier:555We also experimented with sentence-level tasks for injecting verb knowledge, with target verbs presented in sentential contexts drawn from example sentences from VN/FN: we fed (a) pairs of sentences in a binary classification setup (e.g., Jackie leads Rose to the store. – Jackie escorts Rose.); and (b) individual sentences in a multi-class classification setup (predicting the correct VN class/FN frame). Both these variants with sentence-level input, however, led to weaker downstream performance.

y^=softmax(xclsWcl+bcl)\hat{\textbf{y}}=\textrm{softmax}(\textbf{x}_{\textsc{cls}}\textbf{W}_{cl}+\textbf{b}_{cl}) (1)

with Wclh×2\textbf{W}_{cl}\in\mathbb{R}^{h\times 2} and bcl2\textbf{b}_{cl}\in\mathbb{R}^{2} as classifier’s trainable parameters. We train by minimising the standard cross-entropy loss (𝑉𝐸𝑅𝐵\mathcal{L}_{\mathit{VERB}} in Figure 1).

Adapter Architecture. Instead of directly fine-tuning all parameters of the pretrained Transformer, we opt for storing verb knowledge in a separate set of adapter parameters, keeping the verb knowledge separate from the general language knowledge acquired in pretraining. This (1) allows downstream training to flexibly combine the two sources of knowledge, and (2) bypasses the issues with catastrophic forgetting and interference Hashimoto et al. (2017); de Masson d’Autume et al. (2019).

We adopt the adapter architecture of Pfeiffer et al. (2020a, c) which exhibits comparable performance to more commonly used Houlsby et al. (2019) architecture, while being computationally more efficient. In each Transformer layer ll, we insert a single adapter module (AdapterlAdapter_{l}) after the feed-forward sub-layer. The adapter module itself is a two-layer feed-forward neural network with a residual connection, consisting of a down-projection Dh×m\textbf{D}\in\mathbb{R}^{h\times m}, a GeLU activation Hendrycks and Gimpel (2016), and an up-projection Um×h\textbf{U}\in\mathbb{R}^{m\times h}, where hh is the hidden size of the Transformer model and mm is the dimension of the adapter:

Adapterl(hl,rl)=Ul(GeLU(Dl(hl)))+rlAdapter_{l}(\textbf{h}_{l},\textbf{r}_{l})=\textbf{U}_{l}(\textrm{GeLU}(\textbf{D}_{l}(\textbf{h}_{l})))+\textbf{r}_{l} (2)

where rl\textbf{r}_{l} is the residual connection, output of the Transformer’s feed-forward layer, and hl\textbf{h}_{l} is the Transformer hidden state, output of the subsequent layer normalisation.

2.3 Downstream Fine-Tuning for Event Tasks

With verb knowledge from VN/FN injected into the parameters of VN-/FN-Adapters, we proceed to the downstream fine-tuning for a concrete event extraction task. Tasks that we experiment with (see §3) are (1) token-level event trigger identification and classification and (2) span extraction for event triggers and arguments (a sequence labeling task). For the former, we mount a classification head – a simple single-layer feed-forward softmax regression classifier – on top of the Transformer augmented with VN-/FN-Adapters. For the latter, we follow the architecture from prior work M’hamdi et al. (2019); Wang et al. (2019) and add a CRF layer Lafferty et al. (2001) on top of the sequence of Transformer’s outputs (for subword tokens), in order to learn inter-dependencies between output tags and determine the optimal tag sequence.

For both types of downstream tasks, we propose and evaluate two different fine-tuning regimes: (1) full downstream fine-tuning, in which we update both the original Transformer’s parameters and VN-/FN-Adapters (see 2a in Figure 1); and (2) task-adapter (TA) fine-tuning, where we keep both Transformer’s original parameters and VN-/FN-Adapters frozen, while stacking a new trainable task adapter on top of the VN-/FN-Adapter in each Transformer layer (see 2b in Figure 1).

2.4 Cross-Lingual Transfer

Creation of curated resources like VN or FN takes years of expert linguistic labour. Consequently, such resources do not exist for a vast majority of languages. Given the inherent cross-lingual nature of verb classes and semantic frames (see §2.1), we investigate the potential for verb knowledge transfer from English to target languages, without any manual target-language adjustments. Massively multilingual Transformers, such as multilingual BERT (mBERT) Devlin et al. (2019) or XLM-R Conneau et al. (2020) have become the de facto standard mechanisms for zero-shot (zs) cross-lingual transfer. We also adopt mBERT in our first language transfer approach: we fine-tune mBERT first on the English verb knowledge and then on English task data and then simply make task predictions for the target language input.

Our second transfer approach, dubbed vtrans, is inspired by the work on cross-lingual transfer of semantic specialisation for static word embedding spaces Glavaš et al. (2019); Ponti et al. (2019); Wang et al. (2020b).

VerbNet FrameNet
English (EN) 181,882 57,335
Spanish (ES) x96,300 36,623
Chinese (ZH) x60,365 21,815
Arabic (AR) x70,278 24,551
Table 1: Number of positive training verb pairs in English, and in each target language obtained via the vtrans method (§2.4).

Starting from a set of positive pairs PP from English VN/FN, vtrans involves three steps: (1) automatic translation of verbs in each pair into the target language, (2) filtering of the noisy target language pairs by means of a relation prediction model trained on the English examples, and (3) training the verb adapters injected into either mBERT or target language BERT with target language verb pairs. For (1), we translate the verbs by retrieving their nearest neighbour in the target language from the shared cross-lingual embedding space, aligned using the Relaxed Cross-domain Similarity Local Scaling (RCSLS) model of Joulin et al. (2018). Such translation procedure is liable to error due to an imperfect cross-lingual embedding space as well as polysemy and out-of-context word translation. We dwarf these issues in step (2), where we purify the set of noisily translated target language verb pairs by means of a neural lexico-semantic relation prediction model, the Specialization Tensor Model Glavaš and Vulić (2018a), here adjusted for binary classification. We train the STM for the same task as verb adapters during verb knowledge injection (§2.2): to distinguish (positive) verb pairs from the same English VN class/FN frame from those from different VN classes/FN frames. In training, the input to STM are static word embeddings of English verbs taken from a shared cross-lingual word embedding space. We then make predictions in the target language by feeding vectors of target language verbs (from noisily translated verb pairs), taken from the same cross-lingual word embedding space, as input for STM (we provide more details on STM training in Appendix C). Finally, in step (3), we retain only the target language verb pairs identified by STM as positive pairs and perform direct monolingual FN-/VN-Adapter training in the target language, following the same protocol used for English, as described in §2.2.

3 Experimental Setup

Event Processing Tasks and Data. In light of the pivotal role of verbs in encoding the unfolding of actions and occurrences in time, as well as the nature of the relations between their participants, sensitivity to the cues they provide is especially important in event processing tasks. There, systems are tasked with detecting that something happened, identifying what type of occurrence took place, as well as what entities were involved. Verbs typically act as the organisational core of each such event schema,666Event expressions are not, however, restricted to verbs: adjectives, nominalisations or prepositional phrases can also act as event triggers (consider, e.g., Two weeks after the murder took place…, Following the recent acquisition of the company’s assets…). carrying a lot of semantic and structural weight. Therefore, a model’s grasp of verbs’ properties should have a bearing on ultimate task performance. Based on this assumption, we select event extraction and classification as suitable evaluation tasks to profile the methods from §2.

These tasks and the corresponding data are based on the two prominent frameworks for annotating event expressions: TimeML Pustejovsky et al. (2003, 2005) and the Automatic Content Extraction (ACE) Doddington et al. (2004). First, we rely on the TimeML-annotated corpus from TempEval tasks Verhagen et al. (2010); UzZaman et al. (2013), which targets automatic identification of temporal expressions, events, and temporal relations. Second, we use the ACE dataset which provides annotations for entities, the relations between them, and for events in which they participate in newspaper and newswire text. We provide more derails about the frameworks and their corresponding annotation schemes in the Appendix A.

Task 1: Trigger Identification and Classification (TempEval). We frame the first event extraction task as a token-level classification problem, predicting whether a token triggers an event and assigning it to one of the following event types: OCCURRENCE (e.g., died, attacks), STATE (e.g., share, assigned), Reporting (e.g., announced, said), I-ACTION (e.g., agreed, trying), I-STATE (e.g., understands, wants, consider), ASPECTUAL (e.g., ending, began), and PERCEPTION (e.g., watched, spotted).777E.g., in the sentence: “The rules can also affect small businesses, which sometimes pay premiums tied to employees’ health status and claims history.”, affect and pay are event triggers of type STATE and OCCURRENCE, respectively. We use the TempEval-3 data for English and Spanish UzZaman et al. (2013), and the TempEval-2 data for Chinese Verhagen et al. (2010) (see Table 2 for dataset sizes).

Train Test
TempEval English 830,005 7,174
Spanish x51,511 5,466
Chinese x23,180 5,313
ACE English xxx,529 x,x40
Chinese xxx,573 x,x43
Arabic xxx,356 x,x27
Table 2: Number of tokens (TempEval) and documents (ACE) in the training and test sets.

Task 2: Trigger and Argument Identification and Classification (ACE). In this sequence-labeling task, we detect and label event triggers and their arguments, with four individually scored subtasks: (i) trigger identification, where we identify the key word conveying the nature of the event, and (ii) trigger classification, where we classify the trigger word into one of the predefined categories; (iii) argument identification, where we predict whether an entity mention is an argument of the event identified in (i), and (iv) argument classification, where the correct role needs to be assigned to the identified event arguments. We use the ACE data available for English, Chinese, and Arabic.888The ACE annotations distinguish 34 trigger types (e.g., Business:Merge-Org, Justice:Trial-Hearing, Conflict:Attack) and 35 argument roles. Following previous work Hsi et al. (2016), we conflate eight time-related argument roles - e.g., ‘Time-At-End’, ‘Time-Before’, ‘Time-At-Beginning’ - into a single ‘Time’ role in order to alleviate training data sparsity.

Event extraction as specified in these two frameworks is a challenging, highly context-sensitive problem, where different words (most often verbs) may trigger the same type of event, and conversely, the same word (verb) can evoke different types of event schemata depending on the context. Adopting these tasks as our experimental setup thus tests whether leveraging fine-grained curated knowledge of verbs’ semantic-syntactic behaviour can improve models’ reasoning about event-triggering predicates and their arguments.

Model Configurations. For each task, we compare the performance of the underlying “vanilla” BERT-based model (see §2.3) against its variant with an added VN-Adapter or FN-Adapter999We also experimented with inserting both verb adapters simultaneously; however, this resulted in weaker downstream performance than adding each separately, a likely product of the partly redundant, partly conflicting information encoded in these adapters (see §2.1 for comparison of VN and FN). (see §2.2) in two regimes: (a) full fine-tuning, and (b) task adapter (TA) fine-tuning (see Figure 1 again). To ensure that any performance gains are not merely due to increased parameter capacity offered by the adapter module, we evaluate an additional setup where we replace the knowledge adapter with a randomly initialised adapter module of the same size (+Random). Additionally, we examine the impact of increasing the capacity of the trainable task adapter by replacing it with a ‘Double Task Adapter’ (2TA), i.e., a task adapter with double the number of trainable parameters compared to the base architecture described in §2.2.

Training Details: Verb Adapters. We experimented with k{2,3,4}k\in\{2,3,4\} negative examples and the following combinations of controlled (cc) and randomly (rr) sampled negatives (see §2.2): k=2k=2 [cc][cc], k=3k=3 [ccr][ccr], k=4k=4 [ccrr][ccrr]. In our preliminary experiments we found the k=3k=3 [ccr][ccr] configuration to yield best-performing adapter modules. The downstream evaluation and analysis presented in §4 is therefore based on this setup.

Our VN- and FN-Adapters are injected into the cased variant of the BERT Base model. Following (Pfeiffer et al., 2020a), we train the adapters for 30 epochs using the Adam algorithm Kingma and Ba (2015), a learning rate of 1e41e-4 and the adapter reduction factor of 16 Pfeiffer et al. (2020a), i.e., d=48d=48. Our batch size is 64, comprising 16 positive examples and 3×16=483\times 16=48 negative examples (since k=3k=3). We provide more details on hyperparameter search in Appendix B.

Downstream Task Fine-Tuning. In downstream fine-tuning on Task 1 (TempEval), we train for 10 epochs in batches of size 32, with a learning rate 1e41e-4 and maximum input sequence length of T=128T=128 WordPiece tokens. In Task 2 (ACE), in light of a greater data sparsity,101010Most event types in ACE (70%\approx 70\%) have fewer than 100 labeled instances, and three have fewer than 10 Liu et al. (2018). we search for an optimal hyperparameter configuration for each language and evaluation setup from the following grid: learning rate l{1e5,1e6}l\in\{1e-5,1e-6\} and epochs n{3,5,10,25,50}n\in\{3,5,10,25,50\} (with maximum input sequence length of T=128T=128).

Transfer Experiments. For zero-shot (zs) transfer experiments, we leverage mBERT, to which we add the VN- or FN-Adapter trained on the English VN/FN data. We train the model on English training data available for each task and evaluate it on the test set in the target language. For the vtrans approach (see §2.4), we use language-specific BERT models readily available for our target languages, and leverage target-language adapters trained on translated and automatically refined verb pairs. The model, with or without the target-language VN-/FN-Adapter, is trained and evaluated on the training and test data available in the language. We carry out the procedure for three target languages (see Table 1). We use the same negative sampling parameter configuration proven strongest in our English experiments (k=3k=3 [ccr][ccr]).

4 Results and Discussion

FFT +Random +FN +VN TA +Random +FN +VN
TempEval T-ident&class 73.6 73.5 73.6 73.6 74.5 74.4 75.0 75.2
ACE T-ident 69.3 69.6 70.8 70.3 65.1 65.0 65.7 66.4
T-class 65.3 65.5 66.7 66.2 58.0 58.5 59.5 60.2
ARG-ident 33.8 33.5 34.2 34.6 x2.1 x1.9 x2.3 x2.5
ARG-class 31.6 31.6 32.2 32.8 x0.6 x0.6 x0.8 x0.8
Table 3: Results on English TempEval and ACE test sets for full fine-tuning (FFT) setup and the task adapter (TA) setup. Provided are average F1F_{1} scores over 10 runs, with statistically significant (paired t-test; p<0.05p<0.05) improvements over both baselines marked in bold.
FFT +Random +FN +VN TA +Random +FN +VN
Spanish mBERT-zs 37.2 37.2 37.0 36.6     38.0 38.0 38.6 36.5
ES-BERT 77.7 77.1 77.6 77.4     70.0 70.0 70.7 70.6
ES-mBERT 73.5 73.6 74.4 74.1     65.3 65.4 65.8 66.2
Chinese mBERT-zs 49.9 49.9 50.5 47.9     49.2 49.5 50.1 48.2
ZH-BERT 82.0 81.6 81.8 81.8     76.2 76.3 75.9 76.9
ZH-mBERT 80.2 80.1 79.9 80.0     71.8 71.8 72.1 71.9
Table 4: Results on Spanish and Chinese TempEval test sets for full fine-tuning (FFT) and the task adapter (TA) set-up, for zero-shot (zs) transfer with mBERT and monolingual target language evaluation with language-specific BERT (ES-BERT / ZH-BERT) or mBERT (ES-mBERT / ZH-mBERT), with FN/VN adapters trained on vtrans-translated verb pairs (see §2.4). F1F_{1} scores are averaged over 10 runs, with statistically significant (paired t-test; p<0.05p<0.05) improvements over both baselines marked in bold.
FFT +Random +FN +VN TA +Random +FN +VN
Arabic mBERT-zs T-ident 15.8 13.5 17.2 16.3     29.4 30.3 32.9 32.4
T-class 14.2 12.2 16.1 15.6     25.6 26.3 27.8 28.4
ARG-ident x1.2 x0.6 x2.1 x2.7     x2.0 x3.3 x3.3 x3.6
ARG-class x0.9 x0.4 x1.5 x1.9     x1.2 x1.6 x1.6 x1.3
AR-BERT T-ident 68.8 68.9 70.2 68.6     24.0 21.3 24.6 23.5
T-class 63.6 62.8 64.4 62.8     22.0 19.5 23.1 22.3
ARG-ident 31.7 29.3 34.0 33.4     xxx
ARG-class 28.4 26.7 30.3 29.7     xxx
Chinese mBERT-zs T-ident 36.9 36.7 42.1 36.8     47.8 49.4 55.0 55.4
T-class 27.9 25.2 30.9 29.8     38.6 40.1 43.5 44.9
ARG-ident x4.3 x3.1 x5.5 x6.1     x5.1 x6.0 x7.6 x8.4
ARG-class x3.9 x2.7 x4.9 x5.2     x3.5 x4.7 x5.7 x7.1
ZH-BERT T-ident 75.5 74.9 74.5 74.9     69.8 69.3 70.0 70.2
T-class 67.9 68.2 68.0 68.6     58.4 57.5 59.9 60.0
ARG-ident 27.3 26.1 29.8 28.8     xxx
ARG-class 25.8 25.2 28.2 27.2     xxx
Table 5: Results on Arabic and Chinese ACE test sets for full fine-tuning (FFT) setup and task adapter (TA) setup, for zero-shot (zs) transfer with mBERT and vtrans transfer approach with language-specific BERT (AR-BERT / ZH-BERT) and FN/VN adapters trained on noisily translated verb pairs (§2.4). F1F_{1} scores averaged over 5 runs; significant improvements (paired t-test; p<0.05p<0.05) over both baselines marked in bold.

4.1 Main Results

English Event Processing. Table 3 shows the performance on English Task 1 (TempEval) and Task 2 (ACE). First, we note that the computationally more efficient setup with a dedicated task adapter (TA) yields higher absolute scores compared to full fine-tuning (FFT) on TempEval. When the underlying BERT is frozen along with the added FN-/VN-Adapter, the TA is enforced to encode additional task-specific knowledge into its parameters, beyond what is provided in the verb adapter, which results in two strongest results overall from the +FN/VN setups. In Task 2, the primacy of TA-based training is overturned in favour of full fine-tuning. Encouragingly, boosts provided by verb adapters are visible regardless of the chosen task fine-tuning regime, that is, regardless of whether the underlying BERT’s parameters remain fixed or not. We notice consistent statistically significant111111We test significance with the Student’s t-test with a significance value set at α=0.05\alpha=0.05 for sets of model F1F_{1} scores. improvements in the +VN setup, although the performance of the TA-based setups clearly suffers in argument (arg) tasks due to decreased trainable parameter capacity. Lack of visible improvements from the Random Adapter supports the interpretation that performance gains indeed stem from the added useful ‘non-random’ signal in the verb adapters.

Multilingual Event Processing. Table 4 compares the performance of zero-shot (zs) transfer and monolingual target-language training (via the vtrans approach) on TempEval in Spanish and Chinese. For both we see that the addition of the FN-Adapter in the TA-based setup boosts zero-shot transfer. Benefits of this knowledge injection extend to the full fine-tuning setup in Chinese, achieving the top score overall.

In monolingual evaluation, we observe consistent gains from the added transferred knowledge (i.e., the vtrans approach) in Spanish, while in Chinese performance boosts come from the transferred VerbNet-style class membership information (+VN). These results suggest that even the noisily translated verb pairs carry enough useful signal through to the target language. To tease apart the contribution of the language-specific encoders and transferred verb knowledge to task performance, we carry out an additional monolingual evaluation substituting the monolingual target language BERT with the massively multilingual encoder, trained on (noisy) target language verb signal (ES-mBERT/ZH-mBERT). Notably, although the performance of the massively multilingual model is lower than the language-specific BERTs in absolute terms, the addition of the transferred verb knowledge helps reduce the gap between the two encoders, with tangible gains achieved over the baselines in Spanish (see discussion in §4.2).121212Given that analogous patterns were observed in relative scores of mBERT and language-specific BERTs in monolingual evaluation on ACE (Task 2), for brevity we show the vtrans results with mBERT on TempEval only.

In ACE, the top performance scores are achieved in the monolingual full fine-tuning setting; as seen in English, keeping the full capacity of BERT parameters unfrozen noticeably helps performance.131313This is especially the case in arg tasks, where the TA-based setup fails to achieve meaningful improvements over zero, even with extended training up to 100 epochs. Due to the computational burden of such long training, the results in this setup are limited to trigger tasks (after 50 epochs). In Arabic, FN knowledge provides performance boosts across the four tasks and with both the zero-shot (zs) and monolingual (vtrans) transfer approaches, whereas the addition of the VN adapter boosts scores in arg tasks. The usefulness of FN knowledge extends to zero-shot transfer in Chinese, and both adapters benefit the arg tasks in the monolingual (vtrans) transfer setup. Notably, in zero-shot transfer, we observe that the highest scores are achieved in the task adapter (TA) fine-tuning, where the inclusion of the knowledge adapters offers additional performance gains. Overall, however, the argument tasks elude the restricted capacity of the TA-based setup, with very low scores across the board.

4.2 Further Discussion

Zero-shot Transfer vs Monolingual Training. The results reveal a considerable gap between the performance of zero-shot transfer and monolingual fine-tuning. The event extraction tasks pose a significant challenge to the zero-shot transfer via mBERT, where downstream event extraction training data is in English; however, mBERT exhibits much more robust performance in the monolingual setup, when presented with training data for event extraction tasks in the target language – here it trails language-specific BERT models by less than 5 points (see Table 4). This is an encouraging result, given that LM-pretrained language-specific Transformers currently exist only for a narrow set of well-resourced languages: for all other languages – should there be language-specific event extraction data – one needs to resort to massively multilingual Transformers. What is more, mBERT’s performance is further improved by the inclusion of transferred verb knowledge (the vtrans approach, see §2.4): in Spanish, where the greater typological vicinity to English (compared to Chinese) renders direct transfer of semantic-syntactic information more viable, the addition of verb adapters (trained on noisy Spanish constraints) yields significant improvements both in the FFT and the TA setup. These results confirm the effectiveness of lexical knowledge transfer (i.e., the vtrans approach) observed in previous work Ponti et al. (2019); Wang et al. (2020b) in the context of semantic specialisation of static word embedding spaces.

Double Task Adapter. The addition of a verb adapter increases the parameter capacity of the underlying pretrained model. To verify whether increasing the number of trainable parameters in TA cancels out the benefits from the frozen verb adapter, we run additional evaluation in the TA-based setup, but with trainable task adapters double the size of the standard TA (2TA). Promisingly, we see in Tables 6 and 7 that the relative performance gains from FN/VN adapters are preserved regardless of the added trainable parameter capacity. As expected, the increased task adapter size helps argument tasks in ACE, where verb adapters produce additional gains. Overall, this suggests that verb adapters indeed encode additional, non-redundant information beyond what is offered by the pretrained model alone, and boost the dedicated task adapter in solving the problem at hand.

2TA +FN +VN
English EN-BERT 74.5 74.8 74.8
Spanish mBERT-zs 37.7 38.3 37.1
ES-BERT 73.1 73.6 73.6
Chinese mBERT-zs 49.1 50.1 48.8
ZH-BERT 78.1 78.1 78.6
Table 6: Results on TempEval for the Double Task Adapter-based approaches (2TA). Significant improvements (paired t-test; p<0.05p<0.05) in bold.
2TA +FN +VN
EN EN-BERT T-ident 67.5 68.1 68.9
T-class 61.6 62.6 62.7
ARG-ident x6.2 x8.9 x7.1
ARG-class x3.9 x6.7 x5.0
AR mBERT-zs T-ident 31.2 32.6 31.7
T-class 26.3 27.1 29.3
ARG-ident x5.9 x6.0 x6.9
ARG-class x3.9 x4.1 x4.3
AR-BERT T-ident 40.6 42.3 43.0
T-class 36.9 38.1 39.5
ARG-ident
ARG-class
ZH mBERT-zs T-ident 54.6 56.3 58.1
T-class 45.6 46.2 46.9
ARG-ident x9.2 10.8 11.3
ARG-class x8.0 x8.5 x9.9
ZH-BERT T-ident 72.3 73.1 72.0
T-class 59.6 63.0 61.3
ARG-ident x2.6 x2.8 x3.3
ARG-class x2.3 x2.6 x2.9
Table 7: Results on ACE for the Double Task Adapter-based approaches (2TA). Significant improvements (paired t-test; p<0.05p<0.05) in bold.
FFT+FNES TA+FNES 2TA+FNES
ES-BERT 78.0 (+0.4) 70.9 (+0.2) 73.8 (+0.2)
Table 8: Results (F1F_{1} scores) on Spanish TempEval for different configurations of Spanish BERT with added Spanish FN-Adapter (FNES), trained on clean Spanish FN constraints. Numbers in brackets indicate relative performance w.r.t. the corresponding setup with FN-Adapter trained on (a larger set of) noisy Spanish constraints obtained through automatic translation of verb pairs from English FN (vtrans approach).

Cleanliness of Verb Knowledge. Gains from verb adapters suggest that there is potential to find supplementary information within structured lexical resources that can support distributional models in tackling tasks where nuanced knowledge of verb behaviour is important. The fact that we obtain best transfer performance through noisy translation of English verb knowledge suggests that these benefits transcend language boundaries.

There are, however, two main limitations to the translation-based (vtrans) approach we used to train our target-language verb adapters (especially in the context of VerbNet constraints): (1) noisy translation based on cross-lingual semantic similarity may already break the VerbNet class membership alignment (i.e., words close in meaning may belong to different VerbNet classes due to differences in syntactic behaviour); and (2) the language-specificity of verb classes due to which they cannot be directly ported to another language without adjustments due to the delicate language-specific interplay of semantic and syntactic information. This is in contrast to the proven cross-lingual portability of synonymy and antonymy relations shown in previous work on semantic specialisation transfer Mrkšić et al. (2017); Ponti et al. (2019), which rely on semantics alone. In case of VerbNet, despite the cross-lingual applicability of a semantically-syntactically defined verb class as a lexical organisational unit, the fine-grained class divisions and exact class membership may be too English-specific to allow direct automatic translation. On the contrary, semantically-driven FrameNet lends itself better to cross-lingual transfer, given that it focuses on function and roles played by event participants, rather than their surface realisations (see §2.1). Indeed, although FN and VN adapters both offer performance gains in our evaluation, the somewhat more consistent improvements from the FN-Adapter may be symptomatic of the resource’s greater cross-lingual portability.

To quickly verify if noisy translation and direct transfer from English curb the usefulness of injected verb knowledge, we additionally evaluate the injection of clean verb knowledge obtained from a small lexical resource available in one of the target languages – Spanish FrameNet Subirats and Sato (2004). Using the procedure described in §2.2, we derive 2,886 positive verb pairs from Spanish FN and train a Spanish FN-Adapter (on top of the Spanish BERT) with this (much smaller but clean) set of Spanish FN constraints. The results in Table 8 show that, despite having 12 times fewer positive examples for training the verb adapter compared to the translation-based approach, the ‘native’ Spanish verb adapter outperforms its vtrans-based counterpart (Table 4), compensating the limited coverage with gold standard accuracy. Nonetheless, the challenge for using native resources in other languages lies in their very limited availability and expensive, time-consuming manual construction process. Our results reaffirm the usefulness of language-specific expert-curated resources and their ability to enrich state-of-the-art NLP models. This, in turn, suggests that work on optimising resource creation methodologies merits future research efforts on a par with modeling work.

5 Related Work

5.1 Event Extraction

The cost and complexity of event annotation requires robust transfer solutions capable of making fine-grained predictions in the face of data scarcity. Traditional event extraction methods relied on hand-crafted, language-specific features Ahn (2006); Gupta and Ji (2009); Llorens et al. (2010); Hong et al. (2011); Li et al. (2013); Glavaš and Šnajder (2015) (e.g., POS tags, entity knowledge, morphological and syntactic information), which limited their generalisation ability and effectively prevented language transfer.

More recent approaches commonly resorted to word embedding input and neural text encoders such as recurrent nets Nguyen et al. (2016); Duan et al. (2017); Sha et al. (2018) and convolutional nets Chen et al. (2015); Nguyen and Grishman (2015), as well as graph neural networks Nguyen and Grishman (2018); Yan et al. (2019) and adversarial networks Hong et al. (2018); Zhang and Ji (2018). Like in most other NLP tasks, most recent empirical advancements in event trigger and argument extraction tasks have been achieved through fine-tuning of LM-pretrained Transformer networks Yang et al. (2019a); Wang et al. (2019); M’hamdi et al. (2019); Wadden et al. (2019); Liu et al. (2020).

Limited training data nonetheless remains an obstacle, especially when facing previously unseen event types. The alleviation of such data scarcity issues has been attempted through data augmentation methods – automatic data annotation Chen et al. (2017); Zheng (2018); Araki and Mitamura (2018) and bootstrapping for training data generation Ferguson et al. (2018); Wang et al. (2019). The recent release of the large English event detection dataset MAVEN Wang et al. (2020c), with annotations of event triggers only, partially remedies for training data scarcity. MAVEN also demonstrates that even the state-of-the-art Transformer-based models fail to yield satisfying event detection performance in the general domain. The fact that it is unlikely to expect datasets of similar size for other event extraction tasks (e.g., event argument extraction) and especially for other languages only emphasises the need for external event-related knowledge and transfer learning approaches, such as the ones introduced in this work.

Beyond event trigger (and argument)-oriented frameworks such as ACE and its light-weight variant ERE Aguilar et al. (2014); Song et al. (2015), several other event-focused datasets exist which frame the problem either as a slot-filling task Grishman and Sundheim (1996) or an open-domain problem consisting in extracting unconstrained event types and schemata from text Allan (2002); Minard et al. (2016); Araki and Mitamura (2018); Liu et al. (2019a). Small domain-specific datasets have also been constructed for event detection in biomedicine Kim et al. (2008); Thompson et al. (2009); Buyko et al. (2010); Nédellec et al. (2013), as well as literary texts Sims et al. (2019) and Twitter Ritter et al. (2012); Guo et al. (2013).

5.2 Semantic Specialisation

Representation spaces induced through self-supervised objectives from large corpora, be it the word embedding spaces Mikolov et al. (2013); Bojanowski et al. (2017) or those spanned by LM-pretrained Transformers Devlin et al. (2019); Liu et al. (2019b), encode only distributional knowledge, i.e., knowledge obtainable from large corpora. A large body of work focused on semantic specialisation (i.e., refinement) of such distributional spaces by means of injecting lexico-semantic knowledge from external resources such as WordNet Fellbaum (1998), BabelNet Navigli and Ponzetto (2010) or ConceptNet Liu and Singh (2004) expressed in the form of lexical constraints (Faruqui et al., 2015; Mrkšić et al., 2017; Glavaš and Vulić, 2018c; Kamath et al., 2019; Lauscher et al., 2020b, inter alia).

Joint specialisation models (Yu and Dredze, 2014; Nguyen et al., 2017; Lauscher et al., 2020b; Levine et al., 2020, inter alia) train the representation space from scratch on the large corpus, but augment the self-supervised training objective with an additional objective based on external lexical constraints. Lauscher et al. (2020b) add to the Masked LM (MLM) and next sentence prediction (NSP) pretraining objectives of BERT Devlin et al. (2019) an objective that predicts pairs of synonyms and first-order hyponymy-hypernymy pairs, aiming to improve word-level semantic similarity in BERT’s representation space. In a similar vein, Levine et al. (2020) add the objective that predicts WordNet supersenses. While joint specialisation models allow the external knowledge to shape the representation space from the very beginning of the distributional training, this also means that any change in lexical constraints implies a new, computationally expensive pretraining from scratch.

Retrofitting and post-specialisation methods (Faruqui et al., 2015; Mrkšić et al., 2017; Vulić et al., 2018; Ponti et al., 2018; Glavaš and Vulić, 2019; Lauscher et al., 2020a; Wang et al., 2020a, inter alia), in contrast, start from a pretrained representation space (word embedding space or a pretrained encoder) and fine-tune it using external lexico-semantic knowledge. Wang et al. (2020a) fine-tune the pre-trained RoBERTa Liu et al. (2019b) with lexical constraints obtained automatically via dependency parsing, whereas Lauscher et al. (2020a) use lexical constraints derived from ConceptNet to inject knowledge into BERT: both adopt adapter-based fine-tuning, storing the external knowledge in a separate set of parameters. In our work, we adopt a similar adapter-based specialisation approach. However, focusing on event-oriented downstream tasks, our lexical constraints reflect verb class memberships and originate from VerbNet and FrameNet.

6 Conclusion

We have investigated the potential of leveraging knowledge about semantic-syntactic behaviour of verbs to improve the capacity of large pretrained models to reason about events in diverse languages. We have proposed an auxiliary pretraining task to inject information about verb class membership and semantic frame-evoking properties into the parameters of dedicated adapter modules, which can be readily employed in other tasks where verb reasoning abilities are key. We demonstrated that state-of-the-art Transformer-based models still benefit from the gold standard linguistic knowledge stored in lexical resources, even those with limited coverage. Crucially, we showed that the benefits of the information available in resource-rich languages can be extended to other, resource-leaner languages through translation-based transfer of verb class/frame membership information.

In future work, we will incorporate our verb knowledge modules into alternative, more sophisticated approaches to cross-lingual transfer to explore the potential for further improvements in low-resource scenarios. Further, we will extend our approach to specialised domains where small-scale but high-quality lexica are available, to support distributional models in dealing with domain-sensitive verb-oriented problems.

Acknowledgments

This work is supported by the ERC Consolidator Grant LEXICAL: Lexical Acquisition Across Languages (no 648909) awarded to Anna Korhonen. The work of Goran Glavaš is supported by the Baden-Württemberg Stiftung (Eliteprogramm, AGREE grant).

References

Appendix A Frameworks for Annotating Event Expressions

Two prominent frameworks for annotating event expressions are TimeML Pustejovsky et al. (2003, 2005) and the Automatic Content Extraction (ACE) Doddington et al. (2004). TimeML was developed as a rich markup language for annotating event and temporal expressions, addressing the problems of identifying event predicates and anchoring them in time, determining their relative ordering and temporal persistence (i.e., how long the consequences of an event last), as well as tackling contextually underspecified temporal expressions (e.g., last month, two days ago). Currently available English corpora annotated based on the TimeML scheme include the TimeBank corpus Pustejovsky et al. (2003), a human annotated collection of 183 newswire texts (including 7,935 annotated events, comprising both punctual occurrences and states which extend over time) and the AQUAINT corpus, with 80 newswire documents grouped by their covered stories, which allows tracing progress of events through time Derczynski (2017). Both corpora, supplemented with a large, automatically TimeML-annotated training corpus are used in the TempEval-3 task Verhagen and Pustejovsky (2008); UzZaman et al. (2013), which targets automatic identification of temporal expressions, events, and temporal relations.

The ACE dataset provides annotations for entities, the relations between them, and for events in which they participate in newspaper and newswire text. For each event, it identifies its lexical instantiation, i.e., the trigger, and its participants, i.e., the arguments, and the roles they play in the event. For example, an event type “Conflict:Attack” (“It could swell to as much as $500 billion if we go to war in Iraq.”), triggered by the noun ‘war’, involves two arguments, the “Attacker” (“we”) and the “Place" (“Iraq”), each of which is annotated with an entity label (“GPE:Nation”).

Appendix B Adapter Training: Hyperparameter Search

We experimented with n{10,15,20,30}n\in\{10,15,20,30\} training epochs, as well as an early stopping approach using validation loss on a small held-out validation set as the stopping criterion, with a patience argument p{2,5}p\in\{2,5\}; we found the adapters trained for the full 30 epochs to perform most consistently across tasks.

The size of the training batch varies based on the value of kk negative examples generated from the starting batch BB of positive pairs: e.g., by generating k=3k=3 negative examples for each of 88 positive examples in the starting batch we end up with a training batch of total size 8+38=328+3*8=32. We experimented with starting batches of size B{8,16}B\in\{8,16\} and found the configuration k=3k=3, B=16B=16 to yield the strongest results (reported in this paper).

Appendix C STM Training Details

We train the STM using the sets of English positive examples from each lexical resource (Table 1). Negative examples are generated using controlled sampling (see §2.2), using a k=2k=2 [cc][cc] configuration, ensuring that generated negatives do not constitute positive constraints in the global set. We use the pre-trained 300-dimensional static distributional word vectors computed on Wikipedia data using the fastText model Bojanowski et al. (2017), cross-lingually aligned using the RCSLS model of Joulin et al. (2018), to induce the shared cross-lingual embedding space for each source-target language pair. The STM is trained using the Adam optimizer Kingma and Ba (2015), a learning rate l=1e4l=1e-4, a batch size of 32 (positive and negative) training examples, for a maximum of 10 iterations. We set the values of other training hyperparameters as in Ponti et al. (2019), i.e., the number of specialisation tensor slices K=5K=5 and the size of the specialised vectors h=300h=300.