Trigger-free Event Detection via Derangement Reading Comprehension

Jiachen Zhao
HKUST
Hong Kong, China
[email protected]
&Haiqin Yang
International Digital Economy Academy
Shenzhen, China
[email protected]

Abstract

Event detection (ED), aiming to detect events from texts and categorize them, is vital to understanding actual happenings in real life. However, mainstream event detection models require high-quality expert human annotations of triggers, which are often costly and thus deter the application of ED to new domains. Therefore, in this paper, we focus on low-resource ED without triggers and aim to tackle the following formidable challenges: multi-label classification, insufficient clues, and imbalanced events distribution. We propose a novel trigger-free ED method via Derangement mechanism on a machine Reading Comprehension (DRC) framework. More specifically, we treat the input text as Context and concatenate it with all event type tokens that are deemed as Answers with an omitted default question. So we can leverage the self-attention in pre-trained language models to absorb semantic relations between input text and the event types. Moreover, we design a simple yet effective event derangement module (EDM) to prevent major events from being excessively learned so as to yield a more balanced training process. The experiment results show that our proposed trigger-free ED model is remarkably competitive to mainstream trigger-based models, showing its strong performance on low-source event detection.

1 Introduction

The task of event detection (ED), aiming to spot the appearance of predefined event types from texts and classify them, is vital to understanding the actual happenings in real life Edouard (2017); Saeed et al. (2019). Taking an example from the Automatic Context Extraction (ACE) corpus:

S: And they sent him to Baghdad and killed.

This sentence consists of two events, Transport and Die. A desired ED system should correctly identify these two events simultaneously. At first glance, this task can be arduous and challenging because event types implicitly exist in sentences. Therefore, in the literature, researchers usually tackle this problem via a two-stage trigger-based framework. Triggers, i.e., words or phrases providing the most clear indication of an event occurrence, are first identified and then events are recognized accordingly Ahn (2006); Li et al. (2013); Chen et al. (2015). For example, in the above example, “sent” and “killed” are the corresponding triggers for the event of Transport and Die, respectively.

However, trigger-based models are difficult to transfer to new domains because they require massive high-quality triggers annotation Lai et al. (2020); Lu et al. (2021). The annotation process is usually expensive and time-consuming Shen et al. (2021), which may take linguistics experts multiple rounds to screen the data Doddington et al. (2004). An effective but challenging solution to arduous labeling is event detection without triggers Zeng et al. (2018); Liu et al. (2019); Zheng et al. (2019). However, such a forbidding low-resource setting leads to insufficient clues, making trigger-free approaches inferior to trigger-based methods Liu et al. (2019). As we hope to extend ED to more domains with less human effort in real-world applications, a key issue then rises, how to design a trigger-free ED model that is competitive with trigger-based models?

To fill the gaps, we focus on event detection without triggers to reduce strenuous effort in data labeling. We aim at tackling the following formidable challenges: (1) Multi-label issue: each input sentence may hold zero or multiple events, which can be formulated into a challenging multi-label classification task. (2) Insufficient clues: triggers are of significance to detect events Zhang et al. (2020); Ebner et al. (2020). Without explicitly annotated triggers, we may lack sufficient clues to identify the event types and need to seek alternatives to shed light on the correlation between words and the event types. (3) Imbalanced events distribution: as shown in Fig. 2, events may follow the Matthew effect. Some events dominate the data while others contain only several instances. The imbalanced event distribution brings significant obstacles to learn and detect minor events.

Hence, we propose a trigger-free ED method via Derangement mechanism on a machine Reading Comprehension (DRC) framework to tackle the challenges. Figure 1 illustrates our proposed framework with three main modules: the RC encoder, the event derangement module (EDM), and the multi-label classifier. In the RC encoder, the input sentence, deemed as “Context”, and all event tokens, appended as “Answers”, are fed into BERT Devlin et al. (2019) simultaneously. Such design allows the model to absorb all information without explicitly indicating triggers and enables the model to automatically learn helpful semantic relations between input texts and event tokens through the self-attention mechanism of Transformer Vaswani et al. (2017). During training, the EDM is activated with a certain probability only when the ground-truth events are major events. By perturbing the order of other event tokens, the model can prevent excessive update on the major events, which implicitly under-samples the training instances in the major events and thus yields a more balanced training process. Finally, the learned contextual representations of event tokens are fed into a multi-label classifier to produce the probabilities of each event type in the input text.

In summary, the contribution of our work is threefold: (1) we propose a competitive paradigm to an important task, namely multi-label event detection without triggers. Through a simplified machine reading comprehension framework, we can directly capture the semantic relation between input texts and event types without explicitly annotated triggers. (2) During training, we implement a simple yet effective module, i.e., the event derangement module, to overcome the imbalanced learning issue. By perturbing the order of event tokens, we prevent major events from being excessively learned by model, which can make the training process more balanced. (3) We report that our proposal can achieve the state-of-the-art performance on event detection on public benchmark datasets. Further gradient explanation also indicates that our trigger-free model can spot and link triggers to the corresponding events by itself and identify related event arguments as well.

Refer to caption — Figure 1: Our proposed DRC is on top of BERT. It consists of three main modules: RC encoder, the event derangement module (EDM), and the multi-label classifier. The EDM is amplified in the upper-left corner for better illustration; see more description in the main text.

2 Related Work

Existing event detection methods usually rely on triggers to detect events. These approaches need to explicitly identify triggers and assign them with predefined event types afterwards. For example, in Li et al. (2013), structured Perceptron has been exploited on hand-made features to identify triggers. In Nguyen et al. (2016), triggers and arguments are jointly identified by utilizing bidirectional recurrent neural networks. Additionally, some approaches formulate the task as machine reading comprehension or question answering Du and Cardie (2020); Liu et al. (2020). A predefined question template concatenating with the input sentence will be fed into a language model to identify the corresponding triggers. Searching optimal results from multiple predefined templates may be needed during inference.

To reduce massive annotation of triggers in new domains, more and more proposals have reformulated event detection as a low-resource natural language processing task by discarding some information Liu et al. (2019); Lu et al. (2021) or only using partial data Lai et al. (2020); Hsu et al. (2022). Event detection without triggers is proposed in Liu et al. (2019) which incorporates features of event types to input sentences via the attention mechanism on an LSTM model. The proposal does not utilize recently-developed pre-trained language models and yields much worse performance than trigger-based methods. Moreover, it splits a multi-label event detection task into multiple binary classification tasks for each event, which leads to slow inference. Most recently, Lu et al. (2021) proposed a sequence-to-structure framework to extract events in an end-to-end manner. Though it makes token-level annotations (i.e., trigger offset) unnecessary, the proposed method still requires strenuous effort to specify trigger words. On the other hand, some approaches can attain satisfactory results with partial raw data labeled Lai et al. (2020); Hsu et al. (2022). For example, Lai et al. (2020) adopt a prototypical framework to classify representative vectors computed from the embeddings of input texts. Hsu et al. (2022) feed input texts and a manually designed prompt into a generation-based neural network to output a natural sentence for further event extraction. Nevertheless, these approaches are less useful with more raw data accessible since parts of the data may be left unemployed for reducing annotation.

3 Methodology

3.1 Task Definition

Following Ahn (2006); Ji and Grishman (2008); Liu et al. (2019), we are given a set of training data, $\{(x_{i},y_{i})\}_{i=1}^{N}$ , where $N$ is the number of sentence-event pairs. $x_{i}=w_{i1}w_{i2}\ldots w_{i|x_{i}|}$ is the $i$ -th sentence with $|x_{i}|$ tokens and $y_{i}\subseteq\mathcal{S}$ is an event set, which records all related event(s). $\mathcal{S}=\{e_{1},e_{2},\ldots,e_{n}\}$ consists of all $n$ events, including an additional “negative” event meaning that sentences do not contain any events. Our goal is to train a model to detect the corresponding event type(s) as accurate as possible given an input sentence. This can be formulated as the multi-label classification task in machine learning. Our tasks lie in (1) how to learn more precise representations to embed the semantic information between texts and event types? (2) How to deliver the multi-task classification task effectively?

Major Events vs. Minor Events

Imbalanced event distribution is a major issue in our setting. Traditionally, Imbalance Ratio (IR) Galar et al. (2012) is a typical metric to estimate the imbalance of the data. However, IR provides little distribution information about the middle classes Ortigosa-Hernández et al. (2017). To articulate the differences of major events and minor events, we borrow its definition in Dong et al. (2019) to distinguish them. We first sort all event types in descending order with respect to the number of instances in each class and obtain the sorted sequence:

S_{\mbox{SA}}=e_{1}\ldots e_{n},\quad\mbox{where~{}~{}}|{e_{i}}|\geq|{e_{i+1}}|.

(1)

Here, $e_{i}$ denotes the $i$ -th event type with $|{e_{i}}|$ instances.

Then, we define the set of major events as the top- $k$ elements in $S_{\mbox{SA}}$ while the remaining elements as the minor events:

	$\displaystyle E_{\mbox{Major}}=\{e_{i}\;\|\;i=1,2,...k\},$		(2)
	$\displaystyle E_{\mbox{Minor}}=\{e_{i}\;\|\;i=k+1,...n\},$		(3)

where $k$ is determined by a hyperparamter of $\alpha$ by rounding to the nearest integer if it is a float. Here, $\alpha$ indicates the percentage of the major events in all $N$ sentence-event pairs:

\\ \alpha*N=\sum_{i=1}^{k}|e_{i}|.

(4)

Usually, $\alpha$ is simply set to 0.5 as Dong et al. (2019).

3.2 Our Proposal

Figure 1 outlines the overall structure of our proposed DRC, which consists of three main modules: the RC encoder, the multi-label classifier, and the event derangement module (EDM).

Algorithm 1 Event Derangement

0: Input sentence

x

; The initial event sequence

S_{{\mbox{init}}}

; The sequence of all event types in descending order

S_{\mbox{SA}}

; Possibility

q

; Number

r

0: Deranged sequence of event tokens

S_{\mbox{O}}

1: Initialize

E_{\mbox{GT}}

as the set of the ground truth event types implied by

x

2: Initialize

E_{\mbox{D}}

with

r

events that are not in

E_{\mbox{GT}}

from the beginning of

S_{\mbox{SA}}

3: Initialize

E_{tmp}=\emptyset

# a helper set to record the selected event types during derangement

4: Initialize

S_{\mbox{O}}=[]

5: Generate

rand

uniformly from [0, 1]

6: if

E_{\mbox{GT}}\cap E_{\mbox{Major}}\neq\emptyset

and

rand

q

then

7: for

e_{curr}

S_{{\mbox{init}}}

8: if

e_{curr}

E_{\mbox{D}}

then

9: Randomly select

e

from

E_{\mbox{D}}

and

e\neq e_{curr}

and

e\notin E_{tmp}

10: Append

e

S_{\mbox{O}}

11: Add

e

E_{tmp}

12: else

13: Append

e_{curr}

S_{\mbox{O}}

14: end if

15: end for

16: else

17:

S_{\mbox{O}}=S_{{\mbox{init}}}

18: end if

19: Return

S_{\mbox{O}}

RC Encoder

Our proposal is based on BERT due to its power in learning the contextual representation in the sequence of tokens Devlin et al. (2019). We present a simplified machine reading comprehension (MRC) framework:

[CLS] Context [SEP] Answers

where Context is the input sentence and Answers sequence all the event types. This setup is close to MRC with the multiple choices option. That is, it views the input sentence as Context and event types as the multiple choices (or Answers) with an omitted default question: “What is the event type/what are the event types in the Context?”. With both input texts and event tokens as the input of BERT, we can utilize BERT to automatically capture the relation between input texts and event types without explicitly indicating the triggers.

In the implementation, given a training set, we first generate a random event order index $I_{\mbox{init}}=s_{1}\ldots s_{n}$ , which is a permutation of $\{1,\ldots,n\}$ , and obtain its initial sequence of event tokens $S_{\mbox{init}}=e_{s_{1}}\ldots e_{s_{n}}$ . The event sequence $S_{\mbox{init}}$ is kept fixed for both training and testing. Hence, given a sentence $x=w_{1}\ldots w_{|x|}$ , we obtain

\mbox{Input}\!=\![\mbox{CLS}]\,w_{1}\,\ldots\,w_{|x|}\,[\mbox{SEP}]\,e_{s_{1}}\,\ldots\,e_{s_{n}}.

(5)

To avoid word-piece segmentation, we employ a square bracket around an event type, e.g., the event token of Transport is converted to “[Transport]”. This allows us to learn more precise event token representations and yield better performance (experiment results shown in Appendix A.1). Additionally, we apply position embeddings to event tokens based on their order in $S_{\mbox{init}}$ following the standard setup of BERT. This can make BERT order-sensitive and further help BERT distinguish event types; see more discussion in Sec. 5.1

After that, we learn the hidden representations:

	$\displaystyle h_{{[\mbox{CLS}]}},h_{1}^{w},\ldots,$	$\displaystyle~{}h_{\|x\|}^{w},h_{{[\mbox{SEP}]}},h_{1}^{e},\ldots,h_{n}^{e}$
	$\displaystyle=$	$\displaystyle~{}\text{BERT}(\mbox{Input}),$		(6)

where $h_{i}^{w}$ is the hidden state of the $i$ -th input token and $h_{i}^{e}$ is the hidden state of the corresponding event type, namely $e_{s_{i}}$ .

Multi-label Classifier

After learning the contextualized representations of the Input, we turn to construct the multi-label classifier. Traditional methods usually apply a Multi-Layer Perception (MLP) on the [CLS] token to yield the classifier. Differently, we feed all hidden states of event types to an MLP for the classification due to the supportive evaluation in Appendix A.2. Hence, given an input sentence $x$ , we compute the predicted probability for the corresponding events by

\hat{p}=\sigma\left(\text{MLP}\left(h_{1}^{e},...,h_{n}^{e}\right)\right).

(7)

Since $\hat{p}$ is normalized to the range of 0 and 1, for simplicity, we follow Liu et al. (2019) to determine the event labels when $\hat{p}\geq 0.5$ .

Our model can then be trained by minimizing the following loss:

\mathcal{L}\propto-\sum_{i=1}^{N}\sum_{j=1}^{n}(p_{ij}\log(\hat{p}_{ij})+(1-p_{ij})\log(1-\hat{p}_{ij}))

(8)

where $p_{ij}=1$ represents the corresponding event for the $i$ -th input text. Different from Liu et al. (2019) that converts the multi-label classification task into a series of binary classification tasks, our proposal can outputs all event type(s) simultaneously.

EDM

We will describe the implementation of EDM here. More detailed explanations on the design EDM with supporting experiments are provided in Sec. 5.2. During training, the event derangement module is proposed to mitigate the imbalanced learning issue. In combinatorics, Derangement represents a permutation of the elements in a set that makes no elements appear at its original position. In our implementation, one should know that $E_{\mbox{D}}$ only consists of $r$ events from the beginning of $S_{\mbox{SA}}$ excluding those in $E_{\mbox{GT}}$ ; see line 2 of Algo. 1. From line 6 of Algo. 1, we can know that the derangement procedure is conducted with probability $q$ only when the target (i.e., the ground-truth) events are major events. From line 7-10 of Algo. 1, we derange the sequence $S_{\mbox{init}}$ by switching different events in $E_{\mbox{D}}$ .

The event derangement module can prohibit the model from excessively learning major events, which works similarly to under-sampling the training instances of major events. The training process will be more balanced via EDM.

4 Experiments

We present the experimental setups and overall performance in the following.

Methods	Subtypes (%)			Main (%)
Methods	P	R	F1	P	R	F1
TBNNAM Liu et al. (2019)	76.2	64.5	69.9	-	-	-
TEXT2EVENT Lu et al. (2021)	69.6	74.4	71.9	-	-	-
DEGREE Hsu et al. (2022)	-	-	73.3	-	-	-
\hdashlineBERT_RC_Trigger Du and Cardie (2020)	71.7	73.7	72.3	-	-	-
RCEE_ER Liu et al. (2020)	75.6	74.2	74.9	-	-	-
DMBERT + Boot Wang et al. (2019)	77.9	72.5	75.1	-	-	-
CLEVE Wang et al. (2021)	78.1	81.5	79.8	-	-	-
BERT Finetune	72.8	68.7	70.7	78.0	70.8	74.2
Our ED_RC	76.9	72.3	74.7	78.9	75.4	77.1
Our ED_DRC	79.5	76.8	78.1	78.7	79.0	78.9

Table 1: Event detection results on both the event subtypes and event main types of the ACE2005 corpus.

4.1 Experimental Setups

Datasets and Evaluation

We conducted experiments on two benchmark datasets:

•

The ACE2005 corpus consists of 8 event main types and 33 subtypes Doddington et al. (2004). The data distribution of event subtypes is heavily imbalanced (IR $\approx$ 605.5) as shown in Fig. 2(a). For example, the types of Attack, Transport, and Die account for over half of the total training data. Meanwhile, the distribution of event main types is more balanced, IR $\approx$ 13.1 and see Fig. 2(b). For fair comparison, we follow the evaluation of Li et al. (2013); Liu et al. (2019, 2020), i.e., randomly selecting 30 articles from different genres as the validation set, subsequently delivering a blind test on a separate set of 40 ACE2005 newswire documents, and using the remaining 529 articles as the training set.
•

The TAC-KBP-2015 corpus Ellis et al. (2015) is annotated with event nuggets in 38 types. We process the data following Peng et al. (2016). The data distribution is more balanced than ACE2005 with IR $\approx 61.5$ as shown in Fig. 2(c).

The standard metrics: Precision, Recall, and F1 scores, are applied to evaluate the model performance.

Implementation Details

Our implementation is in PyTorch. The bert-base-uncased from Hugging Face Wolf et al. (2019) is adopted as the backbone model. The MLP consists of two layers with the sizes of 768 and the number of event types (e.g., 34 for ACE2005 and 39 for TAC-KBP-2015 including a “negative” type) to predict the probability of the input sentence assigned to the corresponding event types. We follow Dong et al. (2019) to set $\alpha$ as 0.5 and round $k$ to the nearest integer based on the calculation by Eq. (4). In EDM, the derangement probability $q$ is set to 0.2. Based on our validation observation (see more supporting results in Appendix A.5), we set the number of deranged tokens $r$ to 24 for the event subtypes in ACE2005 and TAC-KBP-2015, respectively, and 3 when testing the event main types in ACE2005. The batch size is set to 8. The dropout rate is 0.1. ADAM is the optimizer Kingma and Ba (2015) with a learning rate of $2\times 10^{-5}$ . We train our models for 10 epochs to give the best performance. All experiments are conducted on an NVIDIA A100 GPU in around 1.5 hours.

4.2 Overall Performance

We name our proposal without and with EDM as ED_RC and ED_DRC, respectively, and compare them with several competitive baselines on ACE2005: in terms of low-resource ED, 1) TBNNAM Liu et al. (2019): an LSTM model detecting events without triggers; 2) TEXT2EVENT Lu et al. (2021): a sequence-to-structure model that directly learns from parallel text-record annotation that requires no trigger offsets; 3) DEGREE Hsu et al. (2022): a generation-based model that leverages manually designed prompts so as to employ only partial data. And for trigger-based ED models, 4) BERT_RC_Trigger Du and Cardie (2020) and 5) RCEE_ER Liu et al. (2020): both BERT-based models converting event extraction as an MRC task; 6) DMBERT Wang et al. (2019): a BERT-based model leveraging adversarial training for weakly supervised events, where DMBERT Boot stands for bootstrapped DMBERT; 7)CLEVE Wang et al. (2021): a contrastive pre-training framework that exploits the rich event knowledge from large-scale unsupervised data. Besides, the baselines are only evaluated on the event subtypes.

Table 1 reports the overall performance on the ACE2005 corpus. It shows that

•

Although our proposed ED_RC does not have access to the triggers, it attains a much better performance than other models for low-resource ED, namely TBNNAM, TEXT2EVENT, and DEGREE. Its performance is also competitive to DMBERT and RCEE_ER, reaching 74.7% F1 score, which is only 0.4% less than that of the best baseline, DMBERT Boot. The result shows that our proposed RC framework is capable of learning relations between given texts and event types even without triggers.
•

After including EDM, our proposed ED_DRC can significantly outperform all compared methods in all three metrics except CLEVE. Despite the significant lack of clues, our trigger-free ED_DRC has a very close F1 score to CLEVE’s. Additionally, the F1 score of ED_DRC gains 3.0% higher than that of DMBERT Boot.
•

To verify the consistence of our proposal, we also conduct experiments to evaluate the performance on event main types. In this case, our proposed ED_RC and ED_DRC are both shown to gain greater improvement in F1 score, which are 2.9% and 4.7% better than the finetuned BERT respectively.

Methods	P	R	F1
BERT Finetune	84.1	65.0	71.7
Our ED_RC	77.4	74.8	76.1
Our ED_DRC	79.8	75.2	77.4

Table 2: Event detection results on TAC-KBP-2015.

To further test the generalization of our proposal, the results on a different dataset,TAC-KBP-2015, are reported (see Table 2) and demonstrate that our ED_RC can greatly outperform the finetuned base BERT by $4.4\%$ in F1 score. The event derangement module can further improve our ED_DRC by $1.3\%$ . But the increment is relatively smaller than that on the ACE2005 dataset. This can be attributed to more balanced event type distribution on TAC-KBP-2015 than that on ACE2005. All in all, our ED_RC is proven capable in both the ACE2005 English dataset and the TAC-KBP-2015 dataset.

5 More Analysis

In this section, we will further analyze the effect of each component in our proposal and explain the underlying mechanism.

	P	R	F1
ED_RC_Same	75.7	71.6	73.6
\hdashlineED_RC	76.9	72.3	74.7
\hdashlineED_RC_Shuffle_Test	18.2	9.2	12.2
ED_DRC_Shuffle_Test	66.0	45.1	53.6
\hdashlineED_DRC	79.5	76.8	78.1

Table 3: Evaluation results on different kinds of event orders for ACE2005.

5.1 Effect of Event Orders

Table 3 explores the effect of the event orders in different cases to understand the underlying mechanism of our proposal. The first two rows suggest that event orders help our ED_RC detect events. Those two rows record the results of ED_RC_Same and ED_RC, where ED_RC_Same applies the same position embedding to all event types to eliminate the difference in the event orders. On the contrary, ED_RC applies varied position embedding to each event type. The results of the first two rows show that by leveraging the positional difference of event tokens, ED_RC gains 1.1% improvement on the F1 score.

Additionally, our ED_RC is shown to be event-order-sensitive. This can be verified by the results of ED_RC and ED_RC_Shuffle_Test. Here, ED_RC_Shuffle_Test is trained with the same event order of ED_RC, but tested with a shuffled event order sequence. By confusing ED_RC with a different event order sequence during inference, we obtain a devastating drop on the F1 score, from 74.7% to 12.2%. This implies that our ED_RC mainly relies on the order of events to recognize them.

Furthermore, by performing derangement on ED_RC_Shuffle_Test, we obtain significant performance improvement on ED_DRC_Shuffle_Test. We conjecture that our proposed event derangement module undermines the excessive reliance of ED_RC on event orders by introducing perturbation. The model is thus made to learn more about semantics between the input text and the event types, which helps ED_DRC outperform ED_RC.

5.2 Effect of EDM

Fig. 3 shows the training losses of ED_DRC and ED_RC on ACE2005 with respect to the major events and the minor events to illustrate how EDM works. To amplify the effect, we test an extreme case, i.e., setting $q=1.0$ , which means that the event derangement will be conducted for each instance whose ground-truth events are major events. From Fig. 3, we observe that

•

The solid line with circle markers records the loss of ED_DRC on the major events and shows that the loss drops much faster and is much smaller (close to zero) than the counterpart of ED_RC.
•

The solid line with ’x’ markers records the loss of ED_DRC on the minor events and shows that the loss is relatively higher than the counterpart of ED_RC before convergence.

We conjecture that the swift convergence of ED_DRC on major events is attributed to the positional hint of ground-truth (major) events given by derangement, because our model is sensitive to event orders. During derangement, the positions of ground-truth (major) events are reserved while some surrounding events are deranged. Such positional hints help the model to recognize ground-truth events without much effort. The training loss on the major events is marginal, which yields little gradient update on the major events in ED_DRC. The major events are thus prohibited from being excessively learned. Meanwhile, the relatively higher loss on the minor events makes ED_DRC focus more on updating the model when the training instances are from the minor events. In short, our derangement procedure implicitly undersamples the instances of major events during training, which makes the training process more balanced. Thanks to EDM, our ED_DRC increases the F1 score of ED_RC by 3.4% as shown in Table 1. More significantly, the F1 score on the minor events can be improved from 69.1% to 72.4%.

5.3 Gradient Explanation

In order to interpret how our model understands input texts and identifies event types, we computed the gradients with respect to the embeddings of the input text. Those gradients quantify the influence of changes in the tokens on the predictions. In the literature, the gradient explanation has been verified as a more stable method to explain the attention model Adebayo et al. (2018) than the attention weights in BERT because the attention weights may be misleading Jain and Wallace (2019) or are not directly interpretable Brunner et al. (2020). Here, we pick the example from ACE2005 in Sec. 1 and select five events by the following criteria: “Die” and “Transport” are the target events; “Negative” and “Attack” are two most common event types; and “Execute” is a minor event with semantic closeness to the target events. Figure 4 clearly shows that

•

For the event of “Die”, our ED_DRC can automatically focus on its trigger word “killed” while for the event “Transport”, the trigger “sent” is also notified by model. But for non-target events, our ED_DRC attains low gradients on the triggers or gets high gradients on unrelated tokens (e.g., “to”).
•

Our ED_DRC can also surprisingly spot the arguments related to the target events. For example, for the event of “Die”, “Baghdad” yields a significant higher gradient, which corresponds to the argument of PLACE. Similarly, for the event of “Transport”, “they” and “him” also yield relatively larger gradients, which respectively correspond to the argument of ARTIFACT and AGENT.

In summary, these observations indicate that our proposed ED_DRC can successfully link triggers to the corresponding target events and may take related event arguments into consideration as well for identifying events. More gradient visualization results on the samples from TAC-KBP-2015 are provided in Fig. 7 of Appendix. A.6 and reveal similar observations as well.

6 Conclusion and Future Work

In this paper, we propose a novel trigger-free event detection method via Derangement mechanism on a machine Reading Comprehension framework. By adopting BERT in the machine reading comprehension framework, we can absorb the semantic relations between the original input text and the event type(s). Moreover, the proposed event derangement module can mitigate the imbalanced training process. Via position perturbation on the major events, we can prevent the major events from being excessively learned while focusing more on updating the model when the training instances are from the minor events. We conduct empirical evaluation to show that our proposal achieves state-of-the-art performance over previous methods. Meanwhile, we make further analysis to uncover why our proposal works. The provided gradient visualization results imply that our proposed trigger-free event detection method can highlight the triggers for the corresponding events by itself and signify the related event arguments as well.

Several promising directions can be considered in the future. First, since our proposal is event-order-sensitive, it is worthy to explore how to generate an optimal initial event order to further improve the performance. Second, the gradients of our model with respect to input words have been shown effective to signify triggers and arguments. It may be meaningful to leverage those gradients to help extract the key information of events, e.g., arguments. Third, it would be worthwhile to adapt our proposal to other information extraction tasks, e.g., relation extraction, to extend its application scope.

References

Adebayo et al. (2018) Julius Adebayo, Justin Gilmer, Michael Muelly, Ian J. Goodfellow, Moritz Hardt, and Been Kim. 2018. Sanity checks for saliency maps. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 9525–9536.
Ahn (2006) David Ahn. 2006. The stages of event extraction. In Proceedings of the Workshop on Annotating and Reasoning about Time and Events, pages 1–8.
Brunner et al. (2020) Gino Brunner, Yang Liu, Damian Pascual, Oliver Richter, Massimiliano Ciaramita, and Roger Wattenhofer. 2020. On identifiability in transformers. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
Chen et al. (2015) Yubo Chen, Liheng Xu, Kang Liu, Daojian Zeng, and Jun Zhao. 2015. Event extraction via dynamic multi-pooling convolutional neural networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, pages 167–176. The Association for Computer Linguistics.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics.
Doddington et al. (2004) George Doddington, Alexis Mitchell, Mark Przybocki, Lance Ramshaw, Stephanie Strassel, and Ralph Weischedel. 2004. The automatic content extraction (ACE) program – tasks, data, and evaluation. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Lisbon, Portugal. European Language Resources Association (ELRA).
Dong et al. (2019) Qi Dong, Shaogang Gong, and Xiatian Zhu. 2019. Imbalanced deep learning by minority class incremental rectification. IEEE Trans. Pattern Anal. Mach. Intell., 41(6):1367–1381.
Du and Cardie (2020) Xinya Du and Claire Cardie. 2020. Event extraction by answering (almost) natural questions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 671–683. Association for Computational Linguistics.
Ebner et al. (2020) Seth Ebner, Patrick Xia, Ryan Culkin, Kyle Rawlins, and Benjamin Van Durme. 2020. Multi-sentence argument linking. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 8057–8077. Association for Computational Linguistics.
Edouard (2017) Amosse Edouard. 2017. Event detection and analysis on short text messages. (Détection d’événement et analyse des messages courts). Ph.D. thesis, University of Côte d’Azur, France.
Ellis et al. (2015) Joe Ellis, Jeremy Getman, Dana Fore, Neil Kuster, Zhiyi Song, Ann Bies, and Stephanie M. Strassel. 2015. Overview of linguistic resources for the TAC KBP 2015 evaluations: Methodologies and results. In Proceedings of the 2015 Text Analysis Conference, TAC 2015, Gaithersburg, Maryland, USA, November 16-17, 2015, 2015. NIST.
Galar et al. (2012) Mikel Galar, Alberto Fernández, Edurne Barrenechea Tartas, Humberto Bustince Sola, and Francisco Herrera. 2012. A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C, 42(4):463–484.
Hsu et al. (2022) I-Hung Hsu, Kuan-Hao Huang, Elizabeth Boschee, Scott Miller, Prem Natarajan, Kai-Wei Chang, and Nanyun Peng. 2022. DEGREE: A data-efficient generation-based event extraction model. In 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).
Jain and Wallace (2019) Sarthak Jain and Byron C. Wallace. 2019. Attention is not explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 3543–3556. Association for Computational Linguistics.
Ji and Grishman (2008) Heng Ji and Ralph Grishman. 2008. Refining event extraction through cross-document inference. In ACL 2008, Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, June 15-20, 2008, Columbus, Ohio, USA, pages 254–262. The Association for Computer Linguistics.
Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
Lai et al. (2020) Viet Dac Lai, Thien Huu Nguyen, and Franck Dernoncourt. 2020. Extensively matching for few-shot learning event detection. In Proceedings of the First Joint Workshop on Narrative Understanding, Storylines, and Events, NUSE@ACL 2020, Online, July 9, 2020, pages 38–45. Association for Computational Linguistics.
Li et al. (2013) Qi Li, Heng Ji, and Liang Huang. 2013. Joint event extraction via structured prediction with global features. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, 4-9 August 2013, Sofia, Bulgaria, Volume 1: Long Papers, pages 73–82. The Association for Computer Linguistics.
Liu et al. (2020) Jian Liu, Yubo Chen, Kang Liu, Wei Bi, and Xiaojiang Liu. 2020. Event extraction as machine reading comprehension. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 1641–1651. Association for Computational Linguistics.
Liu et al. (2019) Shulin Liu, Yang Li, Feng Zhang, Tao Yang, and Xinpeng Zhou. 2019. Event detection without triggers. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 735–744. Association for Computational Linguistics.
Lu et al. (2021) Yaojie Lu, Hongyu Lin, Jin Xu, Xianpei Han, Jialong Tang, Annan Li, Le Sun, Meng Liao, and Shaoyi Chen. 2021. Text2event: Controllable sequence-to-structure generation for end-to-end event extraction. In ACL/IJCNLP, pages 2795–2806. Association for Computational Linguistics.
Nguyen et al. (2016) Thien Huu Nguyen, Kyunghyun Cho, and Ralph Grishman. 2016. Joint event extraction via recurrent neural networks. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pages 300–309. The Association for Computational Linguistics.
Ortigosa-Hernández et al. (2017) Jonathan Ortigosa-Hernández, Iñaki Inza, and José Antonio Lozano. 2017. Measuring the class-imbalance extent of multi-class problems. Pattern Recognit. Lett., 98:32–38.
Peng et al. (2016) Haoruo Peng, Yangqiu Song, and Dan Roth. 2016. Event detection and co-reference with minimal supervision. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 392–402. The Association for Computational Linguistics.
Saeed et al. (2019) Zafar Saeed, Rabeeh Ayaz Abbasi, Onaiza Maqbool, Abida Sadaf, Imran Razzak, Ali Daud, Naif Radi Aljohani, and Guandong Xu. 2019. What’s happening around the world? A survey and framework on event detection techniques on twitter. J. Grid Comput., 17(2):279–312.
Shen et al. (2021) Jiaming Shen, Yunyi Zhang, Heng Ji, and Jiawei Han. 2021. Corpus-based open-domain event type induction. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 5427–5440. Association for Computational Linguistics.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS, pages 5998–6008.
Wang et al. (2019) Xiaozhi Wang, Xu Han, Zhiyuan Liu, Maosong Sun, and Peng Li. 2019. Adversarial training for weakly supervised event detection. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 998–1008. Association for Computational Linguistics.
Wang et al. (2021) Ziqi Wang, Xiaozhi Wang, Xu Han, Yankai Lin, Lei Hou, Zhiyuan Liu, Peng Li, Juanzi Li, and Jie Zhou. 2021. CLEVE: contrastive pre-training for event extraction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 6283–6297. Association for Computational Linguistics.
Wolf et al. (2019) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Huggingface’s transformers: State-of-the-art natural language processing. CoRR, abs/1910.03771.
Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/1609.08144.
Zeng et al. (2018) Ying Zeng, Yansong Feng, Rong Ma, Zheng Wang, Rui Yan, Chongde Shi, and Dongyan Zhao. 2018. Scale up event extraction learning via automatic training data generation. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 6045–6052. AAAI Press.
Zhang et al. (2020) Zhisong Zhang, Xiang Kong, Zhengzhong Liu, Xuezhe Ma, and Eduard H. Hovy. 2020. A two-step approach for implicit event argument detection. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 7479–7485. Association for Computational Linguistics.
Zheng et al. (2019) Shun Zheng, Wei Cao, Wei Xu, and Jiang Bian. 2019. Doc2edag: An end-to-end document-level framework for chinese financial event extraction. In EMNLP-IJCNLP, pages 337–346. Association for Computational Linguistics.

Appendix A Appendix

We provide more analysis to understand our proposal.

Conversion	P	R	F1
Original	75.2	67.6	71.7
New	77.3	68.2	72.5

Table 4: Results of different conversion ways of event tokens.

A.1 Effect of Event Tokens Conversion

There are two intuitive ways to treat the event tokens in our proposed framework. One is to treat them as old words in the BERT dictionary, so that we can initialize the event representations by utilizing BERT’s pre-trained word embeddings. The other way is to treat them as new words, so that we can learn the event representations from scratch. Hence, we can directly feed the original event words in the DRC framework or add a square bracket around the event words to convert them into new words, e.g., “Transport” to “[Transport]”, in the BERT dictionary.

Table 4 reports the compared results and shows that converting event types into new words can attain substantial improvement in all three metrics than treating them as the original words in BERT dictionary. We conjecture that it may arise from WordPiece Wu et al. (2016) in BERT implementation because BERT will separate an event word into several pieces when it is relatively long. This brings the difficulty in precisely absorbing the semantic relation between the words in input texts and event types. On the contrary, when we treat an event word as a new word, BERT will deem them as a whole. Though BERT learns the event representations from scratch, it is still helpful to establish the semantic relationship between words and event types.

A.2 Inputs for the Multi-label Classifier

There are two kinds of inputs for the multi-label classifier: the representation of the [CLS] token, or the event representations. We feed these two inputs into the same MLP to predict the probability of an input sentence $x$ to the corresponding events.

Input	P	R	F1
[CLS]	77.3	68.2	72.5
All event tokens	76.9	72.3	74.7

Table 5: Results of different inputs for the multi-label classifier.

Table 5 reports the performance of different inputs for the multi-label classifier and shows that by feeding the event representations as the input, our ED_RC can significantly improve the performance on Recall and the F1 score with competitive Precision score than only using the representation of the [CLS] token. We conjecture that the event representations have injected more information into the multi-label classifier than only using the representation of the [CLS] token.

A.3 Limitation of EDM

We conduct evaluation on a more balanced dataset to investigate the limitation of EDM. We first select seven relatively balance event types, yielding an imbalance ratio around 1.8, from the subtypes of the ACE2005 corpus; see the data distribution in Fig. 5. In the experiment, we set $q$ to 0.2 and $r$ to 6 for good performance on ED_DRC.

Table 6 reports the comparison results of ED_RC and ED_DRC and shows that ED_RC attains satisfactory results and beats ED_DRC in all three metrics. The results imply that the derangement procedure plays an important role when the dataset is more imbalanced. When the dataset is relatively balanced, we can turn to ED_RC and attain good performance due to the power of self-attention in BERT.

Model	P	R	F1
ED_RC	76.4	77.8	77.1
ED_DRC	75.0	76.3	75.6

Table 6: The performance of our ED_RC and ED_DRC on a more balanced dataset.

A.4 Error Analysis

We conduct error analysis on test dataset in this section. There are three main kinds of errors:

–

The main error comes from event mis-classification, accounting for $52.9\%$ of the total errors. The error also includes that ED_DRC detects more event types than the ground truth. The most event type that ED_DRC over-predicts is the event of Attack. A typical example is given below:

S: The officials, who spoke on … 26 words omitted … on the U.S.-backed war resolution.

ED_DRC deems this sentence belonging to the event of Attack, where the ground truth is the event of Meet. This error is normal because the word “war” is a common trigger to the event of Attack, which yields ED_DRC mis-classifying it. In this dataset, the event of Attack is the most dominating event type, which makes it likely to classify the texts of other events as Attack when the texts hold some similar words to the triggers of Attack.
–

The second type of errors is that ED_DRC outputs fewer event types than the ground truth, which accounts for $28.9\%$ of errors. The frequently missing event types are Transfer-Money and Transfer-Ownership. One typical example is

S: Until Basra, U.S. and British troops … 6 words omitted … they seized nearby Umm RCsr … 3 words omitted … secure key oil fields.

ED_DRC fails to identify the event of Transfer-Ownership, which is indicated by the trigger, “secure”, while recognizing the event of Attack, implied by the trigger if “seized”. On the one hand, the Imbalanced Ratio of Attack and Transfer-Ownership is 14.2. There are much fewer training data for ED_DRC to learn the patterns of Transfer-Ownership than those of Attack. On the other hand, deeper semantic knowledge is needed for understanding the event of Transfer-Ownership, whose trigger words are more diverse and changeable. The triggers for Transfer-Ownership may include “sold”, “acquire”, and “bid”, etc.
–

The third type of errors lies in outputting none-event sentences. When there are no event types in a sentence, ED_DRC may fail to classify it as the type of negative. This is because there is no sufficient clues for ED_DRC to learn the patterns from the type of negative. ED_DRC also turns out to give low predicted probabilities on all event types.

A.5 Effect of Hyperparameters

In this section, we show how to adjust the derangement probability $q$ and the size of the derangement set $r$ by conducting ablation studies where $q$ is selected from $\left\{0.1,\;0.2,\;0.4,\;0.5,\;0.7\right\}$ and $r$ is selected from $\{3,6,\ldots,33\}$ , i.e., equally dividing all event types into 10 buckets. Larger $q$ ’s are ignored because they usually fail the model on detecting major events. Figure 6(a) shows our experiment results on validation set. The best performance is attained when $q=0.2$ and $r=24$ which will then be used during testing. The trends suggest that a small $q$ should be selected, which can usually yield better performance than a larger one. It is suggested choosing $r$ from 15 to 25, which is approximately from the half to the two-third of the number of total event tokens. A small $r$ may cause negligible perturbation and a relatively large $r$ may affect the disturbance of the minor events since $r$ can determine the scale of perturbation caused by derangement.

Although these two parameters may be data-oriented, we actually observed similar trends for TAC-KBP-2015. We select $q$ from $\left\{0.1,\;0.2,\;0.3\right\}$ and $r$ from $\{3,6,\ldots,33\}$ . Figure 6(b) shows the performance on TAC-KBP with respect to $r$ for different $q$ . It is shown that the best performance is attained when $q=0.2$ and $r=21$ , reaching 77.8% for F1 score. The trends remain largely the same as those on the ACE2005 dataset. The best performances also occur when $r$ is selected in the range of 15 and 25.

A.6 Gradient Explanation on TAC-KBP-2015

We conduct gradient explanation on our DRC framework as in Sec. 5.3. We randomly choose instances from the test set and visualize gradients respect to the correctly predicted event types by our ED_DRC. As shown in Fig. 7, “nominated”, “leaked” and “sentences” are respectively triggers for those three sentences and receive significant positive gradients compared with other words. This shows that our DRC framework can automatically learn to spot triggers and relate them to event types in practice.