This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

TacoERE: Cluster-aware Compression for Event Relation Extraction

Abstract

Event relation extraction (ERE) is a critical and fundamental challenge for natural language processing. Existing work mainly focuses on directly modeling the entire document, which cannot effectively handle long-range dependencies and information redundancy. To address these issues, we propose a clusTer-aware compression method for improving Event Relation Extraction (TacoERE), which explores a compression-then-extraction paradigm. Specifically, we first introduce document clustering for modeling event dependencies. It splits the document into intra- and inter-clusters, where intra-clusters aim to enhance the relations within the same cluster, while inter-clusters attempt to model the related events at arbitrary distances. Secondly, we utilize cluster summarization to simplify and highlight important text content of clusters for mitigating information redundancy and event distance. We have conducted extensive experiments on both pre-trained language models, such as RoBERTa, and large language models, such as ChatGPT and GPT-4, on three ERE datasets, i.e., MAVEN-ERE, EventStoryLine and HiEve. Experimental results demonstrate that TacoERE is an effective method for ERE.

Keywords: Event Relation Extraction, Compression-then-Extraction, Large Language Model

\NAT@set@cites

TacoERE: Cluster-aware Compression for Event Relation Extraction


Yong Guan1, Xiaozhi Wang1, Lei Hou1thanks: *Corresponding author, Juanzi Li1, Jeff Pan2,
Jiaoyan Chen3, Freddy Lecue4
1Department of Computer Science and Technology, Tsinghua University, Beijing, China
2School of Informatics, The University of Edinburgh, UK
3Department of Computer Science, The University of Manchester, UK
4INRIA, France
{gy2022, wangxz20}@mails.tsinghua.edu.cn,   {houlei, lijuanzi}@tsinghua.edu.cn

Abstract content

1.   Introduction

Event Relation Extraction (ERE) aims to predict relations, such as causal and subevent relations, between event mentions or trigger words in a document Fan et al. (2022). As shown in Figure 1, given a document with event mentions/trigger words, an ERE model is expected to predict relations among the three mentioned events, such as cyclone(e1)subeventoriginated(e2)preconditionreached(e4)\texttt{cyclone}(\textbf{e1})\xrightarrow[]{subevent}\texttt{originated}(\textbf{e2})\xrightarrow[]{precondition}\texttt{reached}(\textbf{e4}). ERE can not only facilitate deep understanding of text Wang et al. (2020), but also benefit various downstream tasks, such as question answering Khashabi et al. (2018) and information retrieval Pang et al. (2020).

With the widespread adoption of deep neural networks in natural language processing (NLP), event relation extraction systems have undergone a paradigm shift to supervised neural models that encode the document as a clue for predicting relations Cao et al. (2021); Xu et al. (2021); Chen et al. (2022). However, there are still two challenges long-range dependencies and information redundancy. Specifically, long-range dependencies indicates that events may be scattered across multiple sentences potentially far away from each other. In such context, existing ERE methods have difficulties capturing the dependencies among events. Consider example in Figure 1, events cyclone (e1) and destroyed (e7), located respectively in sentence S1 and S11, are related with a cause relation. Information redundancy refers to the existence of information non relevant for relation prediction. For example, identifying the relation between events originated (e2) and reaching (e3) only depends on the sentences S2 and S3, while sentences S9, S10, and S11 are non relevant for identifying the relation.

Refer to caption
Figure 1: An example form MAVEN-ERE. Words in bold italics are trigger words of events. [Si] denotes the i-th sentence index, and precon. is the abbreviation of precondition. The solid and dashed arrows indicate relations within and among clusters, respectively. Different colored sentences represent different clusters.

To tackle these challenges, the major approaches currently are to select sentences Wang et al. (2020); Man et al. (2022) or remove sentences Xu et al. (2022). But these methods do not completely eliminate long range dependencies, or presence of irrelevant information at finer granularities (e.g., clauses/phrases). As shown in Figure 1, for long-range dependencies, deletes sentences S6, S7, and S8, but the distance between sentences S1 and S11 remains considerable, indicating that the dependency on sentence distance remains present. For information redundancy, the sub-sentence of sentence S2, "tenth hurricane, and fifth major hurricane of the season", is still useless for predicting the relation between originated (e2) and reaching (e3) from human understanding. This motivates our hypothesis that compression via summarization might be a better strategy than sentence filtering.

In this paper, we propose TacoERE, a clusTer-aware compression method for improving Event Relation Extraction, which explores a compression-then-extraction paradigm to extract event relations. Specifically, TacoERE first uses document clustering to split the document into intra- and inter-clusters111Document generally utilizes multiple sub-topics to organise the content Hearst and Plaunt (1993). As shown in Figure 1, the three sub-topics, event background, development process, and resulting impact, collectively constitute the entire article. For simplified understanding, we use the term “cluster” to represent sentences within the same sub-topic., where intra-clusters aim to enhance the relations within the same cluster, while inter-clusters attempt to model the related events at arbitrary distances. For instance, events cyclone (e1) in sentence S1 and destroyed (e7) in sentence S11 belong to different clusters, and combining these two clusters can effectively reduce the event distance and facilitate modelling event relations. In this way, all dependencies between related events of any distance in the document can be modelled. Following that, cluster summarization is adopted to generate summaries for every cluster, which can further simplify and highlight important content. At last, the generated summaries of intra- and inter-clusters are utilized to predict relations.

For evaluation, we respectively validate our ideas on both small-scale pre-trained language models (PLMs), such as RoBERTa, and large language models (LLMs), such as ChatGPT222 https://chat.openai.com/chat and GPT-4 OpenAI (2023), and conduct extensive experiments on three ERE datasets, namely, MAVEN-ERE (Wang et al., 2022), EventStoryLine (Caselli and Vossen, 2016) and HiEve (Glavas et al., 2014). Experimental results demonstrate that our approach can effectively improve the performance of event relation extraction models. Our contributions are summarized as follows:

  • We propose a novel cluster-aware compression method for event relation extraction, namely, TacoERE, which explores a compression-then-extraction paradigm to extract relations.

  • We utilize document clustering to split the document into intra- and inter-clusters to allow the modeling of dependencies without any reliance on event distance. We propose cluster summarization to simplify and spotlight important text content to mitigate the impact of information redundancy and event distance.

  • Extensive experiments have been conducted on both PLMs, such as RoBERTa, and LLMs, such as ChatGPT and GPT-4, on three ERE datasets, i.e., MAVEN-ERE, EventStoryLine and HiEve. Our TacoERE outperforms existing methods, especially on LLMs, with improvements by 11.2% and 9.1% on ChatGPT and GPT-4 respectively.

Refer to caption
Figure 2: Model structure of TacoERE.

2.   TacoERE

We consider a document D={s1,s2,,sn}D=\{s_{1},s_{2},...,s_{n}\} with nn sentences, annotated with event mentions/trigger words E={e1,e2,,em}E=\{e_{1},e_{2},...,e_{m}\}. The task of event relation extraction is to, given an annotated document DD and a pre-defined relation set \mathcal{R}, predict relations between all event pairs {(ei,ej)}\{(e_{i},e_{j})\} in DD.

In this paper, we propose TacoERE, a novel cluster-aware compression method for improving event relation extraction, exploring a compression-then-extraction paradigm to extract relations. Figure 2 shows the overview of our framework: we first present Document Compression, including Document Clustering (Section 2.1) and Cluster Summarization (Section 2.2). Document Clustering splits document into intra- and inter-clusters to allow the modeling of dependencies without considering event distance among sentences. Cluster Summarization encourages model to simplify and spotlight important content of clusters for mitigating information redundancy and event distance. We then describe Relation Prediction in Section 2.3, which utilizes the content from Cluster Summarization to predict relations. Last, in Section 2.4, we employ a reinforce algorithm that jointly optimizes the Cluster Summarization and Relation Prediction. Before joint training, we introduce pretraining module for Cluster Summarization with event chains to teach model to do better content representation.

2.1.   Document Clustering

The document clustering aims to split document into intra- and inter-clusters, where intra-clusters aim to enhance the relations within the same cluster, while inter-clusters attempt to model the related events at arbitrary distances. Our observation is that predicting the relations for event pairs relies on a limited content rather than the entire document. For example in Figure 1, just based on sentence S2 and S3, the relation between event originated (e2) and reaching (e3) can be deduced. Moreover, sentences in document with the same cluster are more probably to involve event relations, such as the cluster of development process with blue color (sentences S2-S4) in Figure 1 has three relations. As a result, we split the DD into KK mutually independent intra-clusters at sentence level. To obtain the intra-clusters, arbitrary clustering methods, such as traditional machine learning or deep neural networks, can be applied. In our experiments, we directly utilize the effective and widely used K-means algorithm Guan et al. (2022); Rakib et al. (2020). Specifically, we extract multiple features to enhance the clustering, including cluster words extracted by LDA model, trigger words, tf-idf, and sentence representation encoded by RoBERTa.

J=i=1kj=1pijvjui2J=\sum^{k}_{i=1}\sum_{j=1}p_{ij}||v_{j}-u_{i}||^{2} (1)

where JJ is objective, vjv_{j} is the feature of sentence sjs_{j}, uiu_{i} is center of i-th cluster, and pijp_{ij} is indicator.

Beyond individual intra-cluster, among different intra-clusters also occur with event relations which can be seen in Figure 1, relation between event cyclone (e1) in background and destroyed (e7) in impact. Thus, we fuse any two intra-clusters as inter-clusters. The order of sentences in intra- and inter-clusters will follow their original order in DD. And the intra- and inter-clusters can involve all possible relations in DD.

2.2.   Cluster Summarization

The cluster summarization aims to simplify and highlight important text content of clusters for mitigating information redundancy and event distance. Intra- and inter-clusters directly group the sentences with related sub-topics. It will inevitably contain redundant information and ignore coherence, which may hinder the performance Gao et al. (2021). Text summarization as a technique can effectively simplify and spotlight important text content, which would solve this problem to some extent. To this end, we further utilize a summarization model to generate summaries CaC^{a} and CrC^{r} for intra- and inter-clusters respectively, where {Ca,Cr}C\{C^{a},C^{r}\}\in C. In this paper, we utilize a transformer-based Vaswani et al. (2017) encoder-decoder framework for summarization. In particular, a pretrained language model as the encoder to learn the contextual representation of input, and a transformer-based decoder is utilized to generate its summary word-by-word.

2.3.   Relation Prediction

The relation prediction aims to construct the event relation prediction process by giving the text content from cluster summarization. We first need to obtain the contextual representation of each token in the document DD, CaC^{a}, and CrC^{r} respectively. Take DD as an example, we leverage pre-trained language model RoBERTa Liu et al. (2019) as the encoder. Since event mention/trigger words perhaps contain multiple words, i.e., take place, and individual word in document may be split into sub-words by wordpiece, i.e., word “summarization” will be split into three sub-words “sum”, “mar”, and “ization”, we adopt LogSumExp pooling method Zhou et al. (2021) over all its mentions (sub-words) embeddings in the last encoding layer as the event representation. Then, we can obtain the overall representation of event pairs in DD. For cluster summaries CaC^{a} and CrC^{r}, we extract the events which appear in original document DD.

Event pairs of CaC^{a} and CrC^{r} all appear in DD, which means one event pair may occur multiple times. Different from existing work Adhikari et al. (2019a); Beltagy et al. (2020) which aggregate all identical event representations to get the final event pair representations, for the same event pair, we select the most relevant one, and the detailed selection process is as follows: (1) for the same event pair, the priority of event pair in CrC^{r} is higher than CaC^{a}; (2) for the event pair which not in CaC^{a} and CrC^{r}, we directly use the representation in DD. Finally, a two layers feed-forward network with softmax is adopted to learn the classes probability based on the event pair representations. For training objective of relation prediction rp\mathcal{L}_{rp}, we use cross-entropy function as follows:

rp=ij{rij𝑙𝑜𝑔Pij+(1rij)𝑙𝑜𝑔(1Pij)}\mathcal{L}_{rp}=-\sum_{i\neq j}\sum_{\mathcal{R}}\{r_{ij}\mathit{log}P_{ij}+(1-r_{ij})\mathit{log}(1-P_{ij})\} (2)

where PijP_{ij} is the probabilities between events eie_{i} and eje_{j}, and rijr_{ij}\in\mathcal{R}.

2.4.   Training Phase

The Training Phase describes both Joint Training and Pretraining for Cluster Summarization, where Joint Training aims to jointly optimize the cluster summarization and relation prediction, while Pretraining for Cluster Summarization aims to teach model to do better content representation.

Joint Training  As the performance cannot be directly back-propagated to the cluster summarization process, we employ a reinforce algorithm Williams (1992) that treats the event relation prediction performance as reward function for the cluster summaries to train cluster summarization process. Besides, we also consider another event chains information from clusters and its summaries to enrich the overall reward function (C)\mathcal{R}(C).

Performance-based Reward per(C)\mathcal{R}^{per}(C) aims to use the performance as the direct training signal. We calculate the reward per(C)\mathcal{R}^{per}(C) based on the performance for all document event pairs. In particular, for each event pair (ei,ej)(e_{i},e_{j}), per(C)\mathcal{R}^{per}(C) = 1 if relation prediction model calculates the true relations between eie_{i} and eje_{j}, and 0 otherwise.

Summary-based Reward ec(C)\mathcal{R}^{ec}(C) aims to encourage the correlation between the clusters and its summaries to train cluster summarization. The motivation of ec(C)\mathcal{R}^{ec}(C) is that summary expresses important information about the text and they are consistent at the semantic level Guan et al. (2021b). In Section 2.1, we split document into intra- and inter-clusters. Intra-clusters are independent of each other and cover the content of the entire document. Thus, we only adopt the intra-clusters and its summaries to calculate the reward ec(C)\mathcal{R}^{ec}(C). Similar to existing work Pasunuru and Bansal (2018); Paulus et al. (2018), we utilize the popularly known automatic evaluation metric for summarization, ROUGE Lin (2004), as reward function. However, this metric mainly focuses on phrase matching/n-gram overlap while assuming equal contributions from each word. Addressing these issues, we introduce salient function to give higher weight to the trigger words.

P=k(𝐿𝐶𝑆(Cka,Tka)+lη(wkl))|D|P=\frac{\sum_{k}(\mathit{LCS_{\cup}}(C^{a}_{k},T^{a}_{k})+\sum_{l}\eta(w_{kl}))}{|D|} (3)
R=k(𝐿𝐶𝑆(Cka,Tka)+lη(wkl))k|Cka|R=\frac{\sum_{k}(\mathit{LCS_{\cup}}(C^{a}_{k},T^{a}_{k})+\sum_{l}\eta(w_{kl}))}{\sum_{k}|C^{a}_{k}|} (4)
ec(C)=(1+σ2)RPR+σ2P\mathcal{R}^{ec}(C)=\frac{(1+\sigma^{2})RP}{R+\sigma^{2}P} (5)

where CkaC^{a}_{k} is the summary of kk-th intra-cluster TkaT^{a}_{k}, 𝐿𝐶𝑆()\mathit{LCS_{\cup}(\cdot)} is the union longest common subsequence, σ\sigma is defined in Lin (2004), and η()\eta(\cdot) is function to measure whether wklw_{kl} is a trigger word.

After obtaining reward per(C)\mathcal{R}^{per}(C) and ec(C)\mathcal{R}^{ec}(C), the overall reward can be calculated as (C)=αper(C)+βec(C)\mathcal{R}(C)=\alpha\mathcal{R}^{per}(C)+\beta\mathcal{R}^{ec}(C), where α\alpha and β\beta are trade-off parameters. Following existing work Man et al. (2022), we minimize the negative expected reward (C)\mathcal{R}(C) over the possible choices of summaries:

sum=𝔼CP(C|ei,ej,D)[(C)]\mathcal{L}_{sum}=-\mathbb{E}_{C^{\prime}\sim P(C^{\prime}|e_{i},e_{j},D)}[\mathcal{R}(C^{\prime})] (6)

Then, the gradient can be further formalised by utilizing the one roll-out sample method.

sum=((C)θ)𝑙𝑜𝑔P(C|ei,ej,D)\nabla\mathcal{L}_{sum}=-(\mathcal{R}(C)-\theta)\nabla\mathit{log}P(C|e_{i},e_{j},D) (7)

where θ\theta is used to reduce variance.

Pretraining for Cluster Summarization  This module aims to teach model to do better content representation for cluster summarization process with event chains before joint training phase. As presented in Section 2.2, we utilize summarization model to compress the different clusters for predicting the event relations. However, abstractive summarization method always generates new words. How to ensure generated summaries contain the events in the original document is a key problem. Inspired by Narayan et al. (2021, 2022), we utilize the event chains which order the events in the summary as an intermediate summary representation to better guide the summarization generation. In particular, we concatenate the event chain with the corresponding summary as a unified sequence, such as “[EVENTCHAIN] originated | organized | reaching | reached | … [SUMMARY] paul originated from a trough of low pressure …”. During the decoder phase, the model must generate both the event chain followed by the summary. The model structure is similar to the summarization model in Section 2.2, and we select the existing abstractive summarization datasets, such as CNN/DailyMail, for the pretraining.

3.   Experiments

In this section, we first introduce experiment setup (Section 3.1), and then report the results and analysis (Section 3.2 - 3.5). Furthermore, we also conduct experiments on LLMs in Section 3.6.

3.1.   Experiment Setup

Datasets  We conduct experiments on three datasets: MAVEN-ERE (Wang et al., 2022), EventStoryLine (Caselli and Vossen, 2016) and HiEve (Glavas et al., 2014). MAVEN-ERE is a unified large-scale human-annotated ERE dataset, which contains 4,480 documents, 112,276 events, 57,992 causal relations, and 15,841 subevent relations in total. We split the data into train, dev, and test sets in 2,913, 710, and 857 documents as in Wang et al. (2022). EventStoryLine (i.e., version 0.9) contains 258 documents, 22 topics, 5,334 events, and 5,655 causal relations. Following existing works Gao et al. (2019); Tran Phu and Nguyen (2021), we use the last two topics as development set, and the other 20 topics are verified with a 5-fold cross-validation. For HiEve, the dataset contains 100 documents, 3,185 events, and 3,648 subevent relations. Similar to Zhou et al. (2020); Wang et al. (2020), we split the 100 documents into 60 training, 20 validation, and 20 testing.

Evaluation Details  We adopt similar settings to Wang et al. (2022), which are closer to real situations, yet more challenging, from three aspects: (1) we consider the relation directions for relation prediction. In addition, causal/subevent relations may be defined in several sub-relations. For example, the causal relation in MAVEN-ERE has two sub-relations “CAUSE” and “PRECONDITION”; (2) we report the overall score rather than the individual sub-relation score; (3) we do not down-sample the negative instances.

Following the existing works Gao et al. (2019); Zhou et al. (2020); Tran Phu and Nguyen (2021), we use standard Precision (P), Recall (R), and F1-score (F1) as the metrics.

Methods P R F1
MAVEN-ERE
BERT 31.631.6 28.228.2 29.929.9
RoBERTa 33.833.8 29.529.5 31.531.5
Hierarchical 31.831.8 29.229.2 30.630.6
SIEF 33.633.6 30.830.8 32.332.3
TacoERE (PLMs) 34.8 32.4 34.1
EventStoryLine
BERT 30.330.3 9.49.4 12.812.8
RoBERTa 31.131.1 10.710.7 14.414.4
Hierarchical 30.130.1 10.210.2 13.113.1
SIEF 32.432.4 11.311.3 14.814.8
SCS-EERE 32.732.7 10.910.9 15.115.1
TacoERE (PLMs) 32.9 12.3 16.4
Table 1: Model performance of causal relation on MAVEN-ERE and EventStoryLine.

Training Details  For cluster summarization, we utilize BERT as the document encoder, and a transformer-based framework as the decoder with 2 layers and 8 attention heads. For document clustering, the value KK of intra-cluster is a hyper parameter and we set K=3K=3 with the best performance for the experiment. For relation prediction, we adopt RoBERTa as the document encoder. The implementation of BERT and RoBERTa are based on the pytorch version from HuggingFace Transformers library 333https://github.com/huggingface. We adopt adam optimizer with learning rate of 5e-4. The trade-off parameters of α\alpha and β\beta are set to 1.0 and 0.1 respectively. Each training and testing process is running on two NVIDIA GeForce RTX 3090 GPU.

For LLMs implementation, we use the model API provided by OpenAI, where “gpt-4”, “gpt-3.5-turbo”, and “text-davinci-003” refer to models GPT-4, ChatGPT, and Text-Davinci-003 respectively.

Baselines  The following models, including small-scale PLMs and LLMs, have been compared in our experiments.

For small-scale PLMs, we first select BERT Devlin et al. (2019) and RoBERTa Liu et al. (2019). In addition, we also experiment with three strong and relevant models: (1) Hierarchical Adhikari et al. (2019b) which uses a pretrained model to encode different chunks of the document, and sets an additional BILSTM model to aggregate representations; (2) SIEF Xu et al. (2022) which proposes to randomly remove the useless sentences for prediction; (3) SCS-EERE Man et al. (2022) which selects a sentence set for each event pair for prediction.

The evaluated LLMs include: (1) GPT-4 which is an advanced and improved iteration of the GPT series, demonstrating human-level performance and significant enhancements in various aspects; (2) ChatGPT which is an advanced conversational AI model, is able to provide contextually relevant and coherent responses aligned with human expectation; (3) Text-Davinci-003 which is a variant of GPT-3.5 series, offering improved performance over GPT-3 through further instruction tuning.

Methods P R F1
MAVEN-ERE
BERT 27.527.5 24.724.7 26.826.8
RoBERTa 29.829.8 25.625.6 27.527.5
Hierarchical 28.428.4 25.425.4 27.127.1
SIEF 30.230.2 26.426.4 28.728.7
TacoERE (PLMs) 31.8 28.9 30.6
HiEve
BERT 19.819.8 15.215.2 16.316.3
RoBERTa 20.220.2 16.116.1 17.817.8
Hierarchical 21.421.4 17.317.3 16.716.7
SIEF 21.821.8 17.417.4 18.618.6
SCS-EERE 20.620.6 19.7 19.219.2
TacoERE (PLMs) 22.6 19.519.5 20.8
Table 2: Model performance of subevent relation on MAVEN-ERE and HiEve.

3.2.   Overall Results

We report the results of causal relation on MAVEN-ERE and EventStoryLine, respectively. Subevent relation performance is evaluated on MAVEN-ERE and HiEve, respectively.

Performance of causal relation  The results are shown in Table 1, TacoERE (PLMs) means the implementation of our compression-then-extraction method with PLMs. Experimental results demonstrate that TacoERE (PLMs) outperforms all the baselines on both MAVEN-ERE and EventStoryLine datasets. We also have the following four observations: (1) compared with pretrained model BERT, TacoERE (PLMs) achieves 4.2% improvements of F1-score on MAVEN-ERE, and 3.6% on EventStoryLine data; (2) compared with Hierarchical, which models different chunks of document, TacoERE (PLMs) achieves improved performance. This validates the effectiveness of cluster summarization, which can further mitigate information redundancy and event distance; (3) compared with SIEF, we note that TacoERE (PLMs) achieves better performance, even though SIEF is designed for modeling the important content by removing sentences from the original document. This indicates that our method equipped with compression-then-extraction can effectively alleviate the problem of long-range dependencies; (4) SCS-EERE performs better than Hierarchical and SIEF. It adopts a straightforward idea that selects a set of sentences for each event pair. However, this might cost too much computation time when applied to large-scale datasets, and we directly predict all event relations in a document at one time.

Performance of subevent relation  Table 2 presents the detailed results on MAVEN-ERE and HiEve, and our method again achieves better performance than all the baselines. Our method improves upon the pretrained model BERT by 3.8% and 4.5% in terms of F1-score on both datasets, respectively. In all, such performance on Table 1 and 2 clearly demonstrates the benefits of compression-then-extraction on event relation extraction.

Refer to caption
Figure 3: Model performance on different distance between related events (measured in #words).

3.3.   Impact on Event Distance

To better understand the contributions of our method on long-range dependencies, we show the performance under different word distances between related events in Figure 3.

Compared with existing methods, our method achieves a certain improvement in dealing with long-range dependencies. We can also find: (1) as the word distance continuously increases, the improvement of our method over the baselines shows an overall monotonic upward trend; (2) the overall F1-score on event pairs with long distance is much lower than that of short ones, indicating the challenge of long-range dependency; (3) particularly, compared with the two baselines which directly models the entire document (RoBERTa), or compresses the document by removing irrelevant sentences (SIEF), our method obtain a larger improvement, especially when the word distance of related events is greater than or equal to 4, which can further indicate that our TacoERE (PLMs) can better handle the long-range dependencies.

Methods EP P R F1
OneSum 27.327.3 32.332.3 30.430.4 31.831.8
AvgSum 50.450.4 33.733.7 31.431.4 32.532.5
TacoERE (PLMs) 78.178.1 34.834.8 32.432.4 34.134.1
Table 3: Model performance on different document compression strategies. EP is the ratio of events which has relations in summaries to document.

3.4.   Document Compression Evaluation

In this section, we try different strategies to obtain the cluster and evaluate the performance to verify the effectiveness of our document compression. We consider the following two strategies: (1) “OneSum” means only generating one summary for each document; (2) “AvgSum” refers to splitting document evenly into chunks based on sentence, and then generating summary for each chunk.

The results are shown in Table 3, and we have the following three observations: (1) compared with the other two methods, i.e., OneSum and AvgSum, our proposed TacoERE (PLMs) achieves the best performance; (2) AvgSum obtains second-best results. However, the chunks are independent to each other in their setting, which means they ignore the relation information among chunks. This may prevent it from achieving better performance; (3) for the metric EP, OneSum only preserves 27.3% events of the document, which is due to the fact that summary mainly focuses on the important content of the document and the length is relatively short compared to the input document. As a whole, model performance gradually gets better as more events are preserved.

3.5.   Ablation Study

In addition to the document compression strategy, we also conduct experiments to ablate intra-clusters (w/o intra-clusters), inter-clusters (w/o inter-clusters) and cluster summarization (w/o summarization) to understand their contributions. The results are shown in Table 4. We can observe that: (1) compared the results of w/o (without) intra-clusters with w/o inter-clusters, we can find that the overall model works better. The use of intra- and inter-clusters helps the model to better understand the event relations both within and among the sentences; (2) the performance of TacoERE drops on the three variations, which proves that both variations have contributed to the overall performance; (3) removing w/o intra-clusters causes a sharp performance drop compared to w/o inter-clusters.

Methods MAVEN-ERE
P R F1
TacoERE (PLMs) 34.834.8 32.432.4 34.134.1
w/o intra-clusters 32.432.4 30.930.9 32.332.3
w/o inter-clusters 32.732.7 31.231.2 32.832.8
w/o summarization 31.831.8 31.331.3 31.931.9
Table 4: Ablation study.
Methods Text-Davinci-003 ChatGPT GPT-4
P R F1 P R F1 P R F1
Document 13.813.8 6.26.2 8.58.5 21.721.7 32.232.2 25.925.9 27.127.1 41.541.5 32.832.8
Sentence Pair 21.921.9 7.17.1 10.710.7 24.324.3 31.231.2 27.327.3 33.433.4 38.638.6 35.735.7
Document Clustering 17.317.3 8.18.1 10.910.9 24.624.6 32.932.9 28.228.2 31.931.9 47.147.1 38.138.1
TacoERE (LLMs) 30.2 8.9 13.8 31.3 45.6 37.1 38.9 45.5 41.9
Table 5: Model performance of causal relation on different LLMs. Experiments are under 2-shot setting.
Refer to caption
Figure 4: Case analysis of relation prediction.

3.6.   Evaluation on LLMs

The aforementioned series of experiments has highlighted the efficacy of our TacoERE, based on PLMs, in enhancing ERE performance. To further validate its effectiveness and robustness, we conduct extensive experiments on LLMs. Illustrated in Figure 2, our framework primarily comprises three key components: Document Clustering, Cluster Summarization, and Relation Prediction. We introduce TacoERE (LLMs), which directly leverages LLMs to implement these three components. To thoroughly validate our approach, we configure the task to enable relation prediction between event pairs individually. For testing, we randomly sample 50 documents from MAVEN-ERE, resulting in 646 causal relations. We compare TacoERE (LLMs) with three variants: (1) Document, which involves utilizing the entire document to predict relations; (2) Sentence Pair, which entails using sentences containing the event pairs to predict relations; (3) Document Clustering, which involves using sentences within a cluster to predict relations.

The results are shown in Table 5, we notice that our TacoERE (LLMs) achieves the best performance across all three models, with improvements of 11.2% and 9.1% on ChatGPT and GPT-4, respectively. We also have the following four observations: (1) from model perspective, GTP-4 achieves the highest F1 score of 41.9%, followed by ChatGPT; (2) compared with Document, Sentence Pair obtains better results, indicating that predicting relation between events does not depend on the whole document; (3) compared with Document and Sentence Pair, Document Clustering achieves improved results, indicating that our method can reduce redundant information while retaining useful information for relation prediction; (4) compared with Document Clustering, our TacoERE (LLMs) achieves the best performance, suggesting that reducing redundant information and shortening event distance can further facilitate the improvement of performance.

3.7.   Case Study

We display a case in Figure 4 to quantitatively analyze the model prediction by our TacoERE (LLMs) and different comparison modules, such as Document, Sentence Pair, and Document Clustering. TacoERE (LLMs) means we use the content from Cluster Summarization to predict relations. We can see, for each event pair, the prediction does not rely on the whole document. Some dependency information may be ignored by using Document Clustering for prediction. Our proposed TacoERE (LLMs) with compression-then-extraction is an effective means to enhance event relation extraction.

4.   Related Work

4.1.   Event Relation Extraction

Event Relation Extraction is a challenging task in natural language processing, especially for events that are scattered in different sentences Gao et al. (2019); Chen et al. (2022). Recently, deep learning based methods are becoming the mainstream Cao et al. (2021); Xu et al. (2021), and extensive explorations have been made, such as joint reasoning methods which extract multiple relations simultaneously Ning et al. (2018); Han et al. (2019), and graph-based methods which use event mentions as nodes and model document as graph Tran Phu and Nguyen (2021); Fan et al. (2022); Guo et al. (2023). However, these works use document as input, which cannot well handle the long-range dependency problem. In contrast, we improve event relation extraction by processing document in advance with cluster-aware compression.

Currently, a series of LLMs have been developed, such as GPT series, LaMDA Thoppilan et al. (2022), and PaLM Chowdhery et al. (2022), and have achieved remarkable performance in various fields. Among them, GPT series, i.e., GPT-4 and ChatGPT, is undoubtedly the most popular work. Thus, to verify the effectiveness of our method, we conduct extensive experiments on them.

4.2.   Controlled Text Summarization

With the development of deep learning and the increasing demand for generation quality, increasing studies are focusing on controlled text summarization Dou et al. (2021); Guan et al. (2021a), such as designing copy mechanism to directly copy important word from input See et al. (2017), extracting actual fact triples for modeling Cao et al. (2017), extracting templates from the training data to guide the summarization generation Wang et al. (2019). However, these methods typically design an additional module or using existing third-party content selectors. On the contrary, our method adopts the events in document as intermediate representation to better guide the summary generation.

5.   Conclusion

In this paper, we propose a novel cluster-aware compression method for event relation extraction, namely, TacoERE, which explores a compression-then-extraction paradigm to extract event relations. TacoERE first splits document into intra- and inter-clusters to allow the modeling of dependencies without considering event distance among sentences. Then, cluster summarization is adopted to simplify and highlight the important text of clusters for further mitigating information redundancy and event distance. Extensive experiments have been conducted on both small-scale PLMs such as RoBERTa, and LLMs such as ChatGPT and GPT-4. Experimental results demonstrate that our proposed TacoERE with compression-then-extraction is an effective method for augmenting event relation extraction.

6.   Bibliographical References

\c@NAT@ctr

7.   Language Resource References

\c@NAT@ctr

 

  • Caselli and Vossen (2016) Caselli, Tommaso and Vossen, Piek. 2016. The Storyline Annotation and Representation Scheme (StaR): A Proposal. In Proceedings of the 2nd Workshop on Computing News Storylines. PID https://github.com/tommasoc80/EventStoryLine.
  • Glavas et al. (2014) Goran Glavas and Jan Snajder and Marie-Francine Moens and Parisa KordJamshidi. 2014. HiEve: A Corpus for Extracting Event Hierarchies from News Stories. In Proceedings of LREC. PID http://takelab.fer.hr/hievents.rar.
  • Wang et al. (2022) Wang, Xiaozhi and Chen, Yulin and Ding, Ning and Peng, Hao and Wang, Zimu and Lin, Yankai and Han, Xu and Hou, Lei and Li, Juanzi and Liu, Zhiyuan and Li, Peng and Zhou, Jie. 2022. MAVEN-ERE: A Unified Large-scale Dataset for Event Coreference, Temporal, Causal, and Subevent Relation Extraction. In Proceedings of EMNLP. PID https://github.com/THU-KEG/MAVEN-ERE.