HeadlineCause: A Dataset of News Headlines for Detecting Causalities

Ilya Gusev
Moscow Institute of Physics and Technology
Moscow, Russia
[email protected]
\AndAlexey Tikhonov
Yandex
Berlin, Germany
[email protected]

Abstract

Detecting implicit causal relations in texts is a task that requires both common sense and world knowledge. Existing datasets are focused either on commonsense causal reasoning or explicit causal relations. In this work, we present HeadlineCause, a dataset for detecting implicit causal relations between pairs of news headlines. The dataset includes over 5000 headline pairs from English news and over 9000 headline pairs from Russian news labeled through crowdsourcing. The pairs vary from totally unrelated or belonging to the same general topic to the ones including causation and refutation relations. We also present a set of models and experiments that demonstrates the dataset validity, including a multilingual XLM-RoBERTa based model for causality detection and a GPT-2 based model for possible effects prediction.

1 Introduction

Causality is a crucial concept in many human activities. Automatic inference of causal relations from texts is vital for any model attempting to analyze documents or predict future events. It is much more essential when we consider news. On the other hand, it is still a challenging task for text understanding models, as it requires both common sense and world knowledge.

From the practical sense, news aggregators are interested in detecting causal relations. Firstly, they should understand what news documents refute others to build a relevant and rapid news feed, and a refutation is a special kind of causal relation. For example, almost every big accident has many death toll changes that refute each other. The rise of fake news also increases the need to detect refutations. Secondly, news aggregators should be able to differentiate between news from different sources about the same event and news on the same topic but about different events. Cause-effect event pairs are hard negative samples for this task.

There are several types of causal relations. In this work, we focus on implicit inter-sentence causal relations. It means that cause and effect are in different sentences, even in different texts in our case, and there are no explicit linking words between them.

This paper introduces a dataset of news headline pairs in English and Russian with causality labels obtained through crowdsourcing. We deliberately chose not to include texts of news documents in this dataset as almost every headline contains only one fact and roughly corresponds to a notion of an event. Furthermore, using headlines is much easier than detecting causalities between different parts of texts. For the same reasons, headlines were used in other works [Radinsky et al., 2012].

Natural language understanding benchmarks such as SuperGLUE [Wang et al., 2019] and RussianSuperGLUE [Shavrina et al., 2020] were introduced recently and are a great way to track natural language research progress. These NLU benchmarks have inspired this work. The only task dealing with causality in these benchmarks is COPA [Roemmele et al., 2011] (PARus in the Russian version). The examples in this task are from the general domain and do not fully represent causal relations in other domains. Moreover, COPA was deliberately built as a benchmark but not a dataset for model training.

The motivation of this work was a desire to know how modern models can handle implicit causal relations. Ultimately we would like to predict for every event what other events led to it and to predict possible future events.

To prove our dataset useful, we analyzed its contents, trained several BERT-family classifiers to detect causalities in previously unseen headlines, and checked their performance. We also trained GPT-2 based models to predict future headlines based on the current ones.

The resulting dataset is one of the few datasets on implicit inter-sentence causal relations. Using world knowledge and common sense is the only way to infer a causal relation for many samples. Embedding that knowledge into the models is the main challenge the dataset poses for a research community.

As for potential negative social impact, we do not see any direct malicious applications of our work.

The data probably do not contain offensive content, as news agencies usually do not produce it, and a keyword search returned nothing. Still there are news documents in the dataset on several topics some people can consider sensitive, such as deaths or crimes.

2 Related work

In recent years several reviews of causality extraction papers appeared [Asghar, 2016, Xu et al., 2020, Yang et al., 2021]. We will not mention all the works covered there again, but we will look at the most significant ones.

Most of the papers are describing methods to extract explicit causal relations. For example, these relations can be collected with a set of linguistic patterns [Khoo et al., 1998, 2000, Girju and Moldovan, 2002] or with machine learning methods, such as decision trees [Girju, 2003] or SVM [Bethard and Martin, 2008] over syntactic and semantic features. Riaz and Girju [2013] explore causal associations of verb-verb pairs for this purpose.

As for effect prediction, there is a work of Radinsky et al. [2012] about a prediction of future events through building a generalizing abstraction tree over given event pairs. A new event is matched to a node of this tree, and an associated prediction rule is applied to produce effects. The authors obtain the event pairs from news headlines, but, in contrast with our work, cause and effect should be in the same headline.

The Topic Detection and Tracking initiative (TDT) [Allan et al., 1998] and its successors is a related research area we should mention. The area is mainly about the detection of news events, topics, and composing storylines. However, the already clustered news collection makes composing causality graphs or predicting new events much more manageable.

Radinsky and Horvitz [2013] use TDT methods to compose storylines through text clustering. Then they use these storylines as a heuristic for identifying possible causal relationships among events. Over these storylines, they are trying to predict the probabilities of various future events. This work is very close to ours, as the causal relations in this work are implicit, and we also utilize text clustering methods as one of the heuristics for sampling candidate pairs.

The field of event evolution [Yang et al., 2009, Liu et al., 2020] is the other TDT successor. The event evolution graph built in Yang et al. [2009] can also be seen as a causality graph.

There were also attempts to build causality datasets. The Event StoryLine Corpus [Caselli and Vossen, 2017] is one of these attempts focusing on complex annotations of the events and links between them. The modification by Caselli and Inel [2018] involves mining causal relations through crowdsourcing to enhance the dataset. The other dataset, Altlex [Hidey and McKeown, 2016], leverages parallel Wikipedia corpora to identify new causality markers.

Several recent papers focus on implicit inter-sentence causal relations, including Jin et al. [2020], Hosseini et al. [2021]. Jin et al. [2020] focus on models for extracting these relations from Chinese corpora, and Hosseini et al. [2021] use BERT and its modifications to detect the directionality of these relations.

There is also a paper by Laban et al. [2021] which focuses on converting an event detection task to a classical NLU task on headlines. The main idea and methodology of this work are very similar to ours, but the target class differs. Our main goal is to predict headlines with causal relations, and in contrast, their goal is to predict headlines about the same event.

3 Data

3.1 Definitions

Same headlines: two headlines are considered as the same if they are about the same things or differ in minor details. In other words, if they describe the same event, they should be considered the same.

An example of headlines we consider same (sample en_tg_572):
A: Exclusive: NextVR acquired by Apple (Updated)
B: Apple Buys Virtual Reality Company NextVR

Causality: the first headline causes the second headline if the second headline is impossible without the first one. If the first event did not happen, then the second event must not be happening too.

An example of a news headline pair with a causal relation (sample en_tg_1153):
A: Oklahoma spent $2 million on malaria drug touted by Trump
B: Gov. Kevin Stitt defends $2 million purchase of malaria drug touted by Trump

This type of causality is known as necessity causality. There are other possible definitions of causality, including sufficient causality or a cost-based concept of causality. They are described in Roemmele et al. [2011]. We have several reasons to use this particular definition. First, it is aligned with our goals stated in the introduction. Second, it is easy enough to be used in a crowdsourcing project.

Refutation: the second headline refutes the first one if the second headline makes the first one irrelevant. Refutations are a subset of causal pairs, and every refuting headline is an effect of some cause, but not every effect of a cause is a refutation.

An example of a news headline pair with a refutation (sample en_tg_496):
A: Report: Microsoft acquiring Microvision, a leader in ultra-miniature projection display
B: Microsoft denies MicroVision acquisition

3.2 Sources

We used two sources of news documents: the Lenta corpus¹¹1https://github.com/yutkin/Lenta.Ru-News-Dataset and documents from the Telegram Data Clustering Contest²²2https://contest.com/docs/data_clustering2. We additionally parsed the Lenta website³³3https://lenta.ru to obtain fresher news documents and hyperlinks between documents. Lenta is one of the oldest Russian news websites, and the dataset contains over 800 thousand news from 1999 to 2020. The Telegram news dataset was published in 2020 and contains data from hundreds of sources in over ten languages from October 2019 to May 2020.

The original data for the Telegram dataset contains various types of documents. We trained a FastText classifier on separate crowdsourcing annotations and open datasets to differentiate news from other documents. These annotations and the classifier itself are available in the separate GitHub repository ⁴⁴4https://github.com/IlyaGusev/tgcontest. The classifier is not ideal, but the majority of the resulting documents are news.

The licenses for both datasets are not provided. Our dataset uses only news headlines, provides links for all used documents, and provides authors where possible, which can be considered fair use. We also emailed Lenta asking for permission to publish their data.

3.3 Candidates sampling

For both datasets, we used four filters to extract candidates for annotation:

•

A presence of a hyperlink between two documents
•

An affiliation of documents to the same website
•

A cosine distance between LaBSE embeddings [Feng et al., 2020] with a threshold
•

A presence of different locations in headlines

We combined these filters in various combinations in different annotation pools to collect diverse data. We did not have a specific algorithm or scheme for applying these filters and used some of them as different problems emerged. For example, at some moment we observed that our trained model was making errors in linking news headlines about similar events, but in different locations, so we annotated more data with the last filter. The first two filters were used as the main ones.

Another problem we detected was a subset of headlines that our models considered causes without even looking at effects. This type of error is common for sentence-pair inference tasks [Gururangan et al., 2018]. To handle this, we trained a single headline classifier to estimate a priori probability of a headline being cause or effect, sampled negative pairs for the original task from the strongest examples, and annotated them.

Refer to caption — Figure 1: A dependency between a LaBSE distance and the number of causal pairs

To estimate a final role of a sampling method based on a cosine distance between LaBSE embeddings, we plotted a dependency between the distance and the number of causal pairs. It can be seen in Figure 2. We used this distance as an upper threshold in our candidate sampling. It is clear from the figure that for the Russian annotations we did not lose any of the information as with the increasing threshold the number of causal pairs is decreasing very fast. It is not so obvious for the English annotations, so we additionally plotted Figure 1. One can see that the ratio of causal pairs is also decreasing.

To estimate a final role of a sampling method based on a cosine distance between LaBSE embeddings, we plotted a dependency between the distance and the number of causal pairs. It can be seen in Figure 2. We used this distance as an upper threshold in several candidate samplings. It is clear from the figure that we did not lose many causal pairs above the threshold for the Russian annotations, as with the increasing threshold, the number of causal pairs is decreasing very fast. For English annotations, this regularity is not apparent from Figure 2, so we additionally plotted Figure 1. One can see that the ratio of causal pairs in English is also decreasing with the distance.

3.4 Annotation

We annotated every candidate pair with Yandex Toloka⁵⁵5https://toloka.ai/, a crowdsourcing platform. We chose this platform as we are very familiar with it, and it has many workers that are native Russian speakers, which is useful for Russian annotation. The task was to determine a relationship between two headlines, A and B. There were seven possible options: titles are almost the same, A causes B, B causes A, A refutes B, B refutes A, A linked with B in another way, A is not linked to B. An annotation guideline was in Russian⁶⁶6https://ilyagusev.github.io/HeadlineCause/toloka/ru/instruction.html for Russian news and in English⁷⁷7https://ilyagusev.github.io/HeadlineCause/toloka/en/instruction.html for English news. Ten workers annotated every pair. The annotation interface is in Figure 3. The total annotation budget was 2173$, with the estimated hourly wage paid to participants of 1.13$. Initially, it was 45 cents, but we reconsidered this wage because of ethical reasons, as it is lower than the minimum wage in Russia. Annotation management was semi-automatic. Scripts are available in the project repository⁸⁸8https://github.com/IlyaGusev/HeadlineCause.

As for the quality control, we required annotators to pass training, exam, and their work was continuously evaluated through the control pairs ("honeypots"). The threshold was 70% correct examples for training and 80% correct examples for exams and honeypots. No additional language abilities check was done, as these thresholds should remove workers without them. All examples from training and exam are also available in the project repository.

Table 1: Annotation statistics

	English	Russian
Total number of pairs	10078	11649
Number of pairs with links	8737	5241
Number of pairs from the same source	8139	8278
Number of workers	180	457
Average number of tasks per worker	560	255
Total budget	1008$	1165$

(a) Overall numbers

Country	Workers
India	26
Kenya	24
The Philippines	19
Turkey	10
Nigeria	9
Pakistan	7

(b) English proj., top-6 countries

Annotation statistics are in Table 1. As for the number of pairs with links and the same sources, the filtering system causes these high numbers. As for the number of annotators, historically, the Russian annotation was done earlier than English and was split into two considerable periods, so more workers were involved.

3.5 Aggregation

Table 2: Task labels alignment

Full	Simple
Left-right causality	Left-right causality
Left-right refutation	Left-right causality
Right-left causality	Right-left causality
Right-left refutation	Right-left causality
Same event	No causality
Other relationship
No relationship

We aggregate annotations in two settings. The first setting, Full, includes all seven possible classes. The second setting, Simple, unites some of the classes to simplify the task. The alignment between labels is presented in Table 2. We include only samples with an agreement of 70% or higher in the final dataset. The agreement is calculated relative to the setting, with three labels for the Simple setting and seven labels for the Full setting.

Table 3: Agreement distribution for both languages and both settings, every sample was annotated by ten people,

\alpha

is the Krippendorff’s alpha [Krippendorff, 2011], computed with NLTK package [Bird et al., 2009]

	English, Simple	English, Full	Russian, Simple	Russian, Full
10 votes	966 (10%)	167 (2%)	5379 (46%)	2783 (24%)
9 votes	1476 (14%)	450 (4%)	2058 (18%)	1532 (13%)
8 votes	1568 (16%)	856 (8%)	1384 (12%)	1497 (13%)
7 votes	1703 (17%)	1227 (12%)	1059 (9%)	1603 (14%)
6 votes	1905 (19%)	1734 (17%)	983 (8%)	1770 (15%)
5 votes	1784 (18%)	2309 (23%)	683 (6%)	1597 (14%)
4 votes	676 (7%)	2151 (22%)	103 (1%)	717 (6%)
3 votes	0 (0%)	1127 (11%)	0 (0%)	150 (1%)
2 votes	0 (0%)	57 (1%)	0 (0%)	0 (0%)
Total	10078	10078	11649	11649
Average agreement	0.699	0.548	0.862	0.745
$\alpha$ , all samples	0.289	0.255	0.598	0.548
$\alpha$ , 7 or more votes	0.458	0.551	0.708	0.733

Agreement distribution for both tasks is in Table 3. Agreement between workers is very different for Russian and English. There are several possible reasons for this. Firstly, English workers are less homogeneous than Russian ones, as they are from a more extensive list of countries, and English is not native for some of them. Secondly, there is a difference in the complexity of the task itself. The headlines are more challenging in English documents, as they include a bigger list of entities, including local ones. The diversity of news agencies is also more considerable in the English dataset than in Russian.

The distribution by the most popular countries of English-speaking workers is in Table 1. Most of the workers are probably not native speakers. It affects the annotation quality, but the aggregation with an overlap of 10 and tight quality control should help to maintain it.

We use the Majority vote (MV) aggregation method. There are several possible alternatives: the Dawid and Skene [1979] method, aggregation by skill, and some others. Some of them are supported by crowdsourcing platform itself. We chose the MV as it is easily interpretable and yielded consistent results in our experiments. We also tried to use the Dawid-Skene method, but the training on resulting annotations was hard and yielded poor metrics, so we abandoned it early. However, we still do not know whether the poor results came from the method itself or the complexity of the examples it brings, and it is the subject of future experiments.

Table 4: Simple task aggregated data statistics after excluding samples with less than seven votes and additional postprocessing (Section 3.6)

	English	Russian
Left-right causality	720 (13%)	1173 (12%)
Right-left causality	610 (11%)	1224 (13%)
No causality	4086 (76%)	7156 (75%)
Total	5416	9553

The final statistics for aggregated annotations are presented in Table 4 and Table 5. All datasets are imbalanced. For the Simple task, only 25% of samples contain causal relations.

Table 5: Full task aggregated data statistics after excluding samples with less than seven votes and additional postprocessing (Section 3.6)

	English	Russian
Left-right causality	428 (17%)	914 (13%)
Right-left causality	386 (15%)	966 (13%)
Left-right refutation	61 (2%)	126 (2%)
Right-left refutation	34 (1%)	127 (2%)
Same event	254 (10%)	780 (11%)
Other relationship	813 (32%)	1655 (23%)
No relationship	536 (21%)	2575 (36%)
Total	2512	7143

3.6 Postprocessing

We remove pairs that are not consistent with timestamps. In other words, if a temporally following headline is a cause of a temporally preceding headline, we consider an annotation of this pair inaccurate. There were 981 (10%) such pairs for the English dataset and 540 (5%) pairs for the Russian one.

These numbers drop to 202 (5%) and 132 (2%) if we consider only pairs with more than 70% agreement (Simple task) and from the same sources. Different sources can have different timestamps policies or different promptness of reaction to the events, so it can be incorrect to compare them.

3.7 Splits

We split the dataset into train, validation, and test sets by time. The main reason is a possible entity and event bias in the news domain. Trained models should work well with the unseen entities, locations, and events, and the best way to emulate these factors is to split by time. We consider a maximum of left and right timestamps as a timestamp of a pair. The training dataset contains the first 80% of pairs, the validation dataset contains the next 10% of pairs, and the test dataset utilizes the remaining 10%.

Lenta and Telegram corpora have very different densities of news in time and different time spans affected. So we split these two datasets separately and unite the resulting splits from both sources, so we have samples from both datasets in the train, validation, and test sets equally presented.

3.8 Augmentations

We apply two augmentations⁹⁹9https://github.com/IlyaGusev/HeadlineCause/blob/main/headline_cause/augment.py to the train and validation datasets. The first one adds symmetrical pairs to enforce a model to be logical in case of swapping the headlines. The second one adds typos to the left, right, or both headlines to make a model more robust.

The symmetrical augmentation doubles the size of the dataset. Typos are applied for 5% of the dataset, with original pairs preserved. Typos are just swaps of adjacent letters.

4 Experiments

To prove our dataset to be helpful, we trained several models for the Simple and Full tasks. The main models use XLM-RoBERTa-large [Conneau et al., 2020] as a base pretrained model. It is multilingual and allows Russian inputs and English ones, still providing a good classification quality.

The training was done using GPU at the Google Colab Pro platform. The code is publicly available. One entire training run for one task requires at most 120 minutes. We trained and evaluated every model three times with different random seeds.

The exact hyperparameters can be found in the training notebook itself. They are standard for BERT-family models.

4.1 Simple task

For this task, we consider causality ROC AUC on two classes as a main metric. To calculate it, we unite Left-right and Right-left classes to be able to vary a classifier threshold.

We used a CatBoost [Prokhorenkova et al., 2018] classifier over TF-IDF features as a baseline model. The primary reason to use such a model is to check lexical biases or data leaks. For this task, causality ROC AUC was 71% for Russian and 62% for English, so we concluded no huge leaks.

The results for the Simple task are in Table 6. Models are working better with Russian as the training dataset for Russian is more extensive, and the average agreement on the remaining samples is higher than for English. The final score for both languages is over 95%, which means the models can predict test set labels far from random.

Table 6: Simple task EN+RU XLM-RoBERTa results on the test sets, 3 runs

	English		Russian
	Samples	Score, %	Samples	Score, %
No causality F1	421 (78%)	94.1 $\pm$ 0.2	782 (82%)	94.7 $\pm$ 0.4
Left-right F1	65 (12%)	75.2 $\pm$ 1.4	99 (10%)	76.7 $\pm$ 2.0
Right-left F1	56 (10%)	70.0 $\pm$ 1.5	76 (8%)	69.9 $\pm$ 2.0
Accuracy	542	89.4 $\pm$ 0.2	957	90.9 $\pm$ 0.7
Causality ROC AUC	542	96.3 $\pm$ 0.2	957	95.6 $\pm$ 0.2

We use a checklist [Ribeiro et al., 2020] methodology to evaluate different aspects of the model. We do not include in Table 7 tests that models pass without fail, only those that fail in a considerable number of cases. The typos test ensures the robustness of models, swapping order checks their logic, and the different locations test inspects whether they can detach causes and effects that are certainly not connected. The swap-order test is also similar to Commutative Test from Laban et al. [2021]. The first two groups of tests match the augmentation methods introduced to reduce the failure rate.

Table 7: Simple task EN+RU XLM-RoBERTa checklist results, best model

Test type and description	English, failure rate	Russian, failure rate
INV: Adding typos	3.5%	2.9%
INV: Swapping order of not causal pairs	2.8%	2.0%
DIR: Swapping order of causal pairs	22.0%	12.2%
MFT: Explicit refutations with different locations	9.5%	2.9%

We also hypothesized that the accuracy of models is higher for samples with a higher agreement. Indeed, as shown in Figure 4, it is an almost linear dependence between these two parameters, so we conclude that the agreement can be used as a proxy value of the complexity of a specific pair.

4.2 Full task

The results for this task are in Table 8. We provide F-score for every class and a total multiclass accuracy. The number of samples with refutation is too small to define whether we can reliably detect refutations with our model.

Table 8: Full task EN+RU XLM-RoBERTa results on the test sets, 3 runs

	English		Russian
	Samples	Score, %	Samples	Score, %
No relationship F1	68 (27%)	87.7 $\pm$ 1.8	315 (44%)	95.7 $\pm$ 0.3
Same event F1	22 (9%)	81.2 $\pm$ 5.2	71 (10%)	90.7 $\pm$ 1.9
Other relationship F1	67 (27%)	80.5 $\pm$ 1.5	162 (22%)	81.3 $\pm$ 1.6
Left-right causality F1	43 (17%)	94.3 $\pm$ 0.9	77 (11%)	84.7 $\pm$ 1.1
Right-left causality F1	39 (15%)	84.1 $\pm$ 0.9	58 (8%)	77.1 $\pm$ 1.9
Left-right refutation F1	5 (2%)	25.7 $\pm$ 23.7	16 (2%)	53.8 $\pm$ 5.4
Right-left refutation F1	8 (3%)	48.1 $\pm$ 7.0	16 (2%)	74.8 $\pm$ 1.5
Total number of pairs	252		715
Accuracy		83.5 $\pm$ 0.2		87.9 $\pm$ 0.9

4.3 GPT-2

Additionally, we trained a GPT-2 Radford et al. [2019] model to predict effect headlines for the causes. The examples of such predictions are in Table 9. One can see that some continuations are reasonable, and it is possible to select the probable ones. Still, many continuations are not correct, and probably the larger dataset and the larger model will fix that.

In future, the generator can be used to create an augmented dataset, as it can generate grammatically correct but meaningless continuations. It will be much harder task for a detection model to correctly identify such examples.

Table 9: GPT generation examples

Armed protesters demonstrate inside Michigan state capitol
=> The Michigan capitol clashes with armed protesters.
=> Govt defends lockdown in Michigan state capitol.
=> State capitol protesters demand lockdown of state capitol, condemn lockdown.
=> More arrests as protesters protest state capitol.
=> The Michigan capitol is being shut down, protesters march in defiance of state law.

5 Discussion

No span annotations. This work is different from other papers in the field in terms of annotation format. Usually, in the extraction of causal relations, the particular verbs or noun phrases are annotated. Instead, we label a whole headline pair. The primary motivation is that classification is a much more manageable task for crowdsourcing than real relation extraction. We were also inspired by sentence-level NLU tasks with no spans.

Unclear method of candidates sampling. We did not develop a specific and reliable schema of sampling candidates and encountered two different problems linked with this during the annotation process. The first problem was about single-sentence bias. The second problem was about the detection of causation in headlines with different locations but similar events. Both problems are described in Section 3.3.

Annotation aggregation. The Majority vote is probably not the best choice, as it could leave only simple pairs. Future experiments should determine the best method.

Poor refutation annotation. The collection of refutation relations was one of the main goals of our work, and it is unreached. It is possible to collect more refutations from the same document collections. For instance, one can utilize some active learning.

Disparity between Russian and English parts. The English part of the dataset has a lower inter-annotator agreement for several reasons we discussed in Section 3.5. It also affects Figure 2 and Figure 4.

No ablation study for augmentations. The augmentations were based on the checklist’s tests, and they improve results on them, but we do not present an ablation study here.

6 Conclusion

This paper introduced HeadlineCause, the novel dataset for implicit inter-sentence causation detection based on news headlines in Russian and English. We described the annotation process and several possible biases that we detected and tried to avoid. The dataset differs from other datasets for causal relation extraction and is more similar to NLU datasets. We also presented baselines for this dataset. We believe that HeadlineCause can be successfully used to train causal relation detection models with the subsequent composing of causation graphs.

References

Allan et al. [1998] J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang. Topic detection and tracking pilot study: Final report. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, pages 194–218, Lansdowne, VA, USA, Feb. 1998. 007.
Asghar [2016] N. Asghar. Automatic extraction of causal relations from natural language texts: A comprehensive survey, 2016.
Bethard and Martin [2008] S. Bethard and J. H. Martin. Learning semantic links from a corpus of parallel temporal and causal relations. In Proceedings of ACL-08: HLT, Short Papers, pages 177–180, Columbus, Ohio, June 2008. Association for Computational Linguistics. URL https://aclanthology.org/P08-2045.
Bird et al. [2009] S. Bird, E. Klein, and E. Loper. Natural language processing with Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc.", 2009.
Caselli and Inel [2018] T. Caselli and O. Inel. Crowdsourcing StoryLines: Harnessing the crowd for causal relation annotation. In Proceedings of the Workshop Events and Stories in the News 2018, pages 44–54, Santa Fe, New Mexico, U.S.A, Aug. 2018. Association for Computational Linguistics. URL https://aclanthology.org/W18-4306.
Caselli and Vossen [2017] T. Caselli and P. Vossen. The event StoryLine corpus: A new benchmark for causal and temporal relation extraction. In Proceedings of the Events and Stories in the News Workshop, pages 77–86, Vancouver, Canada, Aug. 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-2711. URL https://aclanthology.org/W17-2711.
Conneau et al. [2020] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.747. URL https://aclanthology.org/2020.acl-main.747.
Dawid and Skene [1979] A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):20–28, 1979. ISSN 00359254, 14679876. URL http://www.jstor.org/stable/2346806.
Feng et al. [2020] F. Feng, Y. Yang, D. Cer, N. Arivazhagan, and W. Wang. Language-agnostic bert sentence embedding, 2020.
Girju [2003] R. Girju. Automatic detection of causal relations for question answering. In Proceedings of the ACL 2003 Workshop on Multilingual Summarization and Question Answering - Volume 12, MultiSumQA ’03, page 76–83, USA, 2003. Association for Computational Linguistics. doi: 10.3115/1119312.1119322. URL https://doi.org/10.3115/1119312.1119322.
Girju and Moldovan [2002] R. Girju and D. I. Moldovan. Text mining for causal relations. In Proceedings of the Fifteenth International Florida Artificial Intelligence Research Society Conference, page 360–364. AAAI Press, 2002. ISBN 157735141X.
Gururangan et al. [2018] S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. R. Bowman, and N. A. Smith. Annotation artifacts in natural language inference data. In NAACL, 2018.
Hidey and McKeown [2016] C. Hidey and K. McKeown. Identifying causal relations using parallel Wikipedia articles. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1424–1433, Berlin, Germany, Aug. 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1135. URL https://aclanthology.org/P16-1135.
Hosseini et al. [2021] P. Hosseini, D. A. Broniatowski, and M. Diab. Predicting directionality in causal relations in text. arXiv preprint arXiv:2103.13606, 2021.
Jin et al. [2020] X. Jin, X. Wang, X. Luo, S. Huang, and S. Gu. Inter-sentence and implicit causality extraction from chinese corpus. Advances in Knowledge Discovery and Data Mining, 12084:739 – 751, 2020.
Khoo et al. [1998] C. S. Khoo, J. Kornfilt, R. N. Oddy, and S. H. Myaeng. Automatic extraction of cause-effect information from newspaper text without knowledge-based inferencing. Literary and Linguistic Computing, 13(4):177–186, 12 1998. ISSN 0268-1145. doi: 10.1093/llc/13.4.177. URL https://doi.org/10.1093/llc/13.4.177.
Khoo et al. [2000] C. S. G. Khoo, S. Chan, and Y. Niu. Extracting causal knowledge from a medical database using graphical patterns. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, ACL ’00, page 336–343, USA, 2000. Association for Computational Linguistics. doi: 10.3115/1075218.1075261. URL https://doi.org/10.3115/1075218.1075261.
Krippendorff [2011] K. Krippendorff. Computing krippendorff’s alpha-reliability. 2011.
Laban et al. [2021] P. Laban, L. Bandarkar, and M. A. Hearst. News headline grouping as a challenging nlu task. In NAACL 2021. Association for Computational Linguistics, 2021.
Liu et al. [2020] Y. Liu, H. Peng, J. Li, Y. Song, and X. Li. Event detection and evolution in multi-lingual social streams. Frontiers of Computer Science, 14, 10 2020. doi: 10.1007/s11704-019-8201-6.
Prokhorenkova et al. [2018] L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin. Catboost: Unbiased boosting with categorical features. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, page 6639–6649, Red Hook, NY, USA, 2018. Curran Associates Inc.
Radford et al. [2019] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. 2019.
Radinsky and Horvitz [2013] K. Radinsky and E. Horvitz. Mining the web to predict future events. In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, WSDM ’13, page 255–264, New York, NY, USA, 2013. Association for Computing Machinery. ISBN 9781450318693. doi: 10.1145/2433396.2433431. URL https://doi.org/10.1145/2433396.2433431.
Radinsky et al. [2012] K. Radinsky, S. Davidovich, and S. Markovitch. Learning causality for news events prediction. In Proceedings of the 21st International Conference on World Wide Web, WWW ’12, page 909–918, New York, NY, USA, 2012. Association for Computing Machinery. ISBN 9781450312295. doi: 10.1145/2187836.2187958. URL https://doi.org/10.1145/2187836.2187958.
Riaz and Girju [2013] M. Riaz and R. Girju. Toward a better understanding of causality between verbal events: Extraction and analysis of the causal power of verb-verb associations. In Proceedings of the SIGDIAL 2013 Conference, pages 21–30, Metz, France, Aug. 2013. Association for Computational Linguistics. URL https://aclanthology.org/W13-4004.
Ribeiro et al. [2020] M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh. Beyond accuracy: Behavioral testing of nlp models with checklist. In Association for Computational Linguistics (ACL), 2020.
Roemmele et al. [2011] M. Roemmele, C. A. Bejan, and A. S. Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, 2011. URL https://people.ict.usc.edu/~gordon/publications/AAAI-SPRING11A.PDF.
Shavrina et al. [2020] T. Shavrina, A. Fenogenova, A. Emelyanov, D. Shevelev, E. Artemova, V. Malykh, V. Mikhailov, M. Tikhonova, A. Chertok, and A. Evlampiev. Russiansuperglue: A russian language understanding evaluation benchmark. arXiv preprint arXiv:2010.15925, 2020.
Wang et al. [2019] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. arXiv preprint 1905.00537, 2019.
Xu et al. [2020] J. Xu, W. Zuo, S. Liang, and X. Zuo. A review of dataset and labeling methods for causality extraction. In Proceedings of the 28th International Conference on Computational Linguistics, pages 1519–1531, Barcelona, Spain (Online), Dec. 2020. International Committee on Computational Linguistics. doi: 10.18653/v1/2020.coling-main.133. URL https://aclanthology.org/2020.coling-main.133.
Yang et al. [2009] C. C. Yang, X. Shi, and C.-P. Wei. Discovering event evolution graphs from news corpora. Trans. Sys. Man Cyber. Part A, 39(4):850–863, July 2009. ISSN 1083-4427. doi: 10.1109/TSMCA.2009.2015885. URL https://doi.org/10.1109/TSMCA.2009.2015885.
Yang et al. [2021] J. Yang, S. C. Han, and J. Poon. A survey on extraction of causal relations from natural language text, 2021.

Checklist

1.
For all authors…
1. (a)
  
  Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes] Section 1
2. (b)
  
  Did you describe the limitations of your work? [Yes] Section 5
3. (c)
  
  Did you discuss any potential negative societal impacts of your work? [Yes] Section 1, last paragraph
4. (d)
  
  Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] Section 1, last paragraph
2.
If you are including theoretical results…
1. (a)
  
  Did you state the full set of assumptions of all theoretical results? [N/A]
2. (b)
  
  Did you include complete proofs of all theoretical results? [N/A]
3.
If you ran experiments (e.g. for benchmarks)…
1. (a)
  
  Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes]
2. (b)
  
  Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes]
3. (c)
  
  Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] Table 6 and Table 8
4. (d)
  
  Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] Section 4
4.
If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…
1. (a)
  
  If your work uses existing assets, did you cite the creators? [Yes] Footnotes in Section 3.2
2. (b)
  
  Did you mention the license of the assets? [Yes] Section 3.2
3. (c)
  
  Did you include any new assets either in the supplemental material or as a URL? [Yes]
4. (d)
  
  Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [Yes]
5. (e)
  
  Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [Yes] Section 1, last paragraph
5.
If you used crowdsourcing or conducted research with human subjects…
1. (a)
  
  Did you include the full text of instructions given to participants and screenshots, if applicable? [Yes] Section 3.4, Figure 3
2. (b)
  
  Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A]
3. (c)
  
  Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [Yes] Section 3.4, Table 1