¹¹institutetext: Institute of Data Science, National University of Singapore, Singapore ¹¹email: [email protected], [email protected] ²²institutetext: Tencent Technology, China ²²email: [email protected]
https://github.com/tanfiona/UniCausal

UniCausal: Unified Benchmark and Repository
for Causal Text Mining

Fiona Anting Tan 11 0000-0002-2828-1831 Xinyu Zuo 22 See-Kiong Ng 11

Abstract

Current causal text mining datasets vary in objectives, data coverage, and annotation schemes. These inconsistent efforts prevent modeling capabilities and fair comparisons of model performance. Furthermore, few datasets include cause-effect span annotations, which are needed for end-to-end causal relation extraction. To address these issues, we propose UniCausal, a unified benchmark for causal text mining across three tasks: (I) Causal Sequence Classification, (II) Cause-Effect Span Detection and (III) Causal Pair Classification. We consolidated and aligned annotations of six high quality, mainly human-annotated, corpora, resulting in a total of 58,720, 12,144 and 69,165 examples for each task respectively. Since the definition of causality can be subjective, our framework was designed to allow researchers to work on some or all datasets and tasks. To create an initial benchmark, we fine-tuned BERT pre-trained language models to each task, achieving 70.10% Binary F1, 52.42% Macro F1, and 84.68% Binary F1 scores respectively.

Keywords:

datasets causal text mining causal relation extraction

1 Introduction

Causal text mining relates to the extraction of causal information from text. Given an input text, we are interested to know if and where causal information occurs. Researchers can use extracted causal information as a knowledge base [9, 16, 15], for summarization, or prediction [24]. Since causality is an important part of human cognition, causal text mining has important natural language understanding applications [8, 29, 6]. Figure 1 illustrates three causal text mining tasks (Sequence Classification, Span Detection and Pair Classification) and their expected output.

Refer to caption — Figure 1: A two-sentence example that contains causal relations. (I) Sequence Classification aims to return a *Causal* label. (II) Span Detection identifies the text related to the Cause-Effect spans. This example contains three annotated relation pairs, where two are labeled *Causal* and one is labeled *Non-causal*. These are target labels of the (III) Pair Classification task.

Currently, large and diverse causal text mining corpora are limited [1]. Across datasets, annotation guidelines vary [34]. These issues hinder both modeling capabilities and fair comparisons between models. Additionally, working on independent tasks and datasets runs the risk of training task-specialized and dataset-specific models that are not generalizable.

The contributions of UniCausal are as follows:

•

To the best of our knowledge, we are the first to produce a unified benchmark and resource for causal text mining. Apart from consolidating six datasets for three tasks, we also employed competitive natural language processing (NLP) models to obtain baseline scores, reported in this paper.
•

Our framework provides a seamless way for researchers to design individual or joint models, while benchmarking their performance against clearly defined test sets across some or all the processed corpora.
•

Our codes and processed data is available online. Our trained baseline model checkpoints are uploaded to Huggingface Hub. ¹¹1Our repository is at https://github.com/tanfiona/UniCausal. Links to our trained baseline models are available in the repository.

The rest of the paper is organized as follows: Section 2 performs a literature review. Section 3.1 describes how we created the consolidated corpus while Section 3.2 outlines our baseline models. Subsequently, Section 4 discusses our findings and Section 5 concludes.

2 Related Work

In this paper, we are interested in supporting end-to-end causal relation extraction. More specifically, given an input sequence, a model should be designed to detect whether the sequence contains causal relations, and if so, where its causal arguments are.

2.1 Tasks

In the earlier days, researchers extracted causal relations and assessed the validity of the extracted relations directly [8, 12]. In the recent years, a two-stepped approach is increasingly popular: After the successful identification of causal sequences (Causal Classification), detection of the cause-effect spans can be conducted on the positive sequences (Cause-Effect Span Detection). This step-wise approach is practiced by shared tasks [17, 18, 31, 30].

2.2 Datasets

Research in causal text mining has been limited by data deficiency issues [34]. The lack of standardized datasets hinders comparisons in performance across models [1]. In the next few paragraphs, we describe some common causal text mining datasets.

AltLex [11]²²2https://github.com/chridey/altlex investigates causal relations with alternative lexicalizations (AltLex) connectives in single sentences. AltLex connectives are an open-class of markers with varied linguistic constructions [33], like “so (close) that” and “This may help explain why”. The limitations of the AltLex dataset is that it (1) is small in size, (2) ignores explicit and implicit signals in causal relations, (3) ignores inter-sentence causal relations, and (4) assumes Cause and Effect spans to be all the words before and after the signals.

BECAUSE 2.0 [7]³³3https://github.com/duncanka/BECAUSE/tree/2.0 contains annotations for causal language in single sentences. Cause, Effect or Connective spans were annotated based on principles of Construction Grammar. Documents and articles were selected from four data sources: Congressional Hearings from 2014 NLP Unshared Task in PoliInformatics (CHRG), Penn Treebank (PTB), Manually Annotated Sub-Corpus (MASC), and the New York Times Annotated Corpus (NYT). The limitations of BECAUSE 2.0 is that (1) it is relatively small in size and (2) ignores inter-sentence causal relations.

CausalTimeBank (CTB) [19, 20]⁴⁴4https://github.com/paramitamirza/Causal-TimeBank annotated only explicit causal relations in the TempEval-3 based on rule-based algorithm. EventStoryLine (ESL) [4]⁵⁵5https://github.com/tommasoc80/EventStoryLine annotated both explicit and implicit causal relations in the Event Coreference Bank+. Both CTB and ESL includes annotations of intra- and inter-sentence causal relations between events. These two datasets are popular amongst researchers studying Event Causality Identification (ECI) [35, 36, 3], which aims to classify if a pair of events are causal or not given its context. However, these two datasets have limited usage outside the event text mining space because they only annotate events, and furthermore, exclude the context of the event in the argument span.

For Penn Discourse Treebank V3.0 (PDTB) [33]⁶⁶6https://catalog.ldc.upenn.edu/LDC2019T05 and SemEval 2010 Task 8 (SemEval) [10], causal relations were not the original focus of the dataset. Causal relations are one out of the many relations they annotated. PDTB annotated discourse relations between arguments, expressed either explicitly, implicitly or in AltLex forms. The main limitation of PDTB is that causal relations expressed within clauses are not annotated. SemEval was annotated for the purpose of semantic relations classification. They accepted only noun phrases with common-noun heads as relation arguments. SemEval’s limitations are: (1) the context is not included in the argument span and (2) it also ignores inter-sentence causal relations.

Corpus	Source	Inter-sent	Linguistic	Arguments
AltLex [11]	News	No	AltLex	Words before/after signal
BECAUSE 2.0 [7]	News, Congress Hearings	No	Explicit	Phrases
CausalTimeBank (CTB) [19]	News	Yes	All	Event head word(s)
EventStoryLine V1.0 (ESL) [4]	News	Yes	All	Event head word(s)
Penn Discourse Treebank V3.0 (PDTB) [33]	News	Yes	All	Clauses
SemEval 2010 Task 8 (SemEval) [10]	Web	No	All	Noun phrases

Table 1: Six popular causal corpora and their annotation coverage in terms of: data source, sentence lengths, linguistic construction and argument types.

Table 1 summarizes the differences across the six datasets. The current disarray across causal text mining datasets leads to three key missed opportunities: Firstly, for training, researchers are unable to seamlessly increase models’ exposure to a wide range of examples. Secondly, for evaluation, researchers cannot make fair comparisons of model performance with one another. Thirdly, it is inconvenient for researchers to test their models’ generalizability to other corpora. To address these issues, we propose UniCausal, a large consolidated resource of annotated texts for causal text mining. We relied on the above six high quality corpora and aligned each corpus’ definitions, where possible, to cater to the three causal text mining tasks. With the exception of CTB, all other datasets were annotated by humans.

2.3 Other large causal resources

Although some large corpora or knowledge bases (KBs) that include causal relations already exists, they are annotated in a semi-supervised manner and constructed using rule-based methods. ⁷⁷7E.g.: Bootstrapped versions of AltLex [11] and SCITE [14]; Causal KBs: CauseNet [9], CausalNet [16] and CausalBank [15]; Semantic KBs that include causal relations: FrameNet [26] and ConceptNet [28]. For example, CauseNet [9] detected causal relations automatically using causal dependency path patterns obtained in a boostrap fashion. Examples from such corpora are of lower quality and have less linguistic variation. Thus, although they are useful databases for common causal relations, they contribute minimally to training and fair testing of models’ reasoning abilities. We perform a short study on this matter in Section 4.3. Different from them, our corpus aims to capture linguistic, syntactic and semantic variation. Furthermore, by training a text mining model on our extensive corpus, researchers can potentially create an even larger causal KB by extracting more relations compared to rule-based methods.

3 Methodology

3.1 Creation of UniCausal

3.1.1 Causal Text Mining Datasets

We combine six datasets that can be used for causal text mining: AltLex [11], BECAUSE 2.0 [7], CTB [19, 20], ESL [4], PDTB [33] and SemEval [10].

3.1.2 Causal Text Mining Tasks

There are three causal text mining tasks that we focus on, corresponding to the tasks shown in Figure 1:

(I)

Sequence Classification: Given an example, does it contain any causal relations?
(II)

Span Detection: Given a causal example, which words in the input text correspond to the Cause and Effect arguments? Identify up to three causal relations and their spans.
(III)

Pair Classification: Given the marked argument or entity pair, are they causally related such that the first argument (ARG0) causes the second argument (ARG1)?

Since pairs can be Non-causal, we marked the arguments with ARG0 and ARG1 instead of CAUSE and EFFECT.

3.1.3 Data Processing

For every dataset, we only focus on examples that were of three or shorter sentences. For Causal sequences, only the sentences that contain the arguments were retained. We split the dataset into train and test sets based on previous works’ recommendations, or if not, randomly. Finally, we process each dataset to fit into the format required for our three tasks described below.

Let a unique sequence of text be represented by a vector $\vec{w}=w_{1},w_{2},...,w_{N}$ of $N$ word tokens. Each sequence has a binary label $s$ of either 1 or 0, representing Causal or Non-causal respectively.

(I)

Sequence Classification: Worked on both Causal and Non-causal texts. Each example text is unique with a target label, $s$ .
(II)

Span Detection: Worked only on Causal texts. Each example text is unique, and at the current stage, we focus on examples with up to three cause-effect relations only. We approach the task as a token classification task, where the annotated spans in the texts were converted to BIO-format (Begin (B), Inside (I), Outside (O)) [25] for two types (Cause (C), Effect (E)). Therefore, there were five possible labels per word: B-C, I-C, B-E, I-E and O. The corresponding target token vector is $\vec{t}=t_{1},t_{2},...,t_{N}$ , where each $t_{n}$ represents one of the five labels. For examples with multiple relations, we sort them based on the location of the B-C, followed by B-E if tied. See Figure 1’s spans for example. For multiple causal relations, $\vec{w}$ has multiple token vectors $\vec{t^{v}}$ where $v=0,1,2$ , since we permit up to three causal relations per unique text.
(III)

Pair Classification: Worked on both Causal and Non-causal texts. Each example text is unique after taking into account of where special tokens ARG0 and ARG1 are located. For a sequence of text with $N$ word tokens, we include $2\cdot a$ special beginning and end tokens⁸⁸8(<ARG0>, </ARG0>) marks the boundaries of a Cause span, while (<ARG1>, </ARG1>) marks the boundaries of a corresponding Effect span. such that the input word vector $\vec{u}$ is now of length $N+2\cdot a$ . $a$ represents the number of arguments in the example. $\vec{w}$ can have multiple versions of $\vec{u}$ , depending on the location of the special tokens. For example, in Figure 1, there are three Pair Classification examples for one Sequence Classification example.

		(I) Seq		(II) Span	(III) Pair
Corpus	Split	Non-causal	Causal	Causal	Non-causal	Causal
AltLex	Train	277	300	300	296	315
	Test	286	115	115	289	127
BEC-	Train	183	716	716	266	902
AUSE	Test	10	41	41	14	46
CTB	Train	1,651	234	-	3,047	270
	Test	274	42	-	444	48
ESL	Train	957	1,043	-	-	-
	Test	119	113	-	-	-
PDTB	Train	24,901	8,917	8,917	32,587	9,809
	Test	5,796	2,055	2,055	7,694	2,294
Sem-	Train	6,976	999	-	6,997	1,003
Eval	Test	2,387	328	-	2,389	328
Total		43,817	14,903	12,144	54,023	15,142

Table 2: Size of dataset. “-" indicates tasks not applicable to the corpus.

The final data sizes are reflected in Table 2. The number of Span Detection example tallies with the positive instances of Sequence Classification examples because multiple cause-effect relation spans (i.e. $\vec{t^{0}},\vec{t^{1}}$ and $\vec{t^{2}}$ ) were grouped into a unique example (i.e. same $\vec{w}$ ), which we term as a ‘grouped’ example. At evaluation, performance metrics were calculated at the ‘ungrouped’ level so that every causal relation is evaluated against equally.

Since each data source has a different data format, our codes had to extract the text sequences and relations from different annotation types: Our codes work for BECAUSE’s ‘brat’, ESL’s ‘CAT’ and CTB’s ‘TimeML’/‘XML’ data formats. PDTB use their own standoff annotations format. AltLex and SemEval datasets are more user-friendly, in that they are stored in ‘CSV’ and ‘JSON’ formats, and can be interpreted directly in a single file. Due to the brevity of space, we describe how we handle the different annotation guidelines and our data processing steps for each corpus in detail in our Supplementary Material online. We also upload the data processing codes for each source to our repository. The final, post-processed datasets are all stored in ‘CSV’ for convenience. We also built a custom dataset loader based on Huggingface’s load_dataset function, such that users only need to indicate the datasets of interest either as a list of inputs within the script (E.g. ‘dataset_name=[‘altlex’,‘because’]’), or directly in the command line (E.g. ‘ $--$ dataset_name altlex because’).

	(I) Sequence Classification				(II) Span Detection			(III) Pair Classification
Test Set	P	R	F1	Acc	P	R	F1	P	R	F1	Acc
All	71.13 ±0.80	69.14 ±1.60	70.10 ±0.58	86.27 ±0.15	46.33 ±1.22	60.35 ±0.30	52.42 ±0.90	85.44 ±0.96	83.93 ±0.44	84.68 ±0.27	93.68 ±0.16
AltLex	50.76 ±1.61	63.48 ±4.60	56.37 ±2.49	71.87 ±1.19	27.74 ±1.20	42.99 ±0.85	33.72 ±1.12	82.60 ±1.99	87.09 ±1.53	84.76 ±0.66	90.43 ±0.55
BECAUSE	92.32 ±1.69	70.24 ±2.04	79.77 ±1.68	71.37 ±2.24	32.51 ±2.82	44.30 ±2.33	37.47 ±2.57	87.93 ±1.73	94.78 ±1.94	91.21 ±1.18	86.00 ±1.90
CTB	42.37 ±2.11	66.19 ±4.26	51.58 ±1.82	83.48 ±1.21	-	-	-	75.66 ±3.61	72.50 ±6.81	73.94 ±4.68	95.04 ±0.78
ESL	76.11 ±2.04	67.43 ±3.45	71.45 ±1.89	73.79 ±1.34	-	-	-	-	-	-	-
PDTB	72.59 ±0.61	66.34 ±1.63	69.31 ±0.70	84.63 ±0.17	47.77 ±1.22	61.54 ±0.29	53.78 ±0.88	84.56 ±1.17	82.04 ±0.46	83.28 ±0.36	92.43 ±0.23
SemEval	73.39 ±1.18	89.51 ±1.59	80.64 ±0.46	94.81 ±0.16	-	-	-	93.38 ±0.88	96.10 ±0.59	94.71 ±0.23	98.70 ±0.07

Table 3: Mean and standard deviation of performance metrics for different test sets across the three tasks, across five random seeds. All models were trained on all six datasets, where applicable. Tasks that are not applicable to the dataset are indicated by “-". Scores are reported in percentages (%).seeds.

(I) Sequence Classification
Corpus	Source	Features	Model	P	R	F1	Acc
AltLex	[11]	Lexical	Support Vector Machine	61.98	58.51	60.19	67.68
	Ours (All)	BERT	BERT+LR	50.76	63.48	56.37	71.87
	Ours (AltLex)	BERT	BERT+LR	50.58	53.57	51.85	71.52
BEC-	[37] $\mathsection$ $\ddagger$	Discourse, SA	PSAN	-	-	81.70	-
AUSE	Ours (All)	BERT	BERT+LR	92.32	70.24	79.77	71.37
	Ours (BECAUSE)	BERT	BERT+LR	86.20	96.01	90.77	84.31
CTB	[13]^	n-gram	LR	100.00	22.22	36.36	-
		word2vec	BIGRUATT	67.04	73.89	69.98	-
		ELMO	BIGRUATT	81.29	70.28	75.08	-
		BERT	BERT+LR	71.17	93.33	80.55	-
		BERT	BERT+BIGRUATT	74.52	86.94	80.06	-
	Ours (All)	BERT	BERT+LR	42.37	66.19	51.58	83.48
	Ours (CTB)	BERT	BERT+LR	71.46	58.57	63.65	91.27
ESL	[13]^	n-gram	LR	100.00	27.27	42.86	-
		word2vec	BIGRUATT	70.09	60.91	63.65	-
		ELMO	BIGRUATT	77.47	59.09	66.55	-
		BERT	BERT+LR	62.44	87.17	72.35	-
		BERT	BERT+BIGRUATT	66.15	83.64	73.09	-
	Ours (All)	BERT	BERT+LR	76.11	67.43	71.45	73.79
	Ours (ESL)	BERT	BERT+LR	75.90	87.79	81.21	80.17
PDTB	[23] $\ddagger$	Lexical	Shallow CNN	39.80	75.29	52.04	63.00
		Lexical	FFNN	42.04	71.74	53.01	66.44
		Lexical, positional, event	FFNN	42.37	76.45	54.52	66.35
	[37] $\mathsection$ $\ddagger$	Discourse, SA	PSAN	-	-	76.60	-
	[32] $\mathsection$	BERT	BERT+LR	-	-	74.45	-
	Ours (All)	BERT	BERT+LR	72.59	66.34	69.31	84.63
	Ours (PDTB)	BERT	BERT+LR	73.54	67.35	70.31	85.11
Sem-	[22]	n-gram	Random Forest	-	-	52.80	-
Eval		n-gram	LR	-	-	81.90	-
		word2vec	LSTM	-	-	85.60	-
		word2vec	LSTM + Self-Attention	-	-	86.90	-
	[13]^	n-gram	LR	88.67	66.83	76.22	-
		word2vec	BIGRUATT	93.96	87.59	90.64	-
		ELMO	BIGRUATT	94.45	91.26	92.81	-
		BERT	BERT+LR	86.62	97.09	91.55	-
		BERT	BERT+BIGRUATT	86.80	96.63	91.45	-
	Ours (All)	BERT	BERT+LR	73.39	89.51	80.64	94.81
	Ours (SemEval)	BERT	BERT+LR	87.84	91.40	89.58	97.43
(III) Pair Classification
Corpus	Source	Features	Model	P	R	F1	Acc
Sem-	[2]	Lexical, semantic, dependency	Bayesian Classifier	-	-	66.00	93.00
Eval		word2vec	CNN	-	-	66.00	88.00
		GrammarTags	CNN	-	-	86.60	93.00
	Ours (All)	BERT	BERT+LR	93.38	96.10	94.71	98.70
	Ours (SemEval)	BERT	BERT+LR	93.96	95.67	94.80	98.73

Table 4: Evaluation metrics for each dataset in the literature review compared to our benchmark (Ours). We do not cover methods that rely on the connectives as features for Classification tasks. Notations: ^ Rebalanced the dataset,

\mathsection

Evaluated on k-folds or different folds,

\ddagger

Slightly different definitions for class labels. Abbreviations: Self-Attention Embeddings (SA), Logistic Regression (LR), Bidirectional GRU + Self-Attention (BIGRUATT), Feed-forward neural network (FFNN)

3.2 Baseline Model

Transformer-based pre-trained language models are the state-of-the-art in NLP. To create our initial benchmark, we used pre-trained Bidirectional Encoder Representations from Transformers (BERT) models [5]. First, sequences are tokenized into token embeddings ( $\vec{r_{n}}$ ). Special start ([CLS]) and end ([SEP]) tokens were added to the input sequence. For the Pair Classification task only, we added four special tokens to the vocabulary to represent the boundaries of the two arguments. The BERT encoder is fine-tuned to our task during training.

3.2.1 Sequence and Pair Classification

For each classification task, we pooled the token embeddings into a sequence embedding by extracting the embedding on the [CLS] token. The sequence embeddings was then fed into a sequence classifier $g(.)$ to predict logits for the two labels: Causal and Non-causal, as in $\hat{y}=g(\vec{r_{[CLS]}})$ . We compared the logits with the true sequence label $y$ to calculate Cross-Entropy (CE) Loss for learning.

\mathcal{L}_{seq}=-y\cdot\log(\hat{y})-(1-y)\cdot\log(1-\hat{y})

(1)

3.2.2 Span Detection

We fed the token embeddings into a token classifier ( $f(.)$ ), as in $\hat{t_{n}}=f(\vec{r_{n}})$ . The classifier returns predicted logits for the five BIO-CE token labels, which aims to predict the cause-effect span in the input sequence, obtained via argmax. Note that given the current simple set up, the span detection model can only predict one cause-effect relation per input sequence. Again, the logits and true token vector $\vec{t^{v}}$ were used to calculate CE Loss.

\mathcal{L}_{token}=-\sum_{n=1}^{N}\sum_{i=1}^{C}t_{n,i}\cdot\log(\hat{t_{n,i}})

(2)

3.2.3 Evaluation Metrics

For the two Classification tasks, we calculated the Accuracy (Acc), Precision (P), Recall (R) and F1 scores per experiment. For Span Detection, we referred to evaluation metrics from earlier Cause-Effect Span Detection [17, 18, 30] shared tasks, and used the Macro P, R and F1 metrics. The token classification evaluation scheme by seqeval [21, 25] reverts the BIO-formatted labels to the original form (i.e. Cause (C) and Effect (E)) for evaluation. Our default evaluation scripts report metrics for all and each corpus. In the next section, we present the average and standard deviation scores obtained from multiple runs using five random seeds.

4 Experiments

In this section, we describe experiments performed on the UniCausal corpus.

4.1 Baseline Performance

In Table 3, we present the performance of the baseline BERT models when trained on all datasets, and tested on all and each dataset. Across all test sets, baseline models achieved 70.10% Binary F1 score for Sequence Classification, 52.42% Macro F1 score for Span Detection, and 84.68% Binary F1 score for Pair Classification.

Overall, regardless of the dataset, performance for Pair Classification is always better than Sequence Classification. F1 scores for Span Detection is poor in comparison to the Classification tasks. This finding correlates with the difficulty of each task: For Pair Classification, since the prompts that already identifies the arguments are provided, it is arguably a simpler task than Sequence Classification. For Span Detection, it is much more challenging than both Classification tasks because it involves accurate identification of the words that corresponds to the cause and effect, not just the mere identification that they exist. Furthermore, the baseline token classification set-up was too simplistic, and unable to handle multiple cause-effect span relations in the same sequence. For each input text, only one pair of Cause and Effect will be predicted. Thus, if multiple relation exists, only one pair can be predicted correctly at best.

In Table 4, we provide a snapshot of evaluation metrics reported by previous works on the datasets. It is challenging to make claims about model superiority from this table alone, since different papers used different train-test splits and some papers altered the dataset composition by rebalancing it. Nevertheless, for datasets like AltLex and SemEval, the development set was predefined by the dataset creators. Thus, comparisons between previous work and ours can be made concretely. For Sequence Classification with AltLex, [11]’s handcrafted lexical features fed through a Support Vector Machine achieved an F1 score of 60.19%, beating us. For Sequence Classification with SemEval, our best F1 score of 89.58% surpasses methods covered by [22] which, at best, achieved 86.90% using word2vec embeddings fed through a Long-Short Term Memory Self-Attention network. Finally, for Pair Classification with SemEval, our BERT-based model consistently surpasses Bayesian Classifier and Convolutional Neural Network methods explored by [2].

All in all, our baseline model is simple but competitive. From Table 4 alone, it is apparent that the Causal Text Mining community lacks a consistent way to benchmark performance. Therefore, we hope that from here on, the scores presented in Table 3 will serve as an initial, universal baseline score for the Causal Text Mining community to beat.

4.2 Impact of Datasets

In Table 5, we present the F1 scores when training and testing on different corpus. This table reflects how compatible each corpus is to one another. When testing on all the datasets, we noticed that training on all datasets returned the best performance across all tasks by a large margin. Training on any one dataset was unable to achieve similar performance. Meanwhile, the generalized model trained on all datasets did not always return the best performance for each corpus. Given the differences in definitions and linguistic coverage of each dataset, it is expected that for some datasets, specializing on its own data distributions leads to better performance. However, such specialized models are more likely to overfit and lack generalizability. Thus, good performance on one dataset but not others should be handled with caution.

(I) Sequence Classification
	Test Set
Training Set	All	AltLex	BECAUSE	CTB	ESL	PDTB	SemEval
All	70.10 ±0.58	56.37 ±2.49	79.77 ±1.68	51.58 ±1.82	71.45 ±1.89	69.31 ±0.70	80.64 ±0.46
AltLex	32.93 ±3.57***	51.85 ±2.53	36.47 ±11.18***	38.21 ±6.20*	53.30 ±8.37**	22.91 ±5.79***	55.83 ±6.68***
BECAUSE	39.15 ±0.99***	47.02 ±1.52**	90.77 ±2.22***	25.17 ±1.34***	63.49 ±1.94**	42.49 ±0.68***	23.71 ±1.93***
CTB	33.49 ±5.48***	55.91 ±7.63	54.73 ±9.40**	63.65 ±5.55**	33.26 ±15.44**	25.97 ±3.73***	51.76 ±13.85**
ESL	39.62 ±0.89***	46.29 ±1.15**	90.12 ±1.05***	30.84 ±1.35***	81.21 ±2.35***	42.55 ±1.25***	26.15 ±2.62***
PDTB	60.99 ±0.76***	48.94 ±1.88**	69.61 ±2.16**	39.54 ±1.88***	38.71 ±3.15***	70.31 ±0.56*	19.75 ±3.35***
SemEval	28.25 ±0.86***	28.95 ±1.74***	16.91 ±3.40***	38.51 ±3.44**	45.95 ±3.50***	10.11 ±1.61***	89.58 ±0.71***
(II) Span Detection
	Test Set
Training Set	All	AltLex	BECAUSE	PDTB
All	52.42 ±0.90	33.72 ±1.12	37.47 ±2.57	53.78 ±0.88
AltLex	6.20 ±0.74***	21.45 ±1.87***	11.51 ±1.63***	5.47 ±0.76***
BECAUSE	12.74 ±0.35***	7.38 ±2.19***	37.79 ±5.77	12.60 ±0.34***
PDTB	51.97 ±0.48	6.73 ±0.94***	35.84 ±2.42	55.02 ±0.38*
(III) Pair Classification
	Test Set
Training Set	All	AltLex	BECAUSE	CTB	PDTB	SemEval
All	84.68 ±0.27	84.76 ±0.66	91.21 ±1.18	73.94 ±4.68	83.28 ±0.36	94.71 ±0.23
AltLex	31.83 ±3.93***	80.57 ±2.48*	48.44 ±20.00**	20.06 ±7.14***	25.11 ±8.75***	57.72 ±14.52**
BECAUSE	36.40 ±0.64***	47.99 ±1.33***	90.01 ±1.95	23.58 ±1.52***	38.39 ±0.37***	25.23 ±2.02***
CTB	20.17 ±5.78***	19.16 ±15.64***	22.00 ±10.92***	73.29 ±6.14	7.02 ±6.06***	63.69 ±5.65***
PDTB	68.13 ±0.88***	40.34 ±1.52***	82.59 ±2.17***	26.74 ±2.42***	83.70 ±0.34	33.64 ±1.76***
SemEval	26.66 ±1.86***	37.07 ±6.58***	25.70 ±11.46***	50.63 ±1.74***	8.08 ±3.20***	94.80 ±0.28

Table 5: Mean and standard deviation of F1 score across different training and test set combinations across five random seeds. Scores are reported in percentages (%). For each panel, the top score per column is bolded. Paired T-test was conducted against the first row per panel, where all datasets were used for training. Statistical significance: ***

<0.001

, **

<0.01

, *

<0.05

4.3 Adding CauseNet to investigate the importance of linguistic variation in examples

Researchers might want to incorporate custom data to train and test their models. The modular structure of our code, in terms of data loading and evaluation, allows this. To illustrate, we obtained around 50,000 training and 5,000 testing causal examples from CauseNet [9] suitable for Pair Classification. After data processing, it is straightforward to load causenet with other datasets by calling it with the command ‘ $--$ dataset_name’.

In terms of F1, we only found improvements in performance for CTB (77.08%) and AltLex (85.04%), which are rule-based and relatively template-based respectively. Unsurprisingly, the model also performs perfectly on CauseNet. BECAUSE (89.80%), PDTB (83.20%), and SemEval (94.31%) had poorer performance. Despite incorporating a large number of rule-based causal examples, the model does not learn about the semantics of causal relations. We believe this is due to the lack of linguistic variation covered. This again motivates our focus to include mainly human-annotated data in UniCausal for both better training and fairer testing.

5 Conclusion

We propose UniCausal, a unified resource and benchmark for causal text mining. Our codes were designed to allow researchers to work on some or all datasets and tasks, while still comparing their performance fairly against us or others. Researchers can easily include new datasets too. In this paper, we provided evaluation metrics per dataset as an initial benchmark for future researchers to compete against.

We hope to see researchers use UniCausal to design joint models that concurrently learn from multiple causal text mining tasks and datasets. A unified model will be able to collaboratively learn about causality from various objectives and knowledge sources, to be more universally adaptable and generalizable to unseen examples.

For future work, we intend to include more datasets relevant to causal text mining, like the Son Facebook dataset [27], FinCausal [17, 18], and Causal News Corpus [31, 30]. We would also expand UniCausal to include longer examples with much more sentences. Finally, we will replicate more models to include in our benchmark.

References

[1] Asghar, N.: Automatic extraction of causal relations from natural language texts: A comprehensive survey. CoRR abs/1605.07895 (2016), http://arxiv.org/abs/1605.07895
[2] Ayyanar, R., Koomullil, G., Ramasangu, H.: Causal relation classification using convolutional neural networks and grammar tags. In: 2019 IEEE 16th India Council International Conference (INDICON). pp. 1–3 (2019). https://doi.org/10.1109/INDICON47234.2019.9028985
[3] Cao, P., Zuo, X., Chen, Y., Liu, K., Zhao, J., Chen, Y., Peng, W.: Knowledge-enriched event causality identification via latent structure induction networks. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). pp. 4862–4872. Association for Computational Linguistics, Online (Aug 2021). https://doi.org/10.18653/v1/2021.acl-long.376, https://aclanthology.org/2021.acl-long.376
[4] Caselli, T., Vossen, P.: The event StoryLine corpus: A new benchmark for causal and temporal relation extraction. In: Proceedings of the Events and Stories in the News Workshop. pp. 77–86. Association for Computational Linguistics, Vancouver, Canada (Aug 2017). https://doi.org/10.18653/v1/W17-2711, https://aclanthology.org/W17-2711
[5] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (Jun 2019). https://doi.org/10.18653/v1/N19-1423, https://aclanthology.org/N19-1423
[6] Dunietz, J., Burnham, G., Bharadwaj, A., Rambow, O., Chu-Carroll, J., Ferrucci, D.: To test machine comprehension, start by defining comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pp. 7839–7859. Association for Computational Linguistics, Online (Jul 2020). https://doi.org/10.18653/v1/2020.acl-main.701, https://aclanthology.org/2020.acl-main.701
[7] Dunietz, J., Levin, L., Carbonell, J.: The BECauSE corpus 2.0: Annotating causality and overlapping relations. In: Proceedings of the 11th Linguistic Annotation Workshop. pp. 95–104. Association for Computational Linguistics, Valencia, Spain (Apr 2017). https://doi.org/10.18653/v1/W17-0812, https://aclanthology.org/W17-0812
[8] Girju, R.: Automatic detection of causal relations for question answering. In: Proceedings of the ACL 2003 Workshop on Multilingual Summarization and Question Answering. pp. 76–83. Association for Computational Linguistics, Sapporo, Japan (Jul 2003). https://doi.org/10.3115/1119312.1119322, https://aclanthology.org/W03-1210
[9] Heindorf, S., Scholten, Y., Wachsmuth, H., Ngomo, A.N., Potthast, M.: Causenet: Towards a causality graph extracted from the web. In: d’Aquin, M., Dietze, S., Hauff, C., Curry, E., Cudré-Mauroux, P. (eds.) CIKM ’20: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event, Ireland, October 19-23, 2020. pp. 3023–3030. ACM (2020). https://doi.org/10.1145/3340531.3412763, https://doi.org/10.1145/3340531.3412763
[10] Hendrickx, I., Kim, S.N., Kozareva, Z., Nakov, P., Ó Séaghdha, D., Padó, S., Pennacchiotti, M., Romano, L., Szpakowicz, S.: SemEval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals. In: Proceedings of the 5th International Workshop on Semantic Evaluation. pp. 33–38. Association for Computational Linguistics, Uppsala, Sweden (Jul 2010), https://aclanthology.org/S10-1006
[11] Hidey, C., McKeown, K.: Identifying causal relations using parallel Wikipedia articles. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 1424–1433. Association for Computational Linguistics, Berlin, Germany (Aug 2016). https://doi.org/10.18653/v1/P16-1135, https://aclanthology.org/P16-1135
[12] Ittoo, A., Bouma, G.: Minimally-supervised learning of domain-specific causal relations using an open-domain corpus as knowledge base. Data Knowl. Eng. 88, 142–163 (2013). https://doi.org/10.1016/j.datak.2013.08.004, https://doi.org/10.1016/j.datak.2013.08.004
[13] Kyriakakis, M., Androutsopoulos, I., Saudabayev, A., Ginés i Ametllé, J.: Transfer learning for causal sentence detection. In: Proceedings of the 18th BioNLP Workshop and Shared Task. pp. 292–297. Association for Computational Linguistics, Florence, Italy (Aug 2019). https://doi.org/10.18653/v1/W19-5031, https://aclanthology.org/W19-5031
[14] Li, Z., Li, Q., Zou, X., Ren, J.: Causality extraction based on self-attentive bilstm-crf with transferred embeddings. Neurocomputing 423, 207–219 (2021). https://doi.org/10.1016/j.neucom.2020.08.078, https://doi.org/10.1016/j.neucom.2020.08.078
[15] Li, Z., Ding, X., Liu, T., Hu, J.E., Durme, B.V.: Guided generation of cause and effect. In: Bessiere, C. (ed.) Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020. pp. 3629–3636. ijcai.org (2020). https://doi.org/10.24963/ijcai.2020/502, https://doi.org/10.24963/ijcai.2020/502
[16] Luo, Z., Sha, Y., Zhu, K.Q., Hwang, S., Wang, Z.: Commonsense causal reasoning between short texts. In: Baral, C., Delgrande, J.P., Wolter, F. (eds.) Principles of Knowledge Representation and Reasoning: Proceedings of the Fifteenth International Conference, KR 2016, Cape Town, South Africa, April 25-29, 2016. pp. 421–431. AAAI Press (2016), http://www.aaai.org/ocs/index.php/KR/KR16/paper/view/12818
[17] Mariko, D., Abi-Akl, H., Labidurie, E., Durfort, S., De Mazancourt, H., El-Haj, M.: The financial document causality detection shared task (FinCausal 2020). In: Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation. pp. 23–32. COLING, Barcelona, Spain (Online) (Dec 2020), https://aclanthology.org/2020.fnp-1.3
[18] Mariko, D., Akl, H.A., Labidurie, E., Durfort, S., de Mazancourt, H., El-Haj, M.: The financial document causality detection shared task (FinCausal 2021). In: Proceedings of the 3rd Financial Narrative Processing Workshop. pp. 58–60. Association for Computational Linguistics, Lancaster, United Kingdom (15-16 Sep 2021), https://aclanthology.org/2021.fnp-1.10
[19] Mirza, P., Sprugnoli, R., Tonelli, S., Speranza, M.: Annotating causality in the TempEval-3 corpus. In: Proceedings of the EACL 2014 Workshop on Computational Approaches to Causality in Language (CAtoCL). pp. 10–19. Association for Computational Linguistics, Gothenburg, Sweden (Apr 2014). https://doi.org/10.3115/v1/W14-0702, https://aclanthology.org/W14-0702
[20] Mirza, P., Tonelli, S.: An analysis of causality between events and its relation to temporal information. In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. pp. 2097–2106. Dublin City University and Association for Computational Linguistics, Dublin, Ireland (Aug 2014), https://aclanthology.org/C14-1198
[21] Nakayama, H.: seqeval: A python framework for sequence labeling evaluation (2018), https://github.com/chakki-works/seqeval, software available from https://github.com/chakki-works/seqeval
[22] Niki, Y., Sakaji, H., Izumi, K., Matsushima, H.: Causality existence classification from multilingual texts using end-to-end LSTM models. In: Papapetrou, P., Cheng, X., He, Q. (eds.) 2019 International Conference on Data Mining Workshops, ICDM Workshops 2019, Beijing, China, November 8-11, 2019. pp. 17–23. IEEE (2019). https://doi.org/10.1109/ICDMW.2019.00011, https://doi.org/10.1109/ICDMW.2019.00011
[23] Ponti, E.M., Korhonen, A.: Event-related features in feedforward neural networks contribute to identifying causal relations in discourse. In: Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics. pp. 25–30. Association for Computational Linguistics, Valencia, Spain (Apr 2017). https://doi.org/10.18653/v1/W17-0903, https://aclanthology.org/W17-0903
[24] Radinsky, K., Horvitz, E.: Mining the web to predict future events. In: Leonardi, S., Panconesi, A., Ferragina, P., Gionis, A. (eds.) Sixth ACM International Conference on Web Search and Data Mining, WSDM 2013, Rome, Italy, February 4-8, 2013. pp. 255–264. ACM (2013). https://doi.org/10.1145/2433396.2433431, https://doi.org/10.1145/2433396.2433431
[25] Ramshaw, L., Marcus, M.: Text chunking using transformation-based learning. In: Third Workshop on Very Large Corpora (1995), https://aclanthology.org/W95-0107
[26] Ruppenhofer, J., Ellsworth, M., Schwarzer-Petruck, M., Johnson, C.R., Scheffczyk, J.: Framenet ii: Extended theory and practice. Tech. rep., International Computer Science Institute (2016)
[27] Son, Y., Bayas, N., Schwartz, H.A.: Causal explanation analysis on social media. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. pp. 3350–3359. Association for Computational Linguistics, Brussels, Belgium (Oct-Nov 2018). https://doi.org/10.18653/v1/D18-1372, https://aclanthology.org/D18-1372
[28] Speer, R., Chin, J., Havasi, C.: Conceptnet 5.5: An open multilingual graph of general knowledge. In: Singh, S.P., Markovitch, S. (eds.) Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA. pp. 4444–4451. AAAI Press (2017), http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14972
[29] Stasaski, K., Rathod, M., Tu, T., Xiao, Y., Hearst, M.A.: Automatically generating cause-and-effect questions from passages. In: Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications. pp. 158–170. Association for Computational Linguistics, Online (Apr 2021), https://aclanthology.org/2021.bea-1.17
[30] Tan, F.A., Hettiarachchi, H., Hürriyetoğlu, A., Caselli, T., Uca, O., Liza, F.F., Oostdijk, N.: Event causality identification with causal news corpus - shared task 3, CASE 2022. In: Proceedings of the 5th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE). pp. 195–208. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid) (Dec 2022), https://aclanthology.org/2022.case-1.28
[31] Tan, F.A., Hürriyetoğlu, A., Caselli, T., Oostdijk, N., Nomoto, T., Hettiarachchi, H., Ameer, I., Uca, O., Liza, F.F., Hu, T.: The causal news corpus: Annotating causal relations in event sentences from news. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference. pp. 2298–2310. European Language Resources Association, Marseille, France (Jun 2022), https://aclanthology.org/2022.lrec-1.246
[32] Tan, F.A., HÃ¼rriyetoÄŸlu, A., Caselli, T., Oostdijk, N., Nomoto, T., Hettiarachchi, H., Ameer, I., Uca, O., Liza, F.F., Hu, T.: The causal news corpus: Annotating causal relations in event sentences from news. In: Proceedings of the Language Resources and Evaluation Conference. pp. 2298–2310. European Language Resources Association, Marseille, France (June 2022), https://aclanthology.org/2022.lrec-1.246
[33] Webber, B., Prasad, R., Lee, A., Joshi, A.: The penn discourse treebank 3.0 annotation manual. Philadelphia, University of Pennsylvania (2019)
[34] Yang, J., Han, S.C., Poon, J.: A survey on extraction of causal relations from natural language text. Knowledge and Information Systems (Mar 2022). https://doi.org/10.1007/s10115-022-01665-w, https://doi.org/10.1007/s10115-022-01665-w
[35] Zuo, X., Cao, P., Chen, Y., Liu, K., Zhao, J., Peng, W., Chen, Y.: Improving event causality identification via self-supervised representation learning on external causal statement. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. pp. 2162–2172. Association for Computational Linguistics, Online (Aug 2021). https://doi.org/10.18653/v1/2021.findings-acl.190, https://aclanthology.org/2021.findings-acl.190
[36] Zuo, X., Chen, Y., Liu, K., Zhao, J.: KnowDis: Knowledge enhanced data augmentation for event causality detection via distant supervision. In: Proceedings of the 28th International Conference on Computational Linguistics. pp. 1544–1550. International Committee on Computational Linguistics, Barcelona, Spain (Online) (Dec 2020). https://doi.org/10.18653/v1/2020.coling-main.135, https://aclanthology.org/2020.coling-main.135
[37] Zuo, X., Chen, Y., Liu, K., Zhao, J.: Towards causal explanation detection with pyramid salient-aware network. In: Proceedings of the 19th Chinese National Conference on Computational Linguistics. pp. 903–914. Chinese Information Processing Society of China, Haikou, China (Oct 2020), https://aclanthology.org/2020.ccl-1.84

UniCausal: Unified Benchmark and Repository for Causal Text Mining