Knowledge-Augmented Language Models for Cause-Effect Relation Classification

Pedram Hosseini¹ David A. Broniatowski¹ Mona Diab^1,2
¹The George Washington University ²Meta AI
{phosseini,broniatowski}@gwu.edu, [email protected]

Abstract

Previous studies have shown the efficacy of knowledge augmentation methods in pretrained language models. However, these methods behave differently across domains and downstream tasks. In this work, we investigate the augmentation of pretrained language models with commonsense knowledge in the cause-effect relation classification and commonsense causal reasoning tasks. After automatically verbalizing ATOMIC ${}^{20}_{20}$ , a wide coverage commonsense reasoning knowledge graph, and GLUCOSE, a dataset of implicit commonsense causal knowledge, we continually pretrain BERT and RoBERTa with the verbalized data. Then we evaluate the resulting models on cause-effect pair classification and answering commonsense causal reasoning questions. Our results show that continually pretrained language models augmented with commonsense knowledge outperform our baselines on two commonsense causal reasoning benchmarks, COPA and BCOPA-CE, and the Temporal and Causal Reasoning (TCR) dataset, without additional improvement in model architecture or using quality-enhanced data for fine-tuning.

1 Introduction

Automatic extraction and classification of causal relations in the text have been important yet challenging tasks in natural language understanding. Early methods in the 80s and 90s Joskowicz et al. (1989); Kaplan and Berry-Rogghe (1991); Garcia et al. (1997); Khoo et al. (1998) mainly relied on defining hand-crafted rules to find cause-effect relations. Starting 2000, machine learning tools were utilized in building causal relation extraction models Girju (2003); Chang and Choi (2004, 2006); Blanco et al. (2008); Do et al. (2011); Hashimoto et al. (2012); Hidey and McKeown (2016). Word-embeddings and Pretrained Language Models (PLMs) have also been leveraged in training models for understanding causality in language in recent years Dunietz et al. (2018); Pennington et al. (2014); Dasgupta et al. (2018); Gao et al. (2019). Knowledge Graphs (KGs) have been also used in combination with pretrained language models to address commonsense reasoning Li et al. (2020); Guan et al. (2020). Despite all these efforts, investigating the true capability of pretrained language models in understanding causality in text is still an open question.

Refer to caption — Figure 1: Overview of our proposed framework to continually pretrain PLMs with commonsense knowledge.

In this work, motivated by the success of continual pretraining of PLMs for downstream tasks Gururangan et al. (2020), we explore the impact of commonsense knowledge injection as a form of continual pretraining for causal reasoning and cause-effect relation classification. It is worth highlighting that even though there are studies to show the efficacy of knowledge injection with continual pretraining for commonsense reasoning Guan et al. (2020), performance of these techniques is very dependent on the domain and downstream tasks Gururangan et al. (2020). And, to the best of our knowledge, there are limited studies on the effect of commonsense knowledge injection on causal relation classification Dalal et al. (2021). Our contributions are as follows:

•

We study the performance of PLMs augmented with commonsense knowledge in the less investigated task of cause-effect relation classification.
•

We demonstrate that a simple masked language modeling framework using automatically verbalized commonsense knowledge, without any further model improvement (e.g., new architecture or loss function) or quality enhanced data for fine-tuning, can significantly boost the performance of PLMs in cause-effect pair classification.
•

We publicly release our knowledge graph verbalization codes and continually pretrained models.

2 Method

The overview of our method is shown in Figure 1.¹¹1Codes and models are publicly available at https://github.com/phosseini/causal-reasoning. In our framework, we start by verbalizing ATOMIC ${}^{20}_{20}$ Hwang et al. (2021) knowledge graph and GLUCOSE Mostafazadeh et al. (2020) to natural language texts. Then we continually pretrain BERT Devlin et al. (2018) and RoBERTa Liu et al. (2019) using Masked Language Modeling (MLM) and evaluate performance of the resulting models on different benchmarks. We delineate each of these steps in the following sections.

2.1 ATOMIC ${}^{20}_{20}$ to Text

Samples in ATOMIC ${}^{20}_{20}$ are stored as triples in the form of (head/subject, relation, tail/target) in three splits including train, development, and test. We only use the train and development sets here. ATOMIC ${}^{20}_{20}$ has 23 relation types that are classified into three categorical types including commonsense relations of social interactions, physical-entity commonsense relations, and event-centric commonsense relations. In the rest of the paper, we refer to these three categories as social, physical, and event, respectively. Distribution of these relations is shown in Figure 2. Each relation in ATOMIC ${}^{20}_{20}$ is associated with a human-readable template. For example, templates for xEffect and HasPrerequisite are as a result, PersonX will and to do this, one requires, respectively. We use these templates to convert triples in ATOMIC ${}^{20}_{20}$ to sentences in natural language (verbalization) by concatenating the subject, relation template, and target.

Before verbalizing triples, we also remove all duplicates and ignore all triples in which the target value is none. Moreover, we ignore all triples that include a blank. Since in masked language modeling we need to know the gold value of masked tokens, a triple that already has a blank (masked token/word) in it may not help our pretraining. For instance, in the triple: [PersonX affords another ___, xAttr, useful] it is hard to know why or understand what it means for a person to be useful without knowing what they afforded. This preprocessing step yields in 782,848 triples with 121,681, 177,706, and 483,461 from event, physical, and social categories, respectively.

Examples of converting triples to text are shown in Figure 3.

2.2 GLUCOSE to Text

GLUCOSE is a large-scale dataset of implicit commonsense causal knowledge. Each data point in GLUCOSE includes ten dimensions of causal explanations for a selected sentence in a story with a focus on events, states, motivations, and emotions. Half of these dimensions are specific causal statements and the remaining half are general rules that capture the implicit commonsense knowledge. Using a slightly modified version of templates that are provided for causal connectives in GLUCOSE, we concatenate the two spans in a causal relation with each relation’s template to form a verbalized sample. The causal connectives in GLUCOSE include: [>Causes/Enables>, >Motivates>, >Enables>, >Causes>, >Results in>]. Verbalization of a sample in GLUCOSE is shown in Figure 4. In the end, we randomly split the verbalized samples into train (90%) and development (10%) sets.

2.3 Checking Grammar

When we verbalize samples in ATOMIC ${}^{20}_{20}$ and GLUCOSE to natural language text, ideally we want to have grammatically correct sentences. Human readable templates provided by ATOMIC ${}^{20}_{20}$ and GLUCOSE are not necessarily rendered in a way to always form error-free sentences. To address this issue, we use an open-source grammar and spell checker, LanguageTool,²²2https://tinyurl.com/yc77k3fb to double-check our converted triples to ensure they do not contain obvious grammatical mistakes or spelling errors. Similar approaches that include deterministic grammatical transformations were also previously used to convert KG triples to coherent sentences Davison et al. (2019). It is worth pointing out that the Data-To-Text generation (KG verbalization) itself is a separate task and there have been efforts to address this task Agarwal et al. (2021). We leave investigating the effects of using other Data-To-Text and grammar-checking methods as future research.

2.4 Continual Pretraining

As mentioned earlier, we use MLM³³3We use Huggingface’s BertForMaskedLM. to continually pretrain our PLMs, bert-large-cased and roberta-large. We follow the same procedure as BERT to create the input data for our pretraining (e.g., number of tokens to mask in input examples). We run the pretraining using train and development splits in ATOMIC ${}^{20}_{20}$ and GLUCOSE (separately) as our training and evaluation sets, respectively, for 10 epochs on Google Colab TPU v2 using PyTorch/XLA package with a maximum sequence length of 30⁴⁴4%99.99 of verbalized instances have 30 tokens or less. and batch size of 128. To avoid overfitting, we use early stopping with the patience of 5 on evaluation loss. We select the best model based on the lowest evaluation loss at the end of training.

3 Experiments

3.1 Benchmarks

We chose multiple benchmarks of commonsense causal reasoning and cause-effect relation classification to ensure we thoroughly test the effects of our newly trained models. These benchmarks include 1) Temporal and Causal Reasoning (TCR) dataset Ning et al. (2018), a benchmark for joint reasoning of temporal and causal relations; 2) Choice Of Plausible Alternatives (COPA) Roemmele et al. (2011) dataset which is a widely used and notable benchmark Rogers et al. (2021) for commonsense causal reasoning; And 3) BCOPA-CE Han and Wang (2021), a new benchmark inspired by COPA, that contains unbiased token distributions which makes it a more challenging benchmark. For COPA-related experiments, since COPA does not have a training set, we use COPA’s development set for fine-tuning our models and testing them on COPA’s test set (COPA-test) and BCOPA-CE. For hyperparameter tuning, we randomly split COPA’s development set into train (%90) and dev (%10) and find the best learning rate, batch size, and number of train epochs based on the evaluation accuracy on the development set. Then using COPA’s original development set and best set of hyperparameters, we fine-tune our models and evaluate them on the test set. For TCR, since there is no development set and TCR’s train split is not large enough for creating train and development sets, we skip hyperparameter tuning and fine-tune all models for 10 epochs with batch size of 8 and learning rate of 2e-5 on the train set and evaluate fine-tuned models on the test set. In all experiments, we report the average performance of models across eight different random seed runs.

3.2 Models and Baseline

We use bert-large-cased and roberta-large pretrained models in our experiments as baseline. For COPA and BCOPA-CE, we convert all instances to a SWAG-formatted data Zellers et al. (2018) and use Huggingface’s BertForMultipleChoice –a BERT model with a multiple-choice classification head on top. And for TCR, we convert every instance by adding special tokens to input sequences as event boundaries and use the R-BERT ⁵⁵5We use the following implementation of R-BERT: https://github.com/monologg/R-BERT model Wu and He (2019). We chose R-BERT for our relation classification since it not only leverages the pretrained embeddings but also transfers information of target entities (e.g., events in a relation) through model’s architecture and incorporates encodings of the target entities. Examples of COPA and TCR are shown in Figure 6. BCOPA-CE has the same format as COPA.

4 Results and Discussion

Results of our experiments on TCR are shown in Table 1. As can be seen, our best model that is continually pretrained with GLUCOSE significantly outperforms our baseline and the joint inference framework by Ning et al. (2018) formulated as an integer linear programming (ILP) problem.

Model	Acc (%)
Joint system Ning et al. (2018)	77.3
Our Models
BERT-Large (baseline)	79.1_(0.1)
ATOMIC-BERT-Large	80.9_(0.11)
GLUCOSE-BERT-Large	83.9_(0.02)

Table 1: TCR Accuracy results.

Results of experiments on COPA-test are shown in Table 2. As can be seen, all our models significantly outperform our baselines and the performance gap between the baseline and the best model is larger for roberta models. Also, GLUCOSE models, despite being trained with significantly fewer training data points ( $\sim$ 70k), achieved performance on par with and even slightly better than models trained with ATOMIC ${}^{20}_{20}$ ( $\sim$ 121k for event only and $\sim$ 780k for all three types). We also observe that continually pretrained ATOMIC ${}^{20}_{20}$ models using only event relations achieve almost the same performance as models trained with all three types of relations with $\sim$ 6X more training data points. By taking a closer look at each relation type, we realize that one reason may be the fact that event-centric relations in ATOMIC ${}^{20}_{20}$ specifically contain commonsense knowledge about event interaction for understating likely causal relations between events in the world Hwang et al. (2021). In addition, event relations have a relatively longer context (# of tokens) than the average of all three relation types combined which means more context for a model to learn from.

Model	Acc (%)
PMI Roemmele et al. (2011)	58.8
b-l-reg Han and Wang (2021)	71.1
Google T5-base Raffel et al. (2019)	71.2
BERT-Large Kavumba et al. (2019)	76.5
CausalBERT Li et al. (2020)	78.6
BERT-SocialIQA Sap et al. (2019)^∗	80.1
Google T5-11B Raffel et al. (2019)	94.8
DeBERTa-1.5B He et al. (2020)	96.8
Our Models
BERT-Large (baseline)	75.5_(0.07)
ATOMIC-BERT-Large
- Event, Physical, Social	79.1_(0.03)
- Event only	79.1_(0.01)
GLUCOSE-BERT-Large	79.9_(0.02)
RoBERTa-Large (baseline)	74.1_(0.11)
ATOMIC-RoBERTa-Large
- Event, Physical, Social	83.9_(0.02)
- Event only	84.9_(0.03)
GLUCOSE-RoBERTa-Large	85.7_(0.03)

Table 2: COPA-test Accuracy results.

It is also worth mentioning three points when we compare our models with other models on COPA. First, our models, BERT-Large and RoBERTa-Large, have a significantly lower number of parameters than state-of-the-art models, Google T5-11B ( $\sim$ 32x) and DeBERTa-1.5B ( $\sim$ 4x) and it shows how smaller models can be competitive and benefit from continual pretraining. Second, we have not yet applied any model improvement methods such as using a margin-based loss introduced by Li et al. (2019) and used in CausalBERT Li et al. (2020), an extra regularization loss proposed by Han and Wang (2021), or fine-tuning with quality-enhanced training data, BCOPA, introduced by Kavumba et al. (2019). As a result, there is still great room to improve current models that can be a proper next step. Third, we achieved performance on par with BERT-SocialIQA Sap et al. (2019) ⁶⁶6Best random seed runs on BERT and RoBERTa models achieved %81.8 and %88.8 accuracies, respectively. while we did not use crowdsourcing or any manual re-writing/correction, which is expensive, for verbalizing KG triples to create our pretraining data.

We also evaluated the performance of our models on the Easy and Hard question splits in COPA-test separated by Kavumba et al. (2019) to see how our models perform on harder questions that do not contain superficial cues. Results are shown in Table 3. As can be seen, our models significantly outperformed our baselines not only on Easy questions but Hard questions as well.

Model	Easy	Hard
BERT-Large Kavumba et al. (2019)	83.9_(0.04)	71.9_(0.03)
RoBERTa-Large Kavumba et al. (2019)	91.6_(0.01)	85.3_(0.02)
Our Models
BERT-Large (baseline)	84.7_(0.05)	69.8_(0.09)
ATOMIC-BERT-Large
- Event, Physical, Social	90.6_(0.02)	72.1_(0.03)
- Event only	88.6_(0.02)	73.2_(0.02)
GLUCOSE-BERT-Large	89.1_(0.02)	74.2_(0.03)
RoBERTa-Large (baseline)	80.5_(0.01)	70.2_(0.12)
ATOMIC-RoBERTa-Large
- Event, Physical, Social	87.5_(0.02)	81.7_(0.03)
- Event only	90.7_(0.03)	81.3_(0.04)
GLUCOSE-RoBERTa-Large	89.6_(0.05)	83.3_(0.03)

Table 3: COPA-test Accuracy results on Easy and Hard question subsets.

Model	Acc (%)
b-l-aug Han and Wang (2021)	51.1
b-l-reg Han and Wang (2021)	64.1
Our Models
BERT-Large (baseline)	51.5_(0.01)
ATOMIC-BERT-Large
- Event only	53.2_(0.01)
- Event, Physical, Social	53.5_(0.02)
GLUCOSE-BERT-Large	54.7_(0.02)
RoBERTa-Large (baseline)	56.5_(0.06)
ATOMIC-RoBERTa-Large
- Event only	64.2_(0.04)
- Event, Physical, Social	61.8_(0.04)
GLUCOSE-RoBERTa-Large	66.1_(0.03)

Table 4: BCOPA-CE Accuracy results. Base model in b-l-* is BERT-Large.

4.1 BCOPA-CE: Prompt vs. No Prompt

Results of experiments on BCOPA-CE are shown in Table 4. As expected based on the results also reported by Han and Wang (2021), we initially observed that our models are performing nearly as random baseline. Since we do not use the type of question when encoding input sequences, we decided to see whether adding question type as a prompt to input sequences will improve the performance. We added It is because and As a result, as prompt for asks-for="cause" and asks-for="effect", respectively. We observed that the new models outperformed the baseline, and our best performing model achieved a better performance than Han and Wang (2021)’s b-l-aug and b-l-reg models –that are fine-tuned with the same data as ours– when question types are added as prompts to input sequences of correct and incorrect answers in the test set.

5 Conclusion

We introduced a simple framework for augmenting PLMs with commonsense knowledge created by automatically verbalizing ATOMIC ${}^{20}_{20}$ and GLUCOSE. Our results show that commonsense knowledge-augmented PLMs outperform the original PLMs on cause-effect pair classification and answering commonsense causal reasoning questions. As the next step, it would be interesting to see how the previously proposed model improvement methods or using unbiased fine-tuning datasets can potentially enhance the performance of our knowledge-augmented models.

References

Agarwal et al. (2021) Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge graph based synthetic corpus generation for knowledge-enhanced language model pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565.
Blanco et al. (2008) Eduardo Blanco, Nuria Castell, and Dan I Moldovan. 2008. Causal relation extraction. In Lrec.
Chang and Choi (2004) Du-Seong Chang and Key-Sun Choi. 2004. Causal relation extraction using cue phrase and lexical pair probabilities. In International Conference on Natural Language Processing, pages 61–70. Springer.
Chang and Choi (2006) Du-Seong Chang and Key-Sun Choi. 2006. Incremental cue phrase learning and bootstrapping method for causality extraction using cue phrase and word pair probabilities. Information processing & management, 42(3):662–678.
Dalal et al. (2021) Dhairya Dalal, Mihael Arcan, and Paul Buitelaar. 2021. Enhancing multiple-choice question answering with causal knowledge. In Proceedings of Deep Learning Inside Out (DeeLIO): The 2nd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 70–80.
Dasgupta et al. (2018) Tirthankar Dasgupta, Rupsa Saha, Lipika Dey, and Abir Naskar. 2018. Automatic extraction of causal relations from text using linguistically informed deep neural networks. In Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue, pages 306–316.
Davison et al. (2019) Joe Davison, Joshua Feldman, and Alexander M Rush. 2019. Commonsense knowledge mining from pretrained models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1173–1178.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Do et al. (2011) Quang Xuan Do, Yee Seng Chan, and Dan Roth. 2011. Minimally supervised event causality identification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 294–303. Association for Computational Linguistics.
Dunietz et al. (2018) Jesse Dunietz, Jaime G Carbonell, and Lori Levin. 2018. Deepcx: A transition-based approach for shallow semantic parsing with complex constructional triggers. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1691–1701.
Gao et al. (2019) Lei Gao, Prafulla Kumar Choubey, and Ruihong Huang. 2019. Modeling document-level causal structures for event causal relation identification. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1808–1817.
Garcia et al. (1997) Daniela Garcia et al. 1997. Coatis, an nlp system to locate expressions of actions connected by causality links. In International Conference on Knowledge Engineering and Knowledge Management, pages 347–352. Springer.
Girju (2003) Roxana Girju. 2003. Automatic detection of causal relations for question answering. In Proceedings of the ACL 2003 workshop on Multilingual summarization and question answering-Volume 12, pages 76–83. Association for Computational Linguistics.
Guan et al. (2020) Jian Guan, Fei Huang, Zhihao Zhao, Xiaoyan Zhu, and Minlie Huang. 2020. A knowledge-enhanced pretraining model for commonsense story generation. Transactions of the Association for Computational Linguistics, 8:93–108.
Gururangan et al. (2020) Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A Smith. 2020. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360.
Han and Wang (2021) Mingyue Han and Yinglin Wang. 2021. Doing good or doing right? exploring the weakness of commonsense causal reasoning models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 151–157, Online. Association for Computational Linguistics.
Hashimoto et al. (2012) Chikara Hashimoto, Kentaro Torisawa, Stijn De Saeger, Jong-Hoon Oh, and Jun’ichi Kazama. 2012. Excitatory or inhibitory: A new semantic orientation extracts contradiction and causality from the web. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 619–630. Association for Computational Linguistics.
He et al. (2020) Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2020. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654.
Hidey and McKeown (2016) Christopher Hidey and Kathy McKeown. 2016. Identifying causal relations using parallel Wikipedia articles. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1424–1433, Berlin, Germany. Association for Computational Linguistics.
Hwang et al. (2021) Jena D. Hwang, Chandra Bhagavatula, Ronan Le Bras, Jeff Da, Keisuke Sakaguchi, Antoine Bosselut, and Yejin Choi. 2021. Comet-atomic 2020: On symbolic and neural commonsense knowledge graphs. In AAAI.
Joskowicz et al. (1989) Leo Joskowicz, T Ksiezyck, and Ralph Grishman. 1989. Deep domain models for discourse analysis. In [1989] Proceedings. The Annual AI Systems in Government Conference, pages 195–200. IEEE.
Kaplan and Berry-Rogghe (1991) Randy M Kaplan and Genevieve Berry-Rogghe. 1991. Knowledge-based acquisition of causal relationships in text. Knowledge Acquisition, 3(3):317–337.
Kavumba et al. (2019) Pride Kavumba, Naoya Inoue, Benjamin Heinzerling, Keshav Singh, Paul Reisert, and Kentaro Inui. 2019. When choosing plausible alternatives, clever hans can be clever. EMNLP 2019, page 33.
Khoo et al. (1998) Christopher SG Khoo, Jaklin Kornfilt, Robert N Oddy, and Sung Hyon Myaeng. 1998. Automatic extraction of cause-effect information from newspaper text without knowledge-based inferencing. Literary and Linguistic Computing, 13(4):177–186.
Li et al. (2019) Zhongyang Li, Tongfei Chen, and Benjamin Van Durme. 2019. Learning to rank for plausible plausibility. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4818–4823.
Li et al. (2020) Zhongyang Li, Xiao Ding, Ting Liu, J Edward Hu, and Benjamin Van Durme. 2020. Guided generation of cause and effect. IJCAI.
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
Mostafazadeh et al. (2020) Nasrin Mostafazadeh, Aditya Kalyanpur, Lori Moon, David Buchanan, Lauren Berkowitz, Or Biran, and Jennifer Chu-Carroll. 2020. Glucose: Generalized and contextualized story explanations. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4569–4586.
Ning et al. (2018) Qiang Ning, Zhili Feng, Hao Wu, and Dan Roth. 2018. Joint reasoning for temporal and causal relations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2278–2288, Melbourne, Australia. Association for Computational Linguistics.
Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
Raffel et al. (2019) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.
Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
Roemmele et al. (2011) Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. 2011. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series.
Rogers et al. (2021) Anna Rogers, Matt Gardner, and Isabelle Augenstein. 2021. Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension. arXiv preprint arXiv:2107.12708.
Sap et al. (2019) Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. 2019. Social IQa: Commonsense reasoning about social interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4463–4473, Hong Kong, China. Association for Computational Linguistics.
Wu and He (2019) Shanchan Wu and Yifan He. 2019. Enriching pre-trained language model with entity information for relation classification. In Proceedings of the 28th ACM international conference on information and knowledge management, pages 2361–2364.
Zellers et al. (2018) Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. Swag: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 93–104.

Appendix A Contribution of Augmented Knowledge

COPA Test Sample	GLUCOSE Similar Entry
The family went to the zoo. The children admired the animals. (ask-for=result)	The kids are excited to see they are at the zoo because the kids like(s) the zoo.
The phone rang. The man picked up the phone. (ask-for=result)	The guy answers the phone because the phone is ringing.
The trash bag was full. I took it to the dumpster. (ask-for=result)	I pick up the bag since the trash bag is full.
The runner sensed his competitor gaining on him. He sped up his pace. (ask-for=result)	Sam ran as fast as he could since sam feel(s) competitive.
The man got out of the shower. The hot water was gone. (ask-for=cause)	All the hot water is gone because my wife just used the shower.
The criminal was executed. He was convicted of murder. (ask-for=cause)	The judge convicts him because he is guilty.
The boy’s forehead felt hot. His mother took his temperature. (ask-for=result)	Sean’s mom takes his temperature caused sean’s mom finds out he has a fever.
The fish bit the line. The fisherman reeled in the fish. (ask-for=result)	A huge fish gets on the line. As a result bob has a bite.
The man went to the doctor. The man felt ill. (ask-for=cause)	Tom goes to the doctor because tom feel(s) sick.
An unfamiliar car parked outside my house. I became suspicious. (ask-for=result)	I notice an unfamiliar car. As a result I feel(s) curiosity.

Table 5: Correctly classified samples in COPA and their most semantically similar entries in GLUCOSE.

We did further analysis to better understand how the augmented knowledge did or did not help PLMs in achieving better results on our benchmarks. Even though knowing how exactly data points from ATOMIC ${}^{20}_{20}$ and GLUCOSE contributed to performance improvements is hard and may need a more rigorous analysis, we found it helpful to investigate the semantic overlap between the augmented data and our benchmarks’ samples to see if the injected knowledge has any context similarity with what our models were tested on. In each benchmark, we picked our best performing model and the baseline and separated all samples in the test set that were correctly predicted across all random seed runs by these models. Then, we created a set of correctly predicted samples by our best model that our baseline failed to predict correctly. And we measured the semantic similarity of each sample in that set with all data points in ATOMIC ${}^{20}_{20}$ and GLUCOSE. To measure semantic similarity, we leveraged the Sentence Transformers Reimers and Gurevych (2019).⁷⁷7https://github.com/UKPLab/sentence-transformers In particular, after computing the embeddings of samples,⁸⁸8The model we use is available on HuggingFace: sentence-transformers/all-mpnet-base-v2 we computed the cosine similarity between pairs of embeddings and separated pairs with at least %50 similarity. Our idea was that if we had a data point in ATOMIC ${}^{20}_{20}$ or GLUCOSE that has a high semantic similarly —in terms of the interactions between events— with a data point in the benchmark, that semantic similarity may have contributed to the augmented model’s performance improvement.

Table 5 shows examples of the correctly classified samples with high context similarity with entries in GLUCOSE. Out of 70,730 training samples in GLUCOSE, there are 3,588 and 253 pairs with 0.5 and 0.6 cosine similarity with a sample in COPA, respectively. As can be seen, there is not necessarily an exact match but a context similarity between samples in each pair. For instance, from an entry in GLUCOSE we know that noticing an unfamiliar car will result in feeling curios. And this is what has been asked in a question in COPA where being suspicious is the plausible result of seeing an unfamiliar car parked outside house. Such examples suggest that a model may have learned the relation between seeing an unfamiliar object and a curiosity feeling at the time of continual pretraining which helped it later to predict the correct answer when two similar events are involved in a question. It is worth emphasizing that we may not be able to claim that this context similarity is the cause for the performance enhancement of augmented models, however, it is still interesting to see that feeding a model with explicit causal statements potentially helps the model to express the causal knowledge that may or may not already be encoded in the model, as also stated in previous work Hwang et al. (2021).