ExClaim: Explainable Neural Claim Verification Using Rationalization

Sai Gurrapu¹, Lifu Huang¹, Feras A. Batarseh²
¹Department of Computer Science, ²Department of Biological Systems Engineering (BSE)
Virginia Tech
Blacksburg, United States
{saig, lifuh, batarseh}@vt.edu

Abstract

With the advent of deep learning, text generation language models have improved dramatically, with text at a similar level as human-written text. This can lead to rampant misinformation because content can now be created cheaply and distributed quickly. Automated claim verification methods exist to validate claims, but they lack foundational data and often use mainstream news as evidence sources that are strongly biased towards a specific agenda. Current claim verification methods use deep neural network models and complex algorithms for a high classification accuracy but it is at the expense of model explainability. The models are black-boxes and their decision-making process and the steps it took to arrive at a final prediction are obfuscated from the user. We introduce a novel claim verification approach, namely: ExClaim, that attempts to provide an explainable claim verification system with foundational evidence. Inspired by the legal system, ExClaim leverages rationalization to provide a verdict for the claim and justifies the verdict through a natural language explanation (rationale) to describe the model’s decision-making process. ExClaim treats the verdict classification task as a question-answer problem and achieves a performance of 0.93 F1 score. It provides subtasks explanations to also justify the intermediate outcomes. Statistical and Explainable AI (XAI) evaluations are conducted to ensure valid and trustworthy outcomes. Ensuring claim verification systems are assured, rational, and explainable is an essential step toward improving Human-AI trust and the accessibility of black-box systems.

Index Terms:

Natural Language Processing, Rationalization, Claim Verification, Explainable AI, Language Modeling

^†^†Copyright and Reprint Permission: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy beyond the limit of U.S. copyright law for private use of patrons those articles in this volume that carry a code at the bottom of the first page, provided the per-copy fee indicated in the code is paid through Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923. For reprint or republication permission, email to IEEE Copyrights Manager at pubs- [email protected]. All rights reserved. Copyright ©2022 by IEEE.

I Introduction

Refer to caption — Figure 1: ExClaim Approach Example

Information on the web has grown remarkably in the recent few decades, and misinformation has become a significant challenge. Misinformation can weaken public trust in institutions, and economies. For instance, in 2013 Associated Press published a false tweet claiming that President Barack Obama suffered injuries from an explosion in the White House and within seconds S&P 500 lost $130 billion in stock value [1]. The impact of such information is not only on global economies but also on all aspects of our lives since it has become commonplace to consume news digitally. A leading cause for the rise of misinformation is that it can now be created cheaper and distributed faster with the internet when compared to traditional platforms such as newspapers and television [2]. The growth of social media usage has also allowed effortless mass sharing of information.

The recent progress with Language Models (LM) in Natural Language Processing (NLP) has made it even easier, cheaper, and faster for a machine to generate artificial text [3]. Generative LMs have become sophisticated such as GPT-3 at text generation, and their results are almost on par with human-written text in terms of readability, coherence, and grammatical accuracy [4]. In [5], the authors demonstrate that LMs are improving 10x every year in terms of parameters and model size and that there is an incredible amount of progress with natural language text generation that is bound to happen. There is a possibility that these models will reach (or even surpass) human performance in a few years.

Such advances can likely lead to more rampant misinformation than what we have today online. There has been a growing interest in identifying the authenticity and truthfulness of online content. Professional fact-checkers and independent fact-checking organizations exist, but their work cannot be scaled linearly with the amount of digital content growth. This led to an increase in interest in automated fact-checking, also known as claim verification [6]. By fusing deep learning techniques with complex networks and algorithms, claim verification systems have achieved respectable performance but have become black-boxes. Optimizing for the task performance comes at the expense of model explainability. The developers of the model and the end-users which we refer to as the nonexperts do not fully know the internals or understand its decision-making processes. Furthermore, [6] points that there exists no system that offers explanations for sub-tasks in the claim verification pipeline.

Currently, a claim is fed into the system, the model checks against an evidence dataset, and it returns a binary verdict as the output; true if the evidence supports the claim otherwise false. This approach has become commonplace [7]; however, many questions still remain unanswered. What evidence influenced the prediction? Is the evidence trustworthy? How can the nonexpert understand the model internals? These are all missing components of the process and there are no wide standards to assure the outcomes of these black-box systems [8].

Explainability techniques are available but they are insufficient because many require specialized domain knowledge to understand. Claim verification systems are tools which would generally be used by a wide public audience. They need to be explainable in the sense that they appropriately justify their outcomes and are comprehensible to the nontechnical users. Local Interpretable Model-agnostic Explanations (LIME), Attention scores and saliency heatmaps are generally helpful to the model developer, but not the end-user.

An emerging NLP explainable technique is called Rationalization [9]. It provides a rationale also known as a natural language explanation (NLE) to justify a model’s prediction. The reasoning behind a model’s prediction can be understood simply by reading the rationale, thereby revealing the model’s decision-making process. Rationalization can be an attractive explainable technique because it is intuitive and human-comprehensible, and does not require domain knowledge [10]. It is a type of local explanation because there is a unique explanation for each prediction [11]. Rationalization frames explainability as an outcome-explanation problem. It looks at explainability from the perspective of an end-user whose aim is to understand how the model arrives at its final outcome [12]. This process is encapsulated in the rationales statement.

Rationales are also widely used in the legal system. A judge rules a verdict and it cannot be upheld without a legal opinion which is a written explanation laying out the rationale for a ruling. Inspired by this, we propose a novel approach, namely, ExClaim which leverages rationalization for an explainable neural claim verification system that is accessible and trustworthy to the nonexpert. It can not only predict a verdict but also defend and rationalize its output as a NLE. We also introduce NLP Assurance and incorporate extensive explainable evaluations to assure the verdict and NLE outcomes are valid to help reinforce Human-AI trust [13]. The following are the contributions of this paper:

1.

Introduces a new benchmark claim verification dataset with credible foundational information and a novel explainable claim verification approach called ExClaim.
2.

Uses transfer learning and presents an effective method to generate abstractive rationales in an unsupervised manner without any ground truth.
3.

Treats the verdict classification task as a question-answer problem and demonstrates that it can significantly help improve the classification performance.
4.

Presents the problem of claim verification through the lens of explainability and is the first to introduce sub-task explanations in the claim verification space.

The structure of this paper is as follows: in Section 2, we present related work. In Section 3, we introduce a new benchmark claim verification dataset. In Section 4, we demonstrate our experiment methodology, while Section 5 describes experimental results. Lastly, in Section 6, we provide a conclusion and future directions.

II Related Work

Several studies have been conducted on FEVER, LIAR, and MultiFC datasets, the most extensive datasets available for claim verification. FEVER does not contain real-world claims; instead, they are generated from Wikipedia. MultiFC uses the top k search results from Google as its evidence for claims which is not reliable, and quality could vary drastically. Hence, we disregarded them in the study and focused on Politifact-based data. The LIAR dataset by [14] which was the first large-scale claim verification Politifact dataset, which led to an increase in research in this field. It is a six-way classification dataset, and [14] presents a Convolutional Neural Network (CNN) model that uses the claim and metadata such as speaker, party, and state, amongst others, to achieve a 0.277 accuracy. Using metadata could potentially introduce unwanted biases towards certain parties and groups. The model will be influenced by those details than using the evidence for a claim. In [15], the authors use an attention mechanism on the speaker and topic and achieve a 0.41 accuracy.

In [16], the authors publishe their LIAR-PLUS dataset with human-written justifications from each Politifact article, and their model uses a P-BiLSTM with claim and justification as inputs and achieves a 0.70 on binary classification on the Politifact data. The current SOTA performance on binary verdict classification on LIAR-PLUS is by [17], and their siamese network achieves a 0.82 accuracy. Although these datasets are different from the ExClaim dataset, they shed light on the progress of verdict classification using Politifact claims. ExClaim serves as a new benchmark dataset with more credible information, and our results from Section 5 serve as the new baselines.

In the previously discussed Politifact work, the role of explainability and methods to assure the black-box predictions are untouched. A majority of explanation methods in the broader claim verification space are quantitative. Papers [18], [19], [20], [21], and [22] all propose attention-based explanations using BiLSTMs, LSTMs, CNNs, co-attention networks, and decision trees, respectively. In [23] and [24], they both use rule discovery to derive knowledge graphs and Horn rules explanations. Given the low accessibility of these explanation methods, [25] in 2020 presents the first study that jointly explains and predicts a verdict using a rationale. They use a supervised technique using DistilBERT for rationale extraction and perform an annotation task to evaluate rationale quality and interpretability. In [26], the authors demonstrate this technique on a public health claims dataset. The progress of natural language explanations as an explainability technique has been minimal and constrained to supervised methods.

III ExClaim Foundational Dataset

TABLE I: ExClaim Foundational Dataset Statistics

Dataset Statistics
Total Articles Size	4006
Training Set Articles Size	2804
Validation Set Articles Size	601
Test Set Articles Size	601
Total Support Labels	2013
Total Refute Labels	1993
Average Token Length of Claim	17
Average Token Length of Evidence	449

In [6], the authors note that the availability of datasets for experimentation is the current bottleneck with automatic claim verification. The two datasets that we first considered were LIAR [14] and LIAR-PLUS [16]. The datasets were developed by scraping Politifact, an independent fact-checking platform. However, they were not sufficient for our task because we found problems with the content with a few features missing that we required. Every Politifact article includes a claim, a six-class verdict, ruling comments written by a human fact-checker; a short justification summary at the end of the article, and metadata such as speaker, subject, and other components. In the LIAR and LIAR-PLUS datasets, they exclude the ruling comments, and previous work relies primarily on using the justification for claim verification. Frequently, this justification summary misses essential information from the verdict comments and serves as more of a closure to the article. Therefore, the datasets were not a reliable structure for our task.

We developed a new claim verification dataset using Politifact for our experiment that includes information on 4000+ articles as shown in Table I. Our dataset consists of the following features: Claim, Date, Source, Verdict, and the ruling comments we refer to as the Evidence and the URL. Determining the authenticity and ensuring that the evidence is neutral is crucial for claim verification. Suppose the evidence includes polarizing language and has a strong True or False sentiment. In that case, the model uses those features to predict a verdict instead of truly learning the evidence content. Additionally, Politifact often cites mainstream news mediums as sources. Most of the time, these sources are heavily biased and skew to the left or right with their content production. Due to these factors, data cleaning is performed based on the following criteria.

When verifying claims, credible and trustworthy evidence is essential. Paper [8] points that evidence is primarily available from reputable institutions such as world governments and international organizations. News sources are mostly avenues of secondary sources. We identified the top 30 most popular news platforms and removed evidence paragraphs citing them. Our evidence includes primarily information from the U.S. government (State and Federal Laws, and government departments and agencies such as Centers for Disease Control and Prevention, Bureau of Labor Statistics, and Census Bureau amongst others) and also international organizations (United Nations, World Health Organization, World Trade Organization and others). Politifact includes six classes for the verdict (True, Mostly True, Half True, Mostly False, False, Pants on Fire). We only scraped articles with True or False labels since the accuracy for the other classes was not clear. Claim verification systems that output True or False as the verdict can be misleading because there is always a possibility for error. We pre-process the verdict labels in our dataset and convert True and False into Supports and Refutes, respectively. From the end user’s perspective, this approach makes it intentional that the evidence at hand either supports or refutes the claim. The end user has the option to conclude if the claim is true or false from the given outputs. We release this as the ExClaim Foundational dataset, a new benchmark claim verification dataset with foundational sources²²2https://github.com/AI-VTRC/ExClaim.

IV Methods

The Principle of Factor Sparsity also known as the Pareto principle states that for many outcomes in a system, 80% of the effects come from 20% of the causes [27]. This principle generally holds true in many situations and has been successfully applied in various disciplines such as medicine, engineering, and economics [28, 29]. In computer science, it was shown useful for the optimization of genetic algorithms [30]. It has also been applied in software testing and [31] demonstrates that on average ”20% of the code has 80% of the errors”.

Inspired by the Pareto principle, we hypothesize that on average only 20% of the information in the evidence is important to verify a claim and the remainder 80% is nonessential. This is similar to how humans approach manual claim verification. There are a few key pieces of information in the evidence that most influence our decision and compel us to rule that a claim is either true or false. Paper [32] categorizes this as an explain-then-predict model and [33, 34, 35] cite improvements in task performance optimization after using similar techniques. This lays the foundation for our ExClaim approach as illustrated in Figure 2.

IV-A Rationale Generation

To determine the salient information in the evidence, we use transfer learning and treat it as an unsupervised summarization task by using language models. Summarization techniques have been shown to distill the most important information from a given source with high accuracy [36]. We specifically use abstractive summarization because extractive techniques lack coherence, readability, and overall quality that does not reflect human-written summaries [37, 38]. We refer to the generated summary as rationales which later become part of the NLE to justify the verdict prediction.

Language models perform well at summarization as first demonstrated by [39] where a BERT-based abstractive summarization model outperforms most non-Transformer-based models. However, a majority of the current language models have a maximum sequence length of 512 tokens. As Table I shows, the average sequence length of the evidence is 449 tokens with some articles over the 512 token limit. To minimize truncation, we leverage the BART language model which has a maximum sequence length of 1024 tokens and it outperformed previous work on summarization at its publication [40]. BART is pre-trained on the English language and we specifically use the bart-large-cnn checkpoint from Hugging Face that has been fine-tuned on a large corpus of text-summary pairs [41]. We learn the function $f(E)=P^{R}$ where, given the evidence $E=\{E_{0},E_{1},…,E_{n}\}$ , it predicts the paragraph of rationales where $P^{R}=\{P^{R}_{0},P^{R}_{1},…,P^{R}_{n}\}$ . The terms $E_{i}$ and $P^{R}_{i}$ are the i-th words of the evidence and the rationales paragraph, respectively.

The function $f(E)$ tokenizes the input E and passes it to a bidirectional encoder, and $P^{R}$ is generated through an autoregressive decoder on the final hidden layer. We use this model to perform inference on each evidence sample in the train, validation, and test sets to generate their respective rationales. The output $P^{R}$ is set to a maximum sequence length of 120 tokens and a minimum of 75 tokens which we found to be optimal after experimentation. The claim was not an input in this process and it is to ensure that the rationales are balanced and are not heavily biased towards any one side of the verdict.

IV-B Verdict Classification

In [42], the authors demonstrate the benefits of multitask learning where downstream tasks are viewed as a question-answer problem using pre-trained language model. Based on our findings, this method has not been explored in the claim verification domain. Given its promising results in other domain, we frame the ExClaim classification task as a question-answer problem.

We leverage a pre-trained T5 model, a text-to-text generation model [43] with SOTA results on many NLP tasks including question-answering. We specifically use T5’s capability with the Choice of Plausible Alternatives (COPA) task. COPA was introduced by [44] and it is a type of question-answering problem that examines automated commonsense causal reasoning. The objective is given a question, a premise as context, and two answer choices perform binary classification to determine the correct answer choice. For this task, we conceive a function $g(S)=P^{V}$ where $S$ is the input prompt sequence and $P^{V}\in\{V^{S},V^{R}\}$ is the binary verdict prediction where $V^{S}=\{Supports\}$ and $V^{R}=\{Refutes\}$ . We formulate $S$ as the following where the claim as the question $C=\{C0,C1,…,Cn\}$ with $Ci$ indicating the i-th word in the claim, the generated paragraph of rationales as the premise $P^{R}$ , and the possible verdicts as $V^{S}$ and $V^{R}$ . The sequence $S$ is fed in as “copa choice1: $V^{S}$ choice2: $V^{R}$ premise: $P^{R}$ question: $C$ ”.

We fine-tune the T5 model on our train dataset with the sequence $S$ for each sample. We set the batch size to 8, cross-entropy as the loss function and AdamW as the optimizer with a 2e-5 learning rate and we train our model for 20 epochs [45]. We evaluate the model’s performance on the validation set at every 350 steps. A majority of the input sequences were below T5’s maximum sequence length of 512 tokens and if the length of S went over the limit T5 does not truncate automatically since T5 uses relative positional bucketing or relative attention [46]. It continues to work until it is exhausted of memory which never occurred during our experimentation.

IV-B1 Baselines

We also utilize other architectures as baselines and they are as follows.

LSTM Previous work such as [14, 16] have relied heavily on LSTM networks for the verdict classification. We designed a multi-input LSTM classification model. It takes the claim and evidence as two inputs which are converted into GloVe embeddings and are individually passed through an embedding layer and an LSTM layer with 128 features and the outputs of these are concatenated and passed through a dense layer with ReLu activation and a final dense layer with two neurons and softmax activation for the binary prediction output.

DistilBERT The DistilBERT model is also used often in the claim verification space. We fine-tune a pre-trained DistilBERT model. We set the batch size to 8, enable a sliding window to prevent truncation, use AdamW for optimization, and train for 2 epochs.

BART We use a BART Zero-Shot classification model, specifically, the fine-tuned bart-large-mnli model checkpoint from Hugging Face and we directly perform inference with a zero-shot problem setup.

IV-C Natural Language Explanation

The NLE is a generative task using a language model. However, during experimentation with SOTA natural language generation models such as GPT-2 [48] and GPT-3 [49], there were challenges with respect to language model hallucination. The model hallucinates by introducing unintended and irrelevant text in the generation process. Paper [50] suggests that such problems diminish user trust and fail to meet their expectations. Especially for a claim verification system, introducing unintended text makes the entire system worthless and can lead to dangerous consequences. Instead, we used abstractive summarization to generate the rationales, originally it was extractive. With abstractive, the rationales reflect the natural style of a human-written text. We formulate the NLE by concatenating the claim C, the rationales $P^{R}$ , and the verdict $P^{V}$ . The resulting NLE, $P^{N}$ , is as follows “The evidence $P^{V}$ the claim because $P^{R}$ ”.

V Results and Discussion

In this section, we describe the experimental results with supporting analysis.

V-A SHAP Explanation

The process of extracting the rationales is an intermediate step in the ExClaim approach. There is also no available ground truth data to assess its accuracy. We used an explainable technique called SHAP, which leverages game theory to explain the output of any ML model [47]. SHAP is based on Shapley values in economics which measure the marginal contribution of a feature towards the final outcome of a system. These values are computed by meticulously perturbing the input features and observing how those changes affect the final prediction of the model. While performing inference to generate the rationales, we also pass our model to a SHAP explainer. Figure 3 shows the output explanation where the explainer provides Shapley values for input features in the text that influenced the generation of $P^{R}_{i}$ in $P^{R}$ . The red highlights indicate features with negative Shapley values, which are negative contributions, and similarly, blue represents positive contributions and values. SHAP is an accessible explanation method for the nonexpert because the explainer illustrates the marginal influence of each component by visualizing the Shapley values with colored highlights.

TABLE II: Macro F1 Scores for Verdict Classification Models

Model	Valid	Test
BART Zero-Shot	-	0.58
LSTM	0.55	0.47
DistilBERT	0.82	0.81
T5	0.85	0.85
ExClaim T5	0.92	0.93

V-B Verdict Classification

We measure the accuracy for our verdict classification models using the macro F1 score. The models we have experimented with and their results for validation and test datasets are shown in Table II. Since the BART Zero-Shot was not finetuned on this dataset, we only include the F1 score from the test set inference. Our approach using ExClaim T5 COPA outperforms the baselines, specifically, the DistilBERT model, by 10 points on the validation and 12 points on the test set. An advantage of language models is that they are pre-trained on existing knowledge. When fine-tuned on downstream tasks, they can use transfer learning and leverage the pre-existing data to improve task performance. Language models such as BART, DistilBERT, and T5 were pre-trained on vast amounts of internet data, and it hints at why the pre-trained language models performed better than the LSTM model. Amongst the language models, zero-shot classification performed poorly, and a potential reason for this is that fine-tuning generally helps improve task performance and accuracy.

TABLE III: NLI Evaluation Results

NLI Result	Count	Percentage
Entailment	168	27.9%
Neutral	253	42.0%
Contradiction	180	29.9%

V-C Natural Language Inference

The NLE is an open-ended text without ground truth data. This eliminated any possibility of using the ROUGE-N or BLEU score for evaluation. The work by [51] demonstrates that Natural Language Inference (NLI) can be confidently used to evaluate the correctness of abstractive summarization results without any human-written references. In [52], they claim that assessing the accuracy of the textual entailment can be a reasonable evaluation method. NLI attempts to determine given a premise if the hypothesis is true (entailment), false (not entitlement/contradiction), or undetermined (neutral) [53].

We use the NLI textual entailment technique to automatically evaluate our NLE. We used a pre-trained T5-large model, specifically, the CB task, which was fine-tuned on [54]’s NLI dataset. We experimented with other T5 NLI tasks such as QNLI, MNLI, and RTE, and we found that CB had the best performance. For each generated NLE, $P^{N}$ , for the test set, we formulate the NLI input for our model as follows: “cb hypothesis: $C$ premise: $P^{N}$ ”. Essentially, we test can we logically deduce the claim from the NLE. The results are shown in Table III. Neutral NLEs is the majority with 42% and contradiction and entailment as the following. A potential reason for this is we are currently relying on transfer learning for the NLI evaluation, and it has not been fine-tuned on our dataset. Identifying and establishing a logical connection between the NLE and the claim can be challenging. An alternate reason is that the quality of the rationales may be sufficient enough to not be a contradiction the majority of the time but also not be sufficient enough for the premise to entail the hypothesis, which may be why neutral is the majority.

V-D Manual Evaluation

In addition to the automatic evaluation with NLI, we also conduct a manual evaluation of the generated NLEs to assess its quality. For the evaluation setup, we randomly select 100 claims and its respective NLE instances from our test set. We had three annotators who were shown the claim and its NLE and their task was to judge for the following criteria: Plausibility, Fluency, and Correctness. The annotators were required to rely only on the information provided with each working individually. Since non-expert annotators might find it difficult to give thoughtful judgment, we adopt a rating scale from 1 to 5 for each criterion as presented in IV.

We define plausibility as how convincingly can the NLE explain the verdict prediction [55] and its rating scale is adopted from [56]. Fluency is defined by [57] as having correct spelling, good grammar, flow, and tone and its rating scale is adopted from [57]. Correctness is defined as the likelihood of the NLE and the verdict being true and its scale is from [56].

TABLE IV: Manual Evaluation Criteria and Rating Scale

Plausibility	Rating
Very Convincing	5
Slightly Convincing	4
Slightly Not Convincing	3
Not Convincing	2
Can Not Judge	1
Fluency	Rating
Flawless English	5
Good English	4
Non-native English	3
Disfluent English	2
Incomprehensible	1
Correctness	Rating
Absolutely True	5
Probably True	4
Probably Not True	3
Absolutely Not True	2
Can Not Judge	1

Figure 4 shows the results for the manual evaluation. We use DistilBERT as a comparison since it is used often in the claim verification space [25]. On average, ExClaim outperformed DistilBERT in all of the ratings. A major contributor is that ExClaim is an ensemble approach that leverages two strong language models (BART and T5) and is, therefore, better able to work in our dynamic context of rationale creation, verdict classification, and NLE generation. Both approaches have the same criteria ranking where fluency has the highest rating, then plausibility, and lastly correctness. This is expected for fluency since language models generally work well at producing proper text without major grammatical errors. Plausibility was the lowest on average and it indicates that the rationales chosen could be more persuasive. DistilBERT has a lower rating for Plausibility and it may be due to its limitations with text summarization. The BART model used in ExClaim outperforms BERT-based models at text summarization that we used for the rationale generation. Further, DistilBERT is extractive summarization and this creates challenges with the coherency and convincingness of the rationales. Correctness on average demonstrates that there is a decent likelihood that the rationales presented with the verdict are accurate for the claim. The T5 model in ExClaim excels at transfer learning and is, therefore, better able to classify the verdict with respect to the rationales presented at inference time. In ExClaim, the verdict and rationales may have strong entailment in the NLE whereas with DistilBERT they may be more random and contradicting.

VI Conclusions

In this paper, we introduced a new benchmark dataset with foundational information (from government policy documents and international organizations), and we presented promising baseline results on the verdict classification task using ExClaim’s question and answer approach. We also demonstrated an unsupervised method of using transfer learning and abstractive summarization to generate rationales which are used to determine the verdict and generate an NLE. We showed multiple explainability methods to assure the final outcomes and also the intermediate sub-tasks outcomes which has not been done before in a claim verification system. We presented a novel claim verification approach and approached the problem with rigorous explainability to ensure the system is accessible, transparent, and trustworthy to the nonexpert.

Certain limitations do exist such as the verdict prediction and NLE being heavily impacted by the quality of the foundational data available. Additionally, claims also need to be closely related to the information included in the foundational dataset. Many promising future work directions exist to further progress in explainable claim verification. For example, developing better abstractive methods for the NLE generation from a given claim and rationales and avoiding LM hallucination. Obtaining the evidence directly from the sources listed on Politifact can yield more plausible rationales than relying on the ruling comments. Rationale generation technique without summarization has also been explored very little. Lastly, incorporating AI assurance methods as demonstrated by [13] in NLP is also a viable path for improving general model trustworthiness.

References

[1] K. Shu, S. Wang, D. Lee, and H. Liu, “Mining disinformation and fake news: Concepts, methods, and recent advancements,” in Disinformation, Misinformation, and Fake News in Social Media. Springer, 2020, pp. 1–19.
[2] K. Shu, A. Sliva, S. Wang, J. Tang, and H. Liu, “Fake news detection on social media: A data mining perspective,” ACM SIGKDD explorations newsletter, vol. 19, no. 1, pp. 22–36, 2017.
[3] J. Li, T. Tang, W. X. Zhao, J.-Y. Nie, and J.-R. Wen, “A survey of pretrained language models based text generation,” arXiv preprint arXiv:2201.05273, 2022.
[4] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling language modeling with pathways,” arXiv preprint arXiv:2204.02311, 2022.
[5] S. Smith, M. Patwary, B. Norick, P. LeGresley, S. Rajbhandari, J. Casper, Z. Liu, S. Prabhumoye, G. Zerveas, V. Korthikanti et al., “Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model,” arXiv preprint arXiv:2201.11990, 2022.
[6] N. Kotonya and F. Toni, “Explainable automated fact-checking: A survey,” arXiv preprint arXiv:2011.03870, 2020.
[7] Z. Liu, C. Xiong, M. Sun, and Z. Liu, “Fine-grained fact verification with kernel graph attention network,” arXiv preprint arXiv:1910.09796, 2019.
[8] X. Zhou and R. Zafarani, “A survey of fake news: Fundamental theories, detection methods, and opportunities,” ACM Computing Surveys (CSUR), vol. 53, no. 5, pp. 1–40, 2020.
[9] S. Gurrapu, A. Kulkarni, L. Huang, I. Lourentzou, L. Freeman, and F. Batarseh, “Rationalization for explainable nlp: A survey (upcoming),” 2022.
[10] M. Danilevsky, K. Qian, R. Aharonov, Y. Katsis, B. Kawas, and P. Sen, “A survey of the state of explainable AI for natural language processing,” in Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. Suzhou, China: Association for Computational Linguistics, Dec. 2020, pp. 447–459. [Online]. Available: https://aclanthology.org/2020.aacl-main.46
[11] S. Luo, H. Ivison, C. Han, and J. Poon, “Local interpretations for explainable natural language processing: A survey,” 2021. [Online]. Available: https://arxiv.org/abs/2103.11072
[12] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, and D. Pedreschi, “A survey of methods for explaining black box models,” ACM Comput. Surv., vol. 51, no. 5, aug 2018. [Online]. Available: https://doi.org/10.1145/3236009
[13] F. A. Batarseh, L. Freeman, and C.-H. Huang, “A survey on artificial intelligence assurance,” Journal of Big Data, 2021.
[14] W. Y. Wang, “” liar, liar pants on fire”: A new benchmark dataset for fake news detection,” arXiv preprint arXiv:1705.00648, 2017.
[15] Y. Long, “Fake news detection through multi-perspective speaker profiles.” Association for Computational Linguistics, 2017.
[16] T. Alhindi, S. Petridis, and S. Muresan, “Where is your evidence: Improving fact-checking by justification modeling,” in Proceedings of the First Workshop on Fact Extraction and VERification (FEVER), 2018, pp. 85–90.
[17] S. Bhatt, N. Goenka, S. Kalra, and Y. Sharma, “Fake news detection: Experiments and approaches beyond linguistic features,” in Data Management, Analytics and Innovation. Springer, 2022, pp. 113–128.
[18] K. Popat, S. Mukherjee, J. Strötgen, and G. Weikum, “Where the truth lies: Explaining the credibility of emerging claims on the web and social media,” in Proceedings of the 26th International Conference on World Wide Web Companion, 2017, pp. 1003–1012.
[19] K. Shu, L. Cui, S. Wang, D. Lee, and H. Liu, “defend: Explainable fake news detection,” in Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2019, pp. 395–405.
[20] F. Yang, S. K. Pentyala, S. Mohseni, M. Du, H. Yuan, R. Linder, E. D. Ragan, S. Ji, and X. Hu, “Xfake: Explainable fake news detector with visualizations,” in The World Wide Web Conference, 2019, pp. 3600–3604.
[21] Y.-J. Lu and C.-T. Li, “Gcan: Graph-aware co-attention networks for explainable fake news detection on social media,” arXiv preprint arXiv:2004.11648, 2020.
[22] L. Wu, Y. Rao, Y. Zhao, H. Liang, and A. Nazir, “Dtca: Decision tree-based co-attention networks for explainable claim verification,” arXiv preprint arXiv:2004.13455, 2020.
[23] M. H. Gad-Elrab, D. Stepanova, J. Urbani, and G. Weikum, “Exfakt: A framework for explaining facts over knowledge graphs and text,” in Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, 2019, pp. 87–95.
[24] N. Ahmadi, J. Lee, P. Papotti, and M. Saeed, “Explainable fact checking with probabilistic answer set programming,” arXiv preprint arXiv:1906.09198, 2019.
[25] P. Atanasova, J. G. Simonsen, C. Lioma, and I. Augenstein, “Generating fact checking explanations,” arXiv preprint arXiv:2004.05773, 2020.
[26] N. Kotonya and F. Toni, “Explainable automated fact-checking for public health claims,” arXiv preprint arXiv:2010.09926, 2020.
[27] D. Kiran, “Total quality management: An overview, pareto principle,” Total Quality Management, pp. 1–14, 2017.
[28] K. Tanabe, “Pareto’s 80/20 rule and the gaussian distribution,” Physica A: Statistical Mechanics and its Applications, vol. 510, pp. 635–640, 2018.
[29] A. Delers, The Pareto Principle for Business Management: Expand Your Business with the 80/20 Rule. 50 Minutes, 2015.
[30] M. Gen and R. Cheng, Genetic algorithms and engineering optimization. John Wiley & Sons, 1999, vol. 7.
[31] R. S. Pressman, Software engineering: a practitioner’s approach. Palgrave macmillan, 2005.
[32] Z. Zhang, K. Rudra, and A. Anand, “Explain and predict, and then predict again,” in Proceedings of the 14th ACM International Conference on Web Search and Data Mining, 2021, pp. 418–426.
[33] E. Lehman, J. DeYoung, R. Barzilay, and B. C. Wallace, “Inferring which medical treatments work from reports of clinical trials,” arXiv preprint arXiv:1904.01606, 2019.
[34] J. Bastings, W. Aziz, and I. Titov, “Interpretable neural predictions with differentiable binary variables,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, Jul. 2019, pp. 2963–2977. [Online]. Available: https://aclanthology.org/P19-1284
[35] T. Lei, R. Barzilay, and T. Jaakkola, “Rationalizing neural predictions,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin, Texas: Association for Computational Linguistics, Nov. 2016, pp. 107–117. [Online]. Available: https://aclanthology.org/D16-1011
[36] W. S. El-Kassas, C. R. Salama, A. A. Rafea, and H. K. Mohamed, “Automatic text summarization: A comprehensive survey,” Expert Systems with Applications, vol. 165, p. 113679, 2021.
[37] S. Gupta and S. Gupta, “Abstractive summarization: An overview of the state of the art,” Expert Systems with Applications, vol. 121, pp. 49–65, 2019.
[38] J.-g. Yao, X. Wan, and J. Xiao, “Recent advances in document summarization,” Knowledge and Information Systems, vol. 53, no. 2, pp. 297–336, 2017.
[39] Y. Liu and M. Lapata, “Text summarization with pretrained encoders,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 3730–3740. [Online]. Available: https://aclanthology.org/D19-1387
[40] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” arXiv preprint arXiv:1910.13461, 2019.
[41] ——, “Bart-large-cnn · hugging face,” 2019. [Online]. Available: https://huggingface.co/facebook/bart-large-cnn
[42] B. McCann, N. S. Keskar, C. Xiong, and R. Socher, “The natural language decathlon: Multitask learning as question answering,” arXiv preprint arXiv:1806.08730, 2018.
[43] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of Machine Learning Research, vol. 21, no. 140, pp. 1–67, 2020. [Online]. Available: http://jmlr.org/papers/v21/20-074.html
[44] M. Roemmele, C. A. Bejan, and A. S. Gordon, “Choice of plausible alternatives: An evaluation of commonsense causal reasoning.” in AAAI spring symposium: logical formalizations of commonsense reasoning, 2011, pp. 90–95.
[45] I. Loshchilov and F. Hutter, “Fixing weight decay regularization in adam,” 2018.
[46] J. Wu, R. Zhang, Y. Mao, and J. Chen, “On scalar embedding of relative positions in attention models,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 16, 2021, pp. 14 050–14 057.
[47] S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” Advances in neural information processing systems, vol. 30, 2017.
[48] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
[49] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
[50] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. Bang, A. Madotto, and P. Fung, “Survey of hallucination in natural language generation,” ACM Computing Surveys, 2022.
[51] R. Bora-Kathariya and Y. Haribhakta, “Natural language inference as an evaluation measure for abstractive summarization,” in 2018 4th International Conference for Convergence in Technology (I2CT). IEEE, 2018, pp. 1–4.
[52] A. Poliak, “A survey on recognizing textual entailment as an nlp evaluation,” arXiv preprint arXiv:2010.03061, 2020.
[53] J. Bos and K. Markert, “Recognising textual entailment with logical inference,” in Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, 2005, pp. 628–635.
[54] M.-C. De Marneffe, M. Simons, and J. Tonhauser, “The commitmentbank: Investigating projection in naturally occurring discourse,” in proceedings of Sinn und Bedeutung, vol. 23, no. 2, 2019, pp. 107–124.
[55] A. Chan, M. Sanjabi, L. Mathias, L. Tan, S. Nie, X. Peng, X. Ren, and H. Firooz, “Unirex: A unified learning framework for language model rationale extraction,” arXiv preprint arXiv:2112.08802, 2021.
[56] M. H. Gad-Elrab, D. Stepanova, J. Urbani, and G. Weikum, “Exfakt: A framework for explaining facts over knowledge graphs and text,” in Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, 2019, pp. 87–95.
[57] A. Mutton, M. Dras, S. Wan, and R. Dale, “Gleu: Automatic evaluation of sentence-level fluency,” in Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, 2007, pp. 344–351.