LLM-based Corroborating and Refuting Evidence Retrieval
for Scientific Claim Verification

Siyuan Wang¹, James R. Foulds², Md Osman Gani², Shimei Pan² Corresponding author.

Abstract

In this paper, we introduce CIBER (Claim Investigation Based on Evidence Retrieval), an extension of the Retrieval-Augmented Generation (RAG) framework designed to identify corroborating and refuting documents as evidence for scientific claim verification. CIBER addresses the inherent uncertainty in Large Language Models (LLMs) by evaluating response consistency across diverse interrogation probes. By focusing on the behavioral analysis of LLMs without requiring access to their internal information, CIBER is applicable to both white-box and black-box models. Furthermore, CIBER operates in an unsupervised manner, enabling easy generalization across various scientific domains. Comprehensive evaluations conducted using LLMs with varying levels of domain expertise and linguistic proficiency reveal CIBER’s superior performance compared to conventional RAG approaches. These findings not only highlight the effectiveness of CIBER but also provide valuable insights for future advancements in LLM-based scientific claim verification.

Introduction

Refer to caption — Figure 1: CIBER Architecture

Recent advances in large language model (LLM) technology promise to redefine how we interact with language technologies and use them in the new digital era (Panagoulias et al. 2023; Wei et al. 2023; Guo et al. 2023; Yang and Luo 2023). One common challenge faced by LLMs is “hallucination” where they may produce answers that seem correct but are actually inaccurate or misleading (Zhang et al. 2023). This can be particularly problematic in scientific investigations where accuracy and reliability of evidences and claims are critical. Hallucinations often arise when the necessary knowledge to answer a specific user query is absent or inadequate from an LLM’s training data. Retrieval-Augmented Generation (RAG), a technique designed to improve the reliability of LLMs (Lewis et al. 2020), can mitigate this issue by dynamically bringing in external knowledge sources relevant to a user’s query. Prior research has demonstrated that RAG can significantly reduce the hallucination rate of LLMs and enhance their response reliability (Feldman, Foulds, and Pan 2023, 2024). Since its debut, RAG has rapidly gained popularity (Siriwardhana et al. 2023; Chen et al. 2024; Hofstätter et al. 2023) and been integrated into numerous AI applications and services. Although RAG can significantly reduce hallucination, the accuracy of LLM responses can still fluctuate due to factors such as the topic and wording of user queries, the relevance and consistency of retrieved documents, and the language comprehension and reasoning abilities of the LLM model employed, potentially resulting in errors in responses. For instance, extensive evaluation on benchmark datasets indicates that RAG still struggles regarding noise robustness, negative rejection, and information integration (Chen et al. 2024).

To further reduce the errors in RAG, we introduce CIBER, designed to identify and retrieve scientific documents as corroborating or refuting evidences for claim verification. A claim could range from causal statements like “Human activities may cause climate change” to hypotheses such as “Dysregulation of microRNA expression may contribute to the pathogenesis of cardiovascular diseases such as hypertension.” Given a claim $C$ by a user (e.g., a scientist), CIBER can automatically (1) retrieve relevant scientific publications from reliable sources (e.g., The New England Journal of Medicine for biomedical research or NeurIPS or ICML for Machine Learning), (2) validate the claim $C$ against each of the retrieved publications (e.g. full papers or abstracts) through multi-faceted interrogations, providing a verdict on its truthfulness along with a confidence score, and (3) present the most representative papers that either support or refute the claim.

Figure 1 depicts the system diagram of CIBER, which is an extension of a typical RAG architecture that includes (1) a text embedding model (Figure 1(a)) that converts text from both external sources (e.g., scientific publications) and user queries (e.g., a claim to be verified) into embedding vectors that capture their semantics, (2) a vector database (Figure 1(b)) enabling similarity-based retrieval of relevant documents given a user query, and (3) an LLM (Figure 1(c)) that analyzes the user query as well as the retrieved documents to generate an answer. In CIBER, we focus on enhancing the last stage of RAG (the new CIBER components are shown in the gray box in Figure 1). Specifically, we employ multi-aspect interrogation (MAI) (Figure 1(d)) to generate different probes to assess the reliability of LLM responses. The Response Resolution (RR) module (Figure 1(e)) parses the LLM responses and maps them to one of the three canonical answers: Support, Refute, and Neutral. The Verdict and Confidence (V&C) module (Figure 1(f)) aggregates all the responses from all the probes and determines whether the input claim $C$ is supported or refuted by combined evidence with a confidence score. Documents with the highest confidence scores can be presented to the user as representative work that either supports or refutes the claim. The main contributions of our project include:

•

Propose a new CIBER framework that can further reduce hallucination in LLM generation than a typical RAG. CIBER is unsupervised, making it applicable in diverse scientific fields. Moreover, it does not require access to LLM internal information (e.g., model parameters or training data), making it suitable for both white-box and black-box LLMs.
•

Develop various methods to systematically integrate and fuse evidences gathered from different interrogation probes to determine the truthfulness of a claim, along with an associated confidence score.
•

Create two new synthetic and two new real datasets to assess the effectiveness of the proposed method.

Related Work

RAG has experienced significant development in recent years. Initially, RAG systems focused on directly augmenting Large Language Models (LLMs) with retrieved knowledge through enhanced pre-training techniques (Lewis et al. 2020; Borgeaud et al. 2022). However, with advanced LLMs demonstrating strong contextual learning abilities, RAG research has transitioned towards providing improved and more relevant contextual information. These retrieval-based enhancements include improved source selection (Li, Nie, and Liang 2023) and query expansion (Ma et al. 2023; Peng et al. 2023), refined content indexing (Wang et al. 2024), enhanced content ranking (Zhuang et al. 2023), and advanced iterative and recursive retrieval techniques (Shao et al. 2023; Trivedi et al. 2022). In contrast, our focus is on enhancing the generation stage of RAG where we systematically measure the uncertainty in LLM responses to a variety of interrogation probes to determine the reliability of these responses and determine the final verdicts.

There is also a large body of work on non-RAG-based automated claim verification. It often involves four stages, beginning with claim detection, where claims are selected for verification based on their check-worthiness (Hassan, Li, and Tremayne 2015). Evidence retrieval aims to find evidences to indicate a claim’s veracity, often using metadata or stance detection techniques (Ferreira and Vlachos 2016; Hanselowski et al. 2019). Verdict prediction determines the truthfulness of claims (Nakashole and Mitchell 2014; Augenstein et al. 2019). Additionally, knowledge graph-based fact verification have been proposed to assess the veracity of extracted claims by leveraging structured knowledge bases (Tchechmedjiev et al. 2019). Finally, justification generation explains the reasoning behind verdicts, with strategies including attention weights, logic-based systems, or textual explanations (Popat et al. 2018; Ahmadi et al. 2019; Atanasova 2024). Among the four tasks in the claim extraction and verification pipeline, we only focus on evidence retrieval and verdict prediction.

Research Questions

In this study, we focus on three research questions:

•

(RQ1) How does the performance of CIBER compare with that of a typical RAG? How does its performance vary with different LLMs with diverse language understanding and reasoning abilities?
•

(RQ2) How do various interrogation strategies within the MAI module influence the performance of CIBER?
•

(RQ3) What effects do different evidence fusion strategies within the V&C module have on CIBER performance?

In the following, we provide details on the design and implementation of CIBER.

	Verdict
	Support	Refute	Neutral
$P_{AG}$	$Prob(r_{i}=S)$	$Prob(r_{i}=R)$	$Prob(r_{i}=N)$
$P_{CF}$	$\alpha*Prob(r_{i}=S)$	$\alpha*Prob(r_{i}=R)$	$Prob(r_{i}=N)$ + $(1-\alpha)$ * $\big{(}Prob(r_{i}=S)$ + $Prob(r_{i}=R)$ )

Table 1: Mass function

m(\cdot)

used in Dempster-Shafer Belief Update.

Methodology

In this section, we explain the main CIBER modules including MAI, RR, and a V&C.

Multi-Aspect Interrogation (MAI)

MAI is designed to assess the consistency of LLM responses under various probes with diverse lexical and logical variations. MAI begins by verifying a claim $C$ within the context of each retrieved paper/abstract. This step is critical in verifying scientific claims, considering the specialized knowledge necessary to either understand or verify such claims may not be adequately represented in conventional LLM training datasets. By anchoring the LLM’s responses in a specific scientific study, this contextual grounding enables the LLM to provide more accurate and relevant information in response to specific scientific claims. Specifically, given an input claim $C$ , a retrieved publication $A$ and an LLM $L$ , we construct the Original Probe $p_{O}$ (e.g., “Based on the study presented in Paper A, is Claim C true?”) We also create additional probes to interrogate $L$ from various perspectives. To test $L$ ’s logic consistency, we design two sets of probes: $P_{AG}$ (“AGree Probe”) and $P_{CF}$ (“ConFlict Probe.”) $P_{AG}$ includes probes whose LLM responses should align (or agree) with those from $p_{O}$ , while $P_{CF}$ contains probes whose responses should contradict (or disagree with) those from $p_{O}$ . Moreover, to test $L$ ’s ability in understanding user queries with different lexical and syntactic variations, for each $p_{i}$ in either $P_{AG}$ or $P_{CF}$ , we add $g(p_{i})$ , which is a paraphrase of $p_{i}$ . To illustrate, imagine a climate researcher seeking to understand the causal link between human activities and climate change. Instead of directly querying an LLM like ChatGPT “Can human activities cause climate change?”, which may yield inaccurate results, our system begins by accessing papers from high-quality venues (e.g., the journal of Nature Climate Change). For each paper/abstract, the system examines the LLM responses from various perspectives. In this example, $p_{O}$ can be a query such as “Based on the study described in paper $A$ , is the claim $C$ true?” We can populate $P_{AG}$ with $p_{O}$ plus paraphrases of $p_{O}$ such as “Is claim $C$ supported by the study described in paper $A$ .” We can populate $P_{CF}$ with $\neg p_{O}$ such as “Based on the study described in paper $A$ , is the claim $C$ false?” and paraphrases of $\neg p_{O}$ such as “Is claim $C$ refuted by the study described in paper $A$ .”

Response Resolution (RR)

We employ specific prompt engineering strategies to facilitate the parsing of LLM responses. For GPT-2, which operates in a sentence completion mode, we append adverbs like “relatively” or “quite” in the prompts so that the words completed by the LLM are more restricted. For instance, in the prompt “Based on the study described in the paper, the likelihood that ’human activities may cause climate change’ is relatively/quite [Blank],” the appended adverbs significantly limit the potential LLM responses to words like “high” or “low.” In contrast, without these adverbs, the expressions in [Blank] could vary widely, ranging from “contingent on extensive scientific evidence” to “subject to further investigation.” Similarly, for GPT-3.5 and GPT-4, which operate in a Q&A model, we append specific instructions to constrain the response format: “Please answer with either yes or no. If you are not sure, please say ’I am not sure’.” While these prompt engineering strategies work most of the time, there’s no guarantee that the LLMs will always follow the instructions. For example, occasionally GPT-3.5/GPT-4’ may generate a response like “According to the abstract provided, the statement ’human activities may cause climate change’ is false.” To process these responses, we developed a straightforward lexicon and regular expression-based parser to map these responses to their canonical forms. For example, for GPT-2, we extract the words after the adverb “relatively” or “quite” and map them to either “Support,” “Refute” or “ Neutral.” To parse the responses from GPT-3.5 and GPT-4, we developed a regular expression-based parser to identify direct answers such as “Yes,” “No” and “ I am not sure” or other variants such as “is correct” and “is not false.”

Verdict and Confidence (V&C)

Given a claim $C$ and a retrieved publication $A$ , the V&C module gathers all the LLM responses from all the probes in $P_{AG}$ and $P_{CF}$ and generates the final verdict $V$ plus a confidence score $CS$ . Given that an LLM typically employs a stochastic text generation process, sending the same probe to the LLM multiple times can result in varied responses. To address this uncertainty, we iterate each probe in $P_{AG}$ and $P_{CF}$ $K$ times, recording all the $K$ LLM responses per probe. We explored three fusion strategies to combine the evidences from all the probes.

Weighted Proportions (WP)

Before consolidating the evidences from all probes, we invert the LLM responses generated from $p_{i}\in P_{CF}$ , because a “Support” response to $p_{i}\in P_{CF}$ is equivalent to a “Refute” response to $p_{j}\in P_{AG}$ . Subsequently, we employ the following formulas to calculate the verdict $V_{WP}$ .

	$\displaystyle WP_{S}=$	$\displaystyle\alpha*Prob(r_{i}=S,i\in P_{AG})$		(1)
		$\displaystyle+(1-\alpha)*Prob(r_{i}=S,i\in P_{CF})$		(1)

	$\displaystyle WP_{R}=$	$\displaystyle\alpha*Prob(r_{i}=R,i\in P_{AG})$		(2)
		$\displaystyle+(1-\alpha)*Prob(r_{i}=R,i\in P_{CF})$		(2)

	$\displaystyle WP_{N}=$	$\displaystyle\alpha*Prob(r_{i}=N,i\in P_{AG})$		(3)
		$\displaystyle+(1-\alpha)*Prob(r_{i}=N,i\in P_{CF})$		(3)

V_{WP}=\operatorname*{arg\,max}_{Q\in(S,R,N)}{WP_{Q}}

(4)

where $r_{i}$ represents an LLM response generated with probe $i$ , and $Prob(r_{i})$ denotes the probability of observing response $r_{i}$ among the responses. Here, $r_{i}$ can take one of three values: “(S)upport,” “(R)efute,” or “(N)eutral”. $\alpha$ serves as a trade-off parameter that regulates the relative importance assigned to the evidence derived from the probes in $P_{AG}$ compared to those in $P_{CF}$ . The final confidence score, $CS_{WP}=WP_{V_{WP}}$ .

Weighted Information Gain (WIG)

Similar to WP, we begin by reversing the LLM responses generated from $p_{i}\in P_{CF}$ . Subsequently, we utilize information theory-based metrics to evaluate the uncertainty in these LLM responses. Information theory is a mathematical framework for quantifying the amount of information in data (Shannon 1948). It explores concepts such as entropy to measure the uncertainty in a dataset. In addition to entropy $E$ , we also compute Information Gain (IG) to quantify the reduction of uncertainty given the evidences from LLM responses.

\displaystyle E_{AG}=-\sum_{r_{i}\in\{S,R,N\}}Prob(r_{i},i\in P_{AG})\log Prob(r_{i},i\in P_{AG})

(5)

\displaystyle E_{CF}=-\sum_{r_{i}\in\{S,R,N\}}Prob(r_{i},i\in P_{CF})\log Prob(r_{i},i\in P_{CF})

(6)

\displaystyle IG_{AG}=logM-E_{AG}

(7)

\displaystyle IG_{CF}=logM-E_{CF}\mbox{ ,}

(8)

where $M$ represents the number of possible verdicts which is 3: “Support,” “Refute,” and “Neutral.” Next we compute the Weighted Information Gain (WIG) and $V_{WIG}$ :

	$\displaystyle WIG_{S}=$	$\displaystyle\alphaIG_{AG}Prob(r_{i}=S,i\in P_{AG})$		(9)
		$\displaystyle+(1-\alpha)*IG_{CF}Prob(r_{i}=S,i\in P_{CF})$		(9)

	$\displaystyle WIG_{R}=$	$\displaystyle\alphaIG_{AG}Prob(r_{i}=R,i\in P_{AG})$		(10)
		$\displaystyle+(1-\alpha)*IG_{CF}Prob(r_{i}=R,i\in P_{CF})$		(10)

	$\displaystyle WIG_{N}=$	$\displaystyle\alphaIG_{AG}Prob(r_{i}=N,i\in P_{AG})$		(11)
		$\displaystyle+(1-\alpha)*IG_{CF}Prob(r_{i}=N,i\in P_{CF})$		(11)

V_{WIG}=\operatorname*{arg\,max}_{Q\in(S,R,N)}{WIG_{Q}}\mbox{ .}

(12)

The final confidence score $CS_{WIG}=WIG_{V_{WIG}}$ .

Weighted Belief Update (WBU)

The third fusion approach is based on the Dempster–Shafer theory (DST) of evidence and belief update (Dempster 2008; Shafer 1976). DST provides a framework for reasoning under uncertainty by combining evidences from multiple sources. The theory is particularly useful in situations where evidence may be incomplete or conflicting, providing a systematic way to manage uncertainty. Similar to the approaches above, we begin by reversing the LLM responses generated from $p_{i}\in P_{CF}$ . Given the LLM responses generated from the probes in $P_{AG}$ and $P_{CF}$ and the verdicts: “Support,” “Refute” and “Neutral,” we define the mass function $m(\cdot)$ as in Table 1. We then apply Dempster’s rule of combination to fuse evidences and update beliefs:

$\displaystyle m(S)$	$\displaystyle=(m_{AG}\oplus m_{CF})(S)$	(13)
	$\displaystyle=\frac{1}{1-K}\sum_{V_{1}\cap V_{2}=S}{m_{AG}(V_{1})m_{CF}(V_{2})}$
	$\displaystyle=\frac{1}{1-K}(m_{AG}(S)m_{CF}(S)+m_{AG}(S)m_{CF}(N)$
	$\displaystyle+m_{AG}(N)m_{CF}(S))$

$\displaystyle m(R)$	$\displaystyle=(m_{AG}\oplus m_{CF})(R)$	(14)
	$\displaystyle=\frac{1}{1-K}\sum_{V_{1}\cap V_{2}=R}{m_{AG}(V_{1})m_{CF}(V_{2})}$
	$\displaystyle=\frac{1}{1-K}(m_{AG}(R)m_{CF}(R)+m_{AG}(R)m_{CF}(N)$
	$\displaystyle+m_{AG}(N)m_{CF}(R))$

$\displaystyle m(N)$	$\displaystyle=(m_{AG}\oplus m_{CF})(N)$	(15)
	$\displaystyle=\frac{1}{1-K}\sum_{V_{1}\cap V_{2}=N}{m_{AG}(V_{1})m_{CF}(V_{2})}$
	$\displaystyle=\frac{1}{1-K}m_{AG}(N)m_{CF}(N)$

	$\displaystyle K$	$\displaystyle=\sum_{V_{1}\cap V_{2}=\emptyset}{m_{AG}(V_{1})m_{CF}(V_{2})}$		(16)
		$\displaystyle=m_{AG}(R)m_{CF}(S)+m_{AG}(S)m_{CF}(R)\mbox{ .}$		(16)

The verdict can be generated based on the updated belief.

V_{WBU}=\operatorname*{arg\,max}_{Q\in(S,R,N)}{m(Q)}

(17)

The confidence score is $CS_{WBU}=m(V_{WBU})$ .

Meta Verdict and Confidence

We can employ an ensemble method to combine the verdicts generated based on each fusion strategy. In this investigation, we simply apply a majority voting strategy to combine all the verdicts to compute $V_{M}$ . As $CS_{Q},Q\in(S,R,N)$ have different ranges, we first normalize them so that all of them are within [0,1]. The final confidence score $CS_{M}$ is the average of the normalized individual confidence scores.

V_{M}=Mode(V_{WP},V_{WIG},V_{WBU})

(18)

CS_{M}=\sum_{Q\in(WP,WIG,WBU)}\frac{1}{3}Norm(CS_{Q})

(19)

Evaluation

		GPT2		GPT3.5		GPT4
Data	Model	Acc.	F1	Acc.	F1	Acc.	F1
	RAG	0.2667	0.2100	0.7083	0.6428	0.7083	0.6847
CSyn	CIBER	0.3583	0.2840	0.8583	0.8329	0.8167	0.7993
	RAG	0.3250	0.2602	0.4667	0.4098	0.5083	0.4053
CReal	CIBER	0.3333	0.2600	0.5917	0.5539	0.5667	0.4567
	RAG	0.3417	0.2949	0.4750	0.4688	0.4833	0.4111
ASyn	CIBER	0.3500	0.2630	0.6333	0.5875	0.6500	0.5794
	RAG	0.3167	0.2750	0.4250	0.4212	0.5500	0.4760
AReal	CIBER	0.3583	0.2660	0.5833	0.5157	0.6333	0.5531
	RAG	0.3125	0.2600	0.5188	0.4857	0.5625	0.4943
All	CIBER	0.3500	0.2683	0.6667	0.6225	0.6667	0.5971

Table 2: Performance of CIBER versus RAG with different LLMs on five datasets: CSyn, CReal, ASyn, AReal, and All.

To systematically assess the effectiveness of our proposed methods, we created two synthetic and two real paper abstract datasets with ground truth labels.

Ground Truth Datasets

The first two datasets, one synthetic and one real, were created to evaluate the validity of the claim regarding the impact of human activities on climate change. The synthetic dataset was generated using ChatGPT 3.5 Turbo, employing a prompt “Please generate a paper abstract summarizing a scientific study investigating the causal relationship between human activities and climate change. The conclusion should support (or refute, or remain neutral respectively) on the assertion that human activities can cause climate change.” The climate synthetic dataset comprises a total of 60 abstracts, with 20 in each of the three categories: supporting, refuting, and neutral on the input claim. Considering that synthetic abstracts may be subject to the limitations inherent in LLMs, we complemented them with a real dataset sourced from survey papers (Lynas, Houlton, and Perry 2021; Cook et al. 2013), which comprises over 3000 papers with established ground truth labels, from which we randomly sampled 20 abstracts for each of the three categories, resulting in a total of 60 real climate paper abstracts.

We also created a synthetic and a real dataset to evaluate the assertion regarding the purported association between vaccination and autism. However, generating abstracts supporting this claim proven challenging as they deemed false and harmful by GPT-3.5. To circumvent this, when GPT-3.5 generated an abstract refuting the claim (which is easy for GPT-3.5 to oblige), we prompted it to generate another abstract with opposing views. The final autism synthetic dataset includes 20 abstracts for each of the three categories. We further compiled a real dataset pertaining to the autism claim based on several survey papers (Doja and Roberts 2006; Folb et al. 2004; Stratton et al. 2001; Wilson et al. 2003; Madsen and Vestergaard 2004; Mohammed et al. 2022; Boretti 2021). We extracted 20 abstracts for each of the three categories with ground truth labels. Figure 2(a) shows examples from the synthetic dataset and 2(b) examples from the real dataset.

The synthetic and real datasets exhibit distinct characteristics. Synthetic abstracts tend to be more general. Vocabulary-wise, they use words more closely tied to the given claims. In contrast, real abstracts have a higher degree of specificity and nuance, often requiring deeper domain knowledge for accurate understanding.

Experiments

We conducted experiments to answer the main research questions. To test how effectively CIBER works with LLMs of varying capabilities, we employed the OpenAI GPT-2 model from Huggingface. In addition, we accessed the GPT-3.5-Turbo and GPT-4-Turbo models via OpenAI’s APIs. In our experiments, for each prompt in $P_{AG}$ and $P_{CF}$ , we recorded the LLM responses from 10 random runs. We also performed grid search to decide the best $\alpha$ .

CIBER performance with different LLMs

To answer RQ1, we compare the performance of CIBER with the traditional RAG using three different LLMs with abilities ranging from low (GPT-2) to typical (GPT 3.5-Turbo) to state-of-the-art (GPT4-Turbo). We present the evaluation results on five datasets: CSyn , the climate change synthetic dataset, CReal, the climate change real dataset, ASyn, the autism synthetic dataset and AReal, the autism real dataset. In addition, we also compute the overall performance on a combination of all four datasets (All).

As shown in Table 2, significant performance differences exist among CIBERs utilizing different LLMs. Specifically, CIBER with GPT-2 generally demonstrated poor reliability, with an accuracy of 0.338% and an F1 score of 0.268 on the All dataset. More advanced LLMs performed much better. Both GPT-3.5 and GPT-4 exhibited comparable accuracy (0.667%), although GPT-3.5 outperformed GPT-4 in terms of F1 score (0.623 versus 0.597) on the All dataset.

Moreover, CIBER outperformed RAG across all three LLMs. The most significant enhancements on the All dataset were observed with GPT-3.5, which exhibited a substantial 14.8% increase in accuracy and a 13.7% improvement in F1 score. Similarly, GPT-4 achieved significant gains, with a 10.4% enhancement in accuracy and a 10.3% increase in F1 score. Comparatively, the enhancements with GPT-2 were the least, with a modest 3.75% improvement in accuracy and a marginal 0.8% increase in F1 score.

From this result, it’s evident that CIBER effectively improves the performance of claim verification beyond what was achieved with conventional RAG methods. Given the nuanced nature of scientific literature and the requirement for precise language comprehension, advanced models like GPT-3.5 and GPT-4 exhibit much better performance compared to smaller models like GPT-2.

Impact of incorporating diverse interrogation strategies

		GPT2		GPT3.5		GPT4
Data	Model	Acc.	F1	Acc.	F1	Acc.	F1
	RAG	0.2667	0.2100	0.7083	0.6428	0.7083	0.6847
	CIBER-AG	0.3167	0.1892	0.8583	0.8329	0.8250	0.8075
CSyn	CIBER-CF	0.3333	0.2358	0.6583	0.5605	0.6667	0.6117
	CIBER-ALL	0.3583	0.2840	0.8583	0.8329	0.8167	0.7993
	RAG	0.3250	0.2602	0.4667	0.4098	0.5083	0.4053
	CIBER-AG	0.3250	0.2063	0.6000	0.5615	0.5333	0.4287
CReal	CIBER-CF	0.3333	0.2174	0.4417	0.3648	0.5500	0.4457
	CIBER-ALL	0.3333	0.2600	0.5917	0.5539	0.5667	0.4567
	RAG	0.3417	0.2949	0.4750	0.4688	0.4833	0.4111
	CIBER-AG	0.3250	0.1940	0.6000	0.5567	0.6500	0.5712
ASyn	CIBER-CF	0.3500	0.2551	0.3583	0.2954	0.4667	0.3612
	CIBER-ALL	0.3500	0.2630	0.6333	0.5875	0.6500	0.5794
	RAG	0.3167	0.2750	0.4250	0.4212	0.5500	0.4760
	CIBER-AG	0.3250	0.1869	0.5333	0.4739	0.6333	0.5508
AReal	CIBER-CF	0.3167	0.2038	0.3583	0.2988	0.5000	0.4355
	CIBER-ALL	0.3583	0.2660	0.5833	0.5157	0.6333	0.5531
	RAG	0.3125	0.2600	0.5188	0.4857	0.5625	0.4943
	CIBER-AG	0.3229	0.1941	0.6479	0.6063	0.6604	0.5896
All	CIBER-CF	0.3333	0.2280	0.4542	0.3799	0.5459	0.4635
	CIBER-ALL	0.3500	0.2683	0.6667	0.6225	0.6667	0.5971

Table 3: The impact of interrogation strategies on system performance on five datasets: CSyn, CReal, ASyn, AReal, and All.

To answer RQ2, we conducted experiments to compare different versions of CIBER with different interrogation strategies: $CIBER_{AG}$ which only employs the probes in $P_{AG}$ , $CIBER_{CF}$ which only includes the probes in $P_{CF}$ , and $CIBER_{ALL}$ which includes the probes in both.

As shown in Table 3, $CIBER_{CF}$ exhibited a significant disadvantage compared to $CIBER_{AG}$ with GPT-3.5 (a 19.4% decrease in accuracy and a 22.6% decrease in F1 on the All dataset) and GPT-4 (an 11.5% decrease in accuracy and a 12.6% decrease in F1 on the All dataset). However, this disadvantage was not observed with GPT-2. In fact, $CIBER_{CF}$ held a slight edge over $CIBER_{AG}$ (with a 1% increase in accuracy and a 3.4% improvement in F1 score). Upon reviewing the log file, we found that this might stem from how we processed the responses to $P_{CF}$ . While we employed GPT-2 in an auto-completion mode, GPT-3.5 and GPT-4 were run in Q&A mode. To make response parsing easier, we specifically instructed GPT-3.5 and GPT-4 to answer with “Yes”, “No” or “I am not sure,” whose interpretation under negative probes can be ambiguous. To illustrate this, we show two responses from GPT-3.5 in our query log. In addition to the “Yes” and “No” answers we requested, it occasionally produces supplementary information which helps us understand the LLM responses more precisely:

Prompt: Based on the abstract, is the following claim “Human activities may cause climate change” false?
(Response 1) Yes, the statement ”Human activities may cause climate change” is not necessarily false based on the information provided in the abstract.
(Response 2) No. The statement ”Human activities may cause climate change” is not false based on the information provided in the abstract.

Since our $RR$ component simply extracts ”Yes” and ”No” from the responses, it assigns different verdicts, even though both responses support the original claim. We expect that in the future, if we adopt better response resolution strategies, the effectiveness of the probes in $P_{CF}$ for GPT-3.5 and GPT-4 can be significantly improved.

Nonetheless, even with our current simple response resolution strategies for GPT-3.5 and GPT-4, aggregating the probes from both $P_{AG}$ and $P_{CF}$ yielded a consistent performance enhancement for $CIBER_{ALL}$ on the All dataset compared to either $CIBER_{AG}$ or $CIBER_{CF}$ alone.

		GPT2		GPT3.5		GPT4
Data	Model	Acc	F1	Acc	F1	Acc	F1
	RAG	0.2667	0.2100	0.7083	0.6428	0.7083	0.6847
	CIBER-WP	0.3583	0.2720	0.8500	0.8188	0.8000	0.7758
CSyn	CIBER-WIG	0.3417	0.3274	0.8583	0.8329	0.8250	0.8075
	CIBER-WBU	0.3583	0.2810	0.8583	0.8329	0.8500	0.8367
	CIBER-ALL	0.3583	0.2840	0.8583	0.8329	0.8167	0.7993
	RAG	0.3250	0.2602	0.4667	0.4098	0.5083	0.4053
	CIBER-WP	0.3667	0.3025	0.5917	0.5530	0.5667	0.4567
CReal	CIBER-WIG	0.3500	0.3045	0.6000	0.5615	0.5667	0.4567
	CIBER-WBU	0.3500	0.2610	0.6083	0.5680	0.5750	0.4639
	CIBER-ALL	0.3333	0.2600	0.5917	0.5539	0.5667	0.4567
	RAG	0.3417	0.2949	0.4750	0.4688	0.4833	0.4111
	CIBER-WP	0.3583	0.2742	0.6250	0.6066	0.5917	0.5194
ASyn	CIBER-WIG	0.3500	0.2741	0.6333	0.5898	0.6500	0.5794
	CIBER-WBU	0.3417	0.2587	0.6167	0.5738	0.6500	0.5794
	CIBER-ALL	0.3500	0.2630	0.6333	0.5875	0.6500	0.5794
	RAG	0.3167	0.2750	0.4250	0.4212	0.5500	0.4760
	CIBER-WP	0.3667	0.2726	0.5750	0.5285	0.5917	0.5217
AReal	CIBER-WIG	0.3583	0.2951	0.5833	0.5176	0.6333	0.5531
	CIBER-WBU	0.3583	0.2703	0.5750	0.5098	0.6333	0.5508
	CIBER-ALL	0.3583	0.2660	0.5833	0.5157	0.6333	0.5531
	RAG	0.3125	0.2600	0.5188	0.4857	0.5625	0.4943
	CIBER-WP	0.3625	0.2803	0.6604	0.6267	0.6375	0.5684
All	CIBER-WIG	0.3500	0.3003	0.6687	0.6255	0.6688	0.5992
	CIBER-WBU	0.3521	0.2678	0.6646	0.6211	0.6771	0.6077
	CIBER-ALL	0.3500	0.2683	0.6667	0.6225	0.6667	0.5971

Table 4: The impact of evidence fusion strategies on system performance on five datasets: CSyn, CReal, ASyn, AReal, and All.

Impact of different evidence fusion strategies

To answer RQ3, we performed experiments comparing various versions of CIBER with different evidence fusion strategies: $CIBER_{WP}$ utilizing a weighted proportion-based fusion strategy, $CIBER_{WIG}$ employing a weighted information gain-based strategy, $CIBER_{WBU}$ employing a weighted belief update-based fusion strategy, and $CIBER_{ALL}$ employing a majority-based voting strategy to aggregate the verdict from each individual strategy.

Based on the results presented in Table 4, each of the three evidence fusion strategies significantly outperformed traditional RAG individually by a considerable margin on the All dataset. However, there is no clear pattern regarding which strategy emerges as the top performer individually, as each strategy claims the top spot in one-third of the tests on the All dataset. Furthermore, the simple majority voting-based verdict aggregation strategy did not result in a superior performing model.

To explore the relationship between different fusion strategies with different LLMs, we computed the correlations among nine variables comprising combinations of three fusion metrics ( $WP$ , $WIG$ , and $WBU$ ) and three LLMs (GPT-2, GPT-3.5, and GPT-4). The resulting correlation matrix is depicted in Figure 3, where darker areas represent stronger correlations. As shown in the figure, the most prominent correlations are observed along the diagonal line among the variables within each LLM. Specifically, among GPT-2 related strategies, the highest correlation is between $WIG$ and $WP$ ( $\rho=0.55$ ), followed by $WIG$ and $WBU$ ( $\rho=0.34$ ). For all the GPT-3.5 related variables, the highest correlation occurs between $WIG$ and $WP$ ( $\rho=0.72$ ) followed by $WIG$ and $WBU$ ( $\rho=0.50$ ). For GPT-4, the highest correlated variables are $WIG$ and $WBU$ ( $\rho=0.58$ ). Across different LLM models, the highest correlation occurs between $WP_{GPT3.5}$ and $WP_{GPT4}$ ( $\rho=0.45$ ).

The correlation results suggest that while there are significant correlations between various fusion metrics, with the exception of $WIG_{GPT3.5}$ and $WP_{GPT3.5}$ , most correlations are moderate or low. This implies that the effectiveness of different fusion strategies may be complementary. As a result, instead of employing a simple majority voting, exploring more sophisticated ensemble methods for verdict aggregation could be a promising avenue for future research.

Conclusions and Future Work

In this paper, we introduce CIBER, a novel framework designed to enhance Retrieval-Augmented Generation (RAG) systems for evidence retrieval and scientific claim verification. CIBER focuses on systematically addressing the inherent uncertainties in LLM outputs. CIBER is quite general and applicable across diverse scenarios. For instance, CIBER focuses on LLM behavioral analysis, which doesn’t require access to LLM internal information, making it suitable for both white-box and black-box LLMs. Additionally, CIBER is unsupervised, making it easily generalizable to different scientific fields. Our evaluation results demonstrate that CIBER achieves significant performance improvements over traditional RAG approaches, particularly benefiting advanced LLMs like GPT-3.5 and GPT-4. Furthermore, we’ve curated several ground truth datasets—two synthetic and two real—which we plan to share with the research community.

Through this exploration, we have also identified potential areas for future enhancement, including improving the development of the LLM response resolution, as well as developing more sophisticated ensemble methods for combining verdicts from different fusion strategies.

While CIBER aims to mitigate hallucinations in LLM generation, which can help reduce the risks associated with spreading LLM-generated misinformation, it’s important to acknowledge that CIBER is still far from perfect. There is a possibility that CIBER could inadvertently generate or cite information that is untrue, particularly if the retrieved content contains misinformation or misleading information. As a result, continuous improvement of CIBER is critical for the safe adoption of LLMs in the real world.

References

Ahmadi et al. (2019) Ahmadi, N.; Lee, J.; Papotti, P.; and Saeed, M. 2019. Explainable fact checking with probabilistic answer set programming. arXiv preprint arXiv:1906.09198.
Atanasova (2024) Atanasova, P. 2024. Generating fact checking explanations. In Accountable and Explainable Methods for Complex Reasoning over Text, 83–103. Springer.
Augenstein et al. (2019) Augenstein, I.; Lioma, C.; Wang, D.; Lima, L. C.; Hansen, C.; Hansen, C.; and Simonsen, J. G. 2019. MultiFC: A real-world multi-domain dataset for evidence-based fact checking of claims. arXiv preprint arXiv:1909.03242.
Boretti (2021) Boretti, A. 2021. Reviewing the association between aluminum adjuvants in the vaccines and autism spectrum disorder. Journal of Trace Elements in Medicine and Biology, 66: 126764.
Borgeaud et al. (2022) Borgeaud, S.; Mensch, A.; Hoffmann, J.; Cai, T.; Rutherford, E.; Millican, K.; Van Den Driessche, G. B.; Lespiau, J.-B.; Damoc, B.; Clark, A.; et al. 2022. Improving language models by retrieving from trillions of tokens. In International Conference on Machine Learning, 2206–2240. PMLR.
Chen et al. (2024) Chen, J.; Lin, H.; Han, X.; and Sun, L. 2024. Benchmarking large language models in retrieval-augmented generation. In Proceedings of the AAAI Conference on Artificial Intelligence, 17754–17762.
Cook et al. (2013) Cook, J.; Nuccitelli, D.; Green, S. A.; Richardson, M.; Winkler, B.; Painting, R.; Way, R.; Jacobs, P.; and Skuce, A. 2013. Quantifying the consensus on anthropogenic global warming in the scientific literature. Environmental research letters, 8(2): 024024.
Dempster (2008) Dempster, A. P. 2008. Upper and lower probabilities induced by a multivalued mapping. In Classic works of the Dempster-Shafer theory of belief functions, 57–72. Springer.
Doja and Roberts (2006) Doja, A.; and Roberts, W. 2006. Immunizations and autism: a review of the literature. Canadian Journal of Neurological Sciences, 33(4): 341–346.
Feldman, Foulds, and Pan (2023) Feldman, P.; Foulds, J. R.; and Pan, S. 2023. Trapping LLM hallucinations using tagged context prompts. arXiv preprint arXiv:2306.06085.
Feldman, Foulds, and Pan (2024) Feldman, P.; Foulds, R., James; and Pan, S. 2024. RAGged Edges: The Double-Edged Sword of Retrieval-Augmented Chatbots. arXiv preprint arXiv:2403.01193.
Ferreira and Vlachos (2016) Ferreira, W.; and Vlachos, A. 2016. Emergent: a novel data-set for stance classification. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: Human language technologies. ACL.
Folb et al. (2004) Folb, P. I.; Bernatowska, E.; Chen, R.; Clemens, J.; Dodoo, A. N.; Ellenberg, S. S.; Farrington, C. P.; John, T. J.; Lambert, P.-H.; MacDonald, N. E.; et al. 2004. A global perspective on vaccine safety and public health: the Global Advisory Committee on Vaccine Safety. American journal of public health, 94(11): 1926–1931.
Guo et al. (2023) Guo, Z.; Wang, P.; Huang, L.; and Cho, J.-H. 2023. Authentic Dialogue Generation to Improve Youth’s Awareness of Cybergrooming for Online Safety. In 2023 IEEE 35th International Conference on Tools with Artificial Intelligence (ICTAI), 64–69.
Hanselowski et al. (2019) Hanselowski, A.; Stab, C.; Schulz, C.; Li, Z.; and Gurevych, I. 2019. A richly annotated corpus for different tasks in automated fact-checking. arXiv preprint arXiv:1911.01214.
Hassan, Li, and Tremayne (2015) Hassan, N.; Li, C.; and Tremayne, M. 2015. Detecting check-worthy factual claims in presidential debates. In Proceedings of the 24th ACM International Conference on Information and Knowledge Management, 1835–1838.
Hofstätter et al. (2023) Hofstätter, S.; Chen, J.; Raman, K.; and Zamani, H. 2023. FiD-Light: Efficient and effective retrieval-augmented text generation. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, 1437–1447.
Lewis et al. (2020) Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-t.; Rocktäschel, T.; et al. 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33: 9459–9474.
Li, Nie, and Liang (2023) Li, X.; Nie, E.; and Liang, S. 2023. From classification to generation: Insights into crosslingual retrieval augmented ICL. arXiv preprint arXiv:2311.06595.
Lynas, Houlton, and Perry (2021) Lynas, M.; Houlton, B. Z.; and Perry, S. 2021. Greater than 99% consensus on human caused climate change in the peer-reviewed scientific literature. Environmental Research Letters, 16(11): 114005.
Ma et al. (2023) Ma, X.; Gong, Y.; He, P.; Zhao, H.; and Duan, N. 2023. Query rewriting for retrieval-augmented large language models. arXiv preprint arXiv:2305.14283.
Madsen and Vestergaard (2004) Madsen, K. M.; and Vestergaard, M. 2004. MMR vaccination and autism: what is the evidence for a causal association? Drug safety, 27: 831–840.
Mohammed et al. (2022) Mohammed, S. A.; Rajashekar, S.; Ravindran, S. G.; Kakarla, M.; Gambo, M. A.; Salama, M. Y.; Ismail, N. H.; Tavalla, P.; Uppal, P.; Hamid, P.; et al. 2022. Does Vaccination Increase the Risk of Autism Spectrum Disorder? Cureus, 14(8).
Nakashole and Mitchell (2014) Nakashole, N.; and Mitchell, T. 2014. Language-aware truth assessment of fact candidates. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1009–1019.
Panagoulias et al. (2023) Panagoulias, D. P.; Palamidas, F. A.; Virvou, M.; and Tsihrintzis, G. A. 2023. Rule-Augmented Artificial Intelligence-empowered Systems for Medical Diagnosis using Large Language Models. In 2023 IEEE 35th International Conference on Tools with Artificial Intelligence (ICTAI), 70–77. Los Alamitos, CA, USA: IEEE Computer Society.
Peng et al. (2023) Peng, W.; Li, G.; Jiang, Y.; Wang, Z.; Ou, D.; Zeng, X.; Chen, E.; et al. 2023. Large language model based long-tail query rewriting in TaoBao search. arXiv preprint arXiv:2311.03758.
Popat et al. (2018) Popat, K.; Mukherjee, S.; Yates, A.; and Weikum, G. 2018. Declare: Debunking fake news and false claims using evidence-aware deep learning. arXiv preprint arXiv:1809.06416.
Shafer (1976) Shafer, G. 1976. A mathematical theory of evidence, volume 42. Princeton university press.
Shannon (1948) Shannon, C. E. 1948. A mathematical theory of communication. The Bell system technical journal, 27(3): 379–423.
Shao et al. (2023) Shao, Z.; Gong, Y.; Shen, Y.; Huang, M.; Duan, N.; and Chen, W. 2023. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. arXiv preprint arXiv:2305.15294.
Siriwardhana et al. (2023) Siriwardhana, S.; Weerasekera, R.; Wen, E.; Kaluarachchi, T.; Rana, R.; and Nanayakkara, S. 2023. Improving the domain adaptation of retrieval augmented generation (RAG) models for open domain question answering. Transactions of the Association for Computational Linguistics, 11: 1–17.
Stratton et al. (2001) Stratton, K.; Gable, A.; Shetty, P.; McCormick, M.; of Medicine (US) Immunization Safety Review Committee, I.; et al. 2001. Immunization safety review: measles-mumps-rubella vaccine and autism. Immunization Safety Review: Measles-Mumps-Rubella Vaccine and Autism.
Tchechmedjiev et al. (2019) Tchechmedjiev, A.; Fafalios, P.; Boland, K.; Gasquet, M.; Zloch, M.; Zapilko, B.; Dietze, S.; and Todorov, K. 2019. ClaimsKG: A knowledge graph of fact-checked claims. In The Semantic Web–ISWC 2019: 18th International Semantic Web Conference, Auckland, New Zealand, October 26–30, 2019, Proceedings, Part II 18, 309–324. Springer.
Trivedi et al. (2022) Trivedi, H.; Balasubramanian, N.; Khot, T.; and Sabharwal, A. 2022. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. arXiv preprint arXiv:2212.10509.
Wang et al. (2024) Wang, Y.; Lipka, N.; Rossi, R. A.; Siu, A.; Zhang, R.; and Derr, T. 2024. Knowledge graph prompting for multi-document question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, 19206–19214.
Wei et al. (2023) Wei, J.; Courbis, A.-L.; Lambolais, T.; Xu, B.; Bernard, P. L.; and Dray, G. 2023. Zero-shot Bilingual App Reviews Mining with Large Language Models. In 2023 IEEE 35th International Conference on Tools with Artificial Intelligence (ICTAI), 898–904.
Wilson et al. (2003) Wilson, K.; Mills, E.; Ross, C.; McGowan, J.; and Jadad, A. 2003. Association of autistic spectrum disorder and the measles, mumps, and rubella vaccine: a systematic review of current epidemiological evidence. Archives of Pediatrics & Adolescent Medicine, 157(7): 628–634.
Yang and Luo (2023) Yang, B.; and Luo, X. 2023. Recent Progress on Named Entity Recognition Based on Pre-trained Language Models. In 2023 IEEE 35th International Conference on Tools with Artificial Intelligence (ICTAI), 799–804.
Zhang et al. (2023) Zhang, Y.; Li, Y.; Cui, L.; Cai, D.; Liu, L.; Fu, T.; Huang, X.; Zhao, E.; Zhang, Y.; Chen, Y.; Wang, L.; Luu, A. T.; Bi, W.; Shi, F.; and Shi, S. 2023. Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models. arXiv:2309.01219.
Zhuang et al. (2023) Zhuang, S.; Liu, B.; Koopman, B.; and Zuccon, G. 2023. Open-source large language models are strong zero-shot query likelihood models for document ranking. arXiv preprint arXiv:2310.13243.

LLM-based Corroborating and Refuting Evidence Retrieval for Scientific Claim Verification