Luna: An Evaluation Foundation Model to Catch Language Model Hallucinations with High Accuracy and Low Cost

Masha Belyi^∗ Robert Friel^∗ Shuai Shao Atindriyo Sanyal

Galileo Technologies Inc.
{masha,rob,ss,atin}@rungalileo.io

Abstract

Retriever-Augmented Generation (RAG) systems have become pivotal in enhancing the capabilities of language models by incorporating external knowledge retrieval mechanisms. However, a significant challenge in deploying these systems in industry applications is the detection and mitigation of hallucinations—instances where the model generates information that is not grounded in the retrieved context. Addressing this issue is crucial for ensuring the reliability and accuracy of responses generated by large language models (LLMs) in diverse industry settings. Current hallucination detection techniques fail to deliver accuracy, low latency, and low cost simultaneously. We introduce Luna: a DeBERTA-large (440M) encoder, fine-tuned for hallucination detection in RAG settings. We demonstrate that Luna outperforms GPT-3.5 and commercial evaluation frameworks on the hallucination detection task, with 97% and 91% reduction in cost and latency, respectively. Luna is lightweight and generalizes across multiple industry verticals and out-of-domain data, making it an ideal candidate for industry LLM applications.

Masha Belyi^∗ Robert Friel^∗ Shuai Shao Atindriyo Sanyal Galileo Technologies Inc. {masha,rob,ss,atin}@rungalileo.io

^*^*footnotetext: These authors contributed equally to this work

1 Introduction

Large Language Models (LLMs) are broadly used in industry dialogue applications due to their impressive ability to hold a natural conversation and succeed on a variety of reasoning tasks (Zhao et al., 2023). A key challenge in deploying customer-facing LLMs is their propensity for hallucinations, where the model presents cohesive, but factually incorrect information in conversation with a user (Roller et al., 2021; Lin et al., 2022). Retrieval-augmented generation (RAG), a technique for incorporating knowledge relevant to each user query in the LLM prompt, effectively reduces LLM hallucinations in production systems (Lewis et al., 2020). Yet, LLMs still often respond with nonfactual information that contradicts the knowledge supplied by RAG Shuster et al. (2021); Magesh et al. (2024).

Refer to caption — Figure 1: Luna is a lightweight DeBERTA-large encoder, fine-tuned for hallucination detection in RAG settings. Luna outperforms zero-shot hallucination detection models (GPT-3.5, ChainPoll GPT-3.5 ensemble) and RAG evaluation frameworks (RAGAS, Trulens) at a fraction of the cost and millisecond inference speed.

Causes of hallucinations have been extensively studied across different LLM tasks (Zheng et al., 2024; Cao et al., 2022; Das et al., 2022). Key contributing factors include knowledge cutoff (Vu et al., 2023), randomness (Lee et al., 2022), faulty training data (Dziri et al., 2022a; Lin et al., 2022; McKenna et al., 2023), and finetuning with large amounts of new knowledge (Gekhman et al., 2024). Apart from RAG, proposed mitigation solutions explore prompt engineering with chain of thought (Wei et al., 2022), finetuning (Zhang et al., 2024), reinforcement learning with human feedback (Ouyang et al., 2022), and specialized hallucination detection models (Wu et al., 2023; Lin et al., 2022). For RAG specifically, evaluation frameworks like RAGAS (Es et al., 2024), Trulens¹¹1https://www.trulens.org/, and ARES (Saad-Falcon et al., 2024) have emerged to offer automated hallucination detection at scale. However, these approaches rely on static prompts (RAGAS, Trulens) or finetuning on in-domain data (ARES), which limit their capacity to generalize to a breadth of industry applications. Gao et al. (2023) and Wu et al. (2023) take it a step further to successfully suppress hallucinations in LLM responses with a detect-and-replace technique. Though, due to prohibitively slow latency of their LLM evaluation models, real-time hallucination prevention in production systems still remains a challenge.

Customer-facing dialogue applications necessitate a hallucination detection system with high-accuracy, low cost, and low latency, such that hallucinations are caught and resolved before reaching the user. Few/zero-shot LLM approaches fail to meet the strict latency requirement due to model size. Moreover, though commericial LLMs like OpenAI’s GPT models (OpenAI, 2023) achieve strong performance, querying customer data through 3rd party APIs is both costly and undesirable for privacy and security reasons. Finetuned BERT-size models can achieve competitive performance to LLM judges (Bohnet et al., 2023; Saad-Falcon et al., 2024; Gao et al., 2023; Li et al., 2024; Yue et al., 2023), offering lower latency and local execution. However, these models require annotated data for finetuning and have not been evaluated for large-scale, cross-domain applications.

In this paper, we introduce Luna - a lightweight RAG hallucination detection model that generalizes across multiple industry-specific domains and scales well for real-time deployment. Luna is a 440M parameter DeBERTa-large encoder that is finetuned on carefully curated real-world RAG data. From analysis of RAG in production settings, we identify long-context RAG evaluation as a previously unaddressed challenge and propose a novel solution that facilitates high precision long-context RAG hallucination detection. Through extensive benchmarking, we demonstrate that Luna outperforms zero-shot prompting and RAG evaluation frameworks on the hallucination detection task.

Our approach is closest to the concurrently proposed ARES automated RAG evaluation framework (Saad-Falcon et al., 2024), with a few key differences: (1) ARES requires a validation set of in-domain annotated data to finetune a custom evaluation model, while Luna is pre-trained on a cross-domain corpus for built-in generalization; (2) Luna accurately detects hallucinations on long RAG contexts; and (3) Luna is optimized to process up to 16k tokens in milliseconds on deployment hardware.

2 Related Work

Hallucination detection

Prior work on hallucination detection in natural language generation (NLG) is vast (Ji et al., 2023). SelfCheckGPT (Manakul et al., 2023) and Agrawal et al. (2024) are examples of heuristic consistency-based methods that detect unreliable LLM outputs by comparing multiple sampled responses from the same LLM. Others look to the internal state of the LLM, such as hidden layer activations (Azaria and Mitchell, 2023) and token-level uncertainty (Varshney et al., 2023) as a proxy signal for hallucinations. Kadavath et al. (2022) prompt the generating LLM to introspect and evaluate it’s own responses. More generally, zero-shot (Es et al., 2024) and finetuned (Wu et al., 2023; Yue et al., 2023; Muller et al., 2023) LLM judges leverage LLM’s inherent reasoning abilities to evaluate other LLM generations. Similarly, general purpose finetuned LLM evaluators (Kim et al., 2024) that have been shown to correlate with human judgements can also be applied to hallucination detection.

Our approach to finetune a small LM evaluator like in (Gao et al., 2023; Saad-Falcon et al., 2024) is the first to evaluate and optimize such a model for industry applications under strict performance, cost, and latency constraints.

NLI for closed-domain Hallucination Detection

Existing research draws parallels between the hallucination detection task and the concept of entailment in Natural Language Inference (NLI). The goal of NLI is to determine the relationship between a premise and hypothesis, which can be one of: entailment, contradiction, or neutral. In the past, NLI models have been used to evaluate factual consistency on closed-domain NLG tasks (Honovich et al., 2022; Dziri et al., 2022b). The Attributable to Identified Sources (AIS) framework, introduced by Rashkin et al. (2023), formally unifies the notions of factuality, attribution, hallucination, faithfulness, and groundedness - all terms used to measure the extent to which an LLM response is attributable to some source of ground truth. In followup work, NLI entailment has been shown to correlate with AIS scores (Gao et al., 2023; Bohnet et al., 2023; Li et al., 2024) and has become a standard baseline for AIS and hallucination detection models.

In this work, we use pre-trained NLI model weights as the starting point for Luna finetuning.

3 Luna Model

We fine-tune a DeBERTa-v3-Large (He et al., 2023) NLI checkpoint²²2https://huggingface.co/MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli from Laurer et al. (2022) with a shallow hallucination classifier on each response token. We train on the task of identifying supported tokens in the response, given a query and retrieved context. Framing the problem in this way makes our work comparable to recent automated RAG evaluation efforts. Our definition of support is synonymous with the answer faithfulness metric explored in RAGAS (Es et al., 2024) and ARES (Saad-Falcon et al., 2024), Truelens groundedness, and attribution (Li et al., 2024). At inference, we treat spans with low support probabilities as hallucinated spans.

Similar to Gao et al. (2023) and Wu et al. (2023), we aim to identify hallucinated spans in the response, rather than the less granular example-level hallucination boolean. While predicting spans is a more challenging task, it yields a more informative prediction to the end-user. Further, this approach sets us up for long-context prediction, which we discuss in detail next.

3.1 Long Context RAG

In practice, we find that context length limitations are a significant pain point in industry applications. Custom RAG setups may retrieve a large number of context documents from various sources, or choose not to chunk the documents before passing them into the retriever. This results in long inputs to the RAG generator and evaluation models, sometimes even exceeding the token limit of select commercial LLMs. In Figure 2 we visualize the context length distribution of our curated RAG dataset (detailed in Section 4.1). While our base DeBERTa model can technically handle sequences of up to 24k (He et al., 2021), computational complexity of transformer attention layers scale quadratically with input length. Moreover, though long-context LLMs like Claude-3 are becoming competitive on LLM leaderboards³³3https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard, research shows that these models suffer from information loss (Liu et al., 2023) and may not be suitable for long-context RAG evaluation.

A naive solution is to chunk long-context RAG inputs into short segments and process them through the evaluator model in batches. Model predictions can then be aggregated over batch rows to predict example-level hallucination probabilities. Figure 3 illustrates how such chunking may result in false positives in cases where supporting information is scattered throughout the long context document(s). Instead, we leverage span-level predictions for a high-precision classifier over long sequence inputs.

3.2 Long Context Chunking

Consider a single input into the RAG evaluation model that consists of C context tokens $[c_{1}...c_{C}]$ , Q question tokens $[q_{1}...q_{Q}]$ , and R response tokens $[r_{1}...r_{R}]$ . Assume we are working with an evaluator model that accepts maximum sequence length L, and that Q+R<L, but C is much larger⁴⁴4the same approach easily extends to cases where R¿L. To fit the example into the model we break it up into windows of length L, such that each window contains the question, response, and a subset of the context tokens:

w_{i}=[c_{i_{1}}...c_{i_{l}}]\oplus[q_{1}...q_{Q}]\oplus[r_{1}...r_{R}]

(1)

where $l=L-Q-R$ , and there are $\frac{N}{l}$ windows per example. In Figure 3 there are three such windows. Our model outputs support probabilities $p^{i}$ for each of the R response tokens in $w_{i}$ as:

P_{S}(w_{i})=[p_{1}^{i}...p_{R}^{i}]

(2)

We train with a cross-entropy loss on each token output. During training, we leverage granular token-level support labels (Section 4.2) to adjust the training labels in each batch based on which context tokens are present in the window. For example, in Figure 3, "Washington, D.C., the capital of the US" is supported in window 1, nothing is supported in window 2, and "was founded in 1791" is supported in window 3.

At inference, we aggregate example-level support probabilities by taking the token-level maximum over windows. Refer to Figure 4 for an visual illustration of the steps described by equations 3-5 below. The example-level support probability for token j is defined as:

p_{j}=\max_{1\leq i\leq|w|}(p_{j}^{i})

(3)

where $|w|=\frac{N}{l}$ is the total number of windows we created in (1). To produce an example-level label, we take the minimum over R tokens:

P_{S}=min(p_{1}...p_{R})

(4)

so that the overall support probability is no greater than the support probability of the least supported token in the response. Finally, we derive example hallucination probability $P\textsubscript{H}$ as

P_{H}=1-P_{S}

(5)

3.3 Training

To leverage the full pre-trained NLI model, we initialize the hallucination prediction head with weights from the NLI classification head. The original NLI head is a 3-class single-layer perceptron with a neuron for each NLI class (entailment, contradiction, and neutral). During training, we optimize for low entailment probability and high contradiction probability for hallucinated tokens (and the opposite for supported tokens). At inference, we output the probability of entailment for each token.

We apply data transformation techniques to introduce additional variability for better generalization during training. Transformations include dropping and inserting context documents, and shuffling questions and responses between examples in batch. Training labels are adjusted accordingly with each transformation.

The model trains for 3 epochs with cross-entropy loss on the output of each response token. We initialize the learning rate to $5^{-6}$ for the base model layers and $2^{-5}$ for the classification head, and train with warmup and a linear decay rate.

Domain	train	val	test	%H
customer support	4k	600	600	22%
finance	38k	5k	5k	5%
biomedical research	22k	3k	3k	20%
legal	1.5k	500	500	6%
general knowledge	9.5k	2k	2k	18%

Table 1: RAG QA data statistics. RAG context and questions are sourced from open-book QA datasets that cover five industry-specific domains. RAG responses are generated with GPT-3.5 and Claude-3-Haiku, and annotated with GPT-4-turbo. %H indicates the fraction of hallucinated responses in each domain.

4 Data

4.1 RAG QA dataset

We recycle open-book QA datasets to construct a RAG QA dataset. Our goal is to simulate natural RAG examples that may occurr in production settings. We sample data from five industry verticals: customer support (DelucionQA (Sadat et al., 2023), EManual (Nandy et al., 2021), TechQA (Castelli et al., 2020)), finance and numerical reasoning (FinQA (Chen et al., 2021), TAT-QA (Zhu et al., 2021)), biomedical research (PubmedQA (Jin et al., 2019), CovidQA (Möller et al., 2020)), legal (Cuad (Hendrycks et al., 2021)) and general knowledge (HotpotQA (Yang et al., 2018), MS Marco (Nguyen et al., 2016), HAGRID (Kamalloo et al., 2023), ExpertQA (Malaviya et al., 2024)). The combined dataset contains examples from a variety of difficult RAG task types, including numerical reasoning over tables, inference over multiple context documents, and retrieval from long contexts. We reserve 20% of the dataset for validation and testing. Table 1 reports statistics of the data splits.

For each component dataset, we ignore the ground truth responses and generate two new responses per input with GPT-3.5 and Claude-3-Haiku. These models exhibit strong reasoning and conversational abilities (Chiang et al., 2024) at a low price point, which makes them realistic candidates for production RAG systems. We set temperature to 1 for generation to encourage diversity and potential hallucinations in the responses. Next, we describe how we annotate the data for training.

4.2 Labeling

We leverage GPT-4-turbo to annotate the RAG QA dataset. Refer to Section 8.1 for a discussion on the limitations of this approach.

Before annotation, we split the context and response into sentences using nltk (Bird and Loper, 2004). We pass the question along with the tokenized context and response sentences to GPT-4-turbo for annotation. For each sentence in the response, we instruct the LLM to identify which context sentences, if any, support the claim in the response. Tokens in sentences without any support are treated as hallucinations. We find that LLM responses often contain transition sentences and general statements that, while not supported by any specific context span, are generally grounded in the question and provided context. We instruct the annotator to label these as "generally supported", which we post-process to indicate support in every context window during training. Statements highlighting lack of sufficient information to answer the question also fall into this category.

We take measures to ensure high quality labels from our LLM annotator. First, we use chain-of-thought (Wei et al., 2022), which has been shown to increase agreement between LLM and human judgements (He et al., 2024). Next, we request both response-level and sentence-level annotations that we compare to identify potentially noisy labels. For example, if GPT-4 claims a response as supported by the context as a whole, but identifies no supporting information for one or more claims in the response, we send the example for re-annotation. We re-annotate examples up to 3 times, after which <2% of the data are still conflicting. After manual inspection, we find that the majority of the conflicts arise from partially supported sentences. Since our annotation scheme is binary on the sentence level (the full sentence is either supported or not), we resolve all tokens in partially supported sentences to "not supported" on both the sentence and example level.

Method	Question Answering			Data-to-Text Writing			Summarization			Overall
	Precision	Recall	F1	Precision	Recall	F1	Precision	Recall	F1	Precision	Recall	F1
Prompt_{gpt-3.5-turbo}^†	18.8	84.4	30.8	65.1	95.5	77.4	23.4	89.2	37.1	37.1	92.3	52.9
Prompt_gpt-4-turbo^†	33.2	90.6	45.6	64.3	100.0	78.3	31.5	97.6	47.6	46.9	97.9	63.4
SelCheckGPT_{gpt-3.5-turbo}^†	35.0	58.0	43.7	68.2	82.8	74.8	31.1	56.5	40.1	49.7	71.9	58.8
LMvLM_gpt-4-turbo^†	18.7	76.9	30.1	68.0	76.7	72.1	23.2	81.9	36.2	36.2	77.8	49.4
Finetuned Llama-2-13B^†	61.6	76.3	68.2	85.4	91.0	88.1	64.0	54.9	59.1	76.9	80.7	78.7
ChainPoll_{gpt-3.5-turbo}	33.5	51.3	40.5	84.6	35.1	49.6	45.8	48.0	46.9	54.8	40.6	46.7
RAGAS Faithfulness	31.2	41.9	35.7	79.2	50.8	61.9	64.2	29.9	40.8	62.0	44.8	52.0
Trulens Groundedness	22.8	92.5	36.6	66.9	96.5	79.0	40.2	50.0	44.5	46.5	85.8	60.4
Luna	37.8	80.0	51.3	64.9	91.2	75.9	40.0	76.5	52.5	52.7	86.1	65.4

Table 2: Response-level results on RAGTruth hallucination prediction task. Luna is compared against RAGTruth baselines reported in Wu et al. (2023) (rows marked with ^†), as well as our own baselines. RAGAS and Trulens are evaluation framewords that query GPT-3.5-turbo for hallucination detection. ChainPoll is our gpt-3.5-turbo ensemble prompt baseline. ChainPoll, RAGAS, Trulens, and Luna probability thresholds were tuned for best Overall F1. The top and second-best F1 scores are bolded and underlined. Luna outperforms all prompt-based approaches and narrows the gap between other baselines and the 13B fine-tuned Llama, at a fraction of the cost.

Method	Customer Support	Financial Reasoning	General Knowledge	Legal	Biomed	Overall
GPT-4-turbo annotator	1.0	1.0	1.0	1.0	1.0	1.0
Prompt_{gpt-3.5-turbo}	0.68	0.67	0.67	0.63	0.64	0.66
ChainPoll_{gpt-3.5-turbo}	0.76	0.74	0.75	0.71	0.71	0.74
RAGAS Faithfulness	0.62	0.60	0.60	0.58	0.54	0.61
Trulens Groundedness	0.56	0.56	0.65	0.34	0.68	0.56
Luna_in-domain	0.76	0.82	0.81	0.78	0.83	0.80
Luna_OOD	0.74	0.64	-	0.79	-	-

Table 3: AUROC on the hallucination detection task on the RAG QA test set. Best score in each domain is bolded. Luna_in-domain is our model trained on combined train splits from each domain. Luna_OOD is the same model trained on a subset of General Knowledge and Biomed domains.

5 Evaluation

5.1 Datasets

We evaluate Luna on a combination of existing academic benchmarks (RAGTruth) and real-world RAG data.

RAGTruth

RAGTruth is an expert-annotated corpus of 18k RAG examples with LLM-generated responses. The data are split into three RAG task types: Question Answering (QA), Data-to-text Writing, and News Summarization. Since Luna is only trained on QA RAG examples, we use this benchmark to evaluate our model’s generalization to other RAG task types.

RAG QA Test Set

We also evaluate Luna on a held-out split of our RAG QA dataset (Section 4.1). This serves as an in-domain test set for evaluating Luna performance across industry verticals.

5.2 Baselines

Zero-shot prompting

We evaluate GPT-3.5-turbo and GPT-4-turbo models from OpenAI as baselines. We prompt the LLMs to return an example-level boolean indicating whether or not a RAG response is supported by the associated RAG context. For RAGTruth we also include all baselines reported in the original paper.

Ensemble prompting

LLM ensembles have been shown to outperform single model judges by eliminating bias (Friel and Sanyal, 2023; Verga et al., 2024). We leverage ChainPoll (Friel and Sanyal, 2023) with a chain-of-thought prompt for a stronger GPT-3.5-turbo baseline.

RAG Evaluation Frameworks

We evaluate two commercial RAG evaluation frmeworks: RAGAS (v0.1.7) (Es et al., 2024) and Trulens (v0.13.4). We report RAGAS Faithfulness and Trulens Groundedness metrics, which are designed for hallucination detection.

5.3 Metrics

For comparison with RAGTruth baselinse, we report best Precision, Recall, and F1 scores on RAGTruth. We tune model output probability thresholds for the best overall F1 and report all metrics at this optimal threshold. For other benchmarks, we report the area under the ROC curve (AUROC), which we consider a more informative metric that circumvents the need for threshold tuning.

6 Results

On the RAGTruth dataset, Luna outperforms all prompt-based approaches on the QA and Summarization tasks, and is competitive with GPT-3.5 evaluators on the Data-to-Text Writing task (Table 2). Overall, Luna is second only to the finetuned Llama-2-13B, which is expected given the significant difference in size between the two models (440M vs 13B). It’s important to note that the Llama-2-13B baseline was trained on a subset of RAGTruth, as compared to Luna, which was trained on a QA-only dataset with a different data distribution. Nevertheless, we find that Luna generalizes well to the out-of-domain task types. Additionally, the gains in cost and inference speed we achieve with the lightweight Luna model (Sections 7.2, 7.3) offset the performance gap.

Results on the RAG QA test set are reported in Table 3 and follow a similar pattern. Luna outperforms the baselines across all verticals.

We also evaluate the model’s cross-domain generalization by training on a subset of General Knowledge and Biomedical Domains, and evaluating on the others. We refer to this model as Luna_OOD. We find that Luna_OOD still outperforms most baselines on the out-of-domain subsets. However, generalization to the Financial Reasoning domain is weak. Examples in this domain require reasoning over tabular data, which Luna_OOD never observes in training. Fine-tuning on the Financial Reasoning domain greatly boosts performance, increasing AUROC from 0.64 to 0.82.

	0-5k	5k-16k	16k+
(count in test)	(223)	(209)	(78)
Prompt_{gpt-3.5-turbo}	0	-12.11%	-100%
ChainPoll_{gpt-3.5-turbo}	0	-8.97%	-100%
RAGAS Faithfulness	0	-4.36%	-100%
Trulens Groundedness	0	-6.38%	-100%
Luna	0	-12.55%	-31.98%
Luna_example	0	-21.44%	-43.75%

Table 4: Relative hallucination detection performance of various models on shor(0-5k), medium(5k-16k), and long(16k+) context lengths. Luna is our best fine-tuned DeBERTA-large model, and Luna_example is a version of Luna that makes hallucination predictions at example level. All GPT-3.5-based baselines (including RAGAS, Trulens) fail on input lengths >16k, while Luna maintains 88% and 68% of its’s performance on medium (5k-16k) and long (16k+) context lengths, respectively. Luna_example also struggles more with long context lengths that Luna.

7 Discussion

7.1 Long Context Hallucination Detection

In Table 4 we report Luna’s performance against baselines on a range of RAG context lengths. For this analysis we sample data from CUAD (Hendrycks et al., 2021), one of the RAG QA component datasets, which passes full-length legal contracts as context inputs into RAG. This dataset contains the largest range of context lengths in RAG QA.

We find that performance of all models inversely correlates with context length. However, while the GPT-3.5-powered baselines fail completely at the GPT-3.5 context limit (16k tokens), Luna maintains 68% of it’s performance on that subset.

To validate the efficacy of our span-level prediction and long context chunking approach (Section 3.2), we do an ablation study where we compare our best model to a version of Luna that makes example level predictions, referred to as Luna_example in Table 4. As shown in Figure 3, we expect Luna_example to perform worse on long contexts. Our findings confirm this hypothesis: although the hallucination detection performance of both Luna and Luna_example degrades with increasing context lengths, Luna_example exhibits a greater degradation than Luna.

7.2 Cost vs Accuracy Trade-offs

API-based hallucination detection methods accrue substantial costs if used continuously in production settings. Luna outperforms GPT-3.5-based approaches while operating at a fraction of the cost. In Figure 1 we illustrate the trade-off between monthly maintenance costs and accuracy for Luna versus our GPT-3.5-based baselines. Costs are estimated assuming average throughput of 10 queries per second, with average query length of 4000 tokens. We use OpenAI API⁵⁵5https://openai.com/api/pricing/ and AWS cloud⁶⁶6https://aws.amazon.com/ec2/pricing/on-demand/ pricing at the time of writing. Detailed cost calculations can be found in Appendix B.

Although we do not explicitly compare pricing against larger fine-tuned models such as Llama-2-13B, we note that hosting a multi-billion parameter model demands substantially more compute resources than Luna, which would be reflected in the overall cost.

7.3 Latency Optimizations

We optimize Luna and its deployment architecture to process up to 16k input tokens in under one second on NVIDIA L4 GPU. To achieve this, we deploy an ONNX-traced model on NVIDIA Triton server with TensorRT backend. We leverage Triton’s Business Logic Scripting (BLS) to optimize the data flow and orchestration between GPU and CPU resources. BLS intelligently allocates resources based on the specific requirements of each inference request, ensuring that both GPU and CPU are utilized effectively and that neither resource becomes a bottleneck. We also tune our inference model maximum input length for optimal performance. While increasing the maximum sequence length would reduce the size and number of batches processed by the model (see Section 3.2), transformer layer computational complexity also scales quadratically with input length. We determine token length of 512 to be the most effective. Finally, we optimize pre-and post-processing python code for maximum efficiency. Table 5 in Appendix details the latency reductions achieved at each optimization step.

8 Conclusion

In this work we introduced Luna: a cost-effective hallucination detection model with millisecond inference speed. Luna eliminates dependency on slow and expensive 3rd party API calls, and enables practitioners to effectively address hallucinations in production. The proposed model can be hosted on a local GPU, guaranteeing privacy that 3d-party API’s cannot.

8.1 Limitations

Closed Domain Hallucinations

Luna’s efficacy is limited to closed domain hallucination detection in RAG settings. Due to its size, Luna lacks the necessary world knowledge to detect open domain hallucinations. For open-domain applications, Luna relies on a high-quality RAG retriever to provide the necessary context knowledge for an input query.

LLM Annotations

LLM’s remarkable zero-shot abilities have encouraged researchers to consider LLMs for annotation and synthetic data generation. Replacing human annotators with LLMs offerst substantial efficiency and cost savings (Wang et al., 2021). However, LLM performance on various annotation tasks is still controversial, with some studies reporting high correlations between LLM and human judgements (Chiang and Lee, 2023; He et al., 2024; Verga et al., 2024), while others advise caution (Li et al., 2023; Wang et al., 2024).

In this work, we recognize the potential noise and bias introduced in our training and evaluation data by automated GPT-4-turbo annotations. We hypothesize that our model derives greater advantages from training on a large-scale dataset, facilitated by low-cost LLM annotation, than it is hindered by potential noise within the data. After taking steps to ensure annotation quality (Section 4.2), we observe competitive performance on RAGTruth, a human-annotated benchmark in Section 6. This evaluation provides external validation for our model outputs, although we acknowledge that performance could potentially be enhanced with higher quality annotation sources.

Sentence-level annotations

Luna is trained on sentence-level annotations, i.e. there is an assumption that a sentence is either supported or not supported. This is most often the case, but future work can explore token-level labels for compound sentences with partially supported claims.

8.2 Future Work

Hallucinations in RAG output highlight weaknesses of the generator model. However, it is equally important to consider the quality of the retriever and its contribution to the overall performance of a RAG system. A sub-optimal retriever may supply irrelevant context to the generator, making it difficult for the generator to produce an accurate response. A comprehensive RAG evaluation model should therefore assess all dimensions of the RAG system. To this end, metrics like context relevance have been explored to assess the quality of retrieved RAG contexts (Es et al., 2024; Saad-Falcon et al., 2024).

In future work, we propose to leverage Luna for measuring a comprehensive suite of RAG metrics. One cost-effective approach could be to augment the current DeBERTA architecture with additional prediction heads that output multiple metrics in one forward pass. We hypothesize that the shared weights of the base encoder layers may enhance the performance of each head.

References

Agrawal et al. (2024) Ayush Agrawal, Mirac Suzgun, Lester Mackey, and Adam Kalai. 2024. Do language models know when they’re hallucinating references? In Findings of the Association for Computational Linguistics: EACL 2024, pages 912–928, St. Julian’s, Malta. Association for Computational Linguistics.
Azaria and Mitchell (2023) Amos Azaria and Tom Mitchell. 2023. The internal state of an LLM knows when it’s lying. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 967–976, Singapore. Association for Computational Linguistics.
Bird and Loper (2004) Steven Bird and Edward Loper. 2004. NLTK: The natural language toolkit. In Proceedings of the ACL Interactive Poster and Demonstration Sessions, pages 214–217, Barcelona, Spain. Association for Computational Linguistics.
Bohnet et al. (2023) Bernd Bohnet, Vinh Q. Tran, Pat Verga, Roee Aharoni, Daniel Andor, Livio Baldini Soares, Massimiliano Ciaramita, Jacob Eisenstein, Kuzman Ganchev, Jonathan Herzig, Kai Hui, Tom Kwiatkowski, Ji Ma, Jianmo Ni, Lierni Sestorain Saralegui, Tal Schuster, William W. Cohen, Michael Collins, Dipanjan Das, Donald Metzler, Slav Petrov, and Kellie Webster. 2023. Attributed question answering: Evaluation and modeling for attributed large language models. Preprint, arXiv:2212.08037.
Cao et al. (2022) Meng Cao, Yue Dong, and Jackie Cheung. 2022. Hallucinated but factual! inspecting the factuality of hallucinations in abstractive summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3340–3354, Dublin, Ireland. Association for Computational Linguistics.
Castelli et al. (2020) Vittorio Castelli, Rishav Chakravarti, Saswati Dana, Anthony Ferritto, Radu Florian, Martin Franz, Dinesh Garg, Dinesh Khandelwal, Scott McCarley, Michael McCawley, Mohamed Nasr, Lin Pan, Cezar Pendus, John Pitrelli, Saurabh Pujar, Salim Roukos, Andrzej Sakrajda, Avi Sil, Rosario Uceda-Sosa, Todd Ward, and Rong Zhang. 2020. The TechQA dataset. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1269–1278, Online. Association for Computational Linguistics.
Chen et al. (2021) Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, and William Yang Wang. 2021. FinQA: A dataset of numerical reasoning over financial data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697–3711, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Chiang and Lee (2023) Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15607–15631, Toronto, Canada. Association for Computational Linguistics.
Chiang et al. (2024) Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. 2024. Chatbot arena: An open platform for evaluating llms by human preference. Preprint, arXiv:2403.04132.
Das et al. (2022) Souvik Das, Sougata Saha, and Rohini Srihari. 2022. Diving deep into modes of fact hallucinations in dialogue systems. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 684–699, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Dziri et al. (2022a) Nouha Dziri, Sivan Milton, Mo Yu, Osmar Zaiane, and Siva Reddy. 2022a. On the origin of hallucinations in conversational models: Is it the datasets or the models? In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5271–5285, Seattle, United States. Association for Computational Linguistics.
Dziri et al. (2022b) Nouha Dziri, Hannah Rashkin, Tal Linzen, and David Reitter. 2022b. Evaluating attribution in dialogue systems: The BEGIN benchmark. Transactions of the Association for Computational Linguistics, 10:1066–1083.
Es et al. (2024) Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. 2024. RAGAs: Automated evaluation of retrieval augmented generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 150–158, St. Julians, Malta. Association for Computational Linguistics.
Friel and Sanyal (2023) Robert Friel and Atindriyo Sanyal. 2023. Chainpoll: A high efficacy method for llm hallucination detection. Preprint, arXiv:2310.18344.
Gao et al. (2023) Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, and Kelvin Guu. 2023. RARR: Researching and revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16477–16508, Toronto, Canada. Association for Computational Linguistics.
Gekhman et al. (2024) Zorik Gekhman, Gal Yona, Roee Aharoni, Matan Eyal, Amir Feder, Roi Reichart, and Jonathan Herzig. 2024. Does fine-tuning llms on new knowledge encourage hallucinations? Preprint, arXiv:2405.05904.
He et al. (2023) Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2023. DeBERTav3: Improving deBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing. In The Eleventh International Conference on Learning Representations.
He et al. (2021) Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2021. Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations.
He et al. (2024) Xingwei He, Zhenghao Lin, Yeyun Gong, A-Long Jin, Hang Zhang, Chen Lin, Jian Jiao, Siu Ming Yiu, Nan Duan, and Weizhu Chen. 2024. Annollm: Making large language models to be better crowdsourced annotators. Preprint, arXiv:2303.16854.
Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Anya Chen, and Spencer Ball. 2021. Cuad: An expert-annotated nlp dataset for legal contract review. NeurIPS.
Honovich et al. (2022) Or Honovich, Roee Aharoni, Jonathan Herzig, Hagai Taitelbaum, Doron Kukliansy, Vered Cohen, Thomas Scialom, Idan Szpektor, Avinatan Hassidim, and Yossi Matias. 2022. TRUE: Re-evaluating factual consistency evaluation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3905–3920, Seattle, United States. Association for Computational Linguistics.
Ji et al. (2023) Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12).
Jin et al. (2019) Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. 2019. PubMedQA: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577, Hong Kong, China. Association for Computational Linguistics.
Kadavath et al. (2022) Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec, Liane Lovitt, Kamal Ndousse, Catherine Olsson, Sam Ringer, Dario Amodei, Tom Brown, Jack Clark, Nicholas Joseph, Ben Mann, Sam McCandlish, Chris Olah, and Jared Kaplan. 2022. Language models (mostly) know what they know. Preprint, arXiv:2207.05221.
Kamalloo et al. (2023) Ehsan Kamalloo, Aref Jafari, Xinyu Zhang, Nandan Thakur, and Jimmy Lin. 2023. HAGRID: A human-llm collaborative dataset for generative information-seeking with attribution. arXiv:2307.16883.
Kim et al. (2024) Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. 2024. Prometheus 2: An open source language model specialized in evaluating other language models. Preprint, arXiv:2405.01535.
Laurer et al. (2022) Moritz Laurer, Wouter van Atteveldt, Andreu Casas, and Kasper Welbers. 2022. Less annotating, more classifying – addressing the data scarcity issue of supervised machine learning with deep transfer learning and bert - nli. Open Science Framework Preprint.
Lee et al. (2022) Nayeon Lee, Wei Ping, Peng Xu, Mostofa Patwary, Pascale N Fung, Mohammad Shoeybi, and Bryan Catanzaro. 2022. Factuality enhanced language models for open-ended text generation. In Advances in Neural Information Processing Systems, volume 35, pages 34586–34599. Curran Associates, Inc.
Lewis et al. (2020) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Advances in Neural Information Processing Systems, volume 33, pages 9459–9474. Curran Associates, Inc.
Li et al. (2024) Yifei Li, Xiang Yue, Zeyi Liao, and Huan Sun. 2024. Attributionbench: How hard is automatic attribution evaluation? arXiv preprint arXiv:2402.15089v1.
Li et al. (2023) Zhuoyan Li, Hangxiao Zhu, Zhuoran Lu, and Ming Yin. 2023. Synthetic data generation with large language models for text classification: Potential and limitations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10443–10461, Singapore. Association for Computational Linguistics.
Lin et al. (2022) Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics.
Liu et al. (2023) Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2023. Lost in the middle: How language models use long contexts. Preprint, arXiv:2307.03172.
Magesh et al. (2024) Varun Magesh, Faiz Surani, Matthew Dahl, Mirac Suzgun, Christopher D. Manning, and Daniel E. Ho. 2024. Hallucination-free? assessing the reliability of leading ai legal research tools. Preprint, arXiv:2405.20362.
Malaviya et al. (2024) Chaitanya Malaviya, Subin Lee, Sihao Chen, Elizabeth Sieber, Mark Yatskar, and Dan Roth. 2024. Expertqa: Expert-curated questions and attributed answers. Preprint, arXiv:2309.07852.
Manakul et al. (2023) Potsawee Manakul, Adian Liusie, and Mark Gales. 2023. SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9004–9017, Singapore. Association for Computational Linguistics.
McKenna et al. (2023) Nick McKenna, Tianyi Li, Liang Cheng, Mohammad Javad Hosseini, Mark Johnson, and Mark Steedman. 2023. Sources of hallucination by large language models on inference tasks. In The 2023 Conference on Empirical Methods in Natural Language Processing.
Möller et al. (2020) Timo Möller, Anthony Reina, Raghavan Jayakumar, and Malte Pietsch. 2020. COVID-QA: A question answering dataset for COVID-19. In Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020, Online. Association for Computational Linguistics.
Muller et al. (2023) Benjamin Muller, John Wieting, Jonathan Clark, Tom Kwiatkowski, Sebastian Ruder, Livio Soares, Roee Aharoni, Jonathan Herzig, and Xinyi Wang. 2023. Evaluating and modeling attribution for cross-lingual question answering. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 144–157, Singapore. Association for Computational Linguistics.
Nandy et al. (2021) Abhilash Nandy, Soumya Sharma, Shubham Maddhashiya, Kapil Sachdeva, Pawan Goyal, and NIloy Ganguly. 2021. Question answering over electronic devices: A new benchmark dataset and a multi-task learning based QA framework. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4600–4609, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Nguyen et al. (2016) Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. Ms marco: A human generated machine reading comprehension dataset.
OpenAI (2023) OpenAI. 2023. https://openai.com.
Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc.
Rashkin et al. (2023) Hannah Rashkin, Vitaly Nikolaev, Matthew Lamm, Lora Aroyo, Michael Collins, Dipanjan Das, Slav Petrov, Gaurav Singh Tomar, Iulia Turc, and David Reitter. 2023. Measuring attribution in natural language generation models. Computational Linguistics, 49(4):777–840.
Roller et al. (2021) Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Eric Michael Smith, Y-Lan Boureau, and Jason Weston. 2021. Recipes for building an open-domain chatbot. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 300–325, Online. Association for Computational Linguistics.
Saad-Falcon et al. (2024) Jon Saad-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia. 2024. Ares: An automated evaluation framework for retrieval-augmented generation systems. Preprint, arXiv:2311.09476.
Sadat et al. (2023) Mobashir Sadat, Zhengyu Zhou, Lukas Lange, Jun Araki, Arsalan Gundroo, Bingqing Wang, Rakesh Menon, Md Parvez, and Zhe Feng. 2023. Delucionqa: Detecting hallucinations in domain-specific question answering. pages 822–835.
Shuster et al. (2021) Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. 2021. Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3784–3803, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Varshney et al. (2023) Neeraj Varshney, Wenlin Yao, Hongming Zhang, Jianshu Chen, and Dong Yu. 2023. A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation. Preprint, arXiv:2307.03987.
Verga et al. (2024) Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. 2024. Replacing judges with juries: Evaluating llm generations with a panel of diverse models. Preprint, arXiv:2404.18796.
Vu et al. (2023) Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny Zhou, Quoc Le, and Thang Luong. 2023. Freshllms: Refreshing large language models with search engine augmentation. Preprint, arXiv:2310.03214.
Wang et al. (2021) Shuohang Wang, Yang Liu, Yichong Xu, Chenguang Zhu, and Michael Zeng. 2021. Want to reduce labeling cost? GPT-3 can help. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4195–4205, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Wang et al. (2024) Zengzhi Wang, Qiming Xie, Yi Feng, Zixiang Ding, Zinong Yang, and Rui Xia. 2024. Is chatgpt a good sentiment analyzer? a preliminary study. Preprint, arXiv:2304.04339.
Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc.
Wu et al. (2023) Yuanhao Wu, Juno Zhu, Siliang Xu, Kashun Shum, Cheng Niu, Randy Zhong, Juntong Song, and Tong Zhang. 2023. Ragtruth: A hallucination corpus for developing trustworthy retrieval-augmented language models. Preprint, arXiv:2401.00396.
Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Conference on Empirical Methods in Natural Language Processing (EMNLP).
Yue et al. (2023) Xiang Yue, Boshi Wang, Ziru Chen, Kai Zhang, Yu Su, and Huan Sun. 2023. Automatic evaluation of attribution by large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4615–4635, Singapore. Association for Computational Linguistics.
Zhang et al. (2024) Hanning Zhang, Shizhe Diao, Yong Lin, Yi R. Fung, Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji, and Tong Zhang. 2024. R-tuning: Instructing large language models to say ‘i don’t know’. Preprint, arXiv:2311.09677.
Zhao et al. (2023) Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2023. A survey of large language models. Preprint, arXiv:2303.18223.
Zheng et al. (2024) Shen Zheng, Jie Huang, and Kevin Chang. 2024. Why does chatGPT fall short in providing truthful answers? In I Can’t Believe It’s Not Better Workshop: Failure Modes in the Age of Foundation Models.
Zhu et al. (2021) Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua. 2021. TAT-QA: A question answering benchmark on a hybrid of tabular and textual content in finance. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3277–3287, Online. Association for Computational Linguistics.

Appendix A Response Generation Prompt

We use the following prompt template to generate LLM responses for each sample in our QA RAG dataset. Context documents, separated by line breaks, along with the question are slotted in for each generation sample.

Use the following pieces of context to answer the question.

{documents}

Question: {question}

Appendix B Cost Calculations

Costs are estimated assuming average throughput of 10 queries per second (qps), with average RAG query length of 4000 tokens, and NVIDIA L4 GPU deployment hardware. When estimating LLM cost for >1qps we assume concurrency is implemented to process multiple queries in parallel.

Luna Costs

Empirically, we find that each L4 can serve up to 4qps. At the time of writing, the monthly cost of running a g6.2xlarge GPU instance on AWS cloud is $700⁷⁷7https://aws.amazon.com/ec2/pricing/on-demand/. Thus, we estimate total monthly cost for 10qps throughput as

\$700*\frac{10}{4}=\$1750

(6)

OpenAI Costs

At the time of writing, querying GPT-3.5-turbo through OpenAI API costs $0.50 / 1M input tokens and $1.50 / 1M output tokens⁸⁸8https://openai.com/api/pricing/. In our test set, we observe the average output token length from GPT-3.5 at 200 tokens. Using average input length of 4000 tokens, the cost of a single query is roughly

(4k*\$0.5+200*\$1.5)/1M=\$0.0023

(7)

Using 2,592,000 seconds/month, the monthly cost of serving 10qps with GPT-3.5 is:

10qps*2,592,000*\$0.0023=\$59,616

(8)

With ChainPoll ensemble, we request 3 outputs per query, bringing the cost of a single query up to

(4k*\$0.5+3*200*\$1.5)/1M=\$0.0029

(9)

And the total monthly cost for 10qps to:

10qps*2,592,000*\$0.0029=\$75,168

(10)

RAGAS Costs

RAGAS makes 2 OpenAI API calls per an input RAG example. The first query extracts a list of claims from the response. The second requests the LLM to evaluate the faithfulness of each extracted claim to the RAG context. We estimate that the output length of the first query is roughly equal to the length of the RAG response; and the output length of the second query is roughly 3x the length of the response, since it includes the original claims followed by a faithfulness score and an explanation. Factoring in overhead token length of each prompt, we calculate the cost per query to be

Query1=\$380/1M

(11)

Query2=\$2730/1M

(12)

Then, the monthly cost of serving 10qps is:

10qps*2,592,000*(\$380+\$2730)/1M=\$79,937

(13)

Trulens Costs

Trulens makes 1 OpenAI per each sentence in the response. For this calculation, we estimate 3 sentences per response, which aligns with our obesrvations on the QA RAG dataset. Each query returns original sentence, a groundedness score (1-10), and an explanation. Here we assume that the token length of the explanation is roughly equal to the token length of the input sentence. The cost of a single query is roughly

(4k*\$0.5+2*75*\$1.5)/1M=\$0.0022

(14)

Using 2,592,000 seconds/month, the monthly cost of serving 10qps with Trulens is:

10qps*2,592,000*3*\$0.0022=\$173,016

(15)

Appendix C Latency Optimizations

We optimize Luna and its deployment architecture to process up to 16k input tokens in under one second on NVIDIA L4 GPU. Table 5 details the latency reductions and how they were achieved.

Optimization	s/16k
baseline	3.27
TensorRT backend	2.09
efficient pre- and post- processing code	1.79
512 max model length	0.98
BLS	0.92

Table 5: Impact of latency optimizations on Luna inference speed. Reporting inference speed in seconds for processing 16k input tokens.

Appendix D Latency Comparison

We empirically estimate the latency of Luna and each baseline model. Luna latency is discussed in Appendix C. For LLm models that query OpenAI API, we calculate the average latency per query after querying the API multiple times with an input of 4000k tokens, split between 3800 tokens for the context, 25 tokens for the question, and 75 tokens for the response.

Model	s/4k	%change
Luna	0.23	-
GPT-3.5	2.5	-91%
ChainPoll n=3	3.0	-93%
Trulens	3.4	-93%
RAGAS	5.4	-96%

Table 6: Model latency (in seconds), comparing Luna to LLM baselines. We also report the % difference between Luna and LLM-based models.