¹¹institutetext: CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China ¹¹email: {qianhaosheng22s,fanyixing,zhangruqing,guojiafeng}@ict.ac.cn

On the Capacity of Citation Generation by Large Language Models

Haosheng Qian Yixing Fan^(🖂) Ruqing Zhang Jiafeng Guo

Abstract

\Ac

RAG appears as a promising method to alleviate the “hallucination” problem in large language models, since it can incorporate external traceable resources for response generation. The essence of retrieval-augmented generation (RAG) in combating the hallucination issue lies in accurately attributing claims in responses to the corresponding retrieved documents. However, most of existing works focus on improving the quality of generated responses from the LLM, while largely overlooked its ability to attribute sources accurately. In this study, we conduct a systematic analysis about the capabilities of LLMs in generating citations within response generation, and further introduce a novel method to enhance their citation generation abilities. Specifically, we evaluate both the correctness and citation quality for seven widely-used LLMs on two benchmark datasets. Meanwhile, we introduce new citation evaluation metrics to eliminate the over-penalization of unnecessary and excessive citations in existing metrics. Furthermore, we propose a Generate-then-Refine method that completes relevant citations and removes irrelevant ones without altering the response text. The results on WebGLM-QA, ASQA and ELI5 datasets show that our method substantially improves the quality of citations in responses generated by LLMs.

Keywords:

Large Language Model Retrieval-Augemented Generation Citation Generation.

1 Introduction

Recently, large language models (LLMs) [1] demonstrate outstanding performance in various natural language processing tasks, showing remarkable generative capabilities for complex questions [2]. However, LLMs also face the well-known “hallucination” issue as they tend to produce fabricated content for unknown questions, which largely hinders the practical usage in risk-aware applications, such as medical or legal consultants. To this end, RAG appears as a promising method to incorporate real-time and factual knowledge for response generation [3].

While RAG could enhance the LLMs in leveraging external resources through in-context learning, it is crucial to acknowledge that the core is to provide citations for any generated statements in responses. However, recent advances in RAG mainly focused on building complex architectures to improve the quality of retrieved content [5]. For example, FLARE [6] retrieves information iteratively and actively by monitoring the confidence of generated tokens. ITER-RETGEN [7] also enhances retrieval by using generated content, achieving an iterative retrieval-generation flow. More recently, the attribution for response has attracted lots of attention in both academia and industry. For example, Gao et al. [8] propose ALCE, the first benchmark for automatic LLMs’ citation evaluation. Besides, the Bing Chat¹¹1https://www.bing.com/new and perplexity²²2https://www.perplexity.ai have already implemented the citation generation in their online systems.

In existing works, there are generally two types of methods to provide citations for responses: the pre-hoc citation and the post-hoc citation [9]. The pre-hoc method treats citations as regular tokens and generates them directly during the inference process of LLMs, which places high demands on the capabilities of the LLMs [10]. In contrast, the post-hoc method firstly generate a response without citations, and then matches the content of the response with references to determine whether citations need to be added [8].

In fact, the pre-hoc method for citation generation often results in better consistency between the responses and the references, as it fully leverages the LLMs’ excellent natural language understanding capabilities. Nonetheless, generating accurate citations is still a significant challenge for LLMs. This task demands that LLMs analyze multiple references, provide a coherent and comprehensive response, and determine precisely when to incorporate citations.

In this study, we systematically analyze the latest LLMs’ abilities in generating citations within responses generation and introduce a novel method to enhance citation quality. We use two basic methods—few-shot and fine-tuning—to guide LLMs in generating responses with citations. Then, we evaluate the correctness of responses and the quality of citations across three long-form question answer (LFQA) datasets in two benchmarks. For evaluation, we found that metrics in ALCE [8] excessively penalize responses that either don’t require citations or include excessive citations. Therefore, we introduce more comprehensive metrics to evaluate the citation quality in responses. We exclude statements that don’t require citations in responses for citation recall and redefine the concept of “relevant” for citation precision.

Moreover, we integrate pre-hoc and post-hoc methods to introduce Generate-then-Refine approach. This approach adds relevant citations that were not initially generated in the response and removes irrelevant citations that were included, thereby improving citation quality without altering the response text itself. We conduct experiments on three LFQA datasets: WebGLM-QA [11], ASQA [12], and ELI5 [13]. The experiment results demonstrate that our proposed method significantly improves the citation quality.

In summary, our contributions are threefold: (1) we analyze the latest LLMs’ ability to generate citations; (2) we introduce more comprehensive metrics for evaluating citation quality; (3) we propose the Generate-then-Refine approach, which substantially enhances the citation quality in responses.

2 Related Work

In this section, we review existing relevant work from two perspectives: citation generation and citation evaluation. Some researchers also refer to the process of associating responses with their corresponding supporting references as “attribution”, and we include these works as well.

2.0.1 Citation Generation

Recently, a host of works in the RAG field have required LLMs to provide citations while generating responses. Nakano et al. [10] presented WebGPT, which fine-tunes GPT-3 to answer long-form questions based on a web browsing environment. This is one of the earliest works enabling LLMs to generate responses with citations. Menick et al. [14] used reinforcement learning from human preferences to train language models that generate responses while also citing specific evidence to support their claims. Qian et al. [15] introduced ReGen framework, which enhances the generation factualness and supports the generation of responses with citations. Liu et al. [11] presented WebGLM, employing a rule-based approach to match responses and references for filtering high-quality training data containing citations, and fine-tunes LLMs to learn incorporating citations into answers. Qin et al. [16] presented WebCPM, a fine-tuned language model that imitates human web search behavior, treating “Quote” as an action to extract content from current web page for supporting evidence during response generation. Gao et al. [8] used a few-shot method to guide LLMs in generating citations and also provide a post-hoc cite option to add citations into the responses. Sun et al. [17] introduced an approach named VTG, incorporating evolving memory and self-reflection, supporting evidence verification and retrieval. This approach aids models in rethinking and reflecting on the relationship between claims and citations. Huang et al. [18] proposed a training framework using fine-grained rewards to teach LLMs to generate highly supportive and relevant citations.

2.0.2 Citation Evaluation

It is crucial to quantitatively evaluate the quality of citations in responses once models are capable of generating citations. Rashkin et al. [19] proposed a manual evaluation framework named AIS for measuring whether model-generated statements are supported by underlying sources. Based on this, Gao et al. [20] introduced an automated metric AutoAIS, which approximates human AIS judgments using an NLI model. Bohnet et al. [21] subsequently defined a reproducible evaluation framework for Attributed QA, using human annotations as a gold standard and employing AutoAIS as an automatic evaluation metric. Liu et al. [22] manually evaluated the citations included in popular generative search engines from the perspectives of comprehensiveness and accuracy. Liu et al. [11] manually evaluated the relationships between answers generated by LLMs and their corresponding references for citation accuracy. Yue et al. [23] defined different types of attribution errors and employed two approaches, prompting LLMs and fine-tuning smaller LMs, for automatic evaluation of attribution. Gao et al. [8] proposed the first benchmark for automatic LLMs’ citation evaluation—ALCE, which defines citation recall and citation precision metrics to measure citation quality, and uses an NLI model to determine whether the responses are supported by cited references. Kamalloo et al. [24] established an attribution dataset, where LLMs generate answers with citations initially, which are then annotated by human based on informativeness and attributability. Hu et al. [25] defined more fine-grained attribution categories and proposed an automatic manner for generating benchmarks of Attributed QA using knowledge graphs.

3 Analysis of Citation Generation by Large Language Models

In this section, we conduct a comprehensive evaluation and analysis of the latest LLMs’ ability to generate citations. We employ two basic methods, few-shot and fine-tuning, to guide LLMs in generating responses with citations.

3.1 Datasets

We select three LFQA datasets for our experiments: (1) WebGLM-QA [11], consisting of 43,579 data samples for the train split, 1,000 for the validation split, and 400 for the test split. Each data sample contains a question, an answer and a set of references. The answers and accompanying citations in the dataset were generated by GPT-3 [1] through in-context learning. Liu et al. [11] applied a series of rules to filter the dataset. They used ROUGE-1 [26] to measure the similarity between answer segments and their corresponding references to remove irrelevant citations which were labeled inaccurately. Additionally, they also implemented rule-based filtering to alleviate issues such as hallucination, few citations, and low-quality citations. (2) ASQA [12] is a factoid QA dataset where the questions often contain ambiguities, resulting in multiple answers based on different interpretations. Responses to these ambiguous questions should synthesize factual information from multiple sources to form the final answer. ALCE benchmark [8] randomly selected 948 samples from original ASQA dataset and added retrieved passages to construct a test set. (3) ELI5 [13] is also an LFQA dataset, with questions collected from the Reddit forum “Explain Like I’m Five”, primarily consisting of “How” and “Why” questions. For these types of questions, good answers are often quite detailed and cannot be adequately addressed with brief responses or by simply extracting words or phrases from the contexts. In a similar manner, ALCE benchmark [8] selected 1000 samples to construct a test set.

3.2 Evaluation

We evaluate responses for both their correctness and citation quality. Despite our main focus is not on correctness, it remains a crucial aspect of evaluation. We use well-established metrics BLEU-4 [27] and ROUGE-L [26] to measure correctness.

For citation quality evaluation, we initially adopt two metrics defined in ALCE [8]: citation recall and citation precision.

3.2.1 Citation Recall

Before evaluation, each response is segmented into several statements $\{s_{1},s_{2},...\}$ . Citation recall is computed on a per-statement basis, where each statement $s_{i}$ receives a binary recall score. The citation recall is the average of recall scores across all statements in the entire dataset. For each statement $s_{i}$ , recall score is 1 if and only if $s_{i}$ contains at least one citation and $\phi(concat(C_{i}),s_{i})=1$ , where $\phi(premise,hypothesis)$ is the NLI model that outputs 1 if the premise entails the hypothesis, and 0 otherwise [8]. And $concat(C_{i})$ denotes the concatenation of all passages cited by $s_{i}$ .

This calculation method is too strict for some responses. Upon reviewing the experimental results, we found that not all statements necessarily require citations. Statements that are commonsense such as “Humans can walk but cannot fly” or transitional statements like “Next, I will answer the question from the following aspects” don’t need any citations. The above metric may lead to underestimated evaluations for certain responses.

Based on this issue, we have defined a more lenient metric. If a statement $s_{i}$ doesn’t have any citation and $\phi(concat(C_{all}),s_{i})=0$ , then it will not require computation of its recall score and will not be included in the final average. And $concat(C_{all})$ denotes the concatenation of all retrieved passages that serve as context prompts to the LLMs in current sample.

3.2.2 Citation Precision

Similar to citation recall, before calculating citation precision, each response is also segmented into statements. However, citation precision is calculated on a per-citation basis, where each citation $c_{ij}$ receives a binary precision score. The citation precision is the average of precision scores across all citations in the entire dataset. Citation precision focuses on whether each citation $c_{ij}$ is relevant to the statement $s_{i}$ . A citation $c_{ij}$ is ‘irrelevant’ if $c_{ij}$ itself cannot support $s_{i}$ and does not affect the rest of the citations to support $s_{i}$ [8].

This definition of “relevant” may overly penalize answers that have excessive citations. If two references contain the same information and happen to be cited together in a statement, the above method may misjudge both citations as irrelevant, even though they both contribute to supporting the statement.

Due to this issue, we have redefined “relevant”. For a citation $c_{ij}$ , if it can support statement $s_{i}$ independently or if it can support statement $s_{i}$ after combining with a subset of remaining citations $C_{i}\backslash\{c_{ij}\}$ that cannot support statement $s_{i}$ , we consider this citation $c_{ij}$ is relevant. Formally, $c_{ij}$ is “relevant” if either of the following two conditions is satisfied:

	$\displaystyle(a)$	$\displaystyle\quad\phi(c_{ij},s_{i})=1,$		(1)
	$\displaystyle(b)$	$\displaystyle\quad\exists\ C_{i}^{{}^{\prime}}\subset C_{i}\backslash\{c_{ij}\},\ \phi(\text{concat}(\{c_{ij}\}\cup C_{i}^{{}^{\prime}}),s_{i})=1.$		(1)

Unlike the conciseness pursued by citation precision in ALCE, our redefined citation precision allows LLMs to generate comprehensive citations in responses. However, whether conciseness or comprehensiveness is preferable depends on specific scenarios, making it difficult to conclusively determine which evaluation approach is better. Therefore, we report the results of both metrics in our subsequent experiments.

3.3 Implementation Details

We conduct experiments on seven representative latest LLMs, including GPT -3.5-turbo-0125 [28], Llama-2-7b-chat [29], Llama-2-13b-chat [29], Mistral-7B-Instruct-v0.2 [30], Meta-Llama-3-8B-Instruct [31], glm-4-9b-chat [32], and Qwen2-7B-Instruct [33]. We fine-tune LLMs on the train split of WebGLM-QA and evaluate them on the test split of WebGLM-QA, as well as on the oracle versions of ASQA and ELI5 in ALCE [11, 8, 12, 13]. And we use t5_xxl_true_nli_mixture [34] as the NLI model when evaluating metrics related to citation.

In few-shot experiments, we provide two examples for each input. In the fine-tuning experiments, we use LoRA [35] method to fine-tune six open-source LLMs. To facilitate reproducibility of results and avoid bias introduced by sampling during decoding, we employ greedy decoding for all open-source models.

3.4 Results

The main results are summarized in Table 1,2,3, and we have the key observations as follows.

First, early open-source models lack the ability to generate citations. Earlier released LLMs like Llama-2 series, whether with 7B or 13B parameters, perform poorly in few-shot experiments across all three datasets. In contrast, Llama-3 models developed by the same team as Llama-2 appear to have significantly better citation generation capabilities. In few-shot experiments on ASQA and ELI5, Llama-3 even surpassed GPT-3.5-turbo by a wide margin. Additionally, other more recently released LLMs have also shown a nearly satisfactory ability in few-shot settings. One possible reason is that as LLMs are increasingly used in RAG tasks, model developers have started to focus on attribution capabilities of LLMs and have conducted additional training on related tasks for them.

Second, LLMs can significantly benefit from fine-tuning to enhance their citation generation capabilities. After fine-tuning, all open-source models demonstrate substantial improvements on WebGLM-QA, both in response correctness and citation quality, compared to their few-shot results. The previously underperforming Llama-2 series models reached performance levels comparable to other models through fine-tuning. Notably, Llama-2-13b even surpassed GPT-3.5-turbo in citation generation capabilities.

Third, fine-tuned LLMs don’t generalize well. The models fine-tuned on WebGLM-QA demonstrate significantly better performance on its test set compared to the original models’ few-shot results. However, their results on ASQA are quite mediocre, even falling short of their few-shot results. Only the Llama-2 series models, which originally lacked attribution capabilities, showed improved performance across all three datasets after fine-tuning. This illustrates that even with similar task formats, model performance can vary greatly due to changes in data distribution.

Furthermore, GPT-3.5-turbo demonstrates notably strong attribution capabilities in few-shot experiment on WebGLM-QA compared to open-source models. However, its performance is less impressive on ASQA and ELI5. This discrepancy might be due to model’s heightened sensitivity to the examples provided in few-shot method. We use identical examples as context across the three datasets in our experiments. For the model, there may be significant differences between these examples and the actual samples being evaluated.

Table 1: Experiments on WebGLM-QA.

Few-Shot
Models	Correctness		Citation (ALCE)			Citation (Ours)
Models	BLEU-4	ROUGE-L	Recall	Precision	F1	Recall	Precision	F1
gpt-3.5-turbo	57.44	41.41	74.21	74.91	74.56	78.31	78.18	78.24
llama2-7b	37.27	40.24	27.85	54.24	36.80	28.95	57.33	38.47
llama2-13b	40.13	40.48	30.08	50.94	37.82	31.78	55.63	40.45
mistral-7b	65.51	47.21	73.03	66.39	69.55	74.31	62.84	68.10
llama3-8b	54.38	45.32	72.88	72.99	72.93	74.58	69.25	71.82
glm4-9b	50.38	44.44	60.98	73.94	66.84	62.11	72.09	66.73
qwen2-7b	50.61	39.86	63.07	66.55	64.76	63.56	65.52	64.53
Fine-Tuning
llama2-7b	69.61	55.59	78.96	79.84	79.40	79.17	75.94	77.52
llama2-13b	71.98	57.32	79.56	82.92	81.21	80.21	78.21	79.20
mistral-7b	70.02	56.31	79.52	81.52	80.51	80.13	77.39	78.74
llama3-8b	70.73	57.25	80.00	82.98	81.46	80.47	77.80	79.11
glm4-9b	71.31	57.37	79.07	83.05	81.01	80.00	77.94	78.96
qwen2-7b	70.88	55.07	77.51	80.46	78.96	78.48	72.90	75.59

Table 2: Experiments on ASQA.

Few-Shot
Models	Correctness		Citation (ALCE)			Citation (Ours)
Models	BLEU-4	ROUGE-L	Recall	Precision	F1	Recall	Precision	F1
gpt-3.5-turbo	28.56	27.18	52.56	51.87	52.21	53.99	67.57	60.02
llama2-7b	28.88	27.35	19.75	33.47	24.84	21.39	41.85	28.31
llama2-13b	34.34	28.17	22.25	36.91	27.76	23.62	42.81	30.44
mistral-7b	45.31	30.33	57.05	57.77	57.41	60.14	58.65	59.39
llama3-8b	23.62	30.57	67.38	64.77	66.05	68.53	67.47	68.00
glm4-9b	37.45	31.34	58.50	61.84	60.12	59.85	65.36	62.48
qwen2-7b	20.75	29.13	56.75	57.60	57.17	57.27	62.96	59.98
Fine-Tuning
llama2-7b	34.92	30.69	60.02	46.72	52.54	61.61	51.38	56.03
llama2-13b	34.37	30.94	60.35	45.26	51.73	62.93	50.71	56.16
mistral-7b	42.97	31.59	60.08	49.19	54.09	62.01	52.80	57.04
llama3-8b	40.07	31.75	63.62	49.58	55.73	64.02	52.93	57.95
glm4-9b	42.30	31.90	61.64	41.77	49.80	62.03	44.02	51.50
qwen2-7b	39.54	31.04	58.26	44.00	50.14	60.56	46.60	52.67

Table 3: Experiments on ELI5.

Few-Shot
Models	Correctness		Citation (ALCE)			Citation (Ours)
Models	BLEU-4	ROUGE-L	Recall	Precision	F1	Recall	Precision	F1
gpt-3.5-turbo	29.06	15.01	23.33	24.87	24.08	25.01	47.24	32.71
llama2-7b	22.06	15.78	15.82	39.13	22.53	16.76	42.14	23.98
llama2-13b	23.93	16.49	15.40	32.88	20.98	16.69	37.66	23.13
mistral-7b	33.50	16.86	43.73	40.85	42.24	45.03	45.95	45.49
llama3-8b	27.38	16.75	42.98	46.48	44.66	45.09	52.88	48.68
glm4-9b	26.08	16.61	29.59	44.54	35.56	30.95	48.89	37.90
qwen2-7b	27.90	15.32	35.40	41.70	38.29	35.96	46.14	40.42
Fine-Tuning
llama2-7b	31.69	17.53	49.44	51.66	50.53	49.93	55.13	52.40
llama2-13b	31.54	17.48	47.01	51.00	48.92	47.88	56.46	51.82
mistral-7b	30.60	17.57	48.92	54.46	51.54	49.82	57.42	53.35
llama3-8b	32.30	17.66	48.71	52.62	50.59	49.35	57.15	52.96
glm4-9b	31.64	17.61	48.69	52.70	50.62	49.84	55.07	52.32
qwen2-7b	32.40	17.06	44.81	52.62	48.40	45.94	54.41	49.82

4 Generate-then-Refine

Based on the previous experiments and analysis, we have identified there is still considerable room for improvement in the quality of citations within the responses. Inspired by post-hoc methods [9], we propose a Generate-then-Refine approach aimed at improving the citation quality without altering the response text. Previous post-hoc methods heavily relied on rule-based matching such as text overlap, which is ineffective for semantic matching. Leveraging the powerful natural language understanding capabilities of LLMs, we aim to fine-tune the LLM to become a robust refiner in our approach.

4.1 Methods

We aim for the refiner to have three capabilities: (1) keep relevant citations within the response; (2) add necessary citations that are missing; (3) remove any irrelevant citations that are present.

To fine-tune an LLM to develop the aforementioned abilities, we first need to construct training data. The most straightforward idea is to create a set of responses with poor citation quality, each paired with a corresponding response that has perfect citation quality. We attempt to use the answers from WebGLM-QA dataset as positive responses and generate negative responses by randomly adding or deleting citations. Unfortunately, this approach proved ineffective, primarily because the citation quality in dataset is not high enough. An evaluation of the dataset’s answers revealed that the citation recall and citation precision are only 73.77% and 69.50%, respectively, which doesn’t even reach the citation quality found in the responses generated by fine-tuned open-source models.

Due to this issue, we had to rely on an NLI model to help us construct high-quality target responses. We split the original responses from the dataset into statements. For each statement, after removing the existing citations, we enumerate all possible combinations of citations and use an NLI model to determine if each combination supports the statement. Then, we incorporate all the gold citations into the dataset. By following these operations, we obtain a dataset containing four fields: question, references, statement, and target citations, which will be used for training the refiner.

To avoid altering the original text of the answers, we only need the refiner to output the ids of the references that the statement should actually cite. Since the refiner outputs ids rather than complete statements, the additional computational overhead of applying this method to RAG scenario is actually minimal. Moreover, in our Generate-then-Refine approach, the generating and refining are decoupled, allowing the refiner to enhance citation quality in responses generated by any method.

4.2 Results

In this section, we fine-tune a Mistral-7B model [30] to serve as the refiner. The main results are summarized in Table 4,5,6, and we have the key observations as follows.

First, whether through few-shot or fine-tuning, the responses generated by the model can achieve significant improvements in citation quality after refining. All models show improvements in citation F1 across three datasets. In the few-shot experiments on WebGLM-QA, Llama2 series models initially perform poorly but achieve improvements of 22.18% and 23.73% respectively with the help of the refiner, narrowing the performance gap with other models. On the other two datasets, Llama2 series models also achieved improvements of nearly 29% and 18%.

Second, refiner exhibits excellent generalization. Previous experiments found that fine-tuning method doesn’t exhibit strong generalization, as the model’s capabilities are constrained by data distribution. For instance, the fine-tuned LLMs don’t perform as well on ASQA compared to few-shot methods. However, after refining, all six fine-tuned LLMs achieved over a 20% increase in citation F1 on ASQA. Except for Qwen2-7b, the other five fine-tuned models surpassed the results of few-shot. These results indicate that changes in the distribution of data have minimal impact on the refiner.

Third, refiner primarily enhances citation quality through improved citation precision. In many results, citation recall has decreased actually, but the increase in citation precision is substantial. For example, the fine-tuned glm4-9b model achieved a staggering 40% increase in citation precision on ASQA. This illustrates that the refiner effectively captures the relationship between statements and references, accurately determining whether a reference truly supports the response.

Table 4: Experiments on WebGLM-QA.

Models	Citation (ALCE)			Citation (Ours)
Models	Recall	Precision	F1	Recall	Precision	F1
Few-Shot + Refine
gpt-3.5-turbo	75.20(+0.99)	81.81(+6.90)	78.37(+3.81)	80.40(+2.09)	88.54(+10.36)	84.27(+6.03)
llama2-7b	49.02(+21.17)	77.28(+23.04)	59.99(+23.19)	51.03(+22.08)	74.76(+17.43)	60.66(+22.18)
llama2-13b	52.30(+22.22)	77.89(+26.95)	62.58(+24.76)	55.09(+23.31)	76.88(+21.25)	64.19(+23.73)
mistral-7b	73.33(+0.30)	83.53(+17.14)	78.10(+8.55)	75.81(+1.50)	83.67(+20.83)	79.55(+11.45)
llama3-8b	69.28(-3.60)	81.56(+8.57)	74.92(+1.99)	72.58(-2.00)	86.01(+16.76)	78.73(+6.91)
glm4-9b	67.76(+6.78)	83.31(+9.37)	74.73(+7.90)	70.26(+8.15)	83.62(+11.53)	76.36(+9.63)
qwen2-7b	71.11(+8.04)	82.11(+15.56)	76.22(+11.45)	72.98(+9.42)	80.78(+15.26)	76.68(+12.16)
Fine-Tuning + Refine
llama2-7b	77.74(-1.22)	89.03(+9.19)	83.00(+3.61)	79.25(+0.08)	89.79(+13.85)	84.19(+6.67)
llama2-13b	78.01(-1.55)	88.89(+5.97)	83.10(+1.89)	79.72(-0.49)	90.17(+11.96)	84.62(+5.43)
mistral-7b	75.99(-3.53)	87.52(+6.00)	81.35(+0.84)	78.40(-1.73)	89.04(+11.65)	83.38(+4.65)
llama3-8b	78.08(-1.92)	88.96(+5.98)	83.17(+1.70)	79.36(-1.11)	90.21(+12.41)	84.44(+5.33)
glm4-9b	78.20(-0.87)	88.97(+5.92)	83.24(+2.23)	80.01(+0.01)	90.02(+12.08)	84.72(+5.76)
qwen2-7b	72.97(-4.54)	87.62(+7.16)	79.63(+0.67)	75.01(-3.47)	89.47(+16.57)	81.60(+6.02)

Table 5: Experiments on ASQA.

Models	Citation (ALCE)			Citation (Ours)
Models	Recall	Precision	F1	Recall	Precision	F1
Few-Shot + Refine
gpt-3.5-turbo	53.82(+1.26)	55.77(+3.90)	54.78(+2.56)	57.42(+3.46)	87.10(+19.53)	69.23(+9.21)
llama2-7b	42.29(+22.54)	67.33(+33.86)	51.95(+27.11)	46.12(+24.73)	75.66(+33.81)	57.31(+29.00)
llama2-13b	43.94(+21.69)	68.20(+31.29)	53.45(+25.68)	48.65(+25.03)	74.57(+31.76)	58.88(+28.44)
mistral-7b	60.09(+3.04)	72.85(+15.08)	65.86(+8.45)	65.54(+5.40)	82.44(+23.79)	73.02(+13.64)
llama3-8b	65.40(-1.98)	68.09(+3.32)	66.72(+0.67)	71.26(+2.73)	84.39(+16.92)	77.27(+9.28)
glm4-9b	64.63(+6.13)	70.51(+8.67)	67.44(+7.32)	69.15(+9.30)	82.12(+16.76)	75.08(+12.60)
qwen2-7b	65.80(+9.05)	66.23(+8.63)	66.01(+8.84)	73.63(+16.36)	82.84(+19.88)	77.96(+17.98)
Fine-Tuning + Refine
llama2-7b	65.14(+5.12)	69.04(+22.32)	67.03(+14.49)	74.09(+12.48)	84.82(+33.44)	79.09(+23.06)
llama2-13b	65.88(+5.53)	69.32(+24.06)	67.56(+15.83)	74.32(+11.39)	84.58(+33.87)	79.12(+22.96)
mistral-7b	66.11(+6.03)	73.38(+24.19)	69.56(+15.46)	72.51(+10.50)	85.78(+32.98)	78.59(+21.55)
llama3-8b	68.28(+4.66)	74.17(+24.59)	71.10(+15.37)	74.20(+10.18)	85.58(+32.65)	79.48(+21.54)
glm4-9b	65.89(+4.25)	72.21(+30.44)	68.91(+19.11)	71.47(+9.44)	84.41(+40.39)	77.40(+25.91)
qwen2-7b	60.56(+2.30)	69.02(+25.02)	64.51(+14.38)	67.47(+6.91)	83.96(+37.36)	74.82(+22.15)

Table 6: Experiments on ELI5.

Models	Citation (ALCE)			Citation (Ours)
Models	Recall	Precision	F1	Recall	Precision	F1
Few-Shot + Refine
gpt-3.5-turbo	23.69(+0.36)	39.54(+14.67)	29.63(+5.55)	26.11(+1.10)	72.33(+25.09)	38.37(+5.66)
llama2-7b	28.34(+12.52)	61.56(+22.43)	38.81(+16.28)	29.74(+12.98)	70.27(+28.13)	41.79(+17.81)
llama2-13b	27.08(+11.68)	61.35(+28.47)	37.57(+16.60)	28.77(+12.08)	71.77(+34.11)	41.07(+17.95)
mistral-7b	45.27(+1.54)	60.69(+19.84)	51.86(+9.62)	48.36(+3.33)	74.75(+28.80)	58.73(+13.24)
llama3-8b	42.94(-0.04)	55.40(+8.92)	48.38(+3.72)	47.54(+2.45)	76.86(+23.98)	58.74(+10.07)
glm4-9b	35.47(+5.88)	61.21(+16.67)	44.91(+9.36)	38.03(+7.08)	72.64(+23.75)	49.92(+12.02)
qwen2-7b	42.79(+7.39)	60.70(+19.00)	50.20(+11.90)	45.06(+9.10)	70.22(+24.08)	54.89(+14.48)
Fine-Tuning + Refine
llama2-7b	48.32(-1.12)	66.62(+14.96)	56.01(+5.49)	51.19(+1.26)	80.33(+25.20)	62.53(+10.13)
llama2-13b	47.92(+0.91)	66.19(+15.19)	55.59(+6.67)	51.28(+3.40)	80.67(+24.21)	62.70(+10.88)
mistral-7b	48.23(-0.69)	66.14(+11.68)	55.78(+4.24)	51.25(+1.43)	80.38(+22.96)	62.59(+9.24)
llama3-8b	48.69(-0.02)	65.24(+12.62)	55.76(+5.17)	51.65(+2.30)	79.13(+21.98)	62.50(+9.54)
glm4-9b	48.82(+0.13)	67.43(+14.73)	56.64(+6.02)	52.18(+2.34)	79.82(+24.75)	63.11(+10.78)
qwen2-7b	43.45(-1.36)	63.91(+11.29)	51.73(+3.33)	46.74(+0.80)	79.36(+24.95)	58.83(+9.01)

4.3 Additional Evaluation

To further demonstrate the effectiveness of our proposed Generate-then-Refine method, we proceed with additional evaluations. We need to confirm whether the improvement in citation quality is genuine or merely aligned with NLI model’s preferences since we use an NLI model to determine the gold citation while constructing the training data for the refiner and also use NLI model in the evaluation.

Thus, we replace NLI model with GPT-3.5-turbo when measuring citation recall and citation precision. We request GPT to output ‘Yes’ only if it believes the cited references supports the statement, otherwise output ‘No’. We conduct experiments with Llama2-7b and Llama2-8b models on the test set of WebGLM-QA.

Figure 1: Evaluation results using NLI model and GPT-3.5-turbo.

The experiment results as shown in Fig. 1. In the evaluation of citation recall and citation precision metrics, there is a significant difference in the judgment criteria between NLI model and GPT-3.5-turbo model. However, the evaluation results from both models show a noticeable positive correlation. From the perspective of evaluating the relative quality of citations, their evaluation results are consistent. In other words, responses with higher citation quality as measured by NLI model are also recognized by GPT-3.5-turbo model. This indicates that the improvement in citation quality brought about by our proposed method is not due to the NLI model’s preference. The improvements in citation quality brought by the refiner is genuine.

Meanwhile, we also attempted to guide the LLM to become an excellent refiner using the few-shot method, which would simplify the pipeline if effective. Unfortunately, none of the LLMs we tried inherently possessed strong refining capabilities, and using the few-shot method for refining significantly reduced citation quality.

Our proposed Generate-then-Refine method combines both pre-hoc and post-hoc citation. To obtain more comprehensive experiment results, we removed the generated citations and re-added them using a rule-based post-hoc method. We still conduct experiments with the Llama2-7b and Llama3-8b models on the WebGLM-QA test set. We used BLEU-4 [27] and ROUGE-L [26] metrics to match the statements in the answers with those in the references, setting the threshold at 0.3. Whenever the similarity score exceeded the threshold, we added a citation to the corresponding answer statement.

Figure 2: Comparison of Pre-hoc Citation and Post-hoc Citation. On the left is the evaluation of Citation Recall, and on the right is the evaluation of Citation Precision.

The experiment results as shown in Fig. 2. From the experiment results, it can be observed that Post-hoc method only works for answers generated by models that lack attribution capabilities. Once a model has good attribution capabilities, the Post-hoc method performs worse than Pre-hoc method in both Citation Recall and Citation Precision metrics. Moreover, due to the difficulty in setting a similarity threshold, the BLEU-based method performs significantly worse than the ROUGE-based method in Citation Precision metric.

5 Conclusion

In this work, we comprehensively evaluate the ability of the latest LLMs to generate citations in their responses. We introduce new citation evaluation metrics to address shortcomings in the existing evaluation framework. To improve the citation quality in LLMs’ responses, We propose Generate-then-Refine method, which fine-tunes a model to serve as a refiner. Our experiments show that our method substantially improves the quality of citations.

{credits}

5.0.1 Acknowledgements

This work was funded by the National Natural Science Foundation of China (NSFC) under Grants No. 62372431 and 62472408, the Strategic Priority Research Program of the CAS under Grants No. XDB0680102, XDB0680301, the National Key Research and Development Program of China under Grants No. 2023YFA1011602, the Youth Innovation Promotion Association CAS under Grants No. 2021100, the Lenovo-CAS Joint Lab Youth Scientist Project, and the project under Grants No. JCKY2022130C039.

References

[1] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, pp. 1877-1901 (2020)
[2] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., et al.: Survey of hallucination in natural language generation. ACM Computing Surveys 55(12), 1-38 (2023)
[3] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., et al.: Retrieval-augmented generation for knowledge-intensive nlp tasks. In: Advances in Neural Information Processing Systems, pp. 9459-9474 (2020)
[4] Shuster, K., Poff, S., Chen, M., Kiela, D., Weston, J.: Retrieval augmentation reduces hallucination in conversation. arXiv preprint arXiv:2104.07567 (2021)
[5] Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., et al.: Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997 (2023)
[6] Jiang, Z., Xu, F. F., Gao, L., Sun, Z., Liu, Q., et al.: Active retrieval augmented generation. arXiv preprint arXiv:2305.06983 (2023)
[7] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., et al.: Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. arXiv preprint arXiv:2305.15294 (2023)
[8] Gao, T., Yen, H., Yu, J., Chen, D.: Enabling large language models to generate text with citations. arXiv preprint arXiv:2305.14627 (2023)
[9] Huang, J., Chang, K. C. C.: Citation: A Key to Building Responsible and Accountable Large Language Models. arXiv preprint arXiv:2307.02185 (2023)
[10] Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., et al.: Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332 (2021)
[11] Liu, X., Lai, H., Yu, H., Xu, Y., Zeng, A., et al.: WebGLM: Towards an efficient web-enhanced question answering system with human preferences. In: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 4549-4560 (2023)
[12] Stelmakh, I., Luan, Y., Dhingra, B., Chang, M. W.: ASQA: Factoid questions meet long-form answers. arXiv preprint arXiv:2204.06092 (2022)
[13] Fan, A., Jernite, Y., Perez, E., Grangier, D., Weston, J., et al.: ELI5: Long form question answering. arXiv preprint arXiv:1907.09190 (2019)
[14] Menick, J., Trebacz, M., Mikulik, V., Aslanides, J., Song, F., et al.: Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147 (2022)
[15] Qian, H., Zhu, Y., Dou, Z., Gu, H., Zhang, X., et al.: Webbrain: Learning to generate factually correct articles for queries by grounding on large web corpus. arXiv preprint arXiv:2304.04358 (2023)
[16] Qin, Y., Cai, Z., Jin, D., Yan, L., Liang, S., et al.: Webcpm: Interactive web search for chinese long-form question answering. arXiv preprint arXiv:2305.06849 (2023)
[17] Sun, H., Cai, H., Wang, B., Hou, Y., Wei, X., et al.: Towards verifiable text generation with evolving memory and self-reflection. arXiv preprint arXiv:2312.09075 (2023)
[18] Huang, C., Wu, Z., Hu, Y., Wang, W.: Training language models to generate text with citations via fine-grained rewards. arXiv preprint arXiv:2402.04315 (2024)
[19] Rashkin, H., Nikolaev, V., Lamm, M., Aroyo, L., Collins, M., et al.: Measuring attribution in natural language generation models. Computational Linguistics 49(4), 777-840 (2023)
[20] Gao, L., Dai, Z., Pasupat, P., Chen, A., Chaganty, A. T., et al.: Rarr: Researching and revising what language models say, using language models. arXiv preprint arXiv:2210.08726 (2022)
[21] Bohnet, B., Tran, V. Q., Verga, P., Aharoni, R., Andor, D., et al.: Attributed question answering: Evaluation and modeling for attributed large language models. arXiv preprint arXiv:2212.08037 (2022)
[22] Liu, N. F., Zhang, T., Liang, P.: Evaluating verifiability in generative search engines. arXiv preprint arXiv:2304.09848 (2023)
[23] Yue, X., Wang, B., Chen, Z., Zhang, K., Su, Y., et al.: Automatic evaluation of attribution by large language models. arXiv preprint arXiv:2305.06311 (2023)
[24] Kamalloo, E., Jafari, A., Zhang, X., Thakur, N., Lin, J.: Hagrid: A human-llm collaborative dataset for generative information-seeking with attribution. arXiv preprint arXiv:2307.16883 (2023)
[25] Hu, N., Chen, J., Wu, Y., Qi, G., Bi, S., et al.: Benchmarking large language models in complex question answering attribution using knowledge graphs. arXiv preprint arXiv:2401.14640 (2024)
[26] Lin, C. Y.: Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp. 74-81 (2004)
[27] Papineni, K., Roukos, S., Ward, T., Zhu, W. J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
[28] OpenAI: Introducing ChatGPT. https://openai.com/blog/chatgpt, last accessed 2024/07/13
[29] Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
[30] Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., et al.: Mistral 7B. arXiv preprint arXiv:2310.06825 (2023)
[31] Meta: Introducing Meta Llama 3: The most capable openly available LLM to date. https://ai.meta.com/blog/meta-llama-3, last accessed 2024/07/13
[32] GLM Team, et al.: ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools. arXiv preprint arXiv:2406.12793 (2024)
[33] Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., et al.: Qwen2 Technical Report. arXiv preprint arXiv:2407.10671 (2024)
[34] Honovich, O., Aharoni, R., Herzig, J., Taitelbaum, H., Kukliansy, D., et al.: TRUE: Re-evaluating factual consistency evaluation. arXiv preprint arXiv:2204.04991 (2022)
[35] Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., et al.: Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)