This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

One Token Can Help! Learning Scalable and Pluggable Virtual Tokens
for Retrieval-Augmented Large Language Models

Yutao Zhu, Zhaoheng Huang, Zhicheng Dou, Ji-Rong Wen
Corresponding author.
Abstract

Retrieval-augmented generation (RAG) is a promising way to improve large language models (LLMs) for generating more factual, accurate, and up-to-date content. Existing methods either optimize prompts to guide LLMs in leveraging retrieved information or directly fine-tune LLMs to adapt to RAG scenarios. Although fine-tuning can yield better performance, it often compromises the LLMs’ general generation capabilities by modifying their parameters. This limitation poses challenges in practical applications, especially when LLMs are already deployed, as parameter adjustments may affect their original functionality. To address this, we propose a novel method that involves learning scalable and pluggable virtual tokens for RAG. By maintaining the LLMs’ original parameters and fine-tuning only the embeddings of these pluggable tokens, our approach not only enhances LLMs’ performance but also preserves their general generation capabilities. Furthermore, we design several training strategies to improve the scalability, flexibility, and generalizability of our method. Comprehensive experiments across 12 question-answering tasks demonstrate the superiority of our approach.

Introduction

Large language models (LLMs) have achieved remarkable performance across various natural language processing tasks (Brown et al. 2020; OpenAI 2023; Touvron et al. 2023). Despite their extensive parameters enabling them to learn rich knowledge during pre-training, LLMs may still generate hallucinated, outdated, or inaccurate content, especially in scenarios requiring long-tail knowledge (Ji et al. 2023; Zhang et al. 2023b). To address this problem, retrieval-augmented generation (RAG) has emerged as a pivotal strategy. By explicitly decoupling knowledge retrieval from the backbone LLMs, such architectures have achieved more accurate and reliable content generation and shown particularly enhanced performance on knowledge-intensive tasks such as open-domain question answering (Petroni et al. 2021; Tan et al. 2024; Jin et al. 2024b).

Existing efforts in RAG development can be roughly categorized into two groups (as illustrated in Figure 1). The first group leverages the in-context learning capabilities of LLMs by incorporating retrieved information into the input along with appropriate prompts (Shi et al. 2023; Ram et al. 2023). This allows for straightforward application to any off-the-shelf LLM without tuning its parameters. However, its effectiveness largely depends on the human experience in crafting effective prompts and the LLM’s ability to interpret these prompts. The second group focuses on training LLMs to enhance their performance in RAG scenarios. This training might involve either end-to-end pre-training (Guu et al. 2020; Borgeaud et al. 2022) or fine-tuning (Lin et al. 2023; Wang et al. 2023b) for specific tasks. These approaches can often lead to better performance, but they require significant computational resources. Recently, parameter-efficient fine-tuning techniques, such as LoRA (Hu et al. 2022), have been widely studied, significantly reducing training costs. These methods can optimize the LLMs’ parameters for RAG, but unfortunately compromise the model’s general abilities in non-RAG scenarios, such as commonsense reasoning and in-context learning. All these limitations prevent their application to LLMs already operational in real-world settings.

Refer to caption
Figure 1: Illustration of existing methods for RAG and our proposed method. Our method can improve LLMs’ performance in RAG scenarios by incorporating trainable virtual tokens, and these tokens can be removed to preserve the general generation abilities in non-RAG scenarios.

Therefore, a critical research problem arises: Is it possible to enhance LLMs’ performance under RAG scenarios while preserving their general generation capabilities? To achieve this, we introduce a novel, lightweight tuning method named SPRING, which learns Scalable and Pluggable viRtual tokens for retrIeval-augmeNted Generation. Our basic idea is to add trainable virtual tokens to help LLMs learn RAG problems. Through fine-tuning, these virtual tokens effectively enhance the LLM’s capability to understand retrieved information and its correlation with user inputs. Importantly, as the LLM’s original parameters are frozen, its general generation abilities are preserved without any loss. During inference, when retrieval is triggered, these trained virtual tokens can be simply added to the prompt, which includes both the retrieved results and user input, thereby significantly enhancing performance. Moreover, we employ a scalable training approach, allowing the number of virtual tokens to be adjusted according to the needs of the inference scenario. Various training strategies have been implemented to further improve the generalizability of our method, ensuring robustness regardless of the number of the retrieved results.

In experiments, SPRING is trained with the base and instruction fine-tuned versions of Mistral-7b, LLaMA-2-7b, and LLaMA-2-13b models and evaluated on 12 commonly used QA datasets, covering both in-domain and out-of-domain scenarios. The experimental results demonstrate that SPRING not only effectively improves the RAG performance of LLMs but also successfully preserves their general generation capabilities. Overall, the SPRING method exhibits four main characteristics:

\bullet Lightweight yet effective. Instead of updating the full parameters of the LLMs, we opt to freeze the pre-trained models and only learn the embeddings for the added virtual tokens. For example, adding 50 tokens to the Mistral-7b model introduces only 0.2M parameters in total. Despite these minimal parameters, SPRING improves the average EM and F1 scores by more than 43% and 17% across 12 QA datasets, respectively.

\bullet Scalable. With our proposed scalable training approach, SPRING can be effective with any number of virtual tokens (k[1,50]k\in[1,50] in our experiments). Remarkably, even just one token can substantially improve the LLMs’ performance in RAG scenarios.

\bullet Pluggable. Owing to its lightweight design, SPRING can be applied in a plug-and-play manner. When retrieval is triggered, simply adding the virtual tokens can lead to better performance. In non-RAG scenarios, the virtual tokens are not added so the LLMs’ original capabilities can be well preserved. This characteristic is crucial for LLMs that have already been deployed for practical use.

\bullet Generalizable. Our robust training strategies ensure that SPRING is adaptable to different retrievers and various numbers of retrieved results. Consequently, there is no need to retrain SPRING with each update to the retrieval system, enhancing its practicality and efficiency.

Related Work

Retrieval-Augmented Generation

Compared to standard text generation, retrieval-augmented generation (RAG) incorporates a retrieval module that accesses external knowledge to enhance generation quality (Lewis et al. 2020; Guu et al. 2020; Zhu et al. 2023; Jin et al. 2024a). The mainstream RAG follows a “retrieve-then-read” paradigm, where the retrieval module provides external knowledge as additional context, which is then read by generation models to produce the final output (Shi et al. 2023; Ram et al. 2023; Borgeaud et al. 2022; Lin et al. 2023; Zhu et al. 2024). To optimize the use of external knowledge, some methods focus on crafting effective prompts that guide the utilization of retrieved information (Shi et al. 2023; Ram et al. 2023). These prompt-based methods are applicable to any LLM without tuning its parameters. However, they depend heavily on skillful prompt writing and the LLMs’ ability to understand instructions. In contrast, other studies attempts to directly train the model to better use the retrieved knowledge. For example, REALM (Guu et al. 2020) and RETRO (Borgeaud et al. 2022) incorporate retrieval in end-to-end retrieval-augmented pre-training. RA-DIT (Lin et al. 2023) employs fine-tuning to enhance LLMs’ retrieval understanding. These tuning-based methods often yield better performance than prompt-based methods by optimizing model parameters for RAG. However, they may compromise the LLMs’ general capabilities, particularly in non-retrieval scenarios. Different from existing methods, we design a new lightweight tuning method for RAG. It is a plug-and-play module that enhances RAG performance using trainable virtual tokens, which can be removed in non-RAG scenarios to preserve the LLMs’ general generation abilities.

Parameter-Efficient Fine-Tuning

The paradigms of “pre-training then fine-tuning” have demonstrated efficacy across various natural language (Devlin et al. 2019; Raffel et al. 2020; Radford et al. 2019) and vision tasks (He et al. 2020; Dosovitskiy et al. 2021; Chen et al. 2020). The common fine-tuning process involves tuning all parameters of a model, which is computational intensive, especially for LLMs. To address this, parameter-efficient fine-tuning (PEFT) (Mangrulkar et al. 2022) approaches have been developed. These approaches freeze most of the pre-trained models’ parameters, yet still manage to achieve comparable performance on downstream tasks. PEFT has been widely studied (Wan et al. 2023), and typical methods including adapter-based tuning (Houlsby et al. 2019; Lin, Madotto, and Fung 2020; Rebuffi, Bilen, and Vedaldi 2017; Chen et al. 2023), low-rank adaptation (LoRA) (Hu et al. 2022; Dettmers et al. 2023), and prompt tuning (Li and Liang 2021; Lester, Al-Rfou, and Constant 2021; Liu et al. 2021b, a; Qin and Eisner 2021). Adapter-based tuning inserts lightweight modules into a model’s existing layers and have been extended to various domains (Gao et al. 2024; Hu et al. 2023; Zhang et al. 2023a). LoRA (Hu et al. 2022) introduces trainable low-rank matrices that adjust the model’s weight updates, achieving promising fine-tuning performance on LLMs (Hu et al. 2023). Prompt tuning incorporates a series of trainable prompt tokens to LLMs. These tokens can be inserted either to the input layer only (Lester, Al-Rfou, and Constant 2021; Liu et al. 2021b) or to all of the intermediate layers (Li and Liang 2021; Liu et al. 2021a). In this paper, we proposes a novel prompt tuning method, SPRING, specifically designed for RAG scenarios. Our method introduces virtual tokens between retrieved results and the input, exploiting the auto-regressive generation paradigm to improve the model’s ability to utilize retrieved information. Additionally, it is designed to be scalable and pluggable, thus broadening its application scope while preserving the original generative capabilities of LLMs.

Refer to caption
Figure 2: Illustration of SPRING. Only the embeddings of the added nn tokens are trainable during fine-tuning. The added tokens are scalable where any first k(kn)k(k\leq n) tokens can be used in inference.

Methodology

To take advantage of both the flexibility of prompt-based methods and the efficacy of fine-tuning-based methods, we propose SPRING to learn scalable and pluggable virtual tokens for retrieval-augmented generation (RAG).

Problem Formulation

Language models are designed to calculate the probability distribution over sequences of natural language texts. Auto-regressive models are commonly used for this through next-token prediction:

pLM=i=1mpθ(xi|x<i),\displaystyle p_{\text{LM}}=\prod_{i=1}^{m}p_{\theta}(x_{i}|x_{<i}), (1)

where x<ix_{<i} denotes the sequence of tokens preceding xix_{i} at each step, and θ\theta represents the parameters of the model. For RAG, a retrieval corpus 𝒟\mathcal{D} and a retriever M{M} are introduced. Then, the generation process is conditioned on both x<ix_{<i} and the retrieved results R=M𝒟(x<i){R}={M}_{\mathcal{D}}(x_{<i}) as:

pRAG\displaystyle p_{\text{RAG}} =i=1mpθ(xi|R;x<i),\displaystyle=\prod_{i=1}^{m}p_{\theta}(x_{i}|{R};x_{<i}), (2)
pRAG-QA\displaystyle p_{\text{RAG-QA}} =i=1mpθ(ai|R;Q;a<i).\displaystyle=\prod_{i=1}^{m}p_{\theta}(a_{i}|{R};Q;a_{<i}). (3)

Note that here x<ix_{<i} serves as the query for retrieval. In question-answering (QA) tasks, x<ix_{<i} is usually the question QQ, and the learning objective is to generate the right answer A={ai}i=1mA=\{a_{i}\}_{i=1}^{m}. The retriever can yield multiple passages, which can be concatenated as a long text sequence using proper separator such as “\\backslashn\\backslashn”. For brevity, this formulation directly concatenates the retrieved results R{R} with the question QQ, omitting more complex prompt designs. Henceforth, we will use the notations in QA tasks as our evaluation is performed on them.

Scalable and Pluggable Virtual Tokens for RAG

Our SPRING method, shown in the left side of Figure 2, introduces trainable virtual tokens into the input to optimize LLMs for RAG scenarios. Specifically, following the notation in Equation (3), we add nn trainable tokens T=[t1,t2,,tn]T=[t_{1},t_{2},\cdots,t_{n}] between the retrieved results R{R} and the input QQ. The generation process can then be described as:

pSPRING=i=1mpθ,δ(ai|R;[t1,t2,,tn];Q;a<i),\displaystyle p_{\text{{SPRING}{}}}=\prod_{i=1}^{m}p_{\theta,\delta}(a_{i}|{R};[t_{1},t_{2},\cdots,t_{n}];Q;a_{<i}),

where δn×d\delta\in\mathbb{R}^{n\times d} represents the added parameters of the trainable tokens (i.e., their embeddings), and dd is the embedding size of the LLM. θ\theta denotes the parameters of the backbone LLM, which are frozen during training. Given that |δ||θ||\delta|\ll|\theta|, our method is highly efficient for training. For example, with the Mistral-7b model (where d=4,096d=4,096), when n=50n=50 tokens are added, we only add and train 50×4,096=0.2M50\times 4,096=0.2\text{M} parameters, approximately 0.003% of the full model.

Importantly, we place the virtual tokens TT between the retrieved results RR and the question QQ for two main reasons: (1) In the auto-regressive generation paradigm, positioning the tokens after the retrieved results allows them to attend to this information, thereby aiding the model’s comprehension. (2) Recent studies have indicated that LLMs are particularly sensitive to the end of an input (Liu et al. 2023). By consistently placing these virtual tokens before the question across all test samples, we aim to mitigate any potential adverse effects on the understanding of the question.

Scalable

In practical developments, LLMs are often constrained by their maximum input lengths, limiting the number of tokens available for retrieval augmentation (especially when the retrieved results are very long). Therefore, it is desired to design a mechanism so that any number of virtual tokens can be used in the inference to improve RAG performance. To achieve this, we propose an optimization strategy working like a “spring” (as shown in Figure 2). Specifically, for a given sample {R,Q,A}\{R,Q,A\} with the total number of added tokens nn, we randomly select a number k(kn)k(k\leq n) and utilize the first kk virtual tokens t1:kt_{1:k} to construct the training example as [R;t1,t2,,tk;Q][R;t_{1},t_{2},\cdots,t_{k};Q]. This method allows for the flexible optimization of any number of virtual tokens. Consequently, the number of virtual tokens incorporated during inference can be arbitrarily selected based on the requirements of the application. The effectiveness of this strategy and its comparison with other methods are further discussed in our experiments.

Pluggable

Due to its designed structure, our method provides considerable flexibility in application. Practically, if user input is assessed to require external knowledge, our virtual tokens can be simply appended after the retrieval results and then fed, along with the user input, into the LLM for generation. In contrast, if the user input does not necessitate retrieval, it can be processed directly by the LLM. As our approach does not adjust the original parameters of the LLM, it preserves the model’s inherent capabilities. This feature is particularly important for industry or business since existing LLMs may have already been deployed for multiple purposes; our method enhances the retrieval understanding capabilities of these models without compromising their existing functionalities.

Inference

We illustrate the instructions for using our SPRING in the right side of Figure 2. After training, the embeddings of the added tokens have been optimized for RAG, but these tokens do not correspond to any existing tokens in the vocabulary. To make them easy to use, we can add some special tokens (e.g., [r1], \cdots, [r50]) to the vocabulary and initialize their embeddings with the trained embeddings. Then, during inference, after obtaining the retrieved results RR, we can add any number of these special tokens (e.g., [r1] \cdots [rk]) after RR and input them with the question QQ to the LLMs for generation. We also provide an example code snippet in Appendix for using our method in practice.

We refer to our method as SPRING due to its scalable and pluggable nature, making it particularly well-suited for enhancing existing LLMs that have already been deployed. Additionally, it effectively bridges the gap between retrieved results and user input, significantly improving the LLMs’ capabilities in understanding the retrieved external knowledge.

with Retrieval without Retrieval
Dataset Metric Concat Prompt Prefix LoRA SPRING SPRING+ Concat Prompt Prefix LoRA SPRING SPRING+
Tuning Parameters 0 0 0.2M 4M 0.2M 4.2M 0 0 0.2M 4M 0.2M 4.2M
Trivia QA EM 0.00 57.79 11.74 62.76 65.71 63.89 0.01 39.90 0.00 0.03 46.56 43.37
F1 65.60 80.33 59.97 85.44 85.26 85.94 63.96 69.72 28.30 34.91 74.48 74.89
NQ EM 0.00 28.99 13.04 47.95 42.35 49.78 0.00 13.36 0.00 0.00 18.80 25.54
F1 41.77 58.72 38.22 74.15 70.73 75.22 43.74 48.63 17.74 29.10 55.75 60.82
HQA EM 0.00 26.36 5.79 39.95 35.26 41.14 0.00 17.07 0.00 0.03 20.15 23.38
F1 44.91 56.15 42.59 68.93 65.44 69.95 47.54 49.50 17.45 27.57 54.79 57.99
SQuAD EM 0.00 23.92 7.19 35.71 33.67 35.98 0.00 8.61 0.00 0.00 12.71 13.90
F1 43.05 57.66 39.75 68.05 66.99 68.41 43.32 46.81 21.88 27.51 53.58 54.93
WebQ EM 0.00 17.53 4.44 43.65 31.84 48.10 0.00 14.79 0.00 0.00 24.95 28.81
F1 37.46 52.10 31.55 71.99 64.78 73.88 44.34 50.60 20.14 32.36 59.83 62.95
2Wiki EM 0.00 22.64 4.38 35.93 31.80 37.12 0.00 23.45 0.00 0.01 24.62 28.60
F1 47.77 55.58 41.82 63.85 62.03 64.60 52.83 53.55 21.14 37.09 56.83 59.12
CoQA EM 0.00 8.20 1.56 12.89 13.28 13.87 0.00 8.59 0.00 0.00 9.96 12.50
F1 27.98 36.72 20.02 41.19 42.41 42.04 32.97 36.58 13.15 18.99 39.96 41.36
MS MARCO EM 0.00 5.73 0.60 8.13 6.57 8.27 0.00 2.56 0.00 0.01 2.09 3.24
F1 56.44 53.81 50.56 54.81 53.48 55.90 49.50 47.75 47.44 52.44 51.41 49.84
PopQA* EM 0.00 39.79 10.02 47.15 48.71 46.98 0.00 16.05 0.00 0.00 20.25 18.70
F1 56.49 68.26 44.54 73.12 73.90 73.29 53.61 54.85 20.39 25.09 58.32 58.05
Fermi* EM 0.00 0.06 0.00 0.12 0.18 0.24 0.00 0.06 0.06 0.00 0.06 0.19
F1 21.66 18.16 12.65 29.63 31.15 30.67 25.51 17.84 20.33 25.42 29.28 30.37
Musique* EM 4.01 4.05 0.04 10.86 8.80 12.58 0.00 1.93 0.66 0.17 3.64 4.43
F1 38.79 38.64 20.91 52.65 48.74 53.65 42.30 36.18 35.22 45.31 44.93 47.70
Bamboogle* EM 12.80 12.80 0.00 24.22 22.66 28.13 0.00 4.69 3.20 0.00 12.00 12.80
F1 45.81 48.13 13.67 59.05 56.95 62.00 42.61 42.24 36.22 45.14 47.69 46.89
Average EM 1.40 19.80 4.90 30.78 28.40 32.17 0.00 12.59 0.33 0.02 16.32 17.96
F1 43.98 51.30 34.69 61.91 60.15 62.96 45.19 46.19 24.95 33.41 52.24 53.74
Table 1: Evaluation results of different methods on twelve QA datasets. The retriever is E5-large model, and the number of retrieved passages is set as three. The number of virtual tokens used in SPRING is set as 50. PopQA, Fermi, Musique, and Bamboogle are invisible during training. “Prefix” stands for prefix-tuning, and “SPRING+” is trained based on the LoRA’s checkpoint. The best results are in bold.

Experiment

Datasets and Retrievers

We conduct experiments on twelve commonly used question-answering datasets, including TriviaQA (TQA) (Joshi et al. 2017), Natural Questions (NQ) (Kwiatkowski et al. 2019), HotpotQA (HQA) (Yang et al. 2018), SQuAD 1.1 (Rajpurkar et al. 2016), Web Questions (WebQ) (Berant et al. 2013), 2WikiMultiHopQA (2Wiki) (Ho et al. 2020), CoQA (Reddy, Chen, and Manning 2019), MS MARCO (Nguyen et al. 2016), PopQA (Mallen et al. 2023), Fermi (Kalyan et al. 2021), Musique (Trivedi et al. 2022), and Bamboogle (Press et al. 2023). These datasets are publicly available at HuggingFace or their official websites. To evaluate the generalizability of the methods, we select PopQA, Fermi, Musique, and Bamboogle as held-out datasets. We mix the training set of all remaining datasets for training. For all datasets, we prioritize the use of test sets for evaluation purposes. In cases where the test set is not available, we utilize the development set instead. It is worth noting that, though some datasets have provided golden reference passages for the answer, we do not use them in our experiment but use the passages retrieved from the retrieval sets in both training and inference stages, which aligns with practical applications. Exact match (EM) and F1 score are employed as evaluation metrics.

For the retrieval sets, we follow previous studies (Yoran et al. 2023) and use the combination of Wikipedia and MS MARCO datasets as the retrieval corpus. Wikipedia contains high-quality human knowledge, which is helpful for many knowledge-intensive tasks. MS MARCO contains a large amount of web pages, which can provide information necessary for curating some natural language questions. We use the datasets that have already been preprossed into passages and released on HuggingFace.111Wikipedia passages: https://huggingface.co/datasets/Tevatron/wikipedia-nq-corpus. MS MARCO passages: https://huggingface.co/datasets/Tevatron/msmarco-passage-corpus. The Wikipedia set has 21M passages, while the MS MARCO set has 8M passages. More details are provided in Appendix.

We use E5-large (Wang et al. 2022) as the main retriever in our experiments. The impact of other retrievers, i.e., BM25 (Robertson and Zaragoza 2009), BGE-base (Xiao et al. 2023), and E5-base, is studied in our further analysis. Among these retrievers, BM25 is a non-neural sparse retrieval algorithm, while others are neural-based dense retrievers. In general, dense retrievers perform better on several benchmarks (Muennighoff et al. 2023).

Baseline Methods

We consider both the base and instruction fine-tuned versions of Mistral-7b, LLaMA-2-7b, and LLaMA-2-13b as the backbone models, and compare our SPRING with the following baselines.

\bullet Concat: This method directly concatenates the retrieval results and the question for evaluation.

\bullet Prompt: This method uses a manually-crafted prompt to indicate the use of retrieval information (details are provided in Appendix).

\bullet Prefix-tuning (Li and Liang 2021): This method uses prefix-tuning to fine-tune the backbone models. To make a fair comparison with our method, we add 50 prefix tokens for training.

\bullet LoRA (Hu et al. 2022): This method uses LoRA to fine-tune the backbone models. We use the hyperparameters suggested by the LLaMA’s official guidance.222LLaMA Recipes, https://github.com/meta-llama/llama-recipes/blob/main/src/llama˙recipes/configs/peft.py To further validates the effectiveness of our SPRING on models that have already been optimized for RAG, we also train our method based on the LoRA checkpoint, and denote this variant as SPRING+.

Implementation Details

We use PyTorch (Paszke et al. 2019) and Huggingface Accelerate library to implement all methods. The learning rate is set as 1e-41\text{e-}4 with a warm-up ratio of 0.10.1. All methods are trained for three epochs, with a training batch size of 256256. We use eight NVIDIA A800 GPUs for training. Training our SPRING for Mistral-7b models consumes around 2.22.2 hours per epoch. The embeddings of the virtual tokens are initialized by the embeddings of the prompt: “According to the previous relevant passages, please answer the following question. Only return the answer without any other words.” Following the settings of prefix-tuning, if the number of tokens required exceeds those available in the prompt, the prompt is repeated to complete the initialization; if fewer are needed, the prompt is truncated accordingly. Additionally, we experiment with random initialization of tokens but observe that its performance is slightly worse than that achieved through prompt-based initialization. Our code is available at https://github.com/DaoD/SPRING.

Experimental Results

We fine-tune the prefix-tuning, LoRA, and SPRING methods on RAG tasks, and then evaluate their performance in scenarios both with (RAG) and without (non-RAG) retrieval. For SPRING, we use k=50k=50 virtual tokens for inference by default, and the impact of token quantity kk is discussed in later. The experimental results are shown in Table 1. To save space, we only show the results based on Mistral-7b-instruct, and other results are provided in Appendix.

We can observe: (1) It is evident that SPRING significantly improves the RAG performance of the original LLM with manually-crafted prompts (the average EM and F1 scores are improved by 43.4%43.4\% and 17.3%17.3\%, respectively). It outperforms LoRA on certain datasets, such as TriviaQA and CoQA. Given that SPRING involves only 0.2M trainable parameters, these results demonstrate its remarkable efficiency and effectiveness. (2) While LoRA achieves slightly better performance on some datasets, it adjusts the LLMs’ original parameters, which adversely impact their performance in non-RAG scenarios—a significant drop has been observed, even far worse than the original models. This challenge also extends to other general generation tasks, which will be discussed in the next section. (3) In non-RAG evaluation, only SPRING and SPRING+ demonstrate better performance than the Prompt method. This indicates that even in the absence of retrieved results, adding virtual tokens is still beneficial. We speculate that beyond simply utilizing retrieved results, virtual tokens can help the LLM understand the task goal and format (e.g., the task is question-answering rather than text continuation). (4) Based on the LoRA’s checkpoint, SPRING+ achieves the best performance on most datasets. Additionally, all backbone models show improvements with SPRING (see Appendix). These findings verify the versatility and flexibility of our approach, confirming its suitability for enhancing various LLMs in RAG scenarios. (5) Using manually-crafted prompts is effective for improving LLMs’ performance on RAG tasks. However, this improvement is limited as no training is involved. (6) SPRING achieves robust performance on the held-out datasets, validating the good generalizability of our method. (7) Interestingly, prefix-tuning cannot perform well for RAG, highlighting that the insertion position of the virtual tokens in SPRING is both reasonable and effective. We provide more analysis in Appendix.

Further Analysis

We further conduct a series of experiments to investigate the impact of different settings in SPRING. All the following experiments are conducted based on fine-tuning the Mistral-7b-instruct model.

Dataset nn-shot LoRA SPRING Diff
BoolQ 0 79.30 82.97 3.67
CommonsenseQA 0 55.45 63.80 8.35
CommonsenseQA 4 59.87 67.07 7.20
GSM8K 8 17.33 31.89 14.56
MMLU 0 51.30 53.62 2.32
MMLU 5 48.76 54.96 6.20
Table 2: Performance comparison on other datasets.

Performance on Other Tasks

To examine the impact of different fine-tuning methods on the inherent capabilities of LLMs, we evaluate the performance of models fine-tuned by LoRA and SPRING on several other (non-RAG) tasks. These tasks are commonly used to evaluate LLMs’ reasoning, mathematical abilities, and world knowledge, including BoolQ (Clark et al. 2019), CommonsenseQA (Talmor et al. 2019), GSM8K (Cobbe et al. 2021), and MMLU (Hendrycks et al. 2021). The experimental results are shown in Table 2.333We notice that our results are quite different from those officially reported, which we attribute to the impact of different prompts. Since official prompts for testing are unavailable, we provide the prompts we used in Appendix to facilitate reproducibility. From the results, we can observe: (1) Thanks to the plug-and-play design of our method, SPRING can revert to to the original LLMs by not using virtual tokens. Therefore, it successfully preserves the original capabilities of the LLMs. In contrast, LoRA, which adjusts the model’s parameters for RAG tasks, inevitably compromises the model’s performance on other tasks. (2) A noticeable decline is observed in the few-shot evaluation, reflecting a decrease in the in-context learning abilities of LLMs. This decline may stem from the fact that RAG fine-tuning does not incorporate in-context learning capabilities. Besides, fine-tuning for RAG tasks may lead the model to overfit to specific task formats (prompts), thereby impairing its general generation abilities (more empirical studies are detailed in Appendix).

Refer to caption
Figure 3: Average performance on nine QA datasets with various numbers of virtual tokens.

Impact of Token Quantity

In SPRING, we design a scalable training approach that enables to use arbitrary numbers of virtual tokens in inference. To validate its effectiveness, we test the performance of our method with various numbers of virtual tokens and compare it with a variant model trained with a fixed number of tokens (k=50k=50). The experimental results are illustrated in Figure 3. In general, we can observe that the performance of SPRING increases with more virtual tokens used. Surprisingly, SPRING can significantly enhance LLMs’ performance in RAG scenarios with just a single token, which is very encouraging.444This varies across different LLMs (see Appendix). In comparison, training with a fixed number of tokens limits the flexibility of SPRING, making it can only be used with the same number of tokens in inference (i.e., k=50k=50).

Prompt SPRING
Retriever EM F1 EM F1
BM25 21.23 54.94 30.94 62.73
BGE-base 23.07 56.12 31.81 63.46
E5-base 24.38 56.84 33.34 64.49
E5-large 25.66 57.70 34.35 65.00
Average 23.58 56.40 32.61 63.92
Variance 2.69 1.02 1.75 0.78
Table 3: Average performance on nine QA datasets with different retrievers.

Effects of Different Retrievers

In our experiments, SPRING is fine-tuned using passages retrieved by E5-large. To investigate its effectiveness with other retrievers, we conduct an experiment by testing its performance with passages retrieved by BM25, BGE-base, and E5-large. The results are presented in Table 3. First, SPRING achieves consistent improvement over the original model using manually crafted prompt, thereby confirming the generalizability of our approach. Second, compared to the original model, the performance gap (variance) among different retrievers becomes smaller, highlighting SPRING’s robustness to variations in retrievers. Finally, even fine-tuned with a superior retriever (i.e., E5-large), SPRING maintains strong performance well with less effective retrievers (such as BM25). This indicates that our method can effectively adapt to varying quality of retrieved results. Hence, there is no necessity to retrain the virtual tokens with each update of retrievers in practical applications, significantly enhancing its applicability.

Refer to caption
Figure 4: Average performance on nine QA datasets with different number of retrieved passages.

Influence of Retrieved Passages

During the fine-tuning of SPRING, we construct training samples by randomly selecting the top-mm (m[1,5]m\in[1,5]) retrieved passages. This aims to enhance SPRING’s adaptability by ensuring it can operate effectively with varying numbers of retrieved passages in real-world scenarios. To evaluate the effect of this training strategy, we test the SPRING’s performance across a range from zero to five passages. Figure 4 illustrates the results. We can find that SPRING’s performance gradually improves as more retrieved passages are used (m=04m=0\rightarrow 4), suggesting that more retrieved passages contribute valuable knowledge for question answering. However, the performance peaks at four passages and declines when more passages are added. This decrease could be attributed to noise accumulation within the retrieved knowledge, a phenomenon also reported in recent studies (Yoran et al. 2023; Wu et al. 2024). Despite this, the use of retrieved passages still results in performance gains compared to scenarios without retrieval (m=0m=0), highlighting again the benefits of RAG.

Training \rightarrow TQA NQ Mix
Test \downarrow EM F1 EM F1 EM F1
TQA 62.80 84.65 65.51 84.73 65.71 85.26
NQ 32.19 64.98 45.26 72.68 42.35 70.73
HQA 22.97 56.95 28.11 59.45 35.26 65.44
SQuAD 25.24 61.39 30.81 64.32 33.67 66.99
WebQ 26.03 62.40 34.13 67.30 31.84 64.78
2Wiki 17.32 54.41 25.43 58.21 31.80 62.03
CoQA 6.05 36.61 7.62 38.54 13.28 42.41
MARCO 3.26 31.88 3.88 33.33 6.57 53.48
PopQA 44.07 72.50 48.28 74.07 48.71 73.90
Average 26.66 58.42 32.11 61.40 34.35 65.00
Table 4: Performance comparison between training on a specific dataset or a mixture of all datasets.

Cross-Dataset Generalizability

Inspired by previous studies in multi-task learning (Raffel et al. 2020; Khashabi et al. 2020), we mix eight QA datasets for training as they require similar LLM capabilities (e.g., reasoning). To study the impact of this strategy, we conduct experiments by training SPRING on each dataset individually and then testing its performance on the others. Table 4 shows partial results, and more results are available at Appendix. As indicated, training on a mixed dataset generally enhances performance on most datasets, thereby validating the benefits of multi-task learning. While training on a single dataset, such as NQ, may yield superior results on its specific test set, such improvements often fail to generalize to other datasets. Notably, training solely on NQ may negatively impact performance on MS MARCO, where the original LLM using a prompt could outperform it. These findings inspire us to carefully consider the interaction between different datasets when applying our method in future applications.

Conclusion

In this paper, we introduced scalable and pluggable virtual tokens for retrieval-augmented large language models. Our method, SPRING, serves as a parameter-efficient fine-tuning approach that significantly enhances RAG performance with the addition of only 0.20.2M trainable parameters. More importantly, the plug-and-play nature of our approach successfully preserves the performance of LLMs on non-RAG tasks, while its scalable training strategy broadens the method’s applicational flexibility. Through extensive experiments across various datasets, we have demonstrated the effectiveness, generalizability, flexibility, and high efficiency of our method. We believe that our research will foster further integration of information retrieval and LLMs, and advance the development of other parameter-efficient fine-tuning technologies for LLMs.

Acknowledgment

This work was supported by National Natural Science Foundation of China (Grant No. 62402497 and 62272467), Beijing Natural Science Foundation L233008, and Beijing Municipal Science and Technology Project No. Z231100010323009. The work was partially done at the Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE.

References

  • Asai et al. (2024) Asai, A.; Wu, Z.; Wang, Y.; Sil, A.; and Hajishirzi, H. 2024. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. In ICLR.
  • Berant et al. (2013) Berant, J.; Chou, A.; Frostig, R.; and Liang, P. 2013. Semantic Parsing on Freebase from Question-Answer Pairs. In EMNLP, 1533–1544.
  • Borgeaud et al. (2022) Borgeaud, S.; Mensch, A.; Hoffmann, J.; Cai, T.; Rutherford, E.; Millican, K.; van den Driessche, G.; Lespiau, J.; Damoc, B.; Clark, A.; de Las Casas, D.; Guy, A.; Menick, J.; Ring, R.; Hennigan, T.; Huang, S.; Maggiore, L.; Jones, C.; Cassirer, A.; Brock, A.; Paganini, M.; Irving, G.; Vinyals, O.; Osindero, S.; Simonyan, K.; Rae, J. W.; Elsen, E.; and Sifre, L. 2022. Improving Language Models by Retrieving from Trillions of Tokens. In ICML, 2206–2240.
  • Brown et al. (2020) Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; Agarwal, S.; Herbert-Voss, A.; Krueger, G.; Henighan, T.; Child, R.; Ramesh, A.; Ziegler, D. M.; Wu, J.; Winter, C.; Hesse, C.; Chen, M.; Sigler, E.; Litwin, M.; Gray, S.; Chess, B.; Clark, J.; Berner, C.; McCandlish, S.; Radford, A.; Sutskever, I.; and Amodei, D. 2020. Language Models are Few-Shot Learners. In NeurIPS.
  • Chen et al. (2020) Chen, T.; Kornblith, S.; Norouzi, M.; and Hinton, G. E. 2020. A Simple Framework for Contrastive Learning of Visual Representations. In ICML, 1597–1607.
  • Chen et al. (2023) Chen, Z.; Duan, Y.; Wang, W.; He, J.; Lu, T.; Dai, J.; and Qiao, Y. 2023. Vision Transformer Adapter for Dense Predictions. In ICLR.
  • Clark et al. (2019) Clark, C.; Lee, K.; Chang, M.; Kwiatkowski, T.; Collins, M.; and Toutanova, K. 2019. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. In NAACL-HLT, 2924–2936.
  • Cobbe et al. (2021) Cobbe, K.; Kosaraju, V.; Bavarian, M.; Chen, M.; Jun, H.; Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R.; Hesse, C.; and Schulman, J. 2021. Training Verifiers to Solve Math Word Problems. CoRR.
  • Dettmers et al. (2023) Dettmers, T.; Pagnoni, A.; Holtzman, A.; and Zettlemoyer, L. 2023. QLoRA: Efficient Finetuning of Quantized LLMs. In NeurIPS.
  • Devlin et al. (2019) Devlin, J.; Chang, M.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT, 4171–4186.
  • Dosovitskiy et al. (2021) Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; Uszkoreit, J.; and Houlsby, N. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR.
  • Fang, Meng, and Macdonald (2024) Fang, J.; Meng, Z.; and Macdonald, C. 2024. TRACE the Evidence: Constructing Knowledge-Grounded Reasoning Chains for Retrieval-Augmented Generation. CoRR.
  • Feng et al. (2024) Feng, Z.; Feng, X.; Zhao, D.; Yang, M.; and Qin, B. 2024. Retrieval-Generation Synergy Augmented Large Language Models. In ICASSP, 11661–11665.
  • Gao et al. (2024) Gao, P.; Geng, S.; Zhang, R.; Ma, T.; Fang, R.; Zhang, Y.; Li, H.; and Qiao, Y. 2024. CLIP-Adapter: Better Vision-Language Models with Feature Adapters. Int. J. Comput. Vis., (2): 581–595.
  • Guu et al. (2020) Guu, K.; Lee, K.; Tung, Z.; Pasupat, P.; and Chang, M. 2020. Retrieval Augmented Language Model Pre-Training. In ICML, 3929–3938.
  • He et al. (2020) He, K.; Fan, H.; Wu, Y.; Xie, S.; and Girshick, R. B. 2020. Momentum Contrast for Unsupervised Visual Representation Learning. In CVPR, 9726–9735.
  • Hendrycks et al. (2021) Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; and Steinhardt, J. 2021. Measuring Massive Multitask Language Understanding. In ICLR.
  • Ho et al. (2020) Ho, X.; Nguyen, A. D.; Sugawara, S.; and Aizawa, A. 2020. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps. In COLING, 6609–6625.
  • Houlsby et al. (2019) Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; de Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; and Gelly, S. 2019. Parameter-Efficient Transfer Learning for NLP. In ICML, 2790–2799.
  • Hu et al. (2022) Hu, E. J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; and Chen, W. 2022. LoRA: Low-Rank Adaptation of Large Language Models. In ICLR.
  • Hu et al. (2023) Hu, Z.; Wang, L.; Lan, Y.; Xu, W.; Lim, E.; Bing, L.; Xu, X.; Poria, S.; and Lee, R. K. 2023. LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models. In EMNLP, 5254–5276.
  • Ji et al. (2023) Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.; Madotto, A.; and Fung, P. 2023. Survey of Hallucination in Natural Language Generation. ACM Comput. Surv., (12): 248:1–248:38.
  • Jiang et al. (2023a) Jiang, H.; Wu, Q.; Luo, X.; Li, D.; Lin, C.; Yang, Y.; and Qiu, L. 2023a. LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression. CoRR.
  • Jiang et al. (2023b) Jiang, Z.; Xu, F. F.; Gao, L.; Sun, Z.; Liu, Q.; Dwivedi-Yu, J.; Yang, Y.; Callan, J.; and Neubig, G. 2023b. Active Retrieval Augmented Generation. In EMNLP, 7969–7992.
  • Jin et al. (2024a) Jin, J.; Zhu, Y.; Yang, X.; Zhang, C.; and Dou, Z. 2024a. FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research. CoRR.
  • Jin et al. (2024b) Jin, J.; Zhu, Y.; Zhou, Y.; and Dou, Z. 2024b. BIDER: Bridging Knowledge Inconsistency for Efficient Retrieval-Augmented LLMs via Key Supporting Evidence. CoRR.
  • Joshi et al. (2017) Joshi, M.; Choi, E.; Weld, D. S.; and Zettlemoyer, L. 2017. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In ACL, 1601–1611.
  • Kalyan et al. (2021) Kalyan, A.; Kumar, A.; Chandrasekaran, A.; Sabharwal, A.; and Clark, P. 2021. How much coffee was consumed during EMNLP 2019? Fermi Problems: A New Reasoning Challenge for AI. In EMNLP, 7318–7328.
  • Khashabi et al. (2020) Khashabi, D.; Min, S.; Khot, T.; Sabharwal, A.; Tafjord, O.; Clark, P.; and Hajishirzi, H. 2020. UnifiedQA: Crossing Format Boundaries With a Single QA System. In Findings of EMNLP, 1896–1907.
  • Kim et al. (2024) Kim, J.; Nam, J.; Mo, S.; Park, J.; Lee, S.; Seo, M.; Ha, J.; and Shin, J. 2024. SuRe: Summarizing Retrievals using Answer Candidates for Open-domain QA of LLMs. In ICLR.
  • Kwiatkowski et al. (2019) Kwiatkowski, T.; Palomaki, J.; Redfield, O.; Collins, M.; Parikh, A. P.; Alberti, C.; Epstein, D.; Polosukhin, I.; Devlin, J.; Lee, K.; Toutanova, K.; Jones, L.; Kelcey, M.; Chang, M.; Dai, A. M.; Uszkoreit, J.; Le, Q.; and Petrov, S. 2019. Natural Questions: a Benchmark for Question Answering Research. Trans. Assoc. Comput. Linguistics, 452–466.
  • Lester, Al-Rfou, and Constant (2021) Lester, B.; Al-Rfou, R.; and Constant, N. 2021. The Power of Scale for Parameter-Efficient Prompt Tuning. In EMNLP, 3045–3059.
  • Lewis et al. (2020) Lewis, P. S. H.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.; Rocktäschel, T.; Riedel, S.; and Kiela, D. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In NeurIPS.
  • Li and Liang (2021) Li, X. L.; and Liang, P. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In ACL, 4582–4597.
  • Li et al. (2023) Li, Y.; Dong, B.; Guerin, F.; and Lin, C. 2023. Compressing Context to Enhance Inference Efficiency of Large Language Models. In EMNLP, 6342–6353.
  • Lin et al. (2023) Lin, X. V.; Chen, X.; Chen, M.; Shi, W.; Lomeli, M.; James, R.; Rodriguez, P.; Kahn, J.; Szilvasy, G.; Lewis, M.; Zettlemoyer, L.; and Yih, S. 2023. RA-DIT: Retrieval-Augmented Dual Instruction Tuning. CoRR.
  • Lin, Madotto, and Fung (2020) Lin, Z.; Madotto, A.; and Fung, P. 2020. Exploring Versatile Generative Language Model Via Parameter-Efficient Transfer Learning. In Findings of EMNLP, 441–459.
  • Liu et al. (2023) Liu, N. F.; Lin, K.; Hewitt, J.; Paranjape, A.; Bevilacqua, M.; Petroni, F.; and Liang, P. 2023. Lost in the Middle: How Language Models Use Long Contexts. CoRR.
  • Liu et al. (2021a) Liu, X.; Ji, K.; Fu, Y.; Du, Z.; Yang, Z.; and Tang, J. 2021a. P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks. CoRR.
  • Liu et al. (2021b) Liu, X.; Zheng, Y.; Du, Z.; Ding, M.; Qian, Y.; Yang, Z.; and Tang, J. 2021b. GPT Understands, Too. CoRR.
  • Mallen et al. (2023) Mallen, A.; Asai, A.; Zhong, V.; Das, R.; Khashabi, D.; and Hajishirzi, H. 2023. When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. In ACL, 9802–9822.
  • Mangrulkar et al. (2022) Mangrulkar, S.; Gugger, S.; Debut, L.; Belkada, Y.; Paul, S.; and Bossan, B. 2022. PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods.
  • Muennighoff et al. (2023) Muennighoff, N.; Tazi, N.; Magne, L.; and Reimers, N. 2023. MTEB: Massive Text Embedding Benchmark. In EACL, 2006–2029.
  • Nguyen et al. (2016) Nguyen, T.; Rosenberg, M.; Song, X.; Gao, J.; Tiwary, S.; Majumder, R.; and Deng, L. 2016. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. In NeurIPS.
  • OpenAI (2023) OpenAI. 2023. GPT-4 Technical Report. CoRR.
  • Paszke et al. (2019) Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; Desmaison, A.; Köpf, A.; Yang, E. Z.; DeVito, Z.; Raison, M.; Tejani, A.; Chilamkurthy, S.; Steiner, B.; Fang, L.; Bai, J.; and Chintala, S. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In NeurIPS, 8024–8035.
  • Petroni et al. (2021) Petroni, F.; Piktus, A.; Fan, A.; Lewis, P. S. H.; Yazdani, M.; Cao, N. D.; Thorne, J.; Jernite, Y.; Karpukhin, V.; Maillard, J.; Plachouras, V.; Rocktäschel, T.; and Riedel, S. 2021. KILT: a Benchmark for Knowledge Intensive Language Tasks. In NAACL-HLT, 2523–2544.
  • Press et al. (2023) Press, O.; Zhang, M.; Min, S.; Schmidt, L.; Smith, N. A.; and Lewis, M. 2023. Measuring and Narrowing the Compositionality Gap in Language Models. In Findings of EMNLP, 5687–5711.
  • Qin and Eisner (2021) Qin, G.; and Eisner, J. 2021. Learning How to Ask: Querying LMs with Mixtures of Soft Prompts. In NAACL-HLT, 5203–5212.
  • Radford et al. (2019) Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and Sutskever, I. 2019. Language Models are Unsupervised Multitask Learners.
  • Raffel et al. (2020) Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res., 140:1–140:67.
  • Rajpurkar et al. (2016) Rajpurkar, P.; Zhang, J.; Lopyrev, K.; and Liang, P. 2016. SQuAD: 100, 000+ Questions for Machine Comprehension of Text. In EMNLP, 2383–2392.
  • Ram et al. (2023) Ram, O.; Levine, Y.; Dalmedigos, I.; Muhlgay, D.; Shashua, A.; Leyton-Brown, K.; and Shoham, Y. 2023. In-Context Retrieval-Augmented Language Models. CoRR.
  • Rebuffi, Bilen, and Vedaldi (2017) Rebuffi, S.; Bilen, H.; and Vedaldi, A. 2017. Learning multiple visual domains with residual adapters. In NeurIPS, 506–516.
  • Reddy, Chen, and Manning (2019) Reddy, S.; Chen, D.; and Manning, C. D. 2019. CoQA: A Conversational Question Answering Challenge. Trans. Assoc. Comput. Linguistics, 249–266.
  • Robertson and Zaragoza (2009) Robertson, S. E.; and Zaragoza, H. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Found. Trends Inf. Retr., (4): 333–389.
  • Shi et al. (2023) Shi, W.; Min, S.; Yasunaga, M.; Seo, M.; James, R.; Lewis, M.; Zettlemoyer, L.; and Yih, W. 2023. REPLUG: Retrieval-Augmented Black-Box Language Models. CoRR.
  • Talmor et al. (2019) Talmor, A.; Herzig, J.; Lourie, N.; and Berant, J. 2019. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. In NAACL-HLT, 4149–4158.
  • Tan et al. (2024) Tan, J.; Dou, Z.; Zhu, Y.; Guo, P.; Fang, K.; and Wen, J. 2024. Small Models, Big Insights: Leveraging Slim Proxy Models To Decide When and What to Retrieve for LLMs. CoRR.
  • Touvron et al. (2023) Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; Rodriguez, A.; Joulin, A.; Grave, E.; and Lample, G. 2023. LLaMA: Open and Efficient Foundation Language Models. CoRR.
  • Trivedi et al. (2022) Trivedi, H.; Balasubramanian, N.; Khot, T.; and Sabharwal, A. 2022. MuSiQue: Multihop Questions via Single-hop Question Composition. Trans. Assoc. Comput. Linguistics, 539–554.
  • Trivedi et al. (2023) Trivedi, H.; Balasubramanian, N.; Khot, T.; and Sabharwal, A. 2023. Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions. In ACL, 10014–10037.
  • Wan et al. (2023) Wan, Z.; Wang, X.; Liu, C.; Alam, S.; Zheng, Y.; Liu, J.; Qu, Z.; Yan, S.; Zhu, Y.; Zhang, Q.; Chowdhury, M.; and Zhang, M. 2023. Efficient Large Language Models: A Survey. CoRR.
  • Wang et al. (2022) Wang, L.; Yang, N.; Huang, X.; Jiao, B.; Yang, L.; Jiang, D.; Majumder, R.; and Wei, F. 2022. Text Embeddings by Weakly-Supervised Contrastive Pre-training. CoRR.
  • Wang et al. (2023a) Wang, Y.; Li, P.; Sun, M.; and Liu, Y. 2023a. Self-Knowledge Guided Retrieval Augmentation for Large Language Models. In Findings ofEMNLP, 10303–10315.
  • Wang et al. (2023b) Wang, Z.; Araki, J.; Jiang, Z.; Parvez, M. R.; and Neubig, G. 2023b. Learning to Filter Context for Retrieval-Augmented Generation. CoRR.
  • Wu et al. (2024) Wu, S.; Xie, J.; Chen, J.; Zhu, T.; Zhang, K.; and Xiao, Y. 2024. How Easily do Irrelevant Inputs Skew the Responses of Large Language Models? CoRR.
  • Xiao et al. (2023) Xiao, S.; Liu, Z.; Zhang, P.; and Muennighof, N. 2023. C-Pack: Packaged Resources To Advance General Chinese Embedding. CoRR.
  • Xu, Shi, and Choi (2024) Xu, F.; Shi, W.; and Choi, E. 2024. RECOMP: Improving Retrieval-Augmented LMs with Context Compression and Selective Augmentation. In ICLR.
  • Yang et al. (2018) Yang, Z.; Qi, P.; Zhang, S.; Bengio, Y.; Cohen, W. W.; Salakhutdinov, R.; and Manning, C. D. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In EMNLP, 2369–2380.
  • Yoran et al. (2023) Yoran, O.; Wolfson, T.; Ram, O.; and Berant, J. 2023. Making Retrieval-Augmented Language Models Robust to Irrelevant Context. CoRR.
  • Yu et al. (2023) Yu, Z.; Xiong, C.; Yu, S.; and Liu, Z. 2023. Augmentation-Adapted Retriever Improves Generalization of Language Models as Generic Plug-In. In ACL, 2421–2436.
  • Zhang et al. (2023a) Zhang, R.; Han, J.; Zhou, A.; Hu, X.; Yan, S.; Lu, P.; Li, H.; Gao, P.; and Qiao, Y. 2023a. LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention. CoRR.
  • Zhang et al. (2023b) Zhang, Y.; Li, Y.; Cui, L.; Cai, D.; Liu, L.; Fu, T.; Huang, X.; Zhao, E.; Zhang, Y.; Chen, Y.; Wang, L.; Luu, A. T.; Bi, W.; Shi, F.; and Shi, S. 2023b. Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models. CoRR.
  • Zhu et al. (2023) Zhu, Y.; Yuan, H.; Wang, S.; Liu, J.; Liu, W.; Deng, C.; Dou, Z.; and Wen, J. 2023. Large Language Models for Information Retrieval: A Survey. CoRR.
  • Zhu et al. (2024) Zhu, Y.; Zhang, P.; Zhang, C.; Chen, Y.; Xie, B.; Dou, Z.; Liu, Z.; and Wen, J. 2024. INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning. CoRR.
Refer to caption
Figure 5: An example code snippet of using our SPRING in practice.

Example Code for Usage

We provide an example of how to use our SPRING in practice. During the initialization phase, SPRING merges the trained embeddings with the original embeddings of the LLM, and assigns these new embeddings to newly added special tokens. These special tokens are then inserted between the retrieved results and the user input for generation.

QA Dataset # Training # Test Average Input Length Average Output Length
Trivia QA 78,785 11,313 20.93 5.16
Natural Questions 79,168 3,610 11.31 4.66
Hotpot QA 90,447 7,405 26.13 4.38
SQuAD 1.1 87,599 10,570 13.89 5.56
Web Questions 3,778 2,032 10.04 4.10
2WikiMultiHopQA 15,000 12,576 19.33 4.10
CoQA 7,199 500 7.89 3.89
MS MARCO 80,000 55,636 7.76 20.75
PopQA - 14,267 9.24 3.48
BoolQ - 3,270 10.70 1.00
CommonsenseQA - 1,221 15.91 1.00
GSM8K - 1,319 64.90 3.29
MMLU - 14,042 63.91 9.62
Retrieval Set # Passages Average Length
Wikipedia Passages 21,015,324 169.57
MS MARCO Passages 8,841,823 95.73
Table 5: Statistics of all datsaets. The input/output length is the number of tokens calculated by Mistral tokenizer.

Datasets

We use the following datasets in our experiments, and their statistics are shown in Table 5.

(1) TriviaQA (Joshi et al. 2017) is a reading comprehension dataset including question-answering pairs created by trivia enthusiasts, accompanied by independently gathered evidence documents. The questions are relatively complex and compositional, which require advanced reasoning to derive answers. This dataset is licensed under Apache 2.0 License.

(2) Natural Questions (Kwiatkowski et al. 2019) is a question-answering dataset. The questions are collected from the Google search engine, with answers annotated based on the Wikipedia pages. All questions are from real users. This dataset is licensed under Apache 2.0 License.

(3) HotpotQA (Yang et al. 2018) is a question-answering dataset with natural and multi-hop questions. All question-answer pairs are Wikipedia-based, where the questions are diverse and include factoid comparison questions. This dataset is licensed under CC BY-SA 4.0 License.

(4) SQuAD 1.1 (Rajpurkar et al. 2016) is a reading comprehension dataset derived from Wikipedia articles. The questions are posed by crowd workers while the answers are text segments from the corresponding passages. This dataset is licensed under CC BY-SA 4.0 License.

(5) Web Questions (Berant et al. 2013) is a question-answering dataset using Freebase as the knowledge base. The questions are collected via the Google Suggest API, and the answers are defined as Freebase entities. This dataset is licensed under CC BY 4.0 License.

(6) 2WikiMultiHopQA (Ho et al. 2020) is a multi-hop question-answering dataset designed to test reasoning and inference skills. It introduces evidence information containing a reasoning path to ensure the quality and multi-hop nature of the questions. This dataset is licensed under Apache 2.0 License.

(7) CoQA (Reddy, Chen, and Manning 2019) is a conversational question-answering dataset, which is obtained from human conversations about text passages from various domains. To keep the completeness of the question, we only use the first-round questions and their corresponding answers. This dataset contains data from several sources, under the license of CC BY-SA 4.0, MSR-LA, Apache, and RACE-specific License.

(8) MS MARCO (Nguyen et al. 2016) is a question-answering dataset featuring real Bing questions and human-generated answers. Since MS MARCO is extremely large, to balance the data amount in each dataset, we randomly sample 80,000 examples from the training set for fine-tuning. Besides, we also filter out questions without answers in the test set for evaluation. This dataset is licensed under the MIT License.

(9) PopQA (Mallen et al. 2023) is an open-domain QA dataset with fine-grained and long-tail Wikidata entities, Wikipedia page views, and the information of relationship type. Since LLMs often struggle with less popular factual knowledge, retrieval augmentation is particularly beneficial for this dataset. This dataset is licensed under the MIT License.

(10) BoolQ (Clark et al. 2019) is a reading comprehension dataset, where each sample consists of a question, a passage excerpt, and a yes/no answer. The questions are often complex and require difficult entailment-like inference to solve. Therefore, it is often used to evaluate LLMs’ text comprehension capability. This dataset is licensed under CC BY-SA 3.0 License.

(11) CommonsenseQA (Talmor et al. 2019) is a commonsense question-answering dataset with multiple-choice questions designed to distinguish between related concepts. It is commonly used for evaluating LLMs’ commonsense reasoning ability. This dataset is licensed under the MIT License.

(12) GSM8K (Cobbe et al. 2021) is a dataset that contains high-quality linguistically diverse grade school math word problems to measure models’ ability of multi-step mathematical reasoning. This dataset is licensed under the MIT License.

(13) MMLU (Hendrycks et al. 2021) is a comprehensive aggregated benchmark. It contains 57 tasks to measure LLMs’ knowledge across different subjects. This dataset is licensed under the MIT License.

Refer to caption
Figure 6: The prompt used in the baseline methods. “{x}” is the placeholder which will be replaced by the real data.
Refer to caption
Figure 7: The prompt used in the BoolQ and CommonsenseQA datasets. “{x}” is the placeholder which will be replaced by the real data.
Refer to caption
Figure 8: The prompt used in the GSM8K and MMLU datasets. “{x}” is the placeholder which will be replaced by the real data.

Prompt

Figure 6 illustrates the prompts used in the baseline methods. In the Concat method, we simply concatenate the retrieval results and the question for evaluation. In the Prompt method, we use a manually-crafted prompt to indicate the incorporation of retrieval information.

Figure 7 and Figure 8 present the prompts used for the BoolQ, CommonsenseQA, GSM8K, and MMLU datasets. We follow the practices in lm-evaluation-harness and adopt different strategies for performing evaluation on these datasets.555https://github.com/EleutherAI/lm-evaluation-harness Specifically, for the BoolQ dataset, we compare the generation probabilities of the token “true” and “false”, and then we select the token with the higher probability as the predicted result. For CommonsenseQA and MMLU, we determine the answer by comparing the generation probabilities of each option with its content and choosing the option with the highest probability. For GSM8K, we provide LLMs with chain-of-thought demonstrations and extract the answer from the generated text. It is worth noting that the choice of prompt significantly affects the performance of LLMs, so we select the most effective prompt for the vanilla model and apply it across other methods for evaluation. We also observe that our results are worse than those officially reported, and we attribute the gap to the different prompts used for evaluation. Unfortunately, the official prompts are not released publicly, so we are limited to selecting the best-performing prompt from our experiments.

Mistral-7b-base Mistral-7b-instruct
Prompt SPRING Prompt SPRING
EM F1 EM F1 EM F1 EM F1
TQA 0.00 17.03 68.54 86.89 57.79 80.33 65.71 85.26
NQ 0.08 7.47 46.90 73.28 28.99 58.72 42.35 70.73
HQA 0.00 12.22 39.78 68.48 26.36 56.15 35.26 65.44
SQuAD 0.01 12.04 37.13 69.02 23.92 57.66 33.67 66.99
WebQ 0.00 11.61 35.25 66.44 17.53 52.10 31.84 64.78
2Wiki 0.00 14.48 35.42 64.13 22.64 55.58 31.80 62.03
CoQA 0.00 8.15 15.23 43.94 8.20 36.72 13.28 42.41
MS MARCO 0.08 21.21 7.59 53.90 5.73 53.81 6.57 53.48
PopQA 0.00 14.72 50.37 74.84 39.79 68.26 48.71 73.90
Average 0.02 13.21 37.36 66.77 25.66 57.70 34.35 65.00
Table 6: Performance of SPRING with Mistral-7b models.
LLaMA-2-7b-base LLaMA-2-7b-chat
Prompt SPRING Prompt SPRING
EM F1 EM F1 EM F1 EM F1
TQA 0.01 34.37 65.42 85.30 59.84 81.75 66.95 85.59
NQ 0.00 22.31 43.48 71.04 32.46 62.10 42.13 70.57
HQA 0.00 26.17 35.48 65.59 26.39 56.58 33.88 64.47
SQuAD 0.00 26.49 35.10 67.34 24.57 59.40 33.36 66.60
WebQ 0.00 23.76 31.20 63.73 20.85 57.22 31.10 64.31
2Wiki 0.00 25.83 30.61 60.90 20.69 55.82 30.86 61.45
CoQA 0.00 16.08 13.67 42.73 8.59 38.56 14.06 43.99
MS MARCO 0.00 51.75 7.43 54.11 5.32 48.42 7.18 52.85
PopQA 0.00 32.17 49.93 74.41 45.08 72.23 49.00 73.98
Average 0.00 28.77 34.70 65.02 27.09 59.12 34.28 64.87
Table 7: Performance of SPRING with LLaMA-2-7b models.
LLaMA-2-13b-base LLaMA-2-13b-chat
Prompt SPRING Prompt SPRING
EM F1 EM F1 EM F1 EM F1
TQA 0.26 30.77 66.95 86.48 63.54 82.54 68.52 86.62
NQ 0.57 19.54 44.91 71.70 36.66 67.03 43.80 71.15
HQA 0.07 19.72 37.24 66.81 29.27 59.36 36.23 66.11
SQuAD 0.26 22.79 36.35 68.48 28.45 61.92 34.80 67.76
WebQ 0.24 20.77 32.32 64.54 20.61 55.75 31.01 64.17
2Wiki 0.10 22.63 33.32 62.85 28.01 59.20 32.32 62.43
CoQA 0.39 14.85 14.84 44.00 9.96 39.27 14.26 44.49
MS MARCO 0.57 47.30 7.36 53.71 6.24 47.68 7.51 52.94
PopQA 0.13 24.06 50.08 74.72 42.49 69.67 50.29 74.84
Average 0.29 24.71 35.93 65.92 29.47 60.27 35.42 65.61
Table 8: Performance of SPRING with LLaMA-2-13b models.
Training \rightarrow TQA NQ HQA SQuAD Mix
Test \downarrow EM F1 EM F1 EM F1 EM F1 EM F1
TQA 62.80 84.65 65.51 84.73 66.42 85.26 67.48 85.76 65.71 85.26
NQ 32.19 64.98 45.26 72.68 40.79 70.18 37.88 67.23 42.35 70.73
HQA 22.97 56.95 28.11 59.45 36.30 66.33 30.43 61.46 35.26 65.44
SQuAD 25.24 61.39 30.81 64.32 33.37 66.36 35.01 67.83 33.67 66.99
WebQ 26.03 62.40 34.13 67.30 29.15 64.04 28.27 63.38 31.84 64.78
2Wiki 17.32 54.41 25.43 58.21 30.42 61.62 25.48 58.26 31.80 62.03
CoQA 6.05 36.61 7.62 38.54 12.50 43.12 7.62 37.74 13.28 42.41
MS MARCO 3.26 31.88 3.88 33.33 5.18 36.12 3.89 45.80 6.57 53.48
PopQA 44.07 72.50 48.28 74.07 48.80 74.32 48.40 73.83 48.71 73.90
Average 26.66 58.42 32.11 61.40 33.66 63.04 31.61 62.37 34.35 65.00
Table 9: Full results of training on a specific dataset and a mixture of all datasets.

Full Experimental Results

In the main body of our paper, we present the results using Mistral-7b-instruct as the backbone model. We also apply our method to Mistral-7b-base, LLaMA-2-7b-chat, LLaMA-2-7b-base, LLaMA-2-13b-base, and LLaMA-2-13b-chat. Their results are shown in Table 6, Table 7, and Table 8. We can observe that SPRING can achieve consistent improvement for all models. Specifically, using prompt is very effective for instruction-tuned models (i.e., Mistral-instruct and LLaMA-chat) but less so for base models. This is because the base model cannot understand the instructions to read the retrieved results and provide correct answers. Mistral series models perform slightly better than LLaMA series models of the same size. This is consistent with their officially reported benchmark performance. Furthermore, larger models tend to perform better than smaller models, and our SPRING can bring improvement across models of different sizes. In future, we plan to extend our method to much larger models (such as 30B or 70B) when more computing resources available.

[T,R,Q][T,R,Q] SPRING [R,Q,T][R,Q,T]
EM F1 EM F1 EM F1
TQA 11.74 59.97 65.71 85.26 64.90 84.65
NQ 13.04 38.22 42.35 70.73 43.16 71.30
HQA 5.79 42.59 35.26 65.44 34.00 64.27
SQuAD 7.19 39.75 33.67 66.99 34.09 66.75
WebQ 4.44 31.55 31.84 64.78 31.35 64.84
2Wiki 4.38 41.82 31.80 62.03 30.64 61.01
CoQA 1.56 20.02 13.28 42.41 12.30 41.18
MS MARCO 0.60 50.56 6.57 53.48 6.94 49.40
PopQA 10.02 44.54 48.71 73.90 46.55 72.33
Average 6.53 41.00 34.35 65.00 33.77 63.97
Table 10: Performance of different virtual token insertion positions. The best results are in bold.

Impact of Token Insertion Position

In SPRING, virtual tokens (TT) are strategically inserted between the retrieved results (RR) and the question (QQ). Generally, there are three possible insertion positions:

  • [T,R,Q][T,R,Q]: The virtual tokens are added at the beginning of the sequence, akin to prefix-tuning. In this setup, due to the auto-regressive nature of LLMs, these tokens cannot access subsequent sequence information.

  • [R,T,Q][R,T,Q]: This is what we used in SPRING. The virtual tokens can attend to the retrieved results, while positioning the question immediately next to the answer.

  • [R,Q,T][R,Q,T]: The virtual tokens are added at the end of the sequence. They are able to attend to all preceding information, including both the retrieved results and the question.

We compare the performance of these three strategies. The results are shown in Table 10. It is evident that SPRING achieves the best performance among the three strategies, which demonstrates the effectiveness of our token placement design. Placing virtual tokens at the beginning of the sequence (prefix-tuning) is the worst strategy, because the added tokens cannot adapt based on the retrieved results or the question during fine-tuning. This result is different from those reported in the original prefix-tuning study (Li and Liang 2021), which suggested that prefix-tuning can yield better performance. We attribute this discrepancy to the different targets of general generation tasks versus RAG tasks. In general generation, all contexts are crucial for predicting the next token, hence adding trainable tokens at the beginning can effectively influence the activation of contexts. In contrast, in RAG tasks, the retrieved information is supplementary to answering the user’s question, so the virtual tokens are added to help LLMs understand and leverage the retrieved information rather than enhance the retrieved information itself. This speculation is further validated by the strong performance achieved by adding tokens at the end of the sequence, where they can also access the retrieved results. Among the three strategies, SPRING is the most effective, not only allowing the virtual tokens to utilize the retrieved information optimally but also maintaining coherence between the question and the answer being generated.

Prompt 1:
According to the previous relevant passages, please answer the following question. Only return the answer without any other words.\\backslashn
Prompt 2:
According to the previous relevant passages, please answer the following question.\\backslashn
Prompt 3:
Answer the following question based on the provided passages.\\backslashn
Prompt 4:
(None)
Table 11: The prompts we used for exploring their impacts on inference.
Refer to caption
Figure 9: Performance under various prompts.

Impact of Prompt after Fine-tuning

In our experiments, we notice that models fine-tuned by certain methods (such as LoRA) are very sensitive to the prompts used for inference. To investigate this problem, we conduct an empirical study. We mainly focus on the impact of the prompt PP inserted between the retrieved results and the question, formatted as [R;P;Q][R;P;Q]. We consider four different prompts listed in Table 11. Specifically, for LoRA, we use the first prompt during training, and test its performance with all prompts. For SPRING, it does not need a prompt in training, which can be seen as using the fourth prompt (none). We also evaluate its performance with the four prompts.

The experimental results are shown in Figure 9. It clearly reflects the prompt dependency of models fine-tuned with LoRA. LoRA achieves the highest EM score of 37.1337.13 with the first prompt, which is used during its training. The performance remains robust with the second prompt, which is very similar to the first but omits the last sentence. However, the third prompt, while semantically similar to the first, involves different phrasing and significantly impacts LoRA’s performance. Notably, in scenarios where no prompt is used (the fourth prompt), LoRA’s performance drops sharply, even worse than that of the original model. These results indicate that fine-tuning with LoRA may lead to overfitting problems. This may also be the reason why models tuned with LoRA cannot perform well on non-RAG tasks. In comparison, our SPRING demonstrates a consistent performance across all prompts, demonstrating its robustness and generalizability.

Refer to caption
Figure 10: Average performance on nine QA datasets with various numbers of virtual tokens.

Impact of Token Quantity

In the main body of our paper, we explore the impact of token quantity based on the Mistral-7b-instruct model. Since this impact may vary across different models, we also extend our analysis to include the LLaMA-2-7b-chat model. The result is shown in Figure 10. It can be seen that the general trend is similar to that based on the Mistral model, namely the performance of SPRING is gradually improved as more virtual tokens are used. However, LLaMA requires at least three tokens to obtain improvement over the original model with prompts. This indicates that different models may require various numbers of virtual tokens to effectively learn and perform RAG tasks.

Method NQ TriviaQA HotpotQA 2Wiki PopQA WebQA
Metric EM EM F1 F1 F1 EM
Naive Generation 22.6 55.7 28.4 33.9 21.7 18.8
Standard RAG 35.1 58.9 35.3 21.0 36.7 15.7
AAR (Yu et al. 2023) 30.1 56.8 33.4 19.8 36.1 16.1
LongLLMLingua (Jiang et al. 2023a) 32.2 59.2 37.5 25.0 38.7 17.5
RECOMP (Xu, Shi, and Choi 2024) 33.1 56.4 37.5 32.4 39.9 20.2
Selective-Context (Li et al. 2023) 30.5 55.6 34.4 18.5 33.5 17.3
Trace (Fang, Meng, and Macdonald 2024) 30.7 50.2 34.0 15.5 37.4 19.9
SuRe (Kim et al. 2024) 37.1 53.2 33.4 20.6 48.1 24.2
REPLUG (Shi et al. 2023) 28.9 57.7 31.2 21.1 27.8 20.2
SKR (Wang et al. 2023a) 33.2 56.0 32.4 23.4 31.7 17.0
Self-RAG (Asai et al. 2024) 36.4 38.2 29.6 25.1 32.7 21.9
FLARE (Jiang et al. 2023b) 22.5 55.8 28.0 33.9 20.7 20.2
ITRG (Feng et al. 2024) 36.8 60.1 38.3 21.6 37.9 18.2
IRCoT (Trivedi et al. 2023) 33.3 56.9 41.5 32.4 45.6 20.7
SPRING 37.9 64.6 42.6 37.3 54.8 27.7
Table 12: Comparison with other RAG methods. All experimental results are cited from FlashRAG (Jin et al. 2024a). Only SPRING uses LLaMA-2-7b as the backbone model, while other methods use LLaMA-3-8b. Nevertheless, as the retrieval corpus and experimental settings used in FlashRAG are quite different from those used in our experiments, their results are also different from those reported in our paper.

Comparison with Other RAG Methods

We further evaluate our SPRING against other RAG methods. The experimental results, provided by FlashRAG (Jin et al. 2024a), are shown in Table 12. It is evident to see that SPRING outperforms all other methods under the same settings. It is worth noting that only SPRING uses LLaMA-2-7b as the backbone model, whereas other methods employ the more advanced LLaMA-3-8b. This clearly reflects the superiority of our method. Essentially, most existing methods rely primarily on complex prompt designs without fine-tuning, so they cannot achieve comparable performance with SPRING.