This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

KnowTuning: Knowledge-aware Fine-tuning for Large Language Models

Yougang Lyu1,3   Lingyong Yan2   Shuaiqiang Wang2  Haibo Shi2  Dawei Yin2
Pengjie Ren1  Zhumin Chen1  Maarten de Rijke3   Zhaochun Ren4
1Shandong University, Qingdao, China   2Baidu Inc., Beijing, China
3University of Amsterdam, Amsterdam, The Netherlands
4Leiden University, Leiden, The Netherlands
[email protected], {yanlingyong, wangshuaiqiang}@baidu.com
[email protected], [email protected], [email protected]
[email protected], [email protected], [email protected]
 Corresponding author.
Abstract

Despite their success at many natural language processing (NLP) tasks, large language models still struggle to effectively leverage knowledge for knowledge-intensive tasks, manifesting limitations such as generating incomplete, non-factual, or illogical answers. These limitations stem from inadequate knowledge awareness of LLMs during vanilla fine-tuning. To address these problems, we propose a knowledge-aware fine-tuning (KnowTuning) method to improve fine-grained and coarse-grained knowledge awareness of LLMs. We devise a fine-grained knowledge augmentation stage to train LLMs to identify difficult fine-grained knowledge in answers. We also propose a coarse-grained knowledge comparison stage to train LLMs to distinguish between reliable and unreliable knowledge, in three aspects: completeness, factuality, and logicality. Extensive experiments on both generic and medical question answering (QA) datasets confirm the effectiveness of KnowTuning, through automatic and human evaluations, across various sizes of LLMs. We further verify that KnowTuning generates more facts with less factual error rate under fine-grained facts evaluation.

KnowTuning: Knowledge-aware Fine-tuning for Large Language Models


Yougang Lyu1,3   Lingyong Yan2   Shuaiqiang Wang2  Haibo Shi2  Dawei Yin2 Pengjie Ren1  Zhumin Chen1  Maarten de Rijke3   Zhaochun Ren4thanks:  Corresponding author. 1Shandong University, Qingdao, China   2Baidu Inc., Beijing, China 3University of Amsterdam, Amsterdam, The Netherlands 4Leiden University, Leiden, The Netherlands [email protected], {yanlingyong, wangshuaiqiang}@baidu.com [email protected], [email protected], [email protected] [email protected], [email protected], [email protected]


1 Introduction

\Acfp

LLM have become a default solution for many natural language processing (NLP) scenarios, including the question answering (QA) task Brown et al. (2020); Ouyang et al. (2022); Qin et al. (2023). To achieve strong performance, most LLM first accumulate substantial knowledge by pre-training on extensive datasets Jiang et al. (2023); Touvron et al. (2023). Then, in the supervised fine-tuning (SFT) stage, these LLMs further learn downstream domain knowledge and how to exploit the corresponding knowledge to answer diverse questions Wei et al. (2022); Chung et al. (2022); Wang et al. (2023f); Peng et al. (2023); Kang et al. (2023); Wang et al. (2023c).

Refer to caption
(a) Fine-grained knowledge awareness.
Refer to caption
(b) Coarse-grained knowledge awareness.
Figure 1: Illustrations of vanilla fine-tuned LLMs lacking knowledge awareness. (a) Vanilla fine-tuned LLMs struggles to identify the fine-grained knowledge to answer a specific question precisely. (b) Vanilla fine-tuned LLMs cannot effectively distinguish between reliable knowledge and unreliable knowledge in answers.

However, fine-tuned LLMs often struggle to effectively leverage knowledge for complex knowledge-intensive question-answering Yu et al. (2023a); Bai et al. (2023); Chen et al. (2023b); Chang et al. (2023). Concretely, many recent studies indicate that LLMs are susceptible to generating incomplete answers, offering incomprehensive and insufficient knowledge Singhal et al. (2022); Bian et al. (2024); Xu et al. (2023a); non-factual answers, delivering factually incorrect knowledge Wang et al. (2023a); Min et al. (2023); Wang et al. (2023b); or illogical answers, providing incoherent and poorly structured knowledge Chen et al. (2023b); Zhong et al. (2023); Kang et al. (2023). Although recent method FactTune Tian et al. (2023) improves the factuality of answers by increasing the proportion of correct facts, it ignores other critical aspects, such as completeness Min et al. (2023) and logicality Xu et al. (2023a).

We hypothesize that these limitations of LLMs arise from insufficient fine-grained and coarse-grained knowledge awareness during vanilla fine-tuning Bian et al. (2024); Ji et al. (2023); Dou et al. (2023); Hua et al. (2024). On the one hand, as illustrated in Figure 1, at the fine-grained level, vanilla fine-tuned LLMs face difficulties in identifying detailed atomic knowledge within the answer, leading to inadequate awareness of fine-grained knowledge. On the other hand, at the coarse-grained level, LLMs frequently fail to distinguish between reliable and unreliable knowledge in answers, indicating a lack of coarse-grained knowledge awareness. Consequently, there is a pressing need for designing knowledge-aware fine-tuning methods. This leads to our central research question: how can we effectively improve both the fine-grained and coarse-grained knowledge awareness of LLMs to address complex knowledge-intensive tasks?

To this end, we propose a novel knowledge-aware fine-tuning method, named KnowTuning, which aims to improve the fine-grained and coarse-grained knowledge awareness of LLMs. KnowTuning consists of two stages: (i) fine-grained knowledge augmentation, and (ii) coarse-grained knowledge comparison. In the first stage, we filter difficult atomic knowledge with high perplexity from original answers, and rewrite fine-grained QA pairs based on the filtered knowledge. After that, we subsequently use both the original and fine-gained QA pairs to train LLMs. In the second stage, we adopt several knowledge-disturbing techniques to construct coarse-grained knowledge comparison sets along three dimensions, completeness, factuality, and logicality. Specifically, we generate answers that are worse in terms of completeness, factuality, or logicality, by deleting, revising, and shuffling the atomic knowledge. Besides, we rephrase original answers based on the atomic knowledge to prevent overfitting. Finally, we combine the rephrased answers and answers with worse completeness, factuality, and logicality as our knowledge comparison sets. We adopt direct preference optimization (DPORafailov et al. (2023) for optimizing LLMs on our coarse-grained knowledge comparison sets.

We conduct experiments on a generic QA dataset and a medical QA dataset using automatic and human evaluations. Experimental results demonstrate the effectiveness of our proposed method KnowTuning, assessing completeness, factuality, and logicality across various sizes of LLMs. Furthermore, we demonstrate that KnowTuning not only generates more facts but also reduces the factual error rate during fine-grained facts evaluation.

In summary, our main contributions are:

  • We focus on systematically enhancing the knowledge awareness of LLMs at both fine-grained and coarse-grained levels to address complex knowledge-intensive tasks.

  • We introduce KnowTuning, a novel method that fine-tunes LLMs to leverage fine-grained knowledge augmentation and coarse-grained knowledge comparison to improve fine-grained and coarse-grained knowledge awareness of LLMs.

  • We demonstrate the effectiveness of KnowTuning in the generic and medical domain QA datasets through automatic and human evaluations, across various sizes of LLMs. Furthermore, KnowTuning generates more facts with less factual error rate under fine-grained facts evaluation.111The code is available at https://github.com/youganglyu/KnowTuning

Refer to caption
Figure 2: Overview of KnowTuning. KnowTuning leverages fine-grained knowledge augmentation and coarse-grained knowledge comparison to improve the knowledge awareness of LLMs.

2 Related Work

2.1 LLMs for Knowledge-intensive Tasks

\Acfp

LLM have been applied to various knowledge-intensive tasks Moiseev et al. (2022); Yu et al. (2023b); Khattab et al. (2022); Tian et al. (2023); Zhang et al. (2023a); Xu et al. (2023b); Mishra et al. (2023); Nguyen et al. (2023); Zhang et al. (2024). Previous work mainly focus on knowledge-intensive tasks with short-form answers.  Liu et al. (2022b) use few-shot demonstrations to elicit relevant knowledge statements from LLMs for QA tasks. Liu et al. (2022a) train a neural model to generate relevant knowledge through reinforcement learning for QA tasks. Liu et al. (2023a) propose a unified model for generating relevant knowledge and solving QA tasks.

However, these methods primarily address multiple-choice QA, rather than the more complex open-ended knowledge-intensive QA tasks Krishna et al. (2021); Kadavath et al. (2022); Liu et al. (2022a, 2023a); Kang et al. (2023), which aim to solve questions that require detailed explanations and extensive domain knowledge. Recent research indicates that LLMs face challenges in tackling complex knowledge-intensive QA tasks Yu et al. (2023a); Bai et al. (2023); Chang et al. (2023). In particular, they are prone to generating responses that are non-factual Lee et al. (2022); Sun et al. (2023); Su et al. (2022), incomplete Singhal et al. (2022); Bian et al. (2024), or illogical Chen et al. (2023b); Zhong et al. (2023). Recently, for open-ended knowledge-intensive tasks,  Tian et al. (2023) propose a method FacTune to improve factuality. Specifically, they first automatically evaluate the proportion of correct facts in candidate answers as factuality scores, and fine-tuning LLMs to increase the likelihood of generating answers with higher factuality scores. In contrast, we focus on improving the knowledge awareness of LLMs at multiple essential aspects simultaneously, for solving complex knowledge-intensive QA tasks.

2.2 Fine-tuning for LLMs

Fine-tuning is a kind of method to optimize pre-trained LLMs for further learning downstream domain knowledge and how to exploit the corresponding knowledge to answer diverse questions Brown et al. (2020); Ouyang et al. (2022). Previously, fine-tuning is mainly focused on enhancing general-purpose QA abilities of LLMs Wang et al. (2022); Wei et al. (2022); Longpre et al. (2023). These approaches mainly adopt human-annotated datasets to build the QA dataset. Recently, an alternative strategy involves generating QA datasets through the utilization of advanced LLMs to create answers to a variety of questions Wang et al. (2023f); Shumailov et al. (2023).

Another line of fine-tuning methods fuse information about the quality of the generated answers into the supervision signals Zhao et al. (2023); Guo et al. (2023); Wang et al. (2023d); Dong et al. (2023); Chen et al. (2024); Zhao et al. (2024).  Rafailov et al. (2023) propose direct preference optimization (DPO) to directly optimize LLMs on the pair-wise comparison set. Song et al. (2023) propose preference ranking optimization (PRO) to fine-tune LLMs on list-wise comparison sets.  Yuan et al. (2023) propose a margin-rank loss to optimize the LLMs on comparison sets. Since collecting large-scale human judgment for the quality of generated answers is expensive, Bai et al. (2022) and  Lee et al. (2023) propose reinforcement learning from AI feedback (RLAIF) methods to leverage off-the-shelf LLMs to annotate general helpfulness scores. In contrast, our work focuses on enhancing the fine-grained and coarse-grained knowledge-awareness of LLMs to improve performance in terms of completeness, factuality, and logicality simultaneously.

3 Method

In this section, we detail the KnowTuning method. First, we introduce the preliminaries. Then, we introduce the fine-grained knowledge augmentation. Next, we introduce coarse-grained knowledge comparison in detail. Finally, a training process for KnowTuning is explained.

3.1 Preliminaries

Supervised fine-tuning. Supervised fine-tuning (SFT) aims to train pre-trained LLMs to understand and answer natural language questions. Formally, given a QA dataset 𝒟={(qi,ai)}i=1N\mathcal{D}=\{(q_{i},a_{i})\}_{i=1}^{N}, where qiq_{i} and aia_{i} denotes a question and a corresponding answer. The training objective of SFT is to minimize the following loss:

sft=j=1|ai|logPπsft(ai,j|ai,<j,qi),\mathcal{L}_{\mathrm{sft}}=-\sum_{j=1}^{|a_{i}|}\log P_{\pi_{sft}}(a_{i,j}|a_{i,<j},q_{i}), (1)

where ai,ja_{i,j} denotes the jj-th token of aia_{i}.

Atomic knowledge. Since individual facts can well cover the knowledge in answers Nenkova and Passonneau (2004); Zhang and Bansal (2021); Liu et al. (2023b); Min et al. (2023); Wei et al. (2024), we break an answer into individual facts as atomic knowledge. The atomic knowledge is a short statement conveying one piece of fact, which is a more fine-grained unit than a sentence. Specifically, we extract atomic knowledge set 𝒦\mathcal{K} from the original answers aa as follows:

𝒦i={kij}j=1|𝒦i|=Extract(ai),\mathcal{K}_{i}=\{k_{i}^{j}\}_{j=1}^{|\mathcal{K}_{i}|}=\operatorname{Extract}(a_{i}), (2)

where Extract()\operatorname{Extract}(\cdot) is implemented by prompting OpenAI models to extract atomic knowledge, following Min et al. (2023).

3.2 Fine-grained Knowledge Augmentation

As illustrated in Figure 2, to improve the fine-grained knowledge awareness of LLMs, we filter difficult atomic knowledge for LLMs, and rewrite fine-grained QA pairs based on the difficult knowledge. After that, we subsequently use both the original and fine-gained QA pairs to train LLMs. To filter the difficult atomic knowledge for LLMs, we first compute the generation perplexity pplijppl^{j}_{i} of each atomic knowledge kijk_{i}^{j} conditioned on qiq_{i} as follows:

pplij=1m=1|kij|PπSFT(ki,mj|ki,<mj,qi)n.ppl_{i}^{j}=\sqrt[n]{\frac{1}{\sum_{m=1}^{|k_{i}^{j}|}P_{\pi_{SFT}}(k_{i,m}^{j}|k_{i,<m}^{j},q_{i})}}. (3)

Since high perplexity pplppl indicates the lack of knowledge awareness of LLMs on specific atomic knowledge, we select α\alpha percent of the atomic knowledge set 𝒦i\mathcal{K}_{i} in descending order of perplexity to form the difficult knowledge set 𝒦i\mathcal{K}_{i}^{*}. Then, we rewrite the question qiq_{i} as a fine-grained question qiq_{i}^{*} relevant to difficult knowledge 𝒦i\mathcal{K}_{i}^{*}, as follows:

qi=Rewrite(qi,𝒦i),q^{*}_{i}=\operatorname{Rewrite}(q_{i},\mathcal{K}_{i}^{*}), (4)

where Rewrite()\operatorname{Rewrite}(\cdot) is implemented by prompting OpenAI models. In addition, we rewrite the answer based on the difficult knowledge set as the fine-grained answer:

ai=Rewrite(𝒦i).a^{*}_{i}=\operatorname{Rewrite}(\mathcal{K}^{*}_{i}). (5)

Finally, we combine the original QA dataset 𝒟\mathcal{D} and the fine-grained QA pairs as the fine-grained knowledge augmentation dataset 𝒟ka\mathcal{D}_{ka} as:

𝒟ka=𝒟{qi,ai}i=1N.\mathcal{D}_{ka}=\mathcal{D}\cup\{q_{i}^{*},a_{i}^{*}\}_{i=1}^{N}. (6)

3.3 Coarse-grained Knowledge Comparison

To improve coarse-grained knowledge awareness of LLMs in terms of completeness, factuality and logicality, we construct three comparison sets by deleting, revising, and shuffling atomic knowledge.

Knowledge completeness comparison. To improve knowledge completeness awareness of LLMs, we construct the knowledge completeness comparison set by randomly deleting the atomic knowledge. Specifically, we first randomly delete atomic knowledge kk in the atomic knowledge set 𝒦\mathcal{K} as incomplete knowledge set:

𝒦ic=Delete(𝒦i),\mathcal{K}^{c}_{i}=\operatorname{Delete}(\mathcal{K}_{i}), (7)

where Delete()\operatorname{Delete}(\cdot) refers to randomly delete β\beta percent of atomic knowledge kk. Then, we concatenate leftover atomic knowledge of the incomplete knowledge set as an incomplete answer:

aic=Concat(𝒦ic).a^{c}_{i}=\operatorname{Concat}(\mathcal{K}^{c}_{i}). (8)

In addition, to avoid overfitting on the original answers Jain et al. (2023), we rephrase the original answers based on the original atomic knowledge set as:

air=Rewrite(𝒦i).a^{r}_{i}=\operatorname{Rewrite}(\mathcal{K}_{i}). (9)

Finally, we combine the rephrased answer aira^{r}_{i} and the incomplete answer aica^{c}_{i} into knowledge completeness comparison set as follows:

𝒟kcc={(qi,(air,aic))}i=1N.\mathcal{D}_{kcc}=\{(q_{i},(a_{i}^{r},a_{i}^{c}))\}_{i=1}^{N}. (10)

Knowledge factuality comparison. To improve the knowledge factuality awareness of LLMs, we construct the knowledge factuality comparison set by revising the atomic knowledge as nonfactual atomic knowledge. Specifically, we first revise the atomic knowledge set 𝒦i\mathcal{K}_{i} as follows:

𝒦if=Revise(𝒦i),\mathcal{K}^{f}_{i}=\operatorname{Revise}(\mathcal{K}_{i}), (11)

where Revise()\operatorname{Revise}(\cdot) is implemented by prompting OpenAI models to revise the atomic knowledge to the wrong atomic knowledge. Then, we concatenate all atomic knowledge in the nonfactual knowledge set as:

aif=Concat(𝒦if).a^{f}_{i}=\operatorname{Concat}(\mathcal{K}^{f}_{i}). (12)

Finally, we combine the rephrased answer aira^{r}_{i} and the nonfactual answer aifa^{f}_{i} into knowledge factuality comparison set as follows:

𝒟kfc={(qi,(air,aif))}i=1N.\mathcal{D}_{kfc}=\{(q_{i},(a_{i}^{r},a_{i}^{f}))\}_{i=1}^{N}. (13)

Knowledge logicality comparison. To improve the knowledge logicality awareness of LLMs, we construct the knowledge logicality comparison set by randomly shuffling the atomic knowledge. Specifically, we first randomly shuffle all atomic knowledge in the atomic knowledge set 𝒦\mathcal{K} as the illogical knowledge set:

𝒦il=Shuffle(𝒦i),\mathcal{K}^{l}_{i}=\operatorname{Shuffle}(\mathcal{K}_{i}), (14)

where Shuffle()\operatorname{Shuffle}(\cdot) is implemented by shuffling the order of all atomic knowledge kk in the atomic knowledge set 𝒦\mathcal{K}. Then, we follow the shuffled order to concatenate all atomic knowledge in the illogical knowledge set as an illogical answer:

ail=Concat(𝒦il).a^{l}_{i}=\operatorname{Concat}(\mathcal{K}^{l}_{i}). (15)

Next, we combine the rephrased answer aira^{r}_{i} and the illogical answer aila^{l}_{i} into knowledge logicality comparison set as follows:

𝒟klc={(qi,(air,ail))}i=1N.\mathcal{D}_{klc}=\{(q_{i},(a_{i}^{r},a_{i}^{l}))\}_{i=1}^{N}. (16)

Finally, we combine the knowledge completeness comparison set, the knowledge factuality comparison set, and the knowledge logicality comparison set as the coarse-grained knowledge comparison set:

𝒟kc=𝒟kcc𝒟kfc𝒟klc.\mathcal{D}_{kc}=\mathcal{D}_{kcc}\cup\mathcal{D}_{kfc}\cup\mathcal{D}_{klc}. (17)

3.4 Training

To improve the knowledge awareness of LLMs for solving complex knowledge-intensive tasks, KnowTuning includes fine-grained knowledge augmentation training and coarse-grained knowledge comparison training. Specifically, we first train LLMs on fine-grained knowledge augmentation dataset 𝒟ka\mathcal{D}_{ka}, resulting in a model denoted as πka\pi_{ka}. To improve the coarse-grained knowledge awareness of the model πka\pi_{ka}, we rewrite the DPO Rafailov et al. (2023) loss as follows:

dpo=𝔼(q,(aw,al))𝒟kc[logσ(βlogπkc(aw|q)πka(aw|q)\displaystyle\mathcal{L}_{dpo}\!={}\!-\mathbb{E}_{(q,(a_{w},a_{l}))\sim\mathcal{D}_{kc}}\bigg{[}\!\log\sigma\bigg{(}\!\beta\log\!\frac{\pi_{kc}(a_{w}|q)}{\pi_{ka}(a_{w}|q)}\mbox{}
βlogπkc(al|q)πka(al|q))],\displaystyle-\beta\log\frac{\pi_{kc}(a_{l}|q)}{\pi_{ka}(a_{l}|q)}\bigg{)}\bigg{]}, (18)

where (aw,al)(a_{w},a_{l}) denotes the answer pair of the question qDkcq\in D_{kc}, and awa_{w} is the better answer. To maintain coarse-grained knowledge awareness of better answers, we add SFT loss into the coarse-grained knowledge comparison loss:

kc=dpo+γsft,\mathcal{L}_{kc}\!=\mathcal{L}_{dpo}+\gamma\mathcal{L}_{\mathrm{sft}}, (19)

where sft\mathcal{L}_{\mathrm{sft}} is a term for better answers awa_{w} and γ\gamma is a scalar weighting hyperparameter.

4 Experiments

4.1 Research Questions

We aim to answer the following research questions in our experiments: RQ1: How does KnowTuning perform on generic and medical QA under automatic evaluation and human evaluation? RQ2: How does KnowTuning perform on generic and medical QA under fine-grained facts evaluation? RQ3: How do fine-grained knowledge augmentation and coarse-grained knowledge comparison affect the performance of KnowTuning?

Dolly MedQuAD NQ ELI5
Method METEOR BERTScore METEOR BERTScore METEOR BERTScore METEOR BERTScore
Backbone Language Model: Llama2-7b-base
Base 12.29 78.07 12.79 78.44 05.10 72.70 09.09 76.05
SFT 14.01 84.38 19.95 80.97 07.55 76.71 11.96 79.65
RLAIF 17.60 85.31 20.60 83.82 10.77 79.62 13.66 80.41
FactTune 16.84 85.16 21.82 82.99 10.08 79.09 14.19 80.83
KnowTuning 19.56 86.37 24.71 84.28 12.22 80.54 16.32 81.74
Backbone Language Model: Llama2-13b-base
Base 11.59 77.90 12.12 78.29 05.51 73.80 07.79 75.63
SFT 15.31 84.39 19.66 82.34 08.70 78.18 12.00 81.21
RLAIF 19.03 85.43 20.37 83.13 11.79 80.30 13.61 82.06
FactTune 18.59 85.38 21.42 83.49 11.37 80.02 13.74 82.16
KnowTuning 20.01 86.32 25.21 84.41 12.56 80.74 14.45 83.06
Table 1: Lexicon-based and semantic-based evaluation on generic and medical QA. The best performance is highlighted in bold.

4.2 Datasets

We conduct experiments on general domain and domain-specific knowledge-intensive question-answering datasets:

  • Dolly Conover et al. (2023) is a general domain QA dataset carefully curated by thousands of human annotators. Since we focus on open-ended generic domain QA, we filter QA pairs of “open_qa” and “general_qa” categories.

  • MedQuAD Abacha and Demner-Fushman (2019) is a medical domain QA dataset, which is collected from 12 National Institutes of Health websites. Following August et al. (2022), we filter QA pairs of the category “Information” for giving detailed information about medical terms.

To evaluate the performance across a wider range of knowledge-intensive tasks, we further evaluate generic QA models on two representative test sets from knowledge intensive language tasks (KILT) benchmark Petroni et al. (2021):

  • NQ Kwiatkowski et al. (2019) consists of real questions directed to the Google search engine. Every question is paired with a corresponding Wikipedia page that includes a detailed long-form answer and a concise short answer. We filter questions and corresponding long answers as testing QA pairs.

  • ELI5 Fan et al. (2019) includes a set of question-answer-evidence triples. The questions are complex, and the responses are comprehensive, explanatory, and presented in a free-form style. We filter questions and corresponding answers as testing QA pairs.

More details of datasets are in Appendix A.

4.3 Baselines

We compare our model with the following baselines:

  • Base denotes that testing Llama2-base models Touvron et al. (2023) under zero-shot setting.

  • SFT Ouyang et al. (2022) represents vanilla fine-tuning backbone LLMs on QA datasets according to Eq. 1.

  • RLAIF  Bai et al. (2022); Lee et al. (2023) leverages LLMs to annotate overall helpfulness scores for candidate answers, and construct overall helpfulness comparison sets based on the scores.

  • FactTune  Tian et al. (2023) constructs factuality comparison sets by calculating the proportion of correct facts in candidate answers.

More details of baselines are in Appendix B.

Completeness Factuality Logicality
Method Dataset Win Tie Lose Win Tie Lose Win Tie Lose Avg. gap
Backbone Language Model: Llama2-7b-base
KnowTuning vs Base Dolly 88.50 03.00 08.50 73.00 20.00 07.00 80.50 12.00 07.50 +73.00
KnowTuning vs SFT 78.50 05.50 16.00 37.00 46.50 16.50 50.50 34.00 15.50 +39.33
KnowTuning vs RLAIF 69.50 05.00 25.50 32.00 49.00 19.00 46.50 39.00 14.50 +29.67
KnowTuning vs FactTune 64.50 10.00 25.50 30.00 53.00 17.00 31.50 55.50 13.00 +23.50
KnowTuning vs Base MedQuAD 93.00 03.00 04.00 72.50 20.50 07.00 85.00 08.50 06.50 +77.67
KnowTuning vs SFT 81.00 03.50 15.50 46.50 37.50 16.00 64.50 21.50 14.00 +48.83
KnowTuning vs RLAIF 85.00 02.50 12.50 41.00 38.50 20.50 50.50 30.00 19.50 +41.33
KnowTuning vs FactTune 83.00 03.50 13.50 40.50 36.50 23.00 50.50 31.50 18.00 +39.83
Backbone Language Model: Llama2-13b-base
KnowTuning vs Base Dolly 85.50 06.50 08.00 66.00 24.50 09.50 81.00 13.00 06.00 +69.67
KnowTuning vs SFT 77.00 05.00 18.00 35.50 49.50 15.00 45.00 40.00 15.00 +36.50
KnowTuning vs RLAIF 73.50 04.00 22.50 33.50 52.50 14.00 46.50 40.50 13.00 +34.67
KnowTuning vs FactTune 68.50 06.50 25.00 30.50 55.00 14.50 36.00 54.00 10.00 +28.50
KnowTuning vs Base MedQuAD 92.50 02.50 05.00 73.50 17.50 09.00 84.00 08.00 08.00 +76.00
KnowTuning vs SFT 86.50 03.50 10.00 45.50 41.00 13.50 60.00 31.00 09.00 +53.16
KnowTuning vs RLAIF 82.50 05.00 12.50 38.50 48.00 13.50 54.00 38.50 07.50 +47.17
KnowTuning vs FactTune 78.00 04.50 17.50 37.00 47.00 16.00 48.50 39.50 12.00 +39.33
Table 2: Main results on generic QA and medical QA datasets evaluated by GPT-4. The scores marked with \ast mean KnowTuning outperforms the baseline significantly with pp-value<0.05<0.05 (sign. test), following Guan et al. (2021).

4.4 Evaluation Metrics

We present our experimental results using two evaluation metrics: automatic evaluation and human-based evaluation. Following previous studies Clinciu et al. (2021); Slobodkin et al. (2023), we employ two automatic metrics for absolute quality evaluation: the lexicon-based metric METEOR Banerjee and Lavie (2005) and the semantic-based metric BERTScore Zhang et al. (2019). Since recent studies propose that GPT-4 can effectively evaluate the quality of LLMs answers Zheng et al. (2024a); Dubois et al. (2023); Fu et al. (2023), we also conduct GPT-4 pairwise evaluation. Specifically, given the golden label as a reference, we employ GPT-4 to rate generated answers on three aspects: completeness, factuality, and logicality, on a range of 1 to 10. Following Singhal et al. (2022); Zheng et al. (2024a); Zhang et al. (2023b), we define completeness, factuality and logicality as: (i) Completeness: it examines whether the answers provide comprehensive and sufficient knowledge to the questions. (ii) Factuality: it examines whether the knowledge in the answers is factually correct. (iii) Logicality: it examines whether the knowledge in the answers is logically structured. Following Li et al. (2023); Chen et al. (2023a), we define “Win-Tie-Lose” as: (i) Win: KnowTuning wins twice, or wins once and ties once. (ii) Tie: KnowTuning ties twice, or wins once and loses once. (iii) Lose: KnowTuning loses twice, or loses once and ties once.

We also employ human judgments as the gold standard for assessing the quality of answers. Specifically, human evaluators perform pair-wise comparisons of the top-performing models identified in automatic evaluations. They are presented with a question with a golden answer, and asked to judge two generated answers on three aspects: completeness, factuality, and logicality.

To evaluate the capabilities of LLMs at a fine-grained level, we follow Min et al. (2023) to conduct fine-grained facts evaluation. Specifically, we first break candidate answers into individual facts, and use gpt-3.5-turbo to measure the correctness of each fact based on the golden answer as a reference. Following Tian et al. (2023), we report the number of correct facts (#\# Correct), the number of incorrect facts (#\# Incorrect), the number of total facts (#\# Total) and the proportion of correct facts out of the total number of extracted facts (%\% Correct). More details of the evaluation are in Appendix C.

4.5 Implementation Details

We employ Llama2-base models of different sizes (7b and 13b) as our backbone models for training. We adopt the Alpaca template Taori et al. (2023) for training and inference. The OpenAI model used for Extract()\operatorname{Extract}(\cdot), Rewrite()\operatorname{Rewrite}(\cdot) and Revise()\operatorname{Revise}(\cdot) is gpt-3.5-turbo. More details of the implementation are in Appendix D.

5 Experimental Results and Analysis

To answer our research questions, we conduct generic domain and medical domain QA experiments, fine-grained facts evaluation, and ablation studies. In addition, we conducted a case study to gain further understanding of the effectiveness of KnowTuning.

Completeness Factuality Logicality
Method Dataset Win Tie Lose Win Tie Lose Win Tie Lose Avg. gap
Backbone Language Model: Llama2-7b-base
KnowTuning vs FactTune Dolly 61.00 12.00 27.00 28.00 58.50 13.50 33.50 50.00 16.50 +21.83
KnowTuning vs FactTune MedQuAD 73.00 09.00 18.00 40.00 43.00 17.00 45.50 36.00 18.50 +35.00
Backbone Language Model: Llama2-13b-base
KnowTuning vs FactTune Dolly 58.00 11.00 31.00 32.50 56.50 11.00 35.00 53.00 12.00 +23.83
KnowTuning vs FactTune MedQuAD 78.00 06.50 15.50 43.00 45.50 11.50 39.00 45.50 15.50 +39.17
Table 3: Human evaluation results on generic domain and medical domain QA datasets. The scores marked with \ast mean KnowTuning surpass FactTune significantly with pp-value<0.05<0.05 (sign. test).
Dolly MedQuAD
Method # Correct \uparrow # Incorrect \downarrow # Total \uparrow % Correct \uparrow # Correct \uparrow # Incorrect \downarrow # Total \uparrow % Correct \uparrow
Backbone Language Model: Llama2-7b-base
Base 06.15 3.62 09.77 62.94 06.54 3.42 09.96 65.66
SFT 07.77 1.85 09.62 80.77 16.11 1.73 17.84 90.30
RLAIF 11.23 2.10 13.33 84.25 10.86 0.95 11.81 91.96
FactTune 11.25 1.92 13.17 85.42 12.83 0.83 13.66 93.92
KnowTuning 14.40 2.36 16.76 85.92 18.04 0.98 19.02 94.85
Backbone Language Model: Llama2-13b-base
Base 09.57 4.28 13.85 69.10 07.96 3.50 11.46 69.46
SFT 09.96 2.21 12.17 81.84 16.82 1.66 18.48 91.02
RLAIF 10.72 2.16 12.88 83.23 13.01 1.16 14.17 91.81
FactTune 12.73 2.12 14.85 85.72 13.02 1.01 14.03 92.80
KnowTuning 15.44 2.20 17.64 87.53 19.01 1.11 20.12 94.48
Table 4: Fine-grained facts evaluation on generic and medical QA. The best performance is highlighted in bold.

5.1 Main Results (RQ1)

Automatic evaluation. Table 1 and Table 2 present the reference-based GPT-4 evaluation results and absolute quality evaluation results for both generic and medical domain QA datasets. Across all metrics, KnowTuning outperforms the baseline models in these domains. Based on the results, we have three main observations:

  • KnowTuning demonstrates effectiveness under lexicon-based and semantic-based evaluations. As shown in Table 1, our method consistently improves the absolute quality of answers for general and medical QA tasks. Furthermore, these results illustrate the ability of our method to generalize to a wider range of knowledge-intensive datasets, such as NQ and ELI5.

  • KnowTuning consistently outperforms baselines in terms of completeness, factuality and logicality, across generic and domain-specific QA datasets. Compared with Base and SFT, KnowTuning focuses on improving fine-grained and coarse-grained knowledge awareness of LLMs, which significantly improves the performance. Compared with RLAIF and FactTune, KnowTuning is more effective in improving the performance of LLMs on complex knowledge-intensive QA in multiple aspects. The reason is that RLAIF improves the performance by calculating overall helpfulness scores and FactTune focuses on improving the factuality, they ignore improving the knowledge awareness of LLMs in multiple essential aspects simultaneously.

  • KnowTuning demonstrates effectiveness on LLMs across different sizes. We observe that KnowTuning consistently improves the performance of QA tasks on different scales (7b and 13B) LLMs. This finding aligns with Bian et al. (2024) and Mecklenburg et al. (2024): LLMs learn a lot of generic knowledge during the pre-training stage but still need to learn downstream domain knowledge and explore how to effectively leverage knowledge for solving knowledge-intensive QA tasks.

Human evaluation. Human evaluations are crucial for accurately assessing the quality of answers. As shown in Table 3, to facilitate human annotation processes, we focus on comparing KnowTuning with the state-of-art baseline FactTune:

  • Our findings indicate that KnowTuning consistently surpasses FactTune in terms of completeness, factuality, and logicality performance across various sizes of LLMs under human evaluation.

  • KnowTuning demonstrates superior performance over QA in both generic and medical domain QA evaluated by human, in terms of completeness, factuality, and logicality.

Completeness Factuality Logicality
Method Win Tie Lose Win Tie Lose Win Tie Lose Avg. gap
-KAC vs KnowTuning 32.50 20.00 47.50 16.00 57.50 26.50 12.50 61.50 26.00 -13.00
-KCC vs KnowTuning 18.50 31.00 50.50 11.00 72.50 16.50 10.50 61.50 28.00 -18.33
-KFC vs KnowTuning 23.00 28.50 48.50 08.50 70.50 21.00 12.00 60.50 27.50 -17.83
-KLC vs KnowTuning 25.50 27.50 47.00 12.00 73.00 15.00 09.50 60.00 30.50 -15.17
-KCC vs KnowTuning 11.50 06.00 82.50 16.00 52.00 32.00 15.50 40.50 44.00 -38.50
Table 5: Ablation study evaluated by GPT-4 on the generic QA dataset. The backbone model is Llama2-7b-base. -KA indicates the exclusion of fine-grained knowledge augmentation, -KCC indicates the exclusion of completeness comparison, -KFC indicates the exclusion of factuality comparison, -KLC indicates the exclusion of logicality comparison, and -KC indicates the exclusion of all coarse-grained knowledge comparisons.

5.2 Fine-grained Fact Evaluation (RQ2)

To evaluate the ability of methods to generate correct facts at the fine-grained level, we conduct fine-grained facts evaluation experiments. Based on the results in Table 4, we have two main observations:

  • Knowtuning generates answers with a higher proportion of correct facts across various sizes. Compared to baselines, KnowTuning can generate more facts with less factual error rate across different sizes of LLMs. Although RLAIF and FactTune improve the proportion of correct facts, they ignore fine-grained knowledge augmentation and coarse-grained knowledge completeness awareness. Note that even though FactTune generates fewer incorrect facts, KnowTuning outperforms FactTune on the more critical metric of the percentage of correct facts.

  • KnowTuning generates larger amounts of correct facts across generic and domain-specific QA datasets. Compared to SFT, we observe that KnowTuning consistently generates more correct facts across generic and domain-specific QA datasets. However, in the specific medical domain QA, RLAIF and FactTune generate fewer correct facts than SFT. This is because LLMs learn a large amount of generic knowledge during the pre-training stage, yet still lack domain-specific knowledge for downstream tasks Mecklenburg et al. (2024). This underscores the necessity for enhancing fine-grained knowledge awareness in domain-specific, knowledge-intensive QA tasks, as well as the need to improve coarse-grained knowledge awareness across key aspects of completeness, factuality, and logicality.

5.3 Ablation Studies (RQ3)

In Table 5, we compare KnowTuning with several ablative variants. The variants are as follows: (i) -KA: we remove the fine-grained knowledge augmentation. (ii) -KCC: we remove knowledge completeness comparison set. (iii) -KFC: we remove knowledge factuality comparison set. (iv) -KLC: we remove knowledge logicality comparison set. (v) -KC: we remove all coarse-grained knowledge comparison sets. Our findings are as follows:

  • Removing the fine-grained knowledge augmentation. We observe that removing fine-grained knowledge augmentation (-KA) decreases the performance of all three aspects. This indicates that fine-grained knowledge augmentation is effective for improving fine-grained knowledge awareness of LLMs.

  • Removing the coarse-grained knowledge comparison. The absence of coarse-grained knowledge comparisons results in substantial performance degradation in knowledge-intensive QA tasks. Specifically, removing the knowledge completeness comparison (-KCC) adversely affects completeness, the elimination of the knowledge factuality comparison (-KFC) undermines factuality, and the removal of the knowledge logicality comparison (-KLC) diminishes logicality. Although deleting and revising atomic knowledge can impact logicality, shuffling has been found more effective in improving coarse-grained logicality for LLMs. Furthermore, removing all coarse-grained knowledge comparison sets (-KC) results in a significant drop in performance across all aspects of the knowledge-intensive QA task.

5.4 Case Study

We conduct several case studies and find that KnowTuning is more effective at generating complete, factual and logical answers than baselines across various sizes of LLMs. More details of our case study results are in Appendix E.

6 Conclusions

In this paper, we focus on improving the knowledge awareness of LLMs via fine-tuning for complex knowledge-intensive tasks. We have proposed KnowTuning to fine-tune LLMs through fine-grained knowledge augmentation and coarse-grained knowledge comparison stages. We have conducted comprehensive experiments on generic and medical domain QA datasets, demonstrating the effectiveness of KnowTuning through automatic and human evaluations, across various sizes of LLMs. Moreover, KnowTuning generates more facts with less factual error rate under fine-grained facts evaluation.

Limitations

In this study, KnowTuning is mainly aimed at generic and medical knowledge-intensive tasks, we plan to adopt KnowTuning to other tasks such as legal domain QA Zhong et al. (2020); Lyu et al. (2022, 2023a) and mathematical reasoning Luo et al. (2023). Moreover, our efforts have been concentrated on enhancing the knowledge awareness of LLMs during the fine-tuning stage. Future studies will aim to explore improving knowledge awareness of LLMs in the pre-training stage Rosset et al. (2020).

Ethical Considerations

KnowTuning mainly focuses on completeness, factuality, and logicality, but not social bias Pitoura et al. (2017); Lyu et al. (2023b) or the potential for generating harmful or toxic content Song et al. (2024); Hewitt et al. (2024); Gao et al. (2024). We plan to adopt our method to reduce social bias and harmful content at fine-grained and coarse-grained levels in future work.

Acknowledgments

This work was supported by the Natural Science Foundation of China (62272274, 62372275, 62102234, 62202271, 62072279), the National Key R&D Program of China with grant No.2022YFC3303004, the Natural Science Foundation of Shandong Province (ZR2021QF129), the China Scholarship Council under grant number 202306220180, the Dutch Research Council (NWO), under project numbers 024.004.022, NWA.1389.20.183, and KICH3.LTP.20.006, and the European Union’s Horizon Europe program under grant agreement No 101070212. All content represents the opinion of the authors, which is not necessarily shared or endorsed by their respective employers and/or sponsors.

References

Appendix

Appendix A Details of Datasets

  • Dolly Conover et al. (2023): Given our focus on open-ended generic domain QA, we selected QA pairs specifically categorized under "open_qa" and "general_qa" for our dataset. We filter 4,000 QA pairs for training, 200 QA pairs for validation, and 200 QA pairs for testing.

  • MedQuAD Abacha and Demner-Fushman (2019): The dataset covers 37 different question types. In this paper, following August et al. (2022), we filter QA pairs of the category “Information” for giving definitions and information about medical terms. We filter 4000 QA pairs for training, 200 QA pairs for validation and 200 QA pairs for testing.

  • NQ Kwiatkowski et al. (2019): We filter 200 questions and corresponding long answers as testing QA pairs from the development set. The length of these long answers ranges from 100 to 500.

  • ELI5 Fan et al. (2019): We filter 200 questions in the test set and the corresponding highest scoring answers as testing QA pairs.

Appendix B Details of Baselines

  • Base: We adopt the Alpaca template Taori et al. (2023) for testing the Llama2-base model Touvron et al. (2023) under zero-shot setting.

  • SFT: We follow standard vanilla fine-tuning loss in Eq. 1 to train LLMs on original QA datasets.

  • RLAIF Bai et al. (2022); Lee et al. (2023): We leverage gpt-3.5-turbo to annotate overall helpfulness scores and construct generic helpfulness comparison sets. We adopt DPO Rafailov et al. (2023) for generic helpfulness comparison sets optimization.

  • FactTune Tian et al. (2023): We follow Min et al. (2023) to first break each candidate answers into individual facts, and prompt LLMs to measure the correctness of each fact based on the golden answer as a reference.222https://github.com/shmsw25/FActScore Then, we construct factuality comparison sets by the percentage of correct facts. Finally, we adopt DPO Rafailov et al. (2023) for factuality comparison sets optimization.

Appendix C Details of Evaluation

C.1 GPT-4 Evaluation

This section provides specifics of the GPT-4 prompt utilized for reference-based evaluation, employing gpt4-turbo. Figure 3 illustrates the adapted prompt from  Zheng et al. (2024a), aimed at assessing the completeness, factuality, and logicality of answers. To avoid positional bias Ko et al. (2020); Wang et al. (2023e), we evaluate each answer in both positions during two separate runs.

Refer to caption
Figure 3: Prompts for GPT-4 evaluation.
Refer to caption
Figure 4: Instructions for human evaluation.

C.2 Human Evaluation

For the human evaluation, we hired people with undergraduate degrees and undergraduate medical degrees to annotate generic QA and medical QA test sets, respectively, to ensure the trustworthiness of the human evaluations, and we allowed the human evaluators to access Wikipedia to further validate the knowledge during the evaluation process. Instructions for human evaluation are depicted in Figure 4.

C.3 Fine-grained facts evaluation

Following Min et al. (2023), we first break candidate answers into individual facts, and use gpt-3.5-turbo to measure the correctness of each fact based on the golden answer as a reference.2

Appendix D Details of Implementation

D.1 Prompts for Extracting, Rewriting, and Revising

Details for the prompts used in Extract()\operatorname{Extract}(\cdot), Rewrite()\operatorname{Rewrite}(\cdot), and Revise()\operatorname{Revise}(\cdot) are provided. Figures 5, 6, 7 and 8 display the prompts for extracting atomic knowledge, rewriting fine-grained questions, rewriting fine-grained answers, and revising atomic knowledge into nonfactual knowledge, respectively.

Refer to caption
Figure 5: Prompts for extracting atomic knowledge in the answer Min et al. (2023).
Refer to caption
Figure 6: Prompts for rewriting fine-grained questions.
Refer to caption
Figure 7: Prompts for rewriting fine-grained answers.
Refer to caption
Figure 8: Prompts for revising atomic facts into incorrect facts.

D.2 Reliability of atomic knowledge extraction

To evaluate the reliability of atomic knowledge extraction, we first sample 50 instances of genericQA dataset Dolly. We manually checked these data and find that only 3 instances required further separation or merging of atomic facts, illustrating the reliability of extracting atomic facts using gpt3.5-turbo.

D.3 Training

During the training phase, the AdamW optimizer Loshchilov and Hutter (2019) is utilized with initial learning rates of 51055\cdot 10^{-5} for SFT and 11051\cdot 10^{-5} for DPO. The batch sizes for SFT and DPO are set to 32 and 16, respectively, with SFT undergoing 3 epochs of training and DPO 1 epoch. The filtering and deleting percentages, α\alpha and β\beta, are both fixed at 0.5. The scalar weighting hyperparameter γ\gamma is set to 0.2. We determine the hyperparameters through pilot experiments. Training leverages PEFT Mangrulkar et al. (2022), LLaMA-Factory Zheng et al. (2024b) and LoRA Hu et al. (2022).

D.4 Cost Analysis

The cost of KnowTuning is lower than that of the baseline methods RLAIF and FactTune. Specifically, in the generic domain QA dataset Dolly, the costs are as follows: KnowTuning is $8.45, RLAIF is $9.94, and FactTune is $10.53. This cost difference arises because RLAIF necessitates pairwise comparisons for assessing the overall helpfulness of all candidate answers, while FactTune requires a detailed factuality evaluation for each fact across all candidate answers, thereby increasing their dataset comparison construction costs.

Refer to caption
Figure 9: Case study for intuitive comparisons on generic QA dataset based on Llama2-7b-base.
Refer to caption
Figure 10: Case study for intuitive comparisons on generic QA dataset based on Llama2-13b-base.

Appendix E Details of Case Study

As illustrated in Figures 9 and 10, the case studies evaluate answers generated by four methods: SFT, RLAIF, FactTune, and KnowTuning across various sizes. Our findings indicate that KnowTuning excels at producing answers that are more complete, factual, and logical across various sizes of LLMs, as detailed below:

  • As shown in Figure 9 for the case study based on backbone Llama2-7b-base, KnowTuning generates more complete and logical answers compared to all baselines. Although RLAIF produces more knowledge compared to SFT, it results in fewer logical answers because it does not explicitly focus on logicality optimization. FactTune, on the other hand, focuses on improving the percentage of factualness and performs poorly in terms of answer completeness and logic. This illustrates the need for multiple aspects of coarse-grained knowledge awareness.

  • As shown in Figure 10 for the case study based on backbone Llama2-13b-base, KnowTuning generates content that is more informative and factual, and the logic between the knowledge is more logical. Although RLAIF generates multiple aspects of knowledge, it does not provide fine-grained knowledge in the answer. FactTune generates detailed information such as Canada’s domestic population and GDP, but it provides factually incorrect information. This further underscores the critical need for enhanced fine-grained knowledge awareness.