This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Knowledgeable Preference Alignment for LLMs in
Domain-specific Question Answering

Yichi Zhang11footnotemark: 1, Zhuo Chen, Yin Fang, Yanxi Lu, Fangming Li,
Wen Zhang22footnotemark: 2, Huajun Chen
Zhejiang University
Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph
NAIE, Huawei Technologies Co.Ltd.
{zhangyichi2022,zhuo.chen,zhang.wen,huajunsir}@zju.edu.cn
https://github.com/zjukg/KnowPAT
  Equal contribution.  Corresponding authors.
Abstract

Deploying large language models (LLMs) to real scenarios for domain-specific question answering (QA) is a key thrust for LLM applications, which poses numerous challenges, especially in ensuring that responses are both accommodating to user requirements and appropriately leveraging domain-specific knowledge bases. They are the two major difficulties for LLM application as vanilla fine-tuning falls short of addressing. Combining these requirements, we conceive of them as the requirement for the model’s preference to be harmoniously aligned with humans’. Thus, we introduce Knowledgeable Preference AlignmenT (KnowPAT), which constructs two kinds of preference sets to tackle the two issues. Besides, we design a new alignment objective to align the LLM preference with different human preferences uniformly, aiming to optimize LLM performance in real-world, domain-specific QA settings. Adequate experiments and comprehensive comparisons with 15 baseline methods illustrate that our KnowPAT is a superior pipeline for real-scenario domain-specific QA with LLMs.

\useunder

\ul

Knowledgeable Preference Alignment for LLMs in
Domain-specific Question Answering


Yichi Zhang11footnotemark: 1, Zhuo Chenthanks:   Equal contribution., Yin Fang, Yanxi Lu, Fangming Li, Wen Zhang22footnotemark: 2, Huajun Chenthanks:   Corresponding authors. Zhejiang University Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph NAIE, Huawei Technologies Co.Ltd. {zhangyichi2022,zhuo.chen,zhang.wen,huajunsir}@zju.edu.cn https://github.com/zjukg/KnowPAT


1 Introduction

Refer to caption
Figure 1: A simple case of intelligent service for cloud products. Such a simple example is meant to illustrate the importance of selective use of retrieved knowledge as MAC is a terminology in computer networking rather than a kind of computer or lipstick in the user context.

In contemporary digital commerce platforms, the deployment of automated and intelligent question-answering (QA) services is a pivotal task to augment service quality. These services are designed to furnish answers to domain-specific customer queries. Building such a domain-specific QA system, while highly sought after, remains a daunting challenge in practical scenarios.

Domain-specific QA necessitates a comprehensive understanding of a specific domain to answer specialized questions. However, traditional deep learning models Devlin et al. (2019); Raffel et al. (2020) still have insufficient domain-specific expertise. This makes the domain knowledge bases (KBs) Liang et al. (2022) a pivotal tool for the storage and querying of domain knowledge. KBs can store human knowledge in the triple form, offering a unified, maintainable, and extensible representation of the knowledge from heterogeneous sources. The utility of KBs has already been demonstrated across various application scenarios such as E-commerce Zhu et al. (2021), and health care Li et al. (2020). Within the context of QA, incorporating external knowledge source represents a promising approach, which is known as KBQA Jiang et al. (2023b).

Meanwhile, as large language models (LLMs) West et al. (2023) achieve significant progress and exhibit substantial proficiency within numerous NLP fields Zhu et al. (2023), applying LLMs into various downstream tasks have been a predominant trend in industry Zhang et al. (2023a). Contrasting earlier pre-trained language models Devlin et al. (2019); Raffel et al. (2020), LLMs trained on the massive corpus have outperformed text generation capabilities which perform better when interacting with human Ouyang et al. (2022). To adapt the LLMs for downstream usage, supervised fine-tuning (SFT) Zhang et al. (2023e) is applied to fit the model with specific tasks and data. However, the LLM application for real-scenario QA with external KB remains an underexplored domain, with limited work addressing this intersection.

Our goal entails the resolution of a challenge in real-world applications: How can LLM be used to solve real-scenario QA problems supported by external knowledge bases? A generic pipeline for this problem is the retrieve-augmented generation (RAG) Tian et al. (2023), which first retrieves relative knowledge triples for the question as reference data and subsequently fine-tunes the LLM with knowledge-enhanced prompt. However, this conventional approach often encounters obstacles in practical scenarios. Firstly, the LLM-produced responses must prioritize user-friendliness, avoiding any generation of inappropriate or unfriendly content. Secondly, the retrieved knowledge is not invariably useful, necessitating that LLMs develop the capacity to judiciously exploit knowledge. Figure 1 illustrates a simple case in which retrieved knowledge is not always desperately needed (e.g., MAC is a kind of lipstick), which requires the LLMs to selectively utilize the retrieved knowledge instead of generating answers without thoughtful consideration. These two issues can uniformly collectively constitute the preference problem of LLMs. LLMs have their style preference to generate contents and knowledge preference to selectively use the retrieved knowledge in the prompt. As a practical application, the preference of LLMs needs to align with human expectations and requirements for better service. This refers to preference alignment (PA) Yuan et al. (2023), a burgeoning topic in the LLMs community, which would incorporate human preference to tune the LLMs during training. PA aims to control the model to generate human-preferred content and avoid unpreferred content. However, the scenarios faced by current PA works tend to be generic. No research has been explicitly directed towards domain-specific applications such as our scenario, providing impetus for further exploration.

In this paper, we propose a novel three-step Knowledgeable Preference AlignmenT (KnowPAT) pipeline to address the domain-specific QA task for a real-scenario LLM application. KnowPAT propose knowledgeable preference set construction to incorporate domain KBs to construct knowledgeable preference data. Besides, a new alignment objective is designed to optimize the LLM with the knowledge preference. Our contribution can be summarized as three-folded:

(1). We are the first work that introduces preference alignment for domain-specific QA with LLMs and domain KBs, which is an industrial practice with practical applications.

(2). We propose a knowledgeable preference alignment (KnowPAT) framework to incorporate KBs into the preference alignment process of LLMs. We balanced the need for both style and knowledge preference and devised a new training objective to align the LLM with human preference.

(3). We conduct comprehensive experiments to validate the effectiveness of our methods on two datasets, which shows that KnowPAT outperforms 15 existing baselines.

2 Problem Setting

In this section, we will first introduce our problem scenario and basic notations.

Refer to caption
Figure 2: The overall architecture of KnowPAT. We design three important modules in our framework: unsupervised knowledge retrieve, knowledgeable preference set construction, and knowledgeable preference alignment.

Our overall target is to fine-tune a LLM \mathcal{M} with our QA datasets 𝒟={(qi,ai)i=1,2,,N}\mathcal{D}=\{(q_{i},a_{i})\mid i=1,2,\dots,N\} where qi,aiq_{i},a_{i} represent a question and answer pair. The questions in the dataset are all about common usage issues with our cloud products while the questions and answers are manually collected and labeled, which are gloden answers with decent and knowledgeable responses. For vanilla fine-tuning (VFT), we first wrap the QA pair with a prompt template \mathcal{I} and the model \mathcal{M} is autogressively Brown et al. (2020) optimized as:

ft=1|ai|j=1|ai|logP(ai,j|,qi,ai,<j)\mathcal{L}_{ft}=-\frac{1}{|a_{i}|}\sum_{j=1}^{|a_{i}|}\log P_{\mathcal{M}}(a_{i,j}|\mathcal{I},q_{i},a_{i,<j}) (1)

where ai,ja_{i,j} is the jj-th token of aia_{i} and PP_{\mathcal{M}} denotes the token probability predicted by the model \mathcal{M}. With such a training objective, the training QA data serves as the supervision information to tune the model \mathcal{M} to the QA scenario. Besides, as a domain-specific task, we maintain a domain knowledge base (domain KB) \mathcal{B}. The domain KB can take very many forms, such as domain knowledge graphs, or documents. We denote kik_{i}\in\mathcal{B} as one support knowledge in the domain KB. By retrieving top-k knowledge kk with higher relevance, the input prompt will incorporate the retrieved knowledge 𝒦\mathcal{K}. Thus, \mathcal{M} can learn the relative knowledge during the VFT process, which is a general retrieve-augmented generation (RAG) pipeline for domain-specific LLM applications.

However, such a VFT approach can not achieve pretty good results for the domain-specific QA. On the one hand, applications in real scenarios should be user-friendly, otherwise, they will not bring commercial value. Thus, the text style of the generated response should be more acceptable for users. On the other hand, the knowledge retrieval process is unsupervised and the effectiveness of the retrieved knowledge is hard to guarantee, which means that the model \mathcal{M} needs to acquire the ability to judge and selectively utilize the knowledge triples. Therefore, we should improve the basic VFT to solve these two problems.

Actually, both of these problems can be summarised as model preference. The LLM \mathcal{M} has its style preference to generate texts and its knowledge preference to selectively utilize the retrieved knowledge. For the model to be practically applicable, the model preference should align with human preference, aiming to generate high-quality answers that humans prefer. Preference alignment (PA) is an important topic for LLMs. To apply PA during LLM fine-tuning, we sample a preference set 𝒫={b1,b2,,bl}\mathcal{P}=\{b_{1},b_{2},\dots,b_{l}\} with ll different answers for each QA pair (q,a)(q,a). We denote rir_{i} as the preference score of each answer bib_{i} where higher rir_{i} represents that humans prefer this answer. During training, we will define another objective align\mathcal{L}_{align} to align the model \mathcal{M} with the preference set 𝒫\mathcal{P}, aiming to increase the probability of a human preferred answer appearing and simultaneously decrease the probability of an unpreferred answer. The human preference of each answer is the preference score rr. The overall training objective then becomes =ft+align\mathcal{L}=\mathcal{L}_{ft}+\mathcal{L}_{align}.With such a multi-task objective, the LLM is fine-tuned to fit the golden answers while avoiding unpreferred results. The next question is how to generate a preference set to reflect both the style and knowledge preference.

3 Our KnowPAT Pipeline

In this section, we will present our pipeline of knowledgeable preference alignment (KnowPAT), which consists of three key parts: unsupervised knowledge retrieve, knowledgeable preference set construction, fine-tuning, and training. Figure 2 demonstrates an intuitive view of the three parts in our pipeline design.

3.1 Unsupervised Knowledge Retrieve

The first key parts is the unsupervised knowledge retrieve which aims to link the knowledge in the KB \mathcal{B} to each question qiq_{i}. We design a simple semantic similarity-based retriever \mathcal{H} to achieve this goal. The similarity between the ii-th question qiq_{i} and the jj-th knowledge kjk_{j} is:

sim(i,j)=𝐂𝐨𝐬𝐢𝐧𝐞((qi),(kj))sim(i,j)=\mathbf{Cosine}(\mathcal{H}(q_{i}),\mathcal{H}(k_{j})) (2)

where the retriever \mathcal{H} serves as a textual encoder and we treat both the question and knowledge as a text sequence to get their sentence representations. The similarity is based on the cosine similarity of the two representations. We retrieve the top-kk knowledge with the highest similarities for each question qiq_{i} and denote the retrieved knowledge (RK) as 𝒦\mathcal{K}. RK will be added into the input prompt as the background knowledge for the current question.

This process is unsupervised as we have no manually labeled question-knowledge pairs. Besides, our model will be deployed for real scenario usage, so it also requires strong zero-shot generalization capabilities to new questions. For these two reasons, the retrieved knowledge 𝒦\mathcal{K} might be noisy and useless to provide background knowledge. We think that the LLM \mathcal{M} should learn the knowledge preference to select helpful information from the retrieved knowledge 𝒦\mathcal{K}.

3.2 Knowledgeable Preference Set Construction

Motivated by such goal, we propose a knowledgeable preference set construction process to enable retrieved knowledge in the preference set construction, which consists of two parts: the style and the knowledge preference set.

For the style preference set (SPS) 𝒫s\mathcal{P}_{s}, we select l1l-1 different LLMs denoted 1,2,,l1\mathcal{M}_{1},\mathcal{M}_{2},\dots,\mathcal{M}_{l-1}. These different LLMs i\mathcal{M}_{i} have different textual comprehension and expression skills, which can generate answers with different text styles. The ability and quality of these models to answer domain-specific questions are inferior compared to human-labeled golden answers. The l1l-1 answers generated in this way and golden answers form a style preference set 𝒫s={b1,b2,,bl}\mathcal{P}_{s}=\{b_{1},b_{2},\dots,b_{l}\} with length ll. For the knowledge preference set (KPS), we assume that the knowledge that has high similarity but do not reach the top-kk rank are more likely to be knowledge that is not useful for the input question. We can get preference sets with different quality by retrieving some relatively worse knowledge and prompting the model to generate responses with knowledge of different quality. In our design, we retrieve 3 groups of knowledge 𝒦1,𝒦2,𝒦3\mathcal{K}_{1},\mathcal{K}_{2},\mathcal{K}_{3} from the CPKG. 𝒦1\mathcal{K}_{1} represents the retrieved top-kk knowledge, 𝒦2=\mathcal{K}_{2}=\emptyset is an empty set with no retrieved knowledge. 𝒦3\mathcal{K}_{3} represents the knowledge with top k+1k+1 to 2k2k similarities which we think are easily misused knowledge with relatively high semantic similarity. Then we wrap the different knowledge 𝒦i\mathcal{K}_{i} with the input prompt \mathcal{I} into the LLM \mathcal{M} and generate different answers. These generated 3 answers and the golden answer form a knowledge preference set 𝒫k={c1,c2,c3,c4}\mathcal{P}_{k}=\{c_{1},c_{2},c_{3},c_{4}\}.

By doing this, we can get two preference sets for each QA pair. To simplify the setting, we set l=4l=4 to let the two sets be of the same size. Besides, we design a rule-based strategy to decide the preference score rr for each answer. For the style preference set 𝒫s\mathcal{P}_{s}, the high-quality golden answer b1b_{1} is assigned with the highest score, and answers from other LLMs were determined by their general capabilities. In practice, we choose three different LLMs ChatGPT (b2b_{2}) Ouyang et al. (2022), ChatGLM-6B (b3b_{3}) Zeng et al. (2023), and Vicuna-7B (b4b_{4}). The results of several LLM ranking lists indicate that the three are ranked in order of ability as follows ChatGPT > ChatGLM > Vicuna. Besides, after verification by human experts, we also believe that the quality of the answers generated by these three models in our QA scenarios also conforms to this rule. Thus, the preference scores are assigned in this order: r1>r2>r3>r4r_{1}>r_{2}>r_{3}>r_{4}.

Meanwhile, for the knowledge preference set 𝒫k\mathcal{P}_{k}, the golden answer c1c_{1} still has the highest preference score r1r_{1}. The answer c2c_{2} generated with top-kk knowledge 𝒦1\mathcal{K}_{1} has the second highest preference. The answer c3c_{3} generated with no extra knowledge 𝒦2\mathcal{K}_{2} has the third highest preference, and the answer c4c_{4} generated with knowledge 𝒦3\mathcal{K}_{3} is the worst. We found in our actual tests that the mismatch rate between the retrieved knowledge 𝒦3\mathcal{K}_{3} and the question qq is very high and easily misleads the model \mathcal{M}, so we set its score to be lower than the case of the empty knowledge 𝒦2\mathcal{K}_{2}. Thus, for the knowledge preference set 𝒫k\mathcal{P}_{k}, the preference scores are still in the order: r1>r2>r3>r4r_{1}>r_{2}>r_{3}>r_{4}. For each QA pair, we can construct two preference sets and we finally get the whole preference data with 2N2N preference sets. The preference data will participate in the fine-tuning process to control the style preference and knowledge preference for the model \mathcal{M}. Note that the size of the two preference sets need not be strictly same, and we have adopted the above formulation for the sake of uniformity of representation in our paper.

3.3 Fine-tuning and Preference Alignment

In addition to the vanilla fine-tuning loss ft\mathcal{L}_{ft} with the golden answer, the preference data will also participate in the training process. For each preference set, the preference score rir_{i} of the i-th answer represents our degree of preference. We expect the model \mathcal{M} to align with our preference. Thus, we design another score to represent the preference of the model, which is denoted as:

𝒮i=1|ai|j=1|ai|logP(ai,j|,qi,ai,<j)\mathcal{S}_{i}=\frac{1}{|a_{i}|}\sum_{j=1}^{|a_{i}|}\log P_{\mathcal{M}}(a_{i,j}|\mathcal{I},q_{i},a_{i,<j}) (3)

This score 𝒮i\mathcal{S}_{i} is the average log-likelihood of each answer token conditioned on the given prompt template \mathcal{I} and question qiq_{i}. Higher scores represent a higher probability that the model considers the current answer to occur. To align the model preference with our envision, we designed a new alignment objective for our scenario. The alignment objective is denoted as:

align=i=1|𝒫|1(logσ(𝒮i)+rj<rilogσ(𝒮j))\displaystyle\mathcal{L}_{align}=-\sum_{i=1}^{|\mathcal{P}|-1}\left(\log\sigma(\mathcal{S}_{i})+\sum_{r_{j}<r_{i}}\log\sigma(-\mathcal{S}_{j})\right) (4)

where σ\sigma is the sigmoid function. Such an objective is newly proposed by us to achieve the preference alignment process, which contrasts the preferred answer and the unpreferred answers. It is worth noting that the human preference scores rir_{i} will only determine the ordering corresponding to different answers and will not be directly involved in the computation and gradient accumulation. Existing methods like RRHF Yuan et al. (2023) and SLiC-HF Zhao et al. (2023) apply a margin-rank loss in the form rj<rimax(0,λ𝒮i+𝒮j)\sum_{r_{j}<r_{i}}\max(0,\lambda-\mathcal{S}_{i}+\mathcal{S}_{j}) to achieve preference alignment. But their design only optimizes the model preference when the model preference score 𝒮\mathcal{S} of a human preferred answer is lower than an unpreferred answer (a more formalized formulation would be 𝒮i<𝒮j\mathcal{S}_{i}<\mathcal{S}_{j} when rj<rir_{j}<r_{i}). However, we think that the preference should still be optimized in this situation and propose such a training objective to continuously decrease the occurrence probability of the unpreferred answers. Meanwhile, as different answers have different text quality and preference degrees, we further design an adaptive weight to control the influence of each preferred answer, which is denoted as:

μi=𝒮i𝒮min𝒮max𝒮min\mu_{i}=\frac{\mathcal{S}_{i}-\mathcal{S}_{min}}{\mathcal{S}_{max}-\mathcal{S}_{min}} (5)

where 𝒮max\mathcal{S}_{max} and 𝒮min\mathcal{S}_{min} are the max and min model preference scores in a preference set 𝒫\mathcal{P}. With such an adaptive weight, the influence of the answers with different preferences could be dynamically adjusted. The alignment loss then becomes:

align=i=1|𝒫|1μi(log(1+e𝒮i)+rj<rilog(1+e𝒮j))\mathcal{L}_{align}=\sum_{i=1}^{|\mathcal{P}|-1}\mu_{i}\left(\log(1+e^{-\mathcal{S}_{i}})+\sum_{r_{j}<r_{i}}\log(1+e^{\mathcal{S}_{j}})\right) (6)

The final training objective is still in a multi-task manner and we add a hyper-parameter λ\lambda as the coefficient of the alignment loss:

=ft+λ|𝒫|1align\mathcal{L}=\mathcal{L}_{ft}+\frac{\lambda}{|\mathcal{P}|-1}\mathcal{L}_{align} (7)

where 𝒫1\mathcal{P}-1 represents the count of prefer-unprefer contrast to normalize the alignment loss. For each preference set constructed in the previous section, the model is trained and optimized with such an objective.

4 Experiments and Analysis

Table 1: The experimental results for traditional text generation metrics on two datasets. The red numbers represent the improvement of KnowPAT on each dataset. The best baseline performance is underlined.
Dataset Type Setting BLEU-1 BLEU-2 BLEU-3 BLEU-4 ROUGE-1 ROUGE-2 ROUGE-L METEOR
CPQA Zero-shot Vicuna 14.18 7.89 5.02 2.69 16.31 6.15 15.69 17.96
ChatGLM 14.21 8.36 5.41 2.79 15.38 5.64 14.75 19.34
Baichuan 15.51 9.08 5.86 2.87 16.74 6.64 15.81 19.71
Atom 10.07 4.11 2.06 8.15 6.24 1.99 6.02 11.31
ChatGPT 13.09 7.72 4.93 2.59 16.96 6.68 16.15 19.52
In-context Learning 1-shot 8.97 3.84 1.88 0.53 7.49 1.99 7.31 10.41
2-shot 9.11 3.84 1.85 0.50 7.34 1.82 7.01 9.88
4-shot 8.18 3.42 1.65 0.48 7.07 2.04 6.91 8.83
8-shot 7.79 3.29 1.70 0.79 6.57 1.38 6.41 8.19
Fine-tuning Atom 14.89 9.35 7.33 6.05 14.77 5.57 14.61 15.99
Alignment DPO 18.31 12.07 9.63 7.81 17.74 6.61 17.38 18.81
RRHF 11.99 6.32 4.52 3.47 12.56 4.08 12.29 12.62
SLiC 16.55 10.34 7.99 6.53 14.69 5.03 14.48 16.95
PRO 18.27 12.36 10.04 8.41 17.07 6.75 16.85 19.17
AFT-BC 18.39 12.17 9.86 7.81 18.09 7.14 17.76 19.48
AFT-DC 15.34 8.44 5.94 4.35 14.51 5.59 14.15 16.31
KnowPAT 21.87 15.59 13.21 11.14 19.91 8.31 19.62 22.42
+18.92% +26.13% +31.57% +32.46% +10.06% +16.38% +10.47% +15.09%
Fine-tuning Atom 23.12 15.17 10.89 8.31 7.92 0.79 7.17 21.26
RJUA-QA Alignment DPO 23.87 15.81 11.67 8.99 11.69 2.53 9.66 22.43
RRHF 22.32 14.94 10.86 8.41 7.39 1.40 5.92 21.66
SLiC 23.51 15.46 11.27 8.69 8.68 1.17 7.70 22.14
PRO 24.01 16.05 11.76 9.06 12.50 1.69 9.98 22.38
AFT-BC 24.43 16.27 12.10 9.37 9.06 2.03 7.31 23.30
AFT-DC 20.81 13.01 9.15 6.92 6.40 0.65 5.08 20.02
KnowPAT 25.61 18.04 13.95 11.38 10.75 4.26 10.46 24.48
+4.83% +10.88% +15.28% +21.45% - +68.38% +4.82% +5.64%
Table 2: The experimental results of model-based metrics. We report the BERTScore, reward score, and perplexity (PPL) for KnowPAT and the baseline methods. The best result of each metric is bold and the second best is underlined.
BERTScore\uparrow Reward\uparrow PPL\downarrow
Fine Tuning 66.24 -1.64 31.13
RRHF 64.48 -1.67 31.26
SLiC 66.69 -1.74 32.51
PRO 67.41 -1.78 32.37
AFT 66.16 -2.25 30.11
KnowPAT 69.34 -1.69 29.93

In this section, we present the detailed experimental settings and analyze the experiment results to investigate the following four research questions:

(i) RQ1: How does KnowPAT perform compared with the baseline methods?

(ii) RQ2: Do the proposed modules in KnowPAT really benefit the performance of KnowPAT?

(iii) RQ3: Are there some intuitive cases to demonstrate the effectiveness of KnowPAT.

(iv) RQ4: Does the LLM still keep the general ability rather than catastrophic forgetting?

These four questions evaluate our approach on four dimensions: performance, design soundness, intuition, and usability in real scenarios. We will answer the four questions in the following sections.

4.1 Experiment Settings

4.1.1 Dataset Information

Our experiments are performed on both private and public datasets. The private CPQA dataset consists of a CPKG and QA pairs. The public dataset is RJUA-QA Lyu et al. (2023), which is a urology domain QA dataset. The detailed dataset information is presented in Appendix B.1.

4.1.2 Baseline Methods

To make a comprehensive study, we select four types of different baseline methods to demonstrate the effectiveness of our preference alignment approach. We not only want to show that alignment is a better framework for LLM application compared to other paradigms (e.g. zero-shot reasoning, in-context learning Dong et al. (2023), vanilla fine-tuning Ouyang et al. (2022); Fang et al. (2023)), but also to show that our method is better than other preference alignment methods Yuan et al. (2023); Zhao et al. (2023); Song et al. (2023); Wang et al. (2023b); Rafailov et al. (2023). The detailed information of the baselines are shown in Appendix B.2.

4.1.3 Evaluation Metrics

To make a comprehensive evaluation of the experimental results, we employ the different evaluation metrics from three aspects: traditional text generation metrics (BLEU Papineni et al. (2002), ROUGE Lin (2004), CIDEr Vedantam et al. (2015), and METEOR Banerjee and Lavie (2005)), model-based metrics (BERTScore Zhang et al. (2020a), PPL), and manual evaluation. The detailed information of the evaluation metrics refers to Appendix B.3.

4.1.4 Implementation Details

In our experiment, we select Atom-7B 111https://github.com/FlagAlpha/Llama2-Chinese as the backbone LLM \mathcal{M}, which is an open-source version of Llama2 Touvron et al. (2023b, a) with Chinese vocabulary extension. As our dataset is mainly in Chinese, we choose Atom-7B-chat to be our backbone model for experiments. Another consideration for us is that using the open-source Llama architecture model enhances the generality of our method to maintain the community ecology of LLMs. For unsupervised triple linking, BGE-base-zh-v1.5 Xiao et al. (2023) is applied as the retriever \mathcal{H} to encode and retrieve relative knowledge candidates.

During training, we tune the backbone model with bf16 float precision. The training epoch is set to 3 and the gradient accumulation step is set to 8. We optimize the model using AdamW optimizer Loshchilov and Hutter (2019) while the learning rate is fixed to 3e43e^{-4}. The coefficient hyper-parameter λ\lambda is search in {1,0.1,0.01,0.001}\{1,0.1,0.01,0.001\}.

4.2 Main Results (Q1)

The main results of the traditional metrics are shown in Table 1. As we mentioned before, the traditional metrics can measure the similarity between the generated answer and the golden answer. From the results, we can observe that KnowPAT achieved obvious improvements compared with the baseline methods. We can conclude that KnowPAT achieves a more significant improvement in the BLEU-3(42.03%)/BLEU-4(43.99%) than BLEU-1(22.67%)/BLEU-2(34.79%), which means that KnowPAT makes more significant progress in capturing some complex phrases and discourse. Corresponding to our cloud product QA scenario, these complex phrase usages are usually specialized terms that have a critical impact on the quality of the answer.

Besides, we evaluate our methods with three model-based metrics BERTScore Zhang et al. (2020a), reward score Yuan et al. (2023), and PPL Yuan et al. (2023), which is shown in Table 2. We can observe that KnowPAT still achieves good performance in the model-based metrics such as BERTScore and PPL, which means that the results generated by KnowPAT are more acceptable for the language models. For the reward score, relatively good results have also been achieved by KnowPAT.

Further, we conduct a human evaluation for our method and baseline methods. The two results from the two models are shown to the human evaluator anonymously so that the human evaluator can choose a better result. The model which generates that result will get one point and the competition results are shown in Figure 3. We can observe from the figure that our method generates answers that are more acceptable to humans compared to other baselines, maintaining a relatively high win rate in the competition. Only a small number of times does KnowPAT perform weaker than the baselines, and most of the time KnowPAT is equal or even better. Therefore, combining the above three different perspectives of evaluation, we can conclude that KnowPAT achieves outperforming results in the cloud product QA scenario.

Refer to caption
Figure 3: The human evaluation results. For each competition, we randomly select 100 questions and compare the generated results of the two methods.

4.3 Ablation Study (Q2)

We conduct Ablation experiments to verify the validity of each module design. We validated the effectiveness of the designed components in our KnowPAT. We can find that the fine-tuning objective ft\mathcal{L}_{ft} and the alignment objective align\mathcal{L}_{align} are both contributing to the model performance. Without fine-tuning (FT), the model performance can take a serious dip, as the LLM is not tuned to fit the golden answer. Besides, both two preference sets (SPS and KPS) in KnowPAT are contributing to the performance. The adaptive weights (AW) can control for the participation of different quality samples in the loss, which is also effective in KnowPAT.

Besides, we demonstrate the necessity of the CPKG with two groups of experiments. w/o RK denotes the experiment that removes the retrieved knowledge in the input prompt during the fine-tuning and preference alignment process. w/o KG denotes the experiment without KG in the whole process, which means the KPS and RK in the input prompt are all removed. For the results of these two groups of experiments, we can observe that the CPKG plays a remarkable role in KnowPAT. In the design of KnowPAT, the CPKG does not only serve as an external knowledge source during training but also participates in the preference set construction process, which is important to the model performance. In summary, each detailed design in our method KnowPAT has its unique role and contributes to the overall performance.

Table 3: The ablation study results. We evaluate various stripped-down versions of our model to compare the performance gain brought by different components. The full names of these abbreviations are as follows: FT (fine-tuning); AW (adaptive weight); SPS (style preference sets); KPS (knowledge preference sets); RK (retrieved knowledge).
Setting BLEU-1\uparrow ROUGE-1\uparrow Reward\uparrow PPL\downarrow
KnowPAT 22.56 20.28 -1.69 29.93
   w/o FT 13.17 12.91 -2.14 31.96
   w/o AW 21.87 19.91 -1.71 30.84
   w/o SPS 17.57 17.66 -1.75 31.08
   w/o KPS 16.12 16.51 -1.79 30.82
   w/o RK 17.46 17.56 -1.89 30.85
   w/o KG 15.09 16.55 -2.09 33.50

4.4 Case Study (Q3)

Table 4: The case study results for ground truth (GT), our KnowPAT predictions, and RRHF Yuan et al. (2023) results. The original Chinese text have been translated into English for clarity.
Question
Please provide the steps for handling IOPS
detection errors.
GT
It is recommended to replace the disk with one
that meets the IOPS specification.
Ours
It is recommended to replace the server with
device that meets the IOPS specifications.
RRHF
After ADAC troubleshooting, restart the business
and check whether it is valid.
Question
What is the explanation for the hwFlowRestoreFailed
alarm in CloudEngine 1800V product?
GT
The switch flow table restore failed (host_ip=
[host_ip], host_name=[host_name])
Ours
The switch flow table restore failed. (host_ip=
[host_ip], host_name=[host_name])
RRHF Flow table restore failed

To make an intuition for the effectiveness of our method, we conduct a case study as shown in Table 4. We can observe that the answers generated by KnowPAT are more similar to the golden answer while keeping a user-friendly tone and providing sufficient information such as the host parameters in the second case. This suggests that the model learns appropriate style preferences. Besides, the retrieved knowledge in the first case is (EIP, used for, IP Binding), (Select Box, belongs to, Alarm Management Component), etc., which are all helpless to answer this question. However, KnowPAT is not misled by this useless knowledge and generates the correct answer while RRHF falls into the trap.

4.5 Knowledge Retention Analysis (Q4)

Refer to caption
Figure 4: The commonsense ability on five domains.

As a project that needs to get off the ground in real-world scenarios, the general ability of the trained model should also be carefully evaluated, because the user may ask various kinds of questions if they like the model. We expect the model to keep their existing knowledge learned during pre-training and obtain new knowledge about our domain. Thus, we also conduct a commonsense evaluation on the trained models with the CMMLU Li et al. (2023) dataset, which is a benchmark for LLM’s Chinese ability evaluation. The evaluation result is shown in 4. We demonstrate the general ability on five distinctive commonsense regions (history, clinical, politics, computer science, and economics) for KnowPAT, vanilla Atom-7B (none), and other PA methods. As can be seen from the radargram, there is a relatively significant decline in the KnowPAT’s ability in medicine, but in the areas of politics, history, and economics it still maintains the ability of the original backbone model and even grows slightly. PRO, while unexpectedly showing a significant improvement in the economics problem, shows a more pronounced performance degradation than KnowPAT in several other areas. Taken together, such variations of KnowPAT in generalized ability are acceptable for our cloud product QA scenario.

5 Related Works

Preference alignment (PA) Wang et al. (2023d); Cheng et al. (2023) seeks to tailor pre-trained LLMs to align with human preferences (feedbacks) Ouyang et al. (2022). RLHF is a landmark work for PA, which leverages reinforcement learning (RL) Schulman et al. (2017) to align human preference with LLMs. Due to the sensitivity of RL parameters and the intricate three-stage processes of RLHF, many PA approaches have been proposed to address these challenges. For example, RRHF Yuan et al. (2023) propose a margin-rank loss to optimize the LLMs without the need for extra reward models. PRO Song et al. (2023) optimizes complex preference data with a list-wise contrastive loss. DPO Rafailov et al. (2023) propose a direct preference optimization method by treating the LLM itself as a reward model. AFT Wang et al. (2023b) propose a ranking-feedback boundary-constrained alignment loss to optimize the preference data. Besides, our work also focuses on the large language model application and knowledge-enhanced QA. We give a brief introduction of these fields in Appendix A.1 and A.2.

6 Conlusion

In this paper, we introduce a novel framework, knowledgeable preference alignment (KnowPAT), for domain-specific QA tasks in cloud product services, leveraging LLMs and KGs in a practical application setting. Our approach constructs a knowledgeable preference set by retrieving and utilizing knowledge triples to generate answers with different quantities. A new alignment objective is designed to unleash the power of the preference set. Comprehensive experiments demonstrate that our method surpasses existing solutions for this real-world challenge. Looking ahead, we aim to apply KnowPAT to more real scenarios such as enterprise-class services and further investigate the potential of KG-enhanced LLM application in the future.

Acknowledgment

This work is founded by National Natural Science Foundation of China ( NSFC62306276 / NSFCU23B2055 / NSFCU19B2027 / NSFC91846204 ), Zhejiang Provincial Natural Science Foundation of China (No. LQ23F020017), Ningbo Natural Science Foundation (2023J291), Yongjiang Talent Introduction Programme (2022A-238-G), Fundamental Research Funds for the Central Universities (226-2023-00138).

Limitations

In this paper, we mainly focuses on a real-world application problem to align LLMs with knowledge preference for better domain-specific QA. There are still some limitations in our work.

Domain-specific scenario. Our approach is designed for specific domain (cloud product QA in our paper), and its effectiveness on general domains and open-source datasets is still subject to further validation. This will be the goal of our future endeavours.

Forms of external knowledge. In our paper, we apply knowledge bases to store the external background knowledge for the QA tasks. This is a convenient and efficient way of storing knowledge for our scenario, but in more other scenarios, knowledge may be stored in other forms (e.g. unstructured text). Therefore, a more general framework to process the external knowledge with any format (KGs, unstructured text, documents) should be considered for better usage, which is also our future plan.

Ethical Considerations

In this paper, we employ the open-source LLM to validate the effectiveness of our approach. Besides, the dataset we used is manually labeled with golden answer from domain experts engaged legally with suitable work intensity and well above aversage wages. Their rights are well protected at work. The content of the dataset is mainly questions about our cloud product usage, which do not involve private information and sensitive data of the target users. We promise that the content and collection steps of our dataset that are not against scientific ethics.

References

  • Antol et al. (2015) Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: visual question answering. In ICCV, pages 2425–2433. IEEE Computer Society.
  • Baek et al. (2023) Jinheon Baek, Alham Fikri Aji, and Amir Saffari. 2023. Knowledge-augmented language model prompting for zero-shot knowledge graph question answering. CoRR, abs/2306.04136.
  • Banerjee and Lavie (2005) Satanjeev Banerjee and Alon Lavie. 2005. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In IEEvaluation@ACL, pages 65–72. Association for Computational Linguistics.
  • Bao et al. (2023a) Keqin Bao, Jizhi Zhang, Wenjie Wang, Yang Zhang, Zhengyi Yang, Yancheng Luo, Fuli Feng, Xiangnan He, and Qi Tian. 2023a. A bi-step grounding paradigm for large language models in recommendation systems. CoRR, abs/2308.08434.
  • Bao et al. (2023b) Keqin Bao, Jizhi Zhang, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. 2023b. Tallrec: An effective and efficient tuning framework to align large language model with recommendation. In RecSys, pages 1007–1014. ACM.
  • Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In NeurIPS.
  • Chen et al. (2023a) Hao Chen, Runfeng Xie, Xiangyang Cui, Zhou Yan, Xin Wang, Zhanwei Xuan, and Kai Zhang. 2023a. LKPNR: LLM and KG for personalized news recommendation framework. CoRR, abs/2308.12028.
  • Chen et al. (2021) Zhuo Chen, Jiaoyan Chen, Yuxia Geng, Jeff Z. Pan, Zonggang Yuan, and Huajun Chen. 2021. Zero-shot visual question answering using knowledge graph. In ISWC, volume 12922 of Lecture Notes in Computer Science, pages 146–162. Springer.
  • Chen et al. (2022) Zhuo Chen, Yufeng Huang, Jiaoyan Chen, Yuxia Geng, Yin Fang, Jeff Z. Pan, Ningyu Zhang, and Wen Zhang. 2022. Lako: Knowledge-driven visual question answering via late knowledge-to-text injection. In IJCKG, pages 20–29. ACM.
  • Chen et al. (2023b) Zhuo Chen, Wen Zhang, Yufeng Huang, Mingyang Chen, Yuxia Geng, Hongtao Yu, Zhen Bi, Yichi Zhang, Zhen Yao, Wenting Song, Xinliang Wu, Yi Yang, Mingyi Chen, Zhaoyang Lian, Yingying Li, Lei Cheng, and Huajun Chen. 2023b. Tele-knowledge pre-training for fault analysis. In ICDE, pages 3453–3466. IEEE.
  • Cheng et al. (2023) Pengyu Cheng, Jiawen Xie, Ke Bai, Yong Dai, and Nan Du. 2023. Everyone deserves A reward: Learning customized human preferences. CoRR, abs/2309.03126.
  • Cui et al. (2017) Wanyun Cui, Yanghua Xiao, Haixun Wang, Yangqiu Song, Seung-won Hwang, and Wei Wang. 2017. KBQA: learning question answering over QA corpora and knowledge bases. Proc. VLDB Endow., 10(5):565–576.
  • Dao et al. (2023) Xuan-Quy Dao, Ngoc-Bich Le, Xuan-Dung Phan, and Bac-Bien Ngo. 2023. An evaluation of chatgpt’s proficiency in english language testing of the vietnamese national high school graduation examination. Available at SSRN 4473369.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (1), pages 4171–4186. Association for Computational Linguistics.
  • Dong et al. (2023) Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, Lei Li, and Zhifang Sui. 2023. A survey for in-context learning. CoRR, abs/2301.00234.
  • Du et al. (2022) Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2022. GLM: general language model pretraining with autoregressive blank infilling. In ACL (1), pages 320–335. Association for Computational Linguistics.
  • Fang et al. (2023) Yin Fang, Xiaozhuan Liang, Ningyu Zhang, Kangwei Liu, Rui Huang, Zhuo Chen, Xiaohui Fan, and Huajun Chen. 2023. Mol-instructions: A large-scale biomolecular instruction dataset for large language models. CoRR, abs/2306.08018.
  • Gao et al. (2021) Shen Gao, Xiuying Chen, Zhaochun Ren, Dongyan Zhao, and Rui Yan. 2021. Meaningful answer generation of e-commerce question-answering. ACM Trans. Inf. Syst., 39(2):18:1–18:26.
  • Gao et al. (2019) Shen Gao, Zhaochun Ren, Yihong Eric Zhao, Dongyan Zhao, Dawei Yin, and Rui Yan. 2019. Product-aware answer generation in e-commerce question-answering. In WSDM, pages 429–437. ACM.
  • Huang et al. (2023) Quzhe Huang, Mingxu Tao, Chen Zhang, Zhenwei An, Cong Jiang, Zhibin Chen, Zirui Wu, and Yansong Feng. 2023. Lawyer llama technical report.
  • Jiang et al. (2023a) Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, Wayne Xin Zhao, and Ji-Rong Wen. 2023a. Structgpt: A general framework for large language model to reason over structured data. CoRR, abs/2305.09645.
  • Jiang et al. (2023b) Jinhao Jiang, Kun Zhou, Xin Zhao, and Ji-Rong Wen. 2023b. Unikgqa: Unified retrieval and reasoning for solving multi-hop question answering over knowledge graph. In ICLR. OpenReview.net.
  • Karpukhin et al. (2020) Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick S. H. Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. In EMNLP (1), pages 6769–6781. Association for Computational Linguistics.
  • Li et al. (2023) Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. 2023. Cmmlu: Measuring massive multitask language understanding in chinese.
  • Li et al. (2020) Linfeng Li, Peng Wang, Jun Yan, Yao Wang, Simin Li, Jinpeng Jiang, Zhe Sun, Buzhou Tang, Tsung-Hui Chang, Shenghui Wang, and Yuting Liu. 2020. Real-world data medical knowledge graph: construction and applications. Artif. Intell. Medicine, 103:101817.
  • Liang et al. (2022) Ke Liang, Lingyuan Meng, Meng Liu, Yue Liu, Wenxuan Tu, Siwei Wang, Sihang Zhou, X Liu, and F Sun. 2022. A survey of knowledge graph reasoning on graph types: Static, dynamic, and multimodal.
  • Lin (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  • Liu et al. (2020) Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Qi Ju, Haotang Deng, and Ping Wang. 2020. K-BERT: enabling language representation with knowledge graph. In AAAI, pages 2901–2908. AAAI Press.
  • Loshchilov and Hutter (2019) Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In ICLR (Poster). OpenReview.net.
  • Luo et al. (2023) Haoran Luo, Haihong E, Zichen Tang, Shiyao Peng, Yikai Guo, Wentai Zhang, Chenghao Ma, Guanting Dong, Meina Song, and Wei Lin. 2023. Chatkbqa: A generate-then-retrieve framework for knowledge base question answering with fine-tuned large language models. CoRR, abs/2310.08975.
  • Lyu et al. (2023) Shiwei Lyu, Chenfei Chi, Hongbo Cai, Lei Shi, Xiaoyan Yang, Lei Liu, Xiang Chen, Deng Zhao, Zhiqiang Zhang, Xianguo Lyu, et al. 2023. Rjua-qa: A comprehensive qa dataset for urology. arXiv preprint arXiv:2312.09785.
  • Nguyen and Nguyen (2023) Duc-Vu Nguyen and Quoc-Nam Nguyen. 2023. Evaluating the symbol binding ability of large language models for multiple-choice questions in vietnamese general education. arXiv preprint arXiv:2310.12059.
  • OpenAI (2023) OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. In NeurIPS.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In ACL, pages 311–318. ACL.
  • Rafailov et al. (2023) Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. CoRR, abs/2305.18290.
  • Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
  • Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. CoRR, abs/1707.06347.
  • Song et al. (2023) Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei Huang, Yongbin Li, and Houfeng Wang. 2023. Preference ranking optimization for human alignment. CoRR, abs/2306.17492.
  • Su et al. (2019) Dan Su, Yan Xu, Genta Indra Winata, Peng Xu, Hyeondey Kim, Zihan Liu, and Pascale Fung. 2019. Generalizing question answering system with pre-trained language model fine-tuning. In MRQA@EMNLP, pages 203–211. Association for Computational Linguistics.
  • Sun et al. (2020) Rui Sun, Xuezhi Cao, Yan Zhao, Junchen Wan, Kun Zhou, Fuzheng Zhang, Zhongyuan Wang, and Kai Zheng. 2020. Multi-modal knowledge graphs for recommender systems. In CIKM, pages 1405–1414. ACM.
  • Thakkar et al. (2023) Megh Thakkar, Tolga Bolukbasi, Sriram Ganapathy, Shikhar Vashishth, Sarath Chandar, and Partha Talukdar. 2023. Self-influence guided data reweighting for language model pre-training.
  • Tian et al. (2023) Yijun Tian, Huan Song, Zichen Wang, Haozhu Wang, Ziqing Hu, Fang Wang, Nitesh V. Chawla, and Panpan Xu. 2023. Graph neural prompting with large language models. CoRR, abs/2309.15427.
  • Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023a. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
  • Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurélien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023b. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS, pages 5998–6008.
  • Vedantam et al. (2015) Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In CVPR, pages 4566–4575. IEEE Computer Society.
  • Wang et al. (2023a) Haochun Wang, Chi Liu, Nuwa Xi, Zewen Qiang, Sendong Zhao, Bing Qin, and Ting Liu. 2023a. Huatuo: Tuning llama model with chinese medical knowledge.
  • Wang et al. (2023b) Peiyi Wang, Lei Li, Liang Chen, Feifan Song, Binghuai Lin, Yunbo Cao, Tianyu Liu, and Zhifang Sui. 2023b. Making large language models better reasoners with alignment. CoRR, abs/2309.02144.
  • Wang et al. (2018) Peng Wang, Qi Wu, Chunhua Shen, Anthony R. Dick, and Anton van den Hengel. 2018. FVQA: fact-based visual question answering. IEEE Trans. Pattern Anal. Mach. Intell., 40(10):2413–2427.
  • Wang et al. (2017) Quan Wang, Zhendong Mao, Bin Wang, and Li Guo. 2017. Knowledge graph embedding: A survey of approaches and applications. IEEE Trans. Knowl. Data Eng., 29(12):2724–2743.
  • Wang et al. (2019) Xiang Wang, Xiangnan He, Yixin Cao, Meng Liu, and Tat-Seng Chua. 2019. KGAT: knowledge graph attention network for recommendation. In KDD, pages 950–958. ACM.
  • Wang et al. (2023c) Yanan Wang, Michihiro Yasunaga, Hongyu Ren, Shinya Wada, and Jure Leskovec. 2023c. Vqa-gnn: Reasoning with multimodal knowledge via graph neural networks for visual question answering.
  • Wang et al. (2023d) Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xingshan Zeng, Wenyong Huang, Lifeng Shang, Xin Jiang, and Qun Liu. 2023d. Aligning large language models with human: A survey. CoRR, abs/2307.12966.
  • Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS.
  • West et al. (2023) Peter West, Ximing Lu, Nouha Dziri, Faeze Brahman, Linjie Li, Jena D. Hwang, Liwei Jiang, Jillian Fisher, Abhilasha Ravichander, Khyathi Chandu, Benjamin Newman, Pang Wei Koh, Allyson Ettinger, and Yejin Choi. 2023. The generative ai paradox: "what it can create, it may not understand".
  • Xiao et al. (2023) Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. 2023. C-pack: Packaged resources to advance general chinese embedding.
  • Yasunaga et al. (2021) Michihiro Yasunaga, Hongyu Ren, Antoine Bosselut, Percy Liang, and Jure Leskovec. 2021. QA-GNN: reasoning with language models and knowledge graphs for question answering. In NAACL-HLT, pages 535–546. Association for Computational Linguistics.
  • Yoon et al. (2019) Wonjin Yoon, Jinhyuk Lee, Donghyeon Kim, Minbyul Jeong, and Jaewoo Kang. 2019. Pre-trained language model for biomedical question answering. In PKDD/ECML Workshops (2), volume 1168 of Communications in Computer and Information Science, pages 727–740. Springer.
  • Yuan et al. (2023) Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. 2023. RRHF: rank responses to align language models with human feedback without tears. CoRR, abs/2304.05302.
  • Zeng et al. (2023) Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Zhiyuan Liu, Peng Zhang, Yuxiao Dong, and Jie Tang. 2023. GLM-130B: an open bilingual pre-trained model. In ICLR. OpenReview.net.
  • Zhang et al. (2023a) Jizhi Zhang, Keqin Bao, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. 2023a. Is chatgpt fair for recommendation? evaluating fairness in large language model recommendation. In RecSys, pages 993–999. ACM.
  • Zhang et al. (2023b) Jizhi Zhang, Keqin Bao, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. 2023b. Is chatgpt fair for recommendation? evaluating fairness in large language model recommendation. In RecSys, pages 993–999. ACM.
  • Zhang et al. (2023c) Junjie Zhang, Ruobing Xie, Yupeng Hou, Wayne Xin Zhao, Leyu Lin, and Ji-Rong Wen. 2023c. Recommendation as instruction following: A large language model empowered recommendation approach. CoRR, abs/2305.07001.
  • Zhang et al. (2023d) Lingxi Zhang, Jing Zhang, Yanling Wang, Shulin Cao, Xinmei Huang, Cuiping Li, Hong Chen, and Juanzi Li. 2023d. FC-KBQA: A fine-to-coarse composition framework for knowledge base question answering. In ACL (1), pages 1002–1017. Association for Computational Linguistics.
  • Zhang et al. (2023e) Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei Wu, and Guoyin Wang. 2023e. Instruction tuning for large language models: A survey. CoRR, abs/2308.10792.
  • Zhang et al. (2020a) Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020a. Bertscore: Evaluating text generation with BERT. In ICLR. OpenReview.net.
  • Zhang et al. (2021) Wen Zhang, Chi Man Wong, Ganqiang Ye, Bo Wen, Wei Zhang, and Huajun Chen. 2021. Billion-scale pre-trained e-commerce product knowledge graph model. In ICDE, pages 2476–2487. IEEE.
  • Zhang et al. (2020b) Yong Zhang, Ming Sheng, Rui Zhou, Ye Wang, Guangjie Han, Han Zhang, Chunxiao Xing, and Jing Dong. 2020b. HKGB: an inclusive, extensible, intelligent, semi-auto-constructed knowledge graph framework for healthcare with clinicians’ expertise incorporated. Inf. Process. Manag., 57(6):102324.
  • Zhao et al. (2023) Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, and Peter J. Liu. 2023. Slic-hf: Sequence likelihood calibration with human feedback. CoRR, abs/2305.10425.
  • Zhu et al. (2023) Yuqi Zhu, Xiaohan Wang, Jing Chen, Shuofei Qiao, Yixin Ou, Yunzhi Yao, Shumin Deng, Huajun Chen, and Ningyu Zhang. 2023. Llms for knowledge graph construction and reasoning: Recent capabilities and future opportunities. CoRR, abs/2305.13168.
  • Zhu et al. (2021) Yushan Zhu, Huaixiao Zhao, Wen Zhang, Ganqiang Ye, Hui Chen, Ningyu Zhang, and Huajun Chen. 2021. Knowledge perceived multi-modal pretraining in e-commerce. In ACM Multimedia, pages 2744–2752. ACM.

Appendix

Appendix A Related Works

A.1 KG-enhanced Question Answering

Knowledge graphs (KGs) Wang et al. (2017); Liang et al. (2022) is a kind of complex semantic web that models world knowledge in terms of structural triples as (head entity, relation, tail entity). KGs serve as external knowledge source and benefit many AI tasks like language model pre-training Liu et al. (2020), question answering Yasunaga et al. (2021); Wang et al. (2023c), and recommendation systems Wang et al. (2019); Sun et al. (2020). Besides, domain-specific KGs are the important infrastructure of internet industry to provide exact factual knowledge, which is widely leveraged in E-commerce Zhu et al. (2021); Zhang et al. (2021), telecom fault analysis Chen et al. (2023b), health care Li et al. (2020); Zhang et al. (2020b) and so on. It is a popular topic to utilize KGs in real industry applications. In our scenario, we construct a domain-specific KG for cloud service products to benefit our Question Answering (QA) task. QA stands as a cornerstone in NLP, aiming at equipping machines with the capability to autonomously respond to human queries Su et al. (2019); Yoon et al. (2019). QA tasks can take on various forms. Some require the selection from multiple choices, as seen in certain knowledge base QA (KBQA) Cui et al. (2017); Tian et al. (2023); Baek et al. (2023) and visual question answering (VQA) Antol et al. (2015); Chen et al. (2021); Wang et al. (2018). Conversely, tasks like open-domain QA often challenge systems to directly produce textual responses without a set answer pool Gao et al. (2021); Karpukhin et al. (2020). In the last few years, fine-tuning pre-trained language models has been a leading approach for QA tasks. Models like BERT Devlin et al. (2019) and T5 Raffel et al. (2020) have previously achieved notable performance when adapted with question-answer pairs.

We hold that QA doesn’t just remain an academic pursuit; it acts as a bridge, facilitating the adoption of AI technologies in real-world applications. Numerous industrial efforts have been directed toward developing domain-specific QA systems to meet the needs of their users Gao et al. (2021, 2019). Such systems often rely on domain-specific knowledge bases, like Knowledge Graphs (KGs), to provide relevant information for the posed questions. Our current investigation aligns with this trend, focusing on a domain-specific QA scenario for cloud service products. Moreover, our approach diverges from these recent KG-based QA systems Jiang et al. (2023b, a); Luo et al. (2023); Zhang et al. (2023d); Chen et al. (2022) that utilize prompts for dialog with (large) language models to facilitate path reasoning and refine the scope of KG retrieval. We propose an innovative knowledgeable preference alignment framework that enhances KG-aware QA with the knowledge preference.

A.2 Large Language Model Application

Prominent large language models (LLMs) like GPT OpenAI (2023); Brown et al. (2020); Ouyang et al. (2022) and GLM Zeng et al. (2023); Du et al. (2022) are sparking a wave of research in the community due to their generalization ability in many NLP tasks such as relation extraction Zhu et al. (2023), algebraic reasoning Wei et al. (2022), and question answering Dao et al. (2023); Nguyen and Nguyen (2023). Most LLMs leverage the transformer Vaswani et al. (2017) architecture, benefiting from training on vast corpora Thakkar et al. (2023) through autoregressive tasks. Deploying and applying LLMs in real-life scenarios is also a major topic in industry today and several efforts have been made. For example, many works Zhang et al. (2023a); Bao et al. (2023b); Zhang et al. (2023c); Bao et al. (2023a); Zhang et al. (2023b); Chen et al. (2023a) attempt to build recommendation systems with LLMs. Some work like Huatuo Wang et al. (2023a) and LawyerLlama Huang et al. (2023) have developed LLMs for domain-specific usage.

Our work proposes a knowledgeable preference alignment framework to incorporate the domain-specific KG into the preference alignment pipeline for the LLM application. By constructing a knowledgeable preference set, the LLMs are trained to align the knowledge preference with humans and select better factual knowledge in the input prompt to solve the QA task.

Appendix B Experiment Details

B.1 Dataset Details

The detailed information of our dataset are shown in this section. We evaluate the performance of KnowPAT on two dataset: one private dataset CPQA constructed by us, and one public dataset RJUA-QA Lyu et al. (2023). They are both domain-specific datasets.

  • CPQA is for the cloud product domain labeled by a team of human experts with 8909 QA pairs. CPQA employs a cloud product knowledge graph (CPKG) as the domain KB.

  • RJUA-QA is a urological domain open-source dataset extracted from real-world medical records with 2132 QA pairs. RJUA-QA labels a series of medical context documents for each QA pair. We collect these contexts in the form of documents as the domain KB.

B.2 Baseline Details

(i) Zero-shot approach, which directly prompts the LLM with the input question to get the answer without training.

(ii) In-context learning Dong et al. (2023) approach, which would sample a few (kk-shot) QA pairs as demonstrations from the training dataset as examples and get the answers from the LLM without training.

(iii) Vanilla fine-tuning approach, which fine-tunes the LLM using the QA pairs w/ or w/o retrieved knowledge as Equation 1. The fine-tuning baseline with retrieved knowledge is also known as retrieve-augmented generation (RAG) method.

(iv) Preference alignment approaches, which introduce additional preference alignment objectives during training to align with human preference. We select five existing state-of-the-art (SOTA) PA methods including RRHF Yuan et al. (2023), SLiC-HF Zhao et al. (2023), DPO Rafailov et al. (2023), PRO Song et al. (2023), AFT (both AFT-BC and AFT-DC) Wang et al. (2023b) as our baselines.

B.3 Evaluation Details

We select three types of metrics to evaluate our method against baselines. The detailed information on the metrics is listed in the following:

(i) Traditional text generation metrics. We select several traditional text generation metrics such as BLEU Papineni et al. (2002), ROUGE Lin (2004), CIDEr Vedantam et al. (2015), and METEOR Banerjee and Lavie (2005) to evaluate the generated answers. However, these evaluation metrics are mainly used to measure the text-level similarity between generated answers and real answers, which means they can not fully reflect the semantic relevance or depth of understanding of the text.

(ii) Model-based metrics. To evaluate the semantic similarity of the generated answers and the golden answers, we employ several model-based metrics such as BERTScore Zhang et al. (2020a), perplexity (PPL), and preference score. These metrics evaluate the generated answers using various language models. BERTScore employs BERT Devlin et al. (2019) to calculate the semantic similarity between two sentences. PPL measures the ability of the LLM to understand and predict the entire sentence. The preference score is 𝒮\mathcal{S} mentioned in Equation 3 to reflect the model’s preference degree of the current answer.

(iii) Manual evaluation metrics. We employ human labelers to evaluate the results from different methods. The labeler makes a judgment on two answers from unknown sources in a single-blind situation, chooses the better one, and counts the results. The comparison result in each turn is recorded as win/tie/lose.

The three main categories of metrics respond to a certain part of the result’s characteristics at three levels: similarity at the textual level, similarity at the semantic level, and human preference.

B.4 Implemention Details

For zero-shot baselines, we select several different LLMs (ChatGPT Ouyang et al. (2022), ChatGLM-6B Zeng et al. (2023), Baichuan-7B 222https://github.com/baichuan-inc/Baichuan-7B, Vicuna-7B 333https://github.com/lm-sys/FastChat, and Atom-7B-CP) for the zeros-shot approach. For in-context learning, we sample 1,2,4,8-shot QA pairs as demonstrations to support the input question. For the PA methods, we leverage the official code of RRHF Yuan et al. (2023) and implement other PA methods (SLiC-HF Zhao et al. (2023), PRO Song et al. (2023), AFT Wang et al. (2023b)) based on the code to reproduce the results on our preference dataset. The selection of hyper-parameters is based on the original paper. Atom-7B-CP is employed as the backbone model for all the baseline methods such as in-context learning, vanilla fine-tuning, and PA methods.