This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Blending Reward Functions via Few Expert Demonstrations for Faithful and Accurate Knowledge-Grounded Dialogue Generation

Wanyu Du  Yangfeng Ji
Department of Computer Science
University of Virginia
Charlottesville, VA 22903
{wd5jq,yangfeng}@virginia.edu
Abstract

The development of trustworthy conversational information-seeking systems relies on dialogue models that can generate faithful and accurate responses based on relevant knowledge texts. However, two main challenges hinder this task. Firstly, language models may generate hallucinations due to data biases present in their pretraining corpus. Secondly, knowledge texts often contain redundant and irrelevant information that distracts the model’s attention from the relevant text span. Previous works use additional data annotations on the knowledge texts to learn a knowledge identification module in order to bypass irrelevant information, but collecting such high-quality span annotations can be costly. In this work, we leverage reinforcement learning algorithms to overcome the above challenges by introducing a novel reward function. Our reward function combines an accuracy metric and a faithfulness metric to provide a balanced quality judgment of generated responses, which can be used as a cost-effective approximation to a human preference reward model when only a few preference annotations are available. Empirical experiments on two conversational information-seeking datasets demonstrate that our method can compete with other strong supervised learning baselines.

1 Introduction

History User: What kind of rockets does SpaceX have? ||Agent: One is called the Falcon Heavy and can put about 8,000 kilograms into geostationary transfer orbit with side boosters. ||User: Wow, that’s pretty cool. Do you know anything about space elevators?
Knowledge Text The design would permit vehicles to travel along the cable from a planetary surface, such as the Earth’s, directly into space or orbit, without the use of large rockets.
Human Reference Yeah, they would allow vehicles to move along a cable into space, instead of on a rocket.
LLM Output Yes, a space elevator is a proposed type of transportation system. It consists of a cable anchored to the surface of a planet, with the other end in space. The cable would be held up by the centrifugal force generated by its orbital motion around. (unfaithful)
History User: How do i apply for Medicare?
Knowledge Text Benefits Planner: Retirement | Applying for Medicare Only | Social Security Administration // Applying for Medicare Only. If you are within three months of age 65 or older and not ready to start your monthly Social Security benefits yet, you can use our online retirement application to sign up just for Medicare and wait to apply for your retirement or spouse’s benefits later. Before you decide, you need to be sure that you understand how waiting until later will affect: the lifetime benefits we can pay on your account and your health insurance coverage.
Human Reference You can use our online retirement application to sign up just for Medicare and wait to apply for your retirement or spouse’s benefits.
LLM Output You can use the Social Security Administration’s online retirement application to sign up for Medicare only. Be aware that if you wait to apply for retirement or spouse benefits later, it may affect the benefits you receive and your health insurance coverage. (semantically incorrect)
Table 1: Two example responses generated by a LLM (text-davinci-003) in knowledge-grounded information-seeking conversations. The green text highlights the ground-truth knowledge span. The gray text is the unfaithful model-generated response caused by the bias from the LLM. The red text is the incorrect model-generated response misled by the text that contains redundant information.

Recent large language models (LLMs) have enabled conversational information-seeking systems to exhibit remarkable proficiency in producing fluent and coherent responses (Thoppilan et al., 2022; Nakano et al., 2022; Menick et al., 2022; Ouyang et al., 2022). However, the models sometimes fail in generating faithful and accurate responses supported by verified knowledge texts. This undesirable model behavior stems from three distinct sources: the bias inherent in LLMs, the irrelevant information in input knowledge texts, and the characteristics of the employed learning algorithms. Firstly, LLMs are likely to generate texts that are most frequently seen during pretraining (Kandpal et al., 2022; Wang et al., 2023), and may either disregard the knowledge texts or generate additional information not provided in the knowledge texts, as illustrated in Table 1. This problem is also described as hallucinations in other works (Maynez et al., 2020; Cao et al., 2022; Dziri et al., 2022a). Secondly, the input knowledge texts often contain redundant and irrelevant information that diverts the model’s attention away from the relevant text span (Chen et al., 2021), as demonstrated in Table 1. Lastly, the model is trained to maximize the likelihood of ground-truth responses, which is easy to suffer from exposure bias (Ranzato et al., 2016; Du and Ji, 2019), and sometimes fails to align with the factual content of the input knowledge texts (Wang and Sennrich, 2020; Li et al., 2022a).

Knowledge-grounded dialogue generation techniques are commonly leveraged in building conversational information-seeking systems. Aligning with the problem source analysis, we classify the related prior works into three categories. The first category of works proposes to use additional data annotations on the knowledge texts to filter out redundant and irrelevant information, such as learning a knowledge selection module (Dinan et al., 2019; Feng et al., 2021; Kim et al., 2020; Wu et al., 2021b), or leveraging cross-attention scores to identify important texts (Zhao et al., 2020a; Xu et al., 2021). While directly learning from additional data annotations can improve the accuracy of locating the intended text spans, high-quality annotated data is often expensive to collect. The second group of works modifies the model architecture (Prabhumoye et al., 2021) or refines the decoding process (Tian et al., 2020) to force the model generating responses more faithful to the knowledge texts. The third line of works switches to alternative learning algorithms to explore the optimal dialogue generation policy, such as unlikelihood training (Li et al., 2020), imitation learning (Liu et al., 2018) and reinforcement learning (Li et al., 2016; Stiennon et al., 2020; Ramamurthy et al., 2023). However, previous learning algorithms focus on optimizing semantic coherence and fluency of generated responses, and it is still less studied in improving the faithfulness and accuracy of generated responses in knowledge-grounded conversations.

In this work, we apply reinforcement learning (RL) algorithms to learn faithful and accurate dialogue generation policy. On one hand, fine-tuning LLMs with RL on the downstream datasets can help alleviate the bias learned from the pretraining corpus; on the other hand, an appropriate reward function can guide LLMs to generate responses that align with the relevant knowledge text. Our key contribution is the design of a novel reward function that combines two automatic metrics via expert demonstrations for effective evaluation. In this design, each metric aims to address one specific concern of response generation discussed above: (1) the accuracy metric measures the similarity between model-generated responses and ground-truth references, which aims at making the generated response coherent with the dialogue context; (2) the faithfulness metric evaluates the similarity between model-generated responses and input knowledge texts, which aims at aligning the factual content of the generated response with the knowledge text.

We blend the two automatic metrics to approximate a balanced quality judgment of generated responses. The blending coefficient is learned from a few expert demonstrations of pair-wise quality judgment on two LLMs’ outputs. Our reward function can be used as a cost-effective approximation to a human preference reward model (Stiennon et al., 2020), when there only exists a few (e.g. 25) human preference annotations. This approach enables the improvement of both faithfulness and accuracy in knowledge-grounded dialogue generation, while reducing the reliance on massive human preference annotations. Empirical experiments on two information-seeking conversation benchmark datasets, MultiDoc2Dial (Feng et al., 2021) and FaithDial (Dziri et al., 2022a), show that our method can obtain improved performance in faithfulness and accuracy compared with other strong supervised learning baselines.

We summarize the contributions of this work as follows:

  1. 1.

    Identifying three major sources for the problem of generating unfaithful and inaccurate responses in the knowledge-grounded conversations.

  2. 2.

    Proposing a new reward function for reinforcement learning algorithms that can improve the faithfulness and accuracy of generated responses.

  3. 3.

    Conducting empirical experiments to demonstrate the effectiveness of our method compared with strong supervised learning baselines.

2 Related Work

Knowledge-Grounded Dialogue Generation.

Previous works in knowledge-grounded dialogue generation train language models conditioning on knowledge texts, with the goal of maximizing the likelihood of ground-truth responses (Ghazvininejad et al., 2018; Dinan et al., 2019; Zhang et al., 2020b; Bao et al., 2022; Peng et al., 2022). Some works apply multi-task learning and transfer learning techniques to improve dialogue generation quality via joint learning with other text generation tasks (Shuster et al., 2020; Raffel et al., 2020a; Li et al., 2022b). In addition, Zhan et al. (2021) learns the knowledge transition in multi-turn conversations to better select knowledge texts for response generation. Other works leverage the retriever-reader architecture to learn knowledge text representations for improving generation quality (Lewis et al., 2020; Izacard and Grave, 2021). This work leverages reinforcement learning algorithms to learn faithful and accurate dialogue generation policy.

Knowledge Identification in Document-Grounded Dialogues.

A majority of works incorporate a knowledge identification module in the document-grounded dialogue generation task (Dinan et al., 2019; Kim et al., 2020; Chen et al., 2020; Feng et al., 2021). The knowledge identification module prevents the language model from attending to irrelevant knowledge texts and avoids generating inappropriate responses (Chen et al., 2020). One line of works learns a knowledge identification module with explicit span annotations before generation (Wu et al., 2021b; Zhao et al., 2020b). Another line of works models the grounding knowledge texts as latent variables (Zhao et al., 2020a; Kim et al., 2020). This work does not require data annotations on knowledge texts nor applied modification to the model architecture. We fine-tune LLMs with RL using our novel reward function to get improved performance in accuracy and faithfulness.

Faithful Text Generation.

The faithfulness text generation problem is defined as whether the generated content is factually consistent with the input information (Li et al., 2022a). Some works improve the factual consistency of dialogues with a natural language inference model to select the most faithful candidates during inference (Welleck et al., 2019; Qin et al., 2021), or use the entailment score from the inference model as a reward to learn better dialogue policy (Song et al., 2020; Mesgar et al., 2021). Tian et al. (2020) design constrained decoding strategies to improve the faithfulness. Nye et al. (2021) propose a dual-system, where the first system generates a set of candidate responses and the second system validates if the generated responses contain contradictions or commonsense violations. Our method combines the faithfulness and accuracy metrics to better approximate human preference in the low-data setting.

3 Learning Faithful and Accurate Generation Policy with RL

3.1 Problem Definition

Given the knowledge text KnK_{n} and the conversation history X=(𝒖0,,𝒖n1)X=(\bm{u}_{0},\cdots,\bm{u}_{n-1}), the task is to generate a response 𝒖n\bm{u}_{n} that is faithful to KnK_{n} and coherent to the conversation history XX. As the coherence evaluation in dialogue generation varies across different specific tasks and domains, in this work, we follow prior works (Feng et al., 2021; Dziri et al., 2022a) and approximate it by calculating the accuracy between generated response 𝒖n\bm{u}_{n} and ground-truth reference 𝒚n\bm{y}_{n}.

Following (Sutton and Barto, 2018; Ramamurthy et al., 2023), we formulate the response generation 𝒖n=(a0,,aT)\bm{u}_{n}=(a_{0},\cdots,a_{T}) as a Markov Decision Process (MDP) 𝒮,𝒜,𝒫,,γ\langle\mathcal{S},\mathcal{A},\mathcal{P},\mathcal{R},\gamma\rangle. 𝒮\mathcal{S} is a finite set of states, where the initial state s0𝒮s_{0}\in\mathcal{S} is a concatenation of input conversation history XX and knowledge text KnK_{n}. 𝒜\mathcal{A} is a finite set of actions, where an action at𝒜a_{t}\in\mathcal{A} is a token from our vocabulary 𝒱\mathcal{V}. 𝒫:𝒮×𝒜𝒮\mathcal{P}:\mathcal{S}\times\mathcal{A}\to\mathcal{S} is a transition function that determines the next state st+1s_{t+1} given the current state action pair (st,at)(s_{t},a_{t}). :𝒮×𝒜\mathcal{R}:\mathcal{S}\times\mathcal{A}\to\mathbb{R} is a reward function that returns a real number given the current state action pair (st,at)(s_{t},a_{t}). γ[0,1]\gamma\in[0,1] is a discount factor. Each episode in the MDP begins by sampling a datapoint (𝒖0,,𝒖n1,𝒚n,Kn)(\bm{u}_{0},\cdots,\bm{u}_{n-1},\bm{y}_{n},K_{n}) from the dataset, and ends when the current time step tt exceeds the horizon TT or an end of sentence token is generated.

3.2 Proximal Policy Optimization (PPO)

The policy πθ:𝒮𝒜\pi_{\theta}:\mathcal{S}\to\mathcal{A} is a function that selects an action in a given state in order to maximize the long-term discounted rewards over a trajectory 𝔼π[t=0TγtRt]\mathbb{E}_{\pi}[\sum_{t=0}^{T}\gamma^{t}R_{t}]. In this work, we initialize the policy πθ\pi_{\theta} with a pre-trained language model π0\pi_{0}. We learn the policy using the Proximal Policy Optimization (PPO) algorithm (Schulman et al., 2017), which is an effective actor-critic algorithm in many text generation tasks (Stiennon et al., 2020; Ramamurthy et al., 2023). The advantage is approximated using Generalized Advantage Estimation (Schulman et al., 2018): At=τ=tT(γλ)τt(Rτ+γVϕ(sτ+1)Vϕ(sτ))A_{t}=\sum_{\tau=t}^{T}(\gamma\lambda)^{\tau-t}(R_{\tau}+\gamma V_{\phi}(s_{\tau+1})-V_{\phi}(s_{\tau})), where λ\lambda is a hyper-parameter, γ\gamma is the discount factor, RtR_{t} is the reward assigned to ata_{t}, and Vϕ(st)V_{\phi}(s_{t}) is the value of state sts_{t} given by the value network VϕV_{\phi}. Note that the reward RtR_{t} is regularized using a token-level KL penalty (Wu et al., 2021a), in order to prevent the updated policy πθ\pi_{\theta} deviating too far from the pre-trained language model π0\pi_{0}:

Rt=RtβKL(πθ(at|st)||π0(at|st))R_{t}=R_{t}-\beta\text{KL}(\pi_{\theta}(a_{t}|s_{t})||\pi_{0}(a_{t}|s_{t})) (1)

where β\beta is a dynamically adapted coefficient (Ziegler et al., 2020). During training, the policy network πθ\pi_{\theta} and the value network VϕV_{\phi} will be optimized jointly. 111In this work, we use the RL4LMs library to learn the response generation policy, so please refer to Ramamurthy et al. (2023) for more algorithm implementation details.

In-Context Example
Instruction: The following is a conversation with an AI assistant. The assistant is providing an answer based on a knowledge passage.
Human: I recently discovered rap music and I’m so intrigued by it! Do you listen to rap music?
Knowledge Text: Rapping is a musical form of vocal delivery that incorporates ”rhyme, rhythmic speech, and street vernacular”, which is performed or chanted in a variety of ways.
AI: No, I’m a bot and can’t hear. I know that it’s a form of music that involves chanting and rhythmic speech.
LLM Input
Human: Chevrolet is my all time favorite car brand, have you heard of it?
Knowledge Text: The Chevrolet Chevy II/Nova was a small automobile manufactured by Chevrolet, and produced in five generations for the 1962 through 1979, and 1985 through 1988 model years.
LLM Output
Model Output 1: Yes, I’m familiar with Chevrolet. It’s a car brand that has been around since the early 1900s and has produced popular models like the Chevy II/Nova, Corvette, Camaro, and Impala.
Model Output 2: Yes, I heard of Chevrolet. It’s a popular car brand that has been manufacturing cars for several decades. Chevy II/Nova was produced in five generations between 1962 and 1988.
Table 2: An example from the validation set of FaithDial for human expert to do pair-wise comparison between the outputs of text-davinci-003 and GPT-3.5-turbo. The gray text is the unfaithful model-generated response caused by the bias from the LLM.

3.3 Blending Reward Functions via Expert Demonstrations

Blending Reward Functions.

We propose leveraging existing automatic metrics to evaluate the faithfulness and accuracy of generated responses and directly learn a policy that balances the trade-off between these two factors. This approach eliminates the need for large-scale human annotations and utilizes the available metrics to guide the learning process effectively. By optimizing the policy based on these metrics, the model achieves improved performance in generating faithful and accurate responses.

Formally, the new reward function is defined as:

Rt=αRtacc(𝒖,𝒚)+(1α)Rtfaith(𝒖,K)R_{t}=\alpha R_{t}^{\text{acc}}(\bm{u},\bm{y})+(1-\alpha)R_{t}^{\text{faith}}(\bm{u},K) (2)

where RtaccR_{t}^{\text{acc}} is the accuracy metric measuring the similarity between the generated response 𝒖\bm{u} and the ground-truth reference 𝒚\bm{y}, RtfaithR_{t}^{\text{faith}} is the faithfulness metric evaluating the factual consistency between the generated response 𝒖\bm{u} and the knowledge text KK, and α[0,1]\alpha\in[0,1] is a coefficient used to balance the accuracy and faithfulness of generated responses. In this work, we choose SacreBLEU (Post, 2018) as RtaccR_{t}^{\text{acc}} and BERTScore (Zhang et al., 2020a) as RtfaithR_{t}^{\text{faith}}, as they are recognized as effective evaluation metrics in the knowledge-grounded dialogue generation task (Dziri et al., 2022a, b). Note that RtaccR_{t}^{\text{acc}} and RtfaithR_{t}^{\text{faith}} are only assigned to the final token in the generated response. This reward will also be regularized with the token-level KL penalty the same way as in Equation 1.

Learning α\alpha from Expert Demonstrations.

The selection of the coefficient α\alpha is crucial for learning an effective policy.222Note that we cannot learn the coefficient α\alpha during training, because the reward is part of PPO’s learning objective. Grid search on the validation set is a common strategy but can be computationally intensive and prone to overfitting on the validation set. To address this issue, we propose to learn the coefficient α\alpha from a few number of expert pair-wise comparison demonstrations. Since the primary requirement of the reward function is to differentiate faithful and accurate responses, leveraging expert demonstrations can provide valuable insights for determining an appropriate value for α\alpha. Additionally, this approach can reduce the computational burden and help ensure the generalizability of the learned policy.

Specifically, we leverage two state-of-the-art large language models (LLMs), text-davinci-003 and GPT-3.5-turbo333https://platform.openai.com/docs/api-reference, to generate 25 responses respectively based on one in-context example. Then we shuffle the presentation order and ask an NLP expert to do pair-wise comparisons between the two model outputs. An illustration example is provided in Table 2. Next, we compute the reward using Equation 2 for both models’ outputs, and align our reward comparison results with the expert pair-wise comparison results. The alignment is done by iterating values of α\alpha and finding the optimal value that maximizes the Pearson correlation coefficient (Pearson, 1895) between the expert pair-wise comparison results and our reward pair-wise comparison results. By learning α\alpha from these demonstrations, we can effectively calibrate the balance between faithfulness and accuracy in the generated responses.

Input: LLMs’ outputs {(𝒖n1,𝒖n2}n=1N\{(\bm{u}_{n}^{1},\bm{u}_{n}^{2}\}_{n=1}^{N}. The expert demonstrations of pair-wise comparison on two LLMs’ outputs {p^n}n=1N\{\hat{p}_{n}\}_{n=1}^{N}.
Output: The optimal blending coefficient αhuman\alpha_{human}
1:  for α=0.00,,1.00\alpha=0.00,\dots,1.00 do
2:     Compute our reward on two LLMs’ outputs using Equation 2 and get the scores {(rn1,rn2)}n=1N\{(r_{n}^{1},r_{n}^{2})\}_{n=1}^{N}.
3:     Get the pair-wise comparison of our reward pnα=argmax(rn1,rn2)p^{\alpha}_{n}=\text{argmax}(r_{n}^{1},r_{n}^{2}), for n=1,,Nn=1,\dots,N
4:     Compute the Pearson correlation coefficient rr between {p^n}n=1N\{\hat{p}_{n}\}_{n=1}^{N} and {pnα}n=1N\{p^{\alpha}_{n}\}_{n=1}^{N}
5:  end for
6:  Save the optimal α\alpha which achieves the highest Pearson correlation coefficient rmaxr_{\max} as αhuman\alpha_{human}
Algorithm 1 Learning α\alpha from expert demonstrations

4 Experiments

The experiments in this section are designed to answer the following research questions:

  1. RQ1

    Does there exist a trade-off between faithfulness and accuracy in the knowledge-grounded dialogue generation task?

  2. RQ2

    Can the proposed method improve the faithfulness and accuracy of model-generated responses?

  3. RQ3

    Can expert demonstrations help effectively calibrate the values of α\alpha to learn a better policy?

Accuracy Faithfulness
Dataset Method SacreBLEU ROUGE-L BERTScore Token-F1 Overall
MultiDoc2Dial R3 31.10 41.40 - - -
T5-SFT 25.38 41.13 91.10 51.61 209.22
T5-PPO-RoBERTa 30.51 42.56 90.66 47.05 210.78
T5-PPO-Ours 31.15 43.28 91.45 51.81 217.69
FaithDial T5-CTRL 13.75 38.57 94.42 70.91 217.65
T5-SFT 13.69 39.58 95.13 75.49 223.89
T5-PPO-RoBERTa 11.87 36.75 92.06 52.05 192.73
T5-PPO-Ours 12.75 36.98 98.79 94.71 243.23
Table 3: Accuracy and faithfulness evaluation results on the test set of MultiDoc2Dial and FaithDial. The results of R3 are reprinted from the original paper (Bansal et al., 2022). The results of T5-CTRL come from the code and data released by Dziri et al. (2022a). The other results come from our implementation. The Overall score is a sum of accuracy and faithfulness scores.
Accuracy Faithfulness
Dataset α\alpha SacreBLEU ROUGE-L BERTScore Token-F1 Overall
MultiDoc2Dial αhuman\alpha_{human} 31.15 43.28 91.45 51.81 217.69
α=1.00\alpha=1.00 30.60 43.04 91.67 51.80 217.11
α=0.00\alpha=0.00 27.45 44.44 91.37 51.12 214.38
α=0.25\alpha=0.25 29.59 44.32 91.55 52.19 217.65
FaithDial αhuman\alpha_{human} 12.75 36.98 98.79 94.71 243.23
α=1.00\alpha=1.00 13.09 38.36 95.97 81.98 229.40
α=0.00\alpha=0.00 12.59 36.95 91.57 54.47 195.58
α=0.85\alpha=0.85 13.30 36.94 98.30 92.28 240.82
Table 4: Model performances under different values of α\alpha on the test set of MultiDoc2Dial and FaithDial. The Overall score is a sum of accuracy and faithfulness scores.

4.1 Experimental Setups

Benchmark Datasets.

We choose two information-seeking conversation datasets as our benchmarks: MultiDoc2Dial (Feng et al., 2021) and FaithDial (Dziri et al., 2022a). Both datasets contain two participants in each conversation: a user (or seeker) who initiates the conversation with a question, and a system (or wizard) who answers the user’s question by referring to a piece of knowledge text. Each conversation contains several turns and probably topic shifts. The datasets statistics are demonstrated in Table 5.

Dataset Train Valid Test Know. Tok.
MultiDoc2Dial 21,453 4,201 4,094 106
FaithDial 18,357 3,417 3,539 27
Table 5: Benchmark datasets statistics, where Know. Tok. indicates the average number of tokens for each input knowledge text.

Competitive Methods.

We compare with the following knowledge-grounded dialogue generation methods:

  • R3 (Bansal et al., 2022): a retriever-reranker-reader system that achieves state-of-the-art performance on MultiDoc2Dial. The system uses a bi-encoder DistilSPLADE (Formal et al., 2021) retriever to fetch top-100 relevant knowledge passages from the corpus, then applies a RoBERTa-based (Liu et al., 2019) cross-encoder to rerank the top-100 knowledge passages, finally passes the top-10 reranked knowledge passages to a T5-based FiD (Izacard and Grave, 2021) to generate the response.

  • T5-CTRL (Dziri et al., 2022a): a controlled generation method that achieves state-of-the-art performance on FaithDial. Following Rashkin et al. (2021), it sets control feature tokens based on measures of entailment, lexical precision, and objective voice of the ground-truth response, to steer a T5-base model (Raffel et al., 2020b) generating responses faithful to the input knowledge texts.

  • T5-SFT: a supervised fine-tuning baseline for both datasets. This method directly fine-tunes the T5-base model with maximum likelihood estimation on ground-truth responses in the full training set.

  • T5-PPO-RoBERTa: a baseline reward function for PPO, which is a RoBERTa-base model that discriminates whether the current output is the ground-truth response given the conversation history and knowledge text. The purpose of this reward model is to encourage the model-generated response matching the ground-truth reference. The reward model is trained with an equal number of positive and negative examples, where the negative examples come from the outputs of T5-SFT. For PPO training, we use T5-SFT to initialize the policy network and value network.

  • T5-PPO-Ours: our proposed reward function for PPO introduced in subsection 3.3. We use T5-SFT to initialize the policy network and value network, and update them jointly with PPO.

History User: I am aspiring to be an accountant. I want to know more about accounting.
Knowledge Text Accounting, which has been called the ”language of business”, measures the results of an organization’s economic activities and conveys this information to a variety of users, including investors, creditors, management, and regulators.
Human Reference Accounting has been called the ”language of business”.
T5-SFT Accounting has been called the ”language of business”. It measures the results of an organization’s economic activities and conveys this information to a variety of users.
T5-PPO-Acc. (α=1.00\alpha=1.00) I see, did you know that accounting has been called the ”language of business”? It measures the results of an organization’s economic activities and conveys this information to a variety of users?
T5-PPO-Faith. (α=0.00\alpha=0.00) Ah, accounting is called the language of business.
T5-PPO-Ours (αhuman\alpha_{human}) Accounting has been called the ”language of business”, measures the results of an organization’s economic activities and conveys this information to a variety of users, including investors, creditors, management, and regulators.
Table 6: Different model-generated responses sampled from the test set of FaithDial. The green text highlights the ground-truth knowledge span.

Implementation Details.

We choose T5-base as our backbone language model for all experiments. For the input knowledge text in MultiDoc2Dial, we use the ground-truth knowledge passage preprocessed by the official code444https://github.com/IBM/multidoc2dial/blob/main/scripts/run_data_preprocessing.sh, because we focus on the agent response generation subtask. For T5-SFT baseline, we fine-tune the model on the ground-truth response from the full training set. We fine-tune the model for 10 epochs using the AdamW optimizer (Loshchilov and Hutter, 2019) with a linear decaying learning rate starting from 1×1051\times 10^{-5}. During inference, we use beam search with beam size of 4 to generate the final response.

The RoBERTa-base reward model in T5-PPO-RoBERTa is trained to discriminate the T5-SFT generated responses and the ground-truth responses. We train the reward model for 10 epochs using the AdamW optimizer with a constant learning rate of 1×1061\times 10^{-6}. The reward model achieves 89% accuracy on the test set of MultiDoc2Dial and 96% accuracy on the test set of FaithDial.

For learning α\alpha from expert demonstrations, we collect 25 pair-wise comparison demonstrations on each dataset respectively. We find the optimal value of αhuman\alpha_{human} on MultiDoc2Dial is 0.04 with a Pearson correlation coefficient of 0.2278, and the optimal value of αhuman\alpha_{human} on FaithDial is 0.92 with a Pearson correlation coefficient of 0.2865.

For all PPO experiments, the policy network and value network share the same base model initialized from T5-SFT but separate the last output layer. The output layer of the value network is a linear network that maps the last hidden state to a scalar value. We update the parameters for 10,000 iterations using the AdamW optimizer (Loshchilov and Hutter, 2019) with a learning rate of 5×1075\times 10^{-7}. The policy is evaluated on the full validation set every 100 iterations, and the final policy is the one that achieves the highest total scores in accuracy and faithfulness on the validation set. We use top-k (k=50k=50) sampling to generate trajectories during training, and use beam search with beam size of 4 to generate the final response during testing.

Evaluation Metrics.

We follow the prior works (Feng et al., 2021; Dziri et al., 2022a) to evaluate the accuracy and faithfulness of model-generated responses. Specifically, we use SacreBLEU (Post, 2018) and ROUGE-L (Lin, 2004) to evaluate the accuracy, which measures the similarity between model-generated responses and ground-truth responses; we use the F1 measure of BERTScore (Zhang et al., 2020a) and the token-level F1 scores to evaluate the faithfulness, which computes the similarity between model-generated responses and input knowledge texts.

History User: Can those over the age of 18 drive without supervision while holding a learner permit?
Knowledge Text Learner permit restrictions // No matter what age you are, if you hold a learner permit, you may not drive unless accompanied by a supervising driver age 21 or older who has a valid license to operate the vehicle you are driving. For example, only a person with a motorcycle license may supervise a person learning to drive a motorcycle. You m- ay not drive with a learner permit on any street within a park in New York City on any bridge or tunnel under the jurisdiction of the Tri-borough Bridge and Tunnel Authority on the Cross Coun- ty, Hutchinson River, Saw Mill River, or Taconic State parkways in Westchester County in a DMV road test area.
Human Reference No. Learner permit holders of any age can’t drive without a supervising driver.
T5-SFT Yes, (unfaithful) you may not drive unless accompanied by a supervising driver age 21 or older who has a valid license to operate the vehicle you are driving. For example, only a person with a motorcycle license may supervise (incorrect span)
T5-PPO-Acc. (α=1.00\alpha=1.00) You may not drive with a learner permit on any street within a park in New York City on any bridge or tunnel under the jurisdiction of the Tri-borough Bridge and Tunnel Authority on the Cross County. (incorrect span)
T5-PPO-Faith. (α=0.00\alpha=0.00) Yes. (unfaithful)
T5-PPO-Ours (αhuman\alpha_{human}) No. You may not drive unless accompanied by a supervising driver age 21 or older who has a valid license to operate the vehicle you are driving.
Table 7: Different model-generated responses sampled from the test set of MultiDoc2Dial. The green text highlights the ground-truth knowledge span. The gray text is the unfaithful model-generated response caused by the bias from the LLM. The red text is the incorrect model-generated response misled by the text that contains redundant information.

4.2 Result Analysis

RQ1: Faithfulness v.s. Accuracy.

We first investigate whether there exists a trade-off between faithfulness and accuracy in the knowledge-grounded dialogue generation task. The empirical results in Table 3 indicate the answer depends on the input knowledge text. When the input knowledge text contains redundant and irrelevant information, as in the case of MultiDoc2Dial, a good method can help improve both accuracy and faithfulness. Because the ground-truth reference can help locate the relevant text span in a long knowledge passage.

On the other hand, when the input knowledge text only contains the precise information needed for the answer, as in the case of FaithDial, the model may learn to directly copy from the knowledge text, which improves the performance in faithfulness but degrades the performance in accuracy. In this case, a good method should balance the performance improvement trade-off between faithfulness and accuracy.

RQ2: Our Method v.s. Others.

Generally speaking, our method achieves the best overall performance compared with other competitive baselines in both datasets, as shown in Table 3. In MultiDoc2Dial, T5-PPO-Ours achieves the best performance in both accuracy and faithfulness evaluation, indicating our learning method has the advantage of locating the relevant information from a long knowledge document. T5-PPO-RoBERTa achieves improved performance in accuracy but degraded performance in faithfulness compared with T5-SFT, because its reward function is designed to only encourage the model-generated response matching the ground-truth reference.

In FaithDial, T5-PPO-Ours achieves the best performance in faithfulness but degraded performance in accuracy compared with T5-SFT. As discussed in RQ1, the knowledge text of FaithDial contains the exact relevant information for the answer, consequently, the model learns to directly copy the knowledge text as its response when further trained with PPO. A generation example is provided in Table 6. Besides, T5-PPO-RoBERTa achieves the worst performance even if its reward model has 96% test accuracy, which indicates the reward model also learns some shortcut features in model-generated responses (e.g. the n-gram overlap between response and knowledge text).

RQ3: Calibrating α\alpha.

As illustrated in Table 4, we experiment with different values of α\alpha for our reward function, and find that the value learned from expert demonstrations αhuman\alpha_{human} achieves the best overall performance in both datasets. Single metric, i.e. α=0.00\alpha=0.00 and α=1.00\alpha=1.00, only achieves good performance in a single evaluation dimension, and still generates inaccurate or unfaithful responses, as shown in Table 7.

Additionally, we compared αhuman\alpha_{human} with values found by grid search which achieves the highest total score on the validation set, i.e. α=0.25\alpha=0.25 and α=0.85\alpha=0.85. Surprisingly, we find αhuman\alpha_{human} achieves better overall performance in both datasets, which indicates few expert demonstrations can not only help calibrate the values to learn a better policy, but also provide a good generalization ability for the reward function.

5 Conclusion

This work investigates how to improve faithfulness and accuracy in knowledge-grounded dialogue generation tasks. Firstly, we identify three major sources for the problem of generating unfaithful and inaccurate responses: the bias in LLMs, the irrelevant information in knowledge texts, and the characteristics of the supervised learning algorithm. Then, we solve the problem by applying a reinforcement learning algorithm with a novel reward function. Our reward function can be used as a cost-effective approximation to the human preference reward model learned from massive high-quality human preference annotations. Finally, we validate the effectiveness of our method in two information-seeking conversation datasets. The empirical experiment results show that our method can outperform other strong supervised learning baselines.

Limitations

The good performance of PPO algorithm not only requires a good reward function, but also relies on a good initial policy. The initial policy ensures the trajectories sampled from the current policy are of high quality, which benefits the convergence to a better policy. Therefore, the pretraining of better language models is always helpful for this method. In addition, learning on-policy RL algorithms requires large GPU memory, so more memory-efficient training or model compression techniques can further benefit the RL training.

Ethics Statement

This work complies with the ACL Ethics Policy. Both benchmark datasets and baseline models are collected from public academic resources, and do not contain harmful, unfair, or discriminating content. We will also make our data and code open-sourced once this paper is made public, in order to provide easily reproducible experiment configurations for future research.

References