Blending Reward Functions via Few Expert Demonstrations for Faithful and Accurate Knowledge-Grounded Dialogue Generation

Wanyu Du Yangfeng Ji
Department of Computer Science
University of Virginia
Charlottesville, VA 22903
{wd5jq,yangfeng}@virginia.edu

Abstract

The development of trustworthy conversational information-seeking systems relies on dialogue models that can generate faithful and accurate responses based on relevant knowledge texts. However, two main challenges hinder this task. Firstly, language models may generate hallucinations due to data biases present in their pretraining corpus. Secondly, knowledge texts often contain redundant and irrelevant information that distracts the model’s attention from the relevant text span. Previous works use additional data annotations on the knowledge texts to learn a knowledge identification module in order to bypass irrelevant information, but collecting such high-quality span annotations can be costly. In this work, we leverage reinforcement learning algorithms to overcome the above challenges by introducing a novel reward function. Our reward function combines an accuracy metric and a faithfulness metric to provide a balanced quality judgment of generated responses, which can be used as a cost-effective approximation to a human preference reward model when only a few preference annotations are available. Empirical experiments on two conversational information-seeking datasets demonstrate that our method can compete with other strong supervised learning baselines.

1 Introduction

History	User: What kind of rockets does SpaceX have? \|\|Agent: One is called the Falcon Heavy and can put about 8,000 kilograms into geostationary transfer orbit with side boosters. \|\|User: Wow, that’s pretty cool. Do you know anything about space elevators?
Knowledge Text	The design would permit vehicles to travel along the cable from a planetary surface, such as the Earth’s, directly into space or orbit, without the use of large rockets.
Human Reference	Yeah, they would allow vehicles to move along a cable into space, instead of on a rocket.
LLM Output	Yes, a space elevator is a proposed type of transportation system. It consists of a cable anchored to the surface of a planet, with the other end in space. The cable would be held up by the centrifugal force generated by its orbital motion around. (unfaithful)
History	User: How do i apply for Medicare?
Knowledge Text	Benefits Planner: Retirement \| Applying for Medicare Only \| Social Security Administration // Applying for Medicare Only. If you are within three months of age 65 or older and not ready to start your monthly Social Security benefits yet, you can use our online retirement application to sign up just for Medicare and wait to apply for your retirement or spouse’s benefits later. Before you decide, you need to be sure that you understand how waiting until later will affect: the lifetime benefits we can pay on your account and your health insurance coverage.
Human Reference	You can use our online retirement application to sign up just for Medicare and wait to apply for your retirement or spouse’s benefits.
LLM Output	You can use the Social Security Administration’s online retirement application to sign up for Medicare only. Be aware that if you wait to apply for retirement or spouse benefits later, it may affect the benefits you receive and your health insurance coverage. (semantically incorrect)

Table 1: Two example responses generated by a LLM (text-davinci-003) in knowledge-grounded information-seeking conversations. The green text highlights the ground-truth knowledge span. The gray text is the unfaithful model-generated response caused by the bias from the LLM. The red text is the incorrect model-generated response misled by the text that contains redundant information.

Recent large language models (LLMs) have enabled conversational information-seeking systems to exhibit remarkable proficiency in producing fluent and coherent responses (Thoppilan et al., 2022; Nakano et al., 2022; Menick et al., 2022; Ouyang et al., 2022). However, the models sometimes fail in generating faithful and accurate responses supported by verified knowledge texts. This undesirable model behavior stems from three distinct sources: the bias inherent in LLMs, the irrelevant information in input knowledge texts, and the characteristics of the employed learning algorithms. Firstly, LLMs are likely to generate texts that are most frequently seen during pretraining (Kandpal et al., 2022; Wang et al., 2023), and may either disregard the knowledge texts or generate additional information not provided in the knowledge texts, as illustrated in Table 1. This problem is also described as hallucinations in other works (Maynez et al., 2020; Cao et al., 2022; Dziri et al., 2022a). Secondly, the input knowledge texts often contain redundant and irrelevant information that diverts the model’s attention away from the relevant text span (Chen et al., 2021), as demonstrated in Table 1. Lastly, the model is trained to maximize the likelihood of ground-truth responses, which is easy to suffer from exposure bias (Ranzato et al., 2016; Du and Ji, 2019), and sometimes fails to align with the factual content of the input knowledge texts (Wang and Sennrich, 2020; Li et al., 2022a).

Knowledge-grounded dialogue generation techniques are commonly leveraged in building conversational information-seeking systems. Aligning with the problem source analysis, we classify the related prior works into three categories. The first category of works proposes to use additional data annotations on the knowledge texts to filter out redundant and irrelevant information, such as learning a knowledge selection module (Dinan et al., 2019; Feng et al., 2021; Kim et al., 2020; Wu et al., 2021b), or leveraging cross-attention scores to identify important texts (Zhao et al., 2020a; Xu et al., 2021). While directly learning from additional data annotations can improve the accuracy of locating the intended text spans, high-quality annotated data is often expensive to collect. The second group of works modifies the model architecture (Prabhumoye et al., 2021) or refines the decoding process (Tian et al., 2020) to force the model generating responses more faithful to the knowledge texts. The third line of works switches to alternative learning algorithms to explore the optimal dialogue generation policy, such as unlikelihood training (Li et al., 2020), imitation learning (Liu et al., 2018) and reinforcement learning (Li et al., 2016; Stiennon et al., 2020; Ramamurthy et al., 2023). However, previous learning algorithms focus on optimizing semantic coherence and fluency of generated responses, and it is still less studied in improving the faithfulness and accuracy of generated responses in knowledge-grounded conversations.

In this work, we apply reinforcement learning (RL) algorithms to learn faithful and accurate dialogue generation policy. On one hand, fine-tuning LLMs with RL on the downstream datasets can help alleviate the bias learned from the pretraining corpus; on the other hand, an appropriate reward function can guide LLMs to generate responses that align with the relevant knowledge text. Our key contribution is the design of a novel reward function that combines two automatic metrics via expert demonstrations for effective evaluation. In this design, each metric aims to address one specific concern of response generation discussed above: (1) the accuracy metric measures the similarity between model-generated responses and ground-truth references, which aims at making the generated response coherent with the dialogue context; (2) the faithfulness metric evaluates the similarity between model-generated responses and input knowledge texts, which aims at aligning the factual content of the generated response with the knowledge text.

We blend the two automatic metrics to approximate a balanced quality judgment of generated responses. The blending coefficient is learned from a few expert demonstrations of pair-wise quality judgment on two LLMs’ outputs. Our reward function can be used as a cost-effective approximation to a human preference reward model (Stiennon et al., 2020), when there only exists a few (e.g. 25) human preference annotations. This approach enables the improvement of both faithfulness and accuracy in knowledge-grounded dialogue generation, while reducing the reliance on massive human preference annotations. Empirical experiments on two information-seeking conversation benchmark datasets, MultiDoc2Dial (Feng et al., 2021) and FaithDial (Dziri et al., 2022a), show that our method can obtain improved performance in faithfulness and accuracy compared with other strong supervised learning baselines.

We summarize the contributions of this work as follows:

1.

Identifying three major sources for the problem of generating unfaithful and inaccurate responses in the knowledge-grounded conversations.
2.

Proposing a new reward function for reinforcement learning algorithms that can improve the faithfulness and accuracy of generated responses.
3.

Conducting empirical experiments to demonstrate the effectiveness of our method compared with strong supervised learning baselines.

2 Related Work

Knowledge-Grounded Dialogue Generation.

Previous works in knowledge-grounded dialogue generation train language models conditioning on knowledge texts, with the goal of maximizing the likelihood of ground-truth responses (Ghazvininejad et al., 2018; Dinan et al., 2019; Zhang et al., 2020b; Bao et al., 2022; Peng et al., 2022). Some works apply multi-task learning and transfer learning techniques to improve dialogue generation quality via joint learning with other text generation tasks (Shuster et al., 2020; Raffel et al., 2020a; Li et al., 2022b). In addition, Zhan et al. (2021) learns the knowledge transition in multi-turn conversations to better select knowledge texts for response generation. Other works leverage the retriever-reader architecture to learn knowledge text representations for improving generation quality (Lewis et al., 2020; Izacard and Grave, 2021). This work leverages reinforcement learning algorithms to learn faithful and accurate dialogue generation policy.

Knowledge Identification in Document-Grounded Dialogues.

A majority of works incorporate a knowledge identification module in the document-grounded dialogue generation task (Dinan et al., 2019; Kim et al., 2020; Chen et al., 2020; Feng et al., 2021). The knowledge identification module prevents the language model from attending to irrelevant knowledge texts and avoids generating inappropriate responses (Chen et al., 2020). One line of works learns a knowledge identification module with explicit span annotations before generation (Wu et al., 2021b; Zhao et al., 2020b). Another line of works models the grounding knowledge texts as latent variables (Zhao et al., 2020a; Kim et al., 2020). This work does not require data annotations on knowledge texts nor applied modification to the model architecture. We fine-tune LLMs with RL using our novel reward function to get improved performance in accuracy and faithfulness.

Faithful Text Generation.

The faithfulness text generation problem is defined as whether the generated content is factually consistent with the input information (Li et al., 2022a). Some works improve the factual consistency of dialogues with a natural language inference model to select the most faithful candidates during inference (Welleck et al., 2019; Qin et al., 2021), or use the entailment score from the inference model as a reward to learn better dialogue policy (Song et al., 2020; Mesgar et al., 2021). Tian et al. (2020) design constrained decoding strategies to improve the faithfulness. Nye et al. (2021) propose a dual-system, where the first system generates a set of candidate responses and the second system validates if the generated responses contain contradictions or commonsense violations. Our method combines the faithfulness and accuracy metrics to better approximate human preference in the low-data setting.

3 Learning Faithful and Accurate Generation Policy with RL

3.1 Problem Definition

Given the knowledge text $K_{n}$ and the conversation history $X=(\bm{u}_{0},\cdots,\bm{u}_{n-1})$ , the task is to generate a response $\bm{u}_{n}$ that is faithful to $K_{n}$ and coherent to the conversation history $X$ . As the coherence evaluation in dialogue generation varies across different specific tasks and domains, in this work, we follow prior works (Feng et al., 2021; Dziri et al., 2022a) and approximate it by calculating the accuracy between generated response $\bm{u}_{n}$ and ground-truth reference $\bm{y}_{n}$ .

Following (Sutton and Barto, 2018; Ramamurthy et al., 2023), we formulate the response generation $\bm{u}_{n}=(a_{0},\cdots,a_{T})$ as a Markov Decision Process (MDP) $\langle\mathcal{S},\mathcal{A},\mathcal{P},\mathcal{R},\gamma\rangle$ . $\mathcal{S}$ is a finite set of states, where the initial state $s_{0}\in\mathcal{S}$ is a concatenation of input conversation history $X$ and knowledge text $K_{n}$ . $\mathcal{A}$ is a finite set of actions, where an action $a_{t}\in\mathcal{A}$ is a token from our vocabulary $\mathcal{V}$ . $\mathcal{P}:\mathcal{S}\times\mathcal{A}\to\mathcal{S}$ is a transition function that determines the next state $s_{t+1}$ given the current state action pair $(s_{t},a_{t})$ . $\mathcal{R}:\mathcal{S}\times\mathcal{A}\to\mathbb{R}$ is a reward function that returns a real number given the current state action pair $(s_{t},a_{t})$ . $\gamma\in[0,1]$ is a discount factor. Each episode in the MDP begins by sampling a datapoint $(\bm{u}_{0},\cdots,\bm{u}_{n-1},\bm{y}_{n},K_{n})$ from the dataset, and ends when the current time step $t$ exceeds the horizon $T$ or an end of sentence token is generated.

3.2 Proximal Policy Optimization (PPO)

The policy $\pi_{\theta}:\mathcal{S}\to\mathcal{A}$ is a function that selects an action in a given state in order to maximize the long-term discounted rewards over a trajectory $\mathbb{E}_{\pi}[\sum_{t=0}^{T}\gamma^{t}R_{t}]$ . In this work, we initialize the policy $\pi_{\theta}$ with a pre-trained language model $\pi_{0}$ . We learn the policy using the Proximal Policy Optimization (PPO) algorithm (Schulman et al., 2017), which is an effective actor-critic algorithm in many text generation tasks (Stiennon et al., 2020; Ramamurthy et al., 2023). The advantage is approximated using Generalized Advantage Estimation (Schulman et al., 2018): $A_{t}=\sum_{\tau=t}^{T}(\gamma\lambda)^{\tau-t}(R_{\tau}+\gamma V_{\phi}(s_{\tau+1})-V_{\phi}(s_{\tau}))$ , where $\lambda$ is a hyper-parameter, $\gamma$ is the discount factor, $R_{t}$ is the reward assigned to $a_{t}$ , and $V_{\phi}(s_{t})$ is the value of state $s_{t}$ given by the value network $V_{\phi}$ . Note that the reward $R_{t}$ is regularized using a token-level KL penalty (Wu et al., 2021a), in order to prevent the updated policy $\pi_{\theta}$ deviating too far from the pre-trained language model $\pi_{0}$ :

R_{t}=R_{t}-\beta\text{KL}(\pi_{\theta}(a_{t}|s_{t})||\pi_{0}(a_{t}|s_{t}))

(1)

where $\beta$ is a dynamically adapted coefficient (Ziegler et al., 2020). During training, the policy network $\pi_{\theta}$ and the value network $V_{\phi}$ will be optimized jointly. ¹¹1In this work, we use the RL4LMs library to learn the response generation policy, so please refer to Ramamurthy et al. (2023) for more algorithm implementation details.

In-Context Example

Instruction: The following is a conversation with an AI assistant. The assistant is providing an answer based on a knowledge passage.

Human: I recently discovered rap music and I’m so intrigued by it! Do you listen to rap music?

Knowledge Text: Rapping is a musical form of vocal delivery that incorporates ”rhyme, rhythmic speech, and street vernacular”, which is performed or chanted in a variety of ways.

AI: No, I’m a bot and can’t hear. I know that it’s a form of music that involves chanting and rhythmic speech.

LLM Input

Human: Chevrolet is my all time favorite car brand, have you heard of it?

Knowledge Text: The Chevrolet Chevy II/Nova was a small automobile manufactured by Chevrolet, and produced in five generations for the 1962 through 1979, and 1985 through 1988 model years.

LLM Output

Model Output 1: Yes, I’m familiar with Chevrolet. It’s a car brand that has been around since the early 1900s and has produced popular models like the Chevy II/Nova, Corvette, Camaro, and Impala.

Model Output 2: Yes, I heard of Chevrolet. It’s a popular car brand that has been manufacturing cars for several decades. Chevy II/Nova was produced in five generations between 1962 and 1988.

Table 2: An example from the validation set of FaithDial for human expert to do pair-wise comparison between the outputs of text-davinci-003 and GPT-3.5-turbo. The gray text is the unfaithful model-generated response caused by the bias from the LLM.

3.3 Blending Reward Functions via Expert Demonstrations

Blending Reward Functions.

We propose leveraging existing automatic metrics to evaluate the faithfulness and accuracy of generated responses and directly learn a policy that balances the trade-off between these two factors. This approach eliminates the need for large-scale human annotations and utilizes the available metrics to guide the learning process effectively. By optimizing the policy based on these metrics, the model achieves improved performance in generating faithful and accurate responses.

Formally, the new reward function is defined as:

R_{t}=\alpha R_{t}^{\text{acc}}(\bm{u},\bm{y})+(1-\alpha)R_{t}^{\text{faith}}(\bm{u},K)

(2)

where $R_{t}^{\text{acc}}$ is the accuracy metric measuring the similarity between the generated response $\bm{u}$ and the ground-truth reference $\bm{y}$ , $R_{t}^{\text{faith}}$ is the faithfulness metric evaluating the factual consistency between the generated response $\bm{u}$ and the knowledge text $K$ , and $\alpha\in[0,1]$ is a coefficient used to balance the accuracy and faithfulness of generated responses. In this work, we choose SacreBLEU (Post, 2018) as $R_{t}^{\text{acc}}$ and BERTScore (Zhang et al., 2020a) as $R_{t}^{\text{faith}}$ , as they are recognized as effective evaluation metrics in the knowledge-grounded dialogue generation task (Dziri et al., 2022a, b). Note that $R_{t}^{\text{acc}}$ and $R_{t}^{\text{faith}}$ are only assigned to the final token in the generated response. This reward will also be regularized with the token-level KL penalty the same way as in Equation 1.

Learning $\alpha$ from Expert Demonstrations.

The selection of the coefficient $\alpha$ is crucial for learning an effective policy.²²2Note that we cannot learn the coefficient $\alpha$ during training, because the reward is part of PPO’s learning objective. Grid search on the validation set is a common strategy but can be computationally intensive and prone to overfitting on the validation set. To address this issue, we propose to learn the coefficient $\alpha$ from a few number of expert pair-wise comparison demonstrations. Since the primary requirement of the reward function is to differentiate faithful and accurate responses, leveraging expert demonstrations can provide valuable insights for determining an appropriate value for $\alpha$ . Additionally, this approach can reduce the computational burden and help ensure the generalizability of the learned policy.

Specifically, we leverage two state-of-the-art large language models (LLMs), text-davinci-003 and GPT-3.5-turbo³³3https://platform.openai.com/docs/api-reference, to generate 25 responses respectively based on one in-context example. Then we shuffle the presentation order and ask an NLP expert to do pair-wise comparisons between the two model outputs. An illustration example is provided in Table 2. Next, we compute the reward using Equation 2 for both models’ outputs, and align our reward comparison results with the expert pair-wise comparison results. The alignment is done by iterating values of $\alpha$ and finding the optimal value that maximizes the Pearson correlation coefficient (Pearson, 1895) between the expert pair-wise comparison results and our reward pair-wise comparison results. By learning $\alpha$ from these demonstrations, we can effectively calibrate the balance between faithfulness and accuracy in the generated responses.

Input: LLMs’ outputs

\{(\bm{u}_{n}^{1},\bm{u}_{n}^{2}\}_{n=1}^{N}

. The expert demonstrations of pair-wise comparison on two LLMs’ outputs

\{\hat{p}_{n}\}_{n=1}^{N}

Output: The optimal blending coefficient

\alpha_{human}

1: for

\alpha=0.00,\dots,1.00

2: Compute our reward on two LLMs’ outputs using Equation 2 and get the scores

\{(r_{n}^{1},r_{n}^{2})\}_{n=1}^{N}

3: Get the pair-wise comparison of our reward

p^{\alpha}_{n}=\text{argmax}(r_{n}^{1},r_{n}^{2})

, for

n=1,\dots,N

4: Compute the Pearson correlation coefficient

r

between

\{\hat{p}_{n}\}_{n=1}^{N}

and

\{p^{\alpha}_{n}\}_{n=1}^{N}

5: end for

6: Save the optimal

\alpha

which achieves the highest Pearson correlation coefficient

r_{\max}

\alpha_{human}

Algorithm 1 Learning

\alpha

from expert demonstrations

4 Experiments

The experiments in this section are designed to answer the following research questions:

RQ1

Does there exist a trade-off between faithfulness and accuracy in the knowledge-grounded dialogue generation task?
RQ2

Can the proposed method improve the faithfulness and accuracy of model-generated responses?
RQ3

Can expert demonstrations help effectively calibrate the values of $\alpha$ to learn a better policy?

		Accuracy		Faithfulness
Dataset	Method	SacreBLEU	ROUGE-L	BERTScore	Token-F1	Overall
MultiDoc2Dial	R3	31.10	41.40	-	-	-
	T5-SFT	25.38	41.13	91.10	51.61	209.22
	T5-PPO-RoBERTa	30.51	42.56	90.66	47.05	210.78
	T5-PPO-Ours	31.15	43.28	91.45	51.81	217.69
FaithDial	T5-CTRL	13.75	38.57	94.42	70.91	217.65
	T5-SFT	13.69	39.58	95.13	75.49	223.89
	T5-PPO-RoBERTa	11.87	36.75	92.06	52.05	192.73
	T5-PPO-Ours	12.75	36.98	98.79	94.71	243.23

Table 3: Accuracy and faithfulness evaluation results on the test set of MultiDoc2Dial and FaithDial. The results of R3 are reprinted from the original paper (Bansal et al., 2022). The results of T5-CTRL come from the code and data released by Dziri et al. (2022a). The other results come from our implementation. The Overall score is a sum of accuracy and faithfulness scores.

		Accuracy		Faithfulness
Dataset	$\alpha$	SacreBLEU	ROUGE-L	BERTScore	Token-F1	Overall
MultiDoc2Dial	$\alpha_{human}$	31.15	43.28	91.45	51.81	217.69
	$\alpha=1.00$	30.60	43.04	91.67	51.80	217.11
	$\alpha=0.00$	27.45	44.44	91.37	51.12	214.38
	$\alpha=0.25$	29.59	44.32	91.55	52.19	217.65
FaithDial	$\alpha_{human}$	12.75	36.98	98.79	94.71	243.23
	$\alpha=1.00$	13.09	38.36	95.97	81.98	229.40
	$\alpha=0.00$	12.59	36.95	91.57	54.47	195.58
	$\alpha=0.85$	13.30	36.94	98.30	92.28	240.82

Table 4: Model performances under different values of

\alpha

on the test set of MultiDoc2Dial and FaithDial. The Overall score is a sum of accuracy and faithfulness scores.

4.1 Experimental Setups

Benchmark Datasets.

We choose two information-seeking conversation datasets as our benchmarks: MultiDoc2Dial (Feng et al., 2021) and FaithDial (Dziri et al., 2022a). Both datasets contain two participants in each conversation: a user (or seeker) who initiates the conversation with a question, and a system (or wizard) who answers the user’s question by referring to a piece of knowledge text. Each conversation contains several turns and probably topic shifts. The datasets statistics are demonstrated in Table 5.

Dataset	Train	Valid	Test	Know. Tok.
MultiDoc2Dial	21,453	4,201	4,094	106
FaithDial	18,357	3,417	3,539	27

Table 5: Benchmark datasets statistics, where Know. Tok. indicates the average number of tokens for each input knowledge text.

Competitive Methods.

We compare with the following knowledge-grounded dialogue generation methods:

•

R3 (Bansal et al., 2022): a retriever-reranker-reader system that achieves state-of-the-art performance on MultiDoc2Dial. The system uses a bi-encoder DistilSPLADE (Formal et al., 2021) retriever to fetch top-100 relevant knowledge passages from the corpus, then applies a RoBERTa-based (Liu et al., 2019) cross-encoder to rerank the top-100 knowledge passages, finally passes the top-10 reranked knowledge passages to a T5-based FiD (Izacard and Grave, 2021) to generate the response.
•

T5-CTRL (Dziri et al., 2022a): a controlled generation method that achieves state-of-the-art performance on FaithDial. Following Rashkin et al. (2021), it sets control feature tokens based on measures of entailment, lexical precision, and objective voice of the ground-truth response, to steer a T5-base model (Raffel et al., 2020b) generating responses faithful to the input knowledge texts.
•

T5-SFT: a supervised fine-tuning baseline for both datasets. This method directly fine-tunes the T5-base model with maximum likelihood estimation on ground-truth responses in the full training set.
•

T5-PPO-RoBERTa: a baseline reward function for PPO, which is a RoBERTa-base model that discriminates whether the current output is the ground-truth response given the conversation history and knowledge text. The purpose of this reward model is to encourage the model-generated response matching the ground-truth reference. The reward model is trained with an equal number of positive and negative examples, where the negative examples come from the outputs of T5-SFT. For PPO training, we use T5-SFT to initialize the policy network and value network.
•

T5-PPO-Ours: our proposed reward function for PPO introduced in subsection 3.3. We use T5-SFT to initialize the policy network and value network, and update them jointly with PPO.

History	User: I am aspiring to be an accountant. I want to know more about accounting.
Knowledge Text	Accounting, which has been called the ”language of business”, measures the results of an organization’s economic activities and conveys this information to a variety of users, including investors, creditors, management, and regulators.
Human Reference	Accounting has been called the ”language of business”.
T5-SFT	Accounting has been called the ”language of business”. It measures the results of an organization’s economic activities and conveys this information to a variety of users.
T5-PPO-Acc. ( $\alpha=1.00$ )	I see, did you know that accounting has been called the ”language of business”? It measures the results of an organization’s economic activities and conveys this information to a variety of users?
T5-PPO-Faith. ( $\alpha=0.00$ )	Ah, accounting is called the language of business.
T5-PPO-Ours ( $\alpha_{human}$ )	Accounting has been called the ”language of business”, measures the results of an organization’s economic activities and conveys this information to a variety of users, including investors, creditors, management, and regulators.

Table 6: Different model-generated responses sampled from the test set of FaithDial. The green text highlights the ground-truth knowledge span.

Implementation Details.

We choose T5-base as our backbone language model for all experiments. For the input knowledge text in MultiDoc2Dial, we use the ground-truth knowledge passage preprocessed by the official code⁴⁴4https://github.com/IBM/multidoc2dial/blob/main/scripts/run_data_preprocessing.sh, because we focus on the agent response generation subtask. For T5-SFT baseline, we fine-tune the model on the ground-truth response from the full training set. We fine-tune the model for 10 epochs using the AdamW optimizer (Loshchilov and Hutter, 2019) with a linear decaying learning rate starting from $1\times 10^{-5}$ . During inference, we use beam search with beam size of 4 to generate the final response.

The RoBERTa-base reward model in T5-PPO-RoBERTa is trained to discriminate the T5-SFT generated responses and the ground-truth responses. We train the reward model for 10 epochs using the AdamW optimizer with a constant learning rate of $1\times 10^{-6}$ . The reward model achieves 89% accuracy on the test set of MultiDoc2Dial and 96% accuracy on the test set of FaithDial.

For learning $\alpha$ from expert demonstrations, we collect 25 pair-wise comparison demonstrations on each dataset respectively. We find the optimal value of $\alpha_{human}$ on MultiDoc2Dial is 0.04 with a Pearson correlation coefficient of 0.2278, and the optimal value of $\alpha_{human}$ on FaithDial is 0.92 with a Pearson correlation coefficient of 0.2865.

For all PPO experiments, the policy network and value network share the same base model initialized from T5-SFT but separate the last output layer. The output layer of the value network is a linear network that maps the last hidden state to a scalar value. We update the parameters for 10,000 iterations using the AdamW optimizer (Loshchilov and Hutter, 2019) with a learning rate of $5\times 10^{-7}$ . The policy is evaluated on the full validation set every 100 iterations, and the final policy is the one that achieves the highest total scores in accuracy and faithfulness on the validation set. We use top-k ( $k=50$ ) sampling to generate trajectories during training, and use beam search with beam size of 4 to generate the final response during testing.

Evaluation Metrics.

We follow the prior works (Feng et al., 2021; Dziri et al., 2022a) to evaluate the accuracy and faithfulness of model-generated responses. Specifically, we use SacreBLEU (Post, 2018) and ROUGE-L (Lin, 2004) to evaluate the accuracy, which measures the similarity between model-generated responses and ground-truth responses; we use the F1 measure of BERTScore (Zhang et al., 2020a) and the token-level F1 scores to evaluate the faithfulness, which computes the similarity between model-generated responses and input knowledge texts.

History	User: Can those over the age of 18 drive without supervision while holding a learner permit?
Knowledge Text	Learner permit restrictions // No matter what age you are, if you hold a learner permit, you may not drive unless accompanied by a supervising driver age 21 or older who has a valid license to operate the vehicle you are driving. For example, only a person with a motorcycle license may supervise a person learning to drive a motorcycle. You m- ay not drive with a learner permit on any street within a park in New York City on any bridge or tunnel under the jurisdiction of the Tri-borough Bridge and Tunnel Authority on the Cross Coun- ty, Hutchinson River, Saw Mill River, or Taconic State parkways in Westchester County in a DMV road test area.
Human Reference	No. Learner permit holders of any age can’t drive without a supervising driver.
T5-SFT	Yes, (unfaithful) you may not drive unless accompanied by a supervising driver age 21 or older who has a valid license to operate the vehicle you are driving. For example, only a person with a motorcycle license may supervise (incorrect span)
T5-PPO-Acc. ( $\alpha=1.00$ )	You may not drive with a learner permit on any street within a park in New York City on any bridge or tunnel under the jurisdiction of the Tri-borough Bridge and Tunnel Authority on the Cross County. (incorrect span)
T5-PPO-Faith. ( $\alpha=0.00$ )	Yes. (unfaithful)
T5-PPO-Ours ( $\alpha_{human}$ )	No. You may not drive unless accompanied by a supervising driver age 21 or older who has a valid license to operate the vehicle you are driving.

Table 7: Different model-generated responses sampled from the test set of MultiDoc2Dial. The green text highlights the ground-truth knowledge span. The gray text is the unfaithful model-generated response caused by the bias from the LLM. The red text is the incorrect model-generated response misled by the text that contains redundant information.

4.2 Result Analysis

RQ1: Faithfulness v.s. Accuracy.

We first investigate whether there exists a trade-off between faithfulness and accuracy in the knowledge-grounded dialogue generation task. The empirical results in Table 3 indicate the answer depends on the input knowledge text. When the input knowledge text contains redundant and irrelevant information, as in the case of MultiDoc2Dial, a good method can help improve both accuracy and faithfulness. Because the ground-truth reference can help locate the relevant text span in a long knowledge passage.

On the other hand, when the input knowledge text only contains the precise information needed for the answer, as in the case of FaithDial, the model may learn to directly copy from the knowledge text, which improves the performance in faithfulness but degrades the performance in accuracy. In this case, a good method should balance the performance improvement trade-off between faithfulness and accuracy.

RQ2: Our Method v.s. Others.

Generally speaking, our method achieves the best overall performance compared with other competitive baselines in both datasets, as shown in Table 3. In MultiDoc2Dial, T5-PPO-Ours achieves the best performance in both accuracy and faithfulness evaluation, indicating our learning method has the advantage of locating the relevant information from a long knowledge document. T5-PPO-RoBERTa achieves improved performance in accuracy but degraded performance in faithfulness compared with T5-SFT, because its reward function is designed to only encourage the model-generated response matching the ground-truth reference.

In FaithDial, T5-PPO-Ours achieves the best performance in faithfulness but degraded performance in accuracy compared with T5-SFT. As discussed in RQ1, the knowledge text of FaithDial contains the exact relevant information for the answer, consequently, the model learns to directly copy the knowledge text as its response when further trained with PPO. A generation example is provided in Table 6. Besides, T5-PPO-RoBERTa achieves the worst performance even if its reward model has 96% test accuracy, which indicates the reward model also learns some shortcut features in model-generated responses (e.g. the n-gram overlap between response and knowledge text).

RQ3: Calibrating $\alpha$ .

As illustrated in Table 4, we experiment with different values of $\alpha$ for our reward function, and find that the value learned from expert demonstrations $\alpha_{human}$ achieves the best overall performance in both datasets. Single metric, i.e. $\alpha=0.00$ and $\alpha=1.00$ , only achieves good performance in a single evaluation dimension, and still generates inaccurate or unfaithful responses, as shown in Table 7.

Additionally, we compared $\alpha_{human}$ with values found by grid search which achieves the highest total score on the validation set, i.e. $\alpha=0.25$ and $\alpha=0.85$ . Surprisingly, we find $\alpha_{human}$ achieves better overall performance in both datasets, which indicates few expert demonstrations can not only help calibrate the values to learn a better policy, but also provide a good generalization ability for the reward function.

5 Conclusion

This work investigates how to improve faithfulness and accuracy in knowledge-grounded dialogue generation tasks. Firstly, we identify three major sources for the problem of generating unfaithful and inaccurate responses: the bias in LLMs, the irrelevant information in knowledge texts, and the characteristics of the supervised learning algorithm. Then, we solve the problem by applying a reinforcement learning algorithm with a novel reward function. Our reward function can be used as a cost-effective approximation to the human preference reward model learned from massive high-quality human preference annotations. Finally, we validate the effectiveness of our method in two information-seeking conversation datasets. The empirical experiment results show that our method can outperform other strong supervised learning baselines.

Limitations

The good performance of PPO algorithm not only requires a good reward function, but also relies on a good initial policy. The initial policy ensures the trajectories sampled from the current policy are of high quality, which benefits the convergence to a better policy. Therefore, the pretraining of better language models is always helpful for this method. In addition, learning on-policy RL algorithms requires large GPU memory, so more memory-efficient training or model compression techniques can further benefit the RL training.

Ethics Statement

This work complies with the ACL Ethics Policy. Both benchmark datasets and baseline models are collected from public academic resources, and do not contain harmful, unfair, or discriminating content. We will also make our data and code open-sourced once this paper is made public, in order to provide easily reproducible experiment configurations for future research.

References

Bansal et al. (2022) Srijan Bansal, Suraj Tripathi, Sumit Agarwal, Sireesh Gururaja, Aditya Srikanth Veerubhotla, Ritam Dutt, Teruko Mitamura, and Eric Nyberg. 2022. R3 : Refined retriever-reader pipeline for multidoc2dial. In Proceedings of the Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering, pages 148–154, Dublin, Ireland. Association for Computational Linguistics.
Bao et al. (2022) Siqi Bao, Huang He, Fan Wang, Hua Wu, Haifeng Wang, Wenquan Wu, Zhihua Wu, Zhen Guo, Hua Lu, Xinxian Huang, Xin Tian, Xinchao Xu, Yingzhan Lin, and Zheng-Yu Niu. 2022. PLATO-XL: Exploring the large-scale pre-training of dialogue generation. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022, pages 107–118, Online only. Association for Computational Linguistics.
Cao et al. (2022) Meng Cao, Yue Dong, and Jackie Cheung. 2022. Hallucinated but factual! inspecting the factuality of hallucinations in abstractive summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3340–3354, Dublin, Ireland. Association for Computational Linguistics.
Chen et al. (2021) Xiuyi Chen, Feilong Chen, Fandong Meng, Peng Li, and Jie Zhou. 2021. Unsupervised knowledge selection for dialogue generation. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1230–1244, Online. Association for Computational Linguistics.
Chen et al. (2020) Xiuyi Chen, Fandong Meng, Peng Li, Feilong Chen, Shuang Xu, Bo Xu, and Jie Zhou. 2020. Bridging the gap between prior and posterior knowledge selection for knowledge-grounded dialogue generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3426–3437, Online. Association for Computational Linguistics.
Dinan et al. (2019) Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. 2019. Wizard of wikipedia: Knowledge-powered conversational agents. In International Conference on Learning Representations.
Du and Ji (2019) Wanyu Du and Yangfeng Ji. 2019. An empirical comparison on imitation learning and reinforcement learning for paraphrase generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6012–6018, Hong Kong, China. Association for Computational Linguistics.
Dziri et al. (2022a) Nouha Dziri, Ehsan Kamalloo, Sivan Milton, Osmar Zaiane, Mo Yu, Edoardo M. Ponti, and Siva Reddy. 2022a. FaithDial: A Faithful Benchmark for Information-Seeking Dialogue. Transactions of the Association for Computational Linguistics, 10:1473–1490.
Dziri et al. (2022b) Nouha Dziri, Hannah Rashkin, Tal Linzen, and David Reitter. 2022b. Evaluating attribution in dialogue systems: The BEGIN benchmark. Transactions of the Association for Computational Linguistics, 10:1066–1083.
Feng et al. (2021) Song Feng, Siva Sankalp Patel, Hui Wan, and Sachindra Joshi. 2021. MultiDoc2Dial: Modeling dialogues grounded in multiple documents. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6162–6176, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Formal et al. (2021) Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant. 2021. Splade v2: Sparse lexical and expansion model for information retrieval. arXiv.
Ghazvininejad et al. (2018) Marjan Ghazvininejad, Chris Brockett, Ming-Wei Chang, Bill Dolan, Jianfeng Gao, Wen-tau Yih, and Michel Galley. 2018. A knowledge-grounded neural conversation model. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1).
Izacard and Grave (2021) Gautier Izacard and Edouard Grave. 2021. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880, Online. Association for Computational Linguistics.
Kandpal et al. (2022) Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. 2022. Large language models struggle to learn long-tail knowledge. arXiv.
Kim et al. (2020) Byeongchang Kim, Jaewoo Ahn, and Gunhee Kim. 2020. Sequential latent knowledge selection for knowledge-grounded dialogue. In International Conference on Learning Representations.
Lewis et al. (2020) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Advances in Neural Information Processing Systems, volume 33, pages 9459–9474. Curran Associates, Inc.
Li et al. (2016) Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky, Michel Galley, and Jianfeng Gao. 2016. Deep reinforcement learning for dialogue generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1192–1202, Austin, Texas. Association for Computational Linguistics.
Li et al. (2020) Margaret Li, Stephen Roller, Ilia Kulikov, Sean Welleck, Y-Lan Boureau, Kyunghyun Cho, and Jason Weston. 2020. Don’t say that! making inconsistent dialogue unlikely with unlikelihood training. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4715–4728, Online. Association for Computational Linguistics.
Li et al. (2022a) Wei Li, Wenhao Wu, Moye Chen, Jiachen Liu, Xinyan Xiao, and Hua Wu. 2022a. Faithfulness in natural language generation: A systematic survey of analysis, evaluation and optimization methods. arXiv.
Li et al. (2022b) Yu Li, Baolin Peng, Yelong Shen, Yi Mao, Lars Liden, Zhou Yu, and Jianfeng Gao. 2022b. Knowledge-grounded dialogue generation with a unified knowledge representation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 206–218, Seattle, United States. Association for Computational Linguistics.
Lin (2004) Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
Liu et al. (2018) Bing Liu, Gokhan Tür, Dilek Hakkani-Tür, Pararth Shah, and Larry Heck. 2018. Dialogue learning with human teaching and feedback in end-to-end trainable task-oriented dialogue systems. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2060–2069, New Orleans, Louisiana. Association for Computational Linguistics.
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv.
Loshchilov and Hutter (2019) Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations.
Maynez et al. (2020) Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online. Association for Computational Linguistics.
Menick et al. (2022) Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, and Nat McAleese. 2022. Teaching language models to support answers with verified quotes. arXiv.
Mesgar et al. (2021) Mohsen Mesgar, Edwin Simpson, and Iryna Gurevych. 2021. Improving factual consistency between a response and persona facts. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 549–562, Online. Association for Computational Linguistics.
Nakano et al. (2022) Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. 2022. Webgpt: Browser-assisted question-answering with human feedback. arXiv.
Nye et al. (2021) Maxwell Nye, Michael Henry Tessler, Joshua B. Tenenbaum, and Brenden M. Lake. 2021. Improving coherence and consistency in neural sequence models with dual-system, neuro-symbolic reasoning. In Advances in Neural Information Processing Systems.
Ouyang et al. (2022) Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. arXiv.
Pearson (1895) Karl Pearson. 1895. Note on regression and inheritance in the case of two parents. proceedings of the royal society of London, 58(347-352):240–242.
Peng et al. (2022) Baolin Peng, Michel Galley, Pengcheng He, Chris Brockett, Lars Liden, Elnaz Nouri, Zhou Yu, Bill Dolan, and Jianfeng Gao. 2022. Godel: Large-scale pre-training for goal-directed dialog. arXiv.
Post (2018) Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.
Prabhumoye et al. (2021) Shrimai Prabhumoye, Kazuma Hashimoto, Yingbo Zhou, Alan W Black, and Ruslan Salakhutdinov. 2021. Focused attention improves document-grounded generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4274–4287, Online. Association for Computational Linguistics.
Qin et al. (2021) Libo Qin, Tianbao Xie, Shijue Huang, Qiguang Chen, Xiao Xu, and Wanxiang Che. 2021. Don’t be contradicted with anything! CI-ToD: Towards benchmarking consistency for task-oriented dialogue system. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2357–2367, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Raffel et al. (2020a) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020a. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
Raffel et al. (2020b) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020b. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21:1–67.
Ramamurthy et al. (2023) Rajkumar Ramamurthy, Prithviraj Ammanabrolu, Kianté Brantley, Jack Hessel, Rafet Sifa, Christian Bauckhage, Hannaneh Hajishirzi, and Yejin Choi. 2023. Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization. In The Eleventh International Conference on Learning Representations.
Ranzato et al. (2016) Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2016. Sequence level training with recurrent neural networks. arXiv.
Rashkin et al. (2021) Hannah Rashkin, David Reitter, Gaurav Singh Tomar, and Dipanjan Das. 2021. Increasing faithfulness in knowledge-grounded dialogue with controllable features. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 704–718, Online. Association for Computational Linguistics.
Schulman et al. (2018) John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. 2018. High-dimensional continuous control using generalized advantage estimation.
Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv.
Shuster et al. (2020) Kurt Shuster, Da Ju, Stephen Roller, Emily Dinan, Y-Lan Boureau, and Jason Weston. 2020. The dialogue dodecathlon: Open-domain knowledge and image grounded conversational agents. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2453–2470, Online. Association for Computational Linguistics.
Song et al. (2020) Haoyu Song, Wei-Nan Zhang, Jingwen Hu, and Ting Liu. 2020. Generating persona consistent dialogues by exploiting natural language inference. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):8878–8885.
Stiennon et al. (2020) Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. Learning to summarize with human feedback. In Advances in Neural Information Processing Systems, volume 33, pages 3008–3021. Curran Associates, Inc.
Sutton and Barto (2018) Richard S. Sutton and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction. A Bradford Book, Cambridge, MA, USA.
Thoppilan et al. (2022) Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Maarten Bosma, Vincent Zhao, Yanqi Zhou, Chung-Ching Chang, Igor Krivokon, Will Rusch, Marc Pickett, Pranesh Srinivasan, Laichee Man, Kathleen Meier-Hellstern, Meredith Ringel Morris, Tulsee Doshi, Renelito Delos Santos, Toju Duke, Johnny Soraker, Ben Zevenbergen, Vinodkumar Prabhakaran, Mark Diaz, Ben Hutchinson, Kristen Olson, Alejandra Molina, Erin Hoffman-John, Josh Lee, Lora Aroyo, Ravi Rajakumar, Alena Butryna, Matthew Lamm, Viktoriya Kuzmina, Joe Fenton, Aaron Cohen, Rachel Bernstein, Ray Kurzweil, Blaise Aguera-Arcas, Claire Cui, Marian Croak, Ed Chi, and Quoc Le. 2022. Lamda: Language models for dialog applications. arXiv.
Tian et al. (2020) Ran Tian, Shashi Narayan, Thibault Sellam, and Ankur P. Parikh. 2020. Sticking to the facts: Confident decoding for faithful data-to-text generation. arXiv.
Wang and Sennrich (2020) Chaojun Wang and Rico Sennrich. 2020. On exposure bias, hallucination and domain shift in neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3544–3552, Online. Association for Computational Linguistics.
Wang et al. (2023) Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. Self-instruct: Aligning language models with self-generated instructions. arXiv.
Welleck et al. (2019) Sean Welleck, Jason Weston, Arthur Szlam, and Kyunghyun Cho. 2019. Dialogue natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3731–3741, Florence, Italy. Association for Computational Linguistics.
Wu et al. (2021a) Jeff Wu, Long Ouyang, Daniel M. Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, and Paul Christiano. 2021a. Recursively summarizing books with human feedback. arXiv.
Wu et al. (2021b) Zeqiu Wu, Bo-Ru Lu, Hannaneh Hajishirzi, and Mari Ostendorf. 2021b. DIALKI: Knowledge identification in conversational systems through dialogue-document contextualization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1852–1863, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Xu et al. (2021) Peng Xu, Davis Liang, Zhiheng Huang, and Bing Xiang. 2021. Attention-guided generative models for extractive question answering. arXiv.
Zhan et al. (2021) Haolan Zhan, Hainan Zhang, Hongshen Chen, Zhuoye Ding, Yongjun Bao, and Yanyan Lan. 2021. Augmenting knowledge-grounded conversations with sequential knowledge transition. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5621–5630, Online. Association for Computational Linguistics.
Zhang et al. (2020a) Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020a. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
Zhang et al. (2020b) Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and Bill Dolan. 2020b. DIALOGPT : Large-scale generative pre-training for conversational response generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 270–278, Online. Association for Computational Linguistics.
Zhao et al. (2020a) Xueliang Zhao, Wei Wu, Chongyang Tao, Can Xu, Dongyan Zhao, and Rui Yan. 2020a. Low-resource knowledge-grounded dialogue generation. In International Conference on Learning Representations.
Zhao et al. (2020b) Xueliang Zhao, Wei Wu, Can Xu, Chongyang Tao, Dongyan Zhao, and Rui Yan. 2020b. Knowledge-grounded dialogue generation with pre-trained language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3377–3390, Online. Association for Computational Linguistics.
Ziegler et al. (2020) Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. 2020. Fine-tuning language models from human preferences. arXiv.