"What do others think?":
Task-Oriented Conversational Modeling with Subjective Knowledge

Chao Zhao ¹ Spandana Gella ² Seokhwan Kim ² Di Jin ²
Devamanyu Hazarika ² Alexandros Papangelis ² Behnam Hedayatnia ²
Mahdi Namazifar ² Yang Liu ² Dilek Hakkani-Tur ²
[email protected] {sgella,seokhwk,djinamzn}@amazon.com
{dvhaz,papangea,behnam,mahdinam,yangliud,hakkanit}@amazon.com
¹ UNC Chapel Hill ² Amazon, Alexa

Abstract

Task-oriented Dialogue (TOD) Systems aim to build dialogue systems that assist users in accomplishing specific goals, such as booking a hotel or a restaurant. Traditional TODs rely on domain-specific APIs/DBs or external factual knowledge to generate responses, which cannot accommodate subjective user requests (e.g.,“Is the WIFI reliable?” or “Does the restaurant have a good atmosphere?”). To address this issue, we propose a novel task of subjective-knowledge-based TOD (SK-TOD). We also propose the first corresponding dataset, which contains subjective knowledge-seeking dialogue contexts and manually annotated responses grounded in subjective knowledge sources. When evaluated with existing TOD approaches, we find that this task poses new challenges such as aggregating diverse opinions from multiple knowledge snippets. We hope this task and dataset can promote further research on TOD and subjective content understanding. The code and the dataset are available at https://github.com/alexa/dstc11-track5.

Chao Zhao ¹ Spandana Gella ² Seokhwan Kim ² Di Jin ² Devamanyu Hazarika ² Alexandros Papangelis ² Behnam Hedayatnia ² Mahdi Namazifar ² Yang Liu ² Dilek Hakkani-Tur ² [email protected] {sgella,seokhwk,djinamzn}@amazon.com {dvhaz,papangea,behnam,mahdinam,yangliud,hakkanit}@amazon.com ¹ UNC Chapel Hill ² Amazon, Alexa

1 Introduction

Task-oriented Dialogue (TOD) Systems aim to build dialogue systems that assist users in accomplishing specific goals, such as booking a hotel or a restaurant. Most solutions of TOD are based on domain-APIs Budzianowski et al. (2018); Rastogi et al. (2020) and structured databases Eric et al. (2017); Wu et al. (2019), which can only handle a limited range of scenarios within the scope of APIs/DBs. To further enlarge the model’s ability of task-oriented assistance, recent works Dimitrakis et al. (2018); Kim et al. (2020, 2021); Feng et al. (2020, 2021); Majumder et al. (2022) incorporate unstructured textual information retrieved from the Internet into dialogue modeling. Most of these works focus on factual knowledge sources such as frequently asked questions (FAQs) of online products or government service guides. We refer to these models as Fact-TOD models.

Refer to caption — Figure 1: Examples of the SK-TOD task. The top part shows two hotels and their customer reviews. The bottom part shows three dialogue sessions between the system (denoted by S) and three users (denoted by U). The last user utterance is a subjective question about the WIFI quality of the hotel(s). The system needs to retrieve information from the relevant subjective knowledge, which is highlighted in the review text.

However, in many TOD tasks, users care about not only factual information but subjective insights as well, such as the experiences, opinions, and preferences of other customers. For instance, when booking a hotel or a restaurant, users often inquire about subject aspects like “Is the WIFI reliable?” or “Does the restaurant have a good atmosphere?”. To respond to such user requests, an agent needs to seek information from subjective knowledge sources, such as online customer reviews. While subjective knowledge has been specifically studied in other NLP problems such as opinion mining Liu and Zhang (2012) and question answering Bjerva et al. (2020), incorporating it into TOD has not received significant attention.

In this work, we argue that it is important to enable the TOD model to leverage subjective knowledge for more effective task-oriented assistance. To this end, we propose a novel task of subjective-knowledge-based task-oriented dialogue (SK-TOD). SK-TOD focuses on responding to user requests that seek subjective information by incorporating user reviews as subjective knowledge. Figure 1 shows three examples of such requests, where customers ask about the WiFi quality of various hotels. User reviews are valuable resources for subjective information because even for the same aspect of a product or service, customers may have different opinions and leave either positive or negative reviews. As a result, a TOD system should consider multiple reviews to provide a comprehensive representation of user opinions. Ideally, the system’s response should include both positive and negative opinions, along with their respective proportions (as exemplified in Dialogue 3). This two-sided response has been recognized as more credible and valuable for customers Kamins et al. (1989); Lee et al. (2008); Baek et al. (2012), thereby fostering trust in the TOD system.

Incorporating subjective knowledge into TOD introduces two unique challenges. Firstly, unlike in Fact-TOD where selecting a few relevant knowledge snippets suffices, the SK-TOD model must consider all relevant knowledge snippets. In other words, both precision and recall matter during this process. Secondly, the model needs to aggregate these knowledge snippets into a concise response that can faithfully reflect the diversity and proportion of opinions expressed. Conquering these challenges requires a large-scale dataset with subjective-knowledge-grounded responses, which, to our best knowledge, is not publicly available.

To facilitate the research in subjective-knowledge-grounded TOD, we have collected a large-scale dataset, which contains 19,696 subjective knowledge-seeking dialogue contexts and manually annotated responses that are grounded on 143 entities and 1,430 reviews (8,013 sentences). We evaluate the performance of strong baselines on the SK-TOD task. Results show that there is a significant gap between human-generated and machine-generated responses, particularly in terms of the faithfulness of the sentiment proportion. To address this issue, we propose a model that incorporates review understanding into SK-TOD. We experimentally demonstrate that responses generated by this model more effectively capture the sentiment proportion. Our contributions are three-fold:

•

We introduce a novel task of subjective-knowledge-based TOD (SK-TOD);
•

We create and release a large-scale, human-annotated dataset designed for this task;
•

We propose a new model and conduct extensive experiments on the proposed task.

2 Related Work

2.1 Knowledge-Grounded Dialogue

Knowledge-grounded response generation is popular in the open-domain dialogue. Numerous external knowledge sources have been explored, from structured knowledge such as fact tables Moghe et al. (2018); Liu et al. (2018) and knowledge graphs Zhang et al. (2020a); Moon et al. (2019); Tuan et al. (2019), to unstructured knowledge such as Wikipedia articles Vougiouklis et al. (2016); Zhou et al. (2018); Dinan et al. (2018), news articles Majumder et al. (2020), web pages Long et al. (2017); Galley et al. (2019); Komeili et al. (2022), narratives Xu et al. (2021); Gopalakrishnan et al. (2019), user reviews and comments Moghe et al. (2018); Ghazvininejad et al. (2018), and so on. Grounding on external knowledge makes the response more informative and meaningful when compared with models that solely rely on the dialog context.

Regarding task-oriented dialogues, previous works have primarily focused on domain-specific APIs and databases to support the dialogue response Levin et al. (2000); Singh et al. (2002); Williams and Young (2007); Eric et al. (2017); Wu et al. (2019), which can only support a limited scope of user queries. Later works ground task-oriented dialogues to web pages Penha et al. (2019); Chen et al. (2022), government service documents Saeidi et al. (2018); Feng et al. (2020, 2021), and FAQ knowledge snippets Kim et al. (2020, 2021). Different from these works where factual knowledge is utilized, we apply subjective knowledge to generate the response and ground in multiple knowledge snippets. While Majumder et al. (2022) also explored grounding TOD in user reviews, they did not consider the diversity of opinions.

	Size	Manual	Dial	TOD	Query	Aspect	Senti	Mul-Knwl	Senti-%
Semeval/MAMS Pontiki et al. (2016); Jiang et al. (2019)	5K/22K	$\checkmark$	✗	n/a	✗	$\checkmark$	$\checkmark$	✗	n/a
Space Angelidis et al. (2021)	1K	$\checkmark$	✗	n/a	✗	$\checkmark$	$\checkmark$	$\checkmark$	✗
Yelp/Amazon Chu and Liu (2019); Bražinskas et al. (2020)	200/180	$\checkmark$	✗	n/a	✗	✗	$\checkmark$	$\checkmark$	✗
Justify-Rec Ni et al. (2019)	1.3M	✗	✗	n/a	✗	$\checkmark$	✗	$\checkmark$	✗
AmazonQA McAuley and Yang (2016)	309K	✗	✗	n/a	$\checkmark$	✗	✗	✗	n/a
SubjQA Bjerva et al. (2020)	10K	✗	✗	n/a	$\checkmark$	$\checkmark$	$\checkmark$	✗	n/a
Holl-E Moghe et al. (2018)	9K	$\checkmark$	$\checkmark$	✗	✗	✗	✗	$\checkmark$	✗
Foursquare Ghazvininejad et al. (2018)	1M	✗	$\checkmark$	✗	✗	✗	✗	$\checkmark$	n/a
SK-TOD (Ours)	20K	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$

Table 1: Comparison between SK-TOD and other benchmarks based on the subjective content. We consider if the dataset is manually annotated, dialogue-based, task-oriented, and query-focused. We also list if it considers aspect and sentiment, multiple knowledge snippets (Mul-Knwl), and the proportion of two-sided sentiments (Senti-%).

2.2 Subjective Content Understanding

Besides being used as external knowledge sources in dialogue systems, subjective content, especially user reviews, has been studied in various non-conversational NLP tasks. For example, opinion mining Pontiki et al. (2016); Jiang et al. (2019) focuses on extracting opinions and sentiments from user reviews. Opinion summarization Chu and Liu (2019); Zhao and Chaturvedi (2020); Bražinskas et al. (2020); Angelidis et al. (2021) is used to distill multiple opinions into concise summaries. Subjective question answering McAuley and Yang (2016); Bjerva et al. (2020) have been proposed to answer questions based on user reviews. Explainable recommendation Ni et al. (2019) aims to generate review-based explanations for the items recommended by a recommendation system. Table 1 provides detailed comparisons between SK-TOD and these subjective-content-based benchmarks. Generally, SK-TOD requires creating a response that is appropriate to the dialogue context. It also requires grounding in multiple subjective knowledge and explicitly considers the diversity of opinions and the proportion of sentiments.

color=blue!20] SK: let’s describe other subproblems like KTD, entity tracking and KS in addition to RG.

color=yellow!80] SG: we don’t mention about lack of work or that our work does emphasize on having aspects for user queries and reflecting sentiment distribution of knowledge snippets in response which was not studied earlier. Other works did parts of this (for example opinion memorization: refer those). Some pointers added below.

3 Problem Formulation

Formally, we have a dialogue context $C=[U_{1},S_{1},U_{2},S_{2},\cdots,U_{t}]$ between a user and a system, where each user utterance $U_{i}$ is followed by a system response utterance $S_{i}$ , except for the last user utterance $U_{t}$ . The dialogue involves one or more entities, denoted as $\mathcal{E}=\{e_{1},\cdots,e_{m}\}$ . Alongside the dialogue, we have a subjective knowledge source $\mathcal{B}=\{(e_{1},\mathcal{R}_{1}),(e_{2},\mathcal{R}_{2}),\cdots\}$ containing all the entities and their corresponding customer reviews. Each entity $e$ is associated with multiple reviews $\mathcal{R}=\{R_{1},R_{2},\cdots\}$ . Each review can be divided into segments $[K_{1},K_{2},\cdots]$ , such as paragraphs, sentences, or sub-sentential units. In this work, we regard each review sentence as a knowledge snippet.

The SK-TOD task aims to identify whether $U_{t}$ is a subjective knowledge-seeking request and, if it is, to select the relevant knowledge snippets $\mathcal{K}^{+}$ from the knowledge source and finally generate a response $S_{t}$ grounded on $\mathcal{K}^{+}$ .

4 Data Collection and Statistics

We ground the data collection in MultiWOZ Budzianowski et al. (2018); Eric et al. (2020). We select dialogues from the domains of hotels and restaurants. The data collection is conducted by a group of crowd workers through Amazon Mechanical Turk (AMT). To control the data quality, we only choose workers that are pre-qualified. More details can be found in Appendix A.

4.1 Annotation Guideline

Dialogues in MultiWOZ are collected based on single or multiple entities as the back-end database. To create a subjective knowledge source to support the SK-TOD task, we first collect multiple user reviews for each entity. To control the review collection, we provide the reviewer’s persona, as well as the aspects and sentiments of reviews to workers. We then ask workers to write a review with all the given information included. After collecting the reviews, we also annotate the aspect and sentiment information for each review sentence. Overall, we select 33 hotels and 110 restaurants from MultiWOZ, and collect 10 reviews for each entity. On average, each review contains 5.6 sentences and 56.71 tokens. More details about the review collection can be found in Appendix A.

After obtaining the reviews, we go back to the dialogue data to create the subjective user request. Following a similar procedure in Kim et al. (2020), for each dialogue, we provide an aspect that users are interested in (e.g., WIFI-quality of the hotel) and then ask the worker to insert a subjective user request into the dialogue. Workers are requested to carefully select the insertion position and write an utterance to maintain coherence and naturalness in the dialogue flow. Finally, we use the partial dialog until this newly inserted turn as an instance in our data. Utterances that come after the insertion position are removed from the dialogue instance.

So far, we’ve collected the dialogue context $C$ and the subjective knowledge source $\mathcal{B}$ . The final step is to ground the dialogue in the knowledge source. We first ask workers to identify entities that are relevant to the subjective user request as gold entities. We then align the user request and review sentences of the gold entities by matching their aspect. For example, if the aspect of a user request is about the “WIFI quality” of a hotel, all review sentences discussing the “WIFI quality” of that specific hotel will be considered relevant knowledge snippets. ¹¹1Note that the aspect information is only used to build the dataset but is not included in the problem formulation of SK-TOD, which means it is not available for model training. The goal of SK-TOD is to handle user requests with arbitrary aspects, and therefore we do not define a taxonomy of aspects in the task like what is done in dialogue state tracking. Finally, we provide the dialogue context $C$ and all related knowledge snippets $\mathcal{K}^{+}$ and ask workers to generate a natural and faithful response. We explicitly instruct workers to consider the diversity and proportion of opinions in all relevant knowledge snippets during response creation. Detailed instructions can be found in Appendix A.

4.2 Quality Control

To ensure the quality of our dataset, we took great care in selecting pre-qualified workers and designing annotation interfaces. We further conducted a human verification task on the entire dataset to identify invalid instances. The annotation showed that 81.89% of subjective-knowledge-seeking user turns are valid, with an Inter-Annotator Agreement (IAA) score of 0.9369 in Gwet’s gamma. For agent response turns, 96.78% were valid, with an IAA score of 0.9497 in Gwet’s gamma. Any invalid instances were filtered out or manually corrected before finalizing the dataset. We paid workers an average of $13.82/hr for data annotation and $14.77/hr for data verification. Both exceed the local living minimum wage. The details of our payment settings are elaborated on in Appendix A.

4.3 Data Statistics

	Train	Val	Test
# instances	14768	2129	2799
# seen instances	14768	1471	1547
# unseen instances	0	658	1252
# multi-entity instances	412	199	436
Knowledge Snippets
Avg. # snippets per instance	3.80	4.07	4.21
Avg. # tokens per snippet	14.68	15.49	14.5
Dialogue
Avg. # uttrances per instance	9.29	9.44	9.36
Avg. # tokens per request	8.65	8.94	9.12
Avg. # tokens per response	24.18	23.61	23.86

Table 2: Basic statistics of our dataset.

We collected a total of 19,696 instances consisting of subjective user requests and subjective-knowledge-grounded responses. The average length of the subjective user request and the agent response is 8.75 and 24.07 tokens, respectively. While most of the instances contain a single entity, there are 1,047 instances where multiple entities are compared (like Dialogue 2 in Figure 1). On average, each instance requires 3.88 subjective knowledge snippets. To help identify the subjective knowledge-seeking user request, we also randomly sample another 18,383 dialogues with non-subjective user requests from the original MultiWOZ dataset.

We split the dataset into training (75%), validation (10.8%), and test (14.2%) sets. Table 2 presents the detailed statistics of each subset. Both the validation and test sets contain two subsets: the seen subset where the aspects of these instances are included in the training set, and the unseen subset where the aspects are not included in the training set. The unseen subset is designed to evaluate models’ ability to generalize to arbitrary aspects.

5 Subjective-Knowledge-Grounded TOD

In this section, we describe the method for SK-TOD. As shown in Figure 2, we follow the pipeline introduced by Kim et al. (2020) which comprises four sequential sub-tasks: knowledge-seeking turn detection (KTD), entity tracking (ET), knowledge selection (KS), and response generation (RG). We elaborate on each subtask below.

5.1 Knowledge-Seeking Turn Detection

The goal of KTD is to identify the user request that requires subjective knowledge. We regard it as a binary classification problem, where the input is the dialogue context $C$ and the output is a binary indicator.

We employ a pre-trained language model (e.g., BERT Devlin et al. (2019)) to encode $C$ and adopt the hidden state of the first token as its representation. Then we apply a classifier to obtain the probability that the current user request is seeking subjective knowledge. That is,

	$\displaystyle h$	$\displaystyle=\text{Enc}(C)$		(1)
	$\displaystyle P(C)$	$\displaystyle=\text{softmax}\left(\text{FFN}\left(h\right)\right).$		(1)

The model is finetuned with the binary cross-entropy loss.

5.2 Entity Tracking

The goal of ET is to identify the entities $\mathcal{E}=\{e_{1},\cdots,e_{m}\}$ that are relevant to the user request. It can help to reduce the number of candidates during the knowledge selection step.

We adopt a word-matching-based method used by Jin et al. (2021) to extract relevant entities. It first normalizes entity names in the knowledge source using a set of heuristic rules. Then a fuzzy n-gram matching is performed between the normalized entity and all dialogue turns. To find the entities that are relevant to the last user request, we choose the last dialogue turn in which the entities are detected and use these entities as the output of ET. We leave the tracking of aspects being questioned over multiple turns as future work.

5.3 Knowledge Selection

The goal of KS is to select the knowledge snippets that are relevant to the user’s request. The inputs are the dialogue context $C$ and a set of knowledge snippets candidates $\mathcal{K}$ , which is a combination of all knowledge snippets of the relevant entities in $\mathcal{E}$ . The output $\mathcal{K}^{+}\subseteq\mathcal{K}$ is a subset of relevant knowledge candidates. Note that there might be multiple knowledge snippets in $\mathcal{K}^{+}$ .

To select relevant knowledge snippets, we calculate the relevance score between the dialogue context $C$ and a knowledge snippet $K\in\mathcal{K}$ . We regard it as a pairwise text scoring problem and consider two popular approaches: bi-encoder Mazaré et al. (2018) and cross-encoder Wolf et al. (2019). Generally, the bi-encoder approach is more efficient while the cross-encoder approach is more accurate.

For the bi-encoder approach, we encode $C$ and $K$ separately using the same pre-trained encoder and obtain two representations, $h_{C}$ and $h_{K}$ . Following Reimers and Gurevych (2019), we use the concatenation of $h_{C}$ , $h_{K}$ , and $|h_{C}-h_{K}|$ as features and apply a classifier to obtain the probability of relevance. That is,

	$\displaystyle h_{C}$	$\displaystyle=\text{Enc}(C),\quad h_{K}=\text{Enc}(K)$		(2)
	$\displaystyle P(C,K)$	$\displaystyle=\text{softmax}\left(\text{FFN}\left(h_{c},h_{K},\|h_{C}-h_{K}\|\right)\right).$		(2)

For the cross-encoder approach, we encode the concatenation of $C$ and $K$ to obtain a contextualized representation. That is,

	$\displaystyle h$	$\displaystyle=\text{Enc}(C,K)$		(3)
	$\displaystyle P(C,K)$	$\displaystyle=\text{softmax}\left(\text{FFN}\left(h\right)\right).$		(3)

During training, we use all relevant knowledge snippets to construct positive ( $C$ , $K$ ) pairs. Due to the large number of irrelevant knowledge snippets, we randomly sample the same number of irrelevant snippets to form negative pairs. We optimize the model using the binary cross-entropy loss. During inference, we predict the relevance probability for all knowledge snippets in the candidates. Since both precision and recall are crucial in KS, instead of selecting the top few results, we use a threshold, estimated from the validation set, to determine the relevancy of each knowledge snippet.

5.4 Response Generation

The goal of RG is to create an utterance $S_{t}$ that addresses the user’s request. This response is generated based on the dialogue context $C$ and the set of relevant knowledge snippets $\mathcal{K}^{+}$ . To accomplish this, we concatenate $\mathcal{K}^{+}$ and $C$ as the input and use a pre-trained generation model to generate the response. We consider both the decoder-only model, such as GPT-2 Radford et al. , and the encoder-decoder model, such as BART Lewis et al. (2020). The model is trained to maximize the generation probability $p(S_{T}\mid C,\mathcal{K}^{+})$ .

To accurately capture the diversity and proportion of opinions, the model needs to understand the sentiment polarity of each knowledge snippet, which is challenging due to the lack of direct supervision. To address this issue, we apply a state-of-the-art aspect-based sentiment analysis (ABSA) model Zhang et al. (2021) to predict the sentiment $Z=[z_{1},\cdots,z_{i},\cdots]$ for each knowledge snippet $K_{i}\in\mathcal{K}^{+}$ . Then we incorporate the sentiment information into RG by maximizing $p(S_{T}\mid C,\mathcal{K}^{+},Z)$ .

More specifically, we first convert the predicted $z_{i}$ into a natural language description using templates, and then append it to the end of the corresponding $K_{i}$ as the enhanced input of RG. For example, given the knowledge snippet as “The ambience was so fun.”, the ABSA model detects the aspect-based sentiment as (“ambience”, “positive”). We first convert the sentiment into a natural language “ambience is great.” and then enhance the knowledge snippet as “The ambience was so fun. ambience is great.”. We refer to Appendix B for more details.

6 Experiments on Sub-Tasks

We first conduct experiments on each individual subtask. To avoid any error accumulation from upstream tasks, we use the gold output of the previous task as the input to the current target task. The detailed experimental setup can be found in Appendix C.

6.1 Knowledge-Seeking Turn Detection

Setting We conduct experiments using various pre-trained language models, including BERT ²²2We use the base version of all pre-trained models. Devlin et al. (2019), RoBERTa Liu et al. (2019), ALBERT Lan et al. (2020), and DeBERTa He et al. (2021).

Evaluation We report the precision, recall, $F_{1}$ score, and accuracy score.

	Acc	P	R	F
BERT	99.67	99.75	99.61	99.68
RoBERTa	99.74	99.86	99.64	99.75
ALBERT	99.49	99.64	99.36	99.50
DeBERTa	99.71	99.86	99.57	99.71

Table 3: Results of KTD task. Models are evaluated using Accuracy, Precision, Recall, and

F_{1}

. All models achieve similar and near-perfect performance.

Results Table 3 shows the results of the KTD task. All models achieve similar and near-perfect performance, which is in line with the findings of Kim et al. (2020). color=yellow!80] SG: these seem very high compared to FAQ task. May be we should add some explanation for this. It demonstrates that it is feasible to identify the user requests that require subjective knowledge, allowing them to be explicitly addressed by an SK-TOD component. However, this KTD classifier’s performance may be specific to this dataset or similar domains, and its generalizability to unseen domains or knowledge types requires further exploration in future works.

6.2 Entity Tracking

Setting We follow the setting of Jin et al. (2021) to run the ET method.

Evaluation We report the instance-level accuracy score. An instance is regarded as accurate only if the predicted entities match exactly with the gold entities.

Results The fuzzy n-gram matching method achieves an instance-level accuracy of 92.18%. We further analyzed the type of errors. For 1.8% of the instances, there is at least one gold entity missing from the predicted entities. For 7.6% of the instances, the predicted entities contain at least one spurious entity. The latter error case can be further reduced by using model-based matching approaches, which we leave as future work.

6.3 Knowledge Selection

Setting We fine-tune the KS models following the same setting as in the KTD task. Additionally, we compare them with traditional information retrieval (IR) baselines, such as TF-IDF Manning et al. (2008) and BM25 Robertson et al. (2009).

Evaluation Knowledge selection can be viewed as either a classification task or a retrieval task. For classification, we use precision, recall, and $F_{1}$ measures. We calculate these measures at both the instance level and the snippet level. For the instance level, we first calculate $P$ / $R$ / $F_{1}$ for each instance, and then take the average over all instances as the final scores. For the snippet level, instead of computing $P$ / $R$ / $F_{1}$ for each instance, we calculate these scores for all < $C$ , $K$ > pairs in the entire dataset. Regarding retrieval evaluation, we use mean-average-precision (mAP) as the metric, which is not dependent on a specific threshold value and can reflect the overall ranking positions of all relevant knowledge snippets. Since the total number of the relevant knowledge snippets can vary for each instance, we do not include top-K-based measures like Precision@K or Recall@K, which are commonly used in other Fact-TOD and knowledge-grounded open-domain dialogue tasks.

	Instance-level			Snippet-level			mAP
	P	R	F	P	R	F	mAP
IR Baselines
TF-IDF	34.61	70.33	40.46	23.81	65.00	34.85	45.97
BM25	31.38	40.95	32.21	31.14	32.42	31.77	45.42
Bi-encoder
BERT	56.66	70.06	59.31	58.87	74.69	65.84	71.59
RoBERTa	60.98	83.06	66.47	54.40	85.38	66.46	77.25
ALBERT	70.21	78.74	70.43	63.13	78.90	70.14	81.62
DeBERTa	71.46	83.18	72.44	62.64	83.50	71.58	83.43
Cross-encoder
BERT	85.18	86.01	83.33	82.40	83.82	83.11	90.06
RoBERTa	81.59	83.62	80.53	82.20	80.77	81.48	88.98
ALBERT	86.18	87.29	84.22	83.56	84.78	84.16	90.50
DeBERTa	86.07	87.64	84.6	82.70	85.71	84.18	91.84
Seen	88.80	93.45	89.93	90.83	89.90	90.37	95.70
Unseen	82.68	80.47	78.03	69.98	78.29	73.90	87.07

Table 4: Results of the KS task. Models are evaluated using instance-level and snippet-level classification measures, as well as mAP, a retrieval-based measure. DeBERTa achieves the best performance among all evaluation measures.

Results Table 4 shows the results of the KS task. Firstly, when comparing our models with IR baselines, all of the trained models outperform the baselines, indicating that the KS model can benefit from the annotated training data. We then compare bi-encoder models and cross-encoder models, and as expected, cross-encoder models outperform bi-encoder models by a large margin. When comparing the performance of different pre-trained models, there is a notable difference among the models under the bi-encoder setting. The variance becomes smaller when applying the cross-encoder architecture. DeBERTa achieves the best performance on all measures in both the bi-encoder and cross-encoder settings.

Finally, we compare the performance between the seen subset and the unseen subset. At the bottom of Table 4, we list the performance of DeBERTa on both the seen and unseen test subsets. The results reveal a large gap between the performance of the two subsets, indicating that one of the challenges for the KS model is to generalize from seen aspects to unseen aspects.

6.4 Response Generation

Setting we experiment with decoder-only generation models such as GPT-2 Radford et al. ³³3We use the base-version of all pre-trained models. and DialoGPT Zhang et al. (2020c), as well as encoder-decoder models such as BART Lewis et al. (2020) and T5 Raffel et al. (2020). We also include two ABSA-enhanced models, namely BART ${}_{\text{ABSA}}$ and T5 ${}_{\text{ABSA}}$ . During decoding, we use beam-search with top-K sampling Fan et al. (2018). We set the beam size as 5 and sample from the top 50 tokens. We also compare with a random extractive baseline (EXT), where the response is created by randomly selecting a relevant knowledge snippet.

Evaluation Following the evaluation of other generation tasks, We employ several automatic evaluation metrics, including BLEU Papineni et al. (2002), ROUGE Lin (2004), METEOR Banerjee and Lavie (2005), as well as BERTScore Zhang et al. (2020b), to evaluate the quality of the generated responses compared to the reference responses. We also conduct a human evaluation, where we ask crowd workers to evaluate the quality of responses.

Results

	BLEU	R-1	R-2	R-L	MT	BS	Len
EXT	2.89	23.17	6.53	18.33	9.62	30.83	14.93
GPT2	9.04	33.9	13.52	26.73	16.27	39.73	22.66
DialoGPT	9.19	33.6	13.62	26.81	16.15	39.72	22.05
BART	10.8	36.35	15.04	28.57	17.96	41.12	24.02
BART ${}_{\text{ABSA}}$	10.78	36.30	15.36	28.47	18.06	41.75	23.66
T5	10.72	36.50	15.57	28.81	18.33	40.84	25.36
T5 ${}_{\text{ABSA}}$	10.97	36.66	15.51	28.88	18.15	40.94	24.75

Table 5: Results of RG task. Models are evaluated using BLEU, ROUGE (R-1, R-2, R-L), METEOR (MT), and BertScore (BS). We also listed the average length (Len) of the generated response. Encoder-decoder models such as BART and T5 achieve better performance compared with GPT2-based models.

As presented in Table 5, machine-generated responses significantly outperform the extractive responses. Encoder-decoder models achieve better performance across all automatic measures compared to GPT-based models, indicating that they are more suitable for this task. color=yellow!80] SG: I think this claim would look different if we train models on FAQ test on reviews. I remember initial experiments showed that GPT based models were doing better than BART/T5 in those cases. They also tend to generate longer responses. There is no clear difference in automatic measures when comparing BART and T5. For ABSA-enhanced models, BART ${}_{\text{ABSA}}$ achieves the best performance on BertScore, while T5 ${}_{\text{ABSA}}$ achieves the best score on BLEU and ROUGE.

Human Evaluation To obtain a more reliable assessment of response quality, we also conduct a human evaluation on AMT. We use the same group of workers involved in the data collection process. During the evaluation, we show the dialogue context, the oracle knowledge snippets, and all responses (both the reference and the generated responses) to the workers. We randomly sample 240 instances from the test set for evaluation. For each instance, we ask three independent workers to compare the responses based on three measures:

•

Appropriateness: whether the response is fluent and naturally connected to the dialogue context.
•

Aspect Accuracy: whether the response provides relevant and useful information to the aspect that the user queried.
•

Sentiment Accuracy: whether the sentiment proportion provided by the response is consistent with that of the subjective knowledge.

For sentiment accuracy, we first ask workers to annotate the sentiment label of each knowledge snippet, and then evaluate each response. All three measures are evaluated using a 5-Point Likert scale. The system-level score is computed as the average score over all instances and workers for each system. The compensation for workers was set at $0.25 for the tasks of appropriateness and aspect accuracy, and $0.4 for the task of sentiment accuracy. The average hourly pay for the crowd workers was $15.25/hr, $14.40/hr, and $14.85/hr for each evaluation task, exceeding the local living minimum wage.

Table 6 shows the results of human evaluation for response generation. The inter-annotator agreement scores for each task are 0.7270, 0.7535, and 0.6239 in Gwet’s gamma, respectively. The results show that machine-generated responses are comparable to the references in terms of appropriateness and aspect accuracy. Moreover, incorporating ABSA can improve the model’s performance in sentiment accuracy. However, there is still a large gap in sentiment accuracy between the best model-generated responses and the references, indicating that faithfully aggregating sentiment information from multiple knowledge snippets is still a challenging task for current models.

	Approp.	Asp-Acc	Senti-Acc
EXT	2.65	3.32	3.13
GPT2	4.55	4.54	3.20
BART	4.55	4.68	3.56
BART ${}_{\text{ABSA}}$	4.58	4.66	3.80
T5	4.40	4.63	3.87
T5 ${}_{\text{ABSA}}$	4.49	4.67	3.98
Reference	4.70	4.77	4.50

Table 6: Results of human evaluation for RG.

	KS		RG
	Macro-F	mAP	BLEU	R-L	BS
RG	-	-	10.80	28.52	41.12
+KS	84.60	91.84	10.20	27.78	40.64
+ET+KS	83.47	90.45	10.29	27.80	40.56
+KTD+ET+KS	83.46	90.45	10.27	27.79	40.55

"What do others think?":
Task-Oriented Conversational Modeling with Subjective Knowledge

Abstract

1 Introduction