¹¹institutetext: School of Computer Science and Technology, Shanghai University of Electric Power 201306, China
¹¹email: [email protected],[email protected] ²²institutetext: Chair of Database Systems and Data Mining, University of Munich 80538, Munich
²²email: [email protected] ³³institutetext: School of Electronics and Information Engineering, Tongji University 201804, China ⁴⁴institutetext: Data Intelligence Center, Alibaba Local Life Service Co., Ltd 200062, China

Open-domain Dialogue Generation Grounded with Dynamic Multi-form Knowledge Fusion

Feifei Xu 11 Shanlin Zhou 11 Xinpeng Wang F. Xu, S. Zhou, and X.Wang are the first authors with equal contributions.33 Yunpu Ma

{}^{(\textrm{{\char 0\relax}})}

22 Wenkai Zhang 11 Zhisong Li 44

Abstract

Open-domain multi-turn conversations normally face the challenges of how to enrich and expand the content of the conversation. Recently, many approaches based on external knowledge are proposed to generate rich semantic and information conversation. Two types of knowledge have been studied for knowledge-aware open-domain dialogue generation: structured triples from knowledge graphs and unstructured texts from documents. To take both advantages of abundant unstructured latent knowledge in the documents and the information expansion capabilities of the structured knowledge graph, this paper presents a new dialogue generation model, Dynamic Multi-form Knowledge Fusion based Open-domain Chatt-ing Machine (DMKCM). In particular, DMKCM applies an indexed text (a virtual Knowledge Base) to locate relevant documents as 1st hop and then expands the content of the dialogue and its 1st hop using a commonsense knowledge graph to get apposite triples as 2nd hop. To merge these two forms of knowledge into the dialogue effectively, we design a dynamic virtual knowledge selector and a controller that help to enrich and expand knowledge space. Moreover, DMKCM adopts a novel dynamic knowledge memory module that effectively uses historical reasoning knowledge to generate better responses. Experimental results indicate the effectiveness of our method in terms of dialogue coherence and informativeness.

Keywords:

Conversation Generation Virtual Knowledge Base Commonsense Knowledge Graph Dynamic Fusion.

1 Introduction

Open-domain conversation tries to meet human needs in terms of dialogue understanding and emotional resonance while keeping continuous. However, traditional merely data-driven multi-turn conversation models often generate simple and repetitive contents [15, 19]. To address this issue, previous studies add additional persona information documents [16] or guide the conversation topic [22] to improve dialogue informativeness and diversity.

Notably, more recent studies investigate external knowledge as additional inputs of conversations [27, 24, 26, 14, kim2020sequential], including knowledge graphs (denoted as KGs)[27, 24, 26], or unstructured texts [14, kim2020sequential]. Methods based on KGs show that KGs organize information around entities, making it easy to reason. Nevertheless, extracting relations for establishing the knowledge graph usually leads to the loss of information. More, it often generates less informative responses by simply applying and reformulating triples of KGs, e.g., KnowHRL [22] adds keywords from KGs using reasoning strategies to guide topics but the informativeness of the conversation has not increased significantly. Informative texts, e.g., comments about movies, can provide rich knowledge for the generation. However, their unstructured representation schemes require the language models to perform knowledge selection or attention from the knowledge texts,e.g., SKT [kim2020sequential] designs a complex screening process to use document knowledge. In general, these works are impossible to avoid the problem that the KGs are incomplete or the processing of documents is complicated. A very recent work, MKST [26] first attempts to apply different forms of knowledge in conversation. It extracts the entities mentioned in the sentences and links them to their corresponding entities in KGs as label knowledge. It designs a multi-knowledge-aware encoder to encode label, unstructured, and dialogue information together and get a generation by a knowledge-aware decoder. However, label knowledge is not achieved through reasoning which may not help the further expansion of dialogue. More, MKST just relies on dialogue data sets with background knowledge, e.g., Wizard-of-Wikipedia dataset.

To address above problems, we propose a new multi-turn dialogue generation model, Dynamic Multi-form Knowledge Fusion based Open-domain Chatting Machine (DMKCM). Its goal is to fuse abundant knowledge in an indexed corpus (a virtual Knowledge Base or a virtual KB) and information expansion capabilities of a commonsense knowledge graph (commonsense KG) simultaneously to enrich and expand informativeness in multi-turn conversation. The differences and functions of these two types of external knowledge can be summarized as follows:

•

Virtual KB: This kind of knowledge base is usually an indexed corpus where each document link to its related documents with keywords. Each document in this base can express a complete meaning.
•

Commonsense KG: This kind of knowledge graph includes the triples $[head\_entity,relation,tail\_entity]$ , whose entities are also called commonsense facts. These commonsense facts can enhance language representation in the commonsense aspect and even expand topics with reasoning by traversing entities and relations.

Refer to caption — Figure 1: An Example of Knowledge Fusion in a Conversation. Yellow indicate key words from 1st hop, and blue indicate entities of 2nd hop. Red indicates key words from history virtual knowledge. Different colored circles and dotted arrows point out the source of latent knowledge in the response. Black arrows indicate the flow of information.

For DMKCM, we design two branches, including a dialogue branch (green blocks in left of Fig. 2) and a knowledge branch (orange blocks in left of Fig. 2). The dialogue branch generates responses by interchanging knowledge with the knowledge branch. On the knowledge branch, we separately take different reasoning strategies on virtual KB and commonsense KG to get passages (1st hop) and entities (2nd hop) that are related to the current dialog. 1st hop services for riching information of response. 2nd hop is to better capture concepts shifted in conversation structure which help generate more meaningful conversations, like “diet” and ”weight” (concepts in ”Historical context” and ”POST” in Fig. 2) hop to related concepts, eg, ”healthy” and ”exercise” elt., along the commonsense relations, in ”Related Entities” of Fig. 1. This is a typical case in natural conversations. In addition, before 1st hop, we also expand concepts by commonsense KG to calculate the filtering scores, like “people” (from ”POST” in Fig. 2) to ”overweight” etl.). Using these filtering scores to select results inferred from Virtual KB helps to remove potential noise in reasoning results. When the topic is shifted, it is hard to find suitable knowledge from the current 1st hop to generate a response. Especially, we find history 1st hop (history virtual knowledge) can solve this issue, e.g., ”fatty food” and ”healthy” in history virtual knowledge bank related to response in Fig. 1. For this, history virtual knowledge is dynamically stored into the history virtual knowledge bank and provides knowledge support for the current turn. This helps the topic transition better in the current dialogue and also enriches the response to some extent. Our work improves the model explainability on knowledge extraction, helps to generate informative responses, and expands the topic of conversation to a certain extent. Explainability is important to dialogue in information-oriented scenarios, where a user needs to know how new knowledge in chatbot’s responses is linked to the knowledge in their utterances, as Fig. 1 shows. Our experiments on two conversation datasets, including Persona-Chat[25] and DailyDialog[8], demonstrate the effectiveness of DMKCM.

In summary, the following contributions are made in this paper:

•

This paper creatively proposes a novel dialogue generation model-DMKCM, to dynamically fuse multi-from knowledge into generation. To our best knowledge, this work is the first attempt to fuse virtual KB and commonsense KG into dialogue to get better responses.
•

To adjust to the open domain dialogue task, we construct a new virtual knowledge base using the dataset of commonsense stories-ROCStory.
•

We find that history virtual knowledge helps generate better responses and provides a new dynamically delayed updating strategy to store and filter history virtual knowledge.
•

The experimental results and cases show the superior performance of our model. Various evaluating indicators indicate that DMKCM not only maximizes the advantages of achieved knowledge but also helps to generate more informative and coherent conversations.

2 Related Work

2.1 Dialogue Generation with External Knowledge

Many works have proved that external knowledge can facilitate dialogue generation. [27] presents a novel open-domain dialogue generation method to demonstrate how large-scale commonsense knowledge can facilitate language understanding and generation. [4] proposes a latent relation language model, a class of language models that parameterize the joint distribution over the words in a document and relevant entities via knowledge graph relations. For the use of the external documents, [10] incorporates external documents into the procedure of response generation in custom service dialogues. GLKS [14] adopts a global guide method to the local, and uses the dialogue contexts to filter out important n-gram information from the document to guide the generation process. However, the knowledge graphs lose facts, and external texts require complicated processing. These two forms of knowledge still have limitations in exerting external knowledge.

2.2 Virtual Knowledge base

Virtual Knowledge Base (virtual KB) is an indexed corpus, which treats a corpus as a knowledge base containing entities and texts. It has been widely employed in open-domain Question Answer (QA) [3, 11, 2]. Virtual KB accomplishes the QA tasks by answering queries with spans from the corpus, ensuring that facts can be preserved in the relation extraction process. Whereas, to the best of our knowledge, virtual KB has not yet been mentioned in open-domain dialogue generation. DrKIT[2] is a state-of-the-art reasoning algorithm with QA on a virtual KB, which traverses textual data like a KB, softly following paths of relations between mentions of entities in the corpus. Inspired by this, we present a novel model, DMKCM, which includes a reasoning strategy based on DrKIT for getting more information related to our dialogue. To better fit our task, we convert a commonsense story corpus—the indexed ROCStories[12] as our virtual KB, instead of professional Wikipedia.

3 Model

3.1 Overview

The overview of DMKCM is shown in Fig. 2. DMKCM consists of two branches, dialogue branch and knowledge branch. The dialogue branch (green blocks in left of Fig. 2) aims to generate conversation based on an encoder-decoder model and interacts information with the knowledge branch to improve the informativeness expression of response. The knowledge branch(orange blocks in left of Fig. 2) is to reason, store, merge, and expand knowledge by Virtual Knowledge reasoning module (VK-reasoning), Dynamic Virtual Knowledge memory module (DVK-memory), Dynamic Virtual Knowledge selector module (DVK-selector), and Commonsense Knowledge expansion module (CK-expansion).

Before presenting our detail for the dialogue generation approach, we first introduce our notations and critical concepts. Formally, suppose we have a conversation dataset $D=\left\{\left({C}^{i},{X}^{i},R^{i}\right)\right\}_{i=1}^{N}$ , where $C^{i}$ represents conversational context before $i$ -th turn. $X^{i}$ is $i$ -th user utterance. $R^{i}$ is a response regarding to $X^{i}$ . The final goal of this task is to estimate a generation probability distribution $P(R|[C,X],K)$ from $D$ . Therefore, one can generate a response for $\left[C,X\right]$ following $P(R|[C,X],K)$ , where $\left[C,X\right]$ means the concatenation for context $C$ and current user utterance $X$ , $K$ is the knowledge from knowledge branch, and $R$ is the corresponding response. Assume that there are already $n-1$ turns in a dialogue. We use Transformer encoder (T_enc) to encode $X^{n}$ and $C^{n}$ and get the last hidden state $h_{p}^{n}$ and state $h_{e}^{n}$ from ${X}^{n}$ and $\left[C^{n},X^{n}\right]$ separately. $h_{p}^{n}$ and $h_{e}^{n}$ represent encoded semantic information of ${X}^{n}$ and $\left[C^{n},X^{n}\right]$ . Next, we elaborate on details for each of them.

3.2 Knowledge Branch

Firstly, VK-reasoning reasons and filters candidate documents related to the user utterance $X^{n}$ as 1st hop from a virtual KB. We send $X^{n}$ , candidate documents, and $C^{n}$ to CK-expansion, aiming to expand commonsense concepts for better response. DVK-memory is a dynamic transfer module. It dynamically stores encoded vectors of 1st hop from previous $n-1$ turns. We named these vectors as history virtual knowledge. Then, we filter and send related history virtual knowledge to DVK-selector as a knowledge supplement. Next, DVK-selector dynamically integrates information from 1st hop, history virtual knowledge and dialog information for the decoder. Noticeably, after this process, encoded information from the current 1st hop requires to be updated into DVK-memory. Before generating a response, CK-expansion needs to expand the words of input information by traversing in a commonsense KG to capture concepts shifted in conversation structure. These extended concepts are denoted as 2nd hop.

3.2.1 Virtual Knowledge Reasoning(VK-reasoning)

Corresponding to our open-domain dialogue task, we select a commonsense story corpus-ROCStories [12] as the source of our virtual KB. Firstly, we index each unique title of ROCStories as an entity in our virtual KB. And then, to complete the simulation of the relationship pattern of KGs on the text, we traverse each story to connect related titles (entities). We use the latest reasoning algorithm DrKIT[2] to train a reasoning model with our conversation dataset and virtual KB. By this trained model, we reason and get the related candidate documents $K_{D}^{n}$ to $X^{n}$ . In particular, to obtain documents that are more relevant to $X^{n}$ , we list related words of each word in $X^{n}$ from a commonsense. The number of co-occurrence of each document and this list is regarded as this document’s filtering score. We get top $T$ candidate documents from $K_{D}^{n}$ by this score. For convenience, these filtered candidate documents are named as $K_{V}^{n}$ (1st hop).

As shown in right of Fig. 2, we encode $K_{V}^{n}$ for DVK-selector. Concretely, these candidate documents are successively encoded by transformer encoder. Then, we get the last hidden state $h_{V_{t}}^{n}$ from the encoder layer, which represents encoding information of $t$ -th candidate document and $t$ is from 1 to $T$ . $T$ means the total number of candidate documents. Subsequently, we merge the last hidden states generated each time into a matrix called $H_{V}^{n}$ . The process is as follows:

H_{V}^{n}=\left[h_{V_{1}}^{n},h_{V_{2}}^{n},\ldots,h_{V_{T}}^{n}\right]^{T},

(1)

h_{V_{t}}^{n}=T\_enc\left(K_{V_{t}}^{n}\right),\left(t=1,\ldots,T\right).

(2)

3.2.2 Dynamic Virtual Knowledge Memory (DVK-memory)

When the topic is shifted, e.g., ”soda” in ”A4” of Table 1, VK-reasoning may find documents about ”soda” instead of ”weight” and ”healthy” which stand for the topic from context. This leads to little support for the current conversation because our model lacks practical knowledge to generate a response if only using the 1st hop knowledge.

Table 1: An Example of Conversation Topic Shift in DailyDialog. The topic words in the dialogue are emphasized in bold, and the words in red indicate the words that deviate from the current dialogue topic.

A1: What do you think about all the different diets that people go on ?

B1: …a balanced diet and to never get overweight…

A2: But…What should they do to lose weight ?

B2: They need to eat healthy foods…cut out fattening foods altogether…

…

A4: How about drinking soda?

B4: …gain weight by drinking far too much soda … no nutritional value..

Thus, we design DVK-memory to dynamically store and filter history virtual knowledge representation, which provides a virtual knowledge supplement for the 1st hop and finally helps the topic transition has better smooth in the current dialogue. This process applies a dynamic delayed updating strategy (in Algorithm 1).

Input: The

i

-th (range from 1 to N) turn output of VK-reasoning:

H_{V}^{i}

Output: Encoded vector of related history virtual knowledge:

h_{m}^{i}

2for each $i\in[1,N]$ do

3 if the value of $i$ is $1$ then

4 Do DVK-selector;

5 Add

H_{V}^{i}

into Memory

M^{i+1}

;

6 else

7 Extract historical knowledge representation of documents

h_{m}^{i}

related to

h_{p}^{i}

from

M^{i}

;

8 Do DVK-selector;

9 Add

H_{V}^{i}

into Memory

M^{i+1}

;

Algorithm 1 Algorithm of Dynamically Store and Filter History Virtual Knowledge

We denote $M=\left\{M^{n}\right\}_{n=2}^{N}$ as a set of history virtual knowledge representation, where $M^{n}$ is historical knowledge representation for $n$ -th turn of dialogue.

M^{n}=\left[H_{V}^{1},\ldots,H_{V}^{n-1}\right]^{T}.

(3)

Then, We apply attention mechanism from [1] to calculate the extracting historical knowledge representation of documents $h_{m}^{n}$ from $M^{n}$ that is related to the representation of current user utterance $h_{p}^{n}$ :

h_{m}^{n}=\sum_{i=2}^{n-1}\alpha_{w,k}^{i}M_{k}^{i},

(4)

\alpha_{w,k}^{i}=\frac{exp\left(S_{w,k}^{i}\right)}{\sum_{i=1}^{n}exp\left(S_{w,k}^{i}\right)},

(5)

S_{w,k}^{i}=V_{a}^{T}tanh\left(W_{h}\left[h_{p_{w}}^{i};M_{k}^{i}\right]\right),

(6)

where $M_{k}^{i}$ represents the k-th position hidden state of history virtual knowledge $M^{i}$ and $h_{p_{w}}^{i}$ is the w-th token vector in $h_{p}^{i}$ . $V_{a}^{T}$ and $W_{h}$ are trainable parameters. $S_{w,k}^{i}$ is the unnormalized attention weight by an attention neural network and $\alpha_{w,k}^{i}$ is the normalized attention weight from $S_{w,k}^{i}$ .

3.2.3 Dynamic Virtual Knowledge Selector(DVK-selector)

We apply multi-head attention mechanism (MultiHead) [21] to extract features of current virtual knowledge $H_{V}^{n}$ and historical knowledge $h_{m}^{n}$ according to dialogue semantic information $h_{e}^{n}$ . A gate is proposed for information fusion and its result is $A_{merge}^{n}$ . Specifically,

\begin{split}A_{merge}^{n}=\left\{\begin{array}[]{ll}\mu A_{V}^{n}+h_{e}^{n},&n=1\\ \mu A_{V}^{n}+(1-\mu)A_{M}^{n}+h_{e}^{n},&n>1\\ \end{array}\right.\end{split},

(7)

A_{M}^{n}=MultiHead\left(h_{e}^{n},h_{m}^{n},h_{m}^{n}\right),

(8)

A_{V}^{n}=MultiHead\left(h_{e}^{n},H_{V}^{n},H_{V}^{n}\right),

(9)

\mu=sigmoid\left(W_{g}h_{e}^{n}\right).

(10)

Here, we use the sigmoid function to get a gating parameter $\mu$ for fusion, and the $W_{g}$ are trainable parameters. $A_{V}^{n}$ is the current virtual knowledge features related to $h_{e}^{n}$ and $A_{M}^{n}$ is the historical knowledge features related to $h_{e}^{n}$ . Particularly, since DVK-memory takes a delayed updating strategy, it needs to remove the $h_{m}^{n}$ when $n$ is 1.

3.2.4 Commonsense Knowledge Expansion(CK-expansion)

To expand concepts and further enhance informativeness, we expand entities of $K_{V}^{n}$ , $C^{n}$ and $X^{n}$ by searching neighbor nodes on a commonsense KG. We use $K_{C}^{n}=(k_{h}^{n},k_{r}^{n},k_{t}^{n})$ to represent the knowledge triples, which connects the original entities and expanded entities. $k_{h}^{n}$ is a set of words (entities) from $K_{V}^{n}$ , $C^{n}$ and $X^{n}$ . $k_{t}^{n}$ means expanded entities by the KG. $k_{r}^{n}$ is the relation of $k_{h}^{n}$ and $k_{t}^{n}$ on the KG. Inspired by GCN that can encode graph structure well, we use Multi-layer CompGCN (M_CompGCN) [20] to encode the knowledge triples by combining the node embedding and the relation embedding.

h_{K_{h}}^{n},h_{K_{r}}^{n},h_{K_{t}}^{n}=M\_CompGCN(K_{C}^{n}).

(11)

We use the dialogue context encoding $h_{e}^{n}$ to compute the degree of attention $\beta^{i}$ with the encoded head $h_{K_{h}}^{n}$ and the relation $h_{K_{r}}^{n}$ , and then multiply with the encoded tail $h_{K_{t}}^{n}$ . Finally, we get the representation of knowledge triples $h_{k_{C}}^{n}$ .

h_{k_{C}}^{n}=\sum_{i=1}^{k}{\beta^{i}h_{k_{t}}^{i}},

(12)

\beta^{i}=Softmax(h_{e}^{i}[h_{k_{h}}^{i}+h_{k_{r}}^{i}]),

(13)

where k is the number of the triples.

3.3 Generation

We use Transformer Decoder (T_dec) to generate words,

h_{d}^{n}=T\_dec(y_{t-1}^{n},A_{merge}^{n}).

(14)

Then, a Controller is designed, in which the decoded hidden state $h_{d}^{n}$ will be mapped into vocab size and outputs the probability of words $P_{v}$ by Softmax function,

P_{v}=Softmax(W_{v}h_{d}^{n}).

(15)

In addition, we can also generate knowledgeable words by using knowledge expansion representation encoded in CK-expansion,

P_{K_{C}}=Softmax(\sum_{i=1}^{l}{\gamma_{i}^{n}h_{k_{C}}^{n}}),

(16)

\gamma_{i}^{n}=Softmax(h_{d}^{n}W_{k}h_{K_{C}}^{n}).

(17)

We get an attention weight $\gamma_{i}^{n}$ by using the decoded hidden state $h_{d}^{n}$ to focus on the $h_{k_{C}}^{n}$ , which can make the model focus on the relative knowledge triples; then, we choose the knowledge entities according to entities probability $P_{K_{C}}$ of relative weighted knowledge after Softmax function.

The final generated words will consider both the distribution of standard vocabulary and the distribution of knowledge entities. We use a soft gate probability $g_{t}$ to choose the generated words from standard vocabulary or knowledge entities.

y_{t}=g_{t}\cdot P_{v}+(1-g_{t})\cdot P_{K_{C}}

(18)

g_{t}=\sigma(h_{d}^{n})

(19)

3.4 Training

To train the proposed model, we minimize the negative log-likelihood

L_{NLL}=-\frac{1}{N}\sum_{i=1}^{N}\sum_{t=1}^{T}logP(y_{t}^{(n)}|y_{<t}^{(n)},X^{(n)},K^{(n)}),

(20)

where $N$ is the total number of the dataset, and $T$ is the timestep of the $n$ -th turn response sentence. $X^{(n)}$ represents the $n$ -th turn user utterance in the dataset, and $K^{(n)}$ represents the $n$ -th turn knowledge.

4 Experiments

4.1 Dataset

Conversation Corpus: We choose DailyDialog [8] and PersonaChat [25] as our datasets. In our work, four turns of dialogue are a unit of training sample and pre-processed statistics of the above datasets are shown in the Fig. 3. Commonsense Knowledge Corpus: ConceptNet[17] is a semantic network designed to help computers understand the meanings of words that people use. Its English vocabulary contains approximately 1,500,000 nodes. Source of Virtual KB: The ROCStories [12] is a commonsense story corpus which contains 98,161 five-sentence stories.

4.2 Comparison Method

We compare our model with representative baselines to investigate its effectiveness. The baselines are as follows: (1) Attn-S2S[18]: A classic method applies simple attention [1] to the input context based on the sequence-to-sequence model; (2) Transformer[21]: Transformer is a popular network architecture, based solely on attention mechanisms; (3) Dir-VHRED[23]: A recent work based on the latent variable hierarchical recurrent encoder-decoder model and characterizes the latent variables using Dirichlet distribution instead of traditional Gaussian distribution; (4) GLKS[14]: The newest dialogue generation model based on unstructured knowledge builds a global-to-local knowledge selection method to improve the quality of selected unstructured knowledge in background-based conversations; (5) CCM[27]: A commonsense knowledge-aware conversation model, which leverages commonsense knowledge from ConceptNet[17] through two graph attention mechanisms to facilitate informative response generation; (6) MKST[26]: The latest universal transformer-based architecture fuses label, background knowledge in open-domain conversation. Since MKST relies on datasets with background knowledge, we compare it with our model only on PersonaChat which has background information.

4.3 Implementation Details

We conduct the experiments with a Transformer structure (our baseline) with 8 heads, 6 layers, 512-dimensional hidden states, and a 2-layer GCN model. In the VK-reasoning, we set the number of reasoned candidate documents as 10 and filtered candidate documents as 5. During the CK-expansion, we search the neighbors of nodes and preserve the top 100 neighbor nodes. When processing datasets, the history context we choose are the previous turns of the current conversation. To train the model, we use the Adam optimizer[6] and use Adam-warmup to adjust the learning rate.

4.4 Evaluation Metrics

To analyze and evaluate our model more comprehensively, we use both automatic and human evaluations. Automatic Evaluation: Based on previous work, we apply several widely used automatic metrics. Specifically, we adopt PPL,BLEU-1,2,3,4)[13],and Distinct-1,2 (Dist-1,2)[7] to intuitively reveals quality, coherence and diversity of generated responses. PPL is the perplexity score that measures the quality of the language model. BLEU calculates word-based precision between a generated response and a gold response. Distinct evaluates the informativeness and diversity of the predicted responses. Human Evaluation: As known, automatic evaluation indicators have limitations in evaluating human conversations [9]. In our work, we randomly sample 200 test samples to conduct human evaluations. For response, we define three metrics: (1) Fluency (Flu.), i.e., degree of fluency and human readability; (2) Informativeness (Inf.), i.e., degree of knowledge for responses; (3) Appropriateness (App.), i.e., degree of relevance to the given context; Each response has 3 annotators to give a 3-graded whose rating range from 0 to 2. We take the average scores as the final results for each metric.

5 Results and Discussion

5.1 Performance Evaluation

5.1.1 Automatic Evaluation

Table 2 lists the automatic evaluation results for each model. Our model outperforms almost the baselines on two corpora. In the quality of the model, our PPL is the lowest, indicating that our generated responses Model is more grammatical. In the aspect of coherence, DMKCM has higher BLEU values, demonstrating our model tends to generate responses that are more similar to the gold responses than baselines in most cases. It can be inferred that our model can effectively obtain useful information from the historical context and historical knowledge in memory to help generate a response. On diversity, the Dist-1,2 metrics demonstrate that the models leveraging external knowledge achieve better performance than the knowledge-based model, e.g., GLKS, CCM, MKST, in generating meaningful and diverse responses. According to Table 2, in terms of indicators, DMKCM is better than MKST, which is the latest multi-knowledge based dialogue generation model. This signifies the effectiveness of our model on using structured knowledge or unstructured knowledge or the method of fusion.

Table 2: Automatic Evaluation Results of the Proposed Model and the Baseline Models. Numbers in bold indicate the best-performing model on the corresponding metrics.

Dataset	Model	PPL	BLEU-1	BLEU-2	BLEU-3	BLEU-4	Dist-1	Dist-2
PersonaChat	Attn-S2S	7.0079	0.4372	0.2525	0.1458	0.0842	0.0185	0.1208
	TRANSFORMER	6.5172	0.5023	0.2900	0.1675	0.0967	0.0425	0.2239
	Dir-VHRED	8.7117	0.4428	0.2557	0.1476	0.0852	0.0164	0.0795
	GLKS	7.5668	0.4684	0.2705	0.1562	0.0903	0.0420	0.1556
	CCM	7.3534	0.4703	0.2715	0.1568	0.0906	0.0596	0.2373
	MKST	7.1307	0.4460	0.2577	0.1491	0.0864	0.0808	0.2907
	DMKCM (our model)	5.4549	0.5452	0.3149	0.1852	0.1087	0.0903	0.3328
DailyDialog	Attn-S2S	7.7451	0.4166	0.2406	0.1390	0.0803	0.0281	0.2010
	TRANSFORMER	10.1355	0.4205	0.2428	0.1403	0.0811	0.0633	0.3096
	Dir-VHRED	9.3176	0.4276	0.2469	0.1426	0.0824	0.0399	0.2052
	GLKS	7.7635	0.4449	0.2570	0.1485	0.0859	0.0738	0.3496
	CCM	7.5592	0.4875	0.2773	0.1560	0.0860	0.0598	0.2350
	DMKCM (our model)	5.7692	0.4928	0.2846	0.1645	0.0951	0.0801	0.3589

5.1.2 Human Evaluation

Fig. 4 clearly shows the human evaluation metrics results of DMKCM compared with the baselines through the radar chart. The three vertices of the radar chart respectively represent fluency, informativeness, and appropriateness. From the radar chart, DMKCM has the best performance on two datasets. Particularly, the informativeness has the most obvious advantage over other baselines, which indicates the effectiveness of our fusion of the multi-form knowledge and can generate coherent and informative responses.

5.2 Ablation Study

As shown in Table 3, we analyze the effectiveness of each module proposed in DMKCM through the following situations: (1)- 2Hop: DMKCM drops CK-expansion; (2)- Mem and 2Hop: DMKCM drops DVK-memory and CK-expansion; (3)- 1Hop and Mem: DMKCM drops VK-reasoning and DVK-memory; (4)- 1Hop, Mem, and 2Hop: This is baseline (Transformer). From the results, we can observe that the performance of situation (1) drops sharply from our model. This is within our expectations since CK-expansion helps to capture extra information from the post, which can improve the diversity of generated responses. This result can also show that fusing structured knowledge can effectively help dialogue generation. Metric results of the situation (1) are better than situation (2), which verifies that retaining history virtual knowledge using DVK-memory can effectively help dialogue generation. Situation (3) is related to unstructured knowledge and its poor results prove the effectiveness of VK-reasoning and DVK-memory, and the reasoned virtual knowledge affects generating responses. Situation (1) is better than situation (4) can also reflect all of the modules play important roles in our model. In summary, these modules of DMKCM designed for fusing structured and unstructured knowledge are helpful for response generating in terms of informativeness and coherence.

Table 3: Results of ablation study.

Dataset	Model	PPL	BLEU-1	BLEU-2	BLEU-3	BLEU-4	Dist-1	Dist-2	Flu./Inf./App.
PersonaChat	DMKCM	5.4549	0.5452	0.3149	0.1852	0.1087	0.0903	0.3328	1.62/1.43/1.35
	- 2Hop	5.5568	0.4516	0.2608	0.1506	0.0870	0.0838	0.3233	1.58/1.38/1.28
	- Mem and 2Hop	6.1340	0.5005	0.2890	0.1669	0.0964	0.0782	0.2712	1.53/1.31/1.18
	- 1Hop and Mem	6.7423	0.4843	0.2796	0.1615	0.0933	0.0862	0.3134	1.52/1.40/1.26
	- 1Hop, Mem and 2Hop	6.5172	0.5023	0.2900	0.1675	0.0967	0.0425	0.2239	1.46/1.22/1.08
DailyDialog	DMKCM	5.7692	0.4928	0.2846	0.1645	0.0951	0.0801	0.3589	1.58/1.48/1.44
	- 2Hop	7.6755	0.4643	0.2682	0.1550	0.0896	0.0744	0.3498	1.51/1.34/1.31
	- Mem and 2Hop	8.8904	0.4632	0.2677	0.1547	0.0894	0.0683	0.3414	1.48/1.32/1.29
	- 1Hop and Mem	8.0996	0.4572	0.2641	0.1526	0.0882	0.0792	0.3505	1.50/1.35/1.33
	- 1Hop, Mem and 2Hop	10.1355	0.4205	0.2428	0.1403	0.0811	0.0633	0.3096	1.44/1.21/1.25

5.3 Case Study

Sample conversations are shown in Table 4 which indicate that DMKCM can generate better responses than the baselines on two conversation datasets. Traditional end-to-end models, e.g., Attn-S2S and Dir-VHRED, tend to generate simple, safe, and incoherent responses without access to knowledge. Despite knowledge-based model, e.g., CCM, GLKS, and MKST, which generate informative responses, still produces irrelevant responses. Instead, DMKCM fuses knowledge from the knowledge graph and virtual knowledge as a whole and encodes more related information via DVK-selector and Controller modules, which supports generating more information and coherence response. In practice, the effect of these modules can be reflected, like Fig. 5. It thus generates more reasonable responses through better use of knowledge.

Table 4: Case Study of Generated Responses.

-	PersonaChat	DailyDialog
-	Historical Contexts: A1: Hey, how’s it going? I am eating pizza. B1: Not too bad. Pizza sounds great, I am on a diet. Post: A2: I’m a simple guy, I like pizza and trucks. But not diets.	Historical Contexts: A1: Say, Jim, how about going for a few beers after dinner? B1: You know that is tempting but is really not good for our fitness. Post: A2: What do you mean? It will help us to relax.
Attn-S2S	I love the walking. I like spicy food.	Do you really think it?
Transformer	That is cool. I like to eat healthy.	I am sorry, but i am not sure that it will be a good idea.
Dir-VHRED	I am always looking for ways to make money, I can make some money.	But you know, it’s not so much. Now that you have to do the drying up.
GLKS	I love food too much. I am trying to open a restaurant next day.	Do you really think so?
CCM	I am a good aggressive. I am a good teacher.	I am like to get harvesting of the day.
MKST	I like to work out.	-
DMKCM	I am a waitress at a small restaurant. I am a vegan, but I like pizza.	Do you think so? I do not really think we should get looking forward to it, it will just make us fat.

6 Conclusion

To solve the challenge of lacking informative response in multi-turn dialogue generation, we propose a novel model, DMKCM. The existing methods of introducing knowledge into dialogue generation have some limits, so we combine virtual KB and commonsense KG to help generate better responses. In addition, we find that history virtual knowledge can improve responses and provide a new dynamically delayed updating strategy to store and filter history virtual knowledge. Experimental results on two datasets show that DMKCM can generate a more informative dialog with appropriate content ordering.

References

[1] Bahdanau, D., Cho, K.H., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: 3rd International Conference on Learning Representations, ICLR 2015 (2015)
[2] Dhingra, B., Zaheer, M., Balachandran, V., Neubig, G., Salakhutdinov, R., Cohen, W.W.: Differentiable reasoning over a virtual knowledge base. In: International Conference on Learning Representations (2020)
[3] Godbole, A., Kavarthapu, D., Das, R., Gong, Z., Singhal, A., Zamani, H., Yu, M., Gao, T., Guo, X., Zaheer, M., et al.: Multi-step entity-centric information retrieval for multi-hop question answering. In: EMNLP 2019 MRQA Workshop. p. 113 (2019)
[4] Hayashi, H., Hu, Z., Xiong, C., Neubig, G.: Latent relation language models. In: Proceedings of the AAAI Conference on Artificial Intelligence. pp. 7911–7918 (2020)
[5] Kim, B., Ahn, J., Kim, G.: Sequential latent knowledge selection for knowledge-grounded dialogue. In: International Conference on Learning Representations (2020)
[6] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
[7] Li, J., Galley, M., Brockett, C., Gao, J., Dolan, B.: A diversity-promoting objective function for neural conversation models. In: HLT-NAACL (2016)
[8] Li, Y., Su, H., Shen, X., Li, W., Cao, Z., Niu, S.: Dailydialog: A manually labelled multi-turn dialogue dataset. In: Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers). pp. 986–995 (2017)
[9] Liang, W., Liang, K.H., Yu, Z.: Herald: An annotation efficient method to detect user disengagement in social conversations. In: ACL/IJCNLP (2021)
[10] Long, Y., Wang, J., Xu, Z., Wang, Z., Wang, B., Wang, Z.: A knowledge enhanced generative conversational service agent. In: Proceedings of the 6th Dialog System Technology Challenges (DSTC6) Workshop (2017)
[11] Moldovan, D., Pasca, M., Harabagiu, S., Surdeanu, M.: Performance issues and error analysis in an open-domain question answering system. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. pp. 33–40 (2002)
[12] Mostafazadeh, N., Chambers, N., He, X., Parikh, D., Batra, D., Vanderwende, L., Kohli, P., Allen, J.: A corpus and cloze evaluation for deeper understanding of commonsense stories. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp. 839–849 (2016)
[13] Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. pp. 311–318 (2002)
[14] Ren, P., Chen, Z., Monz, C., Ma, J., de Rijke, M.: Thinking globally, acting locally: Distantly supervised global-to-local knowledge selection for background based conversation. In: Proceedings of the AAAI Conference on Artificial Intelligence. pp. 8697–8704 (2020)
[15] Shang, L., Lu, Z., Li, H.: Neural responding machine for short-text conversation. arXiv preprint arXiv:1503.02364 (2015)
[16] Song, H., Zhang, W.N., Hu, J., Liu, T.: Generating persona consistent dialogues by exploiting natural language inference. In: Proceedings of the AAAI Conference on Artificial Intelligence. pp. 8878–8885 (2020)
[17] Speer, R., Chin, J., Havasi, C.: Conceptnet 5.5: An open multilingual graph of general knowledge. In: Thirty-first AAAI conference on artificial intelligence (2017)
[18] Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in neural information processing systems. pp. 3104–3112 (2014)
[19] Tang, J., Zhao, T., Xiong, C., Liang, X., Xing, E., Hu, Z.: Target-guided open-domain conversation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. pp. 5624–5634 (2019)
[20] Vashishth, S., Sanyal, S., Nitin, V., Talukdar, P.: Composition-based multi-relational graph convolutional networks. In: International Conference on Learning Representations (2020)
[21] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems. pp. 5998–6008 (2017)
[22] Xu, J., Wang, H., Niu, Z., Wu, H., Che, W.: Knowledge graph grounded goal planning for open-domain conversation generation. In: Proceedings of the AAAI Conference on Artificial Intelligence. pp. 9338–9345 (2020)
[23] Zeng, M., Wang, Y., Luo, Y.: Dirichlet latent variable hierarchical recurrent encoder-decoder in dialogue generation. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). pp. 1267–1272 (2019)
[24] Zhang, H., Liu, Z., Xiong, C., Liu, Z.: Grounded conversation generation as guided traverses in commonsense knowledge graphs. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pp. 2031–2043 (2020)
[25] Zhang, S., Dinan, E., Urbanek, J., Szlam, A., Kiela, D., Weston, J.: Personalizing dialogue agents: I have a dog, do you have pets too? In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2204–2213 (2018)
[26] Zhao, X., Wang, L., He, R., Yang, T., Chang, J., Wang, R.: Multiple knowledge syncretic transformer for natural dialogue generation. In: Proceedings of The Web Conference 2020. pp. 752–762 (2020)
[27] Zhou, H., Young, T., Huang, M., Zhao, H., Xu, J., Zhu, X.: Commonsense knowledge aware conversation generation with graph attention. In: IJCAI. pp. 4623–4629 (2018)