Which Kind Is Better in Open-domain Multi-turn Dialog, Hierarchical or Non-hierarchical Models? An Empirical Study

Tian Lan [email protected] Beijing Institute of TechnologyBeijingBeijing100081 , Xian-ling Mao [email protected] Beijing Institute of TechnologyBeijingBeijing100081 , Wei Wei [email protected] Huazhong University of Science and TechnologyWuhanHubei430074 and Heyan Huang [email protected] Beijing Institute of TechnologyBeijingBeijing100081

Abstract.

Currently, open-domain generative dialog systems have attracted considerable attention in academia and industry. Despite the success of single-turn dialog generation, multi-turn dialog generation is still a big challenge. So far, there are two kinds of models for open-domain multi-turn dialog generation: hierarchical and non-hierarchical models. Recently, some works have shown that the hierarchical models are better than non-hierarchical models under their experimental settings; meanwhile, some works also demonstrate the opposite conclusion. Due to the lack of adequate comparisons, it’s not clear which kind of models are better in open-domain multi-turn dialog generation. Thus, in this paper, we will measure systematically nearly all representative hierarchical and non-hierarchical models over the same experimental settings to check which kind is better. Through extensive experiments, we have the following three important conclusions: (1) Nearly all hierarchical models are worse than non-hierarchical models in open-domain multi-turn dialog generation, except for the HRAN model. Through further analysis, the excellent performance of HRAN mainly depends on its word-level attention mechanism; (2) The performance of other hierarchical models will also obtain a great improvement if integrating the word-level attention mechanism into these models. The modified hierarchical models even significantly outperform the non-hierarchical models; (3) The reason why the word-level attention mechanism is so powerful for hierarchical models is because it can leverage context information more effectively, especially the fine-grained information. Besides, we have implemented all of the models and already released the codes¹¹1The anonymous link: https://github.com/anonymous/xxx.

Multi-turn; Open-domain dialog generation; Hierarchical models

1. Introduction

Recently, open-domain generative dialog systems are attracting increasing attention due to its promising potentials and alluring commercial values. With the huge development of deep learning, neural networks can generate a very fluent response based on one user’s utterance, which is usually called the open-domain single-turn dialog generation, such as CCM (Zhou et al., 2018) and DialoGPT (Zhang et al., 2019c). However, the daily conversations between two humans are actually the multi-turn manner, which contains multiple utterances as the conversation context (dialog history). Unlike the single-turn manner, the multi-turn dialog models need to make full use of the multiple utterances for generating a coherent response. For example, as shown in Table 1, the single-turn dialog models can only focus the last utterance in the conversation context, which leads to the bad responses in the second conversation. The suitable and coherent responses can only be generated by fully considering the information of the multiple utterances, which is still a big challenge.

(1)

Context

Hey, I got a perfect score on this test.

Wow, wonderful !

How should i tell my mom ?

Single-turn

Give her a surprise !

Multi-turn

Give her a surprise !

(2)

Context

Oh, I failed the exam.

That sounds terrible !

How should i tell my mom ?

Single-turn

Give her a surprise !

Multi-turn

Tell her the truth. Next time, keep it up !

Table 1. Cases of the multi-turn and single-turn manners.

Refer to caption — (a) Hierarchical architecture.

So far, there are two kinds of models for modeling multi-turn conversations: hierarchical and non-hierarchical models. For hierarchical models, as shown in Figure 1 (a), they usually contain two encoders and one decoder: (1) Word encoder expresses one sentence in a multi-turn conversation as a dense vector, which represents the semantics of the input message. The GRUs (Bahdanau et al., 2014) and LSTMs (Hochreiter and Schmidhuber, 1997) are most commonly used (Zhang et al., 2019b). (2) Context encoder captures the context-level information based on the semantics of the utterances. The RNNs (Bahdanau et al., 2014) and Transformers architecture (Vaswani et al., 2017) are most commonly used in recent works (Zhang et al., 2018a; Zhang et al., 2019b). (3) Decoder module finally generates a context-sensitive response according to the semantic representation of the multi-turn conversation. For non-hierarchical models, as shown in Figure 1 (b), they leverage the basic Seq2Seq encoder-decoder architecture (Sutskever et al., 2014) to directly maps the input sequence to the target sequence by using deep neural networks. In multi-turn settings, the researchers usually simply concatenate the multiple sentences with a separator.

Recently, some works have shown that the hierarchical models have a more powerful capability to leverage the multi-turn context than non-hierarchical models (Tian et al., 2017; Xing et al., 2017; Zhang et al., 2019b). Meanwhile, the experiments of some works (Wang et al., 2018; Xu et al., 2019) demonstrate the opposite conclusion, i.e. the hierarchical models are worse than non-hierarchical models under their experimental settings. Due to the lack of systematic comparisons, it is still not clear which kind of models are better in open-domain multi-turn dialog generation. Thus, in this paper, we will measure systematically nearly all representative hierarchical and non-hierarchical models over the same experimental settings to check which kind is better.

From extensive experiments, we obtain three important conclusions: (1) Nearly all of the hierarchical models are worse than the non-hierarchical models in open-domain multi-turn dialog generation, except for the HRAN model (Xing et al., 2017). Through further analysis, the excellent performance of HRAN model mainly depends on its word-level attention mechanisms, which is usually ignored by the recent researches (Zhang et al., 2019b; Zhang et al., 2018a; Chen et al., 2018); (2) The performance of other existing hierarchical models will also obtain a great improvement if integrating the word-level attention mechanism into them. Besides, the modified hierarchical models even significantly outperform the non-hierarchical models; (3) The reason why the word-level attention mechanism is so powerful for hierarchical models is because it can leverage the context information more effectively, especially the fine-grained information.

In this paper, our main contributions are three-fold:

•

It is the first time to study the fundamental question of which kind of models are better in open-domain multi-turn dialog generation. We systematically compare the existing hierarchical and non-hierarchical models in open-domain multi-turn dialog generation.
•

Extensive experiments demonstrate three important conclusions: (1) Nearly all hierarchical models are worse than the fundamental non-hierarchical models, except for HRAN model; (2) Word-level attention mechanism greatly improves the performance of hierarchical models, which is usually ignored by the existing researches. The modified hierarchical models even significantly outperform the state-of-the-art non-hierarchical models; (3) Quantitative and qualitative analysis demonstrate that the word-level attention mechanism can help the hierarchical models leverage the multi-turn context more effectively, especially the fine-grained information.
•

We have implemented all of the chosen models and already released the codes, which will be quite helpful for the dialog system research community.

2. Related Work

2.1. Generative Multi-turn Dialog

Despite the success of single-turn dialog generation, multi-turn dialog generation is still a big challenge. So far, there are two kinds of models for open-domain multi-turn dialog generation: hierarchical and non-hierarchical models.

Hierarchical Models: The most important work of hierarchical models is the HRED model (Serban et al., 2015), which contains the word-level and utterance-level encoders: (1) Word-level encoder mainly focuses on representing utterances by using RNN (Cho et al., 2014); (2) Utterance-level encoder leverages the utterance representation generated by the word-level encoder to capture the session-level information. Based on HRED, lots of hierarchical models are proposed, and the main architectures of these proposed models are consistent with the HRED. First, WSeq (Tian et al., 2017) is a hierarchical model that uses the cosine similarity to weight the utterance representations generated by utterance-level encoder. Then, VHRED (Serban et al., 2016) introduces the latent variable into HRED model for generating more diverse responses. Furthermore, HRAN model (Xing et al., 2017) contains the hierarchical attention mechanism (word-level and utterance-level attention) for HRED. Moreover, DSHRED model (Zhang et al., 2018a) is proposed for generating context-sensitive responses for HRED, which uses the dynamic and static attention mechanisms to focus the last utterance in the conversation. Recently, ReCoSa model (Zhang et al., 2019b) replaces the RNN-based utterance-level encoder with the multi-head self-attention module to detect multiple relative sentences in the conversation context, and shows the state-of-the-art performance.

Non-hierarchical Models: The motivation of the non-hierarchical models is to simplify the multi-turn dialog modeling by the single-turn manner, i.e., simply concatenate multiple sentences into one sentence. The non-hierarchical models usually take use of the basic Seq2Seq with attention architecture. The NRM model (Shang et al., 2015) is the first non-hierarchical model for dialog generation, which has the RNN-based encoder and decoder modules. Recently, the transformer architecture (Vaswani et al., 2017) shows more powerful capability than RNN models for modeling the long sequence, which is very suitable for processing the multi-turn context, and some works have been proposed to use transformer model for open-domain dialog generation, such as DialoGPT (Zhang et al., 2019c) and Meena (Adiwardana et al., 2020).

Some researches show that the hierarchical models are better than the non-hierarchical models for multi-turn dialogue modeling (Serban et al., 2015, 2016; Tian et al., 2017; Xing et al., 2017; Zhang et al., 2019b; Zhang et al., 2018a). Meanwhile, there are also some works obtaining the opposite conclusion under their experimental settings (Wang et al., 2018; Xu et al., 2019). Due to the lack of systematic comparisons between hierarchical and non-hierarchical models, it is still not clear which kind of models are better. So we conduct extensive experiments to research the problem in this paper.

2.2. Attention Mechanism

In natural language generation, the attention mechanism is usually used to improve the performance of the encoder-decoder architecture, which can provide high-quality context representation (context vector) for decoding. The attention mechanism can be described as mapping a query and a set of key-value pairs to a context vector $c$ (Vaswani et al., 2017), where the query, keys, values, and $c$ are all vectors. $c$ is computed as a weighted sum of the values, where the weight assigned to each value is computed by an attention module of the query with the corresponding key.

For non-hierarchical models, the context vector $c$ is calculated as follows:

(1)

e_{ij}=\rm{attn}(s_{i},h_{j}),w_{ij}=\frac{\rm{exp}(e_{ij})}{\sum_{k}exp(e_{ik})}

where the $s_{i}$ is the hidden state of the decoder model in step $i$ , which is also the query. $h_{j}$ is the hidden state of the encoder model, which is also the $j$ -th key (there are $k$ key-value pairs). In most of the works, the $\rm{attn}$ module is a one-layer neural network, which generates the weight score $e_{ij}$ . Then, the softmax function is used to normalize all weights. Finally, the context vector $c$ can be represented:

(2)

c_{i}=\sum_{k}w_{ik}h_{k}

where the $h_{k}$ is the value (same as the key).

For hierarchical models, there are two kinds of attention mechanisms (Xing et al., 2017): word-level attention and utterance-level attention. It should be noted that the calculation process of the two kinds of attention mechanisms are the same as the one in Formula (1) and Formula (2). The context vector $c_{i}$ of hierarchical models is obtained by the following steps: First of all, suppose the multi-turn context contains $m$ utterances, $m$ different context vectors will be obtained $\{c_{ij}\}_{j=1}^{m}$ . Then, these context vectors $\{c_{ij}\}_{j=1}^{m}$ will be fed as the input vector to the context encoder model, and the hidden state of the context encoder $\{H_{ij}\}_{j=1}^{m}$ are obtained. Finally, the utterance-level attention is used to obtain the final context vector $c_{i}$ for decoding based on $\{H_{ij}\}_{j=1}^{m}$ .

After obtaining the context vector $c_{i}$ , the decoder will generated the context-sensitve responses, and the $(i+1)$ -th token $t_{i+1}$ can be generated:

(3)

t_{i+1}=f(s_{i};c_{i};t_{i})

where the function $f$ is the RNNs or transformer model.

3. Systematic comparisons

In this section, we will show the details of our systematic comparisons of the existing models for open-domain multi-turn dialog generation. First of all, the experimental settings are shown in Section 3.1. Then, we systematically compare the representative hierarchical and non-hierarchical models on four open-domain dialog datasets in Section 3.2. Moreover, in order to check whether the word-level attention mechanism has the consistent improvement for hierarchical models, it is added into other hierarchical models, and the results are shown in Section 3.3. Finally, in Section 3.4 and 3.5, the quantitative and qualitative analysis are elaborated to analyze why word-level attention mechanism is so effective.

Due to the page limitation, we only show the partial results in this paper, and more details can be found in the GitHub repository²²2Anonymous link, https://github.com/g32M7fT6b8Y/External-Experiments. Noted that the conclusions over these partial results in this paper are consistent with all the results.

3.1. Experimental Setting

3.1.1. Chosen Datasets

In order to systematically compare the models, we choose four popular English open-domain multi-turn dialog datasets:

•

DailyDialog (Li et al., 2017): DailyDialog is a high-quality open-domain multi-turn dialogue corpus which covers various topics about our daily life. Recently, lots of researches use this dataset to evaluate their models (Wang et al., 2018; Rashkin et al., 2019).
•

EmpChat (Rashkin et al., 2019): EmpChat is a new benchmark for empathetic dialog generation, called EmpatheticDialogues, a novel dataset of 25k conversations grounded in emotional situations. In this work, we simply ignore the labels and situations information.
•

DSTC7-AVSD (AlAmri et al., 2019): DSTC7-AVSD dataset contains multimodal conversations such as video and text. This task aims to generate a complete and natural response to a question about a scene, given video and audio of the scene and the history of previous turns in the dialog. Here, we just simply ignore the video and audio modal and construct the corpus by only using the text.
•

PersonaChat (Zhang et al., 2018b): PersonaChat is the first work to introduce profile information (sentences about the persona) as the condition to maintain the consistent personality during dialog generation. In this work, the persona sentences of the speaker are placed at the forefront of the conversation context.

These datasets are carefully pre-processed, and the statistics about processed datasets can be found in Table 2.

Dataset	Turn			Length			Vocab
Dataset	max	avg	min	max	avg	min	Vocab
DailyDialog	34	5.09	1	262	14.46	2	23,869
PersonaChat	54	11.96	4	61	11.02	2	18,096
DSTC7-AVSD	36	13.26	2	69	11.27	2	10,850
EmpChat	7	2.23	1	111	14.95	2	37,109

Table 2. The statistics of the chosen datasets. The number of turns in the multi-turn conversation, and the length of each utterance are reported.

3.1.2. Chosen Models

We implement the nearly all representative hierarchical dialogue models and non-hierarchical models in consistent settings for systematic comparisons.

Hierarchical Models:

•

HRED (Serban et al., 2015): HRED is proposed in (2015), which is the first work to use the hierarchical encoder-decoder architecture to model the multi-turn dialog generation. Here we enhance the vanilla HRED by adding the utterance-level attention module to the context encoder.
•

VHRED (Serban et al., 2016): In (2016), VHRED is proposed to add the latent stochastic variables before decoding for the HRED context encoder, which aims to generate more diverse utterances. In our implementation, the KL annealing method (Bowman et al., 2015) is used, which is the same as the original paper.
•

WSeq (Tian et al., 2017): The research in (Tian et al., 2017) verifies the capability of the hierarchical models and proposes a hierarchical model WSeq, which explicitly weights context vectors by context-query relevance (cosine similarity).
•

HRAN (Xing et al., 2017): HRAN contains a hierarchical recurrent attention mechanism to fully leverage the context information, called word-level and utterance-level attention.
•

DSHRED (Zhang et al., 2018a): DSHRED uses dynamic and static attention mechanisms to generate more context-sensitive responses. The query of the dynamic attention is the decoder hidden state, and the query of the static attention is the hidden state of the last utterance in the conversation.
•

ReCoSa (Zhang et al., 2019b): ReCoSa leverages the multi-head self-attention to detect multiple relative utterances in the context and achieves the state-of-the-art performance. We run the original codes of the ReCoSa³³3https://github.com/zhanghainan/ReCoSa, but it fails to generate any meaningful responses and is very hard to converge. In this paper, we replace the transformer decoder of ReCoSa with the RNN-based decoder and only use the multi-head self-attention in the encoder, which is more stable and effective than original version ReCoSa.

Noted that there are also some hierarchical models based on HRED model, like Dir-VHRED (Zeng et al., 2019). Because the improvements of these works are small, and they are not representative. Thus, we don’t choose them in this paper, which will not affect the conclusions.

Non-hierarchical Models: The non-hierarchical models usually make use of the Seq2Seq architecture, which is mainly divided into the following two kinds:

•

Seq2Seq+attn (Cho et al., 2014): Seq2Seq architecture is first proposed for neural machine translation, which is also widely for constructing the generative open-domain dialog systems. Our implementation is consistent with the original paper, and the GRUs are used as the component of encoder and decoder modules. Besides, the attention module is also added (Bahdanau et al., 2014).
•

Seq2Seq+trs (Vaswani et al., 2017): The transformer model leverages the multi-head self-attention mechanism to model the very long conversation context, and shows very powerful performance on various natural language generation tasks. In our experiments, we find that the vanilla transformer model is very sensitive to the hyper-parameters and the its performance is very unstable, which fails to achieve good performance. So we modify the vanilla transformer model by applying the multi-head self-attention module on the Seq2Seq+attn model.

Noted that some transformer-based non-hierarchical models are proposed recently, like DialoGPT (Zhang et al., 2019c). They are essentially the same as the Seq2Seq+trs model. Thus, in this paper, we only evaluate the Seq2Seq+trs model.

3.1.3. Chosen Metrics

In this paper, we choose the following automatic evaluations to measure the performance of the hierarchical and non-hierarchical models. It should be noted that we don’t choose the word-overlap-based metrics such as BLEU and ROUGE, because some researches (Liu et al., 2016; Zhang et al., 2018a) show that these metrics are unsatisfied for evaluating the open-domain dialog generation. Recently, researches tend to leverage the human evaluation to evaluate the open-domain multi-turn dialog systems. However, the human evaluations are very expensive, time-consuming and irreproducible. Besides, there are lots of comparisons for the hierarchical and non-hierarchical models, so it is unpractical for us to collect the human annotations in this work. Thus, in addition to the most commonly used automatic evaluations, we also introduce two state-of-the-art automatic evaluations to accurately measure the performance of these models, called BERT-RUBER and BERTScore.

•

Dist-1/2 (Li et al., 2015): The two metrics dist-1/2 measure the degree of the diversity of the responses by calculating the number of distinct unigrams and bigrams. If the two metrics are low, the responses generated by the models are very likely to be the safe responses, i.e., the bad, meaningless responses.
•

Embedding-based metrics (Liu et al., 2016): Embedding-based metrics measure the performance by calculating the similarity of sentence embeddings between the ground-truth and the generated response. In this paper, we use Embedding Average, Vector Extrema and Greedy Matching methods. Hereafter, we call them Average, Extrema, Greedy. Besides, the GoogleNews word2vec is used as the word embeddings⁴⁴4https://github.com/mmihaltz/word2vec-GoogleNews-vectors.
•

BERTScore (Zhang et al., 2019a): BERTScore leverages the tf-idf weighted BERT embedding to calculate the semantic similarity between responses and ground-truths, which shows the very powerful capability to measure the quality of the natural language. In this work, the latest BERTScore version 0.3.0 is used in this work⁵⁵5https://github.com/Tiiiger/bert_score.
•

BERT-RUBER (Tao et al., 2017): Extensive experiments (Liu et al., 2016) already demonstrate that the word-overlap-based metrics (BLEU, ROUGE) and embedding-based metrics show the relatively low correlation with human judgments. So a learning-based metric (Tao et al., 2017) is proposed, which is trained by negative sampling, and shows a very high correlation with human judgments, called RUBER. Furthermore, BERT-RUBER (Ghazarian et al., 2019) leverages the BERT contextual word embedding to improve the performance of RUBER and shows the closest correlation with human judgments. In this work, we implement the BERT-RUBER and already released the codes⁶⁶6The anonymous link: https://github.com/anonymous/xxx.

3.1.4. Parameter settings

All of the models are implemented by PyTorch. For all the models, the early stopping mechanism and the dropout (Srivastava et al., 2014) are used to avoid overfitting. It should be noted that GRUs (Cho et al., 2014) are used for all the RNN cells. All hyperparameter settings can be found in Table 3, which is consistent among all of the chosen hierarchical models and non-hierarchical models.

Hyperparameter	Value
Learning rate schedule	ReduceLROnPlateau
Learning rate decay ratio	0.5
Patience of LR decay	10
GRU hidden size	512
Utterance encoder layer	2
Bidirectional encoder	True
Context encoder layer	1
Decoder layer	2
Gradient clip	3.0
Learning rate	1e-4
Optimizer	Adam
Dropout ratio	0.3
Weight decay	1e-6
Epochs	100
Word embed size	256
Seed	30
Multi-head	8
$d_{model}$	512
Transformer layer	3

Table 3. Hyperparameter setting.

3.2. Hierarchical vs. Non-hierarchical

Model	Dist-1	Dist-2	Average	Extrema	Greedy	BERTScore	BERT-RUBER
Seq2Seq+attn	2.25	10.77	61.63	77.06	49.80	13.75	57.09
Seq2Seq+trs	2.64 ${\ddagger}$	13.88 ${\dagger}$	62.53 ${\ddagger}$	77.45 ${\ddagger}$	51.29 ${\ddagger}$	15.76 ${\dagger}$	64.23 ${\ddagger}$
HRED	1.46	6.71	60.27	76.09	49.08	13.41	58.81
WSeq	1.17	5.46	60.97	76.41	48.92	13.10	59.60
VHRED	1.62	6.90	59.90	75.50	48.65	12.27	60.62
DSHRED	1.85	9.05	61.39	76.87	49.67	14.43	62.10
ReCoSa	2.19	10.82	62.51	77.54 ${\dagger}$	50.81	15.67 ${\ddagger}$	62.53
HRAN	2.67 ${\dagger}$	13.71 ${\ddagger}$	62.88 ${\dagger}$	77.36	51.33 ${\dagger}$	15.61	66.42 ${\dagger}$

(a) Results on DailyDialog dataset.

Model	Dist-1	Dist-2	Average	Extrema	Greedy	BERTScore	BERT-RUBER
Seq2Seq+attn	0.79 ${\dagger}$	4.57 ${\ddagger}$	63.36	83.52	48.87	16.41 ${\ddagger}$	41.05
Seq2Seq+trs	0.77 ${\ddagger}$	4.76 ${\dagger}$	64.65 ${\dagger}$	83.74 ${\dagger}$	49.38 ${\ddagger}$	16.03	42.48 ${\ddagger}$
HRED	0.34	1.98	62.76	83.30	48.46	15.82	40.74
WSeq	0.41	2.41	63.25	83.53 ${\ddagger}$	48.43	15.96	42.06
VHRED	0.50	2.78	63.04	83.40	48.61	16.29	41.59
DSHRED	0.44	2.63	63.02	83.44	48.73	15.96	42.16
ReCoSa	0.47	2.93	63.29	83.31	48.56	15.65	41.84
HRAN	0.67	4.04	63.65 ${\ddagger}$	83.47	49.50 ${\dagger}$	16.99 ${\dagger}$	42.97 ${\dagger}$

(b) Results on PersonaChat dataset.

Table 4. Automatic evaluation (%) on DailyDialog and PersonaChat dataset.

{\dagger}

represents the best performance, and

{\ddagger}

represents the second best performance.

As shown in Figure 2 and Table 4, we can make the following conclusions:

•

As shown in Figure 2, it can be found that nearly all of the hierarchical models are worse than the non-hierarchical models, except for HRAN model. The worse performance of the hierarchical models is caused by the complicated hierarchical architecture, which makes them easily forget the essential fine-grained information, such as the valuable tokens. It should be noted that, ReCoSa and DSHRED perform better than non-hierarchical models in their papers, but they perform the worse performance in our experiments. Maybe the reasons are as follows: (1) The implementations of models in their works are different from our models, such as the different parameters and training settings. In our work, the training and parameter settings of all the models are consistent to guarantee the fairness of the comparisons; (2) In their experiments, they only use one or two datasets to evaluate the models, which may be insufficient;
•

As shown in Figure 2 and Table 4, it can be shown that the existing hierarchical models WSeq, VHRED, DSHRED, ReCoSa and HRAN are better than the basic hierarchical model HRED, which means that the modifications of HRED model are effective.
•

As shown in Table 4, it can be found that the HRAN model significantly outperforms other hierarchical models and non-hierarchical models on the state-of-the-art automatic metrics BERTScore and BERT-RUBER. The main difference between the HRAN model and other hierarchical models is that it has the word-level attention mechanism, which potentially illustrates its powerful capability.

3.3. Does Word-level Attention Work Well for Other Hierarchical Models?

Through the systematic comparisons of the hierarchical and non-hierarchical models, it can be found that word-level attention has the potential capability to improve the performance of the hierarchical models. In order to check whether the word-level attention mechanism has the consistent improvement for the hierarchical architecture, we add the word-level attention mechanism into other hierarchical models and show the comparisons in this section.

3.3.1. Chosen Hierarchical models

It should be noted that not all of the hierarchical models can add the word-level attention mechanism. For example, VHRED model cannot add word-level attention mechanism because of its latent variable mechanism. In this paper, we add the word-level attention mechanism into HRED, WSeq, DSHRED and ReCoSa models, and obtain the corresponding modified models. For example, DSHRED+WA represents the DSHRED with word-level attention mechanism.

3.3.2. Results

Model	Dist-1	Dist-2	Average	Extrema	Greedy	BERTScore	BERT-RUBER
Seq2Seq+attn	2.25	10.77	61.63	77.06	49.80	13.75	57.09
Seq2Seq+trs	2.64	13.88	62.53 ${\ddagger}$	77.45 ${\dagger}$	51.29 ${\ddagger}$	15.76 ${\ddagger}$	64.23
HRED+WA	2.67	13.71	62.88 ${\dagger}$	77.36 ${\ddagger}$	51.33	15.61	66.42
WSeq+WA	2.68	13.65	61.66	77.12	50.31	14.93	66.52 ${\ddagger}$
DSHRED+WA	2.85 ${\ddagger}$	14.66 ${\ddagger}$	62.32	77.09	51.75 ${\dagger}$	16.48 ${\dagger}$	67.27 ${\dagger}$
ReCoSa+WA	3.13 ${\dagger}$	15.06 ${\dagger}$	61.33	76.56	49.84	14.45	59.73

(a) Results on DailyDialog dataset.

Model	Dist-1	Dist-2	Average	Extrema	Greedy	BERTScore	BERT-RUBER
Seq2Seq+attn	0.79 ${\ddagger}$	4.57	63.36	83.52	48.87	16.41	41.05
Seq2Seq+trs	0.77	4.76 ${\ddagger}$	64.65 ${\dagger}$	83.74 ${\dagger}$	49.38	16.03	42.48
HRED+WA	0.67	4.04	63.65	82.47	49.50 ${\dagger}$	16.99 ${\dagger}$	42.97 ${\ddagger}$
WSeq+WA	0.59	3.20	62.94	83.43	48.69	16.52	42.55
DSHRED+WA	0.80 ${\dagger}$	4.84 ${\dagger}$	64.14 ${\ddagger}$	83.64 ${\ddagger}$	49.48 ${\ddagger}$	16.68 ${\ddagger}$	44.71 ${\dagger}$
ReCoSa+WA	0.62	3.39	63.39	83.74 ${\dagger}$	48.78	15.34	42.97 ${\ddagger}$

(b) Results on PersonaChat dataset.

Table 5. Automatic evaluation on Dailydialog and PersonaChat dataset.

{\dagger}

represents the best performance, and

{\ddagger}

represents the second best performance. It can be found that modified hierarchical models are better than non-hierarchical models on most of the automatic evaluations.

The comparisons between the modified hierarchical models on BERT-RUBER metric are shown in Figure 3, and we can make the following conclusions:

•

Compared with the original hierarchical models (red bar), the word-level attention mechanism significantly improves the performance of modified models (yellow bar) on BERT-RUBER metric, which means that the word-level attention mechanism is effective for the hierarchical architecture.
•

Nearly all of the modified hierarchical models (yellow bar) significantly outperform the state-of-the-art non-hierarchical models (blue bar).
•

The improvement of DSHRED model is the best significant. For example, 6.61% and 5.17% improvement can be achieved on EmpChat and Dailydialog datasets.
•

Although the word-level attention mechanism indeed improves the ReCoSa model, its improvement is not as large as the one of other hierarchical models. Through careful analysis, the reasons may be the incompatibility of the multi-head self-attention mechanism and vanilla attention mechanism. In the future, we will study how to combine these two attention mechanisms more effectively.

Moreover, all of the automatic evaluations of these modified hierarchical models are shown in Table 5. Through these results, it can be observed: (1) For most of the automatic evaluations, especially on BERTScore and BERT-RUBER, the word-level attention can greatly improve the hierarchical models, which significantly outperform the non-hierarchical models; (2) Compared with other modified hierarchical models, the DSHRED+WA model achieves the best performance on most of the automatic evaluations, especially the state-of-the-art metric BERT-RUBER. Other modified hierarchical models don’t outperform non-hierarchical models significantly.

3.4. Why Word-level Attention Is So Effective? Quantitative Analysis

Through the above discussion and analysis, it is clear that the word-level attention mechanism is very necessary for hierarchical models, which can significantly improve their performance. But it is still a confusing question why the word-level attention is so effective.

Intuitively, word-level attention can generate better utterance representations for the context encoder of the hierarchical models. Besides, lots of fine-grained information, for example, the word-level information can be leveraged effectively. In other words, the word-level attention can help the model leverage the information of the multi-turn conversation context effectively. In the paper of HRAN (Xing et al., 2017), they only conduct the ablation study to analyze the contribution of the word-level attention, which is very coarse and insufficient. But, how to quantitatively measure the model’s capability of using contextual information?

The perturbation test (Khandelwal et al., 2018; Sankar et al., 2019) is a novel and effective method to measure the capability of the generative models to utilize context information in natural language process. The central premise of the perturbation test is that models make minimal use of certain types of information if they are insensitive to perturbations that destroy them. Specifically, 10 kinds of perturbation are injected into the multi-turn conversation context only during the test stage (Sankar et al., 2019), and the decrease of the performance is reported. If the performance of the models decrease badly, it means that the models effectively use the context information; otherwise not. However, the original perturbation test (Sankar et al., 2019) uses the perplexity metric for evaluating the generative model’s performance, which is unsuitable for dialogue modeling (Liu et al., 2016). So in our work, we replace the perplexity with the state-of-the-art metrics BERTScore and BERT-RUBER, and the perturbations are as follows:

Model	DailyDialog	EmpChat	PersonaChat	DSTC7-AVSD
Seq2Seq+trs	-7.46	-12.02	-14.01	-17.34
HRED	-5.83	-11.55	-13.80	-17.29
HRED+WA	-6.64 $\uparrow$	-14.10 $\uparrow$	-17.15 $\uparrow$	-21.04 $\uparrow$
WSeq	-6.04	-11.80	-13.38	-17.96
WSeq+WA	-7.05 $\uparrow$	-12.77 $\uparrow$	-17.72 $\uparrow$	-20.48 $\uparrow$
DSHRED	-7.38	-11.81	-15.00	-17.23
DSHRED+WA	-7.28 $\downarrow$	-13.34 $\uparrow$	-14.93 $\downarrow$	-17.29 $\uparrow$
ReCoSa	-6.67	-11.74	-14.49	-17.08
ReCoSa+WA	-5.98 $\downarrow$	-12.20 $\uparrow$	-15.62 $\uparrow$	-17.22 $\uparrow$

Model	DailyDialog	EmpChat	PersonaChat	DSTC7-AVSD
Seq2Seq+trs	-5.37	-2.01	-2.96	-8.29
HRED	-3.55	-1.69	-4.08	-9.04
HRED+WA	-4.93 $\uparrow$	-2.10 $\uparrow$	-4.13 $\uparrow$	-9.13 $\uparrow$
WSeq	-3.15	-1.91	-2.92	-6.73
WSeq+WA	-4.33 $\uparrow$	-2.14 $\uparrow$	-4.71 $\uparrow$	-9.68 $\uparrow$
DSHRED	-4.00	-2.02	-3.54	-7.62
DSHRED+WA	-5.84 $\uparrow$	-2.11 $\uparrow$	-4.88 $\uparrow$	-9.63 $\uparrow$
ReCoSa	-4.97	-1.42	-3.18	-7.33
ReCoSa+WA	-4.50 $\downarrow$	-1.42 $\sim$	-2.50 $\downarrow$	-6.96 $\downarrow$

Table 6. Perturbation test on four datasets. It should be noted that the scores (%) are the average decrease in performance. The higher the decrease of the performance, the more effective the model is in leveraging the multi-turn context.

\uparrow

means the capability of using the context information is better,

\downarrow

means the worse capability. The best results are shown in bold. The left table shows the results on the BERT-RUBER metric, and the right table shows the results on BERTScore metric.

3.4.1. Utterance-level perturbation

•

Shuffle: shuffles the sequence of utterances in the multi-turn dialog history.
•

Reverse: Reverses the order of utterances in the history (but maintain the word order within each utterance).
•

Drop first: Drops the first sentence in the dialog history.
•

Drop last: Drops the last sentence e.g. query in the dialog history.
•

Truncate: truncates the dialog history to contain only the $k$ most recent utterance where $k\leq n$ , where $n$ is the number of the utterances in the multi-turn conversation. In this paper, the $k$ is 1 e.g. only contain the last utterance as the conversation context.

3.4.2. Word-level perturbation

•

Word shuffle: randomly shuffles the words within each utterance.
•

Word reverse: reverses the ordering of the words in each utterance.
•

Word drop: drops 30% of the words uniformly in each utterance.
•

Noun drop: drops all the nouns.
•

Verb drop: drops all the verbs.

The performance decrease on BERT-RUBER and BERTScore caused by the perturbation are shown in Table 6, and the average decrease of the performance on 10 perturbations are reported. From these results, we can make the following conclusions: (1) it can be found that the performance decrease of most modified models are better than original hierarchical models, which means that they leverage the multi-turn context information more effectively than original hierarchical models; (2) In most cases, the modified hierarchical models are more sensitive to the perturbations than non-hierarchical models, which means that modified hierarchical models leverage the context information more effectively than non-hierarchical models. This also explains why the modified hierarchical models significantly outperform the non-hierarchical models.

The quantitative analysis from perturbation test demonstrates that modified hierarchical models leverage the multi-turn context more effectively, which explains why the word-level attention mechanism is necessary for hierarchical models.

Context	A man is standing against the wall of a house or building when another man walks up from the side The man who walks up is carrying a bag and can be heard There is a man that is holding a cellphone and texting A man then enters and places things in a counter How many persons are in the video ?
Ground-Truth	There are two people in the video
Seq2Seq+trs	There is only one person in the video .
HRED	There is one man in the video .
HRED+WA	There are two people in the video .

Table 7. Real examples in the DSTC7-AVSD datasets. The generated responses of Seq2Seq+trs, HRED and HRED+WA models are reported. In this example, the keyword another is very valuable information for the question.

3.5. Why Word-level Attention Is So Effective? Qualitative Analysis

In this section, in order to qualitatively show the capability of the word-level attention mechanism, we show the attention weights heatmap of the HRED+WA, HRED and Seq2Seq+trs models. We use a real example from the DSTC7-AVSD dataset to show the attention scores, and the example is shown in Table 7. In this example, the query is to ask the dialog model a question How many persons are in the video. The model needs to make full use of the information in the context and give the correct answer two people. Besides, the most valuable utterance is the first sentence which contains the keyword another. As shown in Table 7, it can be found that only HRED+WA model provides the correct answer, which means that other models use context information incorrectly or inadequately.

Firstly, we conduct comparisons between the HRED and HRED+ WA models. In these comparisons, the third token in the response is very important, for example, the token one leads to the wrong result. In order to qualitatively analyze why the HRED decides to generate the token one, the context-level attention scores are shown in Figure 5 (b). It can be found that HRED model ignores the information of the first sentence, and the attention score of the first utterance is 8.9e-3. The reason of HRED’s unsatisfied performance may be as follows: (1) The complicated hierarchical architecture makes the HRED easily forgets the essential word-level information; (2) The utterance representations generated by word-level encoder are unsatisfied. By contrast, the HRED+WA can generate more appropriate attention scores. The attention score of the first utterance is 0.134, which is much higher than the one in HRED. Because the HRED+WA model collects the essential word-level information and generate better utterance representations, HRED+WA can focus on valuable utterance and generate the correct answer.

Then, we also conduct the comparisons between the HRED+WA and Seq2Seq+trs models. The attention heatmaps are shown in Figure 4. It can be found that the non-hierarchical model Seq2Seq+trs tends to leverage the nearby context, and barely focuses on the valuable first utterance far away. This phenomenon is already reported by the recent works (Sankar et al., 2019; Khandelwal et al., 2018). The HRED+WA model, in contrast, can effectively focus on the fine-grained word-level information another in the first utterance. Specifically, the attention score of HRED+WA is 0.0504, which is much higher than 7.4e-4 of Seq2Seq+trs model. Compared with the non-hierarchical model, the modified hierarchical models make full use of the context information, and focus on the fine-grained information, to generate more context-sensitive responses.

4. Conclusions and future work

Open-domain multi-turn generative dialog generation is a big challenge. Due to the lack of adequate and systematic comparisons between these two kinds of models, it is still not clear which kind of models are better in open-domain multi-turn dialog generation. Thus, in this paper, we measure systematically nearly all representative hierarchical and non-hierarchical models over the same experimental setting to check which kind is better.

Extensive experiments demonstrate three important conclusions: (1) Nearly all the hierarchical models are worse than non-hierarchical models, except for the HRAN model, which contains the word-level attention mechanism; (2) The performance of hierarchical models can be greatly improved by integrating word-level attention mechanism. Besides, the modified hierarchical models even significantly outperform the state-of-the-art non-hierarchical models; (3) The reason why the word-level attention mechanism is so powerful for hierarchical models is because it can leverage context more effectively, especially on fine-grained information.

Although extensive experiments demonstrate that the word-level attention mechanism is very important for hierarchical models, it still has some fatal weaknesses. For example, training and inferernce stage of the modified hierarchical models is slower than the hierarchical models without word-level attention mechanism. During decoding every token, the context-level encoder needs to re-process the utterance representations generated by the word-level attention, which is very time-consuming. So in the future, we would like to improve the efficiency of the word-level attention mechanism and accelerate the training and inference stage.

References

(1)
Adiwardana et al. (2020) Daniel De Freitas Adiwardana, Minh-Thang Luong, David R. So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, and Quoc V. Le. 2020. Towards a Human-like Open-Domain Chatbot. ArXiv abs/2001.09977 (2020).
AlAmri et al. (2019) Huda AlAmri, Vincent Cartillier, Abhishek Das, Jue Wang, Stefan Lee, Pip Anderson, Irfan Essa, Devi Parikh, Dhruv Batra, Anoop Cherian, Tim K. Marks, and Chiori Hori. 2019. Audio Visual Scene-Aware Dialog. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), 7550–7559.
Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. CoRR abs/1409.0473 (2014).
Bowman et al. (2015) Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Józefowicz, and Samy Bengio. 2015. Generating Sentences from a Continuous Space. In CoNLL.
Chen et al. (2018) Hongshen Chen, Zhaochun Ren, Jiliang Tang, Yihong Eric Zhao, and Dawei Yin. 2018. Hierarchical Variational Memory Network for Dialogue Generation. In WWW.
Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. ArXiv abs/1406.1078 (2014).
Ghazarian et al. (2019) Sarik Ghazarian, Johnny Tian-Zheng Wei, Aram Galstyan, and Nanyun Peng. 2019. Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings. ArXiv abs/1904.10635 (2019).
Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9 (1997), 1735–1780.
Khandelwal et al. (2018) Urvashi Khandelwal, He He, Peng Qi, and Dan Jurafsky. 2018. Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context. In ACL.
Li et al. (2015) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and William B. Dolan. 2015. A Diversity-Promoting Objective Function for Neural Conversation Models. ArXiv abs/1510.03055 (2015).
Li et al. (2017) Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. 2017. DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset. In IJCNLP.
Liu et al. (2016) Chia-Wei Liu, Ryan Lowe, Iulian Serban, Michael Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation. ArXiv abs/1603.08023 (2016).
Rashkin et al. (2019) Hannah Rashkin, Eric Michael Smith, Margaret Li, and Y-Lan Boureau. 2019. Towards Empathetic Open-domain Conversation Models: a New Benchmark and Dataset. In ACL.
Sankar et al. (2019) Chinnadhurai Sankar, S. Subramanian, Christopher Joseph Pal, A. P. Sarath Chandar, and Yoshua Bengio. 2019. Do Neural Dialog Systems Use the Conversation History Effectively? An Empirical Study. In ACL.
Serban et al. (2015) Iulian Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C. Courville, and Joelle Pineau. 2015. Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models. In AAAI.
Serban et al. (2016) Iulian Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron C. Courville, and Yoshua Bengio. 2016. A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues. In AAAI.
Shang et al. (2015) Lifeng Shang, Zhengdong Lu, and Hang Li. 2015. Neural Responding Machine for Short-Text Conversation. In ACL.
Srivastava et al. (2014) Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15 (2014), 1929–1958.
Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. In NIPS.
Tao et al. (2017) Chongyang Tao, Lili Mou, Dongyan Zhao, and Rui Yan. 2017. RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems. ArXiv abs/1701.03079 (2017).
Tian et al. (2017) Zhiliang Tian, Rui Yan, Lili Mou, Yiping Song, Yansong Feng, and Dongyan Zhao. 2017. How to Make Context More Useful? An Empirical Study on Context-Aware Neural Conversational Models. In ACL.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NIPS.
Wang et al. (2018) Wenjie Wang, Minlie Huang, Xin-Shun Xu, Fumin Shen, and Liqiang Nie. 2018. Chat More: Deepening and Widening the Chatting Topic via A Deep Model. In SIGIR ’18.
Xing et al. (2017) Chen Xing, Wei Yu Wu, Yu Wu, Ming Zhou, Yalou Huang, and Wei-Ying Ma. 2017. Hierarchical Recurrent Attention Network for Response Generation. ArXiv abs/1701.07149 (2017).
Xu et al. (2019) Zhen Xu, Chengjie Sun, Yinong Long, Bingquan Liu, Baoxun Wang, Mingjiang Wang, Min Zhang, and Xiaolong Wang. 2019. Dynamic Working Memory for Context-Aware Response Generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (2019), 1419–1431.
Zeng et al. (2019) Min Zeng, Yisen Wang, and Yuan Luo. 2019. Dirichlet Latent Variable Hierarchical Recurrent Encoder-Decoder in Dialogue Generation. In EMNLP/IJCNLP.
Zhang et al. (2019b) Hemmg Zhang, Yanyan Lan, Liang Pang, Jiafeng Guo, and Xueqi Cheng. 2019b. ReCoSa: Detecting the Relevant Contexts with Self-Attention for Multi-turn Dialogue Generation. In ACL.
Zhang et al. (2018b) Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018b. Personalizing Dialogue Agents: I have a dog, do you have pets too?. In ACL.
Zhang et al. (2019a) Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2019a. BERTScore: Evaluating Text Generation with BERT. ArXiv abs/1904.09675 (2019).
Zhang et al. (2018a) Weinan Zhang, Yiming Cui, Yifa Wang, Qingfu Zhu, Lingzhi Li, Lianqiang Zhou, and Ting Liu. 2018a. Context-Sensitive Generation of Open-Domain Conversational Responses. In COLING.
Zhang et al. (2019c) Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and William B. Dolan. 2019c. DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation. ArXiv abs/1911.00536 (2019).
Zhou et al. (2018) Hao Zhou, Tom Young, Minlie Huang, Haizhou Zhao, Jingfang Xu, and Xiaoyan Zhu. 2018. Commonsense Knowledge Aware Conversation Generation with Graph Attention. In IJCAI.