This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

“Wait, I’m Still Talking!”
Predicting the Dialogue Interaction Behavior
Using Imagine-Then-Arbitrate Model

Zehao Lin1    Xiaoming Kang2    Guodun Li1&Feng Ji2    Haiqing Chen2    Yin Zhang1 1College of Computer Science and Technology, Zhejiang University
2DAMO Academy, Alibaba Group
[email protected], [email protected], [email protected], {zhongxiu.jf, haiqing.chenhq}@alibaba-inc.com, [email protected]
Abstract

Producing natural and accurate responses like human beings is the ultimate goal of intelligent dialogue agents. So far, most of the past works concentrate on selecting or generating one pertinent and fluent response according to current query and its context. These models work on a one-to-one environment, making one response to one utterance each round. However, in real human-human conversations, human often sequentially sends several short messages for readability instead of a long message in one turn. Thus messages will not end with an explicit ending signal, which is crucial for agents to decide when to reply. So the first step for an intelligent dialogue agent is not replying but deciding if it should reply at the moment. To address this issue, in this paper, we propose a novel Imagine-then-Arbitrate (ITA) neural dialogue model to help the agent decide whether to wait or to make a response directly. Our method has two imaginator modules and an arbitrator module. The two imaginators will learn the agent’s and user’s speaking style respectively, generate possible utterances as the input of the arbitrator, combining with dialogue history. And the arbitrator decides whether to wait or to make a response to the user directly. To verify the performance and effectiveness of our method, we prepared two dialogue datasets and compared our approach with several popular models. Experimental results show that our model performs well on addressing ending prediction issue and outperforms baseline models.

1 Introduction

Refer to caption
Figure 1: A multi-turn dialogue fragment. In this case, user sends splited utterances in a turn, e.g. split U1 to {U11, U12 and U13}

All species are unique, but languages make humans uniquest Premack (2004). Dialogues, especially spoken and written dialogues, are fundamental communication mechanisms for human beings. In real life, tons of businesses and entertainments are done via dialogues. This makes it significant and valuable to build an intelligent dialogue product. So far there are quite a few business applications of dialogue techniques, e.g. personal assistant, intelligent customer service and chitchat companion.

The quality of response is always the most important metric for dialogue agent, targeted by most existing work and models searching the best response. Some works incorporate knowledge Madotto et al. (2018); Lin et al. (2019) to improve the success rate of task-oriented dialogue models, while some others Vinyals et al. (2015) solve the rare words problem and make response more fluent and informative.

Despite the heated competition of models, however, the pace of interaction is also important for human-computer dialogue agent, which has drawn less or no attention. Figure 1 shows a typical dialogue fragment in an instant message program. A user is asking the service about the schedule of the theater. The user firstly says hello (U11) followed by demand description (U12), and then asks for suggested arrangement (U13), each of which is sent as a single message in one turn. The agent doesn’t answer (A2) until the user finishes his description and throws his question. The user then makes a decision (U21) and asks a new question (U22). And then the agent replies with (A3). It’s quite normal and natural that the user sends several messages in one turn and the agent waits until the user finished his last message, otherwise the pace of the conversation will be messed up. However, existing dialogue agents can not handle well when faced with this scenario and will reply to every utterance received immediately.

There are two issues when applying existing dialogue agents to real life conversation. Firstly, when user sends a short utterance as the start of a conversation, the agent has to make a decision to avoid generating bad responses based on semantically incomplete utterance. Secondly, dialogue agent cutting in the conversation at an unreasonable time could confuse user and mess up the pace of conversation, leading to nonsense interactions.

To address these two issues, in this paper, we propose a novel Imagine-then-Arbitrate (ITA) neural dialogue model to recognize if it is the appropriate moment for agent to reply when agent receives a message from the user. In our method, we have two imaginator modules and an arbitrator module. Imaginators will learn both of the agent’s and user’s speaking styles respectively. The arbitrator will use the dialogue history and the imagined future utterances generated by the two imaginators to decide whether the agent should wait user or make a response directly.

In summary, this paper makes the following contributions:

  • We first addressed an interaction problem, whether the dialogue model should wait for the end of the utterance or make a response directly in order to simulate real life conversation and tried several popular baseline models to solve it.

  • We proposed a novel Imagine-then-Arbitrate (ITA) neural dialogue model to solve the problem mentioned above, based on both of the historical conversation information and the predicted future possible utterances.

  • We modified two popular dialogue datasets to simulate the real human dialogue interaction behavior.

  • Experimental results demonstrate that our model performs well on addressing ending prediction issue and the proposed imaginator modules can significantly help arbitrator outperform baseline models.

2 Related Work

2.1 Dialogue System

Creating a perfect artificial human-computer dialogue system is always a ultimate goal of natural language processing. In recent years, deep learning has become a basic technique in dialogue system. Lots of work has investigated on applying neural networks to dialogue system’s components or end-to-end dialogue frameworks Yan et al. (2017); Lipton et al. (2018). The advantage of deep learning is its ability to leverage large amount of data from internet, sensors, etc. The big conversation data and deep learning techniques like SEQ2SEQ Sutskever et al. (2014) and attention mechanism Luong et al. (2015) help the model understand the utterances, retrieve background knowledge and generate responses.

Refer to caption
Figure 2: Model Overview. (a) Train the agent and user imaginators using the same dialogues but different samples. (b) During training and inference step, arbitrator uses the dialogue history and two trained imaginators’ predictions.

2.2 Classification in Dialogue

Though end-to-end methods play a more and more important role in dialogue system, the text classification modules Jiang et al. (2018); Kowsari et al. (2017) remains very useful in many problems like emotion recognition Song et al. (2019), gender recognition Hoyle et al. (2019), verbal intelligence, etc. There have been several widely used text classification methods proposed, e.g. Recurrent Neural Networks (RNNs) and CNNs. Typically RNN is trained to recognize patterns across time, while CNN learns to recognize patterns across space. Kim (2014) proposed TextCNNs trained on top of pre-trained word vectors for sentence-level classification tasks, and achieved excellent results on multiple benchmarks.

Besides RNNs and CNNs, Vaswani et al. (2017) proposed a new network architecture called Transformer, based solely on attention mechanism and obtained promising performance on many NLP tasks. To make the best use of unlabeled data, Devlin et al. (2018) introduced a new language representation model called BERT based on transformer and obtained state-of-the-art results.

2.3 Dialogue Generation

Different from retrieval method, Natural Language Generation (NLG) tries converting a communication goal, selected by the dialogue manager, into a natural language form. It reflects the naturalness of a dialogue system, and thus the user experience. Traditional template or rule-based approach mainly contains a set of templates, rules, and hand-craft heuristics designed by domain experts. This makes it labor-intensive yet rigid, motivating researchers to find more data-driven approaches Ghazvininejad et al. (2018); Lin et al. (2019) that aim to optimize a generation module from corpora, one of which, Semantically Controlled LSTM (SC-LSTM) Wen et al. (2015), a variant of LSTM Hochreiter and Schmidhuber (1997), gives a semantic control on language generation with an extra component.

3 Task Definition

In this section we will describe the task by taking a scenario and then define the task formally.

As shown in Figure 1, we have two participants in a conversation. One is the dialogue agent, and the other is a real human user. The agent’s behavior is similar to most chatbots, except that it doesn’t reply on every sentence received. Instead, this agent will judge to find the right time to reply.

Our problem is formulated as follows. There is a conversation history represented as a sequence of utterances: X={x1,x2,,xm}X=\{x_{1},x_{2},...,x_{m}\}, where each utterance xix_{i} itself is a sequence of words xi1,xi2,xi3xinx_{i_{1}},x_{i_{2}},x_{i_{3}}...x_{i_{n}}. Besides, each utterance has some additional tags:

  • turn tags t0,t1,t2tkt_{0},t_{1},t_{2}...t_{k} to show which turn this utterance is in the whole conversation.

  • speakers’ identification tags agentagent or useruser to show who sends this utterance.

  • subturn tags st0,st1,st2stj{st}_{0},{st}_{1},{st}_{2}...{st}_{j} for user to indicate which subturn an utterance tit_{i}is in. Note that an utterance will be labelled as st0{st}_{0} even if it doesn’t have one.

Now, given a dialogue history XX and tags TT, the goal of the model is to predict a label Y{0,1}Y\in\{0,1\}, the action the agent would take, where Y=0Y=0 means the agent will wait the user for next message, and Y=1Y=1 means the agent will reply immediately. Formally we are going to maximize following probability:

Y=argmaxyP(y|X,T)Y=\arg\max_{y}P\left(y|X,T\right) (1)

4 Proposed Framework

Basically, the task can be simplified as a simple text classification problem. However, traditional classification models only use the dialogue history XX and predict ground truth label. The ground truth label actually ignores all context information in the next utterance. To make the best use of training data, we propose a novel Imagine-then-Arbitrate (ITA) model taking XX, ground truth label, and the future possible XX^{\prime} into consideration. In this section, we will describe the architecture of our model and how it works in detail.

4.1 Imaginator

An imaginator is a natural language generator generating next sentence given the dialogue history. There are two imaginators in our method, agent’s imaginator and user’s imaginator. The goal of the two imaginators are to learn the agent’s and user’s speaking style respectively and generate possible future utterances.

As shown in Figure 2 (a), imaginator itself is a sequence generation model. We use one-hot embedding to convert all words and relative tags, e.g. turn tags and place holders, to one-hot vectors wnRVw_{n}\in\textbf{R}^{V}, where VV is the length of vocabulary list. Then we extend each word xijx_{i_{j}} in utterance xix_{i} by concatenating the token itself with turn tag, identity tag and subturn tag. We adopt SEQ2SEQ as the basic architecture and LSTMs as the encoder and decoder networks. LSTMs will encode each extended word wtw_{t} as a continuous vector hth_{t} at each time step tt. The process can be formulated as following:

ft=σ(Wfe(wt)+Ufht1+bf)it=σ(Wie(wt)+Uiht1+bi)ot=σ(Woe(wt)+Uoht1+bo)gt=tanh(Wge(wt)+Ught1+bg)ct=ftct1+itgtht=ottanh(ct)\begin{array}[]{l}{f_{t}=\sigma\left(W_{f}e(w_{t})+U_{f}h_{t-1}+b_{f}\right)}\\ {i_{t}=\sigma\left(W_{i}e(w_{t})+U_{i}h_{t-1}+b_{i}\right)}\\ {o_{t}=\sigma\left(W_{o}e(w_{t})+U_{o}h_{t-1}+b_{o}\right)}\\ {g_{t}=\tanh\left(W_{g}e(w_{t})+U_{g}h_{t-1}+b_{g}\right)}\\ c_{t}=f_{t}\odot c_{t-1}+i_{t}\odot g_{t}\\ h_{t}=o_{t}\odot\tanh\left(c_{t}\right)\\ \end{array} (2)

where e(wt)e(w_{t}) is the embedding of the extended word wtw_{t}, WfW_{f}, UfU_{f}, WiW_{i}, UiU_{i}, WoW_{o}, UoU_{o}, WgW_{g}, UgU_{g} and bb are learnt parameters.

Though trained on the same dataset, the two imaginators learn different roles independently. So in the same piece of dialogue, we split it into different samples for different imaginators. For example, as shown in Figure 1 and 2 (a), we use utterance (A1, U11, U12) as dialogue history input and U13 as ground truth to train the user imaginator and use utterance (A1, U11, U12, U13) as dialogue history and A2 as ground truth to train the agent imaginator.

During training, the encoder runs as equation 2, and the decoder is the same structured LSTMs but hth_{t} will be fed to a Softmax with WvRh×V,bvRVW_{v}\in{\textbf{R}^{h\times V}},b_{v}\in{\textbf{R}^{\textbf{V}}}, which will produce a probability distribution ptp_{t} over all words, formally:

pt=Softmax(Wvht+bv)\begin{array}[]{l}p_{t}=Softmax(W_{v}h_{t}+b_{v})\end{array} (3)

the decoder at time step t will select the highest word in ptp_{t}, and our imaginator’s loss is the sum of the negative log likelihood of the correct word at each step as follows:

L=t=1Nlog(pt)L=-\sum_{t=1}^{N}\log(p_{t}) (4)

where NN is the length of the generated sentence. During inference, we also apply beam search to improve the generation performance.

Finally, the trained agent imaginator and user imaginator are obtained.

4.2 Arbitrator

The arbitrator module is fundamentally a text classifier. However, in this task, we make the module maximally utilize both dialogue history and ground truth’s semantic information. So we turned the problem of maximizing YY from XX in equation (1) to:

Ragent=IGagent(X,T)Ruser=IGuser(X,T)R=argmaxyP(y|X,T,Ragent,Ruser)R0,1\begin{array}[]{l}\begin{aligned} R_{agent}&=\textbf{IG}_{agent}(X,T)\\ R_{user}&=\textbf{IG}_{user}(X,T)\\ R^{\prime}&=\arg\max_{y}P\left(y|X,T,R_{agent},R_{user}\right)\\ R^{\prime}&\in{0,1}\end{aligned}\end{array} (5)

where IGagent\textbf{IG}_{agent} and IGuser\textbf{IG}_{user} are the trained agent imaginator and user imaginator respectively, and RR^{\prime} is a selection indicator where R=1R^{\prime}=1 means selecting RagentR_{agent} whereas 0 means selecting RuserR_{user}. And Thus we (1) introduce the generation ground truth semantic information and future possible predicted utterances (2) turn the label prediction problem into a response selection problem.

We adopt several architectures like Bi-GRUs, TextCNNs and BERT as the basis of arbitrator module. We will show how to build an arbitrator by taking TextCNNs as an example.

As is shown in Figure 2, the three CNNs with same structure take the inferred responses RagentR_{agent}, RuserR_{user} and dialogue history XX, tags TT. For each raw word sequence x1,,xnx_{1},...,x_{n}, we embed each word as one-hot vector wiRVw_{i}\in\textbf{R}^{V}. By looking up a word embedding matrix ERV×dE\in\textbf{R}^{V\times d}, the input text is represented as an input matrix QRl×dQ\in\textbf{R}^{l\times d}, where ll is the length of sequence of words and dd is the dimension of word embedding features. The matrix is then fed into a convolution layer where a filter wRk×d\textbf{w}\in\textbf{R}^{k\times d} is applied:

ci\displaystyle c_{i} =f(WQi:i+k1+b)\displaystyle=f(W\cdot Q_{i:i+k-1}+b) (6)

where Qi:i+k1Q_{i:i+k-1} is the window of token representation and the function ff is ReLUReLU, WW and bb are learnt parameters. Applying this filter to mm possible Qi:i+k1Q_{i:i+k-1} obtains a feature map:

c =[c1,c2,..,clk+1]\displaystyle=[c_{1},c_{2},..,c_{l-k+1}] (7)

where cRlk+1\textbf{c}\in\textbf{R}^{l-k+1} for mm filters. And we use jRj\in\textbf{R} different size of filters in parallel in the same convolution layer. This means we will have m1,m2,,mjm_{1},m_{2},\dots,m_{j} windows at the same time, so formally:

C =[c1,c2,..,cj]\displaystyle=[\textbf{c}_{1},\textbf{c}_{2},..,\textbf{c}_{j}] (8)

, then we apply max-over-time pooling operation to capture the most important feature:

C^\displaystyle\hat{C} =max{C}\displaystyle=\max\{\textbf{C}\} (9)

, and thus we get the final feature map of the input sequence.

We apply same CNNs to get the feature maps of XX, RagentR_{agent} and RuserR_{user}:

C^his\displaystyle\hat{C}_{his} =TextCNNs(X)\displaystyle=TextCNNs(X) (10)
C^agent\displaystyle\hat{C}_{agent} =TextCNNs(Ragent)\displaystyle=TextCNNs(R_{agent})
C^user\displaystyle\hat{C}_{user} =TextCNNs(Ruser)\displaystyle=TextCNNs(R_{user})

where function TextCNNs() follows as equations from 6 to 9. Then we will have two possible dialogue paths, XX with RagentR_{agent} and XX with RuserR_{user}, representations DagentD_{agent} and DuserD_{user}:

Dagent\displaystyle D_{agent} =W1[C^his;C^agent]+b1\displaystyle=W_{1}[\hat{C}_{his};\hat{C}_{agent}]+b_{1} (11)
Duser\displaystyle D_{user} =W2[C^his;C^user]+b2\displaystyle=W_{2}[\hat{C}_{his};\hat{C}_{user}]+b_{2}

And then, the arbitrator will calculate the probability of the two possible dialogue paths:

D\displaystyle D =W3[Dagent;Duser]+b3\displaystyle=W_{3}[D_{agent};D_{user}]+b_{3} (12)
P\displaystyle P =Softmax(W4D+b4)\displaystyle=Softmax(W_{4}D+b_{4})

Through learnt parameters W4W_{4} and b4b_{4}, we will get a two-dimensional probability distribution PP, in which the most reasonable response has the max probability. This also indicates whether the agent should wait or not.

And the total loss function of the whole attribution module will be negative log likelihood of the probability of choosing the correct action:

L=i=1Nlog(Ri=Yi)L=-\sum_{i=1}^{N}{\log(R_{i}^{\prime}=Y_{i})} (13)

where NN is the number of samples and YiY_{i} is the ground truth label of i-th sample.

The arbitrator module based on Bi-GRU and BERT is implemented similar to TextCNNs.

Datasets MultiWoz DailyDialogue
Train Valid Test Train Valid Test
Vocabulary Size 2443 6219
Dialogues 8423 1000 1000 1118 1000 1000
Avg. Turns/Dialogue 6.32 6.97 6.98 4.09 4.21 4.03
Avg. Split User Turns 1.89 1.92 1.94 2.09 2.12 2.12
Avg. Utterance Length 10.54 10.7 10.56 8.71 8.54 8.75
Avg. Agent’s Utterance 14.43 14.78 14.69 12.04 11.81 12.17
Avg. User’s Utterance 6.18 6.28 6.17 5.91 5.87 5.96
Agent Wait Samples 53249 6970 6983 41547 3846 3689
Agent Reply Samples 47341 6410 6573 49540 4717 4510
Table 1: Datasets Statistics. Note that the statistics are based on the modified dataset described in Section 5.2
Dataset MultiWoz DailyDialogue
Task Agent User Arbitrator Agent User Arbitrator
Baseline: TextCNNs N/A N/A 77.68 N/A N/A 75.79
Agent Imaginator 11.77 0.80 4.51 0.61
LSTM User Imaginator 0.3 8.87 80.04 0.15 8.70 76.37
Agent Imaginator 12.47 0.72 19.19 0.60
LSTM+Attn. User Imaginator 0.24 9.71 80.75 0.26 24.52 79.02
Agent Imaginator 13.37 0.67 19.01 0.67
LSTM (with GLOVE) + Attn. User Imaginator 0.51 10.61 80.38 0.21 24.65 78.56
Table 2: Results of the different imaginators generation performance (in BLEU score) and accuracy score on the same TextCNNs based arbitrator. Better results between imaginators are in BOLD and best results on datasets are in RED.
Dataset MultiWoz DailyDialogue
Random 51.51 55.00
Bi-GRU 79.12 75.23
ITA-GRU 82.03 77.80
TextCNNs 77.68 75.79
ITA-TextCNN 80.75 79.02
BERT 80.75 78.68
ITA-BERT 82.73 79.35
Table 3: Accuracy Results on Two datasets. Better results between baselines and corresponding ITA models are in BOLD and best results on datasets are in RED. Random result is the accuracy of script that making random decisions.

5 Experimental Setup

5.1 Datasets

As the proposed approach mainly concentrates on the interaction of human-computer, we select and modify two very different style datasets to test the performance of our method. One is a task-oriented dialogue dataset MultiWoz 2.0 and the other is a chitchat dataset DailyDialogue . Both datasets are collected from human-to-human conversations. We evaluate and compare the results with the baseline methods in multiple dimensions. Table 1 shows the statistics of datasets.

  • MultiWOZ 2.0 Budzianowski et al. (2018). MultiDomain Wizard-of-Oz dataset (MultiWOZ) is a fully-labeled collection of human-human written conversations. Compared with previous task-oriented dialogue datasets, e.g. DSTC 2 Henderson et al. (2014) and KVR Eric and Manning (2017), it is a much larger multi-turn conversational corpus and across serveral domains and topics.

  • DailyDialogue Li et al. (2017). DailyDialogue is a high-quality multi-turn dialogue dataset, which contains conversations about daily life. Compare to the task-oriented dialogue datasets, the speaker’s behavior will be more unpredictable and complex .

5.2 Datasets Modification

Because the task we concentrate on is different from traditional ones, to make the datasets fit our problems and real life, we modify the datasets with the following steps:

Example
Dialogue History
User: actually
User: can you suggest [value_count] of them
User: can i get their contact info as well
Agent: sure , i would suggest the [restaurant_name] at [restaurant_address] . you can reach them at
     [restaurant_phone] . i could reserve it for you
User: no
User: that ‘s ok
User: i can take it from here
Ground Truth User: thank,you for all your help
Imagintator Prediction
Agent imaginator: would you like me to book it for you
User imaginator: thanks for all your help
Arbitrator Selection
User imaginator
Table 4: An Example of The Imaginator’s Generation and arbitrator’s Selection.
  • Drop Slots and Values For task-oriented dialogue, slot labels are important for navigating the system to complete a specific task. However, those labels and accurate values from ontology files will not benefit our task essentially. So we replace all specific values with a slot placeholder in preprocessing step.

  • Split Utterances Existing datasets concentrate on the dialogue content, combining multiple sentences into one utterance each turn when gathering the data. In this step, we randomly split the combined utterance into multiple utterances according to the punctuation. And we set a determined probability to decide if the preprocessing program should split a certain sentence.

  • Add Turn Tag We add turn tags, subturn tags and role tags to each split and original sentences to (1) label the speaker role and dialogue turns (2) tag the ground truth for training and testing the supervised baselines and our model.

Finally, we have the modified datasets which imitate the real life human chatting behaviors as shown in Figure 1. Our datasets and code will be released to public for further researches in both academic and industry.

5.3 Evaluation Method

To compare with dataset baselines in multiple dimensions and test the model’s performance, we use the overall Bilingual Evaluation Understudy (BLEU) Papineni et al. (2002) to evaluate the imaginators’ generation performance. As for arbitrator, we use accuracy score of the classification to evaluate. Accuracy in our experiments is the correct ratio in all samples.

5.4 Baselines and Training Setup

The hyper-parameter settings adopted in baselines and our model are the best practice settings for each training set. All models are tested with various hyper-parameter settings to get their best performance. Baseline models are Bidirectional Gated Recurrent Units (Bi-GRUs) Chung et al. (2014), TextCNNs Kim (2014) and BERT Devlin et al. (2018).

6 Experimental Results and Analysis

6.1 Results

In Table 2, we show different imaginators’ generation abilities and their performances on the same TextCNN based arbitrator. Firstly, we gathered the results of agent and user imaginators’ generation based on LSTM, LSTM-attention and LSTM-attention with GLOVE pretrained word embedding. According to the evaluation metric BLEU, the latter two models achieve higher but similar results. Secondly, when fixed the arbitrator on the TextCNNs model, the latter two also get the similar results on accuracy and significantly outperform the others including the TextCNNs baseline.

The performances on different arbitrators with the same LSTM-attention imaginators are shown in Table 3. From those results, we can directly compared with the corresponding baseline models. The imaginators with BERT based arbitrator make the best results in both datasets while all ITA models beat the baseline models.

We also present an example of how our model runs in Table 4. Imaginators predict the agent and user’s utterance according to the dialogue history, and then arbitrator selects the user imaginator’s prediction. It is worth noting that the arbitrator generates a high-quality sentence again if only considering the generation effect. However, referring to the dialogue history, it is not a good choice since its semantic is repeated in the last turn by the agent.

6.2 Analysis

6.2.1 Imaginators Benefit the Performance

Table 2 figures out experiment results on MultiWOZ dataset. The LSTM based agent imaginator get the BLEU score at 11.77 on agent samples, in which the ground truth is agents’ utterances, and 0.80 on user samples. Meanwhile, the user imaginator get the BLEU score at 0.3 on agent samples and 8.87 on user target samples. Similar results are shown in other imaginators’ expermients. Although these comparisons seem unfair to some extends since we do not have the agent and user’s real utterances at the same time and under the same dialogue history, these results show that the imaginators did learn the speaking style of agent and user respectively. So the suitable imaginator’s generation will be more similar to the ground truth, such an example shown in Table 4, which means this response more semantically suitable given the dialogue history.

If we fix the agent and user imaginators’ model, as we take the LSTM-attention model, the arbitrators achieve different performances on different models, shown in Table 3. As expected, ITA models beat their base models by nearly 2 \sim 3% and ITA-BERT model beats all other ITA models.

So from the all results, we can conclude that imaginators will significantly help the arbitrator in predicting the dialogue interaction behavior using the future possible agent and user responses’ semantic information.

6.2.2 Relation of Imaginators and Arbitrator’s Performance

As shown in the DailyDialogue dataset of Table 2, we can see that attention mechanism works in learning the generation task. LSTMs -Attention and LSTMs-attention-GLOVE based imaginators get more than 19 and 24 BLEU scores in corresponding target, while the LSTMs without attention gets only 4.51 and 8.70. These results also impact on the arbitrator results. The imaginator with attention mechanism get an accuracy score of 79.02 and 78.56, significantly better than the others. The evidence also exists in the results on MultiWoz. All imaginators get similar generation performance, so the arbitrators gets the similar accuracy scores.

From those results, we can conclude that there is positive correlation between the performance of imaginators and arbitrators. However, there still exists problems. It’s not easy to evaluate the dialogue generation’s performance. In the results of MultiWoz, we can see that LSTMs-GLOVE based ITA performs a little better than LSTMs-attention based ITA, but not the results of the arbitrator are opposite. This may indicate that (1) when the imaginators’ performance is high enough, the arbitrator’s performance will be stable and (2) the BLEU score will not perfectly present the contribution to the arbitrator. We leave these hypotheses in future work.

7 Conclusion

We first address an interaction problem, whether the dialogue model should wait for the end of the utterance or reply directly in order to simulate user’s real life conversation behavior, and propose a novel Imagine-then-Arbitrate (ITA) neural dialogue model to deal with it. Our model introduces the imagined future possible semantic information for prediction. We modified two popular dialogue datasets to fit in the real situation. It is reasonable that additional information is helpful for arbitrator, despite its fantasy.

References

  • Budzianowski et al. [2018] Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gašić. MultiWOZ - a large-scale multi-domain wizard-of-Oz dataset for task-oriented dialogue modelling. In Proceedings of the 2018 EMNLP, pages 5016–5026, Brussels, Belgium, October-November 2018. ACL.
  • Chung et al. [2014] Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS 2014 Workshop on Deep Learning, 2014.
  • Devlin et al. [2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • Eric and Manning [2017] Mihail Eric and Christopher D Manning. Key-value retrieval networks for task-oriented dialogue. arXiv preprint arXiv:1705.05414, 2017.
  • Ghazvininejad et al. [2018] Marjan Ghazvininejad, Chris Brockett, Ming-Wei Chang, Bill Dolan, Jianfeng Gao, Wen-tau Yih, and Michel Galley. A knowledge-grounded neural conversation model. In Thirty-Second AAAI, 2018.
  • Henderson et al. [2014] Matthew Henderson, Blaise Thomson, and Jason D. Williams. The second dialog state tracking challenge. In Proceedings of the 15th SIGDIAL, pages 263–272, Philadelphia, PA, U.S.A., June 2014. ACL.
  • Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • Hoyle et al. [2019] Alexander Miserlis Hoyle, Lawrence Wolf-Sonkin, Hanna Wallach, Isabelle Augenstein, and Ryan Cotterell. Unsupervised discovery of gendered language through latent-variable modeling. In Proceedings of the 57th ACL, pages 1706–1716, Florence, Italy, July 2019. ACL.
  • Jiang et al. [2018] Mingyang Jiang, Yanchun Liang, Xiaoyue Feng, Xiaojing Fan, Zhili Pei, Yu Xue, and Renchu Guan. Text classification based on deep belief network and softmax regression. Neural Computing and Applications, 29(1):61–70, 2018.
  • Kim [2014] Yoon Kim. Convolutional neural networks for sentence classification. In Proceedings of the 2014 EMNLP, pages 1746–1751, 2014.
  • Kowsari et al. [2017] Kamran Kowsari, Donald E Brown, Mojtaba Heidarysafa, Kiana Jafari Meimandi, Matthew S Gerber, and Laura E Barnes. Hdltex: Hierarchical deep learning for text classification. In 2017 ICMLA, pages 364–371. IEEE, 2017.
  • Li et al. [2017] Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. DailyDialog: A manually labelled multi-turn dialogue dataset. In Proceedings of the 8th IJCNLP, pages 986–995, Taipei, Taiwan, November 2017. Asian Federation of Natural Language Processing.
  • Lin et al. [2019] Zehao Lin, Xinjing Huang, Feng Ji, Haiqing Chen, and Yin Zhang. Task-oriented conversation generation using heterogeneous memory networks. In Proceedings of the 2019 EMNLP-IJCNLP, pages 4557–4566, Hong Kong, China, November 2019. ACL.
  • Lipton et al. [2018] Zachary Lipton, Xiujun Li, Jianfeng Gao, Lihong Li, Faisal Ahmed, and Li Deng. Bbq-networks: Efficient exploration in deep reinforcement learning for task-oriented dialogue systems. In AAAI 2018, February 2018.
  • Luong et al. [2015] Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based neural machine translation. In Proceedings of the EMNLP 2015, Lisbon, Portugal, pages 1412–1421, 2015.
  • Madotto et al. [2018] Andrea Madotto, Chien-Sheng Wu, and Pascale Fung. Mem2seq: Effectively incorporating knowledge bases into end-to-end task-oriented dialog systems. In Proceedings of the 56th ACL, pages 1468–1478, 2018.
  • Papineni et al. [2002] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th ACL, July 6-12, 2002, Philadelphia, PA, USA., pages 311–318, 2002.
  • Premack [2004] David Premack. Is language the key to human intelligence? Science, 303(5656):318–320, 2004.
  • Song et al. [2019] Zhenqiao Song, Xiaoqing Zheng, Lu Liu, Mu Xu, and Xuanjing Huang. Generating responses with a specific emotion in dialog. In Proceedings of the 57th ACL, pages 3685–3695, Florence, Italy, July 2019. ACL.
  • Sutskever et al. [2014] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. In NIPS 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 3104–3112, 2014.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
  • Vinyals et al. [2015] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 2692–2700, 2015.
  • Wen et al. [2015] Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Pei-Hao Su, David Vandyke, and Steve Young. Semantically conditioned lstm-based natural language generation for spoken dialogue systems. arXiv preprint arXiv:1508.01745, 2015.
  • Yan et al. [2017] Zhao Yan, Nan Duan, Peng Chen, Ming Zhou, Jianshe Zhou, and Zhoujun Li. Building task-oriented dialogue systems for online shopping. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., pages 4618–4626, 2017.