CogIntAc: Modeling the Relationships between Intention, Emotion and Action in Interactive Process from Cognitive Perspective

Wei Peng^1,2, Yue Hu^1,2∗ , Yuqiang Xie^1,2, Luxi Xing^1,2, Yajing Sun^1,2

*

Corresponding author. ¹ Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China ² School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China [email protected], [email protected]

Abstract

Intention, emotion and action are important psychological factors in human activities, which play an important role in the interaction between individuals. How to model the interaction process between individuals by analyzing the relationship of their intentions, emotions, and actions at the cognitive level is challenging. In this paper, we propose a novel cognitive framework of individual interaction. The core of the framework is that individuals achieve interaction through external action driven by their inner intention. Based on this idea, the interactions between individuals can be constructed by establishing relationships between the intention, emotion and action. Furthermore, we conduct analysis on the interaction between individuals and give a reasonable explanation for the predicting results. To verify the effectiveness of the framework, we reconstruct a dataset and propose three tasks as well as the corresponding baseline models, including action abduction, emotion prediction and action generation. The novel framework shows an interesting perspective on mimicking the mental state of human beings in cognitive science.

Index Terms:

action abduction, emotion prediction, action generation, interaction

I Introduction

In the process of human interaction, analyzing the relationships between IEA {Intention, Emotion and Action} is long-term research in Artificial Intelligence [1, 2]. Individual activities are the process of interacting with the external world driven by the intention [3, 4]. Specifically, the individual influences the external world via action [5], while the external world influences the individual via emotion [3] to the individual. During this process, the individual’s inner intention is the essence of the interaction [6, 7]. The whole interactive process is realized and completed by the mutual cooperation of IEA. Analyzing and model the relationships between IEA at the cognitive level is of great significance to the in-depth understanding about the essence of human activities. It is also beneficial for various applications, including service satisfaction analysis [8], intelligent agent [9], understanding intention [10], emotion of consumers in live conversations [11], and so on.

Refer to caption — Figure 1: An example of interaction. N/A* means that the emotion has not yet been generated at this moment.

Previous work [12, 13, 14, 15] mainly focus on emotion recognition or intention recognition of individuals, but seldom consider these factors of multiple individuals. Although some methods [16, 17] propose to model the interaction between the emotion recognition task and intention recognition task, they just predict the label by the representation of utterances or graph attention network and doesn’t take the inherent influence between IEA of multiple individuals into account. These works also lack analysis and explanation of emotional prediction and emotional cause. The RAIN [18] consider the exploring the mutual relationships between intention and emotion, but ignoring the response generation during the interaction.

To illustrate the process of interaction, an example is shown in Fig. 1 (the focus is on two individuals), where intention and emotional expectation constitute the motives that trigger the action, that is, driven by intention, the action is generated in the direction of emotional expectation. The definition of emotional expectation is the desired emotion towards achieving one’s own goals. For the speaker, the goal is to accomplish his intentions, so the expectation is always positive. For the listener, the goal is to generate the action in the direction of speaker’s emotional reaction, which in turn can be defined as the emotional expectation of the listener per se. Emotion can be further composed of emotional expectation and emotional reaction. When the speaker generates action, the listener reacts by generating corresponding motives, based on which, the speaker will generate the emotional reaction. The above process can be defined as interaction chain (e.g. ① $\rightarrow$ ② $\rightarrow$ ③ $\rightarrow$ ④). Before the speaker responds to the next turn at this moment, the listener will generate no emotional reaction. For the listener, the intention will be influenced by the speaker, and the intention accept or reject will be produced. These two intentions correspond to two kinds of emotional expectations and actions, respectively. Namely, the listener’s action is driven by the intention in the direction of speaker’s emotional reaction. Finally, the speaker’s emotional reaction is determined by whether the intention is satisfied. When satisfied, the speaker’s emotion will be positive [19], otherwise, non-positive.

We propose a novel Cognitive framework of individual InterAction (CogIntAc) to model the interaction between individuals. The framework contains three aspects: (1) The mental state [20, 21] between the speaker and listener is linked by actions to realize social interaction [22]. (2) Intention is the essence of the action. The action is generated in the direction of emotional expectation under the drive of intention. By analyzing and understanding the action, the model can in turn obtain the intention via action abduction [23]. (3) Emotional reaction is determined by whether the action satisfies the intention, namely, it is a combined result of intention (mental state) and action (external world).

Around the CogIntAc, we have also proposed three novel tasks to discuss our framework. Considering that there is currently no dataset available, a dataset (see Sec.V-A) is manually created, which contains 2,106 single-turn dialogues. Each instance consists of utterances, intention and emotion labels and satisfaction labels.

The contributions can be summarized as follows:

•

We propose a novel framework, CogIntAc, with the concept of a complete interaction and interaction chain in Sec. III-B, taking intention, emotion and action as the basic elements to model the process of human interaction.
•

Guided by the relationships between the nodes in the interaction chain, three tasks and corresponding baselines are proposed, including action abduction of the speaker, emotion prediction of the speaker and action generation of the listener.
•

We reconstruct a new dataset CogIEA (Intention, Emotion, Action) to verify the effectiveness of the framework.
•

Experiments show that with the help of the CogIntAc, we can analyze the interaction in a deeper level and give a more reasonable explanation to the analysis results.

II Related Work

Intention Recognition In dialogue systems, intention recognition requires models to predict the intention of the interlocutor. With the development of deep learning, [12] propose multi-task model for conversation modeling, which is optimized with dialogue act prediction and response selection. [14] introduce a hierarchical encoder with attention mechanism for intention recognition. Few works [24] consider emotion and only take the emotion classification as a joint task to boost intention recognition. These methods primarily research individual intention recognition and ignore the influence of emotion on intention in mental state.

Emotion Recognition Previous studies [25, 26, 27, 28] show that in addition to the content of response, emotional communication between machine and human also plays an important role. [29] consider the role of inter-speaker dependency relations while classifying emotions. [13] design a network to keep track of the individual party states throughout conversation and uses this information for emotion classification. [16] propose DCR-Net to explicitly consider the cross-impact and model the interaction between the recognition of emotion and intention. [15] incorporate different elements of commonsense for emotion recognition. These methods focus on recognizing an individual’s emotion but also ignore the interaction between multiple individuals.

Response Generation There are plenty of works that study emotional response generation. The mainstream approach utilizes emotion-related factors, such as emotion vector, emotional memory, emotion dictionary, psychological intention, etc., to generate emotional responses. For example, [30] propose three novel ways to incorporate emotional aspects into LSTM-based Seq2Seq neural conversation generation model. Besides, [31] firstly propose to generate appropriate responses not only in content but also in emotion with three mechanisms. [32] generate meaningful responses with a coherent structure for a post, and express the desired emotion within a unified framework. GLHG [33] aims to capture the global cause, local intentions and dialog history, and model hierarchical relationships between them in the emotional support conversation [34]. However, in psychology, the fundamental reason why human beings behave in the external world is that they take motivation as the starting point. Therefore, we focus on whether the generated response is consistent with our intention and in the direction of the emotion of speaker.

III Analysis of Individual Interaction

The interaction between human beings is a process where humans interact through external world driven by their mental state. We first build the Individual Activity Model (IAM), and then introduce the interactive framework CogIntAc between individuals as well as the task definitions.

III-A Individual Activities Model

Human individual activity is a process of interaction with the external world driven by mental state. In Fig. 2 (a), the mental state includes intention, emotional expectation and reaction. The external world includes environmental factors and language actions, the main focus is on the language actions in this paper. \small{n}⃝ indicates the information flows. The definitions of these elements are as follows:

Intention As the psychological background to stimulate and guide action, intention is the essence of activity. In this paper, it particularly refers to the interaction intentions [11].

Emotional Expectation The definition of emotional expectation is the desired emotion towards achieving one’s goals, as demonstrated in Sec. I.

Emotional Reaction It is the psychological reaction of an individual’s inner satisfaction with the external world, which is formed by intention (internal causes) and external world (external causes).

Action An individual’s words and deeds that influence the external world driven by his own intention.

According to the correlation of various elements above, the following relationships exist. (1) Intention v.s. Action: action can be predicted with the intention and emotional expectation. Intention can be analyzed by action abduction. (2) Emotional Reaction v.s. Intention: Emotional reaction can be predicted from the intention and external world.

III-B CogIntAc

From the perspective of interaction, we analyze the relationships between intention, emotion and action, and propose the CogIntAc based on IAM in Sec. III-A. We define the concept of a complete interaction and interaction chain. The framework will be introduced in detail from four aspects.

CogIntAc Composition speaker and listener interact with each other in CogIntAc. As shown in Fig. 2 (b), each of them is represented as an IAM structure, as introduced in III-A, including four basic elements. Specifically, the intention classification has expanded into seven categories based on [11], including request, suggest, command, accept, reject, question, inform. Emotional expectation is usually marked as happiness or content for speaker, as for listener, it is marked as the emotional reaction of the speaker. We have six labels of emotional reaction in CogIEA (e.g., happy, content, neutral, sadness, anger, disgust).

Individual Interaction Process The interaction process is defined as a complete interaction, starting with the speaker’s intention and ending with the speaker’s emotional reaction. Specifically, starting from the intention, the speaker generates action in the direction of his emotional expectation driven by the intention. Then, the listener generates the corresponding motives to respond. Finally, the speaker generates an emotional reaction based on that response.

At the same time, we propose a new concept interaction chain (e.g. motives_s $\rightarrow$ action_s $\rightarrow$ action _r $\rightarrow$ emotional reaction_r) (subscript $s$ and $r$ indicate speaker and listener), which is a logical sequential chain between speaker and listener in CogIntAc, as shown by the gray arrows in Fig. 2 (b). \small{n}⃝ indicates the information flows. Each node in the interaction chain has an impact on the final emotional reaction.

III-C Relationships Analysis

Since the interaction between individuals develops in line with the interaction chain, we can reveal the essence of how two individuals complete the interaction, and give a reasonable explanation for the analysis results by analyzing the causes of each node on the interaction chain. According to different roles of speaker and listener in the interaction process, we focus on analyzing the relationships between the speaker’s intention, emotion and action, and predicting the listener’s action. The subscript $s$ and $r$ indicate speaker and listener.

Action v.s. Intention of the speaker As shown in Fig. 3 (a), the gray arrows indicate following relationships. (1) Action can be predicted when intention and emotional expectation are known. (2) intention can be obtained by action abduction. In this paper, considering that the action of speaker is often observable, the main focus is on the latter in predicting the intention of speaker.

Intention v.s. Emotion of the speaker As shown in Fig. 3 (b), the gray arrows indicate the relationship. The emotion_s can be inferred when the intention_s and the action_r are known.

Action v.s. Motives of the Listener As shown in Fig. 3 (c), the gray arrows indicate the relationship. It consists of two parts, motives inference of the listener and action generation of the listener. The intention_r is influenced by intention_s, the emotional expectation_r can be represented with the emotional reaction_s. And then, the action_r is generated in the direction of emotional expectation_r under the drive of intention_r.

The action_r in the interaction chain determines the final emotional reaction_s. Meanwhile, the analysis and prediction of the action_r are of great significance in the decision-making of response strategy.

III-D Task Definition

Three tasks are proposed for relationships analysis in Sec. III-C, including Action Abduction, Emotion Prediction and Action Generation.

Action Abduction Task Description: Given the action_s, the intention_s is predictable.

Task Input/Output: Action_s is as input, intention of the speaker is as output.

Emotion Prediction Task Description: Given the action_s and the action_r, the task aims to predict the emotional reaction of the speaker.

Task Input/Output: Action_s and action_r are as input, emotional reaction of the speaker is as output.

Action Generation Task Description: Given the action_s, emotional reaction_s, the task aims to predict the action of the listener.

Task Input/Output: Action_s, emotional reaction_s are as input, action of the listener is as output.

IV Model

We design three components for each of the three tasks in Sec. III-D. The model architectures for the tasks are shown in Fig. 4. Before proceeding, the formulation used in this paper is first described. Given a conversation $C=(u^{s},u^{r})$ a set of two utterances (actions) and $u^{i}=({x^{i}_{1},x^{i}_{2},\dots,x^{i}_{T}})$ that consists of a sequence of T words, $i\in\{s,r\}$ , with $Y=(y^{s},y^{r})$ being the corresponding labels of intention and ${y_{e}^{s}}$ being the corresponding emotional reaction label of the speaker and ${\hat{y}}$ being the corresponding satisfaction label, and $s\in\{speaker\}$ , $r\in\{listener\}$ .

IV-A Action Abduction Module

In this section, an Action Abduction Module is proposed to obtain the intention of the speaker. For the given utterance $u_{1}$ , an encoder (LSTM model or pre-trained language models) models the contextual semantic information, using the first hidden state $\boldsymbol{h_{s}}$ as the representation of the intention of the speaker, and then with a Multilayer Perceptron (MLP) to output probability distributions, as:

\displaystyle\beta={\rm Softmax}({\rm MLP}(\boldsymbol{h_{s}}))

(1)

Intention Dictionary Observing that certain specific words such as ask for, proposal can reflect intention, we extract keywords and construct an Intention Dictionary (IntDic) as the intention knowledge base by statistical method. The IntDic outputs a probability distribution $\alpha$ of the keyword over all the corresponding intentions.

Abductive Classifier Finally, with the Abductive Classifier (can be regarded as MLP), the probabilities $\beta$ and $\alpha$ are jointly considered to obtain the speaker’s intention by MLP. We optimize the average cross entropy loss between all examples.

IV-B Emotion Prediction Module

For emotion prediction task, the model predicts the emotional reaction of the speaker. As shown in Fig. 4 (b), we first model the intention of the speaker to obtain the intention vector $\boldsymbol{\bar{h_{s}}}$ . This part has been implemented in Sec. IV-A, as:

\displaystyle\boldsymbol{\bar{h_{s}}}={\rm ReLU}(W_{e}^{\top}[\boldsymbol{h_{s}};\alpha])

(2)

where $W_{e}\in\mathbb{R}^{(h+v)\times h}$ , $[;]$ indicates vector concatenation. $v$ represents the dimension of the $\alpha$ .

Similarly, the hidden state of the action of the listener $\boldsymbol{h_{r}}$ is obtained with the encoder module.

Double-head Classifier A Double-head Classifier is designed for predicting the emotion of the speaker and determining whether the intention of the speaker is satisfied. A fusion mechanism [35, 36] is introduced for better fusing the intention of the speaker and action of the listener:

\displaystyle\boldsymbol{f}={\rm{Fuse}}(\boldsymbol{h_{r}},\boldsymbol{\bar{h_{s}}})

(3)

Then, we consider two MLPs for the multi-task learning, the average cross entropy loss is also optimized by Adam [37]. Furthermore, the interpretation template is constructed to explain the cause of emotion, namely, utilizing the template to generate a more reasonable explanation. For example, the speaker’s emotion is happy because his intention is satisfied by the listener.

IV-C Action Generation Module

For the listener, the model in Fig. 4 (c) predicts the action of the listener. Under the guidance of CogIntAc, we first make an inference to the intention of the listener by Intention Inference Module, because the intention of listener leads to the action. As for the emotional expectation, it is defined as emotional reaction of the speaker. Similarly, we construct the template The emotional expectation of the listener is xxx to enrich the complete semantic information. Finally, the Generator Module (BART, GPT-2) outputs a response.

Intention Inference Module The module uses a MLP to transform the intention space from the speaker to the listener, as:

\displaystyle\boldsymbol{\bar{h_{r}}}=({\rm MLP}(\boldsymbol{\bar{h_{s}}}))

(4)

	Action Abduction			Emotion Prediction			Satisfaction Prediction
	P	R	F1	P	R	F1	P	R	F1
GRU [38]	46.32	41.27	43.65	42.47	41.13	41.79	72.20	72.59	72.39
GRU+Attention [39]	48.17	41.53	44.60	43.55	41.85	42.68	73.16	74.55	73.84
BERT [40]	65.28	65.73	65.50	55.14	55.49	55.31	81.77	81.56	81.66
RoBERTa_base [41]	68.00	67.71	67.85	57.05	57.48	57.26	83.05	82.87	82.95
RoBERTa_large [41]	71.88	70.63	71.25	58.68	59.05	58.85	85.94	85.92	85.93
+IntDic	73.25	72.19	72.71	60.81	61.53	61.16	88.10	88.18	88.14
+Fusion Mechanism	-	-	-	58.37	61.41	59.85	88.84	88.64	88.74
+Multi-task	-	-	-	59.79	59.84	59.81	87.88	88.33	88.11
+All	73.25	72.19	72.71	64.80	62.21	63.47	89.69	89.76	89.72
Human Performance	91.42	89.58	90.49	87.46	86.15	86.80	93.65	93.57	93.60

TABLE I: Experimental results on CogIEA test set for tasks of Action Abduction and Emotion Prediction of the speaker. Human performance results are obtained by three testers. P indicates Precision, R means Recall.

Generator Module The generator module generates response by using a combination of intention $\boldsymbol{\bar{h_{r}}}$ , emotional expectation template sentence $m_{j}$ and action of the speaker $x^{s}_{i}$ , as:

p(y_{t}|y_{1},\ldots,y_{t-1})={\rm{Generator}}([m_{j},x^{s}_{i}];\boldsymbol{\bar{h_{r}}})

(5)

V Experiment

In this section, we present our CogIEA dataset and some baselines for the three tasks with the results.

V-A Data Collection

Basic Statistics The dialogue dataset CogIEA is collected from DailyDialog [11] and IEMOCAP [42], which contains 2,106 single-turn conversations. For the annotation, six emotion labels and seven intention labels are defined, as depicted in Sec. III-B. For example, for the utterances containing obvious request words, such as would like, ask for and so on, annotators annotate them as request. For the utterances which express yes or no, annotators annotate them as accept or reject. Besides, another satisfaction label is added to indicate whether the action of the listener satisfied the intention of the speaker. Fig. 5 (c) shows the detailed statistics of the final dataset. We also count the distribution of emotion and intention to give a brief view of the dataset, as shown in Fig. 5 (a)(b).

Annotation Criteria The annotators should first determine the intention of the speaker and the listener according to the dialogue. We have expanded the intention classification into seven categories based on [11], including request, suggest, command, accept, reject, question, inform. As for the emotion, we have six emotional reaction in CogIEA (e.g., happy, content, neutral, sadness, anger, disgust). Then they are asked to identify whether the speaker’s intention is satisfied by the listener’s action. If the intention is satisfied, they will annotate positive emotion, otherwise, label as neutral or negative emotion. We assign each example to three annotators to annotate. When the results of two of them are consistent, the data is retained. Otherwise, an expert is invoked to make the final decision, ensuring the quality of data annotation. We removed the examples where no emotion or intention was selected. And each example is paid $ 0.1.

V-B Experimental Setting

We separate the CogIEA datasets into training (80%), validation (10%), test (10%) sets. We train our models on the training sets, selecting the best performing model on validation set, for which we then evaluate test results. We set word embeddings to size of 300 and initialize them with Glove embeddings [43]. In addition, we perform a grid search over the hyper-parameter settings (with a learning rate from {0.01, 0.4} for GRU or {1e-5, 3e-5} for PLMs, a batch size from {16, 32}, and epochs from {3, 20}). Models are trained to minimize the cross entropy with Adam [37]. As for the evaluation metric, we use Precision (P), Recall (R) and F1 for task one and two. Automatic evaluation (BLEU, ROUGE) [44, 45] and human evaluation [46] are used in task three.

In the following, we provide some baselines and proposed modules ¹¹1The data is available at: https://github.com/pengwei-iie/CogIntAc for three tasks, respectively.

GRU [38], where a GRU model is first used to construct the utterance representations, then with a MLP and softmax layer. The RNN is 2-layer GRU with 128 hidden units.

GRU+ATTENTION [39], where a GRU model with the self-attention to model the semantic information of the utterance.

BERT [40], where the BERT is used as the contextual encoder, then with a MLP and softmax layer.

RoBERTa [41], where it considers more data and bigger model for better performance. We introduce it for stronger baseline.

BART [47], a denoising autoencoder for pretraining sequence-to-sequence models. We consider it and GPT-2 as the generator module.

GPT-2 [48], which is a 1.5B parameter Transformer for the generation task. The PLMs have the same hyperparameters given on the paper [40, 41].

	B-1	B-2	B-4	R-1	R-2	R-L	Coherent	Consistent	Intention	Emotion
BART_large^∗ [47]	23.55	2.61	0.16	14.47	2.55	13.30	2.61	2.44	2.07	1.96
GPT-2_large^∗ [48]	25.91	7.55	2.73	7.05	1.37	7.12	2.63	2.51	2.46	2.27
BART_large (ours)	27.10	4.89	0.36	16.31	3.87	15.21	2.67	2.53	2.33	2.30
GPT-2_large (ours)	25.91	8.67	3.08	8.87	2.07	8.66	2.71	2.76	2.68	2.53

TABLE II: Automatic evaluation and human evaluation of coherence, consistent (reported from 1-low to 3-high), intention, and emotion (reported from 1-low to 3-high). ^∗ means with the only context as input.

V-C Experimental Results

For thorough comparisons, we implement some baselines to the three tasks, including Action Abduction, Emotion Prediction, Action Generation.

Action Abduction of Speaker The Action Abduction Module, depicted in Fig. 4 (a), aims to obtain the intention of the speaker. As shown in the first column of Table I, in all baselines, RoBERTa performs the best. After utilizing the IntDic, the performance has improved. It shows that the prior statistical knowledge is effective for the action abduction task to predict the intention.

Emotion Prediction of Speaker In this section, the results are shown in the second and third columns of Table I. In addition to IntDic, the fusion mechanism and multi-task learning also have an improvement, the model achieves the best results with all the components together. Another interesting point is that satisfaction prediction task scores highest because the model only needs to determine whether the action meets the intention, but when we increase the joint training for the tasks, we find that these tasks are also improved.

Action Generation of Listener The Action Generation Module generates the response of the listener. We ask testers to rate responses on the following criteria: (1) Coherence, or is the response on topic and strongly acknowledges the speaker, (2) Consistent, or does the response make logical sense given the context, (3) Follows intention, or does the response contain reasonable intention, and (4) Follows emotion, or does the response satisfy the speaker’s emotional label. The results in Table II demonstrate that (1) compared with GPT-2_large^∗ and BART_large^∗, our model has achieved an improvement in both automatic and human evaluation. (2) the GPT-2 models achieve higher ratings for quality metrics of human evaluation. And the logic and coherence of the GPT-2 are relatively good, the expression form is relatively diversified.

V-D Qualitative Analysis

We present dialogue cases and explanations of the emotion to demonstrate how our model performs on these tasks. As shown in Fig. 6, for the first case, the speaker requests for burger meal and intention of the listener is question. The model only with context as input generates a sentence with a logical error I’d like a double … Our model produces a reasonable response which is consistent with the intention question. Also, the response satisfies the speaker’s emotional label. For the second case, the baseline outputs a reasonable utterance but neither the follows intention nor the follows emotion satisfies the label. Ours generates a better one. As for the explanation of the emotion, we give the generative template as demonstrated in Sec. IV-B.

VI Conclusion

In this paper, we present the novel Cognitive framework of individual InterAction (CogIntAc) with the concept of a complete interaction and interaction chain from the perspective of psychology. Guided by the framework, we establish the interactive relationships by understanding and analyzing the relationships between elements in CogIntAc. Furthermore, we reconstruct a dataset CogIEA and introduce three tasks as well as the corresponding baseline models. Experimental results and qualitative analysis show that interactive action can be predictable by analyzing the nodes in the interaction chain, and the analysis results can be explained reasonably. For the future work, we intend to expand into multi-turn of dialogues for further analysis.

Acknowledgment

We thank all anonymous reviewers for their constructive comments and we have made some modifications. This work is supported by the National Natural Science Foundation of China (No.U21B2009).

References

[1] R. W. Picard, Affective computing. MIT Press, 1997.
[2] A. H. Maslow, A theory of human motivation. Simon and Schuster, 2013.
[3] J. J. Campos, R. G. Campos, and K. C. Barrett, “Emergent themes in the study of emotional development and emotion regulation.” Developmental Psychology, 1989.
[4] J. Reeve, Understanding motivation and emotion. John Wiley & Sons, 2014.
[5] E. A. Minton, Belief systems, religion, and behavioral economics: Marketing in multicultural environments. Business Expert Press, 2013.
[6] M. Bratman et al., Intention, plans, and practical reason. Harvard University Press Cambridge, MA, 1987.
[7] P. A. Gable and E. Harmon-Jones, “Approach-motivated positive affect reduces breadth of attention,” Psychological Science, 2010.
[8] K. V. Montfort, E. Masurel, and I. V. Rijn, “Service satisfaction: An empirical analysis of consumer satisfaction in financial services,” Service Industries Journal, 2000.
[9] L. Padgham and M. Winikoff, Developing intelligent agent systems: A practical guide. John Wiley & Sons, 2005.
[10] Y. Sun, Y. Shan, C. Tang, Y. Hu, Y. Dai, J. Yu, J. Sun, F. Huang, and L. Si, “Unsupervised learning of deterministic dialogue structure with edge-enhanced graph auto-encoder,” in AAAI. AAAI Press, 2021, pp. 13 869–13 877. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/17634
[11] Y. Li, H. Su, X. Shen, W. Li, Z. Cao, and S. Niu, “Dailydialog: A manually labelled multi-turn dialogue dataset,” in IJCNLP, 2017.
[12] H. Kumar, A. Agarwal, and S. Joshi, “A practical dialogue-act-driven conversation model for multi-turn response selection,” in EMNLP-IJCNLP, 2019.
[13] N. Majumder, S. Poria, D. Hazarika, R. Mihalcea, A. F. Gelbukh, and E. Cambria, “Dialoguernn: An attentive RNN for emotion detection in conversations,” in AAAI, 2019.
[14] P. Colombo, E. Chapuis, M. Manica, E. Vignon, G. Varni, and C. Clavel, “Guiding attention in sequence-to-sequence models for dialogue act prediction,” in AAAI, 2020.
[15] D. Ghosal, N. Majumder, A. F. Gelbukh, R. Mihalcea, and S. Poria, “COSMIC: commonsense knowledge for emotion identification in conversations,” in EMNLP, 2020.
[16] L. Qin, W. Che, Y. Li, M. Ni, and T. Liu, “Dcr-net: A deep co-interactive relation network for joint dialog act recognition and sentiment classification,” in AAAI, 2020.
[17] L. Qin, Z. Li, W. Che, M. Ni, and T. Liu, “Co-gat: A co-interactive graph attention network for joint dialog act recognition and sentiment classification,” CoRR, vol. abs/2012.13260, 2020. [Online]. Available: https://arxiv.org/abs/2012.13260
[18] W. Peng, Y. Hu, L. Xing, Y. Xie, X. Zhang, and Y. Sun, “Modeling intention, emotion and external world in dialogue systems,” in ICASSP, 2022, pp. 7042–7046.
[19] M. Standage, J. L. Duda, and A. M. Pensgaard, “The effect of competitive outcome and task-involving, ego-involving, and cooperative structures on the psychological well-being of individuals engaged in a co-ordination task: A self-determination approach,” Motivation and emotion, 2005.
[20] H. Rashkin, A. Bosselut, M. Sap, K. Knight, and Y. Choi, “Modeling naive psychology of characters in simple commonsense stories,” in ACL, 2018.
[21] H. Rashkin, M. Sap, E. Allaway, N. A. Smith, and Y. Choi, “Event2mind: Commonsense inference on events, intents, and reactions,” in ACL, 2018.
[22] M. Sap, H. Rashkin, D. Chen, R. L. Bras, and Y. Choi, “Social iqa: Commonsense reasoning about social interactions,” in EMNLP-IJCNLP, 2019.
[23] C. Bhagavatula, R. L. Bras, C. Malaviya, K. Sakaguchi, A. Holtzman, H. Rashkin, D. Downey, W. Yih, and Y. Choi, “Abductive commonsense reasoning,” in ICLR, 2020.
[24] T. Saha, A. Patra, S. Saha, and P. Bhattacharyya, “Towards emotion-aided multi-modal dialogue act classification,” in ACL, 2020.
[25] D. Hazarika, S. Poria, R. Mihalcea, E. Cambria, and R. Zimmermann, “ICON: interactive conversational memory network for multimodal emotion detection,” in EMNLP, 2018.
[26] P. Colombo, W. Witon, A. Modi, J. Kennedy, and M. Kapadia, “Affect-driven dialog generation,” in NAACL-HLT, 2019.
[27] X. Zhou and W. Y. Wang, “Mojitalk: Generating emotional responses at scale,” in ACL, 2018.
[28] P. Zhong, D. Wang, and C. Miao, “An affect-rich neural conversational model with biased attention and weighted cross-entropy loss,” in AAAI, 2019.
[29] D. Hazarika, S. Poria, A. Zadeh, E. Cambria, L. Morency, and R. Zimmermann, “Conversational memory network for emotion recognition in dyadic dialogue videos,” in NAACL-HLT, 2018.
[30] N. Asghar, P. Poupart, J. Hoey, X. Jiang, and L. Mou, “Affective neural response generation,” in European Conference on Information Retrieval, 2018.
[31] H. Zhou, M. Huang, T. Zhang, X. Zhu, and B. Liu, “Emotional chatting machine: Emotional conversation generation with internal and external memory,” in AAAI, 2018.
[32] Z. Song, X. Zheng, L. Liu, M. Xu, and X. Huang, “Generating responses with a specific emotion in dialog,” in ACL, 2019.
[33] W. Peng, Y. Hu, L. Xing, Y. Xie, Y. Sun, and Y. Li, “Control globally, understand locally: A global-to-local hierarchical graph network for emotional support conversation,” CoRR, vol. abs/2204.12749, 2022.
[34] S. Liu, C. Zheng, O. Demasi, Z. Yu, Y. Jiang, M. Huang, and et al., “Towards emotional support dialog systems,” in ACL/IJCNLP, 2021.
[35] W. Wang, C. Wu, and M. Yan, “Multi-granularity hierarchical attention fusion networks for reading comprehension and question answering,” in ACL, 2018.
[36] W. Peng, Y. Hu, J. Yu, L. Xing, and Y. Xie, “APER: adaptive evidence-driven reasoning network for machine reading comprehension with unanswerable questions,” Knowl. Based Syst., vol. 229, p. 107364, 2021. [Online]. Available: https://doi.org/10.1016/j.knosys.2021.107364
[37] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015.
[38] J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” CoRR, vol. abs/1412.3555, 2014. [Online]. Available: http://arxiv.org/abs/1412.3555
[39] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” ArXiv, 2017.
[40] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in NAACL-HLT, 2019.
[41] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized BERT pretraining approach,” CoRR, 2019.
[42] C. Busso, M. Bulut, C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “IEMOCAP: interactive emotional dyadic motion capture database,” Lang. Resour. Evaluation, 2008.
[43] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in EMNLP, 2014.
[44] C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries acl,” in Proceedings of Workshop on Text Summarization Branches Out Post Conference Workshop of ACL, 2004.
[45] K. Papineni, S. Roukos, T. Ward, and W. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in ACL, 2002.
[46] P. Gupta, J. P. Bigham, Y. Tsvetkov, and A. Pavel, “Controlling dialogue generation with semantic exemplars,” CoRR, vol. abs/2008.09075, 2020. [Online]. Available: https://arxiv.org/abs/2008.09075
[47] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” in ACL, 2020.
[48] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” in OpenAI Blog, 2019.