This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Discourse-Aware Emotion Cause Extraction in Conversations

Dexin Kong1, Nan Yu1, Yun Yuan1, Guohong Fu1,2 , Chen Gong1,2
1School of Computer Science and Technology, Soochow University, China
2Institute of Artificial Intelligence, Soochow University, China
[email protected];
{nyu,yyuanwind}@stu.suda.edu.cn;
{ghfu, gongchen18}@suda.edu.cn
 Corresponding author.
Abstract

Emotion Cause Extraction in Conversations (ECEC) aims to extract the utterances which contain the emotional cause in conversations. Most prior research focuses on modelling conversational contexts with sequential encoding, ignoring the informative interactions between utterances and conversational-specific features for ECEC. In this paper, we investigate the importance of discourse structures in handling utterance interactions and conversation-specific features for ECEC. To this end, we propose a discourse-aware model (DAM) for this task. Concretely, we jointly model ECEC with discourse parsing using a multi-task learning (MTL) framework and explicitly encode discourse structures via gated graph neural network (gated GNN), integrating rich utterance interaction information to our model. In addition, we use gated GNN to further enhance our ECEC model with conversation-specific features. Results on the benchmark corpus show that DAM outperform the state-of-the-art (SOTA) systems in the literature. This suggests that the discourse structure may contain a potential link between emotional utterances and their corresponding cause expressions. It also verifies the effectiveness of conversational-specific features. The codes of this paper will be available on GitHub111http://github.com/.

1 Introduction

Emotion cause extraction in conversations (ECEC) is an important task in conversation analysis. It has received increasing attention Poria et al. (2021); Wang et al. (2021b) with the open conversational data deluge on social media platforms. Similar to text-level emotion cause extraction (ECE) Lee et al. (2010), this task aims to identify utterances that contain explanations for emotional causes in the given conversation. The right part of Fig. 1 shows an example of ECEC. The emotion of the target utterance “That sounds wonderful. Will there be anyone there that I know?” is “happiness”. The cause utterance spans of it are “inviting you to a dinner party”, “it would be fun”, and “It will give my wife a chance to dress up”, It suggests that explicit emotion causes aforementioned could be helpful for many downstream tasks, such as opinion mining Choi et al. (2005); Das and Bandyopadhyay (2010).

Refer to caption
Figure 1: An example of ECEC. Left arcs are predicted discourse structures. Right arcs are emotion-cause pair annotations.

ECE has been investigated intensively since early research Lee et al. (2010); Chen et al. (2010). It can be treated as a classification task which requires the contexts of a document as inputs and determine whether each clause is a cause. Recently, several neural models for ECE have been proposed Ding and Kejriwal (2020); Li et al. (2021c); Hu et al. (2021a), using pre-trained language models such as BERT Devlin et al. (2018) and RoBERTa Liu et al. (2019) to represent the plain texts. Compared with text-level ECE, the research of emotion cause extraction in conversation is at its preliminary stage. Poria et al. (2021) propose this task for the first time, and apply neural approaches of text-level emotion cause extraction Wei et al. (2020); Ding et al. (2020a, b).

Above traditional approaches treat conversational contexts as common plain text while using a Transformer-based encoder Liu et al. (2019) to learn the conversational context representation for the ECEC task Poria et al. (2021). These approaches ignore the interaction information between utterances and conversational-specific features. The problem could be serious when performing the ECEC task in long conversations.

The discourse structures in a conversation represent the interactive relationships between utterances. Intuitively, these structures contain the potential links between emotional utterances and their corresponding cause expressions. As shown by the solid arc on the left in Fig. 1, most of the emotion-cause pairs are linked with the target utterance by discourse relations. In addition, several findings of previous studies on ECE support our point of view. For instance, Hu et al. (2021b); Ding et al. (2019) believe that the interaction features between sentences could be useful information for the ECE task.

In this paper, we propose a discourse-aware model for ECEC. Concretely, we model ECEC and discourse parsing in conversations Afantenos et al. (2015) jointly. It uses a shared pre-trained language model (PLM) to represent conversational contexts. The discourse parsing task can integrate rich utterance interactions information into the shared PLM. Besides, we use a gated graph neural network (gated GNN) Li et al. (2016) to explicitly encode discourse structures that generated by the discourse parser. In addition, we follow Wang et al. (2021a), exploiting a gated GNN to further integrate conversation-specific features such as the relative utterance distance and speakers to our model.

We conduct the experiments on the standard benchmark dataset of the ECEC task to verify our DAM model. Experiments show that the utterance interaction features are effective for this task. When the conversation-specific features are integrated by gated GNN, the proposed model is able to obtain further improvements.

To sum up, we make three main contributions as follows:

  • \bullet

    We propose a discourse-aware model for ECEC named DAM using multiple task learning and a gated GNN, which is able to integrate rich utterance interaction features for ECEC.

  • \bullet

    We further exploit a gated GNN to capture conversation-specific features such as the relative utterance distance and speakers for ECEC.

  • \bullet

    We advance the performance of the SOTA models for ECEC.

We organize the rest of this paper as follows. First, in Section 2, we introduce the related work. Following in Section 3, we introduce the proposed model, including the multi-task learning framework, and the gated GNN module. Section 4 and Section 5 describes our experiments on a benchmark dataset, verifying the effectiveness of our proposed approach. Finally, we make conclusions on Section 6.

2 Related Work

Prior research on emotion cause extraction can be divided into two categories according to the source text type: text-level emotion cause extraction (ECE) and emotion cause extraction in conversation (ECEC) Poria et al. (2021). For the ECE task, early research adopts linguistic rules Lee et al. (2010); Chen et al. (2010); Russo et al. (2011) and traditional machine learning Gui et al. (2014, 2016, 2018); Xu et al. (2017). In recently years, several neural network models are introduced to the ECE task with different granularities, such as clause-level Diao et al. (2020); Ding and Kejriwal (2020); Hu et al. (2021a) and span-level Li et al. (2021c, a); Qian et al. (2021); Turcan et al. (2021); Li et al. (2021b). Except for the text-level emotion cause extraction, Poria et al. (2021) introduces a new conversational dataset RECCON and the Transformer-based baseline models.

Many researchers realize that there is a connection between ECE task and emotion recognition task. In order to make full use of the interaction between tasks, researchers use the multi-task framework to carry out ECE task Chen et al. (2018); Wu et al. (2020). In addition, researchers take causal reasoning as an auxiliary task  Fan et al. (2020); Turcan et al. (2021) to enhance the reasoning ability of the model. However, most of these methods only focus on the interaction between tasks, without considering the dependency between clauses or utterances, which is easy to cause long-distance information loss. Chen et al. (2020); Hu et al. (2021b) use GCN and other methods to capture the dependency between clauses. Although these methods of ECE have realized the importance of interaction information between clauses, they only use simple structural information such as distance to model the dependency between clauses. We use discourse structures to make full use of the interaction information between utterances, which can be beneficial to encode long-distance dependency.

Most of the existing ECEC works do not consider the influence of the discourse relation between utterances and conversation-specific features. Intuitively, utterance interactions information such as discourse structures are promising for the ECEC task. In this paper, we employ discourse parsing as an auxiliary task by multi-task learning framework He et al. (2021); Fan et al. (2021) to capture utterance interaction information. To model conversation-specific features, we refer Wang et al. (2021a); Li et al. (2016) to utilize a gated graph neural network (gated GNN). We also use gated GNN to further enhance discourse structure by encoding a discourse graph. Finally, a gate controlled mechanism is applied to alleviate the error propagation problem from predicted discourse structures.

3 Our Proposed Model

Refer to caption
Figure 2: Framework of our model.

In this section, we present the proposed discourse-aware model (DAM) for ECEC. As shown in Fig. 2, we use an MLT framework to model the ECEC task and discourse parsing jointly. It can integrate rich utterance interaction information to our conversational context representations. We use hard parameter sharing to combine these two task, and use RoBERTa Liu et al. (2019) to obtain the shared conversational context representations.

To further enhance utterance interaction information, we encode conversational discourse structures by gated graph neural network (gated GNN). In addition, we use gated GNN to encode conversation-specific features such as speakers and relative utterance distance. Concretely, we use the gate controlled mechanism to alleviate the error propagation problem caused by predicted discourse relations. After that, we concatenate them with global encoding information and through a linear layer to obtain the classification result of the ECEC task.

As for auxiliary task discourse parsing, we use GRU to further obtain high-level hidden representations (BiGRU) Cho et al. (2014). Then we feed these high-level hidden representations to link predictor and relation classifier to obtain the corresponding discourse relations.

3.1 Multi-Task Learning Framework

3.1.1 ECEC Model

Given a conversation with nn utterances {u1,u2,,un}\{u_{1},u_{2},...,u_{n}\}, for each target utterance utu_{t} with emotion EtE_{t} and its historical utterances set H(ut)H(u_{t}), the ECEC task aims to extract the cause utterances set C(ut)C(u_{t}) of the emotion expressed by utu_{t}. Therefore, this task can be regarded as a triple classification Poria et al. (2021). For each utterance uiu_{i} in H(ut)H(u_{t}), the triple (ut,ui,H(ut))(u_{t},u_{i},H(u_{t})) is taken as a positive case if uiC(ut)u_{i}\in C(u_{t}). Otherwise, it is classified as a negative case.

Utterance Representations

To compare with the recent Transformer-based model Poria et al. (2021), we also use a [CLS] token and the emotional label [Et] of the target utterance in front. We concatenate the elements of triplet as extra inputs, as shown in Fig. 2. Before building the graph of gated GNN, we preprocess the input of the ECEC task at the select and pooler layer. We extract the hidden states 𝐯cls\mathbf{v}_{cls} of the [CLS] token as well as the hidden states of all tokens belonging to H(ut)H(u_{t}) for later use. The hidden states of tokens belonging to the DAMe utterance are represented as: {𝐯1i,𝐯2i,,𝐯mii}\{\mathbf{v}^{i}_{1},\mathbf{v}^{i}_{2},...,\mathbf{v}^{i}_{m_{i}}\}, where ii represents the ithi^{th} utterance in H(ut)H(u_{t}), and mim_{i} represents the ithi^{th} utterance with total of mm tokens. For convenience and simplicity, these hidden states are summed as the utterance level representation: 𝐯i=j=0mi𝐯ji\mathbf{v}^{i}=\sum\limits_{j=0}^{m_{i}}\mathbf{v}^{i}_{j}.

ECEC Encoder

In the ECEC Encoder, we build a full connected graph according to the discourse relations, speaker, and relative utterance distance. Then, we use the gated GNN layer to encode inter-utterance representations and outputs hidden states 𝐡i\mathbf{h}_{i}. See more details in the Section 3.2.

ECEC Classification

After getting the output hidden states 𝐡i\mathbf{h}_{i} of the encoder, we now use softmax function to get the probability distribution 𝐏i\mathbf{P}_{i} through a linear layer and generate the prediction y^i\widehat{y}_{i} of inputs as follows:

𝐏i=softmax(𝐖s𝐡i+𝐛s)\mathbf{P}_{i}=\mathrm{softmax}(\mathbf{W}_{s}\mathbf{h}_{i}+\mathbf{b}_{s}) (1)
y^i=argmax(𝐏i)\widehat{y}_{i}=\mathop{\arg\max}(\mathbf{P}_{i}) (2)

where 𝐖s,𝐛s\mathbf{W}_{s},\mathbf{b}_{s} are the weight matrix and bias of the linear before softmax.

Loss Function

To train the model, we calculate the negative log-likelihood of train data as loss of the main task LeL_{e} as follows:

Le=vyVz=1Z𝐘vzlog𝐏vzL_{e}=-\sum_{v\in y_{V}}\sum_{z=1}^{Z}\mathbf{Y}_{vz}\log\mathbf{P}_{vz} (3)

where yVy_{V} is the gold labels set and YY is the label indicator matrix.

3.1.2 Discourse Parser

Given a conversation with nn elementary discourse units (EDU) {edu1,edu2,,edun}\{edu_{1},edu_{2},...,edu_{n}\}, discourse parsing aims to predict dependency links and the corresponding relation types {(eduj,edui,rji)|ji}\{(edu_{j},edu_{i},r_{ji})|j\neq i\} between the EDUs, where (eduj,edui,rji)(edu_{j},edu_{i},r_{ji}) stands for a link of relation type rjir_{ji} from edujedu_{j} to eduiedu_{i}.

Discourse Representations

All EDUs from a conversation are fed into RoBERTaBASE{}_{\textrm{BASE}} to encode token-level information. For ithi^{th} EDU eduiedu_{i}, the [CLS] token embedding is taken as the representation of eduiedu_{i}, denoted as 𝐱iu\mathbf{x}_{i}^{u}. We use Bi-directional GRU (BiGRU) to encode clause-level contextual information. Specifically, we take the representation of EDUs {𝐱1u,𝐱2u,,𝐱nu}\{\mathbf{x}_{1}^{u},\mathbf{x}_{2}^{u},...,\mathbf{x}_{n}^{u}\} as the input of BiGRU to get hidden representation {𝐡1u,𝐡2u,,𝐡nu}\{\mathbf{h}_{1}^{u},\mathbf{h}_{2}^{u},...,\mathbf{h}_{n}^{u}\}.

Then we follow Shi and Huang (2019), taking 𝐡<i,i={𝐡(1,i),,𝐡i1,i}\mathbf{h}_{<i,i}=\{\mathbf{h}_{(1,i)},...,\mathbf{h}_{i-1,i}\} as input of link predictor and relation classifier. The Link predictor predicts the parent of eduiedu_{i}. Relation classifier is responsible for predicting the relationship type rjir_{ji} between eduiedu_{i} and edujedu_{j}, if predicted parent of eduiedu_{i} and edujedu_{j}.

Link Prediction and Relation Classification

Link predictor and relation classifier have similar structures. They first converts the input vector 𝐡i,j(j<i)\mathbf{h}_{i,j}(j<i) into a hidden representation through a linear layer:

𝐋i,j𝒟=tanh(𝐖𝒟𝐡i,j+𝐛𝒟)\mathbf{L}_{i,j}^{\mathcal{D}}=\mathrm{tanh}(\mathbf{W}_{\mathcal{D}}\mathbf{h}_{i,j}+\mathbf{b}_{\mathcal{D}}) (4)

where 𝐖\mathbf{W} and 𝐛\mathbf{b} are learnable weights, and 𝒟{link,rel}\mathcal{D}\in\{link,rel\} represent the weights of link predictor and relation classifier respectively. Link predictor adopts softmax function to obtain the probability that edujedu_{j} is the parent pip_{i} of eduiedu_{i} as follows:

𝐏(pi)=softmax(𝐖link𝐋i,jlink+𝐛link)\mathbf{P}(p_{i})=\mathrm{softmax}(\mathbf{W^{\prime}}_{link}\mathbf{L}_{i,j}^{link}+\mathbf{b^{\prime}}_{link}) (5)

where 𝐖link𝐑1×dl\mathbf{W^{\prime}}_{link}\in\mathbf{R}^{1\times d_{l}} and 𝐛link𝐑\mathbf{b^{\prime}}_{link}\in\mathbf{R} also is learnable weights, and dld_{l} is the dimension of 𝐋i,jlink\mathbf{L}_{i,j}^{link}. Hence, the predicted pip_{i} is chosen as follows:

pi=argmax(𝐏(pi))p_{i}=\mathop{\arg\max}(\mathbf{P}(p_{i})) (6)

The relation classifier predicts the relation type rjir_{ji} as follows:

𝐏(r)=softmax(𝐖rel𝐋i,jrel+𝐛rel)\mathbf{P}(r)=\mathrm{softmax}(\mathbf{W^{\prime}}_{rel}\mathbf{L}_{i,j}^{rel}+\mathbf{b^{\prime}}_{rel}) (7)

𝐖rel𝐑K×dr\mathbf{W^{\prime}}_{rel}\in\mathbf{R}^{K\times d_{r}} and 𝐛rel𝐑K\mathbf{b^{\prime}}_{rel}\in\mathbf{R}^{K} are learnable weights. KK is number of relation types and drd_{r} is the dimension of 𝐋i,jrel\mathbf{L}_{i,j}^{rel}.

Loss Function

As DAMe as ECEC, we calculate the negative log-likelihood loss as follows:

Llink=di=1nlog𝐏(pi=pi)L_{link}=-\sum_{d}\sum_{i=1}^{n}\log\mathbf{P}(p_{i}=p^{*}_{i}) (8)
Lrel=di=1nlog𝐏(rji=rji)L_{rel}=-\sum_{d}\sum_{i=1}^{n}\log\mathbf{P}(r_{ji}=r^{*}_{ji}) (9)
Ldp=Llink+LrelL_{dp}=L_{link}+L_{rel} (10)

where dd is a conversation of train data and pip^{*}_{i} and rjir^{*}_{ji} is golden labels.

3.1.3 Training

During training, two models are learned simultaneously for the two sub-tasks (ECEC and discourse parsing). The bottom RoBERTa encoding layer is shared by the above two tasks. The total training objective is defined as:

L=αLe+βLdpL=\alpha L_{e}+\beta L_{dp} (11)

where α\alpha and β\beta are hyperparameters.

3.2 ECEC Encoder

3.2.1 Graph Construction

Following Wang et al. (2021a); Li et al. (2016), we adopt a gated graph neural network (gated GNN) to capture conversation-specific features. A fully connected graph is the input of the gated GNN, and each utterance in H(ut)H(u_{t}) is regarded as a node of the graph. We explain the construction rules of discourse relations, speaker, and relative distance graph separately as follows.

Discourse Relation Graph:

In order to model discourse relations between two utterances in conversations, we refer to the discourse parser proposed by Wang et al. (2021a) to obtain the discourse relations between utterances. We train discourse parser by the STAC dataset Afantenos et al. (2015). Then, we input an entire conversation into the discourse parser:

{(i,j,rij),}=Parser(u1,,un)\{(i,j,r_{ij}),...\}=\texttt{Parser}(u_{1},...,u_{n}) (12)

where (i,j,rij)(i,j,r_{ij}) denotes a link of relation type rijr_{ij} from uiu_{i} to uju_{j}, noting that i,j=1,2,,ni,j=1,2,...,n and j>ij>i. We connect an edge if there is a discourse relation between any two nodes. We give different weights to different discourse relation types. These predicted discourse relations are not exactly correct, so it may lead to error propagation in the later use.

Speaker Graph:

Plain text is a continuous document written by a single author, while conversation is the interaction between multiple speakers, which is the most obvious difference between plain text and conversation. Therefore, we incorporate conversation-specific features to the model by a fully connected graph. A DAMe type of edge connects two adjacent utterances of the DAMe speaker.

Relative Distance Graph:

For the ECEC task, it is very important to establish the long-distance dependency between utterances. In order to better model this dependency, we add relative utterance distance features. There is an edge between any two nodes and a self-loop edge on each node. It represents the relative utterance distance between its linked nodes.

3.2.2 Gated GNN

Each node and edge in the graph represents as a learnable vector for gated GNN. We directly use the utterance level representation 𝐯i\mathbf{v}^{i} to initialize the vector representation of the corresponding node. As for the edge between any two nodes, we initialize it by a learnable vector 𝐞ij\mathbf{e^{\prime}}_{ij} as follows.

𝐞ij=[𝐬ij,𝐝ij,𝐫ij]\mathbf{e^{\prime}}_{ij}=[\mathbf{s}_{ij},\mathbf{d}_{ij},\mathbf{r}_{ij}] (13)

where 𝐬ij\mathbf{s}_{ij}, 𝐝ij\mathbf{d}_{ij}, 𝐫ij\mathbf{r}_{ij} respectively denote the speaker, relative utterance distance, and discourse relation type vectors between utterance uiu_{i} and utterance uju_{j}. It should be noted that the discourse relation between utterances is predicted by the discourse parser trained on STAC datasets. Therefore, there is a problem of error propagation.

After direct concatenation of three vectors, we adopt a learnable gate module in 𝐞ij\mathbf{e^{\prime}}_{ij} to selectively forget certain information in neural networks. It can control duplicated discourse information as well as relieving the issue of error propagation, according to our experiments.

𝐞ij=𝐞ijσ(𝐖f𝐞ij+𝐛f)\mathbf{e}_{ij}=\mathbf{e^{\prime}}_{ij}\odot\sigma(\mathbf{W}_{f}\mathbf{e}_{ij}^{\prime}+\mathbf{b}_{f}) (14)

where σ\sigma and \odot represents sigmoid\rm{sigmoid} function, dot-product operation separately, 𝐖f,𝐛f\mathbf{W}_{f},\mathbf{b}_{f} are weight matrix and bias.

After that, we conduct structure-aware scaled dot-product attention to update the hidden state of nodes. More details can be seen in Wang et al. (2021a). We will iterate TT times to update the above hidden states. Eventually, we concatenate the hidden state vectors of two directions 𝐞ijT,𝐞jiT\mathbf{e}_{ij}^{T},\mathbf{e}_{ji}^{T} in the top layer:

𝐄^i,j=[𝐞ijT;𝐞jiT]\mathbf{\widehat{E}}_{i,j}=[\mathbf{e}_{ij}^{T};\mathbf{e}_{ji}^{T}] (15)

where the subscript i,ji,j represents the ithi{th} and jthj^{th} utterance. The GNN encoding representation E^i,t\widehat{E}_{i,t} and the hidden representation 𝐯cls\mathbf{v}_{cls} of [CLS] in previous stage are concatenated together as the final representation 𝐡i\mathbf{h}_{i}:

𝐡i=[𝐯cls;𝐄^i,t]\mathbf{h}_{i}=[\mathbf{v}_{cls};\mathbf{\widehat{E}}_{i,t}] (16)

where ii represents the ithi^{th} utterance that need to be classified and tt represents the target utterance that contains the non-neural emotion.

4 Experiment Settings

4.1 Datasets

To verify the effectiveness of utterance interactions information, we conduct experiments on the RECCON Poria et al. (2021) dataset. We use the Fold1 dataset Poria et al. (2021) generated by a negative DAMpling strategy as our experimental dataset. Training, validation, and testing data are 27915, 1185, and 7224 DAMples respectively.

Moreover, we use the STAC Afantenos et al. (2015) dataset to help complete the training of auxiliary task discourse parsing. There are 1091 conversations with 10677 EDUs and 11348 discourse relations in STAC.

4.2 Hyperparameters

We use RoBERTaBASE{}_{\textrm{BASE}} to encode the conversational contexts, and adopt an AadmW algorithm to optimize the parameters of our model. The initial learning rate is set to 1e-5, batch size sets to 8. α,β\alpha,\beta set to 1 and 0.25. The dimensions for the edge and the node states in gated GNN are both 768. And the dimensions for 𝐬ij,𝐝ij\mathbf{s}_{ij},\mathbf{d}_{ij} and 𝐫ij\mathbf{r}_{ij} are set to be 192, 192 and 384, respectively. We train 10 epochs on the training set and save the best model according to the performance on the validation set. Based on the best model, we test the performance on the test set.

4.3 Evaluation

For fair comparison, we use F1-scores to evaluate our proposed model. Pos.F1 and Neg.F1 represent the F1-score on the positive and the negative examples, respectively. In addition, we report the overall MacroF1 score based on the Pos.F1 and the Neg.F1.

4.4 Benchmark Models

Model Pos.F1 Neg.F1 MacroF1
RankCP 33.00 97.30 65.15
ECPE-MLL 48.48 94.68 71.59
ECPE-2D 55.50 94.96 75.23
SIMPBASE{}_{\textrm{BASE}} 64.28 88.74 76.51
SIMPLARGE{}_{\textrm{LARGE}} 66.23 87.89 77.06
DAM (ours) 67.91 89.55 78.73
Table 1: The results of our model in comparison to other models, the SIMPBASE{}_{\textrm{BASE}} and SIMPLARGE{}_{\textrm{LARGE}} are Transformer-based models proposed by Poria et al. (2021) initializing with RoBERTaBASE{}_{\textrm{BASE}} and RoBERTaLARGE{}_{\textrm{LARGE}} respectively.

We first compare our DAM with three ECE SOTA models proposed by previous research in Table 1.

  • 1.

    RankCP Wei et al. (2020): It used graph attention network to encode the representation of document clause;

  • 2.

    ECPE-2D Ding et al. (2020a): It integrated the representation, interaction and prediction of 2D emotion-cause pairs by jointing learning;

  • 3.

    ECPE-MLL Ding et al. (2020b): It used the multi-label learning scheme for training, and obtained a good effect.

Poria et al. (2021) transfer these three models to conversation dataset and propose some simple but better Transformer-based models (SIMPBASE{}_{\textrm{BASE}} and SIMPLARGE{}_{\textrm{LARGE}}).

4.5 Main Results

We can draw the following conclusions. First, because of the differences between plain text and conversation, the previous complex deep learning methods such as RankCP, ECPE-MLL and ECPE-2D for plain text document data perform even worse than the simple SIMPBASE{}_{\textrm{BASE}} model. Second, our model improves by nearly 2.22% over the baseline model encoded by the DAMe RoBERTaBASE{}_{\textrm{BASE}}. This can be owing to the potential links between emotional utterances and their corresponding cause expressions contained in the discourse structures. Conversation-specific features can also help models understand conversations.

5 Results and Analysis

We next conduct more experiments to explore which part of our model works and what we have promoted. We also study how to combine these structure information so that making the most of them.

5.1 Ablation Study

Model Pos.F1 Neg.F1 MacroF1
DAM 67.91 89.55 78.73
W/O multi-task 65.76 87.44 76.60
W/O gated-gnn 66.61 88.13 77.37
W/O speaker 65.32 89.38 77.35
W/O distance 66.58 88.44 77.51
W/O gate 66.32 89.02 77.67
Table 2: Ablation results. “W/O” represents “without"

Previous main results show that the effectiveness of our model. As shown in Table 2, the performance of removing auxiliary task decreases by 2.13% over the DAM. It indicates that multi-task framework can help our model capture discourse structures information. This information improves our model’s ability to establish relationships between utterances. The second row shows the model without gated GNN decreases by 1.36% over the DAM, showing the help of encoding conversation context information for understanding conversation. We also verify the influence of speakers and relative utterance distance features respectively, the performance of them all decrease. It demonstrates that these structure information containing conversational features have a significant impact on the model. Finally, it can be found that only remove the gate mechanism, performance decreases by 1.06%. We argue that the existence of the gate module alleviates the error propagation caused by directly using the pseudo discourse relations.

5.2 Long-distance Dependency Analysis

Refer to caption
Figure 3: Performance of different relative utterance distance between utu_{t} and uiu_{i}.

To verify whether our model can alleviate the long-distance dependency problem, we present a set of results of different relative utterance distance between utu_{t} and uiu_{i} in Fig 3. Foremost, we can find whatever the relative utterance distance are, our DAM model performs better than previous SIMPBASE{}_{\textrm{BASE}} model. That means our model is more robust in terms of distance by combining the utterance interaction features. Then we see that our model performs significantly better than the SIMPBASE{}_{\textrm{BASE}} inside a turn or distance of 3. However, above 3, the performance does not significantly increase. It suggests that in the ECEC task, long-distance dependency problem is significant. Although our approach can help to ameliorate the problem, deep reasoning needs to be further explored in the case of overextended turns or distance.

5.3 Effect of Different Fusion Methods

Refer to caption
Figure 4: A conversation example of test dataset along with predicted discourse relations, Ground truth, SIMPBASE{}_{\textrm{BASE}} results and DAM results. "C-Q" is short for "Clarification_question", "Elab" for "Elaboration", "Res" for "Result", "Ack" for "Acknowledgement", and "Com" for "Comment". uiu_{i} corresponds to the ithi_{th} utterance of the conversation.
Model Pos.F1 Neg.F1 MacroF1
SIMPBASE{}_{\textrm{BASE}} 64.28 88.74 76.51
SIMPLARGE{}_{\textrm{LARGE}} 66.23 87.89 77.06
DAM 67.91 89.55 78.73
DAM-mtl 66.61 88.13 77.37
DAM-cat 65.04 88.36 76.70
DAM-gnn 66.91 89.24 78.07
DAM-mtl-gnn 66.32 89.02 77.67
DAM-arc 66.70 88.87 77.79
Table 3: Result of different fusion methods of discourse relations. DAM is our final strategy.

In this section, we study how to integrate discourse relations so that making the most of them. Chen et al. (2020); Ding et al. (2020a) use graph convolutional network or joint learning to model the dependency relations between clauses. In this paper, we design and experiment with six alternative methods for incorporating discourse relations. Table 3 show the performance of these methods, and we can draw the following conclusions. First, we can see that discourse structure information can be strengthened by three strategies (DAM-mtl/cat/gnn). Therefore, bringing better performance than the SIMPBASE{}_{\textrm{BASE}} that initialized by RoBERTaBASE{}_{\textrm{BASE}}. However, the improvement of DAM-cat is slight. It may due to that it suffers from the serious error propagation directly using not exactly correct hidden states. When we combine the DAM-mtl and DAM-gnn (DAM-mtl-gnn), the performance degrades. It may be due to the duplicated discourse information or error propagation. We add the gate module (DAM) on the basis of DAM-mtl-gnn that effectively alleviates the above problems and performs best. So we choose this fusion method as final method. Moreover, we also try to not use the specific discourse relation types and instead focus on whether they have discourse relations (DAM-arc) or not, but it still does not comparable with DAM.

5.4 Case Study

For the case study, we select an example in the test dataset to verify that discourse structure information can help our model to better model long-distance dependencies.

Due to space limitations, we only show a subset of emotional cause and discourse relations. As shown in Fig. 4, the cause utterances set of u10u_{10} is C(u10)=u4,u6,u8C(u_{10})={u_{4},u_{6},u_{8}}, and the cause utterances set of u12u_{12} is C(u12)=u4,u6,u8,u10C(u_{12})={u_{4},u_{6},u_{8},u_{10}}. The SIMPBASE{}_{\textrm{BASE}} model can only predict the cause utterances with a short relative distance, and its performance ability is not good for the cause utterances with a long relative distance. It can be seen from the discourse relationship diagram that the target utterances and each utterance in the cause utterances set are related to each other. After we incorporate the discourse structure information into the model, the model can better capture long-distance dependencies. Although there are still some cause utterances that have not been successfully found, such as the cause utterance u4u_{4} of the target utterance u12u_{12}, we guess that this is caused by the error propagation of the discourse parser.

6 Conclusion and Future Work

In this paper, we introduce a discourse-aware model for emotion cause extraction in conversation. Specifically, we model ECEC with discourse parsing in conversations by multi-task framework. It can help share pre-trained language model learning better discourse structure representations of conversations. In addition, this model employ a graph neural network to encode conversation-specific features such as relative utterance distance, and speakers, and further enhance discourse structures. Both of them can help the model integrate rich utterance interactions information and mitigate long-distance dependency problem. Finally, we utilize a gate controlled module to alleviate the error propagation problem from predicted discourse relations. Experiments on the benchmark show that our model reaches the new SOTA.

Our future works may explore how discourse structure information can be used to extract both the emotion and the cause without the target utterance emotion information. Besides, we still need to explore more methods to solve the problem of long-distance dependence and error propagation.

References