Question-Interlocutor Scope Realized Graph Modeling over Key Utterances for Dialogue Reading Comprehension

Jiangnan Li^1,2, Mo Yu³, Fandong Meng³, Zheng Lin^1,2, Peng Fu¹, Weiping Wang¹, Jie Zhou³

Abstract

In this work, we focus on dialogue reading comprehension (DRC), a task extracting answer spans for questions from dialogues. Dialogue context modeling in DRC is tricky due to complex speaker information and noisy dialogue context. To solve the two problems, previous research proposes two self-supervised tasks respectively: guessing who a randomly masked speaker is according to the dialogue and predicting which utterance in the dialogue contains the answer. Although these tasks are effective, there are still urging problems: (1) randomly masking speakers regardless of the question cannot map the speaker mentioned in the question to the corresponding speaker in the dialogue, and ignores the speaker-centric nature of utterances. This leads to wrong answer extraction from utterances in unrelated interlocutors’ scopes; (2) the single utterance prediction, preferring utterances similar to the question, is limited in finding answer-contained utterances not similar to the question. To alleviate these problems, we first propose a new key utterances extracting method. It performs prediction on the unit formed by several contiguous utterances, which can realize more answer-contained utterances. Based on utterances in the extracted units, we then propose Question-Interlocutor Scope Realized Graph (QuISG) modeling. As a graph constructed on the text of utterances, QuISG additionally involves the question and question-mentioning speaker names as nodes. To realize interlocutor scopes, speakers in the dialogue are connected with the words in their corresponding utterances. Experiments on the benchmarks show that our method can achieve better and competitive results against previous works.

Introduction

Beyond the formal forms of text, dialogues are one of the most frequently used media that people communicate with others to informally deliver their emotions (Poria et al. 2019), opinions (Cox et al. 2020), and intentions (Qin et al. 2021). Moreover, dialogues are also a crucial type of information carriers in literature, such as novels and movies (Kociský et al. 2018), for people to understand the characters and plots (Sang et al. 2022) in their reading and entertainment behaviors. Therefore, how to comprehend dialogues is a key step for machines to act like human.

Refer to caption — Figure 1: Two questions with related dialogue clips that the baseline SelfSuper (Li and Zhao 2021) fails. Utter. #9 is too long, so we omit some parts of the utterance.

Despite the importance and value of dialogues, reading comprehension over dialogues (DRC) lags behind those over formal text, e.g., news articles and Wikipedia text.¹¹1Note there is a direction of conversational question answering (Reddy, Chen, and Manning 2018; Choi et al. 2018) differing from the DRC task here. For the former, QA are formed as a dialogue, and the model is required to understand Wikipedia articles to derive the answers. There are several reasons for the challenges in dialogue reading comprehension. (1) As shown in Fig. 1, utterances in dialogues involve mostly informal oral language. (2) An utterance itself is usually short and incomplete, and therefore understanding it highly depends on its dialogue context. Finally, (3) dialogue texts can be fluctuated. For example, people make a slip of language sometimes, and may evoke emotional expression. The aforementioned challenges in (1) and (3) can be alleviated with pretrained models with domain-specific training data and enhancement of robustness. However, the challenge (2) is a major scientific problem in DRC.

In previous works, Li and Zhao (2021) (abbreviated as SelfSuper) point out that dialogue context modeling in DRC faces two challenges: complex speaker information and noisy question-unrelated context. For speaker modeling, SelfSuper design a self-supervised task guessing who a randomly masked speaker is according to the dialogue context (e.g., masking “Monica Geller” of #10). To reduce noise, SelfSuper design another task to predict whether an utterance contains the answer. Although decent performance can be achieved, several urging problems still exist.

Firstly, speaker guessing does not aware the speaker information in questions and the interlocutor scope. As randomly masking is independent to the question, it cannot tell which speaker in the dialogue is related to the speaker mentioned in the question, e.g., Joey Tribbiani to Joey in Q1 of Fig. 1. As for the interlocutor scope, we define it as the utterances said by the corresponding speaker. We point that utterances have a speaker-centric nature: First, each utterance has target listeners. For example, in Utter. #10 of Fig. 1, it requires to understand that Joey is a listener, so “you had the night” is making fun of Joey from Monica’s scope. Second, an utterance reflects the message of experience of its speaker. For example, to answer Q1 in Fig. 1, it requires to understand “stayed up all night talking” is the experience appearing in Joey’s scope. Due to ignoring the question-mentioned interlocutor and its scope, SelfSuper provides a wrong answer.

Secondly, answer-contained utterance (denoted as key utterance by SelfSuper) prediction prefers utterances similar to the question, failing to find key utterances not similar to the question. The reason for this is that answers are likely to appear in utterances similar to the question. For example, about 77% questions has answers in top-5 utterances similar to the question according to SimCSE (Gao, Yao, and Chen 2021) in the dev set of FriendsQA (Yang and Choi 2019). Furthermore, the utterances extracted by the key utterance prediction have over 82% overlaps with the top-5 utterances. Therefore, there are considerable key utterances have been ignored, leading to overrated attention to similar utterances, e.g., Q2 in Fig. 1. In fact, answer-contained utterances are likely to appear as the similar utterances or near the similar utterances because contiguous utterances in local context tend to be in a topic relevant to the question. However, the single utterance prediction cannot realize this observation.

To settle aforementioned problems, so that more answer-contained utterances can be found and answering process realizes the question and interlocutor scopes, we propose a new pipeline framework for DRC. We first propose a new key utterances extracting method. The method slides a window through the dialogue, where contiguous utterances in the window are regarded as a unit. The prediction is made on these units. Once a unit containing the answer, all utterances in it are selected as key utterances. Based on these extracted utterances, we then propose Question-Interlocutor Scope Realized Graph (QuISG) modeling. Instead of treating utterances as a plain sequence, QuISG constructs a graph structure over the contextualized text embeddings. The question and speaker names mentioned in the question are explicitly present in QuISG as nodes. The question-mentioning speaker then connects with its corresponding speaker in the dialogue. Furthermore, to remind the model of interlocutor scopes, QuISG connects every speaker node in the dialogue with words from the speaker’s scope. We verify our model on two representative DRC benchmarks, FriendsQA and Molweni (Li et al. 2020). Our model achieves better and competitive performance against baselines on both benchmarks, and further experiments indicate the efficacy of our proposed method.

Related Work

Dialogue Reading Comprehension.

Unlike traditional Machine Reading Comprehension (Rajpurkar et al. 2016), Dialogue Reading Comprehension (DRC) aims to answer a question according to the given dialogue. There are several related but different types of conversational question answering: CoQA (Reddy, Chen, and Manning 2018) conversationally asks questions after reading Wikipedia articles. QuAC (Choi et al. 2018) forms a dialogue of QA between a student and a teacher about Wikipedia articles. DREAM (Sun et al. 2019) tries to answer multi-choice questions over dialogues of English exams. To understand characteristics of speakers, Sang et al. (2022) propose TVShowGuess in a multi-choice style to predict unknown speakers in dialogues. Conversely, we focus on DRC extracting answer spans from a dialogue for an independent question (Yang and Choi 2019). For our focused DRC, as dialogues are a kind of special text, Li and Choi (2020) propose several pretrained and downstream tasks based on utterances in dialogues. To consider coreference and relationship of speakers, Liu et al. (2020) introduce the two types of knowledge from other dialogue-related tasks. Besides, Li et al. (2021); Ma, Zhang, and Zhao (2021) model the knowledge of discourse structure for dialogues. To model the complex speaker information and noisy dialogue context, Li and Zhao (2021) propose two self supervised tasks, i.e., masked-speaker guessing and key utterance prediction. However, existing work ignores explicitly modeling the question and speaker scopes and suffers from low key-utterance coverage.

Dialogue Modeling with Graph Representations.

In many QA tasks (Yang et al. 2018; Talmor et al. 2019), graphs are the main carrier for reasoning (Qiu et al. 2019; Fang et al. 2020; Yasunaga et al. 2021). As for dialogue understanding, graphs are still a hotspot for various purposes. In dialogue emotion recognition, graphs are constructed to consider the interactions between different parties of speakers (Ghosal et al. 2019; Ishiwatari et al. 2020; Shen et al. 2021). In dialogue act classification, graphs model the cross-utterances and cross-tasks information (Qin et al. 2021). In dialogue semantic modeling, Bai et al. (2021) extend AMR (Banarescu et al. 2013) to construct graphs for dialogues. As for our focused DRC, graphs are constructed for knowledge propagation between utterances by works (Liu et al. 2020; Li et al. 2021; Ma, Zhang, and Zhao 2021) mentioned above.

Framework

Task Definition

Given a dialogue consisting of N utterances: $\mathcal{D}$ = $[utter_{1}$ $,utter_{2},...,utter_{N}]$ , the task aims to extract the answer span $a$ for a question $q=[qw_{1},qw_{2},...,qw_{L_{q}}]$ from $\mathcal{D}$ , where $qw_{i}$ is the i-th word in $q$ and $L_{q}$ is the length of $q$ . In $\mathcal{D}$ , each utterance $utter_{i}=\{\mathrm{speaker}:s_{i},\mathrm{text}:t_{i}\}$ contains its corresponding speaker (e.g., $s_{i}=$ “Chandler Bing”) and text content $t_{i}=[tw_{1},tw_{2},...,tw_{L_{i}}]$ , where $tw_{j}$ the j-th word in $t_{i}$ and $L_{i}$ is the length of $t_{i}$ . For some unanswerable questions, there is no answer span to be found in $\mathcal{D}$ . Under such a circumstance, $a$ is assigned to be null.

Conversational Context Encoder

Dialogue is a process whose utterances are not isolated but contextually interact with each other. Therefore, encoding words in an utterance requires considering contextual information from other utterances. To better encode words contextually using pretrained models (PTM), following previous work (Li and Zhao 2021), we chronologically concatenate utterances in the same conversation to form a text sequence: $\mathcal{C}$ = “ $s_{1}$ : $t_{1}$ $\mathrm{[SEP]}$ … $\mathrm{[SEP]}$ $s_{N}$ : $t_{N}$ ”, where [SEP] is a special token of PTM for separation. Holding the conversational context $\mathcal{C}$ , PTM can deeply encode $\mathcal{C}$ with the question $q$ to make it question-aware by concatenating them as $QC$ = “ $\mathrm{[CLS]}\ q\ \mathrm{[SEP]}\ \mathcal{C}\ \mathrm{[SEP]}$ ” (it is okay that $\mathcal{C}$ goes first), where [CLS] is also a special token for classification.

Following Li and Zhao (2021), we utilize the ELECTRA discriminator to encode the sequence $QC$ :

H_{QC}=\mathrm{ELECTRA}(QC),

(1)

where $H_{QC}\in\mathbb{R}^{L_{QC}\times d_{h}}$ , $L_{QC}$ is the length of $QC$ , and $d_{h}$ is the hidden size of PLM. $H_{QC}$ can be split into $H_{Q}\in\mathbb{R}^{L_{q}\times d_{h}}$ and $H_{C}\in\mathbb{R}^{L_{C}}$ according the position of [SEP] between $q$ and $C$ , where $L_{C}$ is the length of $\mathcal{C}$ .

Key Utterances Extractor

Treating every single utterance as a unit to pair with the question prefers utterances similar to the question. However, the utterance containing the answer is not always in the case, where it can appear near the similar utterance within several steps due to the topic relevance. Therefore, we apply a window along the dialogue. Utterances in the window are treated as a unit so that the similar utterance and the answer-contained utterance can co-occur and more answer-contained utterances can be realized.

Training the Extractor.

With the window whose size is $m$ , [ $utter_{i}$ , $utter_{i+1}$ , …, $utter_{i+m}$ ] is grouped. Mapping the start ( $st_{i}$ ) and end ( $ed_{i}$ ) position of the unit in $\mathcal{C}$ , the representation of the unit can be computed by:

H_{u_{i}}^{k}=\mathrm{Maxpooling}(H_{C}[st_{i}:ed_{i}]).

(2)

Similarly, the representation of the question is computed by $H_{q}^{k}=\mathrm{Maxpooling}(H_{Q})$ . The correlation score of the unit and the question is then computed by:

y_{i}=\mathrm{sigmoid}(\mathrm{Linear}(H_{u_{i}}^{k}||H_{q}^{k})),

(3)

where $\mathrm{Linear}(\cdot)$ is a linear unit mapping the dimension from $\mathbb{R}^{2d_{h}}$ to $\mathbb{R}$ and $||$ is the concatenating operation. For the unit, if any utterances in it contains the answer, the label $y_{i}^{k}$ of this unit is set to 1, otherwise 0. Therefore, the training objective of the key utterances extractor on the dialogue $\mathcal{D}$ is:

\mathcal{J}_{k}=-\sum_{i=1}^{N-m}{[(1-y_{i}^{k})log(1-y_{i})+y_{i}^{k}log(y_{i})]}.

(4)

Extracting key utterances.

The trained model predicts whether a unit is correlated to the question. If $y_{i}>0.5$ , the unit is regarded as a question-related unit and utterances inside are all regarded as key utterances. To avoid involving too many utterances as key utterances, we further rank all the units by $y_{i}$ and pick up top- $k$ units whose $y_{i}>0.5$ . For a question $q$ , we keep a key utterance set $key=(\cdot)$ to store the extracted key utterances.

Specifically, when the $i$ -th unit is among the top- $k$ units and holds $y_{i}>0.5$ , [ $utter_{i}$ , …, $utter_{i+m}$ ] are all considered to be added into $key$ . If $utter_{i}$ does not exist in $key$ , then $key$ .add( $utter_{i}$ ) is triggered, otherwise skipped. After processing all units satisfying the conditions, $key$ sorts key utterances in it by sort( $key$ , 1 $\rightarrow$ N), where 1 $\rightarrow$ N denotes chronological order.

We obverse that, in most cases, key utterances in $key$ are contiguous utterances. When k=3 and m=2, the set is ordered as ( $utter_{i-m}$ , …, $utter_{i}$ , …, $utter_{i+m}$ ), where $utter_{i}$ is usually the similar utterance.

Question-Interlocutor Scope Realized Graph Modeling

To guide models to further realize the question, speakers in the question, and the scope of the corresponding speakers in $\mathcal{D}$ , we construct a Question-Interlocutor Scope Realized Graph (QuISG) based on key utterances in $key$ . QuISG can be formulated as $\mathcal{G}$ = ( $\mathcal{V}$ , $\mathcal{A}$ ), where $\mathcal{V}$ denotes the set of nodes involved in $\mathcal{G}$ and $\mathcal{A}$ denotes the adjacent matrix to indicate edges. After the construction of QuISG, we utilize a node type realized graph attention network to process it. We elaborate the graph modeling below.

Nodes.

QuISG is directly constructed on the words of the question and key utterances. Therefore, we define several types of node based on words.

Question Node: Question node denotes the questioning word (e.g., “what”, “for what reason”) of the question. The node representation is initialized by the meanpooling on the representations of the question word: $\mathrm{v.rep}$ = $\mathrm{mean}(H_{Q}[$ for, what, reason $])$ . We denote this type of node as $\mathrm{v.type}$ =qw.

Question Speaker Node: Considering speakers in the question can help models to realize which speakers and their interactions are focused by the question. Question speaker node is derived from the speaker name recognized from the question. We use stanza ²²2https://github.com/stanfordnlp/stanza performing NER to recognize person names (e.g. “ross”) in the question and pick up those names appearing in the dialogue as interlocutors. The representation of node is initialized as $\mathrm{v.rep}$ = $H_{Q}[$ ross $]$ and the type is marked as $\mathrm{v.type}$ =qs. Additionally, if a question contains no speaker name or the picked name does not belong to interlocutors in the dialogue, no question speaker node will be involved.

Dialogue Speaker Node: Speakers appeared in the dialogue are crucial for dialogue modeling. We consider speakers of the key utterances to construct dialogue speaker nodes. As the speaker in the dialogue is identified by its full name (e.g., “Ross Gellar”), we compute the initial node representation by meanpooling on the full name and all utterances of the speaker will provide its speaker name for the computation: $\mathrm{v.rep}$ = $\mathrm{mean}(H_{C}[$ Ross₁, Gellar₁, …, Ross_x, Gellar ${}_{x}])$ , where $x$ is the number of utterance whose speaker name is “Ross Gellar”. This type is marked as $\mathrm{v.type}$ =ds.

Dialogue Word Node: As the main body where extracting answer is performed, words from utterances are positioned in the graph as dialogue word nodes. We take all words in the key utterances to form the group of dialogue word node. The representation is initialized from the corresponding item of $H_{C}$ . This type is denoted as $\mathrm{v.type}$ =dw.

Scene Node: In some datasets, there is a kind of special utterance that appears at the beginning of a dialogue and gives a brief description of where the dialogue happens or what it is about. If it is selected as a key utterance, we set all words in it as scene nodes. Although we define this type of node, it still acts as a word node with $\mathrm{v.type}$ =dw. The only difference is how to connect with dialogue speaker nodes, which is stated in the next subsection.

Edges.

We set edges between nodes to connect nodes from the question and key utterances and initialize the adjacent matrix as $\mathcal{A}=\mathrm{O}$ .

For the word node $\mathrm{v}_{x}\in utter_{i}$ , we connect it with other word nodes $\mathrm{v}_{y}\in utter_{i}$ ( $x-k_{w}$ $\leq$ $y$ $\leq$ $x+k_{w}$ ) within a window whose size is $k_{w}$ , i.e., $\mathcal{A}[\mathrm{v}_{x},\mathrm{v}_{y}]=1$ . For word nodes in other utterances (e.g., $\mathrm{v}_{z}\in utter_{i+1}$ ), there is no edge between $\mathrm{v}_{x}$ and $\mathrm{v}_{z}$ . To remind the model of the scope of speakers, we connect every word node with the speaker node $\mathrm{v}_{s_{i}}$ it belongs to, i.e., $\mathcal{A}[\mathrm{v}_{x},\mathrm{v}_{s_{i}}]=1$ and $\mathcal{A}[\mathrm{v}_{s_{i}},\mathrm{v}_{x}]=1$ . To realize the question, we connect all word nodes with the question node $\mathrm{v}_{q}$ , i.e., $\mathcal{A}[\mathrm{v}_{x},\mathrm{v}_{q}]=1$ and $\mathcal{A}[\mathrm{v}_{q},\mathrm{v}_{x}]=1$ .

For the speakers mentioned in the question, we fully connect their corresponding question speaker nodes to model the interactions between these speakers, e.g., $\mathcal{A}[\mathrm{v}_{qs_{m}},\mathrm{v}_{qs_{n}}]=1$ and $\mathcal{A}[\mathrm{v}_{qs_{n}},\mathrm{v}_{qs_{m}}]=1$ . To remind the model which speaker in dialogue is related, we further connect the question speaker node $\mathrm{v}_{qs_{m}}$ with its corresponding speaker node $\mathrm{v}_{s_{i}}$ , i.e., $\mathcal{A}[\mathrm{v}_{qs_{m}},\mathrm{v}_{s_{i}}]=1$ and $\mathcal{A}[\mathrm{v}_{s_{i}},\mathrm{v}_{qs_{m}}]=1$ . Furthermore, to realize the question, question speaker nodes should connect with the question node, e.g., $\mathcal{A}[\mathrm{v}_{qs_{m}},\mathrm{v}_{q}]=1$ and $\mathcal{A}[\mathrm{v}_{q},\mathrm{v}_{qs_{m}}]=1$ .

If the scene description is selected as a key utterance, it will be regarded as an utterance without speaker identification. We treat a scene node as a word node and follow the same edge construction as word nodes. As the scene description is likely to tell things about speakers, we utilize stanza to recognize speakers and connect all scene nodes with the dialogue speaker nodes the recognized speakers related to.

For every node in QuISG, we additionally add a self-connected edge, i.e., $\mathcal{A}[\mathrm{v},\mathrm{v}]=1$ . By realizing these connections mentioned above, items in $\mathcal{A}$ are assigned with 0 or 1 for edge existence.

Node Type Dependent Graph Attention Network.

Node Type Dependent Graph Attention Network (GAT) is a $T$ -layer stack of graph attention blocks (Velickovic et al. 2017). The input of GAT is a QuISG and GAT propagates and aggregates messages between nodes through edges. We initial the graph representation by $h_{v}^{0}=\mathrm{v.rep}$ . We then introduce how the $t$ - $th$ layer gathers messages for nodes from the neighbors.

A graph attention block mainly performs multi-head attention computing, in which each head has the similar operation. We exemplify the attention computing by one head. To measure how important the node $\mathrm{v}_{n}$ to the node $\mathrm{v}_{m}$ , the node type realized attentive weight is computed by:

	$\displaystyle\alpha_{mn}=\frac{\mathrm{exp}\left(\mathrm{LeakyReLU}\left(\mathrm{c}_{mn}\right)\right)}{\sum_{\mathrm{v}_{o}\in\mathcal{N}_{\mathrm{v}_{m}}}{\mathrm{exp}\left(\mathrm{LeakyReLU}\left(\mathrm{c}_{mo}\right)\right)}},$		(5)
	$\displaystyle\mathrm{c}_{mn}=\mathrm{a}\left[[h_{\mathrm{v}_{m}}^{t-1}\|\|{r_{\mathrm{v}_{m}.{\mathrm{type}}}}]\mathrm{W}_{q}\|\|[h_{\mathrm{v}_{n}}^{t-1}\|\|r_{{\mathrm{v}_{n}}.\mathrm{type}}]\mathrm{W}_{k}\right]^{\mathrm{T}},$		(6)

where $\mathrm{c}_{mn}$ is the attentive coefficient score between $\mathrm{v}_{m}$ and $\mathrm{v}_{n}$ , $[\cdot||\cdot]$ is the concatenating operation, $r_{\mathrm{v}_{m}.\mathrm{type}}\in\mathbb{R}^{1\times 4}$ is a one-hot vector denoting the node type of $\mathrm{v}_{m}$ , and $\mathrm{a}\in\mathbb{R}^{1\times 2d_{head}}$ , $\mathrm{W}_{q}\in\mathbb{R}^{(d_{head}+4)\times d_{head}}$ , $\mathrm{W}_{k}\in\mathbb{R}^{(d_{head}+4)\times d_{head}}$ are trainable parameters.

Once obtaining the importance of adjacent nodes, the graph attention block aggregates the weighted message by:

h_{\mathrm{v}_{m}}^{t\cdot head}=\sigma\left(\sum_{\mathrm{v}_{o}\in\mathcal{N}_{\mathrm{v}_{m}}}{\alpha_{mn}h_{\mathrm{v}_{o}}^{t-1}\mathrm{W}_{v}}\right),

(7)

where $\sigma$ is the ELU function, and $\mathrm{W}_{o}\in\mathbb{R}^{d_{head}\times d_{head}}$ is a trainable parameter. By concatenating weighted messages from all heads, the $t$ - $th$ graph attention block can update the node representation to $h_{\mathrm{v}_{m}}^{t}$ .

Answer Extraction Training and Inference.

After graph modeling, nodes in the QuISG are then mapped back into the original token sequence. As answers are extracted from the dialogue, only nodes in key utterances are considered. We locate the dialogue word (scene) node $v_{x}$ with its corresponding token representation $H_{C}[utter_{i}[x]]$ in $\mathcal{C}$ , we then update the token representation by $H_{C}[utter_{i}[x]]$ += $h_{\mathrm{v}_{x}}^{T}$ . For the speaker token representation $H_{C}[$ Ross_i, Gellar ${}_{i}]$ in key utterances, the mapped dialogue speaker node $\mathrm{v}_{s_{i}}$ updates it by $H_{C}[$ Ross_i, Gellar ${}_{i}]$ += [ $h_{\mathrm{v}_{s_{i}}}^{T}$ , $h_{\mathrm{v}_{s_{i}}}^{T}$ ]. As a speaker name $s_{i}$ may appear several times, we repeat adding duplicated $h_{\mathrm{v}_{s_{i}}}^{T}$ to the corresponding token representations. We denote the updated $H_{C}$ as $H_{C}^{{}^{\prime}}$ .

Training.

Given $H_{C}^{{}^{\prime}}$ , the model computes the start and the end distributions by:

	$\displaystyle Y_{srt}=\mathrm{softmax}(\mathrm{w}_{srt}{H_{C}^{{}^{\prime}}}^{\mathrm{T}}),$		(8)
	$\displaystyle Y_{end}=\mathrm{softmax}(\mathrm{w}_{end}{H_{C}^{{}^{\prime}}}^{\mathrm{T}}),$		(9)

where $\mathrm{w}_{srt}\in\mathbb{R}^{1\times L_{C}}$ , $\mathrm{w}_{end}\in\mathbb{R}^{1\times L_{C}}$ are trainable parameters. For the answer span $a$ , we denote its start index and end index as $a_{st}$ and $a_{ed}$ respectively. Therefore, the answer extracting objective is:

\mathcal{J}_{ax}=-\mathrm{log}\left(Y_{srt}(a_{st})\right)-\mathrm{log}\left(Y_{end}(a_{ed})\right).

(10)

If there are questions without any answers, another header is applied to predict whether a question is answerable. The header computes the probability by $p_{na}=\mathrm{sigmoid}(\mathrm{Linear}(H_{C}^{{}^{\prime}}[\mathrm{CLS}]))$ . By annotating every question with a label $q\in\{0,1\}$ to indicate answerability, another objective is added:

\mathcal{J}_{na}=-[(1-q)\mathrm{log}(1-p_{na})+q\mathrm{log}(p_{na})].

(11)

In this way, the overall training objective is $\mathcal{J}=\mathcal{J}_{ax}+0.5*\mathcal{J}_{na}$ .

Inference.

Following Li and Zhao (2021), we extract the answer span by performing beam search with the size of 5. We constrain the answer span in one utterance to avoid answers across utterances. To further emphasize the importance of key utterances, we construct a scaling vector $S\in\mathbb{R}^{1\times L_{C}}$ , where the token belonging to key utterances is kept with 1 and the token out of key utterances is assigned with a scale factor $0\leq f\leq 1$ . The scaling vector is multiplied on $Y_{srt}$ and $Y_{end}$ before softmax, and we then perform the ranking on the possibilities of answers for inference.

Experimental Settings

Datasets.

Following Li and Zhao (2021), we conduct experiments on FriendsQA (Yang and Choi 2019) and Molweni (Li et al. 2020). As our work does not focus on unanswerable questions, we construct an answerable version of Molweni (Molweni-A) by removing all unanswerable questions. FriendsQA is a open-domain question answering dataset collected from the TV-series Friends. It contains 977/122/123 (train/dev/test) dialogues and 8,535/1,010/1,065 questions. As we recognize person names in questions, we find about 76%/76%/75% of questions contain person names in FriendsQA. Molweni is another related dataset whose topic is about Ubuntu. It contains 8,771/883/100 dialogues and 24,682/2,513/2,871 questions, in which about 14% questions are unanswerable. Dialogues in Mowelni is much shorter than those in FriendsQA and contain no scene descriptions. Speaker names in Molweni are meaningless user ids (e.g., “nbx909”). Furthermore, questions containing user ids in Molweni, whose proportion is about 47%/49%/48%, are much less than those in FriendsQA. In Molweni-A, there are 20,873/2,346/2,560 questions.

Compared Methods.

We compare our method with existing methods in DRC. ULM+UOP (Li and Choi 2020) adapt token MLM, utterance MLM, and utterance order prediction to pretrain BERT, and finetune it in the multi-task setting. KnowledgeGraph (Liu et al. 2020) introduce and structurally model additional knowledge about speakers’ co-reference and relations between each other from other related datasets (Yu et al. 2020). DADGraph (Li et al. 2021) is another graph-based method that introduce external knowledge about the discourse structure of dialogues. ELECTRA (Clark et al. 2020) is the vanilla pretrained model used as the backbone of our method. SelfSuper (Li and Zhao 2021) is the SOTA method. It design two self-supervised tasks to capture speaker information and reduce noise in the dialogue.

Implementation.

Our medel is implemented based on ELECTRA-large-discriminator from Transformers. For key utterances extraction, the size of the sliding window (i.e., $m$ ) is set to 2 and top-3 units are considered. Other hyper-parameters are the same as those in the question answering training. For question answering, we search the size of word node window (i.e., $k_{w}$ ) in 1, 2, 3, and the number of attention heads in 1, 2, 4. We set the number of GAT layers to 5 for FriendsQA and 3 for Molweni; $f$ is set to 0.5 for FriendsQA and 0.9 for Molweni. For other hyper-parameters, we follow Li and Zhao (2021). We use Exact Matching (EM) score and F1 score as the metrics.

Results and Discussion

Main Results

	Model	EM	F1
BERT based	ULM+UOP (Li and Choi 2020)	46.80	63.10
	KnowledgeGraph (Liu et al. 2020)	46.40	64.30
	SelfSuper ${}_{\textrm{BERT}}$ (Li and Zhao 2021)	46.90	63.90
ELECTRA based	Our Reimpl. ELECTRA	54.62	71.29
	SelfSuper (Li and Zhao 2021)	55.80	72.30
	Ours	57.79^∗	75.22^∗

Table 1: Results on FriendsQA. ^∗ denotes significance against SelfSuper with paired t-test.

	Model	EM	F1
BERT based	DADGraph (Li et al. 2021)	46.50	61.50
BERT based	SelfSuper ${}_{\textrm{BERT}}$ (Li and Zhao 2021)	49.20	64.00
ELECTRA based	Our Reimpl. ELECTRA	57.85	72.17
	SelfSuper (Li and Zhao 2021)	58.00	72.90
	Ours	59.32^∗	72.86
	Human performance	64.30	80.20

Table 2: Results on Molweni.

Tab. 1 shows the results achieved by our method and other baselines on FriendsQA. Baselines listed in the first three rows are all based on BERT (Devlin et al. 2019). We can see that SelfSuper achieves better and competitive results compared with ULM+UOP and KnowledgeGraph, where ULM+UOP considers no speaker information and KnowledgeGraph introduces external knowledge. This indicates the effectiveness of the self-supervised tasks for speaker and key utterance modeling of SelfSuper. When it comes to ELECTRA, the performance reach a new elevated level, which shows that ELECTRA is more suitable for DRC. By comparing with SelfSuper based on ELECTRA, our method can achieve significantly better performance. This improvement shows the advantage of both the higher coverage of answer-contained utterances by our proposed key utterances extractor and better graph representations to consider the question and interlocutor scopes by QuISG.

Results on Molweni are listed in Tab. 2. Our approach still gives new state-of-the-art, especially a significantly improvement on EM scores. However, the absolute improvement is smaller compared to that on FriendsQA. This is mainly for two reasons. First, the baseline results are close to the human performance on Molweni, so the space for improvement is also smaller. Second, Molweni consists of unanswerable questions, which are not the main focus of our work. To see how the unanswerable questions affect the results, we further show the performance of our method and the ELECRTA-based baselines on Molweni-A in Tab. 3, i.e., the subset of Molweni with only answerable questions. We observe that our method still achieves better EM score against SelfSuper and gains slightly better F1 score, which indicates that our method can better deal with questions with answers. As for unanswerable questions, we believe that better performance can be achieved with related-techniques plugged in our method, which we leave to future work.

By comparing the performance gains of our method in FriendsQA and Molweni, we can observe that our method is more significant in FriendsQA. We think the reason may be that (1) our key utterance extractor can cover more answer-contained utterances in FriendsQA, as will be shown in Fig. 3; (2) questions with speaker mention show more frequently in FriendsQA than in Molweni, and therefore QuISG can help achieve better graph representations in FriendsQA. On all accounts, this further demonstrates that our method alleviates the problems that we focus on.

Model	EM	F1
Our Reimpl. ELECTRA	61.02	77.62
SelfSuper (Li and Zhao 2021)	61.13	78.30
Ours	62.54^∗	78.65

Table 3: Results on Molweni-A.

Model	FriendsQA		Molweni
Model	EM	F1	EM	F1
full model	57.79	75.22	59.32	72.86
w/o NodeType	56.79	74.01	58.38	72.75
w/o KeyUttExt	55.87	72.30	58.48	72.10
w/o Q	56.37	73.55	58.20	72.52
w/o SpkScope	57.29	74.26	58.62	72.29
w/o All	53.12	70.05	56.32	71.08

Table 4: Ablation Study.

Ablation Study

To demonstrate the importance of our proposed modules, we adapt ablation study on both datasets. The results are shown in Tab. 4. We study the effects of node type information (NodeType), key utterances extractor and its scaling factor on logits (KeyUttExt); question and question speaker nodes (Q); edges between dialogue word nodes and dialogue speaker nodes to model interlocutor scope (SpkScope). We further remove both KeyUttExt and QuISG, leading to fully connections between every two tokens in dialogues, and apply transformer layers to further process dialogues (w/o All).

By removing NodeType, the performance drops, which demonstrates minding different node behaviors can help better model graph representations. Our method w/o KeyUttExt decreases the performance, which demonstrates that key utterance extractor is a crucial module for our method to find more answer-contained utterances and guides our model to pay more attention to the key part in a dialogue for the question. As for the model w/o KeyUttExt shows more performance drop in FriendsQA, we think the reason may be that dialogues in FriendsQA are much longer than Molweni. Therefore, KeyUttExt can reduce more question unrelated parts of dialogues for further graph modeling in FriendsQA. Removing Q or SpkScope also shows performance decline, which indicates the importance of realizing question and interlocutor scope for DRC. The worse results of w/o Q indicates that questions and speakers mentioned in questions are more important to DRC due to the question-involved particularity of DRC. Replacing KeyUttExt and QuISG with transformer layers even performs worse than ELECTRA, which indicates that the further process of dialogues without speaker and question realized modeling can be redundant.

Accuracy of Utterance Extraction

As we claim that our method covers more answer-contained utterances compared with SelfSuper, in this section, we show the recall of answer-contained utterances by different methods. Besides our method and SelfSuper, we further consider retrieval methods appearing other reading comprehension tasks. As the similarity-based seeker is usually used, we apply the SOTA model SimCSE (Gao, Yao, and Chen 2021) to compute similarity score between utterances and the question. However, directly using top similar utterances produces a extremely low recall. Therefore, we also add utterances around every picked top utterance as key utterances like ours. We consider top 3 similar utterances and its historical 2 and future 2 utterances as key utterances. The results are illustrated in Fig. 3. As shown in Fig. 3, we choosing top 3 units for key utterances selection does not affect the recall a lot and can keep the average size of key utterance set to 4.13 for FriendsQA and 3.61 for Molweni. Compared with our method, SelfSuper achieves undesirable recall for answer-contained utterance extraction, which indicates the efficacy of our method. As for SimCSE, equipped with our enhancement, it can achieve competitive recall to ours. Especially on the dev and test sets of Molweni. However, the average size of key utterance set of SimCSE is 7.73, where the average length of dialogue in Molweni is 8.82. Additionally, SimCSE cannot judge whether a question is answerable and therefore extracts key utterances for every question, leading to a low recall for the training set in Molweni.

To further demonstrate that our method is more suitable for DRC against SimCSE, we run a variant with key utterances sets extracted by SimCSE. The results are shown in Tab. 5. Our method achieves better performance with high coverage of answer-contained utterances and fewer key utterances involved.

Improvement on Speaker-Contained Questions

Method	FriendsQA		Molweni
Method	EM	F1	EM	F1
ours	57.79	75.22	59.32	72.86
SimCSE	57.19	74.36	57.30	72.09

Table 5: Results of our method and the variant with SimCSE searching for key utterances.

As QuISG focuses on the question speaker information and dialogue interlocutor scope modeling, whether it can help answer questions mentioning speaker names is crucial to verify. We illustrate the F1 scores of questions containing different speakers in FriendsQA and questions with or without mentioning speakers in Fig. 4. We can see that SelfSuper outperforms our method only on “Rachel” and is slightly better on “Joey” and “Monica”. Our method can outperform SelfSuper by a great margin on “Ross”, “Phoebe”, “Chandler” and other casts. Specifically, our method can improve the F1 score of questions mentioning speaker by a wider margin compared to questions without speakers. This indicates that our speaker modeling is benefit from our proposed method.

Case Study

As shown in the very beginning of the paper, Fig. 1 provides two cases that SelfSuper fails. On the contrary, attributing to our proposed key utterances extractor and QuISG, our method can answer the two questions correctly.

Conclusion

To cover more answer-contained utterances and make the model realize speaker information in the question and interlocutor scopes in the dialogue for DRC, we propose a new pipeline method. The method firstly adapts a new key utterances extractor with contiguous utterances as a unit for prediction. Based on utterances in the extracted units, a Question-Interlocutor Scope Realized Graph (QuISG) is constructed. QuISG sets question-mentioning speakers as question speaker nodes and connects the speaker node in the dialogue with words from its scope. Our proposed method achieves decent performance on related benchmarks.

References

Bai et al. (2021) Bai, X.; Chen, Y.; Song, L.; and Zhang, Y. 2021. Semantic Representation for Dialogue Modeling. In ACL, 4430–4445. Association for Computational Linguistics. doi:10.18653/v1/2021.acl-long.342. URL https://doi.org/10.18653/v1/2021.acl-long.342.
Banarescu et al. (2013) Banarescu, L.; Bonial, C.; Cai, S.; Georgescu, M.; Griffitt, K.; Hermjakob, U.; Knight, K.; Koehn, P.; Palmer, M.; and Schneider, N. 2013. Abstract Meaning Representation for Sembanking. In LAW-ID@ACL, 178–186. The Association for Computer Linguistics. URL https://aclanthology.org/W13-2322/.
Choi et al. (2018) Choi, E.; He, H.; Iyyer, M.; Yatskar, M.; Yih, W.; Choi, Y.; Liang, P.; and Zettlemoyer, L. 2018. QuAC: Question Answering in Context. In EMNLP, 2174–2184. Association for Computational Linguistics. doi:10.18653/v1/d18-1241. URL https://doi.org/10.18653/v1/d18-1241.
Clark et al. (2020) Clark, K.; Luong, M.; Le, Q. V.; and Manning, C. D. 2020. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. In ICLR. OpenReview.net. URL https://openreview.net/forum?id=r1xMH1BtvB.
Cox et al. (2020) Cox, R. A. V.; Kumar, S.; Babcock, M.; and Carley, K. M. 2020. Stance in Replies and Quotes (SRQ): A New Dataset For Learning Stance in Twitter Conversations. CoRR abs/2006.00691. URL https://arxiv.org/abs/2006.00691.
Devlin et al. (2019) Devlin, J.; Chang, M.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL, 4171–4186. Association for Computational Linguistics. doi:10.18653/v1/n19-1423. URL https://doi.org/10.18653/v1/n19-1423.
Fang et al. (2020) Fang, Y.; Sun, S.; Gan, Z.; Pillai, R.; Wang, S.; and Liu, J. 2020. Hierarchical Graph Network for Multi-hop Question Answering. In EMNLP, 8823–8838. Association for Computational Linguistics. doi:10.18653/v1/2020.emnlp-main.710. URL https://doi.org/10.18653/v1/2020.emnlp-main.710.
Gao, Yao, and Chen (2021) Gao, T.; Yao, X.; and Chen, D. 2021. SimCSE: Simple Contrastive Learning of Sentence Embeddings. In EMNLP, 6894–6910. Association for Computational Linguistics. doi:10.18653/v1/2021.emnlp-main.552. URL https://doi.org/10.18653/v1/2021.emnlp-main.552.
Ghosal et al. (2019) Ghosal, D.; Majumder, N.; Poria, S.; Chhaya, N.; and Gelbukh, A. F. 2019. DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation. In EMNLP, 154–164. Association for Computational Linguistics. doi:10.18653/v1/D19-1015. URL https://doi.org/10.18653/v1/D19-1015.
Ishiwatari et al. (2020) Ishiwatari, T.; Yasuda, Y.; Miyazaki, T.; and Goto, J. 2020. Relation-aware Graph Attention Networks with Relational Position Encodings for Emotion Recognition in Conversations. In Webber, B.; Cohn, T.; He, Y.; and Liu, Y., eds., EMNLP, 7360–7370. Association for Computational Linguistics. doi:10.18653/v1/2020.emnlp-main.597. URL https://doi.org/10.18653/v1/2020.emnlp-main.597.
Kociský et al. (2018) Kociský, T.; Schwarz, J.; Blunsom, P.; Dyer, C.; Hermann, K. M.; Melis, G.; and Grefenstette, E. 2018. The NarrativeQA Reading Comprehension Challenge. Trans. Assoc. Comput. Linguistics 6: 317–328. doi:10.1162/tacl“˙a“˙00023. URL https://doi.org/10.1162/tacl“˙a“˙00023.
Li and Choi (2020) Li, C.; and Choi, J. D. 2020. Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-based Question Answering. In ACL, 5709–5714. Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.505. URL https://doi.org/10.18653/v1/2020.acl-main.505.
Li et al. (2020) Li, J.; Liu, M.; Kan, M.; Zheng, Z.; Wang, Z.; Lei, W.; Liu, T.; and Qin, B. 2020. Molweni: A Challenge Multiparty Dialogues-based Machine Reading Comprehension Dataset with Discourse Structure. In COLING, 2642–2652. International Committee on Computational Linguistics. doi:10.18653/v1/2020.coling-main.238. URL https://doi.org/10.18653/v1/2020.coling-main.238.
Li et al. (2021) Li, J.; Liu, M.; Zheng, Z.; Zhang, H.; Qin, B.; Kan, M.; and Liu, T. 2021. DADgraph: A Discourse-aware Dialogue Graph Neural Network for Multiparty Dialogue Machine Reading Comprehension. In IJCNN, 1–8. IEEE. doi:10.1109/IJCNN52387.2021.9533364. URL https://doi.org/10.1109/IJCNN52387.2021.9533364.
Li and Zhao (2021) Li, Y.; and Zhao, H. 2021. Self- and Pseudo-self-supervised Prediction of Speaker and Key-utterance for Multi-party Dialogue Reading Comprehension. In Findings of EMNLP 2021, 2053–2063. Association for Computational Linguistics. doi:10.18653/v1/2021.findings-emnlp.176. URL https://doi.org/10.18653/v1/2021.findings-emnlp.176.
Liu et al. (2020) Liu, J.; Sui, D.; Liu, K.; and Zhao, J. 2020. Graph-Based Knowledge Integration for Question Answering over Dialogue. In COLING, 2425–2435. International Committee on Computational Linguistics. doi:10.18653/v1/2020.coling-main.219. URL https://doi.org/10.18653/v1/2020.coling-main.219.
Ma, Zhang, and Zhao (2021) Ma, X.; Zhang, Z.; and Zhao, H. 2021. Enhanced Speaker-aware Multi-party Multi-turn Dialogue Comprehension. CoRR abs/2109.04066. URL https://arxiv.org/abs/2109.04066.
Poria et al. (2019) Poria, S.; Hazarika, D.; Majumder, N.; Naik, G.; Cambria, E.; and Mihalcea, R. 2019. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. In ACL, 527–536. Association for Computational Linguistics. doi:10.18653/v1/p19-1050. URL https://doi.org/10.18653/v1/p19-1050.
Qin et al. (2021) Qin, L.; Li, Z.; Che, W.; Ni, M.; and Liu, T. 2021. Co-GAT: A Co-Interactive Graph Attention Network for Joint Dialog Act Recognition and Sentiment Classification. In AAAI, 13709–13717. AAAI Press. URL https://ojs.aaai.org/index.php/AAAI/article/view/17616.
Qiu et al. (2019) Qiu, L.; Xiao, Y.; Qu, Y.; Zhou, H.; Li, L.; Zhang, W.; and Yu, Y. 2019. Dynamically Fused Graph Network for Multi-hop Reasoning. In ACL, 6140–6150. Association for Computational Linguistics. doi:10.18653/v1/p19-1617. URL https://doi.org/10.18653/v1/p19-1617.
Rajpurkar et al. (2016) Rajpurkar, P.; Zhang, J.; Lopyrev, K.; and Liang, P. 2016. SQuAD: 100, 000+ Questions for Machine Comprehension of Text. In EMNLP, 2383–2392. The Association for Computational Linguistics. doi:10.18653/v1/d16-1264. URL https://doi.org/10.18653/v1/d16-1264.
Reddy, Chen, and Manning (2018) Reddy, S.; Chen, D.; and Manning, C. D. 2018. CoQA: A Conversational Question Answering Challenge. CoRR abs/1808.07042. URL http://arxiv.org/abs/1808.07042.
Sang et al. (2022) Sang, Y.; Mou, X.; Yu, M.; Yao, S.; Li, J.; and Stanton, J. 2022. TVShowGuess: Character Comprehension in Stories as Speaker Guessing. In NAACL, 4267–4287. Association for Computational Linguistics. doi:10.18653/v1/2022.naacl-main.317. URL https://doi.org/10.18653/v1/2022.naacl-main.317.
Shen et al. (2021) Shen, W.; Wu, S.; Yang, Y.; and Quan, X. 2021. Directed Acyclic Graph Network for Conversational Emotion Recognition. In Zong, C.; Xia, F.; Li, W.; and Navigli, R., eds., ACL, 1551–1560. Association for Computational Linguistics. doi:10.18653/v1/2021.acl-long.123. URL https://doi.org/10.18653/v1/2021.acl-long.123.
Sun et al. (2019) Sun, K.; Yu, D.; Chen, J.; Yu, D.; Choi, Y.; and Cardie, C. 2019. DREAM: A Challenge Dataset and Models for Dialogue-Based Reading Comprehension. Trans. Assoc. Comput. Linguistics 7: 217–231. doi:10.1162/tacl“˙a“˙00264. URL https://doi.org/10.1162/tacl“˙a“˙00264.
Talmor et al. (2019) Talmor, A.; Herzig, J.; Lourie, N.; and Berant, J. 2019. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. In NAACL, 4149–4158. Association for Computational Linguistics. doi:10.18653/v1/n19-1421. URL https://doi.org/10.18653/v1/n19-1421.
Velickovic et al. (2017) Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; and Bengio, Y. 2017. Graph Attention Networks. CoRR abs/1710.10903. URL http://arxiv.org/abs/1710.10903.
Yang and Choi (2019) Yang, Z.; and Choi, J. D. 2019. FriendsQA: Open-Domain Question Answering on TV Show Transcripts. In SIGdial, 188–197. Association for Computational Linguistics. doi:10.18653/v1/W19-5923. URL https://doi.org/10.18653/v1/W19-5923.
Yang et al. (2018) Yang, Z.; Qi, P.; Zhang, S.; Bengio, Y.; Cohen, W. W.; Salakhutdinov, R.; and Manning, C. D. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In EMNLP, 2369–2380. Association for Computational Linguistics. doi:10.18653/v1/d18-1259. URL https://doi.org/10.18653/v1/d18-1259.
Yasunaga et al. (2021) Yasunaga, M.; Ren, H.; Bosselut, A.; Liang, P.; and Leskovec, J. 2021. QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering. In NAACL, 535–546. Association for Computational Linguistics. doi:10.18653/v1/2021.naacl-main.45. URL https://doi.org/10.18653/v1/2021.naacl-main.45.
Yu et al. (2020) Yu, D.; Sun, K.; Cardie, C.; and Yu, D. 2020. Dialogue-Based Relation Extraction. In ACL, 4927–4940. Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.444. URL https://doi.org/10.18653/v1/2020.acl-main.444.