An Understanding-Oriented Robust Machine Reading Comprehension Model

Feiliang Ren [email protected] 0000-0001-6824-1191 Northeastern University11 Wenhua RdHeping QuShenyang ShiChina , Yongkang Liu 0000-0003-3098-0225 Northeastern UniversityChina , Bochao Li 0000-0003-2897-3886 Northeastern UniversityChina , Shilei Liu 0000-0003-2976-6256 Northeastern UniversityChina , Bingchao Wang 0000-0002-3528-773X Northeastern UniversityChina , Jiaqi Wang 0000-0001-9306-8757 Northeastern UniversityChina , Chunchao Liu 0000-0002-0028-8425 Northeastern UniversityChina and Qi Ma 0000-0001-9548-1350 Northeastern UniversityChina

Abstract.

Although existing machine reading comprehension models are making rapid progress on many datasets, they are far from robust. In this paper, we propose an understanding-oriented machine reading comprehension model to address three kinds of robustness issues, which are over sensitivity, over stability and generalization. Specifically, we first use a natural language inference module to help the model understand the accurate semantic meanings of input questions so as to address the issues of over sensitivity and over stability. Then in the machine reading comprehension module, we propose a memory-guided multi-head attention method that can further well understand the semantic meanings of input questions and passages. Third, we propose a multi-language learning mechanism to address the issue of generalization. Finally, these modules are integrated with a multi-task learning based method. We evaluate our model on three benchmark datasets that are designed to measure models’ robustness, including DuReader (robust) and two SQuAD-related datasets. Extensive experiments show that our model can well address the mentioned three kinds of robustness issues. And it achieves much better results than the compared state-of-the-art models on all these datasets under different evaluation metrics, even under some extreme and unfair evaluations. The source code of our work is available at: https://github.com/neukg/RobustMRC.

question & answering, robust machine reading comprehension, over sensitivity, over stability, generalization, memory-guided multi-head attention, multi-task and multi-language learning, DuReader (robust), SQuAD

^†^†copyright: acmlicensed^†^†journal: TALLIP^†^†journalyear: 2022^†^†journalvolume: 1^†^†journalnumber: 1^†^†article: 1^†^†publicationmonth: 1^†^†price: 15.00^†^†doi: 10.1145/3546190^†^†ccs: Information systems Question answering

1. Introduction

Machine reading comprehension (MRC) aims to answer questions by reading given passages (or documents). It is considered one of the core abilities of artificial intelligence (AI) and the foundation of many AI-related applications like next-generation search engines and conversational agents.

At present, MRC is achieving more and more research attention and lots of novel models have been proposed. On some benchmark datasets like SQuAD (Rajpurkar et al., 2018) and MS MARCO (Nguyen et al., 2016), some models like BERT (Devlin et al., 2018) have achieved higher performance than human. However, recent work (Jia and Liang, 2017; Gan and Ng, 2019; Tang et al., 2021; Zhou et al., 2020; Liu et al., 2020; Wu and Xu, 2020; Si et al., 2020) shows that current MRC test sets tend to overestimate an MRC model’s true ability to unseen data due to the following reason: the test set on which an MRC model evaluated is typically randomly selected from the whole set of data collected and thus follows the same distributions as the training and development sets, while in real world, it is impossible to ask the unseen data follow such known distributions. Thus it is very necessary to evaluate MRC models on some unseen test data to reveal their robustness.

In this study, we focus on following three kinds of robustness issues that are defined by (Tang et al., 2021). (i) Over sensitivity issue which refers to semantically invariant text perturbations cause a models’ prediction to change when it should not; (ii) Over stability issue which refers to input text is meaningfully changed but the model’s prediction does not, even though it should; (iii) Generalization issue which refers to models usually perform well on in-domain test sets yet perform poorly on out-of-domain test sets. All these issues are widely existed in real world, and they will lead to a significant decrease in performance for most of existing state-of-the-art MRC models (Jia and Liang, 2017; Gan and Ng, 2019; Welbl et al., 2020; Tang et al., 2021; Liu et al., 2020). Fig. 1 shows two examples about the issues of over sensitivity and over stability respectively, both of which are extracted from the DuReader (robust) dataset (Tang et al., 2021) and all the questions in them are really asked by users in the Baidu search engine.

Refer to caption — Figure 1. Examples of over sensitivity and over stability (extracted from DuReader (robust)). The words with the same color have the same meaning. All the answers are generated by BERT (large).

Existing methods address above issues mainly with a kind of data augmentation based methods or a kind of adversarial training based methods. For example, Gan and Ng(2019) use a neural paraphrasing model to generate multiple paraphrased questions for a given source question that is paired with a set of paraphrase suggestions. Then an MRC model is retrained on the training set where the paraphrased samples are integrated. However, neither of these two kinds of methods can truly solve the mentioned issues. Essentially, they still focus on the effort of making a model “see” as many samples as possible so that the model can make decisions based on the “saw” knowledge. But both the paraphrased and adversarial questions are “generated”, so it is very possible that they may not be present in real world. Besides, many real questions could not be fully “generated”. Taking the questions in Fig. 1 as examples, for the first one, there are many different ways to express “how old” or “die” in Chinese. For the second one, there is only ONE different Chinese CHARACTER in the two questions, but they have completely different semantic meanings. Both examples are difficult to be addressed by existing data augmentation or adversarial training based methods because of the following reasons. For the over sensitivity example, there are too many diverse and flexible expression manners for a same meaning question to be enumerated or paraphrased. As for the over stability example, it is impossible to paraphrase it because the aim of data augmentation methods is to generate some paraphrased questions that have the same semantic meaning with the source question. And it is also impossible to handle this over stability example by the adversarial training based methods because these methods usually use some context words near a wrong answer candidate to generate some adversarial examples based on which the models are trained (Gan and Ng, 2019). Thus these methods do not have capabilities to distinguish the slight perturbations between two questions that have a significant high string match similarity.

In contrast, human can handle all above three kinds of robustness issues effectively, the main reason of which is that human can understand the semantic meanings of the given text precisely. In fact, the understanding capability is also the key of solving these diverse kinds of robustness issues (Jia and Liang, 2017). Inspired by this, we propose an understanding-oriented MRC model that can address the mentioned three kinds of robustness issues well. First, we view both the issues of over sensitivity and over stability as a semantic meaning understanding problem, which requires the MRC model have the ability of distinguishing the semantic meanings of a question and its paraphrased expressions that may have similar or dissimilar semantic meanings with the source question. To this end, we introduce a nature language inference (NLI) (Ido Dagan, 2006) module to judge whether two input sentences have the same semantic meaning. Second, in the MRC module, we propose a memory-guided multi-head attention method that can better understand the interactions between questions and passages. Third, we propose a multi-language learning mechanism to prevent the model from over-fitting in-domain data and enhance the generalization ability of the model. We introduce several language-specific MRC datasets to train the model together, which makes the distributions of training set and test set be completely different. During training, each dataset can be viewed as an adversarial dataset of others. Accordingly, none of a dataset can dominant the training process and the model will be pushed to learn more generalized knowledge for predictions. And these modules are jointly trained with a multi-task learning manner.

We evaluate the proposed method on three benchmark MRC datasets, including DuReader (robust) (Tang et al., 2021) and two SQuAD-related datasets (Gan and Ng, 2019). All of these datasets are designed to measure the capabilities of an MRC model for addressing the issues of over sensitivity, over stability and generalization. Extensive experiments show that our model achieves very competitive results on all of these datasets. On DuReader (robust), it outperforms the compared strong baselines by a large margin and ranks No.3 on the final test set leader-board. On the other two SQuAD-related datasets, it achieves much competitive results even under two kinds of extreme and unfair evaluations.

2. Related Work

Common MRC Research In the early study of MRC, researchers pay much attention to design diverse attention methods to mine the interactions between questions and passages. These interactions have been proven to be much helpful for improving the performance of an MRC model. BiDAF (Seo et al., 2016) is one of the most representative work, where the authors design a Context-to-query and Query-to-context bi-directional attention method. (Yu et al., 2018) and (Clark and Gardner, 2018) also use a BiDAF-style attention method. Besides, researchers also propose many other kinds of attention methods. For example, (Cui et al., 2017) designs an attention-over-attention model that uses a 2-dimension similarity matrix between the question and the context words to compute the weighted query-to-context attention. (Wang et al., 2018c) propose a multi-granularity hierarchical attention method. (Hu et al., 2019a) use the self-attention based method.

Recently, there are two kinds of research lines that are dominant in the MRC task, both of which achieve competitive results on many benchmark MRC datasets.

The first one is to imitate some reading patterns used by human when designing an MRC model. For example, (Sun et al., 2019b) explicitly use three human’s reading strategies in their MRC model, including: (1) back and forth reading, (2) highlighting, and (3) self-assessment. (Wang et al., 2018a) imitate human’s following reading pattern: first scans through the whole passage; then with the question in mind, detects a rough answer span; finally, come back to the question and select the best answer. (Liu et al., 2018b) design their MRC model by simulating human’s multi-step reasoning pattern: human often re-read and re-digest given passages many times before a final answer is found. (Wang et al., 2018b) use an extract-then-select reading strategy. They further regard the candidate extraction as a latent variable and train the two-stage process jointly with reinforcement learning. (Peng et al., 2020) design their MRC model by simulating two ways of human thinking when answering questions, including reverse thinking and inertial thinking. (Zhang et al., 2021) imitate human’s “read + verify” reading pattern: first to read through the full passage along with the question and grasp the general idea, then re-read the full text and verify the answer. Some other researchers (Hu et al., 2019b; Clark and Gardner, 2018; Yan et al., 2019; Wang et al., 2018c) also imitate human’s this “read + verify” reading pattern. There are other kinds of human reading patterns imitated. For example, (Tian et al., 2020) imitate the pattern of restoring a scene according to the text, (Malmaud et al., 2020) imitate the pattern of human gaze during reading comprehension, and (Chen et al., 2020) imitate the pattern of tactical comparing and reasoning over candidates while choosing the best answer, etc.

The second one is to use diverse large-scale pretrained language models like BERT (Devlin et al., 2018) and lots of its variants including XLNet (Yang et al., 2019), RoBERTa (Liu et al., 2019), and ALBERT (Lan et al., 2020), etc. These language models have a strong capacity for capturing the contextualized sentence-level language representations (Zhang et al., 2021). With these language models, researchers can design an MRC model very easily because the language models can be used either as MRC models themselves, or as the encoder part of an MRC model and researchers only need to focus on designing the decoder part. For example, lots of recent MRC models (Gong et al., 2020; Zheng et al., 2020; Luo et al., 2020; Long et al., 2020; Banerjee et al., 2021; Li et al., 2020b; Zhang et al., 2020; Guo et al., 2020a, b; Li et al., 2020a; Huang et al., 2020; Chen and Wu, 2020) only consist of a language model based encoder module and a careful designed decoder module.

Robust MRC Research (Jia and Liang, 2017) explore the over sensitivity and over stability issues by testing whether an MRC model can answer paraphrased questions that contain adversarial sentences. Recently, (Náplava et al., 2021) extensively evaluated several state-of-the-art AI-related downstream systems including some MRC models with their robustness to input noise. Their experiments show that there is no published open-source models which are robust to the addition of adversarial sentences. Thus, the robustness issues in MRC are attracting more and more research attentions, and lots of novel methods have been proposed. Generally, these existing methods can be classified into following two kinds.

The first kind is the data augmentation based methods that address the robustness issues by training a model with additional careful generated training data. For example, (Wang and Bansal, 2018) augment the training datasets by incorporating some adversarial examples, and then the MRC model is trained on the augmented dataset. (Welbl et al., 2020) investigate data augmentation and adversarial training as defenses. (Liu et al., 2020) propose a model-driven approach to generate adversarial examples that can attack given MRC models. Then they retrain and strengthen the MRC model by using the generated adversarial examples. (Li et al., 2021) introduce a new knowledge distillation method by taking advantage of data augmentation and progressive training on a wide range of AI-related applications including the MRC task. (Shinoda et al., 2021) focus on question-answer pair generation to mitigate the robustness issue. But different from most existing methods that aim to improve the quality of synthetic examples, they try to generate multiple diverse question-answer pairs to mitigate the sparsity of training sets so as to improve the robustness of a model. Usually, this kind of methods are simple and effective. However, (Liu et al., 2020) point out that the augmented datasets are often capable of simulating the known types of adversarial examples, while ignoring other unobserved types. (Gan and Ng, 2019) further point out that the main deficiency in the data augmentation based methods is that the adversarial examples created are unnatural and not expected to be present in real world. (Rosenberg et al., 2021) draw similar conclusion that data augmentation based methods cannot address the robustness issues effectively.

The second kind is the adversarial training based methods that explore to design better MRC models to improve models’ robustness. For example, (Liu et al., 2018b) average multi-predictions to improve the model’s robustness. (Min et al., 2018) notice that most questions can be answered by using only a few sentences and without the consideration of context over entire passage, then they design a sentence selector to select the minimal set of sentences to the MRC model to answer a question. Their method reduces the risk of adversarial attacks by reducing passage length, which is proven to be robust to adversarial inputs. (Baradaran and Amirkhani, 2021) investigate the effect of ensemble learning approach to improve the generalization of MRC models. After separately training several base models with different structures on different datasets, these base models are ensembled by using weighting and stacking approaches in probabilistic and non-probabilistic settings. Conversely, (Hu et al., 2018) train a robust single model based on ensemble ones through distillation training approach. They first apply the standard knowledge distillation to mimick output distributions of answer boundaries from an ensemble model, then propose two distillation approaches to further transfer knowledge between the teacher model and the student model. (Bartolo et al., 2021) use synthetic adversarial data generation to make MRC models more robust.

Besides, some researchers also explore the methods of introducing extra knowledge to address the robustness issues. For example, (Wang and Jiang, 2019) address the robustness issues by proposing a data enrichment method that uses WordNet to extract inter-word semantic connections as general knowledge from each given passage-question pair, then they uses the extracted knowledge to assist the attention mechanism in their MRC model. (Zhou et al., 2020) address the over confidence issue and the over sensitivity issue simultaneously with the help of external linguistic knowledge. Specifically, they first incorporate external knowledge to impose different linguistic constraints (entity constraint, lexical constraint, and predicate constraint), and then regularize MRC models through posterior regularization. Linguistic constraints induce more reasonable predictions for both semantic different and semantic equivalent adversarial examples, and posterior regularization provides an effective mechanism to incorporate these constraints. (Wu and Xu, 2020) address the robust issues from two aspects. First, they enhance the representation of the model by leveraging hierarchical knowledge from external knowledge bases. Second, they introduce an auxiliary unanswerability prediction module and perform multi-task learning with a span prediction task. Some researchers explore to use auxiliary tasks to address the robustness issues. For example, (Chen and Durrett, 2021) propose a MRC model that through sub-part alignment, their basic idea is that if every aspect of the question is well supported by the answer context, then the answer produced should be trustable; if not, they suspect that the model is making an incorrect prediction. And the sub-parts used are predicates and arguments from the results of a Semantic Role Labeling task.

Natural Language Inference Given two sentences, often called as a premise and a hypothesis respectively, Natural Language Inference (NLI), also known as Recognizing Textual Entailment (Ido Dagan, 2006), is usually defined as the task of determining whether the premise has a relation of entailment, neutral, or contradiction with the hypothesis (Ido Dagan, 2006; Zhou and Bansal, 2020; Zylberajch et al., 2021). According to (Ido Dagan, 2006), a premise entails a hypothesis if a human reading the premise would infer that the hypothesis is most likely true. For example, given a premise “iPhone13 has seen strong sales in China.” and a hypothesis “Strong sales for iPhone13 in China.”, their relation should be entailment. That is, the given premise entails the hypothesis. In contrast, the relation would be contradiction if the hypothesis is “iPhone13 is not popular in China”, and the relation would be neutral if the hypothesis is “Strong sales for iPhone13 in Japanese.”.

Recently, there are many large-scale standard datasets released, like SciTail (Khot Tushar, 2018), SNLI (Bowman et al., 2015), Multi-NLI (Williams et al., 2018), etc. These datasets facilitate the study of NLI greatly, and some state-of-the-art neural models have achieved very competitive performance on these datasets (Zhou and Bansal, 2020; Chen et al., 2021b; Belinkov et al., 2019; Jiang et al., 2021; Meissner et al., 2021). From the definition of NLI we can see that it is based on (and assumes) common human understanding of language as well as common background knowledge, thus it has been considered by many as an important evaluation measure for language understanding (Ido Dagan, 2006; Williams et al., 2018; Bowman et al., 2015; Zylberajch et al., 2021). Accordingly, more and more researchers use the NLI task into diverse downstream applications with the expectation that NLI would be useful for these downstream applications. For example, (Welleck et al., 2019) use the NLI models to improve the consistency of a dialogue model where utterances are re-ranked using a NLI model. (Falke et al., 2019) use the entailment predictions of NLI models to re-rank the generated summaries of some state-of-the-art models. (Huang et al., 2021) use the NLI models to improve unsupervised commonsense reasoning. (Koreeda and Manning, 2021) use the NLI models to assist contract review.

Like our method, (Chen et al., 2021a) also use an NLI model in their MRC model. But they aim to train NLI models to evaluate the predicted answers by an MRC model. Specifically, they leverage large pretrained models and recent prior datasets to construct powerful question conversion and decontextualization modules, which can reformulate question-answer instances as premise-hypothesis pairs with very high reliability. Then, they combine standard NLI datasets with the NLI examples automatically derived from MRC training data to train the NLI model.

Essentially, our model makes full use of the advantages in diverse existing research lines, but pays more attention to precisely understanding the semantic meanings of questions and passages.

3. Methodology

In this study, both the issues of over-sensitivity and over-stability are viewed as a semantic meaning understanding problem. On the other hand, according to the task definition of NLI we can draw the conclusion that if two sentences have the same semantic meaning, they would always be assigned an entailment relation. Inspired by this, we convert the traditional NLI task into the task of judging whether two questions have the same semantic meaning. Concretely, we roughly regard the entailment relation between two sentences as an equivalent alternative for the relation of having the same semantic meaning. That is, we think two sentences would have an entailment relation if they have the same semantic meaning, and they would have a contradiction relation if they do not have the same semantic meaning. Furthermore, to make our MRC model be sensitive to subtle semantic changes, the converted NLI task is combined with the MRC task with a multi-task learning framework. Finally, the architecture of our model is shown in Fig. 2. It has three main modules: (i) a encoder module (Fig. 2(b)); (ii) an MRC module (Fig. 2(c)); (iii) an NLI module (Fig. 2(a)).

For description convenience, here we first give some basic notations about the MRC module. Given a passage with $n$ tokens (denoted as $p=\{p_{i}\}_{i=1}^{n}$ ) and a question with $m$ tokens (denoted as $q=\{q_{j}\}_{j=1}^{m}$ ), an MRC model is to predict an answer a that is constrained as a contiguous span in $p$ , i.e., $a=\{p_{i}\}_{s}^{e}$ , where $s$ and $e$ indicate the beginning and ending positions of the answer. The training set of an MRC task can be denoted as $\mathcal{D}^{M}=\{(p^{i},q^{i},a^{i})\}_{i=1}^{M}$ where $M$ is the number of samples. In this set, each sample consists of a passage $p$ , a question $q$ and an answer $a$ .

3.1. Shared Encoder Module

In our model, we use RoBERTa (Liu et al., 2019) as the shared encoder module to generate representations for the inputs of both MRC and NLI. This encoder module will output a context-aware representation for each token of the input text. Both NLI and MRC take text pairs as input (NLI takes question pairs as inputs and MRC takes passage-question pairs as inputs), so for simplicity, here we use $S_{l}$ and $S_{r}$ to denote the two text parts of an input text pair and denote the output of this encoder module as $\mathbf{H}$ which is computed by Equation 1.

(1)

\displaystyle\mathbf{H}={\rm RB}([{<}{CLS}{>}\oplus S_{l}\oplus{<}Sp{>}\oplus S_{r}\oplus{<}Sp{>}])

where RB denotes the RoBERTa model, ¡CLS¿ is a padding token, ¡Sp¿ is a separator token defined to separate the token sequences of $S_{l}$ and $S_{r}$ , and $\oplus$ denotes the concatenation operation.

The padded question representation sequence $\mathbf{H}_{Q}=\{\boldsymbol{h}^{q}_{1},...,\boldsymbol{h}^{q}_{m},\boldsymbol{h}_{<Sp>},$ $\boldsymbol{0}_{1},...\boldsymbol{0}_{n+1}\}$ and the padded passage representation sequence $\mathbf{H}_{P}=\left\{\boldsymbol{0}_{1},...\boldsymbol{0}_{m+1},\boldsymbol{h}^{p}_{1},...,\boldsymbol{h}^{p}_{n},\boldsymbol{h}_{<Sp>}\right\}$ are used for the subsequent operations in the MRC module. Here $\boldsymbol{0}_{i}$ is a padded zero vector whose items are all zeros. Obviously, after the padding operations, $\mathbf{H}_{P}$ and $\mathbf{H}_{Q}$ have the same vector dimension.

Note the notations in Fig. 2(b) (like segment encoding, positional encoding, etc) have completely the same meanings as those defined in the original paper of BERT (Devlin et al., 2018) or RoBERTa (Liu et al., 2019), so one can read these original papers for detailed information.

3.2. MRC Module

The structure of our MRC module is shown in Fig. 2(c). It consists of $N$ identical computation blocks followed by a multi layer perception (MLP) based output layer. Each computation block has three components: (i) a multi-head self-attention component that is used to compute a kind of self-aware representations for the input question and passage, with which the model could “see” the entire context information; (ii) a “memory-guided interaction mining” component that can mine richful interactions between question and passage; and (iii) a softmax based output component.

Our MRC module is based on the decoder component in RoBERTa, thus as shown in Fig. 2(c), it has a similar structure as that in Transformer’s decoder (Vaswani et al., 2017). The main difference between our model and the original decoder in Transformer lies in the second component.

Multi-head Self-Attention Component. Here we denote the outputs of this component at the l-th computation block as $\mathbf{R}_{Q}^{l}$ and ${\mathbf{R}}_{P}^{l}$ , which are the generated vector representations of the input question and passage respectively. And they are computed by the following Equation 2.

(2)

{\mathbf{R}_{Q}^{l}=MultiHeadSelfAttention(\mathbf{H}_{Q});\qquad\mathbf{R}_{P}^{l}=MultiHeadSelfAttention(\mathbf{H}_{P})}

In above equation, the computation method for the multi-head self-attention operation is the same as the one in Transformer’s decoder (Vaswani et al., 2017), thus readers can read the original paper of Transformer for more detailed information if necessary.

Memory-Guided Interaction Mining Component. Here we design a memory-guided multi-head attention method to mine the interactions between the input question and passage. Its structure is shown in the right part of Fig. 3 where “query” and “key” denote the representations of question and passage respectively. During the interaction mining process, both “query” and “key” are dynamically updated guided by “value”. “value” stores some “memory” information about the input question and passage, and would keep unchanged during the interaction mining process.

The basic component in this proposed interaction mining method is a “memory-guided attention unit” whose structure is shown in the left part of Fig. 3. We can see that this component first computes two kinds of attentions: one is a key-to-query attention (denoted as k2qAtt in Fig. 3), and the other is a query-to-key attention (denoted as q2kAtt in Fig. 3). Then these two kinds of attentions are integrated with the “value” part.

Formally, we use use $\mathbf{Q}$ , $\mathbf{K}$ , and $\mathbf{V}$ to denote the vector representations of “query”, “key”, and “value” respectively . Then the “memory-guided attention unit” computes the attention with Equation 3, where $d$ is the scaling factor as the one defined in Transformer.

(3)

\displaystyle{\rm MemAtt}\left(\mathbf{Q},\mathbf{K},\mathbf{V}\right)=[softmax\left(\frac{\mathbf{Q}\mathbf{K}^{\top}}{\sqrt{d}}\right)+softmax\left(\frac{\mathbf{K}\mathbf{Q}^{\top}}{\sqrt{d}}\right)]\mathbf{V}

We denote $\mathbf{H}=\mathbf{H_{P}}+\mathbf{H_{Q}}$ (here “ $+$ ” is an element-wise add operation, note as described in the section 3.1, $\mathbf{H_{P}}$ and $\mathbf{H_{Q}}$ have been padded into the same dimension). Based on $\mathbf{H}$ , we define “value” as $\mathbf{HW^{V}}$ . Accordingly, the “memory-guided multi-head attention” at the l-th computation block can be written with following Equation 4 and 5, where $h$ is the number of the heads (we set $h$ to 8 in experiments), $\mathbf{W}_{i}^{Q}$ , $\mathbf{W}_{i}^{K}$ , $\mathbf{W}_{i}^{V}$ , $\mathbf{W}^{O}$ are trainable parameter matrices for the $i$ -th head, and ReLU is the activation function that is widely used in diverse neural models.

(4)

\mathbf{O_{mrc}^{l}}=ReLU(Concat({\rm head_{1}^{l}},...,{\rm head_{h}^{l}})\mathbf{W}^{O})

(5)

{\rm head_{i}^{l}}={\rm MemAtt}\left(\mathbf{R}_{Q}^{l}\mathbf{W}_{i}^{Q},{\mathbf{R}}_{P}^{l}\mathbf{W}_{i}^{K},\mathbf{H}\mathbf{W}^{V}\right)

From above process we can see that during training, as the computation blocks iterated, both $\mathbf{Q}$ and $\mathbf{K}$ are repeatedly updated, while $\mathbf{V}$ keeps unchanged and will be reloaded at each iteration.

Compared with the multi-head attention method used in Transformer (Vaswani et al., 2017), there are two important improvements in our method. First, it computes a kind of cross attentions that mine interactions from both the directions of “query to key” and “key to query”, which can mine more richful and comprehensive interactions between questions and passages. This process is somewhat like a human’s reading pattern that understand the semantic meanings of questions by taking passages as context, and vice versa. So it would be much helpful for understanding the semantic meanings of questions and passages. Second, its whole process is guided by a memory cell $\mathbf{H}$ where the original information of the question and passage is stored. This is very necessary for an MRC task: without $\mathbf{H}$ , the mined interactions would “forget” more and more original input information as the computation blocks iterated. However, the original information is the foundation of interactions, so “forget” them will increase the risk that the mined interactions are actually irrelevant to what are needed. In fact, setting a memory cell to store original input information is in line with a human’s reading pattern that keeping the input text in mind when finding the answer. So our interaction mining method is superior to existing methods like BiDAF (Seo et al., 2016) or the original decoder in Transformer (Vaswani et al., 2017) by nature since lots of existing researches have proven that imitating human’s reading patterns in an MRC model always brings performance gains (see previous Related Work section for details).

Output Component. As shown in Fig. 2 (c), the output of the last computation block in the memory guided interactions mining component will be fed into a feed forward network in each computation block. We denote this output as $\mathbf{O_{mrc}^{N}}$ . Then we perform two softmax based operations to predict the probabilities of each token in the passage being the start and end positions of an answer with following Equation 6, where $\mathbf{W}_{s},\mathbf{W}_{e}$ , $b_{s}$ , and $b_{e}$ are learnable parameters.

(6)			$\displaystyle P_{s}=softmax(\mathbf{W_{s}O_{mrc}^{N}}+b_{s});$
(6)			$\displaystyle P_{e}=softmax(\mathbf{W_{e}O_{mrc}^{N}}+b_{e})$

Finally, following loss function is defined to train the MRC module, where $a^{s}_{i}$ and $a^{e}_{i}$ denote the start and end positions for the answer of the $i$ -th sample.

(7)

\mathcal{L}_{MRC}=-\frac{1}{2M}\sum_{i}^{M}{[log(P_{s}^{a_{i}^{s}})+log(P_{e}^{a_{i}^{e}})]}

3.3. NLI Module

As analyzed above, we convert the traditional NLI task into the task of judge whether two question have the same semantic meaning. Accordingly, it is a natural way to design a classification based NLI module here. Specifically, the training set for the converted NLI task can be denoted as $\mathcal{D}^{N}=\{(s_{1}^{i},s_{2}^{i},s^{i})\}_{i=1}^{N}$ where $N$ is the number of samples. In this set, each sample consists of following three items: the first sentence $s_{1}$ , the second sentence $s_{2}$ , and the answer $s$ ( $s\in\{0,1\}$ ) whose value indicates whether these two sentences have the same semantics meaning or not.

In BERT (or other pre-trained language models like RoBERTa) based models, the embedding representation of the padding token CLS is believed to contain the global information of the whole input text. Thus, here $\boldsymbol{h}_{<CLS>}$ (the context-aware vector representation of ¡CLS¿, see the descriptions in the 3.1 section) is used as a contextualized sentence-level representations of the input two questions. Taking $\boldsymbol{h}_{<CLS>}$ as input, the NLI module uses following affine function to score the probability of the two input sentences having the same semantic meaning.

(8)

\hat{y}=softmax\left(\mathbf{w}\boldsymbol{h}_{<CLS>}+b\right)

where $\mathbf{w}$ and $b$ are learnable parameter.

Finally, following loss function is defined to train the NLI module, where $y_{i}\in\{0,1\}$ denotes the true label for the $i$ -th sample.

(9)

\displaystyle\mathcal{L}_{NLI}=-\frac{1}{N}\sum_{i=1}^{N}[y_{i}\log\hat{y}_{i}+(1-y_{i})\log(1-\hat{y}_{i})]

3.4. Multi-Task and Multi-Language Learning

We use a multi-task learning framework to train the modules of MRC and NLI simultaneously. The whole loss function is defined with Equation 10.

(10)

\mathcal{L}=\alpha\mathcal{L}_{NLI}({\theta}_{N},{\theta}_{S})+\beta\mathcal{L}_{MRC}({\theta}_{M},{\theta}_{S})

where $\alpha$ and $\beta$ are two hyperparameters which are set to 0.5 and 1 respectively in our experiments, $\theta_{S}$ represents the task-independent parameter set shared by the modules of MRC and NLI, $\theta_{M}$ and $\theta_{N}$ are the task-dependent parameter sets for the MRC module and the NLI module respectively.

Algorithm 1 Algorithm of the Multi-Task and Multi-Language Learning.

1:the NLI dataset

\mathcal{D}^{N}

and the mixed MRC dataset

\mathcal{D}^{M}

\mathcal{D}^{M}_{{lg}_{1}}

\mathcal{D}^{M}_{{lg}_{2}}

, …,

\mathcal{D}^{M}_{{lg}_{k}}

} where each sub-dataset corresponds to an MRC dataset of a specific language.

3:Preprocessing: compute a Bernoulli distribution based sampling probability for each sub-dataset in the mixed MRC dataset according to the ratio of the number of samples in this sub-dataset to the total number of samples in the mixed MRC dataset.

4:Sampling: randomly select which task to be trained.

5: If NLI is selected, randomly select a sample from

\mathcal{D}^{N}

6: Else, select a sample from an MRC sub-dataset according to its sampling probability.

7:Generate a batch: repeat above sampling step until selecting a predefined number of samples to build up a batch

B

8:Training: joint train the MRC and NLI modules on

B

with the multi-task learning based method.

9:Iteration: repeat above steps of 2-4 until reaching the predefined training epoch.

Besides, as analyzed above that an MRC model tends to perform well on in-domain test sets but perform poorly on out-of-domain test sets, which is mainly caused by the similar distributions between training sets and test sets. To overcome this issue, we generate a mixed MRC training set which consists of several MRC training sets of different languages. Then the MRC module is trained based on the samples in this mixed training set. For example, one can form a mixed training set by combining an English MRC training set with a Chinese MRC training set, but still testing the MRC model on an English test set (or a Chinese test set). By this way, the distributions of the training set and the test set will be completely different, which will push the MRC model learn more generalized knowledge during training because none of a specific language’s training set could dominant the process of model training. Accordingly, the generalization issue will be alleviated greatly. We call this new training method as multi-language learning, it is then combined with the multi-task learning method to form a new multi-task and multi-language learning mechanism.

Specifically, the detailed process of this new learning mechanism is shown in Algorithm 1. The input of this algorithm includes an NLI dataset and a mixed MRC dataset. In the multi-task learning mechanism, the samples of different tasks are randomly selected. However, for the mixed MRC dataset, if its sub-datasets of different languages have the same probability of being used as a data source to select training samples, the samples in the smaller sub-datasets would be over-trained while the samples in the larger sub-datasets would be under-trained. To overcome this problem, we assign different selected probabilities for the sub-datasets of different languages. Our basic idea is that the larger the number of a sub-dataset, the more possible its samples should be selected. In our algorithm, we design a Bernoulli distribution based method to compute a sampling probability for each sub-dataset in the mixed MRC dataset, as shown in the Preprocessing step in our algorithm. Based on these probabilities, the samples in different sub-datasets in the mixed MRC dataset are selected and batched with the samples in the NLI dataset, then the batched samples are jointly trained with the multi-task learning mechanism. To more clearly demonstrate our Algorithm 1, we further use Fig. 4 to illustrate its whole learning process with some concrete samples.

It should be noted that our multi-language learning does not focus on the effort of making a model “see” some samples that are expected to be present during testing. Thus it is much different from existing data augmentation based methods. Beside, our multi-language learning is also much different from some existing models like the one that uses multi-lingual pre-trained language models (Hsu et al., 2019) or the one that uses cross-lingual pre-trained language models (Nuo Chen, 2022). Although all of these existing models are trained on datasets of a source language but tested on datasets of a target language, all of them heavily depend on the pre-trained multi-lingual or cross-lingual language models. In contrast, our method does not use any of these kinds of language models. During training, the tokens of different languages learn their own embedding representations, while the model parameters are shared among samples of different languages. If a token could not be found in the vocabulary of a pre-trained language model, it would be regarded as an out-of-vocabulary token and its embedding would be randomly initialized and updated during model training. As the study of (Hsu et al., 2019) shows that tokens from different languages might be embedded into the same space with close spatial distribution. Their study further shows that even though during the fine-tuning only data of a specific language is used, the embedding of tokens in another language changed accordingly. These results show that the multi-language learning does have the capability of pushing the model learn more generalized knowledge during training, which is much helpful for alleviating the mentioned generalization issue.

4. Experiments

4.1. Experimental Settings

MRC Datasets In this study, we evaluate the performance of our model on the following benchmark MRC datasets, all of which are designed to measure the robustness of an MRC model, including the abilities of addressing the issues of over sensitivity, over stability and generalization.

(i) DuReader (robust). DuReader (robust) (Tang et al., 2021) is a large benchmark Chinese MRC dataset released in the MRC Competition of “2020 Language and Intelligence Challenge¹¹1https://aistudio.baidu.com/aistudio/competition/detail/28” (LIC-2020). It is a variant of DuReader (He et al., 2018) and is designed to measure an MRC model’s ability for addressing the issues of over sensitivity, over stability and generalization. Its training and development sets consist of 15K and 1.4k samples respectively. In these two sets, the questions are all real questions issued by users in Baidu search engine. The passages are extracted from the search results of Baidu search engine and Baidu Zhidao (a question answering community).

The test set of DuReader (robust) includes four subsets. (i) In-domain subset: the construction method and source of this subset is the same as the training set and the development set. (ii) Over sensitivity subset: the samples in this subset are randomly sampled from the in-domain subset. The new samples are constructed by paraphrasing the questions. But the questions generated here are all real questions asked by users in Baidu search engine. (iii) Over stability subset: the samples in this subset are sampled from the in-domain subset by rules, but annotated by human experts. The passages in this subset are all real passages. (vi) Generalization subset: the samples in this subset have a different distribution from that of the training set. The samples include real questions and passages extracted from educational and financial passages. There are 1.3K, 1.3K, 0.8K, and 1.6K samples in these four subsets respectively.

DuReader (robust) divides its test set into two versions. The first one consists of most of samples in the in-domain subset and a small portion of samples in the other three robustness subsets. And the second one consists of all the four subsets. For convenience, we denote these two versions’ test sets as Test1 and Test2 respectively. It should be noted that DuReader (robust) DOES NOT release its answer labeled test set to avoid models achieve overestimated performance by a test set guided training. Instead, researchers have to submit their results to the competition organizers so that the performance of their models can be evaluated.

(ii) SQuAD-related Datasets. To explore the robustness of MRC models, (Gan and Ng, 2019) create two TEST sets consisting of paraphrased questions that are generated by taking some questions in the development set of SQuAD 1.0 (Rajpurkar et al., 2018) as source questions. The first test set is a non-adversarial paraphrased set that is generated with a neural paraphrasing model trained on a dataset where each sample has a tuple form of (source question, multiple paraphrase suggestions). The paraphrased results in this test set are subsequently verified by human annotators. The second one is an adversarial paraphrased test set that is generated manually by going through question and context pairs from the SQuAD development set and re-writing the question using context words near a confusing answer candidate if such a candidate exists and there are suitable nearby context words for use in paraphrasing. We denote these two test sets as SQuAD (Non-Adv-Paraphrased) and SQuAD (Adv-Paraphrased) respectively. The first test set consists of 1,062 questions and the second test set consists of 56 questions.

NLI Dataset Here LCQMC²²2http://icrc.hitsz.edu.cn/info/1037/1146.htm (Liu et al., 2018a) is used as the training set for the NLI module. LCQMC is a dataset that focuses on the intent matching of two sentences. Thus it is suitable to be used to train our NLI module. Totally, this dataset contains 260,068 question pairs with manual annotations. And it is divided into three parts: a training set that contains 238,766 question pairs, a development set that contains 8,802 question pairs, and a test set that contains 12,500 question pairs.

Multi-language Learning Setting In the multi-language learning, we use SQuAD 1.0 as an auxiliary MRC dataset. SQuAD 1.0 is a widely used large scale English MRC benchmark dataset. Here we select it mainly due to the following two reasons. First, both SQuAD 1.0 and DuReader (robust) belong to the single-passage MRC dataset³³3There is a kind of multi-passage (also called multi-document) MRC datasets that provides multiple passages (or documents) for each question, like DuReader, MS MARCO, TriviaQA (web) (Joshi et al., 2017), etc.. Second, the average lengths of passages in both datasets are close. These two characteristic of SQuAD 1.0 allow us to concentrate on the robustness issues other than some data preprocessing work. Besides, there are more than 100,000 questions in SQuAD 1.0, which will make the distributions of the training set and the test set different greatly. Thus it helps to provide an ideal platform to evaluate the generalization of an MRC model.

Other Settings A Chinese RoBERTa (Cui et al., 2019) is used as the required shared encoder module. In subsequent sections, for all the mentioned language models, we use their large versions. The ensemble model is obtained by averaging 4 single models’ prediction probabilities. Extract match (EM) and F1 are used as evaluation metrics. During training, AdamW (Kingma and Ba, 2015) is used to train our model and word embeddings are not updated. Based on the results on development sets, the learning rate, batch size, and training epoch are set to 0.001, 16, and 3 respectively.

Table 1. Main Results on DuReader (robust).

Test1
Model	F1	EM
BERT (zh) (Devlin et al., 2018)	70.74	54.45
XLNet (zh) (Yang et al., 2019)	68.79	53.70
RoBERTa (zh) (Cui et al., 2019)	73.76	56.75
Our Model (Single)	80.39	66.55
Our Model (Ensemble)	82.70	68.50
Test2
Our Model (Ensemble)	79.45	64.76

Table 2. Ablation Experiments on Test1 of DuReader (robust). “- Memory-Guided Multi-head Attention” means replacing it with a common multi-head attention method as used in Transformer’s decoder. “zh” denotes the corresponding model is a Chinese version.

Model	F1	EM
Our Model (Single)	80.39	66.55
- NLI module	76.26	60.6
- Multi-language learning	77.47	62.8
- Memory-Guided Multi-head Attention	78.65	64.5

4.2. Experiments on DuReader (robust)

Main Results The main results on DuReader (robust) are shown in Table 2. On Test1, we can see that our model achieves much better results than all the compared strong baselines. Especially, our model achieves much better results than RoBERTa which has achieved much better results than many existing Chinese pre-trained models (like BERT and ERNIE (Sun et al., 2019a)) on various natural language processing tasks including several Chinese MRC tasks (Cui et al., 2019). These results show that our model is very effective and it can better address the robustness issues.

In fact, we participated in the MRC Competition of LIC-2020. Finally, our model ranked NO.2 on the test set leader board of Test1. On Test2, our model achieved very competitive results again: it ranked No.3 on this full DuReader (robust) test set leader board⁴⁴4Here we could not compare our model with the top 2 models, because we could not find any papers about their model details either in conferences, journals, or on arXiv..

Note that there are no detailed comparison results on Test2. This is because the final competition results of LIC-2020 were decided by the results on Test2. And LIC-2020 took a closed way to evaluate an MRC model’s performance: researchers must submit their results within a specified deadline, and there was a submission limit per system per day. Thus, researchers usually tried to find the best model based on the results on Test1, and then tried to obtain the best results on Test2 by tuning the selected model. In other words, there was almost no chance for researchers to compare the performance of different models on Test2.

Now, the competition of LIC-2020 was closed, thus we could not make more detailed comparisons with the latest state-of-the-art MRC models. And we leave such comparisons in subsequent sections.

Ablation Results To demonstrate the contributions of different modules in our model, we conduct ablation experiments on Test1 and the results are shown in Table 2.

From these results we can draw following conclusions. First, each component of our model is helpful for improving the performance. Second, NLI plays more roles than the other two modules, which indicates that the key of developing a robust MRC model is to precisely understand the semantic meanings of input questions. Third, introducing a completely different language’s MRC training set is helpful for improving the robustness of an MRC model. Fourth, the proposed memory-guided multi-head attention is effective for addressing the robustness issues and it performs better than the traditional multi-head attention method.

Case Study Fig. 5 illustrates a case study of our model. In this example, the semantic meanings of two questions (Q1 is the source question and Q2 is the paraphrased question) are the same. We can see that if we do not use the NLI module, the model (“Single Task Model” in Fig. 5) outputs a wrong answer for the paraphrased question (Q2). In contrast, the full model precisely distinguishes the semantic meanings of these two questions and outputs correct answers for both of them. These results further confirm the effectiveness of the proposed model for addressing the robustness issues.

4.3. Experiments on SQuAD-related Datasets

On SQuAD (Non-Adv-Paraphrased) and SQuAD (Adv-Paraphrased), we evaluate the robustness of our model based on following two kinds of extreme evaluations: (i) testing our model that is designed for DuReader (robust) on these two English test sets directly; and (ii) comparing an English version of our model with the models that are retrained by a data augmentation based method.

Obviously, both kinds of evaluations are significant unfair to our model. For the first kind of evaluation, although there is a multi-language learning mechanism, most of the modules in our model, including the shared encoder module and the NLI module, are trained on Chinese datasets. For the second kind of evaluation, our model would not use any additional training data while some compared baselines would be retrained by the data augmentation based method. However, we think both kinds of evaluations are very meaningful: the first kind of evaluation provides an ideal scenario to evaluate the generality ability of an MRC model because the distributions of training sets and test sets are completely different; and the second kind of evaluation provides an ideal scenario to evaluate the true potentiality of an MRC model for addressing the robustness issues because the data augmentation based methods are not always available, especially in some cases where constructing the augmented data is time-consuming and high-cost.

In subsequent experiments, we use the single version of our model for evaluations, and all the models marked by ^† denote their corresponding results are directly copied from (Gan and Ng, 2019).

Extreme Evaluation 1. The results of the first kind of extreme evaluation are shown in Table 4 and 4. We can see that on both test sets, our model achieves very competitive results. First, when compared with the models that are fully trained under the common settings, our model achieves significant better results than DrQA and BiDAF. Our model also achieves close results to the well trained BERT on the robust parts (denoted as P-Questions and A-Questions in Table 4 and 4) of the two datasets. Second, the source questions (denoted as S-Questions in Table 4 and 4) in SQuAD (Non-Adv-Paraphrased) and SQuAD (Adv-Paraphrased) form two common MRC datasets, and our model performs much well on them. These results indicate that our model is effective on both the robustness datasets and the common datasets even under an extreme and unfair evaluation.

It should be noted that (Hsu et al., 2019) leverage the pre-trained multilingual BERT (multi-BERT) in cross-lingual zero-shot reading comprehension. That is, multi-BERT is fine-tuned on data of a language, but is tested on data of another language. Obviously, this zero-shot setting is similar to the setting of our first kind of extreme evaluation. But here we do not take multi-BERT as a baseline. This is mainly because we could not provide a fair platform for comparisons: if we fine-tune multi-BERT on DuReader (robust) but test it on the mentioned SQuAD-related datasets, it would be unfair for multi-BERT because our model uses the training set of SQuAD in the multi-language learning while multi-BERT not. This is also the reason why we do not compare our model with other state-of-the-art MRC models under this kind of extreme evaluation.

Table 3. Extreme Evaluations 1 (a): on SQuAD (Non-Adv-Paraphrased). S-Questions and P-Questions refer to the source questions from SQuAD’s development set and their corresponding paraphrased questions respectively.

Model	EM		F1
Model	S-Questions	P-Questions	S-Questions	P-Questions
BiDAF^† (Seo et al., 2016)	67.8	63.84	76.85	73.51
DrQA^† (Chen et al., 2017)	67.33	65.25	76.25	74.25
BERT^† (Devlin et al., 2018)	83.62	79.85	90.78	87.63
Our Model	77.91	76.90	87.38	85.87

Table 4. Extreme Evaluations 1 (b): on SQuAD (Adv-Paraphrased). S-Questions and A-Questions refer to the source questions from SQuAD’s development set and their corresponding adversarial paraphrased questions respectively.

Model	EM		F1
Model	S-Questions	A-Questions	S-Questions	A-Questions
BiDAF^† (Seo et al., 2016)	75	30.36	81.55	38.3
DrQA^† (Chen et al., 2017)	71.43	39.29	81.02	48.94
BERT^† (Devlin et al., 2018)	82.14	57.14	89.31	63.18
Our Model	77.36	56.79	85.74	62.22

Extreme Evaluation 2. Based on the evaluation results on SQuAD (Non-Adv-Paraphrased) and SQuAD (Adv-Paraphrased), (Gan and Ng, 2019) argue that the original training dataset does not contain sufficiently diverse phrased questions, which leads to the models not learning to respond correctly to various ways of asking the same question. They further argue that the capabilities of models for addressing the robustness issues can be improved by a data augmentation based method. So they use the similar methods as used when creating the mentioned two test sets to generate two additional training sets: the first one contains 25,000 non-adversarial paraphrased questions, and the second one contains 25,000 adversarial paraphrased questions. For simplicity, we denote these two additional training sets as Non-Adv-Additional Data and Adv-Additional Data, and denote the original training set and development set of SQuAD as SQuAD (training) and SQuAD (dev). With these additional data, (Gan and Ng, 2019) report following four kinds of comparison results to demonstrate the effectiveness of the data augmentation method. (i) On SQuAD (Non-Adv-Paraphrased), the results of models between trained with SQuAD (training) and “SQuAD (training) + Non-Adv-Additional Data”; (ii) On SQuAD (Adv-Paraphrased), the results of models between trained with SQuAD (training) and “SQuAD (training) + Adv-Additional Data”; (iii) On SQuAD (dev), the results of models between trained with SQuAD (training) and “SQuAD (training) + Non-Adv-Additional Data”; (iv) On SQuAD (dev), the results of models between trained with SQuAD (training) and “SQuAD (training) + Adv-Additional Data”. However, (Gan and Ng, 2019) do not release these two additional training sets. So we could not compare the performance of our model trained with these additional training sets with their reported results directly.

Table 5. Extreme Evaluation 2 (a): performance of different models before and after re-training on SQuAD (Non-Adv-Paraphrased) (the left part) and on SQuAD (Adv-Paraphrased) (the right part).

Model	EM		F1		EM		F1
Model	Before	After	Before	After	Before	After	Before	After
BERT^† (Devlin et al., 2018)	79.85	80.89	87.63	88.62	57.14	69.64	63.18	73.85
DrQA^† (Chen et al., 2017)	65.25	67.33	74.25	75	39.29	41.07	48.94	49.86
BiDAF^† (Seo et al., 2016)	63.84	66.2	73.51	75.94	30.36	39.24	38.3	47.49
ALBERT-large (Lan et al., 2020)	-	83.24	-	89.85	-	69.64	-	74.68
Retro-Reader (Zhang et al., 2021)	-	82.20	-	89.40	-	69.64	-	75.04
Our Model	76.90	82.96	85.87	89.78	56.79	76.79	62.22	80.92

Table 6. Extreme Evaluation 2 (b): performance of different models on SQuAD (dev) before and after re-training. For the baselines marked by ^†, After₁ and After₂ denote the results re-trained with additional Non-Adv-Additional Data and Adv-Additional Data respectively. For other baselines and our model, we use After₁ to denote their English versions.

Model	EM			F1
Model	Before	After₁	After₂	Before	After₁	After₂
BERT^† (Devlin et al., 2018)	84.02	83.76	83.33	91	90.88	90.49
DrQA^† (Chen et al., 2017)	69.04	68.74	67.93	78.38	77.86	77.45
BiDAF^† (Seo et al., 2016)	67.67	67.49	66.23	77.46	77.1	76.19
ALBERT-large (Lan et al., 2020)	-	84.56	-	-	91.63	-
Retro-Reader (Zhang et al., 2021)	-	84.09	-	-	91.09	-
Our Model	78.42	85.50	-	87.35	92.46	-

Instead, we make another kind of extreme evaluation by comparing the English version of our model with the models retrained by using the mentioned two additional training sets. To this end, we make two modifications on our model. First, we use the Quora Question-pair dataset⁵⁵5https://www.kaggle.com/c/quora-question-pairs/data?select=train.csv.zip as the trainign set for the NLI module. Second, we change RoBERTa to its English version. Besides, we also take several state-of-the-art models as baselines. In experiments, to make a fair comparison, we replace the version of AlBERT used in Retro-Reader (Zhang et al., 2021) from xxlarge to large since as mentioned previously that for all the language models, we use their large versions.

Finally, the comparison results are shown in Table 5 and 6. We can see that the English version of our model achieves very competitive results again on all these three test sets. First, when compared with the models that are retrained by the data augmentation based method, we can see that on both SQuAD (Non-Adv-Paraphrased) and SQuAD (Adv-Paraphrased), the performance of our model is far better than the models like DrQA or BiDAF. And the results of our model are also better than the data augmentation retrained BERT. Second, when compared with the models that could not use the additional training data either, we can see that on both SQuAD (Non-Adv-Paraphrased) and SQuAD (Adv-Paraphrased), our model achieves much competitive results: its results are the best or very close to the best. Third, (Gan and Ng, 2019) have pointed out that although the data augmentation based method is usually helpful for models achieve better results on the paraphrased test sets, it always causes a negligible drop to the performance of models on the original development set. However, from the results in Table 6 we can see that on SQuAD (dev), a common MRC dataset, our model achieves better results than all the compared baselines. These results demonstrate an important merit of our model that although it is designed for addressing the robustness issues, there is almost no any negative affects to the performance when it is used to handle common MRC datasets. In a word, our model is a very strong and is competent for diverse application scenarios.

Furthermore, as mentioned above that (Hsu et al., 2019) leverage the pre-trained multi-BERT in cross-lingual zero-shot reading comprehension, which has a similar scenario settings to the robustness issues here because both of them aim to evaluate the performance of models when the distributions of training sets and testing sets are different. Thus there would be a concern that whether multi-BERT would also perform well on addressing the robustness issues in MRC. To answer this concern, we further conduct experiments to demonstrate the performance of multi-BERT⁶⁶6https://github.com/google-research/bert on these SQuAD-related datasets. Specifically, we fine-tune multi-BERT on SQuAD (training), and test it on the mentioned test sets. Here we think it is fair to compare multi-BERT with our model because both models use SQuAD (training) and some multi-lingual resources. Finally, the results are shown in Table 7. We can see that our model achieves far better results than multi-BERT on all datasets. These results indicate that a straightforward multilingual language model based method could not address the robustness issues in MRC well.

Table 7. Results of multi-BERT. P-Questions and A-Questions refer to the paraphrased questions in SQuAD (Non-Adv-Paraphrased) and adversarial questions in SQuAD (Adv-Paraphrased) respectively.

Model	P-Questions		A-Questions		SQuAD (dev)
Model	EM	F1	EM	F1	EM	F1
multi-BERT	76.65	83.94	58.93	66.26	81.39	88.57
Our Model	82.96	89.78	76.79	80.92	85.50	92.46

5. Conclusions

In this study, we propose an understanding-oriented MRC model that can well address the issues of over sensitivity, over stability, and generalization. We conduct extensive experiments to evaluate it on three benchmark MRC robustness datasets. Experimental results show that it achieves consistent better results not only on all of these robustness MRC datasets, but also on some common MRC datasets. Even on some extreme and unfair evaluations, it still achieves much better results.

The main novelties of our model is summarized as follows. First, to the best of our knowledge, this is the first work that systematically addresses all three kinds of robustness issues simultaneously from the model level. Second, we propose a memory-guided multi-head attention method that can mine better interactions between questions and passages. Third, we propose a multi-task and multi-language learning based method to integrate the NLI task and the multi-language MRC task together, which is proven to be much effective for addressing the robustness issues in MRC.

Acknowledgements.

This work is supported by the National Science and Technology Major Project (J2019-IV-0002-0069), the National Natural Science Foundation of China (No.61572120), and the Fundamental Research Funds for the Central Universities (No.N181602013).

References

(1)
Banerjee et al. (2021) Pratyay Banerjee, Tejas Gokhale, and Chitta Baral. 2021. Self-Supervised Test-Time Learning for Reading Comprehension. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 1200–1211. https://doi.org/10.18653/v1/2021.naacl-main.95
Baradaran and Amirkhani (2021) Razieh Baradaran and Hossein Amirkhani. 2021. Ensemble Learning-Based Approach for Improving Generalization Capability of Machine Reading Comprehension Systems. (2021).
Bartolo et al. (2021) Max Bartolo, Tristan Thrush, Robin Jia, Sebastian Riedel, Pontus Stenetorp, and Douwe Kiela. 2021. Improving Question Answering Model Robustness with Synthetic Adversarial Data Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 8830–8848. https://doi.org/10.18653/v1/2021.emnlp-main.696
Belinkov et al. (2019) Yonatan Belinkov, Adam Poliak, Stuart Shieber, Benjamin Van Durme, and Alexander Rush. 2019. Don’t Take the Premise for Granted: Mitigating Artifacts in Natural Language Inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 877–891. https://doi.org/10.18653/v1/P19-1084
Bowman et al. (2015) Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, 632–642. https://doi.org/10.18653/v1/D15-1075
Chen et al. (2021a) Jifan Chen, Eunsol Choi, and Greg Durrett. 2021a. Can NLI Models Verify QA Systems’ Predictions?. In Findings of the Association for Computational Linguistics: EMNLP 2021. Association for Computational Linguistics, Punta Cana, Dominican Republic, 3841–3854. https://doi.org/10.18653/v1/2021.findings-emnlp.324
Chen and Durrett (2021) Jifan Chen and Greg Durrett. 2021. Robust Question Answering Through Sub-part Alignment. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 1251–1263. https://doi.org/10.18653/v1/2021.naacl-main.98
Chen et al. (2020) Wuya Chen, Xiaojun Quan, Chunyu Kit, Zhengcheng Min, and Jiahai Wang. 2020. Multi-choice Relational Reasoning for Machine Reading Comprehension. In Proceedings of the 28th International Conference on Computational Linguistics. 6448–6458.
Chen et al. (2021b) Zeming Chen, Qiyue Gao, and Lawrence S. Moss. 2021b. NeuralLog: Natural Language Inference with Joint Neural and Logical Reasoning. In Proceedings of *SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics. Association for Computational Linguistics, Online, 78–88. https://doi.org/10.18653/v1/2021.starsem-1.7
Chen and Wu (2020) Zheng Chen and Kangjian Wu. 2020. ForceReader: a BERT-based Interactive Machine Reading Comprehension Model with Attention Separation. In Proceedings of the 28th International Conference on Computational Linguistics. 2676–2686.
Chen et al. (2017) Zheqian Chen, Rongqin Yang, Bin Cao, Zhou Zhao, Deng Cai, and Xiaofei He. 2017. Smarnet: Teaching Machines to Read and Comprehend Like Human. arXiv preprint arXiv:1710.02772 (2017).
Clark and Gardner (2018) Christopher Clark and Matt Gardner. 2018. Simple and Effective Multi-Paragraph Reading Comprehension. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 845–855.
Cui et al. (2019) Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Ziqing Yang, Shijin Wang, and Guoping Hu. 2019. Pre-Training with Whole Word Masking for Chinese BERT. arXiv preprint arXiv:1906.08101 (2019).
Cui et al. (2017) Yiming Cui, Zhipeng Chen, Si Wei, Shijin Wang, Ting Liu, and Guoping Hu. 2017. Attention-over-Attention Neural Networks for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 593–602.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina N. Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186.
Falke et al. (2019) Tobias Falke, Leonardo F. R. Ribeiro, Prasetya Ajie Utama, Ido Dagan, and Iryna Gurevych. 2019. Ranking Generated Summaries by Correctness: An Interesting but Challenging Application for Natural Language Inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 2214–2220. https://doi.org/10.18653/v1/P19-1213
Gan and Ng (2019) Wee Chung Gan and Hwee Tou Ng. 2019. Improving the Robustness of Question Answering Systems to Question Paraphrasing. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 6065–6075.
Gong et al. (2020) Hongyu Gong, Yelong Shen, Dian Yu, Jianshu Chen, and Dong Yu. 2020. Recurrent Chunking Mechanisms for Long-Text Machine Reading Comprehension.. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 6751–6761.
Guo et al. (2020a) Shaoru Guo, Yong Guan, Ru Li, Xiaoli Li, and Hongye Tan. 2020a. Incorporating Syntax and Frame Semantics in Neural Network for Machine Reading Comprehension. In Proceedings of the 28th International Conference on Computational Linguistics. 2635–2641.
Guo et al. (2020b) Shaoru Guo, Ru Li, Hongye Tan, Xiaoli Li, Yong Guan, Hongyan Zhao, and Yueping Zhang. 2020b. A Frame-based Sentence Representation for Machine Reading Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 891–896.
He et al. (2018) Wei He, Kai Liu, Jing Liu, Yajuan Lyu, Shiqi Zhao, Xinyan Xiao, Yuan Liu, Yizhong Wang, Hua Wu, Qiaoqiao She, Xuan Liu, Tian Wu, and Haifeng Wang. 2018. DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications. In Proceedings of the Workshop on Machine Reading for Question Answering. 37–46.
Hsu et al. (2019) Tsung-Yuan Hsu, Chi-Liang Liu, and Hung-yi Lee. 2019. Zero-shot Reading Comprehension by Cross-lingual Transfer Learning with Multi-lingual Language Representation Model. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 5933–5940. https://doi.org/10.18653/v1/D19-1607
Hu et al. (2019a) Minghao Hu, Yuxing Peng, Zhen Huang, and Dongsheng Li. 2019a. Retrieve, Read, Rerank: Towards End-to-End Multi-Document Reading Comprehension. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2285–2295.
Hu et al. (2018) Minghao Hu, Yuxing Peng, Furu Wei, Zhen Huang, Dongsheng Li, Nan Yang, and Ming Zhou. 2018. Attention-Guided Answer Distillation for Machine Reading Comprehension. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2077–2086.
Hu et al. (2019b) Minghao Hu, Furu Wei, Yuxing Peng, Zhen Huang, Nan Yang, and Dongsheng Li. 2019b. Read + Verify: Machine Reading Comprehension with Unanswerable Questions. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 6529–6537.
Huang et al. (2021) Canming Huang, Weinan He, and Yongmei Liu. 2021. Improving Unsupervised Commonsense Reasoning Using Knowledge-Enabled Natural Language Inference. In Findings of the Association for Computational Linguistics: EMNLP 2021. Association for Computational Linguistics, Punta Cana, Dominican Republic, 4875–4885. https://doi.org/10.18653/v1/2021.findings-emnlp.420
Huang et al. (2020) Rongtao Huang, Bowei Zou, Yu Hong, Wei Zhang, Ai Ti Aw, and Guodong Zhou. 2020. NUT-RC: Noisy User-generated Text-oriented Reading Comprehension. In Proceedings of the 28th International Conference on Computational Linguistics. 2687–2698.
Ido Dagan (2006) Bernardo Magnini Ido Dagan, Oren Glickman. 2006. The PASCAL Recognising Textual Entailment Challenge. In the First International Conference on Machine Learning Challenges: Evaluating Predictive Uncertainty Visual Object Classification, and Recognizing Textual Entailment, MLCW’05, Vol. 3944. 177–190.
Jia and Liang (2017) Robin Jia and Percy Liang. 2017. Adversarial Examples for Evaluating Reading Comprehension Systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2021–2031.
Jiang et al. (2021) Zhongtao Jiang, Yuanzhe Zhang, Zhao Yang, Jun Zhao, and Kang Liu. 2021. Alignment Rationale for Natural Language Inference. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 5372–5387. https://doi.org/10.18653/v1/2021.acl-long.417
Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. 2017. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 1601–1611.
Khot Tushar (2018) Clark Peter Khot Tushar, Sabharwal Ashish. 2018. SciTaiL: A Textual Entailment Dataset from Science Question Answering. In Proceedings of AAAI.
Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.).
Koreeda and Manning (2021) Yuta Koreeda and Christopher Manning. 2021. ContractNLI: A Dataset for Document-level Natural Language Inference for Contracts. In Findings of the Association for Computational Linguistics: EMNLP 2021. Association for Computational Linguistics, Punta Cana, Dominican Republic, 1907–1919. https://doi.org/10.18653/v1/2021.findings-emnlp.164
Lan et al. (2020) Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In ICLR 2020 : Eighth International Conference on Learning Representations.
Li et al. (2020b) Dongfang Li, Baotian Hu, Qingcai Chen, Weihua Peng, and Anqi Wang. 2020b. Towards Medical Machine Reading Comprehension with Structural Knowledge and Plain Text. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1427–1438.
Li et al. (2020a) Hongyu Li, Tengyang Chen, Shuting Bai, Takehito Utsuro, and Yasuhide Kawada. 2020a. MRC Examples Answerable by BERT without a Question Are Less Effective in MRC Model Training.. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: Student Research Workshop. 146–152.
Li et al. (2021) Tianda Li, Ahmad Rashid, Aref Jafari, Pranav Sharma, Ali Ghodsi, and Mehdi Rezagholizadeh. 2021. How to Select One Among All ? An Empirical Study Towards the Robustness of Knowledge Distillation in Natural Language Understanding. In Findings of the Association for Computational Linguistics: EMNLP 2021. Association for Computational Linguistics, Punta Cana, Dominican Republic, 750–762. https://doi.org/10.18653/v1/2021.findings-emnlp.65
Liu et al. (2020) Kai Liu, Xin Liu, An Yang, Jing Liu, Jinsong Su, Sujian Li, and Qiaoqiao She. 2020. A Robust Adversarial Training Approach to Machine Reading Comprehension. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 8392–8400.
Liu et al. (2018a) Xin Liu, Qingcai Chen, Chong Deng, Huajun Zeng, Jing Chen, Dongfang Li, and Buzhou Tang. 2018a. LCQMC:A Large-scale Chinese Question Matching Corpus. In Proceedings of the 27th International Conference on Computational Linguistics. 1952–1962.
Liu et al. (2018b) Xiaodong Liu, Yelong Shen, Kevin Duh, and Jianfeng Gao. 2018b. Stochastic Answer Networks for Machine Reading Comprehension. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 1694–1704.
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019).
Long et al. (2020) Siyu Long, Ran Wang, Kun Tao, Jiali Zeng, and Xinyu Dai. 2020. Synonym Knowledge Enhanced Reader for Chinese Idiom Reading Comprehension. In Proceedings of the 28th International Conference on Computational Linguistics. 3684–3695.
Luo et al. (2020) Huaishao Luo, Yu Shi, Ming Gong, Linjun Shou, and Tianrui Li. 2020. MaP: A Matrix-based Prediction Approach to Improve Span Extraction in Machine Reading Comprehension. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. 687–695.
Malmaud et al. (2020) Jonathan Malmaud, Roger Levy, and Yevgeni Berzak. 2020. Bridging Information-Seeking Human Gaze and Machine Reading Comprehension. In Proceedings of the 24th Conference on Computational Natural Language Learning. 142–152.
Meissner et al. (2021) Johannes Mario Meissner, Napat Thumwanit, Saku Sugawara, and Akiko Aizawa. 2021. Embracing Ambiguity: Shifting the Training Target of NLI Models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Association for Computational Linguistics, Online, 862–869. https://doi.org/10.18653/v1/2021.acl-short.109
Min et al. (2018) Sewon Min, Victor Zhong, Richard Socher, and Caiming Xiong. 2018. Efficient and Robust Question Answering from Minimal Context over Documents. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 1725–1735.
Náplava et al. (2021) Jakub Náplava, Martin Popel, Milan Straka, and Jana Straková. 2021. Understanding Model Robustness to User-generated Noisy Texts. In Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021). Association for Computational Linguistics, Online, 340–350. https://doi.org/10.18653/v1/2021.wnut-1.38
Nguyen et al. (2016) Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset.. In CoCo@NIPS.
Nuo Chen (2022) Min Gong Jian Pei Daxin Jiang Nuo Chen, Linjun Shou. 2022. From Good to Best: Two-Stage Training for Cross-lingual Machine Reading Comprehension. In Proceedings of AAAI.
Peng et al. (2020) Wei Peng, Yue Hu, Luxi Xing, Yuqiang Xie, Jing Yu, Yajing Sun, and Xiangpeng Wei. 2020. Bi-directional Cognitive Thinking Network for Machine Reading Comprehension. In Proceedings of the 28th International Conference on Computational Linguistics.
Rajpurkar et al. (2018) Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know What You Don’t Know: Unanswerable Questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vol. 2. 784–789.
Rosenberg et al. (2021) Daniel Rosenberg, Itai Gat, Amir Feder, and Roi Reichart. 2021. Are VQA Systems RAD? Measuring Robustness to Augmented Data with Focused Interventions. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Association for Computational Linguistics, Online, 61–70. https://doi.org/10.18653/v1/2021.acl-short.10
Seo et al. (2016) Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2016. Bidirectional Attention Flow for Machine Comprehension. In ICLR (Poster).
Shinoda et al. (2021) Kazutoshi Shinoda, Saku Sugawara, and Akiko Aizawa. 2021. Improving the Robustness of QA Models to Challenge Sets with Variational Question-Answer Pair Generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop. Association for Computational Linguistics, Online, 197–214. https://doi.org/10.18653/v1/2021.acl-srw.21
Si et al. (2020) Chenglei Si, Ziqing Yang, Yiming Cui, Wentao Ma, Ting Liu, and Shijin Wang. 2020. Benchmarking Robustness of Machine Reading Comprehension Models. arXiv preprint arXiv:2004.14004 (2020).
Sun et al. (2019b) Kai Sun, Dian Yu, Dong Yu, and Claire Cardie. 2019b. Improving Machine Reading Comprehension with General Reading Strategies. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2633–2643.
Sun et al. (2019a) Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019a. ERNIE: Enhanced Representation through Knowledge Integration. arXiv preprint arXiv:1904.09223 (2019).
Tang et al. (2021) Hongxuan Tang, Hongyu Li, Jing Liu, Yu Hong, Hua Wu, and Haifeng Wang. 2021. DuReader_robust: A Chinese Dataset Towards Evaluating Robustness and Generalization of Machine Reading Comprehension in Real-World Applications. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Association for Computational Linguistics, Online, 955–963. https://aclanthology.org/2021.acl-short.120
Tian et al. (2020) Zhixing Tian, Yuanzhe Zhang, Kang Liu, Jun Zhao, Yantao Jia, and Zhicheng Sheng. 2020. Scene Restoring for Narrative Machine Reading Comprehension. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 3063–3073.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Vol. 30. 5998–6008.
Wang and Jiang (2019) Chao Wang and Hui Jiang. 2019. Explicit Utilization of General Knowledge in Machine Reading Comprehension. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2263–2272.
Wang et al. (2018c) Wei Wang, Ming Yan, and Chen Wu. 2018c. Multi-Granularity Hierarchical Attention Fusion Networks for Reading Comprehension and Question Answering. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 1705–1714.
Wang and Bansal (2018) Yicheng Wang and Mohit Bansal. 2018. Robust Machine Comprehension Models via Adversarial Training. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), Vol. 2. 575–581.
Wang et al. (2018a) Yizhong Wang, Kai Liu, Jing Liu, Wei He, Yajuan Lyu, Hua Wu, Sujian Li, and Haifeng Wang. 2018a. Multi-Passage Machine Reading Comprehension with Cross-Passage Answer Verification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 1918–1927.
Wang et al. (2018b) Zhen Wang, Jiachen Liu, Xinyan Xiao, Yajuan Lyu, and Tian Wu. 2018b. Joint Training of Candidate Extraction and Answer Selection for Reading Comprehension. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 1715–1724.
Welbl et al. (2020) Johannes Welbl, Pasquale Minervini, Max Bartolo, Pontus Stenetorp, and Sebastian Riedel. 2020. Undersensitivity in Neural Reading Comprehension. In Findings of the Association for Computational Linguistics: EMNLP 2020. 1152–1165.
Welleck et al. (2019) Sean Welleck, Jason Weston, Arthur Szlam, and Kyunghyun Cho. 2019. Dialogue Natural Language Inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 3731–3741. https://doi.org/10.18653/v1/P19-1363
Williams et al. (2018) Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, 1112–1122. https://doi.org/10.18653/v1/N18-1101
Wu and Xu (2020) Zhijing Wu and Hua Xu. 2020. Improving the robustness of machine reading comprehension model with hierarchical knowledge and auxiliary unanswerability prediction. Knowledge Based Systems 203 (2020), 106075.
Yan et al. (2019) Ming Yan, Jiangnan Xia, Chen Wu, Bin Bi, Zhongzhou Zhao, Ji Zhang, Luo Si, Rui Wang, Wei Wang, and Haiqing Chen. 2019. A Deep Cascade Model for Multi-Document Reading Comprehension. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 7354–7361.
Yang et al. (2019) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Advances in Neural Information Processing Systems, Vol. 32. 5753–5763.
Yu et al. (2018) Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V. Le. 2018. QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. In International Conference on Learning Representations.
Zhang et al. (2020) Xuemiao Zhang, Kun Zhou, Sirui Wang, Fuzheng Zhang, Zhongyuan Wang, and Junfei Liu. 2020. Learn with Noisy Data via Unsupervised Loss Correction for Weakly Supervised Reading Comprehension. In Proceedings of the 28th International Conference on Computational Linguistics. 2624–2634.
Zhang et al. (2021) Zhuosheng Zhang, Junjie Yang, and Hai Zhao. 2021. Retrospective Reader for Machine Reading Comprehension.. In AAAI 2021.
Zheng et al. (2020) Bo Zheng, Haoyang Wen, Yaobo Liang, Nan Duan, Wanxiang Che, Daxin Jiang, Ming Zhou, and Ting Liu. 2020. Document Modeling with Graph Attention Networks for Multi-grained Machine Reading Comprehension.. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 6708–6718.
Zhou et al. (2020) Mantong Zhou, Minlie Huang, and Xiaoyan Zhu. 2020. Robust Reading Comprehension With Linguistic Constraints via Posterior Regularization. IEEE Transactions on Audio, Speech, and Language Processing 28 (2020), 2500–2510.
Zhou and Bansal (2020) Xiang Zhou and Mohit Bansal. 2020. Towards Robustifying NLI Models Against Lexical Dataset Biases. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 8759–8771. https://doi.org/10.18653/v1/2020.acl-main.773
Zylberajch et al. (2021) Hugo Zylberajch, Piyawat Lertvittayakumjorn, and Francesca Toni. 2021. HILDIF: Interactive Debugging of NLI Models Using Influence Functions. In Proceedings of the First Workshop on Interactive Learning for Natural Language Processing. Association for Computational Linguistics, Online, 1–6. https://doi.org/10.18653/v1/2021.internlp-1.1