PReGAN: Answer Oriented Passage Ranking with Weakly Supervised GAN

Pan Du [email protected] Thomson Reuters LabsTorontoCanada , Jian-Yun Nie University of MontrealMontrealCanada [email protected] , Yutao Zhu University of MontrealMontrealCanada [email protected] , Hao Jiang Huawei Poisson Lab.ShenzhenChina [email protected] , Lixin Zou Tshinghua UniveristyBeijingChina [email protected] and Xiaohui Yan Huawei Poisson Lab.ShenzhenChina [email protected]

(2018)

Abstract.

Beyond topical relevance, passage ranking for open-domain factoid question answering also requires a passage to contain an answer (answerability). While a few recent studies have incorporated some reading capability into a ranker to account for answerability, the ranker is still hindered by the noisy nature of the training data typically available in this area, which considers any passage containing an answer entity as a positive sample. However, the answer entity in a passage is not necessarily mentioned in relation with the given question. To address the problem, we propose an approach called PReGAN for Passage Reranking based on Generative Adversarial Neural networks, which incorporates a discriminator on answerability, in addition to a discriminator on topical relevance. The goal is to force the generator to rank higher a passage that is topically relevant and contains an answer. Experiments on five public datasets show that PReGAN can better rank appropriate passages, which in turn, boosts the effectiveness of QA systems, and outperforms the existing approaches without using external data.

Passage Ranking, Question Answering, Generative Adversarial Network

^†^†copyright: acmcopyright^†^†journalyear: 2018^†^†doi: XXXXXXX.XXXXXXX^†^†price: 15.00^†^†isbn: 978-1-4503-XXXX-X/18/06

1. Introduction

A common type of open-domain question answering (OpenQA) aims to find answers from a large collection of texts (Chen et al., 2017). It operates generally in two steps: finding a limited number of candidate passages using a retrieval method, and performing machine reading through these texts to extract answers. The state-of-the-art machine reading methods have produced human-level performance (Zhang et al., 2020) when reading the right passage. However, when reading the passages retrieved with the question, the performance drops dramatically (Min et al., 2020; Karpukhin et al., 2020), showing the critical importance of retrieving good candidate passages. Many studies have been devoted to improving the topical relevance of retrieved candidate passages (Lin et al., 2018; Karpukhin et al., 2020; Wang et al., 2018; Lee et al., 2019a), but few studies have investigated the problem of answerability, i.e., whether a retrieved passage may contain an answer. Both criteria - topical relevance and answerability, are critical in the context of OpenQA. In fact, a passage highly relevant to the question may not necessarily contain an answer, and a passage that contains an answer entity may not be relevant to the question. In both cases, the reader may be misled by these passages to select a wrong answer, as the reader highly relies on the few passages provided to it. It is thus important that the selected passages for the reader should both be relevant and contain an asnwer.

In this paper, we propose an approach to refine the list of candidate passages according to both criteria through an answer-oriented passage (re-)ranking.

Refer to caption — Figure 1. Examples of noisy passages for answering a question from Quasar-T. R and A indicate if the passage is topically relevant and if it contains the answer, respectively. Underlined words are related topic words, and green words are the answer text. Red words are correct but unsupported answers in the passage. We also show the ranking scores computed by DPR and our PReGAN model.

Training an effective reranking process is not trivial due to the problem of noisy training data: in many cases (datasets), we only know the correct answer to a question, but not the correct passages from which it should be extracted. For example, as illustrated in Figure 1 (a sample from the Quasar-T dataset), for the question about the inventor of the helicopter, we only know the right answer “Igor”. We can see that $\mathrm{P_{2}}$ , $\mathrm{P_{3}}$ and $\mathrm{P_{4}}$ all contain the answer entity. However, $\mathrm{P_{3}}$ is obviously not a relevant passage to the question. Nevertheless, such a passage has been commonly used as positive training data in previous studies. Another typical example is $\mathrm{P_{1}}$ , which is highly relevant to the question and contains entities (two persons) that look like the answer. More tricky cases are $\mathrm{P_{2}}$ and $\mathrm{P_{4}}$ : Both passages are topically relevant and contain the answer entity. However, $\mathrm{P_{4}}$ provides support for the answer, while $\mathrm{P_{2}}$ does not. Using all the passages containing the answer as positive examples will obviously confuse the subsequent reader. The existing methods for passage ranking do not perform well on these cases. For example, a recent state-of-the-art QA model DPR (Karpukhin et al., 2020) ranks $\mathrm{P_{2}}$ before $\mathrm{P_{4}}$ . Our goal is to rerank the candidates to promote $\mathrm{P_{4}}$ , by enhancing the ranker with some capability of machine reading to detect answerability.

The problem of answer-oriented passage ranking has been addressed in a few recent studies. Two typical approaches have been proposed: incorporating a strong neural mechanism in the first retrieval step so that the question can be naturally expanded (e.g., DPR (Karpukhin et al., 2020)), or adding a reranking step between the retriever and machine reader (e.g., DSQA (Lin et al., 2018; Choi et al., 2017)). While the dense passage retrieval model (DPR) greatly outperforms the BM25-based retrieval, it is known to require large resources to build the index and to determine the candidate passages, which may not be available in real application situations. More importantly, it does not incorporate explicitly the criterion of answerability in its retriever. Our approach is more similar to DSQA (Lin et al., 2018), in which a ranker is used between the retriever and the reader. However, when training the ranker, none of the previous approaches explicitly distinguished true positives and false positives: any passage from which the answer text can be detected (e.g., $\mathrm{P_{3}}$ ) was used as a positive training sample for the ranker.

To better address the problem, we propose the PReGAN model for Passage Reranking based on Generative Adversarial Networks (GAN) (Goodfellow et al., 2014), which can force the generator to better assimilate truly positive examples. We extend the GAN framework by incorporating two discriminators, respectively for topical relevance and for answerability. In so doing, we aim to build a stronger generator, which incorporates some reading capability, to generate passages satisfying both criteria. In particular, it should assign lower scores to passages of low relevance ( $\mathrm{P_{3}}$ ), containing incorrect answers ( $\mathrm{P_{1}}$ ) or not containing support for the answer ( $\mathrm{P_{2}}$ ), and boost the right passage ( $\mathrm{P_{4}}$ ), as shown in Figure 2.

The main contributions of this paper are threefold:

(1) We propose a lightweight answer-oriented ranking method to explicitly model answerability in addition to relevance.

(2) We mitigate the risk of noisy training data through a customized minimax gaming mechanism.

(3) Experiments on five public OpenQA datasets demonstrate the utility of combining answerability with relevance in passage ranking, which in turn, boosts the quality of extracted answers.

2. Related Work

We only review the most relevant studies to our work in this section.

Passage ranking is crucial for QA to select better and fewer passages for the reader. The naive approach used in many studies relies on an IR model for it (Chen et al., 2017; Yang et al., 2019b; Yang et al., 2019c; Nie et al., 2019; Min et al., 2019a; Wolfson et al., 2020; Min et al., 2019b; Asai et al., 2019). However, an IR model will only rank the passages according to their topical relevance to the question without taking into account the answerability. For example, previous studies (Lin et al., 2018) show that a better passage ranking performance on Hits@ $N$ may not necessarily lead to a better answer extraction for the whole QA system. More recent studies have applied dense vector representations to rank passages, based on which a more complex ranking model can naturally incorporate query expansion (Karpukhin et al., 2020; Gao and Zhang, 2021; Nogueira and Cho, 2019; Humeau et al., 2019; Khattab and Zaharia, 2020; Das et al., 2018). However, the neural retrieval process still focuses on topical relevance only.

Some recent studies have incorporated an intermediate ranking step on the search results, which considers the feedback from the reader, or relies on some reading capability incorporated into the ranker. Typical examples are R³ (Wang et al., 2018), DSQA (Lin et al., 2018), and Retro-Reader (Zhang et al., 2020). In these approaches, any passage containing the answer text (entity) will be used as a positive sample for ranker training, i.e., the ranker is not explicitly trained to denoise the training data, and it relies on another component (usually a computationally expensive external reader) to provide indications. The GAN-based approach proposed in this paper specifically addresses the problem of noisy data inside the ranker, so as to simultaneously improve the efficiency and effectiveness of the reader by feeding it with a smaller number of passages with better answerability through ranking.

Generative adversarial networks (Goodfellow et al., 2014) have been widely used, including in IR (Wang et al., 2017; Zou et al., 2018; Zhang, 2018) and recommendation (Bharadhwaj et al., 2018; Cai et al., 2018; Fan et al., 2019; Yuan et al., 2020; Chong et al., 2020). The GAN framework aims at generating the positive samples while making large differences with false positives. In so doing, a stronger generation model can be obtained. Recently, supervised GAN frameworks have also been explored in of question answering tasks (Ramakrishnan et al., 2018a; Tang et al., 2018; Lewis and Fan, 2018; Lee et al., 2019b; Patro et al., 2020; Joty et al., 2017; Liu et al., 2020; Yang et al., 2017; Oh et al., 2019), such as community QA, visual QA (Ramakrishnan et al., 2018b), and multi-choice QA, etc. GANs are utilized to tackle either the class imbalance problem of data (Yang et al., 2019a) or the question clarification problem (Rao and III, 2019) in QA. However, none of them explored the utilization of GAN for answer-oriented passage ranking. Our study shows that GAN can be effectively used for this purpose.

The traditional GAN framework makes a minimax competition between one generator and one discriminator. For passage ranking, we incorporate two discriminators for relevance and answerability. Multiple discriminators have been used in some previous GAN frameworks. For example, Nguyen et al. (2017) have used a similar approach to tackle the mode collapse problem in GAN. Some others applied similar structure to improve image quality (Durugkar et al., 2016) or for image-to-image translation (Yi et al., 2017). Our study is the first attempt using such an extended GAN to cope with multiple criteria in passage ranking for open-domain question answering.

There are also several studies that exploit external resources and large pre-trained models for QA (Lee et al., 2019a; Xiong et al., 2020; Guu et al., 2020). The advantages of external resources and pre-trained language models are obvious: the external resources may contain passages that contain the right answer, or the pre-trained language model may already cover the question, i.e., it is able to generate the right answer for it. In these cases, it is difficult to determine where the improvements observed on QA quality come from: from the external resources/pre-trained language model, or the passage ranking strategy using the same source of information. To avoid this confusing situation, we do not include enhancement by external resources or pre-trained language models in this study, although our approach is orthogonal to them and can be combined with them (this will be investigated later).

3. Methodology

Our ranking model aims to re-rank the list of passages returned by some efficient retrieval component, such as BM25, so that the passages that can answer the given question will be highly ranked. We can use any reader to read the proposed passages to locate the answers within them. In the following, we only detail the process of reranking. Notice that our ranking model should be efficient so as not to hinder the efficiency of the whole QA process.

3.1. Ranking Framework

The general architecture of the PReGAN model is illustrated in Figure 2. It is a weakly supervised generative adversarial neural network containing three parts: a rank discriminator (for relevance), an answer discriminator (for answerability), and an answer-oriented rank generator.

The rank discriminator distinguishes the generated rank distributions from the ground-truth rank distribution. The answer discriminator tells if a generated sample contains the answer to the given question. The rank generator reads the passages under the guidance of the two discriminators and learns to generate the answer-oriented distribution on thoses passages for the given question. The discriminators therefore guide the reading-based generator with a REINFORCE process from two different perspectives.

At the end of the training, it is expected that passages that are relevant to the question and contain the answer text can be ranked highly. In other words, optimizing the three objectives modeled in the GAN framework will help discard noise and promote good passages that can truly answer the question. The utilization of the GAN framework as an effective denoising means (Tran et al., 2020) is not new, but we show for the first time that it is also particularly adapted to the noisy training environment in OpenQA.

3.2. Overall Objective

Taking the answer-oriented ranking as a distribution over passages for a given question, the objective of the generator is to learn this distribution through the minimax game. The rank generator would try to select passages that the answer can be reasoned (by the reader) for the given question. The rank discriminator would try to draw a clear distinction between the ground-truth answer passages and the selected ones made by the generator. Since the positive training passages are noisy, the rank discriminator alone may be misleading for the machine reading-based generator. However, if the answer can be extracted from the passage and the answer discriminator confirms the correctness of the answer, the passage selected by the generator should be a good one. This principle can be formulated as the following overall objective function of the minimax game:

(1)

\begin{split}J=\min_{\theta}\max_{\phi,\xi}\sum_{n=1}^{N}\Big{(}&\mathbb{E}_{d\sim p_{true}(d|q_{n},a)}[\log D_{\phi}^{r}(d|q_{n})]\\ &+\mathbb{E}_{d\sim p_{\theta}(d|q_{n},a)}[\log(1-D_{\phi}^{r}(d|q_{n}))]\\ &-\lambda_{1}\cdot\mathbb{E}_{d\sim p_{\theta}(d|q_{n},a)}[\log D_{\xi}^{a}(d|q_{n})]\\ &+\lambda_{2}\cdot\mathbb{E}_{d\sim p_{true}(d|q_{n},a)}[\log\frac{p_{true}(d|q_{n},a)}{p_{\theta}(d|q_{n},a)}]\Big{)},\end{split}

where the generative ranking model with parameter $\theta$ is written as $p_{\theta}(d|q_{n},a)$ , and the rank discriminator $D_{\phi}^{r}$ parameterized by $\phi$ estimates the probability that a passage $d$ can answer the question $q$ . The answer discriminator $D_{\xi}^{a}$ with parameter $\xi$ works as a regularizer of the generator emphasizing passages in which the answer text appears, no matter it can answer the question or not. The last item in the objective is another commonly used regularizer (Lin et al., 2018; Wang et al., 2018) pushing the overall ranking score distribution produced by the generator towards the ground-truth distribution by minimizing KL-divergence between them.

3.3. Rank Discriminator

As shown in Figure 2, the rank discriminator takes as input the passages generated (selected) by the current optimal generator $p_{\theta^{*}}$ together with the observed passages with answers, and predicts the source of the passages. Denoting $D_{\phi}^{r}(d|q_{n})=\sigma(f_{\phi}(d,q_{n}))$ , which is the sigmoid function of the rank discriminator score $f_{\phi}$ , the loss function of rank discriminator is then represented as:

(2)

\begin{split}\mathcal{L}_{D_{\phi}^{r}}=-\sum_{n=1}^{N}\Big{(}&\mathbb{E}_{d\sim p_{true}}[\log(\sigma(f_{\phi}(d,q_{n})))]\\ &+\mathbb{E}_{d\sim p_{\theta^{*}}}[\log(1-\sigma(f_{\phi}(d,q_{n})))]\Big{)},\end{split}

where the score function $f_{\phi}$ is implemented as a bi-encoder attention network between passages and questions, and is solved by stochastic gradient descent algorithm.

3.4. Answer Discriminator

The answer discriminator is designed to cooperate with the generator to mitigate the effects of noise in the training data. The answer discriminator only tells if the answer text appears in a passage for the given question instead of trying to reason out a result. The generator, on the other hand, makes decisions based on a machine reading comprehension process and cares less about the existence of the answer texts in the passages. The idea is that the answer discriminator and generator make judgements from different perspectives, then cross check the judgements and guide the training of the generator towards passages on which they have mutual agreement. The training of the answer discriminator, however, is independent from the generator, and can be trained directly using passage and answer pairs for each question with a binary cross entropy loss as follows:

\begin{split}\mathcal{L}_{D_{\xi}^{a}}=-\sum_{n=1}^{N}\Big{(}\sum_{d\in A^{+}}\log\sigma(f_{\xi}((d,q_{n})))+\sum_{d\in A^{-}}\log(1-\sigma(f_{\xi}(d,q_{n})))\Big{)},\end{split}

where, similar to the rank discriminator, the score function $f_{\xi}$ is also implemented as a bi-encoder attention network. Passages for question $q_{n}$ are divided into $A^{+}$ and $A^{-}$ depending on whether a passage contains the answer text or not.

3.5. Generator

The generator $p_{\theta}$ aims to minimize the objective, through which it tries to fool the rank discriminator by generating samples faking the ground-truth answer distribution. By keeping the two discriminators $f_{\phi^{*}}$ and $f_{\xi^{*}}$ fixed after their maximization, the generator’s loss function is as follows:

(3)

\begin{split}\mathcal{L}_{p_{\theta}}=\sum_{n=1}^{N}\Big{(}~{}&\mathbb{E}_{d\sim p_{\theta}}[\log(1-\sigma(f_{\phi^{*}}(d,q_{i})))]\\ &-\lambda_{1}~{}\mathbb{E}_{d\sim p_{\theta}}[\log\sigma(f_{\xi^{*}}(d,q_{n}))]\\ &-\lambda_{2}~{}\mathbb{E}_{d\sim p_{true}}[\log p_{\theta}(d|q_{n},a)]\Big{)},\end{split}

where minimizing the first item means to guide the generator $p_{\theta}$ toward generating passages which the rank discriminator considers as likely ground-truth passages. Minimizing the second item will drive the generator away from generating passages that the answer discriminator dislikes. The third item imposes a regularization on the overall answer distribution produced by the generator, forcing it to stay close to the ground-truth answer distribution. $\lambda_{1}$ and $\lambda_{2}$ are hyper-parameters controlling the impacts of the answer discriminator and the distribution regularizer on the generator.

The generator is trained under weak supervision using the noisy training data. It predicts the ranking score of each passage using a simple machine reading comprehension (MRC) component. Specifically, the MRC component takes a question and the passage candidates as input, and produces a joint distribution of the start position and end position of the answer in a passage. Then it takes the maximum of the joint probability as the predicted ranking score of the passage with respect to the question. The ranking scores are then normalized into a probabilistic distribution indicating how likely each passage can answer the given question. Similar to $f_{\phi}$ and $f_{\xi}$ , the ranking scores are also generated by scoring networks. The details of the scoring networks are given in the next section. Our scoring networks are inspired by (Lin et al., 2018). We opt for it because of its high efficiency, which is required for our ranking. Other scoring schemes, such as transformer-based approaches, could also be adopted in our framework in the future.

From the loss function of the generator, its gradient can be derived as follows (We include the details about the derivation of Equation (4) in the Appendix A.1 for interested readers):

(4)

\begin{split}\nabla_{\theta}~{}\mathcal{L}^{G}(q_{n})=&\nabla_{\theta}\mathbb{E}_{d\sim p_{\theta}}\left[\log(1+\exp(f_{\phi}(d,q_{n})))\right]\\ &+\lambda_{1}\nabla_{\theta}\mathbb{E}_{d\sim p_{\theta}}\left[\log(1+\exp(f_{\xi}(d,q_{n})))\right]\\ &-\lambda_{2}\nabla_{\theta}\mathbb{E}_{d\sim p_{true}}[\log p_{\theta}(d|q_{n},a)]\\ =&\frac{1}{K}\sum_{k=1}^{K}\nabla_{\theta}\log p_{\theta}(d_{k}|q_{n},a)\Big{(}\log(1+\exp(f_{\phi}(d_{k},q_{n})))\\ &+\lambda_{1}\log(1+\exp(f_{\xi}(d_{k},q_{n})))\Big{)}\\ &-\lambda_{2}\sum_{m=1}^{M}\frac{p_{true}(d_{m}|q_{n},a)}{p_{\theta}(d_{m}|q_{n},a)},\end{split}

where $p_{true}=1/|d^{+}|$ ( $|d^{+}|$ is the total number of passages containing correct answers) if the passage contains a correct answer, otherwise 0. This completes the approximation approach for generator optimization with policy gradient based reinforcement learning. The term $\log(1+\exp(f_{\phi}(q_{n},a)))+\lambda_{1}\log(1+\exp(f_{\xi}(q_{n},a)))$ acts as the balanced rewards from rank discriminator and answer discriminator for the generator $p_{\theta}(d|q_{n},a)$ to select passage $d$ for a given question $q_{n}$ .

3.6. Scoring Networks

In this section, we will introduce the scoring functions we used in details. As can be seen from the Figure 2 and the Equation (1), the three scoring functions lie in the kernel of our proposed GAN framework.

Formally, given a question $q=(q^{1},q^{2},\cdots,q^{|q|})$ and $m$ passages returned by some retrieval component, which are defined as $D=\{d_{1},d_{2},\cdots,d_{m}\}$ , where $d_{i}=(d_{i}^{1},d_{i}^{2},\cdots,d_{i}^{|d_{i}|})$ is the $i$ -th retrieved passage composed of a sequence of words, the answer-oriented passage ranking aims to assign a score $P(d_{i}|q)$ to passage $d_{i}$ measuring the probability that it can answer the question $q$ .

For $\mathbf{f_{\phi}}$ in rank discriminator $D_{r}$ , we use a bi-encoder framework to build the scoring function $f_{\phi}(d,q_{n})$ . Each passage $d_{i}$ and the question $q$ are encoded with an RNN network:

\small\begin{split}<\hat{d}_{i}^{1},\hat{d}_{i}^{2},\cdots,\hat{d}_{i}^{|d_{i}|}>~{}&=~{}\text{RNN}(<d_{i}^{1},d_{i}^{2},\cdots,d_{i}^{|d_{i}|}>),\\ <\hat{q}^{1},\hat{q}^{2},\cdots,\hat{q}^{|q|}>~{}&=~{}\text{RNN}(<q^{1},q^{2},\cdots,q^{|q|}>),\end{split}

where $\hat{d}_{i}^{j}$ and $\hat{q}^{j}$ are the hidden representations of words in passage $d_{i}$ and question $q$ encoding its context information through RNN. We use a single-layer bidirectional LSTM (biLSTM) as our RNN unit. Following (Lin et al., 2018), the final representation of the question is obtained through a self-attention operation $\hat{q}=\sum_{j=1}^{|q|}\alpha^{j}\hat{q}^{j}$ , where $\alpha^{j}=\text{softmax}(wq^{j})$ is the importance of each word in the question and $w$ is a learned weight vector. Then the score of each passage is obtained via a max-pooling layer and a softmax layer:

f_{\phi}(d_{i},q)=p(d_{i}|q)=\text{softmax}(\max_{j}(\hat{d}_{i}^{j}Wq)),

where $W$ is a weight matrix to be learned.

For $\mathbf{f_{\xi}}$ in answer discriminator $D_{a}$ , even though it serves a different purpose with a different objective function from the rank discriminator $D_{r}$ , the network framework is similar, except that the last layer is a sigmoid function instead of softmax function.

Finally, for $\mathbf{f_{\theta}}$ in rank generator $G$ , similar to $\mathbf{f_{\phi}}$ , we encode the passage $d_{i}$ in a sequence of hidden vectors, and apply a self-attention layer to attend to the question vectors to get a question embedding. Different from the discriminators, we ask the generator to score a passage through predicting the probability of the start and end positions of an answer span in the passage. The probability of a position $j$ as the start position of the answer span in the passage $d_{i}$ is calculated by $p_{s}^{j}=\text{softmax}(\hat{d}_{i}^{j}W_{s}\hat{q})$ with parameter $W_{s}$ , and $p_{e}^{j}=\text{softmax}(\hat{d}_{i}^{j}W_{e}\hat{q})$ for the end position $p_{e}^{j}$ using $W_{e}$ . The score of passage $d_{i}$ is then obtained as the maximum of the joint probability of the start and end positions:

\small f_{\theta}(d_{i},q)=p_{\theta}(a|q,d_{i})=\max_{j,k}p_{s}^{j}(a|q,d_{i})p_{e}^{k}(a|q,d_{i})

Our scoring functions are inspired by (Lin et al., 2018). We opt for it because of its high efficiency, which is required for our ranking. Other scoring schemes, such as transformer-based approaches, could also be adopted in our framework in the future.

3.7. Ranking Algorithm

With all the components in place, the overall ranking algorithm is summarized in Algorithm 1. Before the adversarial training, we initialize the three component with random parameter weights. The answer discriminator, rank discriminator, and generator are pre-trained using the noisy training data.

From the ranked passages, the top-K (K=50 in our experiments) are passed to the subsequent reader to determine the answer span.

We use a BERT-based reader architecture in this paper, which follows (Karpukhin et al., 2020).¹¹1https://github.com/facebookresearch/DPR Let $d_{i}\in\mathbb{R}^{L\times h}$ ( $i\leq i\leq k$ ) be a BERT representation of the $i$ -th passage, where $L$ is the maximum length of the passage and $h$ is the hidden dimension.²²2We assume that the reader is familiar with the BERT training process, or could find more details in (Karpukhin et al., 2020). The probability of an answer span ( $s,e$ ) in passage $d_{i}$ is defined as:

	$\displaystyle P(s,e,i)$	$\displaystyle=P(d_{i})\cdot P(s\|d_{i})\cdot P(e\|d_{i}),$
	$\displaystyle P(s\|d_{i})$	$\displaystyle=\text{softmax}(d_{i}w_{start})_{s},$
	$\displaystyle P(e\|d_{i})$	$\displaystyle=\text{softmax}(d_{i}w_{end})_{t},$
	$\displaystyle P(d_{i})$	$\displaystyle=\text{softmax}(\hat{D}^{T}w_{doc})_{i},$

where $\hat{D}=[d_{1}^{[CLS]},\cdots,d_{k}^{[CLS]}]\in\mathbb{R}^{h\times k}$ where $d_{i}^{[CLS]}$ is the BERT-based sentence representation (with 12 transformer layers and hidden size of 768), and $w_{start}$ , $w_{end}$ , $w_{doc}\in\mathbb{R}^{h}$ are learnable parameters. Notice that the final answer’s probability $P(s,e,i)$ also uses the passage ranking score $P(d_{i})$ , so that the reader will trust more the answers from highly ranked passages.

Algorithm 1 Answer-Oriented Ranking Algorithm

Input: generator $f_{\theta}$ , rank discriminator $f_{\phi}$ , answer discriminator $f_{\xi}$ , training dataset $D$

1: Initialize

f_{\theta}

f_{\phi}

f_{\xi}

with random weights

\theta,\phi,\xi

, num-epochs, d-steps, g-steps.

2: Pre-train the rank discriminator

f_{\phi}

, answer discriminator

f_{\xi}

, and the generator

f_{\theta}

using

D

3: while num-epochs do

4: num-epochs

--

5: while g-steps

>0

6: g-steps

--

7: Sample

K

passages for each query

q

f_{\theta}

;

8: Collect rewards from answer discriminators

f_{\xi}

for the

K

passages;

9: Collect rewards from rank discriminators

f_{\phi}

for the

K

passages;

10: Update the generator

f_{\theta}

via policy gradient Equation (4).

11: end while

12: while d-steps

>0

13: d-steps

--

14: Sample

K

passages for each query

q

f_{\theta}

as negative passages;

15: Combine negative passages with positive passages in

D

for each question

q

;

16: Update the rank discriminator

f_{\phi}

via Equation (2).

17: end while

18: end while

4. Experiments

4.1. Datasets and Evaluation Metrics

We evaluate our ranking model on five commonly-used datasets for OpenQA tasks:

Quasar-T (Dhingra et al., 2017) contains 43,012 trivia questions, each with 100 passages retrieved from ClueWeb09 data source using LUCENE.

SearchQA (Dunn et al., 2017) is a large-scale OpenQA dataset with question-answer pairs crawled from J! Archive and passages retrieved by Google. It contains 140k question-answer pairs. On average, each question has 49.6 passages.

TriviaQA (Joshi et al., 2017) includes 95K question answer pairs authored by trivia enthusiasts and independently gathered evidence passages. Each question has 100 webpages retrieved by Bing Search API. We only use the first 50 for training the ranker and the reader.

Curated TREC (Baudiš and Šedivỳ, 2015) is based on the benchmark from the TREC QA tasks, which contains 2,180 questions extracted from datasets of TREC 1999, 2000, 2001, and 2002.

Natural Questions (Kwiatkowski et al., 2019) consists of real anonymized, aggregated questions issued to the Google search engine. The answers are spans in Wikipedia articles identified by annotators. Each question is paired with up to five reference answers.

For Quasar-T, SearchQA, and TriviaQA, we use the same processed paragraphs provided by (Lin et al., 2018). For Curated TREC and Natural Questions, following existing work, we determine a subset of 50 passages for each question and apply our model to them (Lin et al., 2018; Min et al., 2020). ³³3This is done due to our limited computation resources – training our model on a larger set of long passages (webpages) would require more memory than we have. We call the generated datasets “Natural Questions Subset” (NQ-Sub). The statistics are shown in Table 1.

Table 1. Statistics of the datasets.

Dataset	Train	Dev.	Test	#Passages/Question
Quasar-T	37,012	3,000	3,000	100
Search QA	99,811	13,893	27,247	$\sim$ 49.6
Trivia QA	87,291	11,274	10,790	100
Curated TREC	1,353	133	694	Wiki (50)
Natural Question	79,168	8,757	3,610	Wiki (50)

The ranking results are evaluated by two common metrics:

HITS@ $N$ evaluates the ranking results by indicating the proportion of top- $N$ passages that contains answer texts. Notice that this measure only provides an approximation of the true ranking quality, as a passage containing the answer without support would also be considered relevant.

EM measures the percentage of answers found by the whole system (the reader) that match the ground-truth answers.

4.2. Baselines

For comparison, we select four representative ranking-based models as baselines. These models do not exploit external resources.

(1) BM25+BERT: The passages ranked by BM25 are directly submitted to the BERT-based reader. This is intended to test how a basic retriever+reader approach could perform on the datasets.

(2) DPR (Karpukhin et al., 2020) is one of the latest methods proposed in the literature, and it produces state-of-the-art performance. It adopts a bi-step retriever-reader framework where the retriever utilizes a neural matching model. The reader is the same BERT-based network structure as ours. We will note that DPR requires large computation resources to run.

(3) DSQA (Lin et al., 2018) adopts a retriever-selector-reader framework similar to ours. However, its denoising is limited to exploiting the score of the reader as a confidence score. No explicit denoising is done within the selector. The comparison with DSQA will reveal the impact of the GAN framework for denoising in ranking.

(4) R³ (Wang et al., 2018) is a reinforced model using a ranker to select the most confident passage to pass to the reader. Interactions are allowed between the ranker and the reader – the reader provides feedback (rewards) to the ranker. This is similar to the principle of our ranker, but we use GAN to discriminate between different passages and force the ranker to meet multiple objectives. In addition, the same neural architecture is used in R³’s reader and ranker, while we employ a simplified reader in our ranker for higher efficiency.

For fairness, we do not include methods such as REALM (Guu et al., 2020) and ORQA (Lee et al., 2019a) which either utilize external resources or expensive additional pre-training tasks, while our method does not. These approaches are orthogonal to ours, and can be added on top of ours. We leave it to future work.

4.3. Implementation Details

We tune our model on the development set. Restricted by the computation resources available – one Nvidia GTX Titan X GPU (12G VRAM), only limited parameter sets could be explored. The hidden size of the RNN used in the ranker is 128, the number of RNN layers for the passage and question are both set to 1. The maximum passage length is set according to the passage length distributions – 150 for quasar-T, SearchQA, TriviaQA, and Natural Questions, and 350 for Curated TREC. The batch size varies among {2, 4, 8, 16} across datasets depending on the maximum passage length in order to fit the model into the GPU RAM. We only use top-50 passages retrieved by BM25 to train the ranker and the reader. The hyper-parameters $\lambda_{1}$ and $\lambda_{2}$ are set to 0.25 and 1. We follow (Lin et al., 2018) for the setting of other parameters. The complete optimal parameter settings and code will be made public in our Github repository later.

4.4. Ranking Results

We first test the ability of our PReGAN ranker to better rank passages containing the answers. The baseline models have been used on different datasets. To compare with them directly, we run two sets of experiments: one on Quasar-T and SearchQA on which DSQA has been tested, and another on Curated TREC and TriviaQA on which DPR has been tested. In addition, we also compare DPR with our method on the Natural Question subset.

At the top of Table 2, we report the ranking results by BM25, DSQA, and PReGAN. It shows that PReGAN outperforms both baselines by a large margin. Recall that the main difference between DSQA’s selector and PReGAN lies in the denoising based on GAN. The improvements hence directly reflect the usefulness of the GAN mechanism.

Table 2. Ranker performance compared with DSQA on datasets Quasar-T and SearchQA, and compared with DPR on Curated TREC, TriviaQA, and NQ-Sub. The best results are in bold. Results marked with

\Diamond

are cited from (Lin et al., 2018), which does not provide Hits@20 and Hits@50. Results marked with

\triangle

are cited from (Karpukhin et al., 2020). Results marked with

\heartsuit

are obtained by running DPR against top-50 passages returned by BM25.

Quasar-T	Hits@1	Hits@3	Hits@5	Hits@20	Hits@50
BM25^◇	6.3	10.9	15.2	-	-
DSQA^◇	27.7	36.8	42.6	-	-
PreGAN	35.2	52.0	59.5	72.3	74.8
SearchQA	Hits@1	Hits@3	Hits@5	Hits@20	Hits@50
BM25^◇	13.7	24.1	32.7	-	-
DSQA^◇	59.9	69.8	75.5	-	-
PreGAN	63.9	83.0	88.8	97.5	99.8
Curated TREC	Hits@1	Hits@3	Hits@5	Hits@20	Hits@50
BM25	23.9	45.1	54.3	73.7	82.8
DPR^△	-	-	-	79.8	-
PreGAN (50)	30.3	51.2	60.8	78.6	82.8
PreGAN (100)	33.3	54.6	64.1	81.0	84.6
TriviaQA	Hits@1	Hits@3	Hits@5	Hits@20	Hits@50
BM25	28.1	46.7	56.6	75.6	81.3
DPR^△	-	-	-	79.4	-
PreGAN (50)	48.9	63.8	70.0	79.2	81.5
PreGAN (100)	48.5	64.1	70.0	79.2	81.7
NQ-Sub	Hits@1	Hits@3	Hits@5	Hits@20	Hits@50
BM25	17.6	28.9	35.5	53.1	64.3
DPR (all)	39.3	53.6	58.8	68.9	73.7
DPR (50)^♡	23.6	34.8	42.4	59.7	64.3
PreGAN	24.0	36.7	43.3	58.2	64.3

On Curated TREC and TriviaQA, as the candidate passages are not provided in the datasets, but retrieved from Wikipedia, we consider two cases – retrieving 50 and 100 candidates with BM25 and submitting them to the ranker. Notice that DPR directly retrieves passages from the whole Wikipedia dump, and thus is not limited by the 50 or 100 candidate passages. Nevertheless, we can see that our ranker can produce similar ranking results (Hits@20) to DPR despite the more limited candidates. On the Natural Question subset, we run DPR retriever and our ranker on the top-50 results from BM25. We see that PReGAN can better rank passages in top positions than DPR. However, at lower rank positions (from HITS@20), DPR performs better. This can be attributed to the more sophisticated retrieval model in DPR. In the future, the score networks used in our current model could be replaced by more sophisticated ones as in DPR. Another factor that explains the better ranking for DPR when $N\geq 20$ is that HITS@ $N$ is only an approximation of the ranking quality. This measure should be considered together with the EM measure of answers.

The most important observation to make is the improvements of PReGAN over BM25. They reflect the capability of our ranker to favor the passages that may contain an answer. We expect that this will benefit the reader, as we will show.

4.5. Final Answer Results

We adopt a BERT-based reader structure as in (Karpukhin et al., 2020) to extract answers from the selected passages. We test the QA answer for each question produced by the reader with the top-50 passages (together with their ranking scores). The results are shown in Table 3.

Table 3. Answer results with exact matching (EM) rate. The best results are in bold. Numbers with

*

for DSQA and R³ are cited from (Lin et al., 2018), those for DPR are from (Karpukhin et al., 2020). For the Natural Question subset, our model is based on the top-50 passages retrieved by BM25, while DPR uses its own retriever to retrieve top-50 passages.

	Quasar-T	SearchQA	Curated Trec	TriviaQA	NQ-Sub
BM25+BERT	41.6	57.9	21.3^∗	47.1^∗	26.7
R³	35.3^∗	49.0^∗	28.4^∗	47.3^∗	-
DSQA	42.2^∗	58.8^∗	29.1^∗	48.7^∗	-
DPR	-	-	28.0^∗	57.0^∗	27.4
PReGAN	45.5	61.2	29.3	60.7	29.5

We can observe that on all the datasets, PReGAN (combined with DPR reader) leads to consistently better answer results than the baselines. As BM25 ranking results are submitted to the same reader as PReGAN, the differences between them are directly attributed to passage ranking. This comparison clearly shows the usefulness of adding a ranker to rerank the BM25 retrieval results.

The comparison with R³ shows the utility of adding discriminators. Recall that R³ also uses the rewards from the reader to rerank the retrieval results, but without contrasting between different passages. We can see that by adding discriminators, PReGAN is forced to better distinguish between truly good passages from noisy ones.

In this experiment, DSQA uses its own selector to rerank the BM25 candidates and its reader to find the answer, and DPR also uses both its retriever to retrieve candidates and its reader. This setting may play in favor of these models because the retriever, selector, and reader have been optimized together. More particularly, DPR is allowed to search for top-ranked candidate passages directly from the document collections in Curated TREC (top-100) and Natural Questions (top-50), instead of using the top-50 retrieved with BM25. The quality of these passages is better than that of BM25. Despite this, compared to DSQA and DPR, PReGAN can still produce better results. This result demonstrates that our QA approach can be competitive against the state-of-the-art approaches that have much more parameters. Taking the experiments on passage ranking and answer identification together, we can see that a better passage ranking generally leads to better answers.

4.6. Further Analysis

In this section, we further analyze the impact of the components of PReGAN and the time efficiency.

4.6.1. Influence of the Number of Retrieved Passages

We test with different numbers of initial retrieval results on Curated TREC, on which we can retrieve the desired number of passages. In the experiment, our ranker is asked to select the top-50 passages from the initial retrieval list of variable sizes.

Figure 3 shows how the HITS measures vary depending on the size of the retrieved passages (from 50 to 300). The general observation is that by increasing the size of initial retrieval results from 50 to 100, all HITS measures are improved. However, when we use a larger set of retrieval results, HITS at small $N$ are hurt, while HITS at larger $N$ keep being improved. This observation tends to suggest that the initial retrieval results could be increased to a reasonable size. However, a too large set of initial retrieval results may lead to a higher likelihood to include noise passages, which may make it more difficult for the ranker to determine the right passages. The right choice of the size of initial passages retrieval is an interesting question that we will investigate in the future.

4.6.2. Impact of the Number of Passages to be Read

We select top- $K$ passages from the ranker, and submit them to the reader to find an answer. We want to test the impact of $K$ on the answer results.

Figures 4 shows the Exact Match results of our system compared with the baseline methods DSQA and BM25+DSQA on the datasets Quasar-T and SearchQA. In BM25+DSQA, the passages are ranked by BM25 scores, and we use the reader of DSQA to find the answer.

We can observe that submitting more passages to the reader will generally lead to better answers. However, this is at the cost of higher time complexity due to the complex machine reading process. A good ranker should be able to select a small number of passages for the reader, without penalizing the final results. This is what we observe from PReGAN: by selecting top-20 and top-10 passages for the reader, our model can already obtain the best answer results. This further confirms that PReGAN is able to rank good passages on top. We attribute this capability to the GAN mechanism used, which helps to discriminate between good and bad passages.

Note that BM25+DSQA and DSQA use the same reader and they differ only in an additional selector in DSQA. PReGAN uses a different reader. To see the contribution of the ranker on a fair ground, we will run another experiment using the same reader.

4.6.3. Effectiveness of Ranker

In this experiment, we test the respective contributions of the ranker and the reader. We compare our method with DSQA. To test the effectiveness of the ranker, we replace the ranker in DSQA by that of PReGAN, but still use the same DSQA reader to find the answer (PReGAN+DSQA). Table 4 shows the results.

Table 4. EM performance of Bi-LSTM-based reader on our ranking results on Quasar-T.

Number of Passages ( $k$ )	1	3	5	10
DSQA	22.6	28.6	32.0	36.5
PReGAN+DSQA	25.9	31.2	34.7	38.8
PReGAN (DPR reader)	28.0	37.1	40.7	43.7

We can see that when the DSQA ranker is replaced by ours in PReGAN+DSQA, the EM measures are improved (over DSQA). This is a clear demonstration that our ranker PReGAN is more effective than that of DSQA. The comparison between our method and PReGAN+DSQA shows the impact of a better reader – both our method and PReGAN-DSQA use the same list of passages, but PReGAN is followed by a BERT-based reader (similar to DPR), while DSQA uses a biLSTM-based reader. This shows that the quality of both the ranker and the reader contribute to the global effectiveness of QA, and they are complementary.

4.6.4. Time Efficiency

As an intermediate step, a ranker should be efficient. To show the time cost of different steps, we report in Table 5 the time required for different steps on $\text{NQ}_{\text{sub}}$ on which we can perform all the retrieval-ranking-reading processes. The experiment is run on a server equipped with one Nvidia GTX Titan X GPU (12GB VRAM), one Intel Core [email protected] CPU with 12 cores, and 32G RAM size.

Table 5. Total time of retriever, ranker, and reader on all questions in NQ-Sub, and the average time per question.

	Total (s)		Average ( $\mathbf{10^{-3}}$ s)
	$k=50$	$k=100$	$k=50$	$k=100$
Retriever (BM25)	12	22	3.3	6.1
Ranker (PReGAN)	2	8	0.5	2.2
Reader	207.0	401.0	57.3	111.0

As we can see, the time cost of our ranker is very small compared to the retrieval and reading steps. The reader is the most expensive component. The more we submit passages to the reader, the more time it takes to find the answers. Therefore, it is advantageous to submit only a few passages to the reader without alternating the quality of final answers, which our ranker is able to do (see Figure 4).

4.6.5. Effects of Discriminators

We used two discriminators together for different criteria. We show in Table 6 the performance of the ranker without the answer discriminator component (w/o AD). We can see that the HITS measures generally decrease when the answer discriminator is removed. This confirms the utility of this additional discriminator in the GAN framework.

Table 6. Performance of ranker without answer discriminator (AD) on Quasar-T.

	Hits@1	Hits@3	Hits@5	Hits@20	Hits@50
PReGAN	35.2	52.0	59.5	72.3	74.8
w/o AD	33.0	50.1	58.1	72.0	74.8

4.6.6. Case Study

Let us analyze the example we gave at the beginning of the paper (Figure 1) to see the impact of our ranker on some concrete example. The example is from the Quasar-T dataset. We use the DPR retriever and our ranker (on top of BM25 results) to score the candidate passages. In this example, the passage $\mathrm{P_{3}}$ can be easily ranked low by both approaches due to its low relevance to the question. The passage $\mathrm{P_{4}}$ is the best passage that contains the right answer to the question, and the passage provides explicit support for the answer. $\mathrm{P_{2}}$ is topically relevant and also contains the answer, but the passage does not provide support for the answer. Between the passages $\mathrm{P_{2}}$ and $\mathrm{P_{4}}$ , DPR prefers $\mathrm{P_{2}}$ because it contains more question-related words, i.e., more topically relevant. Topical relevance is indeed the major criterion used in DPR’s retriever, as can be reflected in the ranking of the passages. On the other hand, our ranker can successfully rank the best passage $\mathrm{P_{4}}$ on top, due to its integration of some reading capability. When a reader goes through the two passages, it is expected that the reader will find the answer in $\mathrm{P_{4}}$ with a higher probability than in $\mathrm{P_{2}}$ . This example makes the expected impact of our ranker more concrete. The effect is exactly what we desire.

One could argue that the ranking difference of the desired passage is not so important as long as the passage is submitted to the reader, and we could simply rely on the capability of the reader to find the right answer. This is not true because: 1) ranking a good passage at a lower rank will make it more likely to be cut off (not submitted to the reader), especially when the reader can only read a small number of passages; 2) the final answer from the reader also takes into account the ranking score – the more the ranker is confident about the passage, the more the reader will have confidence about the answer in the passage. So the correct ranking score will directly impact the final answer of the reader.

5. Conclusion

In this paper, we addressed the problem of passage (re-)ranking for open-domain QA. The goal is to create a better and reduced ranking list for the reader. Previous attempts on passage ranking have focused on improving the relevance of the passages (Karpukhin et al., 2020), using explicit interaction between the ranker and the reader (Wang et al., 2018) or incorporating some reading capability into the ranker (Lin et al., 2018). However, given the noisy nature of the training data available, these approaches are hindered by the noise. In this paper, we proposed a model based on a customized GAN framework to support answer-oriented passage ranking. We utilized two discriminators – ranking discriminator and answer discriminator, to guide the reading-based generator to assign high probabilities to passages that are relevant and contain the answer. Experiments on five datasets demonstrated the effectiveness of our proposed method.

The work can be further improved on several aspects. First, due to the computation resources, we were unable to do experiments on large datasets (such as the full set of Natural Questions). We are looking for more powerful computation servers to run larger experiments. Second, our design choice has been inspired by the previous studies (namely, DSQA), which uses a simpler architecture (biLSTM) than the one commonly used now (based on BERT). More recent approaches are typically based on BERT. We expect that similar improvements will obtain from the idea of incorporating both relevance and answerability into a ranker, even though another scoring network is used. This will be tested in the future.

Appendix A Appendix

A.1. Derivation of the Reward Function

The detailed derivation process of the reward functions and regularizers are as follows:

(5)

\begin{split}\nabla_{\theta}~{}&\mathcal{L}^{G}(q_{n})\\ =~{}&\nabla_{\theta}\mathbb{E}_{d\sim p_{\theta}}\left[\log(1+\exp(f_{\phi}(d,q_{n})))\right]+\\ &\lambda_{1}\nabla_{\theta}\mathbb{E}_{d\sim p_{\theta}}\left[\log(1+\exp(f_{\xi}(d,q_{n})))\right]-\\ &\lambda_{2}\nabla_{\theta}\mathbb{E}_{d\sim p_{true}}[\log p_{\theta}(d|q_{n},a)]\\ =~{}&\sum_{i=1}^{M}\nabla_{\theta}p_{\theta}(d_{i}|q_{n},a)\log(1+\exp(f_{\phi}(d_{i},q_{n})))+\\ &\lambda_{1}\sum_{i=1}^{M}\nabla_{\theta}p_{\theta}(d_{i}|q_{n},a)\log(1+\exp(f_{\xi}(d_{i},q_{n})))-\\ &\lambda_{2}\sum_{i=1}^{M}\nabla_{\theta}p_{true}(d_{i}|q_{n},a)\log p_{\theta}(d_{i}|q_{n},a)\\ =~{}&\sum_{i=1}^{M}p_{\theta}(d_{i}|q_{n},a)\nabla_{\theta}\log p_{\theta}(d_{i}|q_{n},a)\log(1+\exp(f_{\phi}(d_{i},q_{n})))+\\ &\lambda_{1}\sum_{i=1}^{M}p_{\theta}(d_{i}|q_{n},a)\nabla_{\theta}\log p_{\theta}(d_{i}|q_{n},a)\log(1+\exp(f_{\xi}(d_{i},q_{n})))-\\ &\lambda_{2}\sum_{i=1}^{M}\nabla_{\theta}p_{true}(d_{i}|q_{n},a)\log p_{\theta}(d_{i}|q_{n},a)\\ =~{}&\mathbb{E}_{d\sim p_{\theta}}\Big{[}\nabla_{\theta}\log p_{\theta}(d|q_{n},a)\Big{(}\log(1+\exp(f_{\phi}(d,q_{n})))+\\ ~{}~{}&\lambda_{1}~{}\log(1+\exp(f_{\xi}(d,q_{n})))\Big{)}\Big{]}-\lambda_{2}\mathbb{E}_{d\sim p_{true}}\Big{[}\nabla_{\theta}\log p_{\theta}(d|q_{n},a)\Big{]}\\ =~{}&\frac{1}{K}\sum_{k=1}^{K}\nabla_{\theta}\log p_{\theta}(d_{k}|q_{n},a)\Big{(}\log(1+\exp(f_{\phi}(d_{k},q_{n})))+\\ &\lambda_{1}~{}\log(1+\exp(f_{\xi}(d_{k},q_{n})))\Big{)}-\lambda_{2}\sum_{m=1}^{M}\frac{p_{true}(d_{m}|q_{n},a)}{p_{\theta}(d_{m}|q_{n},a)}\end{split}

where the term $\log(1+\exp(f_{\phi}(q_{n},a)))+\lambda_{1}\log(1+\exp(f_{\xi}(q_{n},a)))$ acts as the balanced rewards from both of the rank discriminator $D_{\phi}^{r}$ and the answer discriminator $D_{\xi}^{a}$ .

To reduce the variance during the reinforcement learning process, the reward function $f_{rw}$ is usually modified in policy gradient in practice as below:

(6)

\begin{split}f_{rw}=&\log(1+\exp(f_{\phi}(q_{n},a)))\\ &+\lambda_{1}\log(1+\exp(f_{\xi}(q_{n},a)))\\ &-\mathbb{E}_{d\sim p_{\theta}}\Bigg{[}\log(1+\exp(f_{\phi}(q_{n},a)))\\ &+\lambda_{1}\log(1+\exp(f_{\xi}(q_{n},a)))\Bigg{]}.\end{split}

References

(1)
Asai et al. (2019) Akari Asai, Kazuma Hashimoto, Hannaneh Hajishirzi, Richard Socher, and Caiming Xiong. 2019. Learning to retrieve reasoning paths over wikipedia graph for question answering. arXiv preprint arXiv:1911.10470 (2019).
Baudiš and Šedivỳ (2015) Petr Baudiš and Jan Šedivỳ. 2015. Modeling of the question answering task in the yodaqa system. In International Conference of the Cross-Language Evaluation Forum for European Languages. Springer, 222–228.
Bharadhwaj et al. (2018) Homanga Bharadhwaj, Homin Park, and Brian Y Lim. 2018. RecGAN: recurrent generative adversarial networks for recommendation systems. In Proceedings of the 12th ACM Conference on Recommender Systems. 372–376.
Cai et al. (2018) Xiaoyan Cai, Junwei Han, and Libin Yang. 2018. Generative adversarial network based heterogeneous bibliographic network representation for personalized citation recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
Chen et al. (2017) Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading Wikipedia to answer open-domain questions. In 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017. Association for Computational Linguistics (ACL), 1870–1879.
Choi et al. (2017) Eunsol Choi, Daniel Hewlett, Jakob Uszkoreit, Illia Polosukhin, Alexandre Lacoste, and Jonathan Berant. 2017. Coarse-to-fine question answering for long documents. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 209–220.
Chong et al. (2020) Xiaoya Chong, Qing Li, Howard Leung, Qianhui Men, and Xianjin Chao. 2020. Hierarchical Visual-aware Minimax Ranking Based on Co-purchase Data for Personalized Recommendation. In Proceedings of The Web Conference 2020. 2563–2569.
Das et al. (2018) Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer, and Andrew McCallum. 2018. Multi-step Retriever-Reader Interaction for Scalable Open-domain Question Answering. In International Conference on Learning Representations.
Dhingra et al. (2017) Bhuwan Dhingra, Kathryn Mazaitis, and William W Cohen. 2017. Quasar: Datasets for question answering by search and reading. arXiv preprint arXiv:1707.03904 (2017).
Dunn et al. (2017) Matthew Dunn, Levent Sagun, Mike Higgins, V Ugur Guney, Volkan Cirik, and Kyunghyun Cho. 2017. Searchqa: A new q&a dataset augmented with context from a search engine. arXiv preprint arXiv:1704.05179 (2017).
Durugkar et al. (2016) Ishan Durugkar, Ian Gemp, and Sridhar Mahadevan. 2016. Generative multi-adversarial networks. arXiv preprint arXiv:1611.01673 (2016).
Fan et al. (2019) Wenqi Fan, Tyler Derr, Yao Ma, Jianping Wang, Jiliang Tang, and Qing Li. 2019. Deep Adversarial Social Recommendation. In 28th International Joint Conference on Artificial Intelligence (IJCAI-19). International Joint Conferences on Artificial Intelligence, 1351–1357.
Gao and Zhang (2021) Luyu Gao and Yunyi Zhang. 2021. Scaling Deep Contrastive Learning Batch Size with Almost Constant Peak Memory Usage. arXiv preprint arXiv:2101.06983 (2021).
Goodfellow et al. (2014) Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In NIPS.
Guu et al. (2020) Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. 2020. Realm: Retrieval-augmented language model pre-training. arXiv preprint arXiv:2002.08909 (2020).
Humeau et al. (2019) Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, and Jason Weston. 2019. Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring. In International Conference on Learning Representations.
Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vancouver, Canada, 1601–1611. https://doi.org/10.18653/v1/P17-1147
Joty et al. (2017) Shafiq Joty, Preslav Nakov, Lluís Màrquez, and Israa Jaradat. 2017. Cross-language Learning with Adversarial Neural Networks. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017). 226–237.
Karpukhin et al. (2020) Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 6769–6781. https://doi.org/10.18653/v1/2020.emnlp-main.550
Khattab and Zaharia (2020) Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 39–48.
Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics 7 (2019), 453–466.
Lee et al. (2019a) Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 2019a. Latent Retrieval for Weakly Supervised Open Domain Question Answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 6086–6096.
Lee et al. (2019b) Seanie Lee, Donggyu Kim, and Jangwon Park. 2019b. Domain-agnostic Question-Answering with Adversarial Training. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering. 196–202.
Lewis and Fan (2018) Mike Lewis and Angela Fan. 2018. Generative question answering: Learning to answer the whole question. In International Conference on Learning Representations.
Lin et al. (2018) Yankai Lin, Haozhe Ji, Zhiyuan Liu, and Maosong Sun. 2018. Denoising Distantly Supervised Open-Domain Question Answering. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 1736–1745. https://doi.org/10.18653/v1/P18-1161
Liu et al. (2020) Zhuang Liu, Keli Xiao, Bo Jin, Kaiyu Huang, Degen Huang, and Yunxia Zhang. 2020. Unified generative adversarial networks for multiple-choice oriented machine comprehension. ACM Transactions on Intelligent Systems and Technology (TIST) 11, 3 (2020), 1–20.
Min et al. (2020) Sewon Min, Jordan Boyd-Graber, Chris Alberti, Danqi Chen, Eunsol Choi, Michael Collins, Kelvin Guu, Hannaneh Hajishirzi, Kenton Lee, Jennimaria Palomaki, et al. 2020. NeurIPS 2020 EfficientQA Competition: Systems, Analyses and Lessons Learned. arXiv preprint arXiv:2101.00133 (2020).
Min et al. (2019a) Sewon Min, Danqi Chen, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2019a. A Discrete Hard EM Approach for Weakly Supervised Question Answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2844–2857.
Min et al. (2019b) Sewon Min, Danqi Chen, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2019b. Knowledge guided text retrieval and reading for open domain question answering. arXiv preprint arXiv:1911.03868 (2019).
Nguyen et al. (2017) Tu Dinh Nguyen, Trung Le, Hung Vu, and Dinh Phung. 2017. Dual discriminator generative adversarial nets. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 2667–2677.
Nie et al. (2019) Yixin Nie, Songhe Wang, and Mohit Bansal. 2019. Revealing the Importance of Semantic Retrieval for Machine Reading at Scale. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2553–2566.
Nogueira and Cho (2019) Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. arXiv preprint arXiv:1901.04085 (2019).
Oh et al. (2019) Jong-Hoon Oh, Kazuma Kadowaki, Julien Kloetzer, Ryu Iida, and Kentaro Torisawa. 2019. Open-Domain Why-Question Answering with Adversarial Learning to Encode Answer Texts. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 4227–4237.
Patro et al. (2020) Badri Patro, Shivansh Patel, and Vinay Namboodiri. 2020. Robust explanations for visual question answering. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1577–1586.
Ramakrishnan et al. (2018a) Sainandan Ramakrishnan, Aishwarya Agrawal, and Stefan Lee. 2018a. Overcoming Language Priors in Visual Question Answering with Adversarial Regularization. Advances in Neural Information Processing Systems 31 (2018).
Ramakrishnan et al. (2018b) Sainandan Ramakrishnan, Aishwarya Agrawal, and Stefan Lee. 2018b. Overcoming language priors in visual question answering with adversarial regularization. arXiv preprint arXiv:1810.03649 (2018).
Rao and III (2019) Sudha Rao and Hal Daumé III. 2019. Answer-based Adversarial Training for Generating Clarification Questions. CoRR abs/1904.02281 (2019). arXiv:1904.02281 http://arxiv.org/abs/1904.02281
Tang et al. (2018) Duyu Tang, Nan Duan, Zhao Yan, Zhirui Zhang, Yibo Sun, Shujie Liu, Yuanhua Lv, and Ming Zhou. 2018. Learning to collaborate for question answering and asking. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 1564–1574.
Tran et al. (2020) Linh Duy Tran, Son Minh Nguyen, and Masayuki Arai. 2020. GAN-based noise model for denoising real images. In Proceedings of the Asian Conference on Computer Vision.
Wang et al. (2017) Jun Wang, Lantao Yu, Weinan Zhang, Yu Gong, Yinghui Xu, Benyou Wang, Peng Zhang, and Dell Zhang. 2017. Irgan: A minimax game for unifying generative and discriminative information retrieval models. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval. 515–524.
Wang et al. (2018) Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang, Tim Klinger, Wei Zhang, Shiyu Chang, Gerry Tesauro, Bowen Zhou, and Jing Jiang. 2018. R 3: Reinforced ranker-reader for open-domain question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
Wolfson et al. (2020) Tomer Wolfson, Mor Geva, Ankit Gupta, Matt Gardner, Yoav Goldberg, Daniel Deutch, and Jonathan Berant. 2020. Break it down: A question understanding benchmark. Transactions of the Association for Computational Linguistics 8 (2020), 183–198.
Xiong et al. (2020) Wenhan Xiong, Hong Wang, and William Yang Wang. 2020. Progressively pretrained dense corpus index for open-domain question answering. arXiv preprint arXiv:2005.00038 (2020).
Yang et al. (2019b) Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li, Luchen Tan, Kun Xiong, Ming Li, and Jimmy Lin. 2019b. End-to-End Open-Domain Question Answering with BERTserini. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations). 72–77.
Yang et al. (2019c) Wei Yang, Yuqing Xie, Luchen Tan, Kun Xiong, Ming Li, and Jimmy Lin. 2019c. Data augmentation for bert fine-tuning in open-domain question answering. arXiv preprint arXiv:1904.06652 (2019).
Yang et al. (2019a) Xiao Yang, Madian Khabsa, Miaosen Wang, Wei Wang, Ahmed Hassan Awadallah, Daniel Kifer, and C Lee Giles. 2019a. Adversarial training for community question answer selection based on multi-scale matching. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 395–402.
Yang et al. (2017) Zhilin Yang, Junjie Hu, Ruslan Salakhutdinov, and William Cohen. 2017. Semi-Supervised QA with Generative Domain-Adaptive Nets. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1040–1050.
Yi et al. (2017) Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. 2017. Dualgan: Unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE international conference on computer vision. 2849–2857.
Yuan et al. (2020) Feng Yuan, Lina Yao, and Boualem Benatallah. 2020. Exploring Missing Interactions: A Convolutional Generative Adversarial Network for Collaborative Filtering. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 1773–1782.
Zhang (2018) Weinan Zhang. 2018. Generative adversarial nets for information retrieval: Fundamentals and advances. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 1375–1378.
Zhang et al. (2020) Zhuosheng Zhang, Junjie Yang, and Hai Zhao. 2020. Retrospective reader for machine reading comprehension. arXiv preprint arXiv:2001.09694 (2020).
Zou et al. (2018) Shihao Zou, Guanyu Tao, Jun Wang, Weinan Zhang, and Dell Zhang. 2018. On the equilibrium of query reformulation and document retrieval. In Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval. 43–50.