This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Joint Passage Ranking for Diverse Multi-Answer Retrieval

Sewon Min1,   Kenton Lee2,  Ming-Wei Chang2,  Kristina Toutanova2,  Hannaneh Hajishirzi1
1University of Washington   2Google Research
{sewon,hannaneh}@cs.washington.edu
{kentonl,mingweichang,kristout}@google.com
  Work done while interning at Google.
Abstract

We study multi-answer retrieval, an under-explored problem that requires retrieving passages to cover multiple distinct answers for a given question. This task requires joint modeling of retrieved passages, as models should not repeatedly retrieve passages containing the same answer at the cost of missing a different valid answer. In this paper, we introduce JPR, the first joint passage retrieval model for multi-answer retrieval. JPR makes use of an autoregressive reranker that selects a sequence of passages, each conditioned on previously selected passages. JPR  is trained to select passages that cover new answers at each timestep and uses a tree-decoding algorithm to enable flexibility in the degree of diversity. Compared to prior approaches, JPR achieves significantly better answer coverage on three multi-answer datasets. When combined with downstream question answering, the improved retrieval enables larger answer generation models since they need to consider fewer passages, establishing a new state-of-the-art.

1 Introduction

Passage retrieval is the problem of retrieving a set of passages relevant to a natural language question from a large text corpus. Most prior work focuses on single-answer retrieval, which scores passages independently from each other according to their relevance to the given question, assuming there is a single answer (Voorhees et al., 1999; Chen et al., 2017; Lee et al., 2019). However, questions posed by humans are often open-ended and ambiguous, leading to multiple valid answers (Min et al., 2020). For example, for the question in Figure 1, “What was Eli Whitney’s job?”, an ideal retrieval system should provide passages covering all professions of Eli Whitney. This introduces the problem of multi-answer retrieval—retrieval of multiple passages with maximal coverage of all distinct answers—which is a challenging yet understudied problem.

Refer to caption
Figure 1: The problem of multi-answer retrieval. A retrieval system must retrieve a set of kk passages (k=5k=5 in the figure) which has maximal coverage of diverse answers to the input question: inventor, farm laborer and school teacher in this example. This requires modeling the joint probability of the passages in the output set: P(p1pk|q)P(p_{1}...p_{k}|q). Our proposed model JPR achieves this by employing an autoregressive model.

Multi-answer retrieval poses two challenges that are not well represented in single-answer retrieval. First, the task requires scoring passages jointly to optimize for retrieving multiple relevant-yet-complementary passages. Second, the model needs to balance between two different goals: retrieving passages dissimilar to each other to increase the recall, and keeping passages relevant to the question.

In this work, we introduce Joint Passage Retrieval (JPR), a new model that addresses these challenges. To jointly score passages, JPR employs an encoder-decoder reranker and autoregressively generates passage references by modeling the probability of each passage as a function of previously retrieved passages. Since there is no ground truth ordering of passages, we employ a new training method that dynamically forms supervision to drive the model to prefer passages with answers not already covered by previously selected passages. Furthermore, we introduce a new tree-decoding algorithm to allow flexibility in the degree of diversity.

In a set of experiments on three multi-answer datasets—WebQSP (Yih et al., 2016), AmbigQA (Min et al., 2019) and TREC (Baudiš and Šedivỳ, 2015), JPR achieves significantly improved recall over both a dense retrieval baseline (Guu et al., 2020; Karpukhin et al., 2020) and a state-of-the-art reranker that independently scores each passage (Nogueira et al., 2020). Improvements are particularly significant on questions with more than one answer, outperforming dense retrieval by up to 12% absolute and an independent reranker by up to 6% absolute.

We also evaluate the impact of JPR in downstream question answering, where an answer generation model takes the retrieved passages as input and generates short answers. Improved reranking leads to improved answer accuracy because we can supply fewer, higher-quality passages to a larger answer generation model that fits on the same hardware. This practice leads to a new state-of-the-art on three multi-answer QA datasets and NQ (Kwiatkowski et al., 2019). To summarize, our contributions are as follows:

  1. 1.

    We study multi-answer retrieval, an underexplored problem that requires the top kk  passages to maximally cover the set of distinct answers to a natural language question.

  2. 2.

    We propose JPR, a joint passage retrieval model that integrates dependencies among selected passages, along with new training and decoding algorithms.

  3. 3.

    On three multi-answer QA datasets, JPR significantly outperforms a range of baselines with independent scoring of passages, both in retrieval recall and answer accuracy.

2 Background

2.1 Review: Single-Answer Retrieval

In a typical single-answer retrieval problem, a model is given a natural language question qq and retrieves kk passages {p1pk}\{p_{1}...p_{k}\} from a large text corpus 𝒞\mathcal{C} (Voorhees et al., 1999; Ramos et al., 2003; Robertson and Zaragoza, 2009; Chen et al., 2017; Lee et al., 2019; Karpukhin et al., 2020; Luan et al., 2020). The goal is to retrieve at least one passage that contains the answer to qq. During training, question-answer pairs (q,a)(q,a) are given to the model.

Evaluation

Intrinsic evaluation directly evaluates the retrieved passages. The most commonly used metric is Recall @ kk which considers retrieval successful if the answer aa is included in {p1pk}\{p_{1}...p_{k}\}. Extrinsic evaluation uses the retrieved passages as input to an answer generation model such as the model in Izacard and Grave (2021) and evaluates final question answering performance.

Task Single-answer Retrieval Multi-answer Retrieval
Train Data (q,a)(q,a) (q,{a1an})(q,\{a_{1}...a_{n}\})
Inference q{p1pk}q\rightarrow\{p_{1}...p_{k}\} q{p1pk}q\rightarrow\{p_{1}...p_{k}\}
Evaluation Recall((    a,{p1pk})a,\{p_{1}...p_{k}\}) MRecall((    {a1an},\{a_{1}...a_{n}\},    {p1pk})\{p_{1}...p_{k}\})
Appropriate Model P(pi|q)P(p_{i}|q) P(p1pk|q)P(p_{1}...p_{k}|q)
Table 1: A comparison of single-answer and multi-answer retrieval tasks. Previous work has used independent ranking models P(pi|q)P(p_{i}|q) for multi-answer retrieval because the inference-time inputs and outputs are the same. We propose JPR as an instance of P(p1pk|q)P(p_{1}...p_{k}|q).
Refer to caption
Figure 2: An overview of JPR. We focus on reranking and propose: autoregressive generation of passage references (Section 3.1), training with dynamic oracle (Section 3.2), and inferece with TreeDecode (Section 3.3).
Reranking

Much prior work (Liu, 2011; Asadi and Lin, 2013; Nogueira et al., 2020) found an effective strategy in using a two-step approach of (1) retrieving a set of candidate passages \mathcal{B} from the corpus 𝒞\mathcal{C} (k<|||𝒞|k<|\mathcal{B}|\ll|\mathcal{C}|) and (2) using another model to rerank the passages, obtaining a final top kk. A reranker could be more expressive than the first-stage model (e.g. by using cross-attention), as it needs to process much fewer candidates. Most prior work in reranking, including the current state-of-the-art (Nogueira et al., 2020), scores each passage independently, modeling P(p|q)P(p|q).

2.2 Multi-Answer Retrieval

We now formally define the task of multi-answer retrieval. A model is given a natural language question qq and needs to find kk passages {p1pk}\{p_{1}...p_{k}\} from 𝒞\mathcal{C} that contain all distinct answers to qq. Unlike in single-answer retrieval, question-and-answer-set pairs (q,{a1an})(q,\{a_{1}...a_{n}\}) are given during training.

Evaluation

Similar to single-answer retrieval, the intrinsic evaluation directly evaluates a set of kk passages. As the problem is underexplored, metrics for it are less studied. We propose to use MRecall @ kk, a new metric which considers retrieval to be successful if all answers or at least kk answers in the answer set {a1an}\{a_{1}...a_{n}\} are recovered by {p1pk}\{p_{1}...p_{k}\}. Intuitively, MRecall is an extension of Recall that considers the completeness of the retrieval; the model must retrieve all nn answers when nkn\leq k, or at least kk answers when n>kn>k.111 This is to handle the cases where nn is very large (e.g. over 100) and the model covers a reasonable number of answers given the limit of kk passages, therefore deserves credit.

The extrinsic evaluation inputs the retrieved passages into an answer generation module that is designed for multiple answers, and measures multi-answer accuracy using an appropriate metric such as the one in Min et al. (2020).

Comparing to single-answer retrieval

We compare single-answer retrieval and multi-answer retrieval in Table 1. Prior work makes no distinctions between these two problems, since they share the same interface during inference. However, while independently scoring each passage (P(pi|q)P(p_{i}|q)) may be sufficient for single-answer retrieval, multi-answer retrieval inherently requires joint passage scoring P(p1pk|q)P(p_{1}...p_{k}|q). For example, models should not repeatedly retrieve the same answer at the cost of missing other valid answers, which can only be done by a joint model.

Choice of 𝒌k for downstream QA

Previous state-of-the-art models typically input a large number (k100k\geq 100) of passages to the answer generation model. For instance, Izacard and Grave (2021) claim the importance of using a larger value of kk to improve QA accuracy. In this paper, we argue that with reranking, using a smaller value of kk (5 or 10) and instead employing a larger answer generation model is advantageous given a fixed hardware budget.222 We care about a fixed type of hardware since it is the hardest constraint and usually a bottleneck for performance. We do not control for running time in this comparison. We show in Section 5 that, as retrieval performance improves, memory is better spent on larger answer generators rather than on more passages, ultimately leading to higher QA accuracy.

3 JPR: Joint Passage Retrieval

Refer to caption
Figure 3: An illustration of TreeDecode, where passages that are chosen and passages that are being considered are indicated in orange and blue, respectively. See Section 3.3 and Algorithm 1 for details.

We propose JPR (Joint Passage Retrieval), which models the joint probability P(p1pk|q)P(p_{1}...p_{k}|q) for multi-answer retrieval. JPR uses an approach consisting of first-stage retrieval followed by reranking: the first-stage retrieval obtains candidate passages \mathcal{B} from the corpus 𝒞\mathcal{C}, and a reranker processes \mathcal{B} to output {p1pk}\{p_{1}...p_{k}\}\subset\mathcal{B}. We refer to Section 4.2 for the first-stage retrieval, and focus on the reranking component of the model, which allows (1) efficiently modeling the joint probability P(p1pk|q)P(p_{1}...p_{k}|q), and (2) processing candidate passages with a more expressive model.

The overview of JPR is illustrated in Figure 2. The reranker of JPR leverages the encoder-decoder architecture for an autoregressive generation of passage references (Section 3.1). Unlike typical use cases of the encoder-decoder, (1) the ordering of passages to retrieve is not given as supervision, and (2) it is important to balance between exploring passages about new answers and finding passages that may cover previously selected answers. To this end, we introduce a new training method (Section 3.2) and a tree-based decoding algorithm (Section 3.3).

3.1 Autoregressive generation of passage references

JPR makes use of the encoder-decoder architecture, where the encoder processes candidate passages and the decoder autoregressively generates a sequence of kk passage references (indexes). Intuitively, dependencies between passages can be modeled by the autoregressive architecture.

We extend the architecture from Izacard and Grave (2021); we reuse the encoder but modify the decoder. Each candidate passage pip_{i} is concatenated with the question qq and the number ii (namely index). It is fed into the encoder to be transformed to 𝐩iL×h\mathbf{p}_{i}\in\mathbb{R}^{L\times h}, where LL is the length of the input text and hh is a hidden size. Next, 𝐩1𝐩||\mathbf{p}_{1}...\mathbf{p}_{|\mathcal{B}|} are concatenated to form 𝐩¯L||×h\bar{\mathbf{p}}\in\mathbb{R}^{L|\mathcal{B}|\times h}, and then fed into the decoder. The decoder is trained to autoregressively output a sequence of indexes i1iki_{1}...i_{k}, representing a sequence of passages p1pkp_{1}...p_{k}. As the generation at step tt (1tk1\leq t\leq k) is dependent on the generation at step 1t11\dots t-1, it can naturally capture dependencies between selected passages. As each index occupies one token, the length of the decoded sequence is kk.

3.2 Training with Dynamic Oracle

A standard way of training the encoder-decoder is teacher forcing which assumes a single correct sequence. However, in our task, a set of answers can be retrieved through many possible sequences of passages, and it is unknown which sequence is the best. To this end, we dynamically form the supervision data which pushes the model to assign high probability to a dynamic oracle—any positive passage covering a correct answer that is not in the prefix, i.e., previously selected passages.

We first precompute a set of positive passages 𝒪~\mathcal{\tilde{O}} and a prefix p~1p~k\tilde{p}_{1}...\tilde{p}_{k}. A set of positive passages 𝒪~\mathcal{\tilde{O}} includes up to kk passages with maximal coverage of the distinct answers.333 |𝒪~|<k|\mathcal{\tilde{O}}|<k if fewer than kk passages are sufficient to cover all distinct answers; |𝒪~|=k|\mathcal{\tilde{O}}|=k otherwise. A prefix p~1p~k\tilde{p}_{1}...\tilde{p}_{k} is a simulated prediction of the model, consisting of 𝒪~\mathcal{\tilde{O}} and k|𝒪~|k-|\mathcal{\tilde{O}}| sampled negatives. At each step tt (1tk1\leq t\leq k) , given a set of positive passages 𝒪~\mathcal{\tilde{O}} and a prefix p~1p~t\tilde{p}_{1}...\tilde{p}_{t} (denoted as 𝒫t\mathcal{P}_{t}), JPR is trained to assign high probabilities to the dynamic oracle 𝒪~𝒫t\mathcal{\tilde{O}}-\mathcal{P}_{t}. The objective is defined as follows:

1tko𝒪~𝒫t1logP(o|q,,𝒫t1).\sum_{1\leq t\leq k}\sum_{o\in\mathcal{\tilde{O}}-\mathcal{P}_{t-1}}-\mathrm{log}P(o|q,\mathcal{B},\mathcal{P}_{t-1}).
Algorithm 1 Decoding algorithms for JPR.
1:procedure SeqDecode(k,P(p|p1pn)k,P(p|p_{1}...p_{n}))
2:     O[]O\leftarrow[] // a list of selected passages
3:     while i=1,,ki=1,\dots,k do
4:         p^argmaxpO.toSet()logP(p|O)\hat{p}\leftarrow\mathrm{argmax}_{p\in\mathcal{B}-O\mathrm{.toSet()}}\mathrm{log}P(p|O)
5:         OO::p^O\leftarrow O::\hat{p}      
6:     return Set(O)\mathrm{Set}(O)
7:procedure TreeDecode(k,P(p|p1pn),lk,P(p|p_{1}...p_{n}),l)
8:     𝒪\mathcal{O}\leftarrow\emptyset // a set of selected passages
9:     𝒮[Empty]\mathcal{S}\leftarrow[\texttt{Empty}] // a tree
10:     while |𝒪|<k|\mathcal{O}|<k do
11:         P(p|s)P(p|s)𝕀[s::p𝒮]P^{\prime}(p|s)\leftarrow P(p|s)\mathbb{I}[s::p\notin\mathcal{S}]
12:         (s^,p^)argmaxp,s𝒮l(|s|+1)logP(p|s)(\hat{s},\hat{p})\leftarrow\mathrm{argmax}_{p\in\mathcal{B},s\in\mathcal{S}}l(|s|+1)\mathrm{log}P^{\prime}(p|s)
13:         𝒪𝒪{p^},𝒮𝒮.append(s^::p^)\mathcal{O}\leftarrow\mathcal{O}\cup\{\hat{p}\},\mathcal{S}\leftarrow\mathcal{S}.\mathrm{append}(\hat{s}::\hat{p})      
14:     return 𝒪\mathcal{O}

3.3 Inference with TreeDecode

A typical autoregressive decoder makes the top 1 prediction at each step to decode a sequence of kk (SeqDecode in Algorithm 1),444We explored beam search decoding but it gives results that are the same as or marginally different from SeqDecode. which, based on our training scheme, asks the decoder to find a new answer at every step. However, when kk is larger than the number of correct answers, it would be counter-productive to ask for kk passages that each covers a distinct answer. Instead, we want the flexibility of decoding fewer timesteps and take multiple predictions from each timestep.

In this context, we introduce a new decoding algorithm TreeDecode, which decodes a tree from an autoregressive model. TreeDecode iteratively chooses between the depth-wise and the width-wise expansion of the tree—going forward to the next step and taking the next best passage within the same step, respectively—until it reaches kk passages (Figure 3). Intuitively, if the model believes that there are many distinct answers covered by different passages, it will choose to take the next step, being closer to SeqDecode. On the other hand, if the model believes that there are very few distinct answers, it will choose to take more predictions from the same step, resulting in behavior closer to independent scoring.

The formal algorithm is as follows. We represent a tree 𝒮\mathcal{S} as a list of ordered lists [s1sT][s_{1}...s_{T}] where s1s_{1} is an empty list and sis_{i} is one element appended to any of s1si1s_{1}...s_{i-1}. The corresponding set 𝒪\mathcal{O} is s𝒮Set(s)\cup_{s\in\mathcal{S}}\mathrm{Set}(s). We define a score of a tree 𝒮\mathcal{S} as

f(𝒮)=p1pti𝒮logP(pti|p1pti1).f(\mathcal{S})=\sum\limits_{p_{1}...p_{t_{i}}\in\mathcal{S}}\mathrm{log}P(p_{t_{i}}|p_{1}...p_{t_{i-1}}).

We form 𝒮\mathcal{S} and 𝒪\mathcal{O} through an iterative process by (1) starting from 𝒪=\mathcal{O}=\emptyset and 𝒮=[null]\mathcal{S}=[\texttt{null}], and (2) updating 𝒪\mathcal{O} and 𝒮\mathcal{S} by finding the best addition of an element that maximizes the gain in f(𝒮)f(\mathcal{S}), until |𝒪|=k|\mathcal{O}|=k, as described in Algorithm 1.

4 Experimental Setup

Dataset # questions % # answers
Train Dev Test Avg. Median
WebQSP 2,756 241 1,582 12.4 2.0
AmbigQA 10,036 2,002 2,004 2.2 2.0
TREC 1,250 119 654 4.1 2.0
Table 2: Number of questions and an average & median number of the answers on the development data. Data we use for TREC is a subset of the data from Baudiš and Šedivỳ (2015) as described in Appendix B.1.

We compare JPR with multiple baselines in a range of multi-answer QA datasets. We first present an intrinsic evaluation of passage retrieval by reporting MRecall based on answer coverage in the retrieved passages (Section 5.1). We then present an extrinsic evaluation through experiments in downstream question answering (Section 5.2).

kk Models WebQSP AmbigQA TREC
Dev Test Dev Dev Test
55 DPR+ only 56.4/37.8 57.0/38.9 55.2/36.3 53.8/29.9 57.8/36.6
DPR++Nogueira et al. (2020) 60.2/40.9 60.2/39.9 63.4/43.1 53.8/28.4 61.0/39.5
IndepPR 60.6/40.2 62.9/45.2 63.7/43.7 53.8/28.4 62.4/41.1
JPR 68.5/56.7 64.9/50.6 64.8/45.2 55.5/29.9 62.4/41.1
1010 DPR+ only 61.4/42.5 59.0/38.6 59.3/39.6 55.5/28.4 60.1/38.4
DPR++Nogueira et al. (2020) 64.7/45.7 62.9/41.5 65.8/46.4 55.5/28.4 64.8/43.0
IndepPR 65.6/47.2 63.3/43.1 65.5/46.2 53.8/26.9 63.8/42.2
JPR 68.9/55.1 65.7/48.9 67.1/48.2 56.3/29.9 64.5/43.3
Table 3: Results on passage retrieval in MRecall. The two numbers in each cell indicate performance on all questions and on questions with more than one answer, respectively. Test-set metrics on AmbigQA are not available as its test set is hidden, but we report the test results on question answering in Section 5.2.
Note: it is possible to have higher MRecall @ 5 than MRecall @ 10 based on our definition of MRecall (Section 2.2).

4.1 Datasets

We train and evaluate on three datasets that provide a set of distinct answers for each question. Statistics of each dataset are provided in Table 2.

WebQSP (Yih et al., 2016) consists of questions from Google Suggest API, originally from Berant et al. (2013). The answer is a set of distinct entities in Freebase; we recast this problem as textual question answering based on Wikipedia.

AmbigQA (Min et al., 2020) consists of questions mined from Google search queries, originally from NQ (Kwiatkowski et al., 2019). Each question is paired with an annotated set of distinct answers that are equally valid based on Wikipedia.

TREC (Baudiš and Šedivỳ, 2015) contains questions curated from TREC QA tracks, along with regular expressions as answers. Prior work uses this data as a task of finding a single answer (where retrieving any of the correct answers is sufficient), but we recast the problem as a task of finding all answers, and approximate a set of distinct answers. Details are described in Appendix B.1.

4.2 First-stage Retrieval

JPR can obtain candidate passages \mathcal{B} from any first-stage retrieval model. In this paper, we use DPR+, our own improved version of DPR (Karpukhin et al., 2020) combined with REALM (Guu et al., 2020). DPR and REALM are dual encoders with a supervised objective and an unsupervised, language modeling objective, respectively. We initialize the dual encoder with REALM and train on supervised datasets using the objective from DPR. More details are provided in Appendix A.

Training method MRecall
Dynamic oracle 67.6/56.7
Dynamic oracle w/o negatives 65.1/52.0
Teacher forcing 66.4/51.2
Table 4: Ablations in training methods for JPR. Results on WebQSP (k=5k=5). All rows use SeqDecode (instead of TreeDecode).
kk Decoding WebQSP AmbigQA
dd MRecall dd MRecall
55 SeqDecode 5.0 67.6/56.7 5.0 63.1/42.5
TreeDecode 3.0 68.5/56.7 2.1 64.8/45.2
1010 SeqDecode 10.0 68.0/54.3 10.0 65.0/45.9
TreeDecode 5.4 68.9/55.1 2.9 67.1/48.2
Table 5: Ablations in decoding methods for JPR. dd refers to the average depth of the tree (maxs𝒮|s|\mathrm{max}_{s\in\mathcal{S}}|s| in Algorithm 1).
Q: Who play Mark on the TV show Roseanne?
IndepPR JPR
#1 Glenn Quinn … He was best known for his portrayal of #1 Glenn Quinn … He was best known for his portrayal of
Mark Healy on the popular ’90s family sitcom Roseanne. Mark Healy on the popular ’90s family sitcom Roseanne.
#2 Glenn Quinn, who played Becky’s husband, Mark, died #2 Becky begins dating Mark Healy (Glenn Quinn) …
in December 2002 of a heroin overdose at the age of 32 … #3 Glenn Quinn, who played Becky’s husband, Mark, died
#3 Becky begins dating Mark Healy (Glenn Quinn) … in December 2002 of a heroin overdose at the age of 32 …
#4 Johnny Galecki … on the hit ABC sitcom Roseanne as #4 Roseanne (season 10) … In September 2017, Ames
the younger brother of Mark Healy (Glenn Quinn) … McNamara was announced to be cast as Mark Conner-Healy.
Table 6: An example prediction from IndepPR and JPR; answers to the input question highlighted. While IndepPR repeatedly retrieves passages supporting the same answer Glenn Quinn and fails to cover other answers, JPR successfully retrieves a passage covering a novel answer Ames McNamara.

4.3 Baselines

We compare JPR with three baselines, all of which are published models or enhanced versions of them. All baselines independently score each passage.

DPR+ only uses DPR+ without a reranker.

DPR++Nogueira et al. (2020) uses DPR+ followed by Nogueira et al. (2020), the state-of-the-art document ranker. It processes each passage pip_{i} in \mathcal{B} independently and is trained to output yes if pip_{i} contains any valid answer to qq, otherwise no. At inference, the probability for each pip_{i} is computed by taking a softmax over the logit of yes and no. The top kk passages are chosen based on the probabilities assigned to yes.

IndepPR is our own baseline that is a strict non-autoregressive version of JPR in which prediction of a passage is independent from other passages in the retrieved set. It obtains candidate passages \mathcal{B} through DPR+ and the encoder of the reranker processes qq and \mathcal{B}, as JPR does. Different from JPR, the decoder is trained to output a single token ii (1i||1\leq i\leq|\mathcal{B}|) rather than a sequence. The objective is the sum of logP(p|q,)-\mathrm{log}P(p|q,\mathcal{B}) of the passages including any valid answer to qq. At inference, IndepPR outputs the top kk passages based the logit values of the passage indices. We compare mainly to IndepPR because it is the strict non-autoregressive version of JPR, and is empirically better than or comparable to Nogueira et al. (2020) (Section 5.1).

4.4 Implementation Details

We use the English Wikipedia from 12/20/2018 as the retrieval corpus 𝒞\mathcal{C}, where each article is split into passages with up to 288 wordpieces, All rerankers are based on T5 (Raffel et al., 2020), a pretrained encoder-decoder model; T5-base is used unless otherwise specified. We use ||=100,k={5,10}|\mathcal{B}|=100,k=\{5,10\}. Models are first trained on NQ (Kwiatkowski et al., 2019) and then finetuned on multi-answer datasets, which we find helpful since all multi-answer datasets are relatively small. During dynamic oracle training, k|𝒪~|k-|\mathcal{\tilde{O}}| negatives are sampled from 𝒪~\mathcal{B}-\mathcal{\tilde{O}} based on s(pi)+γgis(p_{i})+\gamma g_{i}, where s(pi)s(p_{i}) is a prior logit value from IndepPR, giGumbel(0,1)g_{i}\sim\mathrm{Gumbel}(0,1) and γ\gamma is a hyperparameter. In TreeDecode, to control the trade-off between the depth and the width of the tree, we use a length penalty function l(y)=(5+y5+1)β,l(y)=\left(\frac{5+y}{5+1}\right)^{\beta}, where β\beta is a hyperparameter, following Wu et al. (2016). More details are in Appendix B.2.

5 Experimental Results

5.1 Retrieval Experiments

Table 3 reports MRecall on all questions and on questions with more than one answer.

No reranking vs. reranking

Models with reranking (DPR++Nogueira et al. (2020), IndepPR or JPR) are always better than DPR+ only, demonstrating the importance of reranking.

Independent vs. joint ranking

JPR consistently outperforms both DPR++Nogueira et al. (2020) and IndepPR on all datasets and all values of kk. Gains are especially significant on questions with more than one answer, outperforming two reranking baselines by up to 11% absolute and up to 6% absolute, respectively. WebQSP sees the largest gains out of the three datasets, likely because the average number of answers is the largest.

5.1.1 Ablations & Analysis

Training methods

Table 5 compares dynamic oracle training with alternatives. ‘Dynamic oracle w/o negatives’ is the same as dynamic oracle training except the prefix only has positive passages. ‘Teacher forcing’ is a standard method in training an autoregressive model: given a target sequence o1oko_{1}...o_{k}, the model is trained to maximize Π1tkP(ot|o1ot1)\Pi_{1\leq t\leq k}P(o_{t}|o_{1}...o_{t-1}). We form a target sequence using a set of positive passages 𝒪~\tilde{\mathcal{O}}, where the order is determined by following the ranking from IndepPR. Table 5 shows that our dynamic oracle training, which uses both positives and negatives, significantly outperforms the other methods.

Impact of TreeDecode

Table 5 compares JPR with SeqDecode and with TreeDecode. We find that TreeDecode consistently improves the performance on both WebQSP and AmbigQA, with both k=5k=5 and 1010. Gains are especially significant on AmbigQA, since the choice of whether to increase diversity is more challenging on AmbigQA  where questions are more specific and have fewer distinct answers, which TreeDecode better handles compared to SeqDecode. The average depth of the tree is larger on WebQSP, likely because its average number of distinct answers is larger and thus requires more diversity.

Retrieval QA Model kk Mem WebQSP AmbigQA TREC
Dev Test Dev Test Dev Test
DPR+ only T5-3B 10 x1 50.7/45.3 51.9/45.0 43.5/34.6 39.6/31.4 46.2/32.2 44.7/32.1
IndepPR T5-3B 10 x1 51.8/46.9 51.8/45.0 47.6/36.2 42.3/32.0 44.6/32.8 45.9/31.8
JPR T5-3B 10 x1 53.6/49.5 53.1/47.2 48.5/37.6 43.5/34.2 48.6/32.8 46.8/33.3
DPR+ only T5-large 40 x1 51.4/47.0 52.4/45.8 45.5/34.9 41.1/30.9 40.1/32.8 42.5/32.2
Gao et al. (2021) BART-large 100 x3 - - 48.3/37.3 42.1/33.3 - -
Table 7: Question Answering results on multi-answer datasets. The two values in each cell indicate F1 on all questions and F1 on questions with multiple answers only, respectively. Mem compares the required hardware memory during training. Note that Gao et al. (2021) reranks 1000 passages instead of 100, and trains an answer generation model using x3 more memory than ours. Better retrieval enables using larger answer generation models on fewer retrieved passages.
An example prediction

Table 6 shows predictions from IndepPR and JPR given an example question from AmbigQA, “Who plays Mark on the TV show Roseanne?” One answer Glenn Quinn is easy to retrieve because there are many passages in Wikipedia providing evidence, while the other answer Ames McNamara is harder to find. While IndepPR repeatedly retrieves passages that mention Glenn Quinn and fails to cover Ames McNamara, JPR successfully retrieves both answers.

More analysis can be found in Appendix C.

5.2 QA Experiments

This section discusses experiments on downstream question answering: given a question and a set of passages from retrieval, the model outputs all valid answers to the question. We aim to answer two research questions: (1) whether the improvements in passage retrieval are transferred to improvements in downstream question answering, and (2) whether using a smaller number of passages through reranking is better than using the largest possible number of passages given fixed hardware memory.

We use an answer generation model based on Izacard and Grave (2021) which we train to generate a sequence of answers, separated by a [SEP] token, given a set of retrieved passages. Our main model uses JPR to obtain passages fed into the answer generation model. The baselines obtain passages from either DPR+ only or IndepPR, described in Section 4.3.

We compare different models that fit on the same hardware by varying the sizes of T5 (base, large, 3B) and use the maximum number of passages (kk).555The memory requirement is O(k×T5 size)O(k\times\text{T5 size}). This results in three settings: {k=140,\{k=140, base}\}, {k=40,\{k=40, large}\} and {k=10,\{k=10, 3B}\}.

5.2.1 Main Result

Table 7 reports the performance on three multi-answer datasets in F1, following Min et al. (2020).

Impact of reranking

With {k=10,\{k=10, 3B}\}, JPR outperforms both baselines, indicating that the improvements in retrieval are successfully transferred to improvements in QA performance. We however find that our sequence-to-sequence answer generation model tends to undergenerate answers, presumably due to high variance in the length of the output sequence. This indicates the model is not fully benefiting from retrieval of many answers, and we expect more improvements when combined with an answer generation model that is capable of generating many answers.

More passages vs. bigger model

With fixed memory during training, using fewer passages equipped with a larger answer generation model outperforms using more passages. This is only true when reranking is used; otherwise, using more passages is often better or comparable. This demonstrates that, as retrieval improves, memory is better spent on larger answer generators rather than more passages, leading to the best performance.

Finally, JPR establishes a new state-of-the-art, outperforming the previous state-of-the-art on AmbigQA (Gao et al., 2021) with extensive reranking and the answer generation model trained using x3 more resources than ours.666Gao et al. (2021) reranks 1000 passages through independent scoring as in Nogueira et al. (2020); it is not a directly comparable baseline and serves as a point of reference.

Model T5 kk dev test
DPR+ only base 140 46.4 -
DPR+ only large 40 47.3 -
DPR+ only 3B 10 46.5 -
IndepPR large 40 49.4 -
IndepPR 3B 10 50.4 54.5
Izacard and Grave (2021) - 51.4
Table 8: Question Answering results on NQ. We report Exact Match (EM) accuracy. The first five rows are from our own experiments, which all use the same hardware resources for training. The last row is the previous state-of-the-art which requires x5 more resources than ours to train the model.

5.2.2 Single-answer QA result

While our main contributions are in multi-answer retrieval, we experiment on NQ to demonstrate that the value of good reranking extends to the single-answer scenario. Table 8 indicates two observations consistent to the findings from multi-answer retrieval: (1) when compared within the same setting (same T5 and kk), IndepPR always outperforms DPR+ only, and (2) with reranking, {k=10,\{k=10, 3B}\} outperforms {k=40,\{k=40, large}\}. Finally, our best model outperforms the previous state-of-the-art (Izacard and Grave, 2021) which uses x5 more training resources. Altogether, this result (1) justifies our choice of focusing on reranking, and (2) shows that IndepPR is very competitive and thus our JPR results in multi-answer retrieval are very strong.

6 Related Work

We refer to Section 2 for related work focusing on single-answer retrieval.

Diverse retrieval

Studies on diverse retrieval in the context of information retrieval (IR) requires finding documents covering many different sub-topics to a query topic (Zhai et al., 2003; Clarke et al., 2008). Questions are typically underspecified, and many documents (e.g. up to 56 in Zhai et al. (2003)) are considered relevant. In their problem space, effective models post-hoc increase the distances between output passages during inference (Zhai et al., 2003; Abdool et al., 2020).

Our problem is closely related to diverse retrieval in IR, with two important differences. First, since questions represent more specific information needs, controlling the trade-off between relevance and diversity is harder, and simply increasing the distances between retrieved passages does not help.777 In our preliminary experiment, we tried increasing diversity based on Maximal Marginal Relevance (Carbonell and Goldstein, 1998) following Zhai et al. (2003); Abdool et al. (2020); it improves diversity but significantly hurts the relevance to the input question, dropping the overall performance. Second, multi-answer retrieval uses a clear notion of “answers”; “sub-topics” in diverse IR are more subjective and hard to enumerate fully.

Multi-hop passage retrieval

Recent work studies multi-hop passage retrieval, where a passage containing the answer is the destination of a chain of multiple hops (Asai et al., 2020; Xiong et al., 2021; Khattab et al., 2021). This is a difficult problem as passages in a chain are dissimilar to each other, but existing datasets often suffer from annotation artifacts (Chen and Durrett, 2019; Min et al., 2019), resulting in strong lexical cues for each hop. We study an orthogonal problem of finding multiple answers, where the challenge is in controlling the trade-off between relevance and diversity.

7 Conclusion

We introduce JPR, an autoregressive passage reranker designed to address the multi-answer retrieval problem. On three multi-answer datasets, JPR significantly outperforms a range of baselines in both retrieval recall and downstream QA accuracy, establishing a new state-of-the-art. Future work could extend the scope of the problem to other tasks that exhibit specific information need while requiring diversity.

Acknowledgements

We thank the Google AI Language members, the UW NLP members, and the anonymous reviewers for their valuable feedback. This work was supported in part by ONR N00014-18-1-2826 and DARPA N66001-19-2-403.

References

  • Abadi et al. (2015) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-scale machine learning on heterogeneous systems.
  • Abdool et al. (2020) Mustafa Abdool, Malay Haldar, Prashant Ramanathan, Tyler Sax, Lanbo Zhang, Aamir Mansawala, Shulin Yang, and Thomas Legrand. 2020. Managing diversity in airbnb search. In ACM SIGKDD.
  • Asadi and Lin (2013) Nima Asadi and Jimmy Lin. 2013. Effectiveness/efficiency tradeoffs for candidate generation in multi-stage retrieval architectures. In SIGIR.
  • Asai et al. (2020) Akari Asai, Kazuma Hashimoto, Hannaneh Hajishirzi, Richard Socher, and Caiming Xiong. 2020. Learning to retrieve reasoning paths over wikipedia graph for question answering. In ICLR.
  • Baudiš and Šedivỳ (2015) Petr Baudiš and Jan Šedivỳ. 2015. Modeling of the question answering task in the yodaqa system. In International Conference of the Cross-Language Evaluation Forum for European Languages.
  • Berant et al. (2013) Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on freebase from question-answer pairs. In EMNLP.
  • Carbonell and Goldstein (1998) Jaime Carbonell and Jade Goldstein. 1998. The use of mmr, diversity-based reranking for reordering documents and producing summaries. In SIGIR.
  • Chen et al. (2017) Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading Wikipedia to answer open-domain questions. In ACL.
  • Chen and Durrett (2019) Jifan Chen and Greg Durrett. 2019. Understanding dataset design choices for multi-hop reasoning. In NAACL.
  • Clarke et al. (2008) Charles LA Clarke, Maheedhar Kolla, Gordon V Cormack, Olga Vechtomova, Azin Ashkan, Stefan Büttcher, and Ian MacKinnon. 2008. Novelty and diversity in information retrieval evaluation. In SIGIR.
  • Gao et al. (2021) Yifan Gao, Henghui Zhu, Patrick Ng, Cicero Nogueira dos Santos, Zhiguo Wang, Feng Nan, Dejiao Zhang, Ramesh Nallapati, Andrew O Arnold, and Bing Xiang. 2021. Answering ambiguous questions through generative evidence fusion and round-trip prediction. In ACL.
  • Guu et al. (2020) Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. 2020. REALM: Retrieval-augmented language model pre-training. In ICML.
  • Izacard and Grave (2021) Gautier Izacard and Edouard Grave. 2021. Leveraging passage retrieval with generative models for open domain question answering. In EACL.
  • Järvelin and Kekäläinen (2002) Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of ir techniques. TOIS.
  • Karpukhin et al. (2020) Vladimir Karpukhin, Barlas Oguz, Sewon Min, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. In EMNLP.
  • Khattab et al. (2021) Omar Khattab, Christopher Potts, and Matei Zaharia. 2021. Baleen: Robust multi-hop reasoning at scale via condensed retrieval. arXiv preprint arXiv:2101.00436.
  • Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural Questions: a benchmark for question answering research. TACL.
  • Lee et al. (2019) Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 2019. Latent retrieval for weakly supervised open domain question answering. In ACL.
  • Liu (2011) Tie-Yan Liu. 2011. Learning to rank for information retrieval. Found. Trends Inf. Retr.
  • Luan et al. (2020) Yi Luan, Jacob Eisenstein, Kristina Toutanova, and Michael Collins. 2020. Sparse, dense, and attentional representations for text retrieval. TACL.
  • Min et al. (2020) Sewon Min, Julian Michael, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2020. AmbigQA: Answering ambiguous open-domain questions. In EMNLP.
  • Min et al. (2019) Sewon Min, Eric Wallace, Sameer Singh, Matt Gardner, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2019. Compositional questions do not necessitate multi-hop reasoning. In ACL.
  • Nogueira et al. (2020) Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. 2020. Document ranking with a pretrained sequence-to-sequence model. In Findings of EMNLP.
  • Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research.
  • Ramos et al. (2003) Juan Ramos et al. 2003. Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning.
  • Robertson and Zaragoza (2009) Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval.
  • Sakai and Zeng (2019) Tetsuya Sakai and Zhaohao Zeng. 2019. Which diversity evaluation measures are" good"? In SIGIR.
  • Shazeer et al. (2018) Noam M. Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, H. Lee, Mingsheng Hong, C. Young, Ryan Sepassi, and Blake A. Hechtman. 2018. Mesh-tensorflow: Deep learning for supercomputers. In NeurIPS.
  • Voorhees et al. (1999) Ellen M Voorhees et al. 1999. The trec-8 question answering track report. In Trec, volume 99, pages 77–82. Citeseer.
  • Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
  • Xiong et al. (2021) Wenhan Xiong, Xiang Li, Srini Iyer, Jingfei Du, Patrick Lewis, William Yang Wang, Yashar Mehdad, Scott Yih, Sebastian Riedel, Douwe Kiela, and Barlas Oguz. 2021. Answering complex open-domain questions with multi-hop dense retrieval. In ICLR.
  • Yih et al. (2016) Wen-tau Yih, Matthew Richardson, Christopher Meek, Ming-Wei Chang, and Jina Suh. 2016. The value of semantic parse labeling for knowledge base question answering. In ACL.
  • Zhai et al. (2003) Cheng Zhai, William W Cohen, and John Lafferty. 2003. Beyond independent relevance: methods and evaluation metrics for subtopic retrieval. In SIGIR.

Appendix A Details of DPR+

We use a pretrained dual encoder model from REALM (Guu et al., 2020) and further finetune it on the QA datasets using the objective from DPR (Karpukhin et al., 2020):

=logfq(q)Tfp(p+)p{p+}fq(q)Tfp(p),\mathcal{L}=-\mathrm{log}\frac{f_{q}(q)^{T}f_{p}(p^{\mathrm{+}})}{\sum_{p\in\{p^{\mathrm{+}}\}\cup\mathcal{B}^{\mathrm{-}}}f_{q}(q)^{T}f_{p}(p)},

where fqf_{q} and fpf_{p} are trainable encoders for the questions and passages, respectively, p+p^{\mathrm{+}} is a positive passage (i.e., a passage containing the answer), and \mathcal{B}^{\mathrm{-}} is a set of negative passages (i.e., passages without the answer). As shown in Karpukhin et al. (2020), a choice of \mathcal{B}^{\mathrm{-}} is significant for the performance. We explore two methods:

Distant negatives follows DPR (Karpukhin et al., 2020) in using distantly obtained negative passages as \mathcal{B}^{\mathrm{-}}. We obtain two distant negative passages per question: one hard negative, a top prediction from REALM without finetuning, and one random negative, drawn from a uniform distribution, both not containing the answer.

Full negatives considers all passages in Wikipedia expect p+p^{\mathrm{+}} as \mathcal{B}^{\mathrm{-}}, and instead freezes the passage encoder fpf_{p} and only finetunes the question encoder fqf_{q}. This is appealing because (a) the number and the quality of the negatives, which both are the significant factors for training, are the strict maximum, and (b) fpf_{p} from REALM is already good, producing high quality passage representations without finetuning. Implementation of this method is feasible by exploiting extensive model parallelism.

We use distant negatives for multi-answer datasets and full negatives for NQ as this combination gave the best result.

Appendix B Experiment Details

B.1 Data processing for TREC

TREC from Baudiš and Šedivỳ (2015) contains regular expressions as the answers. We approximate a set of semantically distinct answers as follows. We first run regular expressions over Wikipedia to detect valid answer text. If there is no valid answer found from Wikipedia, or there are more than 100 valid answers888In most of such cases, the regular expressions are extremely permissive., we discard the question. We then only keep the answers with up to five tokens, following the notion of short answers from Lee et al. (2019). Finally, we group the answers that are the same after normalization and white space removal. We find that this gives a reasonable approximation of a set of semantically distinct answers. Note that the data we use is the subset of the original data because we discarded a few questions. Statistics are reported in Section 4.1.

Here is an example: a regular expression from the original data is Long Island|New\s?York|Roosevelt Field. All matching answers over Wikipedia include roosevelt field, new york, new\xa0york, new\nyork, newyork, long island. Once the grouping is done, we have three semantically distinct answers: (1) roosevelt field, (2) new york|new\xa0york|new\nyork|newyork, and (3) long island.

B.2 Details of reranker training

All implementations are based on Tensorflow (Abadi et al., 2015) and Mesh Tensorflow (Shazeer et al., 2018). All experiments are done in Google Cloud TPU. We use batch size that is the maximum that fits one instance of TPU v3-32 (for WebQSP and AmbigQA) or TPU v3-8 (TREC). We use the same batch size for IndepPR; for Nogueira et al. (2020), we use the batch size of 1024. We use the encoder length of 360360 and the decoder length of kk (JPR) or 11 (all others). We use k={5,10}k=\{5,10\} for all experiments. We train JPR with γ={0,0.5,1.0,1.5}\gamma=\{0,0.5,1.0,1.5\} and choose the one with the best accuracy on the development data. We use a flat learning rate of 1×1031\times 10^{-3} with warm-up for the first 500 steps. Full hyperparameters are reported in Table 9.

kk BB # train steps γ\gamma β\beta
WebQSP 5 256 10k 1.5 3.0
10 224 1.5 1.5
AmbigQA 5 256 6k 1.0 2.5
10 224 1.0 2.0
TREC 5 64 3k 1.5 1.5
10 56 1.5 2.0
Table 9: Full hyperparamters for training JPR.
Algorithm 2 An algorithm to obtain 𝒪~\tilde{\mathcal{O}} from the answer set and \mathcal{B}.
1:procedure Preproc(k,{a1an},k,\{a_{1}...a_{n}\},\mathcal{B})
2:     𝒪~\tilde{\mathcal{O}}\leftarrow // a set of positive passages
3:     𝒜left{a1an}\mathcal{A}_{\text{left}}\leftarrow\{a_{1}...a_{n}\}
4:     for bb in \mathcal{B} do
5:         if bb covers any of 𝒜left\mathcal{A}_{\text{left}} then
6:              𝒪~𝒪~\tilde{\mathcal{O}}\leftarrow\tilde{\mathcal{O}}.add(bb)
7:              𝒜left𝒜left\mathcal{A}_{\text{left}}\leftarrow\mathcal{A}_{\text{left}}- answers in bb          
8:         if |𝒪~|==k|\tilde{\mathcal{O}}|==k then
9:              break               
10:     return O.toSet()O\mathrm{.toSet()}
kk Models WebQSP AmbigQA TREC
Dev Test Dev Dev Test
55 IndepPR 62.4/59.0 65.1/60.9 73.6/69.5 70.7/61.1 74.9/66.4
JPR 69.5/67.9 69.1/65.8 73.7/70.0 69.8/61.4 74.7/66.8
1010 IndepPR 60.1/57.2 61.0/57.4 73.6/69.5 66.4/60.3 68.9/61.5
JPR 70.3/67.2 68.9/65.4 73.7/69.4 70.1/62.6 74.3/66.2
Table 10: Results on passage retrieval in α\alpha-NDCG.
Refer to caption
Figure 4: IndepPR vs. JPR on the development data of three datasets. MRecall @ 55 is reported. Lines indicate % of questions in the data. JPR benefits more on questions with 2+ distinct answers.

For training IndepPR and JPR, instead of using all of |||\mathcal{B}| passages, we use ||/4|\mathcal{B}|/4 passages by sampling kk positive passages and ||/4k|\mathcal{B}|/4-k negative passages. We find that this trick allows larger batch size when using the same hardware, ultimately leading to substantial performance gains. We also find that assigning indexes of the passages based on a prior, e.g., ranking from dense retrieval, leads to significant bias, e.g., in 50% of the cases, the top-1 passage from dense retrieval contains a correct answer. We therefore randomly assign the indexes, and find this gives significantly better performance.

Algorithm 2 describes how a set of positive passages 𝒪~\tilde{\mathcal{O}} used in Section 3.2 is computed during preprocessing.

B.3 Details of answer generation training

We train the models using a batch size of 32. We use a decoder length of 2020 and 4040 for NQ and multi-answer datasets, respectively. We decode answers only when they appear in the retrieved passages, as we want the generated answers to be grounded by Wikipedia passages. Answers in the output sequence follow the order they appear in the passages, except on WebQSP, where shuffling the order of the answers improves the accuracy. All other training details are the same as details of reranker training.

Appendix C Additional Results

We additionally report retrieval performance in 𝜶\alpha-NDCG @ kk, one of the metrics for diverse retrieval in IR (Clarke et al., 2008; Sakai and Zeng, 2019). It is a variant of NDCG (Järvelin and Kekäläinen, 2002), but penalizes retrieval of the same answer. We refer to Clarke et al. (2008) for a complete definition. We use α=0.9\alpha=0.9.

Results are reported in Table 10. JPR consistently outperforms IndepPR across all datasets, although the gains are less significant than the gains in MRecall. We note that we report α\alpha-NDCG following IR literatures, but we think of MRecall as a priority, because α\alpha-NDCG does not use an explicit notion of completeness of retrieval of all answers. It is also a less strict measure than recall because it gives partial credits to retrieving a subset of the answers.

Gains with respect to the number of answers

Figure 4 shows gains over IndepPR on three datasets with respect to the number of answers. Overall, gains are larger when the number of answers is larger, especially for WebQSP and TREC. For AmbigQA, the largest gains are when the number of answers is 2, which is responsible for over half of multi-answer questions.