A Survey for Efficient Open Domain Question Answering

Qin Zhang¹, Shangsi Chen¹, Dongkuan Xu², Qingqing Cao³,
Xiaojun Chen¹, Trevor Cohn⁴, Meng Fang⁵
¹Shenzhen University; ²North Carolina State University; ³University of Washington;
⁴The University of Melbourne; ⁵University of Liverpool
[email protected]; [email protected]; [email protected]; [email protected];
[email protected]; [email protected]; [email protected]

Abstract

Open domain question answering (ODQA) is a longstanding task aimed at answering factual questions from a large knowledge corpus without any explicit evidence in natural language processing (NLP). Recent works have predominantly focused on improving the answering accuracy and achieved promising progress. However, higher accuracy often comes with more memory consumption and inference latency, which might not necessarily be efficient enough for direct deployment in the real world. Thus, a trade-off between accuracy, memory consumption and processing speed is pursued. In this paper, we provide a survey of recent advances in the efficiency of ODQA models. We walk through the ODQA models and conclude the core techniques on efficiency. Quantitative analysis on memory cost, processing speed, accuracy and overall comparison are given. We hope that this work would keep interested scholars informed of the advances and open challenges in ODQA efficiency research, and thus contribute to the further development of ODQA efficiency.

Qin Zhang¹, Shangsi Chen¹, Dongkuan Xu², Qingqing Cao³, Xiaojun Chen¹, Trevor Cohn⁴, Meng Fang⁵ ¹Shenzhen University; ²North Carolina State University; ³University of Washington; ⁴The University of Melbourne; ⁵University of Liverpool [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]

1 Introduction

Open domain question answering (ODQA) Voorhees and Tice (2000) is a longstanding task in natural language processing (NLP) that can answer factoid questions, from a large corpus of knowledge such as Wikipedia Wikipedia (2004) or BookCorpus Zhu et al. (2015). Whereas traditional QA models take part of input as a piece of explicit evidence texts in which the answer locates, ODQA models require to process large amounts of knowledge fast to answer the input question. Compared to search engines, ODQA models aim to provide better user-friendliness and efficiency by presenting the final answer to a question directly, rather than returning a list of relevant snippets or hyperlinks Zhu et al. (2021).

ODQA has been studied widely recently and a classic framework of ODQA system is implemented by encompassing an information retriever (IR) and a reader, i.e., Retriever-Reader Chen et al. (2017). The task of IR is to retrieve evidence-related text pieces from the large knowledge corpus. Popularly used IR can be TF-IDF Chen et al. (2017), BM25 Mao et al. (2021) and DPR (dense passage retriever) Karpukhin et al. (2020), etc. The target of reader is understanding and reasoning the retrieved evidences to yield the answer. It is often achieved by transformer-based language models, such as BERT Devlin et al. (2019), RoBERTa Liu et al. (2019), ALBERT Lan et al. (2019) or sequence-to-sequence generator T5 Raffel et al. (2020), BART Lewis et al. (2020a), GPT Brown et al. (2020), etc. This two-module system enjoys a broad range of applications Zhu et al. (2021).

However, most of general-purpose ODQA models are computationally intensive, inference slowly, and training expensive. One reason is the huge index/document size (see Table 2). Concretely, a corpus typically contains millions of long-form articles that need to be encoded and indexed for evidence retrieval. For example, Karpukhin et al. (2020) processed an English Wikipedia corpus including 26 million articles and built a dense index with a size of 65GB. Besides, the majority of general-purpose ODQA models are developed with the large pre-trained language models, which often contain millions of parameters. For instance, the state-of-the-art ODQA models on Natural Question dataset, R2-D2 Fajcik et al. (2021) and UnitedQA Cheng et al. (2021) have 1.29 billion and 2.09 billion model parameters, respectively. Storing the corpus index and pre-trained language models is memory-intensive Xia et al. (2022). As a result, the retrieval and reading of evidence is memory and time consuming, making general-purpose ODQA models suffering a big challenge for real-time use in the real world Seo et al. (2019), such as on a mobile phone.

Refer to caption — Figure 1: The general pipeline of ODQA models is shown in (I), along with three different ODQA frameworks: Retriever-Reader (a), Retriever-Only (b), Generator-Only (c). Specifically, the retriever refers to information retrieval models (e.g., DPR Karpukhin et al. (2020)). The reader can be an extractive QA model (e.g., BERT Devlin et al. (2019)), or a generative QA model (e.g., T5 Roberts et al. (2020)). The retrieved candidate evidences can be in the form of documents, passages, sentences or phrases where the answer to the question can be found.

Towards this challenge, there are various trade-offs in building ODQA models that meet real-world application needs, such as the trade-offs among accuracy, memory consumption, inference speed and so on Izacard et al. (2020); Wu et al. (2020); Mao et al. (2021). NeurIPS 2020 organized an EfficientQA Competition Min et al. (2021), aiming to build open domain question answering systems that can predict correct answers while also satisfying strict on-disk memory budgets. For this purpose, a line of work focused on building more efficient protocols, besides Retriever-Reader, Retriever-Only Lee et al. (2021b); Lewis et al. (2021), Generator-Only Roberts et al. (2020); Lewis et al. (2020a) are newly proposed protocols (see Fig. 1). Various efficiency techniques are also created to achieve the desired reductions, such as index size reducing Yamada et al. (2021); Lewis et al. (2022), fast searching Lewis et al. (2021); Malkov and Yashunin (2020), evidence retrieval or reading omitting Roberts et al. (2020); Brown et al. (2020); Seonwoo et al. (2022); Lee et al. (2021b) and model size reducing Yang and Seo (2021); Singh et al. (2021) etc.

In this survey, we provide a comprehensively introduction into the broad range of methods that aim to improve efficiency with a focus on ODQA task. In Section 2, we overview general-purpose ODQA models, and discuss their strategies and limitations in terms of efficiency. In Section 3, we first walk through the key ODQA models which concentrate on efficiency, then conclude the core techniques used. Section 4 gives a quantitative analysis with overall comparison of different frameworks and three specific aspects, i.e., memory cost, processing speed and accuracy. Finally in Section 5, we discuss the challenges reminded followed by the conclusion given in Section 6.

To provide a practical guide to efficiency for ODQA researchers and users, we grouped the limitation of resources into two categories: (1) If the main limitation is to be expected at processing speed, readers can refer to methods introduced in Section 3.1.1. (2) If the bottleneck of resources focus on the storage/memory, the methods described in Section 3.1.2 are the most relevant ones. Further, if readers want to gain some quantitative comparison of the state-of-the-art methods, and see the trade-offs between efficiency and accuracy, Section 4 would meet their expectations. In general, for the researchers who are interested in improving the state-of-the-art in efficiency methods on ODQA task, this survey can serve as an entry point to find opportunities for new research directions.

Related Surveys. ODQA has been discussed and summarized with a broad overview of techniques for NLP in several survey papers. However, they more focus on deep neural models for improving ODQA performance. Specifically, the survey given by Huang et al. (2020) introduces deep learning based ODQA models proposed in the early years, which are mainly based on LSTM or CNN. Modern Transformer-based ODQA models are not included. Work given by Zhu et al. (2021) provides a comprehensive literature review of ODQA models, with particular attention to techniques incorporating neural machine reading comprehension models. Guo et al. (2022) focuses on the semantic models of the first-stage retrieval models. Shen et al. (2022) pays more attention to how to train the dense retrievers effectively with fewer annotation training data. Treviso et al. (2022) retrospects the efficient methods in natural language processing (NLP). It mainly involves the upstream generic pre-trained language models and training methods. Etezadi and Shamsfard (2022) mainly concentrates on comparison of ODQA methods for complex question answering. As far as we know, there is no survey summarizing ODQA methods from the efficiency perspective so far, which inspires us to overview the efficient ODQA models in this paper.

2 Overview of ODQA models

In this section, we introduce ODQA models by summarizing the developed ODQA models into three typical frameworks (see in Fig. 1): Retriever-Reader, Retriever-only, Generator-only. Retriever-Reader models include two modules: a retriever to select the most related documents from a large corpus and a reader to yield the exact answer according to the selected evidences. According to the way of obtaining answers, the reader in retriever-reader ODQA models can be further divided into two categories: extractive readers and generative readers. Extractive readers normally answer the question using a span from the context and the goal is to classify start and end positions of the answer in the retrieved evidences Devlin et al. (2019); Karpukhin et al. (2020). Generative reader are not restricted to the input context, and freely generate answers by autoregressively predicting tokens Raffel et al. (2020); Izacard and Grave (2021). Different from retriever-reader models, retriever-only models achieve ODQA tasks using a single retriever through converting corpus long-documents into shorter phrases or QA-pairs. And generator-only models directly generate answers for ODQA tasks, not involving evidence retrieval and reading. To guide the readers, we present a diagram with the typology of ODQA methods considered in this section in Fig. 2, while their main concerns are also indicated.¹¹1We collected available codes of the methods discussed in this paper and give their hyperlinks (https://github.com/hyintell/EfficientODQA) for convenience.

2.1 Retriever-Reader

Traditional ODQA models consist of many components such as question processing, document/passage retrieval and answer processing, etc Ferrucci et al. (2010); Baudiš (2015). DrQA Chen et al. (2017) first simplified the traditional multi-component ODQA models into a two-stage retriever-reader framework. Most recent works follow this framework and further supersede the TF-IDF-based retriever, or CNN-based reader, or both modules in DrQA with stronger transformer-based models, such as BERT, T5, BART, etc Yang et al. (2019); Wu et al. (2020); Karpukhin et al. (2020); Guu et al. (2020); Mao et al. (2021); Izacard and Grave (2020); Singh et al. (2021). Further, the reader in retriever-reader ODQA models can be classified into extractive reader and generative reader for their different behaviour in obtaining answers. We summarize those models specifically in the following paragraphs.

Retriever&Extractive-Reader models find the answer by predicting the start and end position of the answer span from the retrieved evidence. The extractive reader is usually completed by BERT-based model and some works integrate this extractive reader into ODQA models to locate exact answers of open domain questions. For example, BERT_serini Yang et al. (2019) and Skylinebuilder Wu et al. (2020) substitute the CNN reader in DrQA with BERT-based reader. Then, ORQA (Open Retrieval Question Answering system) Lee et al. (2019) supersedes both modules in DrQA with BERT-based models. It first pre-trains the retriever with an unsupervised Inverse Cloze Task (ICT) Lee et al. (2019), achieving an end-to-end joint learning between the retriever and the reader. Further, DPR (Dense passage retriever) Karpukhin et al. (2020) directly leverages pre-trained BERT models to build a dual-encoder retriever without additional pre-training. It trains the retriever through contrastive learning as well as well-conceived negative sampling strategies. DPR becomes a stronger baseline in both ODQA (Open Domain Question Answering) and IR (Information Retrieval) domains. Based on DPR, RocketQA Qu et al. (2021) trains the dual-encoder retriever using novel negative sampling methods: cross-batch negatives for mutil-GPUs training, or utilizing well-trained cross-encoder to select high-confidence negatives from the top-k retrievals of a dual-encoder as denoised hard negatives. Meanwhile, RocketQA constructs new training examples from an collection of unlabeled questions, using a cross-encoder to synthesize passage labels. RocketQA-v2 Ren et al. (2021) extends RocketQA by incorporating a joint training approach for both the retriever and the re-ranker via dynamic listwise distillation.

Retriever&Generative-Reader models directly generate free-format textual answers taking the question and retrieved evidence as input. Some ODQA models adopt this generative reader and explore diversified fusion methods between the retriever and the generative reader. For instance, FiD (Fusion-in-Decoder) Izacard and Grave (2021) is one of the typical methods under this framework. It takes the retrieved evidence by BM25 or DPR Karpukhin et al. (2020) as input of the generative reader T5 Roberts et al. (2020). And it aggregates the multiple evidence by concatenating their representations, and decodes the merged representation to generate the answer. FiD-KD Izacard and Grave (2020) integrates knowledge distillation (KD) into FiD to perform iterative training between the retriever and the generator. RAG (Retrieval-Augmented Generation) Lewis et al. (2020b) trains the retriever and the generator in an end-to-end form by viewing the retrieved evidence as a latent variable. Further, EMDR² (End-to-end training of Multi-Document Reader and Retriever) Singh et al. (2021) adopts mutual supervision and performs end-to-end training between a dual-encoder retriever and a T5-based generator.

In summary, for transformer-based ODQA models like DPR, RocketQA and FiD, they may suffer a challenge: training the retriever relies on strong supervision in pairs of questions and positive documents Izacard and Grave (2020). Unfortunately, many ODQA datasets and applications lack adequate such labeled pairs. To this end, many researchers turn to explore end-to-end training between the retriever and the reader Lee et al. (2019); Guu et al. (2020); Singh et al. (2021); Izacard and Grave (2020, 2021). Meanwhile, works Karpukhin et al. (2020); Qu et al. (2021); Ren et al. (2021) focus on constructing more available training example for the retriever by various effective data augmentation or sampling techniques and achieve stronger empirical performance.

The dual-encoder retrievers like DPR, encode for questions and documents independently, ignoring interaction between questions and documents and limiting their retrieval performance Khattab and Zaharia (2020); Humeau et al. (2020); Khattab et al. (2021); Lu et al. (2022). To remedy this issue, Colbert Khattab and Zaharia (2020) add interaction between different embeddings on the top of a dual-encoder, and Colbert-QA Khattab et al. (2021) applies it into ODQA domain to gain better performance.

Retriever-reader ODQA methods generally obtain good performance. However, due to dense representations for corpus passages and longer evidence for answer reasoning, retriever-reader ODQA models normally suffer from a larger index size and a slower processing speed. We will introduce some techniques to reduce the index size and speedup inference in Section 3.

2.2 Retriever-Only

Retriever-only systems tackle ODQA tasks with a single retriever, eliminating reading or generating step.One typical category of Retriever-only ODQA models is phrase-based systems Seo et al. (2019); Lee et al. (2020, 2021b, 2021c), They first split corpus documents into phrases and then directly retrieve related phrases. The phrase with highest relevance to the question is outputted as the predicted answer. The key to phrase-based ODQA methods is how to represent the phrases. DenSPI (Dense-Sparse Phrase Index) Seo et al. (2019) is one of the representative methods for the Retriever-only framework. It represents phrases with both dense and sparse vectors. The dense vectors are obtained by a BERT encoder and the sparse ones are by TF-IDF method. Further, DenSPI+Sparc Lee et al. (2020) achieves dynamic learning of sparse representations through a rectified self-attention mechanism, and uses them to replace the original static ones used in DenSPI. Conversely, DensePhrases Lee et al. (2021b) omits the sparse representations of phrases, and only leverages the dense ones. TQR Sung et al. (2022) further propels the performance of DensePhrases system to a higher level through refining questions at test time.

Different from phrase-based ODQA models, Lewis et al. (2021); Seonwoo et al. (2022) first process a Wikipedia corpus into a knowledge base (KB) of question-answer (QA) pairs, by using a generator model to create QA pairs. Then, based on the QA-pair KB, they handle ODQA tasks by directly retrieving the most similar question and return the answer of the retrieved question as the final answer to the input question. A large KB with 65 million QA pairs, i.e., Probably Asked Question (PAQ), has been created recently by Lewis et al. (2021). Based on PAQ, RePAQ Lewis et al. (2021) is designed to solve ODQA task by retrieving the most similar QA pairs. SQuID Seonwoo et al. (2022) further improves the performance of RePAQ by adopting two dual-encoder retrievers.

To conclude, in retriever-only ODQA models, the omission of the reading/generating step greatly improves the processing speed of answering questions. But there are also a few limitations for Retriever-Only ODQA models. For example, (1) lower performance on average compared to Retriever-Reader ODQA models since less information is considered during answer inference; (2) high storage requirement in terms of indexes for fine-grained retrieval units such as phrases or QA pairs. For instance, the index size of 65M QA-pairs (220GB) are much larger than that of 21M passages (65GB).

2.3 Generator-Only

Generator-only ODQA models are normally based on single generators, mainly seq2seq generative language models, like T5 Roberts et al. (2020), GPT Brown et al. (2020) and BART Lewis et al. (2020a). They are pre-trained on large Wikipedia corpora and have stored the main knowledge of the corpus in the model parameters. Thus, they can directly generate the answers based on the internal knowledge and skip the evidence retrieval process.

In this way, they obtains low memory cost and short processing time than two-stage systems such as Retriever-Reader and retriever-generator framework. However, the performances of generator-only ODQA methods are temporarily dissatisfied. For example, on Natural Question dataset, the best generator-only method, GAR_generative Mao et al. (2021), achieves only 38.10% on exact match (EM) score, while retriever-reader method R2-D2_reranker Fajcik et al. (2021) can be up to 55.9% (see in Table 2). Another limitation of generator-only ODQA models is the uncontrollability of the results they generated, since generative models distribute corpus knowledge to the parameters in an inexplicable way and hallucinate realistic-looking answers when it is unsure Roberts et al. (2020). Additionally, real-world knowledge is updated routinely, the huge training cost of the generative language models makes it laborious and impractical to keep them always up-to-date or retrain them frequently. Meanwhile, billions of parameters make them storage-unfriendly and hard to apply on resource-constrained devices Roberts et al. (2020).

3 Efficient ODQA Models and Techniques

In this section, we first walk through the key ODQA models which concentrate on efficiency, and discuss their strengths and weaknesses as well as their unique characteristics in Section 3.1. Then we conclude the core techniques used in these models in terms of improving the efficiency of ODQA, from data and model perspective respectively, in Section 3.2.

Before we start, we first take DPR on NQ test dataset as an example to show the time each module needs during inference and their detailed memory costs in Fig. 3. We can see the average total processing time DPR needs is 0.91s²²2The passages in the corpus are embedded offline. where the inference speed is mainly affected by evidence searching (74.79%) and evidence reading (23.95%) modules. The total memory cost of DPR is 79.32GB which are huge. The indexes take up 81.95% of the memory and the raw corpus takes 16.39% space, the remaining 1.66% are for the models where the retriever model is around twice the size of the reader model.

Based on these observations, how to improve the efficiency of the ODQA models focus on processing time reduction and memory cost reduction. To reduce processing time, we can accelerate evidence searching and reading. To reduce memory cost, we can reduce the size of index and model. Besides, some blazing new trails are proposed as well, such as generate directly to omit the evidence retrieval part, or retrieve directly to omit evidence reading part. We introduce the details below.

3.1 Walk-through Efficiency ODQA Models

In this subsection, we delve into the details of efficiency ODQA models. We category them into three classes regarding to the different means of implementing efficiency, i.e., reducing processing time, reducing memory cost and blazing new trails.

3.1.1 Reducing Processing Time

The processing time for ODQA involves the time costed in three stages: question embedding, evidence searching and evidence reading. Whereas evidence searching and evidence reading occupy most of the processing time, researches mainly focus on narrowing the time cost of these two stages.

By Accelerating Evidence Searching. Other than the traditional brute search method Zhan et al. (2021), the approximate nearest neighbor (ANN) search Johnson et al. (2021), hierarchical navigable small world graphs (HNSW) technique Malkov and Yashunin (2020) methods become increasingly popular, due to the characteristic of fast searching. For example, HNSW technique Malkov and Yashunin (2020) is adopted by DPR Yamada et al. (2021) and RePAQ Lewis et al. (2021), and brings much faster search speed without a significant decline on retrieval accuracy. However, the negative effect HNSM brings is a larger size of index. Concretely, the method DPR with a HNSW module makes the index size increased from 65GB to 151GB Yamada et al. (2021).

Besides, Product Quantization (PQ) Jégou et al. (2011), Locality Sensitive Hashing (LSH) Neyshabur and Srebro (2015) and Inverted File (IVF) Sivic and Zisserman (2003) are both efficient methods to speedup search Yamada et al. (2021); Lewis et al. (2022), but they often lead to a significant drop of retrieval accuracy Yamada et al. (2021); Lewis et al. (2021, 2022). Concretely, PQ focuses on reducing the embedding dimension while LSH and IVF concentrate on approximate index searching. LSH generates a same hashkey for similar embeddings through suitable hash functions, then evidence retrieval is based on hashkeys Wang et al. (2022). IVF constructs two-level indices using the K-means clustering method Lewis et al. (2022). Different from LSH which can reduce the index size, IVF dose not achieve this goal.

Compared to LSH and IVF, Learned Index for large-scale DEnse passage Retrieval (LIDER) Wang et al. (2022) makes a trade-off between search speed and retrieval accuracy through dynamically learning a corpus index when training. It achieves faster search with a fewer drop of retrieval accuracy compared to PQ and IVF, by predicting the location from learned key-location distribution of dataset. Specifically, LIDER builds two-level indices with the similar method IVF uses. LIDER further maps the documents in it into hashkeys using LSH method and sorts them based on the hashkeys. Meanwhile, the hashkeys are also used to train a multi-layer linear regression model for the location prediction of a hashkey in the sorted indexes. During inference, with a query embedded by DPR Karpukhin et al. (2020), LIDER first calculates its hashkey, and finds its $c$ nearest centroids. With these centroids, LIDER then searches the top-p nearest evidences in each subset in parallel. Finally it merges all the retrieved evidences and selects the top-k ones as output. To conclude, LIDER is a powerful, efficient and practical method for evidence searching in ODQA domain.

By Accelerating Evidence Reading. To accelerate the evidence reading module is another effective way to speedup the question processing of ODQA models. Actually, among the retrieved evidences, a high percentage of content is not pertinent to answers Min et al. (2018). However, the reader module still allocated the same computational volume to these contents, which involves many unnecessary computations and prolongs the inference latency Wu et al. (2020). Thus, jumping reading strategy is proposed and studies have found it can bring certain inference speedup Wu et al. (2020); Guan et al. (2022). Concretely, jumping reading strategy dynamically identifies less relevant text blocks at each layer of computation by calculating an important score for each text block. Towards blocks with low scores (such as lower than a pre-defined threshold), they will be early stopped.

For example, adaptive computation (AC) Bengio et al. (2015); Graves (2016) is an efficient method to ameliorate the reading efficiency following jumping reading strategy which manipulates the allocation of computation of the model input Wu et al. (2020, 2021). SkyLineBuilder Wu et al. (2020) applies AC to an extractive reader and dynamically decides which passage to allocate computation at each layer during reading. Further, the AC strategy has been considered in Retriever-Generator ODQA models, such as Adaptive Passage Encoder (APE) Wu et al. (2021) which combines Fusion-in-Decoder (FiD) system and AC strategy. In APE, AC strategy is used to early stop the encoder of generator to read the evidences that less likely to include answers.

Meanwhile, inspired by the idea of passage filtering before retrieval Yang and Seo (2021), Block-Skim Guan et al. (2022) is proposed which jumps question-irrelevant text blocks to optimize the reading speed. It first slices an input sequence into text blocks with a fixed length. A CNN module is utilized to compute the importance score for each block in each transformer layer, then the unimportant blocks are skipped. The CNN module takes the attention weights of each transformer layer as input, and output the importance score of each block. Thus the transformer layers would handle continuously less context, which leads to less computing overhead and faster inference speed. Block-Skim implements an average 2.56 times speedup inference than BERT-base models with little loss of accuracy on multiple extractive QA datasets. This enlightens us that all BERT-based retriever-reader ODQA models can be optimized the Block-skim to speedup their inference.

3.1.2 Reducing Memory Cost

For ODQA models, there are three kinds of memory cost: index memory cost, model memory cost and raw corpus memory cost. Normally, reducing index size and reducing model size are two ways to break through and to achieve storage efficiency, while reducing raw corpus size³³3However raw corpus could easily be compressed which incurs the cost of decompression when needed. And for some fast compression methods this can be similar cost to reading the file from disk. normally results in certain knowledge source loss and leading to a significant drop of performance Yang and Seo (2021).

By Reducing Index Size. The index of a corpus takes a major proportion of memory cost during running an ODQA system. The evidence searching module, which is strongly related to the index size, is also the module that takes the most time during reference. Thus, reducing index size is a key to improve the efficiency of ODQA models. A line of researches have been proposed trying to achieve this goal. BPR Yamada et al. (2021) and DrBoost Lewis et al. (2022) are representative works in this direction, where BPR reduces the index size by sacrificing data precision and DrBoost downsizes the index through compacting embedding dimension Lewis et al. (2022).

Specifically, BPR Yamada et al. (2021) reduces the index size through a learning-to-hash technique Cao et al. (2017); Wang et al. (2018), by hashing continuous passage vectors into compact binary codes, which is different from DPR Karpukhin et al. (2020) utilizing dense continuous embeddings of corpus passages. It achieves a much smaller size of index from 65GB to 2GB, with competitive performance. In detail, BPR designs a hash layer with a scaled tanh function on the top of DPR retriever to reduce the index size. It optimizes the search efficiency of the retriever while maintaining accuracy through multi-target joint learning: evidence retrieval and reranking. During retrieval, top- $c$ evidence passages are retrieved with the Hamming distance of the binary codes. Then, the retrieved evidence is reranked with maximum inner product search (MIPS) Shrivastava and Li (2014); Guo et al. (2016) between the query dense vector and the passage binary codes. Finally the top- $k$ evidences are outputted, where $k$ is much smaller than $c$ .

DrBoost Lewis et al. (2022), a dense retrieval ensemble method inspired by boosting Freund and Schapire (1997), incrementally compacts the dimension of representations during training. DrBoost obtains an even smaller index size than BPR Yamada et al. (2021), i.e., less than 1GB, while it also achieves higher accuracy.Concretely, it builds sequentially multiple weak learners and integrates them into one stronger learner. Each weak learner consists of a BERT-based dual-encoder for encoding passages and questions by learning embeddings in low dimensions, normally 32-dim. Weak learners are trained iteratively using hard negative samples.The final embeddings for passages and questions are a linear combination of embeddings from all weak learners. The dimension of the final embeddings can be controlled by the iterative rounds during training, which makes the total embedding dimension flexible and the index size adjustable. One limitation of DrBoost is that it must remain multiple encoders simultaneously to compute the full representation for the question during test. To remedy this issue, DrBoost distills all R question encoders (32 dim) into a single encoder (32*R dim). Therefore, the single encoder product the full question embedding directly, which achieves the goal of low-latency and low-resource.

By Reducing Model Size. Besides reducing memory cost on index, to reduce model size is another way to cut memory cost of ODQA models.

One way to reduce model size is to build a comprehensive model which is capable to do retrieving and reading simultaneouslyLee et al. (2021a); Yang and Seo (2021), thus it can replace the multiple models in traditional ODQA models with one comprehensive model and achieve model storage efficiency. YONO (You Only Need One model) Lee et al. (2021a) is a representative model in this way, which integrates retriever, reranker and generator models into a T5-large based singular transformer pipeline. In this way, YONO achieves a less than 2GB model size which is as large as EMDR2 Singh et al. (2021), and a higher QA performance. This makes YONO has the best performance among models that under the size of 2GB. Moreover, YONO can further manipulate its model size by adding or remove certain layers flexibly. To be specific, YONO first discards 18 decoder layers of the T5-large model and splits the rest model into four parts. The first 12 layers are for the evidence retrieval; the middle 4 layers are for evidence reranking; the following 8 layers are for impressively encoding and the last 6 layers are for decoding. The hidden representations are progressively improved along the pipeline. A fully end-to-end training over all stages are performed to make full use of the capability of all modules. However, YONO still needs to do evidence retrieval, i.e., evidence indexing and searching, which is time-consuming. Thus, How to improve the processing speed of YONO is still a problem that needs to be solved urgently.

Besides YONO model, another attempt to reduce model size is the minimal retriever and reader (Minimal R&R) method Yang and Seo (2021), which designs a single lightweight ODQA system based on MobileBERT Sun et al. (2020). Knowledge distillation and iterative fine-tuning are adopted to reduce the model size and meanwhile to keep the ability of both retrieving and reading. Minimal R&R achieves a small model size of under 500MB but with a serious drop of exact match score (EM) on NQ compared to DPR.

3.1.3 Blazing new directions

Besides the methods which accelerate evidence searching and evidence reading and the methods that reduce index size and model size, some blazing new directions are proposed as well, such as generate directly to omit the evidence retrieval part, or retrieve directly to omit evidence reading part.

Generate Directly. Some researchers blazed a brand new path that omits the whole evidence retrieval process, including corpus indexing and evidence searching, by leveraging generative language models (such as T5, BART, GPT) to tackle ODQA tasks Roberts et al. (2020); Brown et al. (2020); Lewis et al. (2020a). Generative models have learnt and stored the knowledge of a large size corpus, such as Wikipedia corpus. Given a question, they generate the answers directly. By eliminating evidence retrieval process, they save a lot processing time during ODQA, making them inference efficient. The main advantage of Generator-only methods is that they can answer open-domain questions without any access to external knowledge Roberts et al. (2020). And they output the literal text of answer in a more free-form fashion.

However, generally, there is a significant gap of QA performance between generative models and Retriever-Reader ODQA models, as well as the adequacy of explanation. Thus, single generator based ODQA models are further combined with existing evidence retriever models Lewis et al. (2020b); Izacard and Grave (2021); Singh et al. (2021) to obtain better QA performance.

Retrieve Directly Evidence reading takes non-negligible processing time. For example, 23.95% of total processing time is occupied by this module in DPR method. An innovative idea to improve the efficiency of ODQA is to directly omit evidence reading. Without evidence reading, the document corpus is first preprocessed into a knowledge base offline.When encountering a new sample, the model searches the final answer from the knowledge base for the question directly Seo et al. (2019); Lee et al. (2021b); Lewis et al. (2021).

RePAQ Lewis et al. (2021) is representative for this framework. It first converts a large corpus to a knowledge base (KB) of question-answer (QA) pairs using a question generation model, then uses a lightweight QA-pair retriever to answer the questions. Specifically, RePAQ automatically generates 65 million QA pairs from a Wikipedia corpus and builds a QA-pair knowledge base offline. It then retrieves the most similar QA pair from the knowledge base by calculating the similarity using maximum inner product search (MIPS) technique Shrivastava and Li (2014); Guo et al. (2016). The answer of the most similar question is returned as the output answer directly. The size of RePAQ’s retrieval model can be less then 500MB or can answer over 1K questions per second with high accuracy. However, the 220GB index for the 65 millions QA pairs becomes a major drawback for RePAQ. For further efficiency, RePAQ reduces the index size from 220GB to 16GB with product quantization (PQ) Jégou et al. (2011) and a flat index, with less than 1% accuracy drop. In addition, RePAQ indexes the QA pairs with an approximate search technique of hierarchical navigable small world graphs (HNSW) Malkov and Yashunin (2020) to speedup searching.

Different from RePAQ, phrase-based ODQA models, such as DenSPI Seo et al. (2019) and DensePhrases Lee et al. (2021b), split the corpus documents into fine-grained phrases. They build an index for these phrases which can be retrieved directly as the predicted answers. Similar to RePAQ, omitting evidence reading makes phrase-based ODQA models faster than Retriever-Reader ODQA models when processing questions. For example, DensePhrases can answer average 20.6 questions per second, while majority of retriever-reader methods can only handle less then 10 questions.

3.2 Core Techniques

This section concludes the core techniques commonly used in existing ODQA models with respect to improve the efficiency. It can be briefly divided into two categories: data-based and model-based techniques. Data-based techniques mainly focus on the reduction of index, which can be downsized from different hierarchies such as number of corpus passages, feature dimension and storage unit per dimension. Model-based techniques try to reduce the model size while avoiding a significant drop of performance. Model pruning and knowledge distillation are commonly used techniques.

3.2.1 Data-based techniques

Passage Filtering. Among the huge corpus ODQA models relies on, there are massive passages that contains little useful information and unlikely to be evidences for answers. Thus, passage filtering which filters the unrelated passages is a way to reduce the memory cost on corpus storage without large negative impact. For example, some researchers designed a linear classifier to discriminate and discard unnecessary passages before evidence retrieval Izacard et al. (2020); Yang and Seo (2021).

Dimension Reduction. Another way to reduce the memory cost is to reduce the dimension for dense passage representations. To achieve this goal, Izacard et al. (2020) learn an additional feed-forward layer to project the high-dimensional embeddings to lower ones. Concretely, it uses a linear layer to map $d$ -dimensional embeddings to $d_{R}$ -dimensional vectors, where $d_{R}$ is much smaller than $d$ Luan et al. (2021).

Principle component analysis (PCA) is an another efficient technique that is commonly used to reduces the dimension of passage representations without a loss of important information Ma et al. (2021); Zouhar et al. (2022). In work Ma et al. (2021), PCA is used to build a projection matrix to project the raw data onto the principal components using an orthonormal basis.

Product Quantization. Product quantization (PQ) Jégou et al. (2011) further reduces the index size by reducing the storage cost of each dimension of the embeddings. It divides a $d$ -dimensional vector into $n$ sub-vectors with $d/n$ dimension and quantifies these sub-vectors independently using k-means Izacard et al. (2020); Ma et al. (2021); Yang and Seo (2021). However, PQ also results in a significant drop on accuracy while it reduces the index size. Joint optimization of query encoding and Product Quantization (JPQ) further enhances the original PQ method

The three techniques introduced above are adopted jointly in Fusion-in-Decoder with Knowledge Distillation (FiD-KD) Izacard et al. (2020) to reduce the memory cost of the ODQA system. It obtains competitive performance on ODQA task compared to the original system with a reduction of memory cost from more than 70GB to less than 6GB.

3.2.2 Model-based techniques

Model Pruning. Most recent works on open domain question answering Chen et al. (2017); Guu et al. (2020) prefer to adopt large pre-trained language models Devlin et al. (2019); Raffel et al. (2020) as passage retriever, reader or generator due to their powerful deep semantic understanding capability. These large models have millions or even billions of parameters, requiring large storage, long training time and leading to slow inference. Therefore, some researchers have turned to adopt more lightweight language models Yang and Seo (2021). For example, a smaller pre-trained language model, MobileBERT Sun et al. (2020), has been used to reduce the size of an ODQA system to 972MB Yang and Seo (2021).

Parameter sharing is another way to constrain the model size. Skylinebuilder Wu et al. (2020) and RePAQ downsize their model size by using the parameter sharing LM, i.e., AlBERT Lan et al. (2019). It has a similar structure with BERT Devlin et al. (2019), but keeps a smaller mode size as 18MB compared to BERT’s 110MB model size. More lightweight pre-trained language models have been proposed and vertified in other natural language tasks, such as machine reading comprehension (MRC) Fan et al. (2019); Sajjad et al. (2020); Lagunas et al. (2021); Xia et al. (2022). They obtain smaller model sizes and achieve high accuracy for downstream tasks, including ODQA tasks.

Knowledge Distillation. Compared to structure pruning, knowledge distillation pays more attention to effectively improve question processing speed. Knowledge distillation, which transfers knowledge from a large model into a small one, has been widely used in several NLP tasks, including ODQA and MRC tasks Sanh et al. (2019); Sun et al. (2020); Izacard and Grave (2020); Lewis et al. (2022); Yang and Seo (2021). For example, Minimal R&R system Yang and Seo (2021) integrates the evidence retriever and the reader into a single model for reducing the model size through knowledge distillation. DrBoost Lewis et al. (2022) distills all question encoders into a single encoder to product the full question embedding directly for further improving its question processing speed and reducing the model size.

4 Quantitative Analysis

This section gives a quantitative analysis of the aforementioned ODQA models and models. We first introduce the corpus and the related partition and segmentation strategies, followed by metrics. Then we give an overall comparison of different frameworks and further discuss the methods quantitatively from three specific aspects: memory cost, processing speed and accuracy. At the end of the analysis, a following subsection summarizes and concludes what has been analysed and discussed.

4.1 Corpus and Metrics

Table 1: The statistical information of Wikipedia corpora used in ODQA models.

Wikipedia Corpus	Split Method	Retrieval Unit	Length of a Unit (tokens)	Number of Units (million)	Encoding Methods	Index Size (GB)	Relatived ODQA models
2016-12-21 dump of English Wikipedia	-	article	-	5.1	TF-IDF	26	DrQA
2016-12-21 dump of English Wikipedia	-	article	-	5.1	BM25	2.4	Skylinebuilder, GAR_extractive
2018-12-20 snapshot of English Wikipedia	BERT’s tokenizer	block/ passage	288	13	dense encoding	18	ORQA, REALM
	-	block/ passage	100	21	dense encoding	65	DPR, RocketQA, R2-D2, etc.
	-	phrase	<=20	60000	TF-IDF+ dense encoding	2000	DenSPI
	-	phrase	<=20	60000	dense encoding	320	DensePhrases
	generator	QA-pair	-	65	dense encoding	220	RePAQ

Corpus. The most commonly used corpus for open domain question answering systems is the 2018-12-20 dump of Wikipedia corpus, which contains 21 million 100-word-long passages after removing semi-structure data (tables, information boxes, lists and the disambiguation pages) Karpukhin et al. (2020). Most ODQA models, such as RocketQA Qu et al. (2021), FiD Izacard and Grave (2021) and R2-D2 Fajcik et al. (2021), directly build the index for passages on this Wikipedia corpus. The size of the index file is 65GB. Based on this Wikipedia corpus, RePQA further generates 65 million QA pairs and indexes these QA pairs to a 220GB file. Some other methods, e.g. DrQA Chen et al. (2017) and Skylinebuilder Wu et al. (2020), encode and build indexes for documents from 2016-12-21 dump of English Wikipedia which includes 5.1 million articles Chen et al. (2017); Wu et al. (2020), and the size of this index file is 26GB.

Except the different choice of the original corpus, there are also different partition and segmentation strategies. For example, ORQA Lee et al. (2019) and REALM Guu et al. (2020) segment the corpus documents into 13 million blocks, each of which has 288 tokens. DenseSPI Seo et al. (2019), Dens+Sparc Lee et al. (2020) and DensePhrases Lee et al. (2021b) divide corpus documents into 60 billion phrases, each phrase including 20 tokens. The rest ODQA models segment corpus documents into 21 million passages with the length of 100 tokens, leading to a 65GB index Karpukhin et al. (2020); Lewis et al. (2021); Izacard and Grave (2021); Qu et al. (2021).

A comprehensive introduction is illustrated in Table 1. In general, the index size of the corpus is quite large, and the storage of the index is one of the main challenges for ODQA efficiency.

Metrics. There are various metrics to depict efficiency in different dimensions.

In terms of latency, training time Mao et al. (2021), indexing time Mao et al. (2021), query time Yamada et al. (2021) and reasoning time are normally considered. The metrics Q/s (questions per second) Seo et al. (2019) and FLOPs (floating point operations) Guan et al. (2022) are popular in measuring the total processing latency, where Q/s is the number of questions one ODQA system can answer per second and FLOPs is the number of floating point operations of the model.

In terms of memory, model parameters size, passage corpus size, index size, training data size are important influence factors of memory cost Yamada et al. (2021). We measure the memory consumption for ODQA models using memory unit (bytes) of corresponding data (corpus, index and model) after loading into memory.

In terms of answering quality, EM (Exact Match accuracy) Chen et al. (2017), F1-score, MRR@k (Mean Reciprocal Rank ) Qu et al. (2021), precision@k, Recall@k and retrieval accuarcy@k Karpukhin et al. (2020) are normally used to measure the quality of the ODQA models. Specifically, EM is the percentage of questions for which the predicted answers can match any one of the reference answers exactly, after string normalization Qu et al. (2021). MRR@k is the mean reciprocal of the rank at which the first relevant passage was retrieved Qu et al. (2021).

In this paper, we adopt metrics on latency, memory cost and answering quality to evaluate ODQA models comprehensively. Specifically, we use Q/s to measure the processing speed, use total memory overhead to evaluate the memory cost and use EM score to estimate the end-to-end answer prediction quality.

Table 2: Comprehensive analysis of memory cost (model size, index size and the total size), processing speed and EM accuracy on NQ. Results marked with symbol

*

were obtained on an Nvidia GeForce Rtx 2080 Ti GPU over 100 examples.

Frameworks		Systems	Grouped by Memory Cost(GB)	Memory Cost(GB)			Processing Speed	EM score (%)
Frameworks		Systems	Grouped by Memory Cost(GB)	Models	Index	Total	Q/s	NQ	TriviaQA
Retriever-Reader	Extractive-Reader	Minimal R&R	(0, 10]	0.17	0.15	0.31	-	32.60	48.75
		SkylineBuilder		0.07	2.40	2.47	-	34.20	-
		BM25+BERT-base		0.44	2.40	2.84	4.68*	-	-
		GAR_extractive		0.44	2.40	2.84	2.61*	41.80	74.80
		DrBoost+PQ(8-dim)		2.64	0.42	3.06	-	-	-
		DPR+PQ		1.32	2.00	3.32	4.67*	38.40	52.00
		BPR_BERT		1.32	2.10	3.42	4.81*	41.60	56.80
		DrBoost+PQ(4-dim)		2.64	0.84	3.48	-	-	-
		BPR_ELECTRA-large		2.22	2.10	4.32	-	49.00	65.60
		DrBoost	(10, 50]	2.64	13.00	15.64	-	-	-
		ORQA		1.32	18.00	19.32	8.60	33.30	45.00
		REALM		1.32	18.00	19.32	8.40	39.20	-
		DrQA		0.27	26.00	26.27	1.80	35.70	-
		ColBERT-QA-base	(50, 100]	0.88	65.00	65.88	-	42.30	64.60
		ANCE		1.32	65.00	66.32	5.51*	46.00	57.50
		DPR		1.32	65.00	66.32	1.60*	41.50	56.80
		GAR+DPR_extractive		1.32	65.00	66.32	1.25*	43.80	-
		RocketQA		1.50	65.00	66.50	-	42.80	-
		ColBERT-QA-large		1.76	65.00	66.76	-	47.80	70.10
		ERNIE-Search_base		1.76	65.00	66.76	-	-	-
		R2-D2_reranker		5.16	65.00	70.16	-	55.90	69.90
		UnitedQA		8.36	65.00	73.36	-	54.70	70.50
		DPR+HNSW	(100, 500])	1.32	151.00	152.32	5.82*	41.20	56.60
		ERNIE-Search_2.4B	(100, 500])	19.20	344.06	363.26	-	-	-
	Generative-Reader	EMDR2	(10, 50]	1.76	32.00	33.76	-	52.50	71.40
		YONO_retriever	(50, 100]	1.54	65.00	66.54	-	53.20	71.30
		FiD-base		1.76	65.00	66.76	2.00	48.20	65.00
		FiD-base+KD_DPR		1.76	65.00	66.76	-	49.60	68.80
		YONO_reranker		1.76	65.00	66.76	-	53.20	71.90
		GAR+DPR_generative		2.50	65.00	67.50	1.25*	45.30	-
		RAG-seq		2.50	65.00	67.50	0.80	44.50	56.80
		FiD-large		3.96	65.00	68.96	0.50	51.40	67.60
		FiD-large+KD_DPR		3.96	65.00	68.96	-	53.70	72.10
		RePAQ+FiD_large	(100, 500]	3.32	220.00	223.32	2.30	52.30	67.30
Retriever-Only		RePAQ_base+PQ	(10, 50]	0.04	48.00	48.04	100.00	41.20	-
		RePAQ_base	(100, 500]	0.04	220.00	220.04	1400.00	40.90	-
		RePAQ_base+reranker_base		0.09	220.00	220.09	55.00	45.70	-
		RePAQ_XL		0.24	220.00	220.24	800.00	41.50	-
		RePAQ_XL+reranker_XXL		1.18	220.00	221.18	6.00	47.60	52.10
		DensePhrases		0.88	320.00	320.88	20.60	40.90	50.70
		DenSPI+Sparc	(1000, 2010]	2.69	1547.00	1549.69	2.10	14.50	34.40
		DenSPI	(1000, 2010]	2.69	2000.00	2002.69	2.90	8.10	30.70
Generator-Only		T5-1.1-small+SSM	(0, 10]	0.24	0.00	0.24	7.20*	25.50	-
		T5-base		0.88	0.00	0.88	7.53*	25.90	29.10
		BART-large		1.62	0.00	1.62	5.88*	26.50	26.70
		GAR_generative		1.62	0.00	1.62	2.94*	38.10	62.20
		T5-large		3.08	0.00	3.08	3.85*	28.50	35.90
		T5-1.1-XL+SSM	(10, 50]	12.00	0.00	12.00	-	29.50	45.10
		T5-1.1-XXL+SSM	(10, 50]	45.27	0.00	45.27	-	35.20	61.60
		GPT-3	(500, 1000]	700.00	0.00	700.00	-	29.90	71.20

4.2 Overall Comparison

Table 2 demonstrates a comprehensive comparison⁴⁴4All discussions in this section are developed on the results that are available. of efficiency-related ODQA models from three aspects: memory cost, proceeding speed and answering quality. Specifically, total memory storage, and detailed model size and index size are listed to show details of memory cost. The number of questions can be answered per second (Q/s) demonstrates the processing speed. EM scores on NQ and TriviaQA datasets indicate the answering quality.

With respect to comparison between different frameworks, we can see two-stage methods (retriever-reader)generally obtain better ODQA performances than one-stage methods (i.e., retriever-only and generator-only). The best end-to-end exact match performance on NQ (55.9%) and TriviaQA (74.8%) datasets are obtained by R2-D2+reranker and GAR_extractive respectively. They are both under retriever-reader framework. The second best ODQA performances on NQ (54.7%) and TriviaQA (72.1%) are obtained by UnitedQA and Fid-large+KD_DPR methods, which are also under the two-stage frameworks.

In terms of total memory cost, i.e., the sum of model size and the index size, generator-only systems keep generally low memory overhead. Except GPT-3, the rest of generator-only systems take less then 50GB memory, and five methods out of the eight are less than 5GB. On the contrary, most retriever-only ODQA models require huge memory, normally greater then 200GB. The method DenSPI needs a 2002.69GB memory cost, which is enormous. Retriever-reader ODQA models have a wide range in terms of the memory cost, from 0.31GB to 363.26GB. Overall speaking, Minimal R&R achieves the smallest memory overhead (0.31GB) while DenSPI keeps the largest one (2002.69GB).

In terms of processing speed, which determines how fast one ODQA system can answer a given question, one-stage methods generally achieve higher processing speed than two-stage methods, especially retriever-only systems. Among the eight retriever-only methods, five of them can process more than 20 questions per second (Q/s) and RePAQ_XL and RePQA_base can answer 800 and 1400 questions per second respectively, which is impressive. For the methods with slow processing speed, Fig-large and RAG-seq from retriever-reader framework are the two slowest systems, which process less than 1 question per second.

To conclude, Fig. 4 gives a visual presentation for comprehensive comparison of efficiency-related ODQA models⁵⁵5Here, we only visualize the ODQA models that have results on all the three aspects: memory, processing speed and EM accuracy.. By using NQ evaluation dataset as an example, it illustrates the detailed model size, index size, EM accuracy and processing speed respectively. From Fig. 4, we can see each framework has its own strengths and weaknesses. Retriever-only systems achieve significantly high processing speed, but cost enormous memory storage. Generator-only systems require the least memory storage. However the main concern of them is the answering quality while majority of these systems’ EM scores are less than 30% on NQ datasets. Two-stage retriever-reader systems relatively behave balanced. They achieve high EM accuracy, and obtain moderate memory cost and processing speed.

4.3 Details in Memory Cost

The total memory cost depends on the model size and the index size.

Index Size. In terms of the index size, the two kinds of one-stage frameworks are two extremes. Generator-only methods need zero memory cost on index size while retriever-only methods generally need a huge storage space for index. Most of the two-stage methods have an moderate index size as 65GB and less.

Specifically, the 65GB index set of dense passage embedding, developed by DPR Karpukhin et al. (2020), is the most commonly adopted index set. It is adopted by 17 methods as we listed in Table 2. Based on this index set, DrQA and GAR_extractive represent passages into sparse vectors, obtained a much smaller index size (26GB) Chen et al. (2017); Mao et al. (2021). DPR+PQ and RePAQ further reduce the index size utilizing product quantization (PQ) technique and compress the size of index Lewis et al. (2021); Karpukhin et al. (2020). DPR+PQ compress the size of index from 65GB to 2GB. RePAQ compresses the size of index from 220GB to 48GB.

On the other side, BPR Yamada et al. (2021) creates an small index less than 2.1GB by integrating the hash-to-learning technique into DPR other than PQ. It also improves the answering performance from 41.6% to 49% on NQ dataset through replacing the BERT-based reader with the ELECTRA-large reader. Meanwhile, Minimal R&R Yang and Seo (2021) establishes the smallest index less than 0.15 GB through multiple techniques including passage filtering, dimension reducing and PQ, with a price of a significant drop of ODQA performance has been paid.

DenSPI+Sparc Lee et al. (2020) and DensePhrase Lee et al. (2021b) smallen the phrase index by pointer sharing, phrase filtering and PQ. However the phrase index is still larger than 1000GB. DensePhrases further cuts down the index size to 320GB by omitting sparse representations and using encoder SpanBERT-base while a relatively high performance remained. SpanBERT-base represents phrases into 768-dim vectors Joshi et al. (2020), compared with 1024-dim ones used in DenSPI+Sparc. DrBoost Lewis et al. (2022) builds an index under 1GB where a passage is represented with a 190-dim vector through the boosting technique and PQ technique.

Model Size⁶⁶6The model size here is the size of all the models that present in the ODQA models, including retriever model, reader model, generator model etc... In terms of model size, it has a great range, from 0.04GB to 700GB. Among all mentioned ODQA models, a quarter ones have model sizes less than 1GB; the model sizes of 40% systems are between 1 $\sim$ 2GB and 12.5% ones have sizes between 2 $\sim$ 3GB; 7.5% systems have model sizes between 3 $\sim$ 4GB; the remaining 15% models weigh larger than 4GB. Specifically, GPT-3 Brown et al. (2020) has an extremely huge model size of 700GB. Besides it, there are another three systems obtaining relatively large models: T5-1.1-XL_SSM (45.27GB) Roberts et al. (2020), UnitedQA (8.36GB) Cheng et al. (2021) and R2-D2+reranker (5.16GB) Fajcik et al. (2021), while the system with the smallest model (0.04GB) is achieved by RePAQ-base Lewis et al. (2021). Specifically, GPT-3 keeps the largest model(700GB) and achieves relatively high performance, i.e., 71.2% EM on TriviaQA (top 1) and 29.9% EM on NQ dataset (top 3), compared to the seven models with the same generator-only framework. Minimal R&R Yang and Seo (2021) cuts down the total model size into 0.17GB by adopting a lightweight encoder MobileBERT Sun et al. (2020) and sharing parameters among all encoders. DrQA Chen et al. (2017) has a small total model size 0.27GB in that its retriever is non-parameter BM25 and the reader relies on LSTM with less parameters. GAR_extractive Mao et al. (2021) maintains a small total model size and achieves the best performance on TriviaQA (74.8%) and the similar performance with DPR on NQ(41.8%). RePAQ Lewis et al. (2021) achieves the smallest model size of 0.04GB by utilizing lightweight encoder ALBERT Lan et al. (2019) strategy. It also gains competitive ODQA performance compared to DPR.

Most ODQA models are implemented with PLMs that are less than 2GB. A few of ODQA models keep the total model size more than 3GB to achieve higher performance,like FiD-large+KD_DPR Izacard and Grave (2020), RePAQ+FiD_large Lewis et al. (2021), UnitedQA Cheng et al. (2021) and R2-D2_reranker Fajcik et al. (2021). As they employ either larger or more PLMs to focus on improving the performance.

4.4 Details on Latency

In terms of latency, i.e., processing speed, most ODQA models answer less than 10 questions per second. Retriever-Only ODQA models brings faster processing speed than other three frameworks. Eliminate the step of long evidence reading makes them time efficient. Compared to phrase-base systems, QA-pair based system RePAQ Lewis et al. (2021) and its variants win the fastest inference speed among the listed ODQA models, up to 1400 Q/s. Generator-Only ODQA models also achieve higher Q/s scores than Retriever-Reader ODQA models, as they no need retrieving evidence from a larger corpus which is time-consuming.

5 Challenges

To build an efficient ODQA system that is capable of answering any input questions with required memory cost and processing speed is regarded as the ultimate goal of QA research. However, the research community still has a long way to go. Here we discuss some salient challenges that need to addressed on the way. By doing this we hope the research gaps can be made clearer so as to accelerate the progress in this field.

5.1 Low Power

One of the goals for efficient ODQA models is to run on low-power machines. Most open domain question answering approaches prove particularly effective when big amounts of data and ample computing resources are available. However, they are computation-heavy and energy-expensive. How can the ODQA system be deployed in low-power devices with limited compute resources and mobile devices is still very challenging. Such machines are often constrained by battery power. They do not usually come with GPUs. Deep learning in low-power machines is worthwhile and is a growing area of research Goel et al. (2020); Cai et al. (2022). So far there is little work about such ODQA models. To deploy DNNs on small embedded computers, it may need to consider multiple factors together to design an efficient ODQA system, such as energy, computation, memory and so on.

5.2 Evaluation

To evaluate efficiency of the ODQA models, it seems to be difficult and there are often multiple factors that need to be traded-off against each other. In existing researches, we use accuracy, EM score, F1-score, precision, recall etc. to evaluate effectiveness, however it is not enough. It is also important to establish what resource, e.g., money, data, memory, time, power consumption, carbon emissions, etc., one attempts to constrain Treviso et al. (2022). For example, using specific hardware such as an electricity meter to measure power consumption, which can provide figures with a high temporal accuracy. External energy costs such as cooling or networking should be covered as well. But they are difficult to measure precisely. Besides the power and energy consumption, we should also pay attention to carbon emission. They are normally computed using the power consumption and the carbon intensity of the marginal energy generation that is used to run the program. Low-energy does not mean low-carbon. Last but not the least, financial impact is another key metric in evaluation. Monetary cost is a resource that one typically prefers to be efficient with. Both fixed and running costs affect ODQA application, depending on how one chooses to execute a model. As hardware configurations and their prices form discrete points on a typically non-linear scale, it is worth paying attention to efficient cost points and fitting to these. Implementing pre-emptible processes that can recover from interruptions also often allows access to much cheaper resources. When calculating or amortizing hardware costs, one should also factor in downtime, maintenance, and configuration. Measuring the total cost of ownership (TCO) provides a more useful metric.

Then we also need to consider the trade-off between efficiency and performance. For some extreme cases, for example, low-power machines, there may be conflict between efficiency and performance. How to develop fair metrics is challenging. Furthermore, what if we have some unified metrics for different systems? We compare different efficient ODQA models from different perspectives. It would be convenient to have some standard unified metrics.

5.3 Model bias

There is little work about model bias in efficient ODQA models. Studies of bias in machine learning have become increasingly important as awareness of how deployed models contribute to inequity grows Blodgett et al. (2020). Previous work in bias shows gender discrimination in word embedding Bolukbasi et al. (2016), coreference resolution Rudinger et al. (2018), and machine translation Stanovsky et al. (2019). Within question answering, prior work has studied differences in accuracy based on gender Gor et al. (2021) and differences in answers based on race and gender Li et al. (2020). However, efficient ODQA models often involve redesigning models, or knowledge distillation or small models. It is unclear about the model bias propagation in the efficient ODQA models. How to reduce bias in those systems, including the bias in open domain passage retrieval and closed domain reading comprehension, is a challenge in the advancement of efficient ODQA models.

6 Conclusion

In this survey, we retrospected the typical literature according to three different frameworks of open domain question answering (ODQA) systems. Further, we provided a broad overview of existing methods to increase efficiency for ODQA models, and discussed their limitations. In addition, we performed quantitative analysis in term of efficiency and offered certain suggestions about method selections of open domain question answering. Finally we discussed possible open challenges and potential future directions of efficient ODQA models.

Acknowledgments

We thank Fan Jiang and Jiaxu Zhao for their invaluable feedback.

References

Baudiš (2015) Petr Baudiš. 2015. Yodaqa: a modular question answering system pipeline. In POSTER 2015-19th International Student Conference on Electrical Engineering, pages 1156–1165.
Bengio et al. (2015) Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. 2015. Conditional computation in neural networks for faster models. CoRR, abs/1511.06297.
Blodgett et al. (2020) Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. Language (technology) is power: A critical survey of “bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454–5476, Online. Association for Computational Linguistics.
Bolukbasi et al. (2016) Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. 2016. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems, 29.
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
Cai et al. (2022) Han Cai, Ji Lin, Yujun Lin, Zhijian Liu, Haotian Tang, Hanrui Wang, Ligeng Zhu, and Song Han. 2022. Enable deep learning on mobile devices: Methods, systems, and applications. ACM Transactions on Design Automation of Electronic Systems, 27(3):1–50.
Cao et al. (2017) Zhangjie Cao, Mingsheng Long, Jianmin Wang, and Philip S. Yu. 2017. Hashnet: Deep learning to hash by continuation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Chen et al. (2017) Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading Wikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1870–1879, Vancouver, Canada. Association for Computational Linguistics.
Cheng et al. (2021) Hao Cheng, Yelong Shen, Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2021. UnitedQA: A hybrid approach for open domain question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3080–3090, Online. Association for Computational Linguistics.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Etezadi and Shamsfard (2022) Romina Etezadi and Mehrnoush Shamsfard. 2022. The state of the art in open domain complex question answering: a survey. Applied Intelligence, pages 1–21.
Fajcik et al. (2021) Martin Fajcik, Martin Docekal, Karel Ondrej, and Pavel Smrz. 2021. R2-D2: A modular baseline for open-domain question answering. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 854–870, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Fan et al. (2019) Angela Fan, Edouard Grave, and Armand Joulin. 2019. Reducing transformer depth on demand with structured dropout. CoRR, abs/1909.11556.
Ferrucci et al. (2010) David Ferrucci, Eric Brown, Jennifer Chu-Carroll, James Fan, David Gondek, Aditya A Kalyanpur, Adam Lally, J William Murdock, Eric Nyberg, John Prager, et al. 2010. Building watson: An overview of the deepqa project. AI magazine, 31(3):59–79.
Freund and Schapire (1997) Yoav Freund and Robert E Schapire. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139.
Goel et al. (2020) Abhinav Goel, Caleb Tung, Yung-Hsiang Lu, and George K. Thiruvathukal. 2020. A survey of methods for low-power deep learning and computer vision. In 2020 IEEE 6th World Forum on Internet of Things (WF-IoT), pages 1–6.
Gor et al. (2021) Maharshi Gor, Kellie Webster, and Jordan Boyd-Graber. 2021. Toward deconfounding the influence of subject’s demographic characteristics in question answering. In Empirical Methods in Natural Language Processing, page 6.
Graves (2016) Alex Graves. 2016. Adaptive computation time for recurrent neural networks. ArXiv, abs/1603.08983.
Guan et al. (2022) Yue Guan, Zhengyi Li, Zhouhan Lin, Yuhao Zhu, Jingwen Leng, and Minyi Guo. 2022. Block-skim: Efficient question answering for transformer. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10):10710–10719.
Guo et al. (2022) Jiafeng Guo, Yinqiong Cai, Yixing Fan, Fei Sun, Ruqing Zhang, and Xueqi Cheng. 2022. Semantic models for the first-stage retrieval: A comprehensive review. ACM Trans. Inf. Syst., 40(4).
Guo et al. (2016) Ruiqi Guo, Sanjiv Kumar, Krzysztof Choromanski, and David Simcha. 2016. Quantization based fast inner product search. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, pages 482–490, Cadiz, Spain. PMLR.
Guu et al. (2020) Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020. Retrieval augmented language model pre-training. In International Conference on Machine Learning, pages 3929–3938. PMLR.
Huang et al. (2020) Zhen Huang, Shiyi Xu, Minghao Hu, Xinyi Wang, Jinyan Qiu, Yongquan Fu, Yuncai Zhao, Yuxing Peng, and Changjian Wang. 2020. Recent trends in deep learning based open-domain textual question answering systems. IEEE Access, 8:94341–94356.
Humeau et al. (2020) Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, and Jason Weston. 2020. Poly-encoders: Architectures and pre-training strategies for fast and accurate multi-sentence scoring. In ICLR.
Izacard and Grave (2020) Gautier Izacard and Edouard Grave. 2020. Distilling knowledge from reader to retriever for question answering. CoRR, abs/2012.04584.
Izacard and Grave (2021) Gautier Izacard and Edouard Grave. 2021. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880, Online. Association for Computational Linguistics.
Izacard et al. (2020) Gautier Izacard, Fabio Petroni, Lucas Hosseini, Nicola De Cao, Sebastian Riedel, and Edouard Grave. 2020. A memory efficient baseline for open domain question answering. CoRR, abs/2012.15156.
Jégou et al. (2011) Hervé Jégou, Matthijs Douze, and Cordelia Schmid. 2011. Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33:117–128.
Johnson et al. (2021) Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2021. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3):535–547.
Joshi et al. (2020) Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. 2020. SpanBERT: Improving Pre-training by Representing and Predicting Spans. Transactions of the Association for Computational Linguistics, 8:64–77.
Karpukhin et al. (2020) Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.
Khattab et al. (2021) Omar Khattab, Christopher Potts, and Matei Zaharia. 2021. Relevance-guided Supervision for OpenQA with ColBERT. Transactions of the Association for Computational Linguistics, 9:929–944.
Khattab and Zaharia (2020) Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’20, page 39–48, New York, NY, USA. Association for Computing Machinery.
Lagunas et al. (2021) François Lagunas, Ella Charlaix, Victor Sanh, and Alexander Rush. 2021. Block pruning for faster transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10619–10629, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Lan et al. (2019) Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A lite BERT for self-supervised learning of language representations. CoRR, abs/1909.11942.
Lee et al. (2021a) Haejun Lee, Akhil Kedia, Jongwon Lee, Ashwin Paranjape, Christopher D. Manning, and Kyoung-Gu Woo. 2021a. You only need one model for open-domain question answering. CoRR, abs/2112.07381.
Lee et al. (2020) Jinhyuk Lee, Minjoon Seo, Hannaneh Hajishirzi, and Jaewoo Kang. 2020. Contextualized sparse representations for real-time open-domain question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 912–919, Online. Association for Computational Linguistics.
Lee et al. (2021b) Jinhyuk Lee, Mujeen Sung, Jaewoo Kang, and Danqi Chen. 2021b. Learning dense representations of phrases at scale. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6634–6647, Online. Association for Computational Linguistics.
Lee et al. (2021c) Jinhyuk Lee, Alexander Wettig, and Danqi Chen. 2021c. Phrase retrieval learns passage retrieval, too. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3661–3672, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Lee et al. (2019) Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 2019. Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6086–6096, Florence, Italy. Association for Computational Linguistics.
Lewis et al. (2020a) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020a. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
Lewis et al. (2022) Patrick Lewis, Barlas Oguz, Wenhan Xiong, Fabio Petroni, Scott Yih, and Sebastian Riedel. 2022. Boosted dense retriever. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3102–3117, Seattle, United States. Association for Computational Linguistics.
Lewis et al. (2020b) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020b. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Advances in Neural Information Processing Systems, volume 33, pages 9459–9474. Curran Associates, Inc.
Lewis et al. (2021) Patrick Lewis, Yuxiang Wu, Linqing Liu, Pasquale Minervini, Heinrich Küttler, Aleksandra Piktus, Pontus Stenetorp, and Sebastian Riedel. 2021. PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them. Transactions of the Association for Computational Linguistics, 9:1098–1115.
Li et al. (2020) Tao Li, Daniel Khashabi, Tushar Khot, Ashish Sabharwal, and Vivek Srikumar. 2020. UNQOVERing stereotyping biases via underspecified questions. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3475–3489, Online. Association for Computational Linguistics.
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692.
Lu et al. (2022) Yuxiang Lu, Yiding Liu, Jiaxiang Liu, Yunsheng Shi, Zhengjie Huang, Shikun Feng Yu Sun, Hao Tian, Hua Wu, Shuaiqiang Wang, Dawei Yin, and Haifeng Wang. 2022. Ernie-search: Bridging cross-encoder with dual-encoder via self on-the-fly distillation for dense passage retrieval.
Luan et al. (2021) Yi Luan, Jacob Eisenstein, Kristina Toutanova, and Michael Collins. 2021. Sparse, Dense, and Attentional Representations for Text Retrieval. Transactions of the Association for Computational Linguistics, 9:329–345.
Ma et al. (2021) Xueguang Ma, Minghan Li, Kai Sun, Ji Xin, and Jimmy Lin. 2021. Simple and effective unsupervised redundancy elimination to compress dense vectors for passage retrieval. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2854–2859, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Malkov and Yashunin (2020) Yu A. Malkov and D. A. Yashunin. 2020. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(4):824–836.
Mao et al. (2021) Yuning Mao, Pengcheng He, Xiaodong Liu, Yelong Shen, Jianfeng Gao, Jiawei Han, and Weizhu Chen. 2021. Generation-augmented retrieval for open-domain question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4089–4100, Online. Association for Computational Linguistics.
Min et al. (2021) Sewon Min, Jordan Boyd-Graber, Chris Alberti, Danqi Chen, Eunsol Choi, Michael Collins, Kelvin Guu, Hannaneh Hajishirzi, Kenton Lee, Jennimaria Palomaki, Colin Raffel, Adam Roberts, Tom Kwiatkowski, Patrick Lewis, Yuxiang Wu, Heinrich Küttler, Linqing Liu, Pasquale Minervini, Pontus Stenetorp, Sebastian Riedel, Sohee Yang, Minjoon Seo, Gautier Izacard, Fabio Petroni, Lucas Hosseini, Nicola De Cao, Edouard Grave, Ikuya Yamada, Sonse Shimaoka, Masatoshi Suzuki, Shumpei Miyawaki, Shun Sato, Ryo Takahashi, Jun Suzuki, Martin Fajcik, Martin Docekal, Karel Ondrej, Pavel Smrz, Hao Cheng, Yelong Shen, Xiaodong Liu, Pengcheng He, Weizhu Chen, Jianfeng Gao, Barlas Oguz, Xilun Chen, Vladimir Karpukhin, Stan Peshterliev, Dmytro Okhonko, Michael Schlichtkrull, Sonal Gupta, Yashar Mehdad, and Wen-tau Yih. 2021. Neurips 2020 efficientqa competition: Systems, analyses and lessons learned. In Proceedings of the NeurIPS 2020 Competition and Demonstration Track, volume 133 of Proceedings of Machine Learning Research, pages 86–111. PMLR.
Min et al. (2018) Sewon Min, Victor Zhong, Richard Socher, and Caiming Xiong. 2018. Efficient and robust question answering from minimal context over documents. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1725–1735, Melbourne, Australia. Association for Computational Linguistics.
Neyshabur and Srebro (2015) Behnam Neyshabur and Nathan Srebro. 2015. On symmetric and asymmetric lshs for inner product search. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, page 1926–1934. JMLR.org.
Qu et al. (2021) Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. 2021. RocketQA: An optimized training approach to dense passage retrieval for open-domain question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5835–5847, Online. Association for Computational Linguistics.
Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu, et al. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67.
Ren et al. (2021) Ruiyang Ren, Yingqi Qu, Jing Liu, Wayne Xin Zhao, QiaoQiao She, Hua Wu, Haifeng Wang, and Ji-Rong Wen. 2021. RocketQAv2: A joint training method for dense passage retrieval and passage re-ranking. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2825–2835, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Roberts et al. (2020) Adam Roberts, Colin Raffel, and Noam Shazeer. 2020. How much knowledge can you pack into the parameters of a language model? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5418–5426, Online. Association for Computational Linguistics.
Rudinger et al. (2018) Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. 2018. Gender bias in coreference resolution. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 8–14, New Orleans, Louisiana. Association for Computational Linguistics.
Sajjad et al. (2020) Hassan Sajjad, Fahim Dalvi, Nadir Durrani, and Preslav Nakov. 2020. Poor man’s bert: Smaller and faster transformer models. ArXiv, abs/2004.03844.
Sanh et al. (2019) Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR, abs/1910.01108.
Seo et al. (2019) Minjoon Seo, Jinhyuk Lee, Tom Kwiatkowski, Ankur Parikh, Ali Farhadi, and Hannaneh Hajishirzi. 2019. Real-time open-domain question answering with dense-sparse phrase index. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4430–4441, Florence, Italy. Association for Computational Linguistics.
Seonwoo et al. (2022) Yeon Seonwoo, Juhee Son, Jiho Jin, Sang-Woo Lee, Ji-Hoon Kim, Jung-Woo Ha, and Alice Oh. 2022. Two-step question retrieval for open-domain QA. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1487–1492, Dublin, Ireland. Association for Computational Linguistics.
Shen et al. (2022) Xiaoyu Shen, Svitlana Vakulenko, Marco Del Tredici, Gianni Barlacchi, Bill Byrne, and A. Gispert. 2022. Low-resource dense retrieval for open-domain question answering: A comprehensive survey. ArXiv, abs/2208.03197.
Shrivastava and Li (2014) Anshumali Shrivastava and Ping Li. 2014. Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). In Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc.
Singh et al. (2021) Devendra Singh, Siva Reddy, Will Hamilton, Chris Dyer, and Dani Yogatama. 2021. End-to-end training of multi-document reader and retriever for open-domain question answering. In Advances in Neural Information Processing Systems, volume 34, pages 25968–25981. Curran Associates, Inc.
Sivic and Zisserman (2003) Sivic and Zisserman. 2003. Video google: a text retrieval approach to object matching in videos. In Proceedings Ninth IEEE International Conference on Computer Vision, pages 1470–1477 vol.2.
Stanovsky et al. (2019) Gabriel Stanovsky, Noah A. Smith, and Luke Zettlemoyer. 2019. Evaluating gender bias in machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1679–1684, Florence, Italy. Association for Computational Linguistics.
Sun et al. (2020) Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. 2020. MobileBERT: a compact task-agnostic BERT for resource-limited devices. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2158–2170, Online. Association for Computational Linguistics.
Sung et al. (2022) Mujeen Sung, Jungsoo Park, Jaewoo Kang, Danqi Chen, and Jinhyuk Lee. 2022. Refining query representations for dense retrieval at test time. ArXiv, abs/2205.12680.
Treviso et al. (2022) Marcos Vinícius Treviso, Tianchu Ji, Ji-Ung Lee, Betty van Aken, Qingqing Cao, Manuel R. Ciosici, Michael Hassid, Kenneth Heafield, Sara Hooker, Pedro Henrique Martins, André F. T. Martins, Peter Milder, Colin Raffel, Edwin Simpson, Noam Slonim, Niranjan Balasubramanian, Leon Derczynski, and Roy Schwartz. 2022. Efficient methods for natural language processing: A survey. ArXiv, abs/2209.00099.
Voorhees and Tice (2000) Ellen M. Voorhees and Dawn M. Tice. 2000. The TREC-8 question answering track. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00), Athens, Greece. European Language Resources Association (ELRA).
Wang et al. (2018) Jingdong Wang, Ting Zhang, jingkuan song, Nicu Sebe, and Heng Tao Shen. 2018. A survey on learning to hash. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):769–790.
Wang et al. (2022) Yifan Wang, Haodi Ma, and Daisy Zhe Wang. 2022. Lider: An efficient high-dimensional learned index for large-scale dense passage retrieval. ArXiv, abs/2205.00970.
Wikipedia (2004) Wikipedia. 2004. Wikipedia. PediaPress.
Wu et al. (2021) Yuxiang Wu, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. 2021. Training adaptive computation for open-domain question answering with computational constraints. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 447–453, Online. Association for Computational Linguistics.
Wu et al. (2020) Yuxiang Wu, Sebastian Riedel, Pasquale Minervini, and Pontus Stenetorp. 2020. Don’t read too much into it: Adaptive computation for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3029–3039, Online. Association for Computational Linguistics.
Xia et al. (2022) Mengzhou Xia, Zexuan Zhong, and Danqi Chen. 2022. Structured pruning learns compact and accurate models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1513–1528, Dublin, Ireland. Association for Computational Linguistics.
Yamada et al. (2021) Ikuya Yamada, Akari Asai, and Hannaneh Hajishirzi. 2021. Efficient passage retrieval with hashing for open-domain question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 979–986, Online. Association for Computational Linguistics.
Yang and Seo (2021) Sohee Yang and Minjoon Seo. 2021. Designing a minimal retrieve-and-read system for open-domain question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5856–5865, Online. Association for Computational Linguistics.
Yang et al. (2019) Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li, Luchen Tan, Kun Xiong, Ming Li, and Jimmy Lin. 2019. End-to-end open-domain question answering with BERTserini. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 72–77, Minneapolis, Minnesota. Association for Computational Linguistics.
Zhan et al. (2021) Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, and Shaoping Ma. 2021. Jointly optimizing query encoder and product quantization to improve retrieval performance. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, CIKM ’21, page 2487–2496, New York, NY, USA. Association for Computing Machinery.
Zhu et al. (2021) Fengbin Zhu, Wenqiang Lei, Chao Wang, Jianming Zheng, Soujanya Poria, and Tat-Seng Chua. 2021. Retrieving and reading: A comprehensive survey on open-domain question answering. CoRR, abs/2101.00774.
Zhu et al. (2015) Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 19–27.
Zouhar et al. (2022) Vilém Zouhar, Marius Mosbach, Miaoran Zhang, and Dietrich Klakow. 2022. Knowledge base index compression via dimensionality and precision reduction. In Proceedings of the 1st Workshop on Semiparametric Methods in NLP: Decoupling Logic from Knowledge, pages 41–53, Dublin, Ireland and Online. Association for Computational Linguistics.