Tevatron: An Efficient and Flexible Toolkit for Dense Retrieval

Luyu Gao¹, Xueguang Ma², Jimmy Lin², and Jamie Callan¹ ¹Language Technologies Institute
Carnegie Mellon University
²David R. Cheriton School of Computer Science
University of Waterloo

Abstract.

Recent rapid advancements in deep pre-trained language models and the introductions of large datasets have powered research in embedding-based dense retrieval. While several good research papers have emerged, many of them come with their own software stacks. These stacks are typically optimized for some particular research goals instead of efficiency or code structure. In this paper, we present Tevatron, a dense retrieval toolkit optimized for efficiency, flexibility, and code simplicity. Tevatron provides a standardized pipeline for dense retrieval including text processing, model training, corpus/query encoding, and search. This paper presents an overview of Tevatron and demonstrates its effectiveness and efficiency across several IR and QA data sets. We also show how Tevatron’s flexible design enables easy generalization across datasets, model architectures, and accelerator platforms(GPU/TPU). We believe Tevatron can serve as an effective software foundation for dense retrieval system research including design, modeling, and optimization.

1. Introduction

Dense retrieval’s popularity in the research community has greatly grown in the past years (Karpukhin et al., 2020; Xiong et al., 2021; Lin et al., 2021; Qu et al., 2021; Gao and Callan, 2021a). By modeling relevance with query-document vector products, dense retrievers can carry out efficient and effective semantic search.

While the idea of vector-based search is not new, the adoption of deep pre-trained language models as encoder (Devlin et al., 2019) has substantially boosted the effectiveness of dense retrieval (Karpukhin et al., 2020). Meanwhile, similar to other research that relies on deep learning, the success of dense retrieval will not be possible without large data. Many of recent research works are based on their own software with specialized support only for specific datasets and models (Karpukhin et al., 2020; Xiong et al., 2021). We however believe flexible generalization across models and datasets is critical. Tevatron provides researchers with access to the latest state-of-the-art models and makes it easy for them to start a new research problem on a new dataset.

In our past research on dense retrieval (Gao and Callan, 2021a, b; Ma et al., 2021), we have run into several engineering challenges specific to dense systems. For example, in terms of resources, large corpora and training sets require large CPU memory; accelerator (GPU/TPU) memory usage also grows with model size. While orthogonal to actual research, these engineering problems slow down and constrain researchers, especially those with limited hardware resources. With Tevatron, we aim at providing a unified solution to common engineering problems.

Tevatron incorporates several popular widely-used open-source packages, including datasets (Lhoest et al., 2021), transformers (Wolf et al., 2020) and FAISS (Johnson et al., 2019) respectively as backbone for our data management, neural network modeling and embedding-based retrieval components.

To accommodate different research needs, we select two deep learning frameworks for Tevatron, Pytorch (Paszke et al., 2019) and JAX(Bradbury et al., 2018). Pytorch’s eager execution patterns and intuitive object-oriented design have gained its massive user base in the research community. On the other hand, JAX, backed by just-in-time (JIT) XLA compilation, offers smooth transitions across hardware stacks with optimized performance.

The rest of the paper is organized as follows. Section 2 gives an overview of Tevatron. Section 3 demonstrates Tevatron usage and command-line interface. Section 4 shows the experimental results of running Tevatron with various models and datasets.

2. Toolkit Overview

Tevatron¹¹1http://tevatron.ai is packaged as a Python module available on the Python Package Index. Tevatron can be installed via pip, as follows:

$ pip install tevatron==0.1.0

In this section, we give an overview of the core components of Tevatron. We demonstrate how these components respectively support the full pipeline of data preparation, training, encoding, and search. Code and documentation of Tevatron are available at its website, tevatron.ai.

2.1. Data Management

Having data ready to use is a critical preliminary step before training or encoding starts. Data access overhead and constraints could directly affect training/encoding performance. In Tevatron, we adopt the following core design: 1) text data are pre-tokenized before training or encoding happens, 2) keep tokenized data memory-mapped instead of lazy-loaded or in-memory. The former avoids overheads when running sub-word/piece level tokenizers and also reduces data traffic compared to raw text. The latter allows random data access in the training/encoding loop without consuming a large amount of physical memory.

Tevatron defines two basic raw input format templates for IR and QA context. As shown in Fig. 1, for the IR dataset (e.g. MS MARCO (Bajaj et al., 2018)), we organize a training instance into an anchor query, a list of positive target texts, and a list of negative target texts. The positive targets are usually human judged and the negative texts are usually non-relevant texts from top results of a baseline retrieval system such as BM25.

    {
       "query_id": "<query id>",
       "query":    "<query text>",
       "positive_passages": [
         {"docid": "<passage id>",
          "title": "<passage title>",
          "text":  "<passage body>"}, ...
       ],
       "negative_passages": [
         {"docid": "<passage id>",
          "title": "<passage title>",
          "text":  "<passage body>"}, ...
       ]
    }

Figure 1. Tevatron raw data template for IR datasets.

The second format (not shown due to space limits) has an additional answers field for QA tasks (e.g. Natural Question (Kwiatkowski et al., 2019)), since the positive passages for QA dataset are usually judged by answer exact match (Chen et al., 2017; Karpukhin et al., 2020).

Users can pass raw data file pointer and processing specifications to Tevatron’s dataset class (HFTrainDataset for training, HFQueryDataset and HFCorpusDataset for encoding) which will perform fast parallel data formatting and tokenization. Processed data is internally represented as a datasets.Dataset object and is stored in Apache Arrow format which can be memory-mapped and randomly accessed by offset.

For researchers who are focusing on building new models, we make a collection of popular open-access datasets self-contained within the Tevatron toolkit. For instance, with a single line of command, one can load the training set of MS-MARCO. Under the hood, Tevatron will first download the raw data set we hosted through Huggingface²²2https://huggingface.co/tevatron. Then it will run the corresponding pre-defined pre-processing script to format and tokenize the downloaded data.

2.2. Dense Retrieval Model

Tevatron’s model class DenseModel is a Pytorch nn.Module subclass that defines the deep neural encoder of the dense retriever. Functionally, it interfaces the underlying Transformer models and provides methods for text encoding and loss computation. Thanks to duck typing in python, DenseModel class support models in the Huggingface transformer library that return standard base model output. This means new Transformers models can be loaded into Tevatron as soon as they are available in the transformer library. On the other hand, this helps Tevatron avoid maintaining the transformer codes and reduce code reduplication. Internally, DenseModel wraps a transformer module and optionally a pooler module which controls how mapping from transformer output tensor to final representations. DenseModel class also handles loss computation during training. It implements a contrastive loss with in-batch negatives and can perform negative sharing across devices using parallel collective defined in the NCCL library.

JAX Support

Tevatron has a sub-package tevax that implements core functionality for JAX. Following JAX’s functional nature (Bradbury et al., 2018), tevax is designed with a different philosophy. We define loss functions that can be composed with other JAX transformations. In practice, they can be combined with Flax models in the transformer library for dense retriever training. Two classes TiedParams and DualParams for parameter managing are registered as Pytrees that JAX can differentiate through. A RetrieverTrainState class manages parameters and model transformations. With JAX as backend, tevax makes it possible for a single piece of code to run on a single GPU, multiple GPUs, or TPU systems.

2.3. Trainer

To complete the dense retriever training setup, we introduce a DenseTrainer which implements miscellaneous training utilities. It controls basic setups such as batch size and the number of training epochs. When running on multiple GPUs, the trainer will properly set up distributed training and wrap models for gradient reduction. During training, it will asynchronously load training data to overlap computation and I/O operations. At each training step, the trainer turns a batch of loaded data into tensors and passes them to the model. In this way, the trainer glues the data sets and models together.

DenseTrainer is a subclass of Trainer in the transformers library. It inherits a collection of advanced utilities including mixed-precision training and optimizer state sharding. It is also possible to further subclass DenseTrainer to create unique training behaviors. Concretely in Tevatron, we implement a subclass GCTrainer which uses gradient caching to support large batch training on memory-limited devices (Luyu Gao and Callan, 2021).

By combining data processor, dense retrieval model, and trainer all together, Tevatron abstracts the training loop of dense retrieval model into the code block shown in Fig. 2.

# initialize model
model = DenseModel.build(model_args)
# initialize dataset
train_dataset = HFTrainDataset(data_args).process()
train_dataset = TrainDataset(data_args, train_dataset)
# initialize trainer
trainer = GCTrainer if training_args.grad_cache else DenseTrainer
trainer = trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    data_collator=QPCollator(data_args),
)
# start training
trainer.train()

Figure 2. Training process of Tevatron (Pytorch)

2.4. Retriever

The retriever classes in Tevatron build dense retrieval index from text embeddings and execute search over the index. We use FAISS library (Johnson et al., 2019) as our retriever’s backend. It implements several efficient indices in C++ and exposes them through Python interfaces. For users who want the best performance, Tevatron provides a simple class BaseFaissIPRetriever which wraps a flat faiss.IndexFlatIP index for exact search. Those who want to trade-off between efficiency and effectiveness can use the more powerful FaissRetriever class. FaissRetriever takes an additional index_spec string argument in its initialization method and use faiss.index_factory method to flexibly build the specified index. Users can take advantage of this interface to build approximate search indices like HNSW (Malkov and Yashunin, 2020) or PQ (Jégou et al., 2011).

3. Toolkit Usage

On top of the various core components, Tevatron provides a set of command-line interfaces (CLI) to drive the dense retrieval pipeline. With the flexible design in data and neural model support, one could conduct research of various types without writing code. In this section, we give an example using Tevatron CLI to run the previously discussed components to learn the model and perform open domain retrieval on Natural Questions (Kwiatkowski et al., 2019).

3.1. Training

With Tevatron, we are able to replicate the training of DPR model for NQ dataset (see details in section 4.1) by a single command:

python -m tevatron.driver.train \
    --do_train \
    --dataset_name Tevatron/wikipedia-nq \
    --model_name_or_path bert-base-uncased \
    --per_device_train_batch_size 128 \
    --train_n_passages 2 \
    --num_train_epochs 40 \
    --learning_rate 1e-5 \
    --fp16 \
    --grad_cache \
    --output_dir model_nq

As introduced in section 2.1, Tevatron will automatically handle the downloading and pre-processing of our self-contained train data Tevatron/wikipedia-nq. Then the preprocessed dataset and initialized DenseModel will be fed into Trainer class as shown in figure 2. Since the above command enables the grad_cache option, it will uses GCTrainer during training. Here we also enable mix precision training (Micikevicius et al., 2018) via the --fp16 option to improve efficiency.

3.2. Encoding

Besides training data, Tevatron also self-contains corresponding corpus data for each dataset. Again, we simplifies corpus encoding process into a single command:

python -m tevatron.driver.encode \
    --output_dir=temp \
    --model_name_or_path model_nq \
    --dataset_name Tevatron/wikipedia-nq-corpus \
    --encoded_save_path corpus_emb_00.pkl \
    --encode_num_shard 20 \
    --encode_shard_index 00 \
    --fp16

As encoding the entire corpus within a single process may cost large RAM usage and a long time, Tevatron support encoding the corpus by sharding. For example, the above command encodes the first 1/20 split of the entire corpus. Users can easily run multiple processes for multiple shards in parallel to speed up the encoding process.

3.3. Retrieval

By taking query and corpus embeddings, we can run retrieval with following command:

python -m tevatron.faiss_retriever \
    --query_reps query.pkl \
    --passage_reps corpus_emb_*.pkl \
    --depth 100 \
    --batch_size -1 \
    --save_text \
    --save_ranking_to result.txt

where --batch_size controls the number of queries passed to the FAISS index each search call and -1 will pass all queries in one call. Larger batches typically run faster due to better memory access patterns and hardware utilization. The results will be saved in a text file with each line stores query_id passage_id score.

4. Experiments

In this section, we demonstrate the system effectiveness and efficiency of Tevatron by running experiments on two common-use collections for QA and IR tasks, Wikipedia and MS MARCO.

4.1. Comparison with DPR

The DPR work by Karpukhin et al. (Karpukhin et al., 2020) is among the first works that show text retrieval using learned dense representations outperforms traditional text retrieval using heuristic sparse representations (e.g. BM25) on open-domain question-answering tasks.

	NQ	TriviaQA	SQuAD	CuratedTrec	WebQuestion
DPR (Karpukhin et al., 2020)	78.4 / 85.4	79.4 / 85.0	63.2 / 77.2	79.8 / 89.1	73.2 / 81.4
Tevatron	79.8 / 86.9	80.2 / 85.5	62.3 / 77.0	84.0 / 90.7	75.4 / 82.9

Table 1. Top-20/top-100 retrieval accuracy of DPR model replication on five open-domain QA datasets

We evaluate the effectiveness of Tevatron by replicating the retrieval results on QA tasks (Kwiatkowski et al., 2019; Joshi et al., 2017; Rajpurkar et al., 2016; Voorhees and Tice, 2000; Berant et al., 2013) reported in original DPR work (Karpukhin et al., 2020). We compare the models trained under the "Single" setting defined in the original work where each model is trained by the corresponding individual dataset. Following the similar hyperparameters setting, we train the models with a learning rate of 1e-5 for 40 epochs with batch size 128. In Table 1, except having slightly lower accuracy than the numbers in DPR paper on SQuAD, Tevatron gives even a bit higher top- $k$ accuracy on all other four datasets. Overall, all top- $k$ accuracy results obtained via the Tevatron pipeline are at the same level of accuracy as original work. Therefore, we conclude that this is a successful replication, proving that the Tevatron pipeline is effective.

	RAM	GPU memory	Time
DPR-repo	60G	20G x 4	2.0 hours
Tevatron-default	17G	17G x 4	1.5 hours
Tevatron-GradCache	4 G	15G x 1	7.0 hours
Tevatron-TPU	10G	–	1.0 hours

Table 2. Training efficiency comparison between original DPR repo and Tevatron’s different setting.

We demonstrate the efficiency of Tevatron by comparing it with the original DPR repo ³³3https://github.com/facebookresearch/DPR
To be clear, the efficiency results are based on the master branch on 2022-02-12 on three dimensions: RAM usage, GPU memory usage, and training time. The experiments are conducted on a machine with NVIDIA A100 GPUs. In both DPR-repo and Tevatron-default settings, we train the dense retriever model on 4 GPUs in distributed data-parallel mode of Pytorch. By comparing the first two rows in Table 2, we see that Tevatron is more efficient on all three dimensions than the original codebase. Concretely, Tevatron costs 3/4 less RAM, 12G less GPU memory, being 1/4 faster than training using DPR repo. This means, given the same resources, Tevatron has the potential to support larger training data, larger batch size, and faster training.

The gradient cache feature of Tevatron can further improve the GPU memory efficiency (Luyu Gao and Callan, 2021). DPR training requires batch size 128 to get the level of retrieval accuracy as reported above. With the original DPR repo, users cannot train a model with enough batch size if the GPU resource is limited, which will result in a drop in retrieval accuracy. Tevatron provides users the option to train dense retrievers using limited GPU resources but keeps the same amount of batch size for each optimization step. To illustrate this, we conduct experiments with Tevatron-GradCache on a single GPU. Tevatron-GradCache trains dense retrievers by splitting the batch of size 128 into sub-batches of size 32. Via conducting two round forward steps described in the gradient cache work (Luyu Gao and Callan, 2021), the model update step of Tevatron-GradCache is mathematically equivalent to Tevatron-default. In the experiment, it costs only 4G RAM and 15G GPU memory, to train a DPR model on NQ dataset with desired batch size. By reducing the sub-batch size, Tevatron-GradCache can save more GPU memory.

We also evaluated the performance of the training dense retriever model using the Jax backend of Tevatron on a V3-8 TPU VM. Back in the days when DPR (Karpukhin et al., 2020) first came out, it cost around a day to train dense retriever on NQ dataset using the initial DPR repo with 8 $\times$ Nvidia V100-large GPUs⁴⁴4This is recorded by DPR authors in the GitHub page.. Now it is exciting to see that, such training can be done within one hour with Tevatron.

4.2. Supervised IR

To further show the flexibility of the Tevatron toolkit across model architectures and accelerator platforms, we train multiple dense retrieval baselines on MS MARCO passage ranking task with different Transformer backbones from HuggingFace hub⁵⁵5https://huggingface.co/models. The experiments are conducted on both GPU and TPU platforms. The models are trained with a learning rate of 5e-6 with batch size 64 for 3 epochs using our self-contained dataset Tevatron/msmarco-passage.

Model	MRR@10	TPU time	GPU time
1. distilbert-base-uncased	0.316	1.0 hours	1.5 hours
2. bert-base-uncased	0.322	2.0 hours	3.0 hours
3. co-condenser-marco	0.357	2.0 hours	3.0 hours
4. bert-large-uncased	0.327	6.0 hours	7.5 hours
5. roberta-large	0.339	6.0 hours	7.5 hours
6. roberta-large + HN	0.361	8.0 hours	10.0 hours
7. co-condenser-marco + HN	0.382	3.0 hours	4.0 hours

Table 3. Results of training dense retrievers using Tevatron toolkit with different Transformer model initialization on MS-MSARCO passage ranking task.

In Table 3, we show MRR@10 for each model and training time on 4 $\times$ A100 GPU and V3-8 TPU. The model backbones we choose varying across:

•

model size: row(1), row(2) and row(4) are models in BERT family with different size {distil, base, large}.
•

model type: row(4), row(5) are models in same level of size with different backbone structure {bert, roberta}.
•

model parameters: row(2), row(3) are same model backbone but the later one is further pre-trained from row(2), i.e. {original, fine-tuned}.

For all the variants of model initialization, Tevatron can train dense retrievers effectively and efficiently on different platforms with the Tevatron CLI commands.

Finally, we evaluated two models trained with hard negative mining(Gao and Callan, 2021b) mined with Tevatron retriever. We craft the augmented training data by combining the hard negative passages mined using the first round dense retriever model with the original training dataset. Then we retrain the models using the hard negative augmented data. The last two rows in Table 3 show that by augmenting training data with hard negative, we can further improve the effectiveness of the dense retriever model. We also demonstrate Tevatron can replicate the state-of-the-art co-Condenser retriever(Gao and Callan, 2021b) on MS MARCO passage ranking.

4.3. Cross-lingual Retrieval

The success of dense retrieval also drives research in multilingual retrieval (Asai et al., 2021; Zhang et al., 2021; Clark et al., 2020). We additionally show our Tevatron toolkit can generalize to multilingual retrieval tasks by replicating the dense retrieval baseline reported in the XOR-Retrieve task (Asai et al., 2021).

	Ar	Bn	Fi	Ja	Ko	Ru	Te	avg
mDPR	50.4	57.7	58.9	37.3	42.8	44.0	44.9	48.0
Tevatron	50.5	64.1	57.3	41.9	60.4	48.5	58.4	54.4

Table 4. Recall@5kt of mDPR replication on dev set of XOR-RETRIEVE task with Tevatron

We train dense retriever with Tevatron that encode seven languages queries and English corpus into the same embedding space. Such a method can conduct retrieval in a single stage without addition need for translation. Results in Table 4 show that the baseline model replicated with Tevatron gaining on average 6 points over original baseline results on seven languages.

5. Conclusion

This paper introduces Tevatron, an efficient and flexible toolkit for training and running dense retrievers with Transformers. The toolkit has a modularized design for easy research exploration and a set of command-line interfaces for fast development and evaluation. Our experiments show that Tevatron can be used to train dense retrieval models effectively and efficiently. The flexible and generalizable functionalities provide IR community convenience in future dense retrieval research.

Acknowledge

We would like to thank Google’s TPU Research Cloud (TRC) for access to Cloud TPUs and Compute Canada for access to GPU clusters.

References

(1)
Asai et al. (2021) Akari Asai, Jungo Kasai, Jonathan H. Clark, Kenton Lee, Eunsol Choi, and Hannaneh Hajishirzi. 2021. XOR QA: Cross-lingual Open-Retrieval Question Answering. In NAACL-HLT.
Bajaj et al. (2018) Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2018. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv:1611.09268v3 (2018).
Berant et al. (2013) Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic Parsing on Freebase from Question-Answer Pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Seattle, Washington, 1533–1544.
Bradbury et al. (2018) James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. 2018. JAX: composable transformations of Python+NumPy programs. http://github.com/google/jax
Chen et al. (2017) Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading Wikipedia to Answer Open-Domain Questions. In Association for Computational Linguistics (ACL).
Clark et al. (2020) Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. 2020. TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages. TACL (2020).
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
Gao and Callan (2021a) Luyu Gao and Jamie Callan. 2021a. Condenser: a Pre-training Architecture for Dense Retrieval. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 981–993. https://doi.org/10.18653/v1/2021.emnlp-main.75
Gao and Callan (2021b) Luyu Gao and Jamie Callan. 2021b. Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval. arXiv:2108.05540 [cs.IR]
Jégou et al. (2011) Hervé Jégou, Matthijs Douze, and Cordelia Schmid. 2011. Product Quantization for Nearest Neighbor Search. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (2011), 117–128.
Johnson et al. (2019) Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data 7, 3 (2019), 535–547.
Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vancouver, Canada, 1601–1611.
Karpukhin et al. (2020) Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 6769–6781. https://doi.org/10.18653/v1/2020.emnlp-main.550
Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural Questions: A Benchmark for Question Answering Research. Transactions of the Association for Computational Linguistics 7 (2019), 452–466.
Lhoest et al. (2021) Quentin Lhoest, Albert Villanova del Moral, Yacine Jernite, Abhishek Thakur, Patrick von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, Joe Davison, Mario Šaško, Gunjan Chhablani, Bhavitvya Malik, Simon Brandeis, Teven Le Scao, Victor Sanh, Canwen Xu, Nicolas Patry, Angelina McMillan-Major, Philipp Schmid, Sylvain Gugger, Clément Delangue, Théo Matussière, Lysandre Debut, Stas Bekman, Pierric Cistac, Thibault Goehringer, Victor Mustar, François Lagunas, Alexander Rush, and Thomas Wolf. 2021. Datasets: A Community Library for Natural Language Processing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 175–184. arXiv:2109.02846 [cs.CL] https://aclanthology.org/2021.emnlp-demo.21
Lin et al. (2021) Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. 2021. In-Batch Negatives for Knowledge Distillation with Tightly-Coupled Teachers for Dense Retrieval. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021). 163–173.
Luyu Gao and Callan (2021) Jiawei Han Luyu Gao, Yunyi Zhang and Jamie Callan. 2021. Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup. In Proceedings of the 6th Workshop on Representation Learning for NLP.
Ma et al. (2021) Xueguang Ma, Kai Sun, Ronak Pradeep, and Jimmy Lin. 2021. A Replication Study of Dense Passage Retriever. arXiv:2104.05740 (2021).
Malkov and Yashunin (2020) Yu A. Malkov and D. A. Yashunin. 2020. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. Transactions on Pattern Analysis and Machine Intelligence 42, 4 (2020), 824–836.
Micikevicius et al. (2018) Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Frederick Diamos, Erich Elsen, David García, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. 2018. Mixed Precision Training. ArXiv abs/1710.03740 (2018).
Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 8024–8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
Qu et al. (2021) Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. 2021. RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 5835–5847. https://doi.org/10.18653/v1/2021.naacl-main.466
Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin, Texas, 2383–2392.
Voorhees and Tice (2000) Ellen M. Voorhees and Dawn M. Tice. 2000. The TREC-8 Question Answering Track. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00). Athens, Greece.
Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online, 38–45. https://www.aclweb.org/anthology/2020.emnlp-demos.6
Xiong et al. (2021) Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. 2021. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In Proceedings of the 9th International Conference on Learning Representations (ICLR 2021).
Zhang et al. (2021) Xinyu Zhang, Xueguang Ma, Peng Shi, and Jimmy Lin. 2021. Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval. arXiv:2108.08787 (2021).