Traditional IR rivals neural models on the MS MARCO Document Ranking Leaderboard

Abstract

This short document describes a traditional IR system that achieved MRR@100 equal to 0.298 on the MS MARCO Document Ranking leaderboard (on 2020-12-06). Although inferior to most BERT-based models, it outperformed several neural runs (as well as all non-neural ones), including two submissions that used a large pretrained Transformer model for re-ranking. We provide software and data to reproduce our results.

1 Introduction and Motivation

A typical text retrieval system uses a multi-stage retrieval pipeline, where documents flow through a series of “funnels“ that discard unpromising candidates using increasingly more complex and accurate ranking components. These systems have been traditionally relying on simple term-matching techniques to generate an initial list of candidates [7, 9]. In that, retrieval performance is adversely affected by a mismatch between query and document terms, which is known as a vocabulary gap problem [11, 21].

The vocabulary gap can be mitigated by learning dense or sparse representations for effective first-stage retrieval. Despite recent success in achieving this objective [14, 13, 20], existing studies have have at least one of the following flaws:

•

They often compare against a weak baseline such as untuned BM25 [17];
•

They nearly always rely on exact brute-force $k$ -NN search, which is not exactly practical. In that, they mostly ignore efficiency-effectiveness and scalability trade-offs related to using approximate $k$ -NN search (§3.3 in [2]).
•

They require expensive index-time computation using a large Transformer [18] model.

This motivated us to develop a carefully-tuned traditional, i.e., non-neural, system, which we evaluated in the MS MARCO document ranking task [15, 8]. Our objectives are:

•

To provide a stronger traditional baseline;
•

To develop an effective first-stage retrieval system, which can be efficient and effective without expensive index-time precomputation.

Our submission (dated 2020-12-06) achieved MRR=0.298 on the hidden validation set and outperformed all other traditional systems. It was the first system (on this leaderboard) that outstripped several neural baselines. According to our own evaluation on TREC NIST data [8], our system achieves NDCG@10 equal to 0.584 and 0.558 on 2019 and 2020 queries, respectively. It, thus, outperforms a tuned BM25 system by 6-7%: NDCG@10 is equal to 0.544 and 0.524 on 2019 and 2020 queries, respectively. We posted two notebooks to reproduce results:

•

The first notebook reproduces all steps necessary to download the data, preprocess it, and train all the models.
•

The second notebook operates on preprocessed data in FlexNeuART JSONL format. It does not require running MGIZA to generate IBM Model 1 (these models are already trained).

2 System Description

2.1 Overview

We use our FlexNeuART retrieval toolkit[4]¹¹1https://github.com/oaqa/FlexNeuART, which ingests data (queries and documents) in the form of multi-field JSON entries. A field can include any textual information associated with a document. In particular, each MS MARCO document has three text attributes: URL, title, and body (main document text with HTML formatting removed). A field can be provided as is, i.e., without any modification, or it can be preprocessed and split into tokens. Given an original textual attribute, e.g., the title, it is possible to generate multiple fields, which are tokenized, lemmatized, stopped, or otherwise processed in different ways.

In the current system, a textual attribute of an MS MARCO document can be tokenized using Spacy[12] or split into BERT word pieces [10, 19]. The tokenized field can be further lemmatized and/or stopped (stopping is not applied to BERT word pieces). In the case of URL, we heuristically preprocess input before applying Spacy: We remove prefixes such as http://, slashes, and other punctuation signs.

Our submission retrieves 1000 candidate documents using Apache Lucene²²2https://lucene.apache.org/ with a tuned BM25 [17] and re-ranks documents using an LambdaMart [6] model, which aggregates 13 features described below.

2.2 Features

Overall we compute 13 features, which include simple textual similarity features as well as the lexical translation features, namely, IBM Model 1 log-scores [1]. The corresponding JSON descriptor can be found online. BM25 and IBM Model 1[1] log-scores are core ingredients. The full set of features includes:

•

BM25 [17] scores, which are normalized using the sum of query term IDFs;
•

cosine similarity scores;
•

relative query-document overlap scores: the number of query terms that appear in documents, normalized by the total number of query terms.
•

normalized proximity scores (similar to ones used in our prior TREC experiments [3]): Each scoring component generates two feature values;
•

IBM Model 1[1] log-scores, which are normalized by the number of query terms.

IBM Model 1[1] is a lexical translation model, which is trained using a parallel corpus (a bitext). The parallel corpus consists of queries paired with respective relevant documents. Training relies on the expectation-maximization algorithm [5] implemented in MGIZA [16].³³3https://github.com/moses-smt/mgiza/

Overall we compute four IBM Model 1 log-scores. Three scores are computed for non-lemmatized attributes: URL, title, and body, which are tokenized with Spacy[12]. One score is computed for body, which is split into BERT word pieces [10, 19].

We take several measures to maximize the effectiveness of IBM Model 1 (see § 3.1.1.2 [2] for details). Most importantly, it is necessary to have queries and documents of comparable sizes. It is true for URL and title, but not for body, which typically has hundreds of tokens. To resolve this issue, for each pair of query $q$ and respective relevant document $d$ , we first split $d$ into multiple short chunks $d_{1}$ , $d_{2}$ , … $d_{n}$ . Then, we replace the pair $(q,d)$ with a set of pairs $\{(q,d_{i})\}$ .

2.3 Efficiency Considerations

Although our traditional system is reasonably efficient, it is research software, which lags somewhat in speed compared to an optimized traditional system. In particular, it contains an inefficient module computing proximity score in time $O(|D|\times w)$ , where $w$ is the size of the window (to count word pairs) and $|D|$ is the document length. It can be implemented to run $w$ times faster.

Furthermore, we can improve computation of IBM Model 1. First, translation tables produced by MGIZA can be pruned more aggressively with a small loss in accuracy. Second, the algorithm runs in time $O(|D|\times|q|)$ , where $|q|$ and $|D|$ are query and document lengths. This can be reduced to $O(|D|)$ by precomputing a query-specific reverse-translation table (§ 3.1.2.1 [2]). Precomputation requires extra time, but it is totally worth the effort when we need to re-rank a large number of documents.

Last but not least, re-ranking requires random-access to data stored in a forward index. Reading forward-index entries from memory can be quite fast. However, storing several fields per documents can be space inefficient. At the same time, if data is read from a disk, reading can become a primary computational bottleneck. FlexNeuART generates and reads forward indices for each field separately, because it is a simpler and more flexible approach. In a production system, all field entries belonging to the same document can be stored jointly. Because recent consumer SSD drives can sustain about 25K random reads per second⁴⁴4https://ssd.userbenchmark.com/SpeedTest/693540/Samsung-SSD-970-EVO-Plus-1TB, it is quite practical to re-rank hundreds of documents per query by reading forward-index entries from a fast SSD disk rather than from main memory.

References

[1] Berger, A., Lafferty, J.: Information retrieval as statistical translation. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. pp. 222–229 (1999)
[2] Boytsov, L.: Efficient and Accurate Non-Metric k-NN Search with Applications to Text Matching. Ph.D. thesis, Carnegie Mellon University (2018)
[3] Boytsov, L., Belova, A.: Evaluating learning-to-rank methods in the web track adhoc task. In: TREC (2011)
[4] Boytsov, L., Nyberg, E.: Flexible retrieval with NMSLIB and FlexNeuART. In: Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS). pp. 32–43 (2020)
[5] Brown, P.F., Pietra, S.D., Pietra, V.J.D., Mercer, R.L.: The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19(2), 263–311 (1993)
[6] Burges, C.J.: From RankNet to LambdaRank to LambdaMart: An overview (2010), microsoft Technical Report MSR-TR-2010-82
[7] Büttcher, S., Clarke, C.L., Cormack, G.V.: Information retrieval: Implementing and evaluating search engines. MIT Press (2016)
[8] Craswell, N., Mitra, B., Yilmaz, E., Campos, D., Voorhees, E.M.: Overview of the TREC 2019 deep learning track. arXiv preprint arXiv:2003.07820 (2020)
[9] Croft, W.B., Metzler, D., Strohman, T.: Search engines: Information retrieval in practice, vol. 520. Addison-Wesley Reading (2010)
[10] Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding pp. 4171–4186 (2019)
[11] Furnas, G.W., Landauer, T.K., Gomez, L.M., Dumais, S.T.: The vocabulary problem in human-system communication. Commun. ACM 30(11), 964–971 (1987)
[12] Honnibal, M., Montani, I.: spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing. To appear (2017)
[13] Karpukhin, V., Oğuz, B., Min, S., Wu, L., Edunov, S., Chen, D., Yih, W.t.: Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906 (2020)
[14] Lee, K., Chang, M.W., Toutanova, K.: Latent retrieval for weakly supervised open domain question answering. arXiv preprint arXiv:1906.00300 (2019)
[15] Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., Deng, L.: Ms marco: A human generated machine reading comprehension dataset (November 2016)
[16] Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Computational Linguistics 29(1), 19–51 (2003)
[17] Robertson, S.: Understanding inverse document frequency: on theoretical arguments for IDF. Journal of Documentation 60(5), 503–520 (2004)
[18] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: NIPS. pp. 5998–6008 (2017)
[19] Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, L., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian, G., Patil, N., Wang, W., Young, C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O., Corrado, G., Hughes, M., Dean, J.: Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR abs/1609.08144 (2016)
[20] Xiong, L., Xiong, C., Li, Y., Tang, K.F., Liu, J., Bennett, P., Ahmed, J., Overwijk, A.: Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808 (2020)
[21] Zhao, L., Callan, J.: Term necessity prediction. In: Huang, J., Koudas, N., Jones, G.J.F., Wu, X., Collins-Thompson, K., An, A. (eds.) Proceedings of the 19th ACM Conference on Information and Knowledge Management, CIKM 2010. pp. 259–268. ACM (2010)