Causal Retrieval with Semantic Consideration
Abstract
Recent advancements in large language models (LLMs) have significantly enhanced the performance of conversational AI systems. To extend their capabilities to knowledge-intensive domains such as biomedical and legal fields, where the accuracy is critical, LLMs are often combined with information retrieval (IR) systems to generate responses based on retrieved documents. However, for IR systems to effectively support such applications, they must go beyond simple semantic matching and accurately capture diverse query intents, including causal relationships. Existing IR models primarily focus on retrieving documents based on surface-level semantic similarity, overlooking deeper relational structures such as causality. To address this, we propose Cawai, a retrieval model that is trained with dual objectives: semantic and causal relations. Our extensive experiments demonstrate that Cawai outperforms various models on diverse causal retrieval tasks especially under large-scale retrieval settings. We also show that Cawai exhibits strong zero-shot generalization across scientific domain QA tasks.
Causal Retrieval with Semantic Consideration
Hyunseo Shin University of Seoul [email protected] Wonseok Hwang University of Seoul [email protected]
1 Introduction
With recent advancement in large language models (LLMs), it has become standard practice to enhance the performance of LLMs via retrieval-augmented generation (RAG). In RAG systems, a document retriever plays a critical role, as it is essential to provide relevant documents for given queries to generate correct answers. Indeed, a recent study analyzing the performance of RAG systems in legal domain shows that around 40–50% of hallucinations originate from failures in the document retrieval step (Magesh et al., 2024).
The performance of information retrieval (IR) systems can be roughly defined as providing “relevant” information (documents) for given queries. However, the notion of “relevant” often encompasses various aspects (van Opijnen and Santos, 2017). For instance, if users want to find similar legal cases, “relevance” may indicate “semantic similarity”. On the other hand, if users want to investigate the consequences of specific events, the retriever may need to find documents that include causal consequences of the “cause” mentioned in the queries. For instance, Ye et al. (2024) categorizes queries based on structure and intent, emphasizing the importance of context for understanding user intent.
However, many existing IR systems primarily focus on semantic similarity between texts to retrieve information. This approach can limit their ability to find relevant information, such as causal relationships. For example, in the e-CARE dataset–a dataset for causal reasoning (Du et al., 2022)–we observe that when the retrieval corpus is limited to the test set, a widely adopted dense passage retrieval (DPR) method (Karpukhin et al., 2020) effectively retrieves the next causal sentence. However, when expanding the retrieval pool to include sentences from Wikipedia, simulating a real-world setting, the model often retrieves sentences based on semantic similarity rather than true causal relationships. For instance, given the query “An explosion of Sulfides occurred in the factory.” the correct effect is“The workers were all injured due to eye irritation to suffocation.”. Yet when retrieving from , DPR selects “On 22 February 2003, one of the production facilities caught fire and was badly damaged.” indicating a shift toward generic semantic similarity.
A manual analysis of 50 randomly sampled cases where DPR retrieved an incorrect passage from shows that approximately 44% of errors result from retrieval based on semantic similarity rather than causal clues present in the query. This highlights a key limitation of existing retrieval models in causal tasks.
Based on this observation, we propose Cawai111Causality AWAre dense retrIever, a novel method for training a causal dense retriever. By incorporating dual constraints–causal loss and semantic loss–Cawai enables the accurate retrieval of cause or effect sentences that are causally related to the input query. Evaluation results show that, Cawai significantly outperforms existing baselines, such as BM25, DPR, and GTR (Ni et al., 2022), in causal retrieval, causal QA, and scientific QA tasks, while maintaining comparable performance in general QA tasks (Table 2–5).
In summary, our contributions are as follows.
-
•
We propose Cawai, a dense retriever specialized in causal tasks.
-
•
Cawai achieves significantly better performance compared to existing dense retrieval baselines.
We will open-source our work.
2 Related Work
2.1 Information Retrieval
Traditional keyword-based information retrieval methods like BM25 (Robertson and Zaragoza, 2009) rely heavily on lexical overlap between queries and documents, which limits their ability to capture deeper semantic relationships.
Karpukhin et al. (2020) address this limitation by proposing DPR that leverages a language model to convert queries and documents into dense vector representations, allowing them to incorporate semantic information beyond simple keyword matching.
Recent studies on dense retrievers show that leveraging LLMs can improve retrieval accuracy. Lee et al. (2024) show the potential of distilling knowledge from LLMs to create compact yet versatile text embedding models. Luo et al. (2024) demonstrates that LLM-based dense retriever significantly outperforms traditional models through comprehensive experiments.
Erker et al. (2024) introduce a Triple-Encoders to compute distributed utterance. The method encodes each sentence independently, and creates contextualized embeddings by linearly combining representations from multiple subspaces. Similarly, our method also uses three encoders, each specializing in capturing different aspects of causal relationships between sentences.
2.2 Causal Relationship Identification
Recent works in causal discovery with LLMs focus on identifying cause-effect relationships by leveraging causal graphs. Zečević et al. (2023) introduces a framework for causal discovery, where LLMs can return causal graphs through conditional independence statements. Similarly, Zhang et al. (2024) introduce a RAG-based approach for causal graph recovery, dynamically retrieving domain-specific text chunks and inferring relationships between factors using LLMs. While these methods offer insights for post-hoc causal analysis, they apply causal reasoning only after retrieval. In contrast, this work incorporates causal cues directly into retrieval, enabling the model to identify causal relationships in the early stage .
3 Methods
3.1 Model architecture

Cawai utilizes three encoders: Cause Encoder, Effect Encoder, and Semantic Encoder (Figure 1). Cause Encoder processes a text for cause event (e.g. Tom really has no energy to run.), denoted as , generating a vector representation . Similarly, Effect Encoder processes a text for the effect event (e.g. He takes a rest before running again.), , corresponding to , producing an encoded representation . Semantic Encoder, whose weights are frozen during training, processes both the cause event and the effect event individually, and outputs vector representations and .
3.2 Training
Cause Encoder is trained to map an input cause event (text) to its corresponding effect event (vector), thereby learning the cause-to-effect relationship. Conversely, Effect Encoder is trained to map the effect event (text) back to the cause event (vector), learning the effect-to-cause relationship. The Semantic Encoder remains fixed during training, while a semantic preservation loss (losssem,c in Figure 1) ensures that the encoded representations and stay close to their original cause and effect events. This alignment preserves contextual nuances and maintains semantic consistency during training.
In-batch Negative Sampling
We use in-batch negative sampling across all three encoders (Cause encoder, Effect encoder, and Semantic encoder). In Cause Encoder, for a given cause event and its corresponding effect event of -th example, we define a set of negative effects that are sampled from for all pairs in the batch. The resulting loss function can be written as
|
(1) |
Here, denotes the similarity between the outputs of Cause Encoder () and the output of Semantic Encoder ().
Similarly, in Effect Encoder, for a given effect event , we define negative causes sampled from , forcing the model to map effect-to-cause more accurately by distinguishing them from irrelevant cause events.
|
(2) |
In addition to two losses above, we introduce semantic losses where the output of Cause Encoder () is contrasted with the output of Semantic Encoder (). Likewise, Effect Encoder’s output is compared against its original effect sentence () from Semantic Encoder ensuring the outputs stay semantically close to their respective inputs.
|
(3) |
We apply the same negative sampling technique for the effect events in Semantic encoder.
The total loss is then computed as follows:
|
(4) |
This final loss ensures that the cause-to-effect and effect-to-cause mappings are learned effectively, while also preserving the semantic consistency of the original inputs. The semantic loss terms impose dual constraints on the vector representation, which helps regularize the model. For instance, should include the information about the corresponding effect text () while preserving its own representation (cause text, ). We assume that preserving semantic similarity may help as a keyword-based retrieval algorithm, such as BM25, can sometimes retrieve an answer for a given question.
4 Experiments
4.1 Datasets
4.1.1 e-CARE Evaluation
We utilize the e-CARE (Du et al., 2022) and BCOPA-CE (Han and Wang, 2021) datasets. The e-CARE dataset is split into training, validation, and test sets (6:1:1). BCOPA-CE is used only for training and validation, consisting of 500 triplets of <cause, premise, effect>, which we transform into 1,000 cause-effect pairs ((cause, premise), (premise, effect)). To prevent data leakage, pairs from the same triplet remain in the same split. The final dataset consists of 13,692 training, 2,232 validation, and 2,136 test examples.
For retrieval evaluation, we construct retrieval pools using large-scale text corpora from Wikipedia222https://huggingface.co/datasets/wikimedia/wikipedia and RedPajama-Data-v2 (Computer, 2023)333https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2 to simulate real-world retrieval scenarios. We prepare varying size of retrieval pools: ranging from 2 million sentences () to 20 million sentences ().
4.1.2 Causal QA Evaluation
To evaluate Cawai on real-world causal question-answering performance, we conduct experiments using retrieval environments defined in Causal QA (Bondarenko et al., 2022). Specifically, we use MS MARCO Bajaj et al. (2018), Natural Questions Kwiatkowski et al. (2019), SQuAD v2.0 Rajpurkar et al. (2018), and HotpotQA Yang et al. (2018) for training, validation, and testing (Table 1). For the training, we merge all training sets, resulting in 23,165 training, 2,895 validation, and 2,895 test samples.
Dataset (% of Total) | Train | Validation | Test |
---|---|---|---|
e-CARE | 12,792 | 2,132 | 2,136 |
BCOPA-CE | 900 | 100 | - |
HotpotQA (0.4%) | 312 | 39 | 39 |
MS MARCO (2.5%) | 19,318 | 2,415 | 2,415 |
SQuAD v2.0 (2.3%) | 2,567 | 321 | 321 |
Natural Questions (0.4%) | 968 | 120 | 120 |
4.2 Experimental Setups
4.2.1 Baseline
We used BM25 Robertson and Zaragoza (2009), DPR Karpukhin et al. (2020), and GTR Ni et al. (2022) as baseline models. We trained a BERT-base-uncased, GTR-Base, and LLaMA-1.0B as the encoders for Cawai-DPR and Cawai-GTR, respectively. We selected the checkpoint with the highest accuracy on the validation dataset after 500 epochs. The batch size was set to 64, the learning rate to 1e-5, using the AdamW optimizer. Experiments were conducted on either NVIDIA A6000 or 3090 GPUs.
All models are initialized with the same weights for fair comparisons: BERT-Base-Uncased444https://huggingface.co/google-bert/bert-base-uncased, GTR-Base555https://huggingface.co/sentence-transformers/gtr-t5-base, and LLaMA-1.0B666https://huggingface.co/knowledgator/Llama-encoder-1.0B. LLaMA-1.0B is trained following the same methodology as GTR, with LoRA adapters Hu et al. (2022) applied for efficient fine-tuning.
In the experiments on general domain QA tasks (Section 5.4), we trained Cawai-DPR under the same conditions as Karpukhin et al. (2020) for fair comparisons: using a batch size of 128, the English Wikipedia dump from December 2018 as the retrieval pool, and BM25 for negative sampling. Also we did not employ a reader model and instead used a simple fuzzy matching as a relaxed metric, focusing on retrieval accuracy.
5 Results
5.1 Causal Retrieval Tasks
Model | Task 1. Cause to Effect | Task 2. Effect to Cause | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
e-CARE | e-CARE + | e-CARE + | e-CARE | e-CARE + | e-CARE + | |||||||||||||
Hit@1 | Hit@10 | MRR@10 | Hit@1 | Hit@10 | MRR@10 | Hit@1 | Hit@10 | MRR@10 | Hit@1 | Hit@10 | MRR@10 | Hit@1 | Hit@10 | MRR@10 | Hit@1 | Hit@10 | MRR@10 | |
BM25 | 8.9 | 21.8 | 12.7 | 4.6 | 8.5 | 5.8 | 4.9 | 9.3 | 6.1 | 9.4 | 20.6 | 12.6 | 4.9 | 8.9 | 6.0 | 4.6 | 9.2 | 5.9 |
DPR | 36.3 | 66.0 | 45.5 | 16.0 | 28.2 | 19.4 | 13.7 | 25.8 | 17.3 | 39.4 | 67.8 | 47.8 | 17.3 | 32.4 | 21.9 | 14.4 | 28.8 | 18.6 |
Cawai-DPR | 36.8 | 63.7 | 45.1 | 20.4 | 32.1 | 23.7 | 18.1 | 29.4 | 21.3 | 37.6 | 64.2 | 45.7 | 22.3 | 35.5 | 26.4 | 20.9 | 34.0 | 24.8 |
GTR | 42.2 | 71.8 | 51.9 | 20.5 | 39.0 | 26.1 | 18.1 | 35.6 | 23.0 | 42.8 | 72.4 | 52.2 | 21.8 | 39.5 | 27.1 | 18.3 | 36.0 | 23.5 |
Cawai-GTR | 43.2 | 70.0 | 51.2 | 21.6 | 36.2 | 26.2 | 19.6 | 34.6 | 24.1 | 41.9 | 68.6 | 50.0 | 21.5 | 38.5 | 26.7 | 20.1 | 36.3 | 24.7 |
LLama 1B (Lora) | 14.9 | 35.3 | 20.7 | 2.3 | 6.4 | 3.4 | 1.9 | 5.8 | 2.9 | 15.1 | 35.4 | 20.8 | 3.3 | 9.7 | 5.1 | 3.0 | 8.9 | 4.8 |
Cawai-LLama 1B (Lora) | 23.5 | 51.0 | 31.6 | 3.1 | 9.0 | 4.8 | 3.6 | 9.5 | 5.1 | 22.1 | 50.7 | 30.5 | 4.4 | 12.3 | 6.5 | 4.6 | 12.4 | 6.7 |
We reformulate the e-CARE causal reasoning dataset into two causal retrieval tasks. In Task 1, the model retrieves the effect sentence given a cause query, while in Task 2, it retrieves the cause sentence given an effect query. Our results show that Cawai achieves performance comparable to two strong baselines–DPR and GTR–when the retrieval pool is small (2,136 sentences, Table 2 rows 2–5, cols 1–3 for Task 1, cols 10–12 for Task 2).
Next, we examine how performance changes when the size of the retrieval pool increases by integrating 2 million sentences (XL) sampled from English Wikipedia or RedPajama. The results show that in most cases, Cawai outperforms the two baselines (row 2–5, cols 4–9 for Task1, cols 13–18 for Task 2).
Notably, although GTR demonstrates slightly better performance in Task 2 even with (rows 4,5, cols 13–15), it performance declines when the retrieval pool increases to 20 million sentences (). In this setting, GTR achieves 12.9% Hit@1, 25.3% Hit@10, and 16.6% MRR@10, while Cawai surpasses it with scores of 14.6%, 26.0%, and 17.8%. Similarly, in Task 1, Cawai maintains a performance advantage (GTR: 12.1%, 24.9%, and 16.0% vs Cawai: 12.4%, 25.3%, and 16.4%). These results highlight the enhanced generalization capability of Cawai across diverse retrieval settings. Additional experiments with LLaMA-1.0B further confirm the superiority of Cawai in causal retrieval tasks (bottom two rows)
5.2 Causal QA Tasks
Model | MS MARCO | Natural Questions | SQuAD v2.0 | HotpotQA | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
H@1 | H@10 | M@10 | H@1 | H@10 | M@10 | H@1 | H@10 | M@10 | H@1 | H@10 | M@10 | |
DPR | 10.6 | 38.3 | 18.4 | 2.2 | 16.0 | 4.7 | 11.2 | 24.3 | 15.0 | 3.4 | 12.8 | 5.4 |
Cawai-DPR | 11.7 | 40.5 | 19.8 | 4.7 | 16.5 | 7.9 | 13.8 | 29.3 | 18.4 | 3.4 | 13.7 | 5.9 |
GTR | 21.3 | 71.5 | 33.7 | 7.5 | 27.0 | 13.8 | 23.7 | 44.4 | 30.3 | 9.4 | 35.9 | 16.4 |
Cawai-GTR | 20.6 | 59.4 | 32.2 | 11.3 | 28.9 | 16.2 | 27.1 | 48.6 | 34.1 | 12.8 | 33.3 | 19.2 |
Next we evaluate Cawai on Causal QA tasks. Cawai again consists of three separate encoders: one for questions (corresponding to Cause or Effect Encoder), one for documents (corresponding to Effect or Cause Encoder), and one for Semantic Encoder. The results show that Cawai outperforms two baselines in most cases (Table 3). Notably, the largest gains is observed in Natural Questions and SQuAD v2.0, where Cawai-GTR outperforms the standard GTR model by 5.5% and 3.4% in Hit@1, respectively (rows 3,4, cols 4, 7).
5.3 Science Domain QA Tasks
Model | NFCorpus | SciDocs | SciFact | SciQ | ||||
---|---|---|---|---|---|---|---|---|
5 | 20 | 5 | 20 | 5 | 20 | 5 | 20 | |
DPR | 6.2 | 11.5 | 10.1 | 16.2 | 23.5 | 28.8 | 50.0 | 55.2 |
Cawai-DPR | 8.7 | 13.0 | 10.7 | 18.0 | 24.5 | 29.8 | 62.5 | 66.3 |
GTR | 6.1 | 13.6 | 20.8 | 28.8 | 38.1 | 43.7 | 71.7 | 74.5 |
Cawai-GTR | 6.6 | 14.4 | 21.3 | 30.3 | 44.0 | 48.6 | 83.8 | 84.8 |
Following Cai et al. (2024), we evaluate the zero-shot generalization capability of dense retrievers trained on Causal QA using four science QA datasets: NFCorpus Boteva et al. (2016), SciDocsCohan et al. (2020), SciFactWadden et al. (2020), and SciQWelbl et al. (2017). As shown in Table 4, Cawai consistently achieves higher nDCG scores across all datasets compared to standard dense retrieval methods, demonstrating its strong generalization performance.
5.4 General Domain QA tasks
Model | Natural Questions | SQuAD v1.1 | ||||
---|---|---|---|---|---|---|
Hit@1 | Hit@20 | Hit@100 | Hit@1 | Hit@20 | Hit@100 | |
DPR | 30.6 | 75.2 | 86.3 | 26.7 | 67.7 | 81.9 |
Cawai-DPR | 33.9 | 74.8 | 85.8 | 25.1 | 65.7 | 80.5 |
DPR+BM25 | 35.1 | 74.7 | 86.7 | 27.1 | 69.4 | 83.7 |
Cawai+BM25 | 36.6 | 75.9 | 86.1 | 25.1 | 65.9 | 81.2 |
Finally, to evaluate Cawai on general QA tasks, we conduct experiments on the Natural Questions and SQuAD v1.1 Rajpurkar et al. (2016) datasets. Cawai achieves performance comparable to DPR but without a clear advantage (Table 5). This suggests that the effectiveness of Cawai is more pronounced in Causal QA scenarios rather than general QA tasks.
5.5 Ablation study
Effects of Semantic Loss
Loss | Task 1. Cause to Effect | Task 2. Effect to Cause | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
e-CARE | e-CARE + | e-CARE | e-CARE + | |||||||||
H@1 | H@10 | M@10 | H@1 | H@10 | M@10 | H@1 | H@10 | M@10 | H@1 | H@10 | M@10 | |
32.4 | 60.4 | 40.9 | 11.5 | 24.5 | 15.3 | 33.4 | 61.3 | 41.8 | 14.5 | 28.0 | 18.4 | |
37.2 | 63.6 | 45.3 | 20.4 | 32.5 | 24.0 | 38.7 | 63.6 | 46.1 | 22.2 | 34.6 | 25.9 | |
36.8 | 63.7 | 45.1 | 20.4 | 32.1 | 23.7 | 37.6 | 64.2 | 45.7 | 22.3 | 35.5 | 26.4 | |
38.3 | 64.3 | 46.1 | 22.8 | 33.9 | 26.0 | 38.8 | 64.3 | 46.4 | 23.0 | 35.6 | 26.8 | |
38.5 | 64.7 | 46.3 | 22.3 | 34.1 | 25.8 | 39.1 | 64.5 | 46.6 | 24.1 | 36.5 | 27.6 |
Next, we examine how the performance of Cawai depends on the semantic loss under varying weights () of the semantic loss in Eqn. 4. Integrating the semantic loss results in a noticeable improvement in accuracy (Table 6 row 1, vs. row 2, ), highlighting its importance. Beyond , no clear improvement were observed (row 3–5), and thus we set in all other experiments.
6 Analysis



We visualize the embeddings of BCOPA-CE validation set using the t-SNE method. In Cawai, causes are encoded by the Cause Encoder, while premises are encoded by the Effect Encoder. Before fine-tuning, the embeddings are randomly scattered (Figure 2a). After training, DPR embeddings show that cause and effect remain separated without shared semantics (b), indicating its limitation in capturing causal relationships. Unlike QA tasks, causal retrieval requires understanding temporal and logical dependencies. On the other hand, Cawai maps each cause-effect pair closely (c), demonstrating its ability to learn meaningful causal embeddings.
7 Conclusion
We proposed a novel retrieval method, Cawai, that integrates cause-effect relationships into the retrieval process. Our experiments demonstrate that Cawai outperforms two strong baselines–DPR and GTR–in causal retrieval tasks under real-world setting, causal QA, and scientific QA tasks, while achieving comparable performance on general QA tasks.
References
- Bajaj et al. (2018) Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2018. Ms marco: A human generated machine reading comprehension dataset. Preprint, arXiv:1611.09268.
- Bondarenko et al. (2022) Alexander Bondarenko, Magdalena Wolska, Stefan Heindorf, Lukas Blübaum, Axel-Cyrille Ngonga Ngomo, Benno Stein, Pavel Braslavski, Matthias Hagen, and Martin Potthast. 2022. Causalqa: A benchmark for causal question answering. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3296–3308.
- Boteva et al. (2016) Vera Boteva, Demian Gholipour, Artem Sokolov, and Stefan Riezler. 2016. A full-text learning to rank dataset for medical information retrieval.
- Cai et al. (2024) Fengyu Cai, Xinran Zhao, Tong Chen, Sihao Chen, Hongming Zhang, Iryna Gurevych, and Heinz Koeppl. 2024. MixGR: Enhancing retriever generalization for scientific domain through complementary granularity. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 10369–10391, Miami, Florida, USA. Association for Computational Linguistics.
- Cohan et al. (2020) Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel S. Weld. 2020. Specter: Document-level representation learning using citation-informed transformers. In ACL.
- Computer (2023) Together Computer. 2023. Redpajama: an open dataset for training large language models.
- Du et al. (2022) Li Du, Xiao Ding, Kai Xiong, Ting Liu, and Bing Qin. 2022. e-CARE: a new dataset for exploring explainable causal reasoning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 432–446, Dublin, Ireland. Association for Computational Linguistics.
- Erker et al. (2024) Justus-Jonas Erker, Florian Mai, Nils Reimers, Gerasimos Spanakis, and Iryna Gurevych. 2024. Triple-encoders: Representations that fire together, wire together. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5317–5332, Bangkok, Thailand. Association for Computational Linguistics.
- Han and Wang (2021) Mingyue Han and Yinglin Wang. 2021. Doing good or doing right? exploring the weakness of commonsense causal reasoning models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 151–157, Online. Association for Computational Linguistics.
- Hu et al. (2022) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
- Karpukhin et al. (2020) Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.
- Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466.
- Lee et al. (2024) Jinhyuk Lee, Zhuyun Dai, Xiaoqi Ren, Blair Chen, Daniel Cer, Jeremy R. Cole, Kai Hui, Michael Boratko, Rajvi Kapadia, Wen Ding, Yi Luan, Sai Meher Karthik Duddu, Gustavo Hernandez Abrego, Weiqiang Shi, Nithi Gupta, Aditya Kusupati, Prateek Jain, Siddhartha Reddy Jonnalagadda, Ming-Wei Chang, and Iftekhar Naim. 2024. Gecko: Versatile text embeddings distilled from large language models. Preprint, arXiv:2403.20327.
- Luo et al. (2024) Kun Luo, Minghao Qin, Zheng Liu, Shitao Xiao, Jun Zhao, and Kang Liu. 2024. Large language models as foundations for next-gen dense retrieval: A comprehensive empirical assessment. Preprint, arXiv:2408.12194.
- Magesh et al. (2024) Varun Magesh, Faiz Surani, Matthew Dahl, Mirac Suzgun, Christopher D. Manning, and Daniel E. Ho. 2024. Hallucination-free? assessing the reliability of leading ai legal research tools. Preprint, arXiv:2405.20362.
- Ni et al. (2022) Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernandez Abrego, Ji Ma, Vincent Zhao, Yi Luan, Keith Hall, Ming-Wei Chang, and Yinfei Yang. 2022. Large dual encoders are generalizable retrievers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9844–9855, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Rajpurkar et al. (2018) Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don‘t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, Melbourne, Australia. Association for Computational Linguistics.
- Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics.
- Robertson and Zaragoza (2009) Stephen E. Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: Bm25 and beyond. Found. Trends Inf. Retr., 3:333–389.
- van Opijnen and Santos (2017) Marc van Opijnen and Cristiana Santos. 2017. On the concept of relevance in legal information retrieval. Artificial Intelligence and Law, 25(1):65–87.
- Wadden et al. (2020) David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Hannaneh Hajishirzi. 2020. Fact or fiction: Verifying scientific claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7534–7550, Online. Association for Computational Linguistics.
- Welbl et al. (2017) Johannes Welbl, Nelson F. Liu, and Matt Gardner. 2017. Crowdsourcing multiple choice science questions. Preprint, arXiv:1707.06209.
- Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium. Association for Computational Linguistics.
- Ye et al. (2024) Linhao Ye, Zhikai Lei, Jianghao Yin, Qin Chen, Jie Zhou, and Liang He. 2024. Boosting conversational question answering with fine-grained retrieval-augmentation and self-check. Preprint, arXiv:2403.18243.
- Zečević et al. (2023) Matej Zečević, Moritz Willig, Devendra Singh Dhami, and Kristian Kersting. 2023. Causal parrots: Large language models may talk causality but are not causal. Preprint, arXiv:2308.13067.
- Zhang et al. (2024) Yuzhe Zhang, Yipeng Zhang, Yidong Gan, Lina Yao, and Chen Wang. 2024. Causal graph discovery with retrieval-augmented generation based large language models. arXiv preprint arXiv:2402.15301.