This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Causal Retrieval with Semantic Consideration

Hyunseo Shin
University of Seoul
[email protected]
&Wonseok Hwang
University of Seoul
[email protected]
Abstract

Recent advancements in large language models (LLMs) have significantly enhanced the performance of conversational AI systems. To extend their capabilities to knowledge-intensive domains such as biomedical and legal fields, where the accuracy is critical, LLMs are often combined with information retrieval (IR) systems to generate responses based on retrieved documents. However, for IR systems to effectively support such applications, they must go beyond simple semantic matching and accurately capture diverse query intents, including causal relationships. Existing IR models primarily focus on retrieving documents based on surface-level semantic similarity, overlooking deeper relational structures such as causality. To address this, we propose Cawai, a retrieval model that is trained with dual objectives: semantic and causal relations. Our extensive experiments demonstrate that Cawai outperforms various models on diverse causal retrieval tasks especially under large-scale retrieval settings. We also show that Cawai exhibits strong zero-shot generalization across scientific domain QA tasks.

Causal Retrieval with Semantic Consideration


Hyunseo Shin University of Seoul [email protected]                        Wonseok Hwang University of Seoul [email protected]


1 Introduction

With recent advancement in large language models (LLMs), it has become standard practice to enhance the performance of LLMs via retrieval-augmented generation (RAG). In RAG systems, a document retriever plays a critical role, as it is essential to provide relevant documents for given queries to generate correct answers. Indeed, a recent study analyzing the performance of RAG systems in legal domain shows that around 40–50% of hallucinations originate from failures in the document retrieval step (Magesh et al., 2024).

The performance of information retrieval (IR) systems can be roughly defined as providing “relevant” information (documents) for given queries. However, the notion of “relevant” often encompasses various aspects (van Opijnen and Santos, 2017). For instance, if users want to find similar legal cases, “relevance” may indicate “semantic similarity”. On the other hand, if users want to investigate the consequences of specific events, the retriever may need to find documents that include causal consequences of the “cause” mentioned in the queries. For instance, Ye et al. (2024) categorizes queries based on structure and intent, emphasizing the importance of context for understanding user intent.

However, many existing IR systems primarily focus on semantic similarity between texts to retrieve information. This approach can limit their ability to find relevant information, such as causal relationships. For example, in the e-CARE dataset–a dataset for causal reasoning (Du et al., 2022)–we observe that when the retrieval corpus is limited to the test set, a widely adopted dense passage retrieval (DPR) method (Karpukhin et al., 2020) effectively retrieves the next causal sentence. However, when expanding the retrieval pool to include sentences from Wikipedia, simulating a real-world setting, the model often retrieves sentences based on semantic similarity rather than true causal relationships. For instance, given the query “An explosion of Sulfides occurred in the factory.” the correct effect is“The workers were all injured due to eye irritation to suffocation.”. Yet when retrieving from wikiXLwiki_{XL}, DPR selects “On 22 February 2003, one of the production facilities caught fire and was badly damaged.” indicating a shift toward generic semantic similarity.

A manual analysis of 50 randomly sampled cases where DPR retrieved an incorrect passage from wikiXLwiki_{XL} shows that approximately 44% of errors result from retrieval based on semantic similarity rather than causal clues present in the query. This highlights a key limitation of existing retrieval models in causal tasks.

Based on this observation, we propose Cawai111Causality AWAre dense retrIever, a novel method for training a causal dense retriever. By incorporating dual constraints–causal loss and semantic loss–Cawai enables the accurate retrieval of cause or effect sentences that are causally related to the input query. Evaluation results show that, Cawai significantly outperforms existing baselines, such as BM25, DPR, and GTR (Ni et al., 2022), in causal retrieval, causal QA, and scientific QA tasks, while maintaining comparable performance in general QA tasks (Table 2–5).

In summary, our contributions are as follows.

  • We propose Cawai, a dense retriever specialized in causal tasks.

  • Cawai achieves significantly better performance compared to existing dense retrieval baselines.

We will open-source our work.

2 Related Work

2.1 Information Retrieval

Traditional keyword-based information retrieval methods like BM25 (Robertson and Zaragoza, 2009) rely heavily on lexical overlap between queries and documents, which limits their ability to capture deeper semantic relationships.

Karpukhin et al. (2020) address this limitation by proposing DPR that leverages a language model to convert queries and documents into dense vector representations, allowing them to incorporate semantic information beyond simple keyword matching.

Recent studies on dense retrievers show that leveraging LLMs can improve retrieval accuracy. Lee et al. (2024) show the potential of distilling knowledge from LLMs to create compact yet versatile text embedding models. Luo et al. (2024) demonstrates that LLM-based dense retriever significantly outperforms traditional models through comprehensive experiments.

Erker et al. (2024) introduce a Triple-Encoders to compute distributed utterance. The method encodes each sentence independently, and creates contextualized embeddings by linearly combining representations from multiple subspaces. Similarly, our method also uses three encoders, each specializing in capturing different aspects of causal relationships between sentences.

2.2 Causal Relationship Identification

Recent works in causal discovery with LLMs focus on identifying cause-effect relationships by leveraging causal graphs. Zečević et al. (2023) introduces a framework for causal discovery, where LLMs can return causal graphs through conditional independence statements. Similarly, Zhang et al. (2024) introduce a RAG-based approach for causal graph recovery, dynamically retrieving domain-specific text chunks and inferring relationships between factors using LLMs. While these methods offer insights for post-hoc causal analysis, they apply causal reasoning only after retrieval. In contrast, this work incorporates causal cues directly into retrieval, enabling the model to identify causal relationships in the early stage .

3 Methods

3.1 Model architecture

Refer to caption
Figure 1: Architecture of Cawai. Cawai comprises three Encoders: the Cause Encoder, the Semantic Encoder, and the Effect Encoder.

Cawai utilizes three encoders: Cause Encoder, Effect Encoder, and Semantic Encoder (Figure 1). Cause Encoder processes a text for cause event (e.g. Tom really has no energy to run.), denoted as e1e_{1}, generating a vector representation e1e_{1}^{\prime}. Similarly, Effect Encoder processes a text for the effect event (e.g. He takes a rest before running again.), e2e_{2}, corresponding to e1e_{1}, producing an encoded representation e2e_{2}^{\prime}. Semantic Encoder, whose weights are frozen during training, processes both the cause event e1e_{1} and the effect event e2e_{2} individually, and outputs vector representations e1′′e_{1}^{\prime\prime} and e2′′e_{2}^{\prime\prime}.

3.2 Training

Cause Encoder is trained to map an input cause event e1e_{1} (text) to its corresponding effect event e2′′e_{2}^{\prime\prime} (vector), thereby learning the cause-to-effect relationship. Conversely, Effect Encoder is trained to map the effect event e2e_{2} (text) back to the cause event e1′′e_{1}^{\prime\prime} (vector), learning the effect-to-cause relationship. The Semantic Encoder remains fixed during training, while a semantic preservation loss (losssem,c in Figure 1) ensures that the encoded representations e1e_{1}^{\prime} and e2e_{2}^{\prime} stay close to their original cause and effect events. This alignment preserves contextual nuances and maintains semantic consistency during training.

In-batch Negative Sampling

We use in-batch negative sampling across all three encoders (Cause encoder, Effect encoder, and Semantic encoder). In Cause Encoder, for a given cause event e1(i)e_{1}(i) and its corresponding effect event e2(i)e_{2}(i) of ii-th example, we define a set of negative effects N(e2(j))N(e_{2}(j)) that are sampled from {e2(j)|ji}\{e_{2}(j)|j\neq i\} for all pairs in the batch. The resulting loss function can be written as

lossc(e1(i),e2(i))=logexp(s(e1(i),e2′′(i)))jBatchexp(s(e1(i),e2′′(j)))\text{loss}_{c}(e_{1}(i),e_{2}(i))=-\log\frac{\exp(s(e_{1}^{\prime}(i),e_{2}^{\prime\prime}(i)))}{\sum_{j\in\text{Batch}}\exp(s(e_{1}^{\prime}(i),e_{2}^{\prime\prime}(j)))}

(1)

Here, s(e1(i),e2′′(i))s(e_{1}^{\prime}(i),e_{2}^{\prime\prime}(i)) denotes the similarity between the outputs of Cause Encoder (e1(i)e_{1}^{\prime}(i)) and the output of Semantic Encoder (e2′′(i)e_{2}^{\prime\prime}(i)).

Similarly, in Effect Encoder, for a given effect event e2e_{2}, we define negative causes N(e1)N(e_{1}) sampled from {e1(i)|ij}\{e_{1}(i)|i\neq j\}, forcing the model to map effect-to-cause more accurately by distinguishing them from irrelevant cause events.

losse(e1(i),e2(i))=logexp(s(e2(i),e1′′(i)))jbatchexp(s(e2(i),e1′′(j)))\text{loss}_{e}(e_{1}(i),e_{2}(i))=-\log\frac{\exp(s(e_{2}^{\prime}(i),e_{1}^{\prime\prime}(i)))}{\sum_{j\in\text{batch}}\exp(s(e_{2}^{\prime}(i),e_{1}^{\prime\prime}(j)))}

(2)

In addition to two losses above, we introduce semantic losses where the output of Cause Encoder (e1e_{1}^{\prime}) is contrasted with the output of Semantic Encoder (e1′′e_{1}^{\prime\prime}). Likewise, Effect Encoder’s output e2e_{2}^{\prime} is compared against its original effect sentence (e2′′e_{2}^{\prime\prime}) from Semantic Encoder ensuring the outputs stay semantically close to their respective inputs.

losssem,c(e1(i),e1(i))=logexp(s(e1(i),e1′′(i)))jBatchexp(s(e1(i),e1′′(j)))\text{loss}_{\text{sem},c}(e_{1}(i),e_{1}(i))=-\log\frac{\exp(s(e_{1}^{\prime}(i),e_{1}^{\prime\prime}(i)))}{\sum_{j\in\text{Batch}}\exp(s(e_{1}^{\prime}(i),e_{1}^{\prime\prime}(j)))}

(3)

We apply the same negative sampling technique for the effect events in Semantic encoder.

The total loss is then computed as follows:

losstotal=lossc+losse+β(losssem,c+losssem,e)\text{loss}_{total}=\text{loss}_{c}+\text{loss}_{e}+\beta(\text{loss}_{\text{sem},c}+\text{loss}_{\text{sem},e})

(4)

This final loss ensures that the cause-to-effect and effect-to-cause mappings are learned effectively, while also preserving the semantic consistency of the original inputs. The semantic loss terms impose dual constraints on the vector representation, which helps regularize the model. For instance, e1e_{1}^{\prime} should include the information about the corresponding effect text (e2′′e_{2}^{\prime\prime}) while preserving its own representation (cause text, e1′′e_{1}^{\prime\prime}). We assume that preserving semantic similarity may help as a keyword-based retrieval algorithm, such as BM25, can sometimes retrieve an answer for a given question.

4 Experiments

4.1 Datasets

4.1.1 e-CARE Evaluation

We utilize the e-CARE (Du et al., 2022) and BCOPA-CE (Han and Wang, 2021) datasets. The e-CARE dataset is split into training, validation, and test sets (6:1:1). BCOPA-CE is used only for training and validation, consisting of 500 triplets of <cause, premise, effect>, which we transform into 1,000 cause-effect pairs ((cause, premise), (premise, effect)). To prevent data leakage, pairs from the same triplet remain in the same split. The final dataset consists of 13,692 training, 2,232 validation, and 2,136 test examples.

For retrieval evaluation, we construct retrieval pools using large-scale text corpora from Wikipedia222https://huggingface.co/datasets/wikimedia/wikipedia and RedPajama-Data-v2 (Computer, 2023)333https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2 to simulate real-world retrieval scenarios. We prepare varying size of retrieval pools: ranging from 2 million sentences (XLXL) to 20 million sentences (XXLXXL).

4.1.2 Causal QA Evaluation

To evaluate Cawai on real-world causal question-answering performance, we conduct experiments using retrieval environments defined in Causal QA (Bondarenko et al., 2022). Specifically, we use MS MARCO Bajaj et al. (2018), Natural Questions Kwiatkowski et al. (2019), SQuAD v2.0 Rajpurkar et al. (2018), and HotpotQA Yang et al. (2018) for training, validation, and testing (Table 1). For the training, we merge all training sets, resulting in 23,165 training, 2,895 validation, and 2,895 test samples.

Dataset (% of Total) Train Validation Test
e-CARE 12,792 2,132 2,136
BCOPA-CE 900 100 -
HotpotQA (0.4%) 312 39 39
MS MARCO (2.5%) 19,318 2,415 2,415
SQuAD v2.0 (2.3%) 2,567 321 321
Natural Questions (0.4%) 968 120 120
Table 1: Distribution of causal questions across datasets.

4.2 Experimental Setups

4.2.1 Baseline

We used BM25 Robertson and Zaragoza (2009), DPR Karpukhin et al. (2020), and GTR Ni et al. (2022) as baseline models. We trained a BERT-base-uncased, GTR-Base, and LLaMA-1.0B as the encoders for Cawai-DPR and Cawai-GTR, respectively. We selected the checkpoint with the highest accuracy on the validation dataset after 500 epochs. The batch size was set to 64, the learning rate to 1e-5, using the AdamW optimizer. Experiments were conducted on either NVIDIA A6000 or 3090 GPUs.

All models are initialized with the same weights for fair comparisons: BERT-Base-Uncased444https://huggingface.co/google-bert/bert-base-uncased, GTR-Base555https://huggingface.co/sentence-transformers/gtr-t5-base, and LLaMA-1.0B666https://huggingface.co/knowledgator/Llama-encoder-1.0B. LLaMA-1.0B is trained following the same methodology as GTR, with LoRA adapters Hu et al. (2022) applied for efficient fine-tuning.

In the experiments on general domain QA tasks (Section 5.4), we trained Cawai-DPR under the same conditions as Karpukhin et al. (2020) for fair comparisons: using a batch size of 128, the English Wikipedia dump from December 2018 as the retrieval pool, and BM25 for negative sampling. Also we did not employ a reader model and instead used a simple fuzzy matching as a relaxed metric, focusing on retrieval accuracy.

5 Results

5.1 Causal Retrieval Tasks

Model Task 1. Cause to Effect Task 2. Effect to Cause
e-CARE e-CARE + wikiXL\text{wiki}_{XL} e-CARE + RedPajamaXL\text{RedPajama}_{XL} e-CARE e-CARE + wikiXL\text{wiki}_{XL} e-CARE + RedPajamaXL\text{RedPajama}_{XL}
Hit@1 Hit@10 MRR@10 Hit@1 Hit@10 MRR@10 Hit@1 Hit@10 MRR@10 Hit@1 Hit@10 MRR@10 Hit@1 Hit@10 MRR@10 Hit@1 Hit@10 MRR@10
BM25 8.9 21.8 12.7 4.6 8.5 5.8 4.9 9.3 6.1 9.4 20.6 12.6 4.9 8.9 6.0 4.6 9.2 5.9
DPR 36.3 66.0 45.5 16.0 28.2 19.4 13.7 25.8 17.3 39.4 67.8 47.8 17.3 32.4 21.9 14.4 28.8 18.6
Cawai-DPR 36.8 63.7 45.1 20.4 32.1 23.7 18.1 29.4 21.3 37.6 64.2 45.7 22.3 35.5 26.4 20.9 34.0 24.8
GTR 42.2 71.8 51.9 20.5 39.0 26.1 18.1 35.6 23.0 42.8 72.4 52.2 21.8 39.5 27.1 18.3 36.0 23.5
Cawai-GTR 43.2 70.0 51.2 21.6 36.2 26.2 19.6 34.6 24.1 41.9 68.6 50.0 21.5 38.5 26.7 20.1 36.3 24.7
LLama 1B (Lora) 14.9 35.3 20.7 2.3 6.4 3.4 1.9 5.8 2.9 15.1 35.4 20.8 3.3 9.7 5.1 3.0 8.9 4.8
Cawai-LLama 1B (Lora) 23.5 51.0 31.6 3.1 9.0 4.8 3.6 9.5 5.1 22.1 50.7 30.5 4.4 12.3 6.5 4.6 12.4 6.7
Table 2: Accuracy comparison on e-CARE

We reformulate the e-CARE causal reasoning dataset into two causal retrieval tasks. In Task 1, the model retrieves the effect sentence given a cause query, while in Task 2, it retrieves the cause sentence given an effect query. Our results show that Cawai achieves performance comparable to two strong baselines–DPR and GTR–when the retrieval pool is small (2,136 sentences, Table 2 rows 2–5, cols 1–3 for Task 1, cols 10–12 for Task 2).

Next, we examine how performance changes when the size of the retrieval pool increases by integrating 2 million sentences (XL) sampled from English Wikipedia or RedPajama. The results show that in most cases, Cawai outperforms the two baselines (row 2–5, cols 4–9 for Task1, cols 13–18 for Task 2).

Notably, although GTR demonstrates slightly better performance in Task 2 even with wikiXL\text{wiki}_{XL} (rows 4,5, cols 13–15), it performance declines when the retrieval pool increases to 20 million sentences (wikiXXL\text{wiki}_{XXL}). In this setting, GTR achieves 12.9% Hit@1, 25.3% Hit@10, and 16.6% MRR@10, while Cawai surpasses it with scores of 14.6%, 26.0%, and 17.8%. Similarly, in Task 1, Cawai maintains a performance advantage (GTR: 12.1%, 24.9%, and 16.0% vs Cawai: 12.4%, 25.3%, and 16.4%). These results highlight the enhanced generalization capability of Cawai across diverse retrieval settings. Additional experiments with LLaMA-1.0B further confirm the superiority of Cawai in causal retrieval tasks (bottom two rows)

5.2 Causal QA Tasks

Model MS MARCO Natural Questions SQuAD v2.0 HotpotQA
H@1 H@10 M@10 H@1 H@10 M@10 H@1 H@10 M@10 H@1 H@10 M@10
DPR 10.6 38.3 18.4 2.2 16.0 4.7 11.2 24.3 15.0 3.4 12.8 5.4
Cawai-DPR 11.7 40.5 19.8 4.7 16.5 7.9 13.8 29.3 18.4 3.4 13.7 5.9
GTR 21.3 71.5 33.7 7.5 27.0 13.8 23.7 44.4 30.3 9.4 35.9 16.4
Cawai-GTR 20.6 59.4 32.2 11.3 28.9 16.2 27.1 48.6 34.1 12.8 33.3 19.2
Table 3: Accuracy comparison on Causal QA. The scores were calculated by averaging the results of three independent experiments with different random seeds. H@kH@k, and M@kM@k stands for Hit@k, and MRR@k respectively.

Next we evaluate Cawai on Causal QA tasks. Cawai again consists of three separate encoders: one for questions (corresponding to Cause or Effect Encoder), one for documents (corresponding to Effect or Cause Encoder), and one for Semantic Encoder. The results show that Cawai outperforms two baselines in most cases (Table 3). Notably, the largest gains is observed in Natural Questions and SQuAD v2.0, where Cawai-GTR outperforms the standard GTR model by 5.5% and 3.4% in Hit@1, respectively (rows 3,4, cols 4, 7).

5.3 Science Domain QA Tasks

Model NFCorpus SciDocs SciFact SciQ
5 20 5 20 5 20 5 20
DPR 6.2 11.5 10.1 16.2 23.5 28.8 50.0 55.2
Cawai-DPR 8.7 13.0 10.7 18.0 24.5 29.8 62.5 66.3
GTR 6.1 13.6 20.8 28.8 38.1 43.7 71.7 74.5
Cawai-GTR 6.6 14.4 21.3 30.3 44.0 48.6 83.8 84.8
Table 4: Zero-shot Document Retrieval Performance on Science Domain QA (nDCG@k = 5, 20). We report the mean over three different random seeds.

Following Cai et al. (2024), we evaluate the zero-shot generalization capability of dense retrievers trained on Causal QA using four science QA datasets: NFCorpus Boteva et al. (2016), SciDocsCohan et al. (2020), SciFactWadden et al. (2020), and SciQWelbl et al. (2017). As shown in Table 4, Cawai consistently achieves higher nDCG scores across all datasets compared to standard dense retrieval methods, demonstrating its strong generalization performance.

5.4 General Domain QA tasks

Model Natural Questions SQuAD v1.1
Hit@1 Hit@20 Hit@100 Hit@1 Hit@20 Hit@100
DPR 30.6 75.2 86.3 26.7 67.7 81.9
Cawai-DPR 33.9 74.8 85.8 25.1 65.7 80.5
DPR+BM25 35.1 74.7 86.7 27.1 69.4 83.7
Cawai+BM25 36.6 75.9 86.1 25.1 65.9 81.2
Table 5: Accuracy comparison on General QA. We report the mean over three different random seeds.

Finally, to evaluate Cawai on general QA tasks, we conduct experiments on the Natural Questions and SQuAD v1.1 Rajpurkar et al. (2016) datasets. Cawai achieves performance comparable to DPR but without a clear advantage (Table 5). This suggests that the effectiveness of Cawai is more pronounced in Causal QA scenarios rather than general QA tasks.

5.5 Ablation study

Effects of Semantic Loss
Loss Task 1. Cause to Effect Task 2. Effect to Cause
e-CARE e-CARE + wikiXL\text{wiki}_{XL} e-CARE e-CARE + wikiXL\text{wiki}_{XL}
H@1 H@10 M@10 H@1 H@10 M@10 H@1 H@10 M@10 H@1 H@10 M@10
c+e\mathcal{L}_{c}+\mathcal{L}_{e} 32.4 60.4 40.9 11.5 24.5 15.3 33.4 61.3 41.8 14.5 28.0 18.4
c+e+0.1s\mathcal{L}_{c}+\mathcal{L}_{e}+0.1~{}\mathcal{L}_{s} 37.2 63.6 45.3 20.4 32.5 24.0 38.7 63.6 46.1 22.2 34.6 25.9
c+e+1s\mathcal{L}_{c}+\mathcal{L}_{e}+1~{}\mathcal{L}_{s} 36.8 63.7 45.1 20.4 32.1 23.7 37.6 64.2 45.7 22.3 35.5 26.4
c+e+2s\mathcal{L}_{c}+\mathcal{L}_{e}+2~{}\mathcal{L}_{s} 38.3 64.3 46.1 22.8 33.9 26.0 38.8 64.3 46.4 23.0 35.6 26.8
c+e+5s\mathcal{L}_{c}+\mathcal{L}_{e}+5~{}\mathcal{L}_{s} 38.5 64.7 46.3 22.3 34.1 25.8 39.1 64.5 46.6 24.1 36.5 27.6
Table 6: Accuracy comparison on semantic loss hyperparameters on e-CARE. s\mathcal{L}_{s}, H@kH@k, and M@kM@k stands for semantic losses defined in Eqn. 4, Hit@k, and MRR@k respectively.

Next, we examine how the performance of Cawai depends on the semantic loss under varying weights (β\beta) of the semantic loss in Eqn. 4. Integrating the semantic loss results in a noticeable improvement in accuracy (Table 6 row 1, β=0\beta=0 vs. row 2, β=0.1\beta=0.1), highlighting its importance. Beyond β=1\beta=1, no clear improvement were observed (row 3–5), and thus we set β=1\beta=1 in all other experiments.

6 Analysis

Refer to caption
(a) BERT
Refer to caption
(b) DPR
Refer to caption
(c) Cawai
Figure 2: t-SNE visualization of BCOPA-CE validation set: Orange for cause embeddings, blue for premise (effect) embeddings. Gradations indicate shared cause-effect relationships.

We visualize the embeddings of BCOPA-CE validation set using the t-SNE method. In Cawai, causes are encoded by the Cause Encoder, while premises are encoded by the Effect Encoder. Before fine-tuning, the embeddings are randomly scattered (Figure 2a). After training, DPR embeddings show that cause and effect remain separated without shared semantics (b), indicating its limitation in capturing causal relationships. Unlike QA tasks, causal retrieval requires understanding temporal and logical dependencies. On the other hand, Cawai  maps each cause-effect pair closely (c), demonstrating its ability to learn meaningful causal embeddings.

7 Conclusion

We proposed a novel retrieval method, Cawai, that integrates cause-effect relationships into the retrieval process. Our experiments demonstrate that Cawai outperforms two strong baselines–DPR and GTR–in causal retrieval tasks under real-world setting, causal QA, and scientific QA tasks, while achieving comparable performance on general QA tasks.

References