Causality is all you need
Abstract
In the fundamental statistics course, students are taught to remember the well-known saying: “Correlation is not Causation”. Till now, statistics (i.e., correlation) have developed various successful frameworks, such as Transformer and Pre-training large-scale models, which have stacked multiple parallel self-attention blocks to imitate a wide range of tasks. However, in the causation community, how to build an integrated causal framework still remains an untouched domain despite its excellent intervention capabilities. In this paper, we propose the Causal Graph Routing (CGR) framework, an integrated causal scheme relying entirely on the intervention mechanisms to reveal the cause-effect forces hidden in data. Specifically, CGR is composed of a stack of causal layers. Each layer includes a set of parallel deconfounding blocks from different causal graphs. We combine these blocks via the concept of the proposed sufficient cause, which allows the model to dynamically select the suitable deconfounding methods in each layer. CGR is implemented as the stacked networks, integrating no confounder, back-door adjustment, front-door adjustment, and probability of sufficient cause. We evaluate this framework on two classical tasks of CV and NLP. Experiments show CGR can surpass the current state-of-the-art methods on both Visual Question Answer and Long Document Classification tasks. In particular, CGR has great potential in building the “causal” pre-training large-scale model that effectively generalizes to diverse tasks. It will improve the machines’ comprehension of causal relationships within a broader semantic space.
1 Introduction
Correlation is not Causation.
—Karl Pearson (1857 1936)
In the fundamental statistics course, students are taught to remember the famous phrase: “Correlation is not Causation”. A classic example is the correlation between the rooster’s crow and the sunrise. While the two are highly correlated, the rooster’s crow does not cause the sunrise. However, statistics alone do not provide us what causation truly is. Unfortunately, many data scientists have a narrow focus on interpreting data without considering the limitations of their models. They mistakenly believe that all causal questions can be answered solely through data analysis and clever data-mining tricks.
Nowadays, thanks to the development of carefully crafted causal models [22, 23, 3, 33, 32], the deep learning community has paid more and more attention to causation. Mathematically, the causal analysis aims to study the dynamic nature of distributions between variables. In statistics, we study and estimate various distributions and their model parameters from data, while in causal analysis, we study how when a change in the distribution of one variable affects the distribution of other variables. Definitedly, the change in the variable distribution is the do-operation, which is an active intervention mechanism in the data and can well define what is the causal effect between variables.
For example, given the environment , we take the input to predict the output , which is denoted as , not . The former represents the probability of after is implemented on the pre-decision environment . The latter is the probability of when coexists with the post-implementation environment . This coexistence environment may be different from the environment before the decision. In short, statistics is observing something (i.e., seeing), and estimating what will happen. Causal analysis is an intervention, what is done (i.e., doing), and predicts what will happen [22].
Till now, statistics have developed various successful frameworks, such as Transformer [28], Pre-training large-scale models [8, 27], and so on. However, in the causation community, how to build an integrated causal framework still remains an untouched domain despite its excellent intervention capabilities. In this work, we propose the Causal Graph Routing (CGR), an integrated causal framework relying entirely on the intervention mechanisms to reveal the cause-effect forces hidden in data.
Specifically, the causal intervention aims to mitigate the effectiveness of confounding, which is a causal concept to describe the spurious correlation between input and output variables [22]. Because the noncausal paths are the source of confounding, we use the do-operator to control (or erase) the influence of noncausal paths, i.e., , to deconfound and . Several classical deconfounding methods are presented in Fig.2: a) No Confounder: The effect of on via the mediator , i.e., , where no confounder exists. b) Back-door Adjustment: The observable confounder influences both and , creating a spurious correlation . The link is defined as the back-door path, which is blocked by controlling for . c) Front-door Adjustment: The causal effect of on is confounded by the unobservable confounder and linked by the mediator . Furthermore, is observable and shielded from the effects of . To eliminate the spurious correlation brought by , the front-door path, i.e., , is blocked by controlling for .
However, in several Computer Vision (CV) and Natural Language Processing (NLP) tasks, the causal intervention often requires the use of multiple deconfounding methods from different causal graphs. As shown in Fig.1, in the Visual Question Answer (VQA) task[20], to answer the question “what days might I most commonly go to this building?”, the model first detects “building” via the visual context, which could be confounded by training data (dataset bias). To address this, the method like Front-door Adjustment or No Confounder is necessary. Then, the model correlates the object “building” with the fact and from an external knowledge base [26], which could be confounded by irrelevant knowledge facts (language bias). To mitigate this, Back- or Front-door Adjustments are required for deconfounding. Hence, relying on a single deconfounding method is insufficient to fulfill the requirement of deconfounding from diverse causal graphs. The same principle applies to the Long Document Classification (LDC) task[7].
Motivated by the transformer and its variants, which have stacked multiple parallel self-attention blocks to imitate a wide range of tasks [36, 7, 27], we propose the Causal Graph Routing (CGR) framework, where above-mentioned deconfounding blocks are also stacked effectively. Specifically, our framework is composed of a stack of causal layers. Each layer includes a set of parallel deconfounding blocks from different causal graphs. We propose the concept of sufficient cause, which provides the formal semantic for the probability that causal graph was a sufficient cause of another graph . It can chain together three candidates of deconfounding methods, i.e., no confounder, back-door adjustment, and front-door adjustment, to get the overall causal effect of on . We calculate the weight of every deconfounding block to approximate the probability of sufficient cause, which allows the model to dynamically select the suitable deconfounding methods in each layer. This facilitates the formulation of a causal routing path for each example. CGR is implemented as the stacked networks that we assess on two classical tasks in CV and NLP. Experiments show CGR outperforms existing state-of-the-art methods on both VQA and LDC tasks with less computation cost. Notably, CGR exhibits significant potential for building the “causal” pre-training large-scale model, which can effectively generalize to diverse tasks. It will enhance the machines’ understanding of causal relationships within a broader semantic space.

2 Related Work
Causality. Causal inference [22] is an important component of human cognition. Thanks to the development of carefully crafted causal methodologies, causality has been extensively studied and mathematized. For examples, Pearl et al. [22] propose the front- and back-door adjustments, which focus on removing unobservable or observable confounders by blocking noncausal paths. Peng et al. [23] design the causality-driven hierarchical reinforcement learning framework. Cai et al. [3] establish an algorithm to comprehensively characterize causal effects with multiple mediators. Jaber et al. [13] propose a new causal do-calculus for identification of interventional distributions in Partial Ancestral Graphs (PAGs). These methods allow researchers to uncover the potential causal relationships between inputs and outputs to improve deep networks.
Applications on CV and NLP tasks. Cause-effect science is well suited for CV and NLP tasks. For examples, using the front-door adjustment to remove dataset bias for improving attention mechanisms [33], discovering causal visual features on Video-QA task using the back-door adjustment [39], equipping the pre-trained language model with a knowledge-guided intervention for text concept extraction [37], generating counterfactual samples to mitigate language priors [21], and constructing a deconfounded framework for visual grounding [12] and image captioning [32].
3 Methodology
In this section, we discuss how to design the causal graph routing (Section 3.1), how to implement it into the stacked networks (Section 3.2), and how to apply it with two classical CV and NLP tasks (Section 3.3).
3.1 Causal Graph Routing
To address the need of deconfounding from diverse causal graphs, we propose the causal graph routing framework, which can integrate different deconfounding blocks by calculating the probabilities of sufficient cause between causal graphs. In particular, our objective is to dynamically select (routing) the suitable deconfounding methods for the given task. We know three candidates of deconfounding methods which are no confounder, back-door adjustment, and front-door adjustment. For no confounder, we make the input to predict the output via the mediator M without any confounder , denoted as . For back-door adjustment, we cut off the link to remove the spurious correlation caused by observable . It measures the average causal effect of on , denoted as . For front-door adjustment, we block the path by controlling for observable , to remove the spurious correlation caused by unobservable , denoted as .
Intuitively, we can think of “how to select the suitable causal graph” as the game of building blocks. In this game, a modal is required to find the reasonable building method (i.e., causal graph) using the given units , , , and . If the model finds the graph is unsuitable, it will switch its building method and consider using either the graph or instead. Hence, there exists a hidden relevance among these graphs. To formalize this relevance, we design the concept of sufficient cause among graphs, which provides the formal semantic for the probability that causal graph was a sufficient cause of another graph . As shown in Fig.2(a), consider the arrow from to as an example, where the graph serves as a sufficient cause for the graph . We can represent the propositions and as and , respectively, while their complements are denoted as and . The probability of sufficient cause from to is defined as:
(1) |
where denotes the Probability of Sufficient cause that measures the capacity of to produce . Given that the term “production” suggests a change from the absence to the presence of and , we calculate the probability by considering situations where neither nor are present. In other words, quantifies the effect of to cause , which determines the probability of occurring (the occurrence of and ), given that both and did not occur. Considering the sufficient causes from the other two graphs, the total effect (TE) of for the causal routing can be defined as:
(2) |
We estimate the total effect of on , i.e., , by dynamically routing all causal graphs:
(3) |
where refers to the set of intervention operations, including , , and . Till now, we have chained together the three deconfounding methods to get the overall causal effect of on . Each deconfounding method is equipped with two terms, which stand for the probabilities of sufficient causes that another graphs would respond to the current graph.
As shown in Fig.2(b), we define the set of chained deconfounding methods as one casual layer. We employ parallel casual layers, where each one produces the output values. These values are then integrated to obtain the final values for the given task.

3.2 Causal Stacked Networks
In this section, we illustrate how to implement our CGR in a deep framework. In practice, we adopt the stacked networks to perform the causal routing computation, integrating no confounder, back-door adjustment, front-door adjustment, and probability of sufficient cause.
3.2.1 Block of No Confounder
This block involves two stages: 1) extract the mediator from the input () and 2) predict the outcome based on (). We have:
(4) |
Considering that most CV and NLP tasks are formulated as classification problems, we compute using a transform function and a multi-layer perceptron layer . The former aims to extract the mediator from the input , while the latter outputs classification probabilities through the softmax layer.
(5) |
We use the classical attention layer to calculate . The queries , keys and values come from the input . is the dimension of queries and keys. As shown in Fig.3(a), in the -th layer, we take the result of as the output of the no confounder block .
3.2.2 Block of Back-door Adjustment
We assume that an observable confounder influences the relationship between input and output . The link is blocked through the back-door adjustment, and then the causal effect of on is identifiable and given by
(6) |
where and denotes the embedding of inputs and confounders, respectively. To perform the back-door intervention operation, we parameterize using a network, which final layer is a softmax function as:
(7) |
where is the fully connected layer predictor. However, it requires an extensive amount of and sampled from this network in order to compute . We employ the Normalized Weighted Geometric Mean (NWGM) [30] to approximate the expectation of the softmax as the softmax of the expectation:
(8) |
We calculate two query sets from and to estimate the input expectation and the confounder expectation , respectively, as:
|
(9) |
where and denote query embedding functions.
As shown in Fig.3(b), we use the classical attention layer to estimate the expectations of both variables: and . In the -th layer, both of them are concatenated and passed through a multi-layer perceptron layer to produce the output of the back-door block .

3.2.3 Block of Front-door Adjustment
We assume that an unobservable confounder influences the relationship between input and output , while an observable mediator establishes a connection from to . We block the link through the front-door adjustment, and the causal effect of on is given by
|
(10) |
where and denotes the embedding of inputs and mediators, respectively. Similar to the back-door adjustment, we parameterize using the softmax-aware network and the NWGM approximation. We have:
|
(11) |
where is the fully connected layer. Similarly, we estimate the expectations of variables by two query embedding functions and .
|
(12) |
As shown in Fig.3(c), we employ the classical attention layer to estimate the expectations, denoted as and . Different from the above two blocks, we utilize the global dictionary , which is initialized by conducting K-means clustering on all the sample features of the training dataset, to generate keys and values. In the -th layer, the concatenated expectations are fed into a multi-layer perceptron to generate the output of the front-door block .

3.2.4 Probability of Sufficient Cause
In our framework, each layer consists of three deconfounding blocks from different causal graphs. We reexamine Eq.3, and find that the process of evaluating the total effect is essentially to search the optimal deconfounding graph among three causal graphs, while the other two causal graphs are sufficient causal conditions for the optimal solution. For example, during the game of building blocks, if we discover that both graph and are ineffective, it naturally leads to use the graph as the optimal building method. Take an example. When both and have high values, indicating that both and are sufficient cause for , we consider as the optimal deconfounding method for achieving . Therefore, in this paper, we calculate the weight of the causal graph to approximate the probability of sufficient cause where is the optimal solution. Similarly, we approximate the probabilities of sufficient causes for and . We have:
(13) |
where is the weight vector of the -th layer. Its each element reflects the probability of the -th deconfounding block as the optimal solution of the -th layer. denotes the normalization function (described in Optimization). denotes the -th element in a given vector. represents the output of the -th deconfounding block in the -th layer. is the output of the -th causal layer. In this work, we employ parallel causal layers and combine them as:
(14) |
where is the layer-aware weight vector. represents the final output, which is computed as a weighted sum of all causal layers. Both and are learnable parameters, initialized with equal constants, indicating that the routing weight learning starts without any prior bias towards a specific block or layer.
Stack. In the no confounder block, the output expectation from the previous layer is used as the input for the current layer. In the back- and front-door adjustment blocks, the output expectation from the previous layer serves as the input for the current layer.
Optimization. To enable the dynamic fusion of causal blocks and causal layers, we design the sharpening softmax function to implement . Specifically, we equip a temperature coefficient that converges with training for the ordinary softmax function as:
(15) |
where represents the normalized weight vector after softmax; is the -th weight value; denotes the temperature coefficient for sharpening the softmax function. At the initial stage of training, the value of is set to 1, which results in the sharpening softmax function being the same as the regular softmax function. As the training progresses, gradually decreases, and as it converges to 0, the sharpening softmax function starts to resemble the argmax function more closely. By the designed sharpening softmax function, the block- and layer-aware weight vectors can be optimized through back-propagation. This optimization process enhances the performance of these weights, resulting in more noticeable differences after training.
3.3 Application to Our Framework
Visual Question Answering (VQA) aims to predict an answer for the given question and image [1]. In this task, represents the input image-question pairs (e.g., an image involving “church” and corresponding question “what days might I most commonly go to this building?”). Y represents the output predicted answers (e.g., “sunday”). is the mediator extracted from , which refers to question-attended visual regions or attributes (e.g., a visual region involving “church” and an attribute “building”). Additionally, in front-door adjustment, denotes the unobservable confounder, while in back-door adjustment, denotes the observable confounder that refers to question-attended external knowledge (e.g., and ). It is because external knowledge comprises both “good” language context and “bad” language bias [21]. We use parallel casual layers in VQA task.
Long Document Classification (LDC) aims to classify a given long document text [7]. In this task, represents the input document collection (e.g., legal-related documentation set). represents the output classification results (e.g., “legal” or “politics”). refers to segments extracted from document (e.g., a segment “…but in practice it too often becomes tyranny…” indicates the label “politics”). Similarly, in front-door adjustment denotes the unobservable cunfounder, while in back-door adjustment denotes the observable confounder that refers to the high-frequency words in each document. We use parallel casual layers in LDC task.
The confounder extraction process for back-door adjustment is explained in Section 4.2.
4 Experiment
4.1 Datasets and Metrics
VQA2.0 [1] is a widely-used benchmark VQA dataset, which uses images from MS-COCO. It comprises a total of 443,757, 214,254, and 447,793 samples for training, validation, and testing, respectively. Every image is paired with around 3 questions, and each question has 10 reference answers. We consider both the soft VQA accuracy [1] for each question type and the overall performance as the evaluation metrics.
ECtHR [5] is a popular dataset for the long document classification task. It comprises European Court of Human Rights cases, with annotations provided for paragraph-level rationales. The dataset consists of 11,000 ECtHR cases, where each case is associated with one or more provisions of the convention allegedly violated. The ECtHR dataset is divided into 8,866, 973, and 986 samples for training, validation, and testing, respectively. It is used to evaluate the performance of our framework on a multi-label classification task. Evaluation metrics include micro/macro average F1 scores and accuracy on the test set.
20 NewsGroups [29] is a popular dataset for the long document classification task. It consists of approximately 20,000 newsgroup documents that are evenly distributed across 20 different news topics. The dataset includes 10,314, 1,000, and 1,000 samples for training, validation, and testing, respectively. It is used to evaluate the performance of our framework on a multi-class classification task. We report the performance with the accuracy as the evaluation metric.
4.2 Implementation
Image and Text Processing. For the VQA task, we employ the image encoder of BLIP-2 [14] to extract grid features. We preprocess the question text and the obtained knowledge text in the back-door adjustment to lower case, tokenize the sentences and remove special symbols. We truncate the maximum length of each sentence to 14 words, and utilize the text encoder of CLIP ViT-L/14 [24] to extract word features, followed by a single-layer LSTM encoder with a hidden dimension of 512.
For the LDC task, we define the maximum sequence length as 4,096 tokens. We split a long document into overlapping segments of 256 tokens. These segments have a 1/4 overlap between them. To extract text features, we utilize the pre-trained RoBERTa [18] as the text encoder.
Confounder Extraction. For the VQA task, we use the question-attended external knowledge as the observable confounder in the back-door adjustment. In this paper, we retrieve external knowledge from the ConceptNet knowledge base [26], which represents common sense using triplets, such as . Besides, ConceptNet provides a statistical weight for each triplet, ensuring reliable retrieval of information. Specifically, we first extract 3 types of query words: (1) object labels of images obtained by GLIP [16]; (2) OCR text of images by EasyOCR toolkit111https://github.com/JaidedAI/EasyOCR; (3) n-gram question entity phrases. All of words are filtered using a tool of part-of-speech restriction [17], and the filtered words are combined to form the query set for searching common sense in ConceptNet. We use the pre-trained MPNet [25] to encode the returned common sense triplets and the given question. Then, we calculate the cosine similarity between the encoded triplets and questions. Further, the cosine similarity is multiplied by the given statistical weight to obtain the final score of a triplet. We select the top-20 pieces of triplets as the observable confounder for each image-question pair.
For the LDC task, we use the high-frequency words in each document as the observable confounder. Specifically, we employ the TF-IDF method to select the top-M words in each long document ( is set as 64 for ECtHR, and 128 for 20 NewsGroups). TF-IDF calculates the importance of a word within a document by considering its frequency in all documents.
Training Strategy. For the VQA task, we use the Adam optimizer to compute the gradient with an initial learning rate of , which decays at epoch 10, 12 with the decay rate of 0.5. We adopt a warm-up strategy for the initial 3 epochs, and the full model is trained for 13 epochs totally. The batch size is set to 64. For the LDC task, we use the AdamW optimizer with an initial learning rate of . We employ a linear decay strategy with a 10% warm-up of the total number of steps to adjust the learning rate. We need about 16 epochs for the model to converge. The batch size on each GPU is set to 2. Our method is implemented on PyTorch with two 3090Ti GPUs.
Method | Test-dev | Test-std | |||
---|---|---|---|---|---|
overall | Yes/No | Num | Others | overall | |
Transformer[28] | 69.53 | 86.25 | 50.70 | 59.90 | 69.82 |
DFAF[10] | 70.22 | 86.09 | 53.32 | 60.49 | 70.34 |
ReGAT[15] | 70.27 | 86.08 | 54.42 | 60.33 | 70.58 |
MCAN[36] | 70.63 | 86.82 | 53.26 | 60.72 | 70.90 |
TRRNet[31] | 70.80 | - | - | - | 71.20 |
Transformer+CATT[33] | 70.95 | 87.40 | 53.45 | 61.3 | 71.27 |
AGAN[41] | 71.16 | 86.87 | 54.29 | 61.56 | 71.50 |
MMNAS[35] | 71.24 | 87.27 | 55.68 | 61.05 | 71.56 |
TRARS[42] | 72.00 | 87.43 | 54.69 | 62.72 | - |
TRARS(16*16)[42] | 72.62 | 88.11 | 55.33 | 63.31 | 72.93 |
CGR | 75.46 | 90.24 | 57.16 | 67.01 | 75.47 |
Method | Test-dev | Test-std |
---|---|---|
LXMERT[27] | 72.42 | 72.54 |
ERNIE-VIL[34] | 72.62 | 72.85 |
UNITER[6] | 72.70 | 72.91 |
12IN1[19] | - | 72.92 |
LXMERT+CATT[33] | 72.81 | 73.04 |
LXMERT+CATT(large)[33] | 73.54 | 73.63 |
VILLA[9] | 73.59 | 73.67 |
UNITER(large)[6] | 73.82 | 74.02 |
VILLA(large)[9] | 74.69 | 74.87 |
ERNIE-VIL(large)[34] | 74.95 | 75.10 |
CGR | 75.46 | 75.47 |
4.3 Results
Visual Question Answering. We report the performance of our framework in VQA task against the transformer-based models (Tab.1) and the pre-trained large-scale models (Tab.2), in which test-dev and test-std are the online development-test and standard-test splits, respectively. As shown in Tab.1, our CGR can significantly outperform all transformer-based models across all metrics. Specially, our method outperforms the best competitor TRARS(16*16) by 3.91% and 3.48% under test-dev and test-std metrics, respectively, which validates the effectiveness of the proposed stacked deconfounding method. Meanwhile, the pre-training large-scale models also use the “stacked” mechanism, which stack multiple self-attention layers to imitate VQA task. However, they ignore the negative impact of confounding that describes the spurious correlation between input and output variables. Our method builds the multiple deconfounding layers to eliminate the spurious correlation. The better result in Tab.2 showcase our advantage.
With less computation cost, our CGR has over 0.4% improvement against the best competitor, ERNIE-VIL(large), in the pre-training large-scale models. Remarkably, CGR equips with just 3 deconfounding methods with 6 causal layers, while ERNIE-VIL(large) relies on a much larger number of training attention layers (24 textual layers + 6 visual layers with 16 heads in each layer). Besides, the causation community can offer numerous powerful deconfounding methods to further enhance our framework. Hence, CGR has great potential for building the “causal” pre-training large-scale model to imitate a wide range of tasks. This will greatly enhance machines’ comprehension of causal relationships within a broader semantic space.
Method | |||
---|---|---|---|
Macro | Micro | ||
ECtHR | RoBERTa[18] | 68.9 | 77.3 |
CaseLaw-BERT[40] | 70.3 | 78.8 | |
BigBird[38] | 70.9 | 78.8 | |
DeBERTa[11] | 71.0 | 78.8 | |
Longformer[2] | 71.7 | 79.4 | |
BERT[8] | 73.4 | 79.7 | |
Legal-BERT[4] | 74.7 | 80.4 | |
Hi-Transformer(RoBERTa)[7] | 76.5 | 81.1 | |
CGR | 76.6 | 81.3 | |
Method | Accuracy | ||
20 NewsGroups | RoBERTa[18] | 83.8 | |
BERT[8] | 85.3 | ||
Hi-Transformer(RoBERTa)[7] | 85.6 | ||
CGR | 86.5 |
Long Document Classification. We compare the proposed framework with the state-of-the-art methods for LDC task on ECtHR and 20 News datasets (Tab.3). Our method can consistently achieve better performance across all metrics, which indicates that our deconfounding strategy still works effectively on the challenging multi-class and multi-label NLP task. Faced with the complex long texts, our CGR helps uncover potential cause effect and improve the model performance through multiple intervention routing. Moreover, CGR can outperform Legal-BERT by 2.54% under the Macro score. It suggests that our method has advantages in deconfouding the domain-specific knowledge.
4.4 Ablation Studies
We further validate the efficacy of the proposed framework by assessing several variants: 1) One deconfounding block reserved per causal layer: In our method, each causal layer consists of three deconfounding blocks. To verify their effectiveness, we retain only one block in each layer, i.e., no confounder, back-door adjustment, or front-door adjustment, respectively, and then calculate their average performance for comparison. 2) Two deconfounding blocks reserved per causal layer: Similarly, we retain two blocks in each layer, and calculate their average performance for comparison. 3) Another strategy to calculate sufficient cause: We design the sharpening softmax function to calculate the weight of deconfounding block for the sufficient cause approximation. In this section, we remove the sharpening mechanism and adopt the ordinary softmax for Eq.15, to obtain the sufficient cause. Tab.4 reports the performance of ablation studies on VQA and LDC tasks. Our framework outperforms all variants, which show the advantages of deconfounding from diverse causal graphs and our sufficient cause approximation method.
Method | VQA2.0 | 20 NewsGroups |
---|---|---|
One deconfounding block reserved | 69.35 | 84.40 |
Two deconfounding blocks reserved | 70.03 | 85.00 |
CGR w/o sharpen softmax | 70.91 | 86.30 |
CGR | 71.20 | 86.50 |
4.5 Qualitative Analysis
Fig.5 shows two qualitative examples from our method on VQA and LDC tasks. To provide insight in which causal graph is dominant, we present the probabilities of sufficient cause in all layers, which reveal the explicit causal routing path within the framework. See the first example, we observe that the front-door adjustment in the first layer dominates the answer inference, which helps the model avoid some unseen confounding effects, such as dataset bias. As the routing progresses, the back-door adjustment has significantly enhanced, suggesting that the model start the focus on how to use external knowledge without confounding for the answer inference.

4.6 Conclusion
In this paper, we propose the novel Causal Graph Routing (CGR) framework, which is the first integrated causal scheme relying entirely on the intervention mechanisms to address the need of deconfounding from diverse causal graphs. Specifically, CGR is composed of a stack of causal layers. Each layer includes a set of parallel deconfounding blocks from different causal graphs. We propose the concept of sufficient cause, which chains together multiple deconfounding methods and allow the model to dynamically select the suitable deconfounding methods in each layer. CGR is implemented as the stacked networks. Experiments show our method can surpass the current state-of-the-art methods on both VQA and LDC tasks. CGR has great potential for building the “causal” pre-training large-scale model. We plan to extend CGR with more powerful deconfounding methods and apply it into other tasks for revealing the cause-effect forces hidden in data.
References
- [1] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. VQA: visual question answering. In ICCV, pages 2425–2433, 2015.
- [2] I. Beltagy, M. E. Peters, and A. Cohan. Longformer: The long-document transformer. 2020.
- [3] H. Cai, R. Song, and W. Lu. ANOCE: analysis of causal effects with multiple mediators via constrained structural learning. In ICLR, 2021.
- [4] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras, and I. Androutsopoulos. LEGAL-BERT: the muppets straight out of law school. 2020.
- [5] I. Chalkidis, M. Fergadiotis, D. Tsarapatsanis, N. Aletras, I. Androutsopoulos, and P. Malakasiotis. Paragraph-level rationale extraction through regularization: A case study on european court of human rights cases. In NAACL-HLT, pages 226–241, 2021.
- [6] Y. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu. UNITER: universal image-text representation learning. In ECCV, volume 12375, pages 104–120, 2020.
- [7] X. Dai, I. Chalkidis, S. Darkner, and D. Elliott. Revisiting transformer-based models for long document classification. In EMNLP, pages 7212–7230, 2022.
- [8] J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL, pages 4171–4186, 2019.
- [9] Z. Gan, Y. Chen, L. Li, C. Zhu, Y. Cheng, and J. Liu. Large-scale adversarial training for vision-and-language representation learning. In NeurlPS, 2020.
- [10] P. Gao, Z. Jiang, H. You, P. Lu, S. C. H. Hoi, X. Wang, and H. Li. Dynamic fusion with intra- and inter-modality attention flow for visual question answering. In CVPR, pages 6639–6648, 2019.
- [11] P. He, X. Liu, J. Gao, and W. Chen. Deberta: decoding-enhanced bert with disentangled attention. In ICLR, 2021.
- [12] J. Huang, Y. Qin, J. Qi, Q. Sun, and H. Zhang. Deconfounded visual grounding. In AAAI, pages 998–1006, 2022.
- [13] A. Jaber, A. H. Ribeiro, J. Zhang, and E. Bareinboim. Causal identification under markov equivalence: Calculus, algorithm, and completeness. In NeurIPS, 2022.
- [14] J. Li, D. Li, S. Savarese, and S. C. H. Hoi. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, volume 202, pages 19730–19742, 2023.
- [15] L. Li, Z. Gan, Y. Cheng, and J. Liu. Relation-aware graph attention network for visual question answering. In ICCV, pages 10312–10321, 2019.
- [16] L. Harold Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y. Zhong, L. Wang, L. Yuan, L. Zhang, J. Hwang, K. Chang, and J. Gao. Grounded language-image pre-training. In CVPR, pages 10955–10965, 2022.
- [17] B. Yuchen Lin, X. Chen, J. Chen, and X. Ren. Kagnet: Knowledge-aware graph networks for commonsense reasoning. In EMNLP, pages 2829–2839, 2019.
- [18] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. Roberta: A robustly optimized BERT pretraining approach. 2019.
- [19] J. Lu, V. Goswami, M. Rohrbach, D. Parikh, and S. Lee. 12-in-1: Multi-task vision and language representation learning. In CVPR, pages 10437–10446, 2020.
- [20] K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi. OK-VQA: A visual question answering benchmark requiring external knowledge. In CVPR, pages 3195–3204, 2019.
- [21] Y. Niu, K. Tang, H. Zhang, Z. Lu, X. Hua, and J. Wen. Counterfactual VQA: A cause-effect look at language bias. In CVPR, pages 12700–12710, 2021.
- [22] J. Pearl and D. Mackenzie. The book of why: the new science of cause and effect. Basic books, 2018.
- [23] S. Peng, X. Hu, R. Zhang, K. Tang, J. Guo, Q. Yi, R. Chen, X. Zhang, Z. Du, L. Li, Q. Guo, and Y. Chen. Causality-driven hierarchical structure discovery for reinforcement learning. In NeurIPS, 2022.
- [24] A. Radford, J. Wook Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. In ICML, volume 139, pages 8748–8763, 2021.
- [25] K. Song, X. Tan, T. Qin, J. Lu, and T. Liu. Mpnet: Masked and permuted pre-training for language understanding. In NeurIPS, 2020.
- [26] R. Speer, J. Chin, and C. Havasi. Conceptnet 5.5: An open multilingual graph of general knowledge. In AAAI, pages 4444–4451, 2017.
- [27] H. Tan and M. Bansal. LXMERT: learning cross-modality encoder representations from transformers. In EMNLP, pages 5099–5110, 2019.
- [28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In NeurIPS, pages 5998–6008, 2017.
- [29] Y. Wahba, N. H. Madhavji, and J. Steinbacher. A comparison of SVM against pre-trained language models (plms) for text classification tasks. In LOD, pages 304–313, 2022.
- [30] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, pages 2048–2057, 2015.
- [31] X. Yang, G. Lin, F. Lv, and F. Liu. Trrnet: Tiered relation reasoning for compositional visual question answering. In ECCV, volume 12366, pages 414–430, 2020.
- [32] X. Yang, H. Zhang, and J. Cai. Deconfounded image captioning: A causal retrospect. IEEE Trans. Pattern Anal. Mach. Intell., 45(11):12996–13010, 2023.
- [33] X. Yang, H. Zhang, G. Qi, and J. Cai. Causal attention for vision-language tasks. In CVPR, pages 9847–9857, 2021.
- [34] F. Yu, J. Tang, W. Yin, Y. Sun, H. Tian, H. Wu, and H. Wang. Ernie-vil: Knowledge enhanced vision-language representations through scene graphs. In AAAI, pages 3208–3216, 2021.
- [35] Z. Yu, Y. Cui, J. Yu, M. Wang, D. Tao, and Q. Tian. Deep multimodal neural architecture search. In ACM MM, pages 3743–3752, 2020.
- [36] Z. Yu, J. Yu, Y. Cui, D. Tao, and Q. Tian. Deep modular co-attention networks for visual question answering. In CVPR, pages 6281–6290, 2019.
- [37] S. Yuan, D. Yang, J. Liu, S. Tian, J. Liang, Y. Xiao, and R. Xie. Causality-aware concept extraction based on knowledge-guided prompting. In ACL, pages 9255–9272, 2023.
- [38] M. Zaheer, G. Guruganesh, K. Avinava Dubey, J. Ainslie, C. Alberti, S. Ontañón, P. Pham, A. Ravula, Q. Wang, L. Yang, and A. Ahmed. Big bird: Transformers for longer sequences. In NeurlPS, 2020.
- [39] C. Zang, H. Wang, M. Pei, and W. Liang. Discovering the real association: Multimodal causal reasoning in video question answering. In CVPR, pages 19027–19036, 2023.
- [40] L. Zheng, N. Guha, B. R Anderson, P. Henderson, and Daniel E H. When does pretraining help? assessing self-supervised learning for law and the casehold dataset of 53,000+ legal holdings. In ICAIL, pages 159–168, 2021.
- [41] Yi. Zhou, R. Ji, X. Sun, G. Luo, X. Hong, J. Su, X. Ding, and L. Shao. K-armed bandit based multi-modal network architecture search for visual question answering. In ACM MM, pages 1245–1254, 2020.
- [42] Y. Zhou, T. Ren, C. Zhu, X. Sun, J. Liu, X. Ding, M. Xu, and R. Ji. TRAR: routing the attention spans in transformer for visual question answering. In ICCV, pages 2054–2064, 2021.