This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Causality is all you need

Ning Xu1, Yifei Gao1, Hongshuo Tian1, Yongdong Zhang2, An-An Liu1∗    1 Tianjin University, China    2 University of Science and Technology of China, China    {ningxu,2022234109,kellyeden,liuanan}@tju.edu.cn, {zhyd73}@ustc.edu.cn
Abstract

In the fundamental statistics course, students are taught to remember the well-known saying: “Correlation is not Causation”. Till now, statistics (i.e., correlation) have developed various successful frameworks, such as Transformer and Pre-training large-scale models, which have stacked multiple parallel self-attention blocks to imitate a wide range of tasks. However, in the causation community, how to build an integrated causal framework still remains an untouched domain despite its excellent intervention capabilities. In this paper, we propose the Causal Graph Routing (CGR) framework, an integrated causal scheme relying entirely on the intervention mechanisms to reveal the cause-effect forces hidden in data. Specifically, CGR is composed of a stack of causal layers. Each layer includes a set of parallel deconfounding blocks from different causal graphs. We combine these blocks via the concept of the proposed sufficient cause, which allows the model to dynamically select the suitable deconfounding methods in each layer. CGR is implemented as the stacked networks, integrating no confounder, back-door adjustment, front-door adjustment, and probability of sufficient cause. We evaluate this framework on two classical tasks of CV and NLP. Experiments show CGR can surpass the current state-of-the-art methods on both Visual Question Answer and Long Document Classification tasks. In particular, CGR has great potential in building the “causal” pre-training large-scale model that effectively generalizes to diverse tasks. It will improve the machines’ comprehension of causal relationships within a broader semantic space.

1 Introduction

Correlation is not Causation.

—Karl Pearson (1857 - 1936)

In the fundamental statistics course, students are taught to remember the famous phrase: “Correlation is not Causation”. A classic example is the correlation between the rooster’s crow and the sunrise. While the two are highly correlated, the rooster’s crow does not cause the sunrise. However, statistics alone do not provide us what causation truly is. Unfortunately, many data scientists have a narrow focus on interpreting data without considering the limitations of their models. They mistakenly believe that all causal questions can be answered solely through data analysis and clever data-mining tricks.

Nowadays, thanks to the development of carefully crafted causal models [22, 23, 3, 33, 32], the deep learning community has paid more and more attention to causation. Mathematically, the causal analysis aims to study the dynamic nature of distributions between variables. In statistics, we study and estimate various distributions and their model parameters from data, while in causal analysis, we study how when a change in the distribution of one variable affects the distribution of other variables. Definitedly, the change in the variable distribution is the do-operation, which is an active intervention mechanism in the data and can well define what is the causal effect between variables.

For example, given the environment 𝒟\mathcal{D}, we take the input XX to predict the output YY, which is denoted as P(Y|do(X),𝒟)P(Y|do(X),\mathcal{D}), not P(Y|X,𝒟)P(Y|X,\mathcal{D}). The former represents the probability of YY after XX is implemented on the pre-decision environment 𝒟\mathcal{D}. The latter is the probability of YY when XX coexists with the post-implementation environment 𝒟\mathcal{D}. This coexistence environment may be different from the environment before the decision. In short, statistics is observing something (i.e., seeing), and estimating what will happen. Causal analysis is an intervention, what is done (i.e., doing), and predicts what will happen [22].

Till now, statistics have developed various successful frameworks, such as Transformer [28], Pre-training large-scale models [8, 27], and so on. However, in the causation community, how to build an integrated causal framework still remains an untouched domain despite its excellent intervention capabilities. In this work, we propose the Causal Graph Routing (CGR), an integrated causal framework relying entirely on the intervention mechanisms to reveal the cause-effect forces hidden in data.

Specifically, the causal intervention aims to mitigate the effectiveness of confounding, which is a causal concept to describe the spurious correlation between input and output variables [22]. Because the noncausal paths are the source of confounding, we use the do-operator to control (or erase) the influence of noncausal paths, i.e., P(Y|do(X))P(Y|do(X)), to deconfound XX and YY. Several classical deconfounding methods are presented in Fig.2: a) No Confounder: The effect of XX on YY via the mediator MM, i.e., XMYX\rightarrow M\rightarrow Y, where no confounder ZZ exists. b) Back-door Adjustment: The observable confounder ZZ influences both XX and YY, creating a spurious correlation XZYX\leftarrow Z\rightarrow Y. The link ZXZ\rightarrow X is defined as the back-door path, which is blocked by controlling for ZZ. c) Front-door Adjustment: The causal effect of XX on YY is confounded by the unobservable confounder ZZ and linked by the mediator MM. Furthermore, MM is observable and shielded from the effects of ZZ. To eliminate the spurious correlation brought by ZZ, the front-door path, i.e., MYM\rightarrow Y, is blocked by controlling for MM.

However, in several Computer Vision (CV) and Natural Language Processing (NLP) tasks, the causal intervention often requires the use of multiple deconfounding methods from different causal graphs. As shown in Fig.1, in the Visual Question Answer (VQA) task[20], to answer the question “what days might I most commonly go to this building?”, the model first detects “building” via the visual context, which could be confounded by training data (dataset bias). To address this, the method like Front-door Adjustment or No Confounder is necessary. Then, the model correlates the object “building” with the fact church,RelatedTo,building\langle church,RelatedTo,building\rangle and church,RelatedTo,sunday\langle church,RelatedTo,sunday\rangle from an external knowledge base [26], which could be confounded by irrelevant knowledge facts (language bias). To mitigate this, Back- or Front-door Adjustments are required for deconfounding. Hence, relying on a single deconfounding method is insufficient to fulfill the requirement of deconfounding from diverse causal graphs. The same principle applies to the Long Document Classification (LDC) task[7].

Motivated by the transformer and its variants, which have stacked multiple parallel self-attention blocks to imitate a wide range of tasks [36, 7, 27], we propose the Causal Graph Routing (CGR) framework, where above-mentioned deconfounding blocks are also stacked effectively. Specifically, our framework is composed of a stack of causal layers. Each layer includes a set of parallel deconfounding blocks from different causal graphs. We propose the concept of sufficient cause, which provides the formal semantic for the probability that causal graph AA was a sufficient cause of another graph BB. It can chain together three candidates of deconfounding methods, i.e., no confounder, back-door adjustment, and front-door adjustment, to get the overall causal effect of XX on YY. We calculate the weight of every deconfounding block to approximate the probability of sufficient cause, which allows the model to dynamically select the suitable deconfounding methods in each layer. This facilitates the formulation of a causal routing path for each example. CGR is implemented as the stacked networks that we assess on two classical tasks in CV and NLP. Experiments show CGR outperforms existing state-of-the-art methods on both VQA and LDC tasks with less computation cost. Notably, CGR exhibits significant potential for building the “causal” pre-training large-scale model, which can effectively generalize to diverse tasks. It will enhance the machines’ understanding of causal relationships within a broader semantic space.

Refer to caption
Figure 1: Two examples from (a) Visual Question Answering (VQA) task and (b) Long Document Classification (LDC) task.

2 Related Work

Causality. Causal inference [22] is an important component of human cognition. Thanks to the development of carefully crafted causal methodologies, causality has been extensively studied and mathematized. For examples, Pearl et al. [22] propose the front- and back-door adjustments, which focus on removing unobservable or observable confounders by blocking noncausal paths. Peng et al. [23] design the causality-driven hierarchical reinforcement learning framework. Cai et al. [3] establish an algorithm to comprehensively characterize causal effects with multiple mediators. Jaber et al. [13] propose a new causal do-calculus for identification of interventional distributions in Partial Ancestral Graphs (PAGs). These methods allow researchers to uncover the potential causal relationships between inputs and outputs to improve deep networks.

Applications on CV and NLP tasks. Cause-effect science is well suited for CV and NLP tasks. For examples, using the front-door adjustment to remove dataset bias for improving attention mechanisms [33], discovering causal visual features on Video-QA task using the back-door adjustment [39], equipping the pre-trained language model with a knowledge-guided intervention for text concept extraction [37], generating counterfactual samples to mitigate language priors [21], and constructing a deconfounded framework for visual grounding [12] and image captioning [32].

3 Methodology

In this section, we discuss how to design the causal graph routing (Section 3.1), how to implement it into the stacked networks (Section 3.2), and how to apply it with two classical CV and NLP tasks (Section 3.3).

3.1 Causal Graph Routing

To address the need of deconfounding from diverse causal graphs, we propose the causal graph routing framework, which can integrate different deconfounding blocks by calculating the probabilities of sufficient cause between causal graphs. In particular, our objective is to dynamically select (routing) the suitable deconfounding methods for the given task. We know three candidates of deconfounding methods which are no confounder, back-door adjustment, and front-door adjustment. For no confounder, we make the input XX to predict the output YY via the mediator M without any confounder ZZ, denoted as P0P(Y|do0(X))P_{0}\sim P(Y|do_{0}(X)). For back-door adjustment, we cut off the link ZXZ\rightarrow X to remove the spurious correlation caused by observable ZZ. It measures the average causal effect of XX on YY, denoted as P1P(Y|do1(X))P_{1}\sim P(Y|do_{1}(X)). For front-door adjustment, we block the path MYM\rightarrow Y by controlling for observable MM, to remove the spurious correlation caused by unobservable ZZ, denoted as P2P(Y|do2(X))P_{2}\sim P(Y|do_{2}(X)).

Intuitively, we can think of “how to select the suitable causal graph” as the game of building blocks. In this game, a modal is required to find the reasonable building method (i.e., causal graph) using the given units XX, YY, ZZ, and MM. If the model finds the graph P1P_{1} is unsuitable, it will switch its building method and consider using either the graph P2P_{2} or P0P_{0} instead. Hence, there exists a hidden relevance among these graphs. To formalize this relevance, we design the concept of sufficient cause among graphs, which provides the formal semantic for the probability that causal graph AA was a sufficient cause of another graph BB. As shown in Fig.2(a), consider the arrow from do1(x)do_{1}(x) to do2(x)do_{2}(x) as an example, where the graph P1P_{1} serves as a sufficient cause for the graph P2P_{2}. We can represent the propositions X=trueX=true and Y=trueY=true as xx and yy, respectively, while their complements are denoted as xx^{\prime} and yy^{\prime}. The probability of sufficient cause from P1P_{1} to P2P_{2} is defined as:

psP1P2=P(ydo2(x)|y,do1(x))\begin{split}ps_{P_{1}\rightarrow P_{2}}=P(y_{do_{2}(x)}|y^{\prime},do_{1}(x^{\prime}))\end{split} (1)

where psps denotes the Probability of Sufficient cause that measures the capacity of do1(x)do_{1}(x) to produce do2(x)do_{2}(x). Given that the term “production” suggests a change from the absence to the presence of do2(x)do_{2}(x) and yy, we calculate the probability P(ydo2(x))P(y_{do_{2}(x)}) by considering situations where neither do1(x)do_{1}(x) nor yy are present. In other words, psps quantifies the effect of P1P_{1} to cause P2P_{2}, which determines the probability of ydo2(x)y_{do_{2}(x)} occurring (the occurrence of do2(x)do_{2}(x) and yy), given that both do1(x)do_{1}(x) and yy did not occur. Considering the sufficient causes from the other two graphs, the total effect (TE) of P2P_{2} for the causal routing can be defined as:

TEP2=P(y|do2(x))(psP1P2+psP3P2)\begin{split}TE_{P_{2}}=P(y|do_{2}(x))*(ps_{P_{1}\rightarrow P_{2}}+ps_{P_{3}\rightarrow P_{2}})\end{split} (2)

We estimate the total effect of XX on YY, i.e., P(Y|do(X))P(Y|do(X)), by dynamically routing all causal graphs:

P(Y|do(X))=TEP1+TEP2+TEP3=P(y|do1(x))(psP2P1+psP3P1)+P(y|do2(x))(psP1P2+psP3P2)+P(y|do3(x))(psP1P3+psP2P3)\begin{split}P(Y|do(X))&=TE_{P_{1}}+TE_{P_{2}}+TE_{P_{3}}=\\ &P(y|do_{1}(x))*(ps_{P_{2}\rightarrow P_{1}}+ps_{P_{3}\rightarrow P_{1}})+\\ &P(y|do_{2}(x))*(ps_{P_{1}\rightarrow P_{2}}+ps_{P_{3}\rightarrow P_{2}})+\\ &P(y|do_{3}(x))*(ps_{P_{1}\rightarrow P_{3}}+ps_{P_{2}\rightarrow P_{3}})\\ \end{split} (3)

where do(X)do(X) refers to the set of intervention operations, including do0(x)do_{0}(x), do1(x)do_{1}(x), and do2(x)do_{2}(x). Till now, we have chained together the three deconfounding methods to get the overall causal effect of XX on YY. Each deconfounding method is equipped with two psps terms, which stand for the probabilities of sufficient causes that another graphs would respond to the current graph.

As shown in Fig.2(b), we define the set of chained deconfounding methods as one casual layer. We employ LL parallel casual layers, where each one produces the output values. These values are then integrated to obtain the final values for the given task.

Refer to caption
Figure 2: (a) Select the suitable causal graph (i.e., intervention operation) via the sufficient causes. (b) Scheme of Causal Graph Routing.

3.2 Causal Stacked Networks

In this section, we illustrate how to implement our CGR in a deep framework. In practice, we adopt the stacked networks to perform the causal routing computation, integrating no confounder, back-door adjustment, front-door adjustment, and probability of sufficient cause.

3.2.1 Block of No Confounder

This block involves two stages: 1) extract the mediator MM from the input XX (XMX\rightarrow M) and 2) predict the outcome YY based on MM (MYM\rightarrow Y). We have:

P(Y|X)=mP(M=m|X)P(Y|M=m)\begin{split}P(Y|X)=\sum_{m}P(M=m|X)P(Y|M=m)\end{split} (4)

Considering that most CV and NLP tasks are formulated as classification problems, we compute P(Y|X)P(Y|X) using a transform function fdo0trans()f_{do_{0}}^{trans}(\cdot) and a multi-layer perceptron layer MLP()\mathrm{MLP}(\cdot). The former aims to extract the mediator MM from the input XX, while the latter outputs classification probabilities through the softmax layer.

𝔼X(M)=fdo0trans(X)P(Y|X)=softmax(MLP(𝔼X(M)))\begin{split}\mathbb{E}_{X}(M)&=f_{do_{0}}^{trans}(X)\\ P(Y|X)&=\mathrm{softmax}(\mathrm{MLP}(\mathbb{E}_{X}(M)))\end{split} (5)

We use the classical attention layer Attention(Q,K,V)=softmax(QKT/d)V\mathrm{Attention}(Q,K,V)\\ =\mathrm{softmax}(QK^{T}/\sqrt{d})V to calculate fdo0trans(X)=Attention(X,X,X)f^{trans}_{do_{0}}(X)=\mathrm{Attention}(X,X,X). The queries QQ, keys KK and values VV come from the input XX. dd is the dimension of queries and keys. As shown in Fig.3(a), in the ll-th layer, we take the result of MLP(𝔼X(M))\mathrm{MLP}(\mathbb{E}_{X}(M)) as the output of the no confounder block Cdo0(l)C^{(l)}_{do_{0}}.

3.2.2 Block of Back-door Adjustment

We assume that an observable confounder ZZ influences the relationship between input XX and output YY. The link ZXZ\rightarrow X is blocked through the back-door adjustment, and then the causal effect of XX on YY is identifiable and given by

P(Y|do1(X))=zZP(Z=z)[P(Y|X,Z=z)]\begin{split}P(Y|do_{1}(X))=\sum_{z\in Z}P(Z=z)[P(Y|X,Z=z)]\end{split} (6)

where XX and ZZ denotes the embedding of inputs and confounders, respectively. To perform the back-door intervention operation, we parameterize P(Y|X,Z)P(Y|X,Z) using a network, which final layer is a softmax function as:

P(Y|X,Z=z)=softmax(fdo1pre(X,Z))\begin{split}P(Y|X,Z=z)=\mathrm{softmax}(f^{pre}_{do_{1}}(X,Z))\end{split} (7)

where fdo1pre()f^{pre}_{do_{1}}(\cdot) is the fully connected layer predictor. However, it requires an extensive amount of XX and ZZ sampled from this network in order to compute P(Y|do1(X))P(Y|do_{1}(X)). We employ the Normalized Weighted Geometric Mean (NWGM) [30] to approximate the expectation of the softmax as the softmax of the expectation:

P(Y|do1(X))=𝔼Z(softmax(fdo1pre(X,Z))softmax(fdo1pre(𝔼X(X),𝔼X(Z)))\begin{split}P(Y|do_{1}(X))&=\mathbb{E}_{Z}(\mathrm{softmax}(f^{pre}_{do_{1}}(X,Z))\\ &\approx\mathrm{softmax}(f^{pre}_{do_{1}}(\mathbb{E}_{X}(X),\mathbb{E}_{X}(Z)))\end{split} (8)

We calculate two query sets from XX and ZZ to estimate the input expectation 𝔼z(X)\mathbb{E}_{z}(X) and the confounder expectation 𝔼z(Z)\mathbb{E}_{z}(Z), respectively, as:

𝔼X(X)=X=xP(X=x|fdo1(X𝔼X(X))emb(X))x𝔼X(Z)=X=xP(Z=z|fdo1(X,Z𝔼X(Z))emb(X))x\begin{split}\mathbb{E}_{X}(X)&=\sum_{X=x}P(X=x|f_{do_{1}(X\rightarrow\mathbb{E}_{X}(X))}^{emb}(X))x\\ \mathbb{E}_{X}(Z)&=\sum_{X=x}P(Z=z|f_{do_{1}(X,Z\rightarrow\mathbb{E}_{X}(Z))}^{emb}(X))x\end{split}

(9)

where fdo1(X𝔼X(X))embf_{do_{1}(X\rightarrow\mathbb{E}_{X}(X))}^{emb} and fdo1(X,Z𝔼X(Z))embf_{do_{1}(X,Z\rightarrow\mathbb{E}_{X}(Z))}^{emb} denote query embedding functions.

As shown in Fig.3(b), we use the classical attention layer to estimate the expectations of both variables: 𝔼X(X)=Attention(X,X,X)\mathbb{E}_{X}(X)=\mathrm{Attention}(X,X,X) and 𝔼X(Z)=Attention(Z,X,X)\mathbb{E}_{X}(Z)=\mathrm{Attention}(Z,X,X). In the ll-th layer, both of them are concatenated and passed through a multi-layer perceptron layer to produce the output of the back-door block Cdo1(l)C^{(l)}_{do_{1}}.

Refer to caption
Figure 3: We adopt the stacked networks to perform the computation of causal graph routing.

3.2.3 Block of Front-door Adjustment

We assume that an unobservable confounder ZZ influences the relationship between input XX and output YY, while an observable mediator MM establishes a connection from XX to YY. We block the link MYM\rightarrow Y through the front-door adjustment, and the causal effect of XX on YY is given by

P(Y|do2(X))=M=mP(M=m|X)xP(X=x)P(Y|X=x,M=m)\begin{split}P(Y|do_{2}(X))&=\\ \sum_{M=m}P(&M=m|X)\sum_{x}P(X=x)P(Y|X=x,M=m)\end{split}

(10)

where XX and MM denotes the embedding of inputs and mediators, respectively. Similar to the back-door adjustment, we parameterize P(Y|X=x,M=m)P(Y|X=x,M=m) using the softmax-aware network and the NWGM approximation. We have:

P(Y|do2(X))softmax(fdo2pre(𝔼X(X),𝔼X(M)))\begin{split}P(Y|do_{2}(X))&\approx softmax(f^{pre}_{do_{2}}(\mathbb{E}_{X}(X),\mathbb{E}_{X}(M)))\end{split}

(11)

where fdo2pre()f^{pre}_{do_{2}}(\cdot) is the fully connected layer. Similarly, we estimate the expectations of variables by two query embedding functions fdo2(X𝔼X(X))embf_{do_{2}(X\rightarrow\mathbb{E}_{X}(X))}^{emb} and fdo2(X𝔼X(M))embf_{do_{2}(X\rightarrow\mathbb{E}_{X}(M))}^{emb}.

𝔼X(X)=X=xP(X=x|fdo2(X𝔼X(X))emb(X))x𝔼X(M)=M=mP(M=m|fdo2(X𝔼X(M))emb(X))m\begin{split}\mathbb{E}_{X}(X)&=\sum_{X=x}P(X=x|f_{do_{2}(X\rightarrow\mathbb{E}_{X}(X))}^{emb}(X))x\\ \mathbb{E}_{X}(M)&=\sum_{M=m}P(M=m|f_{do_{2}(X\rightarrow\mathbb{E}_{X}(M))}^{emb}(X))m\end{split}

(12)

As shown in Fig.3(c), we employ the classical attention layer to estimate the expectations, denoted as 𝔼X(X)=Attention(X,DX,DX)\mathbb{E}_{X}(X)=\mathrm{Attention}(X,D_{X},D_{X}) and 𝔼X(M)=Attention(X,X,X)\mathbb{E}_{X}(M)=\mathrm{Attention}(X,X,X). Different from the above two blocks, we utilize the global dictionary DXD_{X}, which is initialized by conducting K-means clustering on all the sample features of the training dataset, to generate keys and values. In the ll-th layer, the concatenated expectations are fed into a multi-layer perceptron to generate the output of the front-door block Cdo2(l)C^{(l)}_{do_{2}}.

Refer to caption
Figure 4: Routing architecture.

3.2.4 Probability of Sufficient Cause

In our framework, each layer consists of three deconfounding blocks from different causal graphs. We reexamine Eq.3, and find that the process of evaluating the total effect P(Y|do(X))P(Y|do(X)) is essentially to search the optimal deconfounding graph among three causal graphs, while the other two causal graphs are sufficient causal conditions for the optimal solution. For example, during the game of building blocks, if we discover that both graph P2P_{2} and P0P_{0} are ineffective, it naturally leads to use the graph P1P_{1} as the optimal building method. Take TEP1=P1(y|do1(x))(psP2P1+psP3P1)TE_{P_{1}}=P_{1}(y|do_{1}(x))*(ps_{P_{2}\rightarrow P_{1}}+ps_{P_{3}\rightarrow P_{1}}) an example. When both psP2P1ps_{P_{2}\rightarrow P_{1}} and psP3P1ps_{P_{3}\rightarrow P_{1}} have high values, indicating that both P2P_{2} and P3P_{3} are sufficient cause for P1P_{1}, we consider P1P_{1} as the optimal deconfounding method for achieving P(Y|do(X))P(Y|do(X)). Therefore, in this paper, we calculate the weight of the causal graph P1P_{1} to approximate the probability of sufficient cause where P1P_{1} is the optimal solution. Similarly, we approximate the probabilities of sufficient causes for P2P_{2} and P0P_{0}. We have:

C(l)=i=02[fnorm(w(l))]iCdoi(l)\begin{split}C^{(l)}=\sum^{2}_{i=0}[f^{norm}(w^{(l)})]_{i}*C^{(l)}_{do_{i}}\end{split} (13)

where w(l)w^{(l)} is the weight vector of the ll-th layer. Its each element wi(l)(i=0,1,2)w^{(l)}_{i}(i=0,1,2) reflects the probability of the ii-th deconfounding block as the optimal solution of the ll-th layer. fnorm()f^{norm}(\cdot) denotes the normalization function (described in Optimization). []i[\cdot]_{i} denotes the ii-th element in a given vector. Cdoi(l)C^{(l)}_{do_{i}} represents the output of the ii-th deconfounding block in the ll-th layer. C(l)C^{(l)} is the output of the ll-th causal layer. In this work, we employ LL parallel causal layers and combine them as:

C=l=1Lfnorm(w(c))C(l)\begin{split}C=\sum^{L}_{l=1}f^{norm}(w^{(c)})*C^{(l)}\end{split} (14)

where w(c)w^{(c)} is the layer-aware weight vector. CC represents the final output, which is computed as a weighted sum of all causal layers. Both w(l)w^{(l)} and w(c)w^{(c)} are learnable parameters, initialized with equal constants, indicating that the routing weight learning starts without any prior bias towards a specific block or layer.

Stack. In the no confounder block, the output expectation 𝔼X(M)\mathbb{E}_{X}(M) from the previous layer is used as the input XX for the current layer. In the back- and front-door adjustment blocks, the output expectation 𝔼X(X)\mathbb{E}_{X}(X) from the previous layer serves as the input XX for the current layer.

Optimization. To enable the dynamic fusion of causal blocks and causal layers, we design the sharpening softmax function to implement fnorm()f^{norm}(\cdot). Specifically, we equip a temperature coefficient that converges with training for the ordinary softmax function as:

[fnorm(α)]i=exp(log(αi)/τ)jexp(log(αj)/τ)\begin{split}[f^{norm}(\alpha)]_{i}=\frac{exp(log(\alpha_{i})/\tau)}{\sum_{j}exp(log(\alpha_{j})/\tau)}\end{split} (15)

where α\alpha represents the normalized weight vector after softmax; αi\alpha_{i} is the ii-th weight value; τ\tau denotes the temperature coefficient for sharpening the softmax function. At the initial stage of training, the value of τ\tau is set to 1, which results in the sharpening softmax function being the same as the regular softmax function. As the training progresses, τ\tau gradually decreases, and as it converges to 0, the sharpening softmax function starts to resemble the argmax function more closely. By the designed sharpening softmax function, the block- and layer-aware weight vectors can be optimized through back-propagation. This optimization process enhances the performance of these weights, resulting in more noticeable differences after training.

3.3 Application to Our Framework

Visual Question Answering (VQA) aims to predict an answer for the given question and image [1]. In this task, XX represents the input image-question pairs (e.g., an image involving “church” and corresponding question “what days might I most commonly go to this building?”). Y represents the output predicted answers (e.g., “sunday”). MM is the mediator extracted from XX, which refers to question-attended visual regions or attributes (e.g., a visual region involving “church” and an attribute “building”). Additionally, in front-door adjustment, ZZ denotes the unobservable confounder, while in back-door adjustment, ZZ denotes the observable confounder that refers to question-attended external knowledge (e.g., church,RelatedTo,building\langle church,RelatedTo,building\rangle and church,RelatedTo,sunday\langle church,RelatedTo,sunday\rangle). It is because external knowledge comprises both “good” language context and “bad” language bias [21]. We use L=6L=6 parallel casual layers in VQA task.

Long Document Classification (LDC) aims to classify a given long document text [7]. In this task, XX represents the input document collection (e.g., legal-related documentation set). YY represents the output classification results (e.g., “legal” or “politics”). MM refers to segments extracted from document (e.g., a segment “…but in practice it too often becomes tyranny…” indicates the label “politics”). Similarly, ZZ in front-door adjustment denotes the unobservable cunfounder, while ZZ in back-door adjustment denotes the observable confounder that refers to the high-frequency words in each document. We use L=2L=2 parallel casual layers in LDC task.

The confounder extraction process for back-door adjustment is explained in Section 4.2.

4 Experiment

4.1 Datasets and Metrics

VQA2.0 [1] is a widely-used benchmark VQA dataset, which uses images from MS-COCO. It comprises a total of 443,757, 214,254, and 447,793 samples for training, validation, and testing, respectively. Every image is paired with around 3 questions, and each question has 10 reference answers. We consider both the soft VQA accuracy [1] for each question type and the overall performance as the evaluation metrics.

ECtHR [5] is a popular dataset for the long document classification task. It comprises European Court of Human Rights cases, with annotations provided for paragraph-level rationales. The dataset consists of 11,000 ECtHR cases, where each case is associated with one or more provisions of the convention allegedly violated. The ECtHR dataset is divided into 8,866, 973, and 986 samples for training, validation, and testing, respectively. It is used to evaluate the performance of our framework on a multi-label classification task. Evaluation metrics include micro/macro average F1 scores and accuracy on the test set.

20 NewsGroups [29] is a popular dataset for the long document classification task. It consists of approximately 20,000 newsgroup documents that are evenly distributed across 20 different news topics. The dataset includes 10,314, 1,000, and 1,000 samples for training, validation, and testing, respectively. It is used to evaluate the performance of our framework on a multi-class classification task. We report the performance with the accuracy as the evaluation metric.

4.2 Implementation

Image and Text Processing. For the VQA task, we employ the image encoder of BLIP-2 [14] to extract grid features. We preprocess the question text and the obtained knowledge text in the back-door adjustment to lower case, tokenize the sentences and remove special symbols. We truncate the maximum length of each sentence to 14 words, and utilize the text encoder of CLIP ViT-L/14 [24] to extract word features, followed by a single-layer LSTM encoder with a hidden dimension of 512.

For the LDC task, we define the maximum sequence length as 4,096 tokens. We split a long document into overlapping segments of 256 tokens. These segments have a 1/4 overlap between them. To extract text features, we utilize the pre-trained RoBERTa [18] as the text encoder.

Confounder Extraction. For the VQA task, we use the question-attended external knowledge as the observable confounder in the back-door adjustment. In this paper, we retrieve external knowledge from the ConceptNet knowledge base [26], which represents common sense using subject,relation,object\langle subject,relation,object\rangle triplets, such as church,RelatedTo,building\langle church,RelatedTo,building\rangle. Besides, ConceptNet provides a statistical weight for each triplet, ensuring reliable retrieval of information. Specifically, we first extract 3 types of query words: (1) object labels of images obtained by GLIP [16]; (2) OCR text of images by EasyOCR toolkit111https://github.com/JaidedAI/EasyOCR; (3) n-gram question entity phrases. All of words are filtered using a tool of part-of-speech restriction [17], and the filtered words are combined to form the query set for searching common sense in ConceptNet. We use the pre-trained MPNet [25] to encode the returned common sense triplets and the given question. Then, we calculate the cosine similarity between the encoded triplets and questions. Further, the cosine similarity is multiplied by the given statistical weight to obtain the final score of a triplet. We select the top-20 pieces of triplets as the observable confounder ZZ for each image-question pair.

For the LDC task, we use the high-frequency words in each document as the observable confounder. Specifically, we employ the TF-IDF method to select the top-M words in each long document (MM is set as 64 for ECtHR, and 128 for 20 NewsGroups). TF-IDF calculates the importance of a word within a document by considering its frequency in all documents.

Training Strategy. For the VQA task, we use the Adam optimizer to compute the gradient with an initial learning rate of 1×1041\times 10^{-4}, which decays at epoch 10, 12 with the decay rate of 0.5. We adopt a warm-up strategy for the initial 3 epochs, and the full model is trained for 13 epochs totally. The batch size is set to 64. For the LDC task, we use the AdamW optimizer with an initial learning rate of 2×1052\times 10^{-5}. We employ a linear decay strategy with a 10% warm-up of the total number of steps to adjust the learning rate. We need about 16 epochs for the model to converge. The batch size on each GPU is set to 2. Our method is implemented on PyTorch with two 3090Ti GPUs.

Table 1: Comparison with the transformer-based models on VQA2.0.
Method Test-dev Test-std
overall Yes/No Num Others overall
Transformer[28] 69.53 86.25 50.70 59.90 69.82
DFAF[10] 70.22 86.09 53.32 60.49 70.34
ReGAT[15] 70.27 86.08 54.42 60.33 70.58
MCAN[36] 70.63 86.82 53.26 60.72 70.90
TRRNet[31] 70.80 - - - 71.20
Transformer+CATT[33] 70.95 87.40 53.45 61.3 71.27
AGAN[41] 71.16 86.87 54.29 61.56 71.50
MMNAS[35] 71.24 87.27 55.68 61.05 71.56
TRARS[42] 72.00 87.43 54.69 62.72 -
TRARS(16*16)[42] 72.62 88.11 55.33 63.31 72.93
CGR 75.46 90.24 57.16 67.01 75.47
Table 2: Comparison with the pre-trained large-scale models on VQA2.0.
Method Test-dev Test-std
LXMERT[27] 72.42 72.54
ERNIE-VIL[34] 72.62 72.85
UNITER[6] 72.70 72.91
12IN1[19] - 72.92
LXMERT+CATT[33] 72.81 73.04
LXMERT+CATT(large)[33] 73.54 73.63
VILLA[9] 73.59 73.67
UNITER(large)[6] 73.82 74.02
VILLA(large)[9] 74.69 74.87
ERNIE-VIL(large)[34] 74.95 75.10
CGR 75.46 75.47

4.3 Results

Visual Question Answering. We report the performance of our framework in VQA task against the transformer-based models (Tab.1) and the pre-trained large-scale models (Tab.2), in which test-dev and test-std are the online development-test and standard-test splits, respectively. As shown in Tab.1, our CGR can significantly outperform all transformer-based models across all metrics. Specially, our method outperforms the best competitor TRARS(16*16) by 3.91% and 3.48% under test-dev and test-std metrics, respectively, which validates the effectiveness of the proposed stacked deconfounding method. Meanwhile, the pre-training large-scale models also use the “stacked” mechanism, which stack multiple self-attention layers to imitate VQA task. However, they ignore the negative impact of confounding that describes the spurious correlation between input and output variables. Our method builds the multiple deconfounding layers to eliminate the spurious correlation. The better result in Tab.2 showcase our advantage.

With less computation cost, our CGR has over 0.4% improvement against the best competitor, ERNIE-VIL(large), in the pre-training large-scale models. Remarkably, CGR equips with just 3 deconfounding methods with 6 causal layers, while ERNIE-VIL(large) relies on a much larger number of training attention layers (24 textual layers + 6 visual layers with 16 heads in each layer). Besides, the causation community can offer numerous powerful deconfounding methods to further enhance our framework. Hence, CGR has great potential for building the “causal” pre-training large-scale model to imitate a wide range of tasks. This will greatly enhance machines’ comprehension of causal relationships within a broader semantic space.

Table 3: Comparison with the state-of-the-arts on ECtHR and 20 NewsGroups.
Method 𝑭𝟏\bm{F_{1}}
Macro Micro
ECtHR RoBERTa[18] 68.9 77.3
CaseLaw-BERT[40] 70.3 78.8
BigBird[38] 70.9 78.8
DeBERTa[11] 71.0 78.8
Longformer[2] 71.7 79.4
BERT[8] 73.4 79.7
Legal-BERT[4] 74.7 80.4
Hi-Transformer(RoBERTa)[7] 76.5 81.1
CGR 76.6 81.3
Method Accuracy
20 NewsGroups RoBERTa[18] 83.8
BERT[8] 85.3
Hi-Transformer(RoBERTa)[7] 85.6
CGR 86.5

Long Document Classification. We compare the proposed framework with the state-of-the-art methods for LDC task on ECtHR and 20 News datasets (Tab.3). Our method can consistently achieve better performance across all metrics, which indicates that our deconfounding strategy still works effectively on the challenging multi-class and multi-label NLP task. Faced with the complex long texts, our CGR helps uncover potential cause effect and improve the model performance through multiple intervention routing. Moreover, CGR can outperform Legal-BERT by 2.54% under the Macro score. It suggests that our method has advantages in deconfouding the domain-specific knowledge.

4.4 Ablation Studies

We further validate the efficacy of the proposed framework by assessing several variants: 1) One deconfounding block reserved per causal layer: In our method, each causal layer consists of three deconfounding blocks. To verify their effectiveness, we retain only one block in each layer, i.e., no confounder, back-door adjustment, or front-door adjustment, respectively, and then calculate their average performance for comparison. 2) Two deconfounding blocks reserved per causal layer: Similarly, we retain two blocks in each layer, and calculate their average performance for comparison. 3) Another strategy to calculate sufficient cause: We design the sharpening softmax function to calculate the weight of deconfounding block for the sufficient cause approximation. In this section, we remove the sharpening mechanism and adopt the ordinary softmax for Eq.15, to obtain the sufficient cause. Tab.4 reports the performance of ablation studies on VQA and LDC tasks. Our framework outperforms all variants, which show the advantages of deconfounding from diverse causal graphs and our sufficient cause approximation method.

Table 4: Ablation studies on VQA2.0 and 20 NewsGroups.
Method VQA2.0 20 NewsGroups
One deconfounding block reserved 69.35 84.40
Two deconfounding blocks reserved 70.03 85.00
CGR w/o sharpen softmax 70.91 86.30
CGR 71.20 86.50

4.5 Qualitative Analysis

Fig.5 shows two qualitative examples from our method on VQA and LDC tasks. To provide insight in which causal graph is dominant, we present the probabilities of sufficient cause in all layers, which reveal the explicit causal routing path within the framework. See the first example, we observe that the front-door adjustment in the first layer dominates the answer inference, which helps the model avoid some unseen confounding effects, such as dataset bias. As the routing progresses, the back-door adjustment has significantly enhanced, suggesting that the model start the focus on how to use external knowledge without confounding for the answer inference.

Refer to caption
Figure 5: Qualitative examples from our method on VQA and LDC tasks. We present the probabilities of sufficient cause for all blocks in each layer and the corresponding confounder Z. The maximum value of each layer is highlighted with a red box.

4.6 Conclusion

In this paper, we propose the novel Causal Graph Routing (CGR) framework, which is the first integrated causal scheme relying entirely on the intervention mechanisms to address the need of deconfounding from diverse causal graphs. Specifically, CGR is composed of a stack of causal layers. Each layer includes a set of parallel deconfounding blocks from different causal graphs. We propose the concept of sufficient cause, which chains together multiple deconfounding methods and allow the model to dynamically select the suitable deconfounding methods in each layer. CGR is implemented as the stacked networks. Experiments show our method can surpass the current state-of-the-art methods on both VQA and LDC tasks. CGR has great potential for building the “causal” pre-training large-scale model. We plan to extend CGR with more powerful deconfounding methods and apply it into other tasks for revealing the cause-effect forces hidden in data.

References

  • [1] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. VQA: visual question answering. In ICCV, pages 2425–2433, 2015.
  • [2] I. Beltagy, M. E. Peters, and A. Cohan. Longformer: The long-document transformer. 2020.
  • [3] H. Cai, R. Song, and W. Lu. ANOCE: analysis of causal effects with multiple mediators via constrained structural learning. In ICLR, 2021.
  • [4] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras, and I. Androutsopoulos. LEGAL-BERT: the muppets straight out of law school. 2020.
  • [5] I. Chalkidis, M. Fergadiotis, D. Tsarapatsanis, N. Aletras, I. Androutsopoulos, and P. Malakasiotis. Paragraph-level rationale extraction through regularization: A case study on european court of human rights cases. In NAACL-HLT, pages 226–241, 2021.
  • [6] Y. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu. UNITER: universal image-text representation learning. In ECCV, volume 12375, pages 104–120, 2020.
  • [7] X. Dai, I. Chalkidis, S. Darkner, and D. Elliott. Revisiting transformer-based models for long document classification. In EMNLP, pages 7212–7230, 2022.
  • [8] J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL, pages 4171–4186, 2019.
  • [9] Z. Gan, Y. Chen, L. Li, C. Zhu, Y. Cheng, and J. Liu. Large-scale adversarial training for vision-and-language representation learning. In NeurlPS, 2020.
  • [10] P. Gao, Z. Jiang, H. You, P. Lu, S. C. H. Hoi, X. Wang, and H. Li. Dynamic fusion with intra- and inter-modality attention flow for visual question answering. In CVPR, pages 6639–6648, 2019.
  • [11] P. He, X. Liu, J. Gao, and W. Chen. Deberta: decoding-enhanced bert with disentangled attention. In ICLR, 2021.
  • [12] J. Huang, Y. Qin, J. Qi, Q. Sun, and H. Zhang. Deconfounded visual grounding. In AAAI, pages 998–1006, 2022.
  • [13] A. Jaber, A. H. Ribeiro, J. Zhang, and E. Bareinboim. Causal identification under markov equivalence: Calculus, algorithm, and completeness. In NeurIPS, 2022.
  • [14] J. Li, D. Li, S. Savarese, and S. C. H. Hoi. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, volume 202, pages 19730–19742, 2023.
  • [15] L. Li, Z. Gan, Y. Cheng, and J. Liu. Relation-aware graph attention network for visual question answering. In ICCV, pages 10312–10321, 2019.
  • [16] L. Harold Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y. Zhong, L. Wang, L. Yuan, L. Zhang, J. Hwang, K. Chang, and J. Gao. Grounded language-image pre-training. In CVPR, pages 10955–10965, 2022.
  • [17] B. Yuchen Lin, X. Chen, J. Chen, and X. Ren. Kagnet: Knowledge-aware graph networks for commonsense reasoning. In EMNLP, pages 2829–2839, 2019.
  • [18] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. Roberta: A robustly optimized BERT pretraining approach. 2019.
  • [19] J. Lu, V. Goswami, M. Rohrbach, D. Parikh, and S. Lee. 12-in-1: Multi-task vision and language representation learning. In CVPR, pages 10437–10446, 2020.
  • [20] K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi. OK-VQA: A visual question answering benchmark requiring external knowledge. In CVPR, pages 3195–3204, 2019.
  • [21] Y. Niu, K. Tang, H. Zhang, Z. Lu, X. Hua, and J. Wen. Counterfactual VQA: A cause-effect look at language bias. In CVPR, pages 12700–12710, 2021.
  • [22] J. Pearl and D. Mackenzie. The book of why: the new science of cause and effect. Basic books, 2018.
  • [23] S. Peng, X. Hu, R. Zhang, K. Tang, J. Guo, Q. Yi, R. Chen, X. Zhang, Z. Du, L. Li, Q. Guo, and Y. Chen. Causality-driven hierarchical structure discovery for reinforcement learning. In NeurIPS, 2022.
  • [24] A. Radford, J. Wook Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. In ICML, volume 139, pages 8748–8763, 2021.
  • [25] K. Song, X. Tan, T. Qin, J. Lu, and T. Liu. Mpnet: Masked and permuted pre-training for language understanding. In NeurIPS, 2020.
  • [26] R. Speer, J. Chin, and C. Havasi. Conceptnet 5.5: An open multilingual graph of general knowledge. In AAAI, pages 4444–4451, 2017.
  • [27] H. Tan and M. Bansal. LXMERT: learning cross-modality encoder representations from transformers. In EMNLP, pages 5099–5110, 2019.
  • [28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In NeurIPS, pages 5998–6008, 2017.
  • [29] Y. Wahba, N. H. Madhavji, and J. Steinbacher. A comparison of SVM against pre-trained language models (plms) for text classification tasks. In LOD, pages 304–313, 2022.
  • [30] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, pages 2048–2057, 2015.
  • [31] X. Yang, G. Lin, F. Lv, and F. Liu. Trrnet: Tiered relation reasoning for compositional visual question answering. In ECCV, volume 12366, pages 414–430, 2020.
  • [32] X. Yang, H. Zhang, and J. Cai. Deconfounded image captioning: A causal retrospect. IEEE Trans. Pattern Anal. Mach. Intell., 45(11):12996–13010, 2023.
  • [33] X. Yang, H. Zhang, G. Qi, and J. Cai. Causal attention for vision-language tasks. In CVPR, pages 9847–9857, 2021.
  • [34] F. Yu, J. Tang, W. Yin, Y. Sun, H. Tian, H. Wu, and H. Wang. Ernie-vil: Knowledge enhanced vision-language representations through scene graphs. In AAAI, pages 3208–3216, 2021.
  • [35] Z. Yu, Y. Cui, J. Yu, M. Wang, D. Tao, and Q. Tian. Deep multimodal neural architecture search. In ACM MM, pages 3743–3752, 2020.
  • [36] Z. Yu, J. Yu, Y. Cui, D. Tao, and Q. Tian. Deep modular co-attention networks for visual question answering. In CVPR, pages 6281–6290, 2019.
  • [37] S. Yuan, D. Yang, J. Liu, S. Tian, J. Liang, Y. Xiao, and R. Xie. Causality-aware concept extraction based on knowledge-guided prompting. In ACL, pages 9255–9272, 2023.
  • [38] M. Zaheer, G. Guruganesh, K. Avinava Dubey, J. Ainslie, C. Alberti, S. Ontañón, P. Pham, A. Ravula, Q. Wang, L. Yang, and A. Ahmed. Big bird: Transformers for longer sequences. In NeurlPS, 2020.
  • [39] C. Zang, H. Wang, M. Pei, and W. Liang. Discovering the real association: Multimodal causal reasoning in video question answering. In CVPR, pages 19027–19036, 2023.
  • [40] L. Zheng, N. Guha, B. R Anderson, P. Henderson, and Daniel E H. When does pretraining help? assessing self-supervised learning for law and the casehold dataset of 53,000+ legal holdings. In ICAIL, pages 159–168, 2021.
  • [41] Yi. Zhou, R. Ji, X. Sun, G. Luo, X. Hong, J. Su, X. Ding, and L. Shao. K-armed bandit based multi-modal network architecture search for visual question answering. In ACM MM, pages 1245–1254, 2020.
  • [42] Y. Zhou, T. Ren, C. Zhu, X. Sun, J. Liu, X. Ding, M. Xu, and R. Ji. TRAR: routing the attention spans in transformer for visual question answering. In ICCV, pages 2054–2064, 2021.