This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Fluctuation-response theorem for Kullback-Leibler divergences to quantify causation

Andrea Auconi1    Benjamin M. Friedrich1    Andrea Giansanti2 11 cfaed, Technische Universität Dresden, 01069 Dresden, Germany
22 Dipartimento di Fisica, Sapienza Università di Roma, 00185 Rome, Italy
Abstract

We define a new measure of causation from a fluctuation-response theorem for Kullback-Leibler divergences, based on the information-theoretic cost of perturbations. This information response has both the invariance properties required for an information-theoretic measure and the physical interpretation of a propagation of perturbations. In linear systems, the information response reduces to the transfer entropy, providing a connection between Fisher and mutual information.

In the general framework of stochastic dynamical systems, the term causation refers to the influence that a variable xx exerts over the dynamics of another variable yy. Measures of causation find application in neuroscience [1], climate studies [2], cancer research [3], and finance [4]. However, a widely accepted quantitative definition of causation is still missing.

Causation manifests itself in two inseparable forms: information flow [5, 6, 7, 8], and propagation of perturbations [9, 10, 11, 12]. Ideally, a quantitative measure of causation should connect both perspectives.

Information flow is commonly quantified by the transfer entropy [13, 14, 15, 16, 17], that is the average conditional mutual information corresponding to the uncertainty reduction in forecasting the time evolution of yy that is achieved upon knowledge of xx. The mutual information is a special case of Kullback-Leibler (KL) divergence, a dimensionless measure of distinguishability between probability distributions [18]. As such, the transfer entropy abstracts from the underlying physics to give an invariant description in terms of the strength of probabilistic dependencies.

From the interventional point of view [9, 10, 11, 12], causation is identified with how a perturbation applied to xx propagates in the system to effect yy. Although a direct perturbation of observables is unfeasible in most real-world situations, the fluctuation-response theorem establishes a connection between the response to a small perturbation and the correlation of fluctuations in the natural (unperturbed) dynamics [19, 20, 21, 22].

The fluctuation-response theorem considers the first-order expansion of the response with respect to the perturbation. The corresponding linear response coefficient has been suggested as a measure of causation [12, 11]. However, it has the same physical units as y/xy/x, and it can assume negative values; thus, is not directly related to any information-theoretic measure.

In stochastic dynamical systems with nonlinear interactions, perturbing xx may not only affect the evolution of the expectation value of yy, but it may also affect the evolution of the variance of yy, and in fact its entire probability distribution. The KL divergence from the natural to the perturbed probability densities has recently been identified as the universal upper bound to the physical response of any observable relative to its natural fluctuations [23].

In this Letter, we define a new measure of causation in the form of a linear response coefficient between KL divergences, which we would like to call information response. In particular, we consider the ratio of two KL divergences, one for the response and one for the perturbation, where the latter represents an information-theoretic cost of the perturbation. For small perturbations, we formulate a fluctuation-response theorem that expresses this ratio as a ratio of Fisher information.

In linear systems, this new information response reduces to the transfer entropy, which provides a connection between Fisher and mutual information, and thus a connection between fluctuation-response theory and information flows.

Kullback-Leibler (KL) divergence.

Consider two probability distributions p(w)p(w) and q(w)q(w) of a random variable ww. The KL divergence from q(w)q(w) to p(w)p(w) is defined as

D[p(w)||q(w)]dwp(w)ln(p(w)q(w));\displaystyle D\left[p(w)\big{|}\big{|}q(w)\right]\equiv\int dw~{}p(w)\ln\left(\frac{p(w)}{q(w)}\right); (1)

it is not symmetric in its arguments, and non-negative. Importantly, it is invariant under invertible transformations www\rightarrow w^{\prime} [18], namely D[p(w)||q(w)]=D[p(w)||q(w)]D\left[p(w)\big{|}\big{|}q(w)\right]=D\left[p(w^{\prime})\big{|}\big{|}q(w^{\prime})\right].

The problem of causation.

Consider a stochastic system of nn variables evolving with ergodic Markovian dynamics. Our goal is to define a quantitative measure of causation, i.e., the influence that a variable xx exerts over the dynamics of another variable yy. We want this definition to have both the invariance property of KL divergences, and the physical interpretation of a propagation of perturbations.

Since the dynamics is ergodic, and therefore stationary, it suffices to consider the stochastic variables x0x(t=0)x_{0}\equiv x(t=0), y0y(t=0)y_{0}\equiv y(t=0) at t=0t=0, and a time interval τ\tau later yτy(t=τ)y_{\tau}\equiv y(t=\tau). To avoid cluttered notation, we will implicitly assume that the current values of the remaining n2n-2 variables are absorbed into y0y_{0}, e.g., p(yτ|y0)p(yτ|y0,z0)p(y_{\tau}\big{|}y_{0})\equiv p(y_{\tau}\big{|}y_{0},z_{0}). Conditioning on z0z_{0} avoids confounding variables in zz to introduce spurious causal links between xx and yy [24].

Local response divergence.

Let us consider the system at t=0t=0 with steady-state distribution p(x0,y0)p(x_{0},y_{0}). We make an ideal measurement of its actual state (x0,y0)(x_{0},y_{0}). Immediately after the measurement, we perturb the state by introducing a small displacement ϵ>0\epsilon>0 of the variable xx, namely x0x0+ϵx_{0}\Rightarrow x_{0}+\epsilon. If the effect of this perturbation propagates to yy, then it is reflected in the KL divergence from the natural to the perturbed prediction

dτxy(x0,y0,ϵ)D[p(yτ|x0,y0;x0x0+ϵ)||p(yτ|x0,y0)],d^{x\rightarrow y}_{\tau}\left(x_{0},y_{0},\epsilon\right)\equiv\\ D\left[p\left(y_{\tau}\big{|}x_{0},y_{0};x_{0}\Rightarrow x_{0}+\epsilon\right)\big{|}\big{|}p\left(y_{\tau}\big{|}x_{0},y_{0}\right)\right], (2)

which is a function of the local condition (x0,y0)(x_{0},y_{0}) and the perturbation strength ϵ\epsilon. We name it local response divergence, and denote its ensemble average by dτxy(x0,y0,ϵ)\left\langle d^{x\rightarrow y}_{\tau}(x_{0},y_{0},\epsilon)\right\rangle.

The concept of causation, interpreted in the framework of fluctuation-response theory, is only meaningful with respect to an arrow of time. That means to postulate that the perturbation cannot have effects at past times

p(yτ|x0,y0;x0x0+ϵ){p(yτ|x0+ϵ,y0) for τ0,p(yτ|x0,y0) for τ<0.p\left(y_{\tau}\big{|}x_{0},y_{0};x_{0}\Rightarrow x_{0}+\epsilon\right)\equiv\\ \begin{cases}p\left(y_{\tau}\big{|}x_{0}+\epsilon,y_{0}\right)$ for $\tau\geq 0,\\ p\left(y_{\tau}\big{|}x_{0},y_{0}\right)$ for $\tau<0.\end{cases} (3)

In writing the conditional probability p(yτ|x0+ϵ,y0)p\left(y_{\tau}\big{|}x_{0}+\epsilon,y_{0}\right), we implicitly assumed p(x0+ϵ,y0)>0p\left(x_{0}+\epsilon,y_{0}\right)>0, meaning that the condition provoked by the perturbation is possible under the natural statistics. This implies that the response statistics can be predicted without actually perturbing the system, which is the main idea of fluctuation-response theory [19, 20, 21, 22].

Information-theoretic cost.

The mean local response divergence dτxy(x0,y0,ϵ)\left\langle d^{x\rightarrow y}_{\tau}(x_{0},y_{0},\epsilon)\right\rangle, like any response function in fluctuation-response theory, is defined in relation to a perturbation, irrespective of how difficult it may be to perform this perturbation. Intuitively, we expect that it takes more effort to perturb those variables that fluctuate less. Therefore, we consider the KL divergence from the natural to the perturbed ensemble of conditions

cx(ϵ)D[p(x0ϵ,y0)||p(x0,y0)],\displaystyle c_{x}(\epsilon)\equiv D\left[p(x_{0}-\epsilon,y_{0})\big{|}\big{|}p(x_{0},y_{0})\right], (4)

to quantify the information-theoretic cost of perturbations, and call it perturbation divergence.

For example, for an underdamped Brownian particle, the perturbation divergence is equivalent to the average thermodynamic work required to perform an ϵ\epsilon perturbation of its velocity, up to a factor being the temperature, see Supplementary Information (SI). For an equilibrium ensemble in a potential U(x)U(x), with Boltzmann distribution p(x)exp(βU(x))p(x)\sim\exp(-\beta U(x)), the perturbation divergence is the average reversible work cx(ϵ)=βU(x+ϵ)U(x)c_{x}(\epsilon)=\beta\left\langle U(x+\epsilon)-U(x)\right\rangle. Note that the definition of Eq. (4) is general, and can be applied to more abstract models where thermodynamic quantities are not clearly identified.

Information response.

We introduce the information response as the ratio between mean local response divergence and perturbation divergence, in the limit of a small perturbation

Γτxylimϵ0dτxy(x0,y0,ϵ)cx(ϵ).\displaystyle\Gamma^{x\rightarrow y}_{\tau}\equiv\lim\limits_{\epsilon\rightarrow 0}\frac{\left\langle d^{x\rightarrow y}_{\tau}(x_{0},y_{0},\epsilon)\right\rangle}{c_{x}(\epsilon)}. (5)

We can interpret Γτxy\Gamma^{x\rightarrow y}_{\tau} as an information-theoretic linear response coefficient. This information response is our measure of xyx\rightarrow y causation with respect to the timescale τ\tau, see Fig. 1. The time arrow requirement (Eq. (3)) implies Γτxy=0\Gamma^{x\rightarrow y}_{\tau}=0 for τ<0\tau<0.

Introducing the local information response γτxy(x0,y0)limϵ0dτxy(x0,y0,ϵ)/cx(ϵ)\gamma^{x\rightarrow y}_{\tau}(x_{0},y_{0})\equiv\lim\limits_{\epsilon\rightarrow 0}d^{x\rightarrow y}_{\tau}(x_{0},y_{0},\epsilon)/c_{x}(\epsilon), we can equivalently write Γτxy=γτxy(x0,y0)\Gamma^{x\rightarrow y}_{\tau}=\left\langle\gamma^{x\rightarrow y}_{\tau}(x_{0},y_{0})\right\rangle.

Refer to caption
Figure 1: Here we show, on a concrete example, the origin of the two KL divergences entering the information response of Eq. (5). (Upper) Response to the perturbation x0x0+ϵx_{0}\Rightarrow x_{0}+\epsilon at the trajectory level. xtx^{*}_{t} (yty^{*}_{t}) is the perturbed trajectory of xtx_{t} (yty_{t}), for the same noise realization. (Lower Left) Local response divergence dτxy(x0,y0,ϵ)d^{x\rightarrow y}_{\tau}(x_{0},y_{0},\epsilon): change of predicted distribution of yτy_{\tau} for the condition (x0,y0)(x_{0},y_{0}) for a timescale τ=3\tau=3. (Lower Right) Perturbation divergence cx(ϵ)c_{x}(\epsilon): instantaneous displacement of the steady-state ensemble conditional to a particular y0y_{0}. The dynamics follows the nonlinear stochastic model of Eq. (17) with parameters tR=10t_{R}=10, q=0.1q=0.1, α=0.5\alpha=0.5, β=0.2\beta=0.2, for a perturbation ϵ=0.25\epsilon=0.25.

The information response in the form of Eq. (5) inherently relies on the concept of controlled perturbations. We can reformulate it in purely observational form, in the spirit of the fluctuation-response theorem [19, 20, 21, 22], provided p(x0,y0,yτ)p(x_{0},y_{0},y_{\tau}) is sufficiently smooth.

Fisher information.

The one-parameter family {p(yτ|x0,y0)}x0\{p(y_{\tau}\big{|}x_{0},y_{0})\}_{x_{0}} of probability densities parametrized by x0x_{0} (for fixed y0y_{0}) can be equipped with a Riemannian metric having dτxy(x0,y0,ϵ)d^{x\rightarrow y}_{\tau}(x_{0},y_{0},\epsilon) as squared line element. In fact, the leading order term in the Taylor expansion of a KL divergence between probabilities that differ only by a small perturbation of a parameter is of second order, with coefficients known as Fisher information [18, 25]. Explicitly, expanding the mean response divergence for τ>0\tau>0, we obtain

dτxy(x0,y0,ϵ)=12ϵ2x02lnp(yτ|x0,y0)+𝒪(ϵ3),\left\langle d^{x\rightarrow y}_{\tau}(x_{0},y_{0},\epsilon)\right\rangle=\\ -\frac{1}{2}\epsilon^{2}\left\langle\partial^{2}_{x_{0}}\ln p(y_{\tau}\big{|}x_{0},y_{0})\right\rangle+\mathcal{O}(\epsilon^{3}), (6)

where we used the interventional causality requirement (Eq. (3)), and probability normalization. Similarly, for the perturbation divergence we have

cx(ϵ)=12ϵ2x02lnp(x0|y0)+𝒪(ϵ3).\displaystyle c_{x}(\epsilon)=-\frac{1}{2}\epsilon^{2}\left\langle\partial^{2}_{x_{0}}\ln p(x_{0}\big{|}y_{0})\right\rangle+\mathcal{O}(\epsilon^{3}). (7)

Applying the Fisher information representation to the information response, we get for τ>0\tau>0

Γτxy=x02lnp(yτ|x0,y0)x02lnp(x0|y0),\displaystyle\Gamma^{x\rightarrow y}_{\tau}=\frac{\left\langle\partial^{2}_{x_{0}}\ln p\left(y_{\tau}\big{|}x_{0},y_{0}\right)\right\rangle}{\left\langle\partial^{2}_{x_{0}}\ln p\left(x_{0}\big{|}y_{0}\right)\right\rangle}, (8)

that is the fluctuation-response theorem for KL divergences. For generalizations and a discussion of the connection with the classical fluctuation-response theorem see [26] and SI text. Eq. (8) is the ratio of two second derivatives over the same physical variable x0x_{0}, and it can be regarded as an application of L’Hôpital’s rule to Eq. (5).

In general, Fisher information is not easily connected to Shannon entropy and mutual information [27]. Below, we show that for linear stochastic systems, the information response, which is a ratio of Fisher information (Eq. (8)), is equivalent to the transfer entropy, a conditional form of mutual information.

Transfer entropy.

The most widely used measure of information flow is the conditional mutual information

TτxyD[p(x0,yτ|y0)||p(x0|y0)p(yτ|y0)],T^{x\rightarrow y}_{\tau}\equiv\left\langle D\left[p\left(x_{0},y_{\tau}\big{|}y_{0}\right)\big{|}\big{|}p\left(x_{0}\big{|}y_{0}\right)p\left(y_{\tau}\big{|}y_{0}\right)\right]\right\rangle, (9)

which is generally called transfer entropy [13, 14, 15, 16, 17]. It is the average KL divergence from conditional independence of x0x_{0} and yτy_{\tau} given y0y_{0}.

The transfer entropy is used in nonequilibrium thermodynamics of measurement-feedback systems, where it is related to work extraction and dissipation through fluctuation theorems [28, 16, 29]; in data science, causal network reconstruction from time series is based on statistical significance tests for the presence of transfer entropy [24].

If uncertainty is measured by the Shannon entropy S[p(x)]=𝑑xp(x)lnp(x)S[p(x)]=-\int dx~{}p(x)\ln p(x), then the transfer entropy quantifies how much, on average, the uncertainty in predicting yτy_{\tau} from y0y_{0} decreases if we additionally get to know x0x_{0}, Tτxy=S[p(yτ|y0)]S[p(yτ|x0,y0)]T^{x\rightarrow y}_{\tau}=\left\langle S\left[p\left(y_{\tau}\big{|}y_{0}\right)\right]-S\left[p\left(y_{\tau}\big{|}x_{0},y_{0}\right)\right]\right\rangle.

While the joint probability p(x0,y0,yτ)p\left(x_{0},y_{0},y_{\tau}\right) contains all the physics of the interacting dynamics of xx and yy, the description in terms of the scalar transfer entropy TτxyT^{x\rightarrow y}_{\tau} represents a form of coarse-graining.

We introduce the local transfer entropy tτxy(x0,y0)=D[p(yτ|x0,y0)||p(yτ|y0)]t^{x\rightarrow y}_{\tau}(x_{0},y_{0})=D\left[p(y_{\tau}\big{|}x_{0},y_{0})\big{|}\big{|}p(y_{\tau}\big{|}y_{0})\right]; thus for the (macroscopic) transfer entropy Tτxy=tτxy(x0,y0)T^{x\rightarrow y}_{\tau}=\left\langle t^{x\rightarrow y}_{\tau}(x_{0},y_{0})\right\rangle.

We next show that TτxyT^{x\rightarrow y}_{\tau} and Γτxy\Gamma^{x\rightarrow y}_{\tau} are intimately related for linear systems.

Linear stochastic dynamics.

As example of application, we study the information response in Ornstein-Uhlenbeck (OU) processes [30], i.e., linear stochastic systems of the type

dξt(i)dt+j=1nAijξt(j)=ηt(i),\displaystyle\frac{d\xi^{(i)}_{t}}{dt}+\sum_{j=1}^{n}A_{ij}\xi^{(j)}_{t}=\eta^{(i)}_{t}, (10)

where ηt(i)ηt(j)=qijδ(tt)\left\langle\eta^{(i)}_{t}\eta^{(j)}_{t^{\prime}}\right\rangle=q_{ij}\delta(t-t^{\prime}) is Gaussian white noise with symmetric and constant covariance matrix. For the system to be stationary, we require the eigenvalues of the interaction matrix AijA_{ij} to have positive real part. For our setting, we identify xξ(i)x\equiv\xi^{(i)} and yξ(j)y\equiv\xi^{(j)} for some particular (i,j)(i,j), and z{ξ(k)}k=1,,n\{ξ(i),ξ(j)}z\equiv\{\xi^{(k)}\}_{k=1,...,n}\backslash\{\xi^{(i)},\xi^{(j)}\} as the remaining variables. Here, probability densities are normal distributions, p(yτ|x0,y0)=𝒩yτ(yτ|x0,y0,σyτ|x0,y02)p(y_{\tau}\big{|}x_{0},y_{0})=\mathcal{N}_{y_{\tau}}(\langle y_{\tau}\big{|}x_{0},y_{0}\rangle,\sigma^{2}_{y_{\tau}|x_{0},y_{0}}), with mean yτ|x0,y0\langle y_{\tau}\big{|}x_{0},y_{0}\rangle and variance σyτ|x0,y02yτ2|x0,y0yτ|x0,y02\sigma^{2}_{y_{\tau}|x_{0},y_{0}}\equiv\langle y_{\tau}^{2}\big{|}x_{0},y_{0}\rangle-\langle y_{\tau}\big{|}x_{0},y_{0}\rangle^{2}, and similarly for p(yτ|y0)p(y_{\tau}\big{|}y_{0}) and p(x0|y0)p(x_{0}\big{|}y_{0}). Expectations depend linearly on the conditions, x02yτ|x0,y0=0\partial^{2}_{x_{0}}\langle y_{\tau}\big{|}x_{0},y_{0}\rangle=0, and variances are independent of them, x0σyτ|x0,y02=0\partial_{x_{0}}\sigma^{2}_{y_{\tau}|x_{0},y_{0}}=0. Recall the implicit conditioning on the confounding variables z0z_{0} through y0y_{0}.

Applying these Gaussian properties to Eq. (8), the information response becomes:

Γτxy=(x0yτ|x0,y0)2σx0|y02σyτ|x0,y02,\displaystyle\Gamma^{x\rightarrow y}_{\tau}=\frac{\left(\partial_{x_{0}}\langle y_{\tau}\big{|}x_{0},y_{0}\rangle\right)^{2}\sigma^{2}_{x_{0}|y_{0}}}{\sigma^{2}_{y_{\tau}|x_{0},y_{0}}}, (11)

where x0yτ|x0,y0\partial_{x_{0}}\langle y_{\tau}\big{|}x_{0},y_{0}\rangle can be interpreted as the coefficient of x0x_{0} in the linear regression for yτy_{\tau} based on the predictors (x0,y0)(x_{0},y_{0}), and σyτ|x0,y02\sigma^{2}_{y_{\tau}|x_{0},y_{0}} as its error variance. The variance σx0|y02\sigma^{2}_{x_{0}|y_{0}} quantifies the strength of the natural fluctuations of x0x_{0} (variable to be perturbed) conditional on y0y_{0} (other variables). In fact, the information-theoretic cost of the perturbation, cx(ϵ)=ϵ2σx0|y02+𝒪(ϵ3)c_{x}(\epsilon)=\epsilon^{2}\sigma^{-2}_{x_{0}|y_{0}}+\mathcal{O}(\epsilon^{3}), is higher if x0x_{0} and y0y_{0} are more correlated.

In linear systems, the transfer entropy is equivalent to Granger causality [31]

Tτxy=ln(σyτ|y0σyτ|x0,y0),\displaystyle T^{x\rightarrow y}_{\tau}=\ln\left(\frac{\sigma_{y_{\tau}|y_{0}}}{\sigma_{y_{\tau}|x_{0},y_{0}}}\right), (12)

as can be seen by substituting the Gaussian expressions for p(yτ|x0,y0)p(y_{\tau}\big{|}x_{0},y_{0}) and p(yτ|y0)p(y_{\tau}\big{|}y_{0}) into Eq. (9).

The decrease in uncertainty in adding the predictor x0x_{0} to the linear regression of yτy_{\tau} based on y0y_{0} reads

σyτ|y02σyτ|x0,y02=σx0|y02(x0yτ|x0,y0)2,\displaystyle\sigma^{2}_{y_{\tau}|y_{0}}-\sigma^{2}_{y_{\tau}|x_{0},y_{0}}=\sigma^{2}_{x_{0}|y_{0}}\left(\partial_{x_{0}}\langle y_{\tau}\big{|}x_{0},y_{0}\rangle\right)^{2},~{}~{}~{} (13)

see SI text. Comparing Eq. (11) with Eq. (12) and using Eq. (13), we obtain a non-trivial equivalence between information response and transfer entropy for OU processes,

Γτxy=e2Tτxy1.\displaystyle\Gamma^{x\rightarrow y}_{\tau}=e^{2T^{x\rightarrow y}_{\tau}}-1. (14)

Remarkably, despite the equivalence of the macroscopic quantities Γτxy\Gamma^{x\rightarrow y}_{\tau} and TτxyT^{x\rightarrow y}_{\tau}, the corresponding local quantities are markedly different, see Fig. 2.

Refer to caption
Figure 2: Local information response (Left) and local transfer entropy (Right) are different, although their expectation values agree in linear systems. The model is the OU process of Eq. (15) with parameters tR=10t_{R}=10, q=0.1q=0.1, α=0.5\alpha=0.5, β=0.2\beta=0.2, observed with timescale τ=3\tau=3.

In Fig. 2, we show the local response divergence γτxy(x0,y0)\gamma^{x\rightarrow y}_{\tau}(x_{0},y_{0}) and local transfer entropy tτxy(x0,y0)t^{x\rightarrow y}_{\tau}(x_{0},y_{0}) for the hierarchical OU process of two variables

{dxdt=xtR+ηt,dydt=αxβy,\displaystyle\begin{cases}\frac{dx}{dt}=-\frac{x}{t_{R}}+\eta_{t},\\ \frac{dy}{dt}=\alpha x-\beta y,\end{cases} (15)

with ηtηt=qδ(tt)\left\langle\eta_{t}\eta_{t^{\prime}}\right\rangle=q\delta(t-t^{\prime}), and parameters α\alpha, β>0\beta>0, tR>0t_{R}>0, q>0q>0. This is possibly the simplest model of nonequilibrium stationary interacting dynamics with continuous variables [32]. However, the pattern of Fig. 2 is qualitatively the same for any linear OU process. In fact, the perturbation x0x0+ϵx_{0}\Rightarrow x_{0}+\epsilon shifts the prediction p(yτ|x0,y0)p(y_{\tau}\big{|}x_{0},y_{0}) by the same amount on the yy axis, ϵx0yτ|x0,y0\epsilon\partial_{x_{0}}\langle y_{\tau}\big{|}x_{0},y_{0}\rangle, independently of the condition (x0,y0)(x_{0},y_{0}), without affecting the variance σyτ|x0,y02\sigma^{2}_{y_{\tau}|x_{0},y_{0}}. Hence, dτxy(x0,y0,ϵ)d^{x\rightarrow y}_{\tau}(x_{0},y_{0},\epsilon) is constant in space, and the local contribution only reflects the density p(x0,y0)p(x_{0},y_{0}), here a bivariate Gaussian. On the contrary, the KL divergence corresponding to the change of the prediction p(yτ|y0)p(y_{\tau}\big{|}y_{0}) into p(yτ|x0,y0)p(y_{\tau}\big{|}x_{0},y_{0}) given by the knowledge of x0x_{0}, is strongly dependent on (x0,y0)(x_{0},y_{0}). In fact, the local transfer entropy reads

tτxy(x0,y0)=Tτxy+(x0yτ|x0,y0)22σyτ|y02[(x0x0|y0)2σx0|y02],t^{x\rightarrow y}_{\tau}(x_{0},y_{0})=T^{x\rightarrow y}_{\tau}+\\ \frac{\left(\partial_{x_{0}}\langle y_{\tau}|x_{0},y_{0}\rangle\right)^{2}}{2\sigma^{2}_{y_{\tau}|y_{0}}}\left[\left(x_{0}-\left\langle x_{0}\big{|}y_{0}\right\rangle\right)^{2}-\sigma^{2}_{x_{0}|y_{0}}\right], (16)

see SI text. In particular, for likely values x0x0|y0x_{0}\approx\left\langle x_{0}\big{|}y_{0}\right\rangle, the divergence tτxy(x0,y0)t^{x\rightarrow y}_{\tau}(x_{0},y_{0}) is smaller compared to the unlikely situations x0x0|y0x_{0}\gg\left\langle x_{0}\big{|}y_{0}\right\rangle and x0x0|y0x_{0}\ll\left\langle x_{0}\big{|}y_{0}\right\rangle. Thus, when multiplied by the steady-state density p(x0,y0)p(x_{0},y_{0}), tτxy(x0,y0)t^{x\rightarrow y}_{\tau}(x_{0},y_{0}) attains a bimodal shape.

Nonlinear example.

As a counter-example for the general validity of Eq. (14) for nonlinear systems, consider the following nonlinear Langevin equation for two variables

{dxdt=xtR+ηt,dydt=αx2βy.\displaystyle\begin{cases}\frac{dx}{dt}=-\frac{x}{t_{R}}+\eta_{t},\\ \frac{dy}{dt}=\alpha x^{2}-\beta y.\end{cases} (17)

Numerical simulations (same parameters as for Eq. (15)) show that Eq. (14) is violated, see SI for details. Hence, in general, the transfer entropy is not easily connected to the information response.

Ensemble information response.

Similar to the above, we can define an analogous information response at the ensemble level. From the same perturbation x0x0+ϵx_{0}\Rightarrow x_{0}+\epsilon, we consider the unconditional response divergence

dτxy~(ϵ)D[p(yτ|x0x0+ϵ)||p(yτ)],\displaystyle\widetilde{d^{x\rightarrow y}_{\tau}}(\epsilon)\equiv D\left[p\left(y_{\tau}\big{|}x_{0}\Rightarrow x_{0}+\epsilon\right)\big{|}\big{|}p\left(y_{\tau}\right)\right], (18)

i.e., we evaluate the response at the ensemble level, without knowledge of the measurement (x0,y0)(x_{0},y_{0}),

p(yτ|x0x0+ϵ)=p(yτ|x0,y0;x0x0+ϵ).\displaystyle p\left(y_{\tau}\big{|}x_{0}\Rightarrow x_{0}+\epsilon\right)=\left\langle p\left(y_{\tau}\big{|}x_{0},y_{0};x_{0}\Rightarrow x_{0}+\epsilon\right)\right\rangle.~{}~{}~{}~{}~{} (19)

In general dτxy~(ϵ)dτxy(x0,y0,ϵ)\widetilde{d^{x\rightarrow y}_{\tau}}(\epsilon)\neq\left\langle d^{x\rightarrow y}_{\tau}(x_{0},y_{0},\epsilon)\right\rangle.

We define the ensemble information response as

Γτxy~limϵ0dτxy~(ϵ)cx(ϵ)=x0lnp(yτ|x0,y0)|yτ2x02lnp(x0|y0),\widetilde{\Gamma^{x\rightarrow y}_{\tau}}\equiv\lim\limits_{\epsilon\rightarrow 0}\frac{\widetilde{d^{x\rightarrow y}_{\tau}}(\epsilon)}{c_{x}(\epsilon)}\\ =-\frac{\left\langle\left\langle\partial_{x_{0}}\ln p\left(y_{\tau}\big{|}x_{0},y_{0}\right)\big{|}y_{\tau}\right\rangle^{2}\right\rangle}{\left\langle\partial^{2}_{x_{0}}\ln p\left(x_{0}\big{|}y_{0}\right)\right\rangle}, (20)

where the second line, valid only for τ>0\tau>0, is the corresponding fluctuation-response theorem. A straightforward generalization to arbitrary perturbation profiles ϵ(x0,y0)\epsilon(x_{0},y_{0}) is discussed in SI text. Note that we could write dτxy~(ϵ)\widetilde{d^{x\rightarrow y}_{\tau}}(\epsilon) through the Fisher information ϵ2lnp(yτ|x0+ϵ,y0)|ϵ=0\left\langle\partial^{2}_{\epsilon}\ln\left\langle p(y_{\tau}\big{|}x_{0}+\epsilon,y_{0})\right\rangle\right\rangle\big{|}_{\epsilon=0}, but the partial derivative would be over the perturbation parameter ϵ\epsilon, and we found it more natural to consider the self-prediction quantity x0lnp(yτ|x0,y0)|yτ2\left\langle\left\langle\partial_{x_{0}}\ln p\left(y_{\tau}\big{|}x_{0},y_{0}\right)\big{|}y_{\tau}\right\rangle^{2}\right\rangle. See SI text for technical details on expectation brakets.

In linear systems, the ensemble information response takes the form

Γτxy~=Γτxye2Iτxy,y=e2Iτy,y(1e2Tτxy),\widetilde{\Gamma^{x\rightarrow y}_{\tau}}~{}=\Gamma^{x\rightarrow y}_{\tau}e^{-2I_{\tau}^{xy,y}}~{}=e^{-2I_{\tau}^{y,y}}\left(1-e^{-2T^{x\rightarrow y}_{\tau}}\right), (21)

where Iτy,yD[p(y0,yτ)||p(y0)p(yτ)]I_{\tau}^{y,y}\equiv D\left[p(y_{0},y_{\tau})\big{|}\big{|}p(y_{0})p(y_{\tau})\right] is the mutual information between y0y_{0} and yτy_{\tau}, and Iτxy,y=Iτy,y+TτxyI_{\tau}^{xy,y}=I_{\tau}^{y,y}+T^{x\rightarrow y}_{\tau} is the mutual information that the two predictors (x0,y0)(x_{0},y_{0}) together have on the output yτy_{\tau}, see SI text.

From the nonnegativity of informations, we obtain the bound 0Γτxy~10\leq\widetilde{\Gamma^{x\rightarrow y}_{\tau}}\leq 1. We see that Γτxy~\widetilde{\Gamma^{x\rightarrow y}_{\tau}} increases with the transfer entropy TτxyT^{x\rightarrow y}_{\tau}, and decreases with the autocorrelation Iτy,yI_{\tau}^{y,y}. Since Iτy,yI_{\tau}^{y,y} diverges for τ0\tau\rightarrow 0 in continuous processes, the perturbation on the xx ensemble takes a finite time to fully propagate its effect to the yy ensemble. Since time-lagged informations vanish for τ\tau\rightarrow\infty in ergodic processes, ensembles relax asymptotically towards steady-state after a perturbation, and correspondingly the ensemble information response vanishes. This provides a trade-off shape for Γτxy~\widetilde{\Gamma^{x\rightarrow y}_{\tau}} as a function of the timescale τ\tau. Note the asymptotics Γτxy~/Γτxy1\widetilde{\Gamma^{x\rightarrow y}_{\tau}}/\Gamma^{x\rightarrow y}_{\tau}\rightarrow 1 for τ\tau\rightarrow\infty, also resulting from ergodicity.

Discussion.

In this Letter, we introduced a new measure of causation that has both the invariance properties required for an information-theoretic measure and the physical interpretation of a propagation of perturbations. It has the form of a linear response coefficient between Kullback-Leibler divergences, and it is based on the information-theoretic cost of perturbations. We would like to call it information response.

We study the behavior of the information response analytically in linear stochastic systems, and show that it reduces to the known transfer entropy in this case. This establishes a first connection between fluctuation-response theory and information flow, i.e., the two main perspectives to the problem of causation at present. Additionally, it provides a new relation between Fisher and mutual information.

We suggest our information response for the design of new quantitative causal inference methods [24]. Its practical estimation on time series, as it is normally the case for information-theoretic measures, depends on the learnability of probability distributions from a finite amount of data [33, 34].

Acknowledgments

Acknowledgements.
We thank M Scazzocchio for helpful discussions. AA is supported by the DFG through FR3429/3 to BMF; AA, and BMF are supported through the Excellence Initiative by the German Federal and State Governments (Cluster of Excellence PoL EXC-2068).

References

  • Seth et al. [2015] A. K. Seth, A. B. Barrett, and L. Barnett, Granger causality analysis in neuroscience and neuroimaging, Journal of Neuroscience 35, 3293 (2015).
  • Runge et al. [2019] J. Runge, S. Bathiany, E. Bollt, G. Camps-Valls, D. Coumou, E. Deyle, C. Glymour, M. Kretschmer, M. D. Mahecha, J. Muñoz-Marí, et al., Inferring causation from time series in earth system sciences, Nature communications 10, 1 (2019).
  • Luzzatto and Pandolfi [2015] L. Luzzatto and P. P. Pandolfi, Causality and chance in the development of cancer, N Engl J Med 373, 84 (2015).
  • Kwon and Yang [2008] O. Kwon and J.-S. Yang, Information flow between stock indices, EPL (Europhysics Letters) 82, 68003 (2008).
  • Ito and Sagawa [2013] S. Ito and T. Sagawa, Information thermodynamics on causal networks, Physical Review Letters 111, 180603 (2013).
  • Horowitz and Esposito [2014] J. M. Horowitz and M. Esposito, Thermodynamics with continuous information flow, Physical Review X 4, 031015 (2014).
  • James et al. [2016] R. G. James, N. Barnett, and J. P. Crutchfield, Information flows? a critique of transfer entropies, Physical Review Letters 116, 238701 (2016).
  • Auconi et al. [2017] A. Auconi, A. Giansanti, and E. Klipp, Causal influence in linear langevin networks without feedback, Physical Review E 95, 042315 (2017).
  • Pearl [2009] J. Pearl, Causality (Cambridge university press, 2009).
  • Janzing et al. [2013] D. Janzing, D. Balduzzi, M. Grosse-Wentrup, B. Schölkopf, et al., Quantifying causal influences, The Annals of Statistics 41, 2324 (2013).
  • Aurell and Del Ferraro [2016] E. Aurell and G. Del Ferraro, Causal analysis, correlation-response, and dynamic cavity, in Journal of Physics: Conference Series, Vol. 699 (2016) p. 012002.
  • Baldovin et al. [2020] M. Baldovin, F. Cecconi, and A. Vulpiani, Understanding causation via correlations and linear response theory, Physical Review Research 2, 043436 (2020).
  • Massey [1990] J. Massey, Causality, feedback and directed information, in Proc. Int. Symp. Inf. Theory Applic.(ISITA-90) (Citeseer, 1990) pp. 303–305.
  • Schreiber [2000] T. Schreiber, Measuring information transfer, Physical Review Letters 85, 461 (2000).
  • Ay and Polani [2008] N. Ay and D. Polani, Information flows in causal networks, Advances in complex systems 11, 17 (2008).
  • Parrondo et al. [2015] J. M. Parrondo, J. M. Horowitz, and T. Sagawa, Thermodynamics of information, Nature physics 11, 131 (2015).
  • Cover [1999] T. M. Cover, Elements of information theory (John Wiley & Sons, 1999).
  • Amari [2016] S. I. Amari, Information geometry and its applications, Vol. 194 (Springer, 2016).
  • Kubo [1966] R. Kubo, The fluctuation-dissipation theorem, Reports on progress in physics 29, 255 (1966).
  • Kubo [1986] R. Kubo, Brownian motion and nonequilibrium statistical mechanics, Science 233, 330 (1986).
  • Marconi et al. [2008] U. M. B. Marconi, A. Puglisi, L. Rondoni, and A. Vulpiani, Fluctuation–dissipation: response theory in statistical physics, Physics reports 461, 111 (2008).
  • Maes [2020] C. Maes, Response theory: A trajectory-based approach, Frontiers in Physics 8, 229 (2020).
  • Dechant and Sasa [2020] A. Dechant and S.-i. Sasa, Fluctuation–response inequality out of equilibrium, Proceedings of the National Academy of Sciences 117, 6430 (2020).
  • Runge [2018] J. Runge, Causal network reconstruction from time series: From theoretical assumptions to practical estimation, Chaos: An Interdisciplinary Journal of Nonlinear Science 28, 075310 (2018).
  • Ito and Dechant [2020] S. Ito and A. Dechant, Stochastic time evolution, information geometry, and the cramér-rao bound, Physical Review X 10, 021056 (2020).
  • [26] Eq. (8) holds for a larger class of divergences beyond the KL divergence, because the Fisher information is the unique invariant metric [18].
  • Wei and Stocker [2016] X.-X. Wei and A. A. Stocker, Mutual information, fisher information, and efficient coding, Neural computation 28, 305 (2016).
  • Sagawa and Ueda [2012] T. Sagawa and M. Ueda, Nonequilibrium thermodynamics of feedback control, Physical Review E 85, 021104 (2012).
  • Rosinberg and Horowitz [2016] M. L. Rosinberg and J. M. Horowitz, Continuous information flow fluctuations, EPL (Europhysics Letters) 116, 10007 (2016).
  • Risken [1996] H. Risken, Fokker-planck equation, in The Fokker-Planck Equation (Springer, 1996) pp. 63–95.
  • Barnett et al. [2009] L. Barnett, A. B. Barrett, and A. K. Seth, Granger causality and transfer entropy are equivalent for gaussian variables, Physical Review Letters 103, 238701 (2009).
  • Auconi et al. [2019] A. Auconi, A. Giansanti, and E. Klipp, Information thermodynamics for time series of signal-response models, Entropy 21, 177 (2019).
  • Bialek et al. [1996] W. Bialek, C. G. Callan, and S. P. Strong, Field theories for learning probability distributions, Physical Review Letters 77, 4693 (1996).
  • Bialek et al. [2020] W. Bialek, S. E. Palmer, and D. J. Schwab, What makes it possible to learn probability distributions in the natural world?, arXiv preprint arXiv:2008.12279  (2020).