This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Finding emergence in data by maximizing effective information

Mingzhe Yang School of Systems Science, Beijing Normal University, 100875, Beijing, China Zhipeng Wang School of Systems Science, Beijing Normal University, 100875, Beijing, China Kaiwei Liu School of Systems Science, Beijing Normal University, 100875, Beijing, China Yingqi Rong Johns Hopkins University, 21218, Baltimore, MD, USA Bing Yuan Swarma Research, 102300, Beijing, China Jiang Zhang School of Systems Science, Beijing Normal University, 100875, Beijing, China
Abstract

Quantifying emergence and modeling emergent dynamics in a data-driven manner for complex dynamical systems is challenging due to the lack of direct observations at the micro-level. Thus, it’s crucial to develop a framework to identify emergent phenomena and capture emergent dynamics at the macro-level using available data. Inspired by the theory of causal emergence (CE), this paper introduces a machine learning framework to learn macro-dynamics in an emergent latent space and quantify the degree of CE. The framework maximizes effective information, resulting in a macro-dynamics model with enhanced causal effects. Experimental results on simulated and real data demonstrate the effectiveness of the proposed framework. It quantifies degrees of CE effectively under various conditions and reveals distinct influences of different noise types. It can learn a one-dimensional coarse-grained macro-state from fMRI data, to represent complex neural activities during movie clip viewing. Furthermore, improved generalization to different test environments is observed across all simulation data.

Keywords:causal emergence; dynamics learning; effective information; coarse-graining; invertible neural network

1.   Introduction

The climate system, ecosystems, bird flocks, ant colonies, cells, brains, and many other complex systems are composed of numerous interacting elements and exhibit a wide range of nonlinear dynamical behaviors [1, 2]. In the past few decades, the research topic of data-driven modeling complex systems has gained significant attention, driven by the increasing availability and accumulation of data from real dynamical systems [3, 4, 5, 6]. However, complex systems always exhibit emergent behaviors [7, 1]. That means some interesting emergent patterns or dynamical behaviors such as waves [8], periodic oscillations [9] and solitons [10] can hardly be directly observed and identified from the micro-level behavioral data. Therefore, the identification and measure of emergence and the capture of emergent dynamical patterns solely from observational raw data have become crucial challenges in complex systems research [11, 12, 13, 14]. But in order to address these problems, it is necessary to first develop a quantitative understanding of emergence.

Emergence, as a distinctive feature of complex systems [7], has historically been challenging to quantify and describe in quantitative terms [15, 16, 17, 18]. Most conventional measures or methods either rely on predefined macro-variables (e.g., [19, 20, 21]) or are tailored to specific scenarios in engineered systems (e.g. [17, 22, 23, 24]). However, there is a need for a unified method to quantify emergence across different contexts. The theory of causal emergence (CE), introduced by Erik Hoel in 2013, offers a framework to tackle this challenge [25, 26]. Hoel’s theory aims to understand emergence through the lens of causality. The connection between emergence and causality is implied in the descriptive definition of emergence, as stated in the work by J. Fromm[27]. According to this definition, a macro-level property, such as patterns or dynamical behaviors, is considered emergent if it cannot be explained or directly attributed to the individuals in the system. The theory of causal emergence formalizes this concept within the framework of discrete Markov dynamical systems. As shown in Figure 1(a), Hoel et al’s theory states that if a system exhibits stronger causal effects after a specific coarse-graining transformation compared to the original system, then CE has occurred. In Hoel’s framework, the degree of causal effect in a dynamical system is quantified using effective information (EI) [28]. EI can be understood as an intervention-based version of mutual information between two successive states in a dynamical system over time. It is a measure that solely depends on the system’s dynamics. If a dynamical system is more deterministic and non-degenerate, meaning that the temporal adjacent states can be inferred from each other in both directions of the time arrow, then it will have a larger EI [25]. This measure has been shown to be compatible with other well-known measures of causal effect [29]. Figure 1(b) gives an example of CE for simple Markov chain.

Refer to caption
Figure 1: (a) An illustration of the fundamental concept of Erik Hoel’s theory of causal emergence(CE). The effective information (EI) is denoted as 𝒥\mathcal{J} in this paper. (b) A case demonstrating CE in a discrete Markov chain. The micro-dynamics consist of eight micro-states. During the coarse-graining process, the first seven states are grouped together as one macro-state, while the eighth micro-state corresponds to the second macro-state. As a result, a transition probability matrix is formed at the macro-scale, where the effective information 𝒥(fM)=1\mathcal{J}(f_{M})=1 (calculated using Equation 1), which is greater than 𝒥(fm)=0.55\mathcal{J}(f_{m})=0.55. This difference, Δ𝒥=0.45\Delta\mathcal{J}=0.45, indicates the occurrence of CE, as Δ𝒥>0\Delta\mathcal{J}>0.

While the causal emergence theory has successfully quantified emergence using EI and has found applications in various fields [30, 31, 32], there are some drawbacks. Firstly, the markov transition matrix of the micro-dynamics should be given rather than constructed from data. Secondly, a predefined coarse-graining strategy must be provided or optimized through maximizing EI, but this optimization process is computationally complex[33, 30]. Although Rosas et al.[11] proposed a new framework for CE based on partial information decomposition theory [34], which does not require a predefined coarse-graining strategy, it still involves iterating through all variable combinations on the information lattice to compare synergistic information, resulting in significant computational complexity. Rosas et al. also proposed an approximate method to mitigate the complexity [11], but it requires a pre-defined macro-variable.

Therefore, the challenge of finding emergence in data, which involves determining whether CE has occurred within a system and to what extent based solely on observational data of its behavior, remains unresolved. The most daunting task is that all the elements, including the markov dynamics at both the micro- and macro-levels, as well as the coarse-graining strategy to obtain macro-variables, need to be learned from raw data and cannot be predefined in advance [12]. Once the learning process is completed, we can compare the strength of causal effects (measured by EI) in dynamics at different scales to identify CE from the data. Therefore, the problem of finding CE within Hoel’s theoretical framework is essentially equivalent to the challenge of data-driven modeling within a coarse-grained space for complex systems [12]. Building models for complex systems at multiple coarse-grained levels within learned emergent spaces is of utmost importance for both identifying CE and conducting data-driven modeling in complex systems.

Recently, several machine learning frameworks have emerged for learning and simulating the dynamics of complex systems within coarse-grained latent or hidden spaces [35, 14, 36, 37, 38]. By learning the dynamics in a hidden space, these frameworks enable the removal of noise and irrelevant factors from raw data, enhancing prediction capabilities and adaptability to diverse environments. While these learning systems can capture emergent dynamics, they may not directly address the fundamental nature of CE, which entails stronger causality. According to Judea Pearl’s hierarchy of causality, prediction-based learning is situated at the level of association and cannot address the challenges related to intervention and counterfactuals [39]. Empirically, dynamics learned solely based on predictions may be influenced by the distributions of the input data, which can be limited by data diversity and the problem of over fitting models [40]. However, what we truly desire is an invariant causal mechanism or dynamics that are independent of the input data. This allows the learned mechanism or dynamics to be adaptable to broader domains, generalizable to areas beyond the distribution of training data, and capable of accommodating diverse interventions [41, 42, 43]. Unfortunately, few studies have explored the integration of causality and latent space dynamics to address the challenges of data-driven modeling in complex systems [44].

Inspired by the theory of causal emergence, this paper aims to address the challenge of learning causal mechanisms within a learned coarse-grained macro-level(latent) space. The approach involves maximizing the EI of the emergent macro-level dynamics, which is equivalent to maximizing the degree of causal effect in the learned coarse-grained dynamics [29]. By artificial intervening the distributions of input data as uniform distribution, this maximization process helps separating the invariant causal mechanisms from the variant data distribution as much as possible. To achieve this, a novel machine learning framework called Neural Information Squeezer Plus (NIS+) is proposed. NIS+ extends the previous framework (NIS) to solve the problem of maximizing EI under coarse-grained representations. As shown in Figure 2, NIS+ not only learns emergent macro-dynamics and coarse-grained strategies but also quantifies the degree of CE from time series data and captures emergent patterns. Mathematical theorems ensure the flexibility of our framework in different application scenarios, such as cellular automata and multi-agent systems. Numerical experiments demonstrate the effectiveness of NIS+ in capturing emergent behaviors and finding CE in various simulations, including SIR dynamics[45], bird flocking (Boids model)[46], and the Game of Life[47]. Additionally, NIS+ is applied to find emergent features in real brain data from 830 subjects while watching the same movie. Experiments also demonstrate that the learned dynamical model exhibits improved performance in generalization compared to other models.

Refer to caption
Figure 2: The workflow and architecture of our proposed framework, Neural Information Squeezer Plus (NIS+), for discovering causal emergence within data. (a) Various forms of input data from our studied simulation systems such as the Boid flocking model (multi-agent system), Conway’s Game of Life (two-dimensional cellular automata), and real brain fMRI time series data. (b) The framework of our proposed model, NIS+, which incorporates our previous model, NIS. The boxes represent functions or neural networks, and the arrow pointing to a cross represents the operation of information discarding. 𝒙t\boldsymbol{x}_{t} and 𝒙t+1\boldsymbol{x}_{t+1} represent the observational data of micro-states, while 𝒙^t+1\hat{\boldsymbol{x}}_{t+1} represents the predicted micro-state. 𝒚t=ϕ(𝒙t)\boldsymbol{y}_{t}=\phi(\boldsymbol{x}_{t}) and 𝒚t+1=ϕ(𝒙t+1)\boldsymbol{y}_{t+1}=\phi(\boldsymbol{x}_{t+1}) represent the macro-states obtained by encoding the micro-states using the encoder. 𝒚^t=ϕ(𝒙^t)\hat{\boldsymbol{y}}_{t}=\phi(\hat{\boldsymbol{x}}_{t}) and 𝒚^t+1=ϕ(𝒙^t+1)\hat{\boldsymbol{y}}_{t+1}=\phi(\hat{\boldsymbol{x}}_{t+1}) represent the predicted macro-states obtained by encoding the predictions of micro-states. The mathematical problems that each framework aims to solve are also illustrated in the figure. (c) The various output forms of NIS+, which include the degree of CE, the learned macro-dynamics, captured emergent patterns, and the strategy of coarse-graining.

2.   Finding causal emergence in data

Finding CE in time series data involves two sub-problems: emergent dynamics learning and causal emergence quantification. In the first problem, both the dynamics in micro- and macro-level and the coarse-graining strategy should be learned from data. After that, we can measure the degree of CE and judge if emergence occurs in data according to the learned micro- and macro-dynamics.

2.1.   Problem definition

Suppose the behavioral data of a complex dynamical system is a time series {𝒙t}\{\boldsymbol{x}_{t}\} with time steps t=1,2,,Tt=1,2,...,T and dimension pp, they form observable micro-states. And we assume that there are no unobserved variables. The problem of emergent dynamics learning is to find three functions according to the data: a coarse-graining strategy ϕ:pq\phi:\mathcal{R}^{p}\rightarrow\mathcal{R}^{q}, where qpq\leq p is the dimension of macro-states which is given as a hyper-parameter, a corresponding anti-coarsening strategy ϕ:pq\phi^{{\dagger}}:\mathcal{R}^{p}\rightarrow\mathcal{R}^{q}, and a macro-level markov dynamics fqf_{q}, such that the EI of macro-dynamics fqf_{q} is maximized under the constraint that the predicted 𝒙^t+1\hat{\boldsymbol{x}}_{t+1} by ϕ\phi, fqf_{q}, and ϕ\phi^{{\dagger}} is closed to the real data of 𝒙t+1\boldsymbol{x}_{t+1}:

maxϕ,fq,ϕ+𝒥(fq),\displaystyle\max_{\phi,f_{q},\phi^{+}}\mathcal{J}(f_{q}), (1)
s.t.{𝒙^t+1𝒙t+1<ϵ,𝒙^t+1=ϕ(fq(ϕ(𝒙t))).\displaystyle s.t.\begin{cases}||\hat{\boldsymbol{x}}_{t+1}-\boldsymbol{x}_{t+1}||<\epsilon,\\ \hat{\boldsymbol{x}}_{t+1}=\phi^{{\dagger}}(f_{q}(\phi(\boldsymbol{x}_{t}))).\end{cases}

Where, 𝒥\mathcal{J} is the appropriate version of EI to be maximized. For continuous dynamical systems that mainly considered in this paper, 𝒥\mathcal{J} is the dimension averaged effective information (dEI)[12](see method section 2) , ϵ\epsilon is a given small constant.

The objective of maximizing EI in Equation LABEL:old_optimization is to enhance the causal effect of the macro-dynamics fqf_{q}. However, the direct maximization of EI, as depicted in Figure 1, can result in trivial solutions, as highlighted by Zhang et al. [12]. Additionally, it can lead to issues of ambiguity and violations of commutativity, as pointed out by Eberhardt, F and Lee, L.L. [48]. To address these issues, constraints are introduced to prevent such problems. The constraints in Equation LABEL:old_optimization ensure that the macro-dynamics fqf_{q} can simulate the micro-dynamics in the data as accurately as possible, with the prediction error being less than a given threshold ϵ\epsilon.

This is achieved by mapping the micro-state xtx_{t} to the macro-state 𝒚t=ϕ(𝒙𝒕)\boldsymbol{y}_{t}=\phi(\boldsymbol{x_{t}}) at each time step. The macro-dynamics fqf_{q} then makes a one-step prediction to obtain 𝒚^t+1=fq(ϕ(𝒙t))\hat{\boldsymbol{y}}_{t+1}=f_{q}(\phi(\boldsymbol{x}_{t})). Subsequently, an anti-coarsening strategy ϕ\phi^{{\dagger}} is employed to decode 𝒚^t+1\hat{\boldsymbol{y}}_{t+1} back into the micro-state, resulting in the reconstructed micro-state 𝒙^t+1=ϕ(𝒚^t+1)\hat{\boldsymbol{x}}_{t+1}=\phi^{{\dagger}}(\hat{\boldsymbol{y}}_{t+1}). The entire computational process can be visualized by referring to the upper part(NIS) in Figure 2b.

In practice, we obtain the value of ϵ\epsilon by setting the normalized MAE (mean absolute error divided by the standard deviation of 𝒙\boldsymbol{x}) on the training data. The choice of normalized MAE ensures a consistent standard across different experiments, accounting for the varying numerical ranges.

By changing qq we can obtain macro-dynamics in various dimensions. If q=pq=p, then fpf_{p} becomes the learnt micro-dynamics. Then we can compare 𝒥q\mathcal{J}_{q} and 𝒥p\mathcal{J}_{p} for any qq. The problem of causal emergence quantification can be defined as the calculation of the following difference,

Δ𝒥𝒥(fq)𝒥(fp),\Delta\mathcal{J}\equiv\mathcal{J}(f_{q})-\mathcal{J}(f_{p}), (2)

where Δ𝒥\Delta\mathcal{J} is defined as the degree of causal emergence. The micro-dynamics fpf_{p} is also obtained by solving Equation LABEL:old_optimization with dimension q=pq=p. If Δ𝒥>0\Delta\mathcal{J}>0, then we say CE occurs within the data.

2.2.   Solution

While, solving the optimization problem defined in Equation LABEL:old_optimization directly is difficult because the objective function 𝒥\mathcal{J} is the mutual information after intervention which deserved special process.

To tackle this problem, we convert the problem defined in Equation LABEL:old_optimization into a new optimization problem without constraints, that is:

minf,g,ϕ,ϕt=1T1w(𝒙t)𝒚tg(𝒚t+1)+λ𝒙^t+1𝒙t+1,\displaystyle\min_{f,g,\phi,\phi^{{\dagger}}}\sum_{t=1}^{T-1}w(\boldsymbol{x}_{t})||\boldsymbol{y}_{t}-g(\boldsymbol{y}_{t+1})||+\lambda||\hat{\boldsymbol{x}}_{t+1}-\boldsymbol{x}_{t+1}||, (3)

where 𝒙^t+1=ϕ(f(ϕ(𝒙t)))\hat{\boldsymbol{x}}_{t+1}=\phi^{{\dagger}}(f(\phi(\boldsymbol{x}_{t}))). 𝒚t=ϕ(𝒙t)\boldsymbol{y}_{t}=\phi(\boldsymbol{x}_{t}) and 𝒚t+1=ϕ(𝒙t+1)\boldsymbol{y}_{t+1}=\phi(\boldsymbol{x}_{t+1}) are the macro-states. g:qqg:\mathcal{R}^{q}\rightarrow\mathcal{R}^{q} is a new function that we introduce to simulate the inverse macro-dynamics on the macro-state space, that is to map each macro-state at t+1t+1 time step back to the macro-state at tt time step. λ\lambda is a Lagrangian multiplier which will be taken as a hyper-parameter in experiments. w(𝒙t)w(\boldsymbol{x}_{t}) is the inverse probability weights which is defined as:

w(𝒙t)=p~(𝒚t)p(𝒚t)=p~(ϕ(𝒙t))p(ϕ(𝒙t)),w(\boldsymbol{x}_{t})=\frac{\tilde{p}(\boldsymbol{y}_{t})}{p(\boldsymbol{y}_{t})}=\frac{\tilde{p}(\phi(\boldsymbol{x}_{t}))}{p(\phi(\boldsymbol{x}_{t}))}, (4)

where p~\tilde{p} is the new distribution of macro-states 𝒚t\boldsymbol{y}_{t} after intervention for do(𝒚tUq)do(\boldsymbol{y}_{t}\sim U_{q}), and pp is the natural distribution of the data. In practice, p(𝒚t)p(\boldsymbol{y}_{t}) is estimated through kernel density estimation (KDE)[49](see method section 4). The approximated distribution, p~(𝒚t)\tilde{p}(\boldsymbol{y}_{t}), is assumed to be a uniform distribution, which is characterized by a constant value. Consequently, the weight ww is computed as the ratio of these two distributions. Mathematical theorems mentioned in method section 5.1.2 and proved in support information section 10.1 guarantee this new optimization problem (Equation 9) is equivalent to the original one (Equation LABEL:old_optimization).

3.   Results

We will validate the effectiveness of the NIS+ framework through numerical experiments where data is generated by different artificial models (dynamical systems, multi-agent systems, and cellular automata). Additionally, we will apply NIS+ to real fMRI data from human subjects to uncover interesting macro-level variables and dynamics. In these experiments, we will evaluate the models’ prediction and generalization abilities. We will also assess their capability to identify CE and compare it with Rosas’ Ψ\Psi indicator proposed in [11], an alternative measure for quantifying CE approximately. For the experiments, a threshold of normalized MAE of 0.3 was selected based on the observation that exceeding this value leads to a significant change in the relationship between normalized MAE and noise (see below).

3.1.   SIR

The first experiment revolves around the SIR (Susceptible, Infected, and Recovered or Died) model, a simple dynamical system. In this experiment, the SIR dynamics serve as the ground truth for the macro-level dynamics, while the micro-level variables are generated by introducing noise to the macro-variables. The primary objective is to evaluate our model’s ability to effectively remove noise, uncover meaningful macroscopic dynamics, identify CE, and demonstrate generalization beyond the distribution of the training dataset.

Formally, the macro-dynamics can be described as:

{dSdt=βSI,dIdt=βSIγI,dRdt=γI,\displaystyle\begin{aligned} \begin{cases}\frac{\mathrm{d}S}{\mathrm{d}t}=-\beta SI,\\ \frac{\mathrm{d}I}{\mathrm{d}t}=\beta SI-\gamma I,\\ \frac{\mathrm{d}R}{\mathrm{d}t}=\gamma I,\end{cases}\end{aligned} (5)

where S,I,R[0,1]S,I,R\in[0,1] represents the proportions of healthy, infected and recovered or died individuals in a population, β=1\beta=1 and γ=0.5\gamma=0.5 are parameters for infection and recovery rates, respectively. Figure 3(a) shows the phase space (S,I,R)(S,I,R) of the SIR dynamics. Because the model has only two degrees of freedom, as SS, II, and RR satisfy S+I+R=1S+I+R=1, all macro-states are distributed on a triangular plane in three dimensions, and only SS and II are used to form the macro-state variable 𝒚=(S,I)\boldsymbol{y}=(S,I), and RR is removed.

We then expand 𝒚\boldsymbol{y} into a four-dimensional vector and introduce Gaussian noises to form a microscopic state:

{𝑺=(S,S)+𝝃1,𝑰=(I,I)+𝝃2.\displaystyle\begin{aligned} \begin{cases}\boldsymbol{S}^{\prime}=(S,S)+\boldsymbol{\xi}_{1},\\ \boldsymbol{I}^{\prime}=(I,I)+\boldsymbol{\xi}_{2}.\end{cases}\end{aligned} (6)

Among which, 𝝃1,𝝃2N(0,Σ)\boldsymbol{\xi}_{1},\boldsymbol{\xi}_{2}\sim\scriptsize{N}(0,\Sigma) are two-dimensional Gaussian noises and independent each other, and Σ\Sigma is the correlation matrix. In this way, we obtain a micro-states sequence 𝒙t=(𝑺t,𝑰t)\boldsymbol{x}_{t}=(\boldsymbol{S}^{\prime}_{t},\boldsymbol{I}^{\prime}_{t}) as the training samples in the experiment. We randomly select initial conditions by sampling points within the triangular region depicted in Figure 3(a) and generate time series data using the aforementioned process. These generated data are then utilized to train the models.

Refer to caption
Figure 3: The experimental results of NIS+ and compared models on the SIR model with observational noise. (a) The phase space of the SIR model, along with four example trajectories with the same infection and recovery or death rates. The full dataset (blue area) and the partial dataset (dotted area) used for training are also displayed, consisting of 63,000 and 42,000 uniformly distributed data points, respectively. (b) The curves depict the change in dimension-averaged effective information (𝒥\mathcal{J}) with training epochs for different models. The lines represent the means, while the band widths represent the standard deviations of five repeated experiments. (c) A comparison is made among the vector fields of the SIR dynamics, the learned macro-dynamics of NIS+, and the macro-dynamics transformed by the Jacobian of the learned encoder. Each arrow represents a direction, and the magnitude of the derivative of the dynamics at that coordinate point. For detailed procedures, please refer to the support information section 11.3. (d) A comparison is conducted to evaluate the errors in multi-step predictions for different models trained on either partial datasets (with 42,000 missing data points) or complete datasets(see details in the support information section 11.1). These models include NIS+, NIS, a feed-forward neural network (NN), a feed-forward neural network with inverse probability weighting and inverse dynamics learning techniques (NN+), a Variational Autoencoder(VAE), and its reweighted and inverse dynamics version (VAE+). Please refer to the details of the parameters in method section 6. (e). The variations in the measure of CE (Δ𝒥\Delta\mathcal{J}) and EIs for micro-dynamics (𝒥(fm)\mathcal{J}(f_{m})) and macro-dynamics (𝒥(fM)\mathcal{J}(f_{M})) are plotted as the standard deviation σ\sigma of observation noise changes. All these indicators are averaged across dimensions. Following Rosas’ definition and calculation method for CE(see method section 5), the yellow line demonstrates the changes in Rosas’ Ψ\Psi [11]. The vertical line represents the threshold for the normalized MAE equaling 0.3. When σ\sigma is larger than the threshold, the constraint of error in Equation LABEL:old_optimization is violated, and the results are not reliable. (f) A comparison is made among the vector fields of the SIR dynamics, the learned macro-dynamics of NIS, and the macro-dynamics transformed by the encoder Jacobian matrix of NIS, in comparison with (c).

We conduct comparative analysis between NIS+ and other alternative models, including NIS (without EI maximization compared to NIS+), feed-forward neural network (NN), and variational autoencoder (VAE). To make a fair comparison, we ensure that all benchmark models have a roughly equal number of parameters. Moreover, we employ the same techniques of probability reweighting and inverse dynamics learning on the feed-forward neural network (NN+) and variational autoencoder (VAE+) as utilized in NIS+. We evaluate the performances of all candidate models by requiring them to predict future states for multiple time steps (10 steps) on a separate test dataset. The results show that NIS+(green bars) outperforms other competitors on multi-step prediction as shown in Figure 3(d) no matter if they use the techniques like probability reweighting to maximize EI, this manifests that an invertible neural network as an encoder and decoder is necessary (the details can be referred to method section 6).

Further, to assess the model’s generalization ability beyond the region of the training dataset, in addition to the regular training and testing, we also conduct experiments where the model was trained on a subset of the data and tested on the complete dataset. The training samples in this experiment are shown in the dotted area in Figure 3(a)(the area with S13S\leq\frac{1}{3} is missing), and the test samples are also in the blue triangle. As shown by the red bars in Figure 3(d), the performances of out-of-distribution generalization of NIS+ are better than other benchmarks although the test region is beyond the trained region. And the differences among different models are larger on partial dataset.

To further test whether the models successfully learn the ground-truth macro-dynamics, we conduct a comparison between the vector fields of the real SIR dynamics, represented by d𝒚/dtd\boldsymbol{y}/dt, and the learned emergent dynamics d(h1,h2)/dtd(h_{1},h_{2})/dt. This comparison is illustrated in Figure 3(c) for NIS+ and (f) for NIS. In both sub-figures, the learned vectors align with the ground-truth (real) dynamics and match the theoretical predictions based on the Jacobian of the encoder (for more details, please refer to support information section 11.3). However, it is evident that NIS+ outperforms NIS in accurately capturing the underlying dynamics, especially in peripheral areas with limited training samples.

Next, we test NIS+ and other comparison models on EI maximization and CE quantification, the results are shown in Figure 3(b) and (e). First, to ensure that EI is maximized by NIS+, Figure 3(b) illustrates the evolution of EI (dimension averaged) 𝒥\mathcal{J} over training epochs. It is evident that the curves of NIS+ (red solid), NIS (black dashed), and VAE+(green solid) exhibit upward trends, but NIS+ demonstrates a faster rate of increase. This indicates that NIS+ can efficiently maximize 𝒥\mathcal{J} to a greater extent than other models. Notably, NIS also exhibits a natural increase in EI as it strives to minimize prediction errors.

Second, to examine NIS+’s ability to detect and quantify CE, we compute Δ𝒥\Delta\mathcal{J}s and compare them with Rosas’ Ψ\Psi indicators as the noise level σ\sigma in micro-states increased (see support information section 11.2 for details). We utilize the learned macro-states from NIS+ as the prerequisite variable VV to implement Rosas’ method [11]. The results are depicted by the black and yellow solid lines in Figure 3(e).

Both indicators exhibit a slight increase with σ\sigma and Δ𝒥>0\Delta\mathcal{J}>0 always holds when it is less than 0.01, but Rosas’ Ψ>0\Psi>0 after σ=103\sigma=10^{-3}. Therefore, NIS+ indicates that CE consistently occurs at low noise levels, whereas Rosas’ method does not. NIS+’s result is more reasonable since it can extract macro-dynamics similar to the ground-truth from noisy data, and this deterministic dynamics should have a larger EI than the noisy micro-dynamics. We also plot the curves J(fM)J(f_{M}) (red dashed line) and J(fm)J(f_{m}) (green dashed line) for macro- and micro-dynamics, respectively. These curves decrease as σ\sigma increase, but J(fm)J(f_{m}) decreases at a faster rate, leading to the observed occurrence of CE. However, when Rosas’ Ψ<0\Psi<0, we cannot make a definitive judgment as Ψ\Psi can only provide a sufficient condition for CE. Both indicators reach their peak at σ=102\sigma=10^{-2}, which corresponds to the magnitude of the time step (dt=0.01dt=0.01) used in our simulations and reflects the level of change in micro-states.

On the other hand, if the noise becomes too large, the limited observational data makes it challenging for NIS+ to accurately identify the correct macro-dynamics from the data. Consequently, the degree of CE Δ𝒥\Delta\mathcal{J} decreases to zero. Although NIS+ determines that there is no CE when σ>10\sigma>10, this result is not reliable since the normalized prediction errors have exceeded the selected threshold 0.3 after σ=102\sigma=10^{-2} (the vertical dashed and dotted line).

Therefore, these experiments manifest that by maximizing EI and learning an independent causal mechanism, NIS+ can effectively disregard noise within the data and accurately learn the ground truth macro-dynamics, and generalization to unseen data. Additionally, NIS+ demonstrates superior performance in quantifying CE. More details about experimental settings are shown in support information section 11.

3.2.   Boids

The second experiment is on Boids model which is a famous multi-agent model to simulate the collective behaviors of birds [50, 51]. In this experiment, we test the ability of NIS+ on capturing emergent collective behaviors and CE quantification on different environments with intrinsic and extrinsic noises. To increase the explainability of the trained coarse-graining strategy, we also try to give an explicit correspondence between the learned macro-states and the micro-states.

We conducted simulations following the methodology of [50] with N=16N=16 boids on a 300×300300\times 300 canvas to generate training data. The detailed dynamical rules of the Boids model can be found in support information section 12. In order to assess the ability of NIS+ to discover meaningful macro-states, we separate all boids into two groups, and artificially modified the Boids model by introducing distinct constant turning forces for each group. This modification ensured that the two groups exhibited separate trajectories with different turning angles, as depicted in Figure 4a. The micro-states of each boid at each time step consisted of their horizontal and vertical positions, as well as their two-dimensional velocities. The micro-states of all boids formed a 4N4N dimensional vector of real numbers, which was used as input for training NIS+.

As depicted by the triangles in Figure 4a, the predicted emergent collective flying behaviors for 50 steps closely follow the ground-truth trajectories of the two groups, particularly at the initial stages. These predicted trajectories are generated by decoding the predicted macro-states into the corresponding micro-states, and the two solid lines represent their averages. The hyperparameter q=8q=8 is chosen for this experiment based on the observation that the CE consistently reaches its highest value when q=8q=8, as indicated in Figure 4c.

To enhance the interpretability of the learned macro-states and coarse-graining function in NIS+, we utilize the Integrated Gradient (IG) method[52](see method section 7) to identify the most significant micro-states for each learned emergent macro-state dimension. We normalized the calculated IG and enhanced the maximum gradient of the micro-state in each macro-state and disregard the velocity dimensions of each boid due to their lower correlations with macro-states. The normalized IG is drawn into a matrix diagram(Figure 4d). As depicted by Figure 4d, the 1st, 2nd, 5th, and 6th dimensions in macro-states correspond to the boids in the first group (with ID<8), while the 3rd, 4th, 7th, and 8th dimensions correspond to the second group (with ID>=8). Thus, the learned coarse-graining strategy uses two positional coordinates to represent all other information to form one dimension of macro-state.

To compare the learning and prediction effects of NIS+ and NIS, we assess their generalization abilities by testing their performances on initial conditions that differed from the training dataset. During the simulation for generating training data, the positions of all boids are constrained within a circle with a radius of rr, as depicted in Figure 4a. However, we assess the prediction abilities of both models when the initial positions are located on the larger circles. Figure 4b shows the MAEs of NIS+ and NIS, which increase with the radius rr, where smaller prediction errors indicate better generalization. The results clearly demonstrate NIS+’s superior generalization across all tested radii rr compared to NIS.

Furthermore, to examine the impact of intrinsic and observational perturbations on CE, two types of noises are introduced. Intrinsic noise is incorporated into the rule by adding random turning angles to each boid at each time step. These angles are uniformly distributed within the interval α[π,π]\alpha\cdot[-\pi,\pi], where α[0,1]\alpha\in[0,1] is a parameter controlling the magnitude of the intrinsic noise. On the other hand, extrinsic noise is assumed to affect the observational micro-states. In this case, we assume that the micro-states of each boid cannot be directly observed, but instead, noisy data is obtained. The extrinsic or observational noise δ𝒩(0,δmax)\delta\sim\mathcal{N}(0,\delta_{max}) is added to the micro-states, and δmax\delta_{max} is the parameter determining the level of this noise.

The results are shown in Figure 4f and 4g, where the normalized MAE increases in both cases, indicating more challenging prediction tasks with increasing intrinsic and extrinsic noises. However, the differences between these two types of noises can be observed by examining the degrees of CE (Δ𝒥\Delta\mathcal{J}). Figure 4f demonstrates that Δ𝒥\Delta\mathcal{J} increases with the level of extrinsic noise (δmax\delta_{max}), suggesting that coarse-graining can mitigate noise within a certain range and enhance causal effects. When δmax<0.1\delta_{max}<0.1, the normalized MAE is smaller than 0.30.3 (black dashed horizontal line), satisfying the constraint in Equation LABEL:old_optimization. In this case, the degree of CE increases with δmax\delta_{max}. However, when the threshold of 0.3 is exceeded, and even though Δ𝒥\Delta\mathcal{J} decreases, we cannot draw any meaningful conclusion because the violation of the constraint in Equation LABEL:old_optimization undermines the reliability of the results.

On the other hand, Figure 4g demonstrates that Δ𝒥\Delta\mathcal{J} decreases as the level of intrinsic noise (α\alpha) increases. This can be attributed to the fact that the macro-level dynamics learner attempts to capture the flocking behaviors of each group during this stage. However, as the intrinsic noise increases, the flocking behaviors gradually diminish, leading to a decrease in CE. We have not included cases where α>0.6\alpha>0.6 because the normalized MAE exceeds the threshold of 0.3, the constraints in Equation LABEL:old_optimization is violated. Figure 4e illustrates real trajectories and predictions for random deflection angle noise with α=0.4\alpha=0.4. It can be observed that in the early stage, the straight-line trend can be predicted, but as the noise-induced deviation gradually increases, the error also grows, which intuitively reflects the reduction in CE. To compare, we also test the same curves for Rosas’ Ψ\Psi, the results are shown in support information section 12 because all the values are negative with large magnitudes.

Therefore, we conclude that NIS+ learns the optimal macro-dynamics and coarse-graining strategy by maximizing EI. This maximization enhances its generalization ability for cases beyond the training data range. The learned macro-states effectively capture average group behavior and can be attributed to individual positions using the IG method. Additionally, the degree of CE increases with extrinsic noise but decreases with intrinsic noise. This observation suggests that extrinsic noise can be eliminated through coarse-graining, while intrinsic noise cannot. More details about the experiments of Boids can be referred to support information section 12.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Refer to caption
(g)
Figure 4: The experimental results of NIS+ on learning the collective flocking behaviors of the Boids model. (a) and (e) present real and predicted data on boid trajectories under various conditions. Concretely, they present the comparison results for multi-step (50 steps) predictions under the condition of two separating groups, and random deflection angles. Their intrinsic noise levels of α\alpha are 0.001 and 0.4, respectively. (b) showcases the escalation of mean absolute error (MAE) for multi-step predictions as the radius rr, which represents the range of initial positions of boids in (a), extends beyond the limits of the training data. (c) depicts the trend of dimension-averaged causal emergence (Δ𝒥\Delta\mathcal{J}) changes with training epochs of NIS+ using different hyperparameters of qq, which represents the scales of different macro-states. (d) presents the saliency map, which visually depicts the association between each macroscopic dimension and the spatial coordinates of each boid. We highlight the most significant corresponding micro-states for each macro-state dimension with orange dots, determined using the integrated gradients (IG) method applied to our model. The horizontal axis represents the 𝒙\boldsymbol{x} and 𝒚\boldsymbol{y} coordinates of 16 boids in the microscopic state, while the vertical axis represents the 8 macroscopic dimensions. The dodger-blue dashed line distinguishes the coordinates of different individual boids, while the steel-blue solid line separates the boid groups. (f) and (g) shows the changes in Δ𝒥\Delta\mathcal{J} and normalized MAE under different noise levels, (f) for the extrinsic (observational, added on micro-states) noise (δmax\delta_{max}) and (g) for intrinsic noise (α\alpha, added by modifying the dynamical rule of Boids model). In both (f) and (g), the horizontal lines represent the threshold for the violation of the constraint of error in Equation LABEL:old_optimization. When the normalized MAE is larger than the threshold 0.3, the constraint is violated, and the results are not reliable.

3.3.   Real fMRI time series data for brains

We test our models on real fMRI time series data of brains for 830 subjects, called AOMIC ID1000[53]. The fMRI scanning data is collected when the subjects watch the same clip of movie. Thus, similar experiences of subjects under similar natural stimulus is expected, which corresponds to time series of the same dynamics with different initial conditions. We pre-process the raw data through Schaefer atlas method [54] to reduce the dimensionality of time series for each subject from roughly 140,000 (it varies among subjects.) to 100 such that NIS+ can operate and obtain more clear results. Then, the first 800 time series data are selected for training and the remaining 30 time series are for test. We also compare our results with another fMRI dataset AOMIC PIOP2 [53] for 50 subjects in resting state. Further description about the dataset can be referred to the method section.

To demonstrate the predictive capability of NIS+ for micro-states, Figure 5(a) illustrates the changes in normlized MAE with the prediction steps of the micro-dynamics on test data for different hyperparameters qq. It is evident that NIS+ performs better in predictions when q=27q=27 and q=1q=1. Specifically, the curve for q=27q=27 exhibits a slower rate of increase compared to the curve for q=1q=1 as the prediction steps increase. This suggests that selecting the hyperparameter qq as 27 may be more suitable than 1.

However, Figure 5(b) suggests a different outcome. When comparing the degree of CE (Δ𝒥\Delta\mathcal{J}) for different hyperparameters qq (the green bars), the highest Δ𝒥\Delta\mathcal{J} is observed when q=1q=1. Conversely, a negative Δ𝒥\Delta\mathcal{J} value is obtained when q=27q=27. This indicates that the improved prediction results may be attributed to overfitting when q=27q=27. Thus, q=1q=1 outperforms other values of qq in terms of Δ𝒥\Delta\mathcal{J}. This finding is supported by the NIS framework (the red bars), despite observing a larger standard deviation in Δ𝒥\Delta\mathcal{J} when q=1q=1. Furthermore, we also compare the results of CE with resting data and observe that peaks are reached at q=7q=7, which is just the number of subsystems in Schaefer atalas, for both NIS (deep blue bars) and NIS+ (yellow bars). Therefore, we can conclude that when subjects watch movies, the activities in different brain areas can be represented by a single real number at each time step. More analysis for the resting data can be referred to the support information section 14.1.

To investigate how NIS+ coarse-grains the input data into a single-dimensional macro-state, we also utilize the IG method to identify the most significant dimensions of the micro-state [52]. The results are depicted in Figure 5(c) and (d). We observe that the visual (VIS) sub-networks exhibit the highest attribution (Figure 5(c)). These visual sub-networks represent the functional system that subjects utilize while watching movie clips. Furthermore, we can visualize the active areas in finer detail on the brain map (Figure 5(d)), where darker colors indicate greater attribution to the single macro-state. Therefore, the regions exhibiting similar darkest colors identified by NIS+, which correspond to the deep visual processing brain region, could potentially represent the “synergistic core” [55] when the brain is actively engaged in watching movies. The numeric neurons in these areas may collaborate and function collectively. However, this conclusion should be further confirmed and quantified by decomposing the mutual information between micro-states and macro-states into synergistic, redundant, and unique information [34, 11].

In conclusion, NIS+ demonstrates its capability to learn and coarse-grain the intricate fMRI signals from the brain, allowing for the simulation of complex dynamics using a single macro-state. This discovery reveals that the complex neural activities during movie watching can be encoded by a one-dimensional macro-state, with a primary focus on the regions within the visual (VIS) sub-network. The robustness of our findings is further supported by comparable results obtained through alternative methods for data pre-processing, as demonstrated in the support information section 14.

Refer to caption
Figure 5: The learning results, the degree of causal emergence, and the attribution analysis of NIS+ and NIS on the fMRI data for the brains. (a) The mean errors of the multi-step predictions increase with the prediction steps under different scales (qq) on the test dataset. (b) Measures of CE (dimension averaged, Δ𝒥\Delta\mathcal{J}) are compared among different models and different datasets including movie-watching fMRI (visual fMRI) and resting fMRI. The bars show the averaged results for 10 repeating experiments, and the error bar represents the standard deviation. (c) The average attributions of the sub-networks under the Schaefer Atlas, calculated using the integrated gradient (IG) analysis method on the encoder with a scale of q=1q=1, are presented. The error bars represent the standard errors. (d) Attribution maps for movie-watching(visual) fMRI data are displayed. The maps show the left hemisphere from the view of the left, left hemisphere from the view of the right, right hemisphere from the view of the right, and right hemisphere from the view of the left. Also, the right column reflects a detailed map for visual areas. The upper is left visual areas and the bottom is right visual areas. The colors represent the normalized absolute values of the integrated gradient.

4.   Concluding Remarks

Inspired by the theory of causal emergence, this paper introduces a novel machine learning framework called Neural Information Squeezer Plus (NIS+) to learn emergent macro-dynamics, and suitable coarse-graining methods directly from data. Additionally, it aims to quantify the degree of CE under various conditions.

The distinguishing feature of our framework, compared to other machine learning frameworks, is its focus on maximizing the effective information (EI) of the learned macro-dynamics while maintaining effectiveness constraints. This enables the learned emergent macro-dynamics to capture the invariant causal mechanism that is as independent as possible from the distribution of input data. This feature not only enables NIS+ to identify CE in data across different environments but also enhances its ability for generalization on the environments which are distinct from training data. By introducing the error constraint in Equation LABEL:old_optimization, we can address the issues raised in [48] and enhance the robustness of the maximization of EI framework. As a result, NIS+ extends Hoel’s theory of CE to be applicable to both discrete and continuous dynamical systems, as well as real data.

Three experiments were conducted to evaluate the capabilities of NIS+ in learning and generalization, and quantifying CE directly from data. Furthermore, we applied this framework to the domain of the Game of Life (refer to the support information section 13). These experiments encompassed three simulation scenarios and one real fMRI dataset for 830 human subjects while watching same movie clips.

The experiments indicate that by maximizing EI, NIS+ outperforms other machine learning models in tasks such as multi-step predictions and pattern capturing, even in environments that were not encountered during the training process. Consequently, NIS+ enables the acquisition of a more robust macro-dynamics in the latent space.

Furthermore, the experiments show that NIS+ can quantify CE in a more reasonable way than Rosas’ Ψ\Psi indicator. With this framework, we can distinguish different scenarios in the data and identify which settings contain more regular patterns, as demonstrated in the experiment conducted on the Game of Life. The experiment on the Boid model also provides insights into how two types of noise can impact the degrees of CE. The conclusion is that extrinsic noise may increase CE, while intrinsic noise may decrease it. This indicates that extrinsic noise, arising from observational uncertainty, can be mitigated by the learned coarse-graining strategy. On the other hand, intrinsic noise, stemming from inherent uncertainty in the dynamical rules, cannot be eliminated.

The process of coarse-graining a complex system can be effectively learned by NIS+ and illustrated using the Integrated Gradients (IG) method, as demonstrated in the experiments involving the Boid model and the brain dataset. Through this method, the relationship between macro-states and micro-states can be visualized, allowing for the identification of the most significant variables within the micro-states. In the brain experiment, it was observed that the most crucial information stored in the unique macro-state exhibited a strong correlation with the micro-variables of the visual subnetwork. This finding highlights the ability of NIS+ to identify an appropriate coarse-graining strategy that aligns with meaningful biological interpretations.

NIS+ holds potential for various applications in data-driven modeling of real complex systems, such as climate systems, collective behaviors, fluid dynamics, brain activities, and traffic flows. By learning more robust macro-dynamics, the predictive capabilities of these systems can be enhanced. For instance, El Niño, which arises from the intricate interplay of oceanic and atmospheric conditions, exemplifies the emergence of a major climatic pattern from underlying factors. Understanding these emergent macro-dynamics can be instrumental in modeling and predicting El Niño events. By leveraging NIS+ to capture and quantify the CE in such complex systems, we can gain valuable insights and improve our ability to forecast their behavior.

Further extensions of the EI maximization method to other problem domains, such as image classification and natural language understanding, in addition to dynamics learning, warrant attention. Specifically, applying this principle to reinforcement learning agents equipped with world models has the potential to enhance performance across various environments. By incorporating the EI maximization approach into these domains, we can potentially improve the ability of models to understand and interpret complex visual and textual information. This can lead to advancements in tasks such as image recognition, object detection, language understanding, and machine translation.

The relationship between causal emergence and causal representation learning (CRL)[41] deserves further exploration. The NIS+ framework can be seen as a form of CRL, where the macro-dynamics serve as the causal mechanism and the coarse-graining strategy acts as the representation. As such, techniques employed in CRL can be applied to find CE in data. Conversely, the concept of CE and coarse-graining can be leveraged in CRL to enhance the interpretability of the model. By integrating the principles and methodologies from both CE and CRL, we can potentially develop more powerful and interpretable models. This synergy can lead to a deeper understanding of the causal relationships within complex systems and enable the extraction of meaningful representations that capture the emergent patterns and causal mechanisms present in the data. Further research in this direction can contribute to advancements in both CE and causal representation learning.

Another interesting merit of NIS+ is its potential contribution to emergence theory by reconciling the debate on whether emergence is an objective concept or an epistemic notion dependent on the observer. By designing a machine to maximize EI, we can extract objective emergent features and dynamics. The machine serves as an observer, but an objective one. Therefore, if the machine observer detects interesting patterns in the data, emergence occurs.

However, there are several limitations in this paper that should be addressed in future studies. Firstly, the requirement of a large amount of training data for NIS+ to learn the macro-dynamics and coarse-graining strategy may not be feasible in many real-world cases. If the training is insufficient, it may lead to incorrect identification of CE. Therefore, it is necessary to incorporate other numeric method, such as Rosas’ Φ\PhiID [11], to make accurate judgments. One advantage of NIS+ is its ability to identify coarse-grained macro-states, which can then be used as input for Rosas’ method. Secondly, the interpretability of neural networks, particularly for the macro-dynamics learner, remains a challenge. Enhancing the interpretability of the learned models can provide valuable insights into the underlying mechanisms and improve the trustworthiness of the results. Additionally, the learned macro-dynamics in this work are homogeneous, resulting in the same EI value across different locations in the macro-state space. However, in real-world applications, this may not always hold true. Therefore, it is necessary to develop local measures for CE, especially for systems with heterogeneous dynamics. This would facilitate the identification of local emergent patterns, making the analysis more nuanced and insightful. Lastly, the current framework is primarily designed for Markov dynamics, while many real complex systems exhibit long-term memory or involve unobservable variables. Extending the framework to accommodate non-Markov dynamics is an important area for future research. Addressing these limitations and exploring these avenues for improvement will contribute to the advancement of the field and enable the application of NIS+ to a wider range of complex systems.

5.   Methods and Data

In order to provide a comprehensive understanding of the article, we will first dedicate a certain amount of space to introduce Erik Hoel’s theory of causal emergence, and we introduce the details of the NIS and NIS+ frameworks and other used techniques. After that, the details of the fMRI time series data is introduced.

5.1.   Machine Learning Frameworks

We will introduce the details of the two frameworks NIS and NIS+ in this section.

5.1.1.   Neural Information Squeezer (NIS)

NIS use neural networks to parameterize all the functions to be optimized in Equation LABEL:old_optimization, in which the coarse-graining and anti-coarsening function ϕ\phi and ϕ\phi^{{\dagger}} are called encoder and decoder, respectively, and the macro-dynamics fqf_{q} is called dynamics-learner. Second, considering the symmetric position between ϕ\phi and ϕ\phi^{{\dagger}}, invertible neural network(with RealNVP framework[56], details can be referred to the following sub-section) is used to reduce the model complexity and to make mathematical analysis being possible[12].

Concretely,

ϕProjq(ψω),\phi\equiv Proj_{q}(\psi_{\omega}), (7)

where ψω:pp\psi_{\omega}:\mathcal{R}^{p}\rightarrow\mathcal{R}^{p} is an invertible neural network with parameters ω\omega, and ProjqProj_{q} represents the projection operation with the first qq dimensions reserved to form the macro-state 𝒚\boldsymbol{y}, and the last pqp-q dimensional variable 𝒚\boldsymbol{y}^{\prime} are dropped. Empirically, 𝒚\boldsymbol{y}^{\prime} can be approximately treated as a Gaussian noise and independent with 𝒚\boldsymbol{y} or we can force 𝒚\boldsymbol{y}^{\prime} to be independent Gaussian by training the neural network. Similarly, in a symmetric way, ϕ\phi^{{\dagger}} can be approximated in the following way: for any input 𝒚q\boldsymbol{y}\in\mathcal{R}^{q},

ϕ(𝒚)=ψω1(𝒚ξ),\phi^{{\dagger}}(\boldsymbol{y})=\psi_{\omega}^{-1}(\boldsymbol{y}\bigoplus\xi), (8)

where ξ\xi is a standard Gaussian random vector with pqp-q dimensions, and \bigoplus represents the vector concatenation operation.

Last, the macro-dynamics fqf_{q} can be parameterized by a common feed-forward neural network fθf_{\theta} with weight parameters θ\theta. Its number of input and output layer neurons equal to the dimensionality of the macro state qq. It has two hidden layers, each with 64 neurons, and the output is transformed using LeakyReLU. The detailed architectures and parameter settings of ψω\psi_{\omega} and fθf_{\theta} can be referred to the supporting information (support information section 15). To compute EI, this feed-forward neural network is regarded as a Gaussian distribution, which models the conditional probability p(𝒚^t+1|𝒚t)p(\boldsymbol{\hat{y}}_{t+1}|\boldsymbol{y}_{t}).

Previous work [12] has been proved that this framework merits many mathematical properties such as the macro-dynamics fθf_{\theta} forms an information bottleneck, and the mutual information between 𝒚t\boldsymbol{y}_{t} and 𝒚^t+1\hat{\boldsymbol{y}}_{t+1} approximates the mutual information between 𝒙t\boldsymbol{x}_{t} and 𝒙t+1\boldsymbol{x}_{t+1} when the neural networks are convergent such that the constraint in Equation LABEL:old_optimization is required.

5.1.2.   Neural Information Squeezer Plus (NIS+)

To maximize EI defined in Equation LABEL:old_optimization, we extend the framework of NIS to form NIS+. In NIS+, we first use the formula of mutual information and variational inequality to convert the maximization problem of mutual information into a machine learning problem, and secondly, we introduce a neural network gθg_{\theta^{\prime}} to learn the inverse macro-dynamics that is to use 𝒚t+1=ϕ(𝒙t+1)\boldsymbol{y}_{t+1}=\phi(\boldsymbol{x}_{t+1}) to predict 𝒚t\boldsymbol{y}_{t} such that mutual information maximization can be guaranteed. Third, the probability re-weighting technique is employed to address the challenge of computing intervention for a uniform distribution, thereby optimizing the EI. All of these techniques compose neural information squeezer plus (NIS+).

Formally, the maximization problem under the constraint of inequality defined by Equation LABEL:old_optimization can be converted as a minimization of loss function problem without constraints, that is:

minω,θ,θt=1T1w(𝒙t)𝒚tgθ(𝒚t+1)+λ𝒙^t+1𝒙t+1,\displaystyle\min_{\omega,\theta,\theta^{\prime}}\sum_{t=1}^{T-1}w(\boldsymbol{x}_{t})||\boldsymbol{y}_{t}-g_{\theta^{\prime}}(\boldsymbol{y}_{t+1})||+\lambda||\hat{\boldsymbol{x}}_{t+1}-\boldsymbol{x}_{t+1}||, (9)

where ω,θ,θ\omega,\theta,\theta^{\prime} are the parameters of neural networks of ψω\psi_{\omega}, fθf_{\theta}, and gθg_{\theta^{\prime}}, respectively. 𝒚t=ϕ(𝒙t)=Projq(ψω(𝒙t))\boldsymbol{y}_{t}=\phi(\boldsymbol{x}_{t})=Proj_{q}(\psi_{\omega}(\boldsymbol{x}_{t})) and 𝒚t+1=ϕ(𝒙t+1)=Projq(ψω(𝒙t+1))\boldsymbol{y}_{t+1}=\phi(\boldsymbol{x}_{t+1})=Proj_{q}(\psi_{\omega}(\boldsymbol{x}_{t+1})) are the macro-states. λ\lambda is a Lagrangian multiplier which will be taken as a hyper-parameter in experiments. w(𝒙t)w(\boldsymbol{x}_{t}) is the inverse probability weights which is defined as:

w(𝒙t)=p~(𝒚t)p(𝒚t)=p~(ϕ(𝒙t))p(ϕ(𝒙t)),w(\boldsymbol{x}_{t})=\frac{\tilde{p}(\boldsymbol{y}_{t})}{p(\boldsymbol{y}_{t})}=\frac{\tilde{p}(\phi(\boldsymbol{x}_{t}))}{p(\phi(\boldsymbol{x}_{t}))}, (10)

where p~\tilde{p} is the new distribution of macro-states 𝒚t\boldsymbol{y}_{t} after intervention for do(𝒚tUq)do(\boldsymbol{y}_{t}\sim U_{q}), and pp is the natural distribution of the data. In practice, p(𝒚t)p(\boldsymbol{y}_{t}) is estimated through kernel density estimation (KDE)[49](Please refer method section 4). The approximated distribution, p~(𝒚t)\tilde{p}(\boldsymbol{y}_{t}), is assumed to be a uniform distribution, which is characterized by a constant value. Consequently, the weight ww is computed as the ratio of these two distributions.

We can prove a mathematical theorem to guarantee such problem transformation as mentioned below:

Theorem 5.1 (Problem Transformation Theorem).

For a given value of qq, assuming that ω,θ\omega^{\ast},\theta^{\ast}, and θ\theta^{\prime\ast} are the optimal solutions to the unconstrained objective optimization problem defined by equation (9). Then ϕProjq(ψω),fqfθ\phi^{\ast}\equiv Proj_{q}(\psi_{\omega^{\ast}}),f^{\ast}_{q}\equiv f_{\theta^{\ast}}, and ϕ,()ψω1(ξ)\phi^{{\dagger},\ast}(\cdot)\equiv\psi^{-1}_{\omega^{\ast}}(\cdot\bigoplus\xi), where ξ𝒩(0,Ipq)\xi\sim\mathcal{N}(0,I_{p-q}) are the optimal solution to the constrained objective optimization problem Eq.(LABEL:old_optimization).

The details of proof can be seen in support information section 10.1.

Refer to caption
Figure 6: The frameworks of NIS (a) and NIS+ (b). The boxes in the diagram represent functions or neural networks, while the arrow pointing to a cross signifies the operation of information dropping. 𝒙t\boldsymbol{x}_{t} and 𝒙t+1\boldsymbol{x}_{t+1} represent the observational data of micro-states, and 𝒙^t+1\hat{\boldsymbol{x}}_{t+1} represents the predicted micro-state. The macro-states, denoted as 𝒚t=ϕ(𝒙t)\boldsymbol{y}_{t}=\phi(\boldsymbol{x}_{t}) and 𝒚t+1=ϕ(𝒙t+1)\boldsymbol{y}_{t+1}=\phi(\boldsymbol{x}_{t+1}), are obtained by encoding the micro-states using the encoder. Similarly, the predicted macro-states, 𝒚^t=ϕ(𝒙^t)\hat{\boldsymbol{y}}_{t}=\phi(\hat{\boldsymbol{x}}_{t}) and 𝒚^t+1=ϕ(𝒙^t+1)\hat{\boldsymbol{y}}_{t+1}=\phi(\hat{\boldsymbol{x}}_{t+1}), are obtained by encoding the predictions of micro-states.

Discriminate from Figure 6(a), a new computational graph for NIS+ which is depicted in Figure 6(b) is implied by the optimization problem defined in Equation 9. For given pair of data: 𝒙t,𝒙t+1\boldsymbol{x}_{t},\boldsymbol{x}_{t+1}, two dynamics, namely the forward dynamics fθf_{\theta} and the reverse dynamics gθg_{\theta^{\prime}} are trained by optimizing two objective functions 1=t=1T1w(𝒙t)𝒚tgθ(𝒚t+1)\mathcal{L}_{1}=\sum_{t=1}^{T-1}w(\boldsymbol{x}_{t})||\boldsymbol{y}_{t}-g_{\theta^{\prime}}(\boldsymbol{y}_{t+1})|| and 2=t=1T1𝒙^t+1𝒙t+1\mathcal{L}_{2}=\sum_{t=1}^{T-1}||\hat{\boldsymbol{x}}_{t+1}-\boldsymbol{x}_{t+1}||, simultaneously. In the training process, the encoder, ϕ\phi and the decoder, ϕ\phi^{{\dagger}} are shared.

This novel computational framework of NIS+ can realize the maximization of EI under the coarse-grained emergent space. Therefore, it can optimize an independent causal mechanism(ff) represented by fθf_{\theta} on the emergent space. It can also be used to quantify CE in raw data once the macro-dynamics fθf_{\theta} is obtained for different qq.

5.1.3.   Extensions for Practical Computations

When we apply NIS+ to the data generated by various complex systems such as cellular automata and multi-agent systems, the architecture of encoder(decoder) should be extended for stacked and parallel structures. Fortunately, NIS+ is very flexible to different real scenarios which means the important properties such as Theorem 5.1 can be retained.

Refer to caption
Figure 7: Three extended structures of the NIS+ encoder, denoted as (a) and (b), as well as the complete structure (c), are designed to facilitate the implementation of complex tasks. (a) represents a stacked encoder, where basic encoders are layered one on top of another. This arrangement allows for the filtering of input information in a hierarchical manner. (b) showcases a parallel encoder, where multiple basic encoders with shared parameters are combined to form a larger encoder. This parallel configuration enables the processing of input information in parallel, enhancing efficiency. (c) illustrates a multistory structure of the complete NIS+ framework. In this structure, the encoders and decoders are stacked, and the dynamics learners in different layers operate in parallel. This design facilitates simultaneous training for different dynamics learners with distinct dimensions and enables parallel searching for the optimal dimension qq.

Firstly, it is important to consider that when dealing with high-dimensional complex systems, discarding multiple dimensions at once can pose challenges for training neural networks. Therefore, we replace the encoder in NIS+ with a Stacked Encoder. As shown in Figure 7(a), this improvement involves stacking a series of basic encoders together and gradually discard dimensions, thereby reducing the training difficulty.

In addition, complex systems like cellular automota are always comprise with multiple components with similar dynamical rules. Therefore, we introduce the Parallel Encoder to represent each component or groups of these components and combine the results together. Concretely, as shown in Figure 7(b), inputs may be grouped based on their physical relationships(e.g. adjacent neighborhood), and each group is encoded using a basic encoder. By sharing parameters among the encoders, neural networks can capture homogeneous coarse-graining rules efficiently and accurately. Finally, the macro variables obtained from all the encoders are concatenated into a vector to derive the overall macro variables. Furthermore, we can combine this parallel structures with other well-known architectures like convolutional neural networks.

To enhance the efficiency of searching for the optimal scale, we leverage the multiple scales of hidden space obtained through the stacked encoder by training multiple dynamics learners in different scales simultaneously. This forms the framework of Multistory NIS+ as shown in Figure 7(c). This approach is equivalent to searching macro-dynamics for different qq and thereby avoiding to retrain the encoders.

Theorem 5.2 guarantees that the important properties such as Theorem 5.1 can be retained for the extensions of stacked and parallel encoders. And a new universal approximation theorem (Theorem 5.3) can be proved such that multiple stacked encoders can approximate any coarse-graining function (any map defined on p×q\mathcal{R}^{p}\times\mathcal{R}^{q}).

Theorem 5.2 (Problem Transformation in Extensions of NIS+ Theorem).

When the encoder of NIS+ is replaced with an arbitrary combination of stacked encoders and parallel encoders, the conclusion of Theorem 2.1 still holds true.

Theorem 5.3 (Universal Approximating Theorem of Stacked Encoder).

For any continuous function ff which is defined on K×pK\times\mathcal{R}^{p}, where KpK\in\mathcal{R}^{p} is a compact set, and p>q𝒵+p>q\in\mathcal{Z^{+}}, there exists an integer ss and an extended stacked encoder ϕp,s,q:pq\phi_{p,s,q}:\mathcal{R}^{p}\rightarrow\mathcal{R}^{q} with ss hidden size and an expansion operator ηp,s\eta_{p,s} such that:

ϕp,s,qf,\phi_{p,s,q}\simeq f, (11)

Thereafter, an extended stacked encoder with expansion operators is of universal approximation property which means that it can approximate and simulate any coarse-graining function defined on p×q\mathcal{R}^{p}\times\mathcal{R}^{q}.

The extended stacked encoder ϕp,s,q\phi_{p,s,q} refers to stacking two basic encoders together and gradually reducing dimensions, encoding the input from pp dimensions to qq dimensions, with an intermediate dimension of ss. Besides basic operations such as invertible mapping and projection, a new operation, vector expansion(ηp,s\eta_{p,s}, e.g., η2,5(1,2,3)=(1,2,3,1,2)\eta_{2,5}(1,2,3)=(1,2,3,1,2)), should be introduced, and locate in between the two encoders. All the proves of these theorems are referred to support information section 10.2 and 10.3.

5.1.4.   Training NIS+

In practice, two stages are separated during training process for NIS+ and all extensions. The first stage only trains the forward neural networks (the upper information channel as shown in Figure 2) such that the loss 1\mathcal{L}_{1} is small enough. Then, the second stage which only trains the neural networks for the reverse dynamics (the lower information channel as shown in Figure 2) is conducted. Because the trained inverse dynamics gθg_{\theta^{\prime}} is never used, the second stage is for training ϕω\phi_{\omega} in essence.

6.   Acknowledgement

The authors would like to acknowledge all the members who participated in the "Causal Emergence Reading Group" organized by the Swarma Research and the "Swarma - Kaifeng" Workshop. We are especially grateful to Professors Yizhuang You, Peng Cui, Jing He, Dr.Chaochao Lu, and Dr. Yanbo Zhang for their invaluable contributions and insightful discussions throughout the research process. We also thank the large language model ChatGPT for its assistance in refining our paper writing.

Support Information

1.   A Brief Introduction of Causal Emergence Theory

The basic idea of Hoel’s theory is that we have two different views (micro and macro) of a same Markov dynamical system, and CE occurs if the macro-dynamics has stronger causal effect than the micro-dynamics. As shown in Figure 1, for example, the micro-scale system is composed by many colliding balls. The macro-scale system is a coarse-grained description of the colliding balls with colorful boxes where each box represents the number of balls within the box. In Figure 1, the vertical axis represents the scale(micro- and macro-), and the horizontal axis represents time, which depicts the evolution of the system’s dynamics in both scales. All the dynamical systems considered are Markovian, therefore two temporal successive snapshots are necessary. The dynamics in both micro- and macro-scales are represented by transitional probability matrix(TPM): fmf_{m} for micro-dynamics and fMf_{M} for macro-dynamics.

We can measure the strength of its causal effects for each dynamics(TPM) by effective information (EI). EI can be understood as an intervention-based version of mutual information between two successive states in a dynamical system over time. For any TPM ff, EI can be calculated by:

𝒥=I(St;St1|do(St1U(S)))=1Ni,j=1N(fijlogfijfijlog(k=1NfkjN)),\mathcal{J}=I(S_{t};S_{t-1}|do(S_{t-1}\sim U(S)))=\frac{1}{N}\sum_{i,j=1}^{N}\left(f_{ij}\log f_{ij}-f_{ij}\log\left(\frac{\sum_{k=1}^{N}f_{kj}}{N}\right)\right), (1)

where, StS_{t} is the random variable to represent the state of the system at time tt, do(St1U(S))do(S_{t-1}\sim U(S)) represents the do-operator [39] that intervene the state of the system at time t1t-1 to force that St1S_{t-1} follows a uniform distribution on the value space SS of StS_{t}. fijf_{ij} is the probability at the iith row and the jjth column. N=|S|N=|S| is the total number of states and is also the number of rows or columns. log\log represents the logarithmic operation with a base of 2. Equation 1 can be further decomposed into two terms: determinism and non-degeneracy which characterize that how the future state can be determined by the past state and vice versa, respectively[25].

With EI, we can compare two TPMs. If 𝒥(fM)\mathcal{J}(f_{M}) is larger than 𝒥(fm)\mathcal{J}(f_{m}), we say CE occurs in this dynamical system. In practice, we calculate another value called causal emergence to judge if CE takes place, that is:

Δ𝒥=𝒥(fM)𝒥(fm).\Delta\mathcal{J}=\mathcal{J}(f_{M})-\mathcal{J}(f_{m}). (2)

Therefore, if Δ𝒥>0\Delta\mathcal{J}>0, then CE occurs. Δ𝒥\Delta\mathcal{J} can be treated as the quantitative measure of emergence for Markov dynamics. An example of CE on Markov chain is shown in Figure 1(b).

2.   Measures and calculations of EI and causal emergence for neural networks

The measure of EI of discrete Markov chains can be calculated by Equation 1, however, it is not suitable for neural networks because a neural network is a deterministic function. Therefore, we need to extend the definition of EI for general neural networks. We continue to use the method defined in [12]. The basic idea is to understand a neural network as a Gaussian distribution conditional on the input vector with the mean as the deterministic function of the input values, and the standard deviations as the mean square errors of this neural network on the training or test datasets.

In general, if the input of a neural network is X=(x1,x2,,xn)[L,L]nX=(x_{1},x_{2},\cdot\cdot\cdot,x_{n})\in[-L,L]^{n}, which means XX is defined on a hyper-cube with size LL, where LL is a very large integer. The output is Y=(y1,y2,,ym)Y=(y_{1},y_{2},\cdot\cdot\cdot,y_{m}), and Y=μ(X)Y=\mu(X). Here μ\mu is the deterministic mapping implemented by the neural network: μ:nm\mu:\mathcal{R}^{n}\rightarrow\mathcal{R}^{m}, and its Jacobian matrix at XX is Xμ(X){μi(X)Xj|}X=Xnm\partial_{X^{\prime}}\mu(X)\equiv\left\{\frac{\partial\mu_{i}(X^{\prime})}{\partial X^{\prime}_{j}}\left|{}_{X^{\prime}=X}\right.\right\}_{nm}. If the neural network can be regarded as a Gaussian distribution conditional on given XX, then the effective information (EI) of the neural network can be calculated in the following way:

EIL(μ)=I(do(XU([L,L]n;Y)\displaystyle EI_{L}(\mu)=I(do(X\sim U([-L,L]^{n};Y)\approx m+mln(2π)+i=1mσi22\displaystyle-\frac{m+m\ln(2\pi)+\sum_{i=1}^{m}\sigma_{i}^{2}}{2} (3)
+nln(2L)+𝐄XU([L,L]n(ln|det(Xμ(X))|).\displaystyle+n\ln(2L)+\mathbf{E}_{X\sim U([-L,L]^{n}}\left(\ln|\det(\partial_{X^{\prime}}\mu(X))|\right).

where, Σ=diag(σ12,σ22,,σm2)\Sigma=diag(\sigma_{1}^{2},\sigma_{2}^{2},\cdot\cdot\cdot,\sigma_{m}^{2}) is the co-variance matrix, and σi\sigma_{i} is the standard deviation of the output yiy_{i} which can be estimated by the mean square error of yiy_{i} under given xix_{i}. U([L,L]n)U([-L,L]^{n}) is the uniform distribution on [L,L]n[-L,L]^{n}, and |||\cdot| is absolute value, and det\det is determinant. If det(Xμ(X))0\det(\partial_{X^{\prime}}\mu(X))\equiv 0 for all XX, then we set EI0EI\approx 0.

However, Equation 3 can not be applied in real cases directly because it will increase as the dimension of input nn or output mm increases[12]. Therefore, larger EI is always expected for the models with higher dimension. The way to solve this problem is by dividing the input dimension to define dimension averaged effective information(dEI) and is denoted as 𝒥\mathcal{J}:

𝒥L=EIL(μ)n\mathcal{J}_{L}=\frac{EI_{L}(\mu)}{n} (4)

When the numbers of input and output are identical (m=nm=n, this condition is always hold for all the results reported in the main text), Equation 3 becomes:

𝒥L(μ)=1+ln(2π)+i=1nσi2/n2+ln(2L)+1n𝐄XU([L,L]n(ln|det(Xf(X))|).\mathcal{J}_{L}(\mu)=-\frac{1+\ln(2\pi)+\sum_{i=1}^{n}\sigma_{i}^{2}/n}{2}+\ln(2L)+\frac{1}{n}\mathbf{E}_{X\sim U([-L,L]^{n}}\left(\ln|\det(\partial_{X^{\prime}}f(X))|\right). (5)

We always use this measure to quantify EI for neural networks in the main text. However, this measure is also not perfect because it is dependent on a free parameter LL, the range of the domain for the input data. Our solution is to calculate the dimension averaged CE to eliminate the influence of LL. For macro-dynamics fMf_{M} with dimension qq and micro-dynamics fmf_{m} with dimension pp, we define dimension averaged CE as:

Δ𝒥L(fM,fm)=𝒥L(fM)𝒥L(fm)=EIL(fM)qEIL(fm)p.\Delta\mathcal{J}_{L}(f_{M},f_{m})=\mathcal{J}_{L}(f_{M})-\mathcal{J}_{L}(f_{m})=\frac{EI_{L}(f_{M})}{q}-\frac{EI_{L}(f_{m})}{p}. (6)

If fMf_{M} and fmf_{m} are parameterized by neural networks of μM\mu_{M} with dimension qq and μm\mu_{m} with dimension pp, then

Δ𝒥=\displaystyle\Delta\mathcal{J}= (1q𝐄XMln|detXMμM|1p𝐄Xmln|detXmμm|)\displaystyle\left(\frac{1}{q}\mathbf{E}_{X_{M}}\ln|\det\partial_{X_{M}}\mu_{M}|-\frac{1}{p}\mathbf{E}_{X_{m}}\ln|\det\partial_{X_{m}}\mu_{m}|\right) (7)
(1qi=1qlnσi,M21pi=1plnσi,m2),\displaystyle-\left(\frac{1}{q}\sum_{i=1}^{q}\ln\sigma_{i,M}^{2}-\frac{1}{p}\sum_{i=1}^{p}\ln\sigma_{i,m}^{2}\right),

where σi,M\sigma_{i,M} and σi,m\sigma_{i,m} are the standard deviations for μM\mu_{M} and μm\mu_{m} on the iith dimension, respectively. At this point, the influences of the input or output dimensions and the parameter LL has been completely eliminated, making it a more reliable indicator. Therefore, the comparisons between the learned macro-dynamics reported in the main text are reliable because the measure of dimension averaged CE is used.

In practice, the first step is to select a sufficiently large value for LL to ensure that all macro-states are encompassed within the region [L,L]q[-L,L]^{q}. Subsequently, the determinant of Jacobian matrices can be estimated using Monte Carlo integration, which involves sampling data points within this region. The standard deviations can be estimated using a test dataset, and the probability reweighting technique is employed to account for interventions and ensure a uniform distribution.

3.   Invertible neural network in encoder(decoder)

In NIS, both the encoder(ϕ\phi) and the decoder(ϕ\phi^{{\dagger}}) use the invertible neural network module RealNVP[56].

Refer to caption
Figure 1: Structure diagram of RealNVP.

Concretely, if the input of the module is 𝐱\mathbf{x} with dimension pp and the output is 𝐱\mathbf{x^{\prime}} with the same dimension, then the RealNVP module can perform the following computation steps:

{𝐱𝟏=𝐱1:m𝐱𝟐=𝐱m:p\left\{\begin{aligned} &\mathbf{x_{1}}=\mathbf{x}_{1:m}\\ &\mathbf{x_{2}}=\mathbf{x}_{m:p}\end{aligned}\right. (8)

where, mm is an integer in between 1 and pp.

{𝐱𝟏=𝐱𝟏s1(𝐱𝟐)+t1(𝐱𝟐)𝐱𝟐=𝐱𝟐s2(𝐱𝟏)+t2(𝐱𝟏)\left\{\begin{aligned} &\mathbf{x^{\prime}_{1}}=\mathbf{x_{1}}\bigotimes s_{1}(\mathbf{x_{2}})+t_{1}(\mathbf{x_{2}})\\ &\mathbf{x^{\prime}_{2}}=\mathbf{x_{2}}\bigotimes s_{2}(\mathbf{x^{\prime}_{1}})+t_{2}(\mathbf{x^{\prime}_{1}})\end{aligned}\right. (9)

where, \bigotimes is element-wised time, s1,s2,t1s_{1},s_{2},t_{1} and t2t_{2} are feed-forward neural networks with arbitrary architectures, while their input-output dimensions must match with the data. For the experiments in this article, they each have two intermediate hidden layers. The input and output layers have the same number of neurons as the dimensions of the micro-state samples. Each hidden layer has 64 neurons, and the output of each hidden layer is transformed by the non-linear activation function LeakyReLU. In practice, s1s_{1} or s2s_{2} always do an exponential operation on the output of the feed-forward neural network to facilitate the inverse computation.

Finally,

𝐱=𝐱𝟏𝐱𝟐\mathbf{x^{\prime}}=\mathbf{x^{\prime}_{1}}\bigoplus\mathbf{x^{\prime}_{2}} (10)

It is not difficult to verify that all three steps are invertible. Equation 9 is invertible because the same form but with negative signs can be obtained by solving the expressions of 𝐱𝟏\mathbf{x_{1}} and 𝐱𝟐\mathbf{x_{2}} with 𝐱𝟏\mathbf{x^{\prime}_{1}} and 𝐱𝟐\mathbf{x^{\prime}_{2}} from Equation 9.

To simulate more complex invertible functions, we always duplex the basic RealNVP modules by stacking them together. In the main text, we use duplex the basic RealNVP module by three times.

Due to the reversibility of RealNVP, if its output is used as input in a subsequent iteration, the new output will be the same as the original input. Therefore, the neural networks used in the encoder-decoder can share parameters. The process of encoding to obtain the macro-state involves projecting the forward output and retaining only the first few dimensions. On the other hand, the process of decoding to obtain the micro-state involves concatenating the macro-state with a normal distribution noise, expanding it to match the micro-state dimensions, and then feeding it back into the entire invertible neural network.

4.   KDE for probability reweighting

In order to use inverse probability weighting technique, we need to estimate the probability distribution of the samples. KDE(Kernel Density Estimation) is a commonly used estimation method that can effectively eliminate the influence of outliers on the overall probability distribution estimation. In this experiment, we estimate the probability distribution of macro-states 𝒚t\boldsymbol{y}_{t}. Below is an introduction to the operation details of KDE. For a sample (x1,x2,,xn)(x_{1},x_{2},\cdot\cdot\cdot,x_{n}), we have the following kernel density estimation:

f^h(x)=1nhi=1nK(xxih).\displaystyle\hat{f}_{h}(x)=\frac{1}{nh}\sum_{i=1}^{n}K(\frac{x-x_{i}}{h}). (11)

nn represents the number of samples, the hyperparameter hh denotes the bandwidth, which is determined based on the rough range of the data and typically set to 0.05 in this article. KK is the kernel, specifically the standard normal density function. After obtaining the estimation function, each sample point is evaluated individually, resulting in the corresponding probability value for each sample point. By dividing the probability of the target distribution (uniform distribution) by our estimated probability value, we obtained the inverse probability weights corresponding to each sample point. To cover all sample points, the range of the uniform distribution needs to be limited by a parameter LL, which ensures that a square with side length 2L2L can encompass all the sample points in all dimensions.

Another thing to note is that the weights obtained at this point are related to the sample size, so they need to be normalized. However, this will result in very small weight values corresponding to the samples. Therefore, it is necessary to multiply them by the sample quantity to amplify the weight values back to their normal scale. Then, multiply this weight value by the training loss of each sample to enhance the training of sparse regions, achieving the purpose of inverse probability weighting. Since the encoder parameters change with each iteration, causing the distribution of 𝒚t\boldsymbol{y}_{t} to change, we will re-estimate the probability distribution of the entire sample using KDE every 1000 epochs (a total of at least 30000 epochs of iteration, the number of iterations may vary in different experiments).

5.   Calculation method of Rosas’ Ψ\Psi

We present here the computing method for Rosas’ Ψ\Psi, a measurable indicator of CE proposed by Rosas [11]. Its definition is as follows:

Ψ(Y)=I(Yt,Yt)jI(Xtj,Yt),\displaystyle\Psi(Y)=I(Y_{t},Y_{t^{\prime}})-\sum_{j}I(X_{t}^{j},Y_{t^{\prime}}), (12)

where YY represents the given macrostate, YtY_{t} and YtY_{t^{\prime}} represent the macrostates at two consecutive time points, and XtjX_{t}^{j} represents the jj-th dimension of the microstate at time tt. According to Rosas’ theory, Rosas’ Ψ\Psi being greater than 0 is a sufficient but not necessary condition for emergence to occur. In this paper, we employ the method proposed by Lu et al. [57] to compute the continuous variable mutual information in the experimental configuration.

6.   Comparison models in the SIR experiment

To compare the advantages of the NIS+ framework itself, we constructed a regular fully connected neural network (NN) with 33,668 parameters and a Variational Autoencoder (VAE) with 34,956 parameters in the SIR experiment (the NIS+ model consists of a total of 37,404 parameters, which includes the inverse dynamics component). NN consists of an input layer, an output layer, and four hidden layers with 4, 64, 128, 128, 64, 4 neurons respectively. NN+ refers to the NN that combines inverse probability weighting technique, with the same parameter framework. For each data point 𝒙t\boldsymbol{x}_{t}, we estimate the KDE probability distribution and calculate the inverse probability weighting. Finally, the weights are multiplied with the loss to give different training weights to samples with different densities.

VAE is a generative model that maps micro-level data to a normal distribution in latent space. During each prediction, it samples from the normal distribution and then decodes it [58]. It consists of an encoder and a decoder, both of which have 4 hidden layers, each layer containing 64 neurons. The input and predicted output are both four-dimensional variables, and the latent space also has four dimensions, with two dimensions for mean and two dimensions for variance. The macro-dynamic learner fits the dynamical changes of the two-dimensional mean variables. During the training process, the loss function consists of two parts: reconstruction error, which is the MAE error between the predicted output and the true label, and the KL divergence between the normal distribution of the latent space samples and the standard normal distribution. These two parts are combined with a 2:1 ratio and used for gradient backpropagation. VAE+ combines inverse probability weighting and bidirectional dynamic learning in the training of VAE, and adopts the same training method as NIS+. We keep the variance variables obtained from encoding unchanged, and use inverse probability weighting to obtain weight factors for the mean variables. Additionally, a reverse dynamic learner is constructed to predict 𝒚t\boldsymbol{y}_{t} given 𝒚t+1\boldsymbol{y}_{t+1}.

Because NN+ does not have a multi-scale framework for encoding and decoding, it directly estimates and calculates inverse probability weights at the micro-scale data level, and cannot improve the forward dynamics EI through training reverse dynamics. In VAE+, we focus on the dynamics trained in the latent space, aiming to estimate and calculate inverse probability weights in the latent space, and can optimize the encoder by training the reverse dynamics in the latent space to improve EI.The comparison results are shown in Figure 3(b) and (d).

7.   Integrated Gradients for attribution

Here we introduce a method called Integrated Gradients[59] that is used to explain the connection between macro-states and micro-states. To formally describe this technique, let us consider a function F:p[0,1]F:\mathcal{R}^{p}\to[0,1] that represents a deep network. Specifically, let xpx\in\mathcal{R}^{p} denote the input under consideration, and xpx^{\prime}\in\mathcal{R}^{p} denote the baseline input. In both the fMRI and Boids experiments, the baseline is set to 0.

We compute the gradients at each point along a straight-line path (in p\mathcal{R}^{p}) from the baseline xx^{\prime} to the input xx. By accumulating these gradients to obtain the final measure of integrated gradients(IG). More precisely, integrated gradients are defined as the path integral of the gradients along the straight-line path from the baseline xx^{\prime} to the input xx.

The integrated gradient along the i-th dimension for an input x and baseline xx^{\prime} is defined as follows. Here, F(x)/xi\partial F(x)/\partial x_{i} represents the gradient of F(x)F(x) along the ii-th dimension. The Integrated Gradients(IG) that we have computed are shown in Equation 13.

IGi(x)=(xixi)×α=01F(x+α×(xx))xidα.\displaystyle IG_{i}(x)=(x_{i}-x_{i}^{\prime})\times\int^{1}_{\alpha=0}\frac{\partial F(x^{\prime}+\alpha\times(x-x^{\prime}))}{\partial x_{i}}\mathrm{d}\alpha. (13)

8.   fMRI time series data description in detail

AOMIC is an fMRI collection which composes AOMIC PIOP1, AOMIC PIOP2 and AOMIC ID1000[53].

The AOMIC PIOP2 collects subjects’ data for mutiple tasks such as emotion-matching, working-memory and so on. Here, we just use 50 subjects’ resting fMRI data since some time steps of other subject’s have been thrown outby removing the effects of artificial motion, global signal, and white matter signal and cerebrospinal fluid signal using fMRIPrep results[60, 61], which means a difficulty in time alignment.

The AOMIC ID 1000 collected when 881 subjects are watching movies. It contains both raw data and preprocsessed data(See more experimental and preprocsessing details in [53]). Here, we should notice that the movie is edited in a way for no clear semantic meanings but just some concatenated images from the movie. Therefore, it is expected that subjects’ brain activation patterns shouldn’t respond to some higher-order functions such as semantic understandings.

When working with fMRI data, the high dimensionality of the data (over 140,000 dimensions) poses computational challenges. To address this, dimension reduction techniques through brain atlas are employed to make the calculations feasible. We first preprocessed the data by removing the effects of artificial motion, global signal, and white matter signal and cerebrospinal fluid signal using fmriPrep results contained in AOMIC ID1000 [60, 61]. Simultaneously, we detrend and normalize the data during this process. By doing the whole preprocessing process, we also throw out 51 subjects’ data since some time steps of these subject’s have been removed. Furthermore, to facilitate the investigation of the correlation between the required brain function for watching movie clips and the emergent macro dynamics, we employed the Schaefer atlas to divide the brain into 100 functional areas. Also, we have committed the philosophy of anatomical Atlas, which seems to suggest the converging evidence(See support information section 14.2). All of these preprocessing steps were performed using Nilearn[62, 63]. During the training process, the only difference is the use of PCA in dimension reduction to handle the KDE approximation required for reweighting if q>10q>10.

9.   Preceding instructions

Before we prove the theorem, let us first introduce the symbols that will be used in the following paragraphs. We will use capital letters to represent the corresponding random variables. For example, XtX_{t} represents the random variable of the micro-state 𝒙t\boldsymbol{x}_{t} at time tt, and Yt+1Y_{t+1} represents the random variable corresponding to the macro-state 𝒚t+1\boldsymbol{y}_{t+1}. For any random variable XX, X~\tilde{X} represents the same random variable XX after intervention. X^\hat{X} means the prediction for XX given by neural networks.

Second, to understand how the intervention on YtY_{t} can affect other variables, a causal graph derived from the framework of NIS and NIS+ and depicts the causal relations among variables is very useful, this is shown in Figure 2. In which, XtX_{t} and X^t+1\hat{X}_{t+1} denote the input random variable and predicted output random variable of the NIS+ on a micro scale, while YtY_{t} and Y^t+1\hat{Y}_{t+1} represent the input and output of fqf_{q} on a macro scale. After intervening on YtY_{t}, X~t\tilde{X}_{t}, X~t+1,Y~t,Y~t+1\tilde{X}_{t+1},\tilde{Y}_{t},\tilde{Y}_{t+1} represent different micro or macro variables under the intervention do(YtU(Y))do(Y_{t}\sim U(Y)), corresponding to XtX_{t},X^t+1,Yt\hat{X}_{t+1},Y_{t}, and Y^t+1\hat{Y}_{t+1}, respectively.

Refer to caption
Figure 2: The causal graph among random variables after intervention on YtY_{t} according to the framework of NIS or NIS+. Because the intervention on YtY_{t} can only affect the variables in the upper part of NIS+ framework, we ignore the variables in the lower part. In the diagram, XtX_{t}^{\prime} represents the random variable obtained after reversible transformation of XtX_{t}, Xt′′X^{\prime\prime}_{t} represents the variable directly discarded during the projection process, Y~t+1\tilde{Y}_{t+1}^{\prime} represents a new variable composed of Y~t+1\tilde{Y}_{t+1} concatenated with a standard normal distribution, and ξpq\xi_{p-q} represents a pqp-q dimensional standard normal distribution. The dashed circular shape in the diagram represents the variable that is directly intervened to a uniform distribution, and the dashed arrow represents the causal relationship that is severed due to the intervention.

10.   Mathematical Proves

In this sub-section, we will show the mathematical proves of the theorems mentioned in the main text.

10.1.   Proof of Theorem 5.1

To prove this theorem, we utilize a combination of the inverse macro-dynamic and probability re-weighting technique, along with information theory and properties of neural networks. In this proof, we need three lemmas, stated as follows:

Lemma 10.1 (Bijection mapping does not affect mutual information).

For any given continuous random variables XX and ZZ, if there is a bijection (one to one) mapping ff and another random variable YY such that for any xDom(X)x\in Dom(X) there is a y=f(x)Dom(Y)y=f(x)\in Dom(Y), and vice versa, where Dom(X)Dom(X) denotes the domain of the variable XX, then the mutual information between XX and ZZ is equal to the information between YY and ZZ, that is:

I(X;Z)=I(Y;Z),I(X;Z)=I(Y;Z), (14)
Lemma 10.2 (Mutual information will not be affected by concatenating independent variables).

If XDom(X)X\in Dom(X) and YDom(Y)Y\in Dom(Y) form a Markov chain XYX\rightarrow Y, and ZDom(Z)Z\in Dom(Z) is a random variable which is independent on both XX and YY, then:

I(X;Y)=I(X;YZ).I(X;Y)=I(X;Y\bigoplus Z). (15)

The proofs of Lemma10.2 and Lemma14 can be found in the reference [12].

Lemma 10.3 (variational upper bound of a conditional entropy).

Given a conditional entropy H(𝐲|𝐱)H(\boldsymbol{y}|\boldsymbol{x}), where 𝐱s,𝐲q\boldsymbol{x}\in\mathcal{R}^{s},\boldsymbol{y}\in\mathcal{R}^{q}, there exists a variational upper bound on this conditional entropy:

H(Y|X)p(𝒚,𝒙)lng(𝒚|𝒙)d𝒚d𝒙,H(Y|X)\leq-\iint p(\boldsymbol{y},\boldsymbol{x})\ln g(\boldsymbol{y}|\boldsymbol{x})\mathrm{d}\boldsymbol{y}\mathrm{d}\boldsymbol{x}, (16)

where g(𝐲|𝐱)q×sg(\boldsymbol{y}|\boldsymbol{x})\in\mathcal{R}^{q}\times\mathcal{R}^{s} is any distribution.

Proof.

First, we unfold the conditional entropy

H(Y|X)=p(𝒙)p(𝒚|𝒙)lnp(𝒚|𝒙)d𝒚d𝒙\displaystyle H(Y|X)=-\iint p(\boldsymbol{x})p(\boldsymbol{y}|\boldsymbol{x})\ln p(\boldsymbol{y}|\boldsymbol{x})\mathrm{d}\boldsymbol{y}\mathrm{d}\boldsymbol{x} (17)

Due to the property of KL divergence[64], for any distribution gg,

DKL(p||g)=p(𝒙)lnp(𝒙)g(𝒙)d𝒙0.\displaystyle D_{KL}(p||g)=\int p(\boldsymbol{x})\ln\frac{p(\boldsymbol{x})}{g(\boldsymbol{x})}\mathrm{d}\boldsymbol{x}\geq 0. (18)

On the other words,

p(𝒙)lnp(𝒙)d𝒙p(𝒙)lng(𝒙)d𝒙.\displaystyle\int p(\boldsymbol{x})\ln p(\boldsymbol{x})\mathrm{d}\boldsymbol{x}\geq\int p(\boldsymbol{x})\ln g(\boldsymbol{x})\mathrm{d}\boldsymbol{x}. (19)

So,

H(𝒚|𝒙)\displaystyle H(\boldsymbol{y}|\boldsymbol{x}) =p(𝒙)p(𝒚|𝒙)lnp(𝒚|𝒙)d𝒚d𝒙\displaystyle=-\int p(\boldsymbol{x})\int p(\boldsymbol{y}|\boldsymbol{x})\ln p(\boldsymbol{y}|\boldsymbol{x})\mathrm{d}\boldsymbol{y}\mathrm{d}\boldsymbol{x} (20)
p(𝒙)p(𝒚|𝒙)lng(𝒚|𝒙)d𝒚d𝒙\displaystyle\leq-\int p(\boldsymbol{x})\int p(\boldsymbol{y}|\boldsymbol{x})\ln g(\boldsymbol{y}|\boldsymbol{x})\mathrm{d}\boldsymbol{y}\mathrm{d}\boldsymbol{x}
=p(𝒚,𝒙)lng(𝒚|𝒙)d𝒚d𝒙.\displaystyle=-\iint p(\boldsymbol{y},\boldsymbol{x})\ln g(\boldsymbol{y}|\boldsymbol{x})\mathrm{d}\boldsymbol{y}\mathrm{d}\boldsymbol{x}.

To prove the theorem, we also use an assumption:

Assumption: The composition of the inverse dynamics gθg_{\theta^{\prime}} and the encoder ϕ\phi can be regarded as a conditional probability P(Y^t|Xt+1)P(\hat{Y}_{t}|X_{t+1}), and this probability can be approximated as a Gaussian distribution N(gθ(ϕ(𝒙t+1)),Σ)N(g_{\theta^{\prime}}(\phi(\boldsymbol{x}_{t+1})),\Sigma), where Σ=diag(σ1,σ2,,σq)\Sigma=diag(\sigma_{1},\sigma_{2},\cdot\cdot\cdot,\sigma_{q}), and σi\sigma_{i} is the MSE loss of the iith dimension of output Y^t+1\hat{Y}_{t+1}. Further, suppose σi\sigma_{i} is bounded, thus σi[σm,σM]\sigma_{i}\in[\sigma_{m},\sigma_{M}] for any ii, where σm\sigma_{m} and σM\sigma_{M} are the minimum and maximum values of MSEs.

This assumption is supported by the reference [65].

Next, we restate the theorem that needs to be proved:

Theorem 5.1(Problem Transformation Theorem): For a given value of qq, assuming that ω,θ\omega^{\ast},\theta^{\ast}, and θ\theta^{\prime\ast} are the optimal solutions to the unconstrained objective optimization problem defined by equation (9). Then ϕProjq(ψω),fqfθ\phi^{\ast}\equiv Proj_{q}(\psi_{\omega^{\ast}}),f^{\ast}_{q}\equiv f_{\theta^{\ast}}, and ϕ,()ψω1(ξ)\phi^{{\dagger},\ast}(\cdot)\equiv\psi^{-1}_{\omega^{\ast}}(\cdot\bigoplus\xi), where ξ𝒩(0,Ipq)\xi\sim\mathcal{N}(0,I_{p-q}) are the optimal solution to the constrained objective optimization problem Eq.(LABEL:old_optimization).

Proof.

The original constrained goal optimization framework is as follows:

maxϕ,fq,ϕ+𝒥(fq),\displaystyle\max_{\phi,f_{q},\phi^{+}}\mathcal{J}(f_{q}), (21)
s.t.{𝒙^t+1𝒙t+1<ϵ,𝒙^t+1=ϕ(fq(ϕ(𝒙t))).\displaystyle s.t.\begin{cases}||\hat{\boldsymbol{x}}_{t+1}-\boldsymbol{x}_{t+1}||<\epsilon,\\ \hat{\boldsymbol{x}}_{t+1}=\phi^{{\dagger}}(f_{q}(\phi(\boldsymbol{x}_{t}))).\end{cases}

We know that X^t+1=ψω1(Y^t+1ξ)\hat{X}_{t+1}=\psi_{\omega}^{-1}(\hat{Y}_{t+1}\bigoplus\xi), where ψω1\psi_{\omega}^{-1} is a reversible mapping that doesn’t affect the mutual information according to Lemma 14. Therefore, based on Lemma 10.2, we have the capability to apply a transformation to the mutual information I(Yt,Y^t+1)I(Y_{t},\hat{Y}_{t+1}):

I(Yt,Y^t+1)=I(Yt,X^t+1).\displaystyle I(Y_{t},\hat{Y}_{t+1})=I(Y_{t},\hat{X}_{t+1}). (22)

Here, II represents mutual information. By utilizing the property of mutual information, we can derive the following equation:

I(Yt,X^t+1)=H(Yt)H(Yt|X^t+1).\displaystyle I(Y_{t},\hat{X}_{t+1})=H(Y_{t})-H(Y_{t}|\hat{X}_{t+1}). (23)

Now, H(Y~t)=H(Uq)H(\tilde{Y}_{t})=H(U_{q}), where UqU_{q} is a uniform distribution on a macro space. Therefore, we have:

𝒥(fθ,q)=H(Uq)H(Y~t|X~t+1).\displaystyle\mathcal{J}(f_{\theta,q})=H(U_{q})-H(\tilde{Y}_{t}|\tilde{X}_{t+1}). (24)

The optimization of the objective function 𝒥(fq)\mathcal{J}(f_{q}) can be reformulated as the optimization of the conditional entropy H(Y~t|X~t+1)H(\tilde{Y}_{t}|\tilde{X}_{t+1}), since H(Uq)H(U_{q}) is a constant. The variational upper bound on H(Y~t|X~t+1)H(\tilde{Y}_{t}|\tilde{X}_{t+1}) is obtained by Lemma10.3.

H(Y~t|X~t+1)p~(𝒚t,𝒙t+1)lng(𝒚t|𝒙t+1)d𝒚td𝒙t+1,\displaystyle H(\tilde{Y}_{t}|\tilde{X}_{t+1})\leq-\iint\tilde{p}(\boldsymbol{y}_{t},\boldsymbol{x}_{t+1})\ln g(\boldsymbol{y}_{t}|\boldsymbol{x}_{t+1})\mathrm{d}\boldsymbol{y}_{t}\mathrm{d}\boldsymbol{x}_{t+1}, (25)

where p~\tilde{p} represents the probability distribution function of random variables in case of 𝒚t\boldsymbol{y}_{t} being intervened.

We will use a neural network to fit the distribution g(𝒚t|𝒙t+1)g(\boldsymbol{y}_{t}|\boldsymbol{x}_{t+1}). Based on the assumption, we can consider it as a normal distribution. According to Lemma10.3, since the conditional probability g(𝒚t|𝒙t+1)g(\boldsymbol{y}_{t}|\boldsymbol{x}_{t+1}) can be any distribution, we can assume it to be a normal distribution for simplicity. Predicting 𝒚t\boldsymbol{y}_{t} with 𝒙t+1\boldsymbol{x}_{t+1} as input can be divided into two steps. First, 𝒙t+1\boldsymbol{x}_{t+1} is encoded by the encoder ϕ\phi into the macro latent space. Then, the reverse macro dynamics are approximated using gθg_{\theta^{\prime}}. As a result, the expectation of g(𝒚t|𝒙t+1)g(\boldsymbol{y}_{t}|\boldsymbol{x}_{t+1}) can be obtained using the following equation:

Eg(Y~t|Xt+1=𝒙t+1)gθ(ϕ(𝒙t+1))\displaystyle E_{g}(\tilde{Y}_{t}|X_{t+1}=\boldsymbol{x}_{t+1})\equiv g_{\theta^{\prime}}(\phi(\boldsymbol{x}_{t+1})) (26)

Therefore, we have g(𝒚t|𝒙t+1)N(μ,Σ)g(\boldsymbol{y}_{t}|\boldsymbol{x}_{t+1})\sim N(\mu,\Sigma), where Σ\Sigma is a constant diagonal matrix and μ=gθ(ϕ(𝒙t+1))\mu=g_{\theta^{\prime}}(\phi(\boldsymbol{x}_{t+1})). In order to utilize the properties of intervention on 𝒚t\boldsymbol{y}_{t}, we need to separate p~(𝒚t)\tilde{p}(\boldsymbol{y}_{t}). Following Equation 25, we can further transform H(Y~t|X~t+1)H(\tilde{Y}_{t}|\tilde{X}_{t+1}):

H(Y~t|X~t+1)p~(𝒚t)p~(𝒙t+1|𝒚t)lng(𝒚t|𝒙t+1)d𝒚td𝒙t+1.\displaystyle H(\tilde{Y}_{t}|\tilde{X}_{t+1})\leq-\iint\tilde{p}(\boldsymbol{y}_{t})\tilde{p}(\boldsymbol{x}_{t+1}|\boldsymbol{y}_{t})\ln g(\boldsymbol{y}_{t}|\boldsymbol{x}_{t+1})\mathrm{d}\boldsymbol{y}_{t}\mathrm{d}\boldsymbol{x}_{t+1}. (27)

Assuming that the training is sufficient, we can have

p~(𝒙t+1|𝒚t)p(𝒙t+1|𝒚t).\displaystyle\tilde{p}(\boldsymbol{x}_{t+1}|\boldsymbol{y}_{t})\approx p(\boldsymbol{x}_{t+1}|\boldsymbol{y}_{t}). (28)

So,

H(Y~t|X~t+1)p~(𝒚t)p(𝒙t+1|𝒚t)lng(𝒚t|𝒙t+1)d𝒚td𝒙t+1.\displaystyle H(\tilde{Y}_{t}|\tilde{X}_{t+1})\leq-\iint\tilde{p}(\boldsymbol{y}_{t})p(\boldsymbol{x}_{t+1}|\boldsymbol{y}_{t})\ln g(\boldsymbol{y}_{t}|\boldsymbol{x}_{t+1})\mathrm{d}\boldsymbol{y}_{t}\mathrm{d}\boldsymbol{x}_{t+1}. (29)

According to Equation26, we obtain the logarithm probability density function of
g(𝒚t|𝒙t+1)g(\boldsymbol{y}_{t}|\boldsymbol{x}_{t+1}):

lng(𝒚t|𝒙t+1)\displaystyle\ln g(\boldsymbol{y}_{t}|\boldsymbol{x}_{t+1}) ln1(2π)m2|Σ|12e(𝒚tgθ(ϕ(𝒙t+1)))22|Σ|\displaystyle\approx\ln\frac{1}{(2\pi)^{\frac{m}{2}}|\Sigma|^{\frac{1}{2}}}e^{-\frac{(\boldsymbol{y}_{t}-g_{\theta^{\prime}}(\phi(\boldsymbol{x}_{t+1})))^{2}}{2|\Sigma|}} (30)
=(𝒚tgθ(ϕ(𝒙t+1)))22|Σ|+ln1(2π)m2|Σ|12.\displaystyle=-\frac{(\boldsymbol{y}_{t}-g_{\theta^{\prime}}(\phi(\boldsymbol{x}_{t+1})))^{2}}{2|\Sigma|}+\ln\frac{1}{(2\pi)^{\frac{m}{2}}|\Sigma|^{\frac{1}{2}}}.

Because ln1(2π)m2|Σ|12ln1(2π)m2|Σ|max12\ln\frac{1}{(2\pi)^{\frac{m}{2}}|\Sigma|^{\frac{1}{2}}}\geq\ln\frac{1}{(2\pi)^{\frac{m}{2}}|\Sigma|_{max}^{\frac{1}{2}}}, so

H(Y~t|X~t+1)\displaystyle H(\tilde{Y}_{t}|\tilde{X}_{t+1}) p~(𝒚t)p(𝒙t+1|𝒚t)[(ϕ(𝒙t)gθ(ϕ(𝒙t+1)))22|Σ|ln1(2π)m2|Σ|12]d𝒚td𝒙t+1\displaystyle\leq\iint\tilde{p}(\boldsymbol{y}_{t})p(\boldsymbol{x}_{t+1}|\boldsymbol{y}_{t})\left[\frac{(\phi(\boldsymbol{x}_{t})-g_{\theta^{\prime}}(\phi(\boldsymbol{x}_{t+1})))^{2}}{2|\Sigma|}-\ln\frac{1}{(2\pi)^{\frac{m}{2}}|\Sigma|^{\frac{1}{2}}}\right]\mathrm{d}\boldsymbol{y}_{t}\mathrm{d}\boldsymbol{x}_{t+1} (31)
p~(𝒚t)p(𝒙t+1|𝒚t)[(ϕ(𝒙t)gθ(ϕ(𝒙t+1)))22|Σ|minln1(2π)m2|Σ|max12]d𝒚td𝒙t+1,\displaystyle\leq\iint\tilde{p}(\boldsymbol{y}_{t})p(\boldsymbol{x}_{t+1}|\boldsymbol{y}_{t})\left[\frac{(\phi(\boldsymbol{x}_{t})-g_{\theta^{\prime}}(\phi(\boldsymbol{x}_{t+1})))^{2}}{2|\Sigma|_{min}}-\ln\frac{1}{(2\pi)^{\frac{m}{2}}|\Sigma|_{max}^{\frac{1}{2}}}\right]\mathrm{d}\boldsymbol{y}_{t}\mathrm{d}\boldsymbol{x}_{t+1},

where |Σ|min=σminq,|Σ|max=σmaxq|\Sigma|_{min}=\sigma_{min}^{q},|\Sigma|_{max}=\sigma_{max}^{q}. Here, p~(𝒚t)p(𝒙t+1|𝒚t)=p~(𝒚t)p(𝒚t)p(𝒙t+1,𝒚t)\tilde{p}(\boldsymbol{y}_{t})p(\boldsymbol{x}_{t+1}|\boldsymbol{y}_{t})=\frac{\tilde{p}(\boldsymbol{y}_{t})}{p(\boldsymbol{y}_{t})}p(\boldsymbol{x}_{t+1},\boldsymbol{y}_{t}). Thus, we obtain p~(𝒚t)p(𝒚t)\frac{\tilde{p}(\boldsymbol{y}_{t})}{p(\boldsymbol{y}_{t})}, where p~(𝒚t)\tilde{p}(\boldsymbol{y}_{t}) represents the target distribution and p(𝒚t)p(\boldsymbol{y}_{t}) represents the natural distribution obtained from the data. We can define the inverse probability weights w(𝒙t)w(\boldsymbol{x}_{t}) as follows:

w(𝒙t)p~(𝒚t)p(𝒚t).\displaystyle w(\boldsymbol{x}_{t})\equiv\frac{\tilde{p}(\boldsymbol{y}_{t})}{p(\boldsymbol{y}_{t})}. (32)

Let z=(ϕ(𝒙t)gθ(ϕ(𝒙t+1)))22|Σ|minln1(2π)m2|Σ|max12z=\frac{(\phi(\boldsymbol{x}_{t})-g_{\theta^{\prime}}(\phi(\boldsymbol{x}_{t+1})))^{2}}{2|\Sigma|_{min}}-\ln\frac{1}{(2\pi)^{\frac{m}{2}}|\Sigma|_{max}^{\frac{1}{2}}}. Therefore, we can express Equation 31 as follows:

H(Y~t|X~t+1)w(𝒙t)p(𝒙t+1,𝒚t)zd𝒚td𝒙t+1.\displaystyle H(\tilde{Y}_{t}|\tilde{X}_{t+1})\leq\iint w(\boldsymbol{x}_{t})p(\boldsymbol{x}_{t+1},\boldsymbol{y}_{t})z\mathrm{d}\boldsymbol{y}_{t}\mathrm{d}\boldsymbol{x}_{t+1}. (33)

Because we train neural networks using discrete samples {xt}\{x_{t}\}, we can use the sample mean as an approximate estimate of the expectation in Equation33. Therefore, the variational upper bound of H(Y~t|X~t+1)H(\tilde{Y}_{t}|\tilde{X}_{t+1}) can be written as:

H(Y~t|X~t+1)1Ti=0T1w(𝒙t)z.\displaystyle H(\tilde{Y}_{t}|\tilde{X}_{t+1})\leq\frac{1}{T}\sum_{i=0}^{T-1}w(\boldsymbol{x}_{t})z. (34)

Substituting the equation into Equation 24, we obtain the variational lower bound of the original objective function:

𝒥(fθ,q)H(Uq)1Ti=0T1w(𝒙t)z.\displaystyle\mathcal{J}(f_{\theta,q})\geq H(U_{q})-\frac{1}{T}\sum_{i=0}^{T-1}w(\boldsymbol{x}_{t})z. (35)

Therefore, the optimization problem(Equation LABEL:old_optimization) is transformed into

minω,θ,θi=0T1w(𝒙t)|ϕ(𝒙t)gθ(ϕ(𝒙t+1))|2\displaystyle\min_{\omega,\theta,\theta^{\prime}}\sum_{i=0}^{T-1}w(\boldsymbol{x}_{t})|\phi(\boldsymbol{x}_{t})-g_{\theta^{\prime}}(\phi(\boldsymbol{x}_{t+1}))|^{2} (36)
s.t.𝒙^t+1𝒙t+1<ϵ,\displaystyle s.t.||\hat{\boldsymbol{x}}_{t+1}-\boldsymbol{x}_{t+1}||<\epsilon, (37)

where ω\omega, θ\theta, θ\theta^{\prime} respectively represent the parameters of the three neural networks ψ\psi, fθf_{\theta}, gθg_{\theta^{\prime}} in the NIS+ framework.

Then we construct Lagrange function,

L(ω,θ,θ,λ)=i=0T1w(𝒙t)|ϕ(𝒙t)gθ(ϕ(𝒙t+1))|2+λϕ(𝒚t+1)𝒙t+1\displaystyle L(\omega,\theta,\theta^{\prime},\lambda)=\sum_{i=0}^{T-1}w(\boldsymbol{x}_{t})|\phi(\boldsymbol{x}_{t})-g_{\theta^{\prime}}(\phi(\boldsymbol{x}_{t+1}))|^{2}+\lambda||\phi^{\dagger}(\boldsymbol{y}_{t+1})-\boldsymbol{x}_{t+1}|| (38)

Therefore, the optimization goal is transformed into

minω,θ,θi=0T1w(𝒙t)𝒚tgθ(𝒚t+1)+λ𝒙^t+1𝒙t+1\displaystyle\min_{\omega,\theta,\theta^{\prime}}\sum_{i=0}^{T-1}w(\boldsymbol{x}_{t})||\boldsymbol{y}_{t}-g_{\theta^{\prime}}(\boldsymbol{y}_{t+1})||+\lambda||\hat{\boldsymbol{x}}_{t+1}-\boldsymbol{x}_{t+1}|| (39)

10.2.   Proof of Theorem 5.2

We will first use two separate lemmas to prove the scalability of stacked encoder and parallel encoder, thereby demonstrating that their arbitrary combination can keep the conclusion of Theorem 5.1 unchanged.

Lemma 10.4 (Mutual information will not be affected by stacked encoder).

If XDom(X)X\in Dom(X) and YDom(Y)Y\in Dom(Y) form a Markov chain XYX\rightarrow Y, and ΦL\Phi_{L} and ΦL\Phi^{\dagger}_{L} represent the L-layer stacked encoder and decoder, respectively, then:

I(X;Y)=I(X;ΦL(Y)).I(X;Y)=I(X;\Phi^{\dagger}_{L}(Y)). (40)
Proof.

According to Figure7(a), we can obtain

ΦL(Y)=ϕq1ϕq2ϕqL(Y).\Phi^{\dagger}_{L}(Y)=\phi^{\dagger}_{q_{1}}\circ\phi^{\dagger}_{q_{2}}\circ\cdot\cdot\cdot\circ\phi^{\dagger}_{q_{L}}(Y). (41)

ϕqi(i=1,2,,L):qiqi1\phi^{\dagger}_{q_{i}}(i=1,2,\cdot\cdot\cdot,L):\mathcal{R}^{q_{i}}\rightarrow\mathcal{R}^{q_{i-1}} is a basic decoder as shown in Equation 8. \circ represents the composition of functions. Therefore, according to Lemma 10.2 and the fact that reversible mappings do not change mutual information, we obtain

I(X;Y)=I(X;ϕL(Y)).I(X;Y)=I(X;\phi^{\dagger}_{L}(Y)). (42)

Let YL1=ϕL(Y)Y_{L-1}=\phi^{\dagger}_{L}(Y), we can further obtain I(X;YL1)=I(X;ϕL1(YL1))I(X;Y_{L-1})=I(X;\phi^{\dagger}_{L-1}(Y_{L-1})), and so on, leading to the final result

I(X;Y)=I(X;ΦL(Y)).I(X;Y)=I(X;\Phi^{\dagger}_{L}(Y)). (43)

Lemma 10.5 (Mutual information will not be affected by parallel encoder).

If XDom(X)X\in Dom(X) and YDom(Y)Y\in Dom(Y) form a Markov chain XYX\rightarrow Y, and ΦT\Phi_{T} and ΦT\Phi^{\dagger}_{T} represent parallel encoder and decoder composed of T ordinary encoders or decoders, respectively, then:

I(X;Y)=I(X;ΦT(Y)).I(X;Y)=I(X;\Phi^{\dagger}_{T}(Y)). (44)
Proof.

the decoding process can be divided into two steps:

ΦT(Y)=ΨT(Yξ).\Phi^{\dagger}_{T}(Y)=\Psi_{T}(Y\bigoplus\xi). (45)

YY can be decomposed as Y=Y1Y2YTY=Y_{1}\bigoplus Y_{2}\bigoplus\dots\bigoplus Y_{T}, and all the introduced noise ξ=ξ1ξ2ξT\xi=\xi_{1}\bigoplus\xi_{2}\\ \bigoplus\dots\bigoplus\xi_{T}, even if the order of their concatenation changes, it will not affect the mutual information between the overall variable and other variables. Let ΨT(Z)=ψ1(Z1)ψ2(Z2)ψT(ZT)\Psi_{T}(Z)=\psi_{1}(Z_{1})\bigoplus\psi_{2}(Z_{2})\bigoplus\dots\\ \bigoplus\psi_{T}(Z_{T}), where Zi(i=1,2,,T)Z_{i}(i=1,2,\dots,T) is a partition of ZZ, and ψi(i=1,2,,T):pp\psi_{i}(i=1,2,\dots,T):\mathcal{R}^{p}\rightarrow\mathcal{R}^{p} are all invertible functions. Therefore, ΨT\Psi_{T} is also an invertible function, meaning that this function transformation does not change the mutual information. According to Lemma 10.2, we have

I(X;Y)=I(X;Yξ)=I(X;ΨT(Yξ)).I(X;Y)=I(X;Y\bigoplus\xi)=I(X;\Psi_{T}(Y\bigoplus\xi)). (46)

Finally proven,

I(X;Y)=I(X;ΦT(Y)).I(X;Y)=I(X;\Phi^{\dagger}_{T}(Y)). (47)

Next, we provide the proof of Theorem 5.2.

Theorem 5.2(Problem Transformation in Extensions of NIS+ Theorem) When the encoder of NIS+ is replaced with an arbitrary combination of stacked encodes and parallel encoders, the conclusion of Theorem 2.1 still holds true.

Proof.

To demonstrate that Theorem 5.1 remains applicable to the extended NIS+ framework, we only need to prove that Equation 22 still holds true, as the only difference between the extended NIS+ and the previous framework lies in the encoder. First, we restate Equation 22 as follows:

𝒥(fq)=I(Yt,Y^t+1)=I(Yt,X^t+1).\displaystyle\mathcal{J}(f_{q})=I(Y_{t},\hat{Y}_{t+1})=I(Y_{t},\hat{X}_{t+1}). (48)

According to Lemma 10.4 and Lemma 10.5, Equation 22 holds true when the encoder of NIS+ is replaced with either a stacked encoder or a parallel encoder. And any combination of these two encoders is nothing more than an alternating nesting of stacking and parallelization, so the conclusion still holds. ∎

10.3.   Proof of Theorem 5.3

To prove the universality theorem, we need to extend the definition of the basic encoder by introducing a new operation ηp,s:ps\eta_{p,s}:\mathcal{R}^{p}\rightarrow\mathcal{R}^{s}, which represents the self-replication of the original variables.

ηp,s(𝒙)=𝒙𝒙sp.\eta_{p,s}(\boldsymbol{x})=\boldsymbol{x}\bigoplus\boldsymbol{x}_{s-p}. (49)

The vector 𝒙sp\boldsymbol{x}_{s-p} is composed of sps-p dimensions, where each dimension is a duplicate of a specific dimension in 𝒙\boldsymbol{x}. For example, if 𝒙=(0.1,0.2,0.3)\boldsymbol{x}=(0.1,0.2,0.3), then η2,5(𝒙)=(0.1,0.2,0.3,0.1,0.2)\eta_{2,5}(\boldsymbol{x})=(0.1,0.2,0.3,0.1,0.2).

The basic idea to prove theorem 5.3 is to use the famous universal approximation theorems of common feed-forward neural networks mentioned in [66, 67] and of invertible neural networks mentioned in [68, 69] as the bridges, and then try to prove that any feed-forward neural network can be simulated by a serious of bijective mapping(ψ\psi), projection(χ\chi) and vector expansion (η\eta) procedures. The basic encoder after the extension for vector expansion can be expressed by the following equation:

ϕ=Projqψsηp,sψp.\phi=Proj_{q}\circ\psi_{s}\circ\eta_{p,s}\circ\psi_{p}. (50)

The functions ψs:ss\psi_{s}:\mathcal{R}^{s}\rightarrow\mathcal{R}^{s} and ψp:pp\psi_{p}:\mathcal{R}^{p}\rightarrow\mathcal{R}^{p} represent two reversible mappings. The final dimensionality qq that is retained may be larger than the initial dimensionality pp. In the context of coarse-graining microscopic states to obtain macroscopic states, for the sake of convenience, we usually consider ϕ\phi as a dimension reduction operator.

First, we need to prove a lemma.

Lemma 10.6.

For any vector XpX\in\mathcal{R}^{p} and matrix Ws×pW\in\mathcal{R}^{s\times p}, where s,p𝒩s,p\in\mathcal{N}, there exists an integer s1min(s,p)s_{1}\leq\min(s,p) and two basic units of encoder: ψsηs1,s\psi_{s}\circ\eta_{s_{1},s} and χp,s1ψp\chi_{p,s_{1}}\circ\psi_{p} such that:

WX(ψsηs1,s)(χp,s1ψp)(X),W\cdot X\simeq(\psi_{s}\circ\eta_{s_{1},s})\circ(\chi_{p,s_{1}}\circ\psi_{p})(X), (51)

where, \simeq represents approximate or simulate.

Proof.

For any Ws×pW\in\mathcal{R}^{s\times p}, we can SVD decomposes it as:

W=UΛV,W=U\cdot\Lambda\cdot V, (52)

where Us×s,Vp×pU\in\mathcal{R}^{s\times s},V\in\mathcal{R}^{p\times p} are all orthogonal matrices, Λ=(diag(λ1,λ2,,λs1)𝟎𝟎𝟎)s×p\Lambda=\begin{pmatrix}diag(\lambda_{1},\lambda_{2},\cdot\cdot\cdot,\lambda_{s_{1}})&\mathbf{0}\\ \mathbf{0}&\mathbf{0}\\ \end{pmatrix}_{s\times p}is a diagonal matrix composed by nonzero eigenvalues of WW: λi,i[1,s1]\lambda_{i},i\in[1,s_{1}], and zeros, where s1s_{1} is the rank of the matrix WW. Because all matrices can be regarded as functions, therefore:

W(X)=U(ηs1,s(χp,s1(ΛV)(X))),W(X)=U(\eta_{s_{1},s}(\chi_{p,s_{1}}(\Lambda^{\prime}\cdot V)(X))), (53)

where, Λ=diag(λ1,λ2,,λs1,1,,1)\Lambda^{\prime}=diag(\lambda_{1},\lambda_{2},\cdot\cdot\cdot,\lambda_{s_{1}},1,\cdot\cdot\cdot,1). That is, the functional mapping process of W(X)W(X) can be decomposed into four steps: matrix multiplication by ΛV\Lambda^{\prime}\cdot V, projection χp,s1\chi_{p,s_{1}}(projecting a pp dimensional vector to an s1s_{1} dimensional vector), vector expansion ηs1,s\eta_{s_{1},s} (expanding the s1s_{1} dimensional vector to an ss dimensional one), and matrix multiplication by UU. Notice that the first and the last steps are invertible because the corresponding matrices are invertible. Therefore, according to [68, 69], there are invertible neural networks ψs\psi_{s} and ψp\psi_{p} that can simulate orthogonal matrix UU and the invertible matrix ΛV\Lambda^{\prime}\cdot V, respectively. Thus, ψsU\psi_{s}\simeq U, and ψpΛV\psi_{p}\simeq\Lambda^{\prime}\cdot V. Then we have:

(ψsηs1,s)(χp,s1ψp)W,(\psi_{s}\circ\eta_{s_{1},s})\circ(\chi_{p,s_{1}}\circ\psi_{p})\simeq W, (54)

That is W(X)W(X) can be approximated by an extended stacked encoder. ∎

With lemma 10.6, we can prove theorem 5.3. We at first restate theorem 5.3 here:

Theorem 5.3(Universal Approximating Theorem of Stacked Encoder): For any continuous function ff which is defined on K×pK\times\mathcal{R}^{p}, where KpK\in\mathcal{R}^{p} is a compact set, and p>q𝒵+p>q\in\mathcal{Z^{+}}, there exists an integer ss and an extended stacked encoder ϕp,s,q:pq\phi_{p,s,q}:\mathcal{R}^{p}\rightarrow\mathcal{R}^{q} with ss hidden size and an expansion operator ηp,s\eta_{p,s} such that:

ϕp,s,qf,\phi_{p,s,q}\simeq f, (55)

Thereafter, an extended stacked encoder is of universal approximation property which means that it can approximate(simulate) any coarse-graining function defined on p×q\mathcal{R}^{p}\times\mathcal{R}^{q}.

Proof.

According to universal approximation theorem in [66, 67], for any function ff defined on K×pK\times\mathcal{R}^{p}, where KpK\in\mathcal{R}^{p} is a compact set, and p>q𝒵+p>q\in\mathcal{Z^{+}}, and small number ϵ\epsilon, there exists an integer ss and Ws×p,Wq×s,bsW\in\mathcal{R}^{s\times p},W^{\prime}\in\mathcal{R}^{q\times s},b\in\mathcal{R}^{s} such that:

Wσ(W+b)f,W^{\prime}\cdot\sigma(W+b)\simeq f, (56)

where, σ(𝒙)=1/(1+exp(𝒙))\sigma(\boldsymbol{x})=1/(1+\exp(-\boldsymbol{x})) is the sigmoid function on vectors.

According to Lemma 10.6 and +b+b and σ()\sigma(\cdot) are all invertible operators, therefore, there exists invertible neural networks ψq,ψs,ψs,ψp\psi_{q},\psi_{s}^{\prime},\psi_{s},\psi_{p}, and two integers s1,s2s_{1},s_{2} which are ranks of matrices WW^{\prime} and WW, respectively, such that:

(ψqηs2,qχs,s2ψs)(ψsηs1,sχp,s1ψp)Wσ(W+b),(\psi_{q}\circ\eta_{s_{2},q}\circ\chi_{s,s_{2}}\circ\psi_{s}^{\prime})\circ(\psi_{s}\circ\eta_{s_{1},s}\circ\chi_{p,s_{1}}\circ\psi_{p})\simeq W^{\prime}\cdot\sigma(W\cdot+b), (57)

where, ψsηs1,sχp,s1ψp\psi_{s}\circ\eta_{s_{1},s}\circ\chi_{p,s_{1}}\circ\psi_{p} approximates(simulates) the function σ(W+b)\sigma(W\cdot+b) and ψqηs2,qχs,s2ψs\psi_{q}\circ\eta_{s_{2},q}\circ\chi_{s,s_{2}}\circ\psi_{s}^{\prime} approximates(simulates) the function WW^{\prime}\cdot.

Therefore, if we let ϕp,s,q=(ψqηs2,qχs,s2ψs)(ψsηs1,sχp,s1ψp)\phi_{p,s,q}=(\psi_{q}\circ\eta_{s_{2},q}\circ\chi_{s,s_{2}}\circ\psi_{s}^{\prime})\circ(\psi_{s}\circ\eta_{s_{1},s}\circ\chi_{p,s_{1}}\circ\psi_{p}), then ϕp,s,qf\phi_{p,s,q}\simeq f

In real applications, although the basic encoder and the extended versions do not include expansion operator, we always expand input vector before it is input for the encoder. Therefore, it is reasonable to believe that Theorem 5.3 still holds for stacked encoders.

11.   SIR Model

In this section, we will introduce various parameters and specific methods in the SIR experiment.

11.1.   Data generation

In Equation6, 𝝃1,𝝃2N(0,Σ)\boldsymbol{\xi}_{1},\boldsymbol{\xi}_{2}\sim\scriptsize{N}(0,\Sigma) are two-dimensional Gaussian noises and independent each other, and Σ=(σ212σ212σ2σ2)\Sigma=\begin{pmatrix}\sigma^{2}&-\frac{1}{2}\sigma^{2}\\ -\frac{1}{2}\sigma^{2}&\sigma^{2}\end{pmatrix}. For this covariance matrix, we can easily deduce that 𝝃1\boldsymbol{\xi}_{1} and 𝝃2\boldsymbol{\xi}_{2} are both two-dimensional normal distributions with a correlation coefficient of 12-\frac{1}{2}. The parameter σ\sigma controls the overall magnitude of the noise, which is set to σ=0.03\sigma=0.03 in the experiments conducted in this paper. For macroscopic state data, we discretize the original continuous model with dt=0.01dt=0.01 and run 7 time steps for each starting point. After sampling with different starting points, we concatenate equal amounts of 𝝃1\boldsymbol{\xi}_{1} and 𝝃2\boldsymbol{\xi}_{2} to obtain four-dimensional microscopic data.

In the SIR model experiment conducted in this article, we generated two datasets: the complete dataset and the partial dataset. For the complete dataset, we uniformly sampled 9000 initial points from the entire range of possible values for SS and II (S0,I0,S+I1S\geq 0,I\geq 0,S+I\leq 1), resulting in a total of 63000 data points. For the partial dataset, we uniformly sampled 6000 initial points from the region where S13S\geq\frac{1}{3}. Additionally, to satisfy the requirements of KDE estimation for the support set, we randomly sampled 10 initial points from the region where S13S\leq\frac{1}{3}. In total, the partial dataset consists of 42070 sample points. The training and testing results of the NIS+ neural network on these two different sample sets are presented in Figure 3.

11.2.   Training and testing details

For NIS+, two stages are separated at the 3000th epoch(see method section 5.1.4). In the second stage (after 3000 epochs), NIS+ incorporates inverse probability weighting and bidirectional dynamic learning, distinguishing it from the NIS model.

In Figure 3(d), the test involves selecting 20 starting points within an area where the training samples are missing, evolving them for 10 steps using the learned macro-dynamics, and then decoding the results back into micro-states. The mean of the multi-step prediction errors for these 20 starting points is displayed. In Figure 3(e), considering the significant differences in the range of experimental data, the threshold ϵ\epsilon measures the magnitude of relative error, which is obtained by dividing the absolute error by the standard deviation of the data itself. We set ϵ=0.3\epsilon=0.3, which means that the machine training is considered sufficient when the relative error is less than or equal to 0.3.

11.3.   Methods for the vector field results

The following step-by-step explanation outlines how to average the vector field results of emergent dynamics, as demonstrated in Figure 3. The aim of this process is to compute the vector field through extensive sampling and averaging. Initially, we generate grid points on the phase space with a step size of 0.007, as shown in Figure 3(a). This step serves to obtain a significant number of samples for further analysis. To achieve this, we introduce noise to these grid points and their subsequent time steps, extending them to four-dimensional data. Next, we employ the well-trained NIS+ model to encode these extended data points, resulting in samples in the latent space. These samples are then divided into different grids on the latent space with a step size of 0.015, enabling a comprehensive analysis of the emergent dynamics. The subsequent calculation involves determining the average motion vector for each grid using the methodology outlined in Figure 3(d). It is important to note that the primary objective of this step is to compute the average vector field by aggregating the motion vectors across the grids. For comparison, we also generate sparser grid points on the phase space with a step size of 0.07. By applying the same noise and encoding techniques, we establish a one-to-one correspondence between the motion vectors of the grids and the new grid points. This correspondence allows us to depict the vector field of the macro latent space of NIS+ as shown in Figure 3(c). Similarly, by replacing the encoder of NIS+ with the encoder of NIS, we can obtain the latent space vector field of NIS, as illustrated in Figure 3(f). This approach, based on extensive sampling and averaging, provides valuable insights into the macro features of emergent dynamics.

Refer to caption
Figure 3: The method of generating the vector fields in Figure 3(c)(f). (a) Generate a sufficient number of grid points in phase space. (b) Divide the encoded latent space into grids with certain intervals, and each sample point in the latent space will fall into a certain grid. (c) Take the expectation of different direction vectors for each grid, obtaining the corresponding motion vector for that grid. (d) Zoom in on the local amplification schematic of the averaging process. The black dots in the figure represent the coordinates of sample points at a certain time, and the red dots represent the coordinates of these sample points at the next time step. If a sample point jumps from the i-th grid to the j-th grid in one time step, then a count is made in the direction of i→j, represented by a solid red arrow in the figure. Taking the expectation of all solid arrows yields the motion vector corresponding to that grid, represented by a dashed red arrow.

In Figure 3(c) and (f), we provide a theoretic vector field. It is obtained by the folowing steps. First,

d𝒚tdt=ϕ(𝒙t)t=ϕ(𝒙t)𝒙td𝒙tdt,\displaystyle\frac{d\boldsymbol{y}_{t}}{dt}=\frac{\partial\phi(\boldsymbol{x}_{t})}{\partial t}=\frac{\partial\phi(\boldsymbol{x}_{t})}{\partial\boldsymbol{x}_{t}}\frac{d\boldsymbol{x}_{t}}{dt}, (58)

where Jϕdϕ(𝒙t)d𝒙tJ_{\phi}\equiv\frac{d\phi(\boldsymbol{x}_{t})}{d\boldsymbol{x}_{t}} is the Jacobian matrix of the encoder. Therefore, we could obtain the vector by representing the relationship between the true dynamics and the latent space dynamics. We notice that according to Equation 6:

d𝒙tdt=d(St,St,It,It)dt,\displaystyle\frac{d\boldsymbol{x}_{t}}{dt}=\frac{d(S_{t},S_{t},I_{t},I_{t})}{dt}, (59)

Because 𝒙t=(𝑺t,𝑰t)\boldsymbol{x}_{t}=(\boldsymbol{S}^{\prime}_{t},\boldsymbol{I}^{\prime}_{t}) and the noises are independent of time and have zero mean, we can replace d𝒙tdt\frac{d\boldsymbol{x}_{t}}{dt} in the experiment. Finally, we obtain:

d𝒚tdt=Jϕd(𝑺t,𝑰t)dt,\displaystyle\frac{d\boldsymbol{y}_{t}}{dt}=J_{\phi}\cdot\frac{d(\boldsymbol{S}_{t},\boldsymbol{I}_{t})}{dt}, (60)

where, 𝑺t=(St,St)\boldsymbol{S}_{t}=(S_{t},S_{t}) and 𝑰t=(It,It)\boldsymbol{I}_{t}=(I_{t},I_{t}). Therefore, after multiplying the ground truth vector field with the Jacobian matrix JϕJ_{\phi}, we obtain the theoretic prediction of the vector field in the graph. The difference between this theoretic vector field and the true dynamics is due to the errors of the encoder ϕ\phi.

In order to achieve a better dynamic vector field, we specifically conducted multi-step prediction training when drawing Figure 3(c) and (f). Specifically, given the input of microstate at a certain moment, we make multi-step predictions for the next 10 microstates respectively and combine them with certain weights to form a loss for gradient backpropagation. The weights decay at a rate of e0.2te^{-0.2t} as the prediction steps increase. For reverse prediction, we take the target corresponding to the last step as the input and perform multi-step prediction training in the same way. In addition, we adopt the approach of interval sampling, taking every 100 sample points as a step, so that NIS+ can make long-term predictions on the SIR model.

12.   Boids Model

In 1986, Craig Reynolds created a simulation of the collective behavior of birds, known as the Boids model [50]. This model only used three simple rules to control the interactions between individuals, resulting in flock-like behavior.

12.1.   Method of Simulating the Trajectory of Boids

Refer to caption
Figure 4: Boids model. By modifying dmaxd_{\max} and dmind_{\min} at time tt, the aggregation of bird flocks will change. Specify Δv1,it\Delta v_{1,i}^{t}, Δv2,it\Delta v_{2,i}^{t}, Δv3,it\Delta v_{3,i}^{t} as the variation of separation, parallelism, and cohesion velocities, respectively. The ii-th bird is marked in red. At time tt, birds with a distance less than dmind_{min} from bird ii will affect Δv2,it\Delta v_{2,i}^{t}. Birds with a distance less than dmaxd_{max} from bird ii will affect Δv1,it\Delta v_{1,i}^{t}, Δv3,it\Delta v_{3,i}^{t}. In this way, birds can maintain a distance from each other and allow the Boids to fly in an orderly manner in one direction.

In Boids model, the micro-state of each individual boid ii is a four dimensional vector (xi,yi,vx,i,vy,i)(x_{i},y_{i},v_{x,i},v_{y,i}), where 𝒙i=(xi,yi)\boldsymbol{x}_{i}=(x_{i},y_{i}) is the positional vector, and 𝒗i=(vx,i,vy,i)\boldsymbol{v}_{i}=(v_{x,i},v_{y,i}) is the velocity vector. Thus, the complete micro-state for all NN boids is a 4N4N dimensional vector. The micro-states follow the dynamical update equations:

{𝒙it+1=𝒙it+𝒗it,𝒗it+1=𝒗it+Δ𝒗it+εit𝒗it+Δ𝒗it+εit𝒗it.\displaystyle\begin{cases}\boldsymbol{x}_{i}^{t+1}=\boldsymbol{x}_{i}^{t}+\boldsymbol{v}_{i}^{t},\\ \boldsymbol{v}_{i}^{t+1}=\displaystyle\frac{\boldsymbol{v}_{i}^{t}+\Delta\boldsymbol{v}_{i}^{t}+\varepsilon_{i}^{t}}{||\boldsymbol{v}_{i}^{t}+\Delta\boldsymbol{v}_{i}^{t}+\varepsilon_{i}^{t}||}||\boldsymbol{v}_{i}^{t}||.\end{cases} (61)

Where, Δ𝒗𝒊t\Delta\boldsymbol{v_{i}}^{t} is the external force exerted on ii by its neighbors, and it is composed with three components which called cohesion, alignment, and separation, respectively, according to the dynamical rules of Boids model. εit=(cosΔαit,sinΔαit)T\varepsilon_{i}^{t}=(\cos\Delta\alpha_{i}^{t},\sin\Delta\alpha_{i}^{t})^{T} is a force on turning direction exerted on each boid at each time step, where Δα[π,π]\Delta\alpha\in[-\pi,\pi] is a fixed value or random number.

The motion vector of bird ii at time tt can be expressed as (xit,yit,vx,it,vy,it)(x_{i}^{t},y_{i}^{t},v_{x,i}^{t},v_{y,i}^{t}) where xitx_{i}^{t} and yity_{i}^{t} represent the position coordinates, vx,itv_{x,i}^{t} and vy,itv_{y,i}^{t} represent the projection of velocity on two coordinate axes. With NN boids, i=1,2,,Ni=1,2,\dots,N. The motion vector can be decomposed into position vector and velocity vector xit=(xit,yit)\textbf{x}_{i}^{t}=(x_{i}^{t},y_{i}^{t}) and vit=(vx,it,vy,it)\textbf{v}_{i}^{t}=(v_{x,i}^{t},v_{y,i}^{t}). The distance between two boids ii and jj can be expressed as dij=xitxjt=(xixj)2+(yiyj)2d_{ij}=||\textbf{x}_{i}^{t}-\textbf{x}_{j}^{t}||=\sqrt{(x_{i}-x_{j})^{2}+(y_{i}-y_{j})^{2}}. Although the Boids model belongs to a continuous system, in order to facilitate the simulation process and subsequent experiments, we uniformly set Δt=1\Delta t=1. For a single boid in the model, the acceleration ait=vit+1vit\textbf{a}_{i}^{t}=\textbf{v}_{i}^{t+1}-\textbf{v}_{i}^{t} can be divided into four parts: separation, parallelism, cohesion, and random deflection angle noise. We stipulate that the velocity magnitude of a boid during the flight remains stable, so the impact of acceleration mainly changes the direction of Boids’ flight. After knowing the motion state vector of each boid in the flock at time tt, the motion state at time t+1t+1 can be generated. Specify Δv1,it\Delta v_{1,i}^{t}, Δv2,it\Delta v_{2,i}^{t}, Δv3,it\Delta v_{3,i}^{t} as the variation of separation, parallelism, and cohesion velocities, respectively. In which

{Δv1,it=xitxΦitxitxΦit,Δv2,it=kΨitvktkΨitvkt,Δv3,it=xitxΨitxitxΨit.\displaystyle\begin{cases}\Delta v_{1,i}^{t}=\displaystyle\frac{\textbf{x}_{i}^{t}-\textbf{x}_{\Phi_{i}}^{t}}{||\textbf{x}_{i}^{t}-\textbf{x}_{\Phi_{i}}^{t}||},\\ \Delta v_{2,i}^{t}=\displaystyle\frac{\sum_{k\in\Psi_{i}^{t}}\textbf{v}_{k}^{t}}{||\sum_{k\in\Psi_{i}^{t}}\textbf{v}_{k}^{t}||},\\ \Delta v_{3,i}^{t}=-\displaystyle\frac{\textbf{x}_{i}^{t}-\textbf{x}_{\Psi_{i}}^{t}}{||\textbf{x}_{i}^{t}-\textbf{x}_{\Psi_{i}}^{t}||}.\\ \end{cases} (62)

We stipulate that the maximum and minimum detection distances for Boids are dmaxd_{\max} and dmind_{\min}. The sets of Boids within the minimum and maximum detection range are specified as Φit={j|dijt<dmin}\Phi_{i}^{t}=\{j|d_{ij}^{t}<d_{\min}\} and Ψit={j|dijt<dmax}\Psi_{i}^{t}=\{j|d_{ij}^{t}<d_{\max}\}. So the centers of gravity of Φi\Phi_{i} and Ψi\Psi_{i} are represented as xΦit=kΦitxkt/|Φi|\textbf{x}_{\Phi_{i}}^{t}=\sum_{k\in\Phi_{i}^{t}}\textbf{x}_{k}^{t}/|{\Phi_{i}}| and xΨit=kΨitxkt/|Ψi|\textbf{x}_{\Psi_{i}}^{t}=\sum_{k\in\Psi_{i}^{t}}\textbf{x}_{k}^{t}/|{\Psi_{i}}|, which are the mean values of the boids’ position coordinates within two sets. With equation

Δ𝒗it=Δv1,it+Δv2,it+Δv3,it\Delta\boldsymbol{v}_{i}^{t}=\Delta v_{1,i}^{t}+\Delta v_{2,i}^{t}+\Delta v_{3,i}^{t} (63)

we can obtain Equation 61 in Section 3.2. The relationship between the above variables is shown in Fig.4. By modifying the starting vector (xi0,yi0,vx,i0,vy,i0)(x_{i}^{0},y_{i}^{0},v_{x,i}^{0},v_{y,i}^{0}) or dmaxd_{\max} and dmind_{\min}, the aggregation of bird flocks will change. And we can adjust the trajectory and randomness of Boids by modifying the mean and variance of the deflection angle εit\varepsilon_{i}^{t} in Equation 61.

12.2.   Methods for Collecting Experimental Data and Training

We record the velocity vector and position vector of each bird at each time tt during its flight as data. For each time step tt, we need to record the data at that time and the data at the next time step t+1t+1. We merge the data into 4N4N dimensional vectors at each time step tt as

Xt=(x1t,y1t,vx,1t,vy,1t,,xNt,yNt,vx,Nt,vy,Nt).X_{t}=(x_{1}^{t},y_{1}^{t},v_{x,1}^{t},v_{y,1}^{t},\dots,x_{N}^{t},y_{N}^{t},v_{x,N}^{t},v_{y,N}^{t}). (64)

This vector corresponds to our micro-state data XtX_{t}, p=64p=64. We can repeat the preparation and recording stages multiple times until the data volume meets our expectations. The ID of birds is 1,2,,161,2,...,16, evenly divided into two sets, {1,2,,8}\{1,2,\dots,8\} and {9,10,,16}\{9,10,\dots,16\}. The model we generated contains two sets of birds randomly generated near the center of the canvas.

We generate observation data in two phases: preparation and recording. In the preparation phase, one group of birds is randomly generated within a circle with (148,150)(148,150) as the center and 5 as the radius and the other group of birds is randomly generated within a circle with (152,150)(152,150) as the center and 5 as the radius. Then by modifying the (vx,i0,vy,i0)(v_{x,i}^{0},v_{y,i}^{0}) or the mean and variance of the deflection angle εit\varepsilon_{i}^{t} two groups of birds can fly different trajectories in Fig.4a and 4e. The norm of the velocity vector is always 1, that is vit1,t=1,2,.||\textbf{v}_{i}^{t}||\equiv 1,t=1,2,\dots. We set dmax=5d_{max}=5 and dmin=1d_{min}=1 so that the birds within the same group will interact with each other, while the effects between birds in different groups can be ignored. Subsequently, we allowed two groups of birds to fly for 20 steps without affecting each other, resulting in two groups of birds with distance and relatively stable internal stability. Then the positions and velocities of all birds serve as the starting point for us to record data.

In the recording phase, we have the birds continue to run 50 steps on the basis of the preparation phase, recording the velocity and position vectors of each bird separately. We randomly sort the generated data and input them into our NIS+ model for training, using the multistory structure of the NIS+ in Fig.7 during the training process. We sample the data for 4 batches. After inputting XtX_{t}, the microstate YtY_{t}, Yt+1Y_{t+1} and predicted value X^t+1\hat{X}_{t+1} can be output, and training can begin. The encoders and decoders are stacked, and the dynamics learners in different layers, q=64,32,16,8,4,2,1q=64,32,16,8,4,2,1, operate in parallel. This design facilitates simultaneous training for different dynamics learners with distinct dimensions and enables parallel searching for the optimal dimension, which is q=8q=8. Due to the previous experiments showing that the model performs better when q=8q=8 than other scales, and having too many layers can also slow down learning efficiency, we can discard layers with q<8q<8 in the later experiments. All the boids are separated into two groups by forcing their Δαit\Delta\alpha_{i}^{t} as two distinct values ΔαitU(0.0058π,0.0098π)\Delta\alpha_{i}^{t}\sim U(0.0058\pi,0.0098\pi) for boids with i8i\leq 8, and ΔαitU(0.0124π,0.0084π)\Delta\alpha_{i}^{t}\sim U(-0.0124\pi,-0.0084\pi) for boids i>8i>8. Therefore, the two groups will have separating trajectories with different turning angles as shown in Figure 4a.

After training with 800,000 epochs by NIS, 𝒥q\mathcal{J}_{q} reaches its maximum value when q=8q=8. We continue to optimize the existing model using NIS+ training 400,000 epochs, the results are shown in Fig.4c. Then we can obtain multi-step prediction data as shown in Figure 4a. It can be seen that the multi-step predicted trajectory matches the flight trajectory of the real bird flock. However, if the variance of εit\varepsilon_{i}^{t} is large, the training effect of the model will deteriorate, and the multi-step prediction distance will become shorter as shown in 4e. Causal emergence occurs for all tested dimensions(qq) (see Figure 4c). With up to q=8q=8 dimensional macro-state vectors, NIS+ can best capture the emergent collective flying behaviors of the two groups by tracing their centers of the trajectories. This can be visualized by decoding the predicted macro-states into the predicted micro-states as shown in Figure 4a by the two solid lines. With the method of generating data, we can observe the impact of noise. For different observation noises, in cases where the noise is not too significant, we can use the same method to train the model with NIS first, then optimize it with NIS+ to obtain Fig.4f. For different deflection angle noises, after training from random initial parameter values, in order to unify the control variables, we optimized them in the dimension of q=8q=8 and obtained Fig.4g.

We test the degree of Rosas’ Ψ\Psi using the same learned macro-state variable of NIS+. The results, shown in Figure 5, display results of Rosas’ Ψ\Psi for different extrinsic noises. For Rosas’ Ψ\Psi, all cases yield values are far less than 0. One possible reason to explain this is that much redundant information is ignored by the approximation by Ψ\Psi. Another possible reason is that the Boids’ coordinate data has a large order of magnitude, which can result in a significant increase in the value of lost information. Thus, unreasonable results are produced which makes Rosas’ Ψ\Psi impossible to determine whether CE occurs. Therefore, the proposed Δ𝒥\Delta\mathcal{J} in this paper is a superior method for identifying CE.

Refer to caption
Figure 5: The results of Rosas’ Ψ\Psi for different extrinsic noises. For Rosas’ Ψ\Psi, all cases yield values are far less than 0. Thus, unreasonable results are produced which makes Rosas’ Ψ\Psi impossible to determine whether CE occurs. Therefore, the proposed Δ𝒥\Delta\mathcal{J} in this paper is a superior method for identifying CE for Boids model.

13.   Game of Life

Conway’s Game of Life is a famous two dimensional cellular automota model on which various interesting dynamical patterns like glider, square, flower, signal light, honeycomb, traffic light emerge. Different from SIR model and Boids model, the micro-states of Game of Life at each time step are discrete(0 or 1) on a large regular grid as shown in Figure 6. Further, the micro-dynamics can not be represented by differential or difference equations but by rule tables (details can be referred to section 13.1).

We train NIS+ using data generated by the Game of Life simulation with random initial conditions and extract the time series of states from the 100th step to the 120th step. Figures 6(a),(b) and (c) show the dynamical patterns that are generated by the ground truth simulations(the first row) and the predictions by NIS+(the third row), as well as the emergent macro-states that can make those predictions(the second row). We input two images with successive time steps into NIS+, and obtain another image pair with next two successive time steps. Compare the upper picture and lower ones, the patterns are similar. However, the learned and predicted patterns in the third column, specifically the “glider” pattern, appear vague due to the limited occurrence of training samples with this pattern in random initial conditions. To enhance the quality of predictions, we can generate a new set of training samples that include initial conditions with two “gliders”. As a result, the predictions become clearer, as depicted in Figure 6(d), even though the number of gliders in this test environment is three. That means, NIS+ can capture the patterns including moving, static, and oscillating structures.

Furthermore, we evaluate the generalization ability of the models in test environments that differ from the training environments. We compare the multi-step prediction performance of NIS and NIS+ across eight different pattern types, which are distinct from the initial random patterns. The results are depicted in Figure 6(f) showing that NIS+ consistently achieves a higher AUC (area under the curve) than NIS for all the pattern types. Where, in the tick labels of the x-coordinate, we adopt the format of “pattern name (quantity)” to represent various initial conditions. For instance, “glider(2)” signifies an initial configuration comprising two gliders. Thus, we have provided the evidence that the enhanced NIS+ method possesses superior generalization ability in capturing these patterns.

We further test the degree of CE (Δ𝒥\Delta\mathcal{J}) and compare it with Rosas’ Ψ\Psi using the same learned macro-state variable of NIS+. We use the same patterns as initial conditions to test for the comparison. The results, shown in Figure 6(g), display results of CE, in which each bar representing a combination of Δ𝒥\Delta\mathcal{J} and Rosas’ Ψ\Psi for one initial pattern. Regarding Δ𝒥\Delta\mathcal{J}, except for the “random” case, all eight cases demonstrate the occurrence of CE. The case with “glider” patterns exhibit the lowest degree of CE due to poor prediction (see Figure 6(c)). The remaining seven patterns show similar Δ𝒥\Delta\mathcal{J} values. These results indicate that Δ𝒥\Delta\mathcal{J} provides a more reasonable indication of the occurrence of CE, aligning with our intuition. However, for Rosas’ Ψ\Psi, all cases yield values less than or equal to 0. One possible reason to explain this is that much redundant information is ignored by the approximation by Ψ\Psi. Thus, unreasonable results are produced which makes Rosas’ Ψ\Psi impossible to determine whether CE occurs. Therefore, the proposed Δ𝒥\Delta\mathcal{J} in this paper is a superior method for identifying CE.

Additionally, this example highlights the versatility of NIS+. In order to conduct the aforementioned experiments, we needed to coarse-grain the micro-states of the cellular automata in both spatial and temporal dimensions. To address this challenge, we incorporated the concept of spatiotemporal convolution. The architecture used in this experiment is illustrated in Figure 6(e). The entire coarse-graining process can be divided into two steps: first, aggregating information within a fixed-size window (a 3x3 window in this paper) to obtain spatial coarse-grained results; and second, aggregating these results over multiple successive time steps to form a spatiotemporal coarse macro-state. All of these processes are implemented through parallel encoders in NIS+.

Refer to caption
Figure 6: Experiments on the Game of Life. (a-d) depict the comparisons between the ground truth simulations, learned macro-states, and the predictions made by NIS+ for different patterns. In (a-c), the training data is generated by simulating the Game of Life for 120 steps with random initial conditions, and only the images from steps 100-120 are selected to form the training set. (d) depicts a generalization test involving the number of gliders was conducted, transitioning from the original input of two gliders to three gliders. In (a-d), the predicted images are the averaged results obtained by sampling from the predicted macro-states using the decoder 20 times. The colors represent the probability of repetitions in the 20 samples. (e) The architecture of the encoder is illustrated, which can be divided into spatial encoding (mapping 3x3 grids to 1 grid) and temporal encoding (mapping 2 time steps to 1 step) stages. These processes are implemented using parallel encoders. (f) shows the comparisons of AUC (area under the curve) between NIS and NIS+ on different patterns, where each result is the average of multi-step predictions (2 steps). Each case represents multiple quantity combinations of a selected pattern as the initial states. There are 9 patterns, which can be divided into four categories: static (flower, vessel, bread, aircraft carrier and honeycomb), oscillating (signal light, flames of war), moving (glider), and random (random), The horizontal coordinate corresponding to each bar in the figure is represented by the format of “pattern name (quantity)”. (g) Comparisons of the degrees of CE (Δ𝒥\Delta\mathcal{J}, dimension averaged EI and Rosas’ Ψ\Psi) are tested with various patterns.

13.1.   Data generation

In this paper, we use Conway’s Game of Life as the experimental object, in which each cell has two states for two-dimensional state input: alive (1) or dead (0), and each cell is affected by its eight neighbors. The evolution of the Game of Life is only affected by the input state and its update rules, in which the Game of Life has four evolutionary rules, respectively corresponding to cell reproduction and death, and so on. The update rules for the Game of Life are shown in the following table:

Table 1: The update rules for the Game of Life.
xtx_{t} xt+1x_{t+1}
0 1 (there are three living cells around)
1 0 (less than two surviving cells around the cell (excluding two))
1 (there are two or three living cells around)
0 (there are more than three living cells around)

The training sample generation process of the Game of Life is as follows: firstly, a state xtx_{t} is initialized. When considering a temporal coarse-grained of two steps, the subsequent three steps of states xt+1x_{t+1}, xt+2x_{t+2}, and xt+3x_{t+3} are then generated based on the update rule and are input to the machine learning model. The two input states are xtx_{t} and xt+1x_{t+1}, with the micro-dynamics output being xt+1x_{t+1} and xt+2x_{t+2}. Due to the utilization of spatiotemporal coarse-graining, the macro-dynamics will output a macro-state which will be decoded into the micro-states xt+2x_{t+2} and xt+3x_{t+3}. This process is repeated multiple times (50,000 samples) and generate the data for training in Figure 6d. While in the other experiments, we generate 500,000 samples.

13.2.   The model predictive ability on glider pattern

We then test the capability of capturing dynamical patterns on glider pattern, where the model was trained based on two glider patterns. The model depicts a good prediction effect and the results are shown in Figure 7.

In addition, please refer to Table 2 for more detailed information on the other model parameters.

Refer to caption
Figure 7: Experiments of Game of Life. The results show the dynamical patterns that are generated by the ground truth simulations(the first row) and the predictions by NIS+(the third row), as well as the emergent macro-states that can make those predictions(the second row). Here the input state size of the model test is 18×\times18×\times2 and also consists of two gliders.

14.   Brain Results

14.1.   resting state

We also apply integrated gradient to see what seven macro dimensions mean for micro brain areas. Different from results from visual fMRI,we can see some distinct patterns. Unfortunately, we can’t see that some one-to-one relationship between these seven dimensions and seven systems defined by Schaefer atlas(See Figure 8)

Refer to caption
Figure 8: The integrated gradient results in resting NIS+ analysis when q=7q=7.Each pair of left hemisphere and right hemisphere correspond to one macro dimension.Only five percent of most important areas in attribution value are drawn to better show the distinct patterns between seven different marco dimenions

14.2.   AAL Results

We repeat what we have done in sections 3.3 and 8 expect that we have utilized AAL3 atlas and made λ\lambda to be 0.5 for layer 1 training of NIS+ since λ=1\lambda=1 will lead to a trivial solution where macro predictions for all areas will remain the same, which make the prediction can’t be meaningful. We can see a converging evidence that NIS+ framework will lead to an improvement of Δ𝒥\Delta\mathcal{J}(see Figure 9(b) and a stronger correlation with those areas which responds to the visual areas(See Figure 9(a) and (c)).

Refer to caption
Figure 9: The learning results, the degree of causal emergence, and the attribution analysis of NIS+ and NIS on the fMRI data for the brains. (a) The mean errors of the multi-step predictions increase with the prediction steps under different scales (qq) on the test dataset. (b) Measures of CE (dimension averaged, Δ𝒥\Delta\mathcal{J}) are compared among different models and different datasets including movie-watching fMRI (visual fMRI) and resting fMRI. The bars show the averaged results for 10 repeating experiments, and the error bar represents the standard deviation. (c) The attributions of top 8 significant and insignificant areas under the AAL Atlas are presented. The error bar represents the standard errors. (d) Attribution maps for movie-watching(visual) fMRI data are displayed. The maps show the left hemisphere from the view of the left, left hemisphere from the view of the right, right hemisphere from the view of the right, and right hemisphere from the view of the left. Also, the right column reflects a detailed map for visual areas. The upper is left visual areas and the bottom is right visual areas. The colors represent the normalized absolute values of the integrated gradient.

15.   The Structure and parameters of neural networks.

Next, we list the neural network parameters used in our experimental study, as shown in Table 2. The parameter λ\lambda in the table is derived from Equation 9 and serves as the weight coefficient for the forward prediction error. LL is an assumption about the range of data distribution, which represents the length of a hypercube with side length 2L2L that is used to calculate the probability value for a uniform distribution. It is not difficult to see that different types of neural networks have been used to handle different types of complex systems. Both the Boids and fMRI experiments employ the Multistory NIS+ framework, which traverses different macroscopic scales and eventually selects an optimal scale (the dimension of input or output in the Inverse Dynamics Learner table).

Table 2: Parameter table for all experiments conducted using neural networks.
Experiment Module Parameter
SIR Encoder(Decoder) Three RealNVP modules Input(Onput) dimensions: 4 Macro-state dimensions: 2
Dynamics Learner Input(Onput) dimensions: 2 Hidden Layers: 2 Hidden Units: 64 Activation Function: LeakyReLu Total Epochs:30000 Batch Size:700 L: 1
Inverse Dynamics Learner λ\lambda: 0.33 Input(Onput) dimensions: 2 Hidden Layers: 2 Hidden Units: 64 Activation Function: LeakyReLu Stage II Epochs:27000
Boids Encoder(Decoder) Three RealNVP modules Input(Onput) dimensions: 64 Macro-state dimensions: 32,16,8
Dynamics Learner Input(Onput) dimensions: 64,32,16,8 Hidden Layers: 2 Hidden Units: 32 Activation Function: LeakyReLu Total Epochs:800,000 Batch Size:4 L: 100
Inverse Dynamics Learner λ\lambda: 1 Input(Onput) dimensions: 8 Hidden Layers: 2 Hidden Units: 32 Activation Function: LeakyReLu Stage II Epochs:400,000
Game of life Encoder(Decoder) Three RealNVP modules Input(Onput) dimensions: 18×18×218\times 18\times 2 Macro-state dimensions: 6×6×16\times 6\times 1
Dynamics Learner Input(Onput) dimensions: 36 Hidden Layers: 2 Hidden Units: 64 Activation Function: LeakyReLu Total Epochs: 300,000 Batch Size: 50 L: 100
Inverse Dynamics Learner λ\lambda: 1 Input(Onput) dimensions: 18×18×218\times 18\times 2 Hidden Layers: 2 Hidden Units: 64 Activation Function: LeakyReLu Stage II Epochs: 200,000
fMRI Encoder(Decoder) Six RealNVP modules Input(Onput) dimensions: 100 Macro-state dimensions: 52,27,14,7,3,1
Dynamics Learner Input(Onput) dimensions: 100,52,27,14,7,3,1 Hidden Layers: 5 Hidden Units: 256 Activation Function: LeakyReLu and a final tanh Total Epochs: 60000 Batch Size: 100 L: 1
Inverse Dynamics Learner λ\lambda: 1 Input(Onput) dimensions: 1 Hidden Layers: 5 Hidden Units: 256 Activation Function: LeakyReLu Stage II Epochs: 10000

References

  • [1] H. Sayama, Introduction to the modeling and analysis of complex systems.   Open SUNY Textbooks, 2015.
  • [2] J. Odell, “Agents and complex systems,” Journal of Object Technology, vol. 1, no. 2, pp. 35–45, 2002.
  • [3] W.-X. Wang, Y.-C. Lai, and C. Grebogi, “Data based identification and prediction of nonlinear and complex dynamical systems,” Physics Reports, p. 1–76, Jul 2016. [Online]. Available: http://dx.doi.org/10.1016/j.physrep.2016.06.004
  • [4] T. Kipf, E. Fetaya, K.-C. Wang, M. Welling, and R. Zemel, “Neural relational inference for interacting systems,” in International conference on machine learning.   PMLR, 2018, pp. 2688–2697.
  • [5] J. Casadiego, M. Nitzan, S. Hallerberg, and M. Timme, “Model-free inference of direct network interactions from nonlinear collective dynamics,” Nature Communications, vol. 8, no. 1, Dec 2017. [Online]. Available: http://dx.doi.org/10.1038/s41467-017-02288-4
  • [6] Y. Zhang, Y. Guo, Z. Zhang, M. Chen, S. Wang, and J. Zhang, “Universal framework for reconstructing complex networks and node dynamics from discrete or continuous dynamics data,” Physical Review E, vol. 106, no. 3, p. 034315, 2022.
  • [7] J. H. Holland, Emergence: From chaos to order.   OUP Oxford, 2000.
  • [8] S. Ben Tahar, J. J. Muñoz, S. J. Shefelbine, and E. Comellas, “Turing pattern prediction in three-dimensional domains: the role of initial conditions and growth,” bioRxiv, pp. 2023–03, 2023.
  • [9] P. C. Matthews and S. H. Strogatz, “Phase diagram for the collective behavior of limit-cycle oscillators.” Physical Review Letters, vol. 65, no. 14, p. 1701–1704, Jul 2002. [Online]. Available: http://dx.doi.org/10.1103/physrevlett.65.1701
  • [10] Y. Du, Z. He, Q. Gao, H. Zhang, C. Zeng, D. Mao, and J. Zhao, “Emergent phenomena of vector solitons induced by the linear coupling,” Laser & Photonics Reviews, p. 2300076, 2023.
  • [11] F. E. Rosas, P. A. M. Mediano, H. J. Jensen, A. K. Seth, A. B. Barrett, R. L. Carhart-Harris, and D. Bor, “Reconciling emergences: An information-theoretic approach to identify causal emergence in multivariate data,” PLOS Computational Biology, vol. 16, no. 12, p. e1008289, Dec 2020. [Online]. Available: http://dx.doi.org/10.1371/journal.pcbi.1008289
  • [12] J. Zhang and K. Liu, “Neural information squeezer for causal emergence,” Entropy, vol. 25, no. 1, p. 26, 2022.
  • [13] E. O’toole, V. Nallur, and S. Clarke, “Decentralised detection of emergence in complex adaptive systems,” ACM Transactions on Autonomous and Adaptive Systems, vol. 12, no. 1, p. 1–31, Mar 2017. [Online]. Available: http://dx.doi.org/10.1145/3019597
  • [14] F. P. Kemeth, T. Bertalan, T. Thiem, F. Dietrich, S. J. Moon, C. R. Laing, and I. G. Kevrekidis, “Learning emergent partial differential equations in a learned emergent space,” Nature communications, vol. 13, no. 1, p. 3318, 2022.
  • [15] F. Keijzer, “Artificial life xi: Proceedings of the eleventh international conference on the simulation and synthesis of living systems,” MIT Press eBooks, Jan 2008.
  • [16] C. Shalizi and C. Moore, “What is a macrostate? subjective observations and objective dynamics,” arXiv: Statistical Mechanics, Mar 2003.
  • [17] A. K. Seth, “Measuring emergence via nonlinear granger causality.” in alife, vol. 2008, 2008, pp. 545–552.
  • [18] R. A. Haugen, N.-O. Skeie, G. Muller, and E. Syverud, “Detecting emergence in engineered systems: A literature review and synthesis approach,” Systems Engineering, 2023.
  • [19] D. Fisch, M. Jänicke, B. Sick, and C. Müller-Schloer, “Quantitative emergence–a refined approach based on divergence measures,” in 2010 Fourth IEEE International Conference on Self-Adaptive and Self-Organizing Systems.   IEEE Computer Society, 2010, pp. 94–103.
  • [20] M. Mnif and C. Müller-Schloer, “Quantitative emergence,” Organic Computing—A Paradigm Shift for Complex Systems, pp. 39–52, 2011.
  • [21] D. Fisch, M. Jänicke, C. Müller-Schloer, and B. Sick, “Divergence measures as a generalised approach to quantitative emergence,” Organic Computing—A Paradigm Shift for Complex Systems, pp. 53–66, 2011.
  • [22] J. S. Osmundson, T. V. Huynh, and G. O. Langford, “Kr14 emergent behavior in systems of systems,” in INCOSE International Symposium, vol. 18, no. 1.   Wiley Online Library, 2008, pp. 1557–1568.
  • [23] R. Raman and A. Murugesan, “Framework for complex sos emergent behavior evolution using deep reinforcement learning,” in INCOSE International Symposium, vol. 32, no. 1.   Wiley Online Library, 2022, pp. 809–823.
  • [24] Y. M. Teo, B. L. Luong, and C. Szabo, “Formalization of emergence in multi-agent systems,” in Proceedings of the 1st ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, 2013, pp. 231–240.
  • [25] E. P. Hoel, L. Albantakis, and G. Tononi, “Quantifying causal emergence shows that macro can beat micro,” Proceedings of the National Academy of Sciences, vol. 110, no. 49, pp. 19 790–19 795, 2013.
  • [26] E. P. Hoel, “When the map is better than the territory,” Entropy, vol. 19, no. 5, p. 188, 2017.
  • [27] J. Fromm, “Types and forms of emergence,” arXiv: Adaptation and Self-Organizing Systems, Jun 2005.
  • [28] G. Tononi and O. Sporns, “Measuring information integration,” BMC neuroscience, vol. 4, pp. 1–20, 2003.
  • [29] R. Comolatti and E. Hoel, “Causal emergence is widespread across measures of causation,” arXiv preprint arXiv:2202.01854, 2022.
  • [30] R. Griebenow, B. Klein, and E. Hoel, “Finding the right scale of a network: Efficient identification of causal emergence through spectral clustering,” arXiv: Social and Information Networks, Aug 2019.
  • [31] E. Hoel and M. Levin, “Emergence of informative higher scales in biological systems: a computational toolkit for optimal prediction and control,” Communicative &amp; Integrative Biology, vol. 13, no. 1, p. 108–118, Jan 2020. [Online]. Available: http://dx.doi.org/10.1080/19420889.2020.1802914
  • [32] S. Marrow, E. J. Michaud, and E. Hoel, “Examining the causal structures of deep neural networks using information theory,” Entropy, p. 1429, Dec 2020. [Online]. Available: http://dx.doi.org/10.3390/e22121429
  • [33] B. Klein and E. Hoel, “The emergence of informative higher scales in complex networks,” Complexity, vol. 2020, pp. 1–12, 2020.
  • [34] P. L. Williams and R. D. Beer, “Nonnegative decomposition of multivariate information,” arXiv preprint arXiv:1004.2515, 2010.
  • [35] P. R. Vlachas, G. Arampatzis, C. Uhler, and P. Koumoutsakos, “Multiscale simulations of complex systems by learning their effective dynamics,” Nature Machine Intelligence, vol. 4, no. 4, pp. 359–366, 2022.
  • [36] D. Floryan and M. D. Graham, “Data-driven discovery of intrinsic dynamics,” Nature Machine Intelligence, vol. 4, no. 12, pp. 1113–1120, 2022.
  • [37] L. Cai and S. Ji, “A multi-scale approach for graph link prediction,” Proceedings of the AAAI Conference on Artificial Intelligence, p. 3308–3315, Jun 2020. [Online]. Available: http://dx.doi.org/10.1609/aaai.v34i04.5731
  • [38] Z. Chen, S. Li, B. Yang, Q. Li, and H. Liu, “Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition,” Proceedings of the AAAI Conference on Artificial Intelligence, p. 1113–1122, Sep 2022. [Online]. Available: http://dx.doi.org/10.1609/aaai.v35i2.16197
  • [39] L. R. Goldberg, “The book of why: The new science of cause and effect,” Notices of the American Mathematical Society, p. 1, Aug 2019. [Online]. Available: http://dx.doi.org/10.1090/noti1912
  • [40] Z. Shen, J. Liu, Y. He, X. Zhang, R. Xu, H. Yu, and P. Cui, “Towards out-of-distribution generalization: A survey,” Cornell University - arXiv,Cornell University - arXiv, Aug 2021.
  • [41] B. Schölkopf and J. von Kügelgen, “From statistical to causal learning,” arXiv preprint arXiv:2204.00607, 2022.
  • [42] D. Janzing, J. Peters, E. Sgouritsa, K. Zhang, J. Mooij, and B. lkopf, “On causal and anticausal learning,” International Conference on Machine Learning, Jun 2012.
  • [43] J. Peters, P. Bühlmann, and N. Meinshausen, “Causal inference by using invariant prediction: identification and confidence intervals,” Journal of the Royal Statistical Society Series B: Statistical Methodology, vol. 78, no. 5, pp. 947–1012, 2016.
  • [44] A. Zhang, C. Lyle, S. Sodhani, A. Filos, M. Kwiatkowska, J. Pineau, Y. Gal, and D. Precup, “Invariant causal prediction for block mdps,” in International Conference on Machine Learning.   PMLR, 2020, pp. 11 214–11 224.
  • [45] W. O. Kermack and A. G. McKendrick, “A contribution to the mathematical theory of epidemics,” Proceedings of the royal society of london. Series A, Containing papers of a mathematical and physical character, vol. 115, no. 772, pp. 700–721, 1927.
  • [46] A. Attanasi, A. Cavagna, L. Del Castello, I. Giardina, T. S. Grigera, A. Jelić, S. Melillo, L. Parisi, O. Pohl, E. Shen et al., “Information transfer and behavioural inertia in starling flocks,” Nature physics, vol. 10, no. 9, pp. 691–696, 2014.
  • [47] M. Gardner, “The fantastic combinations of jhon conway’s new solitaire game’life,” Sc. Am., vol. 223, pp. 20–123, 1970.
  • [48] F. Eberhardt and L. L. Lee, “Causal emergence: When distortions in a map obscure the territory,” Philosophies, vol. 7, no. 2, p. 30, 2022.
  • [49] M. Rosenblatt, “Remarks on some nonparametric estimates of a density function,” The annals of mathematical statistics, pp. 832–837, 1956.
  • [50] C. W. Reynolds, “Flocks, herds and schools: A distributed behavioral model,” in Proceedings of the 14th annual conference on Computer graphics and interactive techniques, 1987, pp. 25–34.
  • [51] C. W. Reynolds et al., “Steering behaviors for autonomous characters,” in Game developers conference, vol. 1999.   Citeseer, 1999, pp. 763–782.
  • [52] M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep networks,” CoRR, vol. abs/1703.01365, 2017. [Online]. Available: http://arxiv.org/abs/1703.01365
  • [53] L. Snoek, M. van der Miesen, T. Beemsterboer, A. Leij, A. Eigenhuis, and H. Scholte, “The amsterdam open mri collection, a set of multimodal mri datasets for individual difference analyses,” Scientific Data, vol. 8, 03 2021.
  • [54] A. Schaefer, R. Kong, E. M. Gordon, T. O. Laumann, X.-N. Zuo, A. J. Holmes, S. B. Eickhoff, and B. T. T. Yeo, “Local-Global Parcellation of the Human Cerebral Cortex from Intrinsic Functional Connectivity MRI,” Cerebral Cortex, vol. 28, no. 9, pp. 3095–3114, 07 2017. [Online]. Available: https://doi.org/10.1093/cercor/bhx179
  • [55] A. Luppi, P. Mediano, F. Rosas, N. Holland, T. Fryer, J. O’Brien, J. Rowe, D. Menon, D. Bor, and E. Stamatakis, “A synergistic core for human brain evolution and cognition,” Nature Neuroscience, vol. 25, pp. 1–12, 06 2022.
  • [56] L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density estimation using real nvp,” International Conference on Learning Representations,International Conference on Learning Representations, May 2016.
  • [57] C. Lu and J. Peltonen, “Enhancing nearest neighbor based entropy estimator for high dimensional distributions via bootstrapping local ellipsoid,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 5013–5020.
  • [58] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
  • [59] M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep networks,” Cornell University - arXiv,Cornell University - arXiv, Mar 2017.
  • [60] O. Esteban, C. Markiewicz, R. Blair, C. Moodie, A. Isik, A. Erramuzpe, J. Kent, M. Goncalves, E. DuPre, M. Snyder, H. Oya, S. Ghosh, J. Wright, J. Durnez, R. Poldrack, and K. Gorgolewski, “fmriprep: a robust preprocessing pipeline for functional mri,” Nature Methods, vol. 16, 01 2019.
  • [61] O. Esteban, R. Ciric, K. Finc, R. Blair, C. Markiewicz, C. Moodie, J. Kent, M. Goncalves, E. DuPre, D. Gomez, Z. Ye, T. Salo, R. Valabregue, I. Amlien, F. Liem, N. Jacoby, H. Stojić, M. Cieslak, S. Urchs, and K. Gorgolewski, “Analysis of task-based functional mri data preprocessed with fmriprep,” Nature Protocols, vol. 15, 06 2020.
  • [62] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
  • [63] A. Abraham, F. Pedregosa, M. Eickenberg, P. Gervais, A. Mueller, J. Kossaifi, A. Gramfort, B. Thirion, and G. Varoquaux, “Machine learning for neuroimaging with scikit-learn,” Frontiers in neuroinformatics, vol. 8, p. 14, 02 2014.
  • [64] S. Fang and Q. Zhu, “Independent gaussian distributions minimize the kullback-leibler (kl) divergence from independent gaussian distributions.” Cornell University - arXiv, Nov 2020.
  • [65] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra, “Weight uncertainty in neural networks,” arXiv: Machine Learning, May 2015.
  • [66] K. Hornik, “Approximation capabilities of multilayer feedforward networks,” Neural Networks, vol. 4, no. 2, pp. 251–257, 1991.
  • [67] S. Haykin, Neural Networks: A Comprehensive Foundation, Volume 2.   Prentice Hall, 1998.
  • [68] T. Teshima, I. Ishikawa, K. Tojo, K. Oono, M. Ikeda, and M. Sugiyama, “Coupling-based invertible neural networks are universal diffeomorphism approximators,” in Advances in Neural Information Processing Systems, vol. 33, 2020, p. 3362–3373.
  • [69] T. Teshima, K. Tojo, M. Ikeda, I. Ishikawa, and K. Oono, “Universal approximation property of neural ordinary differential equations,” arXiv preprint arXiv:2012.02414, 2017.