This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Backpropagation on Dynamical Networks
Supplementary Material

Eugene Tan, Dèbora Corrêa, Thomas Stemler, Michael Small E. Tan, T. Stemler and M. Small are with the Department of Mathematics & Statistics, The University of Western Australia, Crawley, WA 6009, Australia
E-mail: [email protected] D. Corrêa is with the Department of Computer Science & Software Engineering, The University of Western Australia, Crawley, WA 6009.Manuscript received INSERT DATE; revised INSERT DATE.

Appendix A Chaotic Oscillators Equations

Lorenz

x˙i\displaystyle\dot{x}_{i} =γ(yixi)+ijcij(xjxi),\displaystyle=\gamma(y_{i}-x_{i})+\sum_{i\neq j}c_{ij}(x_{j}-x_{i}),
y˙i\displaystyle\dot{y}_{i} =xi(ρzi)yi,\displaystyle=x_{i}(\rho-z_{i})-y_{i},
z˙i\displaystyle\dot{z}_{i} =xiyiβzi,\displaystyle=x_{i}y_{i}-\beta z_{i},

where (γ,β,ρ)=(10,8/3,28)(\gamma,\beta,\rho)=(10,8/3,28).

Chua

x˙i\displaystyle\dot{x}_{i} =k(yixi+zi)+ijcij(xjxi),\displaystyle=k(y_{i}-x_{i}+z_{i})+\sum_{i\neq j}c_{ij}(x_{j}-x_{i}),
y˙i\displaystyle\dot{y}_{i} =kα(xiyiϕ(yi)),\displaystyle=k\alpha(x_{i}-y_{i}\phi(y_{i})),
z˙i\displaystyle\dot{z}_{i} =k(βxiγzi),\displaystyle=k(-\beta x_{i}-\gamma z_{i}),
ϕ(yi)\displaystyle\phi(y_{i}) =ayi3+byi,\displaystyle=ay_{i}^{3}+by_{i},

where (k,β,α,γ)=(1,53.61,0.75,17)(k,\beta,\alpha,\gamma)=(-1,53.61,-0.75,17).

Appendix B Recursive Partial Derivatives

The recursive relationship of the required partial derivatives for the backpropragation algorithm is given by:

\medmath(𝐱i(1)cjk)tn\displaystyle\medmath{\left(\frac{\partial\mathbf{x}_{i}^{(1)}}{\partial c_{jk}}\right)_{t_{n}}} =\medmath[F^(𝐱i(1))cjk+δthicihg(𝐱i(1),𝐱h(1))cjk\displaystyle=\medmath{\left[\frac{\partial\hat{F}(\mathbf{x}_{i}^{(1)})}{\partial c_{jk}}+\delta t\,\sum_{h\neq i}c_{ih}\frac{\partial g(\mathbf{x}_{i}^{(1)},\mathbf{x}_{h}^{(1)})}{\partial c_{jk}}\right}
+\medmathmissingmissingcihcjkg(𝐱i(1),𝐱h(1))]tn1,\displaystyle+\medmath{\left\frac{missing}{missing}{\partial c_{ih}}{\partial c_{jk}}g(\mathbf{x}_{i}^{(1)},\mathbf{x}_{h}^{(1)})\right]_{t_{n-1}},}
\medmath(g(𝐱i(1),𝐱h(1))cjk)tn1\displaystyle\medmath{\left(\frac{\partial g(\mathbf{x}_{i}^{(1)},\mathbf{x}_{h}^{(1)})}{\partial c_{jk}}\right)_{t_{n-1}}} =\medmath[g(𝐱i(1),𝐱h(1))𝐱i(1)𝐱i(1)cjk\displaystyle=\medmath{\left[\frac{\partial g(\mathbf{x}_{i}^{(1)},\mathbf{x}_{h}^{(1)})}{\partial\mathbf{x}_{i}^{(1)}}\frac{\partial\mathbf{x}_{i}^{(1)}}{\partial c_{jk}}\right{}}
+\medmathmissingmissingg(𝐱i(1),𝐱h(1))𝐱h(1)𝐱h(1)cjk]tn1,\displaystyle+\medmath{\left\frac{missing}{missing}{\partial g(\mathbf{x}_{i}^{(1)},\mathbf{x}_{h}^{(1)})}{\partial\mathbf{x}_{h}^{(1)}}\frac{\partial\mathbf{x}_{h}^{(1)}}{\partial c_{jk}}\right]_{t_{n-1}},}
\medmath(F^(𝐱i(1))cjk)tn1\displaystyle\medmath{\left(\frac{\partial\hat{F}(\mathbf{x}_{i}^{(1)})}{\partial c_{jk}}\right)_{t_{n-1}}} =\medmath[𝐱i(1)cjk+δtdf(1)𝐱i(d)𝐱i(d)cjk]tn1,\displaystyle=\medmath{\left[\frac{\partial\mathbf{x}_{i}^{(1)}}{\partial c_{jk}}+\delta t\,\sum_{d}\frac{\partial f^{(1)}}{\partial\mathbf{x}_{i}^{(d)}}\frac{\partial\mathbf{x}_{i}^{(d)}}{\partial c_{jk}}\right]_{t_{n-1}},}
\medmath(𝐱i(2)cjk)tn1\displaystyle\medmath{\left(\frac{\partial\mathbf{x}_{i}^{(2)}}{\partial c_{jk}}\right)_{t_{n-1}}} =\medmath[𝐱i(2)cjk+δtdf(2)𝐱i(d)𝐱i(d)cjk]tn2,\displaystyle=\medmath{\left[\frac{\partial\mathbf{x}_{i}^{(2)}}{\partial c_{jk}}+\delta t\,\sum_{d}\frac{\partial f^{(2)}}{\partial\mathbf{x}_{i}^{(d)}}\frac{\partial\mathbf{x}_{i}^{(d)}}{\partial c_{jk}}\right]_{t_{n-2}},}
\medmath(𝐱i(3)cjk)tn1\displaystyle\medmath{\left(\frac{\partial\mathbf{x}_{i}^{(3)}}{\partial c_{jk}}\right)_{t_{n-1}}} =\medmath[𝐱i(3)cjk+δtdf(3)𝐱i(d)𝐱i(d)cjk]tn2,\displaystyle=\medmath{\left[\frac{\partial\mathbf{x}_{i}^{(3)}}{\partial c_{jk}}+\delta t\,\sum_{d}\frac{\partial f^{(3)}}{\partial\mathbf{x}_{i}^{(d)}}\frac{\partial\mathbf{x}_{i}^{(d)}}{\partial c_{jk}}\right]_{t_{n-2}},}

where 𝐱i(d)\mathbf{x}_{i}^{(d)} corresponds to the ddth component of the state of node ii. Similarly, f(d)f^{(d)} is the ddth component of the local dynamics function and F^\hat{F} corresponds to the local dynamics contribution of the forward evolution,

F^(𝐱i(tn))=xi(tn)+δt𝐱˙^i(t).\hat{F}(\mathbf{x}_{i}(t_{n}))=x_{i}(t_{n})+\delta t\;\hat{\dot{\mathbf{x}}}_{i}(t). (2)

Appendix C Other Tested Networks

C.1 FitzHugh-Nagumo Neuron Network

An additional system consisting of a network of FitzHugh-Nagumo neuron oscillators [hong2011synchronization] was used to test the capabilities of the backpropagation regression method.

The neuron network presents an additional challenge when constructing a data-driven model due to the presence of disparate time scales in the dynamics (fast spiking depolarisation and slow repolarisation). It also demonstrates the performance of the backpropagation method in a more realistic context, i.e. the inference of neuronal networks. We focus on the FitzHugh-Nagumo system operating under a chaotic regime as given by Hong [hong2011synchronization] with equations,

V˙\displaystyle\dot{V} =V(V1)(1b1V)w+αIω,\displaystyle=V(V-1)(1-b_{1}V)-w+\frac{\alpha I}{\omega},
w˙\displaystyle\dot{w} =b2V,\displaystyle=b_{2}V,
I¨\displaystyle\ddot{I} =ω2I,\displaystyle=-\omega^{2}I,

with constant parameters (α,b1,b2,ω)=(0.1,10,1,0.8105)(\alpha,b_{1},b_{2},\omega)=(0.1,10,1,0.8105). Diffusive coupling was applied on VV with coupling weights normally distributed with (μ,σ2)=(0.15,0.022)(\mu,\sigma^{2})=(0.15,0.02^{2}) and coupling probability log(N)/N\log(N)/N and integration timestep dt=0.02dt=0.02.

Refer to caption
Figure 1: Regression results for the FitzHugh-Nagumo neuron network with normally distributed coupling weights (μ,σ2)=(0.15,0.02)(\mu,\sigma^{2})=(0.15,0.02) and connection probability pp over 80 iterations.
Refer to caption
Figure 2: Regression of weights (top) at different number of iterations compared to the normalised error in the weights (bottom). True weights (right) are given for comparison showing good agreement with the regressed results.

C.2 Heterogeneous Networks

The formulation of the backpropagation regression algorithm assumes that the local dynamics ff for all nodes are identical. Whilst this is a useful property, this strong assumption is unlikely to be present in real systems. In many cases, nodes in a dynamical network may exhibit slight differences in their local dynamics. To test the effect network heterogeneity on regression performance, a 16 node Chua oscillator network was simulated with slightly differing bifurcation parameters for each node. We use the Chua system for this investigation, as it shows similar chaotic dynamics for a wide parameter range (see Figure 3).

The Chua system contains multiple coexisting attractors for particular values of the bifurcation parameter α\alpha [kengne2017dynamics]. When operating in the single scroll regime (α[17,17.3]\alpha\in[17,17.3]), the isolated Chua system exhibits two separate chaotic attractors corresponding to the two scrolls. These two scrolls eventually merge into the characteristic double scroll Chua attractor for larger values of α\alpha (see Figure 3).

Refer to caption
Figure 3: Bifurcation diagram of the Chua chaotic system with cubic nonlinearity. Two separate attractor scrolls (red and blue) with initial conditions (±0.5,0,0)(\pm 0.5,0,0). Chaotic regime with separated scrolls from α[17,17.3]\alpha\in[17,17.3]. Chaotic double scroll regime for α>19.05\alpha>19.05.

To simulate a heterogeneous network, the α\alpha parameter for each node in the network was randomly perturbed by an additional amount ϵαU(0,ξα)\epsilon_{\alpha}\sim U(0,\xi_{\alpha}) where ξα[0,0.3]\xi_{\alpha}\in[0,0.3]. The backpropagation regression algorithm was tested on 7 different configurations of increasing ξα\xi_{\alpha}. Each configuration was tested on 8 randomly initialised 16 node Chua dynamical networks, with 40 regression iterations in each case.

The weight error performance was found to be robust to increasing levels of heterogeneity in network dynamics (see Figure 4). The effect of weight filtration was also found to be unchanged with increasing heterogeneity. However, increasing heterogeneity ξα\xi_{\alpha} resulted in a gradual decrease in the mutual information of local model predictions (see Figure 5). The backpropagation algorithm assumes that all nodes have identical local dynamics. Heterogeneity in node dynamics results in uncertainty the true model parameters when regressing the local model in the training and refitting stage. Mutual information was also tested against the control case where the model is exactly known, but evaluated with perturbed initial conditions (ξ0=0.005)(\xi_{0}=0.005).

Refer to caption
Figure 4: Regressed node weight errors after 40 refit iterations with for configurations with increasing network heterogeneity (ξα)(\xi_{\alpha}). Weight errors are given before (blue) and after (red) truncating weights with a magnitude less than 0.004 in order to remove spurious weights.
Refer to caption
Figure 5: Mutual information of the local models for configurations with increasing network heterogeneity (ξα)(\xi_{\alpha}). Model scores are compared at the beginning (blue) and end (red) of the 40 refit iterations. Mutual information is compared against the control case where the exact model known with perturbed initial conditions.

Appendix D Backpropagation Algorithm Hyperparameters

A list of hyperparameters for the algorithm are provided in Table I. The selection of hyperparameter values require experimentation and were selected based on values typically used for BPTT training of RNNs. As a general guide, KinitK_{init} affects the degree of averaging in the mean field approach when estimating the vector field of the local dynamics f^\hat{f} during initialisation. The parameters NepochsN_{epochs} and NrefitN_{refit} directly control the amount of time spent in stage of backpropagation and retraining. The learning rates η\eta and η\eta^{\prime} are used for the training the feedforward neural network local dynamics model. Selection of these values follow the same heuristics used for machine learning function approximation with the additional criteria that η<η\eta^{\prime}<\eta to ensure that the local dynamics model does not change too much in each refit iteration. The parameters α¯,β,deff\bar{\alpha},\beta,d_{eff} and rr are defined similarly to those normally used in regression and learning rate scheduler of regular BPTT training RNNs. The freerun prediction length tint_{in} alters the length of the trajectory used to calculate the errors for backpropagation. Larger tint_{in} allow the accumulation of errors over time and prioritises the adjustment of weights that have a larger impact on prediction performance, resulting in better convergence and stability at the expense of computational speed. However, tint_{in} should be selected to be smaller than the natural Lyapunov time scale of the system to prevent instability.

Hyper-parameter Value Description
KinitK_{init} 8 Number of neighbours in mean field approach for initial model training
NepochsN_{epochs} 30 Number of training epochs used in each neural network model training run
NrefitN_{refit} 40 Number of backpropagation-decoupling-refit alternations to run
η\eta 0.001 Model training learning rate
η\eta^{\prime} 0.0002 Model refit learning rate
α¯\bar{\alpha} 0.0005 Average learning rate for of each coupling weight in the C^\hat{C}
β\beta 0.9 Learning rate momentum parameter
tint_{in} 10 Length of freerun predictions used to calculate and backpropagate error
deffd_{eff} 0.98 Effective learning rate decay after each decay-reset cycle in scheduler
rr 2.0 Amount to multiply decreased learning rate at the end of decay-reset cycle
TABLE I: List of hyperparameters

Notable Hyperparameters

The backpropagation algorithm requires the selection of various hyperparameters that govern the regression behaviour.

  • Momentum (MM) - This quantity includes a notion of momentum in the gradient update by allowing previous calculated update steps to propagate into future steps with a decaying effect. This technique is also commonly used in RNN backpropagation to improve convergence,

    dC^(n+1)=MdC^(n1)+(1M)dC^(n).d\hat{C}(n+1)=M\,d\hat{C}(n-1)+(1-M)\,d\hat{C}(n). (3)
  • Model Learning Rate (η\eta) - The step size of the feedforward network during the initial construction of the model and retraining.

  • Node Learning Rate (α¯\bar{\alpha}) - The average learning rate that would be applied to each coupling weight if all real links weights are equal in the calculated gradient and exist with probability pp. Its relationship to the real learning rate is given by,

    αLR=pN(N1)2α¯.\alpha_{LR}=\sqrt{p\cdot\frac{N(N-1)}{2}\cdot\bar{\alpha}}. (4)
  • Input History Length (tint_{in}) - The amount of steps over which to unfold the backpropagation regression algorithm. Longer history results in the accumulation of coupling weight effects over a longer period and provides better convergence at the cost of slower computation.