The inverse Kalman filter

Xinyi Fang and Mengyang Gu¹¹1The authors contribute equally. Correspondence should be addressed to Mengyang Gu ([email protected])

Abstract

We introduce the inverse Kalman filter (IKF), which enables exact matrix-vector multiplication between a covariance matrix from a dynamic linear model and any real-valued vector with linear computational cost. We integrate the IKF with the conjugate gradient algorithm, which substantially accelerates the computation of matrix inversion for a general form of covariance matrix, where other approximation approaches may not be directly applicable. We demonstrate the scalability and efficiency of the IKF approach through applications in nonparametric estimation of particle interaction functions, using both simulations and real cell trajectory data.

1 Introduction

Dynamic linear models (DLMs) or linear state space models are ubiquitously used in modeling temporally correlated data [37, 5]. Each observation $y_{t}\in\mathbb{R}$ in DLMs is associated with a $q$ -dimensional latent state vector $\bm{\theta}_{t}$ , defined as:

	$\displaystyle y_{t}$	$\displaystyle=\mathbf{F}_{t}\bm{\theta}_{t}+v_{t},\quad v_{t}\sim\mathcal{N}(0,V_{t}),$		(1)
	$\displaystyle\bm{\theta}_{t}$	$\displaystyle=\mathbf{G}_{t}\bm{\theta}_{t-1}+\mathbf{w}_{t},\quad\mathbf{w}_{t}\sim\mathcal{MN}(\mathbf{0},\mathbf{W}_{t}),$		(2)

where $\mathbf{F}_{t}$ and $\mathbf{G}_{t}$ are matrices of dimensions $1\times q$ and $q\times q$ , respectively, $\mathbf{W}_{t}$ is a $q\times q$ covariance matrix for $t=2,\dots,N$ , and the initial state vector $\bm{\theta}_{1}\sim\mathcal{MN}(\mathbf{b}_{1},\mathbf{W}_{1})$ with $\mathbf{b}_{1}=\mathbf{0}$ assumed herein. The dependent structure of DLMs is illustrated in Fig. 1(a), where each observation $y_{t}$ is associated with a single latent state $\bm{\theta}_{t}$ .

DLMs are a flexible class of models that include many widely used processes, such as autoregressive and moving average processes [26, 27]. Some Gaussian processes (GPs) with commonly used kernel functions, including the Matérn covariance with a half-integer roughness parameter [12], can also be represented as DLMs [38, 13], with closed-form expressions for $\mathbf{F}_{t}$ , $\mathbf{G}_{t}$ and $\mathbf{W}_{t}$ . This connection, summarized in the Supplementary Material, enables differentiable GPs within the DLM framework, allowing efficient estimation of smooth functions from noisy observations.

The Kalman filter (KF) provides a scalable approach for estimating latent state and computing likelihood in DLMs, which scales linearly with the number of observations [16], as reviewed in Appendix A. In particular, the KF enables efficient computation of $\mathbf{L}^{-1}\mathbf{u}$ for any $N$ -dimensional vector $\mathbf{u}$ in $\mathcal{O}(q^{3}N)$ operations, where $\mathbf{L}$ is the Cholesky factor of the covariance matrix $\bm{\Sigma}=\mbox{cov}[\mathbf{y}_{1:N}]=\mathbf{L}\mathbf{L}^{T}$ , summarized in Lemma 6 in the Appendix.

Refer to caption — Figure 1: The dependent structure of the dynamic linear model (panel (a)) and particle dynamics (panel (b)).

Computational challenges remain for high-dimensional state space models and scenarios where KF cannot be directly applied, such as the dependent structure shown in Fig. 1(b), where each observation $y_{i}$ can be associated with multiple latent states. This interaction structure, introduced in Section 3, is common in physical models, such as molecular dynamics simulation [28]. One way to overcome these challenges is to efficiently compute $\bm{\Sigma}\mathbf{u}$ for any $N$ -dimensional vector $\mathbf{u}$ and utilize optimization methods such as the conjugate gradient (CG) algorithm for computing the predictive distribution [15]. This strategy has been used to approximate the maximum likelihood estimator of parameters in GPs [32, 25]. Yet, each CG iteration requires matrix-vector multiplication, which involves $\mathcal{O}(N^{2})$ operations and storage, making it prohibitive for large $N$ .

To address this issue, we introduce the inverse Kalman filter (IKF), which computes $\mathbf{L}\mathbf{u}$ and $\mathbf{L}^{T}\mathbf{u}$ with $\mathcal{O}(q^{3}N)$ operations without approximations, where $\mathbf{L}$ is the Cholesky factor of any $N\times N$ DLM-induced covariance $\bm{\Sigma}$ and $\mathbf{u}$ is an $N$ -dimensional vector. The complexity is significantly smaller than $\mathcal{O}(N^{2})$ in direct computation. The latent states dimension $q$ is not larger than 3 in our applications, and it includes many commonly used models such as twice differentiable GPs with a Matérn covariance in (15), often used as a default choice in GP surrogate models.

The IKF algorithm can be extended to accelerate matrix multiplication and inversion, a key computational bottleneck in many problems. We integrate the IKF into a CG algorithm to scalably compute the predictive distribution of observations with a general covariance structure:

\bm{\Sigma}_{y}=\sum^{J}_{j=1}\mathbf{A}_{j}\bm{\Sigma}_{j}\mathbf{A}^{T}_{j}+\bm{\Lambda},

(3)

where $\bm{\Sigma}_{j}$ is a DLM-induced covariance matrix, $\mathbf{A}_{j}$ is sparse, and $\bm{\Lambda}$ is a diagonal matrix. This structure appears in nonparametric estimation of particle interactions, which will be introduced in Section 3. Both real and simulated studies involve numerous particles over a long time period, where conventional approximation methods [34, 19, 3, 7, 10] are not applicable. This covariance structure also appears in other applications, including varying coefficient models [14] and estimating incomplete lattices of correlated data [31]. We apply our approach to the latter in Section S5 and provide numerical comparisons with various approximation methods in Section S9 of the Supplementary Material.

2 Inverse Kalman filter

In this section, we introduce an exact algorithm for computing $\bm{\Sigma}\mathbf{u}$ with $\mathcal{O}(q^{3}N)$ , where $\mathbf{u}$ is an $N$ -dimensional real-valued vector and $\bm{\Sigma}=\mbox{cov}[\mathbf{y}_{1:N}]$ is an $N\times N$ covariance matrix of observations induced by a DLM with $q$ latent states in Eqs. (1)-(2). This algorithm is applicable to any DLM specified in Eqs. (1)-(2). Denote $\bm{\Sigma}=\mathbf{L}\mathbf{L}^{T}$ , where the Cholesky factor $\mathbf{L}$ does not need to be explicitly computed in our approach. We provide fast algorithms to compute $\mathbf{\tilde{x}}=\mathbf{L}^{T}\mathbf{u}$ and $\mathbf{x}=\mathbf{L}\mathbf{\tilde{x}}$ by Lemma 1 and Lemma 2, respectively, each with $\mathcal{O}(q^{3}N)$ operations. Detailed proofs of these lemmas are available in the Supplementary Material. In the following, $u_{t}$ , $x_{t}$ , $\tilde{x}_{t}$ denote the $t$ th element of the vector of $\mathbf{u}$ , $\mathbf{x}$ , and $\mathbf{\tilde{x}}$ , respectively, and $L_{t^{\prime},t}$ denotes the $(t^{\prime},t)$ th element of $\mathbf{L}$ for $t$ and $t^{\prime}=1,2,\dots,N$ .

Lemma 1 (Compute $\tilde{\mathbf{x}}=\mathbf{L}^{T}{\mathbf{u}}$ with linear computational cost).

For any $N$ -dimensional vector ${\mathbf{u}}$ , let $\tilde{x}_{N}={Q_{N}^{1/2}}u_{N}$ , $\tilde{x}_{N-1}={Q_{N-1}^{1/2}}(\ell_{N,N-1}u_{N}+u_{N-1})$ , and $\mathbf{g}_{N-1}=\mathbf{F}_{N}\mathbf{G}_{N}u_{N}$ . For $t=N-2,\dots,1$ , iteratively compute $\tilde{x}_{t}$ by

	$\displaystyle\tilde{x}_{t}$	$\displaystyle={Q_{t}^{\frac{1}{2}}}\left(\tilde{\ell}_{t+1,t}+\ell_{t+1,t}u_{t+1}+u_{t}\right),$		(4)
	$\displaystyle{\mathbf{g}}_{t}$	$\displaystyle=\mathbf{g}_{t+1}{\mathbf{G}}_{t+1}+\mathbf{F}_{t+1}\mathbf{G}_{t+1}u_{t+1},$		(5)

where $\ell_{t+1,t}=\mathbf{F}_{t+1}\mathbf{G}_{t+1}\mathbf{K}_{t}$ , $\tilde{\ell}_{t+1,t}={\mathbf{g}}_{t+1}\mathbf{G}_{t+1}\mathbf{K}_{t}$ , and $\mathbf{K}_{t}$ is the Kalman gain

\displaystyle\mathbf{K}_{t}=\mathbf{B}_{t}\mathbf{F}^{T}_{t}Q^{-1}_{t},

(6)

with $\mathbf{B}_{t}=\mbox{cov}[\bm{\theta}_{t}\mid{\mathbf{y}}_{1:t-1}]$ and $Q_{t}=\mbox{var}[y_{t}\mid{\mathbf{y}}_{1:t-1}]$ from the Kalman filter in (26) and (27) in Appendix A, respectively. Then $\tilde{\mathbf{x}}=\mathbf{L}^{T}{\mathbf{u}}$ .

Lemma 2 (Compute $\mathbf{x}=\mathbf{L}\tilde{\mathbf{x}}$ with linear computational cost).

For any $N$ -dimensional vector $\tilde{\mathbf{x}}$ , let $x_{1}=\mathbf{F}_{1}{\mathbf{b}}_{1}+{Q_{1}}^{1/2}\tilde{x}_{1}$ and $\tilde{\mathbf{m}}_{1}={\mathbf{b}}_{1}+\mathbf{K}_{1}(x_{1}-\mathbf{F}_{1}{\mathbf{b}}_{1})$ . For $t=2,\dots,N$ , iteratively compute $x_{t}$ by

$\displaystyle{\mathbf{b}}_{t}$	$\displaystyle=\mathbf{G}_{t}\tilde{\mathbf{m}}_{t-1},$	(7)
$\displaystyle x_{t}$	$\displaystyle=\mathbf{F}_{t}{\mathbf{b}}_{t}+{Q_{t}^{\frac{1}{2}}}\tilde{x}_{t},$	(8)
$\displaystyle\tilde{\mathbf{m}}_{t}$	$\displaystyle={\mathbf{b}}_{t}+\mathbf{K}_{t}(x_{t}-\mathbf{F}_{t}{\mathbf{b}}_{t}),$	(9)

where $Q_{t}$ and $\mathbf{K}_{t}$ are from (27) and (6), respectively. Then $\mathbf{x}=\mathbf{L}\tilde{\mathbf{x}}$ .

The algorithm in Lemma 1 is unconventional and nontrivial to derive. In contrast, Lemma 2 has direct connection to the Kalman filter. Specifically, the one-step-ahead predictive distribution in step (ii) of the Kalman filter, given in Appendix A, enables computing $\mathbf{L}^{-1}\mathbf{x}$ for any vector $\mathbf{x}$ (Lemma 6). Lemma 2 essentially reverses this operation, computing $\mathbf{L}\mathbf{\tilde{x}}$ for any vector $\mathbf{\tilde{x}}$ , leading to the term Inverse Kalman filter (IKF).

The IKF algorithm, outlined in Algorithm 1, sequentially applies Lemmas 1 and 2 to reduce the computational cost and storage for $\bm{\Sigma}{\mathbf{u}}$ from $\mathcal{O}(N^{2})$ to $\mathcal{O}(q^{3}N)$ without approximations. This approach applies to all DLMs, including GPs that admit equivalent DLM representations, such as GPs with Matérn kernels where roughness parameters $0.5$ and $2.5$ correspond to $q=1$ and $q=3$ , respectively. Details are provided in the Supplementary Material.

Algorithm 1 Inverse Kalman filter (IKF) for computing

\bm{\Sigma}\mathbf{u}

\mathbf{F}_{t},\mathbf{G}_{t},\mathbf{W}_{t},t=1,\dots,N

, an

N

-dimensional vector

{\mathbf{u}}

2:Use Kalman filter from Lemma 5 to compute

\mathbf{B}_{t}

Q_{t}

, and

\mathbf{K}_{t}

for

t=1,\dots,N

3:Use Lemma 1 to compute

\tilde{\mathbf{x}}=\mathbf{L}^{T}{\mathbf{u}}

4:Use Lemma 2 to compute

{\mathbf{x}}=\mathbf{L}\tilde{\mathbf{x}}

{\mathbf{x}}

While Algorithm 1 can be applied to DLM regardless of the noise variance, we do not recommend directly computing $\bm{\Sigma}\mathbf{u}$ in scenarios when the noise variance $V_{t}$ is zero or close to zero in Eq. (1). This is because when $V_{t}=0$ , $Q_{t}$ in Eq. (27), the variance of the one-step-ahead predictive distribution, could be extremely small, which leads to large numerical errors in Eq. (28) due to unstable inversion of $Q_{t}$ . To ensure robustness, we suggest computing $\mathbf{x}=(\bm{\Sigma}+V\mathbf{I}_{N}){\mathbf{u}}$ with a positive scalar $V$ and outputting $\mathbf{x}-V{\mathbf{u}}$ as the results of $\bm{\Sigma}{\mathbf{u}}$ . Here, $\bm{\Sigma}+V\mathbf{I}_{N}$ can be interpreted as the covariance matrix of a DLM with noise variance $V_{t}=V$ . Essentially, we introduce artificial noise to ensure that $Q_{t}$ computed in KF is at least $V$ for numerical stability. The result remains exact and independent of the choice of $V$ .

We compare the IKF approach with the direct matrix-vector multiplication of $\bm{\Sigma}\mathbf{u}$ in noise-free scenarios in Fig. 2. The experiments utilize $z_{t}=z(d_{t})$ with $d$ uniformly sampled from $[0,1]$ and the Matérn covariance with unit variance, roughness parameter $2.5$ , and range parameter $\gamma=0.1$ [12]. Panels (a) and (b) show that IKF significantly reduces computational costs compared to direct computation, with a linear relationship between computation time and the number of observations. IKF achieves covariance matrix-vector multiplication for a $10^{6}$ -dimensional output vector in about 2 seconds on a desktop, making it highly efficient for iterative algorithms like the CG algorithm [15] and the randomized log-determinant approximation [29]. Fig. 2(c) compares the maximum absolute error between robust IKF with $V=0.1$ and non-robust IKF with $V=0$ . The non-robust IKF exhibits large numerical errors due to instability when $Q_{t}$ approaches zero, while robust IKF remains stable by ensuring $Q_{t}$ is no smaller than $V$ , even with near-singular covariance matrices $\bm{\Sigma}$ .

Lemmas 1 and 2 facilitate the computation of each element in $\mathbf{L}$ using DLM parameters and enable the linear computation of $(\mathbf{L}^{T})^{-1}\tilde{\mathbf{x}}$ . These results are summarized in Lemmas 3 and 4, respectively, with proofs provided in the Supplementary Material.

Lemma 3 (Cholesky factor from the inverse Kalman filter).

Each entry $(t^{\prime},t)$ in the lower triangular matrix $\mathbf{L}$ with $t^{\prime}\geq t$ has the form

\displaystyle L_{t^{\prime},t}={Q_{t}^{\frac{1}{2}}}\ell_{t^{\prime},t},

(10)

where $\ell_{t^{\prime},t^{\prime}}=1$ , and for $t^{\prime}>t$ , $\ell_{t^{\prime},t}$ is defined as

\displaystyle\ell_{t^{\prime},t}=\mathbf{F}_{t^{\prime}}\left(\prod^{t^{\prime}}_{l=t+1}\mathbf{G}_{l}\right)\mathbf{K}_{t},

(11)

with $\prod^{t^{\prime}}_{l=t+1}\mathbf{G}_{l}=\mathbf{G}_{t^{\prime}}\mathbf{G}_{t^{\prime}-1}\dots\mathbf{G}_{t+1}$ , $Q_{t}$ and $\mathbf{K}_{t}$ defined in (27) and (6), respectively.

Lemma 4 (Compute ${\mathbf{u}}=(\mathbf{L}^{T})^{-1}\tilde{\mathbf{x}}$ with linear computational cost).

For any $\tilde{\mathbf{x}}$ , let $u_{N}={Q_{N}^{-1/2}}\tilde{x}_{N}$ , and $u_{N-1}=Q_{N-1}^{-1/2}\tilde{x}_{N-1}-\ell_{N,N-1}u_{N}$ . For $t=N-2,\dots,1$ , iteratively compute $u_{t}$ by

\displaystyle u_{t}

\displaystyle=Q_{t}^{-\frac{1}{2}}\tilde{x}_{t}-{\tilde{\ell}}_{t+1,t}-\ell_{t+1,t}u_{t+1},

(12)

with ${\tilde{\ell}}_{t+1,t}$ and $\ell_{t+1,t}$ defined in Lemma 1. Then ${\mathbf{u}}=(\mathbf{L}^{T})^{-1}\tilde{\mathbf{x}}$ .

3 Nonparametric estimation of particle interaction functions

The IKF algorithm is motivated by the problem of estimating interactions between particles, which is crucial for understanding complex behaviors of molecules and active matter, such as migrating cells and flocking birds. Minimal physical models, such as the Vicsek model [35] and their extensions [1, 9], provide a framework for explaining the collective motion of active matter.

Consider a physical model encompassing multiple types of latent interactions. Let $\mathbf{y}_{i}$ be a $D_{y}$ -dimensional output vector representing the $i$ th particle, which is influenced by $J$ distinct interaction types. For the $j$ th type of interaction, the $i$ th particle interacts with a subset $p_{i,j}$ of particles rather than all particles, typically those within a radius $r_{j}$ . This relationship is expressed as a latent factor model:

\mathbf{y}_{i}=\sum^{J}_{j=1}\sum^{p_{i,j}}_{k=1}\mathbf{a}_{i,j,k}z_{j}(d_{i,j,k})+\bm{\epsilon}_{i},

(13)

where $\bm{\epsilon}_{i}\sim\mathcal{MN}(0,\sigma^{2}_{0}I_{D_{y}})$ denotes a Gaussian noise vector. The term $\mathbf{a}_{i,j,k}$ is a $D_{y}$ -dimensional known factor loading that links the $i$ th output to the $k$ th neighbor in the $j$ th interaction, with the unknown interaction function $z_{j}(\cdot)$ evaluated at a scalar-valued input $d_{i,j,k}$ , such as the distance between particles $i$ and $k$ , for $k=1,\dots,p_{i,j}$ , $i=1,\dots,n$ , and $j=1,\dots,J$ . For a dataset of $n_{p}$ particles over $n_{\tau}$ time points, the total number of observations is $\tilde{N}=nD_{y}=n_{p}n_{\tau}D_{y}$ .

An illustrative example of this framework is the Vicsek model, a seminal approach for studying collective motion [35] (see Fig. S1(a) in the Supplementary Material). In this model, the 2-dimensional velocity of the $i^{\prime}$ th particle at time $\tau$ , $\mathbf{v}_{i^{\prime}}(\tau)=(v_{i^{\prime},1}(\tau),v_{i^{\prime},2}(\tau))^{T}$ , has a constant magnitude $v=||\mathbf{v}_{i^{\prime}}(\tau)||$ , where $||\cdot||$ denotes the $L_{2}$ norm, and $v_{i^{\prime},1}(\tau)$ and $v_{i^{\prime},2}(\tau)$ are the components of the velocity along two orthogonal directions for $i^{\prime}=1,\dots,n_{p}$ and $\tau=1,\dots,n_{\tau}$ . The velocity angle $\phi_{i^{\prime}}(\tau)=\tan^{-1}(v_{i^{\prime},2}(\tau)/v_{i^{\prime},1}(\tau))$ is updated as

\displaystyle\phi_{i^{\prime}}(\tau)

\displaystyle=\frac{1}{p_{i^{\prime}}(\tau-1)}{\sum_{k\in ne_{i^{\prime}}(\tau-1)}\phi_{k}(\tau-1)}+\epsilon_{i^{\prime}}(\tau),

(14)

where $\epsilon_{i^{\prime}}(\tau)$ is a zero-mean Gaussian noise with variance $\sigma^{2}_{0}$ . The set of neighbors $ne_{i^{\prime}}(\tau-1)$ includes particles within a radius of $r$ from particle $i^{\prime}$ at time $\tau-1$ , i.e., $ne_{i^{\prime}}(\tau-1)=\{k:||\mathbf{s}_{i^{\prime}}(\tau-1)-\mathbf{s}_{k}(\tau-1)||<r\}$ , where $\mathbf{s}_{i^{\prime}}(\tau-1)$ and $\mathbf{s}_{k}(\tau-1)$ are 2-dimensional position vectors of particle $i^{\prime}$ and its neighbor $k$ , respectively, and $p_{i^{\prime}}(\tau-1)$ denotes the total number of neighbors of the $i^{\prime}$ th particle, including itself, at time $\tau-1$ .

The Vicsek model in Eq. (14) is a special case of the general framework in Eq. (13) with one-dimensional output $y_{i}=\phi_{i^{\prime}}(\tau)$ , i.e. $D_{y}=1$ , with the index of the $i$ th observation $i=(\tau-1)n_{p}+i^{\prime}$ being a function of time $\tau$ and particle $i^{\prime}$ . The Viscek model contains a single interaction, i.e. $J=1$ , with linear interaction function $z(d)=d$ , where $d$ being the velocity angle of the neighboring particle, and the factor loading being $1/p_{i^{\prime}}(\tau-1)$ .

Numerous other physical and mathematical models of self-propelled particle dynamics can also be formulated as in Eq. (13) with a parametric form of a particle interaction function [1]. Nonparametric estimation of particle interactions is preferred when the physical mechanism is unknown [17, 23], but the high computational cost limits its applicability to systems with large numbers of particles over long time periods.

In this work, we model each latent interaction function $z_{j}(\cdot)$ nonparametrically using a GP. By utilizing kernels with an equivalent DLM representation, we can leverage the proposed IKF algorithms to significantly expedite computation. This includes commonly used kernels such as Matérn kernel with a half-integer roughness parameter [12], often used as the default setting of the GP models for predicting nonlinear functions, as the smoothness of the process can be controlled by its roughness parameters [11].

For each interaction $j$ , we form the input vector $\mathbf{d}^{(u)}_{j}=[d^{(u)}_{1,j},\dots,d^{(u)}_{N_{j},j}]^{T}$ , where ‘u’ indicates that the input entries are unordered, $d^{(u)}_{t,j}=d_{i,j,k}$ with $t=\sum^{i-1}_{i^{\prime}=1}p_{i^{\prime},j}+k$ for any tuple $(i,j,k)$ , $k=1,\dots,p_{i,j}$ , $i=1,\dots,n$ , $j=1,\dots,J$ , and $N_{j}=\sum^{n}_{i=1}p_{i,j}$ . The marginal distribution of the $j$ th factor follows $\mathbf{z}_{j}^{(u)}=[z_{j}(d_{1,j}^{(u)}),\dots,z_{j}(d_{N_{j},j}^{(u)})]^{T}\sim\mathcal{MN}(\mathbf{0},\,\bm{\Sigma}_{j}^{(u)})$ , where the $(t,t^{\prime})$ th entry of the $N_{j}\times N_{j}$ covariance matrix $\bm{\Sigma}_{j}^{(u)}$ is $\mbox{cov}[z_{j}(d_{t,j}^{(u)}),z_{j}(d_{t^{\prime},j}^{(u)})]=\sigma^{2}_{j}c_{j}(d_{t,j}^{(u)},d_{t^{\prime},j}^{(u)})=\sigma^{2}_{j}c_{j}(d)$ , with $d=|d_{t,j}^{(u)}-d_{t^{\prime},j}^{(u)}|$ , $\sigma^{2}_{j}$ and $c_{j}(\cdot)$ being the variance parameter and the correlation function, respectively. We employ the Matérn correlation with roughness parameters $\nu_{j}=0.5$ and $\nu_{j}=2.5$ . For $\nu_{j}=0.5$ , the Matérn correlation is the exponential correlation $c_{j}(d)=\exp(-d/\gamma_{j})$ , while for $\nu_{j}=2.5$ , the Matérn correlation follows:

c_{j}(d)=\left(1+\frac{{5}^{1/2}d}{\gamma_{j}}+\frac{5d^{2}}{3\gamma_{j}^{2}}\right)\exp\left(-\frac{{5}^{1/2}d}{\gamma_{j}}\right)\,.

(15)

where $\gamma_{j}$ denotes the range parameter.

By integrating out latent factor processes $\mathbf{z}_{j}$ , the marginal distribution of the observation vector $\mathbf{y}=(\mathbf{y}^{T}_{1},\dots,\mathbf{y}^{T}_{n})^{T}$ , with dimension $\tilde{N}=nD_{y}$ , follows a multivariate normal distribution:

\displaystyle\left(\mathbf{y}\mid\bm{\Sigma}^{(u)}_{j},\sigma^{2}_{0}\right)\sim\mathcal{MN}\left(\mathbf{0},\sum^{J}_{j=1}\mathbf{A}_{j}\bm{\Sigma}^{(u)}_{j}\mathbf{A}^{T}_{j}+\sigma^{2}_{0}\mathbf{I}_{\tilde{N}}\right),

(16)

where $\mathbf{A}_{j}$ is a sparse $\tilde{N}\times N_{j}$ block diagonal matrix, such that the $i$ th diagonal block is a $D_{y}\times p_{i,j}$ matrix $\mathbf{A}_{i,j}=[\mathbf{a}_{i,j,1},\dots,\mathbf{a}_{i,j,p_{i,j}}]$ for $i=1,\dots,n$ . The posterior predictive distribution of the latent variable $z_{j}(d^{*})$ for any given input $d^{*}$ is also a normal distribution:

\left(z_{j}(d^{*})\mid\mathbf{y},\sigma^{2}_{0},\bm{\sigma}^{2},\bm{\gamma},\mathbf{r}\right)\sim\mathcal{N}(\hat{z}_{j}(d^{*}),c^{*}_{j}(d^{*})),

(17)

with the predictive mean and variance given by

	$\displaystyle\hat{z}_{j}(d^{*})$	$\displaystyle=\bm{\Sigma}^{(u)}_{j}(d^{*})^{T}\mathbf{A}^{T}_{j}\bm{\Sigma}_{y}^{-1}\mathbf{y},$		(18)
	$\displaystyle c^{}_{j}(d^{})$	$\displaystyle=c_{j}(d^{},d^{})-\bm{\Sigma}^{(u)}_{j}(d^{})^{T}\mathbf{A}^{T}_{j}\bm{\Sigma}_{y}^{-1}\mathbf{A}_{j}\bm{\Sigma}^{(u)}_{j}(d^{}),$		(19)

where $\bm{\Sigma}_{y}=\sum^{J}_{j=1}\mathbf{A}_{j}\bm{\Sigma}^{(u)}_{j}\mathbf{A}^{T}_{j}+\sigma^{2}_{0}\mathbf{I}_{\tilde{N}}$ , $\bm{\Sigma}^{(u)}_{j}(d^{*})=[c_{j}(d^{(u)}_{1,j},d^{*}),\dots,c_{j}(d^{(u)}_{N_{j},j},d^{*})]^{T}$ , and $\mathbf{r}$ represents additional system parameters.

The main computational challenge in calculating the predictive distribution (17) lies in inverting the covariance matrix $\bm{\Sigma}_{y}$ , where $\tilde{N}$ and $N_{j}$ range between $10^{5}$ and $10^{6}$ in applications. Though various GP approximation methods have been proposed [34, 30, 8, 2, 22, 10, 3, 18], they typically approximate the parametric covariance $\bm{\Sigma}_{j}^{(u)}$ rather than $\bm{\Sigma}_{y}$ , thus not directly applicable for computing the distribution in Eq. (17).

To address this, we employ the CG algorithm [15] to accelerate the computation of the predictive distribution. Each CG iteration needs to compute $\bm{\Sigma}_{y}\mathbf{u}$ for an $\tilde{N}$ -dimensional vector ${\mathbf{u}}$ . To employ the IKF, we first need to rearrange the input vector $\mathbf{d}^{(u)}_{j}$ into a non-decreasing input sequence $\mathbf{d}_{j}=[d_{1,j},\dots,d_{N_{j},j}]^{T}$ . Denote the covariance $\mbox{var}[\mathbf{z}_{j}]=\bm{\Sigma}_{j}$ with $\mathbf{z}_{j}=[z_{j}(d_{1,j}),\dots,z_{j}(d_{N_{j},j})]^{T}$ . The covariance of the observations can be written as a weighted sum of $\bm{\Sigma}_{j}$ by introducing a permutation matrix for each interaction: $\bm{\Sigma}_{y}=\sum^{J}_{j=1}\mathbf{A}_{\pi,j}\bm{\Sigma}_{j}\mathbf{A}^{T}_{\pi,j}+\sigma^{2}_{0}\mathbf{I}_{N}$ , where $\mathbf{A}_{\pi,j}=\mathbf{A}_{j}\mathbf{P}^{T}_{j}$ with $\mathbf{P}_{j}$ being a permutation matrix such that $\bm{\Sigma}_{j}=\mathbf{P}_{j}\bm{\Sigma}^{(u)}_{j}\mathbf{P}^{T}_{j}$ . After this reordering, the computation of $\bm{\Sigma}_{y}\mathbf{u}$ can be broken into four steps: (1) $\mathbf{u}_{j}=\mathbf{A}_{\pi,j}^{T}\mathbf{u}$ , (2) $\mathbf{x}_{j}=\bm{\Sigma}_{j}\mathbf{u}_{j}$ , (3) $\mathbf{\hat{u}}_{j}=\mathbf{A}_{\pi,j}\mathbf{x}_{j}$ , and (4) $\sum^{J}_{j=1}\mathbf{\hat{u}}_{j}+\sigma^{2}_{0}\mathbf{u}$ . Here $\mathbf{A}_{j}$ is a sparse matrix with $N_{j}D_{y}$ non-zero entries. The IKF algorithm is used in step (2) to accelerate the most expensive computation, with all computations performed directly using terms in the KF algorithm without explicitly constructing the covariance matrix.

We refer to the resulting approach as the IKF-CG algorithm. This approach reduces the computational cost for computing the posterior distribution from $\mathcal{O}(\tilde{N}^{3})$ operations to pseudolinear order with respect to $N_{j}$ , as shown in Table S1 in the Supplementary Material. Furthermore, the IKF-CG algorithm can accelerate the parameter estimation via both cross-validation and maximum likelihood estimation, with the latter requiring an additional approximation of the log-determinant [29]. Details of the CG algorithm, the computation of the predictive distribution, parameter estimation procedures, and computational complexity are discussed in Sections S3, S4, S6, and S7 of the Supplementary Material, respectively.

4 Numerical results for particle interaction function estimation

4.1 Evaluation criteria

We conduct simulation studies in Sections 4.2-4.3 and analyze real cell trajectory data in Section 4.4 to estimate particle interaction functions in physical models. The code and data to reproduce all numerical results are available at https://github.com/UncertaintyQuantification/IKF. Simulation observations are generated at equally spaced time frames $\tau=1,\dots,n_{\tau}$ with interval $h=0.1$ , though the proposed approach is applicable to both equally and unequally spaced time frames. For simplicity, the number of particles $n_{p}$ is assumed constant across all time frames during simulations.

For each of the $J$ latent factors, predictive performance is assessed using normalized root mean squared error (NRMSE_j), the average length of the $95\%$ posterior credible interval ( $L_{j}(95\%)$ ), and the proportion of interaction function values covered within the 95% posterior credible interval ( $P_{j}(95\%)$ ), based on $n^{*}$ test inputs $\mathbf{d}^{*}_{j}=(d^{*}_{1,j},\dots,d^{*}_{n^{*},j})^{T}$ :

$\displaystyle\mbox{NRMSE}_{j}$	$\displaystyle=\left(\frac{{\sum^{n^{}}_{i=1}(\hat{z}_{j}(d^{}_{i,j})-z_{j}(d^{}_{i,j}))^{2}}}{\sum^{n^{}}_{i=1}(\bar{z}_{j}-z_{j}(d^{*}_{i,j}))^{2}}\right)^{1/2},$	(20)
$\displaystyle L_{j}(95\%)$	$\displaystyle=\frac{1}{n^{}}\sum^{n^{}}_{i=1}\mbox{length}\left\{CI_{i,j}(95\%)\right\},$	(21)
$\displaystyle P_{j}(95\%)$	$\displaystyle=\frac{1}{n^{}}\sum^{n^{}}_{i=1}1_{z_{j}(d^{*}_{i,j})\in CI_{i,j}(95\%)}.$	(22)

Here, $\hat{z}_{j}(d^{*}_{i,j})$ represents the predicted mean at $d^{*}_{i,j}$ , $\bar{z}_{j}$ is the average of the $j$ th interaction function, and $CI_{i,j}(95\%)$ denotes the $95\%$ posterior credible interval of $z_{j}(d^{*}_{i,j})$ , for $j=1,\dots,J$ . A desirable method should have a low NRMSE_j, small $L_{j}(95\%)$ , and $P_{j}(95\%)$ close to $95\%$ .

To account for variability due to initial particle positions, velocities, and noise, each simulation scenario is repeated $E=20$ times to compute the average performance metrics. Unless otherwise specified, parameters are estimated by cross-validation, with $80\%$ of the data used as a training set and the rest $20\%$ as the validation set. All computations are performed on a macOS Mojave system with an 8-core Intel i9 processor running at 3.60 GHz and 32 GB of RAM.

4.2 Vicsek model

We first consider the Vicsek model introduced in Section 3. For each particle $i^{\prime}$ at time $\tau$ , after obtaining its velocity angle $\phi_{i^{\prime}}(\tau)$ in Eq. (14), its position is updated as:

\displaystyle\mathbf{s}_{i^{\prime}}(\tau)

\displaystyle=\mathbf{s}_{i^{\prime}}(\tau-1)+v[\cos(\phi_{i^{\prime}}(\tau)),\sin(\phi_{i^{\prime}}(\tau))]^{T}h,\quad i^{\prime}=1,\dots,n_{p},

(23)

where $v$ is the particle speed and $h$ is the time step size. Particles are initialized with velocity $[v\cos(\phi_{i^{\prime}}(0)),v\sin(\phi_{i^{\prime}}(0))]^{T}$ , where $\phi_{i^{\prime}}(0)$ is drawn from $\mbox{Unif}[-\pi,\pi]$ and $v={2}^{1/2}/2\approx 0.71$ . Initial particle positions $\mathbf{s}_{i^{\prime}}(0)$ are uniformly sampled from $[0,{n_{p}}^{1/2}]\times[0,{n_{p}}^{1/2}]$ to keep consistent particle density across experiments. The goal is to estimate the interaction function nonparametrically, without assuming linearity. The interaction radius $r=0.5$ is estimated alongside other parameters, with results detailed in the Supplementary Material.

We first compare the computational cost and accuracy of IKF-CG, CG, and direct computation for calculating the predictive mean in Eq. (18) and the marginal likelihood in Eq. (16) using the covariance in Eq. (15). Simulations are conducted with $n_{p}=100$ particles and noise variance $\sigma^{2}_{0}=0.2^{2}$ . Panel (a) of Fig. 3 shows that the IKF-CG approach substantially outperforms both direct computation and the CG algorithm in computational time when predicting $n^{*}=100$ test inputs with varying time lengths $n_{\tau}$ , ranging from 4 to 200. Fig. 3(b) compares the NRMSE of the predictive mean between direct computation and the IKF-CG method as the number of observations ranges from 400 to 2,000, corresponding to $n_{\tau}$ from 4 to 20. The two methods yield nearly identical predictive errors, with negligible differences in NRMSE. Fig. 3 (c) shows a comparison of log-likelihood values computed via direct computation and IKF-CG, where the latter employs the log-determinant approximation detailed in the Supplementary Material. Both methods produce similar log-likelihood values across different sample sizes.

Next, we evaluate the performance of the IKF-CG algorithm across 12 scenarios with varying particle numbers ( $n_{p}=100,300,900$ ), time frames ( $n_{\tau}=5,10$ ), and noise variances ( $\sigma^{2}_{0}=0.1^{2},0.2^{2}$ ). The predictive performance is evaluated using $n^{*}=200$ test inputs evenly spaced across the domain of the interaction function, $[-\pi,\pi]$ , averaged over $E=20$ experiments.

Panels (a) and (b) of Figs. 4-5 show the predictive performance of particle interactions using the Matérn covariance with roughness parameter $\nu=2.5$ and the exponential covariance (Matérn with $\nu=0.5$ ), respectively, for noise variance $\sigma_{0}^{2}=0.1^{2}$ . Results for $\sigma_{0}^{2}=0.2^{2}$ are provided in the Supplementary Material. Across all scenarios, NRMSEs remain low, with improved accuracy for larger datasets and smaller noise variances. The decrease in the average length of the $95\%$ posterior interval with increasing sample size, along with the relatively small interval span compared to the output range $[-\pi,\pi]$ , indicates improved prediction confidence with more observations. Moreover, the coverage proportion for the $95\%$ posterior credible is close to the nominal $95\%$ level in all cases, validating the reliability of the uncertainty quantification. By comparing panels (a) and (b) in Figs. 4-5, we find that the Matérn covariance in Eq. (15) yields lower NRMSE and narrower $95\%$ posterior credible intervals than the exponential kernel. This improvement is due to the smoother latent process induced by Eq. (15), which is twice mean-squared differentiable, while the process with an exponential kernel is not differentiable.

Figure 6(a) shows close agreement between predicted and true interaction functions over domain $[-\pi,\pi]$ . The 95% interval is narrow yet covers approximately 95% of the test samples of the interaction function. Figure 6(b) compares particle trajectories over $50$ time steps using the predicted mean of one-step-ahead predictions and the true interaction function. The trajectories are visually similar.

4.3 A modified Vicsek model with multiple interactions

Various extensions of the Vicsek model have been studied to capture more complex particle dynamics [1, 9]. For illustration, we consider a modified Vicsek model with two interaction functions. The 2-dimensional velocity $\mathbf{v}_{i^{\prime}}(\tau)=(v_{i^{\prime},1}(\tau),v_{i^{\prime},2}(\tau))^{T}$ , corresponding to the output $\mathbf{y}_{i}$ in Equation (13), is updated according to:

\displaystyle\mathbf{v}_{i^{\prime}}(\tau)=\frac{\sum_{k\in ne_{i^{\prime}}(\tau-1)}\mathbf{v}_{k}(\tau-1)}{p_{i^{\prime}}(\tau-1)}+\frac{\sum_{k\in ne^{{}^{\prime}}_{i^{\prime}}(\tau-1)}f(d_{i^{\prime},k}(\tau-1))\mathbf{e}_{i^{\prime},k}(\tau-1)}{p^{\prime}_{i^{\prime}}(\tau-1)}+\bm{\epsilon}_{i^{\prime}}(\tau),

(24)

where $\bm{\epsilon}_{i^{\prime}}(\tau)=(\epsilon_{i^{\prime},1}(\tau),\epsilon_{i^{\prime},2}(\tau))^{T}$ is a Gaussian noise vector with variance $\sigma_{0}^{2}$ . The first term in (24) models velocity alignment with neighboring particles, and the second term introduces a distance-dependent interaction $f(\cdot)$ . The definitions of neighbor sets, 2-dimensional vector $\mathbf{e}_{i^{\prime},k}(\tau-1)$ , and the form of $f(\cdot)$ are provided in Section S8.2 of the Supplementary Material.

We simulate 12 scenarios, each replicated $E=20$ times, using the same number of particles, time frame, and noise level as in the original Vicsek model in Section 4.2. The predictive performance of the latent factor model is evaluated using $n^{*}=200$ test inputs evenly spaced across the training domain of each interaction function. We focus on the model with the Matérn kernel in Eq. (15) and results for the exponential kernel are detailed in the Supplementary Material. Consistent with the results of the Vicsek model, we find models with the Matérn kernel in Eq. (15) are more accurate due to the smooth prior imposed on the interaction function.

		First Interaction			Second Interaction
		NRMSE	$L_{1}(95\%)$	$P_{1}(95\%)$	NRMSE	$L_{2}(95\%)$	$P_{2}(95\%)$
$n_{p}=100$	$n_{\tau}=5$	$6.8\times 10^{-3}$	$0.098$	$92\%$	$8.7\times 10^{-2}$	$0.29$	$94\%$
	$n_{\tau}=10$	$6.0\times 10^{-3}$	$0.097$	$95\%$	$3.8\times 10^{-2}$	$0.18$	$96\%$
$n_{p}=300$	$n_{\tau}=5$	$1.2\times 10^{-2}$	$0.23$	$87\%$	$3.2\times 10^{-2}$	$0.21$	$96\%$
	$n_{\tau}=10$	$4.2\times 10^{-3}$	$0.10$	$98\%$	$2.0\times 10^{-2}$	$0.13$	$97\%$
$n_{p}=900$	$n_{\tau}=5$	$5.6\times 10^{-3}$	$0.10$	$96\%$	$1.4\times 10^{-2}$	$0.12$	$98\%$
	$n_{\tau}=10$	$4.4\times 10^{-3}$	$0.12$	$97\%$	$1.4\times 10^{-2}$	$0.10$	$96\%$

Table 1: The predictive accuracy and uncertainty assessment by (20)-(22) for the modified Vicsek model with

\sigma_{0}^{2}=0.1^{2}

using Matérn covariance function and roughness parameter

2.5

Table 1 summarizes the predictive performance for $\sigma_{0}^{2}=0.1^{2}$ ; results for $\sigma_{0}^{2}=0.2^{2}$ are included in the Supplementary Material. While both interaction functions exhibit relatively low NRMSEs, the second interaction function has a higher NRMSE than the first because the repulsion at short distances creates fewer training samples with close-proximity neighbors. This discrepancy can be mitigated by increasing the number of observations. The average length of the $95\%$ posterior credible interval for the second interaction decreases as the sample size increases, and coverage proportion remains close to the nominal 95% level across all scenarios.

In Fig. 7, we show the predictions of the interaction functions, where the shaded regions represent the 95% posterior credible intervals. The predictions closely match the true values, and the credible intervals, while narrow relative to the output range and nearly invisible in the plots, mostly cover the truth. These results suggest high confidence in the predictions.

4.4 Estimating cell-cell interactions on anisotropic substrates

We analyze video microscopy data from [24], which tracks the trajectories of over 2,000 human dermal fibroblasts moving on a nematically aligned, liquid-crystalline substrate. This experimental setup encourages cellular alignment along a horizontal axis, with alignment order increasing over time, though the underlying mechanism remains largely unknown. Cells were imaged every 20 minutes over a 36-hour period, during which the cell count grew from 2,615 to 2,953 due to proliferation. Our objective is to estimate the latent interaction function between cells. Given the vast number of velocity observations ( $N\approx 300,000$ ), direct formation and inversion of the covariance matrix is impractical.

We apply the IKF-CG algorithm to estimate the latent interaction function. Due to the anisotropic substrate, cellular velocities differ between horizontal and vertical directions, so we independently model the velocity of the $i^{\prime}$ th cell in each direction $l$ by

v_{i^{\prime},l}(\tau)=\frac{1}{p_{ne_{i^{\prime},l}(\tau-1)}}\sum_{k\in ne_{i^{\prime},l}(\tau-1)}z_{l}(d_{k,l})+\epsilon_{i^{\prime},l}(\tau),\quad i^{\prime}=1,\dots,n_{p}(\tau)

(25)

where $n_{p}(\tau)$ is the cell count at time $\tau$ , $l=1,2$ correspond to horizontal and vertical directions, respectively, and $\epsilon_{i^{\prime},l}(\tau)\sim\mathcal{N}(0,\sigma^{2}_{0,l}(\tau))$ denotes the noise. Inspired by the modified Vicsek model, we set $d_{k,l}=v_{k,l}(\tau-1)$ . To account for velocity decay caused by increasing cell confluence, we model the noise variances as $\sigma^{2}_{0,l}(\tau)=\omega_{l}\sigma^{2}_{v,l}(\tau-1)$ , where $\sigma^{2}_{v,l}(\tau-1)$ is the sample velocity variance at time $(\tau-1)$ , and $\omega_{l}$ is a parameter estimated from data for $l=1,2$ . The neighboring set $ne_{i^{\prime},l}(\tau-1)=\{k:||\mathbf{s}_{i^{\prime}}(\tau-1)-\mathbf{s}_{k}(\tau-1)||<r_{l}\mbox{ and }\mathbf{v}_{i^{\prime}}(\tau-1)\cdot\mathbf{v}_{k}(\tau-1)>0\}$ , where $r_{l}$ denotes the interaction radius of direction $l$ for $l=1,2$ , excludes particles moving in opposite directions, reflecting the observed gliding and intercalating behavior of cells [36].

We use observations from the first half of the time frames as the training set and the latter half as the testing set. The predicted interaction functions in Fig. 8 show diminishing effects in both directions, likely due to cell-substrate interactions such as friction. The estimated uncertainty (obtained via residual bootstrap) increases when the absolute velocity of neighboring cells in the previous time frame is large, which is attributed to fewer observations at the boundaries of the input domain. Furthermore, the interaction in the vertical direction is weaker than in the horizontal direction, due to the confinement from the anisotropic substrate in the vertical direction.

Our nonparametric estimation of the interaction functions is compared with two models for one-step-ahead forecasts of directional velocities: the original Vicsek model introduced in Section 4.2 and the anisotropic Vicsek model, which predicts velocities using the average velocity of neighboring cells from the previous time frame. As shown in Table 2, our model outperforms both Vicsek models in RMSE for one-step-ahead forecasts of directional velocities, despite the large inherent stochasticity in cellular motion. Moreover, our model has notably shorter average interval lengths that cover approximately $95\%$ of the held-out observations. These findings underscore the importance of the IFK-CG algorithm in enabling the use of large datasets to overcome high stochasticity and capture the underlying dynamics of alignment processes.

	Horizontal direction			Vertical direction
	RMSE	L(95%)	P(95%)	RMSE	L(95%)	P(95%)
Baseline Vicsek	$4.5\times 10^{-3}$	0.011	80%	$5.3\times 10^{-3}$	0.011	95%
Anisotropic Vicsek	$3.9\times 10^{-3}$	0.024	98%	$2.6\times 10^{-3}$	0.020	99%
Nonparametric estimation	$3.5\times 10^{-3}$	0.015	95%	$2.3\times 10^{-3}$	0.0092	95%

Table 2: One-step-ahead prediction performance on the testing dataset. Here

\mbox{RMSE}=\{\sum_{\tau=1}^{n_{\tau}^{*}}\sum_{i^{\prime}=1}^{n_{p}(\tau)}(\hat{v}_{i^{\prime},l}(\tau)-v_{i^{\prime},l}(\tau))^{2}/\sum_{\tau=1}^{n_{\tau}^{*}}n_{p}(\tau)\}^{1/2}

, with

n_{\tau}^{*}

denoting the number of testing time frames. L(95%) and P(95%) are computed similarly to those in Eqs. (21) - (22), but with the predictive interval of

z

replaced by that of

y

, where

y

represents the velocity in each direction.

5 Concluding remarks

Several future directions are worth exploring. First, computing large matrix-vector products is ubiquitous, and the approach can be extended to different models, including some large neural network models. Second, it is an open topic to scalably compute the logarithm of the determinant of the covariance in Eq. (3) without approximation. Third, the new approach can be extended to estimate latent functions when forward equations are unavailable or computationally intensive in nonlinear or non-Gaussian dynamical systems, where ensemble Kalman filter [6, 33] and particle filter [20] are commonly used. Fourth, the proposed approach can be extended for estimating and predicting multivariate time series [21] and generalized factor processes for categorical observations [4].

Acknowledgement

Xinyi Fang acknowledges partial support from the BioPACIFIC Materials Innovation Platform of the National Science Foundation under Award No. DMR-1933487. We thank the editor, associate editor, and three anonymous referees for their comments that substantially improved this article.

Supplementary Material

The Supplementary Material contains (i) proofs of all lemmas; (ii) a summary of the connection between GPs with Matérn covariance (roughness parameters 0.5 and 2.5) and DLMs, with closed-form $\mathbf{F}_{t,j}$ , $\mathbf{G}_{t,j}$ and $\mathbf{W}_{t,j}$ ; (iii) the conjugate gradient algorithm; (iv) procedures for the scalable computation of the predictive distribution in particle dynamics; (v) an application of the IKF-CG algorithm for predicting missing values in lattice data; (vi) the parameter estimation method; (vii) computational complexity analysis; (viii) additional numerical results for particle interaction; and (ix) numerical results for predicting missing values in lattice data.

Appendix A. Kalman Filter

The Kalman filter [16] for dynamic linear models in (1)-(2) is summarized in Lemma 5.

Lemma 5.

Let $\bm{\theta}_{t-1}\mid{\mathbf{y}}_{1:t-1}\sim\mathcal{MN}(\mathbf{m}_{t-1},\mathbf{C}_{t-1})$ . Recursively for $t=2,\dots,{N}$ :

(i)

The one-step-ahead predictive distribution of $\bm{\theta}_{t}$ given ${\mathbf{y}}_{1:t-1}$ is

$\bm{\theta}_{t}\mid{\mathbf{y}}_{1:t-1}\sim\mathcal{MN}(\mathbf{b}_{t},\mathbf{B}_{t}),$ (26)

with $\mathbf{b}_{t}=\mathbf{G}_{t}\mathbf{m}_{t-1}$ and $\mathbf{B}_{t}=\mathbf{G}_{t}\mathbf{C}_{t-1}\mathbf{G}^{T}_{t}+\mathbf{W}_{t}$ .
(ii)

The one-step-ahead predictive distribution of $y_{t}$ given ${\mathbf{y}}_{1:t-1}$ is

$y_{t}\mid{\mathbf{y}}_{1:t-1}\sim\mathcal{N}(f_{t},Q_{t}),$ (27)

with $f_{t}=\mathbf{F}_{t}\mathbf{b}_{t},$ and $Q_{t}=\mathbf{F}_{t}\mathbf{B}_{t}\mathbf{F}^{T}_{t}+V_{t}$ .
(iii)

The filtering distribution of $\bm{\theta}_{t}$ given ${\mathbf{y}}_{1:t}$ follows

$\bm{\theta}_{t}\mid{\mathbf{y}}_{1:t}\sim\mathcal{MN}(\mathbf{m}_{t},\mathbf{C}_{t}),$ (28)

with $\mathbf{m}_{t}=\mathbf{b}_{t}+\mathbf{B}_{t}\mathbf{F}^{T}_{t}Q^{-1}_{t}(y_{t}-f_{t})$ and $\mathbf{C}_{t}=\mathbf{B}_{t}-\mathbf{B}_{t}\mathbf{F}^{T}_{t}Q^{-1}_{t}\mathbf{F}_{t}\mathbf{B}_{t}$ .

Lemma 6.

Denote $\mathbf{\tilde{y}}=(\tilde{y}_{1},\dots,\tilde{y}_{N})^{T}=\mathbf{L}^{-1}\mathbf{y}$ , where $\mathbf{L}$ is the factor in Cholesky decomposition of a positive definite covariance ${\bm{\Sigma}}$ , with ${\bm{\Sigma}}=\mbox{cov}[\mathbf{y}]$ and $\mathbf{y}=(y_{1},\dots,y_{N})^{T}$ defined as in (1). We have

\tilde{y}_{t}=\frac{y_{t}-f_{t}}{{Q}^{\frac{1}{2}}_{t}},

(29)

where $f_{t}$ and $Q_{t}$ are defined in Eq. (27). Furthermore, the $t$ th diagonal term of $\mathbf{L}$ is $L_{t,t}={Q}^{1/2}_{t}$ .

References

[1] Hugues Chaté, Francesco Ginelli, Guillaume Grégoire, Fernando Peruani, and Franck Raynaud. Modeling collective motion: variations on the Vicsek model. The European Physical Journal B, 64(3):451–456, 2008.
[2] Noel Cressie and Gardar Johannesson. Fixed rank kriging for very large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1):209–226, 2008.
[3] Abhirup Datta, Sudipto Banerjee, Andrew O Finley, and Alan E Gelfand. Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets. Journal of the American Statistical Association, 111(514):800–812, 2016.
[4] Daniele Durante and David B Dunson. Nonparametric Bayes dynamic modelling of relational data. Biometrika, 101(4):883–898, 2014.
[5] James Durbin and Siem Jan Koopman. Time series analysis by state space methods, volume 38. OUP Oxford, 2012.
[6] Geir Evensen. Data assimilation: the ensemble Kalman filter. Springer Science & Business Media, 2009.
[7] Andrew O. Finley, Abhirup Datta, and Sudipto Banerjee. spNNGP R Package for Nearest Neighbor Gaussian Process Models. Journal of Statistical Software, 103(5):1–40, 2022.
[8] Reinhard Furrer, Marc G Genton, and Douglas Nychka. Covariance tapering for interpolation of large spatial datasets. Journal of Computational and Graphical Statistics, 15(3):502–523, 2006.
[9] Francesco Ginelli, Fernando Peruani, Markus Bär, and Hugues Chaté. Large-scale collective properties of self-propelled rods. Physical review letters, 104(18):184502, 2010.
[10] Robert B Gramacy and Daniel W Apley. Local Gaussian process approximation for large computer experiments. Journal of Computational and Graphical Statistics, 24(2):561–578, 2015.
[11] Mengyang Gu, Xiaojing Wang, and James O Berger. Robust Gaussian stochastic process emulation. Annals of Statistics, 46(6A):3038–3066, 2018.
[12] Mark S Handcock and Michael L Stein. A Bayesian analysis of kriging. Technometrics, 35(4):403–410, 1993.
[13] Jouni Hartikainen and Simo Sarkka. Kalman filtering and smoothing solutions to temporal Gaussian process regression models. In Machine Learning for Signal Processing (MLSP), 2010 IEEE International Workshop on, pages 379–384. IEEE, 2010.
[14] Trevor Hastie and Robert Tibshirani. Varying-coefficient models. Journal of the Royal Statistical Society: Series B (Methodological), 55(4):757–779, 1993.
[15] Magnus R Hestenes and Eduard Stiefel. Methods of conjugate gradients for solving linear systems. Journal of research of the National Bureau of Standards, 49(6):409, 1952.
[16] Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. Journal of basic Engineering, 82(1):35–45, 1960.
[17] Yael Katz, Kolbjørn Tunstrøm, Christos C Ioannou, Cristián Huepe, and Iain D Couzin. Inferring the structure and dynamics of interactions in schooling fish. Proceedings of the National Academy of Sciences, 108(46):18720–18725, 2011.
[18] Matthias Katzfuss and Joseph Guinness. A general framework for Vecchia approximations of Gaussian processes. Statistical Science, 36(1):124–141, 2021.
[19] Matthias Katzfuss, Joseph Guinness, and Earl Lawrence. Scaled Vecchia approximation for fast computer-model emulation. SIAM/ASA Journal on Uncertainty Quantification, 10(2):537–554, 2022.
[20] Genshiro Kitagawa. Monte Carlo filter and smoother for non-Gaussian nonlinear state space models. Journal of computational and graphical statistics, 5(1):1–25, 1996.
[21] Clifford Lam, Qiwei Yao, and Neil Bathia. Estimation of latent factors for high-dimensional time series. Biometrika, 98(4):901–918, 2011.
[22] Finn Lindgren, Håvard Rue, and Johan Lindström. An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(4):423–498, 2011.
[23] Fei Lu, Ming Zhong, Sui Tang, and Mauro Maggioni. Nonparametric inference of interaction laws in systems of agents from trajectory data. Proceedings of the National Academy of Sciences, 116(29):14424–14433, 2019.
[24] Yimin Luo, Mengyang Gu, Minwook Park, Xinyi Fang, Younghoon Kwon, Juan Manuel Urueña, Javier Read de Alaniz, Matthew E Helgeson, Cristina M Marchetti, and Megan T Valentine. Molecular-scale substrate anisotropy, crowding and division drive collective behaviours in cell monolayers. Journal of the Royal Society Interface, 20(204):20230160, 2023.
[25] Suman Majumder, Yawen Guan, Brian J Reich, and Arvind K Saibaba. Kryging: geostatistical analysis of large-scale datasets using Krylov subspace methods. Statistics and Computing, 32(5):74, 2022.
[26] Giovanni Petris, Sonia Petrone, and Patrizia Campagnoli. Dynamic linear models. In Dynamic linear models with R. Springer, 2009.
[27] Raquel Prado and Mike West. Time series: modeling, computation, and inference. Chapman and Hall/CRC, 2010.
[28] Dennis C Rapaport. The art of molecular dynamics simulation. Cambridge university press, 2004.
[29] Arvind K Saibaba, Alen Alexanderian, and Ilse CF Ipsen. Randomized matrix-free trace and log-determinant estimators. Numerische Mathematik, 137(2):353–395, 2017.
[30] Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. Advances in neural information processing systems, 18:1257, 2006.
[31] Michael L Stein. Interpolation of spatial data: some theory for kriging. Springer Science & Business Media, 2012.
[32] Michael L Stein, Jie Chen, and Mihai Anitescu. Stochastic approximation of score functions for Gaussian processes. The Annals of Applied Statistics, 7(2):1162–1191, 2013.
[33] Jonathan R Stroud, Michael L Stein, Barry M Lesht, David J Schwab, and Dmitry Beletsky. An ensemble Kalman filter and smoother for satellite data assimilation. Journal of the american statistical association, 105(491):978–990, 2010.
[34] Aldo V Vecchia. Estimation and model identification for continuous spatial processes. Journal of the Royal Statistical Society: Series B (Methodological), 50(2):297–312, 1988.
[35] Tamás Vicsek, András Czirók, Eshel Ben-Jacob, Inon Cohen, and Ofer Shochet. Novel type of phase transition in a system of self-driven particles. Physical review letters, 75(6):1226, 1995.
[36] Elise Walck-Shannon and Jeff Hardin. Cell intercalation from top to bottom. Nature Reviews Molecular Cell Biology, 15(1):34–48, 2014.
[37] M. West and P. J. Harrison. Bayesian Forecasting & Dynamic Models. Springer Verlag, 2nd edition, 1997.
[38] Peter Whittle. On stationary processes in the plane. Biometrika, pages 434–449, 1954.