This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\NewEnviron

widerequation

\BODY\BODY (0.1) (0.1)

Learning Theory for Inferring Interaction Kernels in Second-Order Interacting Agent Systems

\nameJason Miller2,3 \email[email protected]
\AND\nameSui Tang1,4 \email[email protected]
\AND\nameMing Zhong2,3 \email[email protected]
\AND\nameMauro Maggioni1,2,3 \email[email protected]

\addr1Departments of Mathematics, 2Applied Mathematics and Statistics,
3 Johns Hopkins University, 3400 N. Charles Street, Baltimore, MD 21218, USA
4 University of California, Santa Barbara, 552 University Rd, Isla Vista, CA 93117, USA
Abstract

Modeling the complex interactions of systems of particles or agents is a fundamental scientific and mathematical problem that is studied in diverse fields, ranging from physics and biology, to economics and machine learning. In this work, we describe a very general second-order, heterogeneous, multivariable, interacting agent model, with an environment, that encompasses a wide variety of known systems. We describe an inference framework that uses nonparametric regression and approximation theory based techniques to efficiently derive estimators of the interaction kernels which drive these dynamical systems. We develop a complete learning theory which establishes strong consistency and optimal nonparametric min-max rates of convergence for the estimators, as well as provably accurate predicted trajectories. The estimators exploit the structure of the equations in order to overcome the curse of dimensionality and we describe a fundamental coercivity condition on the inverse problem which ensures that the kernels can be learned and relates to the minimal singular value of the learning matrix. The numerical algorithm presented to build the estimators is parallelizable, performs well on high-dimensional problems, and is demonstrated on complex dynamical systems.

Keywords: Machine learning; dynamical systems; agent-based dynamics; inverse problems; regularized least squares; nonparametric statistics.

1 Introduction

Physical, biological, and social systems across all scales of complexity and size can often be described as dynamical systems written in terms of interacting agents (e.g. particles, cells, humans, planets, …). Rich theories have been developed to explain the collective behavior of these interacting agents across many fields including astronomy, particle physics, economics, social science, and biology. Examples include predator-prey systems, molecular dynamics, coupled harmonic oscillators, flocking birds or milling fish, human social interactions, and celestial mechanics, to name a few. In order to encompass many of these examples, we will describe a very general second-order, heterogeneous (the agents can be of different types), interacting (the acceleration of an agent is a function of properties of the other agents) agent system that includes external forces, masses of the agents, multivariable interaction kernels, and an additional environment variable that is a dynamical property of the agent (for example, a firefly having its luminescence varying in time). We propose a learning approach that combines machine learning and dynamical systems in order to provide highly accurate dynamical models of the observation data from these systems.

The model and learning framework presented in sections 2-5 includes a very large number of relevant systems and allows for their modeling. Clustering of opinions [45, 27, 10, 56] is a simple first-order case that exhibits clustering. Flocking of birds [32, 29, 28] provides a simple example of a second-order system that exhibits an emergent shared velocity of all agents. Milling of fish [25, 1, 2, 24] is another second-order model and presents both a 22 and 33-dimensional milling pattern over long time and introduces a non-collective force from the environment. A model of oscillators (fireflies) that sync and swarm together, and have their dynamics governed by their positions and a phase variable ξ\xi, was studied by [73, 59, 58, 57]. There are also models that include both energy and alignment interaction kernels, a particular case of this is the anticipation dynamics model from [71], which we study in this work.

One can also consider a collection of celestial bodies interacting via the gravitational potential, which was initially studied in [83] and is further studied in the upcoming [54]. All of these models fit into our framework and we have presented detailed studies of these, and others in this work as well as [83, 51, 52]. These dynamics exhibit a wide range of emergent behaviors, and as shown in [79, 74, 29, 38, 23, 5, 56], the behaviors can be studied when the governing equations are known. However, if the equations are not known and the data consists of only trajectories, we still wish to develop a model that can both make accurate predictions of the trajectories and discover a dynamical form that accurately reflects their emergent properties. To achieve this, we present a theoretically optimal learning algorithm that is accurate, captures emergent behavior for large time, and, by exploiting the structure of the collective dynamical system, avoids the curse of dimensionality.

Applying machine learning to the sciences has experienced tremendous growth in recent years, a small selection of general applications related to the ideas in this work include: learning PDEs ([4, 68, 46]), modeling dynamical systems ([37, 6, 47]), governing equations ([20, 80]), biology ([21]), fluid mechanics ([63, 41]), many-body problems in quantum systems ([19]), mean-field games ([67]), meteorology ([40]), and dynamical systems ([26, 17, 81, 11]). These, and the references therein, give a flavor of the diverse range of applications. A vast literature exists in the context of learning dynamical systems. In the case of a general nonlinear dynamical system, symbolic regression has been developed to learn the underlying form of the equations from data, see [12, 69]. Sparse regression techniques which use an extremely large collection of functions, often containing most major mathematical functions, are fit to the data with a sparsity condition that only allows a few terms to appear in the final model. Detailed study and development of these approaches can be found for SINDy in ([16, 66, 15]), a LASSO-type penalty ([42, 43]), and sparse Bayesian regression ([82]). Other approaches consider multiscale methods, statistical mechanics, or force-based models, see [3, 8]. Deep learning has also been applied to learn dynamical systems, for ODEs see [62, 65] and for PDEs see [60, 61, 50], as well as the references therein.

The majority of the earliest work in inferring interaction kernels in systems of the type (1), (2.2) occurred in the Physics literature, going back to the works of Newton. From the viewpoint of purely data-driven analysis of the equations, requiring limited or no physical reasoning, foundational work was [53, 44]. In these works, the interaction kernels are assumed to be in the span of a known family of functions and parameters are estimated. In statistics, the problem of parameter estimation in dynamical systems from observations is classical, e.g. [78, 14, 49, 18, 64]. The question of identifiability of the parameter emerges, see e.g. [34, 55]. Our work is closely related to this viewpoint but our parameter is now infinite-dimensional, with identifiability discusses in section 6.

Our learning approach is based on exploiting the structure of collective dynamical systems and nonparametric estimation techniques (see [30, 76, 39, 36, 9]). We focus here on second-order models and the form of the equations, generalizing the first order models (see discussion in Appendix F), derived from Newton’s second law: for i=1,,Ni=1,\ldots,N

mi𝒙¨i(t)=F𝒙˙(𝒙i,𝒙˙i)+1Ni=1,iiN\displaystyle m_{i}\ddot{\boldsymbol{x}}_{i}(t)=F^{\dot{\boldsymbol{x}}}(\boldsymbol{x}_{i},\dot{\boldsymbol{x}}_{i})+\frac{1}{N}\sum_{\begin{subarray}{c}i^{\prime}=1,\\ i^{\prime}\neq i\end{subarray}}^{N} ϕE(𝒙i(t)𝒙i(t))(𝒙i(t)𝒙i(t))\displaystyle\phi^{E}(\left\|\boldsymbol{x}_{i^{\prime}}(t)-\boldsymbol{x}_{i}(t)\right\|)(\boldsymbol{x}_{i^{\prime}}(t)-\boldsymbol{x}_{i}(t))
+ϕA(𝒙i(t)𝒙i(t))(𝒙˙i(t)𝒙˙i(t)).\displaystyle+\phi^{A}(\left\|\boldsymbol{x}_{i^{\prime}}(t)-\boldsymbol{x}_{i}(t)\right\|)(\dot{\boldsymbol{x}}_{i^{\prime}}(t)-\dot{\boldsymbol{x}}_{i}(t)). (1.1)

Here, mim_{i} is the mass of the ithi^{th} agent, 𝒙i\boldsymbol{x}_{i} is its position, F𝒙˙F^{\dot{\boldsymbol{x}}} is a non-collective force, and ϕE:+\phi^{E}:^{+}\rightarrow, ϕA:+\phi^{A}:^{+}\rightarrow are known as the interaction kernels. A significant amount of research on modeling collective dynamics is concerned with inducing desired collective behaviors (flocking, clustering, milling, etc.) from relatively simple, local and non-local interaction kernels, often from known or specific parametric families of relatively simple functions. Here we consider a non-parametric, inverse-problem-based approach to infer the interaction kernels from observations of trajectory data, especially within short-time periods. In [13], a convergence study of learning unknown interaction kernels from observation of first-order models of homogeneous agents was done for increasing NN, the number of agents. The estimation problem with NN fixed, but the number of trajectories MM varying, for first-order and second-order models of heterogeneous agents was numerically studied in [52] and learning theory on these first-order models was developed in [51, 48]. Further extension of the model and algorithm to more complicated second-order systems, with particular emphasis on emergent collective behaviors, was discussed in [83]. A big data application to real celestial motion ephemerides is developed and discussed in [54]. In this work, we provide a rigorous learning theory covering the models presented in [83], as well as the second-order models introduced in [52]. We consider generalizations of the models in [83], to include models with higher-dimensional interaction kernels, that do not depend only on pairwise distances. Compared to the theories studied in [51, 48], our theory focuses on second-order models with interaction kernels of the form ϕE(r)r+ϕA(r)r˙\phi^{E}(r)r+\phi^{A}(r)\dot{r} (with rr and r˙\dot{r} representing norms of differences of positions and, respectively, velocities of pairs of agents); additionally, we discuss the identifiability and separability of ϕE\phi^{E} and ϕA\phi^{A} from the sum.

The overall objective of the algorithm can be stated as: given trajectory data generated from an interacting agent system, we wish to learn the underlying interaction kernels, from which we will understand its long-term and emergent behavior, and ultimately build a highly accurate approximate model that faithfully captures the dynamics. We make minimal assumptions on the form of the interaction kernels and the various forces involved, and in some cases the assumptions are made for purely theoretical reasons and the algorithm can still perform well when they do not hold for a given system. We offer a learning approach to address these collective systems by first discovering the governing equations from the observational data, and then using the estimated equations for large-time predictions.

The approach in this paper builds on many of these ideas and uses observation data coming from collective dynamical systems of the form (2.2), to learn the underlying interaction functions. This variational approach was initially developed in [13, 52] and further studied and extended in [51, 83]. Our analysis of this system blends differential equations, inverse problems and nonparametric regression, and (statistical) learning theory. A central insight is that we exploit the form of (2.2) to move the inference task to just the unknown functions (ϕE,ϕA,ϕξ\phi^{E},\phi^{A},\phi^{\xi}) allowing us to avoid the curse of dimensionality incurred if we were to directly perform regression against the high-dimensional phase space and trajectory data of the system, which provides independence of observations where each observation is a different trajectory.

To use the trajectory data to derive estimators, we consider appropriate hypothesis spaces in which to build our estimators, measures adapted to the dynamics, norms, and other performance metrics, and ultimately an inverse problem built from these tools. Once we have obtained this estimated interaction kernel, we want to study its properties as a function of the amount of trajectory data we receive, which is the MM trajectories sampled from different initial conditions from the same underlying system. Here we study properties of the error functional, establish the uniqueness of its minimizers, and use the probability measures to define a dynamics-adapted norm to measure the error of our estimators over the hypothesis spaces. In comparing the estimators to the true interaction kernels, we first establish concentration estimates over the hypothesis space.

Our first main result is the strong consistency of our learned estimators asymptotically. For the relevant definitions see 4 and for the full theorem see section 7.2, which for the model (1) yields,

limMϕ^EAϕEA𝑳2(𝝆TEA,L)=0 with probability one.\displaystyle\lim_{M\rightarrow\infty}\|\widehat{\bm{\phi}}^{EA}-\bm{\phi}^{EA}\|_{\bm{L}^{2}(\bm{\rho}_{T}^{EA,L})}=0\text{ with probability one.} (1.2)

where 𝝆TEA,L\bm{\rho}_{T}^{EA,L} is a dynamics-adapted measure on pairwise distances – and we use a weighted 𝑳2\bm{L}^{2} space detailed in section 5, particularly (5.3). Perhaps most importantly, we give a rate of convergence in terms of the MM trajectories. We achieve the minimax rate of convergence for any number of variables |𝓥||\bm{\mathcal{V}}| in the interaction kernels. See section 7.3 for the full theorem, (see section 4 for relevant definitions) which is given by:

𝔼𝝁𝒀[ϕ^EAϕEA𝑳2(𝝆TEA,L)2]C(logMM)2s2s+|𝓥|.\displaystyle\mathbb{E}_{\bm{\mu^{\boldsymbol{Y}}}}\Big{[}\|\widehat{\bm{\phi}}^{EA}-\bm{\phi}^{EA}\|^{2}_{\bm{L}^{2}(\bm{\rho}_{T}^{EA,L})}\Big{]}\leq C\left(\frac{\log M}{M}\right)^{\frac{2s}{2s+|\bm{\mathcal{V}}|}}\,. (1.3)

In the case of model (1), |𝓥|=1|\bm{\mathcal{V}}|=1, as in the results for first-order systems [13, 52, 51].

This means that our estimators converge at the same rate in MM as the best possible estimator (up to a logarithmic factor) one could construct when the initial conditions are randomly sampled from some underlying initial condition distribution denoted 𝝁𝒀\bm{\mu^{\boldsymbol{Y}}} throughout this work, see (section 7).

To solve the inverse problem, we give a detailed discussion of an essential link between these three aspects, the notion of coercivity of the system - detailed in section 6. Coercivity plays a key role in the approximation properties, the ability to approximate the interaction kernel and the learning theory. We also present numerical examples, see the detailed numerical study in [83], which help to explain why the particular norms we define are the right choice, as well as show excellent performance on complex dynamical systems in section 9.

Our paper is structured as follows. The first part of the paper describes the model, learning framework, inference problem, and the basic tools needed for the learning theory. These ideas are all explained in detail in sections 2-5. If one wishes to quickly jump to the theoretical sections, and then refer back to the definitions as needed, we have provided table 1,3 which explains the model equations and outlines the definitions and concepts needed for the learning theory and general theoretical results, respectively. The theoretical part of the paper (sections 6-8) discusses fundamental questions of identifiability and solvability of the inverse problem, consistency, and rate of convergence of the estimators, and the ability to control trajectory error of the evolved trajectories using our estimators. Some key highlights of our theoretical contributions are described in 4.3, with full details in the corresponding sections. Lastly, we consider applications in section 9, as well as have many additional proofs and details in appendices A-H.

1.1 Comparison with existing work

Our learning approach discovers the governing structure of a particular subset of dynamical systems of the form,

𝒀˙(t)=𝑭ϕEA,ϕξ(𝒀(t)),t[0,T].\dot{\boldsymbol{Y}}(t)=\boldsymbol{F}_{\bm{\phi}^{EA},\bm{\phi}^{\xi}}(\boldsymbol{Y}(t)),\quad t\in[0,T].

from observations {𝒀(tl),𝒀˙(tl)}l=1L\{\boldsymbol{Y}(t_{l}),\dot{\boldsymbol{Y}}(t_{l})\}_{l=1}^{L}, by implicitly inferring the right hand side, 𝑭ϕEA,ϕξ\boldsymbol{F}_{\bm{\phi}^{EA},\bm{\phi}^{\xi}}. The main difficulties in establishing an effective theory of learning 𝑭ϕEA,ϕξ\boldsymbol{F}_{\bm{\phi}^{EA},\bm{\phi}^{\xi}} are the curse of dimensionality caused by the size of 𝒀\boldsymbol{Y}, which is often N(2d+1)N(2d+1), where NN is the number of agents, dd the dimension of physical space; and the dependence of the observation data, for example 𝒀(tl+1)\boldsymbol{Y}(t_{l+1}) depends on 𝒀(tl)\boldsymbol{Y}(t_{l}).

There are many techniques which can be used to tackle the high-dimension of the data set: sparsity assumptions, dimension reduction, reduced-order modeling, and machine learning techniques trained using gradient-based optimization. The dependent nature of the data prevents traditional regression-based approaches, see the discussion in [51], but many of the approaches above successfully address this. Our work, however, exploits the interacting-agent structure of collective dynamical systems, which is driven by a collection of two-body interactions where each interaction depends only on pairwise data between the states of agents, as in (1). With this structure in mind, we are able to reduce the ambient dimension of the data N(2d+1)N(2d+1) to the dimension of the variables in the interaction kernels, which is independent of NN. We also naturally incorporate the dependence in the data in an appropriate manner by considering trajectories generated from different initial conditions.

Our theoretical results focus on the joint learning of ϕE,ϕA{\bm{\phi}}^{E},{\bm{\phi}}^{A} that takes into account their natural weighted direct sum structure that is described in the following sections, which is different from the learning theory on single ϕE{\bm{\phi}}^{E}’s considered in [52, 51]. The current theoretical framework is not able to conclusively show that ϕE{\bm{\phi}}^{E} and ϕA{\bm{\phi}}^{A} can be learned separately; however we demonstrate in various numerical experiments that by learning ϕE{\bm{\phi}}^{E} and ϕA{\bm{\phi}}^{A} jointly, we still achieve strong performance. Finally, we note that the first-order theory developed in [51] is a special case of our second-order theory, see details in appendix F.

2 Model description

In order to motivate the choice of second-order models considered in this paper, we begin our discussion with a simple second-order model derived from the classical mechanics point of view. Let us consider a closed system of NN homogeneous agents (or particles) equipped with a certain type of Lagrangian energy, namely L(t)L(t) for the whole system, given as follows,

L(t)=12mi𝒙˙i(t)21N1i<iNU(𝒙i(t)𝒙i(t)),i=1,,N.L(t)=\frac{1}{2}m_{i}\left\|\dot{\boldsymbol{x}}_{i}(t)\right\|^{2}-\frac{1}{N}\sum_{1\leq i<i^{\prime}\leq N}U(\left\|\boldsymbol{x}_{i^{\prime}}(t)-\boldsymbol{x}_{i}(t)\right\|),\quad i=1,\ldots,N.

Here UU is a potential energy depending on pairwise distance. From the Lagrange equation, ddt(𝒙˙iL)=𝒙iL\frac{d}{dt}\Big{(}\partial_{\dot{\boldsymbol{x}}_{i}}L\Big{)}=\partial_{\boldsymbol{x}_{i}}L, we obtain a simple second-order collective dynamics model

mi𝒙¨i(t)=1Ni=1,iiNϕE(𝒙i(t)𝒙i(t))(𝒙i(t)𝒙i(t)),i=1,,N.m_{i}\ddot{\boldsymbol{x}}_{i}(t)=\frac{1}{N}\sum_{i^{\prime}=1,i^{\prime}\neq i}^{N}\phi^{E}(\left\|\boldsymbol{x}_{i^{\prime}}(t)-\boldsymbol{x}_{i}(t)\right\|)(\boldsymbol{x}_{i^{\prime}}(t)-\boldsymbol{x}_{i}(t)),\quad i=1,\ldots,N. (2.1)

Here, ϕE(r)=U(r)r\phi^{E}(r)=\frac{U^{\prime}(r)}{r} represents an energy-based interaction between agents. For example, taking U(r)=NGmimirU(r)=\frac{NGm_{i^{\prime}}m_{i}}{r}, it becomes the celebrated model for Newton’s universal gravity. In order to incorporate more complicated behaviors into the model equation, we consider alignment-based interactions (to align velocities, so that short-range repulsion, mid-range alignment, and long arrange attraction are all present), auxiliary variables describing internal states of agents (emotion, excitation, phases, etc.), and non-collective forces (interaction with the environment). We also consider a system of NN heterogeneous agents, such that the agents belong to KK disjoint types {Ck}k=1K\{C_{k}\}_{k=1}^{K}, with NkN_{k} being the number of agents of type kk. They interact according to the system of ODEs

{mi𝒙¨i(t)=F𝒙˙(𝒙i(t),𝒙˙i(t),ξi(t))+i=1N1N𝓀𝒾[ϕ𝓀𝒾𝓀𝒾E(rii(t),𝒔iiE(t))(𝒙i(t)𝒙i(t))+ϕ𝓀𝒾𝓀𝒾A(rii(t),𝒔iiA(t))(𝒙˙i(t)𝒙˙i(t))]ξ˙i(t)=Fξ(𝒙i(t),𝒙˙i(t),ξi(t))+i=1N1N𝓀𝒾ϕ𝓀𝒾𝓀𝒾ξ(rii(t),𝒔iiξ(t))(ξi(t)ξi(t))\begin{dcases}m_{i}\ddot{\boldsymbol{x}}_{i}(t)&=F^{\dot{\boldsymbol{x}}}(\boldsymbol{x}_{i}(t),\dot{\boldsymbol{x}}_{i}(t),\xi_{i}(t))+\sum_{i^{\prime}=1}^{N}\frac{1}{N_{\mathpzc{k}_{i^{\prime}}}}\Big{[}\phi^{E}_{\mathpzc{k}_{i}\mathpzc{k}_{i^{\prime}}}(r_{ii^{\prime}}(t),\boldsymbol{s}^{E}_{ii^{\prime}}(t))(\boldsymbol{x}_{i^{\prime}}(t)-\boldsymbol{x}_{i}(t))\\ &\qquad\qquad\qquad\qquad\qquad\qquad\qquad+\phi^{A}_{\mathpzc{k}_{i}\mathpzc{k}_{i^{\prime}}}(r_{ii^{\prime}}(t),\boldsymbol{s}^{A}_{ii^{\prime}}(t))(\dot{\boldsymbol{x}}_{i^{\prime}}(t)-\dot{\boldsymbol{x}}_{i}(t))\Big{]}\\ \dot{\xi}_{i}(t)&=F^{\xi}(\boldsymbol{x}_{i}(t),\dot{\boldsymbol{x}}_{i}(t),\xi_{i}(t))+\sum_{i^{\prime}=1}^{N}\frac{1}{N_{\mathpzc{k}_{i^{\prime}}}}\phi^{\xi}_{\mathpzc{k}_{i}\mathpzc{k}_{i^{\prime}}}(r_{ii^{\prime}}(t),\boldsymbol{s}^{\xi}_{ii^{\prime}}(t))(\xi_{i^{\prime}}(t)-\xi_{i}(t))\end{dcases} (2.2)

for i=1,,Ni=1,\ldots,N, where 𝓀𝒾,𝓀𝒾{1,,𝒦}\mathpzc{k}_{i},\mathpzc{k}_{i^{\prime}}\in\{1,\cdots,K\} are the indices of the agent type of the agents ii and ii^{\prime} respectively; the interaction kernels, ϕkkE,ϕkkA,ϕkkξ\phi^{E}_{kk^{\prime}},\phi^{A}_{kk^{\prime}},\phi^{\xi}_{kk^{\prime}}, are in general different for interacting agents of different types, and they not only depend on pairwise distance rii(t)r_{ii^{\prime}}(t), but also on other features, given by 𝒔iiE(t),𝒔iiA(t),𝒔iiξ(t)\boldsymbol{s}^{E}_{ii^{\prime}}(t),\boldsymbol{s}^{A}_{ii^{\prime}}(t),\boldsymbol{s}^{\xi}_{ii^{\prime}}(t). For example, the interactions between birds can also depend on the field of vision, not just the distance between pairs of birds. Note that we will often suppress the explicit dependence on time tt when it is clear from context. The unknowns, for which we will construct estimators, in these equations, are the functions ϕ𝓀𝒾𝓀𝒾E\phi^{E}_{\mathpzc{k}_{i}\mathpzc{k}_{i^{\prime}}}, ϕ𝓀𝒾𝓀𝒾A\phi^{A}_{\mathpzc{k}_{i}\mathpzc{k}_{i^{\prime}}} and ϕ𝓀𝒾𝓀𝒾ξ\phi^{\xi}_{\mathpzc{k}_{i}\mathpzc{k}_{i^{\prime}}}; everything else is assumed given.

Table 1 gives a detailed explanation for the definition of the variables used in (2.2). We note that in what follows the notation, {E,A,ξ}{\{E,A,\xi\}}, attached to any expression means that there is one of those maps, functions, etc. for each element in the set {E,A,ξ}{\{E,A,\xi\}}. It is a convenient way to avoid excessive repetition of similar definitions.

Variable Definition
i,ii,i^{\prime} index of agent, from 1,,N1,\ldots,N
mim_{i} mass of agent ii
𝒙i(t),𝒙˙i(t),𝒙¨i(t)d\boldsymbol{x}_{i}(t),\dot{\boldsymbol{x}}_{i}(t),\ddot{\boldsymbol{x}}_{i}(t)\in^{d} position/velocity/acceleration vector of agent ii at time tt
ξi,ξ˙i\xi_{i},\dot{\xi}_{i} auxiliary variable, and its derivative
\|\cdot\| Euclidean norm in d\mathbb{R}^{d}
KK number of types of agents
k,kk,k^{\prime} index of type of agents, from 1,,K1,\ldots,K
NkN_{k} number of agents in type kk
𝓀𝒾\mathpzc{k}_{i} type index of agent ii
CkC_{k} set of indices for type kk agent, subset of {1,,N}\{1,\ldots,N\}
ϕkk\phi_{kk^{\prime}} (or ϕkk′′\phi_{kk^{\prime\prime}}) interaction kernel: influence of agent kk^{\prime} (or k′′k^{\prime\prime}) on agent kk
F𝒙˙,FξF^{\dot{\boldsymbol{x}}},F^{\xi} non-collective forces on 𝒙¨i\ddot{\boldsymbol{x}}_{i} and ξ˙i\dot{\xi}_{i} respectively
ϕE,ϕA,ϕξ\phi^{E},\phi^{A},\phi^{\xi} energy, alignment, and environment-based interaction kernels respectively
𝒱\mathcal{V} Features, 𝒱(𝒙,𝒙˙,ξ,𝒙,𝒙˙,ξ):4d+2p\mathcal{V}(\boldsymbol{x},\dot{\boldsymbol{x}},\xi,\boldsymbol{x}^{\prime},\dot{\boldsymbol{x}}^{\prime},\xi^{\prime}):^{4d+2}\rightarrow\mathbb{R}^{p}
𝒔(k,k){E,A,ξ}\boldsymbol{s}_{(k,k^{\prime})}^{{\{E,A,\xi\}}} Feature map, πkk{E,A,ξ}𝒱(𝒙,𝒙˙,ξ,𝒙,𝒙˙,ξ):4d+2p(k,k){E,A,ξ}\pi_{kk^{\prime}}^{{\{E,A,\xi\}}}\circ\mathcal{V}(\boldsymbol{x},\dot{\boldsymbol{x}},\xi,\boldsymbol{x}^{\prime},\dot{\boldsymbol{x}}^{\prime},\xi^{\prime}):^{4d+2}\rightarrow^{p_{(k,k^{\prime})}^{{\{E,A,\xi\}}}}
𝒔ii{E,A,ξ}\boldsymbol{s}_{ii^{\prime}}^{{\{E,A,\xi\}}} Feature evaluation, 𝒔(𝓀𝒾,𝓀𝒾){E,A,ξ}(𝒙i,𝒙˙i,ξi,𝒙i,𝒙˙i,ξi)p𝓀𝒾𝓀𝒾{E,A,ξ}\boldsymbol{s}_{(\mathpzc{k}_{i},\mathpzc{k}_{i^{\prime}})}^{{\{E,A,\xi\}}}(\boldsymbol{x}_{i},\dot{\boldsymbol{x}}_{i},\xi_{i},\boldsymbol{x}_{i^{\prime}},\dot{\boldsymbol{x}}_{i^{\prime}},\xi_{i^{\prime}})\in^{p_{\mathpzc{k}_{i}\mathpzc{k}_{i^{\prime}}}^{{\{E,A,\xi\}}}}
Table 1: Notation for the variables in (2.2).

The specific instances of the feature map 𝒱\mathcal{V} together with corresponding projections πkk{E,A,ξ}\pi_{kk^{\prime}}^{{\{E,A,\xi\}}} include a variety of systems that have found a wide range of applications in physics, biology, ecology, and social science; see the examples in the chart below. We assume that the function 𝒱\mathcal{V} is Lipschitz and known and so are all the πkk{E,A,ξ}\pi_{kk^{\prime}}^{{\{E,A,\xi\}}}’s. The Lipschitz assumption is necessary for us to control the trajectory error and ensure the well-posedness of the system. The function 𝒱\mathcal{V} is a uniform way to collect all of the different variables (functions of the inputs) used across any of the (k,k)(k,k^{\prime}) pairs over all of the E,A,ξE,A,\xi functions in the system. This uniformity is helpful when discussing the rate of convergence, among other places. Examples of where this generality matters emerge naturally, say when one has a different number of variables across interaction kernels for different pairs (k,k)(k,k^{\prime}), or when the energy and alignment kernels depend on rr and then additional but distinct other variables. From this uniform set of variables, we then project (which of course implies that the feature maps are all Lipschitz) to arrive at the relevant function 𝒔(k,k){E,A,ξ}\boldsymbol{s}_{(k,k^{\prime})}^{{\{E,A,\xi\}}} for each pair and each of the elements of the wildcard. Lastly, we can then evaluate this map at the specific pair of agents (i,i)(i,i^{\prime}), that leads to the feature evaluation, 𝒔ii{E,A,ξ}\boldsymbol{s}_{ii^{\prime}}^{{\{E,A,\xi\}}} which is the expression used in the model equation (2.2).

The models encompassed by the form (2.2) are quite diverse. For a concrete example, please see section 9.2. We summarize many examples in table 2 with a shaded cell indicating that the model has that characteristic, and an empty cell indicates that the model does not have this characteristic. A numeric value indicates this is the number of unique variables, 𝓥,𝓥ξ\bm{\mathcal{V}},\bm{\mathcal{V}}_{\xi} used within the EA or ξ\xi portions of the system. The number of these unique variables specifies the dimension in the minimax convergence rate, see section 7.3.

Properties
Model ϕE\phi^{E} ϕA\phi^{A} mim_{i} F𝒙˙F^{\dot{\boldsymbol{x}}} ϕξ\phi^{\xi} FξF^{\xi} K>1K>1 𝒔E\boldsymbol{s}^{E} 𝒔A\boldsymbol{s}^{A} 𝒔ξ\boldsymbol{s}^{\xi} |𝓥||\bm{\mathcal{V}}| |𝓥ξ||\bm{\mathcal{V}}_{\xi}|
Anticipation Dynamics 2
Celestial Mechanics 1
Cucker-Smale 1
Fish Milling 2D 1
Fish Milling 3D 1
Flocking w. Ext. Poten. 1
Phototaxis 1 1
Predator-Swarm (2nd2^{nd} Order) 1
Lennard-Jones 1
Opinion Dynamics 1
Predator-Swarm (1st1^{st} Order) 1
Synchronized Oscillator 2 2
Table 2: Summary of the models studied in this work and in [52, 51, 83, 54]

Our second-order model equations cover the first-order models considered in [52, 51, 83] as special cases, see Appendix F, which is why we choose second-order models as the main focus of this work. Furthermore, the dynamical characteristics produced by second-order models are much richer and can model more complicated collective motions and emergent behavior of the agents. Note that our second-order model in (2.2), even when written as a first-order system in more variables, is a strict generalization of the previous first-order analysis.

Rate of convergence notation

For the system (2.2), depending on the number of variables, we will have a different rate of convergence as the dimension of the function(s) we are learning changes. In order to present a unified theorem, we adopt the following notation. Let 𝓥\bm{\mathcal{V}} denote the set of features in the range of 𝒱\mathcal{V} that are arguments of ϕE{\bm{\phi}}^{E} or ϕA{\bm{\phi}}^{A} across each of the (k,k)(k,k^{\prime}) pairs. For many collective dynamical systems 𝓥={r}\bm{\mathcal{V}}=\{r\}. In the case where the system has both ϕE{\bm{\phi}}^{E} and ϕA{\bm{\phi}}^{A} but 𝓥={r}\bm{\mathcal{V}}=\{r\}, it is easy to see from Theorem 9 below that we still only pay the 11-dimensional rate. Analogously, we define 𝓥ξ\bm{\mathcal{V}}_{\xi} to be the set of all features in the range of 𝒱\mathcal{V} that are arguments involved in the ϕξ{\bm{\phi}}^{\xi} part of (2.2) across all of the (k,k)(k,k^{\prime}) pairs. This notation is used to frame the convergence rate theorem on the ξ\xi variable as well.

3 Preliminaries and notation

We vectorize the models given in (2.2) in order to give them a more compact description. Letting 𝒗i(t)=𝒙˙i(t)\boldsymbol{v}_{i}(t)=\dot{\boldsymbol{x}}_{i}(t), we take the following notations

𝑿(t)=[𝒙1(t)𝒙N(t)]Nd,𝑽(t):=[𝒗1(t)𝒗N(t)]Nd,𝚵(t):=[ξ1(t)ξN(t)]N.\boldsymbol{X}(t)=\begin{bmatrix}\boldsymbol{x}_{1}(t)\\ \vdots\\ \boldsymbol{x}_{N}(t)\end{bmatrix}\in\mathbb{R}^{Nd},\quad\boldsymbol{V}(t):=\begin{bmatrix}\boldsymbol{v}_{1}(t)\\ \vdots\\ \boldsymbol{v}_{N}(t)\end{bmatrix}\in\mathbb{R}^{Nd},\quad\boldsymbol{\Xi}(t):=\begin{bmatrix}\xi_{1}(t)\\ \vdots\\ \xi_{N}(t)\end{bmatrix}\in\mathbb{R}^{N}.

We introduce a weighted norm to measure the system variables, denoted 𝒮()\left\|\cdot\right\|_{\mathcal{S}(\cdot)} and given by,

𝒁𝒮2:=i=1N1N𝓀𝒾𝒛i2,\left\|\boldsymbol{Z}\right\|_{\mathcal{S}}^{2}:=\sum_{i=1}^{N}\frac{1}{N_{\mathpzc{k}_{i}}}\left\|\boldsymbol{z}_{i}\right\|^{2}, (3.1)

for 𝒁=[𝒛1T,,𝒛NT]T\boldsymbol{Z}=\begin{bmatrix}\boldsymbol{z}_{1}^{T},&\ldots,&\boldsymbol{z}_{N}^{T}\end{bmatrix}^{T} with each 𝒛id\boldsymbol{z}_{i}\in^{d} or . Here \left\|\cdot\right\| is the same norm used in the construction of pairwise distance data for the interaction kernels. In the subsequent equations we drop the explicit dependence on tt for simplicity. The weight factor, 1N𝓀𝒾\frac{1}{N_{\mathpzc{k}_{i}}}, is introduced so that agents of different types are considered equally and we learn well even in the case that the number of agents in the classes is highly non-uniform. With thsee vectorized notations, the model in (2.2) becomes,

{m𝑿¨=𝐟nc,𝒙˙(𝑿,𝑽,𝚵)+𝐟ϕE(𝑿,𝑽,𝚵)+𝐟ϕA(𝑿,𝑽,𝚵)𝚵˙=𝐟nc,ξ(𝑿,𝑽,𝚵)+𝐟ϕξ(𝑿,𝑽,𝚵).\begin{dcases}\vec{m}\circ\ddot{\boldsymbol{X}}&=\mathbf{f}^{\text{nc},\dot{\boldsymbol{x}}}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi})+\mathbf{f}^{\bm{\phi}^{E}}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi})+\mathbf{f}^{\bm{\phi}^{A}}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi})\\ \dot{\boldsymbol{\Xi}}&=\mathbf{f}^{\text{nc},\xi}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi})+\mathbf{f}^{{\bm{\phi}}^{\xi}}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi}).\end{dcases}

Here m=[m1,,mN]TN\vec{m}=\begin{bmatrix}m_{1},&\ldots,&m_{N}\end{bmatrix}^{T}\in\mathbb{R}^{N}, \circ is the Hadamard product, and we use boldface fonts to denote the vectorized form of our estimators (with some once-for-all-fixed ordering of the pairs (k,k)k,k=1,,K(k,k^{\prime})_{k,k^{\prime}=1,\dots,K}):

ϕE={ϕkkE}k,k=1K,ϕA={ϕkkA}k,k=1K,ϕξ={ϕkkξ}k,k=1K.\bm{\phi}^{E}=\{\phi^{E}_{kk^{\prime}}\}_{k,k^{\prime}=1}^{K},\quad\bm{\phi}^{A}=\{\phi^{A}_{kk^{\prime}}\}_{k,k^{\prime}=1}^{K},\quad\bm{\phi}^{\xi}=\{\phi^{\xi}_{kk^{\prime}}\}_{k,k^{\prime}=1}^{K}. (3.2)

We also use the shorthand,

ϕEA=ϕEϕA,\bm{\phi}^{EA}=\bm{\phi}^{E}\oplus\bm{\phi}^{A}, (3.3)

to denote the element of the direct sum of the function spaces containing ϕE,ϕA\bm{\phi}^{E},\bm{\phi}^{A}. This notation will be used throughout on the energy and alignment (EAEA for short) portion of the system in order to simplify the notation.

We denote by 𝐟nc,𝒙˙\mathbf{f}^{\text{nc},\dot{\boldsymbol{x}}}, the vectorized notation for the non-collective force defined as follows, 𝐟nc,𝒙˙(𝑿,𝑽,𝚵):=[F𝒙˙(𝒙1,𝒙˙1,ξ1),,F𝒙˙(𝒙N,𝒙˙N,ξN)]TNd\mathbf{f}^{\text{nc},\dot{\boldsymbol{x}}}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi}):=\begin{bmatrix}F^{\dot{\boldsymbol{x}}}(\boldsymbol{x}_{1},\dot{\boldsymbol{x}}_{1},\xi_{1}),\ldots,F^{\dot{\boldsymbol{x}}}(\boldsymbol{x}_{N},\dot{\boldsymbol{x}}_{N},\xi_{N})\end{bmatrix}^{T}\in\mathbb{R}^{Nd}, and

𝐟ϕE:=[i=1N1N𝓀𝒾ϕ𝓀1𝓀𝒾E(r1i,𝒔1iE)(𝒙i𝒙1),,i=1N1N𝓀𝒾ϕ𝓀𝒩𝓀𝒾E(rNi,𝒔NiE)(𝒙i𝒙N)]TNd.\mathbf{f}^{\bm{\phi}^{E}}:=\bigg{[}\sum_{i^{\prime}=1}^{N}\frac{1}{N_{\mathpzc{k}_{i^{\prime}}}}\phi^{E}_{\mathpzc{k}_{1}\mathpzc{k}_{i^{\prime}}}(r_{1i^{\prime}},\boldsymbol{s}^{E}_{1i^{\prime}})(\boldsymbol{x}_{i^{\prime}}-\boldsymbol{x}_{1}),\ldots,\sum_{i^{\prime}=1}^{N}\frac{1}{N_{\mathpzc{k}_{i^{\prime}}}}\phi^{E}_{\mathpzc{k}_{N}\mathpzc{k}_{i^{\prime}}}(r_{Ni^{\prime}},\boldsymbol{s}^{E}_{Ni^{\prime}})(\boldsymbol{x}_{i^{\prime}}-\boldsymbol{x}_{N})\bigg{]}^{T}\in\mathbb{R}^{Nd}.

We omit the analogous definitions for 𝐟ϕA\mathbf{f}^{\bm{\phi}^{A}} and 𝐟ϕξ\mathbf{f}^{\bm{\phi}^{\xi}}.

3.1 Trajectory Performance Measurement

We will also consider another measurement to assess the learning performance of the estimated kernels in terms of trajectory error. We compare the observed trajectories to the estimated trajectories evolved from the same initial conditions but with the estimated interaction kernels. Let 𝒀(t)=[𝑿T(t),𝑽T(t),𝚵T(t)]T\boldsymbol{Y}(t)=[\boldsymbol{X}^{T}(t),\boldsymbol{V}^{T}(t),\boldsymbol{\Xi}^{T}(t)]^{T} be the trajectory from dynamics generated by the true/unknown interaction kernels with initial condition, 𝒀(0)\boldsymbol{Y}(0); and 𝒀^(t)=[𝑿^T(t),𝑽^T(t),𝚵^T(t)]T\hat{\boldsymbol{Y}}(t)=[\hat{\boldsymbol{X}}^{T}(t),\hat{\boldsymbol{V}}^{T}(t),\hat{\boldsymbol{\Xi}}^{T}(t)]^{T} be the trajectory from dynamics generated by the estimated interaction kernels learned from observation of {𝒀(tl)}l=1L\{\boldsymbol{Y}(t_{l})\}_{l=1}^{L} with the same initial condition, 𝒀(0)\boldsymbol{Y}(0) (i.e., 𝒀^(0)=𝒀(0)\hat{\boldsymbol{Y}}(0)=\boldsymbol{Y}(0)). We define a norm for the difference between 𝒀\boldsymbol{Y} and 𝒀^\hat{\boldsymbol{Y}} at time tlt_{l}:

𝒀(tl)𝒀^(tl)𝒴2=𝑿(tl)𝑿^(tl)𝒮2+𝑽(tl)𝑽^(tl)𝒮2+𝚵(tl)𝚵^(tl)𝒮2,\left\|\boldsymbol{Y}(t_{l})-\hat{\boldsymbol{Y}}(t_{l})\right\|_{\mathcal{Y}}^{2}=\left\|\boldsymbol{X}(t_{l})-\hat{\boldsymbol{X}}(t_{l})\right\|_{\mathcal{S}}^{2}+\left\|\boldsymbol{V}(t_{l})-\hat{\boldsymbol{V}}(t_{l})\right\|_{\mathcal{S}}^{2}+\left\|\boldsymbol{\Xi}(t_{l})-\hat{\boldsymbol{\Xi}}(t_{l})\right\|_{\mathcal{S}}^{2}\,, (3.4)

and a corresponding norm on the trajectory 𝒀[0,T]={𝒀(tl)}l=1L\boldsymbol{Y}_{[0,T]}=\{\boldsymbol{Y}(t_{l})\}_{l=1}^{L} (0=t1<<tL=T0=t_{1}<\dots<t_{L}=T):

𝒀[0,T]𝒀^[0,T]traj=maxl=1,,L𝒀(tl)𝒀^(tl)𝒴.\left\|\boldsymbol{Y}_{[0,T]}-\hat{\boldsymbol{Y}}_{[0,T]}\right\|_{\text{traj}}=\max_{l=1,\ldots,L}\left\|\boldsymbol{Y}(t_{l})-\hat{\boldsymbol{Y}}(t_{l})\right\|_{\mathcal{Y}}\,. (3.5)

We also consider a relative version, invariant under changes of units of measure:

𝒀[0,T]𝒀^[0,T]traj=𝒀[0,T]𝒀^[0,T]traj𝒀[0,T]traj.\left\|\boldsymbol{Y}_{[0,T]}-\hat{\boldsymbol{Y}}_{[0,T]}\right\|_{\text{traj}^{*}}=\frac{\left\|\boldsymbol{Y}_{[0,T]}-\hat{\boldsymbol{Y}}_{[0,T]}\right\|_{\text{traj}}}{\left\|\boldsymbol{Y}_{[0,T]}\right\|_{\text{traj}}}\,.

Lastly, we report errors between 𝑿[0,T]\boldsymbol{X}_{[0,T]} and 𝑿^[0,T]\hat{\boldsymbol{X}}_{[0,T]},

𝑿[0,T]𝑿^[0,T]𝒮=maxl=1,,L{𝑿(tl)𝑿^(tl)𝒮}maxl=1,,L{𝑿(tl)𝒮}.\left\|\boldsymbol{X}_{[0,T]}-\hat{\boldsymbol{X}}_{[0,T]}\right\|_{\mathcal{S}^{*}}=\frac{\max_{l=1,\ldots,L}\{\left\|\boldsymbol{X}(t_{l})-\hat{\boldsymbol{X}}(t_{l})\right\|_{\mathcal{S}}\}}{\max_{l=1,\ldots,L}\{\left\|\boldsymbol{X}(t_{l})\right\|_{\mathcal{S}}\}}.

Similar re-scaled norms are used for the difference between 𝑽[0,T]\boldsymbol{V}_{[0,T]} and 𝑽^[0,T]\hat{\boldsymbol{V}}_{[0,T]}, and for the difference between 𝚵[0,T]\boldsymbol{\Xi}_{[0,T]} and 𝚵^[0,T]\hat{\boldsymbol{\Xi}}_{[0,T]}.

3.2 Function spaces

We begin by describing some basic ideas about measures and function spaces. Consider a compact or precompact set 𝒰p\mathcal{U}\subset\mathbb{R}^{p} for some pp, then we define the infinity norm as,

h=esssupx𝒰|h(x)|.\|h\|_{\infty}=\operatorname{ess}\sup_{x\in\mathcal{U}}|h(x)|.

Further define, L(𝒰)L^{\infty}(\mathcal{U}) as the space of real valued functions defined on 𝒰\mathcal{U} with finite \infty-norm. A key function space we need to consider is,

Cck,α(𝒰) for k,0<α1,C_{c}^{k,\alpha}(\mathcal{\mathcal{U}})\text{ for }k\in\mathbb{N},0<\alpha\leq 1,

defined as the space of compactly supported, kk times continuously differentiable functions with a kk-th derivative that is Hölder continuous of order α\alpha. We can then consider vectorizations of these spaces as

𝑳(𝒰):=k,k=1,1K,KL(𝒰),\boldsymbol{L}^{\infty}(\mathcal{U}):=\bigoplus_{k,k^{\prime}=1,1}^{K,K}L^{\infty}(\mathcal{U}),

which has the vectorized infinity norm given by

𝒇=maxk,kfkk,𝒇𝑳(𝒰).\|\boldsymbol{f}\|_{\infty}=\max_{k,k^{{}^{\prime}}}\left\|f_{kk^{\prime}}\right\|_{\infty},\forall\boldsymbol{f}\in\boldsymbol{L}^{\infty}(\mathcal{U}).

Similarly, we consider direct sums of measures, with corresponding vectorized function spaces. This is done explicitly in section 5.1 where we define a weighted L2L^{2} space (under a particular measure with an associated norm that induces a weighting) and consider vectorized versions of it.

In order to develop a theoretical foundation, and in line with the literature, we make assumptions on the class of functions that can arise as interaction kernels in the model (2.2). As the agents get farther and farther apart, they eventually should have no influence on each other. This is an approximation to the vanishing, or rapidly decreasing to zero, nature of pairwise interaction as distance increases that is observed in many physical models. We thus assume a maximum interaction radius for the interaction kernels which represents the maximal distance at which one agent can influence another. Similar assumptions will be made on the feature maps.

More precisely, for each pair (k,k)(k,k^{\prime}) we consider the following spaces,

𝒦kk{E,A,ξ}:=L([Rkkmin,Rkkmax]×𝕊kk{E,A,ξ}),\displaystyle\mathcal{K}_{kk^{\prime}}^{\{E,A,\xi\}}:=L^{\infty}([R_{kk^{\prime}}^{\min},R_{kk^{\prime}}^{\max}]\times\mathbb{S}^{{\{E,A,\xi\}}}_{kk^{\prime}})\,, (3.6)

for k,k=1,,Kk,k^{\prime}=1,\ldots,K where recall that the {E,A,ξ}{\{E,A,\xi\}} notation means that there is an admissible space (or more generally an expression) for each element of the set {E,A,ξ}{\{E,A,\xi\}}. Here, Rkkmin,RkkmaxR_{kk^{\prime}}^{\min},R_{kk^{\prime}}^{\max} are the minimum or maximum, respectively, possible interaction radii for agents in CkC_{k^{\prime}} influencing agents in CkC_{k}. Similarly, 𝕊kkE,𝕊kkA,𝕊kkξ\mathbb{S}^{E}_{kk^{\prime}},\mathbb{S}^{A}_{kk^{\prime}},\mathbb{S}^{\xi}_{kk^{\prime}} are compact sets in ,pkkApkkE,pkkξ{}^{p_{kk^{\prime}}^{E}},^{p_{kk^{\prime}}^{A}},^{p_{kk^{\prime}}^{\xi}} which contain the ranges of the feature maps, 𝒔kkE,𝒔kkA\boldsymbol{s}^{E}_{kk^{\prime}},\boldsymbol{s}^{A}_{kk^{\prime}} and 𝒔kkξ\boldsymbol{s}^{\xi}_{kk^{\prime}}.

We can also define the vectorizations of these spaces, which we will look at subsets of when we define the admissible spaces below, given by

𝓚{E,A,ξ}:=k,k=1,1K,K𝒦kk{E,A,ξ}.\bm{\mathcal{K}}^{{\{E,A,\xi\}}}:=\bigoplus_{k,k^{\prime}=1,1}^{K,K}\mathcal{K}_{kk^{\prime}}^{\{E,A,\xi\}}.

In order to provide uniform bounds, we introduce the following sets:

𝐒E:=k,k𝕊kkE𝐒A:=k,k𝕊kkA𝐒ξ:=k,k𝕊kkξ\mathbf{S}^{E}:=\prod_{k,k^{\prime}}\mathbb{S}_{kk^{\prime}}^{E}\qquad\mathbf{S}^{A}:=\prod_{k,k^{\prime}}\mathbb{S}_{kk^{\prime}}^{A}\qquad\mathbf{S}^{\xi}:=\prod_{k,k^{\prime}}\mathbb{S}_{kk^{\prime}}^{\xi} (3.7)

Next, we introduce notation to bound the interaction radii on the pairwise distances and pairwise velocities.

Remark 1

Here we note an important distinction. There is the range of the norms of pairwise interactions generated by the dynamics, and the underlying support of the interaction kernels themselves. These two notions of interaction radius are distinct, we will comment on the estimation and subtleties of both in the numerical algorithm section.

We let

𝐑\displaystyle\mathbf{R} :=k,k[Rkkmin,Rkkmax]\displaystyle:=\prod_{k,k^{\prime}}[R_{kk^{\prime}}^{\min},R_{kk^{\prime}}^{\max}] (3.8)
R\displaystyle R :=maxk,kRkkmax.\displaystyle:=\max_{k,k^{\prime}}R_{kk^{\prime}}^{\max}. (3.9)

Notice that a uniform support for all interaction kernels on the pairwise distance variable rr is [0,R][0,R].

We denote the distribution of the initial conditions by 𝝁𝒀\bm{\mu^{\boldsymbol{Y}}}. This measure is unknown and is the source of randomness in our system. It reflects that we will observe trajectories which start at different initial conditions, but that evolve from the same dynamical system, which allows for learnability. For the numerical experiments we will choose our initial conditions to be sampled uniformly over a system dependent range.

Due to the form of (2.2), and the norms defined below, we consider the following dynamics induced ranges of the variables. Note that the first supremum is taken over the initial conditions, each of which generate different solutions 𝒙i(t)\boldsymbol{x}_{i}(t) which are used in the second supremum.

Rx˙\displaystyle R_{\dot{x}} :=sup𝒀(0)𝝁𝒀supt[0,T]maxi,i𝒙˙i(t)𝒙˙i(t)\displaystyle:=\sup_{\boldsymbol{Y}(0)\sim\bm{\mu^{\boldsymbol{Y}}}}\sup_{t\in[0,T]}\max_{i,i^{\prime}}\|\dot{\boldsymbol{x}}_{i}(t)-\dot{\boldsymbol{x}}_{i^{\prime}}(t)\| (3.10)
Rξ\displaystyle R_{\xi} :=sup𝒀(0)𝝁𝒀supt[0,T]maxi,iξi(t)ξi(t)\displaystyle:=\sup_{\boldsymbol{Y}(0)\sim\bm{\mu^{\boldsymbol{Y}}}}\sup_{t\in[0,T]}\max_{i,i^{\prime}}\|\xi_{i}(t)-\xi_{i^{\prime}}(t)\| (3.11)

We assume in this work that both of these quantities Rx˙,RξR_{\dot{x}},R_{\xi} are finite. This will be easily satisfied if the measures 𝝁𝒙˙,𝝁ξ\bm{\mu^{\dot{x}}},\bm{\mu}^{\xi} (specifying the distribution of the initial conditions on the velocities and the environment variable) are compactly supported. This follows by the assumptions on the interaction kernels below and that we only consider finite final time TT.

In order for the second-order systems given by (2.2) to be well-posed, we assume that the interaction kernels lie in admissible sets. For each of the kernels, let 𝒰kk:=[Rkkmin,Rkkmax]×𝕊kk{E,A,ξ}\mathcal{U}_{kk^{\prime}}:=[R_{kk^{\prime}}^{\min},R_{kk^{\prime}}^{\max}]\times\mathbb{S}^{{\{E,A,\xi\}}}_{kk^{\prime}} and define

𝓚S{E,A,ξ}{E,A,ξ}\displaystyle\boldsymbol{\mathcal{K}}_{S_{{\{E,A,\xi\}}}}^{{\{E,A,\xi\}}} :={(ϕkk{E,A,ξ})k,k=1,1K,K: for all k,k=1,,K,\displaystyle:=\Big{\{}\left(\phi^{{\{E,A,\xi\}}}_{kk^{\prime}}\right)_{k,k^{\prime}=1,1}^{K,K}:\text{{ for all }}k,k^{\prime}=1,\dots,K\,, (3.12)
ϕkk{E,A,ξ}Cc0,1(𝒰kk),ϕkk{E,A,ξ}+Lip[ϕkk{E,A,ξ}]S{E,A,ξ}}.\displaystyle\qquad\phi^{{\{E,A,\xi\}}}_{kk^{\prime}}\in C_{c}^{0,1}\left(\mathcal{U}_{kk^{\prime}}\right),\left\|\phi^{{\{E,A,\xi\}}}_{kk^{\prime}}\right\|_{\infty}+\operatorname{Lip}\left[\phi^{{\{E,A,\xi\}}}_{kk^{\prime}}\right]\leq S_{{\{E,A,\xi\}}}\Big{\}}\,.

When estimating the EAEA part of the system, we will consider the direct sum admissible space, for SEAmax{SE,SA}S_{EA}\geq\max\{S_{E},S_{A}\},

𝓚SEAEA:=𝓚SEE𝓚SAA\boldsymbol{\mathcal{K}}_{S_{EA}}^{EA}:=\boldsymbol{\mathcal{K}}_{S_{E}}^{E}\oplus\boldsymbol{\mathcal{K}}_{S_{A}}^{A} (3.13)

The admissibility assumptions allow us to establish properties such as existence and uniqueness of solutions to (2.2) as well as to have control on the trajectory errors in finite time [0,T][0,T]. It further allows us to show regularity and absolute continuity with respect to Lebesgue measure of the appropriate performance measures defined in section 5.1.

In the learning approach, we will consider hypothesis spaces that we will search in order to estimate the various interaction kernels. The hypothesis spaces corresponding to {ϕkk{E,A,ξ}}\{\phi_{kk^{\prime}}^{{\{E,A,\xi\}}}\} are denoted as {kk{E,A,ξ}}\{\mathcal{H}_{kk^{\prime}}^{{\{E,A,\xi\}}}\} and we vectorize them as,

𝓗{E,A,ξ}:=k,k=1,1K,Kkk{E,A,ξ}.\displaystyle\boldsymbol{\mathcal{H}}^{{\{E,A,\xi\}}}:=\bigoplus_{k,k^{\prime}=1,1}^{K,K}\mathcal{H}_{kk^{\prime}}^{{\{E,A,\xi\}}}. (3.14)

Analogous to our simplified notation for ϕEA,𝝋EA\bm{\phi}^{EA},\bm{\varphi}^{EA} described in (3.3), we define the direct sum of the hypothesis spaces as,

𝓗EA:=𝓗E𝓗A\boldsymbol{\mathcal{H}}^{EA}:=\boldsymbol{\mathcal{H}}^{E}\oplus\boldsymbol{\mathcal{H}}^{A} (3.15)

We will consider specific hypothesis spaces during the learning theory and numerical algorithm sections.

4 Inference problem and learning approach

In this section, we first introduce the problem of inferring the interaction kernels from observations of trajectory data and give a brief review and generalization of the learning approach proposed in the works [52] and [83].

4.1 Problem setting

Our observation data is given by {𝒀(m)(tl),𝒀˙(m)(tl)}m=1,l=1M,L\{\boldsymbol{Y}^{(m)}(t_{l}),\smash{\dot{\boldsymbol{Y}}^{(m)}(t_{l})}\}_{m=1,l=1}^{M,L} for 0=t1<t2<<tL=T0=t_{1}<t_{2}<\cdots<t_{L}=T. Here 𝒀˙=[𝒚1T,,𝒚NT]\dot{\boldsymbol{Y}}=[\boldsymbol{y}_{1}^{T},\cdots,\boldsymbol{y}_{N}^{T}] and 𝒚i=[𝒙iT(t),𝒙˙iT(t),ξi(t)]T\boldsymbol{y}_{i}=\begin{bmatrix}\boldsymbol{x}_{i}^{T}(t),\dot{\boldsymbol{x}}_{i}^{T}(t),\xi_{i}(t)\end{bmatrix}^{T}. For simplicity, we only consider equidistant observation points: tltl1=ht_{l}-t_{l-1}=h for l=2,,Ll=2,\ldots,L. The proposed algorithm with slight modifications also works for non-equispaced points. The MM sets of discrete trajectory data are generated by the system (2.1) with the unknown set of interaction kernels, i.e. ϕE,ϕA,ϕξ\bm{\phi}^{E},\bm{\phi}^{A},\bm{\phi}^{\xi}, whose initial conditions 𝒀(m)(0)\boldsymbol{Y}^{(m)}(0) are drawn i.i.d from 𝝁𝒀\bm{\mu^{\boldsymbol{Y}}}, a probability measure defined on the space N(2d+2). The goal is to infer the unknown interaction kernels directly from data.

4.2 Loss functionals

Given observations, the references [52, 51, 83] proposed an empirical error functional, recalling the notational convention (3.3),

𝓔MEA(𝝋EA)\displaystyle\boldsymbol{\mathcal{E}}_{M}^{EA}(\bm{\varphi}^{EA}) =\displaystyle= 1LMl=1,m=1L,M𝑿¨(m)(tl)𝐟nc,𝒙˙(𝑿(m)(tl),𝑽(m)(tl),𝚵(m)(tl))\displaystyle\frac{1}{LM}\sum_{l=1,m=1}^{L,M}\Big{\|}\ddot{\boldsymbol{X}}^{(m)}(t_{l})-\mathbf{f}^{\text{nc},\dot{\boldsymbol{x}}}(\boldsymbol{X}^{(m)}(t_{l}),\boldsymbol{V}^{(m)}(t_{l}),\boldsymbol{\Xi}^{(m)}(t_{l}))
\displaystyle- 𝐟𝝋E(𝑿(m)(tl),𝑽(m)(tl),𝚵(m)(tl))𝐟𝝋A(𝑿(m)(tl),𝑽(m)(tl),𝚵(m)(tl))𝒮2,\displaystyle\mathbf{f}^{\bm{\varphi}^{E}}(\boldsymbol{X}^{(m)}(t_{l}),\boldsymbol{V}^{(m)}(t_{l}),\boldsymbol{\Xi}^{(m)}(t_{l}))-\mathbf{f}^{\bm{\varphi}^{A}}(\boldsymbol{X}^{(m)}(t_{l}),\boldsymbol{V}^{(m)}(t_{l}),\boldsymbol{\Xi}^{(m)}(t_{l}))\Big{\|}^{2}_{\mathcal{S}},
𝓔Mξ(𝝋ξ)\displaystyle\boldsymbol{\mathcal{E}}_{M}^{\xi}(\bm{\varphi}^{\xi}) =\displaystyle= 1LMl=1,m=1L,M𝚵˙(tl)𝐟nc,ξ(𝑿(m)(tl),𝑽(m)(tl),𝚵(m)(tl))\displaystyle\frac{1}{LM}\sum_{l=1,m=1}^{L,M}\Big{\|}\dot{\bm{\Xi}}(t_{l})-\mathbf{f}^{\text{nc},\xi}(\boldsymbol{X}^{(m)}(t_{l}),\boldsymbol{V}^{(m)}(t_{l}),\boldsymbol{\Xi}^{(m)}(t_{l}))-
𝐟𝝋ξ(𝑿(m)(tl),𝑽(m)(tl),𝚵(m)(tl))𝒮2.\displaystyle\mathbf{f}^{\bm{\varphi}^{\xi}}(\boldsymbol{X}^{(m)}(t_{l}),\boldsymbol{V}^{(m)}(t_{l}),\boldsymbol{\Xi}^{(m)}(t_{l}))\Big{\|}^{2}_{\mathcal{S}}\,.

The estimators of interaction kernels are defined as the minimizers of the error functionals 𝓔MEA\boldsymbol{\mathcal{E}}_{M}^{EA} and 𝓔Mξ\boldsymbol{\mathcal{E}}_{M}^{\xi} over 𝓗EA\boldsymbol{\mathcal{H}}^{EA} and 𝓗ξ\boldsymbol{\mathcal{H}}^{\xi} respectively, i.e.

ϕ^EA=argmin𝝋EA𝓗EA𝓔MEA(𝝋EA),ϕ^ξ=argmin𝝋ξ𝓗ξ𝓔Mξ(𝝋ξ).\widehat{\bm{\phi}}^{EA}=\underset{\bm{\varphi}^{EA}\in\boldsymbol{\mathcal{H}}^{EA}}{\operatorname{arg}\operatorname{min}}\;\boldsymbol{\mathcal{E}}_{M}^{EA}(\bm{\varphi}^{EA})\,,\qquad{\widehat{\bm{\phi}}}^{\xi}=\underset{{\bm{\varphi}}^{\xi}\in\boldsymbol{\mathcal{H}}^{\xi}}{\operatorname{arg}\operatorname{min}}\;\boldsymbol{\mathcal{E}}_{M}^{\xi}({\bm{\varphi}}^{\xi}). (4.3)

For the learning theory, we will consider the following error functionals. On the energy and alignment portion, we consider,

𝓔EA(𝝋EA)\displaystyle\boldsymbol{\mathcal{E}}_{\infty}^{EA}(\bm{\varphi}^{EA}) :=𝔼𝝁𝒀1Ll=1L[𝑿¨(tl)𝐟nc,𝒙˙(𝑿(tl),𝑽(tl),𝚵(tl))\displaystyle:=\mathbb{E}_{\bm{\mu^{\boldsymbol{Y}}}}\frac{1}{L}\sum_{l=1}^{L}\Big{[}\Big{\|}\ddot{\boldsymbol{X}}(t_{l})-\mathbf{f}^{\text{nc},\dot{\boldsymbol{x}}}(\boldsymbol{X}(t_{l}),\boldsymbol{V}(t_{l}),\boldsymbol{\Xi}(t_{l}))
𝐟𝝋E(𝑿(tl),𝑽(tl),𝚵(tl))𝐟𝝋A(𝑿(tl),𝑽(tl),𝚵(tl))𝒮2].\displaystyle-\mathbf{f}^{\bm{\varphi}^{E}}(\boldsymbol{X}(t_{l}),\boldsymbol{V}(t_{l}),\boldsymbol{\Xi}(t_{l}))-\mathbf{f}^{\bm{\varphi}^{A}}(\boldsymbol{X}(t_{l}),\boldsymbol{V}(t_{l}),\boldsymbol{\Xi}(t_{l}))\Big{\|}^{2}_{\mathcal{S}}\Big{]}. (4.4)

Similarly, on the ξ\xi portion, we consider,

𝓔ξ(𝝋ξ)=𝔼𝝁𝒀1Ll=1L[𝚵˙(tl)𝐟nc,ξ(𝑿(tl),𝑽(tl),𝚵(tl))𝐟𝝋ξ(𝑿(tl),𝑽(tl),𝚵(tl))𝒮2].\boldsymbol{\mathcal{E}}_{\infty}^{\xi}(\bm{\varphi}^{\xi})=\mathbb{E}_{\bm{\mu^{\boldsymbol{Y}}}}\frac{1}{L}\sum_{l=1}^{L}\Big{[}\Big{\|}\dot{\bm{\Xi}}(t_{l})-\mathbf{f}^{\text{nc},\xi}(\boldsymbol{X}(t_{l}),\boldsymbol{V}(t_{l}),\boldsymbol{\Xi}(t_{l}))-\mathbf{f}^{\bm{\varphi}^{\xi}}(\boldsymbol{X}(t_{l}),\boldsymbol{V}(t_{l}),\boldsymbol{\Xi}(t_{l}))\Big{\|}^{2}_{\mathcal{S}}\Big{]}. (4.5)

We can relate these error functionals to the natural empirical error functionals introduced at the start of the section as follows. By the Strong Law of Large Numbers we have that,

𝓔EA(𝝋EA)=limM𝓔MEA(𝝋EA)and𝓔ξ(𝝋ξ)=limM𝓔Mξ(𝝋ξ).\boldsymbol{\mathcal{E}}_{\infty}^{EA}(\bm{\varphi}^{EA})=\lim_{M\to\infty}\boldsymbol{\mathcal{E}}_{M}^{EA}(\bm{\varphi}^{EA})\quad\text{and}\quad\boldsymbol{\mathcal{E}}_{\infty}^{\xi}(\bm{\varphi}^{\xi})=\lim_{M\to\infty}\boldsymbol{\mathcal{E}}_{M}^{\xi}(\bm{\varphi}^{\xi}).

The relationship between the theoretical and empirical error functionals will play a key role in the learning theory.

4.3 Overview of theoretical contributions

The papers [83, 52, 51] have applied this learning approach to a variety of systems and the extensive numerical simulations demonstrate the effectiveness of the approach. However, theoretical guarantees of the proposed approach for second order systems had not been fully developed and will be the main focus of this paper. Our theory contains the first-order theory in [51] as a special case, as discussed in Appendix F. We focus on the regime where LL is fixed but MM\rightarrow\infty. We provide a learning theory that answers the fundamental questions:

  • Quantitative description of estimator errors. We will introduce measures to describe how close the estimators are to the true interaction kernels, that lead to novel dynamics-adapted norms. See section 5.

  • Identifiability of kernels. We will establish the existence and uniqueness of the estimators as well as relate the solvability of our inverse problem to a fundamental coercivity property. See section 6.

  • Consistency and optimal convergence rate of the estimators. We will prove theorems on strong consistency and optimal minimax rates of convergence of the estimators, which exploit the separability of the learning on the energy and alignment from the learning on the environment variable. See section 7.

  • Trajectory Prediction We prove a theorem that describes the performance of the estimated dynamics using the estimated kernels compared to the true dynamics. Our result demonstrates how the expected supremum error (over the entire time interval) of our trajectories is controlled by the norm of the difference between the true and estimated kernels, further justifying our choice of norms and estimation procedure. See section 8.

4.4 Hypothesis Space and Algorithm

First, we choose finite-dimensional subspaces of kkE\mathcal{H}^{E}_{kk^{\prime}}, i.e., ~kkEkkE\tilde{\mathcal{H}}_{kk^{\prime}}^{E}\subset\mathcal{H}^{E}_{kk^{\prime}}, whose basis functions are piece-wise polynomials of varying degrees (other type of basis functions are also possible, e.g., clamped B-splines as shown in [52]); similarly for ~kkAA\tilde{\mathcal{H}}_{kk^{\prime}}^{A}\subset\mathcal{H}^{A}. Hence, each test function, φkkE,φkkA\varphi^{E}_{kk^{\prime}},\varphi^{A}_{kk^{\prime}}, can be expressed in terms of the linear combination of the basis functions as follows

φkkE(r,𝒔E)\displaystyle\varphi^{E}_{kk^{\prime}}(r,\boldsymbol{s}^{E}) =ηkkE=1nkkEαk,k,ηkkEEψk,k,ηkkE𝒙(r,𝒔E),\displaystyle=\sum_{\eta_{kk^{\prime}}^{E}=1}^{n_{kk^{\prime}}^{E}}\alpha_{k,k^{\prime},\eta_{kk^{\prime}}^{E}}^{E}\psi^{\boldsymbol{x}}_{k,k^{\prime},\eta_{kk^{\prime}}^{E}}(r,\boldsymbol{s}^{E}),
φkkA(r,r˙,𝒔A)\displaystyle\varphi^{A}_{kk^{\prime}}(r,\dot{r},\boldsymbol{s}^{A}) =ηkkA=1nkkAαk,k,ηkkAAψk,k,ηkkA𝒙(r,r˙,𝒔A),.\displaystyle=\sum_{\eta_{kk^{\prime}}^{A}=1}^{n_{kk^{\prime}}^{A}}\alpha_{k,k^{\prime},\eta_{kk^{\prime}}^{A}}^{A}\psi^{\boldsymbol{x}}_{k,k^{\prime},\eta_{kk^{\prime}}^{A}}(r,\dot{r},\boldsymbol{s}^{A}),.

Substituting this linear combination back into (4.2), we obtain a system of linear equations,

AEAαEA=bEA.A^{EA}\vec{\alpha}^{EA}=\vec{b}^{EA}.

Here, αEAnEA=[(αE)T(αA)T]T\vec{\alpha}^{EA}\in^{n^{EA}}=\begin{bmatrix}(\vec{\alpha}^{E})^{T}&(\vec{\alpha}^{A})^{T}\end{bmatrix}^{T} with αE\vec{\alpha}^{E} and αA\vec{\alpha}^{A} being the collection of αk,k,ηkkEE\alpha_{k,k^{\prime},\eta_{kk^{\prime}}^{E}}^{E} or αk,k,ηkkAA\alpha_{k,k^{\prime},\eta_{kk^{\prime}}^{A}}^{A} respectively. Moreover, AEAnEA×nEAA^{EA}\in^{n^{EA}\times n^{EA}} and bEAnEA\vec{b}^{EA}\in^{n^{EA}}. See Sec. H for full details.

The overhead memory storage needed is MLN(d(5+nEA+nξ)+3)MLN(d(5+n^{EA}+n^{\xi})+3), with MLN(4d+2)MLN(4d+2) needed for trajectory data, MLNd(nEA+nξMLNd(n^{EA}+n^{\xi} (here nEA=nE+nAn^{EA}=n^{E}+n^{A}, the sum of the number of basis functions on EE and AA) for learning matrices, and MLN(d+1)MLN(d+1) for right hand side vectors. Hence if M𝒪(1)M\gg\mathcal{O}(1), we can consider parallelization in mm in order to reduce the overhead memory, ending up with Mper core(LN(d(5+nEA+nξ)+3))M_{\text{per core}}\big{(}LN(d(5+n^{EA}+n^{\xi})+3)\big{)} with Mper core=MncoresM_{\text{per core}}=\frac{M}{n_{\text{cores}}}. The final storage of AA and b\vec{b} only needs nEA(nEA+1)+nξ(nξ+1)n^{EA}(n^{EA}+1)+n^{\xi}(n^{\xi}+1).

The computational cost of for solving the system AEAαEA=bEAA^{EA}\vec{\alpha}^{EA}=\vec{b}^{EA} is MLN2+MLd(nEA)2+(nEA)2log(nEA)MLN^{2}+MLd(n^{EA})^{2}+(n^{EA})^{2}\log(n^{EA}), with MLN2MLN^{2} for computing pairwise data, MLd(nEA)2MLd(n^{EA})^{2} for constructing the learning matrix and right hand side vector, and (nEA)2log(nEA)(n^{EA})^{2}\log(n^{EA}) for solving the linear system. In the case of choosing the optimal number of basis functions, i.e., noptEA=M|V|2s+|V|n^{EA}_{\text{opt}}=M^{\frac{|V|}{2s+|V|}}, we end up with the total computational cost, of the order M1+2|V|2s+|V|M^{1+\frac{2|V|}{2s+|V|}}, which is slightly super linear in MM , but less than quadratic in the common situation when 2s+|V|2|V|2s+|V|\geq 2|V|.

Similar analysis on solving Aξαξ=bξA^{\xi}\vec{\alpha}^{\xi}=\vec{b}^{\xi} also shows that the computational cost is slightly super linear in MM.

5 Learning theory

Given estimators, how to best measure the estimation error? This is the first question to address in order to have a full understanding of the performance. In earlier works, a set of probability measures that are adapted to the dynamical system and learning setting are introduced to describe how close the estimators are to the true interaction kernels. We will generalize these ideas and measures to our learning problem.

5.1 Probability measures

The variables we are working with have a natural distribution on the space of pairwise distances and features. Together with the fact that our functions have as arguments the variables (r,r˙,𝒔E,𝒔A,𝒔ξr,\dot{r},\boldsymbol{s}^{E},\boldsymbol{s}^{A},\boldsymbol{s}^{\xi}) which are functions of the state space, we are led to consider probability measures which account for the distribution of the data, while respecting the interaction structure of the system. For further intuition into these measures, see [51, 52, 13]. For each interacting pair (k,k)(k,k^{\prime}), we introduce the following probability measures,

{ρTEA,k,k(r,𝒔E,r˙,𝒔A):=𝔼𝒀0𝝁𝒀1TNkkt=0TiCk,iCkiiδii,t(r,𝒔E,r˙,𝒔A)dtρTEA,L,k,k(r,𝒔E,r˙,𝒔A):=𝔼𝒀0𝝁𝒀1LNkkl=1LiCk,iCkiiδii,tl(r,𝒔E,r˙,𝒔A)ρTEA,L,M,k,k(r,𝒔E,r˙,𝒔A):=1MLNkkl,m=1L,MiCk,iCkiiδii,tl,m(r,𝒔E,r˙,𝒔A)\begin{dcases}\rho_{T}^{EA,k,k^{\prime}}(r,\boldsymbol{s}^{E},\dot{r},\boldsymbol{s}^{A})&:=\mathbb{E}_{\boldsymbol{Y}_{0}\sim\bm{\mu^{\boldsymbol{Y}}}}\frac{1}{TN_{kk^{\prime}}}\int_{t=0}^{T}\sum_{\begin{subarray}{c}i\in C_{k},i^{\prime}\in C_{k^{\prime}}\\ i\neq i^{\prime}\end{subarray}}\delta_{ii^{\prime},t}(r,\boldsymbol{s}^{E},\dot{r},\boldsymbol{s}^{A})\,dt\\ \rho_{T}^{EA,L,k,k^{\prime}}(r,\boldsymbol{s}^{E},\dot{r},\boldsymbol{s}^{A})&:=\mathbb{E}_{\boldsymbol{Y}_{0}\sim\bm{\mu^{\boldsymbol{Y}}}}\frac{1}{LN_{kk^{\prime}}}\sum_{l=1}^{L}\sum_{\begin{subarray}{c}i\in C_{k},i^{\prime}\in C_{k^{\prime}}\\ i\neq i^{\prime}\end{subarray}}\delta_{ii^{\prime},t_{l}}(r,\boldsymbol{s}^{E},\dot{r},\boldsymbol{s}^{A})\\ \rho_{T}^{EA,L,M,k,k^{\prime}}(r,\boldsymbol{s}^{E},\dot{r},\boldsymbol{s}^{A})&:=\frac{1}{MLN_{kk^{\prime}}}\sum_{l,m=1}^{L,M}\sum_{\begin{subarray}{c}i\in C_{k},i^{\prime}\in C_{k^{\prime}}\\ i\neq i^{\prime}\end{subarray}}\delta_{ii^{\prime},t_{l},m}(r,\boldsymbol{s}^{E},\dot{r},\boldsymbol{s}^{A})\end{dcases} (5.1)

where Nkk=NkNkN_{kk^{\prime}}=N_{k}N_{k^{\prime}} for kkk\neq k^{\prime} and Nkk=(Nk2)N_{kk^{\prime}}={N_{k}\choose 2} for k=kk=k^{\prime}, and the dirac measures are defined as

{δii,t(r,𝒔E,r˙,𝒔A):=δrii(t),𝒔iiE(t),r˙ii(t),𝒔iiA(t)(r,𝒔E,r˙,𝒔A)δii,t,m(r,𝒔E,r˙,𝒔A):=δrii(m)(t),𝒔iiE,(m)(t),r˙ii(m)(t),𝒔iiA,(m)(t)(r,𝒔E,r˙,𝒔A).\begin{dcases}\delta_{ii^{\prime},t}(r,\boldsymbol{s}^{E},\dot{r},\boldsymbol{s}^{A})&:=\delta_{r_{ii^{\prime}}(t),\boldsymbol{s}^{E}_{ii^{\prime}}(t),\dot{r}_{ii^{\prime}}(t),\boldsymbol{s}^{A}_{ii^{\prime}}(t)}(r,\boldsymbol{s}^{E},\dot{r},\boldsymbol{s}^{A})\\ \delta_{ii^{\prime},t,m}(r,\boldsymbol{s}^{E},\dot{r},\boldsymbol{s}^{A})&:=\delta_{r_{ii^{\prime}}^{(m)}(t),\boldsymbol{s}^{E,(m)}_{ii^{\prime}}(t),\dot{r}_{ii^{\prime}}^{(m)}(t),\boldsymbol{s}^{A,(m)}_{ii^{\prime}}(t)}(r,\boldsymbol{s}^{E},\dot{r},\boldsymbol{s}^{A}).\end{dcases}

We use a superscript (m)(m) to denote that the variable is calculated from the data from that mthm^{\text{th}} trajectory. For example, 𝒔iiE,(m)(t)\boldsymbol{s}^{E,(m)}_{ii^{\prime}}(t) denotes the feature maps between agents ii and ii^{\prime} at time tt along the mthm^{\text{th}} trajectory.

The measure ρTEA,L,k,k\rho_{T}^{EA,L,k,k^{\prime}} is the discrete counterpart of ρTEA,k,k\rho_{T}^{EA,k,k^{\prime}} at the observation time instances. In practice, we can use ρTEA,L,M,k,k\rho_{T}^{EA,L,M,k,k^{\prime}} to approximate ρTEA,L,k,k\rho_{T}^{EA,L,k,k^{\prime}} since it can be computed from observational data and will converge to ρTEA,L,k,k\rho_{T}^{EA,L,k,k^{\prime}} as MM\rightarrow\infty.

We also consider the marginal distributions

ρTE,k,k(r,𝒔E):=r˙𝒔AρTEA,k,kd𝒔Adr˙,ρTA,k,k(r,r˙,𝒔A):=𝒔EρTEA,k,kd𝒔E\rho_{T}^{E,k,k^{\prime}}(r,\boldsymbol{s}^{E}):=\int_{\dot{r}}\int_{\boldsymbol{s}^{A}}\rho_{T}^{EA,k,k^{\prime}}\,d\boldsymbol{s}^{A}\,d\dot{r}\quad,\quad\rho_{T}^{A,k,k^{\prime}}(r,\dot{r},\boldsymbol{s}^{A}):=\int_{\boldsymbol{s}^{E}}\rho_{T}^{EA,k,k^{\prime}}\,d\boldsymbol{s}^{E} (5.2)

and ρTE,L,k,k(r,𝒔E)\rho_{T}^{E,L,k,k^{\prime}}(r,\boldsymbol{s}^{E}), ρTE,L,M,k,k(r,𝒔E)\rho_{T}^{E,L,M,k,k^{\prime}}(r,\boldsymbol{s}^{E}), ρTA,L,k,k(r,r˙,𝒔A)\rho_{T}^{A,L,k,k^{\prime}}(r,\dot{r},\boldsymbol{s}^{A}), ρTA,L,M,k,k(r,r˙,𝒔A)\rho_{T}^{A,L,M,k,k^{\prime}}(r,\dot{r},\boldsymbol{s}^{A}) defined analogously as above. The empirical measures, ρTE,L,M,k,k,ρTA,L,M,k,k\rho_{T}^{E,L,M,k,k^{\prime}},\rho_{T}^{A,L,M,k,k^{\prime}}, are the ones used in the actual algorithm to quantify the learning performances of the estimators ϕ^kkE\widehat{\phi}^{E}_{kk^{\prime}} and ϕ^kkA\widehat{\phi}^{A}_{kk^{\prime}} respectively. They are also crucial in discussing the separability of ϕ^kkE\widehat{\phi}^{E}_{kk^{\prime}} and ϕ^kkA\widehat{\phi}^{A}_{kk^{\prime}}.

For ease of notation, we introduce the following measures to handle the heterogeneity of the system, and which are used to describe error over all of the pairs (k,k)(k,k^{\prime}).

𝝆TEA,L=k,k=1,1K,KρTEA,L,kk,𝝆TEA=k,k=1,1K,KρTEA,kk,𝑳2(𝝆TEA,L)=k,k=1,1K,KL2(ρTEA,L,kk)\bm{\rho}_{T}^{EA,L}=\bigoplus_{k,k^{\prime}=1,1}^{K,K}\rho_{T}^{EA,L,kk^{\prime}},\quad\bm{\rho}_{T}^{EA}=\bigoplus_{k,k^{\prime}=1,1}^{K,K}\rho_{T}^{EA,kk^{\prime}},\quad\bm{L}^{2}\left(\bm{\rho}_{T}^{EA,L}\right)=\bigoplus_{k,k^{\prime}=1,1}^{K,K}L^{2}\left(\rho_{T}^{EA,L,kk^{\prime}}\right) (5.3)

Similar definitions apply for measures related to learning the ξ\xi-based interaction kernels, see Appendix G. We discuss some key properties of the measures in Appendix D.

5.2 Learning performance

We now discuss the performance measures for the estimated interaction kernels. We have already treated the trajectory estimation error in section 3.1. We use weighted L2L^{2}-norms (with mild abuse of notation, we omit the weight from the notation) based on the dynamics-adapted measures introduced above, and with analogous definitions when discrete over LL:

ϕ^kkEϕkkEL2(ρTE,k,k)2\displaystyle\left\|\widehat{\phi}^{E}_{kk^{\prime}}-\phi^{E}_{kk^{\prime}}\right\|_{L^{2}(\rho_{T}^{E,k,k^{\prime}})}^{2} :=\displaystyle:= (r,𝒔E)(ϕ^kkE(r,𝒔E)ϕkkE(r,𝒔E))2r2𝑑ρTE,k,k\displaystyle\int_{(r,\boldsymbol{s}^{E})}(\widehat{\phi}^{E}_{kk^{\prime}}(r,\boldsymbol{s}^{E})-\phi^{E}_{kk^{\prime}}(r,\boldsymbol{s}^{E}))^{2}r^{2}\,d\rho_{T}^{E,k,k^{\prime}}
ϕ^kkAϕkkAL2(ρTA,k,k)2\displaystyle\left\|\widehat{\phi}^{A}_{kk^{\prime}}-\phi^{A}_{kk^{\prime}}\right\|_{L^{2}(\rho_{T}^{A,k,k^{\prime}})}^{2} :=\displaystyle:= (r,r˙,𝒔A)(ϕ^kkE(r,r˙,𝒔A)ϕkkE(r,r˙,𝒔A))2r˙2𝑑ρTA,k,k\displaystyle\int_{(r,\dot{r},\boldsymbol{s}^{A})}(\widehat{\phi}^{E}_{kk^{\prime}}(r,\dot{r},\boldsymbol{s}^{A})-\phi^{E}_{kk^{\prime}}(r,\dot{r},\boldsymbol{s}^{A}))^{2}\dot{r}^{2}\,d\rho_{T}^{A,k,k^{\prime}}
ϕ^kkEAϕkkEAL2(ρTEA,k,k)2\displaystyle\left\|\widehat{\phi}_{kk^{\prime}}^{EA}-\phi_{kk^{\prime}}^{EA}\right\|_{L^{2}(\rho_{T}^{EA,k,k^{\prime}})}^{2} :=\displaystyle:= r,𝒔E,r˙,𝒔A[(ϕ^kkE(r,𝒔E)ϕkkE(r,𝒔E))r\displaystyle\int_{r,\boldsymbol{s}^{E},\dot{r},\boldsymbol{s}^{A}}\Big{[}(\widehat{\phi}^{E}_{kk^{\prime}}(r,\boldsymbol{s}^{E})-\phi^{E}_{kk^{\prime}}(r,\boldsymbol{s}^{E}))r (5.4)
+(ϕ^kkE(r,r˙,𝒔A)ϕkkE(r,r˙,𝒔A))r˙]2dρTEA,k,k.\displaystyle\qquad+(\widehat{\phi}^{E}_{kk^{\prime}}(r,\dot{r},\boldsymbol{s}^{A})-\phi^{E}_{kk^{\prime}}(r,\dot{r},\boldsymbol{s}^{A}))\dot{r}\Big{]}^{2}\,d\rho_{T}^{EA,k,k^{\prime}}\,.

Our learning theory focuses on minimizing the difference between ϕ^kkEϕ^kkA\widehat{\phi}^{E}_{kk^{\prime}}\oplus\widehat{\phi}^{A}_{kk^{\prime}} and ϕkkEϕkkA\phi^{E}_{kk^{\prime}}\oplus\phi^{A}_{kk^{\prime}} in the joint norm given by (5.4). As long as the joint norm is small, our estimators produce faithful approximations of the right hand side function of the original system and trajectories. However, it does not necessarily imply that both ϕ^kkEϕkkE\widehat{\phi}^{E}_{kk^{\prime}}-\phi^{E}_{kk^{\prime}}’s and ϕ^kkAϕkkA\widehat{\phi}^{A}_{kk^{\prime}}-\phi^{A}_{kk^{\prime}}’s are small in their corresponding energy- and alignment-based norms, since the joint norm is a weaker norm. It would be interesting to study if there is any equivalence between these two norms, but the problem appears to be quite delicate. The theoretical investigation is still ongoing.

Now, we have all the tools needed to establish a theoretical framework: dynamics induced probability measures, performance measurements in appropriate norms, and loss functionals. These will allow us to discuss the convergence properties of our estimators. Full details of the numerical algorithm are given in Appendix H.

Notational summary

A summary of the learning theory notation introduced in sections 3, 4, and the notation above, is given below in table 3.

Notation Definition Ref
MM number of trajectories Sec. 1
LL number of times in [0,T][0,T] for each trajectory Sec. 2
𝒀(t)\boldsymbol{Y}(t) full state space vector containing 𝑿(t),𝑽(t),𝚵(t)\boldsymbol{X}(t),\boldsymbol{V}(t),\boldsymbol{\Xi}(t) Sec. 2
{E,A,ξ}{\{E,A,\xi\}} wildcard, means the notation applies for all 33 variables Sec. 2
𝑿𝒮\|\boldsymbol{X}\|_{\mathcal{S}} i=1N1N𝓀𝒾𝒙i2\sum_{i=1}^{N}\frac{1}{{N_{\mathpzc{k}_{i}}}}\left\|\boldsymbol{x}_{i}\right\|^{2} (3.1)
𝝁𝒀\bm{\mu^{\boldsymbol{Y}}} distribution on the initial conditions 𝒀(0)\boldsymbol{Y}(0) Sec. 3.2
ϕ{E,A,ξ}=(ϕkk{E,A,ξ})k,k{\bm{\phi}}^{{\{E,A,\xi\}}}=(\phi_{kk^{\prime}}^{{\{E,A,\xi\}}})_{k,k^{\prime}} vectorized true E,A,ξE,A,\xi interaction kernels (3.2)
𝝋{E,A,ξ}=(φkk{E,A,ξ})k,k{\bm{\varphi}}^{{\{E,A,\xi\}}}=(\varphi_{kk^{\prime}}^{{\{E,A,\xi\}}})_{k,k^{\prime}} 𝝋{E,A,ξ}𝓗{E,A,ξ}{\bm{\varphi}}^{{\{E,A,\xi\}}}\in\boldsymbol{\mathcal{H}}^{{\{E,A,\xi\}}} with φkk{E,A,ξ}kk{E,A,ξ}\varphi_{kk^{\prime}}^{{\{E,A,\xi\}}}\in\mathcal{H}_{kk^{\prime}}^{{\{E,A,\xi\}}} (3.2)
EAEA shorthand denoting energy and alignment part of system (3.3)
ϕEA\bm{\phi}^{EA} represents the joint function ϕEϕA𝓗EA{\bm{\phi}}^{E}\oplus{\bm{\phi}}^{A}\in\boldsymbol{\mathcal{H}}^{EA} (3.3)
𝒀𝒴2\left\|\boldsymbol{Y}\right\|_{\mathcal{Y}}^{2} 𝑿𝒮2+𝑽𝒮2+𝚵𝒮2.\left\|\boldsymbol{X}\right\|_{\mathcal{S}}^{2}+\left\|\boldsymbol{V}\right\|_{\mathcal{S}}^{2}+\left\|\boldsymbol{\Xi}\right\|_{\mathcal{S}}^{2}. (3.4)
S{E,A,ξ}\textbf{S}^{{\{E,A,\xi\}}} k,k𝕊kk{E,A,ξ}\prod_{k,k^{\prime}}\mathbb{S}_{kk^{\prime}}^{{\{E,A,\xi\}}} (3.7)
𝐑\mathbf{R} k,k[Rkkmin,Rkkmax]\prod_{k,k^{\prime}}[R_{kk^{\prime}}^{\min},R_{kk^{\prime}}^{\max}] (3.8)
RR maxk,kRkkmax\max_{k,k^{\prime}}R_{kk^{\prime}}^{\max} (3.8)
𝓚S{E,A,ξ}{E,A,ξ}\boldsymbol{\mathcal{K}}_{S_{{\{E,A,\xi\}}}}^{{\{E,A,\xi\}}} admissible spaces for the E,A,ξE,A,\xi kernels (3.12)
kk{E,A,ξ}\mathcal{H}_{kk^{\prime}}^{{\{E,A,\xi\}}} the hypothesis spaces for ϕkk{E,A,ξ}\phi_{kk^{\prime}}^{{\{E,A,\xi\}}} (3.14)
𝓗{E,A,ξ}=kkkk{E,A,ξ}\boldsymbol{\mathcal{H}}^{{\{E,A,\xi\}}}=\oplus_{kk^{\prime}}\mathcal{H}_{kk^{\prime}}^{{\{E,A,\xi\}}} the hypothesis spaces for ϕ{E,A,ξ}{\bm{\phi}}^{{\{E,A,\xi\}}} (3.14)
𝓗EA\boldsymbol{\mathcal{H}}^{EA} direct sum of hypothesis spaces 𝓗E𝓗A\boldsymbol{\mathcal{H}}^{E}\oplus\boldsymbol{\mathcal{H}}^{A} (3.15)
𝓔MEA(),𝓔Mξ()\boldsymbol{\mathcal{E}}_{M}^{EA}(\cdot),\boldsymbol{\mathcal{E}}_{M}^{\xi}(\cdot) empirical EAEA error functional, ξ\xi error functional Sec. 4.2
ϕ^MEA:=ϕ^L,M,𝓗EAEA\widehat{\bm{\phi}}_{M}^{EA}:=\widehat{\bm{\phi}}_{L,M,\boldsymbol{\mathcal{H}}^{EA}}^{EA} argmin𝝋EA𝓗EA𝓔MEA(𝝋EA)\mathrm{argmin}_{\bm{\varphi}^{EA}\in\boldsymbol{\mathcal{H}}^{EA}}\bm{\mathcal{E}}_{M}^{EA}(\bm{\varphi}^{EA}) (4.3)
ϕ^Mξ:=ϕ^L,M,𝓗ξξ\widehat{\bm{\phi}}_{M}^{\xi}:=\widehat{\bm{\phi}}_{L,M,\boldsymbol{\mathcal{H}}^{\xi}}^{\xi} argmin𝝋ξ𝓗ξ𝓔Mξ(𝝋ξ)\mathrm{argmin}_{\bm{\varphi}^{\xi}\in\boldsymbol{\mathcal{H}}^{\xi}}\bm{\mathcal{E}}_{M}^{\xi}(\bm{\varphi}^{\xi}) (4.3)
{ψkk,p{E,A,ξ}}p=1nkk{E,A,ξ}\{\psi_{kk^{\prime},p}^{{\{E,A,\xi\}}}\}_{p=1}^{n_{kk^{\prime}}^{{\{E,A,\xi\}}}} basis for kk{E,A,ξ}\mathcal{H}_{kk^{\prime}}^{{\{E,A,\xi\}}} Sec. 4.4
𝝆TEA\bm{\rho}_{T}^{EA}, 𝝆Tξ\bm{\rho}_{T}^{\xi} measure for EAEA, ξ\xi with continuous time, infinite trajectories (5.3), G
𝝆TEA,L\bm{\rho}_{T}^{EA,L}, 𝝆Tξ,L\bm{\rho}_{T}^{\xi,L} measure for EAEA, ξ\xi discrete in time, infinite trajectories (5.3), G
𝑳2(𝝆TEA,L)\bm{L}^{2}\left(\bm{\rho}_{T}^{EA,L}\right) k,k=1,1K,KL2(ρTEA,L,kk)\bigoplus_{k,k^{\prime}=1,1}^{K,K}L^{2}\left(\rho_{T}^{EA,L,kk^{\prime}}\right) (5.3)
c𝓗EAc_{\boldsymbol{\mathcal{H}}^{EA}}, c𝓗ξc_{\boldsymbol{\mathcal{H}}^{\xi}} coercivity constant on the 𝓗EA\boldsymbol{\mathcal{H}}^{EA}, 𝓗ξ\boldsymbol{\mathcal{H}}^{\xi} hypothesis spaces Def.4
𝓗MEA\boldsymbol{\mathcal{H}}_{M}^{EA},𝓗Mξ\boldsymbol{\mathcal{H}}_{M}^{\xi} hypothesis spaces on EA,ξEA,\xi depending on MM Sec. 7
AEAA^{EA}, AξA^{\xi} Learning matrices for the inverse problem Sec. 9
𝒩(𝓗,δ)\mathcal{N}(\boldsymbol{\mathcal{H}},\delta) δ\delta-covering number, under the \infty-norm, of a set 𝓗\boldsymbol{\mathcal{H}} [77]
Table 3: Notation used throughout the paper

6 Identifiability of kernels from data

In this section we introduce a technical condition on the dynamical system that relates to the solvability of the inverse problem and plays a key role in the learning theory. We establish theorems in two directions:

  1. 1.

    Showing how identifiability of the kernels can be derived from the coercivity condition by relating the coercivity constant to the singular values of the learning matrices associated to our inverse problem, for both finitely and infinitely many trajectories.

  2. 2.

    Establishing the coercivity condition for a wide class of dynamical systems of the form (2.2), under mild assumptions on the distribution 𝝁𝒀\bm{\mu^{\boldsymbol{Y}}} of the initial conditions. From our numerical experiments, we expect that coercivity holds even more generally.

We will make the following assumptions on the hypothesis spaces used in the learning approach for the remainder of the paper:

Assumption 2

𝓗EA\boldsymbol{\mathcal{H}}^{EA} is a compact convex subset of 𝓚SEAEA:=𝓚SEE𝓚SAA\boldsymbol{\mathcal{K}}_{S_{EA}}^{EA}:=\boldsymbol{\mathcal{K}}_{S_{E}}^{E}\oplus\boldsymbol{\mathcal{K}}_{S_{A}}^{A} (see 3.13) which implies that the infinity norm of all elements in 𝓗EA\boldsymbol{\mathcal{H}}^{EA} is bounded above by a constant SEAmax{SE,SA}S_{EA}\geq\max\{{S_{E},S_{A}}\}.

Assumption 3

𝓗ξ\boldsymbol{\mathcal{H}}^{\xi} is a compact convex subset of 𝓚Sξξ\boldsymbol{\mathcal{K}}_{S_{\xi}}^{\xi} (see 3.12) and is bounded above by S0SξS_{0}\geq S_{\xi}.

It is easy to see that 𝓗EA\boldsymbol{\mathcal{H}}^{EA} can be naturally embedded as a compact subset of 𝑳2(𝝆TEA,L)\bm{L}^{2}(\bm{\rho}_{T}^{EA,L}) and that 𝓗ξ\boldsymbol{\mathcal{H}}^{\xi} can be naturally embedded as a compact subset of 𝑳2(𝝆Tξ,L)\bm{L}^{2}(\bm{\rho}_{T}^{\xi,L}) (recall these measures are defined in section 5.1). Assumptions 2, 3 ensure the existence of minimizers to the loss functionals 𝓔MEA,𝓔Mξ\boldsymbol{\mathcal{E}}_{M}^{EA},\boldsymbol{\mathcal{E}}_{M}^{\xi} defined in (4.2) and (4.2), which will be proven in Appendix B.

In order to ensure learnability we introduce a coercivity condition, with terminology coming from the Lax-Milgram theorem. In the second-order case we will have two coercivity conditions, one for the energy and alignment, and the other for the ξ\xi variable. These conditions serve the same purpose in both cases. They first ensure that the minimizers to the error functionals are unique, and second that when the expected error functional is small, then the distance from the estimator to the true kernels is small in the appropriate 𝝆T\bm{\rho}_{T} norm.

Due to their connection to the error functional and the learnability of the kernels, coercivity plays an important role in the theorems of Section 7.

Definition 4 (Coercivity condition)

For the dynamical system (2.2) observed at time instants 0=t1<t2<<tL=T0=t_{1}<t_{2}<\dots<t_{L}=T and with initial condition distributed 𝛍𝐘\bm{\mu^{\boldsymbol{Y}}} on (2d+1)N, it satisfies the coercivity condition on the hypothesis space 𝓗EA\boldsymbol{\mathcal{H}}^{EA} with constant c𝓗EAc_{\boldsymbol{\mathcal{H}}^{EA}} if

c𝓗EA:=inf𝝋EA𝓗EA\{𝟎}1Ll=1L𝔼𝝁𝒀[𝐟𝝋EA(𝑿(tl),𝑽(tl),𝚵(tl))𝒮2]𝝋EA𝑳2(𝝆TEA,L)2>0.\displaystyle c_{\boldsymbol{\mathcal{H}}^{EA}}:=\inf_{\bm{\varphi}^{EA}\in\boldsymbol{\mathcal{H}}^{EA}\backslash\{\boldsymbol{0}\}}\,\frac{\frac{1}{L}\sum_{l=1}^{L}\mathbb{E}_{\bm{\mu^{\boldsymbol{Y}}}}\bigg{[}\big{\|}\mathbf{f}_{\bm{\varphi}^{EA}}(\boldsymbol{X}(t_{l}),\boldsymbol{V}(t_{l}),\boldsymbol{\Xi}(t_{l}))\big{\|}_{\mathcal{S}}^{2}\bigg{]}}{\|\bm{\varphi}^{EA}\|_{\bm{L}^{2}(\bm{\rho}_{T}^{EA,L})}^{2}}>0. (6.1)

An analogous definition holds for continuous observations on the time interval [0,T][0,T], by replacing the sum over observations at discrete times with an integral over [0,T][0,T]. Similarly, the system satisfies the coercivity condition on the hypothesis space 𝓗ξ\boldsymbol{\mathcal{H}}^{\xi} with constant c𝓗ξc_{\boldsymbol{\mathcal{H}}^{\xi}} if

c𝓗ξ:=inf𝝋ξ𝓗ξ\{𝟎}1Ll=1L𝔼𝝁𝒀[𝐟𝝋ξ(𝑿(tl),𝑽(tl),𝚵(tl))𝒮2]𝝋ξ𝑳2(𝝆Tξ,L)2>0.\displaystyle c_{\boldsymbol{\mathcal{H}}^{\xi}}:=\inf_{\bm{\varphi}^{\xi}\in\boldsymbol{\mathcal{H}}^{\xi}\backslash\{\boldsymbol{0}\}}\,\frac{\frac{1}{L}\sum_{l=1}^{L}\mathbb{E}_{\bm{\mu^{\boldsymbol{Y}}}}\bigg{[}\big{\|}\mathbf{f}_{\bm{\varphi}^{\xi}}(\boldsymbol{X}(t_{l}),\boldsymbol{V}(t_{l}),\boldsymbol{\Xi}(t_{l}))\big{\|}_{\mathcal{S}}^{2}\bigg{]}}{\|\bm{\varphi}^{\xi}\|_{\bm{L}^{2}(\bm{\rho}_{T}^{\xi,L})}^{2}}>0. (6.2)

An analogous definition holds for continuous observations on the time interval [0,T][0,T], by replacing the sum over observations at discrete times with an integral over [0,T][0,T].

In the following, we prove the coercivity condition on general compact sets of 𝑳2([0,R],𝝆TEA,L)\bm{L}^{2}([0,R],\bm{\rho}_{T}^{EA,L}) under suitable hypotheses. Our result is independent of NN, which implies that the finite sample bounds of Theorem 9 can be dimension free – in that the coercivity constant has no dependence on NN. This result implies that coercivity may be a fundamental property of the dynamical system, including in the mean field regime (NN\to\infty).

6.1 Identifiability from coercivity

By choosing the hypothesis space to be compact and convex, we are able to show that the error functional has a unique minimizer. However, many possible bases exist that could potentially yield good performance (in terms of the error functional and 𝑳2\bm{L}^{2} error to the true kernel). We want to choose a basis such that the regression matrix, AEAA^{EA} defined in Appendix H, is well-conditioned – which will ensure that the inverse problem can be solved and thus an estimator can be learned that will have good performance (in terms of the error functional and 𝑳2(𝝆TEA,L)\bm{L}^{2}(\bm{\rho}_{T}^{EA,L}) error to the true kernel). In the proposition below we establish two results in this direction. The key for both results is that the basis is chosen to be orthonormal in 𝑳2(𝝆TEA,L)\bm{L}^{2}(\bm{\rho}_{T}^{EA,L}), versus the naive choice of basis in the underlying direct sum of 𝑳\bm{L}^{\infty} spaces that the interaction kernels live in. The first result is theoretical and shows that, under appropriate assumptions on the basis, the minimal singular value of the expected regression matrix (denoted AEAA_{\infty}^{EA}) equals the coercivity constant. While the second result is critical for the practical implementation as it lower bounds the minimal singular value of the regression matrix by the coercivity constant with high probability. In both cases, the numerical performance is affected by the size of the coercivity constant of 𝓗EA\boldsymbol{\mathcal{H}}^{EA} and if the hypothesis space is well-chosen, then the coercivity constant will be sufficiently large and the regression matrix will be well-conditioned.

To ease the notation, we introduce the bilinear functional ,\langle\langle{\cdot,\cdot}\rangle\rangle on 𝓗EA×𝓗EA\boldsymbol{\mathcal{H}}^{EA}\times\boldsymbol{\mathcal{H}}^{EA}, defined by

𝝋1EA,𝝋2EA:=\displaystyle\langle\langle{\bm{\varphi}^{EA}_{1},\bm{\varphi}^{EA}_{2}}\rangle\rangle:=
1Ll,i=1L,N1N𝓀𝒾𝔼𝝁𝒀[i=1N1N𝓀𝒾(φ1,𝓀𝒾𝓀𝒾E(rii(t),𝒔iiE)𝒓ii(t)+φ1,𝓀𝒾𝓀𝒾A(rii(t),𝒔iiA)𝒓˙ii(t)),\displaystyle\frac{1}{L}\sum_{l,i=1}^{L,N}\frac{1}{N_{\mathpzc{k}_{i}}}\mathbb{E}_{\bm{\mu^{\boldsymbol{Y}}}}\bigg{[}\bigg{\langle}\sum_{i^{\prime}=1}^{N}\frac{1}{N_{\mathpzc{k}_{i^{\prime}}}}\Big{(}\varphi_{1,\mathpzc{k}_{i}\mathpzc{k}_{i^{\prime}}}^{E}(r_{ii^{\prime}}(t),\boldsymbol{s}^{E}_{ii^{\prime}})\boldsymbol{r}_{ii^{\prime}}(t)+\varphi_{1,\mathpzc{k}_{i}\mathpzc{k}_{i^{\prime}}}^{A}(r_{ii^{\prime}}(t),\boldsymbol{s}^{A}_{ii^{\prime}})\boldsymbol{\dot{r}}_{ii^{\prime}}(t)\Big{)},
i=1N1N𝓀𝒾(φ2,𝓀𝒾𝓀𝒾E(rii(t),𝒔iiE)𝒓ii(t)+φ2,𝓀𝒾𝓀𝒾A(rii(t),𝒔iiA)𝒓˙ii(t))]\displaystyle\sum_{i^{\prime}=1}^{N}\frac{1}{N_{\mathpzc{k}_{i^{\prime}}}}\Big{(}\varphi_{2,\mathpzc{k}_{i}\mathpzc{k}_{i^{\prime}}}^{E}(r_{ii^{\prime}}(t),\boldsymbol{s}^{E}_{ii^{\prime}})\boldsymbol{r}_{ii^{\prime}}(t)+\varphi_{2,\mathpzc{k}_{i}\mathpzc{k}_{i^{\prime}}}^{A}(r_{ii^{\prime}}(t),\boldsymbol{s}^{A}_{ii^{\prime}})\boldsymbol{\dot{r}}_{ii^{\prime}}(t)\Big{)}\bigg{\rangle}\bigg{]} (6.3)

for any 𝝋1EA=(φ1,kkEφ1,kkA)k,k=1,1K,K𝓗EA\bm{\varphi}^{EA}_{1}=(\varphi_{1,kk^{\prime}}^{E}\oplus\varphi_{1,kk^{\prime}}^{A})_{k,k^{\prime}=1,1}^{K,K}\in\boldsymbol{\mathcal{H}}^{EA}, and 𝝋2EA=(φ2,kkEφ2,kkA)k,k=1,1K,K𝓗EA\bm{\varphi}^{EA}_{2}=(\varphi_{2,kk^{\prime}}^{E}\oplus\varphi_{2,kk^{\prime}}^{A})_{k,k^{\prime}=1,1}^{K,K}\in\boldsymbol{\mathcal{H}}^{EA}. For every pair (k,k)(k,k^{\prime}) let (ψkk,iEψkk,iA)i=1nkk(\psi_{kk^{\prime},i}^{E}\oplus\psi_{kk^{\prime},i}^{A})_{i=1}^{n_{kk^{\prime}}} be a basis of

kkEAL([0,R]×𝕊kkE)L([0,R]×𝕊kkA)\mathcal{H}_{{kk^{\prime}}}^{EA}\subset L^{\infty}([0,R]\times\mathbb{S}_{kk^{\prime}}^{E})\oplus L^{\infty}([0,R]\times\mathbb{S}_{kk^{\prime}}^{A})

satisfying the orthonomality and boundedness conditions

ψkk,pEA,ψkk,pEAL2(ρTEA,L,kk)\displaystyle\langle\psi_{kk^{\prime},p}^{EA},\psi_{kk^{\prime},p^{\prime}}^{EA}\rangle_{L^{2}{(\rho_{T}^{EA,L,kk^{\prime}})}} =δp,p,ψkk,pEA\displaystyle=\delta_{p,p^{\prime}}\quad,\quad\|\psi_{kk^{\prime},p}^{EA}\|_{\infty} SEA.\displaystyle\leq S_{EA}. (6.4)

We note that multivariable basis functions arise naturally in this setting due to the model. A tensor product basis of splines or piecewise polynomials can be used, as one explicit example. The nkkn_{kk^{\prime}} notation allows multivariable functions, different choices for the number of basis functions across pairs (k,k)(k,k^{\prime}), and a different number of basis functions within a pair with respect to the underlying coordinates of the tensor product.

By convention, we use the lexicographic ordering to order within pairs (k,k)(k,k^{\prime}) (with order r,𝒔E,𝒔Ar,\boldsymbol{s}^{E},\boldsymbol{s}^{A}), and then across pairs (with the lexicographic ordering on pairs of integers). Set 𝐧=k,knkk=dim(𝓗EA)\mathbf{n}=\sum_{k,k^{\prime}}n_{kk^{\prime}}=dim(\boldsymbol{\mathcal{H}}^{EA}); then for any function 𝝋EA𝓗EA{\bm{\varphi}}^{EA}\in\boldsymbol{\mathcal{H}}^{EA}, we can write

𝝋EA=p=1𝐧ap𝝍pEA.{\bm{\varphi}}^{EA}=\sum_{p=1}^{\mathbf{n}}a_{p}\boldsymbol{\psi}_{p}^{EA}.

Under the setting above, we have the following relationship between the coercivity constant and the minimal singular value of the empirical and expected learning matrix:

Proposition 5

Consider the matrices

AEA=(𝝍pEA,𝝍pEA)p,p𝐧×𝐧,Aξ=(𝝍pξ,𝝍pξ)p,p𝐧ξ×𝐧ξ,A_{\infty}^{EA}=\big{(}\langle\langle{\boldsymbol{\psi}_{p}^{EA},\boldsymbol{\psi}_{p^{\prime}}^{EA}}\rangle\rangle\big{)}_{p,p^{\prime}}\in\mathbb{R}^{\mathbf{n}\times\mathbf{n}}\,,\quad A_{\infty}^{\xi}=\big{(}\langle\langle{\boldsymbol{\psi}_{p}^{\xi},\boldsymbol{\psi}_{p^{\prime}}^{\xi}}\rangle\rangle\big{)}_{p,p^{\prime}}\in\mathbb{R}^{\mathbf{n}_{\xi}\times\mathbf{n}_{\xi}}\,,

and choose the hypothesis spaces as 𝓗EA=span{𝛙pE𝛙pA}p=1𝐧\boldsymbol{\mathcal{H}}^{EA}=\mathrm{span}\{\boldsymbol{\psi}_{p}^{E}\oplus\boldsymbol{\psi}_{p}^{A}\}_{p=1}^{\mathbf{n}} and 𝓗ξ=span{𝛙pξ}p=1𝐧ξ\boldsymbol{\mathcal{H}}^{\xi}=\mathrm{span}\{\boldsymbol{\psi}_{p}^{\xi}\}_{p=1}^{\mathbf{n}_{\xi}}. Then the coercivity constants for 𝓗EA,𝓗ξ\boldsymbol{\mathcal{H}}^{EA},\boldsymbol{\mathcal{H}}^{\xi} are the smallest singular value of AEAA_{\infty}^{EA}, AξA_{\infty}^{\xi}, respectively:

σmin(AEA)=c𝓗EAσmin(Aξ)=c𝓗ξ\displaystyle\sigma_{\min}(A_{\infty}^{EA})=c_{\boldsymbol{\mathcal{H}}^{EA}}\,\qquad\sigma_{\min}(A_{\infty}^{\xi})=c_{\boldsymbol{\mathcal{H}}^{\xi}} (6.5)

with c𝓗EAc_{\boldsymbol{\mathcal{H}}^{EA}} defined in (6.1) and c𝓗ξc_{\boldsymbol{\mathcal{H}}^{\xi}} defined in (6.2). Additionally, for large MM, the smallest singular value of AEAA^{EA} satisfies the inequality

σmin(AEA)0.8c𝓗EA\displaystyle\sigma_{\min}(A^{EA})\geq 0.8c_{\boldsymbol{\mathcal{H}}^{EA}}

with probability at least 12𝐧exp(c𝓗EA2M100𝐧2c12+203c1c𝓗EA𝐧)1-2\mathbf{n}\exp\bigg{(}-\frac{c_{\boldsymbol{\mathcal{H}}^{EA}}^{2}M}{100\mathbf{n}^{2}c_{1}^{2}+\frac{20}{3}\cdot c_{1}\cdot c_{\boldsymbol{\mathcal{H}}^{EA}}\cdot\mathbf{n}}\bigg{)} with c1=2K4max{R,Rx˙}2SEA2+2c_{1}=2K^{4}\max\{R,R_{\dot{x}}\}^{2}S_{EA}^{2}+2. Similarly, for large MM, the smallest singular value of AMξA_{M}^{\xi} satisfies the inequality

σmin(AMξ)0.8c𝓗ξ\displaystyle\sigma_{\min}(A_{M}^{\xi})\geq 0.8c_{\boldsymbol{\mathcal{H}}^{\xi}}

with probability at least 12𝐧ξexp(c𝓗ξ2M100n2c12+203c2c𝓗ξn)1-2\mathbf{n}_{\xi}\exp\bigg{(}-\frac{c_{\boldsymbol{\mathcal{H}}^{\xi}}^{2}M}{100n^{2}c_{1}^{2}+\frac{20}{3}c_{2}\cdot c_{\boldsymbol{\mathcal{H}}^{\xi}}\cdot n}\bigg{)} with c2=2K4Rξ2Sξ2+2c_{2}=2K^{4}R_{\xi}^{2}S_{\xi}^{2}+2. Therefore, with high probability, a system and its associated hypothesis space satisfying the coercivity condition and MM sufficiently large, the inverse problem will be solvable with condition number controlled by the coercivity constant.

Proof  We prove the result in the EAEA case, the proof of the results about the ξ\xi part of the system is analogous. The orthonormality of the component functions given in (6.4), implies that 𝝍pEA,𝝍pEA𝑳2(𝝆TEA,L)=δpp\langle\boldsymbol{\psi}_{p}^{EA},\boldsymbol{\psi}_{p^{\prime}}^{EA}\rangle_{\bm{L}^{2}(\bm{\rho}_{T}^{EA,L})}=\delta_{pp^{\prime}}. Expand 𝝋EA𝓗EA\bm{\varphi}^{EA}\in\boldsymbol{\mathcal{H}}^{EA} in this basis as 𝝋EA=p=1𝐧ap𝝍pEA.\bm{\varphi}^{EA}=\sum_{p=1}^{\mathbf{n}}a_{p}\boldsymbol{\psi}_{p}^{EA}. Let the vector v=(a1,,a𝐧)𝐧v=(a_{1},\ldots,a_{\mathbf{n}})\in\mathbb{R}^{\mathbf{n}}, and notice that

1Ll=1L𝔼𝝁𝒀[𝐟𝝋EA(𝑿(tl),𝑽(tl),𝚵(tl))𝒮2]=p=1𝐧ap𝝍pEA,p=1𝐧ap𝝍pEA\displaystyle\frac{1}{L}\sum_{l=1}^{L}\mathbb{E}_{\bm{\mu^{\boldsymbol{Y}}}}\bigg{[}\big{\|}\mathbf{f}_{\bm{\varphi}^{EA}}(\boldsymbol{X}(t_{l}),\boldsymbol{V}(t_{l}),\boldsymbol{\Xi}(t_{l}))\big{\|}_{\mathcal{S}}^{2}\bigg{]}=\langle\langle{\sum_{p=1}^{\mathbf{n}}a_{p}\boldsymbol{\psi}_{p}^{EA},\sum_{p=1}^{\mathbf{n}}a_{p}\boldsymbol{\psi}_{p}^{EA}}\rangle\rangle
=vTAEAvσmin(AEA)v2=σmin(AEA)𝝋EA𝑳2(𝝆TEA,L)2\displaystyle=v^{T}A_{\infty}^{EA}v\geq\sigma_{\min}(A_{\infty}^{EA})\|v\|^{2}=\sigma_{\min}(A_{\infty}^{EA})\big{\|}\bm{\varphi}^{EA}\|_{\bm{L}^{2}(\bm{\rho}_{T}^{EA,L})}^{2}

This lower bound is achieved by the singular vector corresponding to the singular value σmin(AEA)\sigma_{\min}(A_{\infty}^{EA}), so that by definition (6.1) we have that σmin(AEA)=c𝓗EA\sigma_{\min}(A_{\infty}^{EA})=c_{\boldsymbol{\mathcal{H}}^{EA}}.

For the second statement, we consider the learning matrix AEAA^{EA} (defined in section H), which we can also write as AMEAA_{M}^{EA} to emphasize the dependence on MM as needed built, from the MM observed trajectories. By construction, for each mm, AEA=𝔼𝝁𝒀[AEA,(m)]A_{\infty}^{EA}=\mathbb{E}_{\bm{\mu^{\boldsymbol{Y}}}}[A^{EA,(m)}] and limMAMEA=AEA\lim_{M\rightarrow\infty}A_{M}^{EA}=A_{\infty}^{EA} by the Strong Law of Large Numbers. Next we will derive some important properties of the learning matrix that will allow us to apply the matrix Bernstein inequality (see [75], Theorem 6.1.1, Corollary 6.1.2). Note that we will use the notation from this reference. First we note an elementary matrix analysis result (see [7] Problem III.6.13); for any two square matrices A,BA,B,

maxj|σj(A)σj(B)|AB.\max_{j}|\sigma_{j}(A)-\sigma_{j}(B)|\leq\|A-B\|.

All norms in this proof are the spectral norm, unless otherwise specified. Thus if we get a concentration inequality of the form 𝝁𝒀{AEAAMEAt}\mathbb{P}_{\bm{\mu^{\boldsymbol{Y}}}}\{\|A_{\infty}^{EA}-A_{M}^{EA}\|\geq t\} we will get the desired result relating the minimal singular values of AMEAA_{M}^{EA} to c𝓗EAc_{\boldsymbol{\mathcal{H}}^{EA}}. First, notice that 𝔼𝝁𝒀[AMEA]=AEA\mathbb{E}_{\bm{\mu^{\boldsymbol{Y}}}}[A_{M}^{EA}]=A_{\infty}^{EA}. Additionally, using the definition of the regression matrix, and our assumptions on the kernels and the dynamics, we can bound every entry by c1=2K4max{R,Rx˙}2SEA2+2c_{1}=2K^{4}\max\{R,R_{\dot{x}}\}^{2}S_{EA}^{2}+2. This immediately implies the bound

AMEAAEA2𝐧c1.\|A_{M}^{EA}-A_{\infty}^{EA}\|\leq 2\mathbf{n}c_{1}.

Next, we upper bound the matrix variance statistic (in our case Z=k=1MSkZ=\sum_{k=1}^{M}S_{k} where Sk=1MAEA,(k)S_{k}=\frac{1}{M}A^{EA,(k)}), defined as

v(Z)=max{𝔼[(Z𝔼Z)(Z𝔼Z),𝔼[(Z𝔼Z)(Z𝔼Z)}.v(Z)=\max\{\|\mathbb{E}[(Z-\mathbb{E}Z)(Z-\mathbb{E}Z)^{*}\|,\|\mathbb{E}[(Z-\mathbb{E}Z)^{*}(Z-\mathbb{E}Z)\|\}.

Using a similar analysis to bound each entry of the matrices, we can arrive at the result that v(Z)2𝐧2c12v(Z)\leq 2\mathbf{n}^{2}c_{1}^{2}. Now, we apply the matrix Bernstein inequality to see that

𝝁𝒀{AMEAAEAt}2𝐧exp(c𝓗EA2M100𝐧2c12+20c1c𝓗EA3𝐧).\mathbb{P}_{\bm{\mu^{\boldsymbol{Y}}}}\Big{\{}\|A_{M}^{EA}-A_{\infty}^{EA}\|\geq t\Big{\}}\leq 2\mathbf{n}\exp\bigg{(}-\frac{c_{\boldsymbol{\mathcal{H}}^{EA}}^{2}M}{100\mathbf{n}^{2}c_{1}^{2}+\frac{20c_{1}c_{\boldsymbol{\mathcal{H}}^{EA}}}{3}\mathbf{n}}\bigg{)}.

Note that the MM in the numerator comes because AMEAA_{M}^{EA} has a factor of 1M\frac{1}{M} on it. Lastly, choose t=c𝓗EA5t=\frac{c_{\boldsymbol{\mathcal{H}}^{EA}}}{5}, which together with the results above yield the desired inequality.  

From Proposition 5 we see that, for each hypothesis space kk\mathcal{H}_{kk^{\prime}}, it is important to choose a basis that is well-conditioned in 𝑳2(𝝆TEA,L),𝑳2(𝝆Tξ,L)\bm{L}^{2}(\bm{\rho}_{T}^{EA,L}),\bm{L}^{2}(\bm{\rho}_{T}^{\xi,L}), instead of in the corresponding 𝑳\bm{L}^{\infty} spaces. If not, the learning matrices, defined in Appendix H, AEA,AξA^{EA},A^{\xi} may be ill-conditioned or even singular. This would lead to fundamental numerical challenges in solving for the kernels. In order to mitigate these issues, one can use piecewise polynomials on a partition of the support of the empirical measure and/or use the pseudo-inverse with an adaptive tolerance.

6.2 Discussions on the coercivity condition

The coercivity condition is key to the identifiability of the kernels from data. It is determined by the distribution of the solution to the agent system and introduces constraints on the hypothesis space. For the second-order system, it is therefore related to the distribution 𝝁𝒀\bm{\mu^{\boldsymbol{Y}}} of the initial conditions, the true interaction kernels, and the non-collective force. The coercivity condition has been studied for first-order systems in [52, 51, 48]. Below, we give a brief review.

For homogeneous systems, [52, 51] showed that the coercivity condition holds true on any compact subset of the corresponding L2L^{2} space for the case of L=1L=1. This result has been generalized to cover heterogeneous systems in [51] and a few examples of the stochastic homogeneous system including linear systems and nonlinear systems with stationary distributions for general LL in [48].

In this paper, we shall employ a similar idea as for first-order systems and extend the result to second-order systems. One key in the proof is to show the positiveness of integral operators that arise in the expectation in Eq. (6.1). We focus on a representative model of second-order homogeneous systems,

{mi𝒙¨i=F𝒙˙(𝒙i,𝒙˙i,ξi)+i=1N1N(ϕE(𝒙i𝒙i)(𝒙i𝒙i)+ϕA(𝒙i𝒙i)(𝒙˙i𝒙˙i)ξ˙i=Fξ(𝒙i,𝒙˙i,ξi)+i=1N1Nϕξ(𝒙i𝒙i)(ξiξi)\begin{dcases}m_{i}\ddot{\boldsymbol{x}}_{i}&=F^{\dot{\boldsymbol{x}}}(\boldsymbol{x}_{i},\dot{\boldsymbol{x}}_{i},\xi_{i})+\sum_{i^{\prime}=1}^{N}\frac{1}{N}(\phi^{E}(\left\|\boldsymbol{x}_{i^{\prime}}-\boldsymbol{x}_{i}\right\|)(\boldsymbol{x}_{i^{\prime}}-\boldsymbol{x}_{i})+\phi^{A}(\left\|\boldsymbol{x}_{i^{\prime}}-\boldsymbol{x}_{i}\right\|)(\dot{\boldsymbol{x}}_{i^{\prime}}-\dot{\boldsymbol{x}}_{i})\\ \dot{\xi}_{i}&=F^{\xi}(\boldsymbol{x}_{i},\dot{\boldsymbol{x}}_{i},\xi_{i})+\sum_{i^{\prime}=1}^{N}\frac{1}{N}\phi^{\xi}(\left\|\boldsymbol{x}_{i^{\prime}}-\boldsymbol{x}_{i}\right\|)(\xi_{i^{\prime}}-\xi_{i})\end{dcases}\, (6.6)

which includes the first-order systems considered in [13, 52, 51] as special cases and various second-order system examples in [52, 83] as specific applications. We shall prove the coercivity condition holds true for the case L=1L=1:

Theorem 6

Consider the system (C.1) at time t1=0t_{1}=0 with the initial distribution μ0𝐘=[μ0𝐗μ0𝐗˙μ0𝚵]\mu_{0}^{\boldsymbol{Y}}=\begin{bmatrix}\mu_{0}^{\boldsymbol{X}}\\ \mu_{0}^{\dot{\boldsymbol{X}}}\\ \mu_{0}^{\boldsymbol{\Xi}}\end{bmatrix}, where μ0𝐗{\mu}_{0}^{\boldsymbol{X}} is exchangeable Gaussian with cov(𝐱i(t1))cov(𝐱i(t1),𝐱j(t1))=λId\mathrm{cov}(\boldsymbol{x}_{i}(t_{1}))-\mathrm{cov}(\boldsymbol{x}_{i}(t_{1}),\boldsymbol{x}_{j}(t_{1}))=\lambda I_{d}  for a constant λ>0\lambda>0, μ0𝐗˙,μ0𝚵{\mu}_{0}^{\dot{\boldsymbol{X}}},{\mu}_{0}^{\boldsymbol{\Xi}} are exchangeable with finite second moment, and they are independent of μ0𝐗{\mu}_{0}^{\boldsymbol{X}}. Then

𝔼μ0𝒀𝐟φEφA(𝑿(0),𝑽(0))𝒮2\displaystyle\mathbb{E}_{\mu_{0}^{\boldsymbol{Y}}}\|\mathbf{f}_{\varphi^{E}\oplus\varphi^{A}}(\boldsymbol{X}(0),\boldsymbol{V}(0))\|^{2}_{\mathcal{S}} c1,N,EAφEφAL2(ρTEA,1),\displaystyle\geq c_{1,N,\mathcal{H}^{EA}}{\left|\kern-1.07639pt\left|\kern-1.07639pt\varphi^{E}\oplus\varphi^{A}\right|\kern-1.07639pt\right|\kern-1.07639pt}_{L^{2}(\rho_{T}^{EA,1})},
𝔼μ0𝒀𝐟φξ(𝑿(0),𝚵(0))𝒮2\displaystyle\mathbb{E}_{\mu_{0}^{\boldsymbol{Y}}}\|\mathbf{f}_{\varphi^{\xi}}(\boldsymbol{X}(0),\boldsymbol{\Xi}(0))\|^{2}_{\mathcal{S}} c1,N,ξφξL2(ρTξ,1),\displaystyle\geq c_{1,N,\mathcal{H}^{\xi}}{\left|\kern-1.07639pt\left|\kern-1.07639pt\varphi^{\xi}\right|\kern-1.07639pt\right|\kern-1.07639pt}_{L^{2}(\rho_{T}^{\xi,1})},

where

  • c1,N,EA(N12N2+(N1)(N2)2N2c),c=min{cEAE,cEAAcμ0𝑿˙}c_{1,N,\mathcal{H}^{EA}}\geq(\frac{N-1}{2N^{2}}+\frac{(N-1)(N-2)}{2N^{2}}c),c=\min\bigg{\{}c_{\mathcal{H}^{EA}}^{E},c_{\mathcal{H}^{EA}}^{A}c_{\mu_{0}^{\dot{\boldsymbol{X}}}}\bigg{\}}, where c𝝁0𝑿˙=1𝔼𝒙˙i(0),𝒙˙i(0)𝔼𝒙˙i(0)2c_{{\bm{\mu}}_{0}^{\dot{\boldsymbol{X}}}}=1-\frac{\mathbb{E}\langle\dot{\boldsymbol{x}}_{i}(0),\dot{\boldsymbol{x}}_{i^{\prime}}(0)\rangle}{\mathbb{E}\|\dot{\boldsymbol{x}}_{i}(0)\|^{2}} (iii\neq i^{\prime}) and cEAEc_{\mathcal{H}^{EA}}^{E} and cEAAc_{\mathcal{H}^{EA}}^{A} are non-negative constants independent of NN, and are strictly positive for compact EA\mathcal{H}^{EA} of L2(ρTEA,1)L^{2}(\rho^{EA,1}_{T}).

  • c1,N,ξ(N1N2+(N1)(N2)N2c),c=cξcμ0𝚵c_{1,N,\mathcal{H}^{\xi}}\geq(\frac{N-1}{N^{2}}+\frac{(N-1)(N-2)}{N^{2}}c),c=c_{\mathcal{H}^{\xi}}c_{\mu_{0}^{\boldsymbol{\Xi}}} with cμ0𝚵=1𝔼ξi(0),ξi(0)𝔼ξi(0)2c_{{\mu}_{0}^{\boldsymbol{\Xi}}}=1-\frac{\mathbb{E}\langle\xi_{i}(0),\xi_{i^{\prime}}(0)\rangle}{\mathbb{E}\|\xi_{i}(0)\|^{2}} (iii\neq i^{\prime}) and cξc_{\mathcal{H}^{\xi}} is a non-negative constant independent of NN, which is strictly positive for compact ξ\mathcal{H}^{\xi} of L2(ρTξ,1)L^{2}(\rho^{\xi,1}_{T}).

This exhibits a particular case that even the coercivity constant is independent of the number of agents NN. Consequently, the estimated errors of our estimators are independent of NN, and therefore not only is the convergence rate of our estimators independent of the dimension (2d+1)N(2d+1)N of the phase space, but even the constants in front of the rate term are independent of NN. Our results extend those for first-order systems from [52, 51, 83]. The empirical numerical experiments on second-order systems, already conducted in [83], support that the coercivity condition is satisfied by large classes of second-order systems, and is “generally” satisfied for general LL on relevant hypothesis spaces, with a constant independent of the number of agents NN thanks to the exchangeability of the distribution of the initial conditions, and of the agents at any time tt. The proof of the result above is given in Appendix C.

7 Consistency and optimal convergence rate of estimators

The final preparatory results for our main theorems combine concentration with a union bound. Here we control the probability that the supremum of the difference between the expected and empirical normalized errors over the whole hypothesis space is large.

7.1 Concentration

Our first main result is a concentration estimate that relates the coercivity condition to an appropriate bias-variance tradeoff in our setting. Let 𝒩(𝓗,δ)\mathcal{N}(\boldsymbol{\mathcal{H}},\delta) be the δ\delta-covering number, with respect to the \infty-norm, of the set 𝓗\boldsymbol{\mathcal{H}}.

Theorem 7 (Concentration)

Suppose that ϕ{E,A,ξ}𝓚S{E,A,ξ}{E,A,ξ}{\bm{\phi}}^{\{E,A,\xi\}}\in\boldsymbol{\mathcal{K}}_{S_{\{E,A,\xi\}}}^{\{E,A,\xi\}}. Consider a convex, compact (with respect to the \infty-norm) hypothesis spaces

𝓗MEA𝑳(𝐑×𝐒E)𝑳(𝐑×𝐒A),𝓗Mξ𝑳(𝐑×𝐒ξ)\boldsymbol{\mathcal{H}}_{M}^{EA}\subset\boldsymbol{L^{\infty}}(\mathbf{R}\times\mathbf{S}^{E})\oplus\boldsymbol{L^{\infty}}(\mathbf{R}\times\mathbf{S}^{A}),\quad\boldsymbol{\mathcal{H}}_{M}^{\xi}\subset\boldsymbol{L^{\infty}}(\mathbf{R}\times\mathbf{S}^{\xi})

bounded above by S0max{SE,SA,Sξ}S_{0}\geq\max\{S_{E},S_{A},S_{\xi}\} respectively. Additionally, assume that the coercivity condition (6.1) holds on 𝓗MEA\boldsymbol{\mathcal{H}}_{M}^{EA} and condition (6.2) on 𝓗Mξ\boldsymbol{\mathcal{H}}_{M}^{\xi}.

Then for all ϵ>0\epsilon>0, with probability (with respect to 𝛍𝐘\bm{\mu^{\boldsymbol{Y}}}) at least 1δ1-\delta, we have the estimates

c𝓗MEAϕ^MEAϕEA𝑳2(𝝆TEA,L)2\displaystyle c_{\boldsymbol{\mathcal{H}}_{M}^{EA}}\|\widehat{\bm{\phi}}^{EA}_{M}-\bm{\phi}^{EA}\|^{2}_{\bm{L}^{2}(\bm{\rho}_{T}^{EA,L})} 2inf𝝋EA𝓗MEA𝝋EAϕEA𝑳2(𝝆TEA,L)2+2ϵ,\displaystyle\leq 2\inf_{\bm{\varphi}^{EA}\in\boldsymbol{\mathcal{H}}_{M}^{EA}}\|\bm{\varphi}^{EA}-\bm{\phi}^{EA}\|^{2}_{\bm{L}^{2}(\bm{\rho}_{T}^{EA,L})}+2\epsilon, (7.1)
c𝓗Mξϕ^Mξϕξ𝑳2(𝝆Tξ,L)2\displaystyle c_{\boldsymbol{\mathcal{H}}_{M}^{\xi}}\|\widehat{{\bm{\phi}}}_{M}^{\xi}-\bm{\phi}^{\xi}\|^{2}_{\bm{L}^{2}(\bm{\rho}_{T}^{\xi,L})} 2inf𝝋ξ𝓗Mξ𝝋ξϕξ𝑳2(𝝆Tξ,L)2+2ϵ,\displaystyle\leq 2\inf_{\bm{\varphi}^{\xi}\in\boldsymbol{\mathcal{H}}_{M}^{\xi}}\|\bm{\varphi}^{\xi}-\bm{\phi}^{\xi}\|^{2}_{\bm{L}^{2}(\bm{\rho}_{T}^{\xi,L})}+2\epsilon\,,

provided that, for the first bound to hold,

M1152S02max{R,Rx˙}2K4ϵc𝓗MEA(log(𝒩(𝓗MEA,ϵ48S0max{R,Rx˙}2K4))+log(1δ)),M\geq\frac{1152S_{0}^{2}\max\{R,R_{\dot{x}}\}^{2}K^{4}}{\epsilon c_{\boldsymbol{\mathcal{H}}_{M}^{EA}}}\bigg{(}\log\Big{(}\mathcal{N}\Big{(}\boldsymbol{\mathcal{H}}^{EA}_{M},\frac{\epsilon}{48S_{0}\max\{R,R_{\dot{x}}\}^{2}K^{4}}\Big{)}\Big{)}+\log\Big{(}\frac{1}{\delta}\Big{)}\bigg{)}\,,

and similarly for the second inequality, using 𝓗Mξ\boldsymbol{\mathcal{H}}_{M}^{\xi}.

Proof  [of Theorem 7] We start out by setting α=16\alpha=\frac{1}{6} in Proposition 21, it is easy to see that this chosen value yields the tightest bound in the argument below. To ease the notation we let ϕ^L,M,𝓗EAEA=ϕ^L,M,𝓗EAEϕ^L,M,𝓗EAA\widehat{\bm{\phi}}_{L,M,\boldsymbol{\mathcal{H}}^{EA}}^{EA}=\widehat{\bm{\phi}}_{L,M,\boldsymbol{\mathcal{H}}^{EA}}^{E}\oplus\widehat{\bm{\phi}}_{L,M,\boldsymbol{\mathcal{H}}^{EA}}^{A} and similarly for ϕ^L,,𝓗EAEA\widehat{\bm{\phi}}_{L,\infty,\boldsymbol{\mathcal{H}}^{EA}}^{EA}. From the Proposition, we have that

sup𝝋EA𝓗EA𝒟(𝝋EA)𝒟M(𝝋EA)𝒟(𝝋EA)+ϵ<12,\sup_{\bm{\varphi}^{EA}\in\boldsymbol{\mathcal{H}}^{EA}}\frac{\mathcal{D}_{\infty}(\bm{\varphi}^{EA})-\mathcal{D}_{M}(\bm{\varphi}^{EA})}{\mathcal{D}_{\infty}(\bm{\varphi}^{EA})+\epsilon}<\frac{1}{2},

holds true with probability

𝒫1𝒩(𝓗EA,ϵ48SEAmax{R,Rx˙}2K4)exp(c𝓗EAMϵ1152SEA2max{R,Rx˙}2K6).\mathcal{P}\geq 1-\mathcal{N}\bigg{(}\boldsymbol{\mathcal{H}}^{EA},\frac{\epsilon}{48S_{EA}\max\{R,R_{\dot{x}}\}^{2}K^{4}}\bigg{)}\exp\bigg{(}-\frac{c_{\boldsymbol{\mathcal{H}}^{EA}}M\epsilon}{1152S_{EA}^{2}\max\{R,R_{\dot{x}}\}^{2}K^{6}}\bigg{)}. (7.2)

This immediately implies, by choosing 𝝋EA=ϕ^L,M,𝓗EAEA\bm{\varphi}^{EA}=\widehat{\bm{\phi}}_{L,M,\boldsymbol{\mathcal{H}}^{EA}}^{EA} and reorganizing, that with probability 𝒫\mathcal{P}

𝒟(ϕ^L,M,𝓗EAEA)<2𝒟M(ϕ^L,M,𝓗EAEA)+ϵ.\mathcal{D}_{\infty}(\widehat{\bm{\phi}}_{L,M,\boldsymbol{\mathcal{H}}^{EA}}^{EA})<2\mathcal{D}_{M}(\widehat{\bm{\phi}}_{L,M,\boldsymbol{\mathcal{H}}^{EA}}^{EA})+\epsilon\,.

By definition of ϕ^L,M,𝓗EAEA\widehat{\bm{\phi}}_{L,M,\boldsymbol{\mathcal{H}}^{EA}}^{EA} as the minimizer of the empirical error functional 𝓔MEA\bm{\mathcal{E}}_{M}^{EA}, we see that

𝒟M(ϕ^L,M,𝓗EAEA)=𝓔MEA(ϕ^L,M,𝓗EAEA)𝓔MEA(ϕ^L,,𝓗EAEA)0,\mathcal{D}_{M}(\widehat{\bm{\phi}}_{L,M,\boldsymbol{\mathcal{H}}^{EA}}^{EA})=\bm{\mathcal{E}}_{M}^{EA}(\widehat{\bm{\phi}}_{L,M,\boldsymbol{\mathcal{H}}^{EA}}^{EA})-\bm{\mathcal{E}}_{M}^{EA}(\widehat{\bm{\phi}}_{L,\infty,\boldsymbol{\mathcal{H}}^{EA}}^{EA})\leq 0,

and combining this result with equation (B.13) from Proposition 15, we have

c𝓗EAϕ^L,M,𝓗EAEAϕ^L,,𝓗EAEA𝑳2(𝝆TEA,L)2𝒟(ϕ^L,M,𝓗EAEA)<ϵ,c_{\boldsymbol{\mathcal{H}}^{EA}}\|\widehat{\bm{\phi}}_{L,M,\boldsymbol{\mathcal{H}}^{EA}}^{EA}-\widehat{\bm{\phi}}_{L,\infty,\boldsymbol{\mathcal{H}}^{EA}}^{EA}\|_{\bm{L}^{2}(\bm{\rho}_{T}^{EA,L})}^{2}\leq\mathcal{D}_{\infty}(\widehat{\bm{\phi}}_{L,M,\boldsymbol{\mathcal{H}}^{EA}}^{EA})<\epsilon, (7.3)

with the final inequality holding with probability 𝒫\mathcal{P}. Now we bound the 𝑳2\bm{L}^{2} error of the empirical estimator to the true interaction kernel, so that with probability 𝒫\mathcal{P}

ϕ^L,M,𝓗EAEAϕEA𝑳2(𝝆TEA,L)2\displaystyle\|\widehat{\bm{\phi}}_{L,M,\boldsymbol{\mathcal{H}}^{EA}}^{EA}-{\bm{\phi}}^{EA}\|_{\bm{L}^{2}(\bm{\rho}_{T}^{EA,L})}^{2} 2ϕ^L,M,𝓗EAEAϕ^L,,𝓗EAEA𝑳2(𝝆TEA,L)2+2ϕ^L,,𝓗EAEAϕEA𝑳2(𝝆TEA,L)2\displaystyle\leq 2\|\widehat{\bm{\phi}}_{L,M,\boldsymbol{\mathcal{H}}^{EA}}^{EA}-\widehat{\bm{\phi}}_{L,\infty,\boldsymbol{\mathcal{H}}^{EA}}^{EA}\|_{\bm{L}^{2}(\bm{\rho}_{T}^{EA,L})}^{2}+2\|\widehat{\bm{\phi}}_{L,\infty,\boldsymbol{\mathcal{H}}^{EA}}^{EA}-{\bm{\phi}}^{EA}\|_{\bm{L}^{2}(\bm{\rho}_{T}^{EA,L})}^{2}
2c𝓗EA(ϵ+inf𝝋EA𝓗EAK2𝝋EAϕEA𝑳2(𝝆TEA,L)2)\displaystyle\leq\frac{2}{c_{\boldsymbol{\mathcal{H}}^{EA}}}\bigg{(}\epsilon+\inf_{\bm{\varphi}^{EA}\in\boldsymbol{\mathcal{H}}^{EA}}K^{2}\|\bm{\varphi}^{EA}-\bm{\phi}^{EA}\|_{\bm{L}^{2}(\bm{\rho}_{T}^{EA,L})}^{2}\bigg{)}
2c𝓗EA(ϵ+inf𝝋EA𝓗EAK2max{R,Rx˙}2𝝋EAϕEA2).\displaystyle\leq\frac{2}{c_{\boldsymbol{\mathcal{H}}^{EA}}}\bigg{(}\epsilon+\inf_{\bm{\varphi}^{EA}\in\boldsymbol{\mathcal{H}}^{EA}}K^{2}\max\{R,R_{\dot{x}}\}^{2}\|\bm{\varphi}^{EA}-\bm{\phi}^{EA}\|_{\infty}^{2}\bigg{)}.

The first inequality follow from the coercivity condition (6.1) and the definition of ϕ^EA\widehat{\bm{\phi}}^{EA}_{\infty}. The second follows by the definition of the norms. Now for a chosen 0<δ<10<\delta<1, let

1𝒩(𝓗EA,ϵ48SEAmax{R,Rx˙}2K4)exp(c𝓗EAMϵ1152SEA2max{R,Rx˙}2K6)1δ1-\mathcal{N}\bigg{(}\boldsymbol{\mathcal{H}}^{EA},\frac{\epsilon}{48S_{EA}\max\{R,R_{\dot{x}}\}^{2}K^{4}}\bigg{)}\exp\bigg{(}-\frac{c_{\boldsymbol{\mathcal{H}}^{EA}}M\epsilon}{1152S_{EA}^{2}\max\{R,R_{\dot{x}}\}^{2}K^{6}}\bigg{)}\geq 1-\delta

and solve for MM. The proof for the ξ\xi part of the system result is similar.  

7.2 Consistency

In the regime where MM\to\infty, we will choose an increasing sequence of hypothesis spaces, each satisfying the conditions of Theorem 7. By our assumptions on the kernels, we can also choose the sequence of 𝓗MEA\boldsymbol{\mathcal{H}}_{M}^{EA}’s such that the approximation error goes to 0 as MM\to\infty. This enables us to control the infimum on the right hand side of (7.1). From here we can apply Theorem 7 on each MM to prove the consistency of our estimators with respect to the 𝑳2(𝝆TEA,L)\bm{L}^{2}(\bm{\rho}_{T}^{EA,L}) norm and derive the following consistency theorem.

Theorem 8 (Strong Consistency)

Suppose that

{𝓗MEA}M=1𝑳(𝐑×𝐒E)𝑳(𝐑×𝐒A)\{\boldsymbol{\mathcal{H}}_{M}^{EA}\}_{M=1}^{\infty}\subset\boldsymbol{L^{\infty}}(\mathbf{R}\times\mathbf{S}^{E})\oplus\boldsymbol{L^{\infty}}(\mathbf{R}\times\mathbf{S}^{A})

is a family of compact and convex subsets such that the approximation error goes to zero,

inf𝝋EA𝓗MEA𝝋EAϕEAM0.\inf_{\bm{\varphi}^{EA}\in\boldsymbol{\mathcal{H}}_{M}^{EA}}\|\bm{\varphi}^{EA}-\bm{\phi}^{EA}\|_{\infty}\xrightarrow{M\rightarrow\infty}0.

Further suppose that the coercivity condition holds on M𝓗MEA\bigcup_{M}\boldsymbol{\mathcal{H}}_{M}^{EA}, and that M𝓗MEA\bigcup_{M}\boldsymbol{\mathcal{H}}_{M}^{EA} is compact in 𝐋(𝐑×𝐒E)𝐋(𝐑×𝐒A)\boldsymbol{L^{\infty}}(\mathbf{R}\times\mathbf{S}^{E})\oplus\boldsymbol{L^{\infty}}(\mathbf{R}\times\mathbf{S}^{A}). Then the estimator is strongly consistent with respect to the 𝐋2(𝛒TEA,L)\bm{L}^{2}(\bm{\rho}_{T}^{EA,L}) norm:

limMϕ^MEAϕEA𝑳2(𝝆TEA,L)=0 with probability one.\lim_{M\rightarrow\infty}\|\widehat{\bm{\phi}}^{EA}_{M}-\bm{\phi}^{EA}\|_{\boldsymbol{L}^{2}(\bm{\rho}_{T}^{EA,L})}=0\text{ with probability one.}

An analogous consistency result holds for the estimator in the ξ\xi variable.

These two results together provide a consistency result on the full estimation of the triple (ϕξ^,ϕE^,ϕA^)(\widehat{{\bm{\phi}}^{\xi}},\widehat{{\bm{\phi}}^{E}},\widehat{{\bm{\phi}}^{A}}) and thus consistency of our estimation procedure on the full system (2.2).

Proof [of Theorem 8 ] To simplify the notation, we use the same conventions as the proof of Theorem 7 and let 𝒟=𝒟L,,𝓗MEA\mathcal{D}_{\infty}=\mathcal{D}_{L,\infty,\boldsymbol{\mathcal{H}}^{EA}_{M}}. By definition of the coercivity constant in (6.1), we have the inequality cM𝓗MEAc𝓗MEAc_{\cup_{M}{\boldsymbol{\mathcal{H}}^{EA}_{M}}}\leq c_{\boldsymbol{\mathcal{H}}^{EA}_{M}}. From an analogous argument used to arrive at equation (7.3) in the proof of Theorem 7, we obtain that

cM𝓗MEAϕ^MEAϕEA𝑳2(𝝆TEA,L)2\displaystyle c_{\cup_{M}{\boldsymbol{\mathcal{H}}^{EA}_{M}}}\|\widehat{\bm{\phi}}^{EA}_{M}-\bm{\phi}^{EA}\|_{\bm{L}^{2}(\bm{\rho}_{T}^{EA,L})}^{2} 𝒟(ϕ^MEA)+𝓔EA(ϕ^EA).\displaystyle\leq\mathcal{D}_{\infty}(\widehat{\bm{\phi}}^{EA}_{M})+\boldsymbol{\mathcal{E}}_{\infty}^{EA}(\widehat{\bm{\phi}}^{EA}_{\infty}). (7.4)

Let ϵ>0\epsilon>0, the inequality (7.4) gives us that

P𝝁𝒀{cM𝓗MEAϕ^MEAϕEA𝑳2(𝝆TEA,L)2ϵ}\displaystyle P_{\bm{\mu^{\boldsymbol{Y}}}}\{c_{\cup_{M}{\boldsymbol{\mathcal{H}}^{EA}_{M}}}\|\widehat{\bm{\phi}}^{EA}_{M}-\bm{\phi}^{EA}\|_{\boldsymbol{L}^{2}(\bm{\rho}_{T}^{EA,L})}^{2}\geq\epsilon\} P𝝁𝒀{𝒟(ϕ^MEA)+𝓔EA(ϕ^EA)ϵ}\displaystyle\leq P_{\bm{\mu^{\boldsymbol{Y}}}}\{\mathcal{D}_{\infty}(\widehat{\bm{\phi}}^{EA}_{M})+\boldsymbol{\mathcal{E}}_{\infty}^{EA}(\widehat{\bm{\phi}}^{EA}_{\infty})\geq\epsilon\}
P𝝁𝒀{𝒟(ϕ^MEA)ϵ2}+P𝝁𝒀{𝓔EA(ϕ^EA)ϵ2}.\displaystyle\leq P_{\bm{\mu^{\boldsymbol{Y}}}}\bigg{\{}\mathcal{D}_{\infty}(\widehat{\bm{\phi}}^{EA}_{M})\geq\frac{\epsilon}{2}\bigg{\}}+P_{\bm{\mu^{\boldsymbol{Y}}}}\bigg{\{}\boldsymbol{\mathcal{E}}_{\infty}^{EA}(\widehat{\bm{\phi}}^{EA}_{\infty})\geq\frac{\epsilon}{2}\bigg{\}}.

We now bound the two terms in the above expression separately. For the first term, the proof of Theorem 7 shows that

P𝝁𝒀{𝓓(ϕ^MEA)ϵ2}\displaystyle P_{\bm{\mu^{\boldsymbol{Y}}}}\{\boldsymbol{\mathcal{D}}_{\infty}(\widehat{\bm{\phi}}^{EA}_{M})\geq\frac{\epsilon}{2}\} 𝒩(𝓗MEA,ϵC1)exp(c𝓗MEAMϵC2)\displaystyle\leq\mathcal{N}\bigg{(}\boldsymbol{\mathcal{H}}^{EA}_{M},\frac{\epsilon}{C_{1}}\bigg{)}\exp\bigg{(}-\frac{c_{\boldsymbol{\mathcal{H}}^{EA}_{M}}M\epsilon}{C_{2}}\bigg{)}
𝒩(M𝓗MEA,ϵC1)exp(cM𝓗MEAMϵC2)\displaystyle\leq\mathcal{N}\bigg{(}\cup_{M}{\boldsymbol{\mathcal{H}}^{EA}_{M}},\frac{\epsilon}{C_{1}}\bigg{)}\exp\bigg{(}-\frac{c_{\cup_{M}\boldsymbol{\mathcal{H}}^{EA}_{M}}M\epsilon}{C_{2}}\bigg{)}

where C1=96SEA2max{R,Rx˙}2K4C_{1}=96S_{EA}^{2}\max\{R,R_{\dot{x}}\}^{2}K^{4}, C2=2304SEA2max{R,Rx˙}2K4C_{2}=2304S_{EA}^{2}\max\{R,R_{\dot{x}}\}^{2}K^{4}, and 𝒩(M𝓗MEA,ϵC1)\mathcal{N}(\cup_{M}{\boldsymbol{\mathcal{H}}^{EA}_{M}},\frac{\epsilon}{C_{1}}) is finite because of the compactness assumption on M𝓗MEA\cup_{M}\boldsymbol{\mathcal{H}}^{EA}_{M}.

Summing this bound in MM we get that,

M=1P𝝁𝒀{𝒟(ϕ^MEA)ϵ2}\displaystyle\sum_{M=1}^{\infty}P_{\bm{\mu^{\boldsymbol{Y}}}}\{\mathcal{D}_{\infty}(\widehat{\bm{\phi}}^{EA}_{M})\geq\frac{\epsilon}{2}\} 𝒩(M𝓗MEA,ϵC1)M=1exp(cM𝓗MEAMϵC2)<.\displaystyle\leq\mathcal{N}\bigg{(}\cup_{M}{\boldsymbol{\mathcal{H}}^{EA}_{M}},\frac{\epsilon}{C_{1}}\bigg{)}\sum_{M=1}^{\infty}\exp\bigg{(}-\frac{c_{\cup_{M}\boldsymbol{\mathcal{H}}^{EA}_{M}}M\epsilon}{C_{2}}\bigg{)}<\infty.

For the second term, the bound (B.3) yields that

𝓔EA(ϕ^EA)4K4SEAmax{R,Rx˙}2inf𝝋EA𝓗MEA𝝋EAϕEAM0.\boldsymbol{\mathcal{E}}_{\infty}^{EA}(\widehat{\bm{\phi}}^{EA}_{\infty})\leq 4K^{4}S_{EA}\max\{R,R_{\dot{x}}\}^{2}\inf_{\bm{\varphi}^{EA}\in\boldsymbol{\mathcal{H}}^{EA}_{M}}\|\bm{\varphi}^{EA}-\bm{\phi}^{EA}\|_{\infty}\xrightarrow{M\rightarrow\infty}0.

Since ϵ\epsilon is fixed, the above result, together with our assumption on the sequence of hypothesis spaces, implies that P𝝁𝒀{𝓔EA(ϕ^EA)ϵ2}=0P_{\bm{\mu^{\boldsymbol{Y}}}}\Big{\{}\boldsymbol{\mathcal{E}}_{\infty}^{EA}(\widehat{\bm{\phi}}^{EA}_{\infty})\geq\frac{\epsilon}{2}\Big{\}}=0 for MM sufficiently large. So we have M=1P𝝁𝒀{𝓔EA(ϕ^EA)ϵ2}<\sum_{M=1}^{\infty}P_{\bm{\mu^{\boldsymbol{Y}}}}\{\boldsymbol{\mathcal{E}}_{\infty}^{EA}(\widehat{\bm{\phi}}^{EA}_{\infty})\geq\frac{\epsilon}{2}\}<\infty. The finiteness of the two sums above implies, by the first Borel-Cantelli Lemma, that

P𝝁𝒀{lim supM{cM𝓗MEAϕ^MEAϕEA𝑳2(𝝆TEA,L)2ϵ}}=0,P_{\bm{\mu^{\boldsymbol{Y}}}}\big{\{}\limsup_{{M\rightarrow\infty}}\{c_{\cup_{M}{\boldsymbol{\mathcal{H}}^{EA}_{M}}}\|\widehat{\bm{\phi}}^{EA}_{M}-\bm{\phi}^{EA}\|_{\bm{L}^{2}(\bm{\rho}_{T}^{EA,L})}^{2}\geq\epsilon\}\big{\}}=0,

As ϵ\epsilon was arbitrary, we have the desired strong consistency of the estimator. An exactly analogous argument gives the result on the ξ\xi part of the system.  

7.3 Rate of convergence

Given data collected from MM trajectories, we would like to choose the best hypothesis space to maximize the accuracy of the estimators. Theorem 7 highlights the classical bias-variance tradeoff in our setting. On the one hand, we would like the hypothesis space 𝓗MEA\boldsymbol{\mathcal{H}}^{EA}_{M} to be large so that the bias

inf𝝋EA𝓗MEA𝝋EAϕEA𝑳2(𝝆TEA,L)2, or inf𝝋EA𝓗EA𝝋EAϕEA2,\inf_{\bm{\varphi}^{EA}\in\boldsymbol{\mathcal{H}}^{EA}_{M}}\|\bm{\varphi}^{EA}-\bm{\phi}^{EA}\|^{2}_{\bm{L}^{2}(\bm{\rho}_{T}^{EA,L})}\,,\text{ or }\,\,\inf_{\bm{\varphi}^{EA}\in\boldsymbol{\mathcal{H}}^{EA}}\|\bm{\varphi}^{EA}-\bm{\phi}^{EA}\|^{2}_{\infty}\,,

is small. Simultaneously, we would like 𝓗MEA\boldsymbol{\mathcal{H}}^{EA}_{M} to be small enough so that the covering number 𝒩(𝓗MEA,ϵ)\mathcal{N}(\boldsymbol{\mathcal{H}}^{EA}_{M},\epsilon) is small. Just as in nonparametric regression, our rate of convergence depends on a regularity condition of the true interaction kernels and properties of the hypothesis space, as is demonstrated in the following theorem. We establish the optimal (up to a log factor) min-max rate of convergence by choosing a hypothesis space in a sample size dependent manner.

Comments

Parts (a) and (c) of the theorem concern an approximation theory type rate of convergence where MM plays no role in the choice of hypothesis space, whereas parts (b) and (d) of the theorem present a minimax rate of convergence that chooses an adaptive hypothesis space depending on MM to achieve the optimal rate.

The splitting of the convergence result, between EAEA and ξ\xi parts, emphasizes a common theme of the paper: we leverage that the system can be decoupled for the learning process to improve the rate of convergence and performance in learning the estimators, but analytically we study it as the full coupled system, see the trajectory prediction result in Theorem 10.

A final important comment is that even though the dimension of the space in which we measure the error may be large, namely the dimension of 𝝆TEA,L\bm{\rho}_{T}^{EA,L} (which can be calculated as d(𝝆TEA,L)=kkp(k,k)E+(k,k)p(k,k)Ad(\bm{\rho}_{T}^{EA,L})=\sum_{kk^{\prime}}p_{(k,k^{\prime})}^{E}+\sum_{(k,k^{\prime})}p_{(k,k^{\prime})}^{A}), we exploit the structure of the system in such a way that our convergence rate only depends on the number of unique variables across all of the estimators. This number is given by the number of variables in 𝓥\bm{\mathcal{V}}, denoted as |𝓥||\bm{\mathcal{V}}|, for the EAEA portion and by |𝓥ξ||\bm{\mathcal{V}}_{\xi}| for the ξ\xi portion of the system. We note that we are not predicting the number of variables nor their form, they are assumed known, this applies to both the pairwise interaction variables and the feature maps.

Theorem 9 (Rate of Convergence)

Let ϕ^EA:=ϕ^MEϕ^MA\widehat{\bm{\phi}}^{EA}:=\widehat{\bm{\phi}}_{M}^{E}\oplus\widehat{\bm{\phi}}_{M}^{A} denote the minimizer of the empirical error functional 𝓔MEA\bm{\mathcal{E}}_{M}^{EA} (defined in (4.2)) over the hypothesis space 𝓗MEA\boldsymbol{\mathcal{H}}^{EA}_{M}.
(a) Let the hypothesis space be chosen as the direct sum of the admissible spaces, namely 𝓗EA=𝓚SEE𝓚SAA,\boldsymbol{\mathcal{H}}^{EA}=\boldsymbol{\mathcal{K}}_{S_{E}}^{E}\oplus\boldsymbol{\mathcal{K}}_{S_{A}}^{A}, and assume that the coercivity condition (6.1) holds true on it.

Then, there exists a constant CC depending only on K,SEA,R,Rx˙K,S_{EA},R,R_{\dot{x}} such that

𝔼𝝁𝒀[ϕ^MEAϕEA𝑳2(𝝆TEA,L)2]Cc𝓗EAM1|𝓥|+1.\mathbb{E}_{\bm{\mu^{\boldsymbol{Y}}}}\Big{[}\|\widehat{\bm{\phi}}^{EA}_{M}-\bm{\phi}^{EA}\|^{2}_{\bm{L}^{2}(\bm{\rho}_{T}^{EA,L})}\Big{]}\leq\frac{C}{c_{\boldsymbol{\mathcal{H}}^{EA}}}M^{-\frac{1}{|\bm{\mathcal{V}}|+1}}.

(b) Assume that {𝓛n}n=1\{\boldsymbol{\mathcal{L}}_{n}\}_{n=1}^{\infty} is a sequence of finite-dimensional linear subspaces of 𝐋(𝐑×𝐒E)𝐋(𝐑×𝐒A)\boldsymbol{L}^{\infty}(\mathbf{R}\times\mathbf{S}^{E})\oplus\boldsymbol{L}^{\infty}(\mathbf{R}\times\mathbf{S}^{A}) satisfying the dimension and approximation constraints

dim(𝓛n)c0K2n|𝓥|,inf𝝋EA𝓛n𝝋EAϕEAc1ns,\displaystyle\text{dim}(\boldsymbol{\mathcal{L}}_{n})\leq c_{0}K^{2}n^{|\bm{\mathcal{V}}|}\,,\quad\inf_{\bm{\varphi}^{EA}\in\boldsymbol{\mathcal{L}}_{n}}\|\bm{\varphi}^{EA}-\bm{\phi}^{EA}\|_{\infty}\leq c_{1}n^{-s}, (7.5)

for some fixed constants c0,c1c_{0},c_{1} representing dimension-independent approximation characteristics of the linear subspaces, and s>0s>0 related to the regularity of the kernels. The value nn can be thought of as the number of basis functions along each of the |𝓥||\bm{\mathcal{V}}| axes. Suppose the coercivity condition holds true on the set n𝓛n\cup_{n}\boldsymbol{\mathcal{L}}_{n}. Define 𝓑n\boldsymbol{\mathcal{B}}_{n} to be the closed ball centered at the origin of radius (c1+SEA)(c_{1}+S_{EA}) in 𝓛n\boldsymbol{\mathcal{L}}_{n}. Let cMEA:=cM𝓗MEAc_{M}^{EA}:=c_{\bigcup_{M}\boldsymbol{\mathcal{H}}^{EA}_{M}}. If we choose the hypothesis space as 𝓗M=𝓑(MlogM)12s+|𝓥|\boldsymbol{\mathcal{H}}_{M}=\boldsymbol{\mathcal{B}}_{(\frac{M}{\log M})^{\frac{1}{2s+|\bm{\mathcal{V}}|}}}, then there exists a constant CC depending on K,R,Rx˙,SEA,c0,c1,sK,R,R_{\dot{x}},S_{EA},c_{0},c_{1},s such that we achieve the convergence rate,

𝔼𝝁𝒀[ϕ^MEAϕEA𝑳2(𝝆TEA,L)2]CcMEA(logMM)2s2s+|𝓥|.\displaystyle\mathbb{E}_{\bm{\mu^{\boldsymbol{Y}}}}\Big{[}\|\widehat{\bm{\phi}}^{EA}_{M}-\bm{\phi}^{EA}\|^{2}_{\bm{L}^{2}(\bm{\rho}_{T}^{EA,L})}\Big{]}\leq\frac{C}{c_{M}^{EA}}\left(\frac{\log M}{M}\right)^{\frac{2s}{2s+|\bm{\mathcal{V}}|}}\,. (7.6)

(c) under the corresponding assumptions as in (a), there exists a constant CC depending only on K,Sξ,RK,S_{\xi},R such that

𝔼𝝁𝒀[ϕ^Mξϕξ𝑳2(𝝆Tξ,L)2]Cc𝓗ξM1|𝓥ξ|+1.\mathbb{E}_{\bm{\mu^{\boldsymbol{Y}}}}\Big{[}\|\widehat{\bm{\phi}}_{M}^{\xi}-{\bm{\phi}}^{\xi}\|^{2}_{\bm{L}^{2}(\bm{\rho}_{T}^{\xi,L})}\Big{]}\leq\frac{C}{c_{\boldsymbol{\mathcal{H}}^{\xi}}}M^{-\frac{1}{|\bm{\mathcal{V}}_{\xi}|+1}}.

(d) under the corresponding assumptions as in (b), there exists a constant CC depending only on K,R,Sξ,c0,c1,sK,R,S_{\xi},c_{0},c_{1},s such that, and for cMξ:=cM𝓗Mξc_{M}^{\xi}:=c_{\bigcup_{M}\boldsymbol{\mathcal{H}}^{\xi}_{M}},

𝔼𝝁𝒀[ϕ^Mξϕξ𝑳2(𝝆Tξ,L)2]CcMξ(logMM)2s2s+|𝓥ξ|.\displaystyle\mathbb{E}_{\bm{\mu^{\boldsymbol{Y}}}}\Big{[}\|\widehat{\bm{\phi}}^{\xi}_{M}-{\bm{\phi}}^{\xi}\|^{2}_{\bm{L}^{2}(\bm{\rho}_{T}^{\xi,L})}\Big{]}\leq\frac{C}{c_{M}^{\xi}}\left(\frac{\log M}{M}\right)^{\frac{2s}{2s+|\bm{\mathcal{V}}_{\xi}|}}\,. (7.7)

We in fact prove bounds not only in expectation, but also with high probability, for every fixed large-enough MM, as the proof will show.

Proof [of Theorem 9] For part (a), let 𝓗=𝓚SEE𝓚SAA\boldsymbol{\mathcal{H}}=\boldsymbol{\mathcal{K}}_{S_{E}}^{E}\oplus\boldsymbol{\mathcal{K}}_{S_{A}}^{A}. Standard results on covering numbers of function spaces (see theorem 2.7.1 of [77]) give us that the covering number of 𝓗\bm{\mathcal{H}} satisfies

𝒩(𝓗,ϵ,)C𝓗exp(K2(1ϵ)|𝓥|)\mathcal{N}(\boldsymbol{\mathcal{H}},\epsilon,\|\cdot\|_{\infty})\leq C_{\boldsymbol{\mathcal{H}}}\exp\bigg{(}K^{2}\bigg{(}\frac{1}{\epsilon}\bigg{)}^{|\bm{\mathcal{V}}|}\bigg{)}

for some absolute constant C𝓗C_{\boldsymbol{\mathcal{H}}} depending only on 𝓗\boldsymbol{\mathcal{H}} and |𝓥||\bm{\mathcal{V}}|. By assumption on the hypothesis space, we have that

inf𝝋EA𝓗𝝋EAϕEA2=0.\inf_{\bm{\varphi}^{EA}\in\boldsymbol{\mathcal{H}}}\|\bm{\varphi}^{EA}-\bm{\phi}^{EA}\|^{2}_{\infty}=0.

From this, the concentration estimate (7.1) together with the covering number bound imply that,

P𝝁𝒀{ϕ^L,M,𝓗EAϕEA𝑳2(𝝆TEA,L)2>ϵ}\displaystyle P_{\bm{\mu^{\boldsymbol{Y}}}}\{\|\widehat{\bm{\phi}}^{EA}_{L,M,\boldsymbol{\mathcal{H}}}-\bm{\phi}^{EA}\|_{\bm{L}^{2}(\bm{\rho}_{T}^{EA,L})}^{2}>\epsilon\} 𝒩(𝓗,C1ϵ,)exp(C2Mϵ)\displaystyle\leq\mathcal{N}(\boldsymbol{\mathcal{H}},C_{1}\epsilon,\|\cdot\|_{\infty})\exp(-C_{2}M\epsilon)
C𝓗exp(K2(C1ϵ)|𝓥|C2Mϵ)\displaystyle\leq C_{\boldsymbol{\mathcal{H}}}\exp(K^{2}(C_{1}\epsilon)^{-|\bm{\mathcal{V}}|}-C_{2}M\epsilon) (7.8)

where C1=c𝓗48SEAmax{R,Rx˙}2K4C_{1}=\frac{c_{\boldsymbol{\mathcal{H}}}}{48S_{EA}\max\{R,R_{\dot{x}}\}^{2}K^{4}} and C2=c𝓗1152SEA2max{R,Rx˙}2K4C_{2}=\frac{c_{\boldsymbol{\mathcal{H}}}}{1152S_{EA}^{2}\max\{R,R_{\dot{x}}\}^{2}K^{4}}. Next, define the function

g(ϵ):=K2(C1ϵ)|𝓥|C2Mϵ2,g(\epsilon):=K^{2}(C_{1}\epsilon)^{-|\bm{\mathcal{V}}|}-\frac{C_{2}M\epsilon}{2},

which we will minimize to achieve the desired probability bound.

By direct calculation, g(ϵ)=0g(\epsilon)=0 if we choose ϵ=ϵM=(C3M)1|𝓥|+1\epsilon=\epsilon_{M}=(\frac{C_{3}}{M})^{\frac{1}{|\bm{\mathcal{V}}|+1}}, where C3=(2K2C2C1|𝓥|)1|𝓥|+1C_{3}=\Big{(}\frac{2K^{2}}{C_{2}C_{1}^{|\bm{\mathcal{V}}|}}\Big{)}^{\frac{1}{|\bm{\mathcal{V}}|+1}}. It is then an easy computation to see that the derivative of g(ϵ)g(\epsilon) is 0\leq 0 for all ϵϵM\epsilon\geq\epsilon_{M}. Thus, we can put this result into the bound (7.8) to arrive at the probability bound,

P𝝁𝒀{ϕ^L,M,𝓗EAϕEA𝑳2(𝝆TEA,L)2>ϵ}{exp(C2Mϵ2),ϵϵM1,ϵϵMP_{\bm{\mu^{\boldsymbol{Y}}}}\{\|\widehat{\bm{\phi}}^{EA}_{L,M,\boldsymbol{\mathcal{H}}}-\bm{\phi}^{EA}\|_{\bm{L}^{2}(\bm{\rho}_{T}^{EA,L})}^{2}>\epsilon\}\leq\begin{cases}\exp\Big{(}\frac{-C_{2}M\epsilon}{2}\Big{)},&\epsilon\geq\epsilon_{M}\\ 1,&\epsilon\leq\epsilon_{M}\end{cases} (7.9)

Integrating this bound over ϵ(0,+)\epsilon\in(0,+\infty) and using the elementary inequality exx+1e^{-x}\leq x+1 for all x0x\geq 0, we get that

0P𝝁𝒀{ϕ^L,M,𝓗EAϕEA𝑳2(𝝆TEA,L)2>ϵ}𝑑ϵ(C4M)1|𝓥|+1+O(1M)\int_{0}^{\infty}P_{\bm{\mu^{\boldsymbol{Y}}}}\{\|\widehat{\bm{\phi}}^{EA}_{L,M,\boldsymbol{\mathcal{H}}}-\bm{\phi}^{EA}\|_{\bm{L}^{2}(\bm{\rho}_{T}^{EA,L})}^{2}>\epsilon\}d\epsilon\leq\Big{(}\frac{C_{4}}{M}\Big{)}^{\frac{1}{|\bm{\mathcal{V}}|+1}}+O\Big{(}\frac{1}{M}\Big{)}

Now, bringing the coercivity part from (7.1) back in, we achieve the rate,

𝔼𝝁𝒀[ϕ^L,M,𝓗EAϕEA𝑳2(𝝆TEA,L)2]C4c𝓗M1|𝓥|+1,\mathbb{E}_{\bm{\mu^{\boldsymbol{Y}}}}[\|\widehat{\bm{\phi}}^{EA}_{L,M,\boldsymbol{\mathcal{H}}}-\bm{\phi}^{EA}\|_{\bm{L}^{2}(\bm{\rho}_{T}^{EA,L})}^{2}]\leq\frac{C_{4}}{c_{\boldsymbol{\mathcal{H}}}}M^{-\frac{1}{|\bm{\mathcal{V}}|+1}},

where C4C_{4} is an absolute constant that only depends on K,SEA,R,Rx˙K,S_{EA},R,R_{\dot{x}}.

For part (b), we note the following basic result on the covering number of 𝓑n\boldsymbol{\mathcal{B}}_{n} by ϵ\epsilon-balls (see [31, Proposition 5]),

𝒩(𝓑n,ϵ,)(4(c1+SEA)ϵ)c0K2n|𝓥|.\mathcal{N}(\bm{\mathcal{B}}_{n},\epsilon,\|\cdot\|_{\infty})\leq\bigg{(}\frac{4(c_{1}+S_{EA})}{\epsilon}\bigg{)}^{c_{0}K^{2}n^{|\bm{\mathcal{V}}|}}.

Using (7.1), and the approximation assumption, we bound the probability as

P𝝁𝒀{ϕ^L,M,𝓑nEAϕEA𝑳2(𝝆TEA,L)2>ϵ+c2n2s}\displaystyle P_{\bm{\mu^{\boldsymbol{Y}}}}\{\|\widehat{\bm{\phi}}^{EA}_{L,M,\bm{\mathcal{B}}_{n}}-\bm{\phi}^{EA}\|_{\bm{L}^{2}(\bm{\rho}_{T}^{EA,L})}^{2}>\epsilon+c_{2}n^{-2s}\} (7.10)
=P𝝁𝒀{ϕ^L,M,𝓑nEAϕEA𝑳2(𝝆TEA,L)2>tn2s+c2n2s}\displaystyle=P_{\bm{\mu^{\boldsymbol{Y}}}}\{\|\widehat{\bm{\phi}}^{EA}_{L,M,\bm{\mathcal{B}}_{n}}-\bm{\phi}^{EA}\|_{\bm{L}^{2}(\bm{\rho}_{T}^{EA,L})}^{2}>t^{\prime}n^{-2s}+c_{2}n^{-2s}\}
=P𝝁𝒀{ϕ^L,M,𝓑nEAϕEA𝑳2(𝝆TEA,L)2>tn2s}\displaystyle=P_{\bm{\mu^{\boldsymbol{Y}}}}\{\|\widehat{\bm{\phi}}^{EA}_{L,M,\bm{\mathcal{B}}_{n}}-\bm{\phi}^{EA}\|_{\bm{L}^{2}(\bm{\rho}_{T}^{EA,L})}^{2}>tn^{-2s}\}
𝒩(𝓑n,c3tn2s,)exp(c4Mtn2s)\displaystyle\leq\mathcal{N}\Big{(}\bm{\mathcal{B}}_{n},c_{3}^{\prime}tn^{-2s},\|\cdot\|_{\infty}\Big{)}\exp(-c_{4}Mtn^{-2s})
(c3tn2s)c0K2n|𝓥|exp(c4Mtn2s)\displaystyle\leq\Big{(}\frac{c_{3}}{tn^{-2s}}\Big{)}^{c_{0}K^{2}n^{|\bm{\mathcal{V}}|}}\exp(-c_{4}Mtn^{-2s})
exp(c0K2n|𝓥|log(c3)+c0K2n|𝓥||log(tn2s)|c4Mtn2s),\displaystyle\leq\exp(c_{0}K^{2}n^{|\bm{\mathcal{V}}|}\log(c_{3})+c_{0}K^{2}n^{|\bm{\mathcal{V}}|}|\log(tn^{-2s})|-c_{4}Mtn^{-2s}),

where c2=1cn𝓛nc1c_{2}=\frac{1}{c_{\cup_{n}\boldsymbol{\mathcal{L}}_{n}}}c_{1}, c3=cn𝓛n48(SEA+c1)max{R,Rx˙}2K4c_{3}^{\prime}=\frac{c_{\cup_{n}\boldsymbol{\mathcal{L}}_{n}}}{48(S_{EA}+c_{1})\max\{R,R_{\dot{x}}\}^{2}K^{4}}, c3=192(SEA+c1)2max{R,Rx˙}2K4cn𝓛nc_{3}=\frac{192(S_{EA}+c_{1})^{2}\max\{R,R_{\dot{x}}\}^{2}K^{4}}{c_{\cup_{n}\boldsymbol{\mathcal{L}}_{n}}}, and c4=cn𝓛n1152(SEA+c1)2max{R,Rx˙}2K4c_{4}=\frac{c_{\cup_{n}\boldsymbol{\mathcal{L}}_{n}}}{1152(S_{EA}+c_{1})^{2}\max\{R,R_{\dot{x}}\}^{2}K^{4}} are absolute constants independent of MM. Define

g(n):=c0n|𝓥|K2log(c3)+c0n|𝓥|K2|log(tn2s)|c42Mtn2s.g(n):=c_{0}n^{|\bm{\mathcal{V}}|}K^{2}\log(c_{3})+c_{0}n^{|\bm{\mathcal{V}}|}K^{2}|\log(tn^{-2s})|-\frac{c_{4}}{2}Mtn^{-2s}.

To find the optimal nn in terms of MM, we minimize gg in nn. By taking a derivative, and solving the corresponding equation, we see that the optimal nn is

n=O((MlogM)12s+|𝓥|),n_{*}=O\bigg{(}\Big{(}\frac{M}{\log M}\Big{)}^{\frac{1}{2s+|\bm{\mathcal{V}}|}}\bigg{)},

with constant independent of MM and only depending on c3,c4,c2c_{3},c_{4},c_{2}. For convenience we will choose n=(MlogM)12s+|𝓥|n_{*}=\lfloor(\frac{M}{\log M})^{\frac{1}{2s+|\bm{\mathcal{V}}|}}\rfloor. Now let ϵM=(MlogM)2s2s+|𝓥|\epsilon_{M}=(\frac{M}{\log M})^{\frac{2s}{2s+|\bm{\mathcal{V}}|}} and consider

h(ϵ)=c0nK2log(c3)+c0nK2|log(ϵ)|c42Mϵ.h(\epsilon)=c_{0}n_{*}K^{2}\log(c_{3})+c_{0}n_{*}K^{2}|\log(\epsilon)|-\frac{c_{4}}{2}M\epsilon.

As before, let ϵ=tn2s=tϵM\epsilon=tn_{*}^{-2s}=t\epsilon_{M} and consider h(tϵM)h(t\epsilon_{M}). It is easy to see that limt0+h(tϵM)=\lim_{t\rightarrow 0^{+}}h(t\epsilon_{M})=\infty and limth(tϵM)=\lim_{t\rightarrow\infty}h(t\epsilon_{M})=-\infty. Together with the continuity of hh, these facts imply that there exists a constant c5c_{5}, depending on K,c0,c2,c3,c4K,c_{0},c_{2},c_{3},c_{4} such that h(c5ϵM)=0h(c_{5}\epsilon_{M})=0. We further need that h(ϵ)0h^{\prime}(\epsilon)\leq 0 for all ϵc5ϵM\epsilon\geq c_{5}\epsilon_{M}. By taking the derivative of hh, setting it 0\leq 0, we find that this condition eventually holds by basic calculus on hh. Therefore, if needed to satisfy the derivative condition, we can enlarge the constant c5c_{5} to a constant c6c_{6} (independent of MM) such that h(ϵ)0h(\epsilon)\leq 0 and h(ϵ)0h^{\prime}(\epsilon)\leq 0 for all ϵc6ϵM\epsilon\geq c_{6}\epsilon_{M}. These results imply the probability bound,

P𝝁𝒀{ϕ^L,M,𝓑nEAϕEA𝑳2(𝝆TEA,L)2>ϵ}{exp(c4Mϵ2),ϵc6ϵM1,ϵc6ϵMP_{\bm{\mu^{\boldsymbol{Y}}}}\{\|\widehat{\bm{\phi}}^{EA}_{L,M,\boldsymbol{\mathcal{B}}_{n_{*}}}-\bm{\phi}^{EA}\|_{\bm{L}^{2}(\bm{\rho}_{T}^{EA,L})}^{2}>\epsilon\}\leq\begin{cases}\exp\Big{(}\frac{-c_{4}M\epsilon}{2}\Big{)},&\epsilon\geq c_{6}\epsilon_{M}\\ 1,&\epsilon\leq c_{6}\epsilon_{M}\end{cases} (7.11)

Integrating this bound over ϵ(0,+)\epsilon\in(0,+\infty) and using the elementary inequality exx+1e^{-x}\leq x+1 for all x0x\geq 0, we get that

0P𝝁𝒀{ϕ^L,M,𝓑nEAϕEA𝑳2(𝝆TEA,L)2>ϵ}𝑑ϵC1(logMM)2s2s+|𝓥|,\int_{0}^{\infty}P_{\bm{\mu^{\boldsymbol{Y}}}}\{\|\widehat{\bm{\phi}}^{EA}_{L,M,\boldsymbol{\mathcal{B}}_{n_{*}}}-\bm{\phi}^{EA}\|_{\bm{L}^{2}(\bm{\rho}_{T}^{EA,L})}^{2}>\epsilon\}d\epsilon\leq C_{1}\Big{(}\frac{\log M}{M}\Big{)}^{\frac{2s}{2s+|\bm{\mathcal{V}}|}},

where C1C_{1} is a constant depending on c0,c1,s,K,SEA,R,Rx˙c_{0},c_{1},s,K,S_{EA},R,R_{\dot{x}}. Now with 𝓗MEA=𝓑n\boldsymbol{\mathcal{H}}^{EA}_{M}=\boldsymbol{\mathcal{B}}_{n_{*}} and using (7.1), we have shown the convergence rate,

𝔼𝝁𝒀[ϕ^L,M,𝓗MEAEAϕEA𝑳2(𝝆TEA,L)2]c7cM𝓗MEA(MlogM)2s2s+|𝓥|,\mathbb{E}_{\bm{\mu^{\boldsymbol{Y}}}}[\|\widehat{\bm{\phi}}^{EA}_{L,M,\boldsymbol{\mathcal{H}}^{EA}_{M}}-\bm{\phi}^{EA}\|_{\bm{L}^{2}(\bm{\rho}_{T}^{EA,L})}^{2}]\leq\frac{c_{7}}{c_{\bigcup_{M}\boldsymbol{\mathcal{H}}^{EA}_{M}}}\Big{(}\frac{M}{\log M}\Big{)}^{-\frac{2s}{2s+|\bm{\mathcal{V}}|}},

where c7c_{7} is an absolute constant that only depends on s,K,c0,c1,SEA,R,Rx˙s,K,c_{0},c_{1},S_{EA},R,R_{\dot{x}}.

 

In both theorems, the convergence rates 2s2s+|𝓥|\frac{2s}{2s+|\bm{\mathcal{V}}|} and 2s2s+|𝓥ξ|\frac{2s}{2s+|\bm{\mathcal{V}}_{\xi}|} coincide with the minimax rate of convergence for nonparametric regression in the corresponding dimension – up to the logarithmic factor. It is possible that this logarithmic factor could be removed (see techniques in Chapter 11-15 of [39]), but with considerable additional complexity of the proof. Achieving the same rate of convergence as if we had observed the noisy values of the interaction kernels directly, rather than through the dynamics, is a major strength of our approach. The strong consistency results show the asymptotic optimality of our method, and for wide classes of systems the assumptions of the theorems apply. Specifically, for part (b) of the theorems, the dimension and approximation conditions can be explicitly achieved by piecewise polynomials or splines appropriately adapted to the regularity of the kernel. In the conditions of theorem 9, nn can be the number of partitions along each axis of the variables in 𝓥\bm{\mathcal{V}}. Then, using multivariate splines or piecewise polynomials we will have a fixed constant c0c_{0} (corresponding to the number of parameters to estimate for each function) times Kn𝓥Kn^{\bm{\mathcal{V}}} as the dimension of the linear space. Furthermore, by standard approximation theory results, see [70] (Chapters 12,13), [33],[35], for ss the regularity of the interaction kernels we achieve the desired approximation condition with piecewise polynomials of degree s\lfloor s\rfloor. In our admissible spaces we have s=1s=1, note that the theorems are stronger if we have a kernel of higher regularity.

We next briefly examine the convergence rate on a few systems of fundamental interest. Recall that in table 2 we have as the final two columns the values |𝓥|,|𝓥ξ||\bm{\mathcal{V}}|,|\bm{\mathcal{V}}_{\xi}|. These correspond directly to the rate of convergence of each of the system under our learning approach.

Some specific highlights:

  • For Anticipation Dynamics (AD), even though we are learning both an energy and alignment kernel, because there are only 22 unique variables shared across both of them we learn at the 22-dimensional rate.

  • For the Synchronized Oscillator we achieve the 22-dimensional optimal learning rate on each of the EAEA and ξ\xi portions (rather than a 44-dimensional rate) due to the decoupled nature of the system, similarly we only pay the 11-dimensional rate twice for the Phototaxis system. This is a key reason for splitting our learning theory between EAEA- and ξ\xi-interaction kernels and accounting for shared and non-shared variables: this enables us to substantially improve the performance guarantees in actual applications.

  • Due to the design of the measures, norms and the associated learning algorithm, even in the heterogeneous case for celestial mechanics and predator-swarm, we only pay the 11-dimensional learning rate, although the constants are of course affected by the heterogeneity and the algorithm requires a larger learning matrix.

  • The rates of convergence of our estimators for all previously-studied first-order systems (see [52, 51, 83]) can be derived from Theorem 9.

One downside of the results above is the lack of dependence on LL as it seems natural that finer time samples in each trajectory should improve the results. Indeed, the numerical experiments of [83, 52, 51] demonstrate that more data in LL may indeed be helpful to improve the performance. One technique used in [83] for very long trajectory data (large LL, medium to small MM) is to split each trajectory into larger MM with smaller LL in each. In this way, one can explicitly show the desired convergence rate in a form agreeing with the above theorems, we do not believe that it leads to significantly different performance compared to using the original data, only that we can easily transform the data into a form that the theorems apply to and with no loss of performance. Explicit dependence on NN is not the objective of this work, see [13], but further study of the mean-field regime is of interest to the authors and work is ongoing.

8 Performance of trajectory prediction

Once estimators 𝝋^EA,𝝋^ξ\widehat{\bm{\varphi}}^{EA},\widehat{\bm{\varphi}}^{\xi} are obtained, a natural question is the accuracy of the evolved trajectory based on these estimated kernels. The next theorem shows that the error in prediction is (i) bounded trajectory-wise by a continuous-time version of the error functional, and (ii) bounded on average by the 𝑳2(𝝆TEA),𝑳2(𝝆Tξ)\bm{L}^{2}(\bm{\rho}_{T}^{EA}),\bm{L}^{2}(\bm{\rho}_{T}^{\xi}), respectively, error of the estimator. This further validates the effectiveness of our error functional and 𝑳2(𝝆T)\bm{L}^{2}(\bm{\rho}_{T})-metric to assess the quality of the estimator. In particular, this emphasizes that although the system is a coupled system of ODE’s, our decoupled learning procedure with our choice of norm will lead to control of the expected supremum error as long as we minimize the 𝑳2(𝝆TEA),𝑳2(𝝆Tξ)\bm{L}^{2}(\bm{\rho}_{T}^{EA}),\bm{L}^{2}(\bm{\rho}_{T}^{\xi}) norms in obtaining our estimators.

Theorem 10

Suppose that ϕ^E𝓚SEE\widehat{\bm{\phi}}^{E}\in\boldsymbol{\mathcal{K}}_{S_{E}}^{E}, ϕ^A𝓚SAA\widehat{\bm{\phi}}^{A}\in\boldsymbol{\mathcal{K}}_{S_{A}}^{A} and ϕ^ξ𝓚Sξξ\widehat{\bm{\phi}}^{\xi}\in\boldsymbol{\mathcal{K}}_{S_{\xi}}^{\xi}. Denote by 𝐘^(t)\widehat{\boldsymbol{Y}}(t) and 𝐘(t)\boldsymbol{Y}(t) the solutions of the systems with kernels ϕ^E=(ϕ^kkE)k,k=1K,K,ϕ^A=(ϕ^kkA)k,k=1K,K\widehat{\bm{\phi}}^{E}=(\widehat{\phi}_{kk^{\prime}}^{E})_{k,k^{\prime}=1}^{K,K},\widehat{\bm{\phi}}^{A}=(\widehat{\phi}_{kk^{\prime}}^{A})_{k,k^{\prime}=1}^{K,K}, and ϕ^ξ=(ϕ^kkξ)k,k=1K,K\widehat{\bm{\phi}}^{\xi}=(\widehat{\phi}_{kk^{\prime}}^{\xi})_{k,k^{\prime}=1}^{K,K} and ϕE,ϕA,ϕξ{\bm{\phi}}^{E},{\bm{\phi}}^{A},{\bm{\phi}}^{\xi} respectively, both with the same initial condition. Then

supt[0,T]𝒀^(t)𝒀(t)𝒴2\displaystyle\sup_{t\in[0,T]}\|\widehat{\boldsymbol{Y}}(t)-\boldsymbol{Y}(t)\|_{\mathcal{Y}}^{2} g(T)[2T2p=0ts=0p𝑿¨𝐟nc,𝒙˙(𝑿,𝑽,𝚵)𝐟ϕ^EA(𝑿,𝑽,𝚵)𝒮2 ds dp\displaystyle\leq g(T)\bigg{[}2T^{2}\int_{p=0}^{t}\int_{s=0}^{p}\|\ddot{\boldsymbol{X}}-\mathbf{f}^{\text{nc},\dot{\boldsymbol{x}}}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi})-\mathbf{f}^{\widehat{\bm{\phi}}^{EA}}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi})\|_{\mathcal{S}}^{2}\text{ ds dp}
+2Ts=0t𝑿¨𝐟nc,𝒙˙(𝑿,𝑽,𝚵)𝐟ϕ^EA(𝑿,𝑽,𝚵)𝒮2𝑑s\displaystyle+2T\int_{s=0}^{t}\|\ddot{\boldsymbol{X}}-\mathbf{f}^{\text{nc},\dot{\boldsymbol{x}}}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi})-\mathbf{f}^{\widehat{\bm{\phi}}^{EA}}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi})\|_{\mathcal{S}}^{2}ds
+2Ts=0t𝚵˙𝐟nc,ξ(𝑿,𝑽,𝚵)𝐟ϕ^ξ(𝑿,𝑽,𝚵)𝒮2ds]\displaystyle+2T\int_{s=0}^{t}\|\dot{\boldsymbol{\Xi}}-\mathbf{f}^{\text{nc},\xi}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi})-\mathbf{f}^{\widehat{{\bm{\phi}}}_{\xi}}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi})\|_{\mathcal{S}}^{2}ds\bigg{]}

where g(T)=1+(1+B1T)Texp(A1T+T2/2)g(T)=1+(1+B_{1}T)T\exp(A_{1}T+T^{2}/2). The constants are A1=2T(8KP++8QK+ξ)A_{1}=2T(8KP+\mathcal{L}+8QK+\mathcal{L}^{\xi}) and B1=2T2(8KP+)B_{1}=2T^{2}(8KP+\mathcal{L}), with any unspecified constants made precise in the proof and only depending on the Lipschitz constants of the noncollective forces and the feature maps, as well as the values SE,SA,SξS^{E},S^{A},S^{\xi} coming from the admissible spaces. It is bounded on average, with respect to the initial distribution 𝛍𝐘\bm{\mu^{\boldsymbol{Y}}}, by

𝔼𝝁𝒀[supt[0,T]𝒀^(t)𝒀(t)𝒴2]\displaystyle\mathbb{E}_{\bm{\mu^{\boldsymbol{Y}}}}[\sup_{t\in[0,T]}\|\widehat{\boldsymbol{Y}}(t)-\boldsymbol{Y}(t)\|_{\mathcal{Y}}^{2}] g(T)((T2K2+TK2)ϕ^EAϕEA𝑳2(𝝆TEA)2\displaystyle\leq g(T)\bigg{(}(T^{2}K^{2}+TK^{2})\|\widehat{\bm{\phi}}^{EA}-\bm{\phi}^{EA}\|_{\bm{L}^{2}(\bm{\rho}_{T}^{EA})}^{2}
+TK2ϕ^ξϕξ𝑳2(𝝆Tξ)2)\displaystyle+TK^{2}\|\widehat{\bm{\phi}}^{\xi}-\bm{\phi}^{\xi}\|_{\bm{L}^{2}(\bm{\rho}_{T}^{\xi})}^{2}\bigg{)} (8.1)

with the measures 𝛒TEA,𝛒Tξ\bm{\rho}_{T}^{EA},\bm{\rho}_{T}^{\xi} defined in (5.1, G.1). Expression (10) shows that by minimizing the right hand side, we can control the expected 𝒴\mathcal{Y}-supremum error of the estimated trajectories.

We postpone the somewhat lengthy proof to Appendix A.

9 Applications

Our learning theory, as well as measures, norms, functionals etc. can be applied to study all the examples considered in the works [52, 51, 83]. These examples, particularly those of [83], can thus be considered as applications of the theoretical results as well as of the algorithm in section H.

We choose to study two new dynamics, which are not considered in [52, 83] since they exhibit some unique features of our generalized model. In particular, we choose them due to their special form of having both energy-based and alignment-based interactions. These are the flocking with external potential (FwEP) model in [72] and the anticipation dynamics (AD) model in [71].

Table 4 shows the value of learning parameters for these dynamics.

MρM_{\rho} LL TfT_{f} TT μ𝒙\mu^{\boldsymbol{x}} μ𝒙˙\mu^{\dot{\boldsymbol{x}}} Num. of learning trials
20002000 500500 1010 55 Unif.([0,5]2)\text{Unif.}([0,5]^{2}) Unif.([0,5]2)\text{Unif.}([0,5]^{2}) 1010
Table 4: Values of parameters for the learning.

The setup of the learning experiment is as follows. We use MρM_{\rho} different initial conditions to evolve the dynamics111We use the built-in MATLAB integrating routine, ode15sode15s, with relative tolerance at 10810^{-8} and absolute tolerance at 101110^{-11}. from 0 to TT to generate a good approximation to ρTL,EA,ρTL,E\rho_{T}^{L,EA},\rho_{T}^{L,E} and ρTL,A\rho_{T}^{L,A}. Then we use another set of MM (M=500M=500 for FwEP and M=750M=750 for AD) initial conditions to generate training data to learn the corresponding ϕE\phi^{E} and ϕA\phi^{A} from the empirical distributions, ρTL,M,EA\rho_{T}^{L,M,EA}’s, etc. We report the relative learning errors calculated via (5.4) for ϕ^Eϕ^A\widehat{\phi}^{E}\oplus\widehat{\phi}^{A}, (5.4) for ϕ^E\widehat{\phi}^{E}, and (5.4) for ϕ^A\widehat{\phi}^{A}, along with pictorial comparison of those interaction kernels as well as a visualization on the pairwise data which is used to learn the estimated kernels. Then we evolve the dynamics either from the training set of MM initial conditions or another set of MM randomly chosen initial conditions with ϕEϕA\phi^{E}\oplus\phi^{A} and ϕ^Eϕ^A\widehat{\phi}^{E}\oplus\widehat{\phi}^{A} from 0 to Tf>TT_{f}>T, and report the trajectory errors calculated using (3.5) on 𝒚\boldsymbol{y} (the whole system), and for 𝒙\boldsymbol{x} (the position) and 𝒗\boldsymbol{v} (the velocity). Again, pictorial comparison of the trajectories are also shown. We report the trajectory errors over [0,T][0,T] and [T,Tf][T,T_{f}]. The learning results are shown in the following sections.

9.1 Learning results for flocking with external potential

We consider the FwEP model for its simplicity and clustering behavior in both position and velocity (hence flocking occurs). The dynamics of the FwEP model is given as follows,

𝒙¨i=1Ni=1,iiNa(𝒙˙i𝒙˙i)+1Ni=1,iiNϕ(𝒙i𝒙i)(𝒙˙i𝒙˙i).\ddot{\boldsymbol{x}}_{i}=\frac{1}{N}\sum_{i^{\prime}=1,i^{\prime}\neq i}^{N}a(\dot{\boldsymbol{x}}_{i^{\prime}}-\dot{\boldsymbol{x}}_{i})+\frac{1}{N}\sum_{i^{\prime}=1,i^{\prime}\neq i}^{N}\phi(\left\|\boldsymbol{x}_{i^{\prime}}-\boldsymbol{x}_{i}\right\|)(\dot{\boldsymbol{x}}_{i^{\prime}}-\dot{\boldsymbol{x}}_{i}).

Here a>0a>0 is a constant representing an attraction force, and ϕ=1(1+r2)β\phi=\frac{1}{(1+r^{2})^{\beta}}222This choice of interaction for alignment is not mandatory. It is a good comparison between this FwEP model with the Cucker-Smale mode with this choice of interaction functions. with β=12\beta=\frac{1}{2}. To fit into our learning regime, we take, mi=1m_{i}=1, K=1K=1, no ξi\xi_{i}, no non-collective force, and

ϕE=aandϕA=1(1+r2)β.\phi^{E}=a\quad\text{and}\quad\phi^{A}=\frac{1}{(1+r^{2})^{\beta}}.

For the FwEP model, we use the space of 1st1^{st} degree piece-wise polynomials with dimension nE=122n^{E}=122 for learning ϕE\phi^{E}; and for ϕA\phi^{A}, we use the same space. First, consider the comparison of energy-based interactions shown in Fig. 1.

Refer to caption
Figure 1: ϕE\phi^{E} vs. ϕ^E\widehat{\phi}^{E}, Err: 3.9106±6.61073.9\cdot 10^{-6}\pm 6.6\cdot 10^{-7}. The lines shown in blue are the estimated interaction kernels, and the lines shown in black are the true interaction kernels. The colored areas shown in the background are the learned distributions of pairwise distance data.

Fig. 1 shows that our learning performance on constant functions using piecewise linear polynomials shows promising results. However, we still have trouble learning the behavior of the interaction at r=0r=0, part of it due to the weight of 0\vec{0} in the model, and the other part of it being lack of available data towards r=0r=0. Next, we show the comparison of alignment-base interactions shown in Fig. 2 with distribution of the pairwise data.

Refer to caption
(a) ϕA\phi^{A} vs. ϕ^A\widehat{\phi}^{A}, Err: 9.1103±2.51049.1\cdot 10^{-3}\pm 2.5\cdot 10^{-4}.
Refer to caption
(b) ρTA,L\rho_{T}^{A,L} vs. ρTA,L,M\rho_{T}^{A,L,M}.
Figure 2: The lines shown in blue are the estimated interaction kernels, and the lines shown in black are the true interaction kernels. The colored areas shown in the background are the learned distributions of pairwise distance data.

Again, in Fig. 2(a), it shows a faithful approximation from our estimated kernels. The ϕ^Eϕ^A\widehat{\phi}^{E}\oplus\widehat{\phi}^{A} error is: 5.8103±1.61045.8\cdot 10^{-3}\pm 1.6\cdot 10^{-4}. The comparison of trajectories are shown in Fig. 3.

Refer to caption
Figure 3: Trajectory Comparison.

Fig. 3 shows little visual difference between the learned and observed trajectories. A more quantitative description of the trajectory errors are shown in table 5.

[0,T][0,T] [T,Tf][T,T_{f}]
meanIC\text{mean}_{\text{IC}} on 𝒙\boldsymbol{x} 7.2104±1.91057.2\cdot 10^{-4}\pm 1.9\cdot 10^{-5} 6.7104±1.81056.7\cdot 10^{-4}\pm 1.8\cdot 10^{-5}
meanIC\text{mean}_{\text{IC}} on 𝒗\boldsymbol{v} 1.15103±3.11051.15\cdot 10^{-3}\pm 3.1\cdot 10^{-5} 1.5103±4.11031.5\cdot 10^{-3}\pm 4.1\cdot 10^{-3}
meanIC\text{mean}_{\text{IC}} on 𝒚\boldsymbol{y} 6.2106±1.71076.2\cdot 10^{-6}\pm 1.7\cdot 10^{-7} 2.22106±6.81082.22\cdot 10^{-6}\pm 6.8\cdot 10^{-8}
stdIC\text{std}_{\text{IC}} on 𝒙\boldsymbol{x} 1.28104±4.21061.28\cdot 10^{-4}\pm 4.2\cdot 10^{-6} 1.22104±4.01061.22\cdot 10^{-4}\pm 4.0\cdot 10^{-6}
stdIC\text{std}_{\text{IC}} on 𝒗\boldsymbol{v} 2.20104±7.01062.20\cdot 10^{-4}\pm 7.0\cdot 10^{-6} 2.5104±1.01052.5\cdot 10^{-4}\pm 1.0\cdot 10^{-5}
stdIC\text{std}_{\text{IC}} on 𝒚\boldsymbol{y} 1.52106±5.91081.52\cdot 10^{-6}\pm 5.9\cdot 10^{-8} 6.0107±2.61086.0\cdot 10^{-7}\pm 2.6\cdot 10^{-8}
meanIC\text{mean}_{\text{IC}} on 𝒙\boldsymbol{x} 7.2104±1.71057.2\cdot 10^{-4}\pm 1.7\cdot 10^{-5} 6.7104±1.61056.7\cdot 10^{-4}\pm 1.6\cdot 10^{-5}
meanIC\text{mean}_{\text{IC}} on 𝒗\boldsymbol{v} 1.15103±2.71051.15\cdot 10^{-3}\pm 2.7\cdot 10^{-5} 1.46103±3.31051.46\cdot 10^{-3}\pm 3.3\cdot 10^{-5}
meanIC\text{mean}_{\text{IC}} on 𝒚\boldsymbol{y} 6.2106±1.61076.2\cdot 10^{-6}\pm 1.6\cdot 10^{-7} 2.22106±5.41082.22\cdot 10^{-6}\pm 5.4\cdot 10^{-8}
stdIC\text{std}_{\text{IC}} on 𝒙\boldsymbol{x} 1.30104±5.81061.30\cdot 10^{-4}\pm 5.8\cdot 10^{-6} 1.24104±5.31061.24\cdot 10^{-4}\pm 5.3\cdot 10^{-6}
stdIC\text{std}_{\text{IC}} on 𝒗\boldsymbol{v} 2.25104±9.91062.25\cdot 10^{-4}\pm 9.9\cdot 10^{-6} 2.56104±9.11062.56\cdot 10^{-4}\pm 9.1\cdot 10^{-6}
stdIC\text{std}_{\text{IC}} on 𝒚\boldsymbol{y} 1.6106±7.61081.6\cdot 10^{-6}\pm 7.6\cdot 10^{-8} 6.2107±2.41086.2\cdot 10^{-7}\pm 2.4\cdot 10^{-8}
Table 5: Trajectory Errors. The first three rows of mean trajectory errors are from the training set of initial conditions. The next three rows are standard deviation of the trajectory errors from the training set of initial conditions. The following three rows are mean trajectory errors from a new set of initial conditions. Finally, the last three rows report the standard deviation of the trajectory errors from a new set of initial conditions.

We are maintaining a relative four-digit accuracy in estimating the position, and a relative three-digit accuracy in estimating the velocity of the agents in the system. Although we are able to reconstruct ϕE\phi^{E} with a 66-digit accuracy, we are not able to do the same for ϕA\phi^{A}. The error in ϕ^Eϕ^A\widehat{\phi}^{E}\oplus\widehat{\phi}^{A} reflects this discrepancy by considering the two functions together.

9.2 Learning results for anticipation dynamics with U(r)=rppU(r)=\frac{r^{p}}{p}

The energy-based interactions are constants in the FwEP models, if we want to consider more complicated models, i.e., interactions depending on pairwise distance and more, the AD models are suitable candidates. The dynamics of the AD model is given as follows,

𝒙¨i\displaystyle\ddot{\boldsymbol{x}}_{i} =1Ni=1,iiNτU(𝒙i𝒙i)𝒙i𝒙i(𝒙˙i𝒙˙i)\displaystyle=\frac{1}{N}\sum_{i^{\prime}=1,i^{\prime}\neq i}^{N}\frac{\tau U^{\prime}(\left\|\boldsymbol{x}_{i^{\prime}}-\boldsymbol{x}_{i}\right\|)}{\left\|\boldsymbol{x}_{i^{\prime}}-\boldsymbol{x}_{i}\right\|}(\dot{\boldsymbol{x}}_{i^{\prime}}-\dot{\boldsymbol{x}}_{i})
+1Ni=1,iiN{τU(𝒙i𝒙i)(𝒙i𝒙i)(𝒙˙i𝒙˙i)𝒙i𝒙i3\displaystyle\quad+\frac{1}{N}\sum_{i^{\prime}=1,i^{\prime}\neq i}^{N}\Big{\{}\frac{-\tau U^{\prime}(\left\|\boldsymbol{x}_{i^{\prime}}-\boldsymbol{x}_{i}\right\|)(\boldsymbol{x}_{i^{\prime}}-\boldsymbol{x}_{i})\cdot(\dot{\boldsymbol{x}}_{i^{\prime}}-\dot{\boldsymbol{x}}_{i})}{\left\|\boldsymbol{x}_{i^{\prime}}-\boldsymbol{x}_{i}\right\|^{3}}
+τU′′(𝒙i𝒙i)(𝒙i𝒙i)(𝒙˙i𝒙˙i)𝒙i𝒙i2+U(𝒙i𝒙i)𝒙i𝒙i}(𝒙i𝒙i).\displaystyle\quad+\frac{\tau U^{\prime\prime}(\left\|\boldsymbol{x}_{i^{\prime}}-\boldsymbol{x}_{i}\right\|)(\boldsymbol{x}_{i^{\prime}}-\boldsymbol{x}_{i})\cdot(\dot{\boldsymbol{x}}_{i^{\prime}}-\dot{\boldsymbol{x}}_{i})}{\left\|\boldsymbol{x}_{i^{\prime}}-\boldsymbol{x}_{i}\right\|^{2}}+\frac{U^{\prime}(\left\|\boldsymbol{x}_{i^{\prime}}-\boldsymbol{x}_{i}\right\|)}{\left\|\boldsymbol{x}_{i^{\prime}}-\boldsymbol{x}_{i}\right\|}\Big{\}}(\boldsymbol{x}_{i^{\prime}}-\boldsymbol{x}_{i}). (9.1)

Compared to the original model in [71], we take τi,i=0\tau_{i,i^{\prime}}=0. In order to fit the model into our learning regime, we take

ϕA(r)=τU(r)randϕE(r,s)=τU(r)sr3+τU′′(r)sr2+U(r)r.\phi^{A}(r)=\frac{\tau U^{\prime}(r)}{r}\quad\text{and}\quad\phi^{E}(r,s)=\frac{-\tau U^{\prime}(r)s}{r^{3}}+\frac{\tau U^{\prime\prime}(r)s}{r^{2}}+\frac{U^{\prime}(r)}{r}.

Here we have no ξi\xi_{i}, K=1K=1, mi=1m_{i}=1, and

si,iE=si,iA=(𝒙i𝒙i)(𝒙˙i𝒙˙i).s^{E}_{i,i^{\prime}}=s^{A}_{i,i^{\prime}}=(\boldsymbol{x}_{i^{\prime}}-\boldsymbol{x}_{i})\cdot(\dot{\boldsymbol{x}}_{i^{\prime}}-\dot{\boldsymbol{x}}_{i}).

We also use τ=0.1\tau=0.1.

It is shown in [71] that if U′′U^{\prime\prime} is bounded when rr\rightarrow\infty with U(0)=U(0)=0U(0)=U^{\prime}(0)=0, then unconditionally flocking would occur. We take U(r)=rppU(r)=\frac{r^{p}}{p} for 1<p21<p\leq 2, then the system would show unconditional flocking. We choose p=1.5p=1.5 for our learning trials333p=2p=2 induces constant forces on the dynamics.. We use a tensor grid of 1st1^{st} degree piece-wise standard polynomials with nE=282n^{E}=28^{2} for learning ϕE(r,s)\phi^{E}(r,s), then a set of 1st1^{st} degree piecewise standard polynomials with nA=138n^{A}=138 for learning ϕA(r)\phi^{A}(r). For the energy-based interactions we have the following results.

Refer to caption
(a) U(r)=r1.51.5U(r)=\frac{r^{1.5}}{1.5}: ϕE\phi^{E} vs. ϕ^E\widehat{\phi}^{E}, Err: 6101±2.61016\cdot 10^{-1}\pm 2.6\cdot 10^{-1}.
Refer to caption
(b) U(r)=r1.51.5U(r)=\frac{r^{1.5}}{1.5}: ρTE,L\rho_{T}^{E,L} vs. ρTE,L,M\rho_{T}^{E,L,M}.
Figure 4: The lines shown in blue are the estimated interaction kernels, and the lines shown in black are the true interaction kernels. The colored areas shown in the background are the learned distributions of pairwise distance data.

As is shown in Fig. 4(b), the concentration of pairwise distance data is away from 0, making the estimation of the behavior of ϕE(r,s)\phi^{E}(r,s) at rr close to 0 extremely difficult, meanwhile, since ϕE\phi^{E} is also weighted by the pairwise difference, 𝒙i𝒙i\boldsymbol{x}_{i^{\prime}}-\boldsymbol{x}_{i}, and at ri,ir_{i,i^{\prime}} close to 0, the information is also lost. Next, we present the alignment-based interaction kernels.

Refer to caption
(a) U(r)=r1.51.5U(r)=\frac{r^{1.5}}{1.5}: ϕA\phi^{A} vs. ϕ^A\widehat{\phi}^{A}. Err: 1.71013.91021.7\cdot 10^{-1}\cdot 3.9\cdot 10^{-2}.
Refer to caption
(b) U(r)=r1.51.5U(r)=\frac{r^{1.5}}{1.5}: ρTA,L\rho_{T}^{A,L} vs. ρTA,L,M\rho_{T}^{A,L,M}.
Figure 5: The lines shown in blue are the estimated interaction kernels, and the lines shown in black are the true interaction kernels. The colored areas shown in the background are the learned distributions of pairwise distance data.

We have less trouble estimating the behavior of ϕA\phi^{A} at r=0r=0. We have trouble estimating ϕA\phi^{A} at the other end of the spectrum of rr, since the agents have aligned their velocities, hence the weight 𝒗i𝒗i\boldsymbol{v}_{i^{\prime}}-\boldsymbol{v}_{i} is close to a zero vector. The overall learning performance for estimating ϕA\phi^{A} is better compared to estimating ϕE\phi^{E}. The ϕ^Eϕ^A\widehat{\phi}^{E}\oplus\widehat{\phi}^{A} error is: 6101±3.01016\cdot 10^{-1}\pm 3.0\cdot 10^{-1}. The comparison of trajectories between the true kernels (LHS) and the estimators (RHS) is shown below.

Refer to caption
Figure 6: U(r)=r1.51.5U(r)=\frac{r^{1.5}}{1.5}: Trajectory Comparison.

Visually, there is no difference between the true dynamics and the estimated dynamics. We offer more quantitative insight into the difference between the two in table 6.

[0,T][0,T] [T,Tf][T,T_{f}]
meanIC\text{mean}_{\text{IC}} on 𝒙\boldsymbol{x} 2.22103±8.51052.22\cdot 10^{-3}\pm 8.5\cdot 10^{-5} 2.4103±1.01042.4\cdot 10^{-3}\pm 1.0\cdot 10^{-4}
meanIC\text{mean}_{\text{IC}} on 𝒗\boldsymbol{v} 7.8103±3.91047.8\cdot 10^{-3}\pm 3.9\cdot 10^{-4} 1.78102±8.91041.78\cdot 10^{-2}\pm 8.9\cdot 10^{-4}
meanIC\text{mean}_{\text{IC}} on 𝒚\boldsymbol{y} 1.91105±9.01071.91\cdot 10^{-5}\pm 9.0\cdot 10^{-7} 6.2106±3.61076.2\cdot 10^{-6}\pm 3.6\cdot 10^{-7}
stdIC\text{std}_{\text{IC}} on 𝒙\boldsymbol{x} 2.8104±1.21052.8\cdot 10^{-4}\pm 1.2\cdot 10^{-5} 3.4104±1.51053.4\cdot 10^{-4}\pm 1.5\cdot 10^{-5}
stdIC\text{std}_{\text{IC}} on 𝒗\boldsymbol{v} 1.14103±7.81051.14\cdot 10^{-3}\pm 7.8\cdot 10^{-5} 2.7103±1.51042.7\cdot 10^{-3}\pm 1.5\cdot 10^{-4}
stdIC\text{std}_{\text{IC}} on 𝒚\boldsymbol{y} 4.7106±3.01074.7\cdot 10^{-6}\pm 3.0\cdot 10^{-7} 6.2106±3.61076.2\cdot 10^{-6}\pm 3.6\cdot 10^{-7}
meanIC\text{mean}_{\text{IC}} on 𝒙\boldsymbol{x} 2.22103±8.41052.22\cdot 10^{-3}\pm 8.4\cdot 10^{-5} 2.4103±1.01042.4\cdot 10^{-3}\pm 1.0\cdot 10^{-4}
meanIC\text{mean}_{\text{IC}} on 𝒗\boldsymbol{v} 7.8103±3.81047.8\cdot 10^{-3}\pm 3.8\cdot 10^{-4} 1.78102±8.71041.78\cdot 10^{-2}\pm 8.7\cdot 10^{-4}
meanIC\text{mean}_{\text{IC}} on 𝒚\boldsymbol{y} 1.91105±8.61071.91\cdot 10^{-5}\pm 8.6\cdot 10^{-7} 2.4105±1.11062.4\cdot 10^{-5}\pm 1.1\cdot 10^{-6}
stdIC\text{std}_{\text{IC}} on 𝒙\boldsymbol{x} 3.0104±2.41053.0\cdot 10^{-4}\pm 2.4\cdot 10^{-5} 3.4104±1.31053.4\cdot 10^{-4}\pm 1.3\cdot 10^{-5}
stdIC\text{std}_{\text{IC}} on 𝒗\boldsymbol{v} 1.15103±6.81051.15\cdot 10^{-3}\pm 6.8\cdot 10^{-5} 2.7103±1.51042.7\cdot 10^{-3}\pm 1.5\cdot 10^{-4}
stdIC\text{std}_{\text{IC}} on 𝒚\boldsymbol{y} 4.7106±2.61074.7\cdot 10^{-6}\pm 2.6\cdot 10^{-7} 6.2106±3.11076.2\cdot 10^{-6}\pm 3.1\cdot 10^{-7}
Table 6: U(r)=r1.51.5U(r)=\frac{r^{1.5}}{1.5}: Trajectory Errors.

We maintain a 33-digit relative accuracy in estimating the position/velocity of the agents, even though for the interaction kernels, we are only able to maintain a 11-digit relative accuracy.

10 Conclusion and further directions

We have described a second-order model of interacting agents that incorporates multiple agent types, an environment, external forces, and multivariable interaction kernels. The inference procedure described exploits the structure of the system to achieve a learning rate that only depends on the dimension of the interaction kernels, which is much smaller than the full ambient dimension (2d+1)N(2d+1)N. Our estimators are strongly consistent, and in fact have learning rates that are min-max optimal within the nonparametric class, under mild assumptions on the interaction kernels and the system. We described how one can relate the expected supremum error of the trajectories for the system driven by the estimated interaction kernels to the difference between the true interaction kernels and the estimated ones – this result gives strong support to the use of our weighted L2L^{2} norms as the correct way to measure performance and derive estimators. A detailed discussion of the full numerical algorithm, including the inverse problem derived from data and a coercivity condition to ensure learnability, along with complex examples, were presented and we showed how the formulation presented covers a very wide range of systems coming from many disciplines.

There are various ways that one could build on this work to handle different systems and for many of these further directions, the theoretical framework, techniques, and theorems presented here would be directly useful. In particular, one could consider second-order stochastic systems or a similar system but on a manifold, more complex environments, having more unknowns within the model beyond just the interaction kernels (say estimating the non-collective forces as well), identifying the best feature maps to model the data, and considering semiparametric problems where there are hidden parameters within the interaction kernels or other parts of the model that we wish to estimate along with the interaction kernels. The generality of the model and its broad coverage of models across the sciences, together with the scalability and performance of the algorithm, could inspire new models – both explicit equations and nonparametric estimators learned from data – which are theoretically justified and highly practical.


Acknowledgments and contributions

MM is grateful for discussions with Fei Lu and Yannis Kevrekidis, and for partial support from NSF-1837991, NSF-1913243, NSF-1934979, NSF-Simons-2031985, AFOSR-FA9550-17-1-0280 and FA9550-20-1-0288, ARO W911NF-18-C-0082, and to the Simons Foundation for the Simons Fellowship for the year ’20-’21; ST for support from an AMS Simons travel grant; JM for support from NIH - T32GM11999. Please direct correspondence to any of the first three authors.

All authors jointly designed research and wrote the manuscript; JM and ST derived theoretical results; MZ developed algorithms and applications; JM and MZ analyzed data.

A Control of trajectory error

Proof [of Theorem 10] We introduce the function

F[φEA](𝒙,𝒙˙,𝒔E,𝒔A):=φE(𝒙,𝒔E)𝒙+φA(𝒙,𝒔A)𝒙˙F{[\varphi^{EA}]}(\boldsymbol{x},\dot{\boldsymbol{x}},\boldsymbol{s}^{E},\boldsymbol{s}^{A}):=\varphi^{E}(||\boldsymbol{x}||,\boldsymbol{s}^{E})\boldsymbol{x}+\varphi^{A}(||\boldsymbol{x}||,\boldsymbol{s}^{A})\dot{\boldsymbol{x}}

defined on 2d+pE+pA\mathbb{R}^{2d+p^{E}+p^{A}} for functions φEL([0,R]×𝕊E),φAL([0,R]×𝕊A)\varphi^{E}\in L^{\infty}([0,R]\times\mathbb{S}^{E}),\varphi^{A}\in L^{\infty}([0,R]\times\mathbb{S}^{A}). Similarly, let F[φξ](𝒙,ξ,sξ):=φξ(𝒙,𝒔ξ)ξF[\varphi^{\xi}](\boldsymbol{x},\xi,s^{\xi}):=\varphi^{\xi}(||\boldsymbol{x}||,\boldsymbol{s}^{\xi})\xi. Now we have by assumption that all trajectories

start from the same initial conditions on both the position and velocity, which implies that 𝒀^(0)=𝒀(0)\widehat{\boldsymbol{Y}}(0)=\boldsymbol{Y}(0) and 𝒀^˙(0)=𝒀˙(0)\dot{\widehat{\boldsymbol{Y}}}(0)=\dot{\boldsymbol{Y}}(0). For every t[0,T]t\in[0,T] we have that, by the fundamental theorem of calculus and the triangle inequality,

𝑿(t)𝑿^(t)𝒮2\displaystyle\|\boldsymbol{X}(t)-\widehat{\boldsymbol{X}}(t)\|_{\mathcal{S}}^{2} =j=1kiCj1Nj0t0p(𝒙i¨𝒙^¨i)𝑑s𝑑p2\displaystyle=\sum_{j=1}^{k}\sum_{i\in C_{j}}\frac{1}{N_{j}}\bigg{\|}\int_{0}^{t}\int_{0}^{p}(\ddot{\boldsymbol{x}_{i}}-\ddot{\widehat{\boldsymbol{x}}}_{i})dsdp\bigg{\|}^{2}
tpj=1KiCj1Njp=0ts=0p𝒙i¨𝒙^¨i2𝑑s𝑑p\displaystyle\leq tp\sum_{j=1}^{K}\sum_{i\in C_{j}}\frac{1}{N_{j}}\int_{p=0}^{t}\int_{s=0}^{p}\|\ddot{\boldsymbol{x}_{i}}-\ddot{\widehat{\boldsymbol{x}}}_{i}\|^{2}dsdp
=tpp=0ts=0p𝑿¨𝐟nc,𝒙˙(𝑿,𝑽,𝚵)𝐟ϕ^EA(𝑿,𝑽,𝚵)+𝐟nc,𝒙˙(𝑿,𝑽,𝚵)\displaystyle=tp\int_{p=0}^{t}\int_{s=0}^{p}\|\ddot{\boldsymbol{X}}-\mathbf{f}^{\text{nc},\dot{\boldsymbol{x}}}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi})-\mathbf{f}^{\widehat{\bm{\phi}}^{EA}}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi})+\mathbf{f}^{\text{nc},\dot{\boldsymbol{x}}}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi})
+𝐟ϕ^EA(𝑿,𝑽,𝚵)𝐟nc,𝒙˙(𝑿^,𝑽^,𝚵^)𝐟ϕ^EA(𝑿^,𝑽^,𝚵^)𝒮2dsdp\displaystyle\hskip 28.45274pt+\mathbf{f}^{\widehat{\bm{\phi}}^{EA}}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi})-\mathbf{f}^{\text{nc},\dot{\boldsymbol{x}}}(\widehat{\boldsymbol{X}},\widehat{\boldsymbol{V}},\widehat{\boldsymbol{\Xi}})-\mathbf{f}^{\widehat{\bm{\phi}}^{EA}}(\widehat{\boldsymbol{X}},\widehat{\boldsymbol{V}},\widehat{\boldsymbol{\Xi}})\|_{\mathcal{S}}^{2}dsdp
2T2p=0ts=0p𝑿¨𝐟nc,𝒙˙(𝑿,𝑽,𝚵)𝐟ϕ^EA(𝑿,𝑽,𝚵)𝒮2𝑑s𝑑p\displaystyle\leq 2T^{2}\int_{p=0}^{t}\int_{s=0}^{p}\|\ddot{\boldsymbol{X}}-\mathbf{f}^{\text{nc},\dot{\boldsymbol{x}}}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi})-\mathbf{f}^{\widehat{\bm{\phi}}^{EA}}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi})\|_{\mathcal{S}}^{2}dsdp (A.1)
+2T2p=0ts=0pI𝑑s𝑑p+2T2p=0ts=0p𝐟nc,𝒙˙(𝑿,𝑽,𝚵)𝐟nc,𝒙˙(𝑿^,𝑽^,𝚵^)𝒮2𝑑s𝑑p.\displaystyle+2T^{2}\int_{p=0}^{t}\int_{s=0}^{p}Idsdp+2T^{2}\int_{p=0}^{t}\int_{s=0}^{p}\|\mathbf{f}^{\text{nc},\dot{\boldsymbol{x}}}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi})-\mathbf{f}^{\text{nc},\dot{\boldsymbol{x}}}(\widehat{\boldsymbol{X}},\widehat{\boldsymbol{V}},\widehat{\boldsymbol{\Xi}})\|_{\mathcal{S}}^{2}dsdp.

Here we have introduced the term,

I=𝐟ϕ^EA(𝑿,𝑽,𝚵)𝐟ϕ^EA(𝑿^,𝑽^,𝚵^)𝒮,I=\bigg{\|}\mathbf{f}^{\widehat{\bm{\phi}}^{EA}}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi})-\mathbf{f}^{\widehat{\bm{\phi}}^{EA}}(\widehat{\boldsymbol{X}},\widehat{\boldsymbol{V}},\widehat{\boldsymbol{\Xi}})\bigg{\|}_{\mathcal{S}},

which can be expressed explicitly as,

I=(j=1KiCj1Nj(F[ϕ^jjEA](𝒓ii,𝒓˙ii,𝒔iiE,𝒔iiA)F[ϕ^jjEA](𝒓^ii,𝒓^˙ii,𝒔iiE^,𝒔iiA^)))i,j𝒮2.I=\Bigg{\|}\bigg{(}\sum_{j^{\prime}=1}^{K}\sum_{i^{\prime}\in C_{j^{\prime}}}\frac{1}{N_{j^{\prime}}}(F{[\widehat{\phi}_{jj^{\prime}}^{EA}]}(\boldsymbol{r}_{ii^{\prime}},\dot{\boldsymbol{r}}_{ii^{\prime}},\boldsymbol{s}^{E}_{ii^{\prime}},\boldsymbol{s}^{A}_{ii^{\prime}})-F{[\widehat{\phi}_{jj^{\prime}}^{EA}]}(\widehat{\boldsymbol{r}}_{ii^{\prime}},\dot{\widehat{\boldsymbol{r}}}_{ii^{\prime}},\widehat{\boldsymbol{s}^{E}_{ii^{\prime}}},\widehat{\boldsymbol{s}^{A}_{ii^{\prime}}}))\bigg{)}_{i,j}\Bigg{\|}_{\mathcal{S}}^{2}. (A.2)

Note that in II, jj is the index of the type among the {1,K}\{1,\ldots K\} and ii indexes within each type CjC_{j}. This holds similarly in later expressions I1,I2I_{1},I_{2}. For the third term of (A.1), we exploit the Lipschitz property of the non-collective force:

𝐟nc,𝒙˙(𝑿,𝑽,𝚵)𝐟nc,𝒙˙(𝑿^,𝑽^,𝚵^)𝒮2\displaystyle\|\mathbf{f}^{\text{nc},\dot{\boldsymbol{x}}}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi})-\mathbf{f}^{\text{nc},\dot{\boldsymbol{x}}}(\widehat{\boldsymbol{X}},\widehat{\boldsymbol{V}},\widehat{\boldsymbol{\Xi}})\|_{\mathcal{S}}^{2} =j=1KiCj1NjFi𝒙˙(𝒙i,𝒙˙i,ξi)Fi𝒙˙(𝒙^i,𝒙^˙,ξ^i)2\displaystyle=\sum_{j=1}^{K}\sum_{i\in C_{j}}\frac{1}{N_{j}}\|F^{\dot{\boldsymbol{x}}}_{i}(\boldsymbol{x}_{i},\dot{\boldsymbol{x}}_{i},\xi_{i})-F^{\dot{\boldsymbol{x}}}_{i}(\widehat{\boldsymbol{x}}_{i},\dot{\widehat{\boldsymbol{x}}},\widehat{\xi}_{i})\|^{2}
j=1KiCj1NjLip2[Fi𝒙˙](𝒙i𝒙^i2+𝒙˙i𝒙^˙i2+ξiξ^i2)\displaystyle\leq\sum_{j=1}^{K}\sum_{i\in C_{j}}\frac{1}{N_{j}}\text{Lip}^{2}[F^{\dot{\boldsymbol{x}}}_{i}](\|\boldsymbol{x}_{i}-\widehat{\boldsymbol{x}}_{i}\|^{2}+\|\dot{\boldsymbol{x}}_{i}-\dot{\widehat{\boldsymbol{x}}}_{i}\|^{2}+\|\xi_{i}-\widehat{\xi}_{i}\|^{2})
maxiLip2[Fi𝒙˙]𝒀𝒀^𝒴2\displaystyle\leq\max_{i}\text{Lip}^{2}[F^{\dot{\boldsymbol{x}}}_{i}]\|\boldsymbol{Y}-\widehat{\boldsymbol{Y}}\|_{\mathcal{Y}}^{2}

So that we have the bound

2T2p=0ts=0p𝐟nc,𝒙˙(𝑿,𝑽,𝚵)𝐟nc,𝒙˙(𝑿^,𝑽^,𝚵^)𝒮2𝑑s𝑑p2T2p=0ts=0pmaxiLip2[Fi𝒙˙]𝒀𝒀^𝒴2𝑑s𝑑p2T^{2}\int_{p=0}^{t}\int_{s=0}^{p}\|\mathbf{f}^{\text{nc},\dot{\boldsymbol{x}}}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi})-\mathbf{f}^{\text{nc},\dot{\boldsymbol{x}}}(\widehat{\boldsymbol{X}},\widehat{\boldsymbol{V}},\widehat{\boldsymbol{\Xi}})\|_{\mathcal{S}}^{2}dsdp\leq 2T^{2}\int_{p=0}^{t}\int_{s=0}^{p}\max_{i}\text{Lip}^{2}[F^{\dot{\boldsymbol{x}}}_{i}]\|\boldsymbol{Y}-\widehat{\boldsymbol{Y}}\|_{\mathcal{Y}}^{2}dsdp (A.3)

First we introduce the convenient notations of

𝒔ii^E=𝒔(𝓀𝒾,𝓀𝒾)E(𝒙i,𝒙˙i,ξi,𝒙^i,𝒙^˙i,ξ^i),𝒔^iiA=𝒔(𝓀𝒾,𝓀𝒾)A(𝒙^i,𝒙^˙i,ξ^i,𝒙^i,𝒙^˙i,ξ^i)\bm{s}_{i\widehat{i^{\prime}}}^{E}=\boldsymbol{s}^{E}_{(\mathpzc{k}_{i},\mathpzc{k}_{i^{\prime}})}(\boldsymbol{x}_{i},\dot{\boldsymbol{x}}_{i},\xi_{i},\widehat{\boldsymbol{x}}_{i^{\prime}},\dot{\widehat{\boldsymbol{x}}}_{i^{\prime}},\widehat{\xi}_{i^{\prime}}),\hskip 21.33955pt\widehat{\bm{s}}_{ii^{\prime}}^{A}=\boldsymbol{s}^{A}_{(\mathpzc{k}_{i},\mathpzc{k}_{i^{\prime}})}(\widehat{\boldsymbol{x}}_{i},\dot{\widehat{\boldsymbol{x}}}_{i},\widehat{\xi}_{i},\widehat{\boldsymbol{x}}_{i^{\prime}},\dot{\widehat{\boldsymbol{x}}}_{i^{\prime}},\widehat{\xi}_{i^{\prime}}) (A.4)

with analogous formulae for 𝒔i^iE,𝒔i^iA,𝒔ii^A,𝒔ii^ξ,𝒔i^iξ,𝒔^iiE,𝒔^iiA,𝒔^iiξ\bm{s}_{\widehat{i}i^{\prime}}^{E},\bm{s}_{\widehat{i}i^{\prime}}^{A},\bm{s}_{i\widehat{i^{\prime}}}^{A},\bm{s}_{i\widehat{i^{\prime}}}^{\xi},\bm{s}_{\widehat{i}i^{\prime}}^{\xi},\widehat{\bm{s}}_{ii^{\prime}}^{E},\widehat{\bm{s}}_{ii^{\prime}}^{A},\widehat{\bm{s}}_{ii^{\prime}}^{\xi}. Now we break up II using the triangle inequality and get that II1+I2I\leq I_{1}+I_{2} where

I1=(j=1KiCj1Nj(F[ϕ^jjEA](𝒓ii,𝒓˙ii,𝒔iiE,𝒔iiA)F[ϕ^jjEA](𝒙i𝒙i^,𝒙i˙𝒙i^˙,𝒔ii^E,𝒔ii^A)))i,j𝒮2I_{1}=\Bigg{\|}\bigg{(}\sum_{j^{\prime}=1}^{K}\sum_{i^{\prime}\in C_{j^{\prime}}}\frac{1}{N_{j^{\prime}}}(F{[\widehat{\phi}_{jj^{\prime}}^{EA}]}(\boldsymbol{r}_{ii^{\prime}},\dot{\boldsymbol{r}}_{ii^{\prime}},\boldsymbol{s}^{E}_{ii^{\prime}},\boldsymbol{s}^{A}_{ii^{\prime}})-F{[\widehat{\phi}_{jj^{\prime}}^{EA}]}(\boldsymbol{x}_{i}-\widehat{\boldsymbol{x}_{i^{\prime}}},\dot{\boldsymbol{x}_{i}}-\dot{\widehat{\boldsymbol{x}_{i}}},\bm{s}_{i\widehat{i^{\prime}}}^{E},\bm{s}_{i\widehat{i^{\prime}}}^{A}))\bigg{)}_{i,j}\Bigg{\|}_{\mathcal{S}}^{2}
I2=(j=1KiCj1Nj(F[ϕ^jjEA](𝒙i𝒙i^,𝒙i˙𝒙i^˙,𝒔ii^E,𝒔ii^A)F[ϕ^jjEA](𝒓ii^,𝒓ii^˙,𝒔iiE^,𝒔iiA^))))i,j𝒮2I_{2}=\Bigg{\|}\bigg{(}\sum_{j^{\prime}=1}^{K}\sum_{i^{\prime}\in C_{j^{\prime}}}\frac{1}{N_{j^{\prime}}}(F[\widehat{\phi}_{jj^{\prime}}^{EA}](\boldsymbol{x}_{i}-\widehat{\boldsymbol{x}_{i^{\prime}}},\dot{\boldsymbol{x}_{i}}-\dot{\widehat{\boldsymbol{x}_{i}}},\bm{s}_{i\widehat{i^{\prime}}}^{E},\bm{s}_{i\widehat{i^{\prime}}}^{A})-F_{[\widehat{\phi}_{jj^{\prime}}^{EA}]}(\widehat{\boldsymbol{r}_{ii^{\prime}}},\dot{\widehat{\boldsymbol{r}_{ii^{\prime}}}},\widehat{\boldsymbol{s}^{E}_{ii^{\prime}}},\widehat{\boldsymbol{s}^{A}_{ii^{\prime}}})))\bigg{)}_{i,j}\Bigg{\|}_{\mathcal{S}}^{2}

So using the Lipschitz property of F[ϕ^jjEA]F[\widehat{\phi}_{jj^{\prime}}^{EA}] we get that, since

I1=j=1KiCj1Nj|j=1KiCj1Nj(F[ϕ^jjEA](𝒓ii,𝒓˙ii,𝒔iiE,𝒔iiA)F[ϕ^jjEA](𝒙i𝒙i^,𝒙i˙𝒙i^˙,𝒔ii^E,𝒔ii^A))|2,I_{1}=\sum_{j=1}^{K}\sum_{i\in C_{j}}\frac{1}{N_{j}}\Bigg{|}\sum_{j^{\prime}=1}^{K}\sum_{i^{\prime}\in C_{j^{\prime}}}\frac{1}{N_{j^{\prime}}}(F{[\widehat{\phi}_{jj^{\prime}}^{EA}]}(\boldsymbol{r}_{ii^{\prime}},\dot{\boldsymbol{r}}_{ii^{\prime}},\boldsymbol{s}^{E}_{ii^{\prime}},\boldsymbol{s}^{A}_{ii^{\prime}})-F{[\widehat{\phi}_{jj^{\prime}}^{EA}]}(\boldsymbol{x}_{i}-\widehat{\boldsymbol{x}_{i^{\prime}}},\dot{\boldsymbol{x}_{i}}-\dot{\widehat{\boldsymbol{x}_{i}}},\bm{s}_{i\widehat{i^{\prime}}}^{E},\bm{s}_{i\widehat{i^{\prime}}}^{A}))\Bigg{|}^{2},

then,

I1Kj=1KiCj1Njj=1KiCj1Nj\displaystyle I_{1}\leq K\sum_{j=1}^{K}\sum_{i\in C_{j}}\frac{1}{N_{j}}\sum_{j^{\prime}=1}^{K}\sum_{i^{\prime}\in C_{j^{\prime}}}\frac{1}{N_{j^{\prime}}} |(Lip[F[ϕ^jjEA]](𝒙i𝒙i^,𝒙˙i𝒙^˙i𝒔iiE𝒔ii^E,𝒔iiA𝒔ii^A))|2.\displaystyle\bigg{|}\bigg{(}\text{Lip}[F[\widehat{\phi}_{jj^{\prime}}^{EA}]]\|(\boldsymbol{x}_{i^{\prime}}-\widehat{\boldsymbol{x}_{i^{\prime}}},\dot{\boldsymbol{x}}_{i^{\prime}}-\dot{\widehat{\boldsymbol{x}}}_{i^{\prime}}\bm{s}_{ii^{\prime}}^{E}-\bm{s}_{i\widehat{i^{\prime}}}^{E},\bm{s}_{ii^{\prime}}^{A}-\bm{s}_{i\widehat{i^{\prime}}}^{A})\|\bigg{)}\bigg{|}^{2}.

By the assumptions on the feature maps, we have that

𝒔iiE𝒔ii^ELip[𝒔(𝓀𝒾,𝓀𝒾)E](𝒙i𝒙i^,𝒙˙i𝒙^˙i,ξiξ^i)\displaystyle\|\bm{s}_{ii^{\prime}}^{E}-\bm{s}_{i\widehat{i^{\prime}}}^{E}\|\leq\text{Lip}[\boldsymbol{s}^{E}_{(\mathpzc{k}_{i},\mathpzc{k}_{i^{\prime}})}]\|(\boldsymbol{x}_{i^{\prime}}-\widehat{\boldsymbol{x}_{i^{\prime}}},\dot{\boldsymbol{x}}_{i^{\prime}}-\dot{\widehat{\boldsymbol{x}}}_{i^{\prime}},\xi_{i^{\prime}}-\widehat{\xi}_{i^{\prime}})\|
𝒔iiA𝒔ii^ALip[𝒔(𝓀𝒾,𝓀𝒾)A](𝒙i𝒙i^,𝒙˙i𝒙^˙i,ξiξ^i)\displaystyle\|\boldsymbol{s}^{A}_{ii^{\prime}}-\bm{s}_{i\widehat{i^{\prime}}}^{A}\|\leq\text{Lip}[\boldsymbol{s}^{A}_{(\mathpzc{k}_{i},\mathpzc{k}_{i^{\prime}})}]\|(\boldsymbol{x}_{i^{\prime}}-\widehat{\boldsymbol{x}_{i^{\prime}}},\dot{\boldsymbol{x}}_{i^{\prime}}-\dot{\widehat{\boldsymbol{x}}}_{i^{\prime}},\xi_{i^{\prime}}-\widehat{\xi}_{i^{\prime}})\|

Combining these bounds we see that,

I1Kj=1KiCj1Njj=1KiCj1Nj\displaystyle I_{1}\leq K\sum_{j=1}^{K}\sum_{i\in C_{j}}\frac{1}{N_{j}}\sum_{j^{\prime}=1}^{K}\sum_{i^{\prime}\in C_{j^{\prime}}}\frac{1}{N_{j^{\prime}}} (maxj,j(Lip[F[ϕ^jjEA](Lip[𝒔(j,j)E]+1),Lip[F[ϕ^jjEA](Lip[𝒔(j,j)A]+1)))2\displaystyle\Big{(}\max_{j,j^{\prime}}\Big{(}\text{Lip}[F[\widehat{\phi}_{jj^{\prime}}^{EA}](\text{Lip}[\boldsymbol{s}^{E}_{(j,j^{\prime})}]+1),\text{Lip}[F[\widehat{\phi}_{jj^{\prime}}^{EA}](\text{Lip}[\boldsymbol{s}^{A}_{(j,j^{\prime})}]+1)\Big{)}\Big{)}^{2}
(𝒙i𝒙i^+𝒙˙i𝒙^˙i+ξiξ^i)2.\displaystyle(\|\boldsymbol{x}_{i^{\prime}}-\widehat{\boldsymbol{x}_{i^{\prime}}}\|+\|\dot{\boldsymbol{x}}_{i^{\prime}}-\dot{\widehat{\boldsymbol{x}}}_{i^{\prime}}\|+\|\xi_{i^{\prime}}-\widehat{\xi}_{i^{\prime}}\|)^{2}. (A.5)

Let S~=max(SE,SA)2\tilde{S}=\max(S_{E},S_{A})^{2}, J=(maxj,jLip[𝒔(j,j)E,𝒔(j,j)A]+1)2J=(\max_{j,j^{\prime}}\text{Lip}[\boldsymbol{s}^{E}_{(j,j^{\prime})},\boldsymbol{s}^{A}_{(j,j^{\prime})}]+1)^{2}, and then let P=S~JP=\tilde{S}J and we get by Young’s inequality that,

I14KP𝒀𝒀^𝒴2,I_{1}\leq 4KP\|\boldsymbol{Y}-\widehat{\boldsymbol{Y}}\|_{\mathcal{Y}}^{2}, (A.6)

and performing a similar analysis we get that

I24KP𝒀𝒀^𝒴2.I_{2}\leq 4KP\|\boldsymbol{Y}-\widehat{\boldsymbol{Y}}\|_{\mathcal{Y}}^{2}.

So gathering terms, we can reexpress (A.1) as

𝑿(t)𝑿^(t)𝒮2\displaystyle\|\boldsymbol{X}(t)-\widehat{\boldsymbol{X}}(t)\|_{\mathcal{S}}^{2} 2T2p=0ts=0p𝑿¨𝐟nc,𝒙˙(𝑿,𝑽,𝚵)𝐟ϕ^EA(𝑿,𝑽,𝚵)𝒮2𝑑s𝑑p\displaystyle\leq 2T^{2}\int_{p=0}^{t}\int_{s=0}^{p}\|\ddot{\boldsymbol{X}}-\mathbf{f}^{\text{nc},\dot{\boldsymbol{x}}}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi})-\mathbf{f}^{\widehat{\bm{\phi}}^{EA}}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi})\|_{\mathcal{S}}^{2}dsdp
+2T2(F+8KP)p=0ts=0p𝒀𝒀^𝒴2dsdp\displaystyle+2T^{2}(F+8KP)\int_{p=0}^{t}\int_{s=0}^{p}\|\boldsymbol{Y}-\widehat{\boldsymbol{Y}}\|_{\mathcal{Y}}^{2}dsdp (A.7)

where F=maxiLip[F𝒙˙i]F=\max_{i}\text{Lip}[F^{\dot{\boldsymbol{x}}}_{i}]. Performing an analogous analysis on 𝑽(t)𝑽^(t)𝒮2,𝚵(t)𝚵^(t)𝒮2\|\boldsymbol{V}(t)-\widehat{\boldsymbol{V}}(t)\|_{\mathcal{S}}^{2},\|\boldsymbol{\Xi}(t)-\widehat{\boldsymbol{\Xi}}(t)\|_{\mathcal{S}}^{2} , with some additional effort, one can get the following result on the phase variable

𝚵(t)𝚵^(t)𝒮2\displaystyle\|\boldsymbol{\Xi}(t)-\widehat{\boldsymbol{\Xi}}(t)\|_{\mathcal{S}}^{2} 2T(8QK+Fξ)s=0t𝒀𝒀^𝒴2ds\displaystyle\leq 2T(8QK+F^{\xi})\int_{s=0}^{t}\|\boldsymbol{Y}-\widehat{\boldsymbol{Y}}\|_{\mathcal{Y}}^{2}ds
+2Ts=0t𝚵˙𝐟nc,ξ(𝑿,𝑽,𝚵)+𝐟ϕξ(𝑿,𝑽,𝚵)𝒮2ds\displaystyle+2T\int_{s=0}^{t}\|\dot{\boldsymbol{\Xi}}-\mathbf{f}^{\text{nc},\xi}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi})+\mathbf{f}^{{\bm{\phi}}^{\xi}}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi})\|_{\mathcal{S}}^{2}ds (A.8)

where Fξ=maxiLip[Fξi]F^{\xi}=\max_{i}\text{Lip}[F^{\xi}_{i}] and Q=max(H,Sξ)Q=\max(H,S^{\xi}) where H=maxj,jLip[𝒔E(j,j),𝒔A(j,j)]H=\max_{j,j^{\prime}}\text{Lip}[\boldsymbol{s}^{E}_{(j,j^{\prime})},\boldsymbol{s}^{A}_{(j,j^{\prime})}]. Similarly, we have that,

𝑽(t)𝑽^(t)𝒮2\displaystyle\|\boldsymbol{V}(t)-\widehat{\boldsymbol{V}}(t)\|_{\mathcal{S}}^{2} 2Ts=0t𝑿¨𝐟nc,𝒙˙(𝑿,𝑽,𝚵)𝐟ϕ^EA(𝑿,𝑽,𝚵)𝒮2ds\displaystyle\leq 2T\int_{s=0}^{t}\|\ddot{\boldsymbol{X}}-\mathbf{f}^{\text{nc},\dot{\boldsymbol{x}}}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi})-\mathbf{f}^{\widehat{\bm{\phi}}^{EA}}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi})\|_{\mathcal{S}}^{2}ds
+2T(F+8KP)s=0t𝒀𝒀^𝒴2ds\displaystyle+2T(F+8KP)\int_{s=0}^{t}\|\boldsymbol{Y}-\widehat{\boldsymbol{Y}}\|_{\mathcal{Y}}^{2}ds (A.9)

Gathering the bounds (A.7, A.8, A.9), we have that

𝒀^(t)𝒀(t)𝒴2\displaystyle\|\widehat{\boldsymbol{Y}}(t)-\boldsymbol{Y}(t)\|_{\mathcal{Y}}^{2} 2T(8KP+F+8QK+Fξ)s=0t𝒀^𝒀𝒴2ds\displaystyle\leq 2T(8KP+F+8QK+F^{\xi})\int_{s=0}^{t}\|\widehat{\boldsymbol{Y}}-\boldsymbol{Y}\|_{\mathcal{Y}}^{2}ds
+2T2(8KP+F)p=0ts=0p𝒀^𝒀𝒴2dsdp\displaystyle+2T^{2}(8KP+F)\int_{p=0}^{t}\int_{s=0}^{p}\|\widehat{\boldsymbol{Y}}-\boldsymbol{Y}\|_{\mathcal{Y}}^{2}dsdp
a(t)\displaystyle a(t) {+2T2p=0ts=0p𝑿¨𝐟nc,𝒙˙(𝑿,𝑽,𝚵)𝐟ϕ^EA(𝑿,𝑽,𝚵)𝒮2dsdp+2Ts=0t𝑿¨𝐟nc,𝒙˙(𝑿,𝑽,𝚵)𝐟ϕ^EA(𝑿,𝑽,𝚵)𝒮2ds+2Ts=0t𝚵˙𝐟nc,ξ(𝑿,𝑽,𝚵)𝐟ϕ^ξ(𝑿,𝑽,𝚵)𝒮2ds]\displaystyle\begin{dcases}&+2T^{2}\int_{p=0}^{t}\int_{s=0}^{p}\|\ddot{\boldsymbol{X}}-\mathbf{f}^{\text{nc},\dot{\boldsymbol{x}}}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi})-\mathbf{f}^{\widehat{\bm{\phi}}^{EA}}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi})\|_{\mathcal{S}}^{2}dsdp\\ &+2T\int_{s=0}^{t}\|\ddot{\boldsymbol{X}}-\mathbf{f}^{\text{nc},\dot{\boldsymbol{x}}}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi})-\mathbf{f}^{\widehat{\bm{\phi}}^{EA}}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi})\|_{\mathcal{S}}^{2}ds\\ &+2T\int_{s=0}^{t}\|\dot{\boldsymbol{\Xi}}-\mathbf{f}^{\text{nc},\xi}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi})-\mathbf{f}^{\widehat{{\bm{\phi}}}^{\xi}}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi})\|_{\mathcal{S}}^{2}ds\bigg{]}\end{dcases}

where we denote the last three lines by a(t)a(t) and notice that this is a nondecreasing function in tt. We also denote A1=2T(8KP+F+8QK+Fξ)A_{1}=2T(8KP+F+8QK+F^{\xi}) and B1=2T2(8KP+F)B_{1}=2T^{2}(8KP+F). Now use theorem 30, which is in [22] and is originally in Bainov and Simeonov. With this notation, we can rewrite the above bound as

𝒀^(t)𝒀(t)𝒴2A1s=0t𝒀^𝒀𝒴2ds+B1p=0ts=0p𝒀^𝒀𝒴2dsdp+a(t)\|\widehat{\boldsymbol{Y}}(t)-\boldsymbol{Y}(t)\|_{\mathcal{Y}}^{2}\leq A_{1}\int_{s=0}^{t}\|\widehat{\boldsymbol{Y}}-\boldsymbol{Y}\|_{\mathcal{Y}}^{2}ds+B_{1}\int_{p=0}^{t}\int_{s=0}^{p}\|\widehat{\boldsymbol{Y}}-\boldsymbol{Y}\|_{\mathcal{Y}}^{2}dsdp+a(t) (A.10)

And so in the notation of Theorem 30 we have u(t)=𝒀^(t)𝒀(t)𝒴2u(t)=\|\widehat{\boldsymbol{Y}}(t)-\boldsymbol{Y}(t)\|_{\mathcal{Y}}^{2}, b(t)=1b(t)=1, k1(t,t1)=A1k_{1}(t,t_{1})=A_{1} and k2(t,t1,t2)=B1k_{2}(t,t_{1},t_{2})=B_{1}, so that for all tt we have

𝒀^(t)𝒀(t)𝒴2a(t)+0tR^[a](t,s)exp(stR^[b](t,τ)dτ)ds\displaystyle\|\widehat{\boldsymbol{Y}}(t)-\boldsymbol{Y}(t)\|_{\mathcal{Y}}^{2}\leq a(t)+\int_{0}^{t}\widehat{R}[a](t,s)\exp\bigg{(}\int_{s}^{t}\widehat{R}[b](t,\tau)d\tau\bigg{)}ds

and we have the simple bounds

R^[a](t,s)=a(t)+0sB1a(t2)dt2a(T)+B1Ta(T)\displaystyle\widehat{R}[a](t,s)=a(t)+\int_{0}^{s}B_{1}a(t_{2})dt_{2}\leq a(T)+B_{1}Ta(T) (A.11)
R^[b](t,τ)=A1+0τ1dy=A1+τ\displaystyle\widehat{R}[b](t,\tau)=A_{1}+\int_{0}^{\tau}1dy=A_{1}+\tau (A.12)

So that,

𝒀^(t)𝒀(t)𝒴2\displaystyle\|\widehat{\boldsymbol{Y}}(t)-\boldsymbol{Y}(t)\|_{\mathcal{Y}}^{2} a(T)+[a(T)+B1Ta(T)]s=0texp(st(A1+τ)dτ)ds\displaystyle\leq a(T)+[a(T)+B_{1}Ta(T)]\int_{s=0}^{t}\exp\bigg{(}\int_{s}^{t}(A_{1}+\tau)d\tau\bigg{)}ds
a(T)+[a(T)+B1Ta(T)]s=0Texp(0TA1+τdτ)ds\displaystyle\leq a(T)+[a(T)+B_{1}Ta(T)]\int_{s=0}^{T}\exp\bigg{(}\int_{0}^{T}A_{1}+\tau d\tau\bigg{)}ds
=a(T)+[a(T)+B1Ta(T)]Texp(A1T+T2/2)\displaystyle=a(T)+[a(T)+B_{1}Ta(T)]T\exp(A_{1}T+T^{2}/2)
=a(T)(1+(1+B1T)Texp(A1T+T2/2))\displaystyle=a(T)(1+(1+B_{1}T)T\exp(A_{1}T+T^{2}/2))

So that we can immediately conclude the first assertion of the theorem,

supt[0,T]𝒀^(t)𝒀(t)𝒴2a(T)(1+(T+B1T2)exp(A1T+T2/2))\sup_{t\in[0,T]}\|\widehat{\boldsymbol{Y}}(t)-\boldsymbol{Y}(t)\|_{\mathcal{Y}}^{2}\leq a(T)(1+(T+B_{1}T^{2})\exp(A_{1}T+T^{2}/2))

Lastly, we can use the results of section B.1 to get the key result on the expected supremum error. We take expectation on each of the three terms of a(T)a(T) and normalize them so they are in the form of the results of B.1.

1T2p=0Ts=0T𝔼𝝁𝒀𝑿¨𝐟nc,𝒙˙(𝑿,𝑽,𝚵)𝐟ϕ^EA(𝑿,𝑽,𝚵)𝒮2dsdp\displaystyle\frac{1}{T^{2}}\int_{p=0}^{T}\int_{s=0}^{T}\mathbb{E}_{\bm{\mu^{\boldsymbol{Y}}}}\|\ddot{\boldsymbol{X}}-\mathbf{f}^{\text{nc},\dot{\boldsymbol{x}}}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi})-\mathbf{f}^{\widehat{\bm{\phi}}^{EA}}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi})\|_{\mathcal{S}}^{2}dsdp (A.13)
<K2ϕ^EAϕEA𝑳2(𝝆TEA)2\displaystyle<K^{2}\|\widehat{\bm{\phi}}^{EA}-\bm{\phi}^{EA}\|_{\bm{L}^{2}(\bm{\rho}_{T}^{EA})}^{2} (A.14)

We similarly get that,

1Ts=0T𝔼𝝁𝒀𝚵˙𝐟nc,ξ(𝑿,𝑽,𝚵)𝐟ϕ^ξ(𝑿,𝑽,𝚵)𝒮2ds<K2ϕ^ξϕξ𝑳2(𝝆Tξ)2,\frac{1}{T}\int_{s=0}^{T}\mathbb{E}_{\bm{\mu^{\boldsymbol{Y}}}}\|\dot{\boldsymbol{\Xi}}-\mathbf{f}^{\text{nc},\xi}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi})-\mathbf{f}^{\widehat{{\bm{\phi}}}^{\xi}}(\boldsymbol{X},\boldsymbol{V},\boldsymbol{\Xi})\|_{\mathcal{S}}^{2}ds<K^{2}\|\widehat{\bm{\phi}}^{\xi}-\bm{\phi}^{\xi}\|_{\bm{L}^{2}(\bm{\rho}_{T}^{\xi})}^{2}, (A.15)

and can get an analogous bound for the remaining term of a(T)a(T). These bounds together lead to

a(T)(2T4K2+2T2K2)ϕ^EAϕEA𝑳2(𝝆TEA)2+2T2K2ϕ^ξϕξ𝑳2(𝝆Tξ)2a(T)\leq(2T^{4}K^{2}+2T^{2}K^{2})\|\widehat{\bm{\phi}}^{EA}-\bm{\phi}^{EA}\|_{\bm{L}^{2}(\bm{\rho}_{T}^{EA})}^{2}+2T^{2}K^{2}\|\widehat{\bm{\phi}}^{\xi}-\bm{\phi}^{\xi}\|_{\bm{L}^{2}(\bm{\rho}_{T}^{\xi})}^{2}

which implies the desired result.  

B Learning theory - technical tools

B.1 Continuity of the error functionals

For any t[0,T]t\in[0,T], consider the two random variables,

𝓔𝑿(t)EA(𝝋EA)\displaystyle\bm{\mathcal{E}}_{\boldsymbol{X}(t)}^{EA}(\bm{\varphi}^{EA}) =𝑿¨(t)𝐟nc,𝒙˙(𝑿(t),𝑽(t),𝚵(t))\displaystyle=\Big{\|}\ddot{\boldsymbol{X}}(t)-\mathbf{f}^{\text{nc},\dot{\boldsymbol{x}}}(\boldsymbol{X}(t),\boldsymbol{V}(t),\boldsymbol{\Xi}(t))
𝐟𝝋E(𝑿(t),𝑽(t),𝚵(t))𝐟𝝋A(𝑿(t),𝑽(t),𝚵(t))2𝒮\displaystyle-\mathbf{f}^{\bm{\varphi}^{E}}(\boldsymbol{X}(t),\boldsymbol{V}(t),\boldsymbol{\Xi}(t))-\mathbf{f}^{\bm{\varphi}^{A}}(\boldsymbol{X}(t),\boldsymbol{V}(t),\boldsymbol{\Xi}(t))\Big{\|}^{2}_{\mathcal{S}} (B.1)
𝓔𝚵(t)ξ(𝝋ξ)=𝚵˙(t)𝐟nc,ξ(𝑿(t),𝑽(t),𝚵(t))𝐟𝝋ξ(𝑿(t),𝑽(t),𝚵(t))2𝒮\displaystyle\bm{\mathcal{E}}_{\boldsymbol{\Xi}(t)}^{\xi}(\bm{\varphi}^{\xi})=\Big{\|}\dot{\bm{\Xi}}(t)-\mathbf{f}^{\text{nc},\xi}(\boldsymbol{X}(t),\boldsymbol{V}(t),\boldsymbol{\Xi}(t))-\mathbf{f}^{\bm{\varphi}^{\xi}}(\boldsymbol{X}(t),\boldsymbol{V}(t),\boldsymbol{\Xi}(t))\Big{\|}^{2}_{\mathcal{S}} (B.2)

These will be used in various places throughout the technical proofs and easily relate to the natural error functionals defined in (4.4), (4.5) as,

𝓔EA(𝝋EA)=1Ll=1L𝔼𝝁𝒀[𝓔𝑿(tl)EA(𝝋EA)],𝓔ξ(𝝋ξ)=1Ll=1L𝔼𝝁𝒀[𝓔𝚵(tl)ξ(𝝋ξ)].\bm{\mathcal{E}}_{\infty}^{EA}(\bm{\varphi}^{EA})=\frac{1}{L}\sum_{l=1}^{L}\mathbb{E}_{\bm{\mu^{\boldsymbol{Y}}}}\left[\bm{\mathcal{E}}_{\boldsymbol{X}(t_{l})}^{EA}(\bm{\varphi}^{EA})\right],\qquad\bm{\mathcal{E}}_{\infty}^{\xi}(\bm{\varphi}^{\xi})=\frac{1}{L}\sum_{l=1}^{L}\mathbb{E}_{\bm{\mu^{\boldsymbol{Y}}}}\left[\bm{\mathcal{E}}_{\boldsymbol{\Xi}(t_{l})}^{\xi}(\bm{\varphi}^{\xi})\right].

We begin by establishing basic continuity results for our error functionals over the hypothesis space. The specific structure of the governing equations plays a critical role in the analysis.

Alignment and energy based kernels

Proposition 11

For 𝛗^EA,ϕ^EA𝓗EA\widehat{\bm{\varphi}}^{EA},\widehat{\bm{\phi}}^{EA}\in\boldsymbol{\mathcal{H}}^{EA} the true and empirical error functionals are bounded as follows,

|𝓔EA(𝝋^EA)𝓔EA(ϕ^EA)|\displaystyle|\boldsymbol{\mathcal{E}}_{\infty}^{EA}(\widehat{\bm{\varphi}}^{EA})-\boldsymbol{\mathcal{E}}_{\infty}^{EA}(\widehat{\bm{\phi}}^{EA})| K2𝝋^EAϕ^EA𝑳2(𝝆TEA,L)2ϕEA𝝋^EAϕ^EA𝑳2(𝝆TEA,L)\displaystyle\leq K^{2}\|\widehat{\bm{\varphi}}^{EA}-\widehat{\bm{\phi}}^{EA}\|_{\bm{L}^{2}(\bm{\rho}_{T}^{EA,L})}\|2\bm{\phi}^{EA}-\widehat{\bm{\varphi}}^{EA}-\widehat{\bm{\phi}}^{EA}\|_{\bm{L}^{2}(\bm{\rho}_{T}^{EA,L})} (B.3)
|𝓔MEA(𝝋^EA)𝓔MEA(ϕ^EA)|\displaystyle|\boldsymbol{\mathcal{E}}_{M}^{EA}(\widehat{\bm{\varphi}}^{EA})-\boldsymbol{\mathcal{E}}_{M}^{EA}(\widehat{\bm{\phi}}^{EA})| K4max{R,Rx˙}2𝝋^EAϕ^EA2ϕEA𝝋^EAϕ^EA\displaystyle\leq K^{4}\max\{R,R_{\dot{x}}\}^{2}\|\widehat{\bm{\varphi}}^{EA}-\widehat{\bm{\phi}}^{EA}\|_{\infty}\|2\bm{\phi}^{EA}-\widehat{\bm{\varphi}}^{EA}-\widehat{\bm{\phi}}^{EA}\|_{\infty} (B.4)

Recall the definitions of R,Rx˙R,R_{\dot{x}} in equations (3.8), and (3.10).

Proof  Using Jensen’s inequality,

|𝓔𝑿(t)EA(𝝋^EA)𝓔𝑿(t)EA(ϕ^EA)|\displaystyle|\bm{\mathcal{E}}_{\boldsymbol{X}(t)}^{EA}(\widehat{\bm{\varphi}}^{EA})-\bm{\mathcal{E}}_{\boldsymbol{X}(t)}^{EA}(\widehat{\bm{\phi}}^{EA})|
=|k=1K1NkiCkk=1K1NkiCk(φ^kkEϕ^kkE)(rii,𝒔Eii)𝒓ii+(φ^kkAϕ^kkA)(rii,𝒔Aii)𝒓˙ii,\displaystyle=\bigg{|}\sum_{k=1}^{K}\frac{1}{N_{k}}\sum_{i\in C_{k}}\big{\langle}\sum_{k^{\prime}=1}^{K}\frac{1}{N_{k^{\prime}}}\sum_{i^{\prime}\in C_{k^{\prime}}}(\widehat{\varphi}_{kk^{\prime}}^{E}-\widehat{\phi}_{kk^{\prime}}^{E})(r_{ii^{\prime}},\boldsymbol{s}^{E}_{ii^{\prime}}){\boldsymbol{r}}_{ii^{\prime}}+(\widehat{\varphi}_{kk^{\prime}}^{A}-\widehat{\phi}_{kk^{\prime}}^{A})(r_{ii^{\prime}},\boldsymbol{s}^{A}_{ii^{\prime}}){\dot{\boldsymbol{r}}}_{ii^{\prime}},
k=1K1NkiCk(2ϕkkEφ^kkEϕ^kkE)(rii,𝒔Eii)𝒓ii+(2ϕkkAφ^kkAϕ^kkA)(rii,𝒔Aii)𝒓˙ii|\displaystyle\sum_{k^{\prime\prime}=1}^{K}\frac{1}{N_{k^{\prime\prime}}}\sum_{i^{\prime}\in C_{k^{\prime\prime}}}(2\phi_{kk^{\prime}}^{E}-\widehat{\varphi}_{kk^{\prime}}^{E}-\widehat{\phi}_{kk^{\prime}}^{E})(r_{ii^{\prime}},\boldsymbol{s}^{E}_{ii^{\prime}}){\boldsymbol{r}}_{ii^{\prime}}+(2\phi_{kk^{\prime}}^{A}-\widehat{\varphi}_{kk^{\prime}}^{A}-\widehat{\phi}_{kk^{\prime}}^{A})(r_{ii^{\prime}},\boldsymbol{s}^{A}_{ii^{\prime}})\dot{\boldsymbol{r}}_{ii^{\prime}}\big{\rangle}\bigg{|}
k=1Kk=1Kk=1K1NkiCk1NkiCk(φ^kkEϕ^kkE)(rii,𝒔Eii)𝒓ii+(φ^kkAϕ^kkA)(rii,𝒔Aii)𝒓˙ii\displaystyle\leq\sum_{k=1}^{K}\sum_{k^{\prime}=1}^{K}\sum_{k^{\prime\prime}=1}^{K}\frac{1}{N_{k}}\sum_{i\in C_{k}}\|\frac{1}{N_{k^{\prime}}}\sum_{i^{\prime}\in C_{k^{\prime}}}(\widehat{\varphi}_{kk^{\prime}}^{E}-\widehat{\phi}_{kk^{\prime}}^{E})(r_{ii^{\prime}},\boldsymbol{s}^{E}_{ii^{\prime}}){\boldsymbol{r}}_{ii^{\prime}}+(\widehat{\varphi}_{kk^{\prime}}^{A}-\widehat{\phi}_{kk^{\prime}}^{A})(r_{ii^{\prime}},\boldsymbol{s}^{A}_{ii^{\prime}}){\dot{\boldsymbol{r}}}_{ii^{\prime}}\| (B.5)
1NkiCk(2ϕkkEφ^kkEϕ^kkE)(rii,𝒔Eii)𝒓ii+(2ϕkkAφ^kkAϕ^kkA)(rii,𝒔Aii)𝒓˙ii\displaystyle\|\frac{1}{N_{k^{\prime\prime}}}\sum_{i^{\prime}\in C_{k^{\prime\prime}}}(2\phi_{kk^{\prime}}^{E}-\widehat{\varphi}_{kk^{\prime}}^{E}-\widehat{\phi}_{kk^{\prime}}^{E})(r_{ii^{\prime}},\boldsymbol{s}^{E}_{ii^{\prime}}){\boldsymbol{r}}_{ii^{\prime}}+(2\phi_{kk^{\prime}}^{A}-\widehat{\varphi}_{kk^{\prime}}^{A}-\widehat{\phi}_{kk^{\prime}}^{A})(r_{ii^{\prime}},\boldsymbol{s}^{A}_{ii^{\prime}})\dot{\boldsymbol{r}}_{ii^{\prime}}\|
<k=1Kk=1Kk=1K1NkNkiCk,iCk(φ^kkEϕ^kkE)(rii,𝒔Eii)𝒓ii+(φ^kkAϕ^kkA)(rii,𝒔Aii)𝒓˙ii2×\displaystyle<\sum_{k=1}^{K}\sum_{k^{\prime}=1}^{K}\sum_{k^{\prime\prime}=1}^{K}\sqrt{\frac{1}{N_{k}N_{k^{\prime}}}\sum_{i\in C_{k},i^{\prime}\in C_{k^{\prime}}}\|(\widehat{\varphi}_{kk^{\prime}}^{E}-\widehat{\phi}_{kk^{\prime}}^{E})(r_{ii^{\prime}},\boldsymbol{s}^{E}_{ii^{\prime}}){\boldsymbol{r}}_{ii^{\prime}}+(\widehat{\varphi}_{kk^{\prime}}^{A}-\widehat{\phi}_{kk^{\prime}}^{A})(r_{ii^{\prime}},\boldsymbol{s}^{A}_{ii^{\prime}}){\dot{\boldsymbol{r}}}_{ii^{\prime}}}\|^{2}\times
1NkNkiCk,iCk(2ϕkkEφ^kkEϕ^kkE)(rii,𝒔Eii)𝒓ii+(2ϕkkAφ^kkAϕ^kkA)(rii,𝒔Aii)𝒓˙ii2\displaystyle\qquad\sqrt{\frac{1}{N_{k}N_{k^{\prime\prime}}}\sum_{i\in C_{k},i^{\prime}\in C_{k^{\prime\prime}}}\|(2\phi_{kk^{\prime}}^{E}-\widehat{\varphi}_{kk^{\prime}}^{E}-\widehat{\phi}_{kk^{\prime}}^{E})(r_{ii^{\prime}},\boldsymbol{s}^{E}_{ii^{\prime}}){\boldsymbol{r}}_{ii^{\prime}}+(2\phi_{kk^{\prime}}^{A}-\widehat{\varphi}_{kk^{\prime}}^{A}-\widehat{\phi}_{kk^{\prime}}^{A})(r_{ii^{\prime}},\boldsymbol{s}^{A}_{ii^{\prime}})\dot{\boldsymbol{r}}_{ii^{\prime}}\|^{2}}
<k=1Kk=1Kk=1K(φ^kkEAϕ^kkEA)L2(ρ^Tt,kk)2(ϕkkEAφ^kkEAφ^kkEA)L2(ρ^Tt,kk)\displaystyle<\sum_{k=1}^{K}\sum_{k^{\prime}=1}^{K}\sum_{k^{\prime\prime}=1}^{K}\|(\widehat{\varphi}_{kk^{\prime}}^{EA}-\widehat{\phi}_{kk^{\prime}}^{EA})\|_{L^{2}(\hat{\rho}_{T}^{t,kk^{\prime}})}\|2(\phi_{kk^{\prime}}^{EA}-\widehat{\varphi}_{kk^{\prime}}^{EA}-\widehat{\varphi}_{kk^{\prime}}^{EA})\|_{L^{2}(\hat{\rho}_{T}^{t,kk^{\prime\prime}})}
K2𝝋^EAϕ^EA𝑳2(𝝆^Tt)2ϕEA𝝋^EAϕ^EA𝑳2(𝝆^Tt),\displaystyle\leq K^{2}\|\widehat{\bm{\varphi}}^{EA}-\widehat{\bm{\phi}}^{EA}\|_{\bm{L}^{2}(\hat{\bm{\rho}}_{T}^{t})}\|2\bm{\phi}^{EA}-\widehat{\bm{\varphi}}^{EA}-\widehat{\bm{\phi}}^{EA}\|_{\bm{L}^{2}(\hat{\bm{\rho}}_{T}^{t})}, (B.6)

where

ρ^Tt,kk(r,𝒔E,r˙,𝒔A)=1LNkkl=1LiCk,iCkiiδrii(t),𝒔Eii(tl),r˙ii(tl),𝒔Aii(tl)(r,𝒔E,r˙,𝒔A)\smash{\widehat{\rho}}_{T}^{t,kk^{\prime}}(r,\boldsymbol{s}^{E},\dot{r},\boldsymbol{s}^{A})=\displaystyle\frac{1}{LN_{kk^{\prime}}}\sum_{l=1}^{L}\!\sum_{\begin{subarray}{c}i\in C_{k},i^{\prime}\in C_{k^{\prime}}\\ i\neq i^{\prime}\end{subarray}}\delta_{r_{ii^{\prime}}(t),\boldsymbol{s}^{E}_{ii^{\prime}}(t_{l}),\dot{r}_{ii^{\prime}}(t_{l}),\boldsymbol{s}^{A}_{ii^{\prime}}(t_{l})}(r,\boldsymbol{s}^{E},\dot{r},\boldsymbol{s}^{A})

and 𝝆^Tt=k,k=1,1K,Kρ^Tt,kk\widehat{\bm{\rho}}_{T}^{t}=\bigoplus_{k,k^{\prime}=1,1}^{K,K}\widehat{\rho}_{T}^{t,kk^{\prime}}. Therefore, we have that

|1Ll=1L𝓔𝑿(tl)EA(𝝋^EA)1Ll=1L𝓔𝑿(tl)EA(ϕ^EA)|1Ll=1L|𝓔𝑿(tl)EA(ϕ^EA)𝓔𝑿(tl)EA(𝝋^EA)|\displaystyle\big{|}\frac{1}{L}\sum_{l=1}^{L}\bm{\mathcal{E}}_{\boldsymbol{X}(t_{l})}^{EA}(\widehat{\bm{\varphi}}^{EA})-\frac{1}{L}\sum_{l=1}^{L}\bm{\mathcal{E}}_{\boldsymbol{X}(t_{l})}^{EA}(\widehat{\bm{\phi}}^{EA})\big{|}\leq\frac{1}{L}\sum_{l=1}^{L}\big{|}\bm{\mathcal{E}}_{\boldsymbol{X}(t_{l})}^{EA}(\widehat{\bm{\phi}}^{EA})-\bm{\mathcal{E}}_{\boldsymbol{X}(t_{l})}^{EA}(\widehat{\bm{\varphi}}^{EA})\big{|}
<K2Ll=1L𝝋^EAϕ^EA𝑳2(𝝆^Ttl)2ϕEA𝝋^EAϕ^EA𝑳2(𝝆^Ttl)\displaystyle<\frac{K^{2}}{L}\sum_{l=1}^{L}\|\widehat{\bm{\varphi}}^{EA}-\widehat{\bm{\phi}}^{EA}\|_{\bm{L}^{2}(\widehat{\bm{\rho}}_{T}^{t_{l}})}\|2\bm{\phi}^{EA}-\widehat{\bm{\varphi}}^{EA}-\widehat{\bm{\phi}}^{EA}\|_{\bm{L}^{2}(\widehat{\bm{\rho}}_{T}^{t_{l}})}
K21Ll=1L𝝋^EAϕ^EA2𝑳2(𝝆^Ttl)1Ll=1L2ϕEA𝝋^EAϕ^EA2𝑳2(𝝆^Ttl)\displaystyle\leq K^{2}\sqrt{\frac{1}{L}\sum_{l=1}^{L}\|\widehat{\bm{\varphi}}^{EA}-\widehat{\bm{\phi}}^{EA}\|^{2}_{\bm{L}^{2}(\widehat{\bm{\rho}}_{T}^{t_{l}})}}\sqrt{\frac{1}{L}\sum_{l=1}^{L}\|2\bm{\phi}^{EA}-\widehat{\bm{\varphi}}^{EA}-\widehat{\bm{\phi}}^{EA}\|^{2}_{\bm{L}^{2}(\widehat{\bm{\rho}}_{T}^{t_{l}})}}
=K2𝝋^EAϕ^EA𝑳2(𝝆^TL)2ϕEA𝝋^EAϕ^EA𝑳2(𝝆^TL)\displaystyle=K^{2}\|\widehat{\bm{\varphi}}^{EA}-\widehat{\bm{\phi}}^{EA}\|_{\boldsymbol{L}^{2}(\widehat{\boldsymbol{\rho}}_{T}^{L})}\|2\bm{\phi}^{EA}-\widehat{\bm{\varphi}}^{EA}-\widehat{\bm{\phi}}^{EA}\|_{\boldsymbol{L}^{2}(\widehat{\boldsymbol{\rho}}_{T}^{L})} (B.7)
K4max{R,Rx˙}2𝝋^EAϕ^EA2ϕEA𝝋^EAϕ^EA\displaystyle\leq K^{4}\max\{R,R_{\dot{x}}\}^{2}\|\widehat{\bm{\varphi}}^{EA}-\widehat{\bm{\phi}}^{EA}\|_{\infty}\|2\bm{\phi}^{EA}-\widehat{\bm{\varphi}}^{EA}-\widehat{\bm{\phi}}^{EA}\|_{\infty} (B.8)

Taking the expectation with respect to 𝝁𝒀\bm{\mu^{\boldsymbol{Y}}} on each side of (B.7) we get the first inequality. The second inequality follows by noticing that,

|𝓔MEA(𝝋^EA)𝓔MEA(ϕ^EA)|1Mm=1M|1Ll=1L𝓔𝑿(m)(tl)(𝝋^EA)1Ll=1L𝓔𝑿(m)(tl)(ϕ^EA)|.|\boldsymbol{\mathcal{E}}_{M}^{EA}(\widehat{\bm{\varphi}}^{EA})-\boldsymbol{\mathcal{E}}_{M}^{EA}(\widehat{\bm{\phi}}^{EA})|\leq\frac{1}{M}\sum_{m=1}^{M}\big{|}\frac{1}{L}\sum_{l=1}^{L}\boldsymbol{\mathcal{E}}_{\boldsymbol{X}^{(m)}(t_{l})}(\widehat{\bm{\varphi}}^{EA})-\frac{1}{L}\sum_{l=1}^{L}\boldsymbol{\mathcal{E}}_{\boldsymbol{X}^{(m)}(t_{l})}(\widehat{\bm{\phi}}^{EA})\big{|}.

 

Environment interaction kernels

Here we show an analogous result to the alignment and energy result above. The techniques are similar and the result serves an identical purpose in the theory. Recall the definition of RξR_{\xi} in (3.11).

Proposition 12

For 𝛗^,ϕ^𝓗ξ\widehat{\bm{\varphi}},\widehat{\bm{\phi}}\in\boldsymbol{\mathcal{H}}^{\xi}, we have

|𝓔ξ(𝝋^)𝓔ξ(ϕ^)|\displaystyle|\boldsymbol{\mathcal{E}}_{\infty}^{\xi}(\widehat{\bm{\varphi}})-\boldsymbol{\mathcal{E}}_{\infty}^{\xi}(\widehat{\bm{\phi}})| K2𝝋^ϕ^𝑳2(𝝆Tξ,L)2ϕξ𝝋^ϕ^𝑳2(𝝆Tξ,L)\displaystyle\leq K^{2}\|\widehat{\bm{\varphi}}-\widehat{\bm{\phi}}\|_{\bm{L}^{2}(\bm{\rho}_{T}^{\xi,L})}\|2{\bm{\phi}}^{\xi}-\widehat{\bm{\varphi}}-\widehat{\bm{\phi}}\|_{\bm{L}^{2}(\bm{\rho}_{T}^{\xi,L})} (B.9)
|𝓔Mξ(𝝋^)𝓔Mξ(ϕ^)|\displaystyle|\boldsymbol{\mathcal{E}}_{M}^{\xi}(\widehat{\bm{\varphi}})-\boldsymbol{\mathcal{E}}_{M}^{\xi}(\widehat{\bm{\phi}})| K4Rξ2𝝋^ϕ^2ϕ𝝋^ϕ^\displaystyle\leq K^{4}R_{\xi}^{2}\|\widehat{\bm{\varphi}}-\widehat{\bm{\phi}}\|_{\infty}\|2{\bm{\phi}}-\widehat{\bm{\varphi}}-\widehat{\bm{\phi}}\|_{\infty} (B.10)

The following lemma can be immediately deduced using (B.3), (B.4) , and (B.8) .

Lemma 13

For all 𝛗EA𝓗EA\bm{\varphi}^{EA}\in\boldsymbol{\mathcal{H}}^{EA}, define the defect function LMEA(𝛗EA)L_{M}^{EA}(\bm{\varphi}^{EA}) as

LMEA(𝝋EA)=𝓔EA(𝝋EA)𝓔MEA(𝝋EA).L_{M}^{EA}(\bm{\varphi}^{EA})=\boldsymbol{\mathcal{E}}_{\infty}^{EA}(\bm{\varphi}^{EA})-\boldsymbol{\mathcal{E}}_{M}^{EA}(\bm{\varphi}^{EA}). (B.11)

Then, given two functions 𝛗EA1,𝛗EA2𝓗EA\bm{\varphi}^{EA}_{1},\bm{\varphi}^{EA}_{2}\in\boldsymbol{\mathcal{H}}^{EA}, the defect function is bounded by

|LMEA(𝝋EA1)LMEA(𝝋EA2)|\displaystyle|L_{M}^{EA}(\bm{\varphi}^{EA}_{1})-L_{M}^{EA}(\bm{\varphi}^{EA}_{2})| 2K4max{R,Rx˙}2𝝋EA1𝝋EA2𝝋EA1+𝝋EA22ϕEA\displaystyle\leq 2K^{4}\max\{R,R_{\dot{x}}\}^{2}\|\bm{\varphi}^{EA}_{1}-\bm{\varphi}^{EA}_{2}\|_{\infty}\|\bm{\varphi}^{EA}_{1}+\bm{\varphi}^{EA}_{2}-2\bm{\phi}^{EA}\|_{\infty}

almost surely with respect to 𝛍𝐘\bm{\mu^{\boldsymbol{Y}}}.

A similar lemma can be immediately deduced on the ξ\xi variable .

Lemma 14

For all 𝛗ξ𝓗ξ\bm{\varphi}^{\xi}\in\boldsymbol{\mathcal{H}}^{\xi}, define the defect function LMξ(𝛗ξ)L_{M}^{\xi}(\bm{\varphi}^{\xi}) as

LMξ(𝝋ξ)=𝓔ξ(𝝋ξ)𝓔Mξ(𝝋ξ).L_{M}^{\xi}(\bm{\varphi}^{\xi})=\boldsymbol{\mathcal{E}}_{\infty}^{\xi}(\bm{\varphi}^{\xi})-\boldsymbol{\mathcal{E}}_{M}^{\xi}(\bm{\varphi}^{\xi}). (B.12)

Then, given two functions 𝛗ξ1,𝛗ξ2𝓗ξ\bm{\varphi}^{\xi}_{1},\bm{\varphi}^{\xi}_{2}\in\boldsymbol{\mathcal{H}}^{\xi}, the defect function is bounded by

|LMξ(𝝋ξ1)LMξ(𝝋ξ2)|\displaystyle|L_{M}^{\xi}(\bm{\varphi}^{\xi}_{1})-L_{M}^{\xi}(\bm{\varphi}^{\xi}_{2})| 2K4Rξ2𝝋ξ1𝝋ξ2𝝋ξ1+𝝋ξ22ϕξ\displaystyle\leq 2K^{4}R_{\xi}^{2}\|\bm{\varphi}^{\xi}_{1}-\bm{\varphi}^{\xi}_{2}\|_{\infty}\|\bm{\varphi}^{\xi}_{1}+\bm{\varphi}^{\xi}_{2}-2\bm{\phi}^{\xi}\|_{\infty}

almost surely with respect to 𝛍𝐘\bm{\mu^{\boldsymbol{Y}}}.

B.2 Uniqueness of minimizers over a compact convex space

Recall the energy and alignment bilinear functional ,EA\langle\langle{\cdot,\cdot}\rangle\rangle_{EA}, previously defined in equation (6.1)

𝝋EA1,𝝋EA2EA:=1Ll=1L𝔼𝝁𝒀[𝐟𝝋EA1(𝑿(tl),𝑽(tl),𝚵(tl)),𝐟𝝋EA2(𝑿(tl),𝑽(tl),𝚵(tl))𝒮],\displaystyle\langle\langle{\bm{\varphi}^{EA}_{1},\bm{\varphi}^{EA}_{2}}\rangle\rangle_{EA}:=\frac{1}{L}\sum_{l=1}^{L}\mathbb{E}_{\bm{\mu^{\boldsymbol{Y}}}}\bigg{[}\left\langle\mathbf{f}_{\bm{\varphi}^{EA}_{1}}(\boldsymbol{X}(t_{l}),\boldsymbol{V}(t_{l}),\boldsymbol{\Xi}(t_{l})),\mathbf{f}_{\bm{\varphi}^{EA}_{2}}(\boldsymbol{X}(t_{l}),\boldsymbol{V}(t_{l}),\boldsymbol{\Xi}(t_{l}))\right\rangle_{\mathcal{S}}\bigg{]},

for any 𝝋EA1,𝝋EA2𝓗EA\bm{\varphi}^{EA}_{1},\bm{\varphi}^{EA}_{2}\in\boldsymbol{\mathcal{H}}^{EA}. The 𝒮\mathcal{S}-inner product is the inner product induced by the 𝒮\|\cdot\|_{\mathcal{S}} norm by the polarization identity, which holds as we are working in an L2L^{2} space, so the parallelogram law holds. Then our coercivity condition (6.1). can be written in terms of this bilinear functional as: for all 𝝋EA𝓗EA\bm{\varphi}^{EA}\in\boldsymbol{\mathcal{H}}^{EA}

c𝓗EA𝝋EA2𝑳2(𝝆TEA,L)𝝋EA,𝝋EAc_{\boldsymbol{\mathcal{H}}^{EA}}\left\|\bm{\varphi}^{EA}\right\|^{2}_{\bm{L}^{2}(\bm{\rho}_{T}^{EA,L})}\leq\langle\langle{\bm{\varphi}^{EA},\bm{\varphi}^{EA}}\rangle\rangle
Proposition 15

Let the minimizer of the error functional be denoted

ϕ^L,,𝓗EAEA:=ϕ^L,,𝓗EAEϕ^L,,𝓗EAA:=argmin𝝋EA𝓗EA𝓔EA(𝝋EA);\widehat{{\bm{\phi}}}_{L,\infty,\boldsymbol{\mathcal{H}}^{EA}}^{EA}:=\widehat{{\bm{\phi}}}_{L,\infty,\boldsymbol{\mathcal{H}}^{EA}}^{E}\oplus\widehat{{\bm{\phi}}}_{L,\infty,\boldsymbol{\mathcal{H}}^{EA}}^{A}:=\underset{\bm{\varphi}^{EA}\in\boldsymbol{\mathcal{H}}^{EA}}{\operatorname{arg}\operatorname{min}}\;\boldsymbol{\mathcal{E}}_{\infty}^{EA}(\bm{\varphi}^{EA});

then for all 𝛗EA𝓗EA\bm{\varphi}^{EA}\in\boldsymbol{\mathcal{H}}^{EA}, the difference of the error functional at this element of 𝓗EA\boldsymbol{\mathcal{H}}^{EA} and the minimizer is lower bounded as,

𝓔EA(𝝋EA)𝓔EA(ϕ^L,,𝓗EAEA)c𝓗EA𝝋EAϕ^L,,𝓗EAEA𝑳2(𝝆TEA,L)2.\boldsymbol{\mathcal{E}}_{\infty}^{EA}(\bm{\varphi}^{EA})-\boldsymbol{\mathcal{E}}_{\infty}^{EA}(\widehat{{\bm{\phi}}}_{L,\infty,\boldsymbol{\mathcal{H}}^{EA}}^{EA})\geq c_{\boldsymbol{\mathcal{H}}^{EA}}\|\bm{\varphi}^{EA}-\widehat{{\bm{\phi}}}_{L,\infty,\boldsymbol{\mathcal{H}}^{EA}}^{EA}\|_{\bm{L}^{2}(\bm{\rho}_{T}^{EA,L})}^{2}\,. (B.13)

Thus, the minimizer of 𝓔EA\boldsymbol{\mathcal{E}}_{\infty}^{EA} over 𝓗EA\boldsymbol{\mathcal{H}}^{EA} is unique in 𝐋2(𝛒TEA,L)\bm{L}^{2}(\bm{\rho}_{T}^{EA,L}).

Proof  For 𝝋EA𝓗EA\bm{\varphi}^{EA}\in\boldsymbol{\mathcal{H}}^{EA}, and to ease the notation let ϕ^EA:=ϕ^L,,𝓗EAEA\widehat{{\bm{\phi}}}^{EA}:=\widehat{{\bm{\phi}}}_{L,\infty,\boldsymbol{\mathcal{H}}^{EA}}^{EA}, we have

𝓔EA(𝝋EA)𝓔EA(ϕ^EA)\displaystyle\boldsymbol{\mathcal{E}}_{\infty}^{EA}(\bm{\varphi}^{EA})-\boldsymbol{\mathcal{E}}_{\infty}^{EA}(\widehat{{\bm{\phi}}}^{EA}) =(𝝋EAϕEA),(𝝋EAϕEA)\displaystyle=\langle\langle{(\bm{\varphi}^{EA}-\bm{\phi}^{EA}),(\bm{\varphi}^{EA}-\bm{\phi}^{EA})}\rangle\rangle
(ϕ^EAϕEA),(ϕ^EAϕEA)\displaystyle-\langle\langle{(\widehat{\bm{\phi}}^{EA}-\bm{\phi}^{EA}),(\widehat{\bm{\phi}}^{EA}-\bm{\phi}^{EA})}\rangle\rangle (B.14)

Using that X,XY,Y=XY,X+Y\langle\langle{X,X}\rangle\rangle-\langle\langle{Y,Y}\rangle\rangle=\langle\langle{X-Y,X+Y}\rangle\rangle, which holds by bilinearity and the definition of the form, we get that

(B.14)\displaystyle(\ref{interEq1}) =(𝝋EAϕ^EA),(𝝋EA+ϕ^EA2ϕEA\displaystyle=\langle\langle{(\bm{\varphi}^{EA}-\widehat{\bm{\phi}}^{EA}),(\bm{\varphi}^{EA}+\widehat{\bm{\phi}}^{EA}-2\bm{\phi}^{EA}}\rangle\rangle
=(𝝋EAϕ^EA),(𝝋EAϕ^EA+2(ϕ^EAϕEA))\displaystyle=\langle\langle{(\bm{\varphi}^{EA}-\widehat{\bm{\phi}}^{EA}),(\bm{\varphi}^{EA}-\widehat{\bm{\phi}}^{EA}+2(\widehat{\bm{\phi}}^{EA}-\bm{\phi}^{EA}))}\rangle\rangle
=(𝝋EAϕ^EA),(𝝋EAϕ^EA)+2(𝝋EAϕ^EA),(ϕ^EAϕEA)\displaystyle=\langle\langle{(\bm{\varphi}^{EA}-\widehat{\bm{\phi}}^{EA}),(\bm{\varphi}^{EA}-\widehat{\bm{\phi}}^{EA})}\rangle\rangle+2\langle\langle{(\bm{\varphi}^{EA}-\widehat{\bm{\phi}}^{EA}),(\widehat{\bm{\phi}}^{EA}-\bm{\phi}^{EA})}\rangle\rangle (B.15)

By the coercivity condition, the first term in (B.15) is at least as large as

c𝓗EA𝝋EAϕ^EA𝑳2(𝝆TEA,L)0c_{\boldsymbol{\mathcal{H}}^{EA}}\|\bm{\varphi}^{EA}-\widehat{\bm{\phi}}^{EA}\|_{\bm{L}^{2}(\bm{\rho}_{T}^{EA,L})}\geq 0

We are left to show the second term in (B.15) is nonnegative. Since 𝓗EA\boldsymbol{\mathcal{H}}^{EA} is convex, for all t[0,1]t\in[0,1], t(𝝋EA)+(1t)(ϕ^EA)𝓗EAt(\bm{\varphi}^{EA})+(1-t)(\widehat{\bm{\phi}}^{EA})\in\boldsymbol{\mathcal{H}}^{EA}. By definition of ϕ^EA\widehat{\bm{\phi}}^{EA} as an argmin,

𝓔EA(t𝝋EA+(1t)ϕ^EA)𝓔EA(ϕ^EA)0\boldsymbol{\mathcal{E}}_{\infty}^{EA}(t\bm{\varphi}^{EA}+(1-t)\widehat{\bm{\phi}}^{EA})-\boldsymbol{\mathcal{E}}_{\infty}^{EA}(\widehat{\bm{\phi}}^{EA})\geq 0

which means, using a decomposition analogous to the one above in (B.15), that

(t𝝋EA+(1t)ϕ^EAϕ^EA),(t𝝋EA+(1t)ϕ^EAϕ^EA+2(ϕ^EAϕEA)\displaystyle\langle\langle(t\bm{\varphi}^{EA}+(1-t)\widehat{\bm{\phi}}^{EA}-\widehat{\bm{\phi}}^{EA}),(t\bm{\varphi}^{EA}+(1-t)\widehat{\bm{\phi}}^{EA}-\widehat{\bm{\phi}}^{EA}+2(\widehat{\bm{\phi}}^{EA}-\bm{\phi}^{EA})\rangle\rangle
=t((𝝋EAϕ^EA)),(t𝝋EA+(2t)ϕ^EA2ϕEA)0\displaystyle=\langle\langle t((\bm{\varphi}^{EA}-\widehat{\bm{\phi}}^{EA})),(t\bm{\varphi}^{EA}+(2-t)\widehat{\bm{\phi}}^{EA}-2\bm{\phi}^{EA})\rangle\rangle\geq 0

So that,

t(𝝋EAϕ^EA),(t𝝋EA+(2t)ϕ^EA2ϕEA)0\displaystyle t\langle\langle(\bm{\varphi}^{EA}-\widehat{\bm{\phi}}^{EA}),(t\bm{\varphi}^{EA}+(2-t)\widehat{\bm{\phi}}^{EA}-2\bm{\phi}^{EA})\rangle\rangle\geq 0 (B.16)
(𝝋EAϕ^EA),(t𝝋EA+(2t)ϕ^EA2ϕEA)0\displaystyle\Leftrightarrow\langle\langle(\bm{\varphi}^{EA}-\widehat{\bm{\phi}}^{EA}),(t\bm{\varphi}^{EA}+(2-t)\widehat{\bm{\phi}}^{EA}-2\bm{\phi}^{EA})\rangle\rangle\geq 0 (B.17)

By the results of section B.1, we have (Lipschitz) continuity of the bilinear functional ,\langle\langle{\cdot,\cdot}\rangle\rangle over 𝓗EA×𝓗EA\boldsymbol{\mathcal{H}}^{EA}\times\boldsymbol{\mathcal{H}}^{EA}. Next, take the limt0+\lim_{t\to 0^{+}} of (B.16) and by the dominated convergence theorem (which holds due to the boundedness and continuity assumptions on the kernels) we pass the limit through the expectations in ,\langle\langle{\cdot,\cdot}\rangle\rangle. This gives that (B.15) is greater than 0, giving the desired result on the uniqueness of the minimizer.  

Proposition 16

Let the minimizer of the error functional be denoted

ϕ^L,,𝓗ξ:=argmin𝝋ξ𝓗ξ𝓔ξ(𝝋ξ);\widehat{{\bm{\phi}}}_{L,\infty,\boldsymbol{\mathcal{H}}^{\xi}}:=\underset{\bm{\varphi}^{\xi}\in\boldsymbol{\mathcal{H}}^{\xi}}{\operatorname{arg}\operatorname{min}}\;\boldsymbol{\mathcal{E}}_{\infty}^{\xi}(\bm{\varphi}^{\xi});

then for all 𝛗ξ𝓗ξ\bm{\varphi}^{\xi}\in\boldsymbol{\mathcal{H}}^{\xi}, the difference of the error functional at this element of 𝓗ξ\boldsymbol{\mathcal{H}}^{\xi} and the minimizer is lower bounded as,

𝓔ξ(𝝋ξ)𝓔ξ(ϕ^L,,𝓗ξξ)c𝓗ξ𝝋ξϕ^L,,𝓗ξξ𝑳2(𝝆Tξ,L)2.\boldsymbol{\mathcal{E}}_{\infty}^{\xi}(\bm{\varphi}^{\xi})-\boldsymbol{\mathcal{E}}_{\infty}^{\xi}(\widehat{{\bm{\phi}}}_{L,\infty,\boldsymbol{\mathcal{H}}^{\xi}}^{\xi})\geq c_{\boldsymbol{\mathcal{H}}^{\xi}}\|\bm{\varphi}^{\xi}-\widehat{{\bm{\phi}}}_{L,\infty,\boldsymbol{\mathcal{H}}^{\xi}}^{\xi}\|_{\bm{L}^{2}(\bm{\rho}_{T}^{\xi,L})}^{2}\,. (B.18)

Thus, the minimizer of 𝓔ξ\boldsymbol{\mathcal{E}}_{\infty}^{\xi} over 𝓗ξ\boldsymbol{\mathcal{H}}^{\xi} is unique in 𝐋2(𝛒Tξ,L)\bm{L}^{2}(\bm{\rho}_{T}^{\xi,L}).

B.3 Uniform estimates on defect functions

We start this section by introducing normalized errors of the estimators. Denote the minimizer of 𝓔EA()\boldsymbol{\mathcal{E}}_{\infty}^{EA}(\cdot) over 𝓗EA\boldsymbol{\mathcal{H}}^{EA} by

ϕ^L,,𝓗EAEA:=ϕ^L,,𝓗EAEϕ^L,,𝓗EAA=argmin𝝋EA𝓗EA𝓔EA(𝝋EA).\displaystyle\widehat{\bm{\phi}}_{L,\infty,\boldsymbol{\mathcal{H}}^{EA}}^{EA}:=\widehat{\bm{\phi}}_{L,\infty,\boldsymbol{\mathcal{H}}^{EA}}^{E}\oplus\widehat{\bm{\phi}}_{L,\infty,\boldsymbol{\mathcal{H}}^{EA}}^{A}=\underset{\bm{\varphi}^{EA}\in\boldsymbol{\mathcal{H}}^{EA}}{\operatorname{arg}\operatorname{min}}\;\boldsymbol{\mathcal{E}}_{\infty}^{EA}(\bm{\varphi}^{EA}). (B.19)

For any 𝝋EA𝓗EA\bm{\varphi}^{EA}\in\boldsymbol{\mathcal{H}}^{EA}, define the normalized errors as

𝒟(𝝋EA):=𝓔EA(𝝋EA)𝓔EA(ϕ^L,,𝓗EAEA),\displaystyle\mathcal{D}_{\infty}(\bm{\varphi}^{EA}):=\boldsymbol{\mathcal{E}}_{\infty}^{EA}(\bm{\varphi}^{EA})-\boldsymbol{\mathcal{E}}_{\infty}^{EA}(\widehat{\bm{\phi}}_{L,\infty,\boldsymbol{\mathcal{H}}^{EA}}^{EA})\,, (B.20)
𝒟M(𝝋EA):=𝓔MEA(𝝋EA)𝓔MEA(ϕ^L,,𝓗EAEA).\displaystyle\mathcal{D}_{M}(\bm{\varphi}^{EA}):=\boldsymbol{\mathcal{E}}_{M}^{EA}(\bm{\varphi}^{EA})-\boldsymbol{\mathcal{E}}_{M}^{EA}(\widehat{\bm{\phi}}_{L,\infty,\boldsymbol{\mathcal{H}}^{EA}}^{EA})\,. (B.21)

These quantities capture the difference between the expected/empirical errors of the estimator and the function in the hypothesis space minimizing the expected error functional. We begin by proving a lemma that assumes the distance between the expected and empirical normalized errors are small for a given estimator. We then show that we have similar control on these distances for all points in a neighborhood of this particular one. This control enables us to apply a covering argument in the main proposition of this section due to the compactness of the hypothesis space.

Remark 17

Exactly analogous definitions hold for the ξ\xi variable and we will simply state the results in that case.

Lemma 18

For all ϵ>0\epsilon>0 and 0<α<10<\alpha<1, if the function 𝛗EA1𝓗EA\bm{\varphi}^{EA}_{1}\in\boldsymbol{\mathcal{H}}^{EA} satisfies

𝒟(𝝋EA1)𝒟M(𝝋EA1)𝒟(𝝋EA1)+ϵ<α,\frac{\mathcal{D}_{\infty}(\bm{\varphi}^{EA}_{1})-\mathcal{D}_{M}(\bm{\varphi}^{EA}_{1})}{\mathcal{D}_{\infty}(\bm{\varphi}^{EA}_{1})+\epsilon}<\alpha\,,

then for all 𝛗EA2𝓗EA\bm{\varphi}^{EA}_{2}\in\boldsymbol{\mathcal{H}}^{EA} such that 𝛗EA1𝛗EA2αϵ8SEAmax{R,Rx˙}2K4\|\bm{\varphi}^{EA}_{1}-\bm{\varphi}^{EA}_{2}\|_{\infty}\leq\frac{\alpha\epsilon}{8S_{EA}\max{\{R,R_{\dot{x}}\}}^{2}K^{4}}, where SEA=max{SE,SA}S_{EA}=\max\{S_{E},S_{A}\} we have

𝒟(𝝋EA2)𝒟M(𝝋EA2)𝒟(𝝋EA2)+ϵ<3α.\frac{\mathcal{D}_{\infty}(\bm{\varphi}^{EA}_{2})-\mathcal{D}_{M}(\bm{\varphi}^{EA}_{2})}{\mathcal{D}_{\infty}(\bm{\varphi}^{EA}_{2})+\epsilon}<3\alpha.

Proof  To ease the notation, write ϕ^EA:=ϕ^L,,𝓗EAEA\widehat{{\bm{\phi}}}^{EA}:=\widehat{\bm{\phi}}_{L,\infty,\boldsymbol{\mathcal{H}}^{EA}}^{EA}, and using definition (B.11), we have that

𝒟(𝝋EA2)𝒟M(𝝋EA2)𝒟(𝝋EA2)+ϵ\displaystyle\frac{\mathcal{D}_{\infty}(\bm{\varphi}^{EA}_{2})-\mathcal{D}_{M}(\bm{\varphi}^{EA}_{2})}{\mathcal{D}_{\infty}(\bm{\varphi}^{EA}_{2})+\epsilon} =𝓔EA(𝝋EA2)𝓔EA(ϕ^EA)(𝓔MEA(𝝋EA2)𝓔MEA(ϕ^EA))𝒟(𝝋EA2)+ϵ\displaystyle=\frac{\boldsymbol{\mathcal{E}}_{\infty}^{EA}(\bm{\varphi}^{EA}_{2})-\boldsymbol{\mathcal{E}}_{\infty}^{EA}(\widehat{{\bm{\phi}}}^{EA})-(\boldsymbol{\mathcal{E}}_{M}^{EA}(\bm{\varphi}^{EA}_{2})-\boldsymbol{\mathcal{E}}_{M}^{EA}(\widehat{{\bm{\phi}}}^{EA}))}{\mathcal{D}_{\infty}(\bm{\varphi}^{EA}_{2})+\epsilon}
=LMEA(𝝋EA2)LMEA(𝝋EA1)𝒟(𝝋EA2)+ϵ+LMEA(𝝋EA1)LMEA(ϕ^EA)𝒟(𝝋EA2)+ϵ\displaystyle=\frac{L_{M}^{EA}(\bm{\varphi}^{EA}_{2})-L_{M}^{EA}(\bm{\varphi}^{EA}_{1})}{\mathcal{D}_{\infty}(\bm{\varphi}^{EA}_{2})+\epsilon}+\frac{L_{M}^{EA}(\bm{\varphi}^{EA}_{1})-L_{M}^{EA}(\widehat{{\bm{\phi}}}^{EA})}{\mathcal{D}_{\infty}(\bm{\varphi}^{EA}_{2})+\epsilon}\hskip 14.22636pt

By Lemma 13, we have

LMEA(𝝋EA2)LMEA(𝝋EA1)8SEAmax{R,Rx˙}2K4𝝋EA2𝝋EA1αϵ.L_{M}^{EA}(\bm{\varphi}^{EA}_{2})-L_{M}^{EA}(\bm{\varphi}^{EA}_{1})\leq 8S_{EA}\max\{R,R_{\dot{x}}\}^{2}K^{4}\|\bm{\varphi}^{EA}_{2}-\bm{\varphi}^{EA}_{1}\|_{\infty}\leq\alpha\epsilon.

By definition we have that 𝒟(𝝋EA2)0\mathcal{D}_{\infty}(\bm{\varphi}^{EA}_{2})\geq 0 implying that,

LM(𝝋EA1)LM(𝝋EA2)𝒟(𝝋EA2)+ϵα.\frac{L_{M}(\bm{\varphi}^{EA}_{1})-L_{M}(\bm{\varphi}^{EA}_{2})}{\mathcal{D}_{\infty}(\bm{\varphi}^{EA}_{2})+\epsilon}\leq\alpha.

For the second term, by equation (B.4) and the assumption that α<1\alpha<1, we obtain that

𝓔EA(𝝋EA1)𝓔EA(𝝋EA2)<4SEAmax{R,Rx˙}2K4𝝋EA1𝝋EA2<ϵ.\bm{\mathcal{E}}_{\infty}^{EA}(\bm{\varphi}^{EA}_{1})-\bm{\mathcal{E}}_{\infty}^{EA}(\bm{\varphi}^{EA}_{2})<4S_{EA}\max\{R,R_{\dot{x}}\}^{2}K^{4}\|\bm{\varphi}^{EA}_{1}-\bm{\varphi}^{EA}_{2}\|_{\infty}<\epsilon.

Therefore

𝒟(𝝋EA1)𝒟(𝝋EA2)=𝓔EA(𝝋EA1)𝓔EA(𝝋EA2)<ϵϵ+𝒟(𝝋EA2),\mathcal{D}_{\infty}(\bm{\varphi}^{EA}_{1})-\mathcal{D}_{\infty}(\bm{\varphi}^{EA}_{2})=\bm{\mathcal{E}}_{\infty}^{EA}(\bm{\varphi}^{EA}_{1})-\bm{\mathcal{E}}_{\infty}^{EA}(\bm{\varphi}^{EA}_{2})<\epsilon\leq\epsilon+\mathcal{D}_{\infty}(\bm{\varphi}^{EA}_{2}),

and thus

𝒟(𝝋EA1)+ϵ𝒟(𝝋EA2)+ϵ2.\frac{\mathcal{D}_{\infty}(\bm{\varphi}^{EA}_{1})+\epsilon}{\mathcal{D}_{\infty}(\bm{\varphi}^{EA}_{2})+\epsilon}\leq 2.

We conclude that

LMEA(𝝋EA1)LMEA(ϕ^EA)𝒟(𝝋EA2)+ϵ=𝒟(𝝋EA1)𝒟M(𝝋EA1)𝒟(𝝋EA1)+ϵ𝒟(𝝋EA1)+ϵ𝒟(𝝋EA2)+ϵ<2α,\frac{L_{M}^{EA}(\bm{\varphi}^{EA}_{1})-L_{M}^{EA}(\widehat{{\bm{\phi}}}^{EA})}{\mathcal{D}_{\infty}(\bm{\varphi}^{EA}_{2})+\epsilon}=\frac{\mathcal{D}_{\infty}(\bm{\varphi}^{EA}_{1})-\mathcal{D}_{M}(\bm{\varphi}^{EA}_{1})}{\mathcal{D}_{\infty}(\bm{\varphi}^{EA}_{1})+\epsilon}\frac{\mathcal{D}_{\infty}(\bm{\varphi}^{EA}_{1})+\epsilon}{\mathcal{D}_{\infty}(\bm{\varphi}^{EA}_{2})+\epsilon}<2\alpha,

and the result follows by summing the two estimates.  

Arguing in the same way as above, we can derive the lemma below using equation B.10. We define 𝒟ξ,𝒟Mξ\mathcal{D}_{\infty}^{\xi},\mathcal{D}_{M}^{\xi} similarly to (B.20), (B.21) using 𝓔ξ,𝓔Mξ\bm{\mathcal{E}}_{\infty}^{\xi},\bm{\mathcal{E}}_{M}^{\xi} in the obvious way.

Lemma 19

For all ϵ>0\epsilon>0 and 0<α<10<\alpha<1, if the function ϕξ1𝓗ξ\bm{\phi}^{\xi}_{1}\in\boldsymbol{\mathcal{H}}^{\xi} satisfies

𝒟ξ(ϕξ1)𝒟Mξ(ϕξ1)𝒟ξ(ϕξ1)+ϵ<α,\frac{\mathcal{D}_{\infty}^{\xi}(\bm{\phi}^{\xi}_{1})-\mathcal{D}_{M}^{\xi}(\bm{\phi}^{\xi}_{1})}{\mathcal{D}_{\infty}^{\xi}(\bm{\phi}^{\xi}_{1})+\epsilon}<\alpha\,,

then for all ϕξ2𝓗ξ\bm{\phi}^{\xi}_{2}\in\boldsymbol{\mathcal{H}}^{\xi} such that ϕξ1ϕξ2αϵ8S0Rξ2K4\|\bm{\phi}^{\xi}_{1}-\bm{\phi}^{\xi}_{2}\|_{\infty}\leq\frac{\alpha\epsilon}{8S_{0}R_{\xi}^{2}K^{4}}, we have, for S0SξS_{0}\geq S_{\xi},

𝒟ξ(ϕξ2)𝒟Mξ(ϕξ2)𝒟ξ(ϕξ2)+ϵ<3α.\frac{\mathcal{D}_{\infty}^{\xi}(\bm{\phi}^{\xi}_{2})-\mathcal{D}_{M}^{\xi}(\bm{\phi}^{\xi}_{2})}{\mathcal{D}_{\infty}^{\xi}(\bm{\phi}^{\xi}_{2})+\epsilon}<3\alpha.

B.4 Concentration

Proposition 20

For all ϵ>0\epsilon>0, 0<α<10<\alpha<1, 𝛗EA𝓗EA\bm{\varphi}^{EA}\in\boldsymbol{\mathcal{H}}^{EA}, the following concentration bound holds

P𝝁𝒀{𝒟(𝝋EA)𝒟M(𝝋EA)𝒟(𝝋EA)+ϵα}exp(c𝓗EAα2Mϵ32SEA2K4)P_{\bm{\mu^{\boldsymbol{Y}}}}\bigg{\{}\frac{\mathcal{D}_{\infty}(\bm{\varphi}^{EA})-\mathcal{D}_{M}(\bm{\varphi}^{EA})}{\mathcal{D}_{\infty}(\bm{\varphi}^{EA})+\epsilon}\geq\alpha\bigg{\}}\leq\exp\bigg{(}{\frac{-c_{\boldsymbol{\mathcal{H}}^{EA}}\alpha^{2}M\epsilon}{32S_{EA}^{2}K^{4}}}\bigg{)}

Proof  Consider the random variable Θ\Theta (with randomness coming from the random initial condition distributed 𝝁𝒀\bm{\mu^{\boldsymbol{Y}}}), and to ease the notation let ϕ^EA:=ϕ^L,,𝓗EAEA\widehat{{\bm{\phi}}}^{EA}:=\widehat{{\bm{\phi}}}_{L,\infty,\boldsymbol{\mathcal{H}}^{EA}}^{EA},

Θ=1Ll=1L(𝓔𝑿(tl)EA(𝝋EA)𝓔𝑿(tl)EA(ϕ^EA))\Theta=\frac{1}{L}\sum_{l=1}^{L}\big{(}\boldsymbol{\mathcal{E}}_{\boldsymbol{X}(t_{l})}^{EA}(\bm{\varphi}^{EA})-\boldsymbol{\mathcal{E}}_{\boldsymbol{X}(t_{l})}^{EA}(\widehat{{\bm{\phi}}}^{EA})\big{)}

The coercivity condition given in Definition (4), Proposition 15 and (B.3) allow us to bound the variance, denoted σ2\sigma^{2}, of Θ\Theta as follows.

σ2\displaystyle\sigma^{2} 𝔼𝝁𝒀[|1Ll=1L(𝓔𝑿(tl)EA(𝝋EA)𝓔𝑿(tl)EA(ϕ^EA))|2]\displaystyle\leq\mathbb{E}_{\bm{\mu^{\boldsymbol{Y}}}}\bigg{[}\bigg{|}\frac{1}{L}\sum_{l=1}^{L}\big{(}\boldsymbol{\mathcal{E}}_{\boldsymbol{X}(t_{l})}^{EA}(\bm{\varphi}^{EA})-\boldsymbol{\mathcal{E}}_{\boldsymbol{X}(t_{l})}^{EA}(\widehat{{\bm{\phi}}}^{EA})\big{)}\bigg{|}^{2}\bigg{]}
1Ll=1L𝔼𝝁𝒀[|𝓔𝑿(tl)EA(𝝋EA)𝓔𝑿(tl)EA(ϕ^EA)|2]\displaystyle\leq\frac{1}{L}\sum_{l=1}^{L}\mathbb{E}_{\bm{\mu^{\boldsymbol{Y}}}}\bigg{[}\bigg{|}\boldsymbol{\mathcal{E}}_{\boldsymbol{X}(t_{l})}^{EA}(\bm{\varphi}^{EA})-\boldsymbol{\mathcal{E}}_{\boldsymbol{X}(t_{l})}^{EA}(\widehat{{\bm{\phi}}}^{EA})\bigg{|}^{2}\bigg{]}
K4max{R,Rx˙}2𝝋EAϕ^EA𝑳2(𝝆TL)2𝝋EA+ϕ^EA2ϕEA2\displaystyle\leq K^{4}\max\{R,R_{\dot{x}}\}^{2}\|\bm{\varphi}^{EA}-\widehat{{\bm{\phi}}}^{EA}\|_{\bm{L}^{2}(\bm{\rho}_{T}^{L})}^{2}\|\bm{\varphi}^{EA}+\widehat{{\bm{\phi}}}^{EA}-2\bm{\phi}^{EA}\|_{\infty}^{2}
K4max{R,Rx˙}2c𝓗EA(𝓔EA(𝝋EA)𝓔EA(ϕ^EA))𝝋EA+ϕ^EA2ϕEA2\displaystyle\leq\frac{K^{4}\max\{R,R_{\dot{x}}\}^{2}}{c_{\boldsymbol{\mathcal{H}}^{EA}}}(\boldsymbol{\mathcal{E}}_{\infty}^{EA}(\bm{\varphi}^{EA})-\boldsymbol{\mathcal{E}}_{\infty}^{EA}(\widehat{{\bm{\phi}}}^{EA}))\|\bm{\varphi}^{EA}+\widehat{{\bm{\phi}}}^{EA}-2\bm{\phi}^{EA}\|_{\infty}^{2}
16SEA2max{R,Rx˙}2K4c𝓗EA(𝓔EA(𝝋EA)𝓔EA(ϕ^EA))\displaystyle\leq\frac{16S_{EA}^{2}\max\{R,R_{\dot{x}}\}^{2}K^{4}}{c_{\boldsymbol{\mathcal{H}}^{EA}}}(\boldsymbol{\mathcal{E}}_{\infty}^{EA}(\bm{\varphi}^{EA})-\boldsymbol{\mathcal{E}}_{\infty}^{EA}(\widehat{{\bm{\phi}}}^{EA}))
16SEA2max{R,Rx˙}2K4c𝓗EA𝒟(𝝋EA).\displaystyle\leq\frac{16S_{EA}^{2}\max\{R,R_{\dot{x}}\}^{2}K^{4}}{c_{\boldsymbol{\mathcal{H}}^{EA}}}\mathcal{D}_{\infty}(\bm{\varphi}^{EA}). (B.22)

By applying equation (B.8) from the proof of Proposition 11, we have that Θ8SEA2max{R,Rx˙}2K4\Theta\leq 8S_{EA}^{2}\max\{R,R_{\dot{x}}\}^{2}K^{4} almost surely. We then apply the one-sided Bernstein inequality to Θ\Theta and recalling the definitions (B.1) together with the definitions of the normalized errors in (B.20, B.21), we get that:

P𝝁𝒀{𝒟(𝝋EA)𝒟M(𝝋EA)𝒟(𝝋EA)+ϵα}exp(α2(𝒟(𝝋EA)+ϵ)2M2(σ2+8SEA2max{R,Rx˙}2K4α(𝒟(𝝋EA)+ϵ)3)).P_{\bm{\mu^{\boldsymbol{Y}}}}\bigg{\{}\frac{\mathcal{D}_{\infty}(\bm{\varphi}^{EA})-\mathcal{D}_{M}(\bm{\varphi}^{EA})}{\mathcal{D}_{\infty}(\bm{\varphi}^{EA})+\epsilon}\geq\alpha\bigg{\}}\leq\exp\Bigg{(}-\frac{\alpha^{2}(\mathcal{D}_{\infty}(\bm{\varphi}^{EA})+\epsilon)^{2}M}{2\Big{(}\sigma^{2}+\frac{8S_{EA}^{2}\max\{R,R_{\dot{x}}\}^{2}K^{4}\alpha(\mathcal{D}_{\infty}(\bm{\varphi}^{EA})+\epsilon)}{3}\Big{)}}\Bigg{)}.

Now we provide a lower bound for the exponent to simplify the dependencies. We show that,

ϵc𝓗EA32SEA2max{R,Rx˙}2K6(𝒟(𝝋EA)+ϵ)22(σ2+8SEA2max{R,Rx˙}2K4α(𝒟(𝝋EA)+ϵ)3),\displaystyle\frac{\epsilon\cdot c_{\boldsymbol{\mathcal{H}}^{EA}}}{32S_{EA}^{2}\max\{R,R_{\dot{x}}\}^{2}K^{6}}\leq\frac{(\mathcal{D}_{\infty}(\bm{\varphi}^{EA})+\epsilon)^{2}}{2\Big{(}\sigma^{2}+\frac{8S_{EA}^{2}\max\{R,R_{\dot{x}}\}^{2}K^{4}\alpha(\mathcal{D}_{\infty}(\bm{\varphi}^{EA})+\epsilon)}{3}\Big{)}}\,,

or equivalently,

ϵc𝓗EA16SEA2max{R,Rx˙}2K6(σ2+8SEA2max{R,Rx˙}2K4α(𝒟(𝝋EA)+ϵ)3)\displaystyle\frac{\epsilon\cdot c_{\boldsymbol{\mathcal{H}}^{EA}}}{16S_{EA}^{2}\max\{R,R_{\dot{x}}\}^{2}K^{6}}\bigg{(}\sigma^{2}+\frac{8S_{EA}^{2}\max\{R,R_{\dot{x}}\}^{2}K^{4}\alpha(\mathcal{D}_{\infty}(\bm{\varphi}^{EA})+\epsilon)}{3}\bigg{)}
(𝒟(𝝋EA)+ϵ)2.\displaystyle\hskip 113.81102pt\leq(\mathcal{D}_{\infty}(\bm{\varphi}^{EA})+\epsilon)^{2}\,.

By the estimate (B.4), since 0<α10<\alpha\leq 1, and 0<cL,N,𝓗EA<K20<c_{L,N,\boldsymbol{\mathcal{H}}^{EA}}<K^{2} it is sufficient to show that

𝒟(𝝋EA)ϵ+ϵ(𝒟(𝝋EA)+ϵ)6(𝒟(𝝋EA)+ϵ)2.\mathcal{D}_{\infty}(\bm{\varphi}^{EA})\epsilon+\frac{\epsilon(\mathcal{D}_{\infty}(\bm{\varphi}^{EA})+\epsilon)}{6}\leq(\mathcal{D}_{\infty}(\bm{\varphi}^{EA})+\epsilon)^{2}.

This follows from Young’s inequality as 2𝒟(𝝋EA)ϵ+ϵ2(𝒟(𝝋EA)+ϵ)22\mathcal{D}_{\infty}(\bm{\varphi}^{EA})\epsilon+\epsilon^{2}\leq(\mathcal{D}_{\infty}(\bm{\varphi}^{EA})+\epsilon)^{2}, and together these results give the desired bound of the proposition.  

We can easily derive the desired supremum bound by a covering argument. The estimation of the covering numbers involved will play a critical role in the main theorems and will be done in a dimension dependent way in order to get optimal minimax rates.

Proposition 21

In the notation of Proposition 20,

P𝝁𝒀{sup𝝋EA𝓗EA𝒟(𝝋EA)𝒟M(𝝋EA)𝒟(𝝋EA)+ϵ3α}𝒩(𝓗EA,αϵ8SEAmax{R,Rx˙}2K4)ec𝓗EAα2Mϵ32SEAK4\displaystyle P_{\bm{\mu^{\boldsymbol{Y}}}}\bigg{\{}\sup_{\bm{\varphi}^{EA}\in\boldsymbol{\mathcal{H}}^{EA}}\frac{\mathcal{D}_{\infty}(\bm{\varphi}^{EA})-\mathcal{D}_{M}(\bm{\varphi}^{EA})}{\mathcal{D}_{\infty}(\bm{\varphi}^{EA})+\epsilon}\geq 3\alpha\bigg{\}}\leq\mathcal{N}\bigg{(}\boldsymbol{\mathcal{H}}^{EA},\frac{\alpha\epsilon}{8S_{EA}\max\{R,R_{\dot{x}}\}^{2}K^{4}}\bigg{)}e^{-\frac{c_{\boldsymbol{\mathcal{H}}^{EA}}\alpha^{2}M\epsilon}{32S_{EA}K^{4}}}

where 𝒩(𝓗EA,αϵ8SEAmax{R,Rx˙}2K4)\mathcal{N}\bigg{(}\boldsymbol{\mathcal{H}}^{EA},\frac{\alpha\epsilon}{8S_{EA}\max\{R,R_{\dot{x}}\}^{2}K^{4}}\bigg{)} denotes the covering number of 𝓗EA\boldsymbol{\mathcal{H}}^{EA} with radius
αϵ8SEAmax{R,Rx˙}2K4\frac{\alpha\epsilon}{8S_{EA}\max\{R,R_{\dot{x}}\}^{2}K^{4}}.

Proof  Let 𝝋EAi=𝝋Ei𝝋Ai𝓗EA\bm{\varphi}^{EA}_{i}=\bm{\varphi}^{E}_{i}\oplus\bm{\varphi}^{A}_{i}\in\boldsymbol{\mathcal{H}}^{EA}, for i=1,,𝒩(𝓗EA,αϵ8SEAmax{R,Rx˙}2K4)i=1,\ldots,\mathcal{N}\bigg{(}\boldsymbol{\mathcal{H}}^{EA},\frac{\alpha\epsilon}{8S_{EA}\max\{R,R_{\dot{x}}\}^{2}K^{4}}\bigg{)}, denote the center of disks DiD_{i} of radius αϵ8SEAmax{R,Rx˙}2K4\frac{\alpha\epsilon}{8S_{EA}\max\{R,R_{\dot{x}}\}^{2}K^{4}} covering 𝓗EA\boldsymbol{\mathcal{H}}^{EA}. The covering number is finite by the compactness assumption on the hypothesis space. By Lemma 18,

sup𝝋EADi𝒟(𝝋EA)𝒟M(𝝋EA)𝒟(𝝋EA)+ϵ3α𝒟(𝝋EAi)𝒟M(𝝋EAi)𝒟(𝝋EAi)+ϵα.\sup_{\bm{\varphi}^{EA}\in D_{i}}\frac{\mathcal{D}_{\infty}(\bm{\varphi}^{EA})-\mathcal{D}_{M}(\bm{\varphi}^{EA})}{\mathcal{D}_{\infty}(\bm{\varphi}^{EA})+\epsilon}\geq 3\alpha\Rightarrow\frac{\mathcal{D}_{\infty}(\bm{\varphi}^{EA}_{i})-\mathcal{D}_{M}(\bm{\varphi}^{EA}_{i})}{\mathcal{D}_{\infty}(\bm{\varphi}^{EA}_{i})+\epsilon}\geq\alpha.

Now, by Proposition 20, for each ii,

P𝝁𝒀{sup𝝋EADi𝒟(𝝋EA)𝒟M(𝝋EA)𝒟(𝝋EA)+ϵ3α}\displaystyle P_{\bm{\mu^{\boldsymbol{Y}}}}\bigg{\{}\sup_{\bm{\varphi}^{EA}\in D_{i}}\frac{\mathcal{D}_{\infty}(\bm{\varphi}^{EA})-\mathcal{D}_{M}(\bm{\varphi}^{EA})}{\mathcal{D}_{\infty}(\bm{\varphi}^{EA})+\epsilon}\geq 3\alpha\bigg{\}} P𝝁𝒀{𝒟(𝝋EAi)𝒟M(𝝋EAi)𝒟(𝝋EAi)+ϵα}\displaystyle\leq P_{\bm{\mu^{\boldsymbol{Y}}}}\bigg{\{}\frac{\mathcal{D}_{\infty}(\bm{\varphi}^{EA}_{i})-\mathcal{D}_{M}(\bm{\varphi}^{EA}_{i})}{\mathcal{D}_{\infty}(\bm{\varphi}^{EA}_{i})+\epsilon}\geq\alpha\bigg{\}}
ec𝓗EAα2Mϵ32SEA2max{R,Rx˙}2K6.\displaystyle\leq e^{-\frac{c_{\boldsymbol{\mathcal{H}}^{EA}}\alpha^{2}M\epsilon}{32S_{EA}^{2}\max\{R,R_{\dot{x}}\}^{2}K^{6}}}.

By definition, 𝓗EAiDi\boldsymbol{\mathcal{H}}^{EA}\subseteq\bigcup_{i}D_{i}, so that

P𝝁𝒀{sup𝝋EA𝓗EA𝒟(𝝋EA)𝒟M(𝝋EA)𝒟(𝝋EA)+ϵ3α}\displaystyle P_{\bm{\mu^{\boldsymbol{Y}}}}\bigg{\{}\sup_{\bm{\varphi}^{EA}\in\boldsymbol{\mathcal{H}}^{EA}}\frac{\mathcal{D}_{\infty}(\bm{\varphi}^{EA})-\mathcal{D}_{M}(\bm{\varphi}^{EA})}{\mathcal{D}_{\infty}(\bm{\varphi}^{EA})+\epsilon}\geq 3\alpha\bigg{\}}
iP𝝁𝒀{sup𝝋EADi𝒟(𝝋EA)𝒟M(𝝋EA)𝒟(𝝋EA)+ϵ3α}\displaystyle\leq\sum_{i}P_{\bm{\mu^{\boldsymbol{Y}}}}\bigg{\{}\sup_{\bm{\varphi}^{EA}\in D_{i}}\frac{\mathcal{D}_{\infty}(\bm{\varphi}^{EA})-\mathcal{D}_{M}(\bm{\varphi}^{EA})}{\mathcal{D}_{\infty}(\bm{\varphi}^{EA})+\epsilon}\geq 3\alpha\bigg{\}}
𝒩(𝓗EA,αϵ8SEAmax{R,Rx˙}2K4)ec𝓗EAα2Mϵ32SEA2max{R,Rx˙}2K6.\displaystyle\leq\mathcal{N}\bigg{(}\boldsymbol{\mathcal{H}}^{EA},\frac{\alpha\epsilon}{8S_{EA}\max\{R,R_{\dot{x}}\}^{2}K^{4}}\bigg{)}e^{-\frac{c_{\boldsymbol{\mathcal{H}}^{EA}}\alpha^{2}M\epsilon}{32S_{EA}^{2}\max\{R,R_{\dot{x}}\}^{2}K^{6}}}.

 

Finally, we state the results for the ξ\xi variable, the proofs are analogous. The advantage of splitting the theorems will become apparent. Specifically, it allows us to control the covering numbers on the EAEA and ξ\xi hypothesis spacees separately, enabling us to get faster rates than if we viewed the task as estimating all functions simultaneously. This is possible due to the fundamentally decoupled nature of the dynamical system.

Proposition 22

For all ϵ>0\epsilon>0, 0<α<10<\alpha<1, 𝛗ξ𝓗ξ\bm{\varphi}^{\xi}\in\boldsymbol{\mathcal{H}}^{\xi}, the following concentration bound holds

P𝝁𝒀{𝒟ξ(𝝋ξ)𝒟Mξ(𝝋ξ)𝒟ξ(𝝋ξ)+ϵα}ec𝓗ξα2Mϵ32S02K4.P_{\bm{\mu^{\boldsymbol{Y}}}}\bigg{\{}\frac{\mathcal{D}_{\infty}^{\xi}(\bm{\varphi}^{\xi})-\mathcal{D}_{M}^{\xi}(\bm{\varphi}^{\xi})}{\mathcal{D}_{\infty}^{\xi}(\bm{\varphi}^{\xi})+\epsilon}\geq\alpha\bigg{\}}\leq e^{-\frac{c_{\boldsymbol{\mathcal{H}}^{\xi}}\alpha^{2}M\epsilon}{32S_{0}^{2}K^{4}}}\,.
Proposition 23

In the notation of Proposition 22,

P𝝁𝒀{sup𝝋ξ𝓗ξ𝒟ξ(𝝋ξ)𝒟Mξ(𝝋ξ)𝒟ξ(𝝋ξ)+ϵ3α}𝒩(𝓗ξ,αϵ8S0Rξ2K4)ec𝓗ξα2Mϵ32S0K4,\displaystyle P_{\bm{\mu^{\boldsymbol{Y}}}}\bigg{\{}\sup_{\bm{\varphi}^{\xi}\in\boldsymbol{\mathcal{H}}^{\xi}}\frac{\mathcal{D}_{\infty}^{\xi}(\bm{\varphi}^{\xi})-\mathcal{D}_{M}^{\xi}(\bm{\varphi}^{\xi})}{\mathcal{D}_{\infty}^{\xi}(\bm{\varphi}^{\xi})+\epsilon}\geq 3\alpha\bigg{\}}\leq\mathcal{N}\bigg{(}\boldsymbol{\mathcal{H}}^{\xi},\frac{\alpha\epsilon}{8S_{0}R_{\xi}^{2}K^{4}}\bigg{)}e^{-\frac{c_{\boldsymbol{\mathcal{H}}^{\xi}}\alpha^{2}M\epsilon}{32S_{0}K^{4}}}\,,

where 𝒩(𝓗ξ,αϵ8S0Rξ2K4)\mathcal{N}\bigg{(}\boldsymbol{\mathcal{H}}^{\xi},\frac{\alpha\epsilon}{8S_{0}R_{\xi}^{2}K^{4}}\bigg{)} denotes the covering number of 𝓗ξ\boldsymbol{\mathcal{H}}^{\xi} with radius αϵ8S0Rξ2K4\frac{\alpha\epsilon}{8S_{0}R_{\xi}^{2}K^{4}}.

C Further verification of coercivity condition

In this appendix, we study the coercivity condition for the second-order system of the form:

{mi𝒙¨i=F𝒙˙(𝒙i,𝒙˙i,ξi)+i=1N1N(ϕE(𝒙i𝒙i)(𝒙i𝒙i)+ϕA(𝒙i𝒙i)(𝒙˙i𝒙˙i)ξ˙i=Fξ(𝒙i,𝒙˙i,ξi)+i=1N1Nϕξ(𝒙i𝒙i)(ξiξi)\begin{dcases}m_{i}\ddot{\boldsymbol{x}}_{i}&=F^{\dot{\boldsymbol{x}}}(\boldsymbol{x}_{i},\dot{\boldsymbol{x}}_{i},\xi_{i})+\sum_{i^{\prime}=1}^{N}\frac{1}{N}(\phi^{E}(\left\|\boldsymbol{x}_{i^{\prime}}-\boldsymbol{x}_{i}\right\|)(\boldsymbol{x}_{i^{\prime}}-\boldsymbol{x}_{i})+\phi^{A}(\left\|\boldsymbol{x}_{i^{\prime}}-\boldsymbol{x}_{i}\right\|)(\dot{\boldsymbol{x}}_{i^{\prime}}-\dot{\boldsymbol{x}}_{i})\\ \dot{\xi}_{i}&=F^{\xi}(\boldsymbol{x}_{i},\dot{\boldsymbol{x}}_{i},\xi_{i})+\sum_{i^{\prime}=1}^{N}\frac{1}{N}\phi^{\xi}(\left\|\boldsymbol{x}_{i^{\prime}}-\boldsymbol{x}_{i}\right\|)(\xi_{i^{\prime}}-\xi_{i})\end{dcases}\, (C.1)

We prove the coercivity condition for the system (C.1) in the case of L=1L=1.

When the system does not have a ξ\xi variable, the system (C.1) is related to the anticipation dynamics studied in [71]. When ϕA0\phi^{A}\equiv 0 or ϕE0\phi^{E}\equiv 0, the system is called energy-based or alignment-based respectively, and has found application in various disciplines including opinion dynamics, particle dynamics, fish-milling dynamics, Cucker-Smale flocking dynamics and phototaxix dynamics. We refer the reader to [52, 51, 83] where extensive numerical experiments were conducted to demonstrate the effectiveness of the proposed learning approach on the aforementioned dynamics.

For conciseness, we only present the proof of coercivity for learning of ϕE\phi^{E} and ϕA\phi^{A}. A similar argument can be conducted to prove the coercivity for learning ϕξ\phi^{\xi}. Our aguments also work for special cases when ϕA0\phi^{A}\equiv 0 or ϕE0\phi^{E}\equiv 0. Therefore we obtain strict generalization of the coercivity results in [52, 51] which are only for first-order energy-based systems. We may go further to analyze the coercivity condition for heterogeneous systems. Compared to the homogeneous system, the coercivity condition would impose constraints on the angle between components of subspaces defined on the directed sum of measures, suggesting that the learning task is more difficult in the case K2K\geq 2.

Theorem 24

Consider the system (C.1) at time t1=0t_{1}=0 with the initial distribution μ0𝐘=[μ0𝐗μ0𝐗˙μ0𝚵]\mu_{0}^{\boldsymbol{Y}}=\begin{bmatrix}\mu_{0}^{\boldsymbol{X}}\\ \mu_{0}^{\dot{\boldsymbol{X}}}\\ \mu_{0}^{\boldsymbol{\Xi}}\end{bmatrix}, where μ0𝐗{\mu}_{0}^{\boldsymbol{X}} is exchangeable Gaussian with cov(𝐱i(t1))cov(𝐱i(t1),𝐱j(t1))=λId\mathrm{cov}(\boldsymbol{x}_{i}(t_{1}))-\mathrm{cov}(\boldsymbol{x}_{i}(t_{1}),\boldsymbol{x}_{j}(t_{1}))=\lambda I_{d}  for a constant λ>0\lambda>0, μ0𝐗˙,μ0𝚵{\mu}_{0}^{\dot{\boldsymbol{X}}},{\mu}_{0}^{\boldsymbol{\Xi}} are exchangeable with finite second moment, and they are independent of μ0𝐗{\mu}_{0}^{\boldsymbol{X}}. Then

𝔼μ0𝒀𝐟φEφA(𝑿(0),𝑽(0))2𝒮\displaystyle\mathbb{E}_{\mu_{0}^{\boldsymbol{Y}}}\|\mathbf{f}_{\varphi^{E}\oplus\varphi^{A}}(\boldsymbol{X}(0),\boldsymbol{V}(0))\|^{2}_{\mathcal{S}} c1,N,EA||φEφA||L2(ρTEA,1),\displaystyle\geq c_{1,N,\mathcal{H}^{EA}}{\left|\kern-1.07639pt\left|\kern-1.07639pt\varphi^{E}\oplus\varphi^{A}\right|\kern-1.07639pt\right|\kern-1.07639pt}_{L^{2}(\rho_{T}^{EA,1})},
𝔼μ0𝒀𝐟φξ(𝑿(0),𝚵(0))2𝒮\displaystyle\mathbb{E}_{\mu_{0}^{\boldsymbol{Y}}}\|\mathbf{f}_{\varphi^{\xi}}(\boldsymbol{X}(0),\boldsymbol{\Xi}(0))\|^{2}_{\mathcal{S}} c1,N,ξ||φξ||L2(ρTξ,1),\displaystyle\geq c_{1,N,\mathcal{H}^{\xi}}{\left|\kern-1.07639pt\left|\kern-1.07639pt\varphi^{\xi}\right|\kern-1.07639pt\right|\kern-1.07639pt}_{L^{2}(\rho_{T}^{\xi,1})},
  • (1)

    where we have

    ρTEA,1(r,r˙)=𝔼μ𝟎𝐘[δr12(t1),r˙12(t1)(r,r˙)]=𝔼μ0𝑿δr12(t1)𝔼μ0𝑿˙δr˙12(t1),\displaystyle\rho_{T}^{EA,1}(r,\dot{r})=\mathbb{E}_{\mathbf{\mu_{0}^{\boldsymbol{Y}}}}\big{[}\delta_{r_{12}(t_{1}),\dot{r}_{12}(t_{1})}(r,\dot{r})\big{]}=\mathbb{E}_{\mu_{0}^{\boldsymbol{X}}}\delta_{r_{12}}(t_{1})\mathbb{E}_{\mu_{0}^{\dot{\boldsymbol{X}}}}\delta_{\dot{r}_{12}}(t_{1}),
    ρTξ,1(r,ξ)=𝔼μ𝟎𝐘[δr12,ξ12(t1)]\displaystyle\rho_{T}^{\xi,1}(r,\xi)=\mathbb{E}_{\mathbf{\mu_{0}^{\boldsymbol{Y}}}}\big{[}\delta_{r_{12},\xi_{12}}(t_{1})\big{]}
    ||φEφA||2L2(ρTEA,1)=φE(r)r+φA(r)r˙L2(ρTEA,1(r,r˙))2,\displaystyle{\left|\kern-1.07639pt\left|\kern-1.07639pt\varphi^{E}\oplus\varphi^{A}\right|\kern-1.07639pt\right|\kern-1.07639pt}^{2}_{L^{2}(\rho_{T}^{EA,1})}=\|\varphi^{E}(r)r+\varphi^{A}(r)\dot{r}\|_{L^{2}(\rho_{T}^{EA,1}(r,\dot{r}))}^{2},
    ||φξ||2L2(ρTξ,1)=φξ(r)ξL2(ρTEA,1(r,ξ))2,\displaystyle{\left|\kern-1.07639pt\left|\kern-1.07639pt\varphi^{\xi}\right|\kern-1.07639pt\right|\kern-1.07639pt}^{2}_{L^{2}(\rho_{T}^{\xi,1})}=\|\varphi^{\xi}(r)\xi\|_{L^{2}(\rho_{T}^{EA,1}(r,\xi))}^{2},
  • (2)

    c1,N,EAN12N2+(N1)(N2)2N2cc_{1,N,\mathcal{H}^{EA}}\geq\frac{N-1}{2N^{2}}+\frac{(N-1)(N-2)}{2N^{2}}c, c=min{cEAE,cEAAcμ0𝑿˙}c=\min\bigg{\{}c_{\mathcal{H}^{EA}}^{E},c_{\mathcal{H}^{EA}}^{A}c_{\mu_{0}^{\dot{\boldsymbol{X}}}}\bigg{\}} with c𝝁0𝑿˙=1𝔼𝒙˙i(0),𝒙˙i(0)𝔼𝒙˙i(0)2c_{{\bm{\mu}}_{0}^{\dot{\boldsymbol{X}}}}=1-\frac{\mathbb{E}\langle\dot{\boldsymbol{x}}_{i}(0),\dot{\boldsymbol{x}}_{i^{\prime}}(0)\rangle}{\mathbb{E}\|\dot{\boldsymbol{x}}_{i}(0)\|^{2}} (iii\neq i^{\prime}) and cEAEc_{\mathcal{H}^{EA}}^{E} and cEAAc_{\mathcal{H}^{EA}}^{A} are non-negative constants and are positive for compact EA\mathcal{H}^{EA} of L2(ρEA,1T)L^{2}(\rho^{EA,1}_{T}) and independent of NN.

  • (3)

    c1,N,ξ(N1N2+(N1)(N2)N2c),c=cξcμ0𝚵c_{1,N,\mathcal{H}^{\xi}}\geq(\frac{N-1}{N^{2}}+\frac{(N-1)(N-2)}{N^{2}}c),c=c_{\mathcal{H}^{\xi}}c_{\mu_{0}^{\boldsymbol{\Xi}}} with cμ0𝚵=1𝔼ξi(0),ξi(0)𝔼ξi(0)2c_{{\mu}_{0}^{\boldsymbol{\Xi}}}=1-\frac{\mathbb{E}\langle\xi_{i}(0),\xi_{i^{\prime}}(0)\rangle}{\mathbb{E}\|\xi_{i}(0)\|^{2}} (iii\neq i^{\prime}) and cξc_{\mathcal{H}^{\xi}} is a non-negative constant and is positive for compact ξ\mathcal{H}^{\xi} of L2(ρξ,1T)L^{2}(\rho^{\xi,1}_{T}) and independent of NN.

Proof 

The proof of part (1) follows from the definition of measures, norms and the properties of the initial distributions. For part (2), we have

𝔼μ0𝒀𝐟φEφA(𝑿(0),𝑽(0))2𝒮\displaystyle\mathbb{E}_{\mu_{0}^{\boldsymbol{Y}}}\|\mathbf{f}_{\varphi^{E}\oplus\varphi^{A}}(\boldsymbol{X}(0),\boldsymbol{V}(0))\|^{2}_{\mathcal{S}} =1N3i=1N((j=k=1N+jk=1N)Ci,j,kE+Ci,j,kA+Di,j,k)\displaystyle=\frac{1}{N^{3}}\sum_{i=1}^{N}\left(\left(\sum_{j=k=1}^{N}+\sum_{j\neq k=1}^{N}\right)C_{i,j,k}^{E}+C_{i,j,k}^{A}+D_{i,j,k}\right)
=N1N2(φE(r)rL2(ρTEA,1)2+φA(r)r˙L2(ρTEA,1)2)+\displaystyle=\frac{N-1}{N^{2}}(\|\varphi^{E}(r)r\|_{L^{2}(\rho_{T}^{EA,1})}^{2}+\|\varphi^{A}(r)\dot{r}\|_{L^{2}(\rho_{T}^{EA,1})}^{2})+\mathcal{R}
N12N2φEφA2L2(ρTEA,1)+\displaystyle\geq\frac{N-1}{2N^{2}}\|\varphi^{E}\oplus\varphi^{A}\|^{2}_{L^{2}(\rho_{T}^{EA,1})}+\mathcal{R} (C.2)

where

Ci,j,kE\displaystyle C_{i,j,k}^{E} =𝔼μ0𝒀φE(𝒓ji(0))φE(𝒓ki(0))𝒓ji(0),𝒓ki(0),\displaystyle=\mathbb{E}_{\mu_{0}^{\boldsymbol{Y}}}\varphi^{E}(\|\boldsymbol{r}_{ji}(0)\|)\varphi^{E}(\|\boldsymbol{r}_{ki}(0)\|)\langle\boldsymbol{r}_{ji}(0),\boldsymbol{r}_{ki}(0)\big{\rangle},
Ci,j,kA\displaystyle C_{i,j,k}^{A} =𝔼μ0𝒀φA(𝒓ji(0))φA(𝒓ki(0))𝒓˙ji(0),𝒓˙ki(0),\displaystyle=\mathbb{E}_{\mu_{0}^{\boldsymbol{Y}}}\varphi^{A}(\|\boldsymbol{r}_{ji}(0)\|)\varphi^{A}(\|\boldsymbol{r}_{ki}(0)\|)\langle\dot{\boldsymbol{r}}_{ji}(0),\dot{\boldsymbol{r}}_{ki}(0)\big{\rangle},
Di,j,k\displaystyle D_{i,j,k} =𝔼μ0𝒀(φE(𝒓ji(0))φA(𝒓ki(0))𝒓ji(0),𝒓˙ki(0)\displaystyle=\mathbb{E}_{\mu_{0}^{\boldsymbol{Y}}}\big{(}\varphi^{E}(\|\boldsymbol{r}_{ji}(0)\|)\varphi^{A}(\|\boldsymbol{r}_{ki}(0)\|)\langle\boldsymbol{r}_{ji}(0),\dot{\boldsymbol{r}}_{ki}(0)\big{\rangle}
+φA(𝒓ji(0))φE(𝒓ki(0))𝒓˙ji(0),𝒓ki(0))=0,\displaystyle\quad\quad+\varphi^{A}(\|\boldsymbol{r}_{ji}(0)\|)\varphi^{E}(\|\boldsymbol{r}_{ki}(0)\|)\langle\dot{\boldsymbol{r}}_{ji}(0),\boldsymbol{r}_{ki}(0)\big{\rangle}\big{)}=0,
\displaystyle\mathcal{R} =1N3i=1Njk,ji,ki(CijkA+CijkE).\displaystyle=\frac{1}{N^{3}}\sum_{i=1}^{N}\sum_{j\neq k,j\neq i,k\neq i}(C_{ijk}^{A}+C_{ijk}^{E}).

By the property of μ0𝒀\mu_{0}^{\boldsymbol{Y}}, we have

CijkE\displaystyle C_{ijk}^{E} =𝔼[φE(X1X2)φE(X1X3)X2X1,X3X1]\displaystyle=\mathbb{E}\big{[}\varphi^{E}(\|X_{1}-X_{2}\|)\varphi^{E}(\|X_{1}-X_{3}\|)\left\langle X_{2}-X_{1},X_{3}-X_{1}\right\rangle\big{]}
CijkA\displaystyle C_{ijk}^{A} =𝔼[φA(X1X2)φA(X1X3)]𝔼[Y2Y1,Y3Y1],\displaystyle=\mathbb{E}\big{[}\varphi^{A}(\|X_{1}-X_{2}\|)\varphi^{A}(\|X_{1}-X_{3}\|)\big{]}\mathbb{E}\big{[}\left\langle Y_{2}-Y_{1},Y_{3}-Y_{1}\right\rangle\big{]},

for all (i,j,k)(i,j,k), where XiX_{i}s are exchangeable Gaussian random vectors with cov(X1)cov(X1,X2)=λId\mathrm{cov}(X_{1})-\mathrm{cov}(X_{1},X_{2})=\lambda I_{d} and YiY_{i}s are exchangeable random vectors who have the same distribution with the initial velocities of agents 𝒙˙i\dot{\boldsymbol{x}}_{i}s. From the Lemma 10 in [51] and Lemma 25 below,

CijkE\displaystyle C_{ijk}^{E} cEAEφE02L2(ρTEA,1)\displaystyle\geq c_{\mathcal{H}^{EA}}^{E}\|\varphi^{E}\oplus 0\|^{2}_{L^{2}(\rho_{T}^{EA,1})}
CijkA\displaystyle C_{ijk}^{A} cEAAcμ0𝑿˙0φA2L2(ρTEA,1),cμ0𝑿˙=(1𝔼Y1,Y2𝔼Y12),\displaystyle\geq c_{\mathcal{H}^{EA}}^{A}c_{\mu_{0}^{\dot{\boldsymbol{X}}}}\|0\oplus\varphi^{A}\|^{2}_{L^{2}(\rho_{T}^{EA,1})}\,,\qquad c_{\mu_{0}^{\dot{\boldsymbol{X}}}}=(1-\frac{\mathbb{E}\langle Y_{1},Y_{2}\rangle}{\mathbb{E}\|Y_{1}\|^{2}}),

where the constants cEAE,cEAA0c_{\mathcal{H}^{EA}}^{E},c_{\mathcal{H}^{EA}}^{A}\geq 0 are always positive and independent of NN for compact EA\mathcal{H}^{EA}, and we used the fact

𝔼[Y2Y1,Y3Y1]=𝔼Y12(1𝔼Y1,Y2𝔼Y12)0.\mathbb{E}\big{[}\left\langle Y_{2}-Y_{1},Y_{3}-Y_{1}\right\rangle\big{]}=\mathbb{E}{\|Y_{1}\|^{2}}(1-\frac{\mathbb{E}\langle Y_{1},Y_{2}\rangle}{\mathbb{E}\|Y_{1}\|^{2}})\geq 0.

Therefore,

𝔼μ0[𝐟φ(𝑿(0),𝑿˙(0))𝒮2]c1,N,EAφEφA2L2(ρT1(r,r˙),\displaystyle\mathbb{E}_{\mu_{0}}[\big{\|}\mathbf{f}_{\varphi}(\boldsymbol{X}(0),\dot{\boldsymbol{X}}(0))\big{\|}_{\mathcal{S}}^{2}]\geq c_{1,N,\mathcal{H}^{EA}}\|\varphi^{E}\oplus\varphi^{A}\|^{2}_{L^{2}(\rho_{T}^{1}(r,\dot{r})},\,
c1,N,EAN12N2+(N1)(N2)2N2min{cEAE,cEAAcμ0𝑿˙}.\displaystyle c_{1,N,\mathcal{H}^{EA}}\geq\frac{N-1}{2N^{2}}+\frac{(N-1)(N-2)}{2N^{2}}\min\bigg{\{}c_{\mathcal{H}^{EA}}^{E},c_{\mathcal{H}^{EA}}^{A}c_{\mu_{0}^{\dot{\boldsymbol{X}}}}\bigg{\}}.

For part (3), the proof follows a similar path as for part (2).  

From Theorem 24, we see a particular case when the coercivity constant c1,N,EAc_{1,N,\mathcal{H}^{EA}} is positive uniformly in NN if cμ0𝑿˙>0c_{\mu_{0}^{\dot{\boldsymbol{X}}}}>0. In fact, many distributions on dN\mathbb{R}^{dN} with non-i.i.d d components make the constant cμ0𝑿˙c_{\mu_{0}^{\dot{\boldsymbol{X}}}} positive. For example, the components of 𝑿˙\dot{\boldsymbol{X}} are exchangeable Gaussian but not i.i.d, and d2d\geq 2. In this particular case, coercivity is a property also of the system in the limit as NN\rightarrow\infty, satisfying the mean-field equations. As a result, the estimation error of our estimators is independent of NN.

The proof of Theorem 24 used the following lemma, whose proof is the same with the proof of Lemma 10 in [51]. To be self-contained, we listed the statement here.

Lemma 25

Let X1,X2,X3X_{1},X_{2},X_{3} be exchangeable Gaussian random vectors in d with cov(X1)cov(X1,X2)=λId\mathrm{cov}(X_{1})-\mathrm{cov}(X_{1},X_{2})=\lambda I_{d} for a constant λ>0\lambda>0.

  • The marginal distribution of ρTEA,1(r,r˙)\rho_{T}^{EA,1}(r,\dot{r}) with respect to rr, denoted by ρ(r)\rho(r), is a probability measure over + with density function Cλ1rd1e14λr2C_{\lambda}^{-1}r^{d-1}e^{-\frac{1}{4\lambda}r^{2}} where Cλ=12(4λ)d2Γ(d2)C_{\lambda}=\frac{1}{2}(4\lambda)^{\frac{d}{2}}\Gamma(\frac{d}{2}).

  • We have

    𝔼[φ(|X1X2|)φ(|X1X3|)]c𝒳φ2L2(ρ)\mathbb{E}\left[\varphi(|X_{1}-X_{2}|)\varphi(|X_{1}-X_{3}|)\right]\geq c_{\mathcal{\mathcal{X}}}\|\varphi\|^{2}_{L^{2}(\rho)} (C.3)

    for all φ𝒳L2(ρ)\varphi\in\mathcal{X}\subset L^{2}(\rho), with c𝒳>0c_{\mathcal{X}}>0 if 𝒳\mathcal{X} is compact and c𝒳=0c_{\mathcal{X}}=0 if 𝒳=L2(ρ)\mathcal{X}=L^{2}(\rho).

D Existence, uniqueness and properties of the measures

In this section, we provide technical details of the analytic properties of the collective system under consideration as well as of the measures that we defined in section 5.1. We emphasize that for the analytic portion of the theory, as we saw with the trajectory prediction result, we view the system (2.2) as coupled (whereas for the learning theory we leverage that they can be decoupled to make the estimation have better performance).

We begin by showing that under the assumption that the interaction kernels lie in the corresponding admissible spaces, then the system is well-posed.

D.1 Well-posedness of second-order heterogeneous systems

Proposition 26

Suppose the kernels ϕE=(ϕkkE)k,k=1K,K,ϕA=(ϕkkA)k,k=1K,K,ϕξ=(ϕkkξ)k,k=1K,K{\bm{\phi}}^{E}=(\phi_{kk^{\prime}}^{E})_{k,k^{\prime}=1}^{K,K},{\bm{\phi}}^{A}=(\phi_{kk^{\prime}}^{A})_{k,k^{\prime}=1}^{K,K},{\bm{\phi}}^{\xi}=(\phi_{kk^{\prime}}^{\xi})_{k,k^{\prime}=1}^{K,K} lie in the admissible sets 𝓚SEE\boldsymbol{\mathcal{K}}_{S_{E}}^{E},𝓚SAA\boldsymbol{\mathcal{K}}_{S_{A}}^{A}, 𝓚Sξξ\boldsymbol{\mathcal{K}}_{S_{\xi}}^{\xi} respectively. Where the admissible spaces are defined in (3.12). Then the second-order heterogenous system (2.2) admits a unique global solution in [0,T][0,T] for every initial datum 𝐗(0),𝐗˙(0)dN\boldsymbol{X}(0),\dot{\boldsymbol{X}}(0)\in\mathbb{R}^{dN}, 𝚵(0)N\bm{\Xi}(0)\in\mathbb{R}^{N} and the solution depends continuously on the initial condition.

The proof of Proposition 26 uses Lemma 27 and similar techniques used to prove the well-posedness of the first-order homogeneous system (see Section 6 in [13]) by rewriting the second-order system as a first-order system and then applying standard Caratheodory ODE results.

Lemma 27

For any φE𝒦SEE\varphi^{E}\in\mathcal{K}_{S_{E}}^{E}, φA𝒦SAA\varphi^{A}\in\mathcal{K}_{S_{A}}^{A}, the function

F[φEA](𝒙,𝒙˙,sE,sA):=φE(||𝒙||,sE)𝒙+φA(||𝒙||,sA)𝒙˙,F[\varphi^{EA}](\boldsymbol{x},\dot{\boldsymbol{x}},s^{E},s^{A}):=\varphi^{E}(||\boldsymbol{x}||,s^{E})\boldsymbol{x}+\varphi^{A}(||\boldsymbol{x}||,s^{A})\dot{\boldsymbol{x}},

for 𝐱,𝐱˙d\boldsymbol{x},\dot{\boldsymbol{x}}\in^{d} is Lipschitz continuous on 2d+pE+pA\mathbb{R}^{2d+p^{E}+p^{A}} where pE,pAp^{E},p^{A} are the dimensions of the range of the functions sE,sAs^{E},s^{A}, respectively. Additionally, for any φξ𝒦Sξξ\varphi^{\xi}\in\mathcal{K}_{S_{\xi}}^{\xi}, the function

F[φξ](𝒙,sξ,ξ):=φξ(||𝒙||,sξ)ξF[\varphi^{\xi}](\boldsymbol{x},s^{\xi},\xi):=\varphi^{\xi}(||\boldsymbol{x}||,s^{\xi})\xi

is Lipschitz continuous on d+1+pξ\mathbb{R}^{d+1+p^{\xi}}, where pξp^{\xi} is the dimension of the range of sξs^{\xi}.

D.2 Properties of measures

In this section we state and prove some technical properties of the measures described in Section 5.1.

Lemma 28

Suppose each of the interaction kernels lie in the respective admissible spaces, namely, ϕE𝓚SEE{\bm{\phi}}^{E}\in\boldsymbol{\mathcal{K}}_{S_{E}}^{E},ϕA𝓚SAA{\bm{\phi}}^{A}\in\boldsymbol{\mathcal{K}}_{S_{A}}^{A}, ϕξ𝓚Sξξ{\bm{\phi}}^{\xi}\in\boldsymbol{\mathcal{K}}_{S_{\xi}}^{\xi}. Then, for each (k,k)(k,k^{\prime}), the measures ρTEA,k,k,ρTEA,L,k,k\rho_{T}^{EA,k,k^{\prime}},\rho_{T}^{EA,L,k,k^{\prime}} and ρTξ,k,k,ρTξ,L,k,k\rho_{T}^{\xi,k,k^{\prime}},\rho_{T}^{\xi,L,k,k^{\prime}} defined in section 5.1, are regular Borel probability measures. Furthermore, if 𝛍𝐘\bm{\mu^{\boldsymbol{Y}}} is absolutely continuous with respect to the Lebesgue measure, then for each (k,k)(k,k^{\prime}) we have that ρTEA,k,k,ρTEA,L,k,k,ρTξ,k,k,ρTξ,L,k,k\rho_{T}^{EA,k,k^{\prime}},\rho_{T}^{EA,L,k,k^{\prime}},\rho_{T}^{\xi,k,k^{\prime}},\rho_{T}^{\xi,L,k,k^{\prime}} are absolutely continuous with respect to the Lebesgue measure. This implies Borel regularity, and under the absolute continuity of 𝛍𝐘\bm{\mu^{\boldsymbol{Y}}}, absolute continuity with respect to Lebesgue measure of the measures, 𝛒TEA,𝛒Tξ,𝛒TEA,L,𝛒Tξ,L\bm{\rho}_{T}^{EA},\bm{\rho}_{T}^{\xi},\bm{\rho}_{T}^{EA,L},\bm{\rho}_{T}^{\xi,L}.

Proposition 29

Suppose the distribution 𝛍𝐘\bm{\mu^{\boldsymbol{Y}}} of the initial condition is compactly supported. Then for each (k,k)(k,k^{\prime}), the support of the measures ρTEA,kk,ρTξ,kk\rho_{T}^{EA,kk^{\prime}},\rho_{T}^{\xi,kk^{\prime}} (and therefore ρTEA,L,kk,ρTξ,L,kk\rho_{T}^{EA,L,kk^{\prime}},\rho_{T}^{\xi,L,kk^{\prime}}) is also compact.

Proof  The compact support of the variables r,r˙,ξr,\dot{r},\xi and the feature maps follows by the global well-posedness of the system in finite time, together with the Lipschitz assumptions on the non-collective forces. This compact support over a fixed, finite time is what is claimed in Proposition 29.  
The main point is that by making reasonable assumptions on the non-collective forces, feature maps, interaction kernels, and the interval of time, together with the assumption that our agents’ initial conditions cannot be arbitrarily far apart, we can derive that the pairwise distance, velocity and ξ\xi will be controlled. Thus, the measures in section 5.1, if given enough trajectories, will be well approximated by the discretized version using the numerical approach described in section 9. Meaning that if we have a reasonable number of trajectories, we can look at the set of pairwise distances, velocities, etc. that these agents explore and bin them to set the support of the interaction kernels. Explicit values for the constants claimed in the proposition depend on the properties of the non-collective forces, the support and sup-norm of the interaction kernels, the interval TT, and the number of agents.

E Background results

In this section, for the convenience of the reader, we gather a few of the technical tools used in the analysis of the system. These are fundamental results necessary for developing the trajectory prediction, the measure support, and the existence and uniqueness results. We also include some of the necessary results on covering numbers of function spaces used for the learning theory.

The first theorem we present is an iterated Grönwall type result that allows us to analyze the trajectory error of the full system 𝒀(t)\boldsymbol{Y}(t).

Theorem 30

Let u(t),a(t),u(t),a(t), and b(t)b(t) be nonnegative continuous functions in J=[α,β],J=[\alpha,\beta], and suppose that

u(t)\displaystyle u(t)\leq a(t)+b(t)[αtk1(t,t1)u(t1)dt1+\displaystyle a(t)+b(t)\left[\int_{\alpha}^{t}k_{1}\left(t,t_{1}\right)u\left(t_{1}\right)dt_{1}+\cdots\right.
+αt(αt1(αtn1kn(t,t1,,tn)u(tn)dtn))dt1]\displaystyle\left.+\int_{\alpha}^{t}\left(\int_{\alpha}^{t_{1}}\cdots\left(\int_{\alpha}^{t_{n-1}}k_{n}\left(t,t_{1},\ldots,t_{n}\right)u\left(t_{n}\right)dt_{n}\right)\cdots\right)dt_{1}\right]

for all tJ,t\in J, where ki(t,t1,,ti)k_{i}\left(t,t_{1},\ldots,t_{i}\right) are nonnegative continuous functions in Ji+1,i=J_{i+1},i= 1,2,,n,1,2,\ldots,n, which are nondecreasing in tJt\in J for all fixed (t1,,ti)Ji,i=\left(t_{1},\ldots,t_{i}\right)\in J_{i},i= 1,2,,n.1,2,\ldots,n. Then, for all tJt\in J

u(t)a(t)+b(t)αtR^[a](t,s)exp(stR^[b](t,τ)dτ)dsu(t)\leq a(t)+b(t)\int_{\alpha}^{t}\widehat{R}[a](t,s)\exp\left(\int_{s}^{t}\widehat{R}[b](t,\tau)d\tau\right)ds

where, for all (t,s)J2(t,s)\in J_{2}

R^[w](t,s))=\displaystyle\widehat{R}[w](t,s))= k1(t,s)w(s)+αsk2(t,s,t2)w(t2)dt2\displaystyle k_{1}(t,s)w(s)+\int_{\alpha}^{s}k_{2}\left(t,s,t_{2}\right)w\left(t_{2}\right)dt_{2}
+i=3nαs(αt2(αtt1ki(t,s,t2,,ti)w(ti)dti))dt2\displaystyle+\sum_{i=3}^{n}\int_{\alpha}^{s}\left(\int_{\alpha}^{t_{2}}\cdots\left(\int_{\alpha}^{t_{t-1}}k_{i}\left(t,s,t_{2},\ldots,t_{i}\right)w\left(t_{i}\right)dt_{i}\right)\cdots\right)dt_{2}

for each continuous function w(t)w(t) in JJ.

Proof  See [22].  

F Additional comments on first-order models and theory

Our second-order model formulation covers the first-order equations of [51, 52] as a special case. When F𝒙˙(𝒙i,𝒙˙i,ξi)=νi𝒙˙i+𝐅𝒙(𝒙i,ξi)F^{\dot{\boldsymbol{x}}}(\boldsymbol{x}_{i},\dot{\boldsymbol{x}}_{i},\xi_{i})=-\nu_{i}\dot{\boldsymbol{x}}_{i}+\mathbf{F}^{\boldsymbol{x}}(\boldsymbol{x}_{i},\xi_{i}) for some constant νi>0\nu_{i}>0, Fξ(𝒙i,𝒙˙i,ξi)=Fξ(𝒙i,ξi)F^{\xi}(\boldsymbol{x}_{i},\dot{\boldsymbol{x}}_{i},\xi_{i})=F^{\xi}(\boldsymbol{x}_{i},\xi_{i}), ϕA𝓀𝒾𝓀𝒾0\phi^{A}_{\mathpzc{k}_{i}\mathpzc{k}_{i^{\prime}}}\equiv 0 for all k,k=1,,Kk,k^{\prime}=1,\ldots,K, and mi1m_{i}\ll 1, (2.2) becomes,

{νi𝒙˙i=𝐅𝒙(𝒙i,ξi)+i=1N1N𝓀𝒾ϕE𝓀𝒾𝓀𝒾(rii,𝒔Eii)(𝒙i𝒙i)ξ˙i=Fξ(𝒙i,ξi)+i=1N1N𝓀𝒾ϕξ𝓀𝒾,𝓀𝒾(rii,𝒔ξii)(ξiξi)\begin{dcases}\nu_{i}\dot{\boldsymbol{x}}_{i}&=\mathbf{F}^{\boldsymbol{x}}(\boldsymbol{x}_{i},\xi_{i})+\sum_{i^{\prime}=1}^{N}\frac{1}{N_{\mathpzc{k}_{i^{\prime}}}}\phi^{E}_{\mathpzc{k}_{i}\mathpzc{k}_{i^{\prime}}}(r_{ii^{\prime}},\boldsymbol{s}^{E}_{ii^{\prime}})(\boldsymbol{x}_{i^{\prime}}-\boldsymbol{x}_{i})\\ \dot{\xi}_{i}&=F^{\xi}(\boldsymbol{x}_{i},\xi_{i})+\sum_{i^{\prime}=1}^{N}\frac{1}{N_{\mathpzc{k}_{i^{\prime}}}}\phi^{\xi}_{\mathpzc{k}_{i},\mathpzc{k}_{i^{\prime}}}(r_{ii^{\prime}},\boldsymbol{s}^{\xi}_{ii^{\prime}})(\xi_{i^{\prime}}-\xi_{i})\end{dcases}

It extends the first-order models considered in [52, 51, 83] by adding non-collective forces, 𝐅𝒙,Fξ\mathbf{F}^{\boldsymbol{x}},F^{\xi}, multi-dimensional interaction kernels, ϕEk,k,ϕξk,k\phi^{E}_{k,k^{\prime}},\phi^{\xi}_{k,k^{\prime}}, and auxiliary variables, ξi\xi_{i}.

The first-order theory considered in [52, 51] was focused on the learnability of functions of the form, ϕE(r)r\phi^{E}(r)r, which is a special case of our second-order theory, where we study the functions of the form ϕE(r)r+ϕA(r)r˙\phi^{E}(r)r+\phi^{A}(r)\dot{r}, with ϕA(r)0\phi^{A}(r)\equiv 0.

G Additional performance measures

For measures related to learning the ξ\xi-based interaction kernels, we take

{δi,i,tξ(r,𝒔ξ,ξ):=δrii(t),𝒔ξii(t),ξii(t)(r,𝒔ξ,ξ)δi,i,t,mξ(r,𝒔ξ,ξ):=δrii(m)(t),𝒔ξ,(m)ii(t),ξii(m)(t)(r,𝒔ξ,ξ)\begin{dcases}\delta_{i,i^{\prime},t}^{\xi}(r,\boldsymbol{s}^{\xi},\xi)&:=\delta_{r_{ii^{\prime}}(t),\boldsymbol{s}^{\xi}_{ii^{\prime}}(t),\xi_{ii^{\prime}}(t)}(r,\boldsymbol{s}^{\xi},\xi)\\ \delta_{i,i^{\prime},t,m}^{\xi}(r,\boldsymbol{s}^{\xi},\xi)&:=\delta_{r_{ii^{\prime}}^{(m)}(t),\boldsymbol{s}^{\xi,(m)}_{ii^{\prime}}(t),\xi_{ii^{\prime}}^{(m)}(t)}(r,\boldsymbol{s}^{\xi},\xi)\end{dcases}

Then, the measures are given by

ρTξ,k,k(r,𝒔ξ,ξ):=𝔼𝒀0𝝁𝒀1TNkkt=0TiCk,iCkiiδξii,t(r,𝒔ξ,ξ)dt\rho_{T}^{\xi,k,k^{\prime}}(r,\boldsymbol{s}^{\xi},\xi):=\mathbb{E}_{\boldsymbol{Y}_{0}\sim\bm{\mu^{\boldsymbol{Y}}}}\frac{1}{TN_{kk^{\prime}}}\int_{t=0}^{T}\sum_{\begin{subarray}{c}i\in C_{k},i^{\prime}\in C_{k^{\prime}}\\ i\neq i^{\prime}\end{subarray}}\delta^{\xi}_{ii^{\prime},t}(r,\boldsymbol{s}^{\xi},\xi)\,dt (G.1)

and similarly for ρTξ,L,k,k(r,𝒔ξ,ξ)\rho_{T}^{\xi,L,k,k^{\prime}}(r,\boldsymbol{s}^{\xi},\xi), ρTξ,L,M,k,k(r,𝒔ξ,ξ)\rho_{T}^{\xi,L,M,k,k^{\prime}}(r,\boldsymbol{s}^{\xi},\xi). Similarly, ρTξ,k,k\rho_{T}^{\xi,k,k^{\prime}} and its time-discretization version, ρTξ,L,k,k\rho_{T}^{\xi,L,k,k^{\prime}}, are only used in the theoretical setting, whereas the empirical ρTξ,L,M,k,k\rho_{T}^{\xi,L,M,k,k^{\prime}} is used in the actual algorithm. We consider direct sums of the measures for the phase variable for ease of notation.

𝝆Tξ,L=k,k=1,1K,KρTξ,L,kk,𝝆Tξ=k,k=1,1K,KρTξ,kk,𝑳2(𝝆Tξ,L)=k,k=1,1K,KL2(ρTξ,L,kk)\bm{\rho}_{T}^{\xi,L}=\bigoplus_{k,k^{\prime}=1,1}^{K,K}\rho_{T}^{\xi,L,kk^{\prime}},\quad\bm{\rho}_{T}^{\xi}=\bigoplus_{k,k^{\prime}=1,1}^{K,K}\rho_{T}^{\xi,kk^{\prime}},\quad\bm{L}^{2}\left(\bm{\rho}_{T}^{\xi,L}\right)=\bigoplus_{k,k^{\prime}=1,1}^{K,K}L^{2}\left(\rho_{T}^{\xi,L,kk^{\prime}}\right) (G.2)

Lastly, for the ξ\xi-based interaction kernels, i.e., ϕ^ξkk\widehat{\phi}^{\xi}_{kk^{\prime}} versus ϕξkk\phi^{\xi}_{kk^{\prime}}, we consider the following norm,

ϕ^ξkkϕξkkL2(ρTξ,k,k)2=rξ𝒔ξ(ϕ^ξkk(r,ξ,𝒔ξ)ϕξkk(r,ξ,𝒔ξ))2ξ2dρTξ,k,k(r,ξ,𝒔ξ).\left\|\widehat{\phi}^{\xi}_{kk^{\prime}}-\phi^{\xi}_{kk^{\prime}}\right\|_{L^{2}(\rho_{T}^{\xi,k,k^{\prime}})}^{2}=\int_{r}\int_{\xi}\int_{\boldsymbol{s}^{\xi}}(\widehat{\phi}^{\xi}_{kk^{\prime}}(r,\xi,\boldsymbol{s}^{\xi})-\phi^{\xi}_{kk^{\prime}}(r,\xi,\boldsymbol{s}^{\xi}))^{2}\xi^{2}\,d\rho_{T}^{\xi,k,k^{\prime}}(r,\xi,\boldsymbol{s}^{\xi}). (G.3)

H Numerical algorithm

In this section, we will detail the construction of the linear systems to learn αEA\vec{\alpha}^{EA} and αξ\vec{\alpha}^{\xi}.

We start from the procedure of solving for αEA\vec{\alpha}^{EA}. First, we build the basis functions for the finite dimensional subspace ^EAEA\widehat{\mathcal{H}}^{EA}\subset\mathcal{H}^{EA}.

Remark 31

The support of the unknown interaction kernels is not assumed to be known. We build our finite dimensional subspaces, ^EA\widehat{\mathcal{H}}^{EA} for example, based on the empirical observation data. For the support-detection capability of our estimators, see the examples of opinion dynamics in [52, 83].

We utilize the tensor grid of basis functions, i.e., tensor product of basis functions in each dimension of the basis [Rkkmin,L,M,Rkkmax,L,M]×𝕊E,L,Mkk[R_{kk^{\prime}}^{\min,L,M},R_{kk^{\prime}}^{\max,L,M}]\times\mathbb{S}^{E,L,M}_{kk^{\prime}} or [Rkkmin,L,M,Rkkmax,L,M]×𝕊A,L,Mkk[R_{kk^{\prime}}^{\min,L,M},R_{kk^{\prime}}^{\max,L,M}]\times\mathbb{S}^{A,L,M}_{kk^{\prime}}, where [Rkkmin,L,M,Rkkmax,L,M][R_{kk^{\prime}}^{\min,L,M},R_{kk^{\prime}}^{\max,L,M}] is the empirical range of rr given by the observation data, similarly for the empirical 𝕊E,L,Mkk\mathbb{S}^{E,L,M}_{kk^{\prime}} and 𝕊A,L,M\mathbb{S}^{A,L,M} being the range of 𝒔Ekk\boldsymbol{s}^{E}_{kk^{\prime}} and 𝒔Akk\boldsymbol{s}^{A}_{kk^{\prime}} given by the observation, respectively. In each dimension444Mixture of basis functions in each dimension is possible, the algorithm does not required the basis functions in each dimension to be of the same kind. We make such assumption for simplicity sake. of [Rkkmin,L,M,Rkkmax,L,M]×𝕊E,L,Mkk[R_{kk^{\prime}}^{\min,L,M},R_{kk^{\prime}}^{\max,L,M}]\times\mathbb{S}^{E,L,M}_{kk^{\prime}} or [Rkkmin,L,M,Rkkmax,L,M]×𝕊A,L,Mkk[R_{kk^{\prime}}^{\min,L,M},R_{kk^{\prime}}^{\max,L,M}]\times\mathbb{S}^{A,L,M}_{kk^{\prime}}, the basis functions are built as piece-wise standard polynomials (or other functions, such as Clamped B-splines, Fourier basis, etc.) uniformly with the number of basis functions being nE,jkkn^{E,j}_{kk^{\prime}} or nA,jkkn^{A,j}_{kk^{\prime}}. Hence nEkk=j1+pEk,knE,jkkn^{E}_{kk^{\prime}}=\prod_{j}^{1+p^{E}_{k,k^{\prime}}}n^{E,j}_{kk^{\prime}} and nAkk=j1+pAk,knA,jkkn^{A}_{kk^{\prime}}=\prod_{j}^{1+p^{A}_{k,k^{\prime}}}n^{A,j}_{kk^{\prime}}. Then, we assemble d(m)\vec{d}^{(m)} as follows,

dEA,(m)=[1N𝓀1𝒙¨1(t1)1N𝓀𝒩𝒙¨N(t1)1N𝓀1𝒙¨1(tL)1N𝓀𝒩𝒙¨N(tL)].\vec{d}^{EA,(m)}=\begin{bmatrix}\frac{1}{\sqrt{N_{\mathpzc{k}_{1}}}}\ddot{\boldsymbol{x}}_{1}(t_{1})\\ \vdots\\ \frac{1}{\sqrt{N_{\mathpzc{k}_{N}}}}\ddot{\boldsymbol{x}}_{N}(t_{1})\\ \vdots\\ \frac{1}{\sqrt{N_{\mathpzc{k}_{1}}}}\ddot{\boldsymbol{x}}_{1}(t_{L})\\ \vdots\\ \frac{1}{\sqrt{N_{\mathpzc{k}_{N}}}}\ddot{\boldsymbol{x}}_{N}(t_{L})\end{bmatrix}.

If 𝒙¨i(tl)\ddot{\boldsymbol{x}}_{i}(t_{l}) is not given, a finite difference scheme on 𝒙i(tl)\boldsymbol{x}_{i}(t_{l}) or 𝒙˙i(tl)\dot{\boldsymbol{x}}_{i}(t_{l}) is used to approximate 𝒙¨i(tl)\ddot{\boldsymbol{x}}_{i}(t_{l}). Next, we build, f(m)\vec{f}^{(m)} as follows,

fEA,(m)=[1N𝓀1F𝒙˙(𝒙1(t1),𝒙˙1(t1),ξ1(t1))1N𝓀𝒩F𝒙˙(𝒙N(t1),𝒙˙N(t1),ξN(t1))1N𝓀1F𝒙˙(𝒙1(tL),𝒙˙1(tL),ξ1(tL))1N𝓀𝒩F𝒙˙(𝒙N(tL),𝒙˙N(tL),ξN(tL))].\vec{f}^{EA,(m)}=\begin{bmatrix}\frac{1}{\sqrt{N_{\mathpzc{k}_{1}}}}F^{\dot{\boldsymbol{x}}}(\boldsymbol{x}_{1}(t_{1}),\dot{\boldsymbol{x}}_{1}(t_{1}),\xi_{1}(t_{1}))\\ \vdots\\ \frac{1}{\sqrt{N_{\mathpzc{k}_{N}}}}F^{\dot{\boldsymbol{x}}}(\boldsymbol{x}_{N}(t_{1}),\dot{\boldsymbol{x}}_{N}(t_{1}),\xi_{N}(t_{1}))\\ \vdots\\ \frac{1}{\sqrt{N_{\mathpzc{k}_{1}}}}F^{\dot{\boldsymbol{x}}}(\boldsymbol{x}_{1}(t_{L}),\dot{\boldsymbol{x}}_{1}(t_{L}),\xi_{1}(t_{L}))\\ \vdots\\ \frac{1}{\sqrt{N_{\mathpzc{k}_{N}}}}F^{\dot{\boldsymbol{x}}}(\boldsymbol{x}_{N}(t_{L}),\dot{\boldsymbol{x}}_{N}(t_{L}),\xi_{N}(t_{L}))\end{bmatrix}.

Then for the learning matrix, ΨEA,(m)LNd×n\Psi^{EA,(m)}\in^{LNd\times n} with n=nE+nAn=n^{E}+n^{A}. It is a concatnation of two sub-matrix, ΨE,(m)\Psi^{E,(m)} and ΨA,(m)\Psi^{A,(m)}, i.e.,

ΨEA,(m)=[ΨE,(m)ΨA,(m)].\Psi^{EA,(m)}=\begin{bmatrix}\Psi^{E,(m)}&\Psi^{A,(m)}\end{bmatrix}.

For the energy-based learning matrix, ΨE,(m)\Psi^{E,(m)}, we use a lexicographical order on (k,k)(k,k^{\prime}) for k,k=1,,Kk,k^{\prime}=1,\ldots,K. We define nEk,k,prev=(k1,k2)<(k,k)nEk1,k2n^{E}_{k,k^{\prime},\text{prev}}=\sum_{(k_{1},k_{2})<(k,k^{\prime})}n^{E}_{k_{1},k_{2}}; if (k,k)=(1,1)(k,k^{\prime})=(1,1), we take nE1,1,prev=0n^{E}_{1,1,\text{prev}}=0. Then for ηkkE=1,,nEk,k\eta_{kk^{\prime}}^{E}=1,\ldots,n^{E}_{k,k^{\prime}}, ΨE,(m)\Psi^{E,(m)} is given as follows,

ΨE,(m)(li(1:d),ηkkE)=iCk1N𝓀𝒾ψ𝒙k,k,ηkkE(𝒙i(tl)𝒙i(tl),𝒔Ei,i(tl))(𝒙i(tl)𝒙i(tl)),iCk,\Psi^{E,(m)}(li(1:d),\eta_{kk^{\prime}}^{E})=\sum_{i^{\prime}\in C_{k^{\prime}}}\frac{1}{\sqrt{N_{\mathpzc{k}_{i}}}}\psi^{\boldsymbol{x}}_{k,k^{\prime},\eta_{kk^{\prime}}^{E}}(\left\|\boldsymbol{x}_{i^{\prime}}(t_{l})-\boldsymbol{x}_{i}(t_{l})\right\|,\boldsymbol{s}^{E}_{i,i^{\prime}}(t_{l}))(\boldsymbol{x}_{i^{\prime}}(t_{l})-\boldsymbol{x}_{i}(t_{l})),\quad i\in C_{k},

and for l=1,,Ll=1,\ldots,L. Similar process of construction is done for ΨA,(m)\Psi^{A,(m)}. Then we define,

AEA,(m)=(ΨEA,(m))TΨEA,(m)andbEA,(m)=(ΨEA,(m))T(dEA,(m)fEA,(m)).A^{EA,(m)}=(\Psi^{EA,(m)})^{T}\Psi^{EA,(m)}\quad\text{and}\quad\vec{b}^{EA,(m)}=(\Psi^{EA,(m)})^{T}(\vec{d}^{EA,(m)}-\vec{f}^{EA,(m)}).

And lastly,

AEA=1LMm=1MAEA,(m)andbEA=1LMm=1MbEA,(m).A^{EA}=\frac{1}{LM}\sum_{m=1}^{M}A^{EA,(m)}\quad\text{and}\quad\vec{b}^{EA}=\frac{1}{LM}\sum_{m=1}^{M}\vec{b}^{EA,(m)}.

Then, αEA=[(αE)T(αA)T]T\vec{\alpha}^{EA}=\begin{bmatrix}(\vec{\alpha}^{E})^{T}&(\vec{\alpha}^{A})^{T}\end{bmatrix}^{T}, is obtained by solving

AEAαEA=bEA.A^{EA}\vec{\alpha}^{EA}=\vec{b}^{EA}.

Then, we assemble

ϕ^Ekk=ηkkE=1nEk,kαk,k,ηkkEEψ𝒙k,k,ηkkE.\widehat{\phi}^{E}_{kk^{\prime}}=\sum_{\eta_{kk^{\prime}}^{E}=1}^{n^{E}_{k,k^{\prime}}}\alpha_{k,k^{\prime},\eta_{kk^{\prime}}^{E}}^{E}\psi^{\boldsymbol{x}}_{k,k^{\prime},\eta_{kk^{\prime}}^{E}}.

Similar assembly from αA\alpha^{A} is done for ϕ^Akk\widehat{\phi}^{A}_{kk^{\prime}}. In the case of using finite difference approximation to approximate the second derivatives of 𝒙i\boldsymbol{x}_{i}, we end up with

AEAαEA=bEA+ζ,A^{EA}\vec{\alpha}^{EA}=\vec{b}^{EA}+\vec{\zeta},

where ζ=𝒪(TL)\vec{\zeta}=\mathcal{O}(\frac{T}{L}) when a first-order finite difference scheme is used.

Next for αξ\vec{\alpha}^{\xi}, we build the basis functions for the finite dimensional subspace ξkkξkk\mathcal{H}^{\xi}_{kk^{\prime}}\subset\mathcal{H}^{\xi}_{kk^{\prime}}. We utilize the tensor grid of basis functions, i.e., tensor product of basis functions in each dimension of the basis [Rkkmin,L,M,Rkkmax,L,M]×𝕊ξ,L,Mkk[R_{kk^{\prime}}^{\min,L,M},R_{kk^{\prime}}^{\max,L,M}]\times\mathbb{S}^{\xi,L,M}_{kk^{\prime}}, where [Rkkmin,L,M,Rkkmax,L,M][R_{kk^{\prime}}^{\min,L,M},R_{kk^{\prime}}^{\max,L,M}] is the empirical range of rr given by the observation data, similarly for the empirical 𝕊ξ,L,Mkk\mathbb{S}^{\xi,L,M}_{kk^{\prime}} being the range of 𝒔ξkk\boldsymbol{s}^{\xi}_{kk^{\prime}} given by the observation. And in each dimension of [Rkkmin,L,M,Rkkmax,L,M]×𝕊ξ,L,Mkk[R_{kk^{\prime}}^{\min,L,M},R_{kk^{\prime}}^{\max,L,M}]\times\mathbb{S}^{\xi,L,M}_{kk^{\prime}}, the basis functions are built as piece-wise standard polynomials (or other functions, such as Clamped B-splines, Fourier basis, etc.) uniformly with the number of basis functions being nξ,jkkn^{\xi,j}_{kk^{\prime}}. Hence nξkk=j1+pξk,knξ,jkkn^{\xi}_{kk^{\prime}}=\prod_{j}^{1+p^{\xi}_{k,k^{\prime}}}n^{\xi,j}_{kk^{\prime}}. We let

dξ,(m):=[1N𝓀1ξ˙1(t1)1N𝓀𝒩ξ˙N(t1)1N𝓀1ξ˙1(tL)1N𝓀𝒩ξ˙N(tL)],fξ,(m):=[1N𝓀1Fξ(𝒙1(t1),𝒙˙1(t1),ξ1(t1))1N𝓀𝒩Fξ(𝒙N(t1),𝒙˙N(t1),ξN(t1))1N𝓀1Fξ(𝒙1(tL),𝒙˙1(tL),ξ1(tL))1N𝓀𝒩Fξ(𝒙N(tL),𝒙˙N(tL),ξN(tL))],\vec{d}^{\xi,(m)}:=\begin{bmatrix}\frac{1}{\sqrt{N_{\mathpzc{k}_{1}}}}\dot{\xi}_{1}(t_{1})\\ \vdots\\ \frac{1}{\sqrt{N_{\mathpzc{k}_{N}}}}\dot{\xi}_{N}(t_{1})\\ \vdots\\ \frac{1}{\sqrt{N_{\mathpzc{k}_{1}}}}\dot{\xi}_{1}(t_{L})\\ \vdots\\ \frac{1}{\sqrt{N_{\mathpzc{k}_{N}}}}\dot{\xi}_{N}(t_{L})\end{bmatrix}\,,\quad\vec{f}^{\xi,(m)}:=\begin{bmatrix}\frac{1}{\sqrt{N_{\mathpzc{k}_{1}}}}F^{\xi}(\boldsymbol{x}_{1}(t_{1}),\dot{\boldsymbol{x}}_{1}(t_{1}),\xi_{1}(t_{1}))\\ \vdots\\ \frac{1}{\sqrt{N_{\mathpzc{k}_{N}}}}F^{\xi}(\boldsymbol{x}_{N}(t_{1}),\dot{\boldsymbol{x}}_{N}(t_{1}),\xi_{N}(t_{1}))\\ \vdots\\ \frac{1}{\sqrt{N_{\mathpzc{k}_{1}}}}F^{\xi}(\boldsymbol{x}_{1}(t_{L}),\dot{\boldsymbol{x}}_{1}(t_{L}),\xi_{1}(t_{L}))\\ \vdots\\ \frac{1}{\sqrt{N_{\mathpzc{k}_{N}}}}F^{\xi}(\boldsymbol{x}_{N}(t_{L}),\dot{\boldsymbol{x}}_{N}(t_{L}),\xi_{N}(t_{L}))\end{bmatrix},

and

Ψξ,(m)(li(1:d),ηkkξ)=iCk1N𝓀𝒾ψξk,k,ηkkξ(𝒙i(tl)𝒙i(tl),𝒔ξi,i(tl))(ξi(tl)ξi(tl)),iCk,\Psi^{\xi,(m)}(li(1:d),\eta_{kk^{\prime}}^{\xi})=\sum_{i^{\prime}\in C_{k^{\prime}}}\frac{1}{\sqrt{N_{\mathpzc{k}_{i}}}}\psi^{\xi}_{k,k^{\prime},\eta_{kk^{\prime}}^{\xi}}(\left\|\boldsymbol{x}_{i^{\prime}}(t_{l})-\boldsymbol{x}_{i}(t_{l})\right\|,\boldsymbol{s}^{\xi}_{i,i^{\prime}}(t_{l}))(\xi_{i^{\prime}}(t_{l})-\xi_{i}(t_{l})),\quad i\in C_{k},

Finally we define,

Aξ,(m)=(Ψξ,(m))TΨξ,(m)andbξ,(m)=(Ψξ,(m))T(dξ,(m)fξ,(m)).A^{\xi,(m)}=(\Psi^{\xi,(m)})^{T}\Psi^{\xi,(m)}\quad\text{and}\quad\vec{b}^{\xi,(m)}=(\Psi^{\xi,(m)})^{T}(\vec{d}^{\xi,(m)}-\vec{f}^{\xi,(m)}).

and

Aξ=1LMm=1MAξ,(m)andbξ=1LMm=1Mbξ,(m).A^{\xi}=\frac{1}{LM}\sum_{m=1}^{M}A^{\xi,(m)}\quad\text{and}\quad\vec{b}^{\xi}=\frac{1}{LM}\sum_{m=1}^{M}\vec{b}^{\xi,(m)}.

Thus, αξ\vec{\alpha}^{\xi} is obtained by solving

Aξαξ=bξ.A^{\xi}\vec{\alpha}^{\xi}=\vec{b}^{\xi}.

Then, we assemble

ϕ^ξkk=ηkkξ=1nξk,kαk,k,ηkkξξψξk,k,ηkkξ.\widehat{\phi}^{\xi}_{kk^{\prime}}=\sum_{\eta_{kk^{\prime}}^{\xi}=1}^{n^{\xi}_{k,k^{\prime}}}\alpha_{k,k^{\prime},\eta_{kk^{\prime}}^{\xi}}^{\xi}\psi^{\xi}_{k,k^{\prime},\eta_{kk^{\prime}}^{\xi}}.

References

  • [1] N. Abaid and M. Porfiri, Fish in a ring: Spatio-temporal pattern formation in one-dimensional animal groups, Journal of the Royal Society Interface, 7 (2010), pp. 1441–1453.
  • [2] G. Albi, D. Balagué, J. A. Carrillo, and J. V. Brecht, Stability analysis of flock and mill rings for second order models in swarming, SIAM Journal on Applied Mathematics, 74 (2014), pp. 794–818.
  • [3] M. Ballerini, N. Cabibbo, R. Candelier, A. Cavagna, E. Cisbani, I. Giardina, V. Lecomte, A. Orlandi, G. Parisi, A. Procaccini, M. Viale, and V. Zdravkovic, Interaction ruling animal collective behavior depends on topological rather than metric distance: Evidence from a field study, Proc Natl Acad Sci USA, 105 (2008), pp. 1232–1237.
  • [4] Y. Bar-Sinai, S. Hoyer, J. Hickey, and M. P. Brenner, Learning data-driven discretizations for partial differential equations, Proceedings of the National Academy of Sciences, 116 (2019), pp. 15344–15349.
  • [5] N. Bellomo, P. Degond, and E. Tadmor, eds., Active Particles, Volume 1, Springer International Publishing AG, Switerland, 2017.
  • [6] T. Bertalan, F. Dietrich, I. Mezić, and I. G. Kevrekidis, On learning hamiltonian systems from data, Chaos: An Interdisciplinary Journal of Nonlinear Science, 29 (2019), p. 121107.
  • [7] R. Bhatia, Matrix Analysis, vol. 169, Springer, 1997.
  • [8] W. Bialek, A. Cavagna, I. Giardina, T. Mora, E. Silvestri, M. Viale, and A. M. Walzak, Statistical mechanics for natural flocks of birds, Proc Natl Acad Sci USA, 109 (2012), pp. 4786 – 4791.
  • [9] P. Binev, A. Cohen, W. Dahmen, R. DeVore, and V. Temlyakov, Universal algorithms for learning theory part i: piecewise constant functions, Journal of Machine Learning Research, 6 (2005), pp. 1297–1321.
  • [10] V. Blodel, J. Hendricks, and J. Tsitsiklis, On Krause’s multi-agent consensus model with state-dependent connectivity, Automatic Control, IEEE Transactions on, 54 (2009), pp. 2586 – 2597.
  • [11] J. Bongard and H. Lipson, Automated reverse engineering of nonlinear dynamical systems, Proceedings of the National Academy of Sciences, 104 (2007), pp. 9943–9948.
  • [12] J. Bongard and H. Lipson, Automated reverse engineering of nonlinear dynamical systems, Proceedings of the National Academy of Sciences of the United States of America, 104 (2007), pp. 9943–9948.
  • [13] M. Bongini, M. Fornasier, M. Hansen, and M. Maggioni, Inferring interaction rules from observations of evolutive systems I: The variational approach, Math Mod Methods Appl Sci, 27 (2017), pp. 909–951.
  • [14] N. Brunel, Parameter estimation of ODEs via nonparametric estimators, Electronic Journal of Statistics, 2 (2008), pp. 1242–1267.
  • [15] S. Brunton, N. Kutz, and J. Proctor, Data-drive discovery of governing physical laws, SIAM News, 50 (2017).
  • [16] S. Brunton, J. Proctor, and J. Kutz, Discovering governing equations from data by sparse identification of nonlinear dynamical systems, Proceedings of the National Academy of Sciences of the United States of America, 113 (2016), pp. 3932–3937.
  • [17] S. L. Brunton, J. L. Proctor, and J. N. Kutz, Discovering governing equations from data by sparse identification of nonlinear dynamical systems, Proceedings of the National Academy of Sciences, 113 (2016), pp. 3932–3937.
  • [18] J. Cao, L. Wang, and J. Xu, Robust estimation for ordinary differential equation models, Biometrics, 67 (2011), pp. 1305–1313.
  • [19] G. Carleo and M. Troyer, Solving the quantum many-body problem with artificial neural networks, Science, 355 (2017), pp. 602–606.
  • [20] K. Champion, B. Lusch, J. N. Kutz, and S. L. Brunton, Data-driven discovery of coordinates and governing equations, Proceedings of the National Academy of Sciences, 116 (2019), pp. 22445–22451.
  • [21] H. J. Chiel, J. P. Gill, J. M. McManus, and K. M. Shaw, Learning biology by recreating and extending mathematical models, Science, 336 (2012), pp. 993–994.
  • [22] Y. Cho, S. Sever, and Y.-H. Kim, On some Gronwall type inequalities with iterated integrals, Mathematical Communications, 12 (2007), pp. 63–73.
  • [23] Y. Chuang, Y. Huang, M. D’Orsogna, and A. Bertozzi, Multi-vehicle flocking: scalability of cooperative control algorithms using pairwise potentials, IEEE International Conference on Robotics and Automation, (2007), pp. 2292 – 2299.
  • [24] Y.-L. Chuang, T. Chou, and M. R. D’Orsogna, Swarming in viscous fluids: Three-dimensional patterns in swimmer- and force-induced flows, Physical Review E, 93 (2016), pp. 1–12.
  • [25] Y.-l. Chuang, M. R. D’Orsogna, D. Marthaler, A. L. Bertozzi, and L. S. Chayes, State transitions and the continuum limit for a 2D interacting, self-propelled particle system, Physica D: Nonlinear Phenomena, 232 (2007), pp. 33–47.
  • [26] A. C. Costa, T. Ahamed, and G. J. Stephens, Adaptive, locally linear models of complex dynamics, Proceedings of the National Academy of Sciences, 116 (2019), pp. 1501–1510.
  • [27] I. Couzin, J. Krause, N. Franks, and S. Levin, Effective leadership and decision-making in animal groups on the move, Nature, 433 (2005), pp. 513 – 516.
  • [28] F. Cucker and J.-G. Dong, A general collision-avoiding flocking framework, IEEE Trans. Automat. Control, 56 (2011), pp. 1124 – 1129.
  • [29] F. Cucker and E. Mordecki, Flocking in noisy environments, J. Math. Pures Appl., 89 (2008), pp. 278 – 296.
  • [30] F. Cucker and S. Smale, On the mathematical foundations of learning, Bull. Amer. Math. Soc, 39 (2002), pp. 1–49.
  • [31]  , On the mathematical foundations of learning, Bulletin of the American Mathematical Society, 39 (2002), pp. 1–49.
  • [32]  , Emergent behavior in flocks, IEEE Transactions on automatic control, 52 (2007), p. 852.
  • [33] W. Dahmen, R. DeVore, and K. Scherer, Multi-Dimensional Spline Approximation, SIAM J. Numer. Anal., 17 (1980), pp. 380–402.
  • [34] I. Dattner and C. Klaassen, Optimal rate of direct estimators in systems of ordinary differential equations linear in functions of the parameters, Electronic Journal of Statistics, 9 (2015), pp. 1939–1973.
  • [35] C. de Boor and R. DeVore, Approximation by Smooth Multivariate Splines, Transactions of the American Mathematical Society, 276 (1983), p. 775.
  • [36] R. DeVore, G. Kerkyacharian, D. Picard, and V. Temlyakov, Approximation methods for supervised learning, Foundations of Computational Mathematics, 6 (2006), pp. 3–58.
  • [37] C. J. Dsilva, R. Talmon, R. R. Coifman, and I. G. Kevrekidis, Parsimonious representation of nonlinear dynamical systems through manifold learning: A chemotaxis case study, Applied and Computational Harmonic Analysis, 44 (2018), pp. 759 – 773.
  • [38] G. Grégoire and H. Chaté, Onset of collective and cohesive motion, Phy. Rev. Lett., 92 (2004).
  • [39] L. Györfi, M. Kohler, A. Krzyzak, and H. Walk, A distribution-free theory of nonparametric regression, Springer, New York, 2002.
  • [40] Y.-G. Ham, J.-H. Kim, and J.-J. Luo, Deep learning for multi-year enso forecasts, Nature, 573 (2019), pp. 568–572.
  • [41] J. Han, C. Ma, Z. Ma, and W. E, Uniformly accurate machine learning-based hydrodynamic models for kinetic equations, Proceedings of the National Academy of Sciences, 116 (2019), pp. 21983–21991.
  • [42] X. Han, Z. Shen, W. Wang, and Z. Di, Robust reconstruction of complex networks from sparse data, Physical Review Letters, 114 (2015), p. 028701.
  • [43] S. Kang, W. Liao, and Y. Liu, Ident: Identifying differential equations with numerical time evolution, arXiv preprint arXiv:1904.03538, (2019).
  • [44] Y. Katz, K. Tunstrom, C. Ioannou, C. Huepe, and I. Couzin, Inferring the structure and dynamics of interactions in schooling fish, Proceedings of the National Academy of Sciences of the United States of America, 108 (2011), pp. 18720–8725.
  • [45] U. Krause, A discrete nonlinear and non-autonomous model of consensus formation, Communications in difference equations, 2000 (2000), pp. 227–236.
  • [46] S. Lee, M. Kooshkbaghi, K. Spiliotis, C. I. Siettos, and I. G. Kevrekidis, Coarse-scale pdes from fine-scale observations via machine learning, Chaos: An Interdisciplinary Journal of Nonlinear Science, 30 (2020), p. 013141.
  • [47] Q. Li, F. Dietrich, E. M. Bollt, and I. G. Kevrekidis, Extended dynamic mode decomposition with dictionary learning: A data-driven adaptive spectral decomposition of the koopman operator, Chaos: An Interdisciplinary Journal of Nonlinear Science, 27 (2017), p. 103111.
  • [48] Z. Li, F. Lu, M. Maggioni, S. Tang, and C. Zhang, On the identifiability of interaction functions in systems of interacting particles, arXiv preprint arXiv:1912.11965, (2019).
  • [49] H. Liang and H. Wu, Parameter estimation for differential equation models using a framework of measurement error in regression models, Journal of the American Statistical Association, 103 (2008), pp. 1570–1583.
  • [50] Z. Long, Y. Lu, X. Ma, and B. Dong, PDE-net: Learning PDEs from data, arXiv preprint arXiv:1710.09668, (2017).
  • [51] F. Lu, M. Maggioni, and S. Tang, Learning interaction kernels in heterogeneous systems of agents from multiple trajectories, 2019.
  • [52] F. Lu, M. Zhong, S. Tang, and M. Maggioni, Nonparametric inference of interaction laws in systems of agents from trajectory data, Proceedings of the National Academy of Sciences of the United States of America, 116 (2019), pp. 14424–14433.
  • [53] R. Lukeman, Y. Li, and L. Edelstein-Keshet, Inferring individual rules from collective behavior, Proceedings of the National Academy of Sciences of the United States of America, 107 (2010), pp. 12576 – 12580.
  • [54] M. Maggioni, J. Miller, and M. Zhong, Agent-based learning of celestial dynamics from ephemerides, In preparation, (2020).
  • [55] H. Miao, X. Xia, A. Perelson, and H. Wu, On identifiability of nonlinear ODE models and applications in viral dynamics, SIAM Review, 53 (2011), pp. 3–39.
  • [56] S. Mostch and E. Tadmor, Heterophilious dynamics enhances consensus, SIAM Review, 56 (2014), pp. 577 – 621.
  • [57] K. O’Keeffe and C. Bettstetter, A review of swarmalators and their potential in bio-inspired computing, (2019), p. 85.
  • [58] K. P. O’Keeffe, J. H. Evers, and T. Kolokolnikov, Ring states in swarmalator systems, Physical Review E, 98 (2018).
  • [59] K. P. O’Keeffe, H. Hong, and S. H. Strogatz, Oscillators that sync and swarm, Nature Communications, 8 (2017), pp. 1–12.
  • [60] M. Raissi, Deep hidden physics models: Deep learning of nonlinear partial differential equations, The Journal of Machine Learning Research, 19 (2018), pp. 932–955.
  • [61] M. Raissi and G. Karniadakis, Hidden physics models: Machine learning of nonlinear partial differential equations, Journal of Computational Physics, 357 (2018), pp. 125–141.
  • [62] M. Raissi, P. Perdikaris, and G. Karniadakis, Multistep neural networks for data-driven discovery of nonlinear dynamical systems, arXiv preprint arXiv:1801.01236, (2018).
  • [63] M. Raissi, A. Yazdani, and G. E. Karniadakis, Hidden fluid mechanics: Learning velocity and pressure fields from flow visualizations, Science, 367 (2020), pp. 1026–1030.
  • [64] J. Ramsay, G. Hooker, D. Campbell, and J. Cao, Parameter estimation for differential equations: a generalized smoothing approach, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69 (2007), pp. 741–796.
  • [65] H. Rudy, N. Kutz, and S. Brunton, Deep learning of dynamics and signal-noise decomposition with time-stepping constraints, Journal of Computational Physics, (2019).
  • [66] S. Rudy, S. Brunton, J. Proctor, and N. Kutz, Data-driven discovery of partial differential equations, Science Advances, 3 (2017), p. e1602614.
  • [67] L. Ruthotto, S. J. Osher, W. Li, L. Nurbekyan, and S. W. Fung, A machine learning framework for solving high-dimensional mean field game and mean field control problems, Proceedings of the National Academy of Sciences, (2020).
  • [68] H. Schaeffer, R. Caflisch, C. Hauck, and S. Osher, Sparse dynamics for partial differential equations, Proceedings of the National Academy of Sciences of the United States of America, 110 (2013), pp. 6634–6639.
  • [69] M. Schmidt and H. Lipson, Distilling free-form natural laws from experimental data, Science, 324 (2009), pp. 81–85.
  • [70] L. Schumaker, Spline Functions: Basic Theory, Cambridge University Press, 3rd ed., 2007.
  • [71] R. Shu and E. Tadmor, Anticipation breeds alignment, arXiv preprint arXiv:1905.00633, (2019).
  • [72]  , Flocking hydrodynamics with external potentials, Arch Rational Mech Anal, (2020), pp. 347 – 381.
  • [73] S. H. Strogatz, From Kuramoto to Crawford: exploring the onset of synchronization in populations of coupled oscillators, Physica D, 143 (2000), pp. 1 – 20.
  • [74] K. Tonstrom, Y. Katz, C. C. Ioannou, C. Huepe, M. J. Kutz, and I. D. Couzin, Collective states, multistability and transitional behavior in schooling fish, Computational Biology, 9 (2013).
  • [75] J. Tropp, An introduction to matrix concentration inequalities, Foundations and Trends in Machine Learning, 8 (2015), pp. 1–230.
  • [76] A. Tsybakov, Introduction to Nonparametric Estimation, Springer Publishing Company, Incorporated, 1st ed., 2008.
  • [77] A. van der Vaart and J. Wellner, Weak Convergence and Empirical Processes with Applications to Statistics, Springer Publishing Company, Incorporated, 1st ed., 1996.
  • [78] J. Varah, A spline least squares method for numerical parameter estimation in differential equations, SIAM Journal on Scientific and Statistical Computing, 3 (1982), pp. 28–46.
  • [79] T. Vicsek, A. Czirók, E. Ben-Jacob, I. Cohen, and O. Shochet, Novel Type of Phase Transition in a System of Self-Driven Particles, Physical Review Letters, 75 (1995), pp. 1226–1229.
  • [80] O. Yair, R. Talmon, R. R. Coifman, and I. G. Kevrekidis, Reconstruction of normal forms by learning informed observation geometries from data, Proceedings of the National Academy of Sciences, 114 (2017), pp. E7865–E7874.
  • [81]  , Reconstruction of normal forms by learning informed observation geometries from data, Proceedings of the National Academy of Sciences, 114 (2017), pp. E7865–E7874.
  • [82] S. Zhang and G. Lin, Robust data-driven discovery of governing physical laws with error bars, Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 474 (2018), p. 20180305.
  • [83] M. Zhong, J. Miller, and M. Maggioni, Data-driven discovery of emergent behaviors in collective dynamics, Physica D: Nonlinear Phenomena, (2020), p. 132542.