¹¹institutetext: Facultad de Matemática, Astronomía, Física y Computación,
Universidad Nacional de Córdoba, (5000), Córdoba, Argentina ²²institutetext: LIGO, California Institute of Technology, Pasadena, CA 91125, USA ³³institutetext: Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas (CIFASIS, CONICET–UNR) ⁴⁴institutetext: Instituto De Astronomía Teórica y Experimental - Observatorio Astronómico Córdoba (IATE–OAC–UNC–CONICET)

Arby - Fast data–driven surrogates

Aarón Villanueva 11 Martin Beroiz 22 Juan Cabral 33 4 4 Martín Chalela 44 Mariano Dominguez 44

Abstract

Context. The availability of fast to evaluate and reliable predictive models is highly relevant in multi-query scenarios where evaluating some quantities in real, or near-real-time becomes crucial. As a result, reduced-order modelling techniques have gained traction in many areas in recent years.

Aims. We introduce Arby, an entirely data-driven Python package for building reduced order or surrogate models. In contrast to standard approaches, which involve solving partial differential equations, Arby is entirely data-driven. The package encompasses several tools for building and interacting with surrogate models in a user-friendly manner. Furthermore, fast model evaluations are possible at a minimum computational cost using the surrogate model.

Methods. The package implements the Reduced Basis approach and the Empirical Interpolation Method along a classic regression stage for surrogate modelling.

Results. We illustrate the simplicity in using Arby to build surrogates through a simple toy model: a damped pendulum. Then, for a real case scenario, we use Arby to describe CMB temperature anisotropies power spectra. On this multi-dimensional setting, we find that out from an initial set of $80,000$ power spectra solutions with $3,000$ multipole indices each, could be well described at a given tolerance error, using just a subset of $84$ solutions.

Key Words.:

Reduced Order Modeling – Surrogate Models – Reduced Basis – Empirical Interpolation – Python Package.

1 Introduction

Several problems arise in observational and theoretical contexts that demand the resolution of computationally intensive differential equations. From structural analysis in engineering to spacetime simulations in astrophysics, rapid and reliable evaluations of solutions to these equations are crucial due to the need for computing in real or near-real time quantities that depend on those solutions. Due to pervasive computational costs led by the inherent complexities present in many problems, achieving fast responses becomes an ubiquitous bottleneck.

As a case study, let us take the example of an ongoing problem in the field of gravitational wave research, namely the template bank problem (Field et al. (2012)). To be able to detect the very faint gravitational wave signals among the noisy background of ground-based interferometers, the LIGO-Virgo collaboration use match filtering against a bank of GW signal templates to maximize the signal-to-noise ratio (SNR) (Cutler & Flanagan (1994); Abbott et al. (2016, 2020)).

The observed time series is filtered to decide whether a binary coalescence took place, and the parameter estimation allows us to infer the properties of precursors of the remnant, such as masses and spins for binary black hole mergers.

It is convenient to count with large enough template banks of theoretical waveforms to fill the parameter space of GWs. However, this pose an exceedingly challenging task due to the complexity of solving the Einstein Equations of General Relativity. A single numerically generated GW waveform can take from days to weeks to become available for production (Lehner & Pretorius (2014)).

A palliative solution is the construction of approximate waveform models, which are more direct to build and deploy than numerical relativity ones. There are several methods to build these approximations. Analytical examples are the Post-Newtonian (Blanchet (2006)), Effective One Body (Damour & Nagar (2011)) and Phenomenological (Sturani et al. (2010); Hannam et al. (2014); Khan et al. (2019)) approximations. However, we focus here on a particular set of methods that have proven to be very fertile in GW research and produced some of the milestones in waveform modeling in recent years: Reduced Order methods.

Reduced Order Modeling (ROM) is an umbrella term that encompasses a variety of techniques for dimensional reduction, developed to address the problem of complexity in numerical simulations. In particular, Surrogate Models obtained through application of ROM to ground truth solutions are low resolution representations intended to be fast to build and evaluate without compromising accuracy. We take a data-driven approach, i.e. driven only by data, as opposed to more standard and intrusive ones in which reduction methods are coupled to differential solvers to build solutions (Quarteroni et al. (2015); Hesthaven et al. (2015)).

In waveform modeling, the combination of two ROM methods originally posed for intrusive problems and recreated later for data-driven ones led to significant success in constructing surrogate models for GWs. Those methods are the Reduced Basis (RB) (Boyaval et al. (2009); Field et al. (2011, 2012)) and the Empirical Interpolation (EI) (Barrault et al. (2004); Chaturantabut & Sorensen (2010)) methods, which we describe in the next section.

The primarily purpose of this work is to disseminate these tools in the astronomy/astrophysics community by introducing a single and user-friendly Python package for data-driven dimensional reduction: Arby.

Arby arises as a response to a lack of well documented, tested and actively maintained code for reduced basis and surrogate modeling in the scientific community while adhering to the data-driven and user-friendliness premise.

2 Theory Overview

This section briefly describes the basics of a reduced-order pipeline for building surrogate models. As we stated above, it merges two main ingredients, the Reduced Basis and the Empirical Interpolation methods, for dimensional reduction of raw data. This pipeline was first introduced in (Field et al. (2014)) followed by the construction of several surrogate waveform models based on this method (Blackman et al. (2015, 2017b, 2017a); Varma et al. (2019a, b); Rifat et al. (2020)). See also (Tiglio & Villanueva (2021)) for a review.

Representation.– We are interested in parametrized scalar (real or complex) functions, solutions or models of the form

\displaystyle h_{\lambda}(x):=h(\lambda;x)\,,

(1)

where $\lambda$ represents the parameter/s of the model and $x$ is the independent variable. Both are real and possibly multidimensional. In physical models, $h_{\lambda}$ can represent parametrized time series with $x$ being the time variable. For convenience, we denote the spaces for $\lambda$ and $x$ as the parameter and physical domains, respectively.

In the first stage, we look for a low-resolution representation of the model $h_{\lambda}$ . The RB method consist in representing a whole set of solutions, usually called the training set

{\cal K}:=\{h_{\lambda_{i}}\}_{i=1}^{N}\,,

by linear combinations of basis elements of the form

h_{\lambda}\approx\sum_{i=1}^{n}\langle e_{i},h_{\lambda}\rangle e_{i}\,,

(2)

where

\langle e_{i},h_{\lambda}\rangle:=\int_{\Omega}\bar{e_{i}}(x)h_{\lambda}(x)dx

(3)

defines the inner product between training functions, $\bar{e_{i}}$ is the complex conjugate of $e_{i}$ –if we deal with complex functions– and $\Omega$ is the physical domain. The set $\{e_{i}\}_{i=1}^{n}$ is called the reduced basis and is composed by a subset of optimally chosen solutions from the training set. The construction of the reduced basis is iterative: at each step of the algorithm, the most dissimilar (orthogonal) element from the training set ${\cal K}$ joins to the current basis, and the process stops when an user-specified tolerance is met. This tolerance is related to the maximum error of the difference between solutions and approximations. In consequence, the different spaces $\mathcal{X}_{i}(i=1,2,\ldots)$ spanned by each reduced basis built at each step are nested, i.e. $\mathcal{X}_{1}\subset\mathcal{X}_{2},\ldots$ The addition of a new element to the basis implies a fine tuning of the previous approximation space.

The RB method supposes a compression in the parameter space of solutions. The next step is to achieve compression in time. To this we turn to interpolation replacing the projection-based approach described so far by an interpolation scheme. We pose the problem by looking for an efficient linear interpolation operator ${\cal I}_{n}$ such that

h_{\lambda}(x)\approx{\cal I}_{n}[h_{\lambda}](x)=\sum_{i=1}^{n}C_{i}(\lambda)e_{i}(x)\,,

(4)

subject to

{\cal I}_{n}[h_{\lambda}](X_{i})=h_{\lambda}(X_{i}),\quad i=1,\ldots,n

(5)

for strategically selected nodes $X_{i}\,(i=1,\ldots,n)$ out from the physical domain. The EI method (Maday et al. (2009); Barrault et al. (2004); Chaturantabut & Sorensen (2010)) gives us an algorithm for building such interpolant. The Empirical Interpolation (EI) algorithm, as described in (Field et al. (2014)), selects iteratively the nodes $\{X_{i}\}$ from the physical domain following a local optimization criteria (Tiglio & Villanueva (2020)).

The EI algorithm receives as unique input the reduced basis and selects the interpolation nodes for building the interpolant. Note that there is no need of the whole training set ${\cal K}$ since the RB algorithm already did the introspection of it, and we assume that the relevant information about ${\cal K}$ is hard-coded in the reduced basis.

It is possible to show that, under some conditions, the interpolation error is similar to the projection one, which in most applications has exponential decay in the number of basis elements (see (Tiglio & Villanueva (2021)) and citations therein). This leads to an efficient and (in most cases of interest) accurate representation of the training set by means of an empirical interpolation.

To summarizing: first, a reduced basis is built from a training set using the RB method, which leads to the linear representation in (2). This step completes a compression in the parameter domain. Next, an empirical interpolant is built solely from the reduced basis. This step completes a compression in the physical domain. Finally, we end up with an empirical interpolation (4) which provides efficient and high-accuracy representations of all functions in the training set.

Predictive models.– We want our model to represent solutions that are not present in the training set. That is, we look for the predictability. For this, we perform parametric fits at each empirical node along with training values. Let us break it down.

Let us rewrite the interpolation in (4) as

{\cal I}_{n}[h_{\lambda}](x)=\sum_{i=1}^{n}B_{i}(x)h_{\lambda}(X_{i})\,,

(6)

where

B_{i}(x):=\sum_{j=1}^{n}({\bf V}_{n}^{-1})_{ji}e_{j}(x)

(7)

and

({\bf V}_{n})_{ij}:=e_{j}(X_{i})\,.

Recall the approach is data-driven, so we do not fill the training set with more solutions to approximate newer ones. Instead, we predict them by performing fits along data that we already know, that is, along $h_{\lambda}(X_{i})$ $(i=1,\ldots,n)$ . For $1$ -D problems ( $\lambda\in\mathbb{R}$ ), deciding for problem-agnostic fitting methods that are well-suited for most cases can be a challenging task, not to mention the high dimensional case, which remains an open question (Tiglio & Villanueva (2021)). For a rough classification, we refer to regression and interpolation methods. The former deals with calibrating free parameters of some model by optimizing an error function; the latter, with solving an interpolation problem consisting essentially in solving algebraic systems which are possibly subject to constraints (e.g. Eqs. (4,5)).

Here we address the second approach for parametric fits. The procedure consists in interpolating the values

\{h(\lambda_{j};X_{i})\}_{j=1}^{N}

along parameter samples for each empirical node $X_{i}\,,\,i=1,\ldots,n$ . As we describe in Section 4, as of now, Arby uses splines for this step, though we expect to generalize this. Once fits are done, we end up with a surrogate model $h_{\lambda}^{surr}(x)$ which can represent and predict $h_{\lambda}$ at any $\lambda$ in the parameter domain with high accuracy and low computational cost.

3 Algorithms

The pipeline for building surrogates described in the previous section is valid in any number of parameter and physical dimensions if we consider arbitrary fittings through parameter space. For surrogate modeling, Arby supports in its present version 1-D parameter and physical domains (real number intervals of the form $[a,b]$ for $a,b\in\mathbb{R}$ ) and real-valued functions. Again, this restriction is only for building surrogate models. On the other hand, Arby supports multidimensional parameter domains (although still restricted to 1-D domains in the physical dimension) and complex-valued functions for building reduced bases and empirical interpolants.

Below we summarize the algorithm for surrogate modeling. We refer the reader to the Appendices for technical details about the RB and EI algorithms used in intermediate stages.

The inputs are the training set ${\cal K}=\{h_{\lambda_{i}}\}_{i=1}^{N}$ , the parameter set ${\cal T}:=\{\lambda_{i}\}_{i=1}^{N}$ , and the greedy tolerance $\epsilon\in\mathbb{R}$ .

Algorithm 1 Surrogate modeling

1:Input:

{\cal K}

{\cal T}

\epsilon

2:Build the reduced basis

\{e_{i}\}_{i=1}^{n}

up to tolerance

\epsilon

3:Find the empirical nodes

\{X_{i}\}_{i=1}^{n}

and build the interpolant

{\cal I}_{n}

by assembling the functions

B_{i}(x)\,(i=1,\ldots,n)

(see 7).

4:for

i=1\to n

Build a continuous function

h_{i}^{fit}(\lambda)

by doing fits along values

\{h(\lambda;X_{i})\}_{\lambda\in{\cal T}}

6:end for

7:Assemble the surrogate:

h^{surr}(\lambda;x):=\sum_{i=1}^{n}B_{i}(x)h_{i}^{fit}(\lambda)

8:Output: surrogate model

h^{surr}(\lambda;x)

Let’s make some remarks on Alg. 1.

•

The training set ${\cal K}$ is built from a discretization ${\cal T}$ of the parameter domain. In the current version of Arby, ${\cal T}$ is a discretized real interval ${\cal T}:=\{\lambda_{1},\ldots,\lambda_{N}\}$ .
•

The RB algorithm used for building the reduced basis is fully described in Alg. 2, see the Appendices. It selects from ${\cal T}$ $n$ points $\lambda_{i}=\Lambda_{i}\,(i=1,\ldots,n)$ , called the greedy points, labeling those functions in the training set that conform the reduced basis. For conditioning purposes, the basis is orthonormalized at each step, so the algorithm’s final output is a set of orthonormal basis elements along with the set of greedy points. So we use the term reduced basis interchangeably for both, the basis conformed by greedy solutions and its orthonormal version, due to its equivalence (they span the same space).

The number $n$ depends on the greedy tolerance $\epsilon$ . In Arby we must specify a discretization of the physical domain $\Omega$ so as to be able to do integrals (see (3)). In this context, $\Omega$ is a real interval $[x_{a},x_{b}]$ and Arby must receive as input an equispaced discretization of it.
•

Step 3 implements the EI algorithm described in Alg. 3, see the Appendices. It receives the reduced basis as unique input and finds $n$ empirical nodes $X_{i}\,(i=1,\ldots,n)$ to build the interpolant. In practice, the interpolant is specified by assembling the $n$ functions $B_{i}(x)$ defined in Eqs. (6,7).
•

To achieve predictability Steps 4-6 perform parametric fits along training values for each empirical node $X_{i}$ . Let’s illustrate this by looking at the first iteration of the loop in Steps 4-6. For the first node $X_{1}$ collect all values

$\{h(\lambda_{1};X_{1}),h(\lambda_{2};X_{1}),\ldots,h(\lambda_{N};X_{1})\}$

and perform a fit along them. This results in a function $h^{fit}_{1}(\lambda)$ that is continuous in the interval $[\lambda_{1},\lambda_{N}]$ . This is repeated $n$ times for each empirical node. The resulted functions $h(\lambda)_{i}^{fit}$ constitute along the reduced basis the building blocks for the final surrogate assembly. The current Arby version implements splines (J.H. Ahlberg & Walsh (1967)) for parametric fits, i.e., piecewise polynomial interpolation of some degree arbitrarily set by the user.
•

From eqs. 6 and 7 the empirical interpolant is defined through the $n$ functions $B_{i}(x)$ . They comprise all the RB-EI information. Combining the functions $B_{i}(x)$ with the fits $h^{fit}_{i}(\lambda)$ built on previous steps, Step 7 leads to the desired surrogate $h^{surr}(\lambda;x)$ which is continuous in $\lambda$ inside the real interval $[\lambda_{1},\lambda_{N}]$ .

3.1 Related works

A previous implementation of the RB-EI approach is GreedyCpp ¹¹1ttps://bitbucket.org/sfield83/greedycpp/, an MPI/OpenMP parallel code written in C++ (Antil et al. (2018)). Although it is not designed for building surrogates and training sets have to be loaded at runtime, it allows for building reduced bases, empirical interpolants and reduced-order quadratures. Another example is ROMPy ²²2https://bitbucket.org/chadgalley/rompy/, a previous attempt written in pure Python which supports surrogate modeling.

Other ROM implementations in the Python ecosystem are not fully data-driven. Typically they are weakly or strongly coupled to solvers for differential equations. Mature examples are PyMOR (Milk et al. (2016)) and RBniCS (Ballarin et al. (2015)). The latter built on top of the FEniCS Logg et al. (2011) library for differential equations, and the former allows for coupling with external solvers.

4 Arby

Arby is a Python package for data-driven surrogate modeling satisfying standard software compliance along quality assurance (see Section 4.4). It allows the user for building reduced basis and empirical interpolants at any number of parameter dimensions. At the current release, Arby builds surrogate models for 1-D domains.

4.1 Implementation

Integrals and inner products (see 3) must be discretized in implementing Alg. 1. For this, the physical interval $\Omega$ is sampled in $L$ equispaced points $\{x_{1},\ldots,x_{L}\}$ to define a discrete inner product between two functions,

\langle h_{1},h_{2}\rangle_{d}:=\sum_{i=1}^{L}\bar{h_{1}}(x_{i})h_{2}(x_{i})\omega_{i}\,.

The bar represents complex conjugation in case of $h$ being complex. The $\omega_{i}$ ’s are $L$ positive real values called weights. Weights $\{\omega_{i}\}$ and sample points $\{x_{i}\}$ constitute a quadrature rule. Arby uses quadrature rules to compute integrals.

4.2 Public API

Classes.– The main class in Arby is alg:surr for surrogate modeling. There are three basic inputs for this class:

•

alg:surr

); eq:inner); alg:surr); These inputs represent the minimum and indispensable for building surrogates. Optional parameters are:

•

greedy_tol

: the greedy tolerance $\epsilon$ for the reduced basis, and itemize These parameters can be tuned for controlling the model accuracy. See the Arby documentation ³³3https://arby.readthedocs.io for a thorough tutorial on this. Once a offline-online stages (Field et al. (2014)). Thus, the offline stage corresponds to a (possibly) expensive first building; the online one corresponds to fast model evaluations. There is a class to compute inner products and integrals: the integral, norm and Basis class encompasses data utilities for handling arbitrary bases, whether they are reduced bases or user-specified ones. The Integration object. Available methods are: project and projection_error method computes squared projection errors due to projecting arrays onto the basis. Auxiliary classes are EIM. They are containers for RB/EIM information. Functions.– The main function in Arby is tiglio2021reduced). For conditioning purposes, there are two patterns related to the normalization of the training set which lead to two different implementations of the greedy algorithm. There is a function for Gram-Schmidt orthonormalization called HoffmannIMGS) to orthonormalize a set of linearly independent arrays. Internally, this algorithm builds the reduced bases.

4.3 Benchmarks

It is interesting within the work to present a performance analysis of the the most important routine of the project: reduce_basis function influence the general performance of the algorithm and if this adjusts to the theoretical estimates. These parameters are

•

riemann

, trapezoidal and euclidean. True if training data must be normalized before training or False otherwise. -14 $and$ 10^-12 $.\par\par\verb{}{training_set} -- The training data as a 2-D array. We tested on square random arrays with sizes $11 \times 11$ and $101 \times 101$. \par \item \code{}{training_set} -- The training data as a 2-D array. We tested on square random arrays with sizes $11 \times 11$ and $101 \times 101$. \par \item \codephysical_{p}oints--Physicalpointsforquadraturerules.Mustmatchthenumberofcolumnsoftrainingset.\par\par$ With these parameters, we simulated 100 training sets for each one of the 24 possible combinations, giving a total of 24000 test cases. The benchmark was then executed on a computer with the following specifications:

•

CPU – 4 x 2.4 GHz AMD Opteron(tm) Processor 6282 SE.
•

RAM – 251GB DDR3L.
•

OS – CentOS 7 Linux 3.10.0-514.el7.x86_64
•

Software – Python 3.9.0.final.0 (64 bit), NumPy 1.21.1 and SciPy 1.7.0

The results are presented in Figure 4.3. As we can anticipate, the size of the training set is the most important factor impacting the execution times. All other parameters, out of figure[h!]

Results of the benchmark on 24000 test cases varying the values of , , and . In all cases, the horizontal axis represents the different parameter values, and the vertical axis represents the execution time in seconds. We can see that the algorithm increases execution time as the size of the increases, while in all other cases the times remain relatively unchanged. To further explore the relationship between normalize=False, greedy_tol=1e-12. We generate 100 random fig:dbenchmark) show the anticipated behavior: in front of random others the cost has a growth $O(LN^{2})$ as we increase the size of https://zenodo.org/record/5139187#.YP-IAXVKhhE (Villanueva et al., 2021)