Greedy Recombination Interpolation Method (GRIM)

Terry Lyons and Andrew D. McLeod

Abstract

In this paper we develop the Greedy Recombination Interpolation Method (GRIM) for finding sparse approximations of functions initially given as linear combinations of some (large) number of simpler functions. In a similar spirit to the CoSaMP algorithm, GRIM combines dynamic growth-based interpolation techniques and thinning-based reduction techniques. The dynamic growth-based aspect is a modification of the greedy growth utilised in the Generalised Empirical Interpolation Method (GEIM). A consequence of the modification is that our growth is not restricted to being one-per-step as it is in GEIM. The thinning-based aspect is carried out by recombination, which is the crucial component of the recent ground-breaking convex kernel quadrature method. GRIM provides the first use of recombination outside the setting of reducing the support of a measure. The sparsity of the approximation found by GRIM is controlled by the geometric concentration of the data in a sense that is related to a particular packing number of the data. We apply GRIM to a kernel quadrature task for the radial basis function kernel, and verify that its performance matches that of other contemporary kernel quadrature techniques.

1 Introduction

This article considers finding sparse approximations of functions with the aim of reducing computational complexity. Applications of sparse representations are wide ranging and include the theory of compressed sensing [DE03, CT05, Don06, CRT06], image processing [EMS08, BMPSZ08, BMPSZ09], facial recognition [GMSWY09], data assimilation [MM13], explainability [FMMT14, DF15], sensor placement in nuclear reactors [ABGMM16, ABCGMM18], reinforcement learning [BS18, HS19], DNA denoising [KK20], and inference acceleration within machine learning [ABDHP21, NPS22].

Loosely speaking, the commonly shared aim of sparse approximation problems is to approximate a complex system using only a few elementary features. In this generality, the problem is commonly tackled via Least Absolute Shrinkage and Selection Operator (LASSO) regression, which involves solving a minimisation problem under $l^{1}$ -norm constraints. Constraining the $l^{1}$ -norm is a convex relaxation of constraining the $l^{0}$ -pseudo-norm, which counts the number of non-zero components of a vector. Since $p=1$ is the smallest positive real number for which the $l^{p}$ -norm is convex, constraining the $l^{1}$ -norm can be viewed as the best convex approximation of constraining the $l^{0}$ -pseudo-norm (i.e. constraring the number of non-zero components). Implementations of LASSO techniques can lead to near optimally sparse solutions in certain contexts [DET06, Tro06, Ela10]. The terminology ”LASSO” originates in [Tib96], though imposing $l^{1}$ -norm constraints is considered in the earlier works [CM73, SS86]. More recent LASSO-type techniques may be found in, for example, [LY07, DGOY08, GO09, XZ16, TW19].

Alternatives to LASSO include the Matching Pursuit (MP) sparse approximation algorithm introduced in [MZ93] and the CoSaMP algorithm introduced in [NT09]. The MP algorithm greedily grows a collection of non-zero weights one-by-one that are used to construct the approximation. The CoSaMP algorithm operates in a similar greedy manner but with the addition of two key properties. The first is that the non-zero weights are no longer found one-by-one; groups of several non-zero weights may be added at each step. Secondly, CoSaMP incorporates a naive prunning procedure at each step. After adding in the new weights at a step, the collection is then pruned down by retaining only the $m$ -largest weights for a chosen integer $m\in{\mathbb{Z}}_{\geq 1}$ . The ethos of CoSaMP is particularly notworthy for our purposes.

In this paper we assume that the system of interest is known to be a linear combination of some (large) number of features. Within this setting the goal is to identify a linear combination of a strict sub-collection of the features (i.e. not all the features) that gives, in some sense, a good approximation of the system (i.e. of the original given linear combination of all the features). Depending on the particular context considered, techniques for finding such sparse approximations include the pioneering Empirical Interpolation Method [BMNP04, GMNP07, MNPP09], its subsequent generalisation the Generalised Empirical Interpolation Method (GEIM) [MM13, MMT14], Pruning [Ree93, AK13, CHXZ20, GLSWZ21], Kernel Herding [Wel09a, Wel09b, CSW10, BLL15, BCGMO18, TT21, PTT22], Convex Kernel Quadrature [HLO21], and Kernel Thinning [DM21a, DM21b, DMS21]. The LASSO approach based on $l^{1}$ -regularisation can still be used within this framework.

It is convenient for our purposes to consider the following loose categorisation of techniques for finding sparse approximations in this setting; those that are growth-based, and those that are thinning-based. Growth-based methods seek to inductively increase the size of a sub-collection of features, until the sub-collection is rich enough to well-approximate the entire collection of features. Thinning-based methods seek to inductively identify features that may be discarded without significantly affecting how well the remaining features can approximate the original entire collection. Of the techniques mentioned above, MP, EIM, GEIM and Kernel Herding are growth-based, whilst LASSO, Convex Kernel Quadrature and Kernel Thinning are thinning-based. An important observation is that the CoSaMP algorithm combines both growth-based and thinning-based techniques.

In this paper we work in the setting considered by Maday et al. when introducing GEIM [MM13, MMT14, MMPY15]. Namely, we let $X$ be a real Banach space and ${\cal N}\in{\mathbb{Z}}_{\geq 1}$ be a (large) positive integer. Assume that ${\cal F}=\{f_{1},\ldots,f_{{\cal N}}\}\subset X$ is a collection of non-zero elements, and that $a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}\setminus\{0\}$ . Consider the element $\varphi\in X$ defined by $\varphi:=\sum_{i=1}^{{\cal N}}a_{i}f_{i}$ . Let $X^{\ast}$ denote the dual of $X$ and suppose that $\Sigma\subset X^{\ast}$ is a finite subset with cardinality $\Lambda\in{\mathbb{Z}}_{\geq 1}$ . We use the terminology that the set ${\cal F}$ consists of the features whilst the set $\Sigma$ consists of data. Then we consider the following sparse approximation problem. Given $\varepsilon>0$ , find an element $u=\sum_{i=1}^{{\cal N}}b_{i}f_{i}\in\operatorname{Span}({\cal F})\subset X$ such that the cardinality of the set $\{i\in\{1,\ldots,{\cal N}\}:b_{i}\neq 0\}$ is less than ${\cal N}$ and that $u$ is close to $\varphi$ throughout $\Sigma$ in the sense that, for every $\sigma\in\Sigma$ , we have $|\sigma(\varphi-u)|\leq\varepsilon$ .

The tasks of approximating sums of continuous functions on finite subsets of Euclidean space, cubature for empirical measures, and kernel quadrature for kernels defined on finite sets are included as particular examples within this general mathematical framework (see Section 2) for full details). The inclusion of approximating a linear combination of continuous functions within this general framework ensures that several machine learning related tasks are included. For example, each layer of a neural network is typically given by a linear combination of continuous functions. Hence the task of approximating a layer within a neural network by a sparse layer is covered within this general framework. Observe that there is no requirement that the original layer is fully connected; in particular, it may itself already be a sparse layer. Consequently, any approach to the approximation problem covered in the general framework above could be used to carry out the reduction step (in which a fixed proportion of a models weights are reduced to zero) in the recently proposed sparse training algorithms [GLMNS18, EJOPR20, LMPY21, JLPST21, FHLLMMPSWY23].

The GEIM approach developed by Maday et al. [MM13, MMT14, MMPY15] involves dynamically growing a subset $F\subset{\cal F}$ of the features and a subset $L\subset\Sigma$ of the date. At each step a new feature from ${\cal F}$ is added to $F$ , and a new piece of data from $\Sigma$ is added to $L$ . An approximation of $\varphi$ is then constructed via the following interpolation problem. Find a linear combination of the elements in $F$ that coincides with $\varphi$ throughout the subset $L\subset\Sigma$ . This dynamic growth is feature-driven in the following sense. The new feature to be added to $F$ is determined, and this new feature is subsequently used to determine how to extend the collection $L$ of linear functionals on which an approximation will be required to match the target.

The growth is greedy in the sense that the element chosen to be added to $F$ is, in some sense, the one worst approximated by the current collection $F$ . If we let $f\in{\cal F}$ be the newly selected feature and $J[f]$ be the linear combination of the previously selected features in $F$ that coincides with $f$ on $L$ , then the element to be added to $L$ is the one achieving the maximum absolute value when acting on $f-J[f]$ . A more detailed overview of GEIM can be found in Section 2 of this paper; full details of GEIM may be found in [MM13, MMT14, MMPY15].

Momentarily restricting to the particular task of kernel quadrature, the recent thinning-based Convex Kernel Quadrature approach proposed by Satoshi Hayakawa, the first author, and Harald Oberhauser in [HLO21] achieves out performs existing techniques such as Monte Carlo, Kernel Herding [Wel09a, Wel09b, CSW10, BLL15, BCGMO18, PTT22], Kernel Thinning [DM21a, DM21b, DMS21] to obtain new state-of-the-art results. Central to this approach is the recombination algorithm [LL12, Tch15, ACO20]. Originating in [LL12] as a technique for reducing the support of a probability measure whilst preserving a specified list of moments, at its core recombination is a method of reducing the number of non-zero components in a solution of a system of linear equations whilst preserving convexity.

An improved implementation of recombination is given in [Tch15] that is significantly more efficient than the original implementation in [LL12]. A novel implementation of the method from [Tch15] was provided in [ACO20]. A modified variant of the implementation from [ACO20] is used in the convex kernel quadrature approach developed in [HLO21]. A method with the same complexity as the original implementation in [LL12] was introduced in the works [FJM19, FJM22].

Returning to our general framework, we develop the Greedy Recombination Interpolation Method (GRIM) which is a novel hybrid combination of the dynamic growth of a greedy selection algorithm, in a similar spirit to GEIM [MM13, MMT14, MMPY15], with the thinning reduction of recombination that underpins the successful convex kernel quadrature approach of [HLO21]. GRIM dynamically grows a collection of linear functionals $L\subset\Sigma$ . After each extension of $L$ , we apply recombination to find an approximation of $\varphi$ that coincides with $\varphi$ throughout $L$ (cf. the recombination thinning Lemma 3.1). Subsequently, for a chosen integer $m\in{\mathbb{Z}}_{\geq 1}$ , we extend $L$ by adding the $m$ linear functionals from $\Sigma$ achieving the $m$ largest absolute values when applied to the difference between $\varphi$ and the current approximation of $\varphi$ (cf. Section 4). We inductively repeat these steps a set number of times. Evidently GRIM is a hybrid of growth-based and thinning-based techniques in a similar spirit to the CoSaMP algorithm [NT09].

Whilst the dynamic growth of GRIM is in a similar spirit to that of GEIM, there are several important distinctions between GRIM and GEIM. The growth in GRIM is data-driven rather than feature-driven. The extension of the data to be interpolated with respect to in GRIM does not involve making any choices of features from ${\cal F}$ . The new information to be matched is determined by examining where in $\Sigma$ the current approximation is furthest from the target $\varphi$ (cf. Section 4).

Expanding on this point, we only dynamically grow a subset $L\subset\Sigma$ of the data and do not grow a subset $F\subset{\cal F}$ of the features. In particular, we do not pre-determine the features that an approximation will be a linear combination of. Instead, the features to be used are determined by recombination (cf. the recombination thinning Lemma 3.1). Indeed, besides an upper bound on the number of features being used to construct an approximation (cf. Section 4), we have no control over the features used. Allowing recombination and the data $\Sigma$ to determine which features are used means there is no requirement to use the features from one step at subsequent steps. There is no requirement that any of the features used at one specific step must be used in any of the subsequent steps.

GRIM is not limited to extending the subset $L\subset\Sigma$ by a single linear functional at each step. GRIM is capable of extending $L$ by $m$ linear functionals for any given integer $m\in{\mathbb{Z}}_{\geq 1}$ (modulo restrictions to avoid adding more linear functionals than there are in the original subset $\Sigma\subset X^{\ast}$ itself). The number of new linear functionals to be added at a step is a hyperparameter that can be optimised during numerical implementations.

Unlike [HLO21], our use of recombination is not restricted to the setting of reducing the support of a discrete measure. After each extension of $L\subset\Sigma$ , we use recombination [LL12, Tch15, ACO20] to find an element $u\in\operatorname{Span}({\cal F})$ satisfying that $u\equiv\varphi$ throughout $L$ . Recombination is applied to the linear system determined by the combination of the set $\{\sigma(\varphi):\sigma\in L\}$ , for a given $\sigma\in L$ we get the equation $\sum_{i=1}^{{\cal N}}a_{i}\sigma(f_{i})=\sigma(\varphi)$ , and the sum of the coefficients $a_{1},\ldots,a_{{\cal N}}$ (i.e. the trivial equation $a_{1}+\ldots+a_{{\cal N}}=\sum_{i=1}^{{\cal N}}a_{i}$ ) (cf. the recombination thinning Lemma 3.1).

Our use of recombination means, in particular, that GRIM can be used for cubature and kernel quadrature (cf. Sections 2 and 4). Since recombination preserves convexity [LL12], the benefits of convex weights enjoyed by the convex kernel quadrature approach in [HLO21] are inherited by GRIM (cf. Section 2).

Moreover, at each step we optimise our use of recombination over multiple permutations of the orderings of the equations determining the linear system to which recombination is applied (cf. the Banach Recombination Step in Section 4). The number of permutations to be considered at each step gives a parameter that may be optimised during applications of GRIM.

Whilst we analyse the complexity cost of the Banach GRIM algorithm (cf. Section 5), computational efficiency is not our top priorities. GRIM is designed to be a one-time tool; it is applied a single time to try and find a sparse approximation of the target system $\varphi\in X$ . The cost of its implementation is then recouped through the repeated use of the resulting approximation for inference via new inputs. Thus GRIM is ideally suited for use in cases where the model will be repeatedly computed on new inputs for the purpose of inference/prediction. Examples of such models include, in particular, those trained for character recognition, medical diagnosis, and action recognition.

With this in mind, the primary aim of our complexity cost considerations is to verify that implementing GRIM is feasible. We verify this by proving that, at worst, the complexity cost of GRIM is ${\cal O}(\Lambda^{2}{\cal N}+s\Lambda^{4}\log({\cal N}))$ where ${\cal N}$ is the number of features in ${\cal F}$ , $\Lambda$ is the number of linear functionals forming the data $\Sigma$ , and $s$ is the maximum number of shuffles considered during each application of recombination (cf. Lemma 5.3).

The remainder of the paper is structured as follows. In Section 2 we fix the mathematical framework that will be used throughout the article and motivate its consideration. In particular, we illustrate some specific examples covered by our framework. Additionally, we summarise the GEIM approach of Maday et al. [MM13, MMT14, MMPY15] and highlight the fundamental differences in the philosophies underlying GEIM and GRIM.

An explanation of the recombination algorithm and its use within our setting is provided in subsection 3. In particular, we prove the Recombination Thinning Lemma 3.1 detailing our use of recombination to find $u\in\operatorname{Span}({\cal F})$ coinciding with $\varphi$ on a given finite subset $L\subset X^{\ast}$ .

The Banach GRIM algorithm is both presented and discussed in Section 4. We formulate the Banach Extension Step governing how we extend an existing collection of linear functionals $L\subset\Sigma$ , the Banach Recombination Step specifying how we use the recombination thinning Lemma 3.1 in the Banach GRIM algorithm, and provide an upper bounds for the number of features used to construct the approximation at each step.

The complexity cost of the Banach GRIM algorithm is considered in Section 5. We prove Lemma 5.3 establishing the complexity cost of any implementation of the Banach GRIM algorithm, and subsequently establish an upper bound complexity cost for the most expensive implementation.

The theoretical performance of the Banach GRIM algorithm is considered in Section 6. Theoretical guarantees in terms of a specific geometric property of $\Sigma$ are established for the Banach GRIM algorithm in which a single new linear functional is chosen at each step (cf. the Banach GRIM Convergence Theorem 6.2). The specific geometric property is related to a particular packing number of $\Sigma$ in $X^{\ast}$ (cf. Subsection 6.2). The packing number of a subset of a Banach space is closely related to the covering number of the subset. Covering and packing numbers, first studied by Kolmogorov [Kol56], arise in a variety of contexts including eigenvalue estimation [Car81, CS90, ET96], Gaussian Processes [LL99, LP04], and machine learning [EPP00, SSW01, Zho02, Ste03, SS07, Kuh11, MRT12, FS21]

In Section 7 we compare the performance of GRIM against other reduction techniques on three tasks. The first is an $L^{2}$ -approximation task motivated by an example in [MMPY15]. The second is a kernel quadrature task using machine learning datasets as considered in [HLO21]. In particular, we illustrate that GRIM matches the performance of the tailor-made convex kernel quadrature technique developed in [HLO21]. The third task is approximating the action recognition model from [JLNSY17] for the purpose of inference acceleration. In particular, this task involves approximating a function in a pointwise sense that is outside the Hilbert space framework of the proceeding two examples. Acknowledgements: This work was supported by the DataSıg Program under the EPSRC grant ES/S026347/1, the Alan Turing Institute under the EPSRC grant EP/N510129/1, the Data Centric Engineering Programme (under Lloyd’s Register Foundation grant G0095), the Defence and Security Programme (funded by the UK Government) and the Hong Kong Innovation and Technology Commission (InnoHK Project CIMDA). This work was funded by the Defence and Security Programme (funded by the UK Government).

2 Mathematical Framework & Motivation

In this section we rigorously formulate the sparse approximation problem that GRIM will be designed to tackle. Further, we briefly summarise the Generalised Empirical Interpolation Method (GEIM) of Maday et al. [MM13, MMT14, MMPY15] to both highlight some of the ideas we utilise in GRIM, and to additionally highlight the key novel properties not satisfied by GEIM that will be satisfied by GRIM. We first fix the mathematical framework in which we will work for the remainder of this paper.

Let $X$ be a Banach space and ${\cal N}\in{\mathbb{Z}}_{>0}$ be a (large) positive integer. Assume that ${\cal F}=\{f_{1},\ldots,f_{{\cal N}}\}\subset X$ is a collection of non-zero elements, and that $a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}\setminus\{0\}$ . Consider the element $\varphi\in X$ defined by

\varphi=\sum_{i=1}^{{\cal N}}a_{i}f_{i}.

(2.1)

Let $X^{\ast}$ denote the dual of $X$ and suppose that $\Sigma\subset X^{\ast}$ is a finite subset of cardinality $\Lambda\in{\mathbb{Z}}_{\geq 1}$ . We use the terminology that the set ${\cal F}$ consists of features whilst the set $\Sigma$ consists of data. We consider the following sparse approximation problem. Given $\varepsilon>0$ , find an element $u=\sum_{i=1}^{{\cal N}}b_{i}f_{i}\in\operatorname{Span}({\cal F})$ such that the cardinality of the set $\left\{i\in\{1,\ldots,{\cal N}\}~{}:~{}b_{i}\neq 0\right\}$ is less than ${\cal N}$ and that $u$ is close to $\varphi$ throughout $\Sigma$ in the sense that, for every $\sigma\in\Sigma$ , we have $|\sigma(\varphi-u)|\leq\varepsilon$ .

The task of finding a sparse approximation of a sum of continuous functions, defined on a finite set of Euclidean space, is within this framework. More precisely, let $N,d,E\in{\mathbb{Z}}_{\geq 1}$ , $a_{1},\ldots,a_{N}\in{\mathbb{R}}$ , $\Omega\subset{\mathbb{R}}^{d}$ be a finite subset, and, for each $i\in\{1,\ldots,N\}$ , $f_{i}\in C^{0}\left(\Omega;{\mathbb{R}}^{E}\right)$ be a continuous function $\Omega\to{\mathbb{R}}^{E}$ . Then finding a sparse approximation of the continuous function $F:=\sum_{i=1}^{N}a_{i}f_{i}$ is within this framework. To see this, first let $e_{1},\ldots,e_{E}\in{\mathbb{R}}^{E}$ be the standard basis of ${\mathbb{R}}^{E}$ that is orthonormal with respect to the Euclidean dot product $\left<\cdot,\cdot\right>_{{\mathbb{R}}^{E}}$ on ${\mathbb{R}}^{E}$ . Then note that for each point $p\in\Omega$ and every $j\in\{1,\ldots,E\}$ the mapping $f\mapsto\left<f(p),e_{j}\right>_{{\mathbb{R}}^{E}}$ determines a linear functional $C^{0}(\Omega;{\mathbb{R}}^{E})\to{\mathbb{R}}$ that is in the dual space $C^{0}(\Omega;{\mathbb{R}}^{E})^{\ast}$ . Denote this linear functional by $\delta_{p,j}:C^{0}(\Omega;{\mathbb{R}}^{E})\to{\mathbb{R}}$ . Therefore by choosing $X:=C^{0}(\Omega;{\mathbb{R}}^{E})$ , ${\cal N}:=N$ , and $\Sigma:=\left\{\delta_{p,j}:p\in\Omega~{}\text{and}~{}j\in\{1,\ldots,E\}\right\}\subset X^{\ast}$ , we see that this problem is within our framework. Here we are also using the observation that if $f,h\in C^{0}(\Omega;{\mathbb{R}}^{E})$ satisfy, for every $p\in\Omega$ and every $j\in\{1,\ldots,E\}$ , that $|\delta_{p,j}[f-h]|\leq\varepsilon$ , then we have $||f-h||_{C^{0}(\Omega;{\mathbb{R}}^{E})}\leq C\varepsilon$ for a constant $C\geq 1$ depending on the particular norm chosen on ${\mathbb{R}}^{E}$ . Thus finding an approximation $u$ of $F$ that satisfies, for every $\sigma\in\Sigma$ , that $|\sigma(F-u)|\leq\varepsilon/C$ allows us to conclude that $||F-u||_{C^{0}(\Omega;{\mathbb{R}}^{E})}\leq\varepsilon$ .

The cubature problem [Str71] for empirical measures, which may be combined with sampling to offer an approach to the cubature problem for general signed measures, is within this framework. More precisely, let $N\in{\mathbb{Z}}_{\geq 1}$ , $a_{1},\ldots,a_{N}>0$ , $\Omega=\{x_{1},\ldots,x_{N}\}\subset{\mathbb{R}}^{d}$ , and ${\cal M}[\Omega]$ denote the collection of finite signed measures on $\Omega$ . Recall that ${\cal M}[\Omega]$ can be viewed as a subset of $C^{0}(\Omega)^{\ast}$ by defining, for $\nu\in{\cal M}[\Omega]$ and $\psi\in C^{0}(\Omega)$ , $\nu[\psi]:=\int_{\Omega}\psi(x)d\nu(x)$ . Consider the empirical measure $\mu:=\sum_{i=1}^{N}a_{i}\delta_{x_{i}}$ and, for $e=(e_{1},\ldots,e_{d})\in{\mathbb{Z}}_{\geq 0}^{d}$ , define $p_{e}:{\mathbb{R}}^{d}\to{\mathbb{R}}$ by $p_{e}(x_{1},\ldots,x_{d}):=x_{1}^{e_{1}}\ldots x_{d}^{e_{d}}$ . Then the choices that $X:={\cal M}[\Omega]$ , ${\cal N}:=N$ , and, for a given $K\in{\mathbb{Z}}_{\geq 0}$ , $\Sigma:=\left\{p_{e}:e=(e_{1},\ldots,e_{d})\text{ with }e_{1}+\ldots+e_{d}\leq K\right\}\subset C^{0}(\Omega)\subset{\cal M}[\Omega]^{\ast}$ illustrate that the cubature problem of reducing the support of $\mu$ whilst preserving its moments of order no greater than $K$ is within our framework.

Moreover the Kernel Quadrature problem for empirical probability measures is within our framework. To be more precise, let ${\cal X}$ be a finite set and ${\cal H}_{k}$ is a Reproducing Kernel Hilbert Space (RKHS) associated to a positive semi-definite symmetric kernel function $k:{\cal X}\times{\cal X}\to{\mathbb{R}}$ (appropriate definitions can be found, for example, in [BT11]). In this case ${\cal H}_{k}=\operatorname{Span}\left(\left\{k_{x}:x\in{\cal X}\right\}\right)$ where, for $x\in{\cal X}$ , the function $k_{x}:{\cal X}\to{\mathbb{R}}$ is defined by $k_{x}(z):=k(x,z)$ (see, for example, [BT11]).

In this context one can consider the Kernel Quadrature problem, for which the Kernel Herding [Wel09a, Wel09b, CSW10, BLL15, BCGMO18, TT21, PTT22], Convex Kernel Quadrature [HLO21] and Kernel Thinning [DM21a, DM21b, DMS21] methods have been developed. Given a probability measure $\mu\in{\mathbb{P}}[{\cal X}]$ the Kernel Quadrature problem involves finding, for some $n\in{\mathbb{Z}}_{\geq 1}$ , points $z_{1},\ldots,z_{n}\in{\cal X}$ and weights $w_{1},\ldots,w_{n}\in{\mathbb{R}}$ and such that the measure $\mu_{n}:=\sum_{j=1}^{n}w_{j}\delta_{z_{j}}$ approximates $\mu$ in the sense that, for every $f\in{\cal H}_{k}$ , we have $\mu_{n}(f)\approx\mu(f)$ . Additionally requiring the weights $w_{1},\ldots,w_{n}$ to be convex in the sense that they are all positive and sum to one (i.e. $w_{1},\ldots,w_{n}>0$ and $w_{1}+\ldots+w_{n}=1$ ) ensures both better robustness properties for the approximation $\mu_{n}\approx\mu$ and better estimates when the $m$ -fold product of quadrature formulas is used to approximate $\mu^{\otimes m}$ on ${\cal X}^{\otimes m}$ ; see [HLO21].

A consequence of ${\cal H}_{k}=\operatorname{Span}\left(\left\{k_{x}:x\in{\cal X}\right\}\right)$ is that linearity ensures that this approximate equality will be valid for all $f\in{\cal H}_{k}$ provided it is true for every $f\in\{k_{x}:x\in{\cal X}\}$ . The inclusion ${\cal H}_{k}\subset C^{0}({\cal X})$ means that the subset $\{k_{x}:x\in{\cal X}\}\subset{\cal H}_{k}$ can be viewed as a finite subset of the dual space ${\cal M}[{\cal X}]^{\ast}$ . The choice of $X:={\cal M}[{\cal X}]$ and $\Sigma:=\{k_{x}:x\in{\cal X}\}$ illustrates that the Kernel Quadrature problem for empirical probability distributions $\mu\in{\mathbb{P}}[{\cal X}]$ is within the framework we consider.

The framework we consider is the same as the setup for which GEIM was developed by Maday et al. [MM13, MMT14, MMPY15]. It is useful to briefly recall this method. GEIM

(A)
- •
  
  Find $h_{1}:=\operatorname{argmax}\left\{||f||_{X}:f\in{\cal F}\right\}$ and $\sigma_{1}:=\operatorname{argmax}\left\{|\sigma(h_{1})|:\sigma\in\Sigma\right\}$ .
- •
  
  Define $q_{1}:=h_{1}/\sigma_{1}(h_{1})$ , $S_{1}:=\{q_{1}\}$ , $L_{1}:=\{\sigma_{1}\}$ , and an operator ${\cal J}_{1}:\operatorname{Span}({\cal F})\to\operatorname{Span}(S_{1})$ by setting ${\cal J}_{1}[w]:=\sigma_{1}(w)q_{1}$ for every $w\in\operatorname{Span}({\cal F})$ .
- •
  
  Observe that, given any $w\in\operatorname{Span}({\cal F})$ , we have $\sigma_{1}(w-{\cal J}_{1}[w])=0$ .
(B)
Proceed recursively via the follwing inductive step for $n\geq 2$ .
- •
  
  Suppose we have found subsets $S_{n-1}=\{q_{1},\ldots,q_{n-1}\}\subset\operatorname{Span}({\cal F})$ and $L_{n-1}=\{\sigma_{1},\ldots,\sigma_{n-1}\}\subset\Sigma$ , and an operator ${\cal J}_{n-1}:\operatorname{Span}({\cal F})\to\operatorname{Span}(S_{n-1})$ satisfying, for every $w\in\operatorname{Span}({\cal F})$ , that $\sigma(w-{\cal J}_{n-1}[w])=0$ for every $\sigma\in L_{n-1}$ .
- •
  
  Find $h_{n}:=\operatorname{argmax}\left\{||f-{\cal J}_{n-1}[f]||_{X}:f\in{\cal F}\right\}$ and $\sigma_{n}:=\operatorname{argmax}\left\{|\sigma(h_{n}-{\cal J}_{n-1}[h_{n}])|:\sigma\in\Sigma\right\}$ .
- •
  
  Define
  
  $q_{n}:=\frac{h_{n}-{\cal J}_{n-1}[h_{n}]}{\sigma_{n}(h_{n}-{\cal J}_{n-1}[h_{n}])},$
  
  $S_{n}:=S_{n-1}\cup\{q_{n}\}\subset\operatorname{Span}({\cal F})$ and $L_{n}:=L_{n-1}\cup\{\sigma_{n}\}\subset\Sigma$ .
- •
  
  Construct operator ${\cal J}_{n}:\operatorname{Span}({\cal F})\to\operatorname{Span}(S_{n})$ by defining
  
  ${\cal J}_{n}[w]:={\cal J}_{n-1}[w]+\sigma_{n}(w-{\cal J}_{n-1}[w])q_{n}$
  
  for $w\in\operatorname{Span}({\cal F})$ . Direct computation verifies that, whenever $w\in\operatorname{Span}({\cal F})$ and $\sigma\in L_{n}$ , we have $\sigma(w-{\cal J}_{n}[w])=0$ .

The algorithm provides a sequence ${\cal J}_{1}[\varphi]$ , ${\cal J}_{2}[\varphi],\ldots$ of approximations of $\varphi$ . However, GEIM is intended to control the $X$ -norm of the difference between $\varphi$ and its approximation, i.e. to have $||\varphi-{\cal J}_{n}[\varphi]||_{X}$ be small for a suitable integer $n$ . This aim requires additional assumptions to be made regarding the data $\Sigma\subset X^{\ast}$ which we do not impose (see [MMT14] for details). Recall that we aim only to approximate $\varphi$ over the data $\Sigma$ , i.e. we seek an approximation $u$ such that $|\sigma(\varphi-u)|$ is small for every $\sigma\in\Sigma$ . A consequence of this difference is that certain aspects of GEIM are not necessarily ideal for our task.

Firstly, the growth of the subset $L_{n-1}$ to $L_{n}$ is feature-driven. That is, a new feature from ${\cal F}$ to be used by the next approximation is selected, and then this new feature is used to determine the new functional from $\Sigma$ to be added to the collection on which we require the approximation to coincide with the target $\varphi$ (cf. GEIM (B)). Since we only seek to approximate $\varphi$ over the data $\Sigma$ , data-driven growth would be preferable. That is, we would rather select the new information from $\Sigma$ to be matched by the next approximation before any consideration is given to determining which features from ${\cal F}$ will be used to construct the new approximation.

Secondly, related to the first aspect, the features from ${\cal F}$ to be used to construct ${\cal J}_{n}[\varphi]$ are predetermined. Further, the features used to construct ${\cal J}_{n}[\varphi]$ are forced to be included in the features used to construct ${\cal J}_{m}[\varphi]$ for any $m>n$ . This restriction is not guaranteed to be sensible; it is conceivable that the features that work well at one stage are disjoint from the features that work well at another. We would prefer that the features used to construct an approximation be determined by the target $\varphi$ and the data selected from $\Sigma$ on which we require the approximation to match $\varphi$ . This would avoid retaining features used at one step that become less effective at later steps, and could offer insight regarding which of the features in ${\cal F}$ are sufficient to capture the behaviour of $\varphi$ on $\Sigma$ .

Thirdly, at each step GEIM can provide an approximation for any $w\in\operatorname{Span}({\cal F})$ . That is, for $n\in{\mathbb{Z}}_{\geq 1}$ , the element ${\cal J}_{n}[w]$ is a linear combination of the elements in $S_{n}$ that coincides with $w$ on $L_{n}$ . Requiring the operator ${\cal J}_{n}:\operatorname{Span}({\cal F})\to\operatorname{Span}(S_{n})$ to provide such an approximation for every $w\in\operatorname{Span}({\cal F})$ stops the method from exploiting any advantages available for the particular choice of $\varphi\in\operatorname{Span}({\cal F})$ . As we only aim to approximate $\varphi$ itself, we would prefer to remove this requirement and allow the method the opportunity to exploit advantages resulting from the specific choice of $\varphi\in\operatorname{Span}({\cal F})$ .

All three aspects are addressed in GRIM. The greedy growth of a subset $L\subset\Sigma$ , determining the functionals in $\Sigma$ at which an approximation is required to agree with $\varphi$ is data-driven. At each step the desired approximation is found using recombination [LL12, Tch15] so that the features used to construct the approximation are determined by $\varphi$ and the subset $L$ and, in particular, are not predetermined. This use of recombination ensures both that GRIM only produces approximations of $\varphi$ , and that GRIM can exploit advantages resulting from the specific choice of $\varphi\in\operatorname{Span}({\cal F})$ .

3 Recombination Thinning

In this section we illustrate how, given a finite collection $L\subset X^{\ast}$ of linear functionals, recombination [LL12, Tch15] can be used to find an element $u\in\operatorname{Span}({\cal F})\subset X$ that coincides with $\varphi$ throughout $L$ , provided we can compute the values $\sigma(f_{i})$ for every $i\in\{1,\ldots,{\cal N}\}$ and every linear functional $\sigma\in L$ . The recombination algorithm was initially introduced by Christian Litterer and the first author in [LL12]; a substantially improved implementation was provided in the PhD thesis of Maria Tchernychova [Tch15]. A novel implementation of the method from [Tch15] was provided in [ACO20]. A method with the same complexity as the original implementation in [LL12] was introduced in the works [FJM19, FJM22]. Recombination has been applied in a number of contexts including particle filtering [LL16], kernel quadrature [HLO21], and mathematical finance [NS21].

For the readers convenience we briefly overview the ideas involved in the recombination algorithm. For this illustrative purpose consider a linear system of equations ${\bf A}{\bf x}={\bf y}$ where ${\bf A}\in{\mathbb{R}}^{m\times k}$ , ${\bf x}\in{\mathbb{R}}^{m}$ , ${\bf y}\in{\mathbb{R}}^{k}$ , and $k,m\in{\mathbb{Z}}_{\geq 1}$ with $m\geq k$ . We assume, for every $i\in\{1,\ldots,k\}$ , that ${\bf x}_{i}>0$ . Recombination relies on the simple observation that this linear system is preserved under translations of ${\bf x}$ by elements in the kernel of the matrix ${\bf A}$ . To be more precise, if ${\bf x}$ satisfies that ${\bf A}{\bf x}={\bf y}$ , and if ${\bf e}\in\ker({\bf A})$ and $\theta\in{\mathbb{R}}$ , then ${\bf x}+\theta{\bf e}$ also satisfies that ${\bf A}\left({\bf x}+\theta{\bf e}\right)={\bf y}$ . The recombination algorithm makes use of linearly independent elements in the $\ker({\bf A})$ to reduce the number of non-zero entries in the solution vector ${\bf x}$ .

As outlined in [LL12], this could in principle be done as follows. First, we choose a basis for the kernel of ${\bf A}$ . Computing the Singular Value Decomposition (SVD) of ${\bf A}$ gives a method of finding such a basis that is well-suited to dealing with numerical instabilities. Supposing that $\ker({\bf A})\neq\{0\}$ , let ${\bf e}_{1},\ldots,{\bf e}_{l}$ be a basis of $\ker({\bf A})$ found via SVD. For each $j\in\{1,\ldots,l\}$ denote the coefficients of ${\bf e}_{j}$ by ${\bf e}_{j,1},\ldots,{\bf e}_{j,m}\in{\mathbb{R}}$ .

Consider the element ${\bf e}_{1}\in{\mathbb{R}}^{m}$ . Choose $i\in\{1,\ldots,m\}$ such that

\frac{{\bf x}_{i}}{{\bf e}_{1,i}}=\min\left\{\frac{{\bf x}_{j}}{{\bf e}_{1,j}}:{\bf e}_{1,j}>0\right\}.

(3.1)

Replace ${\bf x}$ by the vector ${\bf x}-\left({\bf x}_{i}/{\bf e}_{1,i}\right){\bf e}_{1}$ . The new ${\bf x}$ remains a solution to ${\bf A}{\bf x}={\bf y}$ , and now additionally satisfies that ${\bf x}_{i}=0$ . Moreover, for every $j\in\{1,\ldots,m\}$ such that $j\neq i$ , our choice of $i$ in (3.1) ensures that ${\bf x}_{j}\geq 0$ . Finally, for $j\in\{2,\ldots,l\}$ , replace ${\bf e}_{j}$ by ${\bf e}_{j}-({\bf e}_{j,i}/{\bf e}_{1,i}){\bf e}_{1}$ to ensure that ${\bf e}_{j,i}=0$ . This final alteration allows us to repeat the process for the new ${\bf x}$ using the new ${\bf e}_{2}$ in place of ${\bf e}_{1}$ since the addition of any scalar multiple of ${\bf e}_{2}$ to ${\bf x}$ will not change the fact that ${\bf x}_{i}=0$ .

After iteratively repeating this procedure for $j=2,\ldots,l$ , we obtain a vector ${\bf x}^{\prime}\in{\mathbb{R}}^{m}$ whose coefficients are all non-negative, with at most $m-l$ of the coefficients being non-zero, and still satisfying that ${\bf A}{\bf x}^{\prime}={\bf y}$ . That is, the original solution ${\bf x}$ has been reduced to a new solution ${\bf x}^{\prime}$ with at least $l$ fewer non-zero coefficients than the original solution ${\bf x}$ .

Observe that the positivity of the original coefficients is weakly preserved. That is, if we let $x^{\prime}_{1},\ldots,x^{\prime}_{m}\in{\mathbb{R}}$ denote the coefficients of the vector ${\bf x}^{\prime}\in{\mathbb{R}}^{m}$ , then for each $i\in\{1,\ldots,m\}$ we have that $x^{\prime}_{i}\geq 0$ . One consequence of this preservation is that recombination can be used to reduce the support of an empirical measure whilst preserving a given finite collection of moments [LL12]. Moreover, this property is essential to the ground-breaking state-of-the-art convex kernel quadrature method developed by Satoshi Hayakawa, the first author, and Harald Oberhauser in [HLO21].

The implementation of recombination proposed in [LL12] iteratively makes use of the method outlined above, for $m=2k$ , applied to linear systems arising via sub-dividing the original system. The improved tree-based method developed by Maria Tchernychova in [Tch15] provides a significantly more efficient implementation. A geometrically greedy algorithm is proposed in [ACO20] to implement the algorithm of [Tch15]. In the setting of our example above, it follows from [Tch15] that the complexity of the recombination algorithm is $O(mk+k^{3}\log(m/k))$ .

Having outlined the recombination method, we turn our attention to establishing the following result regarding the use of recombination in our setting.

Lemma 3.1 (Recombination Thinning).

Assume $X$ is a Banach space with dual space $X^{\ast}$ , that ${\cal N}\in{\mathbb{Z}}_{\geq 1}$ , and that $m\in{\mathbb{Z}}_{\geq 0}$ . Define ${\cal M}:=\min\{{\cal N},m+1\}$ . Let ${\cal F}:=\{f_{1},\ldots,f_{{\cal N}}\}\subset X$ be a collection of non-zero elements and $L=\{\sigma_{1},\ldots,\sigma_{m}\}\subset X^{\ast}$ . Suppose $a_{1},\ldots,a_{{\cal N}}>0$ and consider the element $\varphi\in\operatorname{Span}({\cal F})\subset X$ defined by $\varphi:=\sum_{i=1}^{{\cal N}}a_{i}f_{i}$ . Then recombination can be applied to find non-negative coefficients $b_{1},\ldots,b_{{\cal M}}\geq 0$ and indices $e(1),\ldots,e({\cal M})\in\{1,\ldots,{\cal N}\}$ satisfying that

\sum_{j=1}^{{\cal M}}b_{j}=\sum_{i=1}^{{\cal N}}a_{i},

(3.2)

and such that the element $u\in\operatorname{Span}({\cal F})\subset X$ defined by

u:=\sum_{j=1}^{{\cal M}}b_{j}f_{e(j)}\quad\text{satisfies, for every }\sigma\in L,\text{ that}\quad\sigma(\varphi-u)=0.

(3.3)

Finally, the complexity of this use of recombination is $O({\cal N}m+m^{3}\log({\cal N}/m))$ .

Proof of Lemma 3.1.

Assume $X$ is a Banach space with dual space $X^{\ast}$ , that ${\cal N}\in{\mathbb{Z}}_{\geq 1}$ , and that $m\in{\mathbb{Z}}_{\geq 0}$ . Define ${\cal M}:=\min\{{\cal N},m+1\}$ . Let ${\cal F}:=\{f_{1},\ldots,f_{{\cal N}}\}\subset X$ be a collection of non-zero elements and $L=\{\sigma_{1},\ldots,\sigma_{m}\}\subset X^{\ast}$ . Suppose $a_{1},\ldots,a_{{\cal N}}>0$ and Consider the element $\varphi\in\operatorname{Span}({\cal F})\subset X$ defined by $\varphi:=\sum_{i=1}^{{\cal N}}a_{i}f_{i}$ .

The values $\sigma_{1}(\varphi),\ldots,\sigma_{m}(\varphi)$ and the sum of the coefficients $\sum_{i=1}^{{\cal N}}a_{i}$ give rise to the linear system of equations

\begin{pmatrix}1&1&\dots&1\\ \sigma_{1}(f_{1})&\sigma_{1}(f_{2})&\dots&\sigma_{1}(f_{{\cal N}})\\ \vdots&\vdots&\ddots&\vdots\\ \sigma_{m}(f_{1})&\sigma_{m}(f_{2})&\dots&\sigma_{m}(f_{{\cal N}})\end{pmatrix}\begin{pmatrix}a_{1}\\ a_{2}\\ \vdots\\ \vdots\\ a_{{\cal N}}\end{pmatrix}=\begin{pmatrix}\sum_{i=1}^{{\cal N}}a_{i}\\ \sigma_{1}(\varphi)\\ \vdots\\ \sigma_{m}(\varphi)\end{pmatrix}

(3.4)

which we denote more succinctly by ${\bf A}{\bf x}={\bf y}$ .

Since the coefficients $a_{1},\ldots,a_{{\cal N}}>0$ are positive, we are able to apply recombination [LL12, Tch15] to this linear system. Combined with the observation that $\dim\left(\ker({\bf A})\right)\geq{\cal N}-{\cal M}$ , an application of recombination returns an element ${\bf x}^{\prime}=\left({\bf x}^{\prime}_{1},\ldots,{\bf x}^{\prime}_{{\cal N}}\right)\in{\mathbb{R}}^{{\cal N}}$ satisfying the following properties. Firstly, for each $j\in\{1,\ldots,{\cal N}\}$ the coefficient ${\bf x}^{\prime}_{j}\geq 0$ is non-negative. Secondly, there are at most ${\cal M}$ indices $i\in\{1,\ldots,{\cal N}\}$ for which ${\bf x}^{\prime}_{i}>0$ .

Let $D:=\dim\left(\ker({\bf A})\right)\geq{\cal N}-{\cal M}$ . Take $e(1),\ldots,e({\cal N}-D)\in\{1,\ldots,{\cal N}\}$ to be the indices $i\in\{1,\ldots,{\cal N}\}$ for which ${\bf x}^{\prime}_{i}>0$ . Then, for each $j\in\{1,\ldots,{\cal N}-D\}$ , we set $b_{j}:={\bf x}^{\prime}_{e(j)}>0$ . Define an element $u\in\operatorname{Span}({\cal F})$ by $u:=\sum_{j=1}^{{\cal N}-D}b_{j}f_{e(j)}$ (cf. (3.3)). Recall that recombination ensures that ${\bf x}^{\prime}$ is a solution to the linear system (3.4). Hence the equation corresponding to the top row of the matrix ${\bf A}$ tells us that

\sum_{j=1}^{{\cal N}-D}b_{j}=\sum_{j=1}^{{\cal N}}{\bf x}^{\prime}_{j}=\sum_{j=1}^{{\cal N}}a_{j}.

(3.5)

Moreover, given any $l\in\{1,\ldots,m\}$ , the equation corresponding to row $l+1$ of the matrix ${\bf A}$ tells us that

\sigma(u)=\sum_{j=1}^{{\cal N}-D}b_{j}\sigma_{l}\left(f_{e(j)}\right)=\sum_{j=1}^{{\cal N}}{\bf x}^{\prime}_{j}\sigma_{l}\left(f_{e(j)}\right)=\sigma_{l}(\varphi)

(3.6)

If $D={\cal N}-{\cal M}$ then (3.5) and (3.6) are precisely the equalities claimed in (3.2) and (3.3). If $D>{\cal N}-{\cal M}$ then ${\cal N}-D<{\cal M}$ . Set $b_{{\cal N}-D+1}=\ldots=b_{{\cal M}}=0$ and choose any indices $e({\cal N}-D+1),\ldots,e({\cal M})\in\{1,\ldots,{\cal N}\}$ . Evidently we have that $\sum_{j=1}^{{\cal M}}b_{j}=\sum_{j=1}^{{\cal N}-D}b_{j}$ and, for each $l\in\{1,\ldots,m\}$ , that $\sum_{j=1}^{{\cal M}}b_{j}\sigma_{l}\left(f_{e(j)}\right)=\sum_{j=1}^{{\cal N}-D}b_{j}\sigma_{l}\left(f_{e(j)}\right)$ . Consequently, (3.5) and (3.6) are once again precisely the equalities claimed in (3.2) and (3.3).

It remains only to verify the claimed complexity of this application of recombination. For this purpose, we observe that recombination is applied to a linear system of $m+1$ equations in ${\cal N}$ variables. Hence from [Tch15] we have that the complexity is $O\left({\cal N}(m+1)+(m+1)^{3}\log({\cal N}/m+1)\right)=O\left({\cal N}m+m^{3}\log({\cal N}/m)\right)$ as claimed. This completes the proof of Lemma 3.1. ∎

4 The Banach GRIM Algorithm

In this section we detail the Banach GRIM algorithm. The following Banach Extension Step is used to grow a collection of linear functionals from $\Sigma$ at which we require our next approximation of $\varphi$ to agree with $\varphi$ . Banach Extension Step

Assume $L^{\prime}\subset\Sigma$ . Let $u\in\operatorname{Span}({\cal F})$ . Let $m\in{\mathbb{Z}}_{\geq 1}$ such that $\#\left(L^{\prime}\right)+m\leq\Lambda:=\#\left(\Sigma\right)$ . Take

\sigma_{1}:=\operatorname{argmax}\left\{|\sigma(\varphi-u)|~{}:~{}\sigma\in\Sigma\right\}.

(4.1)

Inductively for $j=2,3,\ldots,m$ take

\sigma_{j}:=\operatorname{argmax}\left\{|\sigma(\varphi-u)|~{}:~{}\sigma\in\Sigma\setminus\{\sigma_{1},\ldots,\sigma_{j-1}\}\right\}.

(4.2)

Once $\sigma_{1},\ldots,\sigma_{m}\in\Sigma$ have been defined, we extend $L^{\prime}$ to $L:=L^{\prime}\cup\{\sigma_{1},\ldots,\sigma_{m}\}$ . For each choice of subset $L\subset\Sigma$ , we use recombination (cf. Lemma 3.1) to find $u\in\operatorname{Span}({\cal F})$ agreeing with $\varphi$ throughout $L$ . Theoretically, Lemma 3.1 verifies that recombination can be used to find an approximation $u$ of $\varphi$ for which $\sigma(\varphi-u)=0$ for all linear functionals $\sigma\in L$ for a given subset $L\subset\Sigma\subset X^{\ast}$ . However, implementations of recombination inevitably result in numerical errors. That is, the returned coefficients will only solve the equations modulo some (ideally) small error term. To account for this in our analysis, whenever we apply Lemma 3.1 we will only assume that the resulting approximation $u\in\operatorname{Span}({\cal F})$ is close to $\varphi$ at each functional $\sigma\in L$ . That is, for each $\sigma\in L$ , we have that $|\sigma(\varphi-u)|\leq\varepsilon_{0}$ for some (small) constant $\varepsilon_{0}\geq 0$ .

Recall (cf. Section 3) that if recombination is applied to a linear system corresponding to a matrix $A$ , then a Singular Value Decomposition SVD of the matrix $A$ is used to find a basis for $\ker(A)$ . Re-ordering the rows of the matrix $A$ (i.e. changing the order in which the equations are considered) can potentially result in a different basis for $\ker(A)$ being selected. Thus shuffling the order of the equations can affect the approximation returned by recombination via the recombination thinning Lemma 3.1. We exploit this by optimising the approximation returned by recombination over a chosen number of shuffles of the equations forming the linear system. This is made precise in the following Banach Recombination Step detailing how, for a given subset $L\subset\Sigma$ , we use recombination to find an element $u\in\operatorname{Span}({\cal F})$ that is within $\varepsilon_{0}$ of $\varphi$ throughout $\Sigma$ . Banach Recombination Step

Assume $L\subset\Sigma$ . Let $s\in{\mathbb{Z}}_{\geq 1}$ . For each $j\in\{1,\ldots,s\}$ we do the following.

(A)

Let $L_{j}\subset\Sigma$ be the subset resulting from applying a random permutation to the ordering of the elements in $L$ .
(B)

Apply recombination (cf. the recombination thinning Lemma 3.1) to find an element $u_{j}\in\operatorname{Span}({\cal F})$ satisfying, for every $\sigma\in L_{j}$ , that $|\sigma(\varphi-u_{j})|\leq\varepsilon_{0}$ .
(C)

Compute $E[u_{j}]:=\max\left\{|\sigma(\varphi-u_{j})|:\sigma\in\Sigma\right\}$ .

After obtaining the elements $u_{1},\ldots,u_{s}\in\operatorname{Span}({\cal F})$ , define $u\in\operatorname{Span}({\cal F})$ by

u:=\operatorname{argmin}\left\{E[w]:w\in\{u_{1},\ldots,u_{s}\}\right\}.

(4.3)

Then $u\in\operatorname{Span}({\cal F})$ is returned as our approximation of $\varphi$ that satisfies, for every $\sigma\in\Sigma$ , that $|\sigma(\varphi-u)|\leq\varepsilon_{0}$ . We now detail our proposed GRIM algorithm to find an approximation $u\in\operatorname{Span}({\cal F})$ of $\varphi\in\operatorname{Span}({\cal F})$ that is close to $\varphi$ at every linear functional in $\Sigma$ . Banach GRIM

(A)

Fix $\varepsilon>0$ as the target accuracy threshold, $\varepsilon_{0}\in[0,\varepsilon)$ as the acceptable recombination error, and $M\in{\mathbb{Z}}_{\geq 1}$ as the maximum number of steps. Choose integers $s_{1},\ldots,s_{M}\in{\mathbb{Z}}_{\geq 1}$ as the shuffle numbers, and integers $k_{1},\ldots,k_{M}\in{\mathbb{Z}}_{\geq 1}$ with

$\kappa:=k_{1}+\ldots+k_{M}\leq\min\left\{{\cal N}-1,\Lambda\right\}.$ (4.4)
(B)

For each $i\in\{1,\ldots,{\cal N}\}$ , if $a_{i}<0$ then replace $a_{i}$ and $f_{i}$ by $-a_{i}$ and $-f_{i}$ respectively. This ensures that $a_{1},\ldots,a_{{\cal N}}>0$ whilst leaving the expansion $\varphi=\sum_{i=1}^{{\cal N}}a_{i}f_{i}$ unaltered. Additionally, rescale each $f_{i}$ to have unit $X$ norm. That is, for each $i\in{\mathbb{N}}$ we replace $f_{i}$ by $h_{i}:=\frac{f_{i}}{||f_{i}||_{X}}$ . Then $\varphi=\sum_{i=1}^{{\cal N}}\alpha_{i}h_{i}$ where, for each $i\in\{1,\ldots,{\cal N}\}$ , we have $\alpha_{i}:=a_{i}||f_{i}||_{X}>0$ .
(C)

Apply the Banach Extension Step, with $L^{\prime}:=\varnothing$ , $u:=0$ and $m:=k_{1}$ , to obtain a subset $\Sigma_{1}=\{\sigma_{1,1},\ldots,\sigma_{1,k_{1}}\}\subset\Sigma$ . Apply the Banach Recombination Step, with $L:=\Sigma_{1}$ and $s:=s_{1}$ , to find an element $u_{1}\in\operatorname{Span}({\cal F})$ satisfying, for every $\sigma\in\Sigma_{1}$ , that $|\sigma(\varphi-u)|\leq\varepsilon_{0}$ .

If $M=1$ then the algorithm terminates here and returns $u_{1}$ as the final approximation of $\varphi$
(D)

If $M\geq 2$ then we proceed inductively for $t\geq 2$ as follows. If $|\sigma(\varphi-u_{t-1})|\leq\varepsilon$ for every $\sigma\in\Sigma$ then we stop and $u_{t-1}$ is an approximation of $\varphi$ possessing the desired level of accuracy. Otherwise, we choose $k_{t}\in{\mathbb{Z}}_{\geq 1}$ and apply the Banach Extension Step, with $L^{\prime}=\Sigma_{t}$ , $u:=u_{t-1}$ and $m:=k_{t}$ , to obtain a subset $\Sigma_{t}:=\Sigma_{t-1}\cup\left\{\sigma_{t,1},\ldots,\sigma_{t,k_{t}}\right\}\subset\Sigma$ . Apply the Banach Recombination Step, with $L:=\Sigma_{t}$ and $s:=s_{M}$ , to find an element $u\in\operatorname{Span}({\cal F})$ satisfying, for every $\sigma\in\Sigma_{t}$ , that $|\sigma(\varphi-u)|\leq\varepsilon_{0}$ .

The algorithm ends either by returning $u_{t-1}$ for the $t\in\{2,\ldots,M\}$ for which the stopping criterion was triggered as the final approximation of $\varphi$ , or by returning $u_{M}$ as the final approximation of $\varphi$ if the stopping criterion is never triggered.

If the algorithm terminates as a result of one of the early stopping criterion being triggered then we are guaranteed that the returned approximation $u$ satisfies, for every $\sigma\in\Sigma$ , that $|\sigma(\varphi-u)|\leq\varepsilon$ . In Subsection 6 we establish estimates for how large $M$ is required to be in order to guarantee that the algorithm returns an approximation of $\varphi$ possessing this level of accuracy throughout $\Sigma$ (cf. the Banach GRIM Convergence Theorem 6.2).

GRIM uses data-driven growth. For each $m\in\{2,\ldots,M\}$ , GRIM first determines the new linear functionals in $\Sigma$ to be added to $\Sigma_{m-1}$ to form $\Sigma_{m}\subset\Sigma$ before using recombination to find an approximation coinciding with $\varphi$ on $\Sigma_{m}$ . That is, we first choose the new information that we want our approximation to match before using recombination to both select the elements from ${\cal F}$ and use them to construct our approximation.

Evidently we have the nesting property that $\Sigma_{t_{1}}\subset\Sigma_{t_{2}}$ for integers $t_{1}\leq t_{2}$ , ensuring that at each step we are increasing the amount of information that we require our approximation to match. For each integer $m\in\{1,\ldots,M\}$ let $S_{m}\subset{\cal F}$ denote the sub-collection of elements from ${\cal F}$ used to form the approximation $u_{m}$ . Recombination is applied to a system of $1+k_{1}+\ldots+k_{m}$ linear equations when finding $u_{m}$ , hence we may conclude that $\#(S_{m})\leq\min\left\{1+k_{1}+\ldots+k_{m},{\cal N}\right\}$ (cf. the recombination thinning Lemma 3.1). Besides this upper bound for $\#(S_{m})$ , we have no control on the sets $S_{m}$ . We impose only that the linear functionals are greedily chosen; the selection of the elements from ${\cal F}$ to form the approximation $u_{m}$ is left up to recombination and determined by the data. In contrast to GEIM, there is no requirement that elements from ${\cal F}$ used to form $u_{m}$ must also be used for $u_{l}$ for $l>m$ .

The restriction on $\kappa:=k_{1}+\ldots+k_{M}$ in (4.4) is for the following reasons. As observed above, for $m\in\{1,\ldots,M\}$ , $k_{s}$ is the number of new linear functionals to be chosen at the $m^{\text{th}}$ -step. Hence, at the $m^{\text{th}}$ -step, recombination is used to find an approximation $u$ that is within $\varepsilon_{0}$ of $\varphi$ on a collection of $\kappa_{s}:=k_{1}+\ldots+k_{m}$ linear functionals from $\Sigma$ . Evidently we have, for every $m\in\{1,\ldots,M\}$ , that $\kappa_{s}\leq\kappa$ .

A first consequence of the restriction in (4.4) ensures is that, for every $m\in\{1,\ldots,M\}$ , we have $\kappa_{m}\leq{\cal N}-1$ . Since the linear system that recombination is applied to at step $s$ consists of $1+\kappa_{m}$ equations (cf. the recombination thinning Lemma 3.1), this prevents the number of equations from exceeding ${\cal N}$ . When the number of equations is at least ${\cal N}$ , recombination returns the original coefficients without reducing the number that are non-zero (see Subsection 3). Consequently, $\kappa_{m}\geq{\cal N}-1$ guarantees that step $s$ returns $\varphi$ itself as the approximation of $\varphi$ . The restriction in (4.4) ensures that the algorithm ends if this (essentially useless) stage is reached.

A second consequence of (4.4) is, for every $m\in\{1,\ldots,M\}$ , that $\kappa_{m}\leq\Lambda$ . Note the collection $\Sigma_{m}\subset\Lambda$ has cardinality $\kappa_{m}$ . If $\kappa_{m}=\Lambda$ then we necessarily have that $\Sigma_{m}=\Sigma$ , and so recombination is used to find an approximation of $\varphi$ that is within $\varepsilon_{0}$ of $\varphi$ at every $\sigma\in\Sigma$ . The restriction (4.4) ensures that if this happens then the algorithm terminates without carrying out additional steps.

5 Complexity Cost

In this subsection we consider the complexity cost of the Banach GRIM algorithm presented in Section 4. We begin by recording the complexity cost of the Banach Extension Step.

Lemma 5.1 (Banach Extension Step Complexity Cost).

Let ${\cal N},\Lambda,m,t\in{\mathbb{Z}}_{\geq 1}$ and $X$ be a Banach space with dual space $X^{\ast}$ . Assume ${\cal F}=\left\{f_{1},\ldots,f_{{\cal N}}\right\}\subset X\setminus\{0\}$ and that $\Sigma\subset X^{\ast}$ has cardinality $\Lambda$ . Let $a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}\setminus\{0\}$ and define $\varphi:=\sum_{i=1}^{{\cal N}}a_{i}f_{i}$ . Assume that the set $\{\sigma(\varphi):\sigma\in\Sigma\}$ and, for every $i\in\{1,\ldots,{\cal N}\}$ , the set $\{\sigma(f_{i}):\sigma\in\Sigma\}$ have already been computed. Then for any choices of $L^{\prime}\subset\Sigma$ with $\#\left(L^{\prime}\right)\leq\Lambda-m$ and any $u\in\operatorname{Span}({\cal F})$ with $\#{\rm support}(u)=t$ , the complexity cost of applying the Banach Extension Step, with the $L^{\prime}$ , $u$ and $m$ there as the $L^{\prime}$ , $u$ and $m$ here respectively, is ${\cal O}\left((m+t)\Lambda\right)$ .

Proof of Lemma 5.1.

Let ${\cal N},\Lambda,m,t\in{\mathbb{Z}}_{\geq 1}$ and $X$ be a Banach space with dual space $X^{\ast}$ . Suppose that ${\cal F}=\left\{f_{1},\ldots,f_{{\cal N}}\right\}\subset X\setminus\{0\}$ and that $\Sigma\subset X^{\ast}$ has cardinality $\Lambda$ . Let $a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}\setminus\{0\}$ and define $\varphi:=\sum_{i=1}^{{\cal N}}a_{i}f_{i}$ . Assume that the set $\{\sigma(\varphi):\sigma\in\Sigma\}$ and, for every $i\in\{1,\ldots,{\cal N}\}$ , the set $\{\sigma(f_{i}):\sigma\in\Sigma\}$ have already been computed. Suppose that $L^{\prime}\subset\Sigma$ with $\#\left(L^{\prime}\right)\leq\Lambda-m$ and $u\in\operatorname{Span}({\cal F})$ with $\#{\rm support}(u)=t$ . Recall our convention (cf. Section 6.1) that ${\rm support}(u)$ is the set of the $f_{i}$ that correspond to the non-zero coefficients in the expansion of $u$ in terms of the $f_{i}$ . That is, $u=\sum_{i=1}^{{\cal N}}u_{i}f_{i}$ for some coefficients $u_{1},\ldots,u_{{\cal N}}\in{\mathbb{R}}$ , and

{\rm support}(u):=\left\{f_{j}~{}:~{}j\in\{1,\ldots,{\cal N}\}\text{ and }u_{j}\neq 0\right\}.

(5.1)

Consider carrying out the the Banach Extension Step with the $L^{\prime}$ , $u$ and $m$ there as the $L^{\prime}$ , $u$ and $m$ here respectively. Since we have access to $\{\sigma(\varphi):\sigma\in\Sigma\}$ and $\{\sigma(f_{i}):\sigma\in\Sigma\}$ for every $i\in\{1,\ldots,{\cal N}\}$ without additional computation, and since $\#{\rm support}(u)=t$ , the complexity cost of computing the set $\left\{|\sigma(\varphi-u)|:\sigma\in\Sigma\right\}$ is no worse than ${\cal O}\left(t\Lambda\right)$ . The complexity cost of subsequently extracting the top $m$ argmax values of this set is no worse than ${\cal O}(m\Lambda)$ . The complexity cost of appending the resulting $m$ linear functionals in $\Sigma$ to the collection $L^{\prime}$ is ${\cal O}(m)$ . Therefore the entire Banach Extension Step has a complexity cost no worse than ${\cal O}\left((m+t)\Lambda\right)$ as claimed. This completes the proof of Lemma 5.1. ∎

We next record the complexity cost of the Banach Recombination Step.

Lemma 5.2.

Let ${\cal N},\Lambda,m,s\in{\mathbb{Z}}_{\geq 1}$ and $X$ be a Banach space with dual space $X^{\ast}$ . Assume ${\cal F}=\left\{f_{1},\ldots,f_{{\cal N}}\right\}\subset X\setminus\{0\}$ and that $\Sigma\subset X^{\ast}$ has cardinality $\Lambda$ . Let $a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}\setminus\{0\}$ and define $\varphi:=\sum_{i=1}^{{\cal N}}a_{i}f_{i}$ . Assume that the set $\{\sigma(\varphi):\sigma\in\Sigma\}$ and, for every $i\in\{1,\ldots,{\cal N}\}$ , the set $\{\sigma(f_{i}):\sigma\in\Sigma\}$ have already been computed. Then for any $L\subset\Sigma$ with cardinality $\#(L)=m$ the complexity cost of applying the Banach Recombination Step, with the subset $L$ and the integer $s$ there as $L$ and $s$ here respectively, is

{\cal O}\left(ms({\cal N}+\Lambda)+m^{3}s\log\left(\frac{{\cal N}}{m}\right)\right).

(5.2)

Proof of Lemma 5.2.

Let ${\cal N},\Lambda,m,s\in{\mathbb{Z}}_{\geq 1}$ and $X$ be a Banach space with dual space $X^{\ast}$ . Assume that ${\cal F}=\left\{f_{1},\ldots,f_{{\cal N}}\right\}\subset X\setminus\{0\}$ and that $\Sigma\subset X^{\ast}$ has cardinality $\Lambda$ . Let $a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}\setminus\{0\}$ and define $\varphi:=\sum_{i=1}^{{\cal N}}a_{i}f_{i}$ . Assume that the set $\{\sigma(\varphi):\sigma\in\Sigma\}$ and, for every $i\in\{1,\ldots,{\cal N}\}$ , the set $\{\sigma(f_{i}):\sigma\in\Sigma\}$ have already been computed.

Consider applying the Banach Recombination Step with the subset $L$ and the integer $s$ there as $L$ and $s$ here respectively. Let $j\in\{1,\ldots,s\}$ . The complexity cost of shuffling of the elements in $L$ to obtain $L_{j}$ is ${\cal O}(s)$ . By appealing to the recombination thinning Lemma 3.1 we conclude that the complexity cost of applying recombination to find $u_{j}\in\operatorname{Span}({\cal F})$ satisfying, for every $\sigma\in L_{j}$ , that $|\sigma(\varphi-u_{j})|\leq\varepsilon_{0}$ is ${\cal O}\left({\cal N}m+m^{3}\log({\cal N}/m)\right)$ . Further recall that, since $\#(L)=m$ , the recombination thinning Lemma 3.1 ensures that $\#{\rm support}(u_{j})\leq m+1$ . Thus, since we already have access to $\{\sigma(\varphi):\sigma\in\Sigma\}$ and $\{\sigma(f_{i}):\sigma\in\Sigma\}$ for every $i\in\{1,\ldots,{\cal N}\}$ without additional computation and we know from , the complexity cost of computing $E[u_{j}]:=\max\left\{|\sigma(\varphi-u_{j})|:\sigma\in\Sigma\right\}$ is ${\cal O}(\Lambda m)$ .

Therefore, the complexity cost of Banach Recombination Step (A), (B), and (C) is

{\cal O}\left(sm+s\Lambda m+s{\cal N}m+m^{3}s\log\left(\frac{{\cal N}}{m}\right)\right)={\cal O}\left(ms({\cal N}+\Lambda)+m^{3}s\log\left(\frac{{\cal N}}{m}\right)\right).

(5.3)

The complexity cost of the final selection of $u:=\operatorname{argmin}\left\{E[w]:w\in\{u_{1},\ldots,u_{s}\}\right\}$ is ${\cal O}(s)$ . Combined with (5.3), this yields that the complexity cost of the entire Banach Recombination Step is

{\cal O}\left(ms({\cal N}+\Lambda)+m^{3}s\log\left(\frac{{\cal N}}{m}\right)\right).

(5.4)

as claimed in (5.2). This completes the proof of Lemma 5.2. ∎

We now establish an upper bound for the complexity cost of the Banach GRIM algorithm via repeated use of Lemmas 5.1 and 5.2. This is the content of the following result.

Lemma 5.3 (Banach GRIM Complexity Cost).

Let $M,{\cal N},\Lambda\in{\mathbb{Z}}_{\geq 1}$ and $\varepsilon>\varepsilon_{0}\geq 0$ . Take $s_{1},\ldots,s_{M}\in{\mathbb{Z}}_{\geq 1}$ and $k_{1},\ldots,k_{M}\in{\mathbb{Z}}_{\geq 1}$ with $k_{1}+\ldots+k_{M}\leq\min\left\{{\cal N}-1,\Lambda\right\}$ . For $j\in\{1,\ldots,M\}$ let $\kappa_{j}:=\sum_{i=1}^{j}k_{i}$ . Assume $X$ is a Banach space with dual-space $X^{\ast}$ . Suppose ${\cal F}:=\{f_{1},\ldots,f_{{\cal N}}\}\subset X\setminus\{0\}$ and that $\Sigma\subset X^{\ast}$ is finite with cardinality $\#\left(\Sigma\right)=\Lambda$ . Let $a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}\setminus\{0\}$ and define $\varphi\in\operatorname{Span}({\cal F})$ by $\varphi:=\sum_{i=1}^{{\cal N}}a_{i}f_{i}$ . The complexity cost of completing $n$ steps of the Banach GRIM algorithm to approximate $\varphi$ , with $\varepsilon$ as the target accuracy, $\varepsilon_{0}$ as the acceptable recombination error, $M$ as the maximum number of steps, $s_{1},\ldots,s_{M}$ as the shuffle numbers, and the integers $k_{1},\ldots,k_{M}\in{\mathbb{Z}}_{\geq 1}$ as the integers $k_{1},\ldots,k_{M}$ chosen in Step (A) of the Banach GRIM algorithm, is

{\cal O}\left({\cal N}\Lambda+\sum_{j=1}^{M}\kappa_{j}s_{j}\left({\cal N}+\Lambda\right)+\kappa_{j}^{3}s_{j}\log\left(\frac{{\cal N}}{\kappa_{j}}\right)\right).

(5.5)

Proof of Lemma 5.3.

Let $M,{\cal N},\Lambda\in{\mathbb{Z}}_{\geq 1}$ and $\varepsilon>\varepsilon_{0}\geq 0$ . Take $s_{1},\ldots,s_{M}\in{\mathbb{Z}}_{\geq 1}$ and $k_{1},\ldots,k_{M}\in{\mathbb{Z}}_{\geq 1}$ with $k_{1}+\ldots+k_{M}\leq\min\left\{{\cal N}-1,\Lambda\right\}$ . For $j\in\{1,\ldots,M\}$ define $\kappa_{j}:=\sum_{i=1}^{j}k_{i}$ . Assume $X$ is a Banach space with dual-space $X^{\ast}$ . Suppose ${\cal F}:=\{f_{1},\ldots,f_{{\cal N}}\}\subset X\setminus\{0\}$ and that $\Sigma\subset X^{\ast}$ is finite with cardinality $\#\left(\Sigma\right)=\Lambda$ . Let $a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}\setminus\{0\}$ and define $\varphi\in\operatorname{Span}({\cal F})$ by $\varphi:=\sum_{i=1}^{{\cal N}}a_{i}f_{i}$ . Consider applying the Banach GRIM algorithm to approximate $\varphi$ , with $\varepsilon$ as the target accuracy, $\varepsilon_{0}$ as the acceptable recombination error, $M$ as the maximum number of steps, $s_{1},\ldots,s_{M}$ as the shuffle numbers, and the integers $k_{1},\ldots,k_{M}\in{\mathbb{Z}}_{\geq 1}$ as the integers $k_{1},\ldots,k_{M}$ chosen in Step (A) of the Banach GRIM algorithm.

Since the cardinality of ${\cal F}$ is ${\cal N}$ , the complexity cost of the rescaling and sign alterations required in Step (B) of the Banach GRIM algorithm is ${\cal O}({\cal N})$ . The complexity cost of computing the sets $\left\{\sigma(f_{i}):\sigma\in\Sigma\right\}$ for $i\in\{1,\ldots,{\cal N}\}$ is ${\cal O}({\cal N}\Lambda)$ . Subsequently, the complexity cost of computing the set $\left\{\sigma(\varphi):\sigma\in\Sigma\right\}$ is ${\cal O}({\cal N}\Lambda)$ . Consequently, the total complexity cost of performing these computations is ${\cal O}({\cal N}\Lambda)$ .

We appeal to Lemma 5.1 to conclude that the complexity cost of performing the Banach Extension Step as in Step (C) of the Banach GRIM algorithm (i.e. with $L^{\prime}:=\emptyset$ , $u:=0$ , and $m:=k_{1}$ ) is ${\cal O}\left(k_{1}\Lambda\right)$ . By appealing to Lemma 5.2, we conclude that the complexity cost of the use of the Banach Recombination Step in Step (C) of the Banach GRIM algorithm (i.e. with $L:=\Sigma_{1}$ and $s:=s_{1}$ ) is ${\cal O}\left(s_{1}k_{1}\left({\cal N}+\Lambda\right)+k_{1}^{3}s_{1}\log\left({\cal N}/k_{1}\right)\right)$ . Hence the complexity cost of Step (C) of the Banach GRIM algorithm is ${\cal O}\left(s_{1}k_{1}\left({\cal N}+\Lambda\right)+k_{1}^{3}s_{1}\log\left({\cal N}/k_{1}\right)\right)$ .

In the case that $M=1$ we can already conclude that the total complexity cost of performing the Banach GRIM algorithm is ${\cal O}\left(k_{1}s_{1}\left({\cal N}+\Lambda\right)+{\cal N}\Lambda+k_{1}^{3}s_{1}\log\left(\frac{{\cal N}}{k_{1}}\right)\right)$ as claimed in (5.5). Now suppose that $M\geq 2$ . We assume that all $M$ steps of the Banach GRIM algorithm are completed without early termination since this is the case that will maximise the complexity cost. Under this assumption, for $j\in\{1,\ldots,M\}$ let $u_{j}\in\operatorname{Span}({\cal F})$ denote the approximation of $\varphi$ returned after step $j$ of the Banach GRIM algorithm is completed. Recall that $\kappa_{j}:=\sum_{i=1}^{j}k_{i}$ . Examining the Banach GRIM algorithm, $u_{j}$ is obtained by applying recombination (cf. the recombination thinning Lemma 3.1) to find an approximation that is within $\varepsilon_{0}$ of $\varphi$ on a subset of $\kappa_{j}$ linear functionals from $\Sigma$ . Thus, by appealing to the recombination thinning Lemma 3.1, we have that $\#{\rm support}(u_{j})\leq 1+\kappa_{j}$ . For the purpose of computing the maximal complexity cost we assume that $\#{\rm support}(u_{j})=1+\kappa_{j}$ .

Let $t\in\{2,\ldots,M\}$ and consider performing Step (D) of the Banach GRIM algorithm for the $s$ there as $t$ here. Since $\#{\rm support}(u_{t-1})=1+\kappa_{t-1}$ , Lemma 5.1 tells us that the complexity cost of performing the Banach Extension Step as in Step (D) of the Banach GRIM algorithm (i.e. with $L^{\prime}:=\Sigma_{t-1}$ , $u:=u_{t-1}$ , and $m:=k_{t}$ ) is ${\cal O}\left(\kappa_{t}\Lambda\right)$ . Lemma 5.2 yield that the complexity cost of the use of the Banach Recombination Step in Step (D) of the Banach GRIM algorithm (i.e. with $L:=\Sigma_{t}$ and $s:=s_{t}$ ) is ${\cal O}\left(\kappa_{t}s_{t}\left({\cal N}+\Lambda\right)+\kappa_{t}^{3}s_{t}\log\left(\frac{{\cal N}}{\kappa_{t}}\right)\right)$ . Consequently, the total complexity cost of performing Step (D) of the Banach GRIM algorithm (for $t$ here playing the role of $s$ there) is

{\cal O}\left(\kappa_{t}s_{t}\left({\cal N}+\Lambda\right)+\kappa_{t}^{3}s_{t}\log\left(\frac{{\cal N}}{\kappa_{t}}\right)\right).

(5.6)

By summing the complexity costs in (5.6) over $t\in\{2,\ldots,M\}$ we deduce that the complexity cost of Step (D) of the Banach GRIM algorithm is

{\cal O}\left(\sum_{j=2}^{M}\kappa_{j}s_{j}\left({\cal N}+\Lambda\right)+\kappa_{j}^{3}s_{j}\log\left(\frac{{\cal N}}{\kappa_{j}}\right)\right).

(5.7)

Since we have previously observed that the complexity cost of performing Steps (A), (B), and (C) of the Banach GRIM algorithm is ${\cal O}\left(k_{1}s_{1}\left({\cal N}+\Lambda\right)+{\cal N}\Lambda+k_{1}^{3}s_{1}\log\left(\frac{{\cal N}}{k_{1}}\right)\right)$ , it follows from (5.7) that the complexity cost of performing the entire Banach GRIM algorithm is (recalling that $\kappa_{1}:=k_{1}$ )

{\cal O}\left({\cal N}\Lambda+\sum_{j=1}^{M}\kappa_{j}s_{j}\left({\cal N}+\Lambda\right)+\kappa_{j}^{3}s_{j}\log\left(\frac{{\cal N}}{\kappa_{j}}\right)\right)

(5.8)

as claimed in (5.5). This completes the proof of Lemma 5.3. ∎

We end this subsection by explicitly recording the complexity cost estimates resulting from Lemma 5.3 for some particular choices of parameters. We assume for both choices that we are in the situation that ${\cal N}>>\Lambda$ .

First, consider the choices that $M:=1$ , $k_{1}:=\Lambda$ , and $s_{1}:=1$ . This corresponds to making a single application of recombination to find an approximation of $\varphi$ that is within $\varepsilon_{0}$ of $\varphi$ at every linear functional in $\Sigma$ . Lemma 5.3 tells us that the complexity cost of doing this is ${\cal O}\left({\cal N}\Lambda+\Lambda^{2}+\Lambda^{3}\log\left(\frac{{\cal N}}{\Lambda}\right)\right)$ .

Secondly, consider the choices $M:={\mathbb{Z}}_{\geq 1}$ , $k_{1}=\ldots=k_{M}=1$ , and arbitrary fixed $s_{1},\ldots,s_{M}\in{\mathbb{Z}}_{\geq 1}$ . Let $L\subset\Sigma$ denote the collection of linear functionals that is inductively grown during the Banach GRIM algorithm. These choices then correspond to adding a single new linear functional to $L$ at each step. It follows, for each $j\in\{1,\ldots,M\}$ , that $\kappa_{j}:=\sum_{i=1}^{j}k_{i}=j$ . Lemma 5.3 tells us that the complexity cost of doing this is ${\cal O}\left({\cal N}\Lambda+\sum_{j=1}^{M}js_{j}({\cal N}+\Lambda)+j^{3}s_{j}\log\left(\frac{{\cal N}}{j}\right)\right)$ . If we restrict to a single application of recombination at each step (i.e. choosing $s_{1}=\ldots=s_{M}=1$ ), then the complexity cost becomes ${\cal O}\left(M^{2}\left({\cal N}+\Lambda\right)+\sum_{j=1}^{M}j^{3}\log\left(\frac{{\cal N}}{j}\right)\right)$ .

In particular, if we take $M:=\Lambda$ (which corresponds to allowing for the possibility that the collection $L$ may grow to be the entirety of $\Sigma$ ) then the complexity cost is

{\cal O}\left(\Lambda^{2}\left({\cal N}+\Lambda\right)+\sum_{j=1}^{\Lambda}j^{3}\log\left(\frac{{\cal N}}{j}\right)\right).

(5.9)

If ${\cal N}$ is large enough that $\Lambda<e^{-1/3}{\cal N}$ , then the function $x\mapsto x^{3}\log({\cal N}/x)$ is increasing on the interval $[1,\Lambda]$ . Under these conditions, the complexity cost in (5.9) is no worse than

{\cal O}\left(\Lambda^{2}{\cal N}+\Lambda^{4}\log\left(\frac{{\cal N}}{\Lambda}\right)\right).

(5.10)

6 Convergence Analysis

In this section we establish a convergence result for the Banach GRIM algortihm and discuss its consequences. The section is organised as follows. In Subsection 6.1 we fix notation and conventions that will be used throughout the entirety of Section 6. In Subsection 6.2 we state the Banach GRIM Convergence theorem 6.2. This result establishes, in particular, a non-trivial upper bound on the maximum number of steps the Banach GRIM algorithm, under the choice $M:=\min\left\{{\cal N}-1,\Lambda\right\}$ and that for every $t\in\{1,\ldots,M\}$ we have $k_{t}:=1$ , can run for before terminating. We do not consider the potentially beneficial impact of considering multiple shuffles at each step, and so we assume throughout this section that for every $t\in\{1,\ldots,M\}$ we have $s_{t}:=1$ . We additionally discuss some performance guarantees available as a consequence of the Banach GRIM Convergence theorem 6.2. In Subsection 6.3 we record several lemmata recording, in particular, estimates for the approximations found at each completed step of the algorithm (cf. Lemma 6.4), and the properties regarding the distribution of the collection of linear functionals selected at each completed step of the algorithm (cf. Lemma 6.5). In Subsection 6.4 we use the supplementary lemmata from Subsection 6.3 to prove the Banach GRIM Convergence theorem 6.2.

6.1 Notation and Conventions

In this section we fix notation and conventions that will be used throughout the entirety of Section 6.

Given a real Banach space $Z$ , an integer $m\in{\mathbb{Z}}_{\geq 1}$ , a subset $L=\{z_{1},\ldots,z_{m}\}\subset Z$ , an element $r\in{\mathbb{R}}_{\geq 0}$ , and a norm $\lambda$ on ${\mathbb{R}}^{m}$ , we will refer to the set

\operatorname{Span}_{\lambda}\left(L,r\right):=\left\{\sum_{j=1}^{m}\beta_{j}z_{j}~{}:~{}\beta:=(\beta_{1},\ldots,\beta_{m})\in{\mathbb{R}}^{m}\text{ and }\lambda(\beta)\leq r\right\}\subset Z

(6.1)

as the $\lambda$ distance $r$ span of $L=\{z_{1},\ldots,z_{m}\}$ in $Z$ . If $r=\infty$ then we take $\operatorname{Span}_{\lambda}\left(L,\infty\right)$ to be the usual linear span of the set $L=\{z_{1},\ldots,z_{m}\}$ in $Z$ . Further, given constants $c>0$ and $\alpha>\beta\geq 0$ , we refer to the set

\operatorname{Reach}_{\lambda}\left(L,c,\alpha,\beta\right):=\left\{\begin{array}[]{ll}\bigcup_{0\leq\tau\leq\frac{\alpha}{\beta}}\operatorname{\overline{Span}}_{\lambda}\left(L,\tau\right)_{\frac{\alpha-\tau\beta}{c}}&\mbox{if }\beta>0\\ \operatorname{Span}(L)_{\frac{\alpha}{c}}&\mbox{if }\beta=0\end{array}\right.

(6.2)

as the $c$ weighted $\beta$ based $\lambda$ distance $\alpha$ reach of $L=\{z_{1},\ldots,z_{m}\}$ in $Z$ . Given $V=\{v_{1},\ldots,v_{m}\}\subset Z$ we let ${\mathrm{conv}}(V)$ denote the closed convex hull of $\left\{v_{1},\ldots,v_{m}\right\}$ in $Z$ . That is

{\mathrm{conv}}(V):=\left\{\sum_{j=1}^{m}\beta_{j}v_{j}~{}:~{}\beta_{1},\ldots,\beta_{m}\in[0,1]\text{ and }\sum_{j=1}^{m}\beta_{j}=1\right\}\subset Z.

(6.3)

For future use, we record that the $||.||_{l^{1}({\mathbb{R}}^{m})}$ distance $1$ span of $L=\{z_{1},\ldots,z_{m}\}$ in $Z$ coincides with the closed convex hull of the set $\{z_{1},\ldots,z_{m},0,-z_{1},\ldots,-z_{m}\}$ in $Z$ , and that consequently the $(\alpha-\beta)/c$ -fattening of this convex hull is contained within the $c$ -weighted $\beta$ -based $||\cdot||_{l^{1}({\mathbb{R}}^{m})}$ distance $\alpha$ -span of $L$ , denoted $\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{m})}}(L,c,\alpha,\beta)$ , whenever $c>0$ and $\alpha>\beta\geq 0$ .

Lemma 6.1.

Let $Z$ be a Banach space and $m\in{\mathbb{Z}}_{\geq 1}$ . Assume that $L=\{z_{1},\ldots,z_{m}\}\subset Z$ and define $V:=\left\{z_{1},\ldots,z_{m},0,-z_{1},\ldots,-z_{m}\right\}$ . Then we have that

{\mathrm{conv}}(V)=\operatorname{Span}_{||\cdot||_{l^{1}({\mathbb{R}}^{m})}}\left(L,1\right).

(6.4)

Consequently, given any $c>0$ and any $\alpha>\beta\geq 0$ we have that

{\mathrm{conv}}(V)_{\frac{\alpha-\beta}{c}}\subset\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{m})}}(L,c,\alpha,\beta).

(6.5)

Proof of Lemma 6.1.

Let $Z$ be a Banach space, $m\in{\mathbb{Z}}_{\geq 1}$ , and assume that $L=\{z_{1},\ldots,z_{m}\}\subset Z$ . Define $z_{0}:=0\in M$ and, for each $j\in\{1\ldots,m\}$ define $z_{-j}:=-z_{j}$ . Let ${\cal C}:={\mathrm{conv}}\left(\left\{z_{1},\ldots,z_{m},0,-z_{1},\ldots,-z_{m}\right\}\right)$ and ${\cal D}:=\operatorname{Span}_{||\cdot||_{l^{1}({\mathbb{R}}^{m})}}\left(L,1\right)$ . We establish (6.4) by proving both the inclusions ${\cal C}\subset{\cal D}$ and ${\cal D}\subset{\cal C}$ .

We first prove that ${\cal C}\subset{\cal D}$ . To do so, suppose that $z\in{\cal C}$ . Then there are coefficients $\beta_{-m},\ldots,\beta_{m}\in[0,1]$ such that $z=\sum_{j=-m}^{m}\beta_{j}z_{j}$ and $\sum_{j=-m}^{m}\beta_{j}=1$ . Then we have that $z=\sum_{j=1}^{m}\left(\beta_{j}-\beta_{-j}\right)z_{j}$ , and moreover we may observe via the triangle inequality that $\sum_{j=1}^{m}\left|\beta_{j}-\beta_{-j}\right|\leq\sum_{j=-m}^{m}\beta_{j}=1$ . Consequently, by taking $c_{j}:=\beta_{j}-\beta_{-j}$ for $j\in\{1,\ldots,m\}$ , we see that $z\in{\cal D}$ by definition. The arbitrariness of $z\in{\cal C}$ allows us to conclude that ${\cal C}\subset{\cal D}$ .

In order to prove ${\cal D}\subset{\cal C}$ , suppose now that $z\in{\cal D}$ . Hence there exist $c_{1},\ldots,c_{m}\in{\mathbb{R}}$ , with $\sum_{j=1}^{m}\left|c_{j}\right|\leq 1$ , and such that $z=\sum_{j=1}^{m}c_{j}z_{j}$ . For each $j\in\{1,\ldots,m\}$ , if $c_{j}>0$ define $\beta_{j}:=c_{j}$ and $\beta_{-j}:=0$ , and if $c_{j}\leq 0$ define $\beta_{j}:=0$ and $\beta_{-j}:=-c_{j}$ . Observe that, for each $j\in\{1,\ldots,m\}$ , we have $\beta_{j},\beta_{-j}\in[0,1]$ . Further, for each $j\in\{1,\ldots,m\}$ , we have $\beta_{j}+\beta_{-j}=|c_{j}|$ , and so $\sum_{j=1}^{m}\beta_{j}+\beta_{-j}=\sum_{j=1}^{m}|c_{j}|$ . Finally, define $\beta_{0}:=1-\sum_{j=1}^{m}|c_{j}|\in[0,1]$ so that $\beta_{-m},\ldots,\beta_{m}\in[0,1]$ with $\sum_{j=-m}^{m}\beta_{j}=1$ . We observe that $z=\sum_{j=1}^{m}c_{j}z_{j}=\sum_{c_{j}\geq 0}c_{j}z_{j}+\sum_{c_{j}<0}(-c_{j})(-z_{j})=\sum_{j=1}^{m}\beta_{j}z_{j}+\beta_{-j}z_{-j}=\sum_{j=-m}^{m}\beta_{j}z_{j}$ where the last equality holds since $z_{0}:=0$ . Consequently, $z\in{\cal C}$ . The arbitrariness of $z\in D$ allows us to conclude $D\subset{\cal C}$ .

Together, the inclusions ${\cal C}\subset D$ and $D\subset{\cal C}$ establish that ${\cal C}=D$ as claimed in (6.4). We complete the proof of Lemma 6.1 by using (6.4) to establish (6.5).

Let $c>0$ and $\alpha\geq\beta\geq 0$ . First suppose $\beta>0$ . Then $\alpha/\beta\geq 1$ and so

\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{m})}}(L,c,\alpha,\beta)\stackrel{{\scriptstyle{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}({\ref{l1_reach_L}})}}}{{=}}\bigcup_{0\leq\tau\leq\frac{\alpha}{\beta}}\operatorname{\overline{Span}}(L,\tau)_{\frac{\alpha-\tau\beta}{c}}\supset\operatorname{\overline{Span}}(L,1)_{\frac{\alpha-\beta}{c}}\stackrel{{\scriptstyle{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}({\ref{conv_hull_equiv_form}})}}}{{=}}{\mathrm{conv}}(V)_{\frac{\alpha-\beta}{c}}.

(6.6)

If $\beta=0$ then

\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{m})}}(L,c,\alpha,0)\stackrel{{\scriptstyle{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}({\ref{l1_reach_L}})}}}{{=}}\operatorname{Span}(L)_{\frac{\alpha}{c}}\supset{\mathrm{conv}}(V)_{\frac{\alpha}{c}}

(6.7)

where the last inclusion follows from the observation that any element in $V$ is a linear combination of the elements in $L=\{z_{1},\ldots,z_{m}\}$ . Together, (6.6) and (6.7) establish that we always have $\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{m})}}(L,c,\alpha,\beta)\supset{\mathrm{conv}}(V)$ as claimed in (6.5). This completes the proof of Lemma 6.1. ∎

Given a real Banach space $Z$ , an integer $m\in{\mathbb{Z}}_{\geq 1}$ , a subset $S=\{z_{1},\ldots,z_{m}\}\subset Z$ , and $w=\sum_{j=1}^{m}w_{j}z_{j}\in\operatorname{Span}(S)$ for $w_{1},\ldots,w_{m}\in{\mathbb{R}}$ , we refer to the set ${\rm support}(w):=\left\{z_{j}~{}:~{}j\in\{1,\ldots,{\cal N}\}\text{ and }w_{j}\neq 0\right\}$ as the support of $w$ . The cardinality of ${\rm support}(w)$ is equal to the number of non-zero coefficients in the expansion of $w$ in terms of $z_{1},\ldots,z_{m}$ .

Finally, given a real Banach space $Z$ , a $r\in{\mathbb{R}}_{\geq 0}$ and a subset $A\subset Z$ , we denote the $r$ -fattening of $A$ in $Z$ by $A_{r}$ . That is, we define $A_{r}:=\left\{z\in Z~{}:~{}\exists a\in A\text{ with }||z-a||_{Z}\leq r\right\}$ .

6.2 Main Theoretical Result

In this subsection we state our main theoretical result and then discuss its consequences. Our main theoretical result is the following Banach GRIM Convergence theorem for the Banach GRIM algorithm under the choice $M:=\min\left\{{\cal N}-1,\Lambda\right\}$ and that for every $t\in\{1,\ldots,M\}$ we have $k_{t}:=1$ and $s_{t}:=1$ .

Theorem 6.2 (Banach GRIM Convergence).

Assume $X$ is a Banach space with dual-space $X^{\ast}$ . Let $\varepsilon>\varepsilon_{0}\geq 0$ . Let ${\cal N},\Lambda\in{\mathbb{Z}}_{\geq 1}$ and set $M:=\min\left\{{\cal N}-1,\Lambda\right\}$ . Let ${\cal F}:=\{f_{1},\ldots,f_{{\cal N}}\}\subset X\setminus\{0\}$ and $\Sigma\subset X^{\ast}$ be a finite subset with cardinality $\Lambda$ . Let $a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}\setminus\{0\}$ and define $\varphi\in\operatorname{Span}({\cal F})$ and $C>0$ by

({\bf I})\quad\varphi:=\sum_{i=1}^{{\cal N}}a_{i}f_{i}\qquad\text{and}\qquad({\bf II})\quad C:=\sum_{i=1}^{{\cal N}}|a_{i}|||f_{i}||_{X}>0.

(6.8)

Then there is a non-negative integer $N=N(\Sigma,C,\varepsilon,\varepsilon_{0})\in{\mathbb{Z}}_{\geq 0}$ , given by

N:=\max\left\{d\in{\mathbb{Z}}~{}:~{}\begin{array}[]{c}\text{There exists }\sigma_{1},\ldots,\sigma_{d}\in\Sigma\text{ such that for every }j\in\{1,\ldots,d-1\}\\ \text{we have }\sigma_{j+1}\notin\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{j})}}\left(\{\sigma_{1},\ldots,\sigma_{j}\},2C,\varepsilon,\varepsilon_{0}\right)\end{array}\right\},

(6.9)

for which the following is true.

Suppose $N\leq M:=\min\left\{{\cal N}-1,M\right\}$ and consider applying the Banach GRIM algorithm to approximate $\varphi$ on $\Sigma$ with $\varepsilon$ as the accuracy threshold, $\varepsilon_{0}$ as the acceptable recombination error bound, $M$ as the maximum number of steps, $s_{1}=\ldots=s_{M}=1$ as the shuffle numbers, and with the integers $k_{1},\ldots,k_{M}$ in Step (A) of the Banach GRIM algorithm all being chosen equal to $1$ (cf. Banach GRIM (A)). Then, after at most $N$ steps the algorithm terminates. That is, there is some integer $n\in\{1,\ldots,N\}$ for which the Banach GRIM algorithm terminates after completing $n$ steps. Consequently, there are coefficients $c_{1},\ldots,c_{n+1}\in{\mathbb{R}}$ and indices $e(1),\ldots,e(n+1)\in\{1,\ldots,{\cal N}\}$ with

\sum_{s=1}^{n+1}\left|c_{s}\right|||f_{e(s)}||_{X}=C,

(6.10)

and such that the element $u\in\operatorname{Span}({\cal F})$ defined by

u:=\sum_{s=1}^{n+1}c_{s}f_{e(s)}\qquad\text{satisfies, for every }\sigma\in\Sigma,\text{ that}\qquad|\sigma(\varphi-u)|\leq\varepsilon.

(6.11)

Further, given any $A>1$ , we have, for every $\sigma\in\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{\Lambda})}}\left(\Sigma,2C,A\varepsilon,\varepsilon\right)$ , that

|\sigma(\varphi-u)|\leq A\varepsilon.

(6.12)

Finally, if the coefficients $a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}\setminus\{0\}$ corresponding to $\varphi$ (cf. (I) of (6.8)) are all positive (i.e. $a_{1},\ldots,a_{{\cal N}}>0$ ) then the coefficients $c_{1},\ldots,c_{n+1}\in{\mathbb{R}}$ corresponding to $u$ (cf. (6.11)) are all non-negative (i.e. $c_{1},\ldots,c_{n+1}\geq 0$ ).

Remark 6.3.

For the readers convenience, we recall the specific notation from Subsection 6.1 utilised in the Banach GRIM Convergence Theorem 6.2. Firstly, given $l\in{\mathbb{Z}}_{\geq 1}$ , a subset $L=\{\sigma_{1},\ldots,\sigma_{l}\}\subset X^{\ast}$ , and $s\in{\mathbb{R}}_{\geq 0}$ we have

\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{l})}}\left(L,s\right):=\left\{\sum_{s=1}^{l}c_{s}\sigma_{s}~{}:~{}c=(c_{1},\ldots,c_{l})\in{\mathbb{R}}^{l}\text{ and }||c||_{l^{1}({\mathbb{R}}^{l})}\leq s\right\},

(6.13)

whilst $\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{l})}}\left(L,\infty\right):=\operatorname{Span}\left(L\right)$ . Moreover, given $r\in{\mathbb{R}}_{\geq 0}$ we define $\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{l})}}\left(L,s\right)_{r}$ to be the $r$ -fattening of $\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{l})}}\left(L,s\right)$ in $X^{\ast}$ . That is

\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{l})}}\left(L,s\right)_{r}:=\left\{\sigma\in X^{\ast}:\text{There exists }\sigma^{\prime}\in\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{l})}}\left(L,s\right)\text{ with }||\sigma-\sigma^{\prime}||_{X^{\ast}}\leq r\right\}.

(6.14)

In particular, we have $\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{l})}}\left(L,s\right)_{0}:=\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{l})}}\left(L,s\right)$ . Finally, given constants $c>0$ and $\alpha\geq\beta\geq 0$ we have

\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{l})}}\left(L,c,\alpha,\beta\right):=\left\{\begin{array}[]{ll}\bigcup_{0\leq s\leq\frac{\alpha}{\beta}}\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{l})}}\left(L,s\right)_{\frac{\alpha-s\beta}{c}}&\mbox{if }\beta>0\\ \operatorname{Span}(L)_{\frac{\alpha}{c}}&\mbox{if }\beta=0.\end{array}\right.

(6.15)

The Banach GRIM Convergence Theorem 6.2 tells us that the maximum number of steps that the Banach GRIM algorithm can complete before terminating is

N:=\max\left\{d\in{\mathbb{Z}}~{}:~{}\begin{array}[]{c}\text{There exists }\sigma_{1},\ldots,\sigma_{d}\in\Sigma\text{ such that for every }j\in\{1,\ldots,d-1\}\\ \text{we have }\sigma_{j+1}\notin\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{j})}}\left(\{\sigma_{1},\ldots,\sigma_{j}\},2C,\varepsilon,\varepsilon_{0}\right)\end{array}\right\}.

(6.16)

Recall from Theorem 6.2 (cf. (6.11)) that the number of elements from ${\cal F}$ required to yield the desired approximation of $\varphi$ is no greater than $N+1$ . Consequently, via Theorem 6.2, not only can we conclude that we theoretically can use no more than $N+1$ elements from ${\cal F}$ to approximate $\varphi$ throughout $\Sigma$ with an accuracy of $\varepsilon$ , but also that the Banach GRIM algorithm will actually construct such an approximation. In particular, if $N<{\cal N}-1$ then we are guaranteed that the algorithm will find a linear combination of fewer than ${\cal N}$ of the elements $f_{1},\ldots,f_{{\cal N}}$ that is within $\varepsilon$ of $\varphi$ throughout $\Sigma$ .

Further we observe that $N$ defined in (6.16) depends on both the features ${\cal F}$ through the constant $C$ defined in (6.8) and the data $\Sigma$ . The constant $C$ itself depends only on the weights $a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}$ and the values $||f_{1}||_{X},\ldots,||f_{{\cal N}}||_{X}\in{\mathbb{R}}_{>0}$ . No additional constraints are imposed on the collection of features ${\cal F}$ ; in particular, we do not assume the existence of a linear combination of fewer than ${\cal N}$ of the features in ${\cal F}$ giving a good approximation of $\varphi$ throughout $\Sigma$ .

We now relate the quantity $N$ defined in (6.16) to other geometric properties of the subset $\Sigma\subset X^{\ast}$ .

For this purpose, let $\sigma_{1},\ldots,\sigma_{N}\in\Sigma$ and, for each $j\in\{1,\ldots,N\}$ , define $\Sigma_{j}:=\{\sigma_{1},\ldots,\sigma_{j}\}$ . Suppose $N\geq 2$ and that for every $j\in\{1,\ldots,N-1\}$ we have $\sigma_{j+1}\notin\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{j})}}\left(\Sigma_{j},2C,\varepsilon,\varepsilon_{0}\right)$ .

In the case that $\varepsilon_{0}=0$ this means, recalling (6.15), that $\sigma_{j+1}\notin\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{j})}}\left(\Sigma_{j},2C,\varepsilon,0\right)=\operatorname{Span}(\Sigma_{j})_{\varepsilon/2C}$ . Hence, when $\varepsilon_{0}=0$ , the integer $N$ defined in (6.16) is given by

N=\max\left\{d\in{\mathbb{Z}}~{}:~{}\begin{array}[]{c}\text{There exists }\sigma_{1},\ldots,\sigma_{d}\in\Sigma\text{ such that for every }j\in\{1,\ldots,d-1\}\\ \text{we have }\sigma_{j+1}\notin\operatorname{Span}(\Sigma_{j})_{\frac{\varepsilon}{2C}}\end{array}\right\}.

(6.17)

Thus the maximum number of steps before the Banach GRIM algorithm terminates is determined by the maximum dimension of a linear subspace $\Pi\subset X^{\ast}$ that has a basis consisting of elements in $\Sigma$ , and yet its $\varepsilon/2C$ -fattening fails to capture $\Sigma$ in the sense that $\Sigma\cap\Pi_{\varepsilon/2C}\neq\Sigma$ .

Now suppose that $\varepsilon_{0}>0$ . In this case we have that (cf. (6.15))

\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{j})}}\left(\Sigma_{j},2C,\varepsilon,\varepsilon_{0}\right)=\bigcup_{0\leq t\leq\frac{\varepsilon}{\varepsilon_{0}}}\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{j})}}\left(\Sigma_{j},t\right)_{\frac{\varepsilon-t\varepsilon_{0}}{2C}}

(6.18)

which no longer contains the full linear subspace $\operatorname{Span}(\Sigma_{j})$ . However, by appealing to Lemma 6.1, we conclude that (cf. (6.5))

{\mathrm{conv}}(V_{j})_{\frac{\varepsilon-\varepsilon_{0}}{2C}}\subset\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{j})}}\left(\Sigma_{j},2C,\varepsilon,\varepsilon_{0}\right)\quad\text{for}\quad V_{j}:=\left\{-\sigma_{j},\ldots,-\sigma_{1},0,\sigma_{1},\ldots,\sigma_{j}\right\}\subset X^{\ast}

(6.19)

and we use the notation ${\mathrm{conv}}(V_{j})$ to denote the closed convex hull of $V_{j}$ in $X^{\ast}$ . Consequently, if we define

N_{{\mathrm{conv}}}:=\max\left\{d\in{\mathbb{Z}}~{}:~{}\begin{array}[]{c}\text{There exists }\sigma_{1},\ldots,\sigma_{d}\in\Sigma\text{ such that }\forall j\in\{1,\ldots,d-1\}\text{ we have}\\ \sigma_{j+1}\notin{\mathrm{conv}}(V_{j})_{\frac{\varepsilon-\varepsilon_{0}}{2C}}\text{ for }V_{j}:=\left\{-\sigma_{j},\ldots,-\sigma_{1},0,\sigma_{1},\ldots,\sigma_{j}\right\}\subset X^{\ast}\end{array}\right\},

(6.20)

then the integer $N$ defined in (6.16) is bounded above by $N_{{\mathrm{conv}}}$ , i.e. $N\leq N_{{\mathrm{conv}}}$ .

We can control $N_{{\mathrm{conv}}}$ defined in (6.20) by particular packing and covering numbers of $\Sigma$ . To elaborate, if $Z$ is a Banach space and ${\cal U}\subset Z$ , then for any $r>0$ the $r$ -packing number of ${\cal U}$ , denoted by $N_{\operatorname{pack}}({\cal U},Z,r)$ , and the $r$ -covering number of ${\cal U}$ , denoted by $N_{\operatorname{cov}}({\cal U},Z,r)$ , are given by

		$\displaystyle N_{\operatorname{pack}}({\cal U},Z,r):=\max\left\{d\in{\mathbb{Z}}:\exists z_{1},\ldots,z_{d}\in{\cal U}\text{ such that }\|\|z_{a}-z_{b}\|\|_{Z}>r\text{ whenever }a\neq b\right\}\quad\text{and}$		(6.21)
		$\displaystyle N_{\operatorname{cov}}({\cal U},Z,r):=\min\left\{d\in{\mathbb{Z}}:\exists z_{1},\ldots,z_{d}\in{\cal U}\text{ such that we have the inclusion }{\cal U}\subset\bigcup_{j=1}^{d}{\overline{\mathbb{B}}}_{Z}(z_{j},r)\right\}$		(6.21)

respectively. Our convention throughout this subsection is that balls denoted by ${\mathbb{B}}$ are taken to be open, whilst those denoted by ${\overline{\mathbb{B}}}$ are taken to be closed.

It is now an immediate consequence of (6.20) and (6.21) that $N_{{\mathrm{conv}}}$ is no greater than the $(\varepsilon-\varepsilon_{0})/2C$ -packing number of $\Sigma$ , i.e. $N_{{\mathrm{conv}}}\leq N_{\operatorname{pack}}\left(\Sigma,X^{\ast},(\varepsilon-\varepsilon_{0})/2C\right)$ . Moreover, using the well-known fact that the quantities defined in (6.21) satisfy, for any $r>0$ , that $N_{\operatorname{pack}}({\cal U},Z,2r)\leq N_{\operatorname{cov}}({\cal U},Z,r)\leq N_{\operatorname{pack}}({\cal U},Z,r)$ , we may additionally conclude that $N_{{\mathrm{conv}}}\leq N_{\operatorname{pack}}\left(\Sigma,X^{\ast},(\varepsilon-\varepsilon_{0})/2C\right)\leq N_{\operatorname{cov}}\left(\Sigma,X^{\ast},(\varepsilon-\varepsilon_{0})/4C\right)$ . Consequently, both the $(\varepsilon-\varepsilon_{0})/2C$ -packing number and the $(\varepsilon-\varepsilon_{0})/4C$ -covering number of $\Sigma$ provide upper bounds on the number of steps that the Banach GRIM algorithm can complete before terminating.

In summary, we have that the maximum number of steps $N$ that the Banach GRIM algorithm can complete before terminating (cf. (6.16)) satisfies that

N\leq N_{{\mathrm{conv}}}\leq N_{\operatorname{pack}}\left(\Sigma,X^{\ast},\frac{\varepsilon-\varepsilon_{0}}{2C}\right)\leq N_{\operatorname{cov}}\left(\Sigma,X^{\ast},\frac{\varepsilon-\varepsilon_{0}}{4C}\right).

(6.22)

Thus the geometric quantities related to $\Sigma$ appearing in (6.22) provide upper bounds on the number of elements from ${\cal F}$ appearing in the approximation found by the Banach GRIM algorithm. Explicit estimates for these geometric quantities allow us to deduce worst-case bounds for the performance of the Banach GRIM algorithm; we will momentarily establish such bounds for the task of kernel quadrature via estimates for the covering number of the unit ball of a Reproducing Kernel Hilbert Space (RKHS) covered in [JJWWY20].

Before considering the particular setting of kernel quadrature, we observe how the direction of the bounds in Theorem 6.2 can be, in some sense, reversed. To make this precise, observe that Theorem 6.2 fixes an $\varepsilon>0$ as the accuracy we desire for the approximation, and then provides an upper bound on the number of features from the collection ${\cal F}$ that are used by the Banach GRIM algorithm to construct an approximation of $\varphi$ that is within $\varepsilon$ of $\varphi$ throughout $\Sigma$ . However it would be useful to know, for a given fixed $n_{0}\in\{2,\ldots,{\cal N}\}$ , how well the Banach GRIM algorithm can approximate $\varphi$ throughout $\Sigma$ using no greater than $n_{0}$ of the features from ${\cal F}$ . We now illustrate how Theorem 6.2 provides such information.

Consider a fixed $n_{0}\in\{2,\ldots,{\cal N}\}$ and let $\beta_{0}=\beta_{0}(n_{0},C,\Sigma,\varepsilon_{0})>0$ be defined by

\beta_{0}:=\min\left\{\lambda>\varepsilon_{0}~{}:~{}N_{\lambda}\leq n_{0}-1\right\}

(6.23)

where $N_{\lambda}$ denotes the integer defined in (6.16) for the constant $\varepsilon>\varepsilon_{0}$ there as $\lambda$ here. Then, by applying the Banach GRIM algorithm under the same assumptions as in Theorem 6.2 and the additional choice that $\varepsilon:=\beta_{0}$ , we may conclude from Theorem 6.2 that the Banach GRIM algorithm terminates after no more than $n_{0}-1$ steps. Consequently, the algorithm returns an approximation $u$ of $\varphi$ that is a linear combination of at most $n_{0}$ of the features in ${\cal F}$ , and that is within $\beta_{0}$ of $\varphi$ on $\Sigma$ in the sense that for every $\sigma\in\Sigma$ we have $|\sigma(\varphi-u)|\leq\beta_{0}$ . Hence the relation given in (6.23) provides a guaranteed accuracy $\beta_{0}=\beta_{0}(n_{0},C,\Sigma,\varepsilon_{0})>0$ for how well the Banach GRIM algorithm can approximate $\varphi$ with the additional constraint that the approximation is a linear combination of no greater than $n_{0}$ of the features in ${\cal F}$ . This guarantee ensures both that there is a linear combination of at most $n_{0}$ of the features in ${\cal F}$ that is within $\beta_{0}$ of $\varphi$ throughout $\Sigma$ and that the Banach GRIM algorithm will find such a linear combination.

For the remainder of this subsection we turn our attention to the particular setting of kernel quadrature. assume that ${\cal X}=\{x_{1},\ldots,x_{{\cal N}}\}$ is a finite set of some finite dimensional Euclidean space ${\mathbb{R}}^{d}$ , and that $k:{\cal X}\times{\cal X}\to{\mathbb{R}}$ is a continuous symmetric positive semi-definite kernel that is bounded above by $1$ . For each $i\in\{1,\ldots,{\cal N}\}$ define a continuous function $k_{x_{i}}:{\cal X}\to{\mathbb{R}}$ by setting, for $z\in{\cal X}$ , $k_{x_{i}}(z):=k(z,x_{i})$ . Define $K:=\left\{k_{x_{i}}:i\in\{1,\ldots,{\cal N}\}\right\}\subset C^{0}({\cal X})$ and let ${\cal H}_{k}$ denote the RKHS associated to $k$ . In this case it is known that ${\cal H}_{k}=\operatorname{Span}(K)$ and hence $K\subset{\overline{\mathbb{B}}}_{{\cal H}_{k}}(0,1)$ .

Suppose $a_{1},\ldots,a_{{\cal N}}>0$ so that $\varphi:=\sum_{i=1}^{{\cal N}}a_{i}\delta_{x_{i}}\in{\mathbb{P}}[{\cal X}]$ . Under the choice that $X:={\cal M}[{\cal X}]$ , we observe that the constant $C$ corresponding to the definition in (6.8) satisfies that $C=1$ . Recall that $C^{0}({\cal X})\subset{\cal M}[{\cal X}]^{\ast}$ via the identification of an element $\psi\in C^{0}({\cal X})$ with the linear functional ${\cal M}[{\cal X}]\to{\mathbb{R}}$ given by $\nu\mapsto\nu[\psi]:=\int_{{\cal X}}\psi(z)d\nu(z)$ . We abuse notation slightly by referring to both the continuous function in $C^{0}({\cal X})$ and the associated linear functional ${\cal M}[{\cal X}]\to{\mathbb{R}}$ as $\psi$ . By defining $\Sigma:=K$ , the kernel quadrature problem in this setting is to find an empirical measure $u\in{\mathbb{P}}[{\cal X}]$ whose support is a strict subset of the support of $\varphi$ and such that $\left|f(\varphi-u)\right|\leq\varepsilon$ for every $f\in\Sigma$ .

Recalling (6.22), the performance of the Banach GRIM algorithm for this task is controlled by the pointwise $(\varepsilon-\varepsilon_{0})/4$ -covering number of $\Sigma$ , i.e. by $N_{\operatorname{cov}}(\Sigma,C^{0}({\cal X}),(\varepsilon-\varepsilon_{0})/4)$ . That is, there is an integer $M\leq N_{\operatorname{cov}}(\Sigma,C^{0}({\cal X}),(\varepsilon-\varepsilon_{0})/4)$ such that the Banach GRIM algorithm finds weights $c_{1},\ldots,c_{M+1}\geq 0$ and indices $e(1),\ldots,e(M+1)\in\{1,\ldots,{\cal N}\}$ such that $u:=\sum_{s=1}^{M+1}c_{s}\delta_{x_{e(s)}}$ is a probability measure in ${\mathbb{P}}[{\cal X}]$ satisfying, for every $f\in\Sigma$ , that $|\varphi(f)-u(f)|\leq\varepsilon$ . A consequence of $\Sigma\subset{\overline{\mathbb{B}}}_{{\cal H}_{k}}(0,1)$ is that the $(\varepsilon-\varepsilon_{0})/4$ -covering number of $\Sigma$ is itself controlled by the $(\varepsilon-\varepsilon_{0})/4$ -covering number of ${\overline{\mathbb{B}}}_{{\cal H}_{k}}(0,1)$ , denoted by $N_{\operatorname{cov}}\left({\overline{\mathbb{B}}}_{{\cal H}_{k}}(0,1),C^{0}({\cal X}),(\varepsilon-\varepsilon_{0})/4\right)$ .

Many authors have considered estimating the covering number of the unit ball of a RKHS, see the works [Zho02, CSW11, Kuh11, SS13, HLLL18, Suz18, JJWWY20, FS21] for example. In our particular setting, the covering number estimates in [JJWWY20] yield explicit performance guarantees for the Banach GRIM as we now illustrate.

In our setting the kernel has a pointwise convergent Mercer decomposition $k(x,y)=\sum_{m=1}^{\infty}\lambda_{m}e_{m}(x)e_{m}(y)$ with $\lambda_{1}\geq\lambda_{2}\geq\ldots\geq 0$ and $\{e_{m}\}_{m=1}^{\infty}\subset L^{2}({\cal X})$ being orthonormal [CS08, SS12]. The pairs $\left\{(\lambda_{m},e_{m})\right\}_{m=1}^{\infty}$ are the eigenpairs for the operator $T_{k}:L^{2}({\cal X})\to L^{2}({\cal X})$ defined for $f\in L^{2}({\cal X})$ by $T_{k}[f]:=\int_{{\cal X}}k(\cdot,y)f(y)dy$ . We assume that the eigenfunctions $\{e_{m}\}_{m=1}^{\infty}$ are uniformly bounded in the sense that for every $m\in{\mathbb{Z}}_{\geq 1}$ and every $x\in{\cal X}$ we have $|e_{m}(x)|\leq C_{0}$ for some constant $C_{0}>0$ . Finally, we assume that the eigenvalues $\{\lambda_{m}\}_{m=1}^{\infty}$ decay exponentially as $m$ increases in the sense that for every $m\in{\mathbb{Z}}_{\geq 1}$ we have $\lambda_{m}\leq C_{1}e^{-C_{2}j}$ for constants $C_{1},C_{2}>0$ . These assumptions are satisfied, for example, by the squared exponential (Radial Basis Function) kernel $k(s,t):=e^{-(s-t)^{2}}$ [JJWWY20]; more explicit estimates for this particular choice of kernel may be found in [FS21].

Given any $r\in(0,1)$ , it is established in [JJWWY20] (cf. Lemma D.2 of [JJWWY20]) that under these assumptions we have

\log N_{\operatorname{cov}}({\overline{\mathbb{B}}}_{{\cal H}_{k}}(0,1),C^{0}({\cal X}),r)\leq C_{3}\left(\log\left(\frac{1}{r}\right)+C_{4}\right)^{2}

(6.24)

for constants $C_{3}=C_{3}(C_{0},C_{1},C_{2})>0$ and $C_{4}=C_{4}(C_{0},C_{1},C_{2})>0$ . Assuming that $\varepsilon<4$ , by appealing to (6.24) for the choice $r:=(\varepsilon-\varepsilon_{0})/4$ we may conclude that if $C_{3}\left(\log\left(\frac{4}{\varepsilon-\varepsilon_{0}}\right)+C_{4}\right)^{2}<\log({\cal N}-1)$ then the Banach GRIM algorithm will return a probability measure $u\in{\mathbb{P}}[{\cal X}]$ given by a linear combination of fewer than ${\cal N}$ of the point masses $\delta_{x_{1}},\ldots,\delta_{x_{{\cal N}}}$ satisfying, for every $f\in\Sigma$ , that $|\varphi(f)-u(f)|\leq\varepsilon$ .

Alternatively, given $n_{0}\in\{2,\ldots,{\cal N}\}$ define $\beta_{0}=\beta_{0}(C_{0},C_{1},C_{2},n_{0},\varepsilon_{0})>0$ by

\beta_{0}:=4e^{C_{4}}e^{-\left(\frac{\log(n_{0}-1)}{C_{3}}\right)^{\frac{1}{2}}}+\varepsilon_{0}=4e^{C_{4}}\left(n_{0}-1\right)^{-\left(C_{3}\log(n_{0}-1)\right)^{-\frac{1}{2}}}+\varepsilon_{0}>\varepsilon_{0}.

(6.25)

Provided $n_{0}>1+e^{C_{3}C_{4}^{2}}$ we have that $\frac{\beta_{0}-\varepsilon_{0}}{4}\in(0,1)$ . In this case we observe that (6.24) and (6.25) yield that

\log N_{\operatorname{cov}}\left({\overline{\mathbb{B}}}_{{\cal H}_{k}}(0,1),C^{0}({\cal X}),\frac{\beta_{0}-\varepsilon_{0}}{4}\right)\leq\log(n_{0}-1).

(6.26)

We deduce from Theorem 6.2, with $\varepsilon:=\beta_{0}$ , that the algorithm finds a probability measure $u\in{\mathbb{P}}[{\cal X}]$ given by a linear combination of no more than $n_{0}$ of the point masses $\delta_{x_{1}},\ldots,\delta_{x_{{\cal N}}}$ satisfying, for every $f\in\Sigma$ , that $|\varphi(f)-u(f)|\leq\beta_{0}$ .

As $n_{0}$ increases, $\beta_{0}$ defined in (6.25) eventually decays slower than $n_{0}^{-a}$ for any $a>0$ . This poor asymptotic behaviour is not unexpected for an estimate that is itself a combination of worst-case scenario estimates. However, we may still observe that for any integer $A\in{\mathbb{Z}}_{\geq 1}$ large enough to ensure that $1+e^{C_{3}C_{4}^{2}}<n_{0}\leq 1+e^{A^{2}/C_{3}}$ (which in particular requires $A>C_{3}C_{4}$ ), that $\beta_{0}\leq 4e^{C_{4}}(n_{0}-1)^{-1/A}$ .

6.3 Supplementary Lemmata

In this subsection we record several lemmata that will be used during our proof of the Banach GRIM Convergence Theorem 6.2 in Subsection 6.4. The following result records the consequences arising from knowing that an approximation $u\in\operatorname{Span}({\cal F})$ is close to $\varphi$ at a finite set of linear functionals in $X^{\ast}$ .

Lemma 6.4 (Finite Influenced Set).

({\bf I})\quad\varphi:=\sum_{i=1}^{{\cal N}}a_{i}f_{i}\qquad\text{and}\qquad({\bf II})\quad C:=\sum_{i=1}^{{\cal N}}|a_{i}|||f_{i}||_{X}>0.

(6.27)

Suppose that $d\in{\mathbb{Z}}_{\geq 1}$ with $L=\left\{\sigma_{1},\ldots,\sigma_{d}\right\}\subset X^{\ast}$ , and that $u\in\operatorname{Span}({\cal F})$ satisfies both that $||u||_{X}\leq C$ and, for every $\sigma\in L$ , that $|\sigma(\varphi-u)|\leq\theta_{0}$ . Then, using our notation conventions from Section 6.1 (see also Remark 6.3), for every $\sigma\in\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{d})}}\left(L,2C,\theta,\theta_{0}\right)$ we have that

|\sigma(\varphi-u)|\leq\theta.

(6.28)

Proof of Lemma 6.4.

Assume that $X$ is a Banach space with dual-space $X^{\ast}$ . Let $\theta>\theta_{0}\geq 0$ , ${\cal N}\in{\mathbb{Z}}_{\geq 1}$ , and ${\cal F}=\{f_{1},\ldots,f_{\cal N}\}\subset X\setminus\{0\}$ . Let $a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}\setminus\{0\}$ and define $\varphi\in\operatorname{Span}({\cal F})$ and $C>0$ by (I) and (II) in (6.27) respectively. Suppose that $d\in{\mathbb{Z}}_{\geq 1}$ and that $L=\left\{\sigma_{1},\ldots,\sigma_{d}\right\}\subset X^{\ast}$ . Let $u\in\operatorname{Span}({\cal F})$ and assume both that $||u||_{X}\leq C$ and that, for every $\sigma\in L$ , we have $|\sigma(\varphi-u)|\leq\theta_{0}$ . It follows from (I) and (II) of (6.27) that $||\varphi||_{X}\leq C$ . Hence $||\varphi-u||_{X}\leq 2C$ .

We deal with the cases $\theta_{0}=0$ and $\theta_{0}\in(0,\theta)$ separately.

First suppose that $\theta_{0}=0$ . In this case we have that $\sigma_{1}(\varphi-u)=\ldots=\sigma_{d}(\varphi-u)=0$ , and that

\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{d})}}\left(L,2C,\theta,0\right)\stackrel{{\scriptstyle\text{def}}}{{=}}\bigcup_{0\leq t\leq\infty}\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{d})}}\left(L,t\right)_{\frac{\theta}{2C}}=\operatorname{Span}(L)_{\frac{\theta}{2C}}.

(6.29)

As a consequence of (6.29), we need only establish the estimate in (6.28) for $\sigma\in\operatorname{Span}(L)_{\theta/2C}$ . Assuming $\sigma\in\operatorname{Span}(L)_{\theta/2C}$ , then there is some $\rho\in\operatorname{Span}(L)$ with $||\sigma-\rho||_{X^{\ast}}\leq\theta/2C$ . Since $\rho\in\operatorname{Span}(L)$ there are coefficients $c_{1},\ldots,c_{d}\in{\mathbb{R}}$ for which $\rho=\sum_{j=1}^{d}c_{j}\sigma_{j}$ . Evidently we have that $\rho(\varphi-u)=\sum_{j=1}^{d}c_{j}\sigma_{j}(\varphi-u)=0$ . Consequently we may estimate that

|\sigma(\varphi-u)|\leq|(\sigma-\rho)(\varphi-u)|+|\rho(\varphi-u)|\leq||\sigma-\rho||_{X^{\ast}}||\varphi-u||_{X}\leq 2C\left(\frac{\theta}{2C}\right)=\theta.

(6.30)

The arbitrariness of $\sigma$ means we may conclude from (6.30) that (6.28) is valid as claimed when $\theta_{0}=0$ .

Now we suppose that $\theta_{0}\in(0,\theta)$ .

Since $\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{d})}}(L,0)=\{0\}\subset X^{\ast}$ , if $\sigma\in\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{d})}}(L,0)_{\theta/2C}$ then $||\sigma||_{X^{\ast}}\leq\theta/2C$ . Consequently $|\sigma(\varphi-u)|\leq||\sigma||_{X^{\ast}}||\varphi-u||_{X}\leq\theta$ as claimed in (6.28).

Now consider $t\in(0,\theta/\theta_{0}]$ and let $\sigma\in\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{d})}}\left(L,t\right)_{\frac{\theta-t\theta_{0}}{2C}}$ . Then there is some $\rho\in\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{d})}}\left(L,t\right)$ with $||\sigma-\rho||_{X^{\ast}}\leq\frac{\theta-t\theta_{0}}{2C}$ . Further, there are coefficients $c_{1},\ldots,c_{d}\in{\mathbb{R}}$ with $|c_{1}|+\ldots+|c_{d}|\leq t$ and such that $\rho=\sum_{j=1}^{d}c_{j}\sigma_{j}$ . Consequently, we may estimate that

|\sigma(\varphi-u)|\leq|(\sigma-\rho)(\varphi-u)|+|\rho(\varphi-u)|\leq||\sigma-\rho||_{X^{\ast}}||\varphi-u||_{X}+\sum_{j=1}^{d}|c_{j}||\sigma_{j}(\varphi-u)|\\ \leq 2C\left(\frac{\theta-t\theta_{0}}{2C}\right)+\theta_{0}\sum_{j=1}^{d}|c_{j}|\leq\theta-t\theta_{0}+t\theta_{0}=\theta.

(6.31)

Since $t\in(0,\theta/\theta_{0}]$ and $\sigma\in\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{d})}}\left(L,t\right)_{\frac{\theta-t\theta_{0}}{2C}}$ were both arbitrary, we may conclude that (6.31) is valid whenever $\sigma\in\bigcup_{0<t\leq\theta/\theta_{0}}\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{d})}}\left(L,t\right)_{\frac{\theta-t\theta_{0}}{2C}}$ .

The combination of the proceeding two paragraphs allows us to conclude, for $\theta_{0}\in(0,\theta)$ , that whenever $\sigma\in\bigcup_{0\leq t\leq\theta/\theta_{0}}\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{d})}}\left(L,t\right)_{\frac{\theta-t\theta_{0}}{2C}}=\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{d})}}\left(L,2C,\theta,\theta_{0}\right)$ we have $|\sigma(\varphi-u)|\leq\theta$ . Consequently, (6.28) is valid for the case that $\theta_{0}\in(0,\theta)$ . And having earlier established the validity of (6.28) when $\theta_{0}=0$ , this completes the proof of Lemma 6.4. ∎

We can use the result of Lemma 6.4 to prove that the linear functionals selected during the Banach GRIM algorithm, under the choice $M:=\min\left\{{\cal N}-1,\Lambda\right\}$ and that for every $t\in\{1,\ldots,M\}$ we have $k_{t}:=1$ and $s_{t}:=1$ , are separated by a definite $X^{\ast}$ -distance. The precise result is the following lemma.

Lemma 6.5 (Banach GRIM Functional Separation).

Assume $X$ is a Banach space with dual-space $X^{\ast}$ . Let $\theta>\theta_{0}\geq 0$ , and ${\cal N},\Lambda\in{\mathbb{Z}}_{\geq 1}$ such that $M:=\min\left\{{\cal N}-1,\Lambda\right\}\geq 2$ . Suppose that ${\cal F}:=\{f_{1},\ldots,f_{{\cal N}}\}\subset X\setminus\{0\}$ , and that $\Gamma\subset X^{\ast}$ is finite with cardinality $\Lambda$ . Let $a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}\setminus\{0\}$ and define $\varphi\in\operatorname{Span}({\cal F})$ and $C>0$ by

({\bf I})\quad\varphi:=\sum_{i=1}^{{\cal N}}a_{i}f_{i}\qquad\text{and}\qquad({\bf II})\quad C:=\sum_{i=1}^{{\cal N}}|a_{i}|||f_{i}||_{X}>0.

(6.32)

Consider applying the Banach GRIM algorithm to approximate $\varphi$ on $\Gamma$ with $\theta$ as the accuracy threshold, $\theta_{0}$ as the acceptable recombination error bound, $M$ as the maximum number of steps, $s_{1}=\ldots=s_{M}=1$ as the shuffle numbers, and with the integers $k_{1},\ldots,k_{M}$ in Step (A) of the Banach GRIM algorithm all being chosen equal to $1$ (cf. Banach GRIM (A)). Suppose that $m\in{\mathbb{Z}}_{\geq 2}$ and the algorithm reaches and carries out the $m^{th}$ step without terminating. For each $l\in\{1,\ldots,m\}$ let $\sigma_{l}\in\Gamma$ be the linear functional selected at the $l^{\text{th}}$ step, and let $u_{l}\in\operatorname{Span}({\cal F})$ be the element found via recombination (cf. the recombination thinning Lemma 3.1) such that, for every $s\in\{1,\ldots,l\}$ , we have $|\sigma_{s}(\varphi-u_{l})|\leq\theta_{0}$ (cf. Banach GRIM (C) and (D)). Further, for each $l\in\{1,\ldots,m\}$ , let $\Gamma_{l}:=\left\{\sigma_{1},\ldots,\sigma_{l}\right\}\subset\Gamma$ and $\Omega_{l}:=\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{l})}}\left(\Gamma_{l},2C,\theta,\theta_{0}\right)$ where we use our notation conventions from Section 6.1 (see also Remark 6.3). Then for any $l\in\{1,\ldots,m\}$ we have, for every $\sigma\in X^{\ast}$ , we have that

|\sigma(\varphi-u_{l})|\leq\min\left\{2C||\sigma||_{X^{\ast}}~{},~{}2C\operatorname{dist}_{X^{\ast}}(\sigma,\Omega_{l})+\theta\right\}.

(6.33)

In particular, for every $\sigma\in\Omega_{l}$ we have

\left|\sigma(\varphi-u_{l})\right|\leq\theta.

(6.34)

Finally, as a consequence we have, for every $l\in\{2,\ldots,m\}$ , that

\sigma_{l}\notin\Omega_{l-1}:=\operatorname{Reach}_{||\cdot||_{l^{(}{\mathbb{R}}^{l-1})}}\left(\Gamma_{l-1},2C,\theta,\theta_{0}\right).

(6.35)

Remark 6.6.

Using the same notation as in Lemma 6.5, we claim that (6.35) ensures, for every $j\in\{1,\ldots,l-1\}$ , that

||\sigma_{l}-\sigma_{j}||_{X^{\ast}}\geq\frac{\theta-\theta_{0}}{2C}.

(6.36)

To see this, recall that (cf. Subsection 6.1 or Remark 6.3)

\operatorname{Reach}_{||\cdot||_{l^{(}{\mathbb{R}}^{l-1})}}\left(\Gamma_{l-1},2C,\theta,\theta_{0}\right)\stackrel{{\scriptstyle\text{def}}}{{=}}\bigcup_{0\leq\tau\leq\frac{\theta}{\theta_{0}}}\operatorname{\overline{Span}}_{||\cdot||_{l^{(}{\mathbb{R}}^{l-1})}}\left(\Gamma_{l-1},\tau\right)_{\frac{\theta-\tau\theta_{0}}{2C}}.

(6.37)

Since $\theta>\theta_{0}$ we may take $\tau:=1$ in (6.37) to conclude via (6.35) that

\sigma_{l}\notin\operatorname{\overline{Span}}_{||\cdot||_{l^{(}{\mathbb{R}}^{l-1})}}\left(\Gamma_{l-1},1\right)_{\frac{\theta-\theta_{0}}{2C}}.

(6.38)

Recall that (cf. Subsection 6.1 or Remark 6.3)

\operatorname{\overline{Span}}_{||\cdot||_{l^{(}{\mathbb{R}}^{l-1})}}\left(\Gamma_{l-1},1\right)\stackrel{{\scriptstyle\text{def}}}{{=}}\left\{\sum_{s=1}^{l-1}c_{s}\sigma_{s}~{}:~{}c=(c_{1},\ldots,c_{l-1})\in{\mathbb{R}}^{l-1}\text{ with }||c||_{l^{1}({\mathbb{R}}^{l-1})}\leq 1\right\}.

(6.39)

A consequence of (6.39) is that for every $j\in\{1,\ldots,l-1\}$ we have $\sigma_{j}\in\operatorname{\overline{Span}}_{||\cdot||_{l^{(}{\mathbb{R}}^{l-1})}}\left(\Gamma_{l-1},1\right)$ , hence (6.38) means $||\sigma_{l}-\sigma_{j}||_{X^{\ast}}>\frac{\theta-\theta_{0}}{2C}$ as claimed in (6.36).

Proof of Lemma 6.5.

({\bf I})\quad\varphi:=\sum_{i=1}^{{\cal N}}a_{i}f_{i}\qquad\text{and}\qquad({\bf II})\quad C:=\sum_{i=1}^{{\cal N}}|a_{i}|||f_{i}||_{X}>0.

(6.40)

Consider applying the Banach GRIM algorithm to approximate $\varphi$ on $\Gamma$ with $\theta$ as the accuracy threshold, $\theta_{0}$ as the acceptable recombination error bound, $M$ as the maximum number of steps, $s_{1}=\ldots=S_{M}=1$ as the shuffle numbers, and with the integers $k_{1},\ldots,k_{M}$ in Step (A) of the Banach GRIM algorithm all being chosen equal to $1$ (cf. Banach GRIM (A)). We now follow the second step of the Banach GRIM algorithm, i.e. Banach GRIM (B). For each $i\in\{1,\ldots,{\cal N}\}$ let $\tilde{a}_{i}:=|a_{i}|$ and $\tilde{f}_{i}$ be given by $f_{i}$ if $a_{i}>0$ and $-f_{i}$ if $a_{i}<0$ . Evidently, for every $i\in\{1,\ldots,{\cal N}\}$ we have $\left|\left|\tilde{f}_{i}\right|\right|_{X}=||f_{i}||_{X}$ . Moreover, we also have that $\tilde{a}_{1},\ldots,\tilde{a}_{{\cal N}}>0$ and $\varphi=\sum_{i=1}^{{\cal N}}\tilde{a}_{i}\tilde{f}_{i}$ . We additionally rescale $\tilde{f}_{i}$ for each $i\in\{1,\ldots,{\cal N}\}$ to have unit $X$ norm. That is (cf. Banach GRIM (B)), for each $i\in\{1,\ldots,{\cal N}\}$ set $h_{i}:=\frac{\tilde{f}_{i}}{||f_{i}||_{X}}$ and $\alpha_{i}:=\tilde{a}_{i}||f_{i}||_{X}$ . Then observe both that $C$ satisfies

C:=\sum_{i=1}^{{\cal N}}|a_{i}|||f_{i}||_{X}=\sum_{i=1}^{{\cal N}}\tilde{a}_{i}||f_{i}||_{X}=\sum_{i=1}^{{\cal N}}\alpha_{i},

(6.41)

and, for every $i\in\{1,\ldots,{\cal N}\}$ , that $\alpha_{i}h_{i}=\tilde{a}_{i}\tilde{f}_{i}=a_{i}f_{i}$ . Therefore the expansion for $\varphi$ in (I) of (6.40) is equivalent to

\varphi=\sum_{i=1}^{{\cal N}}\alpha_{i}h_{i},\qquad\text{ and hence }\qquad||\varphi||_{X}\leq\sum_{i=1}^{{\cal N}}\alpha_{i}||h_{i}||_{X}=\sum_{i=1}^{{\cal N}}\alpha_{i}\stackrel{{\scriptstyle{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}({\ref{banach_separation_varphi_const}})}}}{{=}}C.

(6.42)

Turning our attention to steps Banach GRIM (C) and (D) of the Banach GRIM algorithm, suppose that $m\in{\mathbb{Z}}_{\geq 2}$ and that the $m^{\text{th}}$ step of the Banach GRIM algorithm is completed without triggering the early termination criterion. For each $l\in\{1,\ldots,m\}$ let $\sigma_{l}\in\Gamma$ be the linear functional selected at the $l^{\text{th}}$ step, and let $u_{l}\in\operatorname{Span}({\cal F})$ be the element found via recombination (cf. the recombination thinning Lemma 3.1) such that, for every $s\in\{1,\ldots,l\}$ , we have $|\sigma_{s}(\varphi-u_{l})|\leq\theta_{0}$ (cf. Banach GRIM (C) and (D)). Define $\Gamma_{l}:=\left\{\sigma_{1},\ldots,\sigma_{l}\right\}\subset\Gamma$ .

For each $l\in\{1,\ldots,m\}$ observe that $\min\left\{{\cal N},l+1\right\}=l+1$ . Hence the recombination thinning Lemma 3.1 additionally tells us that there are non-negative coefficients $b_{l,1},\ldots,b_{l,l+1}\geq 0$ and indices $e_{l}(1),\ldots,e_{l}(l+1)\in\{1,\ldots,{\cal N}\}$ for which

u_{l}=\sum_{s=1}^{l+1}b_{l,s}h_{e_{l}(s)}\qquad\text{and}\qquad\sum_{s=1}^{l+1}b_{l,s}=\sum_{i=1}^{{\cal N}}\alpha_{i}.

(6.43)

A consequence of (6.43) is that

||u_{l}||_{X}\leq\sum_{s=1}^{l+1}b_{l,s}\left|\left|h_{e_{l}(s)}\right|\right|_{X}=\sum_{s=1}^{l+1}b_{l,s}\stackrel{{\scriptstyle{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}({\ref{banach_sep_lemma_u_n_coeff_sum}})}}}{{=}}\sum_{i=1}^{{\cal N}}\alpha_{i}\stackrel{{\scriptstyle{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}({\ref{banach_separation_varphi_const}})}}}{{=}}C.

(6.44)

Further, for each $l\in\{1,\ldots,m\}$ , let

\Omega_{l}:=\operatorname{Reach}_{||\cdot||_{l^{(}{\mathbb{R}}^{l})}}\left(\Gamma_{l},2C,\theta,\theta_{0}\right)\stackrel{{\scriptstyle\text{def}}}{{=}}\bigcup_{0\leq t\leq\frac{\theta}{\theta_{0}}}\operatorname{\overline{Span}}_{||\cdot||_{l^{(}{\mathbb{R}}^{l})}}\left(\Gamma_{l},t\right)_{\frac{\theta-t\theta_{0}}{2C}}

(6.45)

where we use our notation conventions from Section 6.1. With our notation fixed, we turn our attention to verifying the claims made in (6.33), (6.34), and (6.35).

We begin by noting that (6.34) is an immediate consequence of appealing to Lemma 6.4 with $l$ and $\Gamma_{l}\subset X^{\ast}$ playing the roles of integer $d$ and finite subset $L\subset X^{\ast}$ there respectively. Indeed by doing so we may conclude that (cf. (6.28))

\text{for every }\sigma\in\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{l})}}\left(\Gamma_{l},2C,\theta,\theta_{0}\right)\stackrel{{\scriptstyle{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}({\ref{Omega_l_def_sep_lemma_pf}})}}}{{=:}}\Omega_{l}\quad\text{we have}\quad|\sigma(\varphi-u)|\leq\theta

(6.46)

as claimed in (6.34).

To establish (6.35), fix $l\in\{2,\ldots,m\}$ and consider $\sigma_{l}\in\Gamma$ . A consequence of the Banach GRIM algorithm completing the $l^{\text{th}}$ step without terminating is that (cf. Banach GRIM algorithm steps (C) and (D))

|\sigma_{l}(\varphi-u_{l-1})|>\theta.

(6.47)

However, since we have established that (6.34) is true, we know that if $\sigma\in\Omega_{l-1}$ then $|\sigma(\varphi-u_{l-1})|\leq\theta$ . Consequently, (6.47) tells us that $\sigma_{l}\notin\Omega_{l-1}$ as claimed in (6.35).

To establish (6.33), fix $l\in\{1,\ldots,m\}$ and consider any $\sigma\in X^{\ast}$ . Then we may estimate that

\left|\sigma(\varphi-u_{l})\right|\leq||\sigma||_{X^{\ast}}||\varphi-u_{l}||_{X}\stackrel{{\scriptstyle{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}({\ref{banach_sep_lemma_vph_def&bd}})}~{}\&~{}{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}({\ref{banach_sep_lemma_ul_norm_bd}})}}}{{\leq}}2C||\sigma||_{X^{\ast}}.

(6.48)

Alternatively, let $\rho\in\Omega_{l}$ and use that via (6.34) $|\rho(\varphi-u_{l})|\leq\theta$ to compute that

|\sigma(\varphi-u_{l})|\leq||\sigma-\rho||_{X^{\ast}}||\varphi-u_{l}||_{X}+\theta\stackrel{{\scriptstyle{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}({\ref{banach_sep_lemma_vph_def&bd}})}~{}\&~{}{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}({\ref{banach_sep_lemma_ul_norm_bd}})}}}{{\leq}}2C||\sigma-\rho||_{X^{\ast}}+\theta.

(6.49)

Taking the infimum over $\rho\in\Omega_{l}$ in (6.49) yields

|\sigma(\varphi-u_{l})|\leq 2C\operatorname{dist}_{X^{\ast}}(\sigma,\Omega_{l})+\theta.

(6.50)

Together, (6.48) and (6.50) yield that for any $\sigma\in X^{\ast}$ we have

|\sigma(\varphi-u_{l})|\leq\min\left\{2C||\sigma||_{X^{\ast}}~{},~{}2C\operatorname{dist}_{X^{\ast}}(\sigma,\Omega_{l})+\theta\right\}

(6.51)

as claimed in (6.33). This completes the proof of Lemma 6.5. ∎

We can use Lemma 6.5 to establish an upper bound on the maximum number of steps the Banach GRIM algorithm can run for before terminating. The precise statement is the following lemma.

Lemma 6.7 (Banach GRIM Number of Steps Bound).

Assume $X$ is a Banach space with dual-space $X^{\ast}$ . Let $\theta>\theta_{0}\geq 0$ , and ${\cal N},\Lambda\in{\mathbb{Z}}_{\geq 1}$ such that $M:=\min\left\{{\cal N}-1,\Lambda\right\}\geq 1$ . Suppose that ${\cal F}:=\{f_{1},\ldots,f_{{\cal N}}\}\subset X\setminus\{0\}$ , and that $\Gamma\subset X^{\ast}$ is finite with cardinality $\Lambda$ . Let $a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}\setminus\{0\}$ and define $\varphi\in\operatorname{Span}({\cal F})$ and $C>0$ by

({\bf I})\quad\varphi:=\sum_{i=1}^{{\cal N}}a_{i}f_{i}\qquad\text{and}\qquad({\bf II})\quad C:=\sum_{i=1}^{{\cal N}}|a_{i}|||f_{i}||_{X}>0.

(6.52)

Then there is a non-negative integer $N=N(\Gamma,C,\theta,\theta_{0})\in{\mathbb{Z}}_{\geq 0}$ , given by

N:=\max\left\{d\in{\mathbb{Z}}~{}:~{}\begin{array}[]{c}\text{There exists }\sigma_{1},\ldots,\sigma_{d}\in\Gamma\text{ such that for every }j\in\{1,\ldots,d-1\}\\ \text{we have }\sigma_{j+1}\notin\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{j})}}\left(\{\sigma_{1},\ldots,\sigma_{j}\},2C,\theta,\theta_{0}\right)\end{array}\right\},

(6.53)

where we use our notation conventions from Section 6.1 (see also Remark 6.3), for which the following is true.

Suppose $N\leq M:=\min\left\{{\cal N}-1,\Lambda\right\}$ and consider applying the Banach GRIM algorithm to approximate $\varphi$ on $\Gamma$ with $\theta$ as the accuracy threshold, $\theta_{0}$ as the acceptable recombination error bound, $M$ as the maximum number of steps, $s_{1}=\ldots=s_{M}=1$ as the shuffle numbers, and with the integers $k_{1},\ldots,k_{M}$ in Step (A) of the Banach GRIM algorithm all being chosen equal to $1$ (cf. Banach GRIM (A)). Then, after at most $N$ steps the algorithm terminates. That is, there is some integer $n\in\{1,\ldots,N\}$ for which the Banach GRIM algorithm terminates after completing $n$ steps. Consequently, there are coefficients $c_{1},\ldots,c_{n+1}\in{\mathbb{R}}$ and indices $e(1),\ldots,e(n+1)\in\{1,\ldots,{\cal N}\}$ with

\sum_{s=1}^{n+1}\left|c_{s}\right|||f_{e(s)}||_{X}=C,

(6.54)

and such that the element $u\in\operatorname{Span}({\cal F})$ defined by

u:=\sum_{s=1}^{n+1}c_{s}f_{e(s)}\qquad\text{satisfies, for every }\sigma\in\Gamma,\text{ that}\qquad|\sigma(\varphi-u)|\leq\theta.

(6.55)

In fact there are linear functionals $\Gamma_{n}=\left\{\sigma_{1},\ldots,\sigma_{n}\right\}\subset\Gamma$ such that

\text{for every }\sigma\in\Omega_{n}:=\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{n})}}\left(\Gamma_{n},2C,\theta,\theta_{0}\right)\quad\text{we have}\quad\left|\sigma(\varphi-u)\right|\leq\theta.

(6.56)

Moreover, if the coefficients $a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}\setminus\{0\}$ corresponding to $\varphi$ (cf. (I) of (6.52)) are all positive (i.e. $a_{1},\ldots,a_{{\cal N}}>0$ ) then the coefficients $c_{1},\ldots,c_{n+1}\in{\mathbb{R}}$ corresponding to $u$ (cf. (6.55)) are all non-negative (i.e. $c_{1},\ldots,c_{n+1}\geq 0$ ).

Remark 6.8.

Recall that the Banach GRIM algorithm is guaranteed to terminate after $M$ steps. Consequently, the restriction to the case that $N\leq M$ is sensible since it is only in this case that terminating after no more than $N$ steps is a non-trivial statement.

If $N<M$ then Lemma 6.7 guarantees that the Banach GRIM algorithm will find an approximation $u\in\operatorname{Span}({\cal F})$ of $\varphi$ that is a linear combination of less than ${\cal N}$ of the elements $f_{1},\ldots,f_{{\cal N}}$ but is within $\theta$ of $\varphi$ throughout $\Gamma$ in the sense that $|\sigma(\varphi-u)|\leq\theta$ for every $\sigma\in\Gamma$ .

Remark 6.9.

By invoking Lemma 6.5, and using the same notation as in Lemma 6.7, we can additionally conclude that for every $\sigma\in X^{\ast}$ we have

|\sigma(\varphi-u)|\leq\min\left\{2C||\sigma||_{X^{\ast}}~{},~{}2C\operatorname{dist}_{X^{\ast}}\left(\sigma,\Omega_{n}\right)+\theta\right\}.

(6.57)

Proof of Lemma 6.7.

Assume $X$ is a Banach space with dual-space $X^{\ast}$ . Let $\theta>\theta_{0}\geq 0$ , and ${\cal N},\Lambda\in{\mathbb{Z}}_{\geq 1}$ such that $M:=\min\left\{{\cal N}-1,\Lambda\right\}\geq 1$ . Suppose that ${\cal F}:=\{f_{1},\ldots,f_{{\cal N}}\}\subset X\setminus\{0\}$ , and that $\Gamma\subset X^{\ast}$ is finite with cardinality $\Lambda$ . Let $a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}\setminus\{0\}$ and define $\varphi\in\operatorname{Span}({\cal F})$ and $C>0$ by (I) and (II) of (6.52) respectively. That is

({\bf I})\quad\varphi:=\sum_{i=1}^{{\cal N}}a_{i}f_{i}\qquad\text{and}\qquad({\bf II})\quad C:=\sum_{i=1}^{{\cal N}}|a_{i}|||f_{i}||_{X}>0.

(6.58)

With a view to later applying Banach GRIM to approximate $\varphi$ on $\Gamma$ , for each $i\in\{1,\ldots,{\cal N}\}$ let $\tilde{a}_{i}:=|a_{i}|$ and $\tilde{f}_{i}$ be given by $f_{i}$ if $a_{i}>0$ and $-f_{i}$ if $a_{i}<0$ . For every $i\in\{1,\ldots,{\cal N}\}$ we evidently have $\left|\left|\tilde{f}_{i}\right|\right|_{X}=||f_{i}||_{X}$ . Moreover, we also have that $\tilde{a}_{1},\ldots,\tilde{a}_{{\cal N}}>0$ and $\varphi=\sum_{i=1}^{{\cal N}}\tilde{a}_{i}\tilde{f}_{i}$ . Further, we rescale $\tilde{f}_{i}$ for each $i\in\{1,\ldots,{\cal N}\}$ to have unit $X$ norm. That is (cf. Banach GRIM (B)), for each $i\in\{1,\ldots,{\cal N}\}$ set $h_{i}:=\frac{\tilde{f}_{i}}{||f_{i}||_{X}}$ and $\alpha_{i}:=\tilde{a}_{i}||f_{i}||_{X}$ . Observe both that $C$ satisfies

C:=\sum_{i=1}^{{\cal N}}|a_{i}|||f_{i}||_{X}=\sum_{i=1}^{{\cal N}}\tilde{a}_{i}||f_{i}||_{X}=\sum_{i=1}^{{\cal N}}\alpha_{i},

(6.59)

and, for every $i\in\{1,\ldots,{\cal N}\}$ , that $\alpha_{i}h_{i}=\tilde{a}_{i}\tilde{f}_{i}=a_{i}f_{i}$ . Therefore the expansion for $\varphi$ in (I) of (6.32) is equivalent to

\varphi=\sum_{i=1}^{{\cal N}}\alpha_{i}h_{i},\qquad\text{ and hence }\qquad||\varphi||_{X}\leq\sum_{i=1}^{{\cal N}}\alpha_{i}||h_{i}||_{X}=\sum_{i=1}^{{\cal N}}\alpha_{i}\stackrel{{\scriptstyle{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}({\ref{banach_step_num_varphi_const}})}}}{{=}}C.

(6.60)

Define a non-negative integer $N=N(\Gamma,C,\theta,\theta_{0})\in{\mathbb{Z}}_{\geq 0}$ by

N:=\max\left\{d\in{\mathbb{Z}}~{}:~{}\begin{array}[]{c}\text{There exists }\sigma_{1},\ldots,\sigma_{d}\in\Gamma\text{ such that for every }j\in\{1,\ldots,d-1\}\\ \text{we have }\sigma_{j+1}\notin\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{j})}}\left(\{\sigma_{1},\ldots,\sigma_{j}\},2C,\theta,\theta_{0}\right)\end{array}\right\}.

(6.61)

We first prove that the algorithm terminates after at most $N$ steps have been completed. Let $\sigma_{1}\in\Gamma$ be the linear functional chosen in the first step (cf. Banach GRIM (A)), and $u_{1}\in\operatorname{Span}({\cal F})$ be the approximation found via recombination (cf. the recombination thinning Lemma 3.1) satisfying, in particular, that $|\sigma_{1}(\varphi-u_{1})|\leq\theta_{0}$ . Define $\Gamma_{1}:=\{\sigma_{1}\}\subset\Gamma$ . We conclude, via Lemma 6.5 (cf. (6.34)), that for every $\sigma\in\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}})}}\left(\Gamma_{1},2C,\theta,\theta_{0}\right)$ we have $|\sigma(\varphi-u_{1})|\leq\theta$ .

If $N=1$ then (6.61) means there is no $\sigma\in\Gamma$ for which $\sigma\notin\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}})}}\left(\Gamma_{1},2C,\theta,\theta_{0}\right)$ Consequently, $\Gamma\cap\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}})}}\left(\Gamma_{1},2C,\theta,\theta_{0}\right)=\Gamma$ , and so we have established that for every $\sigma\in\Gamma$ we have $|\sigma(\varphi-u_{1})|\leq\theta$ . Recalling Banach GRIM (D), this means that algorithm terminates before step $2$ is completed. Hence the algorithm terminates after completing $N=1$ steps.

If $N\geq 2$ then we note that if the stopping criterion in Banach GRIM (D) is triggered at the start of step $m\in\{2,\ldots,N\}$ then we evidently have that the algorithm has terminated after carrying out no more than $N$ steps. Consequently, we need only deal with the case in which the algorithm reaches and carries out step $N$ without terminating. In this case, we claim that the algorithm terminates after completing step $N$ , i.e. that the termination criterion at the start of step $N+1$ is triggered.

Before proving this we fix some notation. Recalling Banach GRIM (D), for $l\in\{2,\ldots,N\}$ let $\sigma_{l}\in\Gamma$ denote the new linear functional selected at step $l$ , define $\Gamma_{l}:=\left\{\sigma_{1},\ldots,\sigma_{l}\right\}$ , and let $u_{l}\in\operatorname{Span}({\cal F})$ be the approximation found by recombination (cf. the recombination thinning Lemma 3.1) satisfying, for every $s\in\{1,\ldots,l\}$ , that $|\sigma_{s}(\varphi-u_{l})|\leq\theta_{0}$ .

By appealing to Lemma 6.5 we deduce both that for any $l\in\{2,\ldots,N\}$ we have (cf. (6.35))

\sigma_{l}\notin\operatorname{Reach}_{||\cdot||_{l^{(}{\mathbb{R}}^{l-1})}}\left(\Gamma_{l-1},2C,\theta,\theta_{0}\right),

(6.62)

and that for any $\sigma\in\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{l})}}\left(\Gamma_{l},2C,\theta,\theta_{0}\right)$ we have (cf. (6.34))

|\sigma(\varphi-u_{l})|\leq\theta.

(6.63)

Consider step $N+1$ of the algorithm at which we examine $K:=\max\left\{|\sigma(\varphi-u_{N})|:\sigma\in\Gamma\right\}$ . If $K\leq\theta$ then the algorithm terminates without carrying out step $N+1$ , and thus has terminated after carrying out $N$ steps as claimed.

Assume that $K>\theta$ so that $\sigma_{N+1}:=\operatorname{argmax}\left\{|\sigma(\varphi-u_{N})|:\sigma\in\Gamma\right\}$ satisfies that $|\sigma_{N+1}(\varphi-u_{N})|>\theta$ . It follows from (6.63) for $l:=N$ that $\sigma_{N+1}\notin\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{N})}}\left(\Gamma_{N},2C,\theta,\theta_{0}\right)$ . But it then follows from this and (6.62) that $\sigma_{1},\ldots,\sigma_{N+1}\in\Gamma$ satisfy that, for every $j\in\{1,\ldots,N\}$ , that $\sigma_{j}\notin\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{j-1})}}\left(\Gamma_{j-1},2C,\theta,\theta_{0}\right)$ . In which case (6.61) yields that

N\stackrel{{\scriptstyle{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}({\ref{banach_step_bound_lemma_r_packing_pf}})}}}{{:=}}\max\left\{d\in{\mathbb{Z}}~{}:~{}\begin{array}[]{c}\text{There exists }\sigma_{1},\ldots,\sigma_{d}\in\Gamma\text{ such that for every }j\in\{1,\ldots,d-1\}\\ \text{we have }\sigma_{j+1}\notin\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{j})}}\left(\{\sigma_{1},\ldots,\sigma_{j}\},2C,\theta,\theta_{0}\right)\end{array}\right\}\geq N+1

which is evidently a contradiction. Thus we must have that $K\leq\theta$ , and hence that the algorithm must terminate before carrying out step $N+1$ .

Having established the claimed upper bound on the number of steps before the Banach GRIM algorithm terminates, we turn our attention to the properties claimed for approximation returned after the algorithm terminates. Let $n\in\{1,\ldots,N\}$ be the integer for which the Banach GRIM algorithm terminates after step $n$ . Recalling Banach GRIM (C) and (D), let $\Gamma_{n}=\{\sigma_{1},\ldots,\sigma_{n}\}\subset\Gamma$ be the $n$ linear functionals selected by the end of the $n^{\text{th}}$ step, and let $u\in\operatorname{Span}({\cal F})$ denote the approximation found via recombination (cf. the recombination thinning Lemma 3.1) satisfying, for every $s\in\{1,\ldots,n\}$ , that $|\sigma_{s}(\varphi-u)|\leq\theta_{0}$ .

A consequence of Lemma 6.5 is that for every $\sigma\in\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{n})}}\left(\Gamma_{n},2C,\theta,\theta_{0}\right)$ we have (cf. (6.34))

|\sigma(\varphi-u)|\leq\theta

(6.64)

as claimed in (6.56). Moreover, since we have established that the algorithm terminates by triggering the stopping criterion after completing step $n$ , we have, for every $\sigma\in\Gamma$ , that $|\sigma(\varphi-u)|\leq\theta$ as claimed in the second part of (6.55).

To establish (6.54) and the first part of (6.55), first note that $\min\left\{{\cal N},n\right\}=n$ . Thus Lemma 3.1 additionally tells us that recombination returns non-negative coefficients $b_{1},\ldots,b_{n+1}\geq 0$ , with

\sum_{s=1}^{n+1}b_{s}=\sum_{i=1}^{{\cal N}}\alpha_{i}\stackrel{{\scriptstyle{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}({\ref{banach_step_num_varphi_const}})}}}{{=}}C,

(6.65)

and indices $e(1),\ldots,e(n+1)\in\{1,\ldots,{\cal N}\}$ for which

u=\sum_{s=1}^{n+1}b_{s}h_{e(s)}=\sum_{s=1}^{n+1}\frac{b_{s}}{\left|\left|f_{e(s)}\right|\right|_{X}}\tilde{f}_{e(s)}.

(6.66)

For each $s\in\{1,\ldots,n+1\}$ , we define $c_{s}:=\frac{b_{s}}{\left|\left|f_{e(s)}\right|\right|_{X}}$ if $\tilde{f}_{e(s)}=f_{e(s)}$ (which we recall is the case if $a_{e(s)}>0$ ) and $c_{s}:=-\frac{b_{s}}{\left|\left|f_{e(s)}\right|\right|_{X}}$ if $\tilde{f}_{e(s)}=-f_{e(s)}$ (which we recall is the case if $a_{e(s)}<0$ ). Then (6.66) gives the expansion for $u\in\operatorname{Span}({\cal F})\subset X$ in terms of the elements $f_{1},\ldots,f_{{\cal N}}$ claimed in the first part of (6.55). Moreover, from (6.65) we have that

\sum_{s=1}^{n+1}\left|c_{s}\right|\left|\left|f_{e(s)}\right|\right|_{X}=\sum_{s=1}^{n+1}b_{s}\stackrel{{\scriptstyle{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}({\ref{banach_step_bound_lemma_um_coeff_sum_pf}})}}}{{=}}C

(6.67)

as claimed in (6.54).

It remains only to prove that if the coefficients $a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}\setminus\{0\}$ are all positive (i.e. $a_{1},\ldots,a_{{\cal N}}>0$ ), then the resulting coefficients $c_{1},\ldots,c_{n+1}\in{\mathbb{R}}$ are all non-negative (i.e. $c_{1},\ldots,c_{n+1}\geq 0$ ). To see this, observe that if $a_{1},\ldots,a_{{\cal N}}>0$ then, for every $i\in\{1,\ldots,{\cal N}\}$ , we have that $\tilde{f}_{i}=f_{i}$ . Consequently, for every $s\in\{1,\ldots,n+1\}$ we have that $\tilde{f}_{e(s)}=f_{e(s)}$ , and so by definition we have $c_{s}=\frac{b_{s}}{\left|\left|f_{e(s)}\right|\right|_{X}}$ . Since $b_{s}\geq 0$ , it follows that $c_{s}\geq 0$ . This completes the proof of Lemma 6.7. ∎

6.4 Proof of Main Theoretical Result

In this subsection we prove the Banach GRIM Convergence Theorem 6.2 by combining Lemmas 6.4, 6.5, and 6.7.

Proof of Theorem 6.2.

({\bf I})\quad\varphi:=\sum_{i=1}^{{\cal N}}a_{i}f_{i}\qquad\text{and}\qquad({\bf II})\quad C:=\sum_{i=1}^{{\cal N}}|a_{i}|||f_{i}||_{X}>0.

(6.68)

Define a non-negative integer $N=N(\Sigma,C,\varepsilon,\varepsilon_{0})\in{\mathbb{Z}}_{\geq 0}$ , given by

N:=\max\left\{d\in{\mathbb{Z}}~{}:~{}\begin{array}[]{c}\text{There exists }\sigma_{1},\ldots,\sigma_{d}\in\Sigma\text{ such that for every }j\in\{1,\ldots,d-1\}\\ \text{we have }\sigma_{j+1}\notin\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{j})}}\left(\{\sigma_{1},\ldots,\sigma_{j}\},2C,\varepsilon,\varepsilon_{0}\right)\end{array}\right\}.

(6.69)

Suppose $N\leq M:=\min\left\{{\cal N}-1,\Lambda\right\}$ and consider applying the Banach GRIM algorithm to approximate $\varphi$ on $\Sigma$ with $\varepsilon$ as the accuracy threshold, $\varepsilon_{0}$ as the acceptable recombination error bound, $M$ as the maximum number of steps, $s_{1}=\ldots=s_{M}=1$ as the shuffle numbers, and with the integers $k_{1},\ldots,k_{M}$ in Step (A) of the Banach GRIM algorithm all being chosen equal to $1$ (cf. Banach GRIM (A)). By appealing to Lemma 6.7, with $\varepsilon$ , $\varepsilon_{0}$ and $\Sigma$ here as the $\theta$ , $\theta_{0}$ and $\Gamma$ there respectively, to conclude that there is some integer $n\in\{1,\ldots,N\}$ for which the algorithm terminates after step $n$ . Thus the Banach GRIM algorithm terminates after completing, at most, $N$ steps as claimed.

Lemma 6.7 additionally tells us that there are coefficients $c_{1},\ldots,c_{n+1}\in{\mathbb{R}}$ and indices $e(1),\ldots,e(n+1)\in\{1,\ldots,{\cal N}\}$ with (cf. (6.54))

\sum_{s=1}^{n+1}|c_{s}|||f_{e(s)}||_{X}=C

(6.70)

and such that the element $u\in\operatorname{Span}({\cal F})$ defined by (cf. (6.55))

u:=\sum_{s=1}^{n+1}c_{s}f_{e(s)}\quad\text{satisfies, for every }\sigma\in\Sigma\text{ that}\quad|\sigma(\varphi-u)|\leq\varepsilon.

(6.71)

Observe that (6.70) is precisely the claim (6.10), whilst (6.71) is precisely the claim (6.11).

The final consequence of Lemma 6.7 that we note is that if the coefficients $a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}\setminus\{0\}$ associated to $\varphi$ (cf. (I) of (6.68)) are all positive (i.e. $a_{1},\ldots,a_{{\cal N}}>0$ ) then the coefficients $c_{1},\ldots,c_{n+1}\in{\mathbb{R}}$ associated with $u$ (cf. (6.71)) are all non-negative (i.e. $c_{1},\ldots,c_{n+1}\geq 0$ . This is precisely the preservation of non-negative coefficients claimed in Theorem 6.2.

It only remains to verify the claim made in (6.12). For this purpose let $A>1$ and define

\Omega:=\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{\Lambda})}}\left(\Sigma,2C,A\varepsilon,\varepsilon\right)\stackrel{{\scriptstyle\text{def}}}{{=}}\bigcup_{0\leq t\leq A}\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{\Lambda})}}\left(\Sigma,t\right)_{\frac{(A-1)\varepsilon}{2C}}.

(6.72)

Then observe that $\Sigma\subset X^{\ast}$ is a finite subset of cardinality $\Lambda$ and that (6.70) and (6.71) mean that $u\in\operatorname{Span}({\cal F})$ satisfies both that $||u||_{X}\leq C$ and that, for every $\sigma\in\Sigma$ , we have $|\sigma(\varphi-u)|\leq\varepsilon$ . Therefore we may apply Lemma 6.4, with the integer $d$ , the finite subset $L\subset X^{\ast}$ , and the real numbers $\theta>\theta_{0}\geq 0$ of that result as the integer $\Lambda$ , the finite subset $\Sigma$ , and the real numbers $A\varepsilon>\varepsilon>0$ here, to deduce that (cf. (6.28)) for every $\sigma\in\Omega$ we have $|\sigma(\varphi-u)|\leq A\varepsilon$ . Recalling the definition of the set $\Omega$ in (6.72), this is precisely the estimate claimed in (6.12). This completes the proof of Theorem 6.2. ∎

7 Numerical Examples

In this section we compare GRIM with several existing techniques on two different reduction tasks. The first task we consider is motivated by an example appearing in Section 4 of the work [MMPY15] by Maday et al. concerning GEIM.

We consider the Hilbert space $X:=L^{2}(0,1)$ , and given $(a,b)\in[0.01,24.9]\times[0,15]$ we define $f_{a,b}\in X$ by

f_{a,b}(x):=\frac{1}{\sqrt{1+(25+a\cos(bx))x^{2}}}.

(7.1)

For a chosen integer ${\cal N}\in{\mathbb{Z}}_{\geq 1}$ we consider the collection ${\cal F}\subset X$ defined by

{\cal F}:=\left\{f_{a,b}:(a,b)\in[0.01,24.9]_{{\cal N}}\times[0,15]_{{\cal N}}\right\}.

(7.2)

Here, for real numbers $c,d\in{\mathbb{R}}$ , we use the notation $[c,d]_{{\cal N}}$ to denote a partition of the interval $[c,d]$ by ${\cal N}$ equally spaced points.

For a chosen $M\in{\mathbb{Z}}_{\geq 1}$ and $s>0$ , we fix an equally spaced partition $y_{1},\ldots,y_{M}\in[0,1]$ and consider the collection $\Sigma=\left\{\sigma_{k}:k\in\{1,\ldots,M\}\right\}\subset X^{\ast}$ where, for $k\in\{1,\ldots,M\}$ and $\psi\in X$ ,

\sigma_{k}(\psi):=\fint_{0}^{1}\psi(x)d\rho_{k}(x)\qquad\text{with}\qquad d\rho_{k}(x)=e^{-\frac{(x-y_{k})^{2}}{2s^{2}}}dx.

(7.3)

Finally, we fix the choice $a_{1}=\ldots=a_{{\cal N}}=1$ for the coefficients. The task is to approximate the function $\varphi:=\sum_{f\in{\cal F}}f$ over the collection $\Sigma$ of linear functionals.

For the tests we fix the values $M:=1000$ , $s:=5\times 10^{-4}$ , and consider each of the values $N=20,25,30$ individually. We compare our implementation of GRIM against our own coded implementation of GEIM [MM13, MMT14, MMPY15] (which, in particular, makes use of the recursive relation established in [MMPY15]) and the Scikit-learn implementation of LASSO [BBCDDGGMPPPPTVVW11]. The results are summarised in Table 1 below. For each approximation $u$ we record the $L^{2}$ -norm of the difference $\varphi-u$ and the $\sup$ -norm of the difference $\varphi-u$ over $\Sigma$ , i.e. the values $\left(\int_{0}^{1}(\varphi(x)-u(x))^{2}dx\right)^{1/2}$ and $\max\left\{|\sigma(\varphi-u)|:\sigma\in\Sigma\right\}$ ).

	GRIM	GEIM	LASSO
${\cal N}=20$	$\begin{aligned} &16~{}\text{non-zero weights}\\ &L^{2}-\text{norm}=0.15\\ &\sup-\text{norm}=0.49\end{aligned}$	$\begin{aligned} &20~{}\text{non-zero weights}\\ &L^{2}-\text{norm}=0.15\\ &\sup-\text{norm}=0.64\end{aligned}$	$\begin{aligned} &90~{}\text{non-zero weights}\\ &L^{2}-\text{norm}=0.19\\ &\sup-\text{norm}=0.66\end{aligned}$
${\cal N}=25$	$\begin{aligned} &20~{}\text{non-zero weights}\\ &L^{2}-\text{norm}=0.04\\ &\sup-\text{norm}=0.16\end{aligned}$	$\begin{aligned} &27~{}\text{non-zero weights}\\ &L^{2}-\text{norm}=0.04\\ &\sup-\text{norm}=0.26\end{aligned}$	$\begin{aligned} &135~{}\text{non-zero weights}\\ &L^{2}-\text{norm}=0.30\\ &\sup-\text{norm}=1.04\end{aligned}$
${\cal N}=30$	$\begin{aligned} &19~{}\text{non-zero weights}\\ &L^{2}-\text{norm}=0.07\\ &\sup-\text{norm}=0.23\end{aligned}$	$\begin{aligned} &24~{}\text{non-zero weights}\\ &L^{2}-\text{norm}=0.15\\ &\sup-\text{norm}=0.72\end{aligned}$	$\begin{aligned} &176~{}\text{non-zero weights}\\ &L^{2}-\text{norm}=0.43\\ &\sup-\text{norm}=1.51\end{aligned}$

Table 1: The number of non-zero weights and the

L^{2}

and

\sup

norms of the difference between

\varphi

and the approximation are recorded for each technique. The

L^{2}

and

\sup

norm values are recorded to 2 decimal places. For each value of

{\cal N}

we first find the LASSO approximation. Then we record the values for the first GEIM approximation that at least matches the LASSO approximation on both

L^{2}

and

\sup

norm values. Finally we record the values for the first GRIM approximation that at least matches the GEIM approximation on both

L^{2}

and

\sup

norm values.

In each case the GRIM approximation is a linear combination of fewer functions in ${\cal F}$ than both the GEIM approximation and the LASSO approximation. Moreover, the GRIM approximation of $\varphi$ is at least as good as both the GEIM approximation and the LASSO approximation in both the $L^{2}$ -norm sense and the $\sup$ -norm sense.

The second task we consider is a kernel quadrature problem appearing in Section 3.2 of [HLO21]. In particular, we consider the 3D Road Network data set [JKY13] of 434874 elements in ${\mathbb{R}}^{3}$ and the Combined Cycle Power Plant data set [GKT12] of 9568 elements in ${\mathbb{R}}^{5}$ . For the 3D Road Network data set we take a random subset $\Omega$ of size $43487\approx 434874/10$ , whilst for the Combined Cycle Power Plant data set we take $\Omega$ to be the full 9568 elements.

In both cases we consider $X:={\cal M}[\Omega]$ to be the Banach space of signed measures on $\Omega$ , and take ${\cal F}$ to be the collection of point masses supported at points in $\Omega$ , i.e. ${\cal F}:=\left\{\delta_{p}:p\in\Omega\right\}\subset{\cal M}[\Omega]$ . For the collection of linear functionals $\Sigma\subset{\cal M}[\Omega]^{\ast}$ we take $\Sigma$ to be the closed unit ball of the Reproducing Kernel Hilbert Space (RKHS) ${\cal H}_{k}$ associated to the kernel $k:\Omega\times\Omega\to{\mathbb{R}}$ defined by

k(x,y):=e^{-\frac{||x-y||^{2}}{2\lambda^{2}}}.

(7.4)

Here $||\cdot||$ denotes the Euclidean norm on the appropriate Euclidean space, and $\lambda$ is a hyperparameter determined by median heuristics [HLO21]. We let $\varphi$ denote the equally weighted probability measure over $\Omega$ and consider the kernel quadrature problem for $\varphi$ with respect to the RHKS ${\cal H}_{k}$ . By defining ${\cal N}:=\#(\Omega)$ and $a_{1}=\ldots=a_{{\cal N}}=1/{\cal N}$ , we observe that this problem is within our framework.

In addition to implementing GRIM, we additionally implement a modified version of GRIM, which we denote GRIM $+$ opt. The ’ $+$ opt’ refers to applying the convex optimisation detailed in [HLO21] to the weights returned by GRIM at each step. The performance of GRIM and GRIM $+$ opt is compared with the performance of the methods N. $+$ emp, N. $+$ emp $+$ opt, Monte Carlo, iid Bayes, Thinning, Thin $+$ opt, Herding, and Herd $+$ opt considered in [HLO21]. Details of these methods may be found in [HLO21] and the references there in. We make extensive use of the python code associated with [HLO21] available via GitHub (Convex Kernel Quadrature GitHub).

We implement GRIM under the condition that, at each step, 4 new functions from $\Sigma$ are added to the collection of functions over which we require the approximation to coincide with the target $\varphi$ . The performance of each approximation is measured by its Worst Case Error (WCE) with respect to $\varphi$ over the RKHS ${\cal H}_{k}$ . This is defined as

\text{WCE}(u,\varphi,{\cal H}_{k}):=\sup_{f\in{\overline{\mathbb{B}}}_{{\cal H}_{k}}(0,1)}\left|u(f)-\varphi(f)\right|=\sup_{f\in{\overline{\mathbb{B}}}_{{\cal H}_{k}}(0,1)}\left|\int_{\Omega}f(x)du(x)-\int_{\Omega}f(x)d\varphi(x)\right|,

(7.5)

and may be explicitly computed in this setting by the formula provided in [HLO21]. For each method we record the average $\log(\text{WCE}(u,\varphi,{\cal H}_{k})^{2})$ over 20 trials. The results are illustrated in Figures 1 and 2.

For both the 3D Road Network and the Combined Cycle Power Plant data sets the novel recombination-based convex kernel quadrature method developed by Satoshi Hayakawa, the first author, and Harald Oberhauser in [HLO21] and our GRIM approach comfortably out perform the other methods, each of which is either purely growth-based or purely thinning-based. Moreover, the convex kernel quadrature method of [HLO21] is specifically tailored to the task of kernel quadrature. Whilst being significantly slower, GRIM $+$ opt nevertheless matches the performance of N. $+$ emp $+$ opt despite not being specially designed for the task of kernel quadrature. Moreover, even without the additional ’ $+$ opt’ convex optimisation step, GRIM remains within the same class of performance as the N. $+$ emp $+$ opt method.

Refer to caption — Figure 1: 3D Road Network Results - The average $\log(\text{WCE}(u,\varphi,{\cal H}_{k})^{2})$ over 20 trials is plotted for each method. The shaded regions show their standard deviations.

The third and final task we consider is a machine learning inference acceleration task motivated by [JLNSY17]. The problem is outside the Hilbert space framework of the previous examples. We consider the path signature based Landmark Human Action Recognition (LHAR) model developed in [JLNSY17].

This model is trained to determine an action from video clips of the evolution of 15-40 markers placed on a persons body. It utilises signatures of streams generated by the locations of these markers. An introduction to the use of path signatures in machine learning contexts can be found in the survey papers [CK16] and [LM22]. We do not assume any familiarity with signatures and treat them as a ’black box’ tool used by the LHAR model of [JLNSY17].

We restrict consider the LHAR model from [JLNSY17] on the JHMDB data set [BGJSZ13] consisting of 928 clips of 21 distinct actions, with each clip containing 15-40 frames. Following [JLNSY17], the dataset is split into 660 clips to be used for training and 268 clips to be used for testing.

The pipeline for the LHAR model from [JLNSY17] can be summarised as

\text{Clip}\quad\stackrel{{\scriptstyle\text{Signatures}}}{{\rightarrow}}\quad{\mathbb{R}}^{430920}\quad\stackrel{{\scriptstyle\text{Linear Map}}}{{\rightarrow}}\quad{\mathbb{R}}^{21}\quad\stackrel{{\scriptstyle\text{Bias}}}{{\rightarrow}}\quad{\mathbb{R}}^{21}\quad\stackrel{{\scriptstyle\text{Softmax}}}{{\rightarrow}}\quad\{0,\ldots,20\}.

(7.6)

A variety of truncated path signatures are computed of augmentations of the stream of marker locations provided by the clip; see [JLNSY17] for full details. The result is that each clip is transformed into a vector of $N:=430920$ features, i.e. an element in ${\mathbb{R}}^{430920}={\mathbb{R}}^{N}$ . Let $\Omega_{{\mathrm{train}}}\subset{\mathbb{R}}^{430920}$ denote the collection of these vectors for the 660 training clips, and $\Omega_{{\mathrm{test}}}\subset{\mathbb{R}}^{430920}$ denote the collection of these vectors for the 268 testing clips. A single-hidden-layer neural network is trained as the classifier outputting a probability distribution given by a Softmax over the 21 class labels $\{0,\ldots,20\}$ .

Let $A:{\mathbb{R}}^{N}\to{\mathbb{R}}^{21}$ denote the Linear Map part of the model in (7.6). For later convenience, given any map $v:{\mathbb{R}}^{N}\to{\mathbb{R}}^{21}$ we let ${\cal L}[v]$ denote the model from [JLNSY17] with the linear map $A$ replaced by $v$ (i.e. the pipeline described in (7.6) but with the Linear Map $A$ replaced by the function $v$ ). We consider approximating the model ${\cal L}[A]$ . Our approach is to find an approximation $v$ of $A$ , and then use the model ${\cal L}[v]$ as the approximation of ${\cal L}[A]$ . For this purpose, we consider both ${\mathbb{R}}^{N}$ and ${\mathbb{R}}^{21}$ to be equipped with their respective $l^{\infty}$ norms. We first observe that the task of approximating $A$ is within the mathematical framework of this paper.

Let $(a_{i,j})_{i=1,j=1}^{21,N}$ denote the coefficients of the matrix associated to $A$ , and for each $i\in\{1,\ldots,N\}$ define $f_{i}\in C^{0}\left({\mathbb{R}}^{N};{\mathbb{R}}^{21}\right)$ by, for $x=(x_{1},\ldots,x_{N})\in{\mathbb{R}}^{N}$ , setting $f_{i}(x):=x_{i}\left(a_{1,i},\ldots,a_{21,i}\right)$ . Let $e_{1},\ldots,e_{21}\in{\mathbb{R}}^{21}$ be the standard basis of ${\mathbb{R}}^{21}$ that is orthonormal with respect to the standard Euclidean dot product $\left<\cdot,\cdot\right>_{{\mathbb{R}}^{21}}$ on ${\mathbb{R}}^{21}$ . For each $p\in\Omega_{{\mathrm{train}}}$ and each $j\in\{1,\ldots,21\}$ let $\delta_{p,j}:C^{0}({\mathbb{R}}^{N};{\mathbb{R}}^{21})\to{\mathbb{R}}$ be defined by $\delta_{p,j}[f]:=\left<f(p),e_{j}\right>_{{\mathbb{R}}^{21}}$ . Then $A=\sum_{i=1}^{N}f_{i}$ and hence, by choosing $X:=C^{0}\left(\Omega;{\mathbb{R}}^{21}\right)$ , ${\cal N}:=N(=430920)$ , $\varphi:=A$ , and $\Sigma:=\left\{\delta_{p,j}:p\in\Omega_{{\mathrm{train}}}\text{ and }j\in\{1,\ldots,21\}\right\}$ , we observe that we are within the framework for which GRIM is designed (cf. Section 2).

To provide a benchmark we use the Scikit-Learn MultiTaskLasso [BBCDDGGMPPPPTVVW11] implementation to approximate $\varphi$ . Several different choices for the regularisation parameter $\alpha$ are considered. For each $\alpha$ we consider the Scikit-Learn MultiTaskLasso [BBCDDGGMPPPPTVVW11] with the maximum number of iterations set to 10000. We record the number of non-zero weights in the resulting approximation $u$ , the training set accuracy achieved by ${\cal L}[u]$ , and the testing set accuracy achieved by ${\cal L}[u]$ . These results are summarised in Table 2.

Alpha	Number of Non-zero Weights	Train Accuracy (%) (2.d.p)	Test Accuracy (%) (2.d.p)
0.0001	5464	100.00	81.34
0.0002	4611	100.00	82.09
0.0003	4217	100.00	83.21
0.0004	4021	100.00	82.84
0.0005	4009	100.00	82.46
0.001	3913	100.00	82.46
0.005	3157	99.70	81.72
0.01	2455	99.24	81.72
0.05	1012	95.30	79.85

Table 2: Each row contains information corresponding to one application of MultiTaskLasso [BBCDDGGMPPPPTVVW11] to find an approximation

u

\varphi

. The model

{\cal L}[u]

then gives an approximation to the model

{\cal L}[\varphi]

from [JLNSY17] on the JHMDB dataset [BGJSZ13]. The first column records the value of the regularisation parameter

\alpha

for this application of MultiTaskLasso. The second column records the number of non-zero weights appearing in the returned approximation

u

. The third column records the accuracy achieved on the training set (to 2 decimal places) by the model

{\cal L}[u]

. The fourth column records the accuracy achieved on the test set (to 2 decimal places) by the model

{\cal L}[u]

In terms of the test accuracy score achieved by the model ${\cal L}[u]$ , the choice of $\alpha:=3\times 10^{-4}$ performed best. For this choice of $\alpha$ the corresponding model ${\cal L}[u]$ achieved an accuracy score of $83.21\%$ (2.d.p) on the test set whilst only using 4217 non-zero weights. In comparison, the original model ${\cal L}[\varphi]$ developed in [JLNSY17] uses 430920 non-zero weights, and achieves an accuracy score of $83.58\%$ (2.d.p) on the test set.

With the Lasso performance acting as a benchmark, we consider applying GRIM to approximate $\varphi$ . We consider several choices for the number of new points from $\Sigma$ to be added to the collection of points over which we require the approximation to coincide with the target $\varphi$ . We fix the shuffle number for each step as 16. We apply GRIM to approximate $\varphi$ over $\Omega_{{\mathrm{train}}}$ , i.e. we use only the training set. The test set is not used for constructing the approximation $u$ , and is only used to record the accuracy achieved by model generated by the approximation ${\cal L}[u]$ .

We make the following adaptation to the GRIM algorithm for this task. Our aim is to find an approximation $u$ of $\varphi$ that is close in a pointwise sense to $\varphi$ throughout $\Omega_{{\mathrm{train}}}$ . Instead of considering each linear functional in $\Sigma$ individually, we consider each collection $\Delta_{p}:=\left\{\delta_{p,j}:j\in\{1,\ldots,21\}\right\}$ for a point $p\in\Omega_{{\mathrm{train}}}$ . We modify the Banach Extension Step (cf. Section 4) to the Modified Extension Step below. Modified Extension Step

Assume that $L^{\prime}\subset\Sigma$ . Let $u\in\operatorname{Span}({\cal F})$ . Let $m\in{\mathbb{Z}}_{\geq 1}$ such that $\#(L^{\prime})+21m\leq\#(\Sigma)$ . Take

\sigma_{1}:=\operatorname{argmax}\left\{|\sigma(\varphi-u)|:\sigma\in\Sigma\right\}.

(7.7)

Let $p_{1}\in\Omega_{{\mathrm{train}}}$ denote the point for which $\sigma_{1}\in\Delta_{p_{1}}$ .

Inductively for $j=2,\ldots,m$ take

\sigma_{j}:=\operatorname{argmax}\left\{|\sigma(\varphi-u)|:\sigma\in\Sigma\setminus(\Delta_{p_{1}}\cup\dots\cup\Delta_{p_{j-1}})\right\}

(7.8)

and let $p_{j}\in\Omega_{{\mathrm{train}}}$ denote the point for which $\sigma_{j}\in\Delta_{p_{j}}$ .

Once the points $p_{1},\ldots,p_{m}\in\Omega_{{\mathrm{train}}}$ have been selected, extend $L^{\prime}$ to $L:=L^{\prime}\cup\Delta_{p_{1}}\cup\dots\cup\Delta_{p_{m}}$ . Since $||u(p)-\varphi(p)||_{l^{\infty}({\mathbb{R}}^{21})}=\max\left\{|\sigma(u-\varphi)|:\sigma\in\Delta_{p}\right\}$ , we observe that the Modified Extension Step is equivalent to taking

p_{1}:=\operatorname{argmax}\left\{||u(p)-\varphi(p)||_{l^{\infty}({\mathbb{R}}^{21})}:p\in\Omega_{{\mathrm{train}}}\right\},

(7.9)

then inductively taking for $j=2,\ldots,m$ the point

p_{j}:=\operatorname{argmax}\left\{||u(p)-\varphi(p)||_{l^{\infty}({\mathbb{R}}^{21})}:p\in\Omega_{{\mathrm{train}}}\setminus\{p_{1},\ldots,p_{j-1}\}\right\},

(7.10)

and then extending $L^{\prime}$ to $L:=L^{\prime}\cup\Delta_{p_{1}}\cup\dots\cup\Delta_{p_{m}}$ . Consequently, replacing the Banach Extension Step with the Modified Extension Step in the Banach GRIM algorithm in Section 4 results in an algorithm in which a collection of points, at which we require an approximation $u$ of $\varphi$ to coincide with $\varphi$ , is inductively greedily grown. It is important to keep in mind that each newly selected point $p\in\Omega_{{\mathrm{train}}}$ corresponds to choosing 21 new linear functionals in $\Sigma$ . Thus the restriction on the total number of points $K$ to possibly be selected is that $21K\leq\#(\Sigma)$ , which is equivalent to requiring $K\leq\#(\Omega_{{\mathrm{train}}})=660$ . It is this modification of the Banach GRIM algorithm (cf. Section 4) that we consider for the task of approximating the linear map $\varphi:{\mathbb{R}}^{N}\to{\mathbb{R}}^{21}$ .

Our aim here is that the model ${\cal L}[u]$ achieves a similar accuracy score as the original model ${\cal L}[\varphi]$ from [JLNSY17]. Consequently, we alter the termination criteria to better reflect this goal. The model from [JLNSY17] achieves an accuracy score of $83.58\%$ on the test set. We choose $\delta:=10^{-3}$ and terminate if the approximation generates a model achieving an accuracy score of at least $(1-\delta)$ times the accuracy score of the original model on the testing set. Hence we require the approximation to generate a model achieving a testing set accuracy score of at least $83.50\%$ to 2 decimal places. We further record the accuracy achieved on the training set for comparison with the $100\%$ accuracy achieved by the original model here.

For each choice of number of points to add at each step, we run the GRIM algorithm (modified as outlined above) three times. We record the minimum number of non-zero weights returned after an approximation generating a model with the required test set accuracy score. Subsequently, we record the training accuracy score and testing accuracy score of the models found by each run using this number of non-zero weights. In addition to recording these values for the best performing model, we additionally record the variance of the training accuracy score and testing accuracy score over the three runs. The results are summarised in Table 3 below.

Points Added	Number of	Train Accuracy (%)		Test Accuracy (%)
Per Step	Non-zero Weights	Best	Variance	Best	Variance
10	2101	100.00	0.04	83.58	1.12
20	2941	100.00	0.00	83.58	0.09
30	2521	100.00	0.02	83.58	0.49
40	2521	100.00	0.01	84.33	1.33
50	2101	99.55	0.04	83.96	0.65

Table 3: Each row contains information corresponding to our application of GRIM to approximate the action recognition model from [JLNSY17] on the JHMDB dataset [BGJSZ13]. The first column records the number of new points added at each step of GRIM. The second column records the minimum number of non-zero weights appearing in a returned approximation

u

generating a model

{\cal L}[u]

achieving a testing set accuracy score of at least

83.50\%

(2.d.p). The third column summarises the training accuracy scores achieved during the three runs for each choice of number of points to add per step. The first half of the third column records the best training accuracy score (to 2 decimal places) achieved by the model

{\cal L}[u]

for an approximation

u

\varphi

returned by one of the runs of GRIM. The second half of the third column records the variance (to 2 decimal places) of the training accuracy scores achieved by the models

{\cal L}[u]

for the approximations

u

\varphi

returned by each of the three runs of GRIM. The fourth column summarises the testing accuracy scores achieved during the three runs for each choice of number of points to add per step. The first half of the fourth column records the best test accuracy score (to 2 decimal places) achieved by the model

{\cal L}[u]

for an approximation

u

\varphi

returned by one of the runs of GRIM. The second half of the third column records the variance (to 2 decimal places) of the test accuracy scores achieved by the models

{\cal L}[u]

for the approximations

u

\varphi

returned by each of the three runs of GRIM.

For each choice of the number of new points to be added at each step, GRIM successfully finds an approximation $u$ of $\varphi$ which is both uses fewer non-zero weights than the best performing model found via Lasso, and achieves a higher test accuracy score than the best performing model found via Lasso. The best performance, in terms of accuracy score on the test set, was achieved by GRIM with the choice of adding 40 new points at each step. The best performing model uses 2521 non-zero weights and achieves an accuracy score of $84.33\%$ (2.d.p) on the test set. This accuracy score is actually higher than the accuracy score of $83.58\%$ (2.d.p) achieved by the original model ${\cal L}[\varphi]$ from [JLNSY17] on the test set.

References

[ACO20] A. Abate, F. Cosentino, and H. Oberhauser, A randomized algorithm to reduce the support of discrete measures, In Adavances in Neural Information Processing Systems, Volume 33, pages 15100-15110, 2020.
[ABDHP21] D. Alistarh, T. Ben-Nun, N. Dryden, T. Hoefler and A. Peste, Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks, J. Mach. Learn. Res., 22(241), 1-124, 2021.
[ABCGMM18] J.-P. Argaud, B. Bouriquet, F. de Caso, H. Gong, Y. Maday and O. Mula, Sensor Placement in Nuclear Reactors Based on the Generalized Empirical Interpolation Method, J. Comput. Phys., vol. 363, pp. 354-370, 2018.
[ABGMM16] J.-P. Argaud, B. Bouriquet, H. Gong, Y. Maday and O. Mula, Stabilization of (G)EIM in Presence of Measurement Noise: Application to Nuclear Reactor Physics, In Spectral and High Order Methods for Partial Differential Equations-ICOSAHOM 2016, vol. 119 of Lect. Notes Comput. Sci. Eng., pp.133-145, Springer, Cham, 2017.
[AK13] M. G. Augasts and T. Kathirvalavakumar, Pruning algorithms of neural networks - A comparative study, Open Computer Science 3, p.105-115, 2013.
[BLL15] Francis Bach, Simon Lacoste-Julien and Fredrik Lindsten, Sequential Kernel Herding: Frank-Wolfe Optimization for Particle Filtering, In Artificial Intelligence and Statistics, pages 544-552, PMLR, 2015.
[BMPSZ08] F. R. Bach, J. Mairal, J. Ponce, G. Sapiro and A. Zisserman, Discriminative learned dictionaries for local image analysis, Compute Vision and Pattern Recognition (CVPR), pp.1-8, IEEE 2008.
[BMPSZ09] F. R. Bach, J. Mairal, J. Ponce, G. Sapiro and A. Zisserman, Supervised dictionary learning, Advances in Neural Information Processing Systems, pp.1033-1040, 2009.
[BMNP04] M. Barrault, Y. Maday, N. C. Nguyen and A. T. Patera, An Empirical Interpolation Method: Application to Efficient Reduced-Basis Discretization of Partial Differential Equations, C. R. Acad. Sci. Paris, Série I., 339, pp.667-672, 2004
[BS18] A. G. Barto and R. S. Sutton, Reinforcement learning: an introduction, (2nd). Cambridge, MA, USA: MIT Press.
[BT11] Alain Belinet and Christine Thomas-Agnan, Reproducing kernel Hilbert spaces in probabiility and statistics, Springer Science & Business Media, 2011.
[BBCDDGGMPPPPTVVW11] M. Blondel, M. Brucher, D. Cournpeau, V. Dubourg, E. Duchesnay, A. Gramfort, O. Grisel, V. Michel, A. Passos, F. Pedregosa, M. Perrot, P. Prettenhofer, B. Thirion, J. Vanderplas, G. Varoquax and R. Weiss, Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research, 12, 2825-2830, 2011.
[BCGMO18] F. -X. Briol, W. Y. Chen, J. Gorham, L. Mackey and C. J. Oates, Stein Points, In Proceedings of the 35th International Conference on Machine Learning, Volume 80, pp. 843-852, PMLR, 2018.
[BGJSZ13] J. Black, J. Gall, H. Jhuang, C. Schmid and S. Zuffi, Towards understanding action recognition, In IEEE International Conference on Computer Vision (ICCV), pp. 3192-3199, 2013.
[CT05] E. Candés and T. Tao, Decoding by linear programming, IEEE Trans. Inform. Theory, 51, 2005.
[CRT06] E. Candés, J. Romberg and T. Tao, Robust Uncertainty Principles: Exact Signal Reconstruction from Highly Incomplete Frequency Information, IEEE Trans. Inform. Theory, 52, pp. 489-509, 2006.
[Car81] B. Carl, Entropy numbers of diagonal operators with an application to eigenvalue problems, J. Approx. Theory, 32, pp.135-150, 1981.
[CS90] B. Carl and I. Stephani, Entropy, Compactness, and the Approximation of Operators, Cambridge University Press, Cambridge, UK, 1990.
[CHXZ20] L. Chen, A. Huang, S. Xu and B. Zhang, Convolutional Neural Network Pruning: A Survey, 39th Chinese Control Conference (CCC), IEEE, p.7458-7463, 2020.
[CSW11] Zhixiang Chen, Baohuai Sheng and Jianli Wang, The Covering Number for Some Mercer Kernel Hilbert Spaces on the Unit Sphere, Taiwanese J. Math. 15(3), p.1325-1340, 2011.
[CSW10] Y. Chen, A. Smola and M. Welling, Super-Samples from Kernel Herding, In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2010.
[CS08] A. Christmann and I. Steinwart, Support Vector Machines, Springer Science & Business Media, 2008.
[CM73] Jon F. Claerbout and Francis Muir, Robust Modeling with Erratic Data, Geophysics, Vol. 38, So. 5, pp. 826-844, (1973)
[CK16] Ilya Chevyrev and Andrey Kormilitzin, A Primer on the Signature Method in Machine Learning, arXiv preprint, 2016. https://arxiv.org/abs/1603.03788
[DGOY08] J. Darbon, D. Goldfarb, S. Osher and W. Yin, Bregman Iterative Algorithms for L1 Minimization with Applications to Compressed Sensing, SIAM J. Imaging Sci. 1, pp.143-168, 2008.
[Don06] D. Donoho, Compressed Sensing, IEEE Trans. Inform. Theory, 52, pp. 1289-1306, 2006.
[DE03] D. Donoho and M. Elad, Optimally Sparse Representation in General (Nonorthogonal) Dictionaries via $l^{1}$ Minimization, Proc. Natl. Acad. Sci. USA, 100, pp.2197-2202, 2003.
[DET06] D. L. Donoho, M. Elad and V. Templyakov, Stable recovery of sparse overcomplete representations in the presence of noise, IEEE Transactions on Information Theory, 52(1): 6-18, 2006.
[DM21a] Raaz Dwivedi and Lester Mackey, Kernel Thinning, Proceedings of Machine Learning Research vol. 134:1–1, 2021.
[DM21b] Raaz Dwivedi and Lester Mackey, Generalized Kernel Thinning, Published in ICLR 2022.
[DMS21] Raaz Dwivedi, Lester Mackey and Abhishek Shetty, Distribution Compression in Near-Linear Time, Published in ICLR 2022.
[DF15] Chris Dyer and Manaal Faruqui, Non-distributional Word Vector Representations, Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Association for Computational Linguistics, Beijing, China, Volume 2: Short Papers, pp.464-469, 2015.
[ET96] D. E. Edmunds and H. Triebel, Function Spaces, Entropy Numbers, and Differential Operators, Cambridge University Press, Cambridge, UK, 1996.
[Ela10] M. Elad, Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing, Springer, ISBN 978-1441970107, 2010.
[EMS08] M. Elad, J. Mairal and G. Sapiro, Sparse representation for color image restoration, IEEE Transactions on Image Processing, 17(1), 53-69, 2008.
[EJOPR20] E. Elsen, S. Jayakumar, S. Osindere, R. Pascanu and J. Rae, Top-kast: Top-k always sparse training, Advances in Neural Information Processing Systems, 33:20744-20754, 2020.
[EPP00] T. Evgeniou, M. Pontil and T. Poggio, Regularization networks and support vector machines, Adv. Comput. Math., 13, pp.1-50, 2000.
[FHLLMMPSWY23] Meng Fang, Tianjin Huang, Gen Li, Shiwei Liu, Xiaolong Ma, Vlado Menkovski, Mykola Pechenizkiy, Li Shen, Zhangyang Wang and Lu Yin, Dynamic Sparsity in Channel-Level Sparsity Learner, NeurIPS 2023.
[FJM19] Dan Feldman, Ibrahim Jubran and Alaa Maalouf, Fast and Accuract Least-Mean-Squares Solvers, Advances in Neural Information Processing Systems 32 (NeurIPS2019), 2019.
[FJM22] Dan Feldman, Ibrahim Jubran and Alaa Maalouf, Fast and Accurate Least-Mean-Squares Solvers for High Dimensional Data, in IEEE Transactionf on Pattern Analysis and Machine Intelligence, vol. 44, no. 12, pp. 9977-9994, 2022.
[FS21] S. Fischer and I. Steinwart, A closer look at covering number bounds for Gaussian kernels, Journal of Complexity, Volume 62, 2021.
[FMMT14] Alona Fyshe, Tom M. Mitchell, Brian Murphy and Partha P. Talukdar, Interpretable semantic Vectors from a joint model of brain-and text-based meaning, Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Baltimore, Maryland, Volume 1: Long Papers, pp.489-499, 2014.
[GMSWY09] A. Ganesh, Y. Ma, S. S. Sastry, J. Wright and A. Yang, Robust face recognition via sparse representation, TPAMI 31(2), 210-227, 2009.
[GLMNS18] M. Gibescu, A. Liotta, D. C. Mocanu, E. Mocanu, P. H. Nguyen and P. Stone, Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science, Nature communications, 9(1):2383, 2018.
[GLSWZ21] John Glossner, Tailin Liang, Shaobo Shi, Lei Wang and Xiaotong Zhang, Pruning and quantization for deep neural network acceleration: A survey, Neurocomputing, Volume 461, pages 370-403, 2021.
[GO09] T. Goldstein and S. Osher, The Split Bregman Method for L1-Regularized Problems, SIAM J. Imaging Sci. 2, pp.323-343, 2009.
[GMNP07] M. A. Grepl, Y. Maday, N. C. Nguyen and A. T. Patera, Efficient Reduced-Basis Treatment of Nonaffine and Nonlinear Partial Differential Equations, M2AN (Math. Model. Numer. Anal.), 2007.
[GKT12] F. S. Gürgen, H. Kaya and P. Tüfekci, Local and global learning methods for predicting power of a combined gas & steam turbine, In Proceedings of the Internationl Conference on Emerging Trends in Computer and Electronics Engineering, pages 13-18, 2012.
[HLO21] Satoshi Hayakawa, Terry Lyons and Harald Oberhauser, Positively weighted kernel Quadrature via Subsampling, In Adavnces in Neural Information Processing Systems (NeurIPS 2022).
[HS19] J. Hernandez-Garcia and R. Sutton, Learning Sparse Representations Incrementally in Deep Reinforcement Learning, https://arxiv.org/abs/1912.04002, [cs.LG], 9 Dec 2019.
[HLLL18] J. Huang, H. Lian, H. Lin and S. Lv, Oracle Inequalities for Sparse Additive Quantile Regression in Reproducing Kernel Hilbert Space, Annals of Statistics 46, p.781-813, 2018.
[JLPST21] S. Jayakumar, P. E. Latham, R. Pascanu, J. Schwarz and Y. Teh, Powerpropagation: A sparsity inducing weight reparameterisation, Advances in Neural Information Processing Systems, 34:28889-28903, 2021.
[JKY13] C. S. Jensen, M. Kaul and B. Yang, Building accurate 3d spatial networks to enable next generation intelligent transportation systems, In 2013 IEEE 14th International Conference on Mobile Data Management, Volume 1, pages 137-146, IEEE, 2013.
[JJWWY20] C. Jin, M. I. Jordan, M. Wang, Z. Wang and Z. Yang, On Function Approximation in Reinforcement Learning: Optimism in the Face of Large State Spaces, 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada, 2020
[JLNSY17] Lianwen Jin, Terry Lyons, Hao Ni, Cordelia Schmid and Weixin Yang, Developing the Path Signature Methodology and its Application to Landmark-based Human Action Recognition, In: Yin, G., Zariphopoulou, T. (eds) Stochastic Analysis, Filtering and Stochastic Optimization, Springer, Cham. 2022. https://doi.org/10.1007/978-3-030-98519-6_18
[KK20] M. Kalinić and P. Kómár, Denoising DNA encoded library screens with sparse learning, ACS Combinatorial Science, Vol. 22, no. 8, pp.410-421, 2020.
[Kol56] A. N. Kolmogorov, Asymptotic characteristics of some completely bounded metric spaces, Dokl. Akad. Nauk. SSSR, 108, pp.585-589, 1956.
[Kuh11] T. Kühn, Covering numbers of Gaussian reproducing kernel Hilbert spaces, J. Complexity, 27, pp.489-499, 2011.
[LY07] Y. Lin and M. Yuan, On the Non-Negative Garrote Estimator, J. R. Stat. Soc. Ser. B 69, pp. 143-161, 2007
[LL16] W. Lee and T. Lyons, The Adaptive Patched Cubature Filter and its Implementation, Communications in Mathematical Sciences, 14(3), pp.799-829, 2016.
[LL99] W. V. Li and W. Linde, Approximation, metric entropy and small ball estimates for Gaussian measures, Ann. Probab., 27, pp.1556-1578, 1999.
[LL12] C. Litterer and T. Lyons, High order recombination and an application to cubature on wiener space, The Annals of Applied Probability, Vol. 22, No. 4, 1301-1327, 2012.
[LMPY21] S. Liu, D. C. Mocanu, M. Pechenizkiy and L. Yin, Do we actually need dense over-parameterization? in-time over-parameterization in sparse training, In Proceedings of the $39^{\text{th}}$ International Conference on Machine Learning, pages 6989-7000, PMLR 2021.
[LP04] H. Luschgy and G. Pagés, Sharp asymptotics of the Kolmogorov entropy for Gaussian measures, J. Funct. Anal., 212, pp.89-120, 2004.
[LM22] Terry Lyons and Andrew D. McLeod, Signature Methods in Machine Learning, arXiv preprint, 2022. https://arxiv.org/abs/2206.14674
[MM13] Y. Maday, O. Mula, A generalized empirical interpolation method: Application of reduced basis techniques to data assimilation, F. Brezzi, P. Colli Franzone, U. Gianazza, G. Gilardi (Eds.), Analysis and Numerics of Partial Differential Equations, Vol. 4 of Springer INdAM Series, Springer Milan, 2013, pp. 221–235.
[MMPY15] Y. Maday, O. Mula, A. T. Patera and M. Yano, The Generalized Empirical Interpolation Method: Stability Theory On Hilbert Spaces With An Application To Stokes Equation, Comput. Methods Appl. Mech. Engrg. 287, 310-334, 2015.
[MMT14] Y. Maday, O. Mula and G. Turinici, Convergence analysis of the Generalized Empirical Interpolation Method, SIAM J. Numer. Anal., 54(3) 1713-1731, 2014.
[MNPP09] Y. Maday, N. C. Nguyen, A. T. Patera and G. S. H. Pau, A General Multipurpose Interpolation Procedure: The Magic Points, Commun. Pure Appl. Anal., 81, pp. 383-404, 2009.
[MZ93] S. G. Mallat and Z. Zhang, Matching Pursuits with Time-Frequency Dictionaries, IEEE Transactions on Signal Processing, 12: 3397-3415, 1993.
[MRT12] Mehryar Mohri, Afshin Rostamizadeh and Ameet Talwalkar, Foundations of Machine Learning, Massachusetts: MIT Press, USA, 2012
[NT09] D. Needell and J. A. Tropp, CoSaMP: Iterative signal recovery from incomplete and inaccurate samples, Applied and Computational Harmonic Analysis, 26(3): 301-321, 2009.
[NPS22] M. Nikdast, S. Pasricha and F. Sunny, SONIC: A Sparse Neural Network Inference Accelerator with Silicon Photonics for Energy-Efficient Deep Learning, 27th Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 214-219, 2022.
[NS21] S. Ninomiya and Y. Shinozaki, On implementation of high-order recombination and its application to weak approximations of stochastic differential equations, In Proceedings of the NFA $29^{\text{th}}$ Annual Conference, 2021.
[PTT22] Sebastian Pokutta, Ken’ichiro Tanaka and Kazuma Tsuji, Sparser Kernel Herding with Pairwise Conditional Gradients without Swap Steps, https://arxiv.org/abs/2110.12650, [math.OC], 8 February 2022.
[Ree93] R. Reed, Pruning Algorithms - A Survey, IEEE Transactions on Neural Networks 4, p.74-747, 1993.
[SS86] Fadil Santosaf and William W. Symes, Linear Inversion of Band-Limited Reflection Seismograms, SIAM J. ScI. STAT. COMPUT. Vol. 7, No. 4, 1986.
[SSW01] B. Schölkopf, A. J. Smola and R. C. Williamson, Generalization performance of regularization networks and support vector machines via entrop numbers of compact operators, IEEE Trans. Inform. Theory, 47, pp.2516-2532, 2001.
[Ste03] I. Steinwart, Entropy numbers of convex hulls and an application to learning algorithms, Arch. Math., 80, pp.310-318, 2003.
[SS07] I. Steinwart and C. Scovel, Fast rates for support vector machines using Gaussian kernels, Ann. Statist. 35 (2), 2007.
[SS12] I. Steinwart and C. Scovel, Mercer’s theorem on general domains: On the interaction between measures, kernels, and RKHSs, Constructive Approximation, 35(3):363-417, 2012.
[Str71] A. H. Stroud, Approximate calculation of multiple integrals, Series in Automatic Computation, Englewood Cliffs, NJ: Prentice-Hall, 1971.
[Suz18] T. Suzuki, Fast Learning Rate of Non-Sparse Multiple Kernel Learning and Optimal Regularization Strategies, Electronic Journal of Statistics 12, p.2141-2192, 2018.
[SS13] M. Sugiyama and T. Suzuki, Fast Learning Rate of Multiple Kernel Learning: Tradeoff Between Sparsity and Smoothness, Annals of Statistics 41, p.1381-1405, 2013.
[TT21] Ken’ichiro Tanaka and Kazuma Tsuji, Acceleration of the Kernel Herding Algorithm by Improved Gradient Approximation, https://arxiv.org/abs/2105.07900, [math.NA], 17 May 2021.
[Tch15] M. Tchernychova, Carathéodry cubature measures, PhD thesis, University of Oxford, https://ora.ox.ac.uk/objects/uuid:a3a10980-d35d-467b-b3c0-d10d2e491f2d 2015.
[TW19] GL. Tian and M. Wang, Adaptive Group LASSO for High-Dimensional Generalized Linear Models, Stat Papers 60, pp. 1469-1486, 2019.
[Tib96] Robert Tibshirani, Regression Shrinkage and Selection via the Lasso, Journal of the Royal Statistical Society. Series B (Methodological), Vol. 58, No. 1, pp. 267-288, 1996.
[Tro06] J. A. Tropp, Just relax: Convex programming methods for identifying sparse signals in noise, IEEE Transactions on Information Theory, 52(3): 1030-1051, 2006.
[Wel09a] M. Welling, Herding Dynamical Weights to Learn, In Proceedings of the 21st International Conference on Machine Learning, Montreal, Quebec, CAN, 2009.
[Wel09b] M. Welling, Herding Dynamic Weights for Partially Observed Random Field Models, In Proc. of the Conf. on Uncertainty in Artificial Intelligence, Montreal, Quebec, CAN, 2009.
[XZ16] Y. Xiang and C. Zhang, On the Oracle Property of Adaptive Group LASSO in High-Dimensional Linear Models, Stat. Pap. 57, pp. 249-265, 2016.
[Zho02] D. -X. Zhou, The covering number in learning theory, J. Complexity, 18, pp.739-767, 2002.

University of Oxford, Radcliffe Observatory, Andrew Wiles Building, Woodstock Rd, Oxford, OX2 6GG, UK. TL: [email protected]
https://www.maths.ox.ac.uk/people/terry.lyons AM: [email protected]
https://www.maths.ox.ac.uk/people/andrew.mcleod