This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Greedy Recombination Interpolation Method (GRIM)

Terry Lyons and Andrew D. McLeod
Abstract

In this paper we develop the Greedy Recombination Interpolation Method (GRIM) for finding sparse approximations of functions initially given as linear combinations of some (large) number of simpler functions. In a similar spirit to the CoSaMP algorithm, GRIM combines dynamic growth-based interpolation techniques and thinning-based reduction techniques. The dynamic growth-based aspect is a modification of the greedy growth utilised in the Generalised Empirical Interpolation Method (GEIM). A consequence of the modification is that our growth is not restricted to being one-per-step as it is in GEIM. The thinning-based aspect is carried out by recombination, which is the crucial component of the recent ground-breaking convex kernel quadrature method. GRIM provides the first use of recombination outside the setting of reducing the support of a measure. The sparsity of the approximation found by GRIM is controlled by the geometric concentration of the data in a sense that is related to a particular packing number of the data. We apply GRIM to a kernel quadrature task for the radial basis function kernel, and verify that its performance matches that of other contemporary kernel quadrature techniques.

1 Introduction

This article considers finding sparse approximations of functions with the aim of reducing computational complexity. Applications of sparse representations are wide ranging and include the theory of compressed sensing [DE03, CT05, Don06, CRT06], image processing [EMS08, BMPSZ08, BMPSZ09], facial recognition [GMSWY09], data assimilation [MM13], explainability [FMMT14, DF15], sensor placement in nuclear reactors [ABGMM16, ABCGMM18], reinforcement learning [BS18, HS19], DNA denoising [KK20], and inference acceleration within machine learning [ABDHP21, NPS22].

Loosely speaking, the commonly shared aim of sparse approximation problems is to approximate a complex system using only a few elementary features. In this generality, the problem is commonly tackled via Least Absolute Shrinkage and Selection Operator (LASSO) regression, which involves solving a minimisation problem under l1l^{1}-norm constraints. Constraining the l1l^{1}-norm is a convex relaxation of constraining the l0l^{0}-pseudo-norm, which counts the number of non-zero components of a vector. Since p=1p=1 is the smallest positive real number for which the lpl^{p}-norm is convex, constraining the l1l^{1}-norm can be viewed as the best convex approximation of constraining the l0l^{0}-pseudo-norm (i.e. constraring the number of non-zero components). Implementations of LASSO techniques can lead to near optimally sparse solutions in certain contexts [DET06, Tro06, Ela10]. The terminology ”LASSO” originates in [Tib96], though imposing l1l^{1}-norm constraints is considered in the earlier works [CM73, SS86]. More recent LASSO-type techniques may be found in, for example, [LY07, DGOY08, GO09, XZ16, TW19].

Alternatives to LASSO include the Matching Pursuit (MP) sparse approximation algorithm introduced in [MZ93] and the CoSaMP algorithm introduced in [NT09]. The MP algorithm greedily grows a collection of non-zero weights one-by-one that are used to construct the approximation. The CoSaMP algorithm operates in a similar greedy manner but with the addition of two key properties. The first is that the non-zero weights are no longer found one-by-one; groups of several non-zero weights may be added at each step. Secondly, CoSaMP incorporates a naive prunning procedure at each step. After adding in the new weights at a step, the collection is then pruned down by retaining only the mm-largest weights for a chosen integer m1m\in{\mathbb{Z}}_{\geq 1}. The ethos of CoSaMP is particularly notworthy for our purposes.

In this paper we assume that the system of interest is known to be a linear combination of some (large) number of features. Within this setting the goal is to identify a linear combination of a strict sub-collection of the features (i.e. not all the features) that gives, in some sense, a good approximation of the system (i.e. of the original given linear combination of all the features). Depending on the particular context considered, techniques for finding such sparse approximations include the pioneering Empirical Interpolation Method [BMNP04, GMNP07, MNPP09], its subsequent generalisation the Generalised Empirical Interpolation Method (GEIM) [MM13, MMT14], Pruning [Ree93, AK13, CHXZ20, GLSWZ21], Kernel Herding [Wel09a, Wel09b, CSW10, BLL15, BCGMO18, TT21, PTT22], Convex Kernel Quadrature [HLO21], and Kernel Thinning [DM21a, DM21b, DMS21]. The LASSO approach based on l1l^{1}-regularisation can still be used within this framework.

It is convenient for our purposes to consider the following loose categorisation of techniques for finding sparse approximations in this setting; those that are growth-based, and those that are thinning-based. Growth-based methods seek to inductively increase the size of a sub-collection of features, until the sub-collection is rich enough to well-approximate the entire collection of features. Thinning-based methods seek to inductively identify features that may be discarded without significantly affecting how well the remaining features can approximate the original entire collection. Of the techniques mentioned above, MP, EIM, GEIM and Kernel Herding are growth-based, whilst LASSO, Convex Kernel Quadrature and Kernel Thinning are thinning-based. An important observation is that the CoSaMP algorithm combines both growth-based and thinning-based techniques.

In this paper we work in the setting considered by Maday et al. when introducing GEIM [MM13, MMT14, MMPY15]. Namely, we let XX be a real Banach space and 𝒩1{\cal N}\in{\mathbb{Z}}_{\geq 1} be a (large) positive integer. Assume that ={f1,,f𝒩}X{\cal F}=\{f_{1},\ldots,f_{{\cal N}}\}\subset X is a collection of non-zero elements, and that a1,,a𝒩{0}a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}\setminus\{0\}. Consider the element φX\varphi\in X defined by φ:=i=1𝒩aifi\varphi:=\sum_{i=1}^{{\cal N}}a_{i}f_{i}. Let XX^{\ast} denote the dual of XX and suppose that ΣX\Sigma\subset X^{\ast} is a finite subset with cardinality Λ1\Lambda\in{\mathbb{Z}}_{\geq 1}. We use the terminology that the set {\cal F} consists of the features whilst the set Σ\Sigma consists of data. Then we consider the following sparse approximation problem. Given ε>0\varepsilon>0, find an element u=i=1𝒩bifiSpan()Xu=\sum_{i=1}^{{\cal N}}b_{i}f_{i}\in\operatorname{Span}({\cal F})\subset X such that the cardinality of the set {i{1,,𝒩}:bi0}\{i\in\{1,\ldots,{\cal N}\}:b_{i}\neq 0\} is less than 𝒩{\cal N} and that uu is close to φ\varphi throughout Σ\Sigma in the sense that, for every σΣ\sigma\in\Sigma, we have |σ(φu)|ε|\sigma(\varphi-u)|\leq\varepsilon.

The tasks of approximating sums of continuous functions on finite subsets of Euclidean space, cubature for empirical measures, and kernel quadrature for kernels defined on finite sets are included as particular examples within this general mathematical framework (see Section 2) for full details). The inclusion of approximating a linear combination of continuous functions within this general framework ensures that several machine learning related tasks are included. For example, each layer of a neural network is typically given by a linear combination of continuous functions. Hence the task of approximating a layer within a neural network by a sparse layer is covered within this general framework. Observe that there is no requirement that the original layer is fully connected; in particular, it may itself already be a sparse layer. Consequently, any approach to the approximation problem covered in the general framework above could be used to carry out the reduction step (in which a fixed proportion of a models weights are reduced to zero) in the recently proposed sparse training algorithms [GLMNS18, EJOPR20, LMPY21, JLPST21, FHLLMMPSWY23].

The GEIM approach developed by Maday et al. [MM13, MMT14, MMPY15] involves dynamically growing a subset FF\subset{\cal F} of the features and a subset LΣL\subset\Sigma of the date. At each step a new feature from {\cal F} is added to FF, and a new piece of data from Σ\Sigma is added to LL. An approximation of φ\varphi is then constructed via the following interpolation problem. Find a linear combination of the elements in FF that coincides with φ\varphi throughout the subset LΣL\subset\Sigma. This dynamic growth is feature-driven in the following sense. The new feature to be added to FF is determined, and this new feature is subsequently used to determine how to extend the collection LL of linear functionals on which an approximation will be required to match the target.

The growth is greedy in the sense that the element chosen to be added to FF is, in some sense, the one worst approximated by the current collection FF. If we let ff\in{\cal F} be the newly selected feature and J[f]J[f] be the linear combination of the previously selected features in FF that coincides with ff on LL, then the element to be added to LL is the one achieving the maximum absolute value when acting on fJ[f]f-J[f]. A more detailed overview of GEIM can be found in Section 2 of this paper; full details of GEIM may be found in [MM13, MMT14, MMPY15].

Momentarily restricting to the particular task of kernel quadrature, the recent thinning-based Convex Kernel Quadrature approach proposed by Satoshi Hayakawa, the first author, and Harald Oberhauser in [HLO21] achieves out performs existing techniques such as Monte Carlo, Kernel Herding [Wel09a, Wel09b, CSW10, BLL15, BCGMO18, PTT22], Kernel Thinning [DM21a, DM21b, DMS21] to obtain new state-of-the-art results. Central to this approach is the recombination algorithm [LL12, Tch15, ACO20]. Originating in [LL12] as a technique for reducing the support of a probability measure whilst preserving a specified list of moments, at its core recombination is a method of reducing the number of non-zero components in a solution of a system of linear equations whilst preserving convexity.

An improved implementation of recombination is given in [Tch15] that is significantly more efficient than the original implementation in [LL12]. A novel implementation of the method from [Tch15] was provided in [ACO20]. A modified variant of the implementation from [ACO20] is used in the convex kernel quadrature approach developed in [HLO21]. A method with the same complexity as the original implementation in [LL12] was introduced in the works [FJM19, FJM22].

Returning to our general framework, we develop the Greedy Recombination Interpolation Method (GRIM) which is a novel hybrid combination of the dynamic growth of a greedy selection algorithm, in a similar spirit to GEIM [MM13, MMT14, MMPY15], with the thinning reduction of recombination that underpins the successful convex kernel quadrature approach of [HLO21]. GRIM dynamically grows a collection of linear functionals LΣL\subset\Sigma. After each extension of LL, we apply recombination to find an approximation of φ\varphi that coincides with φ\varphi throughout LL (cf. the recombination thinning Lemma 3.1). Subsequently, for a chosen integer m1m\in{\mathbb{Z}}_{\geq 1}, we extend LL by adding the mm linear functionals from Σ\Sigma achieving the mm largest absolute values when applied to the difference between φ\varphi and the current approximation of φ\varphi (cf. Section 4). We inductively repeat these steps a set number of times. Evidently GRIM is a hybrid of growth-based and thinning-based techniques in a similar spirit to the CoSaMP algorithm [NT09].

Whilst the dynamic growth of GRIM is in a similar spirit to that of GEIM, there are several important distinctions between GRIM and GEIM. The growth in GRIM is data-driven rather than feature-driven. The extension of the data to be interpolated with respect to in GRIM does not involve making any choices of features from {\cal F}. The new information to be matched is determined by examining where in Σ\Sigma the current approximation is furthest from the target φ\varphi (cf. Section 4).

Expanding on this point, we only dynamically grow a subset LΣL\subset\Sigma of the data and do not grow a subset FF\subset{\cal F} of the features. In particular, we do not pre-determine the features that an approximation will be a linear combination of. Instead, the features to be used are determined by recombination (cf. the recombination thinning Lemma 3.1). Indeed, besides an upper bound on the number of features being used to construct an approximation (cf. Section 4), we have no control over the features used. Allowing recombination and the data Σ\Sigma to determine which features are used means there is no requirement to use the features from one step at subsequent steps. There is no requirement that any of the features used at one specific step must be used in any of the subsequent steps.

GRIM is not limited to extending the subset LΣL\subset\Sigma by a single linear functional at each step. GRIM is capable of extending LL by mm linear functionals for any given integer m1m\in{\mathbb{Z}}_{\geq 1} (modulo restrictions to avoid adding more linear functionals than there are in the original subset ΣX\Sigma\subset X^{\ast} itself). The number of new linear functionals to be added at a step is a hyperparameter that can be optimised during numerical implementations.

Unlike [HLO21], our use of recombination is not restricted to the setting of reducing the support of a discrete measure. After each extension of LΣL\subset\Sigma, we use recombination [LL12, Tch15, ACO20] to find an element uSpan()u\in\operatorname{Span}({\cal F}) satisfying that uφu\equiv\varphi throughout LL. Recombination is applied to the linear system determined by the combination of the set {σ(φ):σL}\{\sigma(\varphi):\sigma\in L\}, for a given σL\sigma\in L we get the equation i=1𝒩aiσ(fi)=σ(φ)\sum_{i=1}^{{\cal N}}a_{i}\sigma(f_{i})=\sigma(\varphi), and the sum of the coefficients a1,,a𝒩a_{1},\ldots,a_{{\cal N}} (i.e. the trivial equation a1++a𝒩=i=1𝒩aia_{1}+\ldots+a_{{\cal N}}=\sum_{i=1}^{{\cal N}}a_{i}) (cf. the recombination thinning Lemma 3.1).

Our use of recombination means, in particular, that GRIM can be used for cubature and kernel quadrature (cf. Sections 2 and 4). Since recombination preserves convexity [LL12], the benefits of convex weights enjoyed by the convex kernel quadrature approach in [HLO21] are inherited by GRIM (cf. Section 2).

Moreover, at each step we optimise our use of recombination over multiple permutations of the orderings of the equations determining the linear system to which recombination is applied (cf. the Banach Recombination Step in Section 4). The number of permutations to be considered at each step gives a parameter that may be optimised during applications of GRIM.

Whilst we analyse the complexity cost of the Banach GRIM algorithm (cf. Section 5), computational efficiency is not our top priorities. GRIM is designed to be a one-time tool; it is applied a single time to try and find a sparse approximation of the target system φX\varphi\in X. The cost of its implementation is then recouped through the repeated use of the resulting approximation for inference via new inputs. Thus GRIM is ideally suited for use in cases where the model will be repeatedly computed on new inputs for the purpose of inference/prediction. Examples of such models include, in particular, those trained for character recognition, medical diagnosis, and action recognition.

With this in mind, the primary aim of our complexity cost considerations is to verify that implementing GRIM is feasible. We verify this by proving that, at worst, the complexity cost of GRIM is 𝒪(Λ2𝒩+sΛ4log(𝒩)){\cal O}(\Lambda^{2}{\cal N}+s\Lambda^{4}\log({\cal N})) where 𝒩{\cal N} is the number of features in {\cal F}, Λ\Lambda is the number of linear functionals forming the data Σ\Sigma, and ss is the maximum number of shuffles considered during each application of recombination (cf. Lemma 5.3).

The remainder of the paper is structured as follows. In Section 2 we fix the mathematical framework that will be used throughout the article and motivate its consideration. In particular, we illustrate some specific examples covered by our framework. Additionally, we summarise the GEIM approach of Maday et al. [MM13, MMT14, MMPY15] and highlight the fundamental differences in the philosophies underlying GEIM and GRIM.

An explanation of the recombination algorithm and its use within our setting is provided in subsection 3. In particular, we prove the Recombination Thinning Lemma 3.1 detailing our use of recombination to find uSpan()u\in\operatorname{Span}({\cal F}) coinciding with φ\varphi on a given finite subset LXL\subset X^{\ast}.

The Banach GRIM algorithm is both presented and discussed in Section 4. We formulate the Banach Extension Step governing how we extend an existing collection of linear functionals LΣL\subset\Sigma, the Banach Recombination Step specifying how we use the recombination thinning Lemma 3.1 in the Banach GRIM algorithm, and provide an upper bounds for the number of features used to construct the approximation at each step.

The complexity cost of the Banach GRIM algorithm is considered in Section 5. We prove Lemma 5.3 establishing the complexity cost of any implementation of the Banach GRIM algorithm, and subsequently establish an upper bound complexity cost for the most expensive implementation.

The theoretical performance of the Banach GRIM algorithm is considered in Section 6. Theoretical guarantees in terms of a specific geometric property of Σ\Sigma are established for the Banach GRIM algorithm in which a single new linear functional is chosen at each step (cf. the Banach GRIM Convergence Theorem 6.2). The specific geometric property is related to a particular packing number of Σ\Sigma in XX^{\ast} (cf. Subsection 6.2). The packing number of a subset of a Banach space is closely related to the covering number of the subset. Covering and packing numbers, first studied by Kolmogorov [Kol56], arise in a variety of contexts including eigenvalue estimation [Car81, CS90, ET96], Gaussian Processes [LL99, LP04], and machine learning [EPP00, SSW01, Zho02, Ste03, SS07, Kuh11, MRT12, FS21]

In Section 7 we compare the performance of GRIM against other reduction techniques on three tasks. The first is an L2L^{2}-approximation task motivated by an example in [MMPY15]. The second is a kernel quadrature task using machine learning datasets as considered in [HLO21]. In particular, we illustrate that GRIM matches the performance of the tailor-made convex kernel quadrature technique developed in [HLO21]. The third task is approximating the action recognition model from [JLNSY17] for the purpose of inference acceleration. In particular, this task involves approximating a function in a pointwise sense that is outside the Hilbert space framework of the proceeding two examples. Acknowledgements: This work was supported by the DataSıg Program under the EPSRC grant ES/S026347/1, the Alan Turing Institute under the EPSRC grant EP/N510129/1, the Data Centric Engineering Programme (under Lloyd’s Register Foundation grant G0095), the Defence and Security Programme (funded by the UK Government) and the Hong Kong Innovation and Technology Commission (InnoHK Project CIMDA). This work was funded by the Defence and Security Programme (funded by the UK Government).

2 Mathematical Framework & Motivation

In this section we rigorously formulate the sparse approximation problem that GRIM will be designed to tackle. Further, we briefly summarise the Generalised Empirical Interpolation Method (GEIM) of Maday et al. [MM13, MMT14, MMPY15] to both highlight some of the ideas we utilise in GRIM, and to additionally highlight the key novel properties not satisfied by GEIM that will be satisfied by GRIM. We first fix the mathematical framework in which we will work for the remainder of this paper.

Let XX be a Banach space and 𝒩>0{\cal N}\in{\mathbb{Z}}_{>0} be a (large) positive integer. Assume that ={f1,,f𝒩}X{\cal F}=\{f_{1},\ldots,f_{{\cal N}}\}\subset X is a collection of non-zero elements, and that a1,,a𝒩{0}a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}\setminus\{0\}. Consider the element φX\varphi\in X defined by

φ=i=1𝒩aifi.\varphi=\sum_{i=1}^{{\cal N}}a_{i}f_{i}. (2.1)

Let XX^{\ast} denote the dual of XX and suppose that ΣX\Sigma\subset X^{\ast} is a finite subset of cardinality Λ1\Lambda\in{\mathbb{Z}}_{\geq 1}. We use the terminology that the set {\cal F} consists of features whilst the set Σ\Sigma consists of data. We consider the following sparse approximation problem. Given ε>0\varepsilon>0, find an element u=i=1𝒩bifiSpan()u=\sum_{i=1}^{{\cal N}}b_{i}f_{i}\in\operatorname{Span}({\cal F}) such that the cardinality of the set {i{1,,𝒩}:bi0}\left\{i\in\{1,\ldots,{\cal N}\}~{}:~{}b_{i}\neq 0\right\} is less than 𝒩{\cal N} and that uu is close to φ\varphi throughout Σ\Sigma in the sense that, for every σΣ\sigma\in\Sigma, we have |σ(φu)|ε|\sigma(\varphi-u)|\leq\varepsilon.

The task of finding a sparse approximation of a sum of continuous functions, defined on a finite set of Euclidean space, is within this framework. More precisely, let N,d,E1N,d,E\in{\mathbb{Z}}_{\geq 1}, a1,,aNa_{1},\ldots,a_{N}\in{\mathbb{R}}, Ωd\Omega\subset{\mathbb{R}}^{d} be a finite subset, and, for each i{1,,N}i\in\{1,\ldots,N\}, fiC0(Ω;E)f_{i}\in C^{0}\left(\Omega;{\mathbb{R}}^{E}\right) be a continuous function ΩE\Omega\to{\mathbb{R}}^{E}. Then finding a sparse approximation of the continuous function F:=i=1NaifiF:=\sum_{i=1}^{N}a_{i}f_{i} is within this framework. To see this, first let e1,,eEEe_{1},\ldots,e_{E}\in{\mathbb{R}}^{E} be the standard basis of E{\mathbb{R}}^{E} that is orthonormal with respect to the Euclidean dot product ,E\left<\cdot,\cdot\right>_{{\mathbb{R}}^{E}} on E{\mathbb{R}}^{E}. Then note that for each point pΩp\in\Omega and every j{1,,E}j\in\{1,\ldots,E\} the mapping ff(p),ejEf\mapsto\left<f(p),e_{j}\right>_{{\mathbb{R}}^{E}} determines a linear functional C0(Ω;E)C^{0}(\Omega;{\mathbb{R}}^{E})\to{\mathbb{R}} that is in the dual space C0(Ω;E)C^{0}(\Omega;{\mathbb{R}}^{E})^{\ast}. Denote this linear functional by δp,j:C0(Ω;E)\delta_{p,j}:C^{0}(\Omega;{\mathbb{R}}^{E})\to{\mathbb{R}}. Therefore by choosing X:=C0(Ω;E)X:=C^{0}(\Omega;{\mathbb{R}}^{E}), 𝒩:=N{\cal N}:=N, and Σ:={δp,j:pΩandj{1,,E}}X\Sigma:=\left\{\delta_{p,j}:p\in\Omega~{}\text{and}~{}j\in\{1,\ldots,E\}\right\}\subset X^{\ast}, we see that this problem is within our framework. Here we are also using the observation that if f,hC0(Ω;E)f,h\in C^{0}(\Omega;{\mathbb{R}}^{E}) satisfy, for every pΩp\in\Omega and every j{1,,E}j\in\{1,\ldots,E\}, that |δp,j[fh]|ε|\delta_{p,j}[f-h]|\leq\varepsilon, then we have fhC0(Ω;E)Cε||f-h||_{C^{0}(\Omega;{\mathbb{R}}^{E})}\leq C\varepsilon for a constant C1C\geq 1 depending on the particular norm chosen on E{\mathbb{R}}^{E}. Thus finding an approximation uu of FF that satisfies, for every σΣ\sigma\in\Sigma, that |σ(Fu)|ε/C|\sigma(F-u)|\leq\varepsilon/C allows us to conclude that FuC0(Ω;E)ε||F-u||_{C^{0}(\Omega;{\mathbb{R}}^{E})}\leq\varepsilon.

The cubature problem [Str71] for empirical measures, which may be combined with sampling to offer an approach to the cubature problem for general signed measures, is within this framework. More precisely, let N1N\in{\mathbb{Z}}_{\geq 1}, a1,,aN>0a_{1},\ldots,a_{N}>0, Ω={x1,,xN}d\Omega=\{x_{1},\ldots,x_{N}\}\subset{\mathbb{R}}^{d}, and [Ω]{\cal M}[\Omega] denote the collection of finite signed measures on Ω\Omega. Recall that [Ω]{\cal M}[\Omega] can be viewed as a subset of C0(Ω)C^{0}(\Omega)^{\ast} by defining, for ν[Ω]\nu\in{\cal M}[\Omega] and ψC0(Ω)\psi\in C^{0}(\Omega), ν[ψ]:=Ωψ(x)𝑑ν(x)\nu[\psi]:=\int_{\Omega}\psi(x)d\nu(x). Consider the empirical measure μ:=i=1Naiδxi\mu:=\sum_{i=1}^{N}a_{i}\delta_{x_{i}} and, for e=(e1,,ed)0de=(e_{1},\ldots,e_{d})\in{\mathbb{Z}}_{\geq 0}^{d}, define pe:dp_{e}:{\mathbb{R}}^{d}\to{\mathbb{R}} by pe(x1,,xd):=x1e1xdedp_{e}(x_{1},\ldots,x_{d}):=x_{1}^{e_{1}}\ldots x_{d}^{e_{d}}. Then the choices that X:=[Ω]X:={\cal M}[\Omega], 𝒩:=N{\cal N}:=N, and, for a given K0K\in{\mathbb{Z}}_{\geq 0}, Σ:={pe:e=(e1,,ed) with e1++edK}C0(Ω)[Ω]\Sigma:=\left\{p_{e}:e=(e_{1},\ldots,e_{d})\text{ with }e_{1}+\ldots+e_{d}\leq K\right\}\subset C^{0}(\Omega)\subset{\cal M}[\Omega]^{\ast} illustrate that the cubature problem of reducing the support of μ\mu whilst preserving its moments of order no greater than KK is within our framework.

Moreover the Kernel Quadrature problem for empirical probability measures is within our framework. To be more precise, let 𝒳{\cal X} be a finite set and k{\cal H}_{k} is a Reproducing Kernel Hilbert Space (RKHS) associated to a positive semi-definite symmetric kernel function k:𝒳×𝒳k:{\cal X}\times{\cal X}\to{\mathbb{R}} (appropriate definitions can be found, for example, in [BT11]). In this case k=Span({kx:x𝒳}){\cal H}_{k}=\operatorname{Span}\left(\left\{k_{x}:x\in{\cal X}\right\}\right) where, for x𝒳x\in{\cal X}, the function kx:𝒳k_{x}:{\cal X}\to{\mathbb{R}} is defined by kx(z):=k(x,z)k_{x}(z):=k(x,z) (see, for example, [BT11]).

In this context one can consider the Kernel Quadrature problem, for which the Kernel Herding [Wel09a, Wel09b, CSW10, BLL15, BCGMO18, TT21, PTT22], Convex Kernel Quadrature [HLO21] and Kernel Thinning [DM21a, DM21b, DMS21] methods have been developed. Given a probability measure μ[𝒳]\mu\in{\mathbb{P}}[{\cal X}] the Kernel Quadrature problem involves finding, for some n1n\in{\mathbb{Z}}_{\geq 1}, points z1,,zn𝒳z_{1},\ldots,z_{n}\in{\cal X} and weights w1,,wnw_{1},\ldots,w_{n}\in{\mathbb{R}} and such that the measure μn:=j=1nwjδzj\mu_{n}:=\sum_{j=1}^{n}w_{j}\delta_{z_{j}} approximates μ\mu in the sense that, for every fkf\in{\cal H}_{k}, we have μn(f)μ(f)\mu_{n}(f)\approx\mu(f). Additionally requiring the weights w1,,wnw_{1},\ldots,w_{n} to be convex in the sense that they are all positive and sum to one (i.e. w1,,wn>0w_{1},\ldots,w_{n}>0 and w1++wn=1w_{1}+\ldots+w_{n}=1) ensures both better robustness properties for the approximation μnμ\mu_{n}\approx\mu and better estimates when the mm-fold product of quadrature formulas is used to approximate μm\mu^{\otimes m} on 𝒳m{\cal X}^{\otimes m}; see [HLO21].

A consequence of k=Span({kx:x𝒳}){\cal H}_{k}=\operatorname{Span}\left(\left\{k_{x}:x\in{\cal X}\right\}\right) is that linearity ensures that this approximate equality will be valid for all fkf\in{\cal H}_{k} provided it is true for every f{kx:x𝒳}f\in\{k_{x}:x\in{\cal X}\}. The inclusion kC0(𝒳){\cal H}_{k}\subset C^{0}({\cal X}) means that the subset {kx:x𝒳}k\{k_{x}:x\in{\cal X}\}\subset{\cal H}_{k} can be viewed as a finite subset of the dual space [𝒳]{\cal M}[{\cal X}]^{\ast}. The choice of X:=[𝒳]X:={\cal M}[{\cal X}] and Σ:={kx:x𝒳}\Sigma:=\{k_{x}:x\in{\cal X}\} illustrates that the Kernel Quadrature problem for empirical probability distributions μ[𝒳]\mu\in{\mathbb{P}}[{\cal X}] is within the framework we consider.

The framework we consider is the same as the setup for which GEIM was developed by Maday et al. [MM13, MMT14, MMPY15]. It is useful to briefly recall this method. GEIM

  1. (A)
    • Find h1:=argmax{fX:f}h_{1}:=\operatorname{argmax}\left\{||f||_{X}:f\in{\cal F}\right\} and σ1:=argmax{|σ(h1)|:σΣ}\sigma_{1}:=\operatorname{argmax}\left\{|\sigma(h_{1})|:\sigma\in\Sigma\right\}.

    • Define q1:=h1/σ1(h1)q_{1}:=h_{1}/\sigma_{1}(h_{1}), S1:={q1}S_{1}:=\{q_{1}\}, L1:={σ1}L_{1}:=\{\sigma_{1}\}, and an operator 𝒥1:Span()Span(S1){\cal J}_{1}:\operatorname{Span}({\cal F})\to\operatorname{Span}(S_{1}) by setting 𝒥1[w]:=σ1(w)q1{\cal J}_{1}[w]:=\sigma_{1}(w)q_{1} for every wSpan()w\in\operatorname{Span}({\cal F}).

    • Observe that, given any wSpan()w\in\operatorname{Span}({\cal F}), we have σ1(w𝒥1[w])=0\sigma_{1}(w-{\cal J}_{1}[w])=0.

  2. (B)

    Proceed recursively via the follwing inductive step for n2n\geq 2.

    • Suppose we have found subsets Sn1={q1,,qn1}Span()S_{n-1}=\{q_{1},\ldots,q_{n-1}\}\subset\operatorname{Span}({\cal F}) and Ln1={σ1,,σn1}ΣL_{n-1}=\{\sigma_{1},\ldots,\sigma_{n-1}\}\subset\Sigma, and an operator 𝒥n1:Span()Span(Sn1){\cal J}_{n-1}:\operatorname{Span}({\cal F})\to\operatorname{Span}(S_{n-1}) satisfying, for every wSpan()w\in\operatorname{Span}({\cal F}), that σ(w𝒥n1[w])=0\sigma(w-{\cal J}_{n-1}[w])=0 for every σLn1\sigma\in L_{n-1}.

    • Find hn:=argmax{f𝒥n1[f]X:f}h_{n}:=\operatorname{argmax}\left\{||f-{\cal J}_{n-1}[f]||_{X}:f\in{\cal F}\right\} and σn:=argmax{|σ(hn𝒥n1[hn])|:σΣ}\sigma_{n}:=\operatorname{argmax}\left\{|\sigma(h_{n}-{\cal J}_{n-1}[h_{n}])|:\sigma\in\Sigma\right\}.

    • Define

      qn:=hn𝒥n1[hn]σn(hn𝒥n1[hn]),q_{n}:=\frac{h_{n}-{\cal J}_{n-1}[h_{n}]}{\sigma_{n}(h_{n}-{\cal J}_{n-1}[h_{n}])},

      Sn:=Sn1{qn}Span()S_{n}:=S_{n-1}\cup\{q_{n}\}\subset\operatorname{Span}({\cal F}) and Ln:=Ln1{σn}ΣL_{n}:=L_{n-1}\cup\{\sigma_{n}\}\subset\Sigma.

    • Construct operator 𝒥n:Span()Span(Sn){\cal J}_{n}:\operatorname{Span}({\cal F})\to\operatorname{Span}(S_{n}) by defining

      𝒥n[w]:=𝒥n1[w]+σn(w𝒥n1[w])qn{\cal J}_{n}[w]:={\cal J}_{n-1}[w]+\sigma_{n}(w-{\cal J}_{n-1}[w])q_{n}

      for wSpan()w\in\operatorname{Span}({\cal F}). Direct computation verifies that, whenever wSpan()w\in\operatorname{Span}({\cal F}) and σLn\sigma\in L_{n}, we have σ(w𝒥n[w])=0\sigma(w-{\cal J}_{n}[w])=0.

The algorithm provides a sequence 𝒥1[φ]{\cal J}_{1}[\varphi], 𝒥2[φ],{\cal J}_{2}[\varphi],\ldots of approximations of φ\varphi. However, GEIM is intended to control the XX-norm of the difference between φ\varphi and its approximation, i.e. to have φ𝒥n[φ]X||\varphi-{\cal J}_{n}[\varphi]||_{X} be small for a suitable integer nn. This aim requires additional assumptions to be made regarding the data ΣX\Sigma\subset X^{\ast} which we do not impose (see [MMT14] for details). Recall that we aim only to approximate φ\varphi over the data Σ\Sigma, i.e. we seek an approximation uu such that |σ(φu)||\sigma(\varphi-u)| is small for every σΣ\sigma\in\Sigma. A consequence of this difference is that certain aspects of GEIM are not necessarily ideal for our task.

Firstly, the growth of the subset Ln1L_{n-1} to LnL_{n} is feature-driven. That is, a new feature from {\cal F} to be used by the next approximation is selected, and then this new feature is used to determine the new functional from Σ\Sigma to be added to the collection on which we require the approximation to coincide with the target φ\varphi (cf. GEIM (B)). Since we only seek to approximate φ\varphi over the data Σ\Sigma, data-driven growth would be preferable. That is, we would rather select the new information from Σ\Sigma to be matched by the next approximation before any consideration is given to determining which features from {\cal F} will be used to construct the new approximation.

Secondly, related to the first aspect, the features from {\cal F} to be used to construct 𝒥n[φ]{\cal J}_{n}[\varphi] are predetermined. Further, the features used to construct 𝒥n[φ]{\cal J}_{n}[\varphi] are forced to be included in the features used to construct 𝒥m[φ]{\cal J}_{m}[\varphi] for any m>nm>n. This restriction is not guaranteed to be sensible; it is conceivable that the features that work well at one stage are disjoint from the features that work well at another. We would prefer that the features used to construct an approximation be determined by the target φ\varphi and the data selected from Σ\Sigma on which we require the approximation to match φ\varphi. This would avoid retaining features used at one step that become less effective at later steps, and could offer insight regarding which of the features in {\cal F} are sufficient to capture the behaviour of φ\varphi on Σ\Sigma.

Thirdly, at each step GEIM can provide an approximation for any wSpan()w\in\operatorname{Span}({\cal F}). That is, for n1n\in{\mathbb{Z}}_{\geq 1}, the element 𝒥n[w]{\cal J}_{n}[w] is a linear combination of the elements in SnS_{n} that coincides with ww on LnL_{n}. Requiring the operator 𝒥n:Span()Span(Sn){\cal J}_{n}:\operatorname{Span}({\cal F})\to\operatorname{Span}(S_{n}) to provide such an approximation for every wSpan()w\in\operatorname{Span}({\cal F}) stops the method from exploiting any advantages available for the particular choice of φSpan()\varphi\in\operatorname{Span}({\cal F}). As we only aim to approximate φ\varphi itself, we would prefer to remove this requirement and allow the method the opportunity to exploit advantages resulting from the specific choice of φSpan()\varphi\in\operatorname{Span}({\cal F}).

All three aspects are addressed in GRIM. The greedy growth of a subset LΣL\subset\Sigma, determining the functionals in Σ\Sigma at which an approximation is required to agree with φ\varphi is data-driven. At each step the desired approximation is found using recombination [LL12, Tch15] so that the features used to construct the approximation are determined by φ\varphi and the subset LL and, in particular, are not predetermined. This use of recombination ensures both that GRIM only produces approximations of φ\varphi, and that GRIM can exploit advantages resulting from the specific choice of φSpan()\varphi\in\operatorname{Span}({\cal F}).

3 Recombination Thinning

In this section we illustrate how, given a finite collection LXL\subset X^{\ast} of linear functionals, recombination [LL12, Tch15] can be used to find an element uSpan()Xu\in\operatorname{Span}({\cal F})\subset X that coincides with φ\varphi throughout LL, provided we can compute the values σ(fi)\sigma(f_{i}) for every i{1,,𝒩}i\in\{1,\ldots,{\cal N}\} and every linear functional σL\sigma\in L. The recombination algorithm was initially introduced by Christian Litterer and the first author in [LL12]; a substantially improved implementation was provided in the PhD thesis of Maria Tchernychova [Tch15]. A novel implementation of the method from [Tch15] was provided in [ACO20]. A method with the same complexity as the original implementation in [LL12] was introduced in the works [FJM19, FJM22]. Recombination has been applied in a number of contexts including particle filtering [LL16], kernel quadrature [HLO21], and mathematical finance [NS21].

For the readers convenience we briefly overview the ideas involved in the recombination algorithm. For this illustrative purpose consider a linear system of equations 𝐀𝐱=𝐲{\bf A}{\bf x}={\bf y} where 𝐀m×k{\bf A}\in{\mathbb{R}}^{m\times k}, 𝐱m{\bf x}\in{\mathbb{R}}^{m}, 𝐲k{\bf y}\in{\mathbb{R}}^{k}, and k,m1k,m\in{\mathbb{Z}}_{\geq 1} with mkm\geq k. We assume, for every i{1,,k}i\in\{1,\ldots,k\}, that 𝐱i>0{\bf x}_{i}>0. Recombination relies on the simple observation that this linear system is preserved under translations of 𝐱{\bf x} by elements in the kernel of the matrix 𝐀{\bf A}. To be more precise, if 𝐱{\bf x} satisfies that 𝐀𝐱=𝐲{\bf A}{\bf x}={\bf y}, and if 𝐞ker(𝐀){\bf e}\in\ker({\bf A}) and θ\theta\in{\mathbb{R}}, then 𝐱+θ𝐞{\bf x}+\theta{\bf e} also satisfies that 𝐀(𝐱+θ𝐞)=𝐲{\bf A}\left({\bf x}+\theta{\bf e}\right)={\bf y}. The recombination algorithm makes use of linearly independent elements in the ker(𝐀)\ker({\bf A}) to reduce the number of non-zero entries in the solution vector 𝐱{\bf x}.

As outlined in [LL12], this could in principle be done as follows. First, we choose a basis for the kernel of 𝐀{\bf A}. Computing the Singular Value Decomposition (SVD) of 𝐀{\bf A} gives a method of finding such a basis that is well-suited to dealing with numerical instabilities. Supposing that ker(𝐀){0}\ker({\bf A})\neq\{0\}, let 𝐞1,,𝐞l{\bf e}_{1},\ldots,{\bf e}_{l} be a basis of ker(𝐀)\ker({\bf A}) found via SVD. For each j{1,,l}j\in\{1,\ldots,l\} denote the coefficients of 𝐞j{\bf e}_{j} by 𝐞j,1,,𝐞j,m{\bf e}_{j,1},\ldots,{\bf e}_{j,m}\in{\mathbb{R}}.

Consider the element 𝐞1m{\bf e}_{1}\in{\mathbb{R}}^{m}. Choose i{1,,m}i\in\{1,\ldots,m\} such that

𝐱i𝐞1,i=min{𝐱j𝐞1,j:𝐞1,j>0}.\frac{{\bf x}_{i}}{{\bf e}_{1,i}}=\min\left\{\frac{{\bf x}_{j}}{{\bf e}_{1,j}}:{\bf e}_{1,j}>0\right\}. (3.1)

Replace 𝐱{\bf x} by the vector 𝐱(𝐱i/𝐞1,i)𝐞1{\bf x}-\left({\bf x}_{i}/{\bf e}_{1,i}\right){\bf e}_{1}. The new 𝐱{\bf x} remains a solution to 𝐀𝐱=𝐲{\bf A}{\bf x}={\bf y}, and now additionally satisfies that 𝐱i=0{\bf x}_{i}=0. Moreover, for every j{1,,m}j\in\{1,\ldots,m\} such that jij\neq i, our choice of ii in (3.1) ensures that 𝐱j0{\bf x}_{j}\geq 0. Finally, for j{2,,l}j\in\{2,\ldots,l\}, replace 𝐞j{\bf e}_{j} by 𝐞j(𝐞j,i/𝐞1,i)𝐞1{\bf e}_{j}-({\bf e}_{j,i}/{\bf e}_{1,i}){\bf e}_{1} to ensure that 𝐞j,i=0{\bf e}_{j,i}=0. This final alteration allows us to repeat the process for the new 𝐱{\bf x} using the new 𝐞2{\bf e}_{2} in place of 𝐞1{\bf e}_{1} since the addition of any scalar multiple of 𝐞2{\bf e}_{2} to 𝐱{\bf x} will not change the fact that 𝐱i=0{\bf x}_{i}=0.

After iteratively repeating this procedure for j=2,,lj=2,\ldots,l, we obtain a vector 𝐱m{\bf x}^{\prime}\in{\mathbb{R}}^{m} whose coefficients are all non-negative, with at most mlm-l of the coefficients being non-zero, and still satisfying that 𝐀𝐱=𝐲{\bf A}{\bf x}^{\prime}={\bf y}. That is, the original solution 𝐱{\bf x} has been reduced to a new solution 𝐱{\bf x}^{\prime} with at least ll fewer non-zero coefficients than the original solution 𝐱{\bf x}.

Observe that the positivity of the original coefficients is weakly preserved. That is, if we let x1,,xmx^{\prime}_{1},\ldots,x^{\prime}_{m}\in{\mathbb{R}} denote the coefficients of the vector 𝐱m{\bf x}^{\prime}\in{\mathbb{R}}^{m}, then for each i{1,,m}i\in\{1,\ldots,m\} we have that xi0x^{\prime}_{i}\geq 0. One consequence of this preservation is that recombination can be used to reduce the support of an empirical measure whilst preserving a given finite collection of moments [LL12]. Moreover, this property is essential to the ground-breaking state-of-the-art convex kernel quadrature method developed by Satoshi Hayakawa, the first author, and Harald Oberhauser in [HLO21].

The implementation of recombination proposed in [LL12] iteratively makes use of the method outlined above, for m=2km=2k, applied to linear systems arising via sub-dividing the original system. The improved tree-based method developed by Maria Tchernychova in [Tch15] provides a significantly more efficient implementation. A geometrically greedy algorithm is proposed in [ACO20] to implement the algorithm of [Tch15]. In the setting of our example above, it follows from [Tch15] that the complexity of the recombination algorithm is O(mk+k3log(m/k))O(mk+k^{3}\log(m/k)).

Having outlined the recombination method, we turn our attention to establishing the following result regarding the use of recombination in our setting.

Lemma 3.1 (Recombination Thinning).

Assume XX is a Banach space with dual space XX^{\ast}, that 𝒩1{\cal N}\in{\mathbb{Z}}_{\geq 1}, and that m0m\in{\mathbb{Z}}_{\geq 0}. Define :=min{𝒩,m+1}{\cal M}:=\min\{{\cal N},m+1\}. Let :={f1,,f𝒩}X{\cal F}:=\{f_{1},\ldots,f_{{\cal N}}\}\subset X be a collection of non-zero elements and L={σ1,,σm}XL=\{\sigma_{1},\ldots,\sigma_{m}\}\subset X^{\ast}. Suppose a1,,a𝒩>0a_{1},\ldots,a_{{\cal N}}>0 and consider the element φSpan()X\varphi\in\operatorname{Span}({\cal F})\subset X defined by φ:=i=1𝒩aifi\varphi:=\sum_{i=1}^{{\cal N}}a_{i}f_{i}. Then recombination can be applied to find non-negative coefficients b1,,b0b_{1},\ldots,b_{{\cal M}}\geq 0 and indices e(1),,e(){1,,𝒩}e(1),\ldots,e({\cal M})\in\{1,\ldots,{\cal N}\} satisfying that

j=1bj=i=1𝒩ai,\sum_{j=1}^{{\cal M}}b_{j}=\sum_{i=1}^{{\cal N}}a_{i}, (3.2)

and such that the element uSpan()Xu\in\operatorname{Span}({\cal F})\subset X defined by

u:=j=1bjfe(j)satisfies, for every σL, thatσ(φu)=0.u:=\sum_{j=1}^{{\cal M}}b_{j}f_{e(j)}\quad\text{satisfies, for every }\sigma\in L,\text{ that}\quad\sigma(\varphi-u)=0. (3.3)

Finally, the complexity of this use of recombination is O(𝒩m+m3log(𝒩/m))O({\cal N}m+m^{3}\log({\cal N}/m)).

Proof of Lemma 3.1.

Assume XX is a Banach space with dual space XX^{\ast}, that 𝒩1{\cal N}\in{\mathbb{Z}}_{\geq 1}, and that m0m\in{\mathbb{Z}}_{\geq 0}. Define :=min{𝒩,m+1}{\cal M}:=\min\{{\cal N},m+1\}. Let :={f1,,f𝒩}X{\cal F}:=\{f_{1},\ldots,f_{{\cal N}}\}\subset X be a collection of non-zero elements and L={σ1,,σm}XL=\{\sigma_{1},\ldots,\sigma_{m}\}\subset X^{\ast}. Suppose a1,,a𝒩>0a_{1},\ldots,a_{{\cal N}}>0 and Consider the element φSpan()X\varphi\in\operatorname{Span}({\cal F})\subset X defined by φ:=i=1𝒩aifi\varphi:=\sum_{i=1}^{{\cal N}}a_{i}f_{i}.

The values σ1(φ),,σm(φ)\sigma_{1}(\varphi),\ldots,\sigma_{m}(\varphi) and the sum of the coefficients i=1𝒩ai\sum_{i=1}^{{\cal N}}a_{i} give rise to the linear system of equations

(111σ1(f1)σ1(f2)σ1(f𝒩)σm(f1)σm(f2)σm(f𝒩))(a1a2a𝒩)=(i=1𝒩aiσ1(φ)σm(φ))\begin{pmatrix}1&1&\dots&1\\ \sigma_{1}(f_{1})&\sigma_{1}(f_{2})&\dots&\sigma_{1}(f_{{\cal N}})\\ \vdots&\vdots&\ddots&\vdots\\ \sigma_{m}(f_{1})&\sigma_{m}(f_{2})&\dots&\sigma_{m}(f_{{\cal N}})\end{pmatrix}\begin{pmatrix}a_{1}\\ a_{2}\\ \vdots\\ \vdots\\ a_{{\cal N}}\end{pmatrix}=\begin{pmatrix}\sum_{i=1}^{{\cal N}}a_{i}\\ \sigma_{1}(\varphi)\\ \vdots\\ \sigma_{m}(\varphi)\end{pmatrix} (3.4)

which we denote more succinctly by 𝐀𝐱=𝐲{\bf A}{\bf x}={\bf y}.

Since the coefficients a1,,a𝒩>0a_{1},\ldots,a_{{\cal N}}>0 are positive, we are able to apply recombination [LL12, Tch15] to this linear system. Combined with the observation that dim(ker(𝐀))𝒩\dim\left(\ker({\bf A})\right)\geq{\cal N}-{\cal M}, an application of recombination returns an element 𝐱=(𝐱1,,𝐱𝒩)𝒩{\bf x}^{\prime}=\left({\bf x}^{\prime}_{1},\ldots,{\bf x}^{\prime}_{{\cal N}}\right)\in{\mathbb{R}}^{{\cal N}} satisfying the following properties. Firstly, for each j{1,,𝒩}j\in\{1,\ldots,{\cal N}\} the coefficient 𝐱j0{\bf x}^{\prime}_{j}\geq 0 is non-negative. Secondly, there are at most {\cal M} indices i{1,,𝒩}i\in\{1,\ldots,{\cal N}\} for which 𝐱i>0{\bf x}^{\prime}_{i}>0.

Let D:=dim(ker(𝐀))𝒩D:=\dim\left(\ker({\bf A})\right)\geq{\cal N}-{\cal M}. Take e(1),,e(𝒩D){1,,𝒩}e(1),\ldots,e({\cal N}-D)\in\{1,\ldots,{\cal N}\} to be the indices i{1,,𝒩}i\in\{1,\ldots,{\cal N}\} for which 𝐱i>0{\bf x}^{\prime}_{i}>0. Then, for each j{1,,𝒩D}j\in\{1,\ldots,{\cal N}-D\}, we set bj:=𝐱e(j)>0b_{j}:={\bf x}^{\prime}_{e(j)}>0. Define an element uSpan()u\in\operatorname{Span}({\cal F}) by u:=j=1𝒩Dbjfe(j)u:=\sum_{j=1}^{{\cal N}-D}b_{j}f_{e(j)} (cf. (3.3)). Recall that recombination ensures that 𝐱{\bf x}^{\prime} is a solution to the linear system (3.4). Hence the equation corresponding to the top row of the matrix 𝐀{\bf A} tells us that

j=1𝒩Dbj=j=1𝒩𝐱j=j=1𝒩aj.\sum_{j=1}^{{\cal N}-D}b_{j}=\sum_{j=1}^{{\cal N}}{\bf x}^{\prime}_{j}=\sum_{j=1}^{{\cal N}}a_{j}. (3.5)

Moreover, given any l{1,,m}l\in\{1,\ldots,m\}, the equation corresponding to row l+1l+1 of the matrix 𝐀{\bf A} tells us that

σ(u)=j=1𝒩Dbjσl(fe(j))=j=1𝒩𝐱jσl(fe(j))=σl(φ)\sigma(u)=\sum_{j=1}^{{\cal N}-D}b_{j}\sigma_{l}\left(f_{e(j)}\right)=\sum_{j=1}^{{\cal N}}{\bf x}^{\prime}_{j}\sigma_{l}\left(f_{e(j)}\right)=\sigma_{l}(\varphi) (3.6)

If D=𝒩D={\cal N}-{\cal M} then (3.5) and (3.6) are precisely the equalities claimed in (3.2) and (3.3). If D>𝒩D>{\cal N}-{\cal M} then 𝒩D<{\cal N}-D<{\cal M}. Set b𝒩D+1==b=0b_{{\cal N}-D+1}=\ldots=b_{{\cal M}}=0 and choose any indices e(𝒩D+1),,e(){1,,𝒩}e({\cal N}-D+1),\ldots,e({\cal M})\in\{1,\ldots,{\cal N}\}. Evidently we have that j=1bj=j=1𝒩Dbj\sum_{j=1}^{{\cal M}}b_{j}=\sum_{j=1}^{{\cal N}-D}b_{j} and, for each l{1,,m}l\in\{1,\ldots,m\}, that j=1bjσl(fe(j))=j=1𝒩Dbjσl(fe(j))\sum_{j=1}^{{\cal M}}b_{j}\sigma_{l}\left(f_{e(j)}\right)=\sum_{j=1}^{{\cal N}-D}b_{j}\sigma_{l}\left(f_{e(j)}\right). Consequently, (3.5) and (3.6) are once again precisely the equalities claimed in (3.2) and (3.3).

It remains only to verify the claimed complexity of this application of recombination. For this purpose, we observe that recombination is applied to a linear system of m+1m+1 equations in 𝒩{\cal N} variables. Hence from [Tch15] we have that the complexity is O(𝒩(m+1)+(m+1)3log(𝒩/m+1))=O(𝒩m+m3log(𝒩/m))O\left({\cal N}(m+1)+(m+1)^{3}\log({\cal N}/m+1)\right)=O\left({\cal N}m+m^{3}\log({\cal N}/m)\right) as claimed. This completes the proof of Lemma 3.1. ∎

4 The Banach GRIM Algorithm

In this section we detail the Banach GRIM algorithm. The following Banach Extension Step is used to grow a collection of linear functionals from Σ\Sigma at which we require our next approximation of φ\varphi to agree with φ\varphi. Banach Extension Step

Assume LΣL^{\prime}\subset\Sigma. Let uSpan()u\in\operatorname{Span}({\cal F}). Let m1m\in{\mathbb{Z}}_{\geq 1} such that #(L)+mΛ:=#(Σ)\#\left(L^{\prime}\right)+m\leq\Lambda:=\#\left(\Sigma\right). Take

σ1:=argmax{|σ(φu)|:σΣ}.\sigma_{1}:=\operatorname{argmax}\left\{|\sigma(\varphi-u)|~{}:~{}\sigma\in\Sigma\right\}. (4.1)

Inductively for j=2,3,,mj=2,3,\ldots,m take

σj:=argmax{|σ(φu)|:σΣ{σ1,,σj1}}.\sigma_{j}:=\operatorname{argmax}\left\{|\sigma(\varphi-u)|~{}:~{}\sigma\in\Sigma\setminus\{\sigma_{1},\ldots,\sigma_{j-1}\}\right\}. (4.2)

Once σ1,,σmΣ\sigma_{1},\ldots,\sigma_{m}\in\Sigma have been defined, we extend LL^{\prime} to L:=L{σ1,,σm}L:=L^{\prime}\cup\{\sigma_{1},\ldots,\sigma_{m}\}. For each choice of subset LΣL\subset\Sigma, we use recombination (cf. Lemma 3.1) to find uSpan()u\in\operatorname{Span}({\cal F}) agreeing with φ\varphi throughout LL. Theoretically, Lemma 3.1 verifies that recombination can be used to find an approximation uu of φ\varphi for which σ(φu)=0\sigma(\varphi-u)=0 for all linear functionals σL\sigma\in L for a given subset LΣXL\subset\Sigma\subset X^{\ast}. However, implementations of recombination inevitably result in numerical errors. That is, the returned coefficients will only solve the equations modulo some (ideally) small error term. To account for this in our analysis, whenever we apply Lemma 3.1 we will only assume that the resulting approximation uSpan()u\in\operatorname{Span}({\cal F}) is close to φ\varphi at each functional σL\sigma\in L. That is, for each σL\sigma\in L, we have that |σ(φu)|ε0|\sigma(\varphi-u)|\leq\varepsilon_{0} for some (small) constant ε00\varepsilon_{0}\geq 0.

Recall (cf. Section 3) that if recombination is applied to a linear system corresponding to a matrix AA, then a Singular Value Decomposition SVD of the matrix AA is used to find a basis for ker(A)\ker(A). Re-ordering the rows of the matrix AA (i.e. changing the order in which the equations are considered) can potentially result in a different basis for ker(A)\ker(A) being selected. Thus shuffling the order of the equations can affect the approximation returned by recombination via the recombination thinning Lemma 3.1. We exploit this by optimising the approximation returned by recombination over a chosen number of shuffles of the equations forming the linear system. This is made precise in the following Banach Recombination Step detailing how, for a given subset LΣL\subset\Sigma, we use recombination to find an element uSpan()u\in\operatorname{Span}({\cal F}) that is within ε0\varepsilon_{0} of φ\varphi throughout Σ\Sigma. Banach Recombination Step

Assume LΣL\subset\Sigma. Let s1s\in{\mathbb{Z}}_{\geq 1}. For each j{1,,s}j\in\{1,\ldots,s\} we do the following.

  1. (A)

    Let LjΣL_{j}\subset\Sigma be the subset resulting from applying a random permutation to the ordering of the elements in LL.

  2. (B)

    Apply recombination (cf. the recombination thinning Lemma 3.1) to find an element ujSpan()u_{j}\in\operatorname{Span}({\cal F}) satisfying, for every σLj\sigma\in L_{j}, that |σ(φuj)|ε0|\sigma(\varphi-u_{j})|\leq\varepsilon_{0}.

  3. (C)

    Compute E[uj]:=max{|σ(φuj)|:σΣ}E[u_{j}]:=\max\left\{|\sigma(\varphi-u_{j})|:\sigma\in\Sigma\right\}.

After obtaining the elements u1,,usSpan()u_{1},\ldots,u_{s}\in\operatorname{Span}({\cal F}), define uSpan()u\in\operatorname{Span}({\cal F}) by

u:=argmin{E[w]:w{u1,,us}}.u:=\operatorname{argmin}\left\{E[w]:w\in\{u_{1},\ldots,u_{s}\}\right\}. (4.3)

Then uSpan()u\in\operatorname{Span}({\cal F}) is returned as our approximation of φ\varphi that satisfies, for every σΣ\sigma\in\Sigma, that |σ(φu)|ε0|\sigma(\varphi-u)|\leq\varepsilon_{0}. We now detail our proposed GRIM algorithm to find an approximation uSpan()u\in\operatorname{Span}({\cal F}) of φSpan()\varphi\in\operatorname{Span}({\cal F}) that is close to φ\varphi at every linear functional in Σ\Sigma. Banach GRIM

  1. (A)

    Fix ε>0\varepsilon>0 as the target accuracy threshold, ε0[0,ε)\varepsilon_{0}\in[0,\varepsilon) as the acceptable recombination error, and M1M\in{\mathbb{Z}}_{\geq 1} as the maximum number of steps. Choose integers s1,,sM1s_{1},\ldots,s_{M}\in{\mathbb{Z}}_{\geq 1} as the shuffle numbers, and integers k1,,kM1k_{1},\ldots,k_{M}\in{\mathbb{Z}}_{\geq 1} with

    κ:=k1++kMmin{𝒩1,Λ}.\kappa:=k_{1}+\ldots+k_{M}\leq\min\left\{{\cal N}-1,\Lambda\right\}. (4.4)
  2. (B)

    For each i{1,,𝒩}i\in\{1,\ldots,{\cal N}\}, if ai<0a_{i}<0 then replace aia_{i} and fif_{i} by ai-a_{i} and fi-f_{i} respectively. This ensures that a1,,a𝒩>0a_{1},\ldots,a_{{\cal N}}>0 whilst leaving the expansion φ=i=1𝒩aifi\varphi=\sum_{i=1}^{{\cal N}}a_{i}f_{i} unaltered. Additionally, rescale each fif_{i} to have unit XX norm. That is, for each ii\in{\mathbb{N}} we replace fif_{i} by hi:=fifiXh_{i}:=\frac{f_{i}}{||f_{i}||_{X}}. Then φ=i=1𝒩αihi\varphi=\sum_{i=1}^{{\cal N}}\alpha_{i}h_{i} where, for each i{1,,𝒩}i\in\{1,\ldots,{\cal N}\}, we have αi:=aifiX>0\alpha_{i}:=a_{i}||f_{i}||_{X}>0.

  3. (C)

    Apply the Banach Extension Step, with L:=L^{\prime}:=\varnothing, u:=0u:=0 and m:=k1m:=k_{1}, to obtain a subset Σ1={σ1,1,,σ1,k1}Σ\Sigma_{1}=\{\sigma_{1,1},\ldots,\sigma_{1,k_{1}}\}\subset\Sigma. Apply the Banach Recombination Step, with L:=Σ1L:=\Sigma_{1} and s:=s1s:=s_{1}, to find an element u1Span()u_{1}\in\operatorname{Span}({\cal F}) satisfying, for every σΣ1\sigma\in\Sigma_{1}, that |σ(φu)|ε0|\sigma(\varphi-u)|\leq\varepsilon_{0}.

    If M=1M=1 then the algorithm terminates here and returns u1u_{1} as the final approximation of φ\varphi

  4. (D)

    If M2M\geq 2 then we proceed inductively for t2t\geq 2 as follows. If |σ(φut1)|ε|\sigma(\varphi-u_{t-1})|\leq\varepsilon for every σΣ\sigma\in\Sigma then we stop and ut1u_{t-1} is an approximation of φ\varphi possessing the desired level of accuracy. Otherwise, we choose kt1k_{t}\in{\mathbb{Z}}_{\geq 1} and apply the Banach Extension Step, with L=ΣtL^{\prime}=\Sigma_{t}, u:=ut1u:=u_{t-1} and m:=ktm:=k_{t}, to obtain a subset Σt:=Σt1{σt,1,,σt,kt}Σ\Sigma_{t}:=\Sigma_{t-1}\cup\left\{\sigma_{t,1},\ldots,\sigma_{t,k_{t}}\right\}\subset\Sigma. Apply the Banach Recombination Step, with L:=ΣtL:=\Sigma_{t} and s:=sMs:=s_{M}, to find an element uSpan()u\in\operatorname{Span}({\cal F}) satisfying, for every σΣt\sigma\in\Sigma_{t}, that |σ(φu)|ε0|\sigma(\varphi-u)|\leq\varepsilon_{0}.

    The algorithm ends either by returning ut1u_{t-1} for the t{2,,M}t\in\{2,\ldots,M\} for which the stopping criterion was triggered as the final approximation of φ\varphi, or by returning uMu_{M} as the final approximation of φ\varphi if the stopping criterion is never triggered.

If the algorithm terminates as a result of one of the early stopping criterion being triggered then we are guaranteed that the returned approximation uu satisfies, for every σΣ\sigma\in\Sigma, that |σ(φu)|ε|\sigma(\varphi-u)|\leq\varepsilon. In Subsection 6 we establish estimates for how large MM is required to be in order to guarantee that the algorithm returns an approximation of φ\varphi possessing this level of accuracy throughout Σ\Sigma (cf. the Banach GRIM Convergence Theorem 6.2).

GRIM uses data-driven growth. For each m{2,,M}m\in\{2,\ldots,M\}, GRIM first determines the new linear functionals in Σ\Sigma to be added to Σm1\Sigma_{m-1} to form ΣmΣ\Sigma_{m}\subset\Sigma before using recombination to find an approximation coinciding with φ\varphi on Σm\Sigma_{m}. That is, we first choose the new information that we want our approximation to match before using recombination to both select the elements from {\cal F} and use them to construct our approximation.

Evidently we have the nesting property that Σt1Σt2\Sigma_{t_{1}}\subset\Sigma_{t_{2}} for integers t1t2t_{1}\leq t_{2}, ensuring that at each step we are increasing the amount of information that we require our approximation to match. For each integer m{1,,M}m\in\{1,\ldots,M\} let SmS_{m}\subset{\cal F} denote the sub-collection of elements from {\cal F} used to form the approximation umu_{m}. Recombination is applied to a system of 1+k1++km1+k_{1}+\ldots+k_{m} linear equations when finding umu_{m}, hence we may conclude that #(Sm)min{1+k1++km,𝒩}\#(S_{m})\leq\min\left\{1+k_{1}+\ldots+k_{m},{\cal N}\right\} (cf. the recombination thinning Lemma 3.1). Besides this upper bound for #(Sm)\#(S_{m}), we have no control on the sets SmS_{m}. We impose only that the linear functionals are greedily chosen; the selection of the elements from {\cal F} to form the approximation umu_{m} is left up to recombination and determined by the data. In contrast to GEIM, there is no requirement that elements from {\cal F} used to form umu_{m} must also be used for ulu_{l} for l>ml>m.

The restriction on κ:=k1++kM\kappa:=k_{1}+\ldots+k_{M} in (4.4) is for the following reasons. As observed above, for m{1,,M}m\in\{1,\ldots,M\}, ksk_{s} is the number of new linear functionals to be chosen at the mthm^{\text{th}}-step. Hence, at the mthm^{\text{th}}-step, recombination is used to find an approximation uu that is within ε0\varepsilon_{0} of φ\varphi on a collection of κs:=k1++km\kappa_{s}:=k_{1}+\ldots+k_{m} linear functionals from Σ\Sigma. Evidently we have, for every m{1,,M}m\in\{1,\ldots,M\}, that κsκ\kappa_{s}\leq\kappa.

A first consequence of the restriction in (4.4) ensures is that, for every m{1,,M}m\in\{1,\ldots,M\}, we have κm𝒩1\kappa_{m}\leq{\cal N}-1. Since the linear system that recombination is applied to at step ss consists of 1+κm1+\kappa_{m} equations (cf. the recombination thinning Lemma 3.1), this prevents the number of equations from exceeding 𝒩{\cal N}. When the number of equations is at least 𝒩{\cal N}, recombination returns the original coefficients without reducing the number that are non-zero (see Subsection 3). Consequently, κm𝒩1\kappa_{m}\geq{\cal N}-1 guarantees that step ss returns φ\varphi itself as the approximation of φ\varphi. The restriction in (4.4) ensures that the algorithm ends if this (essentially useless) stage is reached.

A second consequence of (4.4) is, for every m{1,,M}m\in\{1,\ldots,M\}, that κmΛ\kappa_{m}\leq\Lambda. Note the collection ΣmΛ\Sigma_{m}\subset\Lambda has cardinality κm\kappa_{m}. If κm=Λ\kappa_{m}=\Lambda then we necessarily have that Σm=Σ\Sigma_{m}=\Sigma, and so recombination is used to find an approximation of φ\varphi that is within ε0\varepsilon_{0} of φ\varphi at every σΣ\sigma\in\Sigma. The restriction (4.4) ensures that if this happens then the algorithm terminates without carrying out additional steps.

5 Complexity Cost

In this subsection we consider the complexity cost of the Banach GRIM algorithm presented in Section 4. We begin by recording the complexity cost of the Banach Extension Step.

Lemma 5.1 (Banach Extension Step Complexity Cost).

Let 𝒩,Λ,m,t1{\cal N},\Lambda,m,t\in{\mathbb{Z}}_{\geq 1} and XX be a Banach space with dual space XX^{\ast}. Assume ={f1,,f𝒩}X{0}{\cal F}=\left\{f_{1},\ldots,f_{{\cal N}}\right\}\subset X\setminus\{0\} and that ΣX\Sigma\subset X^{\ast} has cardinality Λ\Lambda. Let a1,,a𝒩{0}a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}\setminus\{0\} and define φ:=i=1𝒩aifi\varphi:=\sum_{i=1}^{{\cal N}}a_{i}f_{i}. Assume that the set {σ(φ):σΣ}\{\sigma(\varphi):\sigma\in\Sigma\} and, for every i{1,,𝒩}i\in\{1,\ldots,{\cal N}\}, the set {σ(fi):σΣ}\{\sigma(f_{i}):\sigma\in\Sigma\} have already been computed. Then for any choices of LΣL^{\prime}\subset\Sigma with #(L)Λm\#\left(L^{\prime}\right)\leq\Lambda-m and any uSpan()u\in\operatorname{Span}({\cal F}) with #support(u)=t\#{\rm support}(u)=t, the complexity cost of applying the Banach Extension Step, with the LL^{\prime}, uu and mm there as the LL^{\prime}, uu and mm here respectively, is 𝒪((m+t)Λ){\cal O}\left((m+t)\Lambda\right).

Proof of Lemma 5.1.

Let 𝒩,Λ,m,t1{\cal N},\Lambda,m,t\in{\mathbb{Z}}_{\geq 1} and XX be a Banach space with dual space XX^{\ast}. Suppose that ={f1,,f𝒩}X{0}{\cal F}=\left\{f_{1},\ldots,f_{{\cal N}}\right\}\subset X\setminus\{0\} and that ΣX\Sigma\subset X^{\ast} has cardinality Λ\Lambda. Let a1,,a𝒩{0}a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}\setminus\{0\} and define φ:=i=1𝒩aifi\varphi:=\sum_{i=1}^{{\cal N}}a_{i}f_{i}. Assume that the set {σ(φ):σΣ}\{\sigma(\varphi):\sigma\in\Sigma\} and, for every i{1,,𝒩}i\in\{1,\ldots,{\cal N}\}, the set {σ(fi):σΣ}\{\sigma(f_{i}):\sigma\in\Sigma\} have already been computed. Suppose that LΣL^{\prime}\subset\Sigma with #(L)Λm\#\left(L^{\prime}\right)\leq\Lambda-m and uSpan()u\in\operatorname{Span}({\cal F}) with #support(u)=t\#{\rm support}(u)=t. Recall our convention (cf. Section 6.1) that support(u){\rm support}(u) is the set of the fif_{i} that correspond to the non-zero coefficients in the expansion of uu in terms of the fif_{i}. That is, u=i=1𝒩uifiu=\sum_{i=1}^{{\cal N}}u_{i}f_{i} for some coefficients u1,,u𝒩u_{1},\ldots,u_{{\cal N}}\in{\mathbb{R}}, and

support(u):={fj:j{1,,𝒩} and uj0}.{\rm support}(u):=\left\{f_{j}~{}:~{}j\in\{1,\ldots,{\cal N}\}\text{ and }u_{j}\neq 0\right\}. (5.1)

Consider carrying out the the Banach Extension Step with the LL^{\prime}, uu and mm there as the LL^{\prime}, uu and mm here respectively. Since we have access to {σ(φ):σΣ}\{\sigma(\varphi):\sigma\in\Sigma\} and {σ(fi):σΣ}\{\sigma(f_{i}):\sigma\in\Sigma\} for every i{1,,𝒩}i\in\{1,\ldots,{\cal N}\} without additional computation, and since #support(u)=t\#{\rm support}(u)=t, the complexity cost of computing the set {|σ(φu)|:σΣ}\left\{|\sigma(\varphi-u)|:\sigma\in\Sigma\right\} is no worse than 𝒪(tΛ){\cal O}\left(t\Lambda\right). The complexity cost of subsequently extracting the top mm argmax values of this set is no worse than 𝒪(mΛ){\cal O}(m\Lambda). The complexity cost of appending the resulting mm linear functionals in Σ\Sigma to the collection LL^{\prime} is 𝒪(m){\cal O}(m). Therefore the entire Banach Extension Step has a complexity cost no worse than 𝒪((m+t)Λ){\cal O}\left((m+t)\Lambda\right) as claimed. This completes the proof of Lemma 5.1. ∎

We next record the complexity cost of the Banach Recombination Step.

Lemma 5.2.

Let 𝒩,Λ,m,s1{\cal N},\Lambda,m,s\in{\mathbb{Z}}_{\geq 1} and XX be a Banach space with dual space XX^{\ast}. Assume ={f1,,f𝒩}X{0}{\cal F}=\left\{f_{1},\ldots,f_{{\cal N}}\right\}\subset X\setminus\{0\} and that ΣX\Sigma\subset X^{\ast} has cardinality Λ\Lambda. Let a1,,a𝒩{0}a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}\setminus\{0\} and define φ:=i=1𝒩aifi\varphi:=\sum_{i=1}^{{\cal N}}a_{i}f_{i}. Assume that the set {σ(φ):σΣ}\{\sigma(\varphi):\sigma\in\Sigma\} and, for every i{1,,𝒩}i\in\{1,\ldots,{\cal N}\}, the set {σ(fi):σΣ}\{\sigma(f_{i}):\sigma\in\Sigma\} have already been computed. Then for any LΣL\subset\Sigma with cardinality #(L)=m\#(L)=m the complexity cost of applying the Banach Recombination Step, with the subset LL and the integer ss there as LL and ss here respectively, is

𝒪(ms(𝒩+Λ)+m3slog(𝒩m)).{\cal O}\left(ms({\cal N}+\Lambda)+m^{3}s\log\left(\frac{{\cal N}}{m}\right)\right). (5.2)
Proof of Lemma 5.2.

Let 𝒩,Λ,m,s1{\cal N},\Lambda,m,s\in{\mathbb{Z}}_{\geq 1} and XX be a Banach space with dual space XX^{\ast}. Assume that ={f1,,f𝒩}X{0}{\cal F}=\left\{f_{1},\ldots,f_{{\cal N}}\right\}\subset X\setminus\{0\} and that ΣX\Sigma\subset X^{\ast} has cardinality Λ\Lambda. Let a1,,a𝒩{0}a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}\setminus\{0\} and define φ:=i=1𝒩aifi\varphi:=\sum_{i=1}^{{\cal N}}a_{i}f_{i}. Assume that the set {σ(φ):σΣ}\{\sigma(\varphi):\sigma\in\Sigma\} and, for every i{1,,𝒩}i\in\{1,\ldots,{\cal N}\}, the set {σ(fi):σΣ}\{\sigma(f_{i}):\sigma\in\Sigma\} have already been computed.

Consider applying the Banach Recombination Step with the subset LL and the integer ss there as LL and ss here respectively. Let j{1,,s}j\in\{1,\ldots,s\}. The complexity cost of shuffling of the elements in LL to obtain LjL_{j} is 𝒪(s){\cal O}(s). By appealing to the recombination thinning Lemma 3.1 we conclude that the complexity cost of applying recombination to find ujSpan()u_{j}\in\operatorname{Span}({\cal F}) satisfying, for every σLj\sigma\in L_{j}, that |σ(φuj)|ε0|\sigma(\varphi-u_{j})|\leq\varepsilon_{0} is 𝒪(𝒩m+m3log(𝒩/m)){\cal O}\left({\cal N}m+m^{3}\log({\cal N}/m)\right). Further recall that, since #(L)=m\#(L)=m, the recombination thinning Lemma 3.1 ensures that #support(uj)m+1\#{\rm support}(u_{j})\leq m+1. Thus, since we already have access to {σ(φ):σΣ}\{\sigma(\varphi):\sigma\in\Sigma\} and {σ(fi):σΣ}\{\sigma(f_{i}):\sigma\in\Sigma\} for every i{1,,𝒩}i\in\{1,\ldots,{\cal N}\} without additional computation and we know from , the complexity cost of computing E[uj]:=max{|σ(φuj)|:σΣ}E[u_{j}]:=\max\left\{|\sigma(\varphi-u_{j})|:\sigma\in\Sigma\right\} is 𝒪(Λm){\cal O}(\Lambda m).

Therefore, the complexity cost of Banach Recombination Step (A), (B), and (C) is

𝒪(sm+sΛm+s𝒩m+m3slog(𝒩m))=𝒪(ms(𝒩+Λ)+m3slog(𝒩m)).{\cal O}\left(sm+s\Lambda m+s{\cal N}m+m^{3}s\log\left(\frac{{\cal N}}{m}\right)\right)={\cal O}\left(ms({\cal N}+\Lambda)+m^{3}s\log\left(\frac{{\cal N}}{m}\right)\right). (5.3)

The complexity cost of the final selection of u:=argmin{E[w]:w{u1,,us}}u:=\operatorname{argmin}\left\{E[w]:w\in\{u_{1},\ldots,u_{s}\}\right\} is 𝒪(s){\cal O}(s). Combined with (5.3), this yields that the complexity cost of the entire Banach Recombination Step is

𝒪(ms(𝒩+Λ)+m3slog(𝒩m)).{\cal O}\left(ms({\cal N}+\Lambda)+m^{3}s\log\left(\frac{{\cal N}}{m}\right)\right). (5.4)

as claimed in (5.2). This completes the proof of Lemma 5.2. ∎

We now establish an upper bound for the complexity cost of the Banach GRIM algorithm via repeated use of Lemmas 5.1 and 5.2. This is the content of the following result.

Lemma 5.3 (Banach GRIM Complexity Cost).

Let M,𝒩,Λ1M,{\cal N},\Lambda\in{\mathbb{Z}}_{\geq 1} and ε>ε00\varepsilon>\varepsilon_{0}\geq 0. Take s1,,sM1s_{1},\ldots,s_{M}\in{\mathbb{Z}}_{\geq 1} and k1,,kM1k_{1},\ldots,k_{M}\in{\mathbb{Z}}_{\geq 1} with k1++kMmin{𝒩1,Λ}k_{1}+\ldots+k_{M}\leq\min\left\{{\cal N}-1,\Lambda\right\}. For j{1,,M}j\in\{1,\ldots,M\} let κj:=i=1jki\kappa_{j}:=\sum_{i=1}^{j}k_{i}. Assume XX is a Banach space with dual-space XX^{\ast}. Suppose :={f1,,f𝒩}X{0}{\cal F}:=\{f_{1},\ldots,f_{{\cal N}}\}\subset X\setminus\{0\} and that ΣX\Sigma\subset X^{\ast} is finite with cardinality #(Σ)=Λ\#\left(\Sigma\right)=\Lambda. Let a1,,a𝒩{0}a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}\setminus\{0\} and define φSpan()\varphi\in\operatorname{Span}({\cal F}) by φ:=i=1𝒩aifi\varphi:=\sum_{i=1}^{{\cal N}}a_{i}f_{i}. The complexity cost of completing nn steps of the Banach GRIM algorithm to approximate φ\varphi, with ε\varepsilon as the target accuracy, ε0\varepsilon_{0} as the acceptable recombination error, MM as the maximum number of steps, s1,,sMs_{1},\ldots,s_{M} as the shuffle numbers, and the integers k1,,kM1k_{1},\ldots,k_{M}\in{\mathbb{Z}}_{\geq 1} as the integers k1,,kMk_{1},\ldots,k_{M} chosen in Step (A) of the Banach GRIM algorithm, is

𝒪(𝒩Λ+j=1Mκjsj(𝒩+Λ)+κj3sjlog(𝒩κj)).{\cal O}\left({\cal N}\Lambda+\sum_{j=1}^{M}\kappa_{j}s_{j}\left({\cal N}+\Lambda\right)+\kappa_{j}^{3}s_{j}\log\left(\frac{{\cal N}}{\kappa_{j}}\right)\right). (5.5)
Proof of Lemma 5.3.

Let M,𝒩,Λ1M,{\cal N},\Lambda\in{\mathbb{Z}}_{\geq 1} and ε>ε00\varepsilon>\varepsilon_{0}\geq 0. Take s1,,sM1s_{1},\ldots,s_{M}\in{\mathbb{Z}}_{\geq 1} and k1,,kM1k_{1},\ldots,k_{M}\in{\mathbb{Z}}_{\geq 1} with k1++kMmin{𝒩1,Λ}k_{1}+\ldots+k_{M}\leq\min\left\{{\cal N}-1,\Lambda\right\}. For j{1,,M}j\in\{1,\ldots,M\} define κj:=i=1jki\kappa_{j}:=\sum_{i=1}^{j}k_{i}. Assume XX is a Banach space with dual-space XX^{\ast}. Suppose :={f1,,f𝒩}X{0}{\cal F}:=\{f_{1},\ldots,f_{{\cal N}}\}\subset X\setminus\{0\} and that ΣX\Sigma\subset X^{\ast} is finite with cardinality #(Σ)=Λ\#\left(\Sigma\right)=\Lambda. Let a1,,a𝒩{0}a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}\setminus\{0\} and define φSpan()\varphi\in\operatorname{Span}({\cal F}) by φ:=i=1𝒩aifi\varphi:=\sum_{i=1}^{{\cal N}}a_{i}f_{i}. Consider applying the Banach GRIM algorithm to approximate φ\varphi, with ε\varepsilon as the target accuracy, ε0\varepsilon_{0} as the acceptable recombination error, MM as the maximum number of steps, s1,,sMs_{1},\ldots,s_{M} as the shuffle numbers, and the integers k1,,kM1k_{1},\ldots,k_{M}\in{\mathbb{Z}}_{\geq 1} as the integers k1,,kMk_{1},\ldots,k_{M} chosen in Step (A) of the Banach GRIM algorithm.

Since the cardinality of {\cal F} is 𝒩{\cal N}, the complexity cost of the rescaling and sign alterations required in Step (B) of the Banach GRIM algorithm is 𝒪(𝒩){\cal O}({\cal N}). The complexity cost of computing the sets {σ(fi):σΣ}\left\{\sigma(f_{i}):\sigma\in\Sigma\right\} for i{1,,𝒩}i\in\{1,\ldots,{\cal N}\} is 𝒪(𝒩Λ){\cal O}({\cal N}\Lambda). Subsequently, the complexity cost of computing the set {σ(φ):σΣ}\left\{\sigma(\varphi):\sigma\in\Sigma\right\} is 𝒪(𝒩Λ){\cal O}({\cal N}\Lambda). Consequently, the total complexity cost of performing these computations is 𝒪(𝒩Λ){\cal O}({\cal N}\Lambda).

We appeal to Lemma 5.1 to conclude that the complexity cost of performing the Banach Extension Step as in Step (C) of the Banach GRIM algorithm (i.e. with L:=L^{\prime}:=\emptyset, u:=0u:=0, and m:=k1m:=k_{1}) is 𝒪(k1Λ){\cal O}\left(k_{1}\Lambda\right). By appealing to Lemma 5.2, we conclude that the complexity cost of the use of the Banach Recombination Step in Step (C) of the Banach GRIM algorithm (i.e. with L:=Σ1L:=\Sigma_{1} and s:=s1s:=s_{1}) is 𝒪(s1k1(𝒩+Λ)+k13s1log(𝒩/k1)){\cal O}\left(s_{1}k_{1}\left({\cal N}+\Lambda\right)+k_{1}^{3}s_{1}\log\left({\cal N}/k_{1}\right)\right). Hence the complexity cost of Step (C) of the Banach GRIM algorithm is 𝒪(s1k1(𝒩+Λ)+k13s1log(𝒩/k1)){\cal O}\left(s_{1}k_{1}\left({\cal N}+\Lambda\right)+k_{1}^{3}s_{1}\log\left({\cal N}/k_{1}\right)\right).

In the case that M=1M=1 we can already conclude that the total complexity cost of performing the Banach GRIM algorithm is 𝒪(k1s1(𝒩+Λ)+𝒩Λ+k13s1log(𝒩k1)){\cal O}\left(k_{1}s_{1}\left({\cal N}+\Lambda\right)+{\cal N}\Lambda+k_{1}^{3}s_{1}\log\left(\frac{{\cal N}}{k_{1}}\right)\right) as claimed in (5.5). Now suppose that M2M\geq 2. We assume that all MM steps of the Banach GRIM algorithm are completed without early termination since this is the case that will maximise the complexity cost. Under this assumption, for j{1,,M}j\in\{1,\ldots,M\} let ujSpan()u_{j}\in\operatorname{Span}({\cal F}) denote the approximation of φ\varphi returned after step jj of the Banach GRIM algorithm is completed. Recall that κj:=i=1jki\kappa_{j}:=\sum_{i=1}^{j}k_{i}. Examining the Banach GRIM algorithm, uju_{j} is obtained by applying recombination (cf. the recombination thinning Lemma 3.1) to find an approximation that is within ε0\varepsilon_{0} of φ\varphi on a subset of κj\kappa_{j} linear functionals from Σ\Sigma. Thus, by appealing to the recombination thinning Lemma 3.1, we have that #support(uj)1+κj\#{\rm support}(u_{j})\leq 1+\kappa_{j}. For the purpose of computing the maximal complexity cost we assume that #support(uj)=1+κj\#{\rm support}(u_{j})=1+\kappa_{j}.

Let t{2,,M}t\in\{2,\ldots,M\} and consider performing Step (D) of the Banach GRIM algorithm for the ss there as tt here. Since #support(ut1)=1+κt1\#{\rm support}(u_{t-1})=1+\kappa_{t-1}, Lemma 5.1 tells us that the complexity cost of performing the Banach Extension Step as in Step (D) of the Banach GRIM algorithm (i.e. with L:=Σt1L^{\prime}:=\Sigma_{t-1}, u:=ut1u:=u_{t-1}, and m:=ktm:=k_{t}) is 𝒪(κtΛ){\cal O}\left(\kappa_{t}\Lambda\right). Lemma 5.2 yield that the complexity cost of the use of the Banach Recombination Step in Step (D) of the Banach GRIM algorithm (i.e. with L:=ΣtL:=\Sigma_{t} and s:=sts:=s_{t}) is 𝒪(κtst(𝒩+Λ)+κt3stlog(𝒩κt)){\cal O}\left(\kappa_{t}s_{t}\left({\cal N}+\Lambda\right)+\kappa_{t}^{3}s_{t}\log\left(\frac{{\cal N}}{\kappa_{t}}\right)\right). Consequently, the total complexity cost of performing Step (D) of the Banach GRIM algorithm (for tt here playing the role of ss there) is

𝒪(κtst(𝒩+Λ)+κt3stlog(𝒩κt)).{\cal O}\left(\kappa_{t}s_{t}\left({\cal N}+\Lambda\right)+\kappa_{t}^{3}s_{t}\log\left(\frac{{\cal N}}{\kappa_{t}}\right)\right). (5.6)

By summing the complexity costs in (5.6) over t{2,,M}t\in\{2,\ldots,M\} we deduce that the complexity cost of Step (D) of the Banach GRIM algorithm is

𝒪(j=2Mκjsj(𝒩+Λ)+κj3sjlog(𝒩κj)).{\cal O}\left(\sum_{j=2}^{M}\kappa_{j}s_{j}\left({\cal N}+\Lambda\right)+\kappa_{j}^{3}s_{j}\log\left(\frac{{\cal N}}{\kappa_{j}}\right)\right). (5.7)

Since we have previously observed that the complexity cost of performing Steps (A), (B), and (C) of the Banach GRIM algorithm is 𝒪(k1s1(𝒩+Λ)+𝒩Λ+k13s1log(𝒩k1)){\cal O}\left(k_{1}s_{1}\left({\cal N}+\Lambda\right)+{\cal N}\Lambda+k_{1}^{3}s_{1}\log\left(\frac{{\cal N}}{k_{1}}\right)\right), it follows from (5.7) that the complexity cost of performing the entire Banach GRIM algorithm is (recalling that κ1:=k1\kappa_{1}:=k_{1})

𝒪(𝒩Λ+j=1Mκjsj(𝒩+Λ)+κj3sjlog(𝒩κj)){\cal O}\left({\cal N}\Lambda+\sum_{j=1}^{M}\kappa_{j}s_{j}\left({\cal N}+\Lambda\right)+\kappa_{j}^{3}s_{j}\log\left(\frac{{\cal N}}{\kappa_{j}}\right)\right) (5.8)

as claimed in (5.5). This completes the proof of Lemma 5.3. ∎

We end this subsection by explicitly recording the complexity cost estimates resulting from Lemma 5.3 for some particular choices of parameters. We assume for both choices that we are in the situation that 𝒩>>Λ{\cal N}>>\Lambda.

First, consider the choices that M:=1M:=1, k1:=Λk_{1}:=\Lambda, and s1:=1s_{1}:=1. This corresponds to making a single application of recombination to find an approximation of φ\varphi that is within ε0\varepsilon_{0} of φ\varphi at every linear functional in Σ\Sigma. Lemma 5.3 tells us that the complexity cost of doing this is 𝒪(𝒩Λ+Λ2+Λ3log(𝒩Λ)){\cal O}\left({\cal N}\Lambda+\Lambda^{2}+\Lambda^{3}\log\left(\frac{{\cal N}}{\Lambda}\right)\right).

Secondly, consider the choices M:=1M:={\mathbb{Z}}_{\geq 1}, k1==kM=1k_{1}=\ldots=k_{M}=1, and arbitrary fixed s1,,sM1s_{1},\ldots,s_{M}\in{\mathbb{Z}}_{\geq 1}. Let LΣL\subset\Sigma denote the collection of linear functionals that is inductively grown during the Banach GRIM algorithm. These choices then correspond to adding a single new linear functional to LL at each step. It follows, for each j{1,,M}j\in\{1,\ldots,M\}, that κj:=i=1jki=j\kappa_{j}:=\sum_{i=1}^{j}k_{i}=j. Lemma 5.3 tells us that the complexity cost of doing this is 𝒪(𝒩Λ+j=1Mjsj(𝒩+Λ)+j3sjlog(𝒩j)){\cal O}\left({\cal N}\Lambda+\sum_{j=1}^{M}js_{j}({\cal N}+\Lambda)+j^{3}s_{j}\log\left(\frac{{\cal N}}{j}\right)\right). If we restrict to a single application of recombination at each step (i.e. choosing s1==sM=1s_{1}=\ldots=s_{M}=1), then the complexity cost becomes 𝒪(M2(𝒩+Λ)+j=1Mj3log(𝒩j)){\cal O}\left(M^{2}\left({\cal N}+\Lambda\right)+\sum_{j=1}^{M}j^{3}\log\left(\frac{{\cal N}}{j}\right)\right).

In particular, if we take M:=ΛM:=\Lambda (which corresponds to allowing for the possibility that the collection LL may grow to be the entirety of Σ\Sigma) then the complexity cost is

𝒪(Λ2(𝒩+Λ)+j=1Λj3log(𝒩j)).{\cal O}\left(\Lambda^{2}\left({\cal N}+\Lambda\right)+\sum_{j=1}^{\Lambda}j^{3}\log\left(\frac{{\cal N}}{j}\right)\right). (5.9)

If 𝒩{\cal N} is large enough that Λ<e1/3𝒩\Lambda<e^{-1/3}{\cal N}, then the function xx3log(𝒩/x)x\mapsto x^{3}\log({\cal N}/x) is increasing on the interval [1,Λ][1,\Lambda]. Under these conditions, the complexity cost in (5.9) is no worse than

𝒪(Λ2𝒩+Λ4log(𝒩Λ)).{\cal O}\left(\Lambda^{2}{\cal N}+\Lambda^{4}\log\left(\frac{{\cal N}}{\Lambda}\right)\right). (5.10)

6 Convergence Analysis

In this section we establish a convergence result for the Banach GRIM algortihm and discuss its consequences. The section is organised as follows. In Subsection 6.1 we fix notation and conventions that will be used throughout the entirety of Section 6. In Subsection 6.2 we state the Banach GRIM Convergence theorem 6.2. This result establishes, in particular, a non-trivial upper bound on the maximum number of steps the Banach GRIM algorithm, under the choice M:=min{𝒩1,Λ}M:=\min\left\{{\cal N}-1,\Lambda\right\} and that for every t{1,,M}t\in\{1,\ldots,M\} we have kt:=1k_{t}:=1, can run for before terminating. We do not consider the potentially beneficial impact of considering multiple shuffles at each step, and so we assume throughout this section that for every t{1,,M}t\in\{1,\ldots,M\} we have st:=1s_{t}:=1. We additionally discuss some performance guarantees available as a consequence of the Banach GRIM Convergence theorem 6.2. In Subsection 6.3 we record several lemmata recording, in particular, estimates for the approximations found at each completed step of the algorithm (cf. Lemma 6.4), and the properties regarding the distribution of the collection of linear functionals selected at each completed step of the algorithm (cf. Lemma 6.5). In Subsection 6.4 we use the supplementary lemmata from Subsection 6.3 to prove the Banach GRIM Convergence theorem 6.2.

6.1 Notation and Conventions

In this section we fix notation and conventions that will be used throughout the entirety of Section 6.

Given a real Banach space ZZ, an integer m1m\in{\mathbb{Z}}_{\geq 1}, a subset L={z1,,zm}ZL=\{z_{1},\ldots,z_{m}\}\subset Z, an element r0r\in{\mathbb{R}}_{\geq 0}, and a norm λ\lambda on m{\mathbb{R}}^{m}, we will refer to the set

Spanλ(L,r):={j=1mβjzj:β:=(β1,,βm)m and λ(β)r}Z\operatorname{Span}_{\lambda}\left(L,r\right):=\left\{\sum_{j=1}^{m}\beta_{j}z_{j}~{}:~{}\beta:=(\beta_{1},\ldots,\beta_{m})\in{\mathbb{R}}^{m}\text{ and }\lambda(\beta)\leq r\right\}\subset Z (6.1)

as the λ\lambda distance rr span of L={z1,,zm}L=\{z_{1},\ldots,z_{m}\} in ZZ. If r=r=\infty then we take Spanλ(L,)\operatorname{Span}_{\lambda}\left(L,\infty\right) to be the usual linear span of the set L={z1,,zm}L=\{z_{1},\ldots,z_{m}\} in ZZ. Further, given constants c>0c>0 and α>β0\alpha>\beta\geq 0, we refer to the set

Reachλ(L,c,α,β):={0ταβSpan¯λ(L,τ)ατβcif β>0Span(L)αcif β=0\operatorname{Reach}_{\lambda}\left(L,c,\alpha,\beta\right):=\left\{\begin{array}[]{ll}\bigcup_{0\leq\tau\leq\frac{\alpha}{\beta}}\operatorname{\overline{Span}}_{\lambda}\left(L,\tau\right)_{\frac{\alpha-\tau\beta}{c}}&\mbox{if }\beta>0\\ \operatorname{Span}(L)_{\frac{\alpha}{c}}&\mbox{if }\beta=0\end{array}\right. (6.2)

as the cc weighted β\beta based λ\lambda distance α\alpha reach of L={z1,,zm}L=\{z_{1},\ldots,z_{m}\} in ZZ. Given V={v1,,vm}ZV=\{v_{1},\ldots,v_{m}\}\subset Z we let conv(V){\mathrm{conv}}(V) denote the closed convex hull of {v1,,vm}\left\{v_{1},\ldots,v_{m}\right\} in ZZ. That is

conv(V):={j=1mβjvj:β1,,βm[0,1] and j=1mβj=1}Z.{\mathrm{conv}}(V):=\left\{\sum_{j=1}^{m}\beta_{j}v_{j}~{}:~{}\beta_{1},\ldots,\beta_{m}\in[0,1]\text{ and }\sum_{j=1}^{m}\beta_{j}=1\right\}\subset Z. (6.3)

For future use, we record that the ||.||l1(m)||.||_{l^{1}({\mathbb{R}}^{m})} distance 11 span of L={z1,,zm}L=\{z_{1},\ldots,z_{m}\} in ZZ coincides with the closed convex hull of the set {z1,,zm,0,z1,,zm}\{z_{1},\ldots,z_{m},0,-z_{1},\ldots,-z_{m}\} in ZZ, and that consequently the (αβ)/c(\alpha-\beta)/c-fattening of this convex hull is contained within the cc-weighted β\beta-based ||||l1(m)||\cdot||_{l^{1}({\mathbb{R}}^{m})} distance α\alpha-span of LL, denoted Reach||||l1(m)(L,c,α,β)\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{m})}}(L,c,\alpha,\beta), whenever c>0c>0 and α>β0\alpha>\beta\geq 0.

Lemma 6.1.

Let ZZ be a Banach space and m1m\in{\mathbb{Z}}_{\geq 1}. Assume that L={z1,,zm}ZL=\{z_{1},\ldots,z_{m}\}\subset Z and define V:={z1,,zm,0,z1,,zm}V:=\left\{z_{1},\ldots,z_{m},0,-z_{1},\ldots,-z_{m}\right\}. Then we have that

conv(V)=Span||||l1(m)(L,1).{\mathrm{conv}}(V)=\operatorname{Span}_{||\cdot||_{l^{1}({\mathbb{R}}^{m})}}\left(L,1\right). (6.4)

Consequently, given any c>0c>0 and any α>β0\alpha>\beta\geq 0 we have that

conv(V)αβcReach||||l1(m)(L,c,α,β).{\mathrm{conv}}(V)_{\frac{\alpha-\beta}{c}}\subset\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{m})}}(L,c,\alpha,\beta). (6.5)
Proof of Lemma 6.1.

Let ZZ be a Banach space, m1m\in{\mathbb{Z}}_{\geq 1}, and assume that L={z1,,zm}ZL=\{z_{1},\ldots,z_{m}\}\subset Z. Define z0:=0Mz_{0}:=0\in M and, for each j{1,m}j\in\{1\ldots,m\} define zj:=zjz_{-j}:=-z_{j}. Let 𝒞:=conv({z1,,zm,0,z1,,zm}){\cal C}:={\mathrm{conv}}\left(\left\{z_{1},\ldots,z_{m},0,-z_{1},\ldots,-z_{m}\right\}\right) and 𝒟:=Span||||l1(m)(L,1){\cal D}:=\operatorname{Span}_{||\cdot||_{l^{1}({\mathbb{R}}^{m})}}\left(L,1\right). We establish (6.4) by proving both the inclusions 𝒞𝒟{\cal C}\subset{\cal D} and 𝒟𝒞{\cal D}\subset{\cal C}.

We first prove that 𝒞𝒟{\cal C}\subset{\cal D}. To do so, suppose that z𝒞z\in{\cal C}. Then there are coefficients βm,,βm[0,1]\beta_{-m},\ldots,\beta_{m}\in[0,1] such that z=j=mmβjzjz=\sum_{j=-m}^{m}\beta_{j}z_{j} and j=mmβj=1\sum_{j=-m}^{m}\beta_{j}=1. Then we have that z=j=1m(βjβj)zjz=\sum_{j=1}^{m}\left(\beta_{j}-\beta_{-j}\right)z_{j}, and moreover we may observe via the triangle inequality that j=1m|βjβj|j=mmβj=1\sum_{j=1}^{m}\left|\beta_{j}-\beta_{-j}\right|\leq\sum_{j=-m}^{m}\beta_{j}=1. Consequently, by taking cj:=βjβjc_{j}:=\beta_{j}-\beta_{-j} for j{1,,m}j\in\{1,\ldots,m\}, we see that z𝒟z\in{\cal D} by definition. The arbitrariness of z𝒞z\in{\cal C} allows us to conclude that 𝒞𝒟{\cal C}\subset{\cal D}.

In order to prove 𝒟𝒞{\cal D}\subset{\cal C}, suppose now that z𝒟z\in{\cal D}. Hence there exist c1,,cmc_{1},\ldots,c_{m}\in{\mathbb{R}}, with j=1m|cj|1\sum_{j=1}^{m}\left|c_{j}\right|\leq 1, and such that z=j=1mcjzjz=\sum_{j=1}^{m}c_{j}z_{j}. For each j{1,,m}j\in\{1,\ldots,m\}, if cj>0c_{j}>0 define βj:=cj\beta_{j}:=c_{j} and βj:=0\beta_{-j}:=0, and if cj0c_{j}\leq 0 define βj:=0\beta_{j}:=0 and βj:=cj\beta_{-j}:=-c_{j}. Observe that, for each j{1,,m}j\in\{1,\ldots,m\}, we have βj,βj[0,1]\beta_{j},\beta_{-j}\in[0,1]. Further, for each j{1,,m}j\in\{1,\ldots,m\}, we have βj+βj=|cj|\beta_{j}+\beta_{-j}=|c_{j}|, and so j=1mβj+βj=j=1m|cj|\sum_{j=1}^{m}\beta_{j}+\beta_{-j}=\sum_{j=1}^{m}|c_{j}|. Finally, define β0:=1j=1m|cj|[0,1]\beta_{0}:=1-\sum_{j=1}^{m}|c_{j}|\in[0,1] so that βm,,βm[0,1]\beta_{-m},\ldots,\beta_{m}\in[0,1] with j=mmβj=1\sum_{j=-m}^{m}\beta_{j}=1. We observe that z=j=1mcjzj=cj0cjzj+cj<0(cj)(zj)=j=1mβjzj+βjzj=j=mmβjzjz=\sum_{j=1}^{m}c_{j}z_{j}=\sum_{c_{j}\geq 0}c_{j}z_{j}+\sum_{c_{j}<0}(-c_{j})(-z_{j})=\sum_{j=1}^{m}\beta_{j}z_{j}+\beta_{-j}z_{-j}=\sum_{j=-m}^{m}\beta_{j}z_{j} where the last equality holds since z0:=0z_{0}:=0. Consequently, z𝒞z\in{\cal C}. The arbitrariness of zDz\in D allows us to conclude D𝒞D\subset{\cal C}.

Together, the inclusions 𝒞D{\cal C}\subset D and D𝒞D\subset{\cal C} establish that 𝒞=D{\cal C}=D as claimed in (6.4). We complete the proof of Lemma 6.1 by using (6.4) to establish (6.5).

Let c>0c>0 and αβ0\alpha\geq\beta\geq 0. First suppose β>0\beta>0. Then α/β1\alpha/\beta\geq 1 and so

Reach||||l1(m)(L,c,α,β)=(6.2)0ταβSpan¯(L,τ)ατβcSpan¯(L,1)αβc=(6.4)conv(V)αβc.\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{m})}}(L,c,\alpha,\beta)\stackrel{{\scriptstyle{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}({\ref{l1_reach_L}})}}}{{=}}\bigcup_{0\leq\tau\leq\frac{\alpha}{\beta}}\operatorname{\overline{Span}}(L,\tau)_{\frac{\alpha-\tau\beta}{c}}\supset\operatorname{\overline{Span}}(L,1)_{\frac{\alpha-\beta}{c}}\stackrel{{\scriptstyle{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}({\ref{conv_hull_equiv_form}})}}}{{=}}{\mathrm{conv}}(V)_{\frac{\alpha-\beta}{c}}. (6.6)

If β=0\beta=0 then

Reach||||l1(m)(L,c,α,0)=(6.2)Span(L)αcconv(V)αc\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{m})}}(L,c,\alpha,0)\stackrel{{\scriptstyle{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}({\ref{l1_reach_L}})}}}{{=}}\operatorname{Span}(L)_{\frac{\alpha}{c}}\supset{\mathrm{conv}}(V)_{\frac{\alpha}{c}} (6.7)

where the last inclusion follows from the observation that any element in VV is a linear combination of the elements in L={z1,,zm}L=\{z_{1},\ldots,z_{m}\}. Together, (6.6) and (6.7) establish that we always have Reach||||l1(m)(L,c,α,β)conv(V)\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{m})}}(L,c,\alpha,\beta)\supset{\mathrm{conv}}(V) as claimed in (6.5). This completes the proof of Lemma 6.1. ∎

Given a real Banach space ZZ, an integer m1m\in{\mathbb{Z}}_{\geq 1}, a subset S={z1,,zm}ZS=\{z_{1},\ldots,z_{m}\}\subset Z, and w=j=1mwjzjSpan(S)w=\sum_{j=1}^{m}w_{j}z_{j}\in\operatorname{Span}(S) for w1,,wmw_{1},\ldots,w_{m}\in{\mathbb{R}}, we refer to the set support(w):={zj:j{1,,𝒩} and wj0}{\rm support}(w):=\left\{z_{j}~{}:~{}j\in\{1,\ldots,{\cal N}\}\text{ and }w_{j}\neq 0\right\} as the support of ww. The cardinality of support(w){\rm support}(w) is equal to the number of non-zero coefficients in the expansion of ww in terms of z1,,zmz_{1},\ldots,z_{m}.

Finally, given a real Banach space ZZ, a r0r\in{\mathbb{R}}_{\geq 0} and a subset AZA\subset Z, we denote the rr-fattening of AA in ZZ by ArA_{r}. That is, we define Ar:={zZ:aA with zaZr}A_{r}:=\left\{z\in Z~{}:~{}\exists a\in A\text{ with }||z-a||_{Z}\leq r\right\}.

6.2 Main Theoretical Result

In this subsection we state our main theoretical result and then discuss its consequences. Our main theoretical result is the following Banach GRIM Convergence theorem for the Banach GRIM algorithm under the choice M:=min{𝒩1,Λ}M:=\min\left\{{\cal N}-1,\Lambda\right\} and that for every t{1,,M}t\in\{1,\ldots,M\} we have kt:=1k_{t}:=1 and st:=1s_{t}:=1.

Theorem 6.2 (Banach GRIM Convergence).

Assume XX is a Banach space with dual-space XX^{\ast}. Let ε>ε00\varepsilon>\varepsilon_{0}\geq 0. Let 𝒩,Λ1{\cal N},\Lambda\in{\mathbb{Z}}_{\geq 1} and set M:=min{𝒩1,Λ}M:=\min\left\{{\cal N}-1,\Lambda\right\}. Let :={f1,,f𝒩}X{0}{\cal F}:=\{f_{1},\ldots,f_{{\cal N}}\}\subset X\setminus\{0\} and ΣX\Sigma\subset X^{\ast} be a finite subset with cardinality Λ\Lambda. Let a1,,a𝒩{0}a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}\setminus\{0\} and define φSpan()\varphi\in\operatorname{Span}({\cal F}) and C>0C>0 by

(𝐈)φ:=i=1𝒩aifiand(𝐈𝐈)C:=i=1𝒩|ai|fiX>0.({\bf I})\quad\varphi:=\sum_{i=1}^{{\cal N}}a_{i}f_{i}\qquad\text{and}\qquad({\bf II})\quad C:=\sum_{i=1}^{{\cal N}}|a_{i}|||f_{i}||_{X}>0. (6.8)

Then there is a non-negative integer N=N(Σ,C,ε,ε0)0N=N(\Sigma,C,\varepsilon,\varepsilon_{0})\in{\mathbb{Z}}_{\geq 0}, given by

N:=max{d:There exists σ1,,σdΣ such that for every j{1,,d1}we have σj+1Reach||||l1(j)({σ1,,σj},2C,ε,ε0)},N:=\max\left\{d\in{\mathbb{Z}}~{}:~{}\begin{array}[]{c}\text{There exists }\sigma_{1},\ldots,\sigma_{d}\in\Sigma\text{ such that for every }j\in\{1,\ldots,d-1\}\\ \text{we have }\sigma_{j+1}\notin\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{j})}}\left(\{\sigma_{1},\ldots,\sigma_{j}\},2C,\varepsilon,\varepsilon_{0}\right)\end{array}\right\}, (6.9)

for which the following is true.

Suppose NM:=min{𝒩1,M}N\leq M:=\min\left\{{\cal N}-1,M\right\} and consider applying the Banach GRIM algorithm to approximate φ\varphi on Σ\Sigma with ε\varepsilon as the accuracy threshold, ε0\varepsilon_{0} as the acceptable recombination error bound, MM as the maximum number of steps, s1==sM=1s_{1}=\ldots=s_{M}=1 as the shuffle numbers, and with the integers k1,,kMk_{1},\ldots,k_{M} in Step (A) of the Banach GRIM algorithm all being chosen equal to 11 (cf. Banach GRIM (A)). Then, after at most NN steps the algorithm terminates. That is, there is some integer n{1,,N}n\in\{1,\ldots,N\} for which the Banach GRIM algorithm terminates after completing nn steps. Consequently, there are coefficients c1,,cn+1c_{1},\ldots,c_{n+1}\in{\mathbb{R}} and indices e(1),,e(n+1){1,,𝒩}e(1),\ldots,e(n+1)\in\{1,\ldots,{\cal N}\} with

s=1n+1|cs|fe(s)X=C,\sum_{s=1}^{n+1}\left|c_{s}\right|||f_{e(s)}||_{X}=C, (6.10)

and such that the element uSpan()u\in\operatorname{Span}({\cal F}) defined by

u:=s=1n+1csfe(s)satisfies, for every σΣ, that|σ(φu)|ε.u:=\sum_{s=1}^{n+1}c_{s}f_{e(s)}\qquad\text{satisfies, for every }\sigma\in\Sigma,\text{ that}\qquad|\sigma(\varphi-u)|\leq\varepsilon. (6.11)

Further, given any A>1A>1, we have, for every σReach||||l1(Λ)(Σ,2C,Aε,ε)\sigma\in\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{\Lambda})}}\left(\Sigma,2C,A\varepsilon,\varepsilon\right), that

|σ(φu)|Aε.|\sigma(\varphi-u)|\leq A\varepsilon. (6.12)

Finally, if the coefficients a1,,a𝒩{0}a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}\setminus\{0\} corresponding to φ\varphi (cf. (I) of (6.8)) are all positive (i.e. a1,,a𝒩>0a_{1},\ldots,a_{{\cal N}}>0) then the coefficients c1,,cn+1c_{1},\ldots,c_{n+1}\in{\mathbb{R}} corresponding to uu (cf. (6.11)) are all non-negative (i.e. c1,,cn+10c_{1},\ldots,c_{n+1}\geq 0).

Remark 6.3.

For the readers convenience, we recall the specific notation from Subsection 6.1 utilised in the Banach GRIM Convergence Theorem 6.2. Firstly, given l1l\in{\mathbb{Z}}_{\geq 1}, a subset L={σ1,,σl}XL=\{\sigma_{1},\ldots,\sigma_{l}\}\subset X^{\ast}, and s0s\in{\mathbb{R}}_{\geq 0} we have

Span¯||||l1(l)(L,s):={s=1lcsσs:c=(c1,,cl)l and cl1(l)s},\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{l})}}\left(L,s\right):=\left\{\sum_{s=1}^{l}c_{s}\sigma_{s}~{}:~{}c=(c_{1},\ldots,c_{l})\in{\mathbb{R}}^{l}\text{ and }||c||_{l^{1}({\mathbb{R}}^{l})}\leq s\right\}, (6.13)

whilst Span¯||||l1(l)(L,):=Span(L)\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{l})}}\left(L,\infty\right):=\operatorname{Span}\left(L\right). Moreover, given r0r\in{\mathbb{R}}_{\geq 0} we define Span¯||||l1(l)(L,s)r\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{l})}}\left(L,s\right)_{r} to be the rr-fattening of Span¯||||l1(l)(L,s)\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{l})}}\left(L,s\right) in XX^{\ast}. That is

Span¯||||l1(l)(L,s)r:={σX:There exists σSpan¯||||l1(l)(L,s) with ||σσ||Xr}.\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{l})}}\left(L,s\right)_{r}:=\left\{\sigma\in X^{\ast}:\text{There exists }\sigma^{\prime}\in\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{l})}}\left(L,s\right)\text{ with }||\sigma-\sigma^{\prime}||_{X^{\ast}}\leq r\right\}. (6.14)

In particular, we have Span¯||||l1(l)(L,s)0:=Span¯||||l1(l)(L,s)\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{l})}}\left(L,s\right)_{0}:=\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{l})}}\left(L,s\right). Finally, given constants c>0c>0 and αβ0\alpha\geq\beta\geq 0 we have

Reach||||l1(l)(L,c,α,β):={0sαβSpan¯||||l1(l)(L,s)αsβcif β>0Span(L)αcif β=0.\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{l})}}\left(L,c,\alpha,\beta\right):=\left\{\begin{array}[]{ll}\bigcup_{0\leq s\leq\frac{\alpha}{\beta}}\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{l})}}\left(L,s\right)_{\frac{\alpha-s\beta}{c}}&\mbox{if }\beta>0\\ \operatorname{Span}(L)_{\frac{\alpha}{c}}&\mbox{if }\beta=0.\end{array}\right. (6.15)

The Banach GRIM Convergence Theorem 6.2 tells us that the maximum number of steps that the Banach GRIM algorithm can complete before terminating is

N:=max{d:There exists σ1,,σdΣ such that for every j{1,,d1}we have σj+1Reach||||l1(j)({σ1,,σj},2C,ε,ε0)}.N:=\max\left\{d\in{\mathbb{Z}}~{}:~{}\begin{array}[]{c}\text{There exists }\sigma_{1},\ldots,\sigma_{d}\in\Sigma\text{ such that for every }j\in\{1,\ldots,d-1\}\\ \text{we have }\sigma_{j+1}\notin\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{j})}}\left(\{\sigma_{1},\ldots,\sigma_{j}\},2C,\varepsilon,\varepsilon_{0}\right)\end{array}\right\}. (6.16)

Recall from Theorem 6.2 (cf. (6.11)) that the number of elements from {\cal F} required to yield the desired approximation of φ\varphi is no greater than N+1N+1. Consequently, via Theorem 6.2, not only can we conclude that we theoretically can use no more than N+1N+1 elements from {\cal F} to approximate φ\varphi throughout Σ\Sigma with an accuracy of ε\varepsilon, but also that the Banach GRIM algorithm will actually construct such an approximation. In particular, if N<𝒩1N<{\cal N}-1 then we are guaranteed that the algorithm will find a linear combination of fewer than 𝒩{\cal N} of the elements f1,,f𝒩f_{1},\ldots,f_{{\cal N}} that is within ε\varepsilon of φ\varphi throughout Σ\Sigma.

Further we observe that NN defined in (6.16) depends on both the features {\cal F} through the constant CC defined in (6.8) and the data Σ\Sigma. The constant CC itself depends only on the weights a1,,a𝒩a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}} and the values f1X,,f𝒩X>0||f_{1}||_{X},\ldots,||f_{{\cal N}}||_{X}\in{\mathbb{R}}_{>0}. No additional constraints are imposed on the collection of features {\cal F}; in particular, we do not assume the existence of a linear combination of fewer than 𝒩{\cal N} of the features in {\cal F} giving a good approximation of φ\varphi throughout Σ\Sigma.

We now relate the quantity NN defined in (6.16) to other geometric properties of the subset ΣX\Sigma\subset X^{\ast}.

For this purpose, let σ1,,σNΣ\sigma_{1},\ldots,\sigma_{N}\in\Sigma and, for each j{1,,N}j\in\{1,\ldots,N\}, define Σj:={σ1,,σj}\Sigma_{j}:=\{\sigma_{1},\ldots,\sigma_{j}\}. Suppose N2N\geq 2 and that for every j{1,,N1}j\in\{1,\ldots,N-1\} we have σj+1Reach||||l1(j)(Σj,2C,ε,ε0)\sigma_{j+1}\notin\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{j})}}\left(\Sigma_{j},2C,\varepsilon,\varepsilon_{0}\right).

In the case that ε0=0\varepsilon_{0}=0 this means, recalling (6.15), that σj+1Reach||||l1(j)(Σj,2C,ε,0)=Span(Σj)ε/2C\sigma_{j+1}\notin\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{j})}}\left(\Sigma_{j},2C,\varepsilon,0\right)=\operatorname{Span}(\Sigma_{j})_{\varepsilon/2C}. Hence, when ε0=0\varepsilon_{0}=0, the integer NN defined in (6.16) is given by

N=max{d:There exists σ1,,σdΣ such that for every j{1,,d1}we have σj+1Span(Σj)ε2C}.N=\max\left\{d\in{\mathbb{Z}}~{}:~{}\begin{array}[]{c}\text{There exists }\sigma_{1},\ldots,\sigma_{d}\in\Sigma\text{ such that for every }j\in\{1,\ldots,d-1\}\\ \text{we have }\sigma_{j+1}\notin\operatorname{Span}(\Sigma_{j})_{\frac{\varepsilon}{2C}}\end{array}\right\}. (6.17)

Thus the maximum number of steps before the Banach GRIM algorithm terminates is determined by the maximum dimension of a linear subspace ΠX\Pi\subset X^{\ast} that has a basis consisting of elements in Σ\Sigma, and yet its ε/2C\varepsilon/2C-fattening fails to capture Σ\Sigma in the sense that ΣΠε/2CΣ\Sigma\cap\Pi_{\varepsilon/2C}\neq\Sigma.

Now suppose that ε0>0\varepsilon_{0}>0. In this case we have that (cf. (6.15))

Reach||||l1(j)(Σj,2C,ε,ε0)=0tεε0Span¯||||l1(j)(Σj,t)εtε02C\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{j})}}\left(\Sigma_{j},2C,\varepsilon,\varepsilon_{0}\right)=\bigcup_{0\leq t\leq\frac{\varepsilon}{\varepsilon_{0}}}\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{j})}}\left(\Sigma_{j},t\right)_{\frac{\varepsilon-t\varepsilon_{0}}{2C}} (6.18)

which no longer contains the full linear subspace Span(Σj)\operatorname{Span}(\Sigma_{j}). However, by appealing to Lemma 6.1, we conclude that (cf. (6.5))

conv(Vj)εε02CReach||||l1(j)(Σj,2C,ε,ε0)forVj:={σj,,σ1,0,σ1,,σj}X{\mathrm{conv}}(V_{j})_{\frac{\varepsilon-\varepsilon_{0}}{2C}}\subset\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{j})}}\left(\Sigma_{j},2C,\varepsilon,\varepsilon_{0}\right)\quad\text{for}\quad V_{j}:=\left\{-\sigma_{j},\ldots,-\sigma_{1},0,\sigma_{1},\ldots,\sigma_{j}\right\}\subset X^{\ast} (6.19)

and we use the notation conv(Vj){\mathrm{conv}}(V_{j}) to denote the closed convex hull of VjV_{j} in XX^{\ast}. Consequently, if we define

Nconv:=max{d:There exists σ1,,σdΣ such that j{1,,d1} we haveσj+1conv(Vj)εε02C for Vj:={σj,,σ1,0,σ1,,σj}X},N_{{\mathrm{conv}}}:=\max\left\{d\in{\mathbb{Z}}~{}:~{}\begin{array}[]{c}\text{There exists }\sigma_{1},\ldots,\sigma_{d}\in\Sigma\text{ such that }\forall j\in\{1,\ldots,d-1\}\text{ we have}\\ \sigma_{j+1}\notin{\mathrm{conv}}(V_{j})_{\frac{\varepsilon-\varepsilon_{0}}{2C}}\text{ for }V_{j}:=\left\{-\sigma_{j},\ldots,-\sigma_{1},0,\sigma_{1},\ldots,\sigma_{j}\right\}\subset X^{\ast}\end{array}\right\}, (6.20)

then the integer NN defined in (6.16) is bounded above by NconvN_{{\mathrm{conv}}}, i.e. NNconvN\leq N_{{\mathrm{conv}}}.

We can control NconvN_{{\mathrm{conv}}} defined in (6.20) by particular packing and covering numbers of Σ\Sigma. To elaborate, if ZZ is a Banach space and 𝒰Z{\cal U}\subset Z, then for any r>0r>0 the rr-packing number of 𝒰{\cal U}, denoted by Npack(𝒰,Z,r)N_{\operatorname{pack}}({\cal U},Z,r), and the rr-covering number of 𝒰{\cal U}, denoted by Ncov(𝒰,Z,r)N_{\operatorname{cov}}({\cal U},Z,r), are given by

Npack(𝒰,Z,r):=max{d:z1,,zd𝒰 such that zazbZ>r whenever ab}and\displaystyle N_{\operatorname{pack}}({\cal U},Z,r):=\max\left\{d\in{\mathbb{Z}}:\exists z_{1},\ldots,z_{d}\in{\cal U}\text{ such that }||z_{a}-z_{b}||_{Z}>r\text{ whenever }a\neq b\right\}\quad\text{and} (6.21)
Ncov(𝒰,Z,r):=min{d:z1,,zd𝒰 such that we have the inclusion 𝒰j=1d𝔹¯Z(zj,r)}\displaystyle N_{\operatorname{cov}}({\cal U},Z,r):=\min\left\{d\in{\mathbb{Z}}:\exists z_{1},\ldots,z_{d}\in{\cal U}\text{ such that we have the inclusion }{\cal U}\subset\bigcup_{j=1}^{d}{\overline{\mathbb{B}}}_{Z}(z_{j},r)\right\}

respectively. Our convention throughout this subsection is that balls denoted by 𝔹{\mathbb{B}} are taken to be open, whilst those denoted by 𝔹¯{\overline{\mathbb{B}}} are taken to be closed.

It is now an immediate consequence of (6.20) and (6.21) that NconvN_{{\mathrm{conv}}} is no greater than the (εε0)/2C(\varepsilon-\varepsilon_{0})/2C-packing number of Σ\Sigma, i.e. NconvNpack(Σ,X,(εε0)/2C)N_{{\mathrm{conv}}}\leq N_{\operatorname{pack}}\left(\Sigma,X^{\ast},(\varepsilon-\varepsilon_{0})/2C\right). Moreover, using the well-known fact that the quantities defined in (6.21) satisfy, for any r>0r>0, that Npack(𝒰,Z,2r)Ncov(𝒰,Z,r)Npack(𝒰,Z,r)N_{\operatorname{pack}}({\cal U},Z,2r)\leq N_{\operatorname{cov}}({\cal U},Z,r)\leq N_{\operatorname{pack}}({\cal U},Z,r), we may additionally conclude that NconvNpack(Σ,X,(εε0)/2C)Ncov(Σ,X,(εε0)/4C)N_{{\mathrm{conv}}}\leq N_{\operatorname{pack}}\left(\Sigma,X^{\ast},(\varepsilon-\varepsilon_{0})/2C\right)\leq N_{\operatorname{cov}}\left(\Sigma,X^{\ast},(\varepsilon-\varepsilon_{0})/4C\right). Consequently, both the (εε0)/2C(\varepsilon-\varepsilon_{0})/2C-packing number and the (εε0)/4C(\varepsilon-\varepsilon_{0})/4C-covering number of Σ\Sigma provide upper bounds on the number of steps that the Banach GRIM algorithm can complete before terminating.

In summary, we have that the maximum number of steps NN that the Banach GRIM algorithm can complete before terminating (cf. (6.16)) satisfies that

NNconvNpack(Σ,X,εε02C)Ncov(Σ,X,εε04C).N\leq N_{{\mathrm{conv}}}\leq N_{\operatorname{pack}}\left(\Sigma,X^{\ast},\frac{\varepsilon-\varepsilon_{0}}{2C}\right)\leq N_{\operatorname{cov}}\left(\Sigma,X^{\ast},\frac{\varepsilon-\varepsilon_{0}}{4C}\right). (6.22)

Thus the geometric quantities related to Σ\Sigma appearing in (6.22) provide upper bounds on the number of elements from {\cal F} appearing in the approximation found by the Banach GRIM algorithm. Explicit estimates for these geometric quantities allow us to deduce worst-case bounds for the performance of the Banach GRIM algorithm; we will momentarily establish such bounds for the task of kernel quadrature via estimates for the covering number of the unit ball of a Reproducing Kernel Hilbert Space (RKHS) covered in [JJWWY20].

Before considering the particular setting of kernel quadrature, we observe how the direction of the bounds in Theorem 6.2 can be, in some sense, reversed. To make this precise, observe that Theorem 6.2 fixes an ε>0\varepsilon>0 as the accuracy we desire for the approximation, and then provides an upper bound on the number of features from the collection {\cal F} that are used by the Banach GRIM algorithm to construct an approximation of φ\varphi that is within ε\varepsilon of φ\varphi throughout Σ\Sigma. However it would be useful to know, for a given fixed n0{2,,𝒩}n_{0}\in\{2,\ldots,{\cal N}\}, how well the Banach GRIM algorithm can approximate φ\varphi throughout Σ\Sigma using no greater than n0n_{0} of the features from {\cal F}. We now illustrate how Theorem 6.2 provides such information.

Consider a fixed n0{2,,𝒩}n_{0}\in\{2,\ldots,{\cal N}\} and let β0=β0(n0,C,Σ,ε0)>0\beta_{0}=\beta_{0}(n_{0},C,\Sigma,\varepsilon_{0})>0 be defined by

β0:=min{λ>ε0:Nλn01}\beta_{0}:=\min\left\{\lambda>\varepsilon_{0}~{}:~{}N_{\lambda}\leq n_{0}-1\right\} (6.23)

where NλN_{\lambda} denotes the integer defined in (6.16) for the constant ε>ε0\varepsilon>\varepsilon_{0} there as λ\lambda here. Then, by applying the Banach GRIM algorithm under the same assumptions as in Theorem 6.2 and the additional choice that ε:=β0\varepsilon:=\beta_{0}, we may conclude from Theorem 6.2 that the Banach GRIM algorithm terminates after no more than n01n_{0}-1 steps. Consequently, the algorithm returns an approximation uu of φ\varphi that is a linear combination of at most n0n_{0} of the features in {\cal F}, and that is within β0\beta_{0} of φ\varphi on Σ\Sigma in the sense that for every σΣ\sigma\in\Sigma we have |σ(φu)|β0|\sigma(\varphi-u)|\leq\beta_{0}. Hence the relation given in (6.23) provides a guaranteed accuracy β0=β0(n0,C,Σ,ε0)>0\beta_{0}=\beta_{0}(n_{0},C,\Sigma,\varepsilon_{0})>0 for how well the Banach GRIM algorithm can approximate φ\varphi with the additional constraint that the approximation is a linear combination of no greater than n0n_{0} of the features in {\cal F}. This guarantee ensures both that there is a linear combination of at most n0n_{0} of the features in {\cal F} that is within β0\beta_{0} of φ\varphi throughout Σ\Sigma and that the Banach GRIM algorithm will find such a linear combination.

For the remainder of this subsection we turn our attention to the particular setting of kernel quadrature. assume that 𝒳={x1,,x𝒩}{\cal X}=\{x_{1},\ldots,x_{{\cal N}}\} is a finite set of some finite dimensional Euclidean space d{\mathbb{R}}^{d}, and that k:𝒳×𝒳k:{\cal X}\times{\cal X}\to{\mathbb{R}} is a continuous symmetric positive semi-definite kernel that is bounded above by 11. For each i{1,,𝒩}i\in\{1,\ldots,{\cal N}\} define a continuous function kxi:𝒳k_{x_{i}}:{\cal X}\to{\mathbb{R}} by setting, for z𝒳z\in{\cal X}, kxi(z):=k(z,xi)k_{x_{i}}(z):=k(z,x_{i}). Define K:={kxi:i{1,,𝒩}}C0(𝒳)K:=\left\{k_{x_{i}}:i\in\{1,\ldots,{\cal N}\}\right\}\subset C^{0}({\cal X}) and let k{\cal H}_{k} denote the RKHS associated to kk. In this case it is known that k=Span(K){\cal H}_{k}=\operatorname{Span}(K) and hence K𝔹¯k(0,1)K\subset{\overline{\mathbb{B}}}_{{\cal H}_{k}}(0,1).

Suppose a1,,a𝒩>0a_{1},\ldots,a_{{\cal N}}>0 so that φ:=i=1𝒩aiδxi[𝒳]\varphi:=\sum_{i=1}^{{\cal N}}a_{i}\delta_{x_{i}}\in{\mathbb{P}}[{\cal X}]. Under the choice that X:=[𝒳]X:={\cal M}[{\cal X}], we observe that the constant CC corresponding to the definition in (6.8) satisfies that C=1C=1. Recall that C0(𝒳)[𝒳]C^{0}({\cal X})\subset{\cal M}[{\cal X}]^{\ast} via the identification of an element ψC0(𝒳)\psi\in C^{0}({\cal X}) with the linear functional [𝒳]{\cal M}[{\cal X}]\to{\mathbb{R}} given by νν[ψ]:=𝒳ψ(z)𝑑ν(z)\nu\mapsto\nu[\psi]:=\int_{{\cal X}}\psi(z)d\nu(z). We abuse notation slightly by referring to both the continuous function in C0(𝒳)C^{0}({\cal X}) and the associated linear functional [𝒳]{\cal M}[{\cal X}]\to{\mathbb{R}} as ψ\psi. By defining Σ:=K\Sigma:=K, the kernel quadrature problem in this setting is to find an empirical measure u[𝒳]u\in{\mathbb{P}}[{\cal X}] whose support is a strict subset of the support of φ\varphi and such that |f(φu)|ε\left|f(\varphi-u)\right|\leq\varepsilon for every fΣf\in\Sigma.

Recalling (6.22), the performance of the Banach GRIM algorithm for this task is controlled by the pointwise (εε0)/4(\varepsilon-\varepsilon_{0})/4-covering number of Σ\Sigma, i.e. by Ncov(Σ,C0(𝒳),(εε0)/4)N_{\operatorname{cov}}(\Sigma,C^{0}({\cal X}),(\varepsilon-\varepsilon_{0})/4). That is, there is an integer MNcov(Σ,C0(𝒳),(εε0)/4)M\leq N_{\operatorname{cov}}(\Sigma,C^{0}({\cal X}),(\varepsilon-\varepsilon_{0})/4) such that the Banach GRIM algorithm finds weights c1,,cM+10c_{1},\ldots,c_{M+1}\geq 0 and indices e(1),,e(M+1){1,,𝒩}e(1),\ldots,e(M+1)\in\{1,\ldots,{\cal N}\} such that u:=s=1M+1csδxe(s)u:=\sum_{s=1}^{M+1}c_{s}\delta_{x_{e(s)}} is a probability measure in [𝒳]{\mathbb{P}}[{\cal X}] satisfying, for every fΣf\in\Sigma, that |φ(f)u(f)|ε|\varphi(f)-u(f)|\leq\varepsilon. A consequence of Σ𝔹¯k(0,1)\Sigma\subset{\overline{\mathbb{B}}}_{{\cal H}_{k}}(0,1) is that the (εε0)/4(\varepsilon-\varepsilon_{0})/4-covering number of Σ\Sigma is itself controlled by the (εε0)/4(\varepsilon-\varepsilon_{0})/4-covering number of 𝔹¯k(0,1){\overline{\mathbb{B}}}_{{\cal H}_{k}}(0,1), denoted by Ncov(𝔹¯k(0,1),C0(𝒳),(εε0)/4)N_{\operatorname{cov}}\left({\overline{\mathbb{B}}}_{{\cal H}_{k}}(0,1),C^{0}({\cal X}),(\varepsilon-\varepsilon_{0})/4\right).

Many authors have considered estimating the covering number of the unit ball of a RKHS, see the works [Zho02, CSW11, Kuh11, SS13, HLLL18, Suz18, JJWWY20, FS21] for example. In our particular setting, the covering number estimates in [JJWWY20] yield explicit performance guarantees for the Banach GRIM as we now illustrate.

In our setting the kernel has a pointwise convergent Mercer decomposition k(x,y)=m=1λmem(x)em(y)k(x,y)=\sum_{m=1}^{\infty}\lambda_{m}e_{m}(x)e_{m}(y) with λ1λ20\lambda_{1}\geq\lambda_{2}\geq\ldots\geq 0 and {em}m=1L2(𝒳)\{e_{m}\}_{m=1}^{\infty}\subset L^{2}({\cal X}) being orthonormal [CS08, SS12]. The pairs {(λm,em)}m=1\left\{(\lambda_{m},e_{m})\right\}_{m=1}^{\infty} are the eigenpairs for the operator Tk:L2(𝒳)L2(𝒳)T_{k}:L^{2}({\cal X})\to L^{2}({\cal X}) defined for fL2(𝒳)f\in L^{2}({\cal X}) by Tk[f]:=𝒳k(,y)f(y)𝑑yT_{k}[f]:=\int_{{\cal X}}k(\cdot,y)f(y)dy. We assume that the eigenfunctions {em}m=1\{e_{m}\}_{m=1}^{\infty} are uniformly bounded in the sense that for every m1m\in{\mathbb{Z}}_{\geq 1} and every x𝒳x\in{\cal X} we have |em(x)|C0|e_{m}(x)|\leq C_{0} for some constant C0>0C_{0}>0. Finally, we assume that the eigenvalues {λm}m=1\{\lambda_{m}\}_{m=1}^{\infty} decay exponentially as mm increases in the sense that for every m1m\in{\mathbb{Z}}_{\geq 1} we have λmC1eC2j\lambda_{m}\leq C_{1}e^{-C_{2}j} for constants C1,C2>0C_{1},C_{2}>0. These assumptions are satisfied, for example, by the squared exponential (Radial Basis Function) kernel k(s,t):=e(st)2k(s,t):=e^{-(s-t)^{2}} [JJWWY20]; more explicit estimates for this particular choice of kernel may be found in [FS21].

Given any r(0,1)r\in(0,1), it is established in [JJWWY20] (cf. Lemma D.2 of [JJWWY20]) that under these assumptions we have

logNcov(𝔹¯k(0,1),C0(𝒳),r)C3(log(1r)+C4)2\log N_{\operatorname{cov}}({\overline{\mathbb{B}}}_{{\cal H}_{k}}(0,1),C^{0}({\cal X}),r)\leq C_{3}\left(\log\left(\frac{1}{r}\right)+C_{4}\right)^{2} (6.24)

for constants C3=C3(C0,C1,C2)>0C_{3}=C_{3}(C_{0},C_{1},C_{2})>0 and C4=C4(C0,C1,C2)>0C_{4}=C_{4}(C_{0},C_{1},C_{2})>0. Assuming that ε<4\varepsilon<4, by appealing to (6.24) for the choice r:=(εε0)/4r:=(\varepsilon-\varepsilon_{0})/4 we may conclude that if C3(log(4εε0)+C4)2<log(𝒩1)C_{3}\left(\log\left(\frac{4}{\varepsilon-\varepsilon_{0}}\right)+C_{4}\right)^{2}<\log({\cal N}-1) then the Banach GRIM algorithm will return a probability measure u[𝒳]u\in{\mathbb{P}}[{\cal X}] given by a linear combination of fewer than 𝒩{\cal N} of the point masses δx1,,δx𝒩\delta_{x_{1}},\ldots,\delta_{x_{{\cal N}}} satisfying, for every fΣf\in\Sigma, that |φ(f)u(f)|ε|\varphi(f)-u(f)|\leq\varepsilon.

Alternatively, given n0{2,,𝒩}n_{0}\in\{2,\ldots,{\cal N}\} define β0=β0(C0,C1,C2,n0,ε0)>0\beta_{0}=\beta_{0}(C_{0},C_{1},C_{2},n_{0},\varepsilon_{0})>0 by

β0:=4eC4e(log(n01)C3)12+ε0=4eC4(n01)(C3log(n01))12+ε0>ε0.\beta_{0}:=4e^{C_{4}}e^{-\left(\frac{\log(n_{0}-1)}{C_{3}}\right)^{\frac{1}{2}}}+\varepsilon_{0}=4e^{C_{4}}\left(n_{0}-1\right)^{-\left(C_{3}\log(n_{0}-1)\right)^{-\frac{1}{2}}}+\varepsilon_{0}>\varepsilon_{0}. (6.25)

Provided n0>1+eC3C42n_{0}>1+e^{C_{3}C_{4}^{2}} we have that β0ε04(0,1)\frac{\beta_{0}-\varepsilon_{0}}{4}\in(0,1). In this case we observe that (6.24) and (6.25) yield that

logNcov(𝔹¯k(0,1),C0(𝒳),β0ε04)log(n01).\log N_{\operatorname{cov}}\left({\overline{\mathbb{B}}}_{{\cal H}_{k}}(0,1),C^{0}({\cal X}),\frac{\beta_{0}-\varepsilon_{0}}{4}\right)\leq\log(n_{0}-1). (6.26)

We deduce from Theorem 6.2, with ε:=β0\varepsilon:=\beta_{0}, that the algorithm finds a probability measure u[𝒳]u\in{\mathbb{P}}[{\cal X}] given by a linear combination of no more than n0n_{0} of the point masses δx1,,δx𝒩\delta_{x_{1}},\ldots,\delta_{x_{{\cal N}}} satisfying, for every fΣf\in\Sigma, that |φ(f)u(f)|β0|\varphi(f)-u(f)|\leq\beta_{0}.

As n0n_{0} increases, β0\beta_{0} defined in (6.25) eventually decays slower than n0an_{0}^{-a} for any a>0a>0. This poor asymptotic behaviour is not unexpected for an estimate that is itself a combination of worst-case scenario estimates. However, we may still observe that for any integer A1A\in{\mathbb{Z}}_{\geq 1} large enough to ensure that 1+eC3C42<n01+eA2/C31+e^{C_{3}C_{4}^{2}}<n_{0}\leq 1+e^{A^{2}/C_{3}} (which in particular requires A>C3C4A>C_{3}C_{4}), that β04eC4(n01)1/A\beta_{0}\leq 4e^{C_{4}}(n_{0}-1)^{-1/A}.

.

6.3 Supplementary Lemmata

In this subsection we record several lemmata that will be used during our proof of the Banach GRIM Convergence Theorem 6.2 in Subsection 6.4. The following result records the consequences arising from knowing that an approximation uSpan()u\in\operatorname{Span}({\cal F}) is close to φ\varphi at a finite set of linear functionals in XX^{\ast}.

Lemma 6.4 (Finite Influenced Set).

Assume that XX is a Banach space with dual-space XX^{\ast}. Let θ>θ00\theta>\theta_{0}\geq 0, 𝒩1{\cal N}\in{\mathbb{Z}}_{\geq 1}, and ={f1,,f𝒩}X{0}{\cal F}=\{f_{1},\ldots,f_{\cal N}\}\subset X\setminus\{0\}. Let a1,,a𝒩{0}a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}\setminus\{0\} and define φSpan()\varphi\in\operatorname{Span}({\cal F}) and C>0C>0 by

(𝐈)φ:=i=1𝒩aifiand(𝐈𝐈)C:=i=1𝒩|ai|fiX>0.({\bf I})\quad\varphi:=\sum_{i=1}^{{\cal N}}a_{i}f_{i}\qquad\text{and}\qquad({\bf II})\quad C:=\sum_{i=1}^{{\cal N}}|a_{i}|||f_{i}||_{X}>0. (6.27)

Suppose that d1d\in{\mathbb{Z}}_{\geq 1} with L={σ1,,σd}XL=\left\{\sigma_{1},\ldots,\sigma_{d}\right\}\subset X^{\ast}, and that uSpan()u\in\operatorname{Span}({\cal F}) satisfies both that uXC||u||_{X}\leq C and, for every σL\sigma\in L, that |σ(φu)|θ0|\sigma(\varphi-u)|\leq\theta_{0}. Then, using our notation conventions from Section 6.1 (see also Remark 6.3), for every σReach||||l1(d)(L,2C,θ,θ0)\sigma\in\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{d})}}\left(L,2C,\theta,\theta_{0}\right) we have that

|σ(φu)|θ.|\sigma(\varphi-u)|\leq\theta. (6.28)
Proof of Lemma 6.4.

Assume that XX is a Banach space with dual-space XX^{\ast}. Let θ>θ00\theta>\theta_{0}\geq 0, 𝒩1{\cal N}\in{\mathbb{Z}}_{\geq 1}, and ={f1,,f𝒩}X{0}{\cal F}=\{f_{1},\ldots,f_{\cal N}\}\subset X\setminus\{0\}. Let a1,,a𝒩{0}a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}\setminus\{0\} and define φSpan()\varphi\in\operatorname{Span}({\cal F}) and C>0C>0 by (I) and (II) in (6.27) respectively. Suppose that d1d\in{\mathbb{Z}}_{\geq 1} and that L={σ1,,σd}XL=\left\{\sigma_{1},\ldots,\sigma_{d}\right\}\subset X^{\ast}. Let uSpan()u\in\operatorname{Span}({\cal F}) and assume both that uXC||u||_{X}\leq C and that, for every σL\sigma\in L, we have |σ(φu)|θ0|\sigma(\varphi-u)|\leq\theta_{0}. It follows from (I) and (II) of (6.27) that φXC||\varphi||_{X}\leq C. Hence φuX2C||\varphi-u||_{X}\leq 2C.

We deal with the cases θ0=0\theta_{0}=0 and θ0(0,θ)\theta_{0}\in(0,\theta) separately.

First suppose that θ0=0\theta_{0}=0. In this case we have that σ1(φu)==σd(φu)=0\sigma_{1}(\varphi-u)=\ldots=\sigma_{d}(\varphi-u)=0, and that

Reach||||l1(d)(L,2C,θ,0)=def0tSpan¯||||l1(d)(L,t)θ2C=Span(L)θ2C.\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{d})}}\left(L,2C,\theta,0\right)\stackrel{{\scriptstyle\text{def}}}{{=}}\bigcup_{0\leq t\leq\infty}\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{d})}}\left(L,t\right)_{\frac{\theta}{2C}}=\operatorname{Span}(L)_{\frac{\theta}{2C}}. (6.29)

As a consequence of (6.29), we need only establish the estimate in (6.28) for σSpan(L)θ/2C\sigma\in\operatorname{Span}(L)_{\theta/2C}. Assuming σSpan(L)θ/2C\sigma\in\operatorname{Span}(L)_{\theta/2C}, then there is some ρSpan(L)\rho\in\operatorname{Span}(L) with σρXθ/2C||\sigma-\rho||_{X^{\ast}}\leq\theta/2C. Since ρSpan(L)\rho\in\operatorname{Span}(L) there are coefficients c1,,cdc_{1},\ldots,c_{d}\in{\mathbb{R}} for which ρ=j=1dcjσj\rho=\sum_{j=1}^{d}c_{j}\sigma_{j}. Evidently we have that ρ(φu)=j=1dcjσj(φu)=0\rho(\varphi-u)=\sum_{j=1}^{d}c_{j}\sigma_{j}(\varphi-u)=0. Consequently we may estimate that

|σ(φu)||(σρ)(φu)|+|ρ(φu)|σρXφuX2C(θ2C)=θ.|\sigma(\varphi-u)|\leq|(\sigma-\rho)(\varphi-u)|+|\rho(\varphi-u)|\leq||\sigma-\rho||_{X^{\ast}}||\varphi-u||_{X}\leq 2C\left(\frac{\theta}{2C}\right)=\theta. (6.30)

The arbitrariness of σ\sigma means we may conclude from (6.30) that (6.28) is valid as claimed when θ0=0\theta_{0}=0.

Now we suppose that θ0(0,θ)\theta_{0}\in(0,\theta).

Since Span¯||||l1(d)(L,0)={0}X\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{d})}}(L,0)=\{0\}\subset X^{\ast}, if σSpan¯||||l1(d)(L,0)θ/2C\sigma\in\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{d})}}(L,0)_{\theta/2C} then σXθ/2C||\sigma||_{X^{\ast}}\leq\theta/2C. Consequently |σ(φu)|σXφuXθ|\sigma(\varphi-u)|\leq||\sigma||_{X^{\ast}}||\varphi-u||_{X}\leq\theta as claimed in (6.28).

Now consider t(0,θ/θ0]t\in(0,\theta/\theta_{0}] and let σSpan¯||||l1(d)(L,t)θtθ02C\sigma\in\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{d})}}\left(L,t\right)_{\frac{\theta-t\theta_{0}}{2C}}. Then there is some ρSpan¯||||l1(d)(L,t)\rho\in\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{d})}}\left(L,t\right) with σρXθtθ02C||\sigma-\rho||_{X^{\ast}}\leq\frac{\theta-t\theta_{0}}{2C}. Further, there are coefficients c1,,cdc_{1},\ldots,c_{d}\in{\mathbb{R}} with |c1|++|cd|t|c_{1}|+\ldots+|c_{d}|\leq t and such that ρ=j=1dcjσj\rho=\sum_{j=1}^{d}c_{j}\sigma_{j}. Consequently, we may estimate that

|σ(φu)||(σρ)(φu)|+|ρ(φu)|σρXφuX+j=1d|cj||σj(φu)|2C(θtθ02C)+θ0j=1d|cj|θtθ0+tθ0=θ.|\sigma(\varphi-u)|\leq|(\sigma-\rho)(\varphi-u)|+|\rho(\varphi-u)|\leq||\sigma-\rho||_{X^{\ast}}||\varphi-u||_{X}+\sum_{j=1}^{d}|c_{j}||\sigma_{j}(\varphi-u)|\\ \leq 2C\left(\frac{\theta-t\theta_{0}}{2C}\right)+\theta_{0}\sum_{j=1}^{d}|c_{j}|\leq\theta-t\theta_{0}+t\theta_{0}=\theta. (6.31)

Since t(0,θ/θ0]t\in(0,\theta/\theta_{0}] and σSpan¯||||l1(d)(L,t)θtθ02C\sigma\in\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{d})}}\left(L,t\right)_{\frac{\theta-t\theta_{0}}{2C}} were both arbitrary, we may conclude that (6.31) is valid whenever σ0<tθ/θ0Span¯||||l1(d)(L,t)θtθ02C\sigma\in\bigcup_{0<t\leq\theta/\theta_{0}}\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{d})}}\left(L,t\right)_{\frac{\theta-t\theta_{0}}{2C}}.

The combination of the proceeding two paragraphs allows us to conclude, for θ0(0,θ)\theta_{0}\in(0,\theta), that whenever σ0tθ/θ0Span¯||||l1(d)(L,t)θtθ02C=Reach||||l1(d)(L,2C,θ,θ0)\sigma\in\bigcup_{0\leq t\leq\theta/\theta_{0}}\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{d})}}\left(L,t\right)_{\frac{\theta-t\theta_{0}}{2C}}=\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{d})}}\left(L,2C,\theta,\theta_{0}\right) we have |σ(φu)|θ|\sigma(\varphi-u)|\leq\theta. Consequently, (6.28) is valid for the case that θ0(0,θ)\theta_{0}\in(0,\theta). And having earlier established the validity of (6.28) when θ0=0\theta_{0}=0, this completes the proof of Lemma 6.4. ∎

We can use the result of Lemma 6.4 to prove that the linear functionals selected during the Banach GRIM algorithm, under the choice M:=min{𝒩1,Λ}M:=\min\left\{{\cal N}-1,\Lambda\right\} and that for every t{1,,M}t\in\{1,\ldots,M\} we have kt:=1k_{t}:=1 and st:=1s_{t}:=1, are separated by a definite XX^{\ast}-distance. The precise result is the following lemma.

Lemma 6.5 (Banach GRIM Functional Separation).

Assume XX is a Banach space with dual-space XX^{\ast}. Let θ>θ00\theta>\theta_{0}\geq 0, and 𝒩,Λ1{\cal N},\Lambda\in{\mathbb{Z}}_{\geq 1} such that M:=min{𝒩1,Λ}2M:=\min\left\{{\cal N}-1,\Lambda\right\}\geq 2. Suppose that :={f1,,f𝒩}X{0}{\cal F}:=\{f_{1},\ldots,f_{{\cal N}}\}\subset X\setminus\{0\}, and that ΓX\Gamma\subset X^{\ast} is finite with cardinality Λ\Lambda. Let a1,,a𝒩{0}a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}\setminus\{0\} and define φSpan()\varphi\in\operatorname{Span}({\cal F}) and C>0C>0 by

(𝐈)φ:=i=1𝒩aifiand(𝐈𝐈)C:=i=1𝒩|ai|fiX>0.({\bf I})\quad\varphi:=\sum_{i=1}^{{\cal N}}a_{i}f_{i}\qquad\text{and}\qquad({\bf II})\quad C:=\sum_{i=1}^{{\cal N}}|a_{i}|||f_{i}||_{X}>0. (6.32)

Consider applying the Banach GRIM algorithm to approximate φ\varphi on Γ\Gamma with θ\theta as the accuracy threshold, θ0\theta_{0} as the acceptable recombination error bound, MM as the maximum number of steps, s1==sM=1s_{1}=\ldots=s_{M}=1 as the shuffle numbers, and with the integers k1,,kMk_{1},\ldots,k_{M} in Step (A) of the Banach GRIM algorithm all being chosen equal to 11 (cf. Banach GRIM (A)). Suppose that m2m\in{\mathbb{Z}}_{\geq 2} and the algorithm reaches and carries out the mthm^{th} step without terminating. For each l{1,,m}l\in\{1,\ldots,m\} let σlΓ\sigma_{l}\in\Gamma be the linear functional selected at the lthl^{\text{th}} step, and let ulSpan()u_{l}\in\operatorname{Span}({\cal F}) be the element found via recombination (cf. the recombination thinning Lemma 3.1) such that, for every s{1,,l}s\in\{1,\ldots,l\}, we have |σs(φul)|θ0|\sigma_{s}(\varphi-u_{l})|\leq\theta_{0} (cf. Banach GRIM (C) and (D)). Further, for each l{1,,m}l\in\{1,\ldots,m\}, let Γl:={σ1,,σl}Γ\Gamma_{l}:=\left\{\sigma_{1},\ldots,\sigma_{l}\right\}\subset\Gamma and Ωl:=Reach||||l1(l)(Γl,2C,θ,θ0)\Omega_{l}:=\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{l})}}\left(\Gamma_{l},2C,\theta,\theta_{0}\right) where we use our notation conventions from Section 6.1 (see also Remark 6.3). Then for any l{1,,m}l\in\{1,\ldots,m\} we have, for every σX\sigma\in X^{\ast}, we have that

|σ(φul)|min{2CσX,2CdistX(σ,Ωl)+θ}.|\sigma(\varphi-u_{l})|\leq\min\left\{2C||\sigma||_{X^{\ast}}~{},~{}2C\operatorname{dist}_{X^{\ast}}(\sigma,\Omega_{l})+\theta\right\}. (6.33)

In particular, for every σΩl\sigma\in\Omega_{l} we have

|σ(φul)|θ.\left|\sigma(\varphi-u_{l})\right|\leq\theta. (6.34)

Finally, as a consequence we have, for every l{2,,m}l\in\{2,\ldots,m\}, that

σlΩl1:=Reach||||l(l1)(Γl1,2C,θ,θ0).\sigma_{l}\notin\Omega_{l-1}:=\operatorname{Reach}_{||\cdot||_{l^{(}{\mathbb{R}}^{l-1})}}\left(\Gamma_{l-1},2C,\theta,\theta_{0}\right). (6.35)
Remark 6.6.

Using the same notation as in Lemma 6.5, we claim that (6.35) ensures, for every j{1,,l1}j\in\{1,\ldots,l-1\}, that

σlσjXθθ02C.||\sigma_{l}-\sigma_{j}||_{X^{\ast}}\geq\frac{\theta-\theta_{0}}{2C}. (6.36)

To see this, recall that (cf. Subsection 6.1 or Remark 6.3)

Reach||||l(l1)(Γl1,2C,θ,θ0)=def0τθθ0Span¯||||l(l1)(Γl1,τ)θτθ02C.\operatorname{Reach}_{||\cdot||_{l^{(}{\mathbb{R}}^{l-1})}}\left(\Gamma_{l-1},2C,\theta,\theta_{0}\right)\stackrel{{\scriptstyle\text{def}}}{{=}}\bigcup_{0\leq\tau\leq\frac{\theta}{\theta_{0}}}\operatorname{\overline{Span}}_{||\cdot||_{l^{(}{\mathbb{R}}^{l-1})}}\left(\Gamma_{l-1},\tau\right)_{\frac{\theta-\tau\theta_{0}}{2C}}. (6.37)

Since θ>θ0\theta>\theta_{0} we may take τ:=1\tau:=1 in (6.37) to conclude via (6.35) that

σlSpan¯||||l(l1)(Γl1,1)θθ02C.\sigma_{l}\notin\operatorname{\overline{Span}}_{||\cdot||_{l^{(}{\mathbb{R}}^{l-1})}}\left(\Gamma_{l-1},1\right)_{\frac{\theta-\theta_{0}}{2C}}. (6.38)

Recall that (cf. Subsection 6.1 or Remark 6.3)

Span¯||||l(l1)(Γl1,1)=def{s=1l1csσs:c=(c1,,cl1)l1 with cl1(l1)1}.\operatorname{\overline{Span}}_{||\cdot||_{l^{(}{\mathbb{R}}^{l-1})}}\left(\Gamma_{l-1},1\right)\stackrel{{\scriptstyle\text{def}}}{{=}}\left\{\sum_{s=1}^{l-1}c_{s}\sigma_{s}~{}:~{}c=(c_{1},\ldots,c_{l-1})\in{\mathbb{R}}^{l-1}\text{ with }||c||_{l^{1}({\mathbb{R}}^{l-1})}\leq 1\right\}. (6.39)

A consequence of (6.39) is that for every j{1,,l1}j\in\{1,\ldots,l-1\} we have σjSpan¯||||l(l1)(Γl1,1)\sigma_{j}\in\operatorname{\overline{Span}}_{||\cdot||_{l^{(}{\mathbb{R}}^{l-1})}}\left(\Gamma_{l-1},1\right), hence (6.38) means σlσjX>θθ02C||\sigma_{l}-\sigma_{j}||_{X^{\ast}}>\frac{\theta-\theta_{0}}{2C} as claimed in (6.36).

Proof of Lemma 6.5.

Assume XX is a Banach space with dual-space XX^{\ast}. Let θ>θ00\theta>\theta_{0}\geq 0, and 𝒩,Λ1{\cal N},\Lambda\in{\mathbb{Z}}_{\geq 1} such that M:=min{𝒩1,Λ}2M:=\min\left\{{\cal N}-1,\Lambda\right\}\geq 2. Suppose that :={f1,,f𝒩}X{0}{\cal F}:=\{f_{1},\ldots,f_{{\cal N}}\}\subset X\setminus\{0\}, and that ΓX\Gamma\subset X^{\ast} is finite with cardinality Λ\Lambda. Let a1,,a𝒩{0}a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}\setminus\{0\} and define φSpan()X\varphi\in\operatorname{Span}({\cal F})\subset X and C>0C>0 by (cf. (I) and (II) of (6.32) respectively)

(𝐈)φ:=i=1𝒩aifiand(𝐈𝐈)C:=i=1𝒩|ai|fiX>0.({\bf I})\quad\varphi:=\sum_{i=1}^{{\cal N}}a_{i}f_{i}\qquad\text{and}\qquad({\bf II})\quad C:=\sum_{i=1}^{{\cal N}}|a_{i}|||f_{i}||_{X}>0. (6.40)

Consider applying the Banach GRIM algorithm to approximate φ\varphi on Γ\Gamma with θ\theta as the accuracy threshold, θ0\theta_{0} as the acceptable recombination error bound, MM as the maximum number of steps, s1==SM=1s_{1}=\ldots=S_{M}=1 as the shuffle numbers, and with the integers k1,,kMk_{1},\ldots,k_{M} in Step (A) of the Banach GRIM algorithm all being chosen equal to 11 (cf. Banach GRIM (A)). We now follow the second step of the Banach GRIM algorithm, i.e. Banach GRIM (B). For each i{1,,𝒩}i\in\{1,\ldots,{\cal N}\} let a~i:=|ai|\tilde{a}_{i}:=|a_{i}| and f~i\tilde{f}_{i} be given by fif_{i} if ai>0a_{i}>0 and fi-f_{i} if ai<0a_{i}<0. Evidently, for every i{1,,𝒩}i\in\{1,\ldots,{\cal N}\} we have f~iX=fiX\left|\left|\tilde{f}_{i}\right|\right|_{X}=||f_{i}||_{X}. Moreover, we also have that a~1,,a~𝒩>0\tilde{a}_{1},\ldots,\tilde{a}_{{\cal N}}>0 and φ=i=1𝒩a~if~i\varphi=\sum_{i=1}^{{\cal N}}\tilde{a}_{i}\tilde{f}_{i}. We additionally rescale f~i\tilde{f}_{i} for each i{1,,𝒩}i\in\{1,\ldots,{\cal N}\} to have unit XX norm. That is (cf. Banach GRIM (B)), for each i{1,,𝒩}i\in\{1,\ldots,{\cal N}\} set hi:=f~ifiXh_{i}:=\frac{\tilde{f}_{i}}{||f_{i}||_{X}} and αi:=a~ifiX\alpha_{i}:=\tilde{a}_{i}||f_{i}||_{X}. Then observe both that CC satisfies

C:=i=1𝒩|ai|fiX=i=1𝒩a~ifiX=i=1𝒩αi,C:=\sum_{i=1}^{{\cal N}}|a_{i}|||f_{i}||_{X}=\sum_{i=1}^{{\cal N}}\tilde{a}_{i}||f_{i}||_{X}=\sum_{i=1}^{{\cal N}}\alpha_{i}, (6.41)

and, for every i{1,,𝒩}i\in\{1,\ldots,{\cal N}\}, that αihi=a~if~i=aifi\alpha_{i}h_{i}=\tilde{a}_{i}\tilde{f}_{i}=a_{i}f_{i}. Therefore the expansion for φ\varphi in (I) of (6.40) is equivalent to

φ=i=1𝒩αihi, and hence φXi=1𝒩αihiX=i=1𝒩αi=(6.41)C.\varphi=\sum_{i=1}^{{\cal N}}\alpha_{i}h_{i},\qquad\text{ and hence }\qquad||\varphi||_{X}\leq\sum_{i=1}^{{\cal N}}\alpha_{i}||h_{i}||_{X}=\sum_{i=1}^{{\cal N}}\alpha_{i}\stackrel{{\scriptstyle{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}({\ref{banach_separation_varphi_const}})}}}{{=}}C. (6.42)

Turning our attention to steps Banach GRIM (C) and (D) of the Banach GRIM algorithm, suppose that m2m\in{\mathbb{Z}}_{\geq 2} and that the mthm^{\text{th}} step of the Banach GRIM algorithm is completed without triggering the early termination criterion. For each l{1,,m}l\in\{1,\ldots,m\} let σlΓ\sigma_{l}\in\Gamma be the linear functional selected at the lthl^{\text{th}} step, and let ulSpan()u_{l}\in\operatorname{Span}({\cal F}) be the element found via recombination (cf. the recombination thinning Lemma 3.1) such that, for every s{1,,l}s\in\{1,\ldots,l\}, we have |σs(φul)|θ0|\sigma_{s}(\varphi-u_{l})|\leq\theta_{0} (cf. Banach GRIM (C) and (D)). Define Γl:={σ1,,σl}Γ\Gamma_{l}:=\left\{\sigma_{1},\ldots,\sigma_{l}\right\}\subset\Gamma.

For each l{1,,m}l\in\{1,\ldots,m\} observe that min{𝒩,l+1}=l+1\min\left\{{\cal N},l+1\right\}=l+1. Hence the recombination thinning Lemma 3.1 additionally tells us that there are non-negative coefficients bl,1,,bl,l+10b_{l,1},\ldots,b_{l,l+1}\geq 0 and indices el(1),,el(l+1){1,,𝒩}e_{l}(1),\ldots,e_{l}(l+1)\in\{1,\ldots,{\cal N}\} for which

ul=s=1l+1bl,shel(s)ands=1l+1bl,s=i=1𝒩αi.u_{l}=\sum_{s=1}^{l+1}b_{l,s}h_{e_{l}(s)}\qquad\text{and}\qquad\sum_{s=1}^{l+1}b_{l,s}=\sum_{i=1}^{{\cal N}}\alpha_{i}. (6.43)

A consequence of (6.43) is that

ulXs=1l+1bl,shel(s)X=s=1l+1bl,s=(6.43)i=1𝒩αi=(6.41)C.||u_{l}||_{X}\leq\sum_{s=1}^{l+1}b_{l,s}\left|\left|h_{e_{l}(s)}\right|\right|_{X}=\sum_{s=1}^{l+1}b_{l,s}\stackrel{{\scriptstyle{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}({\ref{banach_sep_lemma_u_n_coeff_sum}})}}}{{=}}\sum_{i=1}^{{\cal N}}\alpha_{i}\stackrel{{\scriptstyle{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}({\ref{banach_separation_varphi_const}})}}}{{=}}C. (6.44)

Further, for each l{1,,m}l\in\{1,\ldots,m\}, let

Ωl:=Reach||||l(l)(Γl,2C,θ,θ0)=def0tθθ0Span¯||||l(l)(Γl,t)θtθ02C\Omega_{l}:=\operatorname{Reach}_{||\cdot||_{l^{(}{\mathbb{R}}^{l})}}\left(\Gamma_{l},2C,\theta,\theta_{0}\right)\stackrel{{\scriptstyle\text{def}}}{{=}}\bigcup_{0\leq t\leq\frac{\theta}{\theta_{0}}}\operatorname{\overline{Span}}_{||\cdot||_{l^{(}{\mathbb{R}}^{l})}}\left(\Gamma_{l},t\right)_{\frac{\theta-t\theta_{0}}{2C}} (6.45)

where we use our notation conventions from Section 6.1. With our notation fixed, we turn our attention to verifying the claims made in (6.33), (6.34), and (6.35).

We begin by noting that (6.34) is an immediate consequence of appealing to Lemma 6.4 with ll and ΓlX\Gamma_{l}\subset X^{\ast} playing the roles of integer dd and finite subset LXL\subset X^{\ast} there respectively. Indeed by doing so we may conclude that (cf. (6.28))

for every σReach||||l1(l)(Γl,2C,θ,θ0)=:(6.45)Ωlwe have|σ(φu)|θ\text{for every }\sigma\in\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{l})}}\left(\Gamma_{l},2C,\theta,\theta_{0}\right)\stackrel{{\scriptstyle{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}({\ref{Omega_l_def_sep_lemma_pf}})}}}{{=:}}\Omega_{l}\quad\text{we have}\quad|\sigma(\varphi-u)|\leq\theta (6.46)

as claimed in (6.34).

To establish (6.35), fix l{2,,m}l\in\{2,\ldots,m\} and consider σlΓ\sigma_{l}\in\Gamma. A consequence of the Banach GRIM algorithm completing the lthl^{\text{th}} step without terminating is that (cf. Banach GRIM algorithm steps (C) and (D))

|σl(φul1)|>θ.|\sigma_{l}(\varphi-u_{l-1})|>\theta. (6.47)

However, since we have established that (6.34) is true, we know that if σΩl1\sigma\in\Omega_{l-1} then |σ(φul1)|θ|\sigma(\varphi-u_{l-1})|\leq\theta. Consequently, (6.47) tells us that σlΩl1\sigma_{l}\notin\Omega_{l-1} as claimed in (6.35).

To establish (6.33), fix l{1,,m}l\in\{1,\ldots,m\} and consider any σX\sigma\in X^{\ast}. Then we may estimate that

|σ(φul)|σXφulX(6.42)&(6.44)2CσX.\left|\sigma(\varphi-u_{l})\right|\leq||\sigma||_{X^{\ast}}||\varphi-u_{l}||_{X}\stackrel{{\scriptstyle{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}({\ref{banach_sep_lemma_vph_def&bd}})}~{}\&~{}{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}({\ref{banach_sep_lemma_ul_norm_bd}})}}}{{\leq}}2C||\sigma||_{X^{\ast}}. (6.48)

Alternatively, let ρΩl\rho\in\Omega_{l} and use that via (6.34) |ρ(φul)|θ|\rho(\varphi-u_{l})|\leq\theta to compute that

|σ(φul)|σρXφulX+θ(6.42)&(6.44)2CσρX+θ.|\sigma(\varphi-u_{l})|\leq||\sigma-\rho||_{X^{\ast}}||\varphi-u_{l}||_{X}+\theta\stackrel{{\scriptstyle{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}({\ref{banach_sep_lemma_vph_def&bd}})}~{}\&~{}{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}({\ref{banach_sep_lemma_ul_norm_bd}})}}}{{\leq}}2C||\sigma-\rho||_{X^{\ast}}+\theta. (6.49)

Taking the infimum over ρΩl\rho\in\Omega_{l} in (6.49) yields

|σ(φul)|2CdistX(σ,Ωl)+θ.|\sigma(\varphi-u_{l})|\leq 2C\operatorname{dist}_{X^{\ast}}(\sigma,\Omega_{l})+\theta. (6.50)

Together, (6.48) and (6.50) yield that for any σX\sigma\in X^{\ast} we have

|σ(φul)|min{2CσX,2CdistX(σ,Ωl)+θ}|\sigma(\varphi-u_{l})|\leq\min\left\{2C||\sigma||_{X^{\ast}}~{},~{}2C\operatorname{dist}_{X^{\ast}}(\sigma,\Omega_{l})+\theta\right\} (6.51)

as claimed in (6.33). This completes the proof of Lemma 6.5. ∎

We can use Lemma 6.5 to establish an upper bound on the maximum number of steps the Banach GRIM algorithm can run for before terminating. The precise statement is the following lemma.

Lemma 6.7 (Banach GRIM Number of Steps Bound).

Assume XX is a Banach space with dual-space XX^{\ast}. Let θ>θ00\theta>\theta_{0}\geq 0, and 𝒩,Λ1{\cal N},\Lambda\in{\mathbb{Z}}_{\geq 1} such that M:=min{𝒩1,Λ}1M:=\min\left\{{\cal N}-1,\Lambda\right\}\geq 1. Suppose that :={f1,,f𝒩}X{0}{\cal F}:=\{f_{1},\ldots,f_{{\cal N}}\}\subset X\setminus\{0\}, and that ΓX\Gamma\subset X^{\ast} is finite with cardinality Λ\Lambda. Let a1,,a𝒩{0}a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}\setminus\{0\} and define φSpan()\varphi\in\operatorname{Span}({\cal F}) and C>0C>0 by

(𝐈)φ:=i=1𝒩aifiand(𝐈𝐈)C:=i=1𝒩|ai|fiX>0.({\bf I})\quad\varphi:=\sum_{i=1}^{{\cal N}}a_{i}f_{i}\qquad\text{and}\qquad({\bf II})\quad C:=\sum_{i=1}^{{\cal N}}|a_{i}|||f_{i}||_{X}>0. (6.52)

Then there is a non-negative integer N=N(Γ,C,θ,θ0)0N=N(\Gamma,C,\theta,\theta_{0})\in{\mathbb{Z}}_{\geq 0}, given by

N:=max{d:There exists σ1,,σdΓ such that for every j{1,,d1}we have σj+1Reach||||l1(j)({σ1,,σj},2C,θ,θ0)},N:=\max\left\{d\in{\mathbb{Z}}~{}:~{}\begin{array}[]{c}\text{There exists }\sigma_{1},\ldots,\sigma_{d}\in\Gamma\text{ such that for every }j\in\{1,\ldots,d-1\}\\ \text{we have }\sigma_{j+1}\notin\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{j})}}\left(\{\sigma_{1},\ldots,\sigma_{j}\},2C,\theta,\theta_{0}\right)\end{array}\right\}, (6.53)

where we use our notation conventions from Section 6.1 (see also Remark 6.3), for which the following is true.

Suppose NM:=min{𝒩1,Λ}N\leq M:=\min\left\{{\cal N}-1,\Lambda\right\} and consider applying the Banach GRIM algorithm to approximate φ\varphi on Γ\Gamma with θ\theta as the accuracy threshold, θ0\theta_{0} as the acceptable recombination error bound, MM as the maximum number of steps, s1==sM=1s_{1}=\ldots=s_{M}=1 as the shuffle numbers, and with the integers k1,,kMk_{1},\ldots,k_{M} in Step (A) of the Banach GRIM algorithm all being chosen equal to 11 (cf. Banach GRIM (A)). Then, after at most NN steps the algorithm terminates. That is, there is some integer n{1,,N}n\in\{1,\ldots,N\} for which the Banach GRIM algorithm terminates after completing nn steps. Consequently, there are coefficients c1,,cn+1c_{1},\ldots,c_{n+1}\in{\mathbb{R}} and indices e(1),,e(n+1){1,,𝒩}e(1),\ldots,e(n+1)\in\{1,\ldots,{\cal N}\} with

s=1n+1|cs|fe(s)X=C,\sum_{s=1}^{n+1}\left|c_{s}\right|||f_{e(s)}||_{X}=C, (6.54)

and such that the element uSpan()u\in\operatorname{Span}({\cal F}) defined by

u:=s=1n+1csfe(s)satisfies, for every σΓ, that|σ(φu)|θ.u:=\sum_{s=1}^{n+1}c_{s}f_{e(s)}\qquad\text{satisfies, for every }\sigma\in\Gamma,\text{ that}\qquad|\sigma(\varphi-u)|\leq\theta. (6.55)

In fact there are linear functionals Γn={σ1,,σn}Γ\Gamma_{n}=\left\{\sigma_{1},\ldots,\sigma_{n}\right\}\subset\Gamma such that

for every σΩn:=Reach||||l1(n)(Γn,2C,θ,θ0)we have|σ(φu)|θ.\text{for every }\sigma\in\Omega_{n}:=\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{n})}}\left(\Gamma_{n},2C,\theta,\theta_{0}\right)\quad\text{we have}\quad\left|\sigma(\varphi-u)\right|\leq\theta. (6.56)

Moreover, if the coefficients a1,,a𝒩{0}a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}\setminus\{0\} corresponding to φ\varphi (cf. (I) of (6.52)) are all positive (i.e. a1,,a𝒩>0a_{1},\ldots,a_{{\cal N}}>0) then the coefficients c1,,cn+1c_{1},\ldots,c_{n+1}\in{\mathbb{R}} corresponding to uu (cf. (6.55)) are all non-negative (i.e. c1,,cn+10c_{1},\ldots,c_{n+1}\geq 0).

Remark 6.8.

Recall that the Banach GRIM algorithm is guaranteed to terminate after MM steps. Consequently, the restriction to the case that NMN\leq M is sensible since it is only in this case that terminating after no more than NN steps is a non-trivial statement.

If N<MN<M then Lemma 6.7 guarantees that the Banach GRIM algorithm will find an approximation uSpan()u\in\operatorname{Span}({\cal F}) of φ\varphi that is a linear combination of less than 𝒩{\cal N} of the elements f1,,f𝒩f_{1},\ldots,f_{{\cal N}} but is within θ\theta of φ\varphi throughout Γ\Gamma in the sense that |σ(φu)|θ|\sigma(\varphi-u)|\leq\theta for every σΓ\sigma\in\Gamma.

Remark 6.9.

By invoking Lemma 6.5, and using the same notation as in Lemma 6.7, we can additionally conclude that for every σX\sigma\in X^{\ast} we have

|σ(φu)|min{2CσX,2CdistX(σ,Ωn)+θ}.|\sigma(\varphi-u)|\leq\min\left\{2C||\sigma||_{X^{\ast}}~{},~{}2C\operatorname{dist}_{X^{\ast}}\left(\sigma,\Omega_{n}\right)+\theta\right\}. (6.57)
Proof of Lemma 6.7.

Assume XX is a Banach space with dual-space XX^{\ast}. Let θ>θ00\theta>\theta_{0}\geq 0, and 𝒩,Λ1{\cal N},\Lambda\in{\mathbb{Z}}_{\geq 1} such that M:=min{𝒩1,Λ}1M:=\min\left\{{\cal N}-1,\Lambda\right\}\geq 1. Suppose that :={f1,,f𝒩}X{0}{\cal F}:=\{f_{1},\ldots,f_{{\cal N}}\}\subset X\setminus\{0\}, and that ΓX\Gamma\subset X^{\ast} is finite with cardinality Λ\Lambda. Let a1,,a𝒩{0}a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}\setminus\{0\} and define φSpan()\varphi\in\operatorname{Span}({\cal F}) and C>0C>0 by (I) and (II) of (6.52) respectively. That is

(𝐈)φ:=i=1𝒩aifiand(𝐈𝐈)C:=i=1𝒩|ai|fiX>0.({\bf I})\quad\varphi:=\sum_{i=1}^{{\cal N}}a_{i}f_{i}\qquad\text{and}\qquad({\bf II})\quad C:=\sum_{i=1}^{{\cal N}}|a_{i}|||f_{i}||_{X}>0. (6.58)

With a view to later applying Banach GRIM to approximate φ\varphi on Γ\Gamma, for each i{1,,𝒩}i\in\{1,\ldots,{\cal N}\} let a~i:=|ai|\tilde{a}_{i}:=|a_{i}| and f~i\tilde{f}_{i} be given by fif_{i} if ai>0a_{i}>0 and fi-f_{i} if ai<0a_{i}<0. For every i{1,,𝒩}i\in\{1,\ldots,{\cal N}\} we evidently have f~iX=fiX\left|\left|\tilde{f}_{i}\right|\right|_{X}=||f_{i}||_{X}. Moreover, we also have that a~1,,a~𝒩>0\tilde{a}_{1},\ldots,\tilde{a}_{{\cal N}}>0 and φ=i=1𝒩a~if~i\varphi=\sum_{i=1}^{{\cal N}}\tilde{a}_{i}\tilde{f}_{i}. Further, we rescale f~i\tilde{f}_{i} for each i{1,,𝒩}i\in\{1,\ldots,{\cal N}\} to have unit XX norm. That is (cf. Banach GRIM (B)), for each i{1,,𝒩}i\in\{1,\ldots,{\cal N}\} set hi:=f~ifiXh_{i}:=\frac{\tilde{f}_{i}}{||f_{i}||_{X}} and αi:=a~ifiX\alpha_{i}:=\tilde{a}_{i}||f_{i}||_{X}. Observe both that CC satisfies

C:=i=1𝒩|ai|fiX=i=1𝒩a~ifiX=i=1𝒩αi,C:=\sum_{i=1}^{{\cal N}}|a_{i}|||f_{i}||_{X}=\sum_{i=1}^{{\cal N}}\tilde{a}_{i}||f_{i}||_{X}=\sum_{i=1}^{{\cal N}}\alpha_{i}, (6.59)

and, for every i{1,,𝒩}i\in\{1,\ldots,{\cal N}\}, that αihi=a~if~i=aifi\alpha_{i}h_{i}=\tilde{a}_{i}\tilde{f}_{i}=a_{i}f_{i}. Therefore the expansion for φ\varphi in (I) of (6.32) is equivalent to

φ=i=1𝒩αihi, and hence φXi=1𝒩αihiX=i=1𝒩αi=(6.59)C.\varphi=\sum_{i=1}^{{\cal N}}\alpha_{i}h_{i},\qquad\text{ and hence }\qquad||\varphi||_{X}\leq\sum_{i=1}^{{\cal N}}\alpha_{i}||h_{i}||_{X}=\sum_{i=1}^{{\cal N}}\alpha_{i}\stackrel{{\scriptstyle{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}({\ref{banach_step_num_varphi_const}})}}}{{=}}C. (6.60)

Define a non-negative integer N=N(Γ,C,θ,θ0)0N=N(\Gamma,C,\theta,\theta_{0})\in{\mathbb{Z}}_{\geq 0} by

N:=max{d:There exists σ1,,σdΓ such that for every j{1,,d1}we have σj+1Reach||||l1(j)({σ1,,σj},2C,θ,θ0)}.N:=\max\left\{d\in{\mathbb{Z}}~{}:~{}\begin{array}[]{c}\text{There exists }\sigma_{1},\ldots,\sigma_{d}\in\Gamma\text{ such that for every }j\in\{1,\ldots,d-1\}\\ \text{we have }\sigma_{j+1}\notin\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{j})}}\left(\{\sigma_{1},\ldots,\sigma_{j}\},2C,\theta,\theta_{0}\right)\end{array}\right\}. (6.61)

Suppose NM:=min{𝒩1,Λ}N\leq M:=\min\left\{{\cal N}-1,\Lambda\right\} and consider applying the Banach GRIM algorithm to approximate φ\varphi on Γ\Gamma with θ\theta as the accuracy threshold, θ0\theta_{0} as the acceptable recombination error bound, MM as the maximum number of steps, s1==sM=1s_{1}=\ldots=s_{M}=1 as the shuffle numbers, and with the integers k1,,kMk_{1},\ldots,k_{M} in Step (A) of the Banach GRIM algorithm all being chosen equal to 11 (cf. Banach GRIM (A)).

We first prove that the algorithm terminates after at most NN steps have been completed. Let σ1Γ\sigma_{1}\in\Gamma be the linear functional chosen in the first step (cf. Banach GRIM (A)), and u1Span()u_{1}\in\operatorname{Span}({\cal F}) be the approximation found via recombination (cf. the recombination thinning Lemma 3.1) satisfying, in particular, that |σ1(φu1)|θ0|\sigma_{1}(\varphi-u_{1})|\leq\theta_{0}. Define Γ1:={σ1}Γ\Gamma_{1}:=\{\sigma_{1}\}\subset\Gamma. We conclude, via Lemma 6.5 (cf. (6.34)), that for every σReach||||l1()(Γ1,2C,θ,θ0)\sigma\in\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}})}}\left(\Gamma_{1},2C,\theta,\theta_{0}\right) we have |σ(φu1)|θ|\sigma(\varphi-u_{1})|\leq\theta.

If N=1N=1 then (6.61) means there is no σΓ\sigma\in\Gamma for which σReach||||l1()(Γ1,2C,θ,θ0)\sigma\notin\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}})}}\left(\Gamma_{1},2C,\theta,\theta_{0}\right) Consequently, ΓReach||||l1()(Γ1,2C,θ,θ0)=Γ\Gamma\cap\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}})}}\left(\Gamma_{1},2C,\theta,\theta_{0}\right)=\Gamma, and so we have established that for every σΓ\sigma\in\Gamma we have |σ(φu1)|θ|\sigma(\varphi-u_{1})|\leq\theta. Recalling Banach GRIM (D), this means that algorithm terminates before step 22 is completed. Hence the algorithm terminates after completing N=1N=1 steps.

If N2N\geq 2 then we note that if the stopping criterion in Banach GRIM (D) is triggered at the start of step m{2,,N}m\in\{2,\ldots,N\} then we evidently have that the algorithm has terminated after carrying out no more than NN steps. Consequently, we need only deal with the case in which the algorithm reaches and carries out step NN without terminating. In this case, we claim that the algorithm terminates after completing step NN, i.e. that the termination criterion at the start of step N+1N+1 is triggered.

Before proving this we fix some notation. Recalling Banach GRIM (D), for l{2,,N}l\in\{2,\ldots,N\} let σlΓ\sigma_{l}\in\Gamma denote the new linear functional selected at step ll, define Γl:={σ1,,σl}\Gamma_{l}:=\left\{\sigma_{1},\ldots,\sigma_{l}\right\}, and let ulSpan()u_{l}\in\operatorname{Span}({\cal F}) be the approximation found by recombination (cf. the recombination thinning Lemma 3.1) satisfying, for every s{1,,l}s\in\{1,\ldots,l\}, that |σs(φul)|θ0|\sigma_{s}(\varphi-u_{l})|\leq\theta_{0}.

By appealing to Lemma 6.5 we deduce both that for any l{2,,N}l\in\{2,\ldots,N\} we have (cf. (6.35))

σlReach||||l(l1)(Γl1,2C,θ,θ0),\sigma_{l}\notin\operatorname{Reach}_{||\cdot||_{l^{(}{\mathbb{R}}^{l-1})}}\left(\Gamma_{l-1},2C,\theta,\theta_{0}\right), (6.62)

and that for any σReach||||l1(l)(Γl,2C,θ,θ0)\sigma\in\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{l})}}\left(\Gamma_{l},2C,\theta,\theta_{0}\right) we have (cf. (6.34))

|σ(φul)|θ.|\sigma(\varphi-u_{l})|\leq\theta. (6.63)

Consider step N+1N+1 of the algorithm at which we examine K:=max{|σ(φuN)|:σΓ}K:=\max\left\{|\sigma(\varphi-u_{N})|:\sigma\in\Gamma\right\}. If KθK\leq\theta then the algorithm terminates without carrying out step N+1N+1, and thus has terminated after carrying out NN steps as claimed.

Assume that K>θK>\theta so that σN+1:=argmax{|σ(φuN)|:σΓ}\sigma_{N+1}:=\operatorname{argmax}\left\{|\sigma(\varphi-u_{N})|:\sigma\in\Gamma\right\} satisfies that |σN+1(φuN)|>θ|\sigma_{N+1}(\varphi-u_{N})|>\theta. It follows from (6.63) for l:=Nl:=N that σN+1Reach||||l1(N)(ΓN,2C,θ,θ0)\sigma_{N+1}\notin\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{N})}}\left(\Gamma_{N},2C,\theta,\theta_{0}\right). But it then follows from this and (6.62) that σ1,,σN+1Γ\sigma_{1},\ldots,\sigma_{N+1}\in\Gamma satisfy that, for every j{1,,N}j\in\{1,\ldots,N\}, that σjReach||||l1(j1)(Γj1,2C,θ,θ0)\sigma_{j}\notin\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{j-1})}}\left(\Gamma_{j-1},2C,\theta,\theta_{0}\right). In which case (6.61) yields that

N:=(6.61)max{d:There exists σ1,,σdΓ such that for every j{1,,d1}we have σj+1Reach||||l1(j)({σ1,,σj},2C,θ,θ0)}N+1N\stackrel{{\scriptstyle{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}({\ref{banach_step_bound_lemma_r_packing_pf}})}}}{{:=}}\max\left\{d\in{\mathbb{Z}}~{}:~{}\begin{array}[]{c}\text{There exists }\sigma_{1},\ldots,\sigma_{d}\in\Gamma\text{ such that for every }j\in\{1,\ldots,d-1\}\\ \text{we have }\sigma_{j+1}\notin\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{j})}}\left(\{\sigma_{1},\ldots,\sigma_{j}\},2C,\theta,\theta_{0}\right)\end{array}\right\}\geq N+1

which is evidently a contradiction. Thus we must have that KθK\leq\theta, and hence that the algorithm must terminate before carrying out step N+1N+1.

Having established the claimed upper bound on the number of steps before the Banach GRIM algorithm terminates, we turn our attention to the properties claimed for approximation returned after the algorithm terminates. Let n{1,,N}n\in\{1,\ldots,N\} be the integer for which the Banach GRIM algorithm terminates after step nn. Recalling Banach GRIM (C) and (D), let Γn={σ1,,σn}Γ\Gamma_{n}=\{\sigma_{1},\ldots,\sigma_{n}\}\subset\Gamma be the nn linear functionals selected by the end of the nthn^{\text{th}} step, and let uSpan()u\in\operatorname{Span}({\cal F}) denote the approximation found via recombination (cf. the recombination thinning Lemma 3.1) satisfying, for every s{1,,n}s\in\{1,\ldots,n\}, that |σs(φu)|θ0|\sigma_{s}(\varphi-u)|\leq\theta_{0}.

A consequence of Lemma 6.5 is that for every σReach||||l1(n)(Γn,2C,θ,θ0)\sigma\in\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{n})}}\left(\Gamma_{n},2C,\theta,\theta_{0}\right) we have (cf. (6.34))

|σ(φu)|θ|\sigma(\varphi-u)|\leq\theta (6.64)

as claimed in (6.56). Moreover, since we have established that the algorithm terminates by triggering the stopping criterion after completing step nn, we have, for every σΓ\sigma\in\Gamma, that |σ(φu)|θ|\sigma(\varphi-u)|\leq\theta as claimed in the second part of (6.55).

To establish (6.54) and the first part of (6.55), first note that min{𝒩,n}=n\min\left\{{\cal N},n\right\}=n. Thus Lemma 3.1 additionally tells us that recombination returns non-negative coefficients b1,,bn+10b_{1},\ldots,b_{n+1}\geq 0, with

s=1n+1bs=i=1𝒩αi=(6.59)C,\sum_{s=1}^{n+1}b_{s}=\sum_{i=1}^{{\cal N}}\alpha_{i}\stackrel{{\scriptstyle{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}({\ref{banach_step_num_varphi_const}})}}}{{=}}C, (6.65)

and indices e(1),,e(n+1){1,,𝒩}e(1),\ldots,e(n+1)\in\{1,\ldots,{\cal N}\} for which

u=s=1n+1bshe(s)=s=1n+1bsfe(s)Xf~e(s).u=\sum_{s=1}^{n+1}b_{s}h_{e(s)}=\sum_{s=1}^{n+1}\frac{b_{s}}{\left|\left|f_{e(s)}\right|\right|_{X}}\tilde{f}_{e(s)}. (6.66)

For each s{1,,n+1}s\in\{1,\ldots,n+1\}, we define cs:=bsfe(s)Xc_{s}:=\frac{b_{s}}{\left|\left|f_{e(s)}\right|\right|_{X}} if f~e(s)=fe(s)\tilde{f}_{e(s)}=f_{e(s)} (which we recall is the case if ae(s)>0a_{e(s)}>0) and cs:=bsfe(s)Xc_{s}:=-\frac{b_{s}}{\left|\left|f_{e(s)}\right|\right|_{X}} if f~e(s)=fe(s)\tilde{f}_{e(s)}=-f_{e(s)} (which we recall is the case if ae(s)<0a_{e(s)}<0). Then (6.66) gives the expansion for uSpan()Xu\in\operatorname{Span}({\cal F})\subset X in terms of the elements f1,,f𝒩f_{1},\ldots,f_{{\cal N}} claimed in the first part of (6.55). Moreover, from (6.65) we have that

s=1n+1|cs|fe(s)X=s=1n+1bs=(6.65)C\sum_{s=1}^{n+1}\left|c_{s}\right|\left|\left|f_{e(s)}\right|\right|_{X}=\sum_{s=1}^{n+1}b_{s}\stackrel{{\scriptstyle{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}({\ref{banach_step_bound_lemma_um_coeff_sum_pf}})}}}{{=}}C (6.67)

as claimed in (6.54).

It remains only to prove that if the coefficients a1,,a𝒩{0}a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}\setminus\{0\} are all positive (i.e. a1,,a𝒩>0a_{1},\ldots,a_{{\cal N}}>0), then the resulting coefficients c1,,cn+1c_{1},\ldots,c_{n+1}\in{\mathbb{R}} are all non-negative (i.e. c1,,cn+10c_{1},\ldots,c_{n+1}\geq 0). To see this, observe that if a1,,a𝒩>0a_{1},\ldots,a_{{\cal N}}>0 then, for every i{1,,𝒩}i\in\{1,\ldots,{\cal N}\}, we have that f~i=fi\tilde{f}_{i}=f_{i}. Consequently, for every s{1,,n+1}s\in\{1,\ldots,n+1\} we have that f~e(s)=fe(s)\tilde{f}_{e(s)}=f_{e(s)}, and so by definition we have cs=bsfe(s)Xc_{s}=\frac{b_{s}}{\left|\left|f_{e(s)}\right|\right|_{X}}. Since bs0b_{s}\geq 0, it follows that cs0c_{s}\geq 0. This completes the proof of Lemma 6.7. ∎

6.4 Proof of Main Theoretical Result

In this subsection we prove the Banach GRIM Convergence Theorem 6.2 by combining Lemmas 6.4, 6.5, and 6.7.

Proof of Theorem 6.2.

Assume XX is a Banach space with dual-space XX^{\ast}. Let ε>ε00\varepsilon>\varepsilon_{0}\geq 0. Let 𝒩,Λ1{\cal N},\Lambda\in{\mathbb{Z}}_{\geq 1} and set M:=min{𝒩1,Λ}M:=\min\left\{{\cal N}-1,\Lambda\right\}. Let :={f1,,f𝒩}X{0}{\cal F}:=\{f_{1},\ldots,f_{{\cal N}}\}\subset X\setminus\{0\} and ΣX\Sigma\subset X^{\ast} be a finite subset with cardinality Λ\Lambda. Let a1,,a𝒩{0}a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}\setminus\{0\} and define φSpan()\varphi\in\operatorname{Span}({\cal F}) and C>0C>0 by

(𝐈)φ:=i=1𝒩aifiand(𝐈𝐈)C:=i=1𝒩|ai|fiX>0.({\bf I})\quad\varphi:=\sum_{i=1}^{{\cal N}}a_{i}f_{i}\qquad\text{and}\qquad({\bf II})\quad C:=\sum_{i=1}^{{\cal N}}|a_{i}|||f_{i}||_{X}>0. (6.68)

Define a non-negative integer N=N(Σ,C,ε,ε0)0N=N(\Sigma,C,\varepsilon,\varepsilon_{0})\in{\mathbb{Z}}_{\geq 0}, given by

N:=max{d:There exists σ1,,σdΣ such that for every j{1,,d1}we have σj+1Reach||||l1(j)({σ1,,σj},2C,ε,ε0)}.N:=\max\left\{d\in{\mathbb{Z}}~{}:~{}\begin{array}[]{c}\text{There exists }\sigma_{1},\ldots,\sigma_{d}\in\Sigma\text{ such that for every }j\in\{1,\ldots,d-1\}\\ \text{we have }\sigma_{j+1}\notin\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{j})}}\left(\{\sigma_{1},\ldots,\sigma_{j}\},2C,\varepsilon,\varepsilon_{0}\right)\end{array}\right\}. (6.69)

Suppose NM:=min{𝒩1,Λ}N\leq M:=\min\left\{{\cal N}-1,\Lambda\right\} and consider applying the Banach GRIM algorithm to approximate φ\varphi on Σ\Sigma with ε\varepsilon as the accuracy threshold, ε0\varepsilon_{0} as the acceptable recombination error bound, MM as the maximum number of steps, s1==sM=1s_{1}=\ldots=s_{M}=1 as the shuffle numbers, and with the integers k1,,kMk_{1},\ldots,k_{M} in Step (A) of the Banach GRIM algorithm all being chosen equal to 11 (cf. Banach GRIM (A)). By appealing to Lemma 6.7, with ε\varepsilon, ε0\varepsilon_{0} and Σ\Sigma here as the θ\theta, θ0\theta_{0} and Γ\Gamma there respectively, to conclude that there is some integer n{1,,N}n\in\{1,\ldots,N\} for which the algorithm terminates after step nn. Thus the Banach GRIM algorithm terminates after completing, at most, NN steps as claimed.

Lemma 6.7 additionally tells us that there are coefficients c1,,cn+1c_{1},\ldots,c_{n+1}\in{\mathbb{R}} and indices e(1),,e(n+1){1,,𝒩}e(1),\ldots,e(n+1)\in\{1,\ldots,{\cal N}\} with (cf. (6.54))

s=1n+1|cs|fe(s)X=C\sum_{s=1}^{n+1}|c_{s}|||f_{e(s)}||_{X}=C (6.70)

and such that the element uSpan()u\in\operatorname{Span}({\cal F}) defined by (cf. (6.55))

u:=s=1n+1csfe(s)satisfies, for every σΣ that|σ(φu)|ε.u:=\sum_{s=1}^{n+1}c_{s}f_{e(s)}\quad\text{satisfies, for every }\sigma\in\Sigma\text{ that}\quad|\sigma(\varphi-u)|\leq\varepsilon. (6.71)

Observe that (6.70) is precisely the claim (6.10), whilst (6.71) is precisely the claim (6.11).

The final consequence of Lemma 6.7 that we note is that if the coefficients a1,,a𝒩{0}a_{1},\ldots,a_{{\cal N}}\in{\mathbb{R}}\setminus\{0\} associated to φ\varphi (cf. (I) of (6.68)) are all positive (i.e. a1,,a𝒩>0a_{1},\ldots,a_{{\cal N}}>0) then the coefficients c1,,cn+1c_{1},\ldots,c_{n+1}\in{\mathbb{R}} associated with uu (cf. (6.71)) are all non-negative (i.e. c1,,cn+10c_{1},\ldots,c_{n+1}\geq 0. This is precisely the preservation of non-negative coefficients claimed in Theorem 6.2.

It only remains to verify the claim made in (6.12). For this purpose let A>1A>1 and define

Ω:=Reach||||l1(Λ)(Σ,2C,Aε,ε)=def0tASpan¯||||l1(Λ)(Σ,t)(A1)ε2C.\Omega:=\operatorname{Reach}_{||\cdot||_{l^{1}({\mathbb{R}}^{\Lambda})}}\left(\Sigma,2C,A\varepsilon,\varepsilon\right)\stackrel{{\scriptstyle\text{def}}}{{=}}\bigcup_{0\leq t\leq A}\operatorname{\overline{Span}}_{||\cdot||_{l^{1}({\mathbb{R}}^{\Lambda})}}\left(\Sigma,t\right)_{\frac{(A-1)\varepsilon}{2C}}. (6.72)

Then observe that ΣX\Sigma\subset X^{\ast} is a finite subset of cardinality Λ\Lambda and that (6.70) and (6.71) mean that uSpan()u\in\operatorname{Span}({\cal F}) satisfies both that uXC||u||_{X}\leq C and that, for every σΣ\sigma\in\Sigma, we have |σ(φu)|ε|\sigma(\varphi-u)|\leq\varepsilon. Therefore we may apply Lemma 6.4, with the integer dd, the finite subset LXL\subset X^{\ast}, and the real numbers θ>θ00\theta>\theta_{0}\geq 0 of that result as the integer Λ\Lambda, the finite subset Σ\Sigma, and the real numbers Aε>ε>0A\varepsilon>\varepsilon>0 here, to deduce that (cf. (6.28)) for every σΩ\sigma\in\Omega we have |σ(φu)|Aε|\sigma(\varphi-u)|\leq A\varepsilon. Recalling the definition of the set Ω\Omega in (6.72), this is precisely the estimate claimed in (6.12). This completes the proof of Theorem 6.2. ∎

7 Numerical Examples

In this section we compare GRIM with several existing techniques on two different reduction tasks. The first task we consider is motivated by an example appearing in Section 4 of the work [MMPY15] by Maday et al. concerning GEIM.

We consider the Hilbert space X:=L2(0,1)X:=L^{2}(0,1), and given (a,b)[0.01,24.9]×[0,15](a,b)\in[0.01,24.9]\times[0,15] we define fa,bXf_{a,b}\in X by

fa,b(x):=11+(25+acos(bx))x2.f_{a,b}(x):=\frac{1}{\sqrt{1+(25+a\cos(bx))x^{2}}}. (7.1)

For a chosen integer 𝒩1{\cal N}\in{\mathbb{Z}}_{\geq 1} we consider the collection X{\cal F}\subset X defined by

:={fa,b:(a,b)[0.01,24.9]𝒩×[0,15]𝒩}.{\cal F}:=\left\{f_{a,b}:(a,b)\in[0.01,24.9]_{{\cal N}}\times[0,15]_{{\cal N}}\right\}. (7.2)

Here, for real numbers c,dc,d\in{\mathbb{R}}, we use the notation [c,d]𝒩[c,d]_{{\cal N}} to denote a partition of the interval [c,d][c,d] by 𝒩{\cal N} equally spaced points.

For a chosen M1M\in{\mathbb{Z}}_{\geq 1} and s>0s>0, we fix an equally spaced partition y1,,yM[0,1]y_{1},\ldots,y_{M}\in[0,1] and consider the collection Σ={σk:k{1,,M}}X\Sigma=\left\{\sigma_{k}:k\in\{1,\ldots,M\}\right\}\subset X^{\ast} where, for k{1,,M}k\in\{1,\ldots,M\} and ψX\psi\in X,

σk(ψ):=01ψ(x)𝑑ρk(x)withdρk(x)=e(xyk)22s2dx.\sigma_{k}(\psi):=\fint_{0}^{1}\psi(x)d\rho_{k}(x)\qquad\text{with}\qquad d\rho_{k}(x)=e^{-\frac{(x-y_{k})^{2}}{2s^{2}}}dx. (7.3)

Finally, we fix the choice a1==a𝒩=1a_{1}=\ldots=a_{{\cal N}}=1 for the coefficients. The task is to approximate the function φ:=ff\varphi:=\sum_{f\in{\cal F}}f over the collection Σ\Sigma of linear functionals.

For the tests we fix the values M:=1000M:=1000, s:=5×104s:=5\times 10^{-4}, and consider each of the values N=20,25,30N=20,25,30 individually. We compare our implementation of GRIM against our own coded implementation of GEIM [MM13, MMT14, MMPY15] (which, in particular, makes use of the recursive relation established in [MMPY15]) and the Scikit-learn implementation of LASSO [BBCDDGGMPPPPTVVW11]. The results are summarised in Table 1 below. For each approximation uu we record the L2L^{2}-norm of the difference φu\varphi-u and the sup\sup-norm of the difference φu\varphi-u over Σ\Sigma, i.e. the values (01(φ(x)u(x))2𝑑x)1/2\left(\int_{0}^{1}(\varphi(x)-u(x))^{2}dx\right)^{1/2} and max{|σ(φu)|:σΣ}\max\left\{|\sigma(\varphi-u)|:\sigma\in\Sigma\right\}).

GRIM GEIM LASSO
𝒩=20{\cal N}=20 16non-zero weightsL2norm=0.15supnorm=0.49\begin{aligned} &16~{}\text{non-zero weights}\\ &L^{2}-\text{norm}=0.15\\ &\sup-\text{norm}=0.49\end{aligned} 20non-zero weightsL2norm=0.15supnorm=0.64\begin{aligned} &20~{}\text{non-zero weights}\\ &L^{2}-\text{norm}=0.15\\ &\sup-\text{norm}=0.64\end{aligned} 90non-zero weightsL2norm=0.19supnorm=0.66\begin{aligned} &90~{}\text{non-zero weights}\\ &L^{2}-\text{norm}=0.19\\ &\sup-\text{norm}=0.66\end{aligned}
𝒩=25{\cal N}=25 20non-zero weightsL2norm=0.04supnorm=0.16\begin{aligned} &20~{}\text{non-zero weights}\\ &L^{2}-\text{norm}=0.04\\ &\sup-\text{norm}=0.16\end{aligned} 27non-zero weightsL2norm=0.04supnorm=0.26\begin{aligned} &27~{}\text{non-zero weights}\\ &L^{2}-\text{norm}=0.04\\ &\sup-\text{norm}=0.26\end{aligned} 135non-zero weightsL2norm=0.30supnorm=1.04\begin{aligned} &135~{}\text{non-zero weights}\\ &L^{2}-\text{norm}=0.30\\ &\sup-\text{norm}=1.04\end{aligned}
𝒩=30{\cal N}=30 19non-zero weightsL2norm=0.07supnorm=0.23\begin{aligned} &19~{}\text{non-zero weights}\\ &L^{2}-\text{norm}=0.07\\ &\sup-\text{norm}=0.23\end{aligned} 24non-zero weightsL2norm=0.15supnorm=0.72\begin{aligned} &24~{}\text{non-zero weights}\\ &L^{2}-\text{norm}=0.15\\ &\sup-\text{norm}=0.72\end{aligned} 176non-zero weightsL2norm=0.43supnorm=1.51\begin{aligned} &176~{}\text{non-zero weights}\\ &L^{2}-\text{norm}=0.43\\ &\sup-\text{norm}=1.51\end{aligned}
Table 1: The number of non-zero weights and the L2L^{2} and sup\sup norms of the difference between φ\varphi and the approximation are recorded for each technique. The L2L^{2} and sup\sup norm values are recorded to 2 decimal places. For each value of 𝒩{\cal N} we first find the LASSO approximation. Then we record the values for the first GEIM approximation that at least matches the LASSO approximation on both L2L^{2} and sup\sup norm values. Finally we record the values for the first GRIM approximation that at least matches the GEIM approximation on both L2L^{2} and sup\sup norm values.

In each case the GRIM approximation is a linear combination of fewer functions in {\cal F} than both the GEIM approximation and the LASSO approximation. Moreover, the GRIM approximation of φ\varphi is at least as good as both the GEIM approximation and the LASSO approximation in both the L2L^{2}-norm sense and the sup\sup-norm sense.

The second task we consider is a kernel quadrature problem appearing in Section 3.2 of [HLO21]. In particular, we consider the 3D Road Network data set [JKY13] of 434874 elements in 3{\mathbb{R}}^{3} and the Combined Cycle Power Plant data set [GKT12] of 9568 elements in 5{\mathbb{R}}^{5}. For the 3D Road Network data set we take a random subset Ω\Omega of size 43487434874/1043487\approx 434874/10, whilst for the Combined Cycle Power Plant data set we take Ω\Omega to be the full 9568 elements.

In both cases we consider X:=[Ω]X:={\cal M}[\Omega] to be the Banach space of signed measures on Ω\Omega, and take {\cal F} to be the collection of point masses supported at points in Ω\Omega, i.e. :={δp:pΩ}[Ω]{\cal F}:=\left\{\delta_{p}:p\in\Omega\right\}\subset{\cal M}[\Omega]. For the collection of linear functionals Σ[Ω]\Sigma\subset{\cal M}[\Omega]^{\ast} we take Σ\Sigma to be the closed unit ball of the Reproducing Kernel Hilbert Space (RKHS) k{\cal H}_{k} associated to the kernel k:Ω×Ωk:\Omega\times\Omega\to{\mathbb{R}} defined by

k(x,y):=exy22λ2.k(x,y):=e^{-\frac{||x-y||^{2}}{2\lambda^{2}}}. (7.4)

Here ||||||\cdot|| denotes the Euclidean norm on the appropriate Euclidean space, and λ\lambda is a hyperparameter determined by median heuristics [HLO21]. We let φ\varphi denote the equally weighted probability measure over Ω\Omega and consider the kernel quadrature problem for φ\varphi with respect to the RHKS k{\cal H}_{k}. By defining 𝒩:=#(Ω){\cal N}:=\#(\Omega) and a1==a𝒩=1/𝒩a_{1}=\ldots=a_{{\cal N}}=1/{\cal N}, we observe that this problem is within our framework.

In addition to implementing GRIM, we additionally implement a modified version of GRIM, which we denote GRIM ++ opt. The ’++ opt’ refers to applying the convex optimisation detailed in [HLO21] to the weights returned by GRIM at each step. The performance of GRIM and GRIM ++ opt is compared with the performance of the methods N. ++ emp, N. ++ emp ++ opt, Monte Carlo, iid Bayes, Thinning, Thin ++ opt, Herding, and Herd ++ opt considered in [HLO21]. Details of these methods may be found in [HLO21] and the references there in. We make extensive use of the python code associated with [HLO21] available via GitHub (Convex Kernel Quadrature GitHub).

We implement GRIM under the condition that, at each step, 4 new functions from Σ\Sigma are added to the collection of functions over which we require the approximation to coincide with the target φ\varphi. The performance of each approximation is measured by its Worst Case Error (WCE) with respect to φ\varphi over the RKHS k{\cal H}_{k}. This is defined as

WCE(u,φ,k):=supf𝔹¯k(0,1)|u(f)φ(f)|=supf𝔹¯k(0,1)|Ωf(x)𝑑u(x)Ωf(x)𝑑φ(x)|,\text{WCE}(u,\varphi,{\cal H}_{k}):=\sup_{f\in{\overline{\mathbb{B}}}_{{\cal H}_{k}}(0,1)}\left|u(f)-\varphi(f)\right|=\sup_{f\in{\overline{\mathbb{B}}}_{{\cal H}_{k}}(0,1)}\left|\int_{\Omega}f(x)du(x)-\int_{\Omega}f(x)d\varphi(x)\right|, (7.5)

and may be explicitly computed in this setting by the formula provided in [HLO21]. For each method we record the average log(WCE(u,φ,k)2)\log(\text{WCE}(u,\varphi,{\cal H}_{k})^{2}) over 20 trials. The results are illustrated in Figures 1 and 2.

For both the 3D Road Network and the Combined Cycle Power Plant data sets the novel recombination-based convex kernel quadrature method developed by Satoshi Hayakawa, the first author, and Harald Oberhauser in [HLO21] and our GRIM approach comfortably out perform the other methods, each of which is either purely growth-based or purely thinning-based. Moreover, the convex kernel quadrature method of [HLO21] is specifically tailored to the task of kernel quadrature. Whilst being significantly slower, GRIM ++ opt nevertheless matches the performance of N. ++ emp ++ opt despite not being specially designed for the task of kernel quadrature. Moreover, even without the additional ’++ opt’ convex optimisation step, GRIM remains within the same class of performance as the N. ++ emp ++ opt method.

Refer to caption
Figure 1: 3D Road Network Results - The average log(WCE(u,φ,k)2)\log(\text{WCE}(u,\varphi,{\cal H}_{k})^{2}) over 20 trials is plotted for each method. The shaded regions show their standard deviations.
Refer to caption
Figure 2: Combined Cycle Power Plant Results - The average log(WCE(u,φ,k)2)\log(\text{WCE}(u,\varphi,{\cal H}_{k})^{2}) over 20 trials is plotted for each method. The shaded regions show their standard deviations.

The third and final task we consider is a machine learning inference acceleration task motivated by [JLNSY17]. The problem is outside the Hilbert space framework of the previous examples. We consider the path signature based Landmark Human Action Recognition (LHAR) model developed in [JLNSY17].

This model is trained to determine an action from video clips of the evolution of 15-40 markers placed on a persons body. It utilises signatures of streams generated by the locations of these markers. An introduction to the use of path signatures in machine learning contexts can be found in the survey papers [CK16] and [LM22]. We do not assume any familiarity with signatures and treat them as a ’black box’ tool used by the LHAR model of [JLNSY17].

We restrict consider the LHAR model from [JLNSY17] on the JHMDB data set [BGJSZ13] consisting of 928 clips of 21 distinct actions, with each clip containing 15-40 frames. Following [JLNSY17], the dataset is split into 660 clips to be used for training and 268 clips to be used for testing.

The pipeline for the LHAR model from [JLNSY17] can be summarised as

ClipSignatures430920Linear Map21Bias21Softmax{0,,20}.\text{Clip}\quad\stackrel{{\scriptstyle\text{Signatures}}}{{\rightarrow}}\quad{\mathbb{R}}^{430920}\quad\stackrel{{\scriptstyle\text{Linear Map}}}{{\rightarrow}}\quad{\mathbb{R}}^{21}\quad\stackrel{{\scriptstyle\text{Bias}}}{{\rightarrow}}\quad{\mathbb{R}}^{21}\quad\stackrel{{\scriptstyle\text{Softmax}}}{{\rightarrow}}\quad\{0,\ldots,20\}. (7.6)

A variety of truncated path signatures are computed of augmentations of the stream of marker locations provided by the clip; see [JLNSY17] for full details. The result is that each clip is transformed into a vector of N:=430920N:=430920 features, i.e. an element in 430920=N{\mathbb{R}}^{430920}={\mathbb{R}}^{N}. Let Ωtrain430920\Omega_{{\mathrm{train}}}\subset{\mathbb{R}}^{430920} denote the collection of these vectors for the 660 training clips, and Ωtest430920\Omega_{{\mathrm{test}}}\subset{\mathbb{R}}^{430920} denote the collection of these vectors for the 268 testing clips. A single-hidden-layer neural network is trained as the classifier outputting a probability distribution given by a Softmax over the 21 class labels {0,,20}\{0,\ldots,20\}.

Let A:N21A:{\mathbb{R}}^{N}\to{\mathbb{R}}^{21} denote the Linear Map part of the model in (7.6). For later convenience, given any map v:N21v:{\mathbb{R}}^{N}\to{\mathbb{R}}^{21} we let [v]{\cal L}[v] denote the model from [JLNSY17] with the linear map AA replaced by vv (i.e. the pipeline described in (7.6) but with the Linear Map AA replaced by the function vv). We consider approximating the model [A]{\cal L}[A]. Our approach is to find an approximation vv of AA, and then use the model [v]{\cal L}[v] as the approximation of [A]{\cal L}[A]. For this purpose, we consider both N{\mathbb{R}}^{N} and 21{\mathbb{R}}^{21} to be equipped with their respective ll^{\infty} norms. We first observe that the task of approximating AA is within the mathematical framework of this paper.

Let (ai,j)i=1,j=121,N(a_{i,j})_{i=1,j=1}^{21,N} denote the coefficients of the matrix associated to AA, and for each i{1,,N}i\in\{1,\ldots,N\} define fiC0(N;21)f_{i}\in C^{0}\left({\mathbb{R}}^{N};{\mathbb{R}}^{21}\right) by, for x=(x1,,xN)Nx=(x_{1},\ldots,x_{N})\in{\mathbb{R}}^{N}, setting fi(x):=xi(a1,i,,a21,i)f_{i}(x):=x_{i}\left(a_{1,i},\ldots,a_{21,i}\right). Let e1,,e2121e_{1},\ldots,e_{21}\in{\mathbb{R}}^{21} be the standard basis of 21{\mathbb{R}}^{21} that is orthonormal with respect to the standard Euclidean dot product ,21\left<\cdot,\cdot\right>_{{\mathbb{R}}^{21}} on 21{\mathbb{R}}^{21}. For each pΩtrainp\in\Omega_{{\mathrm{train}}} and each j{1,,21}j\in\{1,\ldots,21\} let δp,j:C0(N;21)\delta_{p,j}:C^{0}({\mathbb{R}}^{N};{\mathbb{R}}^{21})\to{\mathbb{R}} be defined by δp,j[f]:=f(p),ej21\delta_{p,j}[f]:=\left<f(p),e_{j}\right>_{{\mathbb{R}}^{21}}. Then A=i=1NfiA=\sum_{i=1}^{N}f_{i} and hence, by choosing X:=C0(Ω;21)X:=C^{0}\left(\Omega;{\mathbb{R}}^{21}\right), 𝒩:=N(=430920){\cal N}:=N(=430920), φ:=A\varphi:=A, and Σ:={δp,j:pΩtrain and j{1,,21}}\Sigma:=\left\{\delta_{p,j}:p\in\Omega_{{\mathrm{train}}}\text{ and }j\in\{1,\ldots,21\}\right\}, we observe that we are within the framework for which GRIM is designed (cf. Section 2).

To provide a benchmark we use the Scikit-Learn MultiTaskLasso [BBCDDGGMPPPPTVVW11] implementation to approximate φ\varphi. Several different choices for the regularisation parameter α\alpha are considered. For each α\alpha we consider the Scikit-Learn MultiTaskLasso [BBCDDGGMPPPPTVVW11] with the maximum number of iterations set to 10000. We record the number of non-zero weights in the resulting approximation uu, the training set accuracy achieved by [u]{\cal L}[u], and the testing set accuracy achieved by [u]{\cal L}[u]. These results are summarised in Table 2.

Alpha Number of Non-zero Weights Train Accuracy (%) (2.d.p) Test Accuracy (%) (2.d.p)
0.0001 5464 100.00 81.34
0.0002 4611 100.00 82.09
0.0003 4217 100.00 83.21
0.0004 4021 100.00 82.84
0.0005 4009 100.00 82.46
0.001 3913 100.00 82.46
0.005 3157 99.70 81.72
0.01 2455 99.24 81.72
0.05 1012 95.30 79.85
Table 2: Each row contains information corresponding to one application of MultiTaskLasso [BBCDDGGMPPPPTVVW11] to find an approximation uu of φ\varphi. The model [u]{\cal L}[u] then gives an approximation to the model [φ]{\cal L}[\varphi] from [JLNSY17] on the JHMDB dataset [BGJSZ13]. The first column records the value of the regularisation parameter α\alpha for this application of MultiTaskLasso. The second column records the number of non-zero weights appearing in the returned approximation uu. The third column records the accuracy achieved on the training set (to 2 decimal places) by the model [u]{\cal L}[u]. The fourth column records the accuracy achieved on the test set (to 2 decimal places) by the model [u]{\cal L}[u].

In terms of the test accuracy score achieved by the model [u]{\cal L}[u], the choice of α:=3×104\alpha:=3\times 10^{-4} performed best. For this choice of α\alpha the corresponding model [u]{\cal L}[u] achieved an accuracy score of 83.21%83.21\% (2.d.p) on the test set whilst only using 4217 non-zero weights. In comparison, the original model [φ]{\cal L}[\varphi] developed in [JLNSY17] uses 430920 non-zero weights, and achieves an accuracy score of 83.58%83.58\% (2.d.p) on the test set.

With the Lasso performance acting as a benchmark, we consider applying GRIM to approximate φ\varphi. We consider several choices for the number of new points from Σ\Sigma to be added to the collection of points over which we require the approximation to coincide with the target φ\varphi. We fix the shuffle number for each step as 16. We apply GRIM to approximate φ\varphi over Ωtrain\Omega_{{\mathrm{train}}}, i.e. we use only the training set. The test set is not used for constructing the approximation uu, and is only used to record the accuracy achieved by model generated by the approximation [u]{\cal L}[u].

We make the following adaptation to the GRIM algorithm for this task. Our aim is to find an approximation uu of φ\varphi that is close in a pointwise sense to φ\varphi throughout Ωtrain\Omega_{{\mathrm{train}}}. Instead of considering each linear functional in Σ\Sigma individually, we consider each collection Δp:={δp,j:j{1,,21}}\Delta_{p}:=\left\{\delta_{p,j}:j\in\{1,\ldots,21\}\right\} for a point pΩtrainp\in\Omega_{{\mathrm{train}}}. We modify the Banach Extension Step (cf. Section 4) to the Modified Extension Step below. Modified Extension Step

Assume that LΣL^{\prime}\subset\Sigma. Let uSpan()u\in\operatorname{Span}({\cal F}). Let m1m\in{\mathbb{Z}}_{\geq 1} such that #(L)+21m#(Σ)\#(L^{\prime})+21m\leq\#(\Sigma). Take

σ1:=argmax{|σ(φu)|:σΣ}.\sigma_{1}:=\operatorname{argmax}\left\{|\sigma(\varphi-u)|:\sigma\in\Sigma\right\}. (7.7)

Let p1Ωtrainp_{1}\in\Omega_{{\mathrm{train}}} denote the point for which σ1Δp1\sigma_{1}\in\Delta_{p_{1}}.

Inductively for j=2,,mj=2,\ldots,m take

σj:=argmax{|σ(φu)|:σΣ(Δp1Δpj1)}\sigma_{j}:=\operatorname{argmax}\left\{|\sigma(\varphi-u)|:\sigma\in\Sigma\setminus(\Delta_{p_{1}}\cup\dots\cup\Delta_{p_{j-1}})\right\} (7.8)

and let pjΩtrainp_{j}\in\Omega_{{\mathrm{train}}} denote the point for which σjΔpj\sigma_{j}\in\Delta_{p_{j}}.

Once the points p1,,pmΩtrainp_{1},\ldots,p_{m}\in\Omega_{{\mathrm{train}}} have been selected, extend LL^{\prime} to L:=LΔp1ΔpmL:=L^{\prime}\cup\Delta_{p_{1}}\cup\dots\cup\Delta_{p_{m}}. Since u(p)φ(p)l(21)=max{|σ(uφ)|:σΔp}||u(p)-\varphi(p)||_{l^{\infty}({\mathbb{R}}^{21})}=\max\left\{|\sigma(u-\varphi)|:\sigma\in\Delta_{p}\right\}, we observe that the Modified Extension Step is equivalent to taking

p1:=argmax{u(p)φ(p)l(21):pΩtrain},p_{1}:=\operatorname{argmax}\left\{||u(p)-\varphi(p)||_{l^{\infty}({\mathbb{R}}^{21})}:p\in\Omega_{{\mathrm{train}}}\right\}, (7.9)

then inductively taking for j=2,,mj=2,\ldots,m the point

pj:=argmax{u(p)φ(p)l(21):pΩtrain{p1,,pj1}},p_{j}:=\operatorname{argmax}\left\{||u(p)-\varphi(p)||_{l^{\infty}({\mathbb{R}}^{21})}:p\in\Omega_{{\mathrm{train}}}\setminus\{p_{1},\ldots,p_{j-1}\}\right\}, (7.10)

and then extending LL^{\prime} to L:=LΔp1ΔpmL:=L^{\prime}\cup\Delta_{p_{1}}\cup\dots\cup\Delta_{p_{m}}. Consequently, replacing the Banach Extension Step with the Modified Extension Step in the Banach GRIM algorithm in Section 4 results in an algorithm in which a collection of points, at which we require an approximation uu of φ\varphi to coincide with φ\varphi, is inductively greedily grown. It is important to keep in mind that each newly selected point pΩtrainp\in\Omega_{{\mathrm{train}}} corresponds to choosing 21 new linear functionals in Σ\Sigma. Thus the restriction on the total number of points KK to possibly be selected is that 21K#(Σ)21K\leq\#(\Sigma), which is equivalent to requiring K#(Ωtrain)=660K\leq\#(\Omega_{{\mathrm{train}}})=660. It is this modification of the Banach GRIM algorithm (cf. Section 4) that we consider for the task of approximating the linear map φ:N21\varphi:{\mathbb{R}}^{N}\to{\mathbb{R}}^{21}.

Our aim here is that the model [u]{\cal L}[u] achieves a similar accuracy score as the original model [φ]{\cal L}[\varphi] from [JLNSY17]. Consequently, we alter the termination criteria to better reflect this goal. The model from [JLNSY17] achieves an accuracy score of 83.58%83.58\% on the test set. We choose δ:=103\delta:=10^{-3} and terminate if the approximation generates a model achieving an accuracy score of at least (1δ)(1-\delta) times the accuracy score of the original model on the testing set. Hence we require the approximation to generate a model achieving a testing set accuracy score of at least 83.50%83.50\% to 2 decimal places. We further record the accuracy achieved on the training set for comparison with the 100%100\% accuracy achieved by the original model here.

For each choice of number of points to add at each step, we run the GRIM algorithm (modified as outlined above) three times. We record the minimum number of non-zero weights returned after an approximation generating a model with the required test set accuracy score. Subsequently, we record the training accuracy score and testing accuracy score of the models found by each run using this number of non-zero weights. In addition to recording these values for the best performing model, we additionally record the variance of the training accuracy score and testing accuracy score over the three runs. The results are summarised in Table 3 below.

Points Added Number of Train Accuracy (%) Test Accuracy (%)
Per Step Non-zero Weights Best Variance Best Variance
10 2101 100.00 0.04 83.58 1.12
20 2941 100.00 0.00 83.58 0.09
30 2521 100.00 0.02 83.58 0.49
40 2521 100.00 0.01 84.33 1.33
50 2101 99.55 0.04 83.96 0.65
Table 3: Each row contains information corresponding to our application of GRIM to approximate the action recognition model from [JLNSY17] on the JHMDB dataset [BGJSZ13]. The first column records the number of new points added at each step of GRIM. The second column records the minimum number of non-zero weights appearing in a returned approximation uu generating a model [u]{\cal L}[u] achieving a testing set accuracy score of at least 83.50%83.50\% (2.d.p). The third column summarises the training accuracy scores achieved during the three runs for each choice of number of points to add per step. The first half of the third column records the best training accuracy score (to 2 decimal places) achieved by the model [u]{\cal L}[u] for an approximation uu of φ\varphi returned by one of the runs of GRIM. The second half of the third column records the variance (to 2 decimal places) of the training accuracy scores achieved by the models [u]{\cal L}[u] for the approximations uu of φ\varphi returned by each of the three runs of GRIM. The fourth column summarises the testing accuracy scores achieved during the three runs for each choice of number of points to add per step. The first half of the fourth column records the best test accuracy score (to 2 decimal places) achieved by the model [u]{\cal L}[u] for an approximation uu of φ\varphi returned by one of the runs of GRIM. The second half of the third column records the variance (to 2 decimal places) of the test accuracy scores achieved by the models [u]{\cal L}[u] for the approximations uu of φ\varphi returned by each of the three runs of GRIM.

For each choice of the number of new points to be added at each step, GRIM successfully finds an approximation uu of φ\varphi which is both uses fewer non-zero weights than the best performing model found via Lasso, and achieves a higher test accuracy score than the best performing model found via Lasso. The best performance, in terms of accuracy score on the test set, was achieved by GRIM with the choice of adding 40 new points at each step. The best performing model uses 2521 non-zero weights and achieves an accuracy score of 84.33%84.33\% (2.d.p) on the test set. This accuracy score is actually higher than the accuracy score of 83.58%83.58\% (2.d.p) achieved by the original model [φ]{\cal L}[\varphi] from [JLNSY17] on the test set.

References

  • [ACO20] A. Abate, F. Cosentino, and H. Oberhauser, A randomized algorithm to reduce the support of discrete measures, In Adavances in Neural Information Processing Systems, Volume 33, pages 15100-15110, 2020.
  • [ABDHP21] D. Alistarh, T. Ben-Nun, N. Dryden, T. Hoefler and A. Peste, Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks, J. Mach. Learn. Res., 22(241), 1-124, 2021.
  • [ABCGMM18] J.-P. Argaud, B. Bouriquet, F. de Caso, H. Gong, Y. Maday and O. Mula, Sensor Placement in Nuclear Reactors Based on the Generalized Empirical Interpolation Method, J. Comput. Phys., vol. 363, pp. 354-370, 2018.
  • [ABGMM16] J.-P. Argaud, B. Bouriquet, H. Gong, Y. Maday and O. Mula, Stabilization of (G)EIM in Presence of Measurement Noise: Application to Nuclear Reactor Physics, In Spectral and High Order Methods for Partial Differential Equations-ICOSAHOM 2016, vol. 119 of Lect. Notes Comput. Sci. Eng., pp.133-145, Springer, Cham, 2017.
  • [AK13] M. G. Augasts and T. Kathirvalavakumar, Pruning algorithms of neural networks - A comparative study, Open Computer Science 3, p.105-115, 2013.
  • [BLL15] Francis Bach, Simon Lacoste-Julien and Fredrik Lindsten, Sequential Kernel Herding: Frank-Wolfe Optimization for Particle Filtering, In Artificial Intelligence and Statistics, pages 544-552, PMLR, 2015.
  • [BMPSZ08] F. R. Bach, J. Mairal, J. Ponce, G. Sapiro and A. Zisserman, Discriminative learned dictionaries for local image analysis, Compute Vision and Pattern Recognition (CVPR), pp.1-8, IEEE 2008.
  • [BMPSZ09] F. R. Bach, J. Mairal, J. Ponce, G. Sapiro and A. Zisserman, Supervised dictionary learning, Advances in Neural Information Processing Systems, pp.1033-1040, 2009.
  • [BMNP04] M. Barrault, Y. Maday, N. C. Nguyen and A. T. Patera, An Empirical Interpolation Method: Application to Efficient Reduced-Basis Discretization of Partial Differential Equations, C. R. Acad. Sci. Paris, Série I., 339, pp.667-672, 2004
  • [BS18] A. G. Barto and R. S. Sutton, Reinforcement learning: an introduction, (2nd). Cambridge, MA, USA: MIT Press.
  • [BT11] Alain Belinet and Christine Thomas-Agnan, Reproducing kernel Hilbert spaces in probabiility and statistics, Springer Science & Business Media, 2011.
  • [BBCDDGGMPPPPTVVW11] M. Blondel, M. Brucher, D. Cournpeau, V. Dubourg, E. Duchesnay, A. Gramfort, O. Grisel, V. Michel, A. Passos, F. Pedregosa, M. Perrot, P. Prettenhofer, B. Thirion, J. Vanderplas, G. Varoquax and R. Weiss, Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research, 12, 2825-2830, 2011.
  • [BCGMO18] F. -X. Briol, W. Y. Chen, J. Gorham, L. Mackey and C. J. Oates, Stein Points, In Proceedings of the 35th International Conference on Machine Learning, Volume 80, pp. 843-852, PMLR, 2018.
  • [BGJSZ13] J. Black, J. Gall, H. Jhuang, C. Schmid and S. Zuffi, Towards understanding action recognition, In IEEE International Conference on Computer Vision (ICCV), pp. 3192-3199, 2013.
  • [CT05] E. Candés and T. Tao, Decoding by linear programming, IEEE Trans. Inform. Theory, 51, 2005.
  • [CRT06] E. Candés, J. Romberg and T. Tao, Robust Uncertainty Principles: Exact Signal Reconstruction from Highly Incomplete Frequency Information, IEEE Trans. Inform. Theory, 52, pp. 489-509, 2006.
  • [Car81] B. Carl, Entropy numbers of diagonal operators with an application to eigenvalue problems, J. Approx. Theory, 32, pp.135-150, 1981.
  • [CS90] B. Carl and I. Stephani, Entropy, Compactness, and the Approximation of Operators, Cambridge University Press, Cambridge, UK, 1990.
  • [CHXZ20] L. Chen, A. Huang, S. Xu and B. Zhang, Convolutional Neural Network Pruning: A Survey, 39th Chinese Control Conference (CCC), IEEE, p.7458-7463, 2020.
  • [CSW11] Zhixiang Chen, Baohuai Sheng and Jianli Wang, The Covering Number for Some Mercer Kernel Hilbert Spaces on the Unit Sphere, Taiwanese J. Math. 15(3), p.1325-1340, 2011.
  • [CSW10] Y. Chen, A. Smola and M. Welling, Super-Samples from Kernel Herding, In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2010.
  • [CS08] A. Christmann and I. Steinwart, Support Vector Machines, Springer Science & Business Media, 2008.
  • [CM73] Jon F. Claerbout and Francis Muir, Robust Modeling with Erratic Data, Geophysics, Vol. 38, So. 5, pp. 826-844, (1973)
  • [CK16] Ilya Chevyrev and Andrey Kormilitzin, A Primer on the Signature Method in Machine Learning, arXiv preprint, 2016. https://arxiv.org/abs/1603.03788
  • [DGOY08] J. Darbon, D. Goldfarb, S. Osher and W. Yin, Bregman Iterative Algorithms for L1 Minimization with Applications to Compressed Sensing, SIAM J. Imaging Sci. 1, pp.143-168, 2008.
  • [Don06] D. Donoho, Compressed Sensing, IEEE Trans. Inform. Theory, 52, pp. 1289-1306, 2006.
  • [DE03] D. Donoho and M. Elad, Optimally Sparse Representation in General (Nonorthogonal) Dictionaries via l1l^{1} Minimization, Proc. Natl. Acad. Sci. USA, 100, pp.2197-2202, 2003.
  • [DET06] D. L. Donoho, M. Elad and V. Templyakov, Stable recovery of sparse overcomplete representations in the presence of noise, IEEE Transactions on Information Theory, 52(1): 6-18, 2006.
  • [DM21a] Raaz Dwivedi and Lester Mackey, Kernel Thinning, Proceedings of Machine Learning Research vol. 134:1–1, 2021.
  • [DM21b] Raaz Dwivedi and Lester Mackey, Generalized Kernel Thinning, Published in ICLR 2022.
  • [DMS21] Raaz Dwivedi, Lester Mackey and Abhishek Shetty, Distribution Compression in Near-Linear Time, Published in ICLR 2022.
  • [DF15] Chris Dyer and Manaal Faruqui, Non-distributional Word Vector Representations, Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Association for Computational Linguistics, Beijing, China, Volume 2: Short Papers, pp.464-469, 2015.
  • [ET96] D. E. Edmunds and H. Triebel, Function Spaces, Entropy Numbers, and Differential Operators, Cambridge University Press, Cambridge, UK, 1996.
  • [Ela10] M. Elad, Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing, Springer, ISBN 978-1441970107, 2010.
  • [EMS08] M. Elad, J. Mairal and G. Sapiro, Sparse representation for color image restoration, IEEE Transactions on Image Processing, 17(1), 53-69, 2008.
  • [EJOPR20] E. Elsen, S. Jayakumar, S. Osindere, R. Pascanu and J. Rae, Top-kast: Top-k always sparse training, Advances in Neural Information Processing Systems, 33:20744-20754, 2020.
  • [EPP00] T. Evgeniou, M. Pontil and T. Poggio, Regularization networks and support vector machines, Adv. Comput. Math., 13, pp.1-50, 2000.
  • [FHLLMMPSWY23] Meng Fang, Tianjin Huang, Gen Li, Shiwei Liu, Xiaolong Ma, Vlado Menkovski, Mykola Pechenizkiy, Li Shen, Zhangyang Wang and Lu Yin, Dynamic Sparsity in Channel-Level Sparsity Learner, NeurIPS 2023.
  • [FJM19] Dan Feldman, Ibrahim Jubran and Alaa Maalouf, Fast and Accuract Least-Mean-Squares Solvers, Advances in Neural Information Processing Systems 32 (NeurIPS2019), 2019.
  • [FJM22] Dan Feldman, Ibrahim Jubran and Alaa Maalouf, Fast and Accurate Least-Mean-Squares Solvers for High Dimensional Data, in IEEE Transactionf on Pattern Analysis and Machine Intelligence, vol. 44, no. 12, pp. 9977-9994, 2022.
  • [FS21] S. Fischer and I. Steinwart, A closer look at covering number bounds for Gaussian kernels, Journal of Complexity, Volume 62, 2021.
  • [FMMT14] Alona Fyshe, Tom M. Mitchell, Brian Murphy and Partha P. Talukdar, Interpretable semantic Vectors from a joint model of brain-and text-based meaning, Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Baltimore, Maryland, Volume 1: Long Papers, pp.489-499, 2014.
  • [GMSWY09] A. Ganesh, Y. Ma, S. S. Sastry, J. Wright and A. Yang, Robust face recognition via sparse representation, TPAMI 31(2), 210-227, 2009.
  • [GLMNS18] M. Gibescu, A. Liotta, D. C. Mocanu, E. Mocanu, P. H. Nguyen and P. Stone, Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science, Nature communications, 9(1):2383, 2018.
  • [GLSWZ21] John Glossner, Tailin Liang, Shaobo Shi, Lei Wang and Xiaotong Zhang, Pruning and quantization for deep neural network acceleration: A survey, Neurocomputing, Volume 461, pages 370-403, 2021.
  • [GO09] T. Goldstein and S. Osher, The Split Bregman Method for L1-Regularized Problems, SIAM J. Imaging Sci. 2, pp.323-343, 2009.
  • [GMNP07] M. A. Grepl, Y. Maday, N. C. Nguyen and A. T. Patera, Efficient Reduced-Basis Treatment of Nonaffine and Nonlinear Partial Differential Equations, M2AN (Math. Model. Numer. Anal.), 2007.
  • [GKT12] F. S. Gürgen, H. Kaya and P. Tüfekci, Local and global learning methods for predicting power of a combined gas & steam turbine, In Proceedings of the Internationl Conference on Emerging Trends in Computer and Electronics Engineering, pages 13-18, 2012.
  • [HLO21] Satoshi Hayakawa, Terry Lyons and Harald Oberhauser, Positively weighted kernel Quadrature via Subsampling, In Adavnces in Neural Information Processing Systems (NeurIPS 2022).
  • [HS19] J. Hernandez-Garcia and R. Sutton, Learning Sparse Representations Incrementally in Deep Reinforcement Learning, https://arxiv.org/abs/1912.04002, [cs.LG], 9 Dec 2019.
  • [HLLL18] J. Huang, H. Lian, H. Lin and S. Lv, Oracle Inequalities for Sparse Additive Quantile Regression in Reproducing Kernel Hilbert Space, Annals of Statistics 46, p.781-813, 2018.
  • [JLPST21] S. Jayakumar, P. E. Latham, R. Pascanu, J. Schwarz and Y. Teh, Powerpropagation: A sparsity inducing weight reparameterisation, Advances in Neural Information Processing Systems, 34:28889-28903, 2021.
  • [JKY13] C. S. Jensen, M. Kaul and B. Yang, Building accurate 3d spatial networks to enable next generation intelligent transportation systems, In 2013 IEEE 14th International Conference on Mobile Data Management, Volume 1, pages 137-146, IEEE, 2013.
  • [JJWWY20] C. Jin, M. I. Jordan, M. Wang, Z. Wang and Z. Yang, On Function Approximation in Reinforcement Learning: Optimism in the Face of Large State Spaces, 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada, 2020
  • [JLNSY17] Lianwen Jin, Terry Lyons, Hao Ni, Cordelia Schmid and Weixin Yang, Developing the Path Signature Methodology and its Application to Landmark-based Human Action Recognition, In: Yin, G., Zariphopoulou, T. (eds) Stochastic Analysis, Filtering and Stochastic Optimization, Springer, Cham. 2022. https://doi.org/10.1007/978-3-030-98519-6_18
  • [KK20] M. Kalinić and P. Kómár, Denoising DNA encoded library screens with sparse learning, ACS Combinatorial Science, Vol. 22, no. 8, pp.410-421, 2020.
  • [Kol56] A. N. Kolmogorov, Asymptotic characteristics of some completely bounded metric spaces, Dokl. Akad. Nauk. SSSR, 108, pp.585-589, 1956.
  • [Kuh11] T. Kühn, Covering numbers of Gaussian reproducing kernel Hilbert spaces, J. Complexity, 27, pp.489-499, 2011.
  • [LY07] Y. Lin and M. Yuan, On the Non-Negative Garrote Estimator, J. R. Stat. Soc. Ser. B 69, pp. 143-161, 2007
  • [LL16] W. Lee and T. Lyons, The Adaptive Patched Cubature Filter and its Implementation, Communications in Mathematical Sciences, 14(3), pp.799-829, 2016.
  • [LL99] W. V. Li and W. Linde, Approximation, metric entropy and small ball estimates for Gaussian measures, Ann. Probab., 27, pp.1556-1578, 1999.
  • [LL12] C. Litterer and T. Lyons, High order recombination and an application to cubature on wiener space, The Annals of Applied Probability, Vol. 22, No. 4, 1301-1327, 2012.
  • [LMPY21] S. Liu, D. C. Mocanu, M. Pechenizkiy and L. Yin, Do we actually need dense over-parameterization? in-time over-parameterization in sparse training, In Proceedings of the 39th39^{\text{th}} International Conference on Machine Learning, pages 6989-7000, PMLR 2021.
  • [LP04] H. Luschgy and G. Pagés, Sharp asymptotics of the Kolmogorov entropy for Gaussian measures, J. Funct. Anal., 212, pp.89-120, 2004.
  • [LM22] Terry Lyons and Andrew D. McLeod, Signature Methods in Machine Learning, arXiv preprint, 2022. https://arxiv.org/abs/2206.14674
  • [MM13] Y. Maday, O. Mula, A generalized empirical interpolation method: Application of reduced basis techniques to data assimilation, F. Brezzi, P. Colli Franzone, U. Gianazza, G. Gilardi (Eds.), Analysis and Numerics of Partial Differential Equations, Vol. 4 of Springer INdAM Series, Springer Milan, 2013, pp. 221–235.
  • [MMPY15] Y. Maday, O. Mula, A. T. Patera and M. Yano, The Generalized Empirical Interpolation Method: Stability Theory On Hilbert Spaces With An Application To Stokes Equation, Comput. Methods Appl. Mech. Engrg. 287, 310-334, 2015.
  • [MMT14] Y. Maday, O. Mula and G. Turinici, Convergence analysis of the Generalized Empirical Interpolation Method, SIAM J. Numer. Anal., 54(3) 1713-1731, 2014.
  • [MNPP09] Y. Maday, N. C. Nguyen, A. T. Patera and G. S. H. Pau, A General Multipurpose Interpolation Procedure: The Magic Points, Commun. Pure Appl. Anal., 81, pp. 383-404, 2009.
  • [MZ93] S. G. Mallat and Z. Zhang, Matching Pursuits with Time-Frequency Dictionaries, IEEE Transactions on Signal Processing, 12: 3397-3415, 1993.
  • [MRT12] Mehryar Mohri, Afshin Rostamizadeh and Ameet Talwalkar, Foundations of Machine Learning, Massachusetts: MIT Press, USA, 2012
  • [NT09] D. Needell and J. A. Tropp, CoSaMP: Iterative signal recovery from incomplete and inaccurate samples, Applied and Computational Harmonic Analysis, 26(3): 301-321, 2009.
  • [NPS22] M. Nikdast, S. Pasricha and F. Sunny, SONIC: A Sparse Neural Network Inference Accelerator with Silicon Photonics for Energy-Efficient Deep Learning, 27th Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 214-219, 2022.
  • [NS21] S. Ninomiya and Y. Shinozaki, On implementation of high-order recombination and its application to weak approximations of stochastic differential equations, In Proceedings of the NFA 29th29^{\text{th}} Annual Conference, 2021.
  • [PTT22] Sebastian Pokutta, Ken’ichiro Tanaka and Kazuma Tsuji, Sparser Kernel Herding with Pairwise Conditional Gradients without Swap Steps, https://arxiv.org/abs/2110.12650, [math.OC], 8 February 2022.
  • [Ree93] R. Reed, Pruning Algorithms - A Survey, IEEE Transactions on Neural Networks 4, p.74-747, 1993.
  • [SS86] Fadil Santosaf and William W. Symes, Linear Inversion of Band-Limited Reflection Seismograms, SIAM J. ScI. STAT. COMPUT. Vol. 7, No. 4, 1986.
  • [SSW01] B. Schölkopf, A. J. Smola and R. C. Williamson, Generalization performance of regularization networks and support vector machines via entrop numbers of compact operators, IEEE Trans. Inform. Theory, 47, pp.2516-2532, 2001.
  • [Ste03] I. Steinwart, Entropy numbers of convex hulls and an application to learning algorithms, Arch. Math., 80, pp.310-318, 2003.
  • [SS07] I. Steinwart and C. Scovel, Fast rates for support vector machines using Gaussian kernels, Ann. Statist. 35 (2), 2007.
  • [SS12] I. Steinwart and C. Scovel, Mercer’s theorem on general domains: On the interaction between measures, kernels, and RKHSs, Constructive Approximation, 35(3):363-417, 2012.
  • [Str71] A. H. Stroud, Approximate calculation of multiple integrals, Series in Automatic Computation, Englewood Cliffs, NJ: Prentice-Hall, 1971.
  • [Suz18] T. Suzuki, Fast Learning Rate of Non-Sparse Multiple Kernel Learning and Optimal Regularization Strategies, Electronic Journal of Statistics 12, p.2141-2192, 2018.
  • [SS13] M. Sugiyama and T. Suzuki, Fast Learning Rate of Multiple Kernel Learning: Tradeoff Between Sparsity and Smoothness, Annals of Statistics 41, p.1381-1405, 2013.
  • [TT21] Ken’ichiro Tanaka and Kazuma Tsuji, Acceleration of the Kernel Herding Algorithm by Improved Gradient Approximation, https://arxiv.org/abs/2105.07900, [math.NA], 17 May 2021.
  • [Tch15] M. Tchernychova, Carathéodry cubature measures, PhD thesis, University of Oxford, https://ora.ox.ac.uk/objects/uuid:a3a10980-d35d-467b-b3c0-d10d2e491f2d 2015.
  • [TW19] GL. Tian and M. Wang, Adaptive Group LASSO for High-Dimensional Generalized Linear Models, Stat Papers 60, pp. 1469-1486, 2019.
  • [Tib96] Robert Tibshirani, Regression Shrinkage and Selection via the Lasso, Journal of the Royal Statistical Society. Series B (Methodological), Vol. 58, No. 1, pp. 267-288, 1996.
  • [Tro06] J. A. Tropp, Just relax: Convex programming methods for identifying sparse signals in noise, IEEE Transactions on Information Theory, 52(3): 1030-1051, 2006.
  • [Wel09a] M. Welling, Herding Dynamical Weights to Learn, In Proceedings of the 21st International Conference on Machine Learning, Montreal, Quebec, CAN, 2009.
  • [Wel09b] M. Welling, Herding Dynamic Weights for Partially Observed Random Field Models, In Proc. of the Conf. on Uncertainty in Artificial Intelligence, Montreal, Quebec, CAN, 2009.
  • [XZ16] Y. Xiang and C. Zhang, On the Oracle Property of Adaptive Group LASSO in High-Dimensional Linear Models, Stat. Pap. 57, pp. 249-265, 2016.
  • [Zho02] D. -X. Zhou, The covering number in learning theory, J. Complexity, 18, pp.739-767, 2002.

University of Oxford, Radcliffe Observatory, Andrew Wiles Building, Woodstock Rd, Oxford, OX2 6GG, UK. TL: [email protected]
https://www.maths.ox.ac.uk/people/terry.lyons AM: [email protected]
https://www.maths.ox.ac.uk/people/andrew.mcleod