This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\zxrsetup

toltxlabel

Nonparametric estimation of continuous DPPs
with kernel methods

Michaël Fanuel and Rémi Bardenet
Université de Lille, CNRS, Centrale Lille
UMR 9189 – CRIStAL, F-59000 Lille, France
{michael.fanuel, remi.bardenet}@univ-lille.fr
Abstract

Determinantal Point Process (DPPs) are statistical models for repulsive point patterns. Both sampling and inference are tractable for DPPs, a rare feature among models with negative dependence that explains their popularity in machine learning and spatial statistics. Parametric and nonparametric inference methods have been proposed in the finite case, i.e. when the point patterns live in a finite ground set. In the continuous case, only parametric methods have been investigated, while nonparametric maximum likelihood for DPPs – an optimization problem over trace-class operators – has remained an open question. In this paper, we show that a restricted version of this maximum likelihood (MLE) problem falls within the scope of a recent representer theorem for nonnegative functions in an RKHS. This leads to a finite-dimensional problem, with strong statistical ties to the original MLE. Moreover, we propose, analyze, and demonstrate a fixed point algorithm to solve this finite-dimensional problem. Finally, we also provide a controlled estimate of the correlation kernel of the DPP, thus providing more interpretability.

1 Introduction

Determinantal point processes (DPPs) are a tractable family of models for repulsive point patterns, where interaction between points is parametrized by a positive semi-definite kernel. They were introduced by Macchi [1975] in the context of fermionic optics, and have gained a lot of interest since the 2000s, in particular in probability [Hough et al., 2006], spatial statistics [Lavancier et al., 2014], and machine learning [Kulesza and Taskar, 2012]. In machine learning at large, DPPs have been used essentially for two purposes: as statistical models for diverse subsets of items, like in recommendation systems [Gartrell et al., 2019], and as subsampling tools, like in experimental design [Derezinski et al., 2020], column subset selection [Belhadji et al., 2020b], or Monte Carlo integration [Gautier et al., 2019, Belhadji et al., 2019]. In this paper, we are concerned with DPPs used as statistical models for repulsion, and more specifically with inference for continuous DPPs.

DPP models in Machine Learning (ML) have so far mostly been finite DPPs: they are distributions over subsets of a (large) finite ground set, like subsets of sentences from a large corpus of documents [Kulesza and Taskar, 2012]. Since Affandi et al. [2014], a lot of effort has been put into designing efficient inference procedures for finite DPPs. In particular, the fixed point algorithm of Mariet and Sra [2015] allows for nonparametric inference of a finite DPP kernel, thus learning the features used for modelling diversity from the data. DPP models on infinite ground sets, say d\mathbb{R}^{d}, while mathematically and algorithmically very similar to finite DPPs, have been less popular in ML than in spatial statistics. It is thus natural that work on inference for continuous DPPs has happened mostly in the latter community; see e.g. the seminal paper [Lavancier et al., 2014]. Inference for continuous DPPs has however focused on the parametric setting, where a handful of interpretable parameters are learned. Relatedly, spatial statisticians typically learn the correlation kernel of a DPP, which is more interpretable, while machine learners focus on the likelihood kernel, with structural assumptions to make learning scale to large ground sets.

In this paper, we tackle nonparametric inference for continuous DPPs using recent results on kernel methods. More precisely, maximum likelihood estimation (MLE) for continuous DPPs is an optimization problem over trace-class operators. Our first contribution is to show that a suitable modification of this problem is amenable to the representer theorem of Marteau-Ferey, Bach, and Rudi [2020]. Further drawing inspiration from the follow-up work [Rudi, Marteau-Ferey, and Bach, 2020], we derive an optimization problem over matrices, and we prove that its solution has a near optimal objective in the original MLE problem. We then propose, analyze, and demonstrate a fixed point algorithm for the resulting finite problem, in the spirit [Mariet and Sra, 2015] of nonparametric inference for finite DPPs. While our optimization pipeline focuses on the so-called likelihood kernel of a DPP, we also provide a controlled sampling approximation to its correlation kernel, thus providing more interpretability of our estimated kernel operator. A by-product contribution of independent interest is an analysis of a sampling approximation for Fredholm determinants.

The rest of the paper is organized as follows. Since the paper is notation-heavy, we first summarize our notation and give standard definitions in Section 1.1. In Section 2, we introduce DPPs and prior work on inference. In Section 3, we introduce our constrained maximum likelihood problem, and study its empirical counterpart. We analyze an algorithm to solve the latter in Section 4. Statistical guarantees are stated in Section 5, while Section 6 is devoted to numerically validating the whole pipeline. Our code is freely available111https://github.com/mrfanuel/LearningContinuousDPPs.jl.

1.1 Notation and background

Sets.

It is customary to define DPPs on a compact Polish space 𝒳\mathcal{X} endowed with a Radon measure μ\mu, so that we can define the space of square integrable functions L2(𝒳)L^{2}(\mathcal{X}) for this measure [Hough et al., 2009]. Outside of generalities in Section 2, we consider a compact 𝒳d\mathcal{X}\subset\mathbb{R}^{d} and μ\mu the uniform probability measure on 𝒳\mathcal{X}. Let (,,)(\mathcal{H},\langle\cdot,\cdot\rangle) be an RKHS of functions on 𝒳\mathcal{X} with a bounded continuous kernel k(x,y)k_{\mathcal{H}}(x,y), and let κ2=supx𝒳k(x,x)\kappa^{2}=\sup_{x\in\mathcal{X}}k_{\mathcal{H}}(x,x). Denote by ϕ(x)=k(x,)\phi(x)=k_{\mathcal{H}}(x,\cdot)\in\mathcal{H} the canonical feature map. For a Hilbert space \mathcal{F}, denote by 𝒮+()\mathcal{S}_{+}(\mathcal{F}) the space of symmetric and positive semi-definite trace-class operators on \mathcal{F}. By a slight abuse of notation, we denote by 𝒮+(n)\mathcal{S}_{+}(\mathbb{R}^{n}) the space of n×nn\times n real positive semi-definite matrices. Finally, all sets are denoted by calligraphic letters (e.g. 𝒞,\mathcal{C},\mathcal{I}).

Operators and matrices.

Trace-class endomorphisms of L2(𝒳)L^{2}(\mathcal{X}), seen as integral operators, are typeset as uppercase sans-serif (e.g. 𝖠,𝖪\mathsf{A},\mathsf{K}), and the corresponding integral kernels as lowercase sans-serif (e.g. 𝖺,𝗄\mathsf{a},\mathsf{k}). Notice that 𝗄(x,y)\mathsf{k}(x,y) and k(x,y)k_{\mathcal{H}}(x,y) are distinct functions. Other operators are written in standard fonts (e.g. A,SA,S), while we write matrices and finite-dimensional vectors in bold (e.g. 𝐊,𝐂,𝐯\mathbf{K},\mathbf{C},\mathbf{v}). The identity operator is written commonly as 𝕀\mathbb{I}, whereas the n×nn\times n identity matrix is denoted by 𝐈n\mathbf{I}_{n}. When 𝒞\mathcal{C} is a subset of {1,,n}\{1,\dots,n\} and 𝐊\mathbf{K} is an n×nn\times n matrix, the matrix 𝐊𝒞𝒞\mathbf{K}_{\mathcal{C}\mathcal{C}} is the square submatrix obtained by selecting the rows and columns of 𝐊\mathbf{K} indexed by 𝒞\mathcal{C}.

Restriction and reconstruction operators.

Following Rosasco et al. [2010, Section 3], we define the restriction operator S:L2(𝒳)S:\mathcal{H}\to L^{2}(\mathcal{X}) as (Sg)(x)=g(x).(Sg)(x)=g(x). Its adjoint S:L2(𝒳)S^{*}:L^{2}(\mathcal{X})\to\mathcal{H} is the reconstruction operator Sh=𝒳h(x)ϕ(x)dμ(x).S^{*}h=\int_{\mathcal{X}}h(x)\phi(x)\mathrm{d}\mu(x). The classical integral operator given by 𝖳kh=𝒳k(,x)h(x)dμ(x)\mathsf{T}_{k_{\mathcal{H}}}h=\int_{\mathcal{X}}k_{\mathcal{H}}(\cdot,x)h(x)\mathrm{d}\mu(x), seen as an endomorphism of L2(𝒳)L^{2}(\mathcal{X}), thus takes the simple expression 𝖳k=SS\mathsf{T}_{k_{\mathcal{H}}}=SS^{*}. Similarly, the so-called covariance operator C:C:\mathcal{H}\to\mathcal{H}, defined by C=𝒳ϕ(x)ϕ(x)¯dμ(x),C=\int_{\mathcal{X}}\phi(x)\otimes\overline{\phi(x)}\mathrm{d}\mu(x), writes C=SSC=S^{*}S. In the tensor product notation defining CC, ϕ(x)¯\overline{\phi(x)} is an element of the dual of \mathcal{H} and ϕ(x)ϕ(x)¯\phi(x)\otimes\overline{\phi(x)} is the endomorphism of \mathcal{H} defined by ((ϕ(x)ϕ(x)¯)(g)=g(x)ϕ(x)((\phi(x)\otimes\overline{\phi(x)})(g)=g(x)\phi(x); see e.g. Sterge et al. [2020]. Finally, for convenience, given a finite set {x1,,xn}𝒳\{x_{1},\dots,x_{n}\}\subset\mathcal{X}, we also define discrete restriction and reconstruction operators, respectively, as Sn:nS_{n}:\mathcal{H}\to\mathbb{R}^{n} such that Sng=(1/n)[g(x1),,g(xn)],S_{n}g=(1/\sqrt{n})[g(x_{1}),\dots,g(x_{n})]^{\top}, and Sn𝐯=(1/n)i=1n𝐯iϕ(xi)S_{n}^{*}\mathbf{v}=(1/\sqrt{n})\sum_{i=1}^{n}\mathbf{v}_{i}\phi(x_{i}) for any 𝐯n\mathbf{v}\in\mathbb{R}^{n}. In particular, we have SnSn=(1/n)𝐊S_{n}S_{n}^{*}=(1/n)\mathbf{K} where 𝐊=[k(xi,xj)]1i,jn\mathbf{K}=[k_{\mathcal{H}}(x_{i},x_{j})]_{1\leq i,j\leq n} is a kernel matrix, which is defined for a given ordering of the set {x1,,xn}\{x_{1},\dots,x_{n}\}. To avoid cumbersome expressions, when several discrete sets of different cardinalities, say nn and pp, are used, we simply write the respective sampling operators as SnS_{n} and SpS_{p}.

2 Determinantal point processes and inference

Determinantal point processes and L-ensembles.

Consider a simple point process 𝒴\mathcal{Y} on 𝒳\mathcal{X}, that is, a random discrete subset of 𝒳\mathcal{X}. For 𝒟𝒳\mathcal{D}\subset\mathcal{X}, we denote by 𝒴(𝒟)\mathcal{Y}(\mathcal{D}) the number of points of this process that fall within 𝒟\mathcal{D}. Letting mm be a positive integer, we say that 𝒳\mathcal{X} has mm-point correlation function ϱm\varrho_{m} w.r.t. to the reference measure μ\mu if, for any mutually disjoint subsets 𝒟1,,𝒟m𝒳\mathcal{D}_{1},\dots,\mathcal{D}_{m}\subset\mathcal{X},

𝔼[i=1m𝒴(𝒟i)]=i=1m𝒟iϱm(x1,,xm)dμ(x1)dμ(xm).\mathbb{E}\left[\prod_{i=1}^{m}\mathcal{Y}(\mathcal{D}_{i})\right]=\int_{\prod_{i=1}^{m}\mathcal{D}_{i}}\varrho_{m}(x_{1},\dots,x_{m})\mathrm{d}\mu(x_{1})\dots\mathrm{d}\mu(x_{m}).

In most cases, a point process is characterized by its correlation functions (ρm)m1(\rho_{m})_{m\geq 1}. In particular, a determinantal point process (DPP) is defined as having correlation functions in the form of a determinant of a Gram matrix, i.e. ϱm(x1,,xm)=det[𝗄(xi,xj)]\varrho_{m}(x_{1},\dots,x_{m})=\det[\mathsf{k}(x_{i},x_{j})] for all m1m\geq 1. We then say that 𝗄\mathsf{k} is the correlation kernel of the DPP. Not all kernels yield a DPP: if 𝗄(x,y)\mathsf{k}(x,y) is the integral kernel of an operator 𝖪𝒮+(L2(𝒳))\mathsf{K}\in\mathcal{S}_{+}(L^{2}(\mathcal{X})), the Macchi-Soshnikov theorem [Macchi, 1975, Soshnikov, 2000] states that the corresponding DPP exists if and only if the eigenvalues of 𝖪\mathsf{K} are within [0,1][0,1]. In particular, for a finite ground set 𝒳\mathcal{X}, taking the reference measure to be the counting measure leads to conditions on the kernel matrix; see Kulesza and Taskar [2012].

A particular class of DPPs is formed by the so-called L-ensembles, for which the correlation kernel writes

𝖪=𝖠(𝖠+𝕀)1,\mathsf{K}=\mathsf{A}(\mathsf{A}+\mathbb{I})^{-1}, (1)

with the likelihood operator 𝖠𝒮+(L2(𝒳))\mathsf{A}\in\mathcal{S}_{+}(L^{2}(\mathcal{X})) taken to be of the form 𝖠f(x)=𝒳𝖺(x,y)f(y)dμ(y)\mathsf{A}f(x)=\int_{\mathcal{X}}\mathsf{a}(x,y)f(y)\mathrm{d}\mu(y). The kernel 𝖺\mathsf{a} of 𝖠\mathsf{A} is sometimes called the likelihood kernel of the L-ensemble, to distinguish it from its correlation kernel 𝗄\mathsf{k}. The interest of L-ensembles is that their Janossy densities can be computed in closed form. Informally, the mm-Janossy density describes the probability that the point process has cardinality mm, and that the points are located around a given set of distinct points x1,,xm𝒳x_{1},\dots,x_{m}\in\mathcal{X}. For the rest of the paper, we assume that 𝒳d\mathcal{X}\subset\mathbb{R}^{d} is compact, and that μ\mu is the uniform probability measure on 𝒳\mathcal{X}; our results straightforwardly extend to other densities w.r.t. Lebesgue. With these assumptions, the mm-Janossy density is proportional to

det(𝕀+𝖠)1det[𝖺(xi,xj)]1i,jm,\displaystyle\det(\mathbb{I}+\mathsf{A})^{-1}\cdot\det[\mathsf{a}(x_{i},x_{j})]_{1\leq i,j\leq m}, (2)

where the normalization constant is a Fredholm deteminant that implicitly depends on 𝒳\mathcal{X}. Assume now that we are given ss i.i.d. samples of a DPP, denoted by the sets 𝒞1,,𝒞s\mathcal{C}_{1},\dots,\mathcal{C}_{s} in the window 𝒳\mathcal{X}. The associated maximum log-likelihood estimation problem reads

max𝖠𝒮+(L2(𝒳))1s=1slogdet[𝖺(xi,xj)]i,j𝒞logdet(𝕀+𝖠).\max_{\mathsf{A}\in\mathcal{S}_{+}(L^{2}(\mathcal{X}))}\frac{1}{s}\sum_{\ell=1}^{s}\log\det\left[\mathsf{a}(x_{i},x_{j})\right]_{i,j\in\mathcal{C}_{\ell}}-\log\det(\mathbb{I}+\mathsf{A}). (3)

Solving (3) is a nontrivial problem. First, it is difficult to calculate the Fredholm determinant. Second, it is not straightforward to optimize over the space of operators 𝒮+(L2(X))\mathcal{S}_{+}(L^{2}(X)) in a nonparametric setting. However, we shall see that the problem becomes tractable if we restrict the domain of (3) and impose regularity assumptions on the integral kernel 𝖺(x,y)\mathsf{a}(x,y) of the operator 𝖠\mathsf{A}. For more details on DPPs, we refer the interested reader to Hough et al. [2006], Kulesza and Taskar [2012], Lavancier et al. [2014].

Previous work on learning DPPs.

While continuous DPPs have been used in ML as sampling tools [Belhadji et al., 2019, 2020a] or models [Bardenet and Titsias, 2015, Ghosh and Rigollet, 2020], their systematic parametric estimation has been the work of spatial statisticians; see Lavancier et al. [2015], Biscio and Lavancier [2017], Poinas and Lavancier [2021] for general parametric estimation through (3) or so-called minimum-contrast inference. Still for the parametric case, a two-step estimation was recently proposed for non-stationary processes by Lavancier et al. [2021]. In a more general context, non-asymptotic risk bounds for estimating a DPP density are given in Baraud [2013].

Discrete DPPs have been more common in ML, and the study of their estimation has started some years ago [Affandi et al., 2014]. Unlike continuous DPPs, nonparametric estimation procedures have been investigated for finite DPPs by Mariet and Sra [2015], who proposed a fixed point algorithm. Moreover, the statistical properties of maximum likelihood estimation of discrete L-ensembles were studied by Brunel et al. [2017b]. We can also cite low-rank approaches [Dupuy and Bach, 2018, Gartrell et al., 2017], learning with negative sampling [Mariet et al., 2019], learning with moments and cycles [Urschel et al., 2017], or learning with Wasserstein training [Anquetil et al., 2020]. Learning non-symmetric finite DPPs [Gartrell et al., 2019, 2021] has also been proposed, motivated by recommender systems.

At a high level, our paper is a continuous counterpart to the nonparametric learning of finite DPPs with symmetric kernels in Mariet and Sra [2015]. Our treatment of the continuous case is made possible by recent advances in kernel methods.

3 A sampling approximation to a constrained MLE

Using the machinery of kernel methods, we develop a controlled approximation of the MLE problem (3). Let us outline the main landmarks of our approach. First, we restrict the domain of the MLE problem (3) to smooth operators. On the one hand, this restriction allows us to develop a sampling approximation of the Fredholm determinant. On the other hand, the new optimization problem now admits a finite rank solution that can be obtained by solving a finite-dimensional problem. This procedure is described in Algorithm 1 and yields an estimator for the likelihood kernel. Finally, we use another sampling approximation and solve a linear system to estimate the correlation kernel of the fitted DPP; see Algorithm 2.

Restricting to smooth operators.

In order to later apply the representer theorem of Marteau-Ferey et al. [2020], we restrict the original maximum likelihood problem (3) to “smooth” operators 𝖠=SAS\mathsf{A}=SAS^{*}, with A𝒮+()A\in\mathcal{S}_{+}(\mathcal{H}) and SS the restriction operator introduced in Section 1.1. Note that the kernel of 𝖠\mathsf{A} now writes

𝖺(x,y)=ϕ(x),Aϕ(y).\mathsf{a}(x,y)=\langle\phi(x),A\phi(y)\rangle. (4)

With this restriction on its domain, the optimization problem (3) now reads

minA𝒮+()f(A)=1s=1slogdet[ϕ(xi),Aϕ(xj)]i,j𝒞+logdet(𝕀+SAS).\min_{A\in\mathcal{S}_{+}(\mathcal{H})}f(A)=-\frac{1}{s}\sum_{\ell=1}^{s}\log\det\left[\langle\phi(x_{i}),A\phi(x_{j})\rangle\right]_{i,j\in\mathcal{C}_{\ell}}+\log\det(\mathbb{I}+SAS^{*}). (5)
Approximating the Fredholm determinant.

We use a sampling approach to approximate the normalization constant in (5). We sample a set of points ={xi:1in}\mathcal{I}=\{x^{\prime}_{i}:1\leq i\leq n\} i.i.d. from the ambient probability measure μ\mu. For definiteness, we place ourselves on an event happening with probability one where all the points in \mathcal{I} and 𝒞\mathcal{C}_{\ell} for 1s1\leq\ell\leq s are distinct. We define the sample version of f(A)f(A) as

fn(A)=1s=1slogdet[ϕ(xi),Aϕ(xj)]i,j𝒞+logdet(𝐈n+SnASn),f_{n}(A)=-\frac{1}{s}\sum_{\ell=1}^{s}\log\det\left[\langle\phi(x_{i}),A\phi(x_{j})\rangle\right]_{i,j\in\mathcal{C}_{\ell}}+\log\det(\mathbf{I}_{n}+S_{n}AS_{n}^{*}),

where the Fredholm determinant of 𝖠=SAS\mathsf{A}=SAS^{*} has been replaced by the determinant of a finite matrix involving SnASn=[ϕ(xi),Aϕ(xj)]1i,jnS_{n}AS_{n}^{*}=[\langle\phi(x^{\prime}_{i}),A\phi(x^{\prime}_{j})\rangle]_{1\leq i,j\leq n}.

Theorem 1 (Approximation of the Fredholm determinant).

Let δ(0,1/2)\delta\in(0,1/2). With probability at least 12δ1-2\delta,

|logdet(𝐈n+SnASn)logdet(𝕀+SAS)|logdet(𝕀+cnA),\left|\log\det(\mathbf{I}_{n}+S_{n}AS_{n}^{*})-\log\det(\mathbb{I}+SAS^{*})\right|\leq\log\det(\mathbb{I}+c_{n}A),

with

cn=4κ2log(2κ2δ)3n+2κ2log(2κ2δ)n,c_{n}=\frac{4\kappa^{2}\log\left(\frac{2\kappa^{2}}{\ell\delta}\right)}{3n}+\sqrt{\frac{2\kappa^{2}\ell\log\left(\frac{2\kappa^{2}}{\ell\delta}\right)}{n}},

where =λmax(𝖳k)\ell=\lambda_{\max}(\mathsf{T}_{k_{\mathcal{H}}}) and κ2=supx𝒳k(x,x)<\kappa^{2}=\sup_{x\in\mathcal{X}}k_{\mathcal{H}}(x,x)<\infty.

The proof of Theorem 1 is given in Supplementary Material in Section C.2. Several remarks are in order. First, the high probability222We write aba\lesssim b if there exists a constant c>0c>0 such that acba\leq cb. in the statement of Theorem 1 is that of the event {SSSnSnopcn}.\{\|S^{*}S-S^{*}_{n}S_{n}\|_{op}\lesssim c_{n}\}. Importantly, all the results given in what follows for the approximation of the solution of (5) only depend on this event, so that we do not need any union bound. Second, we emphasize that 𝖳k\mathsf{T}_{k_{\mathcal{H}}}, defined in Section 1.1, should not be confused with the correlation kernel (1). Third, to interpret the bound in Theorem 1, it is useful to recall that logdet(𝕀+cnA)cnTr(A)\log\det(\mathbb{I}+c_{n}A)\leq c_{n}\operatorname{Tr}(A), since A𝒮+()A\in\mathcal{S}_{+}(\mathcal{H}). Thus, by penalizing Tr(A)\operatorname{Tr}(A), one also improves the upper bound on the Fredholm determinant approximation error. This remark motivates the following infinite dimensional problem

minA𝒮+()fn(A)+λTr(A),\min_{A\in\mathcal{S}_{+}(\mathcal{H})}f_{n}(A)+\lambda\operatorname{Tr}(A), (6)

for some λ>0\lambda>0. The penalty on Tr(A)\operatorname{Tr}(A) is also intuitively needed so that the optimization problem selects a smooth solution, i.e., such a trace regularizer promotes a fast decay of eigenvalues of AA. Note that this problem depends both on the data 𝒞1,,𝒞n𝒳\mathcal{C}_{1},\dots,\mathcal{C}_{n}\subset\mathcal{X} and the subset \mathcal{I} used for approximating the Fredholm determinant.

Finite-dimensional representatives.

In an RHKS, there is a natural mapping between finite rank operators and matrices. For the sake of completeness, let 𝐊=[k(zi,zj)]1i,jm\mathbf{K}=[k_{\mathcal{H}}(z_{i},z_{j})]_{1\leq i,j\leq m} be a kernel matrix and let 𝐊=𝐑𝐑\mathbf{K}=\mathbf{R}^{\top}\mathbf{R} be a Cholesky factorization. Throughout the paper, kernel matrices are always assumed to be invertible. This is not a strong assumption: if kk_{\mathcal{H}} is the Laplace, Gaussian or Sobolev kernel, this is true almost surely if ziz_{i} for 1im1\leq i\leq m are sampled e.g. w.r.t. the Lebesgue measure; see Bochner’s classical theorem [Wendland, 2004, Theorem 6.6 and Corollary 6.9]. In this case, we can define a partial isometry V:mV:\mathcal{H}\to\mathbb{R}^{m} as V=m(𝐑1)Sm.V=\sqrt{m}(\mathbf{R}^{-1})^{\top}S_{m}. It satisfies VV=𝐈VV^{*}=\mathbf{I}, and VVV^{*}V is the orthogonal projector onto the span of ϕ(zi)\phi(z_{i}) for all 1im1\leq i\leq m. This is helpful to define

𝚽i=Vϕ(zi)n,\bm{\Phi}_{i}=V\phi(z_{i})\in\mathbb{R}^{n}, (7)

the finite-dimensional representative of ϕ(zi)\phi(z_{i})\in\mathcal{H} for all 1im1\leq i\leq m. This construction yields a useful mapping between an operator in 𝒮+()\mathcal{S}_{+}(\mathcal{H}) and a finite matrix, which is instrumental for obtaining our results.

Lemma 2 (Finite dimensional representatives, extension of Lemma 3 in Rudi et al. [2020]).

Let A𝒮+()A\in\mathcal{S}_{+}(\mathcal{H}). Then, the matrix 𝐁¯=VAV\bar{\mathbf{B}}=VAV^{*} is such that 𝚽i𝐁¯𝚽j=ϕ(zi),Aϕ(zj) for all 1i,jm,\bm{\Phi}_{i}^{\top}\bar{\mathbf{B}}\bm{\Phi}_{j}=\langle\phi(z_{i}),A\phi(z_{j})\rangle\quad\text{ for all }1\leq i,j\leq m, and logdet(𝐈+𝐁¯)logdet(𝕀+A)\log\det(\mathbf{I}+\bar{\mathbf{B}})\leq\log\det(\mathbb{I}+A), as well as Tr(𝐁¯)Tr(A)\operatorname{Tr}(\bar{\mathbf{B}})\leq\operatorname{Tr}(A).

The proof of Lemma 2 is given in Section C.1. Notice that the partial isometry VV also helps to map a matrix in 𝒮+(m)\mathcal{S}_{+}(\mathbb{R}^{m}) to an operator in 𝒮+()\mathcal{S}_{+}(\mathcal{H}), as 𝐁V𝐁V,\mathbf{B}\mapsto V^{*}\mathbf{B}V, in such a way that we have the matrix element matching ϕ(zi),V𝐁Vϕ(zj)=𝚽i𝐁𝚽j\langle\phi(z_{i}),V^{*}\mathbf{B}V\phi(z_{j})\rangle=\bm{\Phi}_{i}^{\top}\mathbf{B}\bm{\Phi}_{j} for all 1i,jm1\leq i,j\leq m.

Finite rank solution thanks to a representer theorem.

The sampling approximation of the Fredholm determinant also yields a finite rank solution for (6). For simplicity, we define 𝒞=1s𝒞\mathcal{C}\triangleq\cup_{\ell=1}^{s}\mathcal{C}_{\ell} and recall ={x1,,xn}\mathcal{I}=\{x^{\prime}_{1},\dots,x^{\prime}_{n}\}. Then, write the set of points 𝒵𝒞\mathcal{Z}\triangleq\mathcal{C}\cup\mathcal{I} as {z1,,zm}\{z_{1},\dots,z_{m}\}, with m=|𝒞|+nm=|\mathcal{C}|+n, and denote the corresponding restriction operator Sm:mS_{m}:\mathcal{H}\to\mathbb{R}^{m}. Consider the kernel matrix 𝐊=[k(zi,zj)]1i,jm\mathbf{K}=[k_{\mathcal{H}}(z_{i},z_{j})]_{1\leq i,j\leq m}. In particular, since we used a trace regularizer, the representer theorem of Marteau-Ferey et al. [2020, Theorem 1] holds: the optimal operator is of the form

A=i,j=1m𝐂ijϕ(zi)ϕ(zj)¯ with 𝐂𝒮+(m).A=\sum_{i,j=1}^{m}\mathbf{C}_{ij}\phi(z_{i})\otimes\overline{\phi(z_{j})}\text{ with }\mathbf{C}\in\mathcal{S}_{+}(\mathbb{R}^{m}). (8)

In this paper, we call 𝐂\mathbf{C} the representer matrix of the operator AA. If we do the change of variables 𝐁=𝐑𝐂𝐑\mathbf{B}=\mathbf{R}\mathbf{C}\mathbf{R}^{\top}, we have the following identities: A=mSm𝐂Sm=V𝐁VA=mS_{m}\mathbf{C}S_{m}^{*}=V^{*}\mathbf{B}V and Tr(A)=Tr(𝐊𝐂)=Tr(𝐁)\operatorname{Tr}(A)=\operatorname{Tr}(\mathbf{K}\mathbf{C})=\operatorname{Tr}(\mathbf{B}), thanks to Lemma 7 in Supplementary Material. Therefore, the problem (6) boils down to the finite non-convex problem:

min𝐁0fn(V𝐁V)+λTr(𝐁),\min_{\mathbf{B}\succeq 0}f_{n}(V^{*}\mathbf{B}V)+\lambda\operatorname{Tr}(\mathbf{B}), (9)

where fn(V𝐁V)=1s=1slogdet[𝚽i𝐁𝚽j]i,j𝒞+logdet[δij+𝚽i𝐁𝚽j/||]i,jf_{n}(V^{*}\mathbf{B}V)=-\frac{1}{s}\sum_{\ell=1}^{s}\log\det\left[\bm{\Phi}_{i}^{\top}\mathbf{B}\bm{\Phi}_{j}\right]_{i,j\in\mathcal{C}_{\ell}}+\log\det\left[\delta_{ij}+\bm{\Phi}_{i}^{\top}\mathbf{B}\bm{\Phi}_{j}/|\mathcal{I}|\right]_{i,j\in\mathcal{I}}. We assume that there is a global minimizer of (9) that we denote by 𝐁\mathbf{B}_{\star}. The final estimator of the integral kernel of the likelihood 𝖠\mathsf{A} depends on 𝐂=𝐑1𝐁𝐑1\mathbf{C}_{\star}=\mathbf{R}^{-1}\mathbf{B}_{\star}\mathbf{R}^{-1\top} and reads 𝖺^(x,y)=i,j=1m𝐂ijk(zi,x)k(zj,y).\hat{\mathsf{a}}(x,y)=\sum_{i,j=1}^{m}\mathbf{C}_{\star ij}k_{\mathcal{H}}(z_{i},x)k_{\mathcal{H}}(z_{j},y). The numerical strategy is summarized in Algorithm 1.

procedure Estimate𝖠\mathsf{A}(λ,𝒞1,,𝒞s\lambda,\mathcal{C}_{1},\dots,\mathcal{C}_{s})
     Sample ={x1,,xn}\mathcal{I}=\{x^{\prime}_{1},\dots,x^{\prime}_{n}\} i.i.d. from μ\mu \triangleright Sample nn points for Fredhom det. approx.
     Define 𝒵=1s𝒞\mathcal{Z}\triangleq\cup_{\ell=1}^{s}\mathcal{C}_{\ell}\cup\mathcal{I} \triangleright Collect all samples
     Compute 𝐊=𝐑𝐑\mathbf{K}=\mathbf{R}^{\top}\mathbf{R} with 𝐊=[k(zi,zj)]i,j𝒵\mathbf{K}=[k_{\mathcal{H}}(z_{i},z_{j})]_{i,j\in\mathcal{Z}} \triangleright Cholesky of kernel matrix
     Solve (9) with iteration (14) to obtain 𝐁\mathbf{B}_{\star} \triangleright Regularized Picard iteration
     Compute 𝐂=𝐑1𝐁𝐑1\mathbf{C}_{\star}=\mathbf{R}^{-1}\mathbf{B}_{\star}\mathbf{R}^{-1\top} \triangleright Representer matrix of 𝖺^(x,y)\hat{\mathsf{a}}(x,y)
     return 𝖺^(x,y)=i,j=1m𝐂ijk(zi,x)k(zj,y)\hat{\mathsf{a}}(x,y)=\sum_{i,j=1}^{m}\mathbf{C}_{\star ij}k_{\mathcal{H}}(z_{i},x)k_{\mathcal{H}}(z_{j},y) \triangleright Likelihood kernel
end procedure
Algorithm 1 Estimation of the integral kernel 𝖺(x,y)\mathsf{a}(x,y) of the DPP likelihood kernel 𝖠\mathsf{A}.
Estimation of the correlation kernel.

The exact computation of the correlation kernel of the L-ensemble DPP

𝖪(γ)=𝖠(𝖠+γ𝕀)1,\mathsf{K}(\gamma)=\mathsf{A}(\mathsf{A}+\gamma\mathbb{I})^{-1}, (10)

requires the exact diagonalization of 𝖠=SAS\mathsf{A}=SAS^{*}. For more flexibility, we introduced a scale parameter γ>0\gamma>0 which often takes the value γ=1\gamma=1. It is instructive to approximate 𝖪\mathsf{K} in order to easily express the correlation functions of the estimated point process. We propose here an approximation scheme based once again on sampling. Recall the form of the solution A=mSm𝐂SmA=mS_{m}^{*}\mathbf{C}S_{m} of (6), and consider the factorization 𝐂=𝚲𝚲\mathbf{C}=\mathbf{\Lambda}^{\top}\mathbf{\Lambda} with 𝚲=𝐅𝐑1\mathbf{\Lambda}=\mathbf{F}\mathbf{R}^{-1\top} where 𝐅𝐅=𝐁\mathbf{F}^{\top}\mathbf{F}=\mathbf{B}_{\star} is the Cholesky factorization of 𝐁\mathbf{B}_{\star}. Let {x1′′,,xp′′}𝒳\{x^{\prime\prime}_{1},\dots,x^{\prime\prime}_{p}\}\subseteq\mathcal{X} be sampled i.i.d. from the probability measure μ\mu and denote by Sp:pS_{p}:\mathcal{H}\to\mathbb{R}^{p} the corresponding restriction operator. The following integral operator

𝖪^=mSSm𝚲(m𝚲SmSpSpSm𝚲+γ𝐈m)1𝚲SmS,\hat{\mathsf{K}}=mSS^{*}_{m}\mathbf{\Lambda}^{\top}(m\mathbf{\Lambda}S_{m}S^{*}_{p}S_{p}S^{*}_{m}\mathbf{\Lambda}^{\top}+\gamma\mathbf{I}_{m})^{-1}\mathbf{\Lambda}S_{m}S^{*}, (11)

gives an approximation of 𝖪\mathsf{K}. The numerical approach for solving (11) relies on the computation of 𝐊mp=mpSmSp=[k(zi,xj′′)]\mathbf{K}_{mp}=\sqrt{mp}S_{m}S^{*}_{p}=[k_{\mathcal{H}}(z_{i},x^{\prime\prime}_{j})] with 1im1\leq i\leq m and 1jp1\leq j\leq p is a rectangular kernel matrix, associated to a fixed ordering of 𝒵={z1,,zm}\mathcal{Z}=\{z_{1},\dots,z_{m}\} and {x1′′,,xp′′}\{x^{\prime\prime}_{1},\dots,x^{\prime\prime}_{p}\}. Our strategy is described in Algorithm 2.

procedure Estimate𝖪\mathsf{K}(𝒵,𝐂\mathcal{Z},\mathbf{C}_{\star})
     Compute 𝐂=𝚲𝚲\mathbf{C}_{\star}=\mathbf{\Lambda}^{\top}\mathbf{\Lambda} \triangleright Factorization of representer matrix
     Sample {x1′′,,xp′′}𝒳\{x^{\prime\prime}_{1},\dots,x^{\prime\prime}_{p}\}\subseteq\mathcal{X} i.i.d. from μ\mu \triangleright Sample pp points
     Compute 𝐊mp=[k(zi,xj′′)]m×p\mathbf{K}_{mp}=[k_{\mathcal{H}}(z_{i},x^{\prime\prime}_{j})]\in\mathbb{R}^{m\times p} \triangleright Cross kernel matrix
     Compute 𝛀=𝚲(𝚲𝐊mp1p𝐊mp𝚲+𝐈m)1𝚲\bm{\Omega}=\mathbf{\Lambda}^{\top}(\mathbf{\Lambda}\mathbf{K}_{mp}\frac{1}{p}\mathbf{K}_{mp}^{\top}\mathbf{\Lambda}^{\top}+\mathbf{I}_{m})^{-1}\mathbf{\Lambda} \triangleright Representer matrix of 𝗄^(x,y)\hat{\mathsf{k}}(x,y)
     return 𝗄^(x,y)=i,j=1m𝛀ijk(zi,x)k(zj,y)\hat{\mathsf{k}}(x,y)=\sum_{i,j=1}^{m}\bm{\Omega}_{ij}k_{\mathcal{H}}(z_{i},x)k_{\mathcal{H}}(z_{j},y) \triangleright Correlation kernel
end procedure
Algorithm 2 Estimation of the integral kernel 𝗄(x,y)\mathsf{k}(x,y) of the DPP correlation kernel 𝖪=𝖠(𝖠+𝕀)1\mathsf{K}=\mathsf{A}(\mathsf{A}+\mathbb{I})^{-1}.

4 Implementation

We propose an algorithm for solving the discrete problem (9) associated to (6). To simplify the discussion and relate it to Mariet and Sra [2015], we define the objective g(𝐗)=fn(V𝐁(𝐗)V)+λTr(𝐁(𝐗))g(\mathbf{X})=f_{n}(V^{*}\mathbf{B}(\mathbf{X})V)+\lambda\operatorname{Tr}(\mathbf{B}(\mathbf{X})) with the change of variables 𝐁(𝐗)=𝐑1𝐗𝐑1\mathbf{B}(\mathbf{X})=\mathbf{R}^{-1\top}\mathbf{X}\mathbf{R}^{-1}. Then we can rephrase (9) as

min𝐗0g(𝐗)=1s=1slogdet(𝐗𝒞𝒞)+logdet(𝐈||+1n𝐗)+λTr(𝐗𝐊1),\min_{\mathbf{X}\succeq 0}g(\mathbf{X})=-\frac{1}{s}\sum_{\ell=1}^{s}\log\det(\mathbf{X}_{\mathcal{C}_{\ell}\mathcal{C}_{\ell}})+\log\det\left(\mathbf{I}_{|\mathcal{I}|}+\frac{1}{n}\mathbf{X}_{\mathcal{I}\mathcal{I}}\right)+\lambda\operatorname{Tr}(\mathbf{X}\mathbf{K}^{-1}), (12)

where we recall that n=||n=|\mathcal{I}|. Define for convenience 𝐔\mathbf{U}_{\ell} as the matrix obtained by selecting the columns of the identity matrix which are indexed by 𝒞\mathcal{C}_{\ell}, so that, we have in particular 𝐗𝒞𝒞=𝐔𝐗𝐔\mathbf{X}_{\mathcal{C}_{\ell}\mathcal{C}_{\ell}}=\mathbf{U}_{\ell}^{\top}\mathbf{X}\mathbf{U}_{\ell}. Similarly, define a sampling matrix 𝐔\mathbf{U}_{\mathcal{I}} associated to the subset \mathcal{I}. Recall the Cholesky decomposition 𝐊=𝐑𝐑\mathbf{K}=\mathbf{R}^{\top}\mathbf{R}. To minimize (12), we start at some 𝐗00\mathbf{X}_{0}\succ 0 and use the following iteration

𝐗k+1=12λ𝐑((𝐈m+4λ𝐑1p(𝐗k)𝐑1)1/2𝐑𝐈m)𝐑,\mathbf{X}_{k+1}=\frac{1}{2\lambda}\mathbf{R}^{\top}\left(\left(\mathbf{I}_{m}+4\lambda\mathbf{R}^{-1\top}p(\mathbf{X}_{k})\mathbf{R}^{-1}\right)^{1/2}\mathbf{R}-\mathbf{I}_{m}\right)\mathbf{R}, (13)

where p(𝐗)=𝐗+𝐗𝚫𝐗p(\mathbf{X})=\mathbf{X}+\mathbf{X}\mathbf{\Delta}\mathbf{X} and 𝚫(𝐗)=1s=1s𝐔𝐗𝒞𝒞1𝐔𝐔(𝐗+n𝐈||)1𝐔.\mathbf{\Delta}(\mathbf{X})=\frac{1}{s}\sum_{\ell=1}^{s}\mathbf{U}_{\ell}\mathbf{X}_{\mathcal{C}_{\ell}\mathcal{C}_{\ell}}^{-1}\mathbf{U}_{\ell}^{\top}-\mathbf{U}_{\mathcal{I}}(\mathbf{X}_{\mathcal{I}\mathcal{I}}+n\mathbf{I}_{|\mathcal{I}|})^{-1}\mathbf{U}_{\mathcal{I}}^{\top}. We dub this sequence a regularized Picard iteration, as it is a generalization of the Picard iteration which was introduced by Mariet and Sra [2015] in the context of learning discrete L-ensemble DPPs. In Mariet and Sra [2015], the Picard iteration, defined as 𝐗k+1=p(𝐗k)\mathbf{X}_{k+1}=p(\mathbf{X}_{k}), is shown to be appropriate for minimizing a different objective given by: 1s=1slogdet(𝐗𝒞𝒞)+logdet(𝐈+𝐗)-\frac{1}{s}\sum_{\ell=1}^{s}\log\det(\mathbf{X}_{\mathcal{C}_{\ell}\mathcal{C}_{\ell}})+\log\det(\mathbf{I}+\mathbf{X}). The following theorem indicates that the iteration (13) is a good candidate for minimizing g(𝐗)g(\mathbf{X}).

Theorem 3.

Let 𝐗k\mathbf{X}_{k} for integer kk be the sequence generated by (13) and initialized with 𝐗00\mathbf{X}_{0}\succ 0. Then, the sequence g(𝐗k)g(\mathbf{X}_{k}) is monotonically decreasing.

For a proof, we refer to Section C.4. In practice, we use the iteration (13) with the inverse change of variables 𝐗(𝐁)=𝐑𝐁𝐑\mathbf{X}(\mathbf{B})=\mathbf{R}^{\top}\mathbf{B}\mathbf{R} and solve

𝐁k+1=12λ((𝐈m+4λq(𝐁k))1/2𝐈m), with q(𝐁)=𝐁+𝐁𝐑𝚫(𝐗(𝐁))𝐑𝐁,\mathbf{B}_{k+1}=\frac{1}{2\lambda}\left(\left(\mathbf{I}_{m}+4\lambda q(\mathbf{B}_{k})\right)^{1/2}-\mathbf{I}_{m}\right),\text{ with }q(\mathbf{B})=\mathbf{B}+\mathbf{B}\mathbf{R}\mathbf{\Delta}\big{(}\mathbf{X}(\mathbf{B})\big{)}\mathbf{R}^{\top}\mathbf{B}, (14)

where 𝚫(𝐗)\mathbf{\Delta}(\mathbf{X}) is given hereabove. For the stopping criterion, we monitor the objective values of (9) and stop if the relative variation of two consecutive objectives is less than a predefined precision threshold 𝗍𝗈𝗅\mathsf{tol}. Contrary to (12), the objective (9) does not include the inverse of 𝐊\mathbf{K}, which might be ill-conditioned. The interplay between λ\lambda and nn is best understood by considering (9) with the change of variables 𝐁=𝐁/n\mathbf{B}^{\prime}=\mathbf{B}/n, yielding the equivalent problem

min𝐁01s=1slogdet(𝚽𝐁𝚽)𝒞𝒞+logdet(𝐈+𝚽𝐁𝚽)+λnTr(𝐁),\min_{\mathbf{B}^{\prime}\succeq 0}-\frac{1}{s}\sum_{\ell=1}^{s}\log\det\left(\bm{\Phi}^{\top}\mathbf{B}^{\prime}\bm{\Phi}\right)_{\mathcal{C}_{\ell}\mathcal{C}_{\ell}}+\log\det\left(\mathbf{I}+\bm{\Phi}^{\top}\mathbf{B}^{\prime}\bm{\Phi}\right)_{\mathcal{I}\mathcal{I}}+\lambda n\operatorname{Tr}(\mathbf{B}^{\prime}),

where 𝚽=𝐑\bm{\Phi}=\mathbf{R} is a matrix whose ii-th column is 𝚽i\bm{\Phi}_{i} for 1im1\leq i\leq m as defined in (7). Notice that, up to a 1/n\nicefrac{{1}}{{n}} factor, 𝚽𝐁𝚽\bm{\Phi}^{\top}\mathbf{B}^{\prime}\bm{\Phi} is the in-sample Gram matrix of 𝖺^(x,y)\hat{\mathsf{a}}(x,y) evaluated on the data set 𝒵\mathcal{Z}; see Algorithm 1. Thus, in the limit λ0\lambda\to 0, the above expression corresponds to the MLE estimation of a finite DPP if 𝒞\cup_{\ell}\mathcal{C}_{\ell}\subseteq\mathcal{I}. This is the intuitive connection with finite DPP: the continuous DPP is well approximated by a finite DPP if the ground set \mathcal{I} is a dense enough sampling within 𝒳\mathcal{X}.

5 Theoretical guarantees

We now describe the guarantees coming with the approximations presented in the previous section.

Statistical guarantees for approximating the maximum likelihood problem

Next, we give a statistical guarantee for the approximation of the log-likelihood by its sample version.

Theorem 4 (Discrete optimal objective approximates full MLE objective).

Let 𝐁\mathbf{B}_{\star} be the solution of (9). Let AA_{\star} be the solution of (5). Let δ(0,1/2)\delta\in(0,1/2). If λ2cn(δ)\lambda\geq 2c_{n}(\delta), then with probability at least 12δ1-2\delta, it holds that

|f(A)fn(V𝐁V)|32λTr(A),|f(A_{\star})-f_{n}(V^{*}\mathbf{B}_{\star}V)|\leq\frac{3}{2}\lambda\operatorname{Tr}(A_{\star}),

with 0<cn1/n0<c_{n}\lesssim 1/\sqrt{n} given in Theorem 1.

The above result, proved in Section C.3, shows that, with high probability, the optimal objective value of the discrete problem is not far from the optimal log-likelihood provided nn is large enough. As a simple consequence, the discrete solution also yields a finite rank operator V𝐁VV\mathbf{B}_{\star}V^{*} whose likelihood is not far from the optimal likelihood f(A)f(A_{\star}), as it can be shown by using a triangle inequality.

Corollary 5 (Approximation of the full MLE optimizer by a finite rank operator).

Under the assumptions of Theorem 4, if λ2cn(δ)\lambda\geq 2c_{n}(\delta), with probability at least 12δ1-2\delta, it holds

|f(A)f(V𝐁V)|3λTr(A)|f(A_{\star})-f(V^{*}\mathbf{B}_{\star}V)|\leq 3\lambda\operatorname{Tr}(A_{\star})

with cn1/nc_{n}\lesssim 1/\sqrt{n} given in Theorem 1.

The proof of Corollary 5 is also provided in Section C.3.

Approximation of the correlation kernel

An important quantity for the control of the amount of points necessary to approximate well the correlation kernel is the so-called effective dimension deff(γ)=Tr(𝖠(𝖠+γ𝕀)1),d_{\rm eff}(\gamma)=\operatorname{Tr}\left(\mathsf{A}(\mathsf{A}+\gamma\mathbb{I})^{-1}\right), which is the expected sample size under the DPP with correlation kernel 𝖪=𝖠(𝖠+γ𝕀)1\mathsf{K}=\mathsf{A}(\mathsf{A}+\gamma\mathbb{I})^{-1}.

Theorem 6 (Correlation kernel approximation).

Let δ(0,1)\delta\in(0,1) be a failure probability, let ϵ(0,1)\epsilon\in(0,1) be an acurracy parameter and let γ>0\gamma>0 be a scale factor. Let 𝖪(γ)\mathsf{K}(\gamma) be the correlation kernel (10) defined with 𝖠=SAS\mathsf{A}=SAS^{*}. Consider 𝖪^(γ)\hat{\mathsf{K}}(\gamma) defined in (11) with i.i.d. sampling of pp points in 𝒳\mathcal{X} wrt μ\mu. If we take p8κ2Aopγϵ2log(4deff(γ)δ𝖪op),p\geq\frac{8\kappa^{2}\|A\|_{op}}{\gamma\epsilon^{2}}\log\left(\frac{4d_{\rm eff}(\gamma)}{\delta\|\mathsf{K}\|_{op}}\right), then, with probability 1δ1-\delta, the following multiplicative error bound holds 11+ϵ𝖪(γ)𝖪^(γ)11ϵ𝖪(γ).\frac{1}{1+\epsilon}\mathsf{K}(\gamma)\preceq\hat{\mathsf{K}}(\gamma)\preceq\frac{1}{1-\epsilon}\mathsf{K}(\gamma).

The proof of Theorem 6, given in Section C.5, mainly relies on a matrix Bernstein inequality. Let us make a few comments. First, we can simply take γ=1\gamma=1 in Theorem 6 to recover the common definition of the correlation kernel (1). Second, the presence of deff(γ)d_{\rm eff}(\gamma) in the logarithm is welcome since it is the expected subset size of the L-ensemble. Third, the quantity Aop\|A\|_{op} directly influences the typical sample size to get an accurate approximation. A worst case bound is Aopλmax(𝐂)λmax(𝐊)\|A\|_{op}\leq\lambda_{\max}(\mathbf{C})\lambda_{\max}(\mathbf{K}) with 𝐊=[k(zi,zj)]1i,jm\mathbf{K}=[k_{\mathcal{H}}(z_{i},z_{j})]_{1\leq i,j\leq m} and where we used that A=mSm𝐂SmA=mS_{m}\mathbf{C}S_{m}^{*} in the light of (8). Thus, the lower bound on pp may be large in practice. Although probably more costly, an approach inspired from approximate ridge leverage score sampling [Rudi et al., 2018] is likely to allow lower pp’s. We leave this to future work.

6 Empirical evaluation

We consider an L-ensemble with correlation kernel 𝗄(x,y)=ρexp(xy22/α2)\mathsf{k}(x,y)=\rho\exp(-\|x-y\|_{2}^{2}/\alpha^{2}) defined on d\mathbb{R}^{d} with α=0.05\alpha=0.05. Following Lavancier et al. [2014], this is a valid kernel if ρ<(πα)d\rho<(\sqrt{\pi}\alpha)^{-d}. Note that the intensity, defined as x𝗄(x,x)x\mapsto\mathsf{k}(x,x), is constant equal to ρ\rho; we shall check that the fitted kernel recovers that property.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: Intensity estimation with σ=0.1\sigma=0.1 and λ=0.1\lambda=0.1 from 11 DPP sample with ρ=50\rho=50 (top row) and ρ=100\rho=100 (bottom row). On the LHS, a DPP sample (hexagons) and n=1000n=1000 uniform samples (crosses), the color is the diagonal of 𝚽𝐁𝚽\bm{\Phi}^{\top}\mathbf{B}\bm{\Phi} (in-sample likelihood kernel). On the RHS, out-of-sample estimated intensity 𝗄^(x,x)\hat{\mathsf{k}}(x,x) of the learned process on a 100×100100\times 100 grid.

We draw samples333We used the code of Poinas and Lavancier [2021], available at https://github.com/APoinas/MLEDPP. It relies on the R package spatstat [Baddeley et al., 2015], available under the GPL-2 / GPL-3 licence. from this continuous DPP in the window 𝒳=[0,1]2\mathcal{X}=[0,1]^{2}. Two such samples are shown as hexagons in the first column of Figure 1, with respective intensity ρ=50\rho=50 and ρ=100\rho=100. For the estimation, we use a Gaussian kernel k(x,y)=exp(xy22/(2σ2))k_{\mathcal{H}}(x,y)=\exp\left(-\|x-y\|_{2}^{2}/(2\sigma^{2})\right) with σ>0\sigma>0. The computation of the correlation kernel always uses p=1000p=1000 uniform samples. Iteration (14) is run until the precision threshold 𝗍𝗈𝗅=105\mathsf{tol}=10^{-5} is achieved. For stability, we add 101010^{-10} to the diagonal of the Gram matrix 𝐊\mathbf{K}. The remaining parameter values are given in captions. We empirically observe that the regularized Picard iteration returns a matrix 𝐁\mathbf{B}_{\star} such that 𝚽𝐁𝚽\bm{\Phi}^{\top}\mathbf{B}_{\star}\bm{\Phi} is low rank; see Figure 2 (left). A lesson from Figure 1 is that the sample size of the DPP has to be large enough to retrieve a constant intensity k^(x,x)\hat{k}(x,x). In particular, the top row of this figure illustrates a case where σ\sigma is too small. Also, due to the large regularization λ=0.1\lambda=0.1 and the use of only one DPP sample, the scale of ρ\rho is clearly underestimated in this example. On the contrary, in Figure 3, for a smaller regularization parameter λ=0.01\lambda=0.01, the intensity scale estimate is larger. We also observe that a large regularization parameter tends to smooth out the local variations of the intensity, which is not surprising.

Refer to caption
Refer to caption
Refer to caption
Figure 2: Analysis of the solution corresponding to the example of Figure 1 with ρ=100\rho=100. Left: eigenvalues of 𝚽𝐁𝚽\bm{\Phi}^{\top}\mathbf{B}\bm{\Phi}. Middle: Gram matrix of 𝗄^(x,y)\hat{\mathsf{k}}(x,y) on a regular 30×3030\times 30 grid within [0,1]2[0,1]^{2}. Right: Gram matrix of 𝗄(x,y)\mathsf{k}(x,y) on the same grid.

A comparison between a Gram matrix of 𝗄^(x,y)\hat{\mathsf{k}}(x,y) and 𝗄(x,y)\mathsf{k}(x,y) is given in Figure 2 corresponding to the example of Figure 1. The short-range diagonal structure is recovered, while some long-range structures are smoothed out. More illustrative simulations are given in Section D, with a study of the influence of the hyperparameters, including the use of s>1s>1 DPP samples. In particular, the estimation of the intensity is improved if several DPP samples are used with a smaller value of λ\lambda.

Refer to caption
Refer to caption
Figure 3: Intensity estimation with σ=0.1\sigma=0.1 and λ=0.01\lambda=0.01 from 11 DPP sample with ρ=100\rho=100. On the LHS, a DPP sample (hexagons) and n=1000n=1000 uniform samples (crosses), the color is the diagonal of 𝚽𝐁𝚽\bm{\Phi}^{\top}\mathbf{B}\bm{\Phi} (in-sample likelihood kernel). On the RHS, out-of-sample estimated intensity 𝗄^(x,x)\hat{\mathsf{k}}(x,x) of the learned process on a 100×100100\times 100 grid.

7 Discussion

We leveraged recent progress on kernel methods to propose a nonparametric approach to learning continuous DPPs. We see three major limitations of our procedure. First, our final objective function is nonconvex, and our algorithm is only guaranteed to increase its objective function. Experimental evidence suggests that our approach recovers the synthetic kernel, but more work is needed to study the maximizers of the likelihood, in the spirit of Brunel et al. [2017a] for finite DPPs, and the properties of our fixed point algorithm. Second, the estimated integral kernel does not have any explicit structure, other than being implicitly forced to be low-rank because of the trace penalty. Adding structural assumptions might be desirable, either for modelling or learning purposes. For modelling, it is not uncommon to assume that the underlying continuous DPP is stationary, for example, which implies that the correlation kernel 𝗄(x,y)\mathsf{k}(x,y) depends only on xyx-y. For learning, structural assumptions on the kernel may reduce the computational cost, or reduce the number of maximizers of the likelihood. The third limitation of our pipeline is that, like most nonparametric methods, it still requires to tune a handful of hyperparameters, and, in our experience, the final performance varies significantly with the lengthscale of the RKHS kernel or the coefficient of the trace penalty. An automatic tuning procedure with guarantees would make the pipeline turn-key.

Maybe unexpectedly, future work could also deal with transferring our pipeline to the finite DPP setting. Indeed, we saw in Section 4 that in some asymptotic regime, our regularized MLE objective is close to a regularized version of the MLE objective for a finite DPP. Investigating this maximum a posteriori inference problem may shed some new light on nonparametric inference for finite DPPs. Intuitively, regularization should improve learning and prediction when data is scarce.

Acknowledgements

We thank Arnaud Poinas and Guillaume Gautier for useful discussions on the manuscript. We acknowledge support from ERC grant Blackjack (ERC-2019-STG-851866) and ANR AI chair Baccarat (ANR-20-CHIA-0002).

Supplementary Material

Roadmap.

In Section A, we present useful technical results. Next, in a specific case, we show that the discrete problem (6) admits a closed-form solution that we discuss in Section B. Noticeably, this special case allows the understanding of the behaviour of the estimated DPP kernel in both the small and large regularization (λ\lambda) limits. In Section C, all the deferred proofs are given. Finally, Section D provides a finer analysis of the empirical results of Section 6 as well as a description of the convergence of the regularized Picard algorithm to the closed-form solution described in Section B.

Appendix A Useful technical results

A.1 Technical lemmata

We preface the proofs of our main results with three lemmata.

Lemma 7.

Let kk be a strictly positive definite kernel of a RKHS \mathcal{H} and let 𝐂\mathbf{C} be a n×nn\times n symmetric matrix. Assume that {xi}1in\{x_{i}\}_{1\leq i\leq n} are such that the Gram matrix 𝐊=[k(xi,xj)]1i,jn\mathbf{K}=[k_{\mathcal{H}}(x_{i},x_{j})]_{1\leq i,j\leq n} is non-singular. Then, the non-zero eigenvalues of i,j=1n𝐂ijϕ(xi)ϕ(xj)¯\sum_{i,j=1}^{n}\mathbf{C}_{ij}\phi(x_{i})\otimes\overline{\phi(x_{j})} correspond to the non-zero eigenvalues of 𝐊𝐂\mathbf{K}\mathbf{C}.

Proof.

By definition of the sampling operator (see Section 1.1), it holds that Sn𝐂Sn=1ni,j=1n𝐂ijϕ(xi)ϕ(xj)¯S_{n}^{*}\mathbf{C}S_{n}=\frac{1}{n}\sum_{i,j=1}^{n}\mathbf{C}_{ij}\phi(x_{i})\otimes\overline{\phi(x_{j})} and we have SnSn𝐂=1n𝐊𝐂S_{n}S_{n}^{*}\mathbf{C}=\frac{1}{n}\mathbf{K}\mathbf{C}. Thus, we need to show that the non-zero eigenvalues Sn𝐂SnS_{n}^{*}\mathbf{C}S_{n} correspond to the non-zero eigenvalues of SnSn𝐂S_{n}S_{n}^{*}\mathbf{C}.

Let gλg_{\lambda}\in\mathcal{H} be an eigenvector of Sn𝐂SnS_{n}^{*}\mathbf{C}S_{n} with eigenvalue λ0\lambda\neq 0. First, we show that SngλS_{n}g_{\lambda} is an eigenvector of SnSn𝐂S_{n}S_{n}^{*}\mathbf{C} with eigenvalue λ\lambda. We have Sn𝐂Sngλ=λgλS_{n}^{*}\mathbf{C}S_{n}g_{\lambda}=\lambda g_{\lambda}. By acting on both sides of the latter equation with SnS_{n}, we find Sn(Sn𝐂Sn)gλ=λSngλS_{n}(S_{n}^{*}\mathbf{C}S_{n})g_{\lambda}=\lambda S_{n}g_{\lambda}. This is equivalent to SnSn𝐂(Sngλ)=λ(Sngλ)S_{n}S_{n}^{*}\mathbf{C}(S_{n}g_{\lambda})=\lambda(S_{n}g_{\lambda}). Second, since SnSn0S_{n}S_{n}^{*}\succ 0, remark that SnSn𝐂S_{n}S_{n}^{*}\mathbf{C} is related by a similarity to (SnSn)1/2𝐂(SnSn)1/2(S_{n}S_{n}^{*})^{1/2}\mathbf{C}(S_{n}S_{n}^{*})^{1/2}, which is diagonalizable. Since Sn𝐂SnS_{n}^{*}\mathbf{C}S_{n} is at most of rank nn, the non-zero eigenvalues of Sn𝐂SnS_{n}^{*}\mathbf{C}S_{n} match the non-zero eigenvalues of (SnSn)1/2𝐂(SnSn)1/2(S_{n}S_{n}^{*})^{1/2}\mathbf{C}(S_{n}S_{n}^{*})^{1/2}, which in turn are the same as the non-zero eigenvalues of SnSn𝐂S_{n}S_{n}^{*}\mathbf{C}. ∎

Lemma 8.

Let 𝚺𝒮+(m)\mathbf{\Sigma}\in\mathcal{S}_{+}(\mathbb{R}^{m}) and let \mathcal{I} be a subset of {1,,m}\{1,\dots,m\}. Then, the function

𝚺logdet(𝚺)+logdet(𝐈m+𝚺1/||)\mathbf{\Sigma}\mapsto\log\det(\mathbf{\Sigma})+\log\det(\mathbf{I}_{m}+\mathbf{\Sigma}^{-1}/|\mathcal{I}|)_{\mathcal{I}\mathcal{I}} (15)

is strictly concave on {𝚺0}\{\mathbf{\Sigma}\succ 0\}.

Proof.

To simplify the expression, we do the change of variables 𝚺𝚺/||\mathbf{\Sigma}\mapsto\mathbf{\Sigma}/|\mathcal{I}| and analyse logdet(𝚺)+logdet(𝐈m+𝚺1)\log\det(\mathbf{\Sigma})+\log\det(\mathbf{I}_{m}+\mathbf{\Sigma}^{-1})_{\mathcal{I}\mathcal{I}} which differs from the original function by an additive constant. Let 𝐔\mathbf{U}_{\mathcal{I}} be the matrix obtained by selecting the columns of the identity matrix which are indexed by \mathcal{I}. We rewrite the second term in (15) as

logdet(𝐈m+𝚺1)=logdet(𝐈||+𝐔𝚺1𝐔)=logdet(𝐈m+𝚺1/2𝐔𝐔𝚺1/2),\log\det(\mathbf{I}_{m}+\mathbf{\Sigma}^{-1})_{\mathcal{I}\mathcal{I}}=\log\det(\mathbf{I}_{|\mathcal{I}|}+\mathbf{U}_{\mathcal{I}}^{\top}\mathbf{\Sigma}^{-1}\mathbf{U}_{\mathcal{I}})=\log\det(\mathbf{I}_{m}+\mathbf{\Sigma}^{-1/2}\mathbf{U}_{\mathcal{I}}\mathbf{U}_{\mathcal{I}}^{\top}\mathbf{\Sigma}^{-1/2}),

by Sylvester’s identity. This leads to

logdet(𝚺)+logdet(𝐈m+𝚺1)=logdet(𝚺+𝐔𝐔).\log\det(\mathbf{\Sigma})+\log\det(\mathbf{I}_{m}+\mathbf{\Sigma}^{-1})_{\mathcal{I}\mathcal{I}}=\log\det(\mathbf{\Sigma}+\mathbf{U}_{\mathcal{I}}\mathbf{U}_{\mathcal{I}}^{\top}).

To check that logdet(𝚺+𝐔𝐔)\log\det(\mathbf{\Sigma}+\mathbf{U}_{\mathcal{I}}\mathbf{U}_{\mathcal{I}}^{\top}) is strictly concave on {𝚺0}\{\mathbf{\Sigma}\succ 0\}, we verify that its Hessian is negative any direction 𝐇\mathbf{H}. Its first order directional derivative reads Tr((𝚺+𝐔𝐔)1𝐇)\operatorname{Tr}\left((\mathbf{\Sigma}+\mathbf{U}_{\mathcal{I}}\mathbf{U}_{\mathcal{I}}^{\top})^{-1}\mathbf{H}\right). Hence, the second order directional derivative in the direction of 𝐇\mathbf{H} writes

Tr((𝚺+𝐔𝐔)1𝐇(𝚺+𝐔𝐔)1𝐇),-\operatorname{Tr}\left((\mathbf{\Sigma}+\mathbf{U}_{\mathcal{I}}\mathbf{U}_{\mathcal{I}}^{\top})^{-1}\mathbf{H}(\mathbf{\Sigma}+\mathbf{U}_{\mathcal{I}}\mathbf{U}_{\mathcal{I}}^{\top})^{-1}\mathbf{H}\right),

which is indeed a negative number. ∎

The following result is borrowed from Mariet and Sra [2015, Lemma 2.3.].

Lemma 9.

Let 𝚺𝒮+(m)\mathbf{\Sigma}\in\mathcal{S}_{+}(\mathbb{R}^{m}) and let 𝐔m×\mathbf{U}\in\mathbb{R}^{m\times\ell} be a matrix with m\ell\leq m orthonormal columns. Then, the function logdet(𝐔𝚺1𝐔)-\log\det(\mathbf{U}^{\top}\mathbf{\Sigma}^{-1}\mathbf{U}) is concave on 𝚺0\mathbf{\Sigma}\succ 0.

Proof.

The function logdet(𝐔𝚺1𝐔)-\log\det(\mathbf{U}^{\top}\mathbf{\Sigma}^{-1}\mathbf{U}) is concave on 𝚺0\mathbf{\Sigma}\succ 0 since logdet(𝐔𝚺1𝐔)\log\det(\mathbf{U}^{\top}\mathbf{\Sigma}^{-1}\mathbf{U}) is convex on 𝚺0\mathbf{\Sigma}\succ 0 for any 𝐔\mathbf{U} such that 𝐔𝐔=𝐈\mathbf{U}^{\top}\mathbf{U}=\mathbf{I} as stated in Mariet and Sra [2015, Lemma 2.3.]. ∎

A.2 Use of the representer theorem

We here clarify the definition of the representer theorem used in this paper.

A.2.1 Extended representer theorem

In Section 3, we used a slight extension of the representer theorem of Marteau-Ferey et al. [2020] which we clarify here. Let us first define some notations. Let \mathcal{H} be a RKHS with feature map ϕ()\phi(\cdot) and {z1,,zm}\{z_{1},\dots,z_{m}\} be a data set such as defined in Section 1.1. Define

hA(z)=ϕ(z),Aϕ(z) and hA(z,z)=ϕ(z),Aϕ(z).h_{A}(z)=\langle\phi(z),A\phi(z)\rangle\text{ and }h_{A}(z,z^{\prime})=\langle\phi(z),A\phi(z^{\prime})\rangle.

In this paper, we consider the problem

minA𝒮+()L(hA(zi,zj))1i,jm+Tr(A),\min_{A\in\mathcal{S}_{+}(\mathcal{H})}L\left(h_{A}(z_{i},z_{j})\right)_{1\leq i,j\leq m}+\operatorname{Tr}(A), (16)

where LL is a loss function (specified below). In contrast, the first term of the problem considered in Marteau-Ferey et al. [2020] is of the following form: L(hA(zi))1imL(h_{A}(z_{i}))_{1\leq i\leq m}. In other words, the latter loss function involves only diagonal elements ϕ(zi),Aϕ(zi)\langle\phi(z_{i}),A\phi(z_{i})\rangle for 1im1\leq i\leq m while (16) also involves off-diagonal elements. Now, denote by Πm\Pi_{m} the projector on span{ϕ(zi),i=1,,m}\operatorname{\mathrm{span}}\{\phi(z_{i}),i=1,\dots,m\}, and define

𝒮m,+()={ΠmAΠm:A𝒮+()}.\mathcal{S}_{m,+}(\mathcal{H})=\{\Pi_{m}A\Pi_{m}:A\in\mathcal{S}_{+}(\mathcal{H})\}.

Then, we have the following proposition.

Proposition 10 (Extension of Proposition 7 in Marteau-Ferey et al. [2020]).

Let LL be a lower semi-continuous function such that L(hA(zi,zj))1i,jm+Tr(A)L(h_{A}(z_{i},z_{j}))_{1\leq i,j\leq m}+\operatorname{Tr}(A) is bounded below. Then (16) has a solution AA_{\star} which is in 𝒮m,+()\mathcal{S}_{m,+}(\mathcal{H}).

Proof sketch.

The key step is the following identity

hA(zi,zj)=ϕ(zi),Aϕ(zj)=ϕ(zi),ΠmAΠmϕ(zj)=hΠmAΠm(zi,zj),(1i,jm)h_{A}(z_{i},z_{j})=\langle\phi(z_{i}),A\phi(z_{j})\rangle=\langle\phi(z_{i}),\Pi_{m}A\Pi_{m}\phi(z_{j})\rangle=h_{\Pi_{m}A\Pi_{m}}(z_{i},z_{j}),\ (1\leq i,j\leq m)

which is a direct consequence of the definition of Πm\Pi_{m}. Also, we have Tr(ΠmAΠm)Tr(A)\operatorname{Tr}(\Pi_{m}A\Pi_{m})\leq\operatorname{Tr}(A). The remainder of the proof follows exactly the same lines as in Marteau-Ferey et al. [2020]. Notice that, compared with Marteau-Ferey et al. [2020, Proposition 7], we do not require the loss LL to be lower bounded but rather ask the full objective to be lower bounded, which is a weaker assumption but does not alter the argument. ∎

A.2.2 Applying the extended representer theorem

Now, let us prove that the objective of (6), given by fn(A)+λTr(A),f_{n}(A)+\lambda\operatorname{Tr}(A), is lower bounded. We recall that

fn(A)=1s=1slogdet[ϕ(xi),Aϕ(xj)]i,j𝒞+logdet(𝐈n+SnASn).f_{n}(A)=-\frac{1}{s}\sum_{\ell=1}^{s}\log\det\left[\langle\phi(x_{i}),A\phi(x_{j})\rangle\right]_{i,j\in\mathcal{C}_{\ell}}+\log\det(\mathbf{I}_{n}+S_{n}AS_{n}^{*}).

For clarity, we recall that 𝒞=1s𝒞\mathcal{C}\triangleq\cup_{\ell=1}^{s}\mathcal{C}_{\ell} and ={x1,,xn}\mathcal{I}=\{x^{\prime}_{1},\dots,x^{\prime}_{n}\}. Then, write the set of points 𝒵𝒞\mathcal{Z}\triangleq\mathcal{C}\cup\mathcal{I} as {z1,,zm}\{z_{1},\dots,z_{m}\} with m=|𝒞|+nm=|\mathcal{C}|+n. Recall that we placed ourselves on an event happening with probability one where the sets 𝒞1,,𝒞s,\mathcal{C}_{1},\dots,\mathcal{C}_{s},\mathcal{I} are disjoint. Now, we denote by Πm\Pi_{m} the orthogonal projector on span{ϕ(zi):1im}\operatorname{\mathrm{span}}\{\phi(z_{i}):1\leq i\leq m\}, which writes

Πm=i,j=1m(K1)ijϕ(zi)ϕ(zj)¯.\Pi_{m}=\sum_{i,j=1}^{m}(K^{-1})_{ij}\phi(z_{i})\otimes\overline{\phi(z_{j})}. (17)

First, we have logdet(𝐈n+SnASn)0\log\det(\mathbf{I}_{n}+S_{n}AS_{n}^{*})\geq 0. Second, we use

Tr(A)=Tr(ΠmAΠm)+Tr(ΠmAΠm)Tr(ΠmAΠm)=Tr(ΠmA).\operatorname{Tr}(A)=\operatorname{Tr}\left(\Pi_{m}A\Pi_{m}\right)+\operatorname{Tr}\left(\Pi_{m}^{\perp}A\Pi_{m}^{\perp}\right)\geq\operatorname{Tr}\left(\Pi_{m}A\Pi_{m}\right)=\operatorname{Tr}\left(\Pi_{m}A\right).

Thus, a lower bound on the objective (6) is

fn(A)+λTr(A)1s=1slogdet[ϕ(xi),Aϕ(xj)]i,j𝒞+λTr(ΠmA).\displaystyle f_{n}(A)+\lambda\operatorname{Tr}(A)\geq-\frac{1}{s}\sum_{\ell=1}^{s}\log\det\left[\langle\phi(x_{i}),A\phi(x_{j})\rangle\right]_{i,j\in\mathcal{C}_{\ell}}+\lambda\operatorname{Tr}\left(\Pi_{m}A\right).

Now we define the matrix MM with elements Mij=ϕ(zi),Aϕ(zj)M_{ij}=\langle\phi(z_{i}),A\phi(z_{j})\rangle for 1i,jm1\leq i,j\leq m and notice that

Tr(ΠmA)=Tr(MK1)Tr(M)/λmax(K).\operatorname{Tr}\left(\Pi_{m}A\right)=\operatorname{Tr}(MK^{-1})\geq\operatorname{Tr}(M)/\lambda_{\max}(K).

Remark that the matrix defined by M()=[ϕ(xi),Aϕ(xj)]i,j𝒞M^{(\ell)}=\left[\langle\phi(x_{i}),A\phi(x_{j})\rangle\right]_{i,j\in\mathcal{C}_{\ell}} associated with the \ell-th DPP sample for 1s1\leq\ell\leq s is a principal submatrix of the m×mm\times m matrix M=[ϕ(zi),Aϕ(zj)]1i,jmM=\left[\langle\phi(z_{i}),A\phi(z_{j})\rangle\right]_{1\leq i,j\leq m}. Since the sets 𝒞\mathcal{C}_{\ell} are disjoint, we have =1sTr(M())Tr(M)\sum_{\ell=1}^{s}\operatorname{Tr}(M^{(\ell)})\leq\operatorname{Tr}(M). Denoting λ=λλmax(K)\lambda^{\prime}=\frac{\lambda}{\lambda_{\max}(K)}, by using the last inequality, we find

fn(A)+λTr(A)\displaystyle f_{n}(A)+\lambda\operatorname{Tr}(A) 1s=1slogdetM()+λ=1sTr(M())\displaystyle\geq-\frac{1}{s}\sum_{\ell=1}^{s}\log\det M^{(\ell)}+\lambda^{\prime}\sum_{\ell=1}^{s}\operatorname{Tr}(M^{(\ell)})
=1s=1s(logdetM()+sλTr(M())).\displaystyle=\frac{1}{s}\sum_{\ell=1}^{s}\left(-\log\det M^{(\ell)}+s\lambda^{\prime}\operatorname{Tr}(M^{(\ell)})\right).

Finally, we use the following proposition to show that each term in the above sum is bounded from below.

Proposition 11.

Let a>0a>0. The function h(σ)=log(σ)+aσh(\sigma)=-\log(\sigma)+a\sigma satisfies h(σ)h(1/a)=1+log(a)h(\sigma)\geq h(1/a)=1+\log(a) for all σ>0\sigma>0.

Thus a lower bound for the objective of (6) is obtained by applying Proposition 11 with a=sλ/λmax(K)a=s\lambda/\lambda_{\max}(K) to the each eigenvalue of M()M^{(\ell)} for all 1s1\leq\ell\leq s.

A.3 Boundedness of the discrete objective function

At first sight, we may wonder if the objective of the optimization problem (12) is lower bounded. We show here that the optimization problem is well-defined for all regularization parameters λ>0\lambda>0. The lower boundedness of the discrete objective follows directly from Section A.2.2 in the case of a finite rank operator. For completeness, we repeat below the argument in the discrete case. Explicitly, this discrete objective reads

g(𝐗)=1s=1slogdet(𝐗𝒞𝒞)+logdet(𝐈||+1n𝐗)+λTr(𝐗𝐊1),g(\mathbf{X})=-\frac{1}{s}\sum_{\ell=1}^{s}\log\det(\mathbf{X}_{\mathcal{C}_{\ell}\mathcal{C}_{\ell}})+\log\det\left(\mathbf{I}_{|\mathcal{I}|}+\frac{1}{n}\mathbf{X}_{\mathcal{I}\mathcal{I}}\right)+\lambda\operatorname{Tr}(\mathbf{X}\mathbf{K}^{-1}),

for all 𝐗0\mathbf{X}\succeq 0. First, since 𝐗0\mathbf{X}_{\mathcal{I}\mathcal{I}}\succeq 0, we have logdet(𝐈||+1n𝐗)0\log\det\left(\mathbf{I}_{|\mathcal{I}|}+\frac{1}{n}\mathbf{X}_{\mathcal{I}\mathcal{I}}\right)\geq 0. Next, we use that 𝐊0\mathbf{K}\succ 0 by assumption, which implies that Tr(𝐗𝐊1)Tr(𝐗)/λmax(𝐊)\operatorname{Tr}(\mathbf{X}\mathbf{K}^{-1})\geq\operatorname{Tr}(\mathbf{X})/\lambda_{\max}(\mathbf{K}). Denote λ=λλmax(𝐊)\lambda^{\prime}=\frac{\lambda}{\lambda_{\max}(\mathbf{K})}. Thus, we can lower bound the objective function as follows:

g(𝐗)\displaystyle g(\mathbf{X}) 1s=1slogdet(𝐗𝒞𝒞)+λTr(𝐗)\displaystyle\geq-\frac{1}{s}\sum_{\ell=1}^{s}\log\det(\mathbf{X}_{\mathcal{C}_{\ell}\mathcal{C}_{\ell}})+\lambda^{\prime}\operatorname{Tr}(\mathbf{X})
1s=1s(logdet(𝐗𝒞𝒞)+λsTr(𝐗𝒞𝒞)),\displaystyle\geq\frac{1}{s}\sum_{\ell=1}^{s}\big{(}-\log\det(\mathbf{X}_{\mathcal{C}_{\ell}\mathcal{C}_{\ell}})+\lambda^{\prime}s\operatorname{Tr}(\mathbf{X}_{\mathcal{C}_{\ell}\mathcal{C}_{\ell}})\big{)},

where we used that Tr(𝐗)=Tr(𝐗)+=1sTr(𝐗𝒞𝒞)\operatorname{Tr}(\mathbf{X})=\operatorname{Tr}(\mathbf{X}_{\mathcal{I}\mathcal{I}})+\sum_{\ell=1}^{s}\operatorname{Tr}(\mathbf{X}_{\mathcal{C}_{\ell}\mathcal{C}_{\ell}}) with Tr(𝐗)0\operatorname{Tr}(\mathbf{X}_{\mathcal{I}\mathcal{I}})\geq 0 for 𝐗0\mathbf{X}\succeq 0. Hence, by using Proposition 11, g(𝐗)g(\mathbf{X}) is bounded from below by a sum of functions which are lower bounded.

Appendix B Extra result: an explicit solution for the single-sample case

We note that if the dataset is made of only one sample (i.e., s=1s=1), and if that sample is also used to approximate the Fredholm determinant (i.e., =𝒞\mathcal{I}=\mathcal{C}), then the problem (6) admits an explicit solution.

Proposition 12.

Let 𝒞={xi𝒳}1im\mathcal{C}=\{x_{i}\in\mathcal{X}\}_{1\leq i\leq m} be such that the kernel matrix 𝐊=[k(xi,xj)]1i,jm\mathbf{K}=[k_{\mathcal{H}}(x_{i},x_{j})]_{1\leq i,j\leq m} is invertible. Consider the problem (6) with =𝒞\mathcal{I}=\mathcal{C} and λ>0\lambda>0. Then the solution of (6) has the form A=i,j=1m𝐂ijϕ(xi)ϕ(xj)¯A_{\star}=\sum_{i,j=1}^{m}\mathbf{C}_{\star ij}\phi(x_{i})\otimes\overline{\phi(x_{j})}, with

𝐂=𝐂(λ)=12𝐊2((m2𝐈m+4m𝐊/λ)1/2m𝐈m).\mathbf{C}_{\star}=\mathbf{C}_{\star}(\lambda)=\frac{1}{2}\mathbf{K}^{-2}\left((m^{2}\mathbf{I}_{m}+4m\mathbf{K}/\lambda)^{1/2}-m\mathbf{I}_{m}\right).
Proof.

By the representer theorem in Marteau-Ferey et al. [2020, Theorem 1], the solution is of the form A=i,j=1m𝐂ijϕ(xi)ϕ(xj)¯A=\sum_{i,j=1}^{m}\mathbf{C}_{ij}\phi(x_{i})\otimes\overline{\phi(x_{j})} where 𝐂\mathbf{C} is the solution of

min𝐂0logdet(𝐊𝐂𝐊)+logdet(𝐈m+𝐊𝐂𝐊/m)+λTr(𝐊𝐂),\min_{\mathbf{C}\succ 0}-\log\det\left(\mathbf{K}\mathbf{C}\mathbf{K}\right)+\log\det\left(\mathbf{I}_{m}+\mathbf{K}\mathbf{C}\mathbf{K}/m\right)+\lambda\operatorname{Tr}(\mathbf{K}\mathbf{C}),

Define 𝐗=𝐊𝐂𝐊\mathbf{X}=\mathbf{K}\mathbf{C}\mathbf{K} and denote the objective by g(𝐗)=logdet(𝐗)+logdet(𝐈m+1m𝐗)+λTr(𝐊1𝐗)g(\mathbf{X})=-\log\det(\mathbf{X})+\log\det(\mathbf{I}_{m}+\frac{1}{m}\mathbf{X})+\lambda\operatorname{Tr}(\mathbf{K}^{-1}\mathbf{X}). Let 𝐇\mathbf{H} be a m×mm\times m symmetric matrix. By taking a directional derivative of the objective in the direction 𝐇\mathbf{H}, we find a first order condition Tr[D𝐇g(𝐗)𝐇]=0,\operatorname{Tr}\left[D_{\mathbf{H}}g(\mathbf{X})\mathbf{H}\right]=0, with D𝐇g(𝐗)=𝐗1+(𝐗+m𝐈m)1+λ𝐊1.D_{\mathbf{H}}g(\mathbf{X})=-\mathbf{X}^{-1}+(\mathbf{X}+m\mathbf{I}_{m})^{-1}+\lambda\mathbf{K}^{-1}. Since this condition should be satisfied for all 𝐇\mathbf{H}, we simply solve the equation D𝐇g(𝐗)=0D_{\mathbf{H}}g(\mathbf{X})=0. Thanks to the invertibility of 𝐊\mathbf{K}, we have equivalently

𝐗2+m𝐗mλ𝐊=0.\mathbf{X}^{2}+m\mathbf{X}-\frac{m}{\lambda}\mathbf{K}=0.

A simple algebraic manipulation yields 𝐗=12((m2𝐈m+4m𝐊/λ)1/2m𝐈m)0\mathbf{X}_{\star}=\frac{1}{2}\left((m^{2}\mathbf{I}_{m}+4m\mathbf{K}/\lambda)^{1/2}-m\mathbf{I}_{m}\right)\succ 0. To check that 𝐗\mathbf{X}_{\star} is minimum, it is sufficient to analyse the Hessian of the objective in any direction 𝐇\mathbf{H}. Let t0t\geq 0 small enough such that 𝐗+t𝐇0\mathbf{X}+t\mathbf{H}\succ 0. The second order derivative of g(𝐗+t𝐇)g(\mathbf{X}+t\mathbf{H}) writes

d2dt2g(𝐗+t𝐇)|t=0=Tr(𝐇𝐗1𝐇𝐗1)Tr(𝐇(𝐗+m𝐈m)1𝐇(𝐗+m𝐈m)1).\frac{\mathrm{d}^{2}}{\mathrm{d}t^{2}}g(\mathbf{X}+t\mathbf{H})|_{t=0}=\operatorname{Tr}(\mathbf{H}\mathbf{X}^{-1}\mathbf{H}\mathbf{X}^{-1})-\operatorname{Tr}(\mathbf{H}(\mathbf{X}+m\mathbf{I}_{m})^{-1}\mathbf{H}(\mathbf{X}+m\mathbf{I}_{m})^{-1}).

By using the first order condition 𝐗1=(𝐗+m𝐈m)1+λ𝐊1\mathbf{X}^{-1}_{\star}=(\mathbf{X}_{\star}+m\mathbf{I}_{m})^{-1}+\lambda\mathbf{K}^{-1}, we find

d2dt2g(𝐗+t𝐇)|t=0\displaystyle\frac{\mathrm{d}^{2}}{\mathrm{d}t^{2}}g(\mathbf{X}_{\star}+t\mathbf{H})|_{t=0} =Tr(𝐇(λ𝐊1)𝐇(λ𝐊1+2(𝐗+m𝐈m)1))\displaystyle=\operatorname{Tr}\Big{(}\mathbf{H}(\lambda\mathbf{K}^{-1})\mathbf{H}\big{(}\lambda\mathbf{K}^{-1}+2(\mathbf{X}_{\star}+m\mathbf{I}_{m})^{-1}\big{)}\Big{)}
=Tr(𝐌𝐍𝐌).\displaystyle=\operatorname{Tr}\Big{(}\mathbf{M}\mathbf{N}\mathbf{M}\Big{)}.

with 𝐌=(λ𝐊1)1/2𝐇(λ𝐊1)1/2\mathbf{M}=(\lambda\mathbf{K}^{-1})^{1/2}\mathbf{H}(\lambda\mathbf{K}^{-1})^{1/2} and

𝐍=𝐈+2(λ𝐊1)1/2(𝐗+m𝐈m)1(λ𝐊1)1/20.\mathbf{N}=\mathbf{I}+2(\lambda\mathbf{K}^{-1})^{-1/2}(\mathbf{X}_{\star}+m\mathbf{I}_{m})^{-1}(\lambda\mathbf{K}^{-1})^{-1/2}\succ 0.

Notice that 𝐌𝐍𝐌0\mathbf{M}\mathbf{N}\mathbf{M}\succeq 0 since 𝐍0\mathbf{N}\succ 0. Therefore, it holds d2dt2g(𝐗+t𝐇)|t=00\frac{\mathrm{d}^{2}}{\mathrm{d}t^{2}}g(\mathbf{X}_{\star}+t\mathbf{H})|_{t=0}\geq 0. Since this is true for any direction 𝐇\mathbf{H}, the matrix 𝐗\mathbf{X}_{\star} is a local minimum of g(𝐗)g(\mathbf{X}). ∎

While the exact solution of Proposition 12 is hard to intepret, if the regularization parameter goes to zero, the estimated correlation kernel tends to that of a projection DPP. In what follows, the notation f(λ)g(λ)f(\lambda)\sim g(\lambda) stands for limλ+f(λ)/g(λ)=1\lim_{\lambda\to+\infty}f(\lambda)/g(\lambda)=1.

Proposition 13 (Projection DPP in the low regularization limit).

Under the assumptions of Proposition 12, the correlation kernel 𝖪(λ)=SA(λ)S(SA(λ)S+𝕀)1,\mathsf{K}(\lambda)=SA_{\star}(\lambda)S^{*}(SA_{\star}(\lambda)S^{*}+\mathbb{I})^{-1}, with A(λ)=i,j=1m𝐂ij(λ)ϕ(xi)ϕ(xj)¯A_{\star}(\lambda)=\sum_{i,j=1}^{m}\mathbf{C}_{\star ij}(\lambda)\phi(x_{i})\otimes\overline{\phi(x_{j})} and C(λ)C_{\star}(\lambda) given in Proposition 12, has a range that is independent of λ\lambda. Furthermore, 𝖪(λ)\mathsf{K}(\lambda) converges to the projection operator onto that range when λ0\lambda\to 0. In particular, the integral kernel of A(λ)A_{\star}(\lambda) has the following asymptotic expansion

𝖺(x,y)(m/λ)1/2𝐤x𝐊3/2𝐤y as λ0,\mathsf{a}_{\star}(x,y)\sim{\color[rgb]{0,0,0}(m/\lambda)^{1/2}}\mathbf{k}_{x}^{\top}\mathbf{K}^{{\color[rgb]{0,0,0}-3/2}}\mathbf{k}_{y}\text{ as }\lambda\to 0,

pointwisely, with 𝐤x=[k(x,x1),k(x,xm)]\mathbf{k}_{x}=[k_{\mathcal{H}}(x,x_{1})\dots,k_{\mathcal{H}}(x,x_{m})]^{\top}.

Proof.

As shown in Lemma 7, the non-zero eigenvalues of A(λ)A_{\star}(\lambda) are the eigenvalues of 𝐊𝐂(λ)\mathbf{K}\mathbf{C}_{\star}(\lambda) since 𝐊\mathbf{K} is non-singular. The eigenvalues of 𝐊𝐂(λ)\mathbf{K}\mathbf{C}_{\star}(\lambda) are σ(λ)=m2ς(1+4ςmλ1)\sigma_{\ell}(\lambda)=\frac{m}{2\varsigma_{\ell}}(\sqrt{1+4\frac{\varsigma_{\ell}}{m\lambda}}-1) where ς\varsigma_{\ell} is the \ell-th eigenvalue of 𝐊\mathbf{K} for 1m1\leq\ell\leq m. Thus, the operator A(λ)A_{\star}(\lambda) is of finite rank, i.e., A(λ)==1mσ(λ)vv¯A_{\star}(\lambda)=\sum_{\ell=1}^{m}\sigma_{\ell}(\lambda)v_{\ell}\otimes\overline{v_{\ell}}, where the normalized eigenvectors vv_{\ell} are independent of λ\lambda. The eigenvalues satisfy σ(λ)+\sigma_{\ell}(\lambda)\to+\infty as λ0\lambda\to 0, such that limλ0σ(λ)λ\lim_{\lambda\to 0}\sigma_{\ell}(\lambda)\sqrt{\lambda} is finite. Note that A(λ)==1mσ(λ)(Sv)(Sv)¯A_{\star}(\lambda)=\sum_{\ell=1}^{m}\sigma_{\ell}(\lambda)(Sv_{\ell})\otimes\overline{(Sv_{\ell})} is of finite rank and SvSv_{\ell} does not depend on λ\lambda. Hence, the operator

𝖪(λ)=λSA(λ)S(λSA(λ)S+λ𝕀)1,\mathsf{K}(\lambda)=\sqrt{\lambda}SA_{\star}(\lambda)S^{*}\left(\sqrt{\lambda}SA_{\star}(\lambda)S^{*}+\sqrt{\lambda}\mathbb{I}\right)^{-1},

converges to the projector on the range of SA(λ)SSA_{\star}(\lambda)S^{*} as λ0\lambda\to 0. ∎

On the other hand, as the regularization parameter goes to infinity, the integral kernel of 𝖠(λ)\mathsf{A}_{\star}(\lambda) behaves asymptotically as an out-of-sample Nyström approximation.

Proposition 14 (Large regularization limit).

Under the assumptions of Proposition 12, the integral kernel of 𝖠(λ)\mathsf{A}_{\star}(\lambda) has the following asymptotic expansion

𝖺(x,y)λ1𝐤x𝐊1𝐤y as λ+,\mathsf{a}_{\star}(x,y)\sim\lambda^{-1}\mathbf{k}_{x}^{\top}\mathbf{K}^{-1}\mathbf{k}_{y}\text{ as }\lambda\to+\infty,

pointwisely, with 𝐤x=[k(x,x1),k(x,xm)]\mathbf{k}_{x}=[k_{\mathcal{H}}(x,x_{1})\dots,k_{\mathcal{H}}(x,x_{m})]^{\top}.

The proof of this result is a simple consequence of the series expansion 1+4x=1+2x+O(x2)\sqrt{1+4x}=1+2x+O(x^{2}).

Appendix C Deferred proofs

C.1 Proof of Lemma 2

Proof of Lemma 2.

Since VVV^{*}V is the orthogonal projector onto the span of ϕ(zi)\phi(z_{i}), 1im1\leq i\leq m, we have

𝚽i𝐁¯𝚽j=VVϕ(xi),AVVϕ(xj)=ϕ(xi),Aϕ(xj).\bm{\Phi}_{i}^{\top}\bar{\mathbf{B}}\bm{\Phi}_{j}=\langle V^{*}V\phi(x_{i}),AV^{*}V\phi(x_{j})\rangle=\langle\phi(x_{i}),A\phi(x_{j})\rangle.

By Sylvester’s identity, and since VVop1\|V^{*}V\|_{op}\leq 1, it holds

logdet(𝐈m+𝐁¯)=logdet(𝕀+VAV)=logdet(𝕀+VVA)logdet(𝕀+A).\log\det(\mathbf{I}_{m}+\bar{\mathbf{B}})=\log\det(\mathbb{I}+VAV^{*})=\log\det(\mathbb{I}+V^{*}VA)\leq\log\det(\mathbb{I}+A).

Similarly, it holds that Tr(𝐁¯)=Tr(VVA)Tr(A)\operatorname{Tr}(\bar{\mathbf{B}})=\operatorname{Tr}(V^{*}VA)\leq\operatorname{Tr}(A). ∎

C.2 Approximation of the Fredholm determinant

Before giving the proof of Theorem 1, we remind a well-known and useful formula for the Fredholm determinant of an operator. Let MM be a trace class endomorphism of a separable Hilbert space. Denote by λ(M)\lambda_{\ell}(M) its (possibly infinitely many) eigenvalues for 1\ell\geq 1, including multiplicities. Then, we have

det(𝕀+M)=(1+λ(M)),\det(\mathbb{I}+M)=\prod_{\ell}(1+\lambda_{\ell}(M)),

where the (infinite) product is uniformly convergent, see Bornemann [2010, Eq. (3.1)] and references therein. Now, let 1\mathcal{F}_{1} and 2\mathcal{F}_{2} be two separable Hilbert spaces. Let T:12T:\mathcal{F}_{1}\to\mathcal{F}_{2} such that TTT^{*}T is trace class. Note that TTT^{*}T is self-adjoint and positive semi-definite. Then, TTT^{*}T and TTTT^{*} have the same non-zero eigenvalues (Jacobson’s lemma) and Sylvester’s identity

det(𝕀+TT)=det(𝕀+TT),\det(\mathbb{I}+TT^{*})=\det(\mathbb{I}+T^{*}T),

holds. This can be seen as a consequence of the singular value decomposition of a compact operator; see Weidmann [1980, page 170, Theorem 7.6]. We now give the proof of Theorem 1.

Proof of Theorem 1.

To begin, we recall two useful facts. First, If MM and NN are two trace class endomorphisms of the same Hilbert space, then logdet(𝕀+M)+logdet(𝕀+N)=logdet((𝕀+M)(𝕀+N))\log\det(\mathbb{I}+M)+\log\det(\mathbb{I}+N)=\log\det\left((\mathbb{I}+M)(\mathbb{I}+N)\right); see e.g. Simon [1977, Thm 3.8]. Second, if MM is a trace class endomorphisms of a Hilbert space, we have (𝕀+M)1=𝕀M(M+𝕀)1(\mathbb{I}+M)^{-1}=\mathbb{I}-M(M+\mathbb{I})^{-1}, and thus, det((𝕀+M)1)\det((\mathbb{I}+M)^{-1}) is a well-defined Fredholm determinant since M(M+𝕀)1M(M+\mathbb{I})^{-1} is trace class. Now, define d=logdet(𝕀+SAS)d=\log\det(\mathbb{I}+SAS^{*}) and dn=logdet(𝐈n+SnASn)d_{n}=\log\det(\mathbf{I}_{n}+S_{n}AS_{n}^{*}). Using Sylvester’s identity and the aforementioned facts, we write

dnd\displaystyle d_{n}-d =logdet(𝐈n+SnASn)logdet(𝕀+SAS)\displaystyle=\log\det(\mathbf{I}_{n}+S_{n}AS_{n}^{*})-\log\det(\mathbb{I}+SAS^{*})
=logdet(𝕀+A1/2SnSnA1/2)logdet(𝕀+A1/2SSA1/2) (Sylvester’s identity)\displaystyle=\log\det(\mathbb{I}+A^{1/2}S^{*}_{n}S_{n}A^{1/2})-\log\det(\mathbb{I}+A^{1/2}S^{*}SA^{1/2})\text{ (Sylvester's identity) }
=logdet((𝕀+A1/2SnSnA1/2)(𝕀+A1/2SSA1/2)1).\displaystyle=\log\det\left((\mathbb{I}+A^{1/2}S^{*}_{n}S_{n}A^{1/2})(\mathbb{I}+A^{1/2}S^{*}SA^{1/2})^{-1}\right). (18)

By a direct substitution, we verify that dnd=logdet(𝕀+En),d_{n}-d=\log\det\left(\mathbb{I}+E_{n}\right), with

En=(𝕀+A1/2SSA1/2)1/2A1/2(SnSnSS)A1/2(𝕀+A1/2SSA1/2)1/2.E_{n}=(\mathbb{I}+A^{1/2}S^{*}SA^{1/2})^{-1/2}A^{1/2}(S_{n}^{*}S_{n}-S^{*}S)A^{1/2}(\mathbb{I}+A^{1/2}S^{*}SA^{1/2})^{-1/2}.

Note that, in view of (18), det(𝕀+En)\det\left(\mathbb{I}+E_{n}\right) is a positive real number. Next, by using SnSnSSSnSnSSop𝕀S_{n}^{*}S_{n}-S^{*}S\preceq\|S_{n}^{*}S_{n}-S^{*}S\|_{op}\mathbb{I} and Sylvester’s identity, we find

dnd\displaystyle d_{n}-d logdet(𝕀+SnSnSSop(𝕀+A1/2SSA1/2)1/2A(𝕀+A1/2SSA1/2)1/2)\displaystyle\leq\log\det\left(\mathbb{I}+\|S_{n}^{*}S_{n}-S^{*}S\|_{op}(\mathbb{I}+A^{1/2}S^{*}SA^{1/2})^{-1/2}A(\mathbb{I}+A^{1/2}S^{*}SA^{1/2})^{-1/2}\right)
=logdet(𝕀+SnSnSSopA1/2(𝕀+A1/2SSA1/2)1A1/2)\displaystyle=\log\det\left(\mathbb{I}+\|S_{n}^{*}S_{n}-S^{*}S\|_{op}A^{1/2}(\mathbb{I}+A^{1/2}S^{*}SA^{1/2})^{-1}A^{1/2}\right)
logdet(𝕀+SnSnSSopA).\displaystyle\leq\log\det(\mathbb{I}+\|S_{n}^{*}S_{n}-S^{*}S\|_{op}A).

The latter bound is finite, since for MM a trace-class operator, we have |det(𝕀+M)|exp(M)|\det(\mathbb{I}+M)|\leq\exp(\|M\|_{\star}), where M\|M\|_{\star} is the trace (or nuclear) norm. By exchanging the roles of SnSnS_{n}^{*}S_{n} and SSS^{*}S, we also find

ddnlogdet(𝕀+SnSnSSopA)\displaystyle d-d_{n}\leq\log\det(\mathbb{I}+\|S_{n}^{*}S_{n}-S^{*}S\|_{op}A)

and thus, by combining the two cases, we find

|ddn|logdet(𝕀+SnSnSSopA).\displaystyle|d-d_{n}|\leq\log\det(\mathbb{I}+\|S_{n}^{*}S_{n}-S^{*}S\|_{op}A).

Now, in order to upper bound SSSnSnop\|S^{*}S-S_{n}^{*}S_{n}\|_{op} with high probability, we use the following Bernstein inequality for a sum of random operators; see Rudi et al. [2015, Proposition 12] and Minsker [2017].

Proposition 15 (Bernstein’s inequality for a sum of i.i.d. random operators).

Let \mathcal{H} be a separable Hilbert space and let X1,,XnX_{1},\dots,X_{n} be a sequence of independent and identically distributed self-adjoint positive random operators on \mathcal{H}. Assume that 𝔼X=0\mathbb{E}X=0 and that there exists a real number >0\ell>0 such that λmax(X)\lambda_{\max}(X)\leq\ell almost surely. Let Σ\Sigma be a trace class positive operator such that 𝔼(X2)Σ\mathbb{E}(X^{2})\preceq\Sigma. Then, for any δ(0,1)\delta\in(0,1),

λmax(1ni=1nXi)2β3n+2Σopβn,where β=log(2Tr(Σ)Σopδ),\lambda_{\max}\left(\frac{1}{n}\sum_{i=1}^{n}X_{i}\right)\leq\frac{2\ell\beta}{3n}+\sqrt{\frac{2\|\Sigma\|_{op}\beta}{n}},\quad\text{where }\beta=\log\left(\frac{2\operatorname{Tr}(\Sigma)}{\|\Sigma\|_{op}\delta}\right),

with probability 1δ1-\delta. If there further exists an \ell^{\prime} such that Xop\|X\|_{op}\leq\ell^{\prime} almost surely, then, for any δ(0,1/2)\delta\in(0,1/2),

1ni=1nXiop2β3n+2Σopβn,where β=log(2Tr(Σ)Σopδ),\left\|\frac{1}{n}\sum_{i=1}^{n}X_{i}\right\|_{op}\leq\frac{2\ell^{\prime}\beta}{3n}+\sqrt{\frac{2\|\Sigma\|_{op}\beta}{n}},\quad\text{where }\beta=\log\left(\frac{2\operatorname{Tr}(\Sigma)}{\|\Sigma\|_{op}\delta}\right),

holds with probability 12δ1-2\delta.

To apply Proposition 15 and conclude the proof, first recall the expression of the covariance operator

C=SS=𝒳ϕ(x)ϕ(x)¯dμ(x),C=S^{*}S=\int_{\mathcal{X}}\phi(x)\otimes\overline{\phi(x)}\mathrm{d}\mu(x),

and of its sample version

SnSn=1ni=1nϕ(xi)ϕ(xi)¯.S^{*}_{n}S_{n}=\frac{1}{n}\sum_{i=1}^{n}\phi(x_{i})\otimes\overline{\phi(x_{i})}.

Define Xi=ϕ(xi)ϕ(xi)¯𝒳ϕ(x)ϕ(x)¯dμ(x)X_{i}=\phi(x_{i})\otimes\overline{\phi(x_{i})}-\int_{\mathcal{X}}\phi(x)\otimes\overline{\phi(x)}\mathrm{d}\mu(x). It is easy to check that 𝔼Xi=0\mathbb{E}X_{i}=0 since 𝔼[ϕ(xi)ϕ(xi)¯]=𝒳ϕ(x)ϕ(x)¯dμ(x)\mathbb{E}[\phi(x_{i})\otimes\overline{\phi(x_{i})}]=\int_{\mathcal{X}}\phi(x)\otimes\overline{\phi(x)}\mathrm{d}\mu(x). Then, using the triangle inequality and that supx𝒳k(x,x)κ2\sup_{x\in\mathcal{X}}k_{\mathcal{H}}(x,x)\leq\kappa^{2}, we find the bound

Xiop\displaystyle\|X_{i}\|_{op} =𝒳(ϕ(xi)ϕ(xi)¯ϕ(x)ϕ(x)¯)dμ(x)op\displaystyle=\left\|\int_{\mathcal{X}}\left(\phi(x_{i})\otimes\overline{\phi(x_{i})}-\phi(x)\otimes\overline{\phi(x)}\right)\mathrm{d}\mu(x)\right\|_{op}
𝒳(ϕ(xi)ϕ(xi)¯op+ϕ(x)ϕ(x)¯op)dμ(x)\displaystyle\leq\int_{\mathcal{X}}\left(\left\|\phi(x_{i})\otimes\overline{\phi(x_{i})}\right\|_{op}+\left\|\phi(x)\otimes\overline{\phi(x)}\right\|_{op}\right)\mathrm{d}\mu(x)
2κ2.\displaystyle\leq 2\kappa^{2}\triangleq\ell^{\prime}.

Next, we compute a bound on the variance by bounding the second moment, namely

𝔼(Xi2)\displaystyle\mathbb{E}(X_{i}^{2}) =𝔼[(ϕ(xi)ϕ(xi)¯)2](𝔼[ϕ(xi)ϕ(xi)¯])2\displaystyle=\mathbb{E}\left[\left(\phi(x_{i})\otimes\overline{\phi(x_{i})}\right)^{2}\right]-\left(\mathbb{E}\left[\phi(x_{i})\otimes\overline{\phi(x_{i})}\right]\right)^{2}
𝔼[(ϕ(xi)ϕ(xi)¯)2]\displaystyle\preceq\mathbb{E}\left[\left(\phi(x_{i})\otimes\overline{\phi(x_{i})}\right)^{2}\right]
κ2𝔼[ϕ(xi)ϕ(xi)¯]Σ,\displaystyle\preceq\kappa^{2}\mathbb{E}\left[\phi(x_{i})\otimes\overline{\phi(x_{i})}\right]\triangleq\Sigma,

where we used (ϕ(xi)ϕ(xi)¯)2=k(xi,xi)(ϕ(xi)ϕ(xi)¯)\left(\phi(x_{i})\otimes\overline{\phi(x_{i})}\right)^{2}=k_{\mathcal{H}}(x_{i},x_{i})\left(\phi(x_{i})\otimes\overline{\phi(x_{i})}\right). Also, we have

Tr(Σ)=Tr(κ2𝔼[ϕ(x)ϕ(x)¯])=κ2𝒳k(x,x)dμ(x)κ4.\operatorname{Tr}(\Sigma)=\operatorname{Tr}\left(\kappa^{2}\mathbb{E}\left[\phi(x)\otimes\overline{\phi(x)}\right]\right)=\kappa^{2}\int_{\mathcal{X}}k_{\mathcal{H}}(x,x)\mathrm{d}\mu(x)\leq\kappa^{4}.

Moreover,

Σop=κ2𝔼[ϕ(x)ϕ(x)¯]op=κ2λmax(SS)=κ2λmax(𝖳k),\|\Sigma\|_{op}=\kappa^{2}\left\|\mathbb{E}\left[\phi(x)\otimes\overline{\phi(x)}\right]\right\|_{op}=\kappa^{2}\lambda_{\max}(S^{*}S)=\kappa^{2}\lambda_{\max}(\mathsf{T}_{k_{\mathcal{H}}}),

where we used that SS=𝖳kS^{*}S=\mathsf{T}_{k_{\mathcal{H}}} is the integral operator on L2(𝒳)L^{2}(\mathcal{X}) with integral kernel kk. We are finally ready to apply Proposition 15. Since

βlog(2κ2λmax(𝖳k)δ),\beta\leq\log\left(\frac{2\kappa^{2}}{\lambda_{\max}(\mathsf{T}_{k_{\mathcal{H}}})\delta}\right),

it holds, with probability at least 12δ1-2\delta,

SSSnSnop4κ2log(2κ2λmax(𝖳k)δ)3n+2κ2λmax(𝖳k)log(2κ2λmax(𝖳k)δ)n.\|S^{*}S-S^{*}_{n}S_{n}\|_{op}\leq\frac{4\kappa^{2}\log\left(\frac{2\kappa^{2}}{\lambda_{\max}(\mathsf{T}_{k_{\mathcal{H}}})\delta}\right)}{3n}+\sqrt{\frac{2\kappa^{2}\lambda_{\max}(\mathsf{T}_{k_{\mathcal{H}}})\log\left(\frac{2\kappa^{2}}{\lambda_{\max}(\mathsf{T}_{k_{\mathcal{H}}})\delta}\right)}{n}}.

This concludes the proof of Theorem 1. ∎

C.3 Statistical guarantee for the approximation of the log-likelihood by its sampled version

Proof of Theorem 4.

The proof follows similar lines as in Rudi et al. [2020, Proof of Thm 5], with several adaptations. Let 𝐁𝒮+(m)\mathbf{B}_{\star}\in\mathcal{S}_{+}(\mathbb{R}^{m}) be the solution of (9). Notice that AA_{\star} and V𝐁VV^{*}\mathbf{B}_{\star}V are distinct operators. Let 𝐁¯=VAV𝒮+(m)\bar{\mathbf{B}}=VA_{\star}V^{*}\in\mathcal{S}_{+}(\mathbb{R}^{m}) as in Lemma 2. Since 𝐁\mathbf{B}_{\star} has an optimal objective value, we have

fn(V𝐁V)+Tr(λ𝐁)fn(V𝐁¯V)+Tr(λ𝐁¯).f_{n}(V^{*}\mathbf{B}_{\star}V)+\operatorname{Tr}(\lambda\mathbf{B}_{\star})\leq f_{n}(V^{*}\bar{\mathbf{B}}V)+\operatorname{Tr}(\lambda\bar{\mathbf{B}}).

Now we use the properties of 𝐁¯\bar{\mathbf{B}} given in Lemma 2, namely that fn(A)=fn(V𝐁¯V)f_{n}(A_{\star})=f_{n}(V^{*}\bar{\mathbf{B}}V) and Tr(λ𝐁¯)Tr(λA)\operatorname{Tr}(\lambda\bar{\mathbf{B}})\leq\operatorname{Tr}(\lambda A_{\star}). Then it holds

fn(V𝐁¯V)+Tr(λ𝐁¯)fn(A)+Tr(λA).f_{n}(V^{*}\bar{\mathbf{B}}V)+\operatorname{Tr}(\lambda\bar{\mathbf{B}})\leq f_{n}(A_{\star})+\operatorname{Tr}(\lambda A_{\star}).

By combining the last two inequalities, we have fn(V𝐁V)+Tr(λ𝐁)fn(A)+Tr(λA)f_{n}(V^{*}\mathbf{B}_{\star}V)+\operatorname{Tr}(\lambda\mathbf{B}_{\star})\leq f_{n}(A_{\star})+\operatorname{Tr}(\lambda A_{\star}) and therefore

fn(V𝐁V)fn(A)Tr(λA)Tr(λ𝐁).f_{n}(V^{*}\mathbf{B}_{\star}V)-f_{n}(A_{\star})\leq\operatorname{Tr}(\lambda A_{\star})-\operatorname{Tr}(\lambda\mathbf{B}_{\star}). (19)

We will use (19) to derive upper and lower bounds on the gap f(A)fn(V𝐁V)f(A_{\star})-f_{n}(V^{*}\mathbf{B}_{\star}V).

Lower bound.

Using Theorem 1, we have with high probability

|f(A)fn(A)|logdet(𝕀+cnA),|f(A_{\star})-f_{n}(A_{\star})|\leq\log\det(\mathbb{I}+c_{n}A),

and, in particular, this gives

f(A)fn(A)Tr(cnA).f(A_{\star})-f_{n}(A_{\star})\leq\operatorname{Tr}(c_{n}A).

By combining this last inequality with (19) and by using Tr(λ𝐁)0\operatorname{Tr}(\lambda\mathbf{B}_{\star})\geq 0, we have the lower bound

Δ=f(A)fn(V𝐁V)\displaystyle\Delta=f(A_{\star})-f_{n}(V^{*}\mathbf{B}_{\star}V) =f(A)fn(A)+fn(A)fn(V𝐁V)\displaystyle=f(A_{\star})-f_{n}(A_{\star})+f_{n}(A_{\star})-f_{n}(V^{*}\mathbf{B}_{\star}V)
Tr(cnA)+Tr(λ𝐁)Tr(λA)\displaystyle\geq-\operatorname{Tr}(c_{n}A_{\star})+\operatorname{Tr}(\lambda\mathbf{B}_{\star})-\operatorname{Tr}(\lambda A_{\star}) (20)
(cn+λ)Tr(A).\displaystyle\geq-(c_{n}+\lambda)\operatorname{Tr}(A_{\star}).
Upper bound.

We have the bound with high probability (for the same event as above)

Δ=f(A)fn(𝐁)\displaystyle\Delta=f(A_{\star})-f_{n}(\mathbf{B}_{\star}) =f(A)f(V𝐁V)0+f(V𝐁V)fn(V𝐁V)\displaystyle=\underbrace{f(A_{\star})-f(V^{*}\mathbf{B}_{\star}V)}_{\leq 0}+f(V^{*}\mathbf{B}_{\star}V)-f_{n}(V^{*}\mathbf{B}_{\star}V)
logdet(𝕀+cnV𝐁V) (Theorem 1)\displaystyle\leq\log\det(\mathbb{I}+c_{n}V^{*}\mathbf{B}_{\star}V)\text{ \quad(Theorem~{}\ref{thm:formal_Fredholm})}
Tr(cnVV𝐁) (cyclicity)\displaystyle\leq\operatorname{Tr}(c_{n}VV^{*}\mathbf{B}_{\star})\text{ \quad(cyclicity)}
Tr(cn𝐁). (since VV is a projector)\displaystyle\leq\operatorname{Tr}(c_{n}\mathbf{B}_{\star}).\text{ \quad(since $VV^{*}$ is a projector)} (21)

By combining this with (20), we find

Tr(cnA)+Tr(λ𝐁)Tr(λA)Tr(cn𝐁).-\operatorname{Tr}(c_{n}A_{\star})+\operatorname{Tr}(\lambda\mathbf{B}_{\star})-\operatorname{Tr}(\lambda A_{\star})\leq\operatorname{Tr}(c_{n}\mathbf{B}_{\star}).

Since, by assumption, we have cnλcnc_{n}\leq\lambda-c_{n}, the latter inequality becomes

cnTr(𝐁)(cn+λ)Tr(A).c_{n}\operatorname{Tr}(\mathbf{B}_{\star})\leq(c_{n}+\lambda)\operatorname{Tr}(A_{\star}). (22)

By using (22) in the bound in (21), we obtain

f(A)fn(V𝐁V)(cn+λ)Tr(A),f(A_{\star})-f_{n}(V^{*}\mathbf{B}_{\star}V)\leq(c_{n}+\lambda)\operatorname{Tr}(A_{\star}),

Thus, the upper and lower bound yield together

|f(A)fn(V𝐁V)|(cn+λ)Tr(A)(λ2+λ)Tr(A),|f(A_{\star})-f_{n}(V^{*}\mathbf{B}_{\star}V)|\leq(c_{n}+\lambda)\operatorname{Tr}(A_{\star})\leq\left(\frac{\lambda}{2}+\lambda\right)\operatorname{Tr}(A_{\star}),

where we used once more 2cnλ2c_{n}\leq\lambda. This is the desired result. ∎

Proof of Corollary 5.

We use the triangle inequality

|f(A)f(V𝐁V)||f(A)fn(V𝐁V)|+|fn(V𝐁V)f(V𝐁V)|.|f(A_{\star})-f(V\mathbf{B}_{\star}V^{*})|\leq|f(A_{\star})-f_{n}(V\mathbf{B}_{\star}V^{*})|+|f_{n}(V\mathbf{B}_{\star}V^{*})-f(V\mathbf{B}_{\star}V^{*})|.

The first term is upper bounded whp by Theorem 4. The second term is bounded by Theorem 1 as follows

|fn(V𝐁V)f(V𝐁V)|Tr(cn𝐁)|f_{n}(V\mathbf{B}_{\star}V^{*})-f(V\mathbf{B}_{\star}V^{*})|\leq\operatorname{Tr}(c_{n}\mathbf{B}_{\star})

with Tr(cn𝐁)32λTr(A)\operatorname{Tr}(c_{n}\mathbf{B}_{\star})\leq\frac{3}{2}\lambda\operatorname{Tr}(A_{\star}) as in (22) in the proof of Theorem 4. ∎

C.4 Numerical approach: convergence study

Proof of Theorem 3.

This proof follows the same technique as in Mariet and Sra [2015], with some extensions. Let 𝚺=𝐗1\mathbf{\Sigma}=\mathbf{X}^{-1}. We decompose the objective

g(𝚺1)\displaystyle g(\mathbf{\Sigma}^{-1}) =logdet(𝐈m+1||𝚺1)1s=1slogdet(𝐔𝚺1𝐔)+λTr(𝚺1𝐊1)\displaystyle=\log\det\left(\mathbf{I}_{m}+\frac{1}{|\mathcal{I}|}\mathbf{\Sigma}^{-1}\right)_{\mathcal{I}\mathcal{I}}-\frac{1}{s}\sum_{\ell=1}^{s}\log\det(\mathbf{U}_{\ell}^{\top}\mathbf{\Sigma}^{-1}\mathbf{U}_{\ell})+\lambda\operatorname{Tr}(\mathbf{\Sigma}^{-1}\mathbf{K}^{-1})

as the following sum: g(𝚺1)=h1(𝚺)+h2(𝚺)g(\mathbf{\Sigma}^{-1})=h_{1}(\mathbf{\Sigma})+h_{2}(\mathbf{\Sigma}), where h1(𝚺)=logdet(𝚺)+λTr(𝚺1𝐊1)h_{1}(\mathbf{\Sigma})=-\log\det(\mathbf{\Sigma})+\lambda\operatorname{Tr}(\mathbf{\Sigma}^{-1}\mathbf{K}^{-1}) is a strictly convex function on 𝚺0\mathbf{\Sigma}\succ 0 and

h2(𝚺)=logdet(𝚺)+logdet(𝐈m+1||𝚺1)1s=1slogdet(𝐔𝚺1𝐔)h_{2}(\mathbf{\Sigma})=\log\det(\mathbf{\Sigma})+\log\det\left(\mathbf{I}_{m}+\frac{1}{|\mathcal{I}|}\mathbf{\Sigma}^{-1}\right)_{\mathcal{I}\mathcal{I}}-\frac{1}{s}\sum_{\ell=1}^{s}\log\det(\mathbf{U}_{\ell}^{\top}\mathbf{\Sigma}^{-1}\mathbf{U}_{\ell})

is concave on 𝚺0\mathbf{\Sigma}\succ 0. We refer to Lemma 8 and 9 for the former and latter statements, respectively. Then, we use the concavity of h2h_{2} to write the following upper bound

h2(𝚺)h2(𝐘)+Tr(h2(𝐘)(𝚺𝐘)),h_{2}(\mathbf{\Sigma})\leq h_{2}(\mathbf{Y})+\operatorname{Tr}\Big{(}\nabla h_{2}(\mathbf{Y})(\mathbf{\Sigma}-\mathbf{Y})\Big{)},

where the matrix-valued gradient is

h2(𝐘)\displaystyle\nabla h_{2}(\mathbf{Y}) =𝐘1𝐘1𝐔(||𝐈||+𝐔𝐘1𝐔)1𝐔𝐘1\displaystyle=\mathbf{Y}^{-1}-\mathbf{Y}^{-1}\mathbf{U}_{\mathcal{I}}\left(|\mathcal{I}|\mathbf{I}_{|\mathcal{I}|}+\mathbf{U}_{\mathcal{I}}^{\top}\mathbf{Y}^{-1}\mathbf{U}_{\mathcal{I}}\right)^{-1}\mathbf{U}_{\mathcal{I}}^{\top}\mathbf{Y}^{-1}
+1s=1s𝐘1𝐔(𝐔𝐘1𝐔)1𝐔𝐘1.\displaystyle+\frac{1}{s}\sum_{\ell=1}^{s}\mathbf{Y}^{-1}\mathbf{U}_{\ell}\left(\mathbf{U}_{\ell}^{\top}\mathbf{Y}^{-1}\mathbf{U}_{\ell}\right)^{-1}\mathbf{U}_{\ell}^{\top}\mathbf{Y}^{-1}.

Define the auxillary function ξ(𝚺,𝐘)h1(𝚺)+h2(𝐘)+Tr(h2(𝐘)(𝚺𝐘))\xi(\mathbf{\Sigma},\mathbf{Y})\triangleq h_{1}(\mathbf{\Sigma})+h_{2}(\mathbf{Y})+\operatorname{Tr}\left(\nabla h_{2}(\mathbf{Y})(\mathbf{\Sigma}-\mathbf{Y})\right) which satisfies g(𝚺1)ξ(𝚺,𝐘)g(\mathbf{\Sigma}^{-1})\leq\xi(\mathbf{\Sigma},\mathbf{Y}) and g(𝚺1)=ξ(𝚺,𝚺)g(\mathbf{\Sigma}^{-1})=\xi(\mathbf{\Sigma},\mathbf{\Sigma}). We define the iteration 𝐗k=𝚺k1\mathbf{X}_{k}=\mathbf{\Sigma}_{k}^{-1} where

𝚺k+1=argmin𝚺0ξ(𝚺,𝚺k),\mathbf{\Sigma}_{k+1}=\arg\min_{\mathbf{\Sigma}\succ 0}\xi(\mathbf{\Sigma},\mathbf{\Sigma}_{k}),

so that it holds g(𝚺k+11)ξ(𝚺k+1,𝚺k)ξ(𝚺k,𝚺k)=g(𝚺k1)g(\mathbf{\Sigma}^{-1}_{k+1})\leq\xi(\mathbf{\Sigma}_{k+1},\mathbf{\Sigma}_{k})\leq\xi(\mathbf{\Sigma}_{k},\mathbf{\Sigma}_{k})=g(\mathbf{\Sigma}^{-1}_{k}). Thus, this iteration has monotone decreasing objectives. It remains to show that this iteration corresponds to (13). The solution of the above minimization problem can be obtained by solving the first order optimality condition since ξ(,𝐘)\xi(\cdot,\mathbf{Y}) is strictly convex. This gives

𝚺1λ𝚺1𝐊1𝚺1+𝚺k1𝚺k1𝐔(||𝐈||+𝐔𝚺k1𝐔)1𝐔𝚺k1\displaystyle-\mathbf{\Sigma}^{-1}-\lambda\mathbf{\Sigma}^{-1}\mathbf{K}^{-1}\mathbf{\Sigma}^{-1}+\mathbf{\Sigma}^{-1}_{k}-\mathbf{\Sigma}^{-1}_{k}\mathbf{U}_{\mathcal{I}}\left(|\mathcal{I}|\mathbf{I}_{|\mathcal{I}|}+\mathbf{U}_{\mathcal{I}}^{\top}\mathbf{\Sigma}^{-1}_{k}\mathbf{U}_{\mathcal{I}}\right)^{-1}\mathbf{U}_{\mathcal{I}}^{\top}\mathbf{\Sigma}^{-1}_{k}
+1s=1s𝚺k1𝐔(𝐔𝚺k1𝐔)1𝐔𝚺k1=0.\displaystyle+\frac{1}{s}\sum_{\ell=1}^{s}\mathbf{\Sigma}^{-1}_{k}\mathbf{U}_{\ell}\left(\mathbf{U}_{\ell}^{\top}\mathbf{\Sigma}^{-1}_{k}\mathbf{U}_{\ell}\right)^{-1}\mathbf{U}_{\ell}^{\top}\mathbf{\Sigma}^{-1}_{k}=0.

Now, we replace 𝐗=𝚺1\mathbf{X}=\mathbf{\Sigma}^{-1} in the above condition. After a simple algebraic manipulation, we obtain the following condition

𝐗+λ𝐗𝐊1𝐗=p(𝐗k),\mathbf{X}+\lambda\mathbf{X}\mathbf{K}^{-1}\mathbf{X}=p(\mathbf{X}_{k}),

where, as defined in (13), p(𝐗)=𝐗+𝐗𝚫𝐗p(\mathbf{X})=\mathbf{X}+\mathbf{X}\mathbf{\Delta}\mathbf{X} and

𝚫=1s=1s𝐔𝐗𝒞𝒞1𝐔𝐔(||𝐈||+𝐔𝐗𝐔)1𝐔.\mathbf{\Delta}=\frac{1}{s}\sum_{\ell=1}^{s}\mathbf{U}_{\ell}\mathbf{X}_{\mathcal{C}_{\ell}\mathcal{C}_{\ell}}^{-1}\mathbf{U}_{\ell}^{\top}-\mathbf{U}_{\mathcal{I}}\left(|\mathcal{I}|\mathbf{I}_{|\mathcal{I}|}+\mathbf{U}_{\mathcal{I}}^{\top}\mathbf{X}\mathbf{U}_{\mathcal{I}}\right)^{-1}\mathbf{U}_{\mathcal{I}}^{\top}.

Finally, we introduce the Cholesky decomposition 𝐊=𝐑𝐑\mathbf{K}=\mathbf{R}^{\top}\mathbf{R}, so that we have an equivalent identity

𝐑1𝐗𝐑1+λ(𝐑1𝐗𝐑1)2𝐑1p(𝐗k)𝐑1=0.\mathbf{R}^{-1\top}\mathbf{X}\mathbf{R}^{-1}+\lambda\left(\mathbf{R}^{-1\top}\mathbf{X}\mathbf{R}^{-1}\right)^{2}-\mathbf{R}^{-1\top}p(\mathbf{X}_{k})\mathbf{R}^{-1}=0.

Let 𝐗=𝐑1𝐗𝐑1\mathbf{X}^{\prime}=\mathbf{R}^{-1\top}\mathbf{X}\mathbf{R}^{-1} and p(𝐗k)=𝐑1p(𝐗k)𝐑1p^{\prime}(\mathbf{X}_{k})=\mathbf{R}^{-1\top}p(\mathbf{X}_{k})\mathbf{R}^{-1}. The positive definite solution of this second order matrix equation write

𝐗=𝐈m+(𝐈m+4λp(𝐗k))1/22λ,\mathbf{X}^{\prime}=\frac{-\mathbf{I}_{m}+\left(\mathbf{I}_{m}+4\lambda p^{\prime}(\mathbf{X}_{k})\right)^{1/2}}{2\lambda},

which directly yields (13). ∎

C.5 Approximation of the correlation kernel

We start by proving the following useful result.

Theorem 16 (Correlation kernel approximation, formal version).

Let δ(0,1]\delta\in(0,1] be the failure probability and let γ>0\gamma>0 be a scale factor. Let 𝖪^(γ)\hat{\mathsf{K}}(\gamma) be the approximation defined in (11) with i.i.d. sampling of pp points. Let pp be large enough so that t(p)>1t(p)>1 with t(p)=4c2β3γp+2c2βγpt(p)=\frac{4c^{2}\beta}{3\gamma p}+\sqrt{\frac{2c^{2}\beta}{\gamma p}} where c2=κ2Aopc^{2}=\kappa^{2}\|A\|_{op} and β=log(4deff(γ)δ𝖪(γ)op)\beta=\log\left(\frac{4d_{\rm eff}(\gamma)}{\delta\|\mathsf{K}(\gamma)\|_{op}}\right) Then, with probability 1δ1-\delta it holds that

11+t(p)𝖪(γ)𝖪^(γ)11t(p)𝖪(γ).\frac{1}{1+t(p)}\mathsf{K}(\gamma)\preceq\hat{\mathsf{K}}(\gamma)\preceq\frac{1}{1-t(p)}\mathsf{K}(\gamma).

Furthermore, if we assume γλmax(𝖠)\gamma\leq\lambda_{\max}(\mathsf{A}), we can take β=log(8deff(γ)δ)8deff(γ)δ.\beta=\log\left(\frac{8d_{\rm eff}(\gamma)}{\delta}\right)\leq\frac{8d_{\rm eff}(\gamma)}{\delta}.

Proof.

For simplicity, define Ψ:m\Psi:\mathcal{H}\to\mathbb{R}^{m} as Ψ=m𝚲Sm\Psi=\sqrt{m}\mathbf{\Lambda}S_{m}, which is such that A=ΨΨA_{\star}=\Psi^{*}\Psi. Then, we write

𝖪^=SΨ(ΨSpSpΨ+γ𝐈m)1ΨS,\hat{\mathsf{K}}=S\Psi^{*}(\Psi S^{*}_{p}S_{p}\Psi^{*}+\gamma\mathbf{I}_{m})^{-1}\Psi S^{*},

where we recall that SpSp=1pi=1pϕ(xi′′)ϕ(xi′′)¯S^{*}_{p}S_{p}=\frac{1}{p}\sum_{i=1}^{p}\phi(x^{\prime\prime}_{i})\otimes\overline{\phi(x^{\prime\prime}_{i})}. Next, we used the following result.

Proposition 17 (Proposition 5 in Rudi et al. [2018] with minor adaptations).

Let γ>0\gamma>0 and v1,,vpv_{1},\dots,v_{p} with p1p\geq 1 be identically distributed random vectors on a separable Hilbert space HH, such that there exists c2>0c^{2}>0 for which vHc2\|v\|_{H}\leq c^{2} almost surely. Denote by QQ the Hermitian operator Q=1pi=1p𝔼[vivi¯]Q=\frac{1}{p}\sum_{i=1}^{p}\mathbb{E}[v_{i}\otimes\overline{v_{i}}]. Let Qp=1pi=1pvivi¯Q_{p}=\frac{1}{p}\sum_{i=1}^{p}v_{i}\otimes\overline{v_{i}}. Then for any δ(0,1]\delta\in(0,1], the following holds

(Q+γ𝕀)1/2(QQp)(Q+γ𝕀)1/2op4c2β3γp+2c2βγp,\|(Q+\gamma\mathbb{I})^{-1/2}(Q-Q_{p})(Q+\gamma\mathbb{I})^{-1/2}\|_{op}\leq\frac{4c^{2}\beta}{3\gamma p}+\sqrt{\frac{2c^{2}\beta}{\gamma p}},

with probability 1δ1-\delta and β=log4Tr(Q(Q+γ𝕀)1)δQ(Q+γ𝕀)1op8c2/Qop+Tr(Q(Q+γ𝕀)1)δ\beta=\log\frac{4\operatorname{Tr}\left(Q(Q+\gamma\mathbb{I})^{-1}\right)}{\delta\|Q(Q+\gamma\mathbb{I})^{-1}\|_{op}}\leq 8\frac{c^{2}/\|Q\|_{op}+\operatorname{Tr}\left(Q(Q+\gamma\mathbb{I})^{-1}\right)}{\delta}.

Then, we merely define the following vector 𝐯i=Ψϕ(xi′′)\mathbf{v}_{i}=\Psi\phi(x^{\prime\prime}_{i}) for 1ip1\leq i\leq p so that

𝐐p=ΨSpSpΨ=1pi=1pΨ(ϕ(xi′′)ϕ(xi′′)¯)Ψ.\mathbf{Q}_{p}=\Psi S_{p}^{*}S_{p}\Psi^{*}=\frac{1}{p}\sum_{i=1}^{p}\Psi\left(\phi(x^{\prime\prime}_{i})\otimes\overline{\phi(x^{\prime\prime}_{i})}\right)\Psi^{*}.

Furthermore, we define 𝐐=ΨSSΨ\mathbf{Q}=\Psi S^{*}S\Psi^{*}. Also, we have SS=𝒳ϕ(x)ϕ(x)¯dμ(x)S^{*}S=\int_{\mathcal{X}}\phi(x)\otimes\overline{\phi(x)}\mathrm{d}\mu(x), so that 𝔼[SpSp]=SS\mathbb{E}[S^{*}_{p}S_{p}]=S^{*}S. Hence, it holds 𝔼[𝐐p]=𝐐\mathbb{E}[\mathbf{Q}_{p}]=\mathbf{Q}. First, by using ΨΨΨΨop𝕀\Psi^{*}\Psi\preceq\|\Psi^{*}\Psi\|_{op}\mathbb{I} , we have

𝐯22=ϕ(x),ΨΨϕ(x)k(x,x)ΨΨopκ2ΨΨop=κ2Aop,\|\mathbf{v}\|^{2}_{2}=\langle\phi(x),\Psi^{*}\Psi\phi(x)\rangle\leq k_{\mathcal{H}}(x,x)\|\Psi^{*}\Psi\|_{op}\leq\kappa^{2}\|\Psi^{*}\Psi\|_{op}=\kappa^{2}\|A_{\star}\|_{op},

almost surely. Next, we calculate the following quantity

Tr[𝐐(𝐐+γ𝐈m)1]\displaystyle\operatorname{Tr}\left[\mathbf{Q}(\mathbf{Q}+\gamma\mathbf{I}_{m})^{-1}\right] =Tr[ΨSSΨ(ΨSSΨ+γ𝐈m)1]\displaystyle=\operatorname{Tr}\left[\Psi S^{*}S\Psi^{*}(\Psi S^{*}S\Psi^{*}+\gamma\mathbf{I}_{m})^{-1}\right]
=Tr[SΨ(ΨSSΨ+γ𝐈m)1ΨS]\displaystyle=\operatorname{Tr}\left[S\Psi^{*}(\Psi S^{*}S\Psi^{*}+\gamma\mathbf{I}_{m})^{-1}\Psi S^{*}\right]
=Tr[SΨΨS(SΨΨS+γ𝕀)1]\displaystyle=\operatorname{Tr}\left[S\Psi^{*}\Psi S^{*}(S\Psi^{*}\Psi S^{*}+\gamma\mathbb{I})^{-1}\right]
=Tr[𝖠(𝖠+γ𝕀)1],\displaystyle=\operatorname{Tr}\left[\mathsf{A}(\mathsf{A}+\gamma\mathbb{I})^{-1}\right],

where we used the push-through identity at the next to last equality.

For obtaining the bound on β\beta, we first write

𝐐(𝐐+γ𝐈m)1op=𝖠(𝖠+γ𝕀)1op=(1+γ/λmax(𝖠))1.\|\mathbf{Q}(\mathbf{Q}+\gamma\mathbf{I}_{m})^{-1}\|_{op}=\|\mathsf{A}(\mathsf{A}+\gamma\mathbb{I})^{-1}\|_{op}=(1+\gamma/\lambda_{\max}(\mathsf{A}))^{-1}.

To lower bound the latter quantity we require γλmax(𝖠)\gamma\leq\lambda_{\max}(\mathsf{A}) and hence 𝐐(𝐐+γ𝐈m)1op1/2\|\mathbf{Q}(\mathbf{Q}+\gamma\mathbf{I}_{m})^{-1}\|_{op}\geq 1/2. For the remainder of the proof, we show the main matrix inequality. For convenience, define the upperbound in Proposition 17 as

t(p)=4c2β3γp+2c2βγp.t(p)=\frac{4c^{2}\beta}{3\gamma p}+\sqrt{\frac{2c^{2}\beta}{\gamma p}}. (23)

Thanks to Proposition 17, we know that with probability 1δ1-\delta, we have

t(ΨSSΨ+γ𝐈m)ΨSSΨΨSpSpΨt(ΨSSΨ+γ𝐈m),-t\left(\Psi SS^{*}\Psi^{*}+\gamma\mathbf{I}_{m}\right)\preceq\Psi SS^{*}\Psi^{*}-\Psi S_{p}^{*}S_{p}\Psi^{*}\preceq t\left(\Psi SS^{*}\Psi^{*}+\gamma\mathbf{I}_{m}\right),

or equivalently

ΨSSΨt(ΨSSΨ+γ𝐈m)ΨSpSpΨΨSSΨ+t(ΨSSΨ+γ𝐈m).\Psi SS^{*}\Psi^{*}-t(\Psi SS^{*}\Psi^{*}+\gamma\mathbf{I}_{m})\preceq\Psi S_{p}^{*}S_{p}\Psi^{*}\preceq\Psi SS^{*}\Psi^{*}+t(\Psi SS^{*}\Psi^{*}+\gamma\mathbf{I}_{m}).

By simply adding γ𝐈m\gamma\mathbf{I}_{m} to these inequalities, we obtain

(1t)(ΨSSΨ+γ𝐈m)ΨSpSpΨ+γ𝐈m(1+t)(ΨSSΨ+γ𝐈m).(1-t)(\Psi SS^{*}\Psi^{*}+\gamma\mathbf{I}_{m})\preceq\Psi S_{p}^{*}S_{p}\Psi^{*}+\gamma\mathbf{I}_{m}\preceq(1+t)(\Psi SS^{*}\Psi^{*}+\gamma\mathbf{I}_{m}).

Hence, if t<1t<1, by a simple manipulation, we find

(1+t)1(ΨSSΨ+γ𝐈m)1(ΨSpSpΨ+γ𝐈m)1(1t)1(ΨSSΨ+γ𝐈m)1.(1+t)^{-1}(\Psi SS^{*}\Psi^{*}+\gamma\mathbf{I}_{m})^{-1}\preceq(\Psi S^{*}_{p}S_{p}\Psi^{*}+\gamma\mathbf{I}_{m})^{-1}\preceq(1-t)^{-1}(\Psi SS^{*}\Psi^{*}+\gamma\mathbf{I}_{m})^{-1}.

By acting with SΨS\Psi^{*} on the left and ΨS\Psi S^{*} on the right, and then, using the push-through identity

SΨ(ΨSSΨ+γ𝐈m)1ΨS=(SΨΨS+γ𝕀)1SΨΨS,S\Psi^{*}(\Psi SS^{*}\Psi^{*}+\gamma\mathbf{I}_{m})^{-1}\Psi S^{*}=(S\Psi^{*}\Psi S+\gamma\mathbb{I})^{-1}S^{*}\Psi^{*}\Psi S^{*},

the desired result follows. ∎

We can now prove Theorem 6 by simplifying some of the bounds given in Theorem 16.

Proof of Theorem 6.

Consider the upper bound given in (23). We will simplify it to capture the dominant behavior as p+p\to+\infty. Assume 2c2βγp<1\sqrt{\frac{2c^{2}\beta}{\gamma p}}<1, or equivalently p>2c2βγ.p>\frac{2c^{2}\beta}{\gamma}. In this case, we give a simple upper bound on t(p)t(p) as follows

t(p)<(23+1)2c2βγp<8c2βγp,t(p)<\left(\frac{2}{3}+1\right)\sqrt{\frac{2c^{2}\beta}{\gamma p}}<\sqrt{\frac{8c^{2}\beta}{\gamma p}},

so that we avoid manipulating cumbersome expressions. Thus, if we want the latter bound to be smaller than ϵ(0,1)\epsilon\in(0,1), we require

p8c2βγϵ2,p\geq\frac{8c^{2}\beta}{\gamma\epsilon^{2}},

which is indeed larger than 2c2βγ\frac{2c^{2}\beta}{\gamma} since 1/ϵ>11/\epsilon>1. Thus, by using the same arguments as in the proof of Theorem 16, we have the multiplicative error bound

11+ϵ𝖪(γ)𝖪^(γ)11ϵ𝖪(γ),\frac{1}{1+\epsilon}\mathsf{K}(\gamma)\preceq\hat{\mathsf{K}}(\gamma)\preceq\frac{1}{1-\epsilon}\mathsf{K}(\gamma),

with probability at least 1δ1-\delta if

p8c2βγϵ2=8κ2Aopγϵ2log(4deff(γ)δ𝖪op)p\geq\frac{8c^{2}\beta}{\gamma\epsilon^{2}}=\frac{8\kappa^{2}\|A\|_{op}}{\gamma\epsilon^{2}}\log\left(\frac{4d_{\rm eff}(\gamma)}{\delta\|\mathsf{K}\|_{op}}\right)

where the last equality is obtained by substituting c2=κ2Aopc^{2}=\kappa^{2}\|A\|_{op} and β=log(4deff(γ)δ𝖪(γ)op)\beta=\log\left(\frac{4d_{\rm eff}(\gamma)}{\delta\|\mathsf{K}(\gamma)\|_{op}}\right) given in Theorem 16. ∎

Appendix D Supplementary empirical results

D.1 Finer analysis of the Gaussian L-ensemble estimation problem of Section 6

In this section, we report results corresponding to the simulation setting of Section 6 with ρ=100\rho=100.

Intensity estimation from several DPP samples.

In Figure 4, we replicate the setting of Figure 1 with s=3s=3 and s=10s=10 DPP samples and a smaller regularization parameter. The estimated intensity is then closer to the ground truth (ρ=100\rho=100) for a large value of ss, although there are small areas of high intensity at the boundary of the domain [0,1]2[0,1]^{2}. A small improvement is also observed by increasing ss from 33 (left) to 1010 (right), namely the variance of the estimated intensity tends to decrease when ss increases.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: Effet of the number of samples on intensity estimation (with σ=0.1\sigma=0.1 and λ=104\lambda=10^{-4}, as in Figure 1). Left column: estimation from s=3s=3 DPP samples. Right column: estimation from s=10s=10 DPP samples. The first row is a heatmap of the intensity 𝗄^(x,x)\hat{\mathsf{k}}(x,x) on a 100×100100\times 100 grid within [0,1]2[0,1]^{2}. The second row is the same data matrix in a flattened format, that is, each column of the 100×100100\times 100 data matrix is concatenated to form a 10000×110000\times 1 matrix whose entries are plotted. Notice that the sharp peaks are due to boundary effects. These peaks are regularly spaced due to the column-wise unfolding.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Effet of the bandwidth σ\sigma on intensity estimation. Left column: large value (σ=0.15\sigma=0.15, λ=104\lambda=10^{-4}) with s=10s=10 DPP samples. Right column: small value (σ=0.05\sigma=0.05, λ=104\lambda=10^{-4}) with s=3s=3 DPP samples. The first row is a heatmap of the intensity on [0,1]2[0,1]^{2}. The second row is the same data matrix in a flattened format, that is, each column of the 100×100100\times 100 data matrix is concatenated to form a 10000×110000\times 1 matrix whose entries are plotted. Notice that the sharp peaks at the bottom row are due to boundary effects. These peaks are regularly spaced due to the column-wise unfolding.

In Figure 5, we illustrate the intensity estimation in the case of a large and small σ\sigma, respectively on the left and right columns. As expected, a large value of σ\sigma has a regularization effect but also leads to an underestimation of the intensity. On the contrary, a small value of σ\sigma seems to cause inhomogeneities in the estimated intensity.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: Correlation kernel estimation without boundary effect. We display a Gram matrix of the estimated correlation kernel [𝗄^(x,x)]x,xgrid[\hat{\mathsf{k}}(x,x^{\prime})]_{x,x^{\prime}\in\text{grid}} (left column) and ground truth correlation kernel [𝗄(x,x)]x,xgrid[{\mathsf{k}}(x,x^{\prime})]_{x,x^{\prime}\in\text{grid}} (right column) on a 30×3030\times 30 grid within [0.2,0.8]2[0.2,0.8]^{2} for the example of Figure 4 with s=10s=10, σ=0.1\sigma=0.1 and λ=104\lambda=10^{-4}. The first row is a heatmap of the Gram matrices, while the second row is a one-dimensional slice of the above Gram matrices at index 400400. This second row of plots allows to visually compare the bell shape of the approximate and exact correlation kernels. The apparent discontinuities in the Gaussian kernel shape are an artifact due to the manner the grid points are indexed. Notice that the correlation kernels are evaluated on a smaller domain within [0,1]2[0,1]^{2} in order to remove boundary effects.
Correlation structure estimation.

The general form of the correlation kernel is also important. In order to visualize the shape of the correlation kernel 𝗄^(x,y)\hat{\mathsf{k}}(x,y), we display in Figure 6 the Gram matrices of the estimated [𝗄^(x,x)]x,xgrid[\hat{\mathsf{k}}(x,x^{\prime})]_{x,x^{\prime}\in\text{grid}} and ground truth correlation kernels [𝗄(x,x)]x,xgrid[\mathsf{k}(x,x^{\prime})]_{x,x^{\prime}\in\text{grid}} on a square grid, for the same parameters as in Figure 4 (RHS). After removing the boundary effects, we observe that the estimated correlation kernel shape closely resembles the ground truth although the decay of the estimated kernel seems to be a bit slower. Moreover, we observe some ‘noise’ in the tail of the estimated kernel. Again, the intensity of the estimated process is also a bit underestimated.

In the context of point processes, it is common to compute summary statistics from samples to ‘understand’ the correlation structure of a stationary process. It is strictly speaking not possible to calculate e.g. Ripley’s K function (see Baddeley et al. [2015]) since our estimated correlation kernel is not stationary, that is, there exits no function t()t(\cdot) such that 𝗄^(x,y)=t(xy)\hat{\mathsf{k}}(x,y)=t(x-y).

D.2 Convergence of the regularized Picard iteration

In particular, we illustrate the convergence of the regularized Picard iteration to the exact solution given in Proposition 12. In Figure 7, we solve the problem (6) in the case =𝒞\mathcal{I}=\mathcal{C} with s=1s=1 where 𝒞\mathcal{C} is the unique DPP sample. For simplicity, we select the DPP sample used in Figure 1 (ρ=100\rho=100, bottom row).

Refer to caption
Refer to caption
Figure 7: Convergence towards the exact solution for the example of Figure 1 (bottom row) with the parameters σ=0.1\sigma=0.1 and λ=0.1\lambda=0.1. Left: Relative error with the exact solution in Frobenius norm 𝐁𝐁exactF/𝐁exactF\|\mathbf{B}-\mathbf{B}_{\rm exact}\|_{F}/\|\mathbf{B}_{\rm exact}\|_{F} w.r.t. the iteration number. Right: Objective value (blue line) and optimal objective (red line) vs iteration number. The stopping criterion is here 𝗍𝗈𝗅=107\mathsf{tol}=10^{-7}.

This illustrates that the regularized Picard iteration indeed converges to the unique solution in this special case.

D.3 Complexity and computing ressources

Complexity.

The space complexity of our method is dominated by the space complexity of storing the kernel matrix 𝐊\mathbf{K} in Algorithm 1, namely O(m2)O(m^{2}) where we recall that m=|𝒵|m=|\mathcal{Z}| with 𝒵=1s𝒞\mathcal{Z}\triangleq\cup_{\ell=1}^{s}\mathcal{C}_{\ell}\cup\mathcal{I}. The time complexity of one iteration of  (14) is dominated by the matrix square root, which is similar to the eigendecomposition, i.e., O(m3)O(m^{3}). The time complexity of Algorithm 2 is dominated by the Cholesky decomposition and the linear system solution, i.e., O(m3)O(m^{3}).

Computing ressources.

A typical computation time is 6565 minutes to solve the example of Figure 1 (bottom row) on 88 virtual cores of a server with two 1818 core Intel Xeon E5-2695 v4s (2.12.1 Ghz). The computational bottleneck is the regularized Picard iteration.

References

  • Affandi et al. [2014] R. H. Affandi, E. Fox, R. Adams, and B. Taskar. Learning the parameters of determinantal point process kernels. In E. P. Xing and T. Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1224–1232, Bejing, China, 22–24 Jun 2014. PMLR. URL http://proceedings.mlr.press/v32/affandi14.html.
  • Anquetil et al. [2020] L. Anquetil, M. Gartrell, A. Rakotomamonjy, U. Tanielian, and C. Calauzènes. Wasserstein learning of determinantal point processes. In Learning Meets Combinatorial Algorithms at NeurIPS2020, 2020. URL https://openreview.net/forum?id=fabfWf3JJQi.
  • Baddeley et al. [2015] A. Baddeley, E. Rubak, and R. Turner. Spatial Point Patterns: Methodology and Applications with R. Chapman and Hall/CRC Press, London, 2015. URL https://www.routledge.com/Spatial-Point-Patterns-Methodology-and-Applications-with-R/Baddeley-Rubak-Turner/9781482210200/.
  • Baraud [2013] Y. Baraud. Estimation of the density of a determinantal process. Confluentes Mathematici, 5(1):3–21, 2013. doi: 10.5802/cml.1. URL http://www.numdam.org/articles/10.5802/cml.1/.
  • Bardenet and Titsias [2015] R. Bardenet and M. K. Titsias. Inference for determinantal point processes without spectral knowledge. In Advances in Neural Information Processing Systems (NIPS), pages 3375–3383, 2015. URL https://proceedings.neurips.cc/paper/2015/file/2f25f6e326adb93c5787175dda209ab6-Paper.pdf.
  • Belhadji et al. [2019] A. Belhadji, R. Bardenet, and P. Chainais. Kernel quadrature with determinantal point processes. In Advances in Neural Information Processing Systems (NeurIPS), 2019. URL https://proceedings.neurips.cc/paper/2019/file/7012ef0335aa2adbab58bd6d0702ba41-Paper.pdf.
  • Belhadji et al. [2020a] A. Belhadji, R. Bardenet, and P. Chainais. Kernel interpolation with continuous volume sampling. In International Conference on Machine Learning (ICML), 2020a. URL http://proceedings.mlr.press/v119/belhadji20a.html.
  • Belhadji et al. [2020b] A. Belhadji, R. Bardenet, and P. Chainais. A determinantal point process for column subset selection. Journal of Machine Learning Research (JMLR), 2020b. URL http://jmlr.org/papers/v21/19-080.html.
  • Biscio and Lavancier [2017] C. A. N. Biscio and F. Lavancier. Contrast estimation for parametric stationary determinantal point processes. Scandinavian Journal of Statistics, 44(1):204–229, 2017. doi: https://doi.org/10.1111/sjos.12249. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/sjos.12249.
  • Bornemann [2010] F. Bornemann. On the numerical evaluation of Fredholm determinants. Mathematics of Computation, 79(270):871–915, 2010. URL https://www.ams.org/journals/mcom/2010-79-270/S0025-5718-09-02280-7/.
  • Brunel et al. [2017a] V.-E. Brunel, A. Moitra, P. Rigollet, and J. Urschel. Maximum likelihood estimation of determinantal point processes. arXiv preprint arXiv:1701.06501, 2017a. URL https://arxiv.org/abs/1701.06501.
  • Brunel et al. [2017b] V.-E. Brunel, A. Moitra, P. Rigollet, and J. Urschel. Rates of estimation for determinantal point processes. In Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 343–345. PMLR, 2017b. URL http://proceedings.mlr.press/v65/brunel17a/brunel17a.pdf.
  • Derezinski et al. [2020] M. Derezinski, F. Liang, and M. Mahoney. Bayesian experimental design using regularized determinantal point processes. In S. Chiappa and R. Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 3197–3207. PMLR, 26–28 Aug 2020. URL http://proceedings.mlr.press/v108/derezinski20a.html.
  • Dupuy and Bach [2018] C. Dupuy and F. Bach. Learning determinantal point processes in sublinear time. In A. Storkey and F. Perez-Cruz, editors, Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pages 244–257. PMLR, 09–11 Apr 2018. URL http://proceedings.mlr.press/v84/dupuy18a.html.
  • Gartrell et al. [2017] M. Gartrell, U. Paquet, and N. Koenigstein. Low-rank factorization of determinantal point processes. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI’17, page 1912–1918. AAAI Press, 2017. URL https://ojs.aaai.org/index.php/AAAI/article/view/10869.
  • Gartrell et al. [2019] M. Gartrell, V.-E. Brunel, E. Dohmatob, and S. Krichene. Learning nonsymmetric determinantal point processes. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/cae82d4350cc23aca7fc9ae38dab38ab-Paper.pdf.
  • Gartrell et al. [2021] M. Gartrell, I. Han, E. Dohmatob, J. Gillenwater, and V.-E. Brunel. Scalable Learning and MAP Inference for Nonsymmetric Determinantal Point Processes. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=HajQFbx_yB.
  • Gautier et al. [2019] G. Gautier, R. Bardenet, and M. Valko. On two ways to use determinantal point processes for monte carlo integration. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/1d54c76f48f146c3b2d66daf9d7f845e-Paper.pdf.
  • Ghosh and Rigollet [2020] S. Ghosh and P. Rigollet. Gaussian determinantal processes: A new model for directionality in data. Proceedings of the National Academy of Sciences, 117(24):13207–13213, 2020. ISSN 0027-8424. doi: 10.1073/pnas.1917151117. URL https://www.pnas.org/content/117/24/13207.
  • Hough et al. [2006] J. B. Hough, M. Krishnapur, Y. Peres, and B. Virág. Determinantal processes and independence. Probability surveys, 2006. URL https://doi.org/10.1214/154957806000000078.
  • Hough et al. [2009] J. B. Hough, M. Krishnapur, Y. Peres, and B. Virág. Zeros of Gaussian analytic functions and determinantal point processes, volume 51. American Mathematical Society, 2009. URL https://doi.org/http://dx.doi.org/10.1090/ulect/051.
  • Kulesza and Taskar [2012] A. Kulesza and B. Taskar. Determinantal point processes for machine learning. Foundations and Trends in Machine Learning, 2012. URL http://dx.doi.org/10.1561/2200000044.
  • Lavancier et al. [2014] F. Lavancier, J. Møller, and E. Rubak. Determinantal point process models and statistical inference. Journal of the Royal Statistical Society, Series B, 2014. URL http://www.jstor.org/stable/24775312.
  • Lavancier et al. [2015] F. Lavancier, J. Møller, and E. Rubak. Determinantal point process models and statistical inference. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 77(4):853–877, 2015. ISSN 13697412, 14679868. URL http://www.jstor.org/stable/24775312.
  • Lavancier et al. [2021] F. Lavancier, A. Poinas, and R. Waagepetersen. Adaptive estimating function inference for non-stationary determinantal point processes. Scandinavian Journal of Statistics, 48(1):87–107, Mar. 2021. doi: 10.1111/sjos.12440. URL https://hal.archives-ouvertes.fr/hal-01816528.
  • Macchi [1975] O. Macchi. The coincidence approach to stochastic point processes. Advances in Applied Probability, 7:83–122, 1975. URL http://www.jstor.org/stable/1425855.
  • Mariet and Sra [2015] Z. Mariet and S. Sra. Fixed-point algorithms for learning determinantal point processes. In F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 2389–2397, Lille, France, 07–09 Jul 2015. PMLR. URL http://proceedings.mlr.press/v37/mariet15.html.
  • Mariet et al. [2019] Z. Mariet, M. Gartrell, and S. Sra. Learning determinantal point processes by corrective negative sampling. In K. Chaudhuri and M. Sugiyama, editors, Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pages 2251–2260. PMLR, 16–18 Apr 2019. URL http://proceedings.mlr.press/v89/mariet19b.html.
  • Marteau-Ferey et al. [2020] U. Marteau-Ferey, F. R. Bach, and A. Rudi. Non-parametric models for non-negative functions. In Advances in Neural Information Processing Systems 33, 2020. URL https://arxiv.org/abs/2007.03926.
  • Minsker [2017] S. Minsker. On some extensions of Bernstein’s inequality for self-adjoint operators. Statistics & Probability Letters, 127:111–119, 2017. URL https://www.sciencedirect.com/science/article/pii/S0167715217301207.
  • Poinas and Lavancier [2021] A. Poinas and F. Lavancier. Asymptotic approximation of the likelihood of stationary determinantal point processes. working paper or preprint, Mar. 2021. URL https://hal.archives-ouvertes.fr/hal-03157554.
  • Rosasco et al. [2010] L. Rosasco, M. Belkin, and E. D. Vito. On learning with integral operators. Journal of Machine Learning Research, 11(30):905–934, 2010. URL http://jmlr.org/papers/v11/rosasco10a.html.
  • Rudi et al. [2015] A. Rudi, R. Camoriano, and L. Rosasco. Less is more: Nyström computational regularization. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 1657–1665. Curran Associates, Inc., 2015. URL http://papers.nips.cc/paper/5936-less-is-more-nystrom-computational-regularization.pdf.
  • Rudi et al. [2018] A. Rudi, D. Calandriello, L. Carratino, and L. Rosasco. On fast leverage score sampling and optimal learning. In Advances in Neural Information Processing Systems, pages 5673–5683, 2018. URL https://arxiv.org/abs/1810.13258.
  • Rudi et al. [2020] A. Rudi, U. Marteau-Ferey, and F. Bach. Finding global minima via kernel approximations. In Arxiv preprint arXiv:2012.11978, 2020. URL https://arxiv.org/abs/2012.11978.
  • Simon [1977] B. Simon. Notes on infinite determinants of Hilbert space operators. Advances in Mathematics, 24(3):244–273, 1977. ISSN 0001-8708. doi: https://doi.org/10.1016/0001-8708(77)90057-3. URL https://www.sciencedirect.com/science/article/pii/0001870877900573.
  • Soshnikov [2000] A. Soshnikov. Determinantal random point fields. Russian Mathematical Surveys, 55:923–975, 2000. URL https://doi.org/10.1070/rm2000v055n05abeh000321.
  • Sterge et al. [2020] N. Sterge, B. Sriperumbudur, L. Rosasco, and A. Rudi. Gain with no pain: Efficiency of kernel-pca by nyström sampling. In S. Chiappa and R. Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 3642–3652. PMLR, 26–28 Aug 2020. URL http://proceedings.mlr.press/v108/sterge20a.html.
  • Urschel et al. [2017] J. Urschel, V.-E. Brunel, A. Moitra, and P. Rigollet. Learning determinantal point processes with moments and cycles. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3511–3520. PMLR, 06–11 Aug 2017. URL http://proceedings.mlr.press/v70/urschel17a.html.
  • Weidmann [1980] J. Weidmann. Linear Operators in Hilbert Spaces. 68. Springer-Verlag, 1980. doi: 10.1007/978-1-4612-6027-1. URL https://www.springer.com/gp/book/9781461260295.
  • Wendland [2004] H. Wendland. Scattered Data Approximation. Cambridge Monographs on Applied and Computational Mathematics. Cambridge University Press, 2004. doi: 10.1017/CBO9780511617539. URL https://doi.org/10.1017/CBO9780511617539.