\zxrsetup

toltxlabel

Nonparametric estimation of continuous DPPs
with kernel methods

Michaël Fanuel and Rémi Bardenet
Université de Lille, CNRS, Centrale Lille
UMR 9189 – CRIStAL, F-59000 Lille, France
{michael.fanuel, remi.bardenet}@univ-lille.fr

Abstract

Determinantal Point Process (DPPs) are statistical models for repulsive point patterns. Both sampling and inference are tractable for DPPs, a rare feature among models with negative dependence that explains their popularity in machine learning and spatial statistics. Parametric and nonparametric inference methods have been proposed in the finite case, i.e. when the point patterns live in a finite ground set. In the continuous case, only parametric methods have been investigated, while nonparametric maximum likelihood for DPPs – an optimization problem over trace-class operators – has remained an open question. In this paper, we show that a restricted version of this maximum likelihood (MLE) problem falls within the scope of a recent representer theorem for nonnegative functions in an RKHS. This leads to a finite-dimensional problem, with strong statistical ties to the original MLE. Moreover, we propose, analyze, and demonstrate a fixed point algorithm to solve this finite-dimensional problem. Finally, we also provide a controlled estimate of the correlation kernel of the DPP, thus providing more interpretability.

1 Introduction

Determinantal point processes (DPPs) are a tractable family of models for repulsive point patterns, where interaction between points is parametrized by a positive semi-definite kernel. They were introduced by Macchi [1975] in the context of fermionic optics, and have gained a lot of interest since the 2000s, in particular in probability [Hough et al., 2006], spatial statistics [Lavancier et al., 2014], and machine learning [Kulesza and Taskar, 2012]. In machine learning at large, DPPs have been used essentially for two purposes: as statistical models for diverse subsets of items, like in recommendation systems [Gartrell et al., 2019], and as subsampling tools, like in experimental design [Derezinski et al., 2020], column subset selection [Belhadji et al., 2020b], or Monte Carlo integration [Gautier et al., 2019, Belhadji et al., 2019]. In this paper, we are concerned with DPPs used as statistical models for repulsion, and more specifically with inference for continuous DPPs.

DPP models in Machine Learning (ML) have so far mostly been finite DPPs: they are distributions over subsets of a (large) finite ground set, like subsets of sentences from a large corpus of documents [Kulesza and Taskar, 2012]. Since Affandi et al. [2014], a lot of effort has been put into designing efficient inference procedures for finite DPPs. In particular, the fixed point algorithm of Mariet and Sra [2015] allows for nonparametric inference of a finite DPP kernel, thus learning the features used for modelling diversity from the data. DPP models on infinite ground sets, say $\mathbb{R}^{d}$ , while mathematically and algorithmically very similar to finite DPPs, have been less popular in ML than in spatial statistics. It is thus natural that work on inference for continuous DPPs has happened mostly in the latter community; see e.g. the seminal paper [Lavancier et al., 2014]. Inference for continuous DPPs has however focused on the parametric setting, where a handful of interpretable parameters are learned. Relatedly, spatial statisticians typically learn the correlation kernel of a DPP, which is more interpretable, while machine learners focus on the likelihood kernel, with structural assumptions to make learning scale to large ground sets.

In this paper, we tackle nonparametric inference for continuous DPPs using recent results on kernel methods. More precisely, maximum likelihood estimation (MLE) for continuous DPPs is an optimization problem over trace-class operators. Our first contribution is to show that a suitable modification of this problem is amenable to the representer theorem of Marteau-Ferey, Bach, and Rudi [2020]. Further drawing inspiration from the follow-up work [Rudi, Marteau-Ferey, and Bach, 2020], we derive an optimization problem over matrices, and we prove that its solution has a near optimal objective in the original MLE problem. We then propose, analyze, and demonstrate a fixed point algorithm for the resulting finite problem, in the spirit [Mariet and Sra, 2015] of nonparametric inference for finite DPPs. While our optimization pipeline focuses on the so-called likelihood kernel of a DPP, we also provide a controlled sampling approximation to its correlation kernel, thus providing more interpretability of our estimated kernel operator. A by-product contribution of independent interest is an analysis of a sampling approximation for Fredholm determinants.

The rest of the paper is organized as follows. Since the paper is notation-heavy, we first summarize our notation and give standard definitions in Section 1.1. In Section 2, we introduce DPPs and prior work on inference. In Section 3, we introduce our constrained maximum likelihood problem, and study its empirical counterpart. We analyze an algorithm to solve the latter in Section 4. Statistical guarantees are stated in Section 5, while Section 6 is devoted to numerically validating the whole pipeline. Our code is freely available¹¹1https://github.com/mrfanuel/LearningContinuousDPPs.jl.

1.1 Notation and background

Sets.

It is customary to define DPPs on a compact Polish space $\mathcal{X}$ endowed with a Radon measure $\mu$ , so that we can define the space of square integrable functions $L^{2}(\mathcal{X})$ for this measure [Hough et al., 2009]. Outside of generalities in Section 2, we consider a compact $\mathcal{X}\subset\mathbb{R}^{d}$ and $\mu$ the uniform probability measure on $\mathcal{X}$ . Let $(\mathcal{H},\langle\cdot,\cdot\rangle)$ be an RKHS of functions on $\mathcal{X}$ with a bounded continuous kernel $k_{\mathcal{H}}(x,y)$ , and let $\kappa^{2}=\sup_{x\in\mathcal{X}}k_{\mathcal{H}}(x,x)$ . Denote by $\phi(x)=k_{\mathcal{H}}(x,\cdot)\in\mathcal{H}$ the canonical feature map. For a Hilbert space $\mathcal{F}$ , denote by $\mathcal{S}_{+}(\mathcal{F})$ the space of symmetric and positive semi-definite trace-class operators on $\mathcal{F}$ . By a slight abuse of notation, we denote by $\mathcal{S}_{+}(\mathbb{R}^{n})$ the space of $n\times n$ real positive semi-definite matrices. Finally, all sets are denoted by calligraphic letters (e.g. $\mathcal{C},\mathcal{I}$ ).

Operators and matrices.

Trace-class endomorphisms of $L^{2}(\mathcal{X})$ , seen as integral operators, are typeset as uppercase sans-serif (e.g. $\mathsf{A},\mathsf{K}$ ), and the corresponding integral kernels as lowercase sans-serif (e.g. $\mathsf{a},\mathsf{k}$ ). Notice that $\mathsf{k}(x,y)$ and $k_{\mathcal{H}}(x,y)$ are distinct functions. Other operators are written in standard fonts (e.g. $A,S$ ), while we write matrices and finite-dimensional vectors in bold (e.g. $\mathbf{K},\mathbf{C},\mathbf{v}$ ). The identity operator is written commonly as $\mathbb{I}$ , whereas the $n\times n$ identity matrix is denoted by $\mathbf{I}_{n}$ . When $\mathcal{C}$ is a subset of $\{1,\dots,n\}$ and $\mathbf{K}$ is an $n\times n$ matrix, the matrix $\mathbf{K}_{\mathcal{C}\mathcal{C}}$ is the square submatrix obtained by selecting the rows and columns of $\mathbf{K}$ indexed by $\mathcal{C}$ .

Restriction and reconstruction operators.

Following Rosasco et al. [2010, Section 3], we define the restriction operator $S:\mathcal{H}\to L^{2}(\mathcal{X})$ as $(Sg)(x)=g(x).$ Its adjoint $S^{*}:L^{2}(\mathcal{X})\to\mathcal{H}$ is the reconstruction operator $S^{*}h=\int_{\mathcal{X}}h(x)\phi(x)\mathrm{d}\mu(x).$ The classical integral operator given by $\mathsf{T}_{k_{\mathcal{H}}}h=\int_{\mathcal{X}}k_{\mathcal{H}}(\cdot,x)h(x)\mathrm{d}\mu(x)$ , seen as an endomorphism of $L^{2}(\mathcal{X})$ , thus takes the simple expression $\mathsf{T}_{k_{\mathcal{H}}}=SS^{*}$ . Similarly, the so-called covariance operator $C:\mathcal{H}\to\mathcal{H}$ , defined by $C=\int_{\mathcal{X}}\phi(x)\otimes\overline{\phi(x)}\mathrm{d}\mu(x),$ writes $C=S^{*}S$ . In the tensor product notation defining $C$ , $\overline{\phi(x)}$ is an element of the dual of $\mathcal{H}$ and $\phi(x)\otimes\overline{\phi(x)}$ is the endomorphism of $\mathcal{H}$ defined by $((\phi(x)\otimes\overline{\phi(x)})(g)=g(x)\phi(x)$ ; see e.g. Sterge et al. [2020]. Finally, for convenience, given a finite set $\{x_{1},\dots,x_{n}\}\subset\mathcal{X}$ , we also define discrete restriction and reconstruction operators, respectively, as $S_{n}:\mathcal{H}\to\mathbb{R}^{n}$ such that $S_{n}g=(1/\sqrt{n})[g(x_{1}),\dots,g(x_{n})]^{\top},$ and $S_{n}^{*}\mathbf{v}=(1/\sqrt{n})\sum_{i=1}^{n}\mathbf{v}_{i}\phi(x_{i})$ for any $\mathbf{v}\in\mathbb{R}^{n}$ . In particular, we have $S_{n}S_{n}^{*}=(1/n)\mathbf{K}$ where $\mathbf{K}=[k_{\mathcal{H}}(x_{i},x_{j})]_{1\leq i,j\leq n}$ is a kernel matrix, which is defined for a given ordering of the set $\{x_{1},\dots,x_{n}\}$ . To avoid cumbersome expressions, when several discrete sets of different cardinalities, say $n$ and $p$ , are used, we simply write the respective sampling operators as $S_{n}$ and $S_{p}$ .

2 Determinantal point processes and inference

Determinantal point processes and L-ensembles.

Consider a simple point process $\mathcal{Y}$ on $\mathcal{X}$ , that is, a random discrete subset of $\mathcal{X}$ . For $\mathcal{D}\subset\mathcal{X}$ , we denote by $\mathcal{Y}(\mathcal{D})$ the number of points of this process that fall within $\mathcal{D}$ . Letting $m$ be a positive integer, we say that $\mathcal{X}$ has $m$ -point correlation function $\varrho_{m}$ w.r.t. to the reference measure $\mu$ if, for any mutually disjoint subsets $\mathcal{D}_{1},\dots,\mathcal{D}_{m}\subset\mathcal{X}$ ,

\mathbb{E}\left[\prod_{i=1}^{m}\mathcal{Y}(\mathcal{D}_{i})\right]=\int_{\prod_{i=1}^{m}\mathcal{D}_{i}}\varrho_{m}(x_{1},\dots,x_{m})\mathrm{d}\mu(x_{1})\dots\mathrm{d}\mu(x_{m}).

In most cases, a point process is characterized by its correlation functions $(\rho_{m})_{m\geq 1}$ . In particular, a determinantal point process (DPP) is defined as having correlation functions in the form of a determinant of a Gram matrix, i.e. $\varrho_{m}(x_{1},\dots,x_{m})=\det[\mathsf{k}(x_{i},x_{j})]$ for all $m\geq 1$ . We then say that $\mathsf{k}$ is the correlation kernel of the DPP. Not all kernels yield a DPP: if $\mathsf{k}(x,y)$ is the integral kernel of an operator $\mathsf{K}\in\mathcal{S}_{+}(L^{2}(\mathcal{X}))$ , the Macchi-Soshnikov theorem [Macchi, 1975, Soshnikov, 2000] states that the corresponding DPP exists if and only if the eigenvalues of $\mathsf{K}$ are within $[0,1]$ . In particular, for a finite ground set $\mathcal{X}$ , taking the reference measure to be the counting measure leads to conditions on the kernel matrix; see Kulesza and Taskar [2012].

A particular class of DPPs is formed by the so-called L-ensembles, for which the correlation kernel writes

\mathsf{K}=\mathsf{A}(\mathsf{A}+\mathbb{I})^{-1},

(1)

with the likelihood operator $\mathsf{A}\in\mathcal{S}_{+}(L^{2}(\mathcal{X}))$ taken to be of the form $\mathsf{A}f(x)=\int_{\mathcal{X}}\mathsf{a}(x,y)f(y)\mathrm{d}\mu(y)$ . The kernel $\mathsf{a}$ of $\mathsf{A}$ is sometimes called the likelihood kernel of the L-ensemble, to distinguish it from its correlation kernel $\mathsf{k}$ . The interest of L-ensembles is that their Janossy densities can be computed in closed form. Informally, the $m$ -Janossy density describes the probability that the point process has cardinality $m$ , and that the points are located around a given set of distinct points $x_{1},\dots,x_{m}\in\mathcal{X}$ . For the rest of the paper, we assume that $\mathcal{X}\subset\mathbb{R}^{d}$ is compact, and that $\mu$ is the uniform probability measure on $\mathcal{X}$ ; our results straightforwardly extend to other densities w.r.t. Lebesgue. With these assumptions, the $m$ -Janossy density is proportional to

\displaystyle\det(\mathbb{I}+\mathsf{A})^{-1}\cdot\det[\mathsf{a}(x_{i},x_{j})]_{1\leq i,j\leq m},

(2)

where the normalization constant is a Fredholm deteminant that implicitly depends on $\mathcal{X}$ . Assume now that we are given $s$ i.i.d. samples of a DPP, denoted by the sets $\mathcal{C}_{1},\dots,\mathcal{C}_{s}$ in the window $\mathcal{X}$ . The associated maximum log-likelihood estimation problem reads

\max_{\mathsf{A}\in\mathcal{S}_{+}(L^{2}(\mathcal{X}))}\frac{1}{s}\sum_{\ell=1}^{s}\log\det\left[\mathsf{a}(x_{i},x_{j})\right]_{i,j\in\mathcal{C}_{\ell}}-\log\det(\mathbb{I}+\mathsf{A}).

(3)

Solving (3) is a nontrivial problem. First, it is difficult to calculate the Fredholm determinant. Second, it is not straightforward to optimize over the space of operators $\mathcal{S}_{+}(L^{2}(X))$ in a nonparametric setting. However, we shall see that the problem becomes tractable if we restrict the domain of (3) and impose regularity assumptions on the integral kernel $\mathsf{a}(x,y)$ of the operator $\mathsf{A}$ . For more details on DPPs, we refer the interested reader to Hough et al. [2006], Kulesza and Taskar [2012], Lavancier et al. [2014].

Previous work on learning DPPs.

While continuous DPPs have been used in ML as sampling tools [Belhadji et al., 2019, 2020a] or models [Bardenet and Titsias, 2015, Ghosh and Rigollet, 2020], their systematic parametric estimation has been the work of spatial statisticians; see Lavancier et al. [2015], Biscio and Lavancier [2017], Poinas and Lavancier [2021] for general parametric estimation through (3) or so-called minimum-contrast inference. Still for the parametric case, a two-step estimation was recently proposed for non-stationary processes by Lavancier et al. [2021]. In a more general context, non-asymptotic risk bounds for estimating a DPP density are given in Baraud [2013].

Discrete DPPs have been more common in ML, and the study of their estimation has started some years ago [Affandi et al., 2014]. Unlike continuous DPPs, nonparametric estimation procedures have been investigated for finite DPPs by Mariet and Sra [2015], who proposed a fixed point algorithm. Moreover, the statistical properties of maximum likelihood estimation of discrete L-ensembles were studied by Brunel et al. [2017b]. We can also cite low-rank approaches [Dupuy and Bach, 2018, Gartrell et al., 2017], learning with negative sampling [Mariet et al., 2019], learning with moments and cycles [Urschel et al., 2017], or learning with Wasserstein training [Anquetil et al., 2020]. Learning non-symmetric finite DPPs [Gartrell et al., 2019, 2021] has also been proposed, motivated by recommender systems.

At a high level, our paper is a continuous counterpart to the nonparametric learning of finite DPPs with symmetric kernels in Mariet and Sra [2015]. Our treatment of the continuous case is made possible by recent advances in kernel methods.

3 A sampling approximation to a constrained MLE

Using the machinery of kernel methods, we develop a controlled approximation of the MLE problem (3). Let us outline the main landmarks of our approach. First, we restrict the domain of the MLE problem (3) to smooth operators. On the one hand, this restriction allows us to develop a sampling approximation of the Fredholm determinant. On the other hand, the new optimization problem now admits a finite rank solution that can be obtained by solving a finite-dimensional problem. This procedure is described in Algorithm 1 and yields an estimator for the likelihood kernel. Finally, we use another sampling approximation and solve a linear system to estimate the correlation kernel of the fitted DPP; see Algorithm 2.

Restricting to smooth operators.

In order to later apply the representer theorem of Marteau-Ferey et al. [2020], we restrict the original maximum likelihood problem (3) to “smooth” operators $\mathsf{A}=SAS^{*}$ , with $A\in\mathcal{S}_{+}(\mathcal{H})$ and $S$ the restriction operator introduced in Section 1.1. Note that the kernel of $\mathsf{A}$ now writes

\mathsf{a}(x,y)=\langle\phi(x),A\phi(y)\rangle.

(4)

With this restriction on its domain, the optimization problem (3) now reads

\min_{A\in\mathcal{S}_{+}(\mathcal{H})}f(A)=-\frac{1}{s}\sum_{\ell=1}^{s}\log\det\left[\langle\phi(x_{i}),A\phi(x_{j})\rangle\right]_{i,j\in\mathcal{C}_{\ell}}+\log\det(\mathbb{I}+SAS^{*}).

(5)

Approximating the Fredholm determinant.

We use a sampling approach to approximate the normalization constant in (5). We sample a set of points $\mathcal{I}=\{x^{\prime}_{i}:1\leq i\leq n\}$ i.i.d. from the ambient probability measure $\mu$ . For definiteness, we place ourselves on an event happening with probability one where all the points in $\mathcal{I}$ and $\mathcal{C}_{\ell}$ for $1\leq\ell\leq s$ are distinct. We define the sample version of $f(A)$ as

f_{n}(A)=-\frac{1}{s}\sum_{\ell=1}^{s}\log\det\left[\langle\phi(x_{i}),A\phi(x_{j})\rangle\right]_{i,j\in\mathcal{C}_{\ell}}+\log\det(\mathbf{I}_{n}+S_{n}AS_{n}^{*}),

where the Fredholm determinant of $\mathsf{A}=SAS^{*}$ has been replaced by the determinant of a finite matrix involving $S_{n}AS_{n}^{*}=[\langle\phi(x^{\prime}_{i}),A\phi(x^{\prime}_{j})\rangle]_{1\leq i,j\leq n}$ .

Theorem 1 (Approximation of the Fredholm determinant).

Let $\delta\in(0,1/2)$ . With probability at least $1-2\delta$ ,

\left|\log\det(\mathbf{I}_{n}+S_{n}AS_{n}^{*})-\log\det(\mathbb{I}+SAS^{*})\right|\leq\log\det(\mathbb{I}+c_{n}A),

with

c_{n}=\frac{4\kappa^{2}\log\left(\frac{2\kappa^{2}}{\ell\delta}\right)}{3n}+\sqrt{\frac{2\kappa^{2}\ell\log\left(\frac{2\kappa^{2}}{\ell\delta}\right)}{n}},

where $\ell=\lambda_{\max}(\mathsf{T}_{k_{\mathcal{H}}})$ and $\kappa^{2}=\sup_{x\in\mathcal{X}}k_{\mathcal{H}}(x,x)<\infty$ .

The proof of Theorem 1 is given in Supplementary Material in Section C.2. Several remarks are in order. First, the high probability²²2We write $a\lesssim b$ if there exists a constant $c>0$ such that $a\leq cb$ . in the statement of Theorem 1 is that of the event $\{\|S^{*}S-S^{*}_{n}S_{n}\|_{op}\lesssim c_{n}\}.$ Importantly, all the results given in what follows for the approximation of the solution of (5) only depend on this event, so that we do not need any union bound. Second, we emphasize that $\mathsf{T}_{k_{\mathcal{H}}}$ , defined in Section 1.1, should not be confused with the correlation kernel (1). Third, to interpret the bound in Theorem 1, it is useful to recall that $\log\det(\mathbb{I}+c_{n}A)\leq c_{n}\operatorname{Tr}(A)$ , since $A\in\mathcal{S}_{+}(\mathcal{H})$ . Thus, by penalizing $\operatorname{Tr}(A)$ , one also improves the upper bound on the Fredholm determinant approximation error. This remark motivates the following infinite dimensional problem

\min_{A\in\mathcal{S}_{+}(\mathcal{H})}f_{n}(A)+\lambda\operatorname{Tr}(A),

(6)

for some $\lambda>0$ . The penalty on $\operatorname{Tr}(A)$ is also intuitively needed so that the optimization problem selects a smooth solution, i.e., such a trace regularizer promotes a fast decay of eigenvalues of $A$ . Note that this problem depends both on the data $\mathcal{C}_{1},\dots,\mathcal{C}_{n}\subset\mathcal{X}$ and the subset $\mathcal{I}$ used for approximating the Fredholm determinant.

Finite-dimensional representatives.

In an RHKS, there is a natural mapping between finite rank operators and matrices. For the sake of completeness, let $\mathbf{K}=[k_{\mathcal{H}}(z_{i},z_{j})]_{1\leq i,j\leq m}$ be a kernel matrix and let $\mathbf{K}=\mathbf{R}^{\top}\mathbf{R}$ be a Cholesky factorization. Throughout the paper, kernel matrices are always assumed to be invertible. This is not a strong assumption: if $k_{\mathcal{H}}$ is the Laplace, Gaussian or Sobolev kernel, this is true almost surely if $z_{i}$ for $1\leq i\leq m$ are sampled e.g. w.r.t. the Lebesgue measure; see Bochner’s classical theorem [Wendland, 2004, Theorem 6.6 and Corollary 6.9]. In this case, we can define a partial isometry $V:\mathcal{H}\to\mathbb{R}^{m}$ as $V=\sqrt{m}(\mathbf{R}^{-1})^{\top}S_{m}.$ It satisfies $VV^{*}=\mathbf{I}$ , and $V^{*}V$ is the orthogonal projector onto the span of $\phi(z_{i})$ for all $1\leq i\leq m$ . This is helpful to define

\bm{\Phi}_{i}=V\phi(z_{i})\in\mathbb{R}^{n},

(7)

the finite-dimensional representative of $\phi(z_{i})\in\mathcal{H}$ for all $1\leq i\leq m$ . This construction yields a useful mapping between an operator in $\mathcal{S}_{+}(\mathcal{H})$ and a finite matrix, which is instrumental for obtaining our results.

Lemma 2 (Finite dimensional representatives, extension of Lemma 3 in Rudi et al. [2020]).

Let $A\in\mathcal{S}_{+}(\mathcal{H})$ . Then, the matrix $\bar{\mathbf{B}}=VAV^{*}$ is such that $\bm{\Phi}_{i}^{\top}\bar{\mathbf{B}}\bm{\Phi}_{j}=\langle\phi(z_{i}),A\phi(z_{j})\rangle\quad\text{ for all }1\leq i,j\leq m,$ and $\log\det(\mathbf{I}+\bar{\mathbf{B}})\leq\log\det(\mathbb{I}+A)$ , as well as $\operatorname{Tr}(\bar{\mathbf{B}})\leq\operatorname{Tr}(A)$ .

The proof of Lemma 2 is given in Section C.1. Notice that the partial isometry $V$ also helps to map a matrix in $\mathcal{S}_{+}(\mathbb{R}^{m})$ to an operator in $\mathcal{S}_{+}(\mathcal{H})$ , as $\mathbf{B}\mapsto V^{*}\mathbf{B}V,$ in such a way that we have the matrix element matching $\langle\phi(z_{i}),V^{*}\mathbf{B}V\phi(z_{j})\rangle=\bm{\Phi}_{i}^{\top}\mathbf{B}\bm{\Phi}_{j}$ for all $1\leq i,j\leq m$ .

Finite rank solution thanks to a representer theorem.

The sampling approximation of the Fredholm determinant also yields a finite rank solution for (6). For simplicity, we define $\mathcal{C}\triangleq\cup_{\ell=1}^{s}\mathcal{C}_{\ell}$ and recall $\mathcal{I}=\{x^{\prime}_{1},\dots,x^{\prime}_{n}\}$ . Then, write the set of points $\mathcal{Z}\triangleq\mathcal{C}\cup\mathcal{I}$ as $\{z_{1},\dots,z_{m}\}$ , with $m=|\mathcal{C}|+n$ , and denote the corresponding restriction operator $S_{m}:\mathcal{H}\to\mathbb{R}^{m}$ . Consider the kernel matrix $\mathbf{K}=[k_{\mathcal{H}}(z_{i},z_{j})]_{1\leq i,j\leq m}$ . In particular, since we used a trace regularizer, the representer theorem of Marteau-Ferey et al. [2020, Theorem 1] holds: the optimal operator is of the form

A=\sum_{i,j=1}^{m}\mathbf{C}_{ij}\phi(z_{i})\otimes\overline{\phi(z_{j})}\text{ with }\mathbf{C}\in\mathcal{S}_{+}(\mathbb{R}^{m}).

(8)

In this paper, we call $\mathbf{C}$ the representer matrix of the operator $A$ . If we do the change of variables $\mathbf{B}=\mathbf{R}\mathbf{C}\mathbf{R}^{\top}$ , we have the following identities: $A=mS_{m}\mathbf{C}S_{m}^{*}=V^{*}\mathbf{B}V$ and $\operatorname{Tr}(A)=\operatorname{Tr}(\mathbf{K}\mathbf{C})=\operatorname{Tr}(\mathbf{B})$ , thanks to Lemma 7 in Supplementary Material. Therefore, the problem (6) boils down to the finite non-convex problem:

\min_{\mathbf{B}\succeq 0}f_{n}(V^{*}\mathbf{B}V)+\lambda\operatorname{Tr}(\mathbf{B}),

(9)

where $f_{n}(V^{*}\mathbf{B}V)=-\frac{1}{s}\sum_{\ell=1}^{s}\log\det\left[\bm{\Phi}_{i}^{\top}\mathbf{B}\bm{\Phi}_{j}\right]_{i,j\in\mathcal{C}_{\ell}}+\log\det\left[\delta_{ij}+\bm{\Phi}_{i}^{\top}\mathbf{B}\bm{\Phi}_{j}/|\mathcal{I}|\right]_{i,j\in\mathcal{I}}$ . We assume that there is a global minimizer of (9) that we denote by $\mathbf{B}_{\star}$ . The final estimator of the integral kernel of the likelihood $\mathsf{A}$ depends on $\mathbf{C}_{\star}=\mathbf{R}^{-1}\mathbf{B}_{\star}\mathbf{R}^{-1\top}$ and reads $\hat{\mathsf{a}}(x,y)=\sum_{i,j=1}^{m}\mathbf{C}_{\star ij}k_{\mathcal{H}}(z_{i},x)k_{\mathcal{H}}(z_{j},y).$ The numerical strategy is summarized in Algorithm 1.

procedure Estimate

\mathsf{A}

(

\lambda,\mathcal{C}_{1},\dots,\mathcal{C}_{s}

)

Sample

\mathcal{I}=\{x^{\prime}_{1},\dots,x^{\prime}_{n}\}

i.i.d. from

\mu

\triangleright

Sample

n

points for Fredhom det. approx.

Define

\mathcal{Z}\triangleq\cup_{\ell=1}^{s}\mathcal{C}_{\ell}\cup\mathcal{I}

\triangleright

Collect all samples

Compute

\mathbf{K}=\mathbf{R}^{\top}\mathbf{R}

with

\mathbf{K}=[k_{\mathcal{H}}(z_{i},z_{j})]_{i,j\in\mathcal{Z}}

\triangleright

Cholesky of kernel matrix

Solve (9) with iteration (14) to obtain

\mathbf{B}_{\star}

\triangleright

Regularized Picard iteration

Compute

\mathbf{C}_{\star}=\mathbf{R}^{-1}\mathbf{B}_{\star}\mathbf{R}^{-1\top}

\triangleright

Representer matrix of

\hat{\mathsf{a}}(x,y)

return

\hat{\mathsf{a}}(x,y)=\sum_{i,j=1}^{m}\mathbf{C}_{\star ij}k_{\mathcal{H}}(z_{i},x)k_{\mathcal{H}}(z_{j},y)

\triangleright

Likelihood kernel

end procedure

Algorithm 1 Estimation of the integral kernel

\mathsf{a}(x,y)

of the DPP likelihood kernel

\mathsf{A}

Estimation of the correlation kernel.

The exact computation of the correlation kernel of the L-ensemble DPP

\mathsf{K}(\gamma)=\mathsf{A}(\mathsf{A}+\gamma\mathbb{I})^{-1},

(10)

requires the exact diagonalization of $\mathsf{A}=SAS^{*}$ . For more flexibility, we introduced a scale parameter $\gamma>0$ which often takes the value $\gamma=1$ . It is instructive to approximate $\mathsf{K}$ in order to easily express the correlation functions of the estimated point process. We propose here an approximation scheme based once again on sampling. Recall the form of the solution $A=mS_{m}^{*}\mathbf{C}S_{m}$ of (6), and consider the factorization $\mathbf{C}=\mathbf{\Lambda}^{\top}\mathbf{\Lambda}$ with $\mathbf{\Lambda}=\mathbf{F}\mathbf{R}^{-1\top}$ where $\mathbf{F}^{\top}\mathbf{F}=\mathbf{B}_{\star}$ is the Cholesky factorization of $\mathbf{B}_{\star}$ . Let $\{x^{\prime\prime}_{1},\dots,x^{\prime\prime}_{p}\}\subseteq\mathcal{X}$ be sampled i.i.d. from the probability measure $\mu$ and denote by $S_{p}:\mathcal{H}\to\mathbb{R}^{p}$ the corresponding restriction operator. The following integral operator

\hat{\mathsf{K}}=mSS^{*}_{m}\mathbf{\Lambda}^{\top}(m\mathbf{\Lambda}S_{m}S^{*}_{p}S_{p}S^{*}_{m}\mathbf{\Lambda}^{\top}+\gamma\mathbf{I}_{m})^{-1}\mathbf{\Lambda}S_{m}S^{*},

(11)

gives an approximation of $\mathsf{K}$ . The numerical approach for solving (11) relies on the computation of $\mathbf{K}_{mp}=\sqrt{mp}S_{m}S^{*}_{p}=[k_{\mathcal{H}}(z_{i},x^{\prime\prime}_{j})]$ with $1\leq i\leq m$ and $1\leq j\leq p$ is a rectangular kernel matrix, associated to a fixed ordering of $\mathcal{Z}=\{z_{1},\dots,z_{m}\}$ and $\{x^{\prime\prime}_{1},\dots,x^{\prime\prime}_{p}\}$ . Our strategy is described in Algorithm 2.

procedure Estimate

\mathsf{K}

(

\mathcal{Z},\mathbf{C}_{\star}

)

Compute

\mathbf{C}_{\star}=\mathbf{\Lambda}^{\top}\mathbf{\Lambda}

\triangleright

Factorization of representer matrix

Sample

\{x^{\prime\prime}_{1},\dots,x^{\prime\prime}_{p}\}\subseteq\mathcal{X}

i.i.d. from

\mu

\triangleright

Sample

p

points

Compute

\mathbf{K}_{mp}=[k_{\mathcal{H}}(z_{i},x^{\prime\prime}_{j})]\in\mathbb{R}^{m\times p}

\triangleright

Cross kernel matrix

Compute

\bm{\Omega}=\mathbf{\Lambda}^{\top}(\mathbf{\Lambda}\mathbf{K}_{mp}\frac{1}{p}\mathbf{K}_{mp}^{\top}\mathbf{\Lambda}^{\top}+\mathbf{I}_{m})^{-1}\mathbf{\Lambda}

\triangleright

Representer matrix of

\hat{\mathsf{k}}(x,y)

return

\hat{\mathsf{k}}(x,y)=\sum_{i,j=1}^{m}\bm{\Omega}_{ij}k_{\mathcal{H}}(z_{i},x)k_{\mathcal{H}}(z_{j},y)

\triangleright

Correlation kernel

end procedure

Algorithm 2 Estimation of the integral kernel

\mathsf{k}(x,y)

of the DPP correlation kernel

\mathsf{K}=\mathsf{A}(\mathsf{A}+\mathbb{I})^{-1}

4 Implementation

We propose an algorithm for solving the discrete problem (9) associated to (6). To simplify the discussion and relate it to Mariet and Sra [2015], we define the objective $g(\mathbf{X})=f_{n}(V^{*}\mathbf{B}(\mathbf{X})V)+\lambda\operatorname{Tr}(\mathbf{B}(\mathbf{X}))$ with the change of variables $\mathbf{B}(\mathbf{X})=\mathbf{R}^{-1\top}\mathbf{X}\mathbf{R}^{-1}$ . Then we can rephrase (9) as

\min_{\mathbf{X}\succeq 0}g(\mathbf{X})=-\frac{1}{s}\sum_{\ell=1}^{s}\log\det(\mathbf{X}_{\mathcal{C}_{\ell}\mathcal{C}_{\ell}})+\log\det\left(\mathbf{I}_{|\mathcal{I}|}+\frac{1}{n}\mathbf{X}_{\mathcal{I}\mathcal{I}}\right)+\lambda\operatorname{Tr}(\mathbf{X}\mathbf{K}^{-1}),

(12)

where we recall that $n=|\mathcal{I}|$ . Define for convenience $\mathbf{U}_{\ell}$ as the matrix obtained by selecting the columns of the identity matrix which are indexed by $\mathcal{C}_{\ell}$ , so that, we have in particular $\mathbf{X}_{\mathcal{C}_{\ell}\mathcal{C}_{\ell}}=\mathbf{U}_{\ell}^{\top}\mathbf{X}\mathbf{U}_{\ell}$ . Similarly, define a sampling matrix $\mathbf{U}_{\mathcal{I}}$ associated to the subset $\mathcal{I}$ . Recall the Cholesky decomposition $\mathbf{K}=\mathbf{R}^{\top}\mathbf{R}$ . To minimize (12), we start at some $\mathbf{X}_{0}\succ 0$ and use the following iteration

\mathbf{X}_{k+1}=\frac{1}{2\lambda}\mathbf{R}^{\top}\left(\left(\mathbf{I}_{m}+4\lambda\mathbf{R}^{-1\top}p(\mathbf{X}_{k})\mathbf{R}^{-1}\right)^{1/2}\mathbf{R}-\mathbf{I}_{m}\right)\mathbf{R},

(13)

where $p(\mathbf{X})=\mathbf{X}+\mathbf{X}\mathbf{\Delta}\mathbf{X}$ and $\mathbf{\Delta}(\mathbf{X})=\frac{1}{s}\sum_{\ell=1}^{s}\mathbf{U}_{\ell}\mathbf{X}_{\mathcal{C}_{\ell}\mathcal{C}_{\ell}}^{-1}\mathbf{U}_{\ell}^{\top}-\mathbf{U}_{\mathcal{I}}(\mathbf{X}_{\mathcal{I}\mathcal{I}}+n\mathbf{I}_{|\mathcal{I}|})^{-1}\mathbf{U}_{\mathcal{I}}^{\top}.$ We dub this sequence a regularized Picard iteration, as it is a generalization of the Picard iteration which was introduced by Mariet and Sra [2015] in the context of learning discrete L-ensemble DPPs. In Mariet and Sra [2015], the Picard iteration, defined as $\mathbf{X}_{k+1}=p(\mathbf{X}_{k})$ , is shown to be appropriate for minimizing a different objective given by: $-\frac{1}{s}\sum_{\ell=1}^{s}\log\det(\mathbf{X}_{\mathcal{C}_{\ell}\mathcal{C}_{\ell}})+\log\det(\mathbf{I}+\mathbf{X})$ . The following theorem indicates that the iteration (13) is a good candidate for minimizing $g(\mathbf{X})$ .

Theorem 3.

Let $\mathbf{X}_{k}$ for integer $k$ be the sequence generated by (13) and initialized with $\mathbf{X}_{0}\succ 0$ . Then, the sequence $g(\mathbf{X}_{k})$ is monotonically decreasing.

For a proof, we refer to Section C.4. In practice, we use the iteration (13) with the inverse change of variables $\mathbf{X}(\mathbf{B})=\mathbf{R}^{\top}\mathbf{B}\mathbf{R}$ and solve

\mathbf{B}_{k+1}=\frac{1}{2\lambda}\left(\left(\mathbf{I}_{m}+4\lambda q(\mathbf{B}_{k})\right)^{1/2}-\mathbf{I}_{m}\right),\text{ with }q(\mathbf{B})=\mathbf{B}+\mathbf{B}\mathbf{R}\mathbf{\Delta}\big{(}\mathbf{X}(\mathbf{B})\big{)}\mathbf{R}^{\top}\mathbf{B},

(14)

where $\mathbf{\Delta}(\mathbf{X})$ is given hereabove. For the stopping criterion, we monitor the objective values of (9) and stop if the relative variation of two consecutive objectives is less than a predefined precision threshold $\mathsf{tol}$ . Contrary to (12), the objective (9) does not include the inverse of $\mathbf{K}$ , which might be ill-conditioned. The interplay between $\lambda$ and $n$ is best understood by considering (9) with the change of variables $\mathbf{B}^{\prime}=\mathbf{B}/n$ , yielding the equivalent problem

\min_{\mathbf{B}^{\prime}\succeq 0}-\frac{1}{s}\sum_{\ell=1}^{s}\log\det\left(\bm{\Phi}^{\top}\mathbf{B}^{\prime}\bm{\Phi}\right)_{\mathcal{C}_{\ell}\mathcal{C}_{\ell}}+\log\det\left(\mathbf{I}+\bm{\Phi}^{\top}\mathbf{B}^{\prime}\bm{\Phi}\right)_{\mathcal{I}\mathcal{I}}+\lambda n\operatorname{Tr}(\mathbf{B}^{\prime}),

where $\bm{\Phi}=\mathbf{R}$ is a matrix whose $i$ -th column is $\bm{\Phi}_{i}$ for $1\leq i\leq m$ as defined in (7). Notice that, up to a $\nicefrac{{1}}{{n}}$ factor, $\bm{\Phi}^{\top}\mathbf{B}^{\prime}\bm{\Phi}$ is the in-sample Gram matrix of $\hat{\mathsf{a}}(x,y)$ evaluated on the data set $\mathcal{Z}$ ; see Algorithm 1. Thus, in the limit $\lambda\to 0$ , the above expression corresponds to the MLE estimation of a finite DPP if $\cup_{\ell}\mathcal{C}_{\ell}\subseteq\mathcal{I}$ . This is the intuitive connection with finite DPP: the continuous DPP is well approximated by a finite DPP if the ground set $\mathcal{I}$ is a dense enough sampling within $\mathcal{X}$ .

5 Theoretical guarantees

We now describe the guarantees coming with the approximations presented in the previous section.

Statistical guarantees for approximating the maximum likelihood problem

Next, we give a statistical guarantee for the approximation of the log-likelihood by its sample version.

Theorem 4 (Discrete optimal objective approximates full MLE objective).

Let $\mathbf{B}_{\star}$ be the solution of (9). Let $A_{\star}$ be the solution of (5). Let $\delta\in(0,1/2)$ . If $\lambda\geq 2c_{n}(\delta)$ , then with probability at least $1-2\delta$ , it holds that

|f(A_{\star})-f_{n}(V^{*}\mathbf{B}_{\star}V)|\leq\frac{3}{2}\lambda\operatorname{Tr}(A_{\star}),

with $0<c_{n}\lesssim 1/\sqrt{n}$ given in Theorem 1.

The above result, proved in Section C.3, shows that, with high probability, the optimal objective value of the discrete problem is not far from the optimal log-likelihood provided $n$ is large enough. As a simple consequence, the discrete solution also yields a finite rank operator $V\mathbf{B}_{\star}V^{*}$ whose likelihood is not far from the optimal likelihood $f(A_{\star})$ , as it can be shown by using a triangle inequality.

Corollary 5 (Approximation of the full MLE optimizer by a finite rank operator).

Under the assumptions of Theorem 4, if $\lambda\geq 2c_{n}(\delta)$ , with probability at least $1-2\delta$ , it holds

|f(A_{\star})-f(V^{*}\mathbf{B}_{\star}V)|\leq 3\lambda\operatorname{Tr}(A_{\star})

with $c_{n}\lesssim 1/\sqrt{n}$ given in Theorem 1.

The proof of Corollary 5 is also provided in Section C.3.

Approximation of the correlation kernel

An important quantity for the control of the amount of points necessary to approximate well the correlation kernel is the so-called effective dimension $d_{\rm eff}(\gamma)=\operatorname{Tr}\left(\mathsf{A}(\mathsf{A}+\gamma\mathbb{I})^{-1}\right),$ which is the expected sample size under the DPP with correlation kernel $\mathsf{K}=\mathsf{A}(\mathsf{A}+\gamma\mathbb{I})^{-1}$ .

Theorem 6 (Correlation kernel approximation).

Let $\delta\in(0,1)$ be a failure probability, let $\epsilon\in(0,1)$ be an acurracy parameter and let $\gamma>0$ be a scale factor. Let $\mathsf{K}(\gamma)$ be the correlation kernel (10) defined with $\mathsf{A}=SAS^{*}$ . Consider $\hat{\mathsf{K}}(\gamma)$ defined in (11) with i.i.d. sampling of $p$ points in $\mathcal{X}$ wrt $\mu$ . If we take $p\geq\frac{8\kappa^{2}\|A\|_{op}}{\gamma\epsilon^{2}}\log\left(\frac{4d_{\rm eff}(\gamma)}{\delta\|\mathsf{K}\|_{op}}\right),$ then, with probability $1-\delta$ , the following multiplicative error bound holds $\frac{1}{1+\epsilon}\mathsf{K}(\gamma)\preceq\hat{\mathsf{K}}(\gamma)\preceq\frac{1}{1-\epsilon}\mathsf{K}(\gamma).$

The proof of Theorem 6, given in Section C.5, mainly relies on a matrix Bernstein inequality. Let us make a few comments. First, we can simply take $\gamma=1$ in Theorem 6 to recover the common definition of the correlation kernel (1). Second, the presence of $d_{\rm eff}(\gamma)$ in the logarithm is welcome since it is the expected subset size of the L-ensemble. Third, the quantity $\|A\|_{op}$ directly influences the typical sample size to get an accurate approximation. A worst case bound is $\|A\|_{op}\leq\lambda_{\max}(\mathbf{C})\lambda_{\max}(\mathbf{K})$ with $\mathbf{K}=[k_{\mathcal{H}}(z_{i},z_{j})]_{1\leq i,j\leq m}$ and where we used that $A=mS_{m}\mathbf{C}S_{m}^{*}$ in the light of (8). Thus, the lower bound on $p$ may be large in practice. Although probably more costly, an approach inspired from approximate ridge leverage score sampling [Rudi et al., 2018] is likely to allow lower $p$ ’s. We leave this to future work.

6 Empirical evaluation

We consider an L-ensemble with correlation kernel $\mathsf{k}(x,y)=\rho\exp(-\|x-y\|_{2}^{2}/\alpha^{2})$ defined on $\mathbb{R}^{d}$ with $\alpha=0.05$ . Following Lavancier et al. [2014], this is a valid kernel if $\rho<(\sqrt{\pi}\alpha)^{-d}$ . Note that the intensity, defined as $x\mapsto\mathsf{k}(x,x)$ , is constant equal to $\rho$ ; we shall check that the fitted kernel recovers that property.

Refer to caption — Figure 1: Intensity estimation with $\sigma=0.1$ and $\lambda=0.1$ from $1$ DPP sample with $\rho=50$ (top row) and $\rho=100$ (bottom row). On the LHS, a DPP sample (hexagons) and $n=1000$ uniform samples (crosses), the color is the diagonal of $\bm{\Phi}^{\top}\mathbf{B}\bm{\Phi}$ (in-sample likelihood kernel). On the RHS, out-of-sample estimated intensity $\hat{\mathsf{k}}(x,x)$ of the learned process on a $100\times 100$ grid.

We draw samples³³3We used the code of Poinas and Lavancier [2021], available at https://github.com/APoinas/MLEDPP. It relies on the R package spatstat [Baddeley et al., 2015], available under the GPL-2 / GPL-3 licence. from this continuous DPP in the window $\mathcal{X}=[0,1]^{2}$ . Two such samples are shown as hexagons in the first column of Figure 1, with respective intensity $\rho=50$ and $\rho=100$ . For the estimation, we use a Gaussian kernel $k_{\mathcal{H}}(x,y)=\exp\left(-\|x-y\|_{2}^{2}/(2\sigma^{2})\right)$ with $\sigma>0$ . The computation of the correlation kernel always uses $p=1000$ uniform samples. Iteration (14) is run until the precision threshold $\mathsf{tol}=10^{-5}$ is achieved. For stability, we add $10^{-10}$ to the diagonal of the Gram matrix $\mathbf{K}$ . The remaining parameter values are given in captions. We empirically observe that the regularized Picard iteration returns a matrix $\mathbf{B}_{\star}$ such that $\bm{\Phi}^{\top}\mathbf{B}_{\star}\bm{\Phi}$ is low rank; see Figure 2 (left). A lesson from Figure 1 is that the sample size of the DPP has to be large enough to retrieve a constant intensity $\hat{k}(x,x)$ . In particular, the top row of this figure illustrates a case where $\sigma$ is too small. Also, due to the large regularization $\lambda=0.1$ and the use of only one DPP sample, the scale of $\rho$ is clearly underestimated in this example. On the contrary, in Figure 3, for a smaller regularization parameter $\lambda=0.01$ , the intensity scale estimate is larger. We also observe that a large regularization parameter tends to smooth out the local variations of the intensity, which is not surprising.

Figure 2: Analysis of the solution corresponding to the example of Figure 1 with

\rho=100

. Left: eigenvalues of

\bm{\Phi}^{\top}\mathbf{B}\bm{\Phi}

. Middle: Gram matrix of

\hat{\mathsf{k}}(x,y)

on a regular

30\times 30

grid within

[0,1]^{2}

. Right: Gram matrix of

\mathsf{k}(x,y)

on the same grid.

A comparison between a Gram matrix of $\hat{\mathsf{k}}(x,y)$ and $\mathsf{k}(x,y)$ is given in Figure 2 corresponding to the example of Figure 1. The short-range diagonal structure is recovered, while some long-range structures are smoothed out. More illustrative simulations are given in Section D, with a study of the influence of the hyperparameters, including the use of $s>1$ DPP samples. In particular, the estimation of the intensity is improved if several DPP samples are used with a smaller value of $\lambda$ .

7 Discussion

We leveraged recent progress on kernel methods to propose a nonparametric approach to learning continuous DPPs. We see three major limitations of our procedure. First, our final objective function is nonconvex, and our algorithm is only guaranteed to increase its objective function. Experimental evidence suggests that our approach recovers the synthetic kernel, but more work is needed to study the maximizers of the likelihood, in the spirit of Brunel et al. [2017a] for finite DPPs, and the properties of our fixed point algorithm. Second, the estimated integral kernel does not have any explicit structure, other than being implicitly forced to be low-rank because of the trace penalty. Adding structural assumptions might be desirable, either for modelling or learning purposes. For modelling, it is not uncommon to assume that the underlying continuous DPP is stationary, for example, which implies that the correlation kernel $\mathsf{k}(x,y)$ depends only on $x-y$ . For learning, structural assumptions on the kernel may reduce the computational cost, or reduce the number of maximizers of the likelihood. The third limitation of our pipeline is that, like most nonparametric methods, it still requires to tune a handful of hyperparameters, and, in our experience, the final performance varies significantly with the lengthscale of the RKHS kernel or the coefficient of the trace penalty. An automatic tuning procedure with guarantees would make the pipeline turn-key.

Maybe unexpectedly, future work could also deal with transferring our pipeline to the finite DPP setting. Indeed, we saw in Section 4 that in some asymptotic regime, our regularized MLE objective is close to a regularized version of the MLE objective for a finite DPP. Investigating this maximum a posteriori inference problem may shed some new light on nonparametric inference for finite DPPs. Intuitively, regularization should improve learning and prediction when data is scarce.

Acknowledgements

We thank Arnaud Poinas and Guillaume Gautier for useful discussions on the manuscript. We acknowledge support from ERC grant Blackjack (ERC-2019-STG-851866) and ANR AI chair Baccarat (ANR-20-CHIA-0002).

Supplementary Material

Roadmap.

In Section A, we present useful technical results. Next, in a specific case, we show that the discrete problem (6) admits a closed-form solution that we discuss in Section B. Noticeably, this special case allows the understanding of the behaviour of the estimated DPP kernel in both the small and large regularization ( $\lambda$ ) limits. In Section C, all the deferred proofs are given. Finally, Section D provides a finer analysis of the empirical results of Section 6 as well as a description of the convergence of the regularized Picard algorithm to the closed-form solution described in Section B.

Appendix A Useful technical results

A.1 Technical lemmata

We preface the proofs of our main results with three lemmata.

Lemma 7.

Let $k$ be a strictly positive definite kernel of a RKHS $\mathcal{H}$ and let $\mathbf{C}$ be a $n\times n$ symmetric matrix. Assume that $\{x_{i}\}_{1\leq i\leq n}$ are such that the Gram matrix $\mathbf{K}=[k_{\mathcal{H}}(x_{i},x_{j})]_{1\leq i,j\leq n}$ is non-singular. Then, the non-zero eigenvalues of $\sum_{i,j=1}^{n}\mathbf{C}_{ij}\phi(x_{i})\otimes\overline{\phi(x_{j})}$ correspond to the non-zero eigenvalues of $\mathbf{K}\mathbf{C}$ .

Proof.

By definition of the sampling operator (see Section 1.1), it holds that $S_{n}^{*}\mathbf{C}S_{n}=\frac{1}{n}\sum_{i,j=1}^{n}\mathbf{C}_{ij}\phi(x_{i})\otimes\overline{\phi(x_{j})}$ and we have $S_{n}S_{n}^{*}\mathbf{C}=\frac{1}{n}\mathbf{K}\mathbf{C}$ . Thus, we need to show that the non-zero eigenvalues $S_{n}^{*}\mathbf{C}S_{n}$ correspond to the non-zero eigenvalues of $S_{n}S_{n}^{*}\mathbf{C}$ .

Let $g_{\lambda}\in\mathcal{H}$ be an eigenvector of $S_{n}^{*}\mathbf{C}S_{n}$ with eigenvalue $\lambda\neq 0$ . First, we show that $S_{n}g_{\lambda}$ is an eigenvector of $S_{n}S_{n}^{*}\mathbf{C}$ with eigenvalue $\lambda$ . We have $S_{n}^{*}\mathbf{C}S_{n}g_{\lambda}=\lambda g_{\lambda}$ . By acting on both sides of the latter equation with $S_{n}$ , we find $S_{n}(S_{n}^{*}\mathbf{C}S_{n})g_{\lambda}=\lambda S_{n}g_{\lambda}$ . This is equivalent to $S_{n}S_{n}^{*}\mathbf{C}(S_{n}g_{\lambda})=\lambda(S_{n}g_{\lambda})$ . Second, since $S_{n}S_{n}^{*}\succ 0$ , remark that $S_{n}S_{n}^{*}\mathbf{C}$ is related by a similarity to $(S_{n}S_{n}^{*})^{1/2}\mathbf{C}(S_{n}S_{n}^{*})^{1/2}$ , which is diagonalizable. Since $S_{n}^{*}\mathbf{C}S_{n}$ is at most of rank $n$ , the non-zero eigenvalues of $S_{n}^{*}\mathbf{C}S_{n}$ match the non-zero eigenvalues of $(S_{n}S_{n}^{*})^{1/2}\mathbf{C}(S_{n}S_{n}^{*})^{1/2}$ , which in turn are the same as the non-zero eigenvalues of $S_{n}S_{n}^{*}\mathbf{C}$ . ∎

Lemma 8.

Let $\mathbf{\Sigma}\in\mathcal{S}_{+}(\mathbb{R}^{m})$ and let $\mathcal{I}$ be a subset of $\{1,\dots,m\}$ . Then, the function

\mathbf{\Sigma}\mapsto\log\det(\mathbf{\Sigma})+\log\det(\mathbf{I}_{m}+\mathbf{\Sigma}^{-1}/|\mathcal{I}|)_{\mathcal{I}\mathcal{I}}

(15)

is strictly concave on $\{\mathbf{\Sigma}\succ 0\}$ .

Proof.

To simplify the expression, we do the change of variables $\mathbf{\Sigma}\mapsto\mathbf{\Sigma}/|\mathcal{I}|$ and analyse $\log\det(\mathbf{\Sigma})+\log\det(\mathbf{I}_{m}+\mathbf{\Sigma}^{-1})_{\mathcal{I}\mathcal{I}}$ which differs from the original function by an additive constant. Let $\mathbf{U}_{\mathcal{I}}$ be the matrix obtained by selecting the columns of the identity matrix which are indexed by $\mathcal{I}$ . We rewrite the second term in (15) as

\log\det(\mathbf{I}_{m}+\mathbf{\Sigma}^{-1})_{\mathcal{I}\mathcal{I}}=\log\det(\mathbf{I}_{|\mathcal{I}|}+\mathbf{U}_{\mathcal{I}}^{\top}\mathbf{\Sigma}^{-1}\mathbf{U}_{\mathcal{I}})=\log\det(\mathbf{I}_{m}+\mathbf{\Sigma}^{-1/2}\mathbf{U}_{\mathcal{I}}\mathbf{U}_{\mathcal{I}}^{\top}\mathbf{\Sigma}^{-1/2}),

by Sylvester’s identity. This leads to

\log\det(\mathbf{\Sigma})+\log\det(\mathbf{I}_{m}+\mathbf{\Sigma}^{-1})_{\mathcal{I}\mathcal{I}}=\log\det(\mathbf{\Sigma}+\mathbf{U}_{\mathcal{I}}\mathbf{U}_{\mathcal{I}}^{\top}).

To check that $\log\det(\mathbf{\Sigma}+\mathbf{U}_{\mathcal{I}}\mathbf{U}_{\mathcal{I}}^{\top})$ is strictly concave on $\{\mathbf{\Sigma}\succ 0\}$ , we verify that its Hessian is negative any direction $\mathbf{H}$ . Its first order directional derivative reads $\operatorname{Tr}\left((\mathbf{\Sigma}+\mathbf{U}_{\mathcal{I}}\mathbf{U}_{\mathcal{I}}^{\top})^{-1}\mathbf{H}\right)$ . Hence, the second order directional derivative in the direction of $\mathbf{H}$ writes

-\operatorname{Tr}\left((\mathbf{\Sigma}+\mathbf{U}_{\mathcal{I}}\mathbf{U}_{\mathcal{I}}^{\top})^{-1}\mathbf{H}(\mathbf{\Sigma}+\mathbf{U}_{\mathcal{I}}\mathbf{U}_{\mathcal{I}}^{\top})^{-1}\mathbf{H}\right),

which is indeed a negative number. ∎

The following result is borrowed from Mariet and Sra [2015, Lemma 2.3.].

Lemma 9.

Let $\mathbf{\Sigma}\in\mathcal{S}_{+}(\mathbb{R}^{m})$ and let $\mathbf{U}\in\mathbb{R}^{m\times\ell}$ be a matrix with $\ell\leq m$ orthonormal columns. Then, the function $-\log\det(\mathbf{U}^{\top}\mathbf{\Sigma}^{-1}\mathbf{U})$ is concave on $\mathbf{\Sigma}\succ 0$ .

Proof.

The function $-\log\det(\mathbf{U}^{\top}\mathbf{\Sigma}^{-1}\mathbf{U})$ is concave on $\mathbf{\Sigma}\succ 0$ since $\log\det(\mathbf{U}^{\top}\mathbf{\Sigma}^{-1}\mathbf{U})$ is convex on $\mathbf{\Sigma}\succ 0$ for any $\mathbf{U}$ such that $\mathbf{U}^{\top}\mathbf{U}=\mathbf{I}$ as stated in Mariet and Sra [2015, Lemma 2.3.]. ∎

A.2 Use of the representer theorem

We here clarify the definition of the representer theorem used in this paper.

A.2.1 Extended representer theorem

In Section 3, we used a slight extension of the representer theorem of Marteau-Ferey et al. [2020] which we clarify here. Let us first define some notations. Let $\mathcal{H}$ be a RKHS with feature map $\phi(\cdot)$ and $\{z_{1},\dots,z_{m}\}$ be a data set such as defined in Section 1.1. Define

h_{A}(z)=\langle\phi(z),A\phi(z)\rangle\text{ and }h_{A}(z,z^{\prime})=\langle\phi(z),A\phi(z^{\prime})\rangle.

In this paper, we consider the problem

\min_{A\in\mathcal{S}_{+}(\mathcal{H})}L\left(h_{A}(z_{i},z_{j})\right)_{1\leq i,j\leq m}+\operatorname{Tr}(A),

(16)

where $L$ is a loss function (specified below). In contrast, the first term of the problem considered in Marteau-Ferey et al. [2020] is of the following form: $L(h_{A}(z_{i}))_{1\leq i\leq m}$ . In other words, the latter loss function involves only diagonal elements $\langle\phi(z_{i}),A\phi(z_{i})\rangle$ for $1\leq i\leq m$ while (16) also involves off-diagonal elements. Now, denote by $\Pi_{m}$ the projector on $\operatorname{\mathrm{span}}\{\phi(z_{i}),i=1,\dots,m\}$ , and define

\mathcal{S}_{m,+}(\mathcal{H})=\{\Pi_{m}A\Pi_{m}:A\in\mathcal{S}_{+}(\mathcal{H})\}.

Then, we have the following proposition.

Proposition 10 (Extension of Proposition 7 in Marteau-Ferey et al. [2020]).

Let $L$ be a lower semi-continuous function such that $L(h_{A}(z_{i},z_{j}))_{1\leq i,j\leq m}+\operatorname{Tr}(A)$ is bounded below. Then (16) has a solution $A_{\star}$ which is in $\mathcal{S}_{m,+}(\mathcal{H})$ .

Proof sketch.

The key step is the following identity

h_{A}(z_{i},z_{j})=\langle\phi(z_{i}),A\phi(z_{j})\rangle=\langle\phi(z_{i}),\Pi_{m}A\Pi_{m}\phi(z_{j})\rangle=h_{\Pi_{m}A\Pi_{m}}(z_{i},z_{j}),\ (1\leq i,j\leq m)

which is a direct consequence of the definition of $\Pi_{m}$ . Also, we have $\operatorname{Tr}(\Pi_{m}A\Pi_{m})\leq\operatorname{Tr}(A)$ . The remainder of the proof follows exactly the same lines as in Marteau-Ferey et al. [2020]. Notice that, compared with Marteau-Ferey et al. [2020, Proposition 7], we do not require the loss $L$ to be lower bounded but rather ask the full objective to be lower bounded, which is a weaker assumption but does not alter the argument. ∎

A.2.2 Applying the extended representer theorem

Now, let us prove that the objective of (6), given by $f_{n}(A)+\lambda\operatorname{Tr}(A),$ is lower bounded. We recall that

f_{n}(A)=-\frac{1}{s}\sum_{\ell=1}^{s}\log\det\left[\langle\phi(x_{i}),A\phi(x_{j})\rangle\right]_{i,j\in\mathcal{C}_{\ell}}+\log\det(\mathbf{I}_{n}+S_{n}AS_{n}^{*}).

For clarity, we recall that $\mathcal{C}\triangleq\cup_{\ell=1}^{s}\mathcal{C}_{\ell}$ and $\mathcal{I}=\{x^{\prime}_{1},\dots,x^{\prime}_{n}\}$ . Then, write the set of points $\mathcal{Z}\triangleq\mathcal{C}\cup\mathcal{I}$ as $\{z_{1},\dots,z_{m}\}$ with $m=|\mathcal{C}|+n$ . Recall that we placed ourselves on an event happening with probability one where the sets $\mathcal{C}_{1},\dots,\mathcal{C}_{s},\mathcal{I}$ are disjoint. Now, we denote by $\Pi_{m}$ the orthogonal projector on $\operatorname{\mathrm{span}}\{\phi(z_{i}):1\leq i\leq m\}$ , which writes

\Pi_{m}=\sum_{i,j=1}^{m}(K^{-1})_{ij}\phi(z_{i})\otimes\overline{\phi(z_{j})}.

(17)

First, we have $\log\det(\mathbf{I}_{n}+S_{n}AS_{n}^{*})\geq 0$ . Second, we use

\operatorname{Tr}(A)=\operatorname{Tr}\left(\Pi_{m}A\Pi_{m}\right)+\operatorname{Tr}\left(\Pi_{m}^{\perp}A\Pi_{m}^{\perp}\right)\geq\operatorname{Tr}\left(\Pi_{m}A\Pi_{m}\right)=\operatorname{Tr}\left(\Pi_{m}A\right).

Thus, a lower bound on the objective (6) is

\displaystyle f_{n}(A)+\lambda\operatorname{Tr}(A)\geq-\frac{1}{s}\sum_{\ell=1}^{s}\log\det\left[\langle\phi(x_{i}),A\phi(x_{j})\rangle\right]_{i,j\in\mathcal{C}_{\ell}}+\lambda\operatorname{Tr}\left(\Pi_{m}A\right).

Now we define the matrix $M$ with elements $M_{ij}=\langle\phi(z_{i}),A\phi(z_{j})\rangle$ for $1\leq i,j\leq m$ and notice that

\operatorname{Tr}\left(\Pi_{m}A\right)=\operatorname{Tr}(MK^{-1})\geq\operatorname{Tr}(M)/\lambda_{\max}(K).

Remark that the matrix defined by $M^{(\ell)}=\left[\langle\phi(x_{i}),A\phi(x_{j})\rangle\right]_{i,j\in\mathcal{C}_{\ell}}$ associated with the $\ell$ -th DPP sample for $1\leq\ell\leq s$ is a principal submatrix of the $m\times m$ matrix $M=\left[\langle\phi(z_{i}),A\phi(z_{j})\rangle\right]_{1\leq i,j\leq m}$ . Since the sets $\mathcal{C}_{\ell}$ are disjoint, we have $\sum_{\ell=1}^{s}\operatorname{Tr}(M^{(\ell)})\leq\operatorname{Tr}(M)$ . Denoting $\lambda^{\prime}=\frac{\lambda}{\lambda_{\max}(K)}$ , by using the last inequality, we find

	$\displaystyle f_{n}(A)+\lambda\operatorname{Tr}(A)$	$\displaystyle\geq-\frac{1}{s}\sum_{\ell=1}^{s}\log\det M^{(\ell)}+\lambda^{\prime}\sum_{\ell=1}^{s}\operatorname{Tr}(M^{(\ell)})$
		$\displaystyle=\frac{1}{s}\sum_{\ell=1}^{s}\left(-\log\det M^{(\ell)}+s\lambda^{\prime}\operatorname{Tr}(M^{(\ell)})\right).$

Finally, we use the following proposition to show that each term in the above sum is bounded from below.

Proposition 11.

Let $a>0$ . The function $h(\sigma)=-\log(\sigma)+a\sigma$ satisfies $h(\sigma)\geq h(1/a)=1+\log(a)$ for all $\sigma>0$ .

Thus a lower bound for the objective of (6) is obtained by applying Proposition 11 with $a=s\lambda/\lambda_{\max}(K)$ to the each eigenvalue of $M^{(\ell)}$ for all $1\leq\ell\leq s$ .

A.3 Boundedness of the discrete objective function

At first sight, we may wonder if the objective of the optimization problem (12) is lower bounded. We show here that the optimization problem is well-defined for all regularization parameters $\lambda>0$ . The lower boundedness of the discrete objective follows directly from Section A.2.2 in the case of a finite rank operator. For completeness, we repeat below the argument in the discrete case. Explicitly, this discrete objective reads

g(\mathbf{X})=-\frac{1}{s}\sum_{\ell=1}^{s}\log\det(\mathbf{X}_{\mathcal{C}_{\ell}\mathcal{C}_{\ell}})+\log\det\left(\mathbf{I}_{|\mathcal{I}|}+\frac{1}{n}\mathbf{X}_{\mathcal{I}\mathcal{I}}\right)+\lambda\operatorname{Tr}(\mathbf{X}\mathbf{K}^{-1}),

for all $\mathbf{X}\succeq 0$ . First, since $\mathbf{X}_{\mathcal{I}\mathcal{I}}\succeq 0$ , we have $\log\det\left(\mathbf{I}_{|\mathcal{I}|}+\frac{1}{n}\mathbf{X}_{\mathcal{I}\mathcal{I}}\right)\geq 0$ . Next, we use that $\mathbf{K}\succ 0$ by assumption, which implies that $\operatorname{Tr}(\mathbf{X}\mathbf{K}^{-1})\geq\operatorname{Tr}(\mathbf{X})/\lambda_{\max}(\mathbf{K})$ . Denote $\lambda^{\prime}=\frac{\lambda}{\lambda_{\max}(\mathbf{K})}$ . Thus, we can lower bound the objective function as follows:

	$\displaystyle g(\mathbf{X})$	$\displaystyle\geq-\frac{1}{s}\sum_{\ell=1}^{s}\log\det(\mathbf{X}_{\mathcal{C}_{\ell}\mathcal{C}_{\ell}})+\lambda^{\prime}\operatorname{Tr}(\mathbf{X})$
		$\displaystyle\geq\frac{1}{s}\sum_{\ell=1}^{s}\big{(}-\log\det(\mathbf{X}_{\mathcal{C}_{\ell}\mathcal{C}_{\ell}})+\lambda^{\prime}s\operatorname{Tr}(\mathbf{X}_{\mathcal{C}_{\ell}\mathcal{C}_{\ell}})\big{)},$

where we used that $\operatorname{Tr}(\mathbf{X})=\operatorname{Tr}(\mathbf{X}_{\mathcal{I}\mathcal{I}})+\sum_{\ell=1}^{s}\operatorname{Tr}(\mathbf{X}_{\mathcal{C}_{\ell}\mathcal{C}_{\ell}})$ with $\operatorname{Tr}(\mathbf{X}_{\mathcal{I}\mathcal{I}})\geq 0$ for $\mathbf{X}\succeq 0$ . Hence, by using Proposition 11, $g(\mathbf{X})$ is bounded from below by a sum of functions which are lower bounded.

Appendix B Extra result: an explicit solution for the single-sample case

We note that if the dataset is made of only one sample (i.e., $s=1$ ), and if that sample is also used to approximate the Fredholm determinant (i.e., $\mathcal{I}=\mathcal{C}$ ), then the problem (6) admits an explicit solution.

Proposition 12.

Let $\mathcal{C}=\{x_{i}\in\mathcal{X}\}_{1\leq i\leq m}$ be such that the kernel matrix $\mathbf{K}=[k_{\mathcal{H}}(x_{i},x_{j})]_{1\leq i,j\leq m}$ is invertible. Consider the problem (6) with $\mathcal{I}=\mathcal{C}$ and $\lambda>0$ . Then the solution of (6) has the form $A_{\star}=\sum_{i,j=1}^{m}\mathbf{C}_{\star ij}\phi(x_{i})\otimes\overline{\phi(x_{j})}$ , with

\mathbf{C}_{\star}=\mathbf{C}_{\star}(\lambda)=\frac{1}{2}\mathbf{K}^{-2}\left((m^{2}\mathbf{I}_{m}+4m\mathbf{K}/\lambda)^{1/2}-m\mathbf{I}_{m}\right).

Proof.

By the representer theorem in Marteau-Ferey et al. [2020, Theorem 1], the solution is of the form $A=\sum_{i,j=1}^{m}\mathbf{C}_{ij}\phi(x_{i})\otimes\overline{\phi(x_{j})}$ where $\mathbf{C}$ is the solution of

\min_{\mathbf{C}\succ 0}-\log\det\left(\mathbf{K}\mathbf{C}\mathbf{K}\right)+\log\det\left(\mathbf{I}_{m}+\mathbf{K}\mathbf{C}\mathbf{K}/m\right)+\lambda\operatorname{Tr}(\mathbf{K}\mathbf{C}),

Define $\mathbf{X}=\mathbf{K}\mathbf{C}\mathbf{K}$ and denote the objective by $g(\mathbf{X})=-\log\det(\mathbf{X})+\log\det(\mathbf{I}_{m}+\frac{1}{m}\mathbf{X})+\lambda\operatorname{Tr}(\mathbf{K}^{-1}\mathbf{X})$ . Let $\mathbf{H}$ be a $m\times m$ symmetric matrix. By taking a directional derivative of the objective in the direction $\mathbf{H}$ , we find a first order condition $\operatorname{Tr}\left[D_{\mathbf{H}}g(\mathbf{X})\mathbf{H}\right]=0,$ with $D_{\mathbf{H}}g(\mathbf{X})=-\mathbf{X}^{-1}+(\mathbf{X}+m\mathbf{I}_{m})^{-1}+\lambda\mathbf{K}^{-1}.$ Since this condition should be satisfied for all $\mathbf{H}$ , we simply solve the equation $D_{\mathbf{H}}g(\mathbf{X})=0$ . Thanks to the invertibility of $\mathbf{K}$ , we have equivalently

\mathbf{X}^{2}+m\mathbf{X}-\frac{m}{\lambda}\mathbf{K}=0.

A simple algebraic manipulation yields $\mathbf{X}_{\star}=\frac{1}{2}\left((m^{2}\mathbf{I}_{m}+4m\mathbf{K}/\lambda)^{1/2}-m\mathbf{I}_{m}\right)\succ 0$ . To check that $\mathbf{X}_{\star}$ is minimum, it is sufficient to analyse the Hessian of the objective in any direction $\mathbf{H}$ . Let $t\geq 0$ small enough such that $\mathbf{X}+t\mathbf{H}\succ 0$ . The second order derivative of $g(\mathbf{X}+t\mathbf{H})$ writes

\frac{\mathrm{d}^{2}}{\mathrm{d}t^{2}}g(\mathbf{X}+t\mathbf{H})|_{t=0}=\operatorname{Tr}(\mathbf{H}\mathbf{X}^{-1}\mathbf{H}\mathbf{X}^{-1})-\operatorname{Tr}(\mathbf{H}(\mathbf{X}+m\mathbf{I}_{m})^{-1}\mathbf{H}(\mathbf{X}+m\mathbf{I}_{m})^{-1}).

By using the first order condition $\mathbf{X}^{-1}_{\star}=(\mathbf{X}_{\star}+m\mathbf{I}_{m})^{-1}+\lambda\mathbf{K}^{-1}$ , we find

	$\displaystyle\frac{\mathrm{d}^{2}}{\mathrm{d}t^{2}}g(\mathbf{X}_{\star}+t\mathbf{H})\|_{t=0}$	$\displaystyle=\operatorname{Tr}\Big{(}\mathbf{H}(\lambda\mathbf{K}^{-1})\mathbf{H}\big{(}\lambda\mathbf{K}^{-1}+2(\mathbf{X}_{\star}+m\mathbf{I}_{m})^{-1}\big{)}\Big{)}$
		$\displaystyle=\operatorname{Tr}\Big{(}\mathbf{M}\mathbf{N}\mathbf{M}\Big{)}.$

with $\mathbf{M}=(\lambda\mathbf{K}^{-1})^{1/2}\mathbf{H}(\lambda\mathbf{K}^{-1})^{1/2}$ and

\mathbf{N}=\mathbf{I}+2(\lambda\mathbf{K}^{-1})^{-1/2}(\mathbf{X}_{\star}+m\mathbf{I}_{m})^{-1}(\lambda\mathbf{K}^{-1})^{-1/2}\succ 0.

Notice that $\mathbf{M}\mathbf{N}\mathbf{M}\succeq 0$ since $\mathbf{N}\succ 0$ . Therefore, it holds $\frac{\mathrm{d}^{2}}{\mathrm{d}t^{2}}g(\mathbf{X}_{\star}+t\mathbf{H})|_{t=0}\geq 0$ . Since this is true for any direction $\mathbf{H}$ , the matrix $\mathbf{X}_{\star}$ is a local minimum of $g(\mathbf{X})$ . ∎

While the exact solution of Proposition 12 is hard to intepret, if the regularization parameter goes to zero, the estimated correlation kernel tends to that of a projection DPP. In what follows, the notation $f(\lambda)\sim g(\lambda)$ stands for $\lim_{\lambda\to+\infty}f(\lambda)/g(\lambda)=1$ .

Proposition 13 (Projection DPP in the low regularization limit).

Under the assumptions of Proposition 12, the correlation kernel $\mathsf{K}(\lambda)=SA_{\star}(\lambda)S^{*}(SA_{\star}(\lambda)S^{*}+\mathbb{I})^{-1},$ with $A_{\star}(\lambda)=\sum_{i,j=1}^{m}\mathbf{C}_{\star ij}(\lambda)\phi(x_{i})\otimes\overline{\phi(x_{j})}$ and $C_{\star}(\lambda)$ given in Proposition 12, has a range that is independent of $\lambda$ . Furthermore, $\mathsf{K}(\lambda)$ converges to the projection operator onto that range when $\lambda\to 0$ . In particular, the integral kernel of $A_{\star}(\lambda)$ has the following asymptotic expansion

\mathsf{a}_{\star}(x,y)\sim{\color[rgb]{0,0,0}(m/\lambda)^{1/2}}\mathbf{k}_{x}^{\top}\mathbf{K}^{{\color[rgb]{0,0,0}-3/2}}\mathbf{k}_{y}\text{ as }\lambda\to 0,

pointwisely, with $\mathbf{k}_{x}=[k_{\mathcal{H}}(x,x_{1})\dots,k_{\mathcal{H}}(x,x_{m})]^{\top}$ .

Proof.

As shown in Lemma 7, the non-zero eigenvalues of $A_{\star}(\lambda)$ are the eigenvalues of $\mathbf{K}\mathbf{C}_{\star}(\lambda)$ since $\mathbf{K}$ is non-singular. The eigenvalues of $\mathbf{K}\mathbf{C}_{\star}(\lambda)$ are $\sigma_{\ell}(\lambda)=\frac{m}{2\varsigma_{\ell}}(\sqrt{1+4\frac{\varsigma_{\ell}}{m\lambda}}-1)$ where $\varsigma_{\ell}$ is the $\ell$ -th eigenvalue of $\mathbf{K}$ for $1\leq\ell\leq m$ . Thus, the operator $A_{\star}(\lambda)$ is of finite rank, i.e., $A_{\star}(\lambda)=\sum_{\ell=1}^{m}\sigma_{\ell}(\lambda)v_{\ell}\otimes\overline{v_{\ell}}$ , where the normalized eigenvectors $v_{\ell}$ are independent of $\lambda$ . The eigenvalues satisfy $\sigma_{\ell}(\lambda)\to+\infty$ as $\lambda\to 0$ , such that $\lim_{\lambda\to 0}\sigma_{\ell}(\lambda)\sqrt{\lambda}$ is finite. Note that $A_{\star}(\lambda)=\sum_{\ell=1}^{m}\sigma_{\ell}(\lambda)(Sv_{\ell})\otimes\overline{(Sv_{\ell})}$ is of finite rank and $Sv_{\ell}$ does not depend on $\lambda$ . Hence, the operator

\mathsf{K}(\lambda)=\sqrt{\lambda}SA_{\star}(\lambda)S^{*}\left(\sqrt{\lambda}SA_{\star}(\lambda)S^{*}+\sqrt{\lambda}\mathbb{I}\right)^{-1},

converges to the projector on the range of $SA_{\star}(\lambda)S^{*}$ as $\lambda\to 0$ . ∎

On the other hand, as the regularization parameter goes to infinity, the integral kernel of $\mathsf{A}_{\star}(\lambda)$ behaves asymptotically as an out-of-sample Nyström approximation.

Proposition 14 (Large regularization limit).

Under the assumptions of Proposition 12, the integral kernel of $\mathsf{A}_{\star}(\lambda)$ has the following asymptotic expansion

\mathsf{a}_{\star}(x,y)\sim\lambda^{-1}\mathbf{k}_{x}^{\top}\mathbf{K}^{-1}\mathbf{k}_{y}\text{ as }\lambda\to+\infty,

pointwisely, with $\mathbf{k}_{x}=[k_{\mathcal{H}}(x,x_{1})\dots,k_{\mathcal{H}}(x,x_{m})]^{\top}$ .

The proof of this result is a simple consequence of the series expansion $\sqrt{1+4x}=1+2x+O(x^{2})$ .

Appendix C Deferred proofs

C.1 Proof of Lemma 2

Proof of Lemma 2.

Since $V^{*}V$ is the orthogonal projector onto the span of $\phi(z_{i})$ , $1\leq i\leq m$ , we have

\bm{\Phi}_{i}^{\top}\bar{\mathbf{B}}\bm{\Phi}_{j}=\langle V^{*}V\phi(x_{i}),AV^{*}V\phi(x_{j})\rangle=\langle\phi(x_{i}),A\phi(x_{j})\rangle.

By Sylvester’s identity, and since $\|V^{*}V\|_{op}\leq 1$ , it holds

\log\det(\mathbf{I}_{m}+\bar{\mathbf{B}})=\log\det(\mathbb{I}+VAV^{*})=\log\det(\mathbb{I}+V^{*}VA)\leq\log\det(\mathbb{I}+A).

Similarly, it holds that $\operatorname{Tr}(\bar{\mathbf{B}})=\operatorname{Tr}(V^{*}VA)\leq\operatorname{Tr}(A)$ . ∎

C.2 Approximation of the Fredholm determinant

Before giving the proof of Theorem 1, we remind a well-known and useful formula for the Fredholm determinant of an operator. Let $M$ be a trace class endomorphism of a separable Hilbert space. Denote by $\lambda_{\ell}(M)$ its (possibly infinitely many) eigenvalues for $\ell\geq 1$ , including multiplicities. Then, we have

\det(\mathbb{I}+M)=\prod_{\ell}(1+\lambda_{\ell}(M)),

where the (infinite) product is uniformly convergent, see Bornemann [2010, Eq. (3.1)] and references therein. Now, let $\mathcal{F}_{1}$ and $\mathcal{F}_{2}$ be two separable Hilbert spaces. Let $T:\mathcal{F}_{1}\to\mathcal{F}_{2}$ such that $T^{*}T$ is trace class. Note that $T^{*}T$ is self-adjoint and positive semi-definite. Then, $T^{*}T$ and $TT^{*}$ have the same non-zero eigenvalues (Jacobson’s lemma) and Sylvester’s identity

\det(\mathbb{I}+TT^{*})=\det(\mathbb{I}+T^{*}T),

holds. This can be seen as a consequence of the singular value decomposition of a compact operator; see Weidmann [1980, page 170, Theorem 7.6]. We now give the proof of Theorem 1.

Proof of Theorem 1.

To begin, we recall two useful facts. First, If $M$ and $N$ are two trace class endomorphisms of the same Hilbert space, then $\log\det(\mathbb{I}+M)+\log\det(\mathbb{I}+N)=\log\det\left((\mathbb{I}+M)(\mathbb{I}+N)\right)$ ; see e.g. Simon [1977, Thm 3.8]. Second, if $M$ is a trace class endomorphisms of a Hilbert space, we have $(\mathbb{I}+M)^{-1}=\mathbb{I}-M(M+\mathbb{I})^{-1}$ , and thus, $\det((\mathbb{I}+M)^{-1})$ is a well-defined Fredholm determinant since $M(M+\mathbb{I})^{-1}$ is trace class. Now, define $d=\log\det(\mathbb{I}+SAS^{*})$ and $d_{n}=\log\det(\mathbf{I}_{n}+S_{n}AS_{n}^{*})$ . Using Sylvester’s identity and the aforementioned facts, we write

$\displaystyle d_{n}-d$	$\displaystyle=\log\det(\mathbf{I}_{n}+S_{n}AS_{n}^{})-\log\det(\mathbb{I}+SAS^{})$
	$\displaystyle=\log\det(\mathbb{I}+A^{1/2}S^{}_{n}S_{n}A^{1/2})-\log\det(\mathbb{I}+A^{1/2}S^{}SA^{1/2})\text{ (Sylvester's identity) }$
	$\displaystyle=\log\det\left((\mathbb{I}+A^{1/2}S^{}_{n}S_{n}A^{1/2})(\mathbb{I}+A^{1/2}S^{}SA^{1/2})^{-1}\right).$	(18)

By a direct substitution, we verify that $d_{n}-d=\log\det\left(\mathbb{I}+E_{n}\right),$ with

E_{n}=(\mathbb{I}+A^{1/2}S^{*}SA^{1/2})^{-1/2}A^{1/2}(S_{n}^{*}S_{n}-S^{*}S)A^{1/2}(\mathbb{I}+A^{1/2}S^{*}SA^{1/2})^{-1/2}.

Note that, in view of (18), $\det\left(\mathbb{I}+E_{n}\right)$ is a positive real number. Next, by using $S_{n}^{*}S_{n}-S^{*}S\preceq\|S_{n}^{*}S_{n}-S^{*}S\|_{op}\mathbb{I}$ and Sylvester’s identity, we find

	$\displaystyle d_{n}-d$	$\displaystyle\leq\log\det\left(\mathbb{I}+\\|S_{n}^{}S_{n}-S^{}S\\|_{op}(\mathbb{I}+A^{1/2}S^{}SA^{1/2})^{-1/2}A(\mathbb{I}+A^{1/2}S^{}SA^{1/2})^{-1/2}\right)$
		$\displaystyle=\log\det\left(\mathbb{I}+\\|S_{n}^{}S_{n}-S^{}S\\|_{op}A^{1/2}(\mathbb{I}+A^{1/2}S^{*}SA^{1/2})^{-1}A^{1/2}\right)$
		$\displaystyle\leq\log\det(\mathbb{I}+\\|S_{n}^{}S_{n}-S^{}S\\|_{op}A).$

The latter bound is finite, since for $M$ a trace-class operator, we have $|\det(\mathbb{I}+M)|\leq\exp(\|M\|_{\star})$ , where $\|M\|_{\star}$ is the trace (or nuclear) norm. By exchanging the roles of $S_{n}^{*}S_{n}$ and $S^{*}S$ , we also find

\displaystyle d-d_{n}\leq\log\det(\mathbb{I}+\|S_{n}^{*}S_{n}-S^{*}S\|_{op}A)

and thus, by combining the two cases, we find

\displaystyle|d-d_{n}|\leq\log\det(\mathbb{I}+\|S_{n}^{*}S_{n}-S^{*}S\|_{op}A).

Now, in order to upper bound $\|S^{*}S-S_{n}^{*}S_{n}\|_{op}$ with high probability, we use the following Bernstein inequality for a sum of random operators; see Rudi et al. [2015, Proposition 12] and Minsker [2017].

Proposition 15 (Bernstein’s inequality for a sum of i.i.d. random operators).

Let $\mathcal{H}$ be a separable Hilbert space and let $X_{1},\dots,X_{n}$ be a sequence of independent and identically distributed self-adjoint positive random operators on $\mathcal{H}$ . Assume that $\mathbb{E}X=0$ and that there exists a real number $\ell>0$ such that $\lambda_{\max}(X)\leq\ell$ almost surely. Let $\Sigma$ be a trace class positive operator such that $\mathbb{E}(X^{2})\preceq\Sigma$ . Then, for any $\delta\in(0,1)$ ,

\lambda_{\max}\left(\frac{1}{n}\sum_{i=1}^{n}X_{i}\right)\leq\frac{2\ell\beta}{3n}+\sqrt{\frac{2\|\Sigma\|_{op}\beta}{n}},\quad\text{where }\beta=\log\left(\frac{2\operatorname{Tr}(\Sigma)}{\|\Sigma\|_{op}\delta}\right),

with probability $1-\delta$ . If there further exists an $\ell^{\prime}$ such that $\|X\|_{op}\leq\ell^{\prime}$ almost surely, then, for any $\delta\in(0,1/2)$ ,

\left\|\frac{1}{n}\sum_{i=1}^{n}X_{i}\right\|_{op}\leq\frac{2\ell^{\prime}\beta}{3n}+\sqrt{\frac{2\|\Sigma\|_{op}\beta}{n}},\quad\text{where }\beta=\log\left(\frac{2\operatorname{Tr}(\Sigma)}{\|\Sigma\|_{op}\delta}\right),

holds with probability $1-2\delta$ .

To apply Proposition 15 and conclude the proof, first recall the expression of the covariance operator

C=S^{*}S=\int_{\mathcal{X}}\phi(x)\otimes\overline{\phi(x)}\mathrm{d}\mu(x),

and of its sample version

S^{*}_{n}S_{n}=\frac{1}{n}\sum_{i=1}^{n}\phi(x_{i})\otimes\overline{\phi(x_{i})}.

Define $X_{i}=\phi(x_{i})\otimes\overline{\phi(x_{i})}-\int_{\mathcal{X}}\phi(x)\otimes\overline{\phi(x)}\mathrm{d}\mu(x)$ . It is easy to check that $\mathbb{E}X_{i}=0$ since $\mathbb{E}[\phi(x_{i})\otimes\overline{\phi(x_{i})}]=\int_{\mathcal{X}}\phi(x)\otimes\overline{\phi(x)}\mathrm{d}\mu(x)$ . Then, using the triangle inequality and that $\sup_{x\in\mathcal{X}}k_{\mathcal{H}}(x,x)\leq\kappa^{2}$ , we find the bound

	$\displaystyle\\|X_{i}\\|_{op}$	$\displaystyle=\left\\|\int_{\mathcal{X}}\left(\phi(x_{i})\otimes\overline{\phi(x_{i})}-\phi(x)\otimes\overline{\phi(x)}\right)\mathrm{d}\mu(x)\right\\|_{op}$
		$\displaystyle\leq\int_{\mathcal{X}}\left(\left\\|\phi(x_{i})\otimes\overline{\phi(x_{i})}\right\\|_{op}+\left\\|\phi(x)\otimes\overline{\phi(x)}\right\\|_{op}\right)\mathrm{d}\mu(x)$
		$\displaystyle\leq 2\kappa^{2}\triangleq\ell^{\prime}.$

Next, we compute a bound on the variance by bounding the second moment, namely

	$\displaystyle\mathbb{E}(X_{i}^{2})$	$\displaystyle=\mathbb{E}\left[\left(\phi(x_{i})\otimes\overline{\phi(x_{i})}\right)^{2}\right]-\left(\mathbb{E}\left[\phi(x_{i})\otimes\overline{\phi(x_{i})}\right]\right)^{2}$
		$\displaystyle\preceq\mathbb{E}\left[\left(\phi(x_{i})\otimes\overline{\phi(x_{i})}\right)^{2}\right]$
		$\displaystyle\preceq\kappa^{2}\mathbb{E}\left[\phi(x_{i})\otimes\overline{\phi(x_{i})}\right]\triangleq\Sigma,$

where we used $\left(\phi(x_{i})\otimes\overline{\phi(x_{i})}\right)^{2}=k_{\mathcal{H}}(x_{i},x_{i})\left(\phi(x_{i})\otimes\overline{\phi(x_{i})}\right)$ . Also, we have

\operatorname{Tr}(\Sigma)=\operatorname{Tr}\left(\kappa^{2}\mathbb{E}\left[\phi(x)\otimes\overline{\phi(x)}\right]\right)=\kappa^{2}\int_{\mathcal{X}}k_{\mathcal{H}}(x,x)\mathrm{d}\mu(x)\leq\kappa^{4}.

Moreover,

\|\Sigma\|_{op}=\kappa^{2}\left\|\mathbb{E}\left[\phi(x)\otimes\overline{\phi(x)}\right]\right\|_{op}=\kappa^{2}\lambda_{\max}(S^{*}S)=\kappa^{2}\lambda_{\max}(\mathsf{T}_{k_{\mathcal{H}}}),

where we used that $S^{*}S=\mathsf{T}_{k_{\mathcal{H}}}$ is the integral operator on $L^{2}(\mathcal{X})$ with integral kernel $k$ . We are finally ready to apply Proposition 15. Since

\beta\leq\log\left(\frac{2\kappa^{2}}{\lambda_{\max}(\mathsf{T}_{k_{\mathcal{H}}})\delta}\right),

it holds, with probability at least $1-2\delta$ ,

\|S^{*}S-S^{*}_{n}S_{n}\|_{op}\leq\frac{4\kappa^{2}\log\left(\frac{2\kappa^{2}}{\lambda_{\max}(\mathsf{T}_{k_{\mathcal{H}}})\delta}\right)}{3n}+\sqrt{\frac{2\kappa^{2}\lambda_{\max}(\mathsf{T}_{k_{\mathcal{H}}})\log\left(\frac{2\kappa^{2}}{\lambda_{\max}(\mathsf{T}_{k_{\mathcal{H}}})\delta}\right)}{n}}.

This concludes the proof of Theorem 1. ∎

C.3 Statistical guarantee for the approximation of the log-likelihood by its sampled version

Proof of Theorem 4.

The proof follows similar lines as in Rudi et al. [2020, Proof of Thm 5], with several adaptations. Let $\mathbf{B}_{\star}\in\mathcal{S}_{+}(\mathbb{R}^{m})$ be the solution of (9). Notice that $A_{\star}$ and $V^{*}\mathbf{B}_{\star}V$ are distinct operators. Let $\bar{\mathbf{B}}=VA_{\star}V^{*}\in\mathcal{S}_{+}(\mathbb{R}^{m})$ as in Lemma 2. Since $\mathbf{B}_{\star}$ has an optimal objective value, we have

f_{n}(V^{*}\mathbf{B}_{\star}V)+\operatorname{Tr}(\lambda\mathbf{B}_{\star})\leq f_{n}(V^{*}\bar{\mathbf{B}}V)+\operatorname{Tr}(\lambda\bar{\mathbf{B}}).

Now we use the properties of $\bar{\mathbf{B}}$ given in Lemma 2, namely that $f_{n}(A_{\star})=f_{n}(V^{*}\bar{\mathbf{B}}V)$ and $\operatorname{Tr}(\lambda\bar{\mathbf{B}})\leq\operatorname{Tr}(\lambda A_{\star})$ . Then it holds

f_{n}(V^{*}\bar{\mathbf{B}}V)+\operatorname{Tr}(\lambda\bar{\mathbf{B}})\leq f_{n}(A_{\star})+\operatorname{Tr}(\lambda A_{\star}).

By combining the last two inequalities, we have $f_{n}(V^{*}\mathbf{B}_{\star}V)+\operatorname{Tr}(\lambda\mathbf{B}_{\star})\leq f_{n}(A_{\star})+\operatorname{Tr}(\lambda A_{\star})$ and therefore

f_{n}(V^{*}\mathbf{B}_{\star}V)-f_{n}(A_{\star})\leq\operatorname{Tr}(\lambda A_{\star})-\operatorname{Tr}(\lambda\mathbf{B}_{\star}).

(19)

We will use (19) to derive upper and lower bounds on the gap $f(A_{\star})-f_{n}(V^{*}\mathbf{B}_{\star}V)$ .

Lower bound.

Using Theorem 1, we have with high probability

|f(A_{\star})-f_{n}(A_{\star})|\leq\log\det(\mathbb{I}+c_{n}A),

and, in particular, this gives

f(A_{\star})-f_{n}(A_{\star})\leq\operatorname{Tr}(c_{n}A).

By combining this last inequality with (19) and by using $\operatorname{Tr}(\lambda\mathbf{B}_{\star})\geq 0$ , we have the lower bound

$\displaystyle\Delta=f(A_{\star})-f_{n}(V^{*}\mathbf{B}_{\star}V)$	$\displaystyle=f(A_{\star})-f_{n}(A_{\star})+f_{n}(A_{\star})-f_{n}(V^{*}\mathbf{B}_{\star}V)$
	$\displaystyle\geq-\operatorname{Tr}(c_{n}A_{\star})+\operatorname{Tr}(\lambda\mathbf{B}_{\star})-\operatorname{Tr}(\lambda A_{\star})$	(20)
	$\displaystyle\geq-(c_{n}+\lambda)\operatorname{Tr}(A_{\star}).$

Upper bound.

We have the bound with high probability (for the same event as above)

$\displaystyle\Delta=f(A_{\star})-f_{n}(\mathbf{B}_{\star})$	$\displaystyle=\underbrace{f(A_{\star})-f(V^{}\mathbf{B}_{\star}V)}_{\leq 0}+f(V^{}\mathbf{B}_{\star}V)-f_{n}(V^{*}\mathbf{B}_{\star}V)$
	$\displaystyle\leq\log\det(\mathbb{I}+c_{n}V^{*}\mathbf{B}_{\star}V)\text{ \quad(Theorem~{}\ref{thm:formal_Fredholm})}$
	$\displaystyle\leq\operatorname{Tr}(c_{n}VV^{*}\mathbf{B}_{\star})\text{ \quad(cyclicity)}$
	$\displaystyle\leq\operatorname{Tr}(c_{n}\mathbf{B}_{\star}).\text{ \quad(since $VV^{*}$ is a projector)}$	(21)

By combining this with (20), we find

-\operatorname{Tr}(c_{n}A_{\star})+\operatorname{Tr}(\lambda\mathbf{B}_{\star})-\operatorname{Tr}(\lambda A_{\star})\leq\operatorname{Tr}(c_{n}\mathbf{B}_{\star}).

Since, by assumption, we have $c_{n}\leq\lambda-c_{n}$ , the latter inequality becomes

c_{n}\operatorname{Tr}(\mathbf{B}_{\star})\leq(c_{n}+\lambda)\operatorname{Tr}(A_{\star}).

(22)

By using (22) in the bound in (21), we obtain

f(A_{\star})-f_{n}(V^{*}\mathbf{B}_{\star}V)\leq(c_{n}+\lambda)\operatorname{Tr}(A_{\star}),

Thus, the upper and lower bound yield together

|f(A_{\star})-f_{n}(V^{*}\mathbf{B}_{\star}V)|\leq(c_{n}+\lambda)\operatorname{Tr}(A_{\star})\leq\left(\frac{\lambda}{2}+\lambda\right)\operatorname{Tr}(A_{\star}),

where we used once more $2c_{n}\leq\lambda$ . This is the desired result. ∎

Proof of Corollary 5.

We use the triangle inequality

|f(A_{\star})-f(V\mathbf{B}_{\star}V^{*})|\leq|f(A_{\star})-f_{n}(V\mathbf{B}_{\star}V^{*})|+|f_{n}(V\mathbf{B}_{\star}V^{*})-f(V\mathbf{B}_{\star}V^{*})|.

The first term is upper bounded whp by Theorem 4. The second term is bounded by Theorem 1 as follows

|f_{n}(V\mathbf{B}_{\star}V^{*})-f(V\mathbf{B}_{\star}V^{*})|\leq\operatorname{Tr}(c_{n}\mathbf{B}_{\star})

with $\operatorname{Tr}(c_{n}\mathbf{B}_{\star})\leq\frac{3}{2}\lambda\operatorname{Tr}(A_{\star})$ as in (22) in the proof of Theorem 4. ∎

C.4 Numerical approach: convergence study

Proof of Theorem 3.

This proof follows the same technique as in Mariet and Sra [2015], with some extensions. Let $\mathbf{\Sigma}=\mathbf{X}^{-1}$ . We decompose the objective

\displaystyle g(\mathbf{\Sigma}^{-1})

\displaystyle=\log\det\left(\mathbf{I}_{m}+\frac{1}{|\mathcal{I}|}\mathbf{\Sigma}^{-1}\right)_{\mathcal{I}\mathcal{I}}-\frac{1}{s}\sum_{\ell=1}^{s}\log\det(\mathbf{U}_{\ell}^{\top}\mathbf{\Sigma}^{-1}\mathbf{U}_{\ell})+\lambda\operatorname{Tr}(\mathbf{\Sigma}^{-1}\mathbf{K}^{-1})

as the following sum: $g(\mathbf{\Sigma}^{-1})=h_{1}(\mathbf{\Sigma})+h_{2}(\mathbf{\Sigma})$ , where $h_{1}(\mathbf{\Sigma})=-\log\det(\mathbf{\Sigma})+\lambda\operatorname{Tr}(\mathbf{\Sigma}^{-1}\mathbf{K}^{-1})$ is a strictly convex function on $\mathbf{\Sigma}\succ 0$ and

h_{2}(\mathbf{\Sigma})=\log\det(\mathbf{\Sigma})+\log\det\left(\mathbf{I}_{m}+\frac{1}{|\mathcal{I}|}\mathbf{\Sigma}^{-1}\right)_{\mathcal{I}\mathcal{I}}-\frac{1}{s}\sum_{\ell=1}^{s}\log\det(\mathbf{U}_{\ell}^{\top}\mathbf{\Sigma}^{-1}\mathbf{U}_{\ell})

is concave on $\mathbf{\Sigma}\succ 0$ . We refer to Lemma 8 and 9 for the former and latter statements, respectively. Then, we use the concavity of $h_{2}$ to write the following upper bound

h_{2}(\mathbf{\Sigma})\leq h_{2}(\mathbf{Y})+\operatorname{Tr}\Big{(}\nabla h_{2}(\mathbf{Y})(\mathbf{\Sigma}-\mathbf{Y})\Big{)},

where the matrix-valued gradient is

	$\displaystyle\nabla h_{2}(\mathbf{Y})$	$\displaystyle=\mathbf{Y}^{-1}-\mathbf{Y}^{-1}\mathbf{U}_{\mathcal{I}}\left(\|\mathcal{I}\|\mathbf{I}_{\|\mathcal{I}\|}+\mathbf{U}_{\mathcal{I}}^{\top}\mathbf{Y}^{-1}\mathbf{U}_{\mathcal{I}}\right)^{-1}\mathbf{U}_{\mathcal{I}}^{\top}\mathbf{Y}^{-1}$
		$\displaystyle+\frac{1}{s}\sum_{\ell=1}^{s}\mathbf{Y}^{-1}\mathbf{U}_{\ell}\left(\mathbf{U}_{\ell}^{\top}\mathbf{Y}^{-1}\mathbf{U}_{\ell}\right)^{-1}\mathbf{U}_{\ell}^{\top}\mathbf{Y}^{-1}.$

Define the auxillary function $\xi(\mathbf{\Sigma},\mathbf{Y})\triangleq h_{1}(\mathbf{\Sigma})+h_{2}(\mathbf{Y})+\operatorname{Tr}\left(\nabla h_{2}(\mathbf{Y})(\mathbf{\Sigma}-\mathbf{Y})\right)$ which satisfies $g(\mathbf{\Sigma}^{-1})\leq\xi(\mathbf{\Sigma},\mathbf{Y})$ and $g(\mathbf{\Sigma}^{-1})=\xi(\mathbf{\Sigma},\mathbf{\Sigma})$ . We define the iteration $\mathbf{X}_{k}=\mathbf{\Sigma}_{k}^{-1}$ where

\mathbf{\Sigma}_{k+1}=\arg\min_{\mathbf{\Sigma}\succ 0}\xi(\mathbf{\Sigma},\mathbf{\Sigma}_{k}),

so that it holds $g(\mathbf{\Sigma}^{-1}_{k+1})\leq\xi(\mathbf{\Sigma}_{k+1},\mathbf{\Sigma}_{k})\leq\xi(\mathbf{\Sigma}_{k},\mathbf{\Sigma}_{k})=g(\mathbf{\Sigma}^{-1}_{k})$ . Thus, this iteration has monotone decreasing objectives. It remains to show that this iteration corresponds to (13). The solution of the above minimization problem can be obtained by solving the first order optimality condition since $\xi(\cdot,\mathbf{Y})$ is strictly convex. This gives

	$\displaystyle-\mathbf{\Sigma}^{-1}-\lambda\mathbf{\Sigma}^{-1}\mathbf{K}^{-1}\mathbf{\Sigma}^{-1}+\mathbf{\Sigma}^{-1}_{k}-\mathbf{\Sigma}^{-1}_{k}\mathbf{U}_{\mathcal{I}}\left(\|\mathcal{I}\|\mathbf{I}_{\|\mathcal{I}\|}+\mathbf{U}_{\mathcal{I}}^{\top}\mathbf{\Sigma}^{-1}_{k}\mathbf{U}_{\mathcal{I}}\right)^{-1}\mathbf{U}_{\mathcal{I}}^{\top}\mathbf{\Sigma}^{-1}_{k}$
	$\displaystyle+\frac{1}{s}\sum_{\ell=1}^{s}\mathbf{\Sigma}^{-1}_{k}\mathbf{U}_{\ell}\left(\mathbf{U}_{\ell}^{\top}\mathbf{\Sigma}^{-1}_{k}\mathbf{U}_{\ell}\right)^{-1}\mathbf{U}_{\ell}^{\top}\mathbf{\Sigma}^{-1}_{k}=0.$

Now, we replace $\mathbf{X}=\mathbf{\Sigma}^{-1}$ in the above condition. After a simple algebraic manipulation, we obtain the following condition

\mathbf{X}+\lambda\mathbf{X}\mathbf{K}^{-1}\mathbf{X}=p(\mathbf{X}_{k}),

where, as defined in (13), $p(\mathbf{X})=\mathbf{X}+\mathbf{X}\mathbf{\Delta}\mathbf{X}$ and

\mathbf{\Delta}=\frac{1}{s}\sum_{\ell=1}^{s}\mathbf{U}_{\ell}\mathbf{X}_{\mathcal{C}_{\ell}\mathcal{C}_{\ell}}^{-1}\mathbf{U}_{\ell}^{\top}-\mathbf{U}_{\mathcal{I}}\left(|\mathcal{I}|\mathbf{I}_{|\mathcal{I}|}+\mathbf{U}_{\mathcal{I}}^{\top}\mathbf{X}\mathbf{U}_{\mathcal{I}}\right)^{-1}\mathbf{U}_{\mathcal{I}}^{\top}.

Finally, we introduce the Cholesky decomposition $\mathbf{K}=\mathbf{R}^{\top}\mathbf{R}$ , so that we have an equivalent identity

\mathbf{R}^{-1\top}\mathbf{X}\mathbf{R}^{-1}+\lambda\left(\mathbf{R}^{-1\top}\mathbf{X}\mathbf{R}^{-1}\right)^{2}-\mathbf{R}^{-1\top}p(\mathbf{X}_{k})\mathbf{R}^{-1}=0.

Let $\mathbf{X}^{\prime}=\mathbf{R}^{-1\top}\mathbf{X}\mathbf{R}^{-1}$ and $p^{\prime}(\mathbf{X}_{k})=\mathbf{R}^{-1\top}p(\mathbf{X}_{k})\mathbf{R}^{-1}$ . The positive definite solution of this second order matrix equation write

\mathbf{X}^{\prime}=\frac{-\mathbf{I}_{m}+\left(\mathbf{I}_{m}+4\lambda p^{\prime}(\mathbf{X}_{k})\right)^{1/2}}{2\lambda},

which directly yields (13). ∎

C.5 Approximation of the correlation kernel

We start by proving the following useful result.

Theorem 16 (Correlation kernel approximation, formal version).

Let $\delta\in(0,1]$ be the failure probability and let $\gamma>0$ be a scale factor. Let $\hat{\mathsf{K}}(\gamma)$ be the approximation defined in (11) with i.i.d. sampling of $p$ points. Let $p$ be large enough so that $t(p)>1$ with $t(p)=\frac{4c^{2}\beta}{3\gamma p}+\sqrt{\frac{2c^{2}\beta}{\gamma p}}$ where $c^{2}=\kappa^{2}\|A\|_{op}$ and $\beta=\log\left(\frac{4d_{\rm eff}(\gamma)}{\delta\|\mathsf{K}(\gamma)\|_{op}}\right)$ Then, with probability $1-\delta$ it holds that

\frac{1}{1+t(p)}\mathsf{K}(\gamma)\preceq\hat{\mathsf{K}}(\gamma)\preceq\frac{1}{1-t(p)}\mathsf{K}(\gamma).

Furthermore, if we assume $\gamma\leq\lambda_{\max}(\mathsf{A})$ , we can take $\beta=\log\left(\frac{8d_{\rm eff}(\gamma)}{\delta}\right)\leq\frac{8d_{\rm eff}(\gamma)}{\delta}.$

Proof.

For simplicity, define $\Psi:\mathcal{H}\to\mathbb{R}^{m}$ as $\Psi=\sqrt{m}\mathbf{\Lambda}S_{m}$ , which is such that $A_{\star}=\Psi^{*}\Psi$ . Then, we write

\hat{\mathsf{K}}=S\Psi^{*}(\Psi S^{*}_{p}S_{p}\Psi^{*}+\gamma\mathbf{I}_{m})^{-1}\Psi S^{*},

where we recall that $S^{*}_{p}S_{p}=\frac{1}{p}\sum_{i=1}^{p}\phi(x^{\prime\prime}_{i})\otimes\overline{\phi(x^{\prime\prime}_{i})}$ . Next, we used the following result.

Proposition 17 (Proposition 5 in Rudi et al. [2018] with minor adaptations).

Let $\gamma>0$ and $v_{1},\dots,v_{p}$ with $p\geq 1$ be identically distributed random vectors on a separable Hilbert space $H$ , such that there exists $c^{2}>0$ for which $\|v\|_{H}\leq c^{2}$ almost surely. Denote by $Q$ the Hermitian operator $Q=\frac{1}{p}\sum_{i=1}^{p}\mathbb{E}[v_{i}\otimes\overline{v_{i}}]$ . Let $Q_{p}=\frac{1}{p}\sum_{i=1}^{p}v_{i}\otimes\overline{v_{i}}$ . Then for any $\delta\in(0,1]$ , the following holds

\|(Q+\gamma\mathbb{I})^{-1/2}(Q-Q_{p})(Q+\gamma\mathbb{I})^{-1/2}\|_{op}\leq\frac{4c^{2}\beta}{3\gamma p}+\sqrt{\frac{2c^{2}\beta}{\gamma p}},

with probability $1-\delta$ and $\beta=\log\frac{4\operatorname{Tr}\left(Q(Q+\gamma\mathbb{I})^{-1}\right)}{\delta\|Q(Q+\gamma\mathbb{I})^{-1}\|_{op}}\leq 8\frac{c^{2}/\|Q\|_{op}+\operatorname{Tr}\left(Q(Q+\gamma\mathbb{I})^{-1}\right)}{\delta}$ .

Then, we merely define the following vector $\mathbf{v}_{i}=\Psi\phi(x^{\prime\prime}_{i})$ for $1\leq i\leq p$ so that

\mathbf{Q}_{p}=\Psi S_{p}^{*}S_{p}\Psi^{*}=\frac{1}{p}\sum_{i=1}^{p}\Psi\left(\phi(x^{\prime\prime}_{i})\otimes\overline{\phi(x^{\prime\prime}_{i})}\right)\Psi^{*}.

Furthermore, we define $\mathbf{Q}=\Psi S^{*}S\Psi^{*}$ . Also, we have $S^{*}S=\int_{\mathcal{X}}\phi(x)\otimes\overline{\phi(x)}\mathrm{d}\mu(x)$ , so that $\mathbb{E}[S^{*}_{p}S_{p}]=S^{*}S$ . Hence, it holds $\mathbb{E}[\mathbf{Q}_{p}]=\mathbf{Q}$ . First, by using $\Psi^{*}\Psi\preceq\|\Psi^{*}\Psi\|_{op}\mathbb{I}$ , we have

\|\mathbf{v}\|^{2}_{2}=\langle\phi(x),\Psi^{*}\Psi\phi(x)\rangle\leq k_{\mathcal{H}}(x,x)\|\Psi^{*}\Psi\|_{op}\leq\kappa^{2}\|\Psi^{*}\Psi\|_{op}=\kappa^{2}\|A_{\star}\|_{op},

almost surely. Next, we calculate the following quantity

	$\displaystyle\operatorname{Tr}\left[\mathbf{Q}(\mathbf{Q}+\gamma\mathbf{I}_{m})^{-1}\right]$	$\displaystyle=\operatorname{Tr}\left[\Psi S^{}S\Psi^{}(\Psi S^{}S\Psi^{}+\gamma\mathbf{I}_{m})^{-1}\right]$
		$\displaystyle=\operatorname{Tr}\left[S\Psi^{}(\Psi S^{}S\Psi^{}+\gamma\mathbf{I}_{m})^{-1}\Psi S^{}\right]$
		$\displaystyle=\operatorname{Tr}\left[S\Psi^{}\Psi S^{}(S\Psi^{}\Psi S^{}+\gamma\mathbb{I})^{-1}\right]$
		$\displaystyle=\operatorname{Tr}\left[\mathsf{A}(\mathsf{A}+\gamma\mathbb{I})^{-1}\right],$

where we used the push-through identity at the next to last equality.

For obtaining the bound on $\beta$ , we first write

\|\mathbf{Q}(\mathbf{Q}+\gamma\mathbf{I}_{m})^{-1}\|_{op}=\|\mathsf{A}(\mathsf{A}+\gamma\mathbb{I})^{-1}\|_{op}=(1+\gamma/\lambda_{\max}(\mathsf{A}))^{-1}.

To lower bound the latter quantity we require $\gamma\leq\lambda_{\max}(\mathsf{A})$ and hence $\|\mathbf{Q}(\mathbf{Q}+\gamma\mathbf{I}_{m})^{-1}\|_{op}\geq 1/2$ . For the remainder of the proof, we show the main matrix inequality. For convenience, define the upperbound in Proposition 17 as

t(p)=\frac{4c^{2}\beta}{3\gamma p}+\sqrt{\frac{2c^{2}\beta}{\gamma p}}.

(23)

Thanks to Proposition 17, we know that with probability $1-\delta$ , we have

-t\left(\Psi SS^{*}\Psi^{*}+\gamma\mathbf{I}_{m}\right)\preceq\Psi SS^{*}\Psi^{*}-\Psi S_{p}^{*}S_{p}\Psi^{*}\preceq t\left(\Psi SS^{*}\Psi^{*}+\gamma\mathbf{I}_{m}\right),

or equivalently

\Psi SS^{*}\Psi^{*}-t(\Psi SS^{*}\Psi^{*}+\gamma\mathbf{I}_{m})\preceq\Psi S_{p}^{*}S_{p}\Psi^{*}\preceq\Psi SS^{*}\Psi^{*}+t(\Psi SS^{*}\Psi^{*}+\gamma\mathbf{I}_{m}).

By simply adding $\gamma\mathbf{I}_{m}$ to these inequalities, we obtain

(1-t)(\Psi SS^{*}\Psi^{*}+\gamma\mathbf{I}_{m})\preceq\Psi S_{p}^{*}S_{p}\Psi^{*}+\gamma\mathbf{I}_{m}\preceq(1+t)(\Psi SS^{*}\Psi^{*}+\gamma\mathbf{I}_{m}).

Hence, if $t<1$ , by a simple manipulation, we find

(1+t)^{-1}(\Psi SS^{*}\Psi^{*}+\gamma\mathbf{I}_{m})^{-1}\preceq(\Psi S^{*}_{p}S_{p}\Psi^{*}+\gamma\mathbf{I}_{m})^{-1}\preceq(1-t)^{-1}(\Psi SS^{*}\Psi^{*}+\gamma\mathbf{I}_{m})^{-1}.

By acting with $S\Psi^{*}$ on the left and $\Psi S^{*}$ on the right, and then, using the push-through identity

S\Psi^{*}(\Psi SS^{*}\Psi^{*}+\gamma\mathbf{I}_{m})^{-1}\Psi S^{*}=(S\Psi^{*}\Psi S+\gamma\mathbb{I})^{-1}S^{*}\Psi^{*}\Psi S^{*},

the desired result follows. ∎

We can now prove Theorem 6 by simplifying some of the bounds given in Theorem 16.

Proof of Theorem 6.

Consider the upper bound given in (23). We will simplify it to capture the dominant behavior as $p\to+\infty$ . Assume $\sqrt{\frac{2c^{2}\beta}{\gamma p}}<1$ , or equivalently $p>\frac{2c^{2}\beta}{\gamma}.$ In this case, we give a simple upper bound on $t(p)$ as follows

t(p)<\left(\frac{2}{3}+1\right)\sqrt{\frac{2c^{2}\beta}{\gamma p}}<\sqrt{\frac{8c^{2}\beta}{\gamma p}},

so that we avoid manipulating cumbersome expressions. Thus, if we want the latter bound to be smaller than $\epsilon\in(0,1)$ , we require

p\geq\frac{8c^{2}\beta}{\gamma\epsilon^{2}},

which is indeed larger than $\frac{2c^{2}\beta}{\gamma}$ since $1/\epsilon>1$ . Thus, by using the same arguments as in the proof of Theorem 16, we have the multiplicative error bound

\frac{1}{1+\epsilon}\mathsf{K}(\gamma)\preceq\hat{\mathsf{K}}(\gamma)\preceq\frac{1}{1-\epsilon}\mathsf{K}(\gamma),

with probability at least $1-\delta$ if

p\geq\frac{8c^{2}\beta}{\gamma\epsilon^{2}}=\frac{8\kappa^{2}\|A\|_{op}}{\gamma\epsilon^{2}}\log\left(\frac{4d_{\rm eff}(\gamma)}{\delta\|\mathsf{K}\|_{op}}\right)

where the last equality is obtained by substituting $c^{2}=\kappa^{2}\|A\|_{op}$ and $\beta=\log\left(\frac{4d_{\rm eff}(\gamma)}{\delta\|\mathsf{K}(\gamma)\|_{op}}\right)$ given in Theorem 16. ∎

Appendix D Supplementary empirical results

D.1 Finer analysis of the Gaussian L-ensemble estimation problem of Section 6

In this section, we report results corresponding to the simulation setting of Section 6 with $\rho=100$ .

Intensity estimation from several DPP samples.

In Figure 4, we replicate the setting of Figure 1 with $s=3$ and $s=10$ DPP samples and a smaller regularization parameter. The estimated intensity is then closer to the ground truth ( $\rho=100$ ) for a large value of $s$ , although there are small areas of high intensity at the boundary of the domain $[0,1]^{2}$ . A small improvement is also observed by increasing $s$ from $3$ (left) to $10$ (right), namely the variance of the estimated intensity tends to decrease when $s$ increases.

In Figure 5, we illustrate the intensity estimation in the case of a large and small $\sigma$ , respectively on the left and right columns. As expected, a large value of $\sigma$ has a regularization effect but also leads to an underestimation of the intensity. On the contrary, a small value of $\sigma$ seems to cause inhomogeneities in the estimated intensity.

Correlation structure estimation.

The general form of the correlation kernel is also important. In order to visualize the shape of the correlation kernel $\hat{\mathsf{k}}(x,y)$ , we display in Figure 6 the Gram matrices of the estimated $[\hat{\mathsf{k}}(x,x^{\prime})]_{x,x^{\prime}\in\text{grid}}$ and ground truth correlation kernels $[\mathsf{k}(x,x^{\prime})]_{x,x^{\prime}\in\text{grid}}$ on a square grid, for the same parameters as in Figure 4 (RHS). After removing the boundary effects, we observe that the estimated correlation kernel shape closely resembles the ground truth although the decay of the estimated kernel seems to be a bit slower. Moreover, we observe some ‘noise’ in the tail of the estimated kernel. Again, the intensity of the estimated process is also a bit underestimated.

In the context of point processes, it is common to compute summary statistics from samples to ‘understand’ the correlation structure of a stationary process. It is strictly speaking not possible to calculate e.g. Ripley’s K function (see Baddeley et al. [2015]) since our estimated correlation kernel is not stationary, that is, there exits no function $t(\cdot)$ such that $\hat{\mathsf{k}}(x,y)=t(x-y)$ .

D.2 Convergence of the regularized Picard iteration

In particular, we illustrate the convergence of the regularized Picard iteration to the exact solution given in Proposition 12. In Figure 7, we solve the problem (6) in the case $\mathcal{I}=\mathcal{C}$ with $s=1$ where $\mathcal{C}$ is the unique DPP sample. For simplicity, we select the DPP sample used in Figure 1 ( $\rho=100$ , bottom row).

This illustrates that the regularized Picard iteration indeed converges to the unique solution in this special case.

D.3 Complexity and computing ressources

Complexity.

The space complexity of our method is dominated by the space complexity of storing the kernel matrix $\mathbf{K}$ in Algorithm 1, namely $O(m^{2})$ where we recall that $m=|\mathcal{Z}|$ with $\mathcal{Z}\triangleq\cup_{\ell=1}^{s}\mathcal{C}_{\ell}\cup\mathcal{I}$ . The time complexity of one iteration of (14) is dominated by the matrix square root, which is similar to the eigendecomposition, i.e., $O(m^{3})$ . The time complexity of Algorithm 2 is dominated by the Cholesky decomposition and the linear system solution, i.e., $O(m^{3})$ .

Computing ressources.

A typical computation time is $65$ minutes to solve the example of Figure 1 (bottom row) on $8$ virtual cores of a server with two $18$ core Intel Xeon E5-2695 v4s ( $2.1$ Ghz). The computational bottleneck is the regularized Picard iteration.

References

Affandi et al. [2014] R. H. Affandi, E. Fox, R. Adams, and B. Taskar. Learning the parameters of determinantal point process kernels. In E. P. Xing and T. Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1224–1232, Bejing, China, 22–24 Jun 2014. PMLR. URL http://proceedings.mlr.press/v32/affandi14.html.
Anquetil et al. [2020] L. Anquetil, M. Gartrell, A. Rakotomamonjy, U. Tanielian, and C. Calauzènes. Wasserstein learning of determinantal point processes. In Learning Meets Combinatorial Algorithms at NeurIPS2020, 2020. URL https://openreview.net/forum?id=fabfWf3JJQi.
Baddeley et al. [2015] A. Baddeley, E. Rubak, and R. Turner. Spatial Point Patterns: Methodology and Applications with R. Chapman and Hall/CRC Press, London, 2015. URL https://www.routledge.com/Spatial-Point-Patterns-Methodology-and-Applications-with-R/Baddeley-Rubak-Turner/9781482210200/.
Baraud [2013] Y. Baraud. Estimation of the density of a determinantal process. Confluentes Mathematici, 5(1):3–21, 2013. doi: 10.5802/cml.1. URL http://www.numdam.org/articles/10.5802/cml.1/.
Bardenet and Titsias [2015] R. Bardenet and M. K. Titsias. Inference for determinantal point processes without spectral knowledge. In Advances in Neural Information Processing Systems (NIPS), pages 3375–3383, 2015. URL https://proceedings.neurips.cc/paper/2015/file/2f25f6e326adb93c5787175dda209ab6-Paper.pdf.
Belhadji et al. [2019] A. Belhadji, R. Bardenet, and P. Chainais. Kernel quadrature with determinantal point processes. In Advances in Neural Information Processing Systems (NeurIPS), 2019. URL https://proceedings.neurips.cc/paper/2019/file/7012ef0335aa2adbab58bd6d0702ba41-Paper.pdf.
Belhadji et al. [2020a] A. Belhadji, R. Bardenet, and P. Chainais. Kernel interpolation with continuous volume sampling. In International Conference on Machine Learning (ICML), 2020a. URL http://proceedings.mlr.press/v119/belhadji20a.html.
Belhadji et al. [2020b] A. Belhadji, R. Bardenet, and P. Chainais. A determinantal point process for column subset selection. Journal of Machine Learning Research (JMLR), 2020b. URL http://jmlr.org/papers/v21/19-080.html.
Biscio and Lavancier [2017] C. A. N. Biscio and F. Lavancier. Contrast estimation for parametric stationary determinantal point processes. Scandinavian Journal of Statistics, 44(1):204–229, 2017. doi: https://doi.org/10.1111/sjos.12249. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/sjos.12249.
Bornemann [2010] F. Bornemann. On the numerical evaluation of Fredholm determinants. Mathematics of Computation, 79(270):871–915, 2010. URL https://www.ams.org/journals/mcom/2010-79-270/S0025-5718-09-02280-7/.
Brunel et al. [2017a] V.-E. Brunel, A. Moitra, P. Rigollet, and J. Urschel. Maximum likelihood estimation of determinantal point processes. arXiv preprint arXiv:1701.06501, 2017a. URL https://arxiv.org/abs/1701.06501.
Brunel et al. [2017b] V.-E. Brunel, A. Moitra, P. Rigollet, and J. Urschel. Rates of estimation for determinantal point processes. In Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 343–345. PMLR, 2017b. URL http://proceedings.mlr.press/v65/brunel17a/brunel17a.pdf.
Derezinski et al. [2020] M. Derezinski, F. Liang, and M. Mahoney. Bayesian experimental design using regularized determinantal point processes. In S. Chiappa and R. Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 3197–3207. PMLR, 26–28 Aug 2020. URL http://proceedings.mlr.press/v108/derezinski20a.html.
Dupuy and Bach [2018] C. Dupuy and F. Bach. Learning determinantal point processes in sublinear time. In A. Storkey and F. Perez-Cruz, editors, Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pages 244–257. PMLR, 09–11 Apr 2018. URL http://proceedings.mlr.press/v84/dupuy18a.html.
Gartrell et al. [2017] M. Gartrell, U. Paquet, and N. Koenigstein. Low-rank factorization of determinantal point processes. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI’17, page 1912–1918. AAAI Press, 2017. URL https://ojs.aaai.org/index.php/AAAI/article/view/10869.
Gartrell et al. [2019] M. Gartrell, V.-E. Brunel, E. Dohmatob, and S. Krichene. Learning nonsymmetric determinantal point processes. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/cae82d4350cc23aca7fc9ae38dab38ab-Paper.pdf.
Gartrell et al. [2021] M. Gartrell, I. Han, E. Dohmatob, J. Gillenwater, and V.-E. Brunel. Scalable Learning and MAP Inference for Nonsymmetric Determinantal Point Processes. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=HajQFbx_yB.
Gautier et al. [2019] G. Gautier, R. Bardenet, and M. Valko. On two ways to use determinantal point processes for monte carlo integration. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/1d54c76f48f146c3b2d66daf9d7f845e-Paper.pdf.
Ghosh and Rigollet [2020] S. Ghosh and P. Rigollet. Gaussian determinantal processes: A new model for directionality in data. Proceedings of the National Academy of Sciences, 117(24):13207–13213, 2020. ISSN 0027-8424. doi: 10.1073/pnas.1917151117. URL https://www.pnas.org/content/117/24/13207.
Hough et al. [2006] J. B. Hough, M. Krishnapur, Y. Peres, and B. Virág. Determinantal processes and independence. Probability surveys, 2006. URL https://doi.org/10.1214/154957806000000078.
Hough et al. [2009] J. B. Hough, M. Krishnapur, Y. Peres, and B. Virág. Zeros of Gaussian analytic functions and determinantal point processes, volume 51. American Mathematical Society, 2009. URL https://doi.org/http://dx.doi.org/10.1090/ulect/051.
Kulesza and Taskar [2012] A. Kulesza and B. Taskar. Determinantal point processes for machine learning. Foundations and Trends in Machine Learning, 2012. URL http://dx.doi.org/10.1561/2200000044.
Lavancier et al. [2014] F. Lavancier, J. Møller, and E. Rubak. Determinantal point process models and statistical inference. Journal of the Royal Statistical Society, Series B, 2014. URL http://www.jstor.org/stable/24775312.
Lavancier et al. [2015] F. Lavancier, J. Møller, and E. Rubak. Determinantal point process models and statistical inference. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 77(4):853–877, 2015. ISSN 13697412, 14679868. URL http://www.jstor.org/stable/24775312.
Lavancier et al. [2021] F. Lavancier, A. Poinas, and R. Waagepetersen. Adaptive estimating function inference for non-stationary determinantal point processes. Scandinavian Journal of Statistics, 48(1):87–107, Mar. 2021. doi: 10.1111/sjos.12440. URL https://hal.archives-ouvertes.fr/hal-01816528.
Macchi [1975] O. Macchi. The coincidence approach to stochastic point processes. Advances in Applied Probability, 7:83–122, 1975. URL http://www.jstor.org/stable/1425855.
Mariet and Sra [2015] Z. Mariet and S. Sra. Fixed-point algorithms for learning determinantal point processes. In F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 2389–2397, Lille, France, 07–09 Jul 2015. PMLR. URL http://proceedings.mlr.press/v37/mariet15.html.
Mariet et al. [2019] Z. Mariet, M. Gartrell, and S. Sra. Learning determinantal point processes by corrective negative sampling. In K. Chaudhuri and M. Sugiyama, editors, Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pages 2251–2260. PMLR, 16–18 Apr 2019. URL http://proceedings.mlr.press/v89/mariet19b.html.
Marteau-Ferey et al. [2020] U. Marteau-Ferey, F. R. Bach, and A. Rudi. Non-parametric models for non-negative functions. In Advances in Neural Information Processing Systems 33, 2020. URL https://arxiv.org/abs/2007.03926.
Minsker [2017] S. Minsker. On some extensions of Bernstein’s inequality for self-adjoint operators. Statistics & Probability Letters, 127:111–119, 2017. URL https://www.sciencedirect.com/science/article/pii/S0167715217301207.
Poinas and Lavancier [2021] A. Poinas and F. Lavancier. Asymptotic approximation of the likelihood of stationary determinantal point processes. working paper or preprint, Mar. 2021. URL https://hal.archives-ouvertes.fr/hal-03157554.
Rosasco et al. [2010] L. Rosasco, M. Belkin, and E. D. Vito. On learning with integral operators. Journal of Machine Learning Research, 11(30):905–934, 2010. URL http://jmlr.org/papers/v11/rosasco10a.html.
Rudi et al. [2015] A. Rudi, R. Camoriano, and L. Rosasco. Less is more: Nyström computational regularization. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 1657–1665. Curran Associates, Inc., 2015. URL http://papers.nips.cc/paper/5936-less-is-more-nystrom-computational-regularization.pdf.
Rudi et al. [2018] A. Rudi, D. Calandriello, L. Carratino, and L. Rosasco. On fast leverage score sampling and optimal learning. In Advances in Neural Information Processing Systems, pages 5673–5683, 2018. URL https://arxiv.org/abs/1810.13258.
Rudi et al. [2020] A. Rudi, U. Marteau-Ferey, and F. Bach. Finding global minima via kernel approximations. In Arxiv preprint arXiv:2012.11978, 2020. URL https://arxiv.org/abs/2012.11978.
Simon [1977] B. Simon. Notes on infinite determinants of Hilbert space operators. Advances in Mathematics, 24(3):244–273, 1977. ISSN 0001-8708. doi: https://doi.org/10.1016/0001-8708(77)90057-3. URL https://www.sciencedirect.com/science/article/pii/0001870877900573.
Soshnikov [2000] A. Soshnikov. Determinantal random point fields. Russian Mathematical Surveys, 55:923–975, 2000. URL https://doi.org/10.1070/rm2000v055n05abeh000321.
Sterge et al. [2020] N. Sterge, B. Sriperumbudur, L. Rosasco, and A. Rudi. Gain with no pain: Efficiency of kernel-pca by nyström sampling. In S. Chiappa and R. Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 3642–3652. PMLR, 26–28 Aug 2020. URL http://proceedings.mlr.press/v108/sterge20a.html.
Urschel et al. [2017] J. Urschel, V.-E. Brunel, A. Moitra, and P. Rigollet. Learning determinantal point processes with moments and cycles. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3511–3520. PMLR, 06–11 Aug 2017. URL http://proceedings.mlr.press/v70/urschel17a.html.
Weidmann [1980] J. Weidmann. Linear Operators in Hilbert Spaces. 68. Springer-Verlag, 1980. doi: 10.1007/978-1-4612-6027-1. URL https://www.springer.com/gp/book/9781461260295.
Wendland [2004] H. Wendland. Scattered Data Approximation. Cambridge Monographs on Applied and Computational Mathematics. Cambridge University Press, 2004. doi: 10.1017/CBO9780511617539. URL https://doi.org/10.1017/CBO9780511617539.

$\displaystyle d_{n}-d$	$\displaystyle=\log\det(\mathbf{I}_{n}+S_{n}AS_{n}^{})-\log\det(\mathbb{I}+SAS^{})$
	$\displaystyle=\log\det(\mathbb{I}+A^{1/2}S^{}_{n}S_{n}A^{1/2})-\log\det(\mathbb{I}+A^{1/2}S^{}SA^{1/2})\text{ (Sylvester's identity) }$
	$\displaystyle=\log\det\left((\mathbb{I}+A^{1/2}S^{}_{n}S_{n}A^{1/2})(\mathbb{I}+A^{1/2}S^{}SA^{1/2})^{-1}\right).$	(18)

	$\displaystyle d_{n}-d$	$\displaystyle\leq\log\det\left(\mathbb{I}+\\|S_{n}^{}S_{n}-S^{}S\\|_{op}(\mathbb{I}+A^{1/2}S^{}SA^{1/2})^{-1/2}A(\mathbb{I}+A^{1/2}S^{}SA^{1/2})^{-1/2}\right)$
		$\displaystyle=\log\det\left(\mathbb{I}+\\|S_{n}^{}S_{n}-S^{}S\\|_{op}A^{1/2}(\mathbb{I}+A^{1/2}S^{*}SA^{1/2})^{-1}A^{1/2}\right)$
		$\displaystyle\leq\log\det(\mathbb{I}+\\|S_{n}^{}S_{n}-S^{}S\\|_{op}A).$

	$\displaystyle\\|X_{i}\\|_{op}$	$\displaystyle=\left\\|\int_{\mathcal{X}}\left(\phi(x_{i})\otimes\overline{\phi(x_{i})}-\phi(x)\otimes\overline{\phi(x)}\right)\mathrm{d}\mu(x)\right\\|_{op}$
		$\displaystyle\leq\int_{\mathcal{X}}\left(\left\\|\phi(x_{i})\otimes\overline{\phi(x_{i})}\right\\|_{op}+\left\\|\phi(x)\otimes\overline{\phi(x)}\right\\|_{op}\right)\mathrm{d}\mu(x)$
		$\displaystyle\leq 2\kappa^{2}\triangleq\ell^{\prime}.$

	$\displaystyle\operatorname{Tr}\left[\mathbf{Q}(\mathbf{Q}+\gamma\mathbf{I}_{m})^{-1}\right]$	$\displaystyle=\operatorname{Tr}\left[\Psi S^{}S\Psi^{}(\Psi S^{}S\Psi^{}+\gamma\mathbf{I}_{m})^{-1}\right]$
		$\displaystyle=\operatorname{Tr}\left[S\Psi^{}(\Psi S^{}S\Psi^{}+\gamma\mathbf{I}_{m})^{-1}\Psi S^{}\right]$
		$\displaystyle=\operatorname{Tr}\left[S\Psi^{}\Psi S^{}(S\Psi^{}\Psi S^{}+\gamma\mathbb{I})^{-1}\right]$
		$\displaystyle=\operatorname{Tr}\left[\mathsf{A}(\mathsf{A}+\gamma\mathbb{I})^{-1}\right],$

Nonparametric estimation of continuous DPPs with kernel methods

Abstract

1 Introduction

1.1 Notation and background

Sets.

Operators and matrices.

Restriction and reconstruction operators.

2 Determinantal point processes and inference

Determinantal point processes and L-ensembles.

Previous work on learning DPPs.

3 A sampling approximation to a constrained MLE

Restricting to smooth operators.

Approximating the Fredholm determinant.

Theorem 1 (Approximation of the Fredholm determinant).

Finite-dimensional representatives.

Lemma 2 (Finite dimensional representatives, extension of Lemma 3 in Rudi et al. [2020]).

Finite rank solution thanks to a representer theorem.

Estimation of the correlation kernel.

4 Implementation

Theorem 3.

5 Theoretical guarantees

Statistical guarantees for approximating the maximum likelihood problem

Theorem 4 (Discrete optimal objective approximates full MLE objective).

Corollary 5 (Approximation of the full MLE optimizer by a finite rank operator).

Approximation of the correlation kernel

Theorem 6 (Correlation kernel approximation).

6 Empirical evaluation

7 Discussion

Acknowledgements

Roadmap.

Appendix A Useful technical results

A.1 Technical lemmata

Lemma 7.

Proof.

Lemma 8.

Proof.

Lemma 9.

Proof.

A.2 Use of the representer theorem

A.2.1 Extended representer theorem

Proposition 10 (Extension of Proposition 7 in Marteau-Ferey et al. [2020]).

Proof sketch.

A.2.2 Applying the extended representer theorem

Proposition 11.

A.3 Boundedness of the discrete objective function

Appendix B Extra result: an explicit solution for the single-sample case

Proposition 12.

Proof.

Proposition 13 (Projection DPP in the low regularization limit).

Proof.

Proposition 14 (Large regularization limit).

Appendix C Deferred proofs

C.1 Proof of Lemma 2

Proof of Lemma 2.

C.2 Approximation of the Fredholm determinant

Proof of Theorem 1.

Proposition 15 (Bernstein’s inequality for a sum of i.i.d. random operators).

C.3 Statistical guarantee for the approximation of the log-likelihood by its sampled version

Proof of Theorem 4.

Lower bound.

Upper bound.

Proof of Corollary 5.

C.4 Numerical approach: convergence study

Proof of Theorem 3.

C.5 Approximation of the correlation kernel

Theorem 16 (Correlation kernel approximation, formal version).

Proof.

Proposition 17 (Proposition 5 in Rudi et al. [2018] with minor adaptations).

Proof of Theorem 6.

Appendix D Supplementary empirical results

D.1 Finer analysis of the Gaussian L-ensemble estimation problem of Section 6

Intensity estimation from several DPP samples.

Correlation structure estimation.

D.2 Convergence of the regularized Picard iteration

D.3 Complexity and computing ressources

Complexity.

Computing ressources.

References

Nonparametric estimation of continuous DPPs
with kernel methods